diff --git a/.buildkite/sdist.yml b/.buildkite/sdist.yml deleted file mode 100644 index 9b94e3752..000000000 --- a/.buildkite/sdist.yml +++ /dev/null @@ -1,11 +0,0 @@ -steps: - - - command: "fab env clean make test sdist" - label: ":dizzy: :python:" - artifact_paths: "dist/*.tar.gz" - - wait - - trigger: "spacy-sdist-against-models" - label: ":dizzy: :hammer:" - build: - env: - SPACY_VERSION: "{$SPACY_VERSION}" diff --git a/.buildkite/train.yml b/.buildkite/train.yml deleted file mode 100644 index b257db87c..000000000 --- a/.buildkite/train.yml +++ /dev/null @@ -1,11 +0,0 @@ -steps: - - - command: "fab env clean make test wheel" - label: ":dizzy: :python:" - artifact_paths: "dist/*.whl" - - wait - - trigger: "spacy-train-from-wheel" - label: ":dizzy: :train:" - build: - env: - SPACY_VERSION: "{$SPACY_VERSION}" diff --git a/.github/contributors/tiangolo.md b/.github/contributors/tiangolo.md new file mode 100644 index 000000000..5fd253fe9 --- /dev/null +++ b/.github/contributors/tiangolo.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [ ] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Sebastián Ramírez | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-07-01 | +| GitHub username | tiangolo | +| Website (optional) | | diff --git a/.gitignore b/.gitignore index eb6be73dd..4dbcd67f7 100644 --- a/.gitignore +++ b/.gitignore @@ -18,8 +18,7 @@ website/.npm website/logs *.log npm-debug.log* -website/www/ -website/_deploy.sh +quickstart-training-generator.js # Cython / C extensions cythonize.json @@ -44,12 +43,14 @@ __pycache__/ .env* .~env/ .venv +env3.6/ venv/ env3.*/ .dev .denv .pypyenv .pytest_cache/ +.mypy_cache/ # Distribution / packaging env/ @@ -119,3 +120,6 @@ Desktop.ini # Pycharm project files *.idea + +# IPython +.ipynb_checkpoints/ diff --git a/.travis.yml b/.travis.yml deleted file mode 100644 index e3ce53024..000000000 --- a/.travis.yml +++ /dev/null @@ -1,23 +0,0 @@ -language: python -sudo: false -cache: pip -dist: trusty -group: edge -python: - - "2.7" -os: - - linux -install: - - "pip install -r requirements.txt" - - "python setup.py build_ext --inplace" - - "pip install -e ." -script: - - "cat /proc/cpuinfo | grep flags | head -n 1" - - "python -m pytest --tb=native spacy" -branches: - except: - - spacy.io -notifications: - slack: - secure: F8GvqnweSdzImuLL64TpfG0i5rYl89liyr9tmFVsHl4c0DNiDuGhZivUz0M1broS8svE3OPOllLfQbACG/4KxD890qfF9MoHzvRDlp7U+RtwMV/YAkYn8MGWjPIbRbX0HpGdY7O2Rc9Qy4Kk0T8ZgiqXYIqAz2Eva9/9BlSmsJQ= - email: false diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 3c2b56cd3..70324d8fd 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -5,7 +5,7 @@ Thanks for your interest in contributing to spaCy 🎉 The project is maintained by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines), and we'll do our best to help you get started. This page will give you a quick -overview of how things are organised and most importantly, how to get involved. +overview of how things are organized and most importantly, how to get involved. ## Table of contents @@ -43,33 +43,33 @@ can also submit a [regression test](#fixing-bugs) straight away. When you're opening an issue to report the bug, simply refer to your pull request in the issue body. A few more tips: -- **Describing your issue:** Try to provide as many details as possible. What - exactly goes wrong? _How_ is it failing? Is there an error? - "XY doesn't work" usually isn't that helpful for tracking down problems. Always - remember to include the code you ran and if possible, extract only the relevant - parts and don't just dump your entire script. This will make it easier for us to - reproduce the error. +- **Describing your issue:** Try to provide as many details as possible. What + exactly goes wrong? _How_ is it failing? Is there an error? + "XY doesn't work" usually isn't that helpful for tracking down problems. Always + remember to include the code you ran and if possible, extract only the relevant + parts and don't just dump your entire script. This will make it easier for us to + reproduce the error. -- **Getting info about your spaCy installation and environment:** If you're - using spaCy v1.7+, you can use the command line interface to print details and - even format them as Markdown to copy-paste into GitHub issues: - `python -m spacy info --markdown`. +- **Getting info about your spaCy installation and environment:** If you're + using spaCy v1.7+, you can use the command line interface to print details and + even format them as Markdown to copy-paste into GitHub issues: + `python -m spacy info --markdown`. -- **Checking the model compatibility:** If you're having problems with a - [statistical model](https://spacy.io/models), it may be because the - model is incompatible with your spaCy installation. In spaCy v2.0+, you can check - this on the command line by running `python -m spacy validate`. +- **Checking the model compatibility:** If you're having problems with a + [statistical model](https://spacy.io/models), it may be because the + model is incompatible with your spaCy installation. In spaCy v2.0+, you can check + this on the command line by running `python -m spacy validate`. -- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+ - comes with [built-in visualizers](https://spacy.io/usage/visualizers) that - you can run from within your script or a Jupyter notebook. For some issues, it's - helpful to **include a screenshot** of the visualization. You can simply drag and - drop the image into GitHub's editor and it will be uploaded and included. +- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+ + comes with [built-in visualizers](https://spacy.io/usage/visualizers) that + you can run from within your script or a Jupyter notebook. For some issues, it's + helpful to **include a screenshot** of the visualization. You can simply drag and + drop the image into GitHub's editor and it will be uploaded and included. -- **Sharing long blocks of code or logs:** If you need to include long code, - logs or tracebacks, you can wrap them in `
` and `
`. This - [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details) - so it only becomes visible on click, making the issue easier to read and follow. +- **Sharing long blocks of code or logs:** If you need to include long code, + logs or tracebacks, you can wrap them in `
` and `
`. This + [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details) + so it only becomes visible on click, making the issue easier to read and follow. ### Issue labels @@ -94,39 +94,39 @@ shipped in the core library, and what could be provided in other packages. Our philosophy is to prefer a smaller core library. We generally ask the following questions: -- **What would this feature look like if implemented in a separate package?** - Some features would be very difficult to implement externally – for example, - changes to spaCy's built-in methods. In contrast, a library of word - alignment functions could easily live as a separate package that depended on - spaCy — there's little difference between writing `import word_aligner` and - `import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement - [custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components), - and add your own attributes, properties and methods to the `Doc`, `Token` and - `Span`. If you're looking to implement a new spaCy feature, starting with a - custom component package is usually the best strategy. You won't have to worry - about spaCy's internals and you can test your module in an isolated - environment. And if it works well, we can always integrate it into the core - library later. +- **What would this feature look like if implemented in a separate package?** + Some features would be very difficult to implement externally – for example, + changes to spaCy's built-in methods. In contrast, a library of word + alignment functions could easily live as a separate package that depended on + spaCy — there's little difference between writing `import word_aligner` and + `import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement + [custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components), + and add your own attributes, properties and methods to the `Doc`, `Token` and + `Span`. If you're looking to implement a new spaCy feature, starting with a + custom component package is usually the best strategy. You won't have to worry + about spaCy's internals and you can test your module in an isolated + environment. And if it works well, we can always integrate it into the core + library later. -- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?** - Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or - TensorFlow/Keras do lots of useful things — but we don't want to have them as - dependencies. If the feature requires functionality in one of these libraries, - it's probably better to break it out into a different package. +- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?** + Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or + TensorFlow/Keras do lots of useful things — but we don't want to have them as + dependencies. If the feature requires functionality in one of these libraries, + it's probably better to break it out into a different package. -- **Is the feature orthogonal to the current spaCy functionality, or overlapping?** - spaCy strongly prefers to avoid having 6 different ways of doing the same thing. - As better techniques are developed, we prefer to drop support for "the old way". - However, it's rare that one approach _entirely_ dominates another. It's very - common that there's still a use-case for the "obsolete" approach. For instance, - [WordNet](https://wordnet.princeton.edu/) is still very useful — but word - vectors are better for most use-cases, and the two approaches to lexical - semantics do a lot of the same things. spaCy therefore only supports word - vectors, and support for WordNet is currently left for other packages. +- **Is the feature orthogonal to the current spaCy functionality, or overlapping?** + spaCy strongly prefers to avoid having 6 different ways of doing the same thing. + As better techniques are developed, we prefer to drop support for "the old way". + However, it's rare that one approach _entirely_ dominates another. It's very + common that there's still a use-case for the "obsolete" approach. For instance, + [WordNet](https://wordnet.princeton.edu/) is still very useful — but word + vectors are better for most use-cases, and the two approaches to lexical + semantics do a lot of the same things. spaCy therefore only supports word + vectors, and support for WordNet is currently left for other packages. -- **Do you need the feature to get basic things done?** We do want spaCy to be - at least somewhat self-contained. If we keep needing some feature in our - recipes, that does provide some argument for bringing it "in house". +- **Do you need the feature to get basic things done?** We do want spaCy to be + at least somewhat self-contained. If we keep needing some feature in our + recipes, that does provide some argument for bringing it "in house". ### Getting started @@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.** ### Code formatting [`black`](https://github.com/ambv/black) is an opinionated Python code -formatter, optimised to produce readable code and small diffs. You can run +formatter, optimized to produce readable code and small diffs. You can run `black` from the command-line, or via your code editor. For example, if you're using [Visual Studio Code](https://code.visualstudio.com/), you can add the following to your `settings.json` to use `black` for formatting and auto-format @@ -203,10 +203,10 @@ your files on save: ```json { - "python.formatting.provider": "black", - "[python]": { - "editor.formatOnSave": true - } + "python.formatting.provider": "black", + "[python]": { + "editor.formatOnSave": true + } } ``` @@ -216,7 +216,7 @@ list of available editor integrations. #### Disabling formatting There are a few cases where auto-formatting doesn't improve readability – for -example, in some of the the language data files like the `tag_map.py`, or in +example, in some of the language data files like the `tag_map.py`, or in the tests that construct `Doc` objects from lists of words and other labels. Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting for that particular code. Here's an example: @@ -224,7 +224,7 @@ for that particular code. Here's an example: ```python # fmt: off text = "I look forward to using Thingamajig. I've been told it will make my life easier..." -heads = [1, 0, -1, -2, -1, -1, -5, -1, 3, 2, 1, 0, 2, 1, -3, 1, 1, -3, -7] +heads = [1, 1, 1, 1, 3, 4, 1, 6, 11, 11, 11, 11, 14, 14, 11, 16, 17, 14, 11] deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "", "nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp", "poss", "nsubj", "ccomp", "punct"] @@ -280,29 +280,13 @@ except: # noqa: E722 ### Python conventions -All Python code must be written in an **intersection of Python 2 and Python 3**. -This is easy in Cython, but somewhat ugly in Python. Logic that deals with -Python or platform compatibility should only live in -[`spacy.compat`](spacy/compat.py). To distinguish them from the builtin -functions, replacement functions are suffixed with an underscore, for example -`unicode_`. If you need to access the user's version or platform information, -for example to show more specific error messages, you can use the `is_config()` -helper function. - -```python -from .compat import unicode_, is_config - -compatible_unicode = unicode_('hello world') -if is_config(windows=True, python2=True): - print("You are using Python 2 on Windows.") -``` - +All Python code must be written **compatible with Python 3.6+**. Code that interacts with the file-system should accept objects that follow the `pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`. If the function is user-facing and takes a path as an argument, it should check whether the path is provided as a string. Strings should be converted to `pathlib.Path` objects. Serialization and deserialization functions should always -accept **file-like objects**, as it makes the library io-agnostic. Working on +accept **file-like objects**, as it makes the library IO-agnostic. Working on buffers makes the code more general, easier to test, and compatible with Python 3's asynchronous IO. @@ -400,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The many "traps for new players". Working in Cython is very rewarding once you're over the initial learning curve. As with C and C++, the first way you write something in Cython will often be the performance-optimal approach. In contrast, -Python optimisation generally requires a lot of experimentation. Is it faster to +Python optimization generally requires a lot of experimentation. Is it faster to have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`? Does this numpy operation create a copy? There's no way to guess the answers to these questions, and you'll usually be dissatisfied with your results — so @@ -413,10 +397,10 @@ Python. If it's not fast enough the first time, just switch to Cython. ### Resources to get you started -- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org) -- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org) -- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai) -- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai) +- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org) +- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org) +- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai) +- [Multi-threading spaCy’s parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai) ## Adding tests @@ -428,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in all test files and test functions need to be prefixed with `test_`. When adding tests, make sure to use descriptive names, keep the code short and -concise and only test for one behaviour at a time. Try to `parametrize` test +concise and only test for one behavior at a time. Try to `parametrize` test cases wherever possible, use our pre-defined fixtures for spaCy components and avoid unnecessary imports. @@ -437,7 +421,7 @@ Tests that require the model to be loaded should be marked with `@pytest.mark.models`. Loading the models is expensive and not necessary if you're not actually testing the model performance. If all you need is a `Doc` object with annotations like heads, POS tags or the dependency parse, you can -use the `get_doc()` utility function to construct it manually. +use the `Doc` constructor to construct it manually. 📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).** @@ -456,25 +440,25 @@ simply click on the "Suggest edits" button at the bottom of a page. We're very excited about all the new possibilities for **community extensions** and plugins in spaCy v2.0, and we can't wait to see what you build with it! -- An extension or plugin should add substantial functionality, be - **well-documented** and **open-source**. It should be available for users to download - and install as a Python package – for example via [PyPi](http://pypi.python.org). +- An extension or plugin should add substantial functionality, be + **well-documented** and **open-source**. It should be available for users to download + and install as a Python package – for example via [PyPi](http://pypi.python.org). -- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped - as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components) - that users can **add to their processing pipeline** using `nlp.add_pipe()`. +- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped + as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components) + that users can **add to their processing pipeline** using `nlp.add_pipe()`. -- When publishing your extension on GitHub, **tag it** with the topics - [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and - [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars) - to make it easier to find. Those are also the topics we're linking to from the - spaCy website. If you're sharing your project on Twitter, feel free to tag - [@spacy_io](https://twitter.com/spacy_io) so we can check it out. +- When publishing your extension on GitHub, **tag it** with the topics + [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and + [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars) + to make it easier to find. Those are also the topics we're linking to from the + spaCy website. If you're sharing your project on Twitter, feel free to tag + [@spacy_io](https://twitter.com/spacy_io) so we can check it out. -- Once your extension is published, you can open an issue on the - [issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the - [resources directory](https://spacy.io/usage/resources#extensions) on the - website. +- Once your extension is published, you can open an issue on the + [issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the + [resources directory](https://spacy.io/usage/resources#extensions) on the + website. 📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).** diff --git a/MANIFEST.in b/MANIFEST.in index 9819c7b70..b4887cdb8 100644 --- a/MANIFEST.in +++ b/MANIFEST.in @@ -1,9 +1,9 @@ recursive-include include *.h -recursive-include spacy *.txt *.pyx *.pxd +recursive-include spacy *.pyx *.pxd *.txt *.cfg *.jinja include LICENSE include README.md -include bin/spacy include pyproject.toml recursive-exclude spacy/lang *.json recursive-include spacy/lang *.json.gz +recursive-include spacy/cli *.json *.yml recursive-include licenses * diff --git a/Makefile b/Makefile index 6c0a59ba8..3f10e79cc 100644 --- a/Makefile +++ b/Makefile @@ -1,29 +1,55 @@ SHELL := /bin/bash -PYVER := 3.6 + +ifndef SPACY_EXTRAS +override SPACY_EXTRAS = spacy-lookups-data==1.0.0rc0 jieba spacy-pkuseg==0.0.26 sudachipy sudachidict_core +endif + +ifndef PYVER +override PYVER = 3.6 +endif + VENV := ./env$(PYVER) version := $(shell "bin/get-version.sh") +package := $(shell "bin/get-package.sh") -dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp - $(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) jsonschema spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core +ifndef SPACY_BIN +override SPACY_BIN = $(package)-$(version).pex +endif + +ifndef WHEELHOUSE +override WHEELHOUSE = "./wheelhouse" +endif + + +dist/$(SPACY_BIN) : $(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp + $(VENV)/bin/pex \ + -f $(WHEELHOUSE) \ + --no-index \ + --disable-cache \ + -o $@ \ + $(package)==$(version) \ + $(SPACY_EXTRAS) chmod a+rx $@ cp $@ dist/spacy.pex -dist/pytest.pex : wheelhouse/pytest-*.whl - $(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock +dist/pytest.pex : $(WHEELHOUSE)/pytest-*.whl + $(VENV)/bin/pex -f $(WHEELHOUSE) --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock chmod a+rx $@ -wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py* - $(VENV)/bin/pip wheel . -w ./wheelhouse - $(VENV)/bin/pip wheel jsonschema spacy-lookups-data jieba pkuseg==0.0.25 sudachipy sudachidict_core -w ./wheelhouse +$(WHEELHOUSE)/spacy-$(PYVER)-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py* + $(VENV)/bin/pip wheel . -w $(WHEELHOUSE) + $(VENV)/bin/pip wheel $(SPACY_EXTRAS) -w $(WHEELHOUSE) + touch $@ -wheelhouse/pytest-%.whl : $(VENV)/bin/pex - $(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse +$(WHEELHOUSE)/pytest-%.whl : $(VENV)/bin/pex + $(VENV)/bin/pip wheel pytest pytest-timeout mock -w $(WHEELHOUSE) $(VENV)/bin/pex : python$(PYVER) -m venv $(VENV) $(VENV)/bin/pip install -U pip setuptools pex wheel + $(VENV)/bin/pip install numpy .PHONY : clean test @@ -33,6 +59,6 @@ test : dist/spacy-$(version).pex dist/pytest.pex clean : setup.py rm -rf dist/* - rm -rf ./wheelhouse + rm -rf $(WHEELHOUSE)/* rm -rf $(VENV) python setup.py clean --all diff --git a/README.md b/README.md index 4b5f3d0fa..55e4c6512 100644 --- a/README.md +++ b/README.md @@ -4,18 +4,19 @@ spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to -be used in real products. spaCy comes with -[pretrained statistical models](https://spacy.io/models) and word vectors, and +be used in real products. + +spaCy comes with +[pretrained pipelines](https://spacy.io/models) and vectors, and currently supports tokenization for **60+ languages**. It features state-of-the-art speed, convolutional **neural network models** for tagging, -parsing and **named entity recognition** and easy **deep learning** integration. -It's commercial open-source software, released under the MIT license. +parsing, **named entity recognition**, **text classification** and more, multi-task learning with pretrained **transformers** like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management. +spaCy is commercial open-source software, released under the MIT license. -💫 **Version 2.3 out now!** +💫 **Version 3.0 (nightly) out now!** [Check out the release notes here.](https://github.com/explosion/spaCy/releases) -[![Azure Pipelines]()](https://dev.azure.com/explosion-ai/public/_build?definitionId=8) -[![Travis Build Status]()](https://travis-ci.org/explosion/spaCy) +[![Azure Pipelines](https://img.shields.io/azure-devops/build/explosion-ai/public/8/master.svg?logo=azure-pipelines&style=flat-square&label=build)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8) [![Current Release Version](https://img.shields.io/github/release/explosion/spacy.svg?style=flat-square&logo=github)](https://github.com/explosion/spaCy/releases) [![pypi Version](https://img.shields.io/pypi/v/spacy.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/spacy/) [![conda Version](https://img.shields.io/conda/vn/conda-forge/spacy.svg?style=flat-square&logo=conda-forge&logoColor=white)](https://anaconda.org/conda-forge/spacy) @@ -28,64 +29,60 @@ It's commercial open-source software, released under the MIT license. ## 📖 Documentation -| Documentation | | -| --------------- | -------------------------------------------------------------- | -| [spaCy 101] | New to spaCy? Here's everything you need to know! | -| [Usage Guides] | How to use spaCy and its features. | -| [New in v2.3] | New features, backwards incompatibilities and migration guide. | -| [API Reference] | The detailed reference for spaCy's API. | -| [Models] | Download statistical language models for spaCy. | -| [Universe] | Libraries, extensions, demos, books and courses. | -| [Changelog] | Changes and version history. | -| [Contribute] | How to contribute to the spaCy project and code base. | +| Documentation | | +| ------------------- | -------------------------------------------------------------- | +| [spaCy 101] | New to spaCy? Here's everything you need to know! | +| [Usage Guides] | How to use spaCy and its features. | +| [New in v3.0] | New features, backwards incompatibilities and migration guide. | +| [Project Templates] | End-to-end workflows you can clone, modify and run. | +| [API Reference] | The detailed reference for spaCy's API. | +| [Models] | Download statistical language models for spaCy. | +| [Universe] | Libraries, extensions, demos, books and courses. | +| [Changelog] | Changes and version history. | +| [Contribute] | How to contribute to the spaCy project and code base. | [spacy 101]: https://spacy.io/usage/spacy-101 -[new in v2.3]: https://spacy.io/usage/v2-3 +[new in v3.0]: https://spacy.io/usage/v3 [usage guides]: https://spacy.io/usage/ [api reference]: https://spacy.io/api/ [models]: https://spacy.io/models [universe]: https://spacy.io/universe +[project templates]: https://github.com/explosion/projects [changelog]: https://spacy.io/usage#changelog [contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md ## 💬 Where to ask questions -The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and -[@ines](https://github.com/ines), along with core contributors -[@svlandeg](https://github.com/svlandeg) and +The spaCy project is maintained by [@honnibal](https://github.com/honnibal), +[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and [@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't be able to provide individual support via email. We also believe that help is much more valuable if it's shared publicly, so that more people can benefit from it. -| Type | Platforms | -| ------------------------ | ------------------------------------------------------ | -| 🚨 **Bug Reports** | [GitHub Issue Tracker] | -| 🎁 **Feature Requests** | [GitHub Issue Tracker] | -| 👩‍💻 **Usage Questions** | [Stack Overflow] · [Gitter Chat] · [Reddit User Group] | -| 🗯 **General Discussion** | [Gitter Chat] · [Reddit User Group] | +| Type | Platforms | +| ----------------------- | ---------------------- | +| 🚨 **Bug Reports** | [GitHub Issue Tracker] | +| 🎁 **Feature Requests** | [GitHub Issue Tracker] | +| 👩‍💻 **Usage Questions** | [Stack Overflow] | [github issue tracker]: https://github.com/explosion/spaCy/issues [stack overflow]: https://stackoverflow.com/questions/tagged/spacy -[gitter chat]: https://gitter.im/explosion/spaCy -[reddit user group]: https://www.reddit.com/r/spacynlp ## Features -- Non-destructive **tokenization** -- **Named entity** recognition -- Support for **50+ languages** -- pretrained [statistical models](https://spacy.io/models) and word vectors +- Support for **60+ languages** +- **Trained pipelines** +- Multi-task learning with pretrained **transformers** like BERT +- Pretrained **word vectors** - State-of-the-art speed -- Easy **deep learning** integration -- Part-of-speech tagging -- Labelled dependency parsing -- Syntax-driven sentence segmentation +- Production-ready **training system** +- Linguistically-motivated **tokenization** +- Components for named **entity recognition**, part-of-speech-tagging, dependency parsing, sentence segmentation, **text classification**, lemmatization, morphological analysis, entity linking and more +- Easily extensible with **custom components** and attributes +- Support for custom models in **PyTorch**, **TensorFlow** and other frameworks - Built in **visualizers** for syntax and NER -- Convenient string-to-hash mapping -- Export to numpy data arrays -- Efficient binary serialization -- Easy **model packaging** and deployment +- Easy **model packaging**, deployment and workflow management - Robust, rigorously evaluated accuracy 📖 **For more details, see the @@ -98,7 +95,7 @@ For detailed installation instructions, see the - **Operating system**: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio) -- **Python version**: Python 2.7, 3.5+ (only 64 bit) +- **Python version**: Python 3.6+ (only 64 bit) - **Package managers**: [pip] · [conda] (via `conda-forge`) [pip]: https://pypi.org/project/spacy/ @@ -107,9 +104,11 @@ For detailed installation instructions, see the ### pip Using pip, spaCy releases are available as source packages and binary wheels (as -of `v2.0.13`). +of `v2.0.13`). Before you install spaCy and its dependencies, make sure that +your `pip`, `setuptools` and `wheel` are up to date. ```bash +pip install -U pip setuptools wheel pip install spacy ``` @@ -159,26 +158,26 @@ If you've trained your own models, keep in mind that your training and runtime inputs must match. After updating spaCy, we recommend **retraining your models** with the new version. -📖 **For details on upgrading from spaCy 1.x to spaCy 2.x, see the -[migration guide](https://spacy.io/usage/v2#migrating).** +📖 **For details on upgrading from spaCy 2.x to spaCy 3.x, see the +[migration guide](https://spacy.io/usage/v3#migrating).** ## Download models -As of v1.7.0, models for spaCy can be installed as **Python packages**. This +Trained pipelines for spaCy can be installed as **Python packages**. This means that they're a component of your application, just like any other module. Models can be installed using spaCy's `download` command, or manually by pointing pip to a path or URL. -| Documentation | | -| ---------------------- | ------------------------------------------------------------- | -| [Available Models] | Detailed model descriptions, accuracy figures and benchmarks. | -| [Models Documentation] | Detailed usage instructions. | +| Documentation | | +| ---------------------- | ---------------------------------------------------------------- | +| [Available Pipelines] | Detailed pipeline descriptions, accuracy figures and benchmarks. | +| [Models Documentation] | Detailed usage instructions. | -[available models]: https://spacy.io/models +[available pipelines]: https://spacy.io/models [models documentation]: https://spacy.io/docs/usage/models ```bash -# download best-matching version of specific model for your spaCy installation +# Download best-matching version of specific model for your spaCy installation python -m spacy download en_core_web_sm # pip install .tar.gz archive from path or URL @@ -188,7 +187,7 @@ pip install https://github.com/explosion/spacy-models/releases/download/en_core_ ### Loading and using models -To load a model, use `spacy.load()` with the model name, a shortcut link or a +To load a model, use `spacy.load()` with the model name or a path to the model data directory. ```python @@ -263,9 +262,7 @@ and git preinstalled. Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or [Visual Studio Express](https://visualstudio.microsoft.com/vs/express/) that -matches the version that was used to compile your Python interpreter. For -official distributions these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and -VS 2015 (Python 3.5). +matches the version that was used to compile your Python interpreter. ## Run tests diff --git a/azure-pipelines.yml b/azure-pipelines.yml index 147d2e903..4dfb51296 100644 --- a/azure-pipelines.yml +++ b/azure-pipelines.yml @@ -27,7 +27,7 @@ jobs: inputs: versionSpec: '3.7' - script: | - pip install flake8 + pip install flake8==3.5.0 python -m flake8 spacy --count --select=E901,E999,F821,F822,F823 --show-source --statistics displayName: 'flake8' @@ -35,12 +35,6 @@ jobs: dependsOn: 'Validate' strategy: matrix: - Python35Linux: - imageName: 'ubuntu-16.04' - python.version: '3.5' - Python35Windows: - imageName: 'vs2017-win2016' - python.version: '3.5' Python36Linux: imageName: 'ubuntu-16.04' python.version: '3.6' @@ -58,7 +52,7 @@ jobs: # imageName: 'vs2017-win2016' # python.version: '3.7' # Python37Mac: - # imageName: 'macos-10.13' + # imageName: 'macos-10.14' # python.version: '3.7' Python38Linux: imageName: 'ubuntu-16.04' diff --git a/bin/cythonize.py b/bin/cythonize.py deleted file mode 100755 index 4814f8df0..000000000 --- a/bin/cythonize.py +++ /dev/null @@ -1,169 +0,0 @@ -#!/usr/bin/env python -""" cythonize.py - -Cythonize pyx files into C++ files as needed. - -Usage: cythonize.py [root] - -Checks pyx files to see if they have been changed relative to their -corresponding C++ files. If they have, then runs cython on these files to -recreate the C++ files. - -Additionally, checks pxd files and setup.py if they have been changed. If -they have, rebuilds everything. - -Change detection based on file hashes stored in JSON format. - -For now, this script should be run by developers when changing Cython files -and the resulting C++ files checked in, so that end-users (and Python-only -developers) do not get the Cython dependencies. - -Based upon: - -https://raw.github.com/dagss/private-scipy-refactor/cythonize/cythonize.py -https://raw.githubusercontent.com/numpy/numpy/master/tools/cythonize.py - -Note: this script does not check any of the dependent C++ libraries. -""" -from __future__ import print_function - -import os -import sys -import json -import hashlib -import subprocess -import argparse - - -HASH_FILE = "cythonize.json" - - -def process_pyx(fromfile, tofile, language_level="-2"): - print("Processing %s" % fromfile) - try: - from Cython.Compiler.Version import version as cython_version - from distutils.version import LooseVersion - - if LooseVersion(cython_version) < LooseVersion("0.19"): - raise Exception("Require Cython >= 0.19") - - except ImportError: - pass - - flags = ["--fast-fail", language_level] - if tofile.endswith(".cpp"): - flags += ["--cplus"] - - try: - try: - r = subprocess.call( - ["cython"] + flags + ["-o", tofile, fromfile], env=os.environ - ) # See Issue #791 - if r != 0: - raise Exception("Cython failed") - except OSError: - # There are ways of installing Cython that don't result in a cython - # executable on the path, see gh-2397. - r = subprocess.call( - [ - sys.executable, - "-c", - "import sys; from Cython.Compiler.Main import " - "setuptools_main as main; sys.exit(main())", - ] - + flags - + ["-o", tofile, fromfile] - ) - if r != 0: - raise Exception("Cython failed") - except OSError: - raise OSError("Cython needs to be installed") - - -def preserve_cwd(path, func, *args): - orig_cwd = os.getcwd() - try: - os.chdir(path) - func(*args) - finally: - os.chdir(orig_cwd) - - -def load_hashes(filename): - try: - return json.load(open(filename)) - except (ValueError, IOError): - return {} - - -def save_hashes(hash_db, filename): - with open(filename, "w") as f: - f.write(json.dumps(hash_db)) - - -def get_hash(path): - return hashlib.md5(open(path, "rb").read()).hexdigest() - - -def hash_changed(base, path, db): - full_path = os.path.normpath(os.path.join(base, path)) - return not get_hash(full_path) == db.get(full_path) - - -def hash_add(base, path, db): - full_path = os.path.normpath(os.path.join(base, path)) - db[full_path] = get_hash(full_path) - - -def process(base, filename, db): - root, ext = os.path.splitext(filename) - if ext in [".pyx", ".cpp"]: - if hash_changed(base, filename, db) or not os.path.isfile( - os.path.join(base, root + ".cpp") - ): - preserve_cwd(base, process_pyx, root + ".pyx", root + ".cpp") - hash_add(base, root + ".cpp", db) - hash_add(base, root + ".pyx", db) - - -def check_changes(root, db): - res = False - new_db = {} - - setup_filename = "setup.py" - hash_add(".", setup_filename, new_db) - if hash_changed(".", setup_filename, db): - res = True - - for base, _, files in os.walk(root): - for filename in files: - if filename.endswith(".pxd"): - hash_add(base, filename, new_db) - if hash_changed(base, filename, db): - res = True - - if res: - db.clear() - db.update(new_db) - return res - - -def run(root): - db = load_hashes(HASH_FILE) - - try: - check_changes(root, db) - for base, _, files in os.walk(root): - for filename in files: - process(base, filename, db) - finally: - save_hashes(db, HASH_FILE) - - -if __name__ == "__main__": - parser = argparse.ArgumentParser( - description="Cythonize pyx files into C++ files as needed" - ) - parser.add_argument("root", help="root directory") - args = parser.parse_args() - run(args.root) diff --git a/bin/get-package.sh b/bin/get-package.sh new file mode 100755 index 000000000..d60b930b4 --- /dev/null +++ b/bin/get-package.sh @@ -0,0 +1,12 @@ +#!/usr/bin/env bash + +set -e + +version=$(grep "__title__ = " spacy/about.py) +version=${version/__title__ = } +version=${version/\'/} +version=${version/\'/} +version=${version/\"/} +version=${version/\"/} + +echo $version diff --git a/bin/load_reddit.py b/bin/load_reddit.py deleted file mode 100644 index afddd3798..000000000 --- a/bin/load_reddit.py +++ /dev/null @@ -1,97 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import bz2 -import re -import srsly -import sys -import random -import datetime -import plac -from pathlib import Path - -_unset = object() - - -class Reddit(object): - """Stream cleaned comments from Reddit.""" - - pre_format_re = re.compile(r"^[`*~]") - post_format_re = re.compile(r"[`*~]$") - url_re = re.compile(r"\[([^]]+)\]\(%%URL\)") - link_re = re.compile(r"\[([^]]+)\]\(https?://[^\)]+\)") - - def __init__(self, file_path, meta_keys={"subreddit": "section"}): - """ - file_path (unicode / Path): Path to archive or directory of archives. - meta_keys (dict): Meta data key included in the Reddit corpus, mapped - to display name in Prodigy meta. - RETURNS (Reddit): The Reddit loader. - """ - self.meta = meta_keys - file_path = Path(file_path) - if not file_path.exists(): - raise IOError("Can't find file path: {}".format(file_path)) - if not file_path.is_dir(): - self.files = [file_path] - else: - self.files = list(file_path.iterdir()) - - def __iter__(self): - for file_path in self.iter_files(): - with bz2.open(str(file_path)) as f: - for line in f: - line = line.strip() - if not line: - continue - comment = srsly.json_loads(line) - if self.is_valid(comment): - text = self.strip_tags(comment["body"]) - yield {"text": text} - - def get_meta(self, item): - return {name: item.get(key, "n/a") for key, name in self.meta.items()} - - def iter_files(self): - for file_path in self.files: - yield file_path - - def strip_tags(self, text): - text = self.link_re.sub(r"\1", text) - text = text.replace(">", ">").replace("<", "<") - text = self.pre_format_re.sub("", text) - text = self.post_format_re.sub("", text) - text = re.sub(r"\s+", " ", text) - return text.strip() - - def is_valid(self, comment): - return ( - comment["body"] is not None - and comment["body"] != "[deleted]" - and comment["body"] != "[removed]" - ) - - -def main(path): - reddit = Reddit(path) - for comment in reddit: - print(srsly.json_dumps(comment)) - - -if __name__ == "__main__": - import socket - - try: - BrokenPipeError - except NameError: - BrokenPipeError = socket.error - try: - plac.call(main) - except BrokenPipeError: - import os, sys - - # Python flushes standard streams on exit; redirect remaining output - # to devnull to avoid another BrokenPipeError at shutdown - devnull = os.open(os.devnull, os.O_WRONLY) - os.dup2(devnull, sys.stdout.fileno()) - sys.exit(1) # Python exits with error code 1 on EPIPE diff --git a/bin/spacy b/bin/spacy deleted file mode 100644 index 11359669c..000000000 --- a/bin/spacy +++ /dev/null @@ -1,2 +0,0 @@ -#! /bin/sh -python -m spacy "$@" diff --git a/bin/train_word_vectors.py b/bin/train_word_vectors.py deleted file mode 100644 index 663ce060d..000000000 --- a/bin/train_word_vectors.py +++ /dev/null @@ -1,81 +0,0 @@ -#!/usr/bin/env python -from __future__ import print_function, unicode_literals, division - -import logging -from pathlib import Path -from collections import defaultdict -from gensim.models import Word2Vec -import plac -import spacy - -logger = logging.getLogger(__name__) - - -class Corpus(object): - def __init__(self, directory, nlp): - self.directory = directory - self.nlp = nlp - - def __iter__(self): - for text_loc in iter_dir(self.directory): - with text_loc.open("r", encoding="utf-8") as file_: - text = file_.read() - - # This is to keep the input to the blank model (which doesn't - # sentencize) from being too long. It works particularly well with - # the output of [WikiExtractor](https://github.com/attardi/wikiextractor) - paragraphs = text.split('\n\n') - for par in paragraphs: - yield [word.orth_ for word in self.nlp(par)] - - -def iter_dir(loc): - dir_path = Path(loc) - for fn_path in dir_path.iterdir(): - if fn_path.is_dir(): - for sub_path in fn_path.iterdir(): - yield sub_path - else: - yield fn_path - - -@plac.annotations( - lang=("ISO language code"), - in_dir=("Location of input directory"), - out_loc=("Location of output file"), - n_workers=("Number of workers", "option", "n", int), - size=("Dimension of the word vectors", "option", "d", int), - window=("Context window size", "option", "w", int), - min_count=("Min count", "option", "m", int), - negative=("Number of negative samples", "option", "g", int), - nr_iter=("Number of iterations", "option", "i", int), -) -def main( - lang, - in_dir, - out_loc, - negative=5, - n_workers=4, - window=5, - size=128, - min_count=10, - nr_iter=5, -): - logging.basicConfig( - format="%(asctime)s : %(levelname)s : %(message)s", level=logging.INFO - ) - nlp = spacy.blank(lang) - corpus = Corpus(in_dir, nlp) - model = Word2Vec( - sentences=corpus, - size=size, - window=window, - min_count=min_count, - workers=n_workers, - sample=1e-5, - negative=negative, - ) - model.save(out_loc) - -if __name__ == "__main__": - plac.call(main) diff --git a/bin/ud/__init__.py b/bin/ud/__init__.py deleted file mode 100644 index 119c46ba4..000000000 --- a/bin/ud/__init__.py +++ /dev/null @@ -1,2 +0,0 @@ -from .conll17_ud_eval import main as ud_evaluate # noqa: F401 -from .ud_train import main as ud_train # noqa: F401 diff --git a/bin/ud/conll17_ud_eval.py b/bin/ud/conll17_ud_eval.py deleted file mode 100644 index 88acfabac..000000000 --- a/bin/ud/conll17_ud_eval.py +++ /dev/null @@ -1,614 +0,0 @@ -#!/usr/bin/env python -# flake8: noqa - -# CoNLL 2017 UD Parsing evaluation script. -# -# Compatible with Python 2.7 and 3.2+, can be used either as a module -# or a standalone executable. -# -# Copyright 2017 Institute of Formal and Applied Linguistics (UFAL), -# Faculty of Mathematics and Physics, Charles University, Czech Republic. -# -# Changelog: -# - [02 Jan 2017] Version 0.9: Initial release -# - [25 Jan 2017] Version 0.9.1: Fix bug in LCS alignment computation -# - [10 Mar 2017] Version 1.0: Add documentation and test -# Compare HEADs correctly using aligned words -# Allow evaluation with errorneous spaces in forms -# Compare forms in LCS case insensitively -# Detect cycles and multiple root nodes -# Compute AlignedAccuracy - -# Command line usage -# ------------------ -# conll17_ud_eval.py [-v] [-w weights_file] gold_conllu_file system_conllu_file -# -# - if no -v is given, only the CoNLL17 UD Shared Task evaluation LAS metrics -# is printed -# - if -v is given, several metrics are printed (as precision, recall, F1 score, -# and in case the metric is computed on aligned words also accuracy on these): -# - Tokens: how well do the gold tokens match system tokens -# - Sentences: how well do the gold sentences match system sentences -# - Words: how well can the gold words be aligned to system words -# - UPOS: using aligned words, how well does UPOS match -# - XPOS: using aligned words, how well does XPOS match -# - Feats: using aligned words, how well does FEATS match -# - AllTags: using aligned words, how well does UPOS+XPOS+FEATS match -# - Lemmas: using aligned words, how well does LEMMA match -# - UAS: using aligned words, how well does HEAD match -# - LAS: using aligned words, how well does HEAD+DEPREL(ignoring subtypes) match -# - if weights_file is given (with lines containing deprel-weight pairs), -# one more metric is shown: -# - WeightedLAS: as LAS, but each deprel (ignoring subtypes) has different weight - -# API usage -# --------- -# - load_conllu(file) -# - loads CoNLL-U file from given file object to an internal representation -# - the file object should return str on both Python 2 and Python 3 -# - raises UDError exception if the given file cannot be loaded -# - evaluate(gold_ud, system_ud) -# - evaluate the given gold and system CoNLL-U files (loaded with load_conllu) -# - raises UDError if the concatenated tokens of gold and system file do not match -# - returns a dictionary with the metrics described above, each metrics having -# four fields: precision, recall, f1 and aligned_accuracy (when using aligned -# words, otherwise this is None) - -# Description of token matching -# ----------------------------- -# In order to match tokens of gold file and system file, we consider the text -# resulting from concatenation of gold tokens and text resulting from -# concatenation of system tokens. These texts should match -- if they do not, -# the evaluation fails. -# -# If the texts do match, every token is represented as a range in this original -# text, and tokens are equal only if their range is the same. - -# Description of word matching -# ---------------------------- -# When matching words of gold file and system file, we first match the tokens. -# The words which are also tokens are matched as tokens, but words in multi-word -# tokens have to be handled differently. -# -# To handle multi-word tokens, we start by finding "multi-word spans". -# Multi-word span is a span in the original text such that -# - it contains at least one multi-word token -# - all multi-word tokens in the span (considering both gold and system ones) -# are completely inside the span (i.e., they do not "stick out") -# - the multi-word span is as small as possible -# -# For every multi-word span, we align the gold and system words completely -# inside this span using LCS on their FORMs. The words not intersecting -# (even partially) any multi-word span are then aligned as tokens. - - -from __future__ import division -from __future__ import print_function - -import argparse -import io -import sys -import unittest - -# CoNLL-U column names -ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC = range(10) - -# UD Error is used when raising exceptions in this module -class UDError(Exception): - pass - -# Load given CoNLL-U file into internal representation -def load_conllu(file, check_parse=True): - # Internal representation classes - class UDRepresentation: - def __init__(self): - # Characters of all the tokens in the whole file. - # Whitespace between tokens is not included. - self.characters = [] - # List of UDSpan instances with start&end indices into `characters`. - self.tokens = [] - # List of UDWord instances. - self.words = [] - # List of UDSpan instances with start&end indices into `characters`. - self.sentences = [] - class UDSpan: - def __init__(self, start, end, characters): - self.start = start - # Note that self.end marks the first position **after the end** of span, - # so we can use characters[start:end] or range(start, end). - self.end = end - self.characters = characters - - @property - def text(self): - return ''.join(self.characters[self.start:self.end]) - - def __str__(self): - return self.text - - def __repr__(self): - return self.text - class UDWord: - def __init__(self, span, columns, is_multiword): - # Span of this word (or MWT, see below) within ud_representation.characters. - self.span = span - # 10 columns of the CoNLL-U file: ID, FORM, LEMMA,... - self.columns = columns - # is_multiword==True means that this word is part of a multi-word token. - # In that case, self.span marks the span of the whole multi-word token. - self.is_multiword = is_multiword - # Reference to the UDWord instance representing the HEAD (or None if root). - self.parent = None - # Let's ignore language-specific deprel subtypes. - self.columns[DEPREL] = columns[DEPREL].split(':')[0] - - ud = UDRepresentation() - - # Load the CoNLL-U file - index, sentence_start = 0, None - linenum = 0 - while True: - line = file.readline() - linenum += 1 - if not line: - break - line = line.rstrip("\r\n") - - # Handle sentence start boundaries - if sentence_start is None: - # Skip comments - if line.startswith("#"): - continue - # Start a new sentence - ud.sentences.append(UDSpan(index, 0, ud.characters)) - sentence_start = len(ud.words) - if not line: - # Add parent UDWord links and check there are no cycles - def process_word(word): - if word.parent == "remapping": - raise UDError("There is a cycle in a sentence") - if word.parent is None: - head = int(word.columns[HEAD]) - if head > len(ud.words) - sentence_start: - raise UDError("Line {}: HEAD '{}' points outside of the sentence".format( - linenum, word.columns[HEAD])) - if head: - parent = ud.words[sentence_start + head - 1] - word.parent = "remapping" - process_word(parent) - word.parent = parent - - for word in ud.words[sentence_start:]: - process_word(word) - - # Check there is a single root node - if check_parse: - if len([word for word in ud.words[sentence_start:] if word.parent is None]) != 1: - raise UDError("There are multiple roots in a sentence") - - # End the sentence - ud.sentences[-1].end = index - sentence_start = None - continue - - # Read next token/word - columns = line.split("\t") - if len(columns) != 10: - raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, line)) - - # Skip empty nodes - if "." in columns[ID]: - continue - - # Delete spaces from FORM so gold.characters == system.characters - # even if one of them tokenizes the space. - columns[FORM] = columns[FORM].replace(" ", "") - if not columns[FORM]: - raise UDError("There is an empty FORM in the CoNLL-U file -- line %d" % linenum) - - # Save token - ud.characters.extend(columns[FORM]) - ud.tokens.append(UDSpan(index, index + len(columns[FORM]), ud.characters)) - index += len(columns[FORM]) - - # Handle multi-word tokens to save word(s) - if "-" in columns[ID]: - try: - start, end = map(int, columns[ID].split("-")) - except: - raise UDError("Cannot parse multi-word token ID '{}'".format(columns[ID])) - - for _ in range(start, end + 1): - word_line = file.readline().rstrip("\r\n") - word_columns = word_line.split("\t") - if len(word_columns) != 10: - print(columns) - raise UDError("The CoNLL-U line {} does not contain 10 tab-separated columns: '{}'".format(linenum, word_line)) - ud.words.append(UDWord(ud.tokens[-1], word_columns, is_multiword=True)) - # Basic tokens/words - else: - try: - word_id = int(columns[ID]) - except: - raise UDError("Cannot parse word ID '{}'".format(columns[ID])) - if word_id != len(ud.words) - sentence_start + 1: - raise UDError("Incorrect word ID '{}' for word '{}', expected '{}'".format(columns[ID], columns[FORM], len(ud.words) - sentence_start + 1)) - - try: - head_id = int(columns[HEAD]) - except: - raise UDError("Cannot parse HEAD '{}'".format(columns[HEAD])) - if head_id < 0: - raise UDError("HEAD cannot be negative") - - ud.words.append(UDWord(ud.tokens[-1], columns, is_multiword=False)) - - if sentence_start is not None: - raise UDError("The CoNLL-U file does not end with empty line") - - return ud - -# Evaluate the gold and system treebanks (loaded using load_conllu). -def evaluate(gold_ud, system_ud, deprel_weights=None, check_parse=True): - class Score: - def __init__(self, gold_total, system_total, correct, aligned_total=None, undersegmented=None, oversegmented=None): - self.precision = correct / system_total if system_total else 0.0 - self.recall = correct / gold_total if gold_total else 0.0 - self.f1 = 2 * correct / (system_total + gold_total) if system_total + gold_total else 0.0 - self.aligned_accuracy = correct / aligned_total if aligned_total else aligned_total - self.undersegmented = undersegmented - self.oversegmented = oversegmented - self.under_perc = len(undersegmented) / gold_total if gold_total and undersegmented else 0.0 - self.over_perc = len(oversegmented) / gold_total if gold_total and oversegmented else 0.0 - class AlignmentWord: - def __init__(self, gold_word, system_word): - self.gold_word = gold_word - self.system_word = system_word - self.gold_parent = None - self.system_parent_gold_aligned = None - class Alignment: - def __init__(self, gold_words, system_words): - self.gold_words = gold_words - self.system_words = system_words - self.matched_words = [] - self.matched_words_map = {} - def append_aligned_words(self, gold_word, system_word): - self.matched_words.append(AlignmentWord(gold_word, system_word)) - self.matched_words_map[system_word] = gold_word - def fill_parents(self): - # We represent root parents in both gold and system data by '0'. - # For gold data, we represent non-root parent by corresponding gold word. - # For system data, we represent non-root parent by either gold word aligned - # to parent system nodes, or by None if no gold words is aligned to the parent. - for words in self.matched_words: - words.gold_parent = words.gold_word.parent if words.gold_word.parent is not None else 0 - words.system_parent_gold_aligned = self.matched_words_map.get(words.system_word.parent, None) \ - if words.system_word.parent is not None else 0 - - def lower(text): - if sys.version_info < (3, 0) and isinstance(text, str): - return text.decode("utf-8").lower() - return text.lower() - - def spans_score(gold_spans, system_spans): - correct, gi, si = 0, 0, 0 - undersegmented = [] - oversegmented = [] - combo = 0 - previous_end_si_earlier = False - previous_end_gi_earlier = False - while gi < len(gold_spans) and si < len(system_spans): - previous_si = system_spans[si-1] if si > 0 else None - previous_gi = gold_spans[gi-1] if gi > 0 else None - if system_spans[si].start < gold_spans[gi].start: - # avoid counting the same mistake twice - if not previous_end_si_earlier: - combo += 1 - oversegmented.append(str(previous_gi).strip()) - si += 1 - elif gold_spans[gi].start < system_spans[si].start: - # avoid counting the same mistake twice - if not previous_end_gi_earlier: - combo += 1 - undersegmented.append(str(previous_si).strip()) - gi += 1 - else: - correct += gold_spans[gi].end == system_spans[si].end - if gold_spans[gi].end < system_spans[si].end: - undersegmented.append(str(system_spans[si]).strip()) - previous_end_gi_earlier = True - previous_end_si_earlier = False - elif gold_spans[gi].end > system_spans[si].end: - oversegmented.append(str(gold_spans[gi]).strip()) - previous_end_si_earlier = True - previous_end_gi_earlier = False - else: - previous_end_gi_earlier = False - previous_end_si_earlier = False - si += 1 - gi += 1 - - return Score(len(gold_spans), len(system_spans), correct, None, undersegmented, oversegmented) - - def alignment_score(alignment, key_fn, weight_fn=lambda w: 1): - gold, system, aligned, correct = 0, 0, 0, 0 - - for word in alignment.gold_words: - gold += weight_fn(word) - - for word in alignment.system_words: - system += weight_fn(word) - - for words in alignment.matched_words: - aligned += weight_fn(words.gold_word) - - if key_fn is None: - # Return score for whole aligned words - return Score(gold, system, aligned) - - for words in alignment.matched_words: - if key_fn(words.gold_word, words.gold_parent) == key_fn(words.system_word, words.system_parent_gold_aligned): - correct += weight_fn(words.gold_word) - - return Score(gold, system, correct, aligned) - - def beyond_end(words, i, multiword_span_end): - if i >= len(words): - return True - if words[i].is_multiword: - return words[i].span.start >= multiword_span_end - return words[i].span.end > multiword_span_end - - def extend_end(word, multiword_span_end): - if word.is_multiword and word.span.end > multiword_span_end: - return word.span.end - return multiword_span_end - - def find_multiword_span(gold_words, system_words, gi, si): - # We know gold_words[gi].is_multiword or system_words[si].is_multiword. - # Find the start of the multiword span (gs, ss), so the multiword span is minimal. - # Initialize multiword_span_end characters index. - if gold_words[gi].is_multiword: - multiword_span_end = gold_words[gi].span.end - if not system_words[si].is_multiword and system_words[si].span.start < gold_words[gi].span.start: - si += 1 - else: # if system_words[si].is_multiword - multiword_span_end = system_words[si].span.end - if not gold_words[gi].is_multiword and gold_words[gi].span.start < system_words[si].span.start: - gi += 1 - gs, ss = gi, si - - # Find the end of the multiword span - # (so both gi and si are pointing to the word following the multiword span end). - while not beyond_end(gold_words, gi, multiword_span_end) or \ - not beyond_end(system_words, si, multiword_span_end): - if gi < len(gold_words) and (si >= len(system_words) or - gold_words[gi].span.start <= system_words[si].span.start): - multiword_span_end = extend_end(gold_words[gi], multiword_span_end) - gi += 1 - else: - multiword_span_end = extend_end(system_words[si], multiword_span_end) - si += 1 - return gs, ss, gi, si - - def compute_lcs(gold_words, system_words, gi, si, gs, ss): - lcs = [[0] * (si - ss) for i in range(gi - gs)] - for g in reversed(range(gi - gs)): - for s in reversed(range(si - ss)): - if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]): - lcs[g][s] = 1 + (lcs[g+1][s+1] if g+1 < gi-gs and s+1 < si-ss else 0) - lcs[g][s] = max(lcs[g][s], lcs[g+1][s] if g+1 < gi-gs else 0) - lcs[g][s] = max(lcs[g][s], lcs[g][s+1] if s+1 < si-ss else 0) - return lcs - - def align_words(gold_words, system_words): - alignment = Alignment(gold_words, system_words) - - gi, si = 0, 0 - while gi < len(gold_words) and si < len(system_words): - if gold_words[gi].is_multiword or system_words[si].is_multiword: - # A: Multi-word tokens => align via LCS within the whole "multiword span". - gs, ss, gi, si = find_multiword_span(gold_words, system_words, gi, si) - - if si > ss and gi > gs: - lcs = compute_lcs(gold_words, system_words, gi, si, gs, ss) - - # Store aligned words - s, g = 0, 0 - while g < gi - gs and s < si - ss: - if lower(gold_words[gs + g].columns[FORM]) == lower(system_words[ss + s].columns[FORM]): - alignment.append_aligned_words(gold_words[gs+g], system_words[ss+s]) - g += 1 - s += 1 - elif lcs[g][s] == (lcs[g+1][s] if g+1 < gi-gs else 0): - g += 1 - else: - s += 1 - else: - # B: No multi-word token => align according to spans. - if (gold_words[gi].span.start, gold_words[gi].span.end) == (system_words[si].span.start, system_words[si].span.end): - alignment.append_aligned_words(gold_words[gi], system_words[si]) - gi += 1 - si += 1 - elif gold_words[gi].span.start <= system_words[si].span.start: - gi += 1 - else: - si += 1 - - alignment.fill_parents() - - return alignment - - # Check that underlying character sequences do match - if gold_ud.characters != system_ud.characters: - index = 0 - while gold_ud.characters[index] == system_ud.characters[index]: - index += 1 - - raise UDError( - "The concatenation of tokens in gold file and in system file differ!\n" + - "First 20 differing characters in gold file: '{}' and system file: '{}'".format( - "".join(gold_ud.characters[index:index + 20]), - "".join(system_ud.characters[index:index + 20]) - ) - ) - - # Align words - alignment = align_words(gold_ud.words, system_ud.words) - - # Compute the F1-scores - if check_parse: - result = { - "Tokens": spans_score(gold_ud.tokens, system_ud.tokens), - "Sentences": spans_score(gold_ud.sentences, system_ud.sentences), - "Words": alignment_score(alignment, None), - "UPOS": alignment_score(alignment, lambda w, parent: w.columns[UPOS]), - "XPOS": alignment_score(alignment, lambda w, parent: w.columns[XPOS]), - "Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]), - "AllTags": alignment_score(alignment, lambda w, parent: (w.columns[UPOS], w.columns[XPOS], w.columns[FEATS])), - "Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]), - "UAS": alignment_score(alignment, lambda w, parent: parent), - "LAS": alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL])), - } - else: - result = { - "Tokens": spans_score(gold_ud.tokens, system_ud.tokens), - "Sentences": spans_score(gold_ud.sentences, system_ud.sentences), - "Words": alignment_score(alignment, None), - "Feats": alignment_score(alignment, lambda w, parent: w.columns[FEATS]), - "Lemmas": alignment_score(alignment, lambda w, parent: w.columns[LEMMA]), - } - - - # Add WeightedLAS if weights are given - if deprel_weights is not None: - def weighted_las(word): - return deprel_weights.get(word.columns[DEPREL], 1.0) - result["WeightedLAS"] = alignment_score(alignment, lambda w, parent: (parent, w.columns[DEPREL]), weighted_las) - - return result - -def load_deprel_weights(weights_file): - if weights_file is None: - return None - - deprel_weights = {} - for line in weights_file: - # Ignore comments and empty lines - if line.startswith("#") or not line.strip(): - continue - - columns = line.rstrip("\r\n").split() - if len(columns) != 2: - raise ValueError("Expected two columns in the UD Relations weights file on line '{}'".format(line)) - - deprel_weights[columns[0]] = float(columns[1]) - - return deprel_weights - -def load_conllu_file(path): - _file = open(path, mode="r", **({"encoding": "utf-8"} if sys.version_info >= (3, 0) else {})) - return load_conllu(_file) - -def evaluate_wrapper(args): - # Load CoNLL-U files - gold_ud = load_conllu_file(args.gold_file) - system_ud = load_conllu_file(args.system_file) - - # Load weights if requested - deprel_weights = load_deprel_weights(args.weights) - - return evaluate(gold_ud, system_ud, deprel_weights) - -def main(): - # Parse arguments - parser = argparse.ArgumentParser() - parser.add_argument("gold_file", type=str, - help="Name of the CoNLL-U file with the gold data.") - parser.add_argument("system_file", type=str, - help="Name of the CoNLL-U file with the predicted data.") - parser.add_argument("--weights", "-w", type=argparse.FileType("r"), default=None, - metavar="deprel_weights_file", - help="Compute WeightedLAS using given weights for Universal Dependency Relations.") - parser.add_argument("--verbose", "-v", default=0, action="count", - help="Print all metrics.") - args = parser.parse_args() - - # Use verbose if weights are supplied - if args.weights is not None and not args.verbose: - args.verbose = 1 - - # Evaluate - evaluation = evaluate_wrapper(args) - - # Print the evaluation - if not args.verbose: - print("LAS F1 Score: {:.2f}".format(100 * evaluation["LAS"].f1)) - else: - metrics = ["Tokens", "Sentences", "Words", "UPOS", "XPOS", "Feats", "AllTags", "Lemmas", "UAS", "LAS"] - if args.weights is not None: - metrics.append("WeightedLAS") - - print("Metrics | Precision | Recall | F1 Score | AligndAcc") - print("-----------+-----------+-----------+-----------+-----------") - for metric in metrics: - print("{:11}|{:10.2f} |{:10.2f} |{:10.2f} |{}".format( - metric, - 100 * evaluation[metric].precision, - 100 * evaluation[metric].recall, - 100 * evaluation[metric].f1, - "{:10.2f}".format(100 * evaluation[metric].aligned_accuracy) if evaluation[metric].aligned_accuracy is not None else "" - )) - -if __name__ == "__main__": - main() - -# Tests, which can be executed with `python -m unittest conll17_ud_eval`. -class TestAlignment(unittest.TestCase): - @staticmethod - def _load_words(words): - """Prepare fake CoNLL-U files with fake HEAD to prevent multiple roots errors.""" - lines, num_words = [], 0 - for w in words: - parts = w.split(" ") - if len(parts) == 1: - num_words += 1 - lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, parts[0], int(num_words>1))) - else: - lines.append("{}-{}\t{}\t_\t_\t_\t_\t_\t_\t_\t_".format(num_words + 1, num_words + len(parts) - 1, parts[0])) - for part in parts[1:]: - num_words += 1 - lines.append("{}\t{}\t_\t_\t_\t_\t{}\t_\t_\t_".format(num_words, part, int(num_words>1))) - return load_conllu((io.StringIO if sys.version_info >= (3, 0) else io.BytesIO)("\n".join(lines+["\n"]))) - - def _test_exception(self, gold, system): - self.assertRaises(UDError, evaluate, self._load_words(gold), self._load_words(system)) - - def _test_ok(self, gold, system, correct): - metrics = evaluate(self._load_words(gold), self._load_words(system)) - gold_words = sum((max(1, len(word.split(" ")) - 1) for word in gold)) - system_words = sum((max(1, len(word.split(" ")) - 1) for word in system)) - self.assertEqual((metrics["Words"].precision, metrics["Words"].recall, metrics["Words"].f1), - (correct / system_words, correct / gold_words, 2 * correct / (gold_words + system_words))) - - def test_exception(self): - self._test_exception(["a"], ["b"]) - - def test_equal(self): - self._test_ok(["a"], ["a"], 1) - self._test_ok(["a", "b", "c"], ["a", "b", "c"], 3) - - def test_equal_with_multiword(self): - self._test_ok(["abc a b c"], ["a", "b", "c"], 3) - self._test_ok(["a", "bc b c", "d"], ["a", "b", "c", "d"], 4) - self._test_ok(["abcd a b c d"], ["ab a b", "cd c d"], 4) - self._test_ok(["abc a b c", "de d e"], ["a", "bcd b c d", "e"], 5) - - def test_alignment(self): - self._test_ok(["abcd"], ["a", "b", "c", "d"], 0) - self._test_ok(["abc", "d"], ["a", "b", "c", "d"], 1) - self._test_ok(["a", "bc", "d"], ["a", "b", "c", "d"], 2) - self._test_ok(["a", "bc b c", "d"], ["a", "b", "cd"], 2) - self._test_ok(["abc a BX c", "def d EX f"], ["ab a b", "cd c d", "ef e f"], 4) - self._test_ok(["ab a b", "cd bc d"], ["a", "bc", "d"], 2) - self._test_ok(["a", "bc b c", "d"], ["ab AX BX", "cd CX a"], 1) diff --git a/bin/ud/run_eval.py b/bin/ud/run_eval.py deleted file mode 100644 index 2da476721..000000000 --- a/bin/ud/run_eval.py +++ /dev/null @@ -1,293 +0,0 @@ -import spacy -import time -import re -import plac -import operator -import datetime -from pathlib import Path -import xml.etree.ElementTree as ET - -import conll17_ud_eval -from ud_train import write_conllu -from spacy.lang.lex_attrs import word_shape -from spacy.util import get_lang_class - -# All languages in spaCy - in UD format (note that Norwegian is 'no' instead of 'nb') -ALL_LANGUAGES = ("af, ar, bg, bn, ca, cs, da, de, el, en, es, et, fa, fi, fr," - "ga, he, hi, hr, hu, id, is, it, ja, kn, ko, lt, lv, mr, no," - "nl, pl, pt, ro, ru, si, sk, sl, sq, sr, sv, ta, te, th, tl," - "tr, tt, uk, ur, vi, zh") - -# Non-parsing tasks that will be evaluated (works for default models) -EVAL_NO_PARSE = ['Tokens', 'Words', 'Lemmas', 'Sentences', 'Feats'] - -# Tasks that will be evaluated if check_parse=True (does not work for default models) -EVAL_PARSE = ['Tokens', 'Words', 'Lemmas', 'Sentences', 'Feats', 'UPOS', 'XPOS', 'AllTags', 'UAS', 'LAS'] - -# Minimum frequency an error should have to be printed -PRINT_FREQ = 20 - -# Maximum number of errors printed per category -PRINT_TOTAL = 10 - -space_re = re.compile("\s+") - - -def load_model(modelname, add_sentencizer=False): - """ Load a specific spaCy model """ - loading_start = time.time() - nlp = spacy.load(modelname) - if add_sentencizer: - nlp.add_pipe(nlp.create_pipe('sentencizer')) - loading_end = time.time() - loading_time = loading_end - loading_start - if add_sentencizer: - return nlp, loading_time, modelname + '_sentencizer' - return nlp, loading_time, modelname - - -def load_default_model_sentencizer(lang): - """ Load a generic spaCy model and add the sentencizer for sentence tokenization""" - loading_start = time.time() - lang_class = get_lang_class(lang) - nlp = lang_class() - nlp.add_pipe(nlp.create_pipe('sentencizer')) - loading_end = time.time() - loading_time = loading_end - loading_start - return nlp, loading_time, lang + "_default_" + 'sentencizer' - - -def split_text(text): - return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")] - - -def get_freq_tuples(my_list, print_total_threshold): - """ Turn a list of errors into frequency-sorted tuples thresholded by a certain total number """ - d = {} - for token in my_list: - d.setdefault(token, 0) - d[token] += 1 - return sorted(d.items(), key=operator.itemgetter(1), reverse=True)[:print_total_threshold] - - -def _contains_blinded_text(stats_xml): - """ Heuristic to determine whether the treebank has blinded texts or not """ - tree = ET.parse(stats_xml) - root = tree.getroot() - total_tokens = int(root.find('size/total/tokens').text) - unique_forms = int(root.find('forms').get('unique')) - - # assume the corpus is largely blinded when there are less than 1% unique tokens - return (unique_forms / total_tokens) < 0.01 - - -def fetch_all_treebanks(ud_dir, languages, corpus, best_per_language): - """" Fetch the txt files for all treebanks for a given set of languages """ - all_treebanks = dict() - treebank_size = dict() - for l in languages: - all_treebanks[l] = [] - treebank_size[l] = 0 - - for treebank_dir in ud_dir.iterdir(): - if treebank_dir.is_dir(): - for txt_path in treebank_dir.iterdir(): - if txt_path.name.endswith('-ud-' + corpus + '.txt'): - file_lang = txt_path.name.split('_')[0] - if file_lang in languages: - gold_path = treebank_dir / txt_path.name.replace('.txt', '.conllu') - stats_xml = treebank_dir / "stats.xml" - # ignore treebanks where the texts are not publicly available - if not _contains_blinded_text(stats_xml): - if not best_per_language: - all_treebanks[file_lang].append(txt_path) - # check the tokens in the gold annotation to keep only the biggest treebank per language - else: - with gold_path.open(mode='r', encoding='utf-8') as gold_file: - gold_ud = conll17_ud_eval.load_conllu(gold_file) - gold_tokens = len(gold_ud.tokens) - if treebank_size[file_lang] < gold_tokens: - all_treebanks[file_lang] = [txt_path] - treebank_size[file_lang] = gold_tokens - - return all_treebanks - - -def run_single_eval(nlp, loading_time, print_name, text_path, gold_ud, tmp_output_path, out_file, print_header, - check_parse, print_freq_tasks): - """" Run an evaluation of a model nlp on a certain specified treebank """ - with text_path.open(mode='r', encoding='utf-8') as f: - flat_text = f.read() - - # STEP 1: tokenize text - tokenization_start = time.time() - texts = split_text(flat_text) - docs = list(nlp.pipe(texts)) - tokenization_end = time.time() - tokenization_time = tokenization_end - tokenization_start - - # STEP 2: record stats and timings - tokens_per_s = int(len(gold_ud.tokens) / tokenization_time) - - print_header_1 = ['date', 'text_path', 'gold_tokens', 'model', 'loading_time', 'tokenization_time', 'tokens_per_s'] - print_string_1 = [str(datetime.date.today()), text_path.name, len(gold_ud.tokens), - print_name, "%.2f" % loading_time, "%.2f" % tokenization_time, tokens_per_s] - - # STEP 3: evaluate predicted tokens and features - with tmp_output_path.open(mode="w", encoding="utf8") as tmp_out_file: - write_conllu(docs, tmp_out_file) - with tmp_output_path.open(mode="r", encoding="utf8") as sys_file: - sys_ud = conll17_ud_eval.load_conllu(sys_file, check_parse=check_parse) - tmp_output_path.unlink() - scores = conll17_ud_eval.evaluate(gold_ud, sys_ud, check_parse=check_parse) - - # STEP 4: format the scoring results - eval_headers = EVAL_PARSE - if not check_parse: - eval_headers = EVAL_NO_PARSE - - for score_name in eval_headers: - score = scores[score_name] - print_string_1.extend(["%.2f" % score.precision, - "%.2f" % score.recall, - "%.2f" % score.f1]) - print_string_1.append("-" if score.aligned_accuracy is None else "%.2f" % score.aligned_accuracy) - print_string_1.append("-" if score.undersegmented is None else "%.4f" % score.under_perc) - print_string_1.append("-" if score.oversegmented is None else "%.4f" % score.over_perc) - - print_header_1.extend([score_name + '_p', score_name + '_r', score_name + '_F', score_name + '_acc', - score_name + '_under', score_name + '_over']) - - if score_name in print_freq_tasks: - print_header_1.extend([score_name + '_word_under_ex', score_name + '_shape_under_ex', - score_name + '_word_over_ex', score_name + '_shape_over_ex']) - - d_under_words = get_freq_tuples(score.undersegmented, PRINT_TOTAL) - d_under_shapes = get_freq_tuples([word_shape(x) for x in score.undersegmented], PRINT_TOTAL) - d_over_words = get_freq_tuples(score.oversegmented, PRINT_TOTAL) - d_over_shapes = get_freq_tuples([word_shape(x) for x in score.oversegmented], PRINT_TOTAL) - - # saving to CSV with ; seperator so blinding ; in the example output - print_string_1.append( - str({k: v for k, v in d_under_words if v > PRINT_FREQ}).replace(";", "*SEMICOLON*")) - print_string_1.append( - str({k: v for k, v in d_under_shapes if v > PRINT_FREQ}).replace(";", "*SEMICOLON*")) - print_string_1.append( - str({k: v for k, v in d_over_words if v > PRINT_FREQ}).replace(";", "*SEMICOLON*")) - print_string_1.append( - str({k: v for k, v in d_over_shapes if v > PRINT_FREQ}).replace(";", "*SEMICOLON*")) - - # STEP 5: print the formatted results to CSV - if print_header: - out_file.write(';'.join(map(str, print_header_1)) + '\n') - out_file.write(';'.join(map(str, print_string_1)) + '\n') - - -def run_all_evals(models, treebanks, out_file, check_parse, print_freq_tasks): - """" Run an evaluation for each language with its specified models and treebanks """ - print_header = True - - for tb_lang, treebank_list in treebanks.items(): - print() - print("Language", tb_lang) - for text_path in treebank_list: - print(" Evaluating on", text_path) - - gold_path = text_path.parent / (text_path.stem + '.conllu') - print(" Gold data from ", gold_path) - - # nested try blocks to ensure the code can continue with the next iteration after a failure - try: - with gold_path.open(mode='r', encoding='utf-8') as gold_file: - gold_ud = conll17_ud_eval.load_conllu(gold_file) - - for nlp, nlp_loading_time, nlp_name in models[tb_lang]: - try: - print(" Benchmarking", nlp_name) - tmp_output_path = text_path.parent / str('tmp_' + nlp_name + '.conllu') - run_single_eval(nlp, nlp_loading_time, nlp_name, text_path, gold_ud, tmp_output_path, out_file, - print_header, check_parse, print_freq_tasks) - print_header = False - except Exception as e: - print(" Ran into trouble: ", str(e)) - except Exception as e: - print(" Ran into trouble: ", str(e)) - - -@plac.annotations( - out_path=("Path to output CSV file", "positional", None, Path), - ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path), - check_parse=("Set flag to evaluate parsing performance", "flag", "p", bool), - langs=("Enumeration of languages to evaluate (default: all)", "option", "l", str), - exclude_trained_models=("Set flag to exclude trained models", "flag", "t", bool), - exclude_multi=("Set flag to exclude the multi-language model as default baseline", "flag", "m", bool), - hide_freq=("Set flag to avoid printing out more detailed high-freq tokenization errors", "flag", "f", bool), - corpus=("Whether to run on train, dev or test", "option", "c", str), - best_per_language=("Set flag to only keep the largest treebank for each language", "flag", "b", bool) -) -def main(out_path, ud_dir, check_parse=False, langs=ALL_LANGUAGES, exclude_trained_models=False, exclude_multi=False, - hide_freq=False, corpus='train', best_per_language=False): - """" - Assemble all treebanks and models to run evaluations with. - When setting check_parse to True, the default models will not be evaluated as they don't have parsing functionality - """ - languages = [lang.strip() for lang in langs.split(",")] - - print_freq_tasks = [] - if not hide_freq: - print_freq_tasks = ['Tokens'] - - # fetching all relevant treebank from the directory - treebanks = fetch_all_treebanks(ud_dir, languages, corpus, best_per_language) - - print() - print("Loading all relevant models for", languages) - models = dict() - - # multi-lang model - multi = None - if not exclude_multi and not check_parse: - multi = load_model('xx_ent_wiki_sm', add_sentencizer=True) - - # initialize all models with the multi-lang model - for lang in languages: - models[lang] = [multi] if multi else [] - # add default models if we don't want to evaluate parsing info - if not check_parse: - # Norwegian is 'nb' in spaCy but 'no' in the UD corpora - if lang == 'no': - models['no'].append(load_default_model_sentencizer('nb')) - else: - models[lang].append(load_default_model_sentencizer(lang)) - - # language-specific trained models - if not exclude_trained_models: - if 'de' in models: - models['de'].append(load_model('de_core_news_sm')) - models['de'].append(load_model('de_core_news_md')) - if 'el' in models: - models['el'].append(load_model('el_core_news_sm')) - models['el'].append(load_model('el_core_news_md')) - if 'en' in models: - models['en'].append(load_model('en_core_web_sm')) - models['en'].append(load_model('en_core_web_md')) - models['en'].append(load_model('en_core_web_lg')) - if 'es' in models: - models['es'].append(load_model('es_core_news_sm')) - models['es'].append(load_model('es_core_news_md')) - if 'fr' in models: - models['fr'].append(load_model('fr_core_news_sm')) - models['fr'].append(load_model('fr_core_news_md')) - if 'it' in models: - models['it'].append(load_model('it_core_news_sm')) - if 'nl' in models: - models['nl'].append(load_model('nl_core_news_sm')) - if 'pt' in models: - models['pt'].append(load_model('pt_core_news_sm')) - - with out_path.open(mode='w', encoding='utf-8') as out_file: - run_all_evals(models, treebanks, out_file, check_parse, print_freq_tasks) - - -if __name__ == "__main__": - plac.call(main) diff --git a/bin/ud/ud_run_test.py b/bin/ud/ud_run_test.py deleted file mode 100644 index 7cb270d84..000000000 --- a/bin/ud/ud_run_test.py +++ /dev/null @@ -1,335 +0,0 @@ -# flake8: noqa -"""Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes -.conllu format for development data, allowing the official scorer to be used. -""" -from __future__ import unicode_literals - -import plac -from pathlib import Path -import re -import sys -import srsly - -import spacy -import spacy.util -from spacy.tokens import Token, Doc -from spacy.gold import GoldParse -from spacy.util import compounding, minibatch_by_words -from spacy.syntax.nonproj import projectivize -from spacy.matcher import Matcher - -# from spacy.morphology import Fused_begin, Fused_inside -from spacy import displacy -from collections import defaultdict, Counter -from timeit import default_timer as timer - -Fused_begin = None -Fused_inside = None - -import itertools -import random -import numpy.random - -from . import conll17_ud_eval - -from spacy import lang -from spacy.lang import zh -from spacy.lang import ja -from spacy.lang import ru - - -################ -# Data reading # -################ - -space_re = re.compile(r"\s+") - - -def split_text(text): - return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")] - - -############## -# Evaluation # -############## - - -def read_conllu(file_): - docs = [] - sent = [] - doc = [] - for line in file_: - if line.startswith("# newdoc"): - if doc: - docs.append(doc) - doc = [] - elif line.startswith("#"): - continue - elif not line.strip(): - if sent: - doc.append(sent) - sent = [] - else: - sent.append(list(line.strip().split("\t"))) - if len(sent[-1]) != 10: - print(repr(line)) - raise ValueError - if sent: - doc.append(sent) - if doc: - docs.append(doc) - return docs - - -def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): - if text_loc.parts[-1].endswith(".conllu"): - docs = [] - with text_loc.open(encoding="utf8") as file_: - for conllu_doc in read_conllu(file_): - for conllu_sent in conllu_doc: - words = [line[1] for line in conllu_sent] - docs.append(Doc(nlp.vocab, words=words)) - for name, component in nlp.pipeline: - docs = list(component.pipe(docs)) - else: - with text_loc.open("r", encoding="utf8") as text_file: - texts = split_text(text_file.read()) - docs = list(nlp.pipe(texts)) - with sys_loc.open("w", encoding="utf8") as out_file: - write_conllu(docs, out_file) - with gold_loc.open("r", encoding="utf8") as gold_file: - gold_ud = conll17_ud_eval.load_conllu(gold_file) - with sys_loc.open("r", encoding="utf8") as sys_file: - sys_ud = conll17_ud_eval.load_conllu(sys_file) - scores = conll17_ud_eval.evaluate(gold_ud, sys_ud) - return docs, scores - - -def write_conllu(docs, file_): - merger = Matcher(docs[0].vocab) - merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}]) - for i, doc in enumerate(docs): - matches = [] - if doc.is_parsed: - matches = merger(doc) - spans = [doc[start : end + 1] for _, start, end in matches] - with doc.retokenize() as retokenizer: - for span in spans: - retokenizer.merge(span) - file_.write("# newdoc id = {i}\n".format(i=i)) - for j, sent in enumerate(doc.sents): - file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j)) - file_.write("# text = {text}\n".format(text=sent.text)) - for k, token in enumerate(sent): - file_.write(_get_token_conllu(token, k, len(sent)) + "\n") - file_.write("\n") - for word in sent: - if word.head.i == word.i and word.dep_ == "ROOT": - break - else: - print("Rootless sentence!") - print(sent) - print(i) - for w in sent: - print(w.i, w.text, w.head.text, w.head.i, w.dep_) - raise ValueError - - -def _get_token_conllu(token, k, sent_len): - if token.check_morph(Fused_begin) and (k + 1 < sent_len): - n = 1 - text = [token.text] - while token.nbor(n).check_morph(Fused_inside): - text.append(token.nbor(n).text) - n += 1 - id_ = "%d-%d" % (k + 1, (k + n)) - fields = [id_, "".join(text)] + ["_"] * 8 - lines = ["\t".join(fields)] - else: - lines = [] - if token.head.i == token.i: - head = 0 - else: - head = k + (token.head.i - token.i) + 1 - fields = [ - str(k + 1), - token.text, - token.lemma_, - token.pos_, - token.tag_, - "_", - str(head), - token.dep_.lower(), - "_", - "_", - ] - if token.check_morph(Fused_begin) and (k + 1 < sent_len): - if k == 0: - fields[1] = token.norm_[0].upper() + token.norm_[1:] - else: - fields[1] = token.norm_ - elif token.check_morph(Fused_inside): - fields[1] = token.norm_ - elif token._.split_start is not None: - split_start = token._.split_start - split_end = token._.split_end - split_len = (split_end.i - split_start.i) + 1 - n_in_split = token.i - split_start.i - subtokens = guess_fused_orths(split_start.text, [""] * split_len) - fields[1] = subtokens[n_in_split] - - lines.append("\t".join(fields)) - return "\n".join(lines) - - -def guess_fused_orths(word, ud_forms): - """The UD data 'fused tokens' don't necessarily expand to keys that match - the form. We need orths that exact match the string. Here we make a best - effort to divide up the word.""" - if word == "".join(ud_forms): - # Happy case: we get a perfect split, with each letter accounted for. - return ud_forms - elif len(word) == sum(len(subtoken) for subtoken in ud_forms): - # Unideal, but at least lengths match. - output = [] - remain = word - for subtoken in ud_forms: - assert len(subtoken) >= 1 - output.append(remain[: len(subtoken)]) - remain = remain[len(subtoken) :] - assert len(remain) == 0, (word, ud_forms, remain) - return output - else: - # Let's say word is 6 long, and there are three subtokens. The orths - # *must* equal the original string. Arbitrarily, split [4, 1, 1] - first = word[: len(word) - (len(ud_forms) - 1)] - output = [first] - remain = word[len(first) :] - for i in range(1, len(ud_forms)): - assert remain - output.append(remain[:1]) - remain = remain[1:] - assert len(remain) == 0, (word, output, remain) - return output - - -def print_results(name, ud_scores): - fields = {} - if ud_scores is not None: - fields.update( - { - "words": ud_scores["Words"].f1 * 100, - "sents": ud_scores["Sentences"].f1 * 100, - "tags": ud_scores["XPOS"].f1 * 100, - "uas": ud_scores["UAS"].f1 * 100, - "las": ud_scores["LAS"].f1 * 100, - } - ) - else: - fields.update({"words": 0.0, "sents": 0.0, "tags": 0.0, "uas": 0.0, "las": 0.0}) - tpl = "\t".join( - (name, "{las:.1f}", "{uas:.1f}", "{tags:.1f}", "{sents:.1f}", "{words:.1f}") - ) - print(tpl.format(**fields)) - return fields - - -def get_token_split_start(token): - if token.text == "": - assert token.i != 0 - i = -1 - while token.nbor(i).text == "": - i -= 1 - return token.nbor(i) - elif (token.i + 1) < len(token.doc) and token.nbor(1).text == "": - return token - else: - return None - - -def get_token_split_end(token): - if (token.i + 1) == len(token.doc): - return token if token.text == "" else None - elif token.text != "" and token.nbor(1).text != "": - return None - i = 1 - while (token.i + i) < len(token.doc) and token.nbor(i).text == "": - i += 1 - return token.nbor(i - 1) - - -################## -# Initialization # -################## - - -def load_nlp(experiments_dir, corpus): - nlp = spacy.load(experiments_dir / corpus / "best-model") - return nlp - - -def initialize_pipeline(nlp, docs, golds, config, device): - nlp.add_pipe(nlp.create_pipe("parser")) - return nlp - - -@plac.annotations( - test_data_dir=( - "Path to Universal Dependencies test data", - "positional", - None, - Path, - ), - experiment_dir=("Parent directory with output model", "positional", None, Path), - corpus=( - "UD corpus to evaluate, e.g. UD_English, UD_Spanish, etc", - "positional", - None, - str, - ), -) -def main(test_data_dir, experiment_dir, corpus): - Token.set_extension("split_start", getter=get_token_split_start) - Token.set_extension("split_end", getter=get_token_split_end) - Token.set_extension("begins_fused", default=False) - Token.set_extension("inside_fused", default=False) - lang.zh.Chinese.Defaults.use_jieba = False - lang.ja.Japanese.Defaults.use_janome = False - lang.ru.Russian.Defaults.use_pymorphy2 = False - - nlp = load_nlp(experiment_dir, corpus) - - treebank_code = nlp.meta["treebank"] - for section in ("test", "dev"): - if section == "dev": - section_dir = "conll17-ud-development-2017-03-19" - else: - section_dir = "conll17-ud-test-2017-05-09" - text_path = test_data_dir / "input" / section_dir / (treebank_code + ".txt") - udpipe_path = ( - test_data_dir / "input" / section_dir / (treebank_code + "-udpipe.conllu") - ) - gold_path = test_data_dir / "gold" / section_dir / (treebank_code + ".conllu") - - header = [section, "LAS", "UAS", "TAG", "SENT", "WORD"] - print("\t".join(header)) - inputs = {"gold": gold_path, "udp": udpipe_path, "raw": text_path} - for input_type in ("udp", "raw"): - input_path = inputs[input_type] - output_path = ( - experiment_dir / corpus / "{section}.conllu".format(section=section) - ) - - parsed_docs, test_scores = evaluate(nlp, input_path, gold_path, output_path) - - accuracy = print_results(input_type, test_scores) - acc_path = ( - experiment_dir - / corpus - / "{section}-accuracy.json".format(section=section) - ) - srsly.write_json(acc_path, accuracy) - - -if __name__ == "__main__": - plac.call(main) diff --git a/bin/ud/ud_train.py b/bin/ud/ud_train.py deleted file mode 100644 index 6353bd6e7..000000000 --- a/bin/ud/ud_train.py +++ /dev/null @@ -1,570 +0,0 @@ -# flake8: noqa -"""Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes -.conllu format for development data, allowing the official scorer to be used. -""" -from __future__ import unicode_literals - -import plac -from pathlib import Path -import re -import json -import tqdm - -import spacy -import spacy.util -from bin.ud import conll17_ud_eval -from spacy.tokens import Token, Doc -from spacy.gold import GoldParse -from spacy.util import compounding, minibatch, minibatch_by_words -from spacy.syntax.nonproj import projectivize -from spacy.matcher import Matcher -from spacy import displacy -from collections import defaultdict - -import random - -from spacy import lang -from spacy.lang import zh -from spacy.lang import ja - -try: - import torch -except ImportError: - torch = None - - -################ -# Data reading # -################ - -space_re = re.compile("\s+") - - -def split_text(text): - return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")] - - -def read_data( - nlp, - conllu_file, - text_file, - raw_text=True, - oracle_segments=False, - max_doc_length=None, - limit=None, -): - """Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True, - include Doc objects created using nlp.make_doc and then aligned against - the gold-standard sequences. If oracle_segments=True, include Doc objects - created from the gold-standard segments. At least one must be True.""" - if not raw_text and not oracle_segments: - raise ValueError("At least one of raw_text or oracle_segments must be True") - paragraphs = split_text(text_file.read()) - conllu = read_conllu(conllu_file) - # sd is spacy doc; cd is conllu doc - # cs is conllu sent, ct is conllu token - docs = [] - golds = [] - for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)): - sent_annots = [] - for cs in cd: - sent = defaultdict(list) - for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs: - if "." in id_: - continue - if "-" in id_: - continue - id_ = int(id_) - 1 - head = int(head) - 1 if head != "0" else id_ - sent["words"].append(word) - sent["tags"].append(tag) - sent["morphology"].append(_parse_morph_string(morph)) - sent["morphology"][-1].add("POS_%s" % pos) - sent["heads"].append(head) - sent["deps"].append("ROOT" if dep == "root" else dep) - sent["spaces"].append(space_after == "_") - sent["entities"] = ["-"] * len(sent["words"]) - sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"]) - if oracle_segments: - docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"])) - golds.append(GoldParse(docs[-1], **sent)) - assert golds[-1].morphology is not None - - sent_annots.append(sent) - if raw_text and max_doc_length and len(sent_annots) >= max_doc_length: - doc, gold = _make_gold(nlp, None, sent_annots) - assert gold.morphology is not None - sent_annots = [] - docs.append(doc) - golds.append(gold) - if limit and len(docs) >= limit: - return docs, golds - - if raw_text and sent_annots: - doc, gold = _make_gold(nlp, None, sent_annots) - docs.append(doc) - golds.append(gold) - if limit and len(docs) >= limit: - return docs, golds - return docs, golds - -def _parse_morph_string(morph_string): - if morph_string == '_': - return set() - output = [] - replacements = {'1': 'one', '2': 'two', '3': 'three'} - for feature in morph_string.split('|'): - key, value = feature.split('=') - value = replacements.get(value, value) - value = value.split(',')[0] - output.append('%s_%s' % (key, value.lower())) - return set(output) - -def read_conllu(file_): - docs = [] - sent = [] - doc = [] - for line in file_: - if line.startswith("# newdoc"): - if doc: - docs.append(doc) - doc = [] - elif line.startswith("#"): - continue - elif not line.strip(): - if sent: - doc.append(sent) - sent = [] - else: - sent.append(list(line.strip().split("\t"))) - if len(sent[-1]) != 10: - print(repr(line)) - raise ValueError - if sent: - doc.append(sent) - if doc: - docs.append(doc) - return docs - - -def _make_gold(nlp, text, sent_annots, drop_deps=0.0): - # Flatten the conll annotations, and adjust the head indices - flat = defaultdict(list) - sent_starts = [] - for sent in sent_annots: - flat["heads"].extend(len(flat["words"])+head for head in sent["heads"]) - for field in ["words", "tags", "deps", "morphology", "entities", "spaces"]: - flat[field].extend(sent[field]) - sent_starts.append(True) - sent_starts.extend([False] * (len(sent["words"]) - 1)) - # Construct text if necessary - assert len(flat["words"]) == len(flat["spaces"]) - if text is None: - text = "".join( - word + " " * space for word, space in zip(flat["words"], flat["spaces"]) - ) - doc = nlp.make_doc(text) - flat.pop("spaces") - gold = GoldParse(doc, **flat) - gold.sent_starts = sent_starts - for i in range(len(gold.heads)): - if random.random() < drop_deps: - gold.heads[i] = None - gold.labels[i] = None - - return doc, gold - - -############################# -# Data transforms for spaCy # -############################# - - -def golds_to_gold_tuples(docs, golds): - """Get out the annoying 'tuples' format used by begin_training, given the - GoldParse objects.""" - tuples = [] - for doc, gold in zip(docs, golds): - text = doc.text - ids, words, tags, heads, labels, iob = zip(*gold.orig_annot) - sents = [((ids, words, tags, heads, labels, iob), [])] - tuples.append((text, sents)) - return tuples - - -############## -# Evaluation # -############## - - -def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): - if text_loc.parts[-1].endswith(".conllu"): - docs = [] - with text_loc.open(encoding="utf8") as file_: - for conllu_doc in read_conllu(file_): - for conllu_sent in conllu_doc: - words = [line[1] for line in conllu_sent] - docs.append(Doc(nlp.vocab, words=words)) - for name, component in nlp.pipeline: - docs = list(component.pipe(docs)) - else: - with text_loc.open("r", encoding="utf8") as text_file: - texts = split_text(text_file.read()) - docs = list(nlp.pipe(texts)) - with sys_loc.open("w", encoding="utf8") as out_file: - write_conllu(docs, out_file) - with gold_loc.open("r", encoding="utf8") as gold_file: - gold_ud = conll17_ud_eval.load_conllu(gold_file) - with sys_loc.open("r", encoding="utf8") as sys_file: - sys_ud = conll17_ud_eval.load_conllu(sys_file) - scores = conll17_ud_eval.evaluate(gold_ud, sys_ud) - return docs, scores - - -def write_conllu(docs, file_): - if not Token.has_extension("get_conllu_lines"): - Token.set_extension("get_conllu_lines", method=get_token_conllu) - if not Token.has_extension("begins_fused"): - Token.set_extension("begins_fused", default=False) - if not Token.has_extension("inside_fused"): - Token.set_extension("inside_fused", default=False) - - merger = Matcher(docs[0].vocab) - merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}]) - for i, doc in enumerate(docs): - matches = [] - if doc.is_parsed: - matches = merger(doc) - spans = [doc[start : end + 1] for _, start, end in matches] - seen_tokens = set() - with doc.retokenize() as retokenizer: - for span in spans: - span_tokens = set(range(span.start, span.end)) - if not span_tokens.intersection(seen_tokens): - retokenizer.merge(span) - seen_tokens.update(span_tokens) - - file_.write("# newdoc id = {i}\n".format(i=i)) - for j, sent in enumerate(doc.sents): - file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j)) - file_.write("# text = {text}\n".format(text=sent.text)) - for k, token in enumerate(sent): - if token.head.i > sent[-1].i or token.head.i < sent[0].i: - for word in doc[sent[0].i - 10 : sent[0].i]: - print(word.i, word.head.i, word.text, word.dep_) - for word in sent: - print(word.i, word.head.i, word.text, word.dep_) - for word in doc[sent[-1].i : sent[-1].i + 10]: - print(word.i, word.head.i, word.text, word.dep_) - raise ValueError( - "Invalid parse: head outside sentence (%s)" % token.text - ) - file_.write(token._.get_conllu_lines(k) + "\n") - file_.write("\n") - - -def print_progress(itn, losses, ud_scores): - fields = { - "dep_loss": losses.get("parser", 0.0), - "morph_loss": losses.get("morphologizer", 0.0), - "tag_loss": losses.get("tagger", 0.0), - "words": ud_scores["Words"].f1 * 100, - "sents": ud_scores["Sentences"].f1 * 100, - "tags": ud_scores["XPOS"].f1 * 100, - "uas": ud_scores["UAS"].f1 * 100, - "las": ud_scores["LAS"].f1 * 100, - "morph": ud_scores["Feats"].f1 * 100, - } - header = ["Epoch", "P.Loss", "M.Loss", "LAS", "UAS", "TAG", "MORPH", "SENT", "WORD"] - if itn == 0: - print("\t".join(header)) - tpl = "\t".join(( - "{:d}", - "{dep_loss:.1f}", - "{morph_loss:.1f}", - "{las:.1f}", - "{uas:.1f}", - "{tags:.1f}", - "{morph:.1f}", - "{sents:.1f}", - "{words:.1f}", - )) - print(tpl.format(itn, **fields)) - - -# def get_sent_conllu(sent, sent_id): -# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)] - - -def get_token_conllu(token, i): - if token._.begins_fused: - n = 1 - while token.nbor(n)._.inside_fused: - n += 1 - id_ = "%d-%d" % (i, i + n) - lines = [id_, token.text, "_", "_", "_", "_", "_", "_", "_", "_"] - else: - lines = [] - if token.head.i == token.i: - head = 0 - else: - head = i + (token.head.i - token.i) + 1 - features = list(token.morph) - feat_str = [] - replacements = {"one": "1", "two": "2", "three": "3"} - for feat in features: - if not feat.startswith("begin") and not feat.startswith("end"): - key, value = feat.split("_", 1) - value = replacements.get(value, value) - feat_str.append("%s=%s" % (key, value.title())) - if not feat_str: - feat_str = "_" - else: - feat_str = "|".join(feat_str) - fields = [str(i+1), token.text, token.lemma_, token.pos_, token.tag_, feat_str, - str(head), token.dep_.lower(), "_", "_"] - lines.append("\t".join(fields)) - return "\n".join(lines) - - - -################## -# Initialization # -################## - - -def load_nlp(corpus, config, vectors=None): - lang = corpus.split("_")[0] - nlp = spacy.blank(lang) - if config.vectors: - if not vectors: - raise ValueError( - "config asks for vectors, but no vectors " - "directory set on command line (use -v)" - ) - if (Path(vectors) / corpus).exists(): - nlp.vocab.from_disk(Path(vectors) / corpus / "vocab") - nlp.meta["treebank"] = corpus - return nlp - - -def initialize_pipeline(nlp, docs, golds, config, device): - nlp.add_pipe(nlp.create_pipe("tagger", config={"set_morphology": False})) - nlp.add_pipe(nlp.create_pipe("morphologizer")) - nlp.add_pipe(nlp.create_pipe("parser")) - if config.multitask_tag: - nlp.parser.add_multitask_objective("tag") - if config.multitask_sent: - nlp.parser.add_multitask_objective("sent_start") - for gold in golds: - for tag in gold.tags: - if tag is not None: - nlp.tagger.add_label(tag) - if torch is not None and device != -1: - torch.set_default_tensor_type("torch.cuda.FloatTensor") - optimizer = nlp.begin_training( - lambda: golds_to_gold_tuples(docs, golds), - device=device, - subword_features=config.subword_features, - conv_depth=config.conv_depth, - bilstm_depth=config.bilstm_depth, - ) - if config.pretrained_tok2vec: - _load_pretrained_tok2vec(nlp, config.pretrained_tok2vec) - return optimizer - - -def _load_pretrained_tok2vec(nlp, loc): - """Load pretrained weights for the 'token-to-vector' part of the component - models, which is typically a CNN. See 'spacy pretrain'. Experimental. - """ - with Path(loc).open("rb", encoding="utf8") as file_: - weights_data = file_.read() - loaded = [] - for name, component in nlp.pipeline: - if hasattr(component, "model") and hasattr(component.model, "tok2vec"): - component.tok2vec.from_bytes(weights_data) - loaded.append(name) - return loaded - - -######################## -# Command line helpers # -######################## - - -class Config(object): - def __init__( - self, - vectors=None, - max_doc_length=10, - multitask_tag=False, - multitask_sent=False, - multitask_dep=False, - multitask_vectors=None, - bilstm_depth=0, - nr_epoch=30, - min_batch_size=100, - max_batch_size=1000, - batch_by_words=True, - dropout=0.2, - conv_depth=4, - subword_features=True, - vectors_dir=None, - pretrained_tok2vec=None, - ): - if vectors_dir is not None: - if vectors is None: - vectors = True - if multitask_vectors is None: - multitask_vectors = True - for key, value in locals().items(): - setattr(self, key, value) - - @classmethod - def load(cls, loc, vectors_dir=None): - with Path(loc).open("r", encoding="utf8") as file_: - cfg = json.load(file_) - if vectors_dir is not None: - cfg["vectors_dir"] = vectors_dir - return cls(**cfg) - - -class Dataset(object): - def __init__(self, path, section): - self.path = path - self.section = section - self.conllu = None - self.text = None - for file_path in self.path.iterdir(): - name = file_path.parts[-1] - if section in name and name.endswith("conllu"): - self.conllu = file_path - elif section in name and name.endswith("txt"): - self.text = file_path - if self.conllu is None: - msg = "Could not find .txt file in {path} for {section}" - raise IOError(msg.format(section=section, path=path)) - if self.text is None: - msg = "Could not find .txt file in {path} for {section}" - self.lang = self.conllu.parts[-1].split("-")[0].split("_")[0] - - -class TreebankPaths(object): - def __init__(self, ud_path, treebank, **cfg): - self.train = Dataset(ud_path / treebank, "train") - self.dev = Dataset(ud_path / treebank, "dev") - self.lang = self.train.lang - - -@plac.annotations( - ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path), - parses_dir=("Directory to write the development parses", "positional", None, Path), - corpus=( - "UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora", - "positional", - None, - str, - ), - config=("Path to json formatted config file", "option", "C", Path), - limit=("Size limit", "option", "n", int), - gpu_device=("Use GPU", "option", "g", int), - use_oracle_segments=("Use oracle segments", "flag", "G", int), - vectors_dir=( - "Path to directory with pretrained vectors, named e.g. en/", - "option", - "v", - Path, - ), -) -def main( - ud_dir, - parses_dir, - corpus, - config=None, - limit=0, - gpu_device=-1, - vectors_dir=None, - use_oracle_segments=False, -): - Token.set_extension("get_conllu_lines", method=get_token_conllu) - Token.set_extension("begins_fused", default=False) - Token.set_extension("inside_fused", default=False) - - spacy.util.fix_random_seed() - lang.zh.Chinese.Defaults.use_jieba = False - lang.ja.Japanese.Defaults.use_janome = False - - if config is not None: - config = Config.load(config, vectors_dir=vectors_dir) - else: - config = Config(vectors_dir=vectors_dir) - paths = TreebankPaths(ud_dir, corpus) - if not (parses_dir / corpus).exists(): - (parses_dir / corpus).mkdir() - print("Train and evaluate", corpus, "using lang", paths.lang) - nlp = load_nlp(paths.lang, config, vectors=vectors_dir) - - docs, golds = read_data( - nlp, - paths.train.conllu.open(encoding="utf8"), - paths.train.text.open(encoding="utf8"), - max_doc_length=config.max_doc_length, - limit=limit, - ) - - optimizer = initialize_pipeline(nlp, docs, golds, config, gpu_device) - - batch_sizes = compounding(config.min_batch_size, config.max_batch_size, 1.001) - beam_prob = compounding(0.2, 0.8, 1.001) - for i in range(config.nr_epoch): - docs, golds = read_data( - nlp, - paths.train.conllu.open(encoding="utf8"), - paths.train.text.open(encoding="utf8"), - max_doc_length=config.max_doc_length, - limit=limit, - oracle_segments=use_oracle_segments, - raw_text=not use_oracle_segments, - ) - Xs = list(zip(docs, golds)) - random.shuffle(Xs) - if config.batch_by_words: - batches = minibatch_by_words(Xs, size=batch_sizes) - else: - batches = minibatch(Xs, size=batch_sizes) - losses = {} - n_train_words = sum(len(doc) for doc in docs) - with tqdm.tqdm(total=n_train_words, leave=False) as pbar: - for batch in batches: - batch_docs, batch_gold = zip(*batch) - pbar.update(sum(len(doc) for doc in batch_docs)) - nlp.parser.cfg["beam_update_prob"] = next(beam_prob) - nlp.update( - batch_docs, - batch_gold, - sgd=optimizer, - drop=config.dropout, - losses=losses, - ) - - out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i) - with nlp.use_params(optimizer.averages): - if use_oracle_segments: - parsed_docs, scores = evaluate(nlp, paths.dev.conllu, - paths.dev.conllu, out_path) - else: - parsed_docs, scores = evaluate(nlp, paths.dev.text, - paths.dev.conllu, out_path) - print_progress(i, losses, scores) - - -def _render_parses(i, to_render): - to_render[0].user_data["title"] = "Batch %d" % i - with Path("/tmp/parses.html").open("w", encoding="utf8") as file_: - html = displacy.render(to_render[:5], style="dep", page=True) - file_.write(html) - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/README.md b/examples/README.md deleted file mode 100644 index 869077531..000000000 --- a/examples/README.md +++ /dev/null @@ -1,19 +0,0 @@ - - -# spaCy examples - -The examples are Python scripts with well-behaved command line interfaces. For -more detailed usage guides, see the [documentation](https://spacy.io/usage/). - -To see the available arguments, you can use the `--help` or `-h` flag: - -```bash -$ python examples/training/train_ner.py --help -``` - -While we try to keep the examples up to date, they are not currently exercised -by the test suite, as some of them require significant data downloads or take -time to train. If you find that an example is no longer running, -[please tell us](https://github.com/explosion/spaCy/issues)! We know there's -nothing worse than trying to figure out what you're doing wrong, and it turns -out your code was never the problem. diff --git a/examples/deep_learning_keras.py b/examples/deep_learning_keras.py deleted file mode 100644 index 049cc0be4..000000000 --- a/examples/deep_learning_keras.py +++ /dev/null @@ -1,267 +0,0 @@ -""" -This example shows how to use an LSTM sentiment classification model trained -using Keras in spaCy. spaCy splits the document into sentences, and each -sentence is classified using the LSTM. The scores for the sentences are then -aggregated to give the document score. This kind of hierarchical model is quite -difficult in "pure" Keras or Tensorflow, but it's very effective. The Keras -example on this dataset performs quite poorly, because it cuts off the documents -so that they're a fixed size. This hurts review accuracy a lot, because people -often summarise their rating in the final sentence - -Prerequisites: -spacy download en_vectors_web_lg -pip install keras==2.0.9 - -Compatible with: spaCy v2.0.0+ -""" - -import plac -import random -import pathlib -import cytoolz -import numpy -from keras.models import Sequential, model_from_json -from keras.layers import LSTM, Dense, Embedding, Bidirectional -from keras.layers import TimeDistributed -from keras.optimizers import Adam -import thinc.extra.datasets -from spacy.compat import pickle -import spacy - - -class SentimentAnalyser(object): - @classmethod - def load(cls, path, nlp, max_length=100): - with (path / "config.json").open() as file_: - model = model_from_json(file_.read()) - with (path / "model").open("rb") as file_: - lstm_weights = pickle.load(file_) - embeddings = get_embeddings(nlp.vocab) - model.set_weights([embeddings] + lstm_weights) - return cls(model, max_length=max_length) - - def __init__(self, model, max_length=100): - self._model = model - self.max_length = max_length - - def __call__(self, doc): - X = get_features([doc], self.max_length) - y = self._model.predict(X) - self.set_sentiment(doc, y) - - def pipe(self, docs, batch_size=1000): - for minibatch in cytoolz.partition_all(batch_size, docs): - minibatch = list(minibatch) - sentences = [] - for doc in minibatch: - sentences.extend(doc.sents) - Xs = get_features(sentences, self.max_length) - ys = self._model.predict(Xs) - for sent, label in zip(sentences, ys): - sent.doc.sentiment += label - 0.5 - for doc in minibatch: - yield doc - - def set_sentiment(self, doc, y): - doc.sentiment = float(y[0]) - # Sentiment has a native slot for a single float. - # For arbitrary data storage, there's: - # doc.user_data['my_data'] = y - - -def get_labelled_sentences(docs, doc_labels): - labels = [] - sentences = [] - for doc, y in zip(docs, doc_labels): - for sent in doc.sents: - sentences.append(sent) - labels.append(y) - return sentences, numpy.asarray(labels, dtype="int32") - - -def get_features(docs, max_length): - docs = list(docs) - Xs = numpy.zeros((len(docs), max_length), dtype="int32") - for i, doc in enumerate(docs): - j = 0 - for token in doc: - vector_id = token.vocab.vectors.find(key=token.orth) - if vector_id >= 0: - Xs[i, j] = vector_id - else: - Xs[i, j] = 0 - j += 1 - if j >= max_length: - break - return Xs - - -def train( - train_texts, - train_labels, - dev_texts, - dev_labels, - lstm_shape, - lstm_settings, - lstm_optimizer, - batch_size=100, - nb_epoch=5, - by_sentence=True, -): - - print("Loading spaCy") - nlp = spacy.load("en_vectors_web_lg") - nlp.add_pipe(nlp.create_pipe("sentencizer")) - embeddings = get_embeddings(nlp.vocab) - model = compile_lstm(embeddings, lstm_shape, lstm_settings) - - print("Parsing texts...") - train_docs = list(nlp.pipe(train_texts)) - dev_docs = list(nlp.pipe(dev_texts)) - if by_sentence: - train_docs, train_labels = get_labelled_sentences(train_docs, train_labels) - dev_docs, dev_labels = get_labelled_sentences(dev_docs, dev_labels) - - train_X = get_features(train_docs, lstm_shape["max_length"]) - dev_X = get_features(dev_docs, lstm_shape["max_length"]) - model.fit( - train_X, - train_labels, - validation_data=(dev_X, dev_labels), - epochs=nb_epoch, - batch_size=batch_size, - ) - return model - - -def compile_lstm(embeddings, shape, settings): - model = Sequential() - model.add( - Embedding( - embeddings.shape[0], - embeddings.shape[1], - input_length=shape["max_length"], - trainable=False, - weights=[embeddings], - mask_zero=True, - ) - ) - model.add(TimeDistributed(Dense(shape["nr_hidden"], use_bias=False))) - model.add( - Bidirectional( - LSTM( - shape["nr_hidden"], - recurrent_dropout=settings["dropout"], - dropout=settings["dropout"], - ) - ) - ) - model.add(Dense(shape["nr_class"], activation="sigmoid")) - model.compile( - optimizer=Adam(lr=settings["lr"]), - loss="binary_crossentropy", - metrics=["accuracy"], - ) - return model - - -def get_embeddings(vocab): - return vocab.vectors.data - - -def evaluate(model_dir, texts, labels, max_length=100): - nlp = spacy.load("en_vectors_web_lg") - nlp.add_pipe(nlp.create_pipe("sentencizer")) - nlp.add_pipe(SentimentAnalyser.load(model_dir, nlp, max_length=max_length)) - - correct = 0 - i = 0 - for doc in nlp.pipe(texts, batch_size=1000): - correct += bool(doc.sentiment >= 0.5) == bool(labels[i]) - i += 1 - return float(correct) / i - - -def read_data(data_dir, limit=0): - examples = [] - for subdir, label in (("pos", 1), ("neg", 0)): - for filename in (data_dir / subdir).iterdir(): - with filename.open() as file_: - text = file_.read() - examples.append((text, label)) - random.shuffle(examples) - if limit >= 1: - examples = examples[:limit] - return zip(*examples) # Unzips into two lists - - -@plac.annotations( - train_dir=("Location of training file or directory"), - dev_dir=("Location of development file or directory"), - model_dir=("Location of output model directory",), - is_runtime=("Demonstrate run-time usage", "flag", "r", bool), - nr_hidden=("Number of hidden units", "option", "H", int), - max_length=("Maximum sentence length", "option", "L", int), - dropout=("Dropout", "option", "d", float), - learn_rate=("Learn rate", "option", "e", float), - nb_epoch=("Number of training epochs", "option", "i", int), - batch_size=("Size of minibatches for training LSTM", "option", "b", int), - nr_examples=("Limit to N examples", "option", "n", int), -) -def main( - model_dir=None, - train_dir=None, - dev_dir=None, - is_runtime=False, - nr_hidden=64, - max_length=100, # Shape - dropout=0.5, - learn_rate=0.001, # General NN config - nb_epoch=5, - batch_size=256, - nr_examples=-1, -): # Training params - if model_dir is not None: - model_dir = pathlib.Path(model_dir) - if train_dir is None or dev_dir is None: - imdb_data = thinc.extra.datasets.imdb() - if is_runtime: - if dev_dir is None: - dev_texts, dev_labels = zip(*imdb_data[1]) - else: - dev_texts, dev_labels = read_data(dev_dir) - acc = evaluate(model_dir, dev_texts, dev_labels, max_length=max_length) - print(acc) - else: - if train_dir is None: - train_texts, train_labels = zip(*imdb_data[0]) - else: - print("Read data") - train_texts, train_labels = read_data(train_dir, limit=nr_examples) - if dev_dir is None: - dev_texts, dev_labels = zip(*imdb_data[1]) - else: - dev_texts, dev_labels = read_data(dev_dir, imdb_data, limit=nr_examples) - train_labels = numpy.asarray(train_labels, dtype="int32") - dev_labels = numpy.asarray(dev_labels, dtype="int32") - lstm = train( - train_texts, - train_labels, - dev_texts, - dev_labels, - {"nr_hidden": nr_hidden, "max_length": max_length, "nr_class": 1}, - {"dropout": dropout, "lr": learn_rate}, - {}, - nb_epoch=nb_epoch, - batch_size=batch_size, - ) - weights = lstm.get_weights() - if model_dir is not None: - with (model_dir / "model").open("wb") as file_: - pickle.dump(weights[1:], file_) - with (model_dir / "config.json").open("w") as file_: - file_.write(lstm.to_json()) - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/information_extraction/entity_relations.py b/examples/information_extraction/entity_relations.py deleted file mode 100644 index c40a3c10d..000000000 --- a/examples/information_extraction/entity_relations.py +++ /dev/null @@ -1,82 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""A simple example of extracting relations between phrases and entities using -spaCy's named entity recognizer and the dependency parse. Here, we extract -money and currency values (entities labelled as MONEY) and then check the -dependency tree to find the noun phrase they are referring to – for example: -$9.4 million --> Net income. - -Compatible with: spaCy v2.0.0+ -Last tested with: v2.2.1 -""" -from __future__ import unicode_literals, print_function - -import plac -import spacy - - -TEXTS = [ - "Net income was $9.4 million compared to the prior year of $2.7 million.", - "Revenue exceeded twelve billion dollars, with a loss of $1b.", -] - - -@plac.annotations( - model=("Model to load (needs parser and NER)", "positional", None, str) -) -def main(model="en_core_web_sm"): - nlp = spacy.load(model) - print("Loaded model '%s'" % model) - print("Processing %d texts" % len(TEXTS)) - - for text in TEXTS: - doc = nlp(text) - relations = extract_currency_relations(doc) - for r1, r2 in relations: - print("{:<10}\t{}\t{}".format(r1.text, r2.ent_type_, r2.text)) - - -def filter_spans(spans): - # Filter a sequence of spans so they don't contain overlaps - # For spaCy 2.1.4+: this function is available as spacy.util.filter_spans() - get_sort_key = lambda span: (span.end - span.start, -span.start) - sorted_spans = sorted(spans, key=get_sort_key, reverse=True) - result = [] - seen_tokens = set() - for span in sorted_spans: - # Check for end - 1 here because boundaries are inclusive - if span.start not in seen_tokens and span.end - 1 not in seen_tokens: - result.append(span) - seen_tokens.update(range(span.start, span.end)) - result = sorted(result, key=lambda span: span.start) - return result - - -def extract_currency_relations(doc): - # Merge entities and noun chunks into one token - spans = list(doc.ents) + list(doc.noun_chunks) - spans = filter_spans(spans) - with doc.retokenize() as retokenizer: - for span in spans: - retokenizer.merge(span) - - relations = [] - for money in filter(lambda w: w.ent_type_ == "MONEY", doc): - if money.dep_ in ("attr", "dobj"): - subject = [w for w in money.head.lefts if w.dep_ == "nsubj"] - if subject: - subject = subject[0] - relations.append((subject, money)) - elif money.dep_ == "pobj" and money.head.dep_ == "prep": - relations.append((money.head.head, money)) - return relations - - -if __name__ == "__main__": - plac.call(main) - - # Expected output: - # Net income MONEY $9.4 million - # the prior year MONEY $2.7 million - # Revenue MONEY twelve billion dollars - # a loss MONEY 1b diff --git a/examples/information_extraction/parse_subtrees.py b/examples/information_extraction/parse_subtrees.py deleted file mode 100644 index 2ca9da1ea..000000000 --- a/examples/information_extraction/parse_subtrees.py +++ /dev/null @@ -1,67 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""This example shows how to navigate the parse tree including subtrees -attached to a word. - -Based on issue #252: -"In the documents and tutorials the main thing I haven't found is -examples on how to break sentences down into small sub thoughts/chunks. The -noun_chunks is handy, but having examples on using the token.head to find small -(near-complete) sentence chunks would be neat. Lets take the example sentence: -"displaCy uses CSS and JavaScript to show you how computers understand language" - -This sentence has two main parts (XCOMP & CCOMP) according to the breakdown: -[displaCy] uses CSS and Javascript [to + show] -show you how computers understand [language] - -I'm assuming that we can use the token.head to build these groups." - -Compatible with: spaCy v2.0.0+ -Last tested with: v2.1.0 -""" -from __future__ import unicode_literals, print_function - -import plac -import spacy - - -@plac.annotations(model=("Model to load", "positional", None, str)) -def main(model="en_core_web_sm"): - nlp = spacy.load(model) - print("Loaded model '%s'" % model) - - doc = nlp( - "displaCy uses CSS and JavaScript to show you how computers " - "understand language" - ) - - # The easiest way is to find the head of the subtree you want, and then use - # the `.subtree`, `.children`, `.lefts` and `.rights` iterators. `.subtree` - # is the one that does what you're asking for most directly: - for word in doc: - if word.dep_ in ("xcomp", "ccomp"): - print("".join(w.text_with_ws for w in word.subtree)) - - # It'd probably be better for `word.subtree` to return a `Span` object - # instead of a generator over the tokens. If you want the `Span` you can - # get it via the `.right_edge` and `.left_edge` properties. The `Span` - # object is nice because you can easily get a vector, merge it, etc. - for word in doc: - if word.dep_ in ("xcomp", "ccomp"): - subtree_span = doc[word.left_edge.i : word.right_edge.i + 1] - print(subtree_span.text, "|", subtree_span.root.text) - - # You might also want to select a head, and then select a start and end - # position by walking along its children. You could then take the - # `.left_edge` and `.right_edge` of those tokens, and use it to calculate - # a span. - - -if __name__ == "__main__": - plac.call(main) - - # Expected output: - # to show you how computers understand language - # how computers understand language - # to show you how computers understand language | show - # how computers understand language | understand diff --git a/examples/information_extraction/phrase_matcher.py b/examples/information_extraction/phrase_matcher.py deleted file mode 100644 index f3622bfdd..000000000 --- a/examples/information_extraction/phrase_matcher.py +++ /dev/null @@ -1,112 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""Match a large set of multi-word expressions in O(1) time. - -The idea is to associate each word in the vocabulary with a tag, noting whether -they begin, end, or are inside at least one pattern. An additional tag is used -for single-word patterns. Complete patterns are also stored in a hash set. -When we process a document, we look up the words in the vocabulary, to -associate the words with the tags. We then search for tag-sequences that -correspond to valid candidates. Finally, we look up the candidates in the hash -set. - -For instance, to search for the phrases "Barack Hussein Obama" and "Hilary -Clinton", we would associate "Barack" and "Hilary" with the B tag, Hussein with -the I tag, and Obama and Clinton with the L tag. - -The document "Barack Clinton and Hilary Clinton" would have the tag sequence -[{B}, {L}, {}, {B}, {L}], so we'd get two matches. However, only the second -candidate is in the phrase dictionary, so only one is returned as a match. - -The algorithm is O(n) at run-time for document of length n because we're only -ever matching over the tag patterns. So no matter how many phrases we're -looking for, our pattern set stays very small (exact size depends on the -maximum length we're looking for, as the query language currently has no -quantifiers). - -The example expects a .bz2 file from the Reddit corpus, and a patterns file, -formatted in jsonl as a sequence of entries like this: - -{"text":"Anchorage"} -{"text":"Angola"} -{"text":"Ann Arbor"} -{"text":"Annapolis"} -{"text":"Appalachia"} -{"text":"Argentina"} - -Reddit comments corpus: -* https://files.pushshift.io/reddit/ -* https://archive.org/details/2015_reddit_comments_corpus - -Compatible with: spaCy v2.0.0+ -""" -from __future__ import print_function, unicode_literals, division - -from bz2 import BZ2File -import time -import plac -import json - -from spacy.matcher import PhraseMatcher -import spacy - - -@plac.annotations( - patterns_loc=("Path to gazetteer", "positional", None, str), - text_loc=("Path to Reddit corpus file", "positional", None, str), - n=("Number of texts to read", "option", "n", int), - lang=("Language class to initialise", "option", "l", str), -) -def main(patterns_loc, text_loc, n=10000, lang="en"): - nlp = spacy.blank(lang) - nlp.vocab.lex_attr_getters = {} - phrases = read_gazetteer(nlp.tokenizer, patterns_loc) - count = 0 - t1 = time.time() - for ent_id, text in get_matches(nlp.tokenizer, phrases, read_text(text_loc, n=n)): - count += 1 - t2 = time.time() - print("%d docs in %.3f s. %d matches" % (n, (t2 - t1), count)) - - -def read_gazetteer(tokenizer, loc, n=-1): - for i, line in enumerate(open(loc)): - data = json.loads(line.strip()) - phrase = tokenizer(data["text"]) - for w in phrase: - _ = tokenizer.vocab[w.text] - if len(phrase) >= 2: - yield phrase - - -def read_text(bz2_loc, n=10000): - with BZ2File(bz2_loc) as file_: - for i, line in enumerate(file_): - data = json.loads(line) - yield data["body"] - if i >= n: - break - - -def get_matches(tokenizer, phrases, texts): - matcher = PhraseMatcher(tokenizer.vocab) - matcher.add("Phrase", None, *phrases) - for text in texts: - doc = tokenizer(text) - for w in doc: - _ = doc.vocab[w.text] - matches = matcher(doc) - for ent_id, start, end in matches: - yield (ent_id, doc[start:end].text) - - -if __name__ == "__main__": - if False: - import cProfile - import pstats - - cProfile.runctx("plac.call(main)", globals(), locals(), "Profile.prof") - s = pstats.Stats("Profile.prof") - s.strip_dirs().sort_stats("time").print_stats() - else: - plac.call(main) diff --git a/examples/keras_parikh_entailment/README.md b/examples/keras_parikh_entailment/README.md deleted file mode 100644 index 86ba50d9b..000000000 --- a/examples/keras_parikh_entailment/README.md +++ /dev/null @@ -1,114 +0,0 @@ - - -# A decomposable attention model for Natural Language Inference -**by Matthew Honnibal, [@honnibal](https://github.com/honnibal)** -**Updated for spaCy 2.0+ and Keras 2.2.2+ by John Stewart, [@free-variation](https://github.com/free-variation)** - -This directory contains an implementation of the entailment prediction model described -by [Parikh et al. (2016)](https://arxiv.org/pdf/1606.01933.pdf). The model is notable -for its competitive performance with very few parameters. - -The model is implemented using [Keras](https://keras.io/) and [spaCy](https://spacy.io). -Keras is used to build and train the network. spaCy is used to load -the [GloVe](http://nlp.stanford.edu/projects/glove/) vectors, perform the -feature extraction, and help you apply the model at run-time. The following -demo code shows how the entailment model can be used at runtime, once the -hook is installed to customise the `.similarity()` method of spaCy's `Doc` -and `Span` objects: - -```python -def demo(shape): - nlp = spacy.load('en_vectors_web_lg') - nlp.add_pipe(KerasSimilarityShim.load(nlp.path / 'similarity', nlp, shape[0])) - - doc1 = nlp(u'The king of France is bald.') - doc2 = nlp(u'France has no king.') - - print("Sentence 1:", doc1) - print("Sentence 2:", doc2) - - entailment_type, confidence = doc1.similarity(doc2) - print("Entailment type:", entailment_type, "(Confidence:", confidence, ")") -``` - -Which gives the output `Entailment type: contradiction (Confidence: 0.60604566)`, showing that -the system has definite opinions about Betrand Russell's [famous conundrum](https://users.drew.edu/jlenz/br-on-denoting.html)! - -I'm working on a blog post to explain Parikh et al.'s model in more detail. -A [notebook](https://github.com/free-variation/spaCy/blob/master/examples/notebooks/Decompositional%20Attention.ipynb) is available that briefly explains this implementation. -I think it is a very interesting example of the attention mechanism, which -I didn't understand very well before working through this paper. There are -lots of ways to extend the model. - -## What's where - -| File | Description | -| --- | --- | -| `__main__.py` | The script that will be executed. Defines the CLI, the data reading, etc — all the boring stuff. | -| `spacy_hook.py` | Provides a class `KerasSimilarityShim` that lets you use an arbitrary function to customize spaCy's `doc.similarity()` method. Instead of the default average-of-vectors algorithm, when you call `doc1.similarity(doc2)`, you'll get the result of `your_model(doc1, doc2)`. | -| `keras_decomposable_attention.py` | Defines the neural network model. | - -## Setting up - -First, install [Keras](https://keras.io/), [spaCy](https://spacy.io) and the spaCy -English models (about 1GB of data): - -```bash -pip install keras -pip install spacy -python -m spacy download en_vectors_web_lg -``` - -You'll also want to get Keras working on your GPU, and you will need a backend, such as TensorFlow or Theano. -This will depend on your set up, so you're mostly on your own for this step. If you're using AWS, try the -[NVidia AMI](https://aws.amazon.com/marketplace/pp/B00FYCDDTE). It made things pretty easy. - -Once you've installed the dependencies, you can run a small preliminary test of -the Keras model: - -```bash -py.test keras_parikh_entailment/keras_decomposable_attention.py -``` - -This compiles the model and fits it with some dummy data. You should see that -both tests passed. - -Finally, download the [Stanford Natural Language Inference corpus](http://nlp.stanford.edu/projects/snli/). - -## Running the example - -You can run the `keras_parikh_entailment/` directory as a script, which executes the file -[`keras_parikh_entailment/__main__.py`](__main__.py). If you run the script without arguments -the usage is shown. Running it with `-h` explains the command line arguments. - -The first thing you'll want to do is train the model: - -```bash -python keras_parikh_entailment/ train -t -s -``` - -Training takes about 300 epochs for full accuracy, and I haven't rerun the full -experiment since refactoring things to publish this example — please let me -know if I've broken something. You should get to at least 85% on the development data even after 10-15 epochs. - -The other two modes demonstrate run-time usage. I never like relying on the accuracy printed -by `.fit()` methods. I never really feel confident until I've run a new process that loads -the model and starts making predictions, without access to the gold labels. I've therefore -included an `evaluate` mode. - -```bash -python keras_parikh_entailment/ evaluate -s -``` - -Finally, there's also a little demo, which mostly exists to show -you how run-time usage will eventually look. - -```bash -python keras_parikh_entailment/ demo -``` - -## Getting updates - -We should have the blog post explaining the model ready before the end of the week. To get -notified when it's published, you can either follow me on [Twitter](https://twitter.com/honnibal) -or subscribe to our [mailing list](http://eepurl.com/ckUpQ5). diff --git a/examples/keras_parikh_entailment/__main__.py b/examples/keras_parikh_entailment/__main__.py deleted file mode 100644 index ad398dae3..000000000 --- a/examples/keras_parikh_entailment/__main__.py +++ /dev/null @@ -1,207 +0,0 @@ -import numpy as np -import json -from keras.utils import to_categorical -import plac -import sys - -from keras_decomposable_attention import build_model -from spacy_hook import get_embeddings, KerasSimilarityShim - -try: - import cPickle as pickle -except ImportError: - import pickle - -import spacy - -# workaround for keras/tensorflow bug -# see https://github.com/tensorflow/tensorflow/issues/3388 -import os -import importlib -from keras import backend as K - - -def set_keras_backend(backend): - if K.backend() != backend: - os.environ["KERAS_BACKEND"] = backend - importlib.reload(K) - assert K.backend() == backend - if backend == "tensorflow": - K.get_session().close() - cfg = K.tf.ConfigProto() - cfg.gpu_options.allow_growth = True - K.set_session(K.tf.Session(config=cfg)) - K.clear_session() - - -set_keras_backend("tensorflow") - - -def train(train_loc, dev_loc, shape, settings): - train_texts1, train_texts2, train_labels = read_snli(train_loc) - dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc) - - print("Loading spaCy") - nlp = spacy.load("en_vectors_web_lg") - assert nlp.path is not None - print("Processing texts...") - train_X = create_dataset(nlp, train_texts1, train_texts2, 100, shape[0]) - dev_X = create_dataset(nlp, dev_texts1, dev_texts2, 100, shape[0]) - - print("Compiling network") - model = build_model(get_embeddings(nlp.vocab), shape, settings) - - print(settings) - model.fit( - train_X, - train_labels, - validation_data=(dev_X, dev_labels), - epochs=settings["nr_epoch"], - batch_size=settings["batch_size"], - ) - if not (nlp.path / "similarity").exists(): - (nlp.path / "similarity").mkdir() - print("Saving to", nlp.path / "similarity") - weights = model.get_weights() - # remove the embedding matrix. We can reconstruct it. - del weights[1] - with (nlp.path / "similarity" / "model").open("wb") as file_: - pickle.dump(weights, file_) - with (nlp.path / "similarity" / "config.json").open("w") as file_: - file_.write(model.to_json()) - - -def evaluate(dev_loc, shape): - dev_texts1, dev_texts2, dev_labels = read_snli(dev_loc) - nlp = spacy.load("en_vectors_web_lg") - nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0])) - total = 0.0 - correct = 0.0 - for text1, text2, label in zip(dev_texts1, dev_texts2, dev_labels): - doc1 = nlp(text1) - doc2 = nlp(text2) - sim, _ = doc1.similarity(doc2) - if sim == KerasSimilarityShim.entailment_types[label.argmax()]: - correct += 1 - total += 1 - return correct, total - - -def demo(shape): - nlp = spacy.load("en_vectors_web_lg") - nlp.add_pipe(KerasSimilarityShim.load(nlp.path / "similarity", nlp, shape[0])) - - doc1 = nlp("The king of France is bald.") - doc2 = nlp("France has no king.") - - print("Sentence 1:", doc1) - print("Sentence 2:", doc2) - - entailment_type, confidence = doc1.similarity(doc2) - print("Entailment type:", entailment_type, "(Confidence:", confidence, ")") - - -LABELS = {"entailment": 0, "contradiction": 1, "neutral": 2} - - -def read_snli(path): - texts1 = [] - texts2 = [] - labels = [] - with open(path, "r") as file_: - for line in file_: - eg = json.loads(line) - label = eg["gold_label"] - if label == "-": # per Parikh, ignore - SNLI entries - continue - texts1.append(eg["sentence1"]) - texts2.append(eg["sentence2"]) - labels.append(LABELS[label]) - return texts1, texts2, to_categorical(np.asarray(labels, dtype="int32")) - - -def create_dataset(nlp, texts, hypotheses, num_unk, max_length): - sents = texts + hypotheses - sents_as_ids = [] - for sent in sents: - doc = nlp(sent) - word_ids = [] - for i, token in enumerate(doc): - # skip odd spaces from tokenizer - if token.has_vector and token.vector_norm == 0: - continue - - if i > max_length: - break - - if token.has_vector: - word_ids.append(token.rank + num_unk + 1) - else: - # if we don't have a vector, pick an OOV entry - word_ids.append(token.rank % num_unk + 1) - - # there must be a simpler way of generating padded arrays from lists... - word_id_vec = np.zeros((max_length), dtype="int") - clipped_len = min(max_length, len(word_ids)) - word_id_vec[:clipped_len] = word_ids[:clipped_len] - sents_as_ids.append(word_id_vec) - - return [np.array(sents_as_ids[: len(texts)]), np.array(sents_as_ids[len(texts) :])] - - -@plac.annotations( - mode=("Mode to execute", "positional", None, str, ["train", "evaluate", "demo"]), - train_loc=("Path to training data", "option", "t", str), - dev_loc=("Path to development or test data", "option", "s", str), - max_length=("Length to truncate sentences", "option", "L", int), - nr_hidden=("Number of hidden units", "option", "H", int), - dropout=("Dropout level", "option", "d", float), - learn_rate=("Learning rate", "option", "r", float), - batch_size=("Batch size for neural network training", "option", "b", int), - nr_epoch=("Number of training epochs", "option", "e", int), - entail_dir=( - "Direction of entailment", - "option", - "D", - str, - ["both", "left", "right"], - ), -) -def main( - mode, - train_loc, - dev_loc, - max_length=50, - nr_hidden=200, - dropout=0.2, - learn_rate=0.001, - batch_size=1024, - nr_epoch=10, - entail_dir="both", -): - shape = (max_length, nr_hidden, 3) - settings = { - "lr": learn_rate, - "dropout": dropout, - "batch_size": batch_size, - "nr_epoch": nr_epoch, - "entail_dir": entail_dir, - } - - if mode == "train": - if train_loc == None or dev_loc == None: - print("Train mode requires paths to training and development data sets.") - sys.exit(1) - train(train_loc, dev_loc, shape, settings) - elif mode == "evaluate": - if dev_loc == None: - print("Evaluate mode requires paths to test data set.") - sys.exit(1) - correct, total = evaluate(dev_loc, shape) - print(correct, "/", total, correct / total) - else: - demo(shape) - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/keras_parikh_entailment/keras_decomposable_attention.py b/examples/keras_parikh_entailment/keras_decomposable_attention.py deleted file mode 100644 index 2e17a11ee..000000000 --- a/examples/keras_parikh_entailment/keras_decomposable_attention.py +++ /dev/null @@ -1,152 +0,0 @@ -# Semantic entailment/similarity with decomposable attention (using spaCy and Keras) -# Practical state-of-the-art textual entailment with spaCy and Keras - -import numpy as np -from keras import layers, Model, models, optimizers -from keras import backend as K - - -def build_model(vectors, shape, settings): - max_length, nr_hidden, nr_class = shape - - input1 = layers.Input(shape=(max_length,), dtype="int32", name="words1") - input2 = layers.Input(shape=(max_length,), dtype="int32", name="words2") - - # embeddings (projected) - embed = create_embedding(vectors, max_length, nr_hidden) - - a = embed(input1) - b = embed(input2) - - # step 1: attend - F = create_feedforward(nr_hidden) - att_weights = layers.dot([F(a), F(b)], axes=-1) - - G = create_feedforward(nr_hidden) - - if settings["entail_dir"] == "both": - norm_weights_a = layers.Lambda(normalizer(1))(att_weights) - norm_weights_b = layers.Lambda(normalizer(2))(att_weights) - alpha = layers.dot([norm_weights_a, a], axes=1) - beta = layers.dot([norm_weights_b, b], axes=1) - - # step 2: compare - comp1 = layers.concatenate([a, beta]) - comp2 = layers.concatenate([b, alpha]) - v1 = layers.TimeDistributed(G)(comp1) - v2 = layers.TimeDistributed(G)(comp2) - - # step 3: aggregate - v1_sum = layers.Lambda(sum_word)(v1) - v2_sum = layers.Lambda(sum_word)(v2) - concat = layers.concatenate([v1_sum, v2_sum]) - - elif settings["entail_dir"] == "left": - norm_weights_a = layers.Lambda(normalizer(1))(att_weights) - alpha = layers.dot([norm_weights_a, a], axes=1) - comp2 = layers.concatenate([b, alpha]) - v2 = layers.TimeDistributed(G)(comp2) - v2_sum = layers.Lambda(sum_word)(v2) - concat = v2_sum - - else: - norm_weights_b = layers.Lambda(normalizer(2))(att_weights) - beta = layers.dot([norm_weights_b, b], axes=1) - comp1 = layers.concatenate([a, beta]) - v1 = layers.TimeDistributed(G)(comp1) - v1_sum = layers.Lambda(sum_word)(v1) - concat = v1_sum - - H = create_feedforward(nr_hidden) - out = H(concat) - out = layers.Dense(nr_class, activation="softmax")(out) - - model = Model([input1, input2], out) - - model.compile( - optimizer=optimizers.Adam(lr=settings["lr"]), - loss="categorical_crossentropy", - metrics=["accuracy"], - ) - - return model - - -def create_embedding(vectors, max_length, projected_dim): - return models.Sequential( - [ - layers.Embedding( - vectors.shape[0], - vectors.shape[1], - input_length=max_length, - weights=[vectors], - trainable=False, - ), - layers.TimeDistributed( - layers.Dense(projected_dim, activation=None, use_bias=False) - ), - ] - ) - - -def create_feedforward(num_units=200, activation="relu", dropout_rate=0.2): - return models.Sequential( - [ - layers.Dense(num_units, activation=activation), - layers.Dropout(dropout_rate), - layers.Dense(num_units, activation=activation), - layers.Dropout(dropout_rate), - ] - ) - - -def normalizer(axis): - def _normalize(att_weights): - exp_weights = K.exp(att_weights) - sum_weights = K.sum(exp_weights, axis=axis, keepdims=True) - return exp_weights / sum_weights - - return _normalize - - -def sum_word(x): - return K.sum(x, axis=1) - - -def test_build_model(): - vectors = np.ndarray((100, 8), dtype="float32") - shape = (10, 16, 3) - settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"} - model = build_model(vectors, shape, settings) - - -def test_fit_model(): - def _generate_X(nr_example, length, nr_vector): - X1 = np.ndarray((nr_example, length), dtype="int32") - X1 *= X1 < nr_vector - X1 *= 0 <= X1 - X2 = np.ndarray((nr_example, length), dtype="int32") - X2 *= X2 < nr_vector - X2 *= 0 <= X2 - return [X1, X2] - - def _generate_Y(nr_example, nr_class): - ys = np.zeros((nr_example, nr_class), dtype="int32") - for i in range(nr_example): - ys[i, i % nr_class] = 1 - return ys - - vectors = np.ndarray((100, 8), dtype="float32") - shape = (10, 16, 3) - settings = {"lr": 0.001, "dropout": 0.2, "gru_encode": True, "entail_dir": "both"} - model = build_model(vectors, shape, settings) - - train_X = _generate_X(20, shape[0], vectors.shape[0]) - train_Y = _generate_Y(20, shape[2]) - dev_X = _generate_X(15, shape[0], vectors.shape[0]) - dev_Y = _generate_Y(15, shape[2]) - - model.fit(train_X, train_Y, validation_data=(dev_X, dev_Y), epochs=5, batch_size=4) - - -__all__ = [build_model] diff --git a/examples/keras_parikh_entailment/spacy_hook.py b/examples/keras_parikh_entailment/spacy_hook.py deleted file mode 100644 index 307669a70..000000000 --- a/examples/keras_parikh_entailment/spacy_hook.py +++ /dev/null @@ -1,77 +0,0 @@ -import numpy as np -from keras.models import model_from_json - -try: - import cPickle as pickle -except ImportError: - import pickle - - -class KerasSimilarityShim(object): - entailment_types = ["entailment", "contradiction", "neutral"] - - @classmethod - def load(cls, path, nlp, max_length=100, get_features=None): - - if get_features is None: - get_features = get_word_ids - - with (path / "config.json").open() as file_: - model = model_from_json(file_.read()) - with (path / "model").open("rb") as file_: - weights = pickle.load(file_) - - embeddings = get_embeddings(nlp.vocab) - weights.insert(1, embeddings) - model.set_weights(weights) - - return cls(model, get_features=get_features, max_length=max_length) - - def __init__(self, model, get_features=None, max_length=100): - self.model = model - self.get_features = get_features - self.max_length = max_length - - def __call__(self, doc): - doc.user_hooks["similarity"] = self.predict - doc.user_span_hooks["similarity"] = self.predict - - return doc - - def predict(self, doc1, doc2): - x1 = self.get_features([doc1], max_length=self.max_length) - x2 = self.get_features([doc2], max_length=self.max_length) - scores = self.model.predict([x1, x2]) - - return self.entailment_types[scores.argmax()], scores.max() - - -def get_embeddings(vocab, nr_unk=100): - # the extra +1 is for a zero vector representing sentence-final padding - num_vectors = max(lex.rank for lex in vocab) + 2 - - # create random vectors for OOV tokens - oov = np.random.normal(size=(nr_unk, vocab.vectors_length)) - oov = oov / oov.sum(axis=1, keepdims=True) - - vectors = np.zeros((num_vectors + nr_unk, vocab.vectors_length), dtype="float32") - vectors[1 : (nr_unk + 1),] = oov - for lex in vocab: - if lex.has_vector and lex.vector_norm > 0: - vectors[nr_unk + lex.rank + 1] = lex.vector / lex.vector_norm - - return vectors - - -def get_word_ids(docs, max_length=100, nr_unk=100): - Xs = np.zeros((len(docs), max_length), dtype="int32") - - for i, doc in enumerate(docs): - for j, token in enumerate(doc): - if j == max_length: - break - if token.has_vector: - Xs[i, j] = token.rank + nr_unk + 1 - else: - Xs[i, j] = token.rank % nr_unk + 1 - return Xs diff --git a/examples/load_from_docbin.py b/examples/load_from_docbin.py deleted file mode 100644 index f26e7fc49..000000000 --- a/examples/load_from_docbin.py +++ /dev/null @@ -1,45 +0,0 @@ -# coding: utf-8 -""" -Example of loading previously parsed text using spaCy's DocBin class. The example -performs an entity count to show that the annotations are available. -For more details, see https://spacy.io/usage/saving-loading#docs -Installation: -python -m spacy download en_core_web_lg -Usage: -python examples/load_from_docbin.py en_core_web_lg RC_2015-03-9.spacy -""" -from __future__ import unicode_literals - -import spacy -from spacy.tokens import DocBin -from timeit import default_timer as timer -from collections import Counter - -EXAMPLE_PARSES_PATH = "RC_2015-03-9.spacy" - - -def main(model="en_core_web_lg", docbin_path=EXAMPLE_PARSES_PATH): - nlp = spacy.load(model) - print("Reading data from {}".format(docbin_path)) - with open(docbin_path, "rb") as file_: - bytes_data = file_.read() - nr_word = 0 - start_time = timer() - entities = Counter() - docbin = DocBin().from_bytes(bytes_data) - for doc in docbin.get_docs(nlp.vocab): - nr_word += len(doc) - entities.update((e.label_, e.text) for e in doc.ents) - end_time = timer() - msg = "Loaded {nr_word} words in {seconds} seconds ({wps} words per second)" - wps = nr_word / (end_time - start_time) - print(msg.format(nr_word=nr_word, seconds=end_time - start_time, wps=wps)) - print("Most common entities:") - for (label, entity), freq in entities.most_common(30): - print(freq, entity, label) - - -if __name__ == "__main__": - import plac - - plac.call(main) diff --git a/examples/notebooks/Decompositional Attention.ipynb b/examples/notebooks/Decompositional Attention.ipynb deleted file mode 100644 index 8baaf7d33..000000000 --- a/examples/notebooks/Decompositional Attention.ipynb +++ /dev/null @@ -1,955 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Natural language inference using spaCy and Keras" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Introduction" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This notebook details an implementation of the natural language inference model presented in [(Parikh et al, 2016)](https://arxiv.org/abs/1606.01933). The model is notable for the small number of paramaters *and hyperparameters* it specifices, while still yielding good performance." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Constructing the dataset" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "import spacy\n", - "import numpy as np" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We only need the GloVe vectors from spaCy, not a full NLP pipeline." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "nlp = spacy.load('en_vectors_web_lg')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Function to load the SNLI dataset. The categories are converted to one-shot representation. The function comes from an example in spaCy." - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/home/jds/tensorflow-gpu/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.\n", - " from ._conv import register_converters as _register_converters\n", - "Using TensorFlow backend.\n" - ] - } - ], - "source": [ - "import json\n", - "from keras.utils import to_categorical\n", - "\n", - "LABELS = {'entailment': 0, 'contradiction': 1, 'neutral': 2}\n", - "def read_snli(path):\n", - " texts1 = []\n", - " texts2 = []\n", - " labels = []\n", - " with open(path, 'r') as file_:\n", - " for line in file_:\n", - " eg = json.loads(line)\n", - " label = eg['gold_label']\n", - " if label == '-': # per Parikh, ignore - SNLI entries\n", - " continue\n", - " texts1.append(eg['sentence1'])\n", - " texts2.append(eg['sentence2'])\n", - " labels.append(LABELS[label])\n", - " return texts1, texts2, to_categorical(np.asarray(labels, dtype='int32'))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Because Keras can do the train/test split for us, we'll load *all* SNLI triples from one file." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [], - "source": [ - "texts,hypotheses,labels = read_snli('snli/snli_1.0_train.jsonl')" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [], - "source": [ - "def create_dataset(nlp, texts, hypotheses, num_oov, max_length, norm_vectors = True):\n", - " sents = texts + hypotheses\n", - " \n", - " # the extra +1 is for a zero vector represting NULL for padding\n", - " num_vectors = max(lex.rank for lex in nlp.vocab) + 2 \n", - " \n", - " # create random vectors for OOV tokens\n", - " oov = np.random.normal(size=(num_oov, nlp.vocab.vectors_length))\n", - " oov = oov / oov.sum(axis=1, keepdims=True)\n", - " \n", - " vectors = np.zeros((num_vectors + num_oov, nlp.vocab.vectors_length), dtype='float32')\n", - " vectors[num_vectors:, ] = oov\n", - " for lex in nlp.vocab:\n", - " if lex.has_vector and lex.vector_norm > 0:\n", - " vectors[lex.rank + 1] = lex.vector / lex.vector_norm if norm_vectors == True else lex.vector\n", - " \n", - " sents_as_ids = []\n", - " for sent in sents:\n", - " doc = nlp(sent)\n", - " word_ids = []\n", - " \n", - " for i, token in enumerate(doc):\n", - " # skip odd spaces from tokenizer\n", - " if token.has_vector and token.vector_norm == 0:\n", - " continue\n", - " \n", - " if i > max_length:\n", - " break\n", - " \n", - " if token.has_vector:\n", - " word_ids.append(token.rank + 1)\n", - " else:\n", - " # if we don't have a vector, pick an OOV entry\n", - " word_ids.append(token.rank % num_oov + num_vectors) \n", - " \n", - " # there must be a simpler way of generating padded arrays from lists...\n", - " word_id_vec = np.zeros((max_length), dtype='int')\n", - " clipped_len = min(max_length, len(word_ids))\n", - " word_id_vec[:clipped_len] = word_ids[:clipped_len]\n", - " sents_as_ids.append(word_id_vec)\n", - " \n", - " \n", - " return vectors, np.array(sents_as_ids[:len(texts)]), np.array(sents_as_ids[len(texts):])" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [], - "source": [ - "sem_vectors, text_vectors, hypothesis_vectors = create_dataset(nlp, texts, hypotheses, 100, 50, True)" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [], - "source": [ - "texts_test,hypotheses_test,labels_test = read_snli('snli/snli_1.0_test.jsonl')" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "_, text_vectors_test, hypothesis_vectors_test = create_dataset(nlp, texts_test, hypotheses_test, 100, 50, True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We use spaCy to tokenize the sentences and return, when available, a semantic vector for each token. \n", - "\n", - "OOV terms (tokens for which no semantic vector is available) are assigned to one of a set of randomly-generated OOV vectors, per (Parikh et al, 2016).\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Note that we will clip sentences to 50 words maximum." - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [], - "source": [ - "from keras import layers, Model, models\n", - "from keras import backend as K" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Building the model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The embedding layer copies the 300-dimensional GloVe vectors into GPU memory. Per (Parikh et al, 2016), the vectors, which are not adapted during training, are projected down to lower-dimensional vectors using a trained projection matrix." - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "def create_embedding(vectors, max_length, projected_dim):\n", - " return models.Sequential([\n", - " layers.Embedding(\n", - " vectors.shape[0],\n", - " vectors.shape[1],\n", - " input_length=max_length,\n", - " weights=[vectors],\n", - " trainable=False),\n", - " \n", - " layers.TimeDistributed(\n", - " layers.Dense(projected_dim,\n", - " activation=None,\n", - " use_bias=False))\n", - " ])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The Parikh model makes use of three feedforward blocks that construct nonlinear combinations of their input. Each block contains two ReLU layers and two dropout layers." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [], - "source": [ - "def create_feedforward(num_units=200, activation='relu', dropout_rate=0.2):\n", - " return models.Sequential([\n", - " layers.Dense(num_units, activation=activation),\n", - " layers.Dropout(dropout_rate),\n", - " layers.Dense(num_units, activation=activation),\n", - " layers.Dropout(dropout_rate)\n", - " ])" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The basic idea of the (Parikh et al, 2016) model is to:\n", - "\n", - "1. *Align*: Construct an alignment of subphrases in the text and hypothesis using an attention-like mechanism, called \"decompositional\" because the layer is applied to each of the two sentences individually rather than to their product. The dot product of the nonlinear transformations of the inputs is then normalized vertically and horizontally to yield a pair of \"soft\" alignment structures, from text->hypothesis and hypothesis->text. Concretely, for each word in one sentence, a multinomial distribution is computed over the words of the other sentence, by learning a multinomial logistic with softmax target.\n", - "2. *Compare*: Each word is now compared to its aligned phrase using a function modeled as a two-layer feedforward ReLU network. The output is a high-dimensional representation of the strength of association between word and aligned phrase.\n", - "3. *Aggregate*: The comparison vectors are summed, separately, for the text and the hypothesis. The result is two vectors: one that describes the degree of association of the text to the hypothesis, and the second, of the hypothesis to the text.\n", - "4. Finally, these two vectors are processed by a dense layer followed by a softmax classifier, as usual.\n", - "\n", - "Note that because in entailment the truth conditions of the consequent must be a subset of those of the antecedent, it is not obvious that we need both vectors in step (3). Entailment is not symmetric. It may be enough to just use the hypothesis->text vector. We will explore this possibility later." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We need a couple of little functions for Lambda layers to normalize and aggregate weights:" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [], - "source": [ - "def normalizer(axis):\n", - " def _normalize(att_weights):\n", - " exp_weights = K.exp(att_weights)\n", - " sum_weights = K.sum(exp_weights, axis=axis, keepdims=True)\n", - " return exp_weights/sum_weights\n", - " return _normalize\n", - "\n", - "def sum_word(x):\n", - " return K.sum(x, axis=1)\n" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [], - "source": [ - "def build_model(vectors, max_length, num_hidden, num_classes, projected_dim, entail_dir='both'):\n", - " input1 = layers.Input(shape=(max_length,), dtype='int32', name='words1')\n", - " input2 = layers.Input(shape=(max_length,), dtype='int32', name='words2')\n", - " \n", - " # embeddings (projected)\n", - " embed = create_embedding(vectors, max_length, projected_dim)\n", - " \n", - " a = embed(input1)\n", - " b = embed(input2)\n", - " \n", - " # step 1: attend\n", - " F = create_feedforward(num_hidden)\n", - " att_weights = layers.dot([F(a), F(b)], axes=-1)\n", - " \n", - " G = create_feedforward(num_hidden)\n", - " \n", - " if entail_dir == 'both':\n", - " norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n", - " norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n", - " alpha = layers.dot([norm_weights_a, a], axes=1)\n", - " beta = layers.dot([norm_weights_b, b], axes=1)\n", - "\n", - " # step 2: compare\n", - " comp1 = layers.concatenate([a, beta])\n", - " comp2 = layers.concatenate([b, alpha])\n", - " v1 = layers.TimeDistributed(G)(comp1)\n", - " v2 = layers.TimeDistributed(G)(comp2)\n", - "\n", - " # step 3: aggregate\n", - " v1_sum = layers.Lambda(sum_word)(v1)\n", - " v2_sum = layers.Lambda(sum_word)(v2)\n", - " concat = layers.concatenate([v1_sum, v2_sum])\n", - " elif entail_dir == 'left':\n", - " norm_weights_a = layers.Lambda(normalizer(1))(att_weights)\n", - " alpha = layers.dot([norm_weights_a, a], axes=1)\n", - " comp2 = layers.concatenate([b, alpha])\n", - " v2 = layers.TimeDistributed(G)(comp2)\n", - " v2_sum = layers.Lambda(sum_word)(v2)\n", - " concat = v2_sum\n", - " else:\n", - " norm_weights_b = layers.Lambda(normalizer(2))(att_weights)\n", - " beta = layers.dot([norm_weights_b, b], axes=1)\n", - " comp1 = layers.concatenate([a, beta])\n", - " v1 = layers.TimeDistributed(G)(comp1)\n", - " v1_sum = layers.Lambda(sum_word)(v1)\n", - " concat = v1_sum\n", - " \n", - " H = create_feedforward(num_hidden)\n", - " out = H(concat)\n", - " out = layers.Dense(num_classes, activation='softmax')(out)\n", - " \n", - " model = Model([input1, input2], out)\n", - " \n", - " model.compile(optimizer='adam',\n", - " loss='categorical_crossentropy',\n", - " metrics=['accuracy'])\n", - " return model\n", - " \n", - " \n", - " " - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "__________________________________________________________________________________________________\n", - "Layer (type) Output Shape Param # Connected to \n", - "==================================================================================================\n", - "words1 (InputLayer) (None, 50) 0 \n", - "__________________________________________________________________________________________________\n", - "words2 (InputLayer) (None, 50) 0 \n", - "__________________________________________________________________________________________________\n", - "sequential_1 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n", - " words2[0][0] \n", - "__________________________________________________________________________________________________\n", - "sequential_2 (Sequential) (None, 50, 200) 80400 sequential_1[1][0] \n", - " sequential_1[2][0] \n", - "__________________________________________________________________________________________________\n", - "dot_1 (Dot) (None, 50, 50) 0 sequential_2[1][0] \n", - " sequential_2[2][0] \n", - "__________________________________________________________________________________________________\n", - "lambda_2 (Lambda) (None, 50, 50) 0 dot_1[0][0] \n", - "__________________________________________________________________________________________________\n", - "lambda_1 (Lambda) (None, 50, 50) 0 dot_1[0][0] \n", - "__________________________________________________________________________________________________\n", - "dot_3 (Dot) (None, 50, 200) 0 lambda_2[0][0] \n", - " sequential_1[2][0] \n", - "__________________________________________________________________________________________________\n", - "dot_2 (Dot) (None, 50, 200) 0 lambda_1[0][0] \n", - " sequential_1[1][0] \n", - "__________________________________________________________________________________________________\n", - "concatenate_1 (Concatenate) (None, 50, 400) 0 sequential_1[1][0] \n", - " dot_3[0][0] \n", - "__________________________________________________________________________________________________\n", - "concatenate_2 (Concatenate) (None, 50, 400) 0 sequential_1[2][0] \n", - " dot_2[0][0] \n", - "__________________________________________________________________________________________________\n", - "time_distributed_2 (TimeDistrib (None, 50, 200) 120400 concatenate_1[0][0] \n", - "__________________________________________________________________________________________________\n", - "time_distributed_3 (TimeDistrib (None, 50, 200) 120400 concatenate_2[0][0] \n", - "__________________________________________________________________________________________________\n", - "lambda_3 (Lambda) (None, 200) 0 time_distributed_2[0][0] \n", - "__________________________________________________________________________________________________\n", - "lambda_4 (Lambda) (None, 200) 0 time_distributed_3[0][0] \n", - "__________________________________________________________________________________________________\n", - "concatenate_3 (Concatenate) (None, 400) 0 lambda_3[0][0] \n", - " lambda_4[0][0] \n", - "__________________________________________________________________________________________________\n", - "sequential_4 (Sequential) (None, 200) 120400 concatenate_3[0][0] \n", - "__________________________________________________________________________________________________\n", - "dense_8 (Dense) (None, 3) 603 sequential_4[1][0] \n", - "==================================================================================================\n", - "Total params: 321,703,403\n", - "Trainable params: 381,803\n", - "Non-trainable params: 321,321,600\n", - "__________________________________________________________________________________________________\n" - ] - } - ], - "source": [ - "K.clear_session()\n", - "m = build_model(sem_vectors, 50, 200, 3, 200)\n", - "m.summary()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The number of trainable parameters, ~381k, is the number given by Parikh et al, so we're on the right track." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Training the model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Parikh et al use tiny batches of 4, training for 50MM batches, which amounts to around 500 epochs. Here we'll use large batches to better use the GPU, and train for fewer epochs -- for purposes of this experiment." - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": { - "scrolled": true - }, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Train on 549367 samples, validate on 9824 samples\n", - "Epoch 1/50\n", - "549367/549367 [==============================] - 34s 62us/step - loss: 0.7599 - acc: 0.6617 - val_loss: 0.5396 - val_acc: 0.7861\n", - "Epoch 2/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.5611 - acc: 0.7763 - val_loss: 0.4892 - val_acc: 0.8085\n", - "Epoch 3/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.5212 - acc: 0.7948 - val_loss: 0.4574 - val_acc: 0.8261\n", - "Epoch 4/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4986 - acc: 0.8045 - val_loss: 0.4410 - val_acc: 0.8274\n", - "Epoch 5/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4819 - acc: 0.8114 - val_loss: 0.4224 - val_acc: 0.8383\n", - "Epoch 6/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4714 - acc: 0.8166 - val_loss: 0.4200 - val_acc: 0.8379\n", - "Epoch 7/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4633 - acc: 0.8203 - val_loss: 0.4098 - val_acc: 0.8457\n", - "Epoch 8/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4558 - acc: 0.8232 - val_loss: 0.4114 - val_acc: 0.8415\n", - "Epoch 9/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4508 - acc: 0.8250 - val_loss: 0.4062 - val_acc: 0.8477\n", - "Epoch 10/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4433 - acc: 0.8286 - val_loss: 0.3982 - val_acc: 0.8486\n", - "Epoch 11/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4388 - acc: 0.8307 - val_loss: 0.3953 - val_acc: 0.8497\n", - "Epoch 12/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4351 - acc: 0.8321 - val_loss: 0.3973 - val_acc: 0.8522\n", - "Epoch 13/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4309 - acc: 0.8342 - val_loss: 0.3939 - val_acc: 0.8539\n", - "Epoch 14/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4269 - acc: 0.8355 - val_loss: 0.3932 - val_acc: 0.8517\n", - "Epoch 15/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4247 - acc: 0.8369 - val_loss: 0.3938 - val_acc: 0.8515\n", - "Epoch 16/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4208 - acc: 0.8379 - val_loss: 0.3936 - val_acc: 0.8504\n", - "Epoch 17/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4194 - acc: 0.8390 - val_loss: 0.3885 - val_acc: 0.8560\n", - "Epoch 18/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4162 - acc: 0.8402 - val_loss: 0.3874 - val_acc: 0.8561\n", - "Epoch 19/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4140 - acc: 0.8409 - val_loss: 0.3889 - val_acc: 0.8545\n", - "Epoch 20/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4114 - acc: 0.8426 - val_loss: 0.3864 - val_acc: 0.8583\n", - "Epoch 21/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4092 - acc: 0.8430 - val_loss: 0.3870 - val_acc: 0.8561\n", - "Epoch 22/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4062 - acc: 0.8442 - val_loss: 0.3852 - val_acc: 0.8577\n", - "Epoch 23/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4050 - acc: 0.8450 - val_loss: 0.3850 - val_acc: 0.8578\n", - "Epoch 24/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4035 - acc: 0.8455 - val_loss: 0.3825 - val_acc: 0.8555\n", - "Epoch 25/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.4018 - acc: 0.8460 - val_loss: 0.3837 - val_acc: 0.8573\n", - "Epoch 26/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3989 - acc: 0.8476 - val_loss: 0.3843 - val_acc: 0.8599\n", - "Epoch 27/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3979 - acc: 0.8481 - val_loss: 0.3841 - val_acc: 0.8589\n", - "Epoch 28/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3967 - acc: 0.8484 - val_loss: 0.3811 - val_acc: 0.8575\n", - "Epoch 29/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3956 - acc: 0.8492 - val_loss: 0.3829 - val_acc: 0.8589\n", - "Epoch 30/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3938 - acc: 0.8499 - val_loss: 0.3859 - val_acc: 0.8562\n", - "Epoch 31/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3925 - acc: 0.8500 - val_loss: 0.3798 - val_acc: 0.8587\n", - "Epoch 32/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3906 - acc: 0.8509 - val_loss: 0.3834 - val_acc: 0.8569\n", - "Epoch 33/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3893 - acc: 0.8511 - val_loss: 0.3806 - val_acc: 0.8588\n", - "Epoch 34/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3885 - acc: 0.8515 - val_loss: 0.3828 - val_acc: 0.8603\n", - "Epoch 35/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3879 - acc: 0.8520 - val_loss: 0.3800 - val_acc: 0.8594\n", - "Epoch 36/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3860 - acc: 0.8530 - val_loss: 0.3796 - val_acc: 0.8577\n", - "Epoch 37/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3856 - acc: 0.8532 - val_loss: 0.3857 - val_acc: 0.8591\n", - "Epoch 38/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3838 - acc: 0.8535 - val_loss: 0.3835 - val_acc: 0.8603\n", - "Epoch 39/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3830 - acc: 0.8543 - val_loss: 0.3830 - val_acc: 0.8599\n", - "Epoch 40/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3818 - acc: 0.8548 - val_loss: 0.3832 - val_acc: 0.8559\n", - "Epoch 41/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3806 - acc: 0.8551 - val_loss: 0.3845 - val_acc: 0.8553\n", - "Epoch 42/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3803 - acc: 0.8550 - val_loss: 0.3789 - val_acc: 0.8617\n", - "Epoch 43/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3791 - acc: 0.8556 - val_loss: 0.3835 - val_acc: 0.8580\n", - "Epoch 44/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3778 - acc: 0.8565 - val_loss: 0.3799 - val_acc: 0.8580\n", - "Epoch 45/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3766 - acc: 0.8571 - val_loss: 0.3790 - val_acc: 0.8625\n", - "Epoch 46/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3770 - acc: 0.8569 - val_loss: 0.3820 - val_acc: 0.8590\n", - "Epoch 47/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3761 - acc: 0.8573 - val_loss: 0.3831 - val_acc: 0.8581\n", - "Epoch 48/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3739 - acc: 0.8579 - val_loss: 0.3828 - val_acc: 0.8599\n", - "Epoch 49/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3738 - acc: 0.8577 - val_loss: 0.3785 - val_acc: 0.8590\n", - "Epoch 50/50\n", - "549367/549367 [==============================] - 33s 60us/step - loss: 0.3726 - acc: 0.8580 - val_loss: 0.3820 - val_acc: 0.8585\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "m.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The result is broadly in the region reported by Parikh et al: ~86 vs 86.3%. The small difference might be accounted by differences in `max_length` (here set at 50), in the training regime, and that here we use Keras' built-in validation splitting rather than the SNLI test set." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Experiment: the asymmetric model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It was suggested earlier that, based on the semantics of entailment, the vector representing the strength of association between the hypothesis to the text is all that is needed for classifying the entailment.\n", - "\n", - "The following model removes consideration of the complementary vector (text to hypothesis) from the computation. This will decrease the paramater count slightly, because the final dense layers will be smaller, and speed up the forward pass when predicting, because fewer calculations will be needed." - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "__________________________________________________________________________________________________\n", - "Layer (type) Output Shape Param # Connected to \n", - "==================================================================================================\n", - "words2 (InputLayer) (None, 50) 0 \n", - "__________________________________________________________________________________________________\n", - "words1 (InputLayer) (None, 50) 0 \n", - "__________________________________________________________________________________________________\n", - "sequential_5 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n", - " words2[0][0] \n", - "__________________________________________________________________________________________________\n", - "sequential_6 (Sequential) (None, 50, 200) 80400 sequential_5[1][0] \n", - " sequential_5[2][0] \n", - "__________________________________________________________________________________________________\n", - "dot_4 (Dot) (None, 50, 50) 0 sequential_6[1][0] \n", - " sequential_6[2][0] \n", - "__________________________________________________________________________________________________\n", - "lambda_5 (Lambda) (None, 50, 50) 0 dot_4[0][0] \n", - "__________________________________________________________________________________________________\n", - "dot_5 (Dot) (None, 50, 200) 0 lambda_5[0][0] \n", - " sequential_5[1][0] \n", - "__________________________________________________________________________________________________\n", - "concatenate_4 (Concatenate) (None, 50, 400) 0 sequential_5[2][0] \n", - " dot_5[0][0] \n", - "__________________________________________________________________________________________________\n", - "time_distributed_5 (TimeDistrib (None, 50, 200) 120400 concatenate_4[0][0] \n", - "__________________________________________________________________________________________________\n", - "lambda_6 (Lambda) (None, 200) 0 time_distributed_5[0][0] \n", - "__________________________________________________________________________________________________\n", - "sequential_8 (Sequential) (None, 200) 80400 lambda_6[0][0] \n", - "__________________________________________________________________________________________________\n", - "dense_16 (Dense) (None, 3) 603 sequential_8[1][0] \n", - "==================================================================================================\n", - "Total params: 321,663,403\n", - "Trainable params: 341,803\n", - "Non-trainable params: 321,321,600\n", - "__________________________________________________________________________________________________\n" - ] - } - ], - "source": [ - "m1 = build_model(sem_vectors, 50, 200, 3, 200, 'left')\n", - "m1.summary()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The parameter count has indeed decreased by 40,000, corresponding to the 200x200 smaller H function." - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Train on 549367 samples, validate on 9824 samples\n", - "Epoch 1/50\n", - "549367/549367 [==============================] - 25s 46us/step - loss: 0.7331 - acc: 0.6770 - val_loss: 0.5257 - val_acc: 0.7936\n", - "Epoch 2/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.5518 - acc: 0.7799 - val_loss: 0.4717 - val_acc: 0.8159\n", - "Epoch 3/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.5147 - acc: 0.7967 - val_loss: 0.4449 - val_acc: 0.8278\n", - "Epoch 4/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4948 - acc: 0.8060 - val_loss: 0.4326 - val_acc: 0.8344\n", - "Epoch 5/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4814 - acc: 0.8122 - val_loss: 0.4247 - val_acc: 0.8359\n", - "Epoch 6/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4712 - acc: 0.8162 - val_loss: 0.4143 - val_acc: 0.8430\n", - "Epoch 7/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4635 - acc: 0.8205 - val_loss: 0.4172 - val_acc: 0.8401\n", - "Epoch 8/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4570 - acc: 0.8223 - val_loss: 0.4106 - val_acc: 0.8422\n", - "Epoch 9/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4505 - acc: 0.8259 - val_loss: 0.4043 - val_acc: 0.8451\n", - "Epoch 10/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4459 - acc: 0.8280 - val_loss: 0.4050 - val_acc: 0.8467\n", - "Epoch 11/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4405 - acc: 0.8300 - val_loss: 0.3975 - val_acc: 0.8481\n", - "Epoch 12/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4360 - acc: 0.8324 - val_loss: 0.4026 - val_acc: 0.8496\n", - "Epoch 13/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4327 - acc: 0.8334 - val_loss: 0.4024 - val_acc: 0.8471\n", - "Epoch 14/50\n", - "549367/549367 [==============================] - 24s 45us/step - loss: 0.4293 - acc: 0.8350 - val_loss: 0.3955 - val_acc: 0.8496\n", - "Epoch 15/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4263 - acc: 0.8369 - val_loss: 0.3980 - val_acc: 0.8490\n", - "Epoch 16/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4236 - acc: 0.8377 - val_loss: 0.3958 - val_acc: 0.8496\n", - "Epoch 17/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4213 - acc: 0.8384 - val_loss: 0.3954 - val_acc: 0.8496\n", - "Epoch 18/50\n", - "549367/549367 [==============================] - 24s 45us/step - loss: 0.4187 - acc: 0.8394 - val_loss: 0.3929 - val_acc: 0.8514\n", - "Epoch 19/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4157 - acc: 0.8409 - val_loss: 0.3939 - val_acc: 0.8507\n", - "Epoch 20/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4135 - acc: 0.8417 - val_loss: 0.3953 - val_acc: 0.8522\n", - "Epoch 21/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4122 - acc: 0.8424 - val_loss: 0.3974 - val_acc: 0.8506\n", - "Epoch 22/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4099 - acc: 0.8435 - val_loss: 0.3918 - val_acc: 0.8522\n", - "Epoch 23/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4075 - acc: 0.8443 - val_loss: 0.3901 - val_acc: 0.8513\n", - "Epoch 24/50\n", - "549367/549367 [==============================] - 24s 44us/step - loss: 0.4067 - acc: 0.8447 - val_loss: 0.3885 - val_acc: 0.8543\n", - "Epoch 25/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4047 - acc: 0.8454 - val_loss: 0.3846 - val_acc: 0.8531\n", - "Epoch 26/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.4031 - acc: 0.8461 - val_loss: 0.3864 - val_acc: 0.8562\n", - "Epoch 27/50\n", - "549367/549367 [==============================] - 24s 45us/step - loss: 0.4020 - acc: 0.8467 - val_loss: 0.3874 - val_acc: 0.8546\n", - "Epoch 28/50\n", - "549367/549367 [==============================] - 24s 45us/step - loss: 0.4001 - acc: 0.8473 - val_loss: 0.3848 - val_acc: 0.8534\n", - "Epoch 29/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3991 - acc: 0.8479 - val_loss: 0.3865 - val_acc: 0.8562\n", - "Epoch 30/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3976 - acc: 0.8484 - val_loss: 0.3833 - val_acc: 0.8574\n", - "Epoch 31/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3961 - acc: 0.8487 - val_loss: 0.3846 - val_acc: 0.8585\n", - "Epoch 32/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3942 - acc: 0.8498 - val_loss: 0.3805 - val_acc: 0.8573\n", - "Epoch 33/50\n", - "549367/549367 [==============================] - 24s 44us/step - loss: 0.3935 - acc: 0.8503 - val_loss: 0.3856 - val_acc: 0.8579\n", - "Epoch 34/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3923 - acc: 0.8507 - val_loss: 0.3829 - val_acc: 0.8560\n", - "Epoch 35/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3920 - acc: 0.8508 - val_loss: 0.3864 - val_acc: 0.8575\n", - "Epoch 36/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3907 - acc: 0.8516 - val_loss: 0.3873 - val_acc: 0.8563\n", - "Epoch 37/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3891 - acc: 0.8519 - val_loss: 0.3850 - val_acc: 0.8570\n", - "Epoch 38/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3872 - acc: 0.8522 - val_loss: 0.3815 - val_acc: 0.8591\n", - "Epoch 39/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3887 - acc: 0.8520 - val_loss: 0.3829 - val_acc: 0.8590\n", - "Epoch 40/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3868 - acc: 0.8531 - val_loss: 0.3807 - val_acc: 0.8600\n", - "Epoch 41/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3859 - acc: 0.8537 - val_loss: 0.3832 - val_acc: 0.8574\n", - "Epoch 42/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3849 - acc: 0.8537 - val_loss: 0.3850 - val_acc: 0.8576\n", - "Epoch 43/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3834 - acc: 0.8541 - val_loss: 0.3825 - val_acc: 0.8563\n", - "Epoch 44/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3829 - acc: 0.8548 - val_loss: 0.3844 - val_acc: 0.8540\n", - "Epoch 45/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8552 - val_loss: 0.3841 - val_acc: 0.8559\n", - "Epoch 46/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3816 - acc: 0.8549 - val_loss: 0.3880 - val_acc: 0.8567\n", - "Epoch 47/50\n", - "549367/549367 [==============================] - 24s 45us/step - loss: 0.3799 - acc: 0.8559 - val_loss: 0.3767 - val_acc: 0.8635\n", - "Epoch 48/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3800 - acc: 0.8560 - val_loss: 0.3786 - val_acc: 0.8563\n", - "Epoch 49/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3781 - acc: 0.8563 - val_loss: 0.3812 - val_acc: 0.8596\n", - "Epoch 50/50\n", - "549367/549367 [==============================] - 25s 45us/step - loss: 0.3788 - acc: 0.8560 - val_loss: 0.3782 - val_acc: 0.8601\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "m1.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=50,validation_data=([text_vectors_test, hypothesis_vectors_test], labels_test))" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This model performs the same as the slightly more complex model that evaluates alignments in both directions. Note also that processing time is improved, from 64 down to 48 microseconds per step. \n", - "\n", - "Let's now look at an asymmetric model that evaluates text to hypothesis comparisons. The prediction is that such a model will correctly classify a decent proportion of the exemplars, but not as accurately as the previous two.\n", - "\n", - "We'll just use 10 epochs for expediency." - ] - }, - { - "cell_type": "code", - "execution_count": 96, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "__________________________________________________________________________________________________\n", - "Layer (type) Output Shape Param # Connected to \n", - "==================================================================================================\n", - "words1 (InputLayer) (None, 50) 0 \n", - "__________________________________________________________________________________________________\n", - "words2 (InputLayer) (None, 50) 0 \n", - "__________________________________________________________________________________________________\n", - "sequential_13 (Sequential) (None, 50, 200) 321381600 words1[0][0] \n", - " words2[0][0] \n", - "__________________________________________________________________________________________________\n", - "sequential_14 (Sequential) (None, 50, 200) 80400 sequential_13[1][0] \n", - " sequential_13[2][0] \n", - "__________________________________________________________________________________________________\n", - "dot_8 (Dot) (None, 50, 50) 0 sequential_14[1][0] \n", - " sequential_14[2][0] \n", - "__________________________________________________________________________________________________\n", - "lambda_9 (Lambda) (None, 50, 50) 0 dot_8[0][0] \n", - "__________________________________________________________________________________________________\n", - "dot_9 (Dot) (None, 50, 200) 0 lambda_9[0][0] \n", - " sequential_13[2][0] \n", - "__________________________________________________________________________________________________\n", - "concatenate_6 (Concatenate) (None, 50, 400) 0 sequential_13[1][0] \n", - " dot_9[0][0] \n", - "__________________________________________________________________________________________________\n", - "time_distributed_9 (TimeDistrib (None, 50, 200) 120400 concatenate_6[0][0] \n", - "__________________________________________________________________________________________________\n", - "lambda_10 (Lambda) (None, 200) 0 time_distributed_9[0][0] \n", - "__________________________________________________________________________________________________\n", - "sequential_16 (Sequential) (None, 200) 80400 lambda_10[0][0] \n", - "__________________________________________________________________________________________________\n", - "dense_32 (Dense) (None, 3) 603 sequential_16[1][0] \n", - "==================================================================================================\n", - "Total params: 321,663,403\n", - "Trainable params: 341,803\n", - "Non-trainable params: 321,321,600\n", - "__________________________________________________________________________________________________\n" - ] - } - ], - "source": [ - "m2 = build_model(sem_vectors, 50, 200, 3, 200, 'right')\n", - "m2.summary()" - ] - }, - { - "cell_type": "code", - "execution_count": 97, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Train on 455226 samples, validate on 113807 samples\n", - "Epoch 1/10\n", - "455226/455226 [==============================] - 22s 49us/step - loss: 0.8920 - acc: 0.5771 - val_loss: 0.8001 - val_acc: 0.6435\n", - "Epoch 2/10\n", - "455226/455226 [==============================] - 22s 47us/step - loss: 0.7808 - acc: 0.6553 - val_loss: 0.7267 - val_acc: 0.6855\n", - "Epoch 3/10\n", - "455226/455226 [==============================] - 22s 47us/step - loss: 0.7329 - acc: 0.6825 - val_loss: 0.6966 - val_acc: 0.7006\n", - "Epoch 4/10\n", - "455226/455226 [==============================] - 22s 47us/step - loss: 0.7055 - acc: 0.6978 - val_loss: 0.6713 - val_acc: 0.7150\n", - "Epoch 5/10\n", - "455226/455226 [==============================] - 22s 47us/step - loss: 0.6862 - acc: 0.7081 - val_loss: 0.6533 - val_acc: 0.7253\n", - "Epoch 6/10\n", - "455226/455226 [==============================] - 21s 47us/step - loss: 0.6694 - acc: 0.7179 - val_loss: 0.6472 - val_acc: 0.7277\n", - "Epoch 7/10\n", - "455226/455226 [==============================] - 22s 47us/step - loss: 0.6555 - acc: 0.7252 - val_loss: 0.6338 - val_acc: 0.7347\n", - "Epoch 8/10\n", - "455226/455226 [==============================] - 22s 48us/step - loss: 0.6434 - acc: 0.7310 - val_loss: 0.6246 - val_acc: 0.7385\n", - "Epoch 9/10\n", - "455226/455226 [==============================] - 22s 47us/step - loss: 0.6325 - acc: 0.7367 - val_loss: 0.6164 - val_acc: 0.7424\n", - "Epoch 10/10\n", - "455226/455226 [==============================] - 22s 47us/step - loss: 0.6216 - acc: 0.7426 - val_loss: 0.6082 - val_acc: 0.7478\n" - ] - }, - { - "data": { - "text/plain": [ - "" - ] - }, - "execution_count": 97, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "m2.fit([text_vectors, hypothesis_vectors], labels, batch_size=1024, epochs=10,validation_split=.2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Comparing this fit to the validation accuracy of the previous two models after 10 epochs, we observe that its accuracy is roughly 10% lower.\n", - "\n", - "It is reassuring that the neural modeling here reproduces what we know from the semantics of natural language!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.5.2" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/examples/pipeline/custom_attr_methods.py b/examples/pipeline/custom_attr_methods.py deleted file mode 100644 index 7f97bc1c3..000000000 --- a/examples/pipeline/custom_attr_methods.py +++ /dev/null @@ -1,78 +0,0 @@ -#!/usr/bin/env python -# coding: utf-8 -"""This example contains several snippets of methods that can be set via custom -Doc, Token or Span attributes in spaCy v2.0. Attribute methods act like -they're "bound" to the object and are partially applied – i.e. the object -they're called on is passed in as the first argument. - -* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components - -Compatible with: spaCy v2.0.0+ -Last tested with: v2.1.0 -""" -from __future__ import unicode_literals, print_function - -import plac -from spacy.lang.en import English -from spacy.tokens import Doc, Span -from spacy import displacy -from pathlib import Path - - -@plac.annotations( - output_dir=("Output directory for saved HTML", "positional", None, Path) -) -def main(output_dir=None): - nlp = English() # start off with blank English class - - Doc.set_extension("overlap", method=overlap_tokens) - doc1 = nlp("Peach emoji is where it has always been.") - doc2 = nlp("Peach is the superior emoji.") - print("Text 1:", doc1.text) - print("Text 2:", doc2.text) - print("Overlapping tokens:", doc1._.overlap(doc2)) - - Doc.set_extension("to_html", method=to_html) - doc = nlp("This is a sentence about Apple.") - # add entity manually for demo purposes, to make it work without a model - doc.ents = [Span(doc, 5, 6, label=nlp.vocab.strings["ORG"])] - print("Text:", doc.text) - doc._.to_html(output=output_dir, style="ent") - - -def to_html(doc, output="/tmp", style="dep"): - """Doc method extension for saving the current state as a displaCy - visualization. - """ - # generate filename from first six non-punct tokens - file_name = "-".join([w.text for w in doc[:6] if not w.is_punct]) + ".html" - html = displacy.render(doc, style=style, page=True) # render markup - if output is not None: - output_path = Path(output) - if not output_path.exists(): - output_path.mkdir() - output_file = Path(output) / file_name - output_file.open("w", encoding="utf-8").write(html) # save to file - print("Saved HTML to {}".format(output_file)) - else: - print(html) - - -def overlap_tokens(doc, other_doc): - """Get the tokens from the original Doc that are also in the comparison Doc. - """ - overlap = [] - other_tokens = [token.text for token in other_doc] - for token in doc: - if token.text in other_tokens: - overlap.append(token) - return overlap - - -if __name__ == "__main__": - plac.call(main) - - # Expected output: - # Text 1: Peach emoji is where it has always been. - # Text 2: Peach is the superior emoji. - # Overlapping tokens: [Peach, emoji, is, .] diff --git a/examples/pipeline/custom_component_countries_api.py b/examples/pipeline/custom_component_countries_api.py deleted file mode 100644 index 241c0af37..000000000 --- a/examples/pipeline/custom_component_countries_api.py +++ /dev/null @@ -1,130 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""Example of a spaCy v2.0 pipeline component that requests all countries via -the REST Countries API, merges country names into one token, assigns entity -labels and sets attributes on country tokens, e.g. the capital and lat/lng -coordinates. Can be extended with more details from the API. - -* REST Countries API: https://restcountries.eu (Mozilla Public License MPL 2.0) -* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components - -Compatible with: spaCy v2.0.0+ -Last tested with: v2.1.0 -Prerequisites: pip install requests -""" -from __future__ import unicode_literals, print_function - -import requests -import plac -from spacy.lang.en import English -from spacy.matcher import PhraseMatcher -from spacy.tokens import Doc, Span, Token - - -def main(): - # For simplicity, we start off with only the blank English Language class - # and no model or pre-defined pipeline loaded. - nlp = English() - rest_countries = RESTCountriesComponent(nlp) # initialise component - nlp.add_pipe(rest_countries) # add it to the pipeline - doc = nlp("Some text about Colombia and the Czech Republic") - print("Pipeline", nlp.pipe_names) # pipeline contains component name - print("Doc has countries", doc._.has_country) # Doc contains countries - for token in doc: - if token._.is_country: - print( - token.text, - token._.country_capital, - token._.country_latlng, - token._.country_flag, - ) # country data - print("Entities", [(e.text, e.label_) for e in doc.ents]) # entities - - -class RESTCountriesComponent(object): - """spaCy v2.0 pipeline component that requests all countries via - the REST Countries API, merges country names into one token, assigns entity - labels and sets attributes on country tokens. - """ - - name = "rest_countries" # component name, will show up in the pipeline - - def __init__(self, nlp, label="GPE"): - """Initialise the pipeline component. The shared nlp instance is used - to initialise the matcher with the shared vocab, get the label ID and - generate Doc objects as phrase match patterns. - """ - # Make request once on initialisation and store the data - r = requests.get("https://restcountries.eu/rest/v2/all") - r.raise_for_status() # make sure requests raises an error if it fails - countries = r.json() - - # Convert API response to dict keyed by country name for easy lookup - # This could also be extended using the alternative and foreign language - # names provided by the API - self.countries = {c["name"]: c for c in countries} - self.label = nlp.vocab.strings[label] # get entity label ID - - # Set up the PhraseMatcher with Doc patterns for each country name - patterns = [nlp(c) for c in self.countries.keys()] - self.matcher = PhraseMatcher(nlp.vocab) - self.matcher.add("COUNTRIES", None, *patterns) - - # Register attribute on the Token. We'll be overwriting this based on - # the matches, so we're only setting a default value, not a getter. - # If no default value is set, it defaults to None. - Token.set_extension("is_country", default=False) - Token.set_extension("country_capital", default=False) - Token.set_extension("country_latlng", default=False) - Token.set_extension("country_flag", default=False) - - # Register attributes on Doc and Span via a getter that checks if one of - # the contained tokens is set to is_country == True. - Doc.set_extension("has_country", getter=self.has_country) - Span.set_extension("has_country", getter=self.has_country) - - def __call__(self, doc): - """Apply the pipeline component on a Doc object and modify it if matches - are found. Return the Doc, so it can be processed by the next component - in the pipeline, if available. - """ - matches = self.matcher(doc) - spans = [] # keep the spans for later so we can merge them afterwards - for _, start, end in matches: - # Generate Span representing the entity & set label - entity = Span(doc, start, end, label=self.label) - spans.append(entity) - # Set custom attribute on each token of the entity - # Can be extended with other data returned by the API, like - # currencies, country code, flag, calling code etc. - for token in entity: - token._.set("is_country", True) - token._.set("country_capital", self.countries[entity.text]["capital"]) - token._.set("country_latlng", self.countries[entity.text]["latlng"]) - token._.set("country_flag", self.countries[entity.text]["flag"]) - # Overwrite doc.ents and add entity – be careful not to replace! - doc.ents = list(doc.ents) + [entity] - for span in spans: - # Iterate over all spans and merge them into one token. This is done - # after setting the entities – otherwise, it would cause mismatched - # indices! - span.merge() - return doc # don't forget to return the Doc! - - def has_country(self, tokens): - """Getter for Doc and Span attributes. Returns True if one of the tokens - is a country. Since the getter is only called when we access the - attribute, we can refer to the Token's 'is_country' attribute here, - which is already set in the processing step.""" - return any([t._.get("is_country") for t in tokens]) - - -if __name__ == "__main__": - plac.call(main) - - # Expected output: - # Pipeline ['rest_countries'] - # Doc has countries True - # Colombia Bogotá [4.0, -72.0] https://restcountries.eu/data/col.svg - # Czech Republic Prague [49.75, 15.5] https://restcountries.eu/data/cze.svg - # Entities [('Colombia', 'GPE'), ('Czech Republic', 'GPE')] diff --git a/examples/pipeline/custom_component_entities.py b/examples/pipeline/custom_component_entities.py deleted file mode 100644 index a53b688b0..000000000 --- a/examples/pipeline/custom_component_entities.py +++ /dev/null @@ -1,115 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""Example of a spaCy v2.0 pipeline component that sets entity annotations -based on list of single or multiple-word company names. Companies are -labelled as ORG and their spans are merged into one token. Additionally, -._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token -respectively. - -* Custom pipeline components: https://spacy.io//usage/processing-pipelines#custom-components - -Compatible with: spaCy v2.0.0+ -Last tested with: v2.1.0 -""" -from __future__ import unicode_literals, print_function - -import plac -from spacy.lang.en import English -from spacy.matcher import PhraseMatcher -from spacy.tokens import Doc, Span, Token - - -@plac.annotations( - text=("Text to process", "positional", None, str), - companies=("Names of technology companies", "positional", None, str), -) -def main(text="Alphabet Inc. is the company behind Google.", *companies): - # For simplicity, we start off with only the blank English Language class - # and no model or pre-defined pipeline loaded. - nlp = English() - if not companies: # set default companies if none are set via args - companies = ["Alphabet Inc.", "Google", "Netflix", "Apple"] # etc. - component = TechCompanyRecognizer(nlp, companies) # initialise component - nlp.add_pipe(component, last=True) # add last to the pipeline - - doc = nlp(text) - print("Pipeline", nlp.pipe_names) # pipeline contains component name - print("Tokens", [t.text for t in doc]) # company names from the list are merged - print("Doc has_tech_org", doc._.has_tech_org) # Doc contains tech orgs - print("Token 0 is_tech_org", doc[0]._.is_tech_org) # "Alphabet Inc." is a tech org - print("Token 1 is_tech_org", doc[1]._.is_tech_org) # "is" is not - print("Entities", [(e.text, e.label_) for e in doc.ents]) # all orgs are entities - - -class TechCompanyRecognizer(object): - """Example of a spaCy v2.0 pipeline component that sets entity annotations - based on list of single or multiple-word company names. Companies are - labelled as ORG and their spans are merged into one token. Additionally, - ._.has_tech_org and ._.is_tech_org is set on the Doc/Span and Token - respectively.""" - - name = "tech_companies" # component name, will show up in the pipeline - - def __init__(self, nlp, companies=tuple(), label="ORG"): - """Initialise the pipeline component. The shared nlp instance is used - to initialise the matcher with the shared vocab, get the label ID and - generate Doc objects as phrase match patterns. - """ - self.label = nlp.vocab.strings[label] # get entity label ID - - # Set up the PhraseMatcher – it can now take Doc objects as patterns, - # so even if the list of companies is long, it's very efficient - patterns = [nlp(org) for org in companies] - self.matcher = PhraseMatcher(nlp.vocab) - self.matcher.add("TECH_ORGS", None, *patterns) - - # Register attribute on the Token. We'll be overwriting this based on - # the matches, so we're only setting a default value, not a getter. - Token.set_extension("is_tech_org", default=False) - - # Register attributes on Doc and Span via a getter that checks if one of - # the contained tokens is set to is_tech_org == True. - Doc.set_extension("has_tech_org", getter=self.has_tech_org) - Span.set_extension("has_tech_org", getter=self.has_tech_org) - - def __call__(self, doc): - """Apply the pipeline component on a Doc object and modify it if matches - are found. Return the Doc, so it can be processed by the next component - in the pipeline, if available. - """ - matches = self.matcher(doc) - spans = [] # keep the spans for later so we can merge them afterwards - for _, start, end in matches: - # Generate Span representing the entity & set label - entity = Span(doc, start, end, label=self.label) - spans.append(entity) - # Set custom attribute on each token of the entity - for token in entity: - token._.set("is_tech_org", True) - # Overwrite doc.ents and add entity – be careful not to replace! - doc.ents = list(doc.ents) + [entity] - for span in spans: - # Iterate over all spans and merge them into one token. This is done - # after setting the entities – otherwise, it would cause mismatched - # indices! - span.merge() - return doc # don't forget to return the Doc! - - def has_tech_org(self, tokens): - """Getter for Doc and Span attributes. Returns True if one of the tokens - is a tech org. Since the getter is only called when we access the - attribute, we can refer to the Token's 'is_tech_org' attribute here, - which is already set in the processing step.""" - return any([t._.get("is_tech_org") for t in tokens]) - - -if __name__ == "__main__": - plac.call(main) - - # Expected output: - # Pipeline ['tech_companies'] - # Tokens ['Alphabet Inc.', 'is', 'the', 'company', 'behind', 'Google', '.'] - # Doc has_tech_org True - # Token 0 is_tech_org True - # Token 1 is_tech_org False - # Entities [('Alphabet Inc.', 'ORG'), ('Google', 'ORG')] diff --git a/examples/pipeline/custom_sentence_segmentation.py b/examples/pipeline/custom_sentence_segmentation.py deleted file mode 100644 index ff59ab187..000000000 --- a/examples/pipeline/custom_sentence_segmentation.py +++ /dev/null @@ -1,61 +0,0 @@ -"""Example of adding a pipeline component to prohibit sentence boundaries -before certain tokens. - -What we do is write to the token.is_sent_start attribute, which -takes values in {True, False, None}. The default value None allows the parser -to predict sentence segments. The value False prohibits the parser from inserting -a sentence boundary before that token. Note that fixing the sentence segmentation -should also improve the parse quality. - -The specific example here is drawn from https://github.com/explosion/spaCy/issues/2627 -Other versions of the model may not make the original mistake, so the specific -example might not be apt for future versions. - -Compatible with: spaCy v2.0.0+ -Last tested with: v2.1.0 -""" -import plac -import spacy - - -def prevent_sentence_boundaries(doc): - for token in doc: - if not can_be_sentence_start(token): - token.is_sent_start = False - return doc - - -def can_be_sentence_start(token): - if token.i == 0: - return True - # We're not checking for is_title here to ignore arbitrary titlecased - # tokens within sentences - # elif token.is_title: - # return True - elif token.nbor(-1).is_punct: - return True - elif token.nbor(-1).is_space: - return True - else: - return False - - -@plac.annotations( - text=("The raw text to process", "positional", None, str), - spacy_model=("spaCy model to use (with a parser)", "option", "m", str), -) -def main(text="Been here And I'm loving it.", spacy_model="en_core_web_lg"): - print("Using spaCy model '{}'".format(spacy_model)) - print("Processing text '{}'".format(text)) - nlp = spacy.load(spacy_model) - doc = nlp(text) - sentences = [sent.text.strip() for sent in doc.sents] - print("Before:", sentences) - nlp.add_pipe(prevent_sentence_boundaries, before="parser") - doc = nlp(text) - sentences = [sent.text.strip() for sent in doc.sents] - print("After:", sentences) - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/pipeline/fix_space_entities.py b/examples/pipeline/fix_space_entities.py deleted file mode 100644 index 686253eca..000000000 --- a/examples/pipeline/fix_space_entities.py +++ /dev/null @@ -1,37 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""Demonstrate adding a rule-based component that forces some tokens to not -be entities, before the NER tagger is applied. This is used to hotfix the issue -in https://github.com/explosion/spaCy/issues/2870, present as of spaCy v2.0.16. - -Compatible with: spaCy v2.0.0+ -Last tested with: v2.1.0 -""" -from __future__ import unicode_literals - -import spacy -from spacy.attrs import ENT_IOB - - -def fix_space_tags(doc): - ent_iobs = doc.to_array([ENT_IOB]) - for i, token in enumerate(doc): - if token.is_space: - # Sets 'O' tag (0 is None, so I is 1, O is 2) - ent_iobs[i] = 2 - doc.from_array([ENT_IOB], ent_iobs.reshape((len(doc), 1))) - return doc - - -def main(): - nlp = spacy.load("en_core_web_sm") - text = "This is some crazy test where I dont need an Apple Watch to make things bug" - doc = nlp(text) - print("Before", doc.ents) - nlp.add_pipe(fix_space_tags, name="fix-ner", before="ner") - doc = nlp(text) - print("After", doc.ents) - - -if __name__ == "__main__": - main() diff --git a/examples/pipeline/multi_processing.py b/examples/pipeline/multi_processing.py deleted file mode 100644 index f0e437acf..000000000 --- a/examples/pipeline/multi_processing.py +++ /dev/null @@ -1,84 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""Example of multi-processing with Joblib. Here, we're exporting -part-of-speech-tagged, true-cased, (very roughly) sentence-separated text, with -each "sentence" on a newline, and spaces between tokens. Data is loaded from -the IMDB movie reviews dataset and will be loaded automatically via Thinc's -built-in dataset loader. - -Compatible with: spaCy v2.0.0+ -Last tested with: v2.1.0 -Prerequisites: pip install joblib -""" -from __future__ import print_function, unicode_literals - -from pathlib import Path -from joblib import Parallel, delayed -from functools import partial -import thinc.extra.datasets -import plac -import spacy -from spacy.util import minibatch - - -@plac.annotations( - output_dir=("Output directory", "positional", None, Path), - model=("Model name (needs tagger)", "positional", None, str), - n_jobs=("Number of workers", "option", "n", int), - batch_size=("Batch-size for each process", "option", "b", int), - limit=("Limit of entries from the dataset", "option", "l", int), -) -def main(output_dir, model="en_core_web_sm", n_jobs=4, batch_size=1000, limit=10000): - nlp = spacy.load(model) # load spaCy model - print("Loaded model '%s'" % model) - if not output_dir.exists(): - output_dir.mkdir() - # load and pre-process the IMBD dataset - print("Loading IMDB data...") - data, _ = thinc.extra.datasets.imdb() - texts, _ = zip(*data[-limit:]) - print("Processing texts...") - partitions = minibatch(texts, size=batch_size) - executor = Parallel(n_jobs=n_jobs, backend="multiprocessing", prefer="processes") - do = delayed(partial(transform_texts, nlp)) - tasks = (do(i, batch, output_dir) for i, batch in enumerate(partitions)) - executor(tasks) - - -def transform_texts(nlp, batch_id, texts, output_dir): - print(nlp.pipe_names) - out_path = Path(output_dir) / ("%d.txt" % batch_id) - if out_path.exists(): # return None in case same batch is called again - return None - print("Processing batch", batch_id) - with out_path.open("w", encoding="utf8") as f: - for doc in nlp.pipe(texts): - f.write(" ".join(represent_word(w) for w in doc if not w.is_space)) - f.write("\n") - print("Saved {} texts to {}.txt".format(len(texts), batch_id)) - - -def represent_word(word): - text = word.text - # True-case, i.e. try to normalize sentence-initial capitals. - # Only do this if the lower-cased form is more probable. - if ( - text.istitle() - and is_sent_begin(word) - and word.prob < word.doc.vocab[text.lower()].prob - ): - text = text.lower() - return text + "|" + word.tag_ - - -def is_sent_begin(word): - if word.i == 0: - return True - elif word.i >= 2 and word.nbor(-1).text in (".", "!", "?", "..."): - return True - else: - return False - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/streamlit_spacy.py b/examples/streamlit_spacy.py deleted file mode 100644 index a2da123c2..000000000 --- a/examples/streamlit_spacy.py +++ /dev/null @@ -1,153 +0,0 @@ -# coding: utf-8 -""" -Example of a Streamlit app for an interactive spaCy model visualizer. You can -either download the script, or point streamlit run to the raw URL of this -file. For more details, see https://streamlit.io. - -Installation: -pip install streamlit -python -m spacy download en_core_web_sm -python -m spacy download en_core_web_md -python -m spacy download de_core_news_sm - -Usage: -streamlit run streamlit_spacy.py -""" -from __future__ import unicode_literals - -import streamlit as st -import spacy -from spacy import displacy -import pandas as pd - - -SPACY_MODEL_NAMES = ["en_core_web_sm", "en_core_web_md", "de_core_news_sm"] -DEFAULT_TEXT = "Mark Zuckerberg is the CEO of Facebook." -HTML_WRAPPER = """
{}
""" - - -@st.cache(allow_output_mutation=True) -def load_model(name): - return spacy.load(name) - - -@st.cache(allow_output_mutation=True) -def process_text(model_name, text): - nlp = load_model(model_name) - return nlp(text) - - -st.sidebar.title("Interactive spaCy visualizer") -st.sidebar.markdown( - """ -Process text with [spaCy](https://spacy.io) models and visualize named entities, -dependencies and more. Uses spaCy's built-in -[displaCy](http://spacy.io/usage/visualizers) visualizer under the hood. -""" -) - -spacy_model = st.sidebar.selectbox("Model name", SPACY_MODEL_NAMES) -model_load_state = st.info(f"Loading model '{spacy_model}'...") -nlp = load_model(spacy_model) -model_load_state.empty() - -text = st.text_area("Text to analyze", DEFAULT_TEXT) -doc = process_text(spacy_model, text) - -if "parser" in nlp.pipe_names: - st.header("Dependency Parse & Part-of-speech tags") - st.sidebar.header("Dependency Parse") - split_sents = st.sidebar.checkbox("Split sentences", value=True) - collapse_punct = st.sidebar.checkbox("Collapse punctuation", value=True) - collapse_phrases = st.sidebar.checkbox("Collapse phrases") - compact = st.sidebar.checkbox("Compact mode") - options = { - "collapse_punct": collapse_punct, - "collapse_phrases": collapse_phrases, - "compact": compact, - } - docs = [span.as_doc() for span in doc.sents] if split_sents else [doc] - for sent in docs: - html = displacy.render(sent, options=options) - # Double newlines seem to mess with the rendering - html = html.replace("\n\n", "\n") - if split_sents and len(docs) > 1: - st.markdown(f"> {sent.text}") - st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True) - -if "ner" in nlp.pipe_names: - st.header("Named Entities") - st.sidebar.header("Named Entities") - label_set = nlp.get_pipe("ner").labels - labels = st.sidebar.multiselect( - "Entity labels", options=label_set, default=list(label_set) - ) - html = displacy.render(doc, style="ent", options={"ents": labels}) - # Newlines seem to mess with the rendering - html = html.replace("\n", " ") - st.write(HTML_WRAPPER.format(html), unsafe_allow_html=True) - attrs = ["text", "label_", "start", "end", "start_char", "end_char"] - if "entity_linker" in nlp.pipe_names: - attrs.append("kb_id_") - data = [ - [str(getattr(ent, attr)) for attr in attrs] - for ent in doc.ents - if ent.label_ in labels - ] - df = pd.DataFrame(data, columns=attrs) - st.dataframe(df) - - -if "textcat" in nlp.pipe_names: - st.header("Text Classification") - st.markdown(f"> {text}") - df = pd.DataFrame(doc.cats.items(), columns=("Label", "Score")) - st.dataframe(df) - - -vector_size = nlp.meta.get("vectors", {}).get("width", 0) -if vector_size: - st.header("Vectors & Similarity") - st.code(nlp.meta["vectors"]) - text1 = st.text_input("Text or word 1", "apple") - text2 = st.text_input("Text or word 2", "orange") - doc1 = process_text(spacy_model, text1) - doc2 = process_text(spacy_model, text2) - similarity = doc1.similarity(doc2) - if similarity > 0.5: - st.success(similarity) - else: - st.error(similarity) - -st.header("Token attributes") - -if st.button("Show token attributes"): - attrs = [ - "idx", - "text", - "lemma_", - "pos_", - "tag_", - "dep_", - "head", - "ent_type_", - "ent_iob_", - "shape_", - "is_alpha", - "is_ascii", - "is_digit", - "is_punct", - "like_num", - ] - data = [[str(getattr(token, attr)) for attr in attrs] for token in doc] - df = pd.DataFrame(data, columns=attrs) - st.dataframe(df) - - -st.header("JSON Doc") -if st.button("Show JSON Doc"): - st.json(doc.to_json()) - -st.header("JSON model meta") -if st.button("Show JSON model meta"): - st.json(nlp.meta) diff --git a/examples/training/conllu-config.json b/examples/training/conllu-config.json deleted file mode 100644 index 9a11dd96b..000000000 --- a/examples/training/conllu-config.json +++ /dev/null @@ -1 +0,0 @@ -{"nr_epoch": 3, "batch_size": 24, "dropout": 0.001, "vectors": 0, "multitask_tag": 0, "multitask_sent": 0} diff --git a/examples/training/conllu.py b/examples/training/conllu.py deleted file mode 100644 index 1c65f4a72..000000000 --- a/examples/training/conllu.py +++ /dev/null @@ -1,434 +0,0 @@ -"""Train for CONLL 2017 UD treebank evaluation. Takes .conllu files, writes -.conllu format for development data, allowing the official scorer to be used. -""" -from __future__ import unicode_literals -import plac -import attr -from pathlib import Path -import re -import json -import tqdm - -import spacy -import spacy.util -from spacy.tokens import Token, Doc -from spacy.gold import GoldParse -from spacy.syntax.nonproj import projectivize -from collections import defaultdict -from spacy.matcher import Matcher - -import itertools -import random -import numpy.random - -from bin.ud import conll17_ud_eval - -import spacy.lang.zh -import spacy.lang.ja - -spacy.lang.zh.Chinese.Defaults.use_jieba = False -spacy.lang.ja.Japanese.Defaults.use_janome = False - -random.seed(0) -numpy.random.seed(0) - - -def minibatch_by_words(items, size=5000): - random.shuffle(items) - if isinstance(size, int): - size_ = itertools.repeat(size) - else: - size_ = size - items = iter(items) - while True: - batch_size = next(size_) - batch = [] - while batch_size >= 0: - try: - doc, gold = next(items) - except StopIteration: - if batch: - yield batch - return - batch_size -= len(doc) - batch.append((doc, gold)) - if batch: - yield batch - else: - break - - -################ -# Data reading # -################ - -space_re = re.compile("\s+") - - -def split_text(text): - return [space_re.sub(" ", par.strip()) for par in text.split("\n\n")] - - -def read_data( - nlp, - conllu_file, - text_file, - raw_text=True, - oracle_segments=False, - max_doc_length=None, - limit=None, -): - """Read the CONLLU format into (Doc, GoldParse) tuples. If raw_text=True, - include Doc objects created using nlp.make_doc and then aligned against - the gold-standard sequences. If oracle_segments=True, include Doc objects - created from the gold-standard segments. At least one must be True.""" - if not raw_text and not oracle_segments: - raise ValueError("At least one of raw_text or oracle_segments must be True") - paragraphs = split_text(text_file.read()) - conllu = read_conllu(conllu_file) - # sd is spacy doc; cd is conllu doc - # cs is conllu sent, ct is conllu token - docs = [] - golds = [] - for doc_id, (text, cd) in enumerate(zip(paragraphs, conllu)): - sent_annots = [] - for cs in cd: - sent = defaultdict(list) - for id_, word, lemma, pos, tag, morph, head, dep, _, space_after in cs: - if "." in id_: - continue - if "-" in id_: - continue - id_ = int(id_) - 1 - head = int(head) - 1 if head != "0" else id_ - sent["words"].append(word) - sent["tags"].append(tag) - sent["heads"].append(head) - sent["deps"].append("ROOT" if dep == "root" else dep) - sent["spaces"].append(space_after == "_") - sent["entities"] = ["-"] * len(sent["words"]) - sent["heads"], sent["deps"] = projectivize(sent["heads"], sent["deps"]) - if oracle_segments: - docs.append(Doc(nlp.vocab, words=sent["words"], spaces=sent["spaces"])) - golds.append(GoldParse(docs[-1], **sent)) - - sent_annots.append(sent) - if raw_text and max_doc_length and len(sent_annots) >= max_doc_length: - doc, gold = _make_gold(nlp, None, sent_annots) - sent_annots = [] - docs.append(doc) - golds.append(gold) - if limit and len(docs) >= limit: - return docs, golds - - if raw_text and sent_annots: - doc, gold = _make_gold(nlp, None, sent_annots) - docs.append(doc) - golds.append(gold) - if limit and len(docs) >= limit: - return docs, golds - return docs, golds - - -def read_conllu(file_): - docs = [] - sent = [] - doc = [] - for line in file_: - if line.startswith("# newdoc"): - if doc: - docs.append(doc) - doc = [] - elif line.startswith("#"): - continue - elif not line.strip(): - if sent: - doc.append(sent) - sent = [] - else: - sent.append(list(line.strip().split("\t"))) - if len(sent[-1]) != 10: - print(repr(line)) - raise ValueError - if sent: - doc.append(sent) - if doc: - docs.append(doc) - return docs - - -def _make_gold(nlp, text, sent_annots): - # Flatten the conll annotations, and adjust the head indices - flat = defaultdict(list) - for sent in sent_annots: - flat["heads"].extend(len(flat["words"]) + head for head in sent["heads"]) - for field in ["words", "tags", "deps", "entities", "spaces"]: - flat[field].extend(sent[field]) - # Construct text if necessary - assert len(flat["words"]) == len(flat["spaces"]) - if text is None: - text = "".join( - word + " " * space for word, space in zip(flat["words"], flat["spaces"]) - ) - doc = nlp.make_doc(text) - flat.pop("spaces") - gold = GoldParse(doc, **flat) - return doc, gold - - -############################# -# Data transforms for spaCy # -############################# - - -def golds_to_gold_tuples(docs, golds): - """Get out the annoying 'tuples' format used by begin_training, given the - GoldParse objects.""" - tuples = [] - for doc, gold in zip(docs, golds): - text = doc.text - ids, words, tags, heads, labels, iob = zip(*gold.orig_annot) - sents = [((ids, words, tags, heads, labels, iob), [])] - tuples.append((text, sents)) - return tuples - - -############## -# Evaluation # -############## - - -def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None): - with text_loc.open("r", encoding="utf8") as text_file: - texts = split_text(text_file.read()) - docs = list(nlp.pipe(texts)) - with sys_loc.open("w", encoding="utf8") as out_file: - write_conllu(docs, out_file) - with gold_loc.open("r", encoding="utf8") as gold_file: - gold_ud = conll17_ud_eval.load_conllu(gold_file) - with sys_loc.open("r", encoding="utf8") as sys_file: - sys_ud = conll17_ud_eval.load_conllu(sys_file) - scores = conll17_ud_eval.evaluate(gold_ud, sys_ud) - return scores - - -def write_conllu(docs, file_): - merger = Matcher(docs[0].vocab) - merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}]) - for i, doc in enumerate(docs): - matches = merger(doc) - spans = [doc[start : end + 1] for _, start, end in matches] - offsets = [(span.start_char, span.end_char) for span in spans] - for start_char, end_char in offsets: - doc.merge(start_char, end_char) - file_.write("# newdoc id = {i}\n".format(i=i)) - for j, sent in enumerate(doc.sents): - file_.write("# sent_id = {i}.{j}\n".format(i=i, j=j)) - file_.write("# text = {text}\n".format(text=sent.text)) - for k, token in enumerate(sent): - file_.write(token._.get_conllu_lines(k) + "\n") - file_.write("\n") - - -def print_progress(itn, losses, ud_scores): - fields = { - "dep_loss": losses.get("parser", 0.0), - "tag_loss": losses.get("tagger", 0.0), - "words": ud_scores["Words"].f1 * 100, - "sents": ud_scores["Sentences"].f1 * 100, - "tags": ud_scores["XPOS"].f1 * 100, - "uas": ud_scores["UAS"].f1 * 100, - "las": ud_scores["LAS"].f1 * 100, - } - header = ["Epoch", "Loss", "LAS", "UAS", "TAG", "SENT", "WORD"] - if itn == 0: - print("\t".join(header)) - tpl = "\t".join( - ( - "{:d}", - "{dep_loss:.1f}", - "{las:.1f}", - "{uas:.1f}", - "{tags:.1f}", - "{sents:.1f}", - "{words:.1f}", - ) - ) - print(tpl.format(itn, **fields)) - - -# def get_sent_conllu(sent, sent_id): -# lines = ["# sent_id = {sent_id}".format(sent_id=sent_id)] - - -def get_token_conllu(token, i): - if token._.begins_fused: - n = 1 - while token.nbor(n)._.inside_fused: - n += 1 - id_ = "%d-%d" % (i, i + n) - lines = [id_, token.text, "_", "_", "_", "_", "_", "_", "_", "_"] - else: - lines = [] - if token.head.i == token.i: - head = 0 - else: - head = i + (token.head.i - token.i) + 1 - fields = [ - str(i + 1), - token.text, - token.lemma_, - token.pos_, - token.tag_, - "_", - str(head), - token.dep_.lower(), - "_", - "_", - ] - lines.append("\t".join(fields)) - return "\n".join(lines) - - -################## -# Initialization # -################## - - -def load_nlp(corpus, config): - lang = corpus.split("_")[0] - nlp = spacy.blank(lang) - if config.vectors: - nlp.vocab.from_disk(config.vectors / "vocab") - return nlp - - -def initialize_pipeline(nlp, docs, golds, config): - nlp.add_pipe(nlp.create_pipe("parser")) - if config.multitask_tag: - nlp.parser.add_multitask_objective("tag") - if config.multitask_sent: - nlp.parser.add_multitask_objective("sent_start") - nlp.parser.moves.add_action(2, "subtok") - nlp.add_pipe(nlp.create_pipe("tagger")) - for gold in golds: - for tag in gold.tags: - if tag is not None: - nlp.tagger.add_label(tag) - # Replace labels that didn't make the frequency cutoff - actions = set(nlp.parser.labels) - label_set = set([act.split("-")[1] for act in actions if "-" in act]) - for gold in golds: - for i, label in enumerate(gold.labels): - if label is not None and label not in label_set: - gold.labels[i] = label.split("||")[0] - return nlp.begin_training(lambda: golds_to_gold_tuples(docs, golds)) - - -######################## -# Command line helpers # -######################## - - -@attr.s -class Config(object): - vectors = attr.ib(default=None) - max_doc_length = attr.ib(default=10) - multitask_tag = attr.ib(default=True) - multitask_sent = attr.ib(default=True) - nr_epoch = attr.ib(default=30) - batch_size = attr.ib(default=1000) - dropout = attr.ib(default=0.2) - - @classmethod - def load(cls, loc): - with Path(loc).open("r", encoding="utf8") as file_: - cfg = json.load(file_) - return cls(**cfg) - - -class Dataset(object): - def __init__(self, path, section): - self.path = path - self.section = section - self.conllu = None - self.text = None - for file_path in self.path.iterdir(): - name = file_path.parts[-1] - if section in name and name.endswith("conllu"): - self.conllu = file_path - elif section in name and name.endswith("txt"): - self.text = file_path - if self.conllu is None: - msg = "Could not find .txt file in {path} for {section}" - raise IOError(msg.format(section=section, path=path)) - if self.text is None: - msg = "Could not find .txt file in {path} for {section}" - self.lang = self.conllu.parts[-1].split("-")[0].split("_")[0] - - -class TreebankPaths(object): - def __init__(self, ud_path, treebank, **cfg): - self.train = Dataset(ud_path / treebank, "train") - self.dev = Dataset(ud_path / treebank, "dev") - self.lang = self.train.lang - - -@plac.annotations( - ud_dir=("Path to Universal Dependencies corpus", "positional", None, Path), - parses_dir=("Directory to write the development parses", "positional", None, Path), - config=("Path to json formatted config file", "positional", None, Config.load), - corpus=( - "UD corpus to train and evaluate on, e.g. UD_Spanish-AnCora", - "positional", - None, - str, - ), - limit=("Size limit", "option", "n", int), -) -def main(ud_dir, parses_dir, config, corpus, limit=0): - Token.set_extension("get_conllu_lines", method=get_token_conllu) - Token.set_extension("begins_fused", default=False) - Token.set_extension("inside_fused", default=False) - - paths = TreebankPaths(ud_dir, corpus) - if not (parses_dir / corpus).exists(): - (parses_dir / corpus).mkdir() - print("Train and evaluate", corpus, "using lang", paths.lang) - nlp = load_nlp(paths.lang, config) - - docs, golds = read_data( - nlp, - paths.train.conllu.open(encoding="utf8"), - paths.train.text.open(encoding="utf8"), - max_doc_length=config.max_doc_length, - limit=limit, - ) - - optimizer = initialize_pipeline(nlp, docs, golds, config) - - for i in range(config.nr_epoch): - docs = [nlp.make_doc(doc.text) for doc in docs] - batches = minibatch_by_words(list(zip(docs, golds)), size=config.batch_size) - losses = {} - n_train_words = sum(len(doc) for doc in docs) - with tqdm.tqdm(total=n_train_words, leave=False) as pbar: - for batch in batches: - batch_docs, batch_gold = zip(*batch) - pbar.update(sum(len(doc) for doc in batch_docs)) - nlp.update( - batch_docs, - batch_gold, - sgd=optimizer, - drop=config.dropout, - losses=losses, - ) - - out_path = parses_dir / corpus / "epoch-{i}.conllu".format(i=i) - with nlp.use_params(optimizer.averages): - scores = evaluate(nlp, paths.dev.text, paths.dev.conllu, out_path) - print_progress(i, losses, scores) - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/training/create_kb.py b/examples/training/create_kb.py deleted file mode 100644 index cbdb5c05b..000000000 --- a/examples/training/create_kb.py +++ /dev/null @@ -1,114 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 - -"""Example of defining a knowledge base in spaCy, -which is needed to implement entity linking functionality. - -For more details, see the documentation: -* Knowledge base: https://spacy.io/api/kb -* Entity Linking: https://spacy.io/usage/linguistic-features#entity-linking - -Compatible with: spaCy v2.2.4 -Last tested with: v2.2.4 -""" -from __future__ import unicode_literals, print_function - -import plac -from pathlib import Path - -from spacy.vocab import Vocab -import spacy -from spacy.kb import KnowledgeBase - - -# Q2146908 (Russ Cochran): American golfer -# Q7381115 (Russ Cochran): publisher -ENTITIES = {"Q2146908": ("American golfer", 342), "Q7381115": ("publisher", 17)} - - -@plac.annotations( - model=("Model name, should have pretrained word embeddings", "positional", None, str), - output_dir=("Optional output directory", "option", "o", Path), -) -def main(model=None, output_dir=None): - """Load the model and create the KB with pre-defined entity encodings. - If an output_dir is provided, the KB will be stored there in a file 'kb'. - The updated vocab will also be written to a directory in the output_dir.""" - - nlp = spacy.load(model) # load existing spaCy model - print("Loaded model '%s'" % model) - - # check the length of the nlp vectors - if "vectors" not in nlp.meta or not nlp.vocab.vectors.size: - raise ValueError( - "The `nlp` object should have access to pretrained word vectors, " - " cf. https://spacy.io/usage/models#languages." - ) - - # You can change the dimension of vectors in your KB by using an encoder that changes the dimensionality. - # For simplicity, we'll just use the original vector dimension here instead. - vectors_dim = nlp.vocab.vectors.shape[1] - kb = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=vectors_dim) - - # set up the data - entity_ids = [] - descr_embeddings = [] - freqs = [] - for key, value in ENTITIES.items(): - desc, freq = value - entity_ids.append(key) - descr_embeddings.append(nlp(desc).vector) - freqs.append(freq) - - # set the entities, can also be done by calling `kb.add_entity` for each entity - kb.set_entities(entity_list=entity_ids, freq_list=freqs, vector_list=descr_embeddings) - - # adding aliases, the entities need to be defined in the KB beforehand - kb.add_alias( - alias="Russ Cochran", - entities=["Q2146908", "Q7381115"], - probabilities=[0.24, 0.7], # the sum of these probabilities should not exceed 1 - ) - - # test the trained model - print() - _print_kb(kb) - - # save model to output directory - if output_dir is not None: - output_dir = Path(output_dir) - if not output_dir.exists(): - output_dir.mkdir() - kb_path = str(output_dir / "kb") - kb.dump(kb_path) - print() - print("Saved KB to", kb_path) - - vocab_path = output_dir / "vocab" - kb.vocab.to_disk(vocab_path) - print("Saved vocab to", vocab_path) - - print() - - # test the saved model - # always reload a knowledge base with the same vocab instance! - print("Loading vocab from", vocab_path) - print("Loading KB from", kb_path) - vocab2 = Vocab().from_disk(vocab_path) - kb2 = KnowledgeBase(vocab=vocab2) - kb2.load_bulk(kb_path) - print() - _print_kb(kb2) - - -def _print_kb(kb): - print(kb.get_size_entities(), "kb entities:", kb.get_entity_strings()) - print(kb.get_size_aliases(), "kb aliases:", kb.get_alias_strings()) - - -if __name__ == "__main__": - plac.call(main) - - # Expected output: - # 2 kb entities: ['Q2146908', 'Q7381115'] - # 1 kb aliases: ['Russ Cochran'] diff --git a/examples/training/ner_multitask_objective.py b/examples/training/ner_multitask_objective.py deleted file mode 100644 index 4bf7a008f..000000000 --- a/examples/training/ner_multitask_objective.py +++ /dev/null @@ -1,89 +0,0 @@ -"""This example shows how to add a multi-task objective that is trained -alongside the entity recognizer. This is an alternative to adding features -to the model. - -The multi-task idea is to train an auxiliary model to predict some attribute, -with weights shared between the auxiliary model and the main model. In this -example, we're predicting the position of the word in the document. - -The model that predicts the position of the word encourages the convolutional -layers to include the position information in their representation. The -information is then available to the main model, as a feature. - -The overall idea is that we might know something about what sort of features -we'd like the CNN to extract. The multi-task objectives can encourage the -extraction of this type of feature. The multi-task objective is only used -during training. We discard the auxiliary model before run-time. - -The specific example here is not necessarily a good idea --- but it shows -how an arbitrary objective function for some word can be used. - -Developed and tested for spaCy 2.0.6. Updated for v2.2.2 -""" -import random -import plac -import spacy -import os.path -from spacy.tokens import Doc -from spacy.gold import read_json_file, GoldParse - -random.seed(0) - -PWD = os.path.dirname(__file__) - -TRAIN_DATA = list(read_json_file( - os.path.join(PWD, "ner_example_data", "ner-sent-per-line.json"))) - - -def get_position_label(i, words, tags, heads, labels, ents): - """Return labels indicating the position of the word in the document. - """ - if len(words) < 20: - return "short-doc" - elif i == 0: - return "first-word" - elif i < 10: - return "early-word" - elif i < 20: - return "mid-word" - elif i == len(words) - 1: - return "last-word" - else: - return "late-word" - - -def main(n_iter=10): - nlp = spacy.blank("en") - ner = nlp.create_pipe("ner") - ner.add_multitask_objective(get_position_label) - nlp.add_pipe(ner) - print(nlp.pipeline) - - print("Create data", len(TRAIN_DATA)) - optimizer = nlp.begin_training(get_gold_tuples=lambda: TRAIN_DATA) - for itn in range(n_iter): - random.shuffle(TRAIN_DATA) - losses = {} - for text, annot_brackets in TRAIN_DATA: - for annotations, _ in annot_brackets: - doc = Doc(nlp.vocab, words=annotations[1]) - gold = GoldParse.from_annot_tuples(doc, annotations) - nlp.update( - [doc], # batch of texts - [gold], # batch of annotations - drop=0.2, # dropout - make it harder to memorise data - sgd=optimizer, # callable to update weights - losses=losses, - ) - print(losses.get("nn_labeller", 0.0), losses["ner"]) - - # test the trained model - for text, _ in TRAIN_DATA: - if text is not None: - doc = nlp(text) - print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) - print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/training/pretrain_textcat.py b/examples/training/pretrain_textcat.py deleted file mode 100644 index d29e20ad1..000000000 --- a/examples/training/pretrain_textcat.py +++ /dev/null @@ -1,217 +0,0 @@ -"""This script is experimental. - -Try pre-training the CNN component of the text categorizer using a cheap -language modelling-like objective. Specifically, we load pretrained vectors -(from something like word2vec, GloVe, FastText etc), and use the CNN to -predict the tokens' pretrained vectors. This isn't as easy as it sounds: -we're not merely doing compression here, because heavy dropout is applied, -including over the input words. This means the model must often (50% of the time) -use the context in order to predict the word. - -To evaluate the technique, we're pre-training with the 50k texts from the IMDB -corpus, and then training with only 100 labels. Note that it's a bit dirty to -pre-train with the development data, but also not *so* terrible: we're not using -the development labels, after all --- only the unlabelled text. -""" -import plac -import tqdm -import random -import spacy -import thinc.extra.datasets -from spacy.util import minibatch, use_gpu, compounding -from spacy._ml import Tok2Vec -from spacy.pipeline import TextCategorizer -import numpy - - -def load_texts(limit=0): - train, dev = thinc.extra.datasets.imdb() - train_texts, train_labels = zip(*train) - dev_texts, dev_labels = zip(*train) - train_texts = list(train_texts) - dev_texts = list(dev_texts) - random.shuffle(train_texts) - random.shuffle(dev_texts) - if limit >= 1: - return train_texts[:limit] - else: - return list(train_texts) + list(dev_texts) - - -def load_textcat_data(limit=0): - """Load data from the IMDB dataset.""" - # Partition off part of the train data for evaluation - train_data, eval_data = thinc.extra.datasets.imdb() - random.shuffle(train_data) - train_data = train_data[-limit:] - texts, labels = zip(*train_data) - eval_texts, eval_labels = zip(*eval_data) - cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels] - eval_cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in eval_labels] - return (texts, cats), (eval_texts, eval_cats) - - -def prefer_gpu(): - used = spacy.util.use_gpu(0) - if used is None: - return False - else: - import cupy.random - - cupy.random.seed(0) - return True - - -def build_textcat_model(tok2vec, nr_class, width): - from thinc.v2v import Model, Softmax, Maxout - from thinc.api import flatten_add_lengths, chain - from thinc.t2v import Pooling, sum_pool, mean_pool, max_pool - from thinc.misc import Residual, LayerNorm - from spacy._ml import logistic, zero_init - - with Model.define_operators({">>": chain}): - model = ( - tok2vec - >> flatten_add_lengths - >> Pooling(mean_pool) - >> Softmax(nr_class, width) - ) - model.tok2vec = tok2vec - return model - - -def block_gradients(model): - from thinc.api import wrap - - def forward(X, drop=0.0): - Y, _ = model.begin_update(X, drop=drop) - return Y, None - - return wrap(forward, model) - - -def create_pipeline(width, embed_size, vectors_model): - print("Load vectors") - nlp = spacy.load(vectors_model) - print("Start training") - textcat = TextCategorizer( - nlp.vocab, - labels=["POSITIVE", "NEGATIVE"], - model=build_textcat_model( - Tok2Vec(width=width, embed_size=embed_size), 2, width - ), - ) - - nlp.add_pipe(textcat) - return nlp - - -def train_tensorizer(nlp, texts, dropout, n_iter): - tensorizer = nlp.create_pipe("tensorizer") - nlp.add_pipe(tensorizer) - optimizer = nlp.begin_training() - for i in range(n_iter): - losses = {} - for i, batch in enumerate(minibatch(tqdm.tqdm(texts))): - docs = [nlp.make_doc(text) for text in batch] - tensorizer.update(docs, None, losses=losses, sgd=optimizer, drop=dropout) - print(losses) - return optimizer - - -def train_textcat(nlp, n_texts, n_iter=10): - textcat = nlp.get_pipe("textcat") - tok2vec_weights = textcat.model.tok2vec.to_bytes() - (train_texts, train_cats), (dev_texts, dev_cats) = load_textcat_data(limit=n_texts) - print( - "Using {} examples ({} training, {} evaluation)".format( - n_texts, len(train_texts), len(dev_texts) - ) - ) - train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats])) - - # get names of other pipes to disable them during training - pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - with nlp.disable_pipes(*other_pipes): # only train textcat - optimizer = nlp.begin_training() - textcat.model.tok2vec.from_bytes(tok2vec_weights) - print("Training the model...") - print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F")) - for i in range(n_iter): - losses = {"textcat": 0.0} - # batch up the examples using spaCy's minibatch - batches = minibatch(tqdm.tqdm(train_data), size=2) - for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses) - with textcat.model.use_params(optimizer.averages): - # evaluate on the dev data split off in load_data() - scores = evaluate_textcat(nlp.tokenizer, textcat, dev_texts, dev_cats) - print( - "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format( # print a simple table - losses["textcat"], - scores["textcat_p"], - scores["textcat_r"], - scores["textcat_f"], - ) - ) - - -def evaluate_textcat(tokenizer, textcat, texts, cats): - docs = (tokenizer(text) for text in texts) - tp = 1e-8 - fp = 1e-8 - tn = 1e-8 - fn = 1e-8 - for i, doc in enumerate(textcat.pipe(docs)): - gold = cats[i] - for label, score in doc.cats.items(): - if label not in gold: - continue - if score >= 0.5 and gold[label] >= 0.5: - tp += 1.0 - elif score >= 0.5 and gold[label] < 0.5: - fp += 1.0 - elif score < 0.5 and gold[label] < 0.5: - tn += 1 - elif score < 0.5 and gold[label] >= 0.5: - fn += 1 - precision = tp / (tp + fp) - recall = tp / (tp + fn) - f_score = 2 * (precision * recall) / (precision + recall) - return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score} - - -@plac.annotations( - width=("Width of CNN layers", "positional", None, int), - embed_size=("Embedding rows", "positional", None, int), - pretrain_iters=("Number of iterations to pretrain", "option", "pn", int), - train_iters=("Number of iterations to train", "option", "tn", int), - train_examples=("Number of labelled examples", "option", "eg", int), - vectors_model=("Name or path to vectors model to learn from"), -) -def main( - width, - embed_size, - vectors_model, - pretrain_iters=30, - train_iters=30, - train_examples=1000, -): - random.seed(0) - numpy.random.seed(0) - use_gpu = prefer_gpu() - print("Using GPU?", use_gpu) - - nlp = create_pipeline(width, embed_size, vectors_model) - print("Load data") - texts = load_texts(limit=0) - print("Train tensorizer") - optimizer = train_tensorizer(nlp, texts, dropout=0.2, n_iter=pretrain_iters) - print("Train textcat") - train_textcat(nlp, train_examples, n_iter=train_iters) - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/training/rehearsal.py b/examples/training/rehearsal.py deleted file mode 100644 index 1cdac02aa..000000000 --- a/examples/training/rehearsal.py +++ /dev/null @@ -1,97 +0,0 @@ -"""Prevent catastrophic forgetting with rehearsal updates.""" -import plac -import random -import warnings -import srsly -import spacy -from spacy.gold import GoldParse -from spacy.util import minibatch, compounding - - -LABEL = "ANIMAL" -TRAIN_DATA = [ - ( - "Horses are too tall and they pretend to care about your feelings", - {"entities": [(0, 6, "ANIMAL")]}, - ), - ("Do they bite?", {"entities": []}), - ( - "horses are too tall and they pretend to care about your feelings", - {"entities": [(0, 6, "ANIMAL")]}, - ), - ("horses pretend to care about your feelings", {"entities": [(0, 6, "ANIMAL")]}), - ( - "they pretend to care about your feelings, those horses", - {"entities": [(48, 54, "ANIMAL")]}, - ), - ("horses?", {"entities": [(0, 6, "ANIMAL")]}), -] - - -def read_raw_data(nlp, jsonl_loc): - for json_obj in srsly.read_jsonl(jsonl_loc): - if json_obj["text"].strip(): - doc = nlp.make_doc(json_obj["text"]) - yield doc - - -def read_gold_data(nlp, gold_loc): - docs = [] - golds = [] - for json_obj in srsly.read_jsonl(gold_loc): - doc = nlp.make_doc(json_obj["text"]) - ents = [(ent["start"], ent["end"], ent["label"]) for ent in json_obj["spans"]] - gold = GoldParse(doc, entities=ents) - docs.append(doc) - golds.append(gold) - return list(zip(docs, golds)) - - -def main(model_name, unlabelled_loc): - n_iter = 10 - dropout = 0.2 - batch_size = 4 - nlp = spacy.load(model_name) - nlp.get_pipe("ner").add_label(LABEL) - raw_docs = list(read_raw_data(nlp, unlabelled_loc)) - optimizer = nlp.resume_training() - # Avoid use of Adam when resuming training. I don't understand this well - # yet, but I'm getting weird results from Adam. Try commenting out the - # nlp.update(), and using Adam -- you'll find the models drift apart. - # I guess Adam is losing precision, introducing gradient noise? - optimizer.alpha = 0.1 - optimizer.b1 = 0.0 - optimizer.b2 = 0.0 - - # get names of other pipes to disable them during training - pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - sizes = compounding(1.0, 4.0, 1.001) - with nlp.disable_pipes(*other_pipes), warnings.catch_warnings(): - # show warnings for misaligned entity spans once - warnings.filterwarnings("once", category=UserWarning, module='spacy') - - for itn in range(n_iter): - random.shuffle(TRAIN_DATA) - random.shuffle(raw_docs) - losses = {} - r_losses = {} - # batch up the examples using spaCy's minibatch - raw_batches = minibatch(raw_docs, size=4) - for batch in minibatch(TRAIN_DATA, size=sizes): - docs, golds = zip(*batch) - nlp.update(docs, golds, sgd=optimizer, drop=dropout, losses=losses) - raw_batch = list(next(raw_batches)) - nlp.rehearse(raw_batch, sgd=optimizer, losses=r_losses) - print("Losses", losses) - print("R. Losses", r_losses) - print(nlp.get_pipe("ner").model.unseen_classes) - test_text = "Do you like horses?" - doc = nlp(test_text) - print("Entities in '%s'" % test_text) - for ent in doc.ents: - print(ent.label_, ent.text) - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/training/train_entity_linker.py b/examples/training/train_entity_linker.py deleted file mode 100644 index a68007504..000000000 --- a/examples/training/train_entity_linker.py +++ /dev/null @@ -1,177 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 - -"""Example of training spaCy's entity linker, starting off with a predefined -knowledge base and corresponding vocab, and a blank English model. - -For more details, see the documentation: -* Training: https://spacy.io/usage/training -* Entity Linking: https://spacy.io/usage/linguistic-features#entity-linking - -Compatible with: spaCy v2.2.4 -Last tested with: v2.2.4 -""" -from __future__ import unicode_literals, print_function - -import plac -import random -from pathlib import Path - -from spacy.vocab import Vocab - -import spacy -from spacy.kb import KnowledgeBase -from spacy.pipeline import EntityRuler -from spacy.util import minibatch, compounding - - -def sample_train_data(): - train_data = [] - - # Q2146908 (Russ Cochran): American golfer - # Q7381115 (Russ Cochran): publisher - - text_1 = "Russ Cochran his reprints include EC Comics." - dict_1 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}} - train_data.append((text_1, {"links": dict_1})) - - text_2 = "Russ Cochran has been publishing comic art." - dict_2 = {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}} - train_data.append((text_2, {"links": dict_2})) - - text_3 = "Russ Cochran captured his first major title with his son as caddie." - dict_3 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}} - train_data.append((text_3, {"links": dict_3})) - - text_4 = "Russ Cochran was a member of University of Kentucky's golf team." - dict_4 = {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}} - train_data.append((text_4, {"links": dict_4})) - - return train_data - - -# training data -TRAIN_DATA = sample_train_data() - - -@plac.annotations( - kb_path=("Path to the knowledge base", "positional", None, Path), - vocab_path=("Path to the vocab for the kb", "positional", None, Path), - output_dir=("Optional output directory", "option", "o", Path), - n_iter=("Number of training iterations", "option", "n", int), -) -def main(kb_path, vocab_path, output_dir=None, n_iter=50): - """Create a blank model with the specified vocab, set up the pipeline and train the entity linker. - The `vocab` should be the one used during creation of the KB.""" - # create blank English model with correct vocab - nlp = spacy.blank("en") - nlp.vocab.from_disk(vocab_path) - nlp.vocab.vectors.name = "spacy_pretrained_vectors" - print("Created blank 'en' model with vocab from '%s'" % vocab_path) - - # Add a sentencizer component. Alternatively, add a dependency parser for higher accuracy. - nlp.add_pipe(nlp.create_pipe('sentencizer')) - - # Add a custom component to recognize "Russ Cochran" as an entity for the example training data. - # Note that in a realistic application, an actual NER algorithm should be used instead. - ruler = EntityRuler(nlp) - patterns = [{"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]}] - ruler.add_patterns(patterns) - nlp.add_pipe(ruler) - - # Create the Entity Linker component and add it to the pipeline. - if "entity_linker" not in nlp.pipe_names: - # use only the predicted EL score and not the prior probability (for demo purposes) - cfg = {"incl_prior": False} - entity_linker = nlp.create_pipe("entity_linker", cfg) - kb = KnowledgeBase(vocab=nlp.vocab) - kb.load_bulk(kb_path) - print("Loaded Knowledge Base from '%s'" % kb_path) - entity_linker.set_kb(kb) - nlp.add_pipe(entity_linker, last=True) - - # Convert the texts to docs to make sure we have doc.ents set for the training examples. - # Also ensure that the annotated examples correspond to known identifiers in the knowlege base. - kb_ids = nlp.get_pipe("entity_linker").kb.get_entity_strings() - TRAIN_DOCS = [] - for text, annotation in TRAIN_DATA: - with nlp.disable_pipes("entity_linker"): - doc = nlp(text) - annotation_clean = annotation - for offset, kb_id_dict in annotation["links"].items(): - new_dict = {} - for kb_id, value in kb_id_dict.items(): - if kb_id in kb_ids: - new_dict[kb_id] = value - else: - print( - "Removed", kb_id, "from training because it is not in the KB." - ) - annotation_clean["links"][offset] = new_dict - TRAIN_DOCS.append((doc, annotation_clean)) - - # get names of other pipes to disable them during training - pipe_exceptions = ["entity_linker", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - with nlp.disable_pipes(*other_pipes): # only train entity linker - # reset and initialize the weights randomly - optimizer = nlp.begin_training() - for itn in range(n_iter): - random.shuffle(TRAIN_DOCS) - losses = {} - # batch up the examples using spaCy's minibatch - batches = minibatch(TRAIN_DOCS, size=compounding(4.0, 32.0, 1.001)) - for batch in batches: - texts, annotations = zip(*batch) - nlp.update( - texts, # batch of texts - annotations, # batch of annotations - drop=0.2, # dropout - make it harder to memorise data - losses=losses, - sgd=optimizer, - ) - print(itn, "Losses", losses) - - # test the trained model - _apply_model(nlp) - - # save model to output directory - if output_dir is not None: - output_dir = Path(output_dir) - if not output_dir.exists(): - output_dir.mkdir() - nlp.to_disk(output_dir) - print() - print("Saved model to", output_dir) - - # test the saved model - print("Loading from", output_dir) - nlp2 = spacy.load(output_dir) - _apply_model(nlp2) - - -def _apply_model(nlp): - for text, annotation in TRAIN_DATA: - # apply the entity linker which will now make predictions for the 'Russ Cochran' entities - doc = nlp(text) - print() - print("Entities", [(ent.text, ent.label_, ent.kb_id_) for ent in doc.ents]) - print("Tokens", [(t.text, t.ent_type_, t.ent_kb_id_) for t in doc]) - - -if __name__ == "__main__": - plac.call(main) - - # Expected output (can be shuffled): - - # Entities[('Russ Cochran', 'PERSON', 'Q7381115')] - # Tokens[('Russ', 'PERSON', 'Q7381115'), ('Cochran', 'PERSON', 'Q7381115'), ("his", '', ''), ('reprints', '', ''), ('include', '', ''), ('The', '', ''), ('Complete', '', ''), ('EC', '', ''), ('Library', '', ''), ('.', '', '')] - - # Entities[('Russ Cochran', 'PERSON', 'Q7381115')] - # Tokens[('Russ', 'PERSON', 'Q7381115'), ('Cochran', 'PERSON', 'Q7381115'), ('has', '', ''), ('been', '', ''), ('publishing', '', ''), ('comic', '', ''), ('art', '', ''), ('.', '', '')] - - # Entities[('Russ Cochran', 'PERSON', 'Q2146908')] - # Tokens[('Russ', 'PERSON', 'Q2146908'), ('Cochran', 'PERSON', 'Q2146908'), ('captured', '', ''), ('his', '', ''), ('first', '', ''), ('major', '', ''), ('title', '', ''), ('with', '', ''), ('his', '', ''), ('son', '', ''), ('as', '', ''), ('caddie', '', ''), ('.', '', '')] - - # Entities[('Russ Cochran', 'PERSON', 'Q2146908')] - # Tokens[('Russ', 'PERSON', 'Q2146908'), ('Cochran', 'PERSON', 'Q2146908'), ('was', '', ''), ('a', '', ''), ('member', '', ''), ('of', '', ''), ('University', '', ''), ('of', '', ''), ('Kentucky', '', ''), ("'s", '', ''), ('golf', '', ''), ('team', '', ''), ('.', '', '')] diff --git a/examples/training/train_intent_parser.py b/examples/training/train_intent_parser.py deleted file mode 100644 index a91102093..000000000 --- a/examples/training/train_intent_parser.py +++ /dev/null @@ -1,195 +0,0 @@ -#!/usr/bin/env python -# coding: utf-8 -"""Using the parser to recognise your own semantics - -spaCy's parser component can be trained to predict any type of tree -structure over your input text. You can also predict trees over whole documents -or chat logs, with connections between the sentence-roots used to annotate -discourse structure. In this example, we'll build a message parser for a common -"chat intent": finding local businesses. Our message semantics will have the -following types of relations: ROOT, PLACE, QUALITY, ATTRIBUTE, TIME, LOCATION. - -"show me the best hotel in berlin" -('show', 'ROOT', 'show') -('best', 'QUALITY', 'hotel') --> hotel with QUALITY best -('hotel', 'PLACE', 'show') --> show PLACE hotel -('berlin', 'LOCATION', 'hotel') --> hotel with LOCATION berlin - -Compatible with: spaCy v2.0.0+ -""" -from __future__ import unicode_literals, print_function - -import plac -import random -from pathlib import Path -import spacy -from spacy.util import minibatch, compounding - - -# training data: texts, heads and dependency labels -# for no relation, we simply chose an arbitrary dependency label, e.g. '-' -TRAIN_DATA = [ - ( - "find a cafe with great wifi", - { - "heads": [0, 2, 0, 5, 5, 2], # index of token head - "deps": ["ROOT", "-", "PLACE", "-", "QUALITY", "ATTRIBUTE"], - }, - ), - ( - "find a hotel near the beach", - { - "heads": [0, 2, 0, 5, 5, 2], - "deps": ["ROOT", "-", "PLACE", "QUALITY", "-", "ATTRIBUTE"], - }, - ), - ( - "find me the closest gym that's open late", - { - "heads": [0, 0, 4, 4, 0, 6, 4, 6, 6], - "deps": [ - "ROOT", - "-", - "-", - "QUALITY", - "PLACE", - "-", - "-", - "ATTRIBUTE", - "TIME", - ], - }, - ), - ( - "show me the cheapest store that sells flowers", - { - "heads": [0, 0, 4, 4, 0, 4, 4, 4], # attach "flowers" to store! - "deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "-", "PRODUCT"], - }, - ), - ( - "find a nice restaurant in london", - { - "heads": [0, 3, 3, 0, 3, 3], - "deps": ["ROOT", "-", "QUALITY", "PLACE", "-", "LOCATION"], - }, - ), - ( - "show me the coolest hostel in berlin", - { - "heads": [0, 0, 4, 4, 0, 4, 4], - "deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "LOCATION"], - }, - ), - ( - "find a good italian restaurant near work", - { - "heads": [0, 4, 4, 4, 0, 4, 5], - "deps": [ - "ROOT", - "-", - "QUALITY", - "ATTRIBUTE", - "PLACE", - "ATTRIBUTE", - "LOCATION", - ], - }, - ), -] - - -@plac.annotations( - model=("Model name. Defaults to blank 'en' model.", "option", "m", str), - output_dir=("Optional output directory", "option", "o", Path), - n_iter=("Number of training iterations", "option", "n", int), -) -def main(model=None, output_dir=None, n_iter=15): - """Load the model, set up the pipeline and train the parser.""" - if model is not None: - nlp = spacy.load(model) # load existing spaCy model - print("Loaded model '%s'" % model) - else: - nlp = spacy.blank("en") # create blank Language class - print("Created blank 'en' model") - - # We'll use the built-in dependency parser class, but we want to create a - # fresh instance – just in case. - if "parser" in nlp.pipe_names: - nlp.remove_pipe("parser") - parser = nlp.create_pipe("parser") - nlp.add_pipe(parser, first=True) - - for text, annotations in TRAIN_DATA: - for dep in annotations.get("deps", []): - parser.add_label(dep) - - pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - with nlp.disable_pipes(*other_pipes): # only train parser - optimizer = nlp.begin_training() - for itn in range(n_iter): - random.shuffle(TRAIN_DATA) - losses = {} - # batch up the examples using spaCy's minibatch - batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) - for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, losses=losses) - print("Losses", losses) - - # test the trained model - test_model(nlp) - - # save model to output directory - if output_dir is not None: - output_dir = Path(output_dir) - if not output_dir.exists(): - output_dir.mkdir() - nlp.to_disk(output_dir) - print("Saved model to", output_dir) - - # test the saved model - print("Loading from", output_dir) - nlp2 = spacy.load(output_dir) - test_model(nlp2) - - -def test_model(nlp): - texts = [ - "find a hotel with good wifi", - "find me the cheapest gym near work", - "show me the best hotel in berlin", - ] - docs = nlp.pipe(texts) - for doc in docs: - print(doc.text) - print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != "-"]) - - -if __name__ == "__main__": - plac.call(main) - - # Expected output: - # find a hotel with good wifi - # [ - # ('find', 'ROOT', 'find'), - # ('hotel', 'PLACE', 'find'), - # ('good', 'QUALITY', 'wifi'), - # ('wifi', 'ATTRIBUTE', 'hotel') - # ] - # find me the cheapest gym near work - # [ - # ('find', 'ROOT', 'find'), - # ('cheapest', 'QUALITY', 'gym'), - # ('gym', 'PLACE', 'find'), - # ('near', 'ATTRIBUTE', 'gym'), - # ('work', 'LOCATION', 'near') - # ] - # show me the best hotel in berlin - # [ - # ('show', 'ROOT', 'show'), - # ('best', 'QUALITY', 'hotel'), - # ('hotel', 'PLACE', 'show'), - # ('berlin', 'LOCATION', 'hotel') - # ] diff --git a/examples/training/train_ner.py b/examples/training/train_ner.py deleted file mode 100644 index f64ba801a..000000000 --- a/examples/training/train_ner.py +++ /dev/null @@ -1,117 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""Example of training spaCy's named entity recognizer, starting off with an -existing model or a blank model. - -For more details, see the documentation: -* Training: https://spacy.io/usage/training -* NER: https://spacy.io/usage/linguistic-features#named-entities - -Compatible with: spaCy v2.0.0+ -Last tested with: v2.2.4 -""" -from __future__ import unicode_literals, print_function - -import plac -import random -import warnings -from pathlib import Path -import spacy -from spacy.util import minibatch, compounding - - -# training data -TRAIN_DATA = [ - ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}), - ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}), -] - - -@plac.annotations( - model=("Model name. Defaults to blank 'en' model.", "option", "m", str), - output_dir=("Optional output directory", "option", "o", Path), - n_iter=("Number of training iterations", "option", "n", int), -) -def main(model=None, output_dir=None, n_iter=100): - """Load the model, set up the pipeline and train the entity recognizer.""" - if model is not None: - nlp = spacy.load(model) # load existing spaCy model - print("Loaded model '%s'" % model) - else: - nlp = spacy.blank("en") # create blank Language class - print("Created blank 'en' model") - - # create the built-in pipeline components and add them to the pipeline - # nlp.create_pipe works for built-ins that are registered with spaCy - if "ner" not in nlp.pipe_names: - ner = nlp.create_pipe("ner") - nlp.add_pipe(ner, last=True) - # otherwise, get it so we can add labels - else: - ner = nlp.get_pipe("ner") - - # add labels - for _, annotations in TRAIN_DATA: - for ent in annotations.get("entities"): - ner.add_label(ent[2]) - - # get names of other pipes to disable them during training - pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - # only train NER - with nlp.disable_pipes(*other_pipes), warnings.catch_warnings(): - # show warnings for misaligned entity spans once - warnings.filterwarnings("once", category=UserWarning, module='spacy') - - # reset and initialize the weights randomly – but only if we're - # training a new model - if model is None: - nlp.begin_training() - for itn in range(n_iter): - random.shuffle(TRAIN_DATA) - losses = {} - # batch up the examples using spaCy's minibatch - batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) - for batch in batches: - texts, annotations = zip(*batch) - nlp.update( - texts, # batch of texts - annotations, # batch of annotations - drop=0.5, # dropout - make it harder to memorise data - losses=losses, - ) - print("Losses", losses) - - # test the trained model - for text, _ in TRAIN_DATA: - doc = nlp(text) - print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) - print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) - - # save model to output directory - if output_dir is not None: - output_dir = Path(output_dir) - if not output_dir.exists(): - output_dir.mkdir() - nlp.to_disk(output_dir) - print("Saved model to", output_dir) - - # test the saved model - print("Loading from", output_dir) - nlp2 = spacy.load(output_dir) - for text, _ in TRAIN_DATA: - doc = nlp2(text) - print("Entities", [(ent.text, ent.label_) for ent in doc.ents]) - print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc]) - - -if __name__ == "__main__": - plac.call(main) - - # Expected output: - # Entities [('Shaka Khan', 'PERSON')] - # Tokens [('Who', '', 2), ('is', '', 2), ('Shaka', 'PERSON', 3), - # ('Khan', 'PERSON', 1), ('?', '', 2)] - # Entities [('London', 'LOC'), ('Berlin', 'LOC')] - # Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3), - # ('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)] diff --git a/examples/training/train_new_entity_type.py b/examples/training/train_new_entity_type.py deleted file mode 100644 index a14688012..000000000 --- a/examples/training/train_new_entity_type.py +++ /dev/null @@ -1,144 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""Example of training an additional entity type - -This script shows how to add a new entity type to an existing pretrained NER -model. To keep the example short and simple, only four sentences are provided -as examples. In practice, you'll need many more — a few hundred would be a -good start. You will also likely need to mix in examples of other entity -types, which might be obtained by running the entity recognizer over unlabelled -sentences, and adding their annotations to the training set. - -The actual training is performed by looping over the examples, and calling -`nlp.entity.update()`. The `update()` method steps through the words of the -input. At each word, it makes a prediction. It then consults the annotations -provided on the GoldParse instance, to see whether it was right. If it was -wrong, it adjusts its weights so that the correct action will score higher -next time. - -After training your model, you can save it to a directory. We recommend -wrapping models as Python packages, for ease of deployment. - -For more details, see the documentation: -* Training: https://spacy.io/usage/training -* NER: https://spacy.io/usage/linguistic-features#named-entities - -Compatible with: spaCy v2.1.0+ -Last tested with: v2.2.4 -""" -from __future__ import unicode_literals, print_function - -import plac -import random -import warnings -from pathlib import Path -import spacy -from spacy.util import minibatch, compounding - - -# new entity label -LABEL = "ANIMAL" - -# training data -# Note: If you're using an existing model, make sure to mix in examples of -# other entity types that spaCy correctly recognized before. Otherwise, your -# model might learn the new type, but "forget" what it previously knew. -# https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting -TRAIN_DATA = [ - ( - "Horses are too tall and they pretend to care about your feelings", - {"entities": [(0, 6, LABEL)]}, - ), - ("Do they bite?", {"entities": []}), - ( - "horses are too tall and they pretend to care about your feelings", - {"entities": [(0, 6, LABEL)]}, - ), - ("horses pretend to care about your feelings", {"entities": [(0, 6, LABEL)]}), - ( - "they pretend to care about your feelings, those horses", - {"entities": [(48, 54, LABEL)]}, - ), - ("horses?", {"entities": [(0, 6, LABEL)]}), -] - - -@plac.annotations( - model=("Model name. Defaults to blank 'en' model.", "option", "m", str), - new_model_name=("New model name for model meta.", "option", "nm", str), - output_dir=("Optional output directory", "option", "o", Path), - n_iter=("Number of training iterations", "option", "n", int), -) -def main(model=None, new_model_name="animal", output_dir=None, n_iter=30): - """Set up the pipeline and entity recognizer, and train the new entity.""" - random.seed(0) - if model is not None: - nlp = spacy.load(model) # load existing spaCy model - print("Loaded model '%s'" % model) - else: - nlp = spacy.blank("en") # create blank Language class - print("Created blank 'en' model") - # Add entity recognizer to model if it's not in the pipeline - # nlp.create_pipe works for built-ins that are registered with spaCy - if "ner" not in nlp.pipe_names: - ner = nlp.create_pipe("ner") - nlp.add_pipe(ner) - # otherwise, get it, so we can add labels to it - else: - ner = nlp.get_pipe("ner") - - ner.add_label(LABEL) # add new entity label to entity recognizer - # Adding extraneous labels shouldn't mess anything up - ner.add_label("VEGETABLE") - if model is None: - optimizer = nlp.begin_training() - else: - optimizer = nlp.resume_training() - move_names = list(ner.move_names) - # get names of other pipes to disable them during training - pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - # only train NER - with nlp.disable_pipes(*other_pipes), warnings.catch_warnings(): - # show warnings for misaligned entity spans once - warnings.filterwarnings("once", category=UserWarning, module='spacy') - - sizes = compounding(1.0, 4.0, 1.001) - # batch up the examples using spaCy's minibatch - for itn in range(n_iter): - random.shuffle(TRAIN_DATA) - batches = minibatch(TRAIN_DATA, size=sizes) - losses = {} - for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses) - print("Losses", losses) - - # test the trained model - test_text = "Do you like horses?" - doc = nlp(test_text) - print("Entities in '%s'" % test_text) - for ent in doc.ents: - print(ent.label_, ent.text) - - # save model to output directory - if output_dir is not None: - output_dir = Path(output_dir) - if not output_dir.exists(): - output_dir.mkdir() - nlp.meta["name"] = new_model_name # rename model - nlp.to_disk(output_dir) - print("Saved model to", output_dir) - - # test the saved model - print("Loading from", output_dir) - nlp2 = spacy.load(output_dir) - # Check the classes have loaded back consistently - assert nlp2.get_pipe("ner").move_names == move_names - doc2 = nlp2(test_text) - for ent in doc2.ents: - print(ent.label_, ent.text) - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/training/train_parser.py b/examples/training/train_parser.py deleted file mode 100644 index c5adb0dec..000000000 --- a/examples/training/train_parser.py +++ /dev/null @@ -1,111 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""Example of training spaCy dependency parser, starting off with an existing -model or a blank model. For more details, see the documentation: -* Training: https://spacy.io/usage/training -* Dependency Parse: https://spacy.io/usage/linguistic-features#dependency-parse - -Compatible with: spaCy v2.0.0+ -Last tested with: v2.1.0 -""" -from __future__ import unicode_literals, print_function - -import plac -import random -from pathlib import Path -import spacy -from spacy.util import minibatch, compounding - - -# training data -TRAIN_DATA = [ - ( - "They trade mortgage-backed securities.", - { - "heads": [1, 1, 4, 4, 5, 1, 1], - "deps": ["nsubj", "ROOT", "compound", "punct", "nmod", "dobj", "punct"], - }, - ), - ( - "I like London and Berlin.", - { - "heads": [1, 1, 1, 2, 2, 1], - "deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"], - }, - ), -] - - -@plac.annotations( - model=("Model name. Defaults to blank 'en' model.", "option", "m", str), - output_dir=("Optional output directory", "option", "o", Path), - n_iter=("Number of training iterations", "option", "n", int), -) -def main(model=None, output_dir=None, n_iter=15): - """Load the model, set up the pipeline and train the parser.""" - if model is not None: - nlp = spacy.load(model) # load existing spaCy model - print("Loaded model '%s'" % model) - else: - nlp = spacy.blank("en") # create blank Language class - print("Created blank 'en' model") - - # add the parser to the pipeline if it doesn't exist - # nlp.create_pipe works for built-ins that are registered with spaCy - if "parser" not in nlp.pipe_names: - parser = nlp.create_pipe("parser") - nlp.add_pipe(parser, first=True) - # otherwise, get it, so we can add labels to it - else: - parser = nlp.get_pipe("parser") - - # add labels to the parser - for _, annotations in TRAIN_DATA: - for dep in annotations.get("deps", []): - parser.add_label(dep) - - # get names of other pipes to disable them during training - pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - with nlp.disable_pipes(*other_pipes): # only train parser - optimizer = nlp.begin_training() - for itn in range(n_iter): - random.shuffle(TRAIN_DATA) - losses = {} - # batch up the examples using spaCy's minibatch - batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) - for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, losses=losses) - print("Losses", losses) - - # test the trained model - test_text = "I like securities." - doc = nlp(test_text) - print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc]) - - # save model to output directory - if output_dir is not None: - output_dir = Path(output_dir) - if not output_dir.exists(): - output_dir.mkdir() - nlp.to_disk(output_dir) - print("Saved model to", output_dir) - - # test the saved model - print("Loading from", output_dir) - nlp2 = spacy.load(output_dir) - doc = nlp2(test_text) - print("Dependencies", [(t.text, t.dep_, t.head.text) for t in doc]) - - -if __name__ == "__main__": - plac.call(main) - - # expected result: - # [ - # ('I', 'nsubj', 'like'), - # ('like', 'ROOT', 'like'), - # ('securities', 'dobj', 'like'), - # ('.', 'punct', 'like') - # ] diff --git a/examples/training/train_tagger.py b/examples/training/train_tagger.py deleted file mode 100644 index 7136273b3..000000000 --- a/examples/training/train_tagger.py +++ /dev/null @@ -1,101 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -""" -A simple example for training a part-of-speech tagger with a custom tag map. -To allow us to update the tag map with our custom one, this example starts off -with a blank Language class and modifies its defaults. For more details, see -the documentation: -* Training: https://spacy.io/usage/training -* POS Tagging: https://spacy.io/usage/linguistic-features#pos-tagging - -Compatible with: spaCy v2.0.0+ -Last tested with: v2.1.0 -""" -from __future__ import unicode_literals, print_function - -import plac -import random -from pathlib import Path -import spacy -from spacy.util import minibatch, compounding - - -# You need to define a mapping from your data's part-of-speech tag names to the -# Universal Part-of-Speech tag set, as spaCy includes an enum of these tags. -# See here for the Universal Tag Set: -# http://universaldependencies.github.io/docs/u/pos/index.html -# You may also specify morphological features for your tags, from the universal -# scheme. -TAG_MAP = {"N": {"pos": "NOUN"}, "V": {"pos": "VERB"}, "J": {"pos": "ADJ"}} - -# Usually you'll read this in, of course. Data formats vary. Ensure your -# strings are unicode and that the number of tags assigned matches spaCy's -# tokenization. If not, you can always add a 'words' key to the annotations -# that specifies the gold-standard tokenization, e.g.: -# ("Eatblueham", {'words': ['Eat', 'blue', 'ham'], 'tags': ['V', 'J', 'N']}) -TRAIN_DATA = [ - ("I like green eggs", {"tags": ["N", "V", "J", "N"]}), - ("Eat blue ham", {"tags": ["V", "J", "N"]}), -] - - -@plac.annotations( - lang=("ISO Code of language to use", "option", "l", str), - output_dir=("Optional output directory", "option", "o", Path), - n_iter=("Number of training iterations", "option", "n", int), -) -def main(lang="en", output_dir=None, n_iter=25): - """Create a new model, set up the pipeline and train the tagger. In order to - train the tagger with a custom tag map, we're creating a new Language - instance with a custom vocab. - """ - nlp = spacy.blank(lang) - # add the tagger to the pipeline - # nlp.create_pipe works for built-ins that are registered with spaCy - tagger = nlp.create_pipe("tagger") - # Add the tags. This needs to be done before you start training. - for tag, values in TAG_MAP.items(): - tagger.add_label(tag, values) - nlp.add_pipe(tagger) - - optimizer = nlp.begin_training() - for i in range(n_iter): - random.shuffle(TRAIN_DATA) - losses = {} - # batch up the examples using spaCy's minibatch - batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) - for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, losses=losses) - print("Losses", losses) - - # test the trained model - test_text = "I like blue eggs" - doc = nlp(test_text) - print("Tags", [(t.text, t.tag_, t.pos_) for t in doc]) - - # save model to output directory - if output_dir is not None: - output_dir = Path(output_dir) - if not output_dir.exists(): - output_dir.mkdir() - nlp.to_disk(output_dir) - print("Saved model to", output_dir) - - # test the save model - print("Loading from", output_dir) - nlp2 = spacy.load(output_dir) - doc = nlp2(test_text) - print("Tags", [(t.text, t.tag_, t.pos_) for t in doc]) - - -if __name__ == "__main__": - plac.call(main) - - # Expected output: - # [ - # ('I', 'N', 'NOUN'), - # ('like', 'V', 'VERB'), - # ('blue', 'J', 'ADJ'), - # ('eggs', 'N', 'NOUN') - # ] diff --git a/examples/training/train_textcat.py b/examples/training/train_textcat.py deleted file mode 100644 index 456ef098c..000000000 --- a/examples/training/train_textcat.py +++ /dev/null @@ -1,160 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""Train a convolutional neural network text classifier on the -IMDB dataset, using the TextCategorizer component. The dataset will be loaded -automatically via Thinc's built-in dataset loader. The model is added to -spacy.pipeline, and predictions are available via `doc.cats`. For more details, -see the documentation: -* Training: https://spacy.io/usage/training - -Compatible with: spaCy v2.0.0+ -""" -from __future__ import unicode_literals, print_function -import plac -import random -from pathlib import Path -import thinc.extra.datasets - -import spacy -from spacy.util import minibatch, compounding - - -@plac.annotations( - model=("Model name. Defaults to blank 'en' model.", "option", "m", str), - output_dir=("Optional output directory", "option", "o", Path), - n_texts=("Number of texts to train from", "option", "t", int), - n_iter=("Number of training iterations", "option", "n", int), - init_tok2vec=("Pretrained tok2vec weights", "option", "t2v", Path), -) -def main(model=None, output_dir=None, n_iter=20, n_texts=2000, init_tok2vec=None): - if output_dir is not None: - output_dir = Path(output_dir) - if not output_dir.exists(): - output_dir.mkdir() - - if model is not None: - nlp = spacy.load(model) # load existing spaCy model - print("Loaded model '%s'" % model) - else: - nlp = spacy.blank("en") # create blank Language class - print("Created blank 'en' model") - - # add the text classifier to the pipeline if it doesn't exist - # nlp.create_pipe works for built-ins that are registered with spaCy - if "textcat" not in nlp.pipe_names: - textcat = nlp.create_pipe( - "textcat", config={"exclusive_classes": True, "architecture": "simple_cnn"} - ) - nlp.add_pipe(textcat, last=True) - # otherwise, get it, so we can add labels to it - else: - textcat = nlp.get_pipe("textcat") - - # add label to text classifier - textcat.add_label("POSITIVE") - textcat.add_label("NEGATIVE") - - # load the IMDB dataset - print("Loading IMDB data...") - (train_texts, train_cats), (dev_texts, dev_cats) = load_data() - train_texts = train_texts[:n_texts] - train_cats = train_cats[:n_texts] - print( - "Using {} examples ({} training, {} evaluation)".format( - n_texts, len(train_texts), len(dev_texts) - ) - ) - train_data = list(zip(train_texts, [{"cats": cats} for cats in train_cats])) - - # get names of other pipes to disable them during training - pipe_exceptions = ["textcat", "trf_wordpiecer", "trf_tok2vec"] - other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions] - with nlp.disable_pipes(*other_pipes): # only train textcat - optimizer = nlp.begin_training() - if init_tok2vec is not None: - with init_tok2vec.open("rb") as file_: - textcat.model.tok2vec.from_bytes(file_.read()) - print("Training the model...") - print("{:^5}\t{:^5}\t{:^5}\t{:^5}".format("LOSS", "P", "R", "F")) - batch_sizes = compounding(4.0, 32.0, 1.001) - for i in range(n_iter): - losses = {} - # batch up the examples using spaCy's minibatch - random.shuffle(train_data) - batches = minibatch(train_data, size=batch_sizes) - for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, drop=0.2, losses=losses) - with textcat.model.use_params(optimizer.averages): - # evaluate on the dev data split off in load_data() - scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats) - print( - "{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}".format( # print a simple table - losses["textcat"], - scores["textcat_p"], - scores["textcat_r"], - scores["textcat_f"], - ) - ) - - # test the trained model - test_text = "This movie sucked" - doc = nlp(test_text) - print(test_text, doc.cats) - - if output_dir is not None: - with nlp.use_params(optimizer.averages): - nlp.to_disk(output_dir) - print("Saved model to", output_dir) - - # test the saved model - print("Loading from", output_dir) - nlp2 = spacy.load(output_dir) - doc2 = nlp2(test_text) - print(test_text, doc2.cats) - - -def load_data(limit=0, split=0.8): - """Load data from the IMDB dataset.""" - # Partition off part of the train data for evaluation - train_data, _ = thinc.extra.datasets.imdb() - random.shuffle(train_data) - train_data = train_data[-limit:] - texts, labels = zip(*train_data) - cats = [{"POSITIVE": bool(y), "NEGATIVE": not bool(y)} for y in labels] - split = int(len(train_data) * split) - return (texts[:split], cats[:split]), (texts[split:], cats[split:]) - - -def evaluate(tokenizer, textcat, texts, cats): - docs = (tokenizer(text) for text in texts) - tp = 0.0 # True positives - fp = 1e-8 # False positives - fn = 1e-8 # False negatives - tn = 0.0 # True negatives - for i, doc in enumerate(textcat.pipe(docs)): - gold = cats[i] - for label, score in doc.cats.items(): - if label not in gold: - continue - if label == "NEGATIVE": - continue - if score >= 0.5 and gold[label] >= 0.5: - tp += 1.0 - elif score >= 0.5 and gold[label] < 0.5: - fp += 1.0 - elif score < 0.5 and gold[label] < 0.5: - tn += 1 - elif score < 0.5 and gold[label] >= 0.5: - fn += 1 - precision = tp / (tp + fp) - recall = tp / (tp + fn) - if (precision + recall) == 0: - f_score = 0.0 - else: - f_score = 2 * (precision * recall) / (precision + recall) - return {"textcat_p": precision, "textcat_r": recall, "textcat_f": f_score} - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/vectors_fast_text.py b/examples/vectors_fast_text.py deleted file mode 100644 index 9b34811f7..000000000 --- a/examples/vectors_fast_text.py +++ /dev/null @@ -1,49 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""Load vectors for a language trained using fastText -https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md -Compatible with: spaCy v2.0.0+ -""" -from __future__ import unicode_literals -import plac -import numpy - -import spacy -from spacy.language import Language - - -@plac.annotations( - vectors_loc=("Path to .vec file", "positional", None, str), - lang=( - "Optional language ID. If not set, blank Language() will be used.", - "positional", - None, - str, - ), -) -def main(vectors_loc, lang=None): - if lang is None: - nlp = Language() - else: - # create empty language class – this is required if you're planning to - # save the model to disk and load it back later (models always need a - # "lang" setting). Use 'xx' for blank multi-language class. - nlp = spacy.blank(lang) - with open(vectors_loc, "rb") as file_: - header = file_.readline() - nr_row, nr_dim = header.split() - nlp.vocab.reset_vectors(width=int(nr_dim)) - for line in file_: - line = line.rstrip().decode("utf8") - pieces = line.rsplit(" ", int(nr_dim)) - word = pieces[0] - vector = numpy.asarray([float(v) for v in pieces[1:]], dtype="f") - nlp.vocab.set_vector(word, vector) # add the vectors to the vocab - # test the vectors and similarity - text = "class colspan" - doc = nlp(text) - print(text, doc[0].similarity(doc[1])) - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/vectors_tensorboard.py b/examples/vectors_tensorboard.py deleted file mode 100644 index 72eda1edc..000000000 --- a/examples/vectors_tensorboard.py +++ /dev/null @@ -1,105 +0,0 @@ -#!/usr/bin/env python -# coding: utf8 -"""Visualize spaCy word vectors in Tensorboard. - -Adapted from: https://gist.github.com/BrikerMan/7bd4e4bd0a00ac9076986148afc06507 -""" -from __future__ import unicode_literals - -from os import path - -import tqdm -import math -import numpy -import plac -import spacy -import tensorflow as tf -from tensorflow.contrib.tensorboard.plugins.projector import ( - visualize_embeddings, - ProjectorConfig, -) - - -@plac.annotations( - vectors_loc=("Path to spaCy model that contains vectors", "positional", None, str), - out_loc=( - "Path to output folder for tensorboard session data", - "positional", - None, - str, - ), - name=( - "Human readable name for tsv file and vectors tensor", - "positional", - None, - str, - ), -) -def main(vectors_loc, out_loc, name="spaCy_vectors"): - meta_file = "{}.tsv".format(name) - out_meta_file = path.join(out_loc, meta_file) - - print("Loading spaCy vectors model: {}".format(vectors_loc)) - model = spacy.load(vectors_loc) - print("Finding lexemes with vectors attached: {}".format(vectors_loc)) - strings_stream = tqdm.tqdm( - model.vocab.strings, total=len(model.vocab.strings), leave=False - ) - queries = [w for w in strings_stream if model.vocab.has_vector(w)] - vector_count = len(queries) - - print( - "Building Tensorboard Projector metadata for ({}) vectors: {}".format( - vector_count, out_meta_file - ) - ) - - # Store vector data in a tensorflow variable - tf_vectors_variable = numpy.zeros((vector_count, model.vocab.vectors.shape[1])) - - # Write a tab-separated file that contains information about the vectors for visualization - # - # Reference: https://www.tensorflow.org/programmers_guide/embedding#metadata - with open(out_meta_file, "wb") as file_metadata: - # Define columns in the first row - file_metadata.write("Text\tFrequency\n".encode("utf-8")) - # Write out a row for each vector that we add to the tensorflow variable we created - vec_index = 0 - for text in tqdm.tqdm(queries, total=len(queries), leave=False): - # https://github.com/tensorflow/tensorflow/issues/9094 - text = "" if text.lstrip() == "" else text - lex = model.vocab[text] - - # Store vector data and metadata - tf_vectors_variable[vec_index] = model.vocab.get_vector(text) - file_metadata.write( - "{}\t{}\n".format(text, math.exp(lex.prob) * vector_count).encode( - "utf-8" - ) - ) - vec_index += 1 - - print("Running Tensorflow Session...") - sess = tf.InteractiveSession() - tf.Variable(tf_vectors_variable, trainable=False, name=name) - tf.global_variables_initializer().run() - saver = tf.train.Saver() - writer = tf.summary.FileWriter(out_loc, sess.graph) - - # Link the embeddings into the config - config = ProjectorConfig() - embed = config.embeddings.add() - embed.tensor_name = name - embed.metadata_path = meta_file - - # Tell the projector about the configured embeddings and metadata file - visualize_embeddings(writer, config) - - # Save session and print run command to the output - print("Saving Tensorboard Session...") - saver.save(sess, path.join(out_loc, "{}.ckpt".format(name))) - print("Done. Run `tensorboard --logdir={0}` to view in Tensorboard".format(out_loc)) - - -if __name__ == "__main__": - plac.call(main) diff --git a/examples/training/ner_example_data/README.md b/extra/example_data/ner_example_data/README.md similarity index 100% rename from examples/training/ner_example_data/README.md rename to extra/example_data/ner_example_data/README.md diff --git a/examples/training/ner_example_data/ner-sent-per-line.iob b/extra/example_data/ner_example_data/ner-sent-per-line.iob similarity index 100% rename from examples/training/ner_example_data/ner-sent-per-line.iob rename to extra/example_data/ner_example_data/ner-sent-per-line.iob diff --git a/examples/training/ner_example_data/ner-sent-per-line.json b/extra/example_data/ner_example_data/ner-sent-per-line.json similarity index 100% rename from examples/training/ner_example_data/ner-sent-per-line.json rename to extra/example_data/ner_example_data/ner-sent-per-line.json diff --git a/examples/training/ner_example_data/ner-token-per-line-conll2003.iob b/extra/example_data/ner_example_data/ner-token-per-line-conll2003.iob similarity index 100% rename from examples/training/ner_example_data/ner-token-per-line-conll2003.iob rename to extra/example_data/ner_example_data/ner-token-per-line-conll2003.iob diff --git a/examples/training/ner_example_data/ner-token-per-line-conll2003.json b/extra/example_data/ner_example_data/ner-token-per-line-conll2003.json similarity index 100% rename from examples/training/ner_example_data/ner-token-per-line-conll2003.json rename to extra/example_data/ner_example_data/ner-token-per-line-conll2003.json diff --git a/examples/training/ner_example_data/ner-token-per-line-with-pos.iob b/extra/example_data/ner_example_data/ner-token-per-line-with-pos.iob similarity index 100% rename from examples/training/ner_example_data/ner-token-per-line-with-pos.iob rename to extra/example_data/ner_example_data/ner-token-per-line-with-pos.iob diff --git a/examples/training/ner_example_data/ner-token-per-line-with-pos.json b/extra/example_data/ner_example_data/ner-token-per-line-with-pos.json similarity index 100% rename from examples/training/ner_example_data/ner-token-per-line-with-pos.json rename to extra/example_data/ner_example_data/ner-token-per-line-with-pos.json diff --git a/examples/training/ner_example_data/ner-token-per-line.iob b/extra/example_data/ner_example_data/ner-token-per-line.iob similarity index 100% rename from examples/training/ner_example_data/ner-token-per-line.iob rename to extra/example_data/ner_example_data/ner-token-per-line.iob diff --git a/examples/training/ner_example_data/ner-token-per-line.json b/extra/example_data/ner_example_data/ner-token-per-line.json similarity index 100% rename from examples/training/ner_example_data/ner-token-per-line.json rename to extra/example_data/ner_example_data/ner-token-per-line.json diff --git a/examples/training/textcat_example_data/CC0.txt b/extra/example_data/textcat_example_data/CC0.txt similarity index 100% rename from examples/training/textcat_example_data/CC0.txt rename to extra/example_data/textcat_example_data/CC0.txt diff --git a/examples/training/textcat_example_data/CC_BY-SA-3.0.txt b/extra/example_data/textcat_example_data/CC_BY-SA-3.0.txt similarity index 100% rename from examples/training/textcat_example_data/CC_BY-SA-3.0.txt rename to extra/example_data/textcat_example_data/CC_BY-SA-3.0.txt diff --git a/examples/training/textcat_example_data/CC_BY-SA-4.0.txt b/extra/example_data/textcat_example_data/CC_BY-SA-4.0.txt similarity index 100% rename from examples/training/textcat_example_data/CC_BY-SA-4.0.txt rename to extra/example_data/textcat_example_data/CC_BY-SA-4.0.txt diff --git a/examples/training/textcat_example_data/README.md b/extra/example_data/textcat_example_data/README.md similarity index 100% rename from examples/training/textcat_example_data/README.md rename to extra/example_data/textcat_example_data/README.md diff --git a/examples/training/textcat_example_data/cooking.json b/extra/example_data/textcat_example_data/cooking.json similarity index 100% rename from examples/training/textcat_example_data/cooking.json rename to extra/example_data/textcat_example_data/cooking.json diff --git a/examples/training/textcat_example_data/cooking.jsonl b/extra/example_data/textcat_example_data/cooking.jsonl similarity index 100% rename from examples/training/textcat_example_data/cooking.jsonl rename to extra/example_data/textcat_example_data/cooking.jsonl diff --git a/examples/training/textcat_example_data/jigsaw-toxic-comment.json b/extra/example_data/textcat_example_data/jigsaw-toxic-comment.json similarity index 100% rename from examples/training/textcat_example_data/jigsaw-toxic-comment.json rename to extra/example_data/textcat_example_data/jigsaw-toxic-comment.json diff --git a/examples/training/textcat_example_data/jigsaw-toxic-comment.jsonl b/extra/example_data/textcat_example_data/jigsaw-toxic-comment.jsonl similarity index 100% rename from examples/training/textcat_example_data/jigsaw-toxic-comment.jsonl rename to extra/example_data/textcat_example_data/jigsaw-toxic-comment.jsonl diff --git a/examples/training/textcat_example_data/textcatjsonl_to_trainjson.py b/extra/example_data/textcat_example_data/textcatjsonl_to_trainjson.py similarity index 90% rename from examples/training/textcat_example_data/textcatjsonl_to_trainjson.py rename to extra/example_data/textcat_example_data/textcatjsonl_to_trainjson.py index 339ce39be..41b6a70da 100644 --- a/examples/training/textcat_example_data/textcatjsonl_to_trainjson.py +++ b/extra/example_data/textcat_example_data/textcatjsonl_to_trainjson.py @@ -1,20 +1,21 @@ from pathlib import Path import plac import spacy -from spacy.gold import docs_to_json +from spacy.training import docs_to_json import srsly import sys + @plac.annotations( model=("Model name. Defaults to 'en'.", "option", "m", str), input_file=("Input file (jsonl)", "positional", None, Path), output_dir=("Output directory", "positional", None, Path), n_texts=("Number of texts to convert", "option", "t", int), ) -def convert(model='en', input_file=None, output_dir=None, n_texts=0): +def convert(model="en", input_file=None, output_dir=None, n_texts=0): # Load model with tokenizer + sentencizer only nlp = spacy.load(model) - nlp.disable_pipes(*nlp.pipe_names) + nlp.select_pipes(disable=nlp.pipe_names) sentencizer = nlp.create_pipe("sentencizer") nlp.add_pipe(sentencizer, first=True) @@ -49,5 +50,6 @@ def convert(model='en', input_file=None, output_dir=None, n_texts=0): srsly.write_json(output_dir / input_file.with_suffix(".json"), [docs_to_json(docs)]) + if __name__ == "__main__": plac.call(convert) diff --git a/examples/training/training-data.json b/extra/example_data/training-data.json similarity index 100% rename from examples/training/training-data.json rename to extra/example_data/training-data.json diff --git a/examples/training/vocab-data.jsonl b/extra/example_data/vocab-data.jsonl similarity index 100% rename from examples/training/vocab-data.jsonl rename to extra/example_data/vocab-data.jsonl diff --git a/fabfile.py b/fabfile.py deleted file mode 100644 index fcab493f5..000000000 --- a/fabfile.py +++ /dev/null @@ -1,154 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals, print_function - -import contextlib -from pathlib import Path -from fabric.api import local, lcd, env, settings, prefix -from os import path, environ -import shutil -import sys - - -PWD = path.dirname(__file__) -ENV = environ["VENV_DIR"] if "VENV_DIR" in environ else ".env" -VENV_DIR = Path(PWD) / ENV - - -@contextlib.contextmanager -def virtualenv(name, create=False, python="/usr/bin/python3.6"): - python = Path(python).resolve() - env_path = VENV_DIR - if create: - if env_path.exists(): - shutil.rmtree(str(env_path)) - local("{python} -m venv {env_path}".format(python=python, env_path=VENV_DIR)) - - def wrapped_local(cmd, env_vars=[], capture=False, direct=False): - return local( - "source {}/bin/activate && {}".format(env_path, cmd), - shell="/bin/bash", - capture=False, - ) - - yield wrapped_local - - -def env(lang="python3.6"): - if VENV_DIR.exists(): - local("rm -rf {env}".format(env=VENV_DIR)) - if lang.startswith("python3"): - local("{lang} -m venv {env}".format(lang=lang, env=VENV_DIR)) - else: - local("{lang} -m pip install virtualenv --no-cache-dir".format(lang=lang)) - local( - "{lang} -m virtualenv {env} --no-cache-dir".format(lang=lang, env=VENV_DIR) - ) - with virtualenv(VENV_DIR) as venv_local: - print(venv_local("python --version", capture=True)) - venv_local("pip install --upgrade setuptools --no-cache-dir") - venv_local("pip install pytest --no-cache-dir") - venv_local("pip install wheel --no-cache-dir") - venv_local("pip install -r requirements.txt --no-cache-dir") - venv_local("pip install pex --no-cache-dir") - - -def install(): - with virtualenv(VENV_DIR) as venv_local: - venv_local("pip install dist/*.tar.gz") - - -def make(): - with lcd(path.dirname(__file__)): - local( - "export PYTHONPATH=`pwd` && source .env/bin/activate && python setup.py build_ext --inplace", - shell="/bin/bash", - ) - - -def sdist(): - with virtualenv(VENV_DIR) as venv_local: - with lcd(path.dirname(__file__)): - venv_local("python -m pip install -U setuptools srsly") - venv_local("python setup.py sdist") - - -def wheel(): - with virtualenv(VENV_DIR) as venv_local: - with lcd(path.dirname(__file__)): - venv_local("python setup.py bdist_wheel") - - -def pex(): - with virtualenv(VENV_DIR) as venv_local: - with lcd(path.dirname(__file__)): - sha = local("git rev-parse --short HEAD", capture=True) - venv_local( - "pex dist/*.whl -e spacy -o dist/spacy-%s.pex" % sha, direct=True - ) - - -def clean(): - with lcd(path.dirname(__file__)): - local("rm -f dist/*.whl") - local("rm -f dist/*.pex") - with virtualenv(VENV_DIR) as venv_local: - venv_local("python setup.py clean --all") - - -def test(): - with virtualenv(VENV_DIR) as venv_local: - with lcd(path.dirname(__file__)): - venv_local("pytest -x spacy/tests") - - -def train(): - args = environ.get("SPACY_TRAIN_ARGS", "") - with virtualenv(VENV_DIR) as venv_local: - venv_local("spacy train {args}".format(args=args)) - - -def conll17(treebank_dir, experiment_dir, vectors_dir, config, corpus=""): - is_not_clean = local("git status --porcelain", capture=True) - if is_not_clean: - print("Repository is not clean") - print(is_not_clean) - sys.exit(1) - git_sha = local("git rev-parse --short HEAD", capture=True) - config_checksum = local("sha256sum {config}".format(config=config), capture=True) - experiment_dir = Path(experiment_dir) / "{}--{}".format( - config_checksum[:6], git_sha - ) - if not experiment_dir.exists(): - experiment_dir.mkdir() - test_data_dir = Path(treebank_dir) / "ud-test-v2.0-conll2017" - assert test_data_dir.exists() - assert test_data_dir.is_dir() - if corpus: - corpora = [corpus] - else: - corpora = ["UD_English", "UD_Chinese", "UD_Japanese", "UD_Vietnamese"] - - local( - "cp {config} {experiment_dir}/config.json".format( - config=config, experiment_dir=experiment_dir - ) - ) - with virtualenv(VENV_DIR) as venv_local: - for corpus in corpora: - venv_local( - "spacy ud-train {treebank_dir} {experiment_dir} {config} {corpus} -v {vectors_dir}".format( - treebank_dir=treebank_dir, - experiment_dir=experiment_dir, - config=config, - corpus=corpus, - vectors_dir=vectors_dir, - ) - ) - venv_local( - "spacy ud-run-test {test_data_dir} {experiment_dir} {corpus}".format( - test_data_dir=test_data_dir, - experiment_dir=experiment_dir, - config=config, - corpus=corpus, - ) - ) diff --git a/include/msvc9/stdint.h b/include/msvc9/stdint.h deleted file mode 100644 index 4fe0ef9a9..000000000 --- a/include/msvc9/stdint.h +++ /dev/null @@ -1,259 +0,0 @@ -// ISO C9x compliant stdint.h for Microsoft Visual Studio -// Based on ISO/IEC 9899:TC2 Committee draft (May 6, 2005) WG14/N1124 -// -// Copyright (c) 2006-2013 Alexander Chemeris -// -// Redistribution and use in source and binary forms, with or without -// modification, are permitted provided that the following conditions are met: -// -// 1. Redistributions of source code must retain the above copyright notice, -// this list of conditions and the following disclaimer. -// -// 2. Redistributions in binary form must reproduce the above copyright -// notice, this list of conditions and the following disclaimer in the -// documentation and/or other materials provided with the distribution. -// -// 3. Neither the name of the product nor the names of its contributors may -// be used to endorse or promote products derived from this software -// without specific prior written permission. -// -// THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED -// WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF -// MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO -// EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, -// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, -// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; -// OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, -// WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR -// OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF -// ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -// -/////////////////////////////////////////////////////////////////////////////// - -#ifndef _MSC_VER // [ -#error "Use this header only with Microsoft Visual C++ compilers!" -#endif // _MSC_VER ] - -#ifndef _MSC_STDINT_H_ // [ -#define _MSC_STDINT_H_ - -#if _MSC_VER > 1000 -#pragma once -#endif - -#if _MSC_VER >= 1600 // [ -#include -#else // ] _MSC_VER >= 1600 [ - -#include - -// For Visual Studio 6 in C++ mode and for many Visual Studio versions when -// compiling for ARM we should wrap include with 'extern "C++" {}' -// or compiler give many errors like this: -// error C2733: second C linkage of overloaded function 'wmemchr' not allowed -#ifdef __cplusplus -extern "C" { -#endif -# include -#ifdef __cplusplus -} -#endif - -// Define _W64 macros to mark types changing their size, like intptr_t. -#ifndef _W64 -# if !defined(__midl) && (defined(_X86_) || defined(_M_IX86)) && _MSC_VER >= 1300 -# define _W64 __w64 -# else -# define _W64 -# endif -#endif - - -// 7.18.1 Integer types - -// 7.18.1.1 Exact-width integer types - -// Visual Studio 6 and Embedded Visual C++ 4 doesn't -// realize that, e.g. char has the same size as __int8 -// so we give up on __intX for them. -#if (_MSC_VER < 1300) - typedef signed char int8_t; - typedef signed short int16_t; - typedef signed int int32_t; - typedef unsigned char uint8_t; - typedef unsigned short uint16_t; - typedef unsigned int uint32_t; -#else - typedef signed __int8 int8_t; - typedef signed __int16 int16_t; - typedef signed __int32 int32_t; - typedef unsigned __int8 uint8_t; - typedef unsigned __int16 uint16_t; - typedef unsigned __int32 uint32_t; -#endif -typedef signed __int64 int64_t; -typedef unsigned __int64 uint64_t; - - -// 7.18.1.2 Minimum-width integer types -typedef int8_t int_least8_t; -typedef int16_t int_least16_t; -typedef int32_t int_least32_t; -typedef int64_t int_least64_t; -typedef uint8_t uint_least8_t; -typedef uint16_t uint_least16_t; -typedef uint32_t uint_least32_t; -typedef uint64_t uint_least64_t; - -// 7.18.1.3 Fastest minimum-width integer types -typedef int8_t int_fast8_t; -typedef int16_t int_fast16_t; -typedef int32_t int_fast32_t; -typedef int64_t int_fast64_t; -typedef uint8_t uint_fast8_t; -typedef uint16_t uint_fast16_t; -typedef uint32_t uint_fast32_t; -typedef uint64_t uint_fast64_t; - -// 7.18.1.4 Integer types capable of holding object pointers -#ifdef _WIN64 // [ - typedef signed __int64 intptr_t; - typedef unsigned __int64 uintptr_t; -#else // _WIN64 ][ - typedef _W64 signed int intptr_t; - typedef _W64 unsigned int uintptr_t; -#endif // _WIN64 ] - -// 7.18.1.5 Greatest-width integer types -typedef int64_t intmax_t; -typedef uint64_t uintmax_t; - - -// 7.18.2 Limits of specified-width integer types - -#if !defined(__cplusplus) || defined(__STDC_LIMIT_MACROS) // [ See footnote 220 at page 257 and footnote 221 at page 259 - -// 7.18.2.1 Limits of exact-width integer types -#define INT8_MIN ((int8_t)_I8_MIN) -#define INT8_MAX _I8_MAX -#define INT16_MIN ((int16_t)_I16_MIN) -#define INT16_MAX _I16_MAX -#define INT32_MIN ((int32_t)_I32_MIN) -#define INT32_MAX _I32_MAX -#define INT64_MIN ((int64_t)_I64_MIN) -#define INT64_MAX _I64_MAX -#define UINT8_MAX _UI8_MAX -#define UINT16_MAX _UI16_MAX -#define UINT32_MAX _UI32_MAX -#define UINT64_MAX _UI64_MAX - -// 7.18.2.2 Limits of minimum-width integer types -#define INT_LEAST8_MIN INT8_MIN -#define INT_LEAST8_MAX INT8_MAX -#define INT_LEAST16_MIN INT16_MIN -#define INT_LEAST16_MAX INT16_MAX -#define INT_LEAST32_MIN INT32_MIN -#define INT_LEAST32_MAX INT32_MAX -#define INT_LEAST64_MIN INT64_MIN -#define INT_LEAST64_MAX INT64_MAX -#define UINT_LEAST8_MAX UINT8_MAX -#define UINT_LEAST16_MAX UINT16_MAX -#define UINT_LEAST32_MAX UINT32_MAX -#define UINT_LEAST64_MAX UINT64_MAX - -// 7.18.2.3 Limits of fastest minimum-width integer types -#define INT_FAST8_MIN INT8_MIN -#define INT_FAST8_MAX INT8_MAX -#define INT_FAST16_MIN INT16_MIN -#define INT_FAST16_MAX INT16_MAX -#define INT_FAST32_MIN INT32_MIN -#define INT_FAST32_MAX INT32_MAX -#define INT_FAST64_MIN INT64_MIN -#define INT_FAST64_MAX INT64_MAX -#define UINT_FAST8_MAX UINT8_MAX -#define UINT_FAST16_MAX UINT16_MAX -#define UINT_FAST32_MAX UINT32_MAX -#define UINT_FAST64_MAX UINT64_MAX - -// 7.18.2.4 Limits of integer types capable of holding object pointers -#ifdef _WIN64 // [ -# define INTPTR_MIN INT64_MIN -# define INTPTR_MAX INT64_MAX -# define UINTPTR_MAX UINT64_MAX -#else // _WIN64 ][ -# define INTPTR_MIN INT32_MIN -# define INTPTR_MAX INT32_MAX -# define UINTPTR_MAX UINT32_MAX -#endif // _WIN64 ] - -// 7.18.2.5 Limits of greatest-width integer types -#define INTMAX_MIN INT64_MIN -#define INTMAX_MAX INT64_MAX -#define UINTMAX_MAX UINT64_MAX - -// 7.18.3 Limits of other integer types - -#ifdef _WIN64 // [ -# define PTRDIFF_MIN _I64_MIN -# define PTRDIFF_MAX _I64_MAX -#else // _WIN64 ][ -# define PTRDIFF_MIN _I32_MIN -# define PTRDIFF_MAX _I32_MAX -#endif // _WIN64 ] - -#define SIG_ATOMIC_MIN INT_MIN -#define SIG_ATOMIC_MAX INT_MAX - -#ifndef SIZE_MAX // [ -# ifdef _WIN64 // [ -# define SIZE_MAX _UI64_MAX -# else // _WIN64 ][ -# define SIZE_MAX _UI32_MAX -# endif // _WIN64 ] -#endif // SIZE_MAX ] - -// WCHAR_MIN and WCHAR_MAX are also defined in -#ifndef WCHAR_MIN // [ -# define WCHAR_MIN 0 -#endif // WCHAR_MIN ] -#ifndef WCHAR_MAX // [ -# define WCHAR_MAX _UI16_MAX -#endif // WCHAR_MAX ] - -#define WINT_MIN 0 -#define WINT_MAX _UI16_MAX - -#endif // __STDC_LIMIT_MACROS ] - - -// 7.18.4 Limits of other integer types - -#if !defined(__cplusplus) || defined(__STDC_CONSTANT_MACROS) // [ See footnote 224 at page 260 - -// 7.18.4.1 Macros for minimum-width integer constants - -#define INT8_C(val) val##i8 -#define INT16_C(val) val##i16 -#define INT32_C(val) val##i32 -#define INT64_C(val) val##i64 - -#define UINT8_C(val) val##ui8 -#define UINT16_C(val) val##ui16 -#define UINT32_C(val) val##ui32 -#define UINT64_C(val) val##ui64 - -// 7.18.4.2 Macros for greatest-width integer constants -// These #ifndef's are needed to prevent collisions with . -// Check out Issue 9 for the details. -#ifndef INTMAX_C // [ -# define INTMAX_C INT64_C -#endif // INTMAX_C ] -#ifndef UINTMAX_C // [ -# define UINTMAX_C UINT64_C -#endif // UINTMAX_C ] - -#endif // __STDC_CONSTANT_MACROS ] - -#endif // _MSC_VER >= 1600 ] - -#endif // _MSC_STDINT_H_ ] diff --git a/include/murmurhash/MurmurHash2.h b/include/murmurhash/MurmurHash2.h deleted file mode 100644 index 6d7ccf4b2..000000000 --- a/include/murmurhash/MurmurHash2.h +++ /dev/null @@ -1,22 +0,0 @@ -//----------------------------------------------------------------------------- -// MurmurHash2 was written by Austin Appleby, and is placed in the public -// domain. The author hereby disclaims copyright to this source code. - -#ifndef _MURMURHASH2_H_ -#define _MURMURHASH2_H_ - -#include - -//----------------------------------------------------------------------------- - -uint32_t MurmurHash2 ( const void * key, int len, uint32_t seed ); -uint64_t MurmurHash64A ( const void * key, int len, uint64_t seed ); -uint64_t MurmurHash64B ( const void * key, int len, uint64_t seed ); -uint32_t MurmurHash2A ( const void * key, int len, uint32_t seed ); -uint32_t MurmurHashNeutral2 ( const void * key, int len, uint32_t seed ); -uint32_t MurmurHashAligned2 ( const void * key, int len, uint32_t seed ); - -//----------------------------------------------------------------------------- - -#endif // _MURMURHASH2_H_ - diff --git a/include/murmurhash/MurmurHash3.h b/include/murmurhash/MurmurHash3.h deleted file mode 100644 index 9b4c3c90b..000000000 --- a/include/murmurhash/MurmurHash3.h +++ /dev/null @@ -1,28 +0,0 @@ -//----------------------------------------------------------------------------- -// MurmurHash3 was written by Austin Appleby, and is placed in the public -// domain. The author hereby disclaims copyright to this source code. - -#ifndef _MURMURHASH3_H_ -#define _MURMURHASH3_H_ - -#include - -//----------------------------------------------------------------------------- -#ifdef __cplusplus -extern "C" { -#endif - - -void MurmurHash3_x86_32 ( const void * key, int len, uint32_t seed, void * out ); - -void MurmurHash3_x86_128 ( const void * key, int len, uint32_t seed, void * out ); - -void MurmurHash3_x64_128 ( const void * key, int len, uint32_t seed, void * out ); - -#ifdef __cplusplus -} -#endif - -//----------------------------------------------------------------------------- - -#endif // _MURMURHASH3_H_ diff --git a/include/numpy/__multiarray_api.h b/include/numpy/__multiarray_api.h deleted file mode 100644 index c949d732f..000000000 --- a/include/numpy/__multiarray_api.h +++ /dev/null @@ -1,1686 +0,0 @@ - -#ifdef _MULTIARRAYMODULE - -typedef struct { - PyObject_HEAD - npy_bool obval; -} PyBoolScalarObject; - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION -extern NPY_NO_EXPORT PyTypeObject PyArrayMapIter_Type; -extern NPY_NO_EXPORT PyTypeObject PyArrayNeighborhoodIter_Type; -extern NPY_NO_EXPORT PyBoolScalarObject _PyArrayScalar_BoolValues[2]; -#else -NPY_NO_EXPORT PyTypeObject PyArrayMapIter_Type; -NPY_NO_EXPORT PyTypeObject PyArrayNeighborhoodIter_Type; -NPY_NO_EXPORT PyBoolScalarObject _PyArrayScalar_BoolValues[2]; -#endif - -NPY_NO_EXPORT unsigned int PyArray_GetNDArrayCVersion \ - (void); -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyBigArray_Type; -#else - NPY_NO_EXPORT PyTypeObject PyBigArray_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyArray_Type; -#else - NPY_NO_EXPORT PyTypeObject PyArray_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyArrayDescr_Type; -#else - NPY_NO_EXPORT PyTypeObject PyArrayDescr_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyArrayFlags_Type; -#else - NPY_NO_EXPORT PyTypeObject PyArrayFlags_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyArrayIter_Type; -#else - NPY_NO_EXPORT PyTypeObject PyArrayIter_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyArrayMultiIter_Type; -#else - NPY_NO_EXPORT PyTypeObject PyArrayMultiIter_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT int NPY_NUMUSERTYPES; -#else - NPY_NO_EXPORT int NPY_NUMUSERTYPES; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyBoolArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyBoolArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION -extern NPY_NO_EXPORT PyBoolScalarObject _PyArrayScalar_BoolValues[2]; -#else -NPY_NO_EXPORT PyBoolScalarObject _PyArrayScalar_BoolValues[2]; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyGenericArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyGenericArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyNumberArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyNumberArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyIntegerArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyIntegerArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PySignedIntegerArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PySignedIntegerArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyUnsignedIntegerArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyUnsignedIntegerArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyInexactArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyInexactArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyFloatingArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyFloatingArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyComplexFloatingArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyComplexFloatingArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyFlexibleArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyFlexibleArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyCharacterArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyCharacterArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyByteArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyByteArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyShortArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyShortArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyIntArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyIntArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyLongArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyLongArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyLongLongArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyLongLongArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyUByteArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyUByteArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyUShortArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyUShortArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyUIntArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyUIntArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyULongArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyULongArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyULongLongArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyULongLongArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyFloatArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyFloatArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyDoubleArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyDoubleArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyLongDoubleArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyLongDoubleArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyCFloatArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyCFloatArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyCDoubleArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyCDoubleArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyCLongDoubleArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyCLongDoubleArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyObjectArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyObjectArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyStringArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyStringArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyUnicodeArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyUnicodeArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyVoidArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyVoidArrType_Type; -#endif - -NPY_NO_EXPORT int PyArray_SetNumericOps \ - (PyObject *); -NPY_NO_EXPORT PyObject * PyArray_GetNumericOps \ - (void); -NPY_NO_EXPORT int PyArray_INCREF \ - (PyArrayObject *); -NPY_NO_EXPORT int PyArray_XDECREF \ - (PyArrayObject *); -NPY_NO_EXPORT void PyArray_SetStringFunction \ - (PyObject *, int); -NPY_NO_EXPORT PyArray_Descr * PyArray_DescrFromType \ - (int); -NPY_NO_EXPORT PyObject * PyArray_TypeObjectFromType \ - (int); -NPY_NO_EXPORT char * PyArray_Zero \ - (PyArrayObject *); -NPY_NO_EXPORT char * PyArray_One \ - (PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_CastToType \ - (PyArrayObject *, PyArray_Descr *, int); -NPY_NO_EXPORT int PyArray_CastTo \ - (PyArrayObject *, PyArrayObject *); -NPY_NO_EXPORT int PyArray_CastAnyTo \ - (PyArrayObject *, PyArrayObject *); -NPY_NO_EXPORT int PyArray_CanCastSafely \ - (int, int); -NPY_NO_EXPORT npy_bool PyArray_CanCastTo \ - (PyArray_Descr *, PyArray_Descr *); -NPY_NO_EXPORT int PyArray_ObjectType \ - (PyObject *, int); -NPY_NO_EXPORT PyArray_Descr * PyArray_DescrFromObject \ - (PyObject *, PyArray_Descr *); -NPY_NO_EXPORT PyArrayObject ** PyArray_ConvertToCommonType \ - (PyObject *, int *); -NPY_NO_EXPORT PyArray_Descr * PyArray_DescrFromScalar \ - (PyObject *); -NPY_NO_EXPORT PyArray_Descr * PyArray_DescrFromTypeObject \ - (PyObject *); -NPY_NO_EXPORT npy_intp PyArray_Size \ - (PyObject *); -NPY_NO_EXPORT PyObject * PyArray_Scalar \ - (void *, PyArray_Descr *, PyObject *); -NPY_NO_EXPORT PyObject * PyArray_FromScalar \ - (PyObject *, PyArray_Descr *); -NPY_NO_EXPORT void PyArray_ScalarAsCtype \ - (PyObject *, void *); -NPY_NO_EXPORT int PyArray_CastScalarToCtype \ - (PyObject *, void *, PyArray_Descr *); -NPY_NO_EXPORT int PyArray_CastScalarDirect \ - (PyObject *, PyArray_Descr *, void *, int); -NPY_NO_EXPORT PyObject * PyArray_ScalarFromObject \ - (PyObject *); -NPY_NO_EXPORT PyArray_VectorUnaryFunc * PyArray_GetCastFunc \ - (PyArray_Descr *, int); -NPY_NO_EXPORT PyObject * PyArray_FromDims \ - (int, int *, int); -NPY_NO_EXPORT PyObject * PyArray_FromDimsAndDataAndDescr \ - (int, int *, PyArray_Descr *, char *); -NPY_NO_EXPORT PyObject * PyArray_FromAny \ - (PyObject *, PyArray_Descr *, int, int, int, PyObject *); -NPY_NO_EXPORT PyObject * PyArray_EnsureArray \ - (PyObject *); -NPY_NO_EXPORT PyObject * PyArray_EnsureAnyArray \ - (PyObject *); -NPY_NO_EXPORT PyObject * PyArray_FromFile \ - (FILE *, PyArray_Descr *, npy_intp, char *); -NPY_NO_EXPORT PyObject * PyArray_FromString \ - (char *, npy_intp, PyArray_Descr *, npy_intp, char *); -NPY_NO_EXPORT PyObject * PyArray_FromBuffer \ - (PyObject *, PyArray_Descr *, npy_intp, npy_intp); -NPY_NO_EXPORT PyObject * PyArray_FromIter \ - (PyObject *, PyArray_Descr *, npy_intp); -NPY_NO_EXPORT PyObject * PyArray_Return \ - (PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_GetField \ - (PyArrayObject *, PyArray_Descr *, int); -NPY_NO_EXPORT int PyArray_SetField \ - (PyArrayObject *, PyArray_Descr *, int, PyObject *); -NPY_NO_EXPORT PyObject * PyArray_Byteswap \ - (PyArrayObject *, npy_bool); -NPY_NO_EXPORT PyObject * PyArray_Resize \ - (PyArrayObject *, PyArray_Dims *, int, NPY_ORDER); -NPY_NO_EXPORT int PyArray_MoveInto \ - (PyArrayObject *, PyArrayObject *); -NPY_NO_EXPORT int PyArray_CopyInto \ - (PyArrayObject *, PyArrayObject *); -NPY_NO_EXPORT int PyArray_CopyAnyInto \ - (PyArrayObject *, PyArrayObject *); -NPY_NO_EXPORT int PyArray_CopyObject \ - (PyArrayObject *, PyObject *); -NPY_NO_EXPORT PyObject * PyArray_NewCopy \ - (PyArrayObject *, NPY_ORDER); -NPY_NO_EXPORT PyObject * PyArray_ToList \ - (PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_ToString \ - (PyArrayObject *, NPY_ORDER); -NPY_NO_EXPORT int PyArray_ToFile \ - (PyArrayObject *, FILE *, char *, char *); -NPY_NO_EXPORT int PyArray_Dump \ - (PyObject *, PyObject *, int); -NPY_NO_EXPORT PyObject * PyArray_Dumps \ - (PyObject *, int); -NPY_NO_EXPORT int PyArray_ValidType \ - (int); -NPY_NO_EXPORT void PyArray_UpdateFlags \ - (PyArrayObject *, int); -NPY_NO_EXPORT PyObject * PyArray_New \ - (PyTypeObject *, int, npy_intp *, int, npy_intp *, void *, int, int, PyObject *); -NPY_NO_EXPORT PyObject * PyArray_NewFromDescr \ - (PyTypeObject *, PyArray_Descr *, int, npy_intp *, npy_intp *, void *, int, PyObject *); -NPY_NO_EXPORT PyArray_Descr * PyArray_DescrNew \ - (PyArray_Descr *); -NPY_NO_EXPORT PyArray_Descr * PyArray_DescrNewFromType \ - (int); -NPY_NO_EXPORT double PyArray_GetPriority \ - (PyObject *, double); -NPY_NO_EXPORT PyObject * PyArray_IterNew \ - (PyObject *); -NPY_NO_EXPORT PyObject * PyArray_MultiIterNew \ - (int, ...); -NPY_NO_EXPORT int PyArray_PyIntAsInt \ - (PyObject *); -NPY_NO_EXPORT npy_intp PyArray_PyIntAsIntp \ - (PyObject *); -NPY_NO_EXPORT int PyArray_Broadcast \ - (PyArrayMultiIterObject *); -NPY_NO_EXPORT void PyArray_FillObjectArray \ - (PyArrayObject *, PyObject *); -NPY_NO_EXPORT int PyArray_FillWithScalar \ - (PyArrayObject *, PyObject *); -NPY_NO_EXPORT npy_bool PyArray_CheckStrides \ - (int, int, npy_intp, npy_intp, npy_intp *, npy_intp *); -NPY_NO_EXPORT PyArray_Descr * PyArray_DescrNewByteorder \ - (PyArray_Descr *, char); -NPY_NO_EXPORT PyObject * PyArray_IterAllButAxis \ - (PyObject *, int *); -NPY_NO_EXPORT PyObject * PyArray_CheckFromAny \ - (PyObject *, PyArray_Descr *, int, int, int, PyObject *); -NPY_NO_EXPORT PyObject * PyArray_FromArray \ - (PyArrayObject *, PyArray_Descr *, int); -NPY_NO_EXPORT PyObject * PyArray_FromInterface \ - (PyObject *); -NPY_NO_EXPORT PyObject * PyArray_FromStructInterface \ - (PyObject *); -NPY_NO_EXPORT PyObject * PyArray_FromArrayAttr \ - (PyObject *, PyArray_Descr *, PyObject *); -NPY_NO_EXPORT NPY_SCALARKIND PyArray_ScalarKind \ - (int, PyArrayObject **); -NPY_NO_EXPORT int PyArray_CanCoerceScalar \ - (int, int, NPY_SCALARKIND); -NPY_NO_EXPORT PyObject * PyArray_NewFlagsObject \ - (PyObject *); -NPY_NO_EXPORT npy_bool PyArray_CanCastScalar \ - (PyTypeObject *, PyTypeObject *); -NPY_NO_EXPORT int PyArray_CompareUCS4 \ - (npy_ucs4 *, npy_ucs4 *, size_t); -NPY_NO_EXPORT int PyArray_RemoveSmallest \ - (PyArrayMultiIterObject *); -NPY_NO_EXPORT int PyArray_ElementStrides \ - (PyObject *); -NPY_NO_EXPORT void PyArray_Item_INCREF \ - (char *, PyArray_Descr *); -NPY_NO_EXPORT void PyArray_Item_XDECREF \ - (char *, PyArray_Descr *); -NPY_NO_EXPORT PyObject * PyArray_FieldNames \ - (PyObject *); -NPY_NO_EXPORT PyObject * PyArray_Transpose \ - (PyArrayObject *, PyArray_Dims *); -NPY_NO_EXPORT PyObject * PyArray_TakeFrom \ - (PyArrayObject *, PyObject *, int, PyArrayObject *, NPY_CLIPMODE); -NPY_NO_EXPORT PyObject * PyArray_PutTo \ - (PyArrayObject *, PyObject*, PyObject *, NPY_CLIPMODE); -NPY_NO_EXPORT PyObject * PyArray_PutMask \ - (PyArrayObject *, PyObject*, PyObject*); -NPY_NO_EXPORT PyObject * PyArray_Repeat \ - (PyArrayObject *, PyObject *, int); -NPY_NO_EXPORT PyObject * PyArray_Choose \ - (PyArrayObject *, PyObject *, PyArrayObject *, NPY_CLIPMODE); -NPY_NO_EXPORT int PyArray_Sort \ - (PyArrayObject *, int, NPY_SORTKIND); -NPY_NO_EXPORT PyObject * PyArray_ArgSort \ - (PyArrayObject *, int, NPY_SORTKIND); -NPY_NO_EXPORT PyObject * PyArray_SearchSorted \ - (PyArrayObject *, PyObject *, NPY_SEARCHSIDE, PyObject *); -NPY_NO_EXPORT PyObject * PyArray_ArgMax \ - (PyArrayObject *, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_ArgMin \ - (PyArrayObject *, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Reshape \ - (PyArrayObject *, PyObject *); -NPY_NO_EXPORT PyObject * PyArray_Newshape \ - (PyArrayObject *, PyArray_Dims *, NPY_ORDER); -NPY_NO_EXPORT PyObject * PyArray_Squeeze \ - (PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_View \ - (PyArrayObject *, PyArray_Descr *, PyTypeObject *); -NPY_NO_EXPORT PyObject * PyArray_SwapAxes \ - (PyArrayObject *, int, int); -NPY_NO_EXPORT PyObject * PyArray_Max \ - (PyArrayObject *, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Min \ - (PyArrayObject *, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Ptp \ - (PyArrayObject *, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Mean \ - (PyArrayObject *, int, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Trace \ - (PyArrayObject *, int, int, int, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Diagonal \ - (PyArrayObject *, int, int, int); -NPY_NO_EXPORT PyObject * PyArray_Clip \ - (PyArrayObject *, PyObject *, PyObject *, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Conjugate \ - (PyArrayObject *, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Nonzero \ - (PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Std \ - (PyArrayObject *, int, int, PyArrayObject *, int); -NPY_NO_EXPORT PyObject * PyArray_Sum \ - (PyArrayObject *, int, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_CumSum \ - (PyArrayObject *, int, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Prod \ - (PyArrayObject *, int, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_CumProd \ - (PyArrayObject *, int, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_All \ - (PyArrayObject *, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Any \ - (PyArrayObject *, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Compress \ - (PyArrayObject *, PyObject *, int, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_Flatten \ - (PyArrayObject *, NPY_ORDER); -NPY_NO_EXPORT PyObject * PyArray_Ravel \ - (PyArrayObject *, NPY_ORDER); -NPY_NO_EXPORT npy_intp PyArray_MultiplyList \ - (npy_intp *, int); -NPY_NO_EXPORT int PyArray_MultiplyIntList \ - (int *, int); -NPY_NO_EXPORT void * PyArray_GetPtr \ - (PyArrayObject *, npy_intp*); -NPY_NO_EXPORT int PyArray_CompareLists \ - (npy_intp *, npy_intp *, int); -NPY_NO_EXPORT int PyArray_AsCArray \ - (PyObject **, void *, npy_intp *, int, PyArray_Descr*); -NPY_NO_EXPORT int PyArray_As1D \ - (PyObject **, char **, int *, int); -NPY_NO_EXPORT int PyArray_As2D \ - (PyObject **, char ***, int *, int *, int); -NPY_NO_EXPORT int PyArray_Free \ - (PyObject *, void *); -NPY_NO_EXPORT int PyArray_Converter \ - (PyObject *, PyObject **); -NPY_NO_EXPORT int PyArray_IntpFromSequence \ - (PyObject *, npy_intp *, int); -NPY_NO_EXPORT PyObject * PyArray_Concatenate \ - (PyObject *, int); -NPY_NO_EXPORT PyObject * PyArray_InnerProduct \ - (PyObject *, PyObject *); -NPY_NO_EXPORT PyObject * PyArray_MatrixProduct \ - (PyObject *, PyObject *); -NPY_NO_EXPORT PyObject * PyArray_CopyAndTranspose \ - (PyObject *); -NPY_NO_EXPORT PyObject * PyArray_Correlate \ - (PyObject *, PyObject *, int); -NPY_NO_EXPORT int PyArray_TypestrConvert \ - (int, int); -NPY_NO_EXPORT int PyArray_DescrConverter \ - (PyObject *, PyArray_Descr **); -NPY_NO_EXPORT int PyArray_DescrConverter2 \ - (PyObject *, PyArray_Descr **); -NPY_NO_EXPORT int PyArray_IntpConverter \ - (PyObject *, PyArray_Dims *); -NPY_NO_EXPORT int PyArray_BufferConverter \ - (PyObject *, PyArray_Chunk *); -NPY_NO_EXPORT int PyArray_AxisConverter \ - (PyObject *, int *); -NPY_NO_EXPORT int PyArray_BoolConverter \ - (PyObject *, npy_bool *); -NPY_NO_EXPORT int PyArray_ByteorderConverter \ - (PyObject *, char *); -NPY_NO_EXPORT int PyArray_OrderConverter \ - (PyObject *, NPY_ORDER *); -NPY_NO_EXPORT unsigned char PyArray_EquivTypes \ - (PyArray_Descr *, PyArray_Descr *); -NPY_NO_EXPORT PyObject * PyArray_Zeros \ - (int, npy_intp *, PyArray_Descr *, int); -NPY_NO_EXPORT PyObject * PyArray_Empty \ - (int, npy_intp *, PyArray_Descr *, int); -NPY_NO_EXPORT PyObject * PyArray_Where \ - (PyObject *, PyObject *, PyObject *); -NPY_NO_EXPORT PyObject * PyArray_Arange \ - (double, double, double, int); -NPY_NO_EXPORT PyObject * PyArray_ArangeObj \ - (PyObject *, PyObject *, PyObject *, PyArray_Descr *); -NPY_NO_EXPORT int PyArray_SortkindConverter \ - (PyObject *, NPY_SORTKIND *); -NPY_NO_EXPORT PyObject * PyArray_LexSort \ - (PyObject *, int); -NPY_NO_EXPORT PyObject * PyArray_Round \ - (PyArrayObject *, int, PyArrayObject *); -NPY_NO_EXPORT unsigned char PyArray_EquivTypenums \ - (int, int); -NPY_NO_EXPORT int PyArray_RegisterDataType \ - (PyArray_Descr *); -NPY_NO_EXPORT int PyArray_RegisterCastFunc \ - (PyArray_Descr *, int, PyArray_VectorUnaryFunc *); -NPY_NO_EXPORT int PyArray_RegisterCanCast \ - (PyArray_Descr *, int, NPY_SCALARKIND); -NPY_NO_EXPORT void PyArray_InitArrFuncs \ - (PyArray_ArrFuncs *); -NPY_NO_EXPORT PyObject * PyArray_IntTupleFromIntp \ - (int, npy_intp *); -NPY_NO_EXPORT int PyArray_TypeNumFromName \ - (char *); -NPY_NO_EXPORT int PyArray_ClipmodeConverter \ - (PyObject *, NPY_CLIPMODE *); -NPY_NO_EXPORT int PyArray_OutputConverter \ - (PyObject *, PyArrayObject **); -NPY_NO_EXPORT PyObject * PyArray_BroadcastToShape \ - (PyObject *, npy_intp *, int); -NPY_NO_EXPORT void _PyArray_SigintHandler \ - (int); -NPY_NO_EXPORT void* _PyArray_GetSigintBuf \ - (void); -NPY_NO_EXPORT int PyArray_DescrAlignConverter \ - (PyObject *, PyArray_Descr **); -NPY_NO_EXPORT int PyArray_DescrAlignConverter2 \ - (PyObject *, PyArray_Descr **); -NPY_NO_EXPORT int PyArray_SearchsideConverter \ - (PyObject *, void *); -NPY_NO_EXPORT PyObject * PyArray_CheckAxis \ - (PyArrayObject *, int *, int); -NPY_NO_EXPORT npy_intp PyArray_OverflowMultiplyList \ - (npy_intp *, int); -NPY_NO_EXPORT int PyArray_CompareString \ - (char *, char *, size_t); -NPY_NO_EXPORT PyObject * PyArray_MultiIterFromObjects \ - (PyObject **, int, int, ...); -NPY_NO_EXPORT int PyArray_GetEndianness \ - (void); -NPY_NO_EXPORT unsigned int PyArray_GetNDArrayCFeatureVersion \ - (void); -NPY_NO_EXPORT PyObject * PyArray_Correlate2 \ - (PyObject *, PyObject *, int); -NPY_NO_EXPORT PyObject* PyArray_NeighborhoodIterNew \ - (PyArrayIterObject *, npy_intp *, int, PyArrayObject*); -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyTimeIntegerArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyTimeIntegerArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyDatetimeArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyDatetimeArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyTimedeltaArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyTimedeltaArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyHalfArrType_Type; -#else - NPY_NO_EXPORT PyTypeObject PyHalfArrType_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject NpyIter_Type; -#else - NPY_NO_EXPORT PyTypeObject NpyIter_Type; -#endif - -NPY_NO_EXPORT void PyArray_SetDatetimeParseFunction \ - (PyObject *); -NPY_NO_EXPORT void PyArray_DatetimeToDatetimeStruct \ - (npy_datetime, NPY_DATETIMEUNIT, npy_datetimestruct *); -NPY_NO_EXPORT void PyArray_TimedeltaToTimedeltaStruct \ - (npy_timedelta, NPY_DATETIMEUNIT, npy_timedeltastruct *); -NPY_NO_EXPORT npy_datetime PyArray_DatetimeStructToDatetime \ - (NPY_DATETIMEUNIT, npy_datetimestruct *); -NPY_NO_EXPORT npy_datetime PyArray_TimedeltaStructToTimedelta \ - (NPY_DATETIMEUNIT, npy_timedeltastruct *); -NPY_NO_EXPORT NpyIter * NpyIter_New \ - (PyArrayObject *, npy_uint32, NPY_ORDER, NPY_CASTING, PyArray_Descr*); -NPY_NO_EXPORT NpyIter * NpyIter_MultiNew \ - (int, PyArrayObject **, npy_uint32, NPY_ORDER, NPY_CASTING, npy_uint32 *, PyArray_Descr **); -NPY_NO_EXPORT NpyIter * NpyIter_AdvancedNew \ - (int, PyArrayObject **, npy_uint32, NPY_ORDER, NPY_CASTING, npy_uint32 *, PyArray_Descr **, int, int **, npy_intp *, npy_intp); -NPY_NO_EXPORT NpyIter * NpyIter_Copy \ - (NpyIter *); -NPY_NO_EXPORT int NpyIter_Deallocate \ - (NpyIter *); -NPY_NO_EXPORT npy_bool NpyIter_HasDelayedBufAlloc \ - (NpyIter *); -NPY_NO_EXPORT npy_bool NpyIter_HasExternalLoop \ - (NpyIter *); -NPY_NO_EXPORT int NpyIter_EnableExternalLoop \ - (NpyIter *); -NPY_NO_EXPORT npy_intp * NpyIter_GetInnerStrideArray \ - (NpyIter *); -NPY_NO_EXPORT npy_intp * NpyIter_GetInnerLoopSizePtr \ - (NpyIter *); -NPY_NO_EXPORT int NpyIter_Reset \ - (NpyIter *, char **); -NPY_NO_EXPORT int NpyIter_ResetBasePointers \ - (NpyIter *, char **, char **); -NPY_NO_EXPORT int NpyIter_ResetToIterIndexRange \ - (NpyIter *, npy_intp, npy_intp, char **); -NPY_NO_EXPORT int NpyIter_GetNDim \ - (NpyIter *); -NPY_NO_EXPORT int NpyIter_GetNOp \ - (NpyIter *); -NPY_NO_EXPORT NpyIter_IterNextFunc * NpyIter_GetIterNext \ - (NpyIter *, char **); -NPY_NO_EXPORT npy_intp NpyIter_GetIterSize \ - (NpyIter *); -NPY_NO_EXPORT void NpyIter_GetIterIndexRange \ - (NpyIter *, npy_intp *, npy_intp *); -NPY_NO_EXPORT npy_intp NpyIter_GetIterIndex \ - (NpyIter *); -NPY_NO_EXPORT int NpyIter_GotoIterIndex \ - (NpyIter *, npy_intp); -NPY_NO_EXPORT npy_bool NpyIter_HasMultiIndex \ - (NpyIter *); -NPY_NO_EXPORT int NpyIter_GetShape \ - (NpyIter *, npy_intp *); -NPY_NO_EXPORT NpyIter_GetMultiIndexFunc * NpyIter_GetGetMultiIndex \ - (NpyIter *, char **); -NPY_NO_EXPORT int NpyIter_GotoMultiIndex \ - (NpyIter *, npy_intp *); -NPY_NO_EXPORT int NpyIter_RemoveMultiIndex \ - (NpyIter *); -NPY_NO_EXPORT npy_bool NpyIter_HasIndex \ - (NpyIter *); -NPY_NO_EXPORT npy_bool NpyIter_IsBuffered \ - (NpyIter *); -NPY_NO_EXPORT npy_bool NpyIter_IsGrowInner \ - (NpyIter *); -NPY_NO_EXPORT npy_intp NpyIter_GetBufferSize \ - (NpyIter *); -NPY_NO_EXPORT npy_intp * NpyIter_GetIndexPtr \ - (NpyIter *); -NPY_NO_EXPORT int NpyIter_GotoIndex \ - (NpyIter *, npy_intp); -NPY_NO_EXPORT char ** NpyIter_GetDataPtrArray \ - (NpyIter *); -NPY_NO_EXPORT PyArray_Descr ** NpyIter_GetDescrArray \ - (NpyIter *); -NPY_NO_EXPORT PyArrayObject ** NpyIter_GetOperandArray \ - (NpyIter *); -NPY_NO_EXPORT PyArrayObject * NpyIter_GetIterView \ - (NpyIter *, npy_intp); -NPY_NO_EXPORT void NpyIter_GetReadFlags \ - (NpyIter *, char *); -NPY_NO_EXPORT void NpyIter_GetWriteFlags \ - (NpyIter *, char *); -NPY_NO_EXPORT void NpyIter_DebugPrint \ - (NpyIter *); -NPY_NO_EXPORT npy_bool NpyIter_IterationNeedsAPI \ - (NpyIter *); -NPY_NO_EXPORT void NpyIter_GetInnerFixedStrideArray \ - (NpyIter *, npy_intp *); -NPY_NO_EXPORT int NpyIter_RemoveAxis \ - (NpyIter *, int); -NPY_NO_EXPORT npy_intp * NpyIter_GetAxisStrideArray \ - (NpyIter *, int); -NPY_NO_EXPORT npy_bool NpyIter_RequiresBuffering \ - (NpyIter *); -NPY_NO_EXPORT char ** NpyIter_GetInitialDataPtrArray \ - (NpyIter *); -NPY_NO_EXPORT int NpyIter_CreateCompatibleStrides \ - (NpyIter *, npy_intp, npy_intp *); -NPY_NO_EXPORT int PyArray_CastingConverter \ - (PyObject *, NPY_CASTING *); -NPY_NO_EXPORT npy_intp PyArray_CountNonzero \ - (PyArrayObject *); -NPY_NO_EXPORT PyArray_Descr * PyArray_PromoteTypes \ - (PyArray_Descr *, PyArray_Descr *); -NPY_NO_EXPORT PyArray_Descr * PyArray_MinScalarType \ - (PyArrayObject *); -NPY_NO_EXPORT PyArray_Descr * PyArray_ResultType \ - (npy_intp, PyArrayObject **, npy_intp, PyArray_Descr **); -NPY_NO_EXPORT npy_bool PyArray_CanCastArrayTo \ - (PyArrayObject *, PyArray_Descr *, NPY_CASTING); -NPY_NO_EXPORT npy_bool PyArray_CanCastTypeTo \ - (PyArray_Descr *, PyArray_Descr *, NPY_CASTING); -NPY_NO_EXPORT PyArrayObject * PyArray_EinsteinSum \ - (char *, npy_intp, PyArrayObject **, PyArray_Descr *, NPY_ORDER, NPY_CASTING, PyArrayObject *); -NPY_NO_EXPORT PyObject * PyArray_NewLikeArray \ - (PyArrayObject *, NPY_ORDER, PyArray_Descr *, int); -NPY_NO_EXPORT int PyArray_GetArrayParamsFromObject \ - (PyObject *, PyArray_Descr *, npy_bool, PyArray_Descr **, int *, npy_intp *, PyArrayObject **, PyObject *); -NPY_NO_EXPORT int PyArray_ConvertClipmodeSequence \ - (PyObject *, NPY_CLIPMODE *, int); -NPY_NO_EXPORT PyObject * PyArray_MatrixProduct2 \ - (PyObject *, PyObject *, PyArrayObject*); -NPY_NO_EXPORT npy_bool NpyIter_IsFirstVisit \ - (NpyIter *, int); -NPY_NO_EXPORT int PyArray_SetBaseObject \ - (PyArrayObject *, PyObject *); -NPY_NO_EXPORT void PyArray_CreateSortedStridePerm \ - (int, npy_intp *, npy_stride_sort_item *); -NPY_NO_EXPORT void PyArray_RemoveAxesInPlace \ - (PyArrayObject *, npy_bool *); -NPY_NO_EXPORT void PyArray_DebugPrint \ - (PyArrayObject *); -NPY_NO_EXPORT int PyArray_FailUnlessWriteable \ - (PyArrayObject *, const char *); -NPY_NO_EXPORT int PyArray_SetUpdateIfCopyBase \ - (PyArrayObject *, PyArrayObject *); -NPY_NO_EXPORT void * PyDataMem_NEW \ - (size_t); -NPY_NO_EXPORT void PyDataMem_FREE \ - (void *); -NPY_NO_EXPORT void * PyDataMem_RENEW \ - (void *, size_t); -NPY_NO_EXPORT PyDataMem_EventHookFunc * PyDataMem_SetEventHook \ - (PyDataMem_EventHookFunc *, void *, void **); -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT NPY_CASTING NPY_DEFAULT_ASSIGN_CASTING; -#else - NPY_NO_EXPORT NPY_CASTING NPY_DEFAULT_ASSIGN_CASTING; -#endif - - -#else - -#if defined(PY_ARRAY_UNIQUE_SYMBOL) -#define PyArray_API PY_ARRAY_UNIQUE_SYMBOL -#endif - -#if defined(NO_IMPORT) || defined(NO_IMPORT_ARRAY) -extern void **PyArray_API; -#else -#if defined(PY_ARRAY_UNIQUE_SYMBOL) -void **PyArray_API; -#else -static void **PyArray_API=NULL; -#endif -#endif - -#define PyArray_GetNDArrayCVersion \ - (*(unsigned int (*)(void)) \ - PyArray_API[0]) -#define PyBigArray_Type (*(PyTypeObject *)PyArray_API[1]) -#define PyArray_Type (*(PyTypeObject *)PyArray_API[2]) -#define PyArrayDescr_Type (*(PyTypeObject *)PyArray_API[3]) -#define PyArrayFlags_Type (*(PyTypeObject *)PyArray_API[4]) -#define PyArrayIter_Type (*(PyTypeObject *)PyArray_API[5]) -#define PyArrayMultiIter_Type (*(PyTypeObject *)PyArray_API[6]) -#define NPY_NUMUSERTYPES (*(int *)PyArray_API[7]) -#define PyBoolArrType_Type (*(PyTypeObject *)PyArray_API[8]) -#define _PyArrayScalar_BoolValues ((PyBoolScalarObject *)PyArray_API[9]) -#define PyGenericArrType_Type (*(PyTypeObject *)PyArray_API[10]) -#define PyNumberArrType_Type (*(PyTypeObject *)PyArray_API[11]) -#define PyIntegerArrType_Type (*(PyTypeObject *)PyArray_API[12]) -#define PySignedIntegerArrType_Type (*(PyTypeObject *)PyArray_API[13]) -#define PyUnsignedIntegerArrType_Type (*(PyTypeObject *)PyArray_API[14]) -#define PyInexactArrType_Type (*(PyTypeObject *)PyArray_API[15]) -#define PyFloatingArrType_Type (*(PyTypeObject *)PyArray_API[16]) -#define PyComplexFloatingArrType_Type (*(PyTypeObject *)PyArray_API[17]) -#define PyFlexibleArrType_Type (*(PyTypeObject *)PyArray_API[18]) -#define PyCharacterArrType_Type (*(PyTypeObject *)PyArray_API[19]) -#define PyByteArrType_Type (*(PyTypeObject *)PyArray_API[20]) -#define PyShortArrType_Type (*(PyTypeObject *)PyArray_API[21]) -#define PyIntArrType_Type (*(PyTypeObject *)PyArray_API[22]) -#define PyLongArrType_Type (*(PyTypeObject *)PyArray_API[23]) -#define PyLongLongArrType_Type (*(PyTypeObject *)PyArray_API[24]) -#define PyUByteArrType_Type (*(PyTypeObject *)PyArray_API[25]) -#define PyUShortArrType_Type (*(PyTypeObject *)PyArray_API[26]) -#define PyUIntArrType_Type (*(PyTypeObject *)PyArray_API[27]) -#define PyULongArrType_Type (*(PyTypeObject *)PyArray_API[28]) -#define PyULongLongArrType_Type (*(PyTypeObject *)PyArray_API[29]) -#define PyFloatArrType_Type (*(PyTypeObject *)PyArray_API[30]) -#define PyDoubleArrType_Type (*(PyTypeObject *)PyArray_API[31]) -#define PyLongDoubleArrType_Type (*(PyTypeObject *)PyArray_API[32]) -#define PyCFloatArrType_Type (*(PyTypeObject *)PyArray_API[33]) -#define PyCDoubleArrType_Type (*(PyTypeObject *)PyArray_API[34]) -#define PyCLongDoubleArrType_Type (*(PyTypeObject *)PyArray_API[35]) -#define PyObjectArrType_Type (*(PyTypeObject *)PyArray_API[36]) -#define PyStringArrType_Type (*(PyTypeObject *)PyArray_API[37]) -#define PyUnicodeArrType_Type (*(PyTypeObject *)PyArray_API[38]) -#define PyVoidArrType_Type (*(PyTypeObject *)PyArray_API[39]) -#define PyArray_SetNumericOps \ - (*(int (*)(PyObject *)) \ - PyArray_API[40]) -#define PyArray_GetNumericOps \ - (*(PyObject * (*)(void)) \ - PyArray_API[41]) -#define PyArray_INCREF \ - (*(int (*)(PyArrayObject *)) \ - PyArray_API[42]) -#define PyArray_XDECREF \ - (*(int (*)(PyArrayObject *)) \ - PyArray_API[43]) -#define PyArray_SetStringFunction \ - (*(void (*)(PyObject *, int)) \ - PyArray_API[44]) -#define PyArray_DescrFromType \ - (*(PyArray_Descr * (*)(int)) \ - PyArray_API[45]) -#define PyArray_TypeObjectFromType \ - (*(PyObject * (*)(int)) \ - PyArray_API[46]) -#define PyArray_Zero \ - (*(char * (*)(PyArrayObject *)) \ - PyArray_API[47]) -#define PyArray_One \ - (*(char * (*)(PyArrayObject *)) \ - PyArray_API[48]) -#define PyArray_CastToType \ - (*(PyObject * (*)(PyArrayObject *, PyArray_Descr *, int)) \ - PyArray_API[49]) -#define PyArray_CastTo \ - (*(int (*)(PyArrayObject *, PyArrayObject *)) \ - PyArray_API[50]) -#define PyArray_CastAnyTo \ - (*(int (*)(PyArrayObject *, PyArrayObject *)) \ - PyArray_API[51]) -#define PyArray_CanCastSafely \ - (*(int (*)(int, int)) \ - PyArray_API[52]) -#define PyArray_CanCastTo \ - (*(npy_bool (*)(PyArray_Descr *, PyArray_Descr *)) \ - PyArray_API[53]) -#define PyArray_ObjectType \ - (*(int (*)(PyObject *, int)) \ - PyArray_API[54]) -#define PyArray_DescrFromObject \ - (*(PyArray_Descr * (*)(PyObject *, PyArray_Descr *)) \ - PyArray_API[55]) -#define PyArray_ConvertToCommonType \ - (*(PyArrayObject ** (*)(PyObject *, int *)) \ - PyArray_API[56]) -#define PyArray_DescrFromScalar \ - (*(PyArray_Descr * (*)(PyObject *)) \ - PyArray_API[57]) -#define PyArray_DescrFromTypeObject \ - (*(PyArray_Descr * (*)(PyObject *)) \ - PyArray_API[58]) -#define PyArray_Size \ - (*(npy_intp (*)(PyObject *)) \ - PyArray_API[59]) -#define PyArray_Scalar \ - (*(PyObject * (*)(void *, PyArray_Descr *, PyObject *)) \ - PyArray_API[60]) -#define PyArray_FromScalar \ - (*(PyObject * (*)(PyObject *, PyArray_Descr *)) \ - PyArray_API[61]) -#define PyArray_ScalarAsCtype \ - (*(void (*)(PyObject *, void *)) \ - PyArray_API[62]) -#define PyArray_CastScalarToCtype \ - (*(int (*)(PyObject *, void *, PyArray_Descr *)) \ - PyArray_API[63]) -#define PyArray_CastScalarDirect \ - (*(int (*)(PyObject *, PyArray_Descr *, void *, int)) \ - PyArray_API[64]) -#define PyArray_ScalarFromObject \ - (*(PyObject * (*)(PyObject *)) \ - PyArray_API[65]) -#define PyArray_GetCastFunc \ - (*(PyArray_VectorUnaryFunc * (*)(PyArray_Descr *, int)) \ - PyArray_API[66]) -#define PyArray_FromDims \ - (*(PyObject * (*)(int, int *, int)) \ - PyArray_API[67]) -#define PyArray_FromDimsAndDataAndDescr \ - (*(PyObject * (*)(int, int *, PyArray_Descr *, char *)) \ - PyArray_API[68]) -#define PyArray_FromAny \ - (*(PyObject * (*)(PyObject *, PyArray_Descr *, int, int, int, PyObject *)) \ - PyArray_API[69]) -#define PyArray_EnsureArray \ - (*(PyObject * (*)(PyObject *)) \ - PyArray_API[70]) -#define PyArray_EnsureAnyArray \ - (*(PyObject * (*)(PyObject *)) \ - PyArray_API[71]) -#define PyArray_FromFile \ - (*(PyObject * (*)(FILE *, PyArray_Descr *, npy_intp, char *)) \ - PyArray_API[72]) -#define PyArray_FromString \ - (*(PyObject * (*)(char *, npy_intp, PyArray_Descr *, npy_intp, char *)) \ - PyArray_API[73]) -#define PyArray_FromBuffer \ - (*(PyObject * (*)(PyObject *, PyArray_Descr *, npy_intp, npy_intp)) \ - PyArray_API[74]) -#define PyArray_FromIter \ - (*(PyObject * (*)(PyObject *, PyArray_Descr *, npy_intp)) \ - PyArray_API[75]) -#define PyArray_Return \ - (*(PyObject * (*)(PyArrayObject *)) \ - PyArray_API[76]) -#define PyArray_GetField \ - (*(PyObject * (*)(PyArrayObject *, PyArray_Descr *, int)) \ - PyArray_API[77]) -#define PyArray_SetField \ - (*(int (*)(PyArrayObject *, PyArray_Descr *, int, PyObject *)) \ - PyArray_API[78]) -#define PyArray_Byteswap \ - (*(PyObject * (*)(PyArrayObject *, npy_bool)) \ - PyArray_API[79]) -#define PyArray_Resize \ - (*(PyObject * (*)(PyArrayObject *, PyArray_Dims *, int, NPY_ORDER)) \ - PyArray_API[80]) -#define PyArray_MoveInto \ - (*(int (*)(PyArrayObject *, PyArrayObject *)) \ - PyArray_API[81]) -#define PyArray_CopyInto \ - (*(int (*)(PyArrayObject *, PyArrayObject *)) \ - PyArray_API[82]) -#define PyArray_CopyAnyInto \ - (*(int (*)(PyArrayObject *, PyArrayObject *)) \ - PyArray_API[83]) -#define PyArray_CopyObject \ - (*(int (*)(PyArrayObject *, PyObject *)) \ - PyArray_API[84]) -#define PyArray_NewCopy \ - (*(PyObject * (*)(PyArrayObject *, NPY_ORDER)) \ - PyArray_API[85]) -#define PyArray_ToList \ - (*(PyObject * (*)(PyArrayObject *)) \ - PyArray_API[86]) -#define PyArray_ToString \ - (*(PyObject * (*)(PyArrayObject *, NPY_ORDER)) \ - PyArray_API[87]) -#define PyArray_ToFile \ - (*(int (*)(PyArrayObject *, FILE *, char *, char *)) \ - PyArray_API[88]) -#define PyArray_Dump \ - (*(int (*)(PyObject *, PyObject *, int)) \ - PyArray_API[89]) -#define PyArray_Dumps \ - (*(PyObject * (*)(PyObject *, int)) \ - PyArray_API[90]) -#define PyArray_ValidType \ - (*(int (*)(int)) \ - PyArray_API[91]) -#define PyArray_UpdateFlags \ - (*(void (*)(PyArrayObject *, int)) \ - PyArray_API[92]) -#define PyArray_New \ - (*(PyObject * (*)(PyTypeObject *, int, npy_intp *, int, npy_intp *, void *, int, int, PyObject *)) \ - PyArray_API[93]) -#define PyArray_NewFromDescr \ - (*(PyObject * (*)(PyTypeObject *, PyArray_Descr *, int, npy_intp *, npy_intp *, void *, int, PyObject *)) \ - PyArray_API[94]) -#define PyArray_DescrNew \ - (*(PyArray_Descr * (*)(PyArray_Descr *)) \ - PyArray_API[95]) -#define PyArray_DescrNewFromType \ - (*(PyArray_Descr * (*)(int)) \ - PyArray_API[96]) -#define PyArray_GetPriority \ - (*(double (*)(PyObject *, double)) \ - PyArray_API[97]) -#define PyArray_IterNew \ - (*(PyObject * (*)(PyObject *)) \ - PyArray_API[98]) -#define PyArray_MultiIterNew \ - (*(PyObject * (*)(int, ...)) \ - PyArray_API[99]) -#define PyArray_PyIntAsInt \ - (*(int (*)(PyObject *)) \ - PyArray_API[100]) -#define PyArray_PyIntAsIntp \ - (*(npy_intp (*)(PyObject *)) \ - PyArray_API[101]) -#define PyArray_Broadcast \ - (*(int (*)(PyArrayMultiIterObject *)) \ - PyArray_API[102]) -#define PyArray_FillObjectArray \ - (*(void (*)(PyArrayObject *, PyObject *)) \ - PyArray_API[103]) -#define PyArray_FillWithScalar \ - (*(int (*)(PyArrayObject *, PyObject *)) \ - PyArray_API[104]) -#define PyArray_CheckStrides \ - (*(npy_bool (*)(int, int, npy_intp, npy_intp, npy_intp *, npy_intp *)) \ - PyArray_API[105]) -#define PyArray_DescrNewByteorder \ - (*(PyArray_Descr * (*)(PyArray_Descr *, char)) \ - PyArray_API[106]) -#define PyArray_IterAllButAxis \ - (*(PyObject * (*)(PyObject *, int *)) \ - PyArray_API[107]) -#define PyArray_CheckFromAny \ - (*(PyObject * (*)(PyObject *, PyArray_Descr *, int, int, int, PyObject *)) \ - PyArray_API[108]) -#define PyArray_FromArray \ - (*(PyObject * (*)(PyArrayObject *, PyArray_Descr *, int)) \ - PyArray_API[109]) -#define PyArray_FromInterface \ - (*(PyObject * (*)(PyObject *)) \ - PyArray_API[110]) -#define PyArray_FromStructInterface \ - (*(PyObject * (*)(PyObject *)) \ - PyArray_API[111]) -#define PyArray_FromArrayAttr \ - (*(PyObject * (*)(PyObject *, PyArray_Descr *, PyObject *)) \ - PyArray_API[112]) -#define PyArray_ScalarKind \ - (*(NPY_SCALARKIND (*)(int, PyArrayObject **)) \ - PyArray_API[113]) -#define PyArray_CanCoerceScalar \ - (*(int (*)(int, int, NPY_SCALARKIND)) \ - PyArray_API[114]) -#define PyArray_NewFlagsObject \ - (*(PyObject * (*)(PyObject *)) \ - PyArray_API[115]) -#define PyArray_CanCastScalar \ - (*(npy_bool (*)(PyTypeObject *, PyTypeObject *)) \ - PyArray_API[116]) -#define PyArray_CompareUCS4 \ - (*(int (*)(npy_ucs4 *, npy_ucs4 *, size_t)) \ - PyArray_API[117]) -#define PyArray_RemoveSmallest \ - (*(int (*)(PyArrayMultiIterObject *)) \ - PyArray_API[118]) -#define PyArray_ElementStrides \ - (*(int (*)(PyObject *)) \ - PyArray_API[119]) -#define PyArray_Item_INCREF \ - (*(void (*)(char *, PyArray_Descr *)) \ - PyArray_API[120]) -#define PyArray_Item_XDECREF \ - (*(void (*)(char *, PyArray_Descr *)) \ - PyArray_API[121]) -#define PyArray_FieldNames \ - (*(PyObject * (*)(PyObject *)) \ - PyArray_API[122]) -#define PyArray_Transpose \ - (*(PyObject * (*)(PyArrayObject *, PyArray_Dims *)) \ - PyArray_API[123]) -#define PyArray_TakeFrom \ - (*(PyObject * (*)(PyArrayObject *, PyObject *, int, PyArrayObject *, NPY_CLIPMODE)) \ - PyArray_API[124]) -#define PyArray_PutTo \ - (*(PyObject * (*)(PyArrayObject *, PyObject*, PyObject *, NPY_CLIPMODE)) \ - PyArray_API[125]) -#define PyArray_PutMask \ - (*(PyObject * (*)(PyArrayObject *, PyObject*, PyObject*)) \ - PyArray_API[126]) -#define PyArray_Repeat \ - (*(PyObject * (*)(PyArrayObject *, PyObject *, int)) \ - PyArray_API[127]) -#define PyArray_Choose \ - (*(PyObject * (*)(PyArrayObject *, PyObject *, PyArrayObject *, NPY_CLIPMODE)) \ - PyArray_API[128]) -#define PyArray_Sort \ - (*(int (*)(PyArrayObject *, int, NPY_SORTKIND)) \ - PyArray_API[129]) -#define PyArray_ArgSort \ - (*(PyObject * (*)(PyArrayObject *, int, NPY_SORTKIND)) \ - PyArray_API[130]) -#define PyArray_SearchSorted \ - (*(PyObject * (*)(PyArrayObject *, PyObject *, NPY_SEARCHSIDE, PyObject *)) \ - PyArray_API[131]) -#define PyArray_ArgMax \ - (*(PyObject * (*)(PyArrayObject *, int, PyArrayObject *)) \ - PyArray_API[132]) -#define PyArray_ArgMin \ - (*(PyObject * (*)(PyArrayObject *, int, PyArrayObject *)) \ - PyArray_API[133]) -#define PyArray_Reshape \ - (*(PyObject * (*)(PyArrayObject *, PyObject *)) \ - PyArray_API[134]) -#define PyArray_Newshape \ - (*(PyObject * (*)(PyArrayObject *, PyArray_Dims *, NPY_ORDER)) \ - PyArray_API[135]) -#define PyArray_Squeeze \ - (*(PyObject * (*)(PyArrayObject *)) \ - PyArray_API[136]) -#define PyArray_View \ - (*(PyObject * (*)(PyArrayObject *, PyArray_Descr *, PyTypeObject *)) \ - PyArray_API[137]) -#define PyArray_SwapAxes \ - (*(PyObject * (*)(PyArrayObject *, int, int)) \ - PyArray_API[138]) -#define PyArray_Max \ - (*(PyObject * (*)(PyArrayObject *, int, PyArrayObject *)) \ - PyArray_API[139]) -#define PyArray_Min \ - (*(PyObject * (*)(PyArrayObject *, int, PyArrayObject *)) \ - PyArray_API[140]) -#define PyArray_Ptp \ - (*(PyObject * (*)(PyArrayObject *, int, PyArrayObject *)) \ - PyArray_API[141]) -#define PyArray_Mean \ - (*(PyObject * (*)(PyArrayObject *, int, int, PyArrayObject *)) \ - PyArray_API[142]) -#define PyArray_Trace \ - (*(PyObject * (*)(PyArrayObject *, int, int, int, int, PyArrayObject *)) \ - PyArray_API[143]) -#define PyArray_Diagonal \ - (*(PyObject * (*)(PyArrayObject *, int, int, int)) \ - PyArray_API[144]) -#define PyArray_Clip \ - (*(PyObject * (*)(PyArrayObject *, PyObject *, PyObject *, PyArrayObject *)) \ - PyArray_API[145]) -#define PyArray_Conjugate \ - (*(PyObject * (*)(PyArrayObject *, PyArrayObject *)) \ - PyArray_API[146]) -#define PyArray_Nonzero \ - (*(PyObject * (*)(PyArrayObject *)) \ - PyArray_API[147]) -#define PyArray_Std \ - (*(PyObject * (*)(PyArrayObject *, int, int, PyArrayObject *, int)) \ - PyArray_API[148]) -#define PyArray_Sum \ - (*(PyObject * (*)(PyArrayObject *, int, int, PyArrayObject *)) \ - PyArray_API[149]) -#define PyArray_CumSum \ - (*(PyObject * (*)(PyArrayObject *, int, int, PyArrayObject *)) \ - PyArray_API[150]) -#define PyArray_Prod \ - (*(PyObject * (*)(PyArrayObject *, int, int, PyArrayObject *)) \ - PyArray_API[151]) -#define PyArray_CumProd \ - (*(PyObject * (*)(PyArrayObject *, int, int, PyArrayObject *)) \ - PyArray_API[152]) -#define PyArray_All \ - (*(PyObject * (*)(PyArrayObject *, int, PyArrayObject *)) \ - PyArray_API[153]) -#define PyArray_Any \ - (*(PyObject * (*)(PyArrayObject *, int, PyArrayObject *)) \ - PyArray_API[154]) -#define PyArray_Compress \ - (*(PyObject * (*)(PyArrayObject *, PyObject *, int, PyArrayObject *)) \ - PyArray_API[155]) -#define PyArray_Flatten \ - (*(PyObject * (*)(PyArrayObject *, NPY_ORDER)) \ - PyArray_API[156]) -#define PyArray_Ravel \ - (*(PyObject * (*)(PyArrayObject *, NPY_ORDER)) \ - PyArray_API[157]) -#define PyArray_MultiplyList \ - (*(npy_intp (*)(npy_intp *, int)) \ - PyArray_API[158]) -#define PyArray_MultiplyIntList \ - (*(int (*)(int *, int)) \ - PyArray_API[159]) -#define PyArray_GetPtr \ - (*(void * (*)(PyArrayObject *, npy_intp*)) \ - PyArray_API[160]) -#define PyArray_CompareLists \ - (*(int (*)(npy_intp *, npy_intp *, int)) \ - PyArray_API[161]) -#define PyArray_AsCArray \ - (*(int (*)(PyObject **, void *, npy_intp *, int, PyArray_Descr*)) \ - PyArray_API[162]) -#define PyArray_As1D \ - (*(int (*)(PyObject **, char **, int *, int)) \ - PyArray_API[163]) -#define PyArray_As2D \ - (*(int (*)(PyObject **, char ***, int *, int *, int)) \ - PyArray_API[164]) -#define PyArray_Free \ - (*(int (*)(PyObject *, void *)) \ - PyArray_API[165]) -#define PyArray_Converter \ - (*(int (*)(PyObject *, PyObject **)) \ - PyArray_API[166]) -#define PyArray_IntpFromSequence \ - (*(int (*)(PyObject *, npy_intp *, int)) \ - PyArray_API[167]) -#define PyArray_Concatenate \ - (*(PyObject * (*)(PyObject *, int)) \ - PyArray_API[168]) -#define PyArray_InnerProduct \ - (*(PyObject * (*)(PyObject *, PyObject *)) \ - PyArray_API[169]) -#define PyArray_MatrixProduct \ - (*(PyObject * (*)(PyObject *, PyObject *)) \ - PyArray_API[170]) -#define PyArray_CopyAndTranspose \ - (*(PyObject * (*)(PyObject *)) \ - PyArray_API[171]) -#define PyArray_Correlate \ - (*(PyObject * (*)(PyObject *, PyObject *, int)) \ - PyArray_API[172]) -#define PyArray_TypestrConvert \ - (*(int (*)(int, int)) \ - PyArray_API[173]) -#define PyArray_DescrConverter \ - (*(int (*)(PyObject *, PyArray_Descr **)) \ - PyArray_API[174]) -#define PyArray_DescrConverter2 \ - (*(int (*)(PyObject *, PyArray_Descr **)) \ - PyArray_API[175]) -#define PyArray_IntpConverter \ - (*(int (*)(PyObject *, PyArray_Dims *)) \ - PyArray_API[176]) -#define PyArray_BufferConverter \ - (*(int (*)(PyObject *, PyArray_Chunk *)) \ - PyArray_API[177]) -#define PyArray_AxisConverter \ - (*(int (*)(PyObject *, int *)) \ - PyArray_API[178]) -#define PyArray_BoolConverter \ - (*(int (*)(PyObject *, npy_bool *)) \ - PyArray_API[179]) -#define PyArray_ByteorderConverter \ - (*(int (*)(PyObject *, char *)) \ - PyArray_API[180]) -#define PyArray_OrderConverter \ - (*(int (*)(PyObject *, NPY_ORDER *)) \ - PyArray_API[181]) -#define PyArray_EquivTypes \ - (*(unsigned char (*)(PyArray_Descr *, PyArray_Descr *)) \ - PyArray_API[182]) -#define PyArray_Zeros \ - (*(PyObject * (*)(int, npy_intp *, PyArray_Descr *, int)) \ - PyArray_API[183]) -#define PyArray_Empty \ - (*(PyObject * (*)(int, npy_intp *, PyArray_Descr *, int)) \ - PyArray_API[184]) -#define PyArray_Where \ - (*(PyObject * (*)(PyObject *, PyObject *, PyObject *)) \ - PyArray_API[185]) -#define PyArray_Arange \ - (*(PyObject * (*)(double, double, double, int)) \ - PyArray_API[186]) -#define PyArray_ArangeObj \ - (*(PyObject * (*)(PyObject *, PyObject *, PyObject *, PyArray_Descr *)) \ - PyArray_API[187]) -#define PyArray_SortkindConverter \ - (*(int (*)(PyObject *, NPY_SORTKIND *)) \ - PyArray_API[188]) -#define PyArray_LexSort \ - (*(PyObject * (*)(PyObject *, int)) \ - PyArray_API[189]) -#define PyArray_Round \ - (*(PyObject * (*)(PyArrayObject *, int, PyArrayObject *)) \ - PyArray_API[190]) -#define PyArray_EquivTypenums \ - (*(unsigned char (*)(int, int)) \ - PyArray_API[191]) -#define PyArray_RegisterDataType \ - (*(int (*)(PyArray_Descr *)) \ - PyArray_API[192]) -#define PyArray_RegisterCastFunc \ - (*(int (*)(PyArray_Descr *, int, PyArray_VectorUnaryFunc *)) \ - PyArray_API[193]) -#define PyArray_RegisterCanCast \ - (*(int (*)(PyArray_Descr *, int, NPY_SCALARKIND)) \ - PyArray_API[194]) -#define PyArray_InitArrFuncs \ - (*(void (*)(PyArray_ArrFuncs *)) \ - PyArray_API[195]) -#define PyArray_IntTupleFromIntp \ - (*(PyObject * (*)(int, npy_intp *)) \ - PyArray_API[196]) -#define PyArray_TypeNumFromName \ - (*(int (*)(char *)) \ - PyArray_API[197]) -#define PyArray_ClipmodeConverter \ - (*(int (*)(PyObject *, NPY_CLIPMODE *)) \ - PyArray_API[198]) -#define PyArray_OutputConverter \ - (*(int (*)(PyObject *, PyArrayObject **)) \ - PyArray_API[199]) -#define PyArray_BroadcastToShape \ - (*(PyObject * (*)(PyObject *, npy_intp *, int)) \ - PyArray_API[200]) -#define _PyArray_SigintHandler \ - (*(void (*)(int)) \ - PyArray_API[201]) -#define _PyArray_GetSigintBuf \ - (*(void* (*)(void)) \ - PyArray_API[202]) -#define PyArray_DescrAlignConverter \ - (*(int (*)(PyObject *, PyArray_Descr **)) \ - PyArray_API[203]) -#define PyArray_DescrAlignConverter2 \ - (*(int (*)(PyObject *, PyArray_Descr **)) \ - PyArray_API[204]) -#define PyArray_SearchsideConverter \ - (*(int (*)(PyObject *, void *)) \ - PyArray_API[205]) -#define PyArray_CheckAxis \ - (*(PyObject * (*)(PyArrayObject *, int *, int)) \ - PyArray_API[206]) -#define PyArray_OverflowMultiplyList \ - (*(npy_intp (*)(npy_intp *, int)) \ - PyArray_API[207]) -#define PyArray_CompareString \ - (*(int (*)(char *, char *, size_t)) \ - PyArray_API[208]) -#define PyArray_MultiIterFromObjects \ - (*(PyObject * (*)(PyObject **, int, int, ...)) \ - PyArray_API[209]) -#define PyArray_GetEndianness \ - (*(int (*)(void)) \ - PyArray_API[210]) -#define PyArray_GetNDArrayCFeatureVersion \ - (*(unsigned int (*)(void)) \ - PyArray_API[211]) -#define PyArray_Correlate2 \ - (*(PyObject * (*)(PyObject *, PyObject *, int)) \ - PyArray_API[212]) -#define PyArray_NeighborhoodIterNew \ - (*(PyObject* (*)(PyArrayIterObject *, npy_intp *, int, PyArrayObject*)) \ - PyArray_API[213]) -#define PyTimeIntegerArrType_Type (*(PyTypeObject *)PyArray_API[214]) -#define PyDatetimeArrType_Type (*(PyTypeObject *)PyArray_API[215]) -#define PyTimedeltaArrType_Type (*(PyTypeObject *)PyArray_API[216]) -#define PyHalfArrType_Type (*(PyTypeObject *)PyArray_API[217]) -#define NpyIter_Type (*(PyTypeObject *)PyArray_API[218]) -#define PyArray_SetDatetimeParseFunction \ - (*(void (*)(PyObject *)) \ - PyArray_API[219]) -#define PyArray_DatetimeToDatetimeStruct \ - (*(void (*)(npy_datetime, NPY_DATETIMEUNIT, npy_datetimestruct *)) \ - PyArray_API[220]) -#define PyArray_TimedeltaToTimedeltaStruct \ - (*(void (*)(npy_timedelta, NPY_DATETIMEUNIT, npy_timedeltastruct *)) \ - PyArray_API[221]) -#define PyArray_DatetimeStructToDatetime \ - (*(npy_datetime (*)(NPY_DATETIMEUNIT, npy_datetimestruct *)) \ - PyArray_API[222]) -#define PyArray_TimedeltaStructToTimedelta \ - (*(npy_datetime (*)(NPY_DATETIMEUNIT, npy_timedeltastruct *)) \ - PyArray_API[223]) -#define NpyIter_New \ - (*(NpyIter * (*)(PyArrayObject *, npy_uint32, NPY_ORDER, NPY_CASTING, PyArray_Descr*)) \ - PyArray_API[224]) -#define NpyIter_MultiNew \ - (*(NpyIter * (*)(int, PyArrayObject **, npy_uint32, NPY_ORDER, NPY_CASTING, npy_uint32 *, PyArray_Descr **)) \ - PyArray_API[225]) -#define NpyIter_AdvancedNew \ - (*(NpyIter * (*)(int, PyArrayObject **, npy_uint32, NPY_ORDER, NPY_CASTING, npy_uint32 *, PyArray_Descr **, int, int **, npy_intp *, npy_intp)) \ - PyArray_API[226]) -#define NpyIter_Copy \ - (*(NpyIter * (*)(NpyIter *)) \ - PyArray_API[227]) -#define NpyIter_Deallocate \ - (*(int (*)(NpyIter *)) \ - PyArray_API[228]) -#define NpyIter_HasDelayedBufAlloc \ - (*(npy_bool (*)(NpyIter *)) \ - PyArray_API[229]) -#define NpyIter_HasExternalLoop \ - (*(npy_bool (*)(NpyIter *)) \ - PyArray_API[230]) -#define NpyIter_EnableExternalLoop \ - (*(int (*)(NpyIter *)) \ - PyArray_API[231]) -#define NpyIter_GetInnerStrideArray \ - (*(npy_intp * (*)(NpyIter *)) \ - PyArray_API[232]) -#define NpyIter_GetInnerLoopSizePtr \ - (*(npy_intp * (*)(NpyIter *)) \ - PyArray_API[233]) -#define NpyIter_Reset \ - (*(int (*)(NpyIter *, char **)) \ - PyArray_API[234]) -#define NpyIter_ResetBasePointers \ - (*(int (*)(NpyIter *, char **, char **)) \ - PyArray_API[235]) -#define NpyIter_ResetToIterIndexRange \ - (*(int (*)(NpyIter *, npy_intp, npy_intp, char **)) \ - PyArray_API[236]) -#define NpyIter_GetNDim \ - (*(int (*)(NpyIter *)) \ - PyArray_API[237]) -#define NpyIter_GetNOp \ - (*(int (*)(NpyIter *)) \ - PyArray_API[238]) -#define NpyIter_GetIterNext \ - (*(NpyIter_IterNextFunc * (*)(NpyIter *, char **)) \ - PyArray_API[239]) -#define NpyIter_GetIterSize \ - (*(npy_intp (*)(NpyIter *)) \ - PyArray_API[240]) -#define NpyIter_GetIterIndexRange \ - (*(void (*)(NpyIter *, npy_intp *, npy_intp *)) \ - PyArray_API[241]) -#define NpyIter_GetIterIndex \ - (*(npy_intp (*)(NpyIter *)) \ - PyArray_API[242]) -#define NpyIter_GotoIterIndex \ - (*(int (*)(NpyIter *, npy_intp)) \ - PyArray_API[243]) -#define NpyIter_HasMultiIndex \ - (*(npy_bool (*)(NpyIter *)) \ - PyArray_API[244]) -#define NpyIter_GetShape \ - (*(int (*)(NpyIter *, npy_intp *)) \ - PyArray_API[245]) -#define NpyIter_GetGetMultiIndex \ - (*(NpyIter_GetMultiIndexFunc * (*)(NpyIter *, char **)) \ - PyArray_API[246]) -#define NpyIter_GotoMultiIndex \ - (*(int (*)(NpyIter *, npy_intp *)) \ - PyArray_API[247]) -#define NpyIter_RemoveMultiIndex \ - (*(int (*)(NpyIter *)) \ - PyArray_API[248]) -#define NpyIter_HasIndex \ - (*(npy_bool (*)(NpyIter *)) \ - PyArray_API[249]) -#define NpyIter_IsBuffered \ - (*(npy_bool (*)(NpyIter *)) \ - PyArray_API[250]) -#define NpyIter_IsGrowInner \ - (*(npy_bool (*)(NpyIter *)) \ - PyArray_API[251]) -#define NpyIter_GetBufferSize \ - (*(npy_intp (*)(NpyIter *)) \ - PyArray_API[252]) -#define NpyIter_GetIndexPtr \ - (*(npy_intp * (*)(NpyIter *)) \ - PyArray_API[253]) -#define NpyIter_GotoIndex \ - (*(int (*)(NpyIter *, npy_intp)) \ - PyArray_API[254]) -#define NpyIter_GetDataPtrArray \ - (*(char ** (*)(NpyIter *)) \ - PyArray_API[255]) -#define NpyIter_GetDescrArray \ - (*(PyArray_Descr ** (*)(NpyIter *)) \ - PyArray_API[256]) -#define NpyIter_GetOperandArray \ - (*(PyArrayObject ** (*)(NpyIter *)) \ - PyArray_API[257]) -#define NpyIter_GetIterView \ - (*(PyArrayObject * (*)(NpyIter *, npy_intp)) \ - PyArray_API[258]) -#define NpyIter_GetReadFlags \ - (*(void (*)(NpyIter *, char *)) \ - PyArray_API[259]) -#define NpyIter_GetWriteFlags \ - (*(void (*)(NpyIter *, char *)) \ - PyArray_API[260]) -#define NpyIter_DebugPrint \ - (*(void (*)(NpyIter *)) \ - PyArray_API[261]) -#define NpyIter_IterationNeedsAPI \ - (*(npy_bool (*)(NpyIter *)) \ - PyArray_API[262]) -#define NpyIter_GetInnerFixedStrideArray \ - (*(void (*)(NpyIter *, npy_intp *)) \ - PyArray_API[263]) -#define NpyIter_RemoveAxis \ - (*(int (*)(NpyIter *, int)) \ - PyArray_API[264]) -#define NpyIter_GetAxisStrideArray \ - (*(npy_intp * (*)(NpyIter *, int)) \ - PyArray_API[265]) -#define NpyIter_RequiresBuffering \ - (*(npy_bool (*)(NpyIter *)) \ - PyArray_API[266]) -#define NpyIter_GetInitialDataPtrArray \ - (*(char ** (*)(NpyIter *)) \ - PyArray_API[267]) -#define NpyIter_CreateCompatibleStrides \ - (*(int (*)(NpyIter *, npy_intp, npy_intp *)) \ - PyArray_API[268]) -#define PyArray_CastingConverter \ - (*(int (*)(PyObject *, NPY_CASTING *)) \ - PyArray_API[269]) -#define PyArray_CountNonzero \ - (*(npy_intp (*)(PyArrayObject *)) \ - PyArray_API[270]) -#define PyArray_PromoteTypes \ - (*(PyArray_Descr * (*)(PyArray_Descr *, PyArray_Descr *)) \ - PyArray_API[271]) -#define PyArray_MinScalarType \ - (*(PyArray_Descr * (*)(PyArrayObject *)) \ - PyArray_API[272]) -#define PyArray_ResultType \ - (*(PyArray_Descr * (*)(npy_intp, PyArrayObject **, npy_intp, PyArray_Descr **)) \ - PyArray_API[273]) -#define PyArray_CanCastArrayTo \ - (*(npy_bool (*)(PyArrayObject *, PyArray_Descr *, NPY_CASTING)) \ - PyArray_API[274]) -#define PyArray_CanCastTypeTo \ - (*(npy_bool (*)(PyArray_Descr *, PyArray_Descr *, NPY_CASTING)) \ - PyArray_API[275]) -#define PyArray_EinsteinSum \ - (*(PyArrayObject * (*)(char *, npy_intp, PyArrayObject **, PyArray_Descr *, NPY_ORDER, NPY_CASTING, PyArrayObject *)) \ - PyArray_API[276]) -#define PyArray_NewLikeArray \ - (*(PyObject * (*)(PyArrayObject *, NPY_ORDER, PyArray_Descr *, int)) \ - PyArray_API[277]) -#define PyArray_GetArrayParamsFromObject \ - (*(int (*)(PyObject *, PyArray_Descr *, npy_bool, PyArray_Descr **, int *, npy_intp *, PyArrayObject **, PyObject *)) \ - PyArray_API[278]) -#define PyArray_ConvertClipmodeSequence \ - (*(int (*)(PyObject *, NPY_CLIPMODE *, int)) \ - PyArray_API[279]) -#define PyArray_MatrixProduct2 \ - (*(PyObject * (*)(PyObject *, PyObject *, PyArrayObject*)) \ - PyArray_API[280]) -#define NpyIter_IsFirstVisit \ - (*(npy_bool (*)(NpyIter *, int)) \ - PyArray_API[281]) -#define PyArray_SetBaseObject \ - (*(int (*)(PyArrayObject *, PyObject *)) \ - PyArray_API[282]) -#define PyArray_CreateSortedStridePerm \ - (*(void (*)(int, npy_intp *, npy_stride_sort_item *)) \ - PyArray_API[283]) -#define PyArray_RemoveAxesInPlace \ - (*(void (*)(PyArrayObject *, npy_bool *)) \ - PyArray_API[284]) -#define PyArray_DebugPrint \ - (*(void (*)(PyArrayObject *)) \ - PyArray_API[285]) -#define PyArray_FailUnlessWriteable \ - (*(int (*)(PyArrayObject *, const char *)) \ - PyArray_API[286]) -#define PyArray_SetUpdateIfCopyBase \ - (*(int (*)(PyArrayObject *, PyArrayObject *)) \ - PyArray_API[287]) -#define PyDataMem_NEW \ - (*(void * (*)(size_t)) \ - PyArray_API[288]) -#define PyDataMem_FREE \ - (*(void (*)(void *)) \ - PyArray_API[289]) -#define PyDataMem_RENEW \ - (*(void * (*)(void *, size_t)) \ - PyArray_API[290]) -#define PyDataMem_SetEventHook \ - (*(PyDataMem_EventHookFunc * (*)(PyDataMem_EventHookFunc *, void *, void **)) \ - PyArray_API[291]) -#define NPY_DEFAULT_ASSIGN_CASTING (*(NPY_CASTING *)PyArray_API[292]) - -#if !defined(NO_IMPORT_ARRAY) && !defined(NO_IMPORT) -static int -_import_array(void) -{ - int st; - PyObject *numpy = PyImport_ImportModule("numpy.core.multiarray"); - PyObject *c_api = NULL; - - if (numpy == NULL) { - PyErr_SetString(PyExc_ImportError, "numpy.core.multiarray failed to import"); - return -1; - } - c_api = PyObject_GetAttrString(numpy, "_ARRAY_API"); - Py_DECREF(numpy); - if (c_api == NULL) { - PyErr_SetString(PyExc_AttributeError, "_ARRAY_API not found"); - return -1; - } - -#if PY_VERSION_HEX >= 0x03000000 - if (!PyCapsule_CheckExact(c_api)) { - PyErr_SetString(PyExc_RuntimeError, "_ARRAY_API is not PyCapsule object"); - Py_DECREF(c_api); - return -1; - } - PyArray_API = (void **)PyCapsule_GetPointer(c_api, NULL); -#else - if (!PyCObject_Check(c_api)) { - PyErr_SetString(PyExc_RuntimeError, "_ARRAY_API is not PyCObject object"); - Py_DECREF(c_api); - return -1; - } - PyArray_API = (void **)PyCObject_AsVoidPtr(c_api); -#endif - Py_DECREF(c_api); - if (PyArray_API == NULL) { - PyErr_SetString(PyExc_RuntimeError, "_ARRAY_API is NULL pointer"); - return -1; - } - - /* Perform runtime check of C API version */ - if (NPY_VERSION != PyArray_GetNDArrayCVersion()) { - PyErr_Format(PyExc_RuntimeError, "module compiled against "\ - "ABI version %x but this version of numpy is %x", \ - (int) NPY_VERSION, (int) PyArray_GetNDArrayCVersion()); - return -1; - } - if (NPY_FEATURE_VERSION > PyArray_GetNDArrayCFeatureVersion()) { - PyErr_Format(PyExc_RuntimeError, "module compiled against "\ - "API version %x but this version of numpy is %x", \ - (int) NPY_FEATURE_VERSION, (int) PyArray_GetNDArrayCFeatureVersion()); - return -1; - } - - /* - * Perform runtime check of endianness and check it matches the one set by - * the headers (npy_endian.h) as a safeguard - */ - st = PyArray_GetEndianness(); - if (st == NPY_CPU_UNKNOWN_ENDIAN) { - PyErr_Format(PyExc_RuntimeError, "FATAL: module compiled as unknown endian"); - return -1; - } -#if NPY_BYTE_ORDER == NPY_BIG_ENDIAN - if (st != NPY_CPU_BIG) { - PyErr_Format(PyExc_RuntimeError, "FATAL: module compiled as "\ - "big endian, but detected different endianness at runtime"); - return -1; - } -#elif NPY_BYTE_ORDER == NPY_LITTLE_ENDIAN - if (st != NPY_CPU_LITTLE) { - PyErr_Format(PyExc_RuntimeError, "FATAL: module compiled as "\ - "little endian, but detected different endianness at runtime"); - return -1; - } -#endif - - return 0; -} - -#if PY_VERSION_HEX >= 0x03000000 -#define NUMPY_IMPORT_ARRAY_RETVAL NULL -#else -#define NUMPY_IMPORT_ARRAY_RETVAL -#endif - -#define import_array() {if (_import_array() < 0) {PyErr_Print(); PyErr_SetString(PyExc_ImportError, "numpy.core.multiarray failed to import"); return NUMPY_IMPORT_ARRAY_RETVAL; } } - -#define import_array1(ret) {if (_import_array() < 0) {PyErr_Print(); PyErr_SetString(PyExc_ImportError, "numpy.core.multiarray failed to import"); return ret; } } - -#define import_array2(msg, ret) {if (_import_array() < 0) {PyErr_Print(); PyErr_SetString(PyExc_ImportError, msg); return ret; } } - -#endif - -#endif diff --git a/include/numpy/__ufunc_api.h b/include/numpy/__ufunc_api.h deleted file mode 100644 index fd81d07b5..000000000 --- a/include/numpy/__ufunc_api.h +++ /dev/null @@ -1,323 +0,0 @@ - -#ifdef _UMATHMODULE - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION -extern NPY_NO_EXPORT PyTypeObject PyUFunc_Type; -#else -NPY_NO_EXPORT PyTypeObject PyUFunc_Type; -#endif - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - extern NPY_NO_EXPORT PyTypeObject PyUFunc_Type; -#else - NPY_NO_EXPORT PyTypeObject PyUFunc_Type; -#endif - -NPY_NO_EXPORT PyObject * PyUFunc_FromFuncAndData \ - (PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int); -NPY_NO_EXPORT int PyUFunc_RegisterLoopForType \ - (PyUFuncObject *, int, PyUFuncGenericFunction, int *, void *); -NPY_NO_EXPORT int PyUFunc_GenericFunction \ - (PyUFuncObject *, PyObject *, PyObject *, PyArrayObject **); -NPY_NO_EXPORT void PyUFunc_f_f_As_d_d \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_d_d \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_f_f \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_g_g \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_F_F_As_D_D \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_F_F \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_D_D \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_G_G \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_O_O \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_ff_f_As_dd_d \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_ff_f \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_dd_d \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_gg_g \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_FF_F_As_DD_D \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_DD_D \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_FF_F \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_GG_G \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_OO_O \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_O_O_method \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_OO_O_method \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_On_Om \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT int PyUFunc_GetPyValues \ - (char *, int *, int *, PyObject **); -NPY_NO_EXPORT int PyUFunc_checkfperr \ - (int, PyObject *, int *); -NPY_NO_EXPORT void PyUFunc_clearfperr \ - (void); -NPY_NO_EXPORT int PyUFunc_getfperr \ - (void); -NPY_NO_EXPORT int PyUFunc_handlefperr \ - (int, PyObject *, int, int *); -NPY_NO_EXPORT int PyUFunc_ReplaceLoopBySignature \ - (PyUFuncObject *, PyUFuncGenericFunction, int *, PyUFuncGenericFunction *); -NPY_NO_EXPORT PyObject * PyUFunc_FromFuncAndDataAndSignature \ - (PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int, const char *); -NPY_NO_EXPORT int PyUFunc_SetUsesArraysAsData \ - (void **, size_t); -NPY_NO_EXPORT void PyUFunc_e_e \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_e_e_As_f_f \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_e_e_As_d_d \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_ee_e \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_ee_e_As_ff_f \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT void PyUFunc_ee_e_As_dd_d \ - (char **, npy_intp *, npy_intp *, void *); -NPY_NO_EXPORT int PyUFunc_DefaultTypeResolver \ - (PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyObject *, PyArray_Descr **); -NPY_NO_EXPORT int PyUFunc_ValidateCasting \ - (PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyArray_Descr **); - -#else - -#if defined(PY_UFUNC_UNIQUE_SYMBOL) -#define PyUFunc_API PY_UFUNC_UNIQUE_SYMBOL -#endif - -#if defined(NO_IMPORT) || defined(NO_IMPORT_UFUNC) -extern void **PyUFunc_API; -#else -#if defined(PY_UFUNC_UNIQUE_SYMBOL) -void **PyUFunc_API; -#else -static void **PyUFunc_API=NULL; -#endif -#endif - -#define PyUFunc_Type (*(PyTypeObject *)PyUFunc_API[0]) -#define PyUFunc_FromFuncAndData \ - (*(PyObject * (*)(PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int)) \ - PyUFunc_API[1]) -#define PyUFunc_RegisterLoopForType \ - (*(int (*)(PyUFuncObject *, int, PyUFuncGenericFunction, int *, void *)) \ - PyUFunc_API[2]) -#define PyUFunc_GenericFunction \ - (*(int (*)(PyUFuncObject *, PyObject *, PyObject *, PyArrayObject **)) \ - PyUFunc_API[3]) -#define PyUFunc_f_f_As_d_d \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[4]) -#define PyUFunc_d_d \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[5]) -#define PyUFunc_f_f \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[6]) -#define PyUFunc_g_g \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[7]) -#define PyUFunc_F_F_As_D_D \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[8]) -#define PyUFunc_F_F \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[9]) -#define PyUFunc_D_D \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[10]) -#define PyUFunc_G_G \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[11]) -#define PyUFunc_O_O \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[12]) -#define PyUFunc_ff_f_As_dd_d \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[13]) -#define PyUFunc_ff_f \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[14]) -#define PyUFunc_dd_d \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[15]) -#define PyUFunc_gg_g \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[16]) -#define PyUFunc_FF_F_As_DD_D \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[17]) -#define PyUFunc_DD_D \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[18]) -#define PyUFunc_FF_F \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[19]) -#define PyUFunc_GG_G \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[20]) -#define PyUFunc_OO_O \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[21]) -#define PyUFunc_O_O_method \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[22]) -#define PyUFunc_OO_O_method \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[23]) -#define PyUFunc_On_Om \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[24]) -#define PyUFunc_GetPyValues \ - (*(int (*)(char *, int *, int *, PyObject **)) \ - PyUFunc_API[25]) -#define PyUFunc_checkfperr \ - (*(int (*)(int, PyObject *, int *)) \ - PyUFunc_API[26]) -#define PyUFunc_clearfperr \ - (*(void (*)(void)) \ - PyUFunc_API[27]) -#define PyUFunc_getfperr \ - (*(int (*)(void)) \ - PyUFunc_API[28]) -#define PyUFunc_handlefperr \ - (*(int (*)(int, PyObject *, int, int *)) \ - PyUFunc_API[29]) -#define PyUFunc_ReplaceLoopBySignature \ - (*(int (*)(PyUFuncObject *, PyUFuncGenericFunction, int *, PyUFuncGenericFunction *)) \ - PyUFunc_API[30]) -#define PyUFunc_FromFuncAndDataAndSignature \ - (*(PyObject * (*)(PyUFuncGenericFunction *, void **, char *, int, int, int, int, char *, char *, int, const char *)) \ - PyUFunc_API[31]) -#define PyUFunc_SetUsesArraysAsData \ - (*(int (*)(void **, size_t)) \ - PyUFunc_API[32]) -#define PyUFunc_e_e \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[33]) -#define PyUFunc_e_e_As_f_f \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[34]) -#define PyUFunc_e_e_As_d_d \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[35]) -#define PyUFunc_ee_e \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[36]) -#define PyUFunc_ee_e_As_ff_f \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[37]) -#define PyUFunc_ee_e_As_dd_d \ - (*(void (*)(char **, npy_intp *, npy_intp *, void *)) \ - PyUFunc_API[38]) -#define PyUFunc_DefaultTypeResolver \ - (*(int (*)(PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyObject *, PyArray_Descr **)) \ - PyUFunc_API[39]) -#define PyUFunc_ValidateCasting \ - (*(int (*)(PyUFuncObject *, NPY_CASTING, PyArrayObject **, PyArray_Descr **)) \ - PyUFunc_API[40]) - -static int -_import_umath(void) -{ - PyObject *numpy = PyImport_ImportModule("numpy.core.umath"); - PyObject *c_api = NULL; - - if (numpy == NULL) { - PyErr_SetString(PyExc_ImportError, "numpy.core.umath failed to import"); - return -1; - } - c_api = PyObject_GetAttrString(numpy, "_UFUNC_API"); - Py_DECREF(numpy); - if (c_api == NULL) { - PyErr_SetString(PyExc_AttributeError, "_UFUNC_API not found"); - return -1; - } - -#if PY_VERSION_HEX >= 0x03000000 - if (!PyCapsule_CheckExact(c_api)) { - PyErr_SetString(PyExc_RuntimeError, "_UFUNC_API is not PyCapsule object"); - Py_DECREF(c_api); - return -1; - } - PyUFunc_API = (void **)PyCapsule_GetPointer(c_api, NULL); -#else - if (!PyCObject_Check(c_api)) { - PyErr_SetString(PyExc_RuntimeError, "_UFUNC_API is not PyCObject object"); - Py_DECREF(c_api); - return -1; - } - PyUFunc_API = (void **)PyCObject_AsVoidPtr(c_api); -#endif - Py_DECREF(c_api); - if (PyUFunc_API == NULL) { - PyErr_SetString(PyExc_RuntimeError, "_UFUNC_API is NULL pointer"); - return -1; - } - return 0; -} - -#if PY_VERSION_HEX >= 0x03000000 -#define NUMPY_IMPORT_UMATH_RETVAL NULL -#else -#define NUMPY_IMPORT_UMATH_RETVAL -#endif - -#define import_umath() \ - do {\ - UFUNC_NOFPE\ - if (_import_umath() < 0) {\ - PyErr_Print();\ - PyErr_SetString(PyExc_ImportError,\ - "numpy.core.umath failed to import");\ - return NUMPY_IMPORT_UMATH_RETVAL;\ - }\ - } while(0) - -#define import_umath1(ret) \ - do {\ - UFUNC_NOFPE\ - if (_import_umath() < 0) {\ - PyErr_Print();\ - PyErr_SetString(PyExc_ImportError,\ - "numpy.core.umath failed to import");\ - return ret;\ - }\ - } while(0) - -#define import_umath2(ret, msg) \ - do {\ - UFUNC_NOFPE\ - if (_import_umath() < 0) {\ - PyErr_Print();\ - PyErr_SetString(PyExc_ImportError, msg);\ - return ret;\ - }\ - } while(0) - -#define import_ufunc() \ - do {\ - UFUNC_NOFPE\ - if (_import_umath() < 0) {\ - PyErr_Print();\ - PyErr_SetString(PyExc_ImportError,\ - "numpy.core.umath failed to import");\ - }\ - } while(0) - -#endif diff --git a/include/numpy/_neighborhood_iterator_imp.h b/include/numpy/_neighborhood_iterator_imp.h deleted file mode 100644 index e8860cbc7..000000000 --- a/include/numpy/_neighborhood_iterator_imp.h +++ /dev/null @@ -1,90 +0,0 @@ -#ifndef _NPY_INCLUDE_NEIGHBORHOOD_IMP -#error You should not include this header directly -#endif -/* - * Private API (here for inline) - */ -static NPY_INLINE int -_PyArrayNeighborhoodIter_IncrCoord(PyArrayNeighborhoodIterObject* iter); - -/* - * Update to next item of the iterator - * - * Note: this simply increment the coordinates vector, last dimension - * incremented first , i.e, for dimension 3 - * ... - * -1, -1, -1 - * -1, -1, 0 - * -1, -1, 1 - * .... - * -1, 0, -1 - * -1, 0, 0 - * .... - * 0, -1, -1 - * 0, -1, 0 - * .... - */ -#define _UPDATE_COORD_ITER(c) \ - wb = iter->coordinates[c] < iter->bounds[c][1]; \ - if (wb) { \ - iter->coordinates[c] += 1; \ - return 0; \ - } \ - else { \ - iter->coordinates[c] = iter->bounds[c][0]; \ - } - -static NPY_INLINE int -_PyArrayNeighborhoodIter_IncrCoord(PyArrayNeighborhoodIterObject* iter) -{ - npy_intp i, wb; - - for (i = iter->nd - 1; i >= 0; --i) { - _UPDATE_COORD_ITER(i) - } - - return 0; -} - -/* - * Version optimized for 2d arrays, manual loop unrolling - */ -static NPY_INLINE int -_PyArrayNeighborhoodIter_IncrCoord2D(PyArrayNeighborhoodIterObject* iter) -{ - npy_intp wb; - - _UPDATE_COORD_ITER(1) - _UPDATE_COORD_ITER(0) - - return 0; -} -#undef _UPDATE_COORD_ITER - -/* - * Advance to the next neighbour - */ -static NPY_INLINE int -PyArrayNeighborhoodIter_Next(PyArrayNeighborhoodIterObject* iter) -{ - _PyArrayNeighborhoodIter_IncrCoord (iter); - iter->dataptr = iter->translate((PyArrayIterObject*)iter, iter->coordinates); - - return 0; -} - -/* - * Reset functions - */ -static NPY_INLINE int -PyArrayNeighborhoodIter_Reset(PyArrayNeighborhoodIterObject* iter) -{ - npy_intp i; - - for (i = 0; i < iter->nd; ++i) { - iter->coordinates[i] = iter->bounds[i][0]; - } - iter->dataptr = iter->translate((PyArrayIterObject*)iter, iter->coordinates); - - return 0; -} diff --git a/include/numpy/_numpyconfig.h b/include/numpy/_numpyconfig.h deleted file mode 100644 index d55ffc38d..000000000 --- a/include/numpy/_numpyconfig.h +++ /dev/null @@ -1,29 +0,0 @@ -#define NPY_SIZEOF_SHORT SIZEOF_SHORT -#define NPY_SIZEOF_INT SIZEOF_INT -#define NPY_SIZEOF_LONG SIZEOF_LONG -#define NPY_SIZEOF_FLOAT 4 -#define NPY_SIZEOF_COMPLEX_FLOAT 8 -#define NPY_SIZEOF_DOUBLE 8 -#define NPY_SIZEOF_COMPLEX_DOUBLE 16 -#define NPY_SIZEOF_LONGDOUBLE 16 -#define NPY_SIZEOF_COMPLEX_LONGDOUBLE 32 -#define NPY_SIZEOF_PY_INTPTR_T 8 -#define NPY_SIZEOF_PY_LONG_LONG 8 -#define NPY_SIZEOF_LONGLONG 8 -#define NPY_NO_SMP 0 -#define NPY_HAVE_DECL_ISNAN -#define NPY_HAVE_DECL_ISINF -#define NPY_HAVE_DECL_ISFINITE -#define NPY_HAVE_DECL_SIGNBIT -#define NPY_USE_C99_COMPLEX 1 -#define NPY_HAVE_COMPLEX_DOUBLE 1 -#define NPY_HAVE_COMPLEX_FLOAT 1 -#define NPY_HAVE_COMPLEX_LONG_DOUBLE 1 -#define NPY_USE_C99_FORMATS 1 -#define NPY_VISIBILITY_HIDDEN __attribute__((visibility("hidden"))) -#define NPY_ABI_VERSION 0x01000009 -#define NPY_API_VERSION 0x00000007 - -#ifndef __STDC_FORMAT_MACROS -#define __STDC_FORMAT_MACROS 1 -#endif diff --git a/include/numpy/arrayobject.h b/include/numpy/arrayobject.h deleted file mode 100644 index a84766f63..000000000 --- a/include/numpy/arrayobject.h +++ /dev/null @@ -1,22 +0,0 @@ - -/* This expects the following variables to be defined (besides - the usual ones from pyconfig.h - - SIZEOF_LONG_DOUBLE -- sizeof(long double) or sizeof(double) if no - long double is present on platform. - CHAR_BIT -- number of bits in a char (usually 8) - (should be in limits.h) - -*/ - -#ifndef Py_ARRAYOBJECT_H -#define Py_ARRAYOBJECT_H - -#include "ndarrayobject.h" -#include "npy_interrupt.h" - -#ifdef NPY_NO_PREFIX -#include "noprefix.h" -#endif - -#endif diff --git a/include/numpy/arrayscalars.h b/include/numpy/arrayscalars.h deleted file mode 100644 index 64450e713..000000000 --- a/include/numpy/arrayscalars.h +++ /dev/null @@ -1,175 +0,0 @@ -#ifndef _NPY_ARRAYSCALARS_H_ -#define _NPY_ARRAYSCALARS_H_ - -#ifndef _MULTIARRAYMODULE -typedef struct { - PyObject_HEAD - npy_bool obval; -} PyBoolScalarObject; -#endif - - -typedef struct { - PyObject_HEAD - signed char obval; -} PyByteScalarObject; - - -typedef struct { - PyObject_HEAD - short obval; -} PyShortScalarObject; - - -typedef struct { - PyObject_HEAD - int obval; -} PyIntScalarObject; - - -typedef struct { - PyObject_HEAD - long obval; -} PyLongScalarObject; - - -typedef struct { - PyObject_HEAD - npy_longlong obval; -} PyLongLongScalarObject; - - -typedef struct { - PyObject_HEAD - unsigned char obval; -} PyUByteScalarObject; - - -typedef struct { - PyObject_HEAD - unsigned short obval; -} PyUShortScalarObject; - - -typedef struct { - PyObject_HEAD - unsigned int obval; -} PyUIntScalarObject; - - -typedef struct { - PyObject_HEAD - unsigned long obval; -} PyULongScalarObject; - - -typedef struct { - PyObject_HEAD - npy_ulonglong obval; -} PyULongLongScalarObject; - - -typedef struct { - PyObject_HEAD - npy_half obval; -} PyHalfScalarObject; - - -typedef struct { - PyObject_HEAD - float obval; -} PyFloatScalarObject; - - -typedef struct { - PyObject_HEAD - double obval; -} PyDoubleScalarObject; - - -typedef struct { - PyObject_HEAD - npy_longdouble obval; -} PyLongDoubleScalarObject; - - -typedef struct { - PyObject_HEAD - npy_cfloat obval; -} PyCFloatScalarObject; - - -typedef struct { - PyObject_HEAD - npy_cdouble obval; -} PyCDoubleScalarObject; - - -typedef struct { - PyObject_HEAD - npy_clongdouble obval; -} PyCLongDoubleScalarObject; - - -typedef struct { - PyObject_HEAD - PyObject * obval; -} PyObjectScalarObject; - -typedef struct { - PyObject_HEAD - npy_datetime obval; - PyArray_DatetimeMetaData obmeta; -} PyDatetimeScalarObject; - -typedef struct { - PyObject_HEAD - npy_timedelta obval; - PyArray_DatetimeMetaData obmeta; -} PyTimedeltaScalarObject; - - -typedef struct { - PyObject_HEAD - char obval; -} PyScalarObject; - -#define PyStringScalarObject PyStringObject -#define PyUnicodeScalarObject PyUnicodeObject - -typedef struct { - PyObject_VAR_HEAD - char *obval; - PyArray_Descr *descr; - int flags; - PyObject *base; -} PyVoidScalarObject; - -/* Macros - PyScalarObject - PyArrType_Type - are defined in ndarrayobject.h -*/ - -#define PyArrayScalar_False ((PyObject *)(&(_PyArrayScalar_BoolValues[0]))) -#define PyArrayScalar_True ((PyObject *)(&(_PyArrayScalar_BoolValues[1]))) -#define PyArrayScalar_FromLong(i) \ - ((PyObject *)(&(_PyArrayScalar_BoolValues[((i)!=0)]))) -#define PyArrayScalar_RETURN_BOOL_FROM_LONG(i) \ - return Py_INCREF(PyArrayScalar_FromLong(i)), \ - PyArrayScalar_FromLong(i) -#define PyArrayScalar_RETURN_FALSE \ - return Py_INCREF(PyArrayScalar_False), \ - PyArrayScalar_False -#define PyArrayScalar_RETURN_TRUE \ - return Py_INCREF(PyArrayScalar_True), \ - PyArrayScalar_True - -#define PyArrayScalar_New(cls) \ - Py##cls##ArrType_Type.tp_alloc(&Py##cls##ArrType_Type, 0) -#define PyArrayScalar_VAL(obj, cls) \ - ((Py##cls##ScalarObject *)obj)->obval -#define PyArrayScalar_ASSIGN(obj, cls, val) \ - PyArrayScalar_VAL(obj, cls) = val - -#endif diff --git a/include/numpy/halffloat.h b/include/numpy/halffloat.h deleted file mode 100644 index 944f0ea34..000000000 --- a/include/numpy/halffloat.h +++ /dev/null @@ -1,69 +0,0 @@ -#ifndef __NPY_HALFFLOAT_H__ -#define __NPY_HALFFLOAT_H__ - -#include -#include - -#ifdef __cplusplus -extern "C" { -#endif - -/* - * Half-precision routines - */ - -/* Conversions */ -float npy_half_to_float(npy_half h); -double npy_half_to_double(npy_half h); -npy_half npy_float_to_half(float f); -npy_half npy_double_to_half(double d); -/* Comparisons */ -int npy_half_eq(npy_half h1, npy_half h2); -int npy_half_ne(npy_half h1, npy_half h2); -int npy_half_le(npy_half h1, npy_half h2); -int npy_half_lt(npy_half h1, npy_half h2); -int npy_half_ge(npy_half h1, npy_half h2); -int npy_half_gt(npy_half h1, npy_half h2); -/* faster *_nonan variants for when you know h1 and h2 are not NaN */ -int npy_half_eq_nonan(npy_half h1, npy_half h2); -int npy_half_lt_nonan(npy_half h1, npy_half h2); -int npy_half_le_nonan(npy_half h1, npy_half h2); -/* Miscellaneous functions */ -int npy_half_iszero(npy_half h); -int npy_half_isnan(npy_half h); -int npy_half_isinf(npy_half h); -int npy_half_isfinite(npy_half h); -int npy_half_signbit(npy_half h); -npy_half npy_half_copysign(npy_half x, npy_half y); -npy_half npy_half_spacing(npy_half h); -npy_half npy_half_nextafter(npy_half x, npy_half y); - -/* - * Half-precision constants - */ - -#define NPY_HALF_ZERO (0x0000u) -#define NPY_HALF_PZERO (0x0000u) -#define NPY_HALF_NZERO (0x8000u) -#define NPY_HALF_ONE (0x3c00u) -#define NPY_HALF_NEGONE (0xbc00u) -#define NPY_HALF_PINF (0x7c00u) -#define NPY_HALF_NINF (0xfc00u) -#define NPY_HALF_NAN (0x7e00u) - -#define NPY_MAX_HALF (0x7bffu) - -/* - * Bit-level conversions - */ - -npy_uint16 npy_floatbits_to_halfbits(npy_uint32 f); -npy_uint16 npy_doublebits_to_halfbits(npy_uint64 d); -npy_uint32 npy_halfbits_to_floatbits(npy_uint16 h); -npy_uint64 npy_halfbits_to_doublebits(npy_uint16 h); - -#ifdef __cplusplus -} -#endif - -#endif diff --git a/include/numpy/multiarray_api.txt b/include/numpy/multiarray_api.txt deleted file mode 100644 index 7e588f067..000000000 --- a/include/numpy/multiarray_api.txt +++ /dev/null @@ -1,2375 +0,0 @@ - -=========== -Numpy C-API -=========== -:: - - unsigned int - PyArray_GetNDArrayCVersion(void ) - - -Included at the very first so not auto-grabbed and thus not labeled. - -:: - - int - PyArray_SetNumericOps(PyObject *dict) - -Set internal structure with number functions that all arrays will use - -:: - - PyObject * - PyArray_GetNumericOps(void ) - -Get dictionary showing number functions that all arrays will use - -:: - - int - PyArray_INCREF(PyArrayObject *mp) - -For object arrays, increment all internal references. - -:: - - int - PyArray_XDECREF(PyArrayObject *mp) - -Decrement all internal references for object arrays. -(or arrays with object fields) - -:: - - void - PyArray_SetStringFunction(PyObject *op, int repr) - -Set the array print function to be a Python function. - -:: - - PyArray_Descr * - PyArray_DescrFromType(int type) - -Get the PyArray_Descr structure for a type. - -:: - - PyObject * - PyArray_TypeObjectFromType(int type) - -Get a typeobject from a type-number -- can return NULL. - -New reference - -:: - - char * - PyArray_Zero(PyArrayObject *arr) - -Get pointer to zero of correct type for array. - -:: - - char * - PyArray_One(PyArrayObject *arr) - -Get pointer to one of correct type for array - -:: - - PyObject * - PyArray_CastToType(PyArrayObject *arr, PyArray_Descr *dtype, int - is_f_order) - -For backward compatibility - -Cast an array using typecode structure. -steals reference to at --- cannot be NULL - -This function always makes a copy of arr, even if the dtype -doesn't change. - -:: - - int - PyArray_CastTo(PyArrayObject *out, PyArrayObject *mp) - -Cast to an already created array. - -:: - - int - PyArray_CastAnyTo(PyArrayObject *out, PyArrayObject *mp) - -Cast to an already created array. Arrays don't have to be "broadcastable" -Only requirement is they have the same number of elements. - -:: - - int - PyArray_CanCastSafely(int fromtype, int totype) - -Check the type coercion rules. - -:: - - npy_bool - PyArray_CanCastTo(PyArray_Descr *from, PyArray_Descr *to) - -leaves reference count alone --- cannot be NULL - -PyArray_CanCastTypeTo is equivalent to this, but adds a 'casting' -parameter. - -:: - - int - PyArray_ObjectType(PyObject *op, int minimum_type) - -Return the typecode of the array a Python object would be converted to - -Returns the type number the result should have, or NPY_NOTYPE on error. - -:: - - PyArray_Descr * - PyArray_DescrFromObject(PyObject *op, PyArray_Descr *mintype) - -new reference -- accepts NULL for mintype - -:: - - PyArrayObject ** - PyArray_ConvertToCommonType(PyObject *op, int *retn) - - -:: - - PyArray_Descr * - PyArray_DescrFromScalar(PyObject *sc) - -Return descr object from array scalar. - -New reference - -:: - - PyArray_Descr * - PyArray_DescrFromTypeObject(PyObject *type) - - -:: - - npy_intp - PyArray_Size(PyObject *op) - -Compute the size of an array (in number of items) - -:: - - PyObject * - PyArray_Scalar(void *data, PyArray_Descr *descr, PyObject *base) - -Get scalar-equivalent to a region of memory described by a descriptor. - -:: - - PyObject * - PyArray_FromScalar(PyObject *scalar, PyArray_Descr *outcode) - -Get 0-dim array from scalar - -0-dim array from array-scalar object -always contains a copy of the data -unless outcode is NULL, it is of void type and the referrer does -not own it either. - -steals reference to outcode - -:: - - void - PyArray_ScalarAsCtype(PyObject *scalar, void *ctypeptr) - -Convert to c-type - -no error checking is performed -- ctypeptr must be same type as scalar -in case of flexible type, the data is not copied -into ctypeptr which is expected to be a pointer to pointer - -:: - - int - PyArray_CastScalarToCtype(PyObject *scalar, void - *ctypeptr, PyArray_Descr *outcode) - -Cast Scalar to c-type - -The output buffer must be large-enough to receive the value -Even for flexible types which is different from ScalarAsCtype -where only a reference for flexible types is returned - -This may not work right on narrow builds for NumPy unicode scalars. - -:: - - int - PyArray_CastScalarDirect(PyObject *scalar, PyArray_Descr - *indescr, void *ctypeptr, int outtype) - -Cast Scalar to c-type - -:: - - PyObject * - PyArray_ScalarFromObject(PyObject *object) - -Get an Array Scalar From a Python Object - -Returns NULL if unsuccessful but error is only set if another error occurred. -Currently only Numeric-like object supported. - -:: - - PyArray_VectorUnaryFunc * - PyArray_GetCastFunc(PyArray_Descr *descr, int type_num) - -Get a cast function to cast from the input descriptor to the -output type_number (must be a registered data-type). -Returns NULL if un-successful. - -:: - - PyObject * - PyArray_FromDims(int nd, int *d, int type) - -Construct an empty array from dimensions and typenum - -:: - - PyObject * - PyArray_FromDimsAndDataAndDescr(int nd, int *d, PyArray_Descr - *descr, char *data) - -Like FromDimsAndData but uses the Descr structure instead of typecode -as input. - -:: - - PyObject * - PyArray_FromAny(PyObject *op, PyArray_Descr *newtype, int - min_depth, int max_depth, int flags, PyObject - *context) - -Does not check for NPY_ARRAY_ENSURECOPY and NPY_ARRAY_NOTSWAPPED in flags -Steals a reference to newtype --- which can be NULL - -:: - - PyObject * - PyArray_EnsureArray(PyObject *op) - -This is a quick wrapper around PyArray_FromAny(op, NULL, 0, 0, ENSUREARRAY) -that special cases Arrays and PyArray_Scalars up front -It *steals a reference* to the object -It also guarantees that the result is PyArray_Type -Because it decrefs op if any conversion needs to take place -so it can be used like PyArray_EnsureArray(some_function(...)) - -:: - - PyObject * - PyArray_EnsureAnyArray(PyObject *op) - - -:: - - PyObject * - PyArray_FromFile(FILE *fp, PyArray_Descr *dtype, npy_intp num, char - *sep) - - -Given a ``FILE *`` pointer ``fp``, and a ``PyArray_Descr``, return an -array corresponding to the data encoded in that file. - -If the dtype is NULL, the default array type is used (double). -If non-null, the reference is stolen. - -The number of elements to read is given as ``num``; if it is < 0, then -then as many as possible are read. - -If ``sep`` is NULL or empty, then binary data is assumed, else -text data, with ``sep`` as the separator between elements. Whitespace in -the separator matches any length of whitespace in the text, and a match -for whitespace around the separator is added. - -For memory-mapped files, use the buffer interface. No more data than -necessary is read by this routine. - -:: - - PyObject * - PyArray_FromString(char *data, npy_intp slen, PyArray_Descr - *dtype, npy_intp num, char *sep) - - -Given a pointer to a string ``data``, a string length ``slen``, and -a ``PyArray_Descr``, return an array corresponding to the data -encoded in that string. - -If the dtype is NULL, the default array type is used (double). -If non-null, the reference is stolen. - -If ``slen`` is < 0, then the end of string is used for text data. -It is an error for ``slen`` to be < 0 for binary data (since embedded NULLs -would be the norm). - -The number of elements to read is given as ``num``; if it is < 0, then -then as many as possible are read. - -If ``sep`` is NULL or empty, then binary data is assumed, else -text data, with ``sep`` as the separator between elements. Whitespace in -the separator matches any length of whitespace in the text, and a match -for whitespace around the separator is added. - -:: - - PyObject * - PyArray_FromBuffer(PyObject *buf, PyArray_Descr *type, npy_intp - count, npy_intp offset) - - -:: - - PyObject * - PyArray_FromIter(PyObject *obj, PyArray_Descr *dtype, npy_intp count) - - -steals a reference to dtype (which cannot be NULL) - -:: - - PyObject * - PyArray_Return(PyArrayObject *mp) - - -Return either an array or the appropriate Python object if the array -is 0d and matches a Python type. - -:: - - PyObject * - PyArray_GetField(PyArrayObject *self, PyArray_Descr *typed, int - offset) - -Get a subset of bytes from each element of the array - -:: - - int - PyArray_SetField(PyArrayObject *self, PyArray_Descr *dtype, int - offset, PyObject *val) - -Set a subset of bytes from each element of the array - -:: - - PyObject * - PyArray_Byteswap(PyArrayObject *self, npy_bool inplace) - - -:: - - PyObject * - PyArray_Resize(PyArrayObject *self, PyArray_Dims *newshape, int - refcheck, NPY_ORDER order) - -Resize (reallocate data). Only works if nothing else is referencing this -array and it is contiguous. If refcheck is 0, then the reference count is -not checked and assumed to be 1. You still must own this data and have no -weak-references and no base object. - -:: - - int - PyArray_MoveInto(PyArrayObject *dst, PyArrayObject *src) - -Move the memory of one array into another, allowing for overlapping data. - -Returns 0 on success, negative on failure. - -:: - - int - PyArray_CopyInto(PyArrayObject *dst, PyArrayObject *src) - -Copy an Array into another array. -Broadcast to the destination shape if necessary. - -Returns 0 on success, -1 on failure. - -:: - - int - PyArray_CopyAnyInto(PyArrayObject *dst, PyArrayObject *src) - -Copy an Array into another array -- memory must not overlap -Does not require src and dest to have "broadcastable" shapes -(only the same number of elements). - -TODO: For NumPy 2.0, this could accept an order parameter which -only allows NPY_CORDER and NPY_FORDER. Could also rename -this to CopyAsFlat to make the name more intuitive. - -Returns 0 on success, -1 on error. - -:: - - int - PyArray_CopyObject(PyArrayObject *dest, PyObject *src_object) - - -:: - - PyObject * - PyArray_NewCopy(PyArrayObject *obj, NPY_ORDER order) - -Copy an array. - -:: - - PyObject * - PyArray_ToList(PyArrayObject *self) - -To List - -:: - - PyObject * - PyArray_ToString(PyArrayObject *self, NPY_ORDER order) - - -:: - - int - PyArray_ToFile(PyArrayObject *self, FILE *fp, char *sep, char *format) - -To File - -:: - - int - PyArray_Dump(PyObject *self, PyObject *file, int protocol) - - -:: - - PyObject * - PyArray_Dumps(PyObject *self, int protocol) - - -:: - - int - PyArray_ValidType(int type) - -Is the typenum valid? - -:: - - void - PyArray_UpdateFlags(PyArrayObject *ret, int flagmask) - -Update Several Flags at once. - -:: - - PyObject * - PyArray_New(PyTypeObject *subtype, int nd, npy_intp *dims, int - type_num, npy_intp *strides, void *data, int itemsize, int - flags, PyObject *obj) - -Generic new array creation routine. - -:: - - PyObject * - PyArray_NewFromDescr(PyTypeObject *subtype, PyArray_Descr *descr, int - nd, npy_intp *dims, npy_intp *strides, void - *data, int flags, PyObject *obj) - -Generic new array creation routine. - -steals a reference to descr (even on failure) - -:: - - PyArray_Descr * - PyArray_DescrNew(PyArray_Descr *base) - -base cannot be NULL - -:: - - PyArray_Descr * - PyArray_DescrNewFromType(int type_num) - - -:: - - double - PyArray_GetPriority(PyObject *obj, double default_) - -Get Priority from object - -:: - - PyObject * - PyArray_IterNew(PyObject *obj) - -Get Iterator. - -:: - - PyObject * - PyArray_MultiIterNew(int n, ... ) - -Get MultiIterator, - -:: - - int - PyArray_PyIntAsInt(PyObject *o) - - -:: - - npy_intp - PyArray_PyIntAsIntp(PyObject *o) - - -:: - - int - PyArray_Broadcast(PyArrayMultiIterObject *mit) - - -:: - - void - PyArray_FillObjectArray(PyArrayObject *arr, PyObject *obj) - -Assumes contiguous - -:: - - int - PyArray_FillWithScalar(PyArrayObject *arr, PyObject *obj) - - -:: - - npy_bool - PyArray_CheckStrides(int elsize, int nd, npy_intp numbytes, npy_intp - offset, npy_intp *dims, npy_intp *newstrides) - - -:: - - PyArray_Descr * - PyArray_DescrNewByteorder(PyArray_Descr *self, char newendian) - - -returns a copy of the PyArray_Descr structure with the byteorder -altered: -no arguments: The byteorder is swapped (in all subfields as well) -single argument: The byteorder is forced to the given state -(in all subfields as well) - -Valid states: ('big', '>') or ('little' or '<') -('native', or '=') - -If a descr structure with | is encountered it's own -byte-order is not changed but any fields are: - - -Deep bytorder change of a data-type descriptor -Leaves reference count of self unchanged --- does not DECREF self *** - -:: - - PyObject * - PyArray_IterAllButAxis(PyObject *obj, int *inaxis) - -Get Iterator that iterates over all but one axis (don't use this with -PyArray_ITER_GOTO1D). The axis will be over-written if negative -with the axis having the smallest stride. - -:: - - PyObject * - PyArray_CheckFromAny(PyObject *op, PyArray_Descr *descr, int - min_depth, int max_depth, int requires, PyObject - *context) - -steals a reference to descr -- accepts NULL - -:: - - PyObject * - PyArray_FromArray(PyArrayObject *arr, PyArray_Descr *newtype, int - flags) - -steals reference to newtype --- acc. NULL - -:: - - PyObject * - PyArray_FromInterface(PyObject *origin) - - -:: - - PyObject * - PyArray_FromStructInterface(PyObject *input) - - -:: - - PyObject * - PyArray_FromArrayAttr(PyObject *op, PyArray_Descr *typecode, PyObject - *context) - - -:: - - NPY_SCALARKIND - PyArray_ScalarKind(int typenum, PyArrayObject **arr) - -ScalarKind - -Returns the scalar kind of a type number, with an -optional tweak based on the scalar value itself. -If no scalar is provided, it returns INTPOS_SCALAR -for both signed and unsigned integers, otherwise -it checks the sign of any signed integer to choose -INTNEG_SCALAR when appropriate. - -:: - - int - PyArray_CanCoerceScalar(int thistype, int neededtype, NPY_SCALARKIND - scalar) - - -Determines whether the data type 'thistype', with -scalar kind 'scalar', can be coerced into 'neededtype'. - -:: - - PyObject * - PyArray_NewFlagsObject(PyObject *obj) - - -Get New ArrayFlagsObject - -:: - - npy_bool - PyArray_CanCastScalar(PyTypeObject *from, PyTypeObject *to) - -See if array scalars can be cast. - -TODO: For NumPy 2.0, add a NPY_CASTING parameter. - -:: - - int - PyArray_CompareUCS4(npy_ucs4 *s1, npy_ucs4 *s2, size_t len) - - -:: - - int - PyArray_RemoveSmallest(PyArrayMultiIterObject *multi) - -Adjusts previously broadcasted iterators so that the axis with -the smallest sum of iterator strides is not iterated over. -Returns dimension which is smallest in the range [0,multi->nd). -A -1 is returned if multi->nd == 0. - -don't use with PyArray_ITER_GOTO1D because factors are not adjusted - -:: - - int - PyArray_ElementStrides(PyObject *obj) - - -:: - - void - PyArray_Item_INCREF(char *data, PyArray_Descr *descr) - - -:: - - void - PyArray_Item_XDECREF(char *data, PyArray_Descr *descr) - - -:: - - PyObject * - PyArray_FieldNames(PyObject *fields) - -Return the tuple of ordered field names from a dictionary. - -:: - - PyObject * - PyArray_Transpose(PyArrayObject *ap, PyArray_Dims *permute) - -Return Transpose. - -:: - - PyObject * - PyArray_TakeFrom(PyArrayObject *self0, PyObject *indices0, int - axis, PyArrayObject *out, NPY_CLIPMODE clipmode) - -Take - -:: - - PyObject * - PyArray_PutTo(PyArrayObject *self, PyObject*values0, PyObject - *indices0, NPY_CLIPMODE clipmode) - -Put values into an array - -:: - - PyObject * - PyArray_PutMask(PyArrayObject *self, PyObject*values0, PyObject*mask0) - -Put values into an array according to a mask. - -:: - - PyObject * - PyArray_Repeat(PyArrayObject *aop, PyObject *op, int axis) - -Repeat the array. - -:: - - PyObject * - PyArray_Choose(PyArrayObject *ip, PyObject *op, PyArrayObject - *out, NPY_CLIPMODE clipmode) - - -:: - - int - PyArray_Sort(PyArrayObject *op, int axis, NPY_SORTKIND which) - -Sort an array in-place - -:: - - PyObject * - PyArray_ArgSort(PyArrayObject *op, int axis, NPY_SORTKIND which) - -ArgSort an array - -:: - - PyObject * - PyArray_SearchSorted(PyArrayObject *op1, PyObject *op2, NPY_SEARCHSIDE - side, PyObject *perm) - - -Search the sorted array op1 for the location of the items in op2. The -result is an array of indexes, one for each element in op2, such that if -the item were to be inserted in op1 just before that index the array -would still be in sorted order. - -Parameters ----------- -op1 : PyArrayObject * -Array to be searched, must be 1-D. -op2 : PyObject * -Array of items whose insertion indexes in op1 are wanted -side : {NPY_SEARCHLEFT, NPY_SEARCHRIGHT} -If NPY_SEARCHLEFT, return first valid insertion indexes -If NPY_SEARCHRIGHT, return last valid insertion indexes -perm : PyObject * -Permutation array that sorts op1 (optional) - -Returns -------- -ret : PyObject * -New reference to npy_intp array containing indexes where items in op2 -could be validly inserted into op1. NULL on error. - -Notes ------ -Binary search is used to find the indexes. - -:: - - PyObject * - PyArray_ArgMax(PyArrayObject *op, int axis, PyArrayObject *out) - -ArgMax - -:: - - PyObject * - PyArray_ArgMin(PyArrayObject *op, int axis, PyArrayObject *out) - -ArgMin - -:: - - PyObject * - PyArray_Reshape(PyArrayObject *self, PyObject *shape) - -Reshape - -:: - - PyObject * - PyArray_Newshape(PyArrayObject *self, PyArray_Dims *newdims, NPY_ORDER - order) - -New shape for an array - -:: - - PyObject * - PyArray_Squeeze(PyArrayObject *self) - - -return a new view of the array object with all of its unit-length -dimensions squeezed out if needed, otherwise -return the same array. - -:: - - PyObject * - PyArray_View(PyArrayObject *self, PyArray_Descr *type, PyTypeObject - *pytype) - -View -steals a reference to type -- accepts NULL - -:: - - PyObject * - PyArray_SwapAxes(PyArrayObject *ap, int a1, int a2) - -SwapAxes - -:: - - PyObject * - PyArray_Max(PyArrayObject *ap, int axis, PyArrayObject *out) - -Max - -:: - - PyObject * - PyArray_Min(PyArrayObject *ap, int axis, PyArrayObject *out) - -Min - -:: - - PyObject * - PyArray_Ptp(PyArrayObject *ap, int axis, PyArrayObject *out) - -Ptp - -:: - - PyObject * - PyArray_Mean(PyArrayObject *self, int axis, int rtype, PyArrayObject - *out) - -Mean - -:: - - PyObject * - PyArray_Trace(PyArrayObject *self, int offset, int axis1, int - axis2, int rtype, PyArrayObject *out) - -Trace - -:: - - PyObject * - PyArray_Diagonal(PyArrayObject *self, int offset, int axis1, int - axis2) - -Diagonal - -In NumPy versions prior to 1.7, this function always returned a copy of -the diagonal array. In 1.7, the code has been updated to compute a view -onto 'self', but it still copies this array before returning, as well as -setting the internal WARN_ON_WRITE flag. In a future version, it will -simply return a view onto self. - -:: - - PyObject * - PyArray_Clip(PyArrayObject *self, PyObject *min, PyObject - *max, PyArrayObject *out) - -Clip - -:: - - PyObject * - PyArray_Conjugate(PyArrayObject *self, PyArrayObject *out) - -Conjugate - -:: - - PyObject * - PyArray_Nonzero(PyArrayObject *self) - -Nonzero - -TODO: In NumPy 2.0, should make the iteration order a parameter. - -:: - - PyObject * - PyArray_Std(PyArrayObject *self, int axis, int rtype, PyArrayObject - *out, int variance) - -Set variance to 1 to by-pass square-root calculation and return variance -Std - -:: - - PyObject * - PyArray_Sum(PyArrayObject *self, int axis, int rtype, PyArrayObject - *out) - -Sum - -:: - - PyObject * - PyArray_CumSum(PyArrayObject *self, int axis, int rtype, PyArrayObject - *out) - -CumSum - -:: - - PyObject * - PyArray_Prod(PyArrayObject *self, int axis, int rtype, PyArrayObject - *out) - -Prod - -:: - - PyObject * - PyArray_CumProd(PyArrayObject *self, int axis, int - rtype, PyArrayObject *out) - -CumProd - -:: - - PyObject * - PyArray_All(PyArrayObject *self, int axis, PyArrayObject *out) - -All - -:: - - PyObject * - PyArray_Any(PyArrayObject *self, int axis, PyArrayObject *out) - -Any - -:: - - PyObject * - PyArray_Compress(PyArrayObject *self, PyObject *condition, int - axis, PyArrayObject *out) - -Compress - -:: - - PyObject * - PyArray_Flatten(PyArrayObject *a, NPY_ORDER order) - -Flatten - -:: - - PyObject * - PyArray_Ravel(PyArrayObject *arr, NPY_ORDER order) - -Ravel -Returns a contiguous array - -:: - - npy_intp - PyArray_MultiplyList(npy_intp *l1, int n) - -Multiply a List - -:: - - int - PyArray_MultiplyIntList(int *l1, int n) - -Multiply a List of ints - -:: - - void * - PyArray_GetPtr(PyArrayObject *obj, npy_intp*ind) - -Produce a pointer into array - -:: - - int - PyArray_CompareLists(npy_intp *l1, npy_intp *l2, int n) - -Compare Lists - -:: - - int - PyArray_AsCArray(PyObject **op, void *ptr, npy_intp *dims, int - nd, PyArray_Descr*typedescr) - -Simulate a C-array -steals a reference to typedescr -- can be NULL - -:: - - int - PyArray_As1D(PyObject **op, char **ptr, int *d1, int typecode) - -Convert to a 1D C-array - -:: - - int - PyArray_As2D(PyObject **op, char ***ptr, int *d1, int *d2, int - typecode) - -Convert to a 2D C-array - -:: - - int - PyArray_Free(PyObject *op, void *ptr) - -Free pointers created if As2D is called - -:: - - int - PyArray_Converter(PyObject *object, PyObject **address) - - -Useful to pass as converter function for O& processing in PyArgs_ParseTuple. - -This conversion function can be used with the "O&" argument for -PyArg_ParseTuple. It will immediately return an object of array type -or will convert to a NPY_ARRAY_CARRAY any other object. - -If you use PyArray_Converter, you must DECREF the array when finished -as you get a new reference to it. - -:: - - int - PyArray_IntpFromSequence(PyObject *seq, npy_intp *vals, int maxvals) - -PyArray_IntpFromSequence -Returns the number of dimensions or -1 if an error occurred. -vals must be large enough to hold maxvals - -:: - - PyObject * - PyArray_Concatenate(PyObject *op, int axis) - -Concatenate - -Concatenate an arbitrary Python sequence into an array. -op is a python object supporting the sequence interface. -Its elements will be concatenated together to form a single -multidimensional array. If axis is NPY_MAXDIMS or bigger, then -each sequence object will be flattened before concatenation - -:: - - PyObject * - PyArray_InnerProduct(PyObject *op1, PyObject *op2) - -Numeric.innerproduct(a,v) - -:: - - PyObject * - PyArray_MatrixProduct(PyObject *op1, PyObject *op2) - -Numeric.matrixproduct(a,v) -just like inner product but does the swapaxes stuff on the fly - -:: - - PyObject * - PyArray_CopyAndTranspose(PyObject *op) - -Copy and Transpose - -Could deprecate this function, as there isn't a speed benefit over -calling Transpose and then Copy. - -:: - - PyObject * - PyArray_Correlate(PyObject *op1, PyObject *op2, int mode) - -Numeric.correlate(a1,a2,mode) - -:: - - int - PyArray_TypestrConvert(int itemsize, int gentype) - -Typestr converter - -:: - - int - PyArray_DescrConverter(PyObject *obj, PyArray_Descr **at) - -Get typenum from an object -- None goes to NPY_DEFAULT_TYPE -This function takes a Python object representing a type and converts it -to a the correct PyArray_Descr * structure to describe the type. - -Many objects can be used to represent a data-type which in NumPy is -quite a flexible concept. - -This is the central code that converts Python objects to -Type-descriptor objects that are used throughout numpy. - -Returns a new reference in *at, but the returned should not be -modified as it may be one of the canonical immutable objects or -a reference to the input obj. - -:: - - int - PyArray_DescrConverter2(PyObject *obj, PyArray_Descr **at) - -Get typenum from an object -- None goes to NULL - -:: - - int - PyArray_IntpConverter(PyObject *obj, PyArray_Dims *seq) - -Get intp chunk from sequence - -This function takes a Python sequence object and allocates and -fills in an intp array with the converted values. - -Remember to free the pointer seq.ptr when done using -PyDimMem_FREE(seq.ptr)** - -:: - - int - PyArray_BufferConverter(PyObject *obj, PyArray_Chunk *buf) - -Get buffer chunk from object - -this function takes a Python object which exposes the (single-segment) -buffer interface and returns a pointer to the data segment - -You should increment the reference count by one of buf->base -if you will hang on to a reference - -You only get a borrowed reference to the object. Do not free the -memory... - -:: - - int - PyArray_AxisConverter(PyObject *obj, int *axis) - -Get axis from an object (possibly None) -- a converter function, - -See also PyArray_ConvertMultiAxis, which also handles a tuple of axes. - -:: - - int - PyArray_BoolConverter(PyObject *object, npy_bool *val) - -Convert an object to true / false - -:: - - int - PyArray_ByteorderConverter(PyObject *obj, char *endian) - -Convert object to endian - -:: - - int - PyArray_OrderConverter(PyObject *object, NPY_ORDER *val) - -Convert an object to FORTRAN / C / ANY / KEEP - -:: - - unsigned char - PyArray_EquivTypes(PyArray_Descr *type1, PyArray_Descr *type2) - - -This function returns true if the two typecodes are -equivalent (same basic kind and same itemsize). - -:: - - PyObject * - PyArray_Zeros(int nd, npy_intp *dims, PyArray_Descr *type, int - is_f_order) - -Zeros - -steal a reference -accepts NULL type - -:: - - PyObject * - PyArray_Empty(int nd, npy_intp *dims, PyArray_Descr *type, int - is_f_order) - -Empty - -accepts NULL type -steals referenct to type - -:: - - PyObject * - PyArray_Where(PyObject *condition, PyObject *x, PyObject *y) - -Where - -:: - - PyObject * - PyArray_Arange(double start, double stop, double step, int type_num) - -Arange, - -:: - - PyObject * - PyArray_ArangeObj(PyObject *start, PyObject *stop, PyObject - *step, PyArray_Descr *dtype) - - -ArangeObj, - -this doesn't change the references - -:: - - int - PyArray_SortkindConverter(PyObject *obj, NPY_SORTKIND *sortkind) - -Convert object to sort kind - -:: - - PyObject * - PyArray_LexSort(PyObject *sort_keys, int axis) - -LexSort an array providing indices that will sort a collection of arrays -lexicographically. The first key is sorted on first, followed by the second key --- requires that arg"merge"sort is available for each sort_key - -Returns an index array that shows the indexes for the lexicographic sort along -the given axis. - -:: - - PyObject * - PyArray_Round(PyArrayObject *a, int decimals, PyArrayObject *out) - -Round - -:: - - unsigned char - PyArray_EquivTypenums(int typenum1, int typenum2) - - -:: - - int - PyArray_RegisterDataType(PyArray_Descr *descr) - -Register Data type -Does not change the reference count of descr - -:: - - int - PyArray_RegisterCastFunc(PyArray_Descr *descr, int - totype, PyArray_VectorUnaryFunc *castfunc) - -Register Casting Function -Replaces any function currently stored. - -:: - - int - PyArray_RegisterCanCast(PyArray_Descr *descr, int - totype, NPY_SCALARKIND scalar) - -Register a type number indicating that a descriptor can be cast -to it safely - -:: - - void - PyArray_InitArrFuncs(PyArray_ArrFuncs *f) - -Initialize arrfuncs to NULL - -:: - - PyObject * - PyArray_IntTupleFromIntp(int len, npy_intp *vals) - -PyArray_IntTupleFromIntp - -:: - - int - PyArray_TypeNumFromName(char *str) - - -:: - - int - PyArray_ClipmodeConverter(PyObject *object, NPY_CLIPMODE *val) - -Convert an object to NPY_RAISE / NPY_CLIP / NPY_WRAP - -:: - - int - PyArray_OutputConverter(PyObject *object, PyArrayObject **address) - -Useful to pass as converter function for O& processing in -PyArgs_ParseTuple for output arrays - -:: - - PyObject * - PyArray_BroadcastToShape(PyObject *obj, npy_intp *dims, int nd) - -Get Iterator broadcast to a particular shape - -:: - - void - _PyArray_SigintHandler(int signum) - - -:: - - void* - _PyArray_GetSigintBuf(void ) - - -:: - - int - PyArray_DescrAlignConverter(PyObject *obj, PyArray_Descr **at) - - -Get type-descriptor from an object forcing alignment if possible -None goes to DEFAULT type. - -any object with the .fields attribute and/or .itemsize attribute (if the -.fields attribute does not give the total size -- i.e. a partial record -naming). If itemsize is given it must be >= size computed from fields - -The .fields attribute must return a convertible dictionary if present. -Result inherits from NPY_VOID. - -:: - - int - PyArray_DescrAlignConverter2(PyObject *obj, PyArray_Descr **at) - - -Get type-descriptor from an object forcing alignment if possible -None goes to NULL. - -:: - - int - PyArray_SearchsideConverter(PyObject *obj, void *addr) - -Convert object to searchsorted side - -:: - - PyObject * - PyArray_CheckAxis(PyArrayObject *arr, int *axis, int flags) - -PyArray_CheckAxis - -check that axis is valid -convert 0-d arrays to 1-d arrays - -:: - - npy_intp - PyArray_OverflowMultiplyList(npy_intp *l1, int n) - -Multiply a List of Non-negative numbers with over-flow detection. - -:: - - int - PyArray_CompareString(char *s1, char *s2, size_t len) - - -:: - - PyObject * - PyArray_MultiIterFromObjects(PyObject **mps, int n, int nadd, ... ) - -Get MultiIterator from array of Python objects and any additional - -PyObject **mps -- array of PyObjects -int n - number of PyObjects in the array -int nadd - number of additional arrays to include in the iterator. - -Returns a multi-iterator object. - -:: - - int - PyArray_GetEndianness(void ) - - -:: - - unsigned int - PyArray_GetNDArrayCFeatureVersion(void ) - -Returns the built-in (at compilation time) C API version - -:: - - PyObject * - PyArray_Correlate2(PyObject *op1, PyObject *op2, int mode) - -correlate(a1,a2,mode) - -This function computes the usual correlation (correlate(a1, a2) != -correlate(a2, a1), and conjugate the second argument for complex inputs - -:: - - PyObject* - PyArray_NeighborhoodIterNew(PyArrayIterObject *x, npy_intp - *bounds, int mode, PyArrayObject*fill) - -A Neighborhood Iterator object. - -:: - - void - PyArray_SetDatetimeParseFunction(PyObject *op) - -This function is scheduled to be removed - -TO BE REMOVED - NOT USED INTERNALLY. - -:: - - void - PyArray_DatetimeToDatetimeStruct(npy_datetime val, NPY_DATETIMEUNIT - fr, npy_datetimestruct *result) - -Fill the datetime struct from the value and resolution unit. - -TO BE REMOVED - NOT USED INTERNALLY. - -:: - - void - PyArray_TimedeltaToTimedeltaStruct(npy_timedelta val, NPY_DATETIMEUNIT - fr, npy_timedeltastruct *result) - -Fill the timedelta struct from the timedelta value and resolution unit. - -TO BE REMOVED - NOT USED INTERNALLY. - -:: - - npy_datetime - PyArray_DatetimeStructToDatetime(NPY_DATETIMEUNIT - fr, npy_datetimestruct *d) - -Create a datetime value from a filled datetime struct and resolution unit. - -TO BE REMOVED - NOT USED INTERNALLY. - -:: - - npy_datetime - PyArray_TimedeltaStructToTimedelta(NPY_DATETIMEUNIT - fr, npy_timedeltastruct *d) - -Create a timdelta value from a filled timedelta struct and resolution unit. - -TO BE REMOVED - NOT USED INTERNALLY. - -:: - - NpyIter * - NpyIter_New(PyArrayObject *op, npy_uint32 flags, NPY_ORDER - order, NPY_CASTING casting, PyArray_Descr*dtype) - -Allocate a new iterator for one array object. - -:: - - NpyIter * - NpyIter_MultiNew(int nop, PyArrayObject **op_in, npy_uint32 - flags, NPY_ORDER order, NPY_CASTING - casting, npy_uint32 *op_flags, PyArray_Descr - **op_request_dtypes) - -Allocate a new iterator for more than one array object, using -standard NumPy broadcasting rules and the default buffer size. - -:: - - NpyIter * - NpyIter_AdvancedNew(int nop, PyArrayObject **op_in, npy_uint32 - flags, NPY_ORDER order, NPY_CASTING - casting, npy_uint32 *op_flags, PyArray_Descr - **op_request_dtypes, int oa_ndim, int - **op_axes, npy_intp *itershape, npy_intp - buffersize) - -Allocate a new iterator for multiple array objects, and advanced -options for controlling the broadcasting, shape, and buffer size. - -:: - - NpyIter * - NpyIter_Copy(NpyIter *iter) - -Makes a copy of the iterator - -:: - - int - NpyIter_Deallocate(NpyIter *iter) - -Deallocate an iterator - -:: - - npy_bool - NpyIter_HasDelayedBufAlloc(NpyIter *iter) - -Whether the buffer allocation is being delayed - -:: - - npy_bool - NpyIter_HasExternalLoop(NpyIter *iter) - -Whether the iterator handles the inner loop - -:: - - int - NpyIter_EnableExternalLoop(NpyIter *iter) - -Removes the inner loop handling (so HasExternalLoop returns true) - -:: - - npy_intp * - NpyIter_GetInnerStrideArray(NpyIter *iter) - -Get the array of strides for the inner loop (when HasExternalLoop is true) - -This function may be safely called without holding the Python GIL. - -:: - - npy_intp * - NpyIter_GetInnerLoopSizePtr(NpyIter *iter) - -Get a pointer to the size of the inner loop (when HasExternalLoop is true) - -This function may be safely called without holding the Python GIL. - -:: - - int - NpyIter_Reset(NpyIter *iter, char **errmsg) - -Resets the iterator to its initial state - -If errmsg is non-NULL, it should point to a variable which will -receive the error message, and no Python exception will be set. -This is so that the function can be called from code not holding -the GIL. - -:: - - int - NpyIter_ResetBasePointers(NpyIter *iter, char **baseptrs, char - **errmsg) - -Resets the iterator to its initial state, with new base data pointers. -This function requires great caution. - -If errmsg is non-NULL, it should point to a variable which will -receive the error message, and no Python exception will be set. -This is so that the function can be called from code not holding -the GIL. - -:: - - int - NpyIter_ResetToIterIndexRange(NpyIter *iter, npy_intp istart, npy_intp - iend, char **errmsg) - -Resets the iterator to a new iterator index range - -If errmsg is non-NULL, it should point to a variable which will -receive the error message, and no Python exception will be set. -This is so that the function can be called from code not holding -the GIL. - -:: - - int - NpyIter_GetNDim(NpyIter *iter) - -Gets the number of dimensions being iterated - -:: - - int - NpyIter_GetNOp(NpyIter *iter) - -Gets the number of operands being iterated - -:: - - NpyIter_IterNextFunc * - NpyIter_GetIterNext(NpyIter *iter, char **errmsg) - -Compute the specialized iteration function for an iterator - -If errmsg is non-NULL, it should point to a variable which will -receive the error message, and no Python exception will be set. -This is so that the function can be called from code not holding -the GIL. - -:: - - npy_intp - NpyIter_GetIterSize(NpyIter *iter) - -Gets the number of elements being iterated - -:: - - void - NpyIter_GetIterIndexRange(NpyIter *iter, npy_intp *istart, npy_intp - *iend) - -Gets the range of iteration indices being iterated - -:: - - npy_intp - NpyIter_GetIterIndex(NpyIter *iter) - -Gets the current iteration index - -:: - - int - NpyIter_GotoIterIndex(NpyIter *iter, npy_intp iterindex) - -Sets the iterator position to the specified iterindex, -which matches the iteration order of the iterator. - -Returns NPY_SUCCEED on success, NPY_FAIL on failure. - -:: - - npy_bool - NpyIter_HasMultiIndex(NpyIter *iter) - -Whether the iterator is tracking a multi-index - -:: - - int - NpyIter_GetShape(NpyIter *iter, npy_intp *outshape) - -Gets the broadcast shape if a multi-index is being tracked by the iterator, -otherwise gets the shape of the iteration as Fortran-order -(fastest-changing index first). - -The reason Fortran-order is returned when a multi-index -is not enabled is that this is providing a direct view into how -the iterator traverses the n-dimensional space. The iterator organizes -its memory from fastest index to slowest index, and when -a multi-index is enabled, it uses a permutation to recover the original -order. - -Returns NPY_SUCCEED or NPY_FAIL. - -:: - - NpyIter_GetMultiIndexFunc * - NpyIter_GetGetMultiIndex(NpyIter *iter, char **errmsg) - -Compute a specialized get_multi_index function for the iterator - -If errmsg is non-NULL, it should point to a variable which will -receive the error message, and no Python exception will be set. -This is so that the function can be called from code not holding -the GIL. - -:: - - int - NpyIter_GotoMultiIndex(NpyIter *iter, npy_intp *multi_index) - -Sets the iterator to the specified multi-index, which must have the -correct number of entries for 'ndim'. It is only valid -when NPY_ITER_MULTI_INDEX was passed to the constructor. This operation -fails if the multi-index is out of bounds. - -Returns NPY_SUCCEED on success, NPY_FAIL on failure. - -:: - - int - NpyIter_RemoveMultiIndex(NpyIter *iter) - -Removes multi-index support from an iterator. - -Returns NPY_SUCCEED or NPY_FAIL. - -:: - - npy_bool - NpyIter_HasIndex(NpyIter *iter) - -Whether the iterator is tracking an index - -:: - - npy_bool - NpyIter_IsBuffered(NpyIter *iter) - -Whether the iterator is buffered - -:: - - npy_bool - NpyIter_IsGrowInner(NpyIter *iter) - -Whether the inner loop can grow if buffering is unneeded - -:: - - npy_intp - NpyIter_GetBufferSize(NpyIter *iter) - -Gets the size of the buffer, or 0 if buffering is not enabled - -:: - - npy_intp * - NpyIter_GetIndexPtr(NpyIter *iter) - -Get a pointer to the index, if it is being tracked - -:: - - int - NpyIter_GotoIndex(NpyIter *iter, npy_intp flat_index) - -If the iterator is tracking an index, sets the iterator -to the specified index. - -Returns NPY_SUCCEED on success, NPY_FAIL on failure. - -:: - - char ** - NpyIter_GetDataPtrArray(NpyIter *iter) - -Get the array of data pointers (1 per object being iterated) - -This function may be safely called without holding the Python GIL. - -:: - - PyArray_Descr ** - NpyIter_GetDescrArray(NpyIter *iter) - -Get the array of data type pointers (1 per object being iterated) - -:: - - PyArrayObject ** - NpyIter_GetOperandArray(NpyIter *iter) - -Get the array of objects being iterated - -:: - - PyArrayObject * - NpyIter_GetIterView(NpyIter *iter, npy_intp i) - -Returns a view to the i-th object with the iterator's internal axes - -:: - - void - NpyIter_GetReadFlags(NpyIter *iter, char *outreadflags) - -Gets an array of read flags (1 per object being iterated) - -:: - - void - NpyIter_GetWriteFlags(NpyIter *iter, char *outwriteflags) - -Gets an array of write flags (1 per object being iterated) - -:: - - void - NpyIter_DebugPrint(NpyIter *iter) - -For debugging - -:: - - npy_bool - NpyIter_IterationNeedsAPI(NpyIter *iter) - -Whether the iteration loop, and in particular the iternext() -function, needs API access. If this is true, the GIL must -be retained while iterating. - -:: - - void - NpyIter_GetInnerFixedStrideArray(NpyIter *iter, npy_intp *out_strides) - -Get an array of strides which are fixed. Any strides which may -change during iteration receive the value NPY_MAX_INTP. Once -the iterator is ready to iterate, call this to get the strides -which will always be fixed in the inner loop, then choose optimized -inner loop functions which take advantage of those fixed strides. - -This function may be safely called without holding the Python GIL. - -:: - - int - NpyIter_RemoveAxis(NpyIter *iter, int axis) - -Removes an axis from iteration. This requires that NPY_ITER_MULTI_INDEX -was set for iterator creation, and does not work if buffering is -enabled. This function also resets the iterator to its initial state. - -Returns NPY_SUCCEED or NPY_FAIL. - -:: - - npy_intp * - NpyIter_GetAxisStrideArray(NpyIter *iter, int axis) - -Gets the array of strides for the specified axis. -If the iterator is tracking a multi-index, gets the strides -for the axis specified, otherwise gets the strides for -the iteration axis as Fortran order (fastest-changing axis first). - -Returns NULL if an error occurs. - -:: - - npy_bool - NpyIter_RequiresBuffering(NpyIter *iter) - -Whether the iteration could be done with no buffering. - -:: - - char ** - NpyIter_GetInitialDataPtrArray(NpyIter *iter) - -Get the array of data pointers (1 per object being iterated), -directly into the arrays (never pointing to a buffer), for starting -unbuffered iteration. This always returns the addresses for the -iterator position as reset to iterator index 0. - -These pointers are different from the pointers accepted by -NpyIter_ResetBasePointers, because the direction along some -axes may have been reversed, requiring base offsets. - -This function may be safely called without holding the Python GIL. - -:: - - int - NpyIter_CreateCompatibleStrides(NpyIter *iter, npy_intp - itemsize, npy_intp *outstrides) - -Builds a set of strides which are the same as the strides of an -output array created using the NPY_ITER_ALLOCATE flag, where NULL -was passed for op_axes. This is for data packed contiguously, -but not necessarily in C or Fortran order. This should be used -together with NpyIter_GetShape and NpyIter_GetNDim. - -A use case for this function is to match the shape and layout of -the iterator and tack on one or more dimensions. For example, -in order to generate a vector per input value for a numerical gradient, -you pass in ndim*itemsize for itemsize, then add another dimension to -the end with size ndim and stride itemsize. To do the Hessian matrix, -you do the same thing but add two dimensions, or take advantage of -the symmetry and pack it into 1 dimension with a particular encoding. - -This function may only be called if the iterator is tracking a multi-index -and if NPY_ITER_DONT_NEGATE_STRIDES was used to prevent an axis from -being iterated in reverse order. - -If an array is created with this method, simply adding 'itemsize' -for each iteration will traverse the new array matching the -iterator. - -Returns NPY_SUCCEED or NPY_FAIL. - -:: - - int - PyArray_CastingConverter(PyObject *obj, NPY_CASTING *casting) - -Convert any Python object, *obj*, to an NPY_CASTING enum. - -:: - - npy_intp - PyArray_CountNonzero(PyArrayObject *self) - -Counts the number of non-zero elements in the array. - -Returns -1 on error. - -:: - - PyArray_Descr * - PyArray_PromoteTypes(PyArray_Descr *type1, PyArray_Descr *type2) - -Produces the smallest size and lowest kind type to which both -input types can be cast. - -:: - - PyArray_Descr * - PyArray_MinScalarType(PyArrayObject *arr) - -If arr is a scalar (has 0 dimensions) with a built-in number data type, -finds the smallest type size/kind which can still represent its data. -Otherwise, returns the array's data type. - - -:: - - PyArray_Descr * - PyArray_ResultType(npy_intp narrs, PyArrayObject **arr, npy_intp - ndtypes, PyArray_Descr **dtypes) - -Produces the result type of a bunch of inputs, using the UFunc -type promotion rules. Use this function when you have a set of -input arrays, and need to determine an output array dtype. - -If all the inputs are scalars (have 0 dimensions) or the maximum "kind" -of the scalars is greater than the maximum "kind" of the arrays, does -a regular type promotion. - -Otherwise, does a type promotion on the MinScalarType -of all the inputs. Data types passed directly are treated as array -types. - - -:: - - npy_bool - PyArray_CanCastArrayTo(PyArrayObject *arr, PyArray_Descr - *to, NPY_CASTING casting) - -Returns 1 if the array object may be cast to the given data type using -the casting rule, 0 otherwise. This differs from PyArray_CanCastTo in -that it handles scalar arrays (0 dimensions) specially, by checking -their value. - -:: - - npy_bool - PyArray_CanCastTypeTo(PyArray_Descr *from, PyArray_Descr - *to, NPY_CASTING casting) - -Returns true if data of type 'from' may be cast to data of type -'to' according to the rule 'casting'. - -:: - - PyArrayObject * - PyArray_EinsteinSum(char *subscripts, npy_intp nop, PyArrayObject - **op_in, PyArray_Descr *dtype, NPY_ORDER - order, NPY_CASTING casting, PyArrayObject *out) - -This function provides summation of array elements according to -the Einstein summation convention. For example: -- trace(a) -> einsum("ii", a) -- transpose(a) -> einsum("ji", a) -- multiply(a,b) -> einsum(",", a, b) -- inner(a,b) -> einsum("i,i", a, b) -- outer(a,b) -> einsum("i,j", a, b) -- matvec(a,b) -> einsum("ij,j", a, b) -- matmat(a,b) -> einsum("ij,jk", a, b) - -subscripts: The string of subscripts for einstein summation. -nop: The number of operands -op_in: The array of operands -dtype: Either NULL, or the data type to force the calculation as. -order: The order for the calculation/the output axes. -casting: What kind of casts should be permitted. -out: Either NULL, or an array into which the output should be placed. - -By default, the labels get placed in alphabetical order -at the end of the output. So, if c = einsum("i,j", a, b) -then c[i,j] == a[i]*b[j], but if c = einsum("j,i", a, b) -then c[i,j] = a[j]*b[i]. - -Alternatively, you can control the output order or prevent -an axis from being summed/force an axis to be summed by providing -indices for the output. This allows us to turn 'trace' into -'diag', for example. -- diag(a) -> einsum("ii->i", a) -- sum(a, axis=0) -> einsum("i...->", a) - -Subscripts at the beginning and end may be specified by -putting an ellipsis "..." in the middle. For example, -the function einsum("i...i", a) takes the diagonal of -the first and last dimensions of the operand, and -einsum("ij...,jk...->ik...") takes the matrix product using -the first two indices of each operand instead of the last two. - -When there is only one operand, no axes being summed, and -no output parameter, this function returns a view -into the operand instead of making a copy. - -:: - - PyObject * - PyArray_NewLikeArray(PyArrayObject *prototype, NPY_ORDER - order, PyArray_Descr *dtype, int subok) - -Creates a new array with the same shape as the provided one, -with possible memory layout order and data type changes. - -prototype - The array the new one should be like. -order - NPY_CORDER - C-contiguous result. -NPY_FORTRANORDER - Fortran-contiguous result. -NPY_ANYORDER - Fortran if prototype is Fortran, C otherwise. -NPY_KEEPORDER - Keeps the axis ordering of prototype. -dtype - If not NULL, overrides the data type of the result. -subok - If 1, use the prototype's array subtype, otherwise -always create a base-class array. - -NOTE: If dtype is not NULL, steals the dtype reference. - -:: - - int - PyArray_GetArrayParamsFromObject(PyObject *op, PyArray_Descr - *requested_dtype, npy_bool - writeable, PyArray_Descr - **out_dtype, int *out_ndim, npy_intp - *out_dims, PyArrayObject - **out_arr, PyObject *context) - -Retrieves the array parameters for viewing/converting an arbitrary -PyObject* to a NumPy array. This allows the "innate type and shape" -of Python list-of-lists to be discovered without -actually converting to an array. - -In some cases, such as structured arrays and the __array__ interface, -a data type needs to be used to make sense of the object. When -this is needed, provide a Descr for 'requested_dtype', otherwise -provide NULL. This reference is not stolen. Also, if the requested -dtype doesn't modify the interpretation of the input, out_dtype will -still get the "innate" dtype of the object, not the dtype passed -in 'requested_dtype'. - -If writing to the value in 'op' is desired, set the boolean -'writeable' to 1. This raises an error when 'op' is a scalar, list -of lists, or other non-writeable 'op'. - -Result: When success (0 return value) is returned, either out_arr -is filled with a non-NULL PyArrayObject and -the rest of the parameters are untouched, or out_arr is -filled with NULL, and the rest of the parameters are -filled. - -Typical usage: - -PyArrayObject *arr = NULL; -PyArray_Descr *dtype = NULL; -int ndim = 0; -npy_intp dims[NPY_MAXDIMS]; - -if (PyArray_GetArrayParamsFromObject(op, NULL, 1, &dtype, -&ndim, &dims, &arr, NULL) < 0) { -return NULL; -} -if (arr == NULL) { -... validate/change dtype, validate flags, ndim, etc ... -// Could make custom strides here too -arr = PyArray_NewFromDescr(&PyArray_Type, dtype, ndim, -dims, NULL, -is_f_order ? NPY_ARRAY_F_CONTIGUOUS : 0, -NULL); -if (arr == NULL) { -return NULL; -} -if (PyArray_CopyObject(arr, op) < 0) { -Py_DECREF(arr); -return NULL; -} -} -else { -... in this case the other parameters weren't filled, just -validate and possibly copy arr itself ... -} -... use arr ... - -:: - - int - PyArray_ConvertClipmodeSequence(PyObject *object, NPY_CLIPMODE - *modes, int n) - -Convert an object to an array of n NPY_CLIPMODE values. -This is intended to be used in functions where a different mode -could be applied to each axis, like in ravel_multi_index. - -:: - - PyObject * - PyArray_MatrixProduct2(PyObject *op1, PyObject - *op2, PyArrayObject*out) - -Numeric.matrixproduct(a,v,out) -just like inner product but does the swapaxes stuff on the fly - -:: - - npy_bool - NpyIter_IsFirstVisit(NpyIter *iter, int iop) - -Checks to see whether this is the first time the elements -of the specified reduction operand which the iterator points at are -being seen for the first time. The function returns -a reasonable answer for reduction operands and when buffering is -disabled. The answer may be incorrect for buffered non-reduction -operands. - -This function is intended to be used in EXTERNAL_LOOP mode only, -and will produce some wrong answers when that mode is not enabled. - -If this function returns true, the caller should also -check the inner loop stride of the operand, because if -that stride is 0, then only the first element of the innermost -external loop is being visited for the first time. - -WARNING: For performance reasons, 'iop' is not bounds-checked, -it is not confirmed that 'iop' is actually a reduction -operand, and it is not confirmed that EXTERNAL_LOOP -mode is enabled. These checks are the responsibility of -the caller, and should be done outside of any inner loops. - -:: - - int - PyArray_SetBaseObject(PyArrayObject *arr, PyObject *obj) - -Sets the 'base' attribute of the array. This steals a reference -to 'obj'. - -Returns 0 on success, -1 on failure. - -:: - - void - PyArray_CreateSortedStridePerm(int ndim, npy_intp - *strides, npy_stride_sort_item - *out_strideperm) - - -This function populates the first ndim elements -of strideperm with sorted descending by their absolute values. -For example, the stride array (4, -2, 12) becomes -[(2, 12), (0, 4), (1, -2)]. - -:: - - void - PyArray_RemoveAxesInPlace(PyArrayObject *arr, npy_bool *flags) - - -Removes the axes flagged as True from the array, -modifying it in place. If an axis flagged for removal -has a shape entry bigger than one, this effectively selects -index zero for that axis. - -WARNING: If an axis flagged for removal has a shape equal to zero, -the array will point to invalid memory. The caller must -validate this! - -For example, this can be used to remove the reduction axes -from a reduction result once its computation is complete. - -:: - - void - PyArray_DebugPrint(PyArrayObject *obj) - -Prints the raw data of the ndarray in a form useful for debugging -low-level C issues. - -:: - - int - PyArray_FailUnlessWriteable(PyArrayObject *obj, const char *name) - - -This function does nothing if obj is writeable, and raises an exception -(and returns -1) if obj is not writeable. It may also do other -house-keeping, such as issuing warnings on arrays which are transitioning -to become views. Always call this function at some point before writing to -an array. - -'name' is a name for the array, used to give better error -messages. Something like "assignment destination", "output array", or even -just "array". - -:: - - int - PyArray_SetUpdateIfCopyBase(PyArrayObject *arr, PyArrayObject *base) - - -Precondition: 'arr' is a copy of 'base' (though possibly with different -strides, ordering, etc.). This function sets the UPDATEIFCOPY flag and the -->base pointer on 'arr', so that when 'arr' is destructed, it will copy any -changes back to 'base'. - -Steals a reference to 'base'. - -Returns 0 on success, -1 on failure. - -:: - - void * - PyDataMem_NEW(size_t size) - -Allocates memory for array data. - -:: - - void - PyDataMem_FREE(void *ptr) - -Free memory for array data. - -:: - - void * - PyDataMem_RENEW(void *ptr, size_t size) - -Reallocate/resize memory for array data. - -:: - - PyDataMem_EventHookFunc * - PyDataMem_SetEventHook(PyDataMem_EventHookFunc *newhook, void - *user_data, void **old_data) - -Sets the allocation event hook for numpy array data. -Takes a PyDataMem_EventHookFunc *, which has the signature: -void hook(void *old, void *new, size_t size, void *user_data). -Also takes a void *user_data, and void **old_data. - -Returns a pointer to the previous hook or NULL. If old_data is -non-NULL, the previous user_data pointer will be copied to it. - -If not NULL, hook will be called at the end of each PyDataMem_NEW/FREE/RENEW: -result = PyDataMem_NEW(size) -> (*hook)(NULL, result, size, user_data) -PyDataMem_FREE(ptr) -> (*hook)(ptr, NULL, 0, user_data) -result = PyDataMem_RENEW(ptr, size) -> (*hook)(ptr, result, size, user_data) - -When the hook is called, the GIL will be held by the calling -thread. The hook should be written to be reentrant, if it performs -operations that might cause new allocation events (such as the -creation/descruction numpy objects, or creating/destroying Python -objects which might cause a gc) - diff --git a/include/numpy/ndarrayobject.h b/include/numpy/ndarrayobject.h deleted file mode 100644 index f00dd7744..000000000 --- a/include/numpy/ndarrayobject.h +++ /dev/null @@ -1,244 +0,0 @@ -/* - * DON'T INCLUDE THIS DIRECTLY. - */ - -#ifndef NPY_NDARRAYOBJECT_H -#define NPY_NDARRAYOBJECT_H -#ifdef __cplusplus -#define CONFUSE_EMACS { -#define CONFUSE_EMACS2 } -extern "C" CONFUSE_EMACS -#undef CONFUSE_EMACS -#undef CONFUSE_EMACS2 -/* ... otherwise a semi-smart identer (like emacs) tries to indent - everything when you're typing */ -#endif - -#include "ndarraytypes.h" - -/* Includes the "function" C-API -- these are all stored in a - list of pointers --- one for each file - The two lists are concatenated into one in multiarray. - - They are available as import_array() -*/ - -#include "__multiarray_api.h" - - -/* C-API that requries previous API to be defined */ - -#define PyArray_DescrCheck(op) (((PyObject*)(op))->ob_type==&PyArrayDescr_Type) - -#define PyArray_Check(op) PyObject_TypeCheck(op, &PyArray_Type) -#define PyArray_CheckExact(op) (((PyObject*)(op))->ob_type == &PyArray_Type) - -#define PyArray_HasArrayInterfaceType(op, type, context, out) \ - ((((out)=PyArray_FromStructInterface(op)) != Py_NotImplemented) || \ - (((out)=PyArray_FromInterface(op)) != Py_NotImplemented) || \ - (((out)=PyArray_FromArrayAttr(op, type, context)) != \ - Py_NotImplemented)) - -#define PyArray_HasArrayInterface(op, out) \ - PyArray_HasArrayInterfaceType(op, NULL, NULL, out) - -#define PyArray_IsZeroDim(op) (PyArray_Check(op) && \ - (PyArray_NDIM((PyArrayObject *)op) == 0)) - -#define PyArray_IsScalar(obj, cls) \ - (PyObject_TypeCheck(obj, &Py##cls##ArrType_Type)) - -#define PyArray_CheckScalar(m) (PyArray_IsScalar(m, Generic) || \ - PyArray_IsZeroDim(m)) - -#define PyArray_IsPythonNumber(obj) \ - (PyInt_Check(obj) || PyFloat_Check(obj) || PyComplex_Check(obj) || \ - PyLong_Check(obj) || PyBool_Check(obj)) - -#define PyArray_IsPythonScalar(obj) \ - (PyArray_IsPythonNumber(obj) || PyString_Check(obj) || \ - PyUnicode_Check(obj)) - -#define PyArray_IsAnyScalar(obj) \ - (PyArray_IsScalar(obj, Generic) || PyArray_IsPythonScalar(obj)) - -#define PyArray_CheckAnyScalar(obj) (PyArray_IsPythonScalar(obj) || \ - PyArray_CheckScalar(obj)) - -#define PyArray_IsIntegerScalar(obj) (PyInt_Check(obj) \ - || PyLong_Check(obj) \ - || PyArray_IsScalar((obj), Integer)) - - -#define PyArray_GETCONTIGUOUS(m) (PyArray_ISCONTIGUOUS(m) ? \ - Py_INCREF(m), (m) : \ - (PyArrayObject *)(PyArray_Copy(m))) - -#define PyArray_SAMESHAPE(a1,a2) ((PyArray_NDIM(a1) == PyArray_NDIM(a2)) && \ - PyArray_CompareLists(PyArray_DIMS(a1), \ - PyArray_DIMS(a2), \ - PyArray_NDIM(a1))) - -#define PyArray_SIZE(m) PyArray_MultiplyList(PyArray_DIMS(m), PyArray_NDIM(m)) -#define PyArray_NBYTES(m) (PyArray_ITEMSIZE(m) * PyArray_SIZE(m)) -#define PyArray_FROM_O(m) PyArray_FromAny(m, NULL, 0, 0, 0, NULL) - -#define PyArray_FROM_OF(m,flags) PyArray_CheckFromAny(m, NULL, 0, 0, flags, \ - NULL) - -#define PyArray_FROM_OT(m,type) PyArray_FromAny(m, \ - PyArray_DescrFromType(type), 0, 0, 0, NULL); - -#define PyArray_FROM_OTF(m, type, flags) \ - PyArray_FromAny(m, PyArray_DescrFromType(type), 0, 0, \ - (((flags) & NPY_ARRAY_ENSURECOPY) ? \ - ((flags) | NPY_ARRAY_DEFAULT) : (flags)), NULL) - -#define PyArray_FROMANY(m, type, min, max, flags) \ - PyArray_FromAny(m, PyArray_DescrFromType(type), min, max, \ - (((flags) & NPY_ARRAY_ENSURECOPY) ? \ - (flags) | NPY_ARRAY_DEFAULT : (flags)), NULL) - -#define PyArray_ZEROS(m, dims, type, is_f_order) \ - PyArray_Zeros(m, dims, PyArray_DescrFromType(type), is_f_order) - -#define PyArray_EMPTY(m, dims, type, is_f_order) \ - PyArray_Empty(m, dims, PyArray_DescrFromType(type), is_f_order) - -#define PyArray_FILLWBYTE(obj, val) memset(PyArray_DATA(obj), val, \ - PyArray_NBYTES(obj)) - -#define PyArray_REFCOUNT(obj) (((PyObject *)(obj))->ob_refcnt) -#define NPY_REFCOUNT PyArray_REFCOUNT -#define NPY_MAX_ELSIZE (2 * NPY_SIZEOF_LONGDOUBLE) - -#define PyArray_ContiguousFromAny(op, type, min_depth, max_depth) \ - PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \ - max_depth, NPY_ARRAY_DEFAULT, NULL) - -#define PyArray_EquivArrTypes(a1, a2) \ - PyArray_EquivTypes(PyArray_DESCR(a1), PyArray_DESCR(a2)) - -#define PyArray_EquivByteorders(b1, b2) \ - (((b1) == (b2)) || (PyArray_ISNBO(b1) == PyArray_ISNBO(b2))) - -#define PyArray_SimpleNew(nd, dims, typenum) \ - PyArray_New(&PyArray_Type, nd, dims, typenum, NULL, NULL, 0, 0, NULL) - -#define PyArray_SimpleNewFromData(nd, dims, typenum, data) \ - PyArray_New(&PyArray_Type, nd, dims, typenum, NULL, \ - data, 0, NPY_ARRAY_CARRAY, NULL) - -#define PyArray_SimpleNewFromDescr(nd, dims, descr) \ - PyArray_NewFromDescr(&PyArray_Type, descr, nd, dims, \ - NULL, NULL, 0, NULL) - -#define PyArray_ToScalar(data, arr) \ - PyArray_Scalar(data, PyArray_DESCR(arr), (PyObject *)arr) - - -/* These might be faster without the dereferencing of obj - going on inside -- of course an optimizing compiler should - inline the constants inside a for loop making it a moot point -*/ - -#define PyArray_GETPTR1(obj, i) ((void *)(PyArray_BYTES(obj) + \ - (i)*PyArray_STRIDES(obj)[0])) - -#define PyArray_GETPTR2(obj, i, j) ((void *)(PyArray_BYTES(obj) + \ - (i)*PyArray_STRIDES(obj)[0] + \ - (j)*PyArray_STRIDES(obj)[1])) - -#define PyArray_GETPTR3(obj, i, j, k) ((void *)(PyArray_BYTES(obj) + \ - (i)*PyArray_STRIDES(obj)[0] + \ - (j)*PyArray_STRIDES(obj)[1] + \ - (k)*PyArray_STRIDES(obj)[2])) - -#define PyArray_GETPTR4(obj, i, j, k, l) ((void *)(PyArray_BYTES(obj) + \ - (i)*PyArray_STRIDES(obj)[0] + \ - (j)*PyArray_STRIDES(obj)[1] + \ - (k)*PyArray_STRIDES(obj)[2] + \ - (l)*PyArray_STRIDES(obj)[3])) - -static NPY_INLINE void -PyArray_XDECREF_ERR(PyArrayObject *arr) -{ - if (arr != NULL) { - if (PyArray_FLAGS(arr) & NPY_ARRAY_UPDATEIFCOPY) { - PyArrayObject *base = (PyArrayObject *)PyArray_BASE(arr); - PyArray_ENABLEFLAGS(base, NPY_ARRAY_WRITEABLE); - PyArray_CLEARFLAGS(arr, NPY_ARRAY_UPDATEIFCOPY); - } - Py_DECREF(arr); - } -} - -#define PyArray_DESCR_REPLACE(descr) do { \ - PyArray_Descr *_new_; \ - _new_ = PyArray_DescrNew(descr); \ - Py_XDECREF(descr); \ - descr = _new_; \ - } while(0) - -/* Copy should always return contiguous array */ -#define PyArray_Copy(obj) PyArray_NewCopy(obj, NPY_CORDER) - -#define PyArray_FromObject(op, type, min_depth, max_depth) \ - PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \ - max_depth, NPY_ARRAY_BEHAVED | \ - NPY_ARRAY_ENSUREARRAY, NULL) - -#define PyArray_ContiguousFromObject(op, type, min_depth, max_depth) \ - PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \ - max_depth, NPY_ARRAY_DEFAULT | \ - NPY_ARRAY_ENSUREARRAY, NULL) - -#define PyArray_CopyFromObject(op, type, min_depth, max_depth) \ - PyArray_FromAny(op, PyArray_DescrFromType(type), min_depth, \ - max_depth, NPY_ARRAY_ENSURECOPY | \ - NPY_ARRAY_DEFAULT | \ - NPY_ARRAY_ENSUREARRAY, NULL) - -#define PyArray_Cast(mp, type_num) \ - PyArray_CastToType(mp, PyArray_DescrFromType(type_num), 0) - -#define PyArray_Take(ap, items, axis) \ - PyArray_TakeFrom(ap, items, axis, NULL, NPY_RAISE) - -#define PyArray_Put(ap, items, values) \ - PyArray_PutTo(ap, items, values, NPY_RAISE) - -/* Compatibility with old Numeric stuff -- don't use in new code */ - -#define PyArray_FromDimsAndData(nd, d, type, data) \ - PyArray_FromDimsAndDataAndDescr(nd, d, PyArray_DescrFromType(type), \ - data) - - -/* - Check to see if this key in the dictionary is the "title" - entry of the tuple (i.e. a duplicate dictionary entry in the fields - dict. -*/ - -#define NPY_TITLE_KEY(key, value) ((PyTuple_GET_SIZE((value))==3) && \ - (PyTuple_GET_ITEM((value), 2) == (key))) - - -/* Define python version independent deprecation macro */ - -#if PY_VERSION_HEX >= 0x02050000 -#define DEPRECATE(msg) PyErr_WarnEx(PyExc_DeprecationWarning,msg,1) -#define DEPRECATE_FUTUREWARNING(msg) PyErr_WarnEx(PyExc_FutureWarning,msg,1) -#else -#define DEPRECATE(msg) PyErr_Warn(PyExc_DeprecationWarning,msg) -#define DEPRECATE_FUTUREWARNING(msg) PyErr_Warn(PyExc_FutureWarning,msg) -#endif - - -#ifdef __cplusplus -} -#endif - - -#endif /* NPY_NDARRAYOBJECT_H */ diff --git a/include/numpy/ndarraytypes.h b/include/numpy/ndarraytypes.h deleted file mode 100644 index 04d037ec8..000000000 --- a/include/numpy/ndarraytypes.h +++ /dev/null @@ -1,1731 +0,0 @@ -#ifndef NDARRAYTYPES_H -#define NDARRAYTYPES_H - -/* numpyconfig.h is auto-generated by the installer */ -#include "numpyconfig.h" - -#include "npy_common.h" -#include "npy_endian.h" -#include "npy_cpu.h" -#include "utils.h" - -#ifdef NPY_ENABLE_SEPARATE_COMPILATION - #define NPY_NO_EXPORT NPY_VISIBILITY_HIDDEN -#else - #define NPY_NO_EXPORT static -#endif - -/* Only use thread if configured in config and python supports it */ -#if defined WITH_THREAD && !NPY_NO_SMP - #define NPY_ALLOW_THREADS 1 -#else - #define NPY_ALLOW_THREADS 0 -#endif - - - -/* - * There are several places in the code where an array of dimensions - * is allocated statically. This is the size of that static - * allocation. - * - * The array creation itself could have arbitrary dimensions but all - * the places where static allocation is used would need to be changed - * to dynamic (including inside of several structures) - */ - -#define NPY_MAXDIMS 32 -#define NPY_MAXARGS 32 - -/* Used for Converter Functions "O&" code in ParseTuple */ -#define NPY_FAIL 0 -#define NPY_SUCCEED 1 - -/* - * Binary compatibility version number. This number is increased - * whenever the C-API is changed such that binary compatibility is - * broken, i.e. whenever a recompile of extension modules is needed. - */ -#define NPY_VERSION NPY_ABI_VERSION - -/* - * Minor API version. This number is increased whenever a change is - * made to the C-API -- whether it breaks binary compatibility or not. - * Some changes, such as adding a function pointer to the end of the - * function table, can be made without breaking binary compatibility. - * In this case, only the NPY_FEATURE_VERSION (*not* NPY_VERSION) - * would be increased. Whenever binary compatibility is broken, both - * NPY_VERSION and NPY_FEATURE_VERSION should be increased. - */ -#define NPY_FEATURE_VERSION NPY_API_VERSION - -enum NPY_TYPES { NPY_BOOL=0, - NPY_BYTE, NPY_UBYTE, - NPY_SHORT, NPY_USHORT, - NPY_INT, NPY_UINT, - NPY_LONG, NPY_ULONG, - NPY_LONGLONG, NPY_ULONGLONG, - NPY_FLOAT, NPY_DOUBLE, NPY_LONGDOUBLE, - NPY_CFLOAT, NPY_CDOUBLE, NPY_CLONGDOUBLE, - NPY_OBJECT=17, - NPY_STRING, NPY_UNICODE, - NPY_VOID, - /* - * New 1.6 types appended, may be integrated - * into the above in 2.0. - */ - NPY_DATETIME, NPY_TIMEDELTA, NPY_HALF, - - NPY_NTYPES, - NPY_NOTYPE, - NPY_CHAR, /* special flag */ - NPY_USERDEF=256, /* leave room for characters */ - - /* The number of types not including the new 1.6 types */ - NPY_NTYPES_ABI_COMPATIBLE=21 -}; - -/* basetype array priority */ -#define NPY_PRIORITY 0.0 - -/* default subtype priority */ -#define NPY_SUBTYPE_PRIORITY 1.0 - -/* default scalar priority */ -#define NPY_SCALAR_PRIORITY -1000000.0 - -/* How many floating point types are there (excluding half) */ -#define NPY_NUM_FLOATTYPE 3 - -/* - * These characters correspond to the array type and the struct - * module - */ - -enum NPY_TYPECHAR { - NPY_BOOLLTR = '?', - NPY_BYTELTR = 'b', - NPY_UBYTELTR = 'B', - NPY_SHORTLTR = 'h', - NPY_USHORTLTR = 'H', - NPY_INTLTR = 'i', - NPY_UINTLTR = 'I', - NPY_LONGLTR = 'l', - NPY_ULONGLTR = 'L', - NPY_LONGLONGLTR = 'q', - NPY_ULONGLONGLTR = 'Q', - NPY_HALFLTR = 'e', - NPY_FLOATLTR = 'f', - NPY_DOUBLELTR = 'd', - NPY_LONGDOUBLELTR = 'g', - NPY_CFLOATLTR = 'F', - NPY_CDOUBLELTR = 'D', - NPY_CLONGDOUBLELTR = 'G', - NPY_OBJECTLTR = 'O', - NPY_STRINGLTR = 'S', - NPY_STRINGLTR2 = 'a', - NPY_UNICODELTR = 'U', - NPY_VOIDLTR = 'V', - NPY_DATETIMELTR = 'M', - NPY_TIMEDELTALTR = 'm', - NPY_CHARLTR = 'c', - - /* - * No Descriptor, just a define -- this let's - * Python users specify an array of integers - * large enough to hold a pointer on the - * platform - */ - NPY_INTPLTR = 'p', - NPY_UINTPLTR = 'P', - - /* - * These are for dtype 'kinds', not dtype 'typecodes' - * as the above are for. - */ - NPY_GENBOOLLTR ='b', - NPY_SIGNEDLTR = 'i', - NPY_UNSIGNEDLTR = 'u', - NPY_FLOATINGLTR = 'f', - NPY_COMPLEXLTR = 'c' -}; - -typedef enum { - NPY_QUICKSORT=0, - NPY_HEAPSORT=1, - NPY_MERGESORT=2 -} NPY_SORTKIND; -#define NPY_NSORTS (NPY_MERGESORT + 1) - - -typedef enum { - NPY_SEARCHLEFT=0, - NPY_SEARCHRIGHT=1 -} NPY_SEARCHSIDE; -#define NPY_NSEARCHSIDES (NPY_SEARCHRIGHT + 1) - - -typedef enum { - NPY_NOSCALAR=-1, - NPY_BOOL_SCALAR, - NPY_INTPOS_SCALAR, - NPY_INTNEG_SCALAR, - NPY_FLOAT_SCALAR, - NPY_COMPLEX_SCALAR, - NPY_OBJECT_SCALAR -} NPY_SCALARKIND; -#define NPY_NSCALARKINDS (NPY_OBJECT_SCALAR + 1) - -/* For specifying array memory layout or iteration order */ -typedef enum { - /* Fortran order if inputs are all Fortran, C otherwise */ - NPY_ANYORDER=-1, - /* C order */ - NPY_CORDER=0, - /* Fortran order */ - NPY_FORTRANORDER=1, - /* An order as close to the inputs as possible */ - NPY_KEEPORDER=2 -} NPY_ORDER; - -/* For specifying allowed casting in operations which support it */ -typedef enum { - /* Only allow identical types */ - NPY_NO_CASTING=0, - /* Allow identical and byte swapped types */ - NPY_EQUIV_CASTING=1, - /* Only allow safe casts */ - NPY_SAFE_CASTING=2, - /* Allow safe casts or casts within the same kind */ - NPY_SAME_KIND_CASTING=3, - /* Allow any casts */ - NPY_UNSAFE_CASTING=4, - - /* - * Temporary internal definition only, will be removed in upcoming - * release, see below - * */ - NPY_INTERNAL_UNSAFE_CASTING_BUT_WARN_UNLESS_SAME_KIND = 100, -} NPY_CASTING; - -typedef enum { - NPY_CLIP=0, - NPY_WRAP=1, - NPY_RAISE=2 -} NPY_CLIPMODE; - -/* The special not-a-time (NaT) value */ -#define NPY_DATETIME_NAT NPY_MIN_INT64 - -/* - * Upper bound on the length of a DATETIME ISO 8601 string - * YEAR: 21 (64-bit year) - * MONTH: 3 - * DAY: 3 - * HOURS: 3 - * MINUTES: 3 - * SECONDS: 3 - * ATTOSECONDS: 1 + 3*6 - * TIMEZONE: 5 - * NULL TERMINATOR: 1 - */ -#define NPY_DATETIME_MAX_ISO8601_STRLEN (21+3*5+1+3*6+6+1) - -typedef enum { - NPY_FR_Y = 0, /* Years */ - NPY_FR_M = 1, /* Months */ - NPY_FR_W = 2, /* Weeks */ - /* Gap where 1.6 NPY_FR_B (value 3) was */ - NPY_FR_D = 4, /* Days */ - NPY_FR_h = 5, /* hours */ - NPY_FR_m = 6, /* minutes */ - NPY_FR_s = 7, /* seconds */ - NPY_FR_ms = 8, /* milliseconds */ - NPY_FR_us = 9, /* microseconds */ - NPY_FR_ns = 10,/* nanoseconds */ - NPY_FR_ps = 11,/* picoseconds */ - NPY_FR_fs = 12,/* femtoseconds */ - NPY_FR_as = 13,/* attoseconds */ - NPY_FR_GENERIC = 14 /* Generic, unbound units, can convert to anything */ -} NPY_DATETIMEUNIT; - -/* - * NOTE: With the NPY_FR_B gap for 1.6 ABI compatibility, NPY_DATETIME_NUMUNITS - * is technically one more than the actual number of units. - */ -#define NPY_DATETIME_NUMUNITS (NPY_FR_GENERIC + 1) -#define NPY_DATETIME_DEFAULTUNIT NPY_FR_GENERIC - -/* - * Business day conventions for mapping invalid business - * days to valid business days. - */ -typedef enum { - /* Go forward in time to the following business day. */ - NPY_BUSDAY_FORWARD, - NPY_BUSDAY_FOLLOWING = NPY_BUSDAY_FORWARD, - /* Go backward in time to the preceding business day. */ - NPY_BUSDAY_BACKWARD, - NPY_BUSDAY_PRECEDING = NPY_BUSDAY_BACKWARD, - /* - * Go forward in time to the following business day, unless it - * crosses a month boundary, in which case go backward - */ - NPY_BUSDAY_MODIFIEDFOLLOWING, - /* - * Go backward in time to the preceding business day, unless it - * crosses a month boundary, in which case go forward. - */ - NPY_BUSDAY_MODIFIEDPRECEDING, - /* Produce a NaT for non-business days. */ - NPY_BUSDAY_NAT, - /* Raise an exception for non-business days. */ - NPY_BUSDAY_RAISE -} NPY_BUSDAY_ROLL; - -/************************************************************ - * NumPy Auxiliary Data for inner loops, sort functions, etc. - ************************************************************/ - -/* - * When creating an auxiliary data struct, this should always appear - * as the first member, like this: - * - * typedef struct { - * NpyAuxData base; - * double constant; - * } constant_multiplier_aux_data; - */ -typedef struct NpyAuxData_tag NpyAuxData; - -/* Function pointers for freeing or cloning auxiliary data */ -typedef void (NpyAuxData_FreeFunc) (NpyAuxData *); -typedef NpyAuxData *(NpyAuxData_CloneFunc) (NpyAuxData *); - -struct NpyAuxData_tag { - NpyAuxData_FreeFunc *free; - NpyAuxData_CloneFunc *clone; - /* To allow for a bit of expansion without breaking the ABI */ - void *reserved[2]; -}; - -/* Macros to use for freeing and cloning auxiliary data */ -#define NPY_AUXDATA_FREE(auxdata) \ - do { \ - if ((auxdata) != NULL) { \ - (auxdata)->free(auxdata); \ - } \ - } while(0) -#define NPY_AUXDATA_CLONE(auxdata) \ - ((auxdata)->clone(auxdata)) - -#define NPY_ERR(str) fprintf(stderr, #str); fflush(stderr); -#define NPY_ERR2(str) fprintf(stderr, str); fflush(stderr); - -#define NPY_STRINGIFY(x) #x -#define NPY_TOSTRING(x) NPY_STRINGIFY(x) - - /* - * Macros to define how array, and dimension/strides data is - * allocated. - */ - - /* Data buffer - PyDataMem_NEW/FREE/RENEW are in multiarraymodule.c */ - -#define NPY_USE_PYMEM 1 - -#if NPY_USE_PYMEM == 1 -#define PyArray_malloc PyMem_Malloc -#define PyArray_free PyMem_Free -#define PyArray_realloc PyMem_Realloc -#else -#define PyArray_malloc malloc -#define PyArray_free free -#define PyArray_realloc realloc -#endif - -/* Dimensions and strides */ -#define PyDimMem_NEW(size) \ - ((npy_intp *)PyArray_malloc(size*sizeof(npy_intp))) - -#define PyDimMem_FREE(ptr) PyArray_free(ptr) - -#define PyDimMem_RENEW(ptr,size) \ - ((npy_intp *)PyArray_realloc(ptr,size*sizeof(npy_intp))) - -/* forward declaration */ -struct _PyArray_Descr; - -/* These must deal with unaligned and swapped data if necessary */ -typedef PyObject * (PyArray_GetItemFunc) (void *, void *); -typedef int (PyArray_SetItemFunc)(PyObject *, void *, void *); - -typedef void (PyArray_CopySwapNFunc)(void *, npy_intp, void *, npy_intp, - npy_intp, int, void *); - -typedef void (PyArray_CopySwapFunc)(void *, void *, int, void *); -typedef npy_bool (PyArray_NonzeroFunc)(void *, void *); - - -/* - * These assume aligned and notswapped data -- a buffer will be used - * before or contiguous data will be obtained - */ - -typedef int (PyArray_CompareFunc)(const void *, const void *, void *); -typedef int (PyArray_ArgFunc)(void*, npy_intp, npy_intp*, void *); - -typedef void (PyArray_DotFunc)(void *, npy_intp, void *, npy_intp, void *, - npy_intp, void *); - -typedef void (PyArray_VectorUnaryFunc)(void *, void *, npy_intp, void *, - void *); - -/* - * XXX the ignore argument should be removed next time the API version - * is bumped. It used to be the separator. - */ -typedef int (PyArray_ScanFunc)(FILE *fp, void *dptr, - char *ignore, struct _PyArray_Descr *); -typedef int (PyArray_FromStrFunc)(char *s, void *dptr, char **endptr, - struct _PyArray_Descr *); - -typedef int (PyArray_FillFunc)(void *, npy_intp, void *); - -typedef int (PyArray_SortFunc)(void *, npy_intp, void *); -typedef int (PyArray_ArgSortFunc)(void *, npy_intp *, npy_intp, void *); - -typedef int (PyArray_FillWithScalarFunc)(void *, npy_intp, void *, void *); - -typedef int (PyArray_ScalarKindFunc)(void *); - -typedef void (PyArray_FastClipFunc)(void *in, npy_intp n_in, void *min, - void *max, void *out); -typedef void (PyArray_FastPutmaskFunc)(void *in, void *mask, npy_intp n_in, - void *values, npy_intp nv); -typedef int (PyArray_FastTakeFunc)(void *dest, void *src, npy_intp *indarray, - npy_intp nindarray, npy_intp n_outer, - npy_intp m_middle, npy_intp nelem, - NPY_CLIPMODE clipmode); - -typedef struct { - npy_intp *ptr; - int len; -} PyArray_Dims; - -typedef struct { - /* - * Functions to cast to most other standard types - * Can have some NULL entries. The types - * DATETIME, TIMEDELTA, and HALF go into the castdict - * even though they are built-in. - */ - PyArray_VectorUnaryFunc *cast[NPY_NTYPES_ABI_COMPATIBLE]; - - /* The next four functions *cannot* be NULL */ - - /* - * Functions to get and set items with standard Python types - * -- not array scalars - */ - PyArray_GetItemFunc *getitem; - PyArray_SetItemFunc *setitem; - - /* - * Copy and/or swap data. Memory areas may not overlap - * Use memmove first if they might - */ - PyArray_CopySwapNFunc *copyswapn; - PyArray_CopySwapFunc *copyswap; - - /* - * Function to compare items - * Can be NULL - */ - PyArray_CompareFunc *compare; - - /* - * Function to select largest - * Can be NULL - */ - PyArray_ArgFunc *argmax; - - /* - * Function to compute dot product - * Can be NULL - */ - PyArray_DotFunc *dotfunc; - - /* - * Function to scan an ASCII file and - * place a single value plus possible separator - * Can be NULL - */ - PyArray_ScanFunc *scanfunc; - - /* - * Function to read a single value from a string - * and adjust the pointer; Can be NULL - */ - PyArray_FromStrFunc *fromstr; - - /* - * Function to determine if data is zero or not - * If NULL a default version is - * used at Registration time. - */ - PyArray_NonzeroFunc *nonzero; - - /* - * Used for arange. - * Can be NULL. - */ - PyArray_FillFunc *fill; - - /* - * Function to fill arrays with scalar values - * Can be NULL - */ - PyArray_FillWithScalarFunc *fillwithscalar; - - /* - * Sorting functions - * Can be NULL - */ - PyArray_SortFunc *sort[NPY_NSORTS]; - PyArray_ArgSortFunc *argsort[NPY_NSORTS]; - - /* - * Dictionary of additional casting functions - * PyArray_VectorUnaryFuncs - * which can be populated to support casting - * to other registered types. Can be NULL - */ - PyObject *castdict; - - /* - * Functions useful for generalizing - * the casting rules. - * Can be NULL; - */ - PyArray_ScalarKindFunc *scalarkind; - int **cancastscalarkindto; - int *cancastto; - - PyArray_FastClipFunc *fastclip; - PyArray_FastPutmaskFunc *fastputmask; - PyArray_FastTakeFunc *fasttake; - - /* - * Function to select smallest - * Can be NULL - */ - PyArray_ArgFunc *argmin; - -} PyArray_ArrFuncs; - -/* The item must be reference counted when it is inserted or extracted. */ -#define NPY_ITEM_REFCOUNT 0x01 -/* Same as needing REFCOUNT */ -#define NPY_ITEM_HASOBJECT 0x01 -/* Convert to list for pickling */ -#define NPY_LIST_PICKLE 0x02 -/* The item is a POINTER */ -#define NPY_ITEM_IS_POINTER 0x04 -/* memory needs to be initialized for this data-type */ -#define NPY_NEEDS_INIT 0x08 -/* operations need Python C-API so don't give-up thread. */ -#define NPY_NEEDS_PYAPI 0x10 -/* Use f.getitem when extracting elements of this data-type */ -#define NPY_USE_GETITEM 0x20 -/* Use f.setitem when setting creating 0-d array from this data-type.*/ -#define NPY_USE_SETITEM 0x40 -/* A sticky flag specifically for structured arrays */ -#define NPY_ALIGNED_STRUCT 0x80 - -/* - *These are inherited for global data-type if any data-types in the - * field have them - */ -#define NPY_FROM_FIELDS (NPY_NEEDS_INIT | NPY_LIST_PICKLE | \ - NPY_ITEM_REFCOUNT | NPY_NEEDS_PYAPI) - -#define NPY_OBJECT_DTYPE_FLAGS (NPY_LIST_PICKLE | NPY_USE_GETITEM | \ - NPY_ITEM_IS_POINTER | NPY_ITEM_REFCOUNT | \ - NPY_NEEDS_INIT | NPY_NEEDS_PYAPI) - -#define PyDataType_FLAGCHK(dtype, flag) \ - (((dtype)->flags & (flag)) == (flag)) - -#define PyDataType_REFCHK(dtype) \ - PyDataType_FLAGCHK(dtype, NPY_ITEM_REFCOUNT) - -typedef struct _PyArray_Descr { - PyObject_HEAD - /* - * the type object representing an - * instance of this type -- should not - * be two type_numbers with the same type - * object. - */ - PyTypeObject *typeobj; - /* kind for this type */ - char kind; - /* unique-character representing this type */ - char type; - /* - * '>' (big), '<' (little), '|' - * (not-applicable), or '=' (native). - */ - char byteorder; - /* flags describing data type */ - char flags; - /* number representing this type */ - int type_num; - /* element size (itemsize) for this type */ - int elsize; - /* alignment needed for this type */ - int alignment; - /* - * Non-NULL if this type is - * is an array (C-contiguous) - * of some other type - */ - struct _arr_descr *subarray; - /* - * The fields dictionary for this type - * For statically defined descr this - * is always Py_None - */ - PyObject *fields; - /* - * An ordered tuple of field names or NULL - * if no fields are defined - */ - PyObject *names; - /* - * a table of functions specific for each - * basic data descriptor - */ - PyArray_ArrFuncs *f; - /* Metadata about this dtype */ - PyObject *metadata; - /* - * Metadata specific to the C implementation - * of the particular dtype. This was added - * for NumPy 1.7.0. - */ - NpyAuxData *c_metadata; -} PyArray_Descr; - -typedef struct _arr_descr { - PyArray_Descr *base; - PyObject *shape; /* a tuple */ -} PyArray_ArrayDescr; - -/* - * The main array object structure. - * - * It has been recommended to use the inline functions defined below - * (PyArray_DATA and friends) to access fields here for a number of - * releases. Direct access to the members themselves is deprecated. - * To ensure that your code does not use deprecated access, - * #define NPY_NO_DEPRECATED_API NPY_1_7_VERSION - * (or NPY_1_8_VERSION or higher as required). - */ -/* This struct will be moved to a private header in a future release */ -typedef struct tagPyArrayObject_fields { - PyObject_HEAD - /* Pointer to the raw data buffer */ - char *data; - /* The number of dimensions, also called 'ndim' */ - int nd; - /* The size in each dimension, also called 'shape' */ - npy_intp *dimensions; - /* - * Number of bytes to jump to get to the - * next element in each dimension - */ - npy_intp *strides; - /* - * This object is decref'd upon - * deletion of array. Except in the - * case of UPDATEIFCOPY which has - * special handling. - * - * For views it points to the original - * array, collapsed so no chains of - * views occur. - * - * For creation from buffer object it - * points to an object that shold be - * decref'd on deletion - * - * For UPDATEIFCOPY flag this is an - * array to-be-updated upon deletion - * of this one - */ - PyObject *base; - /* Pointer to type structure */ - PyArray_Descr *descr; - /* Flags describing array -- see below */ - int flags; - /* For weak references */ - PyObject *weakreflist; -} PyArrayObject_fields; - -/* - * To hide the implementation details, we only expose - * the Python struct HEAD. - */ -#if !(defined(NPY_NO_DEPRECATED_API) && (NPY_API_VERSION <= NPY_NO_DEPRECATED_API)) -/* - * Can't put this in npy_deprecated_api.h like the others. - * PyArrayObject field access is deprecated as of NumPy 1.7. - */ -typedef PyArrayObject_fields PyArrayObject; -#else -typedef struct tagPyArrayObject { - PyObject_HEAD -} PyArrayObject; -#endif - -#define NPY_SIZEOF_PYARRAYOBJECT (sizeof(PyArrayObject_fields)) - -/* Array Flags Object */ -typedef struct PyArrayFlagsObject { - PyObject_HEAD - PyObject *arr; - int flags; -} PyArrayFlagsObject; - -/* Mirrors buffer object to ptr */ - -typedef struct { - PyObject_HEAD - PyObject *base; - void *ptr; - npy_intp len; - int flags; -} PyArray_Chunk; - -typedef struct { - NPY_DATETIMEUNIT base; - int num; -} PyArray_DatetimeMetaData; - -typedef struct { - NpyAuxData base; - PyArray_DatetimeMetaData meta; -} PyArray_DatetimeDTypeMetaData; - -/* - * This structure contains an exploded view of a date-time value. - * NaT is represented by year == NPY_DATETIME_NAT. - */ -typedef struct { - npy_int64 year; - npy_int32 month, day, hour, min, sec, us, ps, as; -} npy_datetimestruct; - -/* This is not used internally. */ -typedef struct { - npy_int64 day; - npy_int32 sec, us, ps, as; -} npy_timedeltastruct; - -typedef int (PyArray_FinalizeFunc)(PyArrayObject *, PyObject *); - -/* - * Means c-style contiguous (last index varies the fastest). The data - * elements right after each other. - * - * This flag may be requested in constructor functions. - * This flag may be tested for in PyArray_FLAGS(arr). - */ -#define NPY_ARRAY_C_CONTIGUOUS 0x0001 - -/* - * Set if array is a contiguous Fortran array: the first index varies - * the fastest in memory (strides array is reverse of C-contiguous - * array) - * - * This flag may be requested in constructor functions. - * This flag may be tested for in PyArray_FLAGS(arr). - */ -#define NPY_ARRAY_F_CONTIGUOUS 0x0002 - -/* - * Note: all 0-d arrays are C_CONTIGUOUS and F_CONTIGUOUS. If a - * 1-d array is C_CONTIGUOUS it is also F_CONTIGUOUS - */ - -/* - * If set, the array owns the data: it will be free'd when the array - * is deleted. - * - * This flag may be tested for in PyArray_FLAGS(arr). - */ -#define NPY_ARRAY_OWNDATA 0x0004 - -/* - * An array never has the next four set; they're only used as parameter - * flags to the the various FromAny functions - * - * This flag may be requested in constructor functions. - */ - -/* Cause a cast to occur regardless of whether or not it is safe. */ -#define NPY_ARRAY_FORCECAST 0x0010 - -/* - * Always copy the array. Returned arrays are always CONTIGUOUS, - * ALIGNED, and WRITEABLE. - * - * This flag may be requested in constructor functions. - */ -#define NPY_ARRAY_ENSURECOPY 0x0020 - -/* - * Make sure the returned array is a base-class ndarray - * - * This flag may be requested in constructor functions. - */ -#define NPY_ARRAY_ENSUREARRAY 0x0040 - -/* - * Make sure that the strides are in units of the element size Needed - * for some operations with record-arrays. - * - * This flag may be requested in constructor functions. - */ -#define NPY_ARRAY_ELEMENTSTRIDES 0x0080 - -/* - * Array data is aligned on the appropiate memory address for the type - * stored according to how the compiler would align things (e.g., an - * array of integers (4 bytes each) starts on a memory address that's - * a multiple of 4) - * - * This flag may be requested in constructor functions. - * This flag may be tested for in PyArray_FLAGS(arr). - */ -#define NPY_ARRAY_ALIGNED 0x0100 - -/* - * Array data has the native endianness - * - * This flag may be requested in constructor functions. - */ -#define NPY_ARRAY_NOTSWAPPED 0x0200 - -/* - * Array data is writeable - * - * This flag may be requested in constructor functions. - * This flag may be tested for in PyArray_FLAGS(arr). - */ -#define NPY_ARRAY_WRITEABLE 0x0400 - -/* - * If this flag is set, then base contains a pointer to an array of - * the same size that should be updated with the current contents of - * this array when this array is deallocated - * - * This flag may be requested in constructor functions. - * This flag may be tested for in PyArray_FLAGS(arr). - */ -#define NPY_ARRAY_UPDATEIFCOPY 0x1000 - -/* - * NOTE: there are also internal flags defined in multiarray/arrayobject.h, - * which start at bit 31 and work down. - */ - -#define NPY_ARRAY_BEHAVED (NPY_ARRAY_ALIGNED | \ - NPY_ARRAY_WRITEABLE) -#define NPY_ARRAY_BEHAVED_NS (NPY_ARRAY_ALIGNED | \ - NPY_ARRAY_WRITEABLE | \ - NPY_ARRAY_NOTSWAPPED) -#define NPY_ARRAY_CARRAY (NPY_ARRAY_C_CONTIGUOUS | \ - NPY_ARRAY_BEHAVED) -#define NPY_ARRAY_CARRAY_RO (NPY_ARRAY_C_CONTIGUOUS | \ - NPY_ARRAY_ALIGNED) -#define NPY_ARRAY_FARRAY (NPY_ARRAY_F_CONTIGUOUS | \ - NPY_ARRAY_BEHAVED) -#define NPY_ARRAY_FARRAY_RO (NPY_ARRAY_F_CONTIGUOUS | \ - NPY_ARRAY_ALIGNED) -#define NPY_ARRAY_DEFAULT (NPY_ARRAY_CARRAY) -#define NPY_ARRAY_IN_ARRAY (NPY_ARRAY_CARRAY_RO) -#define NPY_ARRAY_OUT_ARRAY (NPY_ARRAY_CARRAY) -#define NPY_ARRAY_INOUT_ARRAY (NPY_ARRAY_CARRAY | \ - NPY_ARRAY_UPDATEIFCOPY) -#define NPY_ARRAY_IN_FARRAY (NPY_ARRAY_FARRAY_RO) -#define NPY_ARRAY_OUT_FARRAY (NPY_ARRAY_FARRAY) -#define NPY_ARRAY_INOUT_FARRAY (NPY_ARRAY_FARRAY | \ - NPY_ARRAY_UPDATEIFCOPY) - -#define NPY_ARRAY_UPDATE_ALL (NPY_ARRAY_C_CONTIGUOUS | \ - NPY_ARRAY_F_CONTIGUOUS | \ - NPY_ARRAY_ALIGNED) - -/* This flag is for the array interface, not PyArrayObject */ -#define NPY_ARR_HAS_DESCR 0x0800 - - - - -/* - * Size of internal buffers used for alignment Make BUFSIZE a multiple - * of sizeof(npy_cdouble) -- usually 16 so that ufunc buffers are aligned - */ -#define NPY_MIN_BUFSIZE ((int)sizeof(npy_cdouble)) -#define NPY_MAX_BUFSIZE (((int)sizeof(npy_cdouble))*1000000) -#define NPY_BUFSIZE 8192 -/* buffer stress test size: */ -/*#define NPY_BUFSIZE 17*/ - -#define PyArray_MAX(a,b) (((a)>(b))?(a):(b)) -#define PyArray_MIN(a,b) (((a)<(b))?(a):(b)) -#define PyArray_CLT(p,q) ((((p).real==(q).real) ? ((p).imag < (q).imag) : \ - ((p).real < (q).real))) -#define PyArray_CGT(p,q) ((((p).real==(q).real) ? ((p).imag > (q).imag) : \ - ((p).real > (q).real))) -#define PyArray_CLE(p,q) ((((p).real==(q).real) ? ((p).imag <= (q).imag) : \ - ((p).real <= (q).real))) -#define PyArray_CGE(p,q) ((((p).real==(q).real) ? ((p).imag >= (q).imag) : \ - ((p).real >= (q).real))) -#define PyArray_CEQ(p,q) (((p).real==(q).real) && ((p).imag == (q).imag)) -#define PyArray_CNE(p,q) (((p).real!=(q).real) || ((p).imag != (q).imag)) - -/* - * C API: consists of Macros and functions. The MACROS are defined - * here. - */ - - -#define PyArray_ISCONTIGUOUS(m) PyArray_CHKFLAGS(m, NPY_ARRAY_C_CONTIGUOUS) -#define PyArray_ISWRITEABLE(m) PyArray_CHKFLAGS(m, NPY_ARRAY_WRITEABLE) -#define PyArray_ISALIGNED(m) PyArray_CHKFLAGS(m, NPY_ARRAY_ALIGNED) - -#define PyArray_IS_C_CONTIGUOUS(m) PyArray_CHKFLAGS(m, NPY_ARRAY_C_CONTIGUOUS) -#define PyArray_IS_F_CONTIGUOUS(m) PyArray_CHKFLAGS(m, NPY_ARRAY_F_CONTIGUOUS) - -#if NPY_ALLOW_THREADS -#define NPY_BEGIN_ALLOW_THREADS Py_BEGIN_ALLOW_THREADS -#define NPY_END_ALLOW_THREADS Py_END_ALLOW_THREADS -#define NPY_BEGIN_THREADS_DEF PyThreadState *_save=NULL; -#define NPY_BEGIN_THREADS do {_save = PyEval_SaveThread();} while (0); -#define NPY_END_THREADS do {if (_save) PyEval_RestoreThread(_save);} while (0); - -#define NPY_BEGIN_THREADS_DESCR(dtype) \ - do {if (!(PyDataType_FLAGCHK(dtype, NPY_NEEDS_PYAPI))) \ - NPY_BEGIN_THREADS;} while (0); - -#define NPY_END_THREADS_DESCR(dtype) \ - do {if (!(PyDataType_FLAGCHK(dtype, NPY_NEEDS_PYAPI))) \ - NPY_END_THREADS; } while (0); - -#define NPY_ALLOW_C_API_DEF PyGILState_STATE __save__; -#define NPY_ALLOW_C_API do {__save__ = PyGILState_Ensure();} while (0); -#define NPY_DISABLE_C_API do {PyGILState_Release(__save__);} while (0); -#else -#define NPY_BEGIN_ALLOW_THREADS -#define NPY_END_ALLOW_THREADS -#define NPY_BEGIN_THREADS_DEF -#define NPY_BEGIN_THREADS -#define NPY_END_THREADS -#define NPY_BEGIN_THREADS_DESCR(dtype) -#define NPY_END_THREADS_DESCR(dtype) -#define NPY_ALLOW_C_API_DEF -#define NPY_ALLOW_C_API -#define NPY_DISABLE_C_API -#endif - -/********************************** - * The nditer object, added in 1.6 - **********************************/ - -/* The actual structure of the iterator is an internal detail */ -typedef struct NpyIter_InternalOnly NpyIter; - -/* Iterator function pointers that may be specialized */ -typedef int (NpyIter_IterNextFunc)(NpyIter *iter); -typedef void (NpyIter_GetMultiIndexFunc)(NpyIter *iter, - npy_intp *outcoords); - -/*** Global flags that may be passed to the iterator constructors ***/ - -/* Track an index representing C order */ -#define NPY_ITER_C_INDEX 0x00000001 -/* Track an index representing Fortran order */ -#define NPY_ITER_F_INDEX 0x00000002 -/* Track a multi-index */ -#define NPY_ITER_MULTI_INDEX 0x00000004 -/* User code external to the iterator does the 1-dimensional innermost loop */ -#define NPY_ITER_EXTERNAL_LOOP 0x00000008 -/* Convert all the operands to a common data type */ -#define NPY_ITER_COMMON_DTYPE 0x00000010 -/* Operands may hold references, requiring API access during iteration */ -#define NPY_ITER_REFS_OK 0x00000020 -/* Zero-sized operands should be permitted, iteration checks IterSize for 0 */ -#define NPY_ITER_ZEROSIZE_OK 0x00000040 -/* Permits reductions (size-0 stride with dimension size > 1) */ -#define NPY_ITER_REDUCE_OK 0x00000080 -/* Enables sub-range iteration */ -#define NPY_ITER_RANGED 0x00000100 -/* Enables buffering */ -#define NPY_ITER_BUFFERED 0x00000200 -/* When buffering is enabled, grows the inner loop if possible */ -#define NPY_ITER_GROWINNER 0x00000400 -/* Delay allocation of buffers until first Reset* call */ -#define NPY_ITER_DELAY_BUFALLOC 0x00000800 -/* When NPY_KEEPORDER is specified, disable reversing negative-stride axes */ -#define NPY_ITER_DONT_NEGATE_STRIDES 0x00001000 - -/*** Per-operand flags that may be passed to the iterator constructors ***/ - -/* The operand will be read from and written to */ -#define NPY_ITER_READWRITE 0x00010000 -/* The operand will only be read from */ -#define NPY_ITER_READONLY 0x00020000 -/* The operand will only be written to */ -#define NPY_ITER_WRITEONLY 0x00040000 -/* The operand's data must be in native byte order */ -#define NPY_ITER_NBO 0x00080000 -/* The operand's data must be aligned */ -#define NPY_ITER_ALIGNED 0x00100000 -/* The operand's data must be contiguous (within the inner loop) */ -#define NPY_ITER_CONTIG 0x00200000 -/* The operand may be copied to satisfy requirements */ -#define NPY_ITER_COPY 0x00400000 -/* The operand may be copied with UPDATEIFCOPY to satisfy requirements */ -#define NPY_ITER_UPDATEIFCOPY 0x00800000 -/* Allocate the operand if it is NULL */ -#define NPY_ITER_ALLOCATE 0x01000000 -/* If an operand is allocated, don't use any subtype */ -#define NPY_ITER_NO_SUBTYPE 0x02000000 -/* This is a virtual array slot, operand is NULL but temporary data is there */ -#define NPY_ITER_VIRTUAL 0x04000000 -/* Require that the dimension match the iterator dimensions exactly */ -#define NPY_ITER_NO_BROADCAST 0x08000000 -/* A mask is being used on this array, affects buffer -> array copy */ -#define NPY_ITER_WRITEMASKED 0x10000000 -/* This array is the mask for all WRITEMASKED operands */ -#define NPY_ITER_ARRAYMASK 0x20000000 - -#define NPY_ITER_GLOBAL_FLAGS 0x0000ffff -#define NPY_ITER_PER_OP_FLAGS 0xffff0000 - - -/***************************** - * Basic iterator object - *****************************/ - -/* FWD declaration */ -typedef struct PyArrayIterObject_tag PyArrayIterObject; - -/* - * type of the function which translates a set of coordinates to a - * pointer to the data - */ -typedef char* (*npy_iter_get_dataptr_t)(PyArrayIterObject* iter, npy_intp*); - -struct PyArrayIterObject_tag { - PyObject_HEAD - int nd_m1; /* number of dimensions - 1 */ - npy_intp index, size; - npy_intp coordinates[NPY_MAXDIMS];/* N-dimensional loop */ - npy_intp dims_m1[NPY_MAXDIMS]; /* ao->dimensions - 1 */ - npy_intp strides[NPY_MAXDIMS]; /* ao->strides or fake */ - npy_intp backstrides[NPY_MAXDIMS];/* how far to jump back */ - npy_intp factors[NPY_MAXDIMS]; /* shape factors */ - PyArrayObject *ao; - char *dataptr; /* pointer to current item*/ - npy_bool contiguous; - - npy_intp bounds[NPY_MAXDIMS][2]; - npy_intp limits[NPY_MAXDIMS][2]; - npy_intp limits_sizes[NPY_MAXDIMS]; - npy_iter_get_dataptr_t translate; -} ; - - -/* Iterator API */ -#define PyArrayIter_Check(op) PyObject_TypeCheck(op, &PyArrayIter_Type) - -#define _PyAIT(it) ((PyArrayIterObject *)(it)) -#define PyArray_ITER_RESET(it) do { \ - _PyAIT(it)->index = 0; \ - _PyAIT(it)->dataptr = PyArray_BYTES(_PyAIT(it)->ao); \ - memset(_PyAIT(it)->coordinates, 0, \ - (_PyAIT(it)->nd_m1+1)*sizeof(npy_intp)); \ -} while (0) - -#define _PyArray_ITER_NEXT1(it) do { \ - (it)->dataptr += _PyAIT(it)->strides[0]; \ - (it)->coordinates[0]++; \ -} while (0) - -#define _PyArray_ITER_NEXT2(it) do { \ - if ((it)->coordinates[1] < (it)->dims_m1[1]) { \ - (it)->coordinates[1]++; \ - (it)->dataptr += (it)->strides[1]; \ - } \ - else { \ - (it)->coordinates[1] = 0; \ - (it)->coordinates[0]++; \ - (it)->dataptr += (it)->strides[0] - \ - (it)->backstrides[1]; \ - } \ -} while (0) - -#define _PyArray_ITER_NEXT3(it) do { \ - if ((it)->coordinates[2] < (it)->dims_m1[2]) { \ - (it)->coordinates[2]++; \ - (it)->dataptr += (it)->strides[2]; \ - } \ - else { \ - (it)->coordinates[2] = 0; \ - (it)->dataptr -= (it)->backstrides[2]; \ - if ((it)->coordinates[1] < (it)->dims_m1[1]) { \ - (it)->coordinates[1]++; \ - (it)->dataptr += (it)->strides[1]; \ - } \ - else { \ - (it)->coordinates[1] = 0; \ - (it)->coordinates[0]++; \ - (it)->dataptr += (it)->strides[0] \ - (it)->backstrides[1]; \ - } \ - } \ -} while (0) - -#define PyArray_ITER_NEXT(it) do { \ - _PyAIT(it)->index++; \ - if (_PyAIT(it)->nd_m1 == 0) { \ - _PyArray_ITER_NEXT1(_PyAIT(it)); \ - } \ - else if (_PyAIT(it)->contiguous) \ - _PyAIT(it)->dataptr += PyArray_DESCR(_PyAIT(it)->ao)->elsize; \ - else if (_PyAIT(it)->nd_m1 == 1) { \ - _PyArray_ITER_NEXT2(_PyAIT(it)); \ - } \ - else { \ - int __npy_i; \ - for (__npy_i=_PyAIT(it)->nd_m1; __npy_i >= 0; __npy_i--) { \ - if (_PyAIT(it)->coordinates[__npy_i] < \ - _PyAIT(it)->dims_m1[__npy_i]) { \ - _PyAIT(it)->coordinates[__npy_i]++; \ - _PyAIT(it)->dataptr += \ - _PyAIT(it)->strides[__npy_i]; \ - break; \ - } \ - else { \ - _PyAIT(it)->coordinates[__npy_i] = 0; \ - _PyAIT(it)->dataptr -= \ - _PyAIT(it)->backstrides[__npy_i]; \ - } \ - } \ - } \ -} while (0) - -#define PyArray_ITER_GOTO(it, destination) do { \ - int __npy_i; \ - _PyAIT(it)->index = 0; \ - _PyAIT(it)->dataptr = PyArray_BYTES(_PyAIT(it)->ao); \ - for (__npy_i = _PyAIT(it)->nd_m1; __npy_i>=0; __npy_i--) { \ - if (destination[__npy_i] < 0) { \ - destination[__npy_i] += \ - _PyAIT(it)->dims_m1[__npy_i]+1; \ - } \ - _PyAIT(it)->dataptr += destination[__npy_i] * \ - _PyAIT(it)->strides[__npy_i]; \ - _PyAIT(it)->coordinates[__npy_i] = \ - destination[__npy_i]; \ - _PyAIT(it)->index += destination[__npy_i] * \ - ( __npy_i==_PyAIT(it)->nd_m1 ? 1 : \ - _PyAIT(it)->dims_m1[__npy_i+1]+1) ; \ - } \ -} while (0) - -#define PyArray_ITER_GOTO1D(it, ind) do { \ - int __npy_i; \ - npy_intp __npy_ind = (npy_intp) (ind); \ - if (__npy_ind < 0) __npy_ind += _PyAIT(it)->size; \ - _PyAIT(it)->index = __npy_ind; \ - if (_PyAIT(it)->nd_m1 == 0) { \ - _PyAIT(it)->dataptr = PyArray_BYTES(_PyAIT(it)->ao) + \ - __npy_ind * _PyAIT(it)->strides[0]; \ - } \ - else if (_PyAIT(it)->contiguous) \ - _PyAIT(it)->dataptr = PyArray_BYTES(_PyAIT(it)->ao) + \ - __npy_ind * PyArray_DESCR(_PyAIT(it)->ao)->elsize; \ - else { \ - _PyAIT(it)->dataptr = PyArray_BYTES(_PyAIT(it)->ao); \ - for (__npy_i = 0; __npy_i<=_PyAIT(it)->nd_m1; \ - __npy_i++) { \ - _PyAIT(it)->dataptr += \ - (__npy_ind / _PyAIT(it)->factors[__npy_i]) \ - * _PyAIT(it)->strides[__npy_i]; \ - __npy_ind %= _PyAIT(it)->factors[__npy_i]; \ - } \ - } \ -} while (0) - -#define PyArray_ITER_DATA(it) ((void *)(_PyAIT(it)->dataptr)) - -#define PyArray_ITER_NOTDONE(it) (_PyAIT(it)->index < _PyAIT(it)->size) - - -/* - * Any object passed to PyArray_Broadcast must be binary compatible - * with this structure. - */ - -typedef struct { - PyObject_HEAD - int numiter; /* number of iters */ - npy_intp size; /* broadcasted size */ - npy_intp index; /* current index */ - int nd; /* number of dims */ - npy_intp dimensions[NPY_MAXDIMS]; /* dimensions */ - PyArrayIterObject *iters[NPY_MAXARGS]; /* iterators */ -} PyArrayMultiIterObject; - -#define _PyMIT(m) ((PyArrayMultiIterObject *)(m)) -#define PyArray_MultiIter_RESET(multi) do { \ - int __npy_mi; \ - _PyMIT(multi)->index = 0; \ - for (__npy_mi=0; __npy_mi < _PyMIT(multi)->numiter; __npy_mi++) { \ - PyArray_ITER_RESET(_PyMIT(multi)->iters[__npy_mi]); \ - } \ -} while (0) - -#define PyArray_MultiIter_NEXT(multi) do { \ - int __npy_mi; \ - _PyMIT(multi)->index++; \ - for (__npy_mi=0; __npy_mi < _PyMIT(multi)->numiter; __npy_mi++) { \ - PyArray_ITER_NEXT(_PyMIT(multi)->iters[__npy_mi]); \ - } \ -} while (0) - -#define PyArray_MultiIter_GOTO(multi, dest) do { \ - int __npy_mi; \ - for (__npy_mi=0; __npy_mi < _PyMIT(multi)->numiter; __npy_mi++) { \ - PyArray_ITER_GOTO(_PyMIT(multi)->iters[__npy_mi], dest); \ - } \ - _PyMIT(multi)->index = _PyMIT(multi)->iters[0]->index; \ -} while (0) - -#define PyArray_MultiIter_GOTO1D(multi, ind) do { \ - int __npy_mi; \ - for (__npy_mi=0; __npy_mi < _PyMIT(multi)->numiter; __npy_mi++) { \ - PyArray_ITER_GOTO1D(_PyMIT(multi)->iters[__npy_mi], ind); \ - } \ - _PyMIT(multi)->index = _PyMIT(multi)->iters[0]->index; \ -} while (0) - -#define PyArray_MultiIter_DATA(multi, i) \ - ((void *)(_PyMIT(multi)->iters[i]->dataptr)) - -#define PyArray_MultiIter_NEXTi(multi, i) \ - PyArray_ITER_NEXT(_PyMIT(multi)->iters[i]) - -#define PyArray_MultiIter_NOTDONE(multi) \ - (_PyMIT(multi)->index < _PyMIT(multi)->size) - -/* Store the information needed for fancy-indexing over an array */ - -typedef struct { - PyObject_HEAD - /* - * Multi-iterator portion --- needs to be present in this - * order to work with PyArray_Broadcast - */ - - int numiter; /* number of index-array - iterators */ - npy_intp size; /* size of broadcasted - result */ - npy_intp index; /* current index */ - int nd; /* number of dims */ - npy_intp dimensions[NPY_MAXDIMS]; /* dimensions */ - PyArrayIterObject *iters[NPY_MAXDIMS]; /* index object - iterators */ - PyArrayIterObject *ait; /* flat Iterator for - underlying array */ - - /* flat iterator for subspace (when numiter < nd) */ - PyArrayIterObject *subspace; - - /* - * if subspace iteration, then this is the array of axes in - * the underlying array represented by the index objects - */ - int iteraxes[NPY_MAXDIMS]; - /* - * if subspace iteration, the these are the coordinates to the - * start of the subspace. - */ - npy_intp bscoord[NPY_MAXDIMS]; - - PyObject *indexobj; /* creating obj */ - int consec; - char *dataptr; - -} PyArrayMapIterObject; - -enum { - NPY_NEIGHBORHOOD_ITER_ZERO_PADDING, - NPY_NEIGHBORHOOD_ITER_ONE_PADDING, - NPY_NEIGHBORHOOD_ITER_CONSTANT_PADDING, - NPY_NEIGHBORHOOD_ITER_CIRCULAR_PADDING, - NPY_NEIGHBORHOOD_ITER_MIRROR_PADDING -}; - -typedef struct { - PyObject_HEAD - - /* - * PyArrayIterObject part: keep this in this exact order - */ - int nd_m1; /* number of dimensions - 1 */ - npy_intp index, size; - npy_intp coordinates[NPY_MAXDIMS];/* N-dimensional loop */ - npy_intp dims_m1[NPY_MAXDIMS]; /* ao->dimensions - 1 */ - npy_intp strides[NPY_MAXDIMS]; /* ao->strides or fake */ - npy_intp backstrides[NPY_MAXDIMS];/* how far to jump back */ - npy_intp factors[NPY_MAXDIMS]; /* shape factors */ - PyArrayObject *ao; - char *dataptr; /* pointer to current item*/ - npy_bool contiguous; - - npy_intp bounds[NPY_MAXDIMS][2]; - npy_intp limits[NPY_MAXDIMS][2]; - npy_intp limits_sizes[NPY_MAXDIMS]; - npy_iter_get_dataptr_t translate; - - /* - * New members - */ - npy_intp nd; - - /* Dimensions is the dimension of the array */ - npy_intp dimensions[NPY_MAXDIMS]; - - /* - * Neighborhood points coordinates are computed relatively to the - * point pointed by _internal_iter - */ - PyArrayIterObject* _internal_iter; - /* - * To keep a reference to the representation of the constant value - * for constant padding - */ - char* constant; - - int mode; -} PyArrayNeighborhoodIterObject; - -/* - * Neighborhood iterator API - */ - -/* General: those work for any mode */ -static NPY_INLINE int -PyArrayNeighborhoodIter_Reset(PyArrayNeighborhoodIterObject* iter); -static NPY_INLINE int -PyArrayNeighborhoodIter_Next(PyArrayNeighborhoodIterObject* iter); -#if 0 -static NPY_INLINE int -PyArrayNeighborhoodIter_Next2D(PyArrayNeighborhoodIterObject* iter); -#endif - -/* - * Include inline implementations - functions defined there are not - * considered public API - */ -#define _NPY_INCLUDE_NEIGHBORHOOD_IMP -#include "_neighborhood_iterator_imp.h" -#undef _NPY_INCLUDE_NEIGHBORHOOD_IMP - -/* The default array type */ -#define NPY_DEFAULT_TYPE NPY_DOUBLE - -/* - * All sorts of useful ways to look into a PyArrayObject. It is recommended - * to use PyArrayObject * objects instead of always casting from PyObject *, - * for improved type checking. - * - * In many cases here the macro versions of the accessors are deprecated, - * but can't be immediately changed to inline functions because the - * preexisting macros accept PyObject * and do automatic casts. Inline - * functions accepting PyArrayObject * provides for some compile-time - * checking of correctness when working with these objects in C. - */ - -#define PyArray_ISONESEGMENT(m) (PyArray_NDIM(m) == 0 || \ - PyArray_CHKFLAGS(m, NPY_ARRAY_C_CONTIGUOUS) || \ - PyArray_CHKFLAGS(m, NPY_ARRAY_F_CONTIGUOUS)) - -#define PyArray_ISFORTRAN(m) (PyArray_CHKFLAGS(m, NPY_ARRAY_F_CONTIGUOUS) && \ - (PyArray_NDIM(m) > 1)) - -#define PyArray_FORTRAN_IF(m) ((PyArray_CHKFLAGS(m, NPY_ARRAY_F_CONTIGUOUS) ? \ - NPY_ARRAY_F_CONTIGUOUS : 0)) - -#if (defined(NPY_NO_DEPRECATED_API) && (NPY_API_VERSION <= NPY_NO_DEPRECATED_API)) -/* - * Changing access macros into functions, to allow for future hiding - * of the internal memory layout. This later hiding will allow the 2.x series - * to change the internal representation of arrays without affecting - * ABI compatibility. - */ - -static NPY_INLINE int -PyArray_NDIM(const PyArrayObject *arr) -{ - return ((PyArrayObject_fields *)arr)->nd; -} - -static NPY_INLINE void * -PyArray_DATA(PyArrayObject *arr) -{ - return ((PyArrayObject_fields *)arr)->data; -} - -static NPY_INLINE char * -PyArray_BYTES(PyArrayObject *arr) -{ - return ((PyArrayObject_fields *)arr)->data; -} - -static NPY_INLINE npy_intp * -PyArray_DIMS(PyArrayObject *arr) -{ - return ((PyArrayObject_fields *)arr)->dimensions; -} - -static NPY_INLINE npy_intp * -PyArray_STRIDES(PyArrayObject *arr) -{ - return ((PyArrayObject_fields *)arr)->strides; -} - -static NPY_INLINE npy_intp -PyArray_DIM(const PyArrayObject *arr, int idim) -{ - return ((PyArrayObject_fields *)arr)->dimensions[idim]; -} - -static NPY_INLINE npy_intp -PyArray_STRIDE(const PyArrayObject *arr, int istride) -{ - return ((PyArrayObject_fields *)arr)->strides[istride]; -} - -static NPY_INLINE PyObject * -PyArray_BASE(PyArrayObject *arr) -{ - return ((PyArrayObject_fields *)arr)->base; -} - -static NPY_INLINE PyArray_Descr * -PyArray_DESCR(PyArrayObject *arr) -{ - return ((PyArrayObject_fields *)arr)->descr; -} - -static NPY_INLINE int -PyArray_FLAGS(const PyArrayObject *arr) -{ - return ((PyArrayObject_fields *)arr)->flags; -} - -static NPY_INLINE npy_intp -PyArray_ITEMSIZE(const PyArrayObject *arr) -{ - return ((PyArrayObject_fields *)arr)->descr->elsize; -} - -static NPY_INLINE int -PyArray_TYPE(const PyArrayObject *arr) -{ - return ((PyArrayObject_fields *)arr)->descr->type_num; -} - -static NPY_INLINE int -PyArray_CHKFLAGS(const PyArrayObject *arr, int flags) -{ - return (PyArray_FLAGS(arr) & flags) == flags; -} - -static NPY_INLINE PyObject * -PyArray_GETITEM(const PyArrayObject *arr, const char *itemptr) -{ - return ((PyArrayObject_fields *)arr)->descr->f->getitem( - (void *)itemptr, (PyArrayObject *)arr); -} - -static NPY_INLINE int -PyArray_SETITEM(PyArrayObject *arr, char *itemptr, PyObject *v) -{ - return ((PyArrayObject_fields *)arr)->descr->f->setitem( - v, itemptr, arr); -} - -#else - -/* These macros are deprecated as of NumPy 1.7. */ -#define PyArray_NDIM(obj) (((PyArrayObject_fields *)(obj))->nd) -#define PyArray_BYTES(obj) (((PyArrayObject_fields *)(obj))->data) -#define PyArray_DATA(obj) ((void *)((PyArrayObject_fields *)(obj))->data) -#define PyArray_DIMS(obj) (((PyArrayObject_fields *)(obj))->dimensions) -#define PyArray_STRIDES(obj) (((PyArrayObject_fields *)(obj))->strides) -#define PyArray_DIM(obj,n) (PyArray_DIMS(obj)[n]) -#define PyArray_STRIDE(obj,n) (PyArray_STRIDES(obj)[n]) -#define PyArray_BASE(obj) (((PyArrayObject_fields *)(obj))->base) -#define PyArray_DESCR(obj) (((PyArrayObject_fields *)(obj))->descr) -#define PyArray_FLAGS(obj) (((PyArrayObject_fields *)(obj))->flags) -#define PyArray_CHKFLAGS(m, FLAGS) \ - ((((PyArrayObject_fields *)(m))->flags & (FLAGS)) == (FLAGS)) -#define PyArray_ITEMSIZE(obj) \ - (((PyArrayObject_fields *)(obj))->descr->elsize) -#define PyArray_TYPE(obj) \ - (((PyArrayObject_fields *)(obj))->descr->type_num) -#define PyArray_GETITEM(obj,itemptr) \ - PyArray_DESCR(obj)->f->getitem((char *)(itemptr), \ - (PyArrayObject *)(obj)) - -#define PyArray_SETITEM(obj,itemptr,v) \ - PyArray_DESCR(obj)->f->setitem((PyObject *)(v), \ - (char *)(itemptr), \ - (PyArrayObject *)(obj)) -#endif - -static NPY_INLINE PyArray_Descr * -PyArray_DTYPE(PyArrayObject *arr) -{ - return ((PyArrayObject_fields *)arr)->descr; -} - -static NPY_INLINE npy_intp * -PyArray_SHAPE(PyArrayObject *arr) -{ - return ((PyArrayObject_fields *)arr)->dimensions; -} - -/* - * Enables the specified array flags. Does no checking, - * assumes you know what you're doing. - */ -static NPY_INLINE void -PyArray_ENABLEFLAGS(PyArrayObject *arr, int flags) -{ - ((PyArrayObject_fields *)arr)->flags |= flags; -} - -/* - * Clears the specified array flags. Does no checking, - * assumes you know what you're doing. - */ -static NPY_INLINE void -PyArray_CLEARFLAGS(PyArrayObject *arr, int flags) -{ - ((PyArrayObject_fields *)arr)->flags &= ~flags; -} - -#define PyTypeNum_ISBOOL(type) ((type) == NPY_BOOL) - -#define PyTypeNum_ISUNSIGNED(type) (((type) == NPY_UBYTE) || \ - ((type) == NPY_USHORT) || \ - ((type) == NPY_UINT) || \ - ((type) == NPY_ULONG) || \ - ((type) == NPY_ULONGLONG)) - -#define PyTypeNum_ISSIGNED(type) (((type) == NPY_BYTE) || \ - ((type) == NPY_SHORT) || \ - ((type) == NPY_INT) || \ - ((type) == NPY_LONG) || \ - ((type) == NPY_LONGLONG)) - -#define PyTypeNum_ISINTEGER(type) (((type) >= NPY_BYTE) && \ - ((type) <= NPY_ULONGLONG)) - -#define PyTypeNum_ISFLOAT(type) ((((type) >= NPY_FLOAT) && \ - ((type) <= NPY_LONGDOUBLE)) || \ - ((type) == NPY_HALF)) - -#define PyTypeNum_ISNUMBER(type) (((type) <= NPY_CLONGDOUBLE) || \ - ((type) == NPY_HALF)) - -#define PyTypeNum_ISSTRING(type) (((type) == NPY_STRING) || \ - ((type) == NPY_UNICODE)) - -#define PyTypeNum_ISCOMPLEX(type) (((type) >= NPY_CFLOAT) && \ - ((type) <= NPY_CLONGDOUBLE)) - -#define PyTypeNum_ISPYTHON(type) (((type) == NPY_LONG) || \ - ((type) == NPY_DOUBLE) || \ - ((type) == NPY_CDOUBLE) || \ - ((type) == NPY_BOOL) || \ - ((type) == NPY_OBJECT )) - -#define PyTypeNum_ISFLEXIBLE(type) (((type) >=NPY_STRING) && \ - ((type) <=NPY_VOID)) - -#define PyTypeNum_ISDATETIME(type) (((type) >=NPY_DATETIME) && \ - ((type) <=NPY_TIMEDELTA)) - -#define PyTypeNum_ISUSERDEF(type) (((type) >= NPY_USERDEF) && \ - ((type) < NPY_USERDEF+ \ - NPY_NUMUSERTYPES)) - -#define PyTypeNum_ISEXTENDED(type) (PyTypeNum_ISFLEXIBLE(type) || \ - PyTypeNum_ISUSERDEF(type)) - -#define PyTypeNum_ISOBJECT(type) ((type) == NPY_OBJECT) - - -#define PyDataType_ISBOOL(obj) PyTypeNum_ISBOOL(_PyADt(obj)) -#define PyDataType_ISUNSIGNED(obj) PyTypeNum_ISUNSIGNED(((PyArray_Descr*)(obj))->type_num) -#define PyDataType_ISSIGNED(obj) PyTypeNum_ISSIGNED(((PyArray_Descr*)(obj))->type_num) -#define PyDataType_ISINTEGER(obj) PyTypeNum_ISINTEGER(((PyArray_Descr*)(obj))->type_num ) -#define PyDataType_ISFLOAT(obj) PyTypeNum_ISFLOAT(((PyArray_Descr*)(obj))->type_num) -#define PyDataType_ISNUMBER(obj) PyTypeNum_ISNUMBER(((PyArray_Descr*)(obj))->type_num) -#define PyDataType_ISSTRING(obj) PyTypeNum_ISSTRING(((PyArray_Descr*)(obj))->type_num) -#define PyDataType_ISCOMPLEX(obj) PyTypeNum_ISCOMPLEX(((PyArray_Descr*)(obj))->type_num) -#define PyDataType_ISPYTHON(obj) PyTypeNum_ISPYTHON(((PyArray_Descr*)(obj))->type_num) -#define PyDataType_ISFLEXIBLE(obj) PyTypeNum_ISFLEXIBLE(((PyArray_Descr*)(obj))->type_num) -#define PyDataType_ISDATETIME(obj) PyTypeNum_ISDATETIME(((PyArray_Descr*)(obj))->type_num) -#define PyDataType_ISUSERDEF(obj) PyTypeNum_ISUSERDEF(((PyArray_Descr*)(obj))->type_num) -#define PyDataType_ISEXTENDED(obj) PyTypeNum_ISEXTENDED(((PyArray_Descr*)(obj))->type_num) -#define PyDataType_ISOBJECT(obj) PyTypeNum_ISOBJECT(((PyArray_Descr*)(obj))->type_num) -#define PyDataType_HASFIELDS(obj) (((PyArray_Descr *)(obj))->names != NULL) -#define PyDataType_HASSUBARRAY(dtype) ((dtype)->subarray != NULL) - -#define PyArray_ISBOOL(obj) PyTypeNum_ISBOOL(PyArray_TYPE(obj)) -#define PyArray_ISUNSIGNED(obj) PyTypeNum_ISUNSIGNED(PyArray_TYPE(obj)) -#define PyArray_ISSIGNED(obj) PyTypeNum_ISSIGNED(PyArray_TYPE(obj)) -#define PyArray_ISINTEGER(obj) PyTypeNum_ISINTEGER(PyArray_TYPE(obj)) -#define PyArray_ISFLOAT(obj) PyTypeNum_ISFLOAT(PyArray_TYPE(obj)) -#define PyArray_ISNUMBER(obj) PyTypeNum_ISNUMBER(PyArray_TYPE(obj)) -#define PyArray_ISSTRING(obj) PyTypeNum_ISSTRING(PyArray_TYPE(obj)) -#define PyArray_ISCOMPLEX(obj) PyTypeNum_ISCOMPLEX(PyArray_TYPE(obj)) -#define PyArray_ISPYTHON(obj) PyTypeNum_ISPYTHON(PyArray_TYPE(obj)) -#define PyArray_ISFLEXIBLE(obj) PyTypeNum_ISFLEXIBLE(PyArray_TYPE(obj)) -#define PyArray_ISDATETIME(obj) PyTypeNum_ISDATETIME(PyArray_TYPE(obj)) -#define PyArray_ISUSERDEF(obj) PyTypeNum_ISUSERDEF(PyArray_TYPE(obj)) -#define PyArray_ISEXTENDED(obj) PyTypeNum_ISEXTENDED(PyArray_TYPE(obj)) -#define PyArray_ISOBJECT(obj) PyTypeNum_ISOBJECT(PyArray_TYPE(obj)) -#define PyArray_HASFIELDS(obj) PyDataType_HASFIELDS(PyArray_DESCR(obj)) - - /* - * FIXME: This should check for a flag on the data-type that - * states whether or not it is variable length. Because the - * ISFLEXIBLE check is hard-coded to the built-in data-types. - */ -#define PyArray_ISVARIABLE(obj) PyTypeNum_ISFLEXIBLE(PyArray_TYPE(obj)) - -#define PyArray_SAFEALIGNEDCOPY(obj) (PyArray_ISALIGNED(obj) && !PyArray_ISVARIABLE(obj)) - - -#define NPY_LITTLE '<' -#define NPY_BIG '>' -#define NPY_NATIVE '=' -#define NPY_SWAP 's' -#define NPY_IGNORE '|' - -#if NPY_BYTE_ORDER == NPY_BIG_ENDIAN -#define NPY_NATBYTE NPY_BIG -#define NPY_OPPBYTE NPY_LITTLE -#else -#define NPY_NATBYTE NPY_LITTLE -#define NPY_OPPBYTE NPY_BIG -#endif - -#define PyArray_ISNBO(arg) ((arg) != NPY_OPPBYTE) -#define PyArray_IsNativeByteOrder PyArray_ISNBO -#define PyArray_ISNOTSWAPPED(m) PyArray_ISNBO(PyArray_DESCR(m)->byteorder) -#define PyArray_ISBYTESWAPPED(m) (!PyArray_ISNOTSWAPPED(m)) - -#define PyArray_FLAGSWAP(m, flags) (PyArray_CHKFLAGS(m, flags) && \ - PyArray_ISNOTSWAPPED(m)) - -#define PyArray_ISCARRAY(m) PyArray_FLAGSWAP(m, NPY_ARRAY_CARRAY) -#define PyArray_ISCARRAY_RO(m) PyArray_FLAGSWAP(m, NPY_ARRAY_CARRAY_RO) -#define PyArray_ISFARRAY(m) PyArray_FLAGSWAP(m, NPY_ARRAY_FARRAY) -#define PyArray_ISFARRAY_RO(m) PyArray_FLAGSWAP(m, NPY_ARRAY_FARRAY_RO) -#define PyArray_ISBEHAVED(m) PyArray_FLAGSWAP(m, NPY_ARRAY_BEHAVED) -#define PyArray_ISBEHAVED_RO(m) PyArray_FLAGSWAP(m, NPY_ARRAY_ALIGNED) - - -#define PyDataType_ISNOTSWAPPED(d) PyArray_ISNBO(((PyArray_Descr *)(d))->byteorder) -#define PyDataType_ISBYTESWAPPED(d) (!PyDataType_ISNOTSWAPPED(d)) - -/************************************************************ - * A struct used by PyArray_CreateSortedStridePerm, new in 1.7. - ************************************************************/ - -typedef struct { - npy_intp perm, stride; -} npy_stride_sort_item; - -/************************************************************ - * This is the form of the struct that's returned pointed by the - * PyCObject attribute of an array __array_struct__. See - * http://docs.scipy.org/doc/numpy/reference/arrays.interface.html for the full - * documentation. - ************************************************************/ -typedef struct { - int two; /* - * contains the integer 2 as a sanity - * check - */ - - int nd; /* number of dimensions */ - - char typekind; /* - * kind in array --- character code of - * typestr - */ - - int itemsize; /* size of each element */ - - int flags; /* - * how should be data interpreted. Valid - * flags are CONTIGUOUS (1), F_CONTIGUOUS (2), - * ALIGNED (0x100), NOTSWAPPED (0x200), and - * WRITEABLE (0x400). ARR_HAS_DESCR (0x800) - * states that arrdescr field is present in - * structure - */ - - npy_intp *shape; /* - * A length-nd array of shape - * information - */ - - npy_intp *strides; /* A length-nd array of stride information */ - - void *data; /* A pointer to the first element of the array */ - - PyObject *descr; /* - * A list of fields or NULL (ignored if flags - * does not have ARR_HAS_DESCR flag set) - */ -} PyArrayInterface; - -/* - * This is a function for hooking into the PyDataMem_NEW/FREE/RENEW functions. - * See the documentation for PyDataMem_SetEventHook. - */ -typedef void (PyDataMem_EventHookFunc)(void *inp, void *outp, size_t size, - void *user_data); - -#if !(defined(NPY_NO_DEPRECATED_API) && (NPY_API_VERSION <= NPY_NO_DEPRECATED_API)) -#include "npy_deprecated_api.h" -#endif - -#endif /* NPY_ARRAYTYPES_H */ diff --git a/include/numpy/noprefix.h b/include/numpy/noprefix.h deleted file mode 100644 index b3e57480e..000000000 --- a/include/numpy/noprefix.h +++ /dev/null @@ -1,209 +0,0 @@ -#ifndef NPY_NOPREFIX_H -#define NPY_NOPREFIX_H - -/* - * You can directly include noprefix.h as a backward - * compatibility measure - */ -#ifndef NPY_NO_PREFIX -#include "ndarrayobject.h" -#include "npy_interrupt.h" -#endif - -#define SIGSETJMP NPY_SIGSETJMP -#define SIGLONGJMP NPY_SIGLONGJMP -#define SIGJMP_BUF NPY_SIGJMP_BUF - -#define MAX_DIMS NPY_MAXDIMS - -#define longlong npy_longlong -#define ulonglong npy_ulonglong -#define Bool npy_bool -#define longdouble npy_longdouble -#define byte npy_byte - -#ifndef _BSD_SOURCE -#define ushort npy_ushort -#define uint npy_uint -#define ulong npy_ulong -#endif - -#define ubyte npy_ubyte -#define ushort npy_ushort -#define uint npy_uint -#define ulong npy_ulong -#define cfloat npy_cfloat -#define cdouble npy_cdouble -#define clongdouble npy_clongdouble -#define Int8 npy_int8 -#define UInt8 npy_uint8 -#define Int16 npy_int16 -#define UInt16 npy_uint16 -#define Int32 npy_int32 -#define UInt32 npy_uint32 -#define Int64 npy_int64 -#define UInt64 npy_uint64 -#define Int128 npy_int128 -#define UInt128 npy_uint128 -#define Int256 npy_int256 -#define UInt256 npy_uint256 -#define Float16 npy_float16 -#define Complex32 npy_complex32 -#define Float32 npy_float32 -#define Complex64 npy_complex64 -#define Float64 npy_float64 -#define Complex128 npy_complex128 -#define Float80 npy_float80 -#define Complex160 npy_complex160 -#define Float96 npy_float96 -#define Complex192 npy_complex192 -#define Float128 npy_float128 -#define Complex256 npy_complex256 -#define intp npy_intp -#define uintp npy_uintp -#define datetime npy_datetime -#define timedelta npy_timedelta - -#define SIZEOF_INTP NPY_SIZEOF_INTP -#define SIZEOF_UINTP NPY_SIZEOF_UINTP -#define SIZEOF_DATETIME NPY_SIZEOF_DATETIME -#define SIZEOF_TIMEDELTA NPY_SIZEOF_TIMEDELTA - -#define LONGLONG_FMT NPY_LONGLONG_FMT -#define ULONGLONG_FMT NPY_ULONGLONG_FMT -#define LONGLONG_SUFFIX NPY_LONGLONG_SUFFIX -#define ULONGLONG_SUFFIX NPY_ULONGLONG_SUFFIX - -#define MAX_INT8 127 -#define MIN_INT8 -128 -#define MAX_UINT8 255 -#define MAX_INT16 32767 -#define MIN_INT16 -32768 -#define MAX_UINT16 65535 -#define MAX_INT32 2147483647 -#define MIN_INT32 (-MAX_INT32 - 1) -#define MAX_UINT32 4294967295U -#define MAX_INT64 LONGLONG_SUFFIX(9223372036854775807) -#define MIN_INT64 (-MAX_INT64 - LONGLONG_SUFFIX(1)) -#define MAX_UINT64 ULONGLONG_SUFFIX(18446744073709551615) -#define MAX_INT128 LONGLONG_SUFFIX(85070591730234615865843651857942052864) -#define MIN_INT128 (-MAX_INT128 - LONGLONG_SUFFIX(1)) -#define MAX_UINT128 ULONGLONG_SUFFIX(170141183460469231731687303715884105728) -#define MAX_INT256 LONGLONG_SUFFIX(57896044618658097711785492504343953926634992332820282019728792003956564819967) -#define MIN_INT256 (-MAX_INT256 - LONGLONG_SUFFIX(1)) -#define MAX_UINT256 ULONGLONG_SUFFIX(115792089237316195423570985008687907853269984665640564039457584007913129639935) - -#define MAX_BYTE NPY_MAX_BYTE -#define MIN_BYTE NPY_MIN_BYTE -#define MAX_UBYTE NPY_MAX_UBYTE -#define MAX_SHORT NPY_MAX_SHORT -#define MIN_SHORT NPY_MIN_SHORT -#define MAX_USHORT NPY_MAX_USHORT -#define MAX_INT NPY_MAX_INT -#define MIN_INT NPY_MIN_INT -#define MAX_UINT NPY_MAX_UINT -#define MAX_LONG NPY_MAX_LONG -#define MIN_LONG NPY_MIN_LONG -#define MAX_ULONG NPY_MAX_ULONG -#define MAX_LONGLONG NPY_MAX_LONGLONG -#define MIN_LONGLONG NPY_MIN_LONGLONG -#define MAX_ULONGLONG NPY_MAX_ULONGLONG -#define MIN_DATETIME NPY_MIN_DATETIME -#define MAX_DATETIME NPY_MAX_DATETIME -#define MIN_TIMEDELTA NPY_MIN_TIMEDELTA -#define MAX_TIMEDELTA NPY_MAX_TIMEDELTA - -#define SIZEOF_LONGDOUBLE NPY_SIZEOF_LONGDOUBLE -#define SIZEOF_LONGLONG NPY_SIZEOF_LONGLONG -#define SIZEOF_HALF NPY_SIZEOF_HALF -#define BITSOF_BOOL NPY_BITSOF_BOOL -#define BITSOF_CHAR NPY_BITSOF_CHAR -#define BITSOF_SHORT NPY_BITSOF_SHORT -#define BITSOF_INT NPY_BITSOF_INT -#define BITSOF_LONG NPY_BITSOF_LONG -#define BITSOF_LONGLONG NPY_BITSOF_LONGLONG -#define BITSOF_HALF NPY_BITSOF_HALF -#define BITSOF_FLOAT NPY_BITSOF_FLOAT -#define BITSOF_DOUBLE NPY_BITSOF_DOUBLE -#define BITSOF_LONGDOUBLE NPY_BITSOF_LONGDOUBLE -#define BITSOF_DATETIME NPY_BITSOF_DATETIME -#define BITSOF_TIMEDELTA NPY_BITSOF_TIMEDELTA - -#define _pya_malloc PyArray_malloc -#define _pya_free PyArray_free -#define _pya_realloc PyArray_realloc - -#define BEGIN_THREADS_DEF NPY_BEGIN_THREADS_DEF -#define BEGIN_THREADS NPY_BEGIN_THREADS -#define END_THREADS NPY_END_THREADS -#define ALLOW_C_API_DEF NPY_ALLOW_C_API_DEF -#define ALLOW_C_API NPY_ALLOW_C_API -#define DISABLE_C_API NPY_DISABLE_C_API - -#define PY_FAIL NPY_FAIL -#define PY_SUCCEED NPY_SUCCEED - -#ifndef TRUE -#define TRUE NPY_TRUE -#endif - -#ifndef FALSE -#define FALSE NPY_FALSE -#endif - -#define LONGDOUBLE_FMT NPY_LONGDOUBLE_FMT - -#define CONTIGUOUS NPY_CONTIGUOUS -#define C_CONTIGUOUS NPY_C_CONTIGUOUS -#define FORTRAN NPY_FORTRAN -#define F_CONTIGUOUS NPY_F_CONTIGUOUS -#define OWNDATA NPY_OWNDATA -#define FORCECAST NPY_FORCECAST -#define ENSURECOPY NPY_ENSURECOPY -#define ENSUREARRAY NPY_ENSUREARRAY -#define ELEMENTSTRIDES NPY_ELEMENTSTRIDES -#define ALIGNED NPY_ALIGNED -#define NOTSWAPPED NPY_NOTSWAPPED -#define WRITEABLE NPY_WRITEABLE -#define UPDATEIFCOPY NPY_UPDATEIFCOPY -#define ARR_HAS_DESCR NPY_ARR_HAS_DESCR -#define BEHAVED NPY_BEHAVED -#define BEHAVED_NS NPY_BEHAVED_NS -#define CARRAY NPY_CARRAY -#define CARRAY_RO NPY_CARRAY_RO -#define FARRAY NPY_FARRAY -#define FARRAY_RO NPY_FARRAY_RO -#define DEFAULT NPY_DEFAULT -#define IN_ARRAY NPY_IN_ARRAY -#define OUT_ARRAY NPY_OUT_ARRAY -#define INOUT_ARRAY NPY_INOUT_ARRAY -#define IN_FARRAY NPY_IN_FARRAY -#define OUT_FARRAY NPY_OUT_FARRAY -#define INOUT_FARRAY NPY_INOUT_FARRAY -#define UPDATE_ALL NPY_UPDATE_ALL - -#define OWN_DATA NPY_OWNDATA -#define BEHAVED_FLAGS NPY_BEHAVED -#define BEHAVED_FLAGS_NS NPY_BEHAVED_NS -#define CARRAY_FLAGS_RO NPY_CARRAY_RO -#define CARRAY_FLAGS NPY_CARRAY -#define FARRAY_FLAGS NPY_FARRAY -#define FARRAY_FLAGS_RO NPY_FARRAY_RO -#define DEFAULT_FLAGS NPY_DEFAULT -#define UPDATE_ALL_FLAGS NPY_UPDATE_ALL_FLAGS - -#ifndef MIN -#define MIN PyArray_MIN -#endif -#ifndef MAX -#define MAX PyArray_MAX -#endif -#define MAX_INTP NPY_MAX_INTP -#define MIN_INTP NPY_MIN_INTP -#define MAX_UINTP NPY_MAX_UINTP -#define INTP_FMT NPY_INTP_FMT - -#define REFCOUNT PyArray_REFCOUNT -#define MAX_ELSIZE NPY_MAX_ELSIZE - -#endif diff --git a/include/numpy/npy_3kcompat.h b/include/numpy/npy_3kcompat.h deleted file mode 100644 index d0cd9ac1a..000000000 --- a/include/numpy/npy_3kcompat.h +++ /dev/null @@ -1,417 +0,0 @@ -/* - * This is a convenience header file providing compatibility utilities - * for supporting Python 2 and Python 3 in the same code base. - * - * If you want to use this for your own projects, it's recommended to make a - * copy of it. Although the stuff below is unlikely to change, we don't provide - * strong backwards compatibility guarantees at the moment. - */ - -#ifndef _NPY_3KCOMPAT_H_ -#define _NPY_3KCOMPAT_H_ - -#include -#include - -#if PY_VERSION_HEX >= 0x03000000 -#ifndef NPY_PY3K -#define NPY_PY3K 1 -#endif -#endif - -#include "numpy/npy_common.h" -#include "numpy/ndarrayobject.h" - -#ifdef __cplusplus -extern "C" { -#endif - -/* - * PyInt -> PyLong - */ - -#if defined(NPY_PY3K) -/* Return True only if the long fits in a C long */ -static NPY_INLINE int PyInt_Check(PyObject *op) { - int overflow = 0; - if (!PyLong_Check(op)) { - return 0; - } - PyLong_AsLongAndOverflow(op, &overflow); - return (overflow == 0); -} - -#define PyInt_FromLong PyLong_FromLong -#define PyInt_AsLong PyLong_AsLong -#define PyInt_AS_LONG PyLong_AsLong -#define PyInt_AsSsize_t PyLong_AsSsize_t - -/* NOTE: - * - * Since the PyLong type is very different from the fixed-range PyInt, - * we don't define PyInt_Type -> PyLong_Type. - */ -#endif /* NPY_PY3K */ - -/* - * PyString -> PyBytes - */ - -#if defined(NPY_PY3K) - -#define PyString_Type PyBytes_Type -#define PyString_Check PyBytes_Check -#define PyStringObject PyBytesObject -#define PyString_FromString PyBytes_FromString -#define PyString_FromStringAndSize PyBytes_FromStringAndSize -#define PyString_AS_STRING PyBytes_AS_STRING -#define PyString_AsStringAndSize PyBytes_AsStringAndSize -#define PyString_FromFormat PyBytes_FromFormat -#define PyString_Concat PyBytes_Concat -#define PyString_ConcatAndDel PyBytes_ConcatAndDel -#define PyString_AsString PyBytes_AsString -#define PyString_GET_SIZE PyBytes_GET_SIZE -#define PyString_Size PyBytes_Size - -#define PyUString_Type PyUnicode_Type -#define PyUString_Check PyUnicode_Check -#define PyUStringObject PyUnicodeObject -#define PyUString_FromString PyUnicode_FromString -#define PyUString_FromStringAndSize PyUnicode_FromStringAndSize -#define PyUString_FromFormat PyUnicode_FromFormat -#define PyUString_Concat PyUnicode_Concat2 -#define PyUString_ConcatAndDel PyUnicode_ConcatAndDel -#define PyUString_GET_SIZE PyUnicode_GET_SIZE -#define PyUString_Size PyUnicode_Size -#define PyUString_InternFromString PyUnicode_InternFromString -#define PyUString_Format PyUnicode_Format - -#else - -#define PyBytes_Type PyString_Type -#define PyBytes_Check PyString_Check -#define PyBytesObject PyStringObject -#define PyBytes_FromString PyString_FromString -#define PyBytes_FromStringAndSize PyString_FromStringAndSize -#define PyBytes_AS_STRING PyString_AS_STRING -#define PyBytes_AsStringAndSize PyString_AsStringAndSize -#define PyBytes_FromFormat PyString_FromFormat -#define PyBytes_Concat PyString_Concat -#define PyBytes_ConcatAndDel PyString_ConcatAndDel -#define PyBytes_AsString PyString_AsString -#define PyBytes_GET_SIZE PyString_GET_SIZE -#define PyBytes_Size PyString_Size - -#define PyUString_Type PyString_Type -#define PyUString_Check PyString_Check -#define PyUStringObject PyStringObject -#define PyUString_FromString PyString_FromString -#define PyUString_FromStringAndSize PyString_FromStringAndSize -#define PyUString_FromFormat PyString_FromFormat -#define PyUString_Concat PyString_Concat -#define PyUString_ConcatAndDel PyString_ConcatAndDel -#define PyUString_GET_SIZE PyString_GET_SIZE -#define PyUString_Size PyString_Size -#define PyUString_InternFromString PyString_InternFromString -#define PyUString_Format PyString_Format - -#endif /* NPY_PY3K */ - - -static NPY_INLINE void -PyUnicode_ConcatAndDel(PyObject **left, PyObject *right) -{ - PyObject *newobj; - newobj = PyUnicode_Concat(*left, right); - Py_DECREF(*left); - Py_DECREF(right); - *left = newobj; -} - -static NPY_INLINE void -PyUnicode_Concat2(PyObject **left, PyObject *right) -{ - PyObject *newobj; - newobj = PyUnicode_Concat(*left, right); - Py_DECREF(*left); - *left = newobj; -} - -/* - * PyFile_* compatibility - */ -#if defined(NPY_PY3K) - -/* - * Get a FILE* handle to the file represented by the Python object - */ -static NPY_INLINE FILE* -npy_PyFile_Dup(PyObject *file, char *mode) -{ - int fd, fd2; - PyObject *ret, *os; - Py_ssize_t pos; - FILE *handle; - /* Flush first to ensure things end up in the file in the correct order */ - ret = PyObject_CallMethod(file, "flush", ""); - if (ret == NULL) { - return NULL; - } - Py_DECREF(ret); - fd = PyObject_AsFileDescriptor(file); - if (fd == -1) { - return NULL; - } - os = PyImport_ImportModule("os"); - if (os == NULL) { - return NULL; - } - ret = PyObject_CallMethod(os, "dup", "i", fd); - Py_DECREF(os); - if (ret == NULL) { - return NULL; - } - fd2 = PyNumber_AsSsize_t(ret, NULL); - Py_DECREF(ret); -#ifdef _WIN32 - handle = _fdopen(fd2, mode); -#else - handle = fdopen(fd2, mode); -#endif - if (handle == NULL) { - PyErr_SetString(PyExc_IOError, - "Getting a FILE* from a Python file object failed"); - } - ret = PyObject_CallMethod(file, "tell", ""); - if (ret == NULL) { - fclose(handle); - return NULL; - } - pos = PyNumber_AsSsize_t(ret, PyExc_OverflowError); - Py_DECREF(ret); - if (PyErr_Occurred()) { - fclose(handle); - return NULL; - } - npy_fseek(handle, pos, SEEK_SET); - return handle; -} - -/* - * Close the dup-ed file handle, and seek the Python one to the current position - */ -static NPY_INLINE int -npy_PyFile_DupClose(PyObject *file, FILE* handle) -{ - PyObject *ret; - Py_ssize_t position; - position = npy_ftell(handle); - fclose(handle); - - ret = PyObject_CallMethod(file, "seek", NPY_SSIZE_T_PYFMT "i", position, 0); - if (ret == NULL) { - return -1; - } - Py_DECREF(ret); - return 0; -} - -static NPY_INLINE int -npy_PyFile_Check(PyObject *file) -{ - int fd; - fd = PyObject_AsFileDescriptor(file); - if (fd == -1) { - PyErr_Clear(); - return 0; - } - return 1; -} - -#else - -#define npy_PyFile_Dup(file, mode) PyFile_AsFile(file) -#define npy_PyFile_DupClose(file, handle) (0) -#define npy_PyFile_Check PyFile_Check - -#endif - -static NPY_INLINE PyObject* -npy_PyFile_OpenFile(PyObject *filename, const char *mode) -{ - PyObject *open; - open = PyDict_GetItemString(PyEval_GetBuiltins(), "open"); - if (open == NULL) { - return NULL; - } - return PyObject_CallFunction(open, "Os", filename, mode); -} - -static NPY_INLINE int -npy_PyFile_CloseFile(PyObject *file) -{ - PyObject *ret; - - ret = PyObject_CallMethod(file, "close", NULL); - if (ret == NULL) { - return -1; - } - Py_DECREF(ret); - return 0; -} - -/* - * PyObject_Cmp - */ -#if defined(NPY_PY3K) -static NPY_INLINE int -PyObject_Cmp(PyObject *i1, PyObject *i2, int *cmp) -{ - int v; - v = PyObject_RichCompareBool(i1, i2, Py_LT); - if (v == 0) { - *cmp = -1; - return 1; - } - else if (v == -1) { - return -1; - } - - v = PyObject_RichCompareBool(i1, i2, Py_GT); - if (v == 0) { - *cmp = 1; - return 1; - } - else if (v == -1) { - return -1; - } - - v = PyObject_RichCompareBool(i1, i2, Py_EQ); - if (v == 0) { - *cmp = 0; - return 1; - } - else { - *cmp = 0; - return -1; - } -} -#endif - -/* - * PyCObject functions adapted to PyCapsules. - * - * The main job here is to get rid of the improved error handling - * of PyCapsules. It's a shame... - */ -#if PY_VERSION_HEX >= 0x03000000 - -static NPY_INLINE PyObject * -NpyCapsule_FromVoidPtr(void *ptr, void (*dtor)(PyObject *)) -{ - PyObject *ret = PyCapsule_New(ptr, NULL, dtor); - if (ret == NULL) { - PyErr_Clear(); - } - return ret; -} - -static NPY_INLINE PyObject * -NpyCapsule_FromVoidPtrAndDesc(void *ptr, void* context, void (*dtor)(PyObject *)) -{ - PyObject *ret = NpyCapsule_FromVoidPtr(ptr, dtor); - if (ret != NULL && PyCapsule_SetContext(ret, context) != 0) { - PyErr_Clear(); - Py_DECREF(ret); - ret = NULL; - } - return ret; -} - -static NPY_INLINE void * -NpyCapsule_AsVoidPtr(PyObject *obj) -{ - void *ret = PyCapsule_GetPointer(obj, NULL); - if (ret == NULL) { - PyErr_Clear(); - } - return ret; -} - -static NPY_INLINE void * -NpyCapsule_GetDesc(PyObject *obj) -{ - return PyCapsule_GetContext(obj); -} - -static NPY_INLINE int -NpyCapsule_Check(PyObject *ptr) -{ - return PyCapsule_CheckExact(ptr); -} - -static NPY_INLINE void -simple_capsule_dtor(PyObject *cap) -{ - PyArray_free(PyCapsule_GetPointer(cap, NULL)); -} - -#else - -static NPY_INLINE PyObject * -NpyCapsule_FromVoidPtr(void *ptr, void (*dtor)(void *)) -{ - return PyCObject_FromVoidPtr(ptr, dtor); -} - -static NPY_INLINE PyObject * -NpyCapsule_FromVoidPtrAndDesc(void *ptr, void* context, - void (*dtor)(void *, void *)) -{ - return PyCObject_FromVoidPtrAndDesc(ptr, context, dtor); -} - -static NPY_INLINE void * -NpyCapsule_AsVoidPtr(PyObject *ptr) -{ - return PyCObject_AsVoidPtr(ptr); -} - -static NPY_INLINE void * -NpyCapsule_GetDesc(PyObject *obj) -{ - return PyCObject_GetDesc(obj); -} - -static NPY_INLINE int -NpyCapsule_Check(PyObject *ptr) -{ - return PyCObject_Check(ptr); -} - -static NPY_INLINE void -simple_capsule_dtor(void *ptr) -{ - PyArray_free(ptr); -} - -#endif - -/* - * Hash value compatibility. - * As of Python 3.2 hash values are of type Py_hash_t. - * Previous versions use C long. - */ -#if PY_VERSION_HEX < 0x03020000 -typedef long npy_hash_t; -#define NPY_SIZEOF_HASH_T NPY_SIZEOF_LONG -#else -typedef Py_hash_t npy_hash_t; -#define NPY_SIZEOF_HASH_T NPY_SIZEOF_INTP -#endif - -#ifdef __cplusplus -} -#endif - -#endif /* _NPY_3KCOMPAT_H_ */ diff --git a/include/numpy/npy_common.h b/include/numpy/npy_common.h deleted file mode 100644 index 7fca7e220..000000000 --- a/include/numpy/npy_common.h +++ /dev/null @@ -1,930 +0,0 @@ -#ifndef _NPY_COMMON_H_ -#define _NPY_COMMON_H_ - -/* numpconfig.h is auto-generated */ -#include "numpyconfig.h" - -#if defined(_MSC_VER) - #define NPY_INLINE __inline -#elif defined(__GNUC__) - #if defined(__STRICT_ANSI__) - #define NPY_INLINE __inline__ - #else - #define NPY_INLINE inline - #endif -#else - #define NPY_INLINE -#endif - -/* Enable 64 bit file position support on win-amd64. Ticket #1660 */ -#if defined(_MSC_VER) && defined(_WIN64) && (_MSC_VER > 1400) - #define npy_fseek _fseeki64 - #define npy_ftell _ftelli64 -#else - #define npy_fseek fseek - #define npy_ftell ftell -#endif - -/* enums for detected endianness */ -enum { - NPY_CPU_UNKNOWN_ENDIAN, - NPY_CPU_LITTLE, - NPY_CPU_BIG -}; - -/* - * This is to typedef npy_intp to the appropriate pointer size for - * this platform. Py_intptr_t, Py_uintptr_t are defined in pyport.h. - */ -typedef Py_intptr_t npy_intp; -typedef Py_uintptr_t npy_uintp; -#define NPY_SIZEOF_CHAR 1 -#define NPY_SIZEOF_BYTE 1 -#define NPY_SIZEOF_INTP NPY_SIZEOF_PY_INTPTR_T -#define NPY_SIZEOF_UINTP NPY_SIZEOF_PY_INTPTR_T -#define NPY_SIZEOF_CFLOAT NPY_SIZEOF_COMPLEX_FLOAT -#define NPY_SIZEOF_CDOUBLE NPY_SIZEOF_COMPLEX_DOUBLE -#define NPY_SIZEOF_CLONGDOUBLE NPY_SIZEOF_COMPLEX_LONGDOUBLE - -#ifdef constchar -#undef constchar -#endif - -#if (PY_VERSION_HEX < 0x02050000) - #ifndef PY_SSIZE_T_MIN - typedef int Py_ssize_t; - #define PY_SSIZE_T_MAX INT_MAX - #define PY_SSIZE_T_MIN INT_MIN - #endif -#define NPY_SSIZE_T_PYFMT "i" -#define constchar const char -#else -#define NPY_SSIZE_T_PYFMT "n" -#define constchar char -#endif - -/* NPY_INTP_FMT Note: - * Unlike the other NPY_*_FMT macros which are used with - * PyOS_snprintf, NPY_INTP_FMT is used with PyErr_Format and - * PyString_Format. These functions use different formatting - * codes which are portably specified according to the Python - * documentation. See ticket #1795. - * - * On Windows x64, the LONGLONG formatter should be used, but - * in Python 2.6 the %lld formatter is not supported. In this - * case we work around the problem by using the %zd formatter. - */ -#if NPY_SIZEOF_PY_INTPTR_T == NPY_SIZEOF_INT - #define NPY_INTP NPY_INT - #define NPY_UINTP NPY_UINT - #define PyIntpArrType_Type PyIntArrType_Type - #define PyUIntpArrType_Type PyUIntArrType_Type - #define NPY_MAX_INTP NPY_MAX_INT - #define NPY_MIN_INTP NPY_MIN_INT - #define NPY_MAX_UINTP NPY_MAX_UINT - #define NPY_INTP_FMT "d" -#elif NPY_SIZEOF_PY_INTPTR_T == NPY_SIZEOF_LONG - #define NPY_INTP NPY_LONG - #define NPY_UINTP NPY_ULONG - #define PyIntpArrType_Type PyLongArrType_Type - #define PyUIntpArrType_Type PyULongArrType_Type - #define NPY_MAX_INTP NPY_MAX_LONG - #define NPY_MIN_INTP NPY_MIN_LONG - #define NPY_MAX_UINTP NPY_MAX_ULONG - #define NPY_INTP_FMT "ld" -#elif defined(PY_LONG_LONG) && (NPY_SIZEOF_PY_INTPTR_T == NPY_SIZEOF_LONGLONG) - #define NPY_INTP NPY_LONGLONG - #define NPY_UINTP NPY_ULONGLONG - #define PyIntpArrType_Type PyLongLongArrType_Type - #define PyUIntpArrType_Type PyULongLongArrType_Type - #define NPY_MAX_INTP NPY_MAX_LONGLONG - #define NPY_MIN_INTP NPY_MIN_LONGLONG - #define NPY_MAX_UINTP NPY_MAX_ULONGLONG - #if (PY_VERSION_HEX >= 0x02070000) - #define NPY_INTP_FMT "lld" - #else - #define NPY_INTP_FMT "zd" - #endif -#endif - -/* - * We can only use C99 formats for npy_int_p if it is the same as - * intp_t, hence the condition on HAVE_UNITPTR_T - */ -#if (NPY_USE_C99_FORMATS) == 1 \ - && (defined HAVE_UINTPTR_T) \ - && (defined HAVE_INTTYPES_H) - #include - #undef NPY_INTP_FMT - #define NPY_INTP_FMT PRIdPTR -#endif - - -/* - * Some platforms don't define bool, long long, or long double. - * Handle that here. - */ -#define NPY_BYTE_FMT "hhd" -#define NPY_UBYTE_FMT "hhu" -#define NPY_SHORT_FMT "hd" -#define NPY_USHORT_FMT "hu" -#define NPY_INT_FMT "d" -#define NPY_UINT_FMT "u" -#define NPY_LONG_FMT "ld" -#define NPY_ULONG_FMT "lu" -#define NPY_HALF_FMT "g" -#define NPY_FLOAT_FMT "g" -#define NPY_DOUBLE_FMT "g" - - -#ifdef PY_LONG_LONG -typedef PY_LONG_LONG npy_longlong; -typedef unsigned PY_LONG_LONG npy_ulonglong; -# ifdef _MSC_VER -# define NPY_LONGLONG_FMT "I64d" -# define NPY_ULONGLONG_FMT "I64u" -# elif defined(__APPLE__) || defined(__FreeBSD__) -/* "%Ld" only parses 4 bytes -- "L" is floating modifier on MacOS X/BSD */ -# define NPY_LONGLONG_FMT "lld" -# define NPY_ULONGLONG_FMT "llu" -/* - another possible variant -- *quad_t works on *BSD, but is deprecated: - #define LONGLONG_FMT "qd" - #define ULONGLONG_FMT "qu" -*/ -# else -# define NPY_LONGLONG_FMT "Ld" -# define NPY_ULONGLONG_FMT "Lu" -# endif -# ifdef _MSC_VER -# define NPY_LONGLONG_SUFFIX(x) (x##i64) -# define NPY_ULONGLONG_SUFFIX(x) (x##Ui64) -# else -# define NPY_LONGLONG_SUFFIX(x) (x##LL) -# define NPY_ULONGLONG_SUFFIX(x) (x##ULL) -# endif -#else -typedef long npy_longlong; -typedef unsigned long npy_ulonglong; -# define NPY_LONGLONG_SUFFIX(x) (x##L) -# define NPY_ULONGLONG_SUFFIX(x) (x##UL) -#endif - - -typedef unsigned char npy_bool; -#define NPY_FALSE 0 -#define NPY_TRUE 1 - - -#if NPY_SIZEOF_LONGDOUBLE == NPY_SIZEOF_DOUBLE - typedef double npy_longdouble; - #define NPY_LONGDOUBLE_FMT "g" -#else - typedef long double npy_longdouble; - #define NPY_LONGDOUBLE_FMT "Lg" -#endif - -#ifndef Py_USING_UNICODE -#error Must use Python with unicode enabled. -#endif - - -typedef signed char npy_byte; -typedef unsigned char npy_ubyte; -typedef unsigned short npy_ushort; -typedef unsigned int npy_uint; -typedef unsigned long npy_ulong; - -/* These are for completeness */ -typedef char npy_char; -typedef short npy_short; -typedef int npy_int; -typedef long npy_long; -typedef float npy_float; -typedef double npy_double; - -/* - * Disabling C99 complex usage: a lot of C code in numpy/scipy rely on being - * able to do .real/.imag. Will have to convert code first. - */ -#if 0 -#if defined(NPY_USE_C99_COMPLEX) && defined(NPY_HAVE_COMPLEX_DOUBLE) -typedef complex npy_cdouble; -#else -typedef struct { double real, imag; } npy_cdouble; -#endif - -#if defined(NPY_USE_C99_COMPLEX) && defined(NPY_HAVE_COMPLEX_FLOAT) -typedef complex float npy_cfloat; -#else -typedef struct { float real, imag; } npy_cfloat; -#endif - -#if defined(NPY_USE_C99_COMPLEX) && defined(NPY_HAVE_COMPLEX_LONG_DOUBLE) -typedef complex long double npy_clongdouble; -#else -typedef struct {npy_longdouble real, imag;} npy_clongdouble; -#endif -#endif -#if NPY_SIZEOF_COMPLEX_DOUBLE != 2 * NPY_SIZEOF_DOUBLE -#error npy_cdouble definition is not compatible with C99 complex definition ! \ - Please contact Numpy maintainers and give detailed information about your \ - compiler and platform -#endif -typedef struct { double real, imag; } npy_cdouble; - -#if NPY_SIZEOF_COMPLEX_FLOAT != 2 * NPY_SIZEOF_FLOAT -#error npy_cfloat definition is not compatible with C99 complex definition ! \ - Please contact Numpy maintainers and give detailed information about your \ - compiler and platform -#endif -typedef struct { float real, imag; } npy_cfloat; - -#if NPY_SIZEOF_COMPLEX_LONGDOUBLE != 2 * NPY_SIZEOF_LONGDOUBLE -#error npy_clongdouble definition is not compatible with C99 complex definition ! \ - Please contact Numpy maintainers and give detailed information about your \ - compiler and platform -#endif -typedef struct { npy_longdouble real, imag; } npy_clongdouble; - -/* - * numarray-style bit-width typedefs - */ -#define NPY_MAX_INT8 127 -#define NPY_MIN_INT8 -128 -#define NPY_MAX_UINT8 255 -#define NPY_MAX_INT16 32767 -#define NPY_MIN_INT16 -32768 -#define NPY_MAX_UINT16 65535 -#define NPY_MAX_INT32 2147483647 -#define NPY_MIN_INT32 (-NPY_MAX_INT32 - 1) -#define NPY_MAX_UINT32 4294967295U -#define NPY_MAX_INT64 NPY_LONGLONG_SUFFIX(9223372036854775807) -#define NPY_MIN_INT64 (-NPY_MAX_INT64 - NPY_LONGLONG_SUFFIX(1)) -#define NPY_MAX_UINT64 NPY_ULONGLONG_SUFFIX(18446744073709551615) -#define NPY_MAX_INT128 NPY_LONGLONG_SUFFIX(85070591730234615865843651857942052864) -#define NPY_MIN_INT128 (-NPY_MAX_INT128 - NPY_LONGLONG_SUFFIX(1)) -#define NPY_MAX_UINT128 NPY_ULONGLONG_SUFFIX(170141183460469231731687303715884105728) -#define NPY_MAX_INT256 NPY_LONGLONG_SUFFIX(57896044618658097711785492504343953926634992332820282019728792003956564819967) -#define NPY_MIN_INT256 (-NPY_MAX_INT256 - NPY_LONGLONG_SUFFIX(1)) -#define NPY_MAX_UINT256 NPY_ULONGLONG_SUFFIX(115792089237316195423570985008687907853269984665640564039457584007913129639935) -#define NPY_MIN_DATETIME NPY_MIN_INT64 -#define NPY_MAX_DATETIME NPY_MAX_INT64 -#define NPY_MIN_TIMEDELTA NPY_MIN_INT64 -#define NPY_MAX_TIMEDELTA NPY_MAX_INT64 - - /* Need to find the number of bits for each type and - make definitions accordingly. - - C states that sizeof(char) == 1 by definition - - So, just using the sizeof keyword won't help. - - It also looks like Python itself uses sizeof(char) quite a - bit, which by definition should be 1 all the time. - - Idea: Make Use of CHAR_BIT which should tell us how many - BITS per CHARACTER - */ - - /* Include platform definitions -- These are in the C89/90 standard */ -#include -#define NPY_MAX_BYTE SCHAR_MAX -#define NPY_MIN_BYTE SCHAR_MIN -#define NPY_MAX_UBYTE UCHAR_MAX -#define NPY_MAX_SHORT SHRT_MAX -#define NPY_MIN_SHORT SHRT_MIN -#define NPY_MAX_USHORT USHRT_MAX -#define NPY_MAX_INT INT_MAX -#ifndef INT_MIN -#define INT_MIN (-INT_MAX - 1) -#endif -#define NPY_MIN_INT INT_MIN -#define NPY_MAX_UINT UINT_MAX -#define NPY_MAX_LONG LONG_MAX -#define NPY_MIN_LONG LONG_MIN -#define NPY_MAX_ULONG ULONG_MAX - -#define NPY_SIZEOF_HALF 2 -#define NPY_SIZEOF_DATETIME 8 -#define NPY_SIZEOF_TIMEDELTA 8 - -#define NPY_BITSOF_BOOL (sizeof(npy_bool) * CHAR_BIT) -#define NPY_BITSOF_CHAR CHAR_BIT -#define NPY_BITSOF_BYTE (NPY_SIZEOF_BYTE * CHAR_BIT) -#define NPY_BITSOF_SHORT (NPY_SIZEOF_SHORT * CHAR_BIT) -#define NPY_BITSOF_INT (NPY_SIZEOF_INT * CHAR_BIT) -#define NPY_BITSOF_LONG (NPY_SIZEOF_LONG * CHAR_BIT) -#define NPY_BITSOF_LONGLONG (NPY_SIZEOF_LONGLONG * CHAR_BIT) -#define NPY_BITSOF_INTP (NPY_SIZEOF_INTP * CHAR_BIT) -#define NPY_BITSOF_HALF (NPY_SIZEOF_HALF * CHAR_BIT) -#define NPY_BITSOF_FLOAT (NPY_SIZEOF_FLOAT * CHAR_BIT) -#define NPY_BITSOF_DOUBLE (NPY_SIZEOF_DOUBLE * CHAR_BIT) -#define NPY_BITSOF_LONGDOUBLE (NPY_SIZEOF_LONGDOUBLE * CHAR_BIT) -#define NPY_BITSOF_CFLOAT (NPY_SIZEOF_CFLOAT * CHAR_BIT) -#define NPY_BITSOF_CDOUBLE (NPY_SIZEOF_CDOUBLE * CHAR_BIT) -#define NPY_BITSOF_CLONGDOUBLE (NPY_SIZEOF_CLONGDOUBLE * CHAR_BIT) -#define NPY_BITSOF_DATETIME (NPY_SIZEOF_DATETIME * CHAR_BIT) -#define NPY_BITSOF_TIMEDELTA (NPY_SIZEOF_TIMEDELTA * CHAR_BIT) - -#if NPY_BITSOF_LONG == 8 -#define NPY_INT8 NPY_LONG -#define NPY_UINT8 NPY_ULONG - typedef long npy_int8; - typedef unsigned long npy_uint8; -#define PyInt8ScalarObject PyLongScalarObject -#define PyInt8ArrType_Type PyLongArrType_Type -#define PyUInt8ScalarObject PyULongScalarObject -#define PyUInt8ArrType_Type PyULongArrType_Type -#define NPY_INT8_FMT NPY_LONG_FMT -#define NPY_UINT8_FMT NPY_ULONG_FMT -#elif NPY_BITSOF_LONG == 16 -#define NPY_INT16 NPY_LONG -#define NPY_UINT16 NPY_ULONG - typedef long npy_int16; - typedef unsigned long npy_uint16; -#define PyInt16ScalarObject PyLongScalarObject -#define PyInt16ArrType_Type PyLongArrType_Type -#define PyUInt16ScalarObject PyULongScalarObject -#define PyUInt16ArrType_Type PyULongArrType_Type -#define NPY_INT16_FMT NPY_LONG_FMT -#define NPY_UINT16_FMT NPY_ULONG_FMT -#elif NPY_BITSOF_LONG == 32 -#define NPY_INT32 NPY_LONG -#define NPY_UINT32 NPY_ULONG - typedef long npy_int32; - typedef unsigned long npy_uint32; - typedef unsigned long npy_ucs4; -#define PyInt32ScalarObject PyLongScalarObject -#define PyInt32ArrType_Type PyLongArrType_Type -#define PyUInt32ScalarObject PyULongScalarObject -#define PyUInt32ArrType_Type PyULongArrType_Type -#define NPY_INT32_FMT NPY_LONG_FMT -#define NPY_UINT32_FMT NPY_ULONG_FMT -#elif NPY_BITSOF_LONG == 64 -#define NPY_INT64 NPY_LONG -#define NPY_UINT64 NPY_ULONG - typedef long npy_int64; - typedef unsigned long npy_uint64; -#define PyInt64ScalarObject PyLongScalarObject -#define PyInt64ArrType_Type PyLongArrType_Type -#define PyUInt64ScalarObject PyULongScalarObject -#define PyUInt64ArrType_Type PyULongArrType_Type -#define NPY_INT64_FMT NPY_LONG_FMT -#define NPY_UINT64_FMT NPY_ULONG_FMT -#define MyPyLong_FromInt64 PyLong_FromLong -#define MyPyLong_AsInt64 PyLong_AsLong -#elif NPY_BITSOF_LONG == 128 -#define NPY_INT128 NPY_LONG -#define NPY_UINT128 NPY_ULONG - typedef long npy_int128; - typedef unsigned long npy_uint128; -#define PyInt128ScalarObject PyLongScalarObject -#define PyInt128ArrType_Type PyLongArrType_Type -#define PyUInt128ScalarObject PyULongScalarObject -#define PyUInt128ArrType_Type PyULongArrType_Type -#define NPY_INT128_FMT NPY_LONG_FMT -#define NPY_UINT128_FMT NPY_ULONG_FMT -#endif - -#if NPY_BITSOF_LONGLONG == 8 -# ifndef NPY_INT8 -# define NPY_INT8 NPY_LONGLONG -# define NPY_UINT8 NPY_ULONGLONG - typedef npy_longlong npy_int8; - typedef npy_ulonglong npy_uint8; -# define PyInt8ScalarObject PyLongLongScalarObject -# define PyInt8ArrType_Type PyLongLongArrType_Type -# define PyUInt8ScalarObject PyULongLongScalarObject -# define PyUInt8ArrType_Type PyULongLongArrType_Type -#define NPY_INT8_FMT NPY_LONGLONG_FMT -#define NPY_UINT8_FMT NPY_ULONGLONG_FMT -# endif -# define NPY_MAX_LONGLONG NPY_MAX_INT8 -# define NPY_MIN_LONGLONG NPY_MIN_INT8 -# define NPY_MAX_ULONGLONG NPY_MAX_UINT8 -#elif NPY_BITSOF_LONGLONG == 16 -# ifndef NPY_INT16 -# define NPY_INT16 NPY_LONGLONG -# define NPY_UINT16 NPY_ULONGLONG - typedef npy_longlong npy_int16; - typedef npy_ulonglong npy_uint16; -# define PyInt16ScalarObject PyLongLongScalarObject -# define PyInt16ArrType_Type PyLongLongArrType_Type -# define PyUInt16ScalarObject PyULongLongScalarObject -# define PyUInt16ArrType_Type PyULongLongArrType_Type -#define NPY_INT16_FMT NPY_LONGLONG_FMT -#define NPY_UINT16_FMT NPY_ULONGLONG_FMT -# endif -# define NPY_MAX_LONGLONG NPY_MAX_INT16 -# define NPY_MIN_LONGLONG NPY_MIN_INT16 -# define NPY_MAX_ULONGLONG NPY_MAX_UINT16 -#elif NPY_BITSOF_LONGLONG == 32 -# ifndef NPY_INT32 -# define NPY_INT32 NPY_LONGLONG -# define NPY_UINT32 NPY_ULONGLONG - typedef npy_longlong npy_int32; - typedef npy_ulonglong npy_uint32; - typedef npy_ulonglong npy_ucs4; -# define PyInt32ScalarObject PyLongLongScalarObject -# define PyInt32ArrType_Type PyLongLongArrType_Type -# define PyUInt32ScalarObject PyULongLongScalarObject -# define PyUInt32ArrType_Type PyULongLongArrType_Type -#define NPY_INT32_FMT NPY_LONGLONG_FMT -#define NPY_UINT32_FMT NPY_ULONGLONG_FMT -# endif -# define NPY_MAX_LONGLONG NPY_MAX_INT32 -# define NPY_MIN_LONGLONG NPY_MIN_INT32 -# define NPY_MAX_ULONGLONG NPY_MAX_UINT32 -#elif NPY_BITSOF_LONGLONG == 64 -# ifndef NPY_INT64 -# define NPY_INT64 NPY_LONGLONG -# define NPY_UINT64 NPY_ULONGLONG - typedef npy_longlong npy_int64; - typedef npy_ulonglong npy_uint64; -# define PyInt64ScalarObject PyLongLongScalarObject -# define PyInt64ArrType_Type PyLongLongArrType_Type -# define PyUInt64ScalarObject PyULongLongScalarObject -# define PyUInt64ArrType_Type PyULongLongArrType_Type -#define NPY_INT64_FMT NPY_LONGLONG_FMT -#define NPY_UINT64_FMT NPY_ULONGLONG_FMT -# define MyPyLong_FromInt64 PyLong_FromLongLong -# define MyPyLong_AsInt64 PyLong_AsLongLong -# endif -# define NPY_MAX_LONGLONG NPY_MAX_INT64 -# define NPY_MIN_LONGLONG NPY_MIN_INT64 -# define NPY_MAX_ULONGLONG NPY_MAX_UINT64 -#elif NPY_BITSOF_LONGLONG == 128 -# ifndef NPY_INT128 -# define NPY_INT128 NPY_LONGLONG -# define NPY_UINT128 NPY_ULONGLONG - typedef npy_longlong npy_int128; - typedef npy_ulonglong npy_uint128; -# define PyInt128ScalarObject PyLongLongScalarObject -# define PyInt128ArrType_Type PyLongLongArrType_Type -# define PyUInt128ScalarObject PyULongLongScalarObject -# define PyUInt128ArrType_Type PyULongLongArrType_Type -#define NPY_INT128_FMT NPY_LONGLONG_FMT -#define NPY_UINT128_FMT NPY_ULONGLONG_FMT -# endif -# define NPY_MAX_LONGLONG NPY_MAX_INT128 -# define NPY_MIN_LONGLONG NPY_MIN_INT128 -# define NPY_MAX_ULONGLONG NPY_MAX_UINT128 -#elif NPY_BITSOF_LONGLONG == 256 -# define NPY_INT256 NPY_LONGLONG -# define NPY_UINT256 NPY_ULONGLONG - typedef npy_longlong npy_int256; - typedef npy_ulonglong npy_uint256; -# define PyInt256ScalarObject PyLongLongScalarObject -# define PyInt256ArrType_Type PyLongLongArrType_Type -# define PyUInt256ScalarObject PyULongLongScalarObject -# define PyUInt256ArrType_Type PyULongLongArrType_Type -#define NPY_INT256_FMT NPY_LONGLONG_FMT -#define NPY_UINT256_FMT NPY_ULONGLONG_FMT -# define NPY_MAX_LONGLONG NPY_MAX_INT256 -# define NPY_MIN_LONGLONG NPY_MIN_INT256 -# define NPY_MAX_ULONGLONG NPY_MAX_UINT256 -#endif - -#if NPY_BITSOF_INT == 8 -#ifndef NPY_INT8 -#define NPY_INT8 NPY_INT -#define NPY_UINT8 NPY_UINT - typedef int npy_int8; - typedef unsigned int npy_uint8; -# define PyInt8ScalarObject PyIntScalarObject -# define PyInt8ArrType_Type PyIntArrType_Type -# define PyUInt8ScalarObject PyUIntScalarObject -# define PyUInt8ArrType_Type PyUIntArrType_Type -#define NPY_INT8_FMT NPY_INT_FMT -#define NPY_UINT8_FMT NPY_UINT_FMT -#endif -#elif NPY_BITSOF_INT == 16 -#ifndef NPY_INT16 -#define NPY_INT16 NPY_INT -#define NPY_UINT16 NPY_UINT - typedef int npy_int16; - typedef unsigned int npy_uint16; -# define PyInt16ScalarObject PyIntScalarObject -# define PyInt16ArrType_Type PyIntArrType_Type -# define PyUInt16ScalarObject PyIntUScalarObject -# define PyUInt16ArrType_Type PyIntUArrType_Type -#define NPY_INT16_FMT NPY_INT_FMT -#define NPY_UINT16_FMT NPY_UINT_FMT -#endif -#elif NPY_BITSOF_INT == 32 -#ifndef NPY_INT32 -#define NPY_INT32 NPY_INT -#define NPY_UINT32 NPY_UINT - typedef int npy_int32; - typedef unsigned int npy_uint32; - typedef unsigned int npy_ucs4; -# define PyInt32ScalarObject PyIntScalarObject -# define PyInt32ArrType_Type PyIntArrType_Type -# define PyUInt32ScalarObject PyUIntScalarObject -# define PyUInt32ArrType_Type PyUIntArrType_Type -#define NPY_INT32_FMT NPY_INT_FMT -#define NPY_UINT32_FMT NPY_UINT_FMT -#endif -#elif NPY_BITSOF_INT == 64 -#ifndef NPY_INT64 -#define NPY_INT64 NPY_INT -#define NPY_UINT64 NPY_UINT - typedef int npy_int64; - typedef unsigned int npy_uint64; -# define PyInt64ScalarObject PyIntScalarObject -# define PyInt64ArrType_Type PyIntArrType_Type -# define PyUInt64ScalarObject PyUIntScalarObject -# define PyUInt64ArrType_Type PyUIntArrType_Type -#define NPY_INT64_FMT NPY_INT_FMT -#define NPY_UINT64_FMT NPY_UINT_FMT -# define MyPyLong_FromInt64 PyLong_FromLong -# define MyPyLong_AsInt64 PyLong_AsLong -#endif -#elif NPY_BITSOF_INT == 128 -#ifndef NPY_INT128 -#define NPY_INT128 NPY_INT -#define NPY_UINT128 NPY_UINT - typedef int npy_int128; - typedef unsigned int npy_uint128; -# define PyInt128ScalarObject PyIntScalarObject -# define PyInt128ArrType_Type PyIntArrType_Type -# define PyUInt128ScalarObject PyUIntScalarObject -# define PyUInt128ArrType_Type PyUIntArrType_Type -#define NPY_INT128_FMT NPY_INT_FMT -#define NPY_UINT128_FMT NPY_UINT_FMT -#endif -#endif - -#if NPY_BITSOF_SHORT == 8 -#ifndef NPY_INT8 -#define NPY_INT8 NPY_SHORT -#define NPY_UINT8 NPY_USHORT - typedef short npy_int8; - typedef unsigned short npy_uint8; -# define PyInt8ScalarObject PyShortScalarObject -# define PyInt8ArrType_Type PyShortArrType_Type -# define PyUInt8ScalarObject PyUShortScalarObject -# define PyUInt8ArrType_Type PyUShortArrType_Type -#define NPY_INT8_FMT NPY_SHORT_FMT -#define NPY_UINT8_FMT NPY_USHORT_FMT -#endif -#elif NPY_BITSOF_SHORT == 16 -#ifndef NPY_INT16 -#define NPY_INT16 NPY_SHORT -#define NPY_UINT16 NPY_USHORT - typedef short npy_int16; - typedef unsigned short npy_uint16; -# define PyInt16ScalarObject PyShortScalarObject -# define PyInt16ArrType_Type PyShortArrType_Type -# define PyUInt16ScalarObject PyUShortScalarObject -# define PyUInt16ArrType_Type PyUShortArrType_Type -#define NPY_INT16_FMT NPY_SHORT_FMT -#define NPY_UINT16_FMT NPY_USHORT_FMT -#endif -#elif NPY_BITSOF_SHORT == 32 -#ifndef NPY_INT32 -#define NPY_INT32 NPY_SHORT -#define NPY_UINT32 NPY_USHORT - typedef short npy_int32; - typedef unsigned short npy_uint32; - typedef unsigned short npy_ucs4; -# define PyInt32ScalarObject PyShortScalarObject -# define PyInt32ArrType_Type PyShortArrType_Type -# define PyUInt32ScalarObject PyUShortScalarObject -# define PyUInt32ArrType_Type PyUShortArrType_Type -#define NPY_INT32_FMT NPY_SHORT_FMT -#define NPY_UINT32_FMT NPY_USHORT_FMT -#endif -#elif NPY_BITSOF_SHORT == 64 -#ifndef NPY_INT64 -#define NPY_INT64 NPY_SHORT -#define NPY_UINT64 NPY_USHORT - typedef short npy_int64; - typedef unsigned short npy_uint64; -# define PyInt64ScalarObject PyShortScalarObject -# define PyInt64ArrType_Type PyShortArrType_Type -# define PyUInt64ScalarObject PyUShortScalarObject -# define PyUInt64ArrType_Type PyUShortArrType_Type -#define NPY_INT64_FMT NPY_SHORT_FMT -#define NPY_UINT64_FMT NPY_USHORT_FMT -# define MyPyLong_FromInt64 PyLong_FromLong -# define MyPyLong_AsInt64 PyLong_AsLong -#endif -#elif NPY_BITSOF_SHORT == 128 -#ifndef NPY_INT128 -#define NPY_INT128 NPY_SHORT -#define NPY_UINT128 NPY_USHORT - typedef short npy_int128; - typedef unsigned short npy_uint128; -# define PyInt128ScalarObject PyShortScalarObject -# define PyInt128ArrType_Type PyShortArrType_Type -# define PyUInt128ScalarObject PyUShortScalarObject -# define PyUInt128ArrType_Type PyUShortArrType_Type -#define NPY_INT128_FMT NPY_SHORT_FMT -#define NPY_UINT128_FMT NPY_USHORT_FMT -#endif -#endif - - -#if NPY_BITSOF_CHAR == 8 -#ifndef NPY_INT8 -#define NPY_INT8 NPY_BYTE -#define NPY_UINT8 NPY_UBYTE - typedef signed char npy_int8; - typedef unsigned char npy_uint8; -# define PyInt8ScalarObject PyByteScalarObject -# define PyInt8ArrType_Type PyByteArrType_Type -# define PyUInt8ScalarObject PyUByteScalarObject -# define PyUInt8ArrType_Type PyUByteArrType_Type -#define NPY_INT8_FMT NPY_BYTE_FMT -#define NPY_UINT8_FMT NPY_UBYTE_FMT -#endif -#elif NPY_BITSOF_CHAR == 16 -#ifndef NPY_INT16 -#define NPY_INT16 NPY_BYTE -#define NPY_UINT16 NPY_UBYTE - typedef signed char npy_int16; - typedef unsigned char npy_uint16; -# define PyInt16ScalarObject PyByteScalarObject -# define PyInt16ArrType_Type PyByteArrType_Type -# define PyUInt16ScalarObject PyUByteScalarObject -# define PyUInt16ArrType_Type PyUByteArrType_Type -#define NPY_INT16_FMT NPY_BYTE_FMT -#define NPY_UINT16_FMT NPY_UBYTE_FMT -#endif -#elif NPY_BITSOF_CHAR == 32 -#ifndef NPY_INT32 -#define NPY_INT32 NPY_BYTE -#define NPY_UINT32 NPY_UBYTE - typedef signed char npy_int32; - typedef unsigned char npy_uint32; - typedef unsigned char npy_ucs4; -# define PyInt32ScalarObject PyByteScalarObject -# define PyInt32ArrType_Type PyByteArrType_Type -# define PyUInt32ScalarObject PyUByteScalarObject -# define PyUInt32ArrType_Type PyUByteArrType_Type -#define NPY_INT32_FMT NPY_BYTE_FMT -#define NPY_UINT32_FMT NPY_UBYTE_FMT -#endif -#elif NPY_BITSOF_CHAR == 64 -#ifndef NPY_INT64 -#define NPY_INT64 NPY_BYTE -#define NPY_UINT64 NPY_UBYTE - typedef signed char npy_int64; - typedef unsigned char npy_uint64; -# define PyInt64ScalarObject PyByteScalarObject -# define PyInt64ArrType_Type PyByteArrType_Type -# define PyUInt64ScalarObject PyUByteScalarObject -# define PyUInt64ArrType_Type PyUByteArrType_Type -#define NPY_INT64_FMT NPY_BYTE_FMT -#define NPY_UINT64_FMT NPY_UBYTE_FMT -# define MyPyLong_FromInt64 PyLong_FromLong -# define MyPyLong_AsInt64 PyLong_AsLong -#endif -#elif NPY_BITSOF_CHAR == 128 -#ifndef NPY_INT128 -#define NPY_INT128 NPY_BYTE -#define NPY_UINT128 NPY_UBYTE - typedef signed char npy_int128; - typedef unsigned char npy_uint128; -# define PyInt128ScalarObject PyByteScalarObject -# define PyInt128ArrType_Type PyByteArrType_Type -# define PyUInt128ScalarObject PyUByteScalarObject -# define PyUInt128ArrType_Type PyUByteArrType_Type -#define NPY_INT128_FMT NPY_BYTE_FMT -#define NPY_UINT128_FMT NPY_UBYTE_FMT -#endif -#endif - - - -#if NPY_BITSOF_DOUBLE == 32 -#ifndef NPY_FLOAT32 -#define NPY_FLOAT32 NPY_DOUBLE -#define NPY_COMPLEX64 NPY_CDOUBLE - typedef double npy_float32; - typedef npy_cdouble npy_complex64; -# define PyFloat32ScalarObject PyDoubleScalarObject -# define PyComplex64ScalarObject PyCDoubleScalarObject -# define PyFloat32ArrType_Type PyDoubleArrType_Type -# define PyComplex64ArrType_Type PyCDoubleArrType_Type -#define NPY_FLOAT32_FMT NPY_DOUBLE_FMT -#define NPY_COMPLEX64_FMT NPY_CDOUBLE_FMT -#endif -#elif NPY_BITSOF_DOUBLE == 64 -#ifndef NPY_FLOAT64 -#define NPY_FLOAT64 NPY_DOUBLE -#define NPY_COMPLEX128 NPY_CDOUBLE - typedef double npy_float64; - typedef npy_cdouble npy_complex128; -# define PyFloat64ScalarObject PyDoubleScalarObject -# define PyComplex128ScalarObject PyCDoubleScalarObject -# define PyFloat64ArrType_Type PyDoubleArrType_Type -# define PyComplex128ArrType_Type PyCDoubleArrType_Type -#define NPY_FLOAT64_FMT NPY_DOUBLE_FMT -#define NPY_COMPLEX128_FMT NPY_CDOUBLE_FMT -#endif -#elif NPY_BITSOF_DOUBLE == 80 -#ifndef NPY_FLOAT80 -#define NPY_FLOAT80 NPY_DOUBLE -#define NPY_COMPLEX160 NPY_CDOUBLE - typedef double npy_float80; - typedef npy_cdouble npy_complex160; -# define PyFloat80ScalarObject PyDoubleScalarObject -# define PyComplex160ScalarObject PyCDoubleScalarObject -# define PyFloat80ArrType_Type PyDoubleArrType_Type -# define PyComplex160ArrType_Type PyCDoubleArrType_Type -#define NPY_FLOAT80_FMT NPY_DOUBLE_FMT -#define NPY_COMPLEX160_FMT NPY_CDOUBLE_FMT -#endif -#elif NPY_BITSOF_DOUBLE == 96 -#ifndef NPY_FLOAT96 -#define NPY_FLOAT96 NPY_DOUBLE -#define NPY_COMPLEX192 NPY_CDOUBLE - typedef double npy_float96; - typedef npy_cdouble npy_complex192; -# define PyFloat96ScalarObject PyDoubleScalarObject -# define PyComplex192ScalarObject PyCDoubleScalarObject -# define PyFloat96ArrType_Type PyDoubleArrType_Type -# define PyComplex192ArrType_Type PyCDoubleArrType_Type -#define NPY_FLOAT96_FMT NPY_DOUBLE_FMT -#define NPY_COMPLEX192_FMT NPY_CDOUBLE_FMT -#endif -#elif NPY_BITSOF_DOUBLE == 128 -#ifndef NPY_FLOAT128 -#define NPY_FLOAT128 NPY_DOUBLE -#define NPY_COMPLEX256 NPY_CDOUBLE - typedef double npy_float128; - typedef npy_cdouble npy_complex256; -# define PyFloat128ScalarObject PyDoubleScalarObject -# define PyComplex256ScalarObject PyCDoubleScalarObject -# define PyFloat128ArrType_Type PyDoubleArrType_Type -# define PyComplex256ArrType_Type PyCDoubleArrType_Type -#define NPY_FLOAT128_FMT NPY_DOUBLE_FMT -#define NPY_COMPLEX256_FMT NPY_CDOUBLE_FMT -#endif -#endif - - - -#if NPY_BITSOF_FLOAT == 32 -#ifndef NPY_FLOAT32 -#define NPY_FLOAT32 NPY_FLOAT -#define NPY_COMPLEX64 NPY_CFLOAT - typedef float npy_float32; - typedef npy_cfloat npy_complex64; -# define PyFloat32ScalarObject PyFloatScalarObject -# define PyComplex64ScalarObject PyCFloatScalarObject -# define PyFloat32ArrType_Type PyFloatArrType_Type -# define PyComplex64ArrType_Type PyCFloatArrType_Type -#define NPY_FLOAT32_FMT NPY_FLOAT_FMT -#define NPY_COMPLEX64_FMT NPY_CFLOAT_FMT -#endif -#elif NPY_BITSOF_FLOAT == 64 -#ifndef NPY_FLOAT64 -#define NPY_FLOAT64 NPY_FLOAT -#define NPY_COMPLEX128 NPY_CFLOAT - typedef float npy_float64; - typedef npy_cfloat npy_complex128; -# define PyFloat64ScalarObject PyFloatScalarObject -# define PyComplex128ScalarObject PyCFloatScalarObject -# define PyFloat64ArrType_Type PyFloatArrType_Type -# define PyComplex128ArrType_Type PyCFloatArrType_Type -#define NPY_FLOAT64_FMT NPY_FLOAT_FMT -#define NPY_COMPLEX128_FMT NPY_CFLOAT_FMT -#endif -#elif NPY_BITSOF_FLOAT == 80 -#ifndef NPY_FLOAT80 -#define NPY_FLOAT80 NPY_FLOAT -#define NPY_COMPLEX160 NPY_CFLOAT - typedef float npy_float80; - typedef npy_cfloat npy_complex160; -# define PyFloat80ScalarObject PyFloatScalarObject -# define PyComplex160ScalarObject PyCFloatScalarObject -# define PyFloat80ArrType_Type PyFloatArrType_Type -# define PyComplex160ArrType_Type PyCFloatArrType_Type -#define NPY_FLOAT80_FMT NPY_FLOAT_FMT -#define NPY_COMPLEX160_FMT NPY_CFLOAT_FMT -#endif -#elif NPY_BITSOF_FLOAT == 96 -#ifndef NPY_FLOAT96 -#define NPY_FLOAT96 NPY_FLOAT -#define NPY_COMPLEX192 NPY_CFLOAT - typedef float npy_float96; - typedef npy_cfloat npy_complex192; -# define PyFloat96ScalarObject PyFloatScalarObject -# define PyComplex192ScalarObject PyCFloatScalarObject -# define PyFloat96ArrType_Type PyFloatArrType_Type -# define PyComplex192ArrType_Type PyCFloatArrType_Type -#define NPY_FLOAT96_FMT NPY_FLOAT_FMT -#define NPY_COMPLEX192_FMT NPY_CFLOAT_FMT -#endif -#elif NPY_BITSOF_FLOAT == 128 -#ifndef NPY_FLOAT128 -#define NPY_FLOAT128 NPY_FLOAT -#define NPY_COMPLEX256 NPY_CFLOAT - typedef float npy_float128; - typedef npy_cfloat npy_complex256; -# define PyFloat128ScalarObject PyFloatScalarObject -# define PyComplex256ScalarObject PyCFloatScalarObject -# define PyFloat128ArrType_Type PyFloatArrType_Type -# define PyComplex256ArrType_Type PyCFloatArrType_Type -#define NPY_FLOAT128_FMT NPY_FLOAT_FMT -#define NPY_COMPLEX256_FMT NPY_CFLOAT_FMT -#endif -#endif - -/* half/float16 isn't a floating-point type in C */ -#define NPY_FLOAT16 NPY_HALF -typedef npy_uint16 npy_half; -typedef npy_half npy_float16; - -#if NPY_BITSOF_LONGDOUBLE == 32 -#ifndef NPY_FLOAT32 -#define NPY_FLOAT32 NPY_LONGDOUBLE -#define NPY_COMPLEX64 NPY_CLONGDOUBLE - typedef npy_longdouble npy_float32; - typedef npy_clongdouble npy_complex64; -# define PyFloat32ScalarObject PyLongDoubleScalarObject -# define PyComplex64ScalarObject PyCLongDoubleScalarObject -# define PyFloat32ArrType_Type PyLongDoubleArrType_Type -# define PyComplex64ArrType_Type PyCLongDoubleArrType_Type -#define NPY_FLOAT32_FMT NPY_LONGDOUBLE_FMT -#define NPY_COMPLEX64_FMT NPY_CLONGDOUBLE_FMT -#endif -#elif NPY_BITSOF_LONGDOUBLE == 64 -#ifndef NPY_FLOAT64 -#define NPY_FLOAT64 NPY_LONGDOUBLE -#define NPY_COMPLEX128 NPY_CLONGDOUBLE - typedef npy_longdouble npy_float64; - typedef npy_clongdouble npy_complex128; -# define PyFloat64ScalarObject PyLongDoubleScalarObject -# define PyComplex128ScalarObject PyCLongDoubleScalarObject -# define PyFloat64ArrType_Type PyLongDoubleArrType_Type -# define PyComplex128ArrType_Type PyCLongDoubleArrType_Type -#define NPY_FLOAT64_FMT NPY_LONGDOUBLE_FMT -#define NPY_COMPLEX128_FMT NPY_CLONGDOUBLE_FMT -#endif -#elif NPY_BITSOF_LONGDOUBLE == 80 -#ifndef NPY_FLOAT80 -#define NPY_FLOAT80 NPY_LONGDOUBLE -#define NPY_COMPLEX160 NPY_CLONGDOUBLE - typedef npy_longdouble npy_float80; - typedef npy_clongdouble npy_complex160; -# define PyFloat80ScalarObject PyLongDoubleScalarObject -# define PyComplex160ScalarObject PyCLongDoubleScalarObject -# define PyFloat80ArrType_Type PyLongDoubleArrType_Type -# define PyComplex160ArrType_Type PyCLongDoubleArrType_Type -#define NPY_FLOAT80_FMT NPY_LONGDOUBLE_FMT -#define NPY_COMPLEX160_FMT NPY_CLONGDOUBLE_FMT -#endif -#elif NPY_BITSOF_LONGDOUBLE == 96 -#ifndef NPY_FLOAT96 -#define NPY_FLOAT96 NPY_LONGDOUBLE -#define NPY_COMPLEX192 NPY_CLONGDOUBLE - typedef npy_longdouble npy_float96; - typedef npy_clongdouble npy_complex192; -# define PyFloat96ScalarObject PyLongDoubleScalarObject -# define PyComplex192ScalarObject PyCLongDoubleScalarObject -# define PyFloat96ArrType_Type PyLongDoubleArrType_Type -# define PyComplex192ArrType_Type PyCLongDoubleArrType_Type -#define NPY_FLOAT96_FMT NPY_LONGDOUBLE_FMT -#define NPY_COMPLEX192_FMT NPY_CLONGDOUBLE_FMT -#endif -#elif NPY_BITSOF_LONGDOUBLE == 128 -#ifndef NPY_FLOAT128 -#define NPY_FLOAT128 NPY_LONGDOUBLE -#define NPY_COMPLEX256 NPY_CLONGDOUBLE - typedef npy_longdouble npy_float128; - typedef npy_clongdouble npy_complex256; -# define PyFloat128ScalarObject PyLongDoubleScalarObject -# define PyComplex256ScalarObject PyCLongDoubleScalarObject -# define PyFloat128ArrType_Type PyLongDoubleArrType_Type -# define PyComplex256ArrType_Type PyCLongDoubleArrType_Type -#define NPY_FLOAT128_FMT NPY_LONGDOUBLE_FMT -#define NPY_COMPLEX256_FMT NPY_CLONGDOUBLE_FMT -#endif -#elif NPY_BITSOF_LONGDOUBLE == 256 -#define NPY_FLOAT256 NPY_LONGDOUBLE -#define NPY_COMPLEX512 NPY_CLONGDOUBLE - typedef npy_longdouble npy_float256; - typedef npy_clongdouble npy_complex512; -# define PyFloat256ScalarObject PyLongDoubleScalarObject -# define PyComplex512ScalarObject PyCLongDoubleScalarObject -# define PyFloat256ArrType_Type PyLongDoubleArrType_Type -# define PyComplex512ArrType_Type PyCLongDoubleArrType_Type -#define NPY_FLOAT256_FMT NPY_LONGDOUBLE_FMT -#define NPY_COMPLEX512_FMT NPY_CLONGDOUBLE_FMT -#endif - -/* datetime typedefs */ -typedef npy_int64 npy_timedelta; -typedef npy_int64 npy_datetime; -#define NPY_DATETIME_FMT NPY_INT64_FMT -#define NPY_TIMEDELTA_FMT NPY_INT64_FMT - -/* End of typedefs for numarray style bit-width names */ - -#endif - diff --git a/include/numpy/npy_cpu.h b/include/numpy/npy_cpu.h deleted file mode 100644 index 9707a7adf..000000000 --- a/include/numpy/npy_cpu.h +++ /dev/null @@ -1,109 +0,0 @@ -/* - * This set (target) cpu specific macros: - * - Possible values: - * NPY_CPU_X86 - * NPY_CPU_AMD64 - * NPY_CPU_PPC - * NPY_CPU_PPC64 - * NPY_CPU_SPARC - * NPY_CPU_S390 - * NPY_CPU_IA64 - * NPY_CPU_HPPA - * NPY_CPU_ALPHA - * NPY_CPU_ARMEL - * NPY_CPU_ARMEB - * NPY_CPU_SH_LE - * NPY_CPU_SH_BE - */ -#ifndef _NPY_CPUARCH_H_ -#define _NPY_CPUARCH_H_ - -#include "numpyconfig.h" - -#if defined( __i386__ ) || defined(i386) || defined(_M_IX86) - /* - * __i386__ is defined by gcc and Intel compiler on Linux, - * _M_IX86 by VS compiler, - * i386 by Sun compilers on opensolaris at least - */ - #define NPY_CPU_X86 -#elif defined(__x86_64__) || defined(__amd64__) || defined(__x86_64) || defined(_M_AMD64) - /* - * both __x86_64__ and __amd64__ are defined by gcc - * __x86_64 defined by sun compiler on opensolaris at least - * _M_AMD64 defined by MS compiler - */ - #define NPY_CPU_AMD64 -#elif defined(__ppc__) || defined(__powerpc__) || defined(_ARCH_PPC) - /* - * __ppc__ is defined by gcc, I remember having seen __powerpc__ once, - * but can't find it ATM - * _ARCH_PPC is used by at least gcc on AIX - */ - #define NPY_CPU_PPC -#elif defined(__ppc64__) - #define NPY_CPU_PPC64 -#elif defined(__sparc__) || defined(__sparc) - /* __sparc__ is defined by gcc and Forte (e.g. Sun) compilers */ - #define NPY_CPU_SPARC -#elif defined(__s390__) - #define NPY_CPU_S390 -#elif defined(__ia64) - #define NPY_CPU_IA64 -#elif defined(__hppa) - #define NPY_CPU_HPPA -#elif defined(__alpha__) - #define NPY_CPU_ALPHA -#elif defined(__arm__) && defined(__ARMEL__) - #define NPY_CPU_ARMEL -#elif defined(__arm__) && defined(__ARMEB__) - #define NPY_CPU_ARMEB -#elif defined(__sh__) && defined(__LITTLE_ENDIAN__) - #define NPY_CPU_SH_LE -#elif defined(__sh__) && defined(__BIG_ENDIAN__) - #define NPY_CPU_SH_BE -#elif defined(__MIPSEL__) - #define NPY_CPU_MIPSEL -#elif defined(__MIPSEB__) - #define NPY_CPU_MIPSEB -#elif defined(__aarch64__) - #define NPY_CPU_AARCH64 -#else - #error Unknown CPU, please report this to numpy maintainers with \ - information about your platform (OS, CPU and compiler) -#endif - -/* - This "white-lists" the architectures that we know don't require - pointer alignment. We white-list, since the memcpy version will - work everywhere, whereas assignment will only work where pointer - dereferencing doesn't require alignment. - - TODO: There may be more architectures we can white list. -*/ -#if defined(NPY_CPU_X86) || defined(NPY_CPU_AMD64) - #define NPY_COPY_PYOBJECT_PTR(dst, src) (*((PyObject **)(dst)) = *((PyObject **)(src))) -#else - #if NPY_SIZEOF_PY_INTPTR_T == 4 - #define NPY_COPY_PYOBJECT_PTR(dst, src) \ - ((char*)(dst))[0] = ((char*)(src))[0]; \ - ((char*)(dst))[1] = ((char*)(src))[1]; \ - ((char*)(dst))[2] = ((char*)(src))[2]; \ - ((char*)(dst))[3] = ((char*)(src))[3]; - #elif NPY_SIZEOF_PY_INTPTR_T == 8 - #define NPY_COPY_PYOBJECT_PTR(dst, src) \ - ((char*)(dst))[0] = ((char*)(src))[0]; \ - ((char*)(dst))[1] = ((char*)(src))[1]; \ - ((char*)(dst))[2] = ((char*)(src))[2]; \ - ((char*)(dst))[3] = ((char*)(src))[3]; \ - ((char*)(dst))[4] = ((char*)(src))[4]; \ - ((char*)(dst))[5] = ((char*)(src))[5]; \ - ((char*)(dst))[6] = ((char*)(src))[6]; \ - ((char*)(dst))[7] = ((char*)(src))[7]; - #else - #error Unknown architecture, please report this to numpy maintainers with \ - information about your platform (OS, CPU and compiler) - #endif -#endif - -#endif diff --git a/include/numpy/npy_deprecated_api.h b/include/numpy/npy_deprecated_api.h deleted file mode 100644 index c27b4a4c9..000000000 --- a/include/numpy/npy_deprecated_api.h +++ /dev/null @@ -1,129 +0,0 @@ -#ifndef _NPY_DEPRECATED_API_H -#define _NPY_DEPRECATED_API_H - -#if defined(_WIN32) -#define _WARN___STR2__(x) #x -#define _WARN___STR1__(x) _WARN___STR2__(x) -#define _WARN___LOC__ __FILE__ "(" _WARN___STR1__(__LINE__) ") : Warning Msg: " -#pragma message(_WARN___LOC__"Using deprecated NumPy API, disable it by " \ - "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION") -#elif defined(__GNUC__) -#warning "Using deprecated NumPy API, disable it by #defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" -#endif -/* TODO: How to do this warning message for other compilers? */ - -/* - * This header exists to collect all dangerous/deprecated NumPy API. - * - * This is an attempt to remove bad API, the proliferation of macros, - * and namespace pollution currently produced by the NumPy headers. - */ - -#if defined(NPY_NO_DEPRECATED_API) -#error Should never include npy_deprecated_api directly. -#endif - -/* These array flags are deprecated as of NumPy 1.7 */ -#define NPY_CONTIGUOUS NPY_ARRAY_C_CONTIGUOUS -#define NPY_FORTRAN NPY_ARRAY_F_CONTIGUOUS - -/* - * The consistent NPY_ARRAY_* names which don't pollute the NPY_* - * namespace were added in NumPy 1.7. - * - * These versions of the carray flags are deprecated, but - * probably should only be removed after two releases instead of one. - */ -#define NPY_C_CONTIGUOUS NPY_ARRAY_C_CONTIGUOUS -#define NPY_F_CONTIGUOUS NPY_ARRAY_F_CONTIGUOUS -#define NPY_OWNDATA NPY_ARRAY_OWNDATA -#define NPY_FORCECAST NPY_ARRAY_FORCECAST -#define NPY_ENSURECOPY NPY_ARRAY_ENSURECOPY -#define NPY_ENSUREARRAY NPY_ARRAY_ENSUREARRAY -#define NPY_ELEMENTSTRIDES NPY_ARRAY_ELEMENTSTRIDES -#define NPY_ALIGNED NPY_ARRAY_ALIGNED -#define NPY_NOTSWAPPED NPY_ARRAY_NOTSWAPPED -#define NPY_WRITEABLE NPY_ARRAY_WRITEABLE -#define NPY_UPDATEIFCOPY NPY_ARRAY_UPDATEIFCOPY -#define NPY_BEHAVED NPY_ARRAY_BEHAVED -#define NPY_BEHAVED_NS NPY_ARRAY_BEHAVED_NS -#define NPY_CARRAY NPY_ARRAY_CARRAY -#define NPY_CARRAY_RO NPY_ARRAY_CARRAY_RO -#define NPY_FARRAY NPY_ARRAY_FARRAY -#define NPY_FARRAY_RO NPY_ARRAY_FARRAY_RO -#define NPY_DEFAULT NPY_ARRAY_DEFAULT -#define NPY_IN_ARRAY NPY_ARRAY_IN_ARRAY -#define NPY_OUT_ARRAY NPY_ARRAY_OUT_ARRAY -#define NPY_INOUT_ARRAY NPY_ARRAY_INOUT_ARRAY -#define NPY_IN_FARRAY NPY_ARRAY_IN_FARRAY -#define NPY_OUT_FARRAY NPY_ARRAY_OUT_FARRAY -#define NPY_INOUT_FARRAY NPY_ARRAY_INOUT_FARRAY -#define NPY_UPDATE_ALL NPY_ARRAY_UPDATE_ALL - -/* This way of accessing the default type is deprecated as of NumPy 1.7 */ -#define PyArray_DEFAULT NPY_DEFAULT_TYPE - -/* These DATETIME bits aren't used internally */ -#if PY_VERSION_HEX >= 0x03000000 -#define PyDataType_GetDatetimeMetaData(descr) \ - ((descr->metadata == NULL) ? NULL : \ - ((PyArray_DatetimeMetaData *)(PyCapsule_GetPointer( \ - PyDict_GetItemString( \ - descr->metadata, NPY_METADATA_DTSTR), NULL)))) -#else -#define PyDataType_GetDatetimeMetaData(descr) \ - ((descr->metadata == NULL) ? NULL : \ - ((PyArray_DatetimeMetaData *)(PyCObject_AsVoidPtr( \ - PyDict_GetItemString(descr->metadata, NPY_METADATA_DTSTR))))) -#endif - -/* - * Deprecated as of NumPy 1.7, this kind of shortcut doesn't - * belong in the public API. - */ -#define NPY_AO PyArrayObject - -/* - * Deprecated as of NumPy 1.7, an all-lowercase macro doesn't - * belong in the public API. - */ -#define fortran fortran_ - -/* - * Deprecated as of NumPy 1.7, as it is a namespace-polluting - * macro. - */ -#define FORTRAN_IF PyArray_FORTRAN_IF - -/* Deprecated as of NumPy 1.7, datetime64 uses c_metadata instead */ -#define NPY_METADATA_DTSTR "__timeunit__" - -/* - * Deprecated as of NumPy 1.7. - * The reasoning: - * - These are for datetime, but there's no datetime "namespace". - * - They just turn NPY_STR_ into "", which is just - * making something simple be indirected. - */ -#define NPY_STR_Y "Y" -#define NPY_STR_M "M" -#define NPY_STR_W "W" -#define NPY_STR_D "D" -#define NPY_STR_h "h" -#define NPY_STR_m "m" -#define NPY_STR_s "s" -#define NPY_STR_ms "ms" -#define NPY_STR_us "us" -#define NPY_STR_ns "ns" -#define NPY_STR_ps "ps" -#define NPY_STR_fs "fs" -#define NPY_STR_as "as" - -/* - * The macros in old_defines.h are Deprecated as of NumPy 1.7 and will be - * removed in the next major release. - */ -#include "old_defines.h" - - -#endif diff --git a/include/numpy/npy_endian.h b/include/numpy/npy_endian.h deleted file mode 100644 index 4e3349ffe..000000000 --- a/include/numpy/npy_endian.h +++ /dev/null @@ -1,46 +0,0 @@ -#ifndef _NPY_ENDIAN_H_ -#define _NPY_ENDIAN_H_ - -/* - * NPY_BYTE_ORDER is set to the same value as BYTE_ORDER set by glibc in - * endian.h - */ - -#ifdef NPY_HAVE_ENDIAN_H - /* Use endian.h if available */ - #include - - #define NPY_BYTE_ORDER __BYTE_ORDER - #define NPY_LITTLE_ENDIAN __LITTLE_ENDIAN - #define NPY_BIG_ENDIAN __BIG_ENDIAN -#else - /* Set endianness info using target CPU */ - #include "npy_cpu.h" - - #define NPY_LITTLE_ENDIAN 1234 - #define NPY_BIG_ENDIAN 4321 - - #if defined(NPY_CPU_X86) \ - || defined(NPY_CPU_AMD64) \ - || defined(NPY_CPU_IA64) \ - || defined(NPY_CPU_ALPHA) \ - || defined(NPY_CPU_ARMEL) \ - || defined(NPY_CPU_AARCH64) \ - || defined(NPY_CPU_SH_LE) \ - || defined(NPY_CPU_MIPSEL) - #define NPY_BYTE_ORDER NPY_LITTLE_ENDIAN - #elif defined(NPY_CPU_PPC) \ - || defined(NPY_CPU_SPARC) \ - || defined(NPY_CPU_S390) \ - || defined(NPY_CPU_HPPA) \ - || defined(NPY_CPU_PPC64) \ - || defined(NPY_CPU_ARMEB) \ - || defined(NPY_CPU_SH_BE) \ - || defined(NPY_CPU_MIPSEB) - #define NPY_BYTE_ORDER NPY_BIG_ENDIAN - #else - #error Unknown CPU: can not set endianness - #endif -#endif - -#endif diff --git a/include/numpy/npy_interrupt.h b/include/numpy/npy_interrupt.h deleted file mode 100644 index f71fd689e..000000000 --- a/include/numpy/npy_interrupt.h +++ /dev/null @@ -1,117 +0,0 @@ - -/* Signal handling: - -This header file defines macros that allow your code to handle -interrupts received during processing. Interrupts that -could reasonably be handled: - -SIGINT, SIGABRT, SIGALRM, SIGSEGV - -****Warning*************** - -Do not allow code that creates temporary memory or increases reference -counts of Python objects to be interrupted unless you handle it -differently. - -************************** - -The mechanism for handling interrupts is conceptually simple: - - - replace the signal handler with our own home-grown version - and store the old one. - - run the code to be interrupted -- if an interrupt occurs - the handler should basically just cause a return to the - calling function for finish work. - - restore the old signal handler - -Of course, every code that allows interrupts must account for -returning via the interrupt and handle clean-up correctly. But, -even still, the simple paradigm is complicated by at least three -factors. - - 1) platform portability (i.e. Microsoft says not to use longjmp - to return from signal handling. They have a __try and __except - extension to C instead but what about mingw?). - - 2) how to handle threads: apparently whether signals are delivered to - every thread of the process or the "invoking" thread is platform - dependent. --- we don't handle threads for now. - - 3) do we need to worry about re-entrance. For now, assume the - code will not call-back into itself. - -Ideas: - - 1) Start by implementing an approach that works on platforms that - can use setjmp and longjmp functionality and does nothing - on other platforms. - - 2) Ignore threads --- i.e. do not mix interrupt handling and threads - - 3) Add a default signal_handler function to the C-API but have the rest - use macros. - - -Simple Interface: - - -In your C-extension: around a block of code you want to be interruptable -with a SIGINT - -NPY_SIGINT_ON -[code] -NPY_SIGINT_OFF - -In order for this to work correctly, the -[code] block must not allocate any memory or alter the reference count of any -Python objects. In other words [code] must be interruptible so that continuation -after NPY_SIGINT_OFF will only be "missing some computations" - -Interrupt handling does not work well with threads. - -*/ - -/* Add signal handling macros - Make the global variable and signal handler part of the C-API -*/ - -#ifndef NPY_INTERRUPT_H -#define NPY_INTERRUPT_H - -#ifndef NPY_NO_SIGNAL - -#include -#include - -#ifndef sigsetjmp - -#define NPY_SIGSETJMP(arg1, arg2) setjmp(arg1) -#define NPY_SIGLONGJMP(arg1, arg2) longjmp(arg1, arg2) -#define NPY_SIGJMP_BUF jmp_buf - -#else - -#define NPY_SIGSETJMP(arg1, arg2) sigsetjmp(arg1, arg2) -#define NPY_SIGLONGJMP(arg1, arg2) siglongjmp(arg1, arg2) -#define NPY_SIGJMP_BUF sigjmp_buf - -#endif - -# define NPY_SIGINT_ON { \ - PyOS_sighandler_t _npy_sig_save; \ - _npy_sig_save = PyOS_setsig(SIGINT, _PyArray_SigintHandler); \ - if (NPY_SIGSETJMP(*((NPY_SIGJMP_BUF *)_PyArray_GetSigintBuf()), \ - 1) == 0) { \ - -# define NPY_SIGINT_OFF } \ - PyOS_setsig(SIGINT, _npy_sig_save); \ - } - -#else /* NPY_NO_SIGNAL */ - -#define NPY_SIGINT_ON -#define NPY_SIGINT_OFF - -#endif /* HAVE_SIGSETJMP */ - -#endif /* NPY_INTERRUPT_H */ diff --git a/include/numpy/npy_math.h b/include/numpy/npy_math.h deleted file mode 100644 index 7ae166e54..000000000 --- a/include/numpy/npy_math.h +++ /dev/null @@ -1,438 +0,0 @@ -#ifndef __NPY_MATH_C99_H_ -#define __NPY_MATH_C99_H_ - -#include -#ifdef __SUNPRO_CC -#include -#endif -#include - -/* - * NAN and INFINITY like macros (same behavior as glibc for NAN, same as C99 - * for INFINITY) - * - * XXX: I should test whether INFINITY and NAN are available on the platform - */ -NPY_INLINE static float __npy_inff(void) -{ - const union { npy_uint32 __i; float __f;} __bint = {0x7f800000UL}; - return __bint.__f; -} - -NPY_INLINE static float __npy_nanf(void) -{ - const union { npy_uint32 __i; float __f;} __bint = {0x7fc00000UL}; - return __bint.__f; -} - -NPY_INLINE static float __npy_pzerof(void) -{ - const union { npy_uint32 __i; float __f;} __bint = {0x00000000UL}; - return __bint.__f; -} - -NPY_INLINE static float __npy_nzerof(void) -{ - const union { npy_uint32 __i; float __f;} __bint = {0x80000000UL}; - return __bint.__f; -} - -#define NPY_INFINITYF __npy_inff() -#define NPY_NANF __npy_nanf() -#define NPY_PZEROF __npy_pzerof() -#define NPY_NZEROF __npy_nzerof() - -#define NPY_INFINITY ((npy_double)NPY_INFINITYF) -#define NPY_NAN ((npy_double)NPY_NANF) -#define NPY_PZERO ((npy_double)NPY_PZEROF) -#define NPY_NZERO ((npy_double)NPY_NZEROF) - -#define NPY_INFINITYL ((npy_longdouble)NPY_INFINITYF) -#define NPY_NANL ((npy_longdouble)NPY_NANF) -#define NPY_PZEROL ((npy_longdouble)NPY_PZEROF) -#define NPY_NZEROL ((npy_longdouble)NPY_NZEROF) - -/* - * Useful constants - */ -#define NPY_E 2.718281828459045235360287471352662498 /* e */ -#define NPY_LOG2E 1.442695040888963407359924681001892137 /* log_2 e */ -#define NPY_LOG10E 0.434294481903251827651128918916605082 /* log_10 e */ -#define NPY_LOGE2 0.693147180559945309417232121458176568 /* log_e 2 */ -#define NPY_LOGE10 2.302585092994045684017991454684364208 /* log_e 10 */ -#define NPY_PI 3.141592653589793238462643383279502884 /* pi */ -#define NPY_PI_2 1.570796326794896619231321691639751442 /* pi/2 */ -#define NPY_PI_4 0.785398163397448309615660845819875721 /* pi/4 */ -#define NPY_1_PI 0.318309886183790671537767526745028724 /* 1/pi */ -#define NPY_2_PI 0.636619772367581343075535053490057448 /* 2/pi */ -#define NPY_EULER 0.577215664901532860606512090082402431 /* Euler constant */ -#define NPY_SQRT2 1.414213562373095048801688724209698079 /* sqrt(2) */ -#define NPY_SQRT1_2 0.707106781186547524400844362104849039 /* 1/sqrt(2) */ - -#define NPY_Ef 2.718281828459045235360287471352662498F /* e */ -#define NPY_LOG2Ef 1.442695040888963407359924681001892137F /* log_2 e */ -#define NPY_LOG10Ef 0.434294481903251827651128918916605082F /* log_10 e */ -#define NPY_LOGE2f 0.693147180559945309417232121458176568F /* log_e 2 */ -#define NPY_LOGE10f 2.302585092994045684017991454684364208F /* log_e 10 */ -#define NPY_PIf 3.141592653589793238462643383279502884F /* pi */ -#define NPY_PI_2f 1.570796326794896619231321691639751442F /* pi/2 */ -#define NPY_PI_4f 0.785398163397448309615660845819875721F /* pi/4 */ -#define NPY_1_PIf 0.318309886183790671537767526745028724F /* 1/pi */ -#define NPY_2_PIf 0.636619772367581343075535053490057448F /* 2/pi */ -#define NPY_EULERf 0.577215664901532860606512090082402431F /* Euler constan*/ -#define NPY_SQRT2f 1.414213562373095048801688724209698079F /* sqrt(2) */ -#define NPY_SQRT1_2f 0.707106781186547524400844362104849039F /* 1/sqrt(2) */ - -#define NPY_El 2.718281828459045235360287471352662498L /* e */ -#define NPY_LOG2El 1.442695040888963407359924681001892137L /* log_2 e */ -#define NPY_LOG10El 0.434294481903251827651128918916605082L /* log_10 e */ -#define NPY_LOGE2l 0.693147180559945309417232121458176568L /* log_e 2 */ -#define NPY_LOGE10l 2.302585092994045684017991454684364208L /* log_e 10 */ -#define NPY_PIl 3.141592653589793238462643383279502884L /* pi */ -#define NPY_PI_2l 1.570796326794896619231321691639751442L /* pi/2 */ -#define NPY_PI_4l 0.785398163397448309615660845819875721L /* pi/4 */ -#define NPY_1_PIl 0.318309886183790671537767526745028724L /* 1/pi */ -#define NPY_2_PIl 0.636619772367581343075535053490057448L /* 2/pi */ -#define NPY_EULERl 0.577215664901532860606512090082402431L /* Euler constan*/ -#define NPY_SQRT2l 1.414213562373095048801688724209698079L /* sqrt(2) */ -#define NPY_SQRT1_2l 0.707106781186547524400844362104849039L /* 1/sqrt(2) */ - -/* - * C99 double math funcs - */ -double npy_sin(double x); -double npy_cos(double x); -double npy_tan(double x); -double npy_sinh(double x); -double npy_cosh(double x); -double npy_tanh(double x); - -double npy_asin(double x); -double npy_acos(double x); -double npy_atan(double x); -double npy_aexp(double x); -double npy_alog(double x); -double npy_asqrt(double x); -double npy_afabs(double x); - -double npy_log(double x); -double npy_log10(double x); -double npy_exp(double x); -double npy_sqrt(double x); - -double npy_fabs(double x); -double npy_ceil(double x); -double npy_fmod(double x, double y); -double npy_floor(double x); - -double npy_expm1(double x); -double npy_log1p(double x); -double npy_hypot(double x, double y); -double npy_acosh(double x); -double npy_asinh(double xx); -double npy_atanh(double x); -double npy_rint(double x); -double npy_trunc(double x); -double npy_exp2(double x); -double npy_log2(double x); - -double npy_atan2(double x, double y); -double npy_pow(double x, double y); -double npy_modf(double x, double* y); - -double npy_copysign(double x, double y); -double npy_nextafter(double x, double y); -double npy_spacing(double x); - -/* - * IEEE 754 fpu handling. Those are guaranteed to be macros - */ -#ifndef NPY_HAVE_DECL_ISNAN - #define npy_isnan(x) ((x) != (x)) -#else - #ifdef _MSC_VER - #define npy_isnan(x) _isnan((x)) - #else - #define npy_isnan(x) isnan((x)) - #endif -#endif - -#ifndef NPY_HAVE_DECL_ISFINITE - #ifdef _MSC_VER - #define npy_isfinite(x) _finite((x)) - #else - #define npy_isfinite(x) !npy_isnan((x) + (-x)) - #endif -#else - #define npy_isfinite(x) isfinite((x)) -#endif - -#ifndef NPY_HAVE_DECL_ISINF - #define npy_isinf(x) (!npy_isfinite(x) && !npy_isnan(x)) -#else - #ifdef _MSC_VER - #define npy_isinf(x) (!_finite((x)) && !_isnan((x))) - #else - #define npy_isinf(x) isinf((x)) - #endif -#endif - -#ifndef NPY_HAVE_DECL_SIGNBIT - int _npy_signbit_f(float x); - int _npy_signbit_d(double x); - int _npy_signbit_ld(long double x); - #define npy_signbit(x) \ - (sizeof (x) == sizeof (long double) ? _npy_signbit_ld (x) \ - : sizeof (x) == sizeof (double) ? _npy_signbit_d (x) \ - : _npy_signbit_f (x)) -#else - #define npy_signbit(x) signbit((x)) -#endif - -/* - * float C99 math functions - */ - -float npy_sinf(float x); -float npy_cosf(float x); -float npy_tanf(float x); -float npy_sinhf(float x); -float npy_coshf(float x); -float npy_tanhf(float x); -float npy_fabsf(float x); -float npy_floorf(float x); -float npy_ceilf(float x); -float npy_rintf(float x); -float npy_truncf(float x); -float npy_sqrtf(float x); -float npy_log10f(float x); -float npy_logf(float x); -float npy_expf(float x); -float npy_expm1f(float x); -float npy_asinf(float x); -float npy_acosf(float x); -float npy_atanf(float x); -float npy_asinhf(float x); -float npy_acoshf(float x); -float npy_atanhf(float x); -float npy_log1pf(float x); -float npy_exp2f(float x); -float npy_log2f(float x); - -float npy_atan2f(float x, float y); -float npy_hypotf(float x, float y); -float npy_powf(float x, float y); -float npy_fmodf(float x, float y); - -float npy_modff(float x, float* y); - -float npy_copysignf(float x, float y); -float npy_nextafterf(float x, float y); -float npy_spacingf(float x); - -/* - * float C99 math functions - */ - -npy_longdouble npy_sinl(npy_longdouble x); -npy_longdouble npy_cosl(npy_longdouble x); -npy_longdouble npy_tanl(npy_longdouble x); -npy_longdouble npy_sinhl(npy_longdouble x); -npy_longdouble npy_coshl(npy_longdouble x); -npy_longdouble npy_tanhl(npy_longdouble x); -npy_longdouble npy_fabsl(npy_longdouble x); -npy_longdouble npy_floorl(npy_longdouble x); -npy_longdouble npy_ceill(npy_longdouble x); -npy_longdouble npy_rintl(npy_longdouble x); -npy_longdouble npy_truncl(npy_longdouble x); -npy_longdouble npy_sqrtl(npy_longdouble x); -npy_longdouble npy_log10l(npy_longdouble x); -npy_longdouble npy_logl(npy_longdouble x); -npy_longdouble npy_expl(npy_longdouble x); -npy_longdouble npy_expm1l(npy_longdouble x); -npy_longdouble npy_asinl(npy_longdouble x); -npy_longdouble npy_acosl(npy_longdouble x); -npy_longdouble npy_atanl(npy_longdouble x); -npy_longdouble npy_asinhl(npy_longdouble x); -npy_longdouble npy_acoshl(npy_longdouble x); -npy_longdouble npy_atanhl(npy_longdouble x); -npy_longdouble npy_log1pl(npy_longdouble x); -npy_longdouble npy_exp2l(npy_longdouble x); -npy_longdouble npy_log2l(npy_longdouble x); - -npy_longdouble npy_atan2l(npy_longdouble x, npy_longdouble y); -npy_longdouble npy_hypotl(npy_longdouble x, npy_longdouble y); -npy_longdouble npy_powl(npy_longdouble x, npy_longdouble y); -npy_longdouble npy_fmodl(npy_longdouble x, npy_longdouble y); - -npy_longdouble npy_modfl(npy_longdouble x, npy_longdouble* y); - -npy_longdouble npy_copysignl(npy_longdouble x, npy_longdouble y); -npy_longdouble npy_nextafterl(npy_longdouble x, npy_longdouble y); -npy_longdouble npy_spacingl(npy_longdouble x); - -/* - * Non standard functions - */ -double npy_deg2rad(double x); -double npy_rad2deg(double x); -double npy_logaddexp(double x, double y); -double npy_logaddexp2(double x, double y); - -float npy_deg2radf(float x); -float npy_rad2degf(float x); -float npy_logaddexpf(float x, float y); -float npy_logaddexp2f(float x, float y); - -npy_longdouble npy_deg2radl(npy_longdouble x); -npy_longdouble npy_rad2degl(npy_longdouble x); -npy_longdouble npy_logaddexpl(npy_longdouble x, npy_longdouble y); -npy_longdouble npy_logaddexp2l(npy_longdouble x, npy_longdouble y); - -#define npy_degrees npy_rad2deg -#define npy_degreesf npy_rad2degf -#define npy_degreesl npy_rad2degl - -#define npy_radians npy_deg2rad -#define npy_radiansf npy_deg2radf -#define npy_radiansl npy_deg2radl - -/* - * Complex declarations - */ - -/* - * C99 specifies that complex numbers have the same representation as - * an array of two elements, where the first element is the real part - * and the second element is the imaginary part. - */ -#define __NPY_CPACK_IMP(x, y, type, ctype) \ - union { \ - ctype z; \ - type a[2]; \ - } z1;; \ - \ - z1.a[0] = (x); \ - z1.a[1] = (y); \ - \ - return z1.z; - -static NPY_INLINE npy_cdouble npy_cpack(double x, double y) -{ - __NPY_CPACK_IMP(x, y, double, npy_cdouble); -} - -static NPY_INLINE npy_cfloat npy_cpackf(float x, float y) -{ - __NPY_CPACK_IMP(x, y, float, npy_cfloat); -} - -static NPY_INLINE npy_clongdouble npy_cpackl(npy_longdouble x, npy_longdouble y) -{ - __NPY_CPACK_IMP(x, y, npy_longdouble, npy_clongdouble); -} -#undef __NPY_CPACK_IMP - -/* - * Same remark as above, but in the other direction: extract first/second - * member of complex number, assuming a C99-compatible representation - * - * Those are defineds as static inline, and such as a reasonable compiler would - * most likely compile this to one or two instructions (on CISC at least) - */ -#define __NPY_CEXTRACT_IMP(z, index, type, ctype) \ - union { \ - ctype z; \ - type a[2]; \ - } __z_repr; \ - __z_repr.z = z; \ - \ - return __z_repr.a[index]; - -static NPY_INLINE double npy_creal(npy_cdouble z) -{ - __NPY_CEXTRACT_IMP(z, 0, double, npy_cdouble); -} - -static NPY_INLINE double npy_cimag(npy_cdouble z) -{ - __NPY_CEXTRACT_IMP(z, 1, double, npy_cdouble); -} - -static NPY_INLINE float npy_crealf(npy_cfloat z) -{ - __NPY_CEXTRACT_IMP(z, 0, float, npy_cfloat); -} - -static NPY_INLINE float npy_cimagf(npy_cfloat z) -{ - __NPY_CEXTRACT_IMP(z, 1, float, npy_cfloat); -} - -static NPY_INLINE npy_longdouble npy_creall(npy_clongdouble z) -{ - __NPY_CEXTRACT_IMP(z, 0, npy_longdouble, npy_clongdouble); -} - -static NPY_INLINE npy_longdouble npy_cimagl(npy_clongdouble z) -{ - __NPY_CEXTRACT_IMP(z, 1, npy_longdouble, npy_clongdouble); -} -#undef __NPY_CEXTRACT_IMP - -/* - * Double precision complex functions - */ -double npy_cabs(npy_cdouble z); -double npy_carg(npy_cdouble z); - -npy_cdouble npy_cexp(npy_cdouble z); -npy_cdouble npy_clog(npy_cdouble z); -npy_cdouble npy_cpow(npy_cdouble x, npy_cdouble y); - -npy_cdouble npy_csqrt(npy_cdouble z); - -npy_cdouble npy_ccos(npy_cdouble z); -npy_cdouble npy_csin(npy_cdouble z); - -/* - * Single precision complex functions - */ -float npy_cabsf(npy_cfloat z); -float npy_cargf(npy_cfloat z); - -npy_cfloat npy_cexpf(npy_cfloat z); -npy_cfloat npy_clogf(npy_cfloat z); -npy_cfloat npy_cpowf(npy_cfloat x, npy_cfloat y); - -npy_cfloat npy_csqrtf(npy_cfloat z); - -npy_cfloat npy_ccosf(npy_cfloat z); -npy_cfloat npy_csinf(npy_cfloat z); - -/* - * Extended precision complex functions - */ -npy_longdouble npy_cabsl(npy_clongdouble z); -npy_longdouble npy_cargl(npy_clongdouble z); - -npy_clongdouble npy_cexpl(npy_clongdouble z); -npy_clongdouble npy_clogl(npy_clongdouble z); -npy_clongdouble npy_cpowl(npy_clongdouble x, npy_clongdouble y); - -npy_clongdouble npy_csqrtl(npy_clongdouble z); - -npy_clongdouble npy_ccosl(npy_clongdouble z); -npy_clongdouble npy_csinl(npy_clongdouble z); - -/* - * Functions that set the floating point error - * status word. - */ - -void npy_set_floatstatus_divbyzero(void); -void npy_set_floatstatus_overflow(void); -void npy_set_floatstatus_underflow(void); -void npy_set_floatstatus_invalid(void); - -#endif diff --git a/include/numpy/npy_no_deprecated_api.h b/include/numpy/npy_no_deprecated_api.h deleted file mode 100644 index 6183dc278..000000000 --- a/include/numpy/npy_no_deprecated_api.h +++ /dev/null @@ -1,19 +0,0 @@ -/* - * This include file is provided for inclusion in Cython *.pyd files where - * one would like to define the NPY_NO_DEPRECATED_API macro. It can be - * included by - * - * cdef extern from "npy_no_deprecated_api.h": pass - * - */ -#ifndef NPY_NO_DEPRECATED_API - -/* put this check here since there may be multiple includes in C extensions. */ -#if defined(NDARRAYTYPES_H) || defined(_NPY_DEPRECATED_API_H) || \ - defined(OLD_DEFINES_H) -#error "npy_no_deprecated_api.h" must be first among numpy includes. -#else -#define NPY_NO_DEPRECATED_API NPY_API_VERSION -#endif - -#endif diff --git a/include/numpy/npy_os.h b/include/numpy/npy_os.h deleted file mode 100644 index 9228c3916..000000000 --- a/include/numpy/npy_os.h +++ /dev/null @@ -1,30 +0,0 @@ -#ifndef _NPY_OS_H_ -#define _NPY_OS_H_ - -#if defined(linux) || defined(__linux) || defined(__linux__) - #define NPY_OS_LINUX -#elif defined(__FreeBSD__) || defined(__NetBSD__) || \ - defined(__OpenBSD__) || defined(__DragonFly__) - #define NPY_OS_BSD - #ifdef __FreeBSD__ - #define NPY_OS_FREEBSD - #elif defined(__NetBSD__) - #define NPY_OS_NETBSD - #elif defined(__OpenBSD__) - #define NPY_OS_OPENBSD - #elif defined(__DragonFly__) - #define NPY_OS_DRAGONFLY - #endif -#elif defined(sun) || defined(__sun) - #define NPY_OS_SOLARIS -#elif defined(__CYGWIN__) - #define NPY_OS_CYGWIN -#elif defined(_WIN32) || defined(__WIN32__) || defined(WIN32) - #define NPY_OS_WIN32 -#elif defined(__APPLE__) - #define NPY_OS_DARWIN -#else - #define NPY_OS_UNKNOWN -#endif - -#endif diff --git a/include/numpy/numpyconfig.h b/include/numpy/numpyconfig.h deleted file mode 100644 index 401d19fd7..000000000 --- a/include/numpy/numpyconfig.h +++ /dev/null @@ -1,33 +0,0 @@ -#ifndef _NPY_NUMPYCONFIG_H_ -#define _NPY_NUMPYCONFIG_H_ - -#include "_numpyconfig.h" - -/* - * On Mac OS X, because there is only one configuration stage for all the archs - * in universal builds, any macro which depends on the arch needs to be - * harcoded - */ -#ifdef __APPLE__ - #undef NPY_SIZEOF_LONG - #undef NPY_SIZEOF_PY_INTPTR_T - - #ifdef __LP64__ - #define NPY_SIZEOF_LONG 8 - #define NPY_SIZEOF_PY_INTPTR_T 8 - #else - #define NPY_SIZEOF_LONG 4 - #define NPY_SIZEOF_PY_INTPTR_T 4 - #endif -#endif - -/** - * To help with the NPY_NO_DEPRECATED_API macro, we include API version - * numbers for specific versions of NumPy. To exclude all API that was - * deprecated as of 1.7, add the following before #including any NumPy - * headers: - * #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION - */ -#define NPY_1_7_API_VERSION 0x00000007 - -#endif diff --git a/include/numpy/old_defines.h b/include/numpy/old_defines.h deleted file mode 100644 index abf81595a..000000000 --- a/include/numpy/old_defines.h +++ /dev/null @@ -1,187 +0,0 @@ -/* This header is deprecated as of NumPy 1.7 */ -#ifndef OLD_DEFINES_H -#define OLD_DEFINES_H - -#if defined(NPY_NO_DEPRECATED_API) && NPY_NO_DEPRECATED_API >= NPY_1_7_API_VERSION -#error The header "old_defines.h" is deprecated as of NumPy 1.7. -#endif - -#define NDARRAY_VERSION NPY_VERSION - -#define PyArray_MIN_BUFSIZE NPY_MIN_BUFSIZE -#define PyArray_MAX_BUFSIZE NPY_MAX_BUFSIZE -#define PyArray_BUFSIZE NPY_BUFSIZE - -#define PyArray_PRIORITY NPY_PRIORITY -#define PyArray_SUBTYPE_PRIORITY NPY_PRIORITY -#define PyArray_NUM_FLOATTYPE NPY_NUM_FLOATTYPE - -#define NPY_MAX PyArray_MAX -#define NPY_MIN PyArray_MIN - -#define PyArray_TYPES NPY_TYPES -#define PyArray_BOOL NPY_BOOL -#define PyArray_BYTE NPY_BYTE -#define PyArray_UBYTE NPY_UBYTE -#define PyArray_SHORT NPY_SHORT -#define PyArray_USHORT NPY_USHORT -#define PyArray_INT NPY_INT -#define PyArray_UINT NPY_UINT -#define PyArray_LONG NPY_LONG -#define PyArray_ULONG NPY_ULONG -#define PyArray_LONGLONG NPY_LONGLONG -#define PyArray_ULONGLONG NPY_ULONGLONG -#define PyArray_HALF NPY_HALF -#define PyArray_FLOAT NPY_FLOAT -#define PyArray_DOUBLE NPY_DOUBLE -#define PyArray_LONGDOUBLE NPY_LONGDOUBLE -#define PyArray_CFLOAT NPY_CFLOAT -#define PyArray_CDOUBLE NPY_CDOUBLE -#define PyArray_CLONGDOUBLE NPY_CLONGDOUBLE -#define PyArray_OBJECT NPY_OBJECT -#define PyArray_STRING NPY_STRING -#define PyArray_UNICODE NPY_UNICODE -#define PyArray_VOID NPY_VOID -#define PyArray_DATETIME NPY_DATETIME -#define PyArray_TIMEDELTA NPY_TIMEDELTA -#define PyArray_NTYPES NPY_NTYPES -#define PyArray_NOTYPE NPY_NOTYPE -#define PyArray_CHAR NPY_CHAR -#define PyArray_USERDEF NPY_USERDEF -#define PyArray_NUMUSERTYPES NPY_NUMUSERTYPES - -#define PyArray_INTP NPY_INTP -#define PyArray_UINTP NPY_UINTP - -#define PyArray_INT8 NPY_INT8 -#define PyArray_UINT8 NPY_UINT8 -#define PyArray_INT16 NPY_INT16 -#define PyArray_UINT16 NPY_UINT16 -#define PyArray_INT32 NPY_INT32 -#define PyArray_UINT32 NPY_UINT32 - -#ifdef NPY_INT64 -#define PyArray_INT64 NPY_INT64 -#define PyArray_UINT64 NPY_UINT64 -#endif - -#ifdef NPY_INT128 -#define PyArray_INT128 NPY_INT128 -#define PyArray_UINT128 NPY_UINT128 -#endif - -#ifdef NPY_FLOAT16 -#define PyArray_FLOAT16 NPY_FLOAT16 -#define PyArray_COMPLEX32 NPY_COMPLEX32 -#endif - -#ifdef NPY_FLOAT80 -#define PyArray_FLOAT80 NPY_FLOAT80 -#define PyArray_COMPLEX160 NPY_COMPLEX160 -#endif - -#ifdef NPY_FLOAT96 -#define PyArray_FLOAT96 NPY_FLOAT96 -#define PyArray_COMPLEX192 NPY_COMPLEX192 -#endif - -#ifdef NPY_FLOAT128 -#define PyArray_FLOAT128 NPY_FLOAT128 -#define PyArray_COMPLEX256 NPY_COMPLEX256 -#endif - -#define PyArray_FLOAT32 NPY_FLOAT32 -#define PyArray_COMPLEX64 NPY_COMPLEX64 -#define PyArray_FLOAT64 NPY_FLOAT64 -#define PyArray_COMPLEX128 NPY_COMPLEX128 - - -#define PyArray_TYPECHAR NPY_TYPECHAR -#define PyArray_BOOLLTR NPY_BOOLLTR -#define PyArray_BYTELTR NPY_BYTELTR -#define PyArray_UBYTELTR NPY_UBYTELTR -#define PyArray_SHORTLTR NPY_SHORTLTR -#define PyArray_USHORTLTR NPY_USHORTLTR -#define PyArray_INTLTR NPY_INTLTR -#define PyArray_UINTLTR NPY_UINTLTR -#define PyArray_LONGLTR NPY_LONGLTR -#define PyArray_ULONGLTR NPY_ULONGLTR -#define PyArray_LONGLONGLTR NPY_LONGLONGLTR -#define PyArray_ULONGLONGLTR NPY_ULONGLONGLTR -#define PyArray_HALFLTR NPY_HALFLTR -#define PyArray_FLOATLTR NPY_FLOATLTR -#define PyArray_DOUBLELTR NPY_DOUBLELTR -#define PyArray_LONGDOUBLELTR NPY_LONGDOUBLELTR -#define PyArray_CFLOATLTR NPY_CFLOATLTR -#define PyArray_CDOUBLELTR NPY_CDOUBLELTR -#define PyArray_CLONGDOUBLELTR NPY_CLONGDOUBLELTR -#define PyArray_OBJECTLTR NPY_OBJECTLTR -#define PyArray_STRINGLTR NPY_STRINGLTR -#define PyArray_STRINGLTR2 NPY_STRINGLTR2 -#define PyArray_UNICODELTR NPY_UNICODELTR -#define PyArray_VOIDLTR NPY_VOIDLTR -#define PyArray_DATETIMELTR NPY_DATETIMELTR -#define PyArray_TIMEDELTALTR NPY_TIMEDELTALTR -#define PyArray_CHARLTR NPY_CHARLTR -#define PyArray_INTPLTR NPY_INTPLTR -#define PyArray_UINTPLTR NPY_UINTPLTR -#define PyArray_GENBOOLLTR NPY_GENBOOLLTR -#define PyArray_SIGNEDLTR NPY_SIGNEDLTR -#define PyArray_UNSIGNEDLTR NPY_UNSIGNEDLTR -#define PyArray_FLOATINGLTR NPY_FLOATINGLTR -#define PyArray_COMPLEXLTR NPY_COMPLEXLTR - -#define PyArray_QUICKSORT NPY_QUICKSORT -#define PyArray_HEAPSORT NPY_HEAPSORT -#define PyArray_MERGESORT NPY_MERGESORT -#define PyArray_SORTKIND NPY_SORTKIND -#define PyArray_NSORTS NPY_NSORTS - -#define PyArray_NOSCALAR NPY_NOSCALAR -#define PyArray_BOOL_SCALAR NPY_BOOL_SCALAR -#define PyArray_INTPOS_SCALAR NPY_INTPOS_SCALAR -#define PyArray_INTNEG_SCALAR NPY_INTNEG_SCALAR -#define PyArray_FLOAT_SCALAR NPY_FLOAT_SCALAR -#define PyArray_COMPLEX_SCALAR NPY_COMPLEX_SCALAR -#define PyArray_OBJECT_SCALAR NPY_OBJECT_SCALAR -#define PyArray_SCALARKIND NPY_SCALARKIND -#define PyArray_NSCALARKINDS NPY_NSCALARKINDS - -#define PyArray_ANYORDER NPY_ANYORDER -#define PyArray_CORDER NPY_CORDER -#define PyArray_FORTRANORDER NPY_FORTRANORDER -#define PyArray_ORDER NPY_ORDER - -#define PyDescr_ISBOOL PyDataType_ISBOOL -#define PyDescr_ISUNSIGNED PyDataType_ISUNSIGNED -#define PyDescr_ISSIGNED PyDataType_ISSIGNED -#define PyDescr_ISINTEGER PyDataType_ISINTEGER -#define PyDescr_ISFLOAT PyDataType_ISFLOAT -#define PyDescr_ISNUMBER PyDataType_ISNUMBER -#define PyDescr_ISSTRING PyDataType_ISSTRING -#define PyDescr_ISCOMPLEX PyDataType_ISCOMPLEX -#define PyDescr_ISPYTHON PyDataType_ISPYTHON -#define PyDescr_ISFLEXIBLE PyDataType_ISFLEXIBLE -#define PyDescr_ISUSERDEF PyDataType_ISUSERDEF -#define PyDescr_ISEXTENDED PyDataType_ISEXTENDED -#define PyDescr_ISOBJECT PyDataType_ISOBJECT -#define PyDescr_HASFIELDS PyDataType_HASFIELDS - -#define PyArray_LITTLE NPY_LITTLE -#define PyArray_BIG NPY_BIG -#define PyArray_NATIVE NPY_NATIVE -#define PyArray_SWAP NPY_SWAP -#define PyArray_IGNORE NPY_IGNORE - -#define PyArray_NATBYTE NPY_NATBYTE -#define PyArray_OPPBYTE NPY_OPPBYTE - -#define PyArray_MAX_ELSIZE NPY_MAX_ELSIZE - -#define PyArray_USE_PYMEM NPY_USE_PYMEM - -#define PyArray_RemoveLargest PyArray_RemoveSmallest - -#define PyArray_UCS4 npy_ucs4 - -#endif diff --git a/include/numpy/oldnumeric.h b/include/numpy/oldnumeric.h deleted file mode 100644 index 748f06da3..000000000 --- a/include/numpy/oldnumeric.h +++ /dev/null @@ -1,23 +0,0 @@ -#include "arrayobject.h" - -#ifndef REFCOUNT -# define REFCOUNT NPY_REFCOUNT -# define MAX_ELSIZE 16 -#endif - -#define PyArray_UNSIGNED_TYPES -#define PyArray_SBYTE NPY_BYTE -#define PyArray_CopyArray PyArray_CopyInto -#define _PyArray_multiply_list PyArray_MultiplyIntList -#define PyArray_ISSPACESAVER(m) NPY_FALSE -#define PyScalarArray_Check PyArray_CheckScalar - -#define CONTIGUOUS NPY_CONTIGUOUS -#define OWN_DIMENSIONS 0 -#define OWN_STRIDES 0 -#define OWN_DATA NPY_OWNDATA -#define SAVESPACE 0 -#define SAVESPACEBIT 0 - -#undef import_array -#define import_array() { if (_import_array() < 0) {PyErr_Print(); PyErr_SetString(PyExc_ImportError, "numpy.core.multiarray failed to import"); } } diff --git a/include/numpy/ufunc_api.txt b/include/numpy/ufunc_api.txt deleted file mode 100644 index 3365433cd..000000000 --- a/include/numpy/ufunc_api.txt +++ /dev/null @@ -1,312 +0,0 @@ - -================= -Numpy Ufunc C-API -================= -:: - - PyObject * - PyUFunc_FromFuncAndData(PyUFuncGenericFunction *func, void - **data, char *types, int ntypes, int nin, int - nout, int identity, char *name, char *doc, int - check_return) - - -:: - - int - PyUFunc_RegisterLoopForType(PyUFuncObject *ufunc, int - usertype, PyUFuncGenericFunction - function, int *arg_types, void *data) - - -:: - - int - PyUFunc_GenericFunction(PyUFuncObject *ufunc, PyObject *args, PyObject - *kwds, PyArrayObject **op) - - -This generic function is called with the ufunc object, the arguments to it, -and an array of (pointers to) PyArrayObjects which are NULL. - -'op' is an array of at least NPY_MAXARGS PyArrayObject *. - -:: - - void - PyUFunc_f_f_As_d_d(char **args, npy_intp *dimensions, npy_intp - *steps, void *func) - - -:: - - void - PyUFunc_d_d(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_f_f(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_g_g(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_F_F_As_D_D(char **args, npy_intp *dimensions, npy_intp - *steps, void *func) - - -:: - - void - PyUFunc_F_F(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_D_D(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_G_G(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_O_O(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_ff_f_As_dd_d(char **args, npy_intp *dimensions, npy_intp - *steps, void *func) - - -:: - - void - PyUFunc_ff_f(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_dd_d(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_gg_g(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_FF_F_As_DD_D(char **args, npy_intp *dimensions, npy_intp - *steps, void *func) - - -:: - - void - PyUFunc_DD_D(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_FF_F(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_GG_G(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_OO_O(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_O_O_method(char **args, npy_intp *dimensions, npy_intp - *steps, void *func) - - -:: - - void - PyUFunc_OO_O_method(char **args, npy_intp *dimensions, npy_intp - *steps, void *func) - - -:: - - void - PyUFunc_On_Om(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - int - PyUFunc_GetPyValues(char *name, int *bufsize, int *errmask, PyObject - **errobj) - - -On return, if errobj is populated with a non-NULL value, the caller -owns a new reference to errobj. - -:: - - int - PyUFunc_checkfperr(int errmask, PyObject *errobj, int *first) - - -:: - - void - PyUFunc_clearfperr() - - -:: - - int - PyUFunc_getfperr(void ) - - -:: - - int - PyUFunc_handlefperr(int errmask, PyObject *errobj, int retstatus, int - *first) - - -:: - - int - PyUFunc_ReplaceLoopBySignature(PyUFuncObject - *func, PyUFuncGenericFunction - newfunc, int - *signature, PyUFuncGenericFunction - *oldfunc) - - -:: - - PyObject * - PyUFunc_FromFuncAndDataAndSignature(PyUFuncGenericFunction *func, void - **data, char *types, int - ntypes, int nin, int nout, int - identity, char *name, char - *doc, int check_return, const char - *signature) - - -:: - - int - PyUFunc_SetUsesArraysAsData(void **data, size_t i) - - -:: - - void - PyUFunc_e_e(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_e_e_As_f_f(char **args, npy_intp *dimensions, npy_intp - *steps, void *func) - - -:: - - void - PyUFunc_e_e_As_d_d(char **args, npy_intp *dimensions, npy_intp - *steps, void *func) - - -:: - - void - PyUFunc_ee_e(char **args, npy_intp *dimensions, npy_intp *steps, void - *func) - - -:: - - void - PyUFunc_ee_e_As_ff_f(char **args, npy_intp *dimensions, npy_intp - *steps, void *func) - - -:: - - void - PyUFunc_ee_e_As_dd_d(char **args, npy_intp *dimensions, npy_intp - *steps, void *func) - - -:: - - int - PyUFunc_DefaultTypeResolver(PyUFuncObject *ufunc, NPY_CASTING - casting, PyArrayObject - **operands, PyObject - *type_tup, PyArray_Descr **out_dtypes) - - -This function applies the default type resolution rules -for the provided ufunc. - -Returns 0 on success, -1 on error. - -:: - - int - PyUFunc_ValidateCasting(PyUFuncObject *ufunc, NPY_CASTING - casting, PyArrayObject - **operands, PyArray_Descr **dtypes) - - -Validates that the input operands can be cast to -the input types, and the output types can be cast to -the output operands where provided. - -Returns 0 on success, -1 (with exception raised) on validation failure. - diff --git a/include/numpy/ufuncobject.h b/include/numpy/ufuncobject.h deleted file mode 100644 index 95afd5aa2..000000000 --- a/include/numpy/ufuncobject.h +++ /dev/null @@ -1,446 +0,0 @@ -#ifndef Py_UFUNCOBJECT_H -#define Py_UFUNCOBJECT_H - -#include - -#ifdef __cplusplus -extern "C" { -#endif - -/* - * The legacy generic inner loop for a standard element-wise or - * generalized ufunc. - */ -typedef void (*PyUFuncGenericFunction) - (char **args, - npy_intp *dimensions, - npy_intp *strides, - void *innerloopdata); - -/* - * The most generic one-dimensional inner loop for - * a standard element-wise ufunc. This typedef is also - * more consistent with the other NumPy function pointer typedefs - * than PyUFuncGenericFunction. - */ -typedef void (PyUFunc_StridedInnerLoopFunc)( - char **dataptrs, npy_intp *strides, - npy_intp count, - NpyAuxData *innerloopdata); - -/* - * The most generic one-dimensional inner loop for - * a masked standard element-wise ufunc. "Masked" here means that it skips - * doing calculations on any items for which the maskptr array has a true - * value. - */ -typedef void (PyUFunc_MaskedStridedInnerLoopFunc)( - char **dataptrs, npy_intp *strides, - char *maskptr, npy_intp mask_stride, - npy_intp count, - NpyAuxData *innerloopdata); - -/* Forward declaration for the type resolver and loop selector typedefs */ -struct _tagPyUFuncObject; - -/* - * Given the operands for calling a ufunc, should determine the - * calculation input and output data types and return an inner loop function. - * This function should validate that the casting rule is being followed, - * and fail if it is not. - * - * For backwards compatibility, the regular type resolution function does not - * support auxiliary data with object semantics. The type resolution call - * which returns a masked generic function returns a standard NpyAuxData - * object, for which the NPY_AUXDATA_FREE and NPY_AUXDATA_CLONE macros - * work. - * - * ufunc: The ufunc object. - * casting: The 'casting' parameter provided to the ufunc. - * operands: An array of length (ufunc->nin + ufunc->nout), - * with the output parameters possibly NULL. - * type_tup: Either NULL, or the type_tup passed to the ufunc. - * out_dtypes: An array which should be populated with new - * references to (ufunc->nin + ufunc->nout) new - * dtypes, one for each input and output. These - * dtypes should all be in native-endian format. - * - * Should return 0 on success, -1 on failure (with exception set), - * or -2 if Py_NotImplemented should be returned. - */ -typedef int (PyUFunc_TypeResolutionFunc)( - struct _tagPyUFuncObject *ufunc, - NPY_CASTING casting, - PyArrayObject **operands, - PyObject *type_tup, - PyArray_Descr **out_dtypes); - -/* - * Given an array of DTypes as returned by the PyUFunc_TypeResolutionFunc, - * and an array of fixed strides (the array will contain NPY_MAX_INTP for - * strides which are not necessarily fixed), returns an inner loop - * with associated auxiliary data. - * - * For backwards compatibility, there is a variant of the inner loop - * selection which returns an inner loop irrespective of the strides, - * and with a void* static auxiliary data instead of an NpyAuxData * - * dynamically allocatable auxiliary data. - * - * ufunc: The ufunc object. - * dtypes: An array which has been populated with dtypes, - * in most cases by the type resolution funciton - * for the same ufunc. - * fixed_strides: For each input/output, either the stride that - * will be used every time the function is called - * or NPY_MAX_INTP if the stride might change or - * is not known ahead of time. The loop selection - * function may use this stride to pick inner loops - * which are optimized for contiguous or 0-stride - * cases. - * out_innerloop: Should be populated with the correct ufunc inner - * loop for the given type. - * out_innerloopdata: Should be populated with the void* data to - * be passed into the out_innerloop function. - * out_needs_api: If the inner loop needs to use the Python API, - * should set the to 1, otherwise should leave - * this untouched. - */ -typedef int (PyUFunc_LegacyInnerLoopSelectionFunc)( - struct _tagPyUFuncObject *ufunc, - PyArray_Descr **dtypes, - PyUFuncGenericFunction *out_innerloop, - void **out_innerloopdata, - int *out_needs_api); -typedef int (PyUFunc_InnerLoopSelectionFunc)( - struct _tagPyUFuncObject *ufunc, - PyArray_Descr **dtypes, - npy_intp *fixed_strides, - PyUFunc_StridedInnerLoopFunc **out_innerloop, - NpyAuxData **out_innerloopdata, - int *out_needs_api); -typedef int (PyUFunc_MaskedInnerLoopSelectionFunc)( - struct _tagPyUFuncObject *ufunc, - PyArray_Descr **dtypes, - PyArray_Descr *mask_dtype, - npy_intp *fixed_strides, - npy_intp fixed_mask_stride, - PyUFunc_MaskedStridedInnerLoopFunc **out_innerloop, - NpyAuxData **out_innerloopdata, - int *out_needs_api); - -typedef struct _tagPyUFuncObject { - PyObject_HEAD - /* - * nin: Number of inputs - * nout: Number of outputs - * nargs: Always nin + nout (Why is it stored?) - */ - int nin, nout, nargs; - - /* Identity for reduction, either PyUFunc_One or PyUFunc_Zero */ - int identity; - - /* Array of one-dimensional core loops */ - PyUFuncGenericFunction *functions; - /* Array of funcdata that gets passed into the functions */ - void **data; - /* The number of elements in 'functions' and 'data' */ - int ntypes; - - /* Does not appear to be used */ - int check_return; - - /* The name of the ufunc */ - char *name; - - /* Array of type numbers, of size ('nargs' * 'ntypes') */ - char *types; - - /* Documentation string */ - char *doc; - - void *ptr; - PyObject *obj; - PyObject *userloops; - - /* generalized ufunc parameters */ - - /* 0 for scalar ufunc; 1 for generalized ufunc */ - int core_enabled; - /* number of distinct dimension names in signature */ - int core_num_dim_ix; - - /* - * dimension indices of input/output argument k are stored in - * core_dim_ixs[core_offsets[k]..core_offsets[k]+core_num_dims[k]-1] - */ - - /* numbers of core dimensions of each argument */ - int *core_num_dims; - /* - * dimension indices in a flatted form; indices - * are in the range of [0,core_num_dim_ix) - */ - int *core_dim_ixs; - /* - * positions of 1st core dimensions of each - * argument in core_dim_ixs - */ - int *core_offsets; - /* signature string for printing purpose */ - char *core_signature; - - /* - * A function which resolves the types and fills an array - * with the dtypes for the inputs and outputs. - */ - PyUFunc_TypeResolutionFunc *type_resolver; - /* - * A function which returns an inner loop written for - * NumPy 1.6 and earlier ufuncs. This is for backwards - * compatibility, and may be NULL if inner_loop_selector - * is specified. - */ - PyUFunc_LegacyInnerLoopSelectionFunc *legacy_inner_loop_selector; - /* - * A function which returns an inner loop for the new mechanism - * in NumPy 1.7 and later. If provided, this is used, otherwise - * if NULL the legacy_inner_loop_selector is used instead. - */ - PyUFunc_InnerLoopSelectionFunc *inner_loop_selector; - /* - * A function which returns a masked inner loop for the ufunc. - */ - PyUFunc_MaskedInnerLoopSelectionFunc *masked_inner_loop_selector; -} PyUFuncObject; - -#include "arrayobject.h" - -#define UFUNC_ERR_IGNORE 0 -#define UFUNC_ERR_WARN 1 -#define UFUNC_ERR_RAISE 2 -#define UFUNC_ERR_CALL 3 -#define UFUNC_ERR_PRINT 4 -#define UFUNC_ERR_LOG 5 - - /* Python side integer mask */ - -#define UFUNC_MASK_DIVIDEBYZERO 0x07 -#define UFUNC_MASK_OVERFLOW 0x3f -#define UFUNC_MASK_UNDERFLOW 0x1ff -#define UFUNC_MASK_INVALID 0xfff - -#define UFUNC_SHIFT_DIVIDEBYZERO 0 -#define UFUNC_SHIFT_OVERFLOW 3 -#define UFUNC_SHIFT_UNDERFLOW 6 -#define UFUNC_SHIFT_INVALID 9 - - -/* platform-dependent code translates floating point - status to an integer sum of these values -*/ -#define UFUNC_FPE_DIVIDEBYZERO 1 -#define UFUNC_FPE_OVERFLOW 2 -#define UFUNC_FPE_UNDERFLOW 4 -#define UFUNC_FPE_INVALID 8 - -/* Error mode that avoids look-up (no checking) */ -#define UFUNC_ERR_DEFAULT 0 - -#define UFUNC_OBJ_ISOBJECT 1 -#define UFUNC_OBJ_NEEDS_API 2 - - /* Default user error mode */ -#define UFUNC_ERR_DEFAULT2 \ - (UFUNC_ERR_WARN << UFUNC_SHIFT_DIVIDEBYZERO) + \ - (UFUNC_ERR_WARN << UFUNC_SHIFT_OVERFLOW) + \ - (UFUNC_ERR_WARN << UFUNC_SHIFT_INVALID) - -#if NPY_ALLOW_THREADS -#define NPY_LOOP_BEGIN_THREADS do {if (!(loop->obj & UFUNC_OBJ_NEEDS_API)) _save = PyEval_SaveThread();} while (0); -#define NPY_LOOP_END_THREADS do {if (!(loop->obj & UFUNC_OBJ_NEEDS_API)) PyEval_RestoreThread(_save);} while (0); -#else -#define NPY_LOOP_BEGIN_THREADS -#define NPY_LOOP_END_THREADS -#endif - -/* - * UFunc has unit of 1, and the order of operations can be reordered - * This case allows reduction with multiple axes at once. - */ -#define PyUFunc_One 1 -/* - * UFunc has unit of 0, and the order of operations can be reordered - * This case allows reduction with multiple axes at once. - */ -#define PyUFunc_Zero 0 -/* - * UFunc has no unit, and the order of operations cannot be reordered. - * This case does not allow reduction with multiple axes at once. - */ -#define PyUFunc_None -1 -/* - * UFunc has no unit, and the order of operations can be reordered - * This case allows reduction with multiple axes at once. - */ -#define PyUFunc_ReorderableNone -2 - -#define UFUNC_REDUCE 0 -#define UFUNC_ACCUMULATE 1 -#define UFUNC_REDUCEAT 2 -#define UFUNC_OUTER 3 - - -typedef struct { - int nin; - int nout; - PyObject *callable; -} PyUFunc_PyFuncData; - -/* A linked-list of function information for - user-defined 1-d loops. - */ -typedef struct _loop1d_info { - PyUFuncGenericFunction func; - void *data; - int *arg_types; - struct _loop1d_info *next; -} PyUFunc_Loop1d; - - -#include "__ufunc_api.h" - -#define UFUNC_PYVALS_NAME "UFUNC_PYVALS" - -#define UFUNC_CHECK_ERROR(arg) \ - do {if ((((arg)->obj & UFUNC_OBJ_NEEDS_API) && PyErr_Occurred()) || \ - ((arg)->errormask && \ - PyUFunc_checkfperr((arg)->errormask, \ - (arg)->errobj, \ - &(arg)->first))) \ - goto fail;} while (0) - -/* This code checks the IEEE status flags in a platform-dependent way */ -/* Adapted from Numarray */ - -#if (defined(__unix__) || defined(unix)) && !defined(USG) -#include -#endif - -/* OSF/Alpha (Tru64) ---------------------------------------------*/ -#if defined(__osf__) && defined(__alpha) - -#include - -#define UFUNC_CHECK_STATUS(ret) { \ - unsigned long fpstatus; \ - \ - fpstatus = ieee_get_fp_control(); \ - /* clear status bits as well as disable exception mode if on */ \ - ieee_set_fp_control( 0 ); \ - ret = ((IEEE_STATUS_DZE & fpstatus) ? UFUNC_FPE_DIVIDEBYZERO : 0) \ - | ((IEEE_STATUS_OVF & fpstatus) ? UFUNC_FPE_OVERFLOW : 0) \ - | ((IEEE_STATUS_UNF & fpstatus) ? UFUNC_FPE_UNDERFLOW : 0) \ - | ((IEEE_STATUS_INV & fpstatus) ? UFUNC_FPE_INVALID : 0); \ - } - -/* MS Windows -----------------------------------------------------*/ -#elif defined(_MSC_VER) - -#include - - /* Clear the floating point exception default of Borland C++ */ -#if defined(__BORLANDC__) -#define UFUNC_NOFPE _control87(MCW_EM, MCW_EM); -#endif - -#define UFUNC_CHECK_STATUS(ret) { \ - int fpstatus = (int) _clearfp(); \ - \ - ret = ((SW_ZERODIVIDE & fpstatus) ? UFUNC_FPE_DIVIDEBYZERO : 0) \ - | ((SW_OVERFLOW & fpstatus) ? UFUNC_FPE_OVERFLOW : 0) \ - | ((SW_UNDERFLOW & fpstatus) ? UFUNC_FPE_UNDERFLOW : 0) \ - | ((SW_INVALID & fpstatus) ? UFUNC_FPE_INVALID : 0); \ - } - -/* Solaris --------------------------------------------------------*/ -/* --------ignoring SunOS ieee_flags approach, someone else can -** deal with that! */ -#elif defined(sun) || defined(__BSD__) || defined(__OpenBSD__) || \ - (defined(__FreeBSD__) && (__FreeBSD_version < 502114)) || \ - defined(__NetBSD__) -#include - -#define UFUNC_CHECK_STATUS(ret) { \ - int fpstatus; \ - \ - fpstatus = (int) fpgetsticky(); \ - ret = ((FP_X_DZ & fpstatus) ? UFUNC_FPE_DIVIDEBYZERO : 0) \ - | ((FP_X_OFL & fpstatus) ? UFUNC_FPE_OVERFLOW : 0) \ - | ((FP_X_UFL & fpstatus) ? UFUNC_FPE_UNDERFLOW : 0) \ - | ((FP_X_INV & fpstatus) ? UFUNC_FPE_INVALID : 0); \ - (void) fpsetsticky(0); \ - } - -#elif defined(__GLIBC__) || defined(__APPLE__) || \ - defined(__CYGWIN__) || defined(__MINGW32__) || \ - (defined(__FreeBSD__) && (__FreeBSD_version >= 502114)) - -#if defined(__GLIBC__) || defined(__APPLE__) || \ - defined(__MINGW32__) || defined(__FreeBSD__) -#include -#endif - -#define UFUNC_CHECK_STATUS(ret) { \ - int fpstatus = (int) fetestexcept(FE_DIVBYZERO | FE_OVERFLOW | \ - FE_UNDERFLOW | FE_INVALID); \ - ret = ((FE_DIVBYZERO & fpstatus) ? UFUNC_FPE_DIVIDEBYZERO : 0) \ - | ((FE_OVERFLOW & fpstatus) ? UFUNC_FPE_OVERFLOW : 0) \ - | ((FE_UNDERFLOW & fpstatus) ? UFUNC_FPE_UNDERFLOW : 0) \ - | ((FE_INVALID & fpstatus) ? UFUNC_FPE_INVALID : 0); \ - (void) feclearexcept(FE_DIVBYZERO | FE_OVERFLOW | \ - FE_UNDERFLOW | FE_INVALID); \ -} - -#elif defined(_AIX) - -#include -#include - -#define UFUNC_CHECK_STATUS(ret) { \ - fpflag_t fpstatus; \ - \ - fpstatus = fp_read_flag(); \ - ret = ((FP_DIV_BY_ZERO & fpstatus) ? UFUNC_FPE_DIVIDEBYZERO : 0) \ - | ((FP_OVERFLOW & fpstatus) ? UFUNC_FPE_OVERFLOW : 0) \ - | ((FP_UNDERFLOW & fpstatus) ? UFUNC_FPE_UNDERFLOW : 0) \ - | ((FP_INVALID & fpstatus) ? UFUNC_FPE_INVALID : 0); \ - fp_swap_flag(0); \ -} - -#else - -#define NO_FLOATING_POINT_SUPPORT -#define UFUNC_CHECK_STATUS(ret) { \ - ret = 0; \ - } - -#endif - -/* - * THESE MACROS ARE DEPRECATED. - * Use npy_set_floatstatus_* in the npymath library. - */ -#define generate_divbyzero_error() npy_set_floatstatus_divbyzero() -#define generate_overflow_error() npy_set_floatstatus_overflow() - - /* Make sure it gets defined if it isn't already */ -#ifndef UFUNC_NOFPE -#define UFUNC_NOFPE -#endif - - -#ifdef __cplusplus -} -#endif -#endif /* !Py_UFUNCOBJECT_H */ diff --git a/include/numpy/utils.h b/include/numpy/utils.h deleted file mode 100644 index cc968a354..000000000 --- a/include/numpy/utils.h +++ /dev/null @@ -1,19 +0,0 @@ -#ifndef __NUMPY_UTILS_HEADER__ -#define __NUMPY_UTILS_HEADER__ - -#ifndef __COMP_NPY_UNUSED - #if defined(__GNUC__) - #define __COMP_NPY_UNUSED __attribute__ ((__unused__)) - # elif defined(__ICC) - #define __COMP_NPY_UNUSED __attribute__ ((__unused__)) - #else - #define __COMP_NPY_UNUSED - #endif -#endif - -/* Use this to tag a variable as not used. It will remove unused variable - * warning on support platforms (see __COM_NPY_UNUSED) and mangle the variable - * to avoid accidental use */ -#define NPY_UNUSED(x) (__NPY_UNUSED_TAGGED ## x) __COMP_NPY_UNUSED - -#endif diff --git a/netlify.toml b/netlify.toml index 9cb11ae81..3c17b876c 100644 --- a/netlify.toml +++ b/netlify.toml @@ -24,7 +24,7 @@ redirects = [ {from = "/docs/usage/customizing-tokenizer", to = "/usage/linguistic-features#tokenization", force = true}, {from = "/docs/usage/language-processing-pipeline", to = "/usage/processing-pipelines", force = true}, {from = "/docs/usage/customizing-pipeline", to = "/usage/processing-pipelines", force = true}, - {from = "/docs/usage/training-ner", to = "/usage/training#ner", force = true}, + {from = "/docs/usage/training-ner", to = "/usage/training", force = true}, {from = "/docs/usage/tutorials", to = "/usage/examples", force = true}, {from = "/docs/usage/data-model", to = "/api", force = true}, {from = "/docs/usage/cli", to = "/api/cli", force = true}, @@ -36,8 +36,15 @@ redirects = [ {from = "/docs/api/features", to = "/models/#architecture", force = true}, {from = "/docs/api/philosophy", to = "/usage/spacy-101", force = true}, {from = "/docs/usage/showcase", to = "/universe", force = true}, - {from = "/tutorials/load-new-word-vectors", to = "/usage/vectors-similarity#custom", force = true}, + {from = "/tutorials/load-new-word-vectors", to = "/usage/linguistic-features", force = true}, {from = "/tutorials", to = "/usage/examples", force = true}, + # Old documentation pages (v2.x) + {from = "/usage/adding-languages", to = "/usage/linguistic-features", force = true}, + {from = "/usage/vectors-similarity", to = "/usage/linguistic-features#vectors-similarity", force = true}, + {from = "/api/goldparse", to = "/api/top-level", force = true}, + {from = "/api/goldcorpus", to = "/api/corpus", force = true}, + {from = "/api/annotation", to = "/api/data-formats", force = true}, + {from = "/usage/examples", to = "/usage/projects", force = true}, # Rewrite all other docs pages to / {from = "/docs/*", to = "/:splat"}, # Updated documentation pages diff --git a/pyproject.toml b/pyproject.toml index fe66494ff..14a2d7690 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,6 +6,9 @@ requires = [ "cymem>=2.0.2,<2.1.0", "preshed>=3.0.2,<3.1.0", "murmurhash>=0.28.0,<1.1.0", - "thinc==7.4.1", + "thinc>=8.0.0rc0,<8.1.0", + "blis>=0.4.0,<0.8.0", + "pytokenizations", + "pathy" ] build-backend = "setuptools.build_meta" diff --git a/requirements.txt b/requirements.txt index 367eef111..36f0d1e92 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,24 +1,30 @@ # Our libraries cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 -thinc==7.4.1 -blis>=0.4.0,<0.5.0 +thinc>=8.0.0rc0,<8.1.0 +blis>=0.4.0,<0.8.0 +ml_datasets==0.2.0a0 murmurhash>=0.28.0,<1.1.0 -wasabi>=0.4.0,<1.1.0 -srsly>=1.0.2,<1.1.0 -catalogue>=0.0.7,<1.1.0 +wasabi>=0.8.0,<1.1.0 +srsly>=2.3.0,<3.0.0 +catalogue>=2.0.1,<2.1.0 +typer>=0.3.0,<0.4.0 +pathy # Third party dependencies numpy>=1.15.0 requests>=2.13.0,<3.0.0 -plac>=0.9.6,<1.2.0 -pathlib==1.0.1; python_version < "3.4" tqdm>=4.38.0,<5.0.0 -# Optional dependencies -pyrsistent<0.17.0 -jsonschema>=2.6.0,<3.1.0 +pydantic>=1.5.0,<2.0.0 +pytokenizations +# Official Python utilities +setuptools +packaging>=20.0 +importlib_metadata>=0.20; python_version < "3.8" +typing_extensions>=3.7.4; python_version < "3.8" # Development dependencies cython>=0.25 pytest>=4.6.5 pytest-timeout>=1.3.0,<2.0.0 mock>=2.0.0,<3.0.0 flake8>=3.5.0,<3.6.0 +jinja2 diff --git a/setup.cfg b/setup.cfg index 9bd45d45d..adf0c0e20 100644 --- a/setup.cfg +++ b/setup.cfg @@ -16,10 +16,7 @@ classifiers = Operating System :: MacOS :: MacOS X Operating System :: Microsoft :: Windows Programming Language :: Cython - Programming Language :: Python :: 2 - Programming Language :: Python :: 2.7 Programming Language :: Python :: 3 - Programming Language :: Python :: 3.5 Programming Language :: Python :: 3.6 Programming Language :: Python :: 3.7 Programming Language :: Python :: 3.8 @@ -28,62 +25,77 @@ classifiers = [options] zip_safe = false include_package_data = true -scripts = - bin/spacy -python_requires = >=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.* +python_requires = >=3.6 setup_requires = wheel cython>=0.25 + numpy>=1.15.0 # We also need our Cython packages here to compile against cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 murmurhash>=0.28.0,<1.1.0 - thinc==7.4.1 + thinc>=8.0.0rc0,<8.1.0 install_requires = # Our libraries murmurhash>=0.28.0,<1.1.0 cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 - thinc==7.4.1 - blis>=0.4.0,<0.5.0 - wasabi>=0.4.0,<1.1.0 - srsly>=1.0.2,<1.1.0 - catalogue>=0.0.7,<1.1.0 + thinc>=8.0.0rc0,<8.1.0 + blis>=0.4.0,<0.8.0 + wasabi>=0.8.0,<1.1.0 + srsly>=2.3.0,<3.0.0 + catalogue>=2.0.1,<2.1.0 + typer>=0.3.0,<0.4.0 + pathy # Third-party dependencies tqdm>=4.38.0,<5.0.0 - setuptools numpy>=1.15.0 - plac>=0.9.6,<1.2.0 requests>=2.13.0,<3.0.0 - pathlib==1.0.1; python_version < "3.4" + pydantic>=1.5.0,<2.0.0 + pytokenizations + # Official Python utilities + setuptools + packaging>=20.0 + importlib_metadata>=0.20; python_version < "3.8" + typing_extensions>=3.7.4; python_version < "3.8" + +[options.entry_points] +console_scripts = + spacy = spacy.cli:app [options.extras_require] lookups = - spacy_lookups_data>=0.3.2,<0.4.0 + spacy_lookups_data>=1.0.0rc0,<1.1.0 +transformers = + spacy_transformers>=1.0.0rc0,<1.1.0 +ray = + spacy_ray>=0.1.0,<1.0.0 cuda = - cupy>=5.0.0b4,<8.0.0 + cupy>=5.0.0b4,<9.0.0 cuda80 = - cupy-cuda80>=5.0.0b4,<8.0.0 + cupy-cuda80>=5.0.0b4,<9.0.0 cuda90 = - cupy-cuda90>=5.0.0b4,<8.0.0 + cupy-cuda90>=5.0.0b4,<9.0.0 cuda91 = - cupy-cuda91>=5.0.0b4,<8.0.0 + cupy-cuda91>=5.0.0b4,<9.0.0 cuda92 = - cupy-cuda92>=5.0.0b4,<8.0.0 + cupy-cuda92>=5.0.0b4,<9.0.0 cuda100 = - cupy-cuda100>=5.0.0b4,<8.0.0 + cupy-cuda100>=5.0.0b4,<9.0.0 cuda101 = - cupy-cuda101>=5.0.0b4,<8.0.0 + cupy-cuda101>=5.0.0b4,<9.0.0 cuda102 = - cupy-cuda102>=5.0.0b4,<8.0.0 + cupy-cuda102>=5.0.0b4,<9.0.0 # Language tokenizers with external dependencies ja = - sudachipy>=0.4.5 + sudachipy>=0.4.9 sudachidict_core>=20200330 ko = natto-py==0.9.0 th = pythainlp>=2.0 +zh = + spacy-pkuseg==0.0.26 [bdist_wheel] universal = false @@ -92,7 +104,7 @@ universal = false formats = gztar [flake8] -ignore = E203, E266, E501, E731, W503 +ignore = E203, E266, E501, E731, W503, E741 max-line-length = 80 select = B,C,E,F,W,T4,B9 exclude = @@ -100,8 +112,12 @@ exclude = .git, __pycache__, _tokenizer_exceptions_list.py, - spacy/__init__.py [tool:pytest] markers = slow + +[mypy] +ignore_missing_imports = True +no_implicit_optional = True +plugins = pydantic.mypy, thinc.mypy diff --git a/setup.py b/setup.py index f78781918..604d65745 100755 --- a/setup.py +++ b/setup.py @@ -1,55 +1,55 @@ #!/usr/bin/env python -from __future__ import print_function -import io -import os -import subprocess +from setuptools import Extension, setup, find_packages import sys -import contextlib +import platform from distutils.command.build_ext import build_ext from distutils.sysconfig import get_python_inc -import distutils.util -from distutils import ccompiler, msvccompiler -from setuptools import Extension, setup, find_packages +import numpy +from pathlib import Path +import shutil +from Cython.Build import cythonize +from Cython.Compiler import Options +import os +import subprocess -def is_new_osx(): - """Check whether we're on OSX >= 10.10""" - name = distutils.util.get_platform() - if sys.platform != "darwin": - return False - elif name.startswith("macosx-10"): - minor_version = int(name.split("-")[1].split(".")[1]) - if minor_version >= 7: - return True - else: - return False - else: - return False +ROOT = Path(__file__).parent +PACKAGE_ROOT = ROOT / "spacy" +# Preserve `__doc__` on functions and classes +# http://docs.cython.org/en/latest/src/userguide/source_files_and_compilation.html#compiler-options +Options.docstrings = True + PACKAGES = find_packages() - - MOD_NAMES = [ + "spacy.training.example", "spacy.parts_of_speech", "spacy.strings", "spacy.lexeme", "spacy.vocab", "spacy.attrs", "spacy.kb", + "spacy.ml.parser_model", "spacy.morphology", - "spacy.pipeline.pipes", + "spacy.pipeline.dep_parser", "spacy.pipeline.morphologizer", - "spacy.syntax.stateclass", - "spacy.syntax._state", + "spacy.pipeline.multitask", + "spacy.pipeline.ner", + "spacy.pipeline.pipe", + "spacy.pipeline.trainable_pipe", + "spacy.pipeline.sentencizer", + "spacy.pipeline.senter", + "spacy.pipeline.tagger", + "spacy.pipeline.transition_parser", + "spacy.pipeline._parser_internals.arc_eager", + "spacy.pipeline._parser_internals.ner", + "spacy.pipeline._parser_internals.nonproj", + "spacy.pipeline._parser_internals._state", + "spacy.pipeline._parser_internals.stateclass", + "spacy.pipeline._parser_internals.transition_system", "spacy.tokenizer", - "spacy.syntax.nn_parser", - "spacy.syntax._parser_model", - "spacy.syntax._beam_utils", - "spacy.syntax.nonproj", - "spacy.syntax.transition_system", - "spacy.syntax.arc_eager", - "spacy.gold", + "spacy.training.gold_io", "spacy.tokens.doc", "spacy.tokens.span", "spacy.tokens.token", @@ -58,20 +58,40 @@ MOD_NAMES = [ "spacy.matcher.matcher", "spacy.matcher.phrasematcher", "spacy.matcher.dependencymatcher", - "spacy.syntax.ner", "spacy.symbols", "spacy.vectors", ] - - COMPILE_OPTIONS = { "msvc": ["/Ox", "/EHsc"], "mingw32": ["-O2", "-Wno-strict-prototypes", "-Wno-unused-function"], "other": ["-O2", "-Wno-strict-prototypes", "-Wno-unused-function"], } - - LINK_OPTIONS = {"msvc": [], "mingw32": [], "other": []} +COMPILER_DIRECTIVES = { + "language_level": -3, + "embedsignature": True, + "annotation_typing": False, +} +# Files to copy into the package that are otherwise not included +COPY_FILES = { + ROOT / "setup.cfg": PACKAGE_ROOT / "tests" / "package", + ROOT / "pyproject.toml": PACKAGE_ROOT / "tests" / "package", + ROOT / "requirements.txt": PACKAGE_ROOT / "tests" / "package", +} + + +def is_new_osx(): + """Check whether we're on OSX >= 10.7""" + if sys.platform != "darwin": + return False + mac_ver = platform.mac_ver()[0] + if mac_ver.startswith("10"): + minor_version = int(mac_ver.split(".")[1]) + if minor_version >= 7: + return True + else: + return False + return False if is_new_osx(): @@ -104,20 +124,6 @@ class build_ext_subclass(build_ext, build_ext_options): build_ext.build_extensions(self) -def generate_cython(root, source): - print("Cythonizing sources") - p = subprocess.call( - [sys.executable, os.path.join(root, "bin", "cythonize.py"), source], - env=os.environ, - ) - if p != 0: - raise RuntimeError("Running cythonize failed") - - -def is_source_release(path): - return os.path.exists(os.path.join(path, "PKG-INFO")) - - # Include the git version in the build (adapted from NumPy) # Copyright (c) 2005-2020, NumPy Developers. # BSD 3-Clause license, see licenses/3rd_party_licenses.txt @@ -137,19 +143,19 @@ def write_git_info_py(filename="spacy/git_info.py"): return out git_version = "Unknown" - if os.path.exists(".git"): + if Path(".git").exists(): try: out = _minimal_ext_cmd(["git", "rev-parse", "--short", "HEAD"]) git_version = out.strip().decode("ascii") - except: + except Exception: pass - elif os.path.exists(filename): + elif Path(filename).exists(): # must be a source distribution, use existing version file try: a = open(filename, "r") lines = a.readlines() git_version = lines[-1].split('"')[1] - except: + except Exception: pass finally: a.close() @@ -160,90 +166,53 @@ GIT_VERSION = "%(git_version)s" """ a = open(filename, "w") try: - a.write( - text % {"git_version": git_version,} - ) + a.write(text % {"git_version": git_version}) finally: a.close() def clean(path): - for name in MOD_NAMES: - name = name.replace(".", "/") - for ext in [".so", ".html", ".cpp", ".c"]: - file_path = os.path.join(path, name + ext) - if os.path.exists(file_path): - os.unlink(file_path) - - -@contextlib.contextmanager -def chdir(new_dir): - old_dir = os.getcwd() - try: - os.chdir(new_dir) - sys.path.insert(0, new_dir) - yield - finally: - del sys.path[0] - os.chdir(old_dir) + for path in path.glob("**/*"): + if path.is_file() and path.suffix in (".so", ".cpp", ".html"): + print(f"Deleting {path.name}") + path.unlink() def setup_package(): write_git_info_py() + if len(sys.argv) > 1 and sys.argv[1] == "clean": + return clean(PACKAGE_ROOT) - root = os.path.abspath(os.path.dirname(__file__)) + with (PACKAGE_ROOT / "about.py").open("r") as f: + about = {} + exec(f.read(), about) - if hasattr(sys, "argv") and len(sys.argv) > 1 and sys.argv[1] == "clean": - return clean(root) + for copy_file, target_dir in COPY_FILES.items(): + if copy_file.exists(): + shutil.copy(str(copy_file), str(target_dir)) + print(f"Copied {copy_file} -> {target_dir}") - with chdir(root): - with io.open(os.path.join(root, "spacy", "about.py"), encoding="utf8") as f: - about = {} - exec(f.read(), about) + include_dirs = [ + get_python_inc(plat_specific=True), + numpy.get_include(), + ] + ext_modules = [] + for name in MOD_NAMES: + mod_path = name.replace(".", "/") + ".pyx" + ext = Extension(name, [mod_path], language="c++") + ext_modules.append(ext) + print("Cythonizing sources") + ext_modules = cythonize(ext_modules, compiler_directives=COMPILER_DIRECTIVES) - include_dirs = [ - get_python_inc(plat_specific=True), - os.path.join(root, "include"), - ] - - if ( - ccompiler.new_compiler().compiler_type == "msvc" - and msvccompiler.get_build_version() == 9 - ): - include_dirs.append(os.path.join(root, "include", "msvc9")) - - ext_modules = [] - for mod_name in MOD_NAMES: - mod_path = mod_name.replace(".", "/") + ".cpp" - extra_link_args = [] - # ??? - # Imported from patch from @mikepb - # See Issue #267. Running blind here... - if sys.platform == "darwin": - dylib_path = [".." for _ in range(mod_name.count("."))] - dylib_path = "/".join(dylib_path) - dylib_path = "@loader_path/%s/spacy/platform/darwin/lib" % dylib_path - extra_link_args.append("-Wl,-rpath,%s" % dylib_path) - ext_modules.append( - Extension( - mod_name, - [mod_path], - language="c++", - include_dirs=include_dirs, - extra_link_args=extra_link_args, - ) - ) - - if not is_source_release(root): - generate_cython(root, "spacy") - - setup( - name="spacy", - packages=PACKAGES, - version=about["__version__"], - ext_modules=ext_modules, - cmdclass={"build_ext": build_ext_subclass}, - ) + setup( + name="spacy-nightly", + packages=PACKAGES, + version=about["__version__"], + ext_modules=ext_modules, + cmdclass={"build_ext": build_ext_subclass}, + include_dirs=include_dirs, + package_data={"": ["*.pyx", "*.pxd", "*.pxi", "*.cpp"]}, + ) if __name__ == "__main__": diff --git a/spacy/__init__.py b/spacy/__init__.py index 6aa7b7c16..7334b4149 100644 --- a/spacy/__init__.py +++ b/spacy/__init__.py @@ -1,39 +1,68 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Union, Iterable, Dict, Any +from pathlib import Path import warnings import sys -warnings.filterwarnings("ignore", message="numpy.dtype size changed") -warnings.filterwarnings("ignore", message="numpy.ufunc size changed") +warnings.filterwarnings("ignore", message="numpy.dtype size changed") # noqa +warnings.filterwarnings("ignore", message="numpy.ufunc size changed") # noqa # These are imported as part of the API -from thinc.neural.util import prefer_gpu, require_gpu +from thinc.api import prefer_gpu, require_gpu # noqa: F401 +from thinc.api import Config -from . import pipeline -from .cli.info import info as cli_info -from .glossary import explain -from .about import __version__ -from .errors import Errors, Warnings +from . import pipeline # noqa: F401 +from .cli.info import info # noqa: F401 +from .glossary import explain # noqa: F401 +from .about import __version__ # noqa: F401 +from .util import registry, logger # noqa: F401 + +from .errors import Errors +from .language import Language +from .vocab import Vocab from . import util -from .util import registry -from .language import component if sys.maxunicode == 65535: raise SystemError(Errors.E130) -def load(name, **overrides): - depr_path = overrides.get("path") - if depr_path not in (True, False, None): - warnings.warn(Warnings.W001.format(path=depr_path), DeprecationWarning) - return util.load_model(name, **overrides) +def load( + name: Union[str, Path], + disable: Iterable[str] = util.SimpleFrozenList(), + exclude: Iterable[str] = util.SimpleFrozenList(), + config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(), +) -> Language: + """Load a spaCy model from an installed package or a local path. + + name (str): Package name or model path. + disable (Iterable[str]): Names of pipeline components to disable. Disabled + pipes will be loaded but they won't be run unless you explicitly + enable them by calling nlp.enable_pipe. + exclude (Iterable[str]): Names of pipeline components to exclude. Excluded + components won't be loaded. + config (Dict[str, Any] / Config): Config overrides as nested dict or dict + keyed by section values in dot notation. + RETURNS (Language): The loaded nlp object. + """ + return util.load_model(name, disable=disable, exclude=exclude, config=config) -def blank(name, **kwargs): +def blank( + name: str, + *, + vocab: Union[Vocab, bool] = True, + config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(), + meta: Dict[str, Any] = util.SimpleFrozenDict() +) -> Language: + """Create a blank nlp object for a given language code. + + name (str): The language code, e.g. "en". + vocab (Vocab): A Vocab object. If True, a vocab is created. + config (Dict[str, Any] / Config): Optional config overrides. + meta (Dict[str, Any]): Overrides for nlp.meta. + RETURNS (Language): The nlp object. + """ LangClass = util.get_lang_class(name) - return LangClass(**kwargs) - - -def info(model=None, markdown=False, silent=False): - return cli_info(model, markdown, silent) + # We should accept both dot notation and nested dict here for consistency + config = util.dot_to_dict(config) + return LangClass.from_config(config, meta=meta) diff --git a/spacy/__main__.py b/spacy/__main__.py index 2c285095e..f6b5066b7 100644 --- a/spacy/__main__.py +++ b/spacy/__main__.py @@ -1,36 +1,4 @@ -# coding: utf8 -from __future__ import print_function - -# NB! This breaks in plac on Python 2!! -# from __future__ import unicode_literals - if __name__ == "__main__": - import plac - import sys - from wasabi import msg - from spacy.cli import download, link, info, package, train, pretrain, convert - from spacy.cli import init_model, profile, evaluate, validate, debug_data + from spacy.cli import setup_cli - commands = { - "download": download, - "link": link, - "info": info, - "train": train, - "pretrain": pretrain, - "debug-data": debug_data, - "evaluate": evaluate, - "convert": convert, - "package": package, - "init-model": init_model, - "profile": profile, - "validate": validate, - } - if len(sys.argv) == 1: - msg.info("Available commands", ", ".join(commands), exits=1) - command = sys.argv.pop(1) - sys.argv[0] = "spacy %s" % command - if command in commands: - plac.call(commands[command], sys.argv[1:]) - else: - available = "Available: {}".format(", ".join(commands)) - msg.fail("Unknown command: {}".format(command), available, exits=1) + setup_cli() diff --git a/spacy/_ml.py b/spacy/_ml.py deleted file mode 100644 index 3fc2c4718..000000000 --- a/spacy/_ml.py +++ /dev/null @@ -1,1004 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import numpy -import warnings -from thinc.v2v import Model, Maxout, Softmax, Affine, ReLu -from thinc.t2t import ExtractWindow, ParametricAttention -from thinc.t2v import Pooling, sum_pool, mean_pool -from thinc.i2v import HashEmbed -from thinc.misc import Residual, FeatureExtracter -from thinc.misc import LayerNorm as LN -from thinc.api import add, layerize, chain, clone, concatenate, with_flatten -from thinc.api import with_getitem, flatten_add_lengths -from thinc.api import uniqued, wrap, noop -from thinc.linear.linear import LinearModel -from thinc.neural.ops import NumpyOps, CupyOps -from thinc.neural.util import get_array_module, copy_array, to_categorical -from thinc.neural.optimizers import Adam - -from thinc import describe -from thinc.describe import Dimension, Synapses, Biases, Gradient -from thinc.neural._classes.affine import _set_dimensions_if_needed -import thinc.extra.load_nlp - -from .attrs import ID, ORTH, LOWER, NORM, PREFIX, SUFFIX, SHAPE -from .errors import Errors, Warnings -from . import util -from . import ml as new_ml -from .ml import _legacy_tok2vec - - -VECTORS_KEY = "spacy_pretrained_vectors" -# Backwards compatibility with <2.2.2 -USE_MODEL_REGISTRY_TOK2VEC = False - - -def cosine(vec1, vec2): - xp = get_array_module(vec1) - norm1 = xp.linalg.norm(vec1) - norm2 = xp.linalg.norm(vec2) - if norm1 == 0.0 or norm2 == 0.0: - return 0 - else: - return vec1.dot(vec2) / (norm1 * norm2) - - -def create_default_optimizer(ops, **cfg): - learn_rate = util.env_opt("learn_rate", 0.001) - beta1 = util.env_opt("optimizer_B1", 0.9) - beta2 = util.env_opt("optimizer_B2", 0.999) - eps = util.env_opt("optimizer_eps", 1e-8) - L2 = util.env_opt("L2_penalty", 1e-6) - max_grad_norm = util.env_opt("grad_norm_clip", 1.0) - optimizer = Adam(ops, learn_rate, L2=L2, beta1=beta1, beta2=beta2, eps=eps) - optimizer.max_grad_norm = max_grad_norm - optimizer.device = ops.device - return optimizer - - -@layerize -def _flatten_add_lengths(seqs, pad=0, drop=0.0): - ops = Model.ops - lengths = ops.asarray([len(seq) for seq in seqs], dtype="i") - - def finish_update(d_X, sgd=None): - return ops.unflatten(d_X, lengths, pad=pad) - - X = ops.flatten(seqs, pad=pad) - return (X, lengths), finish_update - - -def _zero_init(model): - def _zero_init_impl(self, *args, **kwargs): - self.W.fill(0) - - model.on_init_hooks.append(_zero_init_impl) - if model.W is not None: - model.W.fill(0.0) - return model - - -def with_cpu(ops, model): - """Wrap a model that should run on CPU, transferring inputs and outputs - as necessary.""" - model.to_cpu() - - def with_cpu_forward(inputs, drop=0.0): - cpu_outputs, backprop = model.begin_update(_to_cpu(inputs), drop=drop) - gpu_outputs = _to_device(ops, cpu_outputs) - - def with_cpu_backprop(d_outputs, sgd=None): - cpu_d_outputs = _to_cpu(d_outputs) - return backprop(cpu_d_outputs, sgd=sgd) - - return gpu_outputs, with_cpu_backprop - - return wrap(with_cpu_forward, model) - - -def _to_cpu(X): - if isinstance(X, numpy.ndarray): - return X - elif isinstance(X, tuple): - return tuple([_to_cpu(x) for x in X]) - elif isinstance(X, list): - return [_to_cpu(x) for x in X] - elif hasattr(X, "get"): - return X.get() - else: - return X - - -def _to_device(ops, X): - if isinstance(X, tuple): - return tuple([_to_device(ops, x) for x in X]) - elif isinstance(X, list): - return [_to_device(ops, x) for x in X] - else: - return ops.asarray(X) - - -class extract_ngrams(Model): - def __init__(self, ngram_size, attr=LOWER): - Model.__init__(self) - self.ngram_size = ngram_size - self.attr = attr - - def begin_update(self, docs, drop=0.0): - batch_keys = [] - batch_vals = [] - for doc in docs: - unigrams = doc.to_array([self.attr]) - ngrams = [unigrams] - for n in range(2, self.ngram_size + 1): - ngrams.append(self.ops.ngrams(n, unigrams)) - keys = self.ops.xp.concatenate(ngrams) - keys, vals = self.ops.xp.unique(keys, return_counts=True) - batch_keys.append(keys) - batch_vals.append(vals) - # The dtype here matches what thinc is expecting -- which differs per - # platform (by int definition). This should be fixed once the problem - # is fixed on Thinc's side. - lengths = self.ops.asarray( - [arr.shape[0] for arr in batch_keys], dtype=numpy.int_ - ) - batch_keys = self.ops.xp.concatenate(batch_keys) - batch_vals = self.ops.asarray(self.ops.xp.concatenate(batch_vals), dtype="f") - return (batch_keys, batch_vals, lengths), None - - -@describe.on_data( - _set_dimensions_if_needed, lambda model, X, y: model.init_weights(model) -) -@describe.attributes( - nI=Dimension("Input size"), - nF=Dimension("Number of features"), - nO=Dimension("Output size"), - nP=Dimension("Maxout pieces"), - W=Synapses("Weights matrix", lambda obj: (obj.nF, obj.nO, obj.nP, obj.nI)), - b=Biases("Bias vector", lambda obj: (obj.nO, obj.nP)), - pad=Synapses( - "Pad", - lambda obj: (1, obj.nF, obj.nO, obj.nP), - lambda M, ops: ops.normal_init(M, 1.0), - ), - d_W=Gradient("W"), - d_pad=Gradient("pad"), - d_b=Gradient("b"), -) -class PrecomputableAffine(Model): - def __init__(self, nO=None, nI=None, nF=None, nP=None, **kwargs): - Model.__init__(self, **kwargs) - self.nO = nO - self.nP = nP - self.nI = nI - self.nF = nF - - def begin_update(self, X, drop=0.0): - Yf = self.ops.gemm( - X, self.W.reshape((self.nF * self.nO * self.nP, self.nI)), trans2=True - ) - Yf = Yf.reshape((Yf.shape[0], self.nF, self.nO, self.nP)) - Yf = self._add_padding(Yf) - - def backward(dY_ids, sgd=None): - dY, ids = dY_ids - dY, ids = self._backprop_padding(dY, ids) - Xf = X[ids] - Xf = Xf.reshape((Xf.shape[0], self.nF * self.nI)) - - self.d_b += dY.sum(axis=0) - dY = dY.reshape((dY.shape[0], self.nO * self.nP)) - - Wopfi = self.W.transpose((1, 2, 0, 3)) - Wopfi = self.ops.xp.ascontiguousarray(Wopfi) - Wopfi = Wopfi.reshape((self.nO * self.nP, self.nF * self.nI)) - dXf = self.ops.gemm(dY.reshape((dY.shape[0], self.nO * self.nP)), Wopfi) - - # Reuse the buffer - dWopfi = Wopfi - dWopfi.fill(0.0) - self.ops.gemm(dY, Xf, out=dWopfi, trans1=True) - dWopfi = dWopfi.reshape((self.nO, self.nP, self.nF, self.nI)) - # (o, p, f, i) --> (f, o, p, i) - self.d_W += dWopfi.transpose((2, 0, 1, 3)) - - if sgd is not None: - sgd(self._mem.weights, self._mem.gradient, key=self.id) - return dXf.reshape((dXf.shape[0], self.nF, self.nI)) - - return Yf, backward - - def _add_padding(self, Yf): - Yf_padded = self.ops.xp.vstack((self.pad, Yf)) - return Yf_padded - - def _backprop_padding(self, dY, ids): - # (1, nF, nO, nP) += (nN, nF, nO, nP) where IDs (nN, nF) < 0 - mask = ids < 0.0 - mask = mask.sum(axis=1) - d_pad = dY * mask.reshape((ids.shape[0], 1, 1)) - self.d_pad += d_pad.sum(axis=0) - return dY, ids - - @staticmethod - def init_weights(model): - """This is like the 'layer sequential unit variance', but instead - of taking the actual inputs, we randomly generate whitened data. - - Why's this all so complicated? We have a huge number of inputs, - and the maxout unit makes guessing the dynamics tricky. Instead - we set the maxout weights to values that empirically result in - whitened outputs given whitened inputs. - """ - if (model.W ** 2).sum() != 0.0: - return - ops = model.ops - xp = ops.xp - ops.normal_init(model.W, model.nF * model.nI, inplace=True) - - ids = ops.allocate((5000, model.nF), dtype="f") - ids += xp.random.uniform(0, 1000, ids.shape) - ids = ops.asarray(ids, dtype="i") - tokvecs = ops.allocate((5000, model.nI), dtype="f") - tokvecs += xp.random.normal(loc=0.0, scale=1.0, size=tokvecs.size).reshape( - tokvecs.shape - ) - - def predict(ids, tokvecs): - # nS ids. nW tokvecs. Exclude the padding array. - hiddens = model(tokvecs[:-1]) # (nW, f, o, p) - vectors = model.ops.allocate((ids.shape[0], model.nO * model.nP), dtype="f") - # need nS vectors - hiddens = hiddens.reshape( - (hiddens.shape[0] * model.nF, model.nO * model.nP) - ) - model.ops.scatter_add(vectors, ids.flatten(), hiddens) - vectors = vectors.reshape((vectors.shape[0], model.nO, model.nP)) - vectors += model.b - vectors = model.ops.asarray(vectors) - if model.nP >= 2: - return model.ops.maxout(vectors)[0] - else: - return vectors * (vectors >= 0) - - tol_var = 0.01 - tol_mean = 0.01 - t_max = 10 - t_i = 0 - for t_i in range(t_max): - acts1 = predict(ids, tokvecs) - var = model.ops.xp.var(acts1) - mean = model.ops.xp.mean(acts1) - if abs(var - 1.0) >= tol_var: - model.W /= model.ops.xp.sqrt(var) - elif abs(mean) >= tol_mean: - model.b -= mean - else: - break - - -def link_vectors_to_models(vocab, skip_rank=False): - vectors = vocab.vectors - if vectors.name is None: - vectors.name = VECTORS_KEY - if vectors.data.size != 0: - warnings.warn(Warnings.W020.format(shape=vectors.data.shape)) - ops = Model.ops - if not skip_rank: - for word in vocab: - if word.orth in vectors.key2row: - word.rank = vectors.key2row[word.orth] - else: - word.rank = util.OOV_RANK - data = ops.asarray(vectors.data) - # Set an entry here, so that vectors are accessed by StaticVectors - # (unideal, I know) - key = (ops.device, vectors.name) - if key in thinc.extra.load_nlp.VECTORS: - if thinc.extra.load_nlp.VECTORS[key].shape != data.shape: - # This is a hack to avoid the problem in #3853. - old_name = vectors.name - new_name = vectors.name + "_%d" % data.shape[0] - warnings.warn(Warnings.W019.format(old=old_name, new=new_name)) - vectors.name = new_name - key = (ops.device, vectors.name) - thinc.extra.load_nlp.VECTORS[key] = data - - -def PyTorchBiLSTM(nO, nI, depth, dropout=0.2): - import torch.nn - from thinc.api import with_square_sequences - from thinc.extra.wrappers import PyTorchWrapperRNN - - if depth == 0: - return layerize(noop()) - model = torch.nn.LSTM(nI, nO // 2, depth, bidirectional=True, dropout=dropout) - return with_square_sequences(PyTorchWrapperRNN(model)) - - -def Tok2Vec(width, embed_size, **kwargs): - if not USE_MODEL_REGISTRY_TOK2VEC: - # Preserve prior tok2vec for backwards compat, in v2.2.2 - return _legacy_tok2vec.Tok2Vec(width, embed_size, **kwargs) - pretrained_vectors = kwargs.get("pretrained_vectors", None) - cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3) - subword_features = kwargs.get("subword_features", True) - char_embed = kwargs.get("char_embed", False) - conv_depth = kwargs.get("conv_depth", 4) - bilstm_depth = kwargs.get("bilstm_depth", 0) - conv_window = kwargs.get("conv_window", 1) - - cols = ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"] - - doc2feats_cfg = {"arch": "spacy.Doc2Feats.v1", "config": {"columns": cols}} - if char_embed: - embed_cfg = { - "arch": "spacy.CharacterEmbed.v1", - "config": { - "width": 64, - "chars": 6, - "@mix": { - "arch": "spacy.LayerNormalizedMaxout.v1", - "config": {"width": width, "pieces": 3}, - }, - "@embed_features": None, - }, - } - else: - embed_cfg = { - "arch": "spacy.MultiHashEmbed.v1", - "config": { - "width": width, - "rows": embed_size, - "columns": cols, - "use_subwords": subword_features, - "@pretrained_vectors": None, - "@mix": { - "arch": "spacy.LayerNormalizedMaxout.v1", - "config": {"width": width, "pieces": 3}, - }, - }, - } - if pretrained_vectors: - embed_cfg["config"]["@pretrained_vectors"] = { - "arch": "spacy.PretrainedVectors.v1", - "config": { - "vectors_name": pretrained_vectors, - "width": width, - "column": cols.index("ID"), - }, - } - if cnn_maxout_pieces >= 2: - cnn_cfg = { - "arch": "spacy.MaxoutWindowEncoder.v1", - "config": { - "width": width, - "window_size": conv_window, - "pieces": cnn_maxout_pieces, - "depth": conv_depth, - }, - } - else: - cnn_cfg = { - "arch": "spacy.MishWindowEncoder.v1", - "config": {"width": width, "window_size": conv_window, "depth": conv_depth}, - } - bilstm_cfg = { - "arch": "spacy.TorchBiLSTMEncoder.v1", - "config": {"width": width, "depth": bilstm_depth}, - } - if conv_depth == 0 and bilstm_depth == 0: - encode_cfg = {} - elif conv_depth >= 1 and bilstm_depth >= 1: - encode_cfg = { - "arch": "thinc.FeedForward.v1", - "config": {"children": [cnn_cfg, bilstm_cfg]}, - } - elif conv_depth >= 1: - encode_cfg = cnn_cfg - else: - encode_cfg = bilstm_cfg - config = {"@doc2feats": doc2feats_cfg, "@embed": embed_cfg, "@encode": encode_cfg} - return new_ml.Tok2Vec(config) - - -def reapply(layer, n_times): - def reapply_fwd(X, drop=0.0): - backprops = [] - for i in range(n_times): - Y, backprop = layer.begin_update(X, drop=drop) - X = Y - backprops.append(backprop) - - def reapply_bwd(dY, sgd=None): - dX = None - for backprop in reversed(backprops): - dY = backprop(dY, sgd=sgd) - if dX is None: - dX = dY - else: - dX += dY - return dX - - return Y, reapply_bwd - - return wrap(reapply_fwd, layer) - - -def asarray(ops, dtype): - def forward(X, drop=0.0): - return ops.asarray(X, dtype=dtype), None - - return layerize(forward) - - -def _divide_array(X, size): - parts = [] - index = 0 - while index < len(X): - parts.append(X[index : index + size]) - index += size - return parts - - -def get_col(idx): - if idx < 0: - raise IndexError(Errors.E066.format(value=idx)) - - def forward(X, drop=0.0): - if isinstance(X, numpy.ndarray): - ops = NumpyOps() - else: - ops = CupyOps() - output = ops.xp.ascontiguousarray(X[:, idx], dtype=X.dtype) - - def backward(y, sgd=None): - dX = ops.allocate(X.shape) - dX[:, idx] += y - return dX - - return output, backward - - return layerize(forward) - - -def doc2feats(cols=None): - if cols is None: - cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH] - - def forward(docs, drop=0.0): - feats = [] - for doc in docs: - feats.append(doc.to_array(cols)) - return feats, None - - model = layerize(forward) - model.cols = cols - return model - - -def print_shape(prefix): - def forward(X, drop=0.0): - return X, lambda dX, **kwargs: dX - - return layerize(forward) - - -@layerize -def get_token_vectors(tokens_attrs_vectors, drop=0.0): - tokens, attrs, vectors = tokens_attrs_vectors - - def backward(d_output, sgd=None): - return (tokens, d_output) - - return vectors, backward - - -@layerize -def logistic(X, drop=0.0): - xp = get_array_module(X) - if not isinstance(X, xp.ndarray): - X = xp.asarray(X) - # Clip to range (-10, 10) - X = xp.minimum(X, 10.0, X) - X = xp.maximum(X, -10.0, X) - Y = 1.0 / (1.0 + xp.exp(-X)) - - def logistic_bwd(dY, sgd=None): - dX = dY * (Y * (1 - Y)) - return dX - - return Y, logistic_bwd - - -def zero_init(model): - def _zero_init_impl(self, X, y): - self.W.fill(0) - - model.on_data_hooks.append(_zero_init_impl) - return model - - -def getitem(i): - def getitem_fwd(X, drop=0.0): - return X[i], None - - return layerize(getitem_fwd) - - -@describe.attributes( - W=Synapses("Weights matrix", lambda obj: (obj.nO, obj.nI), lambda W, ops: None) -) -class MultiSoftmax(Affine): - """Neural network layer that predicts several multi-class attributes at once. - For instance, we might predict one class with 6 variables, and another with 5. - We predict the 11 neurons required for this, and then softmax them such - that columns 0-6 make a probability distribution and coumns 6-11 make another. - """ - - name = "multisoftmax" - - def __init__(self, out_sizes, nI=None, **kwargs): - Model.__init__(self, **kwargs) - self.out_sizes = out_sizes - self.nO = sum(out_sizes) - self.nI = nI - - def predict(self, input__BI): - output__BO = self.ops.affine(self.W, self.b, input__BI) - i = 0 - for out_size in self.out_sizes: - self.ops.softmax(output__BO[:, i : i + out_size], inplace=True) - i += out_size - return output__BO - - def begin_update(self, input__BI, drop=0.0): - output__BO = self.predict(input__BI) - - def finish_update(grad__BO, sgd=None): - self.d_W += self.ops.gemm(grad__BO, input__BI, trans1=True) - self.d_b += grad__BO.sum(axis=0) - grad__BI = self.ops.gemm(grad__BO, self.W) - if sgd is not None: - sgd(self._mem.weights, self._mem.gradient, key=self.id) - return grad__BI - - return output__BO, finish_update - - -def build_tagger_model(nr_class, **cfg): - embed_size = util.env_opt("embed_size", 2000) - if "token_vector_width" in cfg: - token_vector_width = cfg["token_vector_width"] - else: - token_vector_width = util.env_opt("token_vector_width", 96) - pretrained_vectors = cfg.get("pretrained_vectors") - subword_features = cfg.get("subword_features", True) - with Model.define_operators({">>": chain, "+": add}): - if "tok2vec" in cfg: - tok2vec = cfg["tok2vec"] - else: - tok2vec = Tok2Vec( - token_vector_width, - embed_size, - subword_features=subword_features, - pretrained_vectors=pretrained_vectors, - ) - softmax = with_flatten(Softmax(nr_class, token_vector_width)) - model = tok2vec >> softmax - model.nI = None - model.tok2vec = tok2vec - model.softmax = softmax - return model - - -def build_morphologizer_model(class_nums, **cfg): - embed_size = util.env_opt("embed_size", 7000) - if "token_vector_width" in cfg: - token_vector_width = cfg["token_vector_width"] - else: - token_vector_width = util.env_opt("token_vector_width", 128) - pretrained_vectors = cfg.get("pretrained_vectors") - char_embed = cfg.get("char_embed", True) - with Model.define_operators({">>": chain, "+": add, "**": clone}): - if "tok2vec" in cfg: - tok2vec = cfg["tok2vec"] - else: - tok2vec = Tok2Vec( - token_vector_width, - embed_size, - char_embed=char_embed, - pretrained_vectors=pretrained_vectors, - ) - softmax = with_flatten(MultiSoftmax(class_nums, token_vector_width)) - softmax.out_sizes = class_nums - model = tok2vec >> softmax - model.nI = None - model.tok2vec = tok2vec - model.softmax = softmax - return model - - -@layerize -def SpacyVectors(docs, drop=0.0): - batch = [] - for doc in docs: - indices = numpy.zeros((len(doc),), dtype="i") - for i, word in enumerate(doc): - if word.orth in doc.vocab.vectors.key2row: - indices[i] = doc.vocab.vectors.key2row[word.orth] - else: - indices[i] = 0 - vectors = doc.vocab.vectors.data[indices] - batch.append(vectors) - return batch, None - - -def build_text_classifier(nr_class, width=64, **cfg): - depth = cfg.get("depth", 2) - nr_vector = cfg.get("nr_vector", 5000) - pretrained_dims = cfg.get("pretrained_dims", 0) - with Model.define_operators({">>": chain, "+": add, "|": concatenate, "**": clone}): - if cfg.get("low_data") and pretrained_dims: - model = ( - SpacyVectors - >> flatten_add_lengths - >> with_getitem(0, Affine(width, pretrained_dims)) - >> ParametricAttention(width) - >> Pooling(sum_pool) - >> Residual(ReLu(width, width)) ** 2 - >> zero_init(Affine(nr_class, width, drop_factor=0.0)) - >> logistic - ) - return model - - lower = HashEmbed(width, nr_vector, column=1, seed=10) - prefix = HashEmbed(width // 2, nr_vector, column=2, seed=11) - suffix = HashEmbed(width // 2, nr_vector, column=3, seed=12) - shape = HashEmbed(width // 2, nr_vector, column=4, seed=13) - - trained_vectors = FeatureExtracter( - [ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID] - ) >> with_flatten( - uniqued( - (lower | prefix | suffix | shape) - >> LN(Maxout(width, width + (width // 2) * 3)), - column=0, - ) - ) - - if pretrained_dims: - static_vectors = SpacyVectors >> with_flatten( - Affine(width, pretrained_dims) - ) - # TODO Make concatenate support lists - vectors = concatenate_lists(trained_vectors, static_vectors) - vectors_width = width * 2 - else: - vectors = trained_vectors - vectors_width = width - static_vectors = None - tok2vec = vectors >> with_flatten( - LN(Maxout(width, vectors_width)) - >> Residual((ExtractWindow(nW=1) >> LN(Maxout(width, width * 3)))) ** depth, - pad=depth, - ) - cnn_model = ( - tok2vec - >> flatten_add_lengths - >> ParametricAttention(width) - >> Pooling(sum_pool) - >> Residual(zero_init(Maxout(width, width))) - >> zero_init(Affine(nr_class, width, drop_factor=0.0)) - ) - - linear_model = build_bow_text_classifier( - nr_class, - ngram_size=cfg.get("ngram_size", 1), - exclusive_classes=cfg.get("exclusive_classes", False), - ) - if cfg.get("exclusive_classes", False): - output_layer = Softmax(nr_class, nr_class * 2) - else: - output_layer = ( - zero_init(Affine(nr_class, nr_class * 2, drop_factor=0.0)) >> logistic - ) - model = (linear_model | cnn_model) >> output_layer - model.tok2vec = chain(tok2vec, flatten) - model.nO = nr_class - model.lsuv = False - return model - - -def build_bow_text_classifier( - nr_class, ngram_size=1, exclusive_classes=False, no_output_layer=False, **cfg -): - with Model.define_operators({">>": chain}): - model = with_cpu( - Model.ops, extract_ngrams(ngram_size, attr=ORTH) >> LinearModel(nr_class) - ) - if not no_output_layer: - model = model >> (cpu_softmax if exclusive_classes else logistic) - model.nO = nr_class - return model - - -@layerize -def cpu_softmax(X, drop=0.0): - ops = NumpyOps() - - def cpu_softmax_backward(dY, sgd=None): - return dY - - return ops.softmax(X), cpu_softmax_backward - - -def build_simple_cnn_text_classifier(tok2vec, nr_class, exclusive_classes=False, **cfg): - """ - Build a simple CNN text classifier, given a token-to-vector model as inputs. - If exclusive_classes=True, a softmax non-linearity is applied, so that the - outputs sum to 1. If exclusive_classes=False, a logistic non-linearity - is applied instead, so that outputs are in the range [0, 1]. - """ - with Model.define_operators({">>": chain}): - if exclusive_classes: - output_layer = Softmax(nr_class, tok2vec.nO) - else: - output_layer = ( - zero_init(Affine(nr_class, tok2vec.nO, drop_factor=0.0)) >> logistic - ) - model = tok2vec >> flatten_add_lengths >> Pooling(mean_pool) >> output_layer - model.tok2vec = chain(tok2vec, flatten) - model.nO = nr_class - return model - - -def build_nel_encoder(embed_width, hidden_width, ner_types, **cfg): - if "entity_width" not in cfg: - raise ValueError(Errors.E144.format(param="entity_width")) - - conv_depth = cfg.get("conv_depth", 2) - cnn_maxout_pieces = cfg.get("cnn_maxout_pieces", 3) - pretrained_vectors = cfg.get("pretrained_vectors", None) - context_width = cfg.get("entity_width") - - with Model.define_operators({">>": chain, "**": clone}): - # context encoder - tok2vec = Tok2Vec( - width=hidden_width, - embed_size=embed_width, - pretrained_vectors=pretrained_vectors, - cnn_maxout_pieces=cnn_maxout_pieces, - subword_features=True, - conv_depth=conv_depth, - bilstm_depth=0, - ) - - model = ( - tok2vec - >> flatten_add_lengths - >> Pooling(mean_pool) - >> Residual(zero_init(Maxout(hidden_width, hidden_width))) - >> zero_init(Affine(context_width, hidden_width, drop_factor=0.0)) - ) - - model.tok2vec = tok2vec - model.nO = context_width - return model - - -@layerize -def flatten(seqs, drop=0.0): - ops = Model.ops - lengths = ops.asarray([len(seq) for seq in seqs], dtype="i") - - def finish_update(d_X, sgd=None): - return ops.unflatten(d_X, lengths, pad=0) - - X = ops.flatten(seqs, pad=0) - return X, finish_update - - -def concatenate_lists(*layers, **kwargs): # pragma: no cover - """Compose two or more models `f`, `g`, etc, such that their outputs are - concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))` - """ - if not layers: - return noop() - drop_factor = kwargs.get("drop_factor", 1.0) - ops = layers[0].ops - layers = [chain(layer, flatten) for layer in layers] - concat = concatenate(*layers) - - def concatenate_lists_fwd(Xs, drop=0.0): - if drop is not None: - drop *= drop_factor - lengths = ops.asarray([len(X) for X in Xs], dtype="i") - flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop) - ys = ops.unflatten(flat_y, lengths) - - def concatenate_lists_bwd(d_ys, sgd=None): - return bp_flat_y(ops.flatten(d_ys), sgd=sgd) - - return ys, concatenate_lists_bwd - - model = wrap(concatenate_lists_fwd, concat) - return model - - -def masked_language_model(vocab, model, mask_prob=0.15): - """Convert a model into a BERT-style masked language model""" - - random_words = _RandomWords(vocab) - - def mlm_forward(docs, drop=0.0): - mask, docs = _apply_mask(docs, random_words, mask_prob=mask_prob) - mask = model.ops.asarray(mask).reshape((mask.shape[0], 1)) - output, backprop = model.begin_update(docs, drop=drop) - - def mlm_backward(d_output, sgd=None): - d_output *= 1 - mask - # Rescale gradient for number of instances. - d_output *= mask.size - mask.sum() - return backprop(d_output, sgd=sgd) - - return output, mlm_backward - - return wrap(mlm_forward, model) - - -class _RandomWords(object): - def __init__(self, vocab): - self.words = [lex.text for lex in vocab if lex.prob != 0.0] - self.probs = [lex.prob for lex in vocab if lex.prob != 0.0] - self.words = self.words[:10000] - self.probs = self.probs[:10000] - self.probs = numpy.exp(numpy.array(self.probs, dtype="f")) - self.probs /= self.probs.sum() - self._cache = [] - - def next(self): - if not self._cache: - self._cache.extend( - numpy.random.choice(len(self.words), 10000, p=self.probs) - ) - index = self._cache.pop() - return self.words[index] - - -def _apply_mask(docs, random_words, mask_prob=0.15): - # This needs to be here to avoid circular imports - from .tokens.doc import Doc - - N = sum(len(doc) for doc in docs) - mask = numpy.random.uniform(0.0, 1.0, (N,)) - mask = mask >= mask_prob - i = 0 - masked_docs = [] - for doc in docs: - words = [] - for token in doc: - if not mask[i]: - word = _replace_word(token.text, random_words) - else: - word = token.text - words.append(word) - i += 1 - spaces = [bool(w.whitespace_) for w in doc] - # NB: If you change this implementation to instead modify - # the docs in place, take care that the IDs reflect the original - # words. Currently we use the original docs to make the vectors - # for the target, so we don't lose the original tokens. But if - # you modified the docs in place here, you would. - masked_docs.append(Doc(doc.vocab, words=words, spaces=spaces)) - return mask, masked_docs - - -def _replace_word(word, random_words, mask="[MASK]"): - roll = numpy.random.random() - if roll < 0.8: - return mask - elif roll < 0.9: - return random_words.next() - else: - return word - - -def _uniform_init(lo, hi): - def wrapped(W, ops): - copy_array(W, ops.xp.random.uniform(lo, hi, W.shape)) - - return wrapped - - -@describe.attributes( - nM=Dimension("Vector dimensions"), - nC=Dimension("Number of characters per word"), - vectors=Synapses( - "Embed matrix", lambda obj: (obj.nC, obj.nV, obj.nM), _uniform_init(-0.1, 0.1) - ), - d_vectors=Gradient("vectors"), -) -class CharacterEmbed(Model): - def __init__(self, nM=None, nC=None, **kwargs): - Model.__init__(self, **kwargs) - self.nM = nM - self.nC = nC - - @property - def nO(self): - return self.nM * self.nC - - @property - def nV(self): - return 256 - - def begin_update(self, docs, drop=0.0): - if not docs: - return [] - ids = [] - output = [] - weights = self.vectors - # This assists in indexing; it's like looping over this dimension. - # Still consider this weird witch craft...But thanks to Mark Neumann - # for the tip. - nCv = self.ops.xp.arange(self.nC) - for doc in docs: - doc_ids = self.ops.asarray(doc.to_utf8_array(nr_char=self.nC)) - doc_vectors = self.ops.allocate((len(doc), self.nC, self.nM)) - # Let's say I have a 2d array of indices, and a 3d table of data. What numpy - # incantation do I chant to get - # output[i, j, k] == data[j, ids[i, j], k]? - doc_vectors[:, nCv] = weights[nCv, doc_ids[:, nCv]] - output.append(doc_vectors.reshape((len(doc), self.nO))) - ids.append(doc_ids) - - def backprop_character_embed(d_vectors, sgd=None): - gradient = self.d_vectors - for doc_ids, d_doc_vectors in zip(ids, d_vectors): - d_doc_vectors = d_doc_vectors.reshape((len(doc_ids), self.nC, self.nM)) - gradient[nCv, doc_ids[:, nCv]] += d_doc_vectors[:, nCv] - if sgd is not None: - sgd(self._mem.weights, self._mem.gradient, key=self.id) - return None - - return output, backprop_character_embed - - -def get_cossim_loss(yh, y, ignore_zeros=False): - xp = get_array_module(yh) - # Find the zero vectors - if ignore_zeros: - zero_indices = xp.abs(y).sum(axis=1) == 0 - # Add a small constant to avoid 0 vectors - yh = yh + 1e-8 - y = y + 1e-8 - # https://math.stackexchange.com/questions/1923613/partial-derivative-of-cosine-similarity - norm_yh = xp.linalg.norm(yh, axis=1, keepdims=True) - norm_y = xp.linalg.norm(y, axis=1, keepdims=True) - mul_norms = norm_yh * norm_y - cosine = (yh * y).sum(axis=1, keepdims=True) / mul_norms - d_yh = (y / mul_norms) - (cosine * (yh / norm_yh ** 2)) - losses = xp.abs(cosine - 1) - if ignore_zeros: - # If the target was a zero vector, don't count it in the loss. - d_yh[zero_indices] = 0 - losses[zero_indices] = 0 - loss = losses.sum() - return loss, -d_yh - - -def get_characters_loss(ops, docs, prediction, nr_char=10): - target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs]) - target_ids = target_ids.reshape((-1,)) - target = ops.asarray(to_categorical(target_ids, nb_classes=256), dtype="f") - target = target.reshape((-1, 256*nr_char)) - diff = prediction - target - loss = (diff**2).sum() - d_target = diff / float(prediction.shape[0]) - return loss, d_target - - - diff --git a/spacy/about.py b/spacy/about.py index 42c38cda5..9c5dd0b4f 100644 --- a/spacy/about.py +++ b/spacy/about.py @@ -1,7 +1,7 @@ # fmt: off -__title__ = "spacy" -__version__ = "2.3.2" -__release__ = True +__title__ = "spacy-nightly" +__version__ = "3.0.0a41" __download_url__ = "https://github.com/explosion/spacy-models/releases/download" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" -__shortcuts__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts-v2.json" +__projects__ = "https://github.com/explosion/projects" +__projects_branch__ = "v3" diff --git a/spacy/analysis.py b/spacy/analysis.py deleted file mode 100644 index 960ce6c0f..000000000 --- a/spacy/analysis.py +++ /dev/null @@ -1,181 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import warnings - -from collections import OrderedDict -from wasabi import Printer - -from .tokens import Doc, Token, Span -from .errors import Errors, Warnings - - -def analyze_pipes(pipeline, name, pipe, index, warn=True): - """Analyze a pipeline component with respect to its position in the current - pipeline and the other components. Will check whether requirements are - fulfilled (e.g. if previous components assign the attributes). - - pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline. - name (unicode): The name of the pipeline component to analyze. - pipe (callable): The pipeline component function to analyze. - index (int): The index of the component in the pipeline. - warn (bool): Show user warning if problem is found. - RETURNS (list): The problems found for the given pipeline component. - """ - assert pipeline[index][0] == name - prev_pipes = pipeline[:index] - pipe_requires = getattr(pipe, "requires", []) - requires = OrderedDict([(annot, False) for annot in pipe_requires]) - if requires: - for prev_name, prev_pipe in prev_pipes: - prev_assigns = getattr(prev_pipe, "assigns", []) - for annot in prev_assigns: - requires[annot] = True - problems = [] - for annot, fulfilled in requires.items(): - if not fulfilled: - problems.append(annot) - if warn: - warnings.warn(Warnings.W025.format(name=name, attr=annot)) - return problems - - -def analyze_all_pipes(pipeline, warn=True): - """Analyze all pipes in the pipeline in order. - - pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline. - warn (bool): Show user warning if problem is found. - RETURNS (dict): The problems found, keyed by component name. - """ - problems = {} - for i, (name, pipe) in enumerate(pipeline): - problems[name] = analyze_pipes(pipeline, name, pipe, i, warn=warn) - return problems - - -def dot_to_dict(values): - """Convert dot notation to a dict. For example: ["token.pos", "token._.xyz"] - become {"token": {"pos": True, "_": {"xyz": True }}}. - - values (iterable): The values to convert. - RETURNS (dict): The converted values. - """ - result = {} - for value in values: - path = result - parts = value.lower().split(".") - for i, item in enumerate(parts): - is_last = i == len(parts) - 1 - path = path.setdefault(item, True if is_last else {}) - return result - - -def validate_attrs(values): - """Validate component attributes provided to "assigns", "requires" etc. - Raises error for invalid attributes and formatting. Doesn't check if - custom extension attributes are registered, since this is something the - user might want to do themselves later in the component. - - values (iterable): The string attributes to check, e.g. `["token.pos"]`. - RETURNS (iterable): The checked attributes. - """ - data = dot_to_dict(values) - objs = {"doc": Doc, "token": Token, "span": Span} - for obj_key, attrs in data.items(): - if obj_key == "span": - # Support Span only for custom extension attributes - span_attrs = [attr for attr in values if attr.startswith("span.")] - span_attrs = [attr for attr in span_attrs if not attr.startswith("span._.")] - if span_attrs: - raise ValueError(Errors.E180.format(attrs=", ".join(span_attrs))) - if obj_key not in objs: # first element is not doc/token/span - invalid_attrs = ", ".join(a for a in values if a.startswith(obj_key)) - raise ValueError(Errors.E181.format(obj=obj_key, attrs=invalid_attrs)) - if not isinstance(attrs, dict): # attr is something like "doc" - raise ValueError(Errors.E182.format(attr=obj_key)) - for attr, value in attrs.items(): - if attr == "_": - if value is True: # attr is something like "doc._" - raise ValueError(Errors.E182.format(attr="{}._".format(obj_key))) - for ext_attr, ext_value in value.items(): - # We don't check whether the attribute actually exists - if ext_value is not True: # attr is something like doc._.x.y - good = "{}._.{}".format(obj_key, ext_attr) - bad = "{}.{}".format(good, ".".join(ext_value)) - raise ValueError(Errors.E183.format(attr=bad, solution=good)) - continue # we can't validate those further - if attr.endswith("_"): # attr is something like "token.pos_" - raise ValueError(Errors.E184.format(attr=attr, solution=attr[:-1])) - if value is not True: # attr is something like doc.x.y - good = "{}.{}".format(obj_key, attr) - bad = "{}.{}".format(good, ".".join(value)) - raise ValueError(Errors.E183.format(attr=bad, solution=good)) - obj = objs[obj_key] - if not hasattr(obj, attr): - raise ValueError(Errors.E185.format(obj=obj_key, attr=attr)) - return values - - -def _get_feature_for_attr(pipeline, attr, feature): - assert feature in ["assigns", "requires"] - result = [] - for pipe_name, pipe in pipeline: - pipe_assigns = getattr(pipe, feature, []) - if attr in pipe_assigns: - result.append((pipe_name, pipe)) - return result - - -def get_assigns_for_attr(pipeline, attr): - """Get all pipeline components that assign an attr, e.g. "doc.tensor". - - pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline. - attr (unicode): The attribute to check. - RETURNS (list): (name, pipeline) tuples of components that assign the attr. - """ - return _get_feature_for_attr(pipeline, attr, "assigns") - - -def get_requires_for_attr(pipeline, attr): - """Get all pipeline components that require an attr, e.g. "doc.tensor". - - pipeline (list): A list of (name, pipe) tuples e.g. nlp.pipeline. - attr (unicode): The attribute to check. - RETURNS (list): (name, pipeline) tuples of components that require the attr. - """ - return _get_feature_for_attr(pipeline, attr, "requires") - - -def print_summary(nlp, pretty=True, no_print=False): - """Print a formatted summary for the current nlp object's pipeline. Shows - a table with the pipeline components and why they assign and require, as - well as any problems if available. - - nlp (Language): The nlp object. - pretty (bool): Pretty-print the results (color etc). - no_print (bool): Don't print anything, just return the data. - RETURNS (dict): A dict with "overview" and "problems". - """ - msg = Printer(pretty=pretty, no_print=no_print) - overview = [] - problems = {} - for i, (name, pipe) in enumerate(nlp.pipeline): - requires = getattr(pipe, "requires", []) - assigns = getattr(pipe, "assigns", []) - retok = getattr(pipe, "retokenizes", False) - overview.append((i, name, requires, assigns, retok)) - problems[name] = analyze_pipes(nlp.pipeline, name, pipe, i, warn=False) - msg.divider("Pipeline Overview") - header = ("#", "Component", "Requires", "Assigns", "Retokenizes") - msg.table(overview, header=header, divider=True, multiline=True) - n_problems = sum(len(p) for p in problems.values()) - if any(p for p in problems.values()): - msg.divider("Problems ({})".format(n_problems)) - for name, problem in problems.items(): - if problem: - problem = ", ".join(problem) - msg.warn("'{}' requirements not met: {}".format(name, problem)) - else: - msg.good("No problems found.") - if no_print: - return {"overview": overview, "problems": problems} diff --git a/spacy/attrs.pxd b/spacy/attrs.pxd index 805dc2950..33d5372de 100644 --- a/spacy/attrs.pxd +++ b/spacy/attrs.pxd @@ -91,6 +91,7 @@ cdef enum attr_id_t: LANG ENT_KB_ID = symbols.ENT_KB_ID + MORPH ENT_ID = symbols.ENT_ID IDX diff --git a/spacy/attrs.pyx b/spacy/attrs.pyx index fe9895d06..b15db7599 100644 --- a/spacy/attrs.pyx +++ b/spacy/attrs.pyx @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - IDS = { "": NULL_ATTR, @@ -92,6 +89,7 @@ IDS = { "SPACY": SPACY, "PROB": PROB, "LANG": LANG, + "MORPH": MORPH, "IDX": IDX } diff --git a/spacy/cli/__init__.py b/spacy/cli/__init__.py index 778453711..7368bcef3 100644 --- a/spacy/cli/__init__.py +++ b/spacy/cli/__init__.py @@ -1,12 +1,37 @@ +from wasabi import msg + +from ._util import app, setup_cli # noqa: F401 + +# These are the actual functions, NOT the wrapped CLI commands. The CLI commands +# are registered automatically and won't have to be imported here. from .download import download # noqa: F401 from .info import info # noqa: F401 -from .link import link # noqa: F401 from .package import package # noqa: F401 from .profile import profile # noqa: F401 -from .train import train # noqa: F401 +from .train import train_cli # noqa: F401 from .pretrain import pretrain # noqa: F401 from .debug_data import debug_data # noqa: F401 +from .debug_config import debug_config # noqa: F401 +from .debug_model import debug_model # noqa: F401 from .evaluate import evaluate # noqa: F401 from .convert import convert # noqa: F401 -from .init_model import init_model # noqa: F401 +from .init_pipeline import init_pipeline_cli # noqa: F401 +from .init_config import init_config, fill_config # noqa: F401 from .validate import validate # noqa: F401 +from .project.clone import project_clone # noqa: F401 +from .project.assets import project_assets # noqa: F401 +from .project.run import project_run # noqa: F401 +from .project.dvc import project_update_dvc # noqa: F401 +from .project.push import project_push # noqa: F401 +from .project.pull import project_pull # noqa: F401 +from .project.document import project_document # noqa: F401 + + +@app.command("link", no_args_is_help=True, deprecated=True, hidden=True) +def link(*args, **kwargs): + """As of spaCy v3.0, symlinks like "en" are deprecated. You can load trained + pipeline packages using their full names or from a directory path.""" + msg.warn( + "As of spaCy v3.0, model symlinks are deprecated. You can load trained " + "pipeline packages using their full names or from a directory path." + ) diff --git a/spacy/cli/_schemas.py b/spacy/cli/_schemas.py deleted file mode 100644 index 3fb2c8979..000000000 --- a/spacy/cli/_schemas.py +++ /dev/null @@ -1,220 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - - -# NB: This schema describes the new format of the training data, see #2928 -TRAINING_SCHEMA = { - "$schema": "http://json-schema.org/draft-06/schema", - "title": "Training data for spaCy models", - "type": "array", - "items": { - "type": "object", - "properties": { - "text": { - "title": "The text of the training example", - "type": "string", - "minLength": 1, - }, - "ents": { - "title": "Named entity spans in the text", - "type": "array", - "items": { - "type": "object", - "properties": { - "start": { - "title": "Start character offset of the span", - "type": "integer", - "minimum": 0, - }, - "end": { - "title": "End character offset of the span", - "type": "integer", - "minimum": 0, - }, - "label": { - "title": "Entity label", - "type": "string", - "minLength": 1, - "pattern": "^[A-Z0-9]*$", - }, - }, - "required": ["start", "end", "label"], - }, - }, - "sents": { - "title": "Sentence spans in the text", - "type": "array", - "items": { - "type": "object", - "properties": { - "start": { - "title": "Start character offset of the span", - "type": "integer", - "minimum": 0, - }, - "end": { - "title": "End character offset of the span", - "type": "integer", - "minimum": 0, - }, - }, - "required": ["start", "end"], - }, - }, - "cats": { - "title": "Text categories for the text classifier", - "type": "object", - "patternProperties": { - "*": { - "title": "A text category", - "oneOf": [ - {"type": "boolean"}, - {"type": "number", "minimum": 0}, - ], - } - }, - "propertyNames": {"pattern": "^[A-Z0-9]*$", "minLength": 1}, - }, - "tokens": { - "title": "The tokens in the text", - "type": "array", - "items": { - "type": "object", - "minProperties": 1, - "properties": { - "id": { - "title": "Token ID, usually token index", - "type": "integer", - "minimum": 0, - }, - "start": { - "title": "Start character offset of the token", - "type": "integer", - "minimum": 0, - }, - "end": { - "title": "End character offset of the token", - "type": "integer", - "minimum": 0, - }, - "pos": { - "title": "Coarse-grained part-of-speech tag", - "type": "string", - "minLength": 1, - }, - "tag": { - "title": "Fine-grained part-of-speech tag", - "type": "string", - "minLength": 1, - }, - "dep": { - "title": "Dependency label", - "type": "string", - "minLength": 1, - }, - "head": { - "title": "Index of the token's head", - "type": "integer", - "minimum": 0, - }, - }, - "required": ["start", "end"], - }, - }, - "_": {"title": "Custom user space", "type": "object"}, - }, - "required": ["text"], - }, -} - -META_SCHEMA = { - "$schema": "http://json-schema.org/draft-06/schema", - "type": "object", - "properties": { - "lang": { - "title": "Two-letter language code, e.g. 'en'", - "type": "string", - "minLength": 2, - "maxLength": 2, - "pattern": "^[a-z]*$", - }, - "name": { - "title": "Model name", - "type": "string", - "minLength": 1, - "pattern": "^[a-z_]*$", - }, - "version": { - "title": "Model version", - "type": "string", - "minLength": 1, - "pattern": "^[0-9a-z.-]*$", - }, - "spacy_version": { - "title": "Compatible spaCy version identifier", - "type": "string", - "minLength": 1, - "pattern": "^[0-9a-z.-><=]*$", - }, - "parent_package": { - "title": "Name of parent spaCy package, e.g. spacy or spacy-nightly", - "type": "string", - "minLength": 1, - "default": "spacy", - }, - "pipeline": { - "title": "Names of pipeline components", - "type": "array", - "items": {"type": "string", "minLength": 1}, - }, - "description": {"title": "Model description", "type": "string"}, - "license": {"title": "Model license", "type": "string"}, - "author": {"title": "Model author name", "type": "string"}, - "email": {"title": "Model author email", "type": "string", "format": "email"}, - "url": {"title": "Model author URL", "type": "string", "format": "uri"}, - "sources": { - "title": "Training data sources", - "type": "array", - "items": {"type": "string"}, - }, - "vectors": { - "title": "Included word vectors", - "type": "object", - "properties": { - "keys": { - "title": "Number of unique keys", - "type": "integer", - "minimum": 0, - }, - "vectors": { - "title": "Number of unique vectors", - "type": "integer", - "minimum": 0, - }, - "width": { - "title": "Number of dimensions", - "type": "integer", - "minimum": 0, - }, - }, - }, - "accuracy": { - "title": "Accuracy numbers", - "type": "object", - "patternProperties": {"*": {"type": "number", "minimum": 0.0}}, - }, - "speed": { - "title": "Speed evaluation numbers", - "type": "object", - "patternProperties": { - "*": { - "oneOf": [ - {"type": "number", "minimum": 0.0}, - {"type": "integer", "minimum": 0}, - ] - } - }, - }, - }, - "required": ["lang", "name", "version"], -} diff --git a/spacy/cli/_util.py b/spacy/cli/_util.py new file mode 100644 index 000000000..60e400fb4 --- /dev/null +++ b/spacy/cli/_util.py @@ -0,0 +1,478 @@ +from typing import Dict, Any, Union, List, Optional, Tuple, Iterable, TYPE_CHECKING +import sys +import shutil +from pathlib import Path +from wasabi import msg +import srsly +import hashlib +import typer +from click import NoSuchOption +from click.parser import split_arg_string +from typer.main import get_command +from contextlib import contextmanager +from thinc.api import Config, ConfigValidationError, require_gpu +from configparser import InterpolationError +import os + +from ..schemas import ProjectConfigSchema, validate +from ..util import import_file, run_command, make_tempdir, registry, logger +from ..util import is_compatible_version, ENV_VARS +from .. import about + +if TYPE_CHECKING: + from pathy import Pathy # noqa: F401 + + +PROJECT_FILE = "project.yml" +PROJECT_LOCK = "project.lock" +COMMAND = "python -m spacy" +NAME = "spacy" +HELP = """spaCy Command-line Interface + +DOCS: https://nightly.spacy.io/api/cli +""" +PROJECT_HELP = f"""Command-line interface for spaCy projects and templates. +You'd typically start by cloning a project template to a local directory and +fetching its assets like datasets etc. See the project's {PROJECT_FILE} for the +available commands. +""" +DEBUG_HELP = """Suite of helpful commands for debugging and profiling. Includes +commands to check and validate your config files, training and evaluation data, +and custom model implementations. +""" +INIT_HELP = """Commands for initializing configs and pipeline packages.""" + +# Wrappers for Typer's annotations. Initially created to set defaults and to +# keep the names short, but not needed at the moment. +Arg = typer.Argument +Opt = typer.Option + +app = typer.Typer(name=NAME, help=HELP) +project_cli = typer.Typer(name="project", help=PROJECT_HELP, no_args_is_help=True) +debug_cli = typer.Typer(name="debug", help=DEBUG_HELP, no_args_is_help=True) +init_cli = typer.Typer(name="init", help=INIT_HELP, no_args_is_help=True) + +app.add_typer(project_cli) +app.add_typer(debug_cli) +app.add_typer(init_cli) + + +def setup_cli() -> None: + # Make sure the entry-point for CLI runs, so that they get imported. + registry.cli.get_all() + # Ensure that the help messages always display the correct prompt + command = get_command(app) + command(prog_name=COMMAND) + + +def parse_config_overrides( + args: List[str], env_var: Optional[str] = ENV_VARS.CONFIG_OVERRIDES +) -> Dict[str, Any]: + """Generate a dictionary of config overrides based on the extra arguments + provided on the CLI, e.g. --training.batch_size to override + "training.batch_size". Arguments without a "." are considered invalid, + since the config only allows top-level sections to exist. + + env_vars (Optional[str]): Optional environment variable to read from. + RETURNS (Dict[str, Any]): The parsed dict, keyed by nested config setting. + """ + env_string = os.environ.get(env_var, "") if env_var else "" + env_overrides = _parse_overrides(split_arg_string(env_string)) + cli_overrides = _parse_overrides(args, is_cli=True) + if cli_overrides: + keys = [k for k in cli_overrides if k not in env_overrides] + logger.debug(f"Config overrides from CLI: {keys}") + if env_overrides: + logger.debug(f"Config overrides from env variables: {list(env_overrides)}") + return {**cli_overrides, **env_overrides} + + +def _parse_overrides(args: List[str], is_cli: bool = False) -> Dict[str, Any]: + result = {} + while args: + opt = args.pop(0) + err = f"Invalid config override '{opt}'" + if opt.startswith("--"): # new argument + orig_opt = opt + opt = opt.replace("--", "") + if "." not in opt: + if is_cli: + raise NoSuchOption(orig_opt) + else: + msg.fail(f"{err}: can't override top-level sections", exits=1) + if "=" in opt: # we have --opt=value + opt, value = opt.split("=", 1) + opt = opt.replace("-", "_") + else: + if not args or args[0].startswith("--"): # flag with no value + value = "true" + else: + value = args.pop(0) + # Just like we do in the config, we're calling json.loads on the + # values. But since they come from the CLI, it'd be unintuitive to + # explicitly mark strings with escaped quotes. So we're working + # around that here by falling back to a string if parsing fails. + # TODO: improve logic to handle simple types like list of strings? + try: + result[opt] = srsly.json_loads(value) + except ValueError: + result[opt] = str(value) + else: + msg.fail(f"{err}: name should start with --", exits=1) + return result + + +def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]: + """Load the project.yml file from a directory and validate it. Also make + sure that all directories defined in the config exist. + + path (Path): The path to the project directory. + interpolate (bool): Whether to substitute project variables. + RETURNS (Dict[str, Any]): The loaded project.yml. + """ + config_path = path / PROJECT_FILE + if not config_path.exists(): + msg.fail(f"Can't find {PROJECT_FILE}", config_path, exits=1) + invalid_err = f"Invalid {PROJECT_FILE}. Double-check that the YAML is correct." + try: + config = srsly.read_yaml(config_path) + except ValueError as e: + msg.fail(invalid_err, e, exits=1) + errors = validate(ProjectConfigSchema, config) + if errors: + msg.fail(invalid_err) + print("\n".join(errors)) + sys.exit(1) + validate_project_version(config) + validate_project_commands(config) + # Make sure directories defined in config exist + for subdir in config.get("directories", []): + dir_path = path / subdir + if not dir_path.exists(): + dir_path.mkdir(parents=True) + if interpolate: + err = "project.yml validation error" + with show_validation_error(title=err, hint_fill=False): + config = substitute_project_variables(config) + return config + + +def substitute_project_variables(config: Dict[str, Any], overrides: Dict = {}): + key = "vars" + config.setdefault(key, {}) + config[key].update(overrides) + # Need to put variables in the top scope again so we can have a top-level + # section "project" (otherwise, a list of commands in the top scope wouldn't) + # be allowed by Thinc's config system + cfg = Config({"project": config, key: config[key]}) + interpolated = cfg.interpolate() + return dict(interpolated["project"]) + + +def validate_project_version(config: Dict[str, Any]) -> None: + """If the project defines a compatible spaCy version range, chec that it's + compatible with the current version of spaCy. + + config (Dict[str, Any]): The loaded config. + """ + spacy_version = config.get("spacy_version", None) + if spacy_version and not is_compatible_version(about.__version__, spacy_version): + err = ( + f"The {PROJECT_FILE} specifies a spaCy version range ({spacy_version}) " + f"that's not compatible with the version of spaCy you're running " + f"({about.__version__}). You can edit version requirement in the " + f"{PROJECT_FILE} to load it, but the project may not run as expected." + ) + msg.fail(err, exits=1) + + +def validate_project_commands(config: Dict[str, Any]) -> None: + """Check that project commands and workflows are valid, don't contain + duplicates, don't clash and only refer to commands that exist. + + config (Dict[str, Any]): The loaded config. + """ + command_names = [cmd["name"] for cmd in config.get("commands", [])] + workflows = config.get("workflows", {}) + duplicates = set([cmd for cmd in command_names if command_names.count(cmd) > 1]) + if duplicates: + err = f"Duplicate commands defined in {PROJECT_FILE}: {', '.join(duplicates)}" + msg.fail(err, exits=1) + for workflow_name, workflow_steps in workflows.items(): + if workflow_name in command_names: + err = f"Can't use workflow name '{workflow_name}': name already exists as a command" + msg.fail(err, exits=1) + for step in workflow_steps: + if step not in command_names: + msg.fail( + f"Unknown command specified in workflow '{workflow_name}': {step}", + f"Workflows can only refer to commands defined in the 'commands' " + f"section of the {PROJECT_FILE}.", + exits=1, + ) + + +def get_hash(data, exclude: Iterable[str] = tuple()) -> str: + """Get the hash for a JSON-serializable object. + + data: The data to hash. + exclude (Iterable[str]): Top-level keys to exclude if data is a dict. + RETURNS (str): The hash. + """ + if isinstance(data, dict): + data = {k: v for k, v in data.items() if k not in exclude} + data_str = srsly.json_dumps(data, sort_keys=True).encode("utf8") + return hashlib.md5(data_str).hexdigest() + + +def get_checksum(path: Union[Path, str]) -> str: + """Get the checksum for a file or directory given its file path. If a + directory path is provided, this uses all files in that directory. + + path (Union[Path, str]): The file or directory path. + RETURNS (str): The checksum. + """ + path = Path(path) + if path.is_file(): + return hashlib.md5(Path(path).read_bytes()).hexdigest() + if path.is_dir(): + # TODO: this is currently pretty slow + dir_checksum = hashlib.md5() + for sub_file in sorted(fp for fp in path.rglob("*") if fp.is_file()): + dir_checksum.update(sub_file.read_bytes()) + return dir_checksum.hexdigest() + msg.fail(f"Can't get checksum for {path}: not a file or directory", exits=1) + + +@contextmanager +def show_validation_error( + file_path: Optional[Union[str, Path]] = None, + *, + title: Optional[str] = None, + desc: str = "", + show_config: Optional[bool] = None, + hint_fill: bool = True, +): + """Helper to show custom config validation errors on the CLI. + + file_path (str / Path): Optional file path of config file, used in hints. + title (str): Override title of custom formatted error. + desc (str): Override description of custom formatted error. + show_config (bool): Whether to output the config the error refers to. + hint_fill (bool): Show hint about filling config. + """ + try: + yield + except ConfigValidationError as e: + title = title if title is not None else e.title + if e.desc: + desc = f"{e.desc}" if not desc else f"{e.desc}\n\n{desc}" + # Re-generate a new error object with overrides + err = e.from_error(e, title="", desc=desc, show_config=show_config) + msg.fail(title) + print(err.text.strip()) + if hint_fill and "value_error.missing" in err.error_types: + config_path = file_path if file_path is not None else "config.cfg" + msg.text( + "If your config contains missing values, you can run the 'init " + "fill-config' command to fill in all the defaults, if possible:", + spaced=True, + ) + print(f"{COMMAND} init fill-config {config_path} {config_path} \n") + sys.exit(1) + except InterpolationError as e: + msg.fail("Config validation error", e, exits=1) + + +def import_code(code_path: Optional[Union[Path, str]]) -> None: + """Helper to import Python file provided in training commands / commands + using the config. This makes custom registered functions available. + """ + if code_path is not None: + if not Path(code_path).exists(): + msg.fail("Path to Python code not found", code_path, exits=1) + try: + import_file("python_code", code_path) + except Exception as e: + msg.fail(f"Couldn't load Python code: {code_path}", e, exits=1) + + +def upload_file(src: Path, dest: Union[str, "Pathy"]) -> None: + """Upload a file. + + src (Path): The source path. + url (str): The destination URL to upload to. + """ + import smart_open + + dest = str(dest) + with smart_open.open(dest, mode="wb") as output_file: + with src.open(mode="rb") as input_file: + output_file.write(input_file.read()) + + +def download_file(src: Union[str, "Pathy"], dest: Path, *, force: bool = False) -> None: + """Download a file using smart_open. + + url (str): The URL of the file. + dest (Path): The destination path. + force (bool): Whether to force download even if file exists. + If False, the download will be skipped. + """ + import smart_open + + if dest.exists() and not force: + return None + src = str(src) + with smart_open.open(src, mode="rb", ignore_ext=True) as input_file: + with dest.open(mode="wb") as output_file: + output_file.write(input_file.read()) + + +def ensure_pathy(path): + """Temporary helper to prevent importing Pathy globally (which can cause + slow and annoying Google Cloud warning).""" + from pathy import Pathy # noqa: F811 + + return Pathy(path) + + +def git_checkout( + repo: str, subpath: str, dest: Path, *, branch: str = "master", sparse: bool = False +): + git_version = get_git_version() + if dest.exists(): + msg.fail("Destination of checkout must not exist", exits=1) + if not dest.parent.exists(): + msg.fail("Parent of destination of checkout must exist", exits=1) + if sparse and git_version >= (2, 22): + return git_sparse_checkout(repo, subpath, dest, branch) + elif sparse: + # Only show warnings if the user explicitly wants sparse checkout but + # the Git version doesn't support it + err_old = ( + f"You're running an old version of Git (v{git_version[0]}.{git_version[1]}) " + f"that doesn't fully support sparse checkout yet." + ) + err_unk = "You're running an unknown version of Git, so sparse checkout has been disabled." + msg.warn( + f"{err_unk if git_version == (0, 0) else err_old} " + f"This means that more files than necessary may be downloaded " + f"temporarily. To only download the files needed, make sure " + f"you're using Git v2.22 or above." + ) + with make_tempdir() as tmp_dir: + cmd = f"git -C {tmp_dir} clone {repo} . -b {branch}" + run_command(cmd, capture=True) + # We need Path(name) to make sure we also support subdirectories + shutil.copytree(str(tmp_dir / Path(subpath)), str(dest)) + + +def git_sparse_checkout(repo, subpath, dest, branch): + # We're using Git, partial clone and sparse checkout to + # only clone the files we need + # This ends up being RIDICULOUS. omg. + # So, every tutorial and SO post talks about 'sparse checkout'...But they + # go and *clone* the whole repo. Worthless. And cloning part of a repo + # turns out to be completely broken. The only way to specify a "path" is.. + # a path *on the server*? The contents of which, specifies the paths. Wat. + # Obviously this is hopelessly broken and insecure, because you can query + # arbitrary paths on the server! So nobody enables this. + # What we have to do is disable *all* files. We could then just checkout + # the path, and it'd "work", but be hopelessly slow...Because it goes and + # transfers every missing object one-by-one. So the final piece is that we + # need to use some weird git internals to fetch the missings in bulk, and + # *that* we can do by path. + # We're using Git and sparse checkout to only clone the files we need + with make_tempdir() as tmp_dir: + # This is the "clone, but don't download anything" part. + cmd = ( + f"git clone {repo} {tmp_dir} --no-checkout --depth 1 " + f"-b {branch} --filter=blob:none" + ) + run_command(cmd) + # Now we need to find the missing filenames for the subpath we want. + # Looking for this 'rev-list' command in the git --help? Hah. + cmd = f"git -C {tmp_dir} rev-list --objects --all --missing=print -- {subpath}" + ret = run_command(cmd, capture=True) + git_repo = _http_to_git(repo) + # Now pass those missings into another bit of git internals + missings = " ".join([x[1:] for x in ret.stdout.split() if x.startswith("?")]) + if not missings: + err = ( + f"Could not find any relevant files for '{subpath}'. " + f"Did you specify a correct and complete path within repo '{repo}' " + f"and branch {branch}?" + ) + msg.fail(err, exits=1) + cmd = f"git -C {tmp_dir} fetch-pack {git_repo} {missings}" + run_command(cmd, capture=True) + # And finally, we can checkout our subpath + cmd = f"git -C {tmp_dir} checkout {branch} {subpath}" + run_command(cmd, capture=True) + # We need Path(name) to make sure we also support subdirectories + shutil.move(str(tmp_dir / Path(subpath)), str(dest)) + + +def get_git_version( + error: str = "Could not run 'git'. Make sure it's installed and the executable is available.", +) -> Tuple[int, int]: + """Get the version of git and raise an error if calling 'git --version' fails. + + error (str): The error message to show. + RETURNS (Tuple[int, int]): The version as a (major, minor) tuple. Returns + (0, 0) if the version couldn't be determined. + """ + ret = run_command("git --version", capture=True) + stdout = ret.stdout.strip() + if not stdout or not stdout.startswith("git version"): + return (0, 0) + version = stdout[11:].strip().split(".") + return (int(version[0]), int(version[1])) + + +def _http_to_git(repo: str) -> str: + if repo.startswith("http://"): + repo = repo.replace(r"http://", r"https://") + if repo.startswith(r"https://"): + repo = repo.replace("https://", "git@").replace("/", ":", 1) + if repo.endswith("/"): + repo = repo[:-1] + repo = f"{repo}.git" + return repo + + +def string_to_list(value: str, intify: bool = False) -> Union[List[str], List[int]]: + """Parse a comma-separated string to a list and account for various + formatting options. Mostly used to handle CLI arguments that take a list of + comma-separated values. + + value (str): The value to parse. + intify (bool): Whether to convert values to ints. + RETURNS (Union[List[str], List[int]]): A list of strings or ints. + """ + if not value: + return [] + if value.startswith("[") and value.endswith("]"): + value = value[1:-1] + result = [] + for p in value.split(","): + p = p.strip() + if p.startswith("'") and p.endswith("'"): + p = p[1:-1] + if p.startswith('"') and p.endswith('"'): + p = p[1:-1] + p = p.strip() + if intify: + p = int(p) + result.append(p) + return result + + +def setup_gpu(use_gpu: int) -> None: + """Configure the GPU and log info.""" + if use_gpu >= 0: + msg.info(f"Using GPU: {use_gpu}") + require_gpu(use_gpu) + else: + msg.info("Using CPU") diff --git a/spacy/cli/convert.py b/spacy/cli/convert.py index fa867fa04..8413c639b 100644 --- a/spacy/cli/convert.py +++ b/spacy/cli/convert.py @@ -1,132 +1,175 @@ -# coding: utf8 -from __future__ import unicode_literals - -import plac +from typing import Optional, Any, List, Union +from enum import Enum from pathlib import Path from wasabi import Printer import srsly import re +import sys -from .converters import conllu2json, iob2json, conll_ner2json -from .converters import ner_jsonl2json +from ._util import app, Arg, Opt +from ..training import docs_to_json +from ..tokens import DocBin +from ..training.converters import iob_to_docs, conll_ner_to_docs, json_to_docs +from ..training.converters import conllu_to_docs # Converters are matched by file extension except for ner/iob, which are # matched by file extension and content. To add a converter, add a new # entry to this dict with the file extension mapped to the converter function # imported from /converters. + CONVERTERS = { - "conllubio": conllu2json, - "conllu": conllu2json, - "conll": conllu2json, - "ner": conll_ner2json, - "iob": iob2json, - "jsonl": ner_jsonl2json, + "conllubio": conllu_to_docs, + "conllu": conllu_to_docs, + "conll": conllu_to_docs, + "ner": conll_ner_to_docs, + "iob": iob_to_docs, + "json": json_to_docs, } -# File types -FILE_TYPES = ("json", "jsonl", "msg") -FILE_TYPES_STDOUT = ("json", "jsonl") + +# File types that can be written to stdout +FILE_TYPES_STDOUT = ("json",) -@plac.annotations( - input_file=("Input file", "positional", None, str), - output_dir=("Output directory. '-' for stdout.", "positional", None, str), - file_type=("Type of data to produce: {}".format(FILE_TYPES), "option", "t", str), - n_sents=("Number of sentences per doc (0 to disable)", "option", "n", int), - seg_sents=("Segment sentences (for -c ner)", "flag", "s"), - model=("Model for sentence segmentation (for -s)", "option", "b", str), - converter=("Converter: {}".format(tuple(CONVERTERS.keys())), "option", "c", str), - lang=("Language (if tokenizer required)", "option", "l", str), - morphology=("Enable appending morphology to tags", "flag", "m", bool), -) -def convert( - input_file, - output_dir="-", - file_type="json", - n_sents=1, - seg_sents=False, - model=None, - morphology=False, - converter="auto", - lang=None, +class FileTypes(str, Enum): + json = "json" + spacy = "spacy" + + +@app.command("convert") +def convert_cli( + # fmt: off + input_path: str = Arg(..., help="Input file or directory", exists=True), + output_dir: Path = Arg("-", help="Output directory. '-' for stdout.", allow_dash=True, exists=True), + file_type: FileTypes = Opt("spacy", "--file-type", "-t", help="Type of data to produce"), + n_sents: int = Opt(1, "--n-sents", "-n", help="Number of sentences per doc (0 to disable)"), + seg_sents: bool = Opt(False, "--seg-sents", "-s", help="Segment sentences (for -c ner)"), + model: Optional[str] = Opt(None, "--model", "--base", "-b", help="Trained spaCy pipeline for sentence segmentation to use as base (for --seg-sents)"), + morphology: bool = Opt(False, "--morphology", "-m", help="Enable appending morphology to tags"), + merge_subtokens: bool = Opt(False, "--merge-subtokens", "-T", help="Merge CoNLL-U subtokens"), + converter: str = Opt("auto", "--converter", "-c", help=f"Converter: {tuple(CONVERTERS.keys())}"), + ner_map: Optional[Path] = Opt(None, "--ner-map", "-nm", help="NER tag mapping (as JSON-encoded dict of entity types)", exists=True), + lang: Optional[str] = Opt(None, "--lang", "-l", help="Language (if tokenizer required)"), + concatenate: bool = Opt(None, "--concatenate", "-C", help="Concatenate output to a single file"), + # fmt: on ): """ - Convert files into JSON format for use with train command and other - experiment management functions. If no output_dir is specified, the data + Convert files into json or DocBin format for training. The resulting .spacy + file can be used with the train command and other experiment management + functions. + + If no output_dir is specified and the output format is JSON, the data is written to stdout, so you can pipe them forward to a JSON file: - $ spacy convert some_file.conllu > some_file.json + $ spacy convert some_file.conllu --file-type json > some_file.json + + DOCS: https://nightly.spacy.io/api/cli#convert """ - no_print = output_dir == "-" - msg = Printer(no_print=no_print) - input_path = Path(input_file) - if file_type not in FILE_TYPES: - msg.fail( - "Unknown file type: '{}'".format(file_type), - "Supported file types: '{}'".format(", ".join(FILE_TYPES)), - exits=1, - ) - if file_type not in FILE_TYPES_STDOUT and output_dir == "-": - # TODO: support msgpack via stdout in srsly? - msg.fail( - "Can't write .{} data to stdout.".format(file_type), - "Please specify an output directory.", - exits=1, - ) - if not input_path.exists(): - msg.fail("Input file not found", input_path, exits=1) - if output_dir != "-" and not Path(output_dir).exists(): - msg.fail("Output directory not found", output_dir, exits=1) - input_data = input_path.open("r", encoding="utf-8").read() - if converter == "auto": - converter = input_path.suffix[1:] - if converter == "ner" or converter == "iob": - converter_autodetect = autodetect_ner_format(input_data) - if converter_autodetect == "ner": - msg.info("Auto-detected token-per-line NER format") - converter = converter_autodetect - elif converter_autodetect == "iob": - msg.info("Auto-detected sentence-per-line NER format") - converter = converter_autodetect - else: - msg.warn( - "Can't automatically detect NER format. Conversion may not succeed. See https://spacy.io/api/cli#convert" - ) - if converter not in CONVERTERS: - msg.fail("Can't find converter for {}".format(converter), exits=1) - # Use converter function to convert data - func = CONVERTERS[converter] - data = func( - input_data, + if isinstance(file_type, FileTypes): + # We get an instance of the FileTypes from the CLI so we need its string value + file_type = file_type.value + input_path = Path(input_path) + output_dir = "-" if output_dir == Path("-") else output_dir + silent = output_dir == "-" + msg = Printer(no_print=silent) + verify_cli_args(msg, input_path, output_dir, file_type, converter, ner_map) + converter = _get_converter(msg, converter, input_path) + convert( + input_path, + output_dir, + file_type=file_type, n_sents=n_sents, seg_sents=seg_sents, - use_morphology=morphology, - lang=lang, model=model, - no_print=no_print, + morphology=morphology, + merge_subtokens=merge_subtokens, + converter=converter, + ner_map=ner_map, + lang=lang, + concatenate=concatenate, + silent=silent, + msg=msg, ) - if output_dir != "-": - # Export data to a file - suffix = ".{}".format(file_type) - output_file = Path(output_dir) / Path(input_path.parts[-1]).with_suffix(suffix) - if file_type == "json": - srsly.write_json(output_file, data) - elif file_type == "jsonl": - srsly.write_jsonl(output_file, data) - elif file_type == "msg": - srsly.write_msgpack(output_file, data) - msg.good( - "Generated output file ({} documents): {}".format(len(data), output_file) + + +def convert( + input_path: Union[str, Path], + output_dir: Union[str, Path], + *, + file_type: str = "json", + n_sents: int = 1, + seg_sents: bool = False, + model: Optional[str] = None, + morphology: bool = False, + merge_subtokens: bool = False, + converter: str = "auto", + ner_map: Optional[Path] = None, + lang: Optional[str] = None, + concatenate: bool = False, + silent: bool = True, + msg: Optional[Printer], +) -> None: + if not msg: + msg = Printer(no_print=silent) + ner_map = srsly.read_json(ner_map) if ner_map is not None else None + doc_files = [] + for input_loc in walk_directory(Path(input_path), converter): + input_data = input_loc.open("r", encoding="utf-8").read() + # Use converter function to convert data + func = CONVERTERS[converter] + docs = func( + input_data, + n_sents=n_sents, + seg_sents=seg_sents, + append_morphology=morphology, + merge_subtokens=merge_subtokens, + lang=lang, + model=model, + no_print=silent, + ner_map=ner_map, ) - else: - # Print to stdout + doc_files.append((input_loc, docs)) + if concatenate: + all_docs = [] + for _, docs in doc_files: + all_docs.extend(docs) + doc_files = [(input_path, all_docs)] + for input_loc, docs in doc_files: if file_type == "json": - srsly.write_json("-", data) - elif file_type == "jsonl": - srsly.write_jsonl("-", data) + data = [docs_to_json(docs)] + else: + data = DocBin(docs=docs, store_user_data=True).to_bytes() + if output_dir == "-": + _print_docs_to_stdout(data, file_type) + else: + if input_loc != input_path: + subpath = input_loc.relative_to(input_path) + output_file = Path(output_dir) / subpath.with_suffix(f".{file_type}") + else: + output_file = Path(output_dir) / input_loc.parts[-1] + output_file = output_file.with_suffix(f".{file_type}") + _write_docs_to_file(data, output_file, file_type) + msg.good(f"Generated output file ({len(docs)} documents): {output_file}") -def autodetect_ner_format(input_data): +def _print_docs_to_stdout(data: Any, output_type: str) -> None: + if output_type == "json": + srsly.write_json("-", data) + else: + sys.stdout.buffer.write(data) + + +def _write_docs_to_file(data: Any, output_file: Path, output_type: str) -> None: + if not output_file.parent.exists(): + output_file.parent.mkdir(parents=True) + if output_type == "json": + srsly.write_json(output_file, data) + else: + with output_file.open("wb") as file_: + file_.write(data) + + +def autodetect_ner_format(input_data: str) -> Optional[str]: # guess format from the first 20 lines lines = input_data.split("\n")[:20] format_guesses = {"ner": 0, "iob": 0} @@ -143,3 +186,86 @@ def autodetect_ner_format(input_data): if format_guesses["ner"] == 0 and format_guesses["iob"] > 0: return "iob" return None + + +def walk_directory(path: Path, converter: str) -> List[Path]: + if not path.is_dir(): + return [path] + paths = [path] + locs = [] + seen = set() + for path in paths: + if str(path) in seen: + continue + seen.add(str(path)) + if path.parts[-1].startswith("."): + continue + elif path.is_dir(): + paths.extend(path.iterdir()) + elif converter == "json" and not path.parts[-1].endswith("json"): + continue + elif converter == "conll" and not path.parts[-1].endswith("conll"): + continue + elif converter == "iob" and not path.parts[-1].endswith("iob"): + continue + else: + locs.append(path) + # It's good to sort these, in case the ordering messes up cache. + locs.sort() + return locs + + +def verify_cli_args( + msg: Printer, + input_path: Union[str, Path], + output_dir: Union[str, Path], + file_type: FileTypes, + converter: str, + ner_map: Optional[Path], +): + input_path = Path(input_path) + if file_type not in FILE_TYPES_STDOUT and output_dir == "-": + msg.fail( + f"Can't write .{file_type} data to stdout. Please specify an output directory.", + exits=1, + ) + if not input_path.exists(): + msg.fail("Input file not found", input_path, exits=1) + if output_dir != "-" and not Path(output_dir).exists(): + msg.fail("Output directory not found", output_dir, exits=1) + if ner_map is not None and not Path(ner_map).exists(): + msg.fail("NER map not found", ner_map, exits=1) + if input_path.is_dir(): + input_locs = walk_directory(input_path, converter) + if len(input_locs) == 0: + msg.fail("No input files in directory", input_path, exits=1) + file_types = list(set([loc.suffix[1:] for loc in input_locs])) + if converter == "auto" and len(file_types) >= 2: + file_types = ",".join(file_types) + msg.fail("All input files must be same type", file_types, exits=1) + if converter != "auto" and converter not in CONVERTERS: + msg.fail(f"Can't find converter for {converter}", exits=1) + + +def _get_converter(msg, converter, input_path): + if input_path.is_dir(): + input_path = walk_directory(input_path, converter)[0] + if converter == "auto": + converter = input_path.suffix[1:] + if converter == "ner" or converter == "iob": + with input_path.open(encoding="utf8") as file_: + input_data = file_.read() + converter_autodetect = autodetect_ner_format(input_data) + if converter_autodetect == "ner": + msg.info("Auto-detected token-per-line NER format") + converter = converter_autodetect + elif converter_autodetect == "iob": + msg.info("Auto-detected sentence-per-line NER format") + converter = converter_autodetect + else: + msg.warn( + "Can't automatically detect NER format. " + "Conversion may not succeed. " + "See https://nightly.spacy.io/api/cli#convert" + ) + return converter diff --git a/spacy/cli/converters/__init__.py b/spacy/cli/converters/__init__.py deleted file mode 100644 index 9dcbf5b13..000000000 --- a/spacy/cli/converters/__init__.py +++ /dev/null @@ -1,4 +0,0 @@ -from .conllu2json import conllu2json # noqa: F401 -from .iob2json import iob2json # noqa: F401 -from .conll_ner2json import conll_ner2json # noqa: F401 -from .jsonl2json import ner_jsonl2json # noqa: F401 diff --git a/spacy/cli/converters/conllu2json.py b/spacy/cli/converters/conllu2json.py deleted file mode 100644 index 3de4dcc30..000000000 --- a/spacy/cli/converters/conllu2json.py +++ /dev/null @@ -1,141 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import re - -from ...gold import iob_to_biluo - - -def conllu2json(input_data, n_sents=10, use_morphology=False, lang=None, **_): - """ - Convert conllu files into JSON format for use with train cli. - use_morphology parameter enables appending morphology to tags, which is - useful for languages such as Spanish, where UD tags are not so rich. - - Extract NER tags if available and convert them so that they follow - BILUO and the Wikipedia scheme - """ - # by @dvsrepo, via #11 explosion/spacy-dev-resources - # by @katarkor - docs = [] - sentences = [] - conll_tuples = read_conllx(input_data, use_morphology=use_morphology) - checked_for_ner = False - has_ner_tags = False - for i, (raw_text, tokens) in enumerate(conll_tuples): - sentence, brackets = tokens[0] - if not checked_for_ner: - has_ner_tags = is_ner(sentence[5][0]) - checked_for_ner = True - sentences.append(generate_sentence(sentence, has_ner_tags)) - # Real-sized documents could be extracted using the comments on the - # conluu document - if len(sentences) % n_sents == 0: - doc = create_doc(sentences, i) - docs.append(doc) - sentences = [] - if sentences: - doc = create_doc(sentences, i) - docs.append(doc) - return docs - - -def is_ner(tag): - """ - Check the 10th column of the first token to determine if the file contains - NER tags - """ - tag_match = re.match("([A-Z_]+)-([A-Z_]+)", tag) - if tag_match: - return True - elif tag == "O": - return True - else: - return False - - -def read_conllx(input_data, use_morphology=False, n=0): - i = 0 - for sent in input_data.strip().split("\n\n"): - lines = sent.strip().split("\n") - if lines: - while lines[0].startswith("#"): - lines.pop(0) - tokens = [] - for line in lines: - - parts = line.split("\t") - id_, word, lemma, pos, tag, morph, head, dep, _1, iob = parts - if "-" in id_ or "." in id_: - continue - try: - id_ = int(id_) - 1 - head = (int(head) - 1) if head not in ["0", "_"] else id_ - dep = "ROOT" if dep == "root" else dep - tag = pos if tag == "_" else tag - tag = tag + "__" + morph if use_morphology else tag - iob = iob if iob else "O" - tokens.append((id_, word, tag, head, dep, iob)) - except: # noqa: E722 - print(line) - raise - tuples = [list(t) for t in zip(*tokens)] - yield (None, [[tuples, []]]) - i += 1 - if n >= 1 and i >= n: - break - - -def simplify_tags(iob): - """ - Simplify tags obtained from the dataset in order to follow Wikipedia - scheme (PER, LOC, ORG, MISC). 'PER', 'LOC' and 'ORG' keep their tags, while - 'GPE_LOC' is simplified to 'LOC', 'GPE_ORG' to 'ORG' and all remaining tags to - 'MISC'. - """ - new_iob = [] - for tag in iob: - tag_match = re.match("([A-Z_]+)-([A-Z_]+)", tag) - if tag_match: - prefix = tag_match.group(1) - suffix = tag_match.group(2) - if suffix == "GPE_LOC": - suffix = "LOC" - elif suffix == "GPE_ORG": - suffix = "ORG" - elif suffix != "PER" and suffix != "LOC" and suffix != "ORG": - suffix = "MISC" - tag = prefix + "-" + suffix - new_iob.append(tag) - return new_iob - - -def generate_sentence(sent, has_ner_tags): - (id_, word, tag, head, dep, iob) = sent - sentence = {} - tokens = [] - if has_ner_tags: - iob = simplify_tags(iob) - biluo = iob_to_biluo(iob) - for i, id in enumerate(id_): - token = {} - token["id"] = id - token["orth"] = word[i] - token["tag"] = tag[i] - token["head"] = head[i] - id - token["dep"] = dep[i] - if has_ner_tags: - token["ner"] = biluo[i] - tokens.append(token) - sentence["tokens"] = tokens - return sentence - - -def create_doc(sentences, id): - doc = {} - paragraph = {} - doc["id"] = id - doc["paragraphs"] = [] - paragraph["sentences"] = sentences - doc["paragraphs"].append(paragraph) - return doc diff --git a/spacy/cli/converters/iob2json.py b/spacy/cli/converters/iob2json.py deleted file mode 100644 index 61c398f8d..000000000 --- a/spacy/cli/converters/iob2json.py +++ /dev/null @@ -1,68 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from wasabi import Printer - -from ...gold import iob_to_biluo -from ...util import minibatch -from .conll_ner2json import n_sents_info - - -def iob2json(input_data, n_sents=10, no_print=False, *args, **kwargs): - """ - Convert IOB files with one sentence per line and tags separated with '|' - into JSON format for use with train cli. IOB and IOB2 are accepted. - - Sample formats: - - I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O - I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O - I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O - I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O - """ - msg = Printer(no_print=no_print) - docs = read_iob(input_data.split("\n")) - if n_sents > 0: - n_sents_info(msg, n_sents) - docs = merge_sentences(docs, n_sents) - return docs - - -def read_iob(raw_sents): - sentences = [] - for line in raw_sents: - if not line.strip(): - continue - tokens = [t.split("|") for t in line.split()] - if len(tokens[0]) == 3: - words, pos, iob = zip(*tokens) - elif len(tokens[0]) == 2: - words, iob = zip(*tokens) - pos = ["-"] * len(words) - else: - raise ValueError( - "The sentence-per-line IOB/IOB2 file is not formatted correctly. Try checking whitespace and delimiters. See https://spacy.io/api/cli#convert" - ) - biluo = iob_to_biluo(iob) - sentences.append( - [ - {"orth": w, "tag": p, "ner": ent} - for (w, p, ent) in zip(words, pos, biluo) - ] - ) - sentences = [{"tokens": sent} for sent in sentences] - paragraphs = [{"sentences": [sent]} for sent in sentences] - docs = [{"id": i, "paragraphs": [para]} for i, para in enumerate(paragraphs)] - return docs - - -def merge_sentences(docs, n_sents): - merged = [] - for group in minibatch(docs, size=n_sents): - group = list(group) - first = group.pop(0) - to_extend = first["paragraphs"][0]["sentences"] - for sent in group: - to_extend.extend(sent["paragraphs"][0]["sentences"]) - merged.append(first) - return merged diff --git a/spacy/cli/converters/jsonl2json.py b/spacy/cli/converters/jsonl2json.py deleted file mode 100644 index 1c1bc45c7..000000000 --- a/spacy/cli/converters/jsonl2json.py +++ /dev/null @@ -1,53 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import srsly - -from ...gold import docs_to_json -from ...util import get_lang_class, minibatch - - -def ner_jsonl2json(input_data, lang=None, n_sents=10, use_morphology=False, **_): - if lang is None: - raise ValueError("No --lang specified, but tokenization required") - json_docs = [] - input_examples = [srsly.json_loads(line) for line in input_data.strip().split("\n")] - nlp = get_lang_class(lang)() - sentencizer = nlp.create_pipe("sentencizer") - for i, batch in enumerate(minibatch(input_examples, size=n_sents)): - docs = [] - for record in batch: - raw_text = record["text"] - if "entities" in record: - ents = record["entities"] - else: - ents = record["spans"] - ents = [(e["start"], e["end"], e["label"]) for e in ents] - doc = nlp.make_doc(raw_text) - sentencizer(doc) - spans = [doc.char_span(s, e, label=L) for s, e, L in ents] - doc.ents = _cleanup_spans(spans) - docs.append(doc) - json_docs.append(docs_to_json(docs, id=i)) - return json_docs - - -def _cleanup_spans(spans): - output = [] - seen = set() - for span in spans: - if span is not None: - # Trim whitespace - while len(span) and span[0].is_space: - span = span[1:] - while len(span) and span[-1].is_space: - span = span[:-1] - if not len(span): - continue - for i in range(span.start, span.end): - if i in seen: - break - else: - output.append(span) - seen.update(range(span.start, span.end)) - return output diff --git a/spacy/cli/debug_config.py b/spacy/cli/debug_config.py new file mode 100644 index 000000000..a6c7345f0 --- /dev/null +++ b/spacy/cli/debug_config.py @@ -0,0 +1,101 @@ +from typing import Optional, Dict, Any, Union, List +from pathlib import Path +from wasabi import msg, table +from thinc.api import Config +from thinc.config import VARIABLE_RE +import typer + +from ._util import Arg, Opt, show_validation_error, parse_config_overrides +from ._util import import_code, debug_cli +from ..schemas import ConfigSchemaTraining +from ..util import registry +from .. import util + + +@debug_cli.command( + "config", + context_settings={"allow_extra_args": True, "ignore_unknown_options": True}, +) +def debug_config_cli( + # fmt: off + ctx: typer.Context, # This is only used to read additional arguments + config_path: Path = Arg(..., help="Path to config file", exists=True), + code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"), + show_funcs: bool = Opt(False, "--show-functions", "-F", help="Show an overview of all registered functions used in the config and where they come from (modules, files etc.)"), + show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.") + # fmt: on +): + """Debug a config.cfg file and show validation errors. The command will + create all objects in the tree and validate them. Note that some config + validation errors are blocking and will prevent the rest of the config from + being resolved. This means that you may not see all validation errors at + once and some issues are only shown once previous errors have been fixed. + Similar as with the 'train' command, you can override settings from the config + as command line options. For instance, --training.batch_size 128 overrides + the value of "batch_size" in the block "[training]". + + DOCS: https://nightly.spacy.io/api/cli#debug-config + """ + overrides = parse_config_overrides(ctx.args) + import_code(code_path) + debug_config( + config_path, overrides=overrides, show_funcs=show_funcs, show_vars=show_vars + ) + + +def debug_config( + config_path: Path, + *, + overrides: Dict[str, Any] = {}, + show_funcs: bool = False, + show_vars: bool = False, +): + msg.divider("Config validation") + with show_validation_error(config_path): + config = util.load_config(config_path, overrides=overrides) + nlp = util.load_model_from_config(config) + config = nlp.config.interpolate() + T = registry.resolve(config["training"], schema=ConfigSchemaTraining) + dot_names = [T["train_corpus"], T["dev_corpus"]] + util.resolve_dot_names(config, dot_names) + msg.good("Config is valid") + if show_vars: + variables = get_variables(config) + msg.divider(f"Variables ({len(variables)})") + head = ("Variable", "Value") + msg.table(variables, header=head, divider=True, widths=(41, 34), spacing=2) + if show_funcs: + funcs = get_registered_funcs(config) + msg.divider(f"Registered functions ({len(funcs)})") + for func in funcs: + func_data = { + "Registry": f"@{func['registry']}", + "Name": func["name"], + "Module": func["module"], + "File": f"{func['file']} (line {func['line_no']})", + } + msg.info(f"[{func['path']}]") + print(table(func_data).strip()) + + +def get_registered_funcs(config: Config) -> List[Dict[str, Optional[Union[str, int]]]]: + result = [] + for key, value in util.walk_dict(config): + if not key[-1].startswith("@"): + continue + # We have a reference to a registered function + reg_name = key[-1][1:] + registry = getattr(util.registry, reg_name) + path = ".".join(key[:-1]) + info = registry.find(value) + result.append({"name": value, "registry": reg_name, "path": path, **info}) + return result + + +def get_variables(config: Config) -> Dict[str, Any]: + result = {} + for variable in sorted(set(VARIABLE_RE.findall(config.to_str()))): + path = variable[2:-1].replace(":", ".") + value = util.dot_to_object(config, path) + result[variable] = repr(value) + return result diff --git a/spacy/cli/debug_data.py b/spacy/cli/debug_data.py index 7e6c99c06..ead759e33 100644 --- a/spacy/cli/debug_data.py +++ b/spacy/cli/debug_data.py @@ -1,192 +1,177 @@ -# coding: utf8 -from __future__ import unicode_literals, print_function - +from typing import List, Sequence, Dict, Any, Tuple, Optional from pathlib import Path from collections import Counter -import plac import sys import srsly -from wasabi import Printer, MESSAGES +from wasabi import Printer, MESSAGES, msg +import typer -from ..gold import GoldCorpus -from ..syntax import nonproj -from ..util import load_model, get_lang_class +from ._util import app, Arg, Opt, show_validation_error, parse_config_overrides +from ._util import import_code, debug_cli +from ..training import Example +from ..training.initialize import get_sourced_components +from ..schemas import ConfigSchemaTraining +from ..pipeline._parser_internals import nonproj +from ..language import Language +from ..util import registry, resolve_dot_names +from .. import util # Minimum number of expected occurrences of NER label in data to train new label NEW_LABEL_THRESHOLD = 50 # Minimum number of expected occurrences of dependency labels DEP_LABEL_THRESHOLD = 20 -# Minimum number of expected examples to train a blank model +# Minimum number of expected examples to train a new pipeline BLANK_MODEL_MIN_THRESHOLD = 100 BLANK_MODEL_THRESHOLD = 2000 -@plac.annotations( - # fmt: off - lang=("model language", "positional", None, str), - train_path=("location of JSON-formatted training data", "positional", None, Path), - dev_path=("location of JSON-formatted development data", "positional", None, Path), - tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path), - base_model=("name of model to update (optional)", "option", "b", str), - pipeline=("Comma-separated names of pipeline components to train", "option", "p", str), - ignore_warnings=("Ignore warnings, only show stats and errors", "flag", "IW", bool), - verbose=("Print additional information and explanations", "flag", "V", bool), - no_format=("Don't pretty-print the results", "flag", "NF", bool), - # fmt: on +@debug_cli.command( + "data", context_settings={"allow_extra_args": True, "ignore_unknown_options": True} ) -def debug_data( - lang, - train_path, - dev_path, - tag_map_path=None, - base_model=None, - pipeline="tagger,parser,ner", - ignore_warnings=False, - verbose=False, - no_format=False, +@app.command( + "debug-data", + context_settings={"allow_extra_args": True, "ignore_unknown_options": True}, + hidden=True, # hide this from main CLI help but still allow it to work with warning +) +def debug_data_cli( + # fmt: off + ctx: typer.Context, # This is only used to read additional arguments + config_path: Path = Arg(..., help="Path to config file", exists=True), + code_path: Optional[Path] = Opt(None, "--code-path", "-c", help="Path to Python file with additional code (registered functions) to be imported"), + ignore_warnings: bool = Opt(False, "--ignore-warnings", "-IW", help="Ignore warnings, only show stats and errors"), + verbose: bool = Opt(False, "--verbose", "-V", help="Print additional information and explanations"), + no_format: bool = Opt(False, "--no-format", "-NF", help="Don't pretty-print the results"), + # fmt: on ): """ - Analyze, debug and validate your training and development data, get useful - stats, and find problems like invalid entity annotations, cyclic - dependencies, low data labels and more. + Analyze, debug and validate your training and development data. Outputs + useful stats, and can help you find problems like invalid entity annotations, + cyclic dependencies, low data labels and more. + + DOCS: https://nightly.spacy.io/api/cli#debug-data """ - msg = Printer(pretty=not no_format, ignore_warnings=ignore_warnings) + if ctx.command.name == "debug-data": + msg.warn( + "The debug-data command is now available via the 'debug data' " + "subcommand (without the hyphen). You can run python -m spacy debug " + "--help for an overview of the other available debugging commands." + ) + overrides = parse_config_overrides(ctx.args) + import_code(code_path) + debug_data( + config_path, + config_overrides=overrides, + ignore_warnings=ignore_warnings, + verbose=verbose, + no_format=no_format, + silent=False, + ) + +def debug_data( + config_path: Path, + *, + config_overrides: Dict[str, Any] = {}, + ignore_warnings: bool = False, + verbose: bool = False, + no_format: bool = True, + silent: bool = True, +): + msg = Printer( + no_print=silent, pretty=not no_format, ignore_warnings=ignore_warnings + ) # Make sure all files and paths exists if they are needed - if not train_path.exists(): - msg.fail("Training data not found", train_path, exits=1) - if not dev_path.exists(): - msg.fail("Development data not found", dev_path, exits=1) - - # Initialize the model and pipeline - pipeline = [p.strip() for p in pipeline.split(",")] - if base_model: - nlp = load_model(base_model) - else: - lang_cls = get_lang_class(lang) - nlp = lang_cls() - - if tag_map_path is not None: - tag_map = srsly.read_json(tag_map_path) - # Replace tag map with provided mapping - nlp.vocab.morphology.load_tag_map(tag_map) - - msg.divider("Data format validation") - - # TODO: Validate data format using the JSON schema - # TODO: update once the new format is ready - # TODO: move validation to GoldCorpus in order to be able to load from dir + with show_validation_error(config_path): + cfg = util.load_config(config_path, overrides=config_overrides) + nlp = util.load_model_from_config(cfg) + config = nlp.config.interpolate() + T = registry.resolve(config["training"], schema=ConfigSchemaTraining) + # Use original config here, not resolved version + sourced_components = get_sourced_components(cfg) + frozen_components = T["frozen_components"] + resume_components = [p for p in sourced_components if p not in frozen_components] + pipeline = nlp.pipe_names + factory_names = [nlp.get_pipe_meta(pipe).factory for pipe in nlp.pipe_names] + msg.divider("Data file validation") # Create the gold corpus to be able to better analyze data - loading_train_error_message = "" - loading_dev_error_message = "" - with msg.loading("Loading corpus..."): - corpus = GoldCorpus(train_path, dev_path) - try: - train_docs = list(corpus.train_docs(nlp)) - train_docs_unpreprocessed = list( - corpus.train_docs_without_preprocessing(nlp) - ) - except ValueError as e: - loading_train_error_message = "Training data cannot be loaded: {}".format( - str(e) - ) - try: - dev_docs = list(corpus.dev_docs(nlp)) - except ValueError as e: - loading_dev_error_message = "Development data cannot be loaded: {}".format( - str(e) - ) - if loading_train_error_message or loading_dev_error_message: - if loading_train_error_message: - msg.fail(loading_train_error_message) - if loading_dev_error_message: - msg.fail(loading_dev_error_message) - sys.exit(1) + dot_names = [T["train_corpus"], T["dev_corpus"]] + train_corpus, dev_corpus = resolve_dot_names(config, dot_names) + train_dataset = list(train_corpus(nlp)) + dev_dataset = list(dev_corpus(nlp)) msg.good("Corpus is loadable") - # Create all gold data here to avoid iterating over the train_docs constantly - gold_train_data = _compile_gold(train_docs, pipeline, nlp) + nlp.initialize(lambda: train_dataset) + msg.good("Pipeline can be initialized with data") + + # Create all gold data here to avoid iterating over the train_dataset constantly + gold_train_data = _compile_gold(train_dataset, factory_names, nlp, make_proj=True) gold_train_unpreprocessed_data = _compile_gold( - train_docs_unpreprocessed, pipeline, nlp + train_dataset, factory_names, nlp, make_proj=False ) - gold_dev_data = _compile_gold(dev_docs, pipeline, nlp) + gold_dev_data = _compile_gold(dev_dataset, factory_names, nlp, make_proj=True) train_texts = gold_train_data["texts"] dev_texts = gold_dev_data["texts"] + frozen_components = T["frozen_components"] msg.divider("Training stats") - msg.text("Training pipeline: {}".format(", ".join(pipeline))) - for pipe in [p for p in pipeline if p not in nlp.factories]: - msg.fail("Pipeline component '{}' not available in factories".format(pipe)) - if base_model: - msg.text("Starting with base model '{}'".format(base_model)) - else: - msg.text("Starting with blank model '{}'".format(lang)) - msg.text("{} training docs".format(len(train_docs))) - msg.text("{} evaluation docs".format(len(dev_docs))) + msg.text(f"Language: {nlp.lang}") + msg.text(f"Training pipeline: {', '.join(pipeline)}") + if resume_components: + msg.text(f"Components from other pipelines: {', '.join(resume_components)}") + if frozen_components: + msg.text(f"Frozen components: {', '.join(frozen_components)}") + msg.text(f"{len(train_dataset)} training docs") + msg.text(f"{len(dev_dataset)} evaluation docs") - if not len(dev_docs): + if not len(gold_dev_data): msg.fail("No evaluation docs") overlap = len(train_texts.intersection(dev_texts)) if overlap: - msg.warn("{} training examples also in evaluation data".format(overlap)) + msg.warn(f"{overlap} training examples also in evaluation data") else: msg.good("No overlap between training and evaluation data") - if not base_model and len(train_docs) < BLANK_MODEL_THRESHOLD: - text = "Low number of examples to train from a blank model ({})".format( - len(train_docs) - ) - if len(train_docs) < BLANK_MODEL_MIN_THRESHOLD: + # TODO: make this feedback more fine-grained and report on updated + # components vs. blank components + if not resume_components and len(train_dataset) < BLANK_MODEL_THRESHOLD: + text = f"Low number of examples to train a new pipeline ({len(train_dataset)})" + if len(train_dataset) < BLANK_MODEL_MIN_THRESHOLD: msg.fail(text) else: msg.warn(text) msg.text( - "It's recommended to use at least {} examples (minimum {})".format( - BLANK_MODEL_THRESHOLD, BLANK_MODEL_MIN_THRESHOLD - ), + f"It's recommended to use at least {BLANK_MODEL_THRESHOLD} examples " + f"(minimum {BLANK_MODEL_MIN_THRESHOLD})", show=verbose, ) msg.divider("Vocab & Vectors") n_words = gold_train_data["n_words"] msg.info( - "{} total {} in the data ({} unique)".format( - n_words, "word" if n_words == 1 else "words", len(gold_train_data["words"]) - ) + f"{n_words} total word(s) in the data ({len(gold_train_data['words'])} unique)" ) if gold_train_data["n_misaligned_words"] > 0: - msg.warn( - "{} misaligned tokens in the training data".format( - gold_train_data["n_misaligned_words"] - ) - ) + n_misaligned = gold_train_data["n_misaligned_words"] + msg.warn(f"{n_misaligned} misaligned tokens in the training data") if gold_dev_data["n_misaligned_words"] > 0: - msg.warn( - "{} misaligned tokens in the dev data".format( - gold_dev_data["n_misaligned_words"] - ) - ) + n_misaligned = gold_dev_data["n_misaligned_words"] + msg.warn(f"{n_misaligned} misaligned tokens in the dev data") most_common_words = gold_train_data["words"].most_common(10) msg.text( - "10 most common words: {}".format( - _format_labels(most_common_words, counts=True) - ), + f"10 most common words: {_format_labels(most_common_words, counts=True)}", show=verbose, ) if len(nlp.vocab.vectors): msg.info( - "{} vectors ({} unique keys, {} dimensions)".format( - len(nlp.vocab.vectors), - nlp.vocab.vectors.n_keys, - nlp.vocab.vectors_length, - ) + f"{len(nlp.vocab.vectors)} vectors ({nlp.vocab.vectors.n_keys} " + f"unique keys, {nlp.vocab.vectors_length} dimensions)" ) n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values()) msg.warn( "{} words in training data without vectors ({:0.2f}%)".format( - n_missing_vectors, n_missing_vectors / gold_train_data["n_words"], + n_missing_vectors, n_missing_vectors / gold_train_data["n_words"] ), ) msg.text( @@ -199,12 +184,12 @@ def debug_data( show=verbose, ) else: - msg.info("No word vectors present in the model") + msg.info("No word vectors present in the package") - if "ner" in pipeline: + if "ner" in factory_names: # Get all unique NER labels present in the data labels = set( - label for label in gold_train_data["ner"] if label not in ("O", "-") + label for label in gold_train_data["ner"] if label not in ("O", "-", None) ) label_counts = gold_train_data["ner"] model_labels = _get_labels_from_model(nlp, "ner") @@ -217,19 +202,10 @@ def debug_data( msg.divider("Named Entity Recognition") msg.info( - "{} new {}, {} existing {}".format( - len(new_labels), - "label" if len(new_labels) == 1 else "labels", - len(existing_labels), - "label" if len(existing_labels) == 1 else "labels", - ) + f"{len(new_labels)} new label(s), {len(existing_labels)} existing label(s)" ) missing_values = label_counts["-"] - msg.text( - "{} missing {} (tokens with '-' label)".format( - missing_values, "value" if missing_values == 1 else "values" - ) - ) + msg.text(f"{missing_values} missing value(s) (tokens with '-' label)") for label in new_labels: if len(label) == 0: msg.fail("Empty label found in new labels") @@ -240,43 +216,28 @@ def debug_data( if label != "-" ] labels_with_counts = _format_labels(labels_with_counts, counts=True) - msg.text("New: {}".format(labels_with_counts), show=verbose) + msg.text(f"New: {labels_with_counts}", show=verbose) if existing_labels: - msg.text( - "Existing: {}".format(_format_labels(existing_labels)), show=verbose - ) - + msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose) if gold_train_data["ws_ents"]: - msg.fail( - "{} invalid whitespace entity span(s)".format( - gold_train_data["ws_ents"] - ) - ) + msg.fail(f"{gold_train_data['ws_ents']} invalid whitespace entity spans") has_ws_ents_error = True if gold_train_data["punct_ents"]: - msg.warn( - "{} entity span(s) with punctuation".format( - gold_train_data["punct_ents"] - ) - ) + msg.warn(f"{gold_train_data['punct_ents']} entity span(s) with punctuation") has_punct_ents_warning = True for label in new_labels: if label_counts[label] <= NEW_LABEL_THRESHOLD: msg.warn( - "Low number of examples for new label '{}' ({})".format( - label, label_counts[label] - ) + f"Low number of examples for new label '{label}' ({label_counts[label]})" ) has_low_data_warning = True with msg.loading("Analyzing label distribution..."): - neg_docs = _get_examples_without_label(train_docs, label) + neg_docs = _get_examples_without_label(train_dataset, label) if neg_docs == 0: - msg.warn( - "No examples for texts WITHOUT new label '{}'".format(label) - ) + msg.warn(f"No examples for texts WITHOUT new label '{label}'") has_no_neg_warning = True if not has_low_data_warning: @@ -290,8 +251,8 @@ def debug_data( if has_low_data_warning: msg.text( - "To train a new entity type, your data should include at " - "least {} instances of the new label".format(NEW_LABEL_THRESHOLD), + f"To train a new entity type, your data should include at " + f"least {NEW_LABEL_THRESHOLD} instances of the new label", show=verbose, ) if has_no_neg_warning: @@ -313,34 +274,28 @@ def debug_data( "with punctuation can not be trained with a noise level > 0." ) - if "textcat" in pipeline: + if "textcat" in factory_names: msg.divider("Text Classification") labels = [label for label in gold_train_data["cats"]] model_labels = _get_labels_from_model(nlp, "textcat") new_labels = [l for l in labels if l not in model_labels] existing_labels = [l for l in labels if l in model_labels] msg.info( - "Text Classification: {} new label(s), {} existing label(s)".format( - len(new_labels), len(existing_labels) - ) + f"Text Classification: {len(new_labels)} new label(s), " + f"{len(existing_labels)} existing label(s)" ) if new_labels: labels_with_counts = _format_labels( gold_train_data["cats"].most_common(), counts=True ) - msg.text("New: {}".format(labels_with_counts), show=verbose) + msg.text(f"New: {labels_with_counts}", show=verbose) if existing_labels: - msg.text( - "Existing: {}".format(_format_labels(existing_labels)), show=verbose - ) + msg.text(f"Existing: {_format_labels(existing_labels)}", show=verbose) if set(gold_train_data["cats"]) != set(gold_dev_data["cats"]): msg.fail( - "The train and dev labels are not the same. " - "Train labels: {}. " - "Dev labels: {}.".format( - _format_labels(gold_train_data["cats"]), - _format_labels(gold_dev_data["cats"]), - ) + f"The train and dev labels are not the same. " + f"Train labels: {_format_labels(gold_train_data['cats'])}. " + f"Dev labels: {_format_labels(gold_dev_data['cats'])}." ) if gold_train_data["n_cats_multilabel"] > 0: msg.info( @@ -366,53 +321,34 @@ def debug_data( "contains only instances with mutually-exclusive classes." ) - if "tagger" in pipeline: + if "tagger" in factory_names: msg.divider("Part-of-speech Tagging") labels = [label for label in gold_train_data["tags"]] - tag_map = nlp.vocab.morphology.tag_map - msg.info( - "{} {} in data ({} {} in tag map)".format( - len(labels), - "label" if len(labels) == 1 else "labels", - len(tag_map), - "label" if len(tag_map) == 1 else "labels", - ) - ) + # TODO: does this need to be updated? + msg.info(f"{len(labels)} label(s) in data") labels_with_counts = _format_labels( gold_train_data["tags"].most_common(), counts=True ) msg.text(labels_with_counts, show=verbose) - non_tagmap = [l for l in labels if l not in tag_map] - if not non_tagmap: - msg.good("All labels present in tag map for language '{}'".format(nlp.lang)) - for label in non_tagmap: - msg.fail( - "Label '{}' not found in tag map for language '{}'".format( - label, nlp.lang - ) - ) - if "parser" in pipeline: + if "parser" in factory_names: has_low_data_warning = False msg.divider("Dependency Parsing") # profile sentence length msg.info( - "Found {} sentence{} with an average length of {:.1f} words.".format( - gold_train_data["n_sents"], - "s" if len(train_docs) > 1 else "", - gold_train_data["n_words"] / gold_train_data["n_sents"], - ) + f"Found {gold_train_data['n_sents']} sentence(s) with an average " + f"length of {gold_train_data['n_words'] / gold_train_data['n_sents']:.1f} words." ) # check for documents with multiple sentences sents_per_doc = gold_train_data["n_sents"] / len(gold_train_data["texts"]) if sents_per_doc < 1.1: msg.warn( - "The training data contains {:.2f} sentences per " - "document. When there are very few documents containing more " - "than one sentence, the parser will not learn how to segment " - "longer texts into sentences.".format(sents_per_doc) + f"The training data contains {sents_per_doc:.2f} sentences per " + f"document. When there are very few documents containing more " + f"than one sentence, the parser will not learn how to segment " + f"longer texts into sentences." ) # profile labels @@ -423,32 +359,13 @@ def debug_data( labels_dev = [label for label in gold_dev_data["deps"]] if gold_train_unpreprocessed_data["n_nonproj"] > 0: - msg.info( - "Found {} nonprojective train sentence{}".format( - gold_train_unpreprocessed_data["n_nonproj"], - "s" if gold_train_unpreprocessed_data["n_nonproj"] > 1 else "", - ) - ) + n_nonproj = gold_train_unpreprocessed_data["n_nonproj"] + msg.info(f"Found {n_nonproj} nonprojective train sentence(s)") if gold_dev_data["n_nonproj"] > 0: - msg.info( - "Found {} nonprojective dev sentence{}".format( - gold_dev_data["n_nonproj"], - "s" if gold_dev_data["n_nonproj"] > 1 else "", - ) - ) - - msg.info( - "{} {} in train data".format( - len(labels_train_unpreprocessed), - "label" if len(labels_train) == 1 else "labels", - ) - ) - msg.info( - "{} {} in projectivized train data".format( - len(labels_train), "label" if len(labels_train) == 1 else "labels" - ) - ) - + n_nonproj = gold_dev_data["n_nonproj"] + msg.info(f"Found {n_nonproj} nonprojective dev sentence(s)") + msg.info(f"{len(labels_train_unpreprocessed)} label(s) in train data") + msg.info(f"{len(labels_train)} label(s) in projectivized train data") labels_with_counts = _format_labels( gold_train_unpreprocessed_data["deps"].most_common(), counts=True ) @@ -458,9 +375,8 @@ def debug_data( for label in gold_train_unpreprocessed_data["deps"]: if gold_train_unpreprocessed_data["deps"][label] <= DEP_LABEL_THRESHOLD: msg.warn( - "Low number of examples for label '{}' ({})".format( - label, gold_train_unpreprocessed_data["deps"][label] - ) + f"Low number of examples for label '{label}' " + f"({gold_train_unpreprocessed_data['deps'][label]})" ) has_low_data_warning = True @@ -469,22 +385,19 @@ def debug_data( for label in gold_train_data["deps"]: if gold_train_data["deps"][label] <= DEP_LABEL_THRESHOLD and "||" in label: rare_projectivized_labels.append( - "{}: {}".format(label, str(gold_train_data["deps"][label])) + f"{label}: {gold_train_data['deps'][label]}" ) if len(rare_projectivized_labels) > 0: msg.warn( - "Low number of examples for {} label{} in the " - "projectivized dependency trees used for training. You may " - "want to projectivize labels such as punct before " - "training in order to improve parser performance.".format( - len(rare_projectivized_labels), - "s" if len(rare_projectivized_labels) > 1 else "", - ) + f"Low number of examples for {len(rare_projectivized_labels)} " + "label(s) in the projectivized dependency trees used for " + "training. You may want to projectivize labels such as punct " + "before training in order to improve parser performance." ) msg.warn( - "Projectivized labels with low numbers of examples: " - "{}".format("\n".join(rare_projectivized_labels)), + f"Projectivized labels with low numbers of examples: ", + ", ".join(rare_projectivized_labels), show=verbose, ) has_low_data_warning = True @@ -492,50 +405,44 @@ def debug_data( # labels only in train if set(labels_train) - set(labels_dev): msg.warn( - "The following labels were found only in the train data: " - "{}".format(", ".join(set(labels_train) - set(labels_dev))), + "The following labels were found only in the train data:", + ", ".join(set(labels_train) - set(labels_dev)), show=verbose, ) # labels only in dev if set(labels_dev) - set(labels_train): msg.warn( - "The following labels were found only in the dev data: " - + ", ".join(set(labels_dev) - set(labels_train)), + "The following labels were found only in the dev data:", + ", ".join(set(labels_dev) - set(labels_train)), show=verbose, ) if has_low_data_warning: msg.text( - "To train a parser, your data should include at " - "least {} instances of each label.".format(DEP_LABEL_THRESHOLD), + f"To train a parser, your data should include at " + f"least {DEP_LABEL_THRESHOLD} instances of each label.", show=verbose, ) # multiple root labels if len(gold_train_unpreprocessed_data["roots"]) > 1: msg.warn( - "Multiple root labels ({}) ".format( - ", ".join(gold_train_unpreprocessed_data["roots"]) - ) - + "found in training data. spaCy's parser uses a single root " - "label ROOT so this distinction will not be available." + f"Multiple root labels " + f"({', '.join(gold_train_unpreprocessed_data['roots'])}) " + f"found in training data. spaCy's parser uses a single root " + f"label ROOT so this distinction will not be available." ) # these should not happen, but just in case if gold_train_data["n_nonproj"] > 0: msg.fail( - "Found {} nonprojective projectivized train sentence{}".format( - gold_train_data["n_nonproj"], - "s" if gold_train_data["n_nonproj"] > 1 else "", - ) + f"Found {gold_train_data['n_nonproj']} nonprojective " + f"projectivized train sentence(s)" ) if gold_train_data["n_cycles"] > 0: msg.fail( - "Found {} projectivized train sentence{} with cycles".format( - gold_train_data["n_cycles"], - "s" if gold_train_data["n_cycles"] > 1 else "", - ) + f"Found {gold_train_data['n_cycles']} projectivized train sentence(s) with cycles" ) msg.divider("Summary") @@ -543,42 +450,39 @@ def debug_data( warn_counts = msg.counts[MESSAGES.WARN] fail_counts = msg.counts[MESSAGES.FAIL] if good_counts: - msg.good( - "{} {} passed".format( - good_counts, "check" if good_counts == 1 else "checks" - ) - ) + msg.good(f"{good_counts} {'check' if good_counts == 1 else 'checks'} passed") if warn_counts: - msg.warn( - "{} {}".format(warn_counts, "warning" if warn_counts == 1 else "warnings") - ) - if fail_counts: - msg.fail("{} {}".format(fail_counts, "error" if fail_counts == 1 else "errors")) - + msg.warn(f"{warn_counts} {'warning' if warn_counts == 1 else 'warnings'}") if fail_counts: + msg.fail(f"{fail_counts} {'error' if fail_counts == 1 else 'errors'}") sys.exit(1) -def _load_file(file_path, msg): +def _load_file(file_path: Path, msg: Printer) -> None: file_name = file_path.parts[-1] if file_path.suffix == ".json": - with msg.loading("Loading {}...".format(file_name)): + with msg.loading(f"Loading {file_name}..."): data = srsly.read_json(file_path) - msg.good("Loaded {}".format(file_name)) + msg.good(f"Loaded {file_name}") return data elif file_path.suffix == ".jsonl": - with msg.loading("Loading {}...".format(file_name)): + with msg.loading(f"Loading {file_name}..."): data = srsly.read_jsonl(file_path) - msg.good("Loaded {}".format(file_name)) + msg.good(f"Loaded {file_name}") return data msg.fail( - "Can't load file extension {}".format(file_path.suffix), + f"Can't load file extension {file_path.suffix}", "Expected .json or .jsonl", exits=1, ) -def _compile_gold(train_docs, pipeline, nlp): +def _compile_gold( + examples: Sequence[Example], + factory_names: List[str], + nlp: Language, + make_proj: bool, +) -> Dict[str, Any]: data = { "ner": Counter(), "cats": Counter(), @@ -597,18 +501,20 @@ def _compile_gold(train_docs, pipeline, nlp): "n_cats_multilabel": 0, "texts": set(), } - for doc, gold in train_docs: - valid_words = [x for x in gold.words if x is not None] + for eg in examples: + gold = eg.reference + doc = eg.predicted + valid_words = [x for x in gold if x is not None] data["words"].update(valid_words) data["n_words"] += len(valid_words) - data["n_misaligned_words"] += len(gold.words) - len(valid_words) + data["n_misaligned_words"] += len(gold) - len(valid_words) data["texts"].add(doc.text) if len(nlp.vocab.vectors): for word in valid_words: if nlp.vocab.strings[word] not in nlp.vocab.vectors: data["words_missing_vectors"].update([word]) - if "ner" in pipeline: - for i, label in enumerate(gold.ner): + if "ner" in factory_names: + for i, label in enumerate(eg.get_aligned_ner()): if label is None: continue if label.startswith(("B-", "U-", "L-")) and doc[i].is_space: @@ -629,45 +535,47 @@ def _compile_gold(train_docs, pipeline, nlp): data["ner"][combined_label] += 1 elif label == "-": data["ner"]["-"] += 1 - if "textcat" in pipeline: + if "textcat" in factory_names: data["cats"].update(gold.cats) if list(gold.cats.values()).count(1.0) != 1: data["n_cats_multilabel"] += 1 - if "tagger" in pipeline: - data["tags"].update([x for x in gold.tags if x is not None]) - if "parser" in pipeline: - data["deps"].update([x for x in gold.labels if x is not None]) - for i, (dep, head) in enumerate(zip(gold.labels, gold.heads)): + if "tagger" in factory_names: + tags = eg.get_aligned("TAG", as_string=True) + data["tags"].update([x for x in tags if x is not None]) + if "parser" in factory_names: + aligned_heads, aligned_deps = eg.get_aligned_parse(projectivize=make_proj) + data["deps"].update([x for x in aligned_deps if x is not None]) + for i, (dep, head) in enumerate(zip(aligned_deps, aligned_heads)): if head == i: data["roots"].update([dep]) data["n_sents"] += 1 - if nonproj.is_nonproj_tree(gold.heads): + if nonproj.is_nonproj_tree(aligned_heads): data["n_nonproj"] += 1 - if nonproj.contains_cycle(gold.heads): + if nonproj.contains_cycle(aligned_heads): data["n_cycles"] += 1 return data -def _format_labels(labels, counts=False): +def _format_labels(labels: List[Tuple[str, int]], counts: bool = False) -> str: if counts: - return ", ".join(["'{}' ({})".format(l, c) for l, c in labels]) - return ", ".join(["'{}'".format(l) for l in labels]) + return ", ".join([f"'{l}' ({c})" for l, c in labels]) + return ", ".join([f"'{l}'" for l in labels]) -def _get_examples_without_label(data, label): +def _get_examples_without_label(data: Sequence[Example], label: str) -> int: count = 0 - for doc, gold in data: + for eg in data: labels = [ label.split("-")[1] - for label in gold.ner - if label is not None and label not in ("O", "-") + for label in eg.get_aligned_ner() + if label not in ("O", "-", None) ] if label not in labels: count += 1 return count -def _get_labels_from_model(nlp, pipe_name): +def _get_labels_from_model(nlp: Language, pipe_name: str) -> Sequence[str]: if pipe_name not in nlp.pipe_names: return set() pipe = nlp.get_pipe(pipe_name) diff --git a/spacy/cli/debug_model.py b/spacy/cli/debug_model.py new file mode 100644 index 000000000..3b8ba7dae --- /dev/null +++ b/spacy/cli/debug_model.py @@ -0,0 +1,246 @@ +from typing import Dict, Any, Optional, Iterable +from pathlib import Path + +from spacy.training import Example +from spacy.util import resolve_dot_names +from wasabi import msg +from thinc.api import fix_random_seed, set_dropout_rate, Adam +from thinc.api import Model, data_validation, set_gpu_allocator +import typer + +from ._util import Arg, Opt, debug_cli, show_validation_error +from ._util import parse_config_overrides, string_to_list, setup_gpu +from ..schemas import ConfigSchemaTraining +from ..util import registry +from .. import util + + +@debug_cli.command( + "model", + context_settings={"allow_extra_args": True, "ignore_unknown_options": True}, +) +def debug_model_cli( + # fmt: off + ctx: typer.Context, # This is only used to read additional arguments + config_path: Path = Arg(..., help="Path to config file", exists=True), + component: str = Arg(..., help="Name of the pipeline component of which the model should be analysed"), + layers: str = Opt("", "--layers", "-l", help="Comma-separated names of layer IDs to print"), + dimensions: bool = Opt(False, "--dimensions", "-DIM", help="Show dimensions"), + parameters: bool = Opt(False, "--parameters", "-PAR", help="Show parameters"), + gradients: bool = Opt(False, "--gradients", "-GRAD", help="Show gradients"), + attributes: bool = Opt(False, "--attributes", "-ATTR", help="Show attributes"), + P0: bool = Opt(False, "--print-step0", "-P0", help="Print model before training"), + P1: bool = Opt(False, "--print-step1", "-P1", help="Print model after initialization"), + P2: bool = Opt(False, "--print-step2", "-P2", help="Print model after training"), + P3: bool = Opt(False, "--print-step3", "-P3", help="Print final predictions"), + use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU") + # fmt: on +): + """ + Analyze a Thinc model implementation. Includes checks for internal structure + and activations during training. + + DOCS: https://nightly.spacy.io/api/cli#debug-model + """ + setup_gpu(use_gpu) + layers = string_to_list(layers, intify=True) + print_settings = { + "dimensions": dimensions, + "parameters": parameters, + "gradients": gradients, + "attributes": attributes, + "layers": layers, + "print_before_training": P0, + "print_after_init": P1, + "print_after_training": P2, + "print_prediction": P3, + } + config_overrides = parse_config_overrides(ctx.args) + with show_validation_error(config_path): + raw_config = util.load_config( + config_path, overrides=config_overrides, interpolate=False + ) + config = raw_config.interpolate() + allocator = config["training"]["gpu_allocator"] + if use_gpu >= 0 and allocator: + set_gpu_allocator(allocator) + with show_validation_error(config_path): + nlp = util.load_model_from_config(raw_config) + config = nlp.config.interpolate() + T = registry.resolve(config["training"], schema=ConfigSchemaTraining) + seed = T["seed"] + if seed is not None: + msg.info(f"Fixing random seed: {seed}") + fix_random_seed(seed) + pipe = nlp.get_pipe(component) + if not hasattr(pipe, "model"): + msg.fail( + f"The component '{component}' does not specify an object that holds a Model.", + exits=1, + ) + model = pipe.model + debug_model(config, T, nlp, model, print_settings=print_settings) + + +def debug_model( + config, + resolved_train_config, + nlp, + model: Model, + *, + print_settings: Optional[Dict[str, Any]] = None, +): + if not isinstance(model, Model): + msg.fail( + f"Requires a Thinc Model to be analysed, but found {type(model)} instead.", + exits=1, + ) + if print_settings is None: + print_settings = {} + + # STEP 0: Printing before training + msg.info(f"Analysing model with ID {model.id}") + if print_settings.get("print_before_training"): + msg.divider(f"STEP 0 - before training") + _print_model(model, print_settings) + + # STEP 1: Initializing the model and printing again + X = _get_docs() + # The output vector might differ from the official type of the output layer + with data_validation(False): + try: + dot_names = [resolved_train_config["train_corpus"]] + with show_validation_error(): + (train_corpus,) = resolve_dot_names(config, dot_names) + nlp.initialize(lambda: train_corpus(nlp)) + msg.info("Initialized the model with the training corpus.") + except ValueError: + try: + _set_output_dim(nO=7, model=model) + with show_validation_error(): + nlp.initialize(lambda: [Example.from_dict(x, {}) for x in X]) + msg.info("Initialized the model with dummy data.") + except Exception: + msg.fail( + "Could not initialize the model: you'll have to provide a valid train_corpus argument in the config file.", + exits=1, + ) + + if print_settings.get("print_after_init"): + msg.divider(f"STEP 1 - after initialization") + _print_model(model, print_settings) + + # STEP 2: Updating the model and printing again + optimizer = Adam(0.001) + set_dropout_rate(model, 0.2) + # ugly hack to deal with Tok2Vec listeners + tok2vec = None + if model.has_ref("tok2vec") and model.get_ref("tok2vec").name == "tok2vec-listener": + tok2vec = nlp.get_pipe("tok2vec") + goldY = None + for e in range(3): + if tok2vec: + tok2vec.update([Example.from_dict(x, {}) for x in X]) + Y, get_dX = model.begin_update(X) + if goldY is None: + goldY = _simulate_gold(Y) + dY = get_gradient(goldY, Y, model.ops) + get_dX(dY) + model.finish_update(optimizer) + if print_settings.get("print_after_training"): + msg.divider(f"STEP 2 - after training") + _print_model(model, print_settings) + + # STEP 3: the final prediction + prediction = model.predict(X) + if print_settings.get("print_prediction"): + msg.divider(f"STEP 3 - prediction") + msg.info(str(prediction)) + + msg.good(f"Succesfully ended analysis - model looks good.") + + +def get_gradient(goldY, Y, ops): + return ops.asarray(Y) - ops.asarray(goldY) + + +def _simulate_gold(element, counter=1): + if isinstance(element, Iterable): + for i in range(len(element)): + element[i] = _simulate_gold(element[i], counter + i) + return element + else: + return 1 / counter + + +def _sentences(): + return [ + "Apple is looking at buying U.K. startup for $1 billion", + "Autonomous cars shift insurance liability toward manufacturers", + "San Francisco considers banning sidewalk delivery robots", + "London is a big city in the United Kingdom.", + ] + + +def _get_docs(lang: str = "en"): + nlp = util.get_lang_class(lang)() + return list(nlp.pipe(_sentences())) + + +def _set_output_dim(model, nO): + # simulating dim inference by directly setting the nO argument of the model + if model.has_dim("nO") is None: + model.set_dim("nO", nO) + if model.has_ref("output_layer"): + if model.get_ref("output_layer").has_dim("nO") is None: + model.get_ref("output_layer").set_dim("nO", nO) + + +def _print_model(model, print_settings): + layers = print_settings.get("layers", "") + parameters = print_settings.get("parameters", False) + dimensions = print_settings.get("dimensions", False) + gradients = print_settings.get("gradients", False) + attributes = print_settings.get("attributes", False) + + for i, node in enumerate(model.walk()): + if not layers or i in layers: + msg.info(f"Layer {i}: model ID {node.id}: '{node.name}'") + + if dimensions: + for name in node.dim_names: + if node.has_dim(name): + msg.info(f" - dim {name}: {node.get_dim(name)}") + else: + msg.info(f" - dim {name}: {node.has_dim(name)}") + + if parameters: + for name in node.param_names: + if node.has_param(name): + print_value = _print_matrix(node.get_param(name)) + msg.info(f" - param {name}: {print_value}") + else: + msg.info(f" - param {name}: {node.has_param(name)}") + if gradients: + for name in node.param_names: + if node.has_grad(name): + print_value = _print_matrix(node.get_grad(name)) + msg.info(f" - grad {name}: {print_value}") + else: + msg.info(f" - grad {name}: {node.has_grad(name)}") + if attributes: + attrs = node.attrs + for name, value in attrs.items(): + msg.info(f" - attr {name}: {value}") + + +def _print_matrix(value): + if value is None or isinstance(value, bool): + return value + result = str(value.shape) + " - sample: " + sample_matrix = value + for d in range(value.ndim - 1): + sample_matrix = sample_matrix[0] + sample_matrix = sample_matrix[0:5] + result = result + str(sample_matrix) + return result diff --git a/spacy/cli/download.py b/spacy/cli/download.py index 19f3e7860..0e7ec2ea5 100644 --- a/spacy/cli/download.py +++ b/spacy/cli/download.py @@ -1,36 +1,47 @@ -# coding: utf8 -from __future__ import unicode_literals - -import plac +from typing import Optional, Sequence import requests -import os -import subprocess import sys from wasabi import msg +import typer -from .link import link -from ..util import get_package_path +from ._util import app, Arg, Opt from .. import about +from ..util import is_package, get_base_version, run_command +from ..errors import OLD_MODEL_SHORTCUTS -@plac.annotations( - model=("Model to download (shortcut or name)", "positional", None, str), - direct=("Force direct download of name + version", "flag", "d", bool), - pip_args=("Additional arguments to be passed to `pip install` on model install"), +@app.command( + "download", + context_settings={"allow_extra_args": True, "ignore_unknown_options": True}, ) -def download(model, direct=False, *pip_args): +def download_cli( + # fmt: off + ctx: typer.Context, + model: str = Arg(..., help="Name of pipeline package to download"), + direct: bool = Opt(False, "--direct", "-d", "-D", help="Force direct download of name + version"), + # fmt: on +): """ - Download compatible model from default download path using pip. Model - can be shortcut, model name or, if --direct flag is set, full model name - with version. For direct downloads, the compatibility check will be skipped. + Download compatible trained pipeline from the default download path using + pip. If --direct flag is set, the command expects the full package name with + version. For direct downloads, the compatibility check will be skipped. All + additional arguments provided to this command will be passed to `pip install` + on package installation. + + DOCS: https://nightly.spacy.io/api/cli#download + AVAILABLE PACKAGES: https://spacy.io/models """ - if not require_package("spacy") and "--no-deps" not in pip_args: + download(model, direct, *ctx.args) + + +def download(model: str, direct: bool = False, *pip_args) -> None: + if not is_package("spacy") and "--no-deps" not in pip_args: msg.warn( - "Skipping model package dependencies and setting `--no-deps`. " + "Skipping pipeline package dependencies and setting `--no-deps`. " "You don't seem to have the spaCy package itself installed " "(maybe because you've built from source?), so installing the " - "model dependencies would cause spaCy to be downloaded, which " - "probably isn't what you want. If the model package has other " + "package dependencies would cause spaCy to be downloaded, which " + "probably isn't what you want. If the pipeline package has other " "dependencies, you'll have to install them manually." ) pip_args = pip_args + ("--no-deps",) @@ -39,97 +50,58 @@ def download(model, direct=False, *pip_args): components = model.split("-") model_name = "".join(components[:-1]) version = components[-1] - dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args) + download_model(dl_tpl.format(m=model_name, v=version), pip_args) else: - shortcuts = get_json(about.__shortcuts__, "available shortcuts") - model_name = shortcuts.get(model, model) + model_name = model + if model in OLD_MODEL_SHORTCUTS: + msg.warn( + f"As of spaCy v3.0, shortcuts like '{model}' are deprecated. Please" + f"use the full pipeline package name '{OLD_MODEL_SHORTCUTS[model]}' instead." + ) + model_name = OLD_MODEL_SHORTCUTS[model] compatibility = get_compatibility() version = get_version(model_name, compatibility) - dl = download_model(dl_tpl.format(m=model_name, v=version), pip_args) - if dl != 0: # if download subprocess doesn't return 0, exit - sys.exit(dl) - msg.good( - "Download and installation successful", - "You can now load the model via spacy.load('{}')".format(model_name), - ) - # Only create symlink if the model is installed via a shortcut like 'en'. - # There's no real advantage over an additional symlink for en_core_web_sm - # and if anything, it's more error prone and causes more confusion. - if model in shortcuts: - try: - # Get package path here because link uses - # pip.get_installed_distributions() to check if model is a - # package, which fails if model was just installed via - # subprocess - package_path = get_package_path(model_name) - link(model_name, model, force=True, model_path=package_path) - except: # noqa: E722 - # Dirty, but since spacy.download and the auto-linking is - # mostly a convenience wrapper, it's best to show a success - # message and loading instructions, even if linking fails. - msg.warn( - "Download successful but linking failed", - "Creating a shortcut link for '{}' didn't work (maybe you " - "don't have admin permissions?), but you can still load " - "the model via its full package name: " - "nlp = spacy.load('{}')".format(model, model_name), - ) - # If a model is downloaded and then loaded within the same process, our - # is_package check currently fails, because pkg_resources.working_set - # is not refreshed automatically (see #3923). We're trying to work - # around this here be requiring the package explicitly. - require_package(model_name) + download_model(dl_tpl.format(m=model_name, v=version), pip_args) + msg.good( + "Download and installation successful", + f"You can now load the package via spacy.load('{model_name}')", + ) -def require_package(name): - try: - import pkg_resources - - pkg_resources.working_set.require(name) - return True - except: # noqa: E722 - return False - - -def get_json(url, desc): - r = requests.get(url) +def get_compatibility() -> dict: + version = get_base_version(about.__version__) + r = requests.get(about.__compatibility__) if r.status_code != 200: msg.fail( - "Server error ({})".format(r.status_code), - "Couldn't fetch {}. Please find a model for your spaCy " - "installation (v{}), and download it manually. For more " - "details, see the documentation: " - "https://spacy.io/usage/models".format(desc, about.__version__), + f"Server error ({r.status_code})", + f"Couldn't fetch compatibility table. Please find a package for your spaCy " + f"installation (v{about.__version__}), and download it manually. " + f"For more details, see the documentation: " + f"https://nightly.spacy.io/usage/models", exits=1, ) - return r.json() - - -def get_compatibility(): - version = about.__version__ - version = version.rsplit(".dev", 1)[0] - comp_table = get_json(about.__compatibility__, "compatibility table") + comp_table = r.json() comp = comp_table["spacy"] if version not in comp: - msg.fail("No compatible models found for v{} of spaCy".format(version), exits=1) + msg.fail(f"No compatible packages found for v{version} of spaCy", exits=1) return comp[version] -def get_version(model, comp): - model = model.rsplit(".dev", 1)[0] +def get_version(model: str, comp: dict) -> str: if model not in comp: msg.fail( - "No compatible model found for '{}' " - "(spaCy v{}).".format(model, about.__version__), + f"No compatible package found for '{model}' (spaCy v{about.__version__})", exits=1, ) return comp[model][0] -def download_model(filename, user_pip_args=None): +def download_model( + filename: str, user_pip_args: Optional[Sequence[str]] = None +) -> None: download_url = about.__download_url__ + "/" + filename pip_args = ["--no-cache-dir"] if user_pip_args: pip_args.extend(user_pip_args) cmd = [sys.executable, "-m", "pip", "install"] + pip_args + [download_url] - return subprocess.call(cmd, env=os.environ.copy()) + run_command(cmd) diff --git a/spacy/cli/evaluate.py b/spacy/cli/evaluate.py index be994de73..566820283 100644 --- a/spacy/cli/evaluate.py +++ b/spacy/cli/evaluate.py @@ -1,76 +1,125 @@ -# coding: utf8 -from __future__ import unicode_literals, division, print_function +from typing import Optional, List, Dict +from wasabi import Printer +from pathlib import Path +import re +import srsly +from thinc.api import fix_random_seed -import plac -from timeit import default_timer as timer -from wasabi import msg - -from ..gold import GoldCorpus +from ..training import Corpus +from ..tokens import Doc +from ._util import app, Arg, Opt, setup_gpu, import_code +from ..scorer import Scorer from .. import util from .. import displacy -@plac.annotations( - model=("Model name or path", "positional", None, str), - data_path=("Location of JSON-formatted evaluation data", "positional", None, str), - gold_preproc=("Use gold preprocessing", "flag", "G", bool), - gpu_id=("Use GPU", "option", "g", int), - displacy_path=("Directory to output rendered parses as HTML", "option", "dp", str), - displacy_limit=("Limit of parses to render as HTML", "option", "dl", int), - return_scores=("Return dict containing model scores", "flag", "R", bool), -) -def evaluate( - model, - data_path, - gpu_id=-1, - gold_preproc=False, - displacy_path=None, - displacy_limit=25, - return_scores=False, +@app.command("evaluate") +def evaluate_cli( + # fmt: off + model: str = Arg(..., help="Model name or path"), + data_path: Path = Arg(..., help="Location of binary evaluation data in .spacy format", exists=True), + output: Optional[Path] = Opt(None, "--output", "-o", help="Output JSON file for metrics", dir_okay=False), + code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"), + use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"), + gold_preproc: bool = Opt(False, "--gold-preproc", "-G", help="Use gold preprocessing"), + displacy_path: Optional[Path] = Opt(None, "--displacy-path", "-dp", help="Directory to output rendered parses as HTML", exists=True, file_okay=False), + displacy_limit: int = Opt(25, "--displacy-limit", "-dl", help="Limit of parses to render as HTML"), + # fmt: on ): """ - Evaluate a model. To render a sample of parses in a HTML file, set an - output directory as the displacy_path argument. + Evaluate a trained pipeline. Expects a loadable spaCy pipeline and evaluation + data in the binary .spacy format. The --gold-preproc option sets up the + evaluation examples with gold-standard sentences and tokens for the + predictions. Gold preprocessing helps the annotations align to the + tokenization, and may result in sequences of more consistent length. However, + it may reduce runtime accuracy due to train/test skew. To render a sample of + dependency parses in a HTML file, set as output directory as the + displacy_path argument. + + DOCS: https://nightly.spacy.io/api/cli#evaluate """ - util.fix_random_seed() - if gpu_id >= 0: - util.use_gpu(gpu_id) - util.set_env_log(False) + import_code(code_path) + evaluate( + model, + data_path, + output=output, + use_gpu=use_gpu, + gold_preproc=gold_preproc, + displacy_path=displacy_path, + displacy_limit=displacy_limit, + silent=False, + ) + + +def evaluate( + model: str, + data_path: Path, + output: Optional[Path] = None, + use_gpu: int = -1, + gold_preproc: bool = False, + displacy_path: Optional[Path] = None, + displacy_limit: int = 25, + silent: bool = True, +) -> Scorer: + msg = Printer(no_print=silent, pretty=not silent) + fix_random_seed() + setup_gpu(use_gpu) data_path = util.ensure_path(data_path) + output_path = util.ensure_path(output) displacy_path = util.ensure_path(displacy_path) if not data_path.exists(): msg.fail("Evaluation data not found", data_path, exits=1) if displacy_path and not displacy_path.exists(): msg.fail("Visualization output directory not found", displacy_path, exits=1) - corpus = GoldCorpus(data_path, data_path) - if model.startswith("blank:"): - nlp = util.get_lang_class(model.replace("blank:", ""))() - else: - nlp = util.load_model(model) - dev_docs = list(corpus.dev_docs(nlp, gold_preproc=gold_preproc)) - begin = timer() - scorer = nlp.evaluate(dev_docs, verbose=False) - end = timer() - nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs) - results = { - "Time": "%.2f s" % (end - begin), - "Words": nwords, - "Words/s": "%.0f" % (nwords / (end - begin)), - "TOK": "%.2f" % scorer.token_acc, - "POS": "%.2f" % scorer.tags_acc, - "UAS": "%.2f" % scorer.uas, - "LAS": "%.2f" % scorer.las, - "NER P": "%.2f" % scorer.ents_p, - "NER R": "%.2f" % scorer.ents_r, - "NER F": "%.2f" % scorer.ents_f, - "Textcat": "%.2f" % scorer.textcat_score, + corpus = Corpus(data_path, gold_preproc=gold_preproc) + nlp = util.load_model(model) + dev_dataset = list(corpus(nlp)) + scores = nlp.evaluate(dev_dataset) + metrics = { + "TOK": "token_acc", + "TAG": "tag_acc", + "POS": "pos_acc", + "MORPH": "morph_acc", + "LEMMA": "lemma_acc", + "UAS": "dep_uas", + "LAS": "dep_las", + "NER P": "ents_p", + "NER R": "ents_r", + "NER F": "ents_f", + "TEXTCAT": "cats_score", + "SENT P": "sents_p", + "SENT R": "sents_r", + "SENT F": "sents_f", + "SPEED": "speed", } + results = {} + for metric, key in metrics.items(): + if key in scores: + if key == "cats_score": + metric = metric + " (" + scores.get("cats_score_desc", "unk") + ")" + if key == "speed": + results[metric] = f"{scores[key]:.0f}" + else: + results[metric] = f"{scores[key]*100:.2f}" + data = {re.sub(r"[\s/]", "_", k.lower()): v for k, v in results.items()} + msg.table(results, title="Results") + if "ents_per_type" in scores: + if scores["ents_per_type"]: + print_ents_per_type(msg, scores["ents_per_type"]) + if "cats_f_per_type" in scores: + if scores["cats_f_per_type"]: + print_textcats_f_per_cat(msg, scores["cats_f_per_type"]) + if "cats_auc_per_type" in scores: + if scores["cats_auc_per_type"]: + print_textcats_auc_per_cat(msg, scores["cats_auc_per_type"]) + if displacy_path: - docs, golds = zip(*dev_docs) - render_deps = "parser" in nlp.meta.get("pipeline", []) - render_ents = "ner" in nlp.meta.get("pipeline", []) + factory_names = [nlp.get_pipe_meta(pipe).factory for pipe in nlp.pipe_names] + docs = [ex.predicted for ex in dev_dataset] + render_deps = "parser" in factory_names + render_ents = "ner" in factory_names render_parses( docs, displacy_path, @@ -79,12 +128,22 @@ def evaluate( deps=render_deps, ents=render_ents, ) - msg.good("Generated {} parses as HTML".format(displacy_limit), displacy_path) - if return_scores: - return scorer.scores + msg.good(f"Generated {displacy_limit} parses as HTML", displacy_path) + + if output_path is not None: + srsly.write_json(output_path, data) + msg.good(f"Saved results to {output_path}") + return data -def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=True): +def render_parses( + docs: List[Doc], + output_path: Path, + model_name: str = "", + limit: int = 250, + deps: bool = True, + ents: bool = True, +): docs[0].user_data["title"] = model_name if ents: html = displacy.render(docs[:limit], style="ent", page=True) @@ -96,3 +155,40 @@ def render_parses(docs, output_path, model_name="", limit=250, deps=True, ents=T ) with (output_path / "parses.html").open("w", encoding="utf8") as file_: file_.write(html) + + +def print_ents_per_type(msg: Printer, scores: Dict[str, Dict[str, float]]) -> None: + data = [ + (k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}") + for k, v in scores.items() + ] + msg.table( + data, + header=("", "P", "R", "F"), + aligns=("l", "r", "r", "r"), + title="NER (per type)", + ) + + +def print_textcats_f_per_cat(msg: Printer, scores: Dict[str, Dict[str, float]]) -> None: + data = [ + (k, f"{v['p']*100:.2f}", f"{v['r']*100:.2f}", f"{v['f']*100:.2f}") + for k, v in scores.items() + ] + msg.table( + data, + header=("", "P", "R", "F"), + aligns=("l", "r", "r", "r"), + title="Textcat F (per label)", + ) + + +def print_textcats_auc_per_cat( + msg: Printer, scores: Dict[str, Dict[str, float]] +) -> None: + msg.table( + [(k, f"{v:.2f}") for k, v in scores.items()], + header=("", "ROC AUC"), + aligns=("l", "r"), + title="Textcat ROC AUC (per label)", + ) diff --git a/spacy/cli/info.py b/spacy/cli/info.py index 080d0dc77..2f2515278 100644 --- a/spacy/cli/info.py +++ b/spacy/cli/info.py @@ -1,92 +1,115 @@ -# coding: utf8 -from __future__ import unicode_literals - -import plac +from typing import Optional, Dict, Any, Union import platform from pathlib import Path -from wasabi import msg +from wasabi import Printer, MarkdownRenderer import srsly -from ..compat import path2str, basestring_, unicode_ +from ._util import app, Arg, Opt from .. import util from .. import about -@plac.annotations( - model=("Optional shortcut link of model", "positional", None, str), - markdown=("Generate Markdown for GitHub issues", "flag", "md", str), - silent=("Don't print anything (just return)", "flag", "s"), -) -def info(model=None, markdown=False, silent=False): +@app.command("info") +def info_cli( + # fmt: off + model: Optional[str] = Arg(None, help="Optional loadable spaCy pipeline"), + markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues"), + silent: bool = Opt(False, "--silent", "-s", "-S", help="Don't print anything (just return)"), + # fmt: on +): """ - Print info about spaCy installation. If a model shortcut link is - speficied as an argument, print model information. Flag --markdown - prints details in Markdown for easy copy-pasting to GitHub issues. + Print info about spaCy installation. If a pipeline is speficied as an argument, + print its meta information. Flag --markdown prints details in Markdown for easy + copy-pasting to GitHub issues. + + DOCS: https://nightly.spacy.io/api/cli#info """ + info(model, markdown=markdown, silent=silent) + + +def info( + model: Optional[str] = None, *, markdown: bool = False, silent: bool = True +) -> Union[str, dict]: + msg = Printer(no_print=silent, pretty=not silent) if model: - if util.is_package(model): - model_path = util.get_package_path(model) - else: - model_path = util.get_data_path() / model - meta_path = model_path / "meta.json" - if not meta_path.is_file(): - msg.fail("Can't find model meta.json", meta_path, exits=1) - meta = srsly.read_json(meta_path) - if model_path.resolve() != model_path: - meta["link"] = path2str(model_path) - meta["source"] = path2str(model_path.resolve()) - else: - meta["source"] = path2str(model_path) + title = f"Info about pipeline '{model}'" + data = info_model(model, silent=silent) + else: + title = "Info about spaCy" + data = info_spacy() + raw_data = {k.lower().replace(" ", "_"): v for k, v in data.items()} + if "Pipelines" in data and isinstance(data["Pipelines"], dict): + data["Pipelines"] = ", ".join( + f"{n} ({v})" for n, v in data["Pipelines"].items() + ) + markdown_data = get_markdown(data, title=title) + if markdown: if not silent: - title = "Info about model '{}'".format(model) - model_meta = { - k: v for k, v in meta.items() if k not in ("accuracy", "speed") - } - if markdown: - print_markdown(model_meta, title=title) - else: - msg.table(model_meta, title=title) - return meta - data = { + print(markdown_data) + return markdown_data + if not silent: + table_data = dict(data) + msg.table(table_data, title=title) + return raw_data + + +def info_spacy() -> Dict[str, any]: + """Generate info about the current spaCy intallation. + + RETURNS (dict): The spaCy info. + """ + all_models = {} + for pkg_name in util.get_installed_models(): + package = pkg_name.replace("-", "_") + all_models[package] = util.get_package_version(pkg_name) + return { "spaCy version": about.__version__, - "Location": path2str(Path(__file__).parent.parent), + "Location": str(Path(__file__).parent.parent), "Platform": platform.platform(), "Python version": platform.python_version(), - "Models": list_models(), + "Pipelines": all_models, } - if not silent: - title = "Info about spaCy" - if markdown: - print_markdown(data, title=title) - else: - msg.table(data, title=title) - return data -def list_models(): - def exclude_dir(dir_name): - # exclude common cache directories and hidden directories - exclude = ("cache", "pycache", "__pycache__") - return dir_name in exclude or dir_name.startswith(".") +def info_model(model: str, *, silent: bool = True) -> Dict[str, Any]: + """Generate info about a specific model. - data_path = util.get_data_path() - if data_path: - models = [f.parts[-1] for f in data_path.iterdir() if f.is_dir()] - return ", ".join([m for m in models if not exclude_dir(m)]) - return "-" + model (str): Model name of path. + silent (bool): Don't print anything, just return. + RETURNS (dict): The model meta. + """ + msg = Printer(no_print=silent, pretty=not silent) + if util.is_package(model): + model_path = util.get_package_path(model) + else: + model_path = model + meta_path = model_path / "meta.json" + if not meta_path.is_file(): + msg.fail("Can't find pipeline meta.json", meta_path, exits=1) + meta = srsly.read_json(meta_path) + if model_path.resolve() != model_path: + meta["source"] = str(model_path.resolve()) + else: + meta["source"] = str(model_path) + return { + k: v for k, v in meta.items() if k not in ("accuracy", "performance", "speed") + } -def print_markdown(data, title=None): - """Print data in GitHub-flavoured Markdown format for issues etc. +def get_markdown(data: Dict[str, Any], title: Optional[str] = None) -> str: + """Get data in GitHub-flavoured Markdown format for issues etc. data (dict or list of tuples): Label/value pairs. - title (unicode or None): Title, will be rendered as headline 2. + title (str / None): Title, will be rendered as headline 2. + RETURNS (str): The Markdown string. """ - markdown = [] - for key, value in data.items(): - if isinstance(value, basestring_) and Path(value).exists(): - continue - markdown.append("* **{}:** {}".format(key, unicode_(value))) + md = MarkdownRenderer() if title: - print("\n## {}".format(title)) - print("\n{}\n".format("\n".join(markdown))) + md.add(md.title(2, title)) + items = [] + for key, value in data.items(): + if isinstance(value, str) and Path(value).exists(): + continue + items.append(f"{md.bold(f'{key}:')} {value}") + md.add(md.list(items)) + return f"\n{md.text}\n" diff --git a/spacy/cli/init_config.py b/spacy/cli/init_config.py new file mode 100644 index 000000000..9f73b17ae --- /dev/null +++ b/spacy/cli/init_config.py @@ -0,0 +1,218 @@ +from typing import Optional, List, Tuple +from enum import Enum +from pathlib import Path +from wasabi import Printer, diff_strings +from thinc.api import Config +import srsly +import re + +from .. import util +from ..language import DEFAULT_CONFIG_PRETRAIN_PATH +from ..schemas import RecommendationSchema +from ._util import init_cli, Arg, Opt, show_validation_error, COMMAND, string_to_list + + +ROOT = Path(__file__).parent / "templates" +TEMPLATE_PATH = ROOT / "quickstart_training.jinja" +RECOMMENDATIONS = srsly.read_yaml(ROOT / "quickstart_training_recommendations.yml") + + +class Optimizations(str, Enum): + efficiency = "efficiency" + accuracy = "accuracy" + + +@init_cli.command("config") +def init_config_cli( + # fmt: off + output_file: Path = Arg(..., help="File to save config.cfg to or - for stdout (will only output config and no additional logging info)", allow_dash=True), + lang: Optional[str] = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"), + pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include (without 'tok2vec' or 'transformer')"), + optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."), + cpu: bool = Opt(False, "--cpu", "-C", help="Whether the model needs to run on CPU. This will impact the choice of architecture, pretrained weights and related hyperparameters."), + pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"), + # fmt: on +): + """ + Generate a starter config.cfg for training. Based on your requirements + specified via the CLI arguments, this command generates a config with the + optimal settings for your use case. This includes the choice of architecture, + pretrained weights and related hyperparameters. + + DOCS: https://nightly.spacy.io/api/cli#init-config + """ + if isinstance(optimize, Optimizations): # instance of enum from the CLI + optimize = optimize.value + pipeline = string_to_list(pipeline) + init_config( + output_file, + lang=lang, + pipeline=pipeline, + optimize=optimize, + cpu=cpu, + pretraining=pretraining, + ) + + +@init_cli.command("fill-config") +def init_fill_config_cli( + # fmt: off + base_path: Path = Arg(..., help="Base config to fill", exists=True, dir_okay=False), + output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True), + pretraining: bool = Opt(False, "--pretraining", "-pt", help="Include config for pretraining (with 'spacy pretrain')"), + diff: bool = Opt(False, "--diff", "-D", help="Print a visual diff highlighting the changes") + # fmt: on +): + """ + Fill partial config.cfg with default values. Will add all missing settings + from the default config and will create all objects, check the registered + functions for their default values and update the base config. This command + can be used with a config generated via the training quickstart widget: + https://nightly.spacy.io/usage/training#quickstart + + DOCS: https://nightly.spacy.io/api/cli#init-fill-config + """ + fill_config(output_file, base_path, pretraining=pretraining, diff=diff) + + +def fill_config( + output_file: Path, + base_path: Path, + *, + pretraining: bool = False, + diff: bool = False, + silent: bool = False, +) -> Tuple[Config, Config]: + is_stdout = str(output_file) == "-" + no_print = is_stdout or silent + msg = Printer(no_print=no_print) + with show_validation_error(hint_fill=False): + config = util.load_config(base_path) + nlp = util.load_model_from_config(config, auto_fill=True, validate=False) + # Load a second time with validation to be extra sure that the produced + # config result is a valid config + nlp = util.load_model_from_config(nlp.config) + filled = nlp.config + if pretraining: + validate_config_for_pretrain(filled, msg) + pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH) + filled = pretrain_config.merge(filled) + before = config.to_str() + after = filled.to_str() + if before == after: + msg.warn("Nothing to auto-fill: base config is already complete") + else: + msg.good("Auto-filled config with all values") + if diff and not no_print: + if before == after: + msg.warn("No diff to show: nothing was auto-filled") + else: + msg.divider("START CONFIG DIFF") + print("") + print(diff_strings(before, after)) + msg.divider("END CONFIG DIFF") + print("") + save_config(filled, output_file, is_stdout=is_stdout, silent=silent) + return config, filled + + +def init_config( + output_file: Path, + *, + lang: str, + pipeline: List[str], + optimize: str, + cpu: bool, + pretraining: bool = False, +) -> None: + is_stdout = str(output_file) == "-" + msg = Printer(no_print=is_stdout) + try: + from jinja2 import Template + except ImportError: + msg.fail("This command requires jinja2", "pip install jinja2", exits=1) + with TEMPLATE_PATH.open("r") as f: + template = Template(f.read()) + # Filter out duplicates since tok2vec and transformer are added by template + pipeline = [pipe for pipe in pipeline if pipe not in ("tok2vec", "transformer")] + reco = RecommendationSchema(**RECOMMENDATIONS.get(lang, {})).dict() + variables = { + "lang": lang, + "components": pipeline, + "optimize": optimize, + "hardware": "cpu" if cpu else "gpu", + "transformer_data": reco["transformer"], + "word_vectors": reco["word_vectors"], + "has_letters": reco["has_letters"], + } + if variables["transformer_data"] and not has_spacy_transformers(): + msg.warn( + "To generate a more effective transformer-based config (GPU-only), " + "install the spacy-transformers package and re-run this command. " + "The config generated now does not use transformers." + ) + variables["transformer_data"] = None + base_template = template.render(variables).strip() + # Giving up on getting the newlines right in jinja for now + base_template = re.sub(r"\n\n\n+", "\n\n", base_template) + # Access variables declared in templates + template_vars = template.make_module(variables) + use_case = { + "Language": lang, + "Pipeline": ", ".join(pipeline), + "Optimize for": optimize, + "Hardware": variables["hardware"].upper(), + "Transformer": template_vars.transformer.get("name", False), + } + msg.info("Generated template specific for your use case") + for label, value in use_case.items(): + msg.text(f"- {label}: {value}") + with show_validation_error(hint_fill=False): + config = util.load_config_from_str(base_template) + nlp = util.load_model_from_config(config, auto_fill=True) + config = nlp.config + if pretraining: + validate_config_for_pretrain(config, msg) + pretrain_config = util.load_config(DEFAULT_CONFIG_PRETRAIN_PATH) + config = pretrain_config.merge(config) + msg.good("Auto-filled config with all values") + save_config(config, output_file, is_stdout=is_stdout) + + +def save_config( + config: Config, output_file: Path, is_stdout: bool = False, silent: bool = False +) -> None: + no_print = is_stdout or silent + msg = Printer(no_print=no_print) + if is_stdout: + print(config.to_str()) + else: + if not output_file.parent.exists(): + output_file.parent.mkdir(parents=True) + config.to_disk(output_file, interpolate=False) + msg.good("Saved config", output_file) + msg.text("You can now add your data and train your pipeline:") + variables = ["--paths.train ./train.spacy", "--paths.dev ./dev.spacy"] + if not no_print: + print(f"{COMMAND} train {output_file.parts[-1]} {' '.join(variables)}") + + +def has_spacy_transformers() -> bool: + try: + import spacy_transformers # noqa: F401 + + return True + except ImportError: + return False + + +def validate_config_for_pretrain(config: Config, msg: Printer) -> None: + if "tok2vec" not in config["nlp"]["pipeline"]: + msg.warn( + "No tok2vec component found in the pipeline. If your tok2vec " + "component has a different name, you may need to adjust the " + "tok2vec_model reference in the [pretraining] block. If you don't " + "have a tok2vec component, make sure to add it to your [components] " + "and the pipeline specified in the [nlp] block, so you can pretrain " + "weights for it." + ) diff --git a/spacy/cli/init_model.py b/spacy/cli/init_model.py deleted file mode 100644 index 7fdd39932..000000000 --- a/spacy/cli/init_model.py +++ /dev/null @@ -1,301 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import plac -import math -from tqdm import tqdm -import numpy -from ast import literal_eval -from pathlib import Path -from preshed.counter import PreshCounter -import tarfile -import gzip -import zipfile -import srsly -import warnings -from wasabi import msg - -from ..vectors import Vectors -from ..errors import Errors, Warnings -from ..util import ensure_path, get_lang_class, load_model, OOV_RANK -from ..lookups import Lookups - - -try: - import ftfy -except ImportError: - ftfy = None - - -DEFAULT_OOV_PROB = -20 - - -@plac.annotations( - lang=("Model language", "positional", None, str), - output_dir=("Model output directory", "positional", None, Path), - freqs_loc=("Location of words frequencies file", "option", "f", Path), - jsonl_loc=("Location of JSONL-formatted attributes file", "option", "j", Path), - clusters_loc=("Optional location of brown clusters data", "option", "c", str), - vectors_loc=("Optional vectors file in Word2Vec format", "option", "v", str), - truncate_vectors=( - "Optional number of vectors to truncate to when reading in vectors file", - "option", - "t", - int, - ), - prune_vectors=("Optional number of vectors to prune to", "option", "V", int), - vectors_name=( - "Optional name for the word vectors, e.g. en_core_web_lg.vectors", - "option", - "vn", - str, - ), - model_name=("Optional name for the model meta", "option", "mn", str), - omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool), - base_model=("Base model (for languages with custom tokenizers)", "option", "b", str), -) -def init_model( - lang, - output_dir, - freqs_loc=None, - clusters_loc=None, - jsonl_loc=None, - vectors_loc=None, - truncate_vectors=0, - prune_vectors=-1, - vectors_name=None, - model_name=None, - omit_extra_lookups=False, - base_model=None, -): - """ - Create a new model from raw data, like word frequencies, Brown clusters - and word vectors. If vectors are provided in Word2Vec format, they can - be either a .txt or zipped as a .zip or .tar.gz. - """ - if jsonl_loc is not None: - if freqs_loc is not None or clusters_loc is not None: - settings = ["-j"] - if freqs_loc: - settings.append("-f") - if clusters_loc: - settings.append("-c") - msg.warn( - "Incompatible arguments", - "The -f and -c arguments are deprecated, and not compatible " - "with the -j argument, which should specify the same " - "information. Either merge the frequencies and clusters data " - "into the JSONL-formatted file (recommended), or use only the " - "-f and -c files, without the other lexical attributes.", - ) - jsonl_loc = ensure_path(jsonl_loc) - lex_attrs = srsly.read_jsonl(jsonl_loc) - else: - clusters_loc = ensure_path(clusters_loc) - freqs_loc = ensure_path(freqs_loc) - if freqs_loc is not None and not freqs_loc.exists(): - msg.fail("Can't find words frequencies file", freqs_loc, exits=1) - lex_attrs = read_attrs_from_deprecated(freqs_loc, clusters_loc) - - with msg.loading("Creating model..."): - nlp = create_model(lang, lex_attrs, name=model_name, base_model=base_model) - - # Create empty extra lexeme tables so the data from spacy-lookups-data - # isn't loaded if these features are accessed - if omit_extra_lookups: - nlp.vocab.lookups_extra = Lookups() - nlp.vocab.lookups_extra.add_table("lexeme_cluster") - nlp.vocab.lookups_extra.add_table("lexeme_prob") - nlp.vocab.lookups_extra.add_table("lexeme_settings") - - msg.good("Successfully created model") - if vectors_loc is not None: - add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, vectors_name) - vec_added = len(nlp.vocab.vectors) - lex_added = len(nlp.vocab) - msg.good( - "Sucessfully compiled vocab", - "{} entries, {} vectors".format(lex_added, vec_added), - ) - if not output_dir.exists(): - output_dir.mkdir() - nlp.to_disk(output_dir) - return nlp - - -def open_file(loc): - """Handle .gz, .tar.gz or unzipped files""" - loc = ensure_path(loc) - if tarfile.is_tarfile(str(loc)): - return tarfile.open(str(loc), "r:gz") - elif loc.parts[-1].endswith("gz"): - return (line.decode("utf8") for line in gzip.open(str(loc), "r")) - elif loc.parts[-1].endswith("zip"): - zip_file = zipfile.ZipFile(str(loc)) - names = zip_file.namelist() - file_ = zip_file.open(names[0]) - return (line.decode("utf8") for line in file_) - else: - return loc.open("r", encoding="utf8") - - -def read_attrs_from_deprecated(freqs_loc, clusters_loc): - if freqs_loc is not None: - with msg.loading("Counting frequencies..."): - probs, _ = read_freqs(freqs_loc) - msg.good("Counted frequencies") - else: - probs, _ = ({}, DEFAULT_OOV_PROB) # noqa: F841 - if clusters_loc: - with msg.loading("Reading clusters..."): - clusters = read_clusters(clusters_loc) - msg.good("Read clusters") - else: - clusters = {} - lex_attrs = [] - sorted_probs = sorted(probs.items(), key=lambda item: item[1], reverse=True) - if len(sorted_probs): - for i, (word, prob) in tqdm(enumerate(sorted_probs)): - attrs = {"orth": word, "id": i, "prob": prob} - # Decode as a little-endian string, so that we can do & 15 to get - # the first 4 bits. See _parse_features.pyx - if word in clusters: - attrs["cluster"] = int(clusters[word][::-1], 2) - else: - attrs["cluster"] = 0 - lex_attrs.append(attrs) - return lex_attrs - - -def create_model(lang, lex_attrs, name=None, base_model=None): - if base_model: - nlp = load_model(base_model) - # keep the tokenizer but remove any existing pipeline components due to - # potentially conflicting vectors - for pipe in nlp.pipe_names: - nlp.remove_pipe(pipe) - else: - lang_class = get_lang_class(lang) - nlp = lang_class() - for lexeme in nlp.vocab: - lexeme.rank = OOV_RANK - for attrs in lex_attrs: - if "settings" in attrs: - continue - lexeme = nlp.vocab[attrs["orth"]] - lexeme.set_attrs(**attrs) - if len(nlp.vocab): - oov_prob = min(lex.prob for lex in nlp.vocab) - 1 - else: - oov_prob = DEFAULT_OOV_PROB - nlp.vocab.cfg.update({"oov_prob": oov_prob}) - if name: - nlp.meta["name"] = name - return nlp - - -def add_vectors(nlp, vectors_loc, truncate_vectors, prune_vectors, name=None): - vectors_loc = ensure_path(vectors_loc) - if vectors_loc and vectors_loc.parts[-1].endswith(".npz"): - nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb"))) - for lex in nlp.vocab: - if lex.rank and lex.rank != OOV_RANK: - nlp.vocab.vectors.add(lex.orth, row=lex.rank) - else: - if vectors_loc: - with msg.loading("Reading vectors from {}".format(vectors_loc)): - vectors_data, vector_keys = read_vectors(vectors_loc, truncate_vectors) - msg.good("Loaded vectors from {}".format(vectors_loc)) - else: - vectors_data, vector_keys = (None, None) - if vector_keys is not None: - for word in vector_keys: - if word not in nlp.vocab: - nlp.vocab[word] - if vectors_data is not None: - nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys) - if name is None: - nlp.vocab.vectors.name = "%s_model.vectors" % nlp.meta["lang"] - else: - nlp.vocab.vectors.name = name - nlp.meta["vectors"]["name"] = nlp.vocab.vectors.name - if prune_vectors >= 1: - nlp.vocab.prune_vectors(prune_vectors) - - -def read_vectors(vectors_loc, truncate_vectors=0): - f = open_file(vectors_loc) - shape = tuple(int(size) for size in next(f).split()) - if truncate_vectors >= 1: - shape = (truncate_vectors, shape[1]) - vectors_data = numpy.zeros(shape=shape, dtype="f") - vectors_keys = [] - for i, line in enumerate(tqdm(f)): - line = line.rstrip() - pieces = line.rsplit(" ", vectors_data.shape[1]) - word = pieces.pop(0) - if len(pieces) != vectors_data.shape[1]: - msg.fail(Errors.E094.format(line_num=i, loc=vectors_loc), exits=1) - vectors_data[i] = numpy.asarray(pieces, dtype="f") - vectors_keys.append(word) - if i == truncate_vectors - 1: - break - return vectors_data, vectors_keys - - -def read_freqs(freqs_loc, max_length=100, min_doc_freq=5, min_freq=50): - counts = PreshCounter() - total = 0 - with freqs_loc.open() as f: - for i, line in enumerate(f): - freq, doc_freq, key = line.rstrip().split("\t", 2) - freq = int(freq) - counts.inc(i + 1, freq) - total += freq - counts.smooth() - log_total = math.log(total) - probs = {} - with freqs_loc.open() as f: - for line in tqdm(f): - freq, doc_freq, key = line.rstrip().split("\t", 2) - doc_freq = int(doc_freq) - freq = int(freq) - if doc_freq >= min_doc_freq and freq >= min_freq and len(key) < max_length: - try: - word = literal_eval(key) - except SyntaxError: - # Take odd strings literally. - word = literal_eval("'%s'" % key) - smooth_count = counts.smoother(int(freq)) - probs[word] = math.log(smooth_count) - log_total - oov_prob = math.log(counts.smoother(0)) - log_total - return probs, oov_prob - - -def read_clusters(clusters_loc): - clusters = {} - if ftfy is None: - warnings.warn(Warnings.W004) - with clusters_loc.open() as f: - for line in tqdm(f): - try: - cluster, word, freq = line.split() - if ftfy is not None: - word = ftfy.fix_text(word) - except ValueError: - continue - # If the clusterer has only seen the word a few times, its - # cluster is unreliable. - if int(freq) >= 3: - clusters[word] = cluster - else: - clusters[word] = "0" - # Expand clusters with re-casing - for word, cluster in list(clusters.items()): - if word.lower() not in clusters: - clusters[word.lower()] = cluster - if word.title() not in clusters: - clusters[word.title()] = cluster - if word.upper() not in clusters: - clusters[word.upper()] = cluster - return clusters diff --git a/spacy/cli/init_pipeline.py b/spacy/cli/init_pipeline.py new file mode 100644 index 000000000..1c0233539 --- /dev/null +++ b/spacy/cli/init_pipeline.py @@ -0,0 +1,117 @@ +from typing import Optional +import logging +from pathlib import Path +from wasabi import msg +import typer +import srsly + +from .. import util +from ..training.initialize import init_nlp, convert_vectors +from ..language import Language +from ._util import init_cli, Arg, Opt, parse_config_overrides, show_validation_error +from ._util import import_code, setup_gpu + + +@init_cli.command("vectors") +def init_vectors_cli( + # fmt: off + lang: str = Arg(..., help="The language of the nlp object to create"), + vectors_loc: Path = Arg(..., help="Vectors file in Word2Vec format", exists=True), + output_dir: Path = Arg(..., help="Pipeline output directory"), + prune: int = Opt(-1, "--prune", "-p", help="Optional number of vectors to prune to"), + truncate: int = Opt(0, "--truncate", "-t", help="Optional number of vectors to truncate to when reading in vectors file"), + name: Optional[str] = Opt(None, "--name", "-n", help="Optional name for the word vectors, e.g. en_core_web_lg.vectors"), + verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"), + jsonl_loc: Optional[Path] = Opt(None, "--lexemes-jsonl", "-j", help="Location of JSONL-formatted attributes file", hidden=True), + # fmt: on +): + """Convert word vectors for use with spaCy. Will export an nlp object that + you can use in the [initialize] block of your config to initialize + a model with vectors. + """ + util.logger.setLevel(logging.DEBUG if verbose else logging.INFO) + msg.info(f"Creating blank nlp object for language '{lang}'") + nlp = util.get_lang_class(lang)() + if jsonl_loc is not None: + update_lexemes(nlp, jsonl_loc) + convert_vectors(nlp, vectors_loc, truncate=truncate, prune=prune, name=name) + msg.good(f"Successfully converted {len(nlp.vocab.vectors)} vectors") + nlp.to_disk(output_dir) + msg.good( + "Saved nlp object with vectors to output directory. You can now use the " + "path to it in your config as the 'vectors' setting in [initialize.vocab].", + output_dir.resolve(), + ) + + +def update_lexemes(nlp: Language, jsonl_loc: Path) -> None: + # Mostly used for backwards-compatibility and may be removed in the future + lex_attrs = srsly.read_jsonl(jsonl_loc) + for attrs in lex_attrs: + if "settings" in attrs: + continue + lexeme = nlp.vocab[attrs["orth"]] + lexeme.set_attrs(**attrs) + + +@init_cli.command( + "nlp", + context_settings={"allow_extra_args": True, "ignore_unknown_options": True}, + hidden=True, +) +def init_pipeline_cli( + # fmt: off + ctx: typer.Context, # This is only used to read additional arguments + config_path: Path = Arg(..., help="Path to config file", exists=True), + output_path: Path = Arg(..., help="Output directory for the prepared data"), + code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"), + verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"), + use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU") + # fmt: on +): + util.logger.setLevel(logging.DEBUG if verbose else logging.INFO) + overrides = parse_config_overrides(ctx.args) + import_code(code_path) + setup_gpu(use_gpu) + with show_validation_error(config_path): + config = util.load_config(config_path, overrides=overrides) + with show_validation_error(hint_fill=False): + nlp = init_nlp(config, use_gpu=use_gpu) + nlp.to_disk(output_path) + msg.good(f"Saved initialized pipeline to {output_path}") + + +@init_cli.command( + "labels", + context_settings={"allow_extra_args": True, "ignore_unknown_options": True}, +) +def init_labels_cli( + # fmt: off + ctx: typer.Context, # This is only used to read additional arguments + config_path: Path = Arg(..., help="Path to config file", exists=True), + output_path: Path = Arg(..., help="Output directory for the labels"), + code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"), + verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"), + use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU") + # fmt: on +): + """Generate JSON files for the labels in the data. This helps speed up the + training process, since spaCy won't have to preprocess the data to + extract the labels.""" + util.logger.setLevel(logging.DEBUG if verbose else logging.INFO) + if not output_path.exists(): + output_path.mkdir() + overrides = parse_config_overrides(ctx.args) + import_code(code_path) + setup_gpu(use_gpu) + with show_validation_error(config_path): + config = util.load_config(config_path, overrides=overrides) + with show_validation_error(hint_fill=False): + nlp = init_nlp(config, use_gpu=use_gpu) + for name, component in nlp.pipeline: + if getattr(component, "label_data", None) is not None: + output_file = output_path / f"{name}.json" + srsly.write_json(output_file, component.label_data) + msg.good(f"Saving {name} labels to {output_file}") + else: + msg.info(f"No labels found for {name}") diff --git a/spacy/cli/link.py b/spacy/cli/link.py deleted file mode 100644 index 8117829b5..000000000 --- a/spacy/cli/link.py +++ /dev/null @@ -1,77 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import plac -from pathlib import Path -from wasabi import msg - -from ..compat import symlink_to, path2str -from .. import util - - -@plac.annotations( - origin=("package name or local path to model", "positional", None, str), - link_name=("name of shortuct link to create", "positional", None, str), - force=("force overwriting of existing link", "flag", "f", bool), -) -def link(origin, link_name, force=False, model_path=None): - """ - Create a symlink for models within the spacy/data directory. Accepts - either the name of a pip package, or the local path to the model data - directory. Linking models allows loading them via spacy.load(link_name). - """ - if util.is_package(origin): - model_path = util.get_package_path(origin) - else: - model_path = Path(origin) if model_path is None else Path(model_path) - if not model_path.exists(): - msg.fail( - "Can't locate model data", - "The data should be located in {}".format(path2str(model_path)), - exits=1, - ) - data_path = util.get_data_path() - if not data_path or not data_path.exists(): - spacy_loc = Path(__file__).parent.parent - msg.fail( - "Can't find the spaCy data path to create model symlink", - "Make sure a directory `/data` exists within your spaCy " - "installation and try again. The data directory should be located " - "here:".format(path=spacy_loc), - exits=1, - ) - link_path = util.get_data_path() / link_name - if link_path.is_symlink() and not force: - msg.fail( - "Link '{}' already exists".format(link_name), - "To overwrite an existing link, use the --force flag", - exits=1, - ) - elif link_path.is_symlink(): # does a symlink exist? - # NB: It's important to check for is_symlink here and not for exists, - # because invalid/outdated symlinks would return False otherwise. - link_path.unlink() - elif link_path.exists(): # does it exist otherwise? - # NB: Check this last because valid symlinks also "exist". - msg.fail( - "Can't overwrite symlink '{}'".format(link_name), - "This can happen if your data directory contains a directory or " - "file of the same name.", - exits=1, - ) - details = "%s --> %s" % (path2str(model_path), path2str(link_path)) - try: - symlink_to(link_path, model_path) - except: # noqa: E722 - # This is quite dirty, but just making sure other errors are caught. - msg.fail( - "Couldn't link model to '{}'".format(link_name), - "Creating a symlink in spacy/data failed. Make sure you have the " - "required permissions and try re-running the command as admin, or " - "use a virtualenv. You can still import the model as a module and " - "call its load() method, or create the symlink manually.", - ) - msg.text(details) - raise - msg.good("Linking successful", details) - msg.text("You can now load the model via spacy.load('{}')".format(link_name)) diff --git a/spacy/cli/package.py b/spacy/cli/package.py index 8ed92259c..49a0ab75d 100644 --- a/spacy/cli/package.py +++ b/spacy/cli/package.py @@ -1,126 +1,177 @@ -# coding: utf8 -from __future__ import unicode_literals - -import plac +from typing import Optional, Union, Any, Dict import shutil from pathlib import Path -from wasabi import msg, get_raw_input +from wasabi import Printer, get_raw_input import srsly +import sys -from ..compat import path2str +from ._util import app, Arg, Opt +from ..schemas import validate, ModelMetaSchema from .. import util from .. import about -@plac.annotations( - input_dir=("Directory with model data", "positional", None, str), - output_dir=("Output parent directory", "positional", None, str), - meta_path=("Path to meta.json", "option", "m", str), - create_meta=("Create meta.json, even if one exists", "flag", "c", bool), - force=("Force overwriting existing model in output directory", "flag", "f", bool), -) -def package(input_dir, output_dir, meta_path=None, create_meta=False, force=False): +@app.command("package") +def package_cli( + # fmt: off + input_dir: Path = Arg(..., help="Directory with pipeline data", exists=True, file_okay=False), + output_dir: Path = Arg(..., help="Output parent directory", exists=True, file_okay=False), + meta_path: Optional[Path] = Opt(None, "--meta-path", "--meta", "-m", help="Path to meta.json", exists=True, dir_okay=False), + create_meta: bool = Opt(False, "--create-meta", "-c", "-C", help="Create meta.json, even if one exists"), + name: Optional[str] = Opt(None, "--name", "-n", help="Package name to override meta"), + version: Optional[str] = Opt(None, "--version", "-v", help="Package version to override meta"), + no_sdist: bool = Opt(False, "--no-sdist", "-NS", help="Don't build .tar.gz sdist, can be set if you want to run this step manually"), + force: bool = Opt(False, "--force", "-f", "-F", help="Force overwriting existing data in output directory"), + # fmt: on +): """ - Generate Python package for model data, including meta and required - installation files. A new directory will be created in the specified - output directory, and model data will be copied over. If --create-meta is - set and a meta.json already exists in the output directory, the existing - values will be used as the defaults in the command-line prompt. + Generate an installable Python package for a pipeline. Includes binary data, + meta and required installation files. A new directory will be created in the + specified output directory, and the data will be copied over. If + --create-meta is set and a meta.json already exists in the output directory, + the existing values will be used as the defaults in the command-line prompt. + After packaging, "python setup.py sdist" is run in the package directory, + which will create a .tar.gz archive that can be installed via "pip install". + + DOCS: https://nightly.spacy.io/api/cli#package """ + package( + input_dir, + output_dir, + meta_path=meta_path, + name=name, + version=version, + create_meta=create_meta, + create_sdist=not no_sdist, + force=force, + silent=False, + ) + + +def package( + input_dir: Path, + output_dir: Path, + meta_path: Optional[Path] = None, + name: Optional[str] = None, + version: Optional[str] = None, + create_meta: bool = False, + create_sdist: bool = True, + force: bool = False, + silent: bool = True, +) -> None: + msg = Printer(no_print=silent, pretty=not silent) input_path = util.ensure_path(input_dir) output_path = util.ensure_path(output_dir) meta_path = util.ensure_path(meta_path) if not input_path or not input_path.exists(): - msg.fail("Can't locate model data", input_path, exits=1) + msg.fail("Can't locate pipeline data", input_path, exits=1) if not output_path or not output_path.exists(): msg.fail("Output directory not found", output_path, exits=1) if meta_path and not meta_path.exists(): - msg.fail("Can't find model meta.json", meta_path, exits=1) - - meta_path = meta_path or input_path / "meta.json" - if meta_path.is_file(): - meta = srsly.read_json(meta_path) - if not create_meta: # only print if user doesn't want to overwrite - msg.good("Loaded meta.json from file", meta_path) - else: - meta = generate_meta(input_dir, meta, msg) - for key in ("lang", "name", "version"): - if key not in meta or meta[key] == "": - msg.fail( - "No '{}' setting found in meta.json".format(key), - "This setting is required to build your package.", - exits=1, - ) + msg.fail("Can't find pipeline meta.json", meta_path, exits=1) + meta_path = meta_path or input_dir / "meta.json" + if not meta_path.exists() or not meta_path.is_file(): + msg.fail("Can't load pipeline meta.json", meta_path, exits=1) + meta = srsly.read_json(meta_path) + meta = get_meta(input_dir, meta) + if name is not None: + meta["name"] = name + if version is not None: + meta["version"] = version + if not create_meta: # only print if user doesn't want to overwrite + msg.good("Loaded meta.json from file", meta_path) + else: + meta = generate_meta(meta, msg) + errors = validate(ModelMetaSchema, meta) + if errors: + msg.fail("Invalid pipeline meta.json") + print("\n".join(errors)) + sys.exit(1) model_name = meta["lang"] + "_" + meta["name"] model_name_v = model_name + "-" + meta["version"] - main_path = output_path / model_name_v + main_path = output_dir / model_name_v package_path = main_path / model_name - if package_path.exists(): if force: - shutil.rmtree(path2str(package_path)) + shutil.rmtree(str(package_path)) else: msg.fail( "Package directory already exists", "Please delete the directory and try again, or use the " - "`--force` flag to overwrite existing " - "directories.".format(path=path2str(package_path)), + "`--force` flag to overwrite existing directories.", exits=1, ) Path.mkdir(package_path, parents=True) - shutil.copytree(path2str(input_path), path2str(package_path / model_name_v)) + shutil.copytree(str(input_dir), str(package_path / model_name_v)) create_file(main_path / "meta.json", srsly.json_dumps(meta, indent=2)) create_file(main_path / "setup.py", TEMPLATE_SETUP) create_file(main_path / "MANIFEST.in", TEMPLATE_MANIFEST) create_file(package_path / "__init__.py", TEMPLATE_INIT) - msg.good("Successfully created package '{}'".format(model_name_v), main_path) - msg.text("To build the package, run `python setup.py sdist` in this directory.") + msg.good(f"Successfully created package '{model_name_v}'", main_path) + if create_sdist: + with util.working_dir(main_path): + util.run_command([sys.executable, "setup.py", "sdist"], capture=False) + zip_file = main_path / "dist" / f"{model_name_v}.tar.gz" + msg.good(f"Successfully created zipped Python package", zip_file) -def create_file(file_path, contents): +def create_file(file_path: Path, contents: str) -> None: file_path.touch() file_path.open("w", encoding="utf-8").write(contents) -def generate_meta(model_path, existing_meta, msg): - meta = existing_meta or {} - settings = [ - ("lang", "Model language", meta.get("lang", "en")), - ("name", "Model name", meta.get("name", "model")), - ("version", "Model version", meta.get("version", "0.0.0")), - ("spacy_version", "Required spaCy version", ">=%s,<3.0.0" % about.__version__), - ("description", "Model description", meta.get("description", False)), - ("author", "Author", meta.get("author", False)), - ("email", "Author email", meta.get("email", False)), - ("url", "Author website", meta.get("url", False)), - ("license", "License", meta.get("license", "CC BY-SA 3.0")), - ] +def get_meta( + model_path: Union[str, Path], existing_meta: Dict[str, Any] +) -> Dict[str, Any]: + meta = { + "lang": "en", + "name": "pipeline", + "version": "0.0.0", + "description": "", + "author": "", + "email": "", + "url": "", + "license": "MIT", + } + meta.update(existing_meta) nlp = util.load_model_from_path(Path(model_path)) - meta["pipeline"] = nlp.pipe_names + meta["spacy_version"] = util.get_model_version_range(about.__version__) meta["vectors"] = { "width": nlp.vocab.vectors_length, "vectors": len(nlp.vocab.vectors), "keys": nlp.vocab.vectors.n_keys, "name": nlp.vocab.vectors.name, } - msg.divider("Generating meta.json") - msg.text( - "Enter the package settings for your model. The following information " - "will be read from your model data: pipeline, vectors." - ) - for setting, desc, default in settings: - response = get_raw_input(desc, default) - meta[setting] = default if response == "" and default else response if about.__title__ != "spacy": meta["parent_package"] = about.__title__ return meta +def generate_meta(existing_meta: Dict[str, Any], msg: Printer) -> Dict[str, Any]: + meta = existing_meta or {} + settings = [ + ("lang", "Pipeline language", meta.get("lang", "en")), + ("name", "Pipeline name", meta.get("name", "pipeline")), + ("version", "Package version", meta.get("version", "0.0.0")), + ("description", "Package description", meta.get("description", None)), + ("author", "Author", meta.get("author", None)), + ("email", "Author email", meta.get("email", None)), + ("url", "Author website", meta.get("url", None)), + ("license", "License", meta.get("license", "MIT")), + ] + msg.divider("Generating meta.json") + msg.text( + "Enter the package settings for your pipeline. The following information " + "will be read from your pipeline data: pipeline, vectors." + ) + for setting, desc, default in settings: + response = get_raw_input(desc, default) + meta[setting] = default if response == "" and default else response + return meta + + TEMPLATE_SETUP = """ #!/usr/bin/env python -# coding: utf8 -from __future__ import unicode_literals - import io import json from os import path, walk @@ -166,16 +217,17 @@ def setup_package(): setup( name=model_name, - description=meta['description'], - author=meta['author'], - author_email=meta['email'], - url=meta['url'], + description=meta.get('description'), + author=meta.get('author'), + author_email=meta.get('email'), + url=meta.get('url'), version=meta['version'], - license=meta['license'], + license=meta.get('license'), packages=[model_name], package_data={model_name: list_files(model_dir)}, install_requires=list_requirements(meta), zip_safe=False, + entry_points={'spacy_models': ['{m} = {m}'.format(m=model_name)]} ) @@ -186,13 +238,11 @@ if __name__ == '__main__': TEMPLATE_MANIFEST = """ include meta.json +include config.cfg """.strip() TEMPLATE_INIT = """ -# coding: utf8 -from __future__ import unicode_literals - from pathlib import Path from spacy.util import load_model_from_init_py, get_model_meta diff --git a/spacy/cli/pretrain.py b/spacy/cli/pretrain.py index e949f76cf..de9341449 100644 --- a/spacy/cli/pretrain.py +++ b/spacy/cli/pretrain.py @@ -1,405 +1,108 @@ -# coding: utf8 -from __future__ import print_function, unicode_literals - -import plac -import random -import numpy -import time -import re -from collections import Counter +from typing import Optional from pathlib import Path -from thinc.v2v import Affine, Maxout -from thinc.misc import LayerNorm as LN -from thinc.neural.util import prefer_gpu from wasabi import msg -import srsly +import typer +import re -from ..errors import Errors -from ..tokens import Doc -from ..attrs import ID, HEAD -from .._ml import Tok2Vec, flatten, chain, create_default_optimizer -from .._ml import masked_language_model, get_cossim_loss, get_characters_loss -from .._ml import MultiSoftmax -from .. import util -from .train import _load_pretrained_tok2vec +from ._util import app, Arg, Opt, parse_config_overrides, show_validation_error +from ._util import import_code, setup_gpu +from ..training.pretrain import pretrain +from ..util import load_config -@plac.annotations( - texts_loc=( - "Path to JSONL file with raw texts to learn from, with text provided as the key 'text' or tokens as the " - "key 'tokens'", - "positional", - None, - str, - ), - vectors_model=("Name or path to spaCy model with vectors to learn from"), - output_dir=("Directory to write models to on each epoch", "positional", None, str), - width=("Width of CNN layers", "option", "cw", int), - conv_depth=("Depth of CNN layers", "option", "cd", int), - cnn_window=("Window size for CNN layers", "option", "cW", int), - cnn_pieces=("Maxout size for CNN layers. 1 for Mish", "option", "cP", int), - use_chars=("Whether to use character-based embedding", "flag", "chr", bool), - sa_depth=("Depth of self-attention layers", "option", "sa", int), - bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int), - embed_rows=("Number of embedding rows", "option", "er", int), - loss_func=( - "Loss function to use for the objective. Either 'characters', 'L2' or 'cosine'", - "option", - "L", - str, - ), - use_vectors=("Whether to use the static vectors as input features", "flag", "uv"), - dropout=("Dropout rate", "option", "d", float), - batch_size=("Number of words per training batch", "option", "bs", int), - max_length=( - "Max words per example. Longer examples are discarded", - "option", - "xw", - int, - ), - min_length=( - "Min words per example. Shorter examples are discarded", - "option", - "nw", - int, - ), - seed=("Seed for random number generators", "option", "s", int), - n_iter=("Number of iterations to pretrain", "option", "i", int), - n_save_every=("Save model every X batches.", "option", "se", int), - init_tok2vec=( - "Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.", - "option", - "t2v", - Path, - ), - epoch_start=( - "The epoch to start counting at. Only relevant when using '--init-tok2vec' and the given weight file has been " - "renamed. Prevents unintended overwriting of existing weight files.", - "option", - "es", - int, - ), +@app.command( + "pretrain", + context_settings={"allow_extra_args": True, "ignore_unknown_options": True}, ) -def pretrain( - texts_loc, - vectors_model, - output_dir, - width=96, - conv_depth=4, - cnn_pieces=3, - sa_depth=0, - cnn_window=1, - bilstm_depth=0, - use_chars=False, - embed_rows=2000, - loss_func="cosine", - use_vectors=False, - dropout=0.2, - n_iter=1000, - batch_size=3000, - max_length=500, - min_length=5, - seed=0, - n_save_every=None, - init_tok2vec=None, - epoch_start=None, +def pretrain_cli( + # fmt: off + ctx: typer.Context, # This is only used to read additional arguments + config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False), + output_dir: Path = Arg(..., help="Directory to write weights to on each epoch"), + code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"), + resume_path: Optional[Path] = Opt(None, "--resume-path", "-r", help="Path to pretrained weights from which to resume pretraining"), + epoch_resume: Optional[int] = Opt(None, "--epoch-resume", "-er", help="The epoch to resume counting from when using --resume-path. Prevents unintended overwriting of existing weight files."), + use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"), + # fmt: on ): """ Pre-train the 'token-to-vector' (tok2vec) layer of pipeline components, - using an approximate language-modelling objective. Specifically, we load - pretrained vectors, and train a component like a CNN, BiLSTM, etc to predict - vectors which match the pretrained ones. The weights are saved to a directory - after each epoch. You can then pass a path to one of these pretrained weights - files to the 'spacy train' command. + using an approximate language-modelling objective. Two objective types + are available, vector-based and character-based. + + In the vector-based objective, we load word vectors that have been trained + using a word2vec-style distributional similarity algorithm, and train a + component like a CNN, BiLSTM, etc to predict vectors which match the + pretrained ones. The weights are saved to a directory after each epoch. You + can then pass a path to one of these pretrained weights files to the + 'spacy train' command. This technique may be especially helpful if you have little labelled data. However, it's still quite experimental, so your mileage may vary. To load the weights back in during 'spacy train', you need to ensure - all settings are the same between pretraining and training. The API and - errors around this need some improvement. + all settings are the same between pretraining and training. Ideally, + this is done by using the same config file for both commands. + + DOCS: https://nightly.spacy.io/api/cli#pretrain """ - config = dict(locals()) - for key in config: - if isinstance(config[key], Path): - config[key] = str(config[key]) - util.fix_random_seed(seed) + config_overrides = parse_config_overrides(ctx.args) + import_code(code_path) + verify_cli_args(config_path, output_dir, resume_path, epoch_resume) + setup_gpu(use_gpu) + msg.info(f"Loading config from: {config_path}") - has_gpu = prefer_gpu() - msg.info("Using GPU" if has_gpu else "Not using GPU") - - output_dir = Path(output_dir) - if output_dir.exists() and [p for p in output_dir.iterdir()]: - msg.warn( - "Output directory is not empty", - "It is better to use an empty directory or refer to a new output path, " - "then the new directory will be created for you.", + with show_validation_error(config_path): + raw_config = load_config( + config_path, overrides=config_overrides, interpolate=False ) + config = raw_config.interpolate() + if not config.get("pretraining"): + # TODO: What's the solution here? How do we handle optional blocks? + msg.fail("The [pretraining] block in your config is empty", exits=1) if not output_dir.exists(): output_dir.mkdir() - msg.good("Created output directory: {}".format(output_dir)) - srsly.write_json(output_dir / "config.json", config) - msg.good("Saved settings to config.json") + msg.good(f"Created output directory: {output_dir}") + # Save non-interpolated config + raw_config.to_disk(output_dir / "config.cfg") + msg.good("Saved config file in the output directory") - # Load texts from file or stdin - if texts_loc != "-": # reading from a file - texts_loc = Path(texts_loc) - if not texts_loc.exists(): - msg.fail("Input text file doesn't exist", texts_loc, exits=1) - with msg.loading("Loading input texts..."): - texts = list(srsly.read_jsonl(texts_loc)) - if not texts: - msg.fail("Input file is empty", texts_loc, exits=1) - msg.good("Loaded input texts") - random.shuffle(texts) - else: # reading from stdin - msg.text("Reading input text from stdin...") - texts = srsly.read_jsonl("-") - - with msg.loading("Loading model '{}'...".format(vectors_model)): - nlp = util.load_model(vectors_model) - msg.good("Loaded model '{}'".format(vectors_model)) - pretrained_vectors = None if not use_vectors else nlp.vocab.vectors.name - model = create_pretraining_model( - nlp, - Tok2Vec( - width, - embed_rows, - conv_depth=conv_depth, - pretrained_vectors=pretrained_vectors, - bilstm_depth=bilstm_depth, # Requires PyTorch. Experimental. - subword_features=not use_chars, # Set to False for Chinese etc - cnn_maxout_pieces=cnn_pieces, # If set to 1, use Mish activation. - ), - objective=loss_func + pretrain( + config, + output_dir, + resume_path=resume_path, + epoch_resume=epoch_resume, + use_gpu=use_gpu, + silent=False, ) - # Load in pretrained weights - if init_tok2vec is not None: - components = _load_pretrained_tok2vec(nlp, init_tok2vec) - msg.text("Loaded pretrained tok2vec for: {}".format(components)) - # Parse the epoch number from the given weight file - model_name = re.search(r"model\d+\.bin", str(init_tok2vec)) - if model_name: - # Default weight file name so read epoch_start from it by cutting off 'model' and '.bin' - epoch_start = int(model_name.group(0)[5:][:-4]) + 1 - else: - if not epoch_start: - msg.fail( - "You have to use the '--epoch-start' argument when using a renamed weight file for " - "'--init-tok2vec'", - exits=True, - ) - elif epoch_start < 0: - msg.fail( - "The argument '--epoch-start' has to be greater or equal to 0. '%d' is invalid" - % epoch_start, - exits=True, - ) - else: - # Without '--init-tok2vec' the '--epoch-start' argument is ignored - epoch_start = 0 - - optimizer = create_default_optimizer(model.ops) - tracker = ProgressTracker(frequency=10000) - msg.divider("Pre-training tok2vec layer - starting at epoch %d" % epoch_start) - row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")} - msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings) - - def _save_model(epoch, is_temp=False): - is_temp_str = ".temp" if is_temp else "" - with model.use_params(optimizer.averages): - with (output_dir / ("model%d%s.bin" % (epoch, is_temp_str))).open( - "wb" - ) as file_: - file_.write(model.tok2vec.to_bytes()) - log = { - "nr_word": tracker.nr_word, - "loss": tracker.loss, - "epoch_loss": tracker.epoch_loss, - "epoch": epoch, - } - with (output_dir / "log.jsonl").open("a") as file_: - file_.write(srsly.json_dumps(log) + "\n") - - skip_counter = 0 - for epoch in range(epoch_start, n_iter + epoch_start): - for batch_id, batch in enumerate( - util.minibatch_by_words(((text, None) for text in texts), size=batch_size) - ): - docs, count = make_docs( - nlp, - [text for (text, _) in batch], - max_length=max_length, - min_length=min_length, - ) - skip_counter += count - loss = make_update( - model, docs, optimizer, objective=loss_func, drop=dropout - ) - progress = tracker.update(epoch, loss, docs) - if progress: - msg.row(progress, **row_settings) - if texts_loc == "-" and tracker.words_per_epoch[epoch] >= 10 ** 7: - break - if n_save_every and (batch_id % n_save_every == 0): - _save_model(epoch, is_temp=True) - _save_model(epoch) - tracker.epoch_loss = 0.0 - if texts_loc != "-": - # Reshuffle the texts if texts were loaded from a file - random.shuffle(texts) - if skip_counter > 0: - msg.warn("Skipped {count} empty values".format(count=str(skip_counter))) msg.good("Successfully finished pretrain") -def make_update(model, docs, optimizer, drop=0.0, objective="L2"): - """Perform an update over a single batch of documents. - - docs (iterable): A batch of `Doc` objects. - drop (float): The dropout rate. - optimizer (callable): An optimizer. - RETURNS loss: A float for the loss. - """ - predictions, backprop = model.begin_update(docs, drop=drop) - if objective == "characters": - loss, gradients = get_characters_loss(model.ops, docs, predictions) - else: - loss, gradients = get_vectors_loss(model.ops, docs, predictions, objective) - backprop(gradients, sgd=optimizer) - # Don't want to return a cupy object here - # The gradients are modified in-place by the BERT MLM, - # so we get an accurate loss - return float(loss) - - -def make_docs(nlp, batch, min_length, max_length): - docs = [] - skip_count = 0 - for record in batch: - if not isinstance(record, dict): - raise TypeError(Errors.E137.format(type=type(record), line=record)) - if "tokens" in record: - words = record["tokens"] - if not words: - skip_count += 1 - continue - doc = Doc(nlp.vocab, words=words) - elif "text" in record: - text = record["text"] - if not text: - skip_count += 1 - continue - doc = nlp.make_doc(text) - else: - raise ValueError(Errors.E138.format(text=record)) - if "heads" in record: - heads = record["heads"] - heads = numpy.asarray(heads, dtype="uint64") - heads = heads.reshape((len(doc), 1)) - doc = doc.from_array([HEAD], heads) - if len(doc) >= min_length and len(doc) < max_length: - docs.append(doc) - return docs, skip_count - - -def get_vectors_loss(ops, docs, prediction, objective="L2"): - """Compute a mean-squared error loss between the documents' vectors and - the prediction. - - Note that this is ripe for customization! We could compute the vectors - in some other word, e.g. with an LSTM language model, or use some other - type of objective. - """ - # The simplest way to implement this would be to vstack the - # token.vector values, but that's a bit inefficient, especially on GPU. - # Instead we fetch the index into the vectors table for each of our tokens, - # and look them up all at once. This prevents data copying. - ids = ops.flatten([doc.to_array(ID).ravel() for doc in docs]) - target = docs[0].vocab.vectors.data[ids] - if objective == "L2": - d_target = prediction - target - loss = (d_target ** 2).sum() - elif objective == "cosine": - loss, d_target = get_cossim_loss(prediction, target) - else: - raise ValueError(Errors.E142.format(loss_func=objective)) - return loss, d_target - - -def create_pretraining_model(nlp, tok2vec, objective="cosine", nr_char=10): - """Define a network for the pretraining. We simply add an output layer onto - the tok2vec input model. The tok2vec input model needs to be a model that - takes a batch of Doc objects (as a list), and returns a list of arrays. - Each array in the output needs to have one row per token in the doc. - """ - if objective == "characters": - out_sizes = [256] * nr_char - output_layer = chain( - LN(Maxout(300, pieces=3)), - MultiSoftmax(out_sizes, 300) - ) - else: - output_size = nlp.vocab.vectors.data.shape[1] - output_layer = chain( - LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0) - ) - # This is annoying, but the parser etc have the flatten step after - # the tok2vec. To load the weights in cleanly, we need to match - # the shape of the models' components exactly. So what we cann - # "tok2vec" has to be the same set of processes as what the components do. - tok2vec = chain(tok2vec, flatten) - model = chain(tok2vec, output_layer) - model = masked_language_model(nlp.vocab, model) - model.tok2vec = tok2vec - model.output_layer = output_layer - model.begin_training([nlp.make_doc("Give it a doc to infer shapes")]) - return model - - -class ProgressTracker(object): - def __init__(self, frequency=1000000): - self.loss = 0.0 - self.prev_loss = 0.0 - self.nr_word = 0 - self.words_per_epoch = Counter() - self.frequency = frequency - self.last_time = time.time() - self.last_update = 0 - self.epoch_loss = 0.0 - - def update(self, epoch, loss, docs): - self.loss += loss - self.epoch_loss += loss - words_in_batch = sum(len(doc) for doc in docs) - self.words_per_epoch[epoch] += words_in_batch - self.nr_word += words_in_batch - words_since_update = self.nr_word - self.last_update - if words_since_update >= self.frequency: - wps = words_since_update / (time.time() - self.last_time) - self.last_update = self.nr_word - self.last_time = time.time() - loss_per_word = self.loss - self.prev_loss - status = ( - epoch, - self.nr_word, - _smart_round(self.loss, width=10), - _smart_round(loss_per_word, width=6), - int(wps), +def verify_cli_args(config_path, output_dir, resume_path, epoch_resume): + if not config_path or not config_path.exists(): + msg.fail("Config file not found", config_path, exits=1) + if output_dir.exists() and [p for p in output_dir.iterdir()]: + if resume_path: + msg.warn( + "Output directory is not empty.", + "If you're resuming a run in this directory, the old weights " + "for the consecutive epochs will be overwritten with the new ones.", ) - self.prev_loss = float(self.loss) - return status else: - return None - - -def _smart_round(figure, width=10, max_decimal=4): - """Round large numbers as integers, smaller numbers as decimals.""" - n_digits = len(str(int(figure))) - n_decimal = width - (n_digits + 1) - if n_decimal <= 1: - return str(int(figure)) - else: - n_decimal = min(n_decimal, max_decimal) - format_str = "%." + str(n_decimal) + "f" - return format_str % figure + msg.warn( + "Output directory is not empty. ", + "It is better to use an empty directory or refer to a new output path, " + "then the new directory will be created for you.", + ) + if resume_path is not None: + model_name = re.search(r"model\d+\.bin", str(resume_path)) + if not model_name and not epoch_resume: + msg.fail( + "You have to use the --epoch-resume setting when using a renamed weight file for --resume-path", + exits=True, + ) + elif not model_name and epoch_resume < 0: + msg.fail( + f"The argument --epoch-resume has to be greater or equal to 0. {epoch_resume} is invalid", + exits=True, + ) diff --git a/spacy/cli/profile.py b/spacy/cli/profile.py index 4ee72fc23..43226730d 100644 --- a/spacy/cli/profile.py +++ b/spacy/cli/profile.py @@ -1,7 +1,4 @@ -# coding: utf8 -from __future__ import unicode_literals, division, print_function - -import plac +from typing import Optional, Sequence, Union, Iterator import tqdm from pathlib import Path import srsly @@ -9,36 +6,65 @@ import cProfile import pstats import sys import itertools -import thinc.extra.datasets -from wasabi import msg +from wasabi import msg, Printer +import typer +from ._util import app, debug_cli, Arg, Opt, NAME +from ..language import Language from ..util import load_model -@plac.annotations( - model=("Model to load", "positional", None, str), - inputs=("Location of input file. '-' for stdin.", "positional", None, str), - n_texts=("Maximum number of texts to use if available", "option", "n", int), -) -def profile(model, inputs=None, n_texts=10000): +@debug_cli.command("profile") +@app.command("profile", hidden=True) +def profile_cli( + # fmt: off + ctx: typer.Context, # This is only used to read current calling context + model: str = Arg(..., help="Trained pipeline to load"), + inputs: Optional[Path] = Arg(None, help="Location of input file. '-' for stdin.", exists=True, allow_dash=True), + n_texts: int = Opt(10000, "--n-texts", "-n", help="Maximum number of texts to use if available"), + # fmt: on +): """ - Profile a spaCy pipeline, to find out which functions take the most time. + Profile which functions take the most time in a spaCy pipeline. Input should be formatted as one JSON object per line with a key "text". It can either be provided as a JSONL file, or be read from sys.sytdin. If no input file is specified, the IMDB dataset is loaded via Thinc. + + DOCS: https://nightly.spacy.io/api/cli#debug-profile """ + if ctx.parent.command.name == NAME: # called as top-level command + msg.warn( + "The profile command is now available via the 'debug profile' " + "subcommand. You can run python -m spacy debug --help for an " + "overview of the other available debugging commands." + ) + profile(model, inputs=inputs, n_texts=n_texts) + + +def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) -> None: + if inputs is not None: inputs = _read_inputs(inputs, msg) if inputs is None: + try: + import ml_datasets + except ImportError: + msg.fail( + "This command, when run without an input file, " + "requires the ml_datasets library to be installed: " + "pip install ml_datasets", + exits=1, + ) + n_inputs = 25000 with msg.loading("Loading IMDB dataset via Thinc..."): - imdb_train, _ = thinc.extra.datasets.imdb() + imdb_train, _ = ml_datasets.imdb() inputs, _ = zip(*imdb_train) - msg.info("Loaded IMDB dataset and using {} examples".format(n_inputs)) + msg.info(f"Loaded IMDB dataset and using {n_inputs} examples") inputs = inputs[:n_inputs] - with msg.loading("Loading model '{}'...".format(model)): + with msg.loading(f"Loading pipeline '{model}'..."): nlp = load_model(model) - msg.good("Loaded model '{}'".format(model)) + msg.good(f"Loaded pipeline '{model}'") texts = list(itertools.islice(inputs, n_texts)) cProfile.runctx("parse_texts(nlp, texts)", globals(), locals(), "Profile.prof") s = pstats.Stats("Profile.prof") @@ -46,12 +72,12 @@ def profile(model, inputs=None, n_texts=10000): s.strip_dirs().sort_stats("time").print_stats() -def parse_texts(nlp, texts): +def parse_texts(nlp: Language, texts: Sequence[str]) -> None: for doc in nlp.pipe(tqdm.tqdm(texts), batch_size=16): pass -def _read_inputs(loc, msg): +def _read_inputs(loc: Union[Path, str], msg: Printer) -> Iterator[str]: if loc == "-": msg.info("Reading input from sys.stdin") file_ = sys.stdin @@ -60,7 +86,7 @@ def _read_inputs(loc, msg): input_path = Path(loc) if not input_path.exists() or not input_path.is_file(): msg.fail("Not a valid input data file", loc, exits=1) - msg.info("Using data from {}".format(input_path.parts[-1])) + msg.info(f"Using data from {input_path.parts[-1]}") file_ = input_path.open() for line in file_: data = srsly.json_loads(line) diff --git a/bin/__init__.py b/spacy/cli/project/__init__.py similarity index 100% rename from bin/__init__.py rename to spacy/cli/project/__init__.py diff --git a/spacy/cli/project/assets.py b/spacy/cli/project/assets.py new file mode 100644 index 000000000..58f59a3f9 --- /dev/null +++ b/spacy/cli/project/assets.py @@ -0,0 +1,154 @@ +from typing import Optional +from pathlib import Path +from wasabi import msg +import re +import shutil +import requests + +from ...util import ensure_path, working_dir +from .._util import project_cli, Arg, Opt, PROJECT_FILE, load_project_config +from .._util import get_checksum, download_file, git_checkout, get_git_version + + +@project_cli.command("assets") +def project_assets_cli( + # fmt: off + project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False), + sparse_checkout: bool = Opt(False, "--sparse", "-S", help="Use sparse checkout for assets provided via Git, to only check out and clone the files needed. Requires Git v22.2+.") + # fmt: on +): + """Fetch project assets like datasets and pretrained weights. Assets are + defined in the "assets" section of the project.yml. If a checksum is + provided in the project.yml, the file is only downloaded if no local file + with the same checksum exists. + + DOCS: https://nightly.spacy.io/api/cli#project-assets + """ + project_assets(project_dir, sparse_checkout=sparse_checkout) + + +def project_assets(project_dir: Path, *, sparse_checkout: bool = False) -> None: + """Fetch assets for a project using DVC if possible. + + project_dir (Path): Path to project directory. + """ + project_path = ensure_path(project_dir) + config = load_project_config(project_path) + assets = config.get("assets", {}) + if not assets: + msg.warn(f"No assets specified in {PROJECT_FILE}", exits=0) + msg.info(f"Fetching {len(assets)} asset(s)") + for asset in assets: + dest = (project_dir / asset["dest"]).resolve() + checksum = asset.get("checksum") + if "git" in asset: + git_err = ( + f"Cloning spaCy project templates requires Git and the 'git' command. " + f"Make sure it's installed and that the executable is available." + ) + get_git_version(error=git_err) + if dest.exists(): + # If there's already a file, check for checksum + if checksum and checksum == get_checksum(dest): + msg.good( + f"Skipping download with matching checksum: {asset['dest']}" + ) + continue + else: + if dest.is_dir(): + shutil.rmtree(dest) + else: + dest.unlink() + git_checkout( + asset["git"]["repo"], + asset["git"]["path"], + dest, + branch=asset["git"].get("branch"), + sparse=sparse_checkout, + ) + msg.good(f"Downloaded asset {dest}") + else: + url = asset.get("url") + if not url: + # project.yml defines asset without URL that the user has to place + check_private_asset(dest, checksum) + continue + fetch_asset(project_path, url, dest, checksum) + + +def check_private_asset(dest: Path, checksum: Optional[str] = None) -> None: + """Check and validate assets without a URL (private assets that the user + has to provide themselves) and give feedback about the checksum. + + dest (Path): Destination path of the asset. + checksum (Optional[str]): Optional checksum of the expected file. + """ + if not Path(dest).exists(): + err = f"No URL provided for asset. You need to add this file yourself: {dest}" + msg.warn(err) + else: + if not checksum: + msg.good(f"Asset already exists: {dest}") + elif checksum == get_checksum(dest): + msg.good(f"Asset exists with matching checksum: {dest}") + else: + msg.fail(f"Asset available but with incorrect checksum: {dest}") + + +def fetch_asset( + project_path: Path, url: str, dest: Path, checksum: Optional[str] = None +) -> None: + """Fetch an asset from a given URL or path. If a checksum is provided and a + local file exists, it's only re-downloaded if the checksum doesn't match. + + project_path (Path): Path to project directory. + url (str): URL or path to asset. + checksum (Optional[str]): Optional expected checksum of local file. + RETURNS (Optional[Path]): The path to the fetched asset or None if fetching + the asset failed. + """ + dest_path = (project_path / dest).resolve() + if dest_path.exists() and checksum: + # If there's already a file, check for checksum + if checksum == get_checksum(dest_path): + msg.good(f"Skipping download with matching checksum: {dest}") + return dest_path + # We might as well support the user here and create parent directories in + # case the asset dir isn't listed as a dir to create in the project.yml + if not dest_path.parent.exists(): + dest_path.parent.mkdir(parents=True) + with working_dir(project_path): + url = convert_asset_url(url) + try: + download_file(url, dest_path) + msg.good(f"Downloaded asset {dest}") + except requests.exceptions.RequestException as e: + if Path(url).exists() and Path(url).is_file(): + # If it's a local file, copy to destination + shutil.copy(url, str(dest_path)) + msg.good(f"Copied local asset {dest}") + else: + msg.fail(f"Download failed: {dest}", e) + return + if checksum and checksum != get_checksum(dest_path): + msg.fail(f"Checksum doesn't match value defined in {PROJECT_FILE}: {dest}") + + +def convert_asset_url(url: str) -> str: + """Check and convert the asset URL if needed. + + url (str): The asset URL. + RETURNS (str): The converted URL. + """ + # If the asset URL is a regular GitHub URL it's likely a mistake + if re.match(r"(http(s?)):\/\/github.com", url) and "releases/download" not in url: + converted = url.replace("github.com", "raw.githubusercontent.com") + converted = re.sub(r"/(tree|blob)/", "/", converted) + msg.warn( + "Downloading from a regular GitHub URL. This will only download " + "the source of the page, not the actual file. Converting the URL " + "to a raw URL.", + converted, + ) + return converted + return url diff --git a/spacy/cli/project/clone.py b/spacy/cli/project/clone.py new file mode 100644 index 000000000..851fc444a --- /dev/null +++ b/spacy/cli/project/clone.py @@ -0,0 +1,92 @@ +from typing import Optional +from pathlib import Path +from wasabi import msg +import subprocess +import re + +from ... import about +from ...util import ensure_path +from .._util import project_cli, Arg, Opt, COMMAND, PROJECT_FILE +from .._util import git_checkout, get_git_version + + +@project_cli.command("clone") +def project_clone_cli( + # fmt: off + name: str = Arg(..., help="The name of the template to clone"), + dest: Optional[Path] = Arg(None, help="Where to clone the project. Defaults to current working directory", exists=False), + repo: str = Opt(about.__projects__, "--repo", "-r", help="The repository to clone from"), + branch: str = Opt(about.__projects_branch__, "--branch", "-b", help="The branch to clone from"), + sparse_checkout: bool = Opt(False, "--sparse", "-S", help="Use sparse Git checkout to only check out and clone the files needed. Requires Git v22.2+.") + # fmt: on +): + """Clone a project template from a repository. Calls into "git" and will + only download the files from the given subdirectory. The GitHub repo + defaults to the official spaCy template repo, but can be customized + (including using a private repo). + + DOCS: https://nightly.spacy.io/api/cli#project-clone + """ + if dest is None: + dest = Path.cwd() / Path(name).parts[-1] + project_clone(name, dest, repo=repo, branch=branch, sparse_checkout=sparse_checkout) + + +def project_clone( + name: str, + dest: Path, + *, + repo: str = about.__projects__, + branch: str = about.__projects_branch__, + sparse_checkout: bool = False, +) -> None: + """Clone a project template from a repository. + + name (str): Name of subdirectory to clone. + dest (Path): Destination path of cloned project. + repo (str): URL of Git repo containing project templates. + branch (str): The branch to clone from + """ + dest = ensure_path(dest) + check_clone(name, dest, repo) + project_dir = dest.resolve() + repo_name = re.sub(r"(http(s?)):\/\/github.com/", "", repo) + try: + git_checkout(repo, name, dest, branch=branch, sparse=sparse_checkout) + except subprocess.CalledProcessError: + err = f"Could not clone '{name}' from repo '{repo_name}'" + msg.fail(err, exits=1) + msg.good(f"Cloned '{name}' from {repo_name}", project_dir) + if not (project_dir / PROJECT_FILE).exists(): + msg.warn(f"No {PROJECT_FILE} found in directory") + else: + msg.good(f"Your project is now ready!") + print(f"To fetch the assets, run:\n{COMMAND} project assets {dest}") + + +def check_clone(name: str, dest: Path, repo: str) -> None: + """Check and validate that the destination path can be used to clone. Will + check that Git is available and that the destination path is suitable. + + name (str): Name of the directory to clone from the repo. + dest (Path): Local destination of cloned directory. + repo (str): URL of the repo to clone from. + """ + git_err = ( + f"Cloning spaCy project templates requires Git and the 'git' command. ", + f"To clone a project without Git, copy the files from the '{name}' " + f"directory in the {repo} to {dest} manually.", + ) + get_git_version(error=git_err) + if not dest: + msg.fail(f"Not a valid directory to clone project: {dest}", exits=1) + if dest.exists(): + # Directory already exists (not allowed, clone needs to create it) + msg.fail(f"Can't clone project, directory already exists: {dest}", exits=1) + if not dest.parent.exists(): + # We're not creating parents, parent dir should exist + msg.fail( + f"Can't clone project, parent directory doesn't exist: {dest.parent}. " + f"Create the necessary folder(s) first before continuing.", + exits=1, + ) diff --git a/spacy/cli/project/document.py b/spacy/cli/project/document.py new file mode 100644 index 000000000..811b7c746 --- /dev/null +++ b/spacy/cli/project/document.py @@ -0,0 +1,119 @@ +from pathlib import Path +from wasabi import msg, MarkdownRenderer + +from ...util import working_dir +from .._util import project_cli, Arg, Opt, PROJECT_FILE, load_project_config + + +DOCS_URL = "https://nightly.spacy.io" +INTRO = f"""> ⚠️ This project template uses the new [**spaCy v3.0**]({DOCS_URL}), which +> is currently available as a nightly pre-release. You can install it from pip as `spacy-nightly`: +> `pip install spacy-nightly`. Make sure to use a fresh virtual environment.""" +INTRO_PROJECT = f"""The [`{PROJECT_FILE}`]({PROJECT_FILE}) defines the data assets required by the +project, as well as the available commands and workflows. For details, see the +[spaCy projects documentation]({DOCS_URL}/usage/projects).""" +INTRO_COMMANDS = f"""The following commands are defined by the project. They +can be executed using [`spacy project run [name]`]({DOCS_URL}/api/cli#project-run). +Commands are only re-run if their inputs have changed.""" +INTRO_WORKFLOWS = f"""The following workflows are defined by the project. They +can be executed using [`spacy project run [name]`]({DOCS_URL}/api/cli#project-run) +and will run the specified commands in order. Commands are only re-run if their +inputs have changed.""" +INTRO_ASSETS = f"""The following assets are defined by the project. They can +be fetched by running [`spacy project assets`]({DOCS_URL}/api/cli#project-assets) +in the project directory.""" +# These markers are added to the Markdown and can be used to update the file in +# place if it already exists. Only the auto-generated part will be replaced. +MARKER_START = "" +MARKER_END = "" +# If this marker is used in an existing README, it's ignored and not replaced +MARKER_IGNORE = "" + + +@project_cli.command("document") +def project_document_cli( + # fmt: off + project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False), + output_file: Path = Opt("-", "--output", "-o", help="Path to output Markdown file for output. Defaults to - for standard output"), + no_emoji: bool = Opt(False, "--no-emoji", "-NE", help="Don't use emoji") + # fmt: on +): + """ + Auto-generate a README.md for a project. If the content is saved to a file, + hidden markers are added so you can add custom content before or after the + auto-generated section and only the auto-generated docs will be replaced + when you re-run the command. + + DOCS: https://nightly.spacy.io/api/cli#project-document + """ + project_document(project_dir, output_file, no_emoji=no_emoji) + + +def project_document( + project_dir: Path, output_file: Path, *, no_emoji: bool = False +) -> None: + is_stdout = str(output_file) == "-" + config = load_project_config(project_dir) + md = MarkdownRenderer(no_emoji=no_emoji) + md.add(MARKER_START) + title = config.get("title") + description = config.get("description") + md.add(md.title(1, f"spaCy Project{f': {title}' if title else ''}", "🪐")) + md.add(INTRO) + if description: + md.add(description) + md.add(md.title(2, PROJECT_FILE, "📋")) + md.add(INTRO_PROJECT) + # Commands + cmds = config.get("commands", []) + data = [(md.code(cmd["name"]), cmd.get("help", "")) for cmd in cmds] + if data: + md.add(md.title(3, "Commands", "⏯")) + md.add(INTRO_COMMANDS) + md.add(md.table(data, ["Command", "Description"])) + # Workflows + wfs = config.get("workflows", {}).items() + data = [(md.code(n), " → ".join(md.code(w) for w in stp)) for n, stp in wfs] + if data: + md.add(md.title(3, "Workflows", "⏭")) + md.add(INTRO_WORKFLOWS) + md.add(md.table(data, ["Workflow", "Steps"])) + # Assets + assets = config.get("assets", []) + data = [] + for a in assets: + source = "Git" if a.get("git") else "URL" if a.get("url") else "Local" + dest_path = a["dest"] + dest = md.code(dest_path) + if source == "Local": + # Only link assets if they're in the repo + with working_dir(project_dir) as p: + if (p / dest_path).exists(): + dest = md.link(dest, dest_path) + data.append((dest, source, a.get("description", ""))) + if data: + md.add(md.title(3, "Assets", "🗂")) + md.add(INTRO_ASSETS) + md.add(md.table(data, ["File", "Source", "Description"])) + md.add(MARKER_END) + # Output result + if is_stdout: + print(md.text) + else: + content = md.text + if output_file.exists(): + with output_file.open("r", encoding="utf8") as f: + existing = f.read() + if MARKER_IGNORE in existing: + msg.warn("Found ignore marker in existing file: skipping", output_file) + return + if MARKER_START in existing and MARKER_END in existing: + msg.info("Found existing file: only replacing auto-generated docs") + before = existing.split(MARKER_START)[0] + after = existing.split(MARKER_END)[1] + content = f"{before}{content}{after}" + else: + msg.warn("Replacing existing file") + with output_file.open("w", encoding="utf8") as f: + f.write(content) + msg.good("Saved project documentation", output_file) diff --git a/spacy/cli/project/dvc.py b/spacy/cli/project/dvc.py new file mode 100644 index 000000000..6eedc9c20 --- /dev/null +++ b/spacy/cli/project/dvc.py @@ -0,0 +1,204 @@ +"""This module contains helpers and subcommands for integrating spaCy projects +with Data Version Controk (DVC). https://dvc.org""" +from typing import Dict, Any, List, Optional, Iterable +import subprocess +from pathlib import Path +from wasabi import msg + +from .._util import PROJECT_FILE, load_project_config, get_hash, project_cli +from .._util import Arg, Opt, NAME, COMMAND +from ...util import working_dir, split_command, join_command, run_command +from ...util import SimpleFrozenList + + +DVC_CONFIG = "dvc.yaml" +DVC_DIR = ".dvc" +UPDATE_COMMAND = "dvc" +DVC_CONFIG_COMMENT = f"""# This file is auto-generated by spaCy based on your {PROJECT_FILE}. If you've +# edited your {PROJECT_FILE}, you can regenerate this file by running: +# {COMMAND} project {UPDATE_COMMAND}""" + + +@project_cli.command(UPDATE_COMMAND) +def project_update_dvc_cli( + # fmt: off + project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False), + workflow: Optional[str] = Arg(None, help=f"Name of workflow defined in {PROJECT_FILE}. Defaults to first workflow if not set."), + verbose: bool = Opt(False, "--verbose", "-V", help="Print more info"), + force: bool = Opt(False, "--force", "-F", help="Force update DVC config"), + # fmt: on +): + """Auto-generate Data Version Control (DVC) config. A DVC + project can only define one pipeline, so you need to specify one workflow + defined in the project.yml. If no workflow is specified, the first defined + workflow is used. The DVC config will only be updated if the project.yml + changed. + + DOCS: https://nightly.spacy.io/api/cli#project-dvc + """ + project_update_dvc(project_dir, workflow, verbose=verbose, force=force) + + +def project_update_dvc( + project_dir: Path, + workflow: Optional[str] = None, + *, + verbose: bool = False, + force: bool = False, +) -> None: + """Update the auto-generated Data Version Control (DVC) config file. A DVC + project can only define one pipeline, so you need to specify one workflow + defined in the project.yml. Will only update the file if the checksum changed. + + project_dir (Path): The project directory. + workflow (Optional[str]): Optional name of workflow defined in project.yml. + If not set, the first workflow will be used. + verbose (bool): Print more info. + force (bool): Force update DVC config. + """ + config = load_project_config(project_dir) + updated = update_dvc_config( + project_dir, config, workflow, verbose=verbose, force=force + ) + help_msg = "To execute the workflow with DVC, run: dvc repro" + if updated: + msg.good(f"Updated DVC config from {PROJECT_FILE}", help_msg) + else: + msg.info(f"No changes found in {PROJECT_FILE}, no update needed", help_msg) + + +def update_dvc_config( + path: Path, + config: Dict[str, Any], + workflow: Optional[str] = None, + verbose: bool = False, + silent: bool = False, + force: bool = False, +) -> bool: + """Re-run the DVC commands in dry mode and update dvc.yaml file in the + project directory. The file is auto-generated based on the config. The + first line of the auto-generated file specifies the hash of the config + dict, so if any of the config values change, the DVC config is regenerated. + + path (Path): The path to the project directory. + config (Dict[str, Any]): The loaded project.yml. + verbose (bool): Whether to print additional info (via DVC). + silent (bool): Don't output anything (via DVC). + force (bool): Force update, even if hashes match. + RETURNS (bool): Whether the DVC config file was updated. + """ + ensure_dvc(path) + workflows = config.get("workflows", {}) + workflow_names = list(workflows.keys()) + check_workflows(workflow_names, workflow) + if not workflow: + workflow = workflow_names[0] + config_hash = get_hash(config) + path = path.resolve() + dvc_config_path = path / DVC_CONFIG + if dvc_config_path.exists(): + # Check if the file was generated using the current config, if not, redo + with dvc_config_path.open("r", encoding="utf8") as f: + ref_hash = f.readline().strip().replace("# ", "") + if ref_hash == config_hash and not force: + return False # Nothing has changed in project.yml, don't need to update + dvc_config_path.unlink() + dvc_commands = [] + config_commands = {cmd["name"]: cmd for cmd in config.get("commands", [])} + for name in workflows[workflow]: + command = config_commands[name] + deps = command.get("deps", []) + outputs = command.get("outputs", []) + outputs_no_cache = command.get("outputs_no_cache", []) + if not deps and not outputs and not outputs_no_cache: + continue + # Default to the working dir as the project path since dvc.yaml is auto-generated + # and we don't want arbitrary paths in there + project_cmd = ["python", "-m", NAME, "project", "run", name] + deps_cmd = [c for cl in [["-d", p] for p in deps] for c in cl] + outputs_cmd = [c for cl in [["-o", p] for p in outputs] for c in cl] + outputs_nc_cmd = [c for cl in [["-O", p] for p in outputs_no_cache] for c in cl] + dvc_cmd = ["run", "-n", name, "-w", str(path), "--no-exec"] + if command.get("no_skip"): + dvc_cmd.append("--always-changed") + full_cmd = [*dvc_cmd, *deps_cmd, *outputs_cmd, *outputs_nc_cmd, *project_cmd] + dvc_commands.append(join_command(full_cmd)) + with working_dir(path): + dvc_flags = {"--verbose": verbose, "--quiet": silent} + run_dvc_commands(dvc_commands, flags=dvc_flags) + with dvc_config_path.open("r+", encoding="utf8") as f: + content = f.read() + f.seek(0, 0) + f.write(f"# {config_hash}\n{DVC_CONFIG_COMMENT}\n{content}") + return True + + +def run_dvc_commands( + commands: Iterable[str] = SimpleFrozenList(), flags: Dict[str, bool] = {} +) -> None: + """Run a sequence of DVC commands in a subprocess, in order. + + commands (List[str]): The string commands without the leading "dvc". + flags (Dict[str, bool]): Conditional flags to be added to command. Makes it + easier to pass flags like --quiet that depend on a variable or + command-line setting while avoiding lots of nested conditionals. + """ + for command in commands: + command = split_command(command) + dvc_command = ["dvc", *command] + # Add the flags if they are set to True + for flag, is_active in flags.items(): + if is_active: + dvc_command.append(flag) + run_command(dvc_command) + + +def check_workflows(workflows: List[str], workflow: Optional[str] = None) -> None: + """Validate workflows provided in project.yml and check that a given + workflow can be used to generate a DVC config. + + workflows (List[str]): Names of the available workflows. + workflow (Optional[str]): The name of the workflow to convert. + """ + if not workflows: + msg.fail( + f"No workflows defined in {PROJECT_FILE}. To generate a DVC config, " + f"define at least one list of commands.", + exits=1, + ) + if workflow is not None and workflow not in workflows: + msg.fail( + f"Workflow '{workflow}' not defined in {PROJECT_FILE}. " + f"Available workflows: {', '.join(workflows)}", + exits=1, + ) + if not workflow: + msg.warn( + f"No workflow specified for DVC pipeline. Using the first workflow " + f"defined in {PROJECT_FILE}: '{workflows[0]}'" + ) + + +def ensure_dvc(project_dir: Path) -> None: + """Ensure that the "dvc" command is available and that the current project + directory is an initialized DVC project. + """ + try: + subprocess.run(["dvc", "--version"], stdout=subprocess.DEVNULL) + except Exception: + msg.fail( + "To use spaCy projects with DVC (Data Version Control), DVC needs " + "to be installed and the 'dvc' command needs to be available", + "You can install the Python package from pip (pip install dvc) or " + "conda (conda install -c conda-forge dvc). For more details, see the " + "documentation: https://dvc.org/doc/install", + exits=1, + ) + if not (project_dir / ".dvc").exists(): + msg.fail( + "Project not initialized as a DVC project", + "To initialize a DVC project, you can run 'dvc init' in the project " + "directory. For more details, see the documentation: " + "https://dvc.org/doc/command-reference/init", + exits=1, + ) diff --git a/spacy/cli/project/pull.py b/spacy/cli/project/pull.py new file mode 100644 index 000000000..26676d5b3 --- /dev/null +++ b/spacy/cli/project/pull.py @@ -0,0 +1,58 @@ +from pathlib import Path +from wasabi import msg +from .remote_storage import RemoteStorage +from .remote_storage import get_command_hash +from .._util import project_cli, Arg +from .._util import load_project_config +from .run import update_lockfile + + +@project_cli.command("pull") +def project_pull_cli( + # fmt: off + remote: str = Arg("default", help="Name or path of remote storage"), + project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False), + # fmt: on +): + """Retrieve available precomputed outputs from a remote storage. + You can alias remotes in your project.yml by mapping them to storage paths. + A storage can be anything that the smart-open library can upload to, e.g. + AWS, Google Cloud Storage, SSH, local directories etc. + + DOCS: https://nightly.spacy.io/api/cli#project-pull + """ + for url, output_path in project_pull(project_dir, remote): + if url is not None: + msg.good(f"Pulled {output_path} from {url}") + + +def project_pull(project_dir: Path, remote: str, *, verbose: bool = False): + # TODO: We don't have tests for this :(. It would take a bit of mockery to + # set up. I guess see if it breaks first? + config = load_project_config(project_dir) + if remote in config.get("remotes", {}): + remote = config["remotes"][remote] + storage = RemoteStorage(project_dir, remote) + commands = list(config.get("commands", [])) + # We use a while loop here because we don't know how the commands + # will be ordered. A command might need dependencies from one that's later + # in the list. + while commands: + for i, cmd in enumerate(list(commands)): + deps = [project_dir / dep for dep in cmd.get("deps", [])] + if all(dep.exists() for dep in deps): + cmd_hash = get_command_hash("", "", deps, cmd["script"]) + for output_path in cmd.get("outputs", []): + url = storage.pull(output_path, command_hash=cmd_hash) + yield url, output_path + + out_locs = [project_dir / out for out in cmd.get("outputs", [])] + if all(loc.exists() for loc in out_locs): + update_lockfile(project_dir, cmd) + # We remove the command from the list here, and break, so that + # we iterate over the loop again. + commands.pop(i) + break + else: + # If we didn't break the for loop, break the while loop. + break diff --git a/spacy/cli/project/push.py b/spacy/cli/project/push.py new file mode 100644 index 000000000..26495412d --- /dev/null +++ b/spacy/cli/project/push.py @@ -0,0 +1,63 @@ +from pathlib import Path +from wasabi import msg +from .remote_storage import RemoteStorage +from .remote_storage import get_content_hash, get_command_hash +from .._util import load_project_config +from .._util import project_cli, Arg + + +@project_cli.command("push") +def project_push_cli( + # fmt: off + remote: str = Arg("default", help="Name or path of remote storage"), + project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False), + # fmt: on +): + """Persist outputs to a remote storage. You can alias remotes in your + project.yml by mapping them to storage paths. A storage can be anything that + the smart-open library can upload to, e.g. AWS, Google Cloud Storage, SSH, + local directories etc. + + DOCS: https://nightly.spacy.io/api/cli#project-push + """ + for output_path, url in project_push(project_dir, remote): + if url is None: + msg.info(f"Skipping {output_path}") + else: + msg.good(f"Pushed {output_path} to {url}") + + +def project_push(project_dir: Path, remote: str): + """Persist outputs to a remote storage. You can alias remotes in your project.yml + by mapping them to storage paths. A storage can be anything that the smart-open + library can upload to, e.g. gcs, aws, ssh, local directories etc + """ + config = load_project_config(project_dir) + if remote in config.get("remotes", {}): + remote = config["remotes"][remote] + storage = RemoteStorage(project_dir, remote) + for cmd in config.get("commands", []): + deps = [project_dir / dep for dep in cmd.get("deps", [])] + if any(not dep.exists() for dep in deps): + continue + cmd_hash = get_command_hash( + "", "", [project_dir / dep for dep in cmd.get("deps", [])], cmd["script"] + ) + for output_path in cmd.get("outputs", []): + output_loc = project_dir / output_path + if output_loc.exists() and _is_not_empty_dir(output_loc): + url = storage.push( + output_path, + command_hash=cmd_hash, + content_hash=get_content_hash(output_loc), + ) + yield output_path, url + + +def _is_not_empty_dir(loc: Path): + if not loc.is_dir(): + return True + elif any(_is_not_empty_dir(child) for child in loc.iterdir()): + return True + else: + return False diff --git a/spacy/cli/project/remote_storage.py b/spacy/cli/project/remote_storage.py new file mode 100644 index 000000000..6056458e2 --- /dev/null +++ b/spacy/cli/project/remote_storage.py @@ -0,0 +1,174 @@ +from typing import Optional, List, Dict, TYPE_CHECKING +import os +import site +import hashlib +import urllib.parse +import tarfile +from pathlib import Path + +from .._util import get_hash, get_checksum, download_file, ensure_pathy +from ...util import make_tempdir, get_minor_version, ENV_VARS, check_bool_env_var +from ...git_info import GIT_VERSION +from ... import about + +if TYPE_CHECKING: + from pathy import Pathy # noqa: F401 + + +class RemoteStorage: + """Push and pull outputs to and from a remote file storage. + + Remotes can be anything that `smart-open` can support: AWS, GCS, file system, + ssh, etc. + """ + + def __init__(self, project_root: Path, url: str, *, compression="gz"): + self.root = project_root + self.url = ensure_pathy(url) + self.compression = compression + + def push(self, path: Path, command_hash: str, content_hash: str) -> "Pathy": + """Compress a file or directory within a project and upload it to a remote + storage. If an object exists at the full URL, nothing is done. + + Within the remote storage, files are addressed by their project path + (url encoded) and two user-supplied hashes, representing their creation + context and their file contents. If the URL already exists, the data is + not uploaded. Paths are archived and compressed prior to upload. + """ + loc = self.root / path + if not loc.exists(): + raise IOError(f"Cannot push {loc}: does not exist.") + url = self.make_url(path, command_hash, content_hash) + if url.exists(): + return None + tmp: Path + with make_tempdir() as tmp: + tar_loc = tmp / self.encode_name(str(path)) + mode_string = f"w:{self.compression}" if self.compression else "w" + with tarfile.open(tar_loc, mode=mode_string) as tar_file: + tar_file.add(str(loc), arcname=str(path)) + with tar_loc.open(mode="rb") as input_file: + with url.open(mode="wb") as output_file: + output_file.write(input_file.read()) + return url + + def pull( + self, + path: Path, + *, + command_hash: Optional[str] = None, + content_hash: Optional[str] = None, + ) -> Optional["Pathy"]: + """Retrieve a file from the remote cache. If the file already exists, + nothing is done. + + If the command_hash and/or content_hash are specified, only matching + results are returned. If no results are available, an error is raised. + """ + dest = self.root / path + if dest.exists(): + return None + url = self.find(path, command_hash=command_hash, content_hash=content_hash) + if url is None: + return url + else: + # Make sure the destination exists + if not dest.parent.exists(): + dest.parent.mkdir(parents=True) + tmp: Path + with make_tempdir() as tmp: + tar_loc = tmp / url.parts[-1] + download_file(url, tar_loc) + mode_string = f"r:{self.compression}" if self.compression else "r" + with tarfile.open(tar_loc, mode=mode_string) as tar_file: + # This requires that the path is added correctly, relative + # to root. This is how we set things up in push() + tar_file.extractall(self.root) + return url + + def find( + self, + path: Path, + *, + command_hash: Optional[str] = None, + content_hash: Optional[str] = None, + ) -> Optional["Pathy"]: + """Find the best matching version of a file within the storage, + or `None` if no match can be found. If both the creation and content hash + are specified, only exact matches will be returned. Otherwise, the most + recent matching file is preferred. + """ + name = self.encode_name(str(path)) + if command_hash is not None and content_hash is not None: + url = self.make_url(path, command_hash, content_hash) + urls = [url] if url.exists() else [] + elif command_hash is not None: + urls = list((self.url / name / command_hash).iterdir()) + else: + urls = list((self.url / name).iterdir()) + if content_hash is not None: + urls = [url for url in urls if url.parts[-1] == content_hash] + return urls[-1] if urls else None + + def make_url(self, path: Path, command_hash: str, content_hash: str) -> "Pathy": + """Construct a URL from a subpath, a creation hash and a content hash.""" + return self.url / self.encode_name(str(path)) / command_hash / content_hash + + def encode_name(self, name: str) -> str: + """Encode a subpath into a URL-safe name.""" + return urllib.parse.quote_plus(name) + + +def get_content_hash(loc: Path) -> str: + return get_checksum(loc) + + +def get_command_hash( + site_hash: str, env_hash: str, deps: List[Path], cmd: List[str] +) -> str: + """Create a hash representing the execution of a command. This includes the + currently installed packages, whatever environment variables have been marked + as relevant, and the command. + """ + check_commit = check_bool_env_var(ENV_VARS.PROJECT_USE_GIT_VERSION) + spacy_v = GIT_VERSION if check_commit else get_minor_version(about.__version__) + dep_checksums = [get_checksum(dep) for dep in sorted(deps)] + hashes = [spacy_v, site_hash, env_hash] + dep_checksums + hashes.extend(cmd) + creation_bytes = "".join(hashes).encode("utf8") + return hashlib.md5(creation_bytes).hexdigest() + + +def get_site_hash(): + """Hash the current Python environment's site-packages contents, including + the name and version of the libraries. The list we're hashing is what + `pip freeze` would output. + """ + site_dirs = site.getsitepackages() + if site.ENABLE_USER_SITE: + site_dirs.extend(site.getusersitepackages()) + packages = set() + for site_dir in site_dirs: + site_dir = Path(site_dir) + for subpath in site_dir.iterdir(): + if subpath.parts[-1].endswith("dist-info"): + packages.add(subpath.parts[-1].replace(".dist-info", "")) + package_bytes = "".join(sorted(packages)).encode("utf8") + return hashlib.md5sum(package_bytes).hexdigest() + + +def get_env_hash(env: Dict[str, str]) -> str: + """Construct a hash of the environment variables that will be passed into + the commands. + + Values in the env dict may be references to the current os.environ, using + the syntax $ENV_VAR to mean os.environ[ENV_VAR] + """ + env_vars = {} + for key, value in env.items(): + if value.startswith("$"): + env_vars[key] = os.environ.get(value[1:], "") + else: + env_vars[key] = value + return get_hash(env_vars) diff --git a/spacy/cli/project/run.py b/spacy/cli/project/run.py new file mode 100644 index 000000000..1a9b447ea --- /dev/null +++ b/spacy/cli/project/run.py @@ -0,0 +1,276 @@ +from typing import Optional, List, Dict, Sequence, Any, Iterable +from pathlib import Path +from wasabi import msg +import sys +import srsly + +from ... import about +from ...git_info import GIT_VERSION +from ...util import working_dir, run_command, split_command, is_cwd, join_command +from ...util import SimpleFrozenList, is_minor_version_match, ENV_VARS +from ...util import check_bool_env_var +from .._util import PROJECT_FILE, PROJECT_LOCK, load_project_config, get_hash +from .._util import get_checksum, project_cli, Arg, Opt, COMMAND + + +@project_cli.command("run") +def project_run_cli( + # fmt: off + subcommand: str = Arg(None, help=f"Name of command defined in the {PROJECT_FILE}"), + project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False), + force: bool = Opt(False, "--force", "-F", help="Force re-running steps, even if nothing changed"), + dry: bool = Opt(False, "--dry", "-D", help="Perform a dry run and don't execute scripts"), + show_help: bool = Opt(False, "--help", help="Show help message and available subcommands") + # fmt: on +): + """Run a named command or workflow defined in the project.yml. If a workflow + name is specified, all commands in the workflow are run, in order. If + commands define dependencies and/or outputs, they will only be re-run if + state has changed. + + DOCS: https://nightly.spacy.io/api/cli#project-run + """ + if show_help or not subcommand: + print_run_help(project_dir, subcommand) + else: + project_run(project_dir, subcommand, force=force, dry=dry) + + +def project_run( + project_dir: Path, subcommand: str, *, force: bool = False, dry: bool = False +) -> None: + """Run a named script defined in the project.yml. If the script is part + of the default pipeline (defined in the "run" section), DVC is used to + execute the command, so it can determine whether to rerun it. It then + calls into "exec" to execute it. + + project_dir (Path): Path to project directory. + subcommand (str): Name of command to run. + force (bool): Force re-running, even if nothing changed. + dry (bool): Perform a dry run and don't execute commands. + """ + config = load_project_config(project_dir) + commands = {cmd["name"]: cmd for cmd in config.get("commands", [])} + workflows = config.get("workflows", {}) + validate_subcommand(commands.keys(), workflows.keys(), subcommand) + if subcommand in workflows: + msg.info(f"Running workflow '{subcommand}'") + for cmd in workflows[subcommand]: + project_run(project_dir, cmd, force=force, dry=dry) + else: + cmd = commands[subcommand] + for dep in cmd.get("deps", []): + if not (project_dir / dep).exists(): + err = f"Missing dependency specified by command '{subcommand}': {dep}" + err_help = "Maybe you forgot to run the 'project assets' command or a previous step?" + err_kwargs = {"exits": 1} if not dry else {} + msg.fail(err, err_help, **err_kwargs) + check_spacy_commit = check_bool_env_var(ENV_VARS.PROJECT_USE_GIT_VERSION) + with working_dir(project_dir) as current_dir: + msg.divider(subcommand) + rerun = check_rerun(current_dir, cmd, check_spacy_commit=check_spacy_commit) + if not rerun and not force: + msg.info(f"Skipping '{cmd['name']}': nothing changed") + else: + run_commands(cmd["script"], dry=dry) + if not dry: + update_lockfile(current_dir, cmd) + + +def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None: + """Simulate a CLI help prompt using the info available in the project.yml. + + project_dir (Path): The project directory. + subcommand (Optional[str]): The subcommand or None. If a subcommand is + provided, the subcommand help is shown. Otherwise, the top-level help + and a list of available commands is printed. + """ + config = load_project_config(project_dir) + config_commands = config.get("commands", []) + commands = {cmd["name"]: cmd for cmd in config_commands} + workflows = config.get("workflows", {}) + project_loc = "" if is_cwd(project_dir) else project_dir + if subcommand: + validate_subcommand(commands.keys(), workflows.keys(), subcommand) + print(f"Usage: {COMMAND} project run {subcommand} {project_loc}") + if subcommand in commands: + help_text = commands[subcommand].get("help") + if help_text: + print(f"\n{help_text}\n") + elif subcommand in workflows: + steps = workflows[subcommand] + print(f"\nWorkflow consisting of {len(steps)} commands:") + steps_data = [ + (f"{i + 1}. {step}", commands[step].get("help", "")) + for i, step in enumerate(steps) + ] + msg.table(steps_data) + help_cmd = f"{COMMAND} project run [COMMAND] {project_loc} --help" + print(f"For command details, run: {help_cmd}") + else: + print("") + title = config.get("title") + if title: + print(f"{title}\n") + if config_commands: + print(f"Available commands in {PROJECT_FILE}") + print(f"Usage: {COMMAND} project run [COMMAND] {project_loc}") + msg.table([(cmd["name"], cmd.get("help", "")) for cmd in config_commands]) + if workflows: + print(f"Available workflows in {PROJECT_FILE}") + print(f"Usage: {COMMAND} project run [WORKFLOW] {project_loc}") + msg.table([(name, " -> ".join(steps)) for name, steps in workflows.items()]) + + +def run_commands( + commands: Iterable[str] = SimpleFrozenList(), + silent: bool = False, + dry: bool = False, +) -> None: + """Run a sequence of commands in a subprocess, in order. + + commands (List[str]): The string commands. + silent (bool): Don't print the commands. + dry (bool): Perform a dry run and don't execut anything. + """ + for command in commands: + command = split_command(command) + # Not sure if this is needed or a good idea. Motivation: users may often + # use commands in their config that reference "python" and we want to + # make sure that it's always executing the same Python that spaCy is + # executed with and the pip in the same env, not some other Python/pip. + # Also ensures cross-compatibility if user 1 writes "python3" (because + # that's how it's set up on their system), and user 2 without the + # shortcut tries to re-run the command. + if len(command) and command[0] in ("python", "python3"): + command[0] = sys.executable + elif len(command) and command[0] in ("pip", "pip3"): + command = [sys.executable, "-m", "pip", *command[1:]] + if not silent: + print(f"Running command: {join_command(command)}") + if not dry: + run_command(command, capture=False) + + +def validate_subcommand( + commands: Sequence[str], workflows: Sequence[str], subcommand: str +) -> None: + """Check that a subcommand is valid and defined. Raises an error otherwise. + + commands (Sequence[str]): The available commands. + subcommand (str): The subcommand. + """ + if not commands and not workflows: + msg.fail(f"No commands or workflows defined in {PROJECT_FILE}", exits=1) + if subcommand not in commands and subcommand not in workflows: + help_msg = [] + if commands: + help_msg.append(f"Available commands: {', '.join(commands)}") + if workflows: + help_msg.append(f"Available workflows: {', '.join(workflows)}") + msg.fail( + f"Can't find command or workflow '{subcommand}' in {PROJECT_FILE}", + ". ".join(help_msg), + exits=1, + ) + + +def check_rerun( + project_dir: Path, + command: Dict[str, Any], + *, + check_spacy_version: bool = True, + check_spacy_commit: bool = False, +) -> bool: + """Check if a command should be rerun because its settings or inputs/outputs + changed. + + project_dir (Path): The current project directory. + command (Dict[str, Any]): The command, as defined in the project.yml. + strict_version (bool): + RETURNS (bool): Whether to re-run the command. + """ + lock_path = project_dir / PROJECT_LOCK + if not lock_path.exists(): # We don't have a lockfile, run command + return True + data = srsly.read_yaml(lock_path) + if command["name"] not in data: # We don't have info about this command + return True + entry = data[command["name"]] + # Always run commands with no outputs (otherwise they'd always be skipped) + if not entry.get("outs", []): + return True + # Always rerun if spaCy version or commit hash changed + spacy_v = entry.get("spacy_version") + commit = entry.get("spacy_git_version") + if check_spacy_version and not is_minor_version_match(spacy_v, about.__version__): + info = f"({spacy_v} in {PROJECT_LOCK}, {about.__version__} current)" + msg.info(f"Re-running '{command['name']}': spaCy minor version changed {info}") + return True + if check_spacy_commit and commit != GIT_VERSION: + info = f"({commit} in {PROJECT_LOCK}, {GIT_VERSION} current)" + msg.info(f"Re-running '{command['name']}': spaCy commit changed {info}") + return True + # If the entry in the lockfile matches the lockfile entry that would be + # generated from the current command, we don't rerun because it means that + # all inputs/outputs, hashes and scripts are the same and nothing changed + lock_entry = get_lock_entry(project_dir, command) + exclude = ["spacy_version", "spacy_git_version"] + return get_hash(lock_entry, exclude=exclude) != get_hash(entry, exclude=exclude) + + +def update_lockfile(project_dir: Path, command: Dict[str, Any]) -> None: + """Update the lockfile after running a command. Will create a lockfile if + it doesn't yet exist and will add an entry for the current command, its + script and dependencies/outputs. + + project_dir (Path): The current project directory. + command (Dict[str, Any]): The command, as defined in the project.yml. + """ + lock_path = project_dir / PROJECT_LOCK + if not lock_path.exists(): + srsly.write_yaml(lock_path, {}) + data = {} + else: + data = srsly.read_yaml(lock_path) + data[command["name"]] = get_lock_entry(project_dir, command) + srsly.write_yaml(lock_path, data) + + +def get_lock_entry(project_dir: Path, command: Dict[str, Any]) -> Dict[str, Any]: + """Get a lockfile entry for a given command. An entry includes the command, + the script (command steps) and a list of dependencies and outputs with + their paths and file hashes, if available. The format is based on the + dvc.lock files, to keep things consistent. + + project_dir (Path): The current project directory. + command (Dict[str, Any]): The command, as defined in the project.yml. + RETURNS (Dict[str, Any]): The lockfile entry. + """ + deps = get_fileinfo(project_dir, command.get("deps", [])) + outs = get_fileinfo(project_dir, command.get("outputs", [])) + outs_nc = get_fileinfo(project_dir, command.get("outputs_no_cache", [])) + return { + "cmd": f"{COMMAND} run {command['name']}", + "script": command["script"], + "deps": deps, + "outs": [*outs, *outs_nc], + "spacy_version": about.__version__, + "spacy_git_version": GIT_VERSION, + } + + +def get_fileinfo(project_dir: Path, paths: List[str]) -> List[Dict[str, str]]: + """Generate the file information for a list of paths (dependencies, outputs). + Includes the file path and the file's checksum. + + project_dir (Path): The current project directory. + paths (List[str]): The file paths. + RETURNS (List[Dict[str, str]]): The lockfile entry for a file. + """ + data = [] + for path in paths: + file_path = project_dir / path + md5 = get_checksum(file_path) if file_path.exists() else None + data.append({"path": path, "md5": md5}) + return data diff --git a/spacy/cli/templates/quickstart_training.jinja b/spacy/cli/templates/quickstart_training.jinja new file mode 100644 index 000000000..d92de9c15 --- /dev/null +++ b/spacy/cli/templates/quickstart_training.jinja @@ -0,0 +1,355 @@ +{# This is a template for training configs used for the quickstart widget in +the docs and the init config command. It encodes various best practices and +can help generate the best possible configuration, given a user's requirements. #} +{%- set use_transformer = (transformer_data and hardware != "cpu") -%} +{%- set transformer = transformer_data[optimize] if use_transformer else {} -%} +[paths] +train = null +dev = null + +[system] +{% if use_transformer -%} +gpu_allocator = "pytorch" +{% else -%} +gpu_allocator = null +{% endif %} + +[nlp] +lang = "{{ lang }}" +{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %} +pipeline = {{ full_pipeline|pprint()|replace("'", '"')|safe }} +tokenizer = {"@tokenizers": "spacy.Tokenizer.v1"} + +[components] + +{# TRANSFORMER PIPELINE #} +{%- if use_transformer -%} +[components.transformer] +factory = "transformer" + +[components.transformer.model] +@architectures = "spacy-transformers.TransformerModel.v1" +name = "{{ transformer["name"] }}" +tokenizer_config = {"use_fast": true} + +[components.transformer.model.get_spans] +@span_getters = "spacy-transformers.strided_spans.v1" +window = 128 +stride = 96 + +{% if "morphologizer" in components %} +[components.morphologizer] +factory = "morphologizer" + +[components.morphologizer.model] +@architectures = "spacy.Tagger.v1" +nO = null + +[components.morphologizer.model.tok2vec] +@architectures = "spacy-transformers.TransformerListener.v1" +grad_factor = 1.0 + +[components.morphologizer.model.tok2vec.pooling] +@layers = "reduce_mean.v1" +{%- endif %} + +{% if "tagger" in components %} +[components.tagger] +factory = "tagger" + +[components.tagger.model] +@architectures = "spacy.Tagger.v1" +nO = null + +[components.tagger.model.tok2vec] +@architectures = "spacy-transformers.TransformerListener.v1" +grad_factor = 1.0 + +[components.tagger.model.tok2vec.pooling] +@layers = "reduce_mean.v1" +{%- endif %} + +{% if "parser" in components -%} +[components.parser] +factory = "parser" + +[components.parser.model] +@architectures = "spacy.TransitionBasedParser.v1" +state_type = "parser" +extra_state_tokens = false +hidden_width = 128 +maxout_pieces = 3 +use_upper = false +nO = null + +[components.parser.model.tok2vec] +@architectures = "spacy-transformers.TransformerListener.v1" +grad_factor = 1.0 + +[components.parser.model.tok2vec.pooling] +@layers = "reduce_mean.v1" +{%- endif %} + +{% if "ner" in components -%} +[components.ner] +factory = "ner" + +[components.ner.model] +@architectures = "spacy.TransitionBasedParser.v1" +state_type = "ner" +extra_state_tokens = false +hidden_width = 64 +maxout_pieces = 2 +use_upper = false +nO = null + +[components.ner.model.tok2vec] +@architectures = "spacy-transformers.TransformerListener.v1" +grad_factor = 1.0 + +[components.ner.model.tok2vec.pooling] +@layers = "reduce_mean.v1" +{% endif -%} + +{% if "entity_linker" in components -%} +[components.entity_linker] +factory = "entity_linker" +get_candidates = {"@misc":"spacy.CandidateGenerator.v1"} +incl_context = true +incl_prior = true + +[components.entity_linker.model] +@architectures = "spacy.EntityLinker.v1" +nO = null + +[components.entity_linker.model.tok2vec] +@architectures = "spacy-transformers.TransformerListener.v1" +grad_factor = 1.0 + +[components.entity_linker.model.tok2vec.pooling] +@layers = "reduce_mean.v1" +{% endif -%} + +{% if "textcat" in components %} +[components.textcat] +factory = "textcat" + +{% if optimize == "accuracy" %} +[components.textcat.model] +@architectures = "spacy.TextCatEnsemble.v1" +exclusive_classes = false +width = 64 +conv_depth = 2 +embed_size = 2000 +window_size = 1 +ngram_size = 1 +nO = null + +{% else -%} +[components.textcat.model] +@architectures = "spacy.TextCatBOW.v1" +exclusive_classes = false +ngram_size = 1 +no_output_layer = false +{%- endif %} +{%- endif %} + +{# NON-TRANSFORMER PIPELINE #} +{% else -%} + +{%- if hardware == "gpu" -%} +# There are no recommended transformer weights available for language '{{ lang }}' +# yet, so the pipeline described here is not transformer-based. +{%- endif %} + +[components.tok2vec] +factory = "tok2vec" + +[components.tok2vec.model] +@architectures = "spacy.Tok2Vec.v1" + +[components.tok2vec.model.embed] +@architectures = "spacy.MultiHashEmbed.v1" +width = ${components.tok2vec.model.encode.width} +{% if has_letters -%} +attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"] +rows = [5000, 2500, 2500, 2500] +{% else -%} +attrs = ["ORTH", "SHAPE"] +rows = [5000, 2500] +{% endif -%} +include_static_vectors = {{ "true" if optimize == "accuracy" else "false" }} + +[components.tok2vec.model.encode] +@architectures = "spacy.MaxoutWindowEncoder.v1" +width = {{ 96 if optimize == "efficiency" else 256 }} +depth = {{ 4 if optimize == "efficiency" else 8 }} +window_size = 1 +maxout_pieces = 3 + +{% if "morphologizer" in components %} +[components.morphologizer] +factory = "morphologizer" + +[components.morphologizer.model] +@architectures = "spacy.Tagger.v1" +nO = null + +[components.morphologizer.model.tok2vec] +@architectures = "spacy.Tok2VecListener.v1" +width = ${components.tok2vec.model.encode.width} +{%- endif %} + +{% if "tagger" in components %} +[components.tagger] +factory = "tagger" + +[components.tagger.model] +@architectures = "spacy.Tagger.v1" +nO = null + +[components.tagger.model.tok2vec] +@architectures = "spacy.Tok2VecListener.v1" +width = ${components.tok2vec.model.encode.width} +{%- endif %} + +{% if "parser" in components -%} +[components.parser] +factory = "parser" + +[components.parser.model] +@architectures = "spacy.TransitionBasedParser.v1" +state_type = "parser" +extra_state_tokens = false +hidden_width = 128 +maxout_pieces = 3 +use_upper = true +nO = null + +[components.parser.model.tok2vec] +@architectures = "spacy.Tok2VecListener.v1" +width = ${components.tok2vec.model.encode.width} +{%- endif %} + +{% if "ner" in components %} +[components.ner] +factory = "ner" + +[components.ner.model] +@architectures = "spacy.TransitionBasedParser.v1" +state_type = "ner" +extra_state_tokens = false +hidden_width = 64 +maxout_pieces = 2 +use_upper = true +nO = null + +[components.ner.model.tok2vec] +@architectures = "spacy.Tok2VecListener.v1" +width = ${components.tok2vec.model.encode.width} +{% endif %} + +{% if "entity_linker" in components -%} +[components.entity_linker] +factory = "entity_linker" +get_candidates = {"@misc":"spacy.CandidateGenerator.v1"} +incl_context = true +incl_prior = true + +[components.entity_linker.model] +@architectures = "spacy.EntityLinker.v1" +nO = null + +[components.entity_linker.model.tok2vec] +@architectures = "spacy.Tok2VecListener.v1" +width = ${components.tok2vec.model.encode.width} +{% endif %} + +{% if "textcat" in components %} +[components.textcat] +factory = "textcat" + +{% if optimize == "accuracy" %} +[components.textcat.model] +@architectures = "spacy.TextCatEnsemble.v1" +exclusive_classes = false +width = 64 +conv_depth = 2 +embed_size = 2000 +window_size = 1 +ngram_size = 1 +nO = null + +{% else -%} +[components.textcat.model] +@architectures = "spacy.TextCatBOW.v1" +exclusive_classes = false +ngram_size = 1 +no_output_layer = false +{%- endif %} +{%- endif %} +{% endif %} + +{% for pipe in components %} +{% if pipe not in ["tagger", "morphologizer", "parser", "ner", "textcat", "entity_linker"] %} +{# Other components defined by the user: we just assume they're factories #} +[components.{{ pipe }}] +factory = "{{ pipe }}" +{% endif %} +{% endfor %} + +[corpora] + +[corpora.train] +@readers = "spacy.Corpus.v1" +path = ${paths.train} +max_length = {{ 500 if hardware == "gpu" else 2000 }} + +[corpora.dev] +@readers = "spacy.Corpus.v1" +path = ${paths.dev} +max_length = 0 + +[training] +{% if use_transformer -%} +accumulate_gradient = {{ transformer["size_factor"] }} +{% endif -%} +dev_corpus = "corpora.dev" +train_corpus = "corpora.train" + +[training.optimizer] +@optimizers = "Adam.v1" + +{% if use_transformer -%} +[training.optimizer.learn_rate] +@schedules = "warmup_linear.v1" +warmup_steps = 250 +total_steps = 20000 +initial_rate = 5e-5 +{% endif %} + +{% if use_transformer %} +[training.batcher] +@batchers = "spacy.batch_by_padded.v1" +discard_oversize = true +size = 2000 +buffer = 256 +{%- else %} +[training.batcher] +@batchers = "spacy.batch_by_words.v1" +discard_oversize = false +tolerance = 0.2 + +[training.batcher.size] +@schedules = "compounding.v1" +start = 100 +stop = 1000 +compound = 1.001 +{% endif %} + +[initialize] +{% if use_transformer or optimize == "efficiency" or not word_vectors -%} +vectors = null +{% else -%} +vectors = "{{ word_vectors }}" +{% endif -%} diff --git a/spacy/cli/templates/quickstart_training_recommendations.yml b/spacy/cli/templates/quickstart_training_recommendations.yml new file mode 100644 index 000000000..54aec2e31 --- /dev/null +++ b/spacy/cli/templates/quickstart_training_recommendations.yml @@ -0,0 +1,121 @@ +# Recommended settings and available resources for each language, if available. +# Not all languages have recommended word vectors or transformers and for some, +# the recommended transformer for efficiency and accuracy may be the same. +en: + word_vectors: en_vectors_web_lg + transformer: + efficiency: + name: roberta-base + size_factor: 3 + accuracy: + name: roberta-base + size_factor: 3 +de: + word_vectors: null + transformer: + efficiency: + name: bert-base-german-cased + size_factor: 3 + accuracy: + name: bert-base-german-cased + size_factor: 3 +fr: + word_vectors: null + transformer: + efficiency: + name: camembert-base + size_factor: 3 + accuracy: + name: camembert-base + size_factor: 3 +es: + word_vectors: null + transformer: + efficiency: + name: dccuchile/bert-base-spanish-wwm-cased + size_factor: 3 + accuracy: + name: dccuchile/bert-base-spanish-wwm-cased + size_factor: 3 +sv: + word_vectors: null + transformer: + efficiency: + name: KB/bert-base-swedish-cased + size_factor: 3 + accuracy: + name: KB/bert-base-swedish-cased + size_factor: 3 +fi: + word_vectors: null + transformer: + efficiency: + name: TurkuNLP/bert-base-finnish-cased-v1 + size_factor: 3 + accuracy: + name: TurkuNLP/bert-base-finnish-cased-v1 + size_factor: 3 +el: + word_vectors: null + transformer: + efficiency: + name: nlpaueb/bert-base-greek-uncased-v1 + size_factor: 3 + accuracy: + name: nlpaueb/bert-base-greek-uncased-v1 + size_factor: 3 +tr: + word_vectors: null + transformer: + efficiency: + name: dbmdz/bert-base-turkish-cased + size_factor: 3 + accuracy: + name: dbmdz/bert-base-turkish-cased + size_factor: 3 +zh: + word_vectors: null + transformer: + efficiency: + name: bert-base-chinese + size_factor: 3 + accuracy: + name: bert-base-chinese + size_factor: 3 + has_letters: false +ar: + word_vectors: null + transformer: + efficiency: + name: asafaya/bert-base-arabic + size_factor: 3 + accuracy: + name: asafaya/bert-base-arabic + size_factor: 3 +pl: + word_vectors: null + transformer: + efficiency: + name: dkleczek/bert-base-polish-cased-v1 + size_factor: 3 + accuracy: + name: dkleczek/bert-base-polish-cased-v1 + size_factor: 3 +nl: + word_vectors: null + transformer: + efficiency: + name: pdelobelle/robbert-v2-dutch-base + size_factor: 3 + accuracy: + name: pdelobelle/robbert-v2-dutch-base + size_factor: 3 +pt: + word_vectors: null + transformer: + efficiency: + name: neuralmind/bert-base-portuguese-cased + size_factor: 3 + accuracy: + name: neuralmind/bert-base-portuguese-cased + size_factor: 3 diff --git a/spacy/cli/train.py b/spacy/cli/train.py index 0a640d909..0b27f63dc 100644 --- a/spacy/cli/train.py +++ b/spacy/cli/train.py @@ -1,774 +1,59 @@ -# coding: utf8 -from __future__ import unicode_literals, division, print_function - -import plac -import os -import tqdm +from typing import Optional from pathlib import Path -from thinc.neural._classes.model import Model -from timeit import default_timer as timer -import shutil -import srsly from wasabi import msg -import contextlib -import random +import typer +import logging +import sys -from .._ml import create_default_optimizer -from ..util import use_gpu as set_gpu -from ..errors import Errors -from ..gold import GoldCorpus -from ..compat import path2str -from ..lookups import Lookups +from ._util import app, Arg, Opt, parse_config_overrides, show_validation_error +from ._util import import_code, setup_gpu +from ..training.loop import train +from ..training.initialize import init_nlp from .. import util -from .. import about -@plac.annotations( - # fmt: off - lang=("Model language", "positional", None, str), - output_path=("Output directory to store model in", "positional", None, Path), - train_path=("Location of JSON-formatted training data", "positional", None, Path), - dev_path=("Location of JSON-formatted development data", "positional", None, Path), - raw_text=("Path to jsonl file with unlabelled text documents.", "option", "rt", Path), - base_model=("Name of model to update (optional)", "option", "b", str), - pipeline=("Comma-separated names of pipeline components", "option", "p", str), - replace_components=("Replace components from base model", "flag", "R", bool), - vectors=("Model to load vectors from", "option", "v", str), - width=("Width of CNN layers of Tok2Vec component", "option", "cw", int), - conv_depth=("Depth of CNN layers of Tok2Vec component", "option", "cd", int), - cnn_window=("Window size for CNN layers of Tok2Vec component", "option", "cW", int), - cnn_pieces=("Maxout size for CNN layers of Tok2Vec component. 1 for Mish", "option", "cP", int), - use_chars=("Whether to use character-based embedding of Tok2Vec component", "flag", "chr", bool), - bilstm_depth=("Depth of BiLSTM layers of Tok2Vec component (requires PyTorch)", "option", "lstm", int), - embed_rows=("Number of embedding rows of Tok2Vec component", "option", "er", int), - n_iter=("Number of iterations", "option", "n", int), - n_early_stopping=("Maximum number of training epochs without dev accuracy improvement", "option", "ne", int), - n_examples=("Number of examples", "option", "ns", int), - use_gpu=("Use GPU", "option", "g", int), - version=("Model version", "option", "V", str), - meta_path=("Optional path to meta.json to use as base.", "option", "m", Path), - init_tok2vec=("Path to pretrained weights for the token-to-vector parts of the models. See 'spacy pretrain'. Experimental.", "option", "t2v", Path), - parser_multitasks=("Side objectives for parser CNN, e.g. 'dep' or 'dep,tag'", "option", "pt", str), - entity_multitasks=("Side objectives for NER CNN, e.g. 'dep' or 'dep,tag'", "option", "et", str), - noise_level=("Amount of corruption for data augmentation", "option", "nl", float), - orth_variant_level=("Amount of orthography variation for data augmentation", "option", "ovl", float), - eval_beam_widths=("Beam widths to evaluate, e.g. 4,8", "option", "bw", str), - gold_preproc=("Use gold preprocessing", "flag", "G", bool), - learn_tokens=("Make parser learn gold-standard tokenization", "flag", "T", bool), - textcat_multilabel=("Textcat classes aren't mutually exclusive (multilabel)", "flag", "TML", bool), - textcat_arch=("Textcat model architecture", "option", "ta", str), - textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str), - tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path), - omit_extra_lookups=("Don't include extra lookups in model", "flag", "OEL", bool), - verbose=("Display more information for debug", "flag", "VV", bool), - debug=("Run data diagnostics before training", "flag", "D", bool), - # fmt: on +@app.command( + "train", context_settings={"allow_extra_args": True, "ignore_unknown_options": True} ) -def train( - lang, - output_path, - train_path, - dev_path, - raw_text=None, - base_model=None, - pipeline="tagger,parser,ner", - replace_components=False, - vectors=None, - width=96, - conv_depth=4, - cnn_window=1, - cnn_pieces=3, - use_chars=False, - bilstm_depth=0, - embed_rows=2000, - n_iter=30, - n_early_stopping=None, - n_examples=0, - use_gpu=-1, - version="0.0.0", - meta_path=None, - init_tok2vec=None, - parser_multitasks="", - entity_multitasks="", - noise_level=0.0, - orth_variant_level=0.0, - eval_beam_widths="", - gold_preproc=False, - learn_tokens=False, - textcat_multilabel=False, - textcat_arch="bow", - textcat_positive_label=None, - tag_map_path=None, - omit_extra_lookups=False, - verbose=False, - debug=False, -): - """ - Train or update a spaCy model. Requires data to be formatted in spaCy's - JSON format. To convert data from other formats, use the `spacy convert` - command. - """ - util.fix_random_seed() - util.set_env_log(verbose) - - # Make sure all files and paths exists if they are needed - train_path = util.ensure_path(train_path) - dev_path = util.ensure_path(dev_path) - meta_path = util.ensure_path(meta_path) - output_path = util.ensure_path(output_path) - if raw_text is not None: - raw_text = list(srsly.read_jsonl(raw_text)) - if not train_path or not train_path.exists(): - msg.fail("Training data not found", train_path, exits=1) - if not dev_path or not dev_path.exists(): - msg.fail("Development data not found", dev_path, exits=1) - if meta_path is not None and not meta_path.exists(): - msg.fail("Can't find model meta.json", meta_path, exits=1) - meta = srsly.read_json(meta_path) if meta_path else {} - if output_path.exists() and [p for p in output_path.iterdir() if p.is_dir()]: - msg.warn( - "Output directory is not empty", - "This can lead to unintended side effects when saving the model. " - "Please use an empty directory or a different path instead. If " - "the specified output path doesn't exist, the directory will be " - "created for you.", - ) - if not output_path.exists(): - output_path.mkdir() - msg.good("Created output directory: {}".format(output_path)) - - tag_map = {} - if tag_map_path is not None: - tag_map = srsly.read_json(tag_map_path) - # Take dropout and batch size as generators of values -- dropout - # starts high and decays sharply, to force the optimizer to explore. - # Batch size starts at 1 and grows, so that we make updates quickly - # at the beginning of training. - dropout_rates = util.decaying( - util.env_opt("dropout_from", 0.2), - util.env_opt("dropout_to", 0.2), - util.env_opt("dropout_decay", 0.0), - ) - batch_sizes = util.compounding( - util.env_opt("batch_from", 100.0), - util.env_opt("batch_to", 1000.0), - util.env_opt("batch_compound", 1.001), - ) - - if not eval_beam_widths: - eval_beam_widths = [1] - else: - eval_beam_widths = [int(bw) for bw in eval_beam_widths.split(",")] - if 1 not in eval_beam_widths: - eval_beam_widths.append(1) - eval_beam_widths.sort() - has_beam_widths = eval_beam_widths != [1] - - # Set up the base model and pipeline. If a base model is specified, load - # the model and make sure the pipeline matches the pipeline setting. If - # training starts from a blank model, intitalize the language class. - pipeline = [p.strip() for p in pipeline.split(",")] - disabled_pipes = None - pipes_added = False - msg.text("Training pipeline: {}".format(pipeline)) - if use_gpu >= 0: - activated_gpu = None - try: - activated_gpu = set_gpu(use_gpu) - except Exception as e: - msg.warn("Exception: {}".format(e)) - if activated_gpu is not None: - msg.text("Using GPU: {}".format(use_gpu)) - else: - msg.warn("Unable to activate GPU: {}".format(use_gpu)) - msg.text("Using CPU only") - use_gpu = -1 - base_components = [] - if base_model: - msg.text("Starting with base model '{}'".format(base_model)) - nlp = util.load_model(base_model) - if nlp.lang != lang: - msg.fail( - "Model language ('{}') doesn't match language specified as " - "`lang` argument ('{}') ".format(nlp.lang, lang), - exits=1, - ) - for pipe in pipeline: - pipe_cfg = {} - if pipe == "parser": - pipe_cfg = {"learn_tokens": learn_tokens} - elif pipe == "textcat": - pipe_cfg = { - "exclusive_classes": not textcat_multilabel, - "architecture": textcat_arch, - "positive_label": textcat_positive_label, - } - if pipe not in nlp.pipe_names: - msg.text("Adding component to base model: '{}'".format(pipe)) - nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg)) - pipes_added = True - elif replace_components: - msg.text("Replacing component from base model '{}'".format(pipe)) - nlp.replace_pipe(pipe, nlp.create_pipe(pipe, config=pipe_cfg)) - pipes_added = True - else: - if pipe == "textcat": - textcat_cfg = nlp.get_pipe("textcat").cfg - base_cfg = { - "exclusive_classes": textcat_cfg["exclusive_classes"], - "architecture": textcat_cfg["architecture"], - "positive_label": textcat_cfg["positive_label"], - } - if base_cfg != pipe_cfg: - msg.fail( - "The base textcat model configuration does" - "not match the provided training options. " - "Existing cfg: {}, provided cfg: {}".format( - base_cfg, pipe_cfg - ), - exits=1, - ) - msg.text("Extending component from base model '{}'".format(pipe)) - base_components.append(pipe) - disabled_pipes = nlp.disable_pipes( - [p for p in nlp.pipe_names if p not in pipeline] - ) - else: - msg.text("Starting with blank model '{}'".format(lang)) - lang_cls = util.get_lang_class(lang) - nlp = lang_cls() - for pipe in pipeline: - if pipe == "parser": - pipe_cfg = {"learn_tokens": learn_tokens} - elif pipe == "textcat": - pipe_cfg = { - "exclusive_classes": not textcat_multilabel, - "architecture": textcat_arch, - "positive_label": textcat_positive_label, - } - else: - pipe_cfg = {} - nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg)) - - # Replace tag map with provided mapping - nlp.vocab.morphology.load_tag_map(tag_map) - - # Create empty extra lexeme tables so the data from spacy-lookups-data - # isn't loaded if these features are accessed - if omit_extra_lookups: - nlp.vocab.lookups_extra = Lookups() - nlp.vocab.lookups_extra.add_table("lexeme_cluster") - nlp.vocab.lookups_extra.add_table("lexeme_prob") - nlp.vocab.lookups_extra.add_table("lexeme_settings") - - if vectors: - msg.text("Loading vector from model '{}'".format(vectors)) - _load_vectors(nlp, vectors) - - # Multitask objectives - multitask_options = [("parser", parser_multitasks), ("ner", entity_multitasks)] - for pipe_name, multitasks in multitask_options: - if multitasks: - if pipe_name not in pipeline: - msg.fail( - "Can't use multitask objective without '{}' in the " - "pipeline".format(pipe_name) - ) - pipe = nlp.get_pipe(pipe_name) - for objective in multitasks.split(","): - pipe.add_multitask_objective(objective) - - # Prepare training corpus - msg.text("Counting training words (limit={})".format(n_examples)) - corpus = GoldCorpus(train_path, dev_path, limit=n_examples) - n_train_words = corpus.count_train() - - if base_model and not pipes_added: - # Start with an existing model, use default optimizer - optimizer = nlp.resume_training(device=use_gpu) - else: - # Start with a blank model, call begin_training - cfg = {"device": use_gpu} - cfg["conv_depth"] = conv_depth - cfg["token_vector_width"] = width - cfg["bilstm_depth"] = bilstm_depth - cfg["cnn_maxout_pieces"] = cnn_pieces - cfg["embed_size"] = embed_rows - cfg["conv_window"] = cnn_window - cfg["subword_features"] = not use_chars - optimizer = nlp.begin_training(lambda: corpus.train_tuples, **cfg) - - nlp._optimizer = None - - # Load in pretrained weights - if init_tok2vec is not None: - components = _load_pretrained_tok2vec(nlp, init_tok2vec, base_components) - msg.text("Loaded pretrained tok2vec for: {}".format(components)) - - # Verify textcat config - if "textcat" in pipeline: - textcat_labels = nlp.get_pipe("textcat").cfg.get("labels", []) - if textcat_positive_label and textcat_positive_label not in textcat_labels: - msg.fail( - "The textcat_positive_label (tpl) '{}' does not match any " - "label in the training data.".format(textcat_positive_label), - exits=1, - ) - if textcat_positive_label and len(textcat_labels) != 2: - msg.fail( - "A textcat_positive_label (tpl) '{}' was provided for training " - "data that does not appear to be a binary classification " - "problem with two labels.".format(textcat_positive_label), - exits=1, - ) - train_docs = corpus.train_docs( - nlp, - noise_level=noise_level, - gold_preproc=gold_preproc, - max_length=0, - ignore_misaligned=True, - ) - train_labels = set() - if textcat_multilabel: - multilabel_found = False - for text, gold in train_docs: - train_labels.update(gold.cats.keys()) - if list(gold.cats.values()).count(1.0) != 1: - multilabel_found = True - if not multilabel_found and not base_model: - msg.warn( - "The textcat training instances look like they have " - "mutually-exclusive classes. Remove the flag " - "'--textcat-multilabel' to train a classifier with " - "mutually-exclusive classes." - ) - if not textcat_multilabel: - for text, gold in train_docs: - train_labels.update(gold.cats.keys()) - if list(gold.cats.values()).count(1.0) != 1 and not base_model: - msg.warn( - "Some textcat training instances do not have exactly " - "one positive label. Modifying training options to " - "include the flag '--textcat-multilabel' for classes " - "that are not mutually exclusive." - ) - nlp.get_pipe("textcat").cfg["exclusive_classes"] = False - textcat_multilabel = True - break - if base_model and set(textcat_labels) != train_labels: - msg.fail( - "Cannot extend textcat model using data with different " - "labels. Base model labels: {}, training data labels: " - "{}.".format(textcat_labels, list(train_labels)), - exits=1, - ) - if textcat_multilabel: - msg.text( - "Textcat evaluation score: ROC AUC score macro-averaged across " - "the labels '{}'".format(", ".join(textcat_labels)) - ) - elif textcat_positive_label and len(textcat_labels) == 2: - msg.text( - "Textcat evaluation score: F1-score for the " - "label '{}'".format(textcat_positive_label) - ) - elif len(textcat_labels) > 1: - if len(textcat_labels) == 2: - msg.warn( - "If the textcat component is a binary classifier with " - "exclusive classes, provide '--textcat-positive-label' for " - "an evaluation on the positive class." - ) - msg.text( - "Textcat evaluation score: F1-score macro-averaged across " - "the labels '{}'".format(", ".join(textcat_labels)) - ) - else: - msg.fail( - "Unsupported textcat configuration. Use `spacy debug-data` " - "for more information." - ) - +def train_cli( # fmt: off - row_head, output_stats = _configure_training_output(pipeline, use_gpu, has_beam_widths) - row_widths = [len(w) for w in row_head] - row_settings = {"widths": row_widths, "aligns": tuple(["r" for i in row_head]), "spacing": 2} + ctx: typer.Context, # This is only used to read additional arguments + config_path: Path = Arg(..., help="Path to config file", exists=True), + output_path: Optional[Path] = Opt(None, "--output", "--output-path", "-o", help="Output directory to store trained pipeline in"), + code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"), + verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"), + use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU") # fmt: on - print("") - msg.row(row_head, **row_settings) - msg.row(["-" * width for width in row_settings["widths"]], **row_settings) - try: - iter_since_best = 0 - best_score = 0.0 - for i in range(n_iter): - train_docs = corpus.train_docs( - nlp, - noise_level=noise_level, - orth_variant_level=orth_variant_level, - gold_preproc=gold_preproc, - max_length=0, - ignore_misaligned=True, - ) - if raw_text: - random.shuffle(raw_text) - raw_batches = util.minibatch( - (nlp.make_doc(rt["text"]) for rt in raw_text), size=8 - ) - words_seen = 0 - with tqdm.tqdm(total=n_train_words, leave=False) as pbar: - losses = {} - for batch in util.minibatch_by_words(train_docs, size=batch_sizes): - if not batch: - continue - docs, golds = zip(*batch) - try: - nlp.update( - docs, - golds, - sgd=optimizer, - drop=next(dropout_rates), - losses=losses, - ) - except ValueError as e: - err = "Error during training" - if init_tok2vec: - err += " Did you provide the same parameters during 'train' as during 'pretrain'?" - msg.fail(err, "Original error message: {}".format(e), exits=1) - if raw_text: - # If raw text is available, perform 'rehearsal' updates, - # which use unlabelled data to reduce overfitting. - raw_batch = list(next(raw_batches)) - nlp.rehearse(raw_batch, sgd=optimizer, losses=losses) - if not int(os.environ.get("LOG_FRIENDLY", 0)): - pbar.update(sum(len(doc) for doc in docs)) - words_seen += sum(len(doc) for doc in docs) - with nlp.use_params(optimizer.averages): - util.set_env_log(False) - epoch_model_path = output_path / ("model%d" % i) - nlp.to_disk(epoch_model_path) - nlp_loaded = util.load_model_from_path(epoch_model_path) - for beam_width in eval_beam_widths: - for name, component in nlp_loaded.pipeline: - if hasattr(component, "cfg"): - component.cfg["beam_width"] = beam_width - dev_docs = list( - corpus.dev_docs( - nlp_loaded, - gold_preproc=gold_preproc, - ignore_misaligned=True, - ) - ) - nwords = sum(len(doc_gold[0]) for doc_gold in dev_docs) - start_time = timer() - scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose) - end_time = timer() - if use_gpu < 0: - gpu_wps = None - cpu_wps = nwords / (end_time - start_time) - else: - gpu_wps = nwords / (end_time - start_time) - # Only evaluate on CPU in the first iteration (for - # timing) if GPU is enabled - if i == 0: - with Model.use_device("cpu"): - nlp_loaded = util.load_model_from_path(epoch_model_path) - for name, component in nlp_loaded.pipeline: - if hasattr(component, "cfg"): - component.cfg["beam_width"] = beam_width - dev_docs = list( - corpus.dev_docs( - nlp_loaded, - gold_preproc=gold_preproc, - ignore_misaligned=True, - ) - ) - start_time = timer() - scorer = nlp_loaded.evaluate(dev_docs, verbose=verbose) - end_time = timer() - cpu_wps = nwords / (end_time - start_time) - acc_loc = output_path / ("model%d" % i) / "accuracy.json" - srsly.write_json(acc_loc, scorer.scores) - - # Update model meta.json - meta["lang"] = nlp.lang - meta["pipeline"] = nlp.pipe_names - meta["spacy_version"] = ">=%s" % about.__version__ - if beam_width == 1: - meta["speed"] = { - "nwords": nwords, - "cpu": cpu_wps, - "gpu": gpu_wps, - } - meta.setdefault("accuracy", {}) - for component in nlp.pipe_names: - for metric in _get_metrics(component): - meta["accuracy"][metric] = scorer.scores[metric] - else: - meta.setdefault("beam_accuracy", {}) - meta.setdefault("beam_speed", {}) - for component in nlp.pipe_names: - for metric in _get_metrics(component): - meta["beam_accuracy"][metric] = scorer.scores[metric] - meta["beam_speed"][beam_width] = { - "nwords": nwords, - "cpu": cpu_wps, - "gpu": gpu_wps, - } - meta["vectors"] = { - "width": nlp.vocab.vectors_length, - "vectors": len(nlp.vocab.vectors), - "keys": nlp.vocab.vectors.n_keys, - "name": nlp.vocab.vectors.name, - } - meta.setdefault("name", "model%d" % i) - meta.setdefault("version", version) - meta["labels"] = nlp.meta["labels"] - meta_loc = output_path / ("model%d" % i) / "meta.json" - srsly.write_json(meta_loc, meta) - util.set_env_log(verbose) - - progress = _get_progress( - i, - losses, - scorer.scores, - output_stats, - beam_width=beam_width if has_beam_widths else None, - cpu_wps=cpu_wps, - gpu_wps=gpu_wps, - ) - if i == 0 and "textcat" in pipeline: - textcats_per_cat = scorer.scores.get("textcats_per_cat", {}) - for cat, cat_score in textcats_per_cat.items(): - if cat_score.get("roc_auc_score", 0) < 0: - msg.warn( - "Textcat ROC AUC score is undefined due to " - "only one value in label '{}'.".format(cat) - ) - msg.row(progress, **row_settings) - # Early stopping - if n_early_stopping is not None: - current_score = _score_for_model(meta) - if current_score < best_score: - iter_since_best += 1 - else: - iter_since_best = 0 - best_score = current_score - if iter_since_best >= n_early_stopping: - iter_current = i + 1 - msg.text( - "Early stopping, best iteration " - "is: {}".format(iter_current - iter_since_best) - ) - msg.text( - "Best score = {}; Final iteration " - "score = {}".format(best_score, current_score) - ) - break - except Exception as e: - msg.warn( - "Aborting and saving the final best model. " - "Encountered exception: {}".format(e), - exits=1, - ) - finally: - best_pipes = nlp.pipe_names - if disabled_pipes: - disabled_pipes.restore() - meta["pipeline"] = nlp.pipe_names - with nlp.use_params(optimizer.averages): - final_model_path = output_path / "model-final" - nlp.to_disk(final_model_path) - srsly.write_json(final_model_path / "meta.json", meta) - - meta_loc = output_path / "model-final" / "meta.json" - final_meta = srsly.read_json(meta_loc) - final_meta.setdefault("accuracy", {}) - final_meta["accuracy"].update(meta.get("accuracy", {})) - final_meta.setdefault("speed", {}) - final_meta["speed"].setdefault("cpu", None) - final_meta["speed"].setdefault("gpu", None) - meta.setdefault("speed", {}) - meta["speed"].setdefault("cpu", None) - meta["speed"].setdefault("gpu", None) - # combine cpu and gpu speeds with the base model speeds - if final_meta["speed"]["cpu"] and meta["speed"]["cpu"]: - speed = _get_total_speed( - [final_meta["speed"]["cpu"], meta["speed"]["cpu"]] - ) - final_meta["speed"]["cpu"] = speed - if final_meta["speed"]["gpu"] and meta["speed"]["gpu"]: - speed = _get_total_speed( - [final_meta["speed"]["gpu"], meta["speed"]["gpu"]] - ) - final_meta["speed"]["gpu"] = speed - # if there were no speeds to update, overwrite with meta - if ( - final_meta["speed"]["cpu"] is None - and final_meta["speed"]["gpu"] is None - ): - final_meta["speed"].update(meta["speed"]) - # note: beam speeds are not combined with the base model - if has_beam_widths: - final_meta.setdefault("beam_accuracy", {}) - final_meta["beam_accuracy"].update(meta.get("beam_accuracy", {})) - final_meta.setdefault("beam_speed", {}) - final_meta["beam_speed"].update(meta.get("beam_speed", {})) - srsly.write_json(meta_loc, final_meta) - msg.good("Saved model to output directory", final_model_path) - with msg.loading("Creating best model..."): - best_model_path = _collate_best_model(final_meta, output_path, best_pipes) - msg.good("Created best model", best_model_path) - - -def _score_for_model(meta): - """ Returns mean score between tasks in pipeline that can be used for early stopping. """ - mean_acc = list() - pipes = meta["pipeline"] - acc = meta["accuracy"] - if "tagger" in pipes: - mean_acc.append(acc["tags_acc"]) - if "parser" in pipes: - mean_acc.append((acc["uas"] + acc["las"]) / 2) - if "ner" in pipes: - mean_acc.append((acc["ents_p"] + acc["ents_r"] + acc["ents_f"]) / 3) - if "textcat" in pipes: - mean_acc.append(acc["textcat_score"]) - return sum(mean_acc) / len(mean_acc) - - -@contextlib.contextmanager -def _create_progress_bar(total): - if int(os.environ.get("LOG_FRIENDLY", 0)): - yield - else: - pbar = tqdm.tqdm(total=total, leave=False) - yield pbar - - -def _load_vectors(nlp, vectors): - util.load_model(vectors, vocab=nlp.vocab) - - -def _load_pretrained_tok2vec(nlp, loc, base_components): - """Load pretrained weights for the 'token-to-vector' part of the component - models, which is typically a CNN. See 'spacy pretrain'. Experimental. - """ - with loc.open("rb") as file_: - weights_data = file_.read() - loaded = [] - for name, component in nlp.pipeline: - if hasattr(component, "model") and hasattr(component.model, "tok2vec"): - if name in base_components: - raise ValueError(Errors.E200.format(component=name)) - component.tok2vec.from_bytes(weights_data) - loaded.append(name) - return loaded - - -def _collate_best_model(meta, output_path, components): - bests = {} - meta.setdefault("accuracy", {}) - for component in components: - bests[component] = _find_best(output_path, component) - best_dest = output_path / "model-best" - shutil.copytree(path2str(output_path / "model-final"), path2str(best_dest)) - for component, best_component_src in bests.items(): - shutil.rmtree(path2str(best_dest / component)) - shutil.copytree( - path2str(best_component_src / component), path2str(best_dest / component) - ) - accs = srsly.read_json(best_component_src / "accuracy.json") - for metric in _get_metrics(component): - meta["accuracy"][metric] = accs[metric] - srsly.write_json(best_dest / "meta.json", meta) - return best_dest - - -def _find_best(experiment_dir, component): - accuracies = [] - for epoch_model in experiment_dir.iterdir(): - if epoch_model.is_dir() and epoch_model.parts[-1] != "model-final": - accs = srsly.read_json(epoch_model / "accuracy.json") - scores = [accs.get(metric, 0.0) for metric in _get_metrics(component)] - # remove per_type dicts from score list for max() comparison - scores = [score for score in scores if isinstance(score, float)] - accuracies.append((scores, epoch_model)) - if accuracies: - return max(accuracies)[1] - else: - return None - - -def _get_metrics(component): - if component == "parser": - return ("las", "uas", "las_per_type", "token_acc") - elif component == "tagger": - return ("tags_acc", "token_acc") - elif component == "ner": - return ("ents_f", "ents_p", "ents_r", "ents_per_type", "token_acc") - elif component == "textcat": - return ("textcat_score", "token_acc") - return ("token_acc",) - - -def _configure_training_output(pipeline, use_gpu, has_beam_widths): - row_head = ["Itn"] - output_stats = [] - for pipe in pipeline: - if pipe == "tagger": - row_head.extend(["Tag Loss ", " Tag % "]) - output_stats.extend(["tag_loss", "tags_acc"]) - elif pipe == "parser": - row_head.extend(["Dep Loss ", " UAS ", " LAS "]) - output_stats.extend(["dep_loss", "uas", "las"]) - elif pipe == "ner": - row_head.extend(["NER Loss ", "NER P ", "NER R ", "NER F "]) - output_stats.extend(["ner_loss", "ents_p", "ents_r", "ents_f"]) - elif pipe == "textcat": - row_head.extend(["Textcat Loss", "Textcat"]) - output_stats.extend(["textcat_loss", "textcat_score"]) - row_head.extend(["Token %", "CPU WPS"]) - output_stats.extend(["token_acc", "cpu_wps"]) - - if use_gpu >= 0: - row_head.extend(["GPU WPS"]) - output_stats.extend(["gpu_wps"]) - - if has_beam_widths: - row_head.insert(1, "Beam W.") - return row_head, output_stats - - -def _get_progress( - itn, losses, dev_scores, output_stats, beam_width=None, cpu_wps=0.0, gpu_wps=0.0 ): - scores = {} - for stat in output_stats: - scores[stat] = 0.0 - scores["dep_loss"] = losses.get("parser", 0.0) - scores["ner_loss"] = losses.get("ner", 0.0) - scores["tag_loss"] = losses.get("tagger", 0.0) - scores["textcat_loss"] = losses.get("textcat", 0.0) - scores["cpu_wps"] = cpu_wps - scores["gpu_wps"] = gpu_wps or 0.0 - scores.update(dev_scores) - formatted_scores = [] - for stat in output_stats: - format_spec = "{:.3f}" - if stat.endswith("_wps"): - format_spec = "{:.0f}" - formatted_scores.append(format_spec.format(scores[stat])) - result = [itn + 1] - result.extend(formatted_scores) - if beam_width is not None: - result.insert(1, beam_width) - return result + """ + Train or update a spaCy pipeline. Requires data in spaCy's binary format. To + convert data from other formats, use the `spacy convert` command. The + config file includes all settings and hyperparameters used during traing. + To override settings in the config, e.g. settings that point to local + paths or that you want to experiment with, you can override them as + command line options. For instance, --training.batch_size 128 overrides + the value of "batch_size" in the block "[training]". The --code argument + lets you pass in a Python file that's imported before training. It can be + used to register custom functions and architectures that can then be + referenced in the config. - -def _get_total_speed(speeds): - seconds_per_word = 0.0 - for words_per_second in speeds: - if words_per_second is None: - return None - seconds_per_word += 1.0 / words_per_second - return 1.0 / seconds_per_word + DOCS: https://nightly.spacy.io/api/cli#train + """ + util.logger.setLevel(logging.DEBUG if verbose else logging.INFO) + # Make sure all files and paths exists if they are needed + if not config_path or not config_path.exists(): + msg.fail("Config file not found", config_path, exits=1) + if output_path is not None and not output_path.exists(): + output_path.mkdir() + msg.good(f"Created output directory: {output_path}") + overrides = parse_config_overrides(ctx.args) + import_code(code_path) + setup_gpu(use_gpu) + with show_validation_error(config_path): + config = util.load_config(config_path, overrides=overrides, interpolate=False) + msg.divider("Initializing pipeline") + with show_validation_error(config_path, hint_fill=False): + nlp = init_nlp(config, use_gpu=use_gpu) + msg.good("Initialized pipeline") + msg.divider("Training pipeline") + train(nlp, output_path, use_gpu=use_gpu, stdout=sys.stdout, stderr=sys.stderr) diff --git a/spacy/cli/validate.py b/spacy/cli/validate.py index 93abad6f6..9a75ed6f3 100644 --- a/spacy/cli/validate.py +++ b/spacy/cli/validate.py @@ -1,148 +1,110 @@ -# coding: utf8 -from __future__ import unicode_literals, print_function - +from typing import Tuple from pathlib import Path import sys import requests -import srsly -from wasabi import msg +from wasabi import msg, Printer -from ..compat import path2str -from ..util import get_data_path +from ._util import app from .. import about +from ..util import get_package_version, get_installed_models, get_base_version +from ..util import get_package_path, get_model_meta, is_compatible_version -def validate(): +@app.command("validate") +def validate_cli(): """ - Validate that the currently installed version of spaCy is compatible - with the installed models. Should be run after `pip install -U spacy`. + Validate the currently installed pipeline packages and spaCy version. Checks + if the installed packages are compatible and shows upgrade instructions if + available. Should be run after `pip install -U spacy`. + + DOCS: https://nightly.spacy.io/api/cli#validate """ + validate() + + +def validate() -> None: + model_pkgs, compat = get_model_pkgs() + spacy_version = get_base_version(about.__version__) + current_compat = compat.get(spacy_version, {}) + if not current_compat: + msg.warn(f"No compatible packages found for v{spacy_version} of spaCy") + incompat_models = {d["name"] for _, d in model_pkgs.items() if not d["compat"]} + na_models = [m for m in incompat_models if m not in current_compat] + update_models = [m for m in incompat_models if m in current_compat] + spacy_dir = Path(__file__).parent.parent + + msg.divider(f"Installed pipeline packages (spaCy v{about.__version__})") + msg.info(f"spaCy installation: {spacy_dir}") + + if model_pkgs: + header = ("NAME", "SPACY", "VERSION", "") + rows = [] + for name, data in model_pkgs.items(): + if data["compat"]: + comp = msg.text("", color="green", icon="good", no_print=True) + version = msg.text(data["version"], color="green", no_print=True) + else: + version = msg.text(data["version"], color="red", no_print=True) + comp = f"--> {compat.get(data['name'], ['n/a'])[0]}" + rows.append((data["name"], data["spacy"], version, comp)) + msg.table(rows, header=header) + else: + msg.text("No pipeline packages found in your current environment.", exits=0) + if update_models: + msg.divider("Install updates") + msg.text("Use the following commands to update the packages:") + cmd = "python -m spacy download {}" + print("\n".join([cmd.format(pkg) for pkg in update_models]) + "\n") + if na_models: + msg.info( + f"The following packages are custom spaCy pipelines or not " + f"available for spaCy v{about.__version__}:", + ", ".join(na_models), + ) + if incompat_models: + sys.exit(1) + + +def get_model_pkgs(silent: bool = False) -> Tuple[dict, dict]: + msg = Printer(no_print=silent, pretty=not silent) with msg.loading("Loading compatibility table..."): r = requests.get(about.__compatibility__) if r.status_code != 200: msg.fail( - "Server error ({})".format(r.status_code), + f"Server error ({r.status_code})", "Couldn't fetch compatibility table.", exits=1, ) msg.good("Loaded compatibility table") compat = r.json()["spacy"] - version = about.__version__ - version = version.rsplit(".dev", 1)[0] - current_compat = compat.get(version) - if not current_compat: - msg.fail( - "Can't find spaCy v{} in compatibility table".format(version), - about.__compatibility__, - exits=1, - ) all_models = set() + installed_models = get_installed_models() for spacy_v, models in dict(compat).items(): all_models.update(models.keys()) for model, model_vs in models.items(): compat[spacy_v][model] = [reformat_version(v) for v in model_vs] - model_links = get_model_links(current_compat) - model_pkgs = get_model_pkgs(current_compat, all_models) - incompat_links = {l for l, d in model_links.items() if not d["compat"]} - incompat_models = {d["name"] for _, d in model_pkgs.items() if not d["compat"]} - incompat_models.update( - [d["name"] for _, d in model_links.items() if not d["compat"]] - ) - na_models = [m for m in incompat_models if m not in current_compat] - update_models = [m for m in incompat_models if m in current_compat] - spacy_dir = Path(__file__).parent.parent - - msg.divider("Installed models (spaCy v{})".format(about.__version__)) - msg.info("spaCy installation: {}".format(path2str(spacy_dir))) - - if model_links or model_pkgs: - header = ("TYPE", "NAME", "MODEL", "VERSION", "") - rows = [] - for name, data in model_pkgs.items(): - rows.append(get_model_row(current_compat, name, data, msg)) - for name, data in model_links.items(): - rows.append(get_model_row(current_compat, name, data, msg, "link")) - msg.table(rows, header=header) - else: - msg.text("No models found in your current environment.", exits=0) - if update_models: - msg.divider("Install updates") - msg.text("Use the following commands to update the model packages:") - cmd = "python -m spacy download {}" - print("\n".join([cmd.format(pkg) for pkg in update_models]) + "\n") - if na_models: - msg.text( - "The following models are not available for spaCy " - "v{}: {}".format(about.__version__, ", ".join(na_models)) - ) - if incompat_links: - msg.text( - "You may also want to overwrite the incompatible links using the " - "`python -m spacy link` command with `--force`, or remove them " - "from the data directory. " - "Data path: {path}".format(path=path2str(get_data_path())) - ) - if incompat_models or incompat_links: - sys.exit(1) - - -def get_model_links(compat): - links = {} - data_path = get_data_path() - if data_path: - models = [p for p in data_path.iterdir() if is_model_path(p)] - for model in models: - meta_path = Path(model) / "meta.json" - if not meta_path.exists(): - continue - meta = srsly.read_json(meta_path) - link = model.parts[-1] - name = meta["lang"] + "_" + meta["name"] - links[link] = { - "name": name, - "version": meta["version"], - "compat": is_compat(compat, name, meta["version"]), - } - return links - - -def get_model_pkgs(compat, all_models): - import pkg_resources - pkgs = {} - for pkg_name, pkg_data in pkg_resources.working_set.by_key.items(): + for pkg_name in installed_models: package = pkg_name.replace("-", "_") - if package in all_models: - version = pkg_data.version - pkgs[pkg_name] = { - "name": package, - "version": version, - "compat": is_compat(compat, package, version), - } - return pkgs + version = get_package_version(pkg_name) + if package in compat: + is_compat = version in compat[package] + spacy_version = about.__version__ + else: + model_path = get_package_path(package) + model_meta = get_model_meta(model_path) + spacy_version = model_meta.get("spacy_version", "n/a") + is_compat = is_compatible_version(about.__version__, spacy_version) + pkgs[pkg_name] = { + "name": package, + "version": version, + "spacy": spacy_version, + "compat": is_compat, + } + return pkgs, compat -def get_model_row(compat, name, data, msg, model_type="package"): - if data["compat"]: - comp = msg.text("", color="green", icon="good", no_print=True) - version = msg.text(data["version"], color="green", no_print=True) - else: - version = msg.text(data["version"], color="red", no_print=True) - comp = "--> {}".format(compat.get(data["name"], ["n/a"])[0]) - return (model_type, name, data["name"], version, comp) - - -def is_model_path(model_path): - exclude = ["cache", "pycache", "__pycache__"] - name = model_path.parts[-1] - return model_path.is_dir() and name not in exclude and not name.startswith(".") - - -def is_compat(compat, name, version): - return name in compat and version in compat[name] - - -def reformat_version(version): +def reformat_version(version: str) -> str: """Hack to reformat old versions ending on '-alpha' to match pip format.""" if version.endswith("-alpha"): return version.replace("-alpha", "a0") diff --git a/spacy/compat.py b/spacy/compat.py index 0ea31c6b3..6eca18b80 100644 --- a/spacy/compat.py +++ b/spacy/compat.py @@ -1,20 +1,6 @@ -# coding: utf8 -""" -Helpers for Python and platform compatibility. To distinguish them from -the builtin functions, replacement functions are suffixed with an underscore, -e.g. `unicode_`. - -DOCS: https://spacy.io/api/top-level#compat -""" -from __future__ import unicode_literals - -import os +"""Helpers for Python and platform compatibility.""" import sys -import itertools -import ast -import types - -from thinc.neural.util import copy_array +from thinc.util import copy_array try: import cPickle as pickle @@ -36,146 +22,19 @@ try: except ImportError: cupy = None -try: - from thinc.neural.optimizers import Optimizer # noqa: F401 +try: # Python 3.8+ + from typing import Literal except ImportError: - from thinc.neural.optimizers import Adam as Optimizer # noqa: F401 + from typing_extensions import Literal # noqa: F401 + +from thinc.api import Optimizer # noqa: F401 pickle = pickle copy_reg = copy_reg CudaStream = CudaStream cupy = cupy copy_array = copy_array -izip = getattr(itertools, "izip", zip) is_windows = sys.platform.startswith("win") is_linux = sys.platform.startswith("linux") is_osx = sys.platform == "darwin" - -# See: https://github.com/benjaminp/six/blob/master/six.py -is_python2 = sys.version_info[0] == 2 -is_python3 = sys.version_info[0] == 3 -is_python_pre_3_5 = is_python2 or (is_python3 and sys.version_info[1] < 5) - -if is_python2: - bytes_ = str - unicode_ = unicode # noqa: F821 - basestring_ = basestring # noqa: F821 - input_ = raw_input # noqa: F821 - path2str = lambda path: str(path).decode("utf8") - class_types = (type, types.ClassType) - -elif is_python3: - bytes_ = bytes - unicode_ = str - basestring_ = str - input_ = input - path2str = lambda path: str(path) - class_types = (type, types.ClassType) if is_python_pre_3_5 else type - - -def b_to_str(b_str): - """Convert a bytes object to a string. - - b_str (bytes): The object to convert. - RETURNS (unicode): The converted string. - """ - if is_python2: - return b_str - # Important: if no encoding is set, string becomes "b'...'" - return str(b_str, encoding="utf8") - - -def symlink_to(orig, dest): - """Create a symlink. Used for model shortcut links. - - orig (unicode / Path): The origin path. - dest (unicode / Path): The destination path of the symlink. - """ - if is_windows: - import subprocess - - subprocess.check_call( - ["mklink", "/d", path2str(orig), path2str(dest)], shell=True - ) - else: - orig.symlink_to(dest) - - -def symlink_remove(link): - """Remove a symlink. Used for model shortcut links. - - link (unicode / Path): The path to the symlink. - """ - # https://stackoverflow.com/q/26554135/6400719 - if os.path.isdir(path2str(link)) and is_windows: - # this should only be on Py2.7 and windows - os.rmdir(path2str(link)) - else: - os.unlink(path2str(link)) - - -def is_config(python2=None, python3=None, windows=None, linux=None, osx=None): - """Check if a specific configuration of Python version and operating system - matches the user's setup. Mostly used to display targeted error messages. - - python2 (bool): spaCy is executed with Python 2.x. - python3 (bool): spaCy is executed with Python 3.x. - windows (bool): spaCy is executed on Windows. - linux (bool): spaCy is executed on Linux. - osx (bool): spaCy is executed on OS X or macOS. - RETURNS (bool): Whether the configuration matches the user's platform. - - DOCS: https://spacy.io/api/top-level#compat.is_config - """ - return ( - python2 in (None, is_python2) - and python3 in (None, is_python3) - and windows in (None, is_windows) - and linux in (None, is_linux) - and osx in (None, is_osx) - ) - - -def import_file(name, loc): - """Import module from a file. Used to load models from a directory. - - name (unicode): Name of module to load. - loc (unicode / Path): Path to the file. - RETURNS: The loaded module. - """ - loc = path2str(loc) - if is_python_pre_3_5: - import imp - - return imp.load_source(name, loc) - else: - import importlib.util - - spec = importlib.util.spec_from_file_location(name, str(loc)) - module = importlib.util.module_from_spec(spec) - spec.loader.exec_module(module) - return module - - -def unescape_unicode(string): - """Python2.7's re module chokes when compiling patterns that have ranges - between escaped unicode codepoints if the two codepoints are unrecognised - in the unicode database. For instance: - - re.compile('[\\uAA77-\\uAA79]').findall("hello") - - Ends up matching every character (on Python 2). This problem doesn't occur - if we're dealing with unicode literals. - """ - if string is None: - return string - # We only want to unescape the unicode, so we first must protect the other - # backslashes. - string = string.replace("\\", "\\\\") - # Now we remove that protection for the unicode. - string = string.replace("\\\\u", "\\u") - string = string.replace("\\\\U", "\\U") - # Now we unescape by evaling the string with the AST. This can't execute - # code -- it only does the representational level. - return ast.literal_eval("u'''" + string + "'''") diff --git a/spacy/default_config.cfg b/spacy/default_config.cfg new file mode 100644 index 000000000..d7fc46ea0 --- /dev/null +++ b/spacy/default_config.cfg @@ -0,0 +1,124 @@ +[paths] +train = null +dev = null +vectors = null +init_tok2vec = null + +[system] +seed = 0 +gpu_allocator = null + +[nlp] +lang = null +# List of pipeline component names, in order. The names should correspond to +# components defined in the [components block] +pipeline = [] +# Components that are loaded but disabled by default +disabled = [] +# Optional callbacks to modify the nlp object before it's initialized, after +# it's created and after the pipeline has been set up +before_creation = null +after_creation = null +after_pipeline_creation = null + +[nlp.tokenizer] +@tokenizers = "spacy.Tokenizer.v1" + +# The pipeline components and their models +[components] + +# Readers for corpora like dev and train. +[corpora] + +[corpora.train] +@readers = "spacy.Corpus.v1" +path = ${paths.train} +# Whether to train on sequences with 'gold standard' sentence boundaries +# and tokens. If you set this to true, take care to ensure your run-time +# data is passed in sentence-by-sentence via some prior preprocessing. +gold_preproc = false +# Limitations on training document length +max_length = 0 +# Limitation on number of training examples +limit = 0 +# Apply some simply data augmentation, where we replace tokens with variations. +# This is especially useful for punctuation and case replacement, to help +# generalize beyond corpora that don't/only have smart quotes etc. +augmenter = null + +[corpora.dev] +@readers = "spacy.Corpus.v1" +path = ${paths.dev} +# Whether to train on sequences with 'gold standard' sentence boundaries +# and tokens. If you set this to true, take care to ensure your run-time +# data is passed in sentence-by-sentence via some prior preprocessing. +gold_preproc = false +# Limitations on training document length +max_length = 0 +# Limitation on number of training examples +limit = 0 +# Optional callback for data augmentation +augmenter = null + +# Training hyper-parameters and additional features. +[training] +seed = ${system.seed} +gpu_allocator = ${system.gpu_allocator} +dropout = 0.1 +accumulate_gradient = 1 +# Controls early-stopping. 0 or -1 mean unlimited. +patience = 1600 +max_epochs = 0 +max_steps = 20000 +eval_frequency = 200 +# Control how scores are printed and checkpoints are evaluated. +score_weights = {} +# Names of pipeline components that shouldn't be updated during training +frozen_components = [] +# Location in the config where the dev corpus is defined +dev_corpus = "corpora.dev" +# Location in the config where the train corpus is defined +train_corpus = "corpora.train" +# Optional callback before nlp object is saved to disk after training +before_to_disk = null + +[training.logger] +@loggers = "spacy.ConsoleLogger.v1" + +[training.batcher] +@batchers = "spacy.batch_by_words.v1" +discard_oversize = false +tolerance = 0.2 + +[training.batcher.size] +@schedules = "compounding.v1" +start = 100 +stop = 1000 +compound = 1.001 + +[training.optimizer] +@optimizers = "Adam.v1" +beta1 = 0.9 +beta2 = 0.999 +L2_is_weight_decay = true +L2 = 0.01 +grad_clip = 1.0 +use_averages = false +eps = 1e-8 +learn_rate = 0.001 + +# These settings are used when nlp.initialize() is called (typically before +# training or pretraining). Components and the tokenizer can each define their +# own arguments via their initialize methods that are populated by the config. +# This lets them gather data resources, build label sets etc. +[initialize] +vectors = ${paths.vectors} +# Extra resources for transfer-learning or pseudo-rehearsal +init_tok2vec = ${paths.init_tok2vec} +# Data and lookups for vocabulary +vocab_data = null +lookups = null +# Arguments passed to the tokenizer's initialize method +tokenizer = {} +# Arguments for initialize methods of the components (keyed by component) +components = {} diff --git a/spacy/default_config_pretraining.cfg b/spacy/default_config_pretraining.cfg new file mode 100644 index 000000000..66987171a --- /dev/null +++ b/spacy/default_config_pretraining.cfg @@ -0,0 +1,41 @@ +[paths] +raw_text = null + +[pretraining] +max_epochs = 1000 +dropout = 0.2 +n_save_every = null +component = "tok2vec" +layer = "" +corpus = "corpora.pretrain" + +[pretraining.batcher] +@batchers = "spacy.batch_by_words.v1" +size = 3000 +discard_oversize = false +tolerance = 0.2 +get_length = null + +[pretraining.objective] +type = "characters" +n_characters = 4 + +[pretraining.optimizer] +@optimizers = "Adam.v1" +beta1 = 0.9 +beta2 = 0.999 +L2_is_weight_decay = true +L2 = 0.01 +grad_clip = 1.0 +use_averages = true +eps = 1e-8 +learn_rate = 0.001 + +[corpora] + +[corpora.pretrain] +@readers = "spacy.JsonlCorpus.v1" +path = ${paths.raw_text} +min_length = 5 +max_length = 500 +limit = 0 diff --git a/spacy/displacy/__init__.py b/spacy/displacy/__init__.py index a0cccbbde..48229572b 100644 --- a/spacy/displacy/__init__.py +++ b/spacy/displacy/__init__.py @@ -1,17 +1,14 @@ -# coding: utf8 """ spaCy's built in visualization suite for dependencies and named entities. -DOCS: https://spacy.io/api/top-level#displacy -USAGE: https://spacy.io/usage/visualizers +DOCS: https://nightly.spacy.io/api/top-level#displacy +USAGE: https://nightly.spacy.io/usage/visualizers """ -from __future__ import unicode_literals - +from typing import Union, Iterable, Optional, Dict, Any, Callable import warnings from .render import DependencyRenderer, EntityRenderer from ..tokens import Doc, Span -from ..compat import b_to_str from ..errors import Errors, Warnings from ..util import is_in_jupyter @@ -21,21 +18,27 @@ RENDER_WRAPPER = None def render( - docs, style="dep", page=False, minify=False, jupyter=None, options={}, manual=False -): + docs: Union[Iterable[Union[Doc, Span]], Doc, Span], + style: str = "dep", + page: bool = False, + minify: bool = False, + jupyter: Optional[bool] = None, + options: Dict[str, Any] = {}, + manual: bool = False, +) -> str: """Render displaCy visualisation. - docs (list or Doc): Document(s) to visualise. - style (unicode): Visualisation style, 'dep' or 'ent'. + docs (Union[Iterable[Doc], Doc]): Document(s) to visualise. + style (str): Visualisation style, 'dep' or 'ent'. page (bool): Render markup as full HTML page. minify (bool): Minify HTML markup. jupyter (bool): Override Jupyter auto-detection. options (dict): Visualiser-specific options, e.g. colors. manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts. - RETURNS (unicode): Rendered HTML markup. + RETURNS (str): Rendered HTML markup. - DOCS: https://spacy.io/api/top-level#displacy.render - USAGE: https://spacy.io/usage/visualizers + DOCS: https://nightly.spacy.io/api/top-level#displacy.render + USAGE: https://nightly.spacy.io/usage/visualizers """ factories = { "dep": (DependencyRenderer, parse_deps), @@ -48,8 +51,8 @@ def render( docs = [obj if not isinstance(obj, Span) else obj.as_doc() for obj in docs] if not all(isinstance(obj, (Doc, Span, dict)) for obj in docs): raise ValueError(Errors.E096) - renderer, converter = factories[style] - renderer = renderer(options=options) + renderer_func, converter = factories[style] + renderer = renderer_func(options=options) parsed = [converter(doc, options) for doc in docs] if not manual else docs _html["parsed"] = renderer.render(parsed, page=page, minify=minify).strip() html = _html["parsed"] @@ -65,62 +68,60 @@ def render( def serve( - docs, - style="dep", - page=True, - minify=False, - options={}, - manual=False, - port=5000, - host="0.0.0.0", -): + docs: Union[Iterable[Doc], Doc], + style: str = "dep", + page: bool = True, + minify: bool = False, + options: Dict[str, Any] = {}, + manual: bool = False, + port: int = 5000, + host: str = "0.0.0.0", +) -> None: """Serve displaCy visualisation. docs (list or Doc): Document(s) to visualise. - style (unicode): Visualisation style, 'dep' or 'ent'. + style (str): Visualisation style, 'dep' or 'ent'. page (bool): Render markup as full HTML page. minify (bool): Minify HTML markup. options (dict): Visualiser-specific options, e.g. colors. manual (bool): Don't parse `Doc` and instead expect a dict/list of dicts. port (int): Port to serve visualisation. - host (unicode): Host to serve visualisation. + host (str): Host to serve visualisation. - DOCS: https://spacy.io/api/top-level#displacy.serve - USAGE: https://spacy.io/usage/visualizers + DOCS: https://nightly.spacy.io/api/top-level#displacy.serve + USAGE: https://nightly.spacy.io/usage/visualizers """ from wsgiref import simple_server if is_in_jupyter(): warnings.warn(Warnings.W011) - render(docs, style=style, page=page, minify=minify, options=options, manual=manual) httpd = simple_server.make_server(host, port, app) - print("\nUsing the '{}' visualizer".format(style)) - print("Serving on http://{}:{} ...\n".format(host, port)) + print(f"\nUsing the '{style}' visualizer") + print(f"Serving on http://{host}:{port} ...\n") try: httpd.serve_forever() except KeyboardInterrupt: - print("Shutting down server on port {}.".format(port)) + print(f"Shutting down server on port {port}.") finally: httpd.server_close() def app(environ, start_response): - # Headers and status need to be bytes in Python 2, see #1227 - headers = [(b_to_str(b"Content-type"), b_to_str(b"text/html; charset=utf-8"))] - start_response(b_to_str(b"200 OK"), headers) + headers = [("Content-type", "text/html; charset=utf-8")] + start_response("200 OK", headers) res = _html["parsed"].encode(encoding="utf-8") return [res] -def parse_deps(orig_doc, options={}): +def parse_deps(orig_doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]: """Generate dependency parse in {'words': [], 'arcs': []} format. doc (Doc): Document do parse. RETURNS (dict): Generated dependency parse keyed by words and arcs. """ doc = Doc(orig_doc.vocab).from_bytes(orig_doc.to_bytes(exclude=["user_data"])) - if not doc.is_parsed: + if not doc.has_annotation("DEP"): warnings.warn(Warnings.W005) if options.get("collapse_phrases", False): with doc.retokenize() as retokenizer: @@ -156,7 +157,6 @@ def parse_deps(orig_doc, options={}): } for w in doc ] - arcs = [] for word in doc: if word.i < word.head.i: @@ -175,7 +175,7 @@ def parse_deps(orig_doc, options={}): return {"words": words, "arcs": arcs, "settings": get_doc_settings(orig_doc)} -def parse_ents(doc, options={}): +def parse_ents(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]: """Generate named entities in [{start: i, end: i, label: 'label'}] format. doc (Doc): Document do parse. @@ -192,7 +192,7 @@ def parse_ents(doc, options={}): return {"text": doc.text, "ents": ents, "title": title, "settings": settings} -def set_render_wrapper(func): +def set_render_wrapper(func: Callable[[str], str]) -> None: """Set an optional wrapper function that is called around the generated HTML markup on displacy.render. This can be used to allow integration into other platforms, similar to Jupyter Notebooks that require functions to be @@ -209,7 +209,7 @@ def set_render_wrapper(func): RENDER_WRAPPER = func -def get_doc_settings(doc): +def get_doc_settings(doc: Doc) -> Dict[str, Any]: return { "lang": doc.lang_, "direction": doc.vocab.writing_system.get("direction", "ltr"), diff --git a/spacy/displacy/render.py b/spacy/displacy/render.py index 431e02841..ba56beca3 100644 --- a/spacy/displacy/render.py +++ b/spacy/displacy/render.py @@ -1,30 +1,44 @@ -# coding: utf8 -from __future__ import unicode_literals - +from typing import Dict, Any, List, Optional, Union import uuid -from .templates import ( - TPL_DEP_SVG, - TPL_DEP_WORDS, - TPL_DEP_WORDS_LEMMA, - TPL_DEP_ARCS, - TPL_ENTS, -) +from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_WORDS_LEMMA, TPL_DEP_ARCS from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE +from .templates import TPL_ENTS from ..util import minify_html, escape_html, registry from ..errors import Errors DEFAULT_LANG = "en" DEFAULT_DIR = "ltr" +DEFAULT_ENTITY_COLOR = "#ddd" +DEFAULT_LABEL_COLORS = { + "ORG": "#7aecec", + "PRODUCT": "#bfeeb7", + "GPE": "#feca74", + "LOC": "#ff9561", + "PERSON": "#aa9cfc", + "NORP": "#c887fb", + "FACILITY": "#9cc9cc", + "EVENT": "#ffeb80", + "LAW": "#ff8197", + "LANGUAGE": "#ff8197", + "WORK_OF_ART": "#f0d0ff", + "DATE": "#bfe1d9", + "TIME": "#bfe1d9", + "MONEY": "#e4e7d2", + "QUANTITY": "#e4e7d2", + "ORDINAL": "#e4e7d2", + "CARDINAL": "#e4e7d2", + "PERCENT": "#e4e7d2", +} -class DependencyRenderer(object): +class DependencyRenderer: """Render dependency parses as SVGs.""" style = "dep" - def __init__(self, options={}): + def __init__(self, options: Dict[str, Any] = {}) -> None: """Initialise dependency renderer. options (dict): Visualiser-specific options (compact, word_spacing, @@ -44,13 +58,15 @@ class DependencyRenderer(object): self.direction = DEFAULT_DIR self.lang = DEFAULT_LANG - def render(self, parsed, page=False, minify=False): + def render( + self, parsed: List[Dict[str, Any]], page: bool = False, minify: bool = False + ) -> str: """Render complete markup. parsed (list): Dependency parses to render. page (bool): Render parses wrapped as full HTML page. minify (bool): Minify HTML markup. - RETURNS (unicode): Rendered SVG or HTML markup. + RETURNS (str): Rendered SVG or HTML markup. """ # Create a random ID prefix to make sure parses don't receive the # same ID, even if they're identical @@ -61,7 +77,7 @@ class DependencyRenderer(object): settings = p.get("settings", {}) self.direction = settings.get("direction", DEFAULT_DIR) self.lang = settings.get("lang", DEFAULT_LANG) - render_id = "{}-{}".format(id_prefix, i) + render_id = f"{id_prefix}-{i}" svg = self.render_svg(render_id, p["words"], p["arcs"]) rendered.append(svg) if page: @@ -75,13 +91,18 @@ class DependencyRenderer(object): return minify_html(markup) return markup - def render_svg(self, render_id, words, arcs): + def render_svg( + self, + render_id: Union[int, str], + words: List[Dict[str, Any]], + arcs: List[Dict[str, Any]], + ) -> str: """Render SVG. - render_id (int): Unique ID, typically index of document. + render_id (Union[int, str]): Unique ID, typically index of document. words (list): Individual words and their tags. arcs (list): Individual arcs and their start, end, direction and label. - RETURNS (unicode): Rendered SVG markup. + RETURNS (str): Rendered SVG markup. """ self.levels = self.get_levels(arcs) self.highest_level = len(self.levels) @@ -89,15 +110,15 @@ class DependencyRenderer(object): self.width = self.offset_x + len(words) * self.distance self.height = self.offset_y + 3 * self.word_spacing self.id = render_id - words = [ + words_svg = [ self.render_word(w["text"], w["tag"], w.get("lemma", None), i) for i, w in enumerate(words) ] - arcs = [ + arcs_svg = [ self.render_arrow(a["label"], a["start"], a["end"], a["dir"], i) for i, a in enumerate(arcs) ] - content = "".join(words) + "".join(arcs) + content = "".join(words_svg) + "".join(arcs_svg) return TPL_DEP_SVG.format( id=self.id, width=self.width, @@ -110,15 +131,13 @@ class DependencyRenderer(object): lang=self.lang, ) - def render_word( - self, text, tag, lemma, i, - ): + def render_word(self, text: str, tag: str, lemma: str, i: int) -> str: """Render individual word. - text (unicode): Word text. - tag (unicode): Part-of-speech tag. + text (str): Word text. + tag (str): Part-of-speech tag. i (int): Unique ID, typically word index. - RETURNS (unicode): Rendered SVG markup. + RETURNS (str): Rendered SVG markup. """ y = self.offset_y + self.word_spacing x = self.offset_x + i * self.distance @@ -131,15 +150,17 @@ class DependencyRenderer(object): ) return TPL_DEP_WORDS.format(text=html_text, tag=tag, x=x, y=y) - def render_arrow(self, label, start, end, direction, i): + def render_arrow( + self, label: str, start: int, end: int, direction: str, i: int + ) -> str: """Render individual arrow. - label (unicode): Dependency label. + label (str): Dependency label. start (int): Index of start word. end (int): Index of end word. - direction (unicode): Arrow direction, 'left' or 'right'. + direction (str): Arrow direction, 'left' or 'right'. i (int): Unique ID, typically arrow index. - RETURNS (unicode): Rendered SVG markup. + RETURNS (str): Rendered SVG markup. """ if start < 0 or end < 0: error_args = dict(start=start, end=end, label=label, dir=direction) @@ -175,48 +196,36 @@ class DependencyRenderer(object): arc=arc, ) - def get_arc(self, x_start, y, y_curve, x_end): + def get_arc(self, x_start: int, y: int, y_curve: int, x_end: int) -> str: """Render individual arc. x_start (int): X-coordinate of arrow start point. y (int): Y-coordinate of arrow start and end point. y_curve (int): Y-corrdinate of Cubic Bézier y_curve point. x_end (int): X-coordinate of arrow end point. - RETURNS (unicode): Definition of the arc path ('d' attribute). + RETURNS (str): Definition of the arc path ('d' attribute). """ template = "M{x},{y} C{x},{c} {e},{c} {e},{y}" if self.compact: template = "M{x},{y} {x},{c} {e},{c} {e},{y}" return template.format(x=x_start, y=y, c=y_curve, e=x_end) - def get_arrowhead(self, direction, x, y, end): + def get_arrowhead(self, direction: str, x: int, y: int, end: int) -> str: """Render individual arrow head. - direction (unicode): Arrow direction, 'left' or 'right'. + direction (str): Arrow direction, 'left' or 'right'. x (int): X-coordinate of arrow start point. y (int): Y-coordinate of arrow start and end point. end (int): X-coordinate of arrow end point. - RETURNS (unicode): Definition of the arrow head path ('d' attribute). + RETURNS (str): Definition of the arrow head path ('d' attribute). """ if direction == "left": - pos1, pos2, pos3 = (x, x - self.arrow_width + 2, x + self.arrow_width - 2) + p1, p2, p3 = (x, x - self.arrow_width + 2, x + self.arrow_width - 2) else: - pos1, pos2, pos3 = ( - end, - end + self.arrow_width - 2, - end - self.arrow_width + 2, - ) - arrowhead = ( - pos1, - y + 2, - pos2, - y - self.arrow_width, - pos3, - y - self.arrow_width, - ) - return "M{},{} L{},{} {},{}".format(*arrowhead) + p1, p2, p3 = (end, end + self.arrow_width - 2, end - self.arrow_width + 2) + return f"M{p1},{y + 2} L{p2},{y - self.arrow_width} {p3},{y - self.arrow_width}" - def get_levels(self, arcs): + def get_levels(self, arcs: List[Dict[str, Any]]) -> List[int]: """Calculate available arc height "levels". Used to calculate arrow heights dynamically and without wasting space. @@ -227,46 +236,34 @@ class DependencyRenderer(object): return sorted(list(levels)) -class EntityRenderer(object): +class EntityRenderer: """Render named entities as HTML.""" style = "ent" - def __init__(self, options={}): + def __init__(self, options: Dict[str, Any] = {}) -> None: """Initialise dependency renderer. options (dict): Visualiser-specific options (colors, ents) """ - colors = { - "ORG": "#7aecec", - "PRODUCT": "#bfeeb7", - "GPE": "#feca74", - "LOC": "#ff9561", - "PERSON": "#aa9cfc", - "NORP": "#c887fb", - "FACILITY": "#9cc9cc", - "EVENT": "#ffeb80", - "LAW": "#ff8197", - "LANGUAGE": "#ff8197", - "WORK_OF_ART": "#f0d0ff", - "DATE": "#bfe1d9", - "TIME": "#bfe1d9", - "MONEY": "#e4e7d2", - "QUANTITY": "#e4e7d2", - "ORDINAL": "#e4e7d2", - "CARDINAL": "#e4e7d2", - "PERCENT": "#e4e7d2", - } + colors = dict(DEFAULT_LABEL_COLORS) user_colors = registry.displacy_colors.get_all() for user_color in user_colors.values(): + if callable(user_color): + # Since this comes from the function registry, we want to make + # sure we support functions that *return* a dict of colors + user_color = user_color() + if not isinstance(user_color, dict): + raise ValueError(Errors.E925.format(obj=type(user_color))) colors.update(user_color) colors.update(options.get("colors", {})) - self.default_color = "#ddd" - self.colors = colors + self.default_color = DEFAULT_ENTITY_COLOR + self.colors = {label.upper(): color for label, color in colors.items()} self.ents = options.get("ents", None) + if self.ents is not None: + self.ents = [ent.upper() for ent in self.ents] self.direction = DEFAULT_DIR self.lang = DEFAULT_LANG - template = options.get("template") if template: self.ent_template = template @@ -276,13 +273,15 @@ class EntityRenderer(object): else: self.ent_template = TPL_ENT - def render(self, parsed, page=False, minify=False): + def render( + self, parsed: List[Dict[str, Any]], page: bool = False, minify: bool = False + ) -> str: """Render complete markup. parsed (list): Dependency parses to render. page (bool): Render parses wrapped as full HTML page. minify (bool): Minify HTML markup. - RETURNS (unicode): Rendered HTML markup. + RETURNS (str): Rendered HTML markup. """ rendered = [] for i, p in enumerate(parsed): @@ -300,12 +299,14 @@ class EntityRenderer(object): return minify_html(markup) return markup - def render_ents(self, text, spans, title): + def render_ents( + self, text: str, spans: List[Dict[str, Any]], title: Optional[str] + ) -> str: """Render entities in text. - text (unicode): Original text. + text (str): Original text. spans (list): Individual entity spans and their start, end and label. - title (unicode or None): Document title set in Doc.user_data['title']. + title (str / None): Document title set in Doc.user_data['title']. """ markup = "" offset = 0 diff --git a/spacy/displacy/templates.py b/spacy/displacy/templates.py index f29eab86f..b9cbf717b 100644 --- a/spacy/displacy/templates.py +++ b/spacy/displacy/templates.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Setting explicit height and max-width: none on the SVG is required for # Jupyter to render it properly in a cell @@ -55,14 +51,14 @@ TPL_ENTS = """ TPL_ENT = """ {text} - {label} + {label} """ TPL_ENT_RTL = """ {text} - {label} + {label} """ diff --git a/spacy/errors.py b/spacy/errors.py index 7f9164694..5fab0bab1 100644 --- a/spacy/errors.py +++ b/spacy/errors.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - def add_codes(err_cls): """Add error codes to string messages via class attribute names.""" @@ -19,24 +15,12 @@ def add_codes(err_cls): # fmt: off @add_codes -class Warnings(object): - W001 = ("As of spaCy v2.0, the keyword argument `path=` is deprecated. " - "You can now call spacy.load with the path as its first argument, " - "and the model's meta.json will be used to determine the language " - "to load. For example:\nnlp = spacy.load('{path}')") - W002 = ("Tokenizer.from_list is now deprecated. Create a new Doc object " - "instead and pass in the strings as the `words` keyword argument, " - "for example:\nfrom spacy.tokens import Doc\n" - "doc = Doc(nlp.vocab, words=[...])") - W003 = ("Positional arguments to Doc.merge are deprecated. Instead, use " - "the keyword arguments, for example tag=, lemma= or ent_type=.") - W004 = ("No text fixing enabled. Run `pip install ftfy` to enable fixing " - "using ftfy.fix_text if necessary.") +class Warnings: W005 = ("Doc object not parsed. This means displaCy won't be able to " "generate a dependency visualization for it. Make sure the Doc " "was processed with a model that supports dependency parsing, and " "not just a language class like `English()`. For more info, see " - "the docs:\nhttps://spacy.io/usage/models") + "the docs:\nhttps://nightly.spacy.io/usage/models") W006 = ("No entities to visualize found in Doc object. If this is " "surprising to you, make sure the Doc was processed using a model " "that supports named entity recognition, and check the `doc.ents` " @@ -49,12 +33,6 @@ class Warnings(object): "use context-sensitive tensors. You can always add your own word " "vectors, or use one of the larger models instead if available.") W008 = ("Evaluating {obj}.similarity based on empty vectors.") - W009 = ("Custom factory '{name}' provided by entry points of another " - "package overwrites built-in factory.") - W010 = ("As of v2.1.0, the PhraseMatcher doesn't have a phrase length " - "limit anymore, so the max_length argument is now deprecated. " - "If you did not specify this parameter, make sure you call the " - "constructor with named arguments instead of positional ones.") W011 = ("It looks like you're calling displacy.serve from within a " "Jupyter notebook or a similar environment. This likely means " "you're already running a local web server, so there's no need to " @@ -68,110 +46,114 @@ class Warnings(object): "components are applied. To only create tokenized Doc objects, " "try using `nlp.make_doc(text)` or process all texts as a stream " "using `list(nlp.tokenizer.pipe(all_texts))`.") - W013 = ("As of v2.1.0, {obj}.merge is deprecated. Please use the more " - "efficient and less error-prone Doc.retokenize context manager " - "instead.") - W014 = ("As of v2.1.0, the `disable` keyword argument on the serialization " - "methods is and should be replaced with `exclude`. This makes it " - "consistent with the other serializable objects.") - W015 = ("As of v2.1.0, the use of keyword arguments to exclude fields from " - "being serialized or deserialized is deprecated. Please use the " - "`exclude` argument instead. For example: exclude=['{arg}'].") - W016 = ("The keyword argument `n_threads` is now deprecated. As of v2.2.2, " - "the argument `n_process` controls parallel inference via " - "multiprocessing.") W017 = ("Alias '{alias}' already exists in the Knowledge Base.") W018 = ("Entity '{entity}' already exists in the Knowledge Base - " "ignoring the duplicate entry.") - W019 = ("Changing vectors name from {old} to {new}, to avoid clash with " - "previously loaded vectors. See Issue #3853.") - W020 = ("Unnamed vectors. This won't allow multiple vectors models to be " - "loaded. (Shape: {shape})") W021 = ("Unexpected hash collision in PhraseMatcher. Matches may be " "incorrect. Modify PhraseMatcher._terminal_hash to fix.") - W022 = ("Training a new part-of-speech tagger using a model with no " - "lemmatization rules or data. This means that the trained model " - "may not be able to lemmatize correctly. If this is intentional " - "or the language you're using doesn't have lemmatization data, " - "please ignore this warning. If this is surprising, make sure you " - "have the spacy-lookups-data package installed.") - W023 = ("Multiprocessing of Language.pipe is not supported in Python 2. " - "'n_process' will be set to 1.") W024 = ("Entity '{entity}' - Alias '{alias}' combination already exists in " "the Knowledge Base.") - W025 = ("'{name}' requires '{attr}' to be assigned, but none of the " - "previous components in the pipeline declare that they assign it.") - W026 = ("Unable to set all sentence boundaries from dependency parses.") + W026 = ("Unable to set all sentence boundaries from dependency parses. If " + "you are constructing a parse tree incrementally by setting " + "token.head values, you can probably ignore this warning. Consider " + "using Doc(words, ..., heads=heads, deps=deps) instead.") W027 = ("Found a large training file of {size} bytes. Note that it may " "be more efficient to split your training data into multiple " "smaller JSON files instead.") W028 = ("Doc.from_array was called with a vector of type '{type}', " - "but is expecting one of type 'uint64' instead. This may result " + "but is expecting one of type uint64 instead. This may result " "in problems with the vocab further on in the pipeline.") - W029 = ("Unable to align tokens with entities from character offsets. " - "Discarding entity annotation for the text: {text}.") W030 = ("Some entities could not be aligned in the text \"{text}\" with " "entities \"{entities}\". Use " - "`spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)`" - " to check the alignment. Misaligned entities (with BILUO tag '-') " - "will be ignored during training.") - W031 = ("Model '{model}' ({model_version}) requires spaCy {version} and " - "is incompatible with the current spaCy version ({current}). This " - "may lead to unexpected results or runtime errors. To resolve " - "this, download a newer compatible model or retrain your custom " - "model with the current spaCy version. For more details and " - "available updates, run: python -m spacy validate") - W032 = ("Unable to determine model compatibility for model '{model}' " - "({model_version}) with the current spaCy version ({current}). " - "This may lead to unexpected results or runtime errors. To resolve " - "this, download a newer compatible model or retrain your custom " - "model with the current spaCy version. For more details and " - "available updates, run: python -m spacy validate") - W033 = ("Training a new {model} using a model with an empty lexeme " - "normalization table. This may degrade the performance to some " - "degree. If this is intentional or this language doesn't have a " - "normalization table, please ignore this warning.") - W034 = ("Please install the package spacy-lookups-data in order to include " - "the default lexeme normalization table for the language '{lang}'.") + "`spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)`" + " to check the alignment. Misaligned entities ('-') will be " + "ignored during training.") + W033 = ("Training a new {model} using a model with no lexeme normalization " + "table. This may degrade the performance of the model to some " + "degree. If this is intentional or the language you're using " + "doesn't have a normalization table, please ignore this warning. " + "If this is surprising, make sure you have the spacy-lookups-data " + "package installed. The languages with lexeme normalization tables " + "are currently: {langs}") W035 = ('Discarding subpattern "{pattern}" due to an unrecognized ' "attribute or operator.") + # TODO: fix numbering after merging develop into master + W088 = ("The pipeline component {name} implements a `begin_training` " + "method, which won't be called by spaCy. As of v3.0, `begin_training` " + "has been renamed to `initialize`, so you likely want to rename the " + "component method. See the documentation for details: " + "https://nightly.spacy.io/api/language#initialize") + W089 = ("As of spaCy v3.0, the `nlp.begin_training` method has been renamed " + "to `nlp.initialize`.") + W090 = ("Could not locate any {format} files in path '{path}'.") + W091 = ("Could not clean/remove the temp directory at {dir}: {msg}.") + W092 = ("Ignoring annotations for sentence starts, as dependency heads are set.") + W093 = ("Could not find any data to train the {name} on. Is your " + "input data correctly formatted?") + W094 = ("Model '{model}' ({model_version}) specifies an under-constrained " + "spaCy version requirement: {version}. This can lead to compatibility " + "problems with older versions, or as new spaCy versions are " + "released, because the model may say it's compatible when it's " + 'not. Consider changing the "spacy_version" in your meta.json to a ' + "version range, with a lower and upper pin. For example: {example}") + W095 = ("Model '{model}' ({model_version}) requires spaCy {version} and is " + "incompatible with the current version ({current}). This may lead " + "to unexpected results or runtime errors. To resolve this, " + "download a newer compatible model or retrain your custom model " + "with the current spaCy version. For more details and available " + "updates, run: python -m spacy validate") + W096 = ("The method `nlp.disable_pipes` is now deprecated - use " + "`nlp.select_pipes` instead.") + W100 = ("Skipping unsupported morphological feature(s): '{feature}'. " + "Provide features as a dict {{\"Field1\": \"Value1,Value2\"}} or " + "string \"Field1=Value1,Value2|Field2=Value3\".") + W101 = ("Skipping Doc custom extension '{name}' while merging docs.") + W102 = ("Skipping unsupported user data '{key}: {value}' while merging docs.") + W103 = ("Unknown {lang} word segmenter '{segmenter}'. Supported " + "word segmenters: {supported}. Defaulting to {default}.") + W104 = ("Skipping modifications for '{target}' segmenter. The current " + "segmenter is '{current}'.") + W105 = ("As of spaCy v3.0, the `{matcher}.pipe` method is deprecated. If you " + "need to match on a stream of documents, you can use `nlp.pipe` and " + "call the {matcher} on each Doc object.") + W107 = ("The property `Doc.{prop}` is deprecated. Use " + "`Doc.has_annotation(\"{attr}\")` instead.") + @add_codes -class Errors(object): +class Errors: E001 = ("No component '{name}' found in pipeline. Available names: {opts}") - E002 = ("Can't find factory for '{name}'. This usually happens when spaCy " - "calls `nlp.create_pipe` with a component name that's not built " - "in - for example, when constructing the pipeline from a model's " - "meta.json. If you're using a custom component, you can write to " - "`Language.factories['{name}']` or remove it from the model meta " - "and add it via `nlp.add_pipe` instead.") + E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). " + "This usually happens when spaCy calls `nlp.{method}` with custom " + "component name that's not registered on the current language class. " + "If you're using a custom component, make sure you've added the " + "decorator `@Language.component` (for function components) or " + "`@Language.factory` (for class components).\n\nAvailable " + "factories: {opts}") E003 = ("Not a valid pipeline component. Expected callable, but " - "got {component} (name: '{name}').") - E004 = ("If you meant to add a built-in component, use `create_pipe`: " - "`nlp.add_pipe(nlp.create_pipe('{component}'))`") + "got {component} (name: '{name}'). If you're using a custom " + "component factory, double-check that it correctly returns your " + "initialized component.") + E004 = ("Can't set up pipeline component: a factory for '{name}' already " + "exists. Existing factory: {func}. New factory: {new_func}") E005 = ("Pipeline component '{name}' returned None. If you're using a " "custom component, maybe you forgot to return the processed Doc?") - E006 = ("Invalid constraints. You can only set one of the following: " - "before, after, first, last.") + E006 = ("Invalid constraints for adding pipeline component. You can only " + "set one of the following: before (component name or index), " + "after (component name or index), first (True) or last (True). " + "Invalid configuration: {args}. Existing components: {opts}") E007 = ("'{name}' already exists in pipeline. Existing names: {opts}") - E008 = ("Some current components would be lost when restoring previous " - "pipeline state. If you added components after calling " - "`nlp.disable_pipes()`, you should remove them explicitly with " - "`nlp.remove_pipe()` before the pipeline is restored. Names of " - "the new components: {names}") - E009 = ("The `update` method expects same number of docs and golds, but " - "got: {n_docs} docs, {n_golds} golds.") + E008 = ("Can't restore disabled pipeline component '{name}' because it " + "doesn't exist in the pipeline anymore. If you want to remove " + "components from the pipeline, you should do it before calling " + "`nlp.select_pipes` or after restoring the disabled components.") E010 = ("Word vectors set to length 0. This may be because you don't have " "a model installed or loaded, or because your model doesn't " "include word vectors. For more info, see the docs:\n" - "https://spacy.io/usage/models") + "https://nightly.spacy.io/usage/models") E011 = ("Unknown operator: '{op}'. Options: {opts}") E012 = ("Cannot add pattern for zero tokens to matcher.\nKey: {key}") - E013 = ("Error selecting action in matcher") - E014 = ("Unknown tag ID: {tag}") - E015 = ("Conflicting morphology exception for ({tag}, {orth}). Use " - "`force=True` to overwrite.") E016 = ("MultitaskObjective target should be function or one of: dep, " "tag, ent, dep_tag_offset, ent_tag.") E017 = ("Can only add unicode or bytes. Got type: {value_type}") @@ -179,71 +161,46 @@ class Errors(object): "refers to an issue with the `Vocab` or `StringStore`.") E019 = ("Can't create transition with unknown action ID: {action}. Action " "IDs are enumerated in spacy/syntax/{src}.pyx.") - E020 = ("Could not find a gold-standard action to supervise the " - "dependency parser. The tree is non-projective (i.e. it has " - "crossing arcs - see spacy/syntax/nonproj.pyx for definitions). " - "The ArcEager transition system only supports projective trees. " - "To learn non-projective representations, transform the data " - "before training and after parsing. Either pass " - "`make_projective=True` to the GoldParse class, or use " - "spacy.syntax.nonproj.preprocess_training_data.") - E021 = ("Could not find a gold-standard action to supervise the " - "dependency parser. The GoldParse was projective. The transition " - "system has {n_actions} actions. State at failure: {state}") E022 = ("Could not find a transition with the name '{name}' in the NER " "model.") - E023 = ("Error cleaning up beam: The same state occurred twice at " - "memory address {addr} and position {i}.") E024 = ("Could not find an optimal move to supervise the parser. Usually, " "this means that the model can't be updated in a way that's valid " "and satisfies the correct annotations specified in the GoldParse. " "For example, are all labels added to the model? If you're " "training a named entity recognizer, also make sure that none of " "your annotated entity spans have leading or trailing whitespace " - "or punctuation. " - "You can also use the experimental `debug-data` command to " + "or punctuation. You can also use the `debug data` command to " "validate your JSON-formatted training data. For details, run:\n" - "python -m spacy debug-data --help") + "python -m spacy debug data --help") E025 = ("String is too long: {length} characters. Max is 2**30.") E026 = ("Error accessing token at position {i}: out of bounds in Doc of " "length {length}.") - E027 = ("Arguments 'words' and 'spaces' should be sequences of the same " - "length, or 'spaces' should be left default at None. spaces " + E027 = ("Arguments `words` and `spaces` should be sequences of the same " + "length, or `spaces` should be left default at None. `spaces` " "should be a sequence of booleans, with True meaning that the " "word owns a ' ' character following it.") - E028 = ("orths_and_spaces expects either a list of unicode string or a " - "list of (unicode, bool) tuples. Got bytes instance: {value}") - E029 = ("noun_chunks requires the dependency parse, which requires a " + E028 = ("`words` expects a list of unicode strings, but got bytes instance: {value}") + E029 = ("`noun_chunks` requires the dependency parse, which requires a " "statistical model to be installed and loaded. For more info, see " - "the documentation:\nhttps://spacy.io/usage/models") + "the documentation:\nhttps://nightly.spacy.io/usage/models") E030 = ("Sentence boundaries unset. You can add the 'sentencizer' " - "component to the pipeline with: " - "nlp.add_pipe(nlp.create_pipe('sentencizer')) " - "Alternatively, add the dependency parser, or set sentence " - "boundaries by setting doc[i].is_sent_start.") + "component to the pipeline with: `nlp.add_pipe('sentencizer')`. " + "Alternatively, add the dependency parser or sentence recognizer, " + "or set sentence boundaries by setting `doc[i].is_sent_start`.") E031 = ("Invalid token: empty string ('') at position {i}.") - E032 = ("Conflicting attributes specified in doc.from_array(): " - "(HEAD, SENT_START). The HEAD attribute currently sets sentence " - "boundaries implicitly, based on the tree structure. This means " - "the HEAD attribute would potentially override the sentence " - "boundaries set by SENT_START.") E033 = ("Cannot load into non-empty Doc of length {length}.") - E034 = ("Doc.merge received {n_args} non-keyword arguments. Expected " - "either 3 arguments (deprecated), or 0 (use keyword arguments).\n" - "Arguments supplied:\n{args}\nKeyword arguments:{kwargs}") E035 = ("Error creating span with start {start} and end {end} for Doc of " "length {length}.") E036 = ("Error calculating span: Can't find a token starting at character " "offset {start}.") E037 = ("Error calculating span: Can't find a token ending at character " "offset {end}.") - E038 = ("Error finding sentence for span. Infinite loop detected.") E039 = ("Array bounds exceeded while searching for root word. This likely " "means the parse tree is in an invalid state. Please report this " "issue here: http://github.com/explosion/spaCy/issues") E040 = ("Attempt to access token at {i}, max length {max_length}.") E041 = ("Invalid comparison operator: {op}. Likely a Cython bug?") - E042 = ("Error accessing doc[{i}].nbor({j}), for doc of length {length}.") + E042 = ("Error accessing `doc[{i}].nbor({j})`, for doc of length {length}.") E043 = ("Refusing to write to token.sent_start if its document is parsed, " "because this may cause inconsistent state.") E044 = ("Invalid value for token.sent_start: {value}. Must be one of: " @@ -254,32 +211,25 @@ class Errors(object): E047 = ("Can't assign a value to unregistered extension attribute " "'{name}'. Did you forget to call the `set_extension` method?") E048 = ("Can't import language {lang} from spacy.lang: {err}") - E049 = ("Can't find spaCy data directory: '{path}'. Check your " - "installation and permissions, or use spacy.util.set_data_path " - "to customise the location if necessary.") - E050 = ("Can't find model '{name}'. It doesn't seem to be a shortcut " - "link, a Python package or a valid path to a data directory.") - E051 = ("Cant' load '{name}'. If you're using a shortcut link, make sure " - "it points to a valid package (not just a data directory).") + E050 = ("Can't find model '{name}'. It doesn't seem to be a Python " + "package or a valid path to a data directory.") E052 = ("Can't find model directory: {path}") - E053 = ("Could not read meta.json from {path}") + E053 = ("Could not read {name} from {path}") E054 = ("No valid '{setting}' setting found in model meta.json.") E055 = ("Invalid ORTH value in exception:\nKey: {key}\nOrths: {orths}") E056 = ("Invalid tokenizer exception: ORTH values combined don't match " "original string.\nKey: {key}\nOrths: {orths}") E057 = ("Stepped slices not supported in Span objects. Try: " - "list(tokens)[start:stop:step] instead.") + "`list(tokens)[start:stop:step]` instead.") E058 = ("Could not retrieve vector for key {key}.") E059 = ("One (and only one) keyword arg must be set. Got: {kwargs}") E060 = ("Cannot add new key to vectors: the table is full. Current shape: " "({rows}, {cols}).") - E061 = ("Bad file name: {filename}. Example of a valid file name: " - "'vectors.128.f.bin'") E062 = ("Cannot find empty bit for new lexical flag. All bits between 0 " "and 63 are occupied. You can replace one by specifying the " "`flag_id` explicitly, e.g. " "`nlp.vocab.add_flag(your_func, flag_id=IS_ALPHA`.") - E063 = ("Invalid value for flag_id: {value}. Flag IDs must be between 1 " + E063 = ("Invalid value for `flag_id`: {value}. Flag IDs must be between 1 " "and 63 (inclusive).") E064 = ("Error fetching a Lexeme from the Vocab. When looking up a " "string, the lexeme returned had an orth ID that did not match " @@ -288,39 +238,17 @@ class Errors(object): "Query string: {string}\nOrth cached: {orth}\nOrth ID: {orth_id}") E065 = ("Only one of the vector table's width and shape can be specified. " "Got width {width} and shape {shape}.") - E066 = ("Error creating model helper for extracting columns. Can only " - "extract columns by positive integer. Got: {value}.") - E067 = ("Invalid BILUO tag sequence: Got a tag starting with 'I' (inside " - "an entity) without a preceding 'B' (beginning of an entity). " + E067 = ("Invalid BILUO tag sequence: Got a tag starting with {start} " + "without a preceding 'B' (beginning of an entity). " "Tag sequence:\n{tags}") E068 = ("Invalid BILUO tag: '{tag}'.") - E069 = ("Invalid gold-standard parse tree. Found cycle between word " - "IDs: {cycle} (tokens: {cycle_tokens}) in the document starting " - "with tokens: {doc_tokens}.") - E070 = ("Invalid gold-standard data. Number of documents ({n_docs}) " - "does not align with number of annotations ({n_annots}).") E071 = ("Error creating lexeme: specified orth ID ({orth}) does not " "match the one in the vocab ({vocab_orth}).") - E072 = ("Error serializing lexeme: expected data length {length}, " - "got {bad_length}.") E073 = ("Cannot assign vector of length {new_length}. Existing vectors " "are of length {length}. You can use `vocab.reset_vectors` to " "clear the existing vectors and resize the table.") E074 = ("Error interpreting compiled match pattern: patterns are expected " "to end with the attribute {attr}. Got: {bad_attr}.") - E075 = ("Error accepting match: length ({length}) > maximum length " - "({max_len}).") - E076 = ("Error setting tensor on Doc: tensor has {rows} rows, while Doc " - "has {words} words.") - E077 = ("Error computing {value}: number of Docs ({n_docs}) does not " - "equal number of GoldParse objects ({n_golds}) in batch.") - E078 = ("Error computing score: number of words in Doc ({words_doc}) does " - "not equal number of words in GoldParse ({words_gold}).") - E079 = ("Error computing states in beam: number of predicted beams " - "({pbeams}) does not equal number of gold beams ({gbeams}).") - E080 = ("Duplicate state found in beam: {key}.") - E081 = ("Error getting gradient in beam: number of histories ({n_hist}) " - "does not equal number of losses ({losses}).") E082 = ("Error deprojectivizing parse: number of heads ({n_heads}), " "projective heads ({n_proj_heads}) and labels ({n_labels}) do not " "match.") @@ -328,11 +256,9 @@ class Errors(object): "`getter` (plus optional `setter`) is allowed. Got: {nr_defined}") E084 = ("Error assigning label ID {label} to span: not in StringStore.") E085 = ("Can't create lexeme for string '{string}'.") - E086 = ("Error deserializing lexeme '{string}': orth ID {orth_id} does " - "not match hash {hash_id} in StringStore.") E087 = ("Unknown displaCy style: {style}.") E088 = ("Text of length {length} exceeds maximum of {max_length}. The " - "v2.x parser and NER models require roughly 1GB of temporary " + "parser and NER models require roughly 1GB of temporary " "memory per 100,000 characters in the input. This means long " "texts may cause memory allocation errors. If you're not using " "the parser or NER, it's probably safe to increase the " @@ -345,43 +271,33 @@ class Errors(object): "existing extension, set `force=True` on `{obj}.set_extension`.") E091 = ("Invalid extension attribute {name}: expected callable or None, " "but got: {value}") - E092 = ("Could not find or assign name for word vectors. Ususally, the " - "name is read from the model's meta.json in vector.name. " - "Alternatively, it is built from the 'lang' and 'name' keys in " - "the meta.json. Vector names are required to avoid issue #1660.") E093 = ("token.ent_iob values make invalid sequence: I without B\n{seq}") E094 = ("Error reading line {line_num} in vectors file {loc}.") E095 = ("Can't write to frozen dictionary. This is likely an internal " "error. Are you writing to a default function argument?") - E096 = ("Invalid object passed to displaCy: Can only visualize Doc or " - "Span objects, or dicts if set to manual=True.") + E096 = ("Invalid object passed to displaCy: Can only visualize `Doc` or " + "Span objects, or dicts if set to `manual=True`.") E097 = ("Invalid pattern: expected token pattern (list of dicts) or " "phrase pattern (string) but got:\n{pattern}") - E098 = ("Invalid pattern specified: expected both SPEC and PATTERN.") - E099 = ("First node of pattern should be a root node. The root should " - "only contain NODE_NAME.") - E100 = ("Nodes apart from the root should contain NODE_NAME, NBOR_NAME and " - "NBOR_RELOP.") - E101 = ("NODE_NAME should be a new node and NBOR_NAME should already have " + E098 = ("Invalid pattern: expected both RIGHT_ID and RIGHT_ATTRS.") + E099 = ("Invalid pattern: the first node of pattern should be an anchor " + "node. The node should only contain RIGHT_ID and RIGHT_ATTRS.") + E100 = ("Nodes other than the anchor node should all contain LEFT_ID, " + "REL_OP and RIGHT_ID.") + E101 = ("RIGHT_ID should be a new node and LEFT_ID should already have " "have been declared in previous edges.") E102 = ("Can't merge non-disjoint spans. '{token}' is already part of " "tokens to merge. If you want to find the longest non-overlapping " "spans, you can use the util.filter_spans helper:\n" - "https://spacy.io/api/top-level#util.filter_spans") + "https://nightly.spacy.io/api/top-level#util.filter_spans") E103 = ("Trying to set conflicting doc.ents: '{span1}' and '{span2}'. A " "token can only be part of one entity, so make sure the entities " "you're setting don't overlap.") - E104 = ("Can't find JSON schema for '{name}'.") - E105 = ("The Doc.print_tree() method is now deprecated. Please use " - "Doc.to_json() instead or write your own function.") - E106 = ("Can't find doc._.{attr} attribute specified in the underscore " + E106 = ("Can't find `doc._.{attr}` attribute specified in the underscore " "settings: {opts}") - E107 = ("Value of doc._.{attr} is not JSON-serializable: {value}") - E108 = ("As of spaCy v2.1, the pipe name `sbd` has been deprecated " - "in favor of the pipe name `sentencizer`, which does the same " - "thing. For example, use `nlp.create_pipeline('sentencizer')`") - E109 = ("Model for component '{name}' not initialized. Did you forget to " - "load a model, or forget to call begin_training()?") + E107 = ("Value of `doc._.{attr}` is not JSON-serializable: {value}") + E109 = ("Component '{name}' could not be run. Did you forget to " + "call `initialize()`?") E110 = ("Invalid displaCy render wrapper. Expected callable, got: {obj}") E111 = ("Pickling a token is not supported, because tokens are only views " "of the parent Doc and can't exist on their own. A pickled token " @@ -394,18 +310,12 @@ class Errors(object): "practically no advantage over pickling the parent Doc directly. " "So instead of pickling the span, pickle the Doc it belongs to or " "use Span.as_doc to convert the span to a standalone Doc object.") - E113 = ("The newly split token can only have one root (head = 0).") - E114 = ("The newly split token needs to have a root (head = 0).") E115 = ("All subtokens must have associated heads.") - E116 = ("Cannot currently add labels to pretrained text classifier. Add " - "labels before training begins. This functionality was available " - "in previous versions, but had significant bugs that led to poor " - "performance.") E117 = ("The newly split tokens must match the text of the original token. " "New orths: {new}. Old text: {old}.") E118 = ("The custom extension attribute '{attr}' is not registered on the " - "Token object so it can't be set during retokenization. To " - "register an attribute, use the Token.set_extension classmethod.") + "`Token` object so it can't be set during retokenization. To " + "register an attribute, use the `Token.set_extension` classmethod.") E119 = ("Can't set custom extension attribute '{attr}' during " "retokenization because it's not writable. This usually means it " "was registered with a getter function (and no setter) or as a " @@ -418,16 +328,9 @@ class Errors(object): "equal to span length ({span_len}).") E122 = ("Cannot find token to be split. Did it get merged?") E123 = ("Cannot find head of token to be split. Did it get merged?") - E124 = ("Cannot read from file: {path}. Supported formats: {formats}") E125 = ("Unexpected value: {value}") E126 = ("Unexpected matcher predicate: '{bad}'. Expected one of: {good}. " "This is likely a bug in spaCy, so feel free to open an issue.") - E127 = ("Cannot create phrase pattern representation for length 0. This " - "is likely a bug in spaCy.") - E128 = ("Unsupported serialization argument: '{arg}'. The use of keyword " - "arguments to exclude fields from being serialized or deserialized " - "is now deprecated. Please use the `exclude` argument instead. " - "For example: exclude=['{arg}'].") E129 = ("Cannot write the label of an existing Span object because a Span " "is a read-only view of the underlying Token objects stored in the " "Doc. Instead, create a new Span object and specify the `label` " @@ -436,7 +339,7 @@ class Errors(object): E130 = ("You are running a narrow unicode build, which is incompatible " "with spacy >= 2.1.0. To fix this, reinstall Python and use a wide " "unicode build instead. You can also rebuild Python and set the " - "--enable-unicode=ucs4 flag.") + "`--enable-unicode=ucs4 flag`.") E131 = ("Cannot write the kb_id of an existing Span object because a Span " "is a read-only view of the underlying Token objects stored in " "the Doc. Instead, create a new Span object and specify the " @@ -449,59 +352,39 @@ class Errors(object): E133 = ("The sum of prior probabilities for alias '{alias}' should not " "exceed 1, but found {sum}.") E134 = ("Entity '{entity}' is not defined in the Knowledge Base.") - E135 = ("If you meant to replace a built-in component, use `create_pipe`: " - "`nlp.replace_pipe('{name}', nlp.create_pipe('{name}'))`") - E136 = ("This additional feature requires the jsonschema library to be " - "installed:\npip install jsonschema") - E137 = ("Expected 'dict' type, but got '{type}' from '{line}'. Make sure " - "to provide a valid JSON object as input with either the `text` " - "or `tokens` key. For more info, see the docs:\n" - "https://spacy.io/api/cli#pretrain-jsonl") - E138 = ("Invalid JSONL format for raw text '{text}'. Make sure the input " - "includes either the `text` or `tokens` key. For more info, see " - "the docs:\nhttps://spacy.io/api/cli#pretrain-jsonl") - E139 = ("Knowledge Base for component '{name}' not initialized. Did you " - "forget to call set_kb()?") + E139 = ("Knowledge base for component '{name}' is empty. Use the methods " + "`kb.add_entity` and `kb.add_alias` to add entries.") E140 = ("The list of entities, prior probabilities and entity vectors " "should be of equal length.") E141 = ("Entity vectors should be of length {required} instead of the " "provided {found}.") - E142 = ("Unsupported loss_function '{loss_func}'. Use either 'L2' or " - "'cosine'.") - E143 = ("Labels for component '{name}' not initialized. Did you forget to " - "call add_label()?") - E144 = ("Could not find parameter `{param}` when building the entity " - "linker model.") + E143 = ("Labels for component '{name}' not initialized. This can be fixed " + "by calling add_label, or by providing a representative batch of " + "examples to the component's `initialize` method.") E145 = ("Error reading `{param}` from input file.") - E146 = ("Could not access `{path}`.") + E146 = ("Could not access {path}.") E147 = ("Unexpected error in the {method} functionality of the " "EntityLinker: {msg}. This is likely a bug in spaCy, so feel free " - "to open an issue.") + "to open an issue: https://github.com/explosion/spaCy/issues") E148 = ("Expected {ents} KB identifiers but got {ids}. Make sure that " "each entity in `doc.ents` is assigned to a KB identifier.") E149 = ("Error deserializing model. Check that the config used to create " "the component matches the model being loaded.") E150 = ("The language of the `nlp` object and the `vocab` should be the " "same, but found '{nlp}' and '{vocab}' respectively.") - E151 = ("Trying to call nlp.update without required annotation types. " - "Expected top-level keys: {exp}. Got: {unexp}.") E152 = ("The attribute {attr} is not supported for token patterns. " - "Please use the option validate=True with Matcher, PhraseMatcher, " + "Please use the option `validate=True` with the Matcher, PhraseMatcher, " "or EntityRuler for more details.") E153 = ("The value type {vtype} is not supported for token patterns. " "Please use the option validate=True with Matcher, PhraseMatcher, " "or EntityRuler for more details.") E154 = ("One of the attributes or values is not supported for token " - "patterns. Please use the option validate=True with Matcher, " + "patterns. Please use the option `validate=True` with the Matcher, " "PhraseMatcher, or EntityRuler for more details.") - E155 = ("The pipeline needs to include a tagger in order to use " - "Matcher or PhraseMatcher with the attributes POS, TAG, or LEMMA. " - "Try using nlp() instead of nlp.make_doc() or list(nlp.pipe()) " - "instead of list(nlp.tokenizer.pipe()).") - E156 = ("The pipeline needs to include a parser in order to use " - "Matcher or PhraseMatcher with the attribute DEP. Try using " - "nlp() instead of nlp.make_doc() or list(nlp.pipe()) instead of " - "list(nlp.tokenizer.pipe()).") + E155 = ("The pipeline needs to include a {pipe} in order to use " + "Matcher or PhraseMatcher with the attribute {attr}. " + "Try using `nlp()` instead of `nlp.make_doc()` or `list(nlp.pipe())` " + "instead of `list(nlp.tokenizer.pipe())`.") E157 = ("Can't render negative values for dependency arc start or end. " "Make sure that you're passing in absolute token indices, not " "relative token offsets.\nstart: {start}, end: {end}, label: " @@ -510,44 +393,29 @@ class Errors(object): E159 = ("Can't find table '{name}' in lookups. Available tables: {tables}") E160 = ("Can't find language data file: {path}") E161 = ("Found an internal inconsistency when predicting entity links. " - "This is likely a bug in spaCy, so feel free to open an issue.") - E162 = ("Cannot evaluate textcat model on data with different labels.\n" - "Labels in model: {model_labels}\nLabels in evaluation " - "data: {eval_labels}") + "This is likely a bug in spaCy, so feel free to open an issue: " + "https://github.com/explosion/spaCy/issues") E163 = ("cumsum was found to be unstable: its last element does not " "correspond to sum") - E164 = ("x is neither increasing nor decreasing: {}.") + E164 = ("x is neither increasing nor decreasing: {x}.") E165 = ("Only one class present in y_true. ROC AUC score is not defined in " "that case.") - E166 = ("Can only merge DocBins with the same pre-defined attributes.\n" + E166 = ("Can only merge DocBins with the same value for '{param}'.\n" "Current DocBin: {current}\nOther DocBin: {other}") - E167 = ("Unknown morphological feature: '{feat}' ({feat_id}). This can " - "happen if the tagger was trained with a different set of " - "morphological features. If you're using a pretrained model, make " - "sure that your models are up to date:\npython -m spacy validate") - E168 = ("Unknown field: {field}") E169 = ("Can't find module: {module}") E170 = ("Cannot apply transition {name}: invalid for the current state.") - E171 = ("Matcher.add received invalid on_match callback argument: expected " + E171 = ("Matcher.add received invalid 'on_match' callback argument: expected " "callable or None, but got: {arg_type}") - E172 = ("The Lemmatizer.load classmethod is deprecated. To create a " - "Lemmatizer, initialize the class directly. See the docs for " - "details: https://spacy.io/api/lemmatizer") - E173 = ("As of v2.2, the Lemmatizer is initialized with an instance of " - "Lookups containing the lemmatization tables. See the docs for " - "details: https://spacy.io/api/lemmatizer#init") - E174 = ("Architecture '{name}' not found in registry. Available " - "names: {names}") E175 = ("Can't remove rule for unknown match pattern ID: {key}") E176 = ("Alias '{alias}' is not defined in the Knowledge Base.") E177 = ("Ill-formed IOB input detected: {tag}") - E178 = ("Invalid pattern. Expected list of dicts but got: {pat}. Maybe you " + E178 = ("Each pattern should be a list of dicts, but got: {pat}. Maybe you " "accidentally passed a single pattern to Matcher.add instead of a " "list of patterns? If you only want to add one pattern, make sure " - "to wrap it in a list. For example: matcher.add('{key}', [pattern])") + "to wrap it in a list. For example: `matcher.add('{key}', [pattern])`") E179 = ("Invalid pattern. Expected a list of Doc objects but got a single " "Doc. If you only want to add one pattern, make sure to wrap it " - "in a list. For example: matcher.add('{key}', [doc])") + "in a list. For example: `matcher.add('{key}', [doc])`") E180 = ("Span attributes can't be declared as required or assigned by " "components, since spans are only views of the Doc. Use Doc and " "Token attributes (or custom extension attributes) only and remove " @@ -555,20 +423,16 @@ class Errors(object): E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. " "Only Doc and Token attributes are supported.") E182 = ("Received invalid attribute declaration: {attr}\nDid you forget " - "to define the attribute? For example: {attr}.???") + "to define the attribute? For example: `{attr}.???`") E183 = ("Received invalid attribute declaration: {attr}\nOnly top-level " "attributes are supported, for example: {solution}") E184 = ("Only attributes without underscores are supported in component " "attribute declarations (because underscore and non-underscore " "attributes are connected anyways): {attr} -> {solution}") E185 = ("Received invalid attribute in component attribute declaration: " - "{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.") - E186 = ("'{tok_a}' and '{tok_b}' are different texts.") + "`{obj}.{attr}`\nAttribute '{attr}' does not exist on {obj}.") E187 = ("Only unicode strings are supported as labels.") - E188 = ("Could not match the gold entity links to entities in the doc - " - "make sure the gold EL data refers to valid results of the " - "named entity recognizer in the `nlp` pipeline.") - E189 = ("Each argument to `get_doc` should be of equal length.") + E189 = ("Each argument to `Doc.__init__` should be of equal length.") E190 = ("Token head out of range in `Doc.from_array()` for token index " "'{index}' with value '{value}' (equivalent to relative head " "index: '{rel_head_index}'). The head indices should be relative " @@ -582,27 +446,276 @@ class Errors(object): "({curr_dim}).") E194 = ("Unable to aligned mismatched text '{text}' and words '{words}'.") E195 = ("Matcher can be called on {good} only, got {got}.") - E196 = ("Refusing to write to token.is_sent_end. Sentence boundaries can " - "only be fixed with token.is_sent_start.") + E196 = ("Refusing to write to `token.is_sent_end`. Sentence boundaries can " + "only be fixed with `token.is_sent_start`.") E197 = ("Row out of bounds, unable to add row {row} for key {key}.") E198 = ("Unable to return {n} most similar vectors for the current vectors " "table, which contains {n_rows} vectors.") - E199 = ("Unable to merge 0-length span at doc[{start}:{end}].") - E200 = ("Specifying a base model with a pretrained component '{component}' " - "can not be combined with adding a pretrained Tok2Vec layer.") - E201 = ("Span index out of range.") - - -@add_codes -class TempErrors(object): - T003 = ("Resizing pretrained Tagger models is not currently supported.") - T004 = ("Currently parser depth is hard-coded to 1. Received: {value}.") - T007 = ("Can't yet set {attr} from Span. Vote for this feature on the " + E199 = ("Unable to merge 0-length span at `doc[{start}:{end}]`.") + E200 = ("Can't yet set {attr} from Span. Vote for this feature on the " "issue tracker: http://github.com/explosion/spaCy/issues") - T008 = ("Bad configuration of Tagger. This is probably a bug within " - "spaCy. We changed the name of an internal attribute for loading " - "pretrained vectors, and the class has been passed the old name " - "(pretrained_dims) but not the new name (pretrained_vectors).") + + # TODO: fix numbering after merging develop into master + E898 = ("Can't serialize trainable pipe '{name}': the `model` attribute " + "is not set or None. If you've implemented a custom component, make " + "sure to store the component model as `self.model` in your " + "component's __init__ method.") + E899 = ("Can't serialize trainable pipe '{name}': the `vocab` attribute " + "is not set or None. If you've implemented a custom component, make " + "sure to store the current `nlp` object's vocab as `self.vocab` in " + "your component's __init__ method.") + E900 = ("Could not run the full pipeline for evaluation. If you specified " + "frozen components, make sure they were already initialized and " + "trained. Full pipeline: {pipeline}") + E901 = ("Failed to remove existing output directory: {path}. If your " + "config and the components you train change between runs, a " + "non-empty output directory can lead to stale pipeline data. To " + "solve this, remove the existing directories in the output directory.") + E902 = ("The sentence-per-line IOB/IOB2 file is not formatted correctly. " + "Try checking whitespace and delimiters. See " + "https://nightly.spacy.io/api/cli#convert") + E903 = ("The token-per-line NER file is not formatted correctly. Try checking " + "whitespace and delimiters. See https://nightly.spacy.io/api/cli#convert") + E904 = ("Cannot initialize StaticVectors layer: nO dimension unset. This " + "dimension refers to the output width, after the linear projection " + "has been applied.") + E905 = ("Cannot initialize StaticVectors layer: nM dimension unset. This " + "dimension refers to the width of the vectors table.") + E906 = ("Unexpected `loss` value in pretraining objective: {loss_type}") + E907 = ("Unexpected `objective_type` value in pretraining objective: {objective_type}") + E908 = ("Can't set `spaces` without `words` in `Doc.__init__`.") + E909 = ("Expected {name} in parser internals. This is likely a bug in spaCy.") + E910 = ("Encountered NaN value when computing loss for component '{name}'.") + E911 = ("Invalid feature: {feat}. Must be a token attribute.") + E912 = ("Failed to initialize lemmatizer. Missing lemmatizer table(s) found " + "for mode '{mode}'. Required tables: {tables}. Found: {found}.") + E913 = ("Corpus path can't be None. Maybe you forgot to define it in your " + "config.cfg or override it on the CLI?") + E914 = ("Executing {name} callback failed. Expected the function to " + "return the nlp object but got: {value}. Maybe you forgot to return " + "the modified object in your function?") + E915 = ("Can't use score '{name}' to calculate final weighted score. Expected " + "float or int but got: {score_type}. To exclude the score from the " + "final score, set its weight to null in the [training.score_weights] " + "section of your training config.") + E916 = ("Can't log score for '{name}' in table: not a valid score ({score_type})") + E917 = ("Received invalid value {value} for `state_type` in " + "TransitionBasedParser: only 'parser' or 'ner' are valid options.") + E918 = ("Received invalid value for vocab: {vocab} ({vocab_type}). Valid " + "values are an instance of `spacy.vocab.Vocab` or True to create one" + " (default).") + E919 = ("A textcat `positive_label` '{pos_label}' was provided for training " + "data that does not appear to be a binary classification problem " + "with two labels. Labels found: {labels}") + E920 = ("The textcat's `positive_label` setting '{pos_label}' " + "does not match any label in the training data or provided during " + "initialization. Available labels: {labels}") + E921 = ("The method `set_output` can only be called on components that have " + "a Model with a `resize_output` attribute. Otherwise, the output " + "layer can not be dynamically changed.") + E922 = ("Component '{name}' has been initialized with an output dimension of " + "{nO} - cannot add any more labels.") + E923 = ("It looks like there is no proper sample data to initialize the " + "Model of component '{name}'. This is likely a bug in spaCy, so " + "feel free to open an issue: https://github.com/explosion/spaCy/issues") + E924 = ("The '{name}' component does not seem to be initialized properly. " + "This is likely a bug in spaCy, so feel free to open an issue: " + "https://github.com/explosion/spaCy/issues") + E925 = ("Invalid color values for displaCy visualizer: expected dictionary " + "mapping label names to colors but got: {obj}") + E926 = ("It looks like you're trying to modify `nlp.{attr}` directly. This " + "doesn't work because it's an immutable computed property. If you " + "need to modify the pipeline, use the built-in methods like " + "`nlp.add_pipe`, `nlp.remove_pipe`, `nlp.disable_pipe` or " + "`nlp.enable_pipe` instead.") + E927 = ("Can't write to frozen list Maybe you're trying to modify a computed " + "property or default function argument?") + E928 = ("A KnowledgeBase can only be serialized to/from from a directory, " + "but the provided argument {loc} points to a file.") + E929 = ("Couldn't read KnowledgeBase from {loc}. The path does not seem to exist.") + E930 = ("Received invalid get_examples callback in `{method}`. " + "Expected function that returns an iterable of Example objects but " + "got: {obj}") + E931 = ("Encountered {parent} subclass without `{parent}.{method}` " + "method in component '{name}'. If you want to use this " + "method, make sure it's overwritten on the subclass.") + E940 = ("Found NaN values in scores.") + E941 = ("Can't find model '{name}'. It looks like you're trying to load a " + "model from a shortcut, which is deprecated as of spaCy v3.0. To " + "load the model, use its full name instead:\n\n" + "nlp = spacy.load(\"{full}\")\n\nFor more details on the available " + "models, see the models directory: https://spacy.io/models. If you " + "want to create a blank model, use spacy.blank: " + "nlp = spacy.blank(\"{name}\")") + E942 = ("Executing `after_{name}` callback failed. Expected the function to " + "return an initialized nlp object but got: {value}. Maybe " + "you forgot to return the modified object in your function?") + E943 = ("Executing `before_creation` callback failed. Expected the function to " + "return an uninitialized Language subclass but got: {value}. Maybe " + "you forgot to return the modified object in your function or " + "returned the initialized nlp object instead?") + E944 = ("Can't copy pipeline component '{name}' from source '{model}': " + "not found in pipeline. Available components: {opts}") + E945 = ("Can't copy pipeline component '{name}' from source. Expected loaded " + "nlp object, but got: {source}") + E947 = ("`Matcher.add` received invalid `greedy` argument: expected " + "a string value from {expected} but got: '{arg}'") + E948 = ("`Matcher.add` received invalid 'patterns' argument: expected " + "a list, but got: {arg_type}") + E949 = ("Can only create an alignment when the texts are the same.") + E952 = ("The section '{name}' is not a valid section in the provided config.") + E953 = ("Mismatched IDs received by the Tok2Vec listener: {id1} vs. {id2}") + E954 = ("The Tok2Vec listener did not receive any valid input from an upstream " + "component.") + E955 = ("Can't find table(s) {table} for language '{lang}' in " + "spacy-lookups-data. Make sure you have the package installed or " + "provide your own lookup tables if no default lookups are available " + "for your language.") + E956 = ("Can't find component '{name}' in [components] block in the config. " + "Available components: {opts}") + E957 = ("Writing directly to `Language.factories` isn't needed anymore in " + "spaCy v3. Instead, you can use the `@Language.factory` decorator " + "to register your custom component factory or `@Language.component` " + "to register a simple stateless function component that just takes " + "a Doc and returns it.") + E958 = ("Language code defined in config ({bad_lang_code}) does not match " + "language code of current Language subclass {lang} ({lang_code}). " + "If you want to create an nlp object from a config, make sure to " + "use the matching subclass with the language-specific settings and " + "data.") + E959 = ("Can't insert component {dir} index {idx}. Existing components: {opts}") + E960 = ("No config data found for component '{name}'. This is likely a bug " + "in spaCy.") + E961 = ("Found non-serializable Python object in config. Configs should " + "only include values that can be serialized to JSON. If you need " + "to pass models or other objects to your component, use a reference " + "to a registered function or initialize the object in your " + "component.\n\n{config}") + E962 = ("Received incorrect {style} for pipe '{name}'. Expected dict, " + "got: {cfg_type}.") + E963 = ("Can't read component info from `@Language.{decorator}` decorator. " + "Maybe you forgot to call it? Make sure you're using " + "`@Language.{decorator}()` instead of `@Language.{decorator}`.") + E964 = ("The pipeline component factory for '{name}' needs to have the " + "following named arguments, which are passed in by spaCy:\n- nlp: " + "receives the current nlp object and lets you access the vocab\n- " + "name: the name of the component instance, can be used to identify " + "the component, output losses etc.") + E965 = ("It looks like you're using the `@Language.component` decorator to " + "register '{name}' on a class instead of a function component. If " + "you need to register a class or function that *returns* a component " + "function, use the `@Language.factory` decorator instead.") + E966 = ("`nlp.add_pipe` now takes the string name of the registered component " + "factory, not a callable component. Expected string, but got " + "{component} (name: '{name}').\n\n- If you created your component " + "with `nlp.create_pipe('name')`: remove nlp.create_pipe and call " + "`nlp.add_pipe('name')` instead.\n\n- If you passed in a component " + "like `TextCategorizer()`: call `nlp.add_pipe` with the string name " + "instead, e.g. `nlp.add_pipe('textcat')`.\n\n- If you're using a custom " + "component: Add the decorator `@Language.component` (for function " + "components) or `@Language.factory` (for class components / factories) " + "to your custom component and assign it a name, e.g. " + "`@Language.component('your_name')`. You can then run " + "`nlp.add_pipe('your_name')` to add it to the pipeline.") + E967 = ("No {meta} meta information found for '{name}'. This is likely a bug in spaCy.") + E968 = ("`nlp.replace_pipe` now takes the string name of the registered component " + "factory, not a callable component. Expected string, but got " + "{component}.\n\n- If you created your component with" + "with `nlp.create_pipe('name')`: remove `nlp.create_pipe` and call " + "`nlp.replace_pipe('{name}', 'name')` instead.\n\n- If you passed in a " + "component like `TextCategorizer()`: call `nlp.replace_pipe` with the " + "string name instead, e.g. `nlp.replace_pipe('{name}', 'textcat')`.\n\n" + "- If you're using a custom component: Add the decorator " + "`@Language.component` (for function components) or `@Language.factory` " + "(for class components / factories) to your custom component and " + "assign it a name, e.g. `@Language.component('your_name')`. You can " + "then run `nlp.replace_pipe('{name}', 'your_name')`.") + E969 = ("Expected string values for field '{field}', but received {types} instead. ") + E970 = ("Can not execute command '{str_command}'. Do you have '{tool}' installed?") + E971 = ("Found incompatible lengths in `Doc.from_array`: {array_length} for the " + "array and {doc_length} for the Doc itself.") + E972 = ("`Example.__init__` got None for '{arg}'. Requires Doc.") + E973 = ("Unexpected type for NER data") + E974 = ("Unknown {obj} attribute: {key}") + E976 = ("The method `Example.from_dict` expects a {type} as {n} argument, " + "but received None.") + E977 = ("Can not compare a MorphAnalysis with a string object. " + "This is likely a bug in spaCy, so feel free to open an issue: " + "https://github.com/explosion/spaCy/issues") + E978 = ("The {name} method takes a list of Example objects, but got: {types}") + E980 = ("Each link annotation should refer to a dictionary with at most one " + "identifier mapping to 1.0, and all others to 0.0.") + E981 = ("The offsets of the annotations for `links` could not be aligned " + "to token boundaries.") + E982 = ("The `Token.ent_iob` attribute should be an integer indexing " + "into {values}, but found {value}.") + E983 = ("Invalid key for '{dict}': {key}. Available keys: " + "{keys}") + E984 = ("Invalid component config for '{name}': component block needs either " + "a key `factory` specifying the registered function used to " + "initialize the component, or a key `source` key specifying a " + "spaCy model to copy the component from. For example, `factory = " + "\"ner\"` will use the 'ner' factory and all other settings in the " + "block will be passed to it as arguments. Alternatively, `source = " + "\"en_core_web_sm\"` will copy the component from that model.\n\n{config}") + E985 = ("Can't load model from config file: no [nlp] section found.\n\n{config}") + E986 = ("Could not create any training batches: check your input. " + "Are the train and dev paths defined? Is `discard_oversize` set appropriately? ") + E989 = ("`nlp.update()` was called with two positional arguments. This " + "may be due to a backwards-incompatible change to the format " + "of the training data in spaCy 3.0 onwards. The 'update' " + "function should now be called with a batch of Example " + "objects, instead of `(text, annotation)` tuples. ") + E991 = ("The function `nlp.select_pipes` should be called with either a " + "`disable` argument to list the names of the pipe components " + "that should be disabled, or with an 'enable' argument that " + "specifies which pipes should not be disabled.") + E992 = ("The function `select_pipes` was called with `enable`={enable} " + "and `disable`={disable} but that information is conflicting " + "for the `nlp` pipeline with components {names}.") + E993 = ("The config for the nlp object needs to include a key `lang` specifying " + "the code of the language to initialize it with (for example " + "'en' for English) - this can't be None.\n\n{config}") + E997 = ("Tokenizer special cases are not allowed to modify the text. " + "This would map '{chunk}' to '{orth}' given token attributes " + "'{token_attrs}'.") + E999 = ("Unable to merge the Doc objects because they do not all share " + "the same `Vocab`.") + E1000 = ("The Chinese word segmenter is pkuseg but no pkuseg model was " + "loaded. Provide the name of a pretrained model or the path to " + "a model and initialize the pipeline:\n\n" + 'nlp.tokenizer.initialize(pkuseg_model="default")') + E1001 = ("Target token outside of matched span for match with tokens " + "'{span}' and offset '{index}' matched by patterns '{patterns}'.") + E1002 = ("Span index out of range.") + E1003 = ("Unsupported lemmatizer mode '{mode}'.") + E1004 = ("Missing lemmatizer table(s) found for lemmatizer mode '{mode}'. " + "Required tables: {tables}. Found: {found}. Maybe you forgot to " + "call `nlp.initialize()` to load in the data?") + E1005 = ("Unable to set attribute '{attr}' in tokenizer exception for " + "'{chunk}'. Tokenizer exceptions are only allowed to specify " + "ORTH and NORM.") + E1007 = ("Unsupported DependencyMatcher operator '{op}'.") + E1008 = ("Invalid pattern: each pattern should be a list of dicts. Check " + "that you are providing a list of patterns as `List[List[dict]]`.") + E1010 = ("Unable to set entity information for token {i} which is included " + "in more than one span in entities, blocked, missing or outside.") + E1011 = ("Unsupported default '{default}' in `doc.set_ents`. Available " + "options: {modes}") + E1012 = ("Entity spans and blocked/missing/outside spans should be " + "provided to `doc.set_ents` as lists of Span objects.") + E1013 = ("Invalid morph: the MorphAnalysis must have the same vocab as the " + "token itself. To set the morph from this MorphAnalysis, set from " + "the string value with: `token.set_morph(str(other_morph))`.") + + +# Deprecated model shortcuts, only used in errors and warnings +OLD_MODEL_SHORTCUTS = { + "en": "en_core_web_sm", "de": "de_core_news_sm", "es": "es_core_news_sm", + "pt": "pt_core_news_sm", "fr": "fr_core_news_sm", "it": "it_core_news_sm", + "nl": "nl_core_news_sm", "el": "el_core_news_sm", "nb": "nb_core_news_sm", + "lt": "lt_core_news_sm", "xx": "xx_ent_wiki_sm" +} # fmt: on @@ -612,16 +725,12 @@ class MatchPatternError(ValueError): def __init__(self, key, errors): """Custom error for validating match patterns. - key (unicode): The name of the matcher rule. + key (str): The name of the matcher rule. errors (dict): Validation errors (sequence of strings) mapped to pattern ID, i.e. the index of the added pattern. """ - msg = "Invalid token patterns for matcher rule '{}'\n".format(key) + msg = f"Invalid token patterns for matcher rule '{key}'\n" for pattern_idx, error_msgs in errors.items(): - pattern_errors = "\n".join(["- {}".format(e) for e in error_msgs]) - msg += "\nPattern {}:\n{}\n".format(pattern_idx, pattern_errors) + pattern_errors = "\n".join([f"- {e}" for e in error_msgs]) + msg += f"\nPattern {pattern_idx}:\n{pattern_errors}\n" ValueError.__init__(self, msg) - - -class AlignmentError(ValueError): - pass diff --git a/spacy/glossary.py b/spacy/glossary.py index 44a8277da..c4a6a5c45 100644 --- a/spacy/glossary.py +++ b/spacy/glossary.py @@ -1,12 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - - def explain(term): """Get a description for a given POS tag, dependency label or entity type. - term (unicode): The term to explain. - RETURNS (unicode): The explanation, or `None` if not found in the glossary. + term (str): The term to explain. + RETURNS (str): The explanation, or `None` if not found in the glossary. EXAMPLE: >>> spacy.explain(u'NORP') diff --git a/spacy/gold.pxd b/spacy/gold.pxd deleted file mode 100644 index 20a25a939..000000000 --- a/spacy/gold.pxd +++ /dev/null @@ -1,41 +0,0 @@ -from cymem.cymem cimport Pool - -from .structs cimport TokenC -from .typedefs cimport attr_t -from .syntax.transition_system cimport Transition - - -cdef struct GoldParseC: - int* tags - int* heads - int* has_dep - int* sent_start - attr_t* labels - int** brackets - Transition* ner - - -cdef class GoldParse: - cdef Pool mem - - cdef GoldParseC c - - cdef int length - cdef public int loss - cdef public list words - cdef public list tags - cdef public list morphology - cdef public list heads - cdef public list labels - cdef public dict orths - cdef public list ner - cdef public list ents - cdef public dict brackets - cdef public object cats - cdef public dict links - - cdef readonly list cand_to_gold - cdef readonly list gold_to_cand - cdef readonly list orig_annot - - diff --git a/spacy/gold.pyx b/spacy/gold.pyx deleted file mode 100644 index e69ff5933..000000000 --- a/spacy/gold.pyx +++ /dev/null @@ -1,1004 +0,0 @@ -# cython: profile=True -# coding: utf8 -from __future__ import unicode_literals, print_function - -import re -import random -import numpy -import tempfile -import shutil -import itertools -from pathlib import Path -import srsly -import warnings - -from .syntax import nonproj -from .tokens import Doc, Span -from .errors import Errors, AlignmentError, Warnings -from .compat import path2str -from . import util -from .util import minibatch, itershuffle - -from libc.stdio cimport FILE, fopen, fclose, fread, fwrite, feof, fseek - - -punct_re = re.compile(r"\W") - - -def tags_to_entities(tags): - entities = [] - start = None - for i, tag in enumerate(tags): - if tag is None: - continue - if tag.startswith("O"): - # TODO: We shouldn't be getting these malformed inputs. Fix this. - if start is not None: - start = None - continue - elif tag == "-": - continue - elif tag.startswith("I"): - if start is None: - raise ValueError(Errors.E067.format(tags=tags[:i + 1])) - continue - if tag.startswith("U"): - entities.append((tag[2:], i, i)) - elif tag.startswith("B"): - start = i - elif tag.startswith("L"): - entities.append((tag[2:], start, i)) - start = None - else: - raise ValueError(Errors.E068.format(tag=tag)) - return entities - - -def merge_sents(sents): - m_deps = [[], [], [], [], [], []] - m_cats = {} - m_brackets = [] - i = 0 - for (ids, words, tags, heads, labels, ner), (cats, brackets) in sents: - m_deps[0].extend(id_ + i for id_ in ids) - m_deps[1].extend(words) - m_deps[2].extend(tags) - m_deps[3].extend(head + i for head in heads) - m_deps[4].extend(labels) - m_deps[5].extend(ner) - m_brackets.extend((b["first"] + i, b["last"] + i, b["label"]) - for b in brackets) - m_cats.update(cats) - i += len(ids) - return [(m_deps, (m_cats, m_brackets))] - - -def _normalize_for_alignment(tokens): - return [w.replace(" ", "").lower() for w in tokens] - - -def align(tokens_a, tokens_b): - """Calculate alignment tables between two tokenizations. - - tokens_a (List[str]): The candidate tokenization. - tokens_b (List[str]): The reference tokenization. - RETURNS: (tuple): A 5-tuple consisting of the following information: - * cost (int): The number of misaligned tokens. - * a2b (List[int]): Mapping of indices in `tokens_a` to indices in `tokens_b`. - For instance, if `a2b[4] == 6`, that means that `tokens_a[4]` aligns - to `tokens_b[6]`. If there's no one-to-one alignment for a token, - it has the value -1. - * b2a (List[int]): The same as `a2b`, but mapping the other direction. - * a2b_multi (Dict[int, int]): A dictionary mapping indices in `tokens_a` - to indices in `tokens_b`, where multiple tokens of `tokens_a` align to - the same token of `tokens_b`. - * b2a_multi (Dict[int, int]): As with `a2b_multi`, but mapping the other - direction. - """ - tokens_a = _normalize_for_alignment(tokens_a) - tokens_b = _normalize_for_alignment(tokens_b) - cost = 0 - a2b = numpy.empty(len(tokens_a), dtype="i") - b2a = numpy.empty(len(tokens_b), dtype="i") - a2b.fill(-1) - b2a.fill(-1) - a2b_multi = {} - b2a_multi = {} - i = 0 - j = 0 - offset_a = 0 - offset_b = 0 - while i < len(tokens_a) and j < len(tokens_b): - a = tokens_a[i][offset_a:] - b = tokens_b[j][offset_b:] - if a == b: - if offset_a == offset_b == 0: - a2b[i] = j - b2a[j] = i - elif offset_a == 0: - cost += 2 - a2b_multi[i] = j - elif offset_b == 0: - cost += 2 - b2a_multi[j] = i - offset_a = offset_b = 0 - i += 1 - j += 1 - elif a == "": - assert offset_a == 0 - cost += 1 - i += 1 - elif b == "": - assert offset_b == 0 - cost += 1 - j += 1 - elif b.startswith(a): - cost += 1 - if offset_a == 0: - a2b_multi[i] = j - i += 1 - offset_a = 0 - offset_b += len(a) - elif a.startswith(b): - cost += 1 - if offset_b == 0: - b2a_multi[j] = i - j += 1 - offset_b = 0 - offset_a += len(b) - else: - assert "".join(tokens_a) != "".join(tokens_b) - raise AlignmentError(Errors.E186.format(tok_a=tokens_a, tok_b=tokens_b)) - return cost, a2b, b2a, a2b_multi, b2a_multi - - -class GoldCorpus(object): - """An annotated corpus, using the JSON file format. Manages - annotations for tagging, dependency parsing and NER. - - DOCS: https://spacy.io/api/goldcorpus - """ - def __init__(self, train, dev, gold_preproc=False, limit=None): - """Create a GoldCorpus. - - train_path (unicode or Path): File or directory of training data. - dev_path (unicode or Path): File or directory of development data. - RETURNS (GoldCorpus): The newly created object. - """ - self.limit = limit - if isinstance(train, str) or isinstance(train, Path): - train = self.read_tuples(self.walk_corpus(train)) - dev = self.read_tuples(self.walk_corpus(dev)) - # Write temp directory with one doc per file, so we can shuffle and stream - self.tmp_dir = Path(tempfile.mkdtemp()) - self.write_msgpack(self.tmp_dir / "train", train, limit=self.limit) - self.write_msgpack(self.tmp_dir / "dev", dev, limit=self.limit) - - def __del__(self): - shutil.rmtree(path2str(self.tmp_dir)) - - @staticmethod - def write_msgpack(directory, doc_tuples, limit=0): - if not directory.exists(): - directory.mkdir() - n = 0 - for i, doc_tuple in enumerate(doc_tuples): - srsly.write_msgpack(directory / "{}.msg".format(i), [doc_tuple]) - n += len(doc_tuple[1]) - if limit and n >= limit: - break - - @staticmethod - def walk_corpus(path): - path = util.ensure_path(path) - if not path.is_dir(): - return [path] - paths = [path] - locs = [] - seen = set() - for path in paths: - if str(path) in seen: - continue - seen.add(str(path)) - if path.parts[-1].startswith("."): - continue - elif path.is_dir(): - paths.extend(path.iterdir()) - elif path.parts[-1].endswith((".json", ".jsonl")): - locs.append(path) - return locs - - @staticmethod - def read_tuples(locs, limit=0): - i = 0 - for loc in locs: - loc = util.ensure_path(loc) - if loc.parts[-1].endswith("json"): - gold_tuples = read_json_file(loc) - elif loc.parts[-1].endswith("jsonl"): - gold_tuples = srsly.read_jsonl(loc) - first_gold_tuple = next(gold_tuples) - gold_tuples = itertools.chain([first_gold_tuple], gold_tuples) - # TODO: proper format checks with schemas - if isinstance(first_gold_tuple, dict): - gold_tuples = read_json_object(gold_tuples) - elif loc.parts[-1].endswith("msg"): - gold_tuples = srsly.read_msgpack(loc) - else: - supported = ("json", "jsonl", "msg") - raise ValueError(Errors.E124.format(path=path2str(loc), formats=supported)) - for item in gold_tuples: - yield item - i += len(item[1]) - if limit and i >= limit: - return - - @property - def dev_tuples(self): - locs = (self.tmp_dir / "dev").iterdir() - yield from self.read_tuples(locs, limit=self.limit) - - @property - def train_tuples(self): - locs = (self.tmp_dir / "train").iterdir() - yield from self.read_tuples(locs, limit=self.limit) - - def count_train(self): - n = 0 - i = 0 - for raw_text, paragraph_tuples in self.train_tuples: - for sent_tuples, brackets in paragraph_tuples: - n += len(sent_tuples[1]) - if self.limit and i >= self.limit: - break - i += 1 - return n - - def train_docs(self, nlp, gold_preproc=False, max_length=None, - noise_level=0.0, orth_variant_level=0.0, - ignore_misaligned=False): - locs = list((self.tmp_dir / 'train').iterdir()) - random.shuffle(locs) - train_tuples = self.read_tuples(locs, limit=self.limit) - gold_docs = self.iter_gold_docs(nlp, train_tuples, gold_preproc, - max_length=max_length, - noise_level=noise_level, - orth_variant_level=orth_variant_level, - make_projective=True, - ignore_misaligned=ignore_misaligned) - yield from gold_docs - - def train_docs_without_preprocessing(self, nlp, gold_preproc=False): - gold_docs = self.iter_gold_docs(nlp, self.train_tuples, gold_preproc=gold_preproc) - yield from gold_docs - - def dev_docs(self, nlp, gold_preproc=False, ignore_misaligned=False): - gold_docs = self.iter_gold_docs(nlp, self.dev_tuples, gold_preproc=gold_preproc, - ignore_misaligned=ignore_misaligned) - yield from gold_docs - - @classmethod - def iter_gold_docs(cls, nlp, tuples, gold_preproc, max_length=None, - noise_level=0.0, orth_variant_level=0.0, make_projective=False, - ignore_misaligned=False): - for raw_text, paragraph_tuples in tuples: - if gold_preproc: - raw_text = None - else: - paragraph_tuples = merge_sents(paragraph_tuples) - docs, paragraph_tuples = cls._make_docs(nlp, raw_text, - paragraph_tuples, gold_preproc, noise_level=noise_level, - orth_variant_level=orth_variant_level) - golds = cls._make_golds(docs, paragraph_tuples, make_projective, - ignore_misaligned=ignore_misaligned) - for doc, gold in zip(docs, golds): - if gold is not None: - if (not max_length) or len(doc) < max_length: - yield doc, gold - - @classmethod - def _make_docs(cls, nlp, raw_text, paragraph_tuples, gold_preproc, noise_level=0.0, orth_variant_level=0.0): - if raw_text is not None: - raw_text, paragraph_tuples = make_orth_variants(nlp, raw_text, paragraph_tuples, orth_variant_level=orth_variant_level) - raw_text = add_noise(raw_text, noise_level) - return [nlp.make_doc(raw_text)], paragraph_tuples - else: - docs = [] - raw_text, paragraph_tuples = make_orth_variants(nlp, None, paragraph_tuples, orth_variant_level=orth_variant_level) - return [Doc(nlp.vocab, words=add_noise(sent_tuples[1], noise_level)) - for (sent_tuples, brackets) in paragraph_tuples], paragraph_tuples - - - @classmethod - def _make_golds(cls, docs, paragraph_tuples, make_projective, ignore_misaligned=False): - if len(docs) != len(paragraph_tuples): - n_annots = len(paragraph_tuples) - raise ValueError(Errors.E070.format(n_docs=len(docs), n_annots=n_annots)) - golds = [] - for doc, (sent_tuples, (cats, brackets)) in zip(docs, paragraph_tuples): - try: - gold = GoldParse.from_annot_tuples(doc, sent_tuples, cats=cats, - make_projective=make_projective) - except AlignmentError: - if ignore_misaligned: - gold = None - else: - raise - golds.append(gold) - return golds - - -def make_orth_variants(nlp, raw, paragraph_tuples, orth_variant_level=0.0): - if random.random() >= orth_variant_level: - return raw, paragraph_tuples - raw_orig = str(raw) - lower = False - if random.random() >= 0.5: - lower = True - if raw is not None: - raw = raw.lower() - ndsv = nlp.Defaults.single_orth_variants - ndpv = nlp.Defaults.paired_orth_variants - # modify words in paragraph_tuples - variant_paragraph_tuples = [] - for sent_tuples, brackets in paragraph_tuples: - ids, words, tags, heads, labels, ner = sent_tuples - if lower: - words = [w.lower() for w in words] - # single variants - punct_choices = [random.choice(x["variants"]) for x in ndsv] - for word_idx in range(len(words)): - for punct_idx in range(len(ndsv)): - if tags[word_idx] in ndsv[punct_idx]["tags"] \ - and words[word_idx] in ndsv[punct_idx]["variants"]: - words[word_idx] = punct_choices[punct_idx] - # paired variants - punct_choices = [random.choice(x["variants"]) for x in ndpv] - for word_idx in range(len(words)): - for punct_idx in range(len(ndpv)): - if tags[word_idx] in ndpv[punct_idx]["tags"] \ - and words[word_idx] in itertools.chain.from_iterable(ndpv[punct_idx]["variants"]): - # backup option: random left vs. right from pair - pair_idx = random.choice([0, 1]) - # best option: rely on paired POS tags like `` / '' - if len(ndpv[punct_idx]["tags"]) == 2: - pair_idx = ndpv[punct_idx]["tags"].index(tags[word_idx]) - # next best option: rely on position in variants - # (may not be unambiguous, so order of variants matters) - else: - for pair in ndpv[punct_idx]["variants"]: - if words[word_idx] in pair: - pair_idx = pair.index(words[word_idx]) - words[word_idx] = punct_choices[punct_idx][pair_idx] - - variant_paragraph_tuples.append(((ids, words, tags, heads, labels, ner), brackets)) - # modify raw to match variant_paragraph_tuples - if raw is not None: - variants = [] - for single_variants in ndsv: - variants.extend(single_variants["variants"]) - for paired_variants in ndpv: - variants.extend(list(itertools.chain.from_iterable(paired_variants["variants"]))) - # store variants in reverse length order to be able to prioritize - # longer matches (e.g., "---" before "--") - variants = sorted(variants, key=lambda x: len(x)) - variants.reverse() - variant_raw = "" - raw_idx = 0 - # add initial whitespace - while raw_idx < len(raw) and re.match("\s", raw[raw_idx]): - variant_raw += raw[raw_idx] - raw_idx += 1 - for sent_tuples, brackets in variant_paragraph_tuples: - ids, words, tags, heads, labels, ner = sent_tuples - for word in words: - match_found = False - # skip whitespace words - if word.isspace(): - match_found = True - # add identical word - elif word not in variants and raw[raw_idx:].startswith(word): - variant_raw += word - raw_idx += len(word) - match_found = True - # add variant word - else: - for variant in variants: - if not match_found and \ - raw[raw_idx:].startswith(variant): - raw_idx += len(variant) - variant_raw += word - match_found = True - # something went wrong, abort - # (add a warning message?) - if not match_found: - return raw_orig, paragraph_tuples - # add following whitespace - while raw_idx < len(raw) and re.match("\s", raw[raw_idx]): - variant_raw += raw[raw_idx] - raw_idx += 1 - return variant_raw, variant_paragraph_tuples - return raw, variant_paragraph_tuples - - -def add_noise(orig, noise_level): - if random.random() >= noise_level: - return orig - elif type(orig) == list: - corrupted = [_corrupt(word, noise_level) for word in orig] - corrupted = [w for w in corrupted if w] - return corrupted - else: - return "".join(_corrupt(c, noise_level) for c in orig) - - -def _corrupt(c, noise_level): - if random.random() >= noise_level: - return c - elif c in [".", "'", "!", "?", ","]: - return "\n" - else: - return c.lower() - - -def read_json_object(json_corpus_section): - """Take a list of JSON-formatted documents (e.g. from an already loaded - training data file) and yield tuples in the GoldParse format. - - json_corpus_section (list): The data. - YIELDS (tuple): The reformatted data. - """ - for json_doc in json_corpus_section: - tuple_doc = json_to_tuple(json_doc) - for tuple_paragraph in tuple_doc: - yield tuple_paragraph - - -def json_to_tuple(doc): - """Convert an item in the JSON-formatted training data to the tuple format - used by GoldParse. - - doc (dict): One entry in the training data. - YIELDS (tuple): The reformatted data. - """ - paragraphs = [] - for paragraph in doc["paragraphs"]: - sents = [] - cats = {} - for cat in paragraph.get("cats", {}): - cats[cat["label"]] = cat["value"] - for sent in paragraph["sentences"]: - words = [] - ids = [] - tags = [] - heads = [] - labels = [] - ner = [] - for i, token in enumerate(sent["tokens"]): - words.append(token["orth"]) - ids.append(i) - tags.append(token.get('tag', "-")) - heads.append(token.get("head", 0) + i) - labels.append(token.get("dep", "")) - # Ensure ROOT label is case-insensitive - if labels[-1].lower() == "root": - labels[-1] = "ROOT" - ner.append(token.get("ner", "-")) - sents.append([ - [ids, words, tags, heads, labels, ner], - [cats, sent.get("brackets", [])]]) - if sents: - yield [paragraph.get("raw", None), sents] - - -def read_json_file(loc, docs_filter=None, limit=None): - loc = util.ensure_path(loc) - if loc.is_dir(): - for filename in loc.iterdir(): - yield from read_json_file(loc / filename, limit=limit) - else: - for doc in _json_iterate(loc): - if docs_filter is not None and not docs_filter(doc): - continue - for json_tuple in json_to_tuple(doc): - yield json_tuple - - -def _json_iterate(loc): - # We should've made these files jsonl...But since we didn't, parse out - # the docs one-by-one to reduce memory usage. - # It's okay to read in the whole file -- just don't parse it into JSON. - cdef bytes py_raw - loc = util.ensure_path(loc) - with loc.open("rb") as file_: - py_raw = file_.read() - cdef long file_length = len(py_raw) - if file_length > 2 ** 30: - warnings.warn(Warnings.W027.format(size=file_length)) - - raw = py_raw - cdef int square_depth = 0 - cdef int curly_depth = 0 - cdef int inside_string = 0 - cdef int escape = 0 - cdef long start = -1 - cdef char c - cdef char quote = ord('"') - cdef char backslash = ord("\\") - cdef char open_square = ord("[") - cdef char close_square = ord("]") - cdef char open_curly = ord("{") - cdef char close_curly = ord("}") - for i in range(file_length): - c = raw[i] - if escape: - escape = False - continue - if c == backslash: - escape = True - continue - if c == quote: - inside_string = not inside_string - continue - if inside_string: - continue - if c == open_square: - square_depth += 1 - elif c == close_square: - square_depth -= 1 - elif c == open_curly: - if square_depth == 1 and curly_depth == 0: - start = i - curly_depth += 1 - elif c == close_curly: - curly_depth -= 1 - if square_depth == 1 and curly_depth == 0: - py_str = py_raw[start : i + 1].decode("utf8") - try: - yield srsly.json_loads(py_str) - except Exception: - print(py_str) - raise - start = -1 - - -def iob_to_biluo(tags): - out = [] - tags = list(tags) - while tags: - out.extend(_consume_os(tags)) - out.extend(_consume_ent(tags)) - return out - - -def _consume_os(tags): - while tags and tags[0] == "O": - yield tags.pop(0) - - -def _consume_ent(tags): - if not tags: - return [] - tag = tags.pop(0) - target_in = "I" + tag[1:] - target_last = "L" + tag[1:] - length = 1 - while tags and tags[0] in {target_in, target_last}: - length += 1 - tags.pop(0) - label = tag[2:] - if length == 1: - if len(label) == 0: - raise ValueError(Errors.E177.format(tag=tag)) - return ["U-" + label] - else: - start = "B-" + label - end = "L-" + label - middle = ["I-%s" % label for _ in range(1, length - 1)] - return [start] + middle + [end] - - -cdef class GoldParse: - """Collection for training annotations. - - DOCS: https://spacy.io/api/goldparse - """ - @classmethod - def from_annot_tuples(cls, doc, annot_tuples, cats=None, make_projective=False): - _, words, tags, heads, deps, entities = annot_tuples - return cls(doc, words=words, tags=tags, heads=heads, deps=deps, - entities=entities, cats=cats, - make_projective=make_projective) - - def __init__(self, doc, annot_tuples=None, words=None, tags=None, morphology=None, - heads=None, deps=None, entities=None, make_projective=False, - cats=None, links=None, **_): - """Create a GoldParse. The fields will not be initialized if len(doc) is zero. - - doc (Doc): The document the annotations refer to. - words (iterable): A sequence of unicode word strings. - tags (iterable): A sequence of strings, representing tag annotations. - heads (iterable): A sequence of integers, representing syntactic - head offsets. - deps (iterable): A sequence of strings, representing the syntactic - relation types. - entities (iterable): A sequence of named entity annotations, either as - BILUO tag strings, or as `(start_char, end_char, label)` tuples, - representing the entity positions. - cats (dict): Labels for text classification. Each key in the dictionary - may be a string or an int, or a `(start_char, end_char, label)` - tuple, indicating that the label is applied to only part of the - document (usually a sentence). Unlike entity annotations, label - annotations can overlap, i.e. a single word can be covered by - multiple labelled spans. The TextCategorizer component expects - true examples of a label to have the value 1.0, and negative - examples of a label to have the value 0.0. Labels not in the - dictionary are treated as missing - the gradient for those labels - will be zero. - links (dict): A dict with `(start_char, end_char)` keys, - and the values being dicts with kb_id:value entries, - representing the external IDs in a knowledge base (KB) - mapped to either 1.0 or 0.0, indicating positive and - negative examples respectively. - make_projective (bool): Whether to projectivize the dependency tree. - RETURNS (GoldParse): The newly constructed object. - """ - self.mem = Pool() - self.loss = 0 - self.length = len(doc) - - self.cats = {} if cats is None else dict(cats) - self.links = links - - # orig_annot is used as an iterator in `nlp.evalate` even if self.length == 0, - # so set a empty list to avoid error. - # if self.lenght > 0, this is modified latter. - self.orig_annot = [] - - # temporary doc for aligning entity annotation - entdoc = None - - # avoid allocating memory if the doc does not contain any tokens - if self.length == 0: - self.words = [] - self.tags = [] - self.heads = [] - self.labels = [] - self.ner = [] - self.morphology = [] - - else: - if words is None: - words = [token.text for token in doc] - if tags is None: - tags = [None for _ in words] - if heads is None: - heads = [None for _ in words] - if deps is None: - deps = [None for _ in words] - if morphology is None: - morphology = [None for _ in words] - if entities is None: - entities = ["-" for _ in words] - elif len(entities) == 0: - entities = ["O" for _ in words] - else: - # Translate the None values to '-', to make processing easier. - # See Issue #2603 - entities = [(ent if ent is not None else "-") for ent in entities] - if not isinstance(entities[0], basestring): - # Assume we have entities specified by character offset. - # Create a temporary Doc corresponding to provided words - # (to preserve gold tokenization) and text (to preserve - # character offsets). - entdoc_words, entdoc_spaces = util.get_words_and_spaces(words, doc.text) - entdoc = Doc(doc.vocab, words=entdoc_words, spaces=entdoc_spaces) - entdoc_entities = biluo_tags_from_offsets(entdoc, entities) - # There may be some additional whitespace tokens in the - # temporary doc, so check that the annotations align with - # the provided words while building a list of BILUO labels. - entities = [] - words_offset = 0 - for i in range(len(entdoc_words)): - if words[i + words_offset] == entdoc_words[i]: - entities.append(entdoc_entities[i]) - else: - words_offset -= 1 - if len(entities) != len(words): - warnings.warn(Warnings.W029.format(text=doc.text)) - entities = ["-" for _ in words] - - # These are filled by the tagger/parser/entity recogniser - self.c.tags = self.mem.alloc(len(doc), sizeof(int)) - self.c.heads = self.mem.alloc(len(doc), sizeof(int)) - self.c.labels = self.mem.alloc(len(doc), sizeof(attr_t)) - self.c.has_dep = self.mem.alloc(len(doc), sizeof(int)) - self.c.sent_start = self.mem.alloc(len(doc), sizeof(int)) - self.c.ner = self.mem.alloc(len(doc), sizeof(Transition)) - - self.words = [None] * len(doc) - self.tags = [None] * len(doc) - self.heads = [None] * len(doc) - self.labels = [None] * len(doc) - self.ner = [None] * len(doc) - self.morphology = [None] * len(doc) - - # This needs to be done before we align the words - if make_projective and heads is not None and deps is not None: - heads, deps = nonproj.projectivize(heads, deps) - - # Do many-to-one alignment for misaligned tokens. - # If we over-segment, we'll have one gold word that covers a sequence - # of predicted words - # If we under-segment, we'll have one predicted word that covers a - # sequence of gold words. - # If we "mis-segment", we'll have a sequence of predicted words covering - # a sequence of gold words. That's many-to-many -- we don't do that - # except for NER spans where the start and end can be aligned. - cost, i2j, j2i, i2j_multi, j2i_multi = align([t.orth_ for t in doc], words) - - self.cand_to_gold = [(j if j >= 0 else None) for j in i2j] - self.gold_to_cand = [(i if i >= 0 else None) for i in j2i] - - annot_tuples = (range(len(words)), words, tags, heads, deps, entities) - self.orig_annot = list(zip(*annot_tuples)) - - for i, gold_i in enumerate(self.cand_to_gold): - if doc[i].text.isspace(): - self.words[i] = doc[i].text - self.tags[i] = "_SP" - self.heads[i] = None - self.labels[i] = None - self.ner[i] = None - self.morphology[i] = set() - if gold_i is None: - if i in i2j_multi: - self.words[i] = words[i2j_multi[i]] - self.tags[i] = tags[i2j_multi[i]] - self.morphology[i] = morphology[i2j_multi[i]] - is_last = i2j_multi[i] != i2j_multi.get(i+1) - # Set next word in multi-token span as head, until last - if not is_last: - self.heads[i] = i+1 - self.labels[i] = "subtok" - else: - head_i = heads[i2j_multi[i]] - if head_i: - self.heads[i] = self.gold_to_cand[head_i] - self.labels[i] = deps[i2j_multi[i]] - ner_tag = entities[i2j_multi[i]] - # Assign O/- for many-to-one O/- NER tags - if ner_tag in ("O", "-"): - self.ner[i] = ner_tag - else: - self.words[i] = words[gold_i] - self.tags[i] = tags[gold_i] - self.morphology[i] = morphology[gold_i] - if heads[gold_i] is None: - self.heads[i] = None - else: - self.heads[i] = self.gold_to_cand[heads[gold_i]] - self.labels[i] = deps[gold_i] - self.ner[i] = entities[gold_i] - # Assign O/- for one-to-many O/- NER tags - for j, cand_j in enumerate(self.gold_to_cand): - if cand_j is None: - if j in j2i_multi: - i = j2i_multi[j] - ner_tag = entities[j] - if ner_tag in ("O", "-"): - self.ner[i] = ner_tag - - # If there is entity annotation and some tokens remain unaligned, - # align all entities at the character level to account for all - # possible token misalignments within the entity spans - if any([e not in ("O", "-") for e in entities]) and None in self.ner: - # If the temporary entdoc wasn't created above, initialize it - if not entdoc: - entdoc_words, entdoc_spaces = util.get_words_and_spaces(words, doc.text) - entdoc = Doc(doc.vocab, words=entdoc_words, spaces=entdoc_spaces) - # Get offsets based on gold words and BILUO entities - entdoc_offsets = offsets_from_biluo_tags(entdoc, entities) - aligned_offsets = [] - aligned_spans = [] - # Filter offsets to identify those that align with doc tokens - for offset in entdoc_offsets: - span = doc.char_span(offset[0], offset[1]) - if span and not span.text.isspace(): - aligned_offsets.append(offset) - aligned_spans.append(span) - # Convert back to BILUO for doc tokens and assign NER for all - # aligned spans - biluo_tags = biluo_tags_from_offsets(doc, aligned_offsets, missing=None) - for span in aligned_spans: - for i in range(span.start, span.end): - self.ner[i] = biluo_tags[i] - - # Prevent whitespace that isn't within entities from being tagged as - # an entity. - for i in range(len(self.ner)): - if self.tags[i] == "_SP": - prev_ner = self.ner[i-1] if i >= 1 else None - next_ner = self.ner[i+1] if (i+1) < len(self.ner) else None - if prev_ner == "O" or next_ner == "O": - self.ner[i] = "O" - - cycle = nonproj.contains_cycle(self.heads) - if cycle is not None: - raise ValueError(Errors.E069.format(cycle=cycle, - cycle_tokens=" ".join(["'{}'".format(self.words[tok_id]) for tok_id in cycle]), - doc_tokens=" ".join(words[:50]))) - - def __len__(self): - """Get the number of gold-standard tokens. - - RETURNS (int): The number of gold-standard tokens. - """ - return self.length - - @property - def is_projective(self): - """Whether the provided syntactic annotations form a projective - dependency tree. - """ - return not nonproj.is_nonproj_tree(self.heads) - - property sent_starts: - def __get__(self): - return [self.c.sent_start[i] for i in range(self.length)] - - def __set__(self, sent_starts): - for gold_i, is_sent_start in enumerate(sent_starts): - i = self.gold_to_cand[gold_i] - if i is not None: - if is_sent_start in (1, True): - self.c.sent_start[i] = 1 - elif is_sent_start in (-1, False): - self.c.sent_start[i] = -1 - else: - self.c.sent_start[i] = 0 - - -def docs_to_json(docs, id=0, ner_missing_tag="O"): - """Convert a list of Doc objects into the JSON-serializable format used by - the spacy train command. - - docs (iterable / Doc): The Doc object(s) to convert. - id (int): Id for the JSON. - RETURNS (dict): The data in spaCy's JSON format - - each input doc will be treated as a paragraph in the output doc - """ - if isinstance(docs, Doc): - docs = [docs] - json_doc = {"id": id, "paragraphs": []} - for i, doc in enumerate(docs): - json_para = {'raw': doc.text, "sentences": [], "cats": []} - for cat, val in doc.cats.items(): - json_cat = {"label": cat, "value": val} - json_para["cats"].append(json_cat) - ent_offsets = [(e.start_char, e.end_char, e.label_) for e in doc.ents] - biluo_tags = biluo_tags_from_offsets(doc, ent_offsets, missing=ner_missing_tag) - for j, sent in enumerate(doc.sents): - json_sent = {"tokens": [], "brackets": []} - for token in sent: - json_token = {"id": token.i, "orth": token.text} - if doc.is_tagged: - json_token["tag"] = token.tag_ - if doc.is_parsed: - json_token["head"] = token.head.i-token.i - json_token["dep"] = token.dep_ - json_token["ner"] = biluo_tags[token.i] - json_sent["tokens"].append(json_token) - json_para["sentences"].append(json_sent) - json_doc["paragraphs"].append(json_para) - return json_doc - - -def biluo_tags_from_offsets(doc, entities, missing="O"): - """Encode labelled spans into per-token tags, using the - Begin/In/Last/Unit/Out scheme (BILUO). - - doc (Doc): The document that the entity offsets refer to. The output tags - will refer to the token boundaries within the document. - entities (iterable): A sequence of `(start, end, label)` triples. `start` - and `end` should be character-offset integers denoting the slice into - the original string. - RETURNS (list): A list of unicode strings, describing the tags. Each tag - string will be of the form either "", "O" or "{action}-{label}", where - action is one of "B", "I", "L", "U". The string "-" is used where the - entity offsets don't align with the tokenization in the `Doc` object. - The training algorithm will view these as missing values. "O" denotes a - non-entity token. "B" denotes the beginning of a multi-token entity, - "I" the inside of an entity of three or more tokens, and "L" the end - of an entity of two or more tokens. "U" denotes a single-token entity. - - EXAMPLE: - >>> text = 'I like London.' - >>> entities = [(len('I like '), len('I like London'), 'LOC')] - >>> doc = nlp.tokenizer(text) - >>> tags = biluo_tags_from_offsets(doc, entities) - >>> assert tags == ["O", "O", 'U-LOC', "O"] - """ - # Ensure no overlapping entity labels exist - tokens_in_ents = {} - - starts = {token.idx: token.i for token in doc} - ends = {token.idx + len(token): token.i for token in doc} - biluo = ["-" for _ in doc] - # Handle entity cases - for start_char, end_char, label in entities: - for token_index in range(start_char, end_char): - if token_index in tokens_in_ents.keys(): - raise ValueError(Errors.E103.format( - span1=(tokens_in_ents[token_index][0], - tokens_in_ents[token_index][1], - tokens_in_ents[token_index][2]), - span2=(start_char, end_char, label))) - tokens_in_ents[token_index] = (start_char, end_char, label) - - start_token = starts.get(start_char) - end_token = ends.get(end_char) - # Only interested if the tokenization is correct - if start_token is not None and end_token is not None: - if start_token == end_token: - biluo[start_token] = "U-%s" % label - else: - biluo[start_token] = "B-%s" % label - for i in range(start_token+1, end_token): - biluo[i] = "I-%s" % label - biluo[end_token] = "L-%s" % label - # Now distinguish the O cases from ones where we miss the tokenization - entity_chars = set() - for start_char, end_char, label in entities: - for i in range(start_char, end_char): - entity_chars.add(i) - for token in doc: - for i in range(token.idx, token.idx + len(token)): - if i in entity_chars: - break - else: - biluo[token.i] = missing - if "-" in biluo: - ent_str = str(entities) - warnings.warn(Warnings.W030.format( - text=doc.text[:50] + "..." if len(doc.text) > 50 else doc.text, - entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str - )) - return biluo - - -def spans_from_biluo_tags(doc, tags): - """Encode per-token tags following the BILUO scheme into Span object, e.g. - to overwrite the doc.ents. - - doc (Doc): The document that the BILUO tags refer to. - entities (iterable): A sequence of BILUO tags with each tag describing one - token. Each tags string will be of the form of either "", "O" or - "{action}-{label}", where action is one of "B", "I", "L", "U". - RETURNS (list): A sequence of Span objects. - """ - token_offsets = tags_to_entities(tags) - spans = [] - for label, start_idx, end_idx in token_offsets: - span = Span(doc, start_idx, end_idx + 1, label=label) - spans.append(span) - return spans - - -def offsets_from_biluo_tags(doc, tags): - """Encode per-token tags following the BILUO scheme into entity offsets. - - doc (Doc): The document that the BILUO tags refer to. - entities (iterable): A sequence of BILUO tags with each tag describing one - token. Each tags string will be of the form of either "", "O" or - "{action}-{label}", where action is one of "B", "I", "L", "U". - RETURNS (list): A sequence of `(start, end, label)` triples. `start` and - `end` will be character-offset integers denoting the slice into the - original string. - """ - spans = spans_from_biluo_tags(doc, tags) - return [(span.start_char, span.end_char, span.label_) for span in spans] - - -def is_punct_label(label): - return label == "P" or label.lower() == "punct" diff --git a/spacy/kb.pxd b/spacy/kb.pxd index 518ce0f4e..4a71b26a2 100644 --- a/spacy/kb.pxd +++ b/spacy/kb.pxd @@ -1,15 +1,15 @@ """Knowledge-base for entity or concept linking.""" from cymem.cymem cimport Pool from preshed.maps cimport PreshMap - from libcpp.vector cimport vector from libc.stdint cimport int32_t, int64_t from libc.stdio cimport FILE from .vocab cimport Vocab from .typedefs cimport hash_t - from .structs cimport KBEntryC, AliasC + + ctypedef vector[KBEntryC] entry_vec ctypedef vector[AliasC] alias_vec ctypedef vector[float] float_vec @@ -140,7 +140,6 @@ cdef class KnowledgeBase: self._entries.push_back(entry) self._aliases_table.push_back(alias) - cpdef load_bulk(self, loc) cpdef set_entities(self, entity_list, freq_list, vector_list) diff --git a/spacy/kb.pyx b/spacy/kb.pyx index a187e63d6..10aa377eb 100644 --- a/spacy/kb.pyx +++ b/spacy/kb.pyx @@ -1,6 +1,7 @@ -# cython: infer_types=True -# cython: profile=True -# coding: utf8 +# cython: infer_types=True, profile=True +from typing import Iterator, Iterable + +import srsly from cymem.cymem cimport Pool from preshed.maps cimport PreshMap from cpython.exc cimport PyErr_SetFromErrno @@ -8,14 +9,13 @@ from libc.stdio cimport fopen, fclose, fread, fwrite, feof, fseek from libc.stdint cimport int32_t, int64_t from libcpp.vector cimport vector -import warnings -from os import path from pathlib import Path +import warnings from .typedefs cimport hash_t - from .errors import Errors, Warnings - +from . import util +from .util import SimpleFrozenList, ensure_path cdef class Candidate: """A `Candidate` object refers to a textual mention (`alias`) that may or may not be resolved @@ -23,7 +23,7 @@ cdef class Candidate: algorithm which will disambiguate the various candidates to the correct one. Each candidate (alias, entity) pair is assigned to a certain prior probability. - DOCS: https://spacy.io/api/kb/#candidate_init + DOCS: https://nightly.spacy.io/api/kb/#candidate_init """ def __init__(self, KnowledgeBase kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob): @@ -41,7 +41,7 @@ cdef class Candidate: @property def entity_(self): - """RETURNS (unicode): ID/name of this entity in the KB""" + """RETURNS (str): ID/name of this entity in the KB""" return self.kb.vocab.strings[self.entity_hash] @property @@ -51,7 +51,7 @@ cdef class Candidate: @property def alias_(self): - """RETURNS (unicode): ID of the original alias""" + """RETURNS (str): ID of the original alias""" return self.kb.vocab.strings[self.alias_hash] @property @@ -67,22 +67,30 @@ cdef class Candidate: return self.prior_prob +def get_candidates(KnowledgeBase kb, span) -> Iterator[Candidate]: + """ + Return candidate entities for a given span by using the text of the span as the alias + and fetching appropriate entries from the index. + This particular function is optimized to work with the built-in KB functionality, + but any other custom candidate generation method can be used in combination with the KB as well. + """ + return kb.get_alias_candidates(span.text) + + cdef class KnowledgeBase: """A `KnowledgeBase` instance stores unique identifiers for entities and their textual aliases, to support entity linking of named entities to real-world concepts. - DOCS: https://spacy.io/api/kb + DOCS: https://nightly.spacy.io/api/kb """ - def __init__(self, Vocab vocab, entity_vector_length=64): - self.vocab = vocab + def __init__(self, Vocab vocab, entity_vector_length): + """Create a KnowledgeBase.""" self.mem = Pool() self.entity_vector_length = entity_vector_length - self._entry_index = PreshMap() self._alias_index = PreshMap() - - self.vocab.strings.add("") + self.vocab = vocab self._create_empty_vectors(dummy_hash=self.vocab.strings[""]) @property @@ -261,8 +269,7 @@ cdef class KnowledgeBase: alias_entry.probs = probs self._aliases_table[alias_index] = alias_entry - - def get_candidates(self, unicode alias): + def get_alias_candidates(self, unicode alias) -> Iterator[Candidate]: """ Return candidate entities for an alias. Each candidate defines the entity, the original alias, and the prior probability of that alias resolving to that entity. @@ -312,9 +319,30 @@ cdef class KnowledgeBase: return 0.0 + def to_disk(self, path, exclude: Iterable[str] = SimpleFrozenList()): + path = ensure_path(path) + if not path.exists(): + path.mkdir(parents=True) + if not path.is_dir(): + raise ValueError(Errors.E928.format(loc=path)) + serialize = {} + serialize["contents"] = lambda p: self.write_contents(p) + serialize["strings.json"] = lambda p: self.vocab.strings.to_disk(p) + util.to_disk(path, serialize, exclude) - def dump(self, loc): - cdef Writer writer = Writer(loc) + def from_disk(self, path, exclude: Iterable[str] = SimpleFrozenList()): + path = ensure_path(path) + if not path.exists(): + raise ValueError(Errors.E929.format(loc=path)) + if not path.is_dir(): + raise ValueError(Errors.E928.format(loc=path)) + deserialize = {} + deserialize["contents"] = lambda p: self.read_contents(p) + deserialize["strings.json"] = lambda p: self.vocab.strings.from_disk(p) + util.from_disk(path, deserialize, exclude) + + def write_contents(self, file_path): + cdef Writer writer = Writer(file_path) writer.write_header(self.get_size_entities(), self.entity_vector_length) # dumping the entity vectors in their original order @@ -353,7 +381,7 @@ cdef class KnowledgeBase: writer.close() - cpdef load_bulk(self, loc): + def read_contents(self, file_path): cdef hash_t entity_hash cdef hash_t alias_hash cdef int64_t entry_index @@ -363,7 +391,7 @@ cdef class KnowledgeBase: cdef AliasC alias cdef float vector_element - cdef Reader reader = Reader(loc) + cdef Reader reader = Reader(file_path) # STEP 0: load header and initialize KB cdef int64_t nr_entities @@ -444,15 +472,13 @@ cdef class KnowledgeBase: cdef class Writer: - def __init__(self, object loc): - if isinstance(loc, Path): - loc = bytes(loc) - if path.exists(loc): - assert not path.isdir(loc), "%s is directory." % loc - cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc + def __init__(self, path): + assert isinstance(path, Path) + content = bytes(path) + cdef bytes bytes_loc = content.encode('utf8') if type(content) == unicode else content self._fp = fopen(bytes_loc, 'wb') if not self._fp: - raise IOError(Errors.E146.format(path=loc)) + raise IOError(Errors.E146.format(path=path)) fseek(self._fp, 0, 0) def close(self): @@ -489,12 +515,9 @@ cdef class Writer: cdef class Reader: - def __init__(self, object loc): - if isinstance(loc, Path): - loc = bytes(loc) - assert path.exists(loc) - assert not path.isdir(loc) - cdef bytes bytes_loc = loc.encode('utf8') if type(loc) == unicode else loc + def __init__(self, path): + content = bytes(path) + cdef bytes bytes_loc = content.encode('utf8') if type(content) == unicode else content self._fp = fopen(bytes_loc, 'rb') if not self._fp: PyErr_SetFromErrno(IOError) diff --git a/spacy/lang/af/__init__.py b/spacy/lang/af/__init__.py index 90ea324f0..91917daee 100644 --- a/spacy/lang/af/__init__.py +++ b/spacy/lang/af/__init__.py @@ -1,14 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language -from ...attrs import LANG class AfrikaansDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "af" stop_words = STOP_WORDS diff --git a/spacy/lang/af/stop_words.py b/spacy/lang/af/stop_words.py index 2b3bcc019..4b5a04a5e 100644 --- a/spacy/lang/af/stop_words.py +++ b/spacy/lang/af/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-af STOP_WORDS = set( diff --git a/spacy/lang/ar/__init__.py b/spacy/lang/ar/__init__.py index c120703f6..6abb65efb 100644 --- a/spacy/lang/ar/__init__.py +++ b/spacy/lang/ar/__init__.py @@ -1,34 +1,21 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .punctuation import TOKENIZER_SUFFIXES - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups class ArabicDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "ar" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS + tokenizer_exceptions = TOKENIZER_EXCEPTIONS suffixes = TOKENIZER_SUFFIXES + stop_words = STOP_WORDS + lex_attr_getters = LEX_ATTRS writing_system = {"direction": "rtl", "has_case": False, "has_letters": True} class Arabic(Language): - lang = "ar" Defaults = ArabicDefaults + lang = "ar" __all__ = ["Arabic"] diff --git a/spacy/lang/ar/examples.py b/spacy/lang/ar/examples.py index 2a10f4fcc..a51bb9ded 100644 --- a/spacy/lang/ar/examples.py +++ b/spacy/lang/ar/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ar/lex_attrs.py b/spacy/lang/ar/lex_attrs.py index 19e7aef8a..54ad7a8c3 100644 --- a/spacy/lang/ar/lex_attrs.py +++ b/spacy/lang/ar/lex_attrs.py @@ -1,5 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals from ...attrs import LIKE_NUM _num_words = set( diff --git a/spacy/lang/ar/punctuation.py b/spacy/lang/ar/punctuation.py index 6625c5475..f30204c02 100644 --- a/spacy/lang/ar/punctuation.py +++ b/spacy/lang/ar/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY from ..char_classes import UNITS, ALPHA_UPPER diff --git a/spacy/lang/ar/stop_words.py b/spacy/lang/ar/stop_words.py index de2fc7443..f4da54dda 100644 --- a/spacy/lang/ar/stop_words.py +++ b/spacy/lang/ar/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - STOP_WORDS = set( """ من diff --git a/spacy/lang/ar/tokenizer_exceptions.py b/spacy/lang/ar/tokenizer_exceptions.py index 030daecd5..7c385bef8 100644 --- a/spacy/lang/ar/tokenizer_exceptions.py +++ b/spacy/lang/ar/tokenizer_exceptions.py @@ -1,7 +1,6 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = {} @@ -9,41 +8,41 @@ _exc = {} # Time for exc_data in [ - {LEMMA: "قبل الميلاد", ORTH: "ق.م"}, - {LEMMA: "بعد الميلاد", ORTH: "ب. م"}, - {LEMMA: "ميلادي", ORTH: ".م"}, - {LEMMA: "هجري", ORTH: ".هـ"}, - {LEMMA: "توفي", ORTH: ".ت"}, + {NORM: "قبل الميلاد", ORTH: "ق.م"}, + {NORM: "بعد الميلاد", ORTH: "ب. م"}, + {NORM: "ميلادي", ORTH: ".م"}, + {NORM: "هجري", ORTH: ".هـ"}, + {NORM: "توفي", ORTH: ".ت"}, ]: _exc[exc_data[ORTH]] = [exc_data] # Scientific abv. for exc_data in [ - {LEMMA: "صلى الله عليه وسلم", ORTH: "صلعم"}, - {LEMMA: "الشارح", ORTH: "الشـ"}, - {LEMMA: "الظاهر", ORTH: "الظـ"}, - {LEMMA: "أيضًا", ORTH: "أيضـ"}, - {LEMMA: "إلى آخره", ORTH: "إلخ"}, - {LEMMA: "انتهى", ORTH: "اهـ"}, - {LEMMA: "حدّثنا", ORTH: "ثنا"}, - {LEMMA: "حدثني", ORTH: "ثنى"}, - {LEMMA: "أنبأنا", ORTH: "أنا"}, - {LEMMA: "أخبرنا", ORTH: "نا"}, - {LEMMA: "مصدر سابق", ORTH: "م. س"}, - {LEMMA: "مصدر نفسه", ORTH: "م. ن"}, + {NORM: "صلى الله عليه وسلم", ORTH: "صلعم"}, + {NORM: "الشارح", ORTH: "الشـ"}, + {NORM: "الظاهر", ORTH: "الظـ"}, + {NORM: "أيضًا", ORTH: "أيضـ"}, + {NORM: "إلى آخره", ORTH: "إلخ"}, + {NORM: "انتهى", ORTH: "اهـ"}, + {NORM: "حدّثنا", ORTH: "ثنا"}, + {NORM: "حدثني", ORTH: "ثنى"}, + {NORM: "أنبأنا", ORTH: "أنا"}, + {NORM: "أخبرنا", ORTH: "نا"}, + {NORM: "مصدر سابق", ORTH: "م. س"}, + {NORM: "مصدر نفسه", ORTH: "م. ن"}, ]: _exc[exc_data[ORTH]] = [exc_data] # Other abv. for exc_data in [ - {LEMMA: "دكتور", ORTH: "د."}, - {LEMMA: "أستاذ دكتور", ORTH: "أ.د"}, - {LEMMA: "أستاذ", ORTH: "أ."}, - {LEMMA: "بروفيسور", ORTH: "ب."}, + {NORM: "دكتور", ORTH: "د."}, + {NORM: "أستاذ دكتور", ORTH: "أ.د"}, + {NORM: "أستاذ", ORTH: "أ."}, + {NORM: "بروفيسور", ORTH: "ب."}, ]: _exc[exc_data[ORTH]] = [exc_data] -for exc_data in [{LEMMA: "تلفون", ORTH: "ت."}, {LEMMA: "صندوق بريد", ORTH: "ص.ب"}]: +for exc_data in [{NORM: "تلفون", ORTH: "ت."}, {NORM: "صندوق بريد", ORTH: "ص.ب"}]: _exc[exc_data[ORTH]] = [exc_data] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/bg/__init__.py b/spacy/lang/bg/__init__.py index 9b4c647e3..a30f49ce7 100644 --- a/spacy/lang/bg/__init__.py +++ b/spacy/lang/bg/__init__.py @@ -1,14 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language -from ...attrs import LANG class BulgarianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "bg" stop_words = STOP_WORDS diff --git a/spacy/lang/bg/examples.py b/spacy/lang/bg/examples.py index b08b8926d..a6d40da1a 100644 --- a/spacy/lang/bg/examples.py +++ b/spacy/lang/bg/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/bg/stop_words.py b/spacy/lang/bg/stop_words.py index e7c65cbc2..aae7692a2 100644 --- a/spacy/lang/bg/stop_words.py +++ b/spacy/lang/bg/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/Alir3z4/stop-words STOP_WORDS = set( diff --git a/spacy/lang/bn/__init__.py b/spacy/lang/bn/__init__.py index 7da50ff2d..879229888 100644 --- a/spacy/lang/bn/__init__.py +++ b/spacy/lang/bn/__init__.py @@ -1,24 +1,18 @@ -# coding: utf8 -from __future__ import unicode_literals - +from typing import Optional +from thinc.api import Model from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES from .stop_words import STOP_WORDS - -from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language -from ...attrs import LANG -from ...util import update_exc +from ...pipeline import Lemmatizer class BengaliDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "bn" - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS + tokenizer_exceptions = TOKENIZER_EXCEPTIONS prefixes = TOKENIZER_PREFIXES suffixes = TOKENIZER_SUFFIXES infixes = TOKENIZER_INFIXES + stop_words = STOP_WORDS class Bengali(Language): @@ -26,4 +20,14 @@ class Bengali(Language): Defaults = BengaliDefaults +@Bengali.factory( + "lemmatizer", + assigns=["token.lemma"], + default_config={"model": None, "mode": "rule"}, + default_score_weights={"lemma_acc": 1.0}, +) +def make_lemmatizer(nlp: Language, model: Optional[Model], name: str, mode: str): + return Lemmatizer(nlp.vocab, model, name, mode=mode) + + __all__ = ["Bengali"] diff --git a/spacy/lang/bn/examples.py b/spacy/lang/bn/examples.py index 2d5bdb238..c3be4c556 100644 --- a/spacy/lang/bn/examples.py +++ b/spacy/lang/bn/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/bn/morph_rules.py b/spacy/lang/bn/morph_rules.py deleted file mode 100644 index 21a76c7e6..000000000 --- a/spacy/lang/bn/morph_rules.py +++ /dev/null @@ -1,266 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import LEMMA, PRON_LEMMA - - -MORPH_RULES = { - "PRP": { - "ঐ": {LEMMA: PRON_LEMMA, "PronType": "Dem"}, - "ওই": {LEMMA: PRON_LEMMA, "PronType": "Dem"}, - "আমাকে": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "One", - "PronType": "Prs", - "Case": "Acc", - }, - "কি": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Gender": "Neut", - "PronType": "Int", - "Case": "Acc", - }, - "সে": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "Three", - "PronType": "Prs", - "Case": "Nom", - }, - "কিসে": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Gender": "Neut", - "PronType": "Int", - "Case": "Acc", - }, - "তাকে": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "Three", - "PronType": "Prs", - "Case": "Acc", - }, - "স্বয়ং": {LEMMA: PRON_LEMMA, "Reflex": "Yes", "PronType": "Ref"}, - "কোনগুলো": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Gender": "Neut", - "PronType": "Int", - "Case": "Acc", - }, - "তুমি": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "Two", - "PronType": "Prs", - "Case": "Nom", - }, - "তুই": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "Two", - "PronType": "Prs", - "Case": "Nom", - }, - "তাদেরকে": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "Three", - "PronType": "Prs", - "Case": "Acc", - }, - "আমরা": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "One ", - "PronType": "Prs", - "Case": "Nom", - }, - "যিনি": {LEMMA: PRON_LEMMA, "Number": "Sing", "PronType": "Rel", "Case": "Nom"}, - "আমাদেরকে": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "One", - "PronType": "Prs", - "Case": "Acc", - }, - "কোন": {LEMMA: PRON_LEMMA, "Number": "Sing", "PronType": "Int", "Case": "Acc"}, - "কারা": {LEMMA: PRON_LEMMA, "Number": "Plur", "PronType": "Int", "Case": "Acc"}, - "তোমাকে": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "Two", - "PronType": "Prs", - "Case": "Acc", - }, - "তোকে": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "Two", - "PronType": "Prs", - "Case": "Acc", - }, - "খোদ": {LEMMA: PRON_LEMMA, "Reflex": "Yes", "PronType": "Ref"}, - "কে": {LEMMA: PRON_LEMMA, "Number": "Sing", "PronType": "Int", "Case": "Acc"}, - "যারা": {LEMMA: PRON_LEMMA, "Number": "Plur", "PronType": "Rel", "Case": "Nom"}, - "যে": {LEMMA: PRON_LEMMA, "Number": "Sing", "PronType": "Rel", "Case": "Nom"}, - "তোমরা": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "Two", - "PronType": "Prs", - "Case": "Nom", - }, - "তোরা": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "Two", - "PronType": "Prs", - "Case": "Nom", - }, - "তোমাদেরকে": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "Two", - "PronType": "Prs", - "Case": "Acc", - }, - "তোদেরকে": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "Two", - "PronType": "Prs", - "Case": "Acc", - }, - "আপন": {LEMMA: PRON_LEMMA, "Reflex": "Yes", "PronType": "Ref"}, - "এ": {LEMMA: PRON_LEMMA, "PronType": "Dem"}, - "নিজ": {LEMMA: PRON_LEMMA, "Reflex": "Yes", "PronType": "Ref"}, - "কার": {LEMMA: PRON_LEMMA, "Number": "Sing", "PronType": "Int", "Case": "Acc"}, - "যা": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Gender": "Neut", - "PronType": "Rel", - "Case": "Nom", - }, - "তারা": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "Three", - "PronType": "Prs", - "Case": "Nom", - }, - "আমি": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "One", - "PronType": "Prs", - "Case": "Nom", - }, - }, - "PRP$": { - "আমার": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "One", - "PronType": "Prs", - "Poss": "Yes", - "Case": "Nom", - }, - "মোর": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "One", - "PronType": "Prs", - "Poss": "Yes", - "Case": "Nom", - }, - "মোদের": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "One", - "PronType": "Prs", - "Poss": "Yes", - "Case": "Nom", - }, - "তার": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "Three", - "PronType": "Prs", - "Poss": "Yes", - "Case": "Nom", - }, - "তাহাার": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "Three", - "PronType": "Prs", - "Poss": "Yes", - "Case": "Nom", - }, - "তোমাদের": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "Two", - "PronType": "Prs", - "Poss": "Yes", - "Case": "Nom", - }, - "আমাদের": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "One", - "PronType": "Prs", - "Poss": "Yes", - "Case": "Nom", - }, - "তোমার": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "Two", - "PronType": "Prs", - "Poss": "Yes", - "Case": "Nom", - }, - "তোর": { - LEMMA: PRON_LEMMA, - "Number": "Sing", - "Person": "Two", - "PronType": "Prs", - "Poss": "Yes", - "Case": "Nom", - }, - "তাদের": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "Three", - "PronType": "Prs", - "Poss": "Yes", - "Case": "Nom", - }, - "কাদের": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "PronType": "Int", - "Case": "Acc", - }, - "তোদের": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "Person": "Two", - "PronType": "Prs", - "Poss": "Yes", - "Case": "Nom", - }, - "যাদের": { - LEMMA: PRON_LEMMA, - "Number": "Plur", - "PronType": "Int", - "Case": "Acc", - }, - }, -} diff --git a/spacy/lang/bn/punctuation.py b/spacy/lang/bn/punctuation.py index f624b4ba4..becfe8d2a 100644 --- a/spacy/lang/bn/punctuation.py +++ b/spacy/lang/bn/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_ICONS from ..char_classes import ALPHA_LOWER, ALPHA, HYPHENS, CONCAT_QUOTES, UNITS diff --git a/spacy/lang/bn/stop_words.py b/spacy/lang/bn/stop_words.py index 6c9967df8..bf38e3254 100644 --- a/spacy/lang/bn/stop_words.py +++ b/spacy/lang/bn/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ অতএব অথচ অথবা অনুযায়ী অনেক অনেকে অনেকেই অন্তত অবধি অবশ্য অর্থাৎ অন্য অনুযায়ী অর্ধভাগে diff --git a/spacy/lang/bn/tokenizer_exceptions.py b/spacy/lang/bn/tokenizer_exceptions.py index 32acb1730..e666522b8 100644 --- a/spacy/lang/bn/tokenizer_exceptions.py +++ b/spacy/lang/bn/tokenizer_exceptions.py @@ -1,27 +1,26 @@ -# coding=utf-8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = {} for exc_data in [ - {ORTH: "ডঃ", LEMMA: "ডক্টর"}, - {ORTH: "ডাঃ", LEMMA: "ডাক্তার"}, - {ORTH: "ড.", LEMMA: "ডক্টর"}, - {ORTH: "ডা.", LEMMA: "ডাক্তার"}, - {ORTH: "মোঃ", LEMMA: "মোহাম্মদ"}, - {ORTH: "মো.", LEMMA: "মোহাম্মদ"}, - {ORTH: "সে.", LEMMA: "সেলসিয়াস"}, - {ORTH: "কি.মি.", LEMMA: "কিলোমিটার"}, - {ORTH: "কি.মি", LEMMA: "কিলোমিটার"}, - {ORTH: "সে.মি.", LEMMA: "সেন্টিমিটার"}, - {ORTH: "সে.মি", LEMMA: "সেন্টিমিটার"}, - {ORTH: "মি.লি.", LEMMA: "মিলিলিটার"}, + {ORTH: "ডঃ", NORM: "ডক্টর"}, + {ORTH: "ডাঃ", NORM: "ডাক্তার"}, + {ORTH: "ড.", NORM: "ডক্টর"}, + {ORTH: "ডা.", NORM: "ডাক্তার"}, + {ORTH: "মোঃ", NORM: "মোহাম্মদ"}, + {ORTH: "মো.", NORM: "মোহাম্মদ"}, + {ORTH: "সে.", NORM: "সেলসিয়াস"}, + {ORTH: "কি.মি.", NORM: "কিলোমিটার"}, + {ORTH: "কি.মি", NORM: "কিলোমিটার"}, + {ORTH: "সে.মি.", NORM: "সেন্টিমিটার"}, + {ORTH: "সে.মি", NORM: "সেন্টিমিটার"}, + {ORTH: "মি.লি.", NORM: "মিলিলিটার"}, ]: _exc[exc_data[ORTH]] = [exc_data] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/ca/__init__.py b/spacy/lang/ca/__init__.py index 6d4c00a6b..970b23c1e 100644 --- a/spacy/lang/ca/__init__.py +++ b/spacy/lang/ca/__init__.py @@ -1,29 +1,15 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS +from .punctuation import TOKENIZER_INFIXES from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups - -from .punctuation import TOKENIZER_INFIXES class CatalanDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "ca" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - lex_attr_getters.update(LEX_ATTRS) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS + tokenizer_exceptions = TOKENIZER_EXCEPTIONS infixes = TOKENIZER_INFIXES + stop_words = STOP_WORDS + lex_attr_getters = LEX_ATTRS class Catalan(Language): diff --git a/spacy/lang/ca/examples.py b/spacy/lang/ca/examples.py index 3020ee707..ae6aa3e24 100644 --- a/spacy/lang/ca/examples.py +++ b/spacy/lang/ca/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ca/lex_attrs.py b/spacy/lang/ca/lex_attrs.py index 6314efa92..be8b7a6ea 100644 --- a/spacy/lang/ca/lex_attrs.py +++ b/spacy/lang/ca/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/ca/punctuation.py b/spacy/lang/ca/punctuation.py index 4439376c8..d50b75589 100644 --- a/spacy/lang/ca/punctuation.py +++ b/spacy/lang/ca/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_INFIXES from ..char_classes import ALPHA diff --git a/spacy/lang/ca/stop_words.py b/spacy/lang/ca/stop_words.py index a803db2a5..1a87b2f9d 100644 --- a/spacy/lang/ca/stop_words.py +++ b/spacy/lang/ca/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ a abans ací ah així això al aleshores algun alguna algunes alguns alhora allà allí allò diff --git a/spacy/lang/ca/tag_map.py b/spacy/lang/ca/tag_map.py deleted file mode 100644 index 472e772ef..000000000 --- a/spacy/lang/ca/tag_map.py +++ /dev/null @@ -1,28 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ -from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ - - -TAG_MAP = { - "ADV": {POS: ADV}, - "NOUN": {POS: NOUN}, - "ADP": {POS: ADP}, - "PRON": {POS: PRON}, - "SCONJ": {POS: SCONJ}, - "PROPN": {POS: PROPN}, - "DET": {POS: DET}, - "SYM": {POS: SYM}, - "INTJ": {POS: INTJ}, - "PUNCT": {POS: PUNCT}, - "NUM": {POS: NUM}, - "AUX": {POS: AUX}, - "X": {POS: X}, - "CONJ": {POS: CONJ}, - "CCONJ": {POS: CCONJ}, - "ADJ": {POS: ADJ}, - "VERB": {POS: VERB}, - "PART": {POS: PART}, - "SP": {POS: SPACE}, -} diff --git a/spacy/lang/ca/tokenizer_exceptions.py b/spacy/lang/ca/tokenizer_exceptions.py index d95e5e626..b465e97ba 100644 --- a/spacy/lang/ca/tokenizer_exceptions.py +++ b/spacy/lang/ca/tokenizer_exceptions.py @@ -1,41 +1,40 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = {} for exc_data in [ - {ORTH: "aprox.", LEMMA: "aproximadament"}, - {ORTH: "pàg.", LEMMA: "pàgina"}, - {ORTH: "p.ex.", LEMMA: "per exemple"}, - {ORTH: "gen.", LEMMA: "gener"}, - {ORTH: "feb.", LEMMA: "febrer"}, - {ORTH: "abr.", LEMMA: "abril"}, - {ORTH: "jul.", LEMMA: "juliol"}, - {ORTH: "set.", LEMMA: "setembre"}, - {ORTH: "oct.", LEMMA: "octubre"}, - {ORTH: "nov.", LEMMA: "novembre"}, - {ORTH: "dec.", LEMMA: "desembre"}, - {ORTH: "Dr.", LEMMA: "doctor"}, - {ORTH: "Sr.", LEMMA: "senyor"}, - {ORTH: "Sra.", LEMMA: "senyora"}, - {ORTH: "Srta.", LEMMA: "senyoreta"}, - {ORTH: "núm", LEMMA: "número"}, - {ORTH: "St.", LEMMA: "sant"}, - {ORTH: "Sta.", LEMMA: "santa"}, + {ORTH: "aprox.", NORM: "aproximadament"}, + {ORTH: "pàg.", NORM: "pàgina"}, + {ORTH: "p.ex.", NORM: "per exemple"}, + {ORTH: "gen.", NORM: "gener"}, + {ORTH: "feb.", NORM: "febrer"}, + {ORTH: "abr.", NORM: "abril"}, + {ORTH: "jul.", NORM: "juliol"}, + {ORTH: "set.", NORM: "setembre"}, + {ORTH: "oct.", NORM: "octubre"}, + {ORTH: "nov.", NORM: "novembre"}, + {ORTH: "dec.", NORM: "desembre"}, + {ORTH: "Dr.", NORM: "doctor"}, + {ORTH: "Sr.", NORM: "senyor"}, + {ORTH: "Sra.", NORM: "senyora"}, + {ORTH: "Srta.", NORM: "senyoreta"}, + {ORTH: "núm", NORM: "número"}, + {ORTH: "St.", NORM: "sant"}, + {ORTH: "Sta.", NORM: "santa"}, ]: _exc[exc_data[ORTH]] = [exc_data] # Times -_exc["12m."] = [{ORTH: "12"}, {ORTH: "m.", LEMMA: "p.m."}] +_exc["12m."] = [{ORTH: "12"}, {ORTH: "m.", NORM: "p.m."}] for h in range(1, 12 + 1): for period in ["a.m.", "am"]: - _exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "a.m."}] + _exc[f"{h}{period}"] = [{ORTH: f"{h}"}, {ORTH: period, NORM: "a.m."}] for period in ["p.m.", "pm"]: - _exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "p.m."}] + _exc[f"{h}{period}"] = [{ORTH: f"{h}"}, {ORTH: period, NORM: "p.m."}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/char_classes.py b/spacy/lang/char_classes.py index bd0f7e437..b8094319f 100644 --- a/spacy/lang/char_classes.py +++ b/spacy/lang/char_classes.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - split_chars = lambda char: list(char.strip().split(" ")) merge_chars = lambda char: char.strip().replace(" ", "|") group_chars = lambda char: char.strip().replace(" ", "") diff --git a/spacy/lang/cs/__init__.py b/spacy/lang/cs/__init__.py index baaaa162b..0c35e2288 100644 --- a/spacy/lang/cs/__init__.py +++ b/spacy/lang/cs/__init__.py @@ -1,17 +1,11 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS -from ...language import Language -from ...attrs import LANG from .lex_attrs import LEX_ATTRS +from ...language import Language class CzechDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "cs" stop_words = STOP_WORDS + lex_attr_getters = LEX_ATTRS class Czech(Language): diff --git a/spacy/lang/cs/examples.py b/spacy/lang/cs/examples.py index fe8a9f6d1..a30b5ac14 100644 --- a/spacy/lang/cs/examples.py +++ b/spacy/lang/cs/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. >>> from spacy.lang.cs.examples import sentences @@ -10,9 +6,9 @@ Example sentences to test spaCy and its language models. sentences = [ - "Máma mele maso.", + "Máma mele maso.", "Příliš žluťoučký kůň úpěl ďábelské ódy.", - "ArcGIS je geografický informační systém určený pro práci s prostorovými daty." , + "ArcGIS je geografický informační systém určený pro práci s prostorovými daty.", "Může data vytvářet a spravovat, ale především je dokáže analyzovat, najít v nich nové vztahy a vše přehledně vizualizovat.", "Dnes je krásné počasí.", "Nestihl autobus, protože pozdě vstal z postele.", @@ -39,4 +35,4 @@ sentences = [ "Jaké PSČ má Praha 1?", "PSČ Prahy 1 je 110 00.", "Za 20 minut jede vlak.", - ] +] diff --git a/spacy/lang/cs/lex_attrs.py b/spacy/lang/cs/lex_attrs.py index 368cab6c8..530d1d5eb 100644 --- a/spacy/lang/cs/lex_attrs.py +++ b/spacy/lang/cs/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ @@ -43,7 +40,7 @@ _num_words = [ "kvadrilion", "kvadriliarda", "kvintilion", - ] +] def like_num(text): diff --git a/spacy/lang/cs/stop_words.py b/spacy/lang/cs/stop_words.py index 9277772fb..f61f424f6 100644 --- a/spacy/lang/cs/stop_words.py +++ b/spacy/lang/cs/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - # Source: https://github.com/Alir3z4/stop-words # Source: https://github.com/stopwords-iso/stopwords-cs/blob/master/stopwords-cs.txt diff --git a/spacy/lang/cs/test_text.py b/spacy/lang/cs/test_text.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/spacy/lang/da/__init__.py b/spacy/lang/da/__init__.py index 0190656e5..8cac30b26 100644 --- a/spacy/lang/da/__init__.py +++ b/spacy/lang/da/__init__.py @@ -1,28 +1,15 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS -from .morph_rules import MORPH_RULES -from ..tag_map import TAG_MAP - -from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language -from ...attrs import LANG -from ...util import update_exc class DanishDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "da" - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - morph_rules = MORPH_RULES + tokenizer_exceptions = TOKENIZER_EXCEPTIONS infixes = TOKENIZER_INFIXES suffixes = TOKENIZER_SUFFIXES - tag_map = TAG_MAP + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS diff --git a/spacy/lang/da/examples.py b/spacy/lang/da/examples.py index 525c6519c..efa1a7c0e 100644 --- a/spacy/lang/da/examples.py +++ b/spacy/lang/da/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/da/lex_attrs.py b/spacy/lang/da/lex_attrs.py index 9fefc1eba..403af686c 100644 --- a/spacy/lang/da/lex_attrs.py +++ b/spacy/lang/da/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/da/morph_rules.py b/spacy/lang/da/morph_rules.py deleted file mode 100644 index 7ffe2ac6f..000000000 --- a/spacy/lang/da/morph_rules.py +++ /dev/null @@ -1,311 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import LEMMA, PRON_LEMMA - -# Source: Danish Universal Dependencies and http://fjern-uv.dk/pronom.php - -# Note: The Danish Universal Dependencies specify Case=Acc for all instances -# of "den"/"det" even when the case is in fact "Nom". In the rules below, Case -# is left unspecified for "den" and "det". - -MORPH_RULES = { - "PRON": { - "jeg": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Case": "Nom", - "Gender": "Com", - }, # Case=Nom|Gender=Com|Number=Sing|Person=1|PronType=Prs - "mig": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Case": "Acc", - "Gender": "Com", - }, # Case=Acc|Gender=Com|Number=Sing|Person=1|PronType=Prs - "min": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Com", - }, # Gender=Com|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs - "mit": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Neut", - }, # Gender=Neut|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs - "vor": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Com", - }, # Gender=Com|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs|Style=Form - "vort": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Neut", - }, # Gender=Neut|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs|Style=Form - "du": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Case": "Nom", - "Gender": "Com", - }, # Case=Nom|Gender=Com|Number=Sing|Person=2|PronType=Prs - "dig": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Case": "Acc", - "Gender": "Com", - }, # Case=Acc|Gender=Com|Number=Sing|Person=2|PronType=Prs - "din": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Com", - }, # Gender=Com|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs - "dit": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Neut", - }, # Gender=Neut|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs - "han": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Case": "Nom", - "Gender": "Com", - }, # Case=Nom|Gender=Com|Number=Sing|Person=3|PronType=Prs - "hun": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Case": "Nom", - "Gender": "Com", - }, # Case=Nom|Gender=Com|Number=Sing|Person=3|PronType=Prs - "den": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Com", - }, # Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs, See note above. - "det": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Neut", - }, # Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs See note above. - "ham": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Case": "Acc", - "Gender": "Com", - }, # Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs - "hende": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Case": "Acc", - "Gender": "Com", - }, # Case=Acc|Gender=Com|Number=Sing|Person=3|PronType=Prs - "sin": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Com", - "Reflex": "Yes", - }, # Gender=Com|Number=Sing|Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs|Reflex=Yes - "sit": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Neut", - "Reflex": "Yes", - }, # Gender=Neut|Number=Sing|Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs|Reflex=Yes - "vi": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Case": "Nom", - "Gender": "Com", - }, # Case=Nom|Gender=Com|Number=Plur|Person=1|PronType=Prs - "os": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Case": "Acc", - "Gender": "Com", - }, # Case=Acc|Gender=Com|Number=Plur|Person=1|PronType=Prs - "mine": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Poss": "Yes", - }, # Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs - "vore": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Poss": "Yes", - }, # Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs|Style=Form - "I": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Plur", - "Case": "Nom", - "Gender": "Com", - }, # Case=Nom|Gender=Com|Number=Plur|Person=2|PronType=Prs - "jer": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Plur", - "Case": "Acc", - "Gender": "Com", - }, # Case=Acc|Gender=Com|Number=Plur|Person=2|PronType=Prs - "dine": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Plur", - "Poss": "Yes", - }, # Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs - "de": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Case": "Nom", - }, # Case=Nom|Number=Plur|Person=3|PronType=Prs - "dem": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Case": "Acc", - }, # Case=Acc|Number=Plur|Person=3|PronType=Prs - "sine": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, # Number=Plur|Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs|Reflex=Yes - "vores": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Poss": "Yes", - }, # Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs - "De": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Case": "Nom", - "Gender": "Com", - }, # Case=Nom|Gender=Com|Person=2|Polite=Form|PronType=Prs - "Dem": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Case": "Acc", - "Gender": "Com", - }, # Case=Acc|Gender=Com|Person=2|Polite=Form|PronType=Prs - "Deres": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Poss": "Yes", - }, # Person=2|Polite=Form|Poss=Yes|PronType=Prs - "jeres": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Poss": "Yes", - }, # Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs - "sig": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Case": "Acc", - "Reflex": "Yes", - }, # Case=Acc|Person=3|PronType=Prs|Reflex=Yes - "hans": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Poss": "Yes", - }, # Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs - "hendes": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Poss": "Yes", - }, # Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs - "dens": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Poss": "Yes", - }, # Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs - "dets": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Poss": "Yes", - }, # Number[psor]=Sing|Person=3|Poss=Yes|PronType=Prs - "deres": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Poss": "Yes", - }, # Number[psor]=Plur|Person=3|Poss=Yes|PronType=Prs - }, - "VERB": { - "er": {LEMMA: "være", "VerbForm": "Fin", "Tense": "Pres"}, - "var": {LEMMA: "være", "VerbForm": "Fin", "Tense": "Past"}, - }, -} - -for tag, rules in MORPH_RULES.items(): - for key, attrs in dict(rules).items(): - rules[key.title()] = attrs diff --git a/spacy/lang/da/punctuation.py b/spacy/lang/da/punctuation.py index b6b852c55..e050ab7aa 100644 --- a/spacy/lang/da/punctuation.py +++ b/spacy/lang/da/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_ICONS from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER from ..punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/da/stop_words.py b/spacy/lang/da/stop_words.py index 48de0c7ca..05b2084dd 100644 --- a/spacy/lang/da/stop_words.py +++ b/spacy/lang/da/stop_words.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - # Source: Handpicked by Jens Dahl Møllerhøj. STOP_WORDS = set( diff --git a/spacy/lang/da/tokenizer_exceptions.py b/spacy/lang/da/tokenizer_exceptions.py index 9e4637bfb..ce25c546b 100644 --- a/spacy/lang/da/tokenizer_exceptions.py +++ b/spacy/lang/da/tokenizer_exceptions.py @@ -1,12 +1,10 @@ -# encoding: utf8 """ Tokenizer Exceptions. Source: https://forkortelse.dk/ and various others. """ - -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA, NORM +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = {} @@ -15,44 +13,44 @@ _exc = {} # (for "torsdag") are left out because they are ambiguous. The same is the case # for abbreviations "jul." and "Jul." ("juli"). for exc_data in [ - {ORTH: "Kbh.", LEMMA: "København", NORM: "København"}, - {ORTH: "jan.", LEMMA: "januar"}, - {ORTH: "febr.", LEMMA: "februar"}, - {ORTH: "feb.", LEMMA: "februar"}, - {ORTH: "mar.", LEMMA: "marts"}, - {ORTH: "apr.", LEMMA: "april"}, - {ORTH: "jun.", LEMMA: "juni"}, - {ORTH: "aug.", LEMMA: "august"}, - {ORTH: "sept.", LEMMA: "september"}, - {ORTH: "sep.", LEMMA: "september"}, - {ORTH: "okt.", LEMMA: "oktober"}, - {ORTH: "nov.", LEMMA: "november"}, - {ORTH: "dec.", LEMMA: "december"}, - {ORTH: "man.", LEMMA: "mandag"}, - {ORTH: "tirs.", LEMMA: "tirsdag"}, - {ORTH: "ons.", LEMMA: "onsdag"}, - {ORTH: "tor.", LEMMA: "torsdag"}, - {ORTH: "tors.", LEMMA: "torsdag"}, - {ORTH: "fre.", LEMMA: "fredag"}, - {ORTH: "lør.", LEMMA: "lørdag"}, - {ORTH: "Jan.", LEMMA: "januar"}, - {ORTH: "Febr.", LEMMA: "februar"}, - {ORTH: "Feb.", LEMMA: "februar"}, - {ORTH: "Mar.", LEMMA: "marts"}, - {ORTH: "Apr.", LEMMA: "april"}, - {ORTH: "Jun.", LEMMA: "juni"}, - {ORTH: "Aug.", LEMMA: "august"}, - {ORTH: "Sept.", LEMMA: "september"}, - {ORTH: "Sep.", LEMMA: "september"}, - {ORTH: "Okt.", LEMMA: "oktober"}, - {ORTH: "Nov.", LEMMA: "november"}, - {ORTH: "Dec.", LEMMA: "december"}, - {ORTH: "Man.", LEMMA: "mandag"}, - {ORTH: "Tirs.", LEMMA: "tirsdag"}, - {ORTH: "Ons.", LEMMA: "onsdag"}, - {ORTH: "Fre.", LEMMA: "fredag"}, - {ORTH: "Lør.", LEMMA: "lørdag"}, - {ORTH: "og/eller", LEMMA: "og/eller", NORM: "og/eller"}, + {ORTH: "Kbh.", NORM: "København"}, + {ORTH: "jan.", NORM: "januar"}, + {ORTH: "febr.", NORM: "februar"}, + {ORTH: "feb.", NORM: "februar"}, + {ORTH: "mar.", NORM: "marts"}, + {ORTH: "apr.", NORM: "april"}, + {ORTH: "jun.", NORM: "juni"}, + {ORTH: "aug.", NORM: "august"}, + {ORTH: "sept.", NORM: "september"}, + {ORTH: "sep.", NORM: "september"}, + {ORTH: "okt.", NORM: "oktober"}, + {ORTH: "nov.", NORM: "november"}, + {ORTH: "dec.", NORM: "december"}, + {ORTH: "man.", NORM: "mandag"}, + {ORTH: "tirs.", NORM: "tirsdag"}, + {ORTH: "ons.", NORM: "onsdag"}, + {ORTH: "tor.", NORM: "torsdag"}, + {ORTH: "tors.", NORM: "torsdag"}, + {ORTH: "fre.", NORM: "fredag"}, + {ORTH: "lør.", NORM: "lørdag"}, + {ORTH: "Jan.", NORM: "januar"}, + {ORTH: "Febr.", NORM: "februar"}, + {ORTH: "Feb.", NORM: "februar"}, + {ORTH: "Mar.", NORM: "marts"}, + {ORTH: "Apr.", NORM: "april"}, + {ORTH: "Jun.", NORM: "juni"}, + {ORTH: "Aug.", NORM: "august"}, + {ORTH: "Sept.", NORM: "september"}, + {ORTH: "Sep.", NORM: "september"}, + {ORTH: "Okt.", NORM: "oktober"}, + {ORTH: "Nov.", NORM: "november"}, + {ORTH: "Dec.", NORM: "december"}, + {ORTH: "Man.", NORM: "mandag"}, + {ORTH: "Tirs.", NORM: "tirsdag"}, + {ORTH: "Ons.", NORM: "onsdag"}, + {ORTH: "Fre.", NORM: "fredag"}, + {ORTH: "Lør.", NORM: "lørdag"}, + {ORTH: "og/eller", NORM: "og/eller"}, ]: _exc[exc_data[ORTH]] = [exc_data] @@ -552,22 +550,22 @@ for orth in [ _exc[capitalized] = [{ORTH: capitalized}] for exc_data in [ - {ORTH: "s'gu", LEMMA: "s'gu", NORM: "s'gu"}, - {ORTH: "S'gu", LEMMA: "s'gu", NORM: "s'gu"}, - {ORTH: "sgu'", LEMMA: "s'gu", NORM: "s'gu"}, - {ORTH: "Sgu'", LEMMA: "s'gu", NORM: "s'gu"}, - {ORTH: "sku'", LEMMA: "skal", NORM: "skulle"}, - {ORTH: "ku'", LEMMA: "kan", NORM: "kunne"}, - {ORTH: "Ku'", LEMMA: "kan", NORM: "kunne"}, - {ORTH: "ka'", LEMMA: "kan", NORM: "kan"}, - {ORTH: "Ka'", LEMMA: "kan", NORM: "kan"}, - {ORTH: "gi'", LEMMA: "give", NORM: "giv"}, - {ORTH: "Gi'", LEMMA: "give", NORM: "giv"}, - {ORTH: "li'", LEMMA: "lide", NORM: "lide"}, - {ORTH: "ha'", LEMMA: "have", NORM: "have"}, - {ORTH: "Ha'", LEMMA: "have", NORM: "have"}, - {ORTH: "ik'", LEMMA: "ikke", NORM: "ikke"}, - {ORTH: "Ik'", LEMMA: "ikke", NORM: "ikke"}, + {ORTH: "s'gu", NORM: "s'gu"}, + {ORTH: "S'gu", NORM: "s'gu"}, + {ORTH: "sgu'", NORM: "s'gu"}, + {ORTH: "Sgu'", NORM: "s'gu"}, + {ORTH: "sku'", NORM: "skulle"}, + {ORTH: "ku'", NORM: "kunne"}, + {ORTH: "Ku'", NORM: "kunne"}, + {ORTH: "ka'", NORM: "kan"}, + {ORTH: "Ka'", NORM: "kan"}, + {ORTH: "gi'", NORM: "giv"}, + {ORTH: "Gi'", NORM: "giv"}, + {ORTH: "li'", NORM: "lide"}, + {ORTH: "ha'", NORM: "have"}, + {ORTH: "Ha'", NORM: "have"}, + {ORTH: "ik'", NORM: "ikke"}, + {ORTH: "Ik'", NORM: "ikke"}, ]: _exc[exc_data[ORTH]] = [exc_data] @@ -575,9 +573,9 @@ for exc_data in [ # Dates for h in range(1, 31 + 1): for period in ["."]: - _exc["%d%s" % (h, period)] = [{ORTH: "%d." % h}] + _exc[f"{h}{period}"] = [{ORTH: f"{h}."}] -_custom_base_exc = {"i.": [{ORTH: "i", LEMMA: "i", NORM: "i"}, {ORTH: "."}]} +_custom_base_exc = {"i.": [{ORTH: "i", NORM: "i"}, {ORTH: "."}]} _exc.update(_custom_base_exc) -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/de/__init__.py b/spacy/lang/de/__init__.py index ca01428ba..b645d3480 100644 --- a/spacy/lang/de/__init__.py +++ b/spacy/lang/de/__init__.py @@ -1,43 +1,17 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES -from .punctuation import TOKENIZER_INFIXES -from .tag_map import TAG_MAP +from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES from .stop_words import STOP_WORDS from .syntax_iterators import SYNTAX_ITERATORS - -from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language -from ...attrs import LANG -from ...util import update_exc class GermanDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "de" - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) + tokenizer_exceptions = TOKENIZER_EXCEPTIONS prefixes = TOKENIZER_PREFIXES suffixes = TOKENIZER_SUFFIXES infixes = TOKENIZER_INFIXES - tag_map = TAG_MAP - stop_words = STOP_WORDS syntax_iterators = SYNTAX_ITERATORS - single_orth_variants = [ - {"tags": ["$("], "variants": ["…", "..."]}, - {"tags": ["$("], "variants": ["-", "—", "–", "--", "---", "——"]}, - ] - paired_orth_variants = [ - { - "tags": ["$("], - "variants": [("'", "'"), (",", "'"), ("‚", "‘"), ("›", "‹"), ("‹", "›")], - }, - { - "tags": ["$("], - "variants": [("``", "''"), ('"', '"'), ("„", "“"), ("»", "«"), ("«", "»")], - }, - ] + stop_words = STOP_WORDS class German(Language): diff --git a/spacy/lang/de/examples.py b/spacy/lang/de/examples.py index 0c64a693a..735d1c316 100644 --- a/spacy/lang/de/examples.py +++ b/spacy/lang/de/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/de/punctuation.py b/spacy/lang/de/punctuation.py index 93454ffff..69d402237 100644 --- a/spacy/lang/de/punctuation.py +++ b/spacy/lang/de/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES from ..char_classes import CURRENCY, UNITS, PUNCT from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER diff --git a/spacy/lang/de/stop_words.py b/spacy/lang/de/stop_words.py index 0c8b375e0..f52687eb9 100644 --- a/spacy/lang/de/stop_words.py +++ b/spacy/lang/de/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ á a ab aber ach acht achte achten achter achtes ag alle allein allem allen diff --git a/spacy/lang/de/syntax_iterators.py b/spacy/lang/de/syntax_iterators.py index c5513abc0..aba0e8024 100644 --- a/spacy/lang/de/syntax_iterators.py +++ b/spacy/lang/de/syntax_iterators.py @@ -1,42 +1,26 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Union, Iterator from ...symbols import NOUN, PROPN, PRON from ...errors import Errors +from ...tokens import Doc, Span -def noun_chunks(doclike): - """ - Detect base noun phrases from a dependency parse. Works on both Doc and Span. - """ +def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]: + """Detect base noun phrases from a dependency parse. Works on Doc and Span.""" # this iterator extracts spans headed by NOUNs starting from the left-most # syntactic dependent until the NOUN itself for close apposition and # measurement construction, the span is sometimes extended to the right of # the NOUN. Example: "eine Tasse Tee" (a cup (of) tea) returns "eine Tasse Tee" # and not just "eine Tasse", same for "das Thema Familie". - labels = [ - "sb", - "oa", - "da", - "nk", - "mo", - "ag", - "ROOT", - "root", - "cj", - "pd", - "og", - "app", - ] + # fmt: off + labels = ["sb", "oa", "da", "nk", "mo", "ag", "ROOT", "root", "cj", "pd", "og", "app"] + # fmt: on doc = doclike.doc # Ensure works on both Doc and Span. - - if not doc.is_parsed: + if not doc.has_annotation("DEP"): raise ValueError(Errors.E029) - np_label = doc.vocab.strings.add("NP") np_deps = set(doc.vocab.strings.add(label) for label in labels) close_app = doc.vocab.strings.add("nk") - rbracket = 0 prev_end = -1 for i, word in enumerate(doclike): diff --git a/spacy/lang/de/tag_map.py b/spacy/lang/de/tag_map.py deleted file mode 100644 index c169501a9..000000000 --- a/spacy/lang/de/tag_map.py +++ /dev/null @@ -1,66 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X -from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, VERB - - -TAG_MAP = { - "$(": {POS: PUNCT, "PunctType": "brck"}, - "$,": {POS: PUNCT, "PunctType": "comm"}, - "$.": {POS: PUNCT, "PunctType": "peri"}, - "ADJA": {POS: ADJ}, - "ADJD": {POS: ADJ}, - "ADV": {POS: ADV}, - "APPO": {POS: ADP, "AdpType": "post"}, - "APPR": {POS: ADP, "AdpType": "prep"}, - "APPRART": {POS: ADP, "AdpType": "prep", "PronType": "art"}, - "APZR": {POS: ADP, "AdpType": "circ"}, - "ART": {POS: DET, "PronType": "art"}, - "CARD": {POS: NUM, "NumType": "card"}, - "FM": {POS: X, "Foreign": "yes"}, - "ITJ": {POS: INTJ}, - "KOKOM": {POS: CCONJ, "ConjType": "comp"}, - "KON": {POS: CCONJ}, - "KOUI": {POS: SCONJ}, - "KOUS": {POS: SCONJ}, - "NE": {POS: PROPN}, - "NNE": {POS: PROPN}, - "NN": {POS: NOUN}, - "PAV": {POS: ADV, "PronType": "dem"}, - "PROAV": {POS: ADV, "PronType": "dem"}, - "PDAT": {POS: DET, "PronType": "dem"}, - "PDS": {POS: PRON, "PronType": "dem"}, - "PIAT": {POS: DET, "PronType": "ind|neg|tot"}, - "PIDAT": {POS: DET, "PronType": "ind|neg|tot"}, - "PIS": {POS: PRON, "PronType": "ind|neg|tot"}, - "PPER": {POS: PRON, "PronType": "prs"}, - "PPOSAT": {POS: DET, "Poss": "yes", "PronType": "prs"}, - "PPOSS": {POS: PRON, "Poss": "yes", "PronType": "prs"}, - "PRELAT": {POS: DET, "PronType": "rel"}, - "PRELS": {POS: PRON, "PronType": "rel"}, - "PRF": {POS: PRON, "PronType": "prs", "Reflex": "yes"}, - "PTKA": {POS: PART}, - "PTKANT": {POS: PART, "PartType": "res"}, - "PTKNEG": {POS: PART, "Polarity": "neg"}, - "PTKVZ": {POS: ADP, "PartType": "vbp"}, - "PTKZU": {POS: PART, "PartType": "inf"}, - "PWAT": {POS: DET, "PronType": "int"}, - "PWAV": {POS: ADV, "PronType": "int"}, - "PWS": {POS: PRON, "PronType": "int"}, - "TRUNC": {POS: X, "Hyph": "yes"}, - "VAFIN": {POS: AUX, "Mood": "ind", "VerbForm": "fin"}, - "VAIMP": {POS: AUX, "Mood": "imp", "VerbForm": "fin"}, - "VAINF": {POS: AUX, "VerbForm": "inf"}, - "VAPP": {POS: AUX, "Aspect": "perf", "VerbForm": "part"}, - "VMFIN": {POS: VERB, "Mood": "ind", "VerbForm": "fin", "VerbType": "mod"}, - "VMINF": {POS: VERB, "VerbForm": "inf", "VerbType": "mod"}, - "VMPP": {POS: VERB, "Aspect": "perf", "VerbForm": "part", "VerbType": "mod"}, - "VVFIN": {POS: VERB, "Mood": "ind", "VerbForm": "fin"}, - "VVIMP": {POS: VERB, "Mood": "imp", "VerbForm": "fin"}, - "VVINF": {POS: VERB, "VerbForm": "inf"}, - "VVIZU": {POS: VERB, "VerbForm": "inf"}, - "VVPP": {POS: VERB, "Aspect": "perf", "VerbForm": "part"}, - "XY": {POS: X}, - "_SP": {POS: SPACE}, -} diff --git a/spacy/lang/de/tokenizer_exceptions.py b/spacy/lang/de/tokenizer_exceptions.py index ebbbfba8c..21d99cffe 100644 --- a/spacy/lang/de/tokenizer_exceptions.py +++ b/spacy/lang/de/tokenizer_exceptions.py @@ -1,160 +1,135 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = { - "auf'm": [{ORTH: "auf", LEMMA: "auf"}, {ORTH: "'m", LEMMA: "der", NORM: "dem"}], - "du's": [ - {ORTH: "du", LEMMA: PRON_LEMMA, TAG: "PPER"}, - {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}, - ], - "er's": [ - {ORTH: "er", LEMMA: PRON_LEMMA, TAG: "PPER"}, - {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}, - ], - "hinter'm": [ - {ORTH: "hinter", LEMMA: "hinter"}, - {ORTH: "'m", LEMMA: "der", NORM: "dem"}, - ], - "ich's": [ - {ORTH: "ich", LEMMA: PRON_LEMMA, TAG: "PPER"}, - {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}, - ], - "ihr's": [ - {ORTH: "ihr", LEMMA: PRON_LEMMA, TAG: "PPER"}, - {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}, - ], - "sie's": [ - {ORTH: "sie", LEMMA: PRON_LEMMA, TAG: "PPER"}, - {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}, - ], - "unter'm": [ - {ORTH: "unter", LEMMA: "unter"}, - {ORTH: "'m", LEMMA: "der", NORM: "dem"}, - ], - "vor'm": [{ORTH: "vor", LEMMA: "vor"}, {ORTH: "'m", LEMMA: "der", NORM: "dem"}], - "wir's": [ - {ORTH: "wir", LEMMA: PRON_LEMMA, TAG: "PPER"}, - {ORTH: "'s", LEMMA: PRON_LEMMA, TAG: "PPER", NORM: "es"}, - ], - "über'm": [{ORTH: "über", LEMMA: "über"}, {ORTH: "'m", LEMMA: "der", NORM: "dem"}], + "auf'm": [{ORTH: "auf"}, {ORTH: "'m", NORM: "dem"}], + "du's": [{ORTH: "du"}, {ORTH: "'s", NORM: "es"}], + "er's": [{ORTH: "er"}, {ORTH: "'s", NORM: "es"}], + "hinter'm": [{ORTH: "hinter"}, {ORTH: "'m", NORM: "dem"}], + "ich's": [{ORTH: "ich"}, {ORTH: "'s", NORM: "es"}], + "ihr's": [{ORTH: "ihr"}, {ORTH: "'s", NORM: "es"}], + "sie's": [{ORTH: "sie"}, {ORTH: "'s", NORM: "es"}], + "unter'm": [{ORTH: "unter"}, {ORTH: "'m", NORM: "dem"}], + "vor'm": [{ORTH: "vor"}, {ORTH: "'m", NORM: "dem"}], + "wir's": [{ORTH: "wir"}, {ORTH: "'s", NORM: "es"}], + "über'm": [{ORTH: "über"}, {ORTH: "'m", NORM: "dem"}], } for exc_data in [ - {ORTH: "'S", LEMMA: PRON_LEMMA, NORM: "'s", TAG: "PPER"}, - {ORTH: "'s", LEMMA: PRON_LEMMA, NORM: "'s", TAG: "PPER"}, - {ORTH: "S'", LEMMA: PRON_LEMMA, NORM: "'s", TAG: "PPER"}, - {ORTH: "s'", LEMMA: PRON_LEMMA, NORM: "'s", TAG: "PPER"}, - {ORTH: "'n", LEMMA: "ein", NORM: "ein"}, - {ORTH: "'ne", LEMMA: "eine", NORM: "eine"}, - {ORTH: "'nen", LEMMA: "ein", NORM: "einen"}, - {ORTH: "'nem", LEMMA: "ein", NORM: "einem"}, - {ORTH: "Abb.", LEMMA: "Abbildung", NORM: "Abbildung"}, - {ORTH: "Abk.", LEMMA: "Abkürzung", NORM: "Abkürzung"}, - {ORTH: "Abt.", LEMMA: "Abteilung", NORM: "Abteilung"}, - {ORTH: "Apr.", LEMMA: "April", NORM: "April"}, - {ORTH: "Aug.", LEMMA: "August", NORM: "August"}, - {ORTH: "Bd.", LEMMA: "Band", NORM: "Band"}, - {ORTH: "Betr.", LEMMA: "Betreff", NORM: "Betreff"}, - {ORTH: "Bf.", LEMMA: "Bahnhof", NORM: "Bahnhof"}, - {ORTH: "Bhf.", LEMMA: "Bahnhof", NORM: "Bahnhof"}, - {ORTH: "Bsp.", LEMMA: "Beispiel", NORM: "Beispiel"}, - {ORTH: "Dez.", LEMMA: "Dezember", NORM: "Dezember"}, - {ORTH: "Di.", LEMMA: "Dienstag", NORM: "Dienstag"}, - {ORTH: "Do.", LEMMA: "Donnerstag", NORM: "Donnerstag"}, - {ORTH: "Fa.", LEMMA: "Firma", NORM: "Firma"}, - {ORTH: "Fam.", LEMMA: "Familie", NORM: "Familie"}, - {ORTH: "Feb.", LEMMA: "Februar", NORM: "Februar"}, - {ORTH: "Fr.", LEMMA: "Frau", NORM: "Frau"}, - {ORTH: "Frl.", LEMMA: "Fräulein", NORM: "Fräulein"}, - {ORTH: "Hbf.", LEMMA: "Hauptbahnhof", NORM: "Hauptbahnhof"}, - {ORTH: "Hr.", LEMMA: "Herr", NORM: "Herr"}, - {ORTH: "Hrn.", LEMMA: "Herr", NORM: "Herrn"}, - {ORTH: "Jan.", LEMMA: "Januar", NORM: "Januar"}, - {ORTH: "Jh.", LEMMA: "Jahrhundert", NORM: "Jahrhundert"}, - {ORTH: "Jhd.", LEMMA: "Jahrhundert", NORM: "Jahrhundert"}, - {ORTH: "Jul.", LEMMA: "Juli", NORM: "Juli"}, - {ORTH: "Jun.", LEMMA: "Juni", NORM: "Juni"}, - {ORTH: "Mi.", LEMMA: "Mittwoch", NORM: "Mittwoch"}, - {ORTH: "Mio.", LEMMA: "Million", NORM: "Million"}, - {ORTH: "Mo.", LEMMA: "Montag", NORM: "Montag"}, - {ORTH: "Mrd.", LEMMA: "Milliarde", NORM: "Milliarde"}, - {ORTH: "Mrz.", LEMMA: "März", NORM: "März"}, - {ORTH: "MwSt.", LEMMA: "Mehrwertsteuer", NORM: "Mehrwertsteuer"}, - {ORTH: "Mär.", LEMMA: "März", NORM: "März"}, - {ORTH: "Nov.", LEMMA: "November", NORM: "November"}, - {ORTH: "Nr.", LEMMA: "Nummer", NORM: "Nummer"}, - {ORTH: "Okt.", LEMMA: "Oktober", NORM: "Oktober"}, - {ORTH: "Orig.", LEMMA: "Original", NORM: "Original"}, - {ORTH: "Pkt.", LEMMA: "Punkt", NORM: "Punkt"}, - {ORTH: "Prof.", LEMMA: "Professor", NORM: "Professor"}, - {ORTH: "Red.", LEMMA: "Redaktion", NORM: "Redaktion"}, - {ORTH: "Sa.", LEMMA: "Samstag", NORM: "Samstag"}, - {ORTH: "Sep.", LEMMA: "September", NORM: "September"}, - {ORTH: "Sept.", LEMMA: "September", NORM: "September"}, - {ORTH: "So.", LEMMA: "Sonntag", NORM: "Sonntag"}, - {ORTH: "Std.", LEMMA: "Stunde", NORM: "Stunde"}, - {ORTH: "Str.", LEMMA: "Straße", NORM: "Straße"}, - {ORTH: "Tel.", LEMMA: "Telefon", NORM: "Telefon"}, - {ORTH: "Tsd.", LEMMA: "Tausend", NORM: "Tausend"}, - {ORTH: "Univ.", LEMMA: "Universität", NORM: "Universität"}, - {ORTH: "abzgl.", LEMMA: "abzüglich", NORM: "abzüglich"}, - {ORTH: "allg.", LEMMA: "allgemein", NORM: "allgemein"}, - {ORTH: "bspw.", LEMMA: "beispielsweise", NORM: "beispielsweise"}, - {ORTH: "bzgl.", LEMMA: "bezüglich", NORM: "bezüglich"}, - {ORTH: "bzw.", LEMMA: "beziehungsweise", NORM: "beziehungsweise"}, - {ORTH: "d.h.", LEMMA: "das heißt"}, - {ORTH: "dgl.", LEMMA: "dergleichen", NORM: "dergleichen"}, - {ORTH: "ebd.", LEMMA: "ebenda", NORM: "ebenda"}, - {ORTH: "eigtl.", LEMMA: "eigentlich", NORM: "eigentlich"}, - {ORTH: "engl.", LEMMA: "englisch", NORM: "englisch"}, - {ORTH: "evtl.", LEMMA: "eventuell", NORM: "eventuell"}, - {ORTH: "frz.", LEMMA: "französisch", NORM: "französisch"}, - {ORTH: "gegr.", LEMMA: "gegründet", NORM: "gegründet"}, - {ORTH: "ggf.", LEMMA: "gegebenenfalls", NORM: "gegebenenfalls"}, - {ORTH: "ggfs.", LEMMA: "gegebenenfalls", NORM: "gegebenenfalls"}, - {ORTH: "ggü.", LEMMA: "gegenüber", NORM: "gegenüber"}, - {ORTH: "i.O.", LEMMA: "in Ordnung"}, - {ORTH: "i.d.R.", LEMMA: "in der Regel"}, - {ORTH: "incl.", LEMMA: "inklusive", NORM: "inklusive"}, - {ORTH: "inkl.", LEMMA: "inklusive", NORM: "inklusive"}, - {ORTH: "insb.", LEMMA: "insbesondere", NORM: "insbesondere"}, - {ORTH: "kath.", LEMMA: "katholisch", NORM: "katholisch"}, - {ORTH: "lt.", LEMMA: "laut", NORM: "laut"}, - {ORTH: "max.", LEMMA: "maximal", NORM: "maximal"}, - {ORTH: "min.", LEMMA: "minimal", NORM: "minimal"}, - {ORTH: "mind.", LEMMA: "mindestens", NORM: "mindestens"}, - {ORTH: "mtl.", LEMMA: "monatlich", NORM: "monatlich"}, - {ORTH: "n.Chr.", LEMMA: "nach Christus"}, - {ORTH: "orig.", LEMMA: "original", NORM: "original"}, - {ORTH: "röm.", LEMMA: "römisch", NORM: "römisch"}, - {ORTH: "s.o.", LEMMA: "siehe oben"}, - {ORTH: "sog.", LEMMA: "so genannt"}, - {ORTH: "stellv.", LEMMA: "stellvertretend"}, - {ORTH: "tägl.", LEMMA: "täglich", NORM: "täglich"}, - {ORTH: "u.U.", LEMMA: "unter Umständen"}, - {ORTH: "u.s.w.", LEMMA: "und so weiter"}, - {ORTH: "u.v.m.", LEMMA: "und vieles mehr"}, - {ORTH: "usf.", LEMMA: "und so fort"}, - {ORTH: "usw.", LEMMA: "und so weiter"}, - {ORTH: "uvm.", LEMMA: "und vieles mehr"}, - {ORTH: "v.Chr.", LEMMA: "vor Christus"}, - {ORTH: "v.a.", LEMMA: "vor allem"}, - {ORTH: "v.l.n.r.", LEMMA: "von links nach rechts"}, - {ORTH: "vgl.", LEMMA: "vergleiche", NORM: "vergleiche"}, - {ORTH: "vllt.", LEMMA: "vielleicht", NORM: "vielleicht"}, - {ORTH: "vlt.", LEMMA: "vielleicht", NORM: "vielleicht"}, - {ORTH: "z.B.", LEMMA: "zum Beispiel"}, - {ORTH: "z.Bsp.", LEMMA: "zum Beispiel"}, - {ORTH: "z.T.", LEMMA: "zum Teil"}, - {ORTH: "z.Z.", LEMMA: "zur Zeit"}, - {ORTH: "z.Zt.", LEMMA: "zur Zeit"}, - {ORTH: "z.b.", LEMMA: "zum Beispiel"}, - {ORTH: "zzgl.", LEMMA: "zuzüglich"}, - {ORTH: "österr.", LEMMA: "österreichisch", NORM: "österreichisch"}, + {ORTH: "'S", NORM: "'s"}, + {ORTH: "'s", NORM: "'s"}, + {ORTH: "S'", NORM: "'s"}, + {ORTH: "s'", NORM: "'s"}, + {ORTH: "'n", NORM: "ein"}, + {ORTH: "'ne", NORM: "eine"}, + {ORTH: "'nen", NORM: "einen"}, + {ORTH: "'nem", NORM: "einem"}, + {ORTH: "Abb.", NORM: "Abbildung"}, + {ORTH: "Abk.", NORM: "Abkürzung"}, + {ORTH: "Abt.", NORM: "Abteilung"}, + {ORTH: "Apr.", NORM: "April"}, + {ORTH: "Aug.", NORM: "August"}, + {ORTH: "Bd.", NORM: "Band"}, + {ORTH: "Betr.", NORM: "Betreff"}, + {ORTH: "Bf.", NORM: "Bahnhof"}, + {ORTH: "Bhf.", NORM: "Bahnhof"}, + {ORTH: "Bsp.", NORM: "Beispiel"}, + {ORTH: "Dez.", NORM: "Dezember"}, + {ORTH: "Di.", NORM: "Dienstag"}, + {ORTH: "Do.", NORM: "Donnerstag"}, + {ORTH: "Fa.", NORM: "Firma"}, + {ORTH: "Fam.", NORM: "Familie"}, + {ORTH: "Feb.", NORM: "Februar"}, + {ORTH: "Fr.", NORM: "Frau"}, + {ORTH: "Frl.", NORM: "Fräulein"}, + {ORTH: "Hbf.", NORM: "Hauptbahnhof"}, + {ORTH: "Hr.", NORM: "Herr"}, + {ORTH: "Hrn.", NORM: "Herrn"}, + {ORTH: "Jan.", NORM: "Januar"}, + {ORTH: "Jh.", NORM: "Jahrhundert"}, + {ORTH: "Jhd.", NORM: "Jahrhundert"}, + {ORTH: "Jul.", NORM: "Juli"}, + {ORTH: "Jun.", NORM: "Juni"}, + {ORTH: "Mi.", NORM: "Mittwoch"}, + {ORTH: "Mio.", NORM: "Million"}, + {ORTH: "Mo.", NORM: "Montag"}, + {ORTH: "Mrd.", NORM: "Milliarde"}, + {ORTH: "Mrz.", NORM: "März"}, + {ORTH: "MwSt.", NORM: "Mehrwertsteuer"}, + {ORTH: "Mär.", NORM: "März"}, + {ORTH: "Nov.", NORM: "November"}, + {ORTH: "Nr.", NORM: "Nummer"}, + {ORTH: "Okt.", NORM: "Oktober"}, + {ORTH: "Orig.", NORM: "Original"}, + {ORTH: "Pkt.", NORM: "Punkt"}, + {ORTH: "Prof.", NORM: "Professor"}, + {ORTH: "Red.", NORM: "Redaktion"}, + {ORTH: "Sa.", NORM: "Samstag"}, + {ORTH: "Sep.", NORM: "September"}, + {ORTH: "Sept.", NORM: "September"}, + {ORTH: "So.", NORM: "Sonntag"}, + {ORTH: "Std.", NORM: "Stunde"}, + {ORTH: "Str.", NORM: "Straße"}, + {ORTH: "Tel.", NORM: "Telefon"}, + {ORTH: "Tsd.", NORM: "Tausend"}, + {ORTH: "Univ.", NORM: "Universität"}, + {ORTH: "abzgl.", NORM: "abzüglich"}, + {ORTH: "allg.", NORM: "allgemein"}, + {ORTH: "bspw.", NORM: "beispielsweise"}, + {ORTH: "bzgl.", NORM: "bezüglich"}, + {ORTH: "bzw.", NORM: "beziehungsweise"}, + {ORTH: "d.h."}, + {ORTH: "dgl.", NORM: "dergleichen"}, + {ORTH: "ebd.", NORM: "ebenda"}, + {ORTH: "eigtl.", NORM: "eigentlich"}, + {ORTH: "engl.", NORM: "englisch"}, + {ORTH: "evtl.", NORM: "eventuell"}, + {ORTH: "frz.", NORM: "französisch"}, + {ORTH: "gegr.", NORM: "gegründet"}, + {ORTH: "ggf.", NORM: "gegebenenfalls"}, + {ORTH: "ggfs.", NORM: "gegebenenfalls"}, + {ORTH: "ggü.", NORM: "gegenüber"}, + {ORTH: "i.O."}, + {ORTH: "i.d.R."}, + {ORTH: "incl.", NORM: "inklusive"}, + {ORTH: "inkl.", NORM: "inklusive"}, + {ORTH: "insb.", NORM: "insbesondere"}, + {ORTH: "kath.", NORM: "katholisch"}, + {ORTH: "lt.", NORM: "laut"}, + {ORTH: "max.", NORM: "maximal"}, + {ORTH: "min.", NORM: "minimal"}, + {ORTH: "mind.", NORM: "mindestens"}, + {ORTH: "mtl.", NORM: "monatlich"}, + {ORTH: "n.Chr."}, + {ORTH: "orig.", NORM: "original"}, + {ORTH: "röm.", NORM: "römisch"}, + {ORTH: "s.o."}, + {ORTH: "sog."}, + {ORTH: "stellv."}, + {ORTH: "tägl.", NORM: "täglich"}, + {ORTH: "u.U."}, + {ORTH: "u.s.w."}, + {ORTH: "u.v.m."}, + {ORTH: "usf."}, + {ORTH: "usw."}, + {ORTH: "uvm."}, + {ORTH: "v.Chr."}, + {ORTH: "v.a."}, + {ORTH: "v.l.n.r."}, + {ORTH: "vgl.", NORM: "vergleiche"}, + {ORTH: "vllt.", NORM: "vielleicht"}, + {ORTH: "vlt.", NORM: "vielleicht"}, + {ORTH: "z.B."}, + {ORTH: "z.Bsp."}, + {ORTH: "z.T."}, + {ORTH: "z.Z."}, + {ORTH: "z.Zt."}, + {ORTH: "z.b."}, + {ORTH: "zzgl."}, + {ORTH: "österr.", NORM: "österreichisch"}, ]: _exc[exc_data[ORTH]] = [exc_data] @@ -257,4 +232,4 @@ for orth in [ _exc[orth] = [{ORTH: orth}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/el/__init__.py b/spacy/lang/el/__init__.py index d03a42da9..53069334e 100644 --- a/spacy/lang/el/__init__.py +++ b/spacy/lang/el/__init__.py @@ -1,43 +1,38 @@ -# -*- coding: utf-8 -*- - -from __future__ import unicode_literals +from typing import Optional +from thinc.api import Model from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from ..tag_map import TAG_MAP from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS -from .lemmatizer import GreekLemmatizer from .syntax_iterators import SYNTAX_ITERATORS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES -from ..tokenizer_exceptions import BASE_EXCEPTIONS +from .lemmatizer import GreekLemmatizer from ...language import Language -from ...lookups import Lookups -from ...attrs import LANG -from ...util import update_exc class GreekDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "el" - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS - tag_map = TAG_MAP + tokenizer_exceptions = TOKENIZER_EXCEPTIONS prefixes = TOKENIZER_PREFIXES suffixes = TOKENIZER_SUFFIXES infixes = TOKENIZER_INFIXES + lex_attr_getters = LEX_ATTRS + stop_words = STOP_WORDS syntax_iterators = SYNTAX_ITERATORS - @classmethod - def create_lemmatizer(cls, nlp=None, lookups=None): - if lookups is None: - lookups = Lookups() - return GreekLemmatizer(lookups) - class Greek(Language): lang = "el" Defaults = GreekDefaults +@Greek.factory( + "lemmatizer", + assigns=["token.lemma"], + default_config={"model": None, "mode": "rule"}, + default_score_weights={"lemma_acc": 1.0}, +) +def make_lemmatizer(nlp: Language, model: Optional[Model], name: str, mode: str): + return GreekLemmatizer(nlp.vocab, model, name, mode=mode) + + __all__ = ["Greek"] diff --git a/spacy/lang/el/examples.py b/spacy/lang/el/examples.py index 521e7b30d..62515c07a 100644 --- a/spacy/lang/el/examples.py +++ b/spacy/lang/el/examples.py @@ -1,7 +1,3 @@ -# -*- coding: utf-8 -*- - -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. >>> from spacy.lang.el.examples import sentences diff --git a/spacy/lang/el/get_pos_from_wiktionary.py b/spacy/lang/el/get_pos_from_wiktionary.py index f41833974..369973cc0 100644 --- a/spacy/lang/el/get_pos_from_wiktionary.py +++ b/spacy/lang/el/get_pos_from_wiktionary.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - def get_pos_from_wiktionary(): import re from gensim.corpora.wikicorpus import extract_pages diff --git a/spacy/lang/el/lemmatizer.py b/spacy/lang/el/lemmatizer.py index 6f5b3999b..a049601dc 100644 --- a/spacy/lang/el/lemmatizer.py +++ b/spacy/lang/el/lemmatizer.py @@ -1,7 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import List -from ...lemmatizer import Lemmatizer +from ...pipeline import Lemmatizer +from ...tokens import Token class GreekLemmatizer(Lemmatizer): @@ -15,7 +15,27 @@ class GreekLemmatizer(Lemmatizer): not applicable for Greek language. """ - def lemmatize(self, string, index, exceptions, rules): + def rule_lemmatize(self, token: Token) -> List[str]: + """Lemmatize using a rule-based approach. + + token (Token): The token to lemmatize. + RETURNS (list): The available lemmas for the string. + """ + cache_key = (token.lower, token.pos) + if cache_key in self.cache: + return self.cache[cache_key] + string = token.text + univ_pos = token.pos_.lower() + if univ_pos in ("", "eol", "space"): + return [string.lower()] + + index_table = self.lookups.get_table("lemma_index", {}) + exc_table = self.lookups.get_table("lemma_exc", {}) + rules_table = self.lookups.get_table("lemma_rules", {}) + index = index_table.get(univ_pos, {}) + exceptions = exc_table.get(univ_pos, {}) + rules = rules_table.get(univ_pos, {}) + string = string.lower() forms = [] if string in index: @@ -37,4 +57,6 @@ class GreekLemmatizer(Lemmatizer): forms.extend(oov_forms) if not forms: forms.append(string) - return list(set(forms)) + forms = list(set(forms)) + self.cache[cache_key] = forms + return forms diff --git a/spacy/lang/el/lex_attrs.py b/spacy/lang/el/lex_attrs.py index cf32fe12c..5c8f96848 100644 --- a/spacy/lang/el/lex_attrs.py +++ b/spacy/lang/el/lex_attrs.py @@ -1,7 +1,3 @@ -# -*- coding: utf-8 -*- - -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ diff --git a/spacy/lang/el/punctuation.py b/spacy/lang/el/punctuation.py index fbf773f4d..2d5690407 100644 --- a/spacy/lang/el/punctuation.py +++ b/spacy/lang/el/punctuation.py @@ -1,7 +1,3 @@ -# -*- coding: utf-8 -*- - -from __future__ import unicode_literals - from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY from ..char_classes import LIST_ICONS, ALPHA_LOWER, ALPHA_UPPER, ALPHA, HYPHENS from ..char_classes import CONCAT_QUOTES, CURRENCY diff --git a/spacy/lang/el/stop_words.py b/spacy/lang/el/stop_words.py index f13c47ec2..7c436219f 100644 --- a/spacy/lang/el/stop_words.py +++ b/spacy/lang/el/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Stop words # Link to greek stop words: https://www.translatum.gr/forum/index.php?topic=3550.0?topic=3550.0 STOP_WORDS = set( diff --git a/spacy/lang/el/syntax_iterators.py b/spacy/lang/el/syntax_iterators.py index 4a40e28c2..89cfd8b72 100644 --- a/spacy/lang/el/syntax_iterators.py +++ b/spacy/lang/el/syntax_iterators.py @@ -1,24 +1,20 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Union, Iterator from ...symbols import NOUN, PROPN, PRON from ...errors import Errors +from ...tokens import Doc, Span -def noun_chunks(doclike): - """ - Detect base noun phrases. Works on both Doc and Span. - """ +def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]: + """Detect base noun phrases from a dependency parse. Works on Doc and Span.""" # It follows the logic of the noun chunks finder of English language, # adjusted to some Greek language special characteristics. # obj tag corrects some DEP tagger mistakes. # Further improvement of the models will eliminate the need for this tag. labels = ["nsubj", "obj", "iobj", "appos", "ROOT", "obl"] doc = doclike.doc # Ensure works on both Doc and Span. - - if not doc.is_parsed: + if not doc.has_annotation("DEP"): raise ValueError(Errors.E029) - np_deps = [doc.vocab.strings.add(label) for label in labels] conj = doc.vocab.strings.add("conj") nmod = doc.vocab.strings.add("nmod") diff --git a/spacy/lang/el/tag_map_fine.py b/spacy/lang/el/tag_map_fine.py deleted file mode 100644 index b346299bc..000000000 --- a/spacy/lang/el/tag_map_fine.py +++ /dev/null @@ -1,4268 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, SCONJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PROPN, PART, INTJ, PRON, AUX - - -TAG_MAP = { - "ABBR": {POS: NOUN, "Abbr": "Yes"}, - "AdXxBa": {POS: ADV, "Degree": ""}, - "AdXxCp": {POS: ADV, "Degree": "Cmp"}, - "AdXxSu": {POS: ADV, "Degree": "Sup"}, - "AjBaFePlAc": { - POS: ADJ, - "Degree": "", - "Gender": "Fem", - "Number": "Plur", - "Case": "Acc", - }, - "AjBaFePlDa": { - POS: ADJ, - "Degree": "", - "Gender": "Fem", - "Number": "Plur", - "Case": "Dat", - }, - "AjBaFePlGe": { - POS: ADJ, - "Degree": "", - "Gender": "Fem", - "Number": "Plur", - "Case": "Gen", - }, - "AjBaFePlNm": { - POS: ADJ, - "Degree": "", - "Gender": "Fem", - "Number": "Plur", - "Case": "Nom", - }, - "AjBaFePlVo": { - POS: ADJ, - "Degree": "", - "Gender": "Fem", - "Number": "Plur", - "Case": "Voc", - }, - "AjBaFeSgAc": { - POS: ADJ, - "Degree": "", - "Gender": "Fem", - "Number": "Sing", - "Case": "Acc", - }, - "AjBaFeSgDa": { - POS: ADJ, - "Degree": "", - "Gender": "Fem", - "Number": "Sing", - "Case": "Dat", - }, - "AjBaFeSgGe": { - POS: ADJ, - "Degree": "", - "Gender": "Fem", - "Number": "Sing", - "Case": "Gen", - }, - "AjBaFeSgNm": { - POS: ADJ, - "Degree": "", - "Gender": "Fem", - "Number": "Sing", - "Case": "Nom", - }, - "AjBaFeSgVo": { - POS: ADJ, - "Degree": "", - "Gender": "Fem", - "Number": "Sing", - "Case": "Voc", - }, - "AjBaMaPlAc": { - POS: ADJ, - "Degree": "", - "Gender": "Masc", - "Number": "Plur", - "Case": "Acc", - }, - "AjBaMaPlDa": { - POS: ADJ, - "Degree": "", - "Gender": "Masc", - "Number": "Plur", - "Case": "Dat", - }, - "AjBaMaPlGe": { - POS: ADJ, - "Degree": "", - "Gender": "Masc", - "Number": "Plur", - "Case": "Gen", - }, - "AjBaMaPlNm": { - POS: ADJ, - "Degree": "", - "Gender": "Masc", - "Number": "Plur", - "Case": "Nom", - }, - "AjBaMaPlVo": { - POS: ADJ, - "Degree": "", - "Gender": "Masc", - "Number": "Plur", - "Case": "Voc", - }, - "AjBaMaSgAc": { - POS: ADJ, - "Degree": "", - "Gender": "Masc", - "Number": "Sing", - "Case": "Acc", - }, - "AjBaMaSgDa": { - POS: ADJ, - "Degree": "", - "Gender": "Masc", - "Number": "Sing", - "Case": "Dat", - }, - "AjBaMaSgGe": { - POS: ADJ, - "Degree": "", - "Gender": "Masc", - "Number": "Sing", - "Case": "Gen", - }, - "AjBaMaSgNm": { - POS: ADJ, - "Degree": "", - "Gender": "Masc", - "Number": "Sing", - "Case": "Nom", - }, - "AjBaMaSgVo": { - POS: ADJ, - "Degree": "", - "Gender": "Masc", - "Number": "Sing", - "Case": "Voc", - }, - "AjBaNePlAc": { - POS: ADJ, - "Degree": "", - "Gender": "Neut", - "Number": "Plur", - "Case": "Acc", - }, - "AjBaNePlDa": { - POS: ADJ, - "Degree": "", - "Gender": "Neut", - "Number": "Plur", - "Case": "Dat", - }, - "AjBaNePlGe": { - POS: ADJ, - "Degree": "", - "Gender": "Neut", - "Number": "Plur", - "Case": "Gen", - }, - "AjBaNePlNm": { - POS: ADJ, - "Degree": "", - "Gender": "Neut", - "Number": "Plur", - "Case": "Nom", - }, - "AjBaNePlVo": { - POS: ADJ, - "Degree": "", - "Gender": "Neut", - "Number": "Plur", - "Case": "Voc", - }, - "AjBaNeSgAc": { - POS: ADJ, - "Degree": "", - "Gender": "Neut", - "Number": "Sing", - "Case": "Acc", - }, - "AjBaNeSgDa": { - POS: ADJ, - "Degree": "", - "Gender": "Neut", - "Number": "Sing", - "Case": "Dat", - }, - "AjBaNeSgGe": { - POS: ADJ, - "Degree": "", - "Gender": "Neut", - "Number": "Sing", - "Case": "Gen", - }, - "AjBaNeSgNm": { - POS: ADJ, - "Degree": "", - "Gender": "Neut", - "Number": "Sing", - "Case": "Nom", - }, - "AjBaNeSgVo": { - POS: ADJ, - "Degree": "", - "Gender": "Neut", - "Number": "Sing", - "Case": "Voc", - }, - "AjCpFePlAc": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Plur", - "Case": "Acc", - }, - "AjCpFePlDa": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Plur", - "Case": "Dat", - }, - "AjCpFePlGe": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Plur", - "Case": "Gen", - }, - "AjCpFePlNm": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Plur", - "Case": "Nom", - }, - "AjCpFePlVo": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Plur", - "Case": "Voc", - }, - "AjCpFeSgAc": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Sing", - "Case": "Acc", - }, - "AjCpFeSgDa": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Sing", - "Case": "Dat", - }, - "AjCpFeSgGe": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Sing", - "Case": "Gen", - }, - "AjCpFeSgNm": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Sing", - "Case": "Nom", - }, - "AjCpFeSgVo": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Sing", - "Case": "Voc", - }, - "AjCpMaPlAc": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Plur", - "Case": "Acc", - }, - "AjCpMaPlDa": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Plur", - "Case": "Dat", - }, - "AjCpMaPlGe": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Plur", - "Case": "Gen", - }, - "AjCpMaPlNm": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Plur", - "Case": "Nom", - }, - "AjCpMaPlVo": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Plur", - "Case": "Voc", - }, - "AjCpMaSgAc": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Sing", - "Case": "Acc", - }, - "AjCpMaSgDa": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Sing", - "Case": "Dat", - }, - "AjCpMaSgGe": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Sing", - "Case": "Gen", - }, - "AjCpMaSgNm": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Sing", - "Case": "Nom", - }, - "AjCpMaSgVo": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Sing", - "Case": "Voc", - }, - "AjCpNePlAc": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Neut", - "Number": "Plur", - "Case": "Acc", - }, - "AjCpNePlDa": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Neut", - "Number": "Plur", - "Case": "Dat", - }, - "AjCpNePlGe": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Neut", - "Number": "Plur", - "Case": "Gen", - }, - "AjCpNePlNm": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Neut", - "Number": "Plur", - "Case": "Nom", - }, - "AjCpNePlVo": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Neut", - "Number": "Plur", - "Case": "Voc", - }, - "AjCpNeSgAc": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Neut", - "Number": "Sing", - "Case": "Acc", - }, - "AjCpNeSgDa": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Neut", - "Number": "Sing", - "Case": "Dat", - }, - "AjCpNeSgGe": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Neut", - "Number": "Sing", - "Case": "Gen", - }, - "AjCpNeSgNm": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Neut", - "Number": "Sing", - "Case": "Nom", - }, - "AjCpNeSgVo": { - POS: ADJ, - "Degree": "Cmp", - "Gender": "Neut", - "Number": "Sing", - "Case": "Voc", - }, - "AjSuFePlAc": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Fem", - "Number": "Plur", - "Case": "Acc", - }, - "AjSuFePlDa": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Fem", - "Number": "Plur", - "Case": "Dat", - }, - "AjSuFePlGe": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Fem", - "Number": "Plur", - "Case": "Gen", - }, - "AjSuFePlNm": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Fem", - "Number": "Plur", - "Case": "Nom", - }, - "AjSuFePlVo": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Fem", - "Number": "Plur", - "Case": "Voc", - }, - "AjSuFeSgAc": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Fem", - "Number": "Sing", - "Case": "Acc", - }, - "AjSuFeSgDa": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Fem", - "Number": "Sing", - "Case": "Dat", - }, - "AjSuFeSgGe": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Fem", - "Number": "Sing", - "Case": "Gen", - }, - "AjSuFeSgNm": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Fem", - "Number": "Sing", - "Case": "Nom", - }, - "AjSuFeSgVo": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Fem", - "Number": "Sing", - "Case": "Voc", - }, - "AjSuMaPlAc": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Masc", - "Number": "Plur", - "Case": "Acc", - }, - "AjSuMaPlDa": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Masc", - "Number": "Plur", - "Case": "Dat", - }, - "AjSuMaPlGe": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Masc", - "Number": "Plur", - "Case": "Gen", - }, - "AjSuMaPlNm": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Masc", - "Number": "Plur", - "Case": "Nom", - }, - "AjSuMaPlVo": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Masc", - "Number": "Plur", - "Case": "Voc", - }, - "AjSuMaSgAc": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Masc", - "Number": "Sing", - "Case": "Acc", - }, - "AjSuMaSgDa": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Masc", - "Number": "Sing", - "Case": "Dat", - }, - "AjSuMaSgGe": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Masc", - "Number": "Sing", - "Case": "Gen", - }, - "AjSuMaSgNm": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Masc", - "Number": "Sing", - "Case": "Nom", - }, - "AjSuMaSgVo": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Masc", - "Number": "Sing", - "Case": "Voc", - }, - "AjSuNePlAc": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Neut", - "Number": "Plur", - "Case": "Acc", - }, - "AjSuNePlDa": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Neut", - "Number": "Plur", - "Case": "Dat", - }, - "AjSuNePlGe": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Neut", - "Number": "Plur", - "Case": "Gen", - }, - "AjSuNePlNm": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Neut", - "Number": "Plur", - "Case": "Nom", - }, - "AjSuNePlVo": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Neut", - "Number": "Plur", - "Case": "Voc", - }, - "AjSuNeSgAc": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Neut", - "Number": "Sing", - "Case": "Acc", - }, - "AjSuNeSgDa": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Neut", - "Number": "Sing", - "Case": "Dat", - }, - "AjSuNeSgGe": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Neut", - "Number": "Sing", - "Case": "Gen", - }, - "AjSuNeSgNm": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Neut", - "Number": "Sing", - "Case": "Nom", - }, - "AjSuNeSgVo": { - POS: ADJ, - "Degree": "Sup", - "Gender": "Neut", - "Number": "Sing", - "Case": "Voc", - }, - "AsPpPaFePlAc": {POS: ADP, "Gender": "Fem", "Number": "Plur", "Case": "Acc"}, - "AsPpPaFePlGe": {POS: ADP, "Gender": "Fem", "Number": "Plur", "Case": "Gen"}, - "AsPpPaFeSgAc": {POS: ADP, "Gender": "Fem", "Number": "Sing", "Case": "Acc"}, - "AsPpPaFeSgGe": {POS: ADP, "Gender": "Fem", "Number": "Sing", "Case": "Gen"}, - "AsPpPaMaPlAc": {POS: ADP, "Gender": "Masc", "Number": "Plur", "Case": "Acc"}, - "AsPpPaMaPlGe": {POS: ADP, "Gender": "Masc", "Number": "Plur", "Case": "Gen"}, - "AsPpPaMaSgAc": {POS: ADP, "Gender": "Masc", "Number": "Sing", "Case": "Acc"}, - "AsPpPaMaSgGe": {POS: ADP, "Gender": "Masc", "Number": "Sing", "Case": "Gen"}, - "AsPpPaNePlAc": {POS: ADP, "Gender": "Neut", "Number": "Plur", "Case": "Acc"}, - "AsPpPaNePlGe": {POS: ADP, "Gender": "Neut", "Number": "Plur", "Case": "Gen"}, - "AsPpPaNeSgAc": {POS: ADP, "Gender": "Neut", "Number": "Sing", "Case": "Acc"}, - "AsPpPaNeSgGe": {POS: ADP, "Gender": "Neut", "Number": "Sing", "Case": "Gen"}, - "AsPpSp": {POS: ADP}, - "AtDfFePlAc": { - POS: DET, - "PronType": "Art", - "Gender": "Fem", - "Number": "Plur", - "Case": "Acc", - "Other": {"Definite": "Def"}, - }, - "AtDfFePlGe": { - POS: DET, - "PronType": "Art", - "Gender": "Fem", - "Number": "Plur", - "Case": "Gen", - "Other": {"Definite": "Def"}, - }, - "AtDfFePlNm": { - POS: DET, - "PronType": "Art", - "Gender": "Fem", - "Number": "Plur", - "Case": "Nom", - "Other": {"Definite": "Def"}, - }, - "AtDfFeSgAc": { - POS: DET, - "PronType": "Art", - "Gender": "Fem", - "Number": "Sing", - "Case": "Acc", - "Other": {"Definite": "Def"}, - }, - "AtDfFeSgDa": { - POS: DET, - "PronType": "Art", - "Gender": "Fem", - "Number": "Sing", - "Case": "Dat", - "Other": {"Definite": "Def"}, - }, - "AtDfFeSgGe": { - POS: DET, - "PronType": "Art", - "Gender": "Fem", - "Number": "Sing", - "Case": "Gen", - "Other": {"Definite": "Def"}, - }, - "AtDfFeSgNm": { - POS: DET, - "PronType": "Art", - "Gender": "Fem", - "Number": "Sing", - "Case": "Nom", - "Other": {"Definite": "Def"}, - }, - "AtDfMaPlAc": { - POS: DET, - "PronType": "Art", - "Gender": "Masc", - "Number": "Plur", - "Case": "Acc", - "Other": {"Definite": "Def"}, - }, - "AtDfMaPlGe": { - POS: DET, - "PronType": "Art", - "Gender": "Masc", - "Number": "Plur", - "Case": "Gen", - "Other": {"Definite": "Def"}, - }, - "AtDfMaPlNm": { - POS: DET, - "PronType": "Art", - "Gender": "Masc", - "Number": "Plur", - "Case": "Nom", - "Other": {"Definite": "Def"}, - }, - "AtDfMaSgAc": { - POS: DET, - "PronType": "Art", - "Gender": "Masc", - "Number": "Sing", - "Case": "Acc", - "Other": {"Definite": "Def"}, - }, - "AtDfMaSgDa": { - POS: DET, - "PronType": "Art", - "Gender": "Masc", - "Number": "Sing", - "Case": "Dat", - "Other": {"Definite": "Def"}, - }, - "AtDfMaSgGe": { - POS: DET, - "PronType": "Art", - "Gender": "Masc", - "Number": "Sing", - "Case": "Gen", - "Other": {"Definite": "Def"}, - }, - "AtDfMaSgNm": { - POS: DET, - "PronType": "Art", - "Gender": "Masc", - "Number": "Sing", - "Case": "Nom", - "Other": {"Definite": "Def"}, - }, - "AtDfNePlAc": { - POS: DET, - "PronType": "Art", - "Gender": "Neut", - "Number": "Plur", - "Case": "Acc", - "Other": {"Definite": "Def"}, - }, - "AtDfNePlDa": { - POS: DET, - "PronType": "Art", - "Gender": "Neut", - "Number": "Plur", - "Case": "Dat", - "Other": {"Definite": "Def"}, - }, - "AtDfNePlGe": { - POS: DET, - "PronType": "Art", - "Gender": "Neut", - "Number": "Plur", - "Case": "Gen", - "Other": {"Definite": "Def"}, - }, - "AtDfNePlNm": { - POS: DET, - "PronType": "Art", - "Gender": "Neut", - "Number": "Plur", - "Case": "Nom", - "Other": {"Definite": "Def"}, - }, - "AtDfNeSgAc": { - POS: DET, - "PronType": "Art", - "Gender": "Neut", - "Number": "Sing", - "Case": "Acc", - "Other": {"Definite": "Def"}, - }, - "AtDfNeSgDa": { - POS: DET, - "PronType": "Art", - "Gender": "Neut", - "Number": "Sing", - "Case": "Dat", - "Other": {"Definite": "Def"}, - }, - "AtDfNeSgGe": { - POS: DET, - "PronType": "Art", - "Gender": "Neut", - "Number": "Sing", - "Case": "Gen", - "Other": {"Definite": "Def"}, - }, - "AtDfNeSgNm": { - POS: DET, - "PronType": "Art", - "Gender": "Neut", - "Number": "Sing", - "Case": "Nom", - "Other": {"Definite": "Def"}, - }, - "AtIdFeSgAc": { - POS: DET, - "PronType": "Art", - "Gender": "Fem", - "Number": "Sing", - "Case": "Acc", - "Other": {"Definite": "Ind"}, - }, - "AtIdFeSgDa": { - POS: DET, - "PronType": "Art", - "Gender": "Fem", - "Number": "Sing", - "Case": "Dat", - "Other": {"Definite": "Ind"}, - }, - "AtIdFeSgGe": { - POS: DET, - "PronType": "Art", - "Gender": "Fem", - "Number": "Sing", - "Case": "Gen", - "Other": {"Definite": "Ind"}, - }, - "AtIdFeSgNm": { - POS: DET, - "PronType": "Art", - "Gender": "Fem", - "Number": "Sing", - "Case": "Nom", - "Other": {"Definite": "Ind"}, - }, - "AtIdMaSgAc": { - POS: DET, - "PronType": "Art", - "Gender": "Masc", - "Number": "Sing", - "Case": "Acc", - "Other": {"Definite": "Ind"}, - }, - "AtIdMaSgGe": { - POS: DET, - "PronType": "Art", - "Gender": "Masc", - "Number": "Sing", - "Case": "Gen", - "Other": {"Definite": "Ind"}, - }, - "AtIdMaSgNm": { - POS: DET, - "PronType": "Art", - "Gender": "Masc", - "Number": "Sing", - "Case": "Nom", - "Other": {"Definite": "Ind"}, - }, - "AtIdNeSgAc": { - POS: DET, - "PronType": "Art", - "Gender": "Neut", - "Number": "Sing", - "Case": "Acc", - "Other": {"Definite": "Ind"}, - }, - "AtIdNeSgGe": { - POS: DET, - "PronType": "Art", - "Gender": "Neut", - "Number": "Sing", - "Case": "Gen", - "Other": {"Definite": "Ind"}, - }, - "AtIdNeSgNm": { - POS: DET, - "PronType": "Art", - "Gender": "Neut", - "Number": "Sing", - "Case": "Nom", - "Other": {"Definite": "Ind"}, - }, - "CjCo": {POS: CCONJ}, - "CjSb": {POS: SCONJ}, - "CPUNCT": {POS: PUNCT}, - "DATE": {POS: NUM}, - "DIG": {POS: NUM}, - "ENUM": {POS: NUM}, - "Ij": {POS: INTJ}, - "INIT": {POS: SYM}, - "NBABBR": {POS: NOUN, "Abbr": "Yes"}, - "NmAnFePlAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Plur", - "Case": "Acc", - }, - "NmAnFePlGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Plur", - "Case": "Gen", - }, - "NmAnFePlNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Plur", - "Case": "Nom", - }, - "NmAnFePlVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Plur", - "Case": "Voc", - }, - "NmAnFeSgAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Sing", - "Case": "Acc", - }, - "NmAnFeSgGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Sing", - "Case": "Gen", - }, - "NmAnFeSgNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Sing", - "Case": "Nom", - }, - "NmAnFeSgVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Sing", - "Case": "Voc", - }, - "NmAnMaPlAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Plur", - "Case": "Acc", - }, - "NmAnMaPlGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Plur", - "Case": "Gen", - }, - "NmAnMaPlNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Plur", - "Case": "Nom", - }, - "NmAnMaPlVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Plur", - "Case": "Voc", - }, - "NmAnMaSgAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Sing", - "Case": "Acc", - }, - "NmAnMaSgGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Sing", - "Case": "Gen", - }, - "NmAnMaSgNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Sing", - "Case": "Nom", - }, - "NmAnMaSgVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Sing", - "Case": "Voc", - }, - "NmAnNePlAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Plur", - "Case": "Acc", - }, - "NmAnNePlGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Plur", - "Case": "Gen", - }, - "NmAnNePlNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Plur", - "Case": "Nom", - }, - "NmAnNePlVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Plur", - "Case": "Voc", - }, - "NmAnNeSgAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Sing", - "Case": "Acc", - }, - "NmAnNeSgGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Sing", - "Case": "Gen", - }, - "NmAnNeSgNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Sing", - "Case": "Nom", - }, - "NmAnNeSgVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Sing", - "Case": "Voc", - }, - "NmAnXxXxXxAd": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc|Fem|Neut", - "Number": "Sing|Plur", - "Case": "Acc|Gen|Nom|Voc", - }, - "NmCdFePlAcAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Fem", - "Number": "Plur", - "Case": "Acc", - }, - "NmCdFePlGeAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Fem", - "Number": "Plur", - "Case": "Gen", - }, - "NmCdFePlNmAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Fem", - "Number": "Plur", - "Case": "Nom", - }, - "NmCdFePlVoAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Fem", - "Number": "Plur", - "Case": "Voc", - }, - "NmCdFeSgAcAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Fem", - "Number": "Sing", - "Case": "Acc", - }, - "NmCdFeSgDaAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Fem", - "Number": "Sing", - "Case": "Dat", - }, - "NmCdFeSgGeAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Fem", - "Number": "Sing", - "Case": "Gen", - }, - "NmCdFeSgNmAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Fem", - "Number": "Sing", - "Case": "Nom", - }, - "NmCdMaPlAcAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Masc", - "Number": "Plur", - "Case": "Acc", - }, - "NmCdMaPlGeAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Masc", - "Number": "Plur", - "Case": "Gen", - }, - "NmCdMaPlNmAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Masc", - "Number": "Plur", - "Case": "Nom", - }, - "NmCdMaPlVoAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Masc", - "Number": "Plur", - "Case": "Voc", - }, - "NmCdMaSgAcAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Masc", - "Number": "Sing", - "Case": "Acc", - }, - "NmCdMaSgGeAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Masc", - "Number": "Sing", - "Case": "Gen", - }, - "NmCdMaSgNmAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Masc", - "Number": "Sing", - "Case": "Nom", - }, - "NmCdNePlAcAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Neut", - "Number": "Plur", - "Case": "Acc", - }, - "NmCdNePlDaAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Neut", - "Number": "Plur", - "Case": "Dat", - }, - "NmCdNePlGeAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Neut", - "Number": "Plur", - "Case": "Gen", - }, - "NmCdNePlNmAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Neut", - "Number": "Plur", - "Case": "Nom", - }, - "NmCdNePlVoAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Neut", - "Number": "Plur", - "Case": "Voc", - }, - "NmCdNeSgAcAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Neut", - "Number": "Sing", - "Case": "Acc", - }, - "NmCdNeSgGeAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Neut", - "Number": "Sing", - "Case": "Gen", - }, - "NmCdNeSgNmAj": { - POS: NUM, - "NumType": "Card", - "Gender": "Neut", - "Number": "Sing", - "Case": "Nom", - }, - "NmCtFePlAcNo": { - POS: NUM, - "NumType": "Sets", - "Gender": "Fem", - "Number": "Plur", - "Case": "Acc", - }, - "NmCtFePlGeNo": { - POS: NUM, - "NumType": "Sets", - "Gender": "Fem", - "Number": "Plur", - "Case": "Gen", - }, - "NmCtFePlNmNo": { - POS: NUM, - "NumType": "Sets", - "Gender": "Fem", - "Number": "Plur", - "Case": "Nom", - }, - "NmCtFePlVoNo": { - POS: NUM, - "NumType": "Sets", - "Gender": "Fem", - "Number": "Plur", - "Case": "Voc", - }, - "NmCtFeSgAcNo": { - POS: NUM, - "NumType": "Sets", - "Gender": "Fem", - "Number": "Sing", - "Case": "Acc", - }, - "NmCtFeSgGeNo": { - POS: NUM, - "NumType": "Sets", - "Gender": "Fem", - "Number": "Sing", - "Case": "Gen", - }, - "NmCtFeSgNmNo": { - POS: NUM, - "NumType": "Sets", - "Gender": "Fem", - "Number": "Sing", - "Case": "Nom", - }, - "NmCtFeSgVoNo": { - POS: NUM, - "NumType": "Sets", - "Gender": "Fem", - "Number": "Sing", - "Case": "Voc", - }, - "NmMlFePlAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Plur", - "Case": "Acc", - }, - "NmMlFePlGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Plur", - "Case": "Gen", - }, - "NmMlFePlNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Plur", - "Case": "Nom", - }, - "NmMlFePlVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Plur", - "Case": "Voc", - }, - "NmMlFeSgAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Sing", - "Case": "Acc", - }, - "NmMlFeSgGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Sing", - "Case": "Gen", - }, - "NmMlFeSgNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Sing", - "Case": "Nom", - }, - "NmMlFeSgVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Sing", - "Case": "Voc", - }, - "NmMlMaPlAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Plur", - "Case": "Acc", - }, - "NmMlMaPlGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Plur", - "Case": "Gen", - }, - "NmMlMaPlNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Plur", - "Case": "Nom", - }, - "NmMlMaPlVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Plur", - "Case": "Voc", - }, - "NmMlMaSgAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Sing", - "Case": "Acc", - }, - "NmMlMaSgGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Sing", - "Case": "Gen", - }, - "NmMlMaSgNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Sing", - "Case": "Nom", - }, - "NmMlMaSgVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc", - "Number": "Sing", - "Case": "Voc", - }, - "NmMlNePlAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Plur", - "Case": "Acc", - }, - "NmMlNePlGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Plur", - "Case": "Gen", - }, - "NmMlNePlNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Plur", - "Case": "Nom", - }, - "NmMlNePlVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Plur", - "Case": "Voc", - }, - "NmMlNeSgAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Sing", - "Case": "Acc", - }, - "NmMlNeSgGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Sing", - "Case": "Gen", - }, - "NmMlNeSgNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Sing", - "Case": "Nom", - }, - "NmMlNeSgVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Neut", - "Number": "Sing", - "Case": "Voc", - }, - "NmMlXxXxXxAd": { - POS: NUM, - "NumType": "Mult", - "Gender": "Masc|Fem|Neut", - "Number": "Sing|Plur", - "Case": "Acc|Gen|Nom|Voc", - }, - "NmOdFePlAcAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Plur", - "Case": "Acc", - }, - "NmOdFePlGeAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Plur", - "Case": "Gen", - }, - "NmOdFePlNmAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Plur", - "Case": "Nom", - }, - "NmOdFePlVoAj": { - POS: NUM, - "NumType": "Mult", - "Gender": "Fem", - "Number": "Plur", - "Case": "Voc", - }, - "NmOdFeSgAcAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Fem", - "Number": "Sing", - "Case": "Acc", - }, - "NmOdFeSgGeAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Fem", - "Number": "Sing", - "Case": "Gen", - }, - "NmOdFeSgNmAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Fem", - "Number": "Sing", - "Case": "Nom", - }, - "NmOdFeSgVoAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Fem", - "Number": "Sing", - "Case": "Voc", - }, - "NmOdMaPlAcAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Masc", - "Number": "Plur", - "Case": "Acc", - }, - "NmOdMaPlGeAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Masc", - "Number": "Plur", - "Case": "Gen", - }, - "NmOdMaPlNmAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Masc", - "Number": "Plur", - "Case": "Nom", - }, - "NmOdMaPlVoAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Masc", - "Number": "Plur", - "Case": "Voc", - }, - "NmOdMaSgAcAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Masc", - "Number": "Sing", - "Case": "Acc", - }, - "NmOdMaSgGeAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Masc", - "Number": "Sing", - "Case": "Gen", - }, - "NmOdMaSgNmAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Masc", - "Number": "Sing", - "Case": "Nom", - }, - "NmOdMaSgVoAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Masc", - "Number": "Sing", - "Case": "Voc", - }, - "NmOdNePlAcAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Neut", - "Number": "Plur", - "Case": "Acc", - }, - "NmOdNePlGeAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Neut", - "Number": "Plur", - "Case": "Gen", - }, - "NmOdNePlNmAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Neut", - "Number": "Plur", - "Case": "Nom", - }, - "NmOdNePlVoAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Neut", - "Number": "Plur", - "Case": "Voc", - }, - "NmOdNeSgAcAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Neut", - "Number": "Sing", - "Case": "Acc", - }, - "NmOdNeSgGeAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Neut", - "Number": "Sing", - "Case": "Gen", - }, - "NmOdNeSgNmAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Neut", - "Number": "Sing", - "Case": "Nom", - }, - "NmOdNeSgVoAj": { - POS: NUM, - "NumType": "Ord", - "Gender": "Neut", - "Number": "Sing", - "Case": "Voc", - }, - "NoCmFePlAc": {POS: NOUN, "Gender": "Fem", "Number": "Plur", "Case": "Acc"}, - "NoCmFePlDa": {POS: NOUN, "Gender": "Fem", "Number": "Plur", "Case": "Dat"}, - "NoCmFePlGe": {POS: NOUN, "Gender": "Fem", "Number": "Plur", "Case": "Gen"}, - "NoCmFePlNm": {POS: NOUN, "Gender": "Fem", "Number": "Plur", "Case": "Nom"}, - "NoCmFePlVo": {POS: NOUN, "Gender": "Fem", "Number": "Plur", "Case": "Voc"}, - "NoCmFeSgAc": {POS: NOUN, "Gender": "Fem", "Number": "Sing", "Case": "Acc"}, - "NoCmFeSgDa": {POS: NOUN, "Gender": "Fem", "Number": "Sing", "Case": "Dat"}, - "NoCmFeSgGe": {POS: NOUN, "Gender": "Fem", "Number": "Sing", "Case": "Gen"}, - "NoCmFeSgNm": {POS: NOUN, "Gender": "Fem", "Number": "Sing", "Case": "Nom"}, - "NoCmFeSgVo": {POS: NOUN, "Gender": "Fem", "Number": "Sing", "Case": "Voc"}, - "NoCmMaPlAc": {POS: NOUN, "Gender": "Masc", "Number": "Plur", "Case": "Acc"}, - "NoCmMaPlDa": {POS: NOUN, "Gender": "Masc", "Number": "Plur", "Case": "Dat"}, - "NoCmMaPlGe": {POS: NOUN, "Gender": "Masc", "Number": "Plur", "Case": "Gen"}, - "NoCmMaPlNm": {POS: NOUN, "Gender": "Masc", "Number": "Plur", "Case": "Nom"}, - "NoCmMaPlVo": {POS: NOUN, "Gender": "Masc", "Number": "Plur", "Case": "Voc"}, - "NoCmMaSgAc": {POS: NOUN, "Gender": "Masc", "Number": "Sing", "Case": "Acc"}, - "NoCmMaSgDa": {POS: NOUN, "Gender": "Masc", "Number": "Sing", "Case": "Dat"}, - "NoCmMaSgGe": {POS: NOUN, "Gender": "Masc", "Number": "Sing", "Case": "Gen"}, - "NoCmMaSgNm": {POS: NOUN, "Gender": "Masc", "Number": "Sing", "Case": "Nom"}, - "NoCmMaSgVo": {POS: NOUN, "Gender": "Masc", "Number": "Sing", "Case": "Voc"}, - "NoCmNePlAc": {POS: NOUN, "Gender": "Neut", "Number": "Plur", "Case": "Acc"}, - "NoCmNePlDa": {POS: NOUN, "Gender": "Neut", "Number": "Plur", "Case": "Dat"}, - "NoCmNePlGe": {POS: NOUN, "Gender": "Neut", "Number": "Plur", "Case": "Gen"}, - "NoCmNePlNm": {POS: NOUN, "Gender": "Neut", "Number": "Plur", "Case": "Nom"}, - "NoCmNePlVo": {POS: NOUN, "Gender": "Neut", "Number": "Plur", "Case": "Voc"}, - "NoCmNeSgAc": {POS: NOUN, "Gender": "Neut", "Number": "Sing", "Case": "Acc"}, - "NoCmNeSgDa": {POS: NOUN, "Gender": "Neut", "Number": "Sing", "Case": "Dat"}, - "NoCmNeSgGe": {POS: NOUN, "Gender": "Neut", "Number": "Sing", "Case": "Gen"}, - "NoCmNeSgNm": {POS: NOUN, "Gender": "Neut", "Number": "Sing", "Case": "Nom"}, - "NoCmNeSgVo": {POS: NOUN, "Gender": "Neut", "Number": "Sing", "Case": "Voc"}, - "NoPrFePlAc": {POS: PROPN, "Gender": "Fem", "Number": "Plur", "Case": "Acc"}, - "NoPrFePlDa": {POS: PROPN, "Gender": "Fem", "Number": "Plur", "Case": "Dat"}, - "NoPrFePlGe": {POS: PROPN, "Gender": "Fem", "Number": "Plur", "Case": "Gen"}, - "NoPrFePlNm": {POS: PROPN, "Gender": "Fem", "Number": "Plur", "Case": "Nom"}, - "NoPrFePlVo": {POS: PROPN, "Gender": "Fem", "Number": "Plur", "Case": "Voc"}, - "NoPrFeSgAc": {POS: PROPN, "Gender": "Fem", "Number": "Sing", "Case": "Acc"}, - "NoPrFeSgDa": {POS: PROPN, "Gender": "Fem", "Number": "Sing", "Case": "Dat"}, - "NoPrFeSgGe": {POS: PROPN, "Gender": "Fem", "Number": "Sing", "Case": "Gen"}, - "NoPrFeSgNm": {POS: PROPN, "Gender": "Fem", "Number": "Sing", "Case": "Nom"}, - "NoPrFeSgVo": {POS: PROPN, "Gender": "Fem", "Number": "Sing", "Case": "Voc"}, - "NoPrMaPlAc": {POS: PROPN, "Gender": "Masc", "Number": "Plur", "Case": "Acc"}, - "NoPrMaPlGe": {POS: PROPN, "Gender": "Masc", "Number": "Plur", "Case": "Gen"}, - "NoPrMaPlNm": {POS: PROPN, "Gender": "Masc", "Number": "Plur", "Case": "Nom"}, - "NoPrMaPlVo": {POS: PROPN, "Gender": "Masc", "Number": "Plur", "Case": "Voc"}, - "NoPrMaSgAc": {POS: PROPN, "Gender": "Masc", "Number": "Sing", "Case": "Acc"}, - "NoPrMaSgDa": {POS: PROPN, "Gender": "Masc", "Number": "Sing", "Case": "Dat"}, - "NoPrMaSgGe": {POS: PROPN, "Gender": "Masc", "Number": "Sing", "Case": "Gen"}, - "NoPrMaSgNm": {POS: PROPN, "Gender": "Masc", "Number": "Sing", "Case": "Nom"}, - "NoPrMaSgVo": {POS: PROPN, "Gender": "Masc", "Number": "Sing", "Case": "Voc"}, - "NoPrNePlAc": {POS: PROPN, "Gender": "Neut", "Number": "Plur", "Case": "Acc"}, - "NoPrNePlGe": {POS: PROPN, "Gender": "Neut", "Number": "Plur", "Case": "Gen"}, - "NoPrNePlNm": {POS: PROPN, "Gender": "Neut", "Number": "Plur", "Case": "Nom"}, - "NoPrNeSgAc": {POS: PROPN, "Gender": "Neut", "Number": "Sing", "Case": "Acc"}, - "NoPrNeSgGe": {POS: PROPN, "Gender": "Neut", "Number": "Sing", "Case": "Gen"}, - "NoPrNeSgNm": {POS: PROPN, "Gender": "Neut", "Number": "Sing", "Case": "Nom"}, - "OPUNCT": {POS: PUNCT}, - "PnDfFe03PlAcXx": { - POS: PRON, - "PronType": "", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnDfFe03SgAcXx": { - POS: PRON, - "PronType": "", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnDfMa03PlGeXx": { - POS: PRON, - "PronType": "", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnDmFe03PlAcXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnDmFe03PlGeXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnDmFe03PlNmXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnDmFe03SgAcXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnDmFe03SgDaXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Dat", - }, - "PnDmFe03SgGeXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnDmFe03SgNmXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnDmMa03PlAcXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnDmMa03PlDaXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Dat", - }, - "PnDmMa03PlGeXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnDmMa03PlNmXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnDmMa03SgAcXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnDmMa03SgGeXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnDmMa03SgNmXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnDmNe03PlAcXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnDmNe03PlDaXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Dat", - }, - "PnDmNe03PlGeXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnDmNe03PlNmXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnDmNe03SgAcXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnDmNe03SgDaXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Dat", - }, - "PnDmNe03SgGeXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnDmNe03SgNmXx": { - POS: PRON, - "PronType": "Dem", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnIdFe03PlAcXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnIdFe03PlGeXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnIdFe03PlNmXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnIdFe03SgAcXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnIdFe03SgGeXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnIdFe03SgNmXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnIdMa03PlAcXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnIdMa03PlGeXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnIdMa03PlNmXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnIdMa03SgAcXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnIdMa03SgGeXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnIdMa03SgNmXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnIdNe03PlAcXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnIdNe03PlGeXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnIdNe03PlNmXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnIdNe03SgAcXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnIdNe03SgDaXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Dat", - }, - "PnIdNe03SgGeXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnIdNe03SgNmXx": { - POS: PRON, - "PronType": "Ind", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnIrFe03PlAcXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnIrFe03PlGeXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnIrFe03PlNmXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnIrFe03SgAcXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnIrFe03SgGeXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnIrFe03SgNmXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnIrMa03PlAcXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnIrMa03PlGeXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnIrMa03PlNmXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnIrMa03SgAcXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnIrMa03SgGeXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnIrMa03SgNmXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnIrNe03PlAcXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnIrNe03PlGeXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnIrNe03PlNmXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnIrNe03SgAcXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnIrNe03SgGeXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnIrNe03SgNmXx": { - POS: PRON, - "PronType": "Int", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnPeFe01PlAcSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "1", - "Number": "Plur", - "Case": "Acc", - }, - "PnPeFe01PlAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "1", - "Number": "Plur", - "Case": "Acc", - }, - "PnPeFe01PlGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "1", - "Number": "Plur", - "Case": "Gen", - }, - "PnPeFe01PlNmSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "1", - "Number": "Plur", - "Case": "Nom", - }, - "PnPeFe01SgAcSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "1", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeFe01SgAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "1", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeFe01SgGeSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "1", - "Number": "Sing", - "Case": "Gen", - }, - "PnPeFe01SgGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "1", - "Number": "Sing", - "Case": "Gen", - }, - "PnPeFe01SgNmSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "1", - "Number": "Sing", - "Case": "Nom", - }, - "PnPeFe02PlAcSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "2", - "Number": "Plur", - "Case": "Acc", - }, - "PnPeFe02PlAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "2", - "Number": "Plur", - "Case": "Acc", - }, - "PnPeFe02PlGeSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "2", - "Number": "Plur", - "Case": "Gen", - }, - "PnPeFe02PlGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "2", - "Number": "Plur", - "Case": "Gen", - }, - "PnPeFe02PlNmSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "2", - "Number": "Plur", - "Case": "Nom", - }, - "PnPeFe02SgAcSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "2", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeFe02SgAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "2", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeFe02SgGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "2", - "Number": "Sing", - "Case": "Gen", - }, - "PnPeFe02SgNmSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "2", - "Number": "Sing", - "Case": "Nom", - }, - "PnPeFe03PlAcSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnPeFe03PlAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnPeFe03PlGeSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnPeFe03PlGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnPeFe03PlNmSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnPeFe03SgAcSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeFe03SgAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeFe03SgGeSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnPeFe03SgGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnPeMa01PlAcSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "1", - "Number": "Plur", - "Case": "Acc", - }, - "PnPeMa01PlAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "1", - "Number": "Plur", - "Case": "Acc", - }, - "PnPeMa01PlDaSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "1", - "Number": "Plur", - "Case": "Dat", - }, - "PnPeMa01PlGeSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "1", - "Number": "Plur", - "Case": "Gen", - }, - "PnPeMa01PlGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "1", - "Number": "Plur", - "Case": "Gen", - }, - "PnPeMa01PlNmSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "1", - "Number": "Plur", - "Case": "Nom", - }, - "PnPeMa01SgAcSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "1", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeMa01SgAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "1", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeMa01SgGeSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "1", - "Number": "Sing", - "Case": "Gen", - }, - "PnPeMa01SgGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "1", - "Number": "Sing", - "Case": "Gen", - }, - "PnPeMa01SgNmSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "1", - "Number": "Sing", - "Case": "Nom", - }, - "PnPeMa02PlAcSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "2", - "Number": "Plur", - "Case": "Acc", - }, - "PnPeMa02PlAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "2", - "Number": "Plur", - "Case": "Acc", - }, - "PnPeMa02PlGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "2", - "Number": "Plur", - "Case": "Gen", - }, - "PnPeMa02PlNmSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "2", - "Number": "Plur", - "Case": "Nom", - }, - "PnPeMa02PlVoSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "2", - "Number": "Plur", - "Case": "Voc", - }, - "PnPeMa02SgAcSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "2", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeMa02SgAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "2", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeMa02SgGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "2", - "Number": "Sing", - "Case": "Gen", - }, - "PnPeMa02SgNmSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "2", - "Number": "Sing", - "Case": "Nom", - }, - "PnPeMa03PlAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnPeMa03PlGeSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnPeMa03PlGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnPeMa03PlNmSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnPeMa03SgAcSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeMa03SgAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeMa03SgGeSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnPeMa03SgGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnPeMa03SgNmWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnPeNe03PlAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnPeNe03PlGeSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnPeNe03PlGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnPeNe03SgAcSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeNe03SgAcWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnPeNe03SgGeSt": { - POS: PRON, - "PronType": "Prs", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnPeNe03SgGeWe": { - POS: PRON, - "PronType": "Prs", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnPoFe01PlGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Fem", - "Person": "1", - "Number": "Plur", - "Case": "Gen", - }, - "PnPoFe01SgGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Fem", - "Person": "1", - "Number": "Sing", - "Case": "Gen", - }, - "PnPoFe02PlGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Fem", - "Person": "2", - "Number": "Plur", - "Case": "Gen", - }, - "PnPoFe02SgGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Fem", - "Person": "2", - "Number": "Sing", - "Case": "Gen", - }, - "PnPoFe03PlGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnPoFe03SgGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnPoMa01PlGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Masc", - "Person": "1", - "Number": "Plur", - "Case": "Gen", - }, - "PnPoMa01SgGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Masc", - "Person": "1", - "Number": "Sing", - "Case": "Gen", - }, - "PnPoMa02PlGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Masc", - "Person": "2", - "Number": "Plur", - "Case": "Gen", - }, - "PnPoMa02SgGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Masc", - "Person": "2", - "Number": "Sing", - "Case": "Gen", - }, - "PnPoMa03PlGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnPoMa03SgGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnPoNe03PlGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnPoNe03SgGeXx": { - POS: PRON, - "Poss": "Yes", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnReFe03PlAcXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnReFe03PlGeXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnReFe03PlNmXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnReFe03SgAcXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnReFe03SgGeXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnReFe03SgNmXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnReMa03PlAcXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnReMa03PlGeXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnReMa03PlNmXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnReMa03SgAcXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnReMa03SgGeXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnReMa03SgNmXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnReNe03PlAcXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnReNe03PlGeXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnReNe03PlNmXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnReNe03SgAcXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnReNe03SgGeXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnReNe03SgNmXx": { - POS: PRON, - "PronType": "Rel", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnRiFe03PlAcXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnRiFe03PlGeXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnRiFe03PlNmXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Fem", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnRiFe03SgAcXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnRiFe03SgGeXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnRiFe03SgNmXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Fem", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnRiMa03PlAcXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnRiMa03PlGeXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnRiMa03PlNmXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Masc", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnRiMa03SgAcXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnRiMa03SgGeXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnRiMa03SgNmXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Masc", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PnRiNe03PlAcXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Acc", - }, - "PnRiNe03PlGeXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Gen", - }, - "PnRiNe03PlNmXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Neut", - "Person": "3", - "Number": "Plur", - "Case": "Nom", - }, - "PnRiNe03SgAcXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Acc", - }, - "PnRiNe03SgGeXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Gen", - }, - "PnRiNe03SgNmXx": { - POS: PRON, - "PronType": "Ind,Rel", - "Gender": "Neut", - "Person": "3", - "Number": "Sing", - "Case": "Nom", - }, - "PTERM_P": {POS: PUNCT}, - "PtFu": {POS: PART}, - "PtNg": {POS: PART}, - "PtOt": {POS: PART}, - "PtSj": {POS: PART}, - "Pu": {POS: SYM}, - "PUNCT": {POS: PUNCT}, - "RgAbXx": {POS: X}, - "RgAnXx": {POS: X}, - "RgFwOr": {POS: X, "Foreign": "Yes"}, - "RgFwTr": {POS: X, "Foreign": "Yes"}, - "RgSyXx": {POS: SYM}, - "VbIsIdPa03SgXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbIsIdPa03SgXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbIsIdPa03SgXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbIsIdPa03SgXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbIsIdPr03SgXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbIsIdPr03SgXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbIsIdXx03SgXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbIsIdXx03SgXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbIsNfXxXxXxXxPeAvXx": { - POS: VERB, - "VerbForm": "Inf", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing|Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa01PlXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "1", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa01PlXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "1", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa01PlXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "1", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa01PlXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "1", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa01SgXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "1", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa01SgXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "1", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa01SgXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa01SgXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "1", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa02PlXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "2", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa02PlXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "2", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa02PlXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "2", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa02PlXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "2", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa02SgXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "2", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa02SgXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "2", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa02SgXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "2", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa02SgXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "2", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa03PlXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "3", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa03PlXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "3", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa03PlXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "3", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa03PlXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "3", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa03SgXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa03SgXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa03SgXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPa03SgXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPr01PlXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "1", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPr01PlXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "1", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPr01SgXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "1", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPr01SgXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "1", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPr02PlXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "2", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPr02PlXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "2", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPr02SgXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "2", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPr02SgXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "2", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPr03PlXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "3", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPr03PlXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "3", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPr03SgXxIpAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdPr03SgXxIpPvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdXx01PlXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "1", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdXx01PlXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "1", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdXx01SgXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "1", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdXx01SgXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "1", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdXx02PlXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "2", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdXx02PlXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "2", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdXx02SgXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "2", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdXx02SgXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "2", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdXx03PlXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "3", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdXx03PlXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "3", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdXx03SgXxPeAvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnIdXx03SgXxPePvXx": { - POS: VERB, - "VerbForm": "Fin", - "Mood": "Ind", - "Tense": "Pres|Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnMpXx02PlXxIpAvXx": { - POS: VERB, - "VerbForm": "", - "Mood": "Imp", - "Tense": "Pres|Past", - "Person": "2", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnMpXx02PlXxIpPvXx": { - POS: VERB, - "VerbForm": "", - "Mood": "Imp", - "Tense": "Pres|Past", - "Person": "2", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnMpXx02PlXxPeAvXx": { - POS: VERB, - "VerbForm": "", - "Mood": "Imp", - "Tense": "Pres|Past", - "Person": "2", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnMpXx02PlXxPePvXx": { - POS: VERB, - "VerbForm": "", - "Mood": "Imp", - "Tense": "Pres|Past", - "Person": "2", - "Number": "Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnMpXx02SgXxIpAvXx": { - POS: VERB, - "VerbForm": "", - "Mood": "Imp", - "Tense": "Pres|Past", - "Person": "2", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnMpXx02SgXxIpPvXx": { - POS: VERB, - "VerbForm": "", - "Mood": "Imp", - "Tense": "Pres|Past", - "Person": "2", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnMpXx02SgXxPeAvXx": { - POS: VERB, - "VerbForm": "", - "Mood": "Imp", - "Tense": "Pres|Past", - "Person": "2", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnMpXx02SgXxPePvXx": { - POS: VERB, - "VerbForm": "", - "Mood": "Imp", - "Tense": "Pres|Past", - "Person": "2", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnMpXx03SgXxIpPvXx": { - POS: VERB, - "VerbForm": "", - "Mood": "Imp", - "Tense": "Pres|Past", - "Person": "3", - "Number": "Sing", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnNfXxXxXxXxPeAvXx": { - POS: VERB, - "VerbForm": "Inf", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing|Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnNfXxXxXxXxPePvXx": { - POS: VERB, - "VerbForm": "Inf", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing|Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnPpPrXxXxXxIpAvXx": { - POS: VERB, - "VerbForm": "Conv", - "Mood": "", - "Tense": "Pres", - "Person": "1|2|3", - "Number": "Sing|Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "VbMnPpXxXxPlFePePvAc": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Plur", - "Gender": "Fem", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Acc", - }, - "VbMnPpXxXxPlFePePvGe": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Plur", - "Gender": "Fem", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Gen", - }, - "VbMnPpXxXxPlFePePvNm": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Plur", - "Gender": "Fem", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom", - }, - "VbMnPpXxXxPlFePePvVo": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Plur", - "Gender": "Fem", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Voc", - }, - "VbMnPpXxXxPlMaPePvAc": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Plur", - "Gender": "Masc", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Acc", - }, - "VbMnPpXxXxPlMaPePvGe": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Plur", - "Gender": "Masc", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Gen", - }, - "VbMnPpXxXxPlMaPePvNm": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Plur", - "Gender": "Masc", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom", - }, - "VbMnPpXxXxPlMaPePvVo": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Plur", - "Gender": "Masc", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Voc", - }, - "VbMnPpXxXxPlNePePvAc": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Plur", - "Gender": "Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Acc", - }, - "VbMnPpXxXxPlNePePvGe": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Plur", - "Gender": "Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Gen", - }, - "VbMnPpXxXxPlNePePvNm": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Plur", - "Gender": "Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom", - }, - "VbMnPpXxXxPlNePePvVo": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Plur", - "Gender": "Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Voc", - }, - "VbMnPpXxXxSgFePePvAc": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Fem", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Acc", - }, - "VbMnPpXxXxSgFePePvGe": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Fem", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Gen", - }, - "VbMnPpXxXxSgFePePvNm": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Fem", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom", - }, - "VbMnPpXxXxSgFePePvVo": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Fem", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Voc", - }, - "VbMnPpXxXxSgMaPePvAc": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Masc", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Acc", - }, - "VbMnPpXxXxSgMaPePvGe": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Masc", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Gen", - }, - "VbMnPpXxXxSgMaPePvNm": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Masc", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom", - }, - "VbMnPpXxXxSgMaPePvVo": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Masc", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Voc", - }, - "VbMnPpXxXxSgNePePvAc": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Acc", - }, - "VbMnPpXxXxSgNePePvGe": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Gen", - }, - "VbMnPpXxXxSgNePePvNm": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Nom", - }, - "VbMnPpXxXxSgNePePvVo": { - POS: VERB, - "VerbForm": "Part", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing", - "Gender": "Neut", - "Aspect": "Perf", - "Voice": "Pass", - "Case": "Voc", - }, - "VbMnPpXxXxXxXxIpAvXx": { - POS: VERB, - "VerbForm": "Conv", - "Mood": "", - "Tense": "Pres|Past", - "Person": "1|2|3", - "Number": "Sing|Plur", - "Gender": "Masc|Fem|Neut", - "Aspect": "Imp", - "Voice": "Act", - "Case": "Nom|Gen|Dat|Acc|Voc", - }, - "ADJ": {POS: ADJ}, - "ADP": {POS: ADP}, - "ADV": {POS: ADV}, - "AtDf": {POS: DET}, - "AUX": {POS: AUX}, - "CCONJ": {POS: CCONJ}, - "DET": {POS: DET}, - "NOUN": {POS: NOUN}, - "NUM": {POS: NUM}, - "PART": {POS: PART}, - "PRON": {POS: PRON}, - "PROPN": {POS: PROPN}, - "SCONJ": {POS: SCONJ}, - "SYM": {POS: SYM}, - "VERB": {POS: VERB}, - "X": {POS: X}, -} diff --git a/spacy/lang/el/tokenizer_exceptions.py b/spacy/lang/el/tokenizer_exceptions.py index a3c36542e..0a36d5d2b 100644 --- a/spacy/lang/el/tokenizer_exceptions.py +++ b/spacy/lang/el/tokenizer_exceptions.py @@ -1,132 +1,128 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA, NORM - +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = {} for token in ["Απ'", "ΑΠ'", "αφ'", "Αφ'"]: - _exc[token] = [{ORTH: token, LEMMA: "από", NORM: "από"}] + _exc[token] = [{ORTH: token, NORM: "από"}] for token in ["Αλλ'", "αλλ'"]: - _exc[token] = [{ORTH: token, LEMMA: "αλλά", NORM: "αλλά"}] + _exc[token] = [{ORTH: token, NORM: "αλλά"}] for token in ["παρ'", "Παρ'", "ΠΑΡ'"]: - _exc[token] = [{ORTH: token, LEMMA: "παρά", NORM: "παρά"}] + _exc[token] = [{ORTH: token, NORM: "παρά"}] for token in ["καθ'", "Καθ'"]: - _exc[token] = [{ORTH: token, LEMMA: "κάθε", NORM: "κάθε"}] + _exc[token] = [{ORTH: token, NORM: "κάθε"}] for token in ["κατ'", "Κατ'"]: - _exc[token] = [{ORTH: token, LEMMA: "κατά", NORM: "κατά"}] + _exc[token] = [{ORTH: token, NORM: "κατά"}] for token in ["'ΣΟΥΝ", "'ναι", "'ταν", "'τανε", "'μαστε", "'μουνα", "'μουν"]: - _exc[token] = [{ORTH: token, LEMMA: "είμαι", NORM: "είμαι"}] + _exc[token] = [{ORTH: token, NORM: "είμαι"}] for token in ["Επ'", "επ'", "εφ'", "Εφ'"]: - _exc[token] = [{ORTH: token, LEMMA: "επί", NORM: "επί"}] + _exc[token] = [{ORTH: token, NORM: "επί"}] for token in ["Δι'", "δι'"]: - _exc[token] = [{ORTH: token, LEMMA: "δια", NORM: "δια"}] + _exc[token] = [{ORTH: token, NORM: "δια"}] for token in ["'χουν", "'χουμε", "'χαμε", "'χα", "'χε", "'χεις", "'χει"]: - _exc[token] = [{ORTH: token, LEMMA: "έχω", NORM: "έχω"}] + _exc[token] = [{ORTH: token, NORM: "έχω"}] for token in ["υπ'", "Υπ'"]: - _exc[token] = [{ORTH: token, LEMMA: "υπό", NORM: "υπό"}] + _exc[token] = [{ORTH: token, NORM: "υπό"}] for token in ["Μετ'", "ΜΕΤ'", "'μετ"]: - _exc[token] = [{ORTH: token, LEMMA: "μετά", NORM: "μετά"}] + _exc[token] = [{ORTH: token, NORM: "μετά"}] for token in ["Μ'", "μ'"]: - _exc[token] = [{ORTH: token, LEMMA: "με", NORM: "με"}] + _exc[token] = [{ORTH: token, NORM: "με"}] for token in ["Γι'", "ΓΙ'", "γι'"]: - _exc[token] = [{ORTH: token, LEMMA: "για", NORM: "για"}] + _exc[token] = [{ORTH: token, NORM: "για"}] for token in ["Σ'", "σ'"]: - _exc[token] = [{ORTH: token, LEMMA: "σε", NORM: "σε"}] + _exc[token] = [{ORTH: token, NORM: "σε"}] for token in ["Θ'", "θ'"]: - _exc[token] = [{ORTH: token, LEMMA: "θα", NORM: "θα"}] + _exc[token] = [{ORTH: token, NORM: "θα"}] for token in ["Ν'", "ν'"]: - _exc[token] = [{ORTH: token, LEMMA: "να", NORM: "να"}] + _exc[token] = [{ORTH: token, NORM: "να"}] for token in ["Τ'", "τ'"]: - _exc[token] = [{ORTH: token, LEMMA: "να", NORM: "να"}] + _exc[token] = [{ORTH: token, NORM: "να"}] for token in ["'γω", "'σένα", "'μεις"]: - _exc[token] = [{ORTH: token, LEMMA: "εγώ", NORM: "εγώ"}] + _exc[token] = [{ORTH: token, NORM: "εγώ"}] for token in ["Τ'", "τ'"]: - _exc[token] = [{ORTH: token, LEMMA: "το", NORM: "το"}] + _exc[token] = [{ORTH: token, NORM: "το"}] for token in ["Φέρ'", "Φερ'", "φέρ'", "φερ'"]: - _exc[token] = [{ORTH: token, LEMMA: "φέρνω", NORM: "φέρνω"}] + _exc[token] = [{ORTH: token, NORM: "φέρνω"}] for token in ["'ρθούνε", "'ρθουν", "'ρθει", "'ρθεί", "'ρθε", "'ρχεται"]: - _exc[token] = [{ORTH: token, LEMMA: "έρχομαι", NORM: "έρχομαι"}] + _exc[token] = [{ORTH: token, NORM: "έρχομαι"}] for token in ["'πανε", "'λεγε", "'λεγαν", "'πε", "'λεγα"]: - _exc[token] = [{ORTH: token, LEMMA: "λέγω", NORM: "λέγω"}] + _exc[token] = [{ORTH: token, NORM: "λέγω"}] for token in ["Πάρ'", "πάρ'"]: - _exc[token] = [{ORTH: token, LEMMA: "παίρνω", NORM: "παίρνω"}] + _exc[token] = [{ORTH: token, NORM: "παίρνω"}] for token in ["μέσ'", "Μέσ'", "μεσ'"]: - _exc[token] = [{ORTH: token, LEMMA: "μέσα", NORM: "μέσα"}] + _exc[token] = [{ORTH: token, NORM: "μέσα"}] for token in ["Δέσ'", "Δεσ'", "δεσ'"]: - _exc[token] = [{ORTH: token, LEMMA: "δένω", NORM: "δένω"}] + _exc[token] = [{ORTH: token, NORM: "δένω"}] for token in ["'κανε", "Κάν'"]: - _exc[token] = [{ORTH: token, LEMMA: "κάνω", NORM: "κάνω"}] + _exc[token] = [{ORTH: token, NORM: "κάνω"}] _other_exc = { - "κι": [{ORTH: "κι", LEMMA: "και", NORM: "και"}], - "Παίξ'": [{ORTH: "Παίξ'", LEMMA: "παίζω", NORM: "παίζω"}], - "Αντ'": [{ORTH: "Αντ'", LEMMA: "αντί", NORM: "αντί"}], - "ολ'": [{ORTH: "ολ'", LEMMA: "όλος", NORM: "όλος"}], - "ύστερ'": [{ORTH: "ύστερ'", LEMMA: "ύστερα", NORM: "ύστερα"}], - "'πρεπε": [{ORTH: "'πρεπε", LEMMA: "πρέπει", NORM: "πρέπει"}], - "Δύσκολ'": [{ORTH: "Δύσκολ'", LEMMA: "δύσκολος", NORM: "δύσκολος"}], - "'θελα": [{ORTH: "'θελα", LEMMA: "θέλω", NORM: "θέλω"}], - "'γραφα": [{ORTH: "'γραφα", LEMMA: "γράφω", NORM: "γράφω"}], - "'παιρνα": [{ORTH: "'παιρνα", LEMMA: "παίρνω", NORM: "παίρνω"}], - "'δειξε": [{ORTH: "'δειξε", LEMMA: "δείχνω", NORM: "δείχνω"}], - "όμουρφ'": [{ORTH: "όμουρφ'", LEMMA: "όμορφος", NORM: "όμορφος"}], - "κ'τσή": [{ORTH: "κ'τσή", LEMMA: "κουτσός", NORM: "κουτσός"}], - "μηδ'": [{ORTH: "μηδ'", LEMMA: "μήδε", NORM: "μήδε"}], - "'ξομολογήθηκε": [ - {ORTH: "'ξομολογήθηκε", LEMMA: "εξομολογούμαι", NORM: "εξομολογούμαι"} - ], - "'μας": [{ORTH: "'μας", LEMMA: "εμάς", NORM: "εμάς"}], - "'ξερες": [{ORTH: "'ξερες", LEMMA: "ξέρω", NORM: "ξέρω"}], - "έφθασ'": [{ORTH: "έφθασ'", LEMMA: "φθάνω", NORM: "φθάνω"}], - "εξ'": [{ORTH: "εξ'", LEMMA: "εκ", NORM: "εκ"}], - "δώσ'": [{ORTH: "δώσ'", LEMMA: "δίνω", NORM: "δίνω"}], - "τίποτ'": [{ORTH: "τίποτ'", LEMMA: "τίποτα", NORM: "τίποτα"}], - "Λήξ'": [{ORTH: "Λήξ'", LEMMA: "λήγω", NORM: "λήγω"}], - "άσ'": [{ORTH: "άσ'", LEMMA: "αφήνω", NORM: "αφήνω"}], - "Στ'": [{ORTH: "Στ'", LEMMA: "στο", NORM: "στο"}], - "Δωσ'": [{ORTH: "Δωσ'", LEMMA: "δίνω", NORM: "δίνω"}], - "Βάψ'": [{ORTH: "Βάψ'", LEMMA: "βάφω", NORM: "βάφω"}], - "Αλλ'": [{ORTH: "Αλλ'", LEMMA: "αλλά", NORM: "αλλά"}], - "Αμ'": [{ORTH: "Αμ'", LEMMA: "άμα", NORM: "άμα"}], - "Αγόρασ'": [{ORTH: "Αγόρασ'", LEMMA: "αγοράζω", NORM: "αγοράζω"}], - "'φύγε": [{ORTH: "'φύγε", LEMMA: "φεύγω", NORM: "φεύγω"}], - "'φερε": [{ORTH: "'φερε", LEMMA: "φέρνω", NORM: "φέρνω"}], - "'φαγε": [{ORTH: "'φαγε", LEMMA: "τρώω", NORM: "τρώω"}], - "'σπαγαν": [{ORTH: "'σπαγαν", LEMMA: "σπάω", NORM: "σπάω"}], - "'σκασε": [{ORTH: "'σκασε", LEMMA: "σκάω", NORM: "σκάω"}], - "'σβηνε": [{ORTH: "'σβηνε", LEMMA: "σβήνω", NORM: "σβήνω"}], - "'ριξε": [{ORTH: "'ριξε", LEMMA: "ρίχνω", NORM: "ρίχνω"}], - "'κλεβε": [{ORTH: "'κλεβε", LEMMA: "κλέβω", NORM: "κλέβω"}], - "'κει": [{ORTH: "'κει", LEMMA: "εκεί", NORM: "εκεί"}], - "'βλεπε": [{ORTH: "'βλεπε", LEMMA: "βλέπω", NORM: "βλέπω"}], - "'βγαινε": [{ORTH: "'βγαινε", LEMMA: "βγαίνω", NORM: "βγαίνω"}], + "κι": [{ORTH: "κι", NORM: "και"}], + "Παίξ'": [{ORTH: "Παίξ'", NORM: "παίζω"}], + "Αντ'": [{ORTH: "Αντ'", NORM: "αντί"}], + "ολ'": [{ORTH: "ολ'", NORM: "όλος"}], + "ύστερ'": [{ORTH: "ύστερ'", NORM: "ύστερα"}], + "'πρεπε": [{ORTH: "'πρεπε", NORM: "πρέπει"}], + "Δύσκολ'": [{ORTH: "Δύσκολ'", NORM: "δύσκολος"}], + "'θελα": [{ORTH: "'θελα", NORM: "θέλω"}], + "'γραφα": [{ORTH: "'γραφα", NORM: "γράφω"}], + "'παιρνα": [{ORTH: "'παιρνα", NORM: "παίρνω"}], + "'δειξε": [{ORTH: "'δειξε", NORM: "δείχνω"}], + "όμουρφ'": [{ORTH: "όμουρφ'", NORM: "όμορφος"}], + "κ'τσή": [{ORTH: "κ'τσή", NORM: "κουτσός"}], + "μηδ'": [{ORTH: "μηδ'", NORM: "μήδε"}], + "'ξομολογήθηκε": [{ORTH: "'ξομολογήθηκε", NORM: "εξομολογούμαι"}], + "'μας": [{ORTH: "'μας", NORM: "εμάς"}], + "'ξερες": [{ORTH: "'ξερες", NORM: "ξέρω"}], + "έφθασ'": [{ORTH: "έφθασ'", NORM: "φθάνω"}], + "εξ'": [{ORTH: "εξ'", NORM: "εκ"}], + "δώσ'": [{ORTH: "δώσ'", NORM: "δίνω"}], + "τίποτ'": [{ORTH: "τίποτ'", NORM: "τίποτα"}], + "Λήξ'": [{ORTH: "Λήξ'", NORM: "λήγω"}], + "άσ'": [{ORTH: "άσ'", NORM: "αφήνω"}], + "Στ'": [{ORTH: "Στ'", NORM: "στο"}], + "Δωσ'": [{ORTH: "Δωσ'", NORM: "δίνω"}], + "Βάψ'": [{ORTH: "Βάψ'", NORM: "βάφω"}], + "Αλλ'": [{ORTH: "Αλλ'", NORM: "αλλά"}], + "Αμ'": [{ORTH: "Αμ'", NORM: "άμα"}], + "Αγόρασ'": [{ORTH: "Αγόρασ'", NORM: "αγοράζω"}], + "'φύγε": [{ORTH: "'φύγε", NORM: "φεύγω"}], + "'φερε": [{ORTH: "'φερε", NORM: "φέρνω"}], + "'φαγε": [{ORTH: "'φαγε", NORM: "τρώω"}], + "'σπαγαν": [{ORTH: "'σπαγαν", NORM: "σπάω"}], + "'σκασε": [{ORTH: "'σκασε", NORM: "σκάω"}], + "'σβηνε": [{ORTH: "'σβηνε", NORM: "σβήνω"}], + "'ριξε": [{ORTH: "'ριξε", NORM: "ρίχνω"}], + "'κλεβε": [{ORTH: "'κλεβε", NORM: "κλέβω"}], + "'κει": [{ORTH: "'κει", NORM: "εκεί"}], + "'βλεπε": [{ORTH: "'βλεπε", NORM: "βλέπω"}], + "'βγαινε": [{ORTH: "'βγαινε", NORM: "βγαίνω"}], } _exc.update(_other_exc) @@ -134,37 +130,37 @@ _exc.update(_other_exc) for h in range(1, 12 + 1): for period in ["π.μ.", "πμ"]: - _exc["%d%s" % (h, period)] = [ - {ORTH: "%d" % h}, - {ORTH: period, LEMMA: "π.μ.", NORM: "π.μ."}, + _exc[f"{h}{period}"] = [ + {ORTH: f"{h}"}, + {ORTH: period, NORM: "π.μ."}, ] for period in ["μ.μ.", "μμ"]: - _exc["%d%s" % (h, period)] = [ - {ORTH: "%d" % h}, - {ORTH: period, LEMMA: "μ.μ.", NORM: "μ.μ."}, + _exc[f"{h}{period}"] = [ + {ORTH: f"{h}"}, + {ORTH: period, NORM: "μ.μ."}, ] for exc_data in [ - {ORTH: "ΑΓΡ.", LEMMA: "Αγροτικός", NORM: "Αγροτικός"}, - {ORTH: "Αγ. Γρ.", LEMMA: "Αγία Γραφή", NORM: "Αγία Γραφή"}, - {ORTH: "Αθ.", LEMMA: "Αθανάσιος", NORM: "Αθανάσιος"}, - {ORTH: "Αλεξ.", LEMMA: "Αλέξανδρος", NORM: "Αλέξανδρος"}, - {ORTH: "Απρ.", LEMMA: "Απρίλιος", NORM: "Απρίλιος"}, - {ORTH: "Αύγ.", LEMMA: "Αύγουστος", NORM: "Αύγουστος"}, - {ORTH: "Δεκ.", LEMMA: "Δεκέμβριος", NORM: "Δεκέμβριος"}, - {ORTH: "Δημ.", LEMMA: "Δήμος", NORM: "Δήμος"}, - {ORTH: "Ιαν.", LEMMA: "Ιανουάριος", NORM: "Ιανουάριος"}, - {ORTH: "Ιούλ.", LEMMA: "Ιούλιος", NORM: "Ιούλιος"}, - {ORTH: "Ιούν.", LEMMA: "Ιούνιος", NORM: "Ιούνιος"}, - {ORTH: "Ιωαν.", LEMMA: "Ιωάννης", NORM: "Ιωάννης"}, - {ORTH: "Μ. Ασία", LEMMA: "Μικρά Ασία", NORM: "Μικρά Ασία"}, - {ORTH: "Μάρτ.", LEMMA: "Μάρτιος", NORM: "Μάρτιος"}, - {ORTH: "Μάρτ'", LEMMA: "Μάρτιος", NORM: "Μάρτιος"}, - {ORTH: "Νοέμβρ.", LEMMA: "Νοέμβριος", NORM: "Νοέμβριος"}, - {ORTH: "Οκτ.", LEMMA: "Οκτώβριος", NORM: "Οκτώβριος"}, - {ORTH: "Σεπτ.", LEMMA: "Σεπτέμβριος", NORM: "Σεπτέμβριος"}, - {ORTH: "Φεβρ.", LEMMA: "Φεβρουάριος", NORM: "Φεβρουάριος"}, + {ORTH: "ΑΓΡ.", NORM: "Αγροτικός"}, + {ORTH: "Αγ. Γρ.", NORM: "Αγία Γραφή"}, + {ORTH: "Αθ.", NORM: "Αθανάσιος"}, + {ORTH: "Αλεξ.", NORM: "Αλέξανδρος"}, + {ORTH: "Απρ.", NORM: "Απρίλιος"}, + {ORTH: "Αύγ.", NORM: "Αύγουστος"}, + {ORTH: "Δεκ.", NORM: "Δεκέμβριος"}, + {ORTH: "Δημ.", NORM: "Δήμος"}, + {ORTH: "Ιαν.", NORM: "Ιανουάριος"}, + {ORTH: "Ιούλ.", NORM: "Ιούλιος"}, + {ORTH: "Ιούν.", NORM: "Ιούνιος"}, + {ORTH: "Ιωαν.", NORM: "Ιωάννης"}, + {ORTH: "Μ. Ασία", NORM: "Μικρά Ασία"}, + {ORTH: "Μάρτ.", NORM: "Μάρτιος"}, + {ORTH: "Μάρτ'", NORM: "Μάρτιος"}, + {ORTH: "Νοέμβρ.", NORM: "Νοέμβριος"}, + {ORTH: "Οκτ.", NORM: "Οκτώβριος"}, + {ORTH: "Σεπτ.", NORM: "Σεπτέμβριος"}, + {ORTH: "Φεβρ.", NORM: "Φεβρουάριος"}, ]: _exc[exc_data[ORTH]] = [exc_data] @@ -395,4 +391,4 @@ for orth in [ ]: _exc[orth] = [{ORTH: orth}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/en/__init__.py b/spacy/lang/en/__init__.py index f58ae4a4e..3a3ebeefd 100644 --- a/spacy/lang/en/__init__.py +++ b/spacy/lang/en/__init__.py @@ -1,75 +1,21 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Optional +from thinc.api import Model from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .tag_map import TAG_MAP from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS -from .morph_rules import MORPH_RULES from .syntax_iterators import SYNTAX_ITERATORS - -from ..tokenizer_exceptions import BASE_EXCEPTIONS +from .punctuation import TOKENIZER_INFIXES +from .lemmatizer import EnglishLemmatizer from ...language import Language -from ...attrs import LANG -from ...util import update_exc - - -def _return_en(_): - return "en" class EnglishDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = _return_en - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - tag_map = TAG_MAP - stop_words = STOP_WORDS - morph_rules = MORPH_RULES + tokenizer_exceptions = TOKENIZER_EXCEPTIONS + infixes = TOKENIZER_INFIXES + lex_attr_getters = LEX_ATTRS syntax_iterators = SYNTAX_ITERATORS - single_orth_variants = [ - {"tags": ["NFP"], "variants": ["…", "..."]}, - {"tags": [":"], "variants": ["-", "—", "–", "--", "---", "——"]}, - ] - paired_orth_variants = [ - {"tags": ["``", "''"], "variants": [("'", "'"), ("‘", "’")]}, - {"tags": ["``", "''"], "variants": [('"', '"'), ("“", "”")]}, - ] - - @classmethod - def is_base_form(cls, univ_pos, morphology=None): - """ - Check whether we're dealing with an uninflected paradigm, so we can - avoid lemmatization entirely. - - univ_pos (unicode / int): The token's universal part-of-speech tag. - morphology (dict): The token's morphological features following the - Universal Dependencies scheme. - """ - if morphology is None: - morphology = {} - if univ_pos == "noun" and morphology.get("Number") == "sing": - return True - elif univ_pos == "verb" and morphology.get("VerbForm") == "inf": - return True - # This maps 'VBP' to base form -- probably just need 'IS_BASE' - # morphology - elif univ_pos == "verb" and ( - morphology.get("VerbForm") == "fin" - and morphology.get("Tense") == "pres" - and morphology.get("Number") is None - ): - return True - elif univ_pos == "adj" and morphology.get("Degree") == "pos": - return True - elif morphology.get("VerbForm") == "inf": - return True - elif morphology.get("VerbForm") == "none": - return True - elif morphology.get("Degree") == "pos": - return True - else: - return False + stop_words = STOP_WORDS class English(Language): @@ -77,4 +23,14 @@ class English(Language): Defaults = EnglishDefaults +@English.factory( + "lemmatizer", + assigns=["token.lemma"], + default_config={"model": None, "mode": "rule"}, + default_score_weights={"lemma_acc": 1.0}, +) +def make_lemmatizer(nlp: Language, model: Optional[Model], name: str, mode: str): + return EnglishLemmatizer(nlp.vocab, model, name, mode=mode) + + __all__ = ["English"] diff --git a/spacy/lang/en/examples.py b/spacy/lang/en/examples.py index 946289c7c..2cca9e05f 100644 --- a/spacy/lang/en/examples.py +++ b/spacy/lang/en/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/en/lemmatizer.py b/spacy/lang/en/lemmatizer.py new file mode 100644 index 000000000..2cb0f9a53 --- /dev/null +++ b/spacy/lang/en/lemmatizer.py @@ -0,0 +1,40 @@ +from ...pipeline import Lemmatizer +from ...tokens import Token + + +class EnglishLemmatizer(Lemmatizer): + """English lemmatizer. Only overrides is_base_form.""" + + def is_base_form(self, token: Token) -> bool: + """ + Check whether we're dealing with an uninflected paradigm, so we can + avoid lemmatization entirely. + + univ_pos (unicode / int): The token's universal part-of-speech tag. + morphology (dict): The token's morphological features following the + Universal Dependencies scheme. + """ + univ_pos = token.pos_.lower() + morphology = token.morph.to_dict() + if univ_pos == "noun" and morphology.get("Number") == "Sing": + return True + elif univ_pos == "verb" and morphology.get("VerbForm") == "Inf": + return True + # This maps 'VBP' to base form -- probably just need 'IS_BASE' + # morphology + elif univ_pos == "verb" and ( + morphology.get("VerbForm") == "Fin" + and morphology.get("Tense") == "Pres" + and morphology.get("Number") is None + ): + return True + elif univ_pos == "adj" and morphology.get("Degree") == "Pos": + return True + elif morphology.get("VerbForm") == "Inf": + return True + elif morphology.get("VerbForm") == "None": + return True + elif morphology.get("Degree") == "Pos": + return True + else: + return False diff --git a/spacy/lang/en/lex_attrs.py b/spacy/lang/en/lex_attrs.py index 4f6988bd5..fcc7c6bf2 100644 --- a/spacy/lang/en/lex_attrs.py +++ b/spacy/lang/en/lex_attrs.py @@ -1,87 +1,25 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM +# fmt: off _num_words = [ - "zero", - "one", - "two", - "three", - "four", - "five", - "six", - "seven", - "eight", - "nine", - "ten", - "eleven", - "twelve", - "thirteen", - "fourteen", - "fifteen", - "sixteen", - "seventeen", - "eighteen", - "nineteen", - "twenty", - "thirty", - "forty", - "fifty", - "sixty", - "seventy", - "eighty", - "ninety", - "hundred", - "thousand", - "million", - "billion", - "trillion", - "quadrillion", - "gajillion", - "bazillion", + "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", + "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen", + "sixteen", "seventeen", "eighteen", "nineteen", "twenty", "thirty", "forty", + "fifty", "sixty", "seventy", "eighty", "ninety", "hundred", "thousand", + "million", "billion", "trillion", "quadrillion", "gajillion", "bazillion" ] - - _ordinal_words = [ - "first", - "second", - "third", - "fourth", - "fifth", - "sixth", - "seventh", - "eighth", - "ninth", - "tenth", - "eleventh", - "twelfth", - "thirteenth", - "fourteenth", - "fifteenth", - "sixteenth", - "seventeenth", - "eighteenth", - "nineteenth", - "twentieth", - "thirtieth", - "fortieth", - "fiftieth", - "sixtieth", - "seventieth", - "eightieth", - "ninetieth", - "hundredth", - "thousandth", - "millionth", - "billionth", - "trillionth", - "quadrillionth", - "gajillionth", - "bazillionth", + "first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth", + "ninth", "tenth", "eleventh", "twelfth", "thirteenth", "fourteenth", + "fifteenth", "sixteenth", "seventeenth", "eighteenth", "nineteenth", + "twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth", + "eightieth", "ninetieth", "hundredth", "thousandth", "millionth", "billionth", + "trillionth", "quadrillionth", "gajillionth", "bazillionth" ] +# fmt: on -def like_num(text): + +def like_num(text: str) -> bool: if text.startswith(("+", "-", "±", "~")): text = text[1:] text = text.replace(",", "").replace(".", "") @@ -91,18 +29,15 @@ def like_num(text): num, denom = text.split("/") if num.isdigit() and denom.isdigit(): return True - text_lower = text.lower() if text_lower in _num_words: return True - - # CHeck ordinal number + # Check ordinal number if text_lower in _ordinal_words: return True if text_lower.endswith("th"): if text_lower[:-2].isdigit(): - return True - + return True return False diff --git a/spacy/lang/en/morph_rules.py b/spacy/lang/en/morph_rules.py deleted file mode 100644 index 5ed4eac59..000000000 --- a/spacy/lang/en/morph_rules.py +++ /dev/null @@ -1,493 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import LEMMA, PRON_LEMMA - -# Several entries here look pretty suspicious. These will get the POS SCONJ -# given the tag IN, when an adpositional reading seems much more likely for -# a lot of these prepositions. I'm not sure what I was running in 04395ffa4 -# when I did this? It doesn't seem right. -_subordinating_conjunctions = [ - "that", - "if", - "as", - "because", - # "of", - # "for", - # "before", - # "in", - "while", - # "after", - "since", - "like", - # "with", - "so", - # "to", - # "by", - # "on", - # "about", - "than", - "whether", - "although", - # "from", - "though", - # "until", - "unless", - "once", - # "without", - # "at", - # "into", - "cause", - # "over", - "upon", - "till", - "whereas", - # "beyond", - "whilst", - "except", - "despite", - "wether", - # "then", - "but", - "becuse", - "whie", - # "below", - # "against", - "it", - "w/out", - # "toward", - "albeit", - "save", - "besides", - "becouse", - "coz", - "til", - "ask", - "i'd", - "out", - "near", - "seince", - # "towards", - "tho", - "sice", - "will", -] - -# This seems kind of wrong too? -# _relative_pronouns = ["this", "that", "those", "these"] - -MORPH_RULES = { - # "DT": {word: {"POS": "PRON"} for word in _relative_pronouns}, - "IN": {word: {"POS": "SCONJ"} for word in _subordinating_conjunctions}, - "NN": { - "something": {"POS": "PRON"}, - "anyone": {"POS": "PRON"}, - "anything": {"POS": "PRON"}, - "nothing": {"POS": "PRON"}, - "someone": {"POS": "PRON"}, - "everything": {"POS": "PRON"}, - "everyone": {"POS": "PRON"}, - "everybody": {"POS": "PRON"}, - "nobody": {"POS": "PRON"}, - "somebody": {"POS": "PRON"}, - "anybody": {"POS": "PRON"}, - "any1": {"POS": "PRON"}, - }, - "PRP": { - "I": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Case": "Nom", - }, - "me": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Case": "Acc", - }, - "you": {LEMMA: PRON_LEMMA, "POS": "PRON", "PronType": "Prs", "Person": "Two"}, - "he": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Masc", - "Case": "Nom", - }, - "him": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Masc", - "Case": "Acc", - }, - "she": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Fem", - "Case": "Nom", - }, - "her": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Fem", - "Case": "Acc", - }, - "it": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Neut", - }, - "we": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Case": "Nom", - }, - "us": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Case": "Acc", - }, - "they": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Case": "Nom", - }, - "them": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Case": "Acc", - }, - "mine": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Reflex": "Yes", - }, - "his": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Masc", - "Poss": "Yes", - "Reflex": "Yes", - }, - "hers": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Fem", - "Poss": "Yes", - "Reflex": "Yes", - }, - "its": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Neut", - "Poss": "Yes", - "Reflex": "Yes", - }, - "ours": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "yours": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Two", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "theirs": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "myself": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Case": "Acc", - "Reflex": "Yes", - }, - "yourself": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Two", - "Case": "Acc", - "Reflex": "Yes", - }, - "himself": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Case": "Acc", - "Gender": "Masc", - "Reflex": "Yes", - }, - "herself": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Case": "Acc", - "Gender": "Fem", - "Reflex": "Yes", - }, - "itself": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Case": "Acc", - "Gender": "Neut", - "Reflex": "Yes", - }, - "themself": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Case": "Acc", - "Reflex": "Yes", - }, - "ourselves": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Case": "Acc", - "Reflex": "Yes", - }, - "yourselves": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Two", - "Case": "Acc", - "Reflex": "Yes", - }, - "themselves": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Case": "Acc", - "Reflex": "Yes", - }, - }, - "PRP$": { - "my": { - LEMMA: PRON_LEMMA, - "Person": "One", - "Number": "Sing", - "PronType": "Prs", - "Poss": "Yes", - }, - "your": {LEMMA: PRON_LEMMA, "Person": "Two", "PronType": "Prs", "Poss": "Yes"}, - "his": { - LEMMA: PRON_LEMMA, - "Person": "Three", - "Number": "Sing", - "Gender": "Masc", - "PronType": "Prs", - "Poss": "Yes", - }, - "her": { - LEMMA: PRON_LEMMA, - "Person": "Three", - "Number": "Sing", - "Gender": "Fem", - "PronType": "Prs", - "Poss": "Yes", - }, - "its": { - LEMMA: PRON_LEMMA, - "Person": "Three", - "Number": "Sing", - "Gender": "Neut", - "PronType": "Prs", - "Poss": "Yes", - }, - "our": { - LEMMA: PRON_LEMMA, - "Person": "One", - "Number": "Plur", - "PronType": "Prs", - "Poss": "Yes", - }, - "their": { - LEMMA: PRON_LEMMA, - "Person": "Three", - "Number": "Plur", - "PronType": "Prs", - "Poss": "Yes", - }, - }, - "RB": {word: {"POS": "PART"} for word in ["not", "n't", "nt", "n’t"]}, - "VB": { - word: {"POS": "AUX"} - for word in ["be", "have", "do", "get", "of", "am", "are", "'ve"] - }, - "VBN": {"been": {LEMMA: "be", "POS": "AUX"}}, - "VBG": {"being": {LEMMA: "be", "POS": "AUX"}}, - "VBZ": { - "am": { - LEMMA: "be", - "POS": "AUX", - "VerbForm": "Fin", - "Person": "One", - "Tense": "Pres", - "Mood": "Ind", - }, - "are": { - LEMMA: "be", - "POS": "AUX", - "VerbForm": "Fin", - "Person": "Two", - "Tense": "Pres", - "Mood": "Ind", - }, - "is": { - LEMMA: "be", - "POS": "AUX", - "VerbForm": "Fin", - "Person": "Three", - "Tense": "Pres", - "Mood": "Ind", - }, - "'re": { - LEMMA: "be", - "POS": "AUX", - "VerbForm": "Fin", - "Person": "Two", - "Tense": "Pres", - "Mood": "Ind", - }, - "'s": { - LEMMA: "be", - "POS": "AUX", - "VerbForm": "Fin", - "Person": "Three", - "Tense": "Pres", - "Mood": "Ind", - }, - "has": {LEMMA: "have", "POS": "AUX"}, - "does": {LEMMA: "do", "POS": "AUX"}, - }, - "VBP": { - "are": { - LEMMA: "be", - "POS": "AUX", - "VerbForm": "Fin", - "Tense": "Pres", - "Mood": "Ind", - }, - "'re": { - LEMMA: "be", - "POS": "AUX", - "VerbForm": "Fin", - "Tense": "Pres", - "Mood": "Ind", - }, - "am": { - LEMMA: "be", - "POS": "AUX", - "VerbForm": "Fin", - "Person": "One", - "Tense": "Pres", - "Mood": "Ind", - }, - "do": {"POS": "AUX"}, - "have": {"POS": "AUX"}, - "'m": {"POS": "AUX", LEMMA: "be"}, - "'ve": {"POS": "AUX"}, - "'s": {"POS": "AUX"}, - "is": {"POS": "AUX"}, - "'d": {"POS": "AUX"}, - }, - "VBD": { - "was": { - LEMMA: "be", - "POS": "AUX", - "VerbForm": "Fin", - "Tense": "Past", - "Number": "Sing", - }, - "were": { - LEMMA: "be", - "POS": "AUX", - "VerbForm": "Fin", - "Tense": "Past", - "Number": "Plur", - }, - "did": {LEMMA: "do", "POS": "AUX"}, - "had": {LEMMA: "have", "POS": "AUX"}, - "'d": {LEMMA: "have", "POS": "AUX"}, - }, -} - - -for tag, rules in MORPH_RULES.items(): - for key, attrs in dict(rules).items(): - rules[key.title()] = attrs diff --git a/spacy/lang/en/punctuation.py b/spacy/lang/en/punctuation.py new file mode 100644 index 000000000..5d3eb792e --- /dev/null +++ b/spacy/lang/en/punctuation.py @@ -0,0 +1,19 @@ +from ..char_classes import LIST_ELLIPSES, LIST_ICONS, HYPHENS +from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA + +_infixes = ( + LIST_ELLIPSES + + LIST_ICONS + + [ + r"(?<=[0-9])[+\-\*^](?=[0-9-])", + r"(?<=[{al}{q}])\.(?=[{au}{q}])".format( + al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES + ), + r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA), + r"(?<=[{a}0-9])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS), + r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), + ] +) + + +TOKENIZER_INFIXES = _infixes diff --git a/spacy/lang/en/stop_words.py b/spacy/lang/en/stop_words.py index 3505b13bf..1ca5cbc16 100644 --- a/spacy/lang/en/stop_words.py +++ b/spacy/lang/en/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Stop words STOP_WORDS = set( """ diff --git a/spacy/lang/en/syntax_iterators.py b/spacy/lang/en/syntax_iterators.py index 0f2b28b58..2a1b0867e 100644 --- a/spacy/lang/en/syntax_iterators.py +++ b/spacy/lang/en/syntax_iterators.py @@ -1,30 +1,18 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Union, Iterator from ...symbols import NOUN, PROPN, PRON from ...errors import Errors +from ...tokens import Doc, Span -def noun_chunks(doclike): - """ - Detect base noun phrases from a dependency parse. Works on both Doc and Span. - """ - labels = [ - "nsubj", - "dobj", - "nsubjpass", - "pcomp", - "pobj", - "dative", - "appos", - "attr", - "ROOT", - ] +def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]: + """Detect base noun phrases from a dependency parse. Works on Doc and Span.""" + # fmt: off + labels = ["nsubj", "dobj", "nsubjpass", "pcomp", "pobj", "dative", "appos", "attr", "ROOT"] + # fmt: on doc = doclike.doc # Ensure works on both Doc and Span. - - if not doc.is_parsed: + if not doc.has_annotation("DEP"): raise ValueError(Errors.E029) - np_deps = [doc.vocab.strings.add(label) for label in labels] conj = doc.vocab.strings.add("conj") np_label = doc.vocab.strings.add("NP") diff --git a/spacy/lang/en/tag_map.py b/spacy/lang/en/tag_map.py deleted file mode 100644 index ecb3103cc..000000000 --- a/spacy/lang/en/tag_map.py +++ /dev/null @@ -1,72 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON - - -TAG_MAP = { - ".": {POS: PUNCT, "PunctType": "peri"}, - ",": {POS: PUNCT, "PunctType": "comm"}, - "-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"}, - "-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"}, - "``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"}, - '""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"}, - "''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"}, - ":": {POS: PUNCT}, - "$": {POS: SYM}, - "#": {POS: SYM}, - "AFX": {POS: ADJ, "Hyph": "yes"}, - "CC": {POS: CCONJ, "ConjType": "comp"}, - "CD": {POS: NUM, "NumType": "card"}, - "DT": {POS: DET}, - "EX": {POS: PRON, "AdvType": "ex"}, - "FW": {POS: X, "Foreign": "yes"}, - "HYPH": {POS: PUNCT, "PunctType": "dash"}, - "IN": {POS: ADP}, - "JJ": {POS: ADJ, "Degree": "pos"}, - "JJR": {POS: ADJ, "Degree": "comp"}, - "JJS": {POS: ADJ, "Degree": "sup"}, - "LS": {POS: X, "NumType": "ord"}, - "MD": {POS: VERB, "VerbType": "mod"}, - "NIL": {POS: X}, - "NN": {POS: NOUN, "Number": "sing"}, - "NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"}, - "NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"}, - "NNS": {POS: NOUN, "Number": "plur"}, - "PDT": {POS: DET}, - "POS": {POS: PART, "Poss": "yes"}, - "PRP": {POS: PRON, "PronType": "prs"}, - "PRP$": {POS: DET, "PronType": "prs", "Poss": "yes"}, - "RB": {POS: ADV, "Degree": "pos"}, - "RBR": {POS: ADV, "Degree": "comp"}, - "RBS": {POS: ADV, "Degree": "sup"}, - "RP": {POS: ADP}, - "SP": {POS: SPACE}, - "SYM": {POS: SYM}, - "TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"}, - "UH": {POS: INTJ}, - "VB": {POS: VERB, "VerbForm": "inf"}, - "VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"}, - "VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"}, - "VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"}, - "VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"}, - "VBZ": { - POS: VERB, - "VerbForm": "fin", - "Tense": "pres", - "Number": "sing", - "Person": "three", - }, - "WDT": {POS: DET}, - "WP": {POS: PRON}, - "WP$": {POS: DET, "Poss": "yes"}, - "WRB": {POS: ADV}, - "ADD": {POS: X}, - "NFP": {POS: PUNCT}, - "GW": {POS: X}, - "XX": {POS: X}, - "BES": {POS: VERB}, - "HVS": {POS: VERB}, - "_SP": {POS: SPACE}, -} diff --git a/spacy/lang/en/tokenizer_exceptions.py b/spacy/lang/en/tokenizer_exceptions.py index 964a714ae..c210e1a19 100644 --- a/spacy/lang/en/tokenizer_exceptions.py +++ b/spacy/lang/en/tokenizer_exceptions.py @@ -1,7 +1,6 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA, TAG, NORM, PRON_LEMMA +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = {} @@ -29,258 +28,270 @@ _exclude = [ for pron in ["i"]: for orth in [pron, pron.title()]: _exc[orth + "'m"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "'m", LEMMA: "be", NORM: "am", TAG: "VBP"}, + {ORTH: orth, NORM: pron}, + {ORTH: "'m", NORM: "am"}, ] _exc[orth + "m"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "m", LEMMA: "be", TAG: "VBP", "tenspect": 1, "number": 1}, + {ORTH: orth, NORM: pron}, + {ORTH: "m", "tenspect": 1, "number": 1}, ] _exc[orth + "'ma"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "'m", LEMMA: "be", NORM: "am"}, - {ORTH: "a", LEMMA: "going to", NORM: "gonna"}, + {ORTH: orth, NORM: pron}, + {ORTH: "'m", NORM: "am"}, + {ORTH: "a", NORM: "gonna"}, ] _exc[orth + "ma"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "m", LEMMA: "be", NORM: "am"}, - {ORTH: "a", LEMMA: "going to", NORM: "gonna"}, + {ORTH: orth, NORM: pron}, + {ORTH: "m", NORM: "am"}, + {ORTH: "a", NORM: "gonna"}, ] for pron in ["i", "you", "he", "she", "it", "we", "they"]: for orth in [pron, pron.title()]: _exc[orth + "'ll"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "'ll", LEMMA: "will", NORM: "will", TAG: "MD"}, + {ORTH: orth, NORM: pron}, + {ORTH: "'ll", NORM: "will"}, ] _exc[orth + "ll"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "ll", LEMMA: "will", NORM: "will", TAG: "MD"}, + {ORTH: orth, NORM: pron}, + {ORTH: "ll", NORM: "will"}, ] _exc[orth + "'ll've"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "'ll", LEMMA: "will", NORM: "will", TAG: "MD"}, - {ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: orth, NORM: pron}, + {ORTH: "'ll", NORM: "will"}, + {ORTH: "'ve", NORM: "have"}, ] _exc[orth + "llve"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "ll", LEMMA: "will", NORM: "will", TAG: "MD"}, - {ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: orth, NORM: pron}, + {ORTH: "ll", NORM: "will"}, + {ORTH: "ve", NORM: "have"}, ] _exc[orth + "'d"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, + {ORTH: orth, NORM: pron}, {ORTH: "'d", NORM: "'d"}, ] _exc[orth + "d"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, + {ORTH: orth, NORM: pron}, {ORTH: "d", NORM: "'d"}, ] _exc[orth + "'d've"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "'d", LEMMA: "would", NORM: "would", TAG: "MD"}, - {ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: orth, NORM: pron}, + {ORTH: "'d", NORM: "would"}, + {ORTH: "'ve", NORM: "have"}, ] _exc[orth + "dve"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "d", LEMMA: "would", NORM: "would", TAG: "MD"}, - {ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: orth, NORM: pron}, + {ORTH: "d", NORM: "would"}, + {ORTH: "ve", NORM: "have"}, ] for pron in ["i", "you", "we", "they"]: for orth in [pron, pron.title()]: _exc[orth + "'ve"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: orth, NORM: pron}, + {ORTH: "'ve", NORM: "have"}, ] _exc[orth + "ve"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: orth, NORM: pron}, + {ORTH: "ve", NORM: "have"}, ] for pron in ["you", "we", "they"]: for orth in [pron, pron.title()]: _exc[orth + "'re"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "'re", LEMMA: "be", NORM: "are"}, + {ORTH: orth, NORM: pron}, + {ORTH: "'re", NORM: "are"}, ] _exc[orth + "re"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, - {ORTH: "re", LEMMA: "be", NORM: "are", TAG: "VBZ"}, + {ORTH: orth, NORM: pron}, + {ORTH: "re", NORM: "are"}, ] for pron in ["he", "she", "it"]: for orth in [pron, pron.title()]: _exc[orth + "'s"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, + {ORTH: orth, NORM: pron}, {ORTH: "'s", NORM: "'s"}, ] _exc[orth + "s"] = [ - {ORTH: orth, LEMMA: PRON_LEMMA, NORM: pron, TAG: "PRP"}, + {ORTH: orth, NORM: pron}, {ORTH: "s"}, ] # W-words, relative pronouns, prepositions etc. -for word in ["who", "what", "when", "where", "why", "how", "there", "that", "this", "these", "those"]: +for word in [ + "who", + "what", + "when", + "where", + "why", + "how", + "there", + "that", + "this", + "these", + "those", +]: for orth in [word, word.title()]: _exc[orth + "'s"] = [ - {ORTH: orth, LEMMA: word, NORM: word}, + {ORTH: orth, NORM: word}, {ORTH: "'s", NORM: "'s"}, ] - _exc[orth + "s"] = [{ORTH: orth, LEMMA: word, NORM: word}, {ORTH: "s"}] + _exc[orth + "s"] = [{ORTH: orth, NORM: word}, {ORTH: "s"}] _exc[orth + "'ll"] = [ - {ORTH: orth, LEMMA: word, NORM: word}, - {ORTH: "'ll", LEMMA: "will", NORM: "will", TAG: "MD"}, + {ORTH: orth, NORM: word}, + {ORTH: "'ll", NORM: "will"}, ] _exc[orth + "ll"] = [ - {ORTH: orth, LEMMA: word, NORM: word}, - {ORTH: "ll", LEMMA: "will", NORM: "will", TAG: "MD"}, + {ORTH: orth, NORM: word}, + {ORTH: "ll", NORM: "will"}, ] _exc[orth + "'ll've"] = [ - {ORTH: orth, LEMMA: word, NORM: word}, - {ORTH: "'ll", LEMMA: "will", NORM: "will", TAG: "MD"}, - {ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: orth, NORM: word}, + {ORTH: "'ll", NORM: "will"}, + {ORTH: "'ve", NORM: "have"}, ] _exc[orth + "llve"] = [ - {ORTH: orth, LEMMA: word, NORM: word}, - {ORTH: "ll", LEMMA: "will", NORM: "will", TAG: "MD"}, - {ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: orth, NORM: word}, + {ORTH: "ll", NORM: "will"}, + {ORTH: "ve", NORM: "have"}, ] _exc[orth + "'re"] = [ - {ORTH: orth, LEMMA: word, NORM: word}, - {ORTH: "'re", LEMMA: "be", NORM: "are"}, + {ORTH: orth, NORM: word}, + {ORTH: "'re", NORM: "are"}, ] _exc[orth + "re"] = [ - {ORTH: orth, LEMMA: word, NORM: word}, - {ORTH: "re", LEMMA: "be", NORM: "are"}, + {ORTH: orth, NORM: word}, + {ORTH: "re", NORM: "are"}, ] _exc[orth + "'ve"] = [ - {ORTH: orth, LEMMA: word, NORM: word}, - {ORTH: "'ve", LEMMA: "have", TAG: "VB"}, + {ORTH: orth, NORM: word}, + {ORTH: "'ve"}, ] _exc[orth + "ve"] = [ - {ORTH: orth, LEMMA: word}, - {ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: orth}, + {ORTH: "ve", NORM: "have"}, ] _exc[orth + "'d"] = [ - {ORTH: orth, LEMMA: word, NORM: word}, + {ORTH: orth, NORM: word}, {ORTH: "'d", NORM: "'d"}, ] _exc[orth + "d"] = [ - {ORTH: orth, LEMMA: word, NORM: word}, + {ORTH: orth, NORM: word}, {ORTH: "d", NORM: "'d"}, ] _exc[orth + "'d've"] = [ - {ORTH: orth, LEMMA: word, NORM: word}, - {ORTH: "'d", LEMMA: "would", NORM: "would", TAG: "MD"}, - {ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: orth, NORM: word}, + {ORTH: "'d", NORM: "would"}, + {ORTH: "'ve", NORM: "have"}, ] _exc[orth + "dve"] = [ - {ORTH: orth, LEMMA: word, NORM: word}, - {ORTH: "d", LEMMA: "would", NORM: "would", TAG: "MD"}, - {ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: orth, NORM: word}, + {ORTH: "d", NORM: "would"}, + {ORTH: "ve", NORM: "have"}, ] # Verbs for verb_data in [ - {ORTH: "ca", LEMMA: "can", NORM: "can", TAG: "MD"}, - {ORTH: "could", NORM: "could", TAG: "MD"}, - {ORTH: "do", LEMMA: "do", NORM: "do"}, - {ORTH: "does", LEMMA: "do", NORM: "does"}, - {ORTH: "did", LEMMA: "do", NORM: "do", TAG: "VBD"}, - {ORTH: "had", LEMMA: "have", NORM: "have", TAG: "VBD"}, - {ORTH: "may", NORM: "may", TAG: "MD"}, - {ORTH: "might", NORM: "might", TAG: "MD"}, - {ORTH: "must", NORM: "must", TAG: "MD"}, + {ORTH: "ca", NORM: "can"}, + {ORTH: "could", NORM: "could"}, + {ORTH: "do", NORM: "do"}, + {ORTH: "does", NORM: "does"}, + {ORTH: "did", NORM: "do"}, + {ORTH: "had", NORM: "have"}, + {ORTH: "may", NORM: "may"}, + {ORTH: "might", NORM: "might"}, + {ORTH: "must", NORM: "must"}, {ORTH: "need", NORM: "need"}, - {ORTH: "ought", NORM: "ought", TAG: "MD"}, - {ORTH: "sha", LEMMA: "shall", NORM: "shall", TAG: "MD"}, - {ORTH: "should", NORM: "should", TAG: "MD"}, - {ORTH: "wo", LEMMA: "will", NORM: "will", TAG: "MD"}, - {ORTH: "would", NORM: "would", TAG: "MD"}, + {ORTH: "ought", NORM: "ought"}, + {ORTH: "sha", NORM: "shall"}, + {ORTH: "should", NORM: "should"}, + {ORTH: "wo", NORM: "will"}, + {ORTH: "would", NORM: "would"}, ]: verb_data_tc = dict(verb_data) verb_data_tc[ORTH] = verb_data_tc[ORTH].title() for data in [verb_data, verb_data_tc]: _exc[data[ORTH] + "n't"] = [ dict(data), - {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}, + {ORTH: "n't", NORM: "not"}, ] _exc[data[ORTH] + "nt"] = [ dict(data), - {ORTH: "nt", LEMMA: "not", NORM: "not", TAG: "RB"}, + {ORTH: "nt", NORM: "not"}, ] _exc[data[ORTH] + "n't've"] = [ dict(data), - {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}, - {ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: "n't", NORM: "not"}, + {ORTH: "'ve", NORM: "have"}, ] _exc[data[ORTH] + "ntve"] = [ dict(data), - {ORTH: "nt", LEMMA: "not", NORM: "not", TAG: "RB"}, - {ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}, + {ORTH: "nt", NORM: "not"}, + {ORTH: "ve", NORM: "have"}, ] for verb_data in [ - {ORTH: "could", NORM: "could", TAG: "MD"}, - {ORTH: "might", NORM: "might", TAG: "MD"}, - {ORTH: "must", NORM: "must", TAG: "MD"}, - {ORTH: "should", NORM: "should", TAG: "MD"}, - {ORTH: "would", NORM: "would", TAG: "MD"}, + {ORTH: "could", NORM: "could"}, + {ORTH: "might", NORM: "might"}, + {ORTH: "must", NORM: "must"}, + {ORTH: "should", NORM: "should"}, + {ORTH: "would", NORM: "would"}, ]: verb_data_tc = dict(verb_data) verb_data_tc[ORTH] = verb_data_tc[ORTH].title() for data in [verb_data, verb_data_tc]: - _exc[data[ORTH] + "'ve"] = [dict(data), {ORTH: "'ve", LEMMA: "have", TAG: "VB"}] + _exc[data[ORTH] + "'ve"] = [dict(data), {ORTH: "'ve"}] - _exc[data[ORTH] + "ve"] = [dict(data), {ORTH: "ve", LEMMA: "have", TAG: "VB"}] + _exc[data[ORTH] + "ve"] = [dict(data), {ORTH: "ve"}] for verb_data in [ - {ORTH: "ai", LEMMA: "be", TAG: "VBP", "number": 2}, - {ORTH: "are", LEMMA: "be", NORM: "are", TAG: "VBP", "number": 2}, - {ORTH: "is", LEMMA: "be", NORM: "is", TAG: "VBZ"}, - {ORTH: "was", LEMMA: "be", NORM: "was"}, - {ORTH: "were", LEMMA: "be", NORM: "were"}, + {ORTH: "ai", "number": 2}, + {ORTH: "are", NORM: "are", "number": 2}, + {ORTH: "is", NORM: "is"}, + {ORTH: "was", NORM: "was"}, + {ORTH: "were", NORM: "were"}, {ORTH: "have", NORM: "have"}, - {ORTH: "has", LEMMA: "have", NORM: "has"}, + {ORTH: "has", NORM: "has"}, {ORTH: "dare", NORM: "dare"}, ]: verb_data_tc = dict(verb_data) @@ -288,24 +299,24 @@ for verb_data in [ for data in [verb_data, verb_data_tc]: _exc[data[ORTH] + "n't"] = [ dict(data), - {ORTH: "n't", LEMMA: "not", NORM: "not", TAG: "RB"}, + {ORTH: "n't", NORM: "not"}, ] _exc[data[ORTH] + "nt"] = [ dict(data), - {ORTH: "nt", LEMMA: "not", NORM: "not", TAG: "RB"}, + {ORTH: "nt", NORM: "not"}, ] # Other contractions with trailing apostrophe for exc_data in [ - {ORTH: "doin", LEMMA: "do", NORM: "doing"}, - {ORTH: "goin", LEMMA: "go", NORM: "going"}, - {ORTH: "nothin", LEMMA: "nothing", NORM: "nothing"}, - {ORTH: "nuthin", LEMMA: "nothing", NORM: "nothing"}, - {ORTH: "ol", LEMMA: "old", NORM: "old"}, - {ORTH: "somethin", LEMMA: "something", NORM: "something"}, + {ORTH: "doin", NORM: "doing"}, + {ORTH: "goin", NORM: "going"}, + {ORTH: "nothin", NORM: "nothing"}, + {ORTH: "nuthin", NORM: "nothing"}, + {ORTH: "ol", NORM: "old"}, + {ORTH: "somethin", NORM: "something"}, ]: exc_data_tc = dict(exc_data) exc_data_tc[ORTH] = exc_data_tc[ORTH].title() @@ -320,9 +331,9 @@ for exc_data in [ for exc_data in [ {ORTH: "cause", NORM: "because"}, - {ORTH: "em", LEMMA: PRON_LEMMA, NORM: "them"}, - {ORTH: "ll", LEMMA: "will", NORM: "will"}, - {ORTH: "nuff", LEMMA: "enough", NORM: "enough"}, + {ORTH: "em", NORM: "them"}, + {ORTH: "ll", NORM: "will"}, + {ORTH: "nuff", NORM: "enough"}, ]: exc_data_apos = dict(exc_data) exc_data_apos[ORTH] = "'" + exc_data_apos[ORTH] @@ -334,174 +345,133 @@ for exc_data in [ for h in range(1, 12 + 1): for period in ["a.m.", "am"]: - _exc["%d%s" % (h, period)] = [ - {ORTH: "%d" % h}, - {ORTH: period, LEMMA: "a.m.", NORM: "a.m."}, + _exc[f"{h}{period}"] = [ + {ORTH: f"{h}"}, + {ORTH: period, NORM: "a.m."}, ] for period in ["p.m.", "pm"]: - _exc["%d%s" % (h, period)] = [ - {ORTH: "%d" % h}, - {ORTH: period, LEMMA: "p.m.", NORM: "p.m."}, + _exc[f"{h}{period}"] = [ + {ORTH: f"{h}"}, + {ORTH: period, NORM: "p.m."}, ] # Rest _other_exc = { - "y'all": [{ORTH: "y'", LEMMA: PRON_LEMMA, NORM: "you"}, {ORTH: "all"}], - "yall": [{ORTH: "y", LEMMA: PRON_LEMMA, NORM: "you"}, {ORTH: "all"}], - "how'd'y": [ - {ORTH: "how", LEMMA: "how"}, - {ORTH: "'d", LEMMA: "do"}, - {ORTH: "'y", LEMMA: PRON_LEMMA, NORM: "you"}, - ], - "How'd'y": [ - {ORTH: "How", LEMMA: "how", NORM: "how"}, - {ORTH: "'d", LEMMA: "do"}, - {ORTH: "'y", LEMMA: PRON_LEMMA, NORM: "you"}, - ], - "not've": [ - {ORTH: "not", LEMMA: "not", TAG: "RB"}, - {ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}, - ], - "notve": [ - {ORTH: "not", LEMMA: "not", TAG: "RB"}, - {ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}, - ], - "Not've": [ - {ORTH: "Not", LEMMA: "not", NORM: "not", TAG: "RB"}, - {ORTH: "'ve", LEMMA: "have", NORM: "have", TAG: "VB"}, - ], - "Notve": [ - {ORTH: "Not", LEMMA: "not", NORM: "not", TAG: "RB"}, - {ORTH: "ve", LEMMA: "have", NORM: "have", TAG: "VB"}, - ], - "cannot": [ - {ORTH: "can", LEMMA: "can", TAG: "MD"}, - {ORTH: "not", LEMMA: "not", TAG: "RB"}, - ], - "Cannot": [ - {ORTH: "Can", LEMMA: "can", NORM: "can", TAG: "MD"}, - {ORTH: "not", LEMMA: "not", TAG: "RB"}, - ], - "gonna": [ - {ORTH: "gon", LEMMA: "go", NORM: "going"}, - {ORTH: "na", LEMMA: "to", NORM: "to"}, - ], - "Gonna": [ - {ORTH: "Gon", LEMMA: "go", NORM: "going"}, - {ORTH: "na", LEMMA: "to", NORM: "to"}, - ], - "gotta": [{ORTH: "got"}, {ORTH: "ta", LEMMA: "to", NORM: "to"}], - "Gotta": [{ORTH: "Got", NORM: "got"}, {ORTH: "ta", LEMMA: "to", NORM: "to"}], - "let's": [{ORTH: "let"}, {ORTH: "'s", LEMMA: PRON_LEMMA, NORM: "us"}], - "Let's": [ - {ORTH: "Let", LEMMA: "let", NORM: "let"}, - {ORTH: "'s", LEMMA: PRON_LEMMA, NORM: "us"}, - ], - "c'mon": [ - {ORTH: "c'm", NORM: "come", LEMMA: "come"}, - {ORTH: "on"} - ], - "C'mon": [ - {ORTH: "C'm", NORM: "come", LEMMA: "come"}, - {ORTH: "on"} - ] + "y'all": [{ORTH: "y'", NORM: "you"}, {ORTH: "all"}], + "yall": [{ORTH: "y", NORM: "you"}, {ORTH: "all"}], + "how'd'y": [{ORTH: "how"}, {ORTH: "'d"}, {ORTH: "'y", NORM: "you"}], + "How'd'y": [{ORTH: "How", NORM: "how"}, {ORTH: "'d"}, {ORTH: "'y", NORM: "you"}], + "not've": [{ORTH: "not"}, {ORTH: "'ve", NORM: "have"}], + "notve": [{ORTH: "not"}, {ORTH: "ve", NORM: "have"}], + "Not've": [{ORTH: "Not", NORM: "not"}, {ORTH: "'ve", NORM: "have"}], + "Notve": [{ORTH: "Not", NORM: "not"}, {ORTH: "ve", NORM: "have"}], + "cannot": [{ORTH: "can"}, {ORTH: "not"}], + "Cannot": [{ORTH: "Can", NORM: "can"}, {ORTH: "not"}], + "gonna": [{ORTH: "gon", NORM: "going"}, {ORTH: "na", NORM: "to"}], + "Gonna": [{ORTH: "Gon", NORM: "going"}, {ORTH: "na", NORM: "to"}], + "gotta": [{ORTH: "got"}, {ORTH: "ta", NORM: "to"}], + "Gotta": [{ORTH: "Got", NORM: "got"}, {ORTH: "ta", NORM: "to"}], + "let's": [{ORTH: "let"}, {ORTH: "'s", NORM: "us"}], + "Let's": [{ORTH: "Let", NORM: "let"}, {ORTH: "'s", NORM: "us"}], + "c'mon": [{ORTH: "c'm", NORM: "come"}, {ORTH: "on"}], + "C'mon": [{ORTH: "C'm", NORM: "come"}, {ORTH: "on"}], } _exc.update(_other_exc) for exc_data in [ - {ORTH: "'S", LEMMA: "'s", NORM: "'s"}, - {ORTH: "'s", LEMMA: "'s", NORM: "'s"}, - {ORTH: "\u2018S", LEMMA: "'s", NORM: "'s"}, - {ORTH: "\u2018s", LEMMA: "'s", NORM: "'s"}, - {ORTH: "and/or", LEMMA: "and/or", NORM: "and/or", TAG: "CC"}, - {ORTH: "w/o", LEMMA: "without", NORM: "without"}, - {ORTH: "'re", LEMMA: "be", NORM: "are"}, - {ORTH: "'Cause", LEMMA: "because", NORM: "because"}, - {ORTH: "'cause", LEMMA: "because", NORM: "because"}, - {ORTH: "'cos", LEMMA: "because", NORM: "because"}, - {ORTH: "'Cos", LEMMA: "because", NORM: "because"}, - {ORTH: "'coz", LEMMA: "because", NORM: "because"}, - {ORTH: "'Coz", LEMMA: "because", NORM: "because"}, - {ORTH: "'cuz", LEMMA: "because", NORM: "because"}, - {ORTH: "'Cuz", LEMMA: "because", NORM: "because"}, - {ORTH: "'bout", LEMMA: "about", NORM: "about"}, - {ORTH: "ma'am", LEMMA: "madam", NORM: "madam"}, - {ORTH: "Ma'am", LEMMA: "madam", NORM: "madam"}, - {ORTH: "o'clock", LEMMA: "o'clock", NORM: "o'clock"}, - {ORTH: "O'clock", LEMMA: "o'clock", NORM: "o'clock"}, - {ORTH: "lovin'", LEMMA: "love", NORM: "loving"}, - {ORTH: "Lovin'", LEMMA: "love", NORM: "loving"}, - {ORTH: "lovin", LEMMA: "love", NORM: "loving"}, - {ORTH: "Lovin", LEMMA: "love", NORM: "loving"}, - {ORTH: "havin'", LEMMA: "have", NORM: "having"}, - {ORTH: "Havin'", LEMMA: "have", NORM: "having"}, - {ORTH: "havin", LEMMA: "have", NORM: "having"}, - {ORTH: "Havin", LEMMA: "have", NORM: "having"}, - {ORTH: "doin'", LEMMA: "do", NORM: "doing"}, - {ORTH: "Doin'", LEMMA: "do", NORM: "doing"}, - {ORTH: "doin", LEMMA: "do", NORM: "doing"}, - {ORTH: "Doin", LEMMA: "do", NORM: "doing"}, - {ORTH: "goin'", LEMMA: "go", NORM: "going"}, - {ORTH: "Goin'", LEMMA: "go", NORM: "going"}, - {ORTH: "goin", LEMMA: "go", NORM: "going"}, - {ORTH: "Goin", LEMMA: "go", NORM: "going"}, - {ORTH: "Mt.", LEMMA: "Mount", NORM: "Mount"}, - {ORTH: "Ak.", LEMMA: "Alaska", NORM: "Alaska"}, - {ORTH: "Ala.", LEMMA: "Alabama", NORM: "Alabama"}, - {ORTH: "Apr.", LEMMA: "April", NORM: "April"}, - {ORTH: "Ariz.", LEMMA: "Arizona", NORM: "Arizona"}, - {ORTH: "Ark.", LEMMA: "Arkansas", NORM: "Arkansas"}, - {ORTH: "Aug.", LEMMA: "August", NORM: "August"}, - {ORTH: "Calif.", LEMMA: "California", NORM: "California"}, - {ORTH: "Colo.", LEMMA: "Colorado", NORM: "Colorado"}, - {ORTH: "Conn.", LEMMA: "Connecticut", NORM: "Connecticut"}, - {ORTH: "Dec.", LEMMA: "December", NORM: "December"}, - {ORTH: "Del.", LEMMA: "Delaware", NORM: "Delaware"}, - {ORTH: "Feb.", LEMMA: "February", NORM: "February"}, - {ORTH: "Fla.", LEMMA: "Florida", NORM: "Florida"}, - {ORTH: "Ga.", LEMMA: "Georgia", NORM: "Georgia"}, - {ORTH: "Ia.", LEMMA: "Iowa", NORM: "Iowa"}, - {ORTH: "Id.", LEMMA: "Idaho", NORM: "Idaho"}, - {ORTH: "Ill.", LEMMA: "Illinois", NORM: "Illinois"}, - {ORTH: "Ind.", LEMMA: "Indiana", NORM: "Indiana"}, - {ORTH: "Jan.", LEMMA: "January", NORM: "January"}, - {ORTH: "Jul.", LEMMA: "July", NORM: "July"}, - {ORTH: "Jun.", LEMMA: "June", NORM: "June"}, - {ORTH: "Kan.", LEMMA: "Kansas", NORM: "Kansas"}, - {ORTH: "Kans.", LEMMA: "Kansas", NORM: "Kansas"}, - {ORTH: "Ky.", LEMMA: "Kentucky", NORM: "Kentucky"}, - {ORTH: "La.", LEMMA: "Louisiana", NORM: "Louisiana"}, - {ORTH: "Mar.", LEMMA: "March", NORM: "March"}, - {ORTH: "Mass.", LEMMA: "Massachusetts", NORM: "Massachusetts"}, - {ORTH: "May.", LEMMA: "May", NORM: "May"}, - {ORTH: "Mich.", LEMMA: "Michigan", NORM: "Michigan"}, - {ORTH: "Minn.", LEMMA: "Minnesota", NORM: "Minnesota"}, - {ORTH: "Miss.", LEMMA: "Mississippi", NORM: "Mississippi"}, - {ORTH: "N.C.", LEMMA: "North Carolina", NORM: "North Carolina"}, - {ORTH: "N.D.", LEMMA: "North Dakota", NORM: "North Dakota"}, - {ORTH: "N.H.", LEMMA: "New Hampshire", NORM: "New Hampshire"}, - {ORTH: "N.J.", LEMMA: "New Jersey", NORM: "New Jersey"}, - {ORTH: "N.M.", LEMMA: "New Mexico", NORM: "New Mexico"}, - {ORTH: "N.Y.", LEMMA: "New York", NORM: "New York"}, - {ORTH: "Neb.", LEMMA: "Nebraska", NORM: "Nebraska"}, - {ORTH: "Nebr.", LEMMA: "Nebraska", NORM: "Nebraska"}, - {ORTH: "Nev.", LEMMA: "Nevada", NORM: "Nevada"}, - {ORTH: "Nov.", LEMMA: "November", NORM: "November"}, - {ORTH: "Oct.", LEMMA: "October", NORM: "October"}, - {ORTH: "Okla.", LEMMA: "Oklahoma", NORM: "Oklahoma"}, - {ORTH: "Ore.", LEMMA: "Oregon", NORM: "Oregon"}, - {ORTH: "Pa.", LEMMA: "Pennsylvania", NORM: "Pennsylvania"}, - {ORTH: "S.C.", LEMMA: "South Carolina", NORM: "South Carolina"}, - {ORTH: "Sep.", LEMMA: "September", NORM: "September"}, - {ORTH: "Sept.", LEMMA: "September", NORM: "September"}, - {ORTH: "Tenn.", LEMMA: "Tennessee", NORM: "Tennessee"}, - {ORTH: "Va.", LEMMA: "Virginia", NORM: "Virginia"}, - {ORTH: "Wash.", LEMMA: "Washington", NORM: "Washington"}, - {ORTH: "Wis.", LEMMA: "Wisconsin", NORM: "Wisconsin"}, + {ORTH: "'S", NORM: "'s"}, + {ORTH: "'s", NORM: "'s"}, + {ORTH: "\u2018S", NORM: "'s"}, + {ORTH: "\u2018s", NORM: "'s"}, + {ORTH: "and/or", NORM: "and/or"}, + {ORTH: "w/o", NORM: "without"}, + {ORTH: "'re", NORM: "are"}, + {ORTH: "'Cause", NORM: "because"}, + {ORTH: "'cause", NORM: "because"}, + {ORTH: "'cos", NORM: "because"}, + {ORTH: "'Cos", NORM: "because"}, + {ORTH: "'coz", NORM: "because"}, + {ORTH: "'Coz", NORM: "because"}, + {ORTH: "'cuz", NORM: "because"}, + {ORTH: "'Cuz", NORM: "because"}, + {ORTH: "'bout", NORM: "about"}, + {ORTH: "ma'am", NORM: "madam"}, + {ORTH: "Ma'am", NORM: "madam"}, + {ORTH: "o'clock", NORM: "o'clock"}, + {ORTH: "O'clock", NORM: "o'clock"}, + {ORTH: "lovin'", NORM: "loving"}, + {ORTH: "Lovin'", NORM: "loving"}, + {ORTH: "lovin", NORM: "loving"}, + {ORTH: "Lovin", NORM: "loving"}, + {ORTH: "havin'", NORM: "having"}, + {ORTH: "Havin'", NORM: "having"}, + {ORTH: "havin", NORM: "having"}, + {ORTH: "Havin", NORM: "having"}, + {ORTH: "doin'", NORM: "doing"}, + {ORTH: "Doin'", NORM: "doing"}, + {ORTH: "doin", NORM: "doing"}, + {ORTH: "Doin", NORM: "doing"}, + {ORTH: "goin'", NORM: "going"}, + {ORTH: "Goin'", NORM: "going"}, + {ORTH: "goin", NORM: "going"}, + {ORTH: "Goin", NORM: "going"}, + {ORTH: "Mt.", NORM: "Mount"}, + {ORTH: "Ak.", NORM: "Alaska"}, + {ORTH: "Ala.", NORM: "Alabama"}, + {ORTH: "Apr.", NORM: "April"}, + {ORTH: "Ariz.", NORM: "Arizona"}, + {ORTH: "Ark.", NORM: "Arkansas"}, + {ORTH: "Aug.", NORM: "August"}, + {ORTH: "Calif.", NORM: "California"}, + {ORTH: "Colo.", NORM: "Colorado"}, + {ORTH: "Conn.", NORM: "Connecticut"}, + {ORTH: "Dec.", NORM: "December"}, + {ORTH: "Del.", NORM: "Delaware"}, + {ORTH: "Feb.", NORM: "February"}, + {ORTH: "Fla.", NORM: "Florida"}, + {ORTH: "Ga.", NORM: "Georgia"}, + {ORTH: "Ia.", NORM: "Iowa"}, + {ORTH: "Id.", NORM: "Idaho"}, + {ORTH: "Ill.", NORM: "Illinois"}, + {ORTH: "Ind.", NORM: "Indiana"}, + {ORTH: "Jan.", NORM: "January"}, + {ORTH: "Jul.", NORM: "July"}, + {ORTH: "Jun.", NORM: "June"}, + {ORTH: "Kan.", NORM: "Kansas"}, + {ORTH: "Kans.", NORM: "Kansas"}, + {ORTH: "Ky.", NORM: "Kentucky"}, + {ORTH: "La.", NORM: "Louisiana"}, + {ORTH: "Mar.", NORM: "March"}, + {ORTH: "Mass.", NORM: "Massachusetts"}, + {ORTH: "May.", NORM: "May"}, + {ORTH: "Mich.", NORM: "Michigan"}, + {ORTH: "Minn.", NORM: "Minnesota"}, + {ORTH: "Miss.", NORM: "Mississippi"}, + {ORTH: "N.C.", NORM: "North Carolina"}, + {ORTH: "N.D.", NORM: "North Dakota"}, + {ORTH: "N.H.", NORM: "New Hampshire"}, + {ORTH: "N.J.", NORM: "New Jersey"}, + {ORTH: "N.M.", NORM: "New Mexico"}, + {ORTH: "N.Y.", NORM: "New York"}, + {ORTH: "Neb.", NORM: "Nebraska"}, + {ORTH: "Nebr.", NORM: "Nebraska"}, + {ORTH: "Nev.", NORM: "Nevada"}, + {ORTH: "Nov.", NORM: "November"}, + {ORTH: "Oct.", NORM: "October"}, + {ORTH: "Okla.", NORM: "Oklahoma"}, + {ORTH: "Ore.", NORM: "Oregon"}, + {ORTH: "Pa.", NORM: "Pennsylvania"}, + {ORTH: "S.C.", NORM: "South Carolina"}, + {ORTH: "Sep.", NORM: "September"}, + {ORTH: "Sept.", NORM: "September"}, + {ORTH: "Tenn.", NORM: "Tennessee"}, + {ORTH: "Va.", NORM: "Virginia"}, + {ORTH: "Wash.", NORM: "Washington"}, + {ORTH: "Wis.", NORM: "Wisconsin"}, ]: _exc[exc_data[ORTH]] = [exc_data] @@ -552,4 +522,4 @@ for string in _exclude: _exc.pop(string) -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/es/__init__.py b/spacy/lang/es/__init__.py index 249748a17..9a47855b1 100644 --- a/spacy/lang/es/__init__.py +++ b/spacy/lang/es/__init__.py @@ -1,33 +1,18 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .tag_map import TAG_MAP from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .syntax_iterators import SYNTAX_ITERATORS from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups class SpanishDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "es" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - tag_map = TAG_MAP + tokenizer_exceptions = TOKENIZER_EXCEPTIONS infixes = TOKENIZER_INFIXES suffixes = TOKENIZER_SUFFIXES - stop_words = STOP_WORDS + lex_attr_getters = LEX_ATTRS syntax_iterators = SYNTAX_ITERATORS + stop_words = STOP_WORDS class Spanish(Language): diff --git a/spacy/lang/es/examples.py b/spacy/lang/es/examples.py index 7ab0a7dfe..2bcbd8740 100644 --- a/spacy/lang/es/examples.py +++ b/spacy/lang/es/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/es/lex_attrs.py b/spacy/lang/es/lex_attrs.py index 632a638fc..988dbaba1 100644 --- a/spacy/lang/es/lex_attrs.py +++ b/spacy/lang/es/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/es/punctuation.py b/spacy/lang/es/punctuation.py index f989221c2..e9552371e 100644 --- a/spacy/lang/es/punctuation.py +++ b/spacy/lang/es/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES from ..char_classes import LIST_ICONS, CURRENCY, LIST_UNITS, PUNCT from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA diff --git a/spacy/lang/es/stop_words.py b/spacy/lang/es/stop_words.py index 20e929b48..004df4fca 100644 --- a/spacy/lang/es/stop_words.py +++ b/spacy/lang/es/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí diff --git a/spacy/lang/es/syntax_iterators.py b/spacy/lang/es/syntax_iterators.py index d4572b682..4dd4f99be 100644 --- a/spacy/lang/es/syntax_iterators.py +++ b/spacy/lang/es/syntax_iterators.py @@ -1,16 +1,15 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Union, Iterator, Optional, List, Tuple from ...symbols import NOUN, PROPN, PRON, VERB, AUX from ...errors import Errors +from ...tokens import Doc, Span, Token -def noun_chunks(doclike): +def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]: + """Detect base noun phrases from a dependency parse. Works on Doc and Span.""" doc = doclike.doc - - if not doc.is_parsed: + if not doc.has_annotation("DEP"): raise ValueError(Errors.E029) - if not len(doc): return np_label = doc.vocab.strings.add("NP") @@ -30,18 +29,24 @@ def noun_chunks(doclike): token = next_token(token) -def is_verb_token(token): +def is_verb_token(token: Token) -> bool: return token.pos in [VERB, AUX] -def next_token(token): +def next_token(token: Token) -> Optional[Token]: try: return token.nbor() except IndexError: return None -def noun_bounds(doc, root, np_left_deps, np_right_deps, stop_deps): +def noun_bounds( + doc: Doc, + root: Token, + np_left_deps: List[str], + np_right_deps: List[str], + stop_deps: List[str], +) -> Tuple[Token, Token]: left_bound = root for token in reversed(list(root.lefts)): if token.dep in np_left_deps: @@ -52,12 +57,8 @@ def noun_bounds(doc, root, np_left_deps, np_right_deps, stop_deps): left, right = noun_bounds( doc, token, np_left_deps, np_right_deps, stop_deps ) - if list( - filter( - lambda t: is_verb_token(t) or t.dep in stop_deps, - doc[left_bound.i : right.i], - ) - ): + filter_func = lambda t: is_verb_token(t) or t.dep in stop_deps + if list(filter(filter_func, doc[left_bound.i : right.i])): break else: right_bound = right diff --git a/spacy/lang/es/tag_map.py b/spacy/lang/es/tag_map.py deleted file mode 100644 index 7a7c9d549..000000000 --- a/spacy/lang/es/tag_map.py +++ /dev/null @@ -1,313 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, SCONJ, AUX, CONJ - -# fmt: off -TAG_MAP = { - "ADJ___": {"morph": "_", POS: ADJ}, - "ADJ__AdpType=Prep": {"morph": "AdpType=Prep", POS: ADJ}, - "ADJ__AdpType=Preppron|Gender=Masc|Number=Sing": {"morph": "AdpType=Preppron|Gender=Masc|Number=Sing", POS: ADV}, - "ADJ__AdvType=Tim": {"morph": "AdvType=Tim", POS: ADJ}, - "ADJ__Gender=Fem|Number=Plur": {"morph": "Gender=Fem|Number=Plur", POS: ADJ}, - "ADJ__Gender=Fem|Number=Plur|NumType=Ord": {"morph": "Gender=Fem|Number=Plur|NumType=Ord", POS: ADJ}, - "ADJ__Gender=Fem|Number=Plur|VerbForm=Part": {"morph": "Gender=Fem|Number=Plur|VerbForm=Part", POS: ADJ}, - "ADJ__Gender=Fem|Number=Sing": {"morph": "Gender=Fem|Number=Sing", POS: ADJ}, - "ADJ__Gender=Fem|Number=Sing|NumType=Ord": {"morph": "Gender=Fem|Number=Sing|NumType=Ord", POS: ADJ}, - "ADJ__Gender=Fem|Number=Sing|VerbForm=Part": {"morph": "Gender=Fem|Number=Sing|VerbForm=Part", POS: ADJ}, - "ADJ__Gender=Masc": {"morph": "Gender=Masc", POS: ADJ}, - "ADJ__Gender=Masc|Number=Plur": {"morph": "Gender=Masc|Number=Plur", POS: ADJ}, - "ADJ__Gender=Masc|Number=Plur|NumType=Ord": {"morph": "Gender=Masc|Number=Plur|NumType=Ord", POS: ADJ}, - "ADJ__Gender=Masc|Number=Plur|VerbForm=Part": {"morph": "Gender=Masc|Number=Plur|VerbForm=Part", POS: ADJ}, - "ADJ__Gender=Masc|Number=Sing": {"morph": "Gender=Masc|Number=Sing", POS: ADJ}, - "ADJ__Gender=Masc|Number=Sing|NumType=Ord": {"morph": "Gender=Masc|Number=Sing|NumType=Ord", POS: ADJ}, - "ADJ__Gender=Masc|Number=Sing|VerbForm=Part": {"morph": "Gender=Masc|Number=Sing|VerbForm=Part", POS: ADJ}, - "ADJ__Number=Plur": {"morph": "Number=Plur", POS: ADJ}, - "ADJ__Number=Sing": {"morph": "Number=Sing", POS: ADJ}, - "ADP__AdpType=Prep": {"morph": "AdpType=Prep", POS: ADP}, - "ADP__AdpType=Preppron|Gender=Fem|Number=Sing": {"morph": "AdpType=Preppron|Gender=Fem|Number=Sing", POS: ADP}, - "ADP__AdpType=Preppron|Gender=Masc|Number=Plur": {"morph": "AdpType=Preppron|Gender=Masc|Number=Plur", POS: ADP}, - "ADP__AdpType=Preppron|Gender=Masc|Number=Sing": {"morph": "AdpType=Preppron|Gender=Masc|Number=Sing", POS: ADP}, - "ADP": {POS: ADP}, - "ADV___": {"morph": "_", POS: ADV}, - "ADV__AdpType=Prep": {"morph": "AdpType=Prep", POS: ADV}, - "ADV__AdpType=Preppron|Gender=Masc|Number=Sing": {"morph": "AdpType=Preppron|Gender=Masc|Number=Sing", POS: ADV}, - "ADV__AdvType=Tim": {"morph": "AdvType=Tim", POS: ADV}, - "ADV__Gender=Masc|Number=Sing": {"morph": "Gender=Masc|Number=Sing", POS: ADV}, - "ADV__Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin", POS: ADV}, - "ADV__Negative=Neg": {"morph": "Negative=Neg", POS: ADV}, - "ADV__Number=Plur": {"morph": "Number=Plur", POS: ADV}, - "ADV__Polarity=Neg": {"morph": "Polarity=Neg", POS: ADV}, - "AUX__Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part": {"morph": "Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part", POS: AUX}, - "AUX__Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part": {"morph": "Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part", POS: AUX}, - "AUX__Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part": {"morph": "Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part", POS: AUX}, - "AUX__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part": {"morph": "Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part", POS: AUX}, - "AUX__Mood=Cnd|Number=Plur|Person=1|VerbForm=Fin": {"morph": "Mood=Cnd|Number=Plur|Person=1|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Cnd|Number=Plur|Person=3|VerbForm=Fin": {"morph": "Mood=Cnd|Number=Plur|Person=3|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Cnd|Number=Sing|Person=1|VerbForm=Fin": {"morph": "Mood=Cnd|Number=Sing|Person=1|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Cnd|Number=Sing|Person=2|VerbForm=Fin": {"morph": "Mood=Cnd|Number=Sing|Person=2|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Cnd|Number=Sing|Person=3|VerbForm=Fin": {"morph": "Mood=Cnd|Number=Sing|Person=3|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Imp|Number=Plur|Person=3|VerbForm=Fin": {"morph": "Mood=Imp|Number=Plur|Person=3|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Imp|Number=Sing|Person=2|VerbForm=Fin": {"morph": "Mood=Imp|Number=Sing|Person=2|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Imp|Number=Sing|Person=3|VerbForm=Fin": {"morph": "Mood=Imp|Number=Sing|Person=3|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=1|Tense=Past|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=1|Tense=Past|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=1|Tense=Fut|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=1|Tense=Fut|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=2|Tense=Fut|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=2|Tense=Fut|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=2|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=2|Tense=Imp|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Sub|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Sub|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Sub|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Sub|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Sub|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Sub|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Sub|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Sub|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Sub|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Sub|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Sub|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Sub|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Sub|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Sub|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Sub|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Sub|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin", POS: AUX}, - "AUX__VerbForm=Ger": {"morph": "VerbForm=Ger", POS: AUX}, - "AUX__VerbForm=Inf": {"morph": "VerbForm=Inf", POS: AUX}, - "CCONJ___": {"morph": "_", POS: CONJ}, - "CONJ___": {"morph": "_", POS: CONJ}, - "DET__Definite=Def|Gender=Fem|Number=Plur|PronType=Art": {"morph": "Definite=Def|Gender=Fem|Number=Plur|PronType=Art", POS: DET}, - "DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art": {"morph": "Definite=Def|Gender=Fem|Number=Sing|PronType=Art", POS: DET}, - "DET__Definite=Def|Gender=Masc|Number=Plur|PronType=Art": {"morph": "Definite=Def|Gender=Masc|Number=Plur|PronType=Art", POS: DET}, - "DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art": {"morph": "Definite=Def|Gender=Masc|Number=Sing|PronType=Art", POS: DET}, - "DET__Definite=Def|Gender=Masc|PronType=Art": {"morph": "Definite=Def|Gender=Masc|PronType=Art", POS: DET}, - "DET__Definite=Def|Number=Sing|PronType=Art": {"morph": "Definite=Def|Number=Sing|PronType=Art", POS: DET}, - "DET__Definite=Ind|Gender=Fem|Number=Plur|PronType=Art": {"morph": "Definite=Ind|Gender=Fem|Number=Plur|PronType=Art", POS: DET}, - "DET__Definite=Ind|Gender=Fem|Number=Sing|NumType=Card|PronType=Art": {"morph": "Definite=Ind|Gender=Fem|Number=Sing|NumType=Card|PronType=Art", POS: DET}, - "DET__Definite=Ind|Gender=Fem|Number=Sing|PronType=Art": {"morph": "Definite=Ind|Gender=Fem|Number=Sing|PronType=Art", POS: DET}, - "DET__Definite=Ind|Gender=Masc|Number=Plur|PronType=Art": {"morph": "Definite=Ind|Gender=Masc|Number=Plur|PronType=Art", POS: DET}, - "DET__Definite=Ind|Gender=Masc|Number=Sing|NumType=Card|PronType=Art": {"morph": "Definite=Ind|Gender=Masc|Number=Sing|NumType=Card|PronType=Art", POS: DET}, - "DET__Definite=Ind|Gender=Masc|Number=Sing|PronType=Art": {"morph": "Definite=Ind|Gender=Masc|Number=Sing|PronType=Art", POS: DET}, - "DET__Gender=Fem|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Gender=Fem|Number=Plur|Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Plur|Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Gender=Fem|Number=Plur|Person=3|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Plur|Person=3|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Gender=Fem|Number=Plur|PronType=Art": {"morph": "Gender=Fem|Number=Plur|PronType=Art", POS: DET}, - "DET__Gender=Fem|Number=Plur|PronType=Dem": {"morph": "Gender=Fem|Number=Plur|PronType=Dem", POS: DET}, - "DET__Gender=Fem|Number=Plur|PronType=Ind": {"morph": "Gender=Fem|Number=Plur|PronType=Ind", POS: DET}, - "DET__Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Gender=Fem|Number=Sing|Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Sing|Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Gender=Fem|Number=Sing|PronType=Art": {"morph": "Gender=Fem|Number=Sing|PronType=Art", POS: DET}, - "DET__Gender=Fem|Number=Sing|PronType=Dem": {"morph": "Gender=Fem|Number=Sing|PronType=Dem", POS: DET}, - "DET__Gender=Fem|Number=Sing|PronType=Ind": {"morph": "Gender=Fem|Number=Sing|PronType=Ind", POS: DET}, - "DET__Gender=Fem|Number=Sing|PronType=Int": {"morph": "Gender=Fem|Number=Sing|PronType=Int", POS: DET}, - "DET__Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {"morph": "Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Gender=Masc|Number=Plur|Person=3|Poss=Yes|PronType=Prs": {"morph": "Gender=Masc|Number=Plur|Person=3|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Gender=Masc|Number=Plur|PronType=Art": {"morph": "Gender=Masc|Number=Plur|PronType=Art", POS: DET}, - "DET__Gender=Masc|Number=Plur|PronType=Dem": {"morph": "Gender=Masc|Number=Plur|PronType=Dem", POS: DET}, - "DET__Gender=Masc|Number=Plur|PronType=Ind": {"morph": "Gender=Masc|Number=Plur|PronType=Ind", POS: DET}, - "DET__Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {"morph": "Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {"morph": "Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"morph": "Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Gender=Masc|Number=Sing|PronType=Art": {"morph": "Gender=Masc|Number=Sing|PronType=Art", POS: DET}, - "DET__Gender=Masc|Number=Sing|PronType=Dem": {"morph": "Gender=Masc|Number=Sing|PronType=Dem", POS: DET}, - "DET__Gender=Masc|Number=Sing|PronType=Ind": {"morph": "Gender=Masc|Number=Sing|PronType=Ind", POS: DET}, - "DET__Gender=Masc|Number=Sing|PronType=Int": {"morph": "Gender=Masc|Number=Sing|PronType=Int", POS: DET}, - "DET__Gender=Masc|Number=Sing|PronType=Tot": {"morph": "Gender=Masc|Number=Sing|PronType=Tot", POS: DET}, - "DET__Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {"morph": "Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {"morph": "Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Number=Plur|Person=3|Poss=Yes|PronType=Prs": {"morph": "Number=Plur|Person=3|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Number=Plur|PronType=Dem": {"morph": "Number=Plur|PronType=Dem", POS: DET}, - "DET__Number=Plur|PronType=Ind": {"morph": "Number=Plur|PronType=Ind", POS: DET}, - "DET__Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {"morph": "Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {"morph": "Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"morph": "Number=Sing|Person=3|Poss=Yes|PronType=Prs", POS: DET}, - "DET__Number=Sing|PronType=Dem": {"morph": "Number=Sing|PronType=Dem", POS: DET}, - "DET__Number=Sing|PronType=Ind": {"morph": "Number=Sing|PronType=Ind", POS: DET}, - "DET__PronType=Int": {"morph": "PronType=Int", POS: DET}, - "DET__PronType=Rel": {"morph": "PronType=Rel", POS: DET}, - "DET": {POS: DET}, - "INTJ___": {"morph": "_", POS: INTJ}, - "NOUN___": {"morph": "_", POS: NOUN}, - "NOUN__AdvType=Tim": {"morph": "AdvType=Tim", POS: NOUN}, - "NOUN__AdvType=Tim|Gender=Masc|Number=Sing": {"morph": "AdvType=Tim|Gender=Masc|Number=Sing", POS: NOUN}, - "NOUN__Gender=Fem": {"morph": "Gender=Fem", POS: NOUN}, - "NOUN__Gender=Fem|Number=Plur": {"morph": "Gender=Fem|Number=Plur", POS: NOUN}, - "NOUN__Gender=Fem|Number=Sing": {"morph": "Gender=Fem|Number=Sing", POS: NOUN}, - "NOUN__Gender=Masc": {"morph": "Gender=Masc", POS: NOUN}, - "NOUN__Gender=Masc|Number=Plur": {"morph": "Gender=Masc|Number=Plur", POS: NOUN}, - "NOUN__Gender=Masc|Number=Sing": {"morph": "Gender=Masc|Number=Sing", POS: NOUN}, - "NOUN__Gender=Masc|Number=Sing|VerbForm=Part": {"morph": "Gender=Masc|Number=Sing|VerbForm=Part", POS: NOUN}, - "NOUN__Number=Plur": {"morph": "Number=Plur", POS: NOUN}, - "NOUN__Number=Sing": {"morph": "Number=Sing", POS: NOUN}, - "NOUN__NumForm=Digit": {"morph": "NumForm=Digit", POS: NOUN}, - "NUM__Gender=Fem|Number=Plur|NumType=Card": {"morph": "Gender=Fem|Number=Plur|NumType=Card", POS: NUM}, - "NUM__Gender=Fem|Number=Sing|NumType=Card": {"morph": "Gender=Fem|Number=Sing|NumType=Card", POS: NUM}, - "NUM__Gender=Masc|Number=Plur|NumType=Card": {"morph": "Gender=Masc|Number=Plur|NumType=Card", POS: NUM}, - "NUM__Gender=Masc|Number=Sing|NumType=Card": {"morph": "Gender=Masc|Number=Sing|NumType=Card", POS: NUM}, - "NUM__Number=Plur|NumType=Card": {"morph": "Number=Plur|NumType=Card", POS: NUM}, - "NUM__Number=Sing|NumType=Card": {"morph": "Number=Sing|NumType=Card", POS: NUM}, - "NUM__NumForm=Digit": {"morph": "NumForm=Digit", POS: NUM}, - "NUM__NumForm=Digit|NumType=Card": {"morph": "NumForm=Digit|NumType=Card", POS: NUM}, - "NUM__NumForm=Digit|NumType=Frac": {"morph": "NumForm=Digit|NumType=Frac", POS: NUM}, - "NUM__NumType=Card": {"morph": "NumType=Card", POS: NUM}, - "PART___": {"morph": "_", POS: PART}, - "PART__Negative=Neg": {"morph": "Negative=Neg", POS: PART}, - "PRON___": {"morph": "_", POS: PRON}, - "PRON__Case=Acc|Gender=Fem|Number=Plur|Person=3|PronType=Prs": {"morph": "Case=Acc|Gender=Fem|Number=Plur|Person=3|PronType=Prs", POS: PRON}, - "PRON__Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs": {"morph": "Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs", POS: PRON}, - "PRON__Case=Acc|Gender=Masc|Number=Plur|Person=3|PronType=Prs": {"morph": "Case=Acc|Gender=Masc|Number=Plur|Person=3|PronType=Prs", POS: PRON}, - "PRON__Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs": {"morph": "Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs", POS: PRON}, - "PRON__Case=Acc|Number=Plur|Person=3|PronType=Prs": {"morph": "Case=Acc|Number=Plur|Person=3|PronType=Prs", POS: PRON}, - "PRON__Case=Acc|Number=Sing|Person=3|PronType=Prs": {"morph": "Case=Acc|Number=Sing|Person=3|PronType=Prs", POS: PRON}, - "PRON__Case=Acc|Person=3|PronType=Prs": {"morph": "Case=Acc|Person=3|PronType=Prs", POS: PRON}, - "PRON__Case=Dat|Number=Plur|Person=3|PronType=Prs": {"morph": "Case=Dat|Number=Plur|Person=3|PronType=Prs", POS: PRON}, - "PRON__Case=Dat|Number=Sing|Person=3|PronType=Prs": {"morph": "Case=Dat|Number=Sing|Person=3|PronType=Prs", POS: PRON}, - "PRON__Case=Nom|Number=Sing|Person=1|PronType=Prs": {"morph": "Case=Nom|Number=Sing|Person=1|PronType=Prs", POS: PRON}, - "PRON__Case=Nom|Number=Sing|Person=2|PronType=Prs": {"morph": "Case=Nom|Number=Sing|Person=2|PronType=Prs", POS: PRON}, - "PRON__Gender=Fem|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Gender=Fem|Number=Plur|Person=3|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Plur|Person=3|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Gender=Fem|Number=Plur|Person=3|PronType=Prs": {"morph": "Gender=Fem|Number=Plur|Person=3|PronType=Prs", POS: PRON}, - "PRON__Gender=Fem|Number=Plur|PronType=Dem": {"morph": "Gender=Fem|Number=Plur|PronType=Dem", POS: PRON}, - "PRON__Gender=Fem|Number=Plur|PronType=Ind": {"morph": "Gender=Fem|Number=Plur|PronType=Ind", POS: PRON}, - "PRON__Gender=Fem|Number=Plur|PronType=Int": {"morph": "Gender=Fem|Number=Plur|PronType=Int", POS: PRON}, - "PRON__Gender=Fem|Number=Plur|PronType=Rel": {"morph": "Gender=Fem|Number=Plur|PronType=Rel", POS: PRON}, - "PRON__Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Gender=Fem|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Gender=Fem|Number=Sing|Person=1|PronType=Prs": {"morph": "Gender=Fem|Number=Sing|Person=1|PronType=Prs", POS: PRON}, - "PRON__Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"morph": "Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Gender=Fem|Number=Sing|Person=3|PronType=Prs": {"morph": "Gender=Fem|Number=Sing|Person=3|PronType=Prs", POS: PRON}, - "PRON__Gender=Fem|Number=Sing|PronType=Dem": {"morph": "Gender=Fem|Number=Sing|PronType=Dem", POS: PRON}, - "PRON__Gender=Fem|Number=Sing|PronType=Ind": {"morph": "Gender=Fem|Number=Sing|PronType=Ind", POS: PRON}, - "PRON__Gender=Fem|Number=Sing|PronType=Rel": {"morph": "Gender=Fem|Number=Sing|PronType=Rel", POS: PRON}, - "PRON__Gender=Masc|Number=Plur|Person=1|PronType=Prs": {"morph": "Gender=Masc|Number=Plur|Person=1|PronType=Prs", POS: PRON}, - "PRON__Gender=Masc|Number=Plur|Person=2|Poss=Yes|PronType=Prs": {"morph": "Gender=Masc|Number=Plur|Person=2|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Gender=Masc|Number=Plur|Person=3|PronType=Prs": {"morph": "Gender=Masc|Number=Plur|Person=3|PronType=Prs", POS: PRON}, - "PRON__Gender=Masc|Number=Plur|PronType=Dem": {"morph": "Gender=Masc|Number=Plur|PronType=Dem", POS: PRON}, - "PRON__Gender=Masc|Number=Plur|PronType=Ind": {"morph": "Gender=Masc|Number=Plur|PronType=Ind", POS: PRON}, - "PRON__Gender=Masc|Number=Plur|PronType=Int": {"morph": "Gender=Masc|Number=Plur|PronType=Int", POS: PRON}, - "PRON__Gender=Masc|Number=Plur|PronType=Rel": {"morph": "Gender=Masc|Number=Plur|PronType=Rel", POS: PRON}, - "PRON__Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {"morph": "Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {"morph": "Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Gender=Masc|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {"morph": "Gender=Masc|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"morph": "Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Gender=Masc|Number=Sing|Person=3|PronType=Prs": {"morph": "Gender=Masc|Number=Sing|Person=3|PronType=Prs", POS: PRON}, - "PRON__Gender=Masc|Number=Sing|PronType=Dem": {"morph": "Gender=Masc|Number=Sing|PronType=Dem", POS: PRON}, - "PRON__Gender=Masc|Number=Sing|PronType=Ind": {"morph": "Gender=Masc|Number=Sing|PronType=Ind", POS: PRON}, - "PRON__Gender=Masc|Number=Sing|PronType=Int": {"morph": "Gender=Masc|Number=Sing|PronType=Int", POS: PRON}, - "PRON__Gender=Masc|Number=Sing|PronType=Rel": {"morph": "Gender=Masc|Number=Sing|PronType=Rel", POS: PRON}, - "PRON__Gender=Masc|Number=Sing|PronType=Tot": {"morph": "Gender=Masc|Number=Sing|PronType=Tot", POS: PRON}, - "PRON__Number=Plur|Person=1": {"morph": "Number=Plur|Person=1", POS: PRON}, - "PRON__Number=Plur|Person=1|PronType=Prs": {"morph": "Number=Plur|Person=1|PronType=Prs", POS: PRON}, - "PRON__Number=Plur|Person=2|Polite=Form|PronType=Prs": {"morph": "Number=Plur|Person=2|Polite=Form|PronType=Prs", POS: PRON}, - "PRON__Number=Plur|Person=2|PronType=Prs": {"morph": "Number=Plur|Person=2|PronType=Prs", POS: PRON}, - "PRON__Number=Plur|Person=3|Poss=Yes|PronType=Prs": {"morph": "Number=Plur|Person=3|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Number=Plur|Person=3|PronType=Prs": {"morph": "Number=Plur|Person=3|PronType=Prs", POS: PRON}, - "PRON__Number=Plur|PronType=Dem": {"morph": "Number=Plur|PronType=Dem", POS: PRON}, - "PRON__Number=Plur|PronType=Ind": {"morph": "Number=Plur|PronType=Ind", POS: PRON}, - "PRON__Number=Plur|PronType=Int": {"morph": "Number=Plur|PronType=Int", POS: PRON}, - "PRON__Number=Plur|PronType=Rel": {"morph": "Number=Plur|PronType=Rel", POS: PRON}, - "PRON__Number=Sing|Person=1": {"morph": "Number=Sing|Person=1", POS: PRON}, - "PRON__Number=Sing|Person=1|PrepCase=Pre|PronType=Prs": {"morph": "Number=Sing|Person=1|PrepCase=Pre|PronType=Prs", POS: PRON}, - "PRON__Number=Sing|Person=1|PronType=Prs": {"morph": "Number=Sing|Person=1|PronType=Prs", POS: PRON}, - "PRON__Number=Sing|Person=2": {"morph": "Number=Sing|Person=2", POS: PRON}, - "PRON__Number=Sing|Person=2|Polite=Form|PronType=Prs": {"morph": "Number=Sing|Person=2|Polite=Form|PronType=Prs", POS: PRON}, - "PRON__Number=Sing|Person=2|PrepCase=Pre|PronType=Prs": {"morph": "Number=Sing|Person=2|PrepCase=Pre|PronType=Prs", POS: PRON}, - "PRON__Number=Sing|Person=2|PronType=Prs": {"morph": "Number=Sing|Person=2|PronType=Prs", POS: PRON}, - "PRON__Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"morph": "Number=Sing|Person=3|Poss=Yes|PronType=Prs", POS: PRON}, - "PRON__Number=Sing|Person=3|PronType=Prs": {"morph": "Number=Sing|Person=3|PronType=Prs", POS: PRON}, - "PRON__Number=Sing|PronType=Dem": {"morph": "Number=Sing|PronType=Dem", POS: PRON}, - "PRON__Number=Sing|PronType=Ind": {"morph": "Number=Sing|PronType=Ind", POS: PRON}, - "PRON__Number=Sing|PronType=Int": {"morph": "Number=Sing|PronType=Int", POS: PRON}, - "PRON__Number=Sing|PronType=Rel": {"morph": "Number=Sing|PronType=Rel", POS: PRON}, - "PRON__Person=1|PronType=Prs": {"morph": "Person=1|PronType=Prs", POS: PRON}, - "PRON__Person=3": {"morph": "Person=3", POS: PRON}, - "PRON__Person=3|PrepCase=Pre|PronType=Prs": {"morph": "Person=3|PrepCase=Pre|PronType=Prs", POS: PRON}, - "PRON__Person=3|PronType=Prs": {"morph": "Person=3|PronType=Prs", POS: PRON}, - "PRON__PronType=Ind": {"morph": "PronType=Ind", POS: PRON}, - "PRON__PronType=Int": {"morph": "PronType=Int", POS: PRON}, - "PRON__PronType=Rel": {"morph": "PronType=Rel", POS: PRON}, - "PROPN___": {"morph": "_", POS: PROPN}, - "PUNCT___": {"morph": "_", POS: PUNCT}, - "PUNCT__PunctSide=Fin|PunctType=Brck": {"morph": "PunctSide=Fin|PunctType=Brck", POS: PUNCT}, - "PUNCT__PunctSide=Fin|PunctType=Excl": {"morph": "PunctSide=Fin|PunctType=Excl", POS: PUNCT}, - "PUNCT__PunctSide=Fin|PunctType=Qest": {"morph": "PunctSide=Fin|PunctType=Qest", POS: PUNCT}, - "PUNCT__PunctSide=Ini|PunctType=Brck": {"morph": "PunctSide=Ini|PunctType=Brck", POS: PUNCT}, - "PUNCT__PunctSide=Ini|PunctType=Excl": {"morph": "PunctSide=Ini|PunctType=Excl", POS: PUNCT}, - "PUNCT__PunctSide=Ini|PunctType=Qest": {"morph": "PunctSide=Ini|PunctType=Qest", POS: PUNCT}, - "PUNCT__PunctType=Colo": {"morph": "PunctType=Colo", POS: PUNCT}, - "PUNCT__PunctType=Comm": {"morph": "PunctType=Comm", POS: PUNCT}, - "PUNCT__PunctType=Dash": {"morph": "PunctType=Dash", POS: PUNCT}, - "PUNCT__PunctType=Peri": {"morph": "PunctType=Peri", POS: PUNCT}, - "PUNCT__PunctType=Quot": {"morph": "PunctType=Quot", POS: PUNCT}, - "PUNCT__PunctType=Semi": {"morph": "PunctType=Semi", POS: PUNCT}, - "SCONJ___": {"morph": "_", POS: SCONJ}, - "SYM___": {"morph": "_", POS: SYM}, - "SYM__NumForm=Digit": {"morph": "NumForm=Digit", POS: SYM}, - "SYM__NumForm=Digit|NumType=Frac": {"morph": "NumForm=Digit|NumType=Frac", POS: SYM}, - "VERB__Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part": {"morph": "Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part", POS: VERB}, - "VERB__Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part": {"morph": "Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part", POS: VERB}, - "VERB__Gender=Masc|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {"morph": "Gender=Masc|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part": {"morph": "Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part", POS: VERB}, - "VERB__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part": {"morph": "Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part", POS: VERB}, - "VERB__Mood=Cnd|Number=Plur|Person=1|VerbForm=Fin": {"morph": "Mood=Cnd|Number=Plur|Person=1|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Cnd|Number=Plur|Person=3|VerbForm=Fin": {"morph": "Mood=Cnd|Number=Plur|Person=3|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Cnd|Number=Sing|Person=1|VerbForm=Fin": {"morph": "Mood=Cnd|Number=Sing|Person=1|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Cnd|Number=Sing|Person=2|VerbForm=Fin": {"morph": "Mood=Cnd|Number=Sing|Person=2|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Cnd|Number=Sing|Person=3|VerbForm=Fin": {"morph": "Mood=Cnd|Number=Sing|Person=3|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Imp|Number=Plur|Person=1|VerbForm=Fin": {"morph": "Mood=Imp|Number=Plur|Person=1|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Imp|Number=Plur|Person=2|VerbForm=Fin": {"morph": "Mood=Imp|Number=Plur|Person=2|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Imp|Number=Plur|Person=3|VerbForm=Fin": {"morph": "Mood=Imp|Number=Plur|Person=3|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Imp|Number=Sing|Person=2|VerbForm=Fin": {"morph": "Mood=Imp|Number=Sing|Person=2|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Imp|Number=Sing|Person=3|VerbForm=Fin": {"morph": "Mood=Imp|Number=Sing|Person=3|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=1|Tense=Past|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=1|Tense=Past|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=1|Tense=Fut|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=1|Tense=Fut|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=2|Tense=Fut|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=2|Tense=Fut|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=2|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=2|Tense=Imp|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=2|Tense=Past|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=2|Tense=Past|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Person=3|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Ind|Person=3|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Sub|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Sub|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Sub|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Sub|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Sub|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Sub|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Sub|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Sub|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Sub|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Sub|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Sub|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Sub|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Sub|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Sub|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Sub|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Sub|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Sub|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {"morph": "Mood=Sub|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {"morph": "Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin", POS: VERB}, - "VERB__VerbForm=Ger": {"morph": "VerbForm=Ger", POS: VERB}, - "VERB__VerbForm=Inf": {"morph": "VerbForm=Inf", POS: VERB}, - "X___": {"morph": "_", POS: X}, - "___PunctType=Quot": {POS: PUNCT}, - "___VerbForm=Inf": {POS: VERB}, - "___Number=Sing|Person=2|PronType=Prs": {POS: PRON}, - "_SP": {"morph": "_", POS: SPACE}, -} -# fmt: on diff --git a/spacy/lang/es/tokenizer_exceptions.py b/spacy/lang/es/tokenizer_exceptions.py index 891323705..fbfe75545 100644 --- a/spacy/lang/es/tokenizer_exceptions.py +++ b/spacy/lang/es/tokenizer_exceptions.py @@ -1,42 +1,42 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA, NORM, PRON_LEMMA +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc -_exc = {} +_exc = { + "pal": [{ORTH: "pa"}, {ORTH: "l", NORM: "el"}], +} for exc_data in [ - {ORTH: "n°", LEMMA: "número"}, - {ORTH: "°C", LEMMA: "grados Celcius"}, - {ORTH: "aprox.", LEMMA: "aproximadamente"}, - {ORTH: "dna.", LEMMA: "docena"}, - {ORTH: "dpto.", LEMMA: "departamento"}, - {ORTH: "ej.", LEMMA: "ejemplo"}, - {ORTH: "esq.", LEMMA: "esquina"}, - {ORTH: "pág.", LEMMA: "página"}, - {ORTH: "p.ej.", LEMMA: "por ejemplo"}, - {ORTH: "Ud.", LEMMA: PRON_LEMMA, NORM: "usted"}, - {ORTH: "Vd.", LEMMA: PRON_LEMMA, NORM: "usted"}, - {ORTH: "Uds.", LEMMA: PRON_LEMMA, NORM: "ustedes"}, - {ORTH: "Vds.", LEMMA: PRON_LEMMA, NORM: "ustedes"}, + {ORTH: "n°"}, + {ORTH: "°C"}, + {ORTH: "aprox."}, + {ORTH: "dna."}, + {ORTH: "dpto."}, + {ORTH: "ej."}, + {ORTH: "esq."}, + {ORTH: "pág."}, + {ORTH: "p.ej."}, + {ORTH: "Ud.", NORM: "usted"}, + {ORTH: "Vd.", NORM: "usted"}, + {ORTH: "Uds.", NORM: "ustedes"}, + {ORTH: "Vds.", NORM: "ustedes"}, {ORTH: "vol.", NORM: "volúmen"}, - ]: _exc[exc_data[ORTH]] = [exc_data] # Times -_exc["12m."] = [{ORTH: "12"}, {ORTH: "m.", LEMMA: "p.m."}] +_exc["12m."] = [{ORTH: "12"}, {ORTH: "m."}] for h in range(1, 12 + 1): for period in ["a.m.", "am"]: - _exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "a.m."}] + _exc[f"{h}{period}"] = [{ORTH: f"{h}"}, {ORTH: period}] for period in ["p.m.", "pm"]: - _exc["%d%s" % (h, period)] = [{ORTH: "%d" % h}, {ORTH: period, LEMMA: "p.m."}] + _exc[f"{h}{period}"] = [{ORTH: f"{h}"}, {ORTH: period}] for orth in [ @@ -65,11 +65,9 @@ for orth in [ "Prof.", "Profa.", "q.e.p.d.", - "Q.E.P.D." - "S.A.", + "Q.E.P.D." "S.A.", "S.L.", - "S.R.L." - "s.s.s.", + "S.R.L." "s.s.s.", "Sr.", "Sra.", "Srta.", @@ -77,4 +75,4 @@ for orth in [ _exc[orth] = [{ORTH: orth}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/et/__init__.py b/spacy/lang/et/__init__.py index d84c081ef..9f71882d2 100644 --- a/spacy/lang/et/__init__.py +++ b/spacy/lang/et/__init__.py @@ -1,14 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language -from ...attrs import LANG class EstonianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "et" stop_words = STOP_WORDS diff --git a/spacy/lang/et/stop_words.py b/spacy/lang/et/stop_words.py index 15070db5f..e1da1f14d 100644 --- a/spacy/lang/et/stop_words.py +++ b/spacy/lang/et/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-et STOP_WORDS = set( diff --git a/spacy/lang/eu/__init__.py b/spacy/lang/eu/__init__.py index b72529fab..89550be96 100644 --- a/spacy/lang/eu/__init__.py +++ b/spacy/lang/eu/__init__.py @@ -1,23 +1,13 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .punctuation import TOKENIZER_SUFFIXES - -from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language -from ...attrs import LANG class BasqueDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "eu" - - tokenizer_exceptions = BASE_EXCEPTIONS - stop_words = STOP_WORDS suffixes = TOKENIZER_SUFFIXES + stop_words = STOP_WORDS + lex_attr_getters = LEX_ATTRS class Basque(Language): diff --git a/spacy/lang/eu/examples.py b/spacy/lang/eu/examples.py index 463494abd..3b9ef71b6 100644 --- a/spacy/lang/eu/examples.py +++ b/spacy/lang/eu/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/eu/lex_attrs.py b/spacy/lang/eu/lex_attrs.py index 19b75c111..a3ab018ee 100644 --- a/spacy/lang/eu/lex_attrs.py +++ b/spacy/lang/eu/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM # Source http://mylanguages.org/basque_numbers.php diff --git a/spacy/lang/eu/punctuation.py b/spacy/lang/eu/punctuation.py index b8b1a1c83..5d35d0a25 100644 --- a/spacy/lang/eu/punctuation.py +++ b/spacy/lang/eu/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/eu/stop_words.py b/spacy/lang/eu/stop_words.py index dda11a7fd..d213b5b81 100644 --- a/spacy/lang/eu/stop_words.py +++ b/spacy/lang/eu/stop_words.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - # Source: https://github.com/stopwords-iso/stopwords-eu # https://www.ranks.nl/stopwords/basque # https://www.mustgo.com/worldlanguages/basque/ diff --git a/spacy/lang/fa/__init__.py b/spacy/lang/fa/__init__.py index c93bca671..77ee3bca3 100644 --- a/spacy/lang/fa/__init__.py +++ b/spacy/lang/fa/__init__.py @@ -1,31 +1,21 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups -from ..norm_exceptions import BASE_NORMS +from typing import Optional +from thinc.api import Model from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .tag_map import TAG_MAP from .punctuation import TOKENIZER_SUFFIXES from .syntax_iterators import SYNTAX_ITERATORS +from ...language import Language +from ...pipeline import Lemmatizer class PersianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - lex_attr_getters[LANG] = lambda text: "fa" - tokenizer_exceptions = update_exc(TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS - tag_map = TAG_MAP + tokenizer_exceptions = TOKENIZER_EXCEPTIONS suffixes = TOKENIZER_SUFFIXES - writing_system = {"direction": "rtl", "has_case": False, "has_letters": True} + lex_attr_getters = LEX_ATTRS syntax_iterators = SYNTAX_ITERATORS + stop_words = STOP_WORDS + writing_system = {"direction": "rtl", "has_case": False, "has_letters": True} class Persian(Language): @@ -33,4 +23,14 @@ class Persian(Language): Defaults = PersianDefaults +@Persian.factory( + "lemmatizer", + assigns=["token.lemma"], + default_config={"model": None, "mode": "rule"}, + default_score_weights={"lemma_acc": 1.0}, +) +def make_lemmatizer(nlp: Language, model: Optional[Model], name: str, mode: str): + return Lemmatizer(nlp.vocab, model, name, mode=mode) + + __all__ = ["Persian"] diff --git a/spacy/lang/fa/examples.py b/spacy/lang/fa/examples.py index 3f65a366d..9c6fb0345 100644 --- a/spacy/lang/fa/examples.py +++ b/spacy/lang/fa/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/fa/generate_verbs_exc.py b/spacy/lang/fa/generate_verbs_exc.py index 5d0ff944d..62094c6de 100644 --- a/spacy/lang/fa/generate_verbs_exc.py +++ b/spacy/lang/fa/generate_verbs_exc.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - verb_roots = """ #هست آخت#آهنج diff --git a/spacy/lang/fa/lex_attrs.py b/spacy/lang/fa/lex_attrs.py index dbea66b68..99b8e2787 100644 --- a/spacy/lang/fa/lex_attrs.py +++ b/spacy/lang/fa/lex_attrs.py @@ -1,5 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals from ...attrs import LIKE_NUM diff --git a/spacy/lang/fa/punctuation.py b/spacy/lang/fa/punctuation.py index 33aa46ae2..4b258c13d 100644 --- a/spacy/lang/fa/punctuation.py +++ b/spacy/lang/fa/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY from ..char_classes import UNITS, ALPHA_UPPER diff --git a/spacy/lang/fa/stop_words.py b/spacy/lang/fa/stop_words.py index 682fb7a71..f462f2e7a 100644 --- a/spacy/lang/fa/stop_words.py +++ b/spacy/lang/fa/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Stop words from HAZM package STOP_WORDS = set( """ diff --git a/spacy/lang/fa/syntax_iterators.py b/spacy/lang/fa/syntax_iterators.py index 0f2b28b58..0be06e73c 100644 --- a/spacy/lang/fa/syntax_iterators.py +++ b/spacy/lang/fa/syntax_iterators.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import NOUN, PROPN, PRON from ...errors import Errors @@ -22,7 +19,7 @@ def noun_chunks(doclike): ] doc = doclike.doc # Ensure works on both Doc and Span. - if not doc.is_parsed: + if not doc.has_annotation("DEP"): raise ValueError(Errors.E029) np_deps = [doc.vocab.strings.add(label) for label in labels] diff --git a/spacy/lang/fa/tag_map.py b/spacy/lang/fa/tag_map.py deleted file mode 100644 index b9043adf0..000000000 --- a/spacy/lang/fa/tag_map.py +++ /dev/null @@ -1,39 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, ADJ, CONJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import PRON, NOUN, PART, INTJ, AUX - - -TAG_MAP = { - "ADJ": {POS: ADJ}, - "ADJ_CMPR": {POS: ADJ}, - "ADJ_INO": {POS: ADJ}, - "ADJ_SUP": {POS: ADJ}, - "ADV": {POS: ADV}, - "ADV_COMP": {POS: ADV}, - "ADV_I": {POS: ADV}, - "ADV_LOC": {POS: ADV}, - "ADV_NEG": {POS: ADV}, - "ADV_TIME": {POS: ADV}, - "CLITIC": {POS: PART}, - "CON": {POS: CONJ}, - "CONJ": {POS: CONJ}, - "DELM": {POS: PUNCT}, - "DET": {POS: DET}, - "FW": {POS: X}, - "INT": {POS: INTJ}, - "N_PL": {POS: NOUN}, - "N_SING": {POS: NOUN}, - "N_VOC": {POS: NOUN}, - "NUM": {POS: NUM}, - "P": {POS: ADP}, - "PREV": {POS: ADP}, - "PRO": {POS: PRON}, - "V_AUX": {POS: AUX}, - "V_IMP": {POS: VERB}, - "V_PA": {POS: VERB}, - "V_PP": {POS: VERB}, - "V_PRS": {POS: VERB}, - "V_SUB": {POS: VERB}, -} diff --git a/spacy/lang/fa/tokenizer_exceptions.py b/spacy/lang/fa/tokenizer_exceptions.py index b3f8dcbf5..30df798ab 100644 --- a/spacy/lang/fa/tokenizer_exceptions.py +++ b/spacy/lang/fa/tokenizer_exceptions.py @@ -1,2756 +1,747 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA, TAG, NORM +from ...symbols import ORTH, NORM -_exc = { - ".ق ": [{LEMMA: "قمری", ORTH: ".ق "}], - ".م": [{LEMMA: "میلادی", ORTH: ".م"}], - ".هـ": [{LEMMA: "هجری", ORTH: ".هـ"}], - "ب.م": [{LEMMA: "بعد از میلاد", ORTH: "ب.م"}], - "ق.م": [{LEMMA: "قبل از میلاد", ORTH: "ق.م"}], +TOKENIZER_EXCEPTIONS = { + ".ق ": [{ORTH: ".ق "}], + ".م": [{ORTH: ".م"}], + ".هـ": [{ORTH: ".هـ"}], + "ب.م": [{ORTH: "ب.م"}], + "ق.م": [{ORTH: "ق.م"}], + "آبرویت": [{ORTH: "آبروی", NORM: "آبروی"}, {ORTH: "ت", NORM: "ت"}], + "آب‌نباتش": [{ORTH: "آب‌نبات", NORM: "آب‌نبات"}, {ORTH: "ش", NORM: "ش"}], + "آثارش": [{ORTH: "آثار", NORM: "آثار"}, {ORTH: "ش", NORM: "ش"}], + "آخرش": [{ORTH: "آخر", NORM: "آخر"}, {ORTH: "ش", NORM: "ش"}], + "آدمهاست": [{ORTH: "آدمها", NORM: "آدمها"}, {ORTH: "ست", NORM: "ست"}], + "آرزومندیم": [{ORTH: "آرزومند", NORM: "آرزومند"}, {ORTH: "یم", NORM: "یم"}], + "آزادند": [{ORTH: "آزاد", NORM: "آزاد"}, {ORTH: "ند", NORM: "ند"}], + "آسیب‌پذیرند": [{ORTH: "آسیب‌پذیر", NORM: "آسیب‌پذیر"}, {ORTH: "ند", NORM: "ند"}], + "آفریده‌اند": [{ORTH: "آفریده‌", NORM: "آفریده‌"}, {ORTH: "اند", NORM: "اند"}], + "آمدنش": [{ORTH: "آمدن", NORM: "آمدن"}, {ORTH: "ش", NORM: "ش"}], + "آمریکاست": [{ORTH: "آمریکا", NORM: "آمریکا"}, {ORTH: "ست", NORM: "ست"}], + "آنجاست": [{ORTH: "آنجا", NORM: "آنجا"}, {ORTH: "ست", NORM: "ست"}], + "آنست": [{ORTH: "آن", NORM: "آن"}, {ORTH: "ست", NORM: "ست"}], + "آنند": [{ORTH: "آن", NORM: "آن"}, {ORTH: "ند", NORM: "ند"}], + "آن‌هاست": [{ORTH: "آن‌ها", NORM: "آن‌ها"}, {ORTH: "ست", NORM: "ست"}], + "آپاداناست": [{ORTH: "آپادانا", NORM: "آپادانا"}, {ORTH: "ست", NORM: "ست"}], + "اجتماعی‌مان": [{ORTH: "اجتماعی‌", NORM: "اجتماعی‌"}, {ORTH: "مان", NORM: "مان"}], + "اجدادت": [{ORTH: "اجداد", NORM: "اجداد"}, {ORTH: "ت", NORM: "ت"}], + "اجدادش": [{ORTH: "اجداد", NORM: "اجداد"}, {ORTH: "ش", NORM: "ش"}], + "اجدادی‌شان": [{ORTH: "اجدادی‌", NORM: "اجدادی‌"}, {ORTH: "شان", NORM: "شان"}], + "اجراست": [{ORTH: "اجرا", NORM: "اجرا"}, {ORTH: "ست", NORM: "ست"}], + "اختیارش": [{ORTH: "اختیار", NORM: "اختیار"}, {ORTH: "ش", NORM: "ش"}], + "اخلاقشان": [{ORTH: "اخلاق", NORM: "اخلاق"}, {ORTH: "شان", NORM: "شان"}], + "ادعایمان": [{ORTH: "ادعای", NORM: "ادعای"}, {ORTH: "مان", NORM: "مان"}], + "اذیتش": [{ORTH: "اذیت", NORM: "اذیت"}, {ORTH: "ش", NORM: "ش"}], + "اراده‌اش": [{ORTH: "اراده‌", NORM: "اراده‌"}, {ORTH: "اش", NORM: "اش"}], + "ارتباطش": [{ORTH: "ارتباط", NORM: "ارتباط"}, {ORTH: "ش", NORM: "ش"}], + "ارتباطمان": [{ORTH: "ارتباط", NORM: "ارتباط"}, {ORTH: "مان", NORM: "مان"}], + "ارزشهاست": [{ORTH: "ارزشها", NORM: "ارزشها"}, {ORTH: "ست", NORM: "ست"}], + "ارزی‌اش": [{ORTH: "ارزی‌", NORM: "ارزی‌"}, {ORTH: "اش", NORM: "اش"}], + "اره‌اش": [{ORTH: "اره‌", NORM: "اره‌"}, {ORTH: "اش", NORM: "اش"}], + "ازش": [{ORTH: "از", NORM: "از"}, {ORTH: "ش", NORM: "ش"}], + "ازین": [{ORTH: "از", NORM: "از"}, {ORTH: "ین", NORM: "ین"}], + "ازین‌هاست": [ + {ORTH: "از", NORM: "از"}, + {ORTH: "ین‌ها", NORM: "ین‌ها"}, + {ORTH: "ست", NORM: "ست"}, + ], + "استخوانند": [{ORTH: "استخوان", NORM: "استخوان"}, {ORTH: "ند", NORM: "ند"}], + "اسلامند": [{ORTH: "اسلام", NORM: "اسلام"}, {ORTH: "ند", NORM: "ند"}], + "اسلامی‌اند": [{ORTH: "اسلامی‌", NORM: "اسلامی‌"}, {ORTH: "اند", NORM: "اند"}], + "اسلحه‌هایشان": [ + {ORTH: "اسلحه‌های", NORM: "اسلحه‌های"}, + {ORTH: "شان", NORM: "شان"}, + ], + "اسمت": [{ORTH: "اسم", NORM: "اسم"}, {ORTH: "ت", NORM: "ت"}], + "اسمش": [{ORTH: "اسم", NORM: "اسم"}, {ORTH: "ش", NORM: "ش"}], + "اشتباهند": [{ORTH: "اشتباه", NORM: "اشتباه"}, {ORTH: "ند", NORM: "ند"}], + "اصلش": [{ORTH: "اصل", NORM: "اصل"}, {ORTH: "ش", NORM: "ش"}], + "اطاقش": [{ORTH: "اطاق", NORM: "اطاق"}, {ORTH: "ش", NORM: "ش"}], + "اعتقادند": [{ORTH: "اعتقاد", NORM: "اعتقاد"}, {ORTH: "ند", NORM: "ند"}], + "اعلایش": [{ORTH: "اعلای", NORM: "اعلای"}, {ORTH: "ش", NORM: "ش"}], + "افتراست": [{ORTH: "افترا", NORM: "افترا"}, {ORTH: "ست", NORM: "ست"}], + "افطارت": [{ORTH: "افطار", NORM: "افطار"}, {ORTH: "ت", NORM: "ت"}], + "اقوامش": [{ORTH: "اقوام", NORM: "اقوام"}, {ORTH: "ش", NORM: "ش"}], + "امروزیش": [{ORTH: "امروزی", NORM: "امروزی"}, {ORTH: "ش", NORM: "ش"}], + "اموالش": [{ORTH: "اموال", NORM: "اموال"}, {ORTH: "ش", NORM: "ش"}], + "امیدوارند": [{ORTH: "امیدوار", NORM: "امیدوار"}, {ORTH: "ند", NORM: "ند"}], + "امیدواریم": [{ORTH: "امیدوار", NORM: "امیدوار"}, {ORTH: "یم", NORM: "یم"}], + "انتخابهایم": [{ORTH: "انتخابها", NORM: "انتخابها"}, {ORTH: "یم", NORM: "یم"}], + "انتظارم": [{ORTH: "انتظار", NORM: "انتظار"}, {ORTH: "م", NORM: "م"}], + "انجمنم": [{ORTH: "انجمن", NORM: "انجمن"}, {ORTH: "م", NORM: "م"}], + "اندرش": [{ORTH: "اندر", NORM: "اندر"}, {ORTH: "ش", NORM: "ش"}], + "انشایش": [{ORTH: "انشای", NORM: "انشای"}, {ORTH: "ش", NORM: "ش"}], + "انگشتشان": [{ORTH: "انگشت", NORM: "انگشت"}, {ORTH: "شان", NORM: "شان"}], + "انگشتهایش": [{ORTH: "انگشتهای", NORM: "انگشتهای"}, {ORTH: "ش", NORM: "ش"}], + "اهمیتشان": [{ORTH: "اهمیت", NORM: "اهمیت"}, {ORTH: "شان", NORM: "شان"}], + "اهمیتند": [{ORTH: "اهمیت", NORM: "اهمیت"}, {ORTH: "ند", NORM: "ند"}], + "اوایلش": [{ORTH: "اوایل", NORM: "اوایل"}, {ORTH: "ش", NORM: "ش"}], + "اوست": [{ORTH: "او", NORM: "او"}, {ORTH: "ست", NORM: "ست"}], + "اولش": [{ORTH: "اول", NORM: "اول"}, {ORTH: "ش", NORM: "ش"}], + "اولشان": [{ORTH: "اول", NORM: "اول"}, {ORTH: "شان", NORM: "شان"}], + "اولم": [{ORTH: "اول", NORM: "اول"}, {ORTH: "م", NORM: "م"}], + "اکثرشان": [{ORTH: "اکثر", NORM: "اکثر"}, {ORTH: "شان", NORM: "شان"}], + "ایتالیاست": [{ORTH: "ایتالیا", NORM: "ایتالیا"}, {ORTH: "ست", NORM: "ست"}], + "ایرانی‌اش": [{ORTH: "ایرانی‌", NORM: "ایرانی‌"}, {ORTH: "اش", NORM: "اش"}], + "اینجاست": [{ORTH: "اینجا", NORM: "اینجا"}, {ORTH: "ست", NORM: "ست"}], + "این‌هاست": [{ORTH: "این‌ها", NORM: "این‌ها"}, {ORTH: "ست", NORM: "ست"}], + "بابات": [{ORTH: "بابا", NORM: "بابا"}, {ORTH: "ت", NORM: "ت"}], + "بارش": [{ORTH: "بار", NORM: "بار"}, {ORTH: "ش", NORM: "ش"}], + "بازیگرانش": [{ORTH: "بازیگران", NORM: "بازیگران"}, {ORTH: "ش", NORM: "ش"}], + "بازیگرمان": [{ORTH: "بازیگر", NORM: "بازیگر"}, {ORTH: "مان", NORM: "مان"}], + "بازیگرهایم": [{ORTH: "بازیگرها", NORM: "بازیگرها"}, {ORTH: "یم", NORM: "یم"}], + "بازی‌اش": [{ORTH: "بازی‌", NORM: "بازی‌"}, {ORTH: "اش", NORM: "اش"}], + "بالاست": [{ORTH: "بالا", NORM: "بالا"}, {ORTH: "ست", NORM: "ست"}], + "باورند": [{ORTH: "باور", NORM: "باور"}, {ORTH: "ند", NORM: "ند"}], + "بجاست": [{ORTH: "بجا", NORM: "بجا"}, {ORTH: "ست", NORM: "ست"}], + "بدان": [{ORTH: "ب", NORM: "ب"}, {ORTH: "دان", NORM: "دان"}], + "بدش": [{ORTH: "بد", NORM: "بد"}, {ORTH: "ش", NORM: "ش"}], + "بدشان": [{ORTH: "بد", NORM: "بد"}, {ORTH: "شان", NORM: "شان"}], + "بدنم": [{ORTH: "بدن", NORM: "بدن"}, {ORTH: "م", NORM: "م"}], + "بدهی‌ات": [{ORTH: "بدهی‌", NORM: "بدهی‌"}, {ORTH: "ات", NORM: "ات"}], + "بدین": [{ORTH: "ب", NORM: "ب"}, {ORTH: "دین", NORM: "دین"}], + "برابرش": [{ORTH: "برابر", NORM: "برابر"}, {ORTH: "ش", NORM: "ش"}], + "برادرت": [{ORTH: "برادر", NORM: "برادر"}, {ORTH: "ت", NORM: "ت"}], + "برادرش": [{ORTH: "برادر", NORM: "برادر"}, {ORTH: "ش", NORM: "ش"}], + "برایت": [{ORTH: "برای", NORM: "برای"}, {ORTH: "ت", NORM: "ت"}], + "برایتان": [{ORTH: "برای", NORM: "برای"}, {ORTH: "تان", NORM: "تان"}], + "برایش": [{ORTH: "برای", NORM: "برای"}, {ORTH: "ش", NORM: "ش"}], + "برایشان": [{ORTH: "برای", NORM: "برای"}, {ORTH: "شان", NORM: "شان"}], + "برایم": [{ORTH: "برای", NORM: "برای"}, {ORTH: "م", NORM: "م"}], + "برایمان": [{ORTH: "برای", NORM: "برای"}, {ORTH: "مان", NORM: "مان"}], + "برخوردارند": [{ORTH: "برخوردار", NORM: "برخوردار"}, {ORTH: "ند", NORM: "ند"}], + "برنامه‌سازهاست": [ + {ORTH: "برنامه‌سازها", NORM: "برنامه‌سازها"}, + {ORTH: "ست", NORM: "ست"}, + ], + "برهمش": [{ORTH: "برهم", NORM: "برهم"}, {ORTH: "ش", NORM: "ش"}], + "برهنه‌اش": [{ORTH: "برهنه‌", NORM: "برهنه‌"}, {ORTH: "اش", NORM: "اش"}], + "برگهایش": [{ORTH: "برگها", NORM: "برگها"}, {ORTH: "یش", NORM: "یش"}], + "برین": [{ORTH: "بر", NORM: "بر"}, {ORTH: "ین", NORM: "ین"}], + "بزرگش": [{ORTH: "بزرگ", NORM: "بزرگ"}, {ORTH: "ش", NORM: "ش"}], + "بزرگ‌تری": [{ORTH: "بزرگ‌تر", NORM: "بزرگ‌تر"}, {ORTH: "ی", NORM: "ی"}], + "بساطش": [{ORTH: "بساط", NORM: "بساط"}, {ORTH: "ش", NORM: "ش"}], + "بعدش": [{ORTH: "بعد", NORM: "بعد"}, {ORTH: "ش", NORM: "ش"}], + "بعضیهایشان": [{ORTH: "بعضیهای", NORM: "بعضیهای"}, {ORTH: "شان", NORM: "شان"}], + "بعضی‌شان": [{ORTH: "بعضی", NORM: "بعضی"}, {ORTH: "‌شان", NORM: "شان"}], + "بقیه‌اش": [{ORTH: "بقیه‌", NORM: "بقیه‌"}, {ORTH: "اش", NORM: "اش"}], + "بلندش": [{ORTH: "بلند", NORM: "بلند"}, {ORTH: "ش", NORM: "ش"}], + "بناگوشش": [{ORTH: "بناگوش", NORM: "بناگوش"}, {ORTH: "ش", NORM: "ش"}], + "بنظرم": [ + {ORTH: "ب", NORM: "ب"}, + {ORTH: "نظر", NORM: "نظر"}, + {ORTH: "م", NORM: "م"}, + ], + "بهت": [{ORTH: "به", NORM: "به"}, {ORTH: "ت", NORM: "ت"}], + "بهترش": [{ORTH: "بهتر", NORM: "بهتر"}, {ORTH: "ش", NORM: "ش"}], + "بهترم": [{ORTH: "بهتر", NORM: "بهتر"}, {ORTH: "م", NORM: "م"}], + "بهتری": [{ORTH: "بهتر", NORM: "بهتر"}, {ORTH: "ی", NORM: "ی"}], + "بهش": [{ORTH: "به", NORM: "به"}, {ORTH: "ش", NORM: "ش"}], + "به‌شان": [{ORTH: "به‌", NORM: "به‌"}, {ORTH: "شان", NORM: "شان"}], + "بودمش": [{ORTH: "بودم", NORM: "بودم"}, {ORTH: "ش", NORM: "ش"}], + "بودنش": [{ORTH: "بودن", NORM: "بودن"}, {ORTH: "ش", NORM: "ش"}], + "بودن‌شان": [{ORTH: "بودن‌", NORM: "بودن‌"}, {ORTH: "شان", NORM: "شان"}], + "بوستانش": [{ORTH: "بوستان", NORM: "بوستان"}, {ORTH: "ش", NORM: "ش"}], + "بویش": [{ORTH: "بو", NORM: "بو"}, {ORTH: "یش", NORM: "یش"}], + "بچه‌اش": [{ORTH: "بچه‌", NORM: "بچه‌"}, {ORTH: "اش", NORM: "اش"}], + "بچه‌م": [{ORTH: "بچه‌", NORM: "بچه‌"}, {ORTH: "م", NORM: "م"}], + "بچه‌هایش": [{ORTH: "بچه‌های", NORM: "بچه‌های"}, {ORTH: "ش", NORM: "ش"}], + "بیانیه‌شان": [{ORTH: "بیانیه‌", NORM: "بیانیه‌"}, {ORTH: "شان", NORM: "شان"}], + "بیدارم": [{ORTH: "بیدار", NORM: "بیدار"}, {ORTH: "م", NORM: "م"}], + "بیناتری": [{ORTH: "بیناتر", NORM: "بیناتر"}, {ORTH: "ی", NORM: "ی"}], + "بی‌اطلاعند": [{ORTH: "بی‌اطلاع", NORM: "بی‌اطلاع"}, {ORTH: "ند", NORM: "ند"}], + "بی‌اطلاعید": [{ORTH: "بی‌اطلاع", NORM: "بی‌اطلاع"}, {ORTH: "ید", NORM: "ید"}], + "بی‌بهره‌اند": [{ORTH: "بی‌بهره‌", NORM: "بی‌بهره‌"}, {ORTH: "اند", NORM: "اند"}], + "بی‌تفاوتند": [{ORTH: "بی‌تفاوت", NORM: "بی‌تفاوت"}, {ORTH: "ند", NORM: "ند"}], + "بی‌حسابش": [{ORTH: "بی‌حساب", NORM: "بی‌حساب"}, {ORTH: "ش", NORM: "ش"}], + "بی‌نیش": [{ORTH: "بی‌نی", NORM: "بی‌نی"}, {ORTH: "ش", NORM: "ش"}], + "تجربه‌هایم": [{ORTH: "تجربه‌ها", NORM: "تجربه‌ها"}, {ORTH: "یم", NORM: "یم"}], + "تحریم‌هاست": [{ORTH: "تحریم‌ها", NORM: "تحریم‌ها"}, {ORTH: "ست", NORM: "ست"}], + "تحولند": [{ORTH: "تحول", NORM: "تحول"}, {ORTH: "ند", NORM: "ند"}], + "تخیلی‌اش": [{ORTH: "تخیلی‌", NORM: "تخیلی‌"}, {ORTH: "اش", NORM: "اش"}], + "ترا": [{ORTH: "ت", NORM: "ت"}, {ORTH: "را", NORM: "را"}], + "ترسشان": [{ORTH: "ترس", NORM: "ترس"}, {ORTH: "شان", NORM: "شان"}], + "ترکش": [{ORTH: "ترک", NORM: "ترک"}, {ORTH: "ش", NORM: "ش"}], + "تشنه‌ت": [{ORTH: "تشنه‌", NORM: "تشنه‌"}, {ORTH: "ت", NORM: "ت"}], + "تشکیلاتی‌اش": [{ORTH: "تشکیلاتی‌", NORM: "تشکیلاتی‌"}, {ORTH: "اش", NORM: "اش"}], + "تعلقش": [{ORTH: "تعلق", NORM: "تعلق"}, {ORTH: "ش", NORM: "ش"}], + "تلاششان": [{ORTH: "تلاش", NORM: "تلاش"}, {ORTH: "شان", NORM: "شان"}], + "تلاشمان": [{ORTH: "تلاش", NORM: "تلاش"}, {ORTH: "مان", NORM: "مان"}], + "تماشاگرش": [{ORTH: "تماشاگر", NORM: "تماشاگر"}, {ORTH: "ش", NORM: "ش"}], + "تمامشان": [{ORTH: "تمام", NORM: "تمام"}, {ORTH: "شان", NORM: "شان"}], + "تنش": [{ORTH: "تن", NORM: "تن"}, {ORTH: "ش", NORM: "ش"}], + "تنمان": [{ORTH: "تن", NORM: "تن"}, {ORTH: "مان", NORM: "مان"}], + "تنهایی‌اش": [{ORTH: "تنهایی‌", NORM: "تنهایی‌"}, {ORTH: "اش", NORM: "اش"}], + "توانایی‌اش": [{ORTH: "توانایی‌", NORM: "توانایی‌"}, {ORTH: "اش", NORM: "اش"}], + "توجهش": [{ORTH: "توجه", NORM: "توجه"}, {ORTH: "ش", NORM: "ش"}], + "توست": [{ORTH: "تو", NORM: "تو"}, {ORTH: "ست", NORM: "ست"}], + "توصیه‌اش": [{ORTH: "توصیه‌", NORM: "توصیه‌"}, {ORTH: "اش", NORM: "اش"}], + "تیغه‌اش": [{ORTH: "تیغه‌", NORM: "تیغه‌"}, {ORTH: "اش", NORM: "اش"}], + "جاست": [{ORTH: "جا", NORM: "جا"}, {ORTH: "ست", NORM: "ست"}], + "جامعه‌اند": [{ORTH: "جامعه‌", NORM: "جامعه‌"}, {ORTH: "اند", NORM: "اند"}], + "جانم": [{ORTH: "جان", NORM: "جان"}, {ORTH: "م", NORM: "م"}], + "جایش": [{ORTH: "جای", NORM: "جای"}, {ORTH: "ش", NORM: "ش"}], + "جایشان": [{ORTH: "جای", NORM: "جای"}, {ORTH: "شان", NORM: "شان"}], + "جدیدش": [{ORTH: "جدید", NORM: "جدید"}, {ORTH: "ش", NORM: "ش"}], + "جرمزاست": [{ORTH: "جرمزا", NORM: "جرمزا"}, {ORTH: "ست", NORM: "ست"}], + "جلوست": [{ORTH: "جلو", NORM: "جلو"}, {ORTH: "ست", NORM: "ست"}], + "جلویش": [{ORTH: "جلوی", NORM: "جلوی"}, {ORTH: "ش", NORM: "ش"}], + "جمهوریست": [{ORTH: "جمهوری", NORM: "جمهوری"}, {ORTH: "ست", NORM: "ست"}], + "جنسش": [{ORTH: "جنس", NORM: "جنس"}, {ORTH: "ش", NORM: "ش"}], + "جنس‌اند": [{ORTH: "جنس‌", NORM: "جنس‌"}, {ORTH: "اند", NORM: "اند"}], + "جوانانش": [{ORTH: "جوانان", NORM: "جوانان"}, {ORTH: "ش", NORM: "ش"}], + "جویش": [{ORTH: "جوی", NORM: "جوی"}, {ORTH: "ش", NORM: "ش"}], + "جگرش": [{ORTH: "جگر", NORM: "جگر"}, {ORTH: "ش", NORM: "ش"}], + "حاضرم": [{ORTH: "حاضر", NORM: "حاضر"}, {ORTH: "م", NORM: "م"}], + "حالتهایشان": [{ORTH: "حالتهای", NORM: "حالتهای"}, {ORTH: "شان", NORM: "شان"}], + "حالیست": [{ORTH: "حالی", NORM: "حالی"}, {ORTH: "ست", NORM: "ست"}], + "حالی‌مان": [{ORTH: "حالی‌", NORM: "حالی‌"}, {ORTH: "مان", NORM: "مان"}], + "حاکیست": [{ORTH: "حاکی", NORM: "حاکی"}, {ORTH: "ست", NORM: "ست"}], + "حرامزادگی‌اش": [ + {ORTH: "حرامزادگی‌", NORM: "حرامزادگی‌"}, + {ORTH: "اش", NORM: "اش"}, + ], + "حرفتان": [{ORTH: "حرف", NORM: "حرف"}, {ORTH: "تان", NORM: "تان"}], + "حرفش": [{ORTH: "حرف", NORM: "حرف"}, {ORTH: "ش", NORM: "ش"}], + "حرفشان": [{ORTH: "حرف", NORM: "حرف"}, {ORTH: "شان", NORM: "شان"}], + "حرفم": [{ORTH: "حرف", NORM: "حرف"}, {ORTH: "م", NORM: "م"}], + "حرف‌های‌شان": [{ORTH: "حرف‌های‌", NORM: "حرف‌های‌"}, {ORTH: "شان", NORM: "شان"}], + "حرکتمان": [{ORTH: "حرکت", NORM: "حرکت"}, {ORTH: "مان", NORM: "مان"}], + "حریفانشان": [{ORTH: "حریفان", NORM: "حریفان"}, {ORTH: "شان", NORM: "شان"}], + "حضورشان": [{ORTH: "حضور", NORM: "حضور"}, {ORTH: "شان", NORM: "شان"}], + "حمایتش": [{ORTH: "حمایت", NORM: "حمایت"}, {ORTH: "ش", NORM: "ش"}], + "حواسش": [{ORTH: "حواس", NORM: "حواس"}, {ORTH: "ش", NORM: "ش"}], + "حواسشان": [{ORTH: "حواس", NORM: "حواس"}, {ORTH: "شان", NORM: "شان"}], + "حوصله‌مان": [{ORTH: "حوصله‌", NORM: "حوصله‌"}, {ORTH: "مان", NORM: "مان"}], + "حکومتش": [{ORTH: "حکومت", NORM: "حکومت"}, {ORTH: "ش", NORM: "ش"}], + "حکومتشان": [{ORTH: "حکومت", NORM: "حکومت"}, {ORTH: "شان", NORM: "شان"}], + "حیفم": [{ORTH: "حیف", NORM: "حیف"}, {ORTH: "م", NORM: "م"}], + "خاندانش": [{ORTH: "خاندان", NORM: "خاندان"}, {ORTH: "ش", NORM: "ش"}], + "خانه‌اش": [{ORTH: "خانه‌", NORM: "خانه‌"}, {ORTH: "اش", NORM: "اش"}], + "خانه‌شان": [{ORTH: "خانه‌", NORM: "خانه‌"}, {ORTH: "شان", NORM: "شان"}], + "خانه‌مان": [{ORTH: "خانه‌", NORM: "خانه‌"}, {ORTH: "مان", NORM: "مان"}], + "خانه‌هایشان": [{ORTH: "خانه‌های", NORM: "خانه‌های"}, {ORTH: "شان", NORM: "شان"}], + "خانواده‌ات": [{ORTH: "خانواده", NORM: "خانواده"}, {ORTH: "‌ات", NORM: "ات"}], + "خانواده‌اش": [{ORTH: "خانواده‌", NORM: "خانواده‌"}, {ORTH: "اش", NORM: "اش"}], + "خانواده‌ام": [{ORTH: "خانواده‌", NORM: "خانواده‌"}, {ORTH: "ام", NORM: "ام"}], + "خانواده‌شان": [{ORTH: "خانواده‌", NORM: "خانواده‌"}, {ORTH: "شان", NORM: "شان"}], + "خداست": [{ORTH: "خدا", NORM: "خدا"}, {ORTH: "ست", NORM: "ست"}], + "خدایش": [{ORTH: "خدا", NORM: "خدا"}, {ORTH: "یش", NORM: "یش"}], + "خدایشان": [{ORTH: "خدای", NORM: "خدای"}, {ORTH: "شان", NORM: "شان"}], + "خردسالش": [{ORTH: "خردسال", NORM: "خردسال"}, {ORTH: "ش", NORM: "ش"}], + "خروپفشان": [{ORTH: "خروپف", NORM: "خروپف"}, {ORTH: "شان", NORM: "شان"}], + "خسته‌ای": [{ORTH: "خسته‌", NORM: "خسته‌"}, {ORTH: "ای", NORM: "ای"}], + "خطت": [{ORTH: "خط", NORM: "خط"}, {ORTH: "ت", NORM: "ت"}], + "خوابمان": [{ORTH: "خواب", NORM: "خواب"}, {ORTH: "مان", NORM: "مان"}], + "خواندنش": [{ORTH: "خواندن", NORM: "خواندن"}, {ORTH: "ش", NORM: "ش"}], + "خواهرش": [{ORTH: "خواهر", NORM: "خواهر"}, {ORTH: "ش", NORM: "ش"}], + "خوبش": [{ORTH: "خوب", NORM: "خوب"}, {ORTH: "ش", NORM: "ش"}], + "خودت": [{ORTH: "خود", NORM: "خود"}, {ORTH: "ت", NORM: "ت"}], + "خودتان": [{ORTH: "خود", NORM: "خود"}, {ORTH: "تان", NORM: "تان"}], + "خودش": [{ORTH: "خود", NORM: "خود"}, {ORTH: "ش", NORM: "ش"}], + "خودشان": [{ORTH: "خود", NORM: "خود"}, {ORTH: "شان", NORM: "شان"}], + "خودمان": [{ORTH: "خود", NORM: "خود"}, {ORTH: "مان", NORM: "مان"}], + "خوردمان": [{ORTH: "خورد", NORM: "خورد"}, {ORTH: "مان", NORM: "مان"}], + "خوردنشان": [{ORTH: "خوردن", NORM: "خوردن"}, {ORTH: "شان", NORM: "شان"}], + "خوشش": [{ORTH: "خوش", NORM: "خوش"}, {ORTH: "ش", NORM: "ش"}], + "خوشوقتم": [{ORTH: "خوشوقت", NORM: "خوشوقت"}, {ORTH: "م", NORM: "م"}], + "خونشان": [{ORTH: "خون", NORM: "خون"}, {ORTH: "شان", NORM: "شان"}], + "خویش": [{ORTH: "خوی", NORM: "خوی"}, {ORTH: "ش", NORM: "ش"}], + "خویشتنم": [{ORTH: "خویشتن", NORM: "خویشتن"}, {ORTH: "م", NORM: "م"}], + "خیالش": [{ORTH: "خیال", NORM: "خیال"}, {ORTH: "ش", NORM: "ش"}], + "خیسش": [{ORTH: "خیس", NORM: "خیس"}, {ORTH: "ش", NORM: "ش"}], + "داراست": [{ORTH: "دارا", NORM: "دارا"}, {ORTH: "ست", NORM: "ست"}], + "داستانهایش": [{ORTH: "داستانهای", NORM: "داستانهای"}, {ORTH: "ش", NORM: "ش"}], + "دخترمان": [{ORTH: "دختر", NORM: "دختر"}, {ORTH: "مان", NORM: "مان"}], + "دخیلند": [{ORTH: "دخیل", NORM: "دخیل"}, {ORTH: "ند", NORM: "ند"}], + "درباره‌ات": [{ORTH: "درباره", NORM: "درباره"}, {ORTH: "‌ات", NORM: "ات"}], + "درباره‌اش": [{ORTH: "درباره‌", NORM: "درباره‌"}, {ORTH: "اش", NORM: "اش"}], + "دردش": [{ORTH: "درد", NORM: "درد"}, {ORTH: "ش", NORM: "ش"}], + "دردشان": [{ORTH: "درد", NORM: "درد"}, {ORTH: "شان", NORM: "شان"}], + "درسته": [{ORTH: "درست", NORM: "درست"}, {ORTH: "ه", NORM: "ه"}], + "درش": [{ORTH: "در", NORM: "در"}, {ORTH: "ش", NORM: "ش"}], + "درون‌شان": [{ORTH: "درون‌", NORM: "درون‌"}, {ORTH: "شان", NORM: "شان"}], + "درین": [{ORTH: "در", NORM: "در"}, {ORTH: "ین", NORM: "ین"}], + "دریچه‌هایش": [{ORTH: "دریچه‌های", NORM: "دریچه‌های"}, {ORTH: "ش", NORM: "ش"}], + "دزدانش": [{ORTH: "دزدان", NORM: "دزدان"}, {ORTH: "ش", NORM: "ش"}], + "دستت": [{ORTH: "دست", NORM: "دست"}, {ORTH: "ت", NORM: "ت"}], + "دستش": [{ORTH: "دست", NORM: "دست"}, {ORTH: "ش", NORM: "ش"}], + "دستمان": [{ORTH: "دست", NORM: "دست"}, {ORTH: "مان", NORM: "مان"}], + "دستهایشان": [{ORTH: "دستهای", NORM: "دستهای"}, {ORTH: "شان", NORM: "شان"}], + "دست‌یافتنی‌ست": [ + {ORTH: "دست‌یافتنی‌", NORM: "دست‌یافتنی‌"}, + {ORTH: "ست", NORM: "ست"}, + ], + "دشمنند": [{ORTH: "دشمن", NORM: "دشمن"}, {ORTH: "ند", NORM: "ند"}], + "دشمنیشان": [{ORTH: "دشمنی", NORM: "دشمنی"}, {ORTH: "شان", NORM: "شان"}], + "دشمنیم": [{ORTH: "دشمن", NORM: "دشمن"}, {ORTH: "یم", NORM: "یم"}], + "دفترش": [{ORTH: "دفتر", NORM: "دفتر"}, {ORTH: "ش", NORM: "ش"}], + "دفنشان": [{ORTH: "دفن", NORM: "دفن"}, {ORTH: "شان", NORM: "شان"}], + "دلت": [{ORTH: "دل", NORM: "دل"}, {ORTH: "ت", NORM: "ت"}], + "دلش": [{ORTH: "دل", NORM: "دل"}, {ORTH: "ش", NORM: "ش"}], + "دلشان": [{ORTH: "دل", NORM: "دل"}, {ORTH: "شان", NORM: "شان"}], + "دلم": [{ORTH: "دل", NORM: "دل"}, {ORTH: "م", NORM: "م"}], + "دلیلش": [{ORTH: "دلیل", NORM: "دلیل"}, {ORTH: "ش", NORM: "ش"}], + "دنبالش": [{ORTH: "دنبال", NORM: "دنبال"}, {ORTH: "ش", NORM: "ش"}], + "دنباله‌اش": [{ORTH: "دنباله‌", NORM: "دنباله‌"}, {ORTH: "اش", NORM: "اش"}], + "دهاتی‌هایش": [{ORTH: "دهاتی‌های", NORM: "دهاتی‌های"}, {ORTH: "ش", NORM: "ش"}], + "دهانت": [{ORTH: "دهان", NORM: "دهان"}, {ORTH: "ت", NORM: "ت"}], + "دهنش": [{ORTH: "دهن", NORM: "دهن"}, {ORTH: "ش", NORM: "ش"}], + "دورش": [{ORTH: "دور", NORM: "دور"}, {ORTH: "ش", NORM: "ش"}], + "دوروبریهاشان": [ + {ORTH: "دوروبریها", NORM: "دوروبریها"}, + {ORTH: "شان", NORM: "شان"}, + ], + "دوستانش": [{ORTH: "دوستان", NORM: "دوستان"}, {ORTH: "ش", NORM: "ش"}], + "دوستانشان": [{ORTH: "دوستان", NORM: "دوستان"}, {ORTH: "شان", NORM: "شان"}], + "دوستت": [{ORTH: "دوست", NORM: "دوست"}, {ORTH: "ت", NORM: "ت"}], + "دوستش": [{ORTH: "دوست", NORM: "دوست"}, {ORTH: "ش", NORM: "ش"}], + "دومش": [{ORTH: "دوم", NORM: "دوم"}, {ORTH: "ش", NORM: "ش"}], + "دویدنش": [{ORTH: "دویدن", NORM: "دویدن"}, {ORTH: "ش", NORM: "ش"}], + "دکورهایمان": [{ORTH: "دکورهای", NORM: "دکورهای"}, {ORTH: "مان", NORM: "مان"}], + "دیدگاهش": [{ORTH: "دیدگاه", NORM: "دیدگاه"}, {ORTH: "ش", NORM: "ش"}], + "دیرت": [{ORTH: "دیر", NORM: "دیر"}, {ORTH: "ت", NORM: "ت"}], + "دیرم": [{ORTH: "دیر", NORM: "دیر"}, {ORTH: "م", NORM: "م"}], + "دینت": [{ORTH: "دین", NORM: "دین"}, {ORTH: "ت", NORM: "ت"}], + "دینش": [{ORTH: "دین", NORM: "دین"}, {ORTH: "ش", NORM: "ش"}], + "دین‌شان": [{ORTH: "دین‌", NORM: "دین‌"}, {ORTH: "شان", NORM: "شان"}], + "دیواره‌هایش": [{ORTH: "دیواره‌های", NORM: "دیواره‌های"}, {ORTH: "ش", NORM: "ش"}], + "دیوانه‌ای": [{ORTH: "دیوانه‌", NORM: "دیوانه‌"}, {ORTH: "ای", NORM: "ای"}], + "دیوی": [{ORTH: "دیو", NORM: "دیو"}, {ORTH: "ی", NORM: "ی"}], + "دیگرم": [{ORTH: "دیگر", NORM: "دیگر"}, {ORTH: "م", NORM: "م"}], + "دیگرمان": [{ORTH: "دیگر", NORM: "دیگر"}, {ORTH: "مان", NORM: "مان"}], + "ذهنش": [{ORTH: "ذهن", NORM: "ذهن"}, {ORTH: "ش", NORM: "ش"}], + "ذهنشان": [{ORTH: "ذهن", NORM: "ذهن"}, {ORTH: "شان", NORM: "شان"}], + "ذهنم": [{ORTH: "ذهن", NORM: "ذهن"}, {ORTH: "م", NORM: "م"}], + "رئوسش": [{ORTH: "رئوس", NORM: "رئوس"}, {ORTH: "ش", NORM: "ش"}], + "راهشان": [{ORTH: "راه", NORM: "راه"}, {ORTH: "شان", NORM: "شان"}], + "راهگشاست": [{ORTH: "راهگشا", NORM: "راهگشا"}, {ORTH: "ست", NORM: "ست"}], + "رایانه‌هایشان": [ + {ORTH: "رایانه‌های", NORM: "رایانه‌های"}, + {ORTH: "شان", NORM: "شان"}, + ], + "رعایتشان": [{ORTH: "رعایت", NORM: "رعایت"}, {ORTH: "شان", NORM: "شان"}], + "رفتارش": [{ORTH: "رفتار", NORM: "رفتار"}, {ORTH: "ش", NORM: "ش"}], + "رفتارشان": [{ORTH: "رفتار", NORM: "رفتار"}, {ORTH: "شان", NORM: "شان"}], + "رفتارمان": [{ORTH: "رفتار", NORM: "رفتار"}, {ORTH: "مان", NORM: "مان"}], + "رفتارهاست": [{ORTH: "رفتارها", NORM: "رفتارها"}, {ORTH: "ست", NORM: "ست"}], + "رفتارهایشان": [{ORTH: "رفتارهای", NORM: "رفتارهای"}, {ORTH: "شان", NORM: "شان"}], + "رفقایم": [{ORTH: "رفقا", NORM: "رفقا"}, {ORTH: "یم", NORM: "یم"}], + "رقیق‌ترش": [{ORTH: "رقیق‌تر", NORM: "رقیق‌تر"}, {ORTH: "ش", NORM: "ش"}], + "رنجند": [{ORTH: "رنج", NORM: "رنج"}, {ORTH: "ند", NORM: "ند"}], + "رهگشاست": [{ORTH: "رهگشا", NORM: "رهگشا"}, {ORTH: "ست", NORM: "ست"}], + "رواست": [{ORTH: "روا", NORM: "روا"}, {ORTH: "ست", NORM: "ست"}], + "روبروست": [{ORTH: "روبرو", NORM: "روبرو"}, {ORTH: "ست", NORM: "ست"}], + "روحی‌اش": [{ORTH: "روحی‌", NORM: "روحی‌"}, {ORTH: "اش", NORM: "اش"}], + "روزنامه‌اش": [{ORTH: "روزنامه‌", NORM: "روزنامه‌"}, {ORTH: "اش", NORM: "اش"}], + "روزه‌ست": [{ORTH: "روزه‌", NORM: "روزه‌"}, {ORTH: "ست", NORM: "ست"}], + "روسری‌اش": [{ORTH: "روسری‌", NORM: "روسری‌"}, {ORTH: "اش", NORM: "اش"}], + "روشتان": [{ORTH: "روش", NORM: "روش"}, {ORTH: "تان", NORM: "تان"}], + "رویش": [{ORTH: "روی", NORM: "روی"}, {ORTH: "ش", NORM: "ش"}], + "زبانش": [{ORTH: "زبان", NORM: "زبان"}, {ORTH: "ش", NORM: "ش"}], + "زحماتشان": [{ORTH: "زحمات", NORM: "زحمات"}, {ORTH: "شان", NORM: "شان"}], + "زدنهایشان": [{ORTH: "زدنهای", NORM: "زدنهای"}, {ORTH: "شان", NORM: "شان"}], + "زرنگشان": [{ORTH: "زرنگ", NORM: "زرنگ"}, {ORTH: "شان", NORM: "شان"}], + "زشتش": [{ORTH: "زشت", NORM: "زشت"}, {ORTH: "ش", NORM: "ش"}], + "زشتکارانند": [{ORTH: "زشتکاران", NORM: "زشتکاران"}, {ORTH: "ند", NORM: "ند"}], + "زلفش": [{ORTH: "زلف", NORM: "زلف"}, {ORTH: "ش", NORM: "ش"}], + "زمن": [{ORTH: "ز", NORM: "ز"}, {ORTH: "من", NORM: "من"}], + "زنبوری‌اش": [{ORTH: "زنبوری‌", NORM: "زنبوری‌"}, {ORTH: "اش", NORM: "اش"}], + "زندانم": [{ORTH: "زندان", NORM: "زندان"}, {ORTH: "م", NORM: "م"}], + "زنده‌ام": [{ORTH: "زنده‌", NORM: "زنده‌"}, {ORTH: "ام", NORM: "ام"}], + "زندگانی‌اش": [{ORTH: "زندگانی‌", NORM: "زندگانی‌"}, {ORTH: "اش", NORM: "اش"}], + "زندگی‌اش": [{ORTH: "زندگی‌", NORM: "زندگی‌"}, {ORTH: "اش", NORM: "اش"}], + "زندگی‌ام": [{ORTH: "زندگی‌", NORM: "زندگی‌"}, {ORTH: "ام", NORM: "ام"}], + "زندگی‌شان": [{ORTH: "زندگی‌", NORM: "زندگی‌"}, {ORTH: "شان", NORM: "شان"}], + "زنش": [{ORTH: "زن", NORM: "زن"}, {ORTH: "ش", NORM: "ش"}], + "زنند": [{ORTH: "زن", NORM: "زن"}, {ORTH: "ند", NORM: "ند"}], + "زو": [{ORTH: "ز", NORM: "ز"}, {ORTH: "و", NORM: "و"}], + "زیاده": [{ORTH: "زیاد", NORM: "زیاد"}, {ORTH: "ه", NORM: "ه"}], + "زیباست": [{ORTH: "زیبا", NORM: "زیبا"}, {ORTH: "ست", NORM: "ست"}], + "زیبایش": [{ORTH: "زیبای", NORM: "زیبای"}, {ORTH: "ش", NORM: "ش"}], + "زیبایی": [{ORTH: "زیبای", NORM: "زیبای"}, {ORTH: "ی", NORM: "ی"}], + "زیربناست": [{ORTH: "زیربنا", NORM: "زیربنا"}, {ORTH: "ست", NORM: "ست"}], + "زیرک‌اند": [{ORTH: "زیرک‌", NORM: "زیرک‌"}, {ORTH: "اند", NORM: "اند"}], + "سؤالتان": [{ORTH: "سؤال", NORM: "سؤال"}, {ORTH: "تان", NORM: "تان"}], + "سؤالم": [{ORTH: "سؤال", NORM: "سؤال"}, {ORTH: "م", NORM: "م"}], + "سابقه‌اش": [{ORTH: "سابقه‌", NORM: "سابقه‌"}, {ORTH: "اش", NORM: "اش"}], + "ساختنم": [{ORTH: "ساختن", NORM: "ساختن"}, {ORTH: "م", NORM: "م"}], + "ساده‌اش": [{ORTH: "ساده‌", NORM: "ساده‌"}, {ORTH: "اش", NORM: "اش"}], + "ساده‌اند": [{ORTH: "ساده‌", NORM: "ساده‌"}, {ORTH: "اند", NORM: "اند"}], + "سازمانش": [{ORTH: "سازمان", NORM: "سازمان"}, {ORTH: "ش", NORM: "ش"}], + "ساعتم": [{ORTH: "ساعت", NORM: "ساعت"}, {ORTH: "م", NORM: "م"}], + "سالته": [ + {ORTH: "سال", NORM: "سال"}, + {ORTH: "ت", NORM: "ت"}, + {ORTH: "ه", NORM: "ه"}, + ], + "سالش": [{ORTH: "سال", NORM: "سال"}, {ORTH: "ش", NORM: "ش"}], + "سالهاست": [{ORTH: "سالها", NORM: "سالها"}, {ORTH: "ست", NORM: "ست"}], + "ساله‌اش": [{ORTH: "ساله‌", NORM: "ساله‌"}, {ORTH: "اش", NORM: "اش"}], + "ساکتند": [{ORTH: "ساکت", NORM: "ساکت"}, {ORTH: "ند", NORM: "ند"}], + "ساکنند": [{ORTH: "ساکن", NORM: "ساکن"}, {ORTH: "ند", NORM: "ند"}], + "سبزشان": [{ORTH: "سبز", NORM: "سبز"}, {ORTH: "شان", NORM: "شان"}], + "سبیل‌مان": [{ORTH: "سبیل‌", NORM: "سبیل‌"}, {ORTH: "مان", NORM: "مان"}], + "ستم‌هایش": [{ORTH: "ستم‌های", NORM: "ستم‌های"}, {ORTH: "ش", NORM: "ش"}], + "سخنانش": [{ORTH: "سخنان", NORM: "سخنان"}, {ORTH: "ش", NORM: "ش"}], + "سخنانشان": [{ORTH: "سخنان", NORM: "سخنان"}, {ORTH: "شان", NORM: "شان"}], + "سخنتان": [{ORTH: "سخن", NORM: "سخن"}, {ORTH: "تان", NORM: "تان"}], + "سخنش": [{ORTH: "سخن", NORM: "سخن"}, {ORTH: "ش", NORM: "ش"}], + "سخنم": [{ORTH: "سخن", NORM: "سخن"}, {ORTH: "م", NORM: "م"}], + "سردش": [{ORTH: "سرد", NORM: "سرد"}, {ORTH: "ش", NORM: "ش"}], + "سرزمینشان": [{ORTH: "سرزمین", NORM: "سرزمین"}, {ORTH: "شان", NORM: "شان"}], + "سرش": [{ORTH: "سر", NORM: "سر"}, {ORTH: "ش", NORM: "ش"}], + "سرمایه‌دارهاست": [ + {ORTH: "سرمایه‌دارها", NORM: "سرمایه‌دارها"}, + {ORTH: "ست", NORM: "ست"}, + ], + "سرنوشتش": [{ORTH: "سرنوشت", NORM: "سرنوشت"}, {ORTH: "ش", NORM: "ش"}], + "سرنوشتشان": [{ORTH: "سرنوشت", NORM: "سرنوشت"}, {ORTH: "شان", NORM: "شان"}], + "سروتهش": [{ORTH: "سروته", NORM: "سروته"}, {ORTH: "ش", NORM: "ش"}], + "سرچشمه‌اش": [{ORTH: "سرچشمه‌", NORM: "سرچشمه‌"}, {ORTH: "اش", NORM: "اش"}], + "سقمش": [{ORTH: "سقم", NORM: "سقم"}, {ORTH: "ش", NORM: "ش"}], + "سنش": [{ORTH: "سن", NORM: "سن"}, {ORTH: "ش", NORM: "ش"}], + "سپاهش": [{ORTH: "سپاه", NORM: "سپاه"}, {ORTH: "ش", NORM: "ش"}], + "سیاسیشان": [{ORTH: "سیاسی", NORM: "سیاسی"}, {ORTH: "شان", NORM: "شان"}], + "سیاه‌چاله‌هاست": [ + {ORTH: "سیاه‌چاله‌ها", NORM: "سیاه‌چاله‌ها"}, + {ORTH: "ست", NORM: "ست"}, + ], + "شاخه‌هایشان": [{ORTH: "شاخه‌های", NORM: "شاخه‌های"}, {ORTH: "شان", NORM: "شان"}], + "شالوده‌اش": [{ORTH: "شالوده‌", NORM: "شالوده‌"}, {ORTH: "اش", NORM: "اش"}], + "شانه‌هایش": [{ORTH: "شانه‌های", NORM: "شانه‌های"}, {ORTH: "ش", NORM: "ش"}], + "شاهدیم": [{ORTH: "شاهد", NORM: "شاهد"}, {ORTH: "یم", NORM: "یم"}], + "شاهکارهایش": [{ORTH: "شاهکارهای", NORM: "شاهکارهای"}, {ORTH: "ش", NORM: "ش"}], + "شخصیتش": [{ORTH: "شخصیت", NORM: "شخصیت"}, {ORTH: "ش", NORM: "ش"}], + "شدنشان": [{ORTH: "شدن", NORM: "شدن"}, {ORTH: "شان", NORM: "شان"}], + "شرکتیست": [{ORTH: "شرکتی", NORM: "شرکتی"}, {ORTH: "ست", NORM: "ست"}], + "شعارهاشان": [{ORTH: "شعارها", NORM: "شعارها"}, {ORTH: "شان", NORM: "شان"}], + "شعورش": [{ORTH: "شعور", NORM: "شعور"}, {ORTH: "ش", NORM: "ش"}], + "شغلش": [{ORTH: "شغل", NORM: "شغل"}, {ORTH: "ش", NORM: "ش"}], + "شماست": [{ORTH: "شما", NORM: "شما"}, {ORTH: "ست", NORM: "ست"}], + "شمشیرش": [{ORTH: "شمشیر", NORM: "شمشیر"}, {ORTH: "ش", NORM: "ش"}], + "شنیدنش": [{ORTH: "شنیدن", NORM: "شنیدن"}, {ORTH: "ش", NORM: "ش"}], + "شوراست": [{ORTH: "شورا", NORM: "شورا"}, {ORTH: "ست", NORM: "ست"}], + "شومت": [{ORTH: "شوم", NORM: "شوم"}, {ORTH: "ت", NORM: "ت"}], + "شیرینترش": [{ORTH: "شیرینتر", NORM: "شیرینتر"}, {ORTH: "ش", NORM: "ش"}], + "شیطان‌اند": [{ORTH: "شیطان‌", NORM: "شیطان‌"}, {ORTH: "اند", NORM: "اند"}], + "شیوه‌هاست": [{ORTH: "شیوه‌ها", NORM: "شیوه‌ها"}, {ORTH: "ست", NORM: "ست"}], + "صاحبش": [{ORTH: "صاحب", NORM: "صاحب"}, {ORTH: "ش", NORM: "ش"}], + "صحنه‌اش": [{ORTH: "صحنه‌", NORM: "صحنه‌"}, {ORTH: "اش", NORM: "اش"}], + "صدایش": [{ORTH: "صدای", NORM: "صدای"}, {ORTH: "ش", NORM: "ش"}], + "صددند": [{ORTH: "صدد", NORM: "صدد"}, {ORTH: "ند", NORM: "ند"}], + "صندوق‌هاست": [{ORTH: "صندوق‌ها", NORM: "صندوق‌ها"}, {ORTH: "ست", NORM: "ست"}], + "صندوق‌هایش": [{ORTH: "صندوق‌های", NORM: "صندوق‌های"}, {ORTH: "ش", NORM: "ش"}], + "صورتش": [{ORTH: "صورت", NORM: "صورت"}, {ORTH: "ش", NORM: "ش"}], + "ضروری‌اند": [{ORTH: "ضروری‌", NORM: "ضروری‌"}, {ORTH: "اند", NORM: "اند"}], + "ضمیرش": [{ORTH: "ضمیر", NORM: "ضمیر"}, {ORTH: "ش", NORM: "ش"}], + "طرفش": [{ORTH: "طرف", NORM: "طرف"}, {ORTH: "ش", NORM: "ش"}], + "طلسمش": [{ORTH: "طلسم", NORM: "طلسم"}, {ORTH: "ش", NORM: "ش"}], + "طوره": [{ORTH: "طور", NORM: "طور"}, {ORTH: "ه", NORM: "ه"}], + "عاشوراست": [{ORTH: "عاشورا", NORM: "عاشورا"}, {ORTH: "ست", NORM: "ست"}], + "عبارتند": [{ORTH: "عبارت", NORM: "عبارت"}, {ORTH: "ند", NORM: "ند"}], + "عزیزانتان": [{ORTH: "عزیزان", NORM: "عزیزان"}, {ORTH: "تان", NORM: "تان"}], + "عزیزانش": [{ORTH: "عزیزان", NORM: "عزیزان"}, {ORTH: "ش", NORM: "ش"}], + "عزیزش": [{ORTH: "عزیز", NORM: "عزیز"}, {ORTH: "ش", NORM: "ش"}], + "عشرت‌طلبی‌اش": [ + {ORTH: "عشرت‌طلبی‌", NORM: "عشرت‌طلبی‌"}, + {ORTH: "اش", NORM: "اش"}, + ], + "عقبیم": [{ORTH: "عقب", NORM: "عقب"}, {ORTH: "یم", NORM: "یم"}], + "علاقه‌اش": [{ORTH: "علاقه‌", NORM: "علاقه‌"}, {ORTH: "اش", NORM: "اش"}], + "علمیمان": [{ORTH: "علمی", NORM: "علمی"}, {ORTH: "مان", NORM: "مان"}], + "عمرش": [{ORTH: "عمر", NORM: "عمر"}, {ORTH: "ش", NORM: "ش"}], + "عمرشان": [{ORTH: "عمر", NORM: "عمر"}, {ORTH: "شان", NORM: "شان"}], + "عملش": [{ORTH: "عمل", NORM: "عمل"}, {ORTH: "ش", NORM: "ش"}], + "عملی‌اند": [{ORTH: "عملی‌", NORM: "عملی‌"}, {ORTH: "اند", NORM: "اند"}], + "عمویت": [{ORTH: "عموی", NORM: "عموی"}, {ORTH: "ت", NORM: "ت"}], + "عمویش": [{ORTH: "عموی", NORM: "عموی"}, {ORTH: "ش", NORM: "ش"}], + "عمیقش": [{ORTH: "عمیق", NORM: "عمیق"}, {ORTH: "ش", NORM: "ش"}], + "عواملش": [{ORTH: "عوامل", NORM: "عوامل"}, {ORTH: "ش", NORM: "ش"}], + "عوضشان": [{ORTH: "عوض", NORM: "عوض"}, {ORTH: "شان", NORM: "شان"}], + "غذایی‌شان": [{ORTH: "غذایی‌", NORM: "غذایی‌"}, {ORTH: "شان", NORM: "شان"}], + "غریبه‌اند": [{ORTH: "غریبه‌", NORM: "غریبه‌"}, {ORTH: "اند", NORM: "اند"}], + "غلامانش": [{ORTH: "غلامان", NORM: "غلامان"}, {ORTH: "ش", NORM: "ش"}], + "غلطهاست": [{ORTH: "غلطها", NORM: "غلطها"}, {ORTH: "ست", NORM: "ست"}], + "فراموشتان": [{ORTH: "فراموش", NORM: "فراموش"}, {ORTH: "تان", NORM: "تان"}], + "فردی‌اند": [{ORTH: "فردی‌", NORM: "فردی‌"}, {ORTH: "اند", NORM: "اند"}], + "فرزندانش": [{ORTH: "فرزندان", NORM: "فرزندان"}, {ORTH: "ش", NORM: "ش"}], + "فرزندش": [{ORTH: "فرزند", NORM: "فرزند"}, {ORTH: "ش", NORM: "ش"}], + "فرم‌هایش": [{ORTH: "فرم‌های", NORM: "فرم‌های"}, {ORTH: "ش", NORM: "ش"}], + "فرهنگی‌مان": [{ORTH: "فرهنگی‌", NORM: "فرهنگی‌"}, {ORTH: "مان", NORM: "مان"}], + "فریادشان": [{ORTH: "فریاد", NORM: "فریاد"}, {ORTH: "شان", NORM: "شان"}], + "فضایی‌شان": [{ORTH: "فضایی‌", NORM: "فضایی‌"}, {ORTH: "شان", NORM: "شان"}], + "فقیرشان": [{ORTH: "فقیر", NORM: "فقیر"}, {ORTH: "شان", NORM: "شان"}], + "فوری‌شان": [{ORTH: "فوری‌", NORM: "فوری‌"}, {ORTH: "شان", NORM: "شان"}], + "قائلند": [{ORTH: "قائل", NORM: "قائل"}, {ORTH: "ند", NORM: "ند"}], + "قائلیم": [{ORTH: "قائل", NORM: "قائل"}, {ORTH: "یم", NORM: "یم"}], + "قادرند": [{ORTH: "قادر", NORM: "قادر"}, {ORTH: "ند", NORM: "ند"}], + "قانونمندش": [{ORTH: "قانونمند", NORM: "قانونمند"}, {ORTH: "ش", NORM: "ش"}], + "قبلند": [{ORTH: "قبل", NORM: "قبل"}, {ORTH: "ند", NORM: "ند"}], + "قبلی‌اش": [{ORTH: "قبلی‌", NORM: "قبلی‌"}, {ORTH: "اش", NORM: "اش"}], + "قبلی‌مان": [{ORTH: "قبلی‌", NORM: "قبلی‌"}, {ORTH: "مان", NORM: "مان"}], + "قدریست": [{ORTH: "قدری", NORM: "قدری"}, {ORTH: "ست", NORM: "ست"}], + "قدمش": [{ORTH: "قدم", NORM: "قدم"}, {ORTH: "ش", NORM: "ش"}], + "قسمتش": [{ORTH: "قسمت", NORM: "قسمت"}, {ORTH: "ش", NORM: "ش"}], + "قضایاست": [{ORTH: "قضایا", NORM: "قضایا"}, {ORTH: "ست", NORM: "ست"}], + "قضیه‌شان": [{ORTH: "قضیه‌", NORM: "قضیه‌"}, {ORTH: "شان", NORM: "شان"}], + "قهرمانهایشان": [ + {ORTH: "قهرمانهای", NORM: "قهرمانهای"}, + {ORTH: "شان", NORM: "شان"}, + ], + "قهرمانیش": [{ORTH: "قهرمانی", NORM: "قهرمانی"}, {ORTH: "ش", NORM: "ش"}], + "قومت": [{ORTH: "قوم", NORM: "قوم"}, {ORTH: "ت", NORM: "ت"}], + "لازمه‌اش": [{ORTH: "لازمه‌", NORM: "لازمه‌"}, {ORTH: "اش", NORM: "اش"}], + "مأموریتش": [{ORTH: "مأموریت", NORM: "مأموریت"}, {ORTH: "ش", NORM: "ش"}], + "مأموریتم": [{ORTH: "مأموریت", NORM: "مأموریت"}, {ORTH: "م", NORM: "م"}], + "مأموریت‌اند": [{ORTH: "مأموریت‌", NORM: "مأموریت‌"}, {ORTH: "اند", NORM: "اند"}], + "مادرانشان": [{ORTH: "مادران", NORM: "مادران"}, {ORTH: "شان", NORM: "شان"}], + "مادرت": [{ORTH: "مادر", NORM: "مادر"}, {ORTH: "ت", NORM: "ت"}], + "مادرش": [{ORTH: "مادر", NORM: "مادر"}, {ORTH: "ش", NORM: "ش"}], + "مادرم": [{ORTH: "مادر", NORM: "مادر"}, {ORTH: "م", NORM: "م"}], + "ماست": [{ORTH: "ما", NORM: "ما"}, {ORTH: "ست", NORM: "ست"}], + "مالی‌اش": [{ORTH: "مالی‌", NORM: "مالی‌"}, {ORTH: "اش", NORM: "اش"}], + "ماهیتش": [{ORTH: "ماهیت", NORM: "ماهیت"}, {ORTH: "ش", NORM: "ش"}], + "مایی": [{ORTH: "ما", NORM: "ما"}, {ORTH: "یی", NORM: "یی"}], + "مجازاتش": [{ORTH: "مجازات", NORM: "مجازات"}, {ORTH: "ش", NORM: "ش"}], + "مجبورند": [{ORTH: "مجبور", NORM: "مجبور"}, {ORTH: "ند", NORM: "ند"}], + "محتاجند": [{ORTH: "محتاج", NORM: "محتاج"}, {ORTH: "ند", NORM: "ند"}], + "محرمم": [{ORTH: "محرم", NORM: "محرم"}, {ORTH: "م", NORM: "م"}], + "محلش": [{ORTH: "محل", NORM: "محل"}, {ORTH: "ش", NORM: "ش"}], + "مخالفند": [{ORTH: "مخالف", NORM: "مخالف"}, {ORTH: "ند", NORM: "ند"}], + "مخدرش": [{ORTH: "مخدر", NORM: "مخدر"}, {ORTH: "ش", NORM: "ش"}], + "مدتهاست": [{ORTH: "مدتها", NORM: "مدتها"}, {ORTH: "ست", NORM: "ست"}], + "مدرسه‌ات": [{ORTH: "مدرسه", NORM: "مدرسه"}, {ORTH: "‌ات", NORM: "ات"}], + "مدرکم": [{ORTH: "مدرک", NORM: "مدرک"}, {ORTH: "م", NORM: "م"}], + "مدیرانش": [{ORTH: "مدیران", NORM: "مدیران"}, {ORTH: "ش", NORM: "ش"}], + "مدیونم": [{ORTH: "مدیون", NORM: "مدیون"}, {ORTH: "م", NORM: "م"}], + "مذهبی‌اند": [{ORTH: "مذهبی‌", NORM: "مذهبی‌"}, {ORTH: "اند", NORM: "اند"}], + "مرا": [{ORTH: "م", NORM: "م"}, {ORTH: "را", NORM: "را"}], + "مرادت": [{ORTH: "مراد", NORM: "مراد"}, {ORTH: "ت", NORM: "ت"}], + "مردمشان": [{ORTH: "مردم", NORM: "مردم"}, {ORTH: "شان", NORM: "شان"}], + "مردمند": [{ORTH: "مردم", NORM: "مردم"}, {ORTH: "ند", NORM: "ند"}], + "مردم‌اند": [{ORTH: "مردم‌", NORM: "مردم‌"}, {ORTH: "اند", NORM: "اند"}], + "مرزشان": [{ORTH: "مرز", NORM: "مرز"}, {ORTH: "شان", NORM: "شان"}], + "مرزهاشان": [{ORTH: "مرزها", NORM: "مرزها"}, {ORTH: "شان", NORM: "شان"}], + "مزدورش": [{ORTH: "مزدور", NORM: "مزدور"}, {ORTH: "ش", NORM: "ش"}], + "مسئولیتش": [{ORTH: "مسئولیت", NORM: "مسئولیت"}, {ORTH: "ش", NORM: "ش"}], + "مسائلش": [{ORTH: "مسائل", NORM: "مسائل"}, {ORTH: "ش", NORM: "ش"}], + "مستحضرید": [{ORTH: "مستحضر", NORM: "مستحضر"}, {ORTH: "ید", NORM: "ید"}], + "مسلمانم": [{ORTH: "مسلمان", NORM: "مسلمان"}, {ORTH: "م", NORM: "م"}], + "مسلمانند": [{ORTH: "مسلمان", NORM: "مسلمان"}, {ORTH: "ند", NORM: "ند"}], + "مشتریانش": [{ORTH: "مشتریان", NORM: "مشتریان"}, {ORTH: "ش", NORM: "ش"}], + "مشتهایمان": [{ORTH: "مشتهای", NORM: "مشتهای"}, {ORTH: "مان", NORM: "مان"}], + "مشخصند": [{ORTH: "مشخص", NORM: "مشخص"}, {ORTH: "ند", NORM: "ند"}], + "مشغولند": [{ORTH: "مشغول", NORM: "مشغول"}, {ORTH: "ند", NORM: "ند"}], + "مشغولیم": [{ORTH: "مشغول", NORM: "مشغول"}, {ORTH: "یم", NORM: "یم"}], + "مشهورش": [{ORTH: "مشهور", NORM: "مشهور"}, {ORTH: "ش", NORM: "ش"}], + "مشکلاتشان": [{ORTH: "مشکلات", NORM: "مشکلات"}, {ORTH: "شان", NORM: "شان"}], + "مشکلم": [{ORTH: "مشکل", NORM: "مشکل"}, {ORTH: "م", NORM: "م"}], + "مطمئنم": [{ORTH: "مطمئن", NORM: "مطمئن"}, {ORTH: "م", NORM: "م"}], + "معامله‌مان": [{ORTH: "معامله‌", NORM: "معامله‌"}, {ORTH: "مان", NORM: "مان"}], + "معتقدم": [{ORTH: "معتقد", NORM: "معتقد"}, {ORTH: "م", NORM: "م"}], + "معتقدند": [{ORTH: "معتقد", NORM: "معتقد"}, {ORTH: "ند", NORM: "ند"}], + "معتقدیم": [{ORTH: "معتقد", NORM: "معتقد"}, {ORTH: "یم", NORM: "یم"}], + "معرفی‌اش": [{ORTH: "معرفی‌", NORM: "معرفی‌"}, {ORTH: "اش", NORM: "اش"}], + "معروفش": [{ORTH: "معروف", NORM: "معروف"}, {ORTH: "ش", NORM: "ش"}], + "معضلاتمان": [{ORTH: "معضلات", NORM: "معضلات"}, {ORTH: "مان", NORM: "مان"}], + "معلمش": [{ORTH: "معلم", NORM: "معلم"}, {ORTH: "ش", NORM: "ش"}], + "معنایش": [{ORTH: "معنای", NORM: "معنای"}, {ORTH: "ش", NORM: "ش"}], + "مغزشان": [{ORTH: "مغز", NORM: "مغز"}, {ORTH: "شان", NORM: "شان"}], + "مفیدند": [{ORTH: "مفید", NORM: "مفید"}, {ORTH: "ند", NORM: "ند"}], + "مقابلش": [{ORTH: "مقابل", NORM: "مقابل"}, {ORTH: "ش", NORM: "ش"}], + "مقاله‌اش": [{ORTH: "مقاله‌", NORM: "مقاله‌"}, {ORTH: "اش", NORM: "اش"}], + "مقدمش": [{ORTH: "مقدم", NORM: "مقدم"}, {ORTH: "ش", NORM: "ش"}], + "مقرش": [{ORTH: "مقر", NORM: "مقر"}, {ORTH: "ش", NORM: "ش"}], + "مقصدشان": [{ORTH: "مقصد", NORM: "مقصد"}, {ORTH: "شان", NORM: "شان"}], + "مقصرند": [{ORTH: "مقصر", NORM: "مقصر"}, {ORTH: "ند", NORM: "ند"}], + "مقصودتان": [{ORTH: "مقصود", NORM: "مقصود"}, {ORTH: "تان", NORM: "تان"}], + "ملاقاتهایش": [{ORTH: "ملاقاتهای", NORM: "ملاقاتهای"}, {ORTH: "ش", NORM: "ش"}], + "ممکنشان": [{ORTH: "ممکن", NORM: "ممکن"}, {ORTH: "شان", NORM: "شان"}], + "ممیزیهاست": [{ORTH: "ممیزیها", NORM: "ممیزیها"}, {ORTH: "ست", NORM: "ست"}], + "منظورم": [{ORTH: "منظور", NORM: "منظور"}, {ORTH: "م", NORM: "م"}], + "منی": [{ORTH: "من", NORM: "من"}, {ORTH: "ی", NORM: "ی"}], + "منید": [{ORTH: "من", NORM: "من"}, {ORTH: "ید", NORM: "ید"}], + "مهربانش": [{ORTH: "مهربان", NORM: "مهربان"}, {ORTH: "ش", NORM: "ش"}], + "مهم‌اند": [{ORTH: "مهم‌", NORM: "مهم‌"}, {ORTH: "اند", NORM: "اند"}], + "مواجهند": [{ORTH: "مواجه", NORM: "مواجه"}, {ORTH: "ند", NORM: "ند"}], + "مواجه‌اند": [{ORTH: "مواجه‌", NORM: "مواجه‌"}, {ORTH: "اند", NORM: "اند"}], + "مواخذه‌ات": [{ORTH: "مواخذه", NORM: "مواخذه"}, {ORTH: "‌ات", NORM: "ات"}], + "مواضعشان": [{ORTH: "مواضع", NORM: "مواضع"}, {ORTH: "شان", NORM: "شان"}], + "مواضعمان": [{ORTH: "مواضع", NORM: "مواضع"}, {ORTH: "مان", NORM: "مان"}], + "موافقند": [{ORTH: "موافق", NORM: "موافق"}, {ORTH: "ند", NORM: "ند"}], + "موجوداتش": [{ORTH: "موجودات", NORM: "موجودات"}, {ORTH: "ش", NORM: "ش"}], + "موجودند": [{ORTH: "موجود", NORM: "موجود"}, {ORTH: "ند", NORM: "ند"}], + "موردش": [{ORTH: "مورد", NORM: "مورد"}, {ORTH: "ش", NORM: "ش"}], + "موضعشان": [{ORTH: "موضع", NORM: "موضع"}, {ORTH: "شان", NORM: "شان"}], + "موظفند": [{ORTH: "موظف", NORM: "موظف"}, {ORTH: "ند", NORM: "ند"}], + "موهایش": [{ORTH: "موهای", NORM: "موهای"}, {ORTH: "ش", NORM: "ش"}], + "موهایمان": [{ORTH: "موهای", NORM: "موهای"}, {ORTH: "مان", NORM: "مان"}], + "مویم": [{ORTH: "مو", NORM: "مو"}, {ORTH: "یم", NORM: "یم"}], + "ناخرسندند": [{ORTH: "ناخرسند", NORM: "ناخرسند"}, {ORTH: "ند", NORM: "ند"}], + "ناراحتیش": [{ORTH: "ناراحتی", NORM: "ناراحتی"}, {ORTH: "ش", NORM: "ش"}], + "ناراضی‌اند": [{ORTH: "ناراضی‌", NORM: "ناراضی‌"}, {ORTH: "اند", NORM: "اند"}], + "نارواست": [{ORTH: "ناروا", NORM: "ناروا"}, {ORTH: "ست", NORM: "ست"}], + "نازش": [{ORTH: "ناز", NORM: "ناز"}, {ORTH: "ش", NORM: "ش"}], + "نامش": [{ORTH: "نام", NORM: "نام"}, {ORTH: "ش", NORM: "ش"}], + "نامشان": [{ORTH: "نام", NORM: "نام"}, {ORTH: "شان", NORM: "شان"}], + "نامم": [{ORTH: "نام", NORM: "نام"}, {ORTH: "م", NORM: "م"}], + "نامه‌ات": [{ORTH: "نامه", NORM: "نامه"}, {ORTH: "‌ات", NORM: "ات"}], + "نامه‌ام": [{ORTH: "نامه‌", NORM: "نامه‌"}, {ORTH: "ام", NORM: "ام"}], + "ناچارم": [{ORTH: "ناچار", NORM: "ناچار"}, {ORTH: "م", NORM: "م"}], + "نخست‌وزیری‌اش": [ + {ORTH: "نخست‌وزیری‌", NORM: "نخست‌وزیری‌"}, + {ORTH: "اش", NORM: "اش"}, + ], + "نزدش": [{ORTH: "نزد", NORM: "نزد"}, {ORTH: "ش", NORM: "ش"}], + "نشانم": [{ORTH: "نشان", NORM: "نشان"}, {ORTH: "م", NORM: "م"}], + "نظرات‌شان": [{ORTH: "نظرات‌", NORM: "نظرات‌"}, {ORTH: "شان", NORM: "شان"}], + "نظرتان": [{ORTH: "نظر", NORM: "نظر"}, {ORTH: "تان", NORM: "تان"}], + "نظرش": [{ORTH: "نظر", NORM: "نظر"}, {ORTH: "ش", NORM: "ش"}], + "نظرشان": [{ORTH: "نظر", NORM: "نظر"}, {ORTH: "شان", NORM: "شان"}], + "نظرم": [{ORTH: "نظر", NORM: "نظر"}, {ORTH: "م", NORM: "م"}], + "نظرهایشان": [{ORTH: "نظرهای", NORM: "نظرهای"}, {ORTH: "شان", NORM: "شان"}], + "نفاقش": [{ORTH: "نفاق", NORM: "نفاق"}, {ORTH: "ش", NORM: "ش"}], + "نفرند": [{ORTH: "نفر", NORM: "نفر"}, {ORTH: "ند", NORM: "ند"}], + "نفوذیند": [{ORTH: "نفوذی", NORM: "نفوذی"}, {ORTH: "ند", NORM: "ند"}], + "نقطه‌نظراتتان": [ + {ORTH: "نقطه‌نظرات", NORM: "نقطه‌نظرات"}, + {ORTH: "تان", NORM: "تان"}, + ], + "نمایشی‌مان": [{ORTH: "نمایشی‌", NORM: "نمایشی‌"}, {ORTH: "مان", NORM: "مان"}], + "نمایندگی‌شان": [ + {ORTH: "نمایندگی‌", NORM: "نمایندگی‌"}, + {ORTH: "شان", NORM: "شان"}, + ], + "نمونه‌اش": [{ORTH: "نمونه‌", NORM: "نمونه‌"}, {ORTH: "اش", NORM: "اش"}], + "نمی‌پذیرندش": [{ORTH: "نمی‌پذیرند", NORM: "نمی‌پذیرند"}, {ORTH: "ش", NORM: "ش"}], + "نوآوری‌اش": [{ORTH: "نوآوری‌", NORM: "نوآوری‌"}, {ORTH: "اش", NORM: "اش"}], + "نوشته‌هایشان": [ + {ORTH: "نوشته‌های", NORM: "نوشته‌های"}, + {ORTH: "شان", NORM: "شان"}, + ], + "نوشته‌هایم": [{ORTH: "نوشته‌ها", NORM: "نوشته‌ها"}, {ORTH: "یم", NORM: "یم"}], + "نکردنشان": [{ORTH: "نکردن", NORM: "نکردن"}, {ORTH: "شان", NORM: "شان"}], + "نگاهداری‌شان": [ + {ORTH: "نگاهداری‌", NORM: "نگاهداری‌"}, + {ORTH: "شان", NORM: "شان"}, + ], + "نگاهش": [{ORTH: "نگاه", NORM: "نگاه"}, {ORTH: "ش", NORM: "ش"}], + "نگرانم": [{ORTH: "نگران", NORM: "نگران"}, {ORTH: "م", NORM: "م"}], + "نگرشهایشان": [{ORTH: "نگرشهای", NORM: "نگرشهای"}, {ORTH: "شان", NORM: "شان"}], + "نیازمندند": [{ORTH: "نیازمند", NORM: "نیازمند"}, {ORTH: "ند", NORM: "ند"}], + "هدفش": [{ORTH: "هدف", NORM: "هدف"}, {ORTH: "ش", NORM: "ش"}], + "همانست": [{ORTH: "همان", NORM: "همان"}, {ORTH: "ست", NORM: "ست"}], + "همراهش": [{ORTH: "همراه", NORM: "همراه"}, {ORTH: "ش", NORM: "ش"}], + "همسرتان": [{ORTH: "همسر", NORM: "همسر"}, {ORTH: "تان", NORM: "تان"}], + "همسرش": [{ORTH: "همسر", NORM: "همسر"}, {ORTH: "ش", NORM: "ش"}], + "همسرم": [{ORTH: "همسر", NORM: "همسر"}, {ORTH: "م", NORM: "م"}], + "همفکرانش": [{ORTH: "همفکران", NORM: "همفکران"}, {ORTH: "ش", NORM: "ش"}], + "همه‌اش": [{ORTH: "همه‌", NORM: "همه‌"}, {ORTH: "اش", NORM: "اش"}], + "همه‌شان": [{ORTH: "همه‌", NORM: "همه‌"}, {ORTH: "شان", NORM: "شان"}], + "همکارانش": [{ORTH: "همکاران", NORM: "همکاران"}, {ORTH: "ش", NORM: "ش"}], + "هم‌نظریم": [{ORTH: "هم‌نظر", NORM: "هم‌نظر"}, {ORTH: "یم", NORM: "یم"}], + "هنرش": [{ORTH: "هنر", NORM: "هنر"}, {ORTH: "ش", NORM: "ش"}], + "هواست": [{ORTH: "هوا", NORM: "هوا"}, {ORTH: "ست", NORM: "ست"}], + "هویتش": [{ORTH: "هویت", NORM: "هویت"}, {ORTH: "ش", NORM: "ش"}], + "وابسته‌اند": [{ORTH: "وابسته‌", NORM: "وابسته‌"}, {ORTH: "اند", NORM: "اند"}], + "واقفند": [{ORTH: "واقف", NORM: "واقف"}, {ORTH: "ند", NORM: "ند"}], + "والدینشان": [{ORTH: "والدین", NORM: "والدین"}, {ORTH: "شان", NORM: "شان"}], + "وجدان‌تان": [{ORTH: "وجدان‌", NORM: "وجدان‌"}, {ORTH: "تان", NORM: "تان"}], + "وجودشان": [{ORTH: "وجود", NORM: "وجود"}, {ORTH: "شان", NORM: "شان"}], + "وطنم": [{ORTH: "وطن", NORM: "وطن"}, {ORTH: "م", NORM: "م"}], + "وعده‌اش": [{ORTH: "وعده‌", NORM: "وعده‌"}, {ORTH: "اش", NORM: "اش"}], + "وقتمان": [{ORTH: "وقت", NORM: "وقت"}, {ORTH: "مان", NORM: "مان"}], + "ولادتش": [{ORTH: "ولادت", NORM: "ولادت"}, {ORTH: "ش", NORM: "ش"}], + "پایانش": [{ORTH: "پایان", NORM: "پایان"}, {ORTH: "ش", NORM: "ش"}], + "پایش": [{ORTH: "پای", NORM: "پای"}, {ORTH: "ش", NORM: "ش"}], + "پایین‌ترند": [{ORTH: "پایین‌تر", NORM: "پایین‌تر"}, {ORTH: "ند", NORM: "ند"}], + "پدرت": [{ORTH: "پدر", NORM: "پدر"}, {ORTH: "ت", NORM: "ت"}], + "پدرش": [{ORTH: "پدر", NORM: "پدر"}, {ORTH: "ش", NORM: "ش"}], + "پدرشان": [{ORTH: "پدر", NORM: "پدر"}, {ORTH: "شان", NORM: "شان"}], + "پدرم": [{ORTH: "پدر", NORM: "پدر"}, {ORTH: "م", NORM: "م"}], + "پربارش": [{ORTH: "پربار", NORM: "پربار"}, {ORTH: "ش", NORM: "ش"}], + "پروردگارت": [{ORTH: "پروردگار", NORM: "پروردگار"}, {ORTH: "ت", NORM: "ت"}], + "پسرتان": [{ORTH: "پسر", NORM: "پسر"}, {ORTH: "تان", NORM: "تان"}], + "پسرش": [{ORTH: "پسر", NORM: "پسر"}, {ORTH: "ش", NORM: "ش"}], + "پسرعمویش": [{ORTH: "پسرعموی", NORM: "پسرعموی"}, {ORTH: "ش", NORM: "ش"}], + "پسر‌عمویت": [{ORTH: "پسر‌عموی", NORM: "پسر‌عموی"}, {ORTH: "ت", NORM: "ت"}], + "پشتش": [{ORTH: "پشت", NORM: "پشت"}, {ORTH: "ش", NORM: "ش"}], + "پشیمونی": [{ORTH: "پشیمون", NORM: "پشیمون"}, {ORTH: "ی", NORM: "ی"}], + "پولش": [{ORTH: "پول", NORM: "پول"}, {ORTH: "ش", NORM: "ش"}], + "پژوهش‌هایش": [{ORTH: "پژوهش‌های", NORM: "پژوهش‌های"}, {ORTH: "ش", NORM: "ش"}], + "پیامبرش": [{ORTH: "پیامبر", NORM: "پیامبر"}, {ORTH: "ش", NORM: "ش"}], + "پیامبری": [{ORTH: "پیامبر", NORM: "پیامبر"}, {ORTH: "ی", NORM: "ی"}], + "پیامش": [{ORTH: "پیام", NORM: "پیام"}, {ORTH: "ش", NORM: "ش"}], + "پیداست": [{ORTH: "پیدا", NORM: "پیدا"}, {ORTH: "ست", NORM: "ست"}], + "پیراهنش": [{ORTH: "پیراهن", NORM: "پیراهن"}, {ORTH: "ش", NORM: "ش"}], + "پیروانش": [{ORTH: "پیروان", NORM: "پیروان"}, {ORTH: "ش", NORM: "ش"}], + "پیشانی‌اش": [{ORTH: "پیشانی‌", NORM: "پیشانی‌"}, {ORTH: "اش", NORM: "اش"}], + "پیمانت": [{ORTH: "پیمان", NORM: "پیمان"}, {ORTH: "ت", NORM: "ت"}], + "پیوندشان": [{ORTH: "پیوند", NORM: "پیوند"}, {ORTH: "شان", NORM: "شان"}], + "چاپش": [{ORTH: "چاپ", NORM: "چاپ"}, {ORTH: "ش", NORM: "ش"}], + "چت": [{ORTH: "چ", NORM: "چ"}, {ORTH: "ت", NORM: "ت"}], + "چته": [{ORTH: "چ", NORM: "چ"}, {ORTH: "ت", NORM: "ت"}, {ORTH: "ه", NORM: "ه"}], + "چرخ‌هایش": [{ORTH: "چرخ‌های", NORM: "چرخ‌های"}, {ORTH: "ش", NORM: "ش"}], + "چشمم": [{ORTH: "چشم", NORM: "چشم"}, {ORTH: "م", NORM: "م"}], + "چشمهایش": [{ORTH: "چشمهای", NORM: "چشمهای"}, {ORTH: "ش", NORM: "ش"}], + "چشمهایشان": [{ORTH: "چشمهای", NORM: "چشمهای"}, {ORTH: "شان", NORM: "شان"}], + "چمنم": [{ORTH: "چمن", NORM: "چمن"}, {ORTH: "م", NORM: "م"}], + "چهره‌اش": [{ORTH: "چهره‌", NORM: "چهره‌"}, {ORTH: "اش", NORM: "اش"}], + "چکاره‌اند": [{ORTH: "چکاره‌", NORM: "چکاره‌"}, {ORTH: "اند", NORM: "اند"}], + "چیزهاست": [{ORTH: "چیزها", NORM: "چیزها"}, {ORTH: "ست", NORM: "ست"}], + "چیزهایش": [{ORTH: "چیزهای", NORM: "چیزهای"}, {ORTH: "ش", NORM: "ش"}], + "چیزیست": [{ORTH: "چیزی", NORM: "چیزی"}, {ORTH: "ست", NORM: "ست"}], + "چیست": [{ORTH: "چی", NORM: "چی"}, {ORTH: "ست", NORM: "ست"}], + "کارش": [{ORTH: "کار", NORM: "کار"}, {ORTH: "ش", NORM: "ش"}], + "کارشان": [{ORTH: "کار", NORM: "کار"}, {ORTH: "شان", NORM: "شان"}], + "کارم": [{ORTH: "کار", NORM: "کار"}, {ORTH: "م", NORM: "م"}], + "کارند": [{ORTH: "کار", NORM: "کار"}, {ORTH: "ند", NORM: "ند"}], + "کارهایم": [{ORTH: "کارها", NORM: "کارها"}, {ORTH: "یم", NORM: "یم"}], + "کافیست": [{ORTH: "کافی", NORM: "کافی"}, {ORTH: "ست", NORM: "ست"}], + "کتابخانه‌اش": [{ORTH: "کتابخانه‌", NORM: "کتابخانه‌"}, {ORTH: "اش", NORM: "اش"}], + "کتابش": [{ORTH: "کتاب", NORM: "کتاب"}, {ORTH: "ش", NORM: "ش"}], + "کتابهاشان": [{ORTH: "کتابها", NORM: "کتابها"}, {ORTH: "شان", NORM: "شان"}], + "کجاست": [{ORTH: "کجا", NORM: "کجا"}, {ORTH: "ست", NORM: "ست"}], + "کدورتهایشان": [{ORTH: "کدورتهای", NORM: "کدورتهای"}, {ORTH: "شان", NORM: "شان"}], + "کردنش": [{ORTH: "کردن", NORM: "کردن"}, {ORTH: "ش", NORM: "ش"}], + "کرم‌خورده‌اش": [ + {ORTH: "کرم‌خورده‌", NORM: "کرم‌خورده‌"}, + {ORTH: "اش", NORM: "اش"}, + ], + "کشش": [{ORTH: "کش", NORM: "کش"}, {ORTH: "ش", NORM: "ش"}], + "کشورش": [{ORTH: "کشور", NORM: "کشور"}, {ORTH: "ش", NORM: "ش"}], + "کشورشان": [{ORTH: "کشور", NORM: "کشور"}, {ORTH: "شان", NORM: "شان"}], + "کشورمان": [{ORTH: "کشور", NORM: "کشور"}, {ORTH: "مان", NORM: "مان"}], + "کشورهاست": [{ORTH: "کشورها", NORM: "کشورها"}, {ORTH: "ست", NORM: "ست"}], + "کلیشه‌هاست": [{ORTH: "کلیشه‌ها", NORM: "کلیشه‌ها"}, {ORTH: "ست", NORM: "ست"}], + "کمبودهاست": [{ORTH: "کمبودها", NORM: "کمبودها"}, {ORTH: "ست", NORM: "ست"}], + "کمتره": [{ORTH: "کمتر", NORM: "کمتر"}, {ORTH: "ه", NORM: "ه"}], + "کمکم": [{ORTH: "کمک", NORM: "کمک"}, {ORTH: "م", NORM: "م"}], + "کنارش": [{ORTH: "کنار", NORM: "کنار"}, {ORTH: "ش", NORM: "ش"}], + "کودکانشان": [{ORTH: "کودکان", NORM: "کودکان"}, {ORTH: "شان", NORM: "شان"}], + "کوچکش": [{ORTH: "کوچک", NORM: "کوچک"}, {ORTH: "ش", NORM: "ش"}], + "کیست": [{ORTH: "کی", NORM: "کی"}, {ORTH: "ست", NORM: "ست"}], + "کیفش": [{ORTH: "کیف", NORM: "کیف"}, {ORTH: "ش", NORM: "ش"}], + "گذشته‌اند": [{ORTH: "گذشته‌", NORM: "گذشته‌"}, {ORTH: "اند", NORM: "اند"}], + "گرانقدرش": [{ORTH: "گرانقدر", NORM: "گرانقدر"}, {ORTH: "ش", NORM: "ش"}], + "گرانقدرشان": [{ORTH: "گرانقدر", NORM: "گرانقدر"}, {ORTH: "شان", NORM: "شان"}], + "گردنتان": [{ORTH: "گردن", NORM: "گردن"}, {ORTH: "تان", NORM: "تان"}], + "گردنش": [{ORTH: "گردن", NORM: "گردن"}, {ORTH: "ش", NORM: "ش"}], + "گرفتارند": [{ORTH: "گرفتار", NORM: "گرفتار"}, {ORTH: "ند", NORM: "ند"}], + "گرفتنت": [{ORTH: "گرفتن", NORM: "گرفتن"}, {ORTH: "ت", NORM: "ت"}], + "گروهند": [{ORTH: "گروه", NORM: "گروه"}, {ORTH: "ند", NORM: "ند"}], + "گروگانهایش": [{ORTH: "گروگانهای", NORM: "گروگانهای"}, {ORTH: "ش", NORM: "ش"}], + "گریمش": [{ORTH: "گریم", NORM: "گریم"}, {ORTH: "ش", NORM: "ش"}], + "گفتارمان": [{ORTH: "گفتار", NORM: "گفتار"}, {ORTH: "مان", NORM: "مان"}], + "گلهایش": [{ORTH: "گلهای", NORM: "گلهای"}, {ORTH: "ش", NORM: "ش"}], + "گلویش": [{ORTH: "گلوی", NORM: "گلوی"}, {ORTH: "ش", NORM: "ش"}], + "گناهت": [{ORTH: "گناه", NORM: "گناه"}, {ORTH: "ت", NORM: "ت"}], + "گوشش": [{ORTH: "گوش", NORM: "گوش"}, {ORTH: "ش", NORM: "ش"}], + "گوشم": [{ORTH: "گوش", NORM: "گوش"}, {ORTH: "م", NORM: "م"}], + "گولش": [{ORTH: "گول", NORM: "گول"}, {ORTH: "ش", NORM: "ش"}], + "یادتان": [{ORTH: "یاد", NORM: "یاد"}, {ORTH: "تان", NORM: "تان"}], + "یادم": [{ORTH: "یاد", NORM: "یاد"}, {ORTH: "م", NORM: "م"}], + "یادمان": [{ORTH: "یاد", NORM: "یاد"}, {ORTH: "مان", NORM: "مان"}], + "یارانش": [{ORTH: "یاران", NORM: "یاران"}, {ORTH: "ش", NORM: "ش"}], } - -_exc.update( - { - "آبرویت": [ - {ORTH: "آبروی", LEMMA: "آبروی", NORM: "آبروی", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "آب‌نباتش": [ - {ORTH: "آب‌نبات", LEMMA: "آب‌نبات", NORM: "آب‌نبات", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "آثارش": [ - {ORTH: "آثار", LEMMA: "آثار", NORM: "آثار", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "آخرش": [ - {ORTH: "آخر", LEMMA: "آخر", NORM: "آخر", TAG: "ADV"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "آدمهاست": [ - {ORTH: "آدمها", LEMMA: "آدمها", NORM: "آدمها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "آرزومندیم": [ - {ORTH: "آرزومند", LEMMA: "آرزومند", NORM: "آرزومند", TAG: "ADJ"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "VERB"}, - ], - "آزادند": [ - {ORTH: "آزاد", LEMMA: "آزاد", NORM: "آزاد", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "آسیب‌پذیرند": [ - {ORTH: "آسیب‌پذیر", LEMMA: "آسیب‌پذیر", NORM: "آسیب‌پذیر", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "آفریده‌اند": [ - {ORTH: "آفریده‌", LEMMA: "آفریده‌", NORM: "آفریده‌", TAG: "NOUN"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "آمدنش": [ - {ORTH: "آمدن", LEMMA: "آمدن", NORM: "آمدن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "آمریکاست": [ - {ORTH: "آمریکا", LEMMA: "آمریکا", NORM: "آمریکا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "آنجاست": [ - {ORTH: "آنجا", LEMMA: "آنجا", NORM: "آنجا", TAG: "ADV"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "آنست": [ - {ORTH: "آن", LEMMA: "آن", NORM: "آن", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "آنند": [ - {ORTH: "آن", LEMMA: "آن", NORM: "آن", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "آن‌هاست": [ - {ORTH: "آن‌ها", LEMMA: "آن‌ها", NORM: "آن‌ها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "آپاداناست": [ - {ORTH: "آپادانا", LEMMA: "آپادانا", NORM: "آپادانا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "اجتماعی‌مان": [ - {ORTH: "اجتماعی‌", LEMMA: "اجتماعی‌", NORM: "اجتماعی‌", TAG: "ADJ"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "اجدادت": [ - {ORTH: "اجداد", LEMMA: "اجداد", NORM: "اجداد", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "اجدادش": [ - {ORTH: "اجداد", LEMMA: "اجداد", NORM: "اجداد", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "اجدادی‌شان": [ - {ORTH: "اجدادی‌", LEMMA: "اجدادی‌", NORM: "اجدادی‌", TAG: "ADJ"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "اجراست": [ - {ORTH: "اجرا", LEMMA: "اجرا", NORM: "اجرا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "اختیارش": [ - {ORTH: "اختیار", LEMMA: "اختیار", NORM: "اختیار", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "اخلاقشان": [ - {ORTH: "اخلاق", LEMMA: "اخلاق", NORM: "اخلاق", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "ادعایمان": [ - {ORTH: "ادعای", LEMMA: "ادعای", NORM: "ادعای", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "اذیتش": [ - {ORTH: "اذیت", LEMMA: "اذیت", NORM: "اذیت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "اراده‌اش": [ - {ORTH: "اراده‌", LEMMA: "اراده‌", NORM: "اراده‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "ارتباطش": [ - {ORTH: "ارتباط", LEMMA: "ارتباط", NORM: "ارتباط", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "ارتباطمان": [ - {ORTH: "ارتباط", LEMMA: "ارتباط", NORM: "ارتباط", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "ارزشهاست": [ - {ORTH: "ارزشها", LEMMA: "ارزشها", NORM: "ارزشها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "ارزی‌اش": [ - {ORTH: "ارزی‌", LEMMA: "ارزی‌", NORM: "ارزی‌", TAG: "ADJ"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "اره‌اش": [ - {ORTH: "اره‌", LEMMA: "اره‌", NORM: "اره‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "ازش": [ - {ORTH: "از", LEMMA: "از", NORM: "از", TAG: "ADP"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "ازین": [ - {ORTH: "از", LEMMA: "از", NORM: "از", TAG: "ADP"}, - {ORTH: "ین", LEMMA: "ین", NORM: "ین", TAG: "NOUN"}, - ], - "ازین‌هاست": [ - {ORTH: "از", LEMMA: "از", NORM: "از", TAG: "ADP"}, - {ORTH: "ین‌ها", LEMMA: "ین‌ها", NORM: "ین‌ها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "استخوانند": [ - {ORTH: "استخوان", LEMMA: "استخوان", NORM: "استخوان", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "اسلامند": [ - {ORTH: "اسلام", LEMMA: "اسلام", NORM: "اسلام", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "اسلامی‌اند": [ - {ORTH: "اسلامی‌", LEMMA: "اسلامی‌", NORM: "اسلامی‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "اسلحه‌هایشان": [ - {ORTH: "اسلحه‌های", LEMMA: "اسلحه‌های", NORM: "اسلحه‌های", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "اسمت": [ - {ORTH: "اسم", LEMMA: "اسم", NORM: "اسم", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "اسمش": [ - {ORTH: "اسم", LEMMA: "اسم", NORM: "اسم", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "اشتباهند": [ - {ORTH: "اشتباه", LEMMA: "اشتباه", NORM: "اشتباه", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "اصلش": [ - {ORTH: "اصل", LEMMA: "اصل", NORM: "اصل", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "اطاقش": [ - {ORTH: "اطاق", LEMMA: "اطاق", NORM: "اطاق", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "اعتقادند": [ - {ORTH: "اعتقاد", LEMMA: "اعتقاد", NORM: "اعتقاد", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "اعلایش": [ - {ORTH: "اعلای", LEMMA: "اعلای", NORM: "اعلای", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "افتراست": [ - {ORTH: "افترا", LEMMA: "افترا", NORM: "افترا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "افطارت": [ - {ORTH: "افطار", LEMMA: "افطار", NORM: "افطار", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "اقوامش": [ - {ORTH: "اقوام", LEMMA: "اقوام", NORM: "اقوام", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "امروزیش": [ - {ORTH: "امروزی", LEMMA: "امروزی", NORM: "امروزی", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "اموالش": [ - {ORTH: "اموال", LEMMA: "اموال", NORM: "اموال", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "امیدوارند": [ - {ORTH: "امیدوار", LEMMA: "امیدوار", NORM: "امیدوار", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "امیدواریم": [ - {ORTH: "امیدوار", LEMMA: "امیدوار", NORM: "امیدوار", TAG: "ADJ"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "VERB"}, - ], - "انتخابهایم": [ - {ORTH: "انتخابها", LEMMA: "انتخابها", NORM: "انتخابها", TAG: "NOUN"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "NOUN"}, - ], - "انتظارم": [ - {ORTH: "انتظار", LEMMA: "انتظار", NORM: "انتظار", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "انجمنم": [ - {ORTH: "انجمن", LEMMA: "انجمن", NORM: "انجمن", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "اندرش": [ - {ORTH: "اندر", LEMMA: "اندر", NORM: "اندر", TAG: "ADP"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "انشایش": [ - {ORTH: "انشای", LEMMA: "انشای", NORM: "انشای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "انگشتشان": [ - {ORTH: "انگشت", LEMMA: "انگشت", NORM: "انگشت", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "انگشتهایش": [ - {ORTH: "انگشتهای", LEMMA: "انگشتهای", NORM: "انگشتهای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "اهمیتشان": [ - {ORTH: "اهمیت", LEMMA: "اهمیت", NORM: "اهمیت", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "اهمیتند": [ - {ORTH: "اهمیت", LEMMA: "اهمیت", NORM: "اهمیت", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "اوایلش": [ - {ORTH: "اوایل", LEMMA: "اوایل", NORM: "اوایل", TAG: "ADV"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "اوست": [ - {ORTH: "او", LEMMA: "او", NORM: "او", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "اولش": [ - {ORTH: "اول", LEMMA: "اول", NORM: "اول", TAG: "ADV"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "اولشان": [ - {ORTH: "اول", LEMMA: "اول", NORM: "اول", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "اولم": [ - {ORTH: "اول", LEMMA: "اول", NORM: "اول", TAG: "ADJ"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "اکثرشان": [ - {ORTH: "اکثر", LEMMA: "اکثر", NORM: "اکثر", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "ایتالیاست": [ - {ORTH: "ایتالیا", LEMMA: "ایتالیا", NORM: "ایتالیا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "ایرانی‌اش": [ - {ORTH: "ایرانی‌", LEMMA: "ایرانی‌", NORM: "ایرانی‌", TAG: "ADJ"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "اینجاست": [ - {ORTH: "اینجا", LEMMA: "اینجا", NORM: "اینجا", TAG: "ADV"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "این‌هاست": [ - {ORTH: "این‌ها", LEMMA: "این‌ها", NORM: "این‌ها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "بابات": [ - {ORTH: "بابا", LEMMA: "بابا", NORM: "بابا", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "بارش": [ - {ORTH: "بار", LEMMA: "بار", NORM: "بار", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بازیگرانش": [ - {ORTH: "بازیگران", LEMMA: "بازیگران", NORM: "بازیگران", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بازیگرمان": [ - {ORTH: "بازیگر", LEMMA: "بازیگر", NORM: "بازیگر", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "بازیگرهایم": [ - {ORTH: "بازیگرها", LEMMA: "بازیگرها", NORM: "بازیگرها", TAG: "NOUN"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "NOUN"}, - ], - "بازی‌اش": [ - {ORTH: "بازی‌", LEMMA: "بازی‌", NORM: "بازی‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "بالاست": [ - {ORTH: "بالا", LEMMA: "بالا", NORM: "بالا", TAG: "ADV"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "باورند": [ - {ORTH: "باور", LEMMA: "باور", NORM: "باور", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "بجاست": [ - {ORTH: "بجا", LEMMA: "بجا", NORM: "بجا", TAG: "ADJ"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "بدان": [ - {ORTH: "ب", LEMMA: "ب", NORM: "ب", TAG: "ADP"}, - {ORTH: "دان", LEMMA: "دان", NORM: "دان", TAG: "NOUN"}, - ], - "بدش": [ - {ORTH: "بد", LEMMA: "بد", NORM: "بد", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بدشان": [ - {ORTH: "بد", LEMMA: "بد", NORM: "بد", TAG: "ADJ"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "بدنم": [ - {ORTH: "بدن", LEMMA: "بدن", NORM: "بدن", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "بدهی‌ات": [ - {ORTH: "بدهی‌", LEMMA: "بدهی‌", NORM: "بدهی‌", TAG: "NOUN"}, - {ORTH: "ات", LEMMA: "ات", NORM: "ات", TAG: "NOUN"}, - ], - "بدین": [ - {ORTH: "ب", LEMMA: "ب", NORM: "ب", TAG: "ADP"}, - {ORTH: "دین", LEMMA: "دین", NORM: "دین", TAG: "NOUN"}, - ], - "برابرش": [ - {ORTH: "برابر", LEMMA: "برابر", NORM: "برابر", TAG: "ADP"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "برادرت": [ - {ORTH: "برادر", LEMMA: "برادر", NORM: "برادر", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "برادرش": [ - {ORTH: "برادر", LEMMA: "برادر", NORM: "برادر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "برایت": [ - {ORTH: "برای", LEMMA: "برای", NORM: "برای", TAG: "ADP"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "برایتان": [ - {ORTH: "برای", LEMMA: "برای", NORM: "برای", TAG: "ADP"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "برایش": [ - {ORTH: "برای", LEMMA: "برای", NORM: "برای", TAG: "ADP"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "برایشان": [ - {ORTH: "برای", LEMMA: "برای", NORM: "برای", TAG: "ADP"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "برایم": [ - {ORTH: "برای", LEMMA: "برای", NORM: "برای", TAG: "ADP"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "برایمان": [ - {ORTH: "برای", LEMMA: "برای", NORM: "برای", TAG: "ADP"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "برخوردارند": [ - {ORTH: "برخوردار", LEMMA: "برخوردار", NORM: "برخوردار", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "برنامه‌سازهاست": [ - { - ORTH: "برنامه‌سازها", - LEMMA: "برنامه‌سازها", - NORM: "برنامه‌سازها", - TAG: "NOUN", - }, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "برهمش": [ - {ORTH: "برهم", LEMMA: "برهم", NORM: "برهم", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "برهنه‌اش": [ - {ORTH: "برهنه‌", LEMMA: "برهنه‌", NORM: "برهنه‌", TAG: "ADJ"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "برگهایش": [ - {ORTH: "برگها", LEMMA: "برگها", NORM: "برگها", TAG: "NOUN"}, - {ORTH: "یش", LEMMA: "یش", NORM: "یش", TAG: "NOUN"}, - ], - "برین": [ - {ORTH: "بر", LEMMA: "بر", NORM: "بر", TAG: "ADP"}, - {ORTH: "ین", LEMMA: "ین", NORM: "ین", TAG: "NOUN"}, - ], - "بزرگش": [ - {ORTH: "بزرگ", LEMMA: "بزرگ", NORM: "بزرگ", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بزرگ‌تری": [ - {ORTH: "بزرگ‌تر", LEMMA: "بزرگ‌تر", NORM: "بزرگ‌تر", TAG: "ADJ"}, - {ORTH: "ی", LEMMA: "ی", NORM: "ی", TAG: "VERB"}, - ], - "بساطش": [ - {ORTH: "بساط", LEMMA: "بساط", NORM: "بساط", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بعدش": [ - {ORTH: "بعد", LEMMA: "بعد", NORM: "بعد", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بعضیهایشان": [ - {ORTH: "بعضیهای", LEMMA: "بعضیهای", NORM: "بعضیهای", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "بعضی‌شان": [ - {ORTH: "بعضی", LEMMA: "بعضی", NORM: "بعضی", TAG: "NOUN"}, - {ORTH: "‌شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "بقیه‌اش": [ - {ORTH: "بقیه‌", LEMMA: "بقیه‌", NORM: "بقیه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "بلندش": [ - {ORTH: "بلند", LEMMA: "بلند", NORM: "بلند", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بناگوشش": [ - {ORTH: "بناگوش", LEMMA: "بناگوش", NORM: "بناگوش", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بنظرم": [ - {ORTH: "ب", LEMMA: "ب", NORM: "ب", TAG: "ADP"}, - {ORTH: "نظر", LEMMA: "نظر", NORM: "نظر", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "بهت": [ - {ORTH: "به", LEMMA: "به", NORM: "به", TAG: "ADP"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "بهترش": [ - {ORTH: "بهتر", LEMMA: "بهتر", NORM: "بهتر", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بهترم": [ - {ORTH: "بهتر", LEMMA: "بهتر", NORM: "بهتر", TAG: "ADJ"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "VERB"}, - ], - "بهتری": [ - {ORTH: "بهتر", LEMMA: "بهتر", NORM: "بهتر", TAG: "ADJ"}, - {ORTH: "ی", LEMMA: "ی", NORM: "ی", TAG: "VERB"}, - ], - "بهش": [ - {ORTH: "به", LEMMA: "به", NORM: "به", TAG: "ADP"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "به‌شان": [ - {ORTH: "به‌", LEMMA: "به‌", NORM: "به‌", TAG: "ADP"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "بودمش": [ - {ORTH: "بودم", LEMMA: "بودم", NORM: "بودم", TAG: "VERB"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بودنش": [ - {ORTH: "بودن", LEMMA: "بودن", NORM: "بودن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بودن‌شان": [ - {ORTH: "بودن‌", LEMMA: "بودن‌", NORM: "بودن‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "بوستانش": [ - {ORTH: "بوستان", LEMMA: "بوستان", NORM: "بوستان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بویش": [ - {ORTH: "بو", LEMMA: "بو", NORM: "بو", TAG: "NOUN"}, - {ORTH: "یش", LEMMA: "یش", NORM: "یش", TAG: "NOUN"}, - ], - "بچه‌اش": [ - {ORTH: "بچه‌", LEMMA: "بچه‌", NORM: "بچه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "بچه‌م": [ - {ORTH: "بچه‌", LEMMA: "بچه‌", NORM: "بچه‌", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "بچه‌هایش": [ - {ORTH: "بچه‌های", LEMMA: "بچه‌های", NORM: "بچه‌های", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بیانیه‌شان": [ - {ORTH: "بیانیه‌", LEMMA: "بیانیه‌", NORM: "بیانیه‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "بیدارم": [ - {ORTH: "بیدار", LEMMA: "بیدار", NORM: "بیدار", TAG: "ADJ"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "بیناتری": [ - {ORTH: "بیناتر", LEMMA: "بیناتر", NORM: "بیناتر", TAG: "ADJ"}, - {ORTH: "ی", LEMMA: "ی", NORM: "ی", TAG: "VERB"}, - ], - "بی‌اطلاعند": [ - {ORTH: "بی‌اطلاع", LEMMA: "بی‌اطلاع", NORM: "بی‌اطلاع", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "بی‌اطلاعید": [ - {ORTH: "بی‌اطلاع", LEMMA: "بی‌اطلاع", NORM: "بی‌اطلاع", TAG: "ADJ"}, - {ORTH: "ید", LEMMA: "ید", NORM: "ید", TAG: "VERB"}, - ], - "بی‌بهره‌اند": [ - {ORTH: "بی‌بهره‌", LEMMA: "بی‌بهره‌", NORM: "بی‌بهره‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "بی‌تفاوتند": [ - {ORTH: "بی‌تفاوت", LEMMA: "بی‌تفاوت", NORM: "بی‌تفاوت", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "بی‌حسابش": [ - {ORTH: "بی‌حساب", LEMMA: "بی‌حساب", NORM: "بی‌حساب", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "بی‌نیش": [ - {ORTH: "بی‌نی", LEMMA: "بی‌نی", NORM: "بی‌نی", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "تجربه‌هایم": [ - {ORTH: "تجربه‌ها", LEMMA: "تجربه‌ها", NORM: "تجربه‌ها", TAG: "NOUN"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "NOUN"}, - ], - "تحریم‌هاست": [ - {ORTH: "تحریم‌ها", LEMMA: "تحریم‌ها", NORM: "تحریم‌ها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "تحولند": [ - {ORTH: "تحول", LEMMA: "تحول", NORM: "تحول", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "تخیلی‌اش": [ - {ORTH: "تخیلی‌", LEMMA: "تخیلی‌", NORM: "تخیلی‌", TAG: "ADJ"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "ترا": [ - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - {ORTH: "را", LEMMA: "را", NORM: "را", TAG: "PART"}, - ], - "ترسشان": [ - {ORTH: "ترس", LEMMA: "ترس", NORM: "ترس", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "ترکش": [ - {ORTH: "ترک", LEMMA: "ترک", NORM: "ترک", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "تشنه‌ت": [ - {ORTH: "تشنه‌", LEMMA: "تشنه‌", NORM: "تشنه‌", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "تشکیلاتی‌اش": [ - {ORTH: "تشکیلاتی‌", LEMMA: "تشکیلاتی‌", NORM: "تشکیلاتی‌", TAG: "ADJ"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "تعلقش": [ - {ORTH: "تعلق", LEMMA: "تعلق", NORM: "تعلق", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "تلاششان": [ - {ORTH: "تلاش", LEMMA: "تلاش", NORM: "تلاش", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "تلاشمان": [ - {ORTH: "تلاش", LEMMA: "تلاش", NORM: "تلاش", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "تماشاگرش": [ - {ORTH: "تماشاگر", LEMMA: "تماشاگر", NORM: "تماشاگر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "تمامشان": [ - {ORTH: "تمام", LEMMA: "تمام", NORM: "تمام", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "تنش": [ - {ORTH: "تن", LEMMA: "تن", NORM: "تن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "تنمان": [ - {ORTH: "تن", LEMMA: "تن", NORM: "تن", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "تنهایی‌اش": [ - {ORTH: "تنهایی‌", LEMMA: "تنهایی‌", NORM: "تنهایی‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "توانایی‌اش": [ - {ORTH: "توانایی‌", LEMMA: "توانایی‌", NORM: "توانایی‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "توجهش": [ - {ORTH: "توجه", LEMMA: "توجه", NORM: "توجه", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "توست": [ - {ORTH: "تو", LEMMA: "تو", NORM: "تو", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "توصیه‌اش": [ - {ORTH: "توصیه‌", LEMMA: "توصیه‌", NORM: "توصیه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "تیغه‌اش": [ - {ORTH: "تیغه‌", LEMMA: "تیغه‌", NORM: "تیغه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "جاست": [ - {ORTH: "جا", LEMMA: "جا", NORM: "جا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "جامعه‌اند": [ - {ORTH: "جامعه‌", LEMMA: "جامعه‌", NORM: "جامعه‌", TAG: "NOUN"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "جانم": [ - {ORTH: "جان", LEMMA: "جان", NORM: "جان", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "جایش": [ - {ORTH: "جای", LEMMA: "جای", NORM: "جای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "جایشان": [ - {ORTH: "جای", LEMMA: "جای", NORM: "جای", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "جدیدش": [ - {ORTH: "جدید", LEMMA: "جدید", NORM: "جدید", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "جرمزاست": [ - {ORTH: "جرمزا", LEMMA: "جرمزا", NORM: "جرمزا", TAG: "ADJ"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "جلوست": [ - {ORTH: "جلو", LEMMA: "جلو", NORM: "جلو", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "جلویش": [ - {ORTH: "جلوی", LEMMA: "جلوی", NORM: "جلوی", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "جمهوریست": [ - {ORTH: "جمهوری", LEMMA: "جمهوری", NORM: "جمهوری", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "جنسش": [ - {ORTH: "جنس", LEMMA: "جنس", NORM: "جنس", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "جنس‌اند": [ - {ORTH: "جنس‌", LEMMA: "جنس‌", NORM: "جنس‌", TAG: "NOUN"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "جوانانش": [ - {ORTH: "جوانان", LEMMA: "جوانان", NORM: "جوانان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "جویش": [ - {ORTH: "جوی", LEMMA: "جوی", NORM: "جوی", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "جگرش": [ - {ORTH: "جگر", LEMMA: "جگر", NORM: "جگر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "حاضرم": [ - {ORTH: "حاضر", LEMMA: "حاضر", NORM: "حاضر", TAG: "ADJ"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "VERB"}, - ], - "حالتهایشان": [ - {ORTH: "حالتهای", LEMMA: "حالتهای", NORM: "حالتهای", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "حالیست": [ - {ORTH: "حالی", LEMMA: "حالی", NORM: "حالی", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "حالی‌مان": [ - {ORTH: "حالی‌", LEMMA: "حالی‌", NORM: "حالی‌", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "حاکیست": [ - {ORTH: "حاکی", LEMMA: "حاکی", NORM: "حاکی", TAG: "ADJ"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "حرامزادگی‌اش": [ - {ORTH: "حرامزادگی‌", LEMMA: "حرامزادگی‌", NORM: "حرامزادگی‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "حرفتان": [ - {ORTH: "حرف", LEMMA: "حرف", NORM: "حرف", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "حرفش": [ - {ORTH: "حرف", LEMMA: "حرف", NORM: "حرف", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "حرفشان": [ - {ORTH: "حرف", LEMMA: "حرف", NORM: "حرف", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "حرفم": [ - {ORTH: "حرف", LEMMA: "حرف", NORM: "حرف", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "حرف‌های‌شان": [ - {ORTH: "حرف‌های‌", LEMMA: "حرف‌های‌", NORM: "حرف‌های‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "حرکتمان": [ - {ORTH: "حرکت", LEMMA: "حرکت", NORM: "حرکت", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "حریفانشان": [ - {ORTH: "حریفان", LEMMA: "حریفان", NORM: "حریفان", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "حضورشان": [ - {ORTH: "حضور", LEMMA: "حضور", NORM: "حضور", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "حمایتش": [ - {ORTH: "حمایت", LEMMA: "حمایت", NORM: "حمایت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "حواسش": [ - {ORTH: "حواس", LEMMA: "حواس", NORM: "حواس", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "حواسشان": [ - {ORTH: "حواس", LEMMA: "حواس", NORM: "حواس", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "حوصله‌مان": [ - {ORTH: "حوصله‌", LEMMA: "حوصله‌", NORM: "حوصله‌", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "حکومتش": [ - {ORTH: "حکومت", LEMMA: "حکومت", NORM: "حکومت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "حکومتشان": [ - {ORTH: "حکومت", LEMMA: "حکومت", NORM: "حکومت", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "حیفم": [ - {ORTH: "حیف", LEMMA: "حیف", NORM: "حیف", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "خاندانش": [ - {ORTH: "خاندان", LEMMA: "خاندان", NORM: "خاندان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "خانه‌اش": [ - {ORTH: "خانه‌", LEMMA: "خانه‌", NORM: "خانه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "خانه‌شان": [ - {ORTH: "خانه‌", LEMMA: "خانه‌", NORM: "خانه‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "خانه‌مان": [ - {ORTH: "خانه‌", LEMMA: "خانه‌", NORM: "خانه‌", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "خانه‌هایشان": [ - {ORTH: "خانه‌های", LEMMA: "خانه‌های", NORM: "خانه‌های", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "خانواده‌ات": [ - {ORTH: "خانواده", LEMMA: "خانواده", NORM: "خانواده", TAG: "NOUN"}, - {ORTH: "‌ات", LEMMA: "ات", NORM: "ات", TAG: "NOUN"}, - ], - "خانواده‌اش": [ - {ORTH: "خانواده‌", LEMMA: "خانواده‌", NORM: "خانواده‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "خانواده‌ام": [ - {ORTH: "خانواده‌", LEMMA: "خانواده‌", NORM: "خانواده‌", TAG: "NOUN"}, - {ORTH: "ام", LEMMA: "ام", NORM: "ام", TAG: "NOUN"}, - ], - "خانواده‌شان": [ - {ORTH: "خانواده‌", LEMMA: "خانواده‌", NORM: "خانواده‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "خداست": [ - {ORTH: "خدا", LEMMA: "خدا", NORM: "خدا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "خدایش": [ - {ORTH: "خدا", LEMMA: "خدا", NORM: "خدا", TAG: "NOUN"}, - {ORTH: "یش", LEMMA: "یش", NORM: "یش", TAG: "NOUN"}, - ], - "خدایشان": [ - {ORTH: "خدای", LEMMA: "خدای", NORM: "خدای", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "خردسالش": [ - {ORTH: "خردسال", LEMMA: "خردسال", NORM: "خردسال", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "خروپفشان": [ - {ORTH: "خروپف", LEMMA: "خروپف", NORM: "خروپف", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "خسته‌ای": [ - {ORTH: "خسته‌", LEMMA: "خسته‌", NORM: "خسته‌", TAG: "ADJ"}, - {ORTH: "ای", LEMMA: "ای", NORM: "ای", TAG: "VERB"}, - ], - "خطت": [ - {ORTH: "خط", LEMMA: "خط", NORM: "خط", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "خوابمان": [ - {ORTH: "خواب", LEMMA: "خواب", NORM: "خواب", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "خواندنش": [ - {ORTH: "خواندن", LEMMA: "خواندن", NORM: "خواندن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "خواهرش": [ - {ORTH: "خواهر", LEMMA: "خواهر", NORM: "خواهر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "خوبش": [ - {ORTH: "خوب", LEMMA: "خوب", NORM: "خوب", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "خودت": [ - {ORTH: "خود", LEMMA: "خود", NORM: "خود", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "خودتان": [ - {ORTH: "خود", LEMMA: "خود", NORM: "خود", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "خودش": [ - {ORTH: "خود", LEMMA: "خود", NORM: "خود", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "خودشان": [ - {ORTH: "خود", LEMMA: "خود", NORM: "خود", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "خودمان": [ - {ORTH: "خود", LEMMA: "خود", NORM: "خود", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "خوردمان": [ - {ORTH: "خورد", LEMMA: "خورد", NORM: "خورد", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "خوردنشان": [ - {ORTH: "خوردن", LEMMA: "خوردن", NORM: "خوردن", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "خوشش": [ - {ORTH: "خوش", LEMMA: "خوش", NORM: "خوش", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "خوشوقتم": [ - {ORTH: "خوشوقت", LEMMA: "خوشوقت", NORM: "خوشوقت", TAG: "ADJ"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "VERB"}, - ], - "خونشان": [ - {ORTH: "خون", LEMMA: "خون", NORM: "خون", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "خویش": [ - {ORTH: "خوی", LEMMA: "خوی", NORM: "خوی", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "خویشتنم": [ - {ORTH: "خویشتن", LEMMA: "خویشتن", NORM: "خویشتن", TAG: "VERB"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "خیالش": [ - {ORTH: "خیال", LEMMA: "خیال", NORM: "خیال", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "خیسش": [ - {ORTH: "خیس", LEMMA: "خیس", NORM: "خیس", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "داراست": [ - {ORTH: "دارا", LEMMA: "دارا", NORM: "دارا", TAG: "ADJ"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "داستانهایش": [ - {ORTH: "داستانهای", LEMMA: "داستانهای", NORM: "داستانهای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دخترمان": [ - {ORTH: "دختر", LEMMA: "دختر", NORM: "دختر", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "دخیلند": [ - {ORTH: "دخیل", LEMMA: "دخیل", NORM: "دخیل", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "درباره‌ات": [ - {ORTH: "درباره", LEMMA: "درباره", NORM: "درباره", TAG: "ADP"}, - {ORTH: "‌ات", LEMMA: "ات", NORM: "ات", TAG: "NOUN"}, - ], - "درباره‌اش": [ - {ORTH: "درباره‌", LEMMA: "درباره‌", NORM: "درباره‌", TAG: "ADP"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "دردش": [ - {ORTH: "درد", LEMMA: "درد", NORM: "درد", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دردشان": [ - {ORTH: "درد", LEMMA: "درد", NORM: "درد", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "درسته": [ - {ORTH: "درست", LEMMA: "درست", NORM: "درست", TAG: "ADJ"}, - {ORTH: "ه", LEMMA: "ه", NORM: "ه", TAG: "VERB"}, - ], - "درش": [ - {ORTH: "در", LEMMA: "در", NORM: "در", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "درون‌شان": [ - {ORTH: "درون‌", LEMMA: "درون‌", NORM: "درون‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "درین": [ - {ORTH: "در", LEMMA: "در", NORM: "در", TAG: "ADP"}, - {ORTH: "ین", LEMMA: "ین", NORM: "ین", TAG: "NOUN"}, - ], - "دریچه‌هایش": [ - {ORTH: "دریچه‌های", LEMMA: "دریچه‌های", NORM: "دریچه‌های", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دزدانش": [ - {ORTH: "دزدان", LEMMA: "دزدان", NORM: "دزدان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دستت": [ - {ORTH: "دست", LEMMA: "دست", NORM: "دست", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "دستش": [ - {ORTH: "دست", LEMMA: "دست", NORM: "دست", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دستمان": [ - {ORTH: "دست", LEMMA: "دست", NORM: "دست", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "دستهایشان": [ - {ORTH: "دستهای", LEMMA: "دستهای", NORM: "دستهای", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "دست‌یافتنی‌ست": [ - { - ORTH: "دست‌یافتنی‌", - LEMMA: "دست‌یافتنی‌", - NORM: "دست‌یافتنی‌", - TAG: "ADJ", - }, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "دشمنند": [ - {ORTH: "دشمن", LEMMA: "دشمن", NORM: "دشمن", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "دشمنیشان": [ - {ORTH: "دشمنی", LEMMA: "دشمنی", NORM: "دشمنی", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "دشمنیم": [ - {ORTH: "دشمن", LEMMA: "دشمن", NORM: "دشمن", TAG: "NOUN"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "VERB"}, - ], - "دفترش": [ - {ORTH: "دفتر", LEMMA: "دفتر", NORM: "دفتر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دفنشان": [ - {ORTH: "دفن", LEMMA: "دفن", NORM: "دفن", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "دلت": [ - {ORTH: "دل", LEMMA: "دل", NORM: "دل", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "دلش": [ - {ORTH: "دل", LEMMA: "دل", NORM: "دل", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دلشان": [ - {ORTH: "دل", LEMMA: "دل", NORM: "دل", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "دلم": [ - {ORTH: "دل", LEMMA: "دل", NORM: "دل", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "دلیلش": [ - {ORTH: "دلیل", LEMMA: "دلیل", NORM: "دلیل", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دنبالش": [ - {ORTH: "دنبال", LEMMA: "دنبال", NORM: "دنبال", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دنباله‌اش": [ - {ORTH: "دنباله‌", LEMMA: "دنباله‌", NORM: "دنباله‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "دهاتی‌هایش": [ - {ORTH: "دهاتی‌های", LEMMA: "دهاتی‌های", NORM: "دهاتی‌های", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دهانت": [ - {ORTH: "دهان", LEMMA: "دهان", NORM: "دهان", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "دهنش": [ - {ORTH: "دهن", LEMMA: "دهن", NORM: "دهن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دورش": [ - {ORTH: "دور", LEMMA: "دور", NORM: "دور", TAG: "ADV"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دوروبریهاشان": [ - {ORTH: "دوروبریها", LEMMA: "دوروبریها", NORM: "دوروبریها", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "دوستانش": [ - {ORTH: "دوستان", LEMMA: "دوستان", NORM: "دوستان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دوستانشان": [ - {ORTH: "دوستان", LEMMA: "دوستان", NORM: "دوستان", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "دوستت": [ - {ORTH: "دوست", LEMMA: "دوست", NORM: "دوست", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "دوستش": [ - {ORTH: "دوست", LEMMA: "دوست", NORM: "دوست", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دومش": [ - {ORTH: "دوم", LEMMA: "دوم", NORM: "دوم", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دویدنش": [ - {ORTH: "دویدن", LEMMA: "دویدن", NORM: "دویدن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دکورهایمان": [ - {ORTH: "دکورهای", LEMMA: "دکورهای", NORM: "دکورهای", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "دیدگاهش": [ - {ORTH: "دیدگاه", LEMMA: "دیدگاه", NORM: "دیدگاه", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دیرت": [ - {ORTH: "دیر", LEMMA: "دیر", NORM: "دیر", TAG: "ADV"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "دیرم": [ - {ORTH: "دیر", LEMMA: "دیر", NORM: "دیر", TAG: "ADV"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "دینت": [ - {ORTH: "دین", LEMMA: "دین", NORM: "دین", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "دینش": [ - {ORTH: "دین", LEMMA: "دین", NORM: "دین", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دین‌شان": [ - {ORTH: "دین‌", LEMMA: "دین‌", NORM: "دین‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "دیواره‌هایش": [ - {ORTH: "دیواره‌های", LEMMA: "دیواره‌های", NORM: "دیواره‌های", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "دیوانه‌ای": [ - {ORTH: "دیوانه‌", LEMMA: "دیوانه‌", NORM: "دیوانه‌", TAG: "ADJ"}, - {ORTH: "ای", LEMMA: "ای", NORM: "ای", TAG: "VERB"}, - ], - "دیوی": [ - {ORTH: "دیو", LEMMA: "دیو", NORM: "دیو", TAG: "NOUN"}, - {ORTH: "ی", LEMMA: "ی", NORM: "ی", TAG: "VERB"}, - ], - "دیگرم": [ - {ORTH: "دیگر", LEMMA: "دیگر", NORM: "دیگر", TAG: "ADJ"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "دیگرمان": [ - {ORTH: "دیگر", LEMMA: "دیگر", NORM: "دیگر", TAG: "ADJ"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "ذهنش": [ - {ORTH: "ذهن", LEMMA: "ذهن", NORM: "ذهن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "ذهنشان": [ - {ORTH: "ذهن", LEMMA: "ذهن", NORM: "ذهن", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "ذهنم": [ - {ORTH: "ذهن", LEMMA: "ذهن", NORM: "ذهن", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "رئوسش": [ - {ORTH: "رئوس", LEMMA: "رئوس", NORM: "رئوس", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "راهشان": [ - {ORTH: "راه", LEMMA: "راه", NORM: "راه", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "راهگشاست": [ - {ORTH: "راهگشا", LEMMA: "راهگشا", NORM: "راهگشا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "رایانه‌هایشان": [ - {ORTH: "رایانه‌های", LEMMA: "رایانه‌های", NORM: "رایانه‌های", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "رعایتشان": [ - {ORTH: "رعایت", LEMMA: "رعایت", NORM: "رعایت", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "رفتارش": [ - {ORTH: "رفتار", LEMMA: "رفتار", NORM: "رفتار", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "رفتارشان": [ - {ORTH: "رفتار", LEMMA: "رفتار", NORM: "رفتار", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "رفتارمان": [ - {ORTH: "رفتار", LEMMA: "رفتار", NORM: "رفتار", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "رفتارهاست": [ - {ORTH: "رفتارها", LEMMA: "رفتارها", NORM: "رفتارها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "رفتارهایشان": [ - {ORTH: "رفتارهای", LEMMA: "رفتارهای", NORM: "رفتارهای", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "رفقایم": [ - {ORTH: "رفقا", LEMMA: "رفقا", NORM: "رفقا", TAG: "NOUN"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "NOUN"}, - ], - "رقیق‌ترش": [ - {ORTH: "رقیق‌تر", LEMMA: "رقیق‌تر", NORM: "رقیق‌تر", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "رنجند": [ - {ORTH: "رنج", LEMMA: "رنج", NORM: "رنج", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "رهگشاست": [ - {ORTH: "رهگشا", LEMMA: "رهگشا", NORM: "رهگشا", TAG: "ADJ"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "رواست": [ - {ORTH: "روا", LEMMA: "روا", NORM: "روا", TAG: "ADJ"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "روبروست": [ - {ORTH: "روبرو", LEMMA: "روبرو", NORM: "روبرو", TAG: "ADJ"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "روحی‌اش": [ - {ORTH: "روحی‌", LEMMA: "روحی‌", NORM: "روحی‌", TAG: "ADJ"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "روزنامه‌اش": [ - {ORTH: "روزنامه‌", LEMMA: "روزنامه‌", NORM: "روزنامه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "روزه‌ست": [ - {ORTH: "روزه‌", LEMMA: "روزه‌", NORM: "روزه‌", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "روسری‌اش": [ - {ORTH: "روسری‌", LEMMA: "روسری‌", NORM: "روسری‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "روشتان": [ - {ORTH: "روش", LEMMA: "روش", NORM: "روش", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "رویش": [ - {ORTH: "روی", LEMMA: "روی", NORM: "روی", TAG: "ADP"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "زبانش": [ - {ORTH: "زبان", LEMMA: "زبان", NORM: "زبان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "زحماتشان": [ - {ORTH: "زحمات", LEMMA: "زحمات", NORM: "زحمات", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "زدنهایشان": [ - {ORTH: "زدنهای", LEMMA: "زدنهای", NORM: "زدنهای", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "زرنگشان": [ - {ORTH: "زرنگ", LEMMA: "زرنگ", NORM: "زرنگ", TAG: "ADJ"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "زشتش": [ - {ORTH: "زشت", LEMMA: "زشت", NORM: "زشت", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "زشتکارانند": [ - {ORTH: "زشتکاران", LEMMA: "زشتکاران", NORM: "زشتکاران", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "زلفش": [ - {ORTH: "زلف", LEMMA: "زلف", NORM: "زلف", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "زمن": [ - {ORTH: "ز", LEMMA: "ز", NORM: "ز", TAG: "ADP"}, - {ORTH: "من", LEMMA: "من", NORM: "من", TAG: "NOUN"}, - ], - "زنبوری‌اش": [ - {ORTH: "زنبوری‌", LEMMA: "زنبوری‌", NORM: "زنبوری‌", TAG: "ADJ"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "زندانم": [ - {ORTH: "زندان", LEMMA: "زندان", NORM: "زندان", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "زنده‌ام": [ - {ORTH: "زنده‌", LEMMA: "زنده‌", NORM: "زنده‌", TAG: "ADJ"}, - {ORTH: "ام", LEMMA: "ام", NORM: "ام", TAG: "VERB"}, - ], - "زندگانی‌اش": [ - {ORTH: "زندگانی‌", LEMMA: "زندگانی‌", NORM: "زندگانی‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "زندگی‌اش": [ - {ORTH: "زندگی‌", LEMMA: "زندگی‌", NORM: "زندگی‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "زندگی‌ام": [ - {ORTH: "زندگی‌", LEMMA: "زندگی‌", NORM: "زندگی‌", TAG: "NOUN"}, - {ORTH: "ام", LEMMA: "ام", NORM: "ام", TAG: "NOUN"}, - ], - "زندگی‌شان": [ - {ORTH: "زندگی‌", LEMMA: "زندگی‌", NORM: "زندگی‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "زنش": [ - {ORTH: "زن", LEMMA: "زن", NORM: "زن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "زنند": [ - {ORTH: "زن", LEMMA: "زن", NORM: "زن", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "زو": [ - {ORTH: "ز", LEMMA: "ز", NORM: "ز", TAG: "ADP"}, - {ORTH: "و", LEMMA: "و", NORM: "و", TAG: "NOUN"}, - ], - "زیاده": [ - {ORTH: "زیاد", LEMMA: "زیاد", NORM: "زیاد", TAG: "ADJ"}, - {ORTH: "ه", LEMMA: "ه", NORM: "ه", TAG: "VERB"}, - ], - "زیباست": [ - {ORTH: "زیبا", LEMMA: "زیبا", NORM: "زیبا", TAG: "ADJ"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "زیبایش": [ - {ORTH: "زیبای", LEMMA: "زیبای", NORM: "زیبای", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "زیبایی": [ - {ORTH: "زیبای", LEMMA: "زیبای", NORM: "زیبای", TAG: "ADJ"}, - {ORTH: "ی", LEMMA: "ی", NORM: "ی", TAG: "VERB"}, - ], - "زیربناست": [ - {ORTH: "زیربنا", LEMMA: "زیربنا", NORM: "زیربنا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "زیرک‌اند": [ - {ORTH: "زیرک‌", LEMMA: "زیرک‌", NORM: "زیرک‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "سؤالتان": [ - {ORTH: "سؤال", LEMMA: "سؤال", NORM: "سؤال", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "سؤالم": [ - {ORTH: "سؤال", LEMMA: "سؤال", NORM: "سؤال", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "سابقه‌اش": [ - {ORTH: "سابقه‌", LEMMA: "سابقه‌", NORM: "سابقه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "ساختنم": [ - {ORTH: "ساختن", LEMMA: "ساختن", NORM: "ساختن", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "ساده‌اش": [ - {ORTH: "ساده‌", LEMMA: "ساده‌", NORM: "ساده‌", TAG: "ADJ"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "ساده‌اند": [ - {ORTH: "ساده‌", LEMMA: "ساده‌", NORM: "ساده‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "سازمانش": [ - {ORTH: "سازمان", LEMMA: "سازمان", NORM: "سازمان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "ساعتم": [ - {ORTH: "ساعت", LEMMA: "ساعت", NORM: "ساعت", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "سالته": [ - {ORTH: "سال", LEMMA: "سال", NORM: "سال", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - {ORTH: "ه", LEMMA: "ه", NORM: "ه", TAG: "VERB"}, - ], - "سالش": [ - {ORTH: "سال", LEMMA: "سال", NORM: "سال", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "سالهاست": [ - {ORTH: "سالها", LEMMA: "سالها", NORM: "سالها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "ساله‌اش": [ - {ORTH: "ساله‌", LEMMA: "ساله‌", NORM: "ساله‌", TAG: "ADJ"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "ساکتند": [ - {ORTH: "ساکت", LEMMA: "ساکت", NORM: "ساکت", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "ساکنند": [ - {ORTH: "ساکن", LEMMA: "ساکن", NORM: "ساکن", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "سبزشان": [ - {ORTH: "سبز", LEMMA: "سبز", NORM: "سبز", TAG: "ADJ"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "سبیل‌مان": [ - {ORTH: "سبیل‌", LEMMA: "سبیل‌", NORM: "سبیل‌", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "ستم‌هایش": [ - {ORTH: "ستم‌های", LEMMA: "ستم‌های", NORM: "ستم‌های", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "سخنانش": [ - {ORTH: "سخنان", LEMMA: "سخنان", NORM: "سخنان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "سخنانشان": [ - {ORTH: "سخنان", LEMMA: "سخنان", NORM: "سخنان", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "سخنتان": [ - {ORTH: "سخن", LEMMA: "سخن", NORM: "سخن", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "سخنش": [ - {ORTH: "سخن", LEMMA: "سخن", NORM: "سخن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "سخنم": [ - {ORTH: "سخن", LEMMA: "سخن", NORM: "سخن", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "سردش": [ - {ORTH: "سرد", LEMMA: "سرد", NORM: "سرد", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "سرزمینشان": [ - {ORTH: "سرزمین", LEMMA: "سرزمین", NORM: "سرزمین", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "سرش": [ - {ORTH: "سر", LEMMA: "سر", NORM: "سر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "سرمایه‌دارهاست": [ - { - ORTH: "سرمایه‌دارها", - LEMMA: "سرمایه‌دارها", - NORM: "سرمایه‌دارها", - TAG: "NOUN", - }, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "سرنوشتش": [ - {ORTH: "سرنوشت", LEMMA: "سرنوشت", NORM: "سرنوشت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "سرنوشتشان": [ - {ORTH: "سرنوشت", LEMMA: "سرنوشت", NORM: "سرنوشت", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "سروتهش": [ - {ORTH: "سروته", LEMMA: "سروته", NORM: "سروته", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "سرچشمه‌اش": [ - {ORTH: "سرچشمه‌", LEMMA: "سرچشمه‌", NORM: "سرچشمه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "سقمش": [ - {ORTH: "سقم", LEMMA: "سقم", NORM: "سقم", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "سنش": [ - {ORTH: "سن", LEMMA: "سن", NORM: "سن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "سپاهش": [ - {ORTH: "سپاه", LEMMA: "سپاه", NORM: "سپاه", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "سیاسیشان": [ - {ORTH: "سیاسی", LEMMA: "سیاسی", NORM: "سیاسی", TAG: "ADJ"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "سیاه‌چاله‌هاست": [ - { - ORTH: "سیاه‌چاله‌ها", - LEMMA: "سیاه‌چاله‌ها", - NORM: "سیاه‌چاله‌ها", - TAG: "NOUN", - }, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "شاخه‌هایشان": [ - {ORTH: "شاخه‌های", LEMMA: "شاخه‌های", NORM: "شاخه‌های", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "شالوده‌اش": [ - {ORTH: "شالوده‌", LEMMA: "شالوده‌", NORM: "شالوده‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "شانه‌هایش": [ - {ORTH: "شانه‌های", LEMMA: "شانه‌های", NORM: "شانه‌های", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "شاهدیم": [ - {ORTH: "شاهد", LEMMA: "شاهد", NORM: "شاهد", TAG: "NOUN"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "VERB"}, - ], - "شاهکارهایش": [ - {ORTH: "شاهکارهای", LEMMA: "شاهکارهای", NORM: "شاهکارهای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "شخصیتش": [ - {ORTH: "شخصیت", LEMMA: "شخصیت", NORM: "شخصیت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "شدنشان": [ - {ORTH: "شدن", LEMMA: "شدن", NORM: "شدن", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "شرکتیست": [ - {ORTH: "شرکتی", LEMMA: "شرکتی", NORM: "شرکتی", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "شعارهاشان": [ - {ORTH: "شعارها", LEMMA: "شعارها", NORM: "شعارها", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "شعورش": [ - {ORTH: "شعور", LEMMA: "شعور", NORM: "شعور", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "شغلش": [ - {ORTH: "شغل", LEMMA: "شغل", NORM: "شغل", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "شماست": [ - {ORTH: "شما", LEMMA: "شما", NORM: "شما", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "شمشیرش": [ - {ORTH: "شمشیر", LEMMA: "شمشیر", NORM: "شمشیر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "شنیدنش": [ - {ORTH: "شنیدن", LEMMA: "شنیدن", NORM: "شنیدن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "شوراست": [ - {ORTH: "شورا", LEMMA: "شورا", NORM: "شورا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "شومت": [ - {ORTH: "شوم", LEMMA: "شوم", NORM: "شوم", TAG: "ADJ"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "شیرینترش": [ - {ORTH: "شیرینتر", LEMMA: "شیرینتر", NORM: "شیرینتر", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "شیطان‌اند": [ - {ORTH: "شیطان‌", LEMMA: "شیطان‌", NORM: "شیطان‌", TAG: "NOUN"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "شیوه‌هاست": [ - {ORTH: "شیوه‌ها", LEMMA: "شیوه‌ها", NORM: "شیوه‌ها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "صاحبش": [ - {ORTH: "صاحب", LEMMA: "صاحب", NORM: "صاحب", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "صحنه‌اش": [ - {ORTH: "صحنه‌", LEMMA: "صحنه‌", NORM: "صحنه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "صدایش": [ - {ORTH: "صدای", LEMMA: "صدای", NORM: "صدای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "صددند": [ - {ORTH: "صدد", LEMMA: "صدد", NORM: "صدد", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "صندوق‌هاست": [ - {ORTH: "صندوق‌ها", LEMMA: "صندوق‌ها", NORM: "صندوق‌ها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "صندوق‌هایش": [ - {ORTH: "صندوق‌های", LEMMA: "صندوق‌های", NORM: "صندوق‌های", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "صورتش": [ - {ORTH: "صورت", LEMMA: "صورت", NORM: "صورت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "ضروری‌اند": [ - {ORTH: "ضروری‌", LEMMA: "ضروری‌", NORM: "ضروری‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "ضمیرش": [ - {ORTH: "ضمیر", LEMMA: "ضمیر", NORM: "ضمیر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "طرفش": [ - {ORTH: "طرف", LEMMA: "طرف", NORM: "طرف", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "طلسمش": [ - {ORTH: "طلسم", LEMMA: "طلسم", NORM: "طلسم", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "طوره": [ - {ORTH: "طور", LEMMA: "طور", NORM: "طور", TAG: "NOUN"}, - {ORTH: "ه", LEMMA: "ه", NORM: "ه", TAG: "VERB"}, - ], - "عاشوراست": [ - {ORTH: "عاشورا", LEMMA: "عاشورا", NORM: "عاشورا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "عبارتند": [ - {ORTH: "عبارت", LEMMA: "عبارت", NORM: "عبارت", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "عزیزانتان": [ - {ORTH: "عزیزان", LEMMA: "عزیزان", NORM: "عزیزان", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "عزیزانش": [ - {ORTH: "عزیزان", LEMMA: "عزیزان", NORM: "عزیزان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "عزیزش": [ - {ORTH: "عزیز", LEMMA: "عزیز", NORM: "عزیز", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "عشرت‌طلبی‌اش": [ - {ORTH: "عشرت‌طلبی‌", LEMMA: "عشرت‌طلبی‌", NORM: "عشرت‌طلبی‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "عقبیم": [ - {ORTH: "عقب", LEMMA: "عقب", NORM: "عقب", TAG: "NOUN"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "VERB"}, - ], - "علاقه‌اش": [ - {ORTH: "علاقه‌", LEMMA: "علاقه‌", NORM: "علاقه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "علمیمان": [ - {ORTH: "علمی", LEMMA: "علمی", NORM: "علمی", TAG: "ADJ"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "عمرش": [ - {ORTH: "عمر", LEMMA: "عمر", NORM: "عمر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "عمرشان": [ - {ORTH: "عمر", LEMMA: "عمر", NORM: "عمر", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "عملش": [ - {ORTH: "عمل", LEMMA: "عمل", NORM: "عمل", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "عملی‌اند": [ - {ORTH: "عملی‌", LEMMA: "عملی‌", NORM: "عملی‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "عمویت": [ - {ORTH: "عموی", LEMMA: "عموی", NORM: "عموی", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "عمویش": [ - {ORTH: "عموی", LEMMA: "عموی", NORM: "عموی", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "عمیقش": [ - {ORTH: "عمیق", LEMMA: "عمیق", NORM: "عمیق", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "عواملش": [ - {ORTH: "عوامل", LEMMA: "عوامل", NORM: "عوامل", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "عوضشان": [ - {ORTH: "عوض", LEMMA: "عوض", NORM: "عوض", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "غذایی‌شان": [ - {ORTH: "غذایی‌", LEMMA: "غذایی‌", NORM: "غذایی‌", TAG: "ADJ"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "غریبه‌اند": [ - {ORTH: "غریبه‌", LEMMA: "غریبه‌", NORM: "غریبه‌", TAG: "NOUN"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "غلامانش": [ - {ORTH: "غلامان", LEMMA: "غلامان", NORM: "غلامان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "غلطهاست": [ - {ORTH: "غلطها", LEMMA: "غلطها", NORM: "غلطها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "فراموشتان": [ - {ORTH: "فراموش", LEMMA: "فراموش", NORM: "فراموش", TAG: "ADJ"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "فردی‌اند": [ - {ORTH: "فردی‌", LEMMA: "فردی‌", NORM: "فردی‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "فرزندانش": [ - {ORTH: "فرزندان", LEMMA: "فرزندان", NORM: "فرزندان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "فرزندش": [ - {ORTH: "فرزند", LEMMA: "فرزند", NORM: "فرزند", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "فرم‌هایش": [ - {ORTH: "فرم‌های", LEMMA: "فرم‌های", NORM: "فرم‌های", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "فرهنگی‌مان": [ - {ORTH: "فرهنگی‌", LEMMA: "فرهنگی‌", NORM: "فرهنگی‌", TAG: "ADJ"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "فریادشان": [ - {ORTH: "فریاد", LEMMA: "فریاد", NORM: "فریاد", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "فضایی‌شان": [ - {ORTH: "فضایی‌", LEMMA: "فضایی‌", NORM: "فضایی‌", TAG: "ADJ"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "فقیرشان": [ - {ORTH: "فقیر", LEMMA: "فقیر", NORM: "فقیر", TAG: "ADJ"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "فوری‌شان": [ - {ORTH: "فوری‌", LEMMA: "فوری‌", NORM: "فوری‌", TAG: "ADJ"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "قائلند": [ - {ORTH: "قائل", LEMMA: "قائل", NORM: "قائل", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "قائلیم": [ - {ORTH: "قائل", LEMMA: "قائل", NORM: "قائل", TAG: "ADJ"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "VERB"}, - ], - "قادرند": [ - {ORTH: "قادر", LEMMA: "قادر", NORM: "قادر", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "قانونمندش": [ - {ORTH: "قانونمند", LEMMA: "قانونمند", NORM: "قانونمند", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "قبلند": [ - {ORTH: "قبل", LEMMA: "قبل", NORM: "قبل", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "قبلی‌اش": [ - {ORTH: "قبلی‌", LEMMA: "قبلی‌", NORM: "قبلی‌", TAG: "ADJ"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "قبلی‌مان": [ - {ORTH: "قبلی‌", LEMMA: "قبلی‌", NORM: "قبلی‌", TAG: "ADJ"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "قدریست": [ - {ORTH: "قدری", LEMMA: "قدری", NORM: "قدری", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "قدمش": [ - {ORTH: "قدم", LEMMA: "قدم", NORM: "قدم", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "قسمتش": [ - {ORTH: "قسمت", LEMMA: "قسمت", NORM: "قسمت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "قضایاست": [ - {ORTH: "قضایا", LEMMA: "قضایا", NORM: "قضایا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "قضیه‌شان": [ - {ORTH: "قضیه‌", LEMMA: "قضیه‌", NORM: "قضیه‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "قهرمانهایشان": [ - {ORTH: "قهرمانهای", LEMMA: "قهرمانهای", NORM: "قهرمانهای", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "قهرمانیش": [ - {ORTH: "قهرمانی", LEMMA: "قهرمانی", NORM: "قهرمانی", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "قومت": [ - {ORTH: "قوم", LEMMA: "قوم", NORM: "قوم", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "لازمه‌اش": [ - {ORTH: "لازمه‌", LEMMA: "لازمه‌", NORM: "لازمه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "مأموریتش": [ - {ORTH: "مأموریت", LEMMA: "مأموریت", NORM: "مأموریت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مأموریتم": [ - {ORTH: "مأموریت", LEMMA: "مأموریت", NORM: "مأموریت", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "مأموریت‌اند": [ - {ORTH: "مأموریت‌", LEMMA: "مأموریت‌", NORM: "مأموریت‌", TAG: "NOUN"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "مادرانشان": [ - {ORTH: "مادران", LEMMA: "مادران", NORM: "مادران", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "مادرت": [ - {ORTH: "مادر", LEMMA: "مادر", NORM: "مادر", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "مادرش": [ - {ORTH: "مادر", LEMMA: "مادر", NORM: "مادر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مادرم": [ - {ORTH: "مادر", LEMMA: "مادر", NORM: "مادر", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "ماست": [ - {ORTH: "ما", LEMMA: "ما", NORM: "ما", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "مالی‌اش": [ - {ORTH: "مالی‌", LEMMA: "مالی‌", NORM: "مالی‌", TAG: "ADJ"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "ماهیتش": [ - {ORTH: "ماهیت", LEMMA: "ماهیت", NORM: "ماهیت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مایی": [ - {ORTH: "ما", LEMMA: "ما", NORM: "ما", TAG: "NOUN"}, - {ORTH: "یی", LEMMA: "یی", NORM: "یی", TAG: "VERB"}, - ], - "مجازاتش": [ - {ORTH: "مجازات", LEMMA: "مجازات", NORM: "مجازات", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مجبورند": [ - {ORTH: "مجبور", LEMMA: "مجبور", NORM: "مجبور", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "محتاجند": [ - {ORTH: "محتاج", LEMMA: "محتاج", NORM: "محتاج", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "محرمم": [ - {ORTH: "محرم", LEMMA: "محرم", NORM: "محرم", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "SCONJ"}, - ], - "محلش": [ - {ORTH: "محل", LEMMA: "محل", NORM: "محل", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مخالفند": [ - {ORTH: "مخالف", LEMMA: "مخالف", NORM: "مخالف", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "مخدرش": [ - {ORTH: "مخدر", LEMMA: "مخدر", NORM: "مخدر", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مدتهاست": [ - {ORTH: "مدتها", LEMMA: "مدتها", NORM: "مدتها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "مدرسه‌ات": [ - {ORTH: "مدرسه", LEMMA: "مدرسه", NORM: "مدرسه", TAG: "NOUN"}, - {ORTH: "‌ات", LEMMA: "ات", NORM: "ات", TAG: "NOUN"}, - ], - "مدرکم": [ - {ORTH: "مدرک", LEMMA: "مدرک", NORM: "مدرک", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "مدیرانش": [ - {ORTH: "مدیران", LEMMA: "مدیران", NORM: "مدیران", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مدیونم": [ - {ORTH: "مدیون", LEMMA: "مدیون", NORM: "مدیون", TAG: "ADJ"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "VERB"}, - ], - "مذهبی‌اند": [ - {ORTH: "مذهبی‌", LEMMA: "مذهبی‌", NORM: "مذهبی‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "مرا": [ - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - {ORTH: "را", LEMMA: "را", NORM: "را", TAG: "PART"}, - ], - "مرادت": [ - {ORTH: "مراد", LEMMA: "مراد", NORM: "مراد", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "مردمشان": [ - {ORTH: "مردم", LEMMA: "مردم", NORM: "مردم", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "مردمند": [ - {ORTH: "مردم", LEMMA: "مردم", NORM: "مردم", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "مردم‌اند": [ - {ORTH: "مردم‌", LEMMA: "مردم‌", NORM: "مردم‌", TAG: "NOUN"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "مرزشان": [ - {ORTH: "مرز", LEMMA: "مرز", NORM: "مرز", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "مرزهاشان": [ - {ORTH: "مرزها", LEMMA: "مرزها", NORM: "مرزها", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "مزدورش": [ - {ORTH: "مزدور", LEMMA: "مزدور", NORM: "مزدور", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مسئولیتش": [ - {ORTH: "مسئولیت", LEMMA: "مسئولیت", NORM: "مسئولیت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مسائلش": [ - {ORTH: "مسائل", LEMMA: "مسائل", NORM: "مسائل", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مستحضرید": [ - {ORTH: "مستحضر", LEMMA: "مستحضر", NORM: "مستحضر", TAG: "ADJ"}, - {ORTH: "ید", LEMMA: "ید", NORM: "ید", TAG: "VERB"}, - ], - "مسلمانم": [ - {ORTH: "مسلمان", LEMMA: "مسلمان", NORM: "مسلمان", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "VERB"}, - ], - "مسلمانند": [ - {ORTH: "مسلمان", LEMMA: "مسلمان", NORM: "مسلمان", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "مشتریانش": [ - {ORTH: "مشتریان", LEMMA: "مشتریان", NORM: "مشتریان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مشتهایمان": [ - {ORTH: "مشتهای", LEMMA: "مشتهای", NORM: "مشتهای", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "مشخصند": [ - {ORTH: "مشخص", LEMMA: "مشخص", NORM: "مشخص", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "مشغولند": [ - {ORTH: "مشغول", LEMMA: "مشغول", NORM: "مشغول", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "مشغولیم": [ - {ORTH: "مشغول", LEMMA: "مشغول", NORM: "مشغول", TAG: "ADJ"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "VERB"}, - ], - "مشهورش": [ - {ORTH: "مشهور", LEMMA: "مشهور", NORM: "مشهور", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مشکلاتشان": [ - {ORTH: "مشکلات", LEMMA: "مشکلات", NORM: "مشکلات", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "مشکلم": [ - {ORTH: "مشکل", LEMMA: "مشکل", NORM: "مشکل", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "مطمئنم": [ - {ORTH: "مطمئن", LEMMA: "مطمئن", NORM: "مطمئن", TAG: "ADJ"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "VERB"}, - ], - "معامله‌مان": [ - {ORTH: "معامله‌", LEMMA: "معامله‌", NORM: "معامله‌", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "معتقدم": [ - {ORTH: "معتقد", LEMMA: "معتقد", NORM: "معتقد", TAG: "ADJ"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "VERB"}, - ], - "معتقدند": [ - {ORTH: "معتقد", LEMMA: "معتقد", NORM: "معتقد", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "معتقدیم": [ - {ORTH: "معتقد", LEMMA: "معتقد", NORM: "معتقد", TAG: "ADJ"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "VERB"}, - ], - "معرفی‌اش": [ - {ORTH: "معرفی‌", LEMMA: "معرفی‌", NORM: "معرفی‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "معروفش": [ - {ORTH: "معروف", LEMMA: "معروف", NORM: "معروف", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "معضلاتمان": [ - {ORTH: "معضلات", LEMMA: "معضلات", NORM: "معضلات", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "معلمش": [ - {ORTH: "معلم", LEMMA: "معلم", NORM: "معلم", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "معنایش": [ - {ORTH: "معنای", LEMMA: "معنای", NORM: "معنای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مغزشان": [ - {ORTH: "مغز", LEMMA: "مغز", NORM: "مغز", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "مفیدند": [ - {ORTH: "مفید", LEMMA: "مفید", NORM: "مفید", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "مقابلش": [ - {ORTH: "مقابل", LEMMA: "مقابل", NORM: "مقابل", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مقاله‌اش": [ - {ORTH: "مقاله‌", LEMMA: "مقاله‌", NORM: "مقاله‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "مقدمش": [ - {ORTH: "مقدم", LEMMA: "مقدم", NORM: "مقدم", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مقرش": [ - {ORTH: "مقر", LEMMA: "مقر", NORM: "مقر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مقصدشان": [ - {ORTH: "مقصد", LEMMA: "مقصد", NORM: "مقصد", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "مقصرند": [ - {ORTH: "مقصر", LEMMA: "مقصر", NORM: "مقصر", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "مقصودتان": [ - {ORTH: "مقصود", LEMMA: "مقصود", NORM: "مقصود", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "ملاقاتهایش": [ - {ORTH: "ملاقاتهای", LEMMA: "ملاقاتهای", NORM: "ملاقاتهای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "ممکنشان": [ - {ORTH: "ممکن", LEMMA: "ممکن", NORM: "ممکن", TAG: "ADJ"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "ممیزیهاست": [ - {ORTH: "ممیزیها", LEMMA: "ممیزیها", NORM: "ممیزیها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "منظورم": [ - {ORTH: "منظور", LEMMA: "منظور", NORM: "منظور", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "منی": [ - {ORTH: "من", LEMMA: "من", NORM: "من", TAG: "NOUN"}, - {ORTH: "ی", LEMMA: "ی", NORM: "ی", TAG: "VERB"}, - ], - "منید": [ - {ORTH: "من", LEMMA: "من", NORM: "من", TAG: "NOUN"}, - {ORTH: "ید", LEMMA: "ید", NORM: "ید", TAG: "VERB"}, - ], - "مهربانش": [ - {ORTH: "مهربان", LEMMA: "مهربان", NORM: "مهربان", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "مهم‌اند": [ - {ORTH: "مهم‌", LEMMA: "مهم‌", NORM: "مهم‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "مواجهند": [ - {ORTH: "مواجه", LEMMA: "مواجه", NORM: "مواجه", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "مواجه‌اند": [ - {ORTH: "مواجه‌", LEMMA: "مواجه‌", NORM: "مواجه‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "مواخذه‌ات": [ - {ORTH: "مواخذه", LEMMA: "مواخذه", NORM: "مواخذه", TAG: "NOUN"}, - {ORTH: "‌ات", LEMMA: "ات", NORM: "ات", TAG: "NOUN"}, - ], - "مواضعشان": [ - {ORTH: "مواضع", LEMMA: "مواضع", NORM: "مواضع", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "مواضعمان": [ - {ORTH: "مواضع", LEMMA: "مواضع", NORM: "مواضع", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "موافقند": [ - {ORTH: "موافق", LEMMA: "موافق", NORM: "موافق", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "موجوداتش": [ - {ORTH: "موجودات", LEMMA: "موجودات", NORM: "موجودات", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "موجودند": [ - {ORTH: "موجود", LEMMA: "موجود", NORM: "موجود", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "موردش": [ - {ORTH: "مورد", LEMMA: "مورد", NORM: "مورد", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "موضعشان": [ - {ORTH: "موضع", LEMMA: "موضع", NORM: "موضع", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "موظفند": [ - {ORTH: "موظف", LEMMA: "موظف", NORM: "موظف", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "موهایش": [ - {ORTH: "موهای", LEMMA: "موهای", NORM: "موهای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "موهایمان": [ - {ORTH: "موهای", LEMMA: "موهای", NORM: "موهای", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "مویم": [ - {ORTH: "مو", LEMMA: "مو", NORM: "مو", TAG: "NOUN"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "NOUN"}, - ], - "ناخرسندند": [ - {ORTH: "ناخرسند", LEMMA: "ناخرسند", NORM: "ناخرسند", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "ناراحتیش": [ - {ORTH: "ناراحتی", LEMMA: "ناراحتی", NORM: "ناراحتی", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "ناراضی‌اند": [ - {ORTH: "ناراضی‌", LEMMA: "ناراضی‌", NORM: "ناراضی‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "نارواست": [ - {ORTH: "ناروا", LEMMA: "ناروا", NORM: "ناروا", TAG: "ADJ"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "نازش": [ - {ORTH: "ناز", LEMMA: "ناز", NORM: "ناز", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "نامش": [ - {ORTH: "نام", LEMMA: "نام", NORM: "نام", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "نامشان": [ - {ORTH: "نام", LEMMA: "نام", NORM: "نام", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "نامم": [ - {ORTH: "نام", LEMMA: "نام", NORM: "نام", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "نامه‌ات": [ - {ORTH: "نامه", LEMMA: "نامه", NORM: "نامه", TAG: "NOUN"}, - {ORTH: "‌ات", LEMMA: "ات", NORM: "ات", TAG: "NOUN"}, - ], - "نامه‌ام": [ - {ORTH: "نامه‌", LEMMA: "نامه‌", NORM: "نامه‌", TAG: "NOUN"}, - {ORTH: "ام", LEMMA: "ام", NORM: "ام", TAG: "NOUN"}, - ], - "ناچارم": [ - {ORTH: "ناچار", LEMMA: "ناچار", NORM: "ناچار", TAG: "ADJ"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "VERB"}, - ], - "نخست‌وزیری‌اش": [ - { - ORTH: "نخست‌وزیری‌", - LEMMA: "نخست‌وزیری‌", - NORM: "نخست‌وزیری‌", - TAG: "NOUN", - }, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "نزدش": [ - {ORTH: "نزد", LEMMA: "نزد", NORM: "نزد", TAG: "ADP"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "نشانم": [ - {ORTH: "نشان", LEMMA: "نشان", NORM: "نشان", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "نظرات‌شان": [ - {ORTH: "نظرات‌", LEMMA: "نظرات‌", NORM: "نظرات‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "نظرتان": [ - {ORTH: "نظر", LEMMA: "نظر", NORM: "نظر", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "نظرش": [ - {ORTH: "نظر", LEMMA: "نظر", NORM: "نظر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "نظرشان": [ - {ORTH: "نظر", LEMMA: "نظر", NORM: "نظر", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "نظرم": [ - {ORTH: "نظر", LEMMA: "نظر", NORM: "نظر", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "نظرهایشان": [ - {ORTH: "نظرهای", LEMMA: "نظرهای", NORM: "نظرهای", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "نفاقش": [ - {ORTH: "نفاق", LEMMA: "نفاق", NORM: "نفاق", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "نفرند": [ - {ORTH: "نفر", LEMMA: "نفر", NORM: "نفر", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "نفوذیند": [ - {ORTH: "نفوذی", LEMMA: "نفوذی", NORM: "نفوذی", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "نقطه‌نظراتتان": [ - {ORTH: "نقطه‌نظرات", LEMMA: "نقطه‌نظرات", NORM: "نقطه‌نظرات", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "نمایشی‌مان": [ - {ORTH: "نمایشی‌", LEMMA: "نمایشی‌", NORM: "نمایشی‌", TAG: "ADJ"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "نمایندگی‌شان": [ - {ORTH: "نمایندگی‌", LEMMA: "نمایندگی‌", NORM: "نمایندگی‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "نمونه‌اش": [ - {ORTH: "نمونه‌", LEMMA: "نمونه‌", NORM: "نمونه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "نمی‌پذیرندش": [ - {ORTH: "نمی‌پذیرند", LEMMA: "نمی‌پذیرند", NORM: "نمی‌پذیرند", TAG: "VERB"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "نوآوری‌اش": [ - {ORTH: "نوآوری‌", LEMMA: "نوآوری‌", NORM: "نوآوری‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "نوشته‌هایشان": [ - {ORTH: "نوشته‌های", LEMMA: "نوشته‌های", NORM: "نوشته‌های", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "نوشته‌هایم": [ - {ORTH: "نوشته‌ها", LEMMA: "نوشته‌ها", NORM: "نوشته‌ها", TAG: "NOUN"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "NOUN"}, - ], - "نکردنشان": [ - {ORTH: "نکردن", LEMMA: "نکردن", NORM: "نکردن", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "نگاهداری‌شان": [ - {ORTH: "نگاهداری‌", LEMMA: "نگاهداری‌", NORM: "نگاهداری‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "نگاهش": [ - {ORTH: "نگاه", LEMMA: "نگاه", NORM: "نگاه", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "نگرانم": [ - {ORTH: "نگران", LEMMA: "نگران", NORM: "نگران", TAG: "ADJ"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "VERB"}, - ], - "نگرشهایشان": [ - {ORTH: "نگرشهای", LEMMA: "نگرشهای", NORM: "نگرشهای", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "نیازمندند": [ - {ORTH: "نیازمند", LEMMA: "نیازمند", NORM: "نیازمند", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "هدفش": [ - {ORTH: "هدف", LEMMA: "هدف", NORM: "هدف", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "همانست": [ - {ORTH: "همان", LEMMA: "همان", NORM: "همان", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "همراهش": [ - {ORTH: "همراه", LEMMA: "همراه", NORM: "همراه", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "همسرتان": [ - {ORTH: "همسر", LEMMA: "همسر", NORM: "همسر", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "همسرش": [ - {ORTH: "همسر", LEMMA: "همسر", NORM: "همسر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "همسرم": [ - {ORTH: "همسر", LEMMA: "همسر", NORM: "همسر", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "همفکرانش": [ - {ORTH: "همفکران", LEMMA: "همفکران", NORM: "همفکران", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "همه‌اش": [ - {ORTH: "همه‌", LEMMA: "همه‌", NORM: "همه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "همه‌شان": [ - {ORTH: "همه‌", LEMMA: "همه‌", NORM: "همه‌", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "همکارانش": [ - {ORTH: "همکاران", LEMMA: "همکاران", NORM: "همکاران", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "هم‌نظریم": [ - {ORTH: "هم‌نظر", LEMMA: "هم‌نظر", NORM: "هم‌نظر", TAG: "ADJ"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "VERB"}, - ], - "هنرش": [ - {ORTH: "هنر", LEMMA: "هنر", NORM: "هنر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "هواست": [ - {ORTH: "هوا", LEMMA: "هوا", NORM: "هوا", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "هویتش": [ - {ORTH: "هویت", LEMMA: "هویت", NORM: "هویت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "وابسته‌اند": [ - {ORTH: "وابسته‌", LEMMA: "وابسته‌", NORM: "وابسته‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "واقفند": [ - {ORTH: "واقف", LEMMA: "واقف", NORM: "واقف", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "والدینشان": [ - {ORTH: "والدین", LEMMA: "والدین", NORM: "والدین", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "وجدان‌تان": [ - {ORTH: "وجدان‌", LEMMA: "وجدان‌", NORM: "وجدان‌", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "وجودشان": [ - {ORTH: "وجود", LEMMA: "وجود", NORM: "وجود", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "وطنم": [ - {ORTH: "وطن", LEMMA: "وطن", NORM: "وطن", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "وعده‌اش": [ - {ORTH: "وعده‌", LEMMA: "وعده‌", NORM: "وعده‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "وقتمان": [ - {ORTH: "وقت", LEMMA: "وقت", NORM: "وقت", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "ولادتش": [ - {ORTH: "ولادت", LEMMA: "ولادت", NORM: "ولادت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پایانش": [ - {ORTH: "پایان", LEMMA: "پایان", NORM: "پایان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پایش": [ - {ORTH: "پای", LEMMA: "پای", NORM: "پای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پایین‌ترند": [ - {ORTH: "پایین‌تر", LEMMA: "پایین‌تر", NORM: "پایین‌تر", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "پدرت": [ - {ORTH: "پدر", LEMMA: "پدر", NORM: "پدر", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "پدرش": [ - {ORTH: "پدر", LEMMA: "پدر", NORM: "پدر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پدرشان": [ - {ORTH: "پدر", LEMMA: "پدر", NORM: "پدر", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "پدرم": [ - {ORTH: "پدر", LEMMA: "پدر", NORM: "پدر", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "پربارش": [ - {ORTH: "پربار", LEMMA: "پربار", NORM: "پربار", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پروردگارت": [ - {ORTH: "پروردگار", LEMMA: "پروردگار", NORM: "پروردگار", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "پسرتان": [ - {ORTH: "پسر", LEMMA: "پسر", NORM: "پسر", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "پسرش": [ - {ORTH: "پسر", LEMMA: "پسر", NORM: "پسر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پسرعمویش": [ - {ORTH: "پسرعموی", LEMMA: "پسرعموی", NORM: "پسرعموی", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پسر‌عمویت": [ - {ORTH: "پسر‌عموی", LEMMA: "پسر‌عموی", NORM: "پسر‌عموی", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "پشتش": [ - {ORTH: "پشت", LEMMA: "پشت", NORM: "پشت", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پشیمونی": [ - {ORTH: "پشیمون", LEMMA: "پشیمون", NORM: "پشیمون", TAG: "ADJ"}, - {ORTH: "ی", LEMMA: "ی", NORM: "ی", TAG: "VERB"}, - ], - "پولش": [ - {ORTH: "پول", LEMMA: "پول", NORM: "پول", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پژوهش‌هایش": [ - {ORTH: "پژوهش‌های", LEMMA: "پژوهش‌های", NORM: "پژوهش‌های", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پیامبرش": [ - {ORTH: "پیامبر", LEMMA: "پیامبر", NORM: "پیامبر", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پیامبری": [ - {ORTH: "پیامبر", LEMMA: "پیامبر", NORM: "پیامبر", TAG: "NOUN"}, - {ORTH: "ی", LEMMA: "ی", NORM: "ی", TAG: "VERB"}, - ], - "پیامش": [ - {ORTH: "پیام", LEMMA: "پیام", NORM: "پیام", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پیداست": [ - {ORTH: "پیدا", LEMMA: "پیدا", NORM: "پیدا", TAG: "ADJ"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "پیراهنش": [ - {ORTH: "پیراهن", LEMMA: "پیراهن", NORM: "پیراهن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پیروانش": [ - {ORTH: "پیروان", LEMMA: "پیروان", NORM: "پیروان", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "پیشانی‌اش": [ - {ORTH: "پیشانی‌", LEMMA: "پیشانی‌", NORM: "پیشانی‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "پیمانت": [ - {ORTH: "پیمان", LEMMA: "پیمان", NORM: "پیمان", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "پیوندشان": [ - {ORTH: "پیوند", LEMMA: "پیوند", NORM: "پیوند", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "چاپش": [ - {ORTH: "چاپ", LEMMA: "چاپ", NORM: "چاپ", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "چت": [ - {ORTH: "چ", LEMMA: "چ", NORM: "چ", TAG: "ADV"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "چته": [ - {ORTH: "چ", LEMMA: "چ", NORM: "چ", TAG: "ADV"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - {ORTH: "ه", LEMMA: "ه", NORM: "ه", TAG: "VERB"}, - ], - "چرخ‌هایش": [ - {ORTH: "چرخ‌های", LEMMA: "چرخ‌های", NORM: "چرخ‌های", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "چشمم": [ - {ORTH: "چشم", LEMMA: "چشم", NORM: "چشم", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "چشمهایش": [ - {ORTH: "چشمهای", LEMMA: "چشمهای", NORM: "چشمهای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "چشمهایشان": [ - {ORTH: "چشمهای", LEMMA: "چشمهای", NORM: "چشمهای", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "چمنم": [ - {ORTH: "چمن", LEMMA: "چمن", NORM: "چمن", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "چهره‌اش": [ - {ORTH: "چهره‌", LEMMA: "چهره‌", NORM: "چهره‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "چکاره‌اند": [ - {ORTH: "چکاره‌", LEMMA: "چکاره‌", NORM: "چکاره‌", TAG: "ADV"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "چیزهاست": [ - {ORTH: "چیزها", LEMMA: "چیزها", NORM: "چیزها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "چیزهایش": [ - {ORTH: "چیزهای", LEMMA: "چیزهای", NORM: "چیزهای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "چیزیست": [ - {ORTH: "چیزی", LEMMA: "چیزی", NORM: "چیزی", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "چیست": [ - {ORTH: "چی", LEMMA: "چی", NORM: "چی", TAG: "ADV"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "کارش": [ - {ORTH: "کار", LEMMA: "کار", NORM: "کار", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "کارشان": [ - {ORTH: "کار", LEMMA: "کار", NORM: "کار", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "کارم": [ - {ORTH: "کار", LEMMA: "کار", NORM: "کار", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "کارند": [ - {ORTH: "کار", LEMMA: "کار", NORM: "کار", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "کارهایم": [ - {ORTH: "کارها", LEMMA: "کارها", NORM: "کارها", TAG: "NOUN"}, - {ORTH: "یم", LEMMA: "یم", NORM: "یم", TAG: "NOUN"}, - ], - "کافیست": [ - {ORTH: "کافی", LEMMA: "کافی", NORM: "کافی", TAG: "ADJ"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "کتابخانه‌اش": [ - {ORTH: "کتابخانه‌", LEMMA: "کتابخانه‌", NORM: "کتابخانه‌", TAG: "NOUN"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "کتابش": [ - {ORTH: "کتاب", LEMMA: "کتاب", NORM: "کتاب", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "کتابهاشان": [ - {ORTH: "کتابها", LEMMA: "کتابها", NORM: "کتابها", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "کجاست": [ - {ORTH: "کجا", LEMMA: "کجا", NORM: "کجا", TAG: "ADV"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "کدورتهایشان": [ - {ORTH: "کدورتهای", LEMMA: "کدورتهای", NORM: "کدورتهای", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "کردنش": [ - {ORTH: "کردن", LEMMA: "کردن", NORM: "کردن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "کرم‌خورده‌اش": [ - {ORTH: "کرم‌خورده‌", LEMMA: "کرم‌خورده‌", NORM: "کرم‌خورده‌", TAG: "ADJ"}, - {ORTH: "اش", LEMMA: "اش", NORM: "اش", TAG: "NOUN"}, - ], - "کشش": [ - {ORTH: "کش", LEMMA: "کش", NORM: "کش", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "کشورش": [ - {ORTH: "کشور", LEMMA: "کشور", NORM: "کشور", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "کشورشان": [ - {ORTH: "کشور", LEMMA: "کشور", NORM: "کشور", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "کشورمان": [ - {ORTH: "کشور", LEMMA: "کشور", NORM: "کشور", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "کشورهاست": [ - {ORTH: "کشورها", LEMMA: "کشورها", NORM: "کشورها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "کلیشه‌هاست": [ - {ORTH: "کلیشه‌ها", LEMMA: "کلیشه‌ها", NORM: "کلیشه‌ها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "کمبودهاست": [ - {ORTH: "کمبودها", LEMMA: "کمبودها", NORM: "کمبودها", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "کمتره": [ - {ORTH: "کمتر", LEMMA: "کمتر", NORM: "کمتر", TAG: "ADJ"}, - {ORTH: "ه", LEMMA: "ه", NORM: "ه", TAG: "VERB"}, - ], - "کمکم": [ - {ORTH: "کمک", LEMMA: "کمک", NORM: "کمک", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "کنارش": [ - {ORTH: "کنار", LEMMA: "کنار", NORM: "کنار", TAG: "ADP"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "کودکانشان": [ - {ORTH: "کودکان", LEMMA: "کودکان", NORM: "کودکان", TAG: "NOUN"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "کوچکش": [ - {ORTH: "کوچک", LEMMA: "کوچک", NORM: "کوچک", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "کیست": [ - {ORTH: "کی", LEMMA: "کی", NORM: "کی", TAG: "NOUN"}, - {ORTH: "ست", LEMMA: "ست", NORM: "ست", TAG: "VERB"}, - ], - "کیفش": [ - {ORTH: "کیف", LEMMA: "کیف", NORM: "کیف", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "گذشته‌اند": [ - {ORTH: "گذشته‌", LEMMA: "گذشته‌", NORM: "گذشته‌", TAG: "ADJ"}, - {ORTH: "اند", LEMMA: "اند", NORM: "اند", TAG: "VERB"}, - ], - "گرانقدرش": [ - {ORTH: "گرانقدر", LEMMA: "گرانقدر", NORM: "گرانقدر", TAG: "ADJ"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "گرانقدرشان": [ - {ORTH: "گرانقدر", LEMMA: "گرانقدر", NORM: "گرانقدر", TAG: "ADJ"}, - {ORTH: "شان", LEMMA: "شان", NORM: "شان", TAG: "NOUN"}, - ], - "گردنتان": [ - {ORTH: "گردن", LEMMA: "گردن", NORM: "گردن", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "گردنش": [ - {ORTH: "گردن", LEMMA: "گردن", NORM: "گردن", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "گرفتارند": [ - {ORTH: "گرفتار", LEMMA: "گرفتار", NORM: "گرفتار", TAG: "ADJ"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "گرفتنت": [ - {ORTH: "گرفتن", LEMMA: "گرفتن", NORM: "گرفتن", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "گروهند": [ - {ORTH: "گروه", LEMMA: "گروه", NORM: "گروه", TAG: "NOUN"}, - {ORTH: "ند", LEMMA: "ند", NORM: "ند", TAG: "VERB"}, - ], - "گروگانهایش": [ - {ORTH: "گروگانهای", LEMMA: "گروگانهای", NORM: "گروگانهای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "گریمش": [ - {ORTH: "گریم", LEMMA: "گریم", NORM: "گریم", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "گفتارمان": [ - {ORTH: "گفتار", LEMMA: "گفتار", NORM: "گفتار", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "گلهایش": [ - {ORTH: "گلهای", LEMMA: "گلهای", NORM: "گلهای", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "گلویش": [ - {ORTH: "گلوی", LEMMA: "گلوی", NORM: "گلوی", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "گناهت": [ - {ORTH: "گناه", LEMMA: "گناه", NORM: "گناه", TAG: "NOUN"}, - {ORTH: "ت", LEMMA: "ت", NORM: "ت", TAG: "NOUN"}, - ], - "گوشش": [ - {ORTH: "گوش", LEMMA: "گوش", NORM: "گوش", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "گوشم": [ - {ORTH: "گوش", LEMMA: "گوش", NORM: "گوش", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "گولش": [ - {ORTH: "گول", LEMMA: "گول", NORM: "گول", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - "یادتان": [ - {ORTH: "یاد", LEMMA: "یاد", NORM: "یاد", TAG: "NOUN"}, - {ORTH: "تان", LEMMA: "تان", NORM: "تان", TAG: "NOUN"}, - ], - "یادم": [ - {ORTH: "یاد", LEMMA: "یاد", NORM: "یاد", TAG: "NOUN"}, - {ORTH: "م", LEMMA: "م", NORM: "م", TAG: "NOUN"}, - ], - "یادمان": [ - {ORTH: "یاد", LEMMA: "یاد", NORM: "یاد", TAG: "NOUN"}, - {ORTH: "مان", LEMMA: "مان", NORM: "مان", TAG: "NOUN"}, - ], - "یارانش": [ - {ORTH: "یاران", LEMMA: "یاران", NORM: "یاران", TAG: "NOUN"}, - {ORTH: "ش", LEMMA: "ش", NORM: "ش", TAG: "NOUN"}, - ], - } -) -TOKENIZER_EXCEPTIONS = _exc diff --git a/spacy/lang/fi/__init__.py b/spacy/lang/fi/__init__.py index 45d2f886f..9233c6547 100644 --- a/spacy/lang/fi/__init__.py +++ b/spacy/lang/fi/__init__.py @@ -1,28 +1,15 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups class FinnishDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "fi" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) infixes = TOKENIZER_INFIXES suffixes = TOKENIZER_SUFFIXES - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) + tokenizer_exceptions = TOKENIZER_EXCEPTIONS + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS diff --git a/spacy/lang/fi/examples.py b/spacy/lang/fi/examples.py index 88be248a6..930fac273 100644 --- a/spacy/lang/fi/examples.py +++ b/spacy/lang/fi/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. >>> from spacy.lang.fi.examples import sentences diff --git a/spacy/lang/fi/lex_attrs.py b/spacy/lang/fi/lex_attrs.py index e960b55eb..4d500cead 100644 --- a/spacy/lang/fi/lex_attrs.py +++ b/spacy/lang/fi/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/fi/punctuation.py b/spacy/lang/fi/punctuation.py index a85c0b228..6e14dde38 100644 --- a/spacy/lang/fi/punctuation.py +++ b/spacy/lang/fi/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_HYPHENS from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER from ..punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/fi/stop_words.py b/spacy/lang/fi/stop_words.py index e8e39ec6f..8e8dcfa56 100644 --- a/spacy/lang/fi/stop_words.py +++ b/spacy/lang/fi/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source https://github.com/stopwords-iso/stopwords-fi/blob/master/stopwords-fi.txt # Reformatted with some minor corrections STOP_WORDS = set( diff --git a/spacy/lang/fi/tokenizer_exceptions.py b/spacy/lang/fi/tokenizer_exceptions.py index 7cdc7cf11..22d710cb0 100644 --- a/spacy/lang/fi/tokenizer_exceptions.py +++ b/spacy/lang/fi/tokenizer_exceptions.py @@ -1,7 +1,6 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH +from ...util import update_exc _exc = {} @@ -9,76 +8,76 @@ _exc = {} # Source https://www.cs.tut.fi/~jkorpela/kielenopas/5.5.html for exc_data in [ - {ORTH: "aik.", LEMMA: "aikaisempi"}, - {ORTH: "alk.", LEMMA: "alkaen"}, - {ORTH: "alv.", LEMMA: "arvonlisävero"}, - {ORTH: "ark.", LEMMA: "arkisin"}, - {ORTH: "as.", LEMMA: "asunto"}, - {ORTH: "eaa.", LEMMA: "ennen ajanlaskun alkua"}, - {ORTH: "ed.", LEMMA: "edellinen"}, - {ORTH: "esim.", LEMMA: "esimerkki"}, - {ORTH: "huom.", LEMMA: "huomautus"}, - {ORTH: "jne.", LEMMA: "ja niin edelleen"}, - {ORTH: "joht.", LEMMA: "johtaja"}, - {ORTH: "k.", LEMMA: "kuollut"}, - {ORTH: "ks.", LEMMA: "katso"}, - {ORTH: "lk.", LEMMA: "luokka"}, - {ORTH: "lkm.", LEMMA: "lukumäärä"}, - {ORTH: "lyh.", LEMMA: "lyhenne"}, - {ORTH: "läh.", LEMMA: "lähettäjä"}, - {ORTH: "miel.", LEMMA: "mieluummin"}, - {ORTH: "milj.", LEMMA: "miljoona"}, - {ORTH: "Mm.", LEMMA: "muun muassa"}, - {ORTH: "mm.", LEMMA: "muun muassa"}, - {ORTH: "myöh.", LEMMA: "myöhempi"}, - {ORTH: "n.", LEMMA: "noin"}, - {ORTH: "nimim.", LEMMA: "nimimerkki"}, - {ORTH: "n:o", LEMMA: "numero"}, - {ORTH: "N:o", LEMMA: "numero"}, - {ORTH: "nro", LEMMA: "numero"}, - {ORTH: "ns.", LEMMA: "niin sanottu"}, - {ORTH: "nyk.", LEMMA: "nykyinen"}, - {ORTH: "oik.", LEMMA: "oikealla"}, - {ORTH: "os.", LEMMA: "osoite"}, - {ORTH: "p.", LEMMA: "päivä"}, - {ORTH: "par.", LEMMA: "paremmin"}, - {ORTH: "per.", LEMMA: "perustettu"}, - {ORTH: "pj.", LEMMA: "puheenjohtaja"}, - {ORTH: "puh.joht.", LEMMA: "puheenjohtaja"}, - {ORTH: "prof.", LEMMA: "professori"}, - {ORTH: "puh.", LEMMA: "puhelin"}, - {ORTH: "pvm.", LEMMA: "päivämäärä"}, - {ORTH: "rak.", LEMMA: "rakennettu"}, - {ORTH: "ry.", LEMMA: "rekisteröity yhdistys"}, - {ORTH: "s.", LEMMA: "sivu"}, - {ORTH: "siht.", LEMMA: "sihteeri"}, - {ORTH: "synt.", LEMMA: "syntynyt"}, - {ORTH: "t.", LEMMA: "toivoo"}, - {ORTH: "tark.", LEMMA: "tarkastanut"}, - {ORTH: "til.", LEMMA: "tilattu"}, - {ORTH: "tms.", LEMMA: "tai muuta sellaista"}, - {ORTH: "toim.", LEMMA: "toimittanut"}, - {ORTH: "v.", LEMMA: "vuosi"}, - {ORTH: "vas.", LEMMA: "vasen"}, - {ORTH: "vast.", LEMMA: "vastaus"}, - {ORTH: "vrt.", LEMMA: "vertaa"}, - {ORTH: "yht.", LEMMA: "yhteensä"}, - {ORTH: "yl.", LEMMA: "yleinen"}, - {ORTH: "ym.", LEMMA: "ynnä muuta"}, - {ORTH: "yms.", LEMMA: "ynnä muuta sellaista"}, - {ORTH: "yo.", LEMMA: "ylioppilas"}, - {ORTH: "yliopp.", LEMMA: "ylioppilas"}, - {ORTH: "ao.", LEMMA: "asianomainen"}, - {ORTH: "em.", LEMMA: "edellä mainittu"}, - {ORTH: "ko.", LEMMA: "kyseessä oleva"}, - {ORTH: "ml.", LEMMA: "mukaan luettuna"}, - {ORTH: "po.", LEMMA: "puheena oleva"}, - {ORTH: "so.", LEMMA: "se on"}, - {ORTH: "ts.", LEMMA: "toisin sanoen"}, - {ORTH: "vm.", LEMMA: "viimeksi mainittu"}, - {ORTH: "srk.", LEMMA: "seurakunta"}, + {ORTH: "aik."}, + {ORTH: "alk."}, + {ORTH: "alv."}, + {ORTH: "ark."}, + {ORTH: "as."}, + {ORTH: "eaa."}, + {ORTH: "ed."}, + {ORTH: "esim."}, + {ORTH: "huom."}, + {ORTH: "jne."}, + {ORTH: "joht."}, + {ORTH: "k."}, + {ORTH: "ks."}, + {ORTH: "lk."}, + {ORTH: "lkm."}, + {ORTH: "lyh."}, + {ORTH: "läh."}, + {ORTH: "miel."}, + {ORTH: "milj."}, + {ORTH: "Mm."}, + {ORTH: "mm."}, + {ORTH: "myöh."}, + {ORTH: "n."}, + {ORTH: "nimim."}, + {ORTH: "n:o"}, + {ORTH: "N:o"}, + {ORTH: "nro"}, + {ORTH: "ns."}, + {ORTH: "nyk."}, + {ORTH: "oik."}, + {ORTH: "os."}, + {ORTH: "p."}, + {ORTH: "par."}, + {ORTH: "per."}, + {ORTH: "pj."}, + {ORTH: "puh.joht."}, + {ORTH: "prof."}, + {ORTH: "puh."}, + {ORTH: "pvm."}, + {ORTH: "rak."}, + {ORTH: "ry."}, + {ORTH: "s."}, + {ORTH: "siht."}, + {ORTH: "synt."}, + {ORTH: "t."}, + {ORTH: "tark."}, + {ORTH: "til."}, + {ORTH: "tms."}, + {ORTH: "toim."}, + {ORTH: "v."}, + {ORTH: "vas."}, + {ORTH: "vast."}, + {ORTH: "vrt."}, + {ORTH: "yht."}, + {ORTH: "yl."}, + {ORTH: "ym."}, + {ORTH: "yms."}, + {ORTH: "yo."}, + {ORTH: "yliopp."}, + {ORTH: "ao."}, + {ORTH: "em."}, + {ORTH: "ko."}, + {ORTH: "ml."}, + {ORTH: "po."}, + {ORTH: "so."}, + {ORTH: "ts."}, + {ORTH: "vm."}, + {ORTH: "srk."}, ]: _exc[exc_data[ORTH]] = [exc_data] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/fr/__init__.py b/spacy/lang/fr/__init__.py index 7727aff0e..1e0011fba 100644 --- a/spacy/lang/fr/__init__.py +++ b/spacy/lang/fr/__init__.py @@ -1,44 +1,26 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Optional + +from thinc.api import Model from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .punctuation import TOKENIZER_SUFFIXES -from .tag_map import TAG_MAP from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS -from .lemmatizer import FrenchLemmatizer from .syntax_iterators import SYNTAX_ITERATORS - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS +from .lemmatizer import FrenchLemmatizer from ...language import Language -from ...lookups import Lookups -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups class FrenchDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "fr" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - tag_map = TAG_MAP - stop_words = STOP_WORDS + tokenizer_exceptions = TOKENIZER_EXCEPTIONS prefixes = TOKENIZER_PREFIXES infixes = TOKENIZER_INFIXES suffixes = TOKENIZER_SUFFIXES token_match = TOKEN_MATCH + lex_attr_getters = LEX_ATTRS syntax_iterators = SYNTAX_ITERATORS - - @classmethod - def create_lemmatizer(cls, nlp=None, lookups=None): - if lookups is None: - lookups = Lookups() - return FrenchLemmatizer(lookups) + stop_words = STOP_WORDS class French(Language): @@ -46,4 +28,14 @@ class French(Language): Defaults = FrenchDefaults +@French.factory( + "lemmatizer", + assigns=["token.lemma"], + default_config={"model": None, "mode": "rule"}, + default_score_weights={"lemma_acc": 1.0}, +) +def make_lemmatizer(nlp: Language, model: Optional[Model], name: str, mode: str): + return FrenchLemmatizer(nlp.vocab, model, name, mode=mode) + + __all__ = ["French"] diff --git a/spacy/lang/fr/_tokenizer_exceptions_list.py b/spacy/lang/fr/_tokenizer_exceptions_list.py index 0fcf02351..50f439501 100644 --- a/spacy/lang/fr/_tokenizer_exceptions_list.py +++ b/spacy/lang/fr/_tokenizer_exceptions_list.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - FR_BASE_EXCEPTIONS = [ "(+)-amphétamine", "(5R,6S)-7,8-didehydro-4,5-époxy-3-méthoxy-N-méthylmorphinan-6-ol", diff --git a/spacy/lang/fr/examples.py b/spacy/lang/fr/examples.py index a874c22fc..a74a62204 100644 --- a/spacy/lang/fr/examples.py +++ b/spacy/lang/fr/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/fr/lemmatizer.py b/spacy/lang/fr/lemmatizer.py index af8345e1b..bb5a270ab 100644 --- a/spacy/lang/fr/lemmatizer.py +++ b/spacy/lang/fr/lemmatizer.py @@ -1,10 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import List, Tuple -from ...lemmatizer import Lemmatizer -from ...symbols import POS, NOUN, VERB, ADJ, ADV, PRON, DET, AUX, PUNCT, ADP -from ...symbols import SCONJ, CCONJ -from ...symbols import VerbForm_inf, VerbForm_none, Number_sing, Degree_pos +from ...pipeline import Lemmatizer +from ...tokens import Token class FrenchLemmatizer(Lemmatizer): @@ -17,69 +14,48 @@ class FrenchLemmatizer(Lemmatizer): the lookup table. """ - def __call__(self, string, univ_pos, morphology=None): - lookup_table = self.lookups.get_table("lemma_lookup", {}) - if "lemma_rules" not in self.lookups: - return [lookup_table.get(string, string)] - if univ_pos in (NOUN, "NOUN", "noun"): - univ_pos = "noun" - elif univ_pos in (VERB, "VERB", "verb"): - univ_pos = "verb" - elif univ_pos in (ADJ, "ADJ", "adj"): - univ_pos = "adj" - elif univ_pos in (ADP, "ADP", "adp"): - univ_pos = "adp" - elif univ_pos in (ADV, "ADV", "adv"): - univ_pos = "adv" - elif univ_pos in (AUX, "AUX", "aux"): - univ_pos = "aux" - elif univ_pos in (CCONJ, "CCONJ", "cconj"): - univ_pos = "cconj" - elif univ_pos in (DET, "DET", "det"): - univ_pos = "det" - elif univ_pos in (PRON, "PRON", "pron"): - univ_pos = "pron" - elif univ_pos in (PUNCT, "PUNCT", "punct"): - univ_pos = "punct" - elif univ_pos in (SCONJ, "SCONJ", "sconj"): - univ_pos = "sconj" + @classmethod + def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]: + if mode == "rule": + required = ["lemma_lookup", "lemma_rules", "lemma_exc", "lemma_index"] + return (required, []) else: - return [self.lookup(string)] + return super().get_lookups_config(mode) + + def rule_lemmatize(self, token: Token) -> List[str]: + cache_key = (token.orth, token.pos) + if cache_key in self.cache: + return self.cache[cache_key] + string = token.text + univ_pos = token.pos_.lower() + if univ_pos in ("", "eol", "space"): + return [string.lower()] + elif "lemma_rules" not in self.lookups or univ_pos not in ( + "noun", + "verb", + "adj", + "adp", + "adv", + "aux", + "cconj", + "det", + "pron", + "punct", + "sconj", + ): + return self.lookup_lemmatize(token) index_table = self.lookups.get_table("lemma_index", {}) exc_table = self.lookups.get_table("lemma_exc", {}) rules_table = self.lookups.get_table("lemma_rules", {}) - lemmas = self.lemmatize( - string, - index_table.get(univ_pos, {}), - exc_table.get(univ_pos, {}), - rules_table.get(univ_pos, []), - ) - return lemmas - - def noun(self, string, morphology=None): - return self(string, "noun", morphology) - - def verb(self, string, morphology=None): - return self(string, "verb", morphology) - - def adj(self, string, morphology=None): - return self(string, "adj", morphology) - - def punct(self, string, morphology=None): - return self(string, "punct", morphology) - - def lookup(self, string, orth=None): - lookup_table = self.lookups.get_table("lemma_lookup", {}) - if orth is not None and orth in lookup_table: - return lookup_table[orth][0] - return string - - def lemmatize(self, string, index, exceptions, rules): lookup_table = self.lookups.get_table("lemma_lookup", {}) + index = index_table.get(univ_pos, {}) + exceptions = exc_table.get(univ_pos, {}) + rules = rules_table.get(univ_pos, []) string = string.lower() forms = [] if string in index: forms.append(string) + self.cache[cache_key] = forms return forms forms.extend(exceptions.get(string, [])) oov_forms = [] @@ -96,7 +72,9 @@ class FrenchLemmatizer(Lemmatizer): if not forms: forms.extend(oov_forms) if not forms and string in lookup_table.keys(): - forms.append(lookup_table[string][0]) + forms.append(self.lookup_lemmatize(token)[0]) if not forms: forms.append(string) - return list(set(forms)) + forms = list(set(forms)) + self.cache[cache_key] = forms + return forms diff --git a/spacy/lang/fr/lex_attrs.py b/spacy/lang/fr/lex_attrs.py index e3ccd9fdd..da98c6e37 100644 --- a/spacy/lang/fr/lex_attrs.py +++ b/spacy/lang/fr/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/fr/punctuation.py b/spacy/lang/fr/punctuation.py index 7d50c4a9e..873d01d87 100644 --- a/spacy/lang/fr/punctuation.py +++ b/spacy/lang/fr/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CURRENCY from ..char_classes import CONCAT_QUOTES, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER diff --git a/spacy/lang/fr/stop_words.py b/spacy/lang/fr/stop_words.py index ae8432043..a331f3c0f 100644 --- a/spacy/lang/fr/stop_words.py +++ b/spacy/lang/fr/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ a à â abord absolument afin ah ai aie ailleurs ainsi ait allaient allo allons diff --git a/spacy/lang/fr/syntax_iterators.py b/spacy/lang/fr/syntax_iterators.py index d6c12e69f..68117a54d 100644 --- a/spacy/lang/fr/syntax_iterators.py +++ b/spacy/lang/fr/syntax_iterators.py @@ -1,29 +1,18 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Union, Iterator from ...symbols import NOUN, PROPN, PRON from ...errors import Errors +from ...tokens import Doc, Span -def noun_chunks(doclike): - """ - Detect base noun phrases from a dependency parse. Works on both Doc and Span. - """ - labels = [ - "nsubj", - "nsubj:pass", - "obj", - "iobj", - "ROOT", - "appos", - "nmod", - "nmod:poss", - ] +def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]: + """Detect base noun phrases from a dependency parse. Works on Doc and Span.""" + # fmt: off + labels = ["nsubj", "nsubj:pass", "obj", "iobj", "ROOT", "appos", "nmod", "nmod:poss"] + # fmt: on doc = doclike.doc # Ensure works on both Doc and Span. - - if not doc.is_parsed: + if not doc.has_annotation("DEP"): raise ValueError(Errors.E029) - np_deps = [doc.vocab.strings[label] for label in labels] conj = doc.vocab.strings.add("conj") np_label = doc.vocab.strings.add("NP") diff --git a/spacy/lang/fr/tag_map.py b/spacy/lang/fr/tag_map.py deleted file mode 100644 index 93b43c2ec..000000000 --- a/spacy/lang/fr/tag_map.py +++ /dev/null @@ -1,219 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, SCONJ - - -TAG_MAP = { - "ADJ__Gender=Fem|Number=Plur": {POS: ADJ}, - "ADJ__Gender=Fem|Number=Plur|NumType=Ord": {POS: ADJ}, - "ADJ__Gender=Fem|Number=Sing": {POS: ADJ}, - "ADJ__Gender=Fem|Number=Sing|NumType=Ord": {POS: ADJ}, - "ADJ__Gender=Masc": {POS: ADJ}, - "ADJ__Gender=Masc|Number=Plur": {POS: ADJ}, - "ADJ__Gender=Masc|Number=Plur|NumType=Ord": {POS: ADJ}, - "ADJ__Gender=Masc|Number=Sing": {POS: ADJ}, - "ADJ__Gender=Masc|Number=Sing|NumType=Card": {POS: ADJ}, - "ADJ__Gender=Masc|Number=Sing|NumType=Ord": {POS: ADJ}, - "ADJ__NumType=Card": {POS: ADJ}, - "ADJ__NumType=Ord": {POS: ADJ}, - "ADJ__Number=Plur": {POS: ADJ}, - "ADJ__Number=Sing": {POS: ADJ}, - "ADJ__Number=Sing|NumType=Ord": {POS: ADJ}, - "ADJ___": {POS: ADJ}, - "ADP__Gender=Fem|Number=Plur|Person=3": {POS: ADP}, - "ADP__Gender=Masc|Number=Plur|Person=3": {POS: ADP}, - "ADP__Gender=Masc|Number=Sing|Person=3": {POS: ADP}, - "ADP___": {POS: ADP}, - "ADV__Polarity=Neg": {POS: ADV}, - "ADV__PronType=Int": {POS: ADV}, - "ADV___": {POS: ADV}, - "AUX__Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part": {POS: AUX}, - "AUX__Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: AUX}, - "AUX__Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part": {POS: AUX}, - "AUX__Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: AUX}, - "AUX__Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part": {POS: AUX}, - "AUX__Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: AUX}, - "AUX__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part": {POS: AUX}, - "AUX__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: AUX}, - "AUX__Mood=Cnd|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Cnd|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Cnd|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Cnd|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Cnd|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=2|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Sub|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Sub|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Sub|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Sub|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "AUX__Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: AUX}, - "AUX__Tense=Past|VerbForm=Part": {POS: AUX}, - "AUX__Tense=Pres|VerbForm=Part": {POS: AUX}, - "AUX__VerbForm=Inf": {POS: AUX}, - "CCONJ___": {POS: CCONJ}, - "DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art": {POS: DET}, - "DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art": {POS: DET}, - "DET__Definite=Def|Number=Plur|PronType=Art": {POS: DET}, - "DET__Definite=Def|Number=Sing|PronType=Art": {POS: DET}, - "DET__Definite=Ind|Gender=Fem|Number=Plur|PronType=Art": {POS: DET}, - "DET__Definite=Ind|Gender=Fem|Number=Sing|PronType=Art": {POS: DET}, - "DET__Definite=Ind|Gender=Masc|Number=Plur|PronType=Art": {POS: DET}, - "DET__Definite=Ind|Gender=Masc|Number=Sing|PronType=Art": {POS: DET}, - "DET__Definite=Ind|Number=Plur|PronType=Art": {POS: DET}, - "DET__Definite=Ind|Number=Sing|PronType=Art": {POS: DET}, - "DET__Gender=Fem|Number=Plur": {POS: DET}, - "DET__Gender=Fem|Number=Plur|PronType=Int": {POS: DET}, - "DET__Gender=Fem|Number=Sing": {POS: DET}, - "DET__Gender=Fem|Number=Sing|Poss=Yes": {POS: DET}, - "DET__Gender=Fem|Number=Sing|PronType=Dem": {POS: DET}, - "DET__Gender=Fem|Number=Sing|PronType=Int": {POS: DET}, - "DET__Gender=Masc|Number=Plur": {POS: DET}, - "DET__Gender=Masc|Number=Sing": {POS: DET}, - "DET__Gender=Masc|Number=Sing|PronType=Dem": {POS: DET}, - "DET__Gender=Masc|Number=Sing|PronType=Int": {POS: DET}, - "DET__Number=Plur": {POS: DET}, - "DET__Number=Plur|Poss=Yes": {POS: DET}, - "DET__Number=Plur|PronType=Dem": {POS: DET}, - "DET__Number=Sing": {POS: DET}, - "DET__Number=Sing|Poss=Yes": {POS: DET}, - "DET___": {POS: DET}, - "INTJ___": {POS: INTJ}, - "NOUN__Gender=Fem": {POS: NOUN}, - "NOUN__Gender=Fem|Number=Plur": {POS: NOUN}, - "NOUN__Gender=Fem|Number=Sing": {POS: NOUN}, - "NOUN__Gender=Masc": {POS: NOUN}, - "NOUN__Gender=Masc|Number=Plur": {POS: NOUN}, - "NOUN__Gender=Masc|Number=Plur|NumType=Card": {POS: NOUN}, - "NOUN__Gender=Masc|Number=Sing": {POS: NOUN}, - "NOUN__Gender=Masc|Number=Sing|NumType=Card": {POS: NOUN}, - "NOUN__NumType=Card": {POS: NOUN}, - "NOUN__Number=Plur": {POS: NOUN}, - "NOUN__Number=Sing": {POS: NOUN}, - "NOUN___": {POS: NOUN}, - "NUM__Gender=Masc|Number=Plur|NumType=Card": {POS: NUM}, - "NUM__NumType=Card": {POS: NUM}, - "PART___": {POS: PART}, - "PRON__Gender=Fem|Number=Plur": {POS: PRON}, - "PRON__Gender=Fem|Number=Plur|Person=3": {POS: PRON}, - "PRON__Gender=Fem|Number=Plur|Person=3|PronType=Prs": {POS: PRON}, - "PRON__Gender=Fem|Number=Plur|Person=3|PronType=Rel": {POS: PRON}, - "PRON__Gender=Fem|Number=Plur|PronType=Dem": {POS: PRON}, - "PRON__Gender=Fem|Number=Plur|PronType=Rel": {POS: PRON}, - "PRON__Gender=Fem|Number=Sing|Person=3": {POS: PRON}, - "PRON__Gender=Fem|Number=Sing|Person=3|PronType=Prs": {POS: PRON}, - "PRON__Gender=Fem|Number=Sing|PronType=Dem": {POS: PRON}, - "PRON__Gender=Fem|Number=Sing|PronType=Rel": {POS: PRON}, - "PRON__Gender=Fem|PronType=Rel": {POS: PRON}, - "PRON__Gender=Masc|Number=Plur": {POS: PRON}, - "PRON__Gender=Masc|Number=Plur|Person=3": {POS: PRON}, - "PRON__Gender=Masc|Number=Plur|Person=3|PronType=Prs": {POS: PRON}, - "PRON__Gender=Masc|Number=Plur|Person=3|PronType=Rel": {POS: PRON}, - "PRON__Gender=Masc|Number=Plur|PronType=Dem": {POS: PRON}, - "PRON__Gender=Masc|Number=Plur|PronType=Rel": {POS: PRON}, - "PRON__Gender=Masc|Number=Sing": {POS: PRON}, - "PRON__Gender=Masc|Number=Sing|Person=3": {POS: PRON}, - "PRON__Gender=Masc|Number=Sing|Person=3|PronType=Dem": {POS: PRON}, - "PRON__Gender=Masc|Number=Sing|Person=3|PronType=Prs": {POS: PRON}, - "PRON__Gender=Masc|Number=Sing|PronType=Dem": {POS: PRON}, - "PRON__Gender=Masc|Number=Sing|PronType=Rel": {POS: PRON}, - "PRON__Gender=Masc|PronType=Rel": {POS: PRON}, - "PRON__NumType=Card|PronType=Rel": {POS: PRON}, - "PRON__Number=Plur|Person=1": {POS: PRON}, - "PRON__Number=Plur|Person=1|PronType=Prs": {POS: PRON}, - "PRON__Number=Plur|Person=1|Reflex=Yes": {POS: PRON}, - "PRON__Number=Plur|Person=2": {POS: PRON}, - "PRON__Number=Plur|Person=2|PronType=Prs": {POS: PRON}, - "PRON__Number=Plur|Person=2|Reflex=Yes": {POS: PRON}, - "PRON__Number=Plur|Person=3": {POS: PRON}, - "PRON__Number=Plur|PronType=Rel": {POS: PRON}, - "PRON__Number=Sing|Person=1": {POS: PRON}, - "PRON__Number=Sing|Person=1|PronType=Prs": {POS: PRON}, - "PRON__Number=Sing|Person=1|Reflex=Yes": {POS: PRON}, - "PRON__Number=Sing|Person=2|PronType=Prs": {POS: PRON}, - "PRON__Number=Sing|Person=3": {POS: PRON}, - "PRON__Number=Sing|PronType=Dem": {POS: PRON}, - "PRON__Number=Sing|PronType=Rel": {POS: PRON}, - "PRON__Person=3": {POS: PRON}, - "PRON__Person=3|Reflex=Yes": {POS: PRON}, - "PRON__PronType=Int": {POS: PRON}, - "PRON__PronType=Rel": {POS: PRON}, - "PRON___": {POS: PRON}, - "PROPN__Gender=Fem|Number=Plur": {POS: PROPN}, - "PROPN__Gender=Fem|Number=Sing": {POS: PROPN}, - "PROPN__Gender=Masc": {POS: PROPN}, - "PROPN__Gender=Masc|Number=Plur": {POS: PROPN}, - "PROPN__Gender=Masc|Number=Sing": {POS: PROPN}, - "PROPN__Number=Plur": {POS: PROPN}, - "PROPN__Number=Sing": {POS: PROPN}, - "PROPN___": {POS: PROPN}, - "PUNCT___": {POS: PUNCT}, - "SCONJ___": {POS: SCONJ}, - "VERB__Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part": {POS: VERB}, - "VERB__Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB}, - "VERB__Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part": {POS: VERB}, - "VERB__Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB}, - "VERB__Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part": {POS: VERB}, - "VERB__Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB}, - "VERB__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part": {POS: VERB}, - "VERB__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB}, - "VERB__Gender=Masc|Tense=Past|VerbForm=Part": {POS: VERB}, - "VERB__Gender=Masc|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB}, - "VERB__Mood=Cnd|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Cnd|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Cnd|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Imp|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Imp|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=2|Tense=Fut|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=2|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=1|Tense=Fut|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|Person=3|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Ind|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Sub|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Sub|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Sub|Number=Sing|Person=3|Tense=Past|VerbForm=Fin": {POS: VERB}, - "VERB__Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "VERB__Number=Plur|Tense=Past|VerbForm=Part": {POS: VERB}, - "VERB__Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB}, - "VERB__Number=Sing|Tense=Past|VerbForm=Part": {POS: VERB}, - "VERB__Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB}, - "VERB__Tense=Past|VerbForm=Part": {POS: VERB}, - "VERB__Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB}, - "VERB__Tense=Pres|VerbForm=Part": {POS: VERB}, - "VERB__VerbForm=Inf": {POS: VERB}, - "VERB__VerbForm=Part": {POS: VERB}, - "X___": {POS: X}, - "_SP": {POS: SPACE}, -} diff --git a/spacy/lang/fr/tokenizer_exceptions.py b/spacy/lang/fr/tokenizer_exceptions.py index 933607bdf..6f429eecc 100644 --- a/spacy/lang/fr/tokenizer_exceptions.py +++ b/spacy/lang/fr/tokenizer_exceptions.py @@ -1,11 +1,11 @@ -# coding: utf8 -from __future__ import unicode_literals - import re +from ..tokenizer_exceptions import BASE_EXCEPTIONS from .punctuation import ELISION, HYPHENS from ..char_classes import ALPHA_LOWER, ALPHA -from ...symbols import ORTH, LEMMA +from ...symbols import ORTH +from ...util import update_exc + # not using the large _tokenizer_exceptions_list by default as it slows down the tokenizer # from ._tokenizer_exceptions_list import FR_BASE_EXCEPTIONS @@ -28,29 +28,29 @@ def lower_first_letter(text): return text[0].lower() + text[1:] -_exc = {"J.-C.": [{LEMMA: "Jésus", ORTH: "J."}, {LEMMA: "Christ", ORTH: "-C."}]} +_exc = {"J.-C.": [{ORTH: "J."}, {ORTH: "-C."}]} for exc_data in [ - {LEMMA: "avant", ORTH: "av."}, - {LEMMA: "janvier", ORTH: "janv."}, - {LEMMA: "février", ORTH: "févr."}, - {LEMMA: "avril", ORTH: "avr."}, - {LEMMA: "juillet", ORTH: "juill."}, - {LEMMA: "septembre", ORTH: "sept."}, - {LEMMA: "octobre", ORTH: "oct."}, - {LEMMA: "novembre", ORTH: "nov."}, - {LEMMA: "décembre", ORTH: "déc."}, - {LEMMA: "après", ORTH: "apr."}, - {LEMMA: "docteur", ORTH: "Dr."}, - {LEMMA: "monsieur", ORTH: "M."}, - {LEMMA: "monsieur", ORTH: "Mr."}, - {LEMMA: "madame", ORTH: "Mme."}, - {LEMMA: "mademoiselle", ORTH: "Mlle."}, - {LEMMA: "numéro", ORTH: "n°"}, - {LEMMA: "degrés", ORTH: "d°"}, - {LEMMA: "saint", ORTH: "St."}, - {LEMMA: "sainte", ORTH: "Ste."}, + {ORTH: "av."}, + {ORTH: "janv."}, + {ORTH: "févr."}, + {ORTH: "avr."}, + {ORTH: "juill."}, + {ORTH: "sept."}, + {ORTH: "oct."}, + {ORTH: "nov."}, + {ORTH: "déc."}, + {ORTH: "apr."}, + {ORTH: "Dr."}, + {ORTH: "M."}, + {ORTH: "Mr."}, + {ORTH: "Mme."}, + {ORTH: "Mlle."}, + {ORTH: "n°"}, + {ORTH: "d°"}, + {ORTH: "St."}, + {ORTH: "Ste."}, ]: _exc[exc_data[ORTH]] = [exc_data] @@ -80,55 +80,37 @@ for orth in [ _exc[orth] = [{ORTH: orth}] -for verb, verb_lemma in [ - ("a", "avoir"), - ("est", "être"), - ("semble", "sembler"), - ("indique", "indiquer"), - ("moque", "moquer"), - ("passe", "passer"), +for verb in [ + "a", + "est" "semble", + "indique", + "moque", + "passe", ]: for orth in [verb, verb.title()]: for pronoun in ["elle", "il", "on"]: - token = "{}-t-{}".format(orth, pronoun) - _exc[token] = [ - {LEMMA: verb_lemma, ORTH: orth}, # , TAG: "VERB"}, - {LEMMA: "t", ORTH: "-t"}, - {LEMMA: pronoun, ORTH: "-" + pronoun}, - ] + token = f"{orth}-t-{pronoun}" + _exc[token] = [{ORTH: orth}, {ORTH: "-t"}, {ORTH: "-" + pronoun}] -for verb, verb_lemma in [("est", "être")]: +for verb in ["est"]: for orth in [verb, verb.title()]: - token = "{}-ce".format(orth) - _exc[token] = [ - {LEMMA: verb_lemma, ORTH: orth}, # , TAG: "VERB"}, - {LEMMA: "ce", ORTH: "-ce"}, - ] + _exc[f"{orth}-ce"] = [{ORTH: orth}, {ORTH: "-ce"}] -for pre, pre_lemma in [("qu'", "que"), ("n'", "ne")]: +for pre in ["qu'", "n'"]: for orth in [pre, pre.title()]: - _exc["%sest-ce" % orth] = [ - {LEMMA: pre_lemma, ORTH: orth}, - {LEMMA: "être", ORTH: "est"}, - {LEMMA: "ce", ORTH: "-ce"}, - ] + _exc[f"{orth}est-ce"] = [{ORTH: orth}, {ORTH: "est"}, {ORTH: "-ce"}] for verb, pronoun in [("est", "il"), ("EST", "IL")]: - token = "{}-{}".format(verb, pronoun) - _exc[token] = [ - {LEMMA: "être", ORTH: verb}, - {LEMMA: pronoun, ORTH: "-" + pronoun}, - ] + _exc[f"{verb}-{pronoun}"] = [{ORTH: verb}, {ORTH: "-" + pronoun}] for s, verb, pronoun in [("s", "est", "il"), ("S", "EST", "IL")]: - token = "{}'{}-{}".format(s, verb, pronoun) - _exc[token] = [ - {LEMMA: "se", ORTH: s + "'"}, - {LEMMA: "être", ORTH: verb}, - {LEMMA: pronoun, ORTH: "-" + pronoun}, + _exc[f"{s}'{verb}-{pronoun}"] = [ + {ORTH: s + "'"}, + {ORTH: verb}, + {ORTH: "-" + pronoun}, ] @@ -455,7 +437,7 @@ _regular_exp += [ ] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) TOKEN_MATCH = re.compile( "(?iu)" + "|".join("(?:{})".format(m) for m in _regular_exp) ).match diff --git a/spacy/lang/ga/__init__.py b/spacy/lang/ga/__init__.py index 42b4d0d18..80131368b 100644 --- a/spacy/lang/ga/__init__.py +++ b/spacy/lang/ga/__init__.py @@ -1,21 +1,11 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS - -from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language -from ...attrs import LANG -from ...util import update_exc class IrishDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "ga" - - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = set(STOP_WORDS) + tokenizer_exceptions = TOKENIZER_EXCEPTIONS + stop_words = STOP_WORDS class Irish(Language): diff --git a/spacy/lang/ga/irish_morphology_helpers.py b/spacy/lang/ga/irish_morphology_helpers.py index 2133f0d22..d606da975 100644 --- a/spacy/lang/ga/irish_morphology_helpers.py +++ b/spacy/lang/ga/irish_morphology_helpers.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # fmt: off consonants = ["b", "c", "d", "f", "g", "h", "j", "k", "l", "m", "n", "p", "q", "r", "s", "t", "v", "w", "x", "z"] broad_vowels = ["a", "á", "o", "ó", "u", "ú"] diff --git a/spacy/lang/ga/stop_words.py b/spacy/lang/ga/stop_words.py index d8f705b59..4ef052ca5 100644 --- a/spacy/lang/ga/stop_words.py +++ b/spacy/lang/ga/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ a ach ag agus an aon ar arna as diff --git a/spacy/lang/ga/tag_map.py b/spacy/lang/ga/tag_map.py deleted file mode 100644 index 1d8284014..000000000 --- a/spacy/lang/ga/tag_map.py +++ /dev/null @@ -1,369 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -# fmt: off -TAG_MAP = { - "ADJ__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, - "ADJ__Case=Gen|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "fem", "Number": "sing"}, - "ADJ__Case=Gen|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "gen", "Gender": "masc", "Number": "sing"}, - "ADJ__Case=Gen|NounType=Strong|Number=Plur": {"pos": "ADJ", "Case": "gen", "Number": "plur", "Other": {"NounType": "strong"}}, - "ADJ__Case=Gen|NounType=Weak|Number=Plur": {"pos": "ADJ", "Case": "gen", "Number": "plur", "Other": {"NounType": "weak"}}, - "ADJ__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "ADJ__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, - "ADJ__Case=NomAcc|Gender=Fem|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "plur"}, - "ADJ__Case=NomAcc|Gender=Fem|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "fem", "Number": "sing"}, - "ADJ__Case=NomAcc|Gender=Masc|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "plur"}, - "ADJ__Case=NomAcc|Gender=Masc|Number=Sing": {"pos": "ADJ", "Case": "nom|acc", "Gender": "masc", "Number": "sing"}, - "ADJ__Case=NomAcc|NounType=NotSlender|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Number": "plur", "Other": {"NounType": "notslender"}}, - "ADJ__Case=NomAcc|NounType=Slender|Number=Plur": {"pos": "ADJ", "Case": "nom|acc", "Number": "plur", "Other": {"NounType": "slender"}}, - "ADJ__Degree=Cmp,Sup|Form=Len": {"pos": "ADJ", "Degree": "cmp|sup", "Other": {"Form": "len"}}, - "ADJ__Degree=Cmp,Sup": {"pos": "ADJ", "Degree": "cmp|sup"}, - "ADJ__Degree=Pos|Form=Ecl": {"pos": "ADJ", "Degree": "pos", "Other": {"Form": "ecl"}}, - "ADJ__Degree=Pos|Form=HPref": {"pos": "ADJ", "Degree": "pos", "Other": {"Form": "hpref"}}, - "ADJ__Degree=Pos|Form=Len": {"pos": "ADJ", "Degree": "pos", "Other": {"Form": "len"}}, - "ADJ__Degree=Pos": {"pos": "ADJ", "Degree": "pos"}, - "ADJ__Foreign=Yes": {"pos": "ADJ", "Foreign": "yes"}, - "ADJ__Form=Len|VerbForm=Part": {"pos": "ADJ", "VerbForm": "part", "Other": {"Form": "len"}}, - "ADJ__Gender=Masc|Number=Sing|PartType=Voc": {"pos": "ADJ", "Gender": "masc", "Number": "sing", "Case": "voc"}, - "ADJ__Gender=Masc|Number=Sing|Case=Voc": {"pos": "ADJ", "Gender": "masc", "Number": "sing", "Case": "voc"}, - "ADJ__Number=Plur|PartType=Voc": {"pos": "ADJ", "Number": "plur", "Case": "voc"}, - "ADJ__Number=Plur|Case=Voc": {"pos": "ADJ", "Number": "plur", "Case": "voc"}, - "ADJ__Number=Plur": {"pos": "ADJ", "Number": "plur"}, - "ADJ___": {"pos": "ADJ"}, - "ADJ__VerbForm=Part": {"pos": "ADJ", "VerbForm": "part"}, - "ADP__Foreign=Yes": {"pos": "ADP", "Foreign": "yes"}, - "ADP__Form=Len|Number=Plur|Person=1": {"pos": "ADP", "Number": "plur", "Person": 1, "Other": {"Form": "len"}}, - "ADP__Form=Len|Number=Plur|Person=3": {"pos": "ADP", "Number": "plur", "Person": 3, "Other": {"Form": "len"}}, - "ADP__Form=Len|Number=Sing|Person=1": {"pos": "ADP", "Number": "sing", "Person": 1, "Other": {"Form": "len"}}, - "ADP__Gender=Fem|Number=Sing|Person=3": {"pos": "ADP", "Gender": "fem", "Number": "sing", "Person": 3}, - "ADP__Gender=Fem|Number=Sing|Person=3|Poss=Yes": {"pos": "ADP", "Gender": "fem", "Number": "sing", "Person": 3, "Poss": "yes"}, - "ADP__Gender=Fem|Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"pos": "ADP", "Gender": "fem", "Number": "sing", "Person": 3, "Poss": "yes", "PronType": "prs"}, - "ADP__Gender=Masc|Number=Sing|Person=3": {"pos": "ADP", "Gender": "masc", "Number": "sing", "Person": 3}, - "ADP__Gender=Masc|Number=Sing|Person=3|Poss=Yes": {"pos": "ADP", "Gender": "masc", "Number": "sing", "Person": 3, "Poss": "yes"}, - "ADP__Gender=Masc|Number=Sing|Person=3|Poss=Yes|PronType=Prs": {"pos": "ADP", "Gender": "masc", "Number": "sing", "Person": 3, "Poss": "yes", "PronType": "prs"}, - "ADP__Gender=Masc|Number=Sing|Person=3|PronType=Emp": {"pos": "ADP", "Gender": "masc", "Number": "sing", "Person": 3, "PronType": "emp"}, - "ADP__Number=Plur|Person=1": {"pos": "ADP", "Number": "plur", "Person": 1}, - "ADP__Number=Plur|Person=1|Poss=Yes": {"pos": "ADP", "Number": "plur", "Person": 1, "Poss": "yes"}, - "ADP__Number=Plur|Person=1|PronType=Emp": {"pos": "ADP", "Number": "plur", "Person": 1, "PronType": "emp"}, - "ADP__Number=Plur|Person=2": {"pos": "ADP", "Number": "plur", "Person": 2}, - "ADP__Number=Plur|Person=3": {"pos": "ADP", "Number": "plur", "Person": 3}, - "ADP__Number=Plur|Person=3|Poss=Yes": {"pos": "ADP", "Number": "plur", "Person": 3, "Poss": "yes"}, - "ADP__Number=Plur|Person=3|Poss=Yes|PronType=Prs": {"pos": "ADP", "Number": "plur", "Person": 3, "Poss": "yes", "PronType": "prs"}, - "ADP__Number=Plur|Person=3|PronType=Emp": {"pos": "ADP", "Number": "plur", "Person": 3, "PronType": "emp"}, - "ADP__Number=Plur|PronType=Art": {"pos": "ADP", "Number": "plur", "PronType": "art"}, - "ADP__Number=Sing|Person=1": {"pos": "ADP", "Number": "sing", "Person": 1}, - "ADP__Number=Sing|Person=1|Poss=Yes": {"pos": "ADP", "Number": "sing", "Person": 1, "Poss": "yes"}, - "ADP__Number=Sing|Person=1|PronType=Emp": {"pos": "ADP", "Number": "sing", "Person": 1, "PronType": "emp"}, - "ADP__Number=Sing|Person=2": {"pos": "ADP", "Number": "sing", "Person": 2}, - "ADP__Number=Sing|Person=3": {"pos": "ADP", "Number": "sing", "Person": 3}, - "ADP__Number=Sing|PronType=Art": {"pos": "ADP", "Number": "sing", "PronType": "art"}, - "ADP__Person=3|Poss=Yes": {"pos": "ADP", "Person": 3, "Poss": "yes"}, - "ADP___": {"pos": "ADP"}, - "ADP__Poss=Yes": {"pos": "ADP", "Poss": "yes"}, - "ADP__PrepForm=Cmpd": {"pos": "ADP", "Other": {"PrepForm": "cmpd"}}, - "ADP__PronType=Art": {"pos": "ADP", "PronType": "art"}, - "ADV__Form=Len": {"pos": "ADV", "Other": {"Form": "len"}}, - "ADV___": {"pos": "ADV"}, - "ADV__PronType=Int": {"pos": "ADV", "PronType": "int"}, - "AUX__Form=VF|Polarity=Neg|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}}, - "AUX__Form=VF|Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}}, - "AUX__Form=VF|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}}, - "AUX__Form=VF|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Other": {"Form": "vf", "VerbForm": "cop"}}, - "AUX__Form=VF|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Other": {"Form": "vf", "VerbForm": "cop"}}, - "AUX__Gender=Masc|Number=Sing|Person=3|VerbForm=Cop": {"pos": "AUX", "Gender": "masc", "Number": "sing", "Person": 3, "Other": {"VerbForm": "cop"}}, - "AUX__Mood=Int|Number=Sing|PronType=Art|VerbForm=Cop": {"pos": "AUX", "Number": "sing", "PronType": "art", "Other": {"Mood": "int", "VerbForm": "cop"}}, - "AUX__Mood=Int|Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Other": {"Mood": "int", "VerbForm": "cop"}}, - "AUX__Mood=Int|Polarity=Neg|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "pres", "Other": {"Mood": "int", "VerbForm": "cop"}}, - "AUX__Mood=Int|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Other": {"Mood": "int", "VerbForm": "cop"}}, - "AUX__PartType=Comp|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Other": {"PartType": "comp", "VerbForm": "cop"}}, - "AUX__Polarity=Neg|PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Other": {"VerbForm": "cop"}}, - "AUX__Polarity=Neg|PronType=Rel|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "PronType": "rel", "Tense": "pres", "Other": {"VerbForm": "cop"}}, - "AUX__Polarity=Neg|Tense=Past|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "past", "Other": {"VerbForm": "cop"}}, - "AUX__Polarity=Neg|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Polarity": "neg", "Tense": "pres", "Other": {"VerbForm": "cop"}}, - "AUX___": {"pos": "AUX"}, - "AUX__PronType=Dem|VerbForm=Cop": {"pos": "AUX", "PronType": "dem", "Other": {"VerbForm": "cop"}}, - "AUX__PronType=Rel|Tense=Past|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "past", "Other": {"VerbForm": "cop"}}, - "AUX__PronType=Rel|Tense=Pres|VerbForm=Cop": {"pos": "AUX", "PronType": "rel", "Tense": "pres", "Other": {"VerbForm": "cop"}}, - "AUX__Tense=Past|VerbForm=Cop": {"pos": "AUX", "Tense": "past", "Other": {"VerbForm": "cop"}}, - "AUX__Tense=Pres|VerbForm=Cop": {"pos": "AUX", "Tense": "pres", "Other": {"VerbForm": "cop"}}, - "AUX__VerbForm=Cop": {"pos": "AUX", "Other": {"VerbForm": "cop"}}, - "CCONJ___": {"pos": "CCONJ"}, - "DET__Case=Gen|Definite=Def|Gender=Fem|Number=Sing|PronType=Art": {"pos": "DET", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "sing", "PronType": "art"}, - "DET__Definite=Def|Form=Ecl": {"pos": "DET", "Definite": "def", "Other": {"Form": "ecl"}}, - "DET__Definite=Def|Gender=Fem|Number=Sing|PronType=Art": {"pos": "DET", "Definite": "def", "Gender": "fem", "Number": "sing", "PronType": "art"}, - "DET__Definite=Def|Number=Plur|PronType=Art": {"pos": "DET", "Definite": "def", "Number": "plur", "PronType": "art"}, - "DET__Definite=Def|Number=Sing|PronType=Art": {"pos": "DET", "Definite": "def", "Number": "sing", "PronType": "art"}, - "DET__Definite=Def": {"pos": "DET", "Definite": "def"}, - "DET__Form=HPref|PronType=Ind": {"pos": "DET", "PronType": "ind", "Other": {"Form": "hpref"}}, - "DET__Gender=Fem|Number=Sing|Person=3|Poss=Yes": {"pos": "DET", "Gender": "fem", "Number": "sing", "Person": 3, "Poss": "yes"}, - "DET__Gender=Masc|Number=Sing|Person=3|Poss=Yes": {"pos": "DET", "Gender": "masc", "Number": "sing", "Person": 3, "Poss": "yes"}, - "DET__Number=Plur|Person=1|Poss=Yes": {"pos": "DET", "Number": "plur", "Person": 1, "Poss": "yes"}, - "DET__Number=Plur|Person=3|Poss=Yes": {"pos": "DET", "Number": "plur", "Person": 3, "Poss": "yes"}, - "DET__Number=Sing|Person=1|Poss=Yes": {"pos": "DET", "Number": "sing", "Person": 1, "Poss": "yes"}, - "DET__Number=Sing|Person=2|Poss=Yes": {"pos": "DET", "Number": "sing", "Person": 2, "Poss": "yes"}, - "DET__Number=Sing|PronType=Int": {"pos": "DET", "Number": "sing", "PronType": "int"}, - "DET___": {"pos": "DET"}, - "DET__PronType=Dem": {"pos": "DET", "PronType": "dem"}, - "DET__PronType=Ind": {"pos": "DET", "PronType": "ind"}, - "NOUN__Case=Dat|Definite=Ind|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Definite": "ind", "Gender": "fem", "Number": "sing"}, - "NOUN__Case=Dat|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}}, - "NOUN__Case=Dat|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Case=Dat|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "fem", "Number": "sing"}, - "NOUN__Case=Dat|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "dat", "Gender": "masc", "Number": "sing"}, - "NOUN__Case=Gen|Definite=Def|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "plur", "Other": {"NounType": "strong"}}, - "NOUN__Case=Gen|Definite=Def|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "sing"}, - "NOUN__Case=Gen|Definite=Def|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "plur", "Other": {"NounType": "strong"}}, - "NOUN__Case=Gen|Definite=Def|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "plur", "Other": {"NounType": "weak"}}, - "NOUN__Case=Gen|Definite=Def|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Definite": "def", "Gender": "masc", "Number": "sing"}, - "NOUN__Case=Gen|Definite=Ind|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Definite": "ind", "Gender": "fem", "Number": "sing"}, - "NOUN__Case=Gen|Form=Ecl|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"Form": "ecl", "NounType": "strong"}}, - "NOUN__Case=Gen|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}}, - "NOUN__Case=Gen|Form=Ecl|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl", "NounType": "strong"}}, - "NOUN__Case=Gen|Form=Ecl|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl", "NounType": "weak"}}, - "NOUN__Case=Gen|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "ecl"}}, - "NOUN__Case=Gen|Form=HPref|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "hpref"}}, - "NOUN__Case=Gen|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Case=Gen|Form=Len|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "len", "NounType": "strong"}}, - "NOUN__Case=Gen|Form=Len|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "len", "NounType": "weak"}}, - "NOUN__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Case=Gen|Form=Len|VerbForm=Inf": {"pos": "NOUN", "Case": "gen", "VerbForm": "inf", "Other": {"Form": "len"}}, - "NOUN__Case=Gen|Gender=Fem|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"NounType": "strong"}}, - "NOUN__Case=Gen|Gender=Fem|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"NounType": "weak"}}, - "NOUN__Case=Gen|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "plur"}, - "NOUN__Case=Gen|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "fem", "Number": "sing"}, - "NOUN__Case=Gen|Gender=Masc|NounType=Strong|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"NounType": "strong"}}, - "NOUN__Case=Gen|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"NounType": "weak"}}, - "NOUN__Case=Gen|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "plur"}, - "NOUN__Case=Gen|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "gen", "Gender": "masc", "Number": "sing"}, - "NOUN__Case=Gen|Number=Sing": {"pos": "NOUN", "Case": "gen", "Number": "sing"}, - "NOUN__Case=Gen|VerbForm=Inf": {"pos": "NOUN", "Case": "gen", "VerbForm": "inf"}, - "NOUN__Case=NomAcc|Definite=Def|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "fem", "Number": "plur"}, - "NOUN__Case=NomAcc|Definite=Def|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "fem", "Number": "sing"}, - "NOUN__Case=NomAcc|Definite=Def|Gender=Fem": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "fem"}, - "NOUN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "plur"}, - "NOUN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "sing"}, - "NOUN__Case=NomAcc|Definite=Ind|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Definite": "ind", "Gender": "masc", "Number": "plur"}, - "NOUN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Other": {"Form": "ecl"}}, - "NOUN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}}, - "NOUN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl"}}, - "NOUN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "ecl"}}, - "NOUN__Case=NomAcc|Form=Emp|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "emp"}}, - "NOUN__Case=NomAcc|Form=HPref|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Other": {"Form": "hpref"}}, - "NOUN__Case=NomAcc|Form=HPref|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "hpref"}}, - "NOUN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Other": {"Form": "hpref"}}, - "NOUN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "hpref"}}, - "NOUN__Case=NomAcc|Form=Len|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur", "Other": {"Form": "len"}}, - "NOUN__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Case=NomAcc|Form=Len|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur", "Other": {"Form": "len"}}, - "NOUN__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Case=NomAcc|Gender=Fem|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "plur"}, - "NOUN__Case=NomAcc|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "fem", "Number": "sing"}, - "NOUN__Case=NomAcc|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "plur"}, - "NOUN__Case=NomAcc|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "nom|acc", "Gender": "masc", "Number": "sing"}, - "NOUN__Case=Voc|Definite=Def|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "voc", "Definite": "def", "Gender": "masc", "Number": "plur"}, - "NOUN__Case=Voc|Form=Len|Gender=Fem|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Case=Voc|Form=Len|Gender=Masc|Number=Plur": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "plur", "Other": {"Form": "len"}}, - "NOUN__Case=Voc|Form=Len|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Case=Voc|Gender=Masc|Number=Sing": {"pos": "NOUN", "Case": "voc", "Gender": "masc", "Number": "sing"}, - "NOUN__Degree=Pos": {"pos": "NOUN", "Degree": "pos"}, - "NOUN__Foreign=Yes": {"pos": "NOUN", "Foreign": "yes"}, - "NOUN__Form=Ecl|Number=Sing": {"pos": "NOUN", "Number": "sing", "Other": {"Form": "ecl"}}, - "NOUN__Form=Ecl|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Other": {"Form": "ecl"}}, - "NOUN__Form=Ecl|VerbForm=Vnoun": {"pos": "NOUN", "VerbForm": "vnoun", "Other": {"Form": "ecl"}}, - "NOUN__Form=HPref|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Other": {"Form": "hpref"}}, - "NOUN__Form=Len|Number=Sing": {"pos": "NOUN", "Number": "sing", "Other": {"Form": "len"}}, - "NOUN__Form=Len|VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf", "Other": {"Form": "len"}}, - "NOUN__Gender=Fem|Number=Sing": {"pos": "NOUN", "Gender": "fem", "Number": "sing"}, - "NOUN__Number=Sing|PartType=Comp": {"pos": "NOUN", "Number": "sing", "Other": {"PartType": "comp"}}, - "NOUN__Number=Sing": {"pos": "NOUN", "Number": "sing"}, - "NOUN___": {"pos": "NOUN"}, - "NOUN__Reflex=Yes": {"pos": "NOUN", "Reflex": "yes"}, - "NOUN__VerbForm=Inf": {"pos": "NOUN", "VerbForm": "inf"}, - "NOUN__VerbForm=Vnoun": {"pos": "NOUN", "VerbForm": "vnoun"}, - "NUM__Definite=Def|NumType=Card": {"pos": "NUM", "Definite": "def", "NumType": "card"}, - "NUM__Form=Ecl|NumType=Card": {"pos": "NUM", "NumType": "card", "Other": {"Form": "ecl"}}, - "NUM__Form=Ecl|NumType=Ord": {"pos": "NUM", "NumType": "ord", "Other": {"Form": "ecl"}}, - "NUM__Form=HPref|NumType=Card": {"pos": "NUM", "NumType": "card", "Other": {"Form": "hpref"}}, - "NUM__Form=Len|NumType=Card": {"pos": "NUM", "NumType": "card", "Other": {"Form": "len"}}, - "NUM__Form=Len|NumType=Ord": {"pos": "NUM", "NumType": "ord", "Other": {"Form": "len"}}, - "NUM__NumType=Card": {"pos": "NUM", "NumType": "card"}, - "NUM__NumType=Ord": {"pos": "NUM", "NumType": "ord"}, - "NUM___": {"pos": "NUM"}, - "PART__Form=Ecl|PartType=Vb|PronType=Rel": {"pos": "PART", "PronType": "rel", "Other": {"Form": "ecl", "PartType": "vb"}}, - "PART__Mood=Imp|PartType=Vb|Polarity=Neg": {"pos": "PART", "Mood": "imp", "Polarity": "neg", "Other": {"PartType": "vb"}}, - "PART__Mood=Imp|PartType=Vb": {"pos": "PART", "Mood": "imp", "Other": {"PartType": "vb"}}, - "PART__Mood=Int|PartType=Vb|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Other": {"Mood": "int", "PartType": "vb"}}, - "PART__PartType=Ad": {"pos": "PART", "Other": {"PartType": "ad"}}, - "PART__PartType=Cmpl|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Other": {"PartType": "cmpl"}}, - "PART__PartType=Cmpl|Polarity=Neg|Tense=Past": {"pos": "PART", "Polarity": "neg", "Tense": "past", "Other": {"PartType": "cmpl"}}, - "PART__PartType=Cmpl": {"pos": "PART", "Other": {"PartType": "cmpl"}}, - "PART__PartType=Comp": {"pos": "PART", "Other": {"PartType": "comp"}}, - "PART__PartType=Cop|PronType=Rel": {"pos": "PART", "PronType": "rel", "Other": {"PartType": "cop"}}, - "PART__PartType=Deg": {"pos": "PART", "Other": {"PartType": "deg"}}, - "PART__PartType=Inf": {"pos": "PART", "PartType": "inf"}, - "PART__PartType=Num": {"pos": "PART", "Other": {"PartType": "num"}}, - "PART__PartType=Pat": {"pos": "PART", "Other": {"PartType": "pat"}}, - "PART__PartType=Vb|Polarity=Neg": {"pos": "PART", "Polarity": "neg", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb|Polarity=Neg|PronType=Rel": {"pos": "PART", "Polarity": "neg", "PronType": "rel", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb|Polarity=Neg|PronType=Rel|Tense=Past": {"pos": "PART", "Polarity": "neg", "PronType": "rel", "Tense": "past", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb|Polarity=Neg|Tense=Past": {"pos": "PART", "Polarity": "neg", "Tense": "past", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb": {"pos": "PART", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb|PronType=Rel": {"pos": "PART", "PronType": "rel", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb|PronType=Rel|Tense=Past": {"pos": "PART", "PronType": "rel", "Tense": "past", "Other": {"PartType": "vb"}}, - "PART__PartType=Vb|Tense=Past": {"pos": "PART", "Tense": "past", "Other": {"PartType": "vb"}}, - "PART__PartType=Voc": {"pos": "PART", "Other": {"PartType": "voc"}}, - "PART___": {"pos": "PART"}, - "PART__PronType=Rel": {"pos": "PART", "PronType": "rel"}, - "PRON__Form=Len|Number=Sing|Person=2": {"pos": "PRON", "Number": "sing", "Person": 2, "Other": {"Form": "len"}}, - "PRON__Form=Len|PronType=Ind": {"pos": "PRON", "PronType": "ind", "Other": {"Form": "len"}}, - "PRON__Gender=Fem|Number=Sing|Person=3": {"pos": "PRON", "Gender": "fem", "Number": "sing", "Person": 3}, - "PRON__Gender=Masc|Number=Sing|Person=3": {"pos": "PRON", "Gender": "masc", "Number": "sing", "Person": 3}, - "PRON__Gender=Masc|Number=Sing|Person=3|PronType=Emp": {"pos": "PRON", "Gender": "masc", "Number": "sing", "Person": 3, "PronType": "emp"}, - "PRON__Gender=Masc|Person=3": {"pos": "PRON", "Gender": "masc", "Person": 3}, - "PRON__Number=Plur|Person=1": {"pos": "PRON", "Number": "plur", "Person": 1}, - "PRON__Number=Plur|Person=1|PronType=Emp": {"pos": "PRON", "Number": "plur", "Person": 1, "PronType": "emp"}, - "PRON__Number=Plur|Person=2": {"pos": "PRON", "Number": "plur", "Person": 2}, - "PRON__Number=Plur|Person=3": {"pos": "PRON", "Number": "plur", "Person": 3}, - "PRON__Number=Plur|Person=3|PronType=Emp": {"pos": "PRON", "Number": "plur", "Person": 3, "PronType": "emp"}, - "PRON__Number=Sing|Person=1": {"pos": "PRON", "Number": "sing", "Person": 1}, - "PRON__Number=Sing|Person=1|PronType=Emp": {"pos": "PRON", "Number": "sing", "Person": 1, "PronType": "emp"}, - "PRON__Number=Sing|Person=2": {"pos": "PRON", "Number": "sing", "Person": 2}, - "PRON__Number=Sing|Person=2|PronType=Emp": {"pos": "PRON", "Number": "sing", "Person": 2, "PronType": "emp"}, - "PRON__Number=Sing|Person=3": {"pos": "PRON", "Number": "sing", "Person": 3}, - "PRON__Number=Sing|PronType=Int": {"pos": "PRON", "Number": "sing", "PronType": "int"}, - "PRON__PronType=Dem": {"pos": "PRON", "PronType": "dem"}, - "PRON__PronType=Ind": {"pos": "PRON", "PronType": "ind"}, - "PRON__PronType=Int": {"pos": "PRON", "PronType": "int"}, - "PRON__Reflex=Yes": {"pos": "PRON", "Reflex": "yes"}, - "PROPN__Abbr=Yes": {"pos": "PROPN", "Other": {"Abbr": "yes"}}, - "PROPN__Case=Dat|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "dat", "Gender": "fem", "Number": "sing"}, - "PROPN__Case=Gen|Definite=Def|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Definite": "def", "Gender": "fem", "Number": "sing"}, - "PROPN__Case=Gen|Form=Ecl|Gender=Fem|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "plur", "Other": {"Form": "ecl"}}, - "PROPN__Case=Gen|Form=Ecl|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"Form": "ecl"}}, - "PROPN__Case=Gen|Form=HPref|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "hpref"}}, - "PROPN__Case=Gen|Form=Len|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "PROPN__Case=Gen|Form=Len|Gender=Fem": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Other": {"Form": "len"}}, - "PROPN__Case=Gen|Form=Len|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, - "PROPN__Case=Gen|Form=Len|Gender=Masc": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Other": {"Form": "len"}}, - "PROPN__Case=Gen|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "fem", "Number": "sing"}, - "PROPN__Case=Gen|Gender=Fem": {"pos": "PROPN", "Case": "gen", "Gender": "fem"}, - "PROPN__Case=Gen|Gender=Masc|NounType=Weak|Number=Plur": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "plur", "Other": {"NounType": "weak"}}, - "PROPN__Case=Gen|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "gen", "Gender": "masc", "Number": "sing"}, - "PROPN__Case=Gen|Gender=Masc": {"pos": "PROPN", "Case": "gen", "Gender": "masc"}, - "PROPN__Case=NomAcc|Definite=Def|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Definite": "def", "Gender": "fem", "Number": "sing"}, - "PROPN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "plur"}, - "PROPN__Case=NomAcc|Definite=Def|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Definite": "def", "Gender": "masc", "Number": "sing"}, - "PROPN__Case=NomAcc|Form=Ecl|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "ecl"}}, - "PROPN__Case=NomAcc|Form=Ecl|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "ecl"}}, - "PROPN__Case=NomAcc|Form=HPref|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "hpref"}}, - "PROPN__Case=NomAcc|Form=Len|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Other": {"Form": "len"}}, - "PROPN__Case=NomAcc|Form=Len|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing", "Other": {"Form": "len"}}, - "PROPN__Case=NomAcc|Gender=Fem|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "fem", "Number": "sing"}, - "PROPN__Case=NomAcc|Gender=Masc|Number=Plur": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "plur"}, - "PROPN__Case=NomAcc|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc", "Number": "sing"}, - "PROPN__Case=NomAcc|Gender=Masc": {"pos": "PROPN", "Case": "nom|acc", "Gender": "masc"}, - "PROPN__Case=Voc|Form=Len|Gender=Fem": {"pos": "PROPN", "Case": "voc", "Gender": "fem", "Other": {"Form": "len"}}, - "PROPN__Case=Voc|Gender=Masc|Number=Sing": {"pos": "PROPN", "Case": "voc", "Gender": "masc", "Number": "sing"}, - "PROPN__Gender=Masc|Number=Sing": {"pos": "PROPN", "Gender": "masc", "Number": "sing"}, - "PROPN___": {"pos": "PROPN"}, - "PUNCT___": {"pos": "PUNCT"}, - "SCONJ___": {"pos": "SCONJ"}, - "SCONJ__Tense=Past|VerbForm=Cop": {"pos": "SCONJ", "Tense": "past", "Other": {"VerbForm": "cop"}}, - "SCONJ__VerbForm=Cop": {"pos": "SCONJ", "Other": {"VerbForm": "cop"}}, - "SYM__Abbr=Yes": {"pos": "SYM", "Other": {"Abbr": "yes"}}, - "VERB__Case=NomAcc|Gender=Masc|Mood=Ind|Number=Sing|Tense=Pres": {"pos": "VERB", "Case": "nom|acc", "Gender": "masc", "Mood": "ind", "Number": "sing", "Tense": "pres"}, - "VERB__Dialect=Munster|Form=Len|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Dialect": "munster", "Form": "len"}}, - "VERB__Foreign=Yes": {"pos": "VERB", "Foreign": "yes"}, - "VERB__Form=Ecl|Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1, "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Cnd|Polarity=Neg": {"pos": "VERB", "Mood": "cnd", "Polarity": "neg", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Cnd": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "ecl", "Voice": "auto"}}, - "VERB__Form=Ecl|Mood=Imp|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "imp", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "ecl", "Voice": "auto"}}, - "VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "ecl", "Voice": "auto"}}, - "VERB__Form=Ecl|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl|Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "ecl", "Voice": "auto"}}, - "VERB__Form=Ecl|Mood=Sub|Tense=Pres": {"pos": "VERB", "Mood": "sub", "Tense": "pres", "Other": {"Form": "ecl"}}, - "VERB__Form=Ecl": {"pos": "VERB", "Other": {"Form": "ecl"}}, - "VERB__Form=Emp|Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres", "Other": {"Form": "emp"}}, - "VERB__Form=Emp|Mood=Ind|Number=Sing|Person=1|PronType=Rel|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "PronType": "rel", "Tense": "pres", "Other": {"Form": "emp"}}, - "VERB__Form=Emp|Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres", "Other": {"Form": "emp"}}, - "VERB__Form=Len|Mood=Cnd|Number=Plur|Person=3": {"pos": "VERB", "Mood": "cnd", "Number": "plur", "Person": 3, "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1, "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Cnd|Number=Sing|Person=2": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 2, "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Cnd|Polarity=Neg": {"pos": "VERB", "Mood": "cnd", "Polarity": "neg", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Cnd": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Imp|Number=Plur|Person=3|Tense=Past": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 3, "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Imp|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "imp", "Tense": "past", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Imp|Voice=Auto": {"pos": "VERB", "Mood": "imp", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Ind|Number=Plur|Person=1|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "fut", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Number=Plur|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Number=Plur|Person=3|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 3, "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Polarity": "neg", "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "fut", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Past": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Polarity=Neg|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "len"}}, - "VERB__Form=Len|Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Form": "len", "Voice": "auto"}}, - "VERB__Form=Len|Mood=Sub|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "sub", "Polarity": "neg", "Tense": "pres", "Other": {"Form": "len"}}, - "VERB__Form=Len|Polarity=Neg": {"pos": "VERB", "Polarity": "neg", "Other": {"Form": "len"}}, - "VERB__Form=Len": {"pos": "VERB", "Other": {"Form": "len"}}, - "VERB__Mood=Cnd|Number=Plur|Person=3": {"pos": "VERB", "Mood": "cnd", "Number": "plur", "Person": 3}, - "VERB__Mood=Cnd|Number=Sing|Person=1": {"pos": "VERB", "Mood": "cnd", "Number": "sing", "Person": 1}, - "VERB__Mood=Cnd": {"pos": "VERB", "Mood": "cnd"}, - "VERB__Mood=Cnd|Voice=Auto": {"pos": "VERB", "Mood": "cnd", "Other": {"Voice": "auto"}}, - "VERB__Mood=Imp|Number=Plur|Person=1|Polarity=Neg": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 1, "Polarity": "neg"}, - "VERB__Mood=Imp|Number=Plur|Person=1": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 1}, - "VERB__Mood=Imp|Number=Plur|Person=2": {"pos": "VERB", "Mood": "imp", "Number": "plur", "Person": 2}, - "VERB__Mood=Imp|Number=Sing|Person=2": {"pos": "VERB", "Mood": "imp", "Number": "sing", "Person": 2}, - "VERB__Mood=Imp|Tense=Past": {"pos": "VERB", "Mood": "imp", "Tense": "past"}, - "VERB__Mood=Ind|Number=Plur|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "past"}, - "VERB__Mood=Ind|Number=Plur|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "plur", "Person": 1, "Tense": "pres"}, - "VERB__Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past"}, - "VERB__Mood=Ind|Number=Sing|Person=1|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "pres"}, - "VERB__Mood=Ind|Polarity=Neg|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "past", "Other": {"Voice": "auto"}}, - "VERB__Mood=Ind|Polarity=Neg|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Polarity": "neg", "Tense": "pres"}, - "VERB__Mood=Ind|PronType=Rel|Tense=Fut": {"pos": "VERB", "Mood": "ind", "PronType": "rel", "Tense": "fut"}, - "VERB__Mood=Ind|PronType=Rel|Tense=Pres": {"pos": "VERB", "Mood": "ind", "PronType": "rel", "Tense": "pres"}, - "VERB__Mood=Ind|Tense=Fut": {"pos": "VERB", "Mood": "ind", "Tense": "fut"}, - "VERB__Mood=Ind|Tense=Fut|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "fut", "Other": {"Voice": "auto"}}, - "VERB__Mood=Ind|Tense=Past": {"pos": "VERB", "Mood": "ind", "Tense": "past"}, - "VERB__Mood=Ind|Tense=Past|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "past", "Other": {"Voice": "auto"}}, - "VERB__Mood=Ind|Tense=Pres": {"pos": "VERB", "Mood": "ind", "Tense": "pres"}, - "VERB__Mood=Ind|Tense=Pres|Voice=Auto": {"pos": "VERB", "Mood": "ind", "Tense": "pres", "Other": {"Voice": "auto"}}, - "VERB___": {"pos": "VERB"}, - "X__Abbr=Yes": {"pos": "X", "Other": {"Abbr": "yes"}}, - "X__Case=NomAcc|Foreign=Yes|Gender=Fem|Number=Sing": {"pos": "X", "Case": "nom|acc", "Gender": "fem", "Number": "sing", "Foreign": "yes"}, - "X__Definite=Def|Dialect=Ulster": {"pos": "X", "Definite": "def", "Other": {"Dialect": "ulster"}}, - "X__Dialect=Munster|Form=Len|Mood=Ind|Number=Sing|Person=1|Tense=Past": {"pos": "X", "Mood": "ind", "Number": "sing", "Person": 1, "Tense": "past", "Other": {"Dialect": "munster", "Form": "len"}}, - "X__Dialect=Munster|Mood=Imp|Number=Sing|Person=2|Polarity=Neg": {"pos": "X", "Mood": "imp", "Number": "sing", "Person": 2, "Polarity": "neg", "Other": {"Dialect": "munster"}}, - "X__Dialect=Munster|Mood=Ind|Tense=Past|Voice=Auto": {"pos": "X", "Mood": "ind", "Tense": "past", "Other": {"Dialect": "munster", "Voice": "auto"}}, - "X__Dialect=Munster": {"pos": "X", "Other": {"Dialect": "munster"}}, - "X__Dialect=Munster|PronType=Dem": {"pos": "X", "PronType": "dem", "Other": {"Dialect": "munster"}}, - "X__Dialect=Ulster|Gender=Masc|Number=Sing|Person=3": {"pos": "X", "Gender": "masc", "Number": "sing", "Person": 3, "Other": {"Dialect": "ulster"}}, - "X__Dialect=Ulster|PartType=Vb|Polarity=Neg": {"pos": "X", "Polarity": "neg", "Other": {"Dialect": "ulster", "PartType": "vb"}}, - "X__Dialect=Ulster|VerbForm=Cop": {"pos": "X", "Other": {"Dialect": "ulster", "VerbForm": "cop"}}, - "X__Foreign=Yes": {"pos": "X", "Foreign": "yes"}, - "X___": {"pos": "X"} -} -# fmt: on diff --git a/spacy/lang/ga/tokenizer_exceptions.py b/spacy/lang/ga/tokenizer_exceptions.py index c0e53f522..abf49c511 100644 --- a/spacy/lang/ga/tokenizer_exceptions.py +++ b/spacy/lang/ga/tokenizer_exceptions.py @@ -1,82 +1,65 @@ -# encoding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, DET, ADP, CCONJ, ADV, NOUN, X, AUX -from ...symbols import ORTH, LEMMA, NORM +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = { - "'acha'n": [ - {ORTH: "'ach", LEMMA: "gach", NORM: "gach", POS: DET}, - {ORTH: "a'n", LEMMA: "aon", NORM: "aon", POS: DET}, - ], - "dem'": [ - {ORTH: "de", LEMMA: "de", NORM: "de", POS: ADP}, - {ORTH: "m'", LEMMA: "mo", NORM: "mo", POS: DET}, - ], - "ded'": [ - {ORTH: "de", LEMMA: "de", NORM: "de", POS: ADP}, - {ORTH: "d'", LEMMA: "do", NORM: "do", POS: DET}, - ], - "lem'": [ - {ORTH: "le", LEMMA: "le", NORM: "le", POS: ADP}, - {ORTH: "m'", LEMMA: "mo", NORM: "mo", POS: DET}, - ], - "led'": [ - {ORTH: "le", LEMMA: "le", NORM: "le", POS: ADP}, - {ORTH: "d'", LEMMA: "mo", NORM: "do", POS: DET}, - ], + "'acha'n": [{ORTH: "'ach", NORM: "gach"}, {ORTH: "a'n", NORM: "aon"}], + "dem'": [{ORTH: "de", NORM: "de"}, {ORTH: "m'", NORM: "mo"}], + "ded'": [{ORTH: "de", NORM: "de"}, {ORTH: "d'", NORM: "do"}], + "lem'": [{ORTH: "le", NORM: "le"}, {ORTH: "m'", NORM: "mo"}], + "led'": [{ORTH: "le", NORM: "le"}, {ORTH: "d'", NORM: "do"}], } for exc_data in [ - {ORTH: "'gus", LEMMA: "agus", NORM: "agus", POS: CCONJ}, - {ORTH: "'ach", LEMMA: "gach", NORM: "gach", POS: DET}, - {ORTH: "ao'", LEMMA: "aon", NORM: "aon"}, - {ORTH: "'niar", LEMMA: "aniar", NORM: "aniar", POS: ADV}, - {ORTH: "'níos", LEMMA: "aníos", NORM: "aníos", POS: ADV}, - {ORTH: "'ndiu", LEMMA: "inniu", NORM: "inniu", POS: ADV}, - {ORTH: "'nocht", LEMMA: "anocht", NORM: "anocht", POS: ADV}, - {ORTH: "m'", LEMMA: "mo", POS: DET}, - {ORTH: "Aib.", LEMMA: "Aibreán", POS: NOUN}, - {ORTH: "Ath.", LEMMA: "athair", POS: NOUN}, - {ORTH: "Beal.", LEMMA: "Bealtaine", POS: NOUN}, - {ORTH: "a.C.n.", LEMMA: "ante Christum natum", POS: X}, - {ORTH: "m.sh.", LEMMA: "mar shampla", POS: ADV}, - {ORTH: "M.F.", LEMMA: "Meán Fómhair", POS: NOUN}, - {ORTH: "M.Fómh.", LEMMA: "Meán Fómhair", POS: NOUN}, - {ORTH: "D.F.", LEMMA: "Deireadh Fómhair", POS: NOUN}, - {ORTH: "D.Fómh.", LEMMA: "Deireadh Fómhair", POS: NOUN}, - {ORTH: "r.C.", LEMMA: "roimh Chríost", POS: ADV}, - {ORTH: "R.C.", LEMMA: "roimh Chríost", POS: ADV}, - {ORTH: "r.Ch.", LEMMA: "roimh Chríost", POS: ADV}, - {ORTH: "r.Chr.", LEMMA: "roimh Chríost", POS: ADV}, - {ORTH: "R.Ch.", LEMMA: "roimh Chríost", POS: ADV}, - {ORTH: "R.Chr.", LEMMA: "roimh Chríost", POS: ADV}, - {ORTH: "⁊rl.", LEMMA: "agus araile", POS: ADV}, - {ORTH: "srl.", LEMMA: "agus araile", POS: ADV}, - {ORTH: "Co.", LEMMA: "contae", POS: NOUN}, - {ORTH: "Ean.", LEMMA: "Eanáir", POS: NOUN}, - {ORTH: "Feab.", LEMMA: "Feabhra", POS: NOUN}, - {ORTH: "gCo.", LEMMA: "contae", POS: NOUN}, - {ORTH: ".i.", LEMMA: "eadhon", POS: ADV}, - {ORTH: "B'", LEMMA: "ba", POS: AUX}, - {ORTH: "b'", LEMMA: "ba", POS: AUX}, - {ORTH: "lch.", LEMMA: "leathanach", POS: NOUN}, - {ORTH: "Lch.", LEMMA: "leathanach", POS: NOUN}, - {ORTH: "lgh.", LEMMA: "leathanach", POS: NOUN}, - {ORTH: "Lgh.", LEMMA: "leathanach", POS: NOUN}, - {ORTH: "Lún.", LEMMA: "Lúnasa", POS: NOUN}, - {ORTH: "Már.", LEMMA: "Márta", POS: NOUN}, - {ORTH: "Meith.", LEMMA: "Meitheamh", POS: NOUN}, - {ORTH: "Noll.", LEMMA: "Nollaig", POS: NOUN}, - {ORTH: "Samh.", LEMMA: "Samhain", POS: NOUN}, - {ORTH: "tAth.", LEMMA: "athair", POS: NOUN}, - {ORTH: "tUas.", LEMMA: "Uasal", POS: NOUN}, - {ORTH: "teo.", LEMMA: "teoranta", POS: NOUN}, - {ORTH: "Teo.", LEMMA: "teoranta", POS: NOUN}, - {ORTH: "Uas.", LEMMA: "Uasal", POS: NOUN}, - {ORTH: "uimh.", LEMMA: "uimhir", POS: NOUN}, - {ORTH: "Uimh.", LEMMA: "uimhir", POS: NOUN}, + {ORTH: "'gus", NORM: "agus"}, + {ORTH: "'ach", NORM: "gach"}, + {ORTH: "ao'", NORM: "aon"}, + {ORTH: "'niar", NORM: "aniar"}, + {ORTH: "'níos", NORM: "aníos"}, + {ORTH: "'ndiu", NORM: "inniu"}, + {ORTH: "'nocht", NORM: "anocht"}, + {ORTH: "m'"}, + {ORTH: "Aib."}, + {ORTH: "Ath."}, + {ORTH: "Beal."}, + {ORTH: "a.C.n."}, + {ORTH: "m.sh."}, + {ORTH: "M.F."}, + {ORTH: "M.Fómh."}, + {ORTH: "D.F."}, + {ORTH: "D.Fómh."}, + {ORTH: "r.C."}, + {ORTH: "R.C."}, + {ORTH: "r.Ch."}, + {ORTH: "r.Chr."}, + {ORTH: "R.Ch."}, + {ORTH: "R.Chr."}, + {ORTH: "⁊rl."}, + {ORTH: "srl."}, + {ORTH: "Co."}, + {ORTH: "Ean."}, + {ORTH: "Feab."}, + {ORTH: "gCo."}, + {ORTH: ".i."}, + {ORTH: "B'"}, + {ORTH: "b'"}, + {ORTH: "lch."}, + {ORTH: "Lch."}, + {ORTH: "lgh."}, + {ORTH: "Lgh."}, + {ORTH: "Lún."}, + {ORTH: "Már."}, + {ORTH: "Meith."}, + {ORTH: "Noll."}, + {ORTH: "Samh."}, + {ORTH: "tAth."}, + {ORTH: "tUas."}, + {ORTH: "teo."}, + {ORTH: "Teo."}, + {ORTH: "Uas."}, + {ORTH: "uimh."}, + {ORTH: "Uimh."}, ]: _exc[exc_data[ORTH]] = [exc_data] @@ -84,4 +67,4 @@ for orth in ["d'", "D'"]: _exc[orth] = [{ORTH: orth}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/gu/__init__.py b/spacy/lang/gu/__init__.py index 1f080c7c2..67228ac40 100644 --- a/spacy/lang/gu/__init__.py +++ b/spacy/lang/gu/__init__.py @@ -1,8 +1,4 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS - from ...language import Language diff --git a/spacy/lang/gu/examples.py b/spacy/lang/gu/examples.py index 202a8d022..1cf75fd32 100644 --- a/spacy/lang/gu/examples.py +++ b/spacy/lang/gu/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/gu/stop_words.py b/spacy/lang/gu/stop_words.py index 85d33763d..2c859681b 100644 --- a/spacy/lang/gu/stop_words.py +++ b/spacy/lang/gu/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - STOP_WORDS = set( """ એમ diff --git a/spacy/lang/he/__init__.py b/spacy/lang/he/__init__.py index 922f61462..e0adc3293 100644 --- a/spacy/lang/he/__init__.py +++ b/spacy/lang/he/__init__.py @@ -1,21 +1,11 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS - -from ..tokenizer_exceptions import BASE_EXCEPTIONS from .lex_attrs import LEX_ATTRS from ...language import Language -from ...attrs import LANG -from ...util import update_exc class HebrewDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "he" - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS) stop_words = STOP_WORDS + lex_attr_getters = LEX_ATTRS writing_system = {"direction": "rtl", "has_case": False, "has_letters": True} diff --git a/spacy/lang/he/examples.py b/spacy/lang/he/examples.py index 34cd157ae..d54d2a145 100644 --- a/spacy/lang/he/examples.py +++ b/spacy/lang/he/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/he/lex_attrs.py b/spacy/lang/he/lex_attrs.py index 9eab93ae4..2953e7592 100644 --- a/spacy/lang/he/lex_attrs.py +++ b/spacy/lang/he/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ @@ -73,6 +70,7 @@ _ordinal_words = [ "עשירי", ] + def like_num(text): if text.startswith(("+", "-", "±", "~")): text = text[1:] @@ -84,7 +82,7 @@ def like_num(text): num, denom = text.split("/") if num.isdigit() and denom.isdigit(): return True - + if text in _num_words: return True diff --git a/spacy/lang/he/stop_words.py b/spacy/lang/he/stop_words.py index d4ac5e846..23bb5176d 100644 --- a/spacy/lang/he/stop_words.py +++ b/spacy/lang/he/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ אני diff --git a/spacy/lang/hi/__init__.py b/spacy/lang/hi/__init__.py index b0d45ddf3..384f040c8 100644 --- a/spacy/lang/hi/__init__.py +++ b/spacy/lang/hi/__init__.py @@ -1,18 +1,11 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS - from ...language import Language -from ...attrs import LANG class HindiDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "hi" stop_words = STOP_WORDS + lex_attr_getters = LEX_ATTRS class Hindi(Language): diff --git a/spacy/lang/hi/examples.py b/spacy/lang/hi/examples.py index 76b0e8bf8..1443b4908 100644 --- a/spacy/lang/hi/examples.py +++ b/spacy/lang/hi/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/hi/lex_attrs.py b/spacy/lang/hi/lex_attrs.py index 515dd0be3..6ae9812d6 100644 --- a/spacy/lang/hi/lex_attrs.py +++ b/spacy/lang/hi/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..norm_exceptions import BASE_NORMS from ...attrs import NORM, LIKE_NUM diff --git a/spacy/lang/hi/stop_words.py b/spacy/lang/hi/stop_words.py index efad18c84..475b07da1 100644 --- a/spacy/lang/hi/stop_words.py +++ b/spacy/lang/hi/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt, https://data.mendeley.com/datasets/bsr3frvvjc/1#file-a21d5092-99d7-45d8-b044-3ae9edd391c6 STOP_WORDS = set( diff --git a/spacy/lang/hr/__init__.py b/spacy/lang/hr/__init__.py index 539b164d7..118e0946a 100644 --- a/spacy/lang/hr/__init__.py +++ b/spacy/lang/hr/__init__.py @@ -1,22 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups class CroatianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "hr" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS) stop_words = STOP_WORDS diff --git a/spacy/lang/hr/examples.py b/spacy/lang/hr/examples.py index dc52ce4f0..b28fb63c2 100644 --- a/spacy/lang/hr/examples.py +++ b/spacy/lang/hr/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/hr/stop_words.py b/spacy/lang/hr/stop_words.py index 408b802c5..dd10f792d 100644 --- a/spacy/lang/hr/stop_words.py +++ b/spacy/lang/hr/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-hr STOP_WORDS = set( """ diff --git a/spacy/lang/hu/__init__.py b/spacy/lang/hu/__init__.py index a331adc5b..8962603a6 100644 --- a/spacy/lang/hu/__init__.py +++ b/spacy/lang/hu/__init__.py @@ -1,29 +1,16 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, TOKEN_MATCH from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES from .stop_words import STOP_WORDS - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups class HungarianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "hu" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS + tokenizer_exceptions = TOKENIZER_EXCEPTIONS prefixes = TOKENIZER_PREFIXES suffixes = TOKENIZER_SUFFIXES infixes = TOKENIZER_INFIXES token_match = TOKEN_MATCH + stop_words = STOP_WORDS class Hungarian(Language): diff --git a/spacy/lang/hu/examples.py b/spacy/lang/hu/examples.py index 3267887fe..711a438bd 100644 --- a/spacy/lang/hu/examples.py +++ b/spacy/lang/hu/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/hu/punctuation.py b/spacy/lang/hu/punctuation.py index a010bb7ae..f827cd677 100644 --- a/spacy/lang/hu/punctuation.py +++ b/spacy/lang/hu/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, CONCAT_QUOTES from ..char_classes import CONCAT_ICONS, UNITS, ALPHA, ALPHA_LOWER, ALPHA_UPPER @@ -10,6 +7,7 @@ _concat_icons = CONCAT_ICONS.replace("\u00B0", "") _currency = r"\$¢£€¥฿" _quotes = CONCAT_QUOTES.replace("'", "") +_units = UNITS.replace("%", "") _prefixes = ( LIST_PUNCT @@ -29,7 +27,7 @@ _suffixes = ( r"(?<=[0-9])\+", r"(?<=°[FfCcKk])\.", r"(?<=[0-9])(?:[{c}])".format(c=_currency), - r"(?<=[0-9])(?:{u})".format(u=UNITS), + r"(?<=[0-9])(?:{u})".format(u=_units), r"(?<=[{al}{e}{q}(?:{c})])\.".format( al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, c=_currency ), diff --git a/spacy/lang/hu/stop_words.py b/spacy/lang/hu/stop_words.py index c9a217dd6..e39a26d35 100644 --- a/spacy/lang/hu/stop_words.py +++ b/spacy/lang/hu/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ a abban ahhoz ahogy ahol aki akik akkor akár alatt amely amelyek amelyekben diff --git a/spacy/lang/hu/tokenizer_exceptions.py b/spacy/lang/hu/tokenizer_exceptions.py index d328baa22..4a64a1d2c 100644 --- a/spacy/lang/hu/tokenizer_exceptions.py +++ b/spacy/lang/hu/tokenizer_exceptions.py @@ -1,10 +1,9 @@ -# coding: utf8 -from __future__ import unicode_literals - import re +from ..tokenizer_exceptions import BASE_EXCEPTIONS from ..punctuation import ALPHA_LOWER, CURRENCY from ...symbols import ORTH +from ...util import update_exc _exc = {} @@ -647,5 +646,5 @@ _nums = r"(({ne})|({t})|({on})|({c}))({s})?".format( ) -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) TOKEN_MATCH = re.compile(r"^{n}$".format(n=_nums)).match diff --git a/spacy/lang/hy/__init__.py b/spacy/lang/hy/__init__.py index 6aaa965bb..4577ab641 100644 --- a/spacy/lang/hy/__init__.py +++ b/spacy/lang/hy/__init__.py @@ -1,21 +1,11 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS -from .tag_map import TAG_MAP - -from ...attrs import LANG from ...language import Language class ArmenianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "hy" - - lex_attr_getters.update(LEX_ATTRS) + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS - tag_map = TAG_MAP class Armenian(Language): diff --git a/spacy/lang/hy/examples.py b/spacy/lang/hy/examples.py index 8a00fd243..212a2ec86 100644 --- a/spacy/lang/hy/examples.py +++ b/spacy/lang/hy/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. >>> from spacy.lang.hy.examples import sentences diff --git a/spacy/lang/hy/lex_attrs.py b/spacy/lang/hy/lex_attrs.py index dea3c0e97..9c9c0380c 100644 --- a/spacy/lang/hy/lex_attrs.py +++ b/spacy/lang/hy/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/hy/stop_words.py b/spacy/lang/hy/stop_words.py index d75aad6e2..46d0f6b51 100644 --- a/spacy/lang/hy/stop_words.py +++ b/spacy/lang/hy/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - STOP_WORDS = set( """ նա diff --git a/spacy/lang/hy/tag_map.py b/spacy/lang/hy/tag_map.py deleted file mode 100644 index 4d5b6e918..000000000 --- a/spacy/lang/hy/tag_map.py +++ /dev/null @@ -1,2303 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN -from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ - -TAG_MAP = { - "ADJ_Abbr=Yes": {POS: ADJ, "Abbr": "Yes"}, - "ADJ_Degree=Pos|NumForm=Word|NumType=Ord": { - POS: ADJ, - "Degree": "Pos", - "NumForm": "Word", - "NumType": "Ord", - }, - "ADJ_Degree=Pos": {POS: ADJ, "Degree": "Pos"}, - "ADJ_Degree=Pos|Style=Coll": {POS: ADJ, "Degree": "Pos", "Style": "Coll"}, - "ADJ_Degree=Pos|Style=Expr": {POS: ADJ, "Degree": "Pos", "Style": "Expr"}, - "ADJ_Degree=Sup": {POS: ADJ, "Degree": "Sup"}, - "ADJ_NumForm=Digit|NumType=Ord": {POS: ADJ, "NumForm": "Digit", "NumType": "Ord"}, - "ADJ_NumForm=Word|NumType=Card": {POS: ADJ, "NumForm": "Word", "NumType": "Card"}, - "ADJ_NumForm=Word|NumType=Ord": {POS: ADJ, "NumForm": "Word", "NumType": "Ord"}, - "ADJ_Style=Coll": {POS: ADJ, "Style": "Coll"}, - "ADJ_Style=Expr": {POS: ADJ, "Style": "Expr"}, - "ADP_AdpType=Post|Case=Dat": {POS: ADP, "AdpType": "Post", "Case": "Dat"}, - "ADP_AdpType=Post|Case=Nom": {POS: ADP, "AdpType": "Post", "Case": "Nom"}, - "ADP_AdpType=Post|Number=Plur|Person=3": { - POS: ADP, - "AdpType": "Post", - "Number": "Plur", - "Person": "three", - }, - "ADP_AdpType=Post": {POS: ADP, "AdpType": "Post"}, - "ADP_AdpType=Prep": {POS: ADP, "AdpType": "Prep"}, - "ADP_AdpType=Prep|Style=Arch": {POS: ADP, "AdpType": "Prep", "Style": "Arch"}, - "ADV_Degree=Cmp": {POS: ADV, "Degree": "Cmp"}, - "ADV_Degree=Pos": {POS: ADV, "Degree": "Pos"}, - "ADV_Degree=Sup": {POS: ADV, "Degree": "Sup"}, - "ADV_Distance=Dist|PronType=Dem": {POS: ADV, "PronType": "Dem"}, - "ADV_Distance=Dist|PronType=Exc": {POS: ADV, "PronType": "Exc"}, - "ADV_Distance=Med|PronType=Dem": {POS: ADV, "PronType": "Dem"}, - "ADV_Distance=Med|PronType=Dem|Style=Coll": { - POS: ADV, - "PronType": "Dem", - "Style": "Coll", - }, - "ADV_NumForm=Word|NumType=Card|PronType=Tot": { - POS: ADV, - "NumForm": "Word", - "NumType": "Card", - "PronType": "Tot", - }, - "ADV_PronType=Dem": {POS: ADV, "PronType": "Dem"}, - "ADV_PronType=Exc": {POS: ADV, "PronType": "Exc"}, - "ADV_PronType=Ind": {POS: ADV, "PronType": "Ind"}, - "ADV_PronType=Int": {POS: ADV, "PronType": "Int"}, - "ADV_PronType=Int|Style=Coll": {POS: ADV, "PronType": "Int", "Style": "Coll"}, - "ADV_PronType=Rel": {POS: ADV, "PronType": "Rel"}, - "ADV_Style=Coll": {POS: ADV, "Style": "Coll"}, - "ADV_Style=Rare": {POS: ADV, "Style": "Rare"}, - "AUX_Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Polarity=Pos|Tense=Pres|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Plur", - "Person": "two", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Tense=Imp|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Neg", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Tense=Pres|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Tense=Imp|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Imp|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Imp|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Polarity=Neg|Tense=Pres|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Polarity=Pos|Tense=Pres|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Tense=Imp|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Tense=Pres|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Imp|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin": { - POS: AUX, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "AUX_Aspect=Imp|VerbForm=Part": {POS: AUX, "Aspect": "Imp", "VerbForm": "Part"}, - "AUX_Aspect=Perf|VerbForm=Part": {POS: AUX, "Aspect": "Perf", "VerbForm": "Part"}, - "AUX_Aspect=Prosp|VerbForm=Part": {POS: AUX, "Aspect": "Prosp", "VerbForm": "Part"}, - "AUX_Polarity=Pos": {POS: AUX, "Polarity": "Pos"}, - "CCONJ_ConjType=Comp": {POS: CCONJ, "ConjType": "Comp"}, - "CCONJ_ConjType=Comp|Style=Coll": {POS: CCONJ, "ConjType": "Comp", "Style": "Coll"}, - "DET_Case=Gen|Distance=Med|Number=Plur|Poss=Yes|PronType=Dem": { - POS: DET, - "Case": "Gen", - "Number": "Plur", - "Poss": "Yes", - "PronType": "Dem", - }, - "DET_Case=Gen|Distance=Med|Number=Sing|Poss=Yes|PronType=Dem": { - POS: DET, - "Case": "Gen", - "Number": "Sing", - "Poss": "Yes", - "PronType": "Dem", - }, - "DET_Case=Gen|Number=Plur|Person=1|Poss=Yes|PronType=Prs": { - POS: DET, - "Case": "Gen", - "Number": "Plur", - "Person": "one", - "Poss": "Yes", - "PronType": "Prs", - }, - "DET_Case=Gen|Number=Plur|Person=2|Polite=Infm|Poss=Yes|PronType=Prs": { - POS: DET, - "Case": "Gen", - "Number": "Plur", - "Person": "two", - "Poss": "Yes", - "PronType": "Prs", - }, - "DET_Case=Gen|Number=Plur|Person=3|Poss=Yes|PronType=Emp": { - POS: DET, - "Case": "Gen", - "Number": "Plur", - "Person": "three", - "Poss": "Yes", - }, - "DET_Case=Gen|Number=Plur|Person=3|Poss=Yes|PronType=Emp|Reflex=Yes": { - POS: DET, - "Case": "Gen", - "Number": "Plur", - "Person": "three", - "Poss": "Yes", - "Reflex": "Yes", - }, - "DET_Case=Gen|Number=Sing|Person=1|Poss=Yes|PronType=Prs": { - POS: DET, - "Case": "Gen", - "Number": "Sing", - "Person": "one", - "Poss": "Yes", - "PronType": "Prs", - }, - "DET_Case=Gen|Number=Sing|Person=2|Polite=Infm|Poss=Yes|PronType=Prs": { - POS: DET, - "Case": "Gen", - "Number": "Sing", - "Person": "two", - "Poss": "Yes", - "PronType": "Prs", - }, - "DET_Case=Gen|Number=Sing|Person=3|Poss=Yes|PronType=Emp": { - POS: DET, - "Case": "Gen", - "Number": "Sing", - "Person": "three", - "Poss": "Yes", - }, - "DET_Case=Gen|Number=Sing|Person=3|Poss=Yes|PronType=Emp|Reflex=Yes": { - POS: DET, - "Case": "Gen", - "Number": "Sing", - "Person": "three", - "Poss": "Yes", - "Reflex": "Yes", - }, - "DET_Case=Gen|Number=Sing|Person=3|Poss=Yes|PronType=Prs": { - POS: DET, - "Case": "Gen", - "Number": "Sing", - "Person": "three", - "Poss": "Yes", - "PronType": "Prs", - }, - "DET_Case=Gen|Number=Sing|Poss=Yes|PronType=Rel": { - POS: DET, - "Case": "Gen", - "Number": "Sing", - "Poss": "Yes", - "PronType": "Rel", - }, - "DET_Distance=Dist|PronType=Dem": {POS: DET, "PronType": "Dem"}, - "DET_Distance=Dist|PronType=Dem|Style=Coll": { - POS: DET, - "PronType": "Dem", - "Style": "Coll", - }, - "DET_Distance=Dist|PronType=Dem|Style=Vrnc": { - POS: DET, - "PronType": "Dem", - "Style": "Vrnc", - }, - "DET_Distance=Med|PronType=Dem": {POS: DET, "PronType": "Dem"}, - "DET_Distance=Med|PronType=Dem|Style=Coll": { - POS: DET, - "PronType": "Dem", - "Style": "Coll", - }, - "DET_Distance=Prox|PronType=Dem": {POS: DET, "PronType": "Dem"}, - "DET_Distance=Prox|PronType=Dem|Style=Coll": { - POS: DET, - "PronType": "Dem", - "Style": "Coll", - }, - "DET_PronType=Art": {POS: DET, "PronType": "Art"}, - "DET_PronType=Exc": {POS: DET, "PronType": "Exc"}, - "DET_PronType=Ind": {POS: DET, "PronType": "Ind"}, - "DET_PronType=Int": {POS: DET, "PronType": "Int"}, - "DET_PronType=Tot": {POS: DET, "PronType": "Tot"}, - "DET_PronType=Tot|Style=Arch": {POS: DET, "PronType": "Tot", "Style": "Arch"}, - "INTJ_Style=Vrnc": {POS: INTJ, "Style": "Vrnc"}, - "NOUN_Abbr=Yes|Animacy=Nhum|Case=Dat|Definite=Ind|Number=Plur": { - POS: NOUN, - "Abbr": "Yes", - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Ind", - "Number": "Plur", - }, - "NOUN_Abbr=Yes|Animacy=Nhum|Case=Nom|Definite=Ind|Number=Sing": { - POS: NOUN, - "Abbr": "Yes", - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Ind", - "Number": "Sing", - }, - "NOUN_Animacy=Hum|Case=Abl|Definite=Ind|Number=Plur": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Abl", - "Definite": "Ind", - "Number": "Plur", - }, - "NOUN_Animacy=Hum|Case=Abl|Definite=Ind|Number=Plur|Style=Slng": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Abl", - "Definite": "Ind", - "Number": "Plur", - }, - "NOUN_Animacy=Hum|Case=Abl|Definite=Ind|Number=Sing": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Abl", - "Definite": "Ind", - "Number": "Sing", - }, - "NOUN_Animacy=Hum|Case=Dat|Definite=Def|Number=Plur": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Dat", - "Definite": "Def", - "Number": "Plur", - }, - "NOUN_Animacy=Hum|Case=Dat|Definite=Def|Number=Sing": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Dat", - "Definite": "Def", - "Number": "Sing", - }, - "NOUN_Animacy=Hum|Case=Dat|Definite=Def|Number=Sing|Style=Slng": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Dat", - "Definite": "Def", - "Number": "Sing", - }, - "NOUN_Animacy=Hum|Case=Dat|Definite=Ind|Number=Assoc": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Dat", - "Definite": "Ind", - }, - "NOUN_Animacy=Hum|Case=Dat|Definite=Ind|Number=Plur": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Dat", - "Definite": "Ind", - "Number": "Plur", - }, - "NOUN_Animacy=Hum|Case=Dat|Definite=Ind|Number=Plur|Style=Coll": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Dat", - "Definite": "Ind", - "Number": "Plur", - "Style": "Coll", - }, - "NOUN_Animacy=Hum|Case=Dat|Definite=Ind|Number=Plur|Style=Slng": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Dat", - "Definite": "Ind", - "Number": "Plur", - }, - "NOUN_Animacy=Hum|Case=Dat|Definite=Ind|Number=Sing": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Dat", - "Definite": "Ind", - "Number": "Sing", - }, - "NOUN_Animacy=Hum|Case=Dat|Definite=Ind|Number=Sing|Style=Arch": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Dat", - "Definite": "Ind", - "Number": "Sing", - "Style": "Arch", - }, - "NOUN_Animacy=Hum|Case=Dat|Number=Sing|Number=Sing|Person=1": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Dat", - "Number": "Sing", - "Number": "Sing", - "Person": "one", - }, - "NOUN_Animacy=Hum|Case=Dat|Number=Sing|Number=Sing|Person=1|Style=Coll": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Dat", - "Number": "Sing", - "Number": "Sing", - "Person": "one", - "Style": "Coll", - }, - "NOUN_Animacy=Hum|Case=Ins|Definite=Ind|Number=Sing": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Ins", - "Definite": "Ind", - "Number": "Sing", - }, - "NOUN_Animacy=Hum|Case=Nom|Definite=Def|Number=Plur": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Def", - "Number": "Plur", - }, - "NOUN_Animacy=Hum|Case=Nom|Definite=Def|Number=Plur|Style=Slng": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Def", - "Number": "Plur", - }, - "NOUN_Animacy=Hum|Case=Nom|Definite=Def|Number=Sing": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Def", - "Number": "Sing", - }, - "NOUN_Animacy=Hum|Case=Nom|Definite=Def|Number=Sing|Style=Coll": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Def", - "Number": "Sing", - "Style": "Coll", - }, - "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Assoc": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Ind", - }, - "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Plur": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Ind", - "Number": "Plur", - }, - "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Plur|Style=Coll": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Ind", - "Number": "Plur", - "Style": "Coll", - }, - "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Plur|Style=Slng": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Ind", - "Number": "Plur", - }, - "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Plur|Typo=Yes": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Ind", - "Number": "Plur", - "Typo": "Yes", - }, - "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Sing": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Ind", - "Number": "Sing", - }, - "NOUN_Animacy=Hum|Case=Nom|Definite=Ind|Number=Sing|Style=Coll": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Ind", - "Number": "Sing", - "Style": "Coll", - }, - "NOUN_Animacy=Hum|Case=Nom|Number=Sing|Number=Sing|Person=1": { - POS: NOUN, - "Animacy": "Hum", - "Case": "Nom", - "Number": "Sing", - "Number": "Sing", - "Person": "one", - }, - "NOUN_Animacy=Nhum|Case=Abl|Definite=Ind|Number=Coll": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Abl", - "Definite": "Ind", - }, - "NOUN_Animacy=Nhum|Case=Abl|Definite=Ind|Number=Plur": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Abl", - "Definite": "Ind", - "Number": "Plur", - }, - "NOUN_Animacy=Nhum|Case=Abl|Definite=Ind|Number=Sing": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Abl", - "Definite": "Ind", - "Number": "Sing", - }, - "NOUN_Animacy=Nhum|Case=Abl|Definite=Ind|Number=Sing|Style=Arch": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Abl", - "Definite": "Ind", - "Number": "Sing", - "Style": "Arch", - }, - "NOUN_Animacy=Nhum|Case=Abl|Number=Sing|Number=Sing|Person=2": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Abl", - "Number": "Sing", - "Number": "Sing", - "Person": "two", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Coll": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Def", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Plur": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Def", - "Number": "Plur", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Sing|NumForm=Digit": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Def", - "Number": "Sing", - "NumForm": "Digit", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Sing|NumForm=Word": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Def", - "Number": "Sing", - "NumForm": "Word", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Sing": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Def", - "Number": "Sing", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Sing|Style=Rare": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Def", - "Number": "Sing", - "Style": "Rare", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Def|Number=Sing|Style=Vrnc": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Def", - "Number": "Sing", - "Style": "Vrnc", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Ind|Number=Coll": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Ind", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Ind|Number=Plur": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Ind", - "Number": "Plur", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Ind|Number=Sing|NumForm=Digit": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Ind", - "Number": "Sing", - "NumForm": "Digit", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Ind|Number=Sing": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Ind", - "Number": "Sing", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Ind|Number=Sing|Style=Coll": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Ind", - "Number": "Sing", - "Style": "Coll", - }, - "NOUN_Animacy=Nhum|Case=Dat|Definite=Ind|Number=Sing|Style=Vrnc": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Ind", - "Number": "Sing", - "Style": "Vrnc", - }, - "NOUN_Animacy=Nhum|Case=Dat|Number=Coll|Number=Sing|Person=1": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - # - "Number": "Sing", - "Person": "one", - }, - "NOUN_Animacy=Nhum|Case=Dat|Number=Sing|Number=Sing|Person=1": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Number": "Sing", - "Number": "Sing", - "Person": "one", - }, - "NOUN_Animacy=Nhum|Case=Dat|Number=Sing|Number=Sing|Person=2": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Dat", - "Number": "Sing", - "Number": "Sing", - "Person": "two", - }, - "NOUN_Animacy=Nhum|Case=Gen|Definite=Ind|Number=Sing|Style=Arch": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Gen", - "Definite": "Ind", - "Number": "Sing", - "Style": "Arch", - }, - "NOUN_Animacy=Nhum|Case=Ins|Definite=Ind|Number=Coll": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Ins", - "Definite": "Ind", - }, - "NOUN_Animacy=Nhum|Case=Ins|Definite=Ind|Number=Plur": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Ins", - "Definite": "Ind", - "Number": "Plur", - }, - "NOUN_Animacy=Nhum|Case=Ins|Definite=Ind|Number=Sing": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Ins", - "Definite": "Ind", - "Number": "Sing", - }, - "NOUN_Animacy=Nhum|Case=Ins|Definite=Ind|Number=Sing|Style=Coll": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Ins", - "Definite": "Ind", - "Number": "Sing", - "Style": "Coll", - }, - "NOUN_Animacy=Nhum|Case=Ins|Number=Sing|Number=Sing|Person=1": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Ins", - "Number": "Sing", - "Number": "Sing", - "Person": "one", - }, - "NOUN_Animacy=Nhum|Case=Loc|Definite=Ind|Number=Plur": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Loc", - "Definite": "Ind", - "Number": "Plur", - }, - "NOUN_Animacy=Nhum|Case=Loc|Definite=Ind|Number=Sing": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Loc", - "Definite": "Ind", - "Number": "Sing", - }, - "NOUN_Animacy=Nhum|Case=Loc|Number=Sing|Number=Sing|Person=2": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Loc", - "Number": "Sing", - "Number": "Sing", - "Person": "two", - }, - "NOUN_Animacy=Nhum|Case=Nom|Definite=Def|Number=Coll": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Def", - }, - "NOUN_Animacy=Nhum|Case=Nom|Definite=Def|Number=Plur|Number=Sing|Poss=Yes": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Def", - # "Number": "Plur", - "Number": "Sing", - "Poss": "Yes", - }, - "NOUN_Animacy=Nhum|Case=Nom|Definite=Def|Number=Plur": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Def", - "Number": "Plur", - }, - "NOUN_Animacy=Nhum|Case=Nom|Definite=Def|Number=Sing|NumForm=Digit": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Def", - "Number": "Sing", - "NumForm": "Digit", - }, - "NOUN_Animacy=Nhum|Case=Nom|Definite=Def|Number=Sing": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Def", - "Number": "Sing", - }, - "NOUN_Animacy=Nhum|Case=Nom|Definite=Ind|Number=Coll": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Ind", - }, - "NOUN_Animacy=Nhum|Case=Nom|Definite=Ind|Number=Coll|Typo=Yes": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Ind", - "Typo": "Yes", - }, - "NOUN_Animacy=Nhum|Case=Nom|Definite=Ind|Number=Plur": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Ind", - "Number": "Plur", - }, - "NOUN_Animacy=Nhum|Case=Nom|Definite=Ind|Number=Sing": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Ind", - "Number": "Sing", - }, - "NOUN_Animacy=Nhum|Case=Nom|Definite=Ind": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Ind", - }, - "NOUN_Animacy=Nhum|Case=Nom|Number=Plur|Number=Sing|Person=2": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - # "Number": "Plur", - "Number": "Sing", - "Person": "two", - }, - "NOUN_Animacy=Nhum|Case=Nom|Number=Sing|Number=Sing|Person=1": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - "Number": "Sing", - "Number": "Sing", - "Person": "one", - }, - "NOUN_Animacy=Nhum|Case=Nom|Number=Sing|Number=Sing|Person=2": { - POS: NOUN, - "Animacy": "Nhum", - "Case": "Nom", - "Number": "Sing", - "Number": "Sing", - "Person": "two", - }, - "NUM_NumForm=Digit|NumType=Card": {POS: NUM, "NumForm": "Digit", "NumType": "Card"}, - "NUM_NumForm=Digit|NumType=Frac|Typo=Yes": { - POS: NUM, - "NumForm": "Digit", - "NumType": "Frac", - "Typo": "Yes", - }, - "NUM_NumForm=Digit|NumType=Range": {POS: NUM, "NumForm": "Digit",}, - "NUM_NumForm=Word|NumType=Card": {POS: NUM, "NumForm": "Word", "NumType": "Card"}, - "NUM_NumForm=Word|NumType=Dist": {POS: NUM, "NumForm": "Word", "NumType": "Dist"}, - "NUM_NumForm=Word|NumType=Range": {POS: NUM, "NumForm": "Word",}, - "PART_Polarity=Neg": {POS: PART, "Polarity": "Neg"}, - "PRON_Case=Abl|Definite=Ind|Number=Sing|Person=3|PronType=Prs": { - POS: PRON, - "Case": "Abl", - "Definite": "Ind", - "Number": "Sing", - "Person": "three", - "PronType": "Prs", - }, - "PRON_Case=Abl|Number=Plur|Person=3|PronType=Prs": { - POS: PRON, - "Case": "Abl", - "Number": "Plur", - "Person": "three", - "PronType": "Prs", - }, - "PRON_Case=Abl|Number=Sing|Person=2|Polite=Infm|PronType=Prs": { - POS: PRON, - "Case": "Abl", - "Number": "Sing", - "Person": "two", - "PronType": "Prs", - }, - "PRON_Case=Dat|Definite=Def|Distance=Dist|Number=Sing|PronType=Dem": { - POS: PRON, - "Case": "Dat", - "Definite": "Def", - "Number": "Sing", - "PronType": "Dem", - }, - "PRON_Case=Dat|Definite=Def|Number=Sing|Person=3|PronType=Prs": { - POS: PRON, - "Case": "Dat", - "Definite": "Def", - "Number": "Sing", - "Person": "three", - "PronType": "Prs", - }, - "PRON_Case=Dat|Definite=Ind|Number=Sing|PronType=Int": { - POS: PRON, - "Case": "Dat", - "Definite": "Ind", - "Number": "Sing", - "PronType": "Int", - }, - "PRON_Case=Dat|Distance=Dist|Number=Sing|PronType=Dem": { - POS: PRON, - "Case": "Dat", - "Number": "Sing", - "PronType": "Dem", - }, - "PRON_Case=Dat|Distance=Med|Number=Plur|PronType=Dem": { - POS: PRON, - "Case": "Dat", - "Number": "Plur", - "PronType": "Dem", - }, - "PRON_Case=Dat|Number=Plur|Person=1|PronType=Prs": { - POS: PRON, - "Case": "Dat", - "Number": "Plur", - "Person": "one", - "PronType": "Prs", - }, - "PRON_Case=Dat|Number=Plur|Person=2|Polite=Infm|PronType=Prs": { - POS: PRON, - "Case": "Dat", - "Number": "Plur", - "Person": "two", - "PronType": "Prs", - }, - "PRON_Case=Dat|Number=Plur|Person=3|PronType=Emp|Reflex=Yes": { - POS: PRON, - "Case": "Dat", - "Number": "Plur", - "Person": "three", - "Reflex": "Yes", - }, - "PRON_Case=Dat|Number=Plur|Person=3|PronType=Prs": { - POS: PRON, - "Case": "Dat", - "Number": "Plur", - "Person": "three", - "PronType": "Prs", - }, - "PRON_Case=Dat|Number=Plur|PronType=Rcp": { - POS: PRON, - "Case": "Dat", - "Number": "Plur", - "PronType": "Rcp", - }, - "PRON_Case=Dat|Number=Sing|Person=1|PronType=Prs": { - POS: PRON, - "Case": "Dat", - "Number": "Sing", - "Person": "one", - "PronType": "Prs", - }, - "PRON_Case=Dat|Number=Sing|Person=2|Polite=Infm|PronType=Prs": { - POS: PRON, - "Case": "Dat", - "Number": "Sing", - "Person": "two", - "PronType": "Prs", - }, - "PRON_Case=Dat|Number=Sing|Person=3|PronType=Emp": { - POS: PRON, - "Case": "Dat", - "Number": "Sing", - "Person": "three", - }, - "PRON_Case=Dat|Number=Sing|Person=3|PronType=Emp|Reflex=Yes": { - POS: PRON, - "Case": "Dat", - "Number": "Sing", - "Person": "three", - "Reflex": "Yes", - }, - "PRON_Case=Dat|Number=Sing|PronType=Int": { - POS: PRON, - "Case": "Dat", - "Number": "Sing", - "PronType": "Int", - }, - "PRON_Case=Dat|Number=Sing|PronType=Rel": { - POS: PRON, - "Case": "Dat", - "Number": "Sing", - "PronType": "Rel", - }, - "PRON_Case=Dat|PronType=Tot": {POS: PRON, "Case": "Dat", "PronType": "Tot"}, - "PRON_Case=Gen|Distance=Med|Number=Sing|PronType=Dem": { - POS: PRON, - "Case": "Gen", - "Number": "Sing", - "PronType": "Dem", - }, - "PRON_Case=Gen|Number=Plur|Person=1|PronType=Prs": { - POS: PRON, - "Case": "Gen", - "Number": "Plur", - "Person": "one", - "PronType": "Prs", - }, - "PRON_Case=Gen|Number=Sing|Person=2|PronType=Prs": { - POS: PRON, - "Case": "Gen", - "Number": "Sing", - "Person": "two", - "PronType": "Prs", - }, - "PRON_Case=Gen|Number=Sing|Person=3|PronType=Prs": { - POS: PRON, - "Case": "Gen", - "Number": "Sing", - "Person": "three", - "PronType": "Prs", - }, - "PRON_Case=Gen|PronType=Tot": {POS: PRON, "Case": "Gen", "PronType": "Tot"}, - "PRON_Case=Ins|Definite=Ind|Number=Sing|PronType=Rel": { - POS: PRON, - "Case": "Ins", - "Definite": "Ind", - "Number": "Sing", - "PronType": "Rel", - }, - "PRON_Case=Ins|Distance=Med|Number=Sing|PronType=Dem": { - POS: PRON, - "Case": "Ins", - "Number": "Sing", - "PronType": "Dem", - }, - "PRON_Case=Loc|Definite=Ind|Number=Sing|PronType=Rel": { - POS: PRON, - "Case": "Loc", - "Definite": "Ind", - "Number": "Sing", - "PronType": "Rel", - }, - "PRON_Case=Loc|Distance=Med|Number=Sing|PronType=Dem": { - POS: PRON, - "Case": "Loc", - "Number": "Sing", - "PronType": "Dem", - }, - "PRON_Case=Nom|Definite=Def|Distance=Dist|Number=Plur|PronType=Dem": { - POS: PRON, - "Case": "Nom", - "Definite": "Def", - "Number": "Plur", - "PronType": "Dem", - }, - "PRON_Case=Nom|Definite=Def|Distance=Med|Number=Sing|PronType=Dem|Style=Coll": { - POS: PRON, - "Case": "Nom", - "Definite": "Def", - "Number": "Sing", - "PronType": "Dem", - "Style": "Coll", - }, - "PRON_Case=Nom|Definite=Def|Number=Sing|PronType=Int": { - POS: PRON, - "Case": "Nom", - "Definite": "Def", - "Number": "Sing", - "PronType": "Int", - }, - "PRON_Case=Nom|Definite=Def|Number=Sing|PronType=Rel": { - POS: PRON, - "Case": "Nom", - "Definite": "Def", - "Number": "Sing", - "PronType": "Rel", - }, - "PRON_Case=Nom|Definite=Ind|Number=Sing|PronType=Int": { - POS: PRON, - "Case": "Nom", - "Definite": "Ind", - "Number": "Sing", - "PronType": "Int", - }, - "PRON_Case=Nom|Definite=Ind|Number=Sing|PronType=Neg": { - POS: PRON, - "Case": "Nom", - "Definite": "Ind", - "Number": "Sing", - "PronType": "Neg", - }, - "PRON_Case=Nom|Definite=Ind|Number=Sing|PronType=Rel": { - POS: PRON, - "Case": "Nom", - "Definite": "Ind", - "Number": "Sing", - "PronType": "Rel", - }, - "PRON_Case=Nom|Distance=Dist|Number=Plur|Person=1|PronType=Dem": { - POS: PRON, - "Case": "Nom", - "Number": "Plur", - "Person": "one", - "PronType": "Dem", - }, - "PRON_Case=Nom|Distance=Med|Number=Plur|PronType=Dem": { - POS: PRON, - "Case": "Nom", - "Number": "Plur", - "PronType": "Dem", - }, - "PRON_Case=Nom|Distance=Med|Number=Sing|PronType=Dem": { - POS: PRON, - "Case": "Nom", - "Number": "Sing", - "PronType": "Dem", - }, - "PRON_Case=Nom|Distance=Prox|Number=Sing|PronType=Dem": { - POS: PRON, - "Case": "Nom", - "Number": "Sing", - "PronType": "Dem", - }, - "PRON_Case=Nom|Number=Plur|Person=1|PronType=Prs": { - POS: PRON, - "Case": "Nom", - "Number": "Plur", - "Person": "one", - "PronType": "Prs", - }, - "PRON_Case=Nom|Number=Plur|Person=3|PronType=Emp": { - POS: PRON, - "Case": "Nom", - "Number": "Plur", - "Person": "three", - }, - "PRON_Case=Nom|Number=Plur|Person=3|PronType=Prs": { - POS: PRON, - "Case": "Nom", - "Number": "Plur", - "Person": "three", - "PronType": "Prs", - }, - "PRON_Case=Nom|Number=Plur|PronType=Rel": { - POS: PRON, - "Case": "Nom", - "Number": "Plur", - "PronType": "Rel", - }, - "PRON_Case=Nom|Number=Sing|Number=Plur|Person=3|Person=1|PronType=Emp": { - POS: PRON, - "Case": "Nom", - # "Number": "Sing", - "Number": "Plur", - # "Person": "three", - "Person": "one", - }, - "PRON_Case=Nom|Number=Sing|Person=1|PronType=Int": { - POS: PRON, - "Case": "Nom", - "Number": "Sing", - "Person": "one", - "PronType": "Int", - }, - "PRON_Case=Nom|Number=Sing|Person=1|PronType=Prs": { - POS: PRON, - "Case": "Nom", - "Number": "Sing", - "Person": "one", - "PronType": "Prs", - }, - "PRON_Case=Nom|Number=Sing|Person=2|Polite=Infm|PronType=Prs": { - POS: PRON, - "Case": "Nom", - "Number": "Sing", - "Person": "two", - "PronType": "Prs", - }, - "PRON_Case=Nom|Number=Sing|Person=3|PronType=Emp": { - POS: PRON, - "Case": "Nom", - "Number": "Sing", - "Person": "three", - }, - "PRON_Case=Nom|Number=Sing|Person=3|PronType=Prs": { - POS: PRON, - "Case": "Nom", - "Number": "Sing", - "Person": "three", - "PronType": "Prs", - }, - "PRON_Case=Nom|Number=Sing|PronType=Int": { - POS: PRON, - "Case": "Nom", - "Number": "Sing", - "PronType": "Int", - }, - "PRON_Case=Nom|Number=Sing|PronType=Rel": { - POS: PRON, - "Case": "Nom", - "Number": "Sing", - "PronType": "Rel", - }, - "PRON_Case=Nom|Person=1|PronType=Tot": { - POS: PRON, - "Case": "Nom", - "Person": "one", - "PronType": "Tot", - }, - "PRON_Case=Nom|PronType=Ind": {POS: PRON, "Case": "Nom", "PronType": "Ind"}, - "PRON_Case=Nom|PronType=Tot": {POS: PRON, "Case": "Nom", "PronType": "Tot"}, - "PRON_Distance=Dist|Number=Sing|PronType=Dem": { - POS: PRON, - "Number": "Sing", - "PronType": "Dem", - }, - "PRON_Distance=Med|PronType=Dem|Style=Coll": { - POS: PRON, - "PronType": "Dem", - "Style": "Coll", - }, - "PRON_Distance=Prox|PronType=Dem|Style=Coll": { - POS: PRON, - "PronType": "Dem", - "Style": "Coll", - }, - "PRON_Number=Plur|PronType=Rel": {POS: PRON, "Number": "Plur", "PronType": "Rel"}, - "PROPN_Abbr=Yes|Animacy=Hum|Case=Nom|Definite=Ind|NameType=Giv|Number=Sing": { - POS: PROPN, - "Abbr": "Yes", - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Ind", - "NameType": "Giv", - "Number": "Sing", - }, - "PROPN_Abbr=Yes|Animacy=Nhum|Case=Nom|Definite=Ind|NameType=Com|Number=Sing": { - POS: PROPN, - "Abbr": "Yes", - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Ind", - "NameType": "Com", - "Number": "Sing", - }, - "PROPN_Animacy=Hum|Case=Dat|Definite=Def|NameType=Sur|Number=Sing": { - POS: PROPN, - "Animacy": "Hum", - "Case": "Dat", - "Definite": "Def", - "NameType": "Sur", - "Number": "Sing", - }, - "PROPN_Animacy=Hum|Case=Dat|Definite=Ind|NameType=Prs|Number=Sing": { - POS: PROPN, - "Animacy": "Hum", - "Case": "Dat", - "Definite": "Ind", - "NameType": "Prs", - "Number": "Sing", - }, - "PROPN_Animacy=Hum|Case=Dat|Definite=Ind|NameType=Sur|Number=Sing": { - POS: PROPN, - "Animacy": "Hum", - "Case": "Dat", - "Definite": "Ind", - "NameType": "Sur", - "Number": "Sing", - }, - "PROPN_Animacy=Hum|Case=Nom|Definite=Def|NameType=Giv|Number=Sing": { - POS: PROPN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Def", - "NameType": "Giv", - "Number": "Sing", - }, - "PROPN_Animacy=Hum|Case=Nom|Definite=Def|NameType=Sur|Number=Sing": { - POS: PROPN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Def", - "NameType": "Sur", - "Number": "Sing", - }, - "PROPN_Animacy=Hum|Case=Nom|Definite=Ind|NameType=Giv|Number=Sing": { - POS: PROPN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Ind", - "NameType": "Giv", - "Number": "Sing", - }, - "PROPN_Animacy=Hum|Case=Nom|Definite=Ind|NameType=Sur|Number=Sing": { - POS: PROPN, - "Animacy": "Hum", - "Case": "Nom", - "Definite": "Ind", - "NameType": "Sur", - "Number": "Sing", - }, - "PROPN_Animacy=Nhum|Case=Abl|Definite=Ind|NameType=Geo|Number=Coll": { - POS: PROPN, - "Animacy": "Nhum", - "Case": "Abl", - "Definite": "Ind", - "NameType": "Geo", - }, - "PROPN_Animacy=Nhum|Case=Abl|Definite=Ind|NameType=Geo|Number=Sing": { - POS: PROPN, - "Animacy": "Nhum", - "Case": "Abl", - "Definite": "Ind", - "NameType": "Geo", - "Number": "Sing", - }, - "PROPN_Animacy=Nhum|Case=Abl|Definite=Ind|Number=Plur": { - POS: PROPN, - "Animacy": "Nhum", - "Case": "Abl", - "Definite": "Ind", - "Number": "Plur", - }, - "PROPN_Animacy=Nhum|Case=Dat|Definite=Ind|NameType=Geo|Number=Sing": { - POS: PROPN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Ind", - "NameType": "Geo", - "Number": "Sing", - }, - "PROPN_Animacy=Nhum|Case=Dat|Definite=Ind|NameType=Geo|Number=Sing|Style=Coll": { - POS: PROPN, - "Animacy": "Nhum", - "Case": "Dat", - "Definite": "Ind", - "NameType": "Geo", - "Number": "Sing", - "Style": "Coll", - }, - "PROPN_Animacy=Nhum|Case=Loc|Definite=Ind|NameType=Geo|Number=Sing": { - POS: PROPN, - "Animacy": "Nhum", - "Case": "Loc", - "Definite": "Ind", - "NameType": "Geo", - "Number": "Sing", - }, - "PROPN_Animacy=Nhum|Case=Nom|Definite=Def|NameType=Geo|Number=Sing": { - POS: PROPN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Def", - "NameType": "Geo", - "Number": "Sing", - }, - "PROPN_Animacy=Nhum|Case=Nom|Definite=Def|NameType=Pro|Number=Sing|Style=Coll": { - POS: PROPN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Def", - "NameType": "Pro", - "Number": "Sing", - "Style": "Coll", - }, - "PROPN_Animacy=Nhum|Case=Nom|Definite=Ind|NameType=Geo|Number=Coll": { - POS: PROPN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Ind", - "NameType": "Geo", - }, - "PROPN_Animacy=Nhum|Case=Nom|Definite=Ind|NameType=Geo|Number=Sing": { - POS: PROPN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Ind", - "NameType": "Geo", - "Number": "Sing", - }, - "PROPN_Animacy=Nhum|Case=Nom|Definite=Ind|NameType=Geo|Number=Sing|Style=Vrnc": { - POS: PROPN, - "Animacy": "Nhum", - "Case": "Nom", - "Definite": "Ind", - "NameType": "Geo", - "Number": "Sing", - "Style": "Vrnc", - }, - "SCONJ_Style=Coll": {POS: SCONJ, "Style": "Coll"}, - "VERB_Aspect=Dur|Polarity=Neg|Subcat=Intr|VerbForm=Part|Voice=Pass": { - POS: VERB, - "Polarity": "Neg", - "VerbForm": "Part", - "Voice": "Pass", - }, - "VERB_Aspect=Dur|Polarity=Pos|Subcat=Intr|VerbForm=Part|Voice=Mid": { - POS: VERB, - "Polarity": "Pos", - "VerbForm": "Part", - "Voice": "Mid", - }, - "VERB_Aspect=Dur|Polarity=Pos|Subcat=Intr|VerbForm=Part|Voice=Pass": { - POS: VERB, - "Polarity": "Pos", - "VerbForm": "Part", - "Voice": "Pass", - }, - "VERB_Aspect=Dur|Polarity=Pos|Subcat=Tran|VerbForm=Part|Voice=Act": { - POS: VERB, - "Polarity": "Pos", - "VerbForm": "Part", - "Voice": "Act", - }, - "VERB_Aspect=Dur|Polarity=Pos|Subcat=Tran|VerbForm=Part|Voice=Mid": { - POS: VERB, - "Polarity": "Pos", - "VerbForm": "Part", - "Voice": "Mid", - }, - "VERB_Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Polarity=Neg|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Subcat=Tran|Tense=Imp|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Imp", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Polarity=Neg|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Imp|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Imp", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Imp", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Imp|Style=Coll|Subcat=Intr|VerbForm=Part|Voice=Mid": { - POS: VERB, - "Aspect": "Imp", - "Style": "Coll", - "VerbForm": "Part", - "Voice": "Mid", - }, - "VERB_Aspect=Imp|Style=Vrnc|Subcat=Intr|VerbForm=Part|Voice=Mid": { - POS: VERB, - "Aspect": "Imp", - "Style": "Vrnc", - "VerbForm": "Part", - "Voice": "Mid", - }, - "VERB_Aspect=Imp|Subcat=Intr|VerbForm=Part": { - POS: VERB, - "Aspect": "Imp", - "VerbForm": "Part", - }, - "VERB_Aspect=Imp|Subcat=Intr|VerbForm=Part|Voice=Act": { - POS: VERB, - "Aspect": "Imp", - "VerbForm": "Part", - "Voice": "Act", - }, - "VERB_Aspect=Imp|Subcat=Intr|VerbForm=Part|Voice=Mid": { - POS: VERB, - "Aspect": "Imp", - "VerbForm": "Part", - "Voice": "Mid", - }, - "VERB_Aspect=Imp|Subcat=Intr|VerbForm=Part|Voice=Pass": { - POS: VERB, - "Aspect": "Imp", - "VerbForm": "Part", - "Voice": "Pass", - }, - "VERB_Aspect=Imp|Subcat=Tran|VerbForm=Part|Voice=Act": { - POS: VERB, - "Aspect": "Imp", - "VerbForm": "Part", - "Voice": "Act", - }, - "VERB_Aspect=Imp|Subcat=Tran|VerbForm=Part|Voice=Cau": { - POS: VERB, - "Aspect": "Imp", - "VerbForm": "Part", - "Voice": "Cau", - }, - "VERB_Aspect=Iter|Case=Ins|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Mid": { - POS: VERB, - "Aspect": "Iter", - "Case": "Ins", - "Definite": "Ind", - "Polarity": "Pos", - "VerbForm": "Gdv", - "Voice": "Mid", - }, - "VERB_Aspect=Iter|Case=Ins|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Tran|VerbForm=Gdv|Voice=Act": { - POS: VERB, - "Aspect": "Iter", - "Case": "Ins", - "Definite": "Ind", - "Polarity": "Pos", - "VerbForm": "Gdv", - "Voice": "Act", - }, - "VERB_Aspect=Iter": {POS: VERB, "Aspect": "Iter"}, - "VERB_Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Subcat=Intr|Tense=Past|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Subcat=Intr|Tense=Past|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Subcat=Intr|Tense=Past|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Style=Vrnc|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Style": "Vrnc", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Subcat=Intr|Tense=Past|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Polarity=Pos|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Style=Vrnc|Subcat=Intr|Tense=Past|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Style": "Vrnc", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|Tense=Past|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Past|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Perf|Polarity=Neg|Subcat=Intr|VerbForm=Part|Voice=Pass": { - POS: VERB, - "Aspect": "Perf", - "Polarity": "Neg", - "VerbForm": "Part", - "Voice": "Pass", - }, - "VERB_Aspect=Perf|Polarity=Pos|Subcat=Intr|VerbForm=Part|Voice=Mid": { - POS: VERB, - "Aspect": "Perf", - "Polarity": "Pos", - "VerbForm": "Part", - "Voice": "Mid", - }, - "VERB_Aspect=Perf|Polarity=Pos|Subcat=Intr|VerbForm=Part|Voice=Pass": { - POS: VERB, - "Aspect": "Perf", - "Polarity": "Pos", - "VerbForm": "Part", - "Voice": "Pass", - }, - "VERB_Aspect=Perf|Polarity=Pos|Subcat=Tran|VerbForm=Part|Voice=Act": { - POS: VERB, - "Aspect": "Perf", - "Polarity": "Pos", - "VerbForm": "Part", - "Voice": "Act", - }, - "VERB_Aspect=Perf|Polarity=Pos|Subcat=Tran|VerbForm=Part|Voice=Pass": { - POS: VERB, - "Aspect": "Perf", - "Polarity": "Pos", - "VerbForm": "Part", - "Voice": "Pass", - }, - "VERB_Aspect=Perf|Polarity=Pos|VerbForm=Part|Voice=Act": { - POS: VERB, - "Aspect": "Perf", - "Polarity": "Pos", - "VerbForm": "Part", - "Voice": "Act", - }, - "VERB_Aspect=Perf|Subcat=Intr|VerbForm=Part|Voice=Mid": { - POS: VERB, - "Aspect": "Perf", - "VerbForm": "Part", - "Voice": "Mid", - }, - "VERB_Aspect=Perf|Subcat=Intr|VerbForm=Part|Voice=Pass": { - POS: VERB, - "Aspect": "Perf", - "VerbForm": "Part", - "Voice": "Pass", - }, - "VERB_Aspect=Perf|Subcat=Tran|VerbForm=Part|Voice=Act": { - POS: VERB, - "Aspect": "Perf", - "VerbForm": "Part", - "Voice": "Act", - }, - "VERB_Aspect=Perf|Subcat=Tran|VerbForm=Part|Voice=Cau": { - POS: VERB, - "Aspect": "Perf", - "VerbForm": "Part", - "Voice": "Cau", - }, - "VERB_Aspect=Prog|Subcat=Intr|VerbForm=Conv|Voice=Mid": { - POS: VERB, - "Aspect": "Prog", - "VerbForm": "Conv", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Connegative=Yes|Mood=Cnd|Subcat=Tran|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Connegative": "Yes", - "Mood": "Cnd", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Cnd|Number=Plur|Person=3|Polarity=Pos|Style=Vrnc|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Cnd", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Style": "Vrnc", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Cnd|Number=Plur|Person=3|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Cnd", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Mood=Cnd|Number=Sing|Person=1|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Cnd", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Mood=Cnd|Number=Sing|Person=2|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Cnd", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Cnd|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Cnd", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Mood=Cnd|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Pass": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Cnd", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Pass", - }, - "VERB_Aspect=Prosp|Mood=Cnd|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Imp|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Cnd", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Imp", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Cnd|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Cnd", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Imp|Number=Sing|Person=2|Subcat=Intr|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Imp", - "Number": "Sing", - "Person": "two", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Mood=Imp|Number=Sing|Person=2|Subcat=Tran|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Imp", - "Number": "Sing", - "Person": "two", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Plur|Person=1|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Plur", - "Person": "one", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Plur|Person=3|Polarity=Neg|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Plur", - "Person": "three", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Plur|Person=3|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=1|Polarity=Neg|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=1|Polarity=Neg|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=1|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=1|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=2|Polarity=Pos|Subcat=Tran|Tense=Imp|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "Tense": "Imp", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=2|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|Tense=Imp|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Imp", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|Tense=Pres|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=3|Polarity=Pos|Subcat=Intr|VerbForm=Fin|Voice=Pass": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "VerbForm": "Fin", - "Voice": "Pass", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Imp|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Imp", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Sub|Number=Sing|Person=3|Polarity=Pos|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Mood=Sub|Person=1|Polarity=Neg|Subcat=Tran|Tense=Pres|VerbForm=Fin|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Mood": "Sub", - "Person": "one", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Polarity=Pos|Subcat=Intr|VerbForm=Part|Voice=Mid": { - POS: VERB, - "Aspect": "Prosp", - "Polarity": "Pos", - "VerbForm": "Part", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Polarity=Pos|Subcat=Tran|VerbForm=Part|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "Polarity": "Pos", - "VerbForm": "Part", - "Voice": "Act", - }, - "VERB_Aspect=Prosp|Subcat=Intr|VerbForm=Part|Voice=Mid": { - POS: VERB, - "Aspect": "Prosp", - "VerbForm": "Part", - "Voice": "Mid", - }, - "VERB_Aspect=Prosp|Subcat=Intr|VerbForm=Part|Voice=Pass": { - POS: VERB, - "Aspect": "Prosp", - "VerbForm": "Part", - "Voice": "Pass", - }, - "VERB_Aspect=Prosp|Subcat=Tran|VerbForm=Part|Voice=Act": { - POS: VERB, - "Aspect": "Prosp", - "VerbForm": "Part", - "Voice": "Act", - }, - "VERB_Case=Abl|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Mid": { - POS: VERB, - "Case": "Abl", - "Definite": "Ind", - "Polarity": "Pos", - "VerbForm": "Gdv", - "Voice": "Mid", - }, - "VERB_Case=Abl|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Pass": { - POS: VERB, - "Case": "Abl", - "Definite": "Ind", - "Polarity": "Pos", - "VerbForm": "Gdv", - "Voice": "Pass", - }, - "VERB_Case=Abl|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Tran|VerbForm=Gdv|Voice=Act": { - POS: VERB, - "Case": "Abl", - "Definite": "Ind", - "Polarity": "Pos", - "VerbForm": "Gdv", - "Voice": "Act", - }, - "VERB_Case=Dat|Definite=Def|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Mid": { - POS: VERB, - "Case": "Dat", - "Definite": "Def", - "Polarity": "Pos", - "VerbForm": "Gdv", - "Voice": "Mid", - }, - "VERB_Case=Dat|Definite=Ind|Number=Coll|Polarity=Neg|Subcat=Intr|VerbForm=Gdv|Voice=Pass": { - POS: VERB, - "Case": "Dat", - "Definite": "Ind", - "Polarity": "Neg", - "VerbForm": "Gdv", - "Voice": "Pass", - }, - "VERB_Case=Dat|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Mid": { - POS: VERB, - "Case": "Dat", - "Definite": "Ind", - "Polarity": "Pos", - "VerbForm": "Gdv", - "Voice": "Mid", - }, - "VERB_Case=Dat|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Tran|VerbForm=Gdv|Voice=Act": { - POS: VERB, - "Case": "Dat", - "Definite": "Ind", - "Polarity": "Pos", - "VerbForm": "Gdv", - "Voice": "Act", - }, - "VERB_Case=Ins|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Mid": { - POS: VERB, - "Case": "Ins", - "Definite": "Ind", - "Polarity": "Pos", - "VerbForm": "Gdv", - "Voice": "Mid", - }, - "VERB_Case=Ins|Definite=Ind|Number=Coll|Polarity=Pos|Subcat=Tran|VerbForm=Gdv|Voice=Act": { - POS: VERB, - "Case": "Ins", - "Definite": "Ind", - "Polarity": "Pos", - "VerbForm": "Gdv", - "Voice": "Act", - }, - "VERB_Case=Nom|Definite=Def|Number=Coll|Polarity=Pos|Subcat=Intr|VerbForm=Gdv|Voice=Mid": { - POS: VERB, - "Case": "Nom", - "Definite": "Def", - "Polarity": "Pos", - "VerbForm": "Gdv", - "Voice": "Mid", - }, - "VERB_Case=Nom|Definite=Def|Number=Coll|Polarity=Pos|Subcat=Tran|VerbForm=Gdv|Voice=Act": { - POS: VERB, - "Case": "Nom", - "Definite": "Def", - "Polarity": "Pos", - "VerbForm": "Gdv", - "Voice": "Act", - }, - "VERB_Mood=Imp|Number=Sing|Person=2|Subcat=Intr|VerbForm=Fin|Voice=Mid": { - POS: VERB, - "Mood": "Imp", - "Number": "Sing", - "Person": "two", - "VerbForm": "Fin", - "Voice": "Mid", - }, - "VERB_Polarity=Neg|Subcat=Intr|VerbForm=Inf|Voice=Mid": { - POS: VERB, - "Polarity": "Neg", - "VerbForm": "Inf", - "Voice": "Mid", - }, - "VERB_Polarity=Pos|Style=Coll|Subcat=Tran|VerbForm=Inf|Voice=Act": { - POS: VERB, - "Polarity": "Pos", - "Style": "Coll", - "VerbForm": "Inf", - "Voice": "Act", - }, - "VERB_Polarity=Pos|Style=Vrnc|Subcat=Tran|VerbForm=Inf|Voice=Act": { - POS: VERB, - "Polarity": "Pos", - "Style": "Vrnc", - "VerbForm": "Inf", - "Voice": "Act", - }, - "VERB_Polarity=Pos|Subcat=Intr|VerbForm=Inf|Voice=Mid": { - POS: VERB, - "Polarity": "Pos", - "VerbForm": "Inf", - "Voice": "Mid", - }, - "VERB_Polarity=Pos|Subcat=Intr|VerbForm=Inf|Voice=Pass": { - POS: VERB, - "Polarity": "Pos", - "VerbForm": "Inf", - "Voice": "Pass", - }, - "VERB_Polarity=Pos|Subcat=Tran|Typo=Yes|VerbForm=Inf|Voice=Act": { - POS: VERB, - "Polarity": "Pos", - "Typo": "Yes", - "VerbForm": "Inf", - "Voice": "Act", - }, - "VERB_Polarity=Pos|Subcat=Tran|VerbForm=Inf|Voice=Act": { - POS: VERB, - "Polarity": "Pos", - "VerbForm": "Inf", - "Voice": "Act", - }, - "X_Foreign=Yes": {POS: X, "Foreign": "Yes"}, - "X_Style=Vrnc": {POS: X, "Style": "Vrnc"}, -} diff --git a/spacy/lang/id/__init__.py b/spacy/lang/id/__init__.py index 8e2266a40..87373551c 100644 --- a/spacy/lang/id/__init__.py +++ b/spacy/lang/id/__init__.py @@ -1,30 +1,19 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .punctuation import TOKENIZER_SUFFIXES, TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .lex_attrs import LEX_ATTRS from .syntax_iterators import SYNTAX_ITERATORS -from .tag_map import TAG_MAP - -from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language -from ...attrs import LANG -from ...util import update_exc class IndonesianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "id" - lex_attr_getters.update(LEX_ATTRS) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS + tokenizer_exceptions = TOKENIZER_EXCEPTIONS prefixes = TOKENIZER_PREFIXES suffixes = TOKENIZER_SUFFIXES infixes = TOKENIZER_INFIXES syntax_iterators = SYNTAX_ITERATORS - tag_map = TAG_MAP + lex_attr_getters = LEX_ATTRS + stop_words = STOP_WORDS class Indonesian(Language): diff --git a/spacy/lang/id/_tokenizer_exceptions_list.py b/spacy/lang/id/_tokenizer_exceptions_list.py index fec878d5a..a0b35fa1a 100644 --- a/spacy/lang/id/_tokenizer_exceptions_list.py +++ b/spacy/lang/id/_tokenizer_exceptions_list.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - ID_BASE_EXCEPTIONS = set( """ aba-aba diff --git a/spacy/lang/id/examples.py b/spacy/lang/id/examples.py index 7b4a4e513..d35271551 100644 --- a/spacy/lang/id/examples.py +++ b/spacy/lang/id/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/id/lex_attrs.py b/spacy/lang/id/lex_attrs.py index 1d4584ae3..3167f4659 100644 --- a/spacy/lang/id/lex_attrs.py +++ b/spacy/lang/id/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import unicodedata from .punctuation import LIST_CURRENCY diff --git a/spacy/lang/id/punctuation.py b/spacy/lang/id/punctuation.py index e4794d42b..f6c2387d8 100644 --- a/spacy/lang/id/punctuation.py +++ b/spacy/lang/id/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES, TOKENIZER_INFIXES from ..char_classes import ALPHA, merge_chars, split_chars, _currency, _units diff --git a/spacy/lang/id/stop_words.py b/spacy/lang/id/stop_words.py index 0a9f91947..b1bfaea79 100644 --- a/spacy/lang/id/stop_words.py +++ b/spacy/lang/id/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - STOP_WORDS = set( """ ada adalah adanya adapun agak agaknya agar akan akankah akhir akhiri akhirnya diff --git a/spacy/lang/id/syntax_iterators.py b/spacy/lang/id/syntax_iterators.py index d6c12e69f..0f29bfe16 100644 --- a/spacy/lang/id/syntax_iterators.py +++ b/spacy/lang/id/syntax_iterators.py @@ -1,29 +1,20 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Union, Iterator from ...symbols import NOUN, PROPN, PRON from ...errors import Errors +from ...tokens import Doc, Span -def noun_chunks(doclike): +def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]: """ Detect base noun phrases from a dependency parse. Works on both Doc and Span. """ - labels = [ - "nsubj", - "nsubj:pass", - "obj", - "iobj", - "ROOT", - "appos", - "nmod", - "nmod:poss", - ] + # fmt: off + labels = ["nsubj", "nsubj:pass", "obj", "iobj", "ROOT", "appos", "nmod", "nmod:poss"] + # fmt: on doc = doclike.doc # Ensure works on both Doc and Span. - - if not doc.is_parsed: + if not doc.has_annotation("DEP"): raise ValueError(Errors.E029) - np_deps = [doc.vocab.strings[label] for label in labels] conj = doc.vocab.strings.add("conj") np_label = doc.vocab.strings.add("NP") diff --git a/spacy/lang/id/tag_map.py b/spacy/lang/id/tag_map.py deleted file mode 100644 index 16391a840..000000000 --- a/spacy/lang/id/tag_map.py +++ /dev/null @@ -1,95 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PRON, AUX, SCONJ, INTJ, PART, PROPN - - -# POS explanations for indonesian available from https://www.aclweb.org/anthology/Y12-1014 -TAG_MAP = { - "NSD": {POS: NOUN}, - "Z--": {POS: PUNCT}, - "VSA": {POS: VERB}, - "CC-": {POS: NUM}, - "R--": {POS: ADP}, - "D--": {POS: ADV}, - "ASP": {POS: ADJ}, - "S--": {POS: SCONJ}, - "VSP": {POS: VERB}, - "H--": {POS: CCONJ}, - "F--": {POS: X}, - "B--": {POS: DET}, - "CO-": {POS: NUM}, - "G--": {POS: ADV}, - "PS3": {POS: PRON}, - "W--": {POS: ADV}, - "O--": {POS: AUX}, - "PP1": {POS: PRON}, - "ASS": {POS: ADJ}, - "PS1": {POS: PRON}, - "APP": {POS: ADJ}, - "CD-": {POS: NUM}, - "VPA": {POS: VERB}, - "VPP": {POS: VERB}, - "X--": {POS: X}, - "CO-+PS3": {POS: NUM}, - "NSD+PS3": {POS: NOUN}, - "ASP+PS3": {POS: ADJ}, - "M--": {POS: AUX}, - "VSA+PS3": {POS: VERB}, - "R--+PS3": {POS: ADP}, - "W--+T--": {POS: ADV}, - "PS2": {POS: PRON}, - "NSD+PS1": {POS: NOUN}, - "PP3": {POS: PRON}, - "VSA+T--": {POS: VERB}, - "D--+T--": {POS: ADV}, - "VSP+PS3": {POS: VERB}, - "F--+PS3": {POS: X}, - "M--+T--": {POS: AUX}, - "F--+T--": {POS: X}, - "PUNCT": {POS: PUNCT}, - "PROPN": {POS: PROPN}, - "I--": {POS: INTJ}, - "S--+PS3": {POS: SCONJ}, - "ASP+T--": {POS: ADJ}, - "CC-+PS3": {POS: NUM}, - "NSD+PS2": {POS: NOUN}, - "B--+T--": {POS: DET}, - "H--+T--": {POS: CCONJ}, - "VSA+PS2": {POS: VERB}, - "NSF": {POS: NOUN}, - "PS1+VSA": {POS: PRON}, - "NPD": {POS: NOUN}, - "PP2": {POS: PRON}, - "VSA+PS1": {POS: VERB}, - "T--": {POS: PART}, - "NSM": {POS: NOUN}, - "NUM": {POS: NUM}, - "ASP+PS2": {POS: ADJ}, - "G--+T--": {POS: PART}, - "D--+PS3": {POS: ADV}, - "R--+PS2": {POS: ADP}, - "NSM+PS3": {POS: NOUN}, - "VSP+T--": {POS: VERB}, - "M--+PS3": {POS: AUX}, - "ASS+PS3": {POS: ADJ}, - "G--+PS3": {POS: PART}, - "F--+PS1": {POS: X}, - "NSD+T--": {POS: NOUN}, - "PP1+T--": {POS: PRON}, - "B--+PS3": {POS: DET}, - "NOUN": {POS: NOUN}, - "NPD+PS3": {POS: NOUN}, - "R--+PS1": {POS: ADP}, - "F--+PS2": {POS: X}, - "CD-+PS3": {POS: NUM}, - "PS1+VSA+T--": {POS: VERB}, - "PS2+VSA": {POS: VERB}, - "VERB": {POS: VERB}, - "CC-+T--": {POS: NUM}, - "NPD+PS2": {POS: NOUN}, - "D--+PS2": {POS: ADV}, - "PP3+T--": {POS: PRON}, - "X": {POS: X}, -} diff --git a/spacy/lang/id/tokenizer_exceptions.py b/spacy/lang/id/tokenizer_exceptions.py index 86fe611bf..ff77ede9f 100644 --- a/spacy/lang/id/tokenizer_exceptions.py +++ b/spacy/lang/id/tokenizer_exceptions.py @@ -1,8 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - +from ..tokenizer_exceptions import BASE_EXCEPTIONS from ._tokenizer_exceptions_list import ID_BASE_EXCEPTIONS -from ...symbols import ORTH, LEMMA, NORM +from ...symbols import ORTH, NORM +from ...util import update_exc + # Daftar singkatan dan Akronim dari: # https://id.wiktionary.org/wiki/Wiktionary:Daftar_singkatan_dan_akronim_bahasa_Indonesia#A @@ -11,53 +11,47 @@ _exc = {} for orth in ID_BASE_EXCEPTIONS: _exc[orth] = [{ORTH: orth}] - orth_title = orth.title() _exc[orth_title] = [{ORTH: orth_title}] - orth_caps = orth.upper() _exc[orth_caps] = [{ORTH: orth_caps}] - orth_lower = orth.lower() _exc[orth_lower] = [{ORTH: orth_lower}] - orth_first_upper = orth[0].upper() + orth[1:] _exc[orth_first_upper] = [{ORTH: orth_first_upper}] - if "-" in orth: orth_title = "-".join([part.title() for part in orth.split("-")]) _exc[orth_title] = [{ORTH: orth_title}] - orth_caps = "-".join([part.upper() for part in orth.split("-")]) _exc[orth_caps] = [{ORTH: orth_caps}] for exc_data in [ - {ORTH: "Jan.", LEMMA: "Januari", NORM: "Januari"}, - {ORTH: "Feb.", LEMMA: "Februari", NORM: "Februari"}, - {ORTH: "Mar.", LEMMA: "Maret", NORM: "Maret"}, - {ORTH: "Apr.", LEMMA: "April", NORM: "April"}, - {ORTH: "Jun.", LEMMA: "Juni", NORM: "Juni"}, - {ORTH: "Jul.", LEMMA: "Juli", NORM: "Juli"}, - {ORTH: "Agu.", LEMMA: "Agustus", NORM: "Agustus"}, - {ORTH: "Ags.", LEMMA: "Agustus", NORM: "Agustus"}, - {ORTH: "Sep.", LEMMA: "September", NORM: "September"}, - {ORTH: "Okt.", LEMMA: "Oktober", NORM: "Oktober"}, - {ORTH: "Nov.", LEMMA: "November", NORM: "November"}, - {ORTH: "Des.", LEMMA: "Desember", NORM: "Desember"}, + {ORTH: "Jan.", NORM: "Januari"}, + {ORTH: "Feb.", NORM: "Februari"}, + {ORTH: "Mar.", NORM: "Maret"}, + {ORTH: "Apr.", NORM: "April"}, + {ORTH: "Jun.", NORM: "Juni"}, + {ORTH: "Jul.", NORM: "Juli"}, + {ORTH: "Agu.", NORM: "Agustus"}, + {ORTH: "Ags.", NORM: "Agustus"}, + {ORTH: "Sep.", NORM: "September"}, + {ORTH: "Okt.", NORM: "Oktober"}, + {ORTH: "Nov.", NORM: "November"}, + {ORTH: "Des.", NORM: "Desember"}, ]: _exc[exc_data[ORTH]] = [exc_data] _other_exc = { - "do'a": [{ORTH: "do'a", LEMMA: "doa", NORM: "doa"}], - "jum'at": [{ORTH: "jum'at", LEMMA: "Jumat", NORM: "Jumat"}], - "Jum'at": [{ORTH: "Jum'at", LEMMA: "Jumat", NORM: "Jumat"}], - "la'nat": [{ORTH: "la'nat", LEMMA: "laknat", NORM: "laknat"}], - "ma'af": [{ORTH: "ma'af", LEMMA: "maaf", NORM: "maaf"}], - "mu'jizat": [{ORTH: "mu'jizat", LEMMA: "mukjizat", NORM: "mukjizat"}], - "Mu'jizat": [{ORTH: "Mu'jizat", LEMMA: "mukjizat", NORM: "mukjizat"}], - "ni'mat": [{ORTH: "ni'mat", LEMMA: "nikmat", NORM: "nikmat"}], - "raka'at": [{ORTH: "raka'at", LEMMA: "rakaat", NORM: "rakaat"}], - "ta'at": [{ORTH: "ta'at", LEMMA: "taat", NORM: "taat"}], + "do'a": [{ORTH: "do'a", NORM: "doa"}], + "jum'at": [{ORTH: "jum'at", NORM: "Jumat"}], + "Jum'at": [{ORTH: "Jum'at", NORM: "Jumat"}], + "la'nat": [{ORTH: "la'nat", NORM: "laknat"}], + "ma'af": [{ORTH: "ma'af", NORM: "maaf"}], + "mu'jizat": [{ORTH: "mu'jizat", NORM: "mukjizat"}], + "Mu'jizat": [{ORTH: "Mu'jizat", NORM: "mukjizat"}], + "ni'mat": [{ORTH: "ni'mat", NORM: "nikmat"}], + "raka'at": [{ORTH: "raka'at", NORM: "rakaat"}], + "ta'at": [{ORTH: "ta'at", NORM: "taat"}], } _exc.update(_other_exc) @@ -224,4 +218,4 @@ for orth in [ ]: _exc[orth] = [{ORTH: orth}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/is/__init__.py b/spacy/lang/is/__init__.py index 18e41432d..be5de5981 100644 --- a/spacy/lang/is/__init__.py +++ b/spacy/lang/is/__init__.py @@ -1,14 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language -from ...attrs import LANG class IcelandicDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "is" stop_words = STOP_WORDS diff --git a/spacy/lang/is/stop_words.py b/spacy/lang/is/stop_words.py index e4ae0498b..917fb6df4 100644 --- a/spacy/lang/is/stop_words.py +++ b/spacy/lang/is/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/Xangis/extra-stopwords STOP_WORDS = set( diff --git a/spacy/lang/it/__init__.py b/spacy/lang/it/__init__.py index 06d146748..25cbaa651 100644 --- a/spacy/lang/it/__init__.py +++ b/spacy/lang/it/__init__.py @@ -1,27 +1,12 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS -from .tag_map import TAG_MAP from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups class ItalianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "it" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) + tokenizer_exceptions = TOKENIZER_EXCEPTIONS stop_words = STOP_WORDS - tag_map = TAG_MAP prefixes = TOKENIZER_PREFIXES infixes = TOKENIZER_INFIXES diff --git a/spacy/lang/it/examples.py b/spacy/lang/it/examples.py index af66b7eca..506721276 100644 --- a/spacy/lang/it/examples.py +++ b/spacy/lang/it/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/it/punctuation.py b/spacy/lang/it/punctuation.py index 1d641f144..f01ab4f0d 100644 --- a/spacy/lang/it/punctuation.py +++ b/spacy/lang/it/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES from ..char_classes import LIST_ELLIPSES, LIST_ICONS from ..char_classes import ALPHA, HYPHENS, CONCAT_QUOTES diff --git a/spacy/lang/it/stop_words.py b/spacy/lang/it/stop_words.py index 84233d381..e97613912 100644 --- a/spacy/lang/it/stop_words.py +++ b/spacy/lang/it/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ a abbastanza abbia abbiamo abbiano abbiate accidenti ad adesso affinche agl diff --git a/spacy/lang/it/tag_map.py b/spacy/lang/it/tag_map.py deleted file mode 100644 index 798c45d80..000000000 --- a/spacy/lang/it/tag_map.py +++ /dev/null @@ -1,323 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, SCONJ, AUX, CONJ - - -TAG_MAP = { - "AP__Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs": {POS: DET}, - "AP__Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs": {POS: DET}, - "AP__Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs": {POS: DET}, - "AP__Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs": {POS: DET}, - "AP__Gender=Masc|Poss=Yes|PronType=Prs": {POS: DET}, - "AP__Number=Sing|Poss=Yes|PronType=Prs": {POS: DET}, - "AP__Poss=Yes|PronType=Prs": {POS: DET}, - "A__Degree=Abs|Gender=Fem|Number=Plur": {POS: ADJ}, - "A__Degree=Abs|Gender=Fem|Number=Sing": {POS: ADJ}, - "A__Degree=Abs|Gender=Masc|Number=Plur": {POS: ADJ}, - "A__Degree=Abs|Gender=Masc|Number=Sing": {POS: ADJ}, - "A__Degree=Cmp": {POS: ADJ}, - "A__Degree=Cmp|Number=Plur": {POS: ADJ}, - "A__Degree=Cmp|Number=Sing": {POS: ADJ}, - "A__Gender=Fem|Number=Plur": {POS: ADJ}, - "A__Gender=Fem|Number=Sing": {POS: ADJ}, - "A__Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs": {POS: ADJ}, - "A__Gender=Masc": {POS: ADJ}, - "A__Gender=Masc|Number=Plur": {POS: ADJ}, - "A__Gender=Masc|Number=Sing": {POS: ADJ}, - "A__Number=Plur": {POS: ADJ}, - "A__Number=Sing": {POS: ADJ}, - "A___": {POS: ADJ}, - "BN__PronType=Neg": {POS: ADV}, - "B__Degree=Abs": {POS: ADV}, - "B__Degree=Abs|Gender=Masc|Number=Sing": {POS: ADV}, - "B___": {POS: ADV}, - "CC___": {POS: CONJ}, - "CS___": {POS: SCONJ}, - "DD__Gender=Fem|Number=Plur|PronType=Dem": {POS: DET}, - "DD__Gender=Fem|Number=Sing|PronType=Dem": {POS: DET}, - "DD__Gender=Masc|Number=Plur|PronType=Dem": {POS: DET}, - "DD__Gender=Masc|Number=Sing|PronType=Dem": {POS: DET}, - "DD__Gender=Masc|PronType=Dem": {POS: DET}, - "DD__Number=Plur|PronType=Dem": {POS: DET}, - "DD__Number=Sing|PronType=Dem": {POS: DET}, - "DE__PronType=Exc": {POS: DET}, - "DI__Definite=Def|Gender=Fem|Number=Plur|PronType=Art": {POS: DET}, - "DI__Gender=Fem|Number=Plur": {POS: DET}, - "DI__Gender=Fem|Number=Plur|PronType=Ind": {POS: DET}, - "DI__Gender=Fem|Number=Sing|PronType=Ind": {POS: DET}, - "DI__Gender=Masc|Number=Plur": {POS: DET}, - "DI__Gender=Masc|Number=Plur|PronType=Ind": {POS: DET}, - "DI__Gender=Masc|Number=Sing|PronType=Ind": {POS: DET}, - "DI__Number=Sing|PronType=Art": {POS: DET}, - "DI__Number=Sing|PronType=Ind": {POS: DET}, - "DI__PronType=Ind": {POS: DET}, - "DQ__Gender=Fem|Number=Plur|PronType=Int": {POS: DET}, - "DQ__Gender=Fem|Number=Sing|PronType=Int": {POS: DET}, - "DQ__Gender=Masc|Number=Plur|PronType=Int": {POS: DET}, - "DQ__Gender=Masc|Number=Sing|PronType=Int": {POS: DET}, - "DQ__Number=Plur|PronType=Int": {POS: DET}, - "DQ__Number=Sing|PronType=Int": {POS: DET}, - "DQ__PronType=Int": {POS: DET}, - "DQ___": {POS: DET}, - "DR__Number=Plur|PronType=Rel": {POS: DET}, - "DR__PronType=Rel": {POS: DET}, - "E__Gender=Masc|Number=Sing": {POS: ADP}, - "E___": {POS: ADP}, - "FB___": {POS: PUNCT}, - "FC___": {POS: PUNCT}, - "FF___": {POS: PUNCT}, - "FS___": {POS: PUNCT}, - "I__Polarity=Neg": {POS: INTJ}, - "I__Polarity=Pos": {POS: INTJ}, - "I___": {POS: INTJ}, - "NO__Gender=Fem|Number=Plur|NumType=Ord": {POS: ADJ}, - "NO__Gender=Fem|Number=Sing|NumType=Ord": {POS: ADJ}, - "NO__Gender=Masc|Number=Plur": {POS: ADJ}, - "NO__Gender=Masc|Number=Plur|NumType=Ord": {POS: ADJ}, - "NO__Gender=Masc|Number=Sing|NumType=Ord": {POS: ADJ}, - "NO__NumType=Ord": {POS: ADJ}, - "NO__Number=Sing|NumType=Ord": {POS: ADJ}, - "NO___": {POS: ADJ}, - "N__Gender=Masc|Number=Sing": {POS: NUM}, - "N__NumType=Card": {POS: NUM}, - "N__NumType=Range": {POS: NUM}, - "N___": {POS: NUM}, - "PART___": {POS: PART}, - "PC__Clitic=Yes|Definite=Def|Gender=Fem|Number=Plur|PronType=Art": {POS: PRON}, - "PC__Clitic=Yes|Gender=Fem|Number=Plur|Person=3|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Gender=Fem|Number=Plur|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Gender=Fem|Number=Sing|Person=3|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Gender=Fem|Person=3|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Gender=Masc|Number=Plur|Person=3|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Gender=Masc|Number=Sing|Person=3|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Gender=Masc|Number=Sing|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Number=Plur|Person=1|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Number=Plur|Person=2|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Number=Plur|Person=3|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Number=Plur|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Number=Sing|Person=1|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Number=Sing|Person=2|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Number=Sing|Person=3|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|Person=3|PronType=Prs": {POS: PRON}, - "PC__Clitic=Yes|PronType=Prs": {POS: PRON}, - "PD__Gender=Fem|Number=Plur|PronType=Dem": {POS: PRON}, - "PD__Gender=Fem|Number=Sing|PronType=Dem": {POS: PRON}, - "PD__Gender=Masc|Number=Plur|PronType=Dem": {POS: PRON}, - "PD__Gender=Masc|Number=Sing|PronType=Dem": {POS: PRON}, - "PD__Number=Plur|PronType=Dem": {POS: PRON}, - "PD__Number=Sing|PronType=Dem": {POS: PRON}, - "PD__PronType=Dem": {POS: PRON}, - "PE__Gender=Fem|Number=Plur|Person=3|PronType=Prs": {POS: PRON}, - "PE__Gender=Fem|Number=Sing|Person=3|PronType=Prs": {POS: PRON}, - "PE__Gender=Masc|Number=Plur|Person=3|PronType=Prs": {POS: PRON}, - "PE__Gender=Masc|Number=Sing|Person=3|PronType=Prs": {POS: PRON}, - "PE__Number=Plur|Person=1|PronType=Prs": {POS: PRON}, - "PE__Number=Plur|Person=2|PronType=Prs": {POS: PRON}, - "PE__Number=Plur|Person=3|PronType=Prs": {POS: PRON}, - "PE__Number=Sing|Person=1|PronType=Prs": {POS: PRON}, - "PE__Number=Sing|Person=2|PronType=Prs": {POS: PRON}, - "PE__Number=Sing|Person=3|PronType=Prs": {POS: PRON}, - "PE__Person=3|PronType=Prs": {POS: PRON}, - "PE__PronType=Prs": {POS: PRON}, - "PI__Gender=Fem|Number=Plur|PronType=Ind": {POS: PRON}, - "PI__Gender=Fem|Number=Sing|PronType=Ind": {POS: PRON}, - "PI__Gender=Masc|Number=Plur|PronType=Ind": {POS: PRON}, - "PI__Gender=Masc|Number=Sing": {POS: PRON}, - "PI__Gender=Masc|Number=Sing|PronType=Ind": {POS: PRON}, - "PI__Number=Plur|PronType=Ind": {POS: PRON}, - "PI__Number=Sing|PronType=Ind": {POS: PRON}, - "PI__PronType=Ind": {POS: PRON}, - "PP__Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs": {POS: PRON}, - "PP__Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs": {POS: PRON}, - "PP__Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs": {POS: PRON}, - "PP__Number=Plur|Poss=Yes|PronType=Prs": {POS: PRON}, - "PP__Number=Sing|Poss=Yes|PronType=Prs": {POS: PRON}, - "PQ__Gender=Fem|Number=Plur|PronType=Int": {POS: PRON}, - "PQ__Gender=Fem|Number=Sing|PronType=Int": {POS: PRON}, - "PQ__Gender=Masc|Number=Plur|PronType=Int": {POS: PRON}, - "PQ__Gender=Masc|Number=Sing|PronType=Int": {POS: PRON}, - "PQ__Number=Plur|PronType=Int": {POS: PRON}, - "PQ__Number=Sing|PronType=Int": {POS: PRON}, - "PQ__PronType=Int": {POS: PRON}, - "PR__Gender=Masc|Number=Plur|PronType=Rel": {POS: PRON}, - "PR__Gender=Masc|Number=Sing|PronType=Rel": {POS: PRON}, - "PR__Gender=Masc|PronType=Rel": {POS: PRON}, - "PR__Number=Plur|PronType=Rel": {POS: PRON}, - "PR__Number=Sing|PronType=Rel": {POS: PRON}, - "PR__Person=3|PronType=Rel": {POS: PRON}, - "PR__PronType=Rel": {POS: PRON}, - "RD__Definite=Def": {POS: DET}, - "RD__Definite=Def|Gender=Fem": {POS: DET}, - "RD__Definite=Def|Gender=Fem|Number=Plur|PronType=Art": {POS: DET}, - "RD__Definite=Def|Gender=Fem|Number=Sing|PronType=Art": {POS: DET}, - "RD__Definite=Def|Gender=Masc|Number=Plur|PronType=Art": {POS: DET}, - "RD__Definite=Def|Gender=Masc|Number=Sing|PronType=Art": {POS: DET}, - "RD__Definite=Def|Number=Plur|PronType=Art": {POS: DET}, - "RD__Definite=Def|Number=Sing|PronType=Art": {POS: DET}, - "RD__Definite=Def|PronType=Art": {POS: DET}, - "RD__Gender=Fem|Number=Sing": {POS: DET}, - "RD__Gender=Masc|Number=Sing": {POS: DET}, - "RD__Number=Sing": {POS: DET}, - "RD__Number=Sing|PronType=Art": {POS: DET}, - "RI__Definite=Ind|Gender=Fem|Number=Plur|PronType=Art": {POS: DET}, - "RI__Definite=Ind|Gender=Fem|Number=Sing|PronType=Art": {POS: DET}, - "RI__Definite=Ind|Gender=Masc|Number=Plur|PronType=Art": {POS: DET}, - "RI__Definite=Ind|Gender=Masc|Number=Sing|PronType=Art": {POS: DET}, - "RI__Definite=Ind|Number=Sing|PronType=Art": {POS: DET}, - "RI__Definite=Ind|PronType=Art": {POS: DET}, - "SP__Gender=Fem|Number=Plur": {POS: PROPN}, - "SP__NumType=Card": {POS: PROPN}, - "SP___": {POS: PROPN}, - "SW__Foreign=Yes": {POS: X}, - "SW__Foreign=Yes|Gender=Masc": {POS: X}, - "SW__Foreign=Yes|Number=Sing": {POS: X}, - "SYM___": {POS: SYM}, - "S__Gender=Fem": {POS: NOUN}, - "S__Gender=Fem|Number=Plur": {POS: NOUN}, - "S__Gender=Fem|Number=Sing": {POS: NOUN}, - "S__Gender=Masc": {POS: NOUN}, - "S__Gender=Masc|Number=Plur": {POS: NOUN}, - "S__Gender=Masc|Number=Sing": {POS: NOUN}, - "S__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part": {POS: NOUN}, - "S__Number=Plur": {POS: NOUN}, - "S__Number=Sing": {POS: NOUN}, - "S___": {POS: NOUN}, - "Sw___": {POS: X}, - "T__Gender=Fem|Number=Plur|PronType=Tot": {POS: DET}, - "T__Gender=Fem|Number=Sing": {POS: DET}, - "T__Gender=Fem|Number=Sing|PronType=Tot": {POS: DET}, - "T__Gender=Masc|Number=Plur|PronType=Tot": {POS: DET}, - "T__Gender=Masc|Number=Sing|PronType=Tot": {POS: DET}, - "T__Number=Plur|PronType=Tot": {POS: DET}, - "T__PronType=Tot": {POS: DET}, - "VA__Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part": {POS: AUX}, - "VA__Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part": {POS: AUX}, - "VA__Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part": {POS: AUX}, - "VA__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part": {POS: AUX}, - "VA__Mood=Cnd|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Cnd|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Cnd|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Cnd|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Plur|Person=2|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Plur|Person=2|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Sing|Person=1|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Sing|Person=2|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Sub|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Sub|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Sub|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Sub|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Sub|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VA__Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VA__VerbForm=Ger": {POS: AUX}, - "VA__VerbForm=Inf": {POS: AUX}, - "VM__Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part": {POS: AUX}, - "VM__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part": {POS: AUX}, - "VM__Mood=Cnd|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Cnd|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Cnd|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Cnd|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Cnd|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Cnd|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Imp|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Plur|Person=2|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Sing|Person=2|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Sing|Person=2|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Sub|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Sub|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Sub|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Sub|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Sub|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {POS: AUX}, - "VM__Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX}, - "VM__VerbForm=Ger": {POS: AUX}, - "VM__VerbForm=Inf": {POS: AUX}, - "V__Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part": {POS: VERB}, - "V__Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part": {POS: VERB}, - "V__Gender=Masc|Number=Plur|Tense=Past|VerbForm=Fin": {POS: VERB}, - "V__Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part": {POS: VERB}, - "V__Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part": {POS: VERB}, - "V__Mood=Cnd|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Cnd|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Cnd|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Cnd|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Cnd|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Cnd|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Imp|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Imp|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Imp|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Imp|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Plur|Person=1|Tense=Past|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Plur|Person=2|Tense=Fut|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Sing|Person=1|Tense=Fut|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Sing|Person=1|Tense=Past|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Sing|Person=2|Tense=Fut|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Ind|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Sub|Number=Plur|Person=1|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "V__Mood=Sub|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Sub|Number=Plur|Person=2|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "V__Mood=Sub|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Sub|Number=Plur|Person=3|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "V__Mood=Sub|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Sub|Number=Sing|Person=1|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "V__Mood=Sub|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Sub|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Sub|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin": {POS: VERB}, - "V__Mood=Sub|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB}, - "V__Mood=Sub|Number=Sing|Person=3|VerbForm=Fin": {POS: VERB}, - "V__Number=Plur|Tense=Pres|VerbForm=Part": {POS: VERB}, - "V__Number=Sing|Tense=Pres|VerbForm=Part": {POS: VERB}, - "V__Tense=Past|VerbForm=Part": {POS: VERB}, - "V__VerbForm=Ger": {POS: VERB}, - "V__VerbForm=Inf": {POS: VERB}, - "X___": {POS: X}, - "_SP": {POS: SPACE}, -} diff --git a/spacy/lang/it/tokenizer_exceptions.py b/spacy/lang/it/tokenizer_exceptions.py index 70519ba6a..0c9968bc6 100644 --- a/spacy/lang/it/tokenizer_exceptions.py +++ b/spacy/lang/it/tokenizer_exceptions.py @@ -1,6 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals -from ...symbols import ORTH, LEMMA +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH +from ...util import update_exc + _exc = { "all'art.": [{ORTH: "all'"}, {ORTH: "art."}], @@ -9,7 +10,7 @@ _exc = { "L'art.": [{ORTH: "L'"}, {ORTH: "art."}], "l'art.": [{ORTH: "l'"}, {ORTH: "art."}], "nell'art.": [{ORTH: "nell'"}, {ORTH: "art."}], - "po'": [{ORTH: "po'", LEMMA: "poco"}], + "po'": [{ORTH: "po'"}], "sett..": [{ORTH: "sett."}, {ORTH: "."}], } @@ -54,4 +55,4 @@ for orth in [ ]: _exc[orth] = [{ORTH: orth}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/ja/__init__.py b/spacy/lang/ja/__init__.py index 80cb7a837..4e6bf9d3c 100644 --- a/spacy/lang/ja/__init__.py +++ b/spacy/lang/ja/__init__.py @@ -1,26 +1,185 @@ -# encoding: utf8 -from __future__ import unicode_literals, print_function - +from typing import Optional, Union, Dict, Any +from pathlib import Path import srsly -from collections import namedtuple, OrderedDict +from collections import namedtuple from .stop_words import STOP_WORDS from .syntax_iterators import SYNTAX_ITERATORS from .tag_map import TAG_MAP from .tag_orth_map import TAG_ORTH_MAP from .tag_bigram_map import TAG_BIGRAM_MAP -from ...attrs import LANG from ...compat import copy_reg from ...errors import Errors from ...language import Language +from ...scorer import Scorer from ...symbols import POS from ...tokens import Doc -from ...util import DummyTokenizer +from ...training import validate_examples +from ...util import DummyTokenizer, registry, load_config_from_str from ... import util +DEFAULT_CONFIG = """ +[nlp] + +[nlp.tokenizer] +@tokenizers = "spacy.ja.JapaneseTokenizer" +split_mode = null +""" + + +@registry.tokenizers("spacy.ja.JapaneseTokenizer") +def create_tokenizer(split_mode: Optional[str] = None): + def japanese_tokenizer_factory(nlp): + return JapaneseTokenizer(nlp, split_mode=split_mode) + + return japanese_tokenizer_factory + + +class JapaneseTokenizer(DummyTokenizer): + def __init__(self, nlp: Language, split_mode: Optional[str] = None) -> None: + self.vocab = nlp.vocab + self.split_mode = split_mode + self.tokenizer = try_sudachi_import(self.split_mode) + + def __call__(self, text: str) -> Doc: + # convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces + sudachipy_tokens = self.tokenizer.tokenize(text) + dtokens = self._get_dtokens(sudachipy_tokens) + dtokens, spaces = get_dtokens_and_spaces(dtokens, text) + + # create Doc with tag bi-gram based part-of-speech identification rules + words, tags, inflections, lemmas, readings, sub_tokens_list = ( + zip(*dtokens) if dtokens else [[]] * 6 + ) + sub_tokens_list = list(sub_tokens_list) + doc = Doc(self.vocab, words=words, spaces=spaces) + next_pos = None # for bi-gram rules + for idx, (token, dtoken) in enumerate(zip(doc, dtokens)): + token.tag_ = dtoken.tag + if next_pos: # already identified in previous iteration + token.pos = next_pos + next_pos = None + else: + token.pos, next_pos = resolve_pos( + token.orth_, + dtoken.tag, + tags[idx + 1] if idx + 1 < len(tags) else None, + ) + # if there's no lemma info (it's an unk) just use the surface + token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface + doc.user_data["inflections"] = inflections + doc.user_data["reading_forms"] = readings + doc.user_data["sub_tokens"] = sub_tokens_list + return doc + + def _get_dtokens(self, sudachipy_tokens, need_sub_tokens: bool = True): + sub_tokens_list = ( + self._get_sub_tokens(sudachipy_tokens) if need_sub_tokens else None + ) + dtokens = [ + DetailedToken( + token.surface(), # orth + "-".join([xx for xx in token.part_of_speech()[:4] if xx != "*"]), # tag + ",".join([xx for xx in token.part_of_speech()[4:] if xx != "*"]), # inf + token.dictionary_form(), # lemma + token.reading_form(), # user_data['reading_forms'] + sub_tokens_list[idx] + if sub_tokens_list + else None, # user_data['sub_tokens'] + ) + for idx, token in enumerate(sudachipy_tokens) + if len(token.surface()) > 0 + # remove empty tokens which can be produced with characters like … that + ] + # Sudachi normalizes internally and outputs each space char as a token. + # This is the preparation for get_dtokens_and_spaces() to merge the continuous space tokens + return [ + t + for idx, t in enumerate(dtokens) + if idx == 0 + or not t.surface.isspace() + or t.tag != "空白" + or not dtokens[idx - 1].surface.isspace() + or dtokens[idx - 1].tag != "空白" + ] + + def _get_sub_tokens(self, sudachipy_tokens): + if ( + self.split_mode is None or self.split_mode == "A" + ): # do nothing for default split mode + return None + + sub_tokens_list = [] # list of (list of list of DetailedToken | None) + for token in sudachipy_tokens: + sub_a = token.split(self.tokenizer.SplitMode.A) + if len(sub_a) == 1: # no sub tokens + sub_tokens_list.append(None) + elif self.split_mode == "B": + sub_tokens_list.append([self._get_dtokens(sub_a, False)]) + else: # "C" + sub_b = token.split(self.tokenizer.SplitMode.B) + if len(sub_a) == len(sub_b): + dtokens = self._get_dtokens(sub_a, False) + sub_tokens_list.append([dtokens, dtokens]) + else: + sub_tokens_list.append( + [ + self._get_dtokens(sub_a, False), + self._get_dtokens(sub_b, False), + ] + ) + return sub_tokens_list + + def score(self, examples): + validate_examples(examples, "JapaneseTokenizer.score") + return Scorer.score_tokenization(examples) + + def _get_config(self) -> Dict[str, Any]: + return {"split_mode": self.split_mode} + + def _set_config(self, config: Dict[str, Any] = {}) -> None: + self.split_mode = config.get("split_mode", None) + + def to_bytes(self, **kwargs) -> bytes: + serializers = {"cfg": lambda: srsly.json_dumps(self._get_config())} + return util.to_bytes(serializers, []) + + def from_bytes(self, data: bytes, **kwargs) -> "JapaneseTokenizer": + deserializers = {"cfg": lambda b: self._set_config(srsly.json_loads(b))} + util.from_bytes(data, deserializers, []) + self.tokenizer = try_sudachi_import(self.split_mode) + return self + + def to_disk(self, path: Union[str, Path], **kwargs) -> None: + path = util.ensure_path(path) + serializers = {"cfg": lambda p: srsly.write_json(p, self._get_config())} + return util.to_disk(path, serializers, []) + + def from_disk(self, path: Union[str, Path], **kwargs) -> "JapaneseTokenizer": + path = util.ensure_path(path) + serializers = {"cfg": lambda p: self._set_config(srsly.read_json(p))} + util.from_disk(path, serializers, []) + self.tokenizer = try_sudachi_import(self.split_mode) + return self + + +class JapaneseDefaults(Language.Defaults): + config = load_config_from_str(DEFAULT_CONFIG) + stop_words = STOP_WORDS + syntax_iterators = SYNTAX_ITERATORS + writing_system = {"direction": "ltr", "has_case": False, "has_letters": False} + + +class Japanese(Language): + lang = "ja" + Defaults = JapaneseDefaults + + # Hold the attributes we need with convenient names -DetailedToken = namedtuple("DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"]) +DetailedToken = namedtuple( + "DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"] +) def try_sudachi_import(split_mode="A"): @@ -29,15 +188,14 @@ def try_sudachi_import(split_mode="A"): split_mode should be one of these values: "A", "B", "C", None->"A".""" try: from sudachipy import dictionary, tokenizer + split_mode = { None: tokenizer.Tokenizer.SplitMode.A, "A": tokenizer.Tokenizer.SplitMode.A, "B": tokenizer.Tokenizer.SplitMode.B, "C": tokenizer.Tokenizer.SplitMode.C, }[split_mode] - tok = dictionary.Dictionary().create( - mode=split_mode - ) + tok = dictionary.Dictionary().create(mode=split_mode) return tok except ImportError: raise ImportError( @@ -45,7 +203,7 @@ def try_sudachi_import(split_mode="A"): "(https://github.com/WorksApplications/SudachiPy). " "Install with `pip install sudachipy sudachidict_core` or " "install spaCy with `pip install spacy[ja]`." - ) + ) from None def resolve_pos(orth, tag, next_tag): @@ -71,7 +229,10 @@ def resolve_pos(orth, tag, next_tag): if tag_bigram in TAG_BIGRAM_MAP: current_pos, next_pos = TAG_BIGRAM_MAP[tag_bigram] if current_pos is None: # apply tag uni-gram mapping for current_pos - return TAG_MAP[tag][POS], next_pos # only next_pos is identified by tag bi-gram mapping + return ( + TAG_MAP[tag][POS], + next_pos, + ) # only next_pos is identified by tag bi-gram mapping else: return current_pos, next_pos @@ -93,7 +254,7 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"): return text_dtokens, text_spaces elif len([word for word in words if not word.isspace()]) == 0: assert text.isspace() - text_dtokens = [DetailedToken(text, gap_tag, '', text, None, None)] + text_dtokens = [DetailedToken(text, gap_tag, "", text, None, None)] text_spaces = [False] return text_dtokens, text_spaces @@ -105,12 +266,12 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"): try: word_start = text[text_pos:].index(word) except ValueError: - raise ValueError(Errors.E194.format(text=text, words=words)) + raise ValueError(Errors.E194.format(text=text, words=words)) from None # space token if word_start > 0: - w = text[text_pos:text_pos + word_start] - text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None)) + w = text[text_pos : text_pos + word_start] + text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None)) text_spaces.append(False) text_pos += word_start @@ -126,162 +287,12 @@ def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"): # trailing space token if text_pos < len(text): w = text[text_pos:] - text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None)) + text_dtokens.append(DetailedToken(w, gap_tag, "", w, None, None)) text_spaces.append(False) return text_dtokens, text_spaces -class JapaneseTokenizer(DummyTokenizer): - def __init__(self, cls, nlp=None, config={}): - self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) - self.split_mode = config.get("split_mode", None) - self.tokenizer = try_sudachi_import(self.split_mode) - - def __call__(self, text): - # convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces - sudachipy_tokens = self.tokenizer.tokenize(text) - dtokens = self._get_dtokens(sudachipy_tokens) - dtokens, spaces = get_dtokens_and_spaces(dtokens, text) - - # create Doc with tag bi-gram based part-of-speech identification rules - words, tags, inflections, lemmas, readings, sub_tokens_list = zip(*dtokens) if dtokens else [[]] * 6 - sub_tokens_list = list(sub_tokens_list) - doc = Doc(self.vocab, words=words, spaces=spaces) - next_pos = None # for bi-gram rules - for idx, (token, dtoken) in enumerate(zip(doc, dtokens)): - token.tag_ = dtoken.tag - if next_pos: # already identified in previous iteration - token.pos = next_pos - next_pos = None - else: - token.pos, next_pos = resolve_pos( - token.orth_, - dtoken.tag, - tags[idx + 1] if idx + 1 < len(tags) else None - ) - # if there's no lemma info (it's an unk) just use the surface - token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface - - doc.user_data["inflections"] = inflections - doc.user_data["reading_forms"] = readings - doc.user_data["sub_tokens"] = sub_tokens_list - doc.is_tagged = True - - return doc - - def _get_dtokens(self, sudachipy_tokens, need_sub_tokens=True): - sub_tokens_list = self._get_sub_tokens(sudachipy_tokens) if need_sub_tokens else None - dtokens = [ - DetailedToken( - token.surface(), # orth - '-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*']), # tag - ','.join([xx for xx in token.part_of_speech()[4:] if xx != '*']), # inf - token.dictionary_form(), # lemma - token.reading_form(), # user_data['reading_forms'] - sub_tokens_list[idx] if sub_tokens_list else None, # user_data['sub_tokens'] - ) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0 - # remove empty tokens which can be produced with characters like … that - ] - # Sudachi normalizes internally and outputs each space char as a token. - # This is the preparation for get_dtokens_and_spaces() to merge the continuous space tokens - return [ - t for idx, t in enumerate(dtokens) if - idx == 0 or - not t.surface.isspace() or t.tag != '空白' or - not dtokens[idx - 1].surface.isspace() or dtokens[idx - 1].tag != '空白' - ] - - def _get_sub_tokens(self, sudachipy_tokens): - if self.split_mode is None or self.split_mode == "A": # do nothing for default split mode - return None - - sub_tokens_list = [] # list of (list of list of DetailedToken | None) - for token in sudachipy_tokens: - sub_a = token.split(self.tokenizer.SplitMode.A) - if len(sub_a) == 1: # no sub tokens - sub_tokens_list.append(None) - elif self.split_mode == "B": - sub_tokens_list.append([self._get_dtokens(sub_a, False)]) - else: # "C" - sub_b = token.split(self.tokenizer.SplitMode.B) - if len(sub_a) == len(sub_b): - dtokens = self._get_dtokens(sub_a, False) - sub_tokens_list.append([dtokens, dtokens]) - else: - sub_tokens_list.append([self._get_dtokens(sub_a, False), self._get_dtokens(sub_b, False)]) - return sub_tokens_list - - def _get_config(self): - config = OrderedDict( - ( - ("split_mode", self.split_mode), - ) - ) - return config - - def _set_config(self, config={}): - self.split_mode = config.get("split_mode", None) - - def to_bytes(self, **kwargs): - serializers = OrderedDict( - ( - ("cfg", lambda: srsly.json_dumps(self._get_config())), - ) - ) - return util.to_bytes(serializers, []) - - def from_bytes(self, data, **kwargs): - deserializers = OrderedDict( - ( - ("cfg", lambda b: self._set_config(srsly.json_loads(b))), - ) - ) - util.from_bytes(data, deserializers, []) - self.tokenizer = try_sudachi_import(self.split_mode) - return self - - def to_disk(self, path, **kwargs): - path = util.ensure_path(path) - serializers = OrderedDict( - ( - ("cfg", lambda p: srsly.write_json(p, self._get_config())), - ) - ) - return util.to_disk(path, serializers, []) - - def from_disk(self, path, **kwargs): - path = util.ensure_path(path) - serializers = OrderedDict( - ( - ("cfg", lambda p: self._set_config(srsly.read_json(p))), - ) - ) - util.from_disk(path, serializers, []) - self.tokenizer = try_sudachi_import(self.split_mode) - - -class JapaneseDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda _text: "ja" - stop_words = STOP_WORDS - tag_map = TAG_MAP - syntax_iterators = SYNTAX_ITERATORS - writing_system = {"direction": "ltr", "has_case": False, "has_letters": False} - - @classmethod - def create_tokenizer(cls, nlp=None, config={}): - return JapaneseTokenizer(cls, nlp, config) - - -class Japanese(Language): - lang = "ja" - Defaults = JapaneseDefaults - - def make_doc(self, text): - return self.tokenizer(text) - - def pickle_japanese(instance): return Japanese, tuple() diff --git a/spacy/lang/ja/examples.py b/spacy/lang/ja/examples.py index e00001ed5..c3a011862 100644 --- a/spacy/lang/ja/examples.py +++ b/spacy/lang/ja/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ja/stop_words.py b/spacy/lang/ja/stop_words.py index bb232a2d2..98560d7e2 100644 --- a/spacy/lang/ja/stop_words.py +++ b/spacy/lang/ja/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - # This list was created by taking the top 2000 words from a Wikipedia dump and # filtering out everything that wasn't hiragana. ー (one) was also added. # Considered keeping some non-hiragana words but too many place names were diff --git a/spacy/lang/ja/syntax_iterators.py b/spacy/lang/ja/syntax_iterators.py index cd1e4fde7..cca4902ab 100644 --- a/spacy/lang/ja/syntax_iterators.py +++ b/spacy/lang/ja/syntax_iterators.py @@ -1,35 +1,23 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Union, Iterator from ...symbols import NOUN, PROPN, PRON, VERB +from ...tokens import Doc, Span -# XXX this can probably be pruned a bit -labels = [ - "nsubj", - "nmod", - "dobj", - "nsubjpass", - "pcomp", - "pobj", - "obj", - "obl", - "dative", - "appos", - "attr", - "ROOT", -] -def noun_chunks(obj): - """ - Detect base noun phrases from a dependency parse. Works on both Doc and Span. - """ +# TODO: this can probably be pruned a bit +# fmt: off +labels = ["nsubj", "nmod", "ddoclike", "nsubjpass", "pcomp", "pdoclike", "doclike", "obl", "dative", "appos", "attr", "ROOT"] +# fmt: on - doc = obj.doc # Ensure works on both Doc and Span. + +def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]: + """Detect base noun phrases from a dependency parse. Works on Doc and Span.""" + doc = doclike.doc # Ensure works on both Doc and Span. np_deps = [doc.vocab.strings.add(label) for label in labels] - conj = doc.vocab.strings.add("conj") + doc.vocab.strings.add("conj") np_label = doc.vocab.strings.add("NP") seen = set() - for i, word in enumerate(obj): + for i, word in enumerate(doclike): if word.pos not in (NOUN, PROPN, PRON): continue # Prevent nested chunks from being produced @@ -39,12 +27,10 @@ def noun_chunks(obj): unseen = [w.i for w in word.subtree if w.i not in seen] if not unseen: continue - # this takes care of particles etc. seen.update(j.i for j in word.subtree) # This avoids duplicating embedded clauses seen.update(range(word.i + 1)) - # if the head of this is a verb, mark that and rights seen # Don't do the subtree as that can hide other phrases if word.head.pos == VERB: @@ -52,4 +38,5 @@ def noun_chunks(obj): seen.update(w.i for w in word.head.rights) yield unseen[0], word.i + 1, np_label + SYNTAX_ITERATORS = {"noun_chunks": noun_chunks} diff --git a/spacy/lang/ja/tag_bigram_map.py b/spacy/lang/ja/tag_bigram_map.py index 5ed9aec89..9d15fc520 100644 --- a/spacy/lang/ja/tag_bigram_map.py +++ b/spacy/lang/ja/tag_bigram_map.py @@ -1,21 +1,15 @@ -# encoding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, ADJ, AUX, NOUN, PART, VERB +from ...symbols import ADJ, AUX, NOUN, PART, VERB # mapping from tag bi-gram to pos of previous token TAG_BIGRAM_MAP = { # This covers only small part of AUX. ("形容詞-非自立可能", "助詞-終助詞"): (AUX, None), - ("名詞-普通名詞-形状詞可能", "助動詞"): (ADJ, None), # ("副詞", "名詞-普通名詞-形状詞可能"): (None, ADJ), - # This covers acl, advcl, obl and root, but has side effect for compound. ("名詞-普通名詞-サ変可能", "動詞-非自立可能"): (VERB, AUX), # This covers almost all of the deps ("名詞-普通名詞-サ変形状詞可能", "動詞-非自立可能"): (VERB, AUX), - ("名詞-普通名詞-副詞可能", "動詞-非自立可能"): (None, VERB), ("副詞", "動詞-非自立可能"): (None, VERB), ("形容詞-一般", "動詞-非自立可能"): (None, VERB), @@ -25,12 +19,9 @@ TAG_BIGRAM_MAP = { ("助詞-副助詞", "動詞-非自立可能"): (None, VERB), ("助詞-格助詞", "動詞-非自立可能"): (None, VERB), ("補助記号-読点", "動詞-非自立可能"): (None, VERB), - ("形容詞-一般", "接尾辞-名詞的-一般"): (None, PART), - ("助詞-格助詞", "形状詞-助動詞語幹"): (None, NOUN), ("連体詞", "形状詞-助動詞語幹"): (None, NOUN), - ("動詞-一般", "助詞-副助詞"): (None, PART), ("動詞-非自立可能", "助詞-副助詞"): (None, PART), ("助動詞", "助詞-副助詞"): (None, PART), diff --git a/spacy/lang/ja/tag_map.py b/spacy/lang/ja/tag_map.py index ad416e109..c6de3831a 100644 --- a/spacy/lang/ja/tag_map.py +++ b/spacy/lang/ja/tag_map.py @@ -1,8 +1,5 @@ -# encoding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, INTJ, X, ADJ, AUX, ADP, PART, CCONJ, SCONJ, NOUN -from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET, SPACE +from ...symbols import POS, PUNCT, INTJ, ADJ, AUX, ADP, PART, SCONJ, NOUN +from ...symbols import SYM, PRON, VERB, ADV, PROPN, NUM, DET, SPACE, CCONJ TAG_MAP = { @@ -11,94 +8,61 @@ TAG_MAP = { # Universal Dependencies Mapping: (Some of the entries in this mapping are updated to v2.6 in the list below) # http://universaldependencies.org/ja/overview/morphology.html # http://universaldependencies.org/ja/pos/all.html - "記号-一般": { - POS: NOUN - }, # this includes characters used to represent sounds like ドレミ + "記号-一般": {POS: NOUN}, # this includes characters used to represent sounds like ドレミ "記号-文字": { POS: NOUN }, # this is for Greek and Latin characters having some meanings, or used as symbols, as in math "感動詞-フィラー": {POS: INTJ}, "感動詞-一般": {POS: INTJ}, - "空白": {POS: SPACE}, - "形状詞-一般": {POS: ADJ}, "形状詞-タリ": {POS: ADJ}, "形状詞-助動詞語幹": {POS: AUX}, - "形容詞-一般": {POS: ADJ}, - "形容詞-非自立可能": {POS: ADJ}, # XXX ADJ if alone, AUX otherwise - "助詞-格助詞": {POS: ADP}, - "助詞-係助詞": {POS: ADP}, - "助詞-終助詞": {POS: PART}, "助詞-準体助詞": {POS: SCONJ}, # の as in 走るのが速い "助詞-接続助詞": {POS: SCONJ}, # verb ending て0 - "助詞-副助詞": {POS: ADP}, # ばかり, つつ after a verb - "助動詞": {POS: AUX}, - "接続詞": {POS: CCONJ}, # XXX: might need refinement "接頭辞": {POS: NOUN}, "接尾辞-形状詞的": {POS: PART}, # がち, チック - "接尾辞-形容詞的": {POS: AUX}, # -らしい - "接尾辞-動詞的": {POS: PART}, # -じみ "接尾辞-名詞的-サ変可能": {POS: NOUN}, # XXX see 名詞,普通名詞,サ変可能,* "接尾辞-名詞的-一般": {POS: NOUN}, "接尾辞-名詞的-助数詞": {POS: NOUN}, "接尾辞-名詞的-副詞可能": {POS: NOUN}, # -後, -過ぎ - "代名詞": {POS: PRON}, - "動詞-一般": {POS: VERB}, - "動詞-非自立可能": {POS: AUX}, # XXX VERB if alone, AUX otherwise - "副詞": {POS: ADV}, - "補助記号-AA-一般": {POS: SYM}, # text art "補助記号-AA-顔文字": {POS: PUNCT}, # kaomoji - "補助記号-一般": {POS: SYM}, - "補助記号-括弧開": {POS: PUNCT}, # open bracket "補助記号-括弧閉": {POS: PUNCT}, # close bracket "補助記号-句点": {POS: PUNCT}, # period or other EOS marker "補助記号-読点": {POS: PUNCT}, # comma - "名詞-固有名詞-一般": {POS: PROPN}, # general proper noun "名詞-固有名詞-人名-一般": {POS: PROPN}, # person's name "名詞-固有名詞-人名-姓": {POS: PROPN}, # surname "名詞-固有名詞-人名-名": {POS: PROPN}, # first name "名詞-固有名詞-地名-一般": {POS: PROPN}, # place name "名詞-固有名詞-地名-国": {POS: PROPN}, # country name - "名詞-助動詞語幹": {POS: AUX}, "名詞-数詞": {POS: NUM}, # includes Chinese numerals - "名詞-普通名詞-サ変可能": {POS: NOUN}, # XXX: sometimes VERB in UDv2; suru-verb noun - "名詞-普通名詞-サ変形状詞可能": {POS: NOUN}, - "名詞-普通名詞-一般": {POS: NOUN}, - "名詞-普通名詞-形状詞可能": {POS: NOUN}, # XXX: sometimes ADJ in UDv2 - "名詞-普通名詞-助数詞可能": {POS: NOUN}, # counter / unit - "名詞-普通名詞-副詞可能": {POS: NOUN}, - "連体詞": {POS: DET}, # XXX this has exceptions based on literal token - # GSD tags. These aren't in Unidic, but we need them for the GSD data. "外国語": {POS: PROPN}, # Foreign words - "絵文字・記号等": {POS: SYM}, # emoji / kaomoji ^^; - } diff --git a/spacy/lang/ja/tag_orth_map.py b/spacy/lang/ja/tag_orth_map.py index 355cc655b..9d32cdea7 100644 --- a/spacy/lang/ja/tag_orth_map.py +++ b/spacy/lang/ja/tag_orth_map.py @@ -1,17 +1,9 @@ -# encoding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, ADJ, AUX, DET, PART, PRON, SPACE ,X +from ...symbols import DET, PART, PRON, SPACE, X # mapping from tag bi-gram to pos of previous token TAG_ORTH_MAP = { - "空白": { - " ": SPACE, - " ": X, - }, - "助詞-副助詞": { - "たり": PART, - }, + "空白": {" ": SPACE, " ": X}, + "助詞-副助詞": {"たり": PART}, "連体詞": { "あの": DET, "かの": DET, diff --git a/spacy/lang/kn/__init__.py b/spacy/lang/kn/__init__.py index c86354248..8e53989e6 100644 --- a/spacy/lang/kn/__init__.py +++ b/spacy/lang/kn/__init__.py @@ -1,14 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language -from ...attrs import LANG class KannadaDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "kn" stop_words = STOP_WORDS diff --git a/spacy/lang/kn/examples.py b/spacy/lang/kn/examples.py index d82630432..3e055752e 100644 --- a/spacy/lang/kn/examples.py +++ b/spacy/lang/kn/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/kn/stop_words.py b/spacy/lang/kn/stop_words.py index 652341e73..dba9740af 100644 --- a/spacy/lang/kn/stop_words.py +++ b/spacy/lang/kn/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ ಹಲವು diff --git a/spacy/lang/ko/__init__.py b/spacy/lang/ko/__init__.py index 21a754168..83c9f4962 100644 --- a/spacy/lang/ko/__init__.py +++ b/spacy/lang/ko/__init__.py @@ -1,16 +1,86 @@ -# encoding: utf8 -from __future__ import unicode_literals, print_function +from typing import Optional, Any, Dict from .stop_words import STOP_WORDS from .tag_map import TAG_MAP -from ...attrs import LANG +from .lex_attrs import LEX_ATTRS from ...language import Language from ...tokens import Doc from ...compat import copy_reg -from ...util import DummyTokenizer +from ...scorer import Scorer +from ...symbols import POS +from ...training import validate_examples +from ...util import DummyTokenizer, registry, load_config_from_str -def try_mecab_import(): +DEFAULT_CONFIG = """ +[nlp] + +[nlp.tokenizer] +@tokenizers = "spacy.ko.KoreanTokenizer" +""" + + +@registry.tokenizers("spacy.ko.KoreanTokenizer") +def create_tokenizer(): + def korean_tokenizer_factory(nlp): + return KoreanTokenizer(nlp) + + return korean_tokenizer_factory + + +class KoreanTokenizer(DummyTokenizer): + def __init__(self, nlp: Optional[Language] = None): + self.vocab = nlp.vocab + MeCab = try_mecab_import() + self.mecab_tokenizer = MeCab("-F%f[0],%f[7]") + + def __del__(self): + self.mecab_tokenizer.__del__() + + def __call__(self, text: str) -> Doc: + dtokens = list(self.detailed_tokens(text)) + surfaces = [dt["surface"] for dt in dtokens] + doc = Doc(self.vocab, words=surfaces, spaces=list(check_spaces(text, surfaces))) + for token, dtoken in zip(doc, dtokens): + first_tag, sep, eomi_tags = dtoken["tag"].partition("+") + token.tag_ = first_tag # stem(어간) or pre-final(선어말 어미) + token.pos = TAG_MAP[token.tag_][POS] + token.lemma_ = dtoken["lemma"] + doc.user_data["full_tags"] = [dt["tag"] for dt in dtokens] + return doc + + def detailed_tokens(self, text: str) -> Dict[str, Any]: + # 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3], + # 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], * + for node in self.mecab_tokenizer.parse(text, as_nodes=True): + if node.is_eos(): + break + surface = node.surface + feature = node.feature + tag, _, expr = feature.partition(",") + lemma, _, remainder = expr.partition("/") + if lemma == "*": + lemma = surface + yield {"surface": surface, "lemma": lemma, "tag": tag} + + def score(self, examples): + validate_examples(examples, "KoreanTokenizer.score") + return Scorer.score_tokenization(examples) + + +class KoreanDefaults(Language.Defaults): + config = load_config_from_str(DEFAULT_CONFIG) + lex_attr_getters = LEX_ATTRS + stop_words = STOP_WORDS + writing_system = {"direction": "ltr", "has_case": False, "has_letters": False} + + +class Korean(Language): + lang = "ko" + Defaults = KoreanDefaults + + +def try_mecab_import() -> None: try: from natto import MeCab @@ -20,10 +90,7 @@ def try_mecab_import(): "Korean support requires [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), " "[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), " "and [natto-py](https://github.com/buruzaemon/natto-py)" - ) - - -# fmt: on + ) from None def check_spaces(text, tokens): @@ -39,61 +106,6 @@ def check_spaces(text, tokens): yield False -class KoreanTokenizer(DummyTokenizer): - def __init__(self, cls, nlp=None): - self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) - MeCab = try_mecab_import() - self.mecab_tokenizer = MeCab("-F%f[0],%f[7]") - - def __del__(self): - self.mecab_tokenizer.__del__() - - def __call__(self, text): - dtokens = list(self.detailed_tokens(text)) - surfaces = [dt["surface"] for dt in dtokens] - doc = Doc(self.vocab, words=surfaces, spaces=list(check_spaces(text, surfaces))) - for token, dtoken in zip(doc, dtokens): - first_tag, sep, eomi_tags = dtoken["tag"].partition("+") - token.tag_ = first_tag # stem(어간) or pre-final(선어말 어미) - token.lemma_ = dtoken["lemma"] - doc.user_data["full_tags"] = [dt["tag"] for dt in dtokens] - return doc - - def detailed_tokens(self, text): - # 품사 태그(POS)[0], 의미 부류(semantic class)[1], 종성 유무(jongseong)[2], 읽기(reading)[3], - # 타입(type)[4], 첫번째 품사(start pos)[5], 마지막 품사(end pos)[6], 표현(expression)[7], * - for node in self.mecab_tokenizer.parse(text, as_nodes=True): - if node.is_eos(): - break - surface = node.surface - feature = node.feature - tag, _, expr = feature.partition(",") - lemma, _, remainder = expr.partition("/") - if lemma == "*": - lemma = surface - yield {"surface": surface, "lemma": lemma, "tag": tag} - - -class KoreanDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda _text: "ko" - stop_words = STOP_WORDS - tag_map = TAG_MAP - writing_system = {"direction": "ltr", "has_case": False, "has_letters": False} - - @classmethod - def create_tokenizer(cls, nlp=None): - return KoreanTokenizer(cls, nlp) - - -class Korean(Language): - lang = "ko" - Defaults = KoreanDefaults - - def make_doc(self, text): - return self.tokenizer(text) - - def pickle_korean(instance): return Korean, tuple() diff --git a/spacy/lang/ko/examples.py b/spacy/lang/ko/examples.py index 0306e5db8..edb755eaa 100644 --- a/spacy/lang/ko/examples.py +++ b/spacy/lang/ko/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ko/lex_attrs.py b/spacy/lang/ko/lex_attrs.py index 1904a0ece..ac5bc7e48 100644 --- a/spacy/lang/ko/lex_attrs.py +++ b/spacy/lang/ko/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/ko/stop_words.py b/spacy/lang/ko/stop_words.py index 676dca1b4..3eba9fc82 100644 --- a/spacy/lang/ko/stop_words.py +++ b/spacy/lang/ko/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - STOP_WORDS = set( """ 이 diff --git a/spacy/lang/ko/tag_map.py b/spacy/lang/ko/tag_map.py index 57317c969..26a8c56b9 100644 --- a/spacy/lang/ko/tag_map.py +++ b/spacy/lang/ko/tag_map.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - from ...symbols import POS, PUNCT, INTJ, X, SYM, ADJ, AUX, ADP, CONJ, NOUN, PRON from ...symbols import VERB, ADV, PROPN, NUM, DET diff --git a/spacy/lang/lb/__init__.py b/spacy/lang/lb/__init__.py index 8d85b8fc7..da6fe55d7 100644 --- a/spacy/lang/lb/__init__.py +++ b/spacy/lang/lb/__init__.py @@ -1,26 +1,15 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_INFIXES from .lex_attrs import LEX_ATTRS -from .tag_map import TAG_MAP from .stop_words import STOP_WORDS - -from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language -from ...attrs import LANG -from ...util import update_exc class LuxembourgishDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "lb" - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS - tag_map = TAG_MAP + tokenizer_exceptions = TOKENIZER_EXCEPTIONS infixes = TOKENIZER_INFIXES + lex_attr_getters = LEX_ATTRS + stop_words = STOP_WORDS class Luxembourgish(Language): diff --git a/spacy/lang/lb/examples.py b/spacy/lang/lb/examples.py index 3cbba31d9..a7a10489c 100644 --- a/spacy/lang/lb/examples.py +++ b/spacy/lang/lb/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/lb/lex_attrs.py b/spacy/lang/lb/lex_attrs.py index e38c74974..d2d50d9dc 100644 --- a/spacy/lang/lb/lex_attrs.py +++ b/spacy/lang/lb/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/lb/punctuation.py b/spacy/lang/lb/punctuation.py index 2a4587856..e382c56c5 100644 --- a/spacy/lang/lb/punctuation.py +++ b/spacy/lang/lb/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_ICONS, ALPHA, ALPHA_LOWER, ALPHA_UPPER ELISION = " ' ’ ".strip().replace(" ", "") diff --git a/spacy/lang/lb/stop_words.py b/spacy/lang/lb/stop_words.py index 41e6f79d2..8f22ea6e6 100644 --- a/spacy/lang/lb/stop_words.py +++ b/spacy/lang/lb/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - STOP_WORDS = set( """ a diff --git a/spacy/lang/lb/tag_map.py b/spacy/lang/lb/tag_map.py deleted file mode 100644 index 424a83bb4..000000000 --- a/spacy/lang/lb/tag_map.py +++ /dev/null @@ -1,28 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, ADJ, CONJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PART, SPACE, AUX - -# TODO: tag map is still using POS tags from an internal training set. -# These POS tags have to be modified to match those from Universal Dependencies - -TAG_MAP = { - "$": {POS: PUNCT}, - "ADJ": {POS: ADJ}, - "AV": {POS: ADV}, - "APPR": {POS: ADP, "AdpType": "prep"}, - "APPRART": {POS: ADP, "AdpType": "prep", "PronType": "art"}, - "D": {POS: DET, "PronType": "art"}, - "KO": {POS: CONJ}, - "N": {POS: NOUN}, - "P": {POS: ADV}, - "TRUNC": {POS: X, "Hyph": "yes"}, - "AUX": {POS: AUX}, - "V": {POS: VERB}, - "MV": {POS: VERB, "VerbType": "mod"}, - "PTK": {POS: PART}, - "INTER": {POS: PART}, - "NUM": {POS: NUM}, - "_SP": {POS: SPACE}, -} diff --git a/spacy/lang/lb/tokenizer_exceptions.py b/spacy/lang/lb/tokenizer_exceptions.py index 1c9b2dde3..d00dc9610 100644 --- a/spacy/lang/lb/tokenizer_exceptions.py +++ b/spacy/lang/lb/tokenizer_exceptions.py @@ -1,7 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc -from ...symbols import ORTH, LEMMA, NORM # TODO # treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions) @@ -10,19 +10,19 @@ _exc = {} # translate / delete what is not necessary for exc_data in [ - {ORTH: "’t", LEMMA: "et", NORM: "et"}, - {ORTH: "’T", LEMMA: "et", NORM: "et"}, - {ORTH: "'t", LEMMA: "et", NORM: "et"}, - {ORTH: "'T", LEMMA: "et", NORM: "et"}, - {ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"}, - {ORTH: "M.", LEMMA: "Monsieur", NORM: "Monsieur"}, - {ORTH: "Mme.", LEMMA: "Madame", NORM: "Madame"}, - {ORTH: "Dr.", LEMMA: "Dokter", NORM: "Dokter"}, - {ORTH: "Tel.", LEMMA: "Telefon", NORM: "Telefon"}, - {ORTH: "asw.", LEMMA: "an sou weider", NORM: "an sou weider"}, - {ORTH: "etc.", LEMMA: "et cetera", NORM: "et cetera"}, - {ORTH: "bzw.", LEMMA: "bezéiungsweis", NORM: "bezéiungsweis"}, - {ORTH: "Jan.", LEMMA: "Januar", NORM: "Januar"}, + {ORTH: "’t", NORM: "et"}, + {ORTH: "’T", NORM: "et"}, + {ORTH: "'t", NORM: "et"}, + {ORTH: "'T", NORM: "et"}, + {ORTH: "wgl.", NORM: "wannechgelift"}, + {ORTH: "M.", NORM: "Monsieur"}, + {ORTH: "Mme.", NORM: "Madame"}, + {ORTH: "Dr.", NORM: "Dokter"}, + {ORTH: "Tel.", NORM: "Telefon"}, + {ORTH: "asw.", NORM: "an sou weider"}, + {ORTH: "etc.", NORM: "et cetera"}, + {ORTH: "bzw.", NORM: "bezéiungsweis"}, + {ORTH: "Jan.", NORM: "Januar"}, ]: _exc[exc_data[ORTH]] = [exc_data] @@ -50,4 +50,4 @@ for orth in [ ]: _exc[orth] = [{ORTH: orth}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/lex_attrs.py b/spacy/lang/lex_attrs.py index 254f8706d..12016c273 100644 --- a/spacy/lang/lex_attrs.py +++ b/spacy/lang/lex_attrs.py @@ -1,6 +1,4 @@ -# coding: utf8 -from __future__ import unicode_literals - +from typing import Set import unicodedata import re @@ -24,21 +22,21 @@ _tlds = set( ) -def is_punct(text): +def is_punct(text: str) -> bool: for char in text: if not unicodedata.category(char).startswith("P"): return False return True -def is_ascii(text): +def is_ascii(text: str) -> bool: for char in text: if ord(char) >= 128: return False return True -def like_num(text): +def like_num(text: str) -> bool: if text.startswith(("+", "-", "±", "~")): text = text[1:] # can be overwritten by lang with list of number words @@ -52,64 +50,31 @@ def like_num(text): return False -def is_bracket(text): +def is_bracket(text: str) -> bool: brackets = ("(", ")", "[", "]", "{", "}", "<", ">") return text in brackets -def is_quote(text): - quotes = ( - '"', - "'", - "`", - "«", - "»", - "‘", - "’", - "‚", - "‛", - "“", - "”", - "„", - "‟", - "‹", - "›", - "❮", - "❯", - "''", - "``", - ) +def is_quote(text: str) -> bool: + # fmt: off + quotes = ('"', "'", "`", "«", "»", "‘", "’", "‚", "‛", "“", "”", "„", "‟", "‹", "›", "❮", "❯", "''", "``") + # fmt: on return text in quotes -def is_left_punct(text): - left_punct = ( - "(", - "[", - "{", - "<", - '"', - "'", - "«", - "‘", - "‚", - "‛", - "“", - "„", - "‟", - "‹", - "❮", - "``", - ) +def is_left_punct(text: str) -> bool: + # fmt: off + left_punct = ("(", "[", "{", "<", '"', "'", "«", "‘", "‚", "‛", "“", "„", "‟", "‹", "❮", "``") + # fmt: on return text in left_punct -def is_right_punct(text): +def is_right_punct(text: str) -> bool: right_punct = (")", "]", "}", ">", '"', "'", "»", "’", "”", "›", "❯", "''") return text in right_punct -def is_currency(text): +def is_currency(text: str) -> bool: # can be overwritten by lang with list of currency words, e.g. dollar, euro for char in text: if unicodedata.category(char) != "Sc": @@ -117,11 +82,11 @@ def is_currency(text): return True -def like_email(text): +def like_email(text: str) -> bool: return bool(_like_email(text)) -def like_url(text): +def like_url(text: str) -> bool: # We're looking for things that function in text like URLs. So, valid URL # or not, anything they say http:// is going to be good. if text.startswith("http://") or text.startswith("https://"): @@ -147,7 +112,7 @@ def like_url(text): return False -def word_shape(text): +def word_shape(text: str) -> str: if len(text) >= 100: return "LONG" shape = [] @@ -174,46 +139,52 @@ def word_shape(text): return "".join(shape) -def lower(string): +def lower(string: str) -> str: return string.lower() -def prefix(string): +def prefix(string: str) -> str: return string[0] -def suffix(string): +def suffix(string: str) -> str: return string[-3:] -def is_alpha(string): +def is_alpha(string: str) -> bool: return string.isalpha() -def is_digit(string): +def is_digit(string: str) -> bool: return string.isdigit() -def is_lower(string): +def is_lower(string: str) -> bool: return string.islower() -def is_space(string): +def is_space(string: str) -> bool: return string.isspace() -def is_title(string): +def is_title(string: str) -> bool: return string.istitle() -def is_upper(string): +def is_upper(string: str) -> bool: return string.isupper() -def is_stop(string, stops=set()): +def is_stop(string: str, stops: Set[str] = set()) -> bool: return string.lower() in stops +def get_lang(text: str, lang: str = "") -> str: + # This function is partially applied so lang code can be passed in + # automatically while still allowing pickling + return lang + + LEX_ATTRS = { attrs.LOWER: lower, attrs.NORM: lower, diff --git a/spacy/lang/lij/__init__.py b/spacy/lang/lij/__init__.py index 9b4b29798..5ae280324 100644 --- a/spacy/lang/lij/__init__.py +++ b/spacy/lang/lij/__init__.py @@ -1,26 +1,13 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_INFIXES - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups class LigurianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "lij" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS + tokenizer_exceptions = TOKENIZER_EXCEPTIONS infixes = TOKENIZER_INFIXES + stop_words = STOP_WORDS class Ligurian(Language): diff --git a/spacy/lang/lij/examples.py b/spacy/lang/lij/examples.py index c4034ae7e..ba7fe43fd 100644 --- a/spacy/lang/lij/examples.py +++ b/spacy/lang/lij/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/lij/punctuation.py b/spacy/lang/lij/punctuation.py index 4439376c8..d50b75589 100644 --- a/spacy/lang/lij/punctuation.py +++ b/spacy/lang/lij/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_INFIXES from ..char_classes import ALPHA diff --git a/spacy/lang/lij/stop_words.py b/spacy/lang/lij/stop_words.py index ffd53370d..1d6f09d27 100644 --- a/spacy/lang/lij/stop_words.py +++ b/spacy/lang/lij/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ a à â a-a a-e a-i a-o aiva aloa an ancheu ancon apreuvo ascì atra atre atri atro avanti avei diff --git a/spacy/lang/lij/tokenizer_exceptions.py b/spacy/lang/lij/tokenizer_exceptions.py index 2109add62..52eae2c89 100644 --- a/spacy/lang/lij/tokenizer_exceptions.py +++ b/spacy/lang/lij/tokenizer_exceptions.py @@ -1,52 +1,50 @@ -# coding: utf8 -from __future__ import unicode_literals -from ...symbols import ORTH, LEMMA +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH +from ...util import update_exc + _exc = {} -for raw, lemma in [ - ("a-a", "a-o"), - ("a-e", "a-o"), - ("a-o", "a-o"), - ("a-i", "a-o"), - ("co-a", "co-o"), - ("co-e", "co-o"), - ("co-i", "co-o"), - ("co-o", "co-o"), - ("da-a", "da-o"), - ("da-e", "da-o"), - ("da-i", "da-o"), - ("da-o", "da-o"), - ("pe-a", "pe-o"), - ("pe-e", "pe-o"), - ("pe-i", "pe-o"), - ("pe-o", "pe-o"), +for raw in [ + "a-e", + "a-o", + "a-i", + "a-a", + "co-a", + "co-e", + "co-i", + "co-o", + "da-a", + "da-e", + "da-i", + "da-o", + "pe-a", + "pe-e", + "pe-i", + "pe-o", ]: for orth in [raw, raw.capitalize()]: - _exc[orth] = [{ORTH: orth, LEMMA: lemma}] + _exc[orth] = [{ORTH: orth}] # Prefix + prepositions with à (e.g. "sott'a-o") -for prep, prep_lemma in [ - ("a-a", "a-o"), - ("a-e", "a-o"), - ("a-o", "a-o"), - ("a-i", "a-o"), +for prep in [ + "a-a", + "a-e", + "a-o", + "a-i", ]: - for prefix, prefix_lemma in [ - ("sott'", "sotta"), - ("sott’", "sotta"), - ("contr'", "contra"), - ("contr’", "contra"), - ("ch'", "che"), - ("ch’", "che"), - ("s'", "se"), - ("s’", "se"), + for prefix in [ + "sott'", + "sott’", + "contr'", + "contr’", + "ch'", + "ch’", + "s'", + "s’", ]: for prefix_orth in [prefix, prefix.capitalize()]: - _exc[prefix_orth + prep] = [ - {ORTH: prefix_orth, LEMMA: prefix_lemma}, - {ORTH: prep, LEMMA: prep_lemma}, - ] + _exc[prefix_orth + prep] = [{ORTH: prefix_orth}, {ORTH: prep}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/lt/__init__.py b/spacy/lang/lt/__init__.py index ce2c8d6a4..e395a8f62 100644 --- a/spacy/lang/lt/__init__.py +++ b/spacy/lang/lt/__init__.py @@ -1,42 +1,16 @@ -# coding: utf8 -from __future__ import unicode_literals - from .punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS -from .tag_map import TAG_MAP -from .morph_rules import MORPH_RULES - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups - - -def _return_lt(_): - return "lt" class LithuanianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = _return_lt - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - lex_attr_getters.update(LEX_ATTRS) - infixes = TOKENIZER_INFIXES suffixes = TOKENIZER_SUFFIXES - mod_base_exceptions = { - exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".") - } - del mod_base_exceptions["8)"] - tokenizer_exceptions = update_exc(mod_base_exceptions, TOKENIZER_EXCEPTIONS) + tokenizer_exceptions = TOKENIZER_EXCEPTIONS stop_words = STOP_WORDS - tag_map = TAG_MAP - morph_rules = MORPH_RULES + lex_attr_getters = LEX_ATTRS class Lithuanian(Language): diff --git a/spacy/lang/lt/examples.py b/spacy/lang/lt/examples.py index 99dbe9d4d..eaf941f1a 100644 --- a/spacy/lang/lt/examples.py +++ b/spacy/lang/lt/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/lt/lex_attrs.py b/spacy/lang/lt/lex_attrs.py index 81879948f..28894a59b 100644 --- a/spacy/lang/lt/lex_attrs.py +++ b/spacy/lang/lt/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = { diff --git a/spacy/lang/lt/morph_rules.py b/spacy/lang/lt/morph_rules.py deleted file mode 100644 index 3bf26d9d8..000000000 --- a/spacy/lang/lt/morph_rules.py +++ /dev/null @@ -1,3075 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import LEMMA, PRON_LEMMA - - -_coordinating_conjunctions = [ - "ar", - "arba", - "bei", - "beigi", - "bet", - "betgi", - "ir", - "kadangi", - "kuo", - "ne", - "o", - "tad", - "tai", - "tačiau", - "tegul", - "tik", - "visgi", -] - -_subordinating_conjunctions = [ - "jei", - "jeigu", - "jog", - "kad", - "kai", - "kaip", - "kol", - "lyg", - "nebent", - "negu", - "nei", - "nes", - "nors", - "tarsi", - "tuo", - "užuot", -] - -MORPH_RULES = { - "Cg": dict( - [(word, {"POS": "CCONJ"}) for word in _coordinating_conjunctions] - + [(word, {"POS": "SCONJ"}) for word in _subordinating_conjunctions] - ), - "Pg--an": { - "keletą": {LEMMA: PRON_LEMMA, "POS": "PRON", "Case": "Acc", "PronType": "Ind"}, - "save": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "PronType": "Prs", - "Reflex": "Yes", - }, - }, - "Pg--dn": { - "sau": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "PronType": "Prs", - "Reflex": "Yes", - } - }, - "Pg--gn": { - "keleto": {LEMMA: PRON_LEMMA, "POS": "PRON", "Case": "Gen", "PronType": "Ind"}, - "savo": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "PronType": "Prs", - "Reflex": "Yes", - }, - "savęs": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "PronType": "Prs", - "Reflex": "Yes", - }, - }, - "Pg--in": { - "savimi": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "PronType": "Prs", - "Reflex": "Yes", - } - }, - "Pg--nn": { - "keletas": {LEMMA: PRON_LEMMA, "POS": "PRON", "Case": "Nom", "PronType": "Ind"} - }, - "Pg-dnn": { - "mudu": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Number": "Dual", - "Person": "1", - "PronType": "Prs", - } - }, - "Pg-pa-": { - "jus": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Number": "Plur", - "Person": "2", - "PronType": "Prs", - } - }, - "Pg-pan": { - "jus": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Number": "Plur", - "Person": "2", - "PronType": "Prs", - }, - "mus": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Number": "Plur", - "Person": "1", - "PronType": "Prs", - }, - }, - "Pg-pdn": { - "jums": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Number": "Plur", - "Person": "2", - "PronType": "Prs", - }, - "mums": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Number": "Plur", - "Person": "1", - "PronType": "Prs", - }, - }, - "Pg-pgn": { - "jūsų": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Number": "Plur", - "Person": "2", - "PronType": "Prs", - }, - "mūsų": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Number": "Plur", - "Person": "1", - "PronType": "Prs", - }, - }, - "Pg-pin": { - "jumis": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Number": "Plur", - "Person": "2", - "PronType": "Prs", - }, - "mumis": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Number": "Plur", - "Person": "1", - "PronType": "Prs", - }, - }, - "Pg-pln": { - "jumyse": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Loc", - "Number": "Plur", - "Person": "2", - "PronType": "Prs", - } - }, - "Pg-pnn": { - "jūs": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Number": "Plur", - "Person": "2", - "PronType": "Prs", - }, - "mes": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Number": "Plur", - "Person": "1", - "PronType": "Prs", - }, - }, - "Pg-san": { - "mane": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Number": "Sing", - "Person": "1", - "PronType": "Prs", - }, - "tave": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Number": "Sing", - "Person": "2", - "PronType": "Prs", - }, - }, - "Pg-sd-": { - "tau": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Number": "Sing", - "Person": "2", - "PronType": "Prs", - } - }, - "Pg-sdn": { - "man": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Number": "Sing", - "Person": "1", - "PronType": "Prs", - }, - "sau": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Number": "Sing", - "PronType": "Prs", - "Reflex": "Yes", - }, - "tau": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Number": "Sing", - "Person": "2", - "PronType": "Prs", - }, - }, - "Pg-sgn": { - "mano": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Number": "Sing", - "Person": "1", - "PronType": "Prs", - }, - "manęs": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Number": "Sing", - "Person": "1", - "PronType": "Prs", - }, - "tavo": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Number": "Sing", - "Person": "2", - "PronType": "Prs", - }, - "tavęs": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Number": "Sing", - "Person": "2", - "PronType": "Prs", - }, - }, - "Pg-sin": { - "manimi": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Number": "Sing", - "Person": "1", - "PronType": "Prs", - }, - "tavim": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Number": "Sing", - "Person": "2", - "PronType": "Prs", - }, - "tavimi": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Number": "Sing", - "Person": "2", - "PronType": "Prs", - }, - }, - "Pg-sln": { - "manyje": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Loc", - "Number": "Sing", - "Person": "1", - "PronType": "Prs", - }, - "tavyje": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Loc", - "Number": "Sing", - "Person": "2", - "PronType": "Prs", - }, - }, - "Pg-snn": { - "aš": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Number": "Sing", - "Person": "1", - "PronType": "Prs", - }, - "tu": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Number": "Sing", - "Person": "2", - "PronType": "Prs", - }, - }, - "Pgf-an": { - "kelias": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Fem", - "PronType": "Ind", - } - }, - "Pgf-dn": { - "kelioms": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Gender": "Fem", - "PronType": "Ind", - } - }, - "Pgf-nn": { - "kelios": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Fem", - "PronType": "Ind", - } - }, - "Pgfdn-": { - "abi": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Fem", - "Number": "Dual", - "PronType": "Ind", - } - }, - "Pgfpan": { - "jas": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Fem", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - }, - "kelias": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Ind", - }, - "kitas": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Ind", - }, - "kokias": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Int", - }, - "kurias": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Int", - }, - "savas": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Ind", - }, - "tas": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Dem", - }, - "tokias": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Dem", - }, - }, - "Pgfpdn": { - "joms": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Gender": "Fem", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - }, - "kitoms": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Ind", - }, - "kurioms": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Dat", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Int", - }, - "tokioms": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Dat", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Dem", - }, - }, - "Pgfpgn": { - "jokių": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Neg", - }, - "jų": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Gender": "Fem", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - }, - "kelių": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Ind", - }, - "kitų": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Ind", - }, - "kurių": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Int", - }, - "pačių": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Emp", - }, - "tokių": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Dem", - }, - "tų": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Dem", - }, - }, - "Pgfpin": { - "jomis": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Gender": "Fem", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - }, - "kitokiomis": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Ind", - }, - "kitomis": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Ind", - }, - "kokiomis": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Int", - }, - "kuriomis": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Int", - }, - "pačiomis": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Emp", - }, - "tomis": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Dem", - }, - }, - "Pgfpln": { - "jose": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Loc", - "Gender": "Fem", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - }, - "kitose": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Loc", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Ind", - }, - "kuriose": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Loc", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Int", - }, - "tokiose": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Loc", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Dem", - }, - "tose": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Loc", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Dem", - }, - }, - "Pgfpnn": { - "jos": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Fem", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - }, - "kitokios": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Ind", - }, - "kitos": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Ind", - }, - "kokios": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Int", - }, - "kurios": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Int", - }, - "pačios": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Emp", - }, - "tokios": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Dem", - }, - "tos": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Fem", - "Number": "Plur", - "PronType": "Dem", - }, - }, - "Pgfsan": { - "ją": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "kiekvieną": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Tot", - }, - "kitokią": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Ind", - }, - "kitą": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Ind", - }, - "kokią": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Int", - }, - "kurią": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Int", - }, - "pačią": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Emp", - }, - "tokią": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Dem", - }, - "tą": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgfsdn": { - "jai": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Gender": "Fem", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "kiekvienai": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Dat", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Tot", - }, - "kitai": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Ind", - }, - "pačiai": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Dat", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Emp", - }, - }, - "Pgfsgn": { - "jokios": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Neg", - }, - "jos": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Gender": "Fem", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "kiekvienos": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Tot", - }, - "kokios": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Int", - }, - "kurios": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Int", - }, - "pačios": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Emp", - }, - "tokios": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Dem", - }, - "tos": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgfsin": { - "ja": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Gender": "Fem", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "kiekviena": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Tot", - }, - "kita": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Ind", - }, - "kuria": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Int", - }, - "ta": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Dem", - }, - "tokia": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgfsln": { - "joje": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Loc", - "Gender": "Fem", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "kiekvienoje": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Loc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Tot", - }, - "kitoje": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Loc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Ind", - }, - "kurioje": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Loc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Int", - }, - "toje": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Loc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Dem", - }, - "tokioje": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Loc", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgfsnn": { - "ji": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "kiekviena": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Tot", - }, - "kita": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Ind", - }, - "kokia": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Int", - }, - "kuri": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Int", - }, - "pati": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Emp", - }, - "sava": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Ind", - }, - "ta": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Dem", - }, - "tokia": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgfsny": { - "jinai": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "toji": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgfsny-": { - "jinai": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - } - }, - "Pgm-a-": { - "kelis": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Masc", - "PronType": "Ind", - } - }, - "Pgm-an": { - "kelis": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Masc", - "PronType": "Ind", - } - }, - "Pgm-dn": { - "keliems": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Gender": "Masc", - "PronType": "Ind", - } - }, - "Pgm-gn": { - "kelių": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Gender": "Masc", - "PronType": "Ind", - } - }, - "Pgm-nn": { - "keli": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Masc", - "PronType": "Ind", - } - }, - "Pgmdan": { - "mudu": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Masc", - "Number": "Dual", - "Person": "1", - "PronType": "Prs", - } - }, - "Pgmdgn": { - "mudviejų": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Gender": "Masc", - "Number": "Dual", - "Person": "1", - "PronType": "Prs", - } - }, - "Pgmdnn": { - "jiedu": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Masc", - "Number": "Dual", - "Person": "3", - "PronType": "Prs", - }, - "mudu": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Masc", - "Number": "Dual", - "Person": "1", - "PronType": "Prs", - }, - }, - "Pgmpan": { - "juos": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - }, - "jus": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - "Person": "2", - "PronType": "Prs", - }, - "kitus": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Ind", - }, - "kokius": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Int", - }, - "kuriuos": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Int", - }, - "pačius": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Emp", - }, - "tokius": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Dem", - }, - "tuos": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Dem", - }, - }, - "Pgmpan-": { - "juos": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - } - }, - "Pgmpdn": { - "jiems": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Gender": "Masc", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - }, - "kitiems": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Ind", - }, - "kuriems": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Dat", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Int", - }, - "patiems": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Dat", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Emp", - }, - "tiems": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Dat", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Dem", - }, - }, - "Pgmpgn": { - "jokių": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Neg", - }, - "jų": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Gender": "Masc", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - }, - "kitų": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Ind", - }, - "kokių": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Int", - }, - "kurių": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Int", - }, - "pačių": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Emp", - }, - "tokių": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Dem", - }, - "tų": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Dem", - }, - }, - "Pgmpin": { - "jais": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Gender": "Masc", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - }, - "jokiais": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Neg", - }, - "kitais": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Ind", - }, - "kokiais": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Int", - }, - "savais": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Ind", - }, - "tais": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Dem", - }, - "tokiais": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Dem", - }, - }, - "Pgmpln": { - "juose": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Loc", - "Gender": "Masc", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - }, - "kituose": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Loc", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Ind", - }, - }, - "Pgmpnn": { - "jie": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "Person": "3", - "PronType": "Prs", - }, - "jūs": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "Person": "2", - "PronType": "Prs", - }, - "kiti": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Ind", - }, - "kokie": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Int", - }, - "kurie": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Int", - }, - "patys": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Emp", - }, - "tie": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Dem", - }, - "tokie": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "PronType": "Dem", - }, - }, - "Pgmsan": { - "jį": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "kiekvieną": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Tot", - }, - "kitokį": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Ind", - }, - "kitą": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Ind", - }, - "kokį": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Int", - }, - "kurį": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Int", - }, - "tokį": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Dem", - }, - "tą": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgmsdn": { - "jam": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Gender": "Masc", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "kiekvienam": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Dat", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Tot", - }, - "kitam": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Dat", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Ind", - }, - "kokiam": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Dat", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Int", - }, - "kuriam": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Dat", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Int", - }, - "pačiam": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Dat", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Emp", - }, - "tam": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Dat", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgmsgn": { - "jo": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "jokio": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Neg", - }, - "kiekvieno": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Tot", - }, - "kito": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Ind", - }, - "kokio": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Int", - }, - "kurio": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Int", - }, - "paties": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Emp", - }, - "savo": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Ind", - }, - "to": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Dem", - }, - "tokio": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgmsin": { - "juo": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Gender": "Masc", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "kitu": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Ins", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Ind", - }, - "kokiu": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Int", - }, - "kuriuo": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Int", - }, - "pačiu": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Emp", - }, - "tokiu": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Dem", - }, - "tuo": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Ins", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgmsln": { - "jame": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Loc", - "Gender": "Masc", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "kiekvienam": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Loc", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Tot", - }, - "kokiame": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Loc", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Int", - }, - "kuriame": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Loc", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Int", - }, - "tame": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Loc", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgmsnn": { - "jis": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "Person": "3", - "PronType": "Prs", - }, - "joks": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Neg", - }, - "kiekvienas": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Tot", - }, - "kitas": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Ind", - }, - "kitoks": { - LEMMA: PRON_LEMMA, - "POS": "PRON", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Ind", - }, - "koks": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Int", - }, - "kuris": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Int", - }, - "pats": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Emp", - }, - "tas": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Dem", - }, - "toks": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgmsny": { - "patsai": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Emp", - }, - "tasai": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Dem", - }, - "toksai": { - LEMMA: PRON_LEMMA, - "POS": "DET", - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "PronType": "Dem", - }, - }, - "Pgn--n": { - "tai": {LEMMA: PRON_LEMMA, "POS": "DET", "Gender": "Neut", "PronType": "Dem"} - }, - "Pgnn--n": { - "tai": {LEMMA: PRON_LEMMA, "POS": "DET", "Gender": "Neut", "PronType": "Dem"} - }, - "Pgsmdn": { - "tam": {LEMMA: PRON_LEMMA, "POS": "DET", "Case": "Dat", "PronType": "Dem"} - }, - "Qg": {"tai": {LEMMA: "tas", "POS": "PART"}}, - "Vgap----n--n--": { - "esant": { - LEMMA: "būti", - "POS": "VERB", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Ger", - }, - "turint": { - LEMMA: "turėti", - "POS": "VERB", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Ger", - }, - }, - "Vgh--pm-n--n--": { - "būdami": { - LEMMA: "būti", - "POS": "VERB", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "POS", - "VerbForm": "Conv", - } - }, - "Vgh--sm-n--n--": { - "būdamas": { - LEMMA: "būti", - "POS": "VERB", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "POS", - "VerbForm": "Conv", - } - }, - "Vgi-----n--n--": { - "būti": {LEMMA: "būti", "POS": "VERB", "Polarity": "POS", "VerbForm": "Inf"}, - "daryti": { - LEMMA: "daryti", - "POS": "VERB", - "Polarity": "POS", - "VerbForm": "Inf", - }, - "turėti": { - LEMMA: "turėti", - "POS": "VERB", - "Polarity": "POS", - "VerbForm": "Inf", - }, - }, - "Vgm-1p--n--ns-": { - "turėtume": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Cnd", - "Number": "Plur", - "Person": "1", - "Polarity": "POS", - "VerbForm": "Fin", - } - }, - "Vgm-2p--n--nm-": { - "būkite": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Imp", - "Number": "Plur", - "Person": "2", - "Polarity": "POS", - "VerbForm": "Fin", - }, - "darykit": { - LEMMA: "daryti", - "POS": "VERB", - "Mood": "Imp", - "Number": "Plur", - "Person": "2", - "Polarity": "POS", - "VerbForm": "Fin", - }, - "darykite": { - LEMMA: "daryti", - "POS": "VERB", - "Mood": "Imp", - "Number": "Plur", - "Person": "2", - "Polarity": "POS", - "VerbForm": "Fin", - }, - "turėkite": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Imp", - "Number": "Plur", - "Person": "2", - "Polarity": "POS", - "VerbForm": "Fin", - }, - }, - "Vgm-2p--n--ns-": { - "turėtumėte": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Cnd", - "Number": "Plur", - "Person": "2", - "Polarity": "POS", - "VerbForm": "Fin", - } - }, - "Vgm-2s--n--ns-": { - "turėtum": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Cnd", - "Number": "Sing", - "Person": "2", - "Polarity": "POS", - "VerbForm": "Fin", - } - }, - "Vgm-3---n--ns-": { - "būtų": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Cnd", - "Person": "3", - "Polarity": "POS", - "VerbForm": "Fin", - }, - "turėtų": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Cnd", - "Person": "3", - "Polarity": "POS", - "VerbForm": "Fin", - }, - }, - "Vgm-3p--n--ns-": { - "būtų": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Cnd", - "Number": "Plur", - "Person": "3", - "Polarity": "POS", - "VerbForm": "Fin", - }, - "turėtų": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Cnd", - "Number": "Plur", - "Person": "3", - "Polarity": "POS", - "VerbForm": "Fin", - }, - }, - "Vgm-3s--n--ns-": { - "būtų": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Cnd", - "Number": "Sing", - "Person": "3", - "Polarity": "POS", - "VerbForm": "Fin", - }, - "turėtų": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Cnd", - "Number": "Sing", - "Person": "3", - "Polarity": "POS", - "VerbForm": "Fin", - }, - }, - "Vgma1p--n--ni-": { - "turėjom": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "1", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - } - }, - "Vgma1s--n--ni-": { - "turėjau": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "1", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - } - }, - "Vgma3---n--ni-": { - "buvo": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Person": "3", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - }, - "turėjo": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Person": "3", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - }, - }, - "Vgma3p--n--ni-": { - "buvo": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "3", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - }, - "darė": { - LEMMA: "daryti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "3", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - }, - "turėjo": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "3", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - }, - }, - "Vgma3s--n--ni-": { - "buvo": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "3", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - }, - "darė": { - LEMMA: "daryti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "3", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - }, - "turėjo": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "3", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - }, - }, - "Vgmf1s--n--ni-": { - "turėsiu": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "1", - "Polarity": "POS", - "Tense": "Fut", - "VerbForm": "Fin", - } - }, - "Vgmf2p--n--ni-": { - "būsite": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "2", - "Polarity": "POS", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "darysite": { - LEMMA: "daryti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "2", - "Polarity": "POS", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "turėsite": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "2", - "Polarity": "POS", - "Tense": "Fut", - "VerbForm": "Fin", - }, - }, - "Vgmf3---n--ni-": { - "bus": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Person": "3", - "Polarity": "POS", - "Tense": "Fut", - "VerbForm": "Fin", - } - }, - "Vgmf3p--n--ni-": { - "bus": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "3", - "Polarity": "POS", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "darys": { - LEMMA: "daryti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "3", - "Polarity": "POS", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "turės": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "3", - "Polarity": "POS", - "Tense": "Fut", - "VerbForm": "Fin", - }, - }, - "Vgmf3s--n--ni-": { - "bus": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "3", - "Polarity": "POS", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "turės": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "3", - "Polarity": "POS", - "Tense": "Fut", - "VerbForm": "Fin", - }, - }, - "Vgmp1p--n--ni-": { - "darome": { - LEMMA: "daryti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "1", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "esame": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "1", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "turime": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "1", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - }, - "Vgmp1s--n--ni-": { - "būnu": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "1", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "esu": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "1", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "turiu": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "1", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - }, - "Vgmp2p--n--ni-": { - "esate": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "2", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "turite": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "2", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - }, - "Vgmp2s--n--ni-": { - "esi": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "2", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "turi": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "2", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - }, - "Vgmp3---n--ni-": { - "būna": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Person": "3", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "turi": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Person": "3", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "yra": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Person": "3", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - }, - "Vgmp3p--n--ni-": { - "būna": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "3", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "daro": { - LEMMA: "daryti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "3", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "turi": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "3", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "yra": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Plur", - "Person": "3", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - }, - "Vgmp3s--n--ni-": { - "būna": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "3", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "daro": { - LEMMA: "daryti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "3", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "turi": { - LEMMA: "turėti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "3", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "yra": { - LEMMA: "būti", - "POS": "VERB", - "Mood": "Ind", - "Number": "Sing", - "Person": "3", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Fin", - }, - }, - "Vgmq2s--n--ni-": { - "turėdavai": { - LEMMA: "turėti", - "POS": "VERB", - "Aspect": "Hab", - "Mood": "Ind", - "Number": "Sing", - "Person": "2", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - } - }, - "Vgmq3---n--ni-": { - "būdavo": { - LEMMA: "būti", - "POS": "VERB", - "Aspect": "Hab", - "Mood": "Ind", - "Person": "3", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - } - }, - "Vgmq3s--n--ni-": { - "turėdavo": { - LEMMA: "turėti", - "POS": "VERB", - "Aspect": "Hab", - "Mood": "Ind", - "Number": "Sing", - "Person": "3", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Fin", - } - }, - "Vgp--pfnnnnn-p": { - "darytinos": { - LEMMA: "daryti", - "POS": "VERB", - "Case": "Nom", - "Degree": "POS", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "POS", - "VerbForm": "Part", - } - }, - "Vgpa--nann-n-p": { - "buvę": { - LEMMA: "būti", - "POS": "VERB", - "Degree": "POS", - "Gender": "Neut", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpa-pmanngn-p": { - "buvusių": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Gen", - "Degree": "POS", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpa-smanngn-p": { - "buvusio": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Gen", - "Degree": "POS", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpa-smannnn-p": { - "buvęs": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Nom", - "Degree": "POS", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "turėjęs": { - LEMMA: "turėti", - "POS": "VERB", - "Case": "Nom", - "Degree": "POS", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - }, - "Vgpa-smanyin-p": { - "buvusiuoju": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Ins", - "Degree": "POS", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpf-smpnnan-p": { - "būsimą": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Acc", - "Degree": "POS", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Fut", - "VerbForm": "Part", - "Voice": "Pass", - } - }, - "Vgpf-smpnndn-p": { - "būsimam": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Dat", - "Degree": "POS", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Fut", - "VerbForm": "Part", - "Voice": "Pass", - } - }, - "Vgpp--npnn-n-p": { - "esama": { - LEMMA: "būti", - "POS": "VERB", - "Degree": "POS", - "Gender": "Neut", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - } - }, - "Vgpp-pfannan-p": { - "esančias": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Acc", - "Degree": "POS", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpp-pfanndn-p": { - "turinčioms": { - LEMMA: "turėti", - "POS": "VERB", - "Case": "Dat", - "Degree": "POS", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpp-pfannin-p": { - "esančiomis": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Ins", - "Degree": "POS", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpp-pfpnnan-p": { - "daromas": { - LEMMA: "daryti", - "POS": "VERB", - "Case": "Acc", - "Degree": "POS", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "turimas": { - LEMMA: "turėti", - "POS": "VERB", - "Case": "Acc", - "Degree": "POS", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - }, - "Vgpp-pfpnnin-p": { - "turimomis": { - LEMMA: "turėti", - "POS": "VERB", - "Case": "Ins", - "Degree": "POS", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - } - }, - "Vgpp-pmannan-p": { - "turinčius": { - LEMMA: "turėti", - "POS": "VERB", - "Case": "Acc", - "Degree": "POS", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpp-pmanngn-p": { - "esančių": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Gen", - "Degree": "POS", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpp-pmannin-p": { - "esančiais": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Ins", - "Degree": "POS", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpp-pmannnn-p": { - "esantys": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Nom", - "Degree": "POS", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpp-pmpnnan-p": { - "turimus": { - LEMMA: "turėti", - "POS": "VERB", - "Case": "Acc", - "Degree": "POS", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - } - }, - "Vgpp-pmpnngn-p": { - "esamų": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Gen", - "Degree": "POS", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - } - }, - "Vgpp-sfanngn-p": { - "turinčios": { - LEMMA: "turėti", - "POS": "VERB", - "Case": "Gen", - "Degree": "POS", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpp-sfannln-p": { - "esančioje": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Loc", - "Degree": "POS", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpp-sfannnn-p": { - "esanti": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Nom", - "Degree": "POS", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpp-sfpnnnn-p": { - "daroma": { - LEMMA: "daryti", - "POS": "VERB", - "Case": "Nom", - "Degree": "POS", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - } - }, - "Vgpp-smanngn-p": { - "esančio": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Gen", - "Degree": "POS", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - } - }, - "Vgpp-smannnn-p": { - "esantis": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Nom", - "Degree": "POS", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "esąs": { - LEMMA: "būti", - "POS": "VERB", - "Case": "Nom", - "Degree": "POS", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "turintis": { - LEMMA: "turėti", - "POS": "VERB", - "Case": "Nom", - "Degree": "POS", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "POS", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - }, - "Vgps--npnn-n-p": { - "daryta": { - LEMMA: "daryti", - "POS": "VERB", - "Aspect": "Perf", - "Degree": "POS", - "Gender": "Neut", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - } - }, - "Vgps-pmpnnnn-p": { - "daryti": { - LEMMA: "daryti", - "POS": "VERB", - "Aspect": "Perf", - "Case": "Nom", - "Degree": "POS", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "POS", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - } - }, -} - - -for tag, rules in MORPH_RULES.items(): - for key, attrs in dict(rules).items(): - rules[key.title()] = attrs diff --git a/spacy/lang/lt/punctuation.py b/spacy/lang/lt/punctuation.py index 5eedc8116..506aa8f32 100644 --- a/spacy/lang/lt/punctuation.py +++ b/spacy/lang/lt/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ICONS, LIST_ELLIPSES from ..char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA from ..char_classes import HYPHENS diff --git a/spacy/lang/lt/stop_words.py b/spacy/lang/lt/stop_words.py index fed05d80d..8c11b3f7b 100644 --- a/spacy/lang/lt/stop_words.py +++ b/spacy/lang/lt/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - STOP_WORDS = { "a", "abejais", diff --git a/spacy/lang/lt/tag_map.py b/spacy/lang/lt/tag_map.py deleted file mode 100644 index 6ea4f8ae0..000000000 --- a/spacy/lang/lt/tag_map.py +++ /dev/null @@ -1,4798 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, ADJ, ADP, ADV, CONJ, INTJ, NOUN, NUM, PART -from ...symbols import PRON, PROPN, PUNCT, SYM, VERB, X - - -TAG_MAP = { - "Agcfpan": { - POS: ADJ, - "Case": "Acc", - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Plur", - }, - "Agcfpgn": { - POS: ADJ, - "Case": "Gen", - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Plur", - }, - "Agcfpin": { - POS: ADJ, - "Case": "Ins", - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Plur", - }, - "Agcfpln": { - POS: ADJ, - "Case": "Loc", - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Plur", - }, - "Agcfpnn": { - POS: ADJ, - "Case": "Nom", - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Plur", - }, - "Agcfsan": { - POS: ADJ, - "Case": "Acc", - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Sing", - }, - "Agcfsnn": { - POS: ADJ, - "Case": "Nom", - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Sing", - }, - "Agcfsny": { - POS: ADJ, - "Case": "Nom", - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Sing", - }, - "Agcmpan": { - POS: ADJ, - "Case": "Acc", - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Plur", - }, - "Agcmpgn": { - POS: ADJ, - "Case": "Gen", - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Plur", - }, - "Agcmpin": { - POS: ADJ, - "Case": "Ins", - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Plur", - }, - "Agcmpnn": { - POS: ADJ, - "Case": "Nom", - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Plur", - }, - "Agcmsa-": { - POS: ADJ, - "Case": "Acc", - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Sing", - }, - "Agcmsan": { - POS: ADJ, - "Case": "Acc", - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Sing", - }, - "Agcmsay": { - POS: ADJ, - "Case": "Acc", - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Sing", - }, - "Agcmsgn": { - POS: ADJ, - "Case": "Gen", - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Sing", - }, - "Agcmsnn": { - POS: ADJ, - "Case": "Nom", - "Degree": "Cmp", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpfpan": { - POS: ADJ, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - }, - "Agpfpay": { - POS: ADJ, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - }, - "Agpfpdn": { - POS: ADJ, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - }, - "Agpfpdy": { - POS: ADJ, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - }, - "Agpfpgn": { - POS: ADJ, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - }, - "Agpfpgy": { - POS: ADJ, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - }, - "Agpfpin": { - POS: ADJ, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - }, - "Agpfpiy": { - POS: ADJ, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - }, - "Agpfpln": { - POS: ADJ, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - }, - "Agpfpnn": { - POS: ADJ, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - }, - "Agpfpny": { - POS: ADJ, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - }, - "Agpfsan": { - POS: ADJ, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - }, - "Agpfsay": { - POS: ADJ, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - }, - "Agpfsdn": { - POS: ADJ, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - }, - "Agpfsdy": { - POS: ADJ, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - }, - "Agpfsgn": { - POS: ADJ, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - }, - "Agpfsgy": { - POS: ADJ, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - }, - "Agpfsin": { - POS: ADJ, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - }, - "Agpfsiy": { - POS: ADJ, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - }, - "Agpfsln": { - POS: ADJ, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - }, - "Agpfsly": { - POS: ADJ, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - }, - "Agpfsnn": { - POS: ADJ, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - }, - "Agpfsny": { - POS: ADJ, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - }, - "Agpmpan": { - POS: ADJ, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - }, - "Agpmpay": { - POS: ADJ, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - }, - "Agpmpdn": { - POS: ADJ, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - }, - "Agpmpdy": { - POS: ADJ, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - }, - "Agpmpgn": { - POS: ADJ, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - }, - "Agpmpgy": { - POS: ADJ, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - }, - "Agpmpin": { - POS: ADJ, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - }, - "Agpmpiy": { - POS: ADJ, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - }, - "Agpmpln": { - POS: ADJ, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - }, - "Agpmply": { - POS: ADJ, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - }, - "Agpmpnn": { - POS: ADJ, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - }, - "Agpmpny": { - POS: ADJ, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - }, - "Agpmsan": { - POS: ADJ, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpmsay": { - POS: ADJ, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpmsdn": { - POS: ADJ, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpmsdy": { - POS: ADJ, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpmsgn": { - POS: ADJ, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpmsgy": { - POS: ADJ, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpmsin": { - POS: ADJ, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpmsiy": { - POS: ADJ, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpmsln": { - POS: ADJ, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpmsly": { - POS: ADJ, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpmsnn": { - POS: ADJ, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpmsny": { - POS: ADJ, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpmsvn": { - POS: ADJ, - "Case": "Voc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - }, - "Agpn--n": {POS: ADJ, "Degree": "Pos", "Gender": "Neut"}, - "Agpn-nn": {POS: ADJ, "Case": "Nom", "Degree": "Pos", "Gender": "Neut"}, - "Agsfpan": { - POS: ADJ, - "Case": "Acc", - "Degree": "Sup", - "Gender": "Fem", - "Number": "Plur", - }, - "Agsfpdn": { - POS: ADJ, - "Case": "Dat", - "Degree": "Sup", - "Gender": "Fem", - "Number": "Plur", - }, - "Agsfpgn": { - POS: ADJ, - "Case": "Gen", - "Degree": "Sup", - "Gender": "Fem", - "Number": "Plur", - }, - "Agsfpin": { - POS: ADJ, - "Case": "Ins", - "Degree": "Sup", - "Gender": "Fem", - "Number": "Plur", - }, - "Agsfpnn": { - POS: ADJ, - "Case": "Nom", - "Degree": "Sup", - "Gender": "Fem", - "Number": "Plur", - }, - "Agsfsgn": { - POS: ADJ, - "Case": "Gen", - "Degree": "Sup", - "Gender": "Fem", - "Number": "Sing", - }, - "Agsfsgy": { - POS: ADJ, - "Case": "Gen", - "Degree": "Sup", - "Gender": "Fem", - "Number": "Sing", - }, - "Agsfsin": { - POS: ADJ, - "Case": "Ins", - "Degree": "Sup", - "Gender": "Fem", - "Number": "Sing", - }, - "Agsfsln": { - POS: ADJ, - "Case": "Loc", - "Degree": "Sup", - "Gender": "Fem", - "Number": "Sing", - }, - "Agsfsnn": { - POS: ADJ, - "Case": "Nom", - "Degree": "Sup", - "Gender": "Fem", - "Number": "Sing", - }, - "Agsfsny": { - POS: ADJ, - "Case": "Nom", - "Degree": "Sup", - "Gender": "Fem", - "Number": "Sing", - }, - "Agsmpan": { - POS: ADJ, - "Case": "Acc", - "Degree": "Sup", - "Gender": "Masc", - "Number": "Plur", - }, - "Agsmpgn": { - POS: ADJ, - "Case": "Gen", - "Degree": "Sup", - "Gender": "Masc", - "Number": "Plur", - }, - "Agsmpin": { - POS: ADJ, - "Case": "Ins", - "Degree": "Sup", - "Gender": "Masc", - "Number": "Plur", - }, - "Agsmpnn": { - POS: ADJ, - "Case": "Nom", - "Degree": "Sup", - "Gender": "Masc", - "Number": "Plur", - }, - "Agsmsan": { - POS: ADJ, - "Case": "Acc", - "Degree": "Sup", - "Gender": "Masc", - "Number": "Sing", - }, - "Agsmsgn": { - POS: ADJ, - "Case": "Gen", - "Degree": "Sup", - "Gender": "Masc", - "Number": "Sing", - }, - "Agsmsin": { - POS: ADJ, - "Case": "Ins", - "Degree": "Sup", - "Gender": "Masc", - "Number": "Sing", - }, - "Agsmsnn": { - POS: ADJ, - "Case": "Nom", - "Degree": "Sup", - "Gender": "Masc", - "Number": "Sing", - }, - "Agsn--n": {POS: ADJ, "Degree": "Sup", "Gender": "Neut"}, - "Cg": {POS: CONJ}, - "Ig": {POS: INTJ}, - "M----d-": {POS: NUM, "NumForm": "Digit"}, - "M----r-": {POS: NUM, "NumForm": "Roman"}, - "M----rn": {POS: NUM, "NumForm": "Roman"}, - "Mc---l-": {POS: NUM, "NumForm": "Word", "NumType": "Card"}, - "Mc--gl-": {POS: NUM, "Case": "Gen", "NumForm": "Word", "NumType": "Card"}, - "Mcf-al-": { - POS: NUM, - "Case": "Acc", - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcf-aln": { - POS: NUM, - "Case": "Acc", - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcf-dl-": { - POS: NUM, - "Case": "Dat", - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcf-gl-": { - POS: NUM, - "Case": "Gen", - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcf-gln": { - POS: NUM, - "Case": "Gen", - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcf-il-": { - POS: NUM, - "Case": "Ins", - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcf-iln": { - POS: NUM, - "Case": "Ins", - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcf-nl-": { - POS: NUM, - "Case": "Nom", - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcf-nln": { - POS: NUM, - "Case": "Nom", - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcfpnl-": { - POS: NUM, - "Case": "Nom", - "Gender": "Fem", - "Number": "Plur", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcfsal-": { - POS: NUM, - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcfsdl-": { - POS: NUM, - "Case": "Dat", - "Gender": "Fem", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcfsgl-": { - POS: NUM, - "Case": "Gen", - "Gender": "Fem", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcfsgln": { - POS: NUM, - "Case": "Gen", - "Gender": "Fem", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcfsil-": { - POS: NUM, - "Case": "Ins", - "Gender": "Fem", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcm-al-": { - POS: NUM, - "Case": "Acc", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcm-aln": { - POS: NUM, - "Case": "Acc", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcm-dl-": { - POS: NUM, - "Case": "Dat", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcm-dln": { - POS: NUM, - "Case": "Dat", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcm-gl-": { - POS: NUM, - "Case": "Gen", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcm-gln": { - POS: NUM, - "Case": "Gen", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcm-il-": { - POS: NUM, - "Case": "Ins", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcm-nl-": { - POS: NUM, - "Case": "Nom", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcm-nln": { - POS: NUM, - "Case": "Nom", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcmpal-": { - POS: NUM, - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcmpaln": { - POS: NUM, - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcmpgl-": { - POS: NUM, - "Case": "Gen", - "Gender": "Masc", - "Number": "Plur", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcmpgln": { - POS: NUM, - "Case": "Gen", - "Gender": "Masc", - "Number": "Plur", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcmpnl-": { - POS: NUM, - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcmpnln": { - POS: NUM, - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcmsal-": { - POS: NUM, - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcmsaln": { - POS: NUM, - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcmsgl-": { - POS: NUM, - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcmsgln": { - POS: NUM, - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcmsnln": { - POS: NUM, - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Card", - }, - "Mcnsnln": { - POS: NUM, - "Case": "Nom", - "Gender": "Neut", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Card", - }, - "Ml--aln": {POS: NUM, "Case": "Acc", "NumForm": "Word", "NumType": "Card"}, - "Mmm-aln": { - POS: ADV, - "Case": "Acc", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Mult", - }, - "Mmm-dln": { - POS: ADV, - "Case": "Dat", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Mult", - }, - "Mmm-gl-": { - POS: ADV, - "Case": "Gen", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Mult", - }, - "Mmm-nln": { - POS: ADV, - "Case": "Nom", - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Mult", - }, - "Mofpily": { - POS: ADJ, - "Case": "Ins", - "Gender": "Fem", - "Number": "Plur", - "NumForm": "Word", - "NumType": "Ord", - }, - "Mofsaly": { - POS: ADJ, - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Ord", - }, - "Mofsamn": { - POS: ADJ, - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - "NumForm": "Combi", - "NumType": "Ord", - }, - "Mofsily": { - POS: ADJ, - "Case": "Ins", - "Gender": "Fem", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Ord", - }, - "Mofsnly": { - POS: ADJ, - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Ord", - }, - "Mofsnmy": { - POS: ADJ, - "Case": "Nom", - "Gender": "Fem", - "Number": "Sing", - "NumForm": "Combi", - "NumType": "Ord", - }, - "Mompgln": { - POS: ADJ, - "Case": "Gen", - "Gender": "Masc", - "Number": "Plur", - "NumForm": "Word", - "NumType": "Ord", - }, - "Mompily": { - POS: ADJ, - "Case": "Ins", - "Gender": "Masc", - "Number": "Plur", - "NumForm": "Word", - "NumType": "Ord", - }, - "Mompnln": { - POS: ADJ, - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "NumForm": "Word", - "NumType": "Ord", - }, - "Mompnly": { - POS: ADJ, - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "NumForm": "Word", - "NumType": "Ord", - }, - "Momsaln": { - POS: ADJ, - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Ord", - }, - "Momsaly": { - POS: ADJ, - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Ord", - }, - "Momsgln": { - POS: ADJ, - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Ord", - }, - "Momsgly": { - POS: ADJ, - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Ord", - }, - "Momslly": { - POS: ADJ, - "Case": "Loc", - "Gender": "Masc", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Ord", - }, - "Momsnln": { - POS: ADJ, - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Ord", - }, - "Momsnly": { - POS: ADJ, - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "NumForm": "Word", - "NumType": "Ord", - }, - "Mon--ln": {POS: ADJ, "Gender": "Neut", "NumForm": "Word", "NumType": "Ord"}, - "Nccpnn-": {POS: NOUN, "Case": "Nom", "Number": "Plur"}, - "Nccsdn-": {POS: NOUN, "Case": "Dat", "Number": "Sing"}, - "Nccsgn-": {POS: NOUN, "Case": "Gen", "Number": "Sing"}, - "Nccsnn-": {POS: NOUN, "Case": "Nom", "Number": "Sing"}, - "Ncf--n-": {POS: NOUN, "Gender": "Fem"}, - "Ncfpan-": {POS: NOUN, "Case": "Acc", "Gender": "Fem", "Number": "Plur"}, - "Ncfpdn-": {POS: NOUN, "Case": "Dat", "Gender": "Fem", "Number": "Plur"}, - "Ncfpgn-": {POS: NOUN, "Case": "Gen", "Gender": "Fem", "Number": "Plur"}, - "Ncfpin-": {POS: NOUN, "Case": "Ins", "Gender": "Fem", "Number": "Plur"}, - "Ncfpln-": {POS: NOUN, "Case": "Loc", "Gender": "Fem", "Number": "Plur"}, - "Ncfpnn-": {POS: NOUN, "Case": "Nom", "Gender": "Fem", "Number": "Plur"}, - "Ncfsan-": {POS: NOUN, "Case": "Acc", "Gender": "Fem", "Number": "Sing"}, - "Ncfsdn-": {POS: NOUN, "Case": "Dat", "Gender": "Fem", "Number": "Sing"}, - "Ncfsgn-": {POS: NOUN, "Case": "Gen", "Gender": "Fem", "Number": "Sing"}, - "Ncfsin-": {POS: NOUN, "Case": "Ins", "Gender": "Fem", "Number": "Sing"}, - "Ncfsln-": {POS: NOUN, "Case": "Loc", "Gender": "Fem", "Number": "Sing"}, - "Ncfsnn-": {POS: NOUN, "Case": "Nom", "Gender": "Fem", "Number": "Sing"}, - "Ncfsvn-": {POS: NOUN, "Case": "Voc", "Gender": "Fem", "Number": "Sing"}, - "Ncfsxn-": {POS: NOUN, "Gender": "Fem", "Number": "Sing"}, - "Ncm--a-": {POS: NOUN, "Gender": "Masc"}, - "Ncm--n-": {POS: NOUN, "Gender": "Masc"}, - "Ncmpan-": {POS: NOUN, "Case": "Acc", "Gender": "Masc", "Number": "Plur"}, - "Ncmpdn-": {POS: NOUN, "Case": "Dat", "Gender": "Masc", "Number": "Plur"}, - "Ncmpgn-": {POS: NOUN, "Case": "Gen", "Gender": "Masc", "Number": "Plur"}, - "Ncmpin-": {POS: NOUN, "Case": "Ins", "Gender": "Masc", "Number": "Plur"}, - "Ncmpln-": {POS: NOUN, "Case": "Loc", "Gender": "Masc", "Number": "Plur"}, - "Ncmpnn-": {POS: NOUN, "Case": "Nom", "Gender": "Masc", "Number": "Plur"}, - "Ncmpny-": { - POS: NOUN, - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "Reflex": "Yes", - }, - "Ncmsan-": {POS: NOUN, "Case": "Acc", "Gender": "Masc", "Number": "Sing"}, - "Ncmsay-": { - POS: NOUN, - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - "Reflex": "Yes", - }, - "Ncmsdn-": {POS: NOUN, "Case": "Dat", "Gender": "Masc", "Number": "Sing"}, - "Ncmsdy-": { - POS: NOUN, - "Case": "Dat", - "Gender": "Masc", - "Number": "Sing", - "Reflex": "Yes", - }, - "Ncmsgn-": {POS: NOUN, "Case": "Gen", "Gender": "Masc", "Number": "Sing"}, - "Ncmsgy-": { - POS: NOUN, - "Case": "Gen", - "Gender": "Masc", - "Number": "Sing", - "Reflex": "Yes", - }, - "Ncmsin-": {POS: NOUN, "Case": "Ins", "Gender": "Masc", "Number": "Sing"}, - "Ncmsiy-": { - POS: NOUN, - "Case": "Ins", - "Gender": "Masc", - "Number": "Sing", - "Reflex": "Yes", - }, - "Ncmsln-": {POS: NOUN, "Case": "Loc", "Gender": "Masc", "Number": "Sing"}, - "Ncmsnn-": {POS: NOUN, "Case": "Nom", "Gender": "Masc", "Number": "Sing"}, - "Ncmsny-": { - POS: NOUN, - "Case": "Nom", - "Gender": "Masc", - "Number": "Sing", - "Reflex": "Yes", - }, - "Ncmsvn-": {POS: NOUN, "Case": "Voc", "Gender": "Masc", "Number": "Sing"}, - "Ncmsxn-": {POS: NOUN, "Gender": "Masc", "Number": "Sing"}, - "Np---n-": {POS: PROPN}, - "Npc--n-": {POS: PROPN}, - "Npfpgn-": {POS: PROPN, "Case": "Gen", "Gender": "Fem", "Number": "Plur"}, - "Npfpgng": { - POS: PROPN, - "Case": "Gen", - "Gender": "Fem", - "NameType": "Geo", - "Number": "Plur", - }, - "Npfpln-": {POS: PROPN, "Case": "Loc", "Gender": "Fem", "Number": "Plur"}, - "Npfsan-": {POS: PROPN, "Case": "Acc", "Gender": "Fem", "Number": "Sing"}, - "Npfsanf": { - POS: PROPN, - "Case": "Acc", - "Gender": "Fem", - "NameType": "Giv", - "Number": "Sing", - }, - "Npfsang": { - POS: PROPN, - "Case": "Acc", - "Gender": "Fem", - "NameType": "Geo", - "Number": "Sing", - }, - "Npfsdn-": {POS: PROPN, "Case": "Dat", "Gender": "Fem", "Number": "Sing"}, - "Npfsdnf": { - POS: PROPN, - "Case": "Dat", - "Gender": "Fem", - "NameType": "Giv", - "Number": "Sing", - }, - "Npfsdng": { - POS: PROPN, - "Case": "Dat", - "Gender": "Fem", - "NameType": "Geo", - "Number": "Sing", - }, - "Npfsdns": { - POS: PROPN, - "Case": "Dat", - "Gender": "Fem", - "NameType": "Sur", - "Number": "Sing", - }, - "Npfsgn-": {POS: PROPN, "Case": "Gen", "Gender": "Fem", "Number": "Sing"}, - "Npfsgnf": { - POS: PROPN, - "Case": "Gen", - "Gender": "Fem", - "NameType": "Giv", - "Number": "Sing", - }, - "Npfsgng": { - POS: PROPN, - "Case": "Gen", - "Gender": "Fem", - "NameType": "Geo", - "Number": "Sing", - }, - "Npfsgns": { - POS: PROPN, - "Case": "Gen", - "Gender": "Fem", - "NameType": "Sur", - "Number": "Sing", - }, - "Npfsin-": {POS: PROPN, "Case": "Ins", "Gender": "Fem", "Number": "Sing"}, - "Npfsinf": { - POS: PROPN, - "Case": "Ins", - "Gender": "Fem", - "NameType": "Giv", - "Number": "Sing", - }, - "Npfsing": { - POS: PROPN, - "Case": "Ins", - "Gender": "Fem", - "NameType": "Geo", - "Number": "Sing", - }, - "Npfsins": { - POS: PROPN, - "Case": "Ins", - "Gender": "Fem", - "NameType": "Sur", - "Number": "Sing", - }, - "Npfslng": { - POS: PROPN, - "Case": "Loc", - "Gender": "Fem", - "NameType": "Geo", - "Number": "Sing", - }, - "Npfsnn-": {POS: PROPN, "Case": "Nom", "Gender": "Fem", "Number": "Sing"}, - "Npfsnnf": { - POS: PROPN, - "Case": "Nom", - "Gender": "Fem", - "NameType": "Giv", - "Number": "Sing", - }, - "Npfsnng": { - POS: PROPN, - "Case": "Nom", - "Gender": "Fem", - "NameType": "Geo", - "Number": "Sing", - }, - "Npfsnns": { - POS: PROPN, - "Case": "Nom", - "Gender": "Fem", - "NameType": "Sur", - "Number": "Sing", - }, - "Npm--nf": {POS: PROPN, "Gender": "Masc", "NameType": "Giv"}, - "Npmpgng": { - POS: PROPN, - "Case": "Gen", - "Gender": "Masc", - "NameType": "Geo", - "Number": "Plur", - }, - "Npmplng": { - POS: PROPN, - "Case": "Loc", - "Gender": "Masc", - "NameType": "Geo", - "Number": "Plur", - }, - "Npms-nf": {POS: PROPN, "Gender": "Masc", "NameType": "Giv", "Number": "Sing"}, - "Npmsan-": {POS: PROPN, "Case": "Acc", "Gender": "Masc", "Number": "Sing"}, - "Npmsanf": { - POS: PROPN, - "Case": "Acc", - "Gender": "Masc", - "NameType": "Giv", - "Number": "Sing", - }, - "Npmsang": { - POS: PROPN, - "Case": "Acc", - "Gender": "Masc", - "NameType": "Geo", - "Number": "Sing", - }, - "Npmsans": { - POS: PROPN, - "Case": "Acc", - "Gender": "Masc", - "NameType": "Sur", - "Number": "Sing", - }, - "Npmsdnf": { - POS: PROPN, - "Case": "Dat", - "Gender": "Masc", - "NameType": "Giv", - "Number": "Sing", - }, - "Npmsdng": { - POS: PROPN, - "Case": "Dat", - "Gender": "Masc", - "NameType": "Geo", - "Number": "Sing", - }, - "Npmsdns": { - POS: PROPN, - "Case": "Dat", - "Gender": "Masc", - "NameType": "Sur", - "Number": "Sing", - }, - "Npmsgn-": {POS: PROPN, "Case": "Gen", "Gender": "Masc", "Number": "Sing"}, - "Npmsgnf": { - POS: PROPN, - "Case": "Gen", - "Gender": "Masc", - "NameType": "Giv", - "Number": "Sing", - }, - "Npmsgng": { - POS: PROPN, - "Case": "Gen", - "Gender": "Masc", - "NameType": "Geo", - "Number": "Sing", - }, - "Npmsgns": { - POS: PROPN, - "Case": "Gen", - "Gender": "Masc", - "NameType": "Sur", - "Number": "Sing", - }, - "Npmsing": { - POS: PROPN, - "Case": "Ins", - "Gender": "Masc", - "NameType": "Geo", - "Number": "Sing", - }, - "Npmsins": { - POS: PROPN, - "Case": "Ins", - "Gender": "Masc", - "NameType": "Sur", - "Number": "Sing", - }, - "Npmslng": { - POS: PROPN, - "Case": "Loc", - "Gender": "Masc", - "NameType": "Geo", - "Number": "Sing", - }, - "Npmsngf": { - POS: PROPN, - "Case": "Nom", - "Gender": "Masc", - "NameType": "Giv", - "Number": "Sing", - }, - "Npmsnn-": {POS: PROPN, "Case": "Nom", "Gender": "Masc", "Number": "Sing"}, - "Npmsnnf": { - POS: PROPN, - "Case": "Nom", - "Gender": "Masc", - "NameType": "Giv", - "Number": "Sing", - }, - "Npmsnng": { - POS: PROPN, - "Case": "Nom", - "Gender": "Masc", - "NameType": "Geo", - "Number": "Sing", - }, - "Npmsnns": { - POS: PROPN, - "Case": "Nom", - "Gender": "Masc", - "NameType": "Sur", - "Number": "Sing", - }, - "Pg--an": {POS: PRON, "Case": "Acc"}, - "Pg--dn": {POS: PRON, "Case": "Dat"}, - "Pg--gn": {POS: PRON, "Case": "Gen"}, - "Pg--i-": {POS: PRON, "Case": "Ins"}, - "Pg--in": {POS: PRON, "Case": "Ins"}, - "Pg--nn": {POS: PRON, "Case": "Nom"}, - "Pg-dnn": {POS: PRON, "Case": "Nom", "Number": "Dual"}, - "Pg-pa-": {POS: PRON, "Case": "Acc", "Number": "Plur"}, - "Pg-pan": {POS: PRON, "Case": "Acc", "Number": "Plur"}, - "Pg-pdn": {POS: PRON, "Case": "Dat", "Number": "Plur"}, - "Pg-pgn": {POS: PRON, "Case": "Gen", "Number": "Plur"}, - "Pg-pin": {POS: PRON, "Case": "Ins", "Number": "Plur"}, - "Pg-pln": {POS: PRON, "Case": "Loc", "Number": "Plur"}, - "Pg-pnn": {POS: PRON, "Case": "Nom", "Number": "Plur"}, - "Pg-san": {POS: PRON, "Case": "Acc", "Number": "Sing"}, - "Pg-sd-": {POS: PRON, "Case": "Dat", "Number": "Sing"}, - "Pg-sdn": {POS: PRON, "Case": "Dat", "Number": "Sing"}, - "Pg-sgn": {POS: PRON, "Case": "Gen", "Number": "Sing"}, - "Pg-sin": {POS: PRON, "Case": "Ins", "Number": "Sing"}, - "Pg-sln": {POS: PRON, "Case": "Loc", "Number": "Sing"}, - "Pg-snn": {POS: PRON, "Case": "Nom", "Number": "Sing"}, - "Pgf-an": {POS: PRON, "Case": "Acc", "Gender": "Fem"}, - "Pgf-dn": {POS: PRON, "Case": "Dat", "Gender": "Fem"}, - "Pgf-nn": {POS: PRON, "Case": "Nom", "Gender": "Fem"}, - "Pgfpan": {POS: PRON, "Case": "Acc", "Gender": "Fem", "Number": "Plur"}, - "Pgfpdn": {POS: PRON, "Case": "Dat", "Gender": "Fem", "Number": "Plur"}, - "Pgfpgn": {POS: PRON, "Case": "Gen", "Gender": "Fem", "Number": "Plur"}, - "Pgfpin": {POS: PRON, "Case": "Ins", "Gender": "Fem", "Number": "Plur"}, - "Pgfpln": {POS: PRON, "Case": "Loc", "Gender": "Fem", "Number": "Plur"}, - "Pgfpnn": {POS: PRON, "Case": "Nom", "Gender": "Fem", "Number": "Plur"}, - "Pgfsan": {POS: PRON, "Case": "Acc", "Gender": "Fem", "Number": "Sing"}, - "Pgfsdn": {POS: PRON, "Case": "Dat", "Gender": "Fem", "Number": "Sing"}, - "Pgfsgn": {POS: PRON, "Case": "Gen", "Gender": "Fem", "Number": "Sing"}, - "Pgfsin": {POS: PRON, "Case": "Ins", "Gender": "Fem", "Number": "Sing"}, - "Pgfsln": {POS: PRON, "Case": "Loc", "Gender": "Fem", "Number": "Sing"}, - "Pgfsnn": {POS: PRON, "Case": "Nom", "Gender": "Fem", "Number": "Sing"}, - "Pgfsny": {POS: PRON, "Case": "Nom", "Gender": "Fem", "Number": "Sing"}, - "Pgfsny-": {POS: PRON, "Case": "Nom", "Gender": "Fem", "Number": "Sing"}, - "Pgm-a-": {POS: PRON, "Case": "Acc", "Gender": "Masc"}, - "Pgm-an": {POS: PRON, "Case": "Acc", "Gender": "Masc"}, - "Pgm-dn": {POS: PRON, "Case": "Dat", "Gender": "Masc"}, - "Pgm-gn": {POS: PRON, "Case": "Gen", "Gender": "Masc"}, - "Pgm-nn": {POS: PRON, "Case": "Nom", "Gender": "Masc"}, - "Pgmdan": {POS: PRON, "Case": "Acc", "Gender": "Masc", "Number": "Dual"}, - "Pgmdgn": {POS: PRON, "Case": "Gen", "Gender": "Masc", "Number": "Dual"}, - "Pgmdnn": {POS: PRON, "Case": "Nom", "Gender": "Masc", "Number": "Dual"}, - "Pgmpan": {POS: PRON, "Case": "Acc", "Gender": "Masc", "Number": "Plur"}, - "Pgmpan-": {POS: PRON, "Case": "Acc", "Gender": "Masc", "Number": "Plur"}, - "Pgmpdn": {POS: PRON, "Case": "Dat", "Gender": "Masc", "Number": "Plur"}, - "Pgmpgn": {POS: PRON, "Case": "Gen", "Gender": "Masc", "Number": "Plur"}, - "Pgmpin": {POS: PRON, "Case": "Ins", "Gender": "Masc", "Number": "Plur"}, - "Pgmpln": {POS: PRON, "Case": "Loc", "Gender": "Masc", "Number": "Plur"}, - "Pgmpnn": {POS: PRON, "Case": "Nom", "Gender": "Masc", "Number": "Plur"}, - "Pgmsan": {POS: PRON, "Case": "Acc", "Gender": "Masc", "Number": "Sing"}, - "Pgmsdn": {POS: PRON, "Case": "Dat", "Gender": "Masc", "Number": "Sing"}, - "Pgmsgn": {POS: PRON, "Case": "Gen", "Gender": "Masc", "Number": "Sing"}, - "Pgmsin": {POS: PRON, "Case": "Ins", "Gender": "Masc", "Number": "Sing"}, - "Pgmsln": {POS: PRON, "Case": "Loc", "Gender": "Masc", "Number": "Sing"}, - "Pgmsnn": {POS: PRON, "Case": "Nom", "Gender": "Masc", "Number": "Sing"}, - "Pgn--n": {POS: PRON, "Gender": "Neut"}, - "Pgnn--n": {POS: PRON, "Gender": "Neut"}, - "Pgsmdn": {POS: PRON, "Case": "Dat"}, - "Qg": {POS: PART}, - "Rgc": {POS: ADV, "Degree": "Cmp"}, - "Rgp": {POS: ADV, "Degree": "Pos"}, - "Rgs": {POS: ADV, "Degree": "Sup"}, - "Sag": {POS: ADP, "AdpType": "Prep", "Case": "Gen"}, - "Sga": {POS: ADP, "AdpType": "Prep", "Case": "Acc"}, - "Sgg": {POS: ADP, "AdpType": "Prep", "Case": "Gen"}, - "Sgi": {POS: ADP, "AdpType": "Prep", "Case": "Ins"}, - "Vgaa----n--n--": { - POS: VERB, - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Ger", - }, - "Vgaa----n--y--": { - POS: VERB, - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Ger", - }, - "Vgaa----y--n--": { - POS: VERB, - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Ger", - }, - "Vgaa----y--y--": { - POS: VERB, - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Ger", - }, - "Vgap----n--n--": { - POS: VERB, - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Ger", - }, - "Vgap----n--y": { - POS: VERB, - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Ger", - }, - "Vgap----n--y--": { - POS: VERB, - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Ger", - }, - "Vgap----y--n--": { - POS: VERB, - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Ger", - }, - "Vgap----y--y--": { - POS: VERB, - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Ger", - }, - "Vgas----n--y--": { - POS: VERB, - "Aspect": "Perf", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Ger", - }, - "Vgb-----n--n--": {POS: ADV, "Polarity": "Pos", "VerbForm": "Conv"}, - "Vgh--pf-n--n--": { - POS: VERB, - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "VerbForm": "Conv", - }, - "Vgh--pf-y--n--": { - POS: VERB, - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Neg", - "VerbForm": "Conv", - }, - "Vgh--pm-n--n--": { - POS: VERB, - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "VerbForm": "Conv", - }, - "Vgh--pm-n--y--": { - POS: VERB, - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Reflex": "Yes", - "VerbForm": "Conv", - }, - "Vgh--pm-y--n--": { - POS: VERB, - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Neg", - "VerbForm": "Conv", - }, - "Vgh--sf-n--n--": { - POS: VERB, - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "VerbForm": "Conv", - }, - "Vgh--sf-n--y--": { - POS: VERB, - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "VerbForm": "Conv", - }, - "Vgh--sf-y--n--": { - POS: VERB, - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Neg", - "VerbForm": "Conv", - }, - "Vgh--sm-n--n--": { - POS: VERB, - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "VerbForm": "Conv", - }, - "Vgh--sm-n--y--": { - POS: VERB, - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "VerbForm": "Conv", - }, - "Vgh--sm-y--n--": { - POS: VERB, - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Neg", - "VerbForm": "Conv", - }, - "Vgi-----n--n--": {POS: VERB, "Polarity": "Pos", "VerbForm": "Inf"}, - "Vgi-----n--y--": { - POS: VERB, - "Polarity": "Pos", - "Reflex": "Yes", - "VerbForm": "Inf", - }, - "Vgi-----y--n--": {POS: VERB, "Polarity": "Neg", "VerbForm": "Inf"}, - "Vgi-----y--y--": { - POS: VERB, - "Polarity": "Neg", - "Reflex": "Yes", - "VerbForm": "Inf", - }, - "Vgm-1p--n--nm-": { - POS: VERB, - "Mood": "Imp", - "Number": "Plur", - "Person": "one", - "Polarity": "Pos", - "VerbForm": "Fin", - }, - "Vgm-1p--n--ns-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Plur", - "Person": "one", - "Polarity": "Pos", - "VerbForm": "Fin", - }, - "Vgm-1p--n--ym-": { - POS: VERB, - "Mood": "Imp", - "Number": "Plur", - "Person": "one", - "Polarity": "Pos", - "Reflex": "Yes", - "VerbForm": "Fin", - }, - "Vgm-1p--y--nm-": { - POS: VERB, - "Mood": "Imp", - "Number": "Plur", - "Person": "one", - "Polarity": "Neg", - "VerbForm": "Fin", - }, - "Vgm-1p--y--ys-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Plur", - "Person": "one", - "Polarity": "Neg", - "Reflex": "Yes", - "VerbForm": "Fin", - }, - "Vgm-1s--n--ns-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "VerbForm": "Fin", - }, - "Vgm-1s--n--ys-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Reflex": "Yes", - "VerbForm": "Fin", - }, - "Vgm-1s--y--ns-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "VerbForm": "Fin", - }, - "Vgm-1s--y--ys-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "Reflex": "Yes", - "VerbForm": "Fin", - }, - "Vgm-2p--n--nm-": { - POS: VERB, - "Mood": "Imp", - "Number": "Plur", - "Person": "two", - "Polarity": "Pos", - "VerbForm": "Fin", - }, - "Vgm-2p--n--ns-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Plur", - "Person": "two", - "Polarity": "Pos", - "VerbForm": "Fin", - }, - "Vgm-2p--n--ym-": { - POS: VERB, - "Mood": "Imp", - "Number": "Plur", - "Person": "two", - "Polarity": "Pos", - "Reflex": "Yes", - "VerbForm": "Fin", - }, - "Vgm-2p--y--nm-": { - POS: VERB, - "Mood": "Imp", - "Number": "Plur", - "Person": "two", - "Polarity": "Neg", - "VerbForm": "Fin", - }, - "Vgm-2p--y--ym-": { - POS: VERB, - "Mood": "Imp", - "Number": "Plur", - "Person": "two", - "Polarity": "Neg", - "Reflex": "Yes", - "VerbForm": "Fin", - }, - "Vgm-2s--n--nm-": { - POS: VERB, - "Mood": "Imp", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "VerbForm": "Fin", - }, - "Vgm-2s--n--ns-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "VerbForm": "Fin", - }, - "Vgm-2s--n--ym-": { - POS: VERB, - "Mood": "Imp", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "Reflex": "Yes", - "VerbForm": "Fin", - }, - "Vgm-2s--y--nm-": { - POS: VERB, - "Mood": "Imp", - "Number": "Sing", - "Person": "two", - "Polarity": "Neg", - "VerbForm": "Fin", - }, - "Vgm-2s--y--ns-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Sing", - "Person": "two", - "Polarity": "Neg", - "VerbForm": "Fin", - }, - "Vgm-3---n--ns-": { - POS: VERB, - "Mood": "Cnd", - "Person": "three", - "Polarity": "Pos", - "VerbForm": "Fin", - }, - "Vgm-3---n--ys-": { - POS: VERB, - "Mood": "Cnd", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "VerbForm": "Fin", - }, - "Vgm-3---y--ns-": { - POS: VERB, - "Mood": "Cnd", - "Person": "three", - "Polarity": "Neg", - "VerbForm": "Fin", - }, - "Vgm-3---y--ys-": { - POS: VERB, - "Mood": "Cnd", - "Person": "three", - "Polarity": "Neg", - "Reflex": "Yes", - "VerbForm": "Fin", - }, - "Vgm-3p--n--ns-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "VerbForm": "Fin", - }, - "Vgm-3p--n--ys-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "VerbForm": "Fin", - }, - "Vgm-3p--y--ns-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Plur", - "Person": "three", - "Polarity": "Neg", - "VerbForm": "Fin", - }, - "Vgm-3s--n--ns-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "VerbForm": "Fin", - }, - "Vgm-3s--n--ys-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "VerbForm": "Fin", - }, - "Vgm-3s--y--ns-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "VerbForm": "Fin", - }, - "Vgm-3s--y--ys-": { - POS: VERB, - "Mood": "Cnd", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Reflex": "Yes", - "VerbForm": "Fin", - }, - "Vgma1p--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma1p--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma1p--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma1p--y--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma1s--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma1s--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma1s--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma1s--y--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma2p--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "two", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma2p--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "two", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma2p--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "two", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma2s--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma2s--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3---n--ni-": { - POS: VERB, - "Mood": "Ind", - "Person": "three", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3---n--yi-": { - POS: VERB, - "Mood": "Ind", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3---y--ni-": { - POS: VERB, - "Mood": "Ind", - "Person": "three", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3--y--ni-": { - POS: VERB, - "Case": "Nom", - "Person": "three", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3p--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3p--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3p--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3p--y--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3s--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3s--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3s--n--yi--": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3s--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgma3s--y--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgmf1p--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Pos", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf1p--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf1p--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Neg", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf1s--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf1s--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf1s--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf2p--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "two", - "Polarity": "Pos", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf2p--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "two", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf2s--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf2s--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf2s--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Neg", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf2s--y--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf3---n--ni-": { - POS: VERB, - "Mood": "Ind", - "Person": "three", - "Polarity": "Pos", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf3---y--ni-": { - POS: VERB, - "Mood": "Ind", - "Person": "three", - "Polarity": "Neg", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf3p--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf3p--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf3p--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Neg", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf3s--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf3s--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf3s--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmf3s--y--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Fut", - "VerbForm": "Fin", - }, - "Vgmp1p--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp1p--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp1p--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp1p--y--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "one", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp1s--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp1s--n--ni--": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp1s--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp1s--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp1s--y--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp2p--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "two", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp2p--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "two", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp2p--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "two", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp2p--y--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "two", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp2s--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp2s--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp2s--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3---n--ni-": { - POS: VERB, - "Mood": "Ind", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3---n--yi-": { - POS: VERB, - "Mood": "Ind", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3---y--ni-": { - POS: VERB, - "Mood": "Ind", - "Person": "three", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3---y--yi-": { - POS: VERB, - "Mood": "Ind", - "Person": "three", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3p--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3p--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3p--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3p--y--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3s--n--ni": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3s--n--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3s--n--ni--": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3s--n--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3s--y--ni-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmp3s--y--yi-": { - POS: VERB, - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vgmq1s--n--ni-": { - POS: VERB, - "Aspect": "Hab", - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgmq1s--n--yi-": { - POS: VERB, - "Aspect": "Hab", - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgmq1s--y--ni-": { - POS: VERB, - "Aspect": "Hab", - "Mood": "Ind", - "Number": "Sing", - "Person": "one", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgmq2s--n--ni-": { - POS: VERB, - "Aspect": "Hab", - "Mood": "Ind", - "Number": "Sing", - "Person": "two", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgmq3---n--ni-": { - POS: VERB, - "Aspect": "Hab", - "Mood": "Ind", - "Person": "three", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgmq3p--n--ni-": { - POS: VERB, - "Aspect": "Hab", - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgmq3p--n--yi-": { - POS: VERB, - "Aspect": "Hab", - "Mood": "Ind", - "Number": "Plur", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgmq3s--n--ni-": { - POS: VERB, - "Aspect": "Hab", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgmq3s--n--yi-": { - POS: VERB, - "Aspect": "Hab", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgmq3s--y--ni-": { - POS: VERB, - "Aspect": "Hab", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgms3s--n--ni-": { - POS: VERB, - "Aspect": "Perf", - "Mood": "Ind", - "Number": "Sing", - "Person": "three", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vgp---nnnn-n-p": { - POS: VERB, - "Degree": "Pos", - "Gender": "Neut", - "Polarity": "Pos", - "VerbForm": "Part", - }, - "Vgp---nnyn-n-p": { - POS: VERB, - "Degree": "Pos", - "Gender": "Neut", - "Polarity": "Neg", - "VerbForm": "Part", - }, - "Vgp--pfnnnnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "VerbForm": "Part", - }, - "Vgp--sfnnnnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "VerbForm": "Part", - }, - "Vgp--smnnnvn-p": { - POS: VERB, - "Case": "Voc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "VerbForm": "Part", - }, - "Vgp--smnynnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Neg", - "VerbForm": "Part", - }, - "Vgpa--nann-n-p": { - POS: VERB, - "Degree": "Pos", - "Gender": "Neut", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa--nann-y-p": { - POS: VERB, - "Degree": "Pos", - "Gender": "Neut", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa--nayn-n-p": { - POS: VERB, - "Degree": "Pos", - "Gender": "Neut", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pfannan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pfannay-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pfanngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pfannin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pfannnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pfannny-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pmannan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pmanndn-p": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pmanngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pmannin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pmannnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pmannny-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pmanygn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pmaynny-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-pmpnnnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpa-sfannan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-sfannay-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-sfanndn-p": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-sfanngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-sfanngy-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-sfannin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-sfannnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-sfannny-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-sfannny-p-": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-sfanynn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-sfaynnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smannan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smannay-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smanngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smanngy-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smannin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smanniy-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smannln-p": { - POS: VERB, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smannnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smannny-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smanygn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smanyin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smanynn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpa-smaynnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpf-smannln-p": { - POS: VERB, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Fut", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpf-smpnnan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Fut", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpf-smpnndn-p": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Fut", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp--fpnn-n-p": { - POS: VERB, - "Degree": "Pos", - "Gender": "Fem", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp--npnn-n-p": { - POS: VERB, - "Degree": "Pos", - "Gender": "Neut", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp--npnn-y-p": { - POS: VERB, - "Degree": "Pos", - "Gender": "Neut", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp--npyn-n-p": { - POS: VERB, - "Degree": "Pos", - "Gender": "Neut", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp--npyn-y-p": { - POS: VERB, - "Degree": "Pos", - "Gender": "Neut", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfannan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pfanndn-p": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pfanngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pfanngy-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pfannin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pfannnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pfannny-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pfpnnan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfpnndn-p": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfpnngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfpnnin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfpnnln-p": { - POS: VERB, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfpnnnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfpnnny-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfpnygn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfpnyin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfpynan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfpyngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfpynin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pfpynnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pmannan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmannay-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmanndn-p": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmanngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmanngy-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmannin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmannln-p": { - POS: VERB, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmannnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmannny-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmanyan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmayndn-p": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmaynin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmaynnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-pmpnnan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pmpnndn-p": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pmpnngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pmpnnin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pmpnniy-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pmpnnnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pmpnygn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pmpnyin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pmpynan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pmpyngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pmpynnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-pmpyygn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-sfannan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-sfannay-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-sfanndn-p": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-sfanndn-p-": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-sfanngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-sfanngy-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-sfannin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-sfannln-p": { - POS: VERB, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-sfannnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-sfanyny-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-sfaynin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-sfaynnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-sfpnnan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-sfpnndn-p": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-sfpnngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-sfpnnin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-sfpnnln-p": { - POS: VERB, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-sfpnnnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-sfpnyan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-sfpnygn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-sfpnyin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-sfpnynn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-sfpyngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-sfpynnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-smannan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-smanndy-p": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-smanngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-smannin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-smannln-p": { - POS: VERB, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-smannly-p": { - POS: VERB, - "Case": "Loc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-smannnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-smaynin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-smaynnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-smaynny-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Neg", - "Reflex": "Yes", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Act", - }, - "Vgpp-smpnnan-p": { - POS: VERB, - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-smpnndn-p": { - POS: VERB, - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-smpnngn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-smpnnin-p": { - POS: VERB, - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-smpnnnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-smpnygn-p": { - POS: VERB, - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-smpnynn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgpp-smpynnn-p": { - POS: VERB, - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Pres", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps--mpnngn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps--npnn-n-p": { - POS: VERB, - "Aspect": "Perf", - "Degree": "Pos", - "Gender": "Neut", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps--npnn-y-p": { - POS: VERB, - "Aspect": "Perf", - "Degree": "Pos", - "Gender": "Neut", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps--npyn-n-p": { - POS: VERB, - "Aspect": "Perf", - "Degree": "Pos", - "Gender": "Neut", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pfpnnan-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pfpnndn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Dat", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pfpnngn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pfpnnin-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pfpnnln-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Loc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pfpnnnn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpnnan-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpnnay-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpnndn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpnngn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpnnin-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpnnln-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Loc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpnnnn-": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpnnnn-n": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpnnnn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpnygn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpnynn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpynin-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmpynnn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-pmsnnnn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - }, - "Vgps-sfpnnan-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-sfpnndn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Dat", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-sfpnngn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-sfpnnin-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Ins", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-sfpnnln-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Loc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-sfpnnnn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-sfpynan-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Acc", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-sfpyngn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Gen", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-sfpynnn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-smpnnan-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-smpnnay-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-smpnndn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Dat", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-smpnngn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-smpnnin-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Ins", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-smpnnln-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Loc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-smpnnnn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-smpnnny-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Reflex": "Yes", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-smpnynn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-smpynan-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Acc", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-smpyngn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Gen", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-smpynnn-p": { - POS: VERB, - "Aspect": "Perf", - "Case": "Nom", - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - "Polarity": "Neg", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "Vgps-snpnn-n-p": { - POS: VERB, - "Aspect": "Perf", - "Degree": "Pos", - "Gender": "Neut", - "Number": "Sing", - "Polarity": "Pos", - "Tense": "Past", - "VerbForm": "Part", - "Voice": "Pass", - }, - "X-": {POS: X}, - "Xf": {POS: X, "Foreign": "Yes"}, - "Xh": {POS: SYM}, - "Ya": {POS: X, "Abbr": "Yes"}, - "Ys": {POS: X, "Abbr": "Yes"}, - "Z": {POS: PUNCT}, -} diff --git a/spacy/lang/lt/tokenizer_exceptions.py b/spacy/lang/lt/tokenizer_exceptions.py index 4287b26dd..118fb2190 100644 --- a/spacy/lang/lt/tokenizer_exceptions.py +++ b/spacy/lang/lt/tokenizer_exceptions.py @@ -1,270 +1,15 @@ -# coding: utf8 -from __future__ import unicode_literals - +from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...symbols import ORTH +from ...util import update_exc + _exc = {} -for orth in [ - "n-tosios", - "?!", - # "G.", - # "J. E.", - # "J. Em.", - # "J.E.", - # "J.Em.", - # "K.", - # "N.", - # "V.", - # "Vt.", - # "a.", - # "a.k.", - # "a.s.", - # "adv.", - # "akad.", - # "aklg.", - # "akt.", - # "al.", - # "ang.", - # "angl.", - # "aps.", - # "apskr.", - # "apyg.", - # "arbat.", - # "asist.", - # "asm.", - # "asm.k.", - # "asmv.", - # "atk.", - # "atsak.", - # "atsisk.", - # "atsisk.sąsk.", - # "atv.", - # "aut.", - # "avd.", - # "b.k.", - # "baud.", - # "biol.", - # "bkl.", - # "bot.", - # "bt.", - # "buv.", - # "ch.", - # "chem.", - # "corp.", - # "d.", - # "dab.", - # "dail.", - # "dek.", - # "deš.", - # "dir.", - # "dirig.", - # "doc.", - # "dol.", - # "dr.", - # "drp.", - # "dvit.", - # "dėst.", - # "dš.", - # "dž.", - # "e.b.", - # "e.bankas", - # "e.p.", - # "e.parašas", - # "e.paštas", - # "e.v.", - # "e.valdžia", - # "egz.", - # "eil.", - # "ekon.", - # "el.", - # "el.bankas", - # "el.p.", - # "el.parašas", - # "el.paštas", - # "el.valdžia", - # "etc.", - # "ež.", - # "fak.", - # "faks.", - # "feat.", - # "filol.", - # "filos.", - # "g.", - # "gen.", - # "geol.", - # "gerb.", - # "gim.", - # "gr.", - # "gv.", - # "gyd.", - # "gyv.", - # "habil.", - # "inc.", - # "insp.", - # "inž.", - # "ir pan.", - # "ir t. t.", - # "isp.", - # "istor.", - # "it.", - # "just.", - # "k.", - # "k. a.", - # "k.a.", - # "kab.", - # "kand.", - # "kart.", - # "kat.", - # "ketv.", - # "kh.", - # "kl.", - # "kln.", - # "km.", - # "kn.", - # "koresp.", - # "kpt.", - # "kr.", - # "kt.", - # "kub.", - # "kun.", - # "kv.", - # "kyš.", - # "l. e. p.", - # "l.e.p.", - # "lenk.", - # "liet.", - # "lot.", - # "lt.", - # "ltd.", - # "ltn.", - # "m.", - # "m.e..", - # "m.m.", - # "mat.", - # "med.", - # "mgnt.", - # "mgr.", - # "min.", - # "mjr.", - # "ml.", - # "mln.", - # "mlrd.", - # "mob.", - # "mok.", - # "moksl.", - # "mokyt.", - # "mot.", - # "mr.", - # "mst.", - # "mstl.", - # "mėn.", - # "nkt.", - # "no.", - # "nr.", - # "ntk.", - # "nuotr.", - # "op.", - # "org.", - # "orig.", - # "p.", - # "p.d.", - # "p.m.e.", - # "p.s.", - # "pab.", - # "pan.", - # "past.", - # "pav.", - # "pavad.", - # "per.", - # "perd.", - # "pirm.", - # "pl.", - # "plg.", - # "plk.", - # "pr.", - # "pr.Kr.", - # "pranc.", - # "proc.", - # "prof.", - # "prom.", - # "prot.", - # "psl.", - # "pss.", - # "pvz.", - # "pšt.", - # "r.", - # "raj.", - # "red.", - # "rez.", - # "rež.", - # "rus.", - # "rš.", - # "s.", - # "sav.", - # "saviv.", - # "sek.", - # "sekr.", - # "sen.", - # "sh.", - # "sk.", - # "skg.", - # "skv.", - # "skyr.", - # "sp.", - # "spec.", - # "sr.", - # "st.", - # "str.", - # "stud.", - # "sąs.", - # "t.", - # "t. p.", - # "t. y.", - # "t.p.", - # "t.t.", - # "t.y.", - # "techn.", - # "tel.", - # "teol.", - # "th.", - # "tir.", - # "trit.", - # "trln.", - # "tšk.", - # "tūks.", - # "tūkst.", - # "up.", - # "upl.", - # "v.s.", - # "vad.", - # "val.", - # "valg.", - # "ved.", - # "vert.", - # "vet.", - # "vid.", - # "virš.", - # "vlsč.", - # "vnt.", - # "vok.", - # "vs.", - # "vtv.", - # "vv.", - # "vyr.", - # "vyresn.", - # "zool.", - # "Įn", - # "įl.", - # "š.m.", - # "šnek.", - # "šv.", - # "švč.", - # "ž.ū.", - # "žin.", - # "žml.", - # "žr.", -]: +for orth in ["n-tosios", "?!"]: _exc[orth] = [{ORTH: orth}] -TOKENIZER_EXCEPTIONS = _exc +mod_base_exceptions = { + exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".") +} +del mod_base_exceptions["8)"] +TOKENIZER_EXCEPTIONS = update_exc(mod_base_exceptions, _exc) diff --git a/spacy/lang/lv/__init__.py b/spacy/lang/lv/__init__.py index bb8c0763b..142bc706e 100644 --- a/spacy/lang/lv/__init__.py +++ b/spacy/lang/lv/__init__.py @@ -1,14 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language -from ...attrs import LANG class LatvianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "lv" stop_words = STOP_WORDS diff --git a/spacy/lang/lv/stop_words.py b/spacy/lang/lv/stop_words.py index 075ad6347..2685c2430 100644 --- a/spacy/lang/lv/stop_words.py +++ b/spacy/lang/lv/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-lv STOP_WORDS = set( diff --git a/spacy/lang/ml/__init__.py b/spacy/lang/ml/__init__.py index d052ded1b..cfad52261 100644 --- a/spacy/lang/ml/__init__.py +++ b/spacy/lang/ml/__init__.py @@ -1,12 +1,10 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS - +from .lex_attrs import LEX_ATTRS from ...language import Language class MalayalamDefaults(Language.Defaults): + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS diff --git a/spacy/lang/ml/examples.py b/spacy/lang/ml/examples.py index a2a0ed10e..9794eab29 100644 --- a/spacy/lang/ml/examples.py +++ b/spacy/lang/ml/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ml/lex_attrs.py b/spacy/lang/ml/lex_attrs.py index 468ad88f8..9ac19b6a7 100644 --- a/spacy/lang/ml/lex_attrs.py +++ b/spacy/lang/ml/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/ml/stop_words.py b/spacy/lang/ml/stop_words.py index 8bd6a7e02..441e93586 100644 --- a/spacy/lang/ml/stop_words.py +++ b/spacy/lang/ml/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ അത് diff --git a/spacy/lang/mr/__init__.py b/spacy/lang/mr/__init__.py index fd95f9354..af0c49878 100644 --- a/spacy/lang/mr/__init__.py +++ b/spacy/lang/mr/__init__.py @@ -1,14 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language -from ...attrs import LANG class MarathiDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "mr" stop_words = STOP_WORDS diff --git a/spacy/lang/mr/stop_words.py b/spacy/lang/mr/stop_words.py index 0b0cd035d..9b0cee951 100644 --- a/spacy/lang/mr/stop_words.py +++ b/spacy/lang/mr/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-mr/blob/master/stopwords-mr.txt, https://github.com/6/stopwords-json/edit/master/dist/mr.json STOP_WORDS = set( """ diff --git a/spacy/lang/nb/__init__.py b/spacy/lang/nb/__init__.py index e6c58b7de..62d7707f3 100644 --- a/spacy/lang/nb/__init__.py +++ b/spacy/lang/nb/__init__.py @@ -1,35 +1,21 @@ -# coding: utf8 -from __future__ import unicode_literals - +from typing import Optional +from thinc.api import Model from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .punctuation import TOKENIZER_SUFFIXES from .stop_words import STOP_WORDS -from .morph_rules import MORPH_RULES from .syntax_iterators import SYNTAX_ITERATORS -from .tag_map import TAG_MAP - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups +from ...pipeline import Lemmatizer class NorwegianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "nb" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) + tokenizer_exceptions = TOKENIZER_EXCEPTIONS prefixes = TOKENIZER_PREFIXES infixes = TOKENIZER_INFIXES suffixes = TOKENIZER_SUFFIXES - stop_words = STOP_WORDS - morph_rules = MORPH_RULES - tag_map = TAG_MAP syntax_iterators = SYNTAX_ITERATORS + stop_words = STOP_WORDS class Norwegian(Language): @@ -37,4 +23,14 @@ class Norwegian(Language): Defaults = NorwegianDefaults +@Norwegian.factory( + "lemmatizer", + assigns=["token.lemma"], + default_config={"model": None, "mode": "rule"}, + default_score_weights={"lemma_acc": 1.0}, +) +def make_lemmatizer(nlp: Language, model: Optional[Model], name: str, mode: str): + return Lemmatizer(nlp.vocab, model, name, mode=mode) + + __all__ = ["Norwegian"] diff --git a/spacy/lang/nb/examples.py b/spacy/lang/nb/examples.py index c15426ded..b1a63ad74 100644 --- a/spacy/lang/nb/examples.py +++ b/spacy/lang/nb/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/nb/morph_rules.py b/spacy/lang/nb/morph_rules.py deleted file mode 100644 index e20814535..000000000 --- a/spacy/lang/nb/morph_rules.py +++ /dev/null @@ -1,668 +0,0 @@ -# encoding: utf8 -from __future__ import unicode_literals - -from ...symbols import LEMMA, PRON_LEMMA - -# This dict includes all the PRON and DET tag combinations found in the -# dataset developed by Schibsted, Nasjonalbiblioteket and LTG (to be published -# autumn 2018) and the rarely used polite form. - -MORPH_RULES = { - "PRON__Animacy=Anim|Case=Nom|Number=Sing|Person=1|PronType=Prs": { - "jeg": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Case": "Nom", - } - }, - "PRON__Animacy=Anim|Case=Nom|Number=Sing|Person=2|PronType=Prs": { - "du": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Case": "Nom", - }, - # polite form, not sure about the tag - "De": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Case": "Nom", - "Polite": "Form", - }, - }, - "PRON__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs": { - "hun": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Fem", - "Case": "Nom", - } - }, - "PRON__Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs": { - "han": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Masc", - "Case": "Nom", - } - }, - "PRON__Gender=Neut|Number=Sing|Person=3|PronType=Prs": { - "det": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Neut", - }, - "alt": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Neut", - }, - "intet": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Neut", - }, - "noe": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Number": "Sing", - "Person": "Three", - "Gender": "Neut", - }, - }, - "PRON__Animacy=Anim|Case=Nom|Number=Plur|Person=1|PronType=Prs": { - "vi": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Case": "Nom", - } - }, - "PRON__Animacy=Anim|Case=Nom|Number=Plur|Person=2|PronType=Prs": { - "dere": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Plur", - "Case": "Nom", - } - }, - "PRON__Case=Nom|Number=Plur|Person=3|PronType=Prs": { - "de": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Case": "Nom", - } - }, - "PRON__Animacy=Anim|Case=Acc|Number=Sing|Person=1|PronType=Prs": { - "meg": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Case": "Acc", - } - }, - "PRON__Animacy=Anim|Case=Acc|Number=Sing|Person=2|PronType=Prs": { - "deg": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Case": "Acc", - }, - # polite form, not sure about the tag - "Dem": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Case": "Acc", - "Polite": "Form", - }, - }, - "PRON__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs": { - "henne": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Fem", - "Case": "Acc", - } - }, - "PRON__Animacy=Anim|Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs": { - "ham": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Masc", - "Case": "Acc", - }, - "han": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Masc", - "Case": "Acc", - }, - }, - "PRON__Animacy=Anim|Case=Acc|Number=Plur|Person=1|PronType=Prs": { - "oss": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Case": "Acc", - } - }, - "PRON__Animacy=Anim|Case=Acc|Number=Plur|Person=2|PronType=Prs": { - "dere": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Plur", - "Case": "Acc", - } - }, - "PRON__Case=Acc|Number=Plur|Person=3|PronType=Prs": { - "dem": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Case": "Acc", - } - }, - "PRON__Case=Acc|Reflex=Yes": { - "seg": { - LEMMA: PRON_LEMMA, - "Person": "Three", - "Number": ("Sing", "Plur"), - "Reflex": "Yes", - } - }, - "PRON__Animacy=Anim|Case=Nom|Number=Sing|PronType=Prs": { - "man": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Number": "Sing", "Case": "Nom"} - }, - "DET__Gender=Masc|Number=Sing|Poss=Yes": { - "min": { - LEMMA: "min", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Masc", - }, - "din": { - LEMMA: "din", - "Person": "Two", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Masc", - }, - "hennes": { - LEMMA: "hennes", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Masc", - }, - "hans": { - LEMMA: "hans", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Masc", - }, - "sin": { - LEMMA: "sin", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Masc", - "Reflex": "Yes", - }, - "vår": { - LEMMA: "vår", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Masc", - }, - "deres": { - LEMMA: "deres", - "Person": ("Two", "Three"), - "Number": "Sing", - "Poss": "Yes", - "Gender": "Masc", - }, - # polite form, not sure about the tag - "Deres": { - LEMMA: "Deres", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Masc", - "Polite": "Form", - }, - }, - "DET__Gender=Fem|Number=Sing|Poss=Yes": { - "mi": { - LEMMA: "min", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Fem", - }, - "di": { - LEMMA: "din", - "Person": "Two", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Fem", - }, - "hennes": { - LEMMA: "hennes", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Fem", - }, - "hans": { - LEMMA: "hans", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Fem", - }, - "si": { - LEMMA: "sin", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Fem", - "Reflex": "Yes", - }, - "vår": { - LEMMA: "vår", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Fem", - }, - "deres": { - LEMMA: "deres", - "Person": ("Two", "Three"), - "Number": "Sing", - "Poss": "Yes", - "Gender": "Fem", - }, - # polite form, not sure about the tag - "Deres": { - LEMMA: "Deres", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Fem", - "Polite": "Form", - }, - }, - "DET__Gender=Neut|Number=Sing|Poss=Yes": { - "mitt": { - LEMMA: "min", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Neut", - }, - "ditt": { - LEMMA: "din", - "Person": "Two", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Neut", - }, - "hennes": { - LEMMA: "hennes", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Neut", - }, - "hans": { - LEMMA: "hans", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Neut", - }, - "sitt": { - LEMMA: "sin", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Neut", - "Reflex": "Yes", - }, - "vårt": { - LEMMA: "vår", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Neut", - }, - "deres": { - LEMMA: "deres", - "Person": ("Two", "Three"), - "Number": "Sing", - "Poss": "Yes", - "Gender": "Neut", - }, - # polite form, not sure about the tag - "Deres": { - LEMMA: "Deres", - "Person": "Three", - "Number": "Sing", - "Poss": "Yes", - "Gender": "Neut", - "Polite": "Form", - }, - }, - "DET__Number=Plur|Poss=Yes": { - "mine": {LEMMA: "min", "Person": "One", "Number": "Plur", "Poss": "Yes"}, - "dine": {LEMMA: "din", "Person": "Two", "Number": "Plur", "Poss": "Yes"}, - "hennes": {LEMMA: "hennes", "Person": "Three", "Number": "Plur", "Poss": "Yes"}, - "hans": {LEMMA: "hans", "Person": "Three", "Number": "Plur", "Poss": "Yes"}, - "sine": { - LEMMA: "sin", - "Person": "Three", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "våre": {LEMMA: "vår", "Person": "One", "Number": "Plur", "Poss": "Yes"}, - "deres": { - LEMMA: "deres", - "Person": ("Two", "Three"), - "Number": "Plur", - "Poss": "Yes", - }, - }, - "PRON__Animacy=Anim|Number=Plur|PronType=Rcp": { - "hverandre": {LEMMA: PRON_LEMMA, "PronType": "Rcp", "Number": "Plur"} - }, - "DET__Number=Plur|Poss=Yes|PronType=Rcp": { - "hverandres": { - LEMMA: "hverandres", - "PronType": "Rcp", - "Number": "Plur", - "Poss": "Yes", - } - }, - "PRON___": {"som": {LEMMA: PRON_LEMMA}, "ikkenoe": {LEMMA: PRON_LEMMA}}, - "PRON__PronType=Int": {"hva": {LEMMA: PRON_LEMMA, "PronType": "Int"}}, - "PRON__Animacy=Anim|PronType=Int": {"hvem": {LEMMA: PRON_LEMMA, "PronType": "Int"}}, - "PRON__Animacy=Anim|Poss=Yes|PronType=Int": { - "hvis": {LEMMA: PRON_LEMMA, "PronType": "Int", "Poss": "Yes"} - }, - "PRON__Number=Plur|Person=3|PronType=Prs": { - "noen": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Number": "Plur", - "Person": "Three", - }, - "ingen": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Number": "Plur", - "Person": "Three", - }, - "alle": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Number": "Plur", - "Person": "Three", - }, - }, - "PRON__Gender=Fem,Masc|Number=Sing|Person=3|PronType=Prs": { - "noen": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Number": "Sing", - "Person": "Three", - "Gender": ("Fem", "Masc"), - }, - "den": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Number": "Sing", - "Person": "Three", - "Gender": ("Fem", "Masc"), - }, - "ingen": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Number": "Sing", - "Person": "Three", - "Gender": ("Fem", "Masc"), - "Polarity": "Neg", - }, - }, - "PRON__Number=Sing": {"ingenting": {LEMMA: PRON_LEMMA, "Number": "Sing"}}, - "PRON__Animacy=Anim|Number=Sing|PronType=Prs": { - "en": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Number": "Sing"} - }, - "PRON__Animacy=Anim|Case=Gen,Nom|Number=Sing|PronType=Prs": { - "ens": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Number": "Sing", - "Case": ("Gen", "Nom"), - } - }, - "PRON__Animacy=Anim|Case=Gen|Number=Sing|PronType=Prs": { - "ens": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Number": "Sing", "Case": "Gen"} - }, - "DET__Case=Gen|Gender=Masc|Number=Sing": { - "ens": {LEMMA: "en", "Number": "Sing", "Case": "Gen"} - }, - "DET__Gender=Masc|Number=Sing": { - "enhver": {LEMMA: "enhver", "Number": "Sing", "Gender": "Masc"}, - "all": {LEMMA: "all", "Number": "Sing", "Gender": "Masc"}, - "hver": {LEMMA: "hver", "Number": "Sing", "Gender": "Masc"}, - "noen": {LEMMA: "noen", "Gender": "Masc", "Number": "Sing"}, - "noe": {LEMMA: "noen", "Gender": "Masc", "Number": "Sing"}, - "en": {LEMMA: "en", "Number": "Sing", "Gender": "Neut"}, - "ingen": {LEMMA: "ingen", "Gender": "Masc", "Number": "Sing"}, - }, - "DET__Gender=Fem|Number=Sing": { - "enhver": {LEMMA: "enhver", "Number": "Sing", "Gender": "Fem"}, - "all": {LEMMA: "all", "Number": "Sing", "Gender": "Fem"}, - "hver": {LEMMA: "hver", "Number": "Sing", "Gender": "Fem"}, - "noen": {LEMMA: "noen", "Gender": "Fem", "Number": "Sing"}, - "noe": {LEMMA: "noen", "Gender": "Fem", "Number": "Sing"}, - "ei": {LEMMA: "en", "Number": "Sing", "Gender": "Fem"}, - }, - "DET__Gender=Neut|Number=Sing": { - "ethvert": {LEMMA: "enhver", "Number": "Sing", "Gender": "Neut"}, - "alt": {LEMMA: "all", "Number": "Sing", "Gender": "Neut"}, - "hvert": {LEMMA: "hver", "Number": "Sing", "Gender": "Neut"}, - "noe": {LEMMA: "noen", "Number": "Sing", "Gender": "Neut"}, - "intet": {LEMMA: "ingen", "Gender": "Neut", "Number": "Sing"}, - "et": {LEMMA: "en", "Number": "Sing", "Gender": "Neut"}, - }, - "DET__Gender=Neut|Number=Sing|PronType=Int": { - "hvilket": { - LEMMA: "hvilken", - "PronType": "Int", - "Number": "Sing", - "Gender": "Neut", - } - }, - "DET__Gender=Fem|Number=Sing|PronType=Int": { - "hvilken": { - LEMMA: "hvilken", - "PronType": "Int", - "Number": "Sing", - "Gender": "Fem", - } - }, - "DET__Gender=Masc|Number=Sing|PronType=Int": { - "hvilken": { - LEMMA: "hvilken", - "PronType": "Int", - "Number": "Sing", - "Gender": "Masc", - } - }, - "DET__Number=Plur|PronType=Int": { - "hvilke": {LEMMA: "hvilken", "PronType": "Int", "Number": "Plur"} - }, - "DET__Number=Plur": { - "alle": {LEMMA: "all", "Number": "Plur"}, - "noen": {LEMMA: "noen", "Number": "Plur"}, - "egne": {LEMMA: "egen", "Number": "Plur"}, - "ingen": {LEMMA: "ingen", "Number": "Plur"}, - }, - "DET__Gender=Masc|Number=Sing|PronType=Dem": { - "den": {LEMMA: "den", "PronType": "Dem", "Number": "Sing", "Gender": "Masc"}, - "slik": {LEMMA: "slik", "PronType": "Dem", "Number": "Sing", "Gender": "Masc"}, - "denne": { - LEMMA: "denne", - "PronType": "Dem", - "Number": "Sing", - "Gender": "Masc", - }, - }, - "DET__Gender=Fem|Number=Sing|PronType=Dem": { - "den": {LEMMA: "den", "PronType": "Dem", "Number": "Sing", "Gender": "Fem"}, - "slik": {LEMMA: "slik", "PronType": "Dem", "Number": "Sing", "Gender": "Fem"}, - "denne": {LEMMA: "denne", "PronType": "Dem", "Number": "Sing", "Gender": "Fem"}, - }, - "DET__Gender=Neut|Number=Sing|PronType=Dem": { - "det": {LEMMA: "det", "PronType": "Dem", "Number": "Sing", "Gender": "Neut"}, - "slikt": {LEMMA: "slik", "PronType": "Dem", "Number": "Sing", "Gender": "Neut"}, - "dette": { - LEMMA: "dette", - "PronType": "Dem", - "Number": "Sing", - "Gender": "Neut", - }, - }, - "DET__Number=Plur|PronType=Dem": { - "disse": {LEMMA: "disse", "PronType": "Dem", "Number": "Plur"}, - "andre": {LEMMA: "annen", "PronType": "Dem", "Number": "Plur"}, - "de": {LEMMA: "de", "PronType": "Dem", "Number": "Plur"}, - "slike": {LEMMA: "slik", "PronType": "Dem", "Number": "Plur"}, - }, - "DET__Definite=Ind|Gender=Masc|Number=Sing|PronType=Dem": { - "annen": {LEMMA: "annen", "PronType": "Dem", "Number": "Sing", "Gender": "Masc"} - }, - "DET__Definite=Ind|Gender=Fem|Number=Sing|PronType=Dem": { - "annen": {LEMMA: "annen", "PronType": "Dem", "Number": "Sing", "Gender": "Fem"} - }, - "DET__Definite=Ind|Gender=Neut|Number=Sing|PronType=Dem": { - "annet": {LEMMA: "annen", "PronType": "Dem", "Number": "Sing", "Gender": "Neut"} - }, - "DET__Case=Gen|Definite=Ind|Gender=Masc|Number=Sing|PronType=Dem": { - "annens": { - LEMMA: "annnen", - "PronType": "Dem", - "Number": "Sing", - "Gender": "Masc", - "Case": "Gen", - } - }, - "DET__Case=Gen|Number=Plur|PronType=Dem": { - "andres": {LEMMA: "annen", "PronType": "Dem", "Number": "Plur", "Case": "Gen"} - }, - "DET__Case=Gen|Gender=Fem|Number=Sing|PronType=Dem": { - "dens": { - LEMMA: "den", - "PronType": "Dem", - "Number": "Sing", - "Gender": "Fem", - "Case": "Gen", - } - }, - "DET__Case=Gen|Gender=Masc|Number=Sing|PronType=Dem": { - "hvis": { - LEMMA: "hvis", - "PronType": "Dem", - "Number": "Sing", - "Gender": "Masc", - "Case": "Gen", - }, - "dens": { - LEMMA: "den", - "PronType": "Dem", - "Number": "Sing", - "Gender": "Masc", - "Case": "Gen", - }, - }, - "DET__Case=Gen|Gender=Neut|Number=Sing|PronType=Dem": { - "dets": { - LEMMA: "det", - "PronType": "Dem", - "Number": "Sing", - "Gender": "Neut", - "Case": "Gen", - } - }, - "DET__Case=Gen|Number=Plur": { - "alles": {LEMMA: "all", "Number": "Plur", "Case": "Gen"} - }, - "DET__Definite=Def|Number=Sing|PronType=Dem": { - "andre": {LEMMA: "annen", "Number": "Sing", "PronType": "Dem"} - }, - "DET__Definite=Def|PronType=Dem": { - "samme": {LEMMA: "samme", "PronType": "Dem"}, - "forrige": {LEMMA: "forrige", "PronType": "Dem"}, - "neste": {LEMMA: "neste", "PronType": "Dem"}, - }, - "DET__Definite=Def": {"selve": {LEMMA: "selve"}, "selveste": {LEMMA: "selveste"}}, - "DET___": {"selv": {LEMMA: "selv"}, "endel": {LEMMA: "endel"}}, - "DET__Definite=Ind|Gender=Fem|Number=Sing": { - "egen": {LEMMA: "egen", "Gender": "Fem", "Number": "Sing"} - }, - "DET__Definite=Ind|Gender=Masc|Number=Sing": { - "egen": {LEMMA: "egen", "Gender": "Masc", "Number": "Sing"} - }, - "DET__Definite=Ind|Gender=Neut|Number=Sing": { - "eget": {LEMMA: "egen", "Gender": "Neut", "Number": "Sing"} - }, - # same wordform and pos (verb), have to specify the exact features in order to not mix them up - "VERB__Mood=Ind|Tense=Pres|VerbForm=Fin": { - "så": {LEMMA: "så", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"} - }, - "VERB__Mood=Ind|Tense=Past|VerbForm=Fin": { - "så": {LEMMA: "se", "VerbForm": "Fin", "Tense": "Past", "Mood": "Ind"} - }, -} - -# copied from the English morph_rules.py -for tag, rules in MORPH_RULES.items(): - for key, attrs in dict(rules).items(): - rules[key.title()] = attrs diff --git a/spacy/lang/nb/punctuation.py b/spacy/lang/nb/punctuation.py index 4c10b5a68..9b800029c 100644 --- a/spacy/lang/nb/punctuation.py +++ b/spacy/lang/nb/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER from ..char_classes import CURRENCY, PUNCT, UNITS, LIST_CURRENCY diff --git a/spacy/lang/nb/stop_words.py b/spacy/lang/nb/stop_words.py index caa2012e7..fd65dd788 100644 --- a/spacy/lang/nb/stop_words.py +++ b/spacy/lang/nb/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ alle allerede alt and andre annen annet at av diff --git a/spacy/lang/nb/syntax_iterators.py b/spacy/lang/nb/syntax_iterators.py index d6c12e69f..68117a54d 100644 --- a/spacy/lang/nb/syntax_iterators.py +++ b/spacy/lang/nb/syntax_iterators.py @@ -1,29 +1,18 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Union, Iterator from ...symbols import NOUN, PROPN, PRON from ...errors import Errors +from ...tokens import Doc, Span -def noun_chunks(doclike): - """ - Detect base noun phrases from a dependency parse. Works on both Doc and Span. - """ - labels = [ - "nsubj", - "nsubj:pass", - "obj", - "iobj", - "ROOT", - "appos", - "nmod", - "nmod:poss", - ] +def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]: + """Detect base noun phrases from a dependency parse. Works on Doc and Span.""" + # fmt: off + labels = ["nsubj", "nsubj:pass", "obj", "iobj", "ROOT", "appos", "nmod", "nmod:poss"] + # fmt: on doc = doclike.doc # Ensure works on both Doc and Span. - - if not doc.is_parsed: + if not doc.has_annotation("DEP"): raise ValueError(Errors.E029) - np_deps = [doc.vocab.strings[label] for label in labels] conj = doc.vocab.strings.add("conj") np_label = doc.vocab.strings.add("NP") diff --git a/spacy/lang/nb/tag_map.py b/spacy/lang/nb/tag_map.py deleted file mode 100644 index ca0ece265..000000000 --- a/spacy/lang/nb/tag_map.py +++ /dev/null @@ -1,761 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, ADJ, CONJ, CCONJ, SCONJ, SYM, NUM, DET, ADV, ADP, X -from ...symbols import VERB, NOUN, PROPN, PART, INTJ, PRON, AUX - - -# Tags are a combination of POS and morphological features from a -# https://github.com/ltgoslo/norne developed by Schibsted, Nasjonalbiblioteket and LTG. The -# data format is .conllu and follows the Universal Dependencies annotation. -# (There are some annotation differences compared to this dataset: -# https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal -# mainly in the way determiners and pronouns are tagged). -TAG_MAP = { - "ADJ__Case=Gen|Definite=Def|Degree=Pos|Number=Sing": { - "morph": "Case=Gen|Definite=Def|Degree=Pos|Number=Sing", - POS: ADJ, - }, - "ADJ__Case=Gen|Definite=Def|Number=Sing": { - "morph": "Case=Gen|Definite=Def|Number=Sing", - POS: ADJ, - }, - "ADJ__Case=Gen|Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing": { - "morph": "Case=Gen|Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing", - POS: ADJ, - }, - "ADJ__Case=Gen|Definite=Ind|Degree=Pos|Number=Sing": { - "morph": "Case=Gen|Definite=Ind|Degree=Pos|Number=Sing", - POS: ADJ, - }, - "ADJ__Case=Gen|Degree=Cmp": {"morph": "Case=Gen|Degree=Cmp", POS: ADJ}, - "ADJ__Case=Gen|Degree=Pos|Number=Plur": { - "morph": "Case=Gen|Degree=Pos|Number=Plur", - POS: ADJ, - }, - "ADJ__Definite=Def|Degree=Pos|Gender=Masc|Number=Sing": { - "morph": "Definite=Def|Degree=Pos|Gender=Masc|Number=Sing", - POS: ADJ, - }, - "ADJ__Definite=Def|Degree=Pos|Number=Sing": { - "morph": "Definite=Def|Degree=Pos|Number=Sing", - POS: ADJ, - }, - "ADJ__Definite=Def|Degree=Sup": {"morph": "Definite=Def|Degree=Sup", POS: ADJ}, - "ADJ__Definite=Def|Number=Sing": {"morph": "Definite=Def|Number=Sing", POS: ADJ}, - "ADJ__Definite=Ind|Degree=Pos": {"morph": "Definite=Ind|Degree=Pos", POS: ADJ}, - "ADJ__Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing": { - "morph": "Definite=Ind|Degree=Pos|Gender=Masc|Number=Sing", - POS: ADJ, - }, - "ADJ__Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing": { - "morph": "Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing", - POS: ADJ, - }, - "ADJ__Definite=Ind|Degree=Pos|Number=Sing": { - "morph": "Definite=Ind|Degree=Pos|Number=Sing", - POS: ADJ, - }, - "ADJ__Definite=Ind|Degree=Sup": {"morph": "Definite=Ind|Degree=Sup", POS: ADJ}, - "ADJ__Definite=Ind|Gender=Masc|Number=Sing": { - "morph": "Definite=Ind|Gender=Masc|Number=Sing", - POS: ADJ, - }, - "ADJ__Definite=Ind|Gender=Neut|Number=Sing": { - "morph": "Definite=Ind|Gender=Neut|Number=Sing", - POS: ADJ, - }, - "ADJ__Definite=Ind|Number=Sing": {"morph": "Definite=Ind|Number=Sing", POS: ADJ}, - "ADJ__Degree=Cmp": {"morph": "Degree=Cmp", POS: ADJ}, - "ADJ__Degree=Pos": {"morph": "Degree=Pos", POS: ADJ}, - "ADJ__Degree=Pos|Number=Plur": {"morph": "Degree=Pos|Number=Plur", POS: ADJ}, - "ADJ__Degree=Sup": {"morph": "Degree=Sup", POS: ADJ}, - "ADJ__Number=Plur": {"morph": "Number=Plur", POS: ADJ}, - "ADJ__Number=Plur|VerbForm=Part": {"morph": "Number=Plur|VerbForm=Part", POS: ADJ}, - "ADJ__Number=Sing": {"morph": "Number=Sing", POS: ADJ}, - "ADJ___": {"morph": "_", POS: ADJ}, - "ADP___": {"morph": "_", POS: ADP}, - "ADV___": {"morph": "_", POS: ADV}, - "ADV__Gender=Masc": {"morph": "Gender=Masc", POS: ADV}, - "AUX__Mood=Imp|VerbForm=Fin": {"morph": "Mood=Imp|VerbForm=Fin", POS: AUX}, - "AUX__Mood=Ind|Tense=Past|VerbForm=Fin": { - "morph": "Mood=Ind|Tense=Past|VerbForm=Fin", - POS: AUX, - }, - "AUX__Mood=Ind|Tense=Pres|VerbForm=Fin": { - "morph": "Mood=Ind|Tense=Pres|VerbForm=Fin", - POS: AUX, - }, - "AUX__Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass": { - "morph": "Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass", - POS: AUX, - }, - "AUX__VerbForm=Inf": {"morph": "VerbForm=Inf", POS: AUX}, - "AUX__VerbForm=Inf|Voice=Pass": {"morph": "VerbForm=Inf|Voice=Pass", POS: AUX}, - "AUX__VerbForm=Part": {"morph": "VerbForm=Part", POS: AUX}, - "CONJ___": {"morph": "_", POS: CONJ}, - "DET__Case=Gen|Definite=Ind|Gender=Masc|Number=Sing|PronType=Dem": { - "morph": "Case=Gen|Definite=Ind|Gender=Masc|Number=Sing|PronType=Dem", - POS: DET, - }, - "DET__Case=Gen|Gender=Fem|Number=Sing|PronType=Dem": { - "morph": "Case=Gen|Gender=Fem|Number=Sing|PronType=Dem", - POS: DET, - }, - "DET__Case=Gen|Gender=Masc|Number=Sing": { - "morph": "Case=Gen|Gender=Masc|Number=Sing", - POS: DET, - }, - "DET__Case=Gen|Gender=Masc|Number=Sing|PronType=Dem": { - "morph": "Case=Gen|Gender=Masc|Number=Sing|PronType=Dem", - POS: DET, - }, - "DET__Case=Gen|Gender=Neut|Number=Sing|PronType=Dem": { - "morph": "Case=Gen|Gender=Neut|Number=Sing|PronType=Dem", - POS: DET, - }, - "DET__Case=Gen|Number=Plur": {"morph": "Case=Gen|Number=Plur", POS: DET}, - "DET__Case=Gen|Number=Plur|PronType=Dem": { - "morph": "Case=Gen|Number=Plur|PronType=Dem", - POS: DET, - }, - "DET__Definite=Def": {"morph": "Definite=Def", POS: DET}, - "DET__Definite=Def|Number=Sing|PronType=Dem": { - "morph": "Definite=Def|Number=Sing|PronType=Dem", - POS: DET, - }, - "DET__Definite=Def|PronType=Dem": {"morph": "Definite=Def|PronType=Dem", POS: DET}, - "DET__Definite=Ind|Gender=Fem|Number=Sing": { - "morph": "Definite=Ind|Gender=Fem|Number=Sing", - POS: DET, - }, - "DET__Definite=Ind|Gender=Fem|Number=Sing|PronType=Dem": { - "morph": "Definite=Ind|Gender=Fem|Number=Sing|PronType=Dem", - POS: DET, - }, - "DET__Definite=Ind|Gender=Masc|Number=Sing": { - "morph": "Definite=Ind|Gender=Masc|Number=Sing", - POS: DET, - }, - "DET__Definite=Ind|Gender=Masc|Number=Sing|PronType=Dem": { - "morph": "Definite=Ind|Gender=Masc|Number=Sing|PronType=Dem", - POS: DET, - }, - "DET__Definite=Ind|Gender=Neut|Number=Sing": { - "morph": "Definite=Ind|Gender=Neut|Number=Sing", - POS: DET, - }, - "DET__Definite=Ind|Gender=Neut|Number=Sing|PronType=Dem": { - "morph": "Definite=Ind|Gender=Neut|Number=Sing|PronType=Dem", - POS: DET, - }, - "DET__Degree=Pos|Number=Plur": {"morph": "Degree=Pos|Number=Plur", POS: DET}, - "DET__Gender=Fem|Number=Sing": {"morph": "Gender=Fem|Number=Sing", POS: DET}, - "DET__Gender=Fem|Number=Sing|Poss=Yes": { - "morph": "Gender=Fem|Number=Sing|Poss=Yes", - POS: DET, - }, - "DET__Gender=Fem|Number=Sing|PronType=Dem": { - "morph": "Gender=Fem|Number=Sing|PronType=Dem", - POS: DET, - }, - "DET__Gender=Fem|Number=Sing|PronType=Int": { - "morph": "Gender=Fem|Number=Sing|PronType=Int", - POS: DET, - }, - "DET__Gender=Masc|Number=Sing": {"morph": "Gender=Masc|Number=Sing", POS: DET}, - "DET__Gender=Masc|Number=Sing|Poss=Yes": { - "morph": "Gender=Masc|Number=Sing|Poss=Yes", - POS: DET, - }, - "DET__Gender=Masc|Number=Sing|PronType=Dem": { - "morph": "Gender=Masc|Number=Sing|PronType=Dem", - POS: DET, - }, - "DET__Gender=Masc|Number=Sing|PronType=Int": { - "morph": "Gender=Masc|Number=Sing|PronType=Int", - POS: DET, - }, - "DET__Gender=Neut|Number=Sing": {"morph": "Gender=Neut|Number=Sing", POS: DET}, - "DET__Gender=Neut|Number=Sing|Poss=Yes": { - "morph": "Gender=Neut|Number=Sing|Poss=Yes", - POS: DET, - }, - "DET__Gender=Neut|Number=Sing|PronType=Dem": { - "morph": "Gender=Neut|Number=Sing|PronType=Dem", - POS: DET, - }, - "DET__Gender=Neut|Number=Sing|PronType=Int": { - "morph": "Gender=Neut|Number=Sing|PronType=Int", - POS: DET, - }, - "DET__Number=Plur": {"morph": "Number=Plur", POS: DET}, - "DET__Number=Plur|Poss=Yes": {"morph": "Number=Plur|Poss=Yes", POS: DET}, - "DET__Number=Plur|Poss=Yes|PronType=Rcp": { - "morph": "Number=Plur|Poss=Yes|PronType=Rcp", - POS: DET, - }, - "DET__Number=Plur|PronType=Dem": {"morph": "Number=Plur|PronType=Dem", POS: DET}, - "DET__Number=Plur|PronType=Int": {"morph": "Number=Plur|PronType=Int", POS: DET}, - "DET___": {"morph": "_", POS: DET}, - "INTJ___": {"morph": "_", POS: INTJ}, - "NOUN__Case=Gen": {"morph": "Case=Gen", POS: NOUN}, - "NOUN__Case=Gen|Definite=Def|Gender=Fem|Number=Plur": { - "morph": "Case=Gen|Definite=Def|Gender=Fem|Number=Plur", - POS: NOUN, - }, - "NOUN__Case=Gen|Definite=Def|Gender=Fem|Number=Sing": { - "morph": "Case=Gen|Definite=Def|Gender=Fem|Number=Sing", - POS: NOUN, - }, - "NOUN__Case=Gen|Definite=Def|Gender=Masc|Number=Plur": { - "morph": "Case=Gen|Definite=Def|Gender=Masc|Number=Plur", - POS: NOUN, - }, - "NOUN__Case=Gen|Definite=Def|Gender=Masc|Number=Sing": { - "morph": "Case=Gen|Definite=Def|Gender=Masc|Number=Sing", - POS: NOUN, - }, - "NOUN__Case=Gen|Definite=Def|Gender=Neut|Number=Plur": { - "morph": "Case=Gen|Definite=Def|Gender=Neut|Number=Plur", - POS: NOUN, - }, - "NOUN__Case=Gen|Definite=Def|Gender=Neut|Number=Sing": { - "morph": "Case=Gen|Definite=Def|Gender=Neut|Number=Sing", - POS: NOUN, - }, - "NOUN__Case=Gen|Definite=Ind|Gender=Fem|Number=Plur": { - "morph": "Case=Gen|Definite=Ind|Gender=Fem|Number=Plur", - POS: NOUN, - }, - "NOUN__Case=Gen|Definite=Ind|Gender=Fem|Number=Sing": { - "morph": "Case=Gen|Definite=Ind|Gender=Fem|Number=Sing", - POS: NOUN, - }, - "NOUN__Case=Gen|Definite=Ind|Gender=Masc|Number=Plur": { - "morph": "Case=Gen|Definite=Ind|Gender=Masc|Number=Plur", - POS: NOUN, - }, - "NOUN__Case=Gen|Definite=Ind|Gender=Masc|Number=Sing": { - "morph": "Case=Gen|Definite=Ind|Gender=Masc|Number=Sing", - POS: NOUN, - }, - "NOUN__Case=Gen|Definite=Ind|Gender=Neut|Number=Plur": { - "morph": "Case=Gen|Definite=Ind|Gender=Neut|Number=Plur", - POS: NOUN, - }, - "NOUN__Case=Gen|Definite=Ind|Gender=Neut|Number=Sing": { - "morph": "Case=Gen|Definite=Ind|Gender=Neut|Number=Sing", - POS: NOUN, - }, - "NOUN__Case=Gen|Gender=Fem": {"morph": "Case=Gen|Gender=Fem", POS: NOUN}, - "NOUN__Definite=Def,Ind|Gender=Masc|Number=Plur,Sing": { - "morph": "Definite=Def", - POS: NOUN, - }, - "NOUN__Definite=Def,Ind|Gender=Masc|Number=Sing": { - "morph": "Definite=Def", - POS: NOUN, - }, - "NOUN__Definite=Def,Ind|Gender=Neut|Number=Plur,Sing": { - "morph": "Definite=Def", - POS: NOUN, - }, - "NOUN__Definite=Def|Gender=Fem|Number=Plur": { - "morph": "Definite=Def|Gender=Fem|Number=Plur", - POS: NOUN, - }, - "NOUN__Definite=Def|Gender=Fem|Number=Sing": { - "morph": "Definite=Def|Gender=Fem|Number=Sing", - POS: NOUN, - }, - "NOUN__Definite=Def|Gender=Masc|Number=Plur": { - "morph": "Definite=Def|Gender=Masc|Number=Plur", - POS: NOUN, - }, - "NOUN__Definite=Def|Gender=Masc|Number=Sing": { - "morph": "Definite=Def|Gender=Masc|Number=Sing", - POS: NOUN, - }, - "NOUN__Definite=Def|Gender=Neut|Number=Plur": { - "morph": "Definite=Def|Gender=Neut|Number=Plur", - POS: NOUN, - }, - "NOUN__Definite=Def|Gender=Neut|Number=Sing": { - "morph": "Definite=Def|Gender=Neut|Number=Sing", - POS: NOUN, - }, - "NOUN__Definite=Def|Number=Plur": {"morph": "Definite=Def|Number=Plur", POS: NOUN}, - "NOUN__Definite=Ind|Gender=Fem|Number=Plur": { - "morph": "Definite=Ind|Gender=Fem|Number=Plur", - POS: NOUN, - }, - "NOUN__Definite=Ind|Gender=Fem|Number=Sing": { - "morph": "Definite=Ind|Gender=Fem|Number=Sing", - POS: NOUN, - }, - "NOUN__Definite=Ind|Gender=Masc": {"morph": "Definite=Ind|Gender=Masc", POS: NOUN}, - "NOUN__Definite=Ind|Gender=Masc|Number=Plur": { - "morph": "Definite=Ind|Gender=Masc|Number=Plur", - POS: NOUN, - }, - "NOUN__Definite=Ind|Gender=Masc|Number=Sing": { - "morph": "Definite=Ind|Gender=Masc|Number=Sing", - POS: NOUN, - }, - "NOUN__Definite=Ind|Gender=Neut|Number=Plur": { - "morph": "Definite=Ind|Gender=Neut|Number=Plur", - POS: NOUN, - }, - "NOUN__Definite=Ind|Gender=Neut|Number=Sing": { - "morph": "Definite=Ind|Gender=Neut|Number=Sing", - POS: NOUN, - }, - "NOUN__Definite=Ind|Number=Plur": {"morph": "Definite=Ind|Number=Plur", POS: NOUN}, - "NOUN__Definite=Ind|Number=Sing": {"morph": "Definite=Ind|Number=Sing", POS: NOUN}, - "NOUN__Gender=Fem": {"morph": "Gender=Fem", POS: NOUN}, - "NOUN__Gender=Masc": {"morph": "Gender=Masc", POS: NOUN}, - "NOUN__Gender=Masc|Number=Sing": {"morph": "Gender=Masc|Number=Sing", POS: NOUN}, - "NOUN__Gender=Neut": {"morph": "Gender=Neut", POS: NOUN}, - "NOUN__Number=Plur": {"morph": "Number=Plur", POS: NOUN}, - "NOUN___": {"morph": "_", POS: NOUN}, - "NUM__Case=Gen|Number=Plur": {"morph": "Case=Gen|Number=Plur", POS: NUM}, - "NUM__Definite=Def": {"morph": "Definite=Def", POS: NUM}, - "NUM__Definite=Def|Number=Sing": {"morph": "Definite=Def|Number=Sing", POS: NUM}, - "NUM__Gender=Fem|Number=Sing": {"morph": "Gender=Fem|Number=Sing", POS: NUM}, - "NUM__Gender=Masc|Number=Sing": {"morph": "Gender=Masc|Number=Sing", POS: NUM}, - "NUM__Gender=Neut|Number=Sing": {"morph": "Gender=Neut|Number=Sing", POS: NUM}, - "NUM__Number=Plur": {"morph": "Number=Plur", POS: NUM}, - "NUM__Number=Sing": {"morph": "Number=Sing", POS: NUM}, - "NUM___": {"morph": "_", POS: NUM}, - "PART___": {"morph": "_", POS: PART}, - "PRON__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs": { - "morph": "Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing|Person=", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs": { - "morph": "Animacy=Anim|Case=Acc|Gender=Masc|Number=Sing|Person=", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Acc|Number=Plur|Person=1|PronType=Prs": { - "morph": "Animacy=Anim|Case=Acc|Number=Plur|Person=", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Acc|Number=Plur|Person=2|PronType=Prs": { - "morph": "Animacy=Anim|Case=Acc|Number=Plur|Person=", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Acc|Number=Sing|Person=1|PronType=Prs": { - "morph": "Animacy=Anim|Case=Acc|Number=Sing|Person=", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Acc|Number=Sing|Person=2|PronType=Prs": { - "morph": "Animacy=Anim|Case=Acc|Number=Sing|Person=", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Gen,Nom|Number=Sing|PronType=Prs": { - "morph": "Animacy=Anim|Case=Gen", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Gen|Number=Sing|PronType=Prs": { - "morph": "Animacy=Anim|Case=Gen|Number=Sing|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs": { - "morph": "Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing|Person=", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs": { - "morph": "Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing|Person=", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Nom|Number=Plur|Person=1|PronType=Prs": { - "morph": "Animacy=Anim|Case=Nom|Number=Plur|Person=", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Nom|Number=Plur|Person=2|PronType=Prs": { - "morph": "Animacy=Anim|Case=Nom|Number=Plur|Person=", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Nom|Number=Sing|Person=1|PronType=Prs": { - "morph": "Animacy=Anim|Case=Nom|Number=Sing|Person=", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Nom|Number=Sing|Person=2|PronType=Prs": { - "morph": "Animacy=Anim|Case=Nom|Number=Sing|Person=", - POS: PRON, - }, - "PRON__Animacy=Anim|Case=Nom|Number=Sing|PronType=Prs": { - "morph": "Animacy=Anim|Case=Nom|Number=Sing|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Anim|Number=Plur|PronType=Rcp": { - "morph": "Animacy=Anim|Number=Plur|PronType=Rcp", - POS: PRON, - }, - "PRON__Animacy=Anim|Number=Sing|PronType=Prs": { - "morph": "Animacy=Anim|Number=Sing|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Anim|Poss=Yes|PronType=Int": { - "morph": "Animacy=Anim|Poss=Yes|PronType=Int", - POS: PRON, - }, - "PRON__Animacy=Anim|PronType=Int": { - "morph": "Animacy=Anim|PronType=Int", - POS: PRON, - }, - "PRON__Case=Acc|Number=Plur|Person=3|PronType=Prs": { - "morph": "Case=Acc|Number=Plur|Person=", - POS: PRON, - }, - "PRON__Case=Acc|Reflex=Yes": {"morph": "Case=Acc|Reflex=Yes", POS: PRON}, - "PRON__Case=Nom|Number=Plur|Person=3|PronType=Prs": { - "morph": "Case=Nom|Number=Plur|Person=", - POS: PRON, - }, - "PRON__Case=Gen|Number=Plur|Person=3|PronType=Prs": { - "morph": "Case=Gen|Number=Plur|Person=3|PronType=Prs", - POS: PRON, - }, - "PRON__Gender=Fem,Masc|Number=Sing|Person=3|PronType=Prs": { - "morph": "Gender=Fem", - POS: PRON, - }, - "PRON__Gender=Neut|Number=Sing|Person=3|PronType=Prs": { - "morph": "Gender=Neut|Number=Sing|Person=", - POS: PRON, - }, - "PRON__Number=Plur|Person=3|PronType=Prs": { - "morph": "Number=Plur|Person=", - POS: PRON, - }, - "PRON__Number=Sing": {"morph": "Number=Sing", POS: PRON}, - "PRON__PronType=Int": {"morph": "PronType=Int", POS: PRON}, - "PRON___": {"morph": "_", POS: PRON}, - "PROPN__Case=Gen": {"morph": "Case=Gen", POS: PROPN}, - "PROPN__Case=Gen|Gender=Fem": {"morph": "Case=Gen|Gender=Fem", POS: PROPN}, - "PROPN__Case=Gen|Gender=Masc": {"morph": "Case=Gen|Gender=Masc", POS: PROPN}, - "PROPN__Case=Gen|Gender=Neut": {"morph": "Case=Gen|Gender=Neut", POS: PROPN}, - "PROPN__Gender=Fem": {"morph": "Gender=Fem", POS: PROPN}, - "PROPN__Gender=Masc": {"morph": "Gender=Masc", POS: PROPN}, - "PROPN__Gender=Neut": {"morph": "Gender=Neut", POS: PROPN}, - "PROPN___": {"morph": "_", POS: PROPN}, - "PUNCT___": {"morph": "_", POS: PUNCT}, - "SCONJ___": {"morph": "_", POS: SCONJ}, - "SYM___": {"morph": "_", POS: SYM}, - "VERB__Definite=Ind|Number=Sing": {"morph": "Definite=Ind|Number=Sing", POS: VERB}, - "VERB__Mood=Imp|VerbForm=Fin": {"morph": "Mood=Imp|VerbForm=Fin", POS: VERB}, - "VERB__Mood=Ind|Tense=Past|VerbForm=Fin": { - "morph": "Mood=Ind|Tense=Past|VerbForm=Fin", - POS: VERB, - }, - "VERB__Mood=Ind|Tense=Past|VerbForm=Fin|Voice=Pass": { - "morph": "Mood=Ind|Tense=Past|VerbForm=Fin|Voice=Pass", - POS: VERB, - }, - "VERB__Mood=Ind|Tense=Pres|VerbForm=Fin": { - "morph": "Mood=Ind|Tense=Pres|VerbForm=Fin", - POS: VERB, - }, - "VERB__Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass": { - "morph": "Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass", - POS: VERB, - }, - "VERB__VerbForm=Inf": {"morph": "VerbForm=Inf", POS: VERB}, - "VERB__VerbForm=Inf|Voice=Pass": {"morph": "VerbForm=Inf|Voice=Pass", POS: VERB}, - "VERB__VerbForm=Part": {"morph": "VerbForm=Part", POS: VERB}, - "VERB___": {"morph": "_", POS: VERB}, - "X___": {"morph": "_", POS: X}, - "CCONJ___": {"morph": "_", POS: CCONJ}, - "ADJ__Abbr=Yes": {"morph": "Abbr=Yes", POS: ADJ}, - "ADJ__Abbr=Yes|Degree=Pos": {"morph": "Abbr=Yes|Degree=Pos", POS: ADJ}, - "ADJ__Case=Gen|Definite=Def|Number=Sing|VerbForm=Part": { - "morph": "Case=Gen|Definite=Def|Number=Sing|VerbForm=Part", - POS: ADJ, - }, - "ADJ__Definite=Def|Number=Sing|VerbForm=Part": { - "morph": "Definite=Def|Number=Sing|VerbForm=Part", - POS: ADJ, - }, - "ADJ__Definite=Ind|Gender=Masc|Number=Sing|VerbForm=Part": { - "morph": "Definite=Ind|Gender=Masc|Number=Sing|VerbForm=Part", - POS: ADJ, - }, - "ADJ__Definite=Ind|Gender=Neut|Number=Sing|VerbForm=Part": { - "morph": "Definite=Ind|Gender=Neut|Number=Sing|VerbForm=Part", - POS: ADJ, - }, - "ADJ__Definite=Ind|Number=Sing|VerbForm=Part": { - "morph": "Definite=Ind|Number=Sing|VerbForm=Part", - POS: ADJ, - }, - "ADJ__Number=Sing|VerbForm=Part": {"morph": "Number=Sing|VerbForm=Part", POS: ADJ}, - "ADJ__VerbForm=Part": {"morph": "VerbForm=Part", POS: ADJ}, - "ADP__Abbr=Yes": {"morph": "Abbr=Yes", POS: ADP}, - "ADV__Abbr=Yes": {"morph": "Abbr=Yes", POS: ADV}, - "DET__Case=Gen|Gender=Masc|Number=Sing|PronType=Art": { - "morph": "Case=Gen|Gender=Masc|Number=Sing|PronType=Art", - POS: DET, - }, - "DET__Case=Gen|Number=Plur|PronType=Tot": { - "morph": "Case=Gen|Number=Plur|PronType=Tot", - POS: DET, - }, - "DET__Definite=Def|PronType=Prs": {"morph": "Definite=Def|PronType=Prs", POS: DET}, - "DET__Definite=Ind|Gender=Fem|Number=Sing|PronType=Prs": { - "morph": "Definite=Ind|Gender=Fem|Number=Sing|PronType=Prs", - POS: DET, - }, - "DET__Definite=Ind|Gender=Masc|Number=Sing|PronType=Prs": { - "morph": "Definite=Ind|Gender=Masc|Number=Sing|PronType=Prs", - POS: DET, - }, - "DET__Definite=Ind|Gender=Neut|Number=Sing|PronType=Prs": { - "morph": "Definite=Ind|Gender=Neut|Number=Sing|PronType=Prs", - POS: DET, - }, - "DET__Gender=Fem|Number=Sing|PronType=Art": { - "morph": "Gender=Fem|Number=Sing|PronType=Art", - POS: DET, - }, - "DET__Gender=Fem|Number=Sing|PronType=Ind": { - "morph": "Gender=Fem|Number=Sing|PronType=Ind", - POS: DET, - }, - "DET__Gender=Fem|Number=Sing|PronType=Prs": { - "morph": "Gender=Fem|Number=Sing|PronType=Prs", - POS: DET, - }, - "DET__Gender=Fem|Number=Sing|PronType=Tot": { - "morph": "Gender=Fem|Number=Sing|PronType=Tot", - POS: DET, - }, - "DET__Gender=Masc|Number=Sing|Polarity=Neg|PronType=Neg": { - "morph": "Gender=Masc|Number=Sing|Polarity=Neg|PronType=Neg", - POS: DET, - }, - "DET__Gender=Masc|Number=Sing|PronType=Art": { - "morph": "Gender=Masc|Number=Sing|PronType=Art", - POS: DET, - }, - "DET__Gender=Masc|Number=Sing|PronType=Ind": { - "morph": "Gender=Masc|Number=Sing|PronType=Ind", - POS: DET, - }, - "DET__Gender=Masc|Number=Sing|PronType=Tot": { - "morph": "Gender=Masc|Number=Sing|PronType=Tot", - POS: DET, - }, - "DET__Gender=Neut|Number=Sing|Polarity=Neg|PronType=Neg": { - "morph": "Gender=Neut|Number=Sing|Polarity=Neg|PronType=Neg", - POS: DET, - }, - "DET__Gender=Neut|Number=Sing|PronType=Art": { - "morph": "Gender=Neut|Number=Sing|PronType=Art", - POS: DET, - }, - "DET__Gender=Neut|Number=Sing|PronType=Dem,Ind": { - "morph": "Gender=Neut|Number=Sing|PronType=Dem,Ind", - POS: DET, - }, - "DET__Gender=Neut|Number=Sing|PronType=Ind": { - "morph": "Gender=Neut|Number=Sing|PronType=Ind", - POS: DET, - }, - "DET__Gender=Neut|Number=Sing|PronType=Tot": { - "morph": "Gender=Neut|Number=Sing|PronType=Tot", - POS: DET, - }, - "DET__Number=Plur|Polarity=Neg|PronType=Neg": { - "morph": "Number=Plur|Polarity=Neg|PronType=Neg", - POS: DET, - }, - "DET__Number=Plur|PronType=Art": {"morph": "Number=Plur|PronType=Art", POS: DET}, - "DET__Number=Plur|PronType=Ind": {"morph": "Number=Plur|PronType=Ind", POS: DET}, - "DET__Number=Plur|PronType=Prs": {"morph": "Number=Plur|PronType=Prs", POS: DET}, - "DET__Number=Plur|PronType=Tot": {"morph": "Number=Plur|PronType=Tot", POS: DET}, - "DET__PronType=Ind": {"morph": "PronType=Ind", POS: DET}, - "DET__PronType=Prs": {"morph": "PronType=Prs", POS: DET}, - "NOUN__Abbr=Yes": {"morph": "Abbr=Yes", POS: NOUN}, - "NOUN__Abbr=Yes|Case=Gen": {"morph": "Abbr=Yes|Case=Gen", POS: NOUN}, - "NOUN__Abbr=Yes|Definite=Def,Ind|Gender=Masc|Number=Plur,Sing": { - "morph": "Abbr=Yes|Definite=Def,Ind|Gender=Masc|Number=Plur,Sing", - POS: NOUN, - }, - "NOUN__Abbr=Yes|Definite=Def,Ind|Gender=Masc|Number=Sing": { - "morph": "Abbr=Yes|Definite=Def,Ind|Gender=Masc|Number=Sing", - POS: NOUN, - }, - "NOUN__Abbr=Yes|Definite=Def,Ind|Gender=Neut|Number=Plur,Sing": { - "morph": "Abbr=Yes|Definite=Def,Ind|Gender=Neut|Number=Plur,Sing", - POS: NOUN, - }, - "NOUN__Abbr=Yes|Gender=Masc": {"morph": "Abbr=Yes|Gender=Masc", POS: NOUN}, - "NUM__Case=Gen|Number=Plur|NumType=Card": { - "morph": "Case=Gen|Number=Plur|NumType=Card", - POS: NUM, - }, - "NUM__Definite=Def|Number=Sing|NumType=Card": { - "morph": "Definite=Def|Number=Sing|NumType=Card", - POS: NUM, - }, - "NUM__Definite=Def|NumType=Card": {"morph": "Definite=Def|NumType=Card", POS: NUM}, - "NUM__Gender=Fem|Number=Sing|NumType=Card": { - "morph": "Gender=Fem|Number=Sing|NumType=Card", - POS: NUM, - }, - "NUM__Gender=Masc|Number=Sing|NumType=Card": { - "morph": "Gender=Masc|Number=Sing|NumType=Card", - POS: NUM, - }, - "NUM__Gender=Neut|Number=Sing|NumType=Card": { - "morph": "Gender=Neut|Number=Sing|NumType=Card", - POS: NUM, - }, - "NUM__Number=Plur|NumType=Card": {"morph": "Number=Plur|NumType=Card", POS: NUM}, - "NUM__Number=Sing|NumType=Card": {"morph": "Number=Sing|NumType=Card", POS: NUM}, - "NUM__NumType=Card": {"morph": "NumType=Card", POS: NUM}, - "PART__Polarity=Neg": {"morph": "Polarity=Neg", POS: PART}, - "PRON__Animacy=Hum|Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs": { - "morph": "Animacy=Hum|Case=Acc|Gender=Fem|Number=Sing|Person=3|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs": { - "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Acc|Number=Plur|Person=1|PronType=Prs": { - "morph": "Animacy=Hum|Case=Acc|Number=Plur|Person=1|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Acc|Number=Plur|Person=2|PronType=Prs": { - "morph": "Animacy=Hum|Case=Acc|Number=Plur|Person=2|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Acc|Number=Sing|Person=1|PronType=Prs": { - "morph": "Animacy=Hum|Case=Acc|Number=Sing|Person=1|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Acc|Number=Sing|Person=2|PronType=Prs": { - "morph": "Animacy=Hum|Case=Acc|Number=Sing|Person=2|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Gen,Nom|Number=Sing|PronType=Art,Prs": { - "morph": "Animacy=Hum|Case=Gen,Nom|Number=Sing|PronType=Art,Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Gen|Number=Sing|PronType=Art,Prs": { - "morph": "Animacy=Hum|Case=Gen|Number=Sing|PronType=Art,Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs": { - "morph": "Animacy=Hum|Case=Nom|Gender=Fem|Number=Sing|Person=3|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs": { - "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Nom|Number=Plur|Person=1|PronType=Prs": { - "morph": "Animacy=Hum|Case=Nom|Number=Plur|Person=1|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Nom|Number=Plur|Person=2|PronType=Prs": { - "morph": "Animacy=Hum|Case=Nom|Number=Plur|Person=2|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Nom|Number=Sing|Person=1|PronType=Prs": { - "morph": "Animacy=Hum|Case=Nom|Number=Sing|Person=1|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Nom|Number=Sing|Person=2|PronType=Prs": { - "morph": "Animacy=Hum|Case=Nom|Number=Sing|Person=2|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Case=Nom|Number=Sing|PronType=Prs": { - "morph": "Animacy=Hum|Case=Nom|Number=Sing|PronType=Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Number=Plur|PronType=Rcp": { - "morph": "Animacy=Hum|Number=Plur|PronType=Rcp", - POS: PRON, - }, - "PRON__Animacy=Hum|Number=Sing|PronType=Art,Prs": { - "morph": "Animacy=Hum|Number=Sing|PronType=Art,Prs", - POS: PRON, - }, - "PRON__Animacy=Hum|Poss=Yes|PronType=Int": { - "morph": "Animacy=Hum|Poss=Yes|PronType=Int", - POS: PRON, - }, - "PRON__Animacy=Hum|PronType=Int": {"morph": "Animacy=Hum|PronType=Int", POS: PRON}, - "PRON__Case=Acc|PronType=Prs|Reflex=Yes": { - "morph": "Case=Acc|PronType=Prs|Reflex=Yes", - POS: PRON, - }, - "PRON__Gender=Fem,Masc|Number=Sing|Person=3|Polarity=Neg|PronType=Neg,Prs": { - "morph": "Gender=Fem,Masc|Number=Sing|Person=3|Polarity=Neg|PronType=Neg,Prs", - POS: PRON, - }, - "PRON__Gender=Fem,Masc|Number=Sing|Person=3|PronType=Ind,Prs": { - "morph": "Gender=Fem,Masc|Number=Sing|Person=3|PronType=Ind,Prs", - POS: PRON, - }, - "PRON__Gender=Fem,Masc|Number=Sing|Person=3|PronType=Prs,Tot": { - "morph": "Gender=Fem,Masc|Number=Sing|Person=3|PronType=Prs,Tot", - POS: PRON, - }, - "PRON__Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs": { - "morph": "Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs", - POS: PRON, - }, - "PRON__Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs": { - "morph": "Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs", - POS: PRON, - }, - "PRON__Gender=Neut|Number=Sing|Person=3|PronType=Ind,Prs": { - "morph": "Gender=Neut|Number=Sing|Person=3|PronType=Ind,Prs", - POS: PRON, - }, - "PRON__Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs": { - "morph": "Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs", - POS: PRON, - }, - "PRON__Number=Plur|Person=3|Polarity=Neg|PronType=Neg,Prs": { - "morph": "Number=Plur|Person=3|Polarity=Neg|PronType=Neg,Prs", - POS: PRON, - }, - "PRON__Number=Plur|Person=3|PronType=Ind,Prs": { - "morph": "Number=Plur|Person=3|PronType=Ind,Prs", - POS: PRON, - }, - "PRON__Number=Plur|Person=3|PronType=Prs,Tot": { - "morph": "Number=Plur|Person=3|PronType=Prs,Tot", - POS: PRON, - }, - "PRON__Number=Plur|Poss=Yes|PronType=Prs": { - "morph": "Number=Plur|Poss=Yes|PronType=Prs", - POS: PRON, - }, - "PRON__Number=Plur|Poss=Yes|PronType=Rcp": { - "morph": "Number=Plur|Poss=Yes|PronType=Rcp", - POS: PRON, - }, - "PRON__Number=Sing|Polarity=Neg|PronType=Neg": { - "morph": "Number=Sing|Polarity=Neg|PronType=Neg", - POS: PRON, - }, - "PRON__PronType=Prs": {"morph": "PronType=Prs", POS: PRON}, - "PRON__PronType=Rel": {"morph": "PronType=Rel", POS: PRON}, - "PROPN__Abbr=Yes": {"morph": "Abbr=Yes", POS: PROPN}, - "PROPN__Abbr=Yes|Case=Gen": {"morph": "Abbr=Yes|Case=Gen", POS: PROPN}, - "VERB__Abbr=Yes|Mood=Ind|Tense=Pres|VerbForm=Fin": { - "morph": "Abbr=Yes|Mood=Ind|Tense=Pres|VerbForm=Fin", - POS: VERB, - }, - "VERB__Definite=Ind|Number=Sing|VerbForm=Part": { - "morph": "Definite=Ind|Number=Sing|VerbForm=Part", - POS: VERB, - }, -} diff --git a/spacy/lang/nb/tokenizer_exceptions.py b/spacy/lang/nb/tokenizer_exceptions.py index 3f4aa79f6..0be436ae4 100644 --- a/spacy/lang/nb/tokenizer_exceptions.py +++ b/spacy/lang/nb/tokenizer_exceptions.py @@ -1,24 +1,23 @@ -# encoding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = {} for exc_data in [ - {ORTH: "jan.", LEMMA: "januar"}, - {ORTH: "feb.", LEMMA: "februar"}, - {ORTH: "mar.", LEMMA: "mars"}, - {ORTH: "apr.", LEMMA: "april"}, - {ORTH: "jun.", LEMMA: "juni"}, - {ORTH: "jul.", LEMMA: "juli"}, - {ORTH: "aug.", LEMMA: "august"}, - {ORTH: "sep.", LEMMA: "september"}, - {ORTH: "okt.", LEMMA: "oktober"}, - {ORTH: "nov.", LEMMA: "november"}, - {ORTH: "des.", LEMMA: "desember"}, + {ORTH: "jan.", NORM: "januar"}, + {ORTH: "feb.", NORM: "februar"}, + {ORTH: "mar.", NORM: "mars"}, + {ORTH: "apr.", NORM: "april"}, + {ORTH: "jun.", NORM: "juni"}, + {ORTH: "jul.", NORM: "juli"}, + {ORTH: "aug.", NORM: "august"}, + {ORTH: "sep.", NORM: "september"}, + {ORTH: "okt.", NORM: "oktober"}, + {ORTH: "nov.", NORM: "november"}, + {ORTH: "des.", NORM: "desember"}, ]: _exc[exc_data[ORTH]] = [exc_data] @@ -221,4 +220,4 @@ for orth in [ _exc[orth] = [{ORTH: orth}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/ne/__init__.py b/spacy/lang/ne/__init__.py index 21556277d..68632e9ad 100644 --- a/spacy/lang/ne/__init__.py +++ b/spacy/lang/ne/__init__.py @@ -1,18 +1,11 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS - from ...language import Language -from ...attrs import LANG class NepaliDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "ne" # Nepali language ISO code stop_words = STOP_WORDS + lex_attr_getters = LEX_ATTRS class Nepali(Language): diff --git a/spacy/lang/ne/examples.py b/spacy/lang/ne/examples.py index b3c4f9e73..a29b77c2f 100644 --- a/spacy/lang/ne/examples.py +++ b/spacy/lang/ne/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ne/lex_attrs.py b/spacy/lang/ne/lex_attrs.py index 652307577..7cb01c515 100644 --- a/spacy/lang/ne/lex_attrs.py +++ b/spacy/lang/ne/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..norm_exceptions import BASE_NORMS from ...attrs import NORM, LIKE_NUM diff --git a/spacy/lang/ne/stop_words.py b/spacy/lang/ne/stop_words.py index f008697d0..8470297b9 100644 --- a/spacy/lang/ne/stop_words.py +++ b/spacy/lang/ne/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt STOP_WORDS = set( diff --git a/spacy/lang/nl/__init__.py b/spacy/lang/nl/__init__.py index 407d23f73..a3591f1bf 100644 --- a/spacy/lang/nl/__init__.py +++ b/spacy/lang/nl/__init__.py @@ -1,40 +1,22 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Optional +from thinc.api import Model from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS -from .tag_map import TAG_MAP from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .punctuation import TOKENIZER_SUFFIXES from .lemmatizer import DutchLemmatizer -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...lookups import Lookups -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups class DutchDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "nl" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS - tag_map = TAG_MAP + tokenizer_exceptions = TOKENIZER_EXCEPTIONS prefixes = TOKENIZER_PREFIXES infixes = TOKENIZER_INFIXES suffixes = TOKENIZER_SUFFIXES - - @classmethod - def create_lemmatizer(cls, nlp=None, lookups=None): - if lookups is None: - lookups = Lookups() - return DutchLemmatizer(lookups) + lex_attr_getters = LEX_ATTRS + stop_words = STOP_WORDS class Dutch(Language): @@ -42,4 +24,14 @@ class Dutch(Language): Defaults = DutchDefaults +@Dutch.factory( + "lemmatizer", + assigns=["token.lemma"], + default_config={"model": None, "mode": "rule"}, + default_score_weights={"lemma_acc": 1.0}, +) +def make_lemmatizer(nlp: Language, model: Optional[Model], name: str, mode: str): + return DutchLemmatizer(nlp.vocab, model, name, mode=mode) + + __all__ = ["Dutch"] diff --git a/spacy/lang/nl/examples.py b/spacy/lang/nl/examples.py index a459760f4..8c8c50c60 100644 --- a/spacy/lang/nl/examples.py +++ b/spacy/lang/nl/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/nl/lemmatizer.py b/spacy/lang/nl/lemmatizer.py index 9a92bee44..6c025dcf6 100644 --- a/spacy/lang/nl/lemmatizer.py +++ b/spacy/lang/nl/lemmatizer.py @@ -1,43 +1,28 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import List, Tuple -from ...lemmatizer import Lemmatizer -from ...symbols import NOUN, VERB, ADJ, NUM, DET, PRON, ADP, AUX, ADV +from ...pipeline import Lemmatizer +from ...tokens import Token class DutchLemmatizer(Lemmatizer): - # Note: CGN does not distinguish AUX verbs, so we treat AUX as VERB. - univ_pos_name_variants = { - NOUN: "noun", - "NOUN": "noun", - "noun": "noun", - VERB: "verb", - "VERB": "verb", - "verb": "verb", - AUX: "verb", - "AUX": "verb", - "aux": "verb", - ADJ: "adj", - "ADJ": "adj", - "adj": "adj", - ADV: "adv", - "ADV": "adv", - "adv": "adv", - PRON: "pron", - "PRON": "pron", - "pron": "pron", - DET: "det", - "DET": "det", - "det": "det", - ADP: "adp", - "ADP": "adp", - "adp": "adp", - NUM: "num", - "NUM": "num", - "num": "num", - } + @classmethod + def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]: + if mode == "rule": + required = ["lemma_lookup", "lemma_rules", "lemma_exc", "lemma_index"] + return (required, []) + else: + return super().get_lookups_config(mode) - def __call__(self, string, univ_pos, morphology=None): + def lookup_lemmatize(self, token: Token) -> List[str]: + """Overrides parent method so that a lowercased version of the string + is used to search the lookup table. This is necessary because our + lookup table consists entirely of lowercase keys.""" + lookup_table = self.lookups.get_table("lemma_lookup", {}) + string = token.text.lower() + return [lookup_table.get(string, string)] + + # Note: CGN does not distinguish AUX verbs, so we treat AUX as VERB. + def rule_lemmatize(self, token: Token) -> List[str]: # Difference 1: self.rules is assumed to be non-None, so no # 'is None' check required. # String lowercased from the get-go. All lemmatization results in @@ -45,68 +30,61 @@ class DutchLemmatizer(Lemmatizer): # any problems, and it keeps the exceptions indexes small. If this # creates problems for proper nouns, we can introduce a check for # univ_pos == "PROPN". - string = string.lower() - try: - univ_pos = self.univ_pos_name_variants[univ_pos] - except KeyError: - # Because PROPN not in self.univ_pos_name_variants, proper names - # are not lemmatized. They are lowercased, however. - return [string] - # if string in self.lemma_index.get(univ_pos) + cache_key = (token.lower, token.pos) + if cache_key in self.cache: + return self.cache[cache_key] + string = token.text + univ_pos = token.pos_.lower() + if univ_pos in ("", "eol", "space"): + forms = [string.lower()] + self.cache[cache_key] = forms + return forms + index_table = self.lookups.get_table("lemma_index", {}) + exc_table = self.lookups.get_table("lemma_exc", {}) + rules_table = self.lookups.get_table("lemma_rules", {}) + index = index_table.get(univ_pos, {}) + exceptions = exc_table.get(univ_pos, {}) + rules = rules_table.get(univ_pos, {}) + + string = string.lower() + if univ_pos not in ( + "noun", + "verb", + "aux", + "adj", + "adv", + "pron", + "det", + "adp", + "num", + ): + forms = [string] + self.cache[cache_key] = forms + return forms lemma_index = index_table.get(univ_pos, {}) # string is already lemma if string in lemma_index: - return [string] + forms = [string] + self.cache[cache_key] = forms + return forms exc_table = self.lookups.get_table("lemma_exc", {}) exceptions = exc_table.get(univ_pos, {}) # string is irregular token contained in exceptions index. try: - lemma = exceptions[string] - return [lemma[0]] + forms = [exceptions[string][0]] + self.cache[cache_key] = forms + return forms except KeyError: pass # string corresponds to key in lookup table lookup_table = self.lookups.get_table("lemma_lookup", {}) looked_up_lemma = lookup_table.get(string) if looked_up_lemma and looked_up_lemma in lemma_index: - return [looked_up_lemma] + forms = [looked_up_lemma] + self.cache[cache_key] = forms + return forms rules_table = self.lookups.get_table("lemma_rules", {}) - forms, is_known = self.lemmatize( - string, lemma_index, exceptions, rules_table.get(univ_pos, []) - ) - # Back-off through remaining return value candidates. - if forms: - if is_known: - return forms - else: - for form in forms: - if form in exceptions: - return [form] - if looked_up_lemma: - return [looked_up_lemma] - else: - return forms - elif looked_up_lemma: - return [looked_up_lemma] - else: - return [string] - - # Overrides parent method so that a lowercased version of the string is - # used to search the lookup table. This is necessary because our lookup - # table consists entirely of lowercase keys. - def lookup(self, string, orth=None): - lookup_table = self.lookups.get_table("lemma_lookup", {}) - string = string.lower() - if orth is not None: - return lookup_table.get(orth, string) - else: - return lookup_table.get(string, string) - - # Reimplemented to focus more on application of suffix rules and to return - # as early as possible. - def lemmatize(self, string, index, exceptions, rules): - # returns (forms, is_known: bool) oov_forms = [] for old, new in rules: if string.endswith(old): @@ -114,7 +92,31 @@ class DutchLemmatizer(Lemmatizer): if not form: pass elif form in index: - return [form], True # True = Is known (is lemma) + forms = [form] + self.cache[cache_key] = forms + return forms else: oov_forms.append(form) - return list(set(oov_forms)), False + forms = list(set(oov_forms)) + # Back-off through remaining return value candidates. + if forms: + for form in forms: + if form in exceptions: + forms = [form] + self.cache[cache_key] = forms + return forms + if looked_up_lemma: + forms = [looked_up_lemma] + self.cache[cache_key] = forms + return forms + else: + self.cache[cache_key] = forms + return forms + elif looked_up_lemma: + forms = [looked_up_lemma] + self.cache[cache_key] = forms + return forms + else: + forms = [string] + self.cache[cache_key] = forms + return forms diff --git a/spacy/lang/nl/lex_attrs.py b/spacy/lang/nl/lex_attrs.py index 69343b589..f1acaefeb 100644 --- a/spacy/lang/nl/lex_attrs.py +++ b/spacy/lang/nl/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/nl/punctuation.py b/spacy/lang/nl/punctuation.py index e7207038b..d9dd2a6e3 100644 --- a/spacy/lang/nl/punctuation.py +++ b/spacy/lang/nl/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_UNITS, merge_chars from ..char_classes import LIST_PUNCT, LIST_QUOTES, CURRENCY, PUNCT from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER diff --git a/spacy/lang/nl/stop_words.py b/spacy/lang/nl/stop_words.py index 44551f2d4..a2c6198e7 100644 --- a/spacy/lang/nl/stop_words.py +++ b/spacy/lang/nl/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - # The original stop words list (added in f46ffe3) was taken from # http://www.damienvanholten.com/downloads/dutch-stop-words.txt # and consisted of about 100 tokens. diff --git a/spacy/lang/nl/tag_map.py b/spacy/lang/nl/tag_map.py deleted file mode 100644 index 4fde5d39f..000000000 --- a/spacy/lang/nl/tag_map.py +++ /dev/null @@ -1,1028 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, ADJ, NUM, DET, ADV, ADP, X, VERB -from ...symbols import NOUN, PROPN, SPACE, PRON, CONJ - - -TAG_MAP = { - "ADJ__Number=Sing": {POS: ADJ}, - "ADJ___": {POS: ADJ}, - "ADP__AdpType=Prep": {POS: ADP}, - "ADP__AdpType=Preppron|Gender=Fem|Number=Sing": {POS: ADP}, - "ADP__AdpType=Preppron|Gender=Masc|Number=Plur": {POS: ADP}, - "ADP__AdpType=Preppron|Gender=Masc|Number=Sing": {POS: ADP}, - "ADV__Number=Sing": {POS: ADV}, - "ADV__PunctType=Comm": {POS: ADV}, - "ADV___": {POS: ADV}, - "Adj_Adj_N_N__Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_Adj_N__Degree=Pos|Number=Plur|Variant=Short": {POS: ADJ}, - "Adj_Adj_N__Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_Adj__Case=Nom|Degree=Pos": {POS: ADJ}, - "Adj_Adj__Degree=Pos": {POS: ADJ}, - "Adj_Adj__Degree=Pos|Variant=Short": {POS: ADJ}, - "Adj_Adv__Degree=Pos|Variant=Short": {POS: ADJ}, - "Adj_Adv|adv|stell|onverv_deelv__Degree=Pos|Variant=Short": {POS: ADJ}, - "Adj_Art__Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_Art__Degree=Pos|Number=Sing|Variant=Short": {POS: ADJ}, - "Adj_Conj_V__Degree=Pos|Mood=Sub|VerbForm=Fin": {POS: ADJ}, - "Adj_Int|attr|stell|vervneut__Case=Nom|Degree=Pos": {POS: ADJ}, - "Adj_Misc_Misc__Degree=Pos": {POS: ADJ}, - "Adj_N_Conj_N__Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_N_N_N_N__Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_N_N_N__Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_N_N__Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_N_Num__Definite=Def|Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_N_Prep_Art_Adj_N__Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ}, - "Adj_N_Prep_N_Conj_N__Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_N_Prep_N_N__Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_N_Prep_N__Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_N_Punc__Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_N__Degree=Pos|Number=Plur": {POS: ADJ}, - "Adj_N__Degree=Pos|Number=Sing": {POS: ADJ}, - "Adj_N__Degree=Pos|Number=Sing|Variant=Short": {POS: ADJ}, - "Adj_Num__Definite=Def|Degree=Pos": {POS: ADJ}, - "Adj_Num__Definite=Def|Degree=Pos|Variant=Short": {POS: ADJ}, - "Adj_Prep|adv|stell|vervneut_voor__Degree=Pos|Variant=Short": {POS: ADJ}, - "Adj_Prep|adv|vergr|onverv_voor__Degree=Cmp|Variant=Short": {POS: ADJ}, - "Adj_V_Conj_V__Degree=Pos|VerbForm=Inf": {POS: ADJ}, - "Adj_V_N__Degree=Pos|Number=Sing|Tense=Past|VerbForm=Part": {POS: ADJ}, - "Adj_V|adv|stell|onverv_intrans|inf__Degree=Pos|Variant=Short|VerbForm=Inf": { - POS: ADJ - }, - "Adj_V|adv|stell|onverv_trans|imp__Degree=Pos|Mood=Imp|Variant=Short|VerbForm=Fin": { - POS: ADJ - }, - "Adj|adv|stell|onverv__Degree=Pos|Variant=Short": {POS: ADJ}, - "Adj|adv|stell|vervneut__Case=Nom|Degree=Pos|Variant=Short": {POS: ADJ}, - "Adj|adv|vergr|onverv__Degree=Cmp|Variant=Short": {POS: ADJ}, - "Adj|adv|vergr|vervneut__Case=Nom|Degree=Cmp|Variant=Short": {POS: ADJ}, - "Adj|attr|overtr|onverv__Degree=Sup": {POS: ADJ}, - "Adj|attr|overtr|vervneut__Case=Nom|Degree=Sup": {POS: ADJ}, - "Adj|attr|stell|onverv__Degree=Pos": {POS: ADJ}, - "Adj|attr|stell|vervgen__Case=Gen|Degree=Pos": {POS: ADJ}, - "Adj|attr|stell|vervneut__Case=Nom|Degree=Pos": {POS: ADJ}, - "Adj|attr|vergr|onverv__Degree=Cmp": {POS: ADJ}, - "Adj|attr|vergr|vervgen__Case=Gen|Degree=Cmp": {POS: ADJ}, - "Adj|attr|vergr|vervneut__Case=Nom|Degree=Cmp": {POS: ADJ}, - "Adj|zelfst|overtr|vervneut__Case=Nom|Degree=Sup": {POS: ADJ}, - "Adj|zelfst|stell|onverv__Degree=Pos": {POS: ADJ}, - "Adj|zelfst|stell|vervmv__Degree=Pos|Number=Plur": {POS: ADJ}, - "Adj|zelfst|stell|vervneut__Case=Nom|Degree=Pos": {POS: ADJ}, - "Adj|zelfst|vergr|vervneut__Case=Nom|Degree=Cmp": {POS: ADJ}, - "Adv_Adj_Conj__Degree=Pos": {POS: ADV}, - "Adv_Adj__Degree=Cmp": {POS: ADV}, - "Adv_Adj__Degree=Pos": {POS: ADV}, - "Adv_Adv_Conj_Adv__PronType=Dem": {POS: ADV}, - "Adv_Adv__AdpType=Prep": {POS: ADV}, - "Adv_Adv__Degree=Pos": {POS: ADV}, - "Adv_Adv__Degree=Pos|PronType=Dem": {POS: ADV}, - "Adv_Adv|pron|vrag_deeladv___": {POS: ADV}, - "Adv_Art__Degree=Pos|Number=Sing": {POS: ADV}, - "Adv_Art__Number=Sing": {POS: ADV}, - "Adv_Conj_Adv__AdpType=Preppron|Gender=Masc|Number=Sing": {POS: ADV}, - "Adv_Conj_Adv__Degree=Pos": {POS: ADV}, - "Adv_Conj_Adv|gew|aanw_neven_gew|aanw__PronType=Dem": {POS: ADV}, - "Adv_Conj_Adv|gew|onbep_neven_gew|onbep__PronType=Ind": {POS: ADV}, - "Adv_Conj_N__Degree=Pos|Number=Sing": {POS: ADV}, - "Adv_Conj__Degree=Pos": {POS: ADV}, - "Adv_N__Degree=Pos|Number=Sing": {POS: ADV}, - "Adv_Num__Degree=Cmp|PronType=Ind": {POS: ADV}, - "Adv_N|gew|aanw_soort|ev|neut__Number=Sing": {POS: ADV}, - "Adv_Prep_N__Case=Dat|Degree=Pos|Number=Sing": {POS: ADV}, - "Adv_Prep_Pron__AdpType=Preppron|Gender=Masc|Number=Sing": {POS: ADV}, - "Adv_Prep__Degree=Pos": {POS: ADV}, - "Adv_Prep|gew|aanw_voor__AdpType=Prep": {POS: ADV}, - "Adv_Prep|gew|aanw_voor___": {POS: ADV}, - "Adv_Pron__Degree=Pos": {POS: ADV}, - "Adv|deeladv__PartType=Vbp": {POS: ADV}, - "Adv|deelv__PartType=Vbp": {POS: ADV}, - "Adv|gew|aanw__PronType=Dem": {POS: ADV}, - "Adv|gew|betr__PronType=Rel": {POS: ADV}, - "Adv|gew|er__AdvType=Ex": {POS: ADV}, - "Adv|gew|geenfunc|overtr|onverv__Degree=Sup": {POS: ADV}, - "Adv|gew|geenfunc|stell|onverv__Degree=Pos": {POS: ADV}, - "Adv|gew|geenfunc|vergr|onverv__Degree=Cmp": {POS: ADV}, - "Adv|gew|onbep__PronType=Ind": {POS: ADV}, - "Adv|gew|vrag__PronType=Int": {POS: ADV}, - "Adv|pron|aanw__PronType=Dem": {POS: ADV}, - "Adv|pron|betr__PronType=Rel": {POS: ADV}, - "Adv|pron|er__AdvType=Ex": {POS: ADV}, - "Adv|pron|onbep__PronType=Ind": {POS: ADV}, - "Adv|pron|vrag__PronType=Int": {POS: ADV}, - "Art_Adj_N__AdpType=Prep": {POS: DET}, - "Art_Adj_N__Definite=Def|Degree=Sup|Gender=Neut|Number=Sing": {POS: DET}, - "Art_Adj__Case=Nom|Definite=Def|Degree=Cmp|Gender=Neut": {POS: DET}, - "Art_Adj__Case=Nom|Definite=Def|Degree=Sup|Gender=Neut": {POS: DET}, - "Art_Adj__Definite=Def|Degree=Cmp|Gender=Neut": {POS: DET}, - "Art_Adj__Definite=Def|Degree=Sup|Gender=Neut": {POS: DET}, - "Art_Adv__Definite=Def|Degree=Sup|Gender=Neut": {POS: DET}, - "Art_Conj_Pron__Number=Sing|PronType=Ind": {POS: DET}, - "Art_N_Conj_Art_N__Definite=Def|Gender=Neut|Number=Sing": {POS: DET}, - "Art_N_Conj_Art_V__AdpType=Prep": {POS: DET}, - "Art_N_Conj_Pron_N__Definite=Def|Gender=Neut|Number=Plur|Person=3": {POS: DET}, - "Art_N_Conj__Number=Sing|PronType=Ind": {POS: DET}, - "Art_N_N__AdpType=Prep": {POS: DET}, - "Art_N_Prep_Adj__Degree=Pos|Number=Sing|PronType=Ind": {POS: DET}, - "Art_N_Prep_Art_N__Number=Sing|PronType=Ind": {POS: DET}, - "Art_N_Prep_N__AdpType=Prep": {POS: DET}, - "Art_N_Prep_N__Definite=Def|Gender=Neut|Number=Sing": {POS: DET}, - "Art_N_Prep_N__Number=Sing|PronType=Ind": {POS: DET}, - "Art_N_Prep_Pron_N__AdpType=Prep": {POS: DET}, - "Art_N__AdpType=Prep": {POS: DET}, - "Art_N__Case=Gen|Definite=Def|Number=Sing": {POS: DET}, - "Art_N__Number=Sing|PronType=Ind": {POS: DET}, - "Art_Num_Art_Adj__AdpType=Prep": {POS: DET}, - "Art_Num_N__AdpType=Prep": {POS: DET}, - "Art_Num__Definite=Def|Degree=Sup|Gender=Neut|PronType=Ind": {POS: DET}, - "Art_Num__Definite=Def|Gender=Neut": {POS: DET}, - "Art_Num__Degree=Pos|Number=Sing|PronType=Ind": {POS: DET}, - "Art_N|bep|onzijd|neut_eigen|ev|neut__Definite=Def|Gender=Neut|Number=Sing": { - POS: DET - }, - "Art_N|bep|onzijd|neut_soort|ev|neut__Definite=Def|Gender=Neut|Number=Sing": { - POS: DET - }, - "Art_Pron_N__Case=Gen|Number=Plur|PronType=Ind": {POS: DET}, - "Art_Pron__Number=Sing|PronType=Ind": {POS: DET}, - "Art_V_N__AdpType=Prep": {POS: DET}, - "Art|bep|onzijd|neut__Definite=Def|Gender=Neut|PronType=Art": {POS: DET}, - "Art|bep|zijdofmv|gen__Case=Gen|Definite=Def|PronType=Art": {POS: DET}, - "Art|bep|zijdofmv|neut__Definite=Def|PronType=Art": {POS: DET}, - "Art|bep|zijdofonzijd|gen__Case=Gen|Definite=Def|Number=Sing|PronType=Art": { - POS: DET - }, - "Art|bep|zijd|dat__Case=Dat|Definite=Def|Gender=Com|PronType=Art": {POS: DET}, - "Art|onbep|zijdofonzijd|neut__Definite=Ind|Number=Sing|PronType=Art": {POS: DET}, - "CCONJ___": {POS: CONJ}, - "Conj_Adj|neven_adv|vergr|onverv__Degree=Cmp": {POS: CONJ}, - "Conj_Adj|neven_attr|stell|onverv__Degree=Pos": {POS: CONJ}, - "Conj_Adv_Adv__Degree=Pos": {POS: CONJ}, - "Conj_Adv__AdpType=Prep": {POS: CONJ}, - "Conj_Adv__AdpType=Preppron|Gender=Masc|Number=Plur": {POS: CONJ}, - "Conj_Adv__Degree=Pos": {POS: CONJ}, - "Conj_Adv|neven_gew|aanw__PronType=Dem": {POS: CONJ}, - "Conj_Art_N__AdpType=Preppron|Gender=Masc|Number=Plur": {POS: CONJ}, - "Conj_Art_N__Gender=Neut|Number=Sing": {POS: CONJ}, - "Conj_Conj|neven_onder|metfin___": {POS: CONJ}, - "Conj_Int|neven___": {POS: CONJ}, - "Conj_Int|onder|metfin___": {POS: CONJ}, - "Conj_N_Adv__AdpType=Preppron|Gender=Masc|Number=Plur": {POS: CONJ}, - "Conj_N_Prep__AdpType=Preppron|Gender=Masc|Number=Plur": {POS: CONJ}, - "Conj_N|onder|metfin_soort|ev|neut__AdpType=Preppron|Gender=Masc|Number=Plur": { - POS: CONJ - }, - "Conj_Pron_Adv__Degree=Pos|Number=Sing|Person=3": {POS: CONJ}, - "Conj_Pron_V__AdpType=Preppron|Gender=Masc|Number=Plur": {POS: CONJ}, - "Conj_Pron|neven_aanw|neut|zelfst__AdpType=Prep": {POS: CONJ}, - "Conj_Punc_Conj|neven_schuinstreep_neven__AdpType=Prep": {POS: CONJ}, - "Conj_V|onder|metfin_intrans|ott|3|ev__AdpType=Preppron|Gender=Masc|Number=Plur": { - POS: CONJ - }, - "Conj|neven___": {POS: CONJ}, - "Conj|onder|metfin___": {POS: CONJ}, - "Conj|onder|metinf___": {POS: CONJ}, - "DET__Degree=Cmp|NumType=Card|PronType=Ind": {POS: DET}, - "DET__Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": { - POS: DET - }, - "DET__Gender=Fem|Number=Sing|PronType=Art": {POS: DET}, - "DET__Gender=Masc|Number=Plur|PronType=Art": {POS: DET}, - "DET__Gender=Masc|Number=Sing|PronType=Tot": {POS: DET}, - "Int_Adv|gew|aanw___": {POS: X}, - "Int_Int__NumType=Card": {POS: X}, - "Int_Int___": {POS: X}, - "Int_N_N_Misc_N___": {POS: X}, - "Int_N_Punc_Int_N__Number=Sing": {POS: X}, - "Int_Punc_Int|komma__PunctType=Comm": {POS: X}, - "Int___": {POS: X}, - "Misc_Misc_Misc_Misc_Misc_Misc_Misc_Misc_Misc___": {POS: X}, - "Misc_Misc_Misc_Misc_Misc_Misc_Misc___": {POS: X}, - "Misc_Misc_Misc_Misc_Misc_Misc_Punc_Misc_Misc_Misc___": {POS: X}, - "Misc_Misc_Misc_Misc_Misc_Misc___": {POS: X}, - "Misc_Misc_Misc_Misc_Misc_N_Misc_Misc_Misc_Misc_Misc_Misc___": {POS: X}, - "Misc_Misc_Misc_Misc|vreemd_vreemd_vreemd_vreemd__AdpType=Preppron|Gender=Masc|Number=Sing": { - POS: X - }, - "Misc_Misc_Misc_Misc|vreemd_vreemd_vreemd_vreemd___": {POS: X}, - "Misc_Misc_Misc_N__Number=Sing": {POS: X}, - "Misc_Misc_Misc|vreemd_vreemd_vreemd___": {POS: X}, - "Misc_Misc_N_N__Number=Sing": {POS: X}, - "Misc_Misc_N|vreemd_vreemd_soort|mv|neut__Number=Plur": {POS: X}, - "Misc_Misc_Punc_N_N__Number=Sing": {POS: X}, - "Misc_Misc|vreemd_vreemd__AdpType=Prep": {POS: X}, - "Misc_Misc|vreemd_vreemd__NumType=Card": {POS: X}, - "Misc_Misc|vreemd_vreemd___": {POS: X}, - "Misc_N_Misc_Misc__Number=Sing": {POS: X}, - "Misc_N_N__Number=Sing": {POS: X}, - "Misc_N|vreemd_eigen|ev|neut__Number=Sing": {POS: X}, - "Misc_N|vreemd_soort|ev|neut__Number=Sing": {POS: X}, - "Misc|vreemd__Foreign=Yes": {POS: X}, - "NUM__Case=Nom|Definite=Def|Degree=Pos|NumType=Card": {POS: NUM}, - "NUM__Definite=Def|Degree=Pos|NumType=Card": {POS: NUM}, - "NUM__Definite=Def|Degree=Pos|Number=Sing|NumType=Card": {POS: NUM}, - "NUM__Definite=Def|NumType=Card": {POS: NUM}, - "NUM__Definite=Def|Number=Plur|NumType=Card": {POS: NUM}, - "NUM__Definite=Def|Number=Sing|NumType=Card": {POS: NUM}, - "NUM__NumForm=Digit|NumType=Card": {POS: NUM}, - "NUM__NumType=Card": {POS: NUM}, - "N_Adj_N_Num__Definite=Def|Degree=Pos|Number=Sing": {POS: NOUN}, - "N_Adj_N__Degree=Pos|Number=Plur": {POS: NOUN}, - "N_Adj_N___": {POS: NOUN}, - "N_Adj__AdpType=Prep": {POS: NOUN}, - "N_Adj__Case=Nom|Degree=Pos|Number=Plur": {POS: NOUN}, - "N_Adj__Case=Nom|Degree=Pos|Number=Sing": {POS: NOUN}, - "N_Adj__Degree=Pos|Number=Plur": {POS: NOUN}, - "N_Adj__Degree=Pos|Number=Sing": {POS: NOUN}, - "N_Adj___": {POS: NOUN}, - "N_Adv_Punc_V_Pron_V__Aspect=Imp|Degree=Pos|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Inf": { - POS: NOUN - }, - "N_Adv__Degree=Pos|Number=Sing": {POS: NOUN}, - "N_Adv___": {POS: NOUN}, - "N_Adv|soort|ev|neut_deelv__Number=Sing": {POS: NOUN}, - "N_Art_Adj_Prep_N___": {POS: NOUN}, - "N_Art_N__Case=Gen|Number=Sing": {POS: NOUN}, - "N_Art_N__Number=Plur": {POS: NOUN}, - "N_Art_N__Number=Sing": {POS: NOUN}, - "N_Art_N___": {POS: NOUN}, - "N_Conj_Adv__Degree=Pos|Number=Sing": {POS: NOUN}, - "N_Conj_Art_N___": {POS: NOUN}, - "N_Conj_N_N__Number=Sing": {POS: NOUN}, - "N_Conj_N_N___": {POS: NOUN}, - "N_Conj_N__Number=Plur": {POS: NOUN}, - "N_Conj_N__Number=Sing": {POS: NOUN}, - "N_Conj_N___": {POS: NOUN}, - "N_Conj|soort|ev|neut_neven__Number=Sing": {POS: NOUN}, - "N_Int_N|eigen|ev|neut_eigen|ev|neut___": {POS: NOUN}, - "N_Misc_Misc_Misc_Misc___": {POS: NOUN}, - "N_Misc_Misc_N___": {POS: NOUN}, - "N_Misc_Misc|eigen|ev|neut_vreemd_vreemd___": {POS: NOUN}, - "N_Misc_Misc|soort|mv|neut_vreemd_vreemd__Number=Plur": {POS: NOUN}, - "N_Misc_N_N_N_N___": {POS: NOUN}, - "N_Misc_N_N___": {POS: NOUN}, - "N_Misc_N___": {POS: NOUN}, - "N_Misc_Num___": {POS: NOUN}, - "N_Misc|eigen|ev|neut_vreemd___": {POS: NOUN}, - "N_Misc|soort|ev|neut_vreemd__Number=Sing": {POS: NOUN}, - "N_N_Adj_Art_N_N__Gender=Masc|Number=Plur|PronType=Art": {POS: NOUN}, - "N_N_Adj_N___": {POS: NOUN}, - "N_N_Adj__Degree=Pos|Number=Sing": {POS: NOUN}, - "N_N_Adj___": {POS: NOUN}, - "N_N_Art_Adv___": {POS: NOUN}, - "N_N_Art_N___": {POS: NOUN}, - "N_N_Conj_N_N_N_N_N___": {POS: NOUN}, - "N_N_Conj_N_N___": {POS: NOUN}, - "N_N_Conj_N__Number=Sing": {POS: NOUN}, - "N_N_Conj_N___": {POS: NOUN}, - "N_N_Conj___": {POS: NOUN}, - "N_N_Int_N_N___": {POS: NOUN}, - "N_N_Misc___": {POS: NOUN}, - "N_N_N_Adj_N___": {POS: NOUN}, - "N_N_N_Adv___": {POS: NOUN}, - "N_N_N_Int__AdpType=Prep": {POS: NOUN}, - "N_N_N_Misc___": {POS: NOUN}, - "N_N_N_N_Conj_N___": {POS: NOUN}, - "N_N_N_N_Misc___": {POS: NOUN}, - "N_N_N_N_N_N_Int__AdpType=Prep": {POS: NOUN}, - "N_N_N_N_N_N_N__AdpType=Prep": {POS: NOUN}, - "N_N_N_N_N_N_N__Gender=Fem|Number=Sing|PronType=Art": {POS: NOUN}, - "N_N_N_N_N_N_N___": {POS: NOUN}, - "N_N_N_N_N_N_Prep_N___": {POS: NOUN}, - "N_N_N_N_N_N__AdpType=Prep": {POS: NOUN}, - "N_N_N_N_N_N___": {POS: NOUN}, - "N_N_N_N_N_Prep_N___": {POS: NOUN}, - "N_N_N_N_N__AdpType=Prep": {POS: NOUN}, - "N_N_N_N_N__Number=Sing": {POS: NOUN}, - "N_N_N_N_N___": {POS: NOUN}, - "N_N_N_N_Prep_N___": {POS: NOUN}, - "N_N_N_N_Punc_N_Punc___": {POS: NOUN}, - "N_N_N_N_V___": {POS: NOUN}, - "N_N_N_N__Gender=Fem|Number=Plur|PronType=Art": {POS: NOUN}, - "N_N_N_N__Gender=Fem|Number=Sing|PronType=Art": {POS: NOUN}, - "N_N_N_N__NumType=Card": {POS: NOUN}, - "N_N_N_N__Number=Plur": {POS: NOUN}, - "N_N_N_N__Number=Sing": {POS: NOUN}, - "N_N_N_N___": {POS: NOUN}, - "N_N_N_Prep_Art_Adj_N___": {POS: NOUN}, - "N_N_N_Prep_N_N___": {POS: NOUN}, - "N_N_N_Prep_N___": {POS: NOUN}, - "N_N_N_Punc_N___": {POS: NOUN}, - "N_N_N_Punc___": {POS: NOUN}, - "N_N_N__AdpType=Prep": {POS: NOUN}, - "N_N_N__Gender=Fem|Number=Sing|PronType=Art": {POS: NOUN}, - "N_N_N__Gender=Masc|Number=Plur|PronType=Art": {POS: NOUN}, - "N_N_N__Number=Plur": {POS: NOUN}, - "N_N_N__Number=Sing": {POS: NOUN}, - "N_N_N___": {POS: NOUN}, - "N_N_Num_N___": {POS: NOUN}, - "N_N_Num__Definite=Def|Number=Sing": {POS: NOUN}, - "N_N_Num___": {POS: NOUN}, - "N_N_Prep_Art_Adj_N__Degree=Pos|Gender=Neut|Number=Sing": {POS: NOUN}, - "N_N_Prep_Art_N_Prep_Art_N___": {POS: NOUN}, - "N_N_Prep_Art_N___": {POS: NOUN}, - "N_N_Prep_N_N__AdpType=Prep": {POS: NOUN}, - "N_N_Prep_N_Prep_Adj_N___": {POS: NOUN}, - "N_N_Prep_N__AdpType=Prep": {POS: NOUN}, - "N_N_Prep_N__Number=Sing": {POS: NOUN}, - "N_N_Prep_N___": {POS: NOUN}, - "N_N_Punc_N_Punc___": {POS: NOUN}, - "N_Num_N_N__Definite=Def|Number=Sing": {POS: NOUN}, - "N_Num_N_Num___": {POS: NOUN}, - "N_Num_N___": {POS: NOUN}, - "N_Num_Num__Definite=Def|Number=Sing": {POS: NOUN}, - "N_Num__Definite=Def|Number=Plur": {POS: NOUN}, - "N_Num__Definite=Def|Number=Sing": {POS: NOUN}, - "N_Num___": {POS: NOUN}, - "N_N|eigen|ev|gen_eigen|ev|gen___": {POS: NOUN}, - "N_N|eigen|ev|gen_eigen|ev|neut___": {POS: NOUN}, - "N_N|eigen|ev|gen_soort|ev|neut___": {POS: NOUN}, - "N_N|eigen|ev|gen_soort|mv|neut___": {POS: NOUN}, - "N_N|eigen|ev|neut_eigen|ev|gen___": {POS: NOUN}, - "N_N|eigen|ev|neut_eigen|ev|neut__AdpType=Prep": {POS: NOUN}, - "N_N|eigen|ev|neut_eigen|ev|neut__AdpType=Preppron|Gender=Fem|Number=Plur": { - POS: NOUN - }, - "N_N|eigen|ev|neut_eigen|ev|neut__AdpType=Preppron|Gender=Masc|Number=Sing": { - POS: NOUN - }, - "N_N|eigen|ev|neut_eigen|ev|neut__Gender=Fem|Number=Plur|PronType=Art": {POS: NOUN}, - "N_N|eigen|ev|neut_eigen|ev|neut__Gender=Fem|Number=Sing|PronType=Art": {POS: NOUN}, - "N_N|eigen|ev|neut_eigen|ev|neut__Gender=Masc|Number=Plur|PronType=Art": { - POS: NOUN - }, - "N_N|eigen|ev|neut_eigen|ev|neut__Gender=Masc|Number=Sing|PronType=Art": { - POS: NOUN - }, - "N_N|eigen|ev|neut_eigen|ev|neut__NumType=Card": {POS: NOUN}, - "N_N|eigen|ev|neut_eigen|ev|neut__Number=Sing": {POS: NOUN}, - "N_N|eigen|ev|neut_eigen|ev|neut___": {POS: NOUN}, - "N_N|eigen|ev|neut_eigen|mv|neut___": {POS: NOUN}, - "N_N|eigen|ev|neut_soort|ev|neut__AdpType=Prep": {POS: NOUN}, - "N_N|eigen|ev|neut_soort|ev|neut___": {POS: NOUN}, - "N_N|eigen|ev|neut_soort|mv|neut___": {POS: NOUN}, - "N_N|eigen|mv|neut_eigen|mv|neut___": {POS: NOUN}, - "N_N|soort|ev|neut_eigen|ev|neut__Number=Sing": {POS: NOUN}, - "N_N|soort|ev|neut_soort|ev|neut__Gender=Masc|Number=Plur|PronType=Art": { - POS: NOUN - }, - "N_N|soort|ev|neut_soort|ev|neut__NumForm=Digit|NumType=Card": {POS: NOUN}, - "N_N|soort|ev|neut_soort|ev|neut__Number=Sing": {POS: NOUN}, - "N_N|soort|ev|neut_soort|mv|neut__Number=Plur": {POS: NOUN}, - "N_N|soort|mv|neut_eigen|ev|neut__Number=Sing": {POS: NOUN}, - "N_N|soort|mv|neut_soort|ev|neut__Number=Sing": {POS: NOUN}, - "N_N|soort|mv|neut_soort|mv|neut__Number=Plur": {POS: NOUN}, - "N_Prep_Adj_Adj_N__Degree=Pos|Number=Plur": {POS: NOUN}, - "N_Prep_Adj_N___": {POS: NOUN}, - "N_Prep_Art_N_Art_N__Number=Plur": {POS: NOUN}, - "N_Prep_Art_N_N__Number=Sing": {POS: NOUN}, - "N_Prep_Art_N_Prep_Art_N__Gender=Neut|Number=Sing": {POS: NOUN}, - "N_Prep_Art_N__Number=Plur": {POS: NOUN}, - "N_Prep_Art_N__Number=Sing": {POS: NOUN}, - "N_Prep_Art_N___": {POS: NOUN}, - "N_Prep_N_Art_Adj___": {POS: NOUN}, - "N_Prep_N_N__Number=Sing": {POS: NOUN}, - "N_Prep_N_N___": {POS: NOUN}, - "N_Prep_N_Prep_Art_N___": {POS: NOUN}, - "N_Prep_N_Prep_N_Conj_N_Prep_Art_N_N__Number=Sing": {POS: NOUN}, - "N_Prep_N_Punc_N_Conj_N__Number=Sing": {POS: NOUN}, - "N_Prep_N__Number=Plur": {POS: NOUN}, - "N_Prep_N__Number=Sing": {POS: NOUN}, - "N_Prep_N___": {POS: NOUN}, - "N_Prep_Num__Definite=Def|Number=Sing": {POS: NOUN}, - "N_Prep_Pron_N___": {POS: NOUN}, - "N_Prep|soort|ev|neut_voor__Number=Sing": {POS: NOUN}, - "N_Pron___": {POS: NOUN}, - "N_Punc_Adj_N___": {POS: NOUN}, - "N_Punc_Adj_Pron_Punc__Degree=Pos|Number=Sing|Person=2": {POS: NOUN}, - "N_Punc_Adv_V_Pron_N__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": { - POS: NOUN - }, - "N_Punc_Misc_Punc_N___": {POS: NOUN}, - "N_Punc_N_N_N_N__Number=Sing": {POS: NOUN}, - "N_Punc_N_Punc_N__Number=Sing": {POS: NOUN}, - "N_Punc_N_Punc__Number=Sing": {POS: NOUN}, - "N_Punc_N__Number=Sing": {POS: NOUN}, - "N_Punc_Punc_N_N_Punc_Punc_N___": {POS: NOUN}, - "N_V_N_N___": {POS: NOUN}, - "N_V_N___": {POS: NOUN}, - "N_V__Aspect=Imp|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin": {POS: NOUN}, - "N_V__Number=Sing|Tense=Past|VerbForm=Part": {POS: NOUN}, - "N_V___": {POS: NOUN}, - "N_V|eigen|ev|neut_trans|imp___": {POS: NOUN}, - "N_V|soort|ev|neut_hulpofkopp|conj__Mood=Sub|Number=Sing|VerbForm=Fin": {POS: NOUN}, - "N_V|soort|ev|neut_intrans|conj__Mood=Sub|Number=Sing|VerbForm=Fin": {POS: NOUN}, - "Num_Adj_Adj_N___": {POS: NUM}, - "Num_Adj_N___": {POS: NUM}, - "Num_Adj__Definite=Def|Degree=Pos|NumType=Card": {POS: NUM}, - "Num_Adj__NumForm=Digit|NumType=Card": {POS: NUM}, - "Num_Adj___": {POS: NUM}, - "Num_Conj_Adj__Case=Nom|Definite=Def|Degree=Pos|NumType=Card": {POS: NUM}, - "Num_Conj_Art_Adj__Definite=Def|Degree=Pos|Number=Sing|NumType=Card": {POS: NUM}, - "Num_Conj_Num_N__NumForm=Digit|NumType=Card": {POS: NUM}, - "Num_Conj_Num__Degree=Cmp|NumType=Card|PronType=Ind": {POS: NUM}, - "Num_N_N__Definite=Def|Number=Sing|NumType=Card": {POS: NUM}, - "Num_N_Num_Num_N__NumForm=Digit|NumType=Card": {POS: NUM}, - "Num_N_Num__Definite=Def|Number=Sing|NumType=Card": {POS: NUM}, - "Num_N_Num__NumForm=Digit|NumType=Card": {POS: NUM}, - "Num_N__Definite=Def|Number=Plur|NumType=Card": {POS: NUM}, - "Num_N__Definite=Def|Number=Sing|NumType=Card": {POS: NUM}, - "Num_N__NumForm=Digit|NumType=Card": {POS: NUM}, - "Num_N___": {POS: NUM}, - "Num_Num_N__NumForm=Digit|NumType=Card": {POS: NUM}, - "Num_Num__Definite=Def|NumType=Card": {POS: NUM}, - "Num_Num__NumForm=Digit|NumType=Card": {POS: NUM}, - "Num_Prep_Num__Definite=Def|NumType=Card": {POS: NUM}, - "Num_Punc_Num_N_N__NumForm=Digit|NumType=Card": {POS: NUM}, - "Num_Punc_Num__NumForm=Digit|NumType=Card": {POS: NUM}, - "Num_Punc__NumForm=Digit|NumType=Card": {POS: NUM}, - "Num__Case=Nom|Degree=Cmp|NumType=Card|PronType=Ind": {POS: NUM}, - "Num__Case=Nom|Degree=Pos|NumType=Card|PronType=Ind": {POS: NUM}, - "Num__Case=Nom|Degree=Sup|NumType=Card|PronType=Ind": {POS: NUM}, - "Num__Degree=Cmp|NumType=Card|PronType=Ind": {POS: NUM}, - "Num__Degree=Pos|NumType=Card|PronType=Ind": {POS: NUM}, - "Num__Degree=Pos|Number=Plur|NumType=Card|PronType=Ind": {POS: NUM}, - "Num__Degree=Sup|NumType=Card|PronType=Ind": {POS: NUM}, - "Num__Degree=Sup|Number=Plur|NumType=Card|PronType=Ind": {POS: NUM}, - "Num|hoofd|bep|attr|onverv__Definite=Def|NumType=Card": {POS: NUM}, - "Num|hoofd|bep|zelfst|onverv__Definite=Def|NumType=Card": {POS: NUM}, - "Num|hoofd|bep|zelfst|vervmv__Definite=Def|Number=Plur|NumType=Card": {POS: NUM}, - "Num|hoofd|onbep|attr|stell|onverv__Degree=Pos|NumType=Card|PronType=Ind": { - POS: NUM - }, - "Num|hoofd|onbep|attr|vergr|onverv__Degree=Cmp|NumType=Card|PronType=Ind": { - POS: NUM - }, - "Num|rang|bep|attr|onverv__Definite=Def|NumType=Ord": {POS: NUM}, - "Num|rang|bep|zelfst|onverv__Definite=Def|NumType=Ord": {POS: NUM}, - "N|eigen|ev|gen__Case=Gen|Number=Sing": {POS: NOUN}, - "N|eigen|ev|neut__Number=Sing": {POS: NOUN}, - "N|eigen|mv|neut__Number=Plur": {POS: NOUN}, - "N|soort|ev|dat__Case=Dat|Number=Sing": {POS: NOUN}, - "N|soort|ev|gen__Case=Gen|Number=Sing": {POS: NOUN}, - "N|soort|ev|neut__Number=Sing": {POS: NOUN}, - "N|soort|mv|neut__Number=Plur": {POS: NOUN}, - "PROPN___": {POS: PROPN}, - "PUNCT___": {POS: PUNCT}, - "Prep_Adj_Conj_Prep_N__Degree=Pos|Number=Sing": {POS: ADP}, - "Prep_Adj_N__Degree=Pos|Number=Plur": {POS: ADP}, - "Prep_Adj|voor_adv|vergr|vervneut__Case=Nom|Degree=Cmp": {POS: ADP}, - "Prep_Adj|voor_attr|stell|onverv__Degree=Pos": {POS: ADP}, - "Prep_Adj|voor_attr|stell|vervneut__Case=Nom|Degree=Pos": {POS: ADP}, - "Prep_Adv__AdpType=Prep": {POS: ADP}, - "Prep_Adv__Case=Nom|Degree=Pos": {POS: ADP}, - "Prep_Adv__Case=Nom|Degree=Sup": {POS: ADP}, - "Prep_Adv__Degree=Pos": {POS: ADP}, - "Prep_Adv|voor_gew|aanw__AdpType=Prep": {POS: ADP}, - "Prep_Adv|voor_gew|aanw__Gender=Masc|Number=Sing|PronType=Tot": {POS: ADP}, - "Prep_Adv|voor_gew|aanw__PronType=Dem": {POS: ADP}, - "Prep_Adv|voor_pron|vrag__PronType=Int": {POS: ADP}, - "Prep_Art_Adj_N__Degree=Pos|Number=Sing": {POS: ADP}, - "Prep_Art_Adj__AdpType=Prep": {POS: ADP}, - "Prep_Art_Adj__Case=Nom|Degree=Pos": {POS: ADP}, - "Prep_Art_Adj__Degree=Cmp|Gender=Neut": {POS: ADP}, - "Prep_Art_Misc_Misc___": {POS: ADP}, - "Prep_Art_N_Adv__Number=Sing": {POS: ADP}, - "Prep_Art_N_Adv__Number=Sing|PronType=Int": {POS: ADP}, - "Prep_Art_N_Art_N__AdpType=Prep": {POS: ADP}, - "Prep_Art_N_Prep_Art_N__AdpType=Prep": {POS: ADP}, - "Prep_Art_N_Prep__AdpType=Prep": {POS: ADP}, - "Prep_Art_N_Prep__Gender=Neut|Number=Sing": {POS: ADP}, - "Prep_Art_N_Prep__Number=Sing": {POS: ADP}, - "Prep_Art_N_V__Number=Plur|Tense=Past|VerbForm=Part": {POS: ADP}, - "Prep_Art_N__AdpType=Prep": {POS: ADP}, - "Prep_Art_N__Gender=Com|Number=Sing": {POS: ADP}, - "Prep_Art_N__Gender=Neut|Number=Sing": {POS: ADP}, - "Prep_Art_N__Number=Plur": {POS: ADP}, - "Prep_Art_N__Number=Sing": {POS: ADP}, - "Prep_Art_V__AdpType=Prep": {POS: ADP}, - "Prep_Art_V__Gender=Neut|VerbForm=Inf": {POS: ADP}, - "Prep_Art|voor_bep|onzijd|neut__Gender=Neut": {POS: ADP}, - "Prep_Art|voor_onbep|zijdofonzijd|neut__Number=Sing": {POS: ADP}, - "Prep_Conj_Prep|voor_neven_voor__Gender=Masc|Number=Sing|PronType=Tot": {POS: ADP}, - "Prep_Misc|voor_vreemd___": {POS: ADP}, - "Prep_N_Adv|voor_soort|ev|neut_deeladv__Number=Sing": {POS: ADP}, - "Prep_N_Adv|voor_soort|ev|neut_pron|aanw__AdpType=Prep": {POS: ADP}, - "Prep_N_Adv|voor_soort|ev|neut_pron|aanw__Number=Sing|PronType=Dem": {POS: ADP}, - "Prep_N_Adv|voor_soort|ev|neut_pron|vrag__Number=Sing|PronType=Int": {POS: ADP}, - "Prep_N_Adv|voor_soort|mv|neut_deelv__Gender=Masc|Number=Sing|PronType=Tot": { - POS: ADP - }, - "Prep_N_Conj_N__Number=Sing": {POS: ADP}, - "Prep_N_Conj__AdpType=Prep": {POS: ADP}, - "Prep_N_Prep_N__Number=Sing": {POS: ADP}, - "Prep_N_Prep|voor_soort|ev|dat_voor__Number=Sing": {POS: ADP}, - "Prep_N_Prep|voor_soort|ev|neut_voor__AdpType=Prep": {POS: ADP}, - "Prep_N_Prep|voor_soort|ev|neut_voor__Number=Sing": {POS: ADP}, - "Prep_N_Prep|voor_soort|mv|neut_voor__Number=Plur": {POS: ADP}, - "Prep_N_V__Case=Nom|Number=Sing|Tense=Past|VerbForm=Part": {POS: ADP}, - "Prep_Num_N__Definite=Def|Number=Sing": {POS: ADP}, - "Prep_Num__Case=Nom|Degree=Sup|PronType=Ind": {POS: ADP}, - "Prep_Num__Degree=Cmp|PronType=Ind": {POS: ADP}, - "Prep_N|voor_eigen|ev|neut__Number=Sing": {POS: ADP}, - "Prep_N|voor_soort|ev|dat__AdpType=Prep": {POS: ADP}, - "Prep_N|voor_soort|ev|dat__Case=Dat|Number=Sing": {POS: ADP}, - "Prep_N|voor_soort|ev|neut__AdpType=Prep": {POS: ADP}, - "Prep_N|voor_soort|ev|neut__Gender=Masc|Number=Sing|PronType=Tot": {POS: ADP}, - "Prep_N|voor_soort|ev|neut__Number=Sing": {POS: ADP}, - "Prep_N|voor_soort|mv|neut__AdpType=Prep": {POS: ADP}, - "Prep_N|voor_soort|mv|neut__Number=Plur": {POS: ADP}, - "Prep_Prep_Adj|voor_voor_adv|stell|onverv__Gender=Masc|Number=Sing|PronType=Tot": { - POS: ADP - }, - "Prep_Prep_Adv__Degree=Pos": {POS: ADP}, - "Prep_Pron_Adj__Degree=Cmp|Number=Sing|Person=3": {POS: ADP}, - "Prep_Pron_N_Adv__Number=Plur": {POS: ADP}, - "Prep_Pron_N__AdpType=Prep": {POS: ADP}, - "Prep_Pron_N__Case=Dat|Number=Sing": {POS: ADP}, - "Prep_Pron|voor_aanw|neut|zelfst___": {POS: ADP}, - "Prep_Pron|voor_onbep|neut|attr___": {POS: ADP}, - "Prep_Pron|voor_onbep|neut|zelfst___": {POS: ADP}, - "Prep_Pron|voor_rec|neut__AdpType=Prep": {POS: ADP}, - "Prep_Pron|voor_rec|neut___": {POS: ADP}, - "Prep_Pron|voor_ref|3|evofmv__Number=Plur,Sing|Person=3": {POS: ADP}, - "Prep_Punc_N_Conj_N__AdpType=Prep": {POS: ADP}, - "Prep_V_N__Number=Sing|Tense=Pres|VerbForm=Part": {POS: ADP}, - "Prep_V_Pron_Pron_Adv__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|PronType=Dem|Tense=Pres|VerbForm=Fin": { - POS: ADP - }, - "Prep_V|voor_intrans|inf__VerbForm=Inf": {POS: ADP}, - "Prep_V|voorinf_trans|inf__VerbForm=Inf": {POS: ADP}, - "Prep|achter__AdpType=Post": {POS: ADP}, - "Prep|comb__AdpType=Circ": {POS: ADP}, - "Prep|voor__AdpType=Prep": {POS: ADP}, - "Prep|voorinf__AdpType=Prep|PartType=Inf": {POS: ADP}, - "Pron_Adj_N_Punc_Art_Adj_N_Prep_Art_Adj_N__NumType=Card": {POS: PRON}, - "Pron_Adj__Case=Nom|Degree=Sup|Number=Sing|Person=2|Poss=Yes|PronType=Prs": { - POS: PRON - }, - "Pron_Adj__Degree=Cmp|PronType=Ind": {POS: PRON}, - "Pron_Adv|vrag|neut|attr_deelv__PronType=Int": {POS: PRON}, - "Pron_Art_N_N__Number=Plur|PronType=Ind": {POS: PRON}, - "Pron_Art__Number=Sing|PronType=Int": {POS: PRON}, - "Pron_N_Adv__Number=Sing|PronType=Ind": {POS: PRON}, - "Pron_N_V_Adv_Num_Punc__Aspect=Imp|Definite=Def|Mood=Ind|Number=Sing|Person=3|PronType=Ind|Tense=Pres|VerbForm=Fin": { - POS: PRON - }, - "Pron_N_V_Conj_N__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|PronType=Ind|Tense=Pres|VerbForm=Fin": { - POS: PRON - }, - "Pron_N__Case=Gen|Number=Sing|PronType=Ind": {POS: PRON}, - "Pron_N__Number=Sing|PronType=Ind": {POS: PRON}, - "Pron_N|aanw|gen|attr_soort|mv|neut__Case=Gen|Number=Plur|PronType=Dem": { - POS: PRON - }, - "Pron_N|onbep|neut|attr_soort|ev|neut__Number=Sing|PronType=Ind": {POS: PRON}, - "Pron_Prep_Art__Number=Sing|PronType=Int": {POS: PRON}, - "Pron_Prep_Art__Number=Sing|PronType=Rel": {POS: PRON}, - "Pron_Prep_N__Number=Plur|PronType=Int": {POS: PRON}, - "Pron_Prep|betr|neut|zelfst_voor__PronType=Rel": {POS: PRON}, - "Pron_Prep|onbep|neut|zelfst_voor__PronType=Ind": {POS: PRON}, - "Pron_Prep|vrag|neut|attr_voor__PronType=Int": {POS: PRON}, - "Pron_Pron_V__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|PronType=Rel|Tense=Pres|VerbForm=Fin": { - POS: PRON - }, - "Pron_Pron__Person=3|PronType=Prs|Reflex=Yes": {POS: PRON}, - "Pron_V_V__Aspect=Imp|Mood=Ind|Person=3|PronType=Dem|Tense=Pres|VerbForm=Inf": { - POS: PRON - }, - "Pron_V__Case=Gen|Number=Sing|Person=3|Poss=Yes|PronType=Prs|VerbForm=Inf": { - POS: PRON - }, - "Pron_V__Number=Plur|Person=1|Poss=Yes|PronType=Prs|VerbForm=Inf": {POS: PRON}, - "Pron|aanw|dat|attr__Case=Dat|PronType=Dem": {POS: PRON}, - "Pron|aanw|gen|attr__Case=Gen|PronType=Dem": {POS: PRON}, - "Pron|aanw|neut|attr__PronType=Dem": {POS: PRON}, - "Pron|aanw|neut|attr|weigen__PronType=Dem": {POS: PRON}, - "Pron|aanw|neut|attr|wzelf__PronType=Dem": {POS: PRON}, - "Pron|aanw|neut|zelfst__PronType=Dem": {POS: PRON}, - "Pron|betr|gen|zelfst__Case=Gen|PronType=Rel": {POS: PRON}, - "Pron|betr|neut|attr__PronType=Rel": {POS: PRON}, - "Pron|betr|neut|zelfst__PronType=Rel": {POS: PRON}, - "Pron|bez|1|ev|neut|attr__Number=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: PRON}, - "Pron|bez|1|mv|neut|attr__Number=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: PRON}, - "Pron|bez|2|ev|neut|attr__Number=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: PRON}, - "Pron|bez|2|mv|neut|attr__Number=Plur|Person=2|Poss=Yes|PronType=Prs": {POS: PRON}, - "Pron|bez|3|ev|gen|attr__Case=Gen|Number=Sing|Person=3|Poss=Yes|PronType=Prs": { - POS: PRON - }, - "Pron|bez|3|ev|neut|attr__Number=Sing|Person=3|Poss=Yes|PronType=Prs": {POS: PRON}, - "Pron|bez|3|ev|neut|zelfst__Number=Sing|Person=3|Poss=Yes|PronType=Prs": { - POS: PRON - }, - "Pron|bez|3|mv|neut|attr__Number=Plur|Person=3|Poss=Yes|PronType=Prs": {POS: PRON}, - "Pron|onbep|gen|attr__Case=Gen|PronType=Ind": {POS: PRON}, - "Pron|onbep|gen|zelfst__Case=Gen|PronType=Ind": {POS: PRON}, - "Pron|onbep|neut|attr__PronType=Ind": {POS: PRON}, - "Pron|onbep|neut|zelfst__PronType=Ind": {POS: PRON}, - "Pron|per|1|ev|datofacc__Case=Acc,Dat|Number=Sing|Person=1|PronType=Prs": { - POS: PRON - }, - "Pron|per|1|ev|nom__Case=Nom|Number=Sing|Person=1|PronType=Prs": {POS: PRON}, - "Pron|per|1|mv|datofacc__Case=Acc,Dat|Number=Plur|Person=1|PronType=Prs": { - POS: PRON - }, - "Pron|per|1|mv|nom__Case=Nom|Number=Plur|Person=1|PronType=Prs": {POS: PRON}, - "Pron|per|2|ev|datofacc__Case=Acc,Dat|Number=Sing|Person=2|PronType=Prs": { - POS: PRON - }, - "Pron|per|2|ev|nom__Case=Nom|Number=Sing|Person=2|PronType=Prs": {POS: PRON}, - "Pron|per|2|mv|datofacc__Case=Acc,Dat|Number=Plur|Person=2|PronType=Prs": { - POS: PRON - }, - "Pron|per|2|mv|nom__Case=Nom|Number=Plur|Person=2|PronType=Prs": {POS: PRON}, - "Pron|per|3|evofmv|datofacc__Case=Acc,Dat|Number=Plur,Sing|Person=3|PronType=Prs": { - POS: PRON - }, - "Pron|per|3|evofmv|nom__Case=Nom|Number=Plur,Sing|Person=3|PronType=Prs": { - POS: PRON - }, - "Pron|per|3|ev|datofacc__Case=Acc,Dat|Number=Sing|Person=3|PronType=Prs": { - POS: PRON - }, - "Pron|per|3|ev|nom__Case=Nom|Number=Sing|Person=3|PronType=Prs": {POS: PRON}, - "Pron|per|3|mv|datofacc__Case=Acc,Dat|Number=Plur|Person=3|PronType=Prs": { - POS: PRON - }, - "Pron|rec|gen__Case=Gen|PronType=Rcp": {POS: PRON}, - "Pron|rec|neut__PronType=Rcp": {POS: PRON}, - "Pron|ref|1|ev__Number=Sing|Person=1|PronType=Prs|Reflex=Yes": {POS: PRON}, - "Pron|ref|1|mv__Number=Plur|Person=1|PronType=Prs|Reflex=Yes": {POS: PRON}, - "Pron|ref|2|ev__Number=Sing|Person=2|PronType=Prs|Reflex=Yes": {POS: PRON}, - "Pron|ref|3|evofmv__Number=Plur,Sing|Person=3|PronType=Prs|Reflex=Yes": {POS: PRON}, - "Pron|vrag|neut|attr__PronType=Int": {POS: PRON}, - "Pron|vrag|neut|zelfst__PronType=Int": {POS: PRON}, - "Punc_Int_Punc_N_N_N_Punc_Pron_V_Pron_Adj_V_Punc___": {POS: PUNCT}, - "Punc_N_Punc_N___": {POS: PUNCT}, - "Punc_Num_Num___": {POS: PUNCT}, - "Punc_Num___": {POS: PUNCT}, - "Punc|aanhaaldubb__PunctType=Quot": {POS: PUNCT}, - "Punc|aanhaalenk__PunctType=Quot": {POS: PUNCT}, - "Punc|dubbpunt__PunctType=Colo": {POS: PUNCT}, - "Punc|haakopen__PunctSide=Ini|PunctType=Brck": {POS: PUNCT}, - "Punc|haaksluit__PunctSide=Fin|PunctType=Brck": {POS: PUNCT}, - "Punc|hellip__PunctType=Peri": {POS: PUNCT}, - "Punc|isgelijk___": {POS: PUNCT}, - "Punc|komma__PunctType=Comm": {POS: PUNCT}, - "Punc|liggstreep___": {POS: PUNCT}, - "Punc|maal___": {POS: PUNCT}, - "Punc|punt__PunctType=Peri": {POS: PUNCT}, - "Punc|puntkomma__PunctType=Semi": {POS: PUNCT}, - "Punc|schuinstreep___": {POS: PUNCT}, - "Punc|uitroep__PunctType=Excl": {POS: PUNCT}, - "Punc|vraag__PunctType=Qest": {POS: PUNCT}, - "V_Adv_Art_N_Prep_Pron_N__Degree=Pos|Number=Plur|Person=2|Subcat=Tran": {POS: VERB}, - "V_Adv__Degree=Pos|Subcat=Tran": {POS: VERB}, - "V_Art_N_Num_N__Aspect=Imp|Definite=Def|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|VerbType=Mod": { - POS: VERB - }, - "V_Art_N__Number=Sing|Subcat=Tran": {POS: VERB}, - "V_Conj_N_N__Number=Sing|Subcat=Tran|Tense=Past|VerbForm=Part": {POS: VERB}, - "V_Conj_Pron__Subcat=Tran|Tense=Past|VerbForm=Part": {POS: VERB}, - "V_N_Conj_Adj_N_Prep_Art_N__Degree=Pos|Number=Sing|Subcat=Tran|Tense=Past|VerbForm=Part": { - POS: VERB - }, - "V_N_N__Number=Sing|Subcat=Intr|Tense=Pres|VerbForm=Part": {POS: VERB}, - "V_N_N__Number=Sing|Subcat=Tran|Tense=Past|VerbForm=Part": {POS: VERB}, - "V_N_V__Aspect=Imp|Mood=Ind|Number=Sing|Subcat=Intr|Tense=Pres|VerbForm=Inf": { - POS: VERB - }, - "V_N__Number=Plur|Subcat=Tran|Tense=Past|VerbForm=Part": {POS: VERB}, - "V_N|trans|imp_eigen|ev|neut__Number=Sing|Subcat=Tran": {POS: VERB}, - "V_Prep|intrans|verldw|onverv_voor__Subcat=Intr|Tense=Past|VerbForm=Part": { - POS: VERB - }, - "V_Pron_Adv_Adv_Pron_V__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Subcat=Tran|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V_Pron_Adv__Aspect=Imp|Degree=Pos|Mood=Ind|Number=Sing|Person=2|Subcat=Tran|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V_Pron_V__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Subcat=Tran|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V_Pron__VerbType=Aux,Cop": {POS: VERB}, - "V_V|hulp|imp_intrans|inf__VerbForm=Inf|VerbType=Mod": {POS: VERB}, - "V|hulpofkopp|conj__Mood=Sub|VerbForm=Fin": {POS: VERB}, - "V|hulpofkopp|conj__Mood=Sub|VerbForm=Fin|VerbType=Aux,Cop": {POS: VERB}, - "V|hulpofkopp|imp__Mood=Imp|VerbForm=Fin": {POS: VERB}, - "V|hulpofkopp|imp__Mood=Imp|VerbForm=Fin|VerbType=Aux,Cop": {POS: VERB}, - "V|hulpofkopp|inf__VerbForm=Inf": {POS: VERB}, - "V|hulpofkopp|inf__VerbForm=Inf|VerbType=Aux,Cop": {POS: VERB}, - "V|hulpofkopp|inf|subst__VerbForm=Inf": {POS: VERB}, - "V|hulpofkopp|ott|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|hulpofkopp|ott|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Tense=Pres|VerbForm=Fin|VerbType=Aux,Cop": { - POS: VERB - }, - "V|hulpofkopp|ott|1|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|hulpofkopp|ott|1|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin|VerbType=Aux,Cop": { - POS: VERB - }, - "V|hulpofkopp|ott|2|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|hulpofkopp|ott|2|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin|VerbType=Aux,Cop": { - POS: VERB - }, - "V|hulpofkopp|ott|3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|hulpofkopp|ott|3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|VerbType=Aux,Cop": { - POS: VERB - }, - "V|hulpofkopp|ovt|1of2of3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin": { - POS: VERB - }, - "V|hulpofkopp|ovt|1of2of3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|VerbType=Aux,Cop": { - POS: VERB - }, - "V|hulpofkopp|ovt|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin": { - POS: VERB - }, - "V|hulpofkopp|ovt|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|VerbType=Aux,Cop": { - POS: VERB - }, - "V|hulpofkopp|tegdw|vervneut__Case=Nom|Tense=Pres|VerbForm=Part": {POS: VERB}, - "V|hulpofkopp|tegdw|vervneut__Case=Nom|Tense=Pres|VerbForm=Part|VerbType=Aux,Cop": { - POS: VERB - }, - "V|hulpofkopp|verldw|onverv__Tense=Past|VerbForm=Part": {POS: VERB}, - "V|hulpofkopp|verldw|onverv__Tense=Past|VerbForm=Part|VerbType=Aux,Cop": { - POS: VERB - }, - "V|hulp|conj__Mood=Sub|VerbForm=Fin|VerbType=Mod": {POS: VERB}, - "V|hulp|inf__VerbForm=Inf": {POS: VERB}, - "V|hulp|inf__VerbForm=Inf|VerbType=Mod": {POS: VERB}, - "V|hulp|ott|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|hulp|ott|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Tense=Pres|VerbForm=Fin|VerbType=Mod": { - POS: VERB - }, - "V|hulp|ott|1|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|hulp|ott|1|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin|VerbType=Mod": { - POS: VERB - }, - "V|hulp|ott|2|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|hulp|ott|2|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin|VerbType=Mod": { - POS: VERB - }, - "V|hulp|ott|3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|hulp|ott|3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|VerbType=Mod": { - POS: VERB - }, - "V|hulp|ovt|1of2of3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin": { - POS: VERB - }, - "V|hulp|ovt|1of2of3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|VerbType=Mod": { - POS: VERB - }, - "V|hulp|ovt|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin": { - POS: VERB - }, - "V|hulp|ovt|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|VerbType=Mod": { - POS: VERB - }, - "V|hulp|verldw|onverv__Tense=Past|VerbForm=Part": {POS: VERB}, - "V|hulp|verldw|onverv__Tense=Past|VerbForm=Part|VerbType=Mod": {POS: VERB}, - "V|intrans|conj__Mood=Sub|Subcat=Intr|VerbForm=Fin": {POS: VERB}, - "V|intrans|imp__Mood=Imp|Subcat=Intr|VerbForm=Fin": {POS: VERB}, - "V|intrans|inf__Subcat=Intr|VerbForm=Inf": {POS: VERB}, - "V|intrans|inf|subst__Subcat=Intr|VerbForm=Inf": {POS: VERB}, - "V|intrans|ott|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Subcat=Intr|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|intrans|ott|1|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Subcat=Intr|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|intrans|ott|2|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Subcat=Intr|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|intrans|ott|3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Subcat=Intr|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|intrans|ovt|1of2of3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Subcat=Intr|Tense=Past|VerbForm=Fin": { - POS: VERB - }, - "V|intrans|ovt|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Subcat=Intr|Tense=Past|VerbForm=Fin": { - POS: VERB - }, - "V|intrans|tegdw|onverv__Subcat=Intr|Tense=Pres|VerbForm=Part": {POS: VERB}, - "V|intrans|tegdw|vervmv__Number=Plur|Subcat=Intr|Tense=Pres|VerbForm=Part": { - POS: VERB - }, - "V|intrans|tegdw|vervneut__Case=Nom|Subcat=Intr|Tense=Pres|VerbForm=Part": { - POS: VERB - }, - "V|intrans|tegdw|vervvergr__Degree=Cmp|Subcat=Intr|Tense=Pres|VerbForm=Part": { - POS: VERB - }, - "V|intrans|verldw|onverv__Subcat=Intr|Tense=Past|VerbForm=Part": {POS: VERB}, - "V|intrans|verldw|vervmv__Number=Plur|Subcat=Intr|Tense=Past|VerbForm=Part": { - POS: VERB - }, - "V|intrans|verldw|vervneut__Case=Nom|Subcat=Intr|Tense=Past|VerbForm=Part": { - POS: VERB - }, - "V|refl|imp__Mood=Imp|Reflex=Yes|VerbForm=Fin": {POS: VERB}, - "V|refl|inf__Reflex=Yes|VerbForm=Inf": {POS: VERB}, - "V|refl|inf|subst__Reflex=Yes|VerbForm=Inf": {POS: VERB}, - "V|refl|ott|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Reflex=Yes|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|refl|ott|1|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Reflex=Yes|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|refl|ott|2|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Reflex=Yes|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|refl|ott|3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Reflex=Yes|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|refl|ovt|1of2of3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Reflex=Yes|Tense=Past|VerbForm=Fin": { - POS: VERB - }, - "V|refl|ovt|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Reflex=Yes|Tense=Past|VerbForm=Fin": { - POS: VERB - }, - "V|refl|tegdw|vervneut__Case=Nom|Reflex=Yes|Tense=Pres|VerbForm=Part": {POS: VERB}, - "V|refl|verldw|onverv__Reflex=Yes|Tense=Past|VerbForm=Part": {POS: VERB}, - "V|trans|conj__Mood=Sub|Subcat=Tran|VerbForm=Fin": {POS: VERB}, - "V|trans|imp__Mood=Imp|Subcat=Tran|VerbForm=Fin": {POS: VERB}, - "V|trans|inf__Subcat=Tran|VerbForm=Inf": {POS: VERB}, - "V|trans|inf|subst__Subcat=Tran|VerbForm=Inf": {POS: VERB}, - "V|trans|ott|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Subcat=Tran|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|trans|ott|1|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Subcat=Tran|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|trans|ott|2|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Subcat=Tran|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|trans|ott|3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Subcat=Tran|Tense=Pres|VerbForm=Fin": { - POS: VERB - }, - "V|trans|ovt|1of2of3|ev__Aspect=Imp|Mood=Ind|Number=Sing|Subcat=Tran|Tense=Past|VerbForm=Fin": { - POS: VERB - }, - "V|trans|ovt|1of2of3|mv__Aspect=Imp|Mood=Ind|Number=Plur|Subcat=Tran|Tense=Past|VerbForm=Fin": { - POS: VERB - }, - "V|trans|tegdw|onverv__Subcat=Tran|Tense=Pres|VerbForm=Part": {POS: VERB}, - "V|trans|tegdw|vervneut__Case=Nom|Subcat=Tran|Tense=Pres|VerbForm=Part": { - POS: VERB - }, - "V|trans|verldw|onverv__Subcat=Tran|Tense=Past|VerbForm=Part": {POS: VERB}, - "V|trans|verldw|vervmv__Number=Plur|Subcat=Tran|Tense=Past|VerbForm=Part": { - POS: VERB - }, - "V|trans|verldw|vervneut__Case=Nom|Subcat=Tran|Tense=Past|VerbForm=Part": { - POS: VERB - }, - "V|trans|verldw|vervvergr__Degree=Cmp|Subcat=Tran|Tense=Past|VerbForm=Part": { - POS: VERB - }, - "X__Aspect=Imp|Definite=Def|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|VerbType=Mod": { - POS: X - }, - "X__Aspect=Imp|Definite=Def|Mood=Ind|Number=Sing|Person=3|PronType=Ind|Tense=Pres|VerbForm=Fin": { - POS: X - }, - "X__Aspect=Imp|Degree=Pos|Mood=Ind|Number=Sing|Person=2|Subcat=Tran|Tense=Pres|VerbForm=Fin": { - POS: X - }, - "X__Aspect=Imp|Degree=Pos|Mood=Ind|Number=Sing|Person=2|Tense=Past|VerbForm=Part": { - POS: X - }, - "X__Aspect=Imp|Degree=Pos|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Inf": { - POS: X - }, - "X__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: X}, - "X__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|PronType=Dem|Tense=Pres|VerbForm=Fin": { - POS: X - }, - "X__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|PronType=Rel|Tense=Pres|VerbForm=Fin": { - POS: X - }, - "X__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Subcat=Tran|Tense=Pres|VerbForm=Fin": { - POS: X - }, - "X__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|PronType=Ind|Tense=Pres|VerbForm=Fin": { - POS: X - }, - "X__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Subcat=Tran|Tense=Pres|VerbForm=Fin": { - POS: X - }, - "X__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: X}, - "X__Aspect=Imp|Mood=Ind|Number=Sing|Subcat=Intr|Tense=Pres|VerbForm=Inf": {POS: X}, - "X__Aspect=Imp|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin": {POS: X}, - "X__Aspect=Imp|Mood=Ind|Person=3|PronType=Dem|Tense=Pres|VerbForm=Inf": {POS: X}, - "X__Case=Dat|Degree=Pos|Number=Sing": {POS: X}, - "X__Case=Dat|Number=Sing": {POS: X}, - "X__Case=Gen|Definite=Def|Number=Sing": {POS: X}, - "X__Case=Gen|Number=Plur|PronType=Dem": {POS: X}, - "X__Case=Gen|Number=Plur|PronType=Ind": {POS: X}, - "X__Case=Gen|Number=Sing": {POS: X}, - "X__Case=Gen|Number=Sing|Person=3|Poss=Yes|PronType=Prs|VerbForm=Inf": {POS: X}, - "X__Case=Gen|Number=Sing|PronType=Ind": {POS: X}, - "X__Case=Nom|Definite=Def|Degree=Cmp|Gender=Neut": {POS: X}, - "X__Case=Nom|Definite=Def|Degree=Sup": {POS: X}, - "X__Case=Nom|Definite=Def|Degree=Sup|Gender=Neut": {POS: X}, - "X__Case=Nom|Degree=Cmp": {POS: X}, - "X__Case=Nom|Degree=Pos": {POS: X}, - "X__Case=Nom|Degree=Pos|Gender=Neut": {POS: X}, - "X__Case=Nom|Degree=Pos|Number=Plur": {POS: X}, - "X__Case=Nom|Degree=Pos|Number=Sing": {POS: X}, - "X__Case=Nom|Degree=Sup": {POS: X}, - "X__Case=Nom|Degree=Sup|Number=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: X}, - "X__Case=Nom|Degree=Sup|PronType=Ind": {POS: X}, - "X__Case=Nom|Number=Sing|Tense=Past|VerbForm=Part": {POS: X}, - "X__Definite=Def": {POS: X}, - "X__Definite=Def|Degree=Cmp|Gender=Neut": {POS: X}, - "X__Definite=Def|Degree=Pos": {POS: X}, - "X__Definite=Def|Degree=Pos|Number=Sing": {POS: X}, - "X__Definite=Def|Degree=Pos|Variant=Short": {POS: X}, - "X__Definite=Def|Degree=Sup|Gender=Neut": {POS: X}, - "X__Definite=Def|Degree=Sup|Gender=Neut|Number=Sing": {POS: X}, - "X__Definite=Def|Degree=Sup|Gender=Neut|PronType=Ind": {POS: X}, - "X__Definite=Def|Gender=Neut": {POS: X}, - "X__Definite=Def|Gender=Neut|Number=Plur|Person=3": {POS: X}, - "X__Definite=Def|Gender=Neut|Number=Sing": {POS: X}, - "X__Definite=Def|Number=Plur": {POS: X}, - "X__Definite=Def|Number=Sing": {POS: X}, - "X__Definite=Def|Number=Sing|Person=1": {POS: X}, - "X__Definite=Def|Number=Sing|Tense=Past|VerbForm=Part": {POS: X}, - "X__Definite=Def|Number=Sing|Tense=Pres|VerbForm=Part": {POS: X}, - "X__Degree=Cmp": {POS: X}, - "X__Degree=Cmp|Gender=Neut": {POS: X}, - "X__Degree=Cmp|Number=Sing|Person=3": {POS: X}, - "X__Degree=Cmp|PronType=Ind": {POS: X}, - "X__Degree=Cmp|Variant=Short": {POS: X}, - "X__Degree=Pos": {POS: X}, - "X__Degree=Pos|Gender=Neut|Number=Sing": {POS: X}, - "X__Degree=Pos|Mood=Imp|Variant=Short|VerbForm=Fin": {POS: X}, - "X__Degree=Pos|Mood=Sub|VerbForm=Fin": {POS: X}, - "X__Degree=Pos|Number=Plur": {POS: X}, - "X__Degree=Pos|Number=Plur|Person=2|Subcat=Tran": {POS: X}, - "X__Degree=Pos|Number=Plur|Variant=Short": {POS: X}, - "X__Degree=Pos|Number=Sing": {POS: X}, - "X__Degree=Pos|Number=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: X}, - "X__Degree=Pos|Number=Sing|Person=2": {POS: X}, - "X__Degree=Pos|Number=Sing|Person=3": {POS: X}, - "X__Degree=Pos|Number=Sing|PronType=Ind": {POS: X}, - "X__Degree=Pos|Number=Sing|Subcat=Tran|Tense=Past|VerbForm=Part": {POS: X}, - "X__Degree=Pos|Number=Sing|Tense=Past|VerbForm=Part": {POS: X}, - "X__Degree=Pos|Number=Sing|Variant=Short": {POS: X}, - "X__Degree=Pos|PronType=Dem": {POS: X}, - "X__Degree=Pos|Subcat=Tran": {POS: X}, - "X__Degree=Pos|Variant=Short": {POS: X}, - "X__Degree=Pos|Variant=Short|VerbForm=Inf": {POS: X}, - "X__Degree=Pos|VerbForm=Inf": {POS: X}, - "X__Gender=Com|Number=Sing": {POS: X}, - "X__Gender=Neut": {POS: X}, - "X__Gender=Neut|Number=Sing": {POS: X}, - "X__Gender=Neut|VerbForm=Inf": {POS: X}, - "X__Mood=Sub|Number=Sing|VerbForm=Fin": {POS: X}, - "X__Mood=Sub|VerbForm=Fin": {POS: X}, - "X__Number=Plur": {POS: X}, - "X__Number=Plur,Sing|Person=3": {POS: X}, - "X__Number=Plur|Person=1|Poss=Yes|PronType=Prs|VerbForm=Inf": {POS: X}, - "X__Number=Plur|PronType=Ind": {POS: X}, - "X__Number=Plur|PronType=Int": {POS: X}, - "X__Number=Plur|Subcat=Tran|Tense=Past|VerbForm=Part": {POS: X}, - "X__Number=Plur|Tense=Past|VerbForm=Part": {POS: X}, - "X__Number=Sing": {POS: X}, - "X__Number=Sing|Person=3": {POS: X}, - "X__Number=Sing|PronType=Dem": {POS: X}, - "X__Number=Sing|PronType=Ind": {POS: X}, - "X__Number=Sing|PronType=Int": {POS: X}, - "X__Number=Sing|PronType=Rel": {POS: X}, - "X__Number=Sing|Subcat=Intr|Tense=Pres|VerbForm=Part": {POS: X}, - "X__Number=Sing|Subcat=Tran": {POS: X}, - "X__Number=Sing|Subcat=Tran|Tense=Past|VerbForm=Part": {POS: X}, - "X__Number=Sing|Tense=Past|VerbForm=Part": {POS: X}, - "X__Number=Sing|Tense=Pres|VerbForm=Part": {POS: X}, - "X__Person=3|PronType=Prs|Reflex=Yes": {POS: X}, - "X__PronType=Dem": {POS: X}, - "X__PronType=Ind": {POS: X}, - "X__PronType=Int": {POS: X}, - "X__PronType=Rel": {POS: X}, - "X__Subcat=Intr|Tense=Past|VerbForm=Part": {POS: X}, - "X__Subcat=Tran|Tense=Past|VerbForm=Part": {POS: X}, - "X__VerbForm=Inf": {POS: X}, - "X__VerbForm=Inf|VerbType=Mod": {POS: X}, - "X__VerbType=Aux,Cop": {POS: X}, - "X___": {POS: X}, - "_SP": {POS: SPACE}, -} diff --git a/spacy/lang/nl/tokenizer_exceptions.py b/spacy/lang/nl/tokenizer_exceptions.py index c0915f127..489d10d71 100644 --- a/spacy/lang/nl/tokenizer_exceptions.py +++ b/spacy/lang/nl/tokenizer_exceptions.py @@ -1,7 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals - +from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...symbols import ORTH +from ...util import update_exc + # Extensive list of both common and uncommon dutch abbreviations copied from # github.com/diasks2/pragmatic_segmenter, a Ruby library for rule-based @@ -1605,4 +1605,4 @@ for orth in abbrevs: _exc[i] = [{ORTH: i}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/norm_exceptions.py b/spacy/lang/norm_exceptions.py index 341967a78..f35f613b1 100644 --- a/spacy/lang/norm_exceptions.py +++ b/spacy/lang/norm_exceptions.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # These exceptions are used to add NORM values based on a token's ORTH value. # Individual languages can also add their own exceptions and overwrite them - # for example, British vs. American spelling in English. diff --git a/spacy/lang/pl/__init__.py b/spacy/lang/pl/__init__.py index 52b662a90..9e7303e83 100644 --- a/spacy/lang/pl/__init__.py +++ b/spacy/lang/pl/__init__.py @@ -1,43 +1,28 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Optional + +from thinc.api import Model from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .punctuation import TOKENIZER_SUFFIXES -from .tag_map import TAG_MAP from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .lemmatizer import PolishLemmatizer - from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import add_lookups -from ...lookups import Lookups + + +TOKENIZER_EXCEPTIONS = { + exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".") +} class PolishDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "pl" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - mod_base_exceptions = { - exc: val for exc, val in BASE_EXCEPTIONS.items() if not exc.endswith(".") - } - tokenizer_exceptions = mod_base_exceptions - stop_words = STOP_WORDS - tag_map = TAG_MAP + tokenizer_exceptions = TOKENIZER_EXCEPTIONS prefixes = TOKENIZER_PREFIXES infixes = TOKENIZER_INFIXES suffixes = TOKENIZER_SUFFIXES - - @classmethod - def create_lemmatizer(cls, nlp=None, lookups=None): - if lookups is None: - lookups = Lookups() - return PolishLemmatizer(lookups) + lex_attr_getters = LEX_ATTRS + stop_words = STOP_WORDS class Polish(Language): @@ -45,4 +30,14 @@ class Polish(Language): Defaults = PolishDefaults +@Polish.factory( + "lemmatizer", + assigns=["token.lemma"], + default_config={"model": None, "mode": "pos_lookup"}, + default_score_weights={"lemma_acc": 1.0}, +) +def make_lemmatizer(nlp: Language, model: Optional[Model], name: str, mode: str): + return PolishLemmatizer(nlp.vocab, model, name, mode=mode) + + __all__ = ["Polish"] diff --git a/spacy/lang/pl/examples.py b/spacy/lang/pl/examples.py index 14b6c7030..b1ea5880f 100644 --- a/spacy/lang/pl/examples.py +++ b/spacy/lang/pl/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/pl/lemmatizer.py b/spacy/lang/pl/lemmatizer.py index 8b8d7fe27..059d0609a 100644 --- a/spacy/lang/pl/lemmatizer.py +++ b/spacy/lang/pl/lemmatizer.py @@ -1,8 +1,7 @@ -# coding: utf-8 -from __future__ import unicode_literals +from typing import List, Dict, Tuple -from ...lemmatizer import Lemmatizer -from ...parts_of_speech import NAMES +from ...pipeline import Lemmatizer +from ...tokens import Token class PolishLemmatizer(Lemmatizer): @@ -10,30 +9,42 @@ class PolishLemmatizer(Lemmatizer): # dictionary (morfeusz.sgjp.pl/en) by Institute of Computer Science PAS. # It utilizes some prefix based improvements for verb and adjectives # lemmatization, as well as case-sensitive lemmatization for nouns. - def __call__(self, string, univ_pos, morphology=None): - if isinstance(univ_pos, int): - univ_pos = NAMES.get(univ_pos, "X") - univ_pos = univ_pos.upper() + @classmethod + def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]: + if mode == "pos_lookup": + # fmt: off + required = [ + "lemma_lookup_adj", "lemma_lookup_adp", "lemma_lookup_adv", + "lemma_lookup_aux", "lemma_lookup_noun", "lemma_lookup_num", + "lemma_lookup_part", "lemma_lookup_pron", "lemma_lookup_verb" + ] + # fmt: on + return (required, []) + else: + return super().get_lookups_config(mode) + + def pos_lookup_lemmatize(self, token: Token) -> List[str]: + string = token.text + univ_pos = token.pos_ + morphology = token.morph.to_dict() lookup_pos = univ_pos.lower() if univ_pos == "PROPN": lookup_pos = "noun" lookup_table = self.lookups.get_table("lemma_lookup_" + lookup_pos, {}) - if univ_pos == "NOUN": return self.lemmatize_noun(string, morphology, lookup_table) - if univ_pos != "PROPN": string = string.lower() - if univ_pos == "ADJ": return self.lemmatize_adj(string, morphology, lookup_table) elif univ_pos == "VERB": return self.lemmatize_verb(string, morphology, lookup_table) - return [lookup_table.get(string, string.lower())] - def lemmatize_adj(self, string, morphology, lookup_table): + def lemmatize_adj( + self, string: str, morphology: dict, lookup_table: Dict[str, str] + ) -> List[str]: # this method utilizes different procedures for adjectives # with 'nie' and 'naj' prefixes if string[:3] == "nie": @@ -44,25 +55,26 @@ class PolishLemmatizer(Lemmatizer): return [lookup_table[naj_search_string]] if search_string in lookup_table: return [lookup_table[search_string]] - if string[:3] == "naj": naj_search_string = string[3:] if naj_search_string in lookup_table: return [lookup_table[naj_search_string]] - return [lookup_table.get(string, string)] - def lemmatize_verb(self, string, morphology, lookup_table): + def lemmatize_verb( + self, string: str, morphology: dict, lookup_table: Dict[str, str] + ) -> List[str]: # this method utilizes a different procedure for verbs # with 'nie' prefix if string[:3] == "nie": search_string = string[3:] if search_string in lookup_table: return [lookup_table[search_string]] - return [lookup_table.get(string, string)] - def lemmatize_noun(self, string, morphology, lookup_table): + def lemmatize_noun( + self, string: str, morphology: dict, lookup_table: Dict[str, str] + ) -> List[str]: # this method is case-sensitive, in order to work # for incorrectly tagged proper names if string != string.lower(): @@ -71,11 +83,4 @@ class PolishLemmatizer(Lemmatizer): elif string in lookup_table: return [lookup_table[string]] return [string.lower()] - return [lookup_table.get(string, string)] - - def lookup(self, string, orth=None): - return string.lower() - - def lemmatize(self, string, index, exceptions, rules): - raise NotImplementedError diff --git a/spacy/lang/pl/lex_attrs.py b/spacy/lang/pl/lex_attrs.py index f1379aa50..ce56e28a8 100644 --- a/spacy/lang/pl/lex_attrs.py +++ b/spacy/lang/pl/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/pl/punctuation.py b/spacy/lang/pl/punctuation.py index c87464b1b..31e56b9ae 100644 --- a/spacy/lang/pl/punctuation.py +++ b/spacy/lang/pl/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import LIST_ELLIPSES, LIST_PUNCT, LIST_HYPHENS from ..char_classes import LIST_ICONS, LIST_QUOTES, CURRENCY, UNITS, PUNCT from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER diff --git a/spacy/lang/pl/stop_words.py b/spacy/lang/pl/stop_words.py index 11df67328..075aec391 100644 --- a/spacy/lang/pl/stop_words.py +++ b/spacy/lang/pl/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 - -from __future__ import unicode_literals - # sources: https://github.com/bieli/stopwords/blob/master/polish.stopwords.txt and https://github.com/stopwords-iso/stopwords-pl STOP_WORDS = set( diff --git a/spacy/lang/pl/tag_map.py b/spacy/lang/pl/tag_map.py deleted file mode 100644 index ed7d6487e..000000000 --- a/spacy/lang/pl/tag_map.py +++ /dev/null @@ -1,1649 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import ( - POS, - ADJ, - ADP, - ADV, - AUX, - CCONJ, - DET, - INTJ, - NOUN, - NUM, - PART, - PRON, - PROPN, - PUNCT, - SCONJ, - VERB, - X, -) - -# fmt: off -TAG_MAP = { - "adja": {POS: ADJ}, - "adjc": {POS: ADJ}, - "adjp": {POS: ADJ, "PrepCase": "pre"}, - "adj:pl:acc:m1.p1:com": {POS: ADJ, "Number": "plur", "Case": "acc", "Gender": "masc", "Degree": "cmp"}, - "adj:pl:acc:m1.p1:pos": {POS: ADJ, "Number": "plur", "Case": "acc", "Gender": "masc", "Degree": "pos"}, - "adj:pl:acc:m1.p1:sup": {POS: ADJ, "Number": "plur", "Case": "acc", "Gender": "masc", "Degree": "sup"}, - "adj:pl:acc:m2.m3.f.n1.n2.p2.p3:com": {POS: ADJ, "Number": "plur", "Case": "acc", "Gender": "masc|fem|neut", "Degree": "cmp"}, - "adj:pl:acc:m2.m3.f.n1.n2.p2.p3:pos": {POS: ADJ, "Number": "plur", "Case": "acc", "Gender": "masc|fem|neut", "Degree": "pos"}, - "adj:pl:acc:m2.m3.f.n1.n2.p2.p3:sup": {POS: ADJ, "Number": "plur", "Case": "acc", "Gender": "masc|fem|neut", "Degree": "sup"}, - "adj:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:com": {POS: ADJ, "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Degree": "cmp"}, - "adj:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:pos": {POS: ADJ, "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Degree": "pos"}, - "adj:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:sup": {POS: ADJ, "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Degree": "sup"}, - "adj:pl:gen:m1.m2.m3.f.n1.n2.p1.p2.p3:com": {POS: ADJ, "Number": "plur", "Case": "gen", "Gender": "masc|fem|neut", "Degree": "cmp"}, - "adj:pl:gen:m1.m2.m3.f.n1.n2.p1.p2.p3:pos": {POS: ADJ, "Number": "plur", "Case": "gen", "Gender": "masc|fem|neut", "Degree": "pos"}, - "adj:pl:gen:m1.m2.m3.f.n1.n2.p1.p2.p3:sup": {POS: ADJ, "Number": "plur", "Case": "gen", "Gender": "masc|fem|neut", "Degree": "sup"}, - "adj:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:com": {POS: ADJ, "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Degree": "cmp"}, - "adj:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:pos": {POS: ADJ, "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Degree": "pos"}, - "adj:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:sup": {POS: ADJ, "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Degree": "sup"}, - "adj:pl:loc:m1.m2.m3.f.n1.n2.p1.p2.p3:com": {POS: ADJ, "Number": "plur", "Case": "loc", "Gender": "masc|fem|neut", "Degree": "cmp"}, - "adj:pl:loc:m1.m2.m3.f.n1.n2.p1.p2.p3:pos": {POS: ADJ, "Number": "plur", "Case": "loc", "Gender": "masc|fem|neut", "Degree": "pos"}, - "adj:pl:loc:m1.m2.m3.f.n1.n2.p1.p2.p3:sup": {POS: ADJ, "Number": "plur", "Case": "loc", "Gender": "masc|fem|neut", "Degree": "sup"}, - "adj:pl:nom:m1.p1:pos": {POS: ADJ, "Number": "plur", "Case": "nom", "Gender": "masc", "Degree": "pos"}, - "adj:pl:nom:m2.m3.f.n1.n2.p2.p3:pos": {POS: ADJ, "Number": "plur", "Case": "nom", "Gender": "masc|fem|neut", "Degree": "pos"}, - "adj:pl:nom.voc:m1.p1:com": {POS: ADJ, "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Degree": "cmp"}, - "adj:pl:nom.voc:m1.p1:pos": {POS: ADJ, "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Degree": "pos"}, - "adj:pl:nom.voc:m1.p1:sup": {POS: ADJ, "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Degree": "sup"}, - "adj:pl:nom.voc:m2.m3.f.n1.n2.p2.p3:com": {POS: ADJ, "Number": "plur", "Case": "nom|voc", "Gender": "masc|fem|neut", "Degree": "cmp"}, - "adj:pl:nom.voc:m2.m3.f.n1.n2.p2.p3:pos": {POS: ADJ, "Number": "plur", "Case": "nom|voc", "Gender": "masc|fem|neut", "Degree": "pos"}, - "adj:pl:nom.voc:m2.m3.f.n1.n2.p2.p3:sup": {POS: ADJ, "Number": "plur", "Case": "nom|voc", "Gender": "masc|fem|neut", "Degree": "sup"}, - "adj:sg:acc:f:com": {POS: ADJ, "Number": "sing", "Case": "acc", "Gender": "fem", "Degree": "cmp"}, - "adj:sg:acc:f:pos": {POS: ADJ, "Number": "sing", "Case": "acc", "Gender": "fem", "Degree": "pos"}, - "adj:sg:acc:f:sup": {POS: ADJ, "Number": "sing", "Case": "acc", "Gender": "fem", "Degree": "sup"}, - "adj:sg:acc:m1.m2:com": {POS: ADJ, "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Degree": "cmp"}, - "adj:sg:acc:m1.m2:pos": {POS: ADJ, "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Degree": "pos"}, - "adj:sg:acc:m1.m2:sup": {POS: ADJ, "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Degree": "sup"}, - "adj:sg:acc:m3:com": {POS: ADJ, "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Degree": "cmp"}, - "adj:sg:acc:m3:pos": {POS: ADJ, "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Degree": "pos"}, - "adj:sg:acc:m3:sup": {POS: ADJ, "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Degree": "sup"}, - "adj:sg:acc:n1.n2:com": {POS: ADJ, "Number": "sing", "Case": "acc", "Gender": "neut", "Degree": "cmp"}, - "adj:sg:acc:n1.n2:pos": {POS: ADJ, "Number": "sing", "Case": "acc", "Gender": "neut", "Degree": "pos"}, - "adj:sg:acc:n1.n2:sup": {POS: ADJ, "Number": "sing", "Case": "acc", "Gender": "neut", "Degree": "sup"}, - "adj:sg:dat:f:com": {POS: ADJ, "Number": "sing", "Case": "dat", "Gender": "fem", "Degree": "cmp"}, - "adj:sg:dat:f:pos": {POS: ADJ, "Number": "sing", "Case": "dat", "Gender": "fem", "Degree": "pos"}, - "adj:sg:dat:f:sup": {POS: ADJ, "Number": "sing", "Case": "dat", "Gender": "fem", "Degree": "sup"}, - "adj:sg:dat:m1.m2.m3.n1.n2:com": {POS: ADJ, "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Degree": "cmp"}, - "adj:sg:dat:m1.m2.m3.n1.n2:pos": {POS: ADJ, "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Degree": "pos"}, - "adj:sg:dat:m1.m2.m3.n1.n2:sup": {POS: ADJ, "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Degree": "sup"}, - "adj:sg:gen:f:com": {POS: ADJ, "Number": "sing", "Case": "gen", "Gender": "fem", "Degree": "cmp"}, - "adj:sg:gen:f:pos": {POS: ADJ, "Number": "sing", "Case": "gen", "Gender": "fem", "Degree": "pos"}, - "adj:sg:gen:f:sup": {POS: ADJ, "Number": "sing", "Case": "gen", "Gender": "fem", "Degree": "sup"}, - "adj:sg:gen:m1.m2.m3.n1.n2:com": {POS: ADJ, "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Degree": "cmp"}, - "adj:sg:gen:m1.m2.m3.n1.n2:pos": {POS: ADJ, "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Degree": "pos"}, - "adj:sg:gen:m1.m2.m3.n1.n2:sup": {POS: ADJ, "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Degree": "sup"}, - "adj:sg:inst:f:com": {POS: ADJ, "Number": "sing", "Case": "ins", "Gender": "fem", "Degree": "cmp"}, - "adj:sg:inst:f:pos": {POS: ADJ, "Number": "sing", "Case": "ins", "Gender": "fem", "Degree": "pos"}, - "adj:sg:inst:f:sup": {POS: ADJ, "Number": "sing", "Case": "ins", "Gender": "fem", "Degree": "sup"}, - "adj:sg:inst:m1.m2.m3.n1.n2:com": {POS: ADJ, "Number": "sing", "Case": "ins", "Gender": "masc|neut", "Degree": "cmp"}, - "adj:sg:inst:m1.m2.m3.n1.n2:pos": {POS: ADJ, "Number": "sing", "Case": "ins", "Gender": "masc|neut", "Degree": "pos"}, - "adj:sg:inst:m1.m2.m3.n1.n2:sup": {POS: ADJ, "Number": "sing", "Case": "ins", "Gender": "masc|neut", "Degree": "sup"}, - "adj:sg:loc:f:com": {POS: ADJ, "Number": "sing", "Case": "loc", "Gender": "fem", "Degree": "cmp"}, - "adj:sg:loc:f:pos": {POS: ADJ, "Number": "sing", "Case": "loc", "Gender": "fem", "Degree": "pos"}, - "adj:sg:loc:f:sup": {POS: ADJ, "Number": "sing", "Case": "loc", "Gender": "fem", "Degree": "sup"}, - "adj:sg:loc:m1.m2.m3.n1.n2:com": {POS: ADJ, "Number": "sing", "Case": "loc", "Gender": "masc|neut", "Degree": "cmp"}, - "adj:sg:loc:m1.m2.m3.n1.n2:pos": {POS: ADJ, "Number": "sing", "Case": "loc", "Gender": "masc|neut", "Degree": "pos"}, - "adj:sg:loc:m1.m2.m3.n1.n2:sup": {POS: ADJ, "Number": "sing", "Case": "loc", "Gender": "masc|neut", "Degree": "sup"}, - "adj:sg:nom:f:pos": {POS: ADJ, "Number": "sing", "Case": "nom", "Gender": "fem", "Degree": "pos"}, - "adj:sg:nom:m1.m2.m3:pos": {POS: ADJ, "Number": "sing", "Case": "nom", "Gender": "Masc", "Degree": "pos"}, - "adj:sg:nom:n1.n2:pos": {POS: ADJ, "Number": "sing", "Case": "nom", "Gender": "neut", "Degree": "pos"}, - "adj:sg:nom.voc:f:com": {POS: ADJ, "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Degree": "cmp"}, - "adj:sg:nom.voc:f:pos": {POS: ADJ, "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Degree": "pos"}, - "adj:sg:nom.voc:f:sup": {POS: ADJ, "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Degree": "sup"}, - "adj:sg:nom.voc:m1.m2.m3:com": {POS: ADJ, "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Degree": "cmp"}, - "adj:sg:nom.voc:m1.m2.m3:pos": {POS: ADJ, "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Degree": "pos"}, - "adj:sg:nom.voc:m1.m2.m3:sup": {POS: ADJ, "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Degree": "sup"}, - "adj:sg:nom.voc:n1.n2:com": {POS: ADJ, "Number": "sing", "Case": "nom|voc", "Gender": "neut", "Degree": "cmp"}, - "adj:sg:nom.voc:n1.n2:pos": {POS: ADJ, "Number": "sing", "Case": "nom|voc", "Gender": "neut", "Degree": "pos"}, - "adj:sg:nom.voc:n1.n2:sup": {POS: ADJ, "Number": "sing", "Case": "nom|voc", "Gender": "neut", "Degree": "sup"}, - "adv": {POS: ADV}, - "adv:com": {POS: ADV, "Degree": "cmp"}, - "adv:pos": {POS: ADV, "Degree": "pos"}, - "adv:sup": {POS: ADV, "Degree": "sup"}, - "aglt:pl:pri:imperf:nwok": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "pres", "Number": "plur", "Person": "one", "Aspect": "imp", }, - "aglt:pl:pri:imperf:wok": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "pres", "Number": "plur", "Person": "one", "Aspect": "imp", }, - "aglt:pl:sec:imperf:nwok": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "pres", "Number": "plur", "Person": "two", "Aspect": "imp", }, - "aglt:pl:sec:imperf:wok": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "pres", "Number": "plur", "Person": "two", "Aspect": "imp", }, - "aglt:sg:pri:imperf:nwok": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": "one", "Aspect": "imp", }, - "aglt:sg:pri:imperf:wok": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": "one", "Aspect": "imp", }, - "aglt:sg:sec:imperf:nwok": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": "two", "Aspect": "imp", }, - "aglt:sg:sec:imperf:wok": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": "two", "Aspect": "imp", }, - "bedzie:pl:pri:imperf": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "fut", "Number": "plur", "Person": "one", "Aspect": "imp"}, - "bedzie:pl:sec:imperf": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "fut", "Number": "plur", "Person": "two", "Aspect": "imp"}, - "bedzie:pl:ter:imperf": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "fut", "Number": "plur", "Person": "three", "Aspect": "imp"}, - "bedzie:sg:pri:imperf": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "fut", "Number": "sing", "Person": "one", "Aspect": "imp"}, - "bedzie:sg:sec:imperf": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "fut", "Number": "sing", "Person": "two", "Aspect": "imp"}, - "bedzie:sg:ter:imperf": {POS: AUX, "Aspect": "imp", "Mood": "ind", "VerbForm": "fin", "Tense": "fut", "Number": "sing", "Person": "three", "Aspect": "imp"}, - "burk": {POS: X}, - "comp": {POS: SCONJ}, - "conj": {POS: CCONJ}, - "depr:pl:nom:m2": {POS: NOUN, "Animacy": "anim", "Number": "plur", "Case": "nom", "Gender": "masc", "Animacy": "anim"}, - "depr:pl:voc:m2": {POS: NOUN, "Animacy": "anim", "Number": "plur", "Case": "voc", "Gender": "masc", "Animacy": "anim"}, - "fin:pl:pri:imperf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "plur", "Person": "one", "Aspect": "imp"}, - "fin:pl:pri:imperf.perf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "plur", "Person": "one", "Aspect": "imp|perf"}, - "fin:pl:pri:perf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "plur", "Person": "one", "Aspect": "perf"}, - "fin:pl:sec:imperf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "plur", "Person": "two", "Aspect": "imp"}, - "fin:pl:sec:imperf.perf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "plur", "Person": "two", "Aspect": "imp|perf"}, - "fin:pl:sec:perf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "plur", "Person": "two", "Aspect": "perf"}, - "fin:pl:ter:imperf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "plur", "Person": "three", "Aspect": "imp"}, - "fin:pl:ter:imperf.perf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "plur", "Person": "three", "Aspect": "imp|perf"}, - "fin:pl:ter:perf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "plur", "Person": "three", "Aspect": "perf"}, - "fin:sg:pri:imperf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "sing", "Person": "one", "Aspect": "imp"}, - "fin:sg:pri:imperf.perf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "sing", "Person": "one", "Aspect": "imp|perf"}, - "fin:sg:pri:perf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "sing", "Person": "one", "Aspect": "perf"}, - "fin:sg:sec:imperf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "sing", "Person": "two", "Aspect": "imp"}, - "fin:sg:sec:imperf.perf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "sing", "Person": "two", "Aspect": "imp|perf"}, - "fin:sg:sec:perf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "sing", "Person": "two", "Aspect": "perf"}, - "fin:sg:ter:imperf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "sing", "Person": "three", "Aspect": "imp"}, - "fin:sg:ter:imperf.perf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "sing", "Person": "three", "Aspect": "imp|perf"}, - "fin:sg:ter:perf": {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Mood": "ind", "Number": "sing", "Person": "three", "Aspect": "perf"}, - "ger:sg:dat.loc:n2:imperf:aff": {POS: VERB, "Number": "sing", "Case": "dat|loc", "Gender": "neut", "Aspect": "imp", "Polarity": "pos"}, - "ger:sg:dat.loc:n2:imperf:neg": {POS: VERB, "Number": "sing", "Case": "dat|loc", "Gender": "neut", "Aspect": "imp", "Polarity": "neg"}, - "ger:sg:dat.loc:n2:imperf.perf:aff": {POS: VERB, "Number": "sing", "Case": "dat|loc", "Gender": "neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "ger:sg:dat.loc:n2:imperf.perf:neg": {POS: VERB, "Number": "sing", "Case": "dat|loc", "Gender": "neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "ger:sg:dat.loc:n2:perf:aff": {POS: VERB, "Number": "sing", "Case": "dat|loc", "Gender": "neut", "Aspect": "perf", "Polarity": "pos"}, - "ger:sg:dat.loc:n2:perf:neg": {POS: VERB, "Number": "sing", "Case": "dat|loc", "Gender": "neut", "Aspect": "perf", "Polarity": "neg"}, - "ger:sg:gen:n2:imperf:aff": {POS: VERB, "Number": "sing", "Case": "gen", "Gender": "neut", "Aspect": "imp", "Polarity": "pos"}, - "ger:sg:gen:n2:imperf:neg": {POS: VERB, "Number": "sing", "Case": "gen", "Gender": "neut", "Aspect": "imp", "Polarity": "neg"}, - "ger:sg:gen:n2:imperf.perf:aff": {POS: VERB, "Number": "sing", "Case": "gen", "Gender": "neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "ger:sg:gen:n2:imperf.perf:neg": {POS: VERB, "Number": "sing", "Case": "gen", "Gender": "neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "ger:sg:gen:n2:perf:aff": {POS: VERB, "Number": "sing", "Case": "gen", "Gender": "neut", "Aspect": "perf", "Polarity": "pos"}, - "ger:sg:gen:n2:perf:neg": {POS: VERB, "Number": "sing", "Case": "gen", "Gender": "neut", "Aspect": "perf", "Polarity": "neg"}, - "ger:sg:inst:n2:imperf:aff": {POS: VERB, "Number": "sing", "Case": "ins", "Gender": "neut", "Aspect": "imp", "Polarity": "pos"}, - "ger:sg:inst:n2:imperf:neg": {POS: VERB, "Number": "sing", "Case": "ins", "Gender": "neut", "Aspect": "imp", "Polarity": "neg"}, - "ger:sg:inst:n2:imperf.perf:aff": {POS: VERB, "Number": "sing", "Case": "ins", "Gender": "neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "ger:sg:inst:n2:imperf.perf:neg": {POS: VERB, "Number": "sing", "Case": "ins", "Gender": "neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "ger:sg:inst:n2:perf:aff": {POS: VERB, "Number": "sing", "Case": "ins", "Gender": "neut", "Aspect": "perf", "Polarity": "pos"}, - "ger:sg:inst:n2:perf:neg": {POS: VERB, "Number": "sing", "Case": "ins", "Gender": "neut", "Aspect": "perf", "Polarity": "neg"}, - "ger:sg:nom.acc:n2:imperf:aff": {POS: VERB, "Number": "sing", "Case": "nom|acc", "Gender": "neut", "Aspect": "imp", "Polarity": "pos"}, - "ger:sg:nom.acc:n2:imperf:neg": {POS: VERB, "Number": "sing", "Case": "nom|acc", "Gender": "neut", "Aspect": "imp", "Polarity": "neg"}, - "ger:sg:nom.acc:n2:imperf.perf:aff": {POS: VERB, "Number": "sing", "Case": "nom|acc", "Gender": "neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "ger:sg:nom.acc:n2:imperf.perf:neg": {POS: VERB, "Number": "sing", "Case": "nom|acc", "Gender": "neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "ger:sg:nom.acc:n2:perf:aff": {POS: VERB, "Number": "sing", "Case": "nom|acc", "Gender": "neut", "Aspect": "perf", "Polarity": "pos"}, - "ger:sg:nom.acc:n2:perf:neg": {POS: VERB, "Number": "sing", "Case": "nom|acc", "Gender": "neut", "Aspect": "perf", "Polarity": "neg"}, - "imps:imperf": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Aspect": "imp"}, - "imps:imperf.perf": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Aspect": "imp|perf"}, - "imps:perf": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Aspect": "perf"}, - "impt:pl:pri:imperf": {POS: VERB, "Mood": "imp", "VerbForm": "fin", "Number": "plur", "Person": "one", "Aspect": "imp"}, - "impt:pl:pri:imperf.perf": {POS: VERB, "Mood": "imp", "VerbForm": "fin", "Number": "plur", "Person": "one", "Aspect": "imp|perf"}, - "impt:pl:pri:perf": {POS: VERB, "Mood": "imp", "VerbForm": "fin", "Number": "plur", "Person": "one", "Aspect": "perf"}, - "impt:pl:sec:imperf": {POS: VERB, "Mood": "imp", "VerbForm": "fin", "Number": "plur", "Person": "two", "Aspect": "imp"}, - "impt:pl:sec:imperf.perf": {POS: VERB, "Mood": "imp", "VerbForm": "fin", "Number": "plur", "Person": "two", "Aspect": "imp|perf"}, - "impt:pl:sec:perf": {POS: VERB, "Mood": "imp", "VerbForm": "fin", "Number": "plur", "Person": "two", "Aspect": "perf"}, - "impt:sg:sec:imperf": {POS: VERB, "Mood": "imp", "VerbForm": "fin", "Number": "sing", "Person": "two", "Aspect": "imp"}, - "impt:sg:sec:imperf.perf": {POS: VERB, "Mood": "imp", "VerbForm": "fin", "Number": "sing", "Person": "two", "Aspect": "imp|perf"}, - "impt:sg:sec:perf": {POS: VERB, "Mood": "imp", "VerbForm": "fin", "Number": "sing", "Person": "two", "Aspect": "perf"}, - "inf:imperf": {POS: VERB, "VerbForm": "inf", "Aspect": "imp"}, - "inf:imperf.perf": {POS: VERB, "VerbForm": "inf", "Aspect": "imp|perf"}, - "inf:perf": {POS: VERB, "VerbForm": "inf", "Aspect": "perf"}, - "interj": {POS: INTJ}, - "num:comp": {POS: NUM}, - "num:pl:acc:m1:rec": {POS: NUM, "Number": "plur", "Case": "acc", "Gender": "Masc", "Animacy": "hum"}, - "num:pl:dat.loc:n1.p1.p2:congr.rec": {POS: NUM, "Number": "plur", "Case": "dat|loc", "Gender": "neut"}, - "num:pl:dat:m1.m2.m3.n2.f:congr": {POS: NUM, "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut"}, - "num:pl:gen.dat.inst.loc:m1.m2.m3.f.n1.n2.p1.p2:congr": {POS: NUM, "Number": "plur", "Case": "gen|dat|ins|loc", "Gender": "masc|fem|neut"}, - "num:pl:gen.dat.inst.loc:m1.m2.m3.f.n2:congr": {POS: NUM, "Number": "plur", "Case": "gen|dat|ins|loc", "Gender": "masc|fem|neut"}, - "num:pl:gen.dat.loc:m1.m2.m3.n2.f:congr": {POS: NUM, "Number": "plur", "Case": "gen|dat|loc", "Gender": "masc|fem|neut"}, - "num:pl:gen.loc:m1.m2.m3.f.n1.n2.p1.p2:congr": {POS: NUM, "Number": "plur", "Case": "gen|loc", "Gender": "masc|fem|neut"}, - "num:pl:gen.loc:m1.m2.m3.n2.f:congr": {POS: NUM, "Number": "plur", "Case": "gen|loc", "Gender": "masc|fem|neut"}, - "num:pl:gen:n1.p1.p2:rec": {POS: NUM, "Number": "plur", "Case": "gen", "Gender": "neut"}, - "num:pl:inst:f:congr": {POS: NUM, "Number": "plur", "Case": "ins", "Gender": "fem"}, - "num:pl:inst:m1.m2.m3.f.n1.n2.p1.p2:congr": {POS: NUM, "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut"}, - "num:pl:inst:m1.m2.m3.f.n2:congr": {POS: NUM, "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut"}, - "num:pl:inst:m1.m2.m3.n2:congr": {POS: NUM, "Number": "plur", "Case": "ins", "Gender": "masc|neut"}, - "num:pl:inst:m1.m2.m3.n2.f:congr": {POS: NUM, "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut"}, - "num:pl:inst:n1.p1.p2:rec": {POS: NUM, "Number": "plur", "Case": "ins", "Gender": "neut"}, - "num:pl:nom.acc:m1.m2.m3.f.n1.n2.p1.p2:rec": {POS: NUM, "Number": "plur", "Case": "nom|acc", "Gender": "masc|fem|neut"}, - "num:pl:nom.acc.voc:f:congr": {POS: NUM, "Number": "plur", "Case": "nom|acc|voc", "Gender": "fem"}, - "num:pl:nom.acc.voc:m1:rec": {POS: NUM, "Number": "plur", "Case": "nom|acc|voc", "Gender": "Masc", "Animacy": "hum"}, - "num:pl:nom.acc.voc:m2.m3.f.n1.n2.p1.p2:rec": {POS: NUM, "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut"}, - "num:pl:nom.acc.voc:m2.m3.f.n2:rec": {POS: NUM, "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut"}, - "num:pl:nom.acc.voc:m2.m3.n2:congr": {POS: NUM, "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|neut"}, - "num:pl:nom.acc.voc:m2.m3.n2.f:congr": {POS: NUM, "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut"}, - "num:pl:nom.acc.voc:n1.p1.p2:rec": {POS: NUM, "Number": "plur", "Case": "nom|acc|voc", "Gender": "neut"}, - "num:pl:nom.gen.dat.inst.acc.loc.voc:m1.m2.m3.f.n1.n2.p1.p2:rec": {POS: NUM, "Number": "plur", "Gender": "masc|fem|neut"}, - "num:pl:nom.voc:m1:congr": {POS: NUM, "Number": "plur", "Case": "nom|voc", "Gender": "Masc", "Animacy": "hum"}, - "num:pl:nom.voc:m1:rec": {POS: NUM, "Number": "plur", "Case": "nom|voc", "Gender": "Masc", "Animacy": "hum"}, - "num:sg:nom.gen.dat.inst.acc.loc.voc:f:rec": {POS: NUM, "Number": "sing", "Gender": "fem"}, - "num:sg:nom.gen.dat.inst.acc.loc.voc:m1.m2.m3.n1.n2:rec": {POS: NUM, "Number": "sing", "Gender": "masc|neut"}, - "pact:pl:acc:m1.p1:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "acc", "Gender": "masc", "Aspect": "imp", "Polarity": "pos"}, - "pact:pl:acc:m1.p1:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "acc", "Gender": "masc", "Aspect": "imp", "Polarity": "neg"}, - "pact:pl:acc:m1.p1:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "acc", "Gender": "masc", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:pl:acc:m1.p1:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "acc", "Gender": "masc", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "pos"}, - "pact:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "neg"}, - "pact:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:pl:gen.loc:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "gen|loc", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "pos"}, - "pact:pl:gen.loc:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "gen|loc", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "neg"}, - "pact:pl:gen.loc:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "gen|loc", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:pl:gen.loc:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "gen|loc", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "pos"}, - "pact:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "neg"}, - "pact:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:pl:nom.acc.voc:m2.m3.f.n1.n2.p2.p3:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "pos"}, - "pact:pl:nom.acc.voc:m2.m3.f.n1.n2.p2.p3:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "neg"}, - "pact:pl:nom.acc.voc:m2.m3.f.n1.n2.p2.p3:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:pl:nom.acc.voc:m2.m3.f.n1.n2.p2.p3:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:pl:nom.voc:m1.p1:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Aspect": "imp", "Polarity": "pos"}, - "pact:pl:nom.voc:m1.p1:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Aspect": "imp", "Polarity": "neg"}, - "pact:pl:nom.voc:m1.p1:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:pl:nom.voc:m1.p1:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:sg:acc.inst:f:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "acc|ins", "Gender": "fem", "Aspect": "imp", "Polarity": "pos"}, - "pact:sg:acc.inst:f:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "acc|ins", "Gender": "fem", "Aspect": "imp", "Polarity": "neg"}, - "pact:sg:acc.inst:f:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "acc|ins", "Gender": "fem", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:sg:acc.inst:f:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "acc|ins", "Gender": "fem", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:sg:acc:m1.m2:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Aspect": "imp", "Polarity": "pos"}, - "pact:sg:acc:m1.m2:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Aspect": "imp", "Polarity": "neg"}, - "pact:sg:acc:m1.m2:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:sg:acc:m1.m2:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:sg:acc:m3:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Aspect": "imp", "Polarity": "pos"}, - "pact:sg:acc:m3:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Aspect": "imp", "Polarity": "neg"}, - "pact:sg:acc:m3:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:sg:acc:m3:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:sg:dat:m1.m2.m3.n1.n2:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Aspect": "imp", "Polarity": "pos"}, - "pact:sg:dat:m1.m2.m3.n1.n2:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Aspect": "imp", "Polarity": "neg"}, - "pact:sg:dat:m1.m2.m3.n1.n2:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:sg:dat:m1.m2.m3.n1.n2:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:sg:gen.dat.loc:f:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "gen|dat|loc", "Gender": "fem", "Aspect": "imp", "Polarity": "pos"}, - "pact:sg:gen.dat.loc:f:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "gen|dat|loc", "Gender": "fem", "Aspect": "imp", "Polarity": "neg"}, - "pact:sg:gen.dat.loc:f:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "gen|dat|loc", "Gender": "fem", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:sg:gen.dat.loc:f:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "gen|dat|loc", "Gender": "fem", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:sg:gen:m1.m2.m3.n1.n2:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Aspect": "imp", "Polarity": "pos"}, - "pact:sg:gen:m1.m2.m3.n1.n2:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Aspect": "imp", "Polarity": "neg"}, - "pact:sg:gen:m1.m2.m3.n1.n2:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:sg:gen:m1.m2.m3.n1.n2:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:sg:inst.loc:m1.m2.m3.n1.n2:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "ins|loc", "Gender": "masc|neut", "Aspect": "imp", "Polarity": "pos"}, - "pact:sg:inst.loc:m1.m2.m3.n1.n2:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "ins|loc", "Gender": "masc|neut", "Aspect": "imp", "Polarity": "neg"}, - "pact:sg:inst.loc:m1.m2.m3.n1.n2:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "ins|loc", "Gender": "masc|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:sg:inst.loc:m1.m2.m3.n1.n2:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "ins|loc", "Gender": "masc|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:sg:nom.acc.voc:n1.n2:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "nom|acc|voc", "Gender": "neut", "Aspect": "imp", "Polarity": "pos"}, - "pact:sg:nom.acc.voc:n1.n2:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "nom|acc|voc", "Gender": "neut", "Aspect": "imp", "Polarity": "neg"}, - "pact:sg:nom.acc.voc:n1.n2:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "nom|acc|voc", "Gender": "neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:sg:nom.acc.voc:n1.n2:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "nom|acc|voc", "Gender": "neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:sg:nom.voc:f:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Aspect": "imp", "Polarity": "pos"}, - "pact:sg:nom.voc:f:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Aspect": "imp", "Polarity": "neg"}, - "pact:sg:nom.voc:f:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:sg:nom.voc:f:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Aspect": "imp|perf", "Polarity": "neg"}, - "pact:sg:nom.voc:m1.m2.m3:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Aspect": "imp", "Polarity": "pos"}, - "pact:sg:nom.voc:m1.m2.m3:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Aspect": "imp", "Polarity": "neg"}, - "pact:sg:nom.voc:m1.m2.m3:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Aspect": "imp|perf", "Polarity": "pos"}, - "pact:sg:nom.voc:m1.m2.m3:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "act", "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Aspect": "imp|perf", "Polarity": "neg"}, - "pant:perf": {POS: VERB, "Tense": "past", "VerbForm": "conv", "Aspect": "perf"}, - "pcon:imperf": {POS: VERB, "Tense": "pres", "VerbForm": "conv", "Aspect": "imp"}, - "ppas:pl:acc:m1.p1:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "acc", "Gender": "masc", "Aspect": "imp", "Polarity": "pos"}, - "ppas:pl:acc:m1.p1:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "acc", "Gender": "masc", "Aspect": "imp", "Polarity": "neg"}, - "ppas:pl:acc:m1.p1:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "acc", "Gender": "masc", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:pl:acc:m1.p1:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "acc", "Gender": "masc", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:pl:acc:m1.p1:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "acc", "Gender": "masc", "Aspect": "perf", "Polarity": "pos"}, - "ppas:pl:acc:m1.p1:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "acc", "Gender": "masc", "Aspect": "perf", "Polarity": "neg"}, - "ppas:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "pos"}, - "ppas:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "neg"}, - "ppas:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Aspect": "perf", "Polarity": "pos"}, - "ppas:pl:dat:m1.m2.m3.f.n1.n2.p1.p2.p3:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "dat", "Gender": "masc|fem|neut", "Aspect": "perf", "Polarity": "neg"}, - "ppas:pl:gen.loc:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "gen|loc", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "pos"}, - "ppas:pl:gen.loc:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "gen|loc", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "neg"}, - "ppas:pl:gen.loc:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "gen|loc", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:pl:gen.loc:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "gen|loc", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:pl:gen.loc:m1.m2.m3.f.n1.n2.p1.p2.p3:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "gen|loc", "Gender": "masc|fem|neut", "Aspect": "perf", "Polarity": "pos"}, - "ppas:pl:gen.loc:m1.m2.m3.f.n1.n2.p1.p2.p3:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "gen|loc", "Gender": "masc|fem|neut", "Aspect": "perf", "Polarity": "neg"}, - "ppas:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "pos"}, - "ppas:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "neg"}, - "ppas:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Aspect": "perf", "Polarity": "pos"}, - "ppas:pl:inst:m1.m2.m3.f.n1.n2.p1.p2.p3:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "ins", "Gender": "masc|fem|neut", "Aspect": "perf", "Polarity": "neg"}, - "ppas:pl:nom.acc.voc:m2.m3.f.n1.n2.p2.p3:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "pos"}, - "ppas:pl:nom.acc.voc:m2.m3.f.n1.n2.p2.p3:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut", "Aspect": "imp", "Polarity": "neg"}, - "ppas:pl:nom.acc.voc:m2.m3.f.n1.n2.p2.p3:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:pl:nom.acc.voc:m2.m3.f.n1.n2.p2.p3:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:pl:nom.acc.voc:m2.m3.f.n1.n2.p2.p3:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut", "Aspect": "perf", "Polarity": "pos"}, - "ppas:pl:nom.acc.voc:m2.m3.f.n1.n2.p2.p3:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "nom|acc|voc", "Gender": "masc|fem|neut", "Aspect": "perf", "Polarity": "neg"}, - "ppas:pl:nom.voc:m1.p1:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Aspect": "imp", "Polarity": "pos"}, - "ppas:pl:nom.voc:m1.p1:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Aspect": "imp", "Polarity": "neg"}, - "ppas:pl:nom.voc:m1.p1:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:pl:nom.voc:m1.p1:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:pl:nom.voc:m1.p1:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Aspect": "perf", "Polarity": "pos"}, - "ppas:pl:nom.voc:m1.p1:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "plur", "Case": "nom|voc", "Gender": "masc", "Aspect": "perf", "Polarity": "neg"}, - "ppas:sg:acc.inst:f:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc|ins", "Gender": "fem", "Aspect": "imp", "Polarity": "pos"}, - "ppas:sg:acc.inst:f:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc|ins", "Gender": "fem", "Aspect": "imp", "Polarity": "neg"}, - "ppas:sg:acc.inst:f:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc|ins", "Gender": "fem", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:sg:acc.inst:f:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc|ins", "Gender": "fem", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:sg:acc.inst:f:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc|ins", "Gender": "fem", "Aspect": "perf", "Polarity": "pos"}, - "ppas:sg:acc.inst:f:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc|ins", "Gender": "fem", "Aspect": "perf", "Polarity": "neg"}, - "ppas:sg:acc:m1.m2:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Aspect": "imp", "Polarity": "pos"}, - "ppas:sg:acc:m1.m2:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Aspect": "imp", "Polarity": "neg"}, - "ppas:sg:acc:m1.m2:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:sg:acc:m1.m2:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:sg:acc:m1.m2:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Aspect": "perf", "Polarity": "pos"}, - "ppas:sg:acc:m1.m2:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum|anim", "Aspect": "perf", "Polarity": "neg"}, - "ppas:sg:acc:m3:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Aspect": "imp", "Polarity": "pos"}, - "ppas:sg:acc:m3:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Aspect": "imp", "Polarity": "neg"}, - "ppas:sg:acc:m3:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:sg:acc:m3:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:sg:acc:m3:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Aspect": "perf", "Polarity": "pos"}, - "ppas:sg:acc:m3:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan", "Aspect": "perf", "Polarity": "neg"}, - "ppas:sg:dat:m1.m2.m3.n1.n2:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Aspect": "imp", "Polarity": "pos"}, - "ppas:sg:dat:m1.m2.m3.n1.n2:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Aspect": "imp", "Polarity": "neg"}, - "ppas:sg:dat:m1.m2.m3.n1.n2:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:sg:dat:m1.m2.m3.n1.n2:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:sg:dat:m1.m2.m3.n1.n2:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Aspect": "perf", "Polarity": "pos"}, - "ppas:sg:dat:m1.m2.m3.n1.n2:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "dat", "Gender": "masc|neut", "Aspect": "perf", "Polarity": "neg"}, - "ppas:sg:gen.dat.loc:f:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "gen|dat|loc", "Gender": "fem", "Aspect": "imp", "Polarity": "pos"}, - "ppas:sg:gen.dat.loc:f:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "gen|dat|loc", "Gender": "fem", "Aspect": "imp", "Polarity": "neg"}, - "ppas:sg:gen.dat.loc:f:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "gen|dat|loc", "Gender": "fem", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:sg:gen.dat.loc:f:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "gen|dat|loc", "Gender": "fem", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:sg:gen.dat.loc:f:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "gen|dat|loc", "Gender": "fem", "Aspect": "perf", "Polarity": "pos"}, - "ppas:sg:gen.dat.loc:f:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "gen|dat|loc", "Gender": "fem", "Aspect": "perf", "Polarity": "neg"}, - "ppas:sg:gen:m1.m2.m3.n1.n2:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Aspect": "imp", "Polarity": "pos"}, - "ppas:sg:gen:m1.m2.m3.n1.n2:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Aspect": "imp", "Polarity": "neg"}, - "ppas:sg:gen:m1.m2.m3.n1.n2:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:sg:gen:m1.m2.m3.n1.n2:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:sg:gen:m1.m2.m3.n1.n2:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Aspect": "perf", "Polarity": "pos"}, - "ppas:sg:gen:m1.m2.m3.n1.n2:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "gen", "Gender": "masc|neut", "Aspect": "perf", "Polarity": "neg"}, - "ppas:sg:inst.loc:m1.m2.m3.n1.n2:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "ins|loc", "Gender": "masc|neut", "Aspect": "imp", "Polarity": "pos"}, - "ppas:sg:inst.loc:m1.m2.m3.n1.n2:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "ins|loc", "Gender": "masc|neut", "Aspect": "imp", "Polarity": "neg"}, - "ppas:sg:inst.loc:m1.m2.m3.n1.n2:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "ins|loc", "Gender": "masc|neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:sg:inst.loc:m1.m2.m3.n1.n2:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "ins|loc", "Gender": "masc|neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:sg:inst.loc:m1.m2.m3.n1.n2:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "ins|loc", "Gender": "masc|neut", "Aspect": "perf", "Polarity": "pos"}, - "ppas:sg:inst.loc:m1.m2.m3.n1.n2:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "ins|loc", "Gender": "masc|neut", "Aspect": "perf", "Polarity": "neg"}, - "ppas:sg:nom.acc.voc:n1.n2:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|acc|voc", "Gender": "neut", "Aspect": "imp", "Polarity": "pos"}, - "ppas:sg:nom.acc.voc:n1.n2:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|acc|voc", "Gender": "neut", "Aspect": "imp", "Polarity": "neg"}, - "ppas:sg:nom.acc.voc:n1.n2:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|acc|voc", "Gender": "neut", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:sg:nom.acc.voc:n1.n2:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|acc|voc", "Gender": "neut", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:sg:nom.acc.voc:n1.n2:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|acc|voc", "Gender": "neut", "Aspect": "perf", "Polarity": "pos"}, - "ppas:sg:nom.acc.voc:n1.n2:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|acc|voc", "Gender": "neut", "Aspect": "perf", "Polarity": "neg"}, - "ppas:sg:nom.voc:f:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Aspect": "imp", "Polarity": "pos"}, - "ppas:sg:nom.voc:f:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Aspect": "imp", "Polarity": "neg"}, - "ppas:sg:nom.voc:f:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:sg:nom.voc:f:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:sg:nom.voc:f:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Aspect": "perf", "Polarity": "pos"}, - "ppas:sg:nom.voc:f:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|voc", "Gender": "fem", "Aspect": "perf", "Polarity": "neg"}, - "ppas:sg:nom.voc:m1.m2.m3:imperf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Aspect": "imp", "Polarity": "pos"}, - "ppas:sg:nom.voc:m1.m2.m3:imperf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Aspect": "imp", "Polarity": "neg"}, - "ppas:sg:nom.voc:m1.m2.m3:imperf.perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Aspect": "imp|perf", "Polarity": "pos"}, - "ppas:sg:nom.voc:m1.m2.m3:imperf.perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Aspect": "imp|perf", "Polarity": "neg"}, - "ppas:sg:nom.voc:m1.m2.m3:perf:aff": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Aspect": "perf", "Polarity": "pos"}, - "ppas:sg:nom.voc:m1.m2.m3:perf:neg": {POS: VERB, "VerbForm": "part", "Voice": "pass", "Number": "sing", "Case": "nom|voc", "Gender": "Masc", "Aspect": "perf", "Polarity": "neg"}, - "ppron12:pl:acc:_:pri": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "acc", "Person": "one"}, - "ppron12:pl:acc:_:sec": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "acc", "Person": "two"}, - "ppron12:pl:dat:_:pri": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "dat", "Person": "one"}, - "ppron12:pl:dat:_:sec": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "dat", "Person": "two"}, - "ppron12:pl:gen:_:pri": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "gen", "Person": "one"}, - "ppron12:pl:gen:_:sec": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "gen", "Person": "two"}, - "ppron12:pl:inst:_:pri": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "ins", "Person": "one"}, - "ppron12:pl:inst:_:sec": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "ins", "Person": "two"}, - "ppron12:pl:loc:_:pri": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "loc", "Person": "one"}, - "ppron12:pl:loc:_:sec": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "loc", "Person": "two"}, - "ppron12:pl:nom:_:pri": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "nom", "Person": "one"}, - "ppron12:pl:nom:_:sec": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "nom", "Person": "two"}, - "ppron12:pl:voc:_:pri": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "voc", "Person": "one"}, - "ppron12:pl:voc:_:sec": {POS: PRON, "PronType": "prs", "Number": "plur", "Case": "voc", "Person": "two"}, - "ppron12:sg:acc:m1.m2.m3.f.n1.n2:pri:akc": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "acc", "Gender": "masc|fem|neut", "Person": "one", }, - "ppron12:sg:acc:m1.m2.m3.f.n1.n2:pri:nakc": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "acc", "Gender": "masc|fem|neut", "Person": "one", }, - "ppron12:sg:acc:m1.m2.m3.f.n1.n2:sec:akc": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "acc", "Gender": "masc|fem|neut", "Person": "two", }, - "ppron12:sg:acc:m1.m2.m3.f.n1.n2:sec:nakc": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "acc", "Gender": "masc|fem|neut", "Person": "two", }, - "ppron12:sg:dat:m1.m2.m3.f.n1.n2:pri:akc": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "dat", "Gender": "masc|fem|neut", "Person": "one", }, - "ppron12:sg:dat:m1.m2.m3.f.n1.n2:pri:nakc": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "dat", "Gender": "masc|fem|neut", "Person": "one", }, - "ppron12:sg:dat:m1.m2.m3.f.n1.n2:sec:akc": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "dat", "Gender": "masc|fem|neut", "Person": "two", }, - "ppron12:sg:dat:m1.m2.m3.f.n1.n2:sec:nakc": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "dat", "Gender": "masc|fem|neut", "Person": "two", }, - "ppron12:sg:gen:m1.m2.m3.f.n1.n2:pri:akc": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "gen", "Gender": "masc|fem|neut", "Person": "one", }, - "ppron12:sg:gen:m1.m2.m3.f.n1.n2:pri:nakc": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "gen", "Gender": "masc|fem|neut", "Person": "one", }, - "ppron12:sg:gen:m1.m2.m3.f.n1.n2:sec:akc": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "gen", "Gender": "masc|fem|neut", "Person": "two", }, - "ppron12:sg:gen:m1.m2.m3.f.n1.n2:sec:nakc": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "gen", "Gender": "masc|fem|neut", "Person": "two", }, - "ppron12:sg:inst:m1.m2.m3.f.n1.n2:pri": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "ins", "Gender": "masc|fem|neut", "Person": "one"}, - "ppron12:sg:inst:m1.m2.m3.f.n1.n2:sec": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "ins", "Gender": "masc|fem|neut", "Person": "two"}, - "ppron12:sg:loc:m1.m2.m3.f.n1.n2:pri": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "loc", "Gender": "masc|fem|neut", "Person": "one"}, - "ppron12:sg:loc:m1.m2.m3.f.n1.n2:sec": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "loc", "Gender": "masc|fem|neut", "Person": "two"}, - "ppron12:sg:nom:m1.m2.m3.f.n1.n2:pri": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "nom", "Gender": "masc|fem|neut", "Person": "one"}, - "ppron12:sg:nom:m1.m2.m3.f.n1.n2:sec": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "nom", "Gender": "masc|fem|neut", "Person": "two"}, - "ppron12:sg:voc:m1.m2.m3.f.n1.n2:pri": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "voc", "Gender": "masc|fem|neut", "Person": "one"}, - "ppron12:sg:voc:m1.m2.m3.f.n1.n2:sec": {POS: PRON, "PronType": "prs", "Number": "sing", "Case": "voc", "Gender": "masc|fem|neut", "Person": "two"}, - "ppron3:pl:acc:m1.p1:ter:_:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "plur", "Case": "acc", "Gender": "masc", "Person": "three", "PrepCase": "npr"}, - "ppron3:pl:acc:m1.p1:ter:_:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "plur", "Case": "acc", "Gender": "masc", "Person": "three", "PrepCase": "pre"}, - "ppron3:pl:acc:m2.m3.f.n1.n2.p2.p3:ter:_:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "plur", "Case": "acc", "Gender": "masc|fem|neut", "Person": "three", "PrepCase": "npr"}, - "ppron3:pl:acc:m2.m3.f.n1.n2.p2.p3:ter:_:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "plur", "Case": "acc", "Gender": "masc|fem|neut", "Person": "three", "PrepCase": "pre"}, - "ppron3:pl:dat:_:ter:_:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "plur", "Case": "dat", "Person": "three", "PrepCase": "npr"}, - "ppron3:pl:dat:_:ter:_:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "plur", "Case": "dat", "Person": "three", "PrepCase": "pre"}, - "ppron3:pl:gen:_:ter:_:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "plur", "Case": "gen", "Person": "three", "PrepCase": "npr"}, - "ppron3:pl:gen:_:ter:_:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "plur", "Case": "gen", "Person": "three", "PrepCase": "pre"}, - "ppron3:pl:inst:_:ter:_:_": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "plur", "Case": "ins", "Person": "three"}, - "ppron3:pl:loc:_:ter:_:_": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "plur", "Case": "loc", "Person": "three"}, - "ppron3:pl:nom:m1.p1:ter:_:_": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "plur", "Case": "nom", "Gender": "masc", "Person": "three"}, - "ppron3:pl:nom:m2.m3.f.n1.n2.p2.p3:ter:_:_": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "plur", "Case": "nom", "Gender": "masc|fem|neut", "Person": "three"}, - "ppron3:sg:acc:f:ter:_:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "acc", "Gender": "fem", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:acc:f:ter:_:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "acc", "Gender": "fem", "Person": "three", "PrepCase": "pre"}, - "ppron3:sg:acc:m1.m2.m3:ter:akc:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "acc", "Gender": "Masc", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:acc:m1.m2.m3:ter:akc:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "acc", "Gender": "Masc", "Person": "three", "PrepCase": "pre"}, - "ppron3:sg:acc:m1.m2.m3:ter:nakc:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "acc", "Gender": "Masc", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:acc:m1.m2.m3:ter:nakc:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "acc", "Gender": "Masc", "Person": "three", "PrepCase": "pre"}, - "ppron3:sg:acc:n1.n2:ter:_:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "acc", "Gender": "neut", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:acc:n1.n2:ter:_:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "acc", "Gender": "neut", "Person": "three", "PrepCase": "pre"}, - "ppron3:sg:dat:f:ter:_:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "dat", "Gender": "fem", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:dat:f:ter:_:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "dat", "Gender": "fem", "Person": "three", "PrepCase": "pre"}, - "ppron3:sg:dat:m1.m2.m3:ter:akc:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "dat", "Gender": "Masc", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:dat:m1.m2.m3:ter:nakc:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "dat", "Gender": "Masc", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:dat:m1.m2.m3:ter:_:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "dat", "Gender": "Masc", "Person": "three", "PrepCase": "pre"}, - "ppron3:sg:dat:n1.n2:ter:akc:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "dat", "Gender": "neut", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:dat:n1.n2:ter:nakc:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "dat", "Gender": "neut", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:dat:n1.n2:ter:_:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "dat", "Gender": "neut", "Person": "three", "PrepCase": "pre"}, - "ppron3:sg:gen:f:ter:_:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "gen", "Gender": "fem", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:gen:f:ter:_:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "gen", "Gender": "fem", "Person": "three", "PrepCase": "pre"}, - "ppron3:sg:gen:m1.m2.m3:ter:akc:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "gen", "Gender": "Masc", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:gen:m1.m2.m3:ter:akc:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "gen", "Gender": "Masc", "Person": "three", "PrepCase": "pre"}, - "ppron3:sg:gen:m1.m2.m3:ter:nakc:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "gen", "Gender": "Masc", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:gen:m1.m2.m3:ter:nakc:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "gen", "Gender": "Masc", "Person": "three", "PrepCase": "pre"}, - "ppron3:sg:gen:n1.n2:ter:akc:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "gen", "Gender": "neut", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:gen:n1.n2:ter:nakc:npraep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "gen", "Gender": "neut", "Person": "three", "PrepCase": "npr"}, - "ppron3:sg:gen:n1.n2:ter:_:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "gen", "Gender": "neut", "Person": "three", "PrepCase": "pre"}, - "ppron3:sg:inst:f:ter:_:praep": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "ins", "Gender": "fem", "Person": "three", "PrepCase": "pre"}, - "ppron3:sg:inst:m1.m2.m3:ter:_:_": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "ins", "Gender": "Masc", "Person": "three"}, - "ppron3:sg:inst:n1.n2:ter:_:_": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "ins", "Gender": "neut", "Person": "three"}, - "ppron3:sg:loc:f:ter:_:_": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "loc", "Gender": "fem", "Person": "three"}, - "ppron3:sg:loc:m1.m2.m3:ter:_:_": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "loc", "Gender": "Masc", "Person": "three"}, - "ppron3:sg:loc:n1.n2:ter:_:_": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "loc", "Gender": "neut", "Person": "three"}, - "ppron3:sg:nom:f:ter:_:_": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "nom", "Gender": "fem", "Person": "three"}, - "ppron3:sg:nom:m1.m2.m3:ter:_:_": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "nom", "Gender": "Masc", "Person": "three"}, - "ppron3:sg:nom:n1.n2:ter:_:_": {POS: PRON, "PronType": "prs", "Person": "three", "Number": "sing", "Case": "nom", "Gender": "neut", "Person": "three"}, - "praet:pl:m1.p1:imperf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "plur", "Gender": "masc", "Aspect": "imp"}, - "praet:pl:m1.p1:imperf.perf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "plur", "Gender": "masc", "Aspect": "imp|perf"}, - "praet:pl:m1.p1:perf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "plur", "Gender": "masc", "Aspect": "perf"}, - "praet:pl:m2.m3.f.n1.n2.p2.p3:imperf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "plur", "Gender": "masc|fem|neut", "Aspect": "imp"}, - "praet:pl:m2.m3.f.n1.n2.p2.p3:imperf.perf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "plur", "Gender": "masc|fem|neut", "Aspect": "imp|perf"}, - "praet:pl:m2.m3.f.n1.n2.p2.p3:perf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "plur", "Gender": "masc|fem|neut", "Aspect": "perf"}, - "praet:sg:f:imperf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "fem", "Aspect": "imp"}, - "praet:sg:f:imperf.perf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "fem", "Aspect": "imp|perf"}, - "praet:sg:f:perf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "fem", "Aspect": "perf"}, - "praet:sg:m1.m2.m3:imperf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "Masc", "Aspect": "imp"}, - "praet:sg:m1.m2.m3:imperf:agl": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "Masc", "Aspect": "imp"}, - "praet:sg:m1.m2.m3:imperf:nagl": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "Masc", "Aspect": "imp"}, - "praet:sg:m1.m2.m3:imperf.perf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "Masc", "Aspect": "imp|perf"}, - "praet:sg:m1.m2.m3:perf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "Masc", "Aspect": "perf"}, - "praet:sg:m1.m2.m3:perf:agl": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "Masc", "Aspect": "perf"}, - "praet:sg:m1.m2.m3:perf:nagl": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "Masc", "Aspect": "perf"}, - "praet:sg:n1.n2:imperf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "neut", "Aspect": "imp"}, - "praet:sg:n1.n2:imperf.perf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "neut", "Aspect": "imp|perf"}, - "praet:sg:n1.n2:perf": {POS: VERB, "VerbForm": "fin", "Tense": "past", "Number": "sing", "Gender": "neut", "Aspect": "perf"}, - "pred": {POS: VERB}, - "prep:acc": {POS: ADP, "AdpType": "prep", "Case": "acc"}, - "prep:acc:nwok": {POS: ADP, "AdpType": "prep", "Case": "acc", }, - "prep:acc:wok": {POS: ADP, "AdpType": "prep", "Case": "acc", }, - "prep:dat": {POS: ADP, "AdpType": "prep", "Case": "dat"}, - "prep:gen": {POS: ADP, "AdpType": "prep", "Case": "gen"}, - "prep:gen:nwok": {POS: ADP, "AdpType": "prep", "Case": "gen", }, - "prep:gen:wok": {POS: ADP, "AdpType": "prep", "Case": "gen", }, - "prep:inst": {POS: ADP, "AdpType": "prep", "Case": "ins"}, - "prep:inst:nwok": {POS: ADP, "AdpType": "prep", "Case": "ins", }, - "prep:inst:wok": {POS: ADP, "AdpType": "prep", "Case": "ins", }, - "prep:loc": {POS: ADP, "AdpType": "prep", "Case": "loc"}, - "prep:loc:nwok": {POS: ADP, "AdpType": "prep", "Case": "loc", }, - "prep:loc:wok": {POS: ADP, "AdpType": "prep", "Case": "loc", }, - "prep:nom": {POS: ADP, "AdpType": "prep", "Case": "nom"}, - "qub": {POS: PART}, - "subst:pl:acc:f": {POS: NOUN, "Number": "plur", "Case": "acc", "Gender": "fem"}, - "subst:pl:acc:m1": {POS: NOUN, "Number": "plur", "Case": "acc", "Gender": "Masc", "Animacy": "hum"}, - "subst:pl:acc:m2": {POS: NOUN, "Number": "plur", "Case": "acc", "Gender": "masc", "Animacy": "anim"}, - "subst:pl:acc:m3": {POS: NOUN, "Number": "plur", "Case": "acc", "Gender": "masc", "Animacy": "inan"}, - "subst:pl:acc:n1": {POS: NOUN, "Number": "plur", "Case": "acc", "Gender": "neut"}, - "subst:pl:acc:n2": {POS: NOUN, "Number": "plur", "Case": "acc", "Gender": "neut"}, - "subst:pl:acc:p1": {POS: NOUN, "Number": "plur", "Case": "acc", "Person": "one"}, - "subst:pl:acc:p2": {POS: NOUN, "Number": "plur", "Case": "acc", "Person": "two"}, - "subst:pl:acc:p3": {POS: NOUN, "Number": "plur", "Case": "acc", "Person": "three"}, - "subst:pl:dat:f": {POS: NOUN, "Number": "plur", "Case": "dat", "Gender": "fem"}, - "subst:pl:dat:m1": {POS: NOUN, "Number": "plur", "Case": "dat", "Gender": "Masc", "Animacy": "hum"}, - "subst:pl:dat:m2": {POS: NOUN, "Number": "plur", "Case": "dat", "Gender": "masc", "Animacy": "anim"}, - "subst:pl:dat:m3": {POS: NOUN, "Number": "plur", "Case": "dat", "Gender": "masc", "Animacy": "inan"}, - "subst:pl:dat:n1": {POS: NOUN, "Number": "plur", "Case": "dat", "Gender": "neut"}, - "subst:pl:dat:n2": {POS: NOUN, "Number": "plur", "Case": "dat", "Gender": "neut"}, - "subst:pl:dat:p1": {POS: NOUN, "Number": "plur", "Case": "dat", "Person": "one"}, - "subst:pl:dat:p2": {POS: NOUN, "Number": "plur", "Case": "dat", "Person": "two"}, - "subst:pl:dat:p3": {POS: NOUN, "Number": "plur", "Case": "dat", "Person": "three"}, - "subst:pl:gen:f": {POS: NOUN, "Number": "plur", "Case": "gen", "Gender": "fem"}, - "subst:pl:gen:m1": {POS: NOUN, "Number": "plur", "Case": "gen", "Gender": "Masc", "Animacy": "hum"}, - "subst:pl:gen:m2": {POS: NOUN, "Number": "plur", "Case": "gen", "Gender": "masc", "Animacy": "anim"}, - "subst:pl:gen:m3": {POS: NOUN, "Number": "plur", "Case": "gen", "Gender": "masc", "Animacy": "inan"}, - "subst:pl:gen:n1": {POS: NOUN, "Number": "plur", "Case": "gen", "Gender": "neut"}, - "subst:pl:gen:n2": {POS: NOUN, "Number": "plur", "Case": "gen", "Gender": "neut"}, - "subst:pl:gen:p1": {POS: NOUN, "Number": "plur", "Case": "gen", "Person": "one"}, - "subst:pl:gen:p2": {POS: NOUN, "Number": "plur", "Case": "gen", "Person": "two"}, - "subst:pl:gen:p3": {POS: NOUN, "Number": "plur", "Case": "gen", "Person": "three"}, - "subst:pl:inst:f": {POS: NOUN, "Number": "plur", "Case": "ins", "Gender": "fem"}, - "subst:pl:inst:m1": {POS: NOUN, "Number": "plur", "Case": "ins", "Gender": "Masc", "Animacy": "hum"}, - "subst:pl:inst:m2": {POS: NOUN, "Number": "plur", "Case": "ins", "Gender": "masc", "Animacy": "anim"}, - "subst:pl:inst:m3": {POS: NOUN, "Number": "plur", "Case": "ins", "Gender": "masc", "Animacy": "inan"}, - "subst:pl:inst:n1": {POS: NOUN, "Number": "plur", "Case": "ins", "Gender": "neut"}, - "subst:pl:inst:n2": {POS: NOUN, "Number": "plur", "Case": "ins", "Gender": "neut"}, - "subst:pl:inst:p1": {POS: NOUN, "Number": "plur", "Case": "ins", "Person": "one"}, - "subst:pl:inst:p2": {POS: NOUN, "Number": "plur", "Case": "ins", "Person": "two"}, - "subst:pl:inst:p3": {POS: NOUN, "Number": "plur", "Case": "ins", "Person": "three"}, - "subst:pl:loc:f": {POS: NOUN, "Number": "plur", "Case": "loc", "Gender": "fem"}, - "subst:pl:loc:m1": {POS: NOUN, "Number": "plur", "Case": "loc", "Gender": "Masc", "Animacy": "hum"}, - "subst:pl:loc:m2": {POS: NOUN, "Number": "plur", "Case": "loc", "Gender": "masc", "Animacy": "anim"}, - "subst:pl:loc:m3": {POS: NOUN, "Number": "plur", "Case": "loc", "Gender": "masc", "Animacy": "inan"}, - "subst:pl:loc:n1": {POS: NOUN, "Number": "plur", "Case": "loc", "Gender": "neut"}, - "subst:pl:loc:n2": {POS: NOUN, "Number": "plur", "Case": "loc", "Gender": "neut"}, - "subst:pl:loc:p1": {POS: NOUN, "Number": "plur", "Case": "loc", "Person": "one"}, - "subst:pl:loc:p2": {POS: NOUN, "Number": "plur", "Case": "loc", "Person": "two"}, - "subst:pl:loc:p3": {POS: NOUN, "Number": "plur", "Case": "loc", "Person": "three"}, - "subst:pl:nom:f": {POS: NOUN, "Number": "plur", "Case": "nom", "Gender": "fem"}, - "subst:pl:nom:m1": {POS: NOUN, "Number": "plur", "Case": "nom", "Gender": "Masc", "Animacy": "hum"}, - "subst:pl:nom:m2": {POS: NOUN, "Number": "plur", "Case": "nom", "Gender": "masc", "Animacy": "anim"}, - "subst:pl:nom:m3": {POS: NOUN, "Number": "plur", "Case": "nom", "Gender": "masc", "Animacy": "inan"}, - "subst:pl:nom:n1": {POS: NOUN, "Number": "plur", "Case": "nom", "Gender": "neut"}, - "subst:pl:nom:n2": {POS: NOUN, "Number": "plur", "Case": "nom", "Gender": "neut"}, - "subst:pl:nom:p1": {POS: NOUN, "Number": "plur", "Case": "nom", "Person": "one"}, - "subst:pl:nom:p2": {POS: NOUN, "Number": "plur", "Case": "nom", "Person": "two"}, - "subst:pl:nom:p3": {POS: NOUN, "Number": "plur", "Case": "nom", "Person": "three"}, - "subst:pl:voc:f": {POS: NOUN, "Number": "plur", "Case": "voc", "Gender": "fem"}, - "subst:pl:voc:m1": {POS: NOUN, "Number": "plur", "Case": "voc", "Gender": "Masc", "Animacy": "hum"}, - "subst:pl:voc:m2": {POS: NOUN, "Number": "plur", "Case": "voc", "Gender": "masc", "Animacy": "anim"}, - "subst:pl:voc:m3": {POS: NOUN, "Number": "plur", "Case": "voc", "Gender": "masc", "Animacy": "inan"}, - "subst:pl:voc:n1": {POS: NOUN, "Number": "plur", "Case": "voc", "Gender": "neut"}, - "subst:pl:voc:n2": {POS: NOUN, "Number": "plur", "Case": "voc", "Gender": "neut"}, - "subst:pl:voc:p1": {POS: NOUN, "Number": "plur", "Case": "voc", "Person": "one"}, - "subst:pl:voc:p2": {POS: NOUN, "Number": "plur", "Case": "voc", "Person": "two"}, - "subst:pl:voc:p3": {POS: NOUN, "Number": "plur", "Case": "voc", "Person": "three"}, - "subst:sg:acc:f": {POS: NOUN, "Number": "sing", "Case": "acc", "Gender": "fem"}, - "subst:sg:acc:m1": {POS: NOUN, "Number": "sing", "Case": "acc", "Gender": "Masc", "Animacy": "hum"}, - "subst:sg:acc:m2": {POS: NOUN, "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "anim"}, - "subst:sg:acc:m3": {POS: NOUN, "Number": "sing", "Case": "acc", "Gender": "masc", "Animacy": "inan"}, - "subst:sg:acc:n1": {POS: NOUN, "Number": "sing", "Case": "acc", "Gender": "neut"}, - "subst:sg:acc:n2": {POS: NOUN, "Number": "sing", "Case": "acc", "Gender": "neut"}, - "subst:sg:dat:f": {POS: NOUN, "Number": "sing", "Case": "dat", "Gender": "fem"}, - "subst:sg:dat:m1": {POS: NOUN, "Number": "sing", "Case": "dat", "Gender": "Masc", "Animacy": "hum"}, - "subst:sg:dat:m2": {POS: NOUN, "Number": "sing", "Case": "dat", "Gender": "masc", "Animacy": "anim"}, - "subst:sg:dat:m3": {POS: NOUN, "Number": "sing", "Case": "dat", "Gender": "masc", "Animacy": "inan"}, - "subst:sg:dat:n1": {POS: NOUN, "Number": "sing", "Case": "dat", "Gender": "neut"}, - "subst:sg:dat:n2": {POS: NOUN, "Number": "sing", "Case": "dat", "Gender": "neut"}, - "subst:sg:gen:f": {POS: NOUN, "Number": "sing", "Case": "gen", "Gender": "fem"}, - "subst:sg:gen:m1": {POS: NOUN, "Number": "sing", "Case": "gen", "Gender": "Masc", "Animacy": "hum"}, - "subst:sg:gen:m2": {POS: NOUN, "Number": "sing", "Case": "gen", "Gender": "masc", "Animacy": "anim"}, - "subst:sg:gen:m3": {POS: NOUN, "Number": "sing", "Case": "gen", "Gender": "masc", "Animacy": "inan"}, - "subst:sg:gen:n1": {POS: NOUN, "Number": "sing", "Case": "gen", "Gender": "neut"}, - "subst:sg:gen:n2": {POS: NOUN, "Number": "sing", "Case": "gen", "Gender": "neut"}, - "subst:sg:inst:f": {POS: NOUN, "Number": "sing", "Case": "ins", "Gender": "fem"}, - "subst:sg:inst:m1": {POS: NOUN, "Number": "sing", "Case": "ins", "Gender": "Masc", "Animacy": "hum"}, - "subst:sg:inst:m2": {POS: NOUN, "Number": "sing", "Case": "ins", "Gender": "masc", "Animacy": "anim"}, - "subst:sg:inst:m3": {POS: NOUN, "Number": "sing", "Case": "ins", "Gender": "masc", "Animacy": "inan"}, - "subst:sg:inst:n1": {POS: NOUN, "Number": "sing", "Case": "ins", "Gender": "neut"}, - "subst:sg:inst:n2": {POS: NOUN, "Number": "sing", "Case": "ins", "Gender": "neut"}, - "subst:sg:loc:f": {POS: NOUN, "Number": "sing", "Case": "loc", "Gender": "fem"}, - "subst:sg:loc:m1": {POS: NOUN, "Number": "sing", "Case": "loc", "Gender": "Masc", "Animacy": "hum"}, - "subst:sg:loc:m2": {POS: NOUN, "Number": "sing", "Case": "loc", "Gender": "masc", "Animacy": "anim"}, - "subst:sg:loc:m3": {POS: NOUN, "Number": "sing", "Case": "loc", "Gender": "masc", "Animacy": "inan"}, - "subst:sg:loc:n1": {POS: NOUN, "Number": "sing", "Case": "loc", "Gender": "neut"}, - "subst:sg:loc:n2": {POS: NOUN, "Number": "sing", "Case": "loc", "Gender": "neut"}, - "subst:sg:nom:f": {POS: NOUN, "Number": "sing", "Case": "nom", "Gender": "fem"}, - "subst:sg:nom:m1": {POS: NOUN, "Number": "sing", "Case": "nom", "Gender": "Masc", "Animacy": "hum"}, - "subst:sg:nom:m2": {POS: NOUN, "Number": "sing", "Case": "nom", "Gender": "masc", "Animacy": "anim"}, - "subst:sg:nom:m3": {POS: NOUN, "Number": "sing", "Case": "nom", "Gender": "masc", "Animacy": "inan"}, - "subst:sg:nom:n1": {POS: NOUN, "Number": "sing", "Case": "nom", "Gender": "neut"}, - "subst:sg:nom:n2": {POS: NOUN, "Number": "sing", "Case": "nom", "Gender": "neut"}, - "subst:sg:voc:f": {POS: NOUN, "Number": "sing", "Case": "voc", "Gender": "fem"}, - "subst:sg:voc:m1": {POS: NOUN, "Number": "sing", "Case": "voc", "Gender": "Masc", "Animacy": "hum"}, - "subst:sg:voc:m2": {POS: NOUN, "Number": "sing", "Case": "voc", "Gender": "masc", "Animacy": "anim"}, - "subst:sg:voc:m3": {POS: NOUN, "Number": "sing", "Case": "voc", "Gender": "masc", "Animacy": "inan"}, - "subst:sg:voc:n1": {POS: NOUN, "Number": "sing", "Case": "voc", "Gender": "neut"}, - "subst:sg:voc:n2": {POS: NOUN, "Number": "sing", "Case": "voc", "Gender": "neut"}, - "winien:pl:m1.p1:imperf": {POS: ADJ, "Number": "plur", "Gender": "masc", "Aspect": "imp"}, - "winien:pl:m2.m3.f.n1.n2.p2.p3:imperf": {POS: ADJ, "Number": "plur", "Gender": "masc|fem|neut", "Aspect": "imp"}, - "winien:sg:f:imperf": {POS: ADJ, "Number": "sing", "Gender": "fem", "Aspect": "imp"}, - "winien:sg:m1.m2.m3:imperf": {POS: ADJ, "Number": "sing", "Gender": "Masc", "Aspect": "imp"}, - "winien:sg:n1.n2:imperf": {POS: ADJ, "Number": "sing", "Gender": "neut", "Aspect": "imp"}, - # UD - "ADJ__Animacy=Hum|Aspect=Imp|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Dat|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Dat|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Dat|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Dat|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Ins|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Ins|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Loc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Loc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Hum|Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Imp|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Aspect=Imp|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Imp|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Hum|Aspect=Perf|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Perf|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Perf|Case=Dat|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Perf|Case=Dat|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Perf|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Perf|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Perf|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Perf|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Perf|Case=Ins|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Perf|Case=Ins|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Perf|Case=Ins|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Perf|Case=Ins|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Perf|Case=Loc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Perf|Case=Loc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Perf|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Perf|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Aspect=Perf|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Hum|Aspect=Perf|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Hum|Case=Acc|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Acc|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Hum|Case=Acc|Degree=Sup|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Acc|Degree=Sup|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Hum|Case=Dat|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Dat|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Dat|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Case=Dat|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Hum|Case=Dat|Degree=Sup|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Dat|Degree=Sup|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Gen|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Gen|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Hum|Case=Gen|Degree=Sup|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Gen|Degree=Sup|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Gen|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Case=Gen|Degree=Sup|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Ins|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Ins|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Ins|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Case=Ins|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Hum|Case=Ins|Degree=Sup|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Ins|Degree=Sup|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Ins|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Case=Ins|Degree=Sup|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Hum|Case=Loc|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Loc|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Loc|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Case=Loc|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Hum|Case=Loc|Degree=Sup|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Loc|Degree=Sup|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Nom|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Nom|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Hum|Case=Nom|Degree=Sup|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Nom|Degree=Sup|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Nom|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Case=Nom|Degree=Sup|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Dat|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Dat|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Ins|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Ins|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Loc|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Loc|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Loc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Loc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Loc|Gender=Masc|Number=Sing|Polarity=Neg|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Loc|Gender=Masc|Number=Sing|Polarity=Neg|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Loc|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Loc|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Loc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Loc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Inan|Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Gen|Gender=Masc|Number=Plur|Polarity=Neg|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Gen|Gender=Masc|Number=Plur|Polarity=Neg|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Ins|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Ins|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Ins|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Ins|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Loc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Loc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Loc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Loc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Nom|Gender=Masc|Number=Plur|Polarity=Neg|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Nom|Gender=Masc|Number=Plur|Polarity=Neg|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Nom|Gender=Masc|Number=Sing|Polarity=Neg|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Nom|Gender=Masc|Number=Sing|Polarity=Neg|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Aspect=Perf|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Inan|Aspect=Perf|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Acc|Degree=Sup|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Sup|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Acc|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Sup|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Gen|Degree=Sup|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Sup|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Gen|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Sup|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Ins|Degree=Sup|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Sup|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Ins|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Sup|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Loc|Degree=Sup|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Sup|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Loc|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Sup|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Nom|Degree=Sup|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Sup|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Nom|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Sup|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Nhum|Aspect=Imp|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Nhum|Aspect=Imp|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Nhum|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Nhum|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Nhum|Aspect=Imp|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Nhum|Aspect=Imp|Case=Gen|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Nhum|Aspect=Imp|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Nhum|Aspect=Imp|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Nhum|Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Animacy=Nhum|Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Animacy=Nhum|Aspect=Imp|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Nhum|Aspect=Imp|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Nhum|Aspect=Perf|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Nhum|Aspect=Perf|Case=Acc|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Nhum|Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Nhum|Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Nhum|Aspect=Perf|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Nhum|Aspect=Perf|Case=Gen|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Nhum|Aspect=Perf|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Nhum|Aspect=Perf|Case=Nom|Gender=Masc|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Nhum|Aspect=Perf|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Animacy=Nhum|Aspect=Perf|Case=Nom|Gender=Masc|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Animacy=Nhum|Case=Acc|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Nhum|Case=Acc|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Nhum|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Nhum|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Nhum|Case=Gen|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Nhum|Case=Gen|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Nhum|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Nhum|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Nhum|Case=Ins|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Nhum|Case=Ins|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Nhum|Case=Ins|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Nhum|Case=Ins|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Nhum|Case=Loc|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Nhum|Case=Loc|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Animacy=Nhum|Case=Nom|Degree=Pos|Gender=Masc|Number=Plur": {POS: ADJ, "morph": "Animacy=Nhum|Case=Nom|Degree=Pos|Gender=Masc|Number=Plur"}, - "ADJ__Animacy=Nhum|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "morph": "Animacy=Nhum|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing"}, - "ADJ__Aspect=Imp|Case=Acc|Gender=Fem|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Acc|Gender=Fem|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Acc|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Acc|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Acc|Gender=Fem|Number=Sing|Polarity=Neg|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Acc|Gender=Fem|Number=Sing|Polarity=Neg|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Acc|Gender=Fem|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Acc|Gender=Fem|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Acc|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Acc|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Acc|Gender=Neut|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Acc|Gender=Neut|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Dat|Gender=Fem|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Dat|Gender=Fem|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Dat|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Dat|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Dat|Gender=Fem|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Dat|Gender=Fem|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Gen|Gender=Fem|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Gen|Gender=Fem|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Gen|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Gen|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Gen|Gender=Fem|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Gen|Gender=Fem|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Gen|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Gen|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Gen|Gender=Neut|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Gen|Gender=Neut|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Gen|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Gen|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Gen|Gender=Neut|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Gen|Gender=Neut|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Gen|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Gen|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Ins|Gender=Fem|Number=Sing|Polarity=Neg|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Ins|Gender=Fem|Number=Sing|Polarity=Neg|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Ins|Gender=Fem|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Ins|Gender=Fem|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Ins|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Ins|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Ins|Gender=Neut|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Ins|Gender=Neut|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Ins|Gender=Neut|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Ins|Gender=Neut|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Loc|Gender=Fem|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Loc|Gender=Fem|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Loc|Gender=Fem|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Loc|Gender=Fem|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Loc|Gender=Neut|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Loc|Gender=Neut|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Loc|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Loc|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Loc|Gender=Neut|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Loc|Gender=Neut|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Loc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Loc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Nom|Gender=Fem|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Nom|Gender=Fem|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Nom|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Nom|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Nom|Gender=Fem|Number=Sing|Polarity=Neg|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Nom|Gender=Fem|Number=Sing|Polarity=Neg|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Nom|Gender=Neut|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Nom|Gender=Neut|Number=Plur|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Nom|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Nom|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act": {POS: ADJ, "morph": "Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos|Tense=Pres|VerbForm=Part|Voice=Act"}, - "ADJ__Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Imp|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Aspect=Imp|Gender=Fem|Number=Plur"}, - "ADJ__Aspect=Imp|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Aspect=Imp|Gender=Fem|Number=Sing"}, - "ADJ__Aspect=Imp|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Aspect=Imp|Gender=Neut|Number=Plur"}, - "ADJ__Aspect=Imp|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Aspect=Imp|Gender=Neut|Number=Sing"}, - "ADJ__Aspect=Perf|Case=Acc|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Acc|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Acc|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Acc|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Acc|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Acc|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Acc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Acc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Dat|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Dat|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Dat|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Dat|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Dat|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Dat|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Dat|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Dat|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Gen|Gender=Fem|Number=Plur|Polarity=Neg|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Gen|Gender=Fem|Number=Plur|Polarity=Neg|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Gen|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Gen|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Gen|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Gen|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Gen|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Gen|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Gen|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Gen|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Ins|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Ins|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Ins|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Ins|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Ins|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Ins|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Ins|Gender=Neut|Number=Sing|Polarity=Neg|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Ins|Gender=Neut|Number=Sing|Polarity=Neg|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Ins|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Ins|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Loc|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Loc|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Loc|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Loc|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Loc|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Loc|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Loc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Loc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Nom|Gender=Fem|Number=Plur|Polarity=Neg|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Nom|Gender=Fem|Number=Plur|Polarity=Neg|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Nom|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Nom|Gender=Fem|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Nom|Gender=Fem|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Nom|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Nom|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Aspect=Perf|Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos": {POS: ADJ, "morph": "Aspect=Perf|Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos"}, - "ADJ__Aspect=Perf|Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass": {POS: ADJ, "morph": "Aspect=Perf|Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Part|Voice=Pass"}, - "ADJ__Case=Acc|Degree=Pos|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Fem|Number=Plur"}, - "ADJ__Case=Acc|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Fem|Number=Sing"}, - "ADJ__Case=Acc|Degree=Pos|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Neut|Number=Plur"}, - "ADJ__Case=Acc|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Neut|Number=Sing"}, - "ADJ__Case=Acc|Degree=Sup|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Acc|Degree=Sup|Gender=Fem|Number=Plur"}, - "ADJ__Case=Acc|Degree=Sup|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Acc|Degree=Sup|Gender=Fem|Number=Sing"}, - "ADJ__Case=Acc|Degree=Sup|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Acc|Degree=Sup|Gender=Neut|Number=Plur"}, - "ADJ__Case=Acc|Degree=Sup|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Acc|Degree=Sup|Gender=Neut|Number=Sing"}, - "ADJ__Case=Acc|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Acc|Gender=Fem|Number=Plur"}, - "ADJ__Case=Acc|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Acc|Gender=Fem|Number=Sing"}, - "ADJ__Case=Acc|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Acc|Gender=Neut|Number=Plur"}, - "ADJ__Case=Acc|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Acc|Gender=Neut|Number=Sing"}, - "ADJ__Case=Dat|Degree=Pos|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Fem|Number=Plur"}, - "ADJ__Case=Dat|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Fem|Number=Sing"}, - "ADJ__Case=Dat|Degree=Pos|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Neut|Number=Plur"}, - "ADJ__Case=Dat|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Neut|Number=Sing"}, - "ADJ__Case=Dat|Degree=Sup|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Dat|Degree=Sup|Gender=Neut|Number=Plur"}, - "ADJ__Case=Gen|Degree=Pos|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Fem|Number=Plur"}, - "ADJ__Case=Gen|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Fem|Number=Sing"}, - "ADJ__Case=Gen|Degree=Pos|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Neut|Number=Plur"}, - "ADJ__Case=Gen|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Neut|Number=Sing"}, - "ADJ__Case=Gen|Degree=Sup|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Gen|Degree=Sup|Gender=Fem|Number=Plur"}, - "ADJ__Case=Gen|Degree=Sup|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Gen|Degree=Sup|Gender=Fem|Number=Sing"}, - "ADJ__Case=Gen|Degree=Sup|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Gen|Degree=Sup|Gender=Neut|Number=Plur"}, - "ADJ__Case=Gen|Degree=Sup|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Gen|Degree=Sup|Gender=Neut|Number=Sing"}, - "ADJ__Case=Gen|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Gen|Gender=Fem|Number=Plur"}, - "ADJ__Case=Gen|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Gen|Gender=Fem|Number=Sing"}, - "ADJ__Case=Gen|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Gen|Gender=Neut|Number=Plur"}, - "ADJ__Case=Gen|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Gen|Gender=Neut|Number=Sing"}, - "ADJ__Case=Ins|Degree=Pos|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Fem|Number=Plur"}, - "ADJ__Case=Ins|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Fem|Number=Sing"}, - "ADJ__Case=Ins|Degree=Pos|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Neut|Number=Plur"}, - "ADJ__Case=Ins|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Neut|Number=Sing"}, - "ADJ__Case=Ins|Degree=Sup|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Ins|Degree=Sup|Gender=Fem|Number=Sing"}, - "ADJ__Case=Ins|Degree=Sup|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Ins|Degree=Sup|Gender=Neut|Number=Plur"}, - "ADJ__Case=Ins|Degree=Sup|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Ins|Degree=Sup|Gender=Neut|Number=Sing"}, - "ADJ__Case=Ins|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Ins|Gender=Fem|Number=Plur"}, - "ADJ__Case=Ins|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Ins|Gender=Fem|Number=Sing"}, - "ADJ__Case=Ins|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Ins|Gender=Neut|Number=Sing"}, - "ADJ__Case=Loc|Degree=Pos|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Fem|Number=Plur"}, - "ADJ__Case=Loc|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Fem|Number=Sing"}, - "ADJ__Case=Loc|Degree=Pos|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Neut|Number=Plur"}, - "ADJ__Case=Loc|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Neut|Number=Sing"}, - "ADJ__Case=Loc|Degree=Sup|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Loc|Degree=Sup|Gender=Fem|Number=Plur"}, - "ADJ__Case=Loc|Degree=Sup|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Loc|Degree=Sup|Gender=Neut|Number=Plur"}, - "ADJ__Case=Loc|Degree=Sup|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Loc|Degree=Sup|Gender=Neut|Number=Sing"}, - "ADJ__Case=Loc|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Loc|Gender=Fem|Number=Plur"}, - "ADJ__Case=Loc|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Loc|Gender=Fem|Number=Sing"}, - "ADJ__Case=Loc|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Loc|Gender=Neut|Number=Plur"}, - "ADJ__Case=Loc|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Loc|Gender=Neut|Number=Sing"}, - "ADJ__Case=Nom|Degree=Pos|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Fem|Number=Plur"}, - "ADJ__Case=Nom|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Fem|Number=Sing"}, - "ADJ__Case=Nom|Degree=Pos|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Neut|Number=Plur"}, - "ADJ__Case=Nom|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Neut|Number=Sing"}, - "ADJ__Case=Nom|Degree=Sup|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Nom|Degree=Sup|Gender=Fem|Number=Plur"}, - "ADJ__Case=Nom|Degree=Sup|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Nom|Degree=Sup|Gender=Fem|Number=Sing"}, - "ADJ__Case=Nom|Degree=Sup|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Nom|Degree=Sup|Gender=Neut|Number=Plur"}, - "ADJ__Case=Nom|Degree=Sup|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Nom|Degree=Sup|Gender=Neut|Number=Sing"}, - "ADJ__Case=Nom|Gender=Fem|Number=Plur": {POS: ADJ, "morph": "Case=Nom|Gender=Fem|Number=Plur"}, - "ADJ__Case=Nom|Gender=Fem|Number=Sing": {POS: ADJ, "morph": "Case=Nom|Gender=Fem|Number=Sing"}, - "ADJ__Case=Nom|Gender=Neut|Number=Plur": {POS: ADJ, "morph": "Case=Nom|Gender=Neut|Number=Plur"}, - "ADJ__Case=Nom|Gender=Neut|Number=Sing": {POS: ADJ, "morph": "Case=Nom|Gender=Neut|Number=Sing"}, - "ADJ__Hyph=Yes": {POS: ADJ, "morph": "Hyph=Yes"}, - "ADJ__PrepCase=Pre": {POS: ADJ, "morph": "PrepCase=Pre"}, - "ADP__AdpType=Prep|Case=Acc": {POS: ADP, "morph": "AdpType=Prep|Case=Acc"}, - "ADP__AdpType=Prep|Case=Acc|Variant=Long": {POS: ADP, "morph": "AdpType=Prep|Case=Acc|Variant=Long"}, - "ADP__AdpType=Prep|Case=Acc|Variant=Short": {POS: ADP, "morph": "AdpType=Prep|Case=Acc|Variant=Short"}, - "ADP__AdpType=Prep|Case=Dat": {POS: ADP, "morph": "AdpType=Prep|Case=Dat"}, - "ADP__AdpType=Prep|Case=Gen": {POS: ADP, "morph": "AdpType=Prep|Case=Gen"}, - "ADP__AdpType=Prep|Case=Gen|Variant=Long": {POS: ADP, "morph": "AdpType=Prep|Case=Gen|Variant=Long"}, - "ADP__AdpType=Prep|Case=Gen|Variant=Short": {POS: ADP, "morph": "AdpType=Prep|Case=Gen|Variant=Short"}, - "ADP__AdpType=Prep|Case=Ins": {POS: ADP, "morph": "AdpType=Prep|Case=Ins"}, - "ADP__AdpType=Prep|Case=Ins|Variant=Long": {POS: ADP, "morph": "AdpType=Prep|Case=Ins|Variant=Long"}, - "ADP__AdpType=Prep|Case=Ins|Variant=Short": {POS: ADP, "morph": "AdpType=Prep|Case=Ins|Variant=Short"}, - "ADP__AdpType=Prep|Case=Loc": {POS: ADP, "morph": "AdpType=Prep|Case=Loc"}, - "ADP__AdpType=Prep|Case=Loc|Variant=Long": {POS: ADP, "morph": "AdpType=Prep|Case=Loc|Variant=Long"}, - "ADP__AdpType=Prep|Case=Loc|Variant=Short": {POS: ADP, "morph": "AdpType=Prep|Case=Loc|Variant=Short"}, - "ADP__AdpType=Prep|Case=Nom": {POS: ADP, "morph": "AdpType=Prep|Case=Nom"}, - "ADV___": {POS: ADV}, - "ADV__Degree=Pos": {POS: ADV, "morph": "Degree=Pos"}, - "ADV__Degree=Sup": {POS: ADV, "morph": "Degree=Sup"}, - "AUX___": {POS: AUX}, - "AUX__Animacy=Hum|Aspect=Imp|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Animacy=Hum|Aspect=Imp|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Animacy=Hum|Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Animacy=Hum|Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Animacy=Hum|Aspect=Perf|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Animacy=Hum|Aspect=Perf|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Animacy=Hum|Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Animacy=Hum|Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Animacy=Nhum|Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Animacy=Nhum|Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Animacy=Nhum|Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Animacy=Nhum|Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Aspect=Imp|Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Aspect=Imp|Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Aspect=Imp|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Aspect=Imp|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Aspect=Imp|Gender=Neut|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Aspect=Imp|Gender=Neut|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Aspect=Imp|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Aspect=Imp|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Aspect=Imp|Mood=Cnd|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Cnd|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Imp|Number=Sing|Person=2|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Imp|Number=Sing|Person=2|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|Variant=Short|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|Variant=Short|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Tense=Pres|Variant=Short|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Tense=Pres|Variant=Short|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Fut|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Fut|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|Variant=Long|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|Variant=Long|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|Variant=Short|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|Variant=Short|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Fut|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Fut|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|Variant=Long|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|Variant=Long|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|Variant=Short|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|Variant=Short|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin"}, - "AUX__Aspect=Imp|VerbForm=Inf": {POS: AUX, "morph": "Aspect=Imp|VerbForm=Inf"}, - "AUX__Aspect=Perf|Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Aspect=Perf|Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Aspect=Perf|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Aspect=Perf|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Aspect=Perf|Gender=Neut|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Aspect=Perf|Gender=Neut|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Aspect=Perf|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "morph": "Aspect=Perf|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "AUX__Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin"}, - "AUX__Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin"}, - "AUX__Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin"}, - "AUX__Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: AUX, "morph": "Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin"}, - "AUX__Aspect=Perf|VerbForm=Inf": {POS: AUX, "morph": "Aspect=Perf|VerbForm=Inf"}, - "CCONJ___": {POS: CCONJ}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|PronType=Dem": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|PronType=Dem"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|PronType=Int,Rel"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|PronType=Tot": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|PronType=Tot"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Ind": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Ind"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Tot": {POS: DET, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Tot"}, - "DET__Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|PronType=Dem": {POS: DET, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|PronType=Dem"}, - "DET__Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|PronType=Int,Rel"}, - "DET__Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|PronType=Tot": {POS: DET, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|PronType=Tot"}, - "DET__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|PronType=Tot": {POS: DET, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|PronType=Tot"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|PronType=Dem": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|PronType=Dem"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|PronType=Int,Rel"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|PronType=Tot": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|PronType=Tot"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Ind": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Ind"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Neg": {POS: DET, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Neg"}, - "DET__Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|PronType=Dem": {POS: DET, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|PronType=Dem"}, - "DET__Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|PronType=Int,Rel"}, - "DET__Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|PronType=Tot": {POS: DET, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|PronType=Tot"}, - "DET__Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|PronType=Tot": {POS: DET, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|PronType=Tot"}, - "DET__Animacy=Hum|Case=Loc|Gender=Masc|Number=Plur|PronType=Tot": {POS: DET, "morph": "Animacy=Hum|Case=Loc|Gender=Masc|Number=Plur|PronType=Tot"}, - "DET__Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|PronType=Dem": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|PronType=Dem"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|PronType=Ind": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|PronType=Ind"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|PronType=Int,Rel"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|PronType=Tot": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|PronType=Tot"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Ind": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Ind"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Neg": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Neg"}, - "DET__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Tot": {POS: DET, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Tot"}, - "DET__Animacy=Hum|Case=Voc|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Hum|Case=Voc|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|PronType=Dem": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|PronType=Dem"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|PronType=Ind": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|PronType=Ind"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|PronType=Int,Rel"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|PronType=Neg": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|PronType=Neg"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|PronType=Tot": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|PronType=Tot"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|PronType=Ind": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|PronType=Ind"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|PronType=Neg": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|PronType=Neg"}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|PronType=Tot": {POS: DET, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|PronType=Tot"}, - "DET__Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur|PronType=Dem": {POS: DET, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur|PronType=Dem"}, - "DET__Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur|PronType=Int,Rel"}, - "DET__Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|PronType=Dem": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|PronType=Dem"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|PronType=Ind": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|PronType=Ind"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|PronType=Int,Rel"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|PronType=Neg": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|PronType=Neg"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|PronType=Tot": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|PronType=Tot"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|PronType=Ind": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|PronType=Ind"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|PronType=Neg": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|PronType=Neg"}, - "DET__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|PronType=Tot": {POS: DET, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|PronType=Tot"}, - "DET__Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur|PronType=Dem": {POS: DET, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur|PronType=Dem"}, - "DET__Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur|PronType=Int,Rel"}, - "DET__Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur|PronType=Tot": {POS: DET, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur|PronType=Tot"}, - "DET__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|PronType=Neg": {POS: DET, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|PronType=Neg"}, - "DET__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|PronType=Tot": {POS: DET, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|PronType=Tot"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|PronType=Dem": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|PronType=Dem"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|PronType=Ind": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|PronType=Ind"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|PronType=Int,Rel"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|PronType=Tot": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|PronType=Tot"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|PronType=Ind": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|PronType=Ind"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|PronType=Tot": {POS: DET, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|PronType=Tot"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|PronType=Dem": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|PronType=Dem"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|PronType=Ind": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|PronType=Ind"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|PronType=Int,Rel"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|PronType=Tot": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|PronType=Tot"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|PronType=Ind": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|PronType=Ind"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|PronType=Neg": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|PronType=Neg"}, - "DET__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|PronType=Tot": {POS: DET, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|PronType=Tot"}, - "DET__Animacy=Nhum|Case=Acc|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Animacy=Nhum|Case=Acc|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Animacy=Nhum|Case=Acc|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Nhum|Case=Acc|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing|PronType=Tot": {POS: DET, "morph": "Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing|PronType=Tot"}, - "DET__Animacy=Nhum|Case=Dat|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Nhum|Case=Dat|Gender=Masc|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Nhum|Case=Gen|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Animacy=Nhum|Case=Gen|Gender=Masc|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Animacy=Nhum|Case=Gen|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Animacy=Nhum|Case=Gen|Gender=Masc|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Animacy=Nhum|Case=Gen|Gender=Masc|Number=Sing|PronType=Tot": {POS: DET, "morph": "Animacy=Nhum|Case=Gen|Gender=Masc|Number=Sing|PronType=Tot"}, - "DET__Animacy=Nhum|Case=Ins|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Nhum|Case=Ins|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur|PronType=Dem": {POS: DET, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur|PronType=Dem"}, - "DET__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur|PronType=Int,Rel"}, - "DET__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur|PronType=Tot": {POS: DET, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur|PronType=Tot"}, - "DET__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Sing|PronType=Dem": {POS: DET, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Sing|PronType=Dem"}, - "DET__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Sing|PronType=Ind": {POS: DET, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Sing|PronType=Ind"}, - "DET__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Acc|Gender=Fem|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Acc|Gender=Fem|Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Acc|Gender=Fem|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Case=Acc|Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Acc|Gender=Fem|Number=Plur|PronType=Dem": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Plur|PronType=Dem"}, - "DET__Case=Acc|Gender=Fem|Number=Plur|PronType=Ind": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Plur|PronType=Ind"}, - "DET__Case=Acc|Gender=Fem|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Plur|PronType=Int,Rel"}, - "DET__Case=Acc|Gender=Fem|Number=Plur|PronType=Tot": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Plur|PronType=Tot"}, - "DET__Case=Acc|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Acc|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Acc|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Acc|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Acc|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Acc|Gender=Fem|Number=Sing|PronType=Dem": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Sing|PronType=Dem"}, - "DET__Case=Acc|Gender=Fem|Number=Sing|PronType=Ind": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Sing|PronType=Ind"}, - "DET__Case=Acc|Gender=Fem|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Acc|Gender=Fem|Number=Sing|PronType=Tot": {POS: DET, "morph": "Case=Acc|Gender=Fem|Number=Sing|PronType=Tot"}, - "DET__Case=Acc|Gender=Neut|Number=Plur|Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Plur|Number[psor]=Plur|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Acc|Gender=Neut|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Case=Acc|Gender=Neut|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Acc|Gender=Neut|Number=Plur|PronType=Dem": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Plur|PronType=Dem"}, - "DET__Case=Acc|Gender=Neut|Number=Plur|PronType=Ind": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Plur|PronType=Ind"}, - "DET__Case=Acc|Gender=Neut|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Plur|PronType=Int,Rel"}, - "DET__Case=Acc|Gender=Neut|Number=Plur|PronType=Neg": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Plur|PronType=Neg"}, - "DET__Case=Acc|Gender=Neut|Number=Plur|PronType=Tot": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Plur|PronType=Tot"}, - "DET__Case=Acc|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Acc|Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Acc|Gender=Neut|Number=Sing|PronType=Dem": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Sing|PronType=Dem"}, - "DET__Case=Acc|Gender=Neut|Number=Sing|PronType=Ind": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Sing|PronType=Ind"}, - "DET__Case=Acc|Gender=Neut|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Acc|Gender=Neut|Number=Sing|PronType=Tot": {POS: DET, "morph": "Case=Acc|Gender=Neut|Number=Sing|PronType=Tot"}, - "DET__Case=Dat|Gender=Fem|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Case=Dat|Gender=Fem|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Case=Dat|Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Dat|Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Dat|Gender=Fem|Number=Plur|PronType=Dem": {POS: DET, "morph": "Case=Dat|Gender=Fem|Number=Plur|PronType=Dem"}, - "DET__Case=Dat|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Dat|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Dat|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Dat|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Dat|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Dat|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Dat|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Dat|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Dat|Gender=Fem|Number=Sing|PronType=Dem": {POS: DET, "morph": "Case=Dat|Gender=Fem|Number=Sing|PronType=Dem"}, - "DET__Case=Dat|Gender=Fem|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Case=Dat|Gender=Fem|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Dat|Gender=Neut|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Dat|Gender=Neut|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Dat|Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Dat|Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Dat|Gender=Neut|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Case=Dat|Gender=Neut|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Gen|Gender=Fem|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Gen|Gender=Fem|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Case=Gen|Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Gen|Gender=Fem|Number=Plur|PronType=Dem": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Plur|PronType=Dem"}, - "DET__Case=Gen|Gender=Fem|Number=Plur|PronType=Ind": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Plur|PronType=Ind"}, - "DET__Case=Gen|Gender=Fem|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Plur|PronType=Int,Rel"}, - "DET__Case=Gen|Gender=Fem|Number=Plur|PronType=Neg": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Plur|PronType=Neg"}, - "DET__Case=Gen|Gender=Fem|Number=Plur|PronType=Tot": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Plur|PronType=Tot"}, - "DET__Case=Gen|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Gen|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Gen|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Gen|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Gen|Gender=Fem|Number=Sing|PronType=Dem": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Sing|PronType=Dem"}, - "DET__Case=Gen|Gender=Fem|Number=Sing|PronType=Ind": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Sing|PronType=Ind"}, - "DET__Case=Gen|Gender=Fem|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Gen|Gender=Fem|Number=Sing|PronType=Neg": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Sing|PronType=Neg"}, - "DET__Case=Gen|Gender=Fem|Number=Sing|PronType=Tot": {POS: DET, "morph": "Case=Gen|Gender=Fem|Number=Sing|PronType=Tot"}, - "DET__Case=Gen|Gender=Neut|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Gen|Gender=Neut|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Case=Gen|Gender=Neut|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Gen|Gender=Neut|Number=Plur|PronType=Dem": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Plur|PronType=Dem"}, - "DET__Case=Gen|Gender=Neut|Number=Plur|PronType=Ind": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Plur|PronType=Ind"}, - "DET__Case=Gen|Gender=Neut|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Plur|PronType=Int,Rel"}, - "DET__Case=Gen|Gender=Neut|Number=Plur|PronType=Neg": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Plur|PronType=Neg"}, - "DET__Case=Gen|Gender=Neut|Number=Plur|PronType=Tot": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Plur|PronType=Tot"}, - "DET__Case=Gen|Gender=Neut|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Gen|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Gen|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Gen|Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Gen|Gender=Neut|Number=Sing|PronType=Dem": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Sing|PronType=Dem"}, - "DET__Case=Gen|Gender=Neut|Number=Sing|PronType=Ind": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Sing|PronType=Ind"}, - "DET__Case=Gen|Gender=Neut|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Gen|Gender=Neut|Number=Sing|PronType=Neg": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Sing|PronType=Neg"}, - "DET__Case=Gen|Gender=Neut|Number=Sing|PronType=Tot": {POS: DET, "morph": "Case=Gen|Gender=Neut|Number=Sing|PronType=Tot"}, - "DET__Case=Ins|Gender=Fem|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Case=Ins|Gender=Fem|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Case=Ins|Gender=Fem|Number=Plur|PronType=Dem": {POS: DET, "morph": "Case=Ins|Gender=Fem|Number=Plur|PronType=Dem"}, - "DET__Case=Ins|Gender=Fem|Number=Plur|PronType=Tot": {POS: DET, "morph": "Case=Ins|Gender=Fem|Number=Plur|PronType=Tot"}, - "DET__Case=Ins|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Ins|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Ins|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Ins|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Ins|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Ins|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Ins|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Ins|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Ins|Gender=Fem|Number=Sing|PronType=Dem": {POS: DET, "morph": "Case=Ins|Gender=Fem|Number=Sing|PronType=Dem"}, - "DET__Case=Ins|Gender=Fem|Number=Sing|PronType=Ind": {POS: DET, "morph": "Case=Ins|Gender=Fem|Number=Sing|PronType=Ind"}, - "DET__Case=Ins|Gender=Fem|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Case=Ins|Gender=Fem|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Ins|Gender=Fem|Number=Sing|PronType=Neg": {POS: DET, "morph": "Case=Ins|Gender=Fem|Number=Sing|PronType=Neg"}, - "DET__Case=Ins|Gender=Neut|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Ins|Gender=Neut|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Ins|Gender=Neut|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Case=Ins|Gender=Neut|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Case=Ins|Gender=Neut|Number=Plur|PronType=Dem": {POS: DET, "morph": "Case=Ins|Gender=Neut|Number=Plur|PronType=Dem"}, - "DET__Case=Ins|Gender=Neut|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Case=Ins|Gender=Neut|Number=Plur|PronType=Int,Rel"}, - "DET__Case=Ins|Gender=Neut|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Ins|Gender=Neut|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Ins|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Ins|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Ins|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Ins|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Ins|Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Ins|Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Ins|Gender=Neut|Number=Sing|PronType=Ind": {POS: DET, "morph": "Case=Ins|Gender=Neut|Number=Sing|PronType=Ind"}, - "DET__Case=Ins|Gender=Neut|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Case=Ins|Gender=Neut|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Loc|Gender=Fem|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Loc|Gender=Fem|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Case=Loc|Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Loc|Gender=Fem|Number=Plur|PronType=Dem": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Plur|PronType=Dem"}, - "DET__Case=Loc|Gender=Fem|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Plur|PronType=Int,Rel"}, - "DET__Case=Loc|Gender=Fem|Number=Plur|PronType=Neg": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Plur|PronType=Neg"}, - "DET__Case=Loc|Gender=Fem|Number=Plur|PronType=Tot": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Plur|PronType=Tot"}, - "DET__Case=Loc|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Loc|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Loc|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Loc|Gender=Fem|Number=Sing|PronType=Dem": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Sing|PronType=Dem"}, - "DET__Case=Loc|Gender=Fem|Number=Sing|PronType=Ind": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Sing|PronType=Ind"}, - "DET__Case=Loc|Gender=Fem|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Loc|Gender=Fem|Number=Sing|PronType=Tot": {POS: DET, "morph": "Case=Loc|Gender=Fem|Number=Sing|PronType=Tot"}, - "DET__Case=Loc|Gender=Neut|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Loc|Gender=Neut|Number=Plur|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Loc|Gender=Neut|Number=Plur|PronType=Dem": {POS: DET, "morph": "Case=Loc|Gender=Neut|Number=Plur|PronType=Dem"}, - "DET__Case=Loc|Gender=Neut|Number=Plur|PronType=Ind": {POS: DET, "morph": "Case=Loc|Gender=Neut|Number=Plur|PronType=Ind"}, - "DET__Case=Loc|Gender=Neut|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Case=Loc|Gender=Neut|Number=Plur|PronType=Int,Rel"}, - "DET__Case=Loc|Gender=Neut|Number=Plur|PronType=Tot": {POS: DET, "morph": "Case=Loc|Gender=Neut|Number=Plur|PronType=Tot"}, - "DET__Case=Loc|Gender=Neut|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Loc|Gender=Neut|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Loc|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Loc|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Loc|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Loc|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Loc|Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes": {POS: DET, "morph": "Case=Loc|Gender=Neut|Number=Sing|Poss=Yes|PronType=Prs|Reflex=Yes"}, - "DET__Case=Loc|Gender=Neut|Number=Sing|PronType=Dem": {POS: DET, "morph": "Case=Loc|Gender=Neut|Number=Sing|PronType=Dem"}, - "DET__Case=Loc|Gender=Neut|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Case=Loc|Gender=Neut|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Loc|Gender=Neut|Number=Sing|PronType=Tot": {POS: DET, "morph": "Case=Loc|Gender=Neut|Number=Sing|PronType=Tot"}, - "DET__Case=Nom|Gender=Fem|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Nom|Gender=Fem|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Nom|Gender=Fem|Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Nom|Gender=Fem|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Case=Nom|Gender=Fem|Number=Plur|PronType=Dem": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Plur|PronType=Dem"}, - "DET__Case=Nom|Gender=Fem|Number=Plur|PronType=Ind": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Plur|PronType=Ind"}, - "DET__Case=Nom|Gender=Fem|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Plur|PronType=Int,Rel"}, - "DET__Case=Nom|Gender=Fem|Number=Plur|PronType=Tot": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Plur|PronType=Tot"}, - "DET__Case=Nom|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Nom|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Nom|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Sing|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Nom|Gender=Fem|Number=Sing|PronType=Dem": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Sing|PronType=Dem"}, - "DET__Case=Nom|Gender=Fem|Number=Sing|PronType=Ind": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Sing|PronType=Ind"}, - "DET__Case=Nom|Gender=Fem|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Nom|Gender=Fem|Number=Sing|PronType=Neg": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Sing|PronType=Neg"}, - "DET__Case=Nom|Gender=Fem|Number=Sing|PronType=Tot": {POS: DET, "morph": "Case=Nom|Gender=Fem|Number=Sing|PronType=Tot"}, - "DET__Case=Nom|Gender=Neut|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Plur|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Nom|Gender=Neut|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Plur|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Nom|Gender=Neut|Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Plur|Number[psor]=Sing|Person=2|Poss=Yes|PronType=Prs"}, - "DET__Case=Nom|Gender=Neut|Number=Plur|NumType=Card|PronType=Ind": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Plur|NumType=Card|PronType=Ind"}, - "DET__Case=Nom|Gender=Neut|Number=Plur|PronType=Dem": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Plur|PronType=Dem"}, - "DET__Case=Nom|Gender=Neut|Number=Plur|PronType=Ind": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Plur|PronType=Ind"}, - "DET__Case=Nom|Gender=Neut|Number=Plur|PronType=Int,Rel": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Plur|PronType=Int,Rel"}, - "DET__Case=Nom|Gender=Neut|Number=Plur|PronType=Neg": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Plur|PronType=Neg"}, - "DET__Case=Nom|Gender=Neut|Number=Plur|PronType=Tot": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Plur|PronType=Tot"}, - "DET__Case=Nom|Gender=Neut|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Sing|Number[psor]=Plur|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Nom|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Sing|Number[psor]=Sing|Person=1|Poss=Yes|PronType=Prs"}, - "DET__Case=Nom|Gender=Neut|Number=Sing|PronType=Dem": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Sing|PronType=Dem"}, - "DET__Case=Nom|Gender=Neut|Number=Sing|PronType=Int,Rel": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Sing|PronType=Int,Rel"}, - "DET__Case=Nom|Gender=Neut|Number=Sing|PronType=Neg": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Sing|PronType=Neg"}, - "DET__Case=Nom|Gender=Neut|Number=Sing|PronType=Tot": {POS: DET, "morph": "Case=Nom|Gender=Neut|Number=Sing|PronType=Tot"}, - "NOUN__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Hum|Case=Loc|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Hum|Case=Loc|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Hum|Case=Voc|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Hum|Case=Voc|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Hum|Case=Voc|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Hum|Case=Voc|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Nhum|Case=Acc|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Nhum|Case=Acc|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Nhum|Case=Dat|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Nhum|Case=Dat|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Nhum|Case=Dat|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Nhum|Case=Dat|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Nhum|Case=Gen|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Nhum|Case=Gen|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Nhum|Case=Gen|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Nhum|Case=Gen|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Nhum|Case=Ins|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Nhum|Case=Ins|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Nhum|Case=Ins|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Nhum|Case=Ins|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Nhum|Case=Loc|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Nhum|Case=Loc|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur"}, - "NOUN__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Sing": {POS: NOUN, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Sing"}, - "NOUN__Animacy=Nhum|Case=Voc|Gender=Masc|Number=Plur": {POS: NOUN, "morph": "Animacy=Nhum|Case=Voc|Gender=Masc|Number=Plur"}, - "NOUN__Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Polarity=Neg|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Polarity=Neg|VerbForm=Vnoun"}, - "NOUN__Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Aspect=Imp|Case=Dat|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Imp|Case=Dat|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Aspect=Imp|Case=Gen|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Imp|Case=Gen|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Aspect=Imp|Case=Ins|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Imp|Case=Ins|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Aspect=Imp|Case=Loc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Imp|Case=Loc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Aspect=Imp|Case=Nom|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Imp|Case=Nom|Gender=Neut|Number=Plur|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Polarity=Neg|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Polarity=Neg|VerbForm=Vnoun"}, - "NOUN__Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Aspect=Perf|Case=Acc|Gender=Neut|Number=Sing|Polarity=Neg|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Perf|Case=Acc|Gender=Neut|Number=Sing|Polarity=Neg|VerbForm=Vnoun"}, - "NOUN__Aspect=Perf|Case=Acc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Perf|Case=Acc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Aspect=Perf|Case=Dat|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Perf|Case=Dat|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Aspect=Perf|Case=Gen|Gender=Neut|Number=Sing|Polarity=Neg|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Perf|Case=Gen|Gender=Neut|Number=Sing|Polarity=Neg|VerbForm=Vnoun"}, - "NOUN__Aspect=Perf|Case=Gen|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Perf|Case=Gen|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Aspect=Perf|Case=Ins|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Perf|Case=Ins|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Aspect=Perf|Case=Loc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Perf|Case=Loc|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Aspect=Perf|Case=Nom|Gender=Neut|Number=Sing|Polarity=Neg|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Perf|Case=Nom|Gender=Neut|Number=Sing|Polarity=Neg|VerbForm=Vnoun"}, - "NOUN__Aspect=Perf|Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun": {POS: NOUN, "morph": "Aspect=Perf|Case=Nom|Gender=Neut|Number=Sing|Polarity=Pos|VerbForm=Vnoun"}, - "NOUN__Case=Acc|Gender=Fem|Number=Plur": {POS: NOUN, "morph": "Case=Acc|Gender=Fem|Number=Plur"}, - "NOUN__Case=Acc|Gender=Fem|Number=Sing": {POS: NOUN, "morph": "Case=Acc|Gender=Fem|Number=Sing"}, - "NOUN__Case=Acc|Gender=Neut|Number=Plur": {POS: NOUN, "morph": "Case=Acc|Gender=Neut|Number=Plur"}, - "NOUN__Case=Acc|Gender=Neut|Number=Sing": {POS: NOUN, "morph": "Case=Acc|Gender=Neut|Number=Sing"}, - "NOUN__Case=Dat|Gender=Fem|Number=Plur": {POS: NOUN, "morph": "Case=Dat|Gender=Fem|Number=Plur"}, - "NOUN__Case=Dat|Gender=Fem|Number=Sing": {POS: NOUN, "morph": "Case=Dat|Gender=Fem|Number=Sing"}, - "NOUN__Case=Dat|Gender=Neut|Number=Plur": {POS: NOUN, "morph": "Case=Dat|Gender=Neut|Number=Plur"}, - "NOUN__Case=Dat|Gender=Neut|Number=Sing": {POS: NOUN, "morph": "Case=Dat|Gender=Neut|Number=Sing"}, - "NOUN__Case=Gen|Gender=Fem|Number=Plur": {POS: NOUN, "morph": "Case=Gen|Gender=Fem|Number=Plur"}, - "NOUN__Case=Gen|Gender=Fem|Number=Sing": {POS: NOUN, "morph": "Case=Gen|Gender=Fem|Number=Sing"}, - "NOUN__Case=Gen|Gender=Neut|Number=Plur": {POS: NOUN, "morph": "Case=Gen|Gender=Neut|Number=Plur"}, - "NOUN__Case=Gen|Gender=Neut|Number=Sing": {POS: NOUN, "morph": "Case=Gen|Gender=Neut|Number=Sing"}, - "NOUN__Case=Ins|Gender=Fem|Number=Plur": {POS: NOUN, "morph": "Case=Ins|Gender=Fem|Number=Plur"}, - "NOUN__Case=Ins|Gender=Fem|Number=Sing": {POS: NOUN, "morph": "Case=Ins|Gender=Fem|Number=Sing"}, - "NOUN__Case=Ins|Gender=Neut|Number=Plur": {POS: NOUN, "morph": "Case=Ins|Gender=Neut|Number=Plur"}, - "NOUN__Case=Ins|Gender=Neut|Number=Sing": {POS: NOUN, "morph": "Case=Ins|Gender=Neut|Number=Sing"}, - "NOUN__Case=Loc|Gender=Fem|Number=Plur": {POS: NOUN, "morph": "Case=Loc|Gender=Fem|Number=Plur"}, - "NOUN__Case=Loc|Gender=Fem|Number=Sing": {POS: NOUN, "morph": "Case=Loc|Gender=Fem|Number=Sing"}, - "NOUN__Case=Loc|Gender=Neut|Number=Plur": {POS: NOUN, "morph": "Case=Loc|Gender=Neut|Number=Plur"}, - "NOUN__Case=Loc|Gender=Neut|Number=Sing": {POS: NOUN, "morph": "Case=Loc|Gender=Neut|Number=Sing"}, - "NOUN__Case=Nom|Gender=Fem|Number=Plur": {POS: NOUN, "morph": "Case=Nom|Gender=Fem|Number=Plur"}, - "NOUN__Case=Nom|Gender=Fem|Number=Sing": {POS: NOUN, "morph": "Case=Nom|Gender=Fem|Number=Sing"}, - "NOUN__Case=Nom|Gender=Neut|Number=Plur": {POS: NOUN, "morph": "Case=Nom|Gender=Neut|Number=Plur"}, - "NOUN__Case=Nom|Gender=Neut|Number=Sing": {POS: NOUN, "morph": "Case=Nom|Gender=Neut|Number=Sing"}, - "NOUN__Case=Voc|Gender=Fem|Number=Sing": {POS: NOUN, "morph": "Case=Voc|Gender=Fem|Number=Sing"}, - "NOUN__Case=Voc|Gender=Neut|Number=Plur": {POS: NOUN, "morph": "Case=Voc|Gender=Neut|Number=Plur"}, - "NOUN__Case=Voc|Gender=Neut|Number=Sing": {POS: NOUN, "morph": "Case=Voc|Gender=Neut|Number=Sing"}, - "NUM__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing": {POS: NUM, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing"}, - "NUM__Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing": {POS: NUM, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing"}, - "NUM__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Nhum|Case=Acc|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Nhum|Case=Acc|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Nhum|Case=Gen|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Nhum|Case=Gen|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Nhum|Case=Ins|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Nhum|Case=Ins|Gender=Masc|Number=Plur"}, - "NUM__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur": {POS: NUM, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur"}, - "NUM__Case=Acc|Gender=Fem|Number=Plur": {POS: NUM, "morph": "Case=Acc|Gender=Fem|Number=Plur"}, - "NUM__Case=Acc|Gender=Fem|Number=Sing": {POS: NUM, "morph": "Case=Acc|Gender=Fem|Number=Sing"}, - "NUM__Case=Acc|Gender=Neut|Number=Plur": {POS: NUM, "morph": "Case=Acc|Gender=Neut|Number=Plur"}, - "NUM__Case=Dat|Gender=Fem|Number=Plur": {POS: NUM, "morph": "Case=Dat|Gender=Fem|Number=Plur"}, - "NUM__Case=Dat|Gender=Neut|Number=Plur": {POS: NUM, "morph": "Case=Dat|Gender=Neut|Number=Plur"}, - "NUM__Case=Gen|Gender=Fem|Number=Plur": {POS: NUM, "morph": "Case=Gen|Gender=Fem|Number=Plur"}, - "NUM__Case=Gen|Gender=Neut|Number=Plur": {POS: NUM, "morph": "Case=Gen|Gender=Neut|Number=Plur"}, - "NUM__Case=Ins|Gender=Fem|Number=Plur": {POS: NUM, "morph": "Case=Ins|Gender=Fem|Number=Plur"}, - "NUM__Case=Ins|Gender=Neut|Number=Plur": {POS: NUM, "morph": "Case=Ins|Gender=Neut|Number=Plur"}, - "NUM__Case=Loc|Gender=Fem|Number=Plur": {POS: NUM, "morph": "Case=Loc|Gender=Fem|Number=Plur"}, - "NUM__Case=Loc|Gender=Neut|Number=Plur": {POS: NUM, "morph": "Case=Loc|Gender=Neut|Number=Plur"}, - "NUM__Case=Nom|Gender=Fem|Number=Plur": {POS: NUM, "morph": "Case=Nom|Gender=Fem|Number=Plur"}, - "NUM__Case=Nom|Gender=Neut|Number=Plur": {POS: NUM, "morph": "Case=Nom|Gender=Neut|Number=Plur"}, - "NUM__Case=Nom|Number=Plur": {POS: NUM, "morph": "Case=Nom|Number=Plur"}, - "PART___": {POS: PART}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Person=1|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Person=1|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Person=2|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Person=2|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|PronType=Tot": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur|PronType=Tot"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=1|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=1|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=2|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=2|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=2|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=2|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Ind": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Ind"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "PRON__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Neg": {POS: PRON, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing|PronType=Neg"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|Person=1|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|Person=1|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|Person=2|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|Person=2|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|PronType=Tot": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur|PronType=Tot"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=1|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=1|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=1|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=1|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=2|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=2|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=2|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=2|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|PronType=Ind": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|PronType=Ind"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "PRON__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|PronType=Neg": {POS: PRON, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing|PronType=Neg"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Person=1|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Person=1|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Person=2|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Person=2|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|PronType=Tot": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur|PronType=Tot"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Person=1|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Person=1|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Person=2|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Person=2|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Person=2|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Person=2|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Ind": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Ind"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "PRON__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Neg": {POS: PRON, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing|PronType=Neg"}, - "PRON__Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|Person=1|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|Person=1|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|Person=2|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|Person=2|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|Person=1|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|Person=1|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|Person=2|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|Person=2|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: PRON, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "PRON__Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|Person=1|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|Person=1|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|Person=2|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|Person=2|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|PronType=Ind": {POS: PRON, "morph": "Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing|PronType=Ind"}, - "PRON__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|Person=1|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|Person=1|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|Person=2|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|Person=2|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|PronType=Tot": {POS: PRON, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur|PronType=Tot"}, - "PRON__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Person=1|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Person=1|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Person=2|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Person=2|PronType=Prs"}, - "PRON__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Ind": {POS: PRON, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Ind"}, - "PRON__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Int,Rel": {POS: PRON, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Int,Rel"}, - "PRON__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Neg": {POS: PRON, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing|PronType=Neg"}, - "PRON__Animacy=Hum|Case=Voc|Gender=Masc|Number=Sing|Person=2|PronType=Prs": {POS: PRON, "morph": "Animacy=Hum|Case=Voc|Gender=Masc|Number=Sing|Person=2|PronType=Prs"}, - "PRON__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Nhum|Case=Dat|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Animacy=Nhum|Case=Dat|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short"}, - "PRON__Animacy=Nhum|Case=Gen|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Nhum|Case=Gen|Gender=Masc|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Nhum|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Nhum|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Nhum|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Nhum|Case=Gen|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Nhum|Case=Ins|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Nhum|Case=Ins|Gender=Masc|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Nhum|Case=Ins|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Nhum|Case=Ins|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Animacy=Nhum|Case=Loc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Animacy=Nhum|Case=Loc|Gender=Masc|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Acc|Gender=Fem|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Acc|Gender=Fem|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Acc|Gender=Fem|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Acc|Gender=Fem|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Acc|Gender=Fem|Number=Sing|Person=1|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Acc|Gender=Fem|Number=Sing|Person=1|PronType=Prs|Variant=Long"}, - "PRON__Case=Acc|Gender=Fem|Number=Sing|Person=2|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Case=Acc|Gender=Fem|Number=Sing|Person=2|PronType=Prs|Variant=Short"}, - "PRON__Case=Acc|Gender=Fem|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Acc|Gender=Fem|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Acc|Gender=Fem|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Acc|Gender=Fem|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Acc|Gender=Neut|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Acc|Gender=Neut|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Acc|Gender=Neut|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Acc|Gender=Neut|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Acc|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Acc|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Acc|Gender=Neut|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Acc|Gender=Neut|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Acc|Gender=Neut|Number=Sing|PronType=Dem": {POS: PRON, "morph": "Case=Acc|Gender=Neut|Number=Sing|PronType=Dem"}, - "PRON__Case=Acc|Gender=Neut|Number=Sing|PronType=Ind": {POS: PRON, "morph": "Case=Acc|Gender=Neut|Number=Sing|PronType=Ind"}, - "PRON__Case=Acc|Gender=Neut|Number=Sing|PronType=Int,Rel": {POS: PRON, "morph": "Case=Acc|Gender=Neut|Number=Sing|PronType=Int,Rel"}, - "PRON__Case=Acc|Gender=Neut|Number=Sing|PronType=Neg": {POS: PRON, "morph": "Case=Acc|Gender=Neut|Number=Sing|PronType=Neg"}, - "PRON__Case=Acc|Gender=Neut|Number=Sing|PronType=Tot": {POS: PRON, "morph": "Case=Acc|Gender=Neut|Number=Sing|PronType=Tot"}, - "PRON__Case=Acc|PronType=Prs|Reflex=Yes": {POS: PRON, "morph": "Case=Acc|PronType=Prs|Reflex=Yes"}, - "PRON__Case=Dat|Gender=Fem|Number=Plur|Person=1|PronType=Prs": {POS: PRON, "morph": "Case=Dat|Gender=Fem|Number=Plur|Person=1|PronType=Prs"}, - "PRON__Case=Dat|Gender=Fem|Number=Plur|Person=2|PronType=Prs": {POS: PRON, "morph": "Case=Dat|Gender=Fem|Number=Plur|Person=2|PronType=Prs"}, - "PRON__Case=Dat|Gender=Fem|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Dat|Gender=Fem|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Dat|Gender=Fem|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Dat|Gender=Fem|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Dat|Gender=Fem|Number=Sing|Person=1|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Dat|Gender=Fem|Number=Sing|Person=1|PronType=Prs|Variant=Long"}, - "PRON__Case=Dat|Gender=Fem|Number=Sing|Person=1|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Case=Dat|Gender=Fem|Number=Sing|Person=1|PronType=Prs|Variant=Short"}, - "PRON__Case=Dat|Gender=Fem|Number=Sing|Person=2|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Case=Dat|Gender=Fem|Number=Sing|Person=2|PronType=Prs|Variant=Short"}, - "PRON__Case=Dat|Gender=Fem|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Dat|Gender=Fem|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Dat|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Dat|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Dat|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Case=Dat|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short"}, - "PRON__Case=Dat|Gender=Neut|Number=Sing|PronType=Dem": {POS: PRON, "morph": "Case=Dat|Gender=Neut|Number=Sing|PronType=Dem"}, - "PRON__Case=Dat|Gender=Neut|Number=Sing|PronType=Int,Rel": {POS: PRON, "morph": "Case=Dat|Gender=Neut|Number=Sing|PronType=Int,Rel"}, - "PRON__Case=Dat|PronType=Prs|Reflex=Yes": {POS: PRON, "morph": "Case=Dat|PronType=Prs|Reflex=Yes"}, - "PRON__Case=Gen|Gender=Fem|Number=Plur|Person=1|PronType=Prs": {POS: PRON, "morph": "Case=Gen|Gender=Fem|Number=Plur|Person=1|PronType=Prs"}, - "PRON__Case=Gen|Gender=Fem|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Gen|Gender=Fem|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Gen|Gender=Fem|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Gen|Gender=Fem|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Gen|Gender=Fem|Number=Sing|Person=1|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Gen|Gender=Fem|Number=Sing|Person=1|PronType=Prs|Variant=Long"}, - "PRON__Case=Gen|Gender=Fem|Number=Sing|Person=2|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Gen|Gender=Fem|Number=Sing|Person=2|PronType=Prs|Variant=Long"}, - "PRON__Case=Gen|Gender=Fem|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Gen|Gender=Fem|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Gen|Gender=Fem|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Gen|Gender=Fem|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Gen|Gender=Neut|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Gen|Gender=Neut|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Gen|Gender=Neut|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Gen|Gender=Neut|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Gen|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Gen|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Gen|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Case=Gen|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Short"}, - "PRON__Case=Gen|Gender=Neut|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Gen|Gender=Neut|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Gen|Gender=Neut|Number=Sing|PronType=Dem": {POS: PRON, "morph": "Case=Gen|Gender=Neut|Number=Sing|PronType=Dem"}, - "PRON__Case=Gen|Gender=Neut|Number=Sing|PronType=Ind": {POS: PRON, "morph": "Case=Gen|Gender=Neut|Number=Sing|PronType=Ind"}, - "PRON__Case=Gen|Gender=Neut|Number=Sing|PronType=Int,Rel": {POS: PRON, "morph": "Case=Gen|Gender=Neut|Number=Sing|PronType=Int,Rel"}, - "PRON__Case=Gen|Gender=Neut|Number=Sing|PronType=Neg": {POS: PRON, "morph": "Case=Gen|Gender=Neut|Number=Sing|PronType=Neg"}, - "PRON__Case=Gen|Gender=Neut|Number=Sing|PronType=Tot": {POS: PRON, "morph": "Case=Gen|Gender=Neut|Number=Sing|PronType=Tot"}, - "PRON__Case=Gen|PronType=Prs|Reflex=Yes": {POS: PRON, "morph": "Case=Gen|PronType=Prs|Reflex=Yes"}, - "PRON__Case=Ins|Gender=Fem|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Ins|Gender=Fem|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Ins|Gender=Fem|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Ins|Gender=Fem|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Ins|Gender=Fem|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Short": {POS: PRON, "morph": "Case=Ins|Gender=Fem|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Short"}, - "PRON__Case=Ins|Gender=Fem|Number=Sing|Person=1|PronType=Prs": {POS: PRON, "morph": "Case=Ins|Gender=Fem|Number=Sing|Person=1|PronType=Prs"}, - "PRON__Case=Ins|Gender=Fem|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Ins|Gender=Fem|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Ins|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Ins|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Ins|Gender=Neut|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Ins|Gender=Neut|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Ins|Gender=Neut|Number=Sing|PronType=Dem": {POS: PRON, "morph": "Case=Ins|Gender=Neut|Number=Sing|PronType=Dem"}, - "PRON__Case=Ins|Gender=Neut|Number=Sing|PronType=Ind": {POS: PRON, "morph": "Case=Ins|Gender=Neut|Number=Sing|PronType=Ind"}, - "PRON__Case=Ins|Gender=Neut|Number=Sing|PronType=Int,Rel": {POS: PRON, "morph": "Case=Ins|Gender=Neut|Number=Sing|PronType=Int,Rel"}, - "PRON__Case=Ins|Gender=Neut|Number=Sing|PronType=Tot": {POS: PRON, "morph": "Case=Ins|Gender=Neut|Number=Sing|PronType=Tot"}, - "PRON__Case=Ins|PronType=Prs|Reflex=Yes": {POS: PRON, "morph": "Case=Ins|PronType=Prs|Reflex=Yes"}, - "PRON__Case=Loc|Gender=Fem|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Loc|Gender=Fem|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Loc|Gender=Fem|Number=Sing|Person=1|PronType=Prs": {POS: PRON, "morph": "Case=Loc|Gender=Fem|Number=Sing|Person=1|PronType=Prs"}, - "PRON__Case=Loc|Gender=Fem|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Loc|Gender=Fem|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Loc|Gender=Neut|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Loc|Gender=Neut|Number=Plur|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Loc|Gender=Neut|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Loc|Gender=Neut|Number=Sing|Person=3|PrepCase=Pre|PronType=Prs|Variant=Long"}, - "PRON__Case=Loc|Gender=Neut|Number=Sing|PronType=Dem": {POS: PRON, "morph": "Case=Loc|Gender=Neut|Number=Sing|PronType=Dem"}, - "PRON__Case=Loc|Gender=Neut|Number=Sing|PronType=Int,Rel": {POS: PRON, "morph": "Case=Loc|Gender=Neut|Number=Sing|PronType=Int,Rel"}, - "PRON__Case=Loc|Gender=Neut|Number=Sing|PronType=Neg": {POS: PRON, "morph": "Case=Loc|Gender=Neut|Number=Sing|PronType=Neg"}, - "PRON__Case=Loc|Gender=Neut|Number=Sing|PronType=Tot": {POS: PRON, "morph": "Case=Loc|Gender=Neut|Number=Sing|PronType=Tot"}, - "PRON__Case=Loc|PronType=Prs|Reflex=Yes": {POS: PRON, "morph": "Case=Loc|PronType=Prs|Reflex=Yes"}, - "PRON__Case=Nom|Gender=Fem|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Nom|Gender=Fem|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Nom|Gender=Fem|Number=Sing|Person=1|PronType=Prs": {POS: PRON, "morph": "Case=Nom|Gender=Fem|Number=Sing|Person=1|PronType=Prs"}, - "PRON__Case=Nom|Gender=Fem|Number=Sing|Person=2|PronType=Prs": {POS: PRON, "morph": "Case=Nom|Gender=Fem|Number=Sing|Person=2|PronType=Prs"}, - "PRON__Case=Nom|Gender=Fem|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Nom|Gender=Fem|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Nom|Gender=Neut|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Nom|Gender=Neut|Number=Plur|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Nom|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long": {POS: PRON, "morph": "Case=Nom|Gender=Neut|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|Variant=Long"}, - "PRON__Case=Nom|Gender=Neut|Number=Sing|PronType=Dem": {POS: PRON, "morph": "Case=Nom|Gender=Neut|Number=Sing|PronType=Dem"}, - "PRON__Case=Nom|Gender=Neut|Number=Sing|PronType=Ind": {POS: PRON, "morph": "Case=Nom|Gender=Neut|Number=Sing|PronType=Ind"}, - "PRON__Case=Nom|Gender=Neut|Number=Sing|PronType=Int,Rel": {POS: PRON, "morph": "Case=Nom|Gender=Neut|Number=Sing|PronType=Int,Rel"}, - "PRON__Case=Nom|Gender=Neut|Number=Sing|PronType=Neg": {POS: PRON, "morph": "Case=Nom|Gender=Neut|Number=Sing|PronType=Neg"}, - "PRON__Case=Nom|Gender=Neut|Number=Sing|PronType=Tot": {POS: PRON, "morph": "Case=Nom|Gender=Neut|Number=Sing|PronType=Tot"}, - "PRON__PronType=Prs|Reflex=Yes": {POS: PRON, "morph": "PronType=Prs|Reflex=Yes"}, - "PRON__PronType=Prs|Reflex=Yes|Typo=Yes": {POS: PRON, "morph": "PronType=Prs|Reflex=Yes|Typo=Yes"}, - "PROPN__Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur": {POS: PROPN, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Plur"}, - "PROPN__Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Hum|Case=Acc|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur": {POS: PROPN, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Plur"}, - "PROPN__Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Hum|Case=Dat|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur": {POS: PROPN, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Plur"}, - "PROPN__Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Hum|Case=Gen|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Hum|Case=Ins|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Hum|Case=Loc|Gender=Masc|Number=Plur": {POS: PROPN, "morph": "Animacy=Hum|Case=Loc|Gender=Masc|Number=Plur"}, - "PROPN__Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Hum|Case=Loc|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur": {POS: PROPN, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Plur"}, - "PROPN__Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Hum|Case=Nom|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Hum|Case=Voc|Gender=Masc|Number=Plur": {POS: PROPN, "morph": "Animacy=Hum|Case=Voc|Gender=Masc|Number=Plur"}, - "PROPN__Animacy=Hum|Case=Voc|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Hum|Case=Voc|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur": {POS: PROPN, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur"}, - "PROPN__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur": {POS: PROPN, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur"}, - "PROPN__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur": {POS: PROPN, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur"}, - "PROPN__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Nhum|Case=Acc|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Nhum|Case=Gen|Gender=Masc|Number=Plur": {POS: PROPN, "morph": "Animacy=Nhum|Case=Gen|Gender=Masc|Number=Plur"}, - "PROPN__Animacy=Nhum|Case=Gen|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Nhum|Case=Gen|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Nhum|Case=Ins|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Nhum|Case=Ins|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Nhum|Case=Loc|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Nhum|Case=Loc|Gender=Masc|Number=Sing"}, - "PROPN__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur": {POS: PROPN, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Plur"}, - "PROPN__Animacy=Nhum|Case=Nom|Gender=Masc|Number=Sing": {POS: PROPN, "morph": "Animacy=Nhum|Case=Nom|Gender=Masc|Number=Sing"}, - "PROPN__Case=Acc|Gender=Fem|Number=Plur": {POS: PROPN, "morph": "Case=Acc|Gender=Fem|Number=Plur"}, - "PROPN__Case=Acc|Gender=Fem|Number=Sing": {POS: PROPN, "morph": "Case=Acc|Gender=Fem|Number=Sing"}, - "PROPN__Case=Acc|Gender=Neut|Number=Plur": {POS: PROPN, "morph": "Case=Acc|Gender=Neut|Number=Plur"}, - "PROPN__Case=Acc|Gender=Neut|Number=Sing": {POS: PROPN, "morph": "Case=Acc|Gender=Neut|Number=Sing"}, - "PROPN__Case=Dat|Gender=Fem|Number=Plur": {POS: PROPN, "morph": "Case=Dat|Gender=Fem|Number=Plur"}, - "PROPN__Case=Dat|Gender=Fem|Number=Sing": {POS: PROPN, "morph": "Case=Dat|Gender=Fem|Number=Sing"}, - "PROPN__Case=Dat|Gender=Neut|Number=Sing": {POS: PROPN, "morph": "Case=Dat|Gender=Neut|Number=Sing"}, - "PROPN__Case=Gen|Gender=Fem|Number=Plur": {POS: PROPN, "morph": "Case=Gen|Gender=Fem|Number=Plur"}, - "PROPN__Case=Gen|Gender=Fem|Number=Sing": {POS: PROPN, "morph": "Case=Gen|Gender=Fem|Number=Sing"}, - "PROPN__Case=Gen|Gender=Neut|Number=Plur": {POS: PROPN, "morph": "Case=Gen|Gender=Neut|Number=Plur"}, - "PROPN__Case=Gen|Gender=Neut|Number=Sing": {POS: PROPN, "morph": "Case=Gen|Gender=Neut|Number=Sing"}, - "PROPN__Case=Ins|Gender=Fem|Number=Plur": {POS: PROPN, "morph": "Case=Ins|Gender=Fem|Number=Plur"}, - "PROPN__Case=Ins|Gender=Fem|Number=Sing": {POS: PROPN, "morph": "Case=Ins|Gender=Fem|Number=Sing"}, - "PROPN__Case=Ins|Gender=Neut|Number=Plur": {POS: PROPN, "morph": "Case=Ins|Gender=Neut|Number=Plur"}, - "PROPN__Case=Ins|Gender=Neut|Number=Sing": {POS: PROPN, "morph": "Case=Ins|Gender=Neut|Number=Sing"}, - "PROPN__Case=Loc|Gender=Fem|Number=Sing": {POS: PROPN, "morph": "Case=Loc|Gender=Fem|Number=Sing"}, - "PROPN__Case=Loc|Gender=Neut|Number=Plur": {POS: PROPN, "morph": "Case=Loc|Gender=Neut|Number=Plur"}, - "PROPN__Case=Loc|Gender=Neut|Number=Sing": {POS: PROPN, "morph": "Case=Loc|Gender=Neut|Number=Sing"}, - "PROPN__Case=Nom|Gender=Fem|Number=Sing": {POS: PROPN, "morph": "Case=Nom|Gender=Fem|Number=Sing"}, - "PROPN__Case=Nom|Gender=Neut|Number=Plur": {POS: PROPN, "morph": "Case=Nom|Gender=Neut|Number=Plur"}, - "PROPN__Case=Nom|Gender=Neut|Number=Sing": {POS: PROPN, "morph": "Case=Nom|Gender=Neut|Number=Sing"}, - "PROPN__Case=Voc|Gender=Fem|Number=Sing": {POS: PROPN, "morph": "Case=Voc|Gender=Fem|Number=Sing"}, - "PROPN__Case=Voc|Gender=Neut|Number=Plur": {POS: PROPN, "morph": "Case=Voc|Gender=Neut|Number=Plur"}, - "PUNCT___": {POS: PUNCT}, - "SCONJ___": {POS: SCONJ}, - "VERB___": {POS: VERB}, - "VERB__Animacy=Hum|Aspect=Imp|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Animacy=Hum|Aspect=Imp|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Animacy=Hum|Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Animacy=Hum|Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Animacy=Hum|Aspect=Perf|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Animacy=Hum|Aspect=Perf|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Animacy=Hum|Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Animacy=Hum|Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Animacy=Nhum|Aspect=Imp|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Animacy=Nhum|Aspect=Imp|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Animacy=Nhum|Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Animacy=Nhum|Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Animacy=Nhum|Aspect=Perf|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Animacy=Nhum|Aspect=Perf|Gender=Masc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Animacy=Nhum|Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Animacy=Nhum|Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Aspect=Imp|Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Aspect=Imp|Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Aspect=Imp|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Aspect=Imp|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Aspect=Imp|Gender=Neut|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Aspect=Imp|Gender=Neut|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Aspect=Imp|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Aspect=Imp|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Aspect=Imp|Mood=Imp|Number=Plur|Person=1|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Imp|Mood=Imp|Number=Plur|Person=1|VerbForm=Fin"}, - "VERB__Aspect=Imp|Mood=Imp|Number=Plur|Person=2|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Imp|Mood=Imp|Number=Plur|Person=2|VerbForm=Fin"}, - "VERB__Aspect=Imp|Mood=Imp|Number=Sing|Person=2|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Imp|Mood=Imp|Number=Sing|Person=2|VerbForm=Fin"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin"}, - "VERB__Aspect=Imp|Tense=Pres|VerbForm=Conv": {POS: VERB, "morph": "Aspect=Imp|Tense=Pres|VerbForm=Conv"}, - "VERB__Aspect=Imp|VerbForm=Inf": {POS: VERB, "morph": "Aspect=Imp|VerbForm=Inf"}, - "VERB__Aspect=Perf|Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Aspect=Perf|Gender=Fem|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Aspect=Perf|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Aspect=Perf|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Aspect=Perf|Gender=Neut|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Aspect=Perf|Gender=Neut|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Aspect=Perf|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "morph": "Aspect=Perf|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act"}, - "VERB__Aspect=Perf|Mood=Imp|Number=Plur|Person=1|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Perf|Mood=Imp|Number=Plur|Person=1|VerbForm=Fin"}, - "VERB__Aspect=Perf|Mood=Imp|Number=Plur|Person=2|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Perf|Mood=Imp|Number=Plur|Person=2|VerbForm=Fin"}, - "VERB__Aspect=Perf|Mood=Imp|Number=Sing|Person=2|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Perf|Mood=Imp|Number=Sing|Person=2|VerbForm=Fin"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin"}, - "VERB__Aspect=Perf|Tense=Past|VerbForm=Conv": {POS: VERB, "morph": "Aspect=Perf|Tense=Past|VerbForm=Conv"}, - "VERB__Aspect=Perf|VerbForm=Inf": {POS: VERB, "morph": "Aspect=Perf|VerbForm=Inf"}, - "X___": {POS: X}, - "X__Abbr=Yes": {POS: X, "morph": "Abbr=Yes"} -} -# fmt: on diff --git a/spacy/lang/pt/__init__.py b/spacy/lang/pt/__init__.py index c09996126..0447099f0 100644 --- a/spacy/lang/pt/__init__.py +++ b/spacy/lang/pt/__init__.py @@ -1,27 +1,16 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS -from .tag_map import TAG_MAP - -from ..tokenizer_exceptions import BASE_EXCEPTIONS from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES from ...language import Language -from ...attrs import LANG -from ...util import update_exc class PortugueseDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "pt" - lex_attr_getters.update(LEX_ATTRS) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS - tag_map = TAG_MAP + tokenizer_exceptions = TOKENIZER_EXCEPTIONS infixes = TOKENIZER_INFIXES prefixes = TOKENIZER_PREFIXES + lex_attr_getters = LEX_ATTRS + stop_words = STOP_WORDS class Portuguese(Language): diff --git a/spacy/lang/pt/examples.py b/spacy/lang/pt/examples.py index b7206ffd7..13f3512cf 100644 --- a/spacy/lang/pt/examples.py +++ b/spacy/lang/pt/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/pt/lex_attrs.py b/spacy/lang/pt/lex_attrs.py index 4ad0eeecb..3c6979ab4 100644 --- a/spacy/lang/pt/lex_attrs.py +++ b/spacy/lang/pt/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/pt/punctuation.py b/spacy/lang/pt/punctuation.py index 370e6aaad..08e31f9d0 100644 --- a/spacy/lang/pt/punctuation.py +++ b/spacy/lang/pt/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_PREFIXES as BASE_TOKENIZER_PREFIXES from ..punctuation import TOKENIZER_SUFFIXES as BASE_TOKENIZER_SUFFIXES from ..punctuation import TOKENIZER_INFIXES as BASE_TOKENIZER_INFIXES diff --git a/spacy/lang/pt/stop_words.py b/spacy/lang/pt/stop_words.py index 774b06809..ff45ad3a7 100644 --- a/spacy/lang/pt/stop_words.py +++ b/spacy/lang/pt/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ à às área acerca ademais adeus agora ainda algo algumas alguns ali além ambas ambos antes diff --git a/spacy/lang/pt/tag_map.py b/spacy/lang/pt/tag_map.py deleted file mode 100644 index cdc7de57e..000000000 --- a/spacy/lang/pt/tag_map.py +++ /dev/null @@ -1,5057 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB, CCONJ -from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, SCONJ, AUX - - -TAG_MAP = { - "<-sam>||DET|F|P|@P<": {POS: PRON}, - "<-sam>||DET|M|P|@P<": {POS: PRON}, - "<-sam>||ART|@>A": {POS: DET}, - "<-sam>||ART|@>N": {POS: DET}, - "<-sam>||ART|F|P|@>N": {POS: DET}, - "<-sam>||ART|F|S|@>N": {POS: DET}, - "<-sam>||ART|F|S|@P<": {POS: DET}, - "<-sam>||ART|M|P|@>A": {POS: DET}, - "<-sam>||ART|M|P|@>N": {POS: DET}, - "<-sam>||ART|M|S|@||ART|M|S|@>A": {POS: DET}, - "<-sam>||ART|M|S|@>N": {POS: DET}, - "<-sam>||ART|M|S|@N<": {POS: DET}, - "<-sam>||ART|M|S|@P<": {POS: DET}, - "<-sam>||DET|F|P|@>N": {POS: DET}, - "<-sam>||DET|F|S|@>N": {POS: DET}, - "<-sam>||DET|M|P|@>N": {POS: DET}, - "<-sam>||DET|M|S/P|@>N": {POS: DET}, - "<-sam>||DET|M|S|@>N": {POS: DET}, - "<-sam>||DET|M|S|@P<": {POS: PRON}, - "<-sam>||ART|F|S|@>N": {POS: DET}, - "<-sam>||ART|M|S|@>N": {POS: DET}, - "<-sam>||DET|F|S|@>N": {POS: DET}, - "<-sam>||DET|M|S|@>N": {POS: DET}, - "<-sam>||NUM|M|S|@P<": {POS: NUM}, - "<-sam>|||DET|M|S|@P<": {POS: PRON}, - "<-sam>||DET|F|P|@>N": {POS: DET}, - "<-sam>||DET|F|P|@P<": {POS: PRON}, - "<-sam>||DET|F|S|@>N": {POS: DET}, - "<-sam>||DET|F|S|@P<": {POS: PRON}, - "<-sam>||DET|F|S|@SUBJ>": {POS: PRON}, - "<-sam>||DET|M|P|@>N": {POS: DET}, - "<-sam>||DET|M|P|@P<": {POS: PRON}, - "<-sam>||DET|M|S|@>N": {POS: DET}, - "<-sam>||DET|M|S|@P<": {POS: PRON}, - "<-sam>||INDP|M|S|@P<": {POS: PRON}, - "<-sam>||DET|F|P|@>N": {POS: DET}, - "<-sam>||DET|F|P|@P<": {POS: PRON}, - "<-sam>||DET|F|S|@>N": {POS: DET}, - "<-sam>||DET|M|P|@>N": {POS: DET}, - "<-sam>||DET|M|S|@>N": {POS: DET}, - "<-sam>||DET|M|S|@P<": {POS: PRON}, - "<-sam>||DET|F|P|@>N": {POS: DET}, - "<-sam>||DET|F|P|@P<": {POS: PRON}, - "<-sam>||DET|M|P|@>N": {POS: DET}, - "<-sam>||PERS|F|3S|PIV|@P<": {POS: PRON}, - "<-sam>||PERS|M|3S|PIV|@P<": {POS: PRON}, - "<-sam>||INDP|M|P|@SUBJ>": {POS: PRON}, - "<-sam>||INDP|M|S|@P<": {POS: PRON}, - "<-sam>|ADV|@ADVL>": {POS: ADV}, - "<-sam>|ADV|@P<": {POS: ADV}, - "<-sam>|ART|@>N": {POS: DET}, - "<-sam>|ART|F|P|@>N": {POS: DET}, - "<-sam>|ART|F|S|@>N": {POS: DET}, - "<-sam>|ART|M|P|@>N": {POS: DET}, - "<-sam>|ART|M|S|@>N": {POS: DET}, - "<-sam>|DET|@>N": {POS: DET}, - "<-sam>|DET|F|P|@P<": {POS: PRON}, - "<-sam>|DET|F|S|@>N": {POS: DET}, - "<-sam>|DET|F|S|@P<": {POS: PRON}, - "<-sam>|DET|M|P|@P<": {POS: PRON}, - "<-sam>|DET|M|S|@>A": {POS: DET}, - "<-sam>|DET|M|S|@>N": {POS: DET}, - "<-sam>|DET|M|S|@P<": {POS: PRON}, - "<-sam>|INDP|M|S|@P<": {POS: PRON}, - "<-sam>|INDP|M|S|@SUBJ>": {POS: PRON}, - "<-sam>|PERS|F|1P|PIV|@P<": {POS: PRON}, - "<-sam>|PERS|F|1S|PIV|@P<": {POS: PRON}, - "<-sam>|PERS|F|3P|NOM/PIV|@P<": {POS: PRON}, - "<-sam>|PERS|F|3P|NOM|@P<": {POS: PRON}, - "<-sam>|PERS|F|3P|PIV|@P<": {POS: PRON}, - "<-sam>|PERS|F|3S|ACC|@ACC>": {POS: PRON}, - "<-sam>|PERS|F|3S|NOM/PIV|@P<": {POS: PRON}, - "<-sam>|PERS|F|3S|NOM|@SUBJ>": {POS: PRON}, - "<-sam>|PERS|F|3S|PIV|@P<": {POS: PRON}, - "<-sam>|PERS|M/F|2P|PIV|@P<": {POS: PRON}, - "<-sam>|PERS|M|3P|NOM/PIV|@P<": {POS: PRON}, - "<-sam>|PERS|M|3P|NOM|@P<": {POS: PRON}, - "<-sam>|PERS|M|3P|PIV|@P<": {POS: PRON}, - "<-sam>|PERS|M|3S|ACC|@NPHR": {POS: PRON}, - "<-sam>|PERS|M|3S|NOM/PIV|@P<": {POS: PRON}, - "<-sam>|PERS|M|3S|NOM|@P<": {POS: PRON}, - "<-sam>|PERS|M|3S|NOM|@SUBJ>": {POS: PRON}, - "<-sam>|PERS|M|3S|PIV|@P<": {POS: PRON}, - "<-sam>|PRP|@N<": {POS: ADP}, - "|ADJ|F|P|@|ADJ|F|P|@|ADJ|F|P|@>N": {POS: ADJ}, - "|ADJ|F|P|@N<": {POS: ADJ}, - "|ADJ|F|P|@P<": {POS: ADJ}, - "|ADJ|F|S|@|ADJ|F|S|@|ADJ|F|S|@>N": {POS: ADJ}, - "|ADJ|F|S|@N<": {POS: ADJ}, - "|ADJ|F|S|@N|ADJ|F|S|@P<": {POS: ADJ}, - "|ADJ|F|S|@SC>": {POS: ADJ}, - "|ADJ|M/F|S|@|ADJ|M|P|@|ADJ|M|P|@>N": {POS: ADJ}, - "|ADJ|M|P|@ADVL>": {POS: ADJ}, - "|ADJ|M|P|@N<": {POS: ADJ}, - "|ADJ|M|S|@|ADJ|M|S|@|ADJ|M|S|@|ADJ|M|S|@>A": {POS: ADJ}, - "|ADJ|M|S|@>N": {POS: ADJ}, - "|ADJ|M|S|@ADVL>": {POS: ADJ}, - "|ADJ|M|S|@AS<": {POS: ADJ}, - "|ADJ|M|S|@N<": {POS: ADJ}, - "|ADJ|M|S|@P<": {POS: ADJ}, - "|ADJ|M|S|@SC>": {POS: ADJ}, - "||PRP|@||ADJ|M|P|@N<": {POS: ADJ}, - "||DET|F|S|@P<": {POS: PRON}, - "||DET|M|P|@P<": {POS: PRON}, - "||N|F|S|@SUBJ>": {POS: NOUN}, - "|||ADJ|F|S|@P<": {POS: ADJ}, - "||N|M|P|@||N|F|P|@|ADV|@ICL-N<": {POS: ADV}, - "|ADV|@N|ADV|@|ADV|@|ADV|@>A": {POS: ADV}, - "|ADV|@ADVL>": {POS: ADV}, - "|ADV|@P<": {POS: ADV}, - "||ADJ|F|P|@||ADJ|F|P|@>N": {POS: ADJ}, - "||ADJ|F|P|@N<": {POS: ADJ}, - "||ADJ|F|S|@||ADJ|F|S|@>N": {POS: ADJ}, - "||ADJ|F|S|@N<": {POS: ADJ}, - "||ADJ|F|S|@N||ADJ|F|S|@SC>": {POS: ADJ}, - "||ADJ|M/F|S|@||ADJ|M/F|S|@||ADJ|M|P|@||ADJ|M|P|@||ADJ|M|P|@>N": {POS: ADJ}, - "||ADJ|M|P|@N<": {POS: ADJ}, - "||ADJ|M|P|@N||ADJ|M|P|@P<": {POS: ADJ}, - "||ADJ|M|S|@||ADJ|M|S|@||ADJ|M|S|@||ADJ|M|S|@>N": {POS: ADJ}, - "||ADJ|M|S|@N<": {POS: ADJ}, - "||ADJ|M|S|@N||ADJ|M|S|@P<": {POS: ADJ}, - "||ADJ|M|S|@PRED>": {POS: ADJ}, - "||ADJ|M|S|@SC>": {POS: ADJ}, - "||ADV|@||ADV|@>N": {POS: ADV}, - "||ADV|@ADVL>": {POS: ADV}, - "|||ADJ|F|P|@>N": {POS: ADJ}, - "|||ADJ|F|S|@>N": {POS: ADJ}, - "|||ADJ|F|S|@N<": {POS: ADJ}, - "|||ADJ|M|P|@|||ADJ|M|P|@>N": {POS: ADJ}, - "|||ADJ|M|S|@|||ADJ|M|S|@>N": {POS: ADJ}, - "|||ADJ|M|S|@N<": {POS: ADJ}, - "|||ADJ|M|S|@SC>": {POS: ADJ}, - "|||ADV|@|||ADV|@ADVL>": {POS: ADV}, - "|||||ADJ|M|S|@|||||ADJ|M|S|@SC>": {POS: ADJ}, - "|||DET|M|P|@P<": {POS: PRON}, - "||DET|F|P|@P<": {POS: PRON}, - "||DET|F|S|@P<": {POS: PRON}, - "||DET|M|S|@>N": {POS: DET}, - "||DET|M|S|@P<": {POS: PRON}, - "|||ADJ|F|S|@>N": {POS: ADJ}, - "|||ADJ|F|S|@N<": {POS: ADJ}, - "|||ADJ|F|S|@|||ADJ|M/F|P|@P<": {POS: ADJ}, - "||||ADJ|F|S|@P<": {POS: ADJ}, - "|||DET|M|S|@P<": {POS: PRON}, - "||ADV|@||ADV|@>A": {POS: ADV}, - "||ADV|@>P": {POS: ADV}, - "||ADV|@ADVL>": {POS: ADV}, - "||ADV|@P<": {POS: ADV}, - "||DET|F|P|@>N": {POS: DET}, - "||DET|M|S|@||N|M|P|@P<": {POS: NOUN}, - "||||ADJ|@SUBJ>": {POS: ADJ}, - "||||ADJ|M|S|@||ADJ|F|S|@>N": {POS: ADJ}, - "||ADJ|F|S|@>N": {POS: ADJ}, - "||ADJ|M|S|@>N": {POS: ADJ}, - "||ADJ|M|S|@N<": {POS: ADJ}, - "|||ADJ|F|P|@|||ADJ|F|S|@|||ADJ|F|S|@APP": {POS: ADJ}, - "|||ADJ|F|S|@P<": {POS: ADJ}, - "|||ADJ|F|S|@SUBJ>": {POS: ADJ}, - "|||ADJ|M|P|@|||ADJ|M|P|@P<": {POS: ADJ}, - "|||ADJ|M|P|@SUBJ>": {POS: ADJ}, - "|||ADJ|M|S|@|||ADJ|M|S|@|||ADJ|M|S|@SUBJ>": {POS: ADJ}, - "||ADJ|F|S|@>N": {POS: ADJ}, - "||ADJ|M|P|@>N": {POS: ADJ}, - "||ADJ|M|S|@P<": {POS: ADJ}, - "||ADJ|F|S|@N|ADJ|@A<": {POS: ADJ}, - "|ADJ|F|P|@>N": {POS: ADJ}, - "|ADJ|F|P|@N<": {POS: ADJ}, - "|ADJ|F|S|@>N": {POS: ADJ}, - "|ADJ|F|S|@N<": {POS: ADJ}, - "|ADJ|F|S|@N|ADJ|M|P|@>N": {POS: ADJ}, - "|ADJ|M|P|@SUBJ>": {POS: ADJ}, - "|ADJ|M|S|@|ADJ|M|S|@>N": {POS: ADJ}, - "|ADJ|M|S|@ADVL>": {POS: ADJ}, - "|ADJ|M|S|@N<": {POS: ADJ}, - "|ADJ|M|S|@P<": {POS: ADJ}, - "|||ADJ|F|P|@|||ADJ|F|P|@>N": {POS: ADJ}, - "|||ADJ|F|S|@|||ADJ|F|S|@>N": {POS: ADJ}, - "|||ADJ|F|S|@N<": {POS: ADJ}, - "|||ADJ|M/F|S|@|||ADJ|M|P|@>N": {POS: ADJ}, - "|||ADJ|M|S|@>N": {POS: ADJ}, - "|||ADJ|M|S|@N<": {POS: ADJ}, - "|||||ADJ|M|S|@P<": {POS: ADJ}, - "||||ADJ|M|S|@P<": {POS: ADJ}, - "|||ADJ|M|S|@P<": {POS: ADJ}, - "|ADJ|F|P|@>N": {POS: ADJ}, - "|ADJ|F|P|@N<": {POS: ADJ}, - "|ADJ|F|S|@>N": {POS: ADJ}, - "|ADJ|F|S|@N<": {POS: ADJ}, - "|ADJ|M|P|@>N": {POS: ADJ}, - "|ADJ|M|P|@N<": {POS: ADJ}, - "|ADJ|M|S|@|ADJ|M|S|@>N": {POS: ADJ}, - "|ADJ|M|S|@N<": {POS: ADJ}, - "|ADJ|M|S|@|<-sam>|ART|@>N": {POS: DET}, - "|<-sam>|ART|F|P|@>A": {POS: DET}, - "|<-sam>|ART|F|P|@>N": {POS: DET}, - "|<-sam>|ART|F|S|@>N": {POS: DET}, - "|<-sam>|ART|M|P|@>N": {POS: DET}, - "|<-sam>|ART|M|S|@>N": {POS: DET}, - "|<-sam>|DET|F|P|@>N": {POS: DET}, - "|<-sam>|DET|F|P|@P<": {POS: PRON}, - "|<-sam>|DET|F|S|@>N": {POS: DET}, - "|<-sam>|DET|M|P|@>N": {POS: DET}, - "|<-sam>|DET|M|S|@>N": {POS: DET}, - "||ART|M|S|@SC>": {POS: DET}, - "||ART|F|S|@>N": {POS: DET}, - "||ART|||N|S|@>N": {POS: DET}, - "||DET|M|S|@SUBJ>": {POS: PRON}, - "|ART|@>N": {POS: DET}, - "|ART|F|P|@>N": {POS: DET}, - "|ART|F|S|@|ART|F|S|@|ART|F|S|@>A": {POS: DET}, - "|ART|F|S|@>N": {POS: DET}, - "|ART|F|S|@SUBJ>": {POS: DET}, - "|ART|M|P|@>A": {POS: DET}, - "|ART|M|P|@>N": {POS: DET}, - "|ART|M|P|@P<": {POS: DET}, - "|ART|M|S|@>A": {POS: DET}, - "|ART|M|S|@>N": {POS: DET}, - "|ART|M|S|@KOMP<": {POS: DET}, - "|DET|F|P|@>N": {POS: DET}, - "|DET|F|P|@A<": {POS: DET}, - "|DET|F|S|@|DET|F|S|@>N": {POS: DET}, - "|DET|F|S|@A<": {POS: DET}, - "|DET|M|P|@|DET|M|P|@>N": {POS: DET}, - "|DET|M|P|@A<": {POS: DET}, - "|DET|M|P|@SUBJ>": {POS: PRON}, - "|DET|M|S|@|DET|M|S|@>N": {POS: DET}, - "|DET|M|S|@A<": {POS: DET}, - "|DET|M|S|@P<": {POS: PRON}, - "|DET|M|S|@SUBJ>": {POS: PRON}, - "|INDP|F|S|@>N": {POS: PRON}, - "|INDP|M|S|@ACC>": {POS: PRON}, - "|INDP|M|S|@P<": {POS: PRON}, - "|<-sam>|ART|M|S|@>N": {POS: DET}, - "||ART|F|S|@>N": {POS: DET}, - "|ART|F|S|@>N": {POS: DET}, - "|ART|M|S|@>N": {POS: DET}, - "|DET|F|S@>N": {POS: DET}, - "|DET|F|S|@>N": {POS: DET}, - "|DET|M|S|@>N": {POS: DET}, - "|ADV|@|ADV|@A<": {POS: ADV}, - "|ADV|@ADVL>": {POS: ADV}, - "|||V|INF|@ICL-AUX<": {POS: AUX}, - "|||V|PCP|@ICL-AUX<": {POS: AUX}, - "|||V|PS|1S|IND|@FS-N<": {POS: AUX}, - "|||V|PR|3S|IND|@FS-|||V|PR|3S|IND|@FS-STA": {POS: AUX}, - "|||V|PCP|F|S|@ICL-N||V|COND|3S|@FS-N<": {POS: AUX}, - "||V|INF|@ICL-AUX<": {POS: AUX}, - "||V|PCP|@ICL-AUX<": {POS: AUX}, - "||V|PCP|M|S|@ICL-AUX<": {POS: AUX}, - "||V|PR|3P|IND|@FS-N<": {POS: AUX}, - "||V|PR|3S|IND|@FS-N<": {POS: AUX}, - "|||V|PS|3S|IND|@FS-STA": {POS: AUX}, - "||||V|INF|@ICL-AUX<": {POS: AUX}, - "|||V|GER|@ICL-ADVL>": {POS: AUX}, - "||V|INF|@ICL-AUX<": {POS: AUX}, - "||V|PCP|@ICL-AUX<": {POS: AUX}, - "||V|PCP|F|P|@ICL-AUX<": {POS: AUX}, - "||V|PCP|F|S|@ICL-AUX<": {POS: AUX}, - "||V|PCP|M|P|@ICL-AUX<": {POS: AUX}, - "||V|PCP|M|S|@ICL-AUX<": {POS: AUX}, - "||V|PR|3S|IND|@FS-N<": {POS: AUX}, - "||V|FUT|3S|IND|@FS-QUE": {POS: AUX}, - "||V|FUT|3S|IND|@FS-STA": {POS: AUX}, - "||V|IMPF|3S|IND|@FS-STA": {POS: AUX}, - "||V|INF|@ICL-AUX<": {POS: AUX}, - "||V|INF|@ICL-P<": {POS: AUX}, - "||V|PR|3S|IND|@FS-ACC>": {POS: AUX}, - "||V|PR|3S|IND|@FS-STA": {POS: AUX}, - "||V|PS|3S|IND|@FS-ACC>": {POS: AUX}, - "||V|PS|3S|IND|@FS-STA": {POS: AUX}, - "||V|PCP|@ICL-AUX<": {POS: AUX}, - "||V|PCP|M|S|@ICL-AUX<": {POS: AUX}, - "||V|PCP|@ICL-AUX<": {POS: AUX}, - "||V|PCP|@ICL-AUX<": {POS: AUX}, - "||V|PCP|@ICL-AUX<": {POS: AUX}, - "||V|FUT|3P|IND|@FS-N||V|FUT|3S|IND|@FS-||V|FUT|3S|IND|@FS-QUE": {POS: AUX}, - "||V|IMPF|3P|IND|@FS-||V|IMPF|3P|IND|@FS-N<": {POS: AUX}, - "||V|IMPF|3S|IND|@FS-N||V|IMPF|3S|IND|@FS-STA": {POS: AUX}, - "||V|INF|3P|@ICL-UTT": {POS: AUX}, - "||V|INF|3S|@ICL-||V|INF|@ICL-P<": {POS: AUX}, - "||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-ACC>": {POS: AUX}, - "||V|PR|3P|IND|@FS-ADVL>": {POS: AUX}, - "||V|PR|3P|IND|@FS-N<": {POS: AUX}, - "||V|PR|3P|IND|@FS-N||V|PR|3P|IND|@FS-QUE": {POS: AUX}, - "||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-ACC>": {POS: AUX}, - "||V|PR|3S|IND|@FS-N<": {POS: AUX}, - "||V|PR|3S|IND|@FS-N||V|PR|3S|IND|@FS-STA": {POS: AUX}, - "||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-ADVL>": {POS: AUX}, - "||V|PR|3S|SUBJ|@FS-P<": {POS: AUX}, - "||V|PS|3P|IND|@FS-STA": {POS: AUX}, - "||V|PS|3S|IND|@FS-N|V|COND|1S|@FS-N<": {POS: AUX}, - "|V|COND|3P|@FS-|V|COND|3P|@FS-|V|COND|3P|@FS-N<": {POS: AUX}, - "|V|COND|3P|@FS-N|V|COND|3P|@FS-S<": {POS: AUX}, - "|V|COND|3P|@FS-STA": {POS: AUX}, - "|V|COND|3S|@FS-|V|COND|3S|@FS-|V|COND|3S|@FS-|V|COND|3S|@FS-|V|COND|3S|@FS-ACC>": {POS: AUX}, - "|V|COND|3S|@FS-ADVL>": {POS: AUX}, - "|V|COND|3S|@FS-KOMP<": {POS: AUX}, - "|V|COND|3S|@FS-N<": {POS: AUX}, - "|V|COND|3S|@FS-N|V|COND|3S|@FS-P<": {POS: AUX}, - "|V|COND|3S|@FS-STA": {POS: AUX}, - "|V|COND|3S|@P<": {POS: AUX}, - "|V|FUT|1P|IND|@FS-|V|FUT|1P|IND|@FS-N<": {POS: AUX}, - "|V|FUT|1P|IND|@FS-STA": {POS: AUX}, - "|V|FUT|1S|IND|@FS-STA": {POS: AUX}, - "|V|FUT|1S|SUBJ|@FS-ADVL>": {POS: AUX}, - "|V|FUT|3P|IND|@FS-|V|FUT|3P|IND|@FS-|V|FUT|3P|IND|@FS-|V|FUT|3P|IND|@FS-ACC>": {POS: AUX}, - "|V|FUT|3P|IND|@FS-N<": {POS: AUX}, - "|V|FUT|3P|IND|@FS-N|V|FUT|3P|IND|@FS-P<": {POS: AUX}, - "|V|FUT|3P|IND|@FS-QUE": {POS: AUX}, - "|V|FUT|3P|IND|@FS-STA": {POS: AUX}, - "|V|FUT|3P|SUBJ|@FS-|V|FUT|3P|SUBJ|@FS-ADVL>": {POS: AUX}, - "|V|FUT|3P|SUBJ|@FS-N<": {POS: AUX}, - "|V|FUT|3S|IND|@FS-|V|FUT|3S|IND|@FS-|V|FUT|3S|IND|@FS-|V|FUT|3S|IND|@FS-|V|FUT|3S|IND|@FS-A<": {POS: AUX}, - "|V|FUT|3S|IND|@FS-ACC>": {POS: AUX}, - "|V|FUT|3S|IND|@FS-ADVL>": {POS: AUX}, - "|V|FUT|3S|IND|@FS-N<": {POS: AUX}, - "|V|FUT|3S|IND|@FS-N|V|FUT|3S|IND|@FS-P<": {POS: AUX}, - "|V|FUT|3S|IND|@FS-QUE": {POS: AUX}, - "|V|FUT|3S|IND|@FS-S<": {POS: AUX}, - "|V|FUT|3S|IND|@FS-STA": {POS: AUX}, - "|V|FUT|3S|SUBJ|@FS-|V|FUT|3S|SUBJ|@FS-ADVL>": {POS: AUX}, - "|V|FUT|3S|SUBJ|@FS-N<": {POS: AUX}, - "|V|FUT|3S|SUBJ|@FS-N|V|FUT|3S|SUBJ|@FS-P<": {POS: AUX}, - "|V|FUT|3S|SUBJ|@FS-UTT": {POS: AUX}, - "|V|GER|@ICL-|V|GER|@ICL-ADVL>": {POS: AUX}, - "|V|GER|@ICL-AUX": {POS: AUX}, - "|V|GER|@ICL-AUX<": {POS: AUX}, - "|V|GER|@ICL-N<": {POS: AUX}, - "|V|GER|@ICL-N|V|GER|@ICL-P<": {POS: AUX}, - "|V|GER|@ICL-PRED>": {POS: AUX}, - "|V|IMPF|1/3S|IND|@FS-STA": {POS: AUX}, - "|V|IMPF|1P|IND|@FS-|V|IMPF|1P|IND|@FS-ACC>": {POS: AUX}, - "|V|IMPF|1P|IND|@FS-STA": {POS: AUX}, - "|V|IMPF|1P|SUBJ|@FS-|V|IMPF|1S|IND|@FS-|V|IMPF|1S|IND|@FS-N<": {POS: AUX}, - "|V|IMPF|1S|IND|@FS-STA": {POS: AUX}, - "|V|IMPF|1S|IND|@FS-SUBJ>": {POS: AUX}, - "|V|IMPF|1S|SUBJ|@FS-|V|IMPF|3P|IND|@FS-|V|IMPF|3P|IND|@FS-|V|IMPF|3P|IND|@FS-|V|IMPF|3P|IND|@FS-ACC>": {POS: AUX}, - "|V|IMPF|3P|IND|@FS-ADVL>": {POS: AUX}, - "|V|IMPF|3P|IND|@FS-N<": {POS: AUX}, - "|V|IMPF|3P|IND|@FS-N|V|IMPF|3P|IND|@FS-P<": {POS: AUX}, - "|V|IMPF|3P|IND|@FS-STA": {POS: AUX}, - "|V|IMPF|3P|SUBJ|@FS-|V|IMPF|3P|SUBJ|@FS-|V|IMPF|3P|SUBJ|@FS-|V|IMPF|3P|SUBJ|@FS-N<": {POS: AUX}, - "|V|IMPF|3P|SUBJ|@FS-N|V|IMPF|3P|SUBJ|@FS-P<": {POS: AUX}, - "|V|IMPF|3S|IND|@FS-|V|IMPF|3S|IND|@FS-|V|IMPF|3S|IND|@FS-|V|IMPF|3S|IND|@FS-ACC>": {POS: AUX}, - "|V|IMPF|3S|IND|@FS-ADVL>": {POS: AUX}, - "|V|IMPF|3S|IND|@FS-N<": {POS: AUX}, - "|V|IMPF|3S|IND|@FS-N|V|IMPF|3S|IND|@FS-STA": {POS: AUX}, - "|V|IMPF|3S|SUBJ|@FS-|V|IMPF|3S|SUBJ|@FS-|V|IMPF|3S|SUBJ|@FS-|V|IMPF|3S|SUBJ|@FS-ADVL>": {POS: AUX}, - "|V|IMPF|3S|SUBJ|@FS-N<": {POS: AUX}, - "|V|IMPF|3S|SUBJ|@FS-P<": {POS: AUX}, - "|V|INF|1P|@ICL-P<": {POS: AUX}, - "|V|INF|3P|@ICL-|V|INF|3P|@ICL-|V|INF|3P|@ICL-A<": {POS: AUX}, - "|V|INF|3P|@ICL-N<": {POS: AUX}, - "|V|INF|3P|@ICL-P<": {POS: AUX}, - "|V|INF|3S|@ICL-|V|INF|3S|@ICL-|V|INF|3S|@ICL-N<": {POS: AUX}, - "|V|INF|3S|@ICL-P<": {POS: AUX}, - "|V|INF|@ICL-|V|INF|@ICL-|V|INF|@ICL-ADVL>": {POS: AUX}, - "|V|INF|@ICL-APP": {POS: AUX}, - "|V|INF|@ICL-AUX<": {POS: AUX}, - "|V|INF|@ICL-P<": {POS: AUX}, - "|V|INF|@ICL-QUE": {POS: AUX}, - "|V|INF|@ICL-SUBJ>": {POS: AUX}, - "|V|INF|FUT|3S|IND|@FS-ICL-STA": {POS: AUX}, - "|V|MQP|3P|IND|@FS-STA": {POS: AUX}, - "|V|MQP|3S|IND|@FS-|V|MQP|3S|IND|@FS-STA": {POS: AUX}, - "|V|PCP|@ICL-AUX<": {POS: AUX}, - "|V|PCP|@ICL-P<": {POS: AUX}, - "|V|PCP|F|P|@ICL-AUX<": {POS: AUX}, - "|V|PCP|M|P|@ICL-N<": {POS: AUX}, - "|V|PCP|M|S|@ICL-AUX<": {POS: AUX}, - "|V|PCP|M|S|@ICL-PRED>": {POS: AUX}, - "|V|PR|1P|IND|@FS-|V|PR|1P|IND|@FS-|V|PR|1P|IND|@FS-|V|PR|1P|IND|@FS-ACC>": {POS: AUX}, - "|V|PR|1P|IND|@FS-ADVL>": {POS: AUX}, - "|V|PR|1P|IND|@FS-N<": {POS: AUX}, - "|V|PR|1P|IND|@FS-QUE": {POS: AUX}, - "|V|PR|1P|IND|@FS-STA": {POS: AUX}, - "|V|PR|1P|SUBJ|@FS-|V|PR|1P|SUBJ|@FS-STA": {POS: AUX}, - "|V|PR|1S|IND|@FS-|V|PR|1S|IND|@FS-ACC>": {POS: AUX}, - "|V|PR|1S|IND|@FS-ADVL>": {POS: AUX}, - "|V|PR|1S|IND|@FS-EXC": {POS: AUX}, - "|V|PR|1S|IND|@FS-N<": {POS: AUX}, - "|V|PR|1S|IND|@FS-QUE": {POS: AUX}, - "|V|PR|1S|IND|@FS-STA": {POS: AUX}, - "|V|PR|1S|SUBJ|@FS-|V|PR|3P|IND|@FS-|V|PR|3P|IND|@FS-|V|PR|3P|IND|@FS-|V|PR|3P|IND|@FS-|V|PR|3P|IND|@FS-A<": {POS: AUX}, - "|V|PR|3P|IND|@FS-ACC>": {POS: AUX}, - "|V|PR|3P|IND|@FS-ADVL>": {POS: AUX}, - "|V|PR|3P|IND|@FS-APP": {POS: AUX}, - "|V|PR|3P|IND|@FS-KOMP<": {POS: AUX}, - "|V|PR|3P|IND|@FS-N<": {POS: AUX}, - "|V|PR|3P|IND|@FS-N|V|PR|3P|IND|@FS-P<": {POS: AUX}, - "|V|PR|3P|IND|@FS-STA": {POS: AUX}, - "|V|PR|3P|IND|@FS-SUBJ>": {POS: AUX}, - "|V|PR|3P|IND|@FS-UTT": {POS: AUX}, - "|V|PR|3P|SUBJ|@FS-|V|PR|3P|SUBJ|@FS-|V|PR|3P|SUBJ|@FS-|V|PR|3P|SUBJ|@FS-ADVL>": {POS: AUX}, - "|V|PR|3P|SUBJ|@FS-N<": {POS: AUX}, - "|V|PR|3P|SUBJ|@FS-N|V|PR|3P|SUBJ|@FS-P<": {POS: AUX}, - "|V|PR|3S|IND|@FS-|V|PR|3S|IND|@FS-|V|PR|3S|IND|@FS-|V|PR|3S|IND|@FS-|V|PR|3S|IND|@FS-A<": {POS: AUX}, - "|V|PR|3S|IND|@FS-ACC>": {POS: AUX}, - "|V|PR|3S|IND|@FS-ADVL>": {POS: AUX}, - "|V|PR|3S|IND|@FS-APP": {POS: AUX}, - "|V|PR|3S|IND|@FS-EXC": {POS: AUX}, - "|V|PR|3S|IND|@FS-KOMP<": {POS: AUX}, - "|V|PR|3S|IND|@FS-N<": {POS: AUX}, - "|V|PR|3S|IND|@FS-N|V|PR|3S|IND|@FS-P<": {POS: AUX}, - "|V|PR|3S|IND|@FS-QUE": {POS: AUX}, - "|V|PR|3S|IND|@FS-S<": {POS: AUX}, - "|V|PR|3S|IND|@FS-STA": {POS: AUX}, - "|V|PR|3S|IND|@FS-SUBJ>": {POS: AUX}, - "|V|PR|3S|IND|@FS-UTT": {POS: AUX}, - "|V|PR|3S|SUBJ|@FS-|V|PR|3S|SUBJ|@FS-|V|PR|3S|SUBJ|@FS-|V|PR|3S|SUBJ|@FS-|V|PR|3S|SUBJ|@FS-A<": {POS: AUX}, - "|V|PR|3S|SUBJ|@FS-ACC>": {POS: AUX}, - "|V|PR|3S|SUBJ|@FS-ADVL>": {POS: AUX}, - "|V|PR|3S|SUBJ|@FS-N<": {POS: AUX}, - "|V|PR|3S|SUBJ|@FS-P<": {POS: AUX}, - "|V|PR|3S|SUBJ|@FS-STA": {POS: AUX}, - "|V|PR|3S|SUBJ|@FS-SUBJ>": {POS: AUX}, - "|V|PS/MQP|3P|IND|@FS-|V|PS/MQP|3P|IND|@FS-|V|PS/MQP|3P|IND|@FS-ACC>": {POS: AUX}, - "|V|PS/MQP|3P|IND|@FS-N<": {POS: AUX}, - "|V|PS/MQP|3P|IND|@FS-N|V|PS/MQP|3P|IND|@FS-STA": {POS: AUX}, - "|V|PS|1P|IND|@FS-STA": {POS: AUX}, - "|V|PS|1S|IND|@FS-|V|PS|1S|IND|@FS-ADVL>": {POS: AUX}, - "|V|PS|1S|IND|@FS-STA": {POS: AUX}, - "|V|PS|3P|IND|@FS-|V|PS|3P|IND|@FS-|V|PS|3P|IND|@FS-ACC>": {POS: AUX}, - "|V|PS|3P|IND|@FS-ADVL>": {POS: AUX}, - "|V|PS|3P|IND|@FS-N<": {POS: AUX}, - "|V|PS|3P|IND|@FS-N|V|PS|3P|IND|@FS-STA": {POS: AUX}, - "|V|PS|3S|IND|@FS-|V|PS|3S|IND|@FS-|V|PS|3S|IND|@FS-A<": {POS: AUX}, - "|V|PS|3S|IND|@FS-ACC>": {POS: AUX}, - "|V|PS|3S|IND|@FS-ADVL>": {POS: AUX}, - "|V|PS|3S|IND|@FS-EXC": {POS: AUX}, - "|V|PS|3S|IND|@FS-KOMP<": {POS: AUX}, - "|V|PS|3S|IND|@FS-N<": {POS: AUX}, - "|V|PS|3S|IND|@FS-N|V|PS|3S|IND|@FS-P<": {POS: AUX}, - "|V|PS|3S|IND|@FS-S<": {POS: AUX}, - "|V|PS|3S|IND|@FS-STA": {POS: AUX}, - "|V|PS|3S|SUBJ|@FS-STA": {POS: AUX}, - "||NUM|@N||NUM|F|P|@P<": {POS: NUM}, - "||NUM|M|P|@||NUM|M|P|@||NUM|M|P|@N||NUM|M|P|@P<": {POS: NUM}, - "|||NUM|M|P|@>N": {POS: NUM}, - "||NUM|M|S|@P<": {POS: NUM}, - "||NUM|F|S|@||NUM|M|P|@||NUM|M|P|@N||NUM|M|P|@P<": {POS: NUM}, - "||NUM|M|P|@P<": {POS: NUM}, - "||NUM|M|P|@SUBJ>": {POS: NUM}, - "||NUM|M|S|@P<": {POS: NUM}, - "||NUM|F|S|@>N": {POS: NUM}, - "||NUM|F|P|@||NUM|F|P|@P<": {POS: NUM}, - "||NUM|F|S|@P<": {POS: NUM}, - "||NUM|M/F|P|@||NUM|M|P|@||NUM|M|P|@||NUM|M|P|@>A": {POS: NUM}, - "||NUM|M|P|@N<": {POS: NUM}, - "||NUM|M|P|@P<": {POS: NUM}, - "||NUM|M|P|@SUBJ>": {POS: NUM}, - "||NUM|M|S|@P<": {POS: NUM}, - "||N|M|P|@P<": {POS: NOUN}, - "||||NUM|M|S||P|@>": {POS: NUM}, - "|||NUM|M|S||M|P|@>": {POS: NUM}, - "|||NUM|M|S||P|@>": {POS: NUM}, - "||||NUM|M|S||P|@>": {POS: NUM}, - "||||NUM|M|S|@P<": {POS: NUM}, - "|ADJ|M|S|@>N": {POS: ADJ}, - "|ART|F|S|@>N": {POS: DET}, - "|NUM|F|P|@|NUM|F|P|@|NUM|F|P|@|NUM|F|P|@>A": {POS: NUM}, - "|NUM|F|P|@>N": {POS: NUM}, - "|NUM|F|P|@APP": {POS: NUM}, - "|NUM|F|P|@N<": {POS: NUM}, - "|NUM|F|P|@N|NUM|F|P|@P<": {POS: NUM}, - "|NUM|F|P|@SUBJ>": {POS: NUM}, - "|NUM|F|S|@|NUM|F|S|@|NUM|F|S|@|NUM|F|S|@>N": {POS: NUM}, - "|NUM|F|S|@APP": {POS: NUM}, - "|NUM|F|S|@N<": {POS: NUM}, - "|NUM|F|S|@N|NUM|F|S|@P<": {POS: NUM}, - "|NUM|F|S|@PRED>": {POS: NUM}, - "|NUM|F|S|@SUBJ>": {POS: NUM}, - "|NUM|M/F|P|@|NUM|M/F|P|@|NUM|M/F|P|@>A": {POS: NUM}, - "|NUM|M/F|P|@>N": {POS: NUM}, - "|NUM|M/F|P|@P<": {POS: NUM}, - "|NUM|M/F|S|@P<": {POS: NUM}, - "|NUM|M|P|@|NUM|M|P|@|NUM|M|P|@|NUM|M|P|@|NUM|M|P|@>A": {POS: NUM}, - "|NUM|M|P|@>N": {POS: NUM}, - "|NUM|M|P|@A<": {POS: NUM}, - "|NUM|M|P|@ACC>": {POS: NUM}, - "|NUM|M|P|@ADVL>": {POS: NUM}, - "|NUM|M|P|@APP": {POS: NUM}, - "|NUM|M|P|@AUX<": {POS: NUM}, - "|NUM|M|P|@N<": {POS: NUM}, - "|NUM|M|P|@N|NUM|M|P|@P<": {POS: NUM}, - "|NUM|M|P|@SUBJ>": {POS: NUM}, - "|NUM|M|S|@|NUM|M|S|@|NUM|M|S|@|NUM|M|S|@|NUM|M|S|@|NUM|M|S|@>N": {POS: NUM}, - "|NUM|M|S|@ADVL>": {POS: NUM}, - "|NUM|M|S|@APP": {POS: NUM}, - "|NUM|M|S|@N<": {POS: NUM}, - "|NUM|M|S|@N|NUM|M|S|@NPHR": {POS: NUM}, - "|NUM|M|S|@P<": {POS: NUM}, - "|NUM|M|S|@SC>": {POS: NUM}, - "|NUM|M|S|@SUBJ>": {POS: NUM}, - "|N|M|P|@>N": {POS: NOUN}, - "|PROP|M|P|@P<": {POS: PROPN}, - "||ADJ|F|P|@>N": {POS: ADJ}, - "||ADJ|F|S|@ICL-N<": {POS: ADJ}, - "||ADJ|M|P|@||ADJ|M|P|@N<": {POS: ADJ}, - "||ADJ|M|S|@||ADJ|M|S|@>N": {POS: ADJ}, - "||ADJ|M|S|@N<": {POS: ADJ}, - "|||NUM|M|P|@|||DET|F|P|@P<": {POS: PRON}, - "|||DET|M|S|@P<": {POS: PRON}, - "||||ADJ|M|S|@P<": {POS: ADJ}, - "||ADV|@ADVL>": {POS: ADV}, - "||ADV|@FS-N<": {POS: ADV}, - "||ADV|@N||NUM|F|P|@||NUM|M|P|@P<": {POS: NUM}, - "||ADV|@||PRP|@||PRP|@PASS": {POS: ADP}, - "|||ADJ|F|S|@>N": {POS: ADJ}, - "|||ADJ|F|S|@N<": {POS: ADJ}, - "|||ADJ|M|S|@|||ADJ|M|S|@>N": {POS: ADJ}, - "||||ADJ|M|S|@N<": {POS: ADJ}, - "||||DET|F|P|@FS-STA": {POS: DET}, - "||||DET|M|P|@P<": {POS: PRON}, - "|||DET|F|P|@|||DET|F|S|@NPHR": {POS: DET}, - "|||DET|F|S|@P<": {POS: PRON}, - "|||DET|M|P|@|||DET|M|S|@|||DET|M|S|@APP": {POS: DET}, - "|||DET|M|S|@N|||ADJ|M|S|@|||ADV|@P<": {POS: ADV}, - "|||ADJ|F|S|@APP": {POS: ADJ}, - "||||ADJ|M|S|@||||ADJ|M|S|@P<": {POS: ADJ}, - "||||ADJ|M|S|@SUBJ>": {POS: ADJ}, - "||||N|F|S|@ADVL": {POS: NOUN}, - "|||ADJ|F|S|@N|||ADJ|M|S|@ADVL>": {POS: ADJ}, - "|||ADJ|M|S|@N<": {POS: ADJ}, - "||ADJ|F|S|@>N": {POS: ADJ}, - "||ADJ|F|S|@APP": {POS: ADJ}, - "||ADJ|M|S|@||ADJ|M|S|@>N": {POS: ADJ}, - "||ADJ|M|S|@FS-UTT": {POS: ADJ}, - "||ADJ|M|S|@N<": {POS: ADJ}, - "||ADJ|M|S|@N||||ADJ|M|S|@N<": {POS: ADJ}, - "|||ADJ|M|S|@>N": {POS: ADJ}, - "|||ADJ|M|S|@P<": {POS: ADJ}, - "||DET|M|P|@>N": {POS: DET}, - "|||V|PCP|@ICL-P<": {POS: AUX}, - "|||V|FUT|3S|IND|@FS-N<": {POS: AUX}, - "|||V|INF|@ICL-P<": {POS: AUX}, - "|||V|PR|3P|IND|@FS-||V|COND|3P|@FS-||V|COND|3P|@FS-APP": {POS: AUX}, - "||V|COND|3P|@FS-STA": {POS: AUX}, - "||V|COND|3S|@FS-||V|COND|3S|@FS-N||V|COND|3S|@FS-STA": {POS: AUX}, - "||V|FUT|3P|IND|@FS-||V|FUT|3P|IND|@FS-||V|FUT|3P|IND|@FS-N||V|FUT|3P|IND|@FS-STA": {POS: AUX}, - "||V|FUT|3P|IND|@ICL-N<": {POS: AUX}, - "||V|FUT|3P|SUBJ|@FS-ADVL>": {POS: AUX}, - "||V|FUT|3S|IND|@FS-||V|FUT|3S|IND|@FS-||V|FUT|3S|IND|@FS-N<": {POS: AUX}, - "||V|FUT|3S|IND|@FS-N||V|FUT|3S|IND|@FS-STA": {POS: AUX}, - "||V|FUT|3S|IND|@ICL-||V|FUT|3S|IND|@N||V|FUT|3S|SUBJ|@FS-||V|GER|@ICL-||V|GER|@ICL-ADVL>": {POS: AUX}, - "||V|IMPF|1S|IND|@FS-STA": {POS: AUX}, - "||V|IMPF|3P|IND|@FS-STA": {POS: AUX}, - "||V|IMPF|3S|IND|@FS-||V|IMPF|3S|IND|@FS-N<": {POS: AUX}, - "||V|IMPF|3S|IND|@FS-N||V|IMPF|3S|IND|@FS-STA": {POS: AUX}, - "||V|IMPF|3S|SUBJ|@FS-STA": {POS: AUX}, - "||V|INF|@ICL-||V|INF|@ICL-AUX<": {POS: AUX}, - "||V|INF|@ICL-P<": {POS: AUX}, - "||V|MQP|3S|IND|@FS-STA": {POS: AUX}, - "||V|PCP|@ICL-AUX<": {POS: AUX}, - "||V|PR|1P|IND|@FS-ACC>": {POS: AUX}, - "||V|PR|1P|IND|@FS-STA": {POS: AUX}, - "||V|PR|1S|IND|@FS-||V|PR|1S|IND|@FS-ACC>": {POS: AUX}, - "||V|PR|1S|IND|@FS-STA": {POS: AUX}, - "||V|PR|2S|IND|@FS-STA": {POS: AUX}, - "||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-N||V|PR|3P|IND|@FS-P<": {POS: AUX}, - "||V|PR|3P|IND|@FS-STA": {POS: AUX}, - "||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-ADVL>": {POS: AUX}, - "||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-ACC>": {POS: AUX}, - "||V|PR|3S|IND|@FS-ADVL>": {POS: AUX}, - "||V|PR|3S|IND|@FS-APP": {POS: AUX}, - "||V|PR|3S|IND|@FS-KOMP<": {POS: AUX}, - "||V|PR|3S|IND|@FS-N<": {POS: AUX}, - "||V|PR|3S|IND|@FS-N||V|PR|3S|IND|@FS-P<": {POS: AUX}, - "||V|PR|3S|IND|@FS-QUE": {POS: AUX}, - "||V|PR|3S|IND|@FS-SC>": {POS: AUX}, - "||V|PR|3S|IND|@FS-STA": {POS: AUX}, - "||V|PR|3S|IND|@ICL-||V|PR|3S|IND|@ICL-N<": {POS: AUX}, - "||V|PR|3S|SUBJ|@ADVL>": {POS: AUX}, - "||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-P<": {POS: AUX}, - "||V|PR|3S|SUBJ|@FS-STA": {POS: AUX}, - "||V|PS/MQP|3P|IND|@FS-||V|PS/MQP|3P|IND|@FS-STA": {POS: AUX}, - "||V|PS|1P|IND|@FS-STA": {POS: AUX}, - "||V|PS|1S|IND|@FS-ACC>": {POS: AUX}, - "||V|PS|1S|IND|@FS-STA": {POS: AUX}, - "||V|PS|3P|IND|@FS-||V|PS|3P|IND|@FS-ACC>": {POS: AUX}, - "||V|PS|3P|IND|@FS-STA": {POS: AUX}, - "||V|PS|3P|IND|@N||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-ADVL>": {POS: AUX}, - "||V|PS|3S|IND|@FS-N<": {POS: AUX}, - "||V|PS|3S|IND|@FS-N||V|PS|3S|IND|@FS-STA": {POS: AUX}, - "||V|PS|3S|IND|@ICL-STA": {POS: AUX}, - "|||NUM|M|S|@N<": {POS: NUM}, - "|||NUM|M|S|@APP": {POS: NUM}, - "|||NUM|M|P|@P<": {POS: NUM}, - "||NUM|F|P|@||NUM|F|P|@>A": {POS: NUM}, - "||NUM|F|P|@>N": {POS: NUM}, - "||NUM|F|P|@N<": {POS: NUM}, - "||NUM|F|P|@N||NUM|F|P|@NPHR": {POS: NUM}, - "||NUM|F|P|@P<": {POS: NUM}, - "||NUM|F|S|@||NUM|F|S|@N<": {POS: NUM}, - "||NUM|F|S|@P<": {POS: NUM}, - "||NUM|M/F|P|@N<": {POS: NUM}, - "||NUM|M|P|@||NUM|M|P|@||NUM|M|P|@||NUM|M|P|@>N": {POS: NUM}, - "||NUM|M|P|@APP": {POS: NUM}, - "||NUM|M|P|@N<": {POS: NUM}, - "||NUM|M|P|@P<": {POS: NUM}, - "||NUM|M|S|@>N": {POS: NUM}, - "||NUM|M|S|@N<": {POS: NUM}, - "||NUM|M|S|@N||NUM|M|S|@P<": {POS: NUM}, - "||NUM|M|S|@PRED>": {POS: NUM}, - "|||N|M|P|@P<": {POS: NOUN}, - "||KC|@||PRP|@||PRP|@PRED>": {POS: ADP}, - "|||INDP|M|S|@KOMP<": {POS: PRON}, - "|||DET|F|P|@|||DET|F|S|@N|||DET|M|P|@||DET|F|P|@||DET|F|P|@||DET|F|P|@P<": {POS: PRON}, - "||DET|F|P|@SUBJ>": {POS: PRON}, - "||DET|F|S|@APP": {POS: DET}, - "||DET|M|P|@||DET|M|P|@P<": {POS: PRON}, - "||DET|M|P|@SUBJ>": {POS: PRON}, - "||DET|M|S|@||DET|M|S|@APP": {POS: DET}, - "||DET|M|S|@P<": {POS: PRON}, - "||DET|M|S|@SUBJ>": {POS: PRON}, - "||||DET|F|S|@SUBJ>": {POS: PRON}, - "|||DET|F|P|@P<": {POS: PRON}, - "|||DET|F|S|@P<": {POS: PRON}, - "|||DET|M|P|@P<": {POS: PRON}, - "||DET|F|S|@P<": {POS: PRON}, - "||DET|M|S|@N||DET|M|S|@SUBJ>": {POS: PRON}, - "|||ADJ|M|S|@N<": {POS: ADJ}, - "||||ADJ|M|S|@||||ADJ|M|S|@P<": {POS: ADJ}, - "|||N|F|P|@P<": {POS: NOUN}, - "|||N|F|S|@P<": {POS: NOUN}, - "|||N|F|S|@SUBJ>": {POS: NOUN}, - "|||N|M|P|@|||N|M|P|@|||N|M|S|@|||N|F|P|@P<": {POS: NOUN}, - "|||N|F|S|@|||N|F|S|@APP": {POS: NOUN}, - "|||N|F|S|@N|||N|F|S|@P<": {POS: NOUN}, - "|||N|M|P|@|||N|M|P|@APP": {POS: NOUN}, - "|||N|M|P|@P<": {POS: NOUN}, - "|||N|M|S|@APP": {POS: NOUN}, - "|||N|M|S|@N|||N|M|S|@P<": {POS: NOUN}, - "|||PRP|@N<": {POS: ADP}, - "||ADJ|F|S|@N||ADJ|F|S|@NPHR": {POS: ADJ}, - "||ADJ|M|P|@FS-STA": {POS: ADJ}, - "||ADJ|M|S|@N||ADV|@ADVL": {POS: ADV}, - "||PERS|F|3S|NOM|@N||PROP|F|S|@APP": {POS: PROPN}, - "||PROP|F|S|@NPHR": {POS: PROPN}, - "||PROP|M|S|@||PROP|M|S|@||PROP|M|S|@APP": {POS: PROPN}, - "||PROP|M|S|@N||PROP|M|S|@SUBJ>": {POS: PROPN}, - "||PRP|@||PRP|@||PRP|@N<": {POS: ADP}, - "||ADV|@||ADV|@ADVL>": {POS: ADV}, - "|||ADV|@ADVL>": {POS: ADV}, - "||||N|F|S|@||||N|M|S|@P<": {POS: NOUN}, - "|||V|FUT|3S|SUBJ|@FS-|||V|IMPF|3P|IND|@FS-N<": {POS: VERB}, - "|||V|IMPF|3P|IND|@FS-STA": {POS: VERB}, - "|||V|IMPF|3S|IND|@FS-N|||V|IMPF|3S|IND|@ICL-N<": {POS: VERB}, - "|||V|IMPF|3S|SUBJ|@FS-P<": {POS: VERB}, - "|||V|INF|3P|@ICL-P<": {POS: VERB}, - "|||V|MQP|3S|IND|@FS-N|||V|PR|3P|IND|@FS-|||V|PR|3P|IND|@FS-N<": {POS: VERB}, - "|||V|PR|3P|IND|@FS-N|||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3P|SUBJ|@FS-|||V|PR|3P|SUBJ|@FS-|||V|PR|3S|IND|@FS-N<": {POS: VERB}, - "|||V|PR|3S|IND|@FS-N|||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "|||V|PS/MQP|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PS|3P|IND|@FS-N<": {POS: VERB}, - "|||V|PS|3P|IND|@FS-QUE": {POS: VERB}, - "|||V|PS|3S|IND|@FS-|||V|PS|3S|IND|@FS-|||V|PS|3S|IND|@FS-N<": {POS: VERB}, - "|||V|PS|3S|IND|@FS-N|||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "|||V|PS|3S|IND|@N||V|COND|1P|@FS-||V|COND|1S|@FS-STA": {POS: VERB}, - "||V|COND|3P|@FS-N||V|COND|3P|@FS-P<": {POS: VERB}, - "||V|COND|3S|@FS-||V|COND|3S|@FS-ACC>": {POS: AUX}, - "||V|COND|3S|@FS-P<": {POS: VERB}, - "||V|COND|3S|@FS-QUE": {POS: VERB}, - "||V|COND|3S|@FS-STA": {POS: VERB}, - "||V|COND|3S|@N||V|FUT|1S|IND|@FS-STA": {POS: VERB}, - "||V|FUT|3P|IND|@FS-||V|FUT|3P|IND|@FS-||V|FUT|3P|IND|@FS-N<": {POS: VERB}, - "||V|FUT|3P|IND|@FS-N||V|FUT|3P|IND|@FS-STA": {POS: VERB}, - "||V|FUT|3P|IND|@N<": {POS: VERB}, - "||V|FUT|3P|SUBJ|@FS-||V|FUT|3P|SUBJ|@FS-ADVL>": {POS: VERB}, - "||V|FUT|3S|IND|@FS-||V|FUT|3S|IND|@FS-||V|FUT|3S|IND|@FS-N<": {POS: VERB}, - "||V|FUT|3S|IND|@FS-N||V|FUT|3S|IND|@FS-STA": {POS: VERB}, - "||V|FUT|3S|IND|@FS-UTT": {POS: VERB}, - "||V|FUT|3S|IND|@N||V|FUT|3S|SUBJ|@FS-||V|FUT|3S|SUBJ|@FS-ADVL>": {POS: VERB}, - "||V|FUT|3S|SUBJ|@ICL-P<": {POS: VERB}, - "||V|GER|@ADVL>": {POS: VERB}, - "||V|GER|@ICL-||V|GER|@ICL-ADVL>": {POS: VERB}, - "||V|GER|@ICL-N||V|GER|@N<": {POS: VERB}, - "||V|IMPF|1P|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|1S|IND|@FS-A<": {POS: VERB}, - "||V|IMPF|1S|IND|@FS-ACC>": {POS: VERB}, - "||V|IMPF|1S|IND|@FS-KOMP<": {POS: VERB}, - "||V|IMPF|1S|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3P|IND|@ADVL>": {POS: AUX}, - "||V|IMPF|3P|IND|@FS-||V|IMPF|3P|IND|@FS-||V|IMPF|3P|IND|@FS-ADVL>": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-KOMP<": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-N<": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-N||V|IMPF|3P|IND|@FS-P<": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-QUE": {POS: AUX}, - "||V|IMPF|3P|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3P|IND|@ICL-STA": {POS: VERB}, - "||V|IMPF|3P|SUBJ|@FS-||V|IMPF|3S|IND|@FS-||V|IMPF|3S|IND|@FS-||V|IMPF|3S|IND|@FS-ACC>": {POS: AUX}, - "||V|IMPF|3S|IND|@FS-N<": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-N||V|IMPF|3S|IND|@FS-P<": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-SUBJ>": {POS: VERB}, - "||V|IMPF|3S|IND|@ICL-N<": {POS: VERB}, - "||V|IMPF|3S|IND|@N<": {POS: VERB}, - "||V|IMPF|3S|SUBJ|@FS-||V|INF|1P|@ICL-||V|INF|3P|@ICL-||V|INF|3P|@ICL-P<": {POS: VERB}, - "||V|INF|3S|@FS-STA": {POS: VERB}, - "||V|INF|3S|@ICL-||V|INF|3S|@ICL-P<": {POS: VERB}, - "||V|INF|@FS-QUE": {POS: VERB}, - "||V|INF|@FS-STA": {POS: VERB}, - "||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-AUX<": {POS: VERB}, - "||V|INF|@ICL-KOMP<": {POS: VERB}, - "||V|INF|@ICL-N||V|INF|@ICL-P<": {POS: VERB}, - "||V|INF|@P<": {POS: VERB}, - "||V|MQP|3S|IND|@FS-N||V|MQP|3S|IND|@FS-P<": {POS: VERB}, - "||V|MQP|3S|IND|@FS-STA": {POS: VERB}, - "||V|PCP|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|P|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|P|@ICL-N<": {POS: VERB}, - "||V|PCP|F|P|@ICL-N||V|PCP|F|P|@N<": {POS: ADJ}, - "||V|PCP|F|S|@ICL-||V|PCP|F|S|@ICL-||V|PCP|F|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|S|@ICL-N<": {POS: VERB}, - "||V|PCP|F|S|@ICL-N||V|PCP|F|S|@ICL-PRED>": {POS: VERB}, - "||V|PCP|F|S|@N<": {POS: VERB}, - "||V|PCP|F|S|@PRED>": {POS: VERB}, - "||V|PCP|M|P|@FS-ACC>": {POS: VERB}, - "||V|PCP|M|P|@FS-STA": {POS: VERB}, - "||V|PCP|M|P|@ICL-||V|PCP|M|P|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|P|@ICL-N<": {POS: VERB}, - "||V|PCP|M|P|@ICL-N||V|PCP|M|P|@ICL-P<": {POS: VERB}, - "||V|PCP|M|P|@N<": {POS: VERB}, - "||V|PCP|M|S|@FS-N||V|PCP|M|S|@FS-STA": {POS: VERB}, - "||V|PCP|M|S|@ICL-||V|PCP|M|S|@ICL-||V|PCP|M|S|@ICL-||V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|S|@ICL-N||V|PCP|M|S|@ICL-PRED>": {POS: VERB}, - "||V|PCP|M|S|@N<": {POS: VERB}, - "||V|PCP|M|S|@N||V|PR|1P|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|1P|IND|@FS-ADVL>": {POS: VERB}, - "||V|PR|1P|IND|@FS-EXC": {POS: VERB}, - "||V|PR|1P|IND|@FS-N<": {POS: VERB}, - "||V|PR|1P|IND|@FS-N||V|PR|1P|IND|@FS-STA": {POS: VERB}, - "||V|PR|1P|SUBJ|@FS-N||V|PR|1S|IND|@FS-||V|PR|1S|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|1S|IND|@FS-STA": {POS: VERB}, - "||V|PR|3P|IND|@||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|3P|IND|@FS-ADVL>": {POS: VERB}, - "||V|PR|3P|IND|@FS-APP": {POS: VERB}, - "||V|PR|3P|IND|@FS-KOMP<": {POS: VERB}, - "||V|PR|3P|IND|@FS-N<": {POS: VERB}, - "||V|PR|3P|IND|@FS-N||V|PR|3P|IND|@FS-P<": {POS: VERB}, - "||V|PR|3P|IND|@FS-QUE": {POS: AUX}, - "||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "||V|PR|3P|IND|@N<": {POS: VERB}, - "||V|PR|3P|IND|@NPHR": {POS: AUX}, - "||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-N<": {POS: VERB}, - "||V|PR|3P|SUBJ|@FS-P<": {POS: VERB}, - "||V|PR|3P|SUBJ|@FS-STA": {POS: VERB}, - "||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-A<": {POS: VERB}, - "||V|PR|3S|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|3S|IND|@FS-ADVL>": {POS: VERB}, - "||V|PR|3S|IND|@FS-APP": {POS: AUX}, - "||V|PR|3S|IND|@FS-N<": {POS: VERB}, - "||V|PR|3S|IND|@FS-N||V|PR|3S|IND|@FS-P<": {POS: VERB}, - "||V|PR|3S|IND|@FS-QUE": {POS: VERB}, - "||V|PR|3S|IND|@FS-S<": {POS: VERB}, - "||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "||V|PR|3S|IND|@FS-SUBJ>": {POS: VERB}, - "||V|PR|3S|IND|@ICL-ADVL>": {POS: AUX}, - "||V|PR|3S|IND|@ICL-AUX<": {POS: VERB}, - "||V|PR|3S|IND|@N||V|PR|3S|IND|@N||V|PR|3S|IND|@NPHR": {POS: VERB}, - "||V|PR|3S|IND|@P<": {POS: VERB}, - "||V|PR|3S|IND|@STA": {POS: AUX}, - "||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-COM<": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-N<": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-P<": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-STA": {POS: VERB}, - "||V|PR|3S|SUBJ|@ICL-N<": {POS: VERB}, - "||V|PR|3S|SUBJ|@ICL-P<": {POS: VERB}, - "||V|PR|3S|SUBJ|@N<": {POS: VERB}, - "||V|PS/MQP|3P|IND|@FS-||V|PS/MQP|3P|IND|@FS-N||V|PS/MQP|3P|IND|@FS-STA": {POS: VERB}, - "||V|PS/MQP|3P|IND|@ICL-N||V|PS/MQP|3P|IND|@N||V|PS|1P|IND|@FS-||V|PS|1P|IND|@FS-STA": {POS: VERB}, - "||V|PS|1S|IND|@FS-||V|PS|1S|IND|@FS-||V|PS|1S|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|1S|IND|@FS-ADVL>": {POS: VERB}, - "||V|PS|1S|IND|@FS-N<": {POS: VERB}, - "||V|PS|1S|IND|@FS-STA": {POS: VERB}, - "||V|PS|3P|IND|@FS-||V|PS|3P|IND|@FS-||V|PS|3P|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|3P|IND|@FS-N<": {POS: VERB}, - "||V|PS|3P|IND|@FS-QUE": {POS: AUX}, - "||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "||V|PS|3S|IND|@||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|3S|IND|@FS-ADVL>": {POS: VERB}, - "||V|PS|3S|IND|@FS-KOMP<": {POS: VERB}, - "||V|PS|3S|IND|@FS-N<": {POS: VERB}, - "||V|PS|3S|IND|@FS-N||V|PS|3S|IND|@FS-P<": {POS: VERB}, - "||V|PS|3S|IND|@FS-S<": {POS: VERB}, - "||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "||V|PS|3S|IND|@FS-UTT": {POS: VERB}, - "||V|PS|3S|IND|@ICL-N<": {POS: VERB}, - "||V|PS|3S|IND|@ICL-N||V|PS|3S|IND|@ICL-QUE": {POS: VERB}, - "||V|PS|3S|IND|@ICL-STA": {POS: VERB}, - "||V|PS|3S|IND|@N|||ADJ|M|S|@P<": {POS: ADJ}, - "|||ADJ|F|P|@|||ADJ|F|S|@|||ADJ|M|P|@|||ADJ|M|P|@|||ADJ|M|P|@ACC>": {POS: ADJ}, - "|||ADJ|M|P|@N|||ADJ|M|P|@P<": {POS: ADJ}, - "|||ADJ|M|S|@|||ADJ|M|S|@|||ADJ|M|S|@APP": {POS: ADJ}, - "|||ADJ|M|S|@NPHR": {POS: ADJ}, - "|||ADJ|M|S|@P<": {POS: ADJ}, - "|||ADJ|F|S|@N|||ADJ|M|S|@SC>": {POS: ADJ}, - "||ADJ|F|P|@SUBJ>": {POS: ADJ}, - "||ADJ|F|S|@N||ADJ|F|S|@P<": {POS: ADJ}, - "||ADJ|M/F|P|@SUBJ>": {POS: ADJ}, - "||ADJ|M|P|@||ADJ|M|P|@APP": {POS: ADJ}, - "||ADJ|M|P|@P<": {POS: ADJ}, - "||ADJ|M|P|@SUBJ>": {POS: ADJ}, - "||ADJ|M|S|@P<": {POS: ADJ}, - "||ADJ|M|S|@SUBJ>": {POS: ADJ}, - "||V|PCP|M|P|@||V|PCP|M|P|@N||V|PCP|M|P|@P<": {POS: VERB}, - "||V|PCP|M|P|@SUBJ>": {POS: VERB}, - "||V|PCP|M|S|@||ADJ|M|S|@N||N|F|P|@||N|F|P|@||N|F|P|@||N|F|P|@ACC>": {POS: NOUN}, - "||N|F|P|@APP": {POS: NOUN}, - "||N|F|P|@N||N|F|P|@P<": {POS: NOUN}, - "||N|F|P|@SUBJ>": {POS: NOUN}, - "||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@ACC>": {POS: NOUN}, - "||N|F|S|@APP": {POS: NOUN}, - "||N|F|S|@ICL-APP": {POS: NOUN}, - "||N|F|S|@N<": {POS: NOUN}, - "||N|F|S|@N||N|F|S|@P<": {POS: NOUN}, - "||N|F|S|@SUBJ>": {POS: NOUN}, - "||N|M|P|@||N|M|P|@||N|M|P|@||N|M|P|@||N|M|P|@APP": {POS: NOUN}, - "||N|M|P|@FS-N<": {POS: NOUN}, - "||N|M|P|@N<": {POS: NOUN}, - "||N|M|P|@N||N|M|P|@P<": {POS: SYM}, - "||N|M|P|@SUBJ>": {POS: NOUN}, - "||N|M|S|@||N|M|S|@||N|M|S|@||N|M|S|@||N|M|S|@ACC>": {POS: NOUN}, - "||N|M|S|@APP": {POS: NOUN}, - "||N|M|S|@FS-STA": {POS: NOUN}, - "||N|M|S|@ICL-||N|M|S|@ICL-PRED>": {POS: NOUN}, - "||N|M|S|@KOMP<": {POS: NOUN}, - "||N|M|S|@N||N|M|S|@NPHR": {POS: NOUN}, - "||N|M|S|@P<": {POS: NOUN}, - "||N|M|S|@SC>": {POS: NOUN}, - "||N|M|S|@SUBJ>": {POS: NOUN}, - "||ADJ|M|P|@||ADJ|M|S|@N||N|F|P|@||N|F|P|@||N|F|P|@||N|F|P|@||N|F|P|@||N|F|P|@ACC>": {POS: NOUN}, - "||N|F|P|@APP": {POS: NOUN}, - "||N|F|P|@N<": {POS: NOUN}, - "||N|F|P|@N||N|F|P|@NPHR": {POS: NOUN}, - "||N|F|P|@P<": {POS: NOUN}, - "||N|F|P|@PASS": {POS: NOUN}, - "||N|F|P|@SUBJ>": {POS: NOUN}, - "||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@ACC>": {POS: NOUN}, - "||N|F|S|@ADVL": {POS: NOUN}, - "||N|F|S|@APP": {POS: NOUN}, - "||N|F|S|@FS-S<": {POS: NOUN}, - "||N|F|S|@ICL-APP": {POS: NOUN}, - "||N|F|S|@N<": {POS: NOUN}, - "||N|F|S|@N||N|F|S|@NPHR": {POS: NOUN}, - "||N|F|S|@P<": {POS: NOUN}, - "||N|F|S|@PRED>": {POS: NOUN}, - "||N|F|S|@SC>": {POS: NOUN}, - "||N|F|S|@SUBJ>": {POS: NOUN}, - "||N|F|S|@VOK": {POS: NOUN}, - "||N|M/F|P|@P<": {POS: NOUN}, - "||N|M|P|@||N|M|P|@||N|M|P|@||N|M|P|@||N|M|P|@ACC>": {POS: NOUN}, - "||N|M|P|@APP": {POS: NOUN}, - "||N|M|P|@ICL-||N|M|P|@ICL-N<": {POS: NOUN}, - "||N|M|P|@N<": {POS: NOUN}, - "||N|M|P|@N||N|M|P|@NPHR": {POS: NOUN}, - "||N|M|P|@P<": {POS: NOUN}, - "||N|M|P|@SUBJ>": {POS: NOUN}, - "||N|M|S|@||N|M|S|@||N|M|S|@||N|M|S|@||N|M|S|@ACC>": {POS: NOUN}, - "||N|M|S|@ADVL": {POS: NOUN}, - "||N|M|S|@ADVL>": {POS: NOUN}, - "||N|M|S|@APP": {POS: NOUN}, - "||N|M|S|@N<": {POS: NOUN}, - "||N|M|S|@N||N|M|S|@NPHR": {POS: NOUN}, - "||N|M|S|@P<": {POS: NOUN}, - "||N|M|S|@PRED>": {POS: NOUN}, - "||N|M|S|@STA": {POS: NOUN}, - "||N|M|S|@SUBJ>": {POS: NOUN}, - "||DET|F|S|@>N": {POS: DET}, - "||DET|M|S|@P<": {POS: PRON}, - "||PRP|@||PRP|@||PRP|@|||N|F|S|@SUBJ>": {POS: NOUN}, - "|||N|M|S|@|||N|M|S|@|||N|M|S|@NPHR": {POS: NOUN}, - "|||N|M|S|@P<": {POS: NOUN}, - "|||N|F|S|@N<": {POS: NOUN}, - "|||N|F|S|@P<": {POS: NOUN}, - "|||N|F|S|@SUBJ>": {POS: NOUN}, - "|||N|M|S|@N<": {POS: NOUN}, - "|||N|M|S|@P<": {POS: NOUN}, - "||||ADV|@P<": {POS: ADV}, - "|||ADV|@N|||DET|F|P|@FS-STA": {POS: DET}, - "|||ADV|@ADVL": {POS: ADV}, - "||ADV|@||ADV|@>A": {POS: ADV}, - "||ADV|@P<": {POS: ADV}, - "||DET|F|P|@>N": {POS: DET}, - "||DET|F|P|@N||DET|M|P|@>N": {POS: DET}, - "||DET|M|P|@N||DET|M|P|@P<": {POS: PRON}, - "||DET|M|P|@SUBJ>": {POS: PRON}, - "||DET|M|S|@||DET|M|S|@||DET|M|S|@>N": {POS: DET}, - "||INDP|M|S|@||INDP|M|S|@||INDP|M|S|@P<": {POS: PRON}, - "|||PRP|@FS-STA": {POS: ADP}, - "|||PRP|@KOMP<": {POS: ADP}, - "||PRP|@||PRP|@||PRP|@||PRP|@||PRP|@||PRP|@||PRP|@A<": {POS: ADP}, - "||PRP|@ADVL": {POS: ADP}, - "||PRP|@ADVL>": {POS: ADP}, - "||PRP|@CJT": {POS: ADP}, - "||PRP|@KOMP<": {POS: ADP}, - "||PRP|@N<": {POS: ADP}, - "||PRP|@N||PRP|@P<": {POS: ADP}, - "||PRP|@PASS": {POS: ADP}, - "||PRP|@PIV>": {POS: ADP}, - "||PRP|@PRED>": {POS: ADP}, - "||||NUM|M|P|@P<": {POS: NUM}, - "||||NUM|M|S|@P<": {POS: NUM}, - "|||NUM|M|S|@P<": {POS: NUM}, - "|||NUM|M|P|@P<": {POS: NUM}, - "|||NUM|M|S|@P<": {POS: NUM}, - "|ADJ|F|P|@|ADJ|F|P|@|ADJ|F|P|@>N": {POS: ADJ}, - "|ADJ|F|P|@APP": {POS: ADJ}, - "|ADJ|F|P|@ICL-|ADJ|F|P|@N<": {POS: ADJ}, - "|ADJ|F|P|@N|ADJ|F|S|@|ADJ|F|S|@|ADJ|F|S|@>N": {POS: ADJ}, - "|ADJ|F|S|@ICL-N<": {POS: ADJ}, - "|ADJ|F|S|@N<": {POS: ADJ}, - "|ADJ|F|S|@PRED>": {POS: ADJ}, - "|ADJ|M/F|P|@N<": {POS: ADJ}, - "|ADJ|M/F|S|@|ADJ|M|P|@|ADJ|M|P|@>N": {POS: ADJ}, - "|ADJ|M|P|@N<": {POS: ADJ}, - "|ADJ|M|P|@N|ADJ|M|P|@P<": {POS: ADJ}, - "|ADJ|M|S|@|ADJ|M|S|@|ADJ|M|S|@|ADJ|M|S|@>N": {POS: ADJ}, - "|ADJ|M|S|@ICL-|ADJ|M|S|@ICL-N<": {POS: ADJ}, - "|ADJ|M|S|@ICL-N|ADJ|M|S|@N<": {POS: ADJ}, - "|ADJ|M|S|@N|ADJ|M|S|@NPHR": {POS: ADJ}, - "|ADJ|M|S|@P<": {POS: ADJ}, - "|ADJ|M|S|@PRED>": {POS: ADJ}, - "|ADV|@|ADV|@|ADV|@>N": {POS: ADV}, - "|ADV|@ADVL": {POS: ADV}, - "|ADV|@ADVL>": {POS: ADV}, - "|ADV|@AS<": {POS: ADV}, - "|ADV|@FS-N<": {POS: ADV}, - "|ADV|@FS-STA": {POS: ADV}, - "|ADV|@ICL-|ADV|@P<": {POS: ADV}, - "|KS|@SUB": {POS: SCONJ}, - "|NUM|M|P|@P<": {POS: NUM}, - "|NUM|M|S|@P<": {POS: NUM}, - "|PERS|F|1S|NOM|@FS-STA": {POS: PRON}, - "|PERS|F|3S|NOM|@NPHR": {POS: PRON}, - "|PERS|F|3S|NOM|@SUBJ>": {POS: PRON}, - "|PERS|M|3S|NOM|@|PROP|F|P|@P<": {POS: PROPN}, - "|PROP|F|S|@|PROP|F|S|@|PROP|F|S|@|PROP|F|S|@|PROP|F|S|@APP": {POS: PROPN}, - "|PROP|F|S|@N<": {POS: PROPN}, - "|PROP|F|S|@N|PROP|F|S|@NPHR": {POS: PROPN}, - "|PROP|F|S|@P<": {POS: PROPN}, - "|PROP|F|S|@SUBJ>": {POS: PROPN}, - "|PROP|M|P|@|PROP|M|P|@|PROP|M|P|@P<": {POS: PROPN}, - "|PROP|M|P|@SUBJ>": {POS: PROPN}, - "|PROP|M|S|@|PROP|M|S|@|PROP|M|S|@|PROP|M|S|@|PROP|M|S|@|PROP|M|S|@APP": {POS: PROPN}, - "|PROP|M|S|@N<": {POS: PROPN}, - "|PROP|M|S|@N|PROP|M|S|@NPHR": {POS: PROPN}, - "|PROP|M|S|@P<": {POS: PROPN}, - "|PROP|M|S|@SUBJ>": {POS: PROPN}, - "|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@A<": {POS: ADP}, - "|PRP|@A|PRP|@ACC>": {POS: ADP}, - "|PRP|@ADVL": {POS: ADP}, - "|PRP|@ADVL>": {POS: ADP}, - "|PRP|@AUX<": {POS: ADP}, - "|PRP|@CO": {POS: ADP}, - "|PRP|@FS-N|PRP|@FS-STA": {POS: ADP}, - "|PRP|@ICL-|PRP|@ICL-|PRP|@ICL-APP": {POS: ADP}, - "|PRP|@ICL-N<": {POS: ADP}, - "|PRP|@ICL-N|PRP|@ICL-P<": {POS: ADP}, - "|PRP|@ICL-PRED>": {POS: ADP}, - "|PRP|@KOMP<": {POS: ADP}, - "|PRP|@N<": {POS: ADP}, - "|PRP|@N|PRP|@N|PRP|@P<": {POS: ADP}, - "|PRP|@PASS": {POS: ADP}, - "|PRP|@PIV>": {POS: ADP}, - "|PRP|@PRED>": {POS: ADP}, - "|PRP|@SUBJ>": {POS: ADP}, - "|PRP|@UTT": {POS: ADP}, - "|V|COND|3S|@FS-PAUX": {POS: VERB}, - "|V|INF|@ICL-PMV": {POS: VERB}, - "|V|PCP|F|P|@|V|PCP|F|P|@>N": {POS: VERB}, - "|V|PCP|F|S|@|V|PCP|F|S|@>N": {POS: VERB}, - "|V|PCP|F|S|@N<": {POS: VERB}, - "|V|PCP|M|P|@|V|PCP|M|P|@|V|PCP|M|P|@>N": {POS: VERB}, - "|V|PCP|M|P|@N<": {POS: VERB}, - "|V|PCP|M|P|@N|V|PCP|M|P|@P<": {POS: VERB}, - "|V|PCP|M|S|@|V|PCP|M|S|@|V|PCP|M|S|@>N": {POS: ADJ}, - "|V|PCP|M|S|@N<": {POS: VERB}, - "|V|PCP|M|S|@N|V|PCP|M|S|@P<": {POS: VERB}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||ADV|@CO": {POS: ADV}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||||KC|@CO": {POS: CCONJ}, - "||||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||ADV|@CO": {POS: ADV}, - "|ADV|@CO": {POS: ADV}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||ADV|@CO": {POS: ADV}, - "|||ADV|@CO": {POS: ADV}, - "|||ADV|@CO": {POS: ADV}, - "|||ADV|@CO": {POS: ADV}, - "|ADV|@CO": {POS: ADV}, - "|KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO||": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO||": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||ADV|@CO": {POS: ADV}, - "||KC|@CO": {POS: CCONJ}, - "||ADV|@CO": {POS: ADV}, - "|KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||ADV|@CO": {POS: ADV}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KS|@CO": {POS: SCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||ADV|@CO": {POS: ADV}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||ADV|@CO": {POS: ADV}, - "|||KC|@CO": {POS: CCONJ}, - "|||ADV|@CO": {POS: ADV}, - "|||KC|@CO": {POS: CCONJ}, - "|||ADV|@CO": {POS: ADV}, - "||ADV|@CO": {POS: ADV}, - "|KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||ADV|@CO": {POS: ADV}, - "|ADV|@CO": {POS: ADV}, - "|KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||ADV|@CO": {POS: ADV}, - "||KC|@CO": {POS: CCONJ}, - "|ADV|@CO": {POS: ADV}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||ADV|@CO": {POS: ADV}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||ADV|@CO": {POS: ADV}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "||||KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "|||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "||N|F|P|@P<": {POS: NOUN}, - "||N|F|S|@P<": {POS: NOUN}, - "||N|F|S|@NPHR": {POS: NOUN}, - "||N|F|S|@P<": {POS: NOUN}, - "||N|F|S|@SUBJ>": {POS: NOUN}, - "|ADV|@N|PRP|@|PRP|@|PRP|@|PRP|@A<": {POS: ADP}, - "|PRP|@ADVL>": {POS: ADP}, - "|PRP|@COM": {POS: ADP}, - "|PRP|@KOMP": {POS: ADP}, - "|PRP|@KOMP<": {POS: ADP}, - "|PRP|@KOMP>": {POS: ADP}, - "|PRP|@N<": {POS: ADP}, - "|PRP|@N|<-sam>|DET|F|P|@>N": {POS: DET}, - "|<-sam>|DET|F|S|@>N": {POS: DET}, - "|<-sam>|DET|F|S|@P<": {POS: PRON}, - "|<-sam>|DET|M|S|@>N": {POS: DET}, - "|<-sam>|DET|M|S|@P<": {POS: PRON}, - "|<-sam>|INDP|M|S|@P<": {POS: PRON}, - "||DET|F|S|@N||DET|M|S|@N||DET|M|S|@NPHR": {POS: DET}, - "||DET|M|S|@SUBJ>": {POS: PRON}, - "|||ADV|@>A": {POS: ADV}, - "||DET|F|P|@>N": {POS: DET}, - "||DET|M|P|@>N": {POS: DET}, - "||DET|F|P|@P<": {POS: PRON}, - "||DET|M|S|@TOP": {POS: DET}, - "||DET|F|P|@||DET|F|P|@N||DET|F|P|@P<": {POS: PRON}, - "||DET|M|P|@||DET|M|P|@N||DET|M|P|@NPHR": {POS: DET}, - "||ADV|@||ADV|@>A": {POS: ADV}, - "||ADV|@SUBJ>": {POS: ADV}, - "||DET|F|S|@>N": {POS: DET}, - "||DET|M|S|@>N": {POS: DET}, - "||PERS|M|P|@P<": {POS: PRON}, - "|ADV|@>A": {POS: ADV}, - "|ART|F|S|@>N": {POS: DET}, - "|ART|M|S|@>N": {POS: DET}, - "|ART|M|S|@P<": {POS: DET}, - "|DET|F|P<|@>N": {POS: DET}, - "|DET|F|P|@|DET|F|P|@|DET|F|P|@|DET|F|P|@>A": {POS: DET}, - "|DET|F|P|@>N": {POS: DET}, - "|DET|F|P|@ACC>": {POS: PRON}, - "|DET|F|P|@N<": {POS: DET}, - "|DET|F|P|@N|DET|F|P|@P<": {POS: PRON}, - "|DET|F|P|@SUBJ>": {POS: PRON}, - "|DET|F|S|@|DET|F|S|@|DET|F|S|@>N": {POS: DET}, - "|DET|F|S|@APP": {POS: DET}, - "|DET|F|S|@N<": {POS: DET}, - "|DET|F|S|@P<": {POS: PRON}, - "|DET|F|S|@SUBJ>": {POS: PRON}, - "|DET|M|P|@|DET|M|P|@|DET|M|P|@|DET|M|P|@>N": {POS: DET}, - "|DET|M|P|@ACC>": {POS: PRON}, - "|DET|M|P|@APP": {POS: DET}, - "|DET|M|P|@N<": {POS: DET}, - "|DET|M|P|@N|DET|M|P|@P<": {POS: PRON}, - "|DET|M|P|@SUBJ>": {POS: PRON}, - "|DET|M|S|@|DET|M|S|@|DET|M|S|@|DET|M|S|@>N": {POS: DET}, - "|DET|M|S|@APP": {POS: DET}, - "|DET|M|S|@N<": {POS: DET}, - "|DET|M|S|@N|DET|M|S|@P<": {POS: PRON}, - "|DET|M|S|@SC>": {POS: PRON}, - "|DET|M|S|@SUBJ>": {POS: PRON}, - "|DET|M|S|@TOP": {POS: DET}, - "|INDP|M/F|S/P|@|INDP|M|S|@|INDP|M|S|@|INDP|M|S|@|INDP|M|S|@A<": {POS: PRON}, - "|INDP|M|S|@ACC>": {POS: PRON}, - "|INDP|M|S|@N<": {POS: PRON}, - "|INDP|M|S|@N|INDP|M|S|@NPHR": {POS: PRON}, - "|INDP|M|S|@P<": {POS: PRON}, - "|INDP|M|S|@S<": {POS: PRON}, - "|INDP|M|S|@SC>": {POS: PRON}, - "|INDP|M|S|@SUBJ>": {POS: PRON}, - "|PERS|M|3S|ACC|@|PERS|M|S|@SUBJ>": {POS: PRON}, - "|||DET|F|S|@|||DET|F|S|@P<": {POS: PRON}, - "|||DET|F|S|@SUBJ>": {POS: PRON}, - "|||DET|M|P|@P<": {POS: PRON}, - "|||DET|M|P|@SUBJ>": {POS: PRON}, - "|||DET|M|S|@|||DET|M|S|@|||DET|M|S|@ACC>": {POS: PRON}, - "|||DET|M|S|@APP": {POS: ADJ}, - "|||DET|M|S|@N|||DET|M|S|@P<": {POS: PRON}, - "|||DET|M|S|@SUBJ>": {POS: PRON}, - "|||DET|F|P|@|||DET|F|P|@P<": {POS: PRON}, - "|||DET|M|P|@|||DET|M|P|@P<": {POS: PRON}, - "||DET|F|P|@||DET|F|P|@>N": {POS: DET}, - "||DET|F|P|@ACC>": {POS: PRON}, - "||DET|F|P|@P<": {POS: PRON}, - "||DET|F|P|@SUBJ>": {POS: PRON}, - "||DET|F|S|@||DET|F|S|@||DET|F|S|@||DET|F|S|@>N": {POS: DET}, - "||DET|F|S|@A<": {POS: ADJ}, - "||DET|F|S|@N||DET|F|S|@P<": {POS: PRON}, - "||DET|F|S|@SUBJ>": {POS: PRON}, - "||DET|M/F|S|@>A": {POS: DET}, - "||DET|M|P|@||DET|M|P|@>N": {POS: DET}, - "||DET|M|P|@A<": {POS: DET}, - "||DET|M|P|@P<": {POS: PRON}, - "||DET|M|P|@SUBJ>": {POS: PRON}, - "||DET|M|S|@||DET|M|S|@>N": {POS: DET}, - "||DET|M|S|@P<": {POS: PRON}, - "||DET|M|S|@SUBJ>": {POS: PRON}, - "|||DET|F|S|@||DET|F|S|@||DET|M|P|@SUBJ>": {POS: PRON}, - "||DET|M|S|@||DET|M|S|@||DET|M|S|@ACC>": {POS: PRON}, - "||DET|M|S|@SUBJ>": {POS: PRON}, - "||DET|F|P|@P<": {POS: PRON}, - "|DET|F|P|@>N": {POS: DET}, - "|DET|F|P|@P<": {POS: PRON}, - "|DET|F|S|@|DET|F|S|@|DET|F|S|@>N": {POS: DET}, - "|DET|F|S|@SC>": {POS: PRON}, - "|DET|M/F|S|@P<": {POS: PRON}, - "|DET|M|P|@|DET|M|P|@>N": {POS: DET}, - "|DET|M|P|@P<": {POS: PRON}, - "|DET|M|P|@SUBJ>": {POS: PRON}, - "|DET|M|S|@|DET|M|S|@>N": {POS: DET}, - "|DET|M|S|@SUBJ>": {POS: PRON}, - "|INDP|M|S|@P<": {POS: PRON}, - "|PROP|M|P|@N|<-sam>|ADV|@|<-sam>|DET|M|P|@P<": {POS: PRON}, - "||ADJ|F|P|@N<": {POS: ADJ}, - "||ADJ|F|S|@APP": {POS: ADJ}, - "||ADJ|F|S|@N<": {POS: ADJ}, - "||ADJ|M|P|@N<": {POS: ADJ}, - "||ADJ|M|S|@||ADJ|M|S|@>N": {POS: ADJ}, - "||ADJ|M|S|@N<": {POS: ADJ}, - "||ADJ|M|S|@P<": {POS: ADJ}, - "||ADJ|M|S|@SC>": {POS: ADJ}, - "|||DET|F|P|@KOMP<": {POS: DET}, - "|||DET|M|S|@FS-STA": {POS: DET}, - "|||DET|M|S|@P<": {POS: PRON}, - "||||ADJ|M|S|@N||NUM|M|P|@P<": {POS: NUM}, - "|||ADJ|F|S|@|||ADJ|F|S|@>N": {POS: ADJ}, - "|||ADJ|F|S|@ADVL>": {POS: ADJ}, - "|||ADV|@|||ADJ|M|S|@ADVL>": {POS: ADJ}, - "||||DET|M|P|@KOMP<": {POS: DET}, - "||||DET|M|P|@P<": {POS: PRON}, - "||||DET|M|S|@P<": {POS: PRON}, - "||||DET|F|P|@KOMP<": {POS: DET}, - "|||DET|M|S|@>N": {POS: ADJ}, - "|||ADV|@ADVL": {POS: ADV}, - "||||ADJ|M|S|@||||ADJ|M|S|@P<": {POS: ADJ}, - "||||ADJ|M|S|@SUBJ>": {POS: ADJ}, - "|||ADJ|M|S|@|||ADJ|M|S|@|||ADJ|M|S|@N<": {POS: ADJ}, - "||ADJ|F|S|@>N": {POS: ADJ}, - "||ADJ|F|S|@N||ADJ|M|S|@>N": {POS: ADJ}, - "||<-sam>|ART|@P<": {POS: DET}, - "||<-sam>|ART|F|S|@P<": {POS: DET}, - "||<-sam>|ART|M|S|@>N": {POS: DET}, - "||<-sam>|ART|M|S|@P<": {POS: DET}, - "||<-sam>|DET|F|S|@P<": {POS: PRON}, - "||ART|F|S|@||ART|F|S|@APP": {POS: DET}, - "||ART|F|S|@KOMP<": {POS: DET}, - "||ART|M|S|@||ART|M|S|@SUBJ>": {POS: DET}, - "||DET|F|S|@KOMP<": {POS: DET}, - "|||V|INF|3S|@ICL-P<": {POS: AUX}, - "|||V|INF|@ICL-AUX<": {POS: AUX}, - "|||V|PS|3S|IND|@FS-STA": {POS: AUX}, - "|||V|INF|@ICL-P<": {POS: AUX}, - "||V|COND|3P|@FS-N<": {POS: AUX}, - "||V|COND|3P|@FS-STA": {POS: AUX}, - "||V|COND|3S|@FS-P<": {POS: AUX}, - "||V|COND|3S|@FS-STA": {POS: AUX}, - "||V|FUT|3P|IND|@FS-||V|FUT|3P|IND|@FS-STA": {POS: AUX}, - "||V|FUT|3S|IND|@FS-||V|FUT|3S|IND|@FS-N<": {POS: AUX}, - "||V|FUT|3S|IND|@FS-STA": {POS: AUX}, - "||V|IMPF|1P|IND|@FS-||V|IMPF|1P|IND|@FS-STA": {POS: AUX}, - "||V|IMPF|1S|IND|@FS-STA": {POS: AUX}, - "||V|IMPF|3P|IND|@FS-||V|IMPF|3P|IND|@FS-STA": {POS: AUX}, - "||V|IMPF|3S|IND|@FS-||V|IMPF|3S|IND|@FS-ADVL>": {POS: AUX}, - "||V|IMPF|3S|IND|@FS-N||V|IMPF|3S|IND|@FS-STA": {POS: AUX}, - "||V|IMPF|3S|SUBJ|@FS-||V|INF|3P|@ICL-P<": {POS: AUX}, - "||V|INF|3S|@ICL-N||V|INF|3S|@ICL-P<": {POS: AUX}, - "||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-AUX<": {POS: AUX}, - "||V|INF|@ICL-P<": {POS: AUX}, - "||V|MQP|3S|IND|@FS-N<": {POS: AUX}, - "||V|PCP|@ICL-AUX<": {POS: AUX}, - "||V|PCP|M|S|@ICL-AUX<": {POS: AUX}, - "||V|PR|1P|IND|@FS-N<": {POS: AUX}, - "||V|PR|1P|IND|@FS-STA": {POS: AUX}, - "||V|PR|1S|IND|@FS-||V|PR|1S|IND|@FS-ACC>": {POS: AUX}, - "||V|PR|1S|IND|@FS-ADVL>": {POS: AUX}, - "||V|PR|1S|IND|@FS-N||V|PR|1S|IND|@FS-QUE": {POS: AUX}, - "||V|PR|1S|IND|@FS-STA": {POS: AUX}, - "||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-N<": {POS: AUX}, - "||V|PR|3P|IND|@FS-N||V|PR|3P|IND|@FS-STA": {POS: AUX}, - "||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-ADVL>": {POS: AUX}, - "||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-A<": {POS: AUX}, - "||V|PR|3S|IND|@FS-ACC>": {POS: AUX}, - "||V|PR|3S|IND|@FS-APP": {POS: AUX}, - "||V|PR|3S|IND|@FS-N<": {POS: AUX}, - "||V|PR|3S|IND|@FS-N||V|PR|3S|IND|@FS-QUE": {POS: AUX}, - "||V|PR|3S|IND|@FS-STA": {POS: AUX}, - "||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-P<": {POS: AUX}, - "||V|PS/MQP|3P|IND|@FS-N||V|PS/MQP|3P|IND|@FS-QUE": {POS: AUX}, - "||V|PS/MQP|3P|IND|@FS-STA": {POS: AUX}, - "||V|PS|1S|IND|@FS-ACC>": {POS: AUX}, - "||V|PS|1S|IND|@FS-STA": {POS: AUX}, - "||V|PS|3P|IND|@FS-||V|PS|3P|IND|@FS-||V|PS|3P|IND|@FS-ACC>": {POS: AUX}, - "||V|PS|3P|IND|@FS-N||V|PS|3P|IND|@FS-STA": {POS: AUX}, - "||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-ACC>": {POS: AUX}, - "||V|PS|3S|IND|@FS-ADVL>": {POS: AUX}, - "||V|PS|3S|IND|@FS-N<": {POS: AUX}, - "||V|PS|3S|IND|@FS-N||V|PS|3S|IND|@FS-STA": {POS: AUX}, - "||||NUM|M|P|@KOMP<": {POS: NUM}, - "||||NUM|M|P|@P<": {POS: NUM}, - "|||NUM|M|P|@||NUM|F|P|@>N": {POS: NUM}, - "||NUM|F|P|@N<": {POS: NUM}, - "||NUM|F|P|@N||NUM|F|P|@NPHR": {POS: NUM}, - "||NUM|F|P|@P<": {POS: NUM}, - "||NUM|F|S|@||NUM|F|S|@>N": {POS: NUM}, - "||NUM|F|S|@N<": {POS: NUM}, - "||NUM|F|S|@N||NUM|F|S|@NPHR": {POS: NUM}, - "||NUM|F|S|@SUBJ>": {POS: NUM}, - "||NUM|M|F|P|@>N": {POS: NUM}, - "||NUM|M|P|@>N": {POS: NUM}, - "||NUM|M|P|@APP": {POS: NUM}, - "||NUM|M|P|@N<": {POS: NUM}, - "||NUM|M|P|@N||NUM|M|P|@P<": {POS: NUM}, - "||NUM|M|S|@||NUM|M|S|@>A": {POS: NUM}, - "||NUM|M|S|@>N": {POS: NUM}, - "||NUM|M|S|@N<": {POS: NUM}, - "||NUM|M|S|@N||NUM|M|S|@NPHR": {POS: NUM}, - "||NUM|M|S|@P<": {POS: NUM}, - "||NUM|M|S|@SUBJ>": {POS: NUM}, - "|||V|PS|3S|IND|@FS-STA": {POS: AUX}, - "|||V|PS|1S|IND|@FS-STA": {POS: VERB}, - "|||N|F|S|@|||N|M|S|@N|||N|M|P|@P<": {POS: NOUN}, - "|||PRP|@N<": {POS: ADP}, - "||PROP|F|S|@FS-STA": {POS: PROPN}, - "||PRP|@||PRP|@PRED>": {POS: ADP}, - "||<-sam>|DET|M|P|@P<": {POS: PRON}, - "|||DET|M|S|@KOMP<": {POS: DET}, - "||DET|F|P|@KOMP<": {POS: DET}, - "||DET|F|S|@||DET|F|S|@APP": {POS: DET}, - "||DET|F|S|@KOMP<": {POS: DET}, - "||DET|F|S|@P<": {POS: PRON}, - "||DET|M|P|@||DET|M|P|@KOMP<": {POS: DET}, - "||DET|M|P|@P<": {POS: PRON}, - "||DET|M|S|@APP": {POS: DET}, - "||DET|M|S|@KOMP<": {POS: DET}, - "||DET|M|S|@N<": {POS: DET}, - "||DET|M|S|@P<": {POS: PRON}, - "||INDP|M|S|@ACC>": {POS: PRON}, - "||INDP|M|S|@KOMP<": {POS: PRON}, - "||INDP|M|S|@P<": {POS: PRON}, - "||PERS|F|S|@P<": {POS: PRON}, - "||ADJ|F|P|@APP": {POS: ADJ}, - "||V|INF|@ICL-||ADV|@||ADV|@||ADV|@ADVL": {POS: ADV}, - "||ADV|@ADVL>": {POS: ADV}, - "||ADV|@N||DET|F|P|@NPHR": {POS: DET}, - "||DET|F|S|@NPHR": {POS: DET}, - "||DET|M|S|@NPHR": {POS: DET}, - "||ADV|@ADVL": {POS: ADV}, - "|||ADV|@|||PRP|@|||PRP|@ADVL>": {POS: ADP}, - "|||V|IMPF|3S|IND|@FS-N<": {POS: VERB}, - "|||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "|||V|INF|3P|@ICL-P<": {POS: VERB}, - "|||V|INF|@FS-N<": {POS: AUX}, - "|||V|INF|@ICL-P<": {POS: VERB}, - "|||V|PR|1P|IND|@FS-ACC>": {POS: VERB}, - "|||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3S|IND|@FS-APP": {POS: VERB}, - "|||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3S|IND|@ICL-N|||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "||||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "|||V|FUT|3S|SUBJ|@FS-|||V|IMPF|3P|SUBJ|@FS-|||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3P|IND|@FS-N<": {POS: VERB}, - "|||V|PR|3P|IND|@FS-P<": {POS: VERB}, - "|||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3S|IND|@FS-|||V|PR|3S|IND|@FS-ADVL>": {POS: VERB}, - "|||V|PR|3S|IND|@FS-N<": {POS: VERB}, - "|||V|PR|3S|IND|@FS-N|||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3S|SUBJ|@FS-|||V|PS|3S|IND|@FS-|||V|PS|3S|IND|@FS-ACC>": {POS: VERB}, - "|||V|PS|3S|IND|@FS-N<": {POS: VERB}, - "|||V|PS|3S|IND|@FS-N|||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "||V|COND|1S|@FS-||V|COND|1S|@FS-STA": {POS: VERB}, - "||V|COND|3P|@FS-ACC>": {POS: VERB}, - "||V|COND|3P|@FS-P<": {POS: VERB}, - "||V|COND|3S|@FS-||V|COND|3S|@FS-STA": {POS: VERB}, - "||V|FUT|3P|IND|@FS-ACC>": {POS: VERB}, - "||V|FUT|3P|IND|@FS-N||V|FUT|3P|IND|@FS-STA": {POS: VERB}, - "||V|FUT|3P|SUBJ|@FS-||V|FUT|3P|SUBJ|@FS-ADVL>": {POS: VERB}, - "||V|FUT|3S|IND|@FS-N<": {POS: VERB}, - "||V|FUT|3S|IND|@FS-N||V|FUT|3S|IND|@FS-QUE": {POS: VERB}, - "||V|FUT|3S|IND|@FS-STA": {POS: VERB}, - "||V|FUT|3S|SUBJ|@FS-||V|FUT|3S|SUBJ|@FS-ADVL>": {POS: VERB}, - "||V|GER|@ICL-||V|GER|@ICL-N||V|GER|@ICL-SUBJ>": {POS: VERB}, - "||V|IMPF|1S|IND|@FS-A<": {POS: VERB}, - "||V|IMPF|1S|IND|@FS-KOMP<": {POS: VERB}, - "||V|IMPF|1S|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|1S|SUBJ|@FS-ADVL>": {POS: AUX}, - "||V|IMPF|3P|IND|@FS-||V|IMPF|3P|IND|@FS-||V|IMPF|3P|IND|@FS-ACC>": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-N<": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-N||V|IMPF|3P|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3P|SUBJ|@FS-P<": {POS: AUX}, - "||V|IMPF|3S|IND|@FS-||V|IMPF|3S|IND|@FS-||V|IMPF|3S|IND|@FS-ACC>": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-ADVL>": {POS: AUX}, - "||V|IMPF|3S|IND|@FS-N<": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-N||V|IMPF|3S|IND|@FS-P<": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-QUE": {POS: AUX}, - "||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-SUBJ>": {POS: VERB}, - "||V|IMPF|3S|SUBJ|@FS-||V|IMPF|3S|SUBJ|@FS-||V|IMPF|3S|SUBJ|@FS-P<": {POS: VERB}, - "||V|IMPF|3S|SUBJ|@FS-STA": {POS: VERB}, - "||V|INF|1P|@ICL-||V|INF|1P|@ICL-P<": {POS: AUX}, - "||V|INF|1S|@ICL-||V|INF|3P|@ICL-P<": {POS: VERB}, - "||V|INF|3S|@ICL-||V|INF|3S|@ICL-P<": {POS: VERB}, - "||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-AUX<": {POS: VERB}, - "||V|INF|@ICL-KOMP<": {POS: VERB}, - "||V|INF|@ICL-N||V|INF|@ICL-P<": {POS: VERB}, - "||V|INF|@ICL-UTT": {POS: VERB}, - "||V|MQP|1S|IND|@FS-STA": {POS: VERB}, - "||V|MQP|3P|IND|@FS-STA": {POS: VERB}, - "||V|MQP|3S|IND|@FS-N<": {POS: VERB}, - "||V|MQP|3S|IND|@FS-STA": {POS: VERB}, - "||V|M|P|3S|IND|@STA": {POS: VERB}, - "||V|PCP|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|P|@ICL-N||V|PCP|F|S|@ICL-||V|PCP|F|S|@ICL-N<": {POS: VERB}, - "||V|PCP|F|S|@ICL-N||V|PCP|M|P|@ICL-||V|PCP|M|P|@ICL-||V|PCP|M|P|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|P|@ICL-N<": {POS: VERB}, - "||V|PCP|M|P|@ICL-N||V|PCP|M|S|@ICL-||V|PCP|M|S|@ICL-||V|PCP|M|S|@ICL-||V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|S|@ICL-N<": {POS: VERB}, - "||V|PCP|M|S|@ICL-N||V|PCP|M|S|@ICL-P<": {POS: VERB}, - "||V|PCP|M|S|@ICL-PRED>": {POS: VERB}, - "||V|PCP|M|S|@ICL-STA": {POS: VERB}, - "||V|PR|1P|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|1P|IND|@FS-EXC": {POS: VERB}, - "||V|PR|1P|IND|@FS-N||V|PR|1P|IND|@FS-STA": {POS: VERB}, - "||V|PR|1S|IND|@FS-||V|PR|1S|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|1S|IND|@FS-N||V|PR|1S|IND|@FS-P<": {POS: VERB}, - "||V|PR|1S|IND|@FS-STA": {POS: VERB}, - "||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|3P|IND|@FS-KOMP<": {POS: VERB}, - "||V|PR|3P|IND|@FS-N<": {POS: VERB}, - "||V|PR|3P|IND|@FS-N||V|PR|3P|IND|@FS-P<": {POS: VERB}, - "||V|PR|3P|IND|@FS-PASS": {POS: AUX}, - "||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-P<": {POS: VERB}, - "||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|3S|IND|@FS-ADVL>": {POS: VERB}, - "||V|PR|3S|IND|@FS-APP": {POS: VERB}, - "||V|PR|3S|IND|@FS-KOMP<": {POS: VERB}, - "||V|PR|3S|IND|@FS-N<": {POS: VERB}, - "||V|PR|3S|IND|@FS-N||V|PR|3S|IND|@FS-P<": {POS: VERB}, - "||V|PR|3S|IND|@FS-QUE": {POS: VERB}, - "||V|PR|3S|IND|@FS-S<": {POS: AUX}, - "||V|PR|3S|IND|@FS-SC>": {POS: VERB}, - "||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "||V|PR|3S|IND|@FS-UTT": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-COM": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-N<": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-N||V|PR|3S|SUBJ|@FS-P<": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-STA": {POS: VERB}, - "||V|PS/MQP|3P|IND|@FS-N||V|PS/MQP|3P|IND|@FS-STA": {POS: VERB}, - "||V|PS|1P|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|1P|IND|@FS-STA": {POS: VERB}, - "||V|PS|1S|IND|@FS-||V|PS|1S|IND|@FS-||V|PS|1S|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|1S|IND|@FS-N<": {POS: VERB}, - "||V|PS|1S|IND|@FS-STA": {POS: VERB}, - "||V|PS|2S|IND|@FS-STA": {POS: VERB}, - "||V|PS|3P|IND|@FS-||V|PS|3P|IND|@FS-||V|PS|3P|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|3P|IND|@FS-N<": {POS: VERB}, - "||V|PS|3P|IND|@FS-N||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|3S|IND|@FS-ADVL>": {POS: VERB}, - "||V|PS|3S|IND|@FS-KOMP<": {POS: VERB}, - "||V|PS|3S|IND|@FS-N<": {POS: VERB}, - "||V|PS|3S|IND|@FS-N||V|PS|3S|IND|@FS-QUE": {POS: VERB}, - "||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "||V|PS|3S|IND|@FS-SUBJ>": {POS: VERB}, - "||V|PS|3S|IND|@FS-UTT": {POS: VERB}, - "|||ADJ|M|S|@P<": {POS: ADJ}, - "|||ADJ|F|S|@|||ADJ|M|P|@|||ADJ|M|P|@N|||ADJ|M|P|@P<": {POS: ADJ}, - "|||ADJ|M|P|@SUBJ>": {POS: ADJ}, - "|||ADJ|M|S|@APP": {POS: ADJ}, - "|||ADJ|M|S|@FS-N|||ADJ|M|S|@N|||ADJ|M|S|@NPHR": {POS: ADJ}, - "|||ADJ|M|S|@P<": {POS: ADJ}, - "|||ADJ|F|P|@|||ADJ|M|S|@APP": {POS: ADJ}, - "||ADJ|F|P|@||ADJ|F|S|@||ADJ|M/F|P|@SUBJ>": {POS: ADJ}, - "||ADJ|M|P|@||ADJ|M|P|@P<": {POS: ADJ}, - "||ADJ|M|S|@||ADJ|M|S|@P<": {POS: ADJ}, - "||ADJ|F|P|@KOMP<": {POS: ADJ}, - "||ADJ|F|P|@P<": {POS: ADJ}, - "||N|F|P|@||N|F|P|@||N|F|P|@||N|F|P|@A<": {POS: NOUN}, - "||N|F|P|@APP": {POS: NOUN}, - "||N|F|P|@FS-STA": {POS: NOUN}, - "||N|F|P|@KOMP<": {POS: NOUN}, - "||N|F|P|@N||N|F|P|@NPHR": {POS: NOUN}, - "||N|F|P|@P<": {POS: NOUN}, - "||N|F|P|@SUBJ>": {POS: NOUN}, - "||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@APP": {POS: NOUN}, - "||N|F|S|@KOMP<": {POS: NOUN}, - "||N|F|S|@N<": {POS: NOUN}, - "||N|F|S|@N||N|F|S|@NPHR": {POS: NOUN}, - "||N|F|S|@P<": {POS: NOUN}, - "||N|F|S|@STA": {POS: NOUN}, - "||N|F|S|@SUBJ>": {POS: NOUN}, - "||N|M|P|@||N|M|P|@||N|M|P|@||N|M|P|@||N|M|P|@A<": {POS: NOUN}, - "||N|M|P|@APP": {POS: NOUN}, - "||N|M|P|@FS-||N|M|P|@KOMP<": {POS: NOUN}, - "||N|M|P|@N||N|M|P|@NPHR": {POS: NOUN}, - "||N|M|P|@P<": {POS: SYM}, - "||N|M|P|@SUBJ>": {POS: NOUN}, - "||N|M|S|@||N|M|S|@||N|M|S|@||N|M|S|@||N|M|S|@APP": {POS: NOUN}, - "||N|M|S|@KOMP<": {POS: NOUN}, - "||N|M|S|@N||N|M|S|@NPHR": {POS: NOUN}, - "||N|M|S|@P<": {POS: NOUN}, - "||N|M|S|@SUBJ>": {POS: NOUN}, - "||ADJ|M|S|@N<": {POS: ADJ}, - "||N|F|P|@||N|F|P|@||N|F|P|@||N|F|P|@||N|F|P|@||N|F|P|@ACC>": {POS: NOUN}, - "||N|F|P|@APP": {POS: NOUN}, - "||N|F|P|@KOMP<": {POS: NOUN}, - "||N|F|P|@N||N|F|P|@NPHR": {POS: NOUN}, - "||N|F|P|@P<": {POS: NOUN}, - "||N|F|P|@SUBJ>": {POS: NOUN}, - "||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@ACC>": {POS: NOUN}, - "||N|F|S|@APP": {POS: NOUN}, - "||N|F|S|@KOMP<": {POS: NOUN}, - "||N|F|S|@N<": {POS: NOUN}, - "||N|F|S|@N||N|F|S|@NPHR": {POS: NOUN}, - "||N|F|S|@P<": {POS: NOUN}, - "||N|F|S|@SUBJ>": {POS: NOUN}, - "||N|M/F|P|@P<": {POS: NOUN}, - "||N|M/F|P|@SUBJ>": {POS: NOUN}, - "||N|M/F|S|@NPHR": {POS: NOUN}, - "||N|M|P|@||N|M|P|@||N|M|P|@||N|M|P|@||N|M|P|@APP": {POS: NOUN}, - "||N|M|P|@N<": {POS: NOUN}, - "||N|M|P|@N||N|M|P|@NPHR": {POS: NOUN}, - "||N|M|P|@P<": {POS: NOUN}, - "||N|M|P|@SUBJ>": {POS: NOUN}, - "||N|M|S|@||N|M|S|@||N|M|S|@||N|M|S|@||N|M|S|@||N|M|S|@ACC>": {POS: NOUN}, - "||N|M|S|@ADVL>": {POS: NOUN}, - "||N|M|S|@APP": {POS: NOUN}, - "||N|M|S|@KOMP<": {POS: NOUN}, - "||N|M|S|@N<": {POS: NOUN}, - "||N|M|S|@N||N|M|S|@NPHR": {POS: NOUN}, - "||N|M|S|@P<": {POS: NOUN}, - "||N|M|S|@PRED>": {POS: NOUN}, - "||N|M|S|@SUBJ>": {POS: NOUN}, - "|||V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "||PRP|@||PRP|@||PRP|@||PRP|@ADVL>": {POS: ADP}, - "||||ADJ|M|S|@P<": {POS: ADJ}, - "|||N|F|P|@P<": {POS: NOUN}, - "|||N|F|S|@|||N|F|S|@|||N|F|S|@NPHR": {POS: NOUN}, - "|||N|F|S|@P<": {POS: NOUN}, - "|||N|M|P|@P<": {POS: NOUN}, - "|||N|M|P|@SUBJ>": {POS: NOUN}, - "|||N|M|S|@|||N|M|S|@|||N|M|S|@|||N|M|S|@FS-N|||N|M|S|@P<": {POS: NOUN}, - "|||N|M|S|@SUBJ>": {POS: NOUN}, - "|||N|F|S|@N<": {POS: NOUN}, - "|||N|F|S|@P<": {POS: NOUN}, - "|||N|M|S|@N<": {POS: NOUN}, - "|||N|M|S|@P<": {POS: NOUN}, - "||PROP|M|S|@SUBJ>": {POS: PROPN}, - "||||DET|M|S|@||ADV|@||ADV|@||ADV|@>A": {POS: ADV}, - "||ADV|@>N": {POS: ADV}, - "||DET|M|P|@P<": {POS: PRON}, - "||DET|M|S|@>N": {POS: DET}, - "||INDP|M|S|@||INDP|M|S|@ACC>": {POS: PRON}, - "||INDP|M|S|@NPHR": {POS: PRON}, - "||INDP|M|S|@P<": {POS: PRON}, - "||INDP|M|S|@SUBJ>": {POS: PRON}, - "||ADV|@||ADV|@ADVL>": {POS: ADV}, - "||ADV|@AUX<": {POS: ADV}, - "||DET|F|S|@||INDP|F|P|@N<": {POS: PRON}, - "||INDP|F|S|@ACC>": {POS: PRON}, - "||INDP|F|S|@N<": {POS: PRON}, - "||INDP|M/F|S/P|@S<": {POS: PRON}, - "||INDP|M/F|S|@SUBJ>": {POS: PRON}, - "||INDP|M|S|@N<": {POS: PRON}, - "||INDP|M|S|@NPHR": {POS: PRON}, - "||PRP|@||PRP|@||PRP|@ADVL>": {POS: ADP}, - "||PRP|@N||PRP|@||PRP|@||PRP|@||PRP|@||PRP|@||PRP|@||PRP|@A<": {POS: ADP}, - "||PRP|@ADVL": {POS: ADP}, - "||PRP|@ADVL>": {POS: ADP}, - "||PRP|@ICL-||PRP|@KOMP<": {POS: ADP}, - "||PRP|@N<": {POS: ADP}, - "||PRP|@N||PRP|@N||PRP|@P<": {POS: ADP}, - "||PRP|@PASS": {POS: ADP}, - "||PRP|@PIV>": {POS: ADP}, - "||PRP|@SC>": {POS: ADP}, - "||PRP|@STA": {POS: ADP}, - "||PRP|@UTT": {POS: ADP}, - "|ADJ|F|P|@|ADJ|F|P|@|ADJ|F|P|@>N": {POS: ADJ}, - "|ADJ|F|P|@N<": {POS: ADJ}, - "|ADJ|F|P|@N|ADJ|F|S|@|ADJ|F|S|@|ADJ|F|S|@>N": {POS: ADJ}, - "|ADJ|F|S|@KOMP<": {POS: ADJ}, - "|ADJ|F|S|@N<": {POS: ADJ}, - "|ADJ|F|S|@N|ADJ|F|S|@NPHR": {POS: ADJ}, - "|ADJ|M/F|P|@N<": {POS: ADJ}, - "|ADJ|M/F|P|@P<": {POS: ADJ}, - "|ADJ|M/F|S|@|ADJ|M|F|@>N": {POS: ADJ}, - "|ADJ|M|P|@|ADJ|M|P|@|ADJ|M|P|@>N": {POS: ADJ}, - "|ADJ|M|P|@N<": {POS: ADJ}, - "|ADJ|M|P|@N|ADJ|M|P|@P<": {POS: ADJ}, - "|ADJ|M|P|@PRED>": {POS: ADJ}, - "|ADJ|M|S|@|ADJ|M|S|@|ADJ|M|S|@|ADJ|M|S|@|ADJ|M|S|@>N": {POS: ADJ}, - "|ADJ|M|S|@APP": {POS: ADJ}, - "|ADJ|M|S|@KOMP<": {POS: ADJ}, - "|ADJ|M|S|@N<": {POS: ADJ}, - "|ADJ|M|S|@N|ADJ|M|S|@NPHR": {POS: ADJ}, - "|ADJ|M|S|@P<": {POS: ADJ}, - "|ADJ|M|S|@PRED>": {POS: ADJ}, - "|ADJ|M|S|@SC>": {POS: ADJ}, - "|ADV|@|ADV|@|ADV|@ADVL": {POS: ADV}, - "|ADV|@ADVL>": {POS: ADV}, - "|ADV|@FS-STA": {POS: ADV}, - "|ADV|@KOMP<": {POS: ADV}, - "|ADV|@N<": {POS: ADV}, - "|ADV|@N|ADV|@P<": {POS: ADV}, - "|EC|M|S|@P<": {POS: PART}, - "|INDP|M|S|@|IN|@ACC>": {POS: INTJ}, - "|IN|@EXC": {POS: INTJ}, - "|IN|@UTT": {POS: INTJ}, - "|IN|F|S|@UTT": {POS: INTJ}, - "|IN|M|S|@|KS|@|KS|@|KS|@A<": {POS: SCONJ}, - "|KS|@ACC>": {POS: SCONJ}, - "|KS|@ADVL>": {POS: SCONJ}, - "|KS|@AUX<": {POS: SCONJ}, - "|KS|@KOMP<": {POS: SCONJ}, - "|KS|@N|KS|@SC>": {POS: SCONJ}, - "|KS|@SUB": {POS: SCONJ}, - "|KS|@SUBJ>": {POS: SCONJ}, - "|KS|@UTT": {POS: SCONJ}, - "|NUM|M|P|@P<": {POS: NUM}, - "|NUM|M|S|@N<": {POS: NUM}, - "|PERS|F|1S|NOM|@SUBJ>": {POS: PRON}, - "|PERS|F|3S|NOM|@|PERS|M/F|1P|NOM|@KOMP<": {POS: PRON}, - "|PERS|M/F|1P|NOM|@P<": {POS: PRON}, - "|PERS|M/F|1S|NOM|@AUX<": {POS: PRON}, - "|PERS|M/F|1S|NOM|@KOMP<": {POS: PRON}, - "|PERS|M|3P|NOM|@ACC>": {POS: PRON}, - "|PERS|M|3P|NOM|@N|PERS|M|3P|NOM|@NPHR": {POS: PRON}, - "|PERS|M|3P|PIV|@P<": {POS: PRON}, - "|PERS|M|3S|ACC|@NPHR": {POS: PRON}, - "|PERS|M|3S|NOM|@N|PERS|M|3S|NOM|@NPHR": {POS: PRON}, - "|PERS|M|3S|NOM|@SUBJ>": {POS: PRON}, - "|PERS|M|3S|PIV|@KOMP<": {POS: PRON}, - "|PERS|M|3S|PIV|@P<": {POS: PRON}, - "|PERS|M|S|@NPHR": {POS: PRON}, - "|PROP|F|P|@|PROP|F|P|@NPHR": {POS: PROPN}, - "|PROP|F|P|@P<": {POS: PROPN}, - "|PROP|F|S|@|PROP|F|S|@|PROP|F|S|@|PROP|F|S|@APP": {POS: PROPN}, - "|PROP|F|S|@KOMP<": {POS: PROPN}, - "|PROP|F|S|@N<": {POS: PROPN}, - "|PROP|F|S|@N|PROP|F|S|@NPHR": {POS: PROPN}, - "|PROP|F|S|@P<": {POS: PROPN}, - "|PROP|F|S|@SUBJ>": {POS: PROPN}, - "|PROP|M/F|S|@P<": {POS: PROPN}, - "|PROP|M/F|S|@SUBJ>": {POS: PROPN}, - "|PROP|M|P|@|PROP|M|P|@|PROP|M|P|@APP": {POS: PROPN}, - "|PROP|M|P|@N|PROP|M|P|@P<": {POS: PROPN}, - "|PROP|M|P|@SUBJ>": {POS: PROPN}, - "|PROP|M|S|@|PROP|M|S|@|PROP|M|S|@|PROP|M|S|@APP": {POS: PROPN}, - "|PROP|M|S|@KOMP<": {POS: PROPN}, - "|PROP|M|S|@N<": {POS: PROPN}, - "|PROP|M|S|@N|PROP|M|S|@NPHR": {POS: PROPN}, - "|PROP|M|S|@P<": {POS: PROPN}, - "|PROP|M|S|@S<": {POS: PROPN}, - "|PROP|M|S|@SUBJ>": {POS: PROPN}, - "|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@A<": {POS: ADP}, - "|PRP|@A|PRP|@ADVL": {POS: ADP}, - "|PRP|@ADVL>": {POS: ADP}, - "|PRP|@FS-APP": {POS: ADP}, - "|PRP|@FS-N|PRP|@H": {POS: ADP}, - "|PRP|@KOMP<": {POS: ADP}, - "|PRP|@N<": {POS: ADP}, - "|PRP|@N|PRP|@N|PRP|@OA>": {POS: ADP}, - "|PRP|@P<": {POS: ADP}, - "|PRP|@PASS": {POS: ADP}, - "|PRP|@PIV>": {POS: ADP}, - "|PRP|@PRED>": {POS: ADP}, - "|PRP|@QUE": {POS: ADP}, - "|PRP|@SC>": {POS: ADP}, - "|PRP|@STA": {POS: ADP}, - "|PRP|@UTT": {POS: ADP}, - "|PU|@PU": {POS: PUNCT}, - "|V|PCP|F|S|@N<": {POS: ADJ}, - "|V|PS|3S|IND|@P<": {POS: VERB}, - "|ADV|@|ADV|@|ADV|@>S": {POS: ADV}, - "|ADV|@ADVL>": {POS: ADV}, - "|ADV|@FOC>": {POS: ADV}, - "|ADV|@PU": {POS: ADV}, - "||N|M|P|@||PERS|M|3S|ACC|@||PERS|M/F|1S|DAT|@|EC|@>N": {POS: PART}, - "|PERS|F|3S|ACC|@|PERS|F|3S|ACC|@|PERS|M/F|3S/P|ACC|@|PERS|M|3S|ACC|@|PERS|M|3S|DAT|@|PROP|F|P|@P<": {POS: PROPN}, - "||DET|F|S|@N|DET|F|P|@>N": {POS: DET}, - "|DET|F|S|@>N": {POS: DET}, - "|DET|F|S|@N<": {POS: DET}, - "|DET|M|P|@>N": {POS: DET}, - "|DET|M|P|@N<": {POS: DET}, - "|DET|M|S|@>A": {POS: ADJ}, - "|DET|M|S|@>N": {POS: DET}, - "|DET|M|S|@N<": {POS: DET}, - "||ADV|@P<": {POS: ADV}, - "||DET|F|P|@>N": {POS: DET}, - "||DET|M|P|@>N": {POS: DET}, - "||DET|M|S|@>N": {POS: DET}, - "||DET|M|S|@ACC>": {POS: PRON}, - "||DET|M|S|@ADVL>": {POS: DET}, - "||DET|M|S|@P<": {POS: PRON}, - "|ADV|@|ADV|@|ADV|@>N": {POS: ADV}, - "|ADV|@ADVL": {POS: ADV}, - "|ADV|@ADVL>": {POS: ADV}, - "|ADV|@N|ADV|@P<": {POS: ADV}, - "|ADV|@SA>": {POS: ADV}, - "|ADV|@SC>": {POS: ADV}, - "|ADV|@SUB": {POS: ADV}, - "|DET|F|P|@>N": {POS: DET}, - "|DET|F|P|@SC>": {POS: PRON}, - "|DET|F|P|@SUBJ>": {POS: PRON}, - "|DET|F|S|@|DET|F|S|@>N": {POS: DET}, - "|DET|F|S|@SC>": {POS: PRON}, - "|DET|M/F|S/P|@>A": {POS: DET}, - "|DET|M/F|S|@SC>": {POS: PRON}, - "|DET|M|P|@>N": {POS: DET}, - "|DET|M|P|@SC>": {POS: PRON}, - "|DET|M|S|@>N": {POS: DET}, - "|DET|M|S|@SC>": {POS: PRON}, - "|INDP|F|P|@ACC>": {POS: PRON}, - "|INDP|F|S|@SC>": {POS: PRON}, - "|INDP|M/F|P|@SUBJ>": {POS: PRON}, - "|INDP|M/F|S/P|@P<": {POS: PRON}, - "|INDP|M/F|S/P|@SUBJ>": {POS: PRON}, - "|INDP|M/F|S|@|INDP|M/F|S|@SC>": {POS: PRON}, - "|INDP|M/F|S|@SUBJ>": {POS: PRON}, - "|INDP|M|P|@SC>": {POS: PRON}, - "|INDP|M|P|@SUBJ>": {POS: PRON}, - "|INDP|M|S|@ACC>": {POS: PRON}, - "|INDP|M|S|@P<": {POS: PRON}, - "|INDP|M|S|@SC>": {POS: PRON}, - "|INDP|M|S|@SUBJ>": {POS: PRON}, - "|<-sam>|ADV|@P<": {POS: ADV}, - "||ADV|@CO": {POS: ADV}, - "|||ADV|@CO": {POS: ADV}, - "|ADV|@|ADV|@|ADV|@|ADV|@|ADV|@>A": {POS: ADV}, - "|ADV|@>N": {POS: ADV}, - "|ADV|@>P": {POS: ADV}, - "|ADV|@ADVL>": {POS: ADV}, - "|ADV|@CO": {POS: ADV}, - "|ADV|@P<": {POS: ADV}, - "|ADV|@SA>": {POS: ADV}, - "|ADV|@SUB": {POS: ADV}, - "|KS|@ADVL>": {POS: SCONJ}, - "||ADV|@ADVL>": {POS: ADV}, - "||PRP|@ADVL>": {POS: ADP}, - "||ADJ|F|S|@>A": {POS: ADJ}, - "||ADJ|M/F|S|@ADVL>": {POS: ADJ}, - "|ADJ|F|S|@|||N|M|P|@|||N|F|S|@P<": {POS: NOUN}, - "||N|F|P|@||N|F|P|@P<": {POS: NOUN}, - "||N|F|P|@SUBJ>": {POS: NOUN}, - "||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@P<": {POS: NOUN}, - "||N|F|S|@SUBJ>": {POS: NOUN}, - "||N|M|P|@||N|M|P|@||N|M|P|@||N|M|P|@P<": {POS: NOUN}, - "||N|M|P|@SUBJ>": {POS: NOUN}, - "||N|M|S|@||N|M|S|@||N|M|S|@P<": {POS: NOUN}, - "||N|M|S|@SUBJ>": {POS: NOUN}, - "||N|F|P|@P<": {POS: NOUN}, - "||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@N||N|F|S|@P<": {POS: NOUN}, - "||N|M|S|@||N|M|S|@||N|M|S|@ACC>": {POS: NOUN}, - "||N|M|S|@NPHR": {POS: NOUN}, - "||N|M|S|@P<": {POS: NOUN}, - "||N|M|S|@SUBJ>": {POS: NOUN}, - "|ADV|@>S": {POS: ADV}, - "|ADV|@ADVL>": {POS: ADV}, - "|||V|FUT|3P|IND|@FS-N<": {POS: VERB}, - "|||V|FUT|3S|SUBJ|@FS-ADVL>": {POS: VERB}, - "|||V|INF|@ICL-P<": {POS: VERB}, - "|||V|PCP|@ICL-AUX<": {POS: VERB}, - "|||V|PR|3P|IND|@FS-P<": {POS: VERB}, - "|||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "||||V|GER|@ICL-||||V|IMPF|3S|IND|@FS-ACC>": {POS: VERB}, - "||||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "||||V|INF|3P|@ICL-P<": {POS: VERB}, - "||||V|INF|@ICL-||||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "||||V|PR|3S|IND|@FS-||||V|PR|3S|IND|@FS-N||||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "||||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "||||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "|||V|COND|3S|@FS-|||V|GER|@ICL-|||V|GER|@ICL-ADVL>": {POS: VERB}, - "|||V|IMPF|1S|IND|@FS-KOMP<": {POS: VERB}, - "|||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "|||V|INF|@ICL-|||V|INF|@ICL-AUX<": {POS: VERB}, - "|||V|INF|@ICL-P<": {POS: VERB}, - "|||V|PR|1S|IND|@FS-ACC>": {POS: VERB}, - "|||V|PR|3P|IND|@FS-ACC>": {POS: VERB}, - "|||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3S|IND|@FS-|||V|PR|3S|IND|@FS-N|||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "|||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PS|3S|IND|@FS-|||V|PS|3S|IND|@FS-ACC>": {POS: VERB}, - "|||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "|||V|PCP|M|P|@ICL-P<": {POS: VERB}, - "|||V|PCP|M|S|@ICL-P<": {POS: VERB}, - "|||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3P|IND|@NPHR": {POS: VERB}, - "|||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "|||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "||V|COND|1S|@FS-STA": {POS: VERB}, - "||V|COND|3S|@FS-STA": {POS: VERB}, - "||V|FUT|1S|IND|@FS-||V|FUT|3P|IND|@FS-STA": {POS: VERB}, - "||V|FUT|3S|IND|@||V|FUT|3S|IND|@FS-N<": {POS: VERB}, - "||V|FUT|3S|IND|@FS-STA": {POS: VERB}, - "||V|FUT|3S|SUBJ|@FS-ADVL>": {POS: VERB}, - "||V|GER|@ADVL>": {POS: VERB}, - "||V|GER|@ICL-||V|GER|@ICL-||V|GER|@ICL-A<": {POS: VERB}, - "||V|GER|@ICL-ADVL>": {POS: VERB}, - "||V|GER|@ICL-AUX<": {POS: VERB}, - "||V|GER|@ICL-N<": {POS: VERB}, - "||V|GER|@ICL-N||V|GER|@ICL-PRED>": {POS: VERB}, - "||V|GER|@ICL-STA": {POS: VERB}, - "||V|GER|@N<": {POS: VERB}, - "||V|GER|@PRED>": {POS: VERB}, - "||V|IMPF|1P|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|1S|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-ADVL>": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3S|IND|@||V|IMPF|3S|IND|@FS-ADVL>": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-N<": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3S|IND|@ICL-||V|IMPF|3S|SUBJ|@FS-||V|INF|3P|@ICL-AUX<": {POS: VERB}, - "||V|INF|3P|@ICL-P<": {POS: VERB}, - "||V|INF|@||V|INF|@FS-STA": {POS: VERB}, - "||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-APP": {POS: VERB}, - "||V|INF|@ICL-AUX<": {POS: VERB}, - "||V|INF|@ICL-N||V|INF|@ICL-P<": {POS: VERB}, - "||V|INF|@ICL-SC>": {POS: VERB}, - "||V|INF|@ICL-STA": {POS: VERB}, - "||V|INF|@ICL-SUBJ>": {POS: VERB}, - "||V|MQP|1/3S|IND|@FS-STA": {POS: VERB}, - "||V|MQP|3S|IND|@FS-STA": {POS: VERB}, - "||V|PCP|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|P|@ICL-||V|PCP|F|P|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|P|@ICL-N<": {POS: VERB}, - "||V|PCP|F|P|@ICL-N||V|PCP|F|S|@||V|PCP|F|S|@FS-STA": {POS: VERB}, - "||V|PCP|F|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|S|@ICL-N<": {POS: VERB}, - "||V|PCP|F|S|@ICL-UTT": {POS: VERB}, - "||V|PCP|F|S|@N||V|PCP|M|P|@ICL-||V|PCP|M|P|@ICL-||V|PCP|M|P|@ICL-||V|PCP|M|P|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|P|@ICL-N<": {POS: VERB}, - "||V|PCP|M|P|@N<": {POS: VERB}, - "||V|PCP|M|S|@ICL-||V|PCP|M|S|@ICL-||V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|S|@ICL-N<": {POS: VERB}, - "||V|PCP|M|S|@ICL-N||V|PCP|M|S|@ICL-PRED>": {POS: ADJ}, - "||V|PCP|M|S|@N<": {POS: VERB}, - "||V|PCP|M|S|@N||V|PCP|M|S|@PRED>": {POS: VERB}, - "||V|PR|1/3S|SUBJ|@FS-STA": {POS: VERB}, - "||V|PR|1P|IND|@FS-STA": {POS: VERB}, - "||V|PR|1S|IND|@FS-||V|PR|1S|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|1S|IND|@FS-QUE": {POS: VERB}, - "||V|PR|1S|IND|@FS-STA": {POS: VERB}, - "||V|PR|1|SUBJ|@FS-STA": {POS: VERB}, - "||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|3P|IND|@FS-N<": {POS: VERB}, - "||V|PR|3P|IND|@FS-N||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-N<": {POS: AUX}, - "||V|PR|3S|IND|@ADVL>": {POS: VERB}, - "||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|3S|IND|@FS-EXC": {POS: VERB}, - "||V|PR|3S|IND|@FS-KOMP<": {POS: VERB}, - "||V|PR|3S|IND|@FS-N<": {POS: VERB}, - "||V|PR|3S|IND|@FS-N||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "||V|PR|3S|IND|@N<": {POS: VERB}, - "||V|PR|3S|IND|@NPHR": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-ACC>": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-N||V|PR|3S|SUBJ|@FS-STA": {POS: VERB}, - "||V|PS/MQP|3P|IND|@P<": {POS: VERB}, - "||V|PS|1P|IND|@FS-||V|PS|1P|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|1P|IND|@FS-N||V|PS|1P|IND|@FS-STA": {POS: VERB}, - "||V|PS|1S|IND|@FS-N||V|PS|1S|IND|@FS-STA": {POS: VERB}, - "||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "||V|PS|3P|IND|@N<": {POS: VERB}, - "||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|3S|IND|@FS-N<": {POS: VERB}, - "||V|PS|3S|IND|@FS-N||V|PS|3S|IND|@FS-P<": {POS: VERB}, - "||V|PS|3S|IND|@FS-QUE": {POS: VERB}, - "||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "|||PERS|M/F|3S|ACC|@|||V|INF|@ICL-N|||V|PCP|M|S|@ICL-P<": {POS: VERB}, - "|||V|PR|1S|IND|@FS-QUE": {POS: VERB}, - "|||V|PR|3S|IND|@FS-KOMP<": {POS: VERB}, - "|||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "|||||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "||||V|GER|@ICL-||||V|INF|@ICL-||||V|INF|@ICL-P<": {POS: VERB}, - "||||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "||||V|PR|3S|IND|@FS-ACC>": {POS: VERB}, - "||||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "||||V|PS/MQP|3P|IND|@FS-STA": {POS: VERB}, - "||||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "||||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "|||V|GER|@ICL-|||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "|||V|INF|1S|@ICL-P<": {POS: VERB}, - "|||V|INF|@ICL-|||V|INF|@ICL-P<": {POS: VERB}, - "|||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3S|IND|@FS-ACC>": {POS: VERB}, - "|||V|PR|3S|IND|@FS-N|||V|PR|3S|IND|@FS-P<": {POS: VERB}, - "|||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3S|SUBJ|@FS-STA": {POS: VERB}, - "|||V|PS|2S|IND|@FS-QUE": {POS: VERB}, - "|||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PS|3S|IND|@FS-|||V|PS|3S|IND|@FS-ACC>": {POS: VERB}, - "|||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "|||V|INF|3S|@ICL-P<": {POS: VERB}, - "|||V|INF|@ICL-|||V|PCP|F|P|@ICL-|||V|PCP|F|S|@ICL-APP": {POS: VERB}, - "|||V|PCP|M|P|@ICL-|||V|PCP|M|P|@ICL-KOMP<": {POS: VERB}, - "|||V|PCP|M|P|@ICL-P<": {POS: VERB}, - "|||V|PCP|M|S|@ICL-KOMP<": {POS: VERB}, - "|||V|PCP|F|S|@ICL-N|||V|PCP|@ICL-AUX<": {POS: VERB}, - "|||V|INF|3S|@ICL-P<": {POS: VERB}, - "|||V|PR|3S|IND|@FS-|||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "||V|COND|3S|@FS-STA": {POS: VERB}, - "||V|FUT|1/3S|SUBJ|@FS-STA": {POS: VERB}, - "||V|FUT|3P|IND|@FS-STA": {POS: VERB}, - "||V|FUT|3S|IND|@FS-N<": {POS: VERB}, - "||V|FUT|3S|IND|@FS-STA": {POS: VERB}, - "||V|FUT|3S|IND|@FS-UTT": {POS: VERB}, - "||V|FUT|3S|SUBJ|@FS-STA": {POS: VERB}, - "||V|FUT|3S|SUBJ|@FS-UTT": {POS: VERB}, - "||V|GER|@ICL-||V|GER|@ICL-||V|GER|@ICL-ADVL>": {POS: VERB}, - "||V|GER|@ICL-AUX<": {POS: VERB}, - "||V|GER|@ICL-N<": {POS: VERB}, - "||V|GER|@ICL-PRED>": {POS: VERB}, - "||V|GER|@ICL-STA": {POS: VERB}, - "||V|GER|@ICL-UTT": {POS: VERB}, - "||V|IMPF|1P|IND|@FS-ACC>": {POS: VERB}, - "||V|IMPF|1P|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|1S|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-P<": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-||V|IMPF|3S|IND|@FS-APP": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-N<": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-N||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3S|SUBJ|@FS-||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-APP": {POS: VERB}, - "||V|INF|@ICL-AUX<": {POS: VERB}, - "||V|INF|@ICL-N||V|INF|@ICL-P<": {POS: VERB}, - "||V|INF|@ICL-QUE": {POS: VERB}, - "||V|INF|@ICL-SC>": {POS: VERB}, - "||V|INF|@ICL-SUBJ>": {POS: VERB}, - "||V|INF|@ICL-UTT": {POS: VERB}, - "||V|MQP|1/3S|IND|@FS-STA": {POS: VERB}, - "||V|MQP|1S|IND|@FS-STA": {POS: AUX}, - "||V|MQP|3S|IND|@FS-STA": {POS: VERB}, - "||V|PCP|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|P|@ICL-||V|PCP|F|P|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|P|@ICL-N<": {POS: VERB}, - "||V|PCP|F|P|@ICL-N||V|PCP|F|S|@ICL-||V|PCP|F|S|@ICL->N": {POS: ADJ}, - "||V|PCP|F|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|S|@ICL-N<": {POS: VERB}, - "||V|PCP|F|S|@ICL-N||V|PCP|F|S|@ICL-PRED>": {POS: VERB}, - "||V|PCP|F|S|@ICL-UTT": {POS: VERB}, - "||V|PCP|M|P|@ICL-||V|PCP|M|P|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|P|@ICL-N<": {POS: VERB}, - "||V|PCP|M|P|@ICL-N||V|PCP|M|P|@ICL-PRED>": {POS: VERB}, - "||V|PCP|M|S|@ICL-||V|PCP|M|S|@ICL-||V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|S|@ICL-N<": {POS: VERB}, - "||V|PCP|M|S|@ICL-N||V|PCP|M|S|@ICL-PRED>": {POS: VERB}, - "||V|PCP|M|S|@ICL-STA": {POS: ADJ}, - "||V|PR|1/3S|SUBJ|@FS-STA": {POS: VERB}, - "||V|PR|1P|IND|@FS-STA": {POS: VERB}, - "||V|PR|1S|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|1S|IND|@FS-STA": {POS: VERB}, - "||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|3P|IND|@FS-N<": {POS: VERB}, - "||V|PR|3P|IND|@FS-N||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-STA": {POS: VERB}, - "||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|3S|IND|@FS-EXC": {POS: AUX}, - "||V|PR|3S|IND|@FS-N<": {POS: VERB}, - "||V|PR|3S|IND|@FS-N||V|PR|3S|IND|@FS-QUE": {POS: VERB}, - "||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-N||V|PS|1P|IND|@FS-||V|PS|1P|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|1P|IND|@FS-N||V|PS|1P|IND|@FS-STA": {POS: VERB}, - "||V|PS|1S|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|1S|IND|@FS-ADVL>": {POS: AUX}, - "||V|PS|1S|IND|@FS-N||V|PS|1S|IND|@FS-STA": {POS: VERB}, - "||V|PS|3P|IND|@FS-N<": {POS: VERB}, - "||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|3S|IND|@FS-N<": {POS: VERB}, - "||V|PS|3S|IND|@FS-N||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "||V|PS|1P|@FS-||V|IMP|2S|@FS-N|||V|INF|@ICL-||||V|PR|3P|SUBJ|@FS-STA": {POS: VERB}, - "|||V|COND|3S|@FS-STA": {POS: VERB}, - "|||V|FUT|3S|IND|@FS-APP": {POS: VERB}, - "|||V|FUT|3S|IND|@FS-STA": {POS: VERB}, - "|||V|GER|@ICL-|||V|GER|@ICL-|||V|GER|@ICL-ADVL>": {POS: VERB}, - "|||V|GER|@ICL-AUX<": {POS: VERB}, - "|||V|GER|@ICL-P<": {POS: VERB}, - "|||V|IMPF|3P|IND|@FS-STA": {POS: VERB}, - "|||V|IMPF|3S|IND|@FS-|||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "|||V|INF|3P|@ICL-|||V|INF|3P|@ICL-P<": {POS: VERB}, - "|||V|INF|3S|@ICL-|||V|INF|3S|@ICL-P<": {POS: VERB}, - "|||V|INF|@ICL-|||V|INF|@ICL-|||V|INF|@ICL-|||V|INF|@ICL-AUX<": {POS: VERB}, - "|||V|INF|@ICL-P<": {POS: VERB}, - "|||V|INF|@ICL-SUBJ>": {POS: VERB}, - "|||V|MQP|1/3S|IND|@FS-STA": {POS: VERB}, - "|||V|MQP|3S|IND|@FS-STA": {POS: VERB}, - "|||V|PR|1/3S|SUBJ|@FS-STA": {POS: VERB}, - "|||V|PR|3P|IND|@FS-|||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3S|IND|@FS-|||V|PR|3S|IND|@FS-|||V|PR|3S|IND|@FS-ACC>": {POS: VERB}, - "|||V|PR|3S|IND|@FS-QUE": {POS: VERB}, - "|||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "|||V|PR|3S|SUBJ|@FS-STA": {POS: VERB}, - "|||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "|||V|PS|3S|IND|@FS-|||V|PS|3S|IND|@FS-|||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "||V|COND|1S|@FS-STA": {POS: VERB}, - "||V|COND|3S|@FS-STA": {POS: VERB}, - "||V|FUT|3S|IND|@FS-STA": {POS: VERB}, - "||V|GER|@ICL-||V|GER|@ICL-||V|GER|@ICL-ADVL>": {POS: VERB}, - "||V|GER|@ICL-AUX<": {POS: VERB}, - "||V|GER|@ICL-UTT": {POS: AUX}, - "||V|IMPF|3P|IND|@FS-||V|IMPF|3P|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "||V|INF|3P|@ICL-||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-AUX<": {POS: VERB}, - "||V|INF|@ICL-P<": {POS: VERB}, - "||V|PR|1/3S|SUBJ|@FS-STA": {POS: VERB}, - "||V|PR|1P|IND|@FS-||V|PR|1S|IND|@FS-STA": {POS: VERB}, - "||V|PR|3P|IND|@FS-P<": {POS: VERB}, - "||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "||V|PR|3P|SUBJ|@FS-STA": {POS: VERB}, - "||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|3S|IND|@FS-ADVL>": {POS: VERB}, - "||V|PR|3S|IND|@FS-N<": {POS: VERB}, - "||V|PR|3S|IND|@FS-P<": {POS: VERB}, - "||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "||V|PR|3S|IND|@ICL-AUX<": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-ADVL>": {POS: AUX}, - "||V|PR|3S|SUBJ|@FS-N||V|PR|3S|SUBJ|@FS-STA": {POS: VERB}, - "||V|PS/MQP|3P|IND|@FS-STA": {POS: VERB}, - "||V|PS|1S|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "||V|PS|3S|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "||V|PCP|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|@ICL-AUX<": {POS: VERB}, - "|||V|PCP|M|P|@ICL-P<": {POS: VERB}, - "||V|INF|3S|@ICL-||V|INF|@ICL-P<": {POS: VERB}, - "||V|INF|@ICL-SUBJ>": {POS: VERB}, - "||V|INF|M|S|@ICL-UTT": {POS: VERB}, - "||V|PCP|F|P|@ICL-||V|PCP|F|P|@ICL-N<": {POS: VERB}, - "||V|PCP|F|P|@ICL-P<": {POS: VERB}, - "||V|PCP|F|P|@ICL-SUBJ>": {POS: VERB}, - "||V|PCP|F|S|@ICL-||V|PCP|F|S|@ICL-||V|PCP|F|S|@ICL-||V|PCP|F|S|@ICL-APP": {POS: VERB}, - "||V|PCP|F|S|@ICL-N<": {POS: VERB}, - "||V|PCP|F|S|@ICL-P<": {POS: VERB}, - "||V|PCP|F|S|@ICL-SUBJ>": {POS: VERB}, - "||V|PCP|F|S|@ICL-UTT": {POS: VERB}, - "||V|PCP|M|P|@ICL-||V|PCP|M|P|@ICL-||V|PCP|M|P|@ICL-||V|PCP|M|P|@ICL-N<": {POS: VERB}, - "||V|PCP|M|P|@ICL-P<": {POS: VERB}, - "||V|PCP|M|P|@ICL-SUBJ>": {POS: VERB}, - "||V|PCP|M|P|@ICL-UTT": {POS: VERB}, - "||V|PCP|M|S|@ICL-||V|PCP|M|S|@ICL-||V|PCP|M|S|@ICL-ADVL>": {POS: VERB}, - "||V|PCP|M|S|@ICL-N<": {POS: VERB}, - "||V|PCP|M|S|@ICL-P<": {POS: VERB}, - "||V|PCP|M|S|@ICL-PRED>": {POS: VERB}, - "||V|PCP|M|S|@ICL-SUBJ>": {POS: VERB}, - "||V|PR|3S|IND|@FS-||V|INF|@ICL-||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "||N|M|S|@N<": {POS: NOUN}, - "||V|INF|@ICL-|||V|PCP|F|P|@ICL-AUX<": {POS: VERB}, - "|||V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|P|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|P|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|S|@ICL-||V|PCP|M|P|@ICL-STA": {POS: VERB}, - "|||V|PCP|@ICL-AUX<": {POS: VERB}, - "|||V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "|||V|PCP|@ICL-AUX<": {POS: VERB}, - "||V|COND|3P|@FS-||V|COND|3S|@FS-||V|COND|3S|@FS-N<": {POS: VERB}, - "||V|COND|3S|@FS-P<": {POS: VERB}, - "||V|FUT|3P|IND|@FS-N||V|FUT|3P|IND|@FS-UTT": {POS: VERB}, - "||V|FUT|3P|SUBJ|@FS-||V|FUT|3P|SUBJ|@FS-ADVL>": {POS: VERB}, - "||V|FUT|3S|IND|@FS-||V|FUT|3S|IND|@FS-APP": {POS: VERB}, - "||V|FUT|3S|IND|@FS-N<": {POS: VERB}, - "||V|FUT|3S|IND|@FS-QUE": {POS: VERB}, - "||V|FUT|3S|IND|@FS-STA": {POS: VERB}, - "||V|FUT|3S|SUBJ|@FS-ADVL>": {POS: VERB}, - "||V|FUT|3S|SUBJ|@FS-P<": {POS: VERB}, - "||V|FUT|3S|SUBJ|@FS-SUBJ>": {POS: VERB}, - "||V|GER|@ICL-||V|GER|@ICL-AUX<": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-||V|IMPF|3P|IND|@FS-KOMP<": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-N<": {POS: VERB}, - "||V|IMPF|3P|IND|@FS-N||V|IMPF|3P|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3P|SUBJ|@FS-||V|IMPF|3P|SUBJ|@FS-N<": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-||V|IMPF|3S|IND|@FS-||V|IMPF|3S|IND|@FS-N<": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-N||V|IMPF|3S|IND|@FS-P<": {POS: VERB}, - "||V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "||V|IMPF|3S|SUBJ|@FS-||V|IMPF|3S|SUBJ|@FS-||V|IMPF|3S|SUBJ|@FS-KOMP<": {POS: VERB}, - "||V|IMPF|3S|SUBJ|@FS-N<": {POS: VERB}, - "||V|IMPF|3S|SUBJ|@FS-P<": {POS: VERB}, - "||V|INF|3P|@ICL-P<": {POS: VERB}, - "||V|INF|3S|@ICL-||V|INF|3S|@ICL-AUX<": {POS: VERB}, - "||V|INF|3S|@ICL-P<": {POS: VERB}, - "||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-||V|INF|@ICL-AUX<": {POS: VERB}, - "||V|INF|@ICL-P<": {POS: VERB}, - "||V|MQP|3P|IND|@FS-N<": {POS: VERB}, - "||V|MQP|3S|IND|@FS-N<": {POS: VERB}, - "||V|PCP|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-||V|PR|3P|IND|@FS-ADVL>": {POS: VERB}, - "||V|PR|3P|IND|@FS-N<": {POS: VERB}, - "||V|PR|3P|IND|@FS-N||V|PR|3P|IND|@FS-STA": {POS: VERB}, - "||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-||V|PR|3P|SUBJ|@FS-ADVL>": {POS: VERB}, - "||V|PR|3P|SUBJ|@FS-N||V|PR|3P|SUBJ|@FS-P<": {POS: VERB}, - "||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-||V|PR|3S|IND|@FS-A<": {POS: VERB}, - "||V|PR|3S|IND|@FS-ACC>": {POS: VERB}, - "||V|PR|3S|IND|@FS-ADVL>": {POS: VERB}, - "||V|PR|3S|IND|@FS-N<": {POS: VERB}, - "||V|PR|3S|IND|@FS-N||V|PR|3S|IND|@FS-P<": {POS: VERB}, - "||V|PR|3S|IND|@FS-QUE": {POS: VERB}, - "||V|PR|3S|IND|@FS-S<": {POS: VERB}, - "||V|PR|3S|IND|@FS-STA": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-||V|PR|3S|SUBJ|@FS-ADVL>": {POS: VERB}, - "||V|PR|3S|SUBJ|@FS-P<": {POS: VERB}, - "||V|PS/MQP|3P|IND|@FS-||V|PS/MQP|3P|IND|@FS-N<": {POS: VERB}, - "||V|PS/MQP|3P|IND|@FS-N||V|PS/MQP|3P|IND|@FS-STA": {POS: VERB}, - "||V|PS|3P|IND|@FS-||V|PS|3P|IND|@FS-||V|PS|3P|IND|@FS-ACC>": {POS: VERB}, - "||V|PS|3P|IND|@FS-ADVL>": {POS: VERB}, - "||V|PS|3P|IND|@FS-N<": {POS: VERB}, - "||V|PS|3P|IND|@FS-N||V|PS|3P|IND|@FS-P<": {POS: VERB}, - "||V|PS|3P|IND|@FS-STA": {POS: VERB}, - "||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-||V|PS|3S|IND|@FS-ADVL>": {POS: VERB}, - "||V|PS|3S|IND|@FS-N<": {POS: VERB}, - "||V|PS|3S|IND|@FS-N||V|PS|3S|IND|@FS-P<": {POS: VERB}, - "||V|PS|3S|IND|@FS-STA": {POS: VERB}, - "|V|@|V|COND|1/3S|@FS-STA": {POS: VERB}, - "|V|COND|1P|@FS-STA": {POS: VERB}, - "|V|COND|1S|@FS-|V|COND|1S|@FS-EXC": {POS: VERB}, - "|V|COND|1S|@FS-STA": {POS: VERB}, - "|V|COND|1|@FS-STA": {POS: VERB}, - "|V|COND|3P|@FS-|V|COND|3P|@FS-|V|COND|3P|@FS-N<": {POS: VERB}, - "|V|COND|3P|@FS-N|V|COND|3P|@FS-P<": {POS: VERB}, - "|V|COND|3P|@FS-STA": {POS: VERB}, - "|V|COND|3S|@FS-|V|COND|3S|@FS-|V|COND|3S|@FS-|V|COND|3S|@FS-|V|COND|3S|@FS-ACC>": {POS: VERB}, - "|V|COND|3S|@FS-ADVL": {POS: VERB}, - "|V|COND|3S|@FS-ADVL>": {POS: VERB}, - "|V|COND|3S|@FS-KOMP<": {POS: VERB}, - "|V|COND|3S|@FS-N<": {POS: VERB}, - "|V|COND|3S|@FS-N|V|COND|3S|@FS-P<": {POS: VERB}, - "|V|COND|3S|@FS-QUE": {POS: VERB}, - "|V|COND|3S|@FS-S<": {POS: VERB}, - "|V|COND|3S|@FS-STA": {POS: VERB}, - "|V|FUT|1/3S|SUBJ|@FS-ADVL>": {POS: VERB}, - "|V|FUT|1/3S|SUBJ|@FS-STA": {POS: VERB}, - "|V|FUT|1P|IND|@FS-|V|FUT|1P|IND|@FS-ACC>": {POS: VERB}, - "|V|FUT|1P|IND|@FS-STA": {POS: VERB}, - "|V|FUT|1P|SUBJ|@FS-|V|FUT|1P|SUBJ|@FS-ADVL>": {POS: VERB}, - "|V|FUT|1P|SUBJ|@FS-N<": {POS: VERB}, - "|V|FUT|1P|SUBJ|@FS-STA": {POS: VERB}, - "|V|FUT|1S|IND|@FS-|V|FUT|1S|IND|@FS-N|V|FUT|1S|IND|@FS-STA": {POS: VERB}, - "|V|FUT|1S|SUBJ|@FS-N<": {POS: VERB}, - "|V|FUT|2S|IND|@FS-STA": {POS: VERB}, - "|V|FUT|3P|IND|@FS-|V|FUT|3P|IND|@FS-|V|FUT|3P|IND|@FS-|V|FUT|3P|IND|@FS-|V|FUT|3P|IND|@FS-A<": {POS: VERB}, - "|V|FUT|3P|IND|@FS-ACC>": {POS: VERB}, - "|V|FUT|3P|IND|@FS-N<": {POS: VERB}, - "|V|FUT|3P|IND|@FS-N|V|FUT|3P|IND|@FS-QUE": {POS: VERB}, - "|V|FUT|3P|IND|@FS-STA": {POS: VERB}, - "|V|FUT|3P|SUBJ|@FS-|V|FUT|3P|SUBJ|@FS-|V|FUT|3P|SUBJ|@FS-ADVL>": {POS: VERB}, - "|V|FUT|3P|SUBJ|@FS-N<": {POS: VERB}, - "|V|FUT|3S|IND|@FS-|V|FUT|3S|IND|@FS-|V|FUT|3S|IND|@FS-|V|FUT|3S|IND|@FS-|V|FUT|3S|IND|@FS-|V|FUT|3S|IND|@FS-A<": {POS: VERB}, - "|V|FUT|3S|IND|@FS-ACC>": {POS: VERB}, - "|V|FUT|3S|IND|@FS-N<": {POS: VERB}, - "|V|FUT|3S|IND|@FS-N|V|FUT|3S|IND|@FS-P<": {POS: VERB}, - "|V|FUT|3S|IND|@FS-QUE": {POS: VERB}, - "|V|FUT|3S|IND|@FS-S<": {POS: VERB}, - "|V|FUT|3S|IND|@FS-STA": {POS: VERB}, - "|V|FUT|3S|IND|@FS-UTT": {POS: VERB}, - "|V|FUT|3S|SUBJ|@FS-|V|FUT|3S|SUBJ|@FS-|V|FUT|3S|SUBJ|@FS-|V|FUT|3S|SUBJ|@FS-ADVL>": {POS: VERB}, - "|V|FUT|3S|SUBJ|@FS-AUX<": {POS: VERB}, - "|V|FUT|3S|SUBJ|@FS-KOMP<": {POS: VERB}, - "|V|FUT|3S|SUBJ|@FS-N<": {POS: VERB}, - "|V|FUT|3S|SUBJ|@FS-N|V|FUT|3S|SUBJ|@FS-P<": {POS: VERB}, - "|V|FUT|3S|SUBJ|@FS-STA": {POS: VERB}, - "|V|FUT|3S|SUBJ|@FS-SUBJ>": {POS: VERB}, - "|V|GER|@ADVL>": {POS: AUX}, - "|V|GER|@ICL-|V|GER|@ICL-|V|GER|@ICL-|V|GER|@ICL-|V|GER|@ICL-|V|GER|@ICL-|V|GER|@ICL-ADVL>": {POS: VERB}, - "|V|GER|@ICL-AUX<": {POS: VERB}, - "|V|GER|@ICL-CO": {POS: VERB}, - "|V|GER|@ICL-N<": {POS: VERB}, - "|V|GER|@ICL-N|V|GER|@ICL-P<": {POS: VERB}, - "|V|GER|@ICL-PRED>": {POS: VERB}, - "|V|GER|@ICL-STA": {POS: VERB}, - "|V|GER|@ICL-UTT": {POS: VERB}, - "|V|GER|@P<": {POS: AUX}, - "|V|IMPF|1/3S|IND|@FS-|V|IMPF|1/3S|IND|@FS-STA": {POS: VERB}, - "|V|IMPF|1/3S|SUBJ|@FS-ADVL>": {POS: AUX}, - "|V|IMPF|1P|IND|@FS-|V|IMPF|1P|IND|@FS-|V|IMPF|1P|IND|@FS-ADVL>": {POS: VERB}, - "|V|IMPF|1P|IND|@FS-KOMP<": {POS: VERB}, - "|V|IMPF|1P|IND|@FS-N|V|IMPF|1P|IND|@FS-P<": {POS: VERB}, - "|V|IMPF|1P|IND|@FS-STA": {POS: VERB}, - "|V|IMPF|1P|SUBJ|@FS-ADVL>": {POS: VERB}, - "|V|IMPF|1P|SUBJ|@FS-STA": {POS: VERB}, - "|V|IMPF|1S|IND|@FS-|V|IMPF|1S|IND|@FS-|V|IMPF|1S|IND|@FS-|V|IMPF|1S|IND|@FS-ACC>": {POS: VERB}, - "|V|IMPF|1S|IND|@FS-ADVL>": {POS: AUX}, - "|V|IMPF|1S|IND|@FS-N<": {POS: VERB}, - "|V|IMPF|1S|IND|@FS-STA": {POS: VERB}, - "|V|IMPF|1S|SUBJ|@FS-|V|IMPF|1S|SUBJ|@FS-ADVL>": {POS: VERB}, - "|V|IMPF|1|IND|@FS-STA": {POS: VERB}, - "|V|IMPF|3P|IND|@FS-|V|IMPF|3P|IND|@FS-|V|IMPF|3P|IND|@FS-|V|IMPF|3P|IND|@FS-|V|IMPF|3P|IND|@FS-ADVL>": {POS: VERB}, - "|V|IMPF|3P|IND|@FS-N<": {POS: VERB}, - "|V|IMPF|3P|IND|@FS-N|V|IMPF|3P|IND|@FS-P<": {POS: VERB}, - "|V|IMPF|3P|IND|@FS-QUE": {POS: VERB}, - "|V|IMPF|3P|IND|@FS-S<": {POS: VERB}, - "|V|IMPF|3P|IND|@FS-STA": {POS: VERB}, - "|V|IMPF|3P|IND|@FS-SUBJ>": {POS: VERB}, - "|V|IMPF|3P|SUBJ|@FS-|V|IMPF|3P|SUBJ|@FS-|V|IMPF|3P|SUBJ|@FS-|V|IMPF|3P|SUBJ|@FS-ADVL>": {POS: VERB}, - "|V|IMPF|3P|SUBJ|@FS-N<": {POS: VERB}, - "|V|IMPF|3P|SUBJ|@FS-P<": {POS: VERB}, - "|V|IMPF|3S|IND|@FS-|V|IMPF|3S|IND|@FS-|V|IMPF|3S|IND|@FS-|V|IMPF|3S|IND|@FS-|V|IMPF|3S|IND|@FS-A<": {POS: VERB}, - "|V|IMPF|3S|IND|@FS-ACC>": {POS: VERB}, - "|V|IMPF|3S|IND|@FS-ADVL>": {POS: VERB}, - "|V|IMPF|3S|IND|@FS-EXC": {POS: VERB}, - "|V|IMPF|3S|IND|@FS-KOMP<": {POS: VERB}, - "|V|IMPF|3S|IND|@FS-N<": {POS: VERB}, - "|V|IMPF|3S|IND|@FS-N|V|IMPF|3S|IND|@FS-P<": {POS: VERB}, - "|V|IMPF|3S|IND|@FS-QUE": {POS: VERB}, - "|V|IMPF|3S|IND|@FS-S<": {POS: VERB}, - "|V|IMPF|3S|IND|@FS-STA": {POS: VERB}, - "|V|IMPF|3S|IND|@FS-SUBJ>": {POS: VERB}, - "|V|IMPF|3S|SUBJ|@FS-|V|IMPF|3S|SUBJ|@FS-|V|IMPF|3S|SUBJ|@FS-|V|IMPF|3S|SUBJ|@FS-|V|IMPF|3S|SUBJ|@FS-ADVL>": {POS: VERB}, - "|V|IMPF|3S|SUBJ|@FS-N<": {POS: VERB}, - "|V|IMPF|3S|SUBJ|@FS-P<": {POS: VERB}, - "|V|IMP|2S|@FS-|V|IMP|2S|@FS-CMD": {POS: VERB}, - "|V|INF|1P|@ICL-|V|INF|1P|@ICL-P<": {POS: VERB}, - "|V|INF|1S|@ICL-|V|INF|3P|@ICL-|V|INF|3P|@ICL-|V|INF|3P|@ICL-|V|INF|3P|@ICL-AUX<": {POS: VERB}, - "|V|INF|3P|@ICL-P<": {POS: VERB}, - "|V|INF|3S|@ICL-|V|INF|3S|@ICL-|V|INF|3S|@ICL-|V|INF|3S|@ICL-ADVL>": {POS: VERB}, - "|V|INF|3S|@ICL-AUX<": {POS: VERB}, - "|V|INF|3S|@ICL-P<": {POS: VERB}, - "|V|INF|3S|@ICL-STA": {POS: VERB}, - "|V|INF|3S|@ICL-SUBJ>": {POS: AUX}, - "|V|INF|@ICL-|V|INF|@ICL-|V|INF|@ICL-|V|INF|@ICL-|V|INF|@ICL-|V|INF|@ICL-|V|INF|@ICL-|V|INF|@ICL->>P": {POS: VERB}, - "|V|INF|@ICL-A<": {POS: VERB}, - "|V|INF|@ICL-ACC>": {POS: VERB}, - "|V|INF|@ICL-ADVL>": {POS: VERB}, - "|V|INF|@ICL-APP": {POS: VERB}, - "|V|INF|@ICL-AUX<": {POS: VERB}, - "|V|INF|@ICL-COM": {POS: VERB}, - "|V|INF|@ICL-KOMP<": {POS: VERB}, - "|V|INF|@ICL-N<": {POS: VERB}, - "|V|INF|@ICL-N|V|INF|@ICL-P<": {POS: VERB}, - "|V|INF|@ICL-QUE": {POS: VERB}, - "|V|INF|@ICL-STA": {POS: VERB}, - "|V|INF|@ICL-SUBJ>": {POS: VERB}, - "|V|INF|@ICL-UTT": {POS: VERB}, - "|V|MQP|3P|IND|@FS-|V|MQP|3P|IND|@FS-N<": {POS: VERB}, - "|V|MQP|3P|IND|@FS-N|V|MQP|3P|IND|@FS-QUE": {POS: VERB}, - "|V|MQP|3P|IND|@FS-STA": {POS: VERB}, - "|V|MQP|3S|IND|@FS-|V|MQP|3S|IND|@FS-|V|MQP|3S|IND|@FS-N<": {POS: VERB}, - "|V|MQP|3S|IND|@FS-N|V|MQP|3S|IND|@FS-STA": {POS: VERB}, - "|V|PCP2|PAS|F|P|@ICL-AUX<": {POS: VERB}, - "|V|PCP|@ICL-AUX<": {POS: VERB}, - "|V|PCP|@ICL-N<": {POS: VERB}, - "|V|PCP|F|P|@ICL-|V|PCP|F|P|@ICL-|V|PCP|F|P|@ICL-|V|PCP|F|P|@ICL-|V|PCP|F|P|@ICL-ADVL>": {POS: VERB}, - "|V|PCP|F|P|@ICL-AUX<": {POS: VERB}, - "|V|PCP|F|P|@ICL-EXC": {POS: VERB}, - "|V|PCP|F|P|@ICL-N<": {POS: VERB}, - "|V|PCP|F|P|@ICL-N|V|PCP|F|P|@ICL-P<": {POS: VERB}, - "|V|PCP|F|P|@ICL-PRED>": {POS: VERB}, - "|V|PCP|F|P|@ICL-UTT": {POS: VERB}, - "|V|PCP|F|P|@N<": {POS: ADJ}, - "|V|PCP|F|S|@ICL-|V|PCP|F|S|@ICL-|V|PCP|F|S|@ICL-|V|PCP|F|S|@ICL-|V|PCP|F|S|@ICL-|V|PCP|F|S|@ICL-|V|PCP|F|S|@ICL-|V|PCP|F|S|@ICL-A<": {POS: VERB}, - "|V|PCP|F|S|@ICL-ADVL>": {POS: VERB}, - "|V|PCP|F|S|@ICL-APP": {POS: ADJ}, - "|V|PCP|F|S|@ICL-AUX<": {POS: VERB}, - "|V|PCP|F|S|@ICL-N<": {POS: VERB}, - "|V|PCP|F|S|@ICL-N|V|PCP|F|S|@ICL-P<": {POS: VERB}, - "|V|PCP|F|S|@ICL-PRED>": {POS: VERB}, - "|V|PCP|F|S|@ICL-SC>": {POS: VERB}, - "|V|PCP|F|S|@ICL-SUBJ>": {POS: ADJ}, - "|V|PCP|F|S|@ICL-UTT": {POS: VERB}, - "|V|PCP|M|P|@ICL-|V|PCP|M|P|@ICL-|V|PCP|M|P|@ICL-|V|PCP|M|P|@ICL-|V|PCP|M|P|@ICL-A<": {POS: VERB}, - "|V|PCP|M|P|@ICL-ADVL>": {POS: VERB}, - "|V|PCP|M|P|@ICL-APP": {POS: VERB}, - "|V|PCP|M|P|@ICL-AUX<": {POS: VERB}, - "|V|PCP|M|P|@ICL-MV": {POS: VERB}, - "|V|PCP|M|P|@ICL-N<": {POS: VERB}, - "|V|PCP|M|P|@ICL-N|V|PCP|M|P|@ICL-P<": {POS: VERB}, - "|V|PCP|M|P|@ICL-PRED>": {POS: VERB}, - "|V|PCP|M|P|@ICL-UTT": {POS: VERB}, - "|V|PCP|M|S|@ICL-|V|PCP|M|S|@ICL-|V|PCP|M|S|@ICL-|V|PCP|M|S|@ICL-|V|PCP|M|S|@ICL-|V|PCP|M|S|@ICL-|V|PCP|M|S|@ICL->A": {POS: VERB}, - "|V|PCP|M|S|@ICL-A<": {POS: VERB}, - "|V|PCP|M|S|@ICL-ADVL>": {POS: VERB}, - "|V|PCP|M|S|@ICL-APP": {POS: VERB}, - "|V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "|V|PCP|M|S|@ICL-N<": {POS: VERB}, - "|V|PCP|M|S|@ICL-N|V|PCP|M|S|@ICL-P<": {POS: VERB}, - "|V|PCP|M|S|@ICL-PRED>": {POS: VERB}, - "|V|PCP|M|S|@ICL-STA": {POS: VERB}, - "|V|PCP|M|S|@ICL-SUBJ>": {POS: VERB}, - "|V|PCP|M|S|@ICL-UTT": {POS: VERB}, - "|V|PCP|M|S|@P<": {POS: VERB}, - "|V|PR|1/3S|SUBJ|@FS-CMD": {POS: VERB}, - "|V|PR|1/3S|SUBJ|@FS-STA": {POS: VERB}, - "|V|PR|1P|IND|@FS-|V|PR|1P|IND|@FS-|V|PR|1P|IND|@FS-|V|PR|1P|IND|@FS-|V|PR|1P|IND|@FS-ACC>": {POS: VERB}, - "|V|PR|1P|IND|@FS-ADVL>": {POS: VERB}, - "|V|PR|1P|IND|@FS-N<": {POS: VERB}, - "|V|PR|1P|IND|@FS-N|V|PR|1P|IND|@FS-QUE": {POS: VERB}, - "|V|PR|1P|IND|@FS-STA": {POS: VERB}, - "|V|PR|1P|IND|@FS-UTT": {POS: VERB}, - "|V|PR|1P|SUBJ|@FS-|V|PR|1P|SUBJ|@FS-|V|PR|1P|SUBJ|@FS-CMD": {POS: VERB}, - "|V|PR|1P|SUBJ|@FS-N|V|PR|1P|SUBJ|@FS-P<": {POS: VERB}, - "|V|PR|1P|SUBJ|@FS-STA": {POS: VERB}, - "|V|PR|1S|IND|@FS-|V|PR|1S|IND|@FS-|V|PR|1S|IND|@FS-ACC>": {POS: VERB}, - "|V|PR|1S|IND|@FS-ADVL>": {POS: VERB}, - "|V|PR|1S|IND|@FS-EXC": {POS: VERB}, - "|V|PR|1S|IND|@FS-N<": {POS: VERB}, - "|V|PR|1S|IND|@FS-P<": {POS: VERB}, - "|V|PR|1S|IND|@FS-QUE": {POS: VERB}, - "|V|PR|1S|IND|@FS-STA": {POS: VERB}, - "|V|PR|1S|IND|@FS-SUBJ>": {POS: VERB}, - "|V|PR|1S|IND|@FS-UTT": {POS: VERB}, - "|V|PR|1S|SUBJ|@FS-|V|PR|1S|SUBJ|@FS-|V|PR|1S|SUBJ|@FS-P<": {POS: VERB}, - "|V|PR|1S|SUBJ|@FS-STA": {POS: VERB}, - "|V|PR|2P|IND|@FS-P<": {POS: VERB}, - "|V|PR|2S|IND|@FS-|V|PR|2S|IND|@FS-|V|PR|2S|SUBJ|@FS-STA": {POS: VERB}, - "|V|PR|3P|IND|@FS-|V|PR|3P|IND|@FS-|V|PR|3P|IND|@FS-|V|PR|3P|IND|@FS-|V|PR|3P|IND|@FS-ACC>": {POS: VERB}, - "|V|PR|3P|IND|@FS-ADVL>": {POS: VERB}, - "|V|PR|3P|IND|@FS-EXC": {POS: VERB}, - "|V|PR|3P|IND|@FS-N<": {POS: VERB}, - "|V|PR|3P|IND|@FS-N|V|PR|3P|IND|@FS-P<": {POS: VERB}, - "|V|PR|3P|IND|@FS-QUE": {POS: VERB}, - "|V|PR|3P|IND|@FS-STA": {POS: VERB}, - "|V|PR|3P|IND|@FS-SUBJ>": {POS: VERB}, - "|V|PR|3P|IND|@FS-UTT": {POS: VERB}, - "|V|PR|3P|SUBJ|@FS-|V|PR|3P|SUBJ|@FS-|V|PR|3P|SUBJ|@FS-|V|PR|3P|SUBJ|@FS-A<": {POS: VERB}, - "|V|PR|3P|SUBJ|@FS-ADVL>": {POS: VERB}, - "|V|PR|3P|SUBJ|@FS-CMD": {POS: VERB}, - "|V|PR|3P|SUBJ|@FS-N<": {POS: VERB}, - "|V|PR|3P|SUBJ|@FS-N|V|PR|3P|SUBJ|@FS-P<": {POS: VERB}, - "|V|PR|3P|SUBJ|@FS-STA": {POS: VERB}, - "|V|PR|3S|@FS-|V|PR|3S|@FS-STA": {POS: VERB}, - "|V|PR|3S|IND|@FS-|V|PR|3S|IND|@FS-|V|PR|3S|IND|@FS-|V|PR|3S|IND|@FS-|V|PR|3S|IND|@FS-|V|PR|3S|IND|@FS-|V|PR|3S|IND|@FS->A": {POS: VERB}, - "|V|PR|3S|IND|@FS-A<": {POS: VERB}, - "|V|PR|3S|IND|@FS-ACC>": {POS: VERB}, - "|V|PR|3S|IND|@FS-ADVL>": {POS: VERB}, - "|V|PR|3S|IND|@FS-APP": {POS: VERB}, - "|V|PR|3S|IND|@FS-EXC": {POS: VERB}, - "|V|PR|3S|IND|@FS-KOMP<": {POS: VERB}, - "|V|PR|3S|IND|@FS-N<": {POS: VERB}, - "|V|PR|3S|IND|@FS-N|V|PR|3S|IND|@FS-P<": {POS: VERB}, - "|V|PR|3S|IND|@FS-QUE": {POS: VERB}, - "|V|PR|3S|IND|@FS-S<": {POS: VERB}, - "|V|PR|3S|IND|@FS-STA": {POS: VERB}, - "|V|PR|3S|IND|@FS-SUBJ>": {POS: VERB}, - "|V|PR|3S|IND|@FS-UTT": {POS: VERB}, - "|V|PR|3S|IND|@ICL-AUX<": {POS: VERB}, - "|V|PR|3S|IND|VFIN|@FS-|V|PR|3S|IND|VFIN|@FS-ADVL>": {POS: VERB}, - "|V|PR|3S|SUBJ|@FS-|V|PR|3S|SUBJ|@FS-|V|PR|3S|SUBJ|@FS-|V|PR|3S|SUBJ|@FS-|V|PR|3S|SUBJ|@FS-ADVL>": {POS: VERB}, - "|V|PR|3S|SUBJ|@FS-CMD": {POS: VERB}, - "|V|PR|3S|SUBJ|@FS-COM": {POS: VERB}, - "|V|PR|3S|SUBJ|@FS-KOMP<": {POS: VERB}, - "|V|PR|3S|SUBJ|@FS-N<": {POS: VERB}, - "|V|PR|3S|SUBJ|@FS-N|V|PR|3S|SUBJ|@FS-P<": {POS: VERB}, - "|V|PR|3S|SUBJ|@FS-STA": {POS: VERB}, - "|V|PR|3S|SUBJ|@FS-SUBJ>": {POS: VERB}, - "|V|PS/MQP|3P|IND|@FS-|V|PS/MQP|3P|IND|@FS-|V|PS/MQP|3P|IND|@FS-ACC>": {POS: VERB}, - "|V|PS/MQP|3P|IND|@FS-ADVL>": {POS: VERB}, - "|V|PS/MQP|3P|IND|@FS-KOMP<": {POS: VERB}, - "|V|PS/MQP|3P|IND|@FS-N<": {POS: VERB}, - "|V|PS/MQP|3P|IND|@FS-N|V|PS/MQP|3P|IND|@FS-P<": {POS: VERB}, - "|V|PS/MQP|3P|IND|@FS-QUE": {POS: VERB}, - "|V|PS/MQP|3P|IND|@FS-STA": {POS: VERB}, - "|V|PS/MQP|3P|IND|@FS-UTT": {POS: VERB}, - "|V|PS|1P|IND|@FS-|V|PS|1P|IND|@FS-|V|PS|1P|IND|@FS-ACC>": {POS: VERB}, - "|V|PS|1P|IND|@FS-N<": {POS: VERB}, - "|V|PS|1P|IND|@FS-P<": {POS: VERB}, - "|V|PS|1P|IND|@FS-STA": {POS: VERB}, - "|V|PS|1P|IND|@FS-SUBJ>": {POS: VERB}, - "|V|PS|1S|IND|@FS-|V|PS|1S|IND|@FS-|V|PS|1S|IND|@FS-ACC>": {POS: VERB}, - "|V|PS|1S|IND|@FS-ADVL>": {POS: VERB}, - "|V|PS|1S|IND|@FS-EXC": {POS: VERB}, - "|V|PS|1S|IND|@FS-N<": {POS: VERB}, - "|V|PS|1S|IND|@FS-N|V|PS|1S|IND|@FS-P<": {POS: VERB}, - "|V|PS|1S|IND|@FS-STA": {POS: VERB}, - "|V|PS|2S|IND|@FS-|V|PS|2S|IND|@FS-QUE": {POS: VERB}, - "|V|PS|2S|IND|@FS-STA": {POS: VERB}, - "|V|PS|3P|IND|@FS-|V|PS|3P|IND|@FS-|V|PS|3P|IND|@FS-|V|PS|3P|IND|@FS-ACC>": {POS: VERB}, - "|V|PS|3P|IND|@FS-ADVL>": {POS: VERB}, - "|V|PS|3P|IND|@FS-N<": {POS: VERB}, - "|V|PS|3P|IND|@FS-N|V|PS|3P|IND|@FS-QUE": {POS: VERB}, - "|V|PS|3P|IND|@FS-STA": {POS: VERB}, - "|V|PS|3S|IND|@FS-|V|PS|3S|IND|@FS-|V|PS|3S|IND|@FS-|V|PS|3S|IND|@FS-|V|PS|3S|IND|@FS-A<": {POS: VERB}, - "|V|PS|3S|IND|@FS-ACC>": {POS: VERB}, - "|V|PS|3S|IND|@FS-ADVL>": {POS: VERB}, - "|V|PS|3S|IND|@FS-APP": {POS: VERB}, - "|V|PS|3S|IND|@FS-EXC": {POS: VERB}, - "|V|PS|3S|IND|@FS-KOMP<": {POS: VERB}, - "|V|PS|3S|IND|@FS-N<": {POS: VERB}, - "|V|PS|3S|IND|@FS-N|V|PS|3S|IND|@FS-P<": {POS: VERB}, - "|V|PS|3S|IND|@FS-QUE": {POS: VERB}, - "|V|PS|3S|IND|@FS-S<": {POS: VERB}, - "|V|PS|3S|IND|@FS-STA": {POS: VERB}, - "|V|PS|3S|IND|@FS-SUBJ>": {POS: VERB}, - "|V|PS|3S|IND|@FS-UTT": {POS: VERB}, - "|V|PS|P|3P|IND|@FS-STA": {POS: VERB}, - "||ADJ|F|S|@||ADJ|F|S|@APP": {POS: ADJ}, - "||ADJ|F|S|@P<": {POS: ADJ}, - "||ADJ|M/F|P|@||ADJ|M|P|@||ADJ|M|P|@P<": {POS: ADJ}, - "||ADJ|M|S|@||ADJ|M|S|@APP": {POS: ADJ}, - "||ADJ|M|S|@P<": {POS: ADJ}, - "||ADJ|M|S|@SC>": {POS: ADJ}, - "||ADJ|M|S|@SUBJ>": {POS: ADJ}, - "||||ADJ|@P<": {POS: ADJ}, - "||||ADJ|@P<": {POS: ADJ}, - "||||ADJ|F|P|@P<": {POS: ADJ}, - "||||ADJ|F|S|@||||ADJ|F|S|@N||||ADJ|F|S|@SUBJ>": {POS: ADJ}, - "||||ADJ|M|P|@||||ADJ|M|P|@SUBJ>": {POS: ADJ}, - "||||ADJ|M|S|@||||ADJ|M|S|@P<": {POS: ADJ}, - "|||ADJ|F|P|@SUBJ>": {POS: ADJ}, - "|||ADJ|F|S|@P<": {POS: ADJ}, - "|||ADJ|M|P|@P<": {POS: ADJ}, - "|||ADJ|M|P|@SUBJ>": {POS: ADJ}, - "|||ADJ|M|S|@P<": {POS: ADJ}, - "|||ADJ|M|P|@P<": {POS: ADJ}, - "|||DET|M|P|@>N": {POS: DET}, - "||||ADJ|F|P|@P<": {POS: ADJ}, - "||||ADJ|F|S|@SUBJ>": {POS: ADJ}, - "||||ADJ|M|S|@||||ADJ|M|S|@|||ADJ|F|S|@|||ADJ|F|S|@APP": {POS: ADJ}, - "|||ADJ|F|S|@N|||ADJ|F|S|@P<": {POS: ADJ}, - "|||ADJ|F|S|@SUBJ>": {POS: ADJ}, - "|||ADJ|M|P|@|||ADJ|M|P|@P<": {POS: ADJ}, - "|||ADJ|M|P|@SUBJ>": {POS: ADJ}, - "|||ADJ|M|S|@|||ADJ|M|S|@NPHR": {POS: ADJ}, - "|||ADJ|M|S|@P<": {POS: ADJ}, - "|||ADJ|M|S|@SUBJ>": {POS: ADJ}, - "|||ADJ|M|S|@|||ADJ|M|S|@SUBJ>": {POS: ADJ}, - "||ADJ|M|S|@||ADJ|M|S|@N<": {POS: ADJ}, - "||ADJ|M|S|@N||ADJ|M|S|@NPHR": {POS: ADJ}, - "||ADJ|M|S|@P<": {POS: ADJ}, - "||||ADJ|F|P|@||||ADJ|F|S|@P<": {POS: ADJ}, - "||||ADJ|M|P|@||||ADJ|M|P|@P<": {POS: ADJ}, - "||||ADJ|M|S|@SC>": {POS: ADJ}, - "||||ADJ|M|S|@SUBJ>": {POS: ADJ}, - "|||ADJ|F|S|@N|||ADJ|M|S|@N|||ADJ|M|S|@P<": {POS: ADJ}, - "|||ADJ|M|S|@SUBJ>": {POS: ADJ}, - "|||NUM|M|S|@SUBJ>": {POS: NUM}, - "||NUM|M|S|@||ADJ|M|S|@PRED>": {POS: ADJ}, - "||V|PCP|M|P|@N||V|PCP|M|P|@P<": {POS: VERB}, - "||ADJ|M|S|@>N": {POS: ADJ}, - "||ADJ|@||ADJ|F|P|@||ADJ|F|P|@||ADJ|F|P|@P<": {POS: ADJ}, - "||ADJ|F|P|@SUBJ>": {POS: ADJ}, - "||ADJ|F|S|@||ADJ|F|S|@||ADJ|F|S|@NPHR": {POS: ADJ}, - "||ADJ|F|S|@P<": {POS: ADJ}, - "||ADJ|F|S|@SUBJ>": {POS: ADJ}, - "||ADJ|M|P|@||ADJ|M|P|@||ADJ|M|P|@||ADJ|M|P|@N||ADJ|M|P|@NPHR": {POS: ADJ}, - "||ADJ|M|P|@P<": {POS: ADJ}, - "||ADJ|M|P|@SUBJ>": {POS: ADJ}, - "||ADJ|M|S|@||ADJ|M|S|@||ADJ|M|S|@||ADJ|M|S|@||ADJ|M|S|@APP": {POS: ADJ}, - "||ADJ|M|S|@P<": {POS: ADJ}, - "||ADJ|M|S|@SC>": {POS: ADJ}, - "||ADJ|M|S|@SUBJ>": {POS: ADJ}, - "||N|M|P|@||N|M|P|@P<": {POS: NOUN}, - "||N|M|P|@SUBJ>": {POS: NOUN}, - "||N|M|S|@||ADJ|F|S|@||ADJ|F|S|@P<": {POS: ADJ}, - "||ADJ|M/F|S|@P<": {POS: ADJ}, - "||ADJ|M|P|@||ADJ|M|P|@||ADJ|M|P|@||ADJ|M|P|@P<": {POS: ADJ}, - "||ADJ|M|S|@||ADJ|M|S|@||ADJ|M|S|@||ADJ|M|S|@>N": {POS: ADJ}, - "||ADJ|M|S|@N||ADJ|M|S|@P<": {POS: ADJ}, - "||ADJ|M|S|@SUBJ>": {POS: ADJ}, - "||N|M|P|@P<": {POS: NOUN}, - "||N|M|S|@P<": {POS: NOUN}, - "|ADJ|@|ADJ|F|P|@|ADJ|F|P|@|ADJ|F|P|@>N": {POS: ADJ}, - "|ADJ|F|P|@P<": {POS: ADJ}, - "|ADJ|F|P|@SUBJ>": {POS: ADJ}, - "|ADJ|F|S|@|ADJ|F|S|@|ADJ|F|S|@>N": {POS: ADJ}, - "|ADJ|F|S|@N<": {POS: ADJ}, - "|ADJ|F|S|@N|ADJ|F|S|@P<": {POS: ADJ}, - "|ADJ|M/F|P|@|ADJ|M/F|P|@NPHR": {POS: ADJ}, - "|ADJ|M/F|P|@P<": {POS: ADJ}, - "|ADJ|M/F|P|@SUBJ>": {POS: ADJ}, - "|ADJ|M/F|S|@P<": {POS: ADJ}, - "|ADJ|M|P|@|ADJ|M|P|@|ADJ|M|P|@|ADJ|M|P|@>N": {POS: ADJ}, - "|ADJ|M|P|@N|ADJ|M|P|@P<": {POS: ADJ}, - "|ADJ|M|P|@SUBJ>": {POS: ADJ}, - "|ADJ|M|S|@|ADJ|M|S|@|ADJ|M|S|@|ADJ|M|S|@>N": {POS: ADJ}, - "|ADJ|M|S|@APP": {POS: ADJ}, - "|ADJ|M|S|@N<": {POS: ADJ}, - "|ADJ|M|S|@N|ADJ|M|S|@P<": {POS: ADJ}, - "|ADJ|M|S|@SUBJ>": {POS: ADJ}, - "|V|PCP|M|S|@P<": {POS: VERB}, - "||N|M|S|@P<": {POS: NOUN}, - "||ART|M|S|@>N": {POS: DET}, - "||<-sam>|DET|M|S|@>N": {POS: DET}, - "||PRP|@|ADV|@|ADV|@>N": {POS: ADV}, - "|PERS|F|3S|ACC|@ACC>": {POS: PRON}, - "|PRP|@ADVL>": {POS: ADP}, - "|PRP|@N<": {POS: ADP}, - "|PRP|@P<": {POS: ADP}, - "|X|@X": {POS: X}, - "|ADJ|F|S|@P<": {POS: ADJ}, - "|ADJ|M|S|@N<": {POS: ADJ}, - "|ADJ|M|S|@N|DET|F|P|@P<": {POS: PRON}, - "|DET|M|P|@SUBJ>": {POS: PRON}, - "|DET|M|S|@P<": {POS: PRON}, - "|DET|M|S|@SUBJ>": {POS: PRON}, - "|N|@P<": {POS: NOUN}, - "|N|F|P|@|N|F|P|@|N|F|P|@|N|F|P|@|N|F|P|@|N|F|P|@>A": {POS: NOUN}, - "|N|F|P|@ADVL>": {POS: NOUN}, - "|N|F|P|@APP": {POS: NOUN}, - "|N|F|P|@KOMP<": {POS: NOUN}, - "|N|F|P|@N<": {POS: NOUN}, - "|N|F|P|@N|N|F|P|@NPHR": {POS: NOUN}, - "|N|F|P|@P<": {POS: NOUN}, - "|N|F|P|@SUBJ>": {POS: NOUN}, - "|N|F|P|@TOP": {POS: NOUN}, - "|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@>A": {POS: NOUN}, - "|N|F|S|@ACC>": {POS: NOUN}, - "|N|F|S|@ADVL>": {POS: NOUN}, - "|N|F|S|@APP": {POS: NOUN}, - "|N|F|S|@AUX<": {POS: NOUN}, - "|N|F|S|@FS-N<": {POS: NOUN}, - "|N|F|S|@FS-N|N|F|S|@ICL-|N|F|S|@N<": {POS: NOUN}, - "|N|F|S|@N|N|F|S|@NPHR": {POS: NOUN}, - "|N|F|S|@P<": {POS: NOUN}, - "|N|F|S|@SC>": {POS: NOUN}, - "|N|F|S|@SUBJ>": {POS: NOUN}, - "|N|M/F|P|@P<": {POS: NOUN}, - "|N|M/F|P|@SUBJ>": {POS: NOUN}, - "|N|M/F|S|@P<": {POS: NOUN}, - "|N|M/F|S|@SUBJ>": {POS: NOUN}, - "|N|M|P|@|N|M|P|@|N|M|P|@|N|M|P|@|N|M|P|@|N|M|P|@>A": {POS: SYM}, - "|N|M|P|@A<": {POS: NOUN}, - "|N|M|P|@ACC>": {POS: NOUN}, - "|N|M|P|@ADVL>": {POS: NOUN}, - "|N|M|P|@APP": {POS: NOUN}, - "|N|M|P|@ICL-P<": {POS: NOUN}, - "|N|M|P|@N<": {POS: SYM}, - "|N|M|P|@N|N|M|P|@NPHR": {POS: SYM}, - "|N|M|P|@P<": {POS: SYM}, - "|N|M|P|@SC>": {POS: NOUN}, - "|N|M|P|@SUBJ>": {POS: SYM}, - "|N|M|P|@TOP": {POS: NOUN}, - "|N|M|S/P|@P<": {POS: NOUN}, - "|N|M|S|@|N|M|S|@|N|M|S|@|N|M|S|@|N|M|S|@|N|M|S|@>A": {POS: NOUN}, - "|N|M|S|@A<": {POS: NOUN}, - "|N|M|S|@ACC>": {POS: NOUN}, - "|N|M|S|@ADVL>": {POS: NOUN}, - "|N|M|S|@APP": {POS: NOUN}, - "|N|M|S|@AS<": {POS: NOUN}, - "|N|M|S|@AUX<": {POS: NOUN}, - "|N|M|S|@N<": {POS: NOUN}, - "|N|M|S|@N|N|M|S|@NPHR": {POS: NOUN}, - "|N|M|S|@P<": {POS: NOUN}, - "|N|M|S|@SC>": {POS: NOUN}, - "|N|M|S|@SUBJ>": {POS: NOUN}, - "|PROP|||N|M|S|@SUBJ>": {POS: PROPN}, - "|PROP|M|S|@P<": {POS: PROPN}, - "||N|F|P|@APP": {POS: NOUN}, - "||N|F|P|@N||N|F|S|@||N|M|P|@N||N|M|P|@P<": {POS: NOUN}, - "|ADJ|F|S|@N<": {POS: ADJ}, - "|ADJ|F|S|@N|ADJ|M|P|@N<": {POS: ADJ}, - "|ADJ|M|P|@N|ADJ|M|S|@N<": {POS: ADJ}, - "|ADJ|M|S|@N|N|@N|N|@NPHR": {POS: NOUN}, - "|N|@P<": {POS: NOUN}, - "|N|F|P|@|N|F|P|@|N|F|P|@|N|F|P|@|N|F|P|@|N|F|P|@>N": {POS: NOUN}, - "|N|F|P|@ACC>": {POS: NOUN}, - "|N|F|P|@ADVL>": {POS: NOUN}, - "|N|F|P|@APP": {POS: NOUN}, - "|N|F|P|@N<": {POS: NOUN}, - "|N|F|P|@N|N|F|P|@NPHR": {POS: NOUN}, - "|N|F|P|@P<": {POS: NOUN}, - "|N|F|P|@S<": {POS: NOUN}, - "|N|F|P|@SUBJ>": {POS: NOUN}, - "|N|F|P|@VOK": {POS: NOUN}, - "|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@|N|F|S|@>N": {POS: NOUN}, - "|N|F|S|@A<": {POS: NOUN}, - "|N|F|S|@ACC>": {POS: NOUN}, - "|N|F|S|@ADVL>": {POS: NOUN}, - "|N|F|S|@APP": {POS: NOUN}, - "|N|F|S|@N<": {POS: NOUN}, - "|N|F|S|@N|N|F|S|@NPHR": {POS: NOUN}, - "|N|F|S|@P<": {POS: NOUN}, - "|N|F|S|@PRED>": {POS: NOUN}, - "|N|F|S|@S<": {POS: NOUN}, - "|N|F|S|@SC>": {POS: NOUN}, - "|N|F|S|@SUBJ>": {POS: NOUN}, - "|N|F|S|@UTT": {POS: NOUN}, - "|N|F|S|@VOK": {POS: NOUN}, - "|N|M/F|P|@|N|M/F|P|@|N|M/F|P|@NPHR": {POS: NOUN}, - "|N|M/F|P|@P<": {POS: NOUN}, - "|N|M/F|P|@SUBJ>": {POS: NOUN}, - "|N|M/F|S|@NPHR": {POS: NOUN}, - "|N|M/F|S|@P<": {POS: NOUN}, - "|N|M|P|@|N|M|P|@|N|M|P|@|N|M|P|@|N|M|P|@|N|M|P|@|N|M|P|@|N|M|P|@>A": {POS: NOUN}, - "|N|M|P|@>N": {POS: NOUN}, - "|N|M|P|@A<": {POS: NOUN}, - "|N|M|P|@ACC>": {POS: NOUN}, - "|N|M|P|@ADVL>": {POS: NOUN}, - "|N|M|P|@APP": {POS: NOUN}, - "|N|M|P|@AUX<": {POS: NOUN}, - "|N|M|P|@ICL-|N|M|P|@ICL-P<": {POS: NOUN}, - "|N|M|P|@N<": {POS: NOUN}, - "|N|M|P|@N|N|M|P|@NPHR": {POS: NOUN}, - "|N|M|P|@P<": {POS: NOUN}, - "|N|M|P|@PRED>": {POS: NOUN}, - "|N|M|P|@SUBJ>": {POS: PROPN}, - "|N|M|P|@TOP": {POS: NOUN}, - "|N|M|R|@|N|M|S|@|N|M|S|@|N|M|S|@|N|M|S|@|N|M|S|@|N|M|S|@|N|M|S|@|N|M|S|@|N|M|S|@|N|M|S|@>A": {POS: NOUN}, - "|N|M|S|@>N": {POS: NOUN}, - "|N|M|S|@A<": {POS: NOUN}, - "|N|M|S|@ACC>": {POS: NOUN}, - "|N|M|S|@ADVL": {POS: NOUN}, - "|N|M|S|@ADVL>": {POS: NOUN}, - "|N|M|S|@APP": {POS: NOUN}, - "|N|M|S|@CO": {POS: NOUN}, - "|N|M|S|@ICL-|N|M|S|@ICL-P<": {POS: NOUN}, - "|N|M|S|@KOMP<": {POS: NOUN}, - "|N|M|S|@N<": {POS: NOUN}, - "|N|M|S|@N|N|M|S|@NPHR": {POS: NOUN}, - "|N|M|S|@P<": {POS: NOUN}, - "|N|M|S|@PRED>": {POS: NOUN}, - "|N|M|S|@S<": {POS: NOUN}, - "|N|M|S|@SC>": {POS: NOUN}, - "|N|M|S|@SUBJ>": {POS: NOUN}, - "|N|M|S|@VOC": {POS: NOUN}, - "|N|M|S|@VOK": {POS: NOUN}, - "|N|M|s|@P<": {POS: NOUN}, - "|PROP|||N|F|S|@APP": {POS: PROPN}, - "|PROP|M|S|@P<": {POS: PROPN}, - "|PERS|F|3P|ACC|@|PERS|F|3P|ACC|@ACC>": {POS: PRON}, - "|PERS|F|3S|ACC|@|PERS|F|3S|ACC|@ACC>": {POS: PRON}, - "|PERS|M/F|3S|ACC|@|PERS|M|3P|ACC|@|PERS|M|3P|ACC|@ACC>": {POS: PRON}, - "|PERS|M|3S|ACC|@|PERS|M|3S|ACC|@ACC>": {POS: PRON}, - "|ADV|@CO": {POS: ADV}, - "|KC|@CO": {POS: CCONJ}, - "||KC|@CO": {POS: CCONJ}, - "|KC|@CO": {POS: CCONJ}, - "|||V|PCP|F|P|@ICL-AUX<": {POS: VERB}, - "|||V|PCP|F|S|@ICL-AUX<": {POS: VERB}, - "|||V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|P|@ICL-AUX<": {POS: VERB}, - "||V|PCP|F|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|MVF|S|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|P|@ICL-AUX<": {POS: VERB}, - "||V|PCP|M|S|@ICL-AUX<": {POS: VERB}, - "||DET|M|P|@P<": {POS: PRON}, - "||DET|M|P|@SUBJ>": {POS: PRON}, - "||DET|M|S|@N<": {POS: DET}, - "||DET|F|S|@N<": {POS: DET}, - "|||DET|M|P|@>A": {POS: DET}, - "||DET|F|P|@>N": {POS: DET}, - "||DET|F|S|@>N": {POS: DET}, - "||DET|M|P|@>N": {POS: DET}, - "||DET|M|S|@>N": {POS: DET}, - "|DET|F|P|@>N": {POS: DET}, - "|DET|F|P|@N<": {POS: DET}, - "|DET|F|S|@|DET|F|S|@>N": {POS: DET}, - "|DET|F|S|@N<": {POS: DET}, - "|DET|M|P|@>N": {POS: DET}, - "|DET|M|P|@N<": {POS: DET}, - "|DET|M|S|@|DET|M|S|@>N": {POS: DET}, - "|DET|M|S|@N<": {POS: DET}, - "|ADV|@|ADV|@|ADV|@|PRP|@|PRP|@|PRP|@|PRP|@ADVL>": {POS: ADP}, - "|PRP|@COM": {POS: ADP}, - "|PRP|@N<": {POS: ADP}, - "|PRP|@N|PRP|@N|PRP|@OC": {POS: ADP}, - "|PRP|@OC>": {POS: ADP}, - "|PRP|@P<": {POS: ADP}, - "|PRP|@PRED>": {POS: ADP}, - "||ADV|@ADVL>": {POS: ADV}, - "||ADV|@ADVL>": {POS: ADV}, - "|ADV|@|ADV|@ADVL>": {POS: ADV}, - "||||ADJ|M|S|@||||ADJ|M|S|@||||ADJ|M|S|@P<": {POS: ADJ}, - "||ADJ|M|S|@N<": {POS: ADJ}, - "|||NUM|M|P|@P<": {POS: NUM}, - "||NUM|F|P|@>N": {POS: NUM}, - "||NUM|M|P|@>N": {POS: NUM}, - "||||NUM|M|P|@P<": {POS: NUM}, - "|||ADJ|M|S|@|||ADJ|M|S|@P<": {POS: ADJ}, - "|||ADJ|M|S|@SUBJ>": {POS: ADJ}, - "|||ADJ|M|S|@P<": {POS: ADJ}, - "||ADJ|M|S|@P<": {POS: ADJ}, - "||N|F|P|@||N|F|P|@P<": {POS: NOUN}, - "||N|F|P|@SUBJ>": {POS: NOUN}, - "||N|F|S|@||N|F|S|@||N|F|S|@N||N|F|S|@P<": {POS: NOUN}, - "||N|F|S|@SUBJ>": {POS: NOUN}, - "||N|M|P|@||N|M|P|@||N|M|P|@P<": {POS: NOUN}, - "||N|M|P|@SUBJ>": {POS: NOUN}, - "||N|M|S|@||N|M|S|@||N|M|S|@||N|M|S|@APP": {POS: NOUN}, - "||N|M|S|@NPHR": {POS: NOUN}, - "||N|M|S|@P<": {POS: NOUN}, - "||N|M|S|@SUBJ>": {POS: NOUN}, - "||N|@P<": {POS: NOUN}, - "||N|F|P|@P<": {POS: NOUN}, - "||N|F|S|@||N|F|S|@||N|F|S|@||N|F|S|@N<": {POS: NOUN}, - "||N|F|S|@N||N|F|S|@NPHR": {POS: NOUN}, - "||N|F|S|@P<": {POS: NOUN}, - "||N|F|S|@SUBJ>": {POS: NOUN}, - "||N|M/F|S|@P<": {POS: NOUN}, - "||N|M|P|@||N|M|P|@||N|M|P|@P<": {POS: NOUN}, - "||N|M|S|@||N|M|S|@||N|M|S|@||N|M|S|@N<": {POS: NOUN}, - "||N|M|S|@N||N|M|S|@NPHR": {POS: NOUN}, - "||N|M|S|@P<": {POS: NOUN}, - "||N|M|S|@SUBJ>": {POS: NOUN}, - "|ADJ|F|P|@N<": {POS: ADJ}, - "|ADJ|F|S|@N<": {POS: ADJ}, - "|ADJ|M/F|S|@N<": {POS: ADJ}, - "|ADJ|M|P|@N<": {POS: ADJ}, - "|ADJ|M|S|@N<": {POS: ADJ}, - "||ADV|@||PRP|@|||PRP|@||PRP|@||PRP|@ADVL>": {POS: ADP}, - "|PRP|@|PRP|@|PRP|@A<": {POS: ADP}, - "|||ADV|@|||ADV|@|||ADV|@|||ADV|@>A": {POS: ADV}, - "|||ADV|@>N": {POS: ADV}, - "|||ADV|@A<": {POS: ADV}, - "|||ADV|@ACC>": {POS: ADV}, - "|||ADV|@ADVL": {POS: ADV}, - "|||ADV|@ADVL>": {POS: ADV}, - "|||ADV|@CO": {POS: ADV}, - "|||ADV|@N<": {POS: ADV}, - "|||ADV|@N|||ADV|@P<": {POS: ADV}, - "|||ADV|F|P|@|||DET|F|P|@|||DET|F|P|@|||DET|F|P|@>N": {POS: DET}, - "|||DET|F|S|@>N": {POS: DET}, - "|||DET|F|S|@SUBJ>": {POS: PRON}, - "|||DET|M/F|S/P|@|||DET|M/F|S/P|@P<": {POS: PRON}, - "|||DET|M|P|@>N": {POS: DET}, - "|||DET|M|P|@P<": {POS: PRON}, - "|||DET|M|P|@SUBJ>": {POS: PRON}, - "|||DET|M|S/P|@|||DET|M|S|@|||DET|M|S|@>N": {POS: DET}, - "|||DET|M|S|@P<": {POS: PRON}, - "|||DET|M|S|@SUBJ>": {POS: PRON}, - "||||DET|M|P|@P<": {POS: PRON}, - "||ADV|@>A": {POS: ADV}, - "||ADV|@CO": {POS: ADV}, - "||DET|M/F|S/P|@>A": {POS: DET}, - "||DET|M|P|@>A": {POS: DET}, - "||DET|M|S|@>A": {POS: DET}, - "||DET|M|S|@>N": {POS: DET}, - "||ADV|@||ADV|@>A": {POS: ADV}, - "||ADV|@>N": {POS: ADV}, - "||ADV|@ADVL>": {POS: ADV}, - "||ADV|@P<": {POS: ADV}, - "||INDP|M|S|@P<": {POS: PRON}, - "||ADV|@>N": {POS: ADV}, - "||ADV|@P<": {POS: ADV}, - "||DET|M|P|@||DET|M|S|@P<": {POS: PRON}, - "||DET|M|S|@SUBJ>": {POS: PRON}, - "||DET|F|S|@SUBJ>": {POS: PRON}, - "||DET|M|P|@P<": {POS: PRON}, - "||DET|M|P|@SUBJ>": {POS: PRON}, - "||DET|M|S|@ACC>": {POS: PRON}, - "||DET|M|S|@P<": {POS: PRON}, - "|||ADV|@CO": {POS: ADV}, - "|||KC|@CO": {POS: CCONJ}, - "|ADJ|F|P|@N<": {POS: ADJ}, - "|ADJ|F|S|@|ADJ|F|S|@>N": {POS: ADJ}, - "|ADJ|F|S|@N<": {POS: ADJ}, - "|ADJ|M|P|@|ADJ|M|P|@N<": {POS: ADJ}, - "|ADJ|M|S|@|ADV|@|ADV|@|ADV|@|ADV|@|ADV|@>A": {POS: ADV}, - "|ADV|@>N": {POS: ADV}, - "|ADV|@>P": {POS: ADV}, - "|ADV|@A<": {POS: ADV}, - "|ADV|@ADVL>": {POS: ADV}, - "|ADV|@CO": {POS: ADV}, - "|ADV|@FS-STA": {POS: ADV}, - "|ADV|@N<": {POS: ADV}, - "|ADV|@N<|": {POS: ADV}, - "|ADV|@P<": {POS: ADV}, - "|ART|F|S|@>N": {POS: DET}, - "|ART|M|S|@>N": {POS: DET}, - "|DET|@>A": {POS: DET}, - "|DET|F|P|@|DET|F|P|@|DET|F|P|@|DET|F|P|@>N": {POS: DET}, - "|DET|F|P|@N<": {POS: DET}, - "|DET|F|P|@N|DET|F|P|@NPHR": {POS: DET}, - "|DET|F|P|@P<": {POS: PRON}, - "|DET|F|P|@SUBJ>": {POS: PRON}, - "|DET|F|S|@|DET|F|S|@|DET|F|S|@>A": {POS: DET}, - "|DET|F|S|@>N": {POS: DET}, - "|DET|F|S|@A<": {POS: DET}, - "|DET|F|S|@N<": {POS: DET}, - "|DET|F|S|@N|DET|F|S|@P<": {POS: PRON}, - "|DET|F|S|@SUBJ>": {POS: PRON}, - "|DET|M/F|P|@>N": {POS: DET}, - "|DET|M/F|S/P|@|DET|M/F|S|@>N": {POS: DET}, - "|DET|M|P|@|DET|M|P|@|DET|M|P|@|DET|M|P|@>A": {POS: DET}, - "|DET|M|P|@>N": {POS: DET}, - "|DET|M|P|@N<": {POS: DET}, - "|DET|M|P|@N|DET|M|P|@P<": {POS: PRON}, - "|DET|M|P|@PRED>": {POS: DET}, - "|DET|M|P|@SC>": {POS: PRON}, - "|DET|M|P|@SUBJ>": {POS: PRON}, - "|DET|M|S|@|DET|M|S|@|DET|M|S|@>A": {POS: DET}, - "|DET|M|S|@>N": {POS: DET}, - "|DET|M|S|@>P": {POS: DET}, - "|DET|M|S|@ADVL>": {POS: DET}, - "|DET|M|S|@N<": {POS: DET}, - "|DET|M|S|@N|DET|M|S|@P<": {POS: PRON}, - "|DET|M|S|@SUBJ>": {POS: PRON}, - "|INDP|M/F|S|@SUBJ>": {POS: PRON}, - "|INDP|M|S|@|INDP|M|S|@|INDP|M|S|@|INDP|M|S|@>A": {POS: PRON}, - "|INDP|M|S|@>N": {POS: PRON}, - "|INDP|M|S|@ACC>": {POS: PRON}, - "|INDP|M|S|@N<": {POS: PRON}, - "|INDP|M|S|@N|INDP|M|S|@NPHR": {POS: PRON}, - "|INDP|M|S|@P<": {POS: PRON}, - "|INDP|M|S|@S<": {POS: PRON}, - "|INDP|M|S|@SC>": {POS: PRON}, - "|INDP|M|S|@SUBJ>": {POS: PRON}, - "|PERS|F|3P|ACC|@ACC>": {POS: PRON}, - "|PERS|M|3P|ACC|@ACC>": {POS: PRON}, - "||PERS|M|1S|DAT|@ACC>": {POS: PRON}, - "|PERS|3S|ACC|@ACC>": {POS: PRON}, - "|PERS|3S|PIV|@P<": {POS: PRON}, - "|PERS|F|1S|ACC|@|PERS|F|1S|ACC|@ACC>": {POS: PRON}, - "|PERS|F|1S|DAT|@|PERS|F|1S|DAT|@DAT>": {POS: PRON}, - "|PERS|F|3P|ACC|@|PERS|F|3S|ACC|@|PERS|F|3S|ACC|@ACC-PASS": {POS: PRON}, - "|PERS|F|3S|DAT|@|PERS|F|3S|DAT|@DAT>": {POS: PRON}, - "|PERS|F|3S|PIV|@P<": {POS: PRON}, - "|PERS|M/F|1P|ACC|@|PERS|M/F|1P|ACC|@|PERS|M/F|1P|ACC|@ACC>": {POS: PRON}, - "|PERS|M/F|1P|DAT|@|PERS|M/F|1P|DAT|@|PERS|M/F|1P|DAT|@DAT>": {POS: PRON}, - "|PERS|M/F|1S|ACC|@|PERS|M/F|1S|ACC|@|PERS|M/F|1S|ACC|@ACC>": {POS: PRON}, - "|PERS|M/F|1S|DAT|@|PERS|M/F|1S|DAT|@ACC>": {POS: PRON}, - "|PERS|M/F|1S|DAT|@DAT>": {POS: PRON}, - "|PERS|M/F|2P|ACC|@|PERS|M/F|2P|DAT|@DAT>": {POS: PRON}, - "|PERS|M/F|3P|ACC|@|PERS|M/F|3S/P|ACC/DAT|@VOC": {POS: PRON}, - "|PERS|M/F|3S/P|ACC|@VOC": {POS: PRON}, - "|PERS|M/F|3S/P|DAT|@DAT>": {POS: PRON}, - "|PERS|M/F|3S/P|PIV|@P<": {POS: PRON}, - "|PERS|M/F|3S|PIV|@P<": {POS: PRON}, - "|PERS|M|1P|ACC|@|PERS|M|1P|ACC|@ACC>": {POS: PRON}, - "|PERS|M|1P|DAT|@DAT>": {POS: PRON}, - "|PERS|M|1S|ACC|@|PERS|M|1S|ACC|@ACC>": {POS: PRON}, - "|PERS|M|1S|DAT|@|PERS|M|1S|DAT|@DAT>": {POS: PRON}, - "|PERS|M|2S|ACC|@|PERS|M|2S|ACC|@ACC>": {POS: PRON}, - "|PERS|M|2S|ACC|@DAT>": {POS: PRON}, - "|PERS|M|3P|ACC|@|PERS|M|3P|ACC|@ACC-PASS": {POS: PRON}, - "|PERS|M|3P|DAT|@|PERS|M|3P|DAT|@DAT>": {POS: PRON}, - "|PERS|M|3P|PIV|@P<": {POS: PRON}, - "|PERS|M|3S|ACC|@|PERS|M|3S|ACC|@|PERS|M|3S|ACC|@ACC>": {POS: PRON}, - "|PERS|M|3S|DAT|@|PERS|M|3S|DAT|@DAT>": {POS: PRON}, - "|PERS|M|3S|PIV|@P<": {POS: PRON}, - "||ADV|@||ADV|@ADVL>": {POS: ADV}, - "||ADV|@COM": {POS: ADV}, - "||ADV|@N||PRP|@ADVL>": {POS: ADP}, - "||PRP|@KOMP<": {POS: ADP}, - "||PRP|@N<": {POS: ADP}, - "||PRP|@N||PRP|@ADVL>": {POS: ADP}, - "||DET|F|P|@P<": {POS: PRON}, - "||DET|F|S|@P<": {POS: PRON}, - "||DET|M|S|@P<": {POS: PRON}, - "||ADV|@N<": {POS: ADV}, - "||PRP|@|||ADV|@ADVL>": {POS: ADV}, - "||ADV|@>A": {POS: ADV}, - "||ADV|@COM": {POS: ADV}, - "||DET|M|S|@N<|": {POS: DET}, - "||INDP|M|S|@ACC>": {POS: PRON}, - "||INDP|M|S|@SUBJ>": {POS: PRON}, - "|ADV|@|ADV|@>A": {POS: ADV}, - "|ADV|@>N": {POS: ADV}, - "|ADV|@ADVL>": {POS: ADV}, - "|ADV|@COM": {POS: ADV}, - "|ADV|@P<": {POS: ADV}, - "|ADV|@SA>": {POS: ADV}, - "|ADV|@SUB": {POS: ADV}, - "|DET|F|P|@>N": {POS: DET}, - "|DET|F|P|@SUBJ>": {POS: PRON}, - "|DET|F|S|@>N": {POS: DET}, - "|DET|F|S|@ADVL>": {POS: DET}, - "|DET|F|S|@SC>": {POS: PRON}, - "|DET|M|P|@>N": {POS: DET}, - "|DET|M|P|@ACC>": {POS: PRON}, - "|DET|M|P|@SUBJ>": {POS: PRON}, - "|DET|M|S|@>N": {POS: DET}, - "|DET|M|S|@ACC>": {POS: PRON}, - "|DET|M|S|@P<": {POS: PRON}, - "|INDP|@SUBJ>": {POS: PRON}, - "|INDP|F|@SUBJ>": {POS: PRON}, - "|INDP|F|P|@ACC>": {POS: PRON}, - "|INDP|F|P|@P<": {POS: PRON}, - "|INDP|F|P|@PIV>": {POS: PRON}, - "|INDP|F|P|@SUBJ>": {POS: PRON}, - "|INDP|F|S|@>N": {POS: PRON}, - "|INDP|F|S|@ACC>": {POS: PRON}, - "|INDP|F|S|@ADVL>": {POS: PRON}, - "|INDP|F|S|@P<": {POS: PRON}, - "|INDP|F|S|@SC>": {POS: PRON}, - "|INDP|F|S|@SUB": {POS: PRON}, - "|INDP|F|S|@SUBJ>": {POS: PRON}, - "|INDP|M/F|P|@P<": {POS: PRON}, - "|INDP|M/F|P|@SC>": {POS: PRON}, - "|INDP|M/F|P|@SUBJ>": {POS: PRON}, - "|INDP|M/F|S/P|@ACC>": {POS: PRON}, - "|INDP|M/F|S/P|@P<": {POS: PRON}, - "|INDP|M/F|S/P|@SUB": {POS: PRON}, - "|INDP|M/F|S/P|@SUBJ>": {POS: PRON}, - "|INDP|M/F|S|@|INDP|M/F|S|@ACC>": {POS: PRON}, - "|INDP|M/F|S|@P<": {POS: PRON}, - "|INDP|M/F|S|@SUBJ>": {POS: PRON}, - "|INDP|M|P|@ACC>": {POS: PRON}, - "|INDP|M|P|@P<": {POS: PRON}, - "|INDP|M|P|@SUB": {POS: PRON}, - "|INDP|M|P|@SUBJ>": {POS: PRON}, - "|INDP|M|S|@ACC>": {POS: PRON}, - "|INDP|M|S|@N<": {POS: PRON}, - "|INDP|M|S|@P<": {POS: PRON}, - "|INDP|M|S|@SC>": {POS: PRON}, - "|INDP|M|S|@SUB": {POS: PRON}, - "|INDP|M|S|@SUBJ>": {POS: PRON}, - "|INDP|S/P|@SUBJ>": {POS: PRON}, - "|PRP|@|PRP|@ADVL>": {POS: ADP}, - "||PRP|@||PRP|@||PRP|@||PRP|@||PRP|@>N": {POS: ADP}, - "||PRP|@A<": {POS: ADP}, - "||PRP|@ADVL>": {POS: ADP}, - "||PRP|@N<": {POS: ADP}, - "||PRP|@N||PRP|@P<": {POS: ADP}, - "||PRP|@PASS": {POS: ADP}, - "||PRP|@PIV>": {POS: ADP}, - "||PRP|@PRED>": {POS: ADP}, - "||PRP|@UTT": {POS: ADP}, - "||PRP|@||PRP|@||PRP|@||PRP|@||PRP|@||PRP|@||PRP|@>N": {POS: ADP}, - "||PRP|@A<": {POS: ADP}, - "||PRP|@ADVL": {POS: ADP}, - "||PRP|@ADVL>": {POS: ADP}, - "||PRP|@KOMP<": {POS: ADP}, - "||PRP|@N<": {POS: ADP}, - "||PRP|@N||PRP|@N||PRP|@P<": {POS: ADP}, - "||PRP|@PASS": {POS: ADP}, - "||PRP|@PIV>": {POS: ADP}, - "||PRP|@PRED>": {POS: ADP}, - "||PRP|@STA": {POS: ADP}, - "||PRP|@UTT": {POS: ADP}, - "||PRP|@KOMP<": {POS: ADP}, - "|ADV|@ADVL": {POS: ADV}, - "|PERS|M/F|3S|DAT|@DAT>": {POS: PRON}, - "|PRP|@-H": {POS: ADP}, - "|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@|PRP|@>A": {POS: ADP}, - "|PRP|@>N": {POS: ADP}, - "|PRP|@>P": {POS: ADP}, - "|PRP|@A<": {POS: ADP}, - "|PRP|@A|PRP|@A|PRP|@ADVL": {POS: ADP}, - "|PRP|@ADVL>": {POS: ADP}, - "|PRP|@COM": {POS: ADP}, - "|PRP|@KOMP<": {POS: ADP}, - "|PRP|@N<": {POS: ADP}, - "|PRP|@N|PRP|@N|PRP|@OA>": {POS: ADP}, - "|PRP|@P<": {POS: ADP}, - "|PRP|@PASS": {POS: ADP}, - "|PRP|@PIV>": {POS: ADP}, - "|PRP|@PRED>": {POS: ADP}, - "|PRP|@SA>": {POS: ADP}, - "|PRP|@SUB": {POS: ADP}, - "|PRP|@UTT": {POS: ADP}, - "|ADV|@>A": {POS: ADV}, - "|ADV|@>N": {POS: ADV}, - "|ADV|@A<": {POS: ADV}, - "|ADV|@N<": {POS: ADV}, - "|ADV|M|P|@>N": {POS: ADV}, - "|ADV|M|S|@>A": {POS: ADV}, - "|||NUM|M|P|@|||NUM|M|P|@APP": {POS: NUM}, - "|||NUM|M|P|@N<": {POS: NUM}, - "|||NUM|M|P|@N|||NUM|M|P|@P<": {POS: NUM}, - "|||NUM|M|P|@SUBJ>": {POS: NUM}, - "|||NUM|M|S|@ADVL>": {POS: NUM}, - "|||NUM|M|S|@N<": {POS: NUM}, - "|||NUM|M|S|@N|||NUM|M|S|@P<": {POS: NUM}, - "|||NUM|M|S|@SUBJ>": {POS: NUM}, - "||||NUM|M|S|@P<": {POS: NUM}, - "||NUM|M|P|@P<": {POS: NUM}, - "||NUM|M|S|@N||NUM|M|S|@P<": {POS: NUM}, - "||||NUM|M|P|@P<": {POS: NUM}, - "||||NUM|M|S|@N<": {POS: NUM}, - "||||NUM|M|S|@P<": {POS: NUM}, - "|||NUM|M|P|@P<": {POS: NUM}, - "|||NUM|M|S|@P<": {POS: NUM}, - "|||NUM|M|P|@P<": {POS: NUM}, - "ADJ|@N": {POS: ADJ}, - "ADJ|F|P|@A<": {POS: ADJ}, - "ADJ|F|P|@N<": {POS: ADJ}, - "ADJ|F|P|@N": {POS: ADJ}, - "ADJ|F|S|@N": {POS: ADJ}, - "ADJ|F|S|@ICL-": {POS: ADJ}, - "ADJ|F|S|@SC>": {POS: ADJ}, - "ADJ|M/F|P|@": {POS: ADJ}, - "ADJ|M|P|@N": {POS: ADJ}, - "ADJ|M|P|@A<": {POS: ADJ}, - "ADJ|M|P|@ADVL>": {POS: ADJ}, - "ADJ|M|P|@APP": {POS: ADJ}, - "ADJ|M|P|@ICL-": {POS: ADJ}, - "ADJ|M|S|@A": {POS: ADJ}, - "ADJ|M|S|@>N": {POS: ADJ}, - "ADJ|M|S|@A<": {POS: ADJ}, - "ADJ|M|S|@APP": {POS: ADJ}, - "ADJ|M|S|@ICL-N<": {POS: ADJ}, - "ADJ|M|S|@ICL-N": {POS: ADJ}, - "ADJ|M|S|@P<": {POS: ADJ}, - "ADJ|M|S|@PRED>": {POS: ADJ}, - "ADJ|M|S|@SC>": {POS: ADJ}, - "ADJ|M|S|@SUBJ>": {POS: ADJ}, - "ADP": {POS: ADP}, - "ADV|@A": {POS: ADV}, - "ADV|@>N": {POS: ADV}, - "ADV|@>P": {POS: ADV}, - "ADV|@>S": {POS: ADV}, - "ADV|@A<": {POS: ADV}, - "ADV|@ADVL": {POS: ADV}, - "ADV|@ADVL>": {POS: ADV}, - "ADV|@APP": {POS: ADV}, - "ADV|@AS-": {POS: ADV}, - "ADV|@AS-KOMP<": {POS: ADV}, - "ADV|@CO": {POS: ADV}, - "ADV|@FOC>": {POS: ADV}, - "ADV|@FS-N<": {POS: ADV}, - "ADV|@ICL-AUX<": {POS: ADV}, - "ADV|@KOMP<": {POS: ADV}, - "ADV|@N<": {POS: ADV}, - "ADV|@N": {POS: ADV}, - "ADV|@PRT-AUX<": {POS: ADV}, - "ADV|@PU": {POS: ADV}, - "ADV|@S<": {POS: ADV}, - "ADV|@SA>": {POS: ADV}, - "ADV|@SC>": {POS: ADV}, - "ADV|@STA": {POS: ADV}, - "ADV|@SUB": {POS: ADV}, - "ADV|@SUBJ>": {POS: ADV}, - "ADV|M|P|@N": {POS: DET}, - "ART|M|P|@>N": {POS: DET}, - "ART|M|S|@>A": {POS: DET}, - "ART|M|S|@>N": {POS: DET}, - "CONJ": {POS: CCONJ}, - "DET|F|P|@>N": {POS: DET}, - "DET|F|P|@P<": {POS: PRON}, - "DET|F|P|@SUBJ>": {POS: PRON}, - "DET|F|S|@>A": {POS: DET}, - "DET|F|S|@>N": {POS: DET}, - "DET|F|S|@P<": {POS: PRON}, - "DET|F|S|@SUB": {POS: DET}, - "DET|F|S|@SUBJ>": {POS: PRON}, - "DET|M/F|S|@A<": {POS: DET}, - "DET|M/F|S|@SUBJ>": {POS: PRON}, - "DET|M|P|@N": {POS: DET}, - "DET|M|P|@P<": {POS: PRON}, - "DET|M|P|@SUB": {POS: DET}, - "DET|M|S|@A": {POS: DET}, - "DET|M|S|@>N": {POS: DET}, - "DET|M|S|@>P": {POS: DET}, - "DET|M|S|@ADVL>": {POS: DET}, - "DET|M|S|@N": {POS: PRON}, - "EC|@>N": {POS: PART}, - "INDP|F|S|@": {POS: PRON}, - "INDP|M/F|S/P|@P<": {POS: PRON}, - "INDP|M/F|S|@SUBJ>": {POS: PRON}, - "INDP|M|P|@ACC>": {POS: PRON}, - "INDP|M|P|@SUBJ>": {POS: PRON}, - "INDP|M|S/P|@P<": {POS: PRON}, - "INDP|M|S|@N": {POS: PRON}, - "INDP|M|S|@ACC>": {POS: PRON}, - "INDP|M|S|@P<": {POS: PRON}, - "INDP|M|S|@S<": {POS: PRON}, - "INDP|M|S|@SUBJ>": {POS: PRON}, - "INDP|S/P|@SUBJ>": {POS: PRON}, - "IN|@": {POS: INTJ}, - "IN|@ADVL>": {POS: INTJ}, - "IN|@EXC": {POS: INTJ}, - "IN|@P<": {POS: INTJ}, - "IN|@UTT": {POS: INTJ}, - "IN|F|S|@S": {POS: SCONJ}, - "KS|@A<": {POS: SCONJ}, - "KS|@ADVL>": {POS: SCONJ}, - "KS|@COM": {POS: SCONJ}, - "KS|@KOMP<": {POS: SCONJ}, - "KS|@P<": {POS: SCONJ}, - "KS|@PRT-AUX<": {POS: SCONJ}, - "KS|@SUB": {POS: SCONJ}, - "KS|@SUBJ>": {POS: SCONJ}, - "NOUN": {POS: NOUN}, - "NUM|@A<": {POS: NUM}, - "NUM|@ADVL>": {POS: NUM}, - "NUM|@NPHR": {POS: NUM}, - "NUM|F|P|@A": {POS: NUM}, - "NUM|M|P|@>N": {POS: NUM}, - "NUM|M|P|@AN": {POS: NUM}, - "NUM|M|S|@ADVL>": {POS: NUM}, - "NUM|M|S|@NN": {POS: NUM}, - "NUM|P|@A<": {POS: NUM}, - "NUM|P|@P<": {POS: NUM}, - "N|@N": {POS: NOUN}, - "N|@N": {POS: NOUN}, - "N|@SUBJ>": {POS: NOUN}, - "N|F|P|@>N": {POS: NOUN}, - "N|F|P|@NN": {POS: NOUN}, - "N|F|S|@>S": {POS: NOUN}, - "N|F|S|@ACC>": {POS: NOUN}, - "N|F|S|@ADVL>": {POS: NOUN}, - "N|F|S|@P<": {POS: NOUN}, - "N|M|P|@>N": {POS: NOUN}, - "N|M|P|@P<": {POS: NOUN}, - "N|M|P|@SUBJ>": {POS: NOUN}, - "N|M|S|@A": {POS: NOUN}, - "N|M|S|@>N": {POS: NOUN}, - "N|M|S|@ADVL>": {POS: NOUN}, - "N|M|S|@AS<": {POS: NOUN}, - "N|M|S|@FS-STA": {POS: NOUN}, - "N|M|S|@N<": {POS: NOUN}, - "N|M|S|@N": {POS: NOUN}, - "PERS|F/M|3S/P|ACC|@": {POS: PRON}, - "PERS|F|1S|PIV|@P<": {POS: PRON}, - "PERS|F|3P|ACC|@": {POS: PRON}, - "PERS|F|3P|ACC|@ACC>-PASS": {POS: PRON}, - "PERS|F|3P|DAT|@": {POS: PRON}, - "PERS|F|3P|NOM/PIV|@P<": {POS: PRON}, - "PERS|F|3P|NOM|@SUBJ>": {POS: PRON}, - "PERS|F|3P|PIV|@P<": {POS: PRON}, - "PERS|F|3S/P|ACC|@": {POS: PRON}, - "PERS|F|3S|ACC|@N": {POS: PRON}, - "PERS|F|3S|ACC|@ACC-PASS": {POS: PRON}, - "PERS|F|3S|ACC|@ACC>": {POS: PRON}, - "PERS|F|3S|ACC|@ACC>-PASS": {POS: PRON}, - "PERS|F|3S|ACC|@SUBJ>": {POS: PRON}, - "PERS|F|3S|DAT|@-PASS": {POS: PRON}, - "PERS|F|3S|DAT|@DAT>": {POS: PRON}, - "PERS|F|3S|NOM/PIV|@P<": {POS: PRON}, - "PERS|F|3S|NOM|@": {POS: PRON}, - "PERS|F|3S|PIV|@P<": {POS: PRON}, - "PERS|F|P|@ACC>-PASS": {POS: PRON}, - "PERS|F|S|@ACC>": {POS: PRON}, - "PERS|F|S|@ACC>-PASS": {POS: PRON}, - "PERS|F|S|ACC|@": {POS: PRON}, - "PERS|M/F|1P|DAT|@": {POS: PRON}, - "PERS|M/F|1P|NOM/PIV|@P<": {POS: PRON}, - "PERS|M/F|1P|NOM|@": {POS: PRON}, - "PERS|M/F|1P|PIV|@P<": {POS: PRON}, - "PERS|M/F|1S|ACC|@SUBJ>": {POS: PRON}, - "PERS|M/F|1S|DAT|@": {POS: PRON}, - "PERS|M/F|1S|PIV|@P<": {POS: PRON}, - "PERS|M/F|2P|NOM|@-PASS": {POS: PRON}, - "PERS|M/F|3P|DAT|@": {POS: PRON}, - "PERS|M/F|3P|NOM|@SUBJ>": {POS: PRON}, - "PERS|M/F|3S/P|ACC|@": {POS: PRON}, - "PERS|M/F|3S/P|ACC|@ACC>-PASS": {POS: PRON}, - "PERS|M/F|3S/P|ACC|@SUBJ>": {POS: PRON}, - "PERS|M/F|3S/P|ACC|@VOC": {POS: PRON}, - "PERS|M/F|3S|ACC|@-PASS": {POS: PRON}, - "PERS|M/F|3S|ACC|@SUBJ>": {POS: PRON}, - "PERS|M/F|3S|DAT|@": {POS: PRON}, - "PERS|M/F|3S|NOM/PIV|@P<": {POS: PRON}, - "PERS|M/F|3S|NOM|@": {POS: PRON}, - "PERS|M|1P|DAT|@DAT>": {POS: PRON}, - "PERS|M|1P|NOM/PIV|@P<": {POS: PRON}, - "PERS|M|1P|NOM|@": {POS: PRON}, - "PERS|M|1S|DAT|@": {POS: PRON}, - "PERS|M|1S|PIV|@P<": {POS: PRON}, - "PERS|M|2S|PIV|@P<": {POS: PRON}, - "PERS|M|3P|ACC|@": {POS: PRON}, - "PERS|M|3P|ACC|@ACC>-PASS": {POS: PRON}, - "PERS|M|3P|ACC|@SUBJ>": {POS: PRON}, - "PERS|M|3P|DAT|@": {POS: PRON}, - "PERS|M|3P|NOM/PIV|@NPHR": {POS: PRON}, - "PERS|M|3P|NOM/PIV|@P<": {POS: PRON}, - "PERS|M|3P|NOM|@": {POS: PRON}, - "PERS|M|3P|PIV|@P<": {POS: PRON}, - "PERS|M|3S/P|ACC|@-PASS": {POS: PRON}, - "PERS|M|3S/P|ACC|@SUBJ>": {POS: PRON}, - "PERS|M|3S|ACC|@": {POS: PRON}, - "PERS|M|3S|ACC|@ACC>-PASS": {POS: PRON}, - "PERS|M|3S|ACC|@DAT>": {POS: PRON}, - "PERS|M|3S|ACC|@SC>": {POS: PRON}, - "PERS|M|3S|ACC|@SUBJ>": {POS: PRON}, - "PERS|M|3S|DAT|@": {POS: PRON}, - "PERS|M|3S|NOM/PIV|@P<": {POS: PRON}, - "PERS|M|3S|NOM|@": {POS: PRON}, - "PERS|M|3S|NOM|@TOP": {POS: PRON}, - "PERS|M|3S|PIV|@P<": {POS: PRON}, - "PERS|M|P|@ACC-PASS": {POS: PRON}, - "PERS|M|P|@ACC>-PASS": {POS: PRON}, - "PERS|M|P|ACC|@ACC>-PASS": {POS: PRON}, - "PERS|M|S|@ACC>": {POS: PRON}, - "PERS|M|S|@ACC>-PASS": {POS: PRON}, - "PERS|M|S|@SUBJ>": {POS: PRON}, - "PERS|M|S|ACC|@SUBJ>": {POS: PRON}, - "PROP": {POS: PROPN}, - "PROPN": {POS: PROPN}, - "PROP|@": {POS: PROPN}, - "PROP|F|S|@N": {POS: PROPN}, - "PROP|F|S|@ADVL>": {POS: PROPN}, - "PROP|F|S|@APP": {POS: PROPN}, - "PROP|F|S|@KOMP<": {POS: PROPN}, - "PROP|F|S|@N<": {POS: PROPN}, - "PROP|F|S|@N": {POS: PROPN}, - "PROP|F|S|@UTT": {POS: PROPN}, - "PROP|F|S|@VOK": {POS: PROPN}, - "PROP|M/F|P|@P<": {POS: PROPN}, - "PROP|M/F|S|@": {POS: PROPN}, - "PROP|M|P|@": {POS: PROPN}, - "PROP|M|P|@UTT": {POS: PROPN}, - "PROP|M|S|@": {POS: PROPN}, - "PROP|M|S|@ADVL>": {POS: PROPN}, - "PROP|M|S|@APP": {POS: PROPN}, - "PROP|M|S|@N<": {POS: PROPN}, - "PROP|M|S|@N": {POS: PROPN}, - "PROP|M|S|@SUBJ>": {POS: PROPN}, - "PROP|M|S|@UTT": {POS: PROPN}, - "PRP|@A": {POS: ADP}, - "PRP|@>N": {POS: ADP}, - "PRP|@>P": {POS: ADP}, - "PRP|@>S": {POS: ADP}, - "PRP|@A<": {POS: ADP}, - "PRP|@A": {POS: ADP}, - "PRP|@ADVL": {POS: ADP}, - "PRP|@ADVL>": {POS: ADP}, - "PRP|@ADVL>>": {POS: ADP}, - "PRP|@AS-ADVL>": {POS: ADP}, - "PRP|@AS<": {POS: ADP}, - "PRP|@CO": {POS: ADP}, - "PRP|@COM": {POS: ADP}, - "PRP|@EXC": {POS: ADP}, - "PRP|@ICL-N<": {POS: ADP}, - "PRP|@ICL-N": {POS: ADP}, - "PRP|@P<": {POS: ADP}, - "PRP|@PASS": {POS: ADP}, - "PRP|@PIV>": {POS: ADP}, - "PRP|@PRED>": {POS: ADP}, - "PRP|@PRT-AUX<": {POS: ADP}, - "PRP|@QUE": {POS: ADP}, - "PRP|@SA>": {POS: ADP}, - "PRP|@SC>": {POS: ADP}, - "PRP|@SUB": {POS: ADP}, - "PRP|@SUBJ>": {POS: ADP}, - "PRP|@UTT": {POS: ADP}, - "PU|@PU": {POS: PUNCT}, - "V|GER|@SUB": {POS: VERB}, - "V|PCP|F|P|@ICL-OC>": {POS: VERB}, - "V|PCP|F|P|@NN": {POS: VERB}, - "V|PCP|M|S|@>N": {POS: VERB}, - "V|PCP|M|S|@ICL-CO": {POS: VERB}, - "V|PCP|M|S|@N<": {POS: ADJ}, - "V|PCP|M|S|@P<": {POS: ADJ}, - "V|PR|3S|IND|@FS-P<": {POS: VERB}, - "_": {POS: X}, - "adj|F|S": {POS: ADJ}, - "ADV": {POS: ADV}, - "art|<-sam>||F|S": {POS: DET}, - "art||M|P": {POS: DET}, - "n|F|S": {POS: NOUN}, - "n|M|P": {POS: NOUN}, - "n|M|S": {POS: NOUN}, - "prop|F|S": {POS: PROPN}, - "prop|M|P": {POS: PROPN}, - "prop|M|S": {POS: PROPN}, - "prp": {POS: ADP}, - "prp|": {POS: ADP}, - "punc": {POS: PUNCT}, - "v-pcp|M|P": {POS: VERB}, - "v-pcp|M|S": {POS: VERB}, - "ADJ": {POS: ADJ}, - "AUX": {POS: AUX}, - "CCONJ": {POS: CCONJ}, - "DET": {POS: DET}, - "INTJ": {POS: INTJ}, - "NUM": {POS: NUM}, - "PART": {POS: PART}, - "PRON": {POS: PRON}, - "PUNCT": {POS: PUNCT}, - "SCONJ": {POS: SCONJ}, - "SYM": {POS: SYM}, - "VERB": {POS: VERB}, - "X": {POS: X}, - "adv": {POS: ADV}, - "_SP": {POS: SPACE}, -} diff --git a/spacy/lang/pt/tokenizer_exceptions.py b/spacy/lang/pt/tokenizer_exceptions.py index 981c0624b..187fc65ea 100644 --- a/spacy/lang/pt/tokenizer_exceptions.py +++ b/spacy/lang/pt/tokenizer_exceptions.py @@ -1,7 +1,6 @@ -# coding: utf8 -from __future__ import unicode_literals - +from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...symbols import ORTH +from ...util import update_exc _exc = {} @@ -53,4 +52,4 @@ for orth in [ _exc[orth] = [{ORTH: orth}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/punctuation.py b/spacy/lang/punctuation.py index ccb72de28..e712e71d6 100644 --- a/spacy/lang/punctuation.py +++ b/spacy/lang/punctuation.py @@ -1,12 +1,9 @@ -# coding: utf8 -from __future__ import unicode_literals - from .char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY from .char_classes import LIST_ICONS, HYPHENS, CURRENCY, UNITS from .char_classes import CONCAT_QUOTES, ALPHA_LOWER, ALPHA_UPPER, ALPHA, PUNCT -_prefixes = ( +TOKENIZER_PREFIXES = ( ["§", "%", "=", "—", "–", r"\+(?![0-9])"] + LIST_PUNCT + LIST_ELLIPSES @@ -16,7 +13,7 @@ _prefixes = ( ) -_suffixes = ( +TOKENIZER_SUFFIXES = ( LIST_PUNCT + LIST_ELLIPSES + LIST_QUOTES @@ -34,7 +31,7 @@ _suffixes = ( ] ) -_infixes = ( +TOKENIZER_INFIXES = ( LIST_ELLIPSES + LIST_ICONS + [ @@ -47,7 +44,3 @@ _infixes = ( r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), ] ) - -TOKENIZER_PREFIXES = _prefixes -TOKENIZER_SUFFIXES = _suffixes -TOKENIZER_INFIXES = _infixes diff --git a/spacy/lang/ro/__init__.py b/spacy/lang/ro/__init__.py index c7b744ca5..f0d8d8d31 100644 --- a/spacy/lang/ro/__init__.py +++ b/spacy/lang/ro/__init__.py @@ -1,17 +1,9 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES from .punctuation import TOKENIZER_SUFFIXES - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS +from .lex_attrs import LEX_ATTRS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups -from .tag_map import TAG_MAP # Lemma data note: # Original pairs downloaded from http://www.lexiconista.com/datasets/lemmatization/ @@ -19,17 +11,12 @@ from .tag_map import TAG_MAP class RomanianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "ro" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS + tokenizer_exceptions = TOKENIZER_EXCEPTIONS prefixes = TOKENIZER_PREFIXES suffixes = TOKENIZER_SUFFIXES infixes = TOKENIZER_INFIXES - tag_map = TAG_MAP + lex_attr_getters = LEX_ATTRS + stop_words = STOP_WORDS class Romanian(Language): diff --git a/spacy/lang/ro/examples.py b/spacy/lang/ro/examples.py index a372d7cb2..bfa258ffc 100644 --- a/spacy/lang/ro/examples.py +++ b/spacy/lang/ro/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ro/lex_attrs.py b/spacy/lang/ro/lex_attrs.py index bb8391ad1..0f86f53cd 100644 --- a/spacy/lang/ro/lex_attrs.py +++ b/spacy/lang/ro/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/ro/punctuation.py b/spacy/lang/ro/punctuation.py index 87f9a1248..529e1c977 100644 --- a/spacy/lang/ro/punctuation.py +++ b/spacy/lang/ro/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import itertools from ..char_classes import LIST_PUNCT, LIST_ELLIPSES, LIST_QUOTES, LIST_CURRENCY diff --git a/spacy/lang/ro/stop_words.py b/spacy/lang/ro/stop_words.py index b5ba73458..1d90be85d 100644 --- a/spacy/lang/ro/stop_words.py +++ b/spacy/lang/ro/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-ro STOP_WORDS = set( """ diff --git a/spacy/lang/ro/tag_map.py b/spacy/lang/ro/tag_map.py deleted file mode 100644 index 5136793ef..000000000 --- a/spacy/lang/ro/tag_map.py +++ /dev/null @@ -1,1654 +0,0 @@ -from __future__ import unicode_literals - -from ...symbols import POS, ADJ, ADP, ADV, INTJ, NOUN, NUM, PART -from ...symbols import PRON, PROPN, PUNCT, SYM, VERB, X, CCONJ, SCONJ, DET, AUX - -TAG_MAP = { - "Afcfson": { - - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Sing", - POS: ADJ, - }, - "Afcfsrn": { - - "Degree": "Cmp", - "Gender": "Fem", - "Number": "Sing", - POS: ADJ, - }, - "Afp": {"Degree": "Pos", POS: ADJ}, - "Afp-p-n": {"Degree": "Pos", "Number": "Plur", POS: ADJ}, - "Afp-p-ny": {"Degree": "Pos", "Number": "Plur", POS: ADJ}, - "Afp-poy": { "Degree": "Pos", "Number": "Plur", POS: ADJ}, - "Afpf--n": {"Degree": "Pos", "Gender": "Fem", POS: ADJ}, - "Afpfp-n": {"Degree": "Pos", "Gender": "Fem", "Number": "Plur", POS: ADJ}, - "Afpfpoy": { - - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - POS: ADJ, - }, - "Afpfpry": { - - "Degree": "Pos", - "Gender": "Fem", - "Number": "Plur", - POS: ADJ, - }, - "Afpfson": { - - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - POS: ADJ, - }, - "Afpfsoy": { - - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - POS: ADJ, - }, - "Afpfsrn": { - - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - POS: ADJ, - }, - "Afpfsry": { - - "Degree": "Pos", - "Gender": "Fem", - "Number": "Sing", - POS: ADJ, - }, - "Afpmp-n": {"Degree": "Pos", "Gender": "Masc", "Number": "Plur", POS: ADJ}, - "Afpmpoy": { - - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - POS: ADJ, - }, - "Afpmpry": { - - "Degree": "Pos", - "Gender": "Masc", - "Number": "Plur", - POS: ADJ, - }, - "Afpms-n": {"Degree": "Pos", "Gender": "Masc", "Number": "Sing", POS: ADJ}, - "Afpmsoy": { - - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - POS: ADJ, - }, - "Afpmsry": { - - "Degree": "Pos", - "Gender": "Masc", - "Number": "Sing", - POS: ADJ, - }, - "COLON": {POS: PUNCT}, - "COMMA": {POS: PUNCT}, - "Ccssp": {POS: CCONJ, "Polarity": "Pos"}, - "Crssp": {POS: CCONJ, "Polarity": "Pos"}, - "Csssp": {POS: SCONJ, "Polarity": "Pos"}, - "Cssspy": {POS: SCONJ, "Polarity": "Pos"}, - "DASH": {POS: PUNCT}, - "DBLQ": {POS: PUNCT}, - "Dd3-po---e": { - - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dd3fpr": { - - "Gender": "Fem", - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dd3fpr---e": { - - "Gender": "Fem", - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dd3fso---e": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dd3fso---o": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dd3fsr": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dd3fsr---e": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dd3fsr---o": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dd3mpo": { - - "Gender": "Masc", - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dd3mpr---e": { - - "Gender": "Masc", - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dd3mso---e": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dd3msr---e": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dd3msr---o": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Dem", - }, - "Dh3fsr": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "three", - - }, - "Dh3mp": { - "Gender": "Masc", - "Number": "Plur", - POS: DET, - "Person": "three", - - }, - "Dh3ms": { - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "three", - - }, - "Di3": {POS: DET, "Person": "three", "PronType": "Ind"}, - "Di3--r---e": { POS: DET, "Person": "three", "PronType": "Ind"}, - "Di3-po": { - - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3-po---e": { - - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3-sr": { - - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3-sr---e": { - - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3fp": { - "Gender": "Fem", - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3fpr": { - - "Gender": "Fem", - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3fpr---e": { - - "Gender": "Fem", - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3fso---e": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3fsr": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3fsr---e": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3mp": { - "Gender": "Masc", - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3mpr": { - - "Gender": "Masc", - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3mpr---e": { - - "Gender": "Masc", - "Number": "Plur", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3ms": { - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3ms----e": { - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3mso---e": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3msr": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Di3msr---e": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Ind", - }, - "Ds1fp-s": { - "Gender": "Fem", - "Number": "Plur", - POS: DET, - "Person": "one", - "Poss": "Yes", - "PronType": "Prs", - }, - "Ds1fsos": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "one", - "Poss": "Yes", - "PronType": "Prs", - }, - "Ds1fsrp": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "one", - "Poss": "Yes", - "PronType": "Prs", - }, - "Ds1fsrs": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "one", - "Poss": "Yes", - "PronType": "Prs", - }, - "Ds1ms-p": { - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "one", - "Poss": "Yes", - "PronType": "Prs", - }, - "Ds1ms-s": { - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "one", - "Poss": "Yes", - "PronType": "Prs", - }, - "Ds2---s": {POS: DET, "Person": "two", "Poss": "Yes", "PronType": "Prs"}, - "Ds2fsrs": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "two", - "Poss": "Yes", - "PronType": "Prs", - }, - "Ds3---p": {POS: DET, "Person": "three", "Poss": "Yes", "PronType": "Prs"}, - "Ds3---s": {POS: DET, "Person": "three", "Poss": "Yes", "PronType": "Prs"}, - "Ds3fp-s": { - "Gender": "Fem", - "Number": "Plur", - POS: DET, - "Person": "three", - "Poss": "Yes", - "PronType": "Prs", - }, - "Ds3fsos": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "three", - "Poss": "Yes", - "PronType": "Prs", - }, - "Ds3fsrs": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "three", - "Poss": "Yes", - "PronType": "Prs", - }, - "Ds3ms-s": { - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "three", - "Poss": "Yes", - "PronType": "Prs", - }, - "Dw3--r---e": { POS: DET, "Person": "three"}, - "Dw3fpr": { - - "Gender": "Fem", - "Number": "Plur", - POS: DET, - "Person": "three", - - }, - "Dw3mso---e": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "three", - - }, - "Dz3fsr---e": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Neg", - }, - "Dz3msr---e": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Person": "three", - "PronType": "Neg", - }, - "EQUAL": {POS: SYM}, - "EXCL": {POS: PUNCT}, - "GT": {POS: SYM}, - "I": {POS: INTJ}, - "LPAR": {POS: PUNCT}, - "Mc": {"NumType": "Card", POS: NUM}, - "Mc-p-d": {"NumForm": "Digit", "NumType": "Card", "Number": "Plur", POS: NUM}, - "Mc-p-l": {"NumForm": "Word", "NumType": "Card", "Number": "Plur", POS: NUM}, - "Mcfp-l": { - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - "Number": "Plur", - POS: NUM, - }, - "Mcfp-ln": { - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - "Number": "Plur", - POS: NUM, - }, - "Mcfsrln": { - - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - "Number": "Sing", - POS: NUM, - }, - "Mcmp-l": { - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Card", - "Number": "Plur", - POS: NUM, - }, - "Mcmsrl": { - - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Card", - "Number": "Sing", - POS: NUM, - }, - "Mffprln": { - - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Card", - "Number": "Plur", - POS: NUM, - }, - "Mlfpo": { - - "Gender": "Fem", - "NumType": "Card", - "Number": "Plur", - POS: NUM, - "PronType": "Tot", - }, - "Mlfpr": { - - "Gender": "Fem", - "NumType": "Card", - "Number": "Plur", - POS: NUM, - "PronType": "Tot", - }, - "Mlmpr": { - - "Gender": "Masc", - "NumType": "Card", - "Number": "Plur", - POS: NUM, - "PronType": "Tot", - }, - "Mo---l": {"NumForm": "Word", "NumType": "Ord", POS: NUM}, - "Mo-s-r": {"NumForm": "Roman", "NumType": "Ord", "Number": "Sing", POS: NUM}, - "Mofp-ln": { - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Ord", - "Number": "Plur", - POS: NUM, - }, - "Mofprly": { - - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Ord", - "Number": "Plur", - POS: NUM, - }, - "Mofs-l": { - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Ord", - "Number": "Sing", - POS: NUM, - }, - "Mofsrln": { - - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Ord", - "Number": "Sing", - POS: NUM, - }, - "Mofsrly": { - - "Gender": "Fem", - "NumForm": "Word", - "NumType": "Ord", - "Number": "Sing", - POS: NUM, - }, - "Momprly": { - - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Ord", - "Number": "Plur", - POS: NUM, - }, - "Moms-l": { - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Ord", - "Number": "Sing", - POS: NUM, - }, - "Moms-ln": { - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Ord", - "Number": "Sing", - POS: NUM, - }, - "Momsoly": { - - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Ord", - "Number": "Sing", - POS: NUM, - }, - "Momsrly": { - - "Gender": "Masc", - "NumForm": "Word", - "NumType": "Ord", - "Number": "Sing", - POS: NUM, - }, - "Nc": {POS: NOUN}, - "Ncf--n": {"Gender": "Fem", POS: NOUN}, - "Ncfp-n": {"Gender": "Fem", "Number": "Plur", POS: NOUN}, - "Ncfpoy": { "Gender": "Fem", "Number": "Plur", POS: NOUN}, - "Ncfpry": { "Gender": "Fem", "Number": "Plur", POS: NOUN}, - "Ncfson": { "Gender": "Fem", "Number": "Sing", POS: NOUN}, - "Ncfsoy": { "Gender": "Fem", "Number": "Sing", POS: NOUN}, - "Ncfsrn": { "Gender": "Fem", "Number": "Sing", POS: NOUN}, - "Ncfsry": { "Gender": "Fem", "Number": "Sing", POS: NOUN}, - "Ncm--n": {"Gender": "Masc", POS: NOUN}, - "Ncmp-n": {"Gender": "Masc", "Number": "Plur", POS: NOUN}, - "Ncmpoy": { "Gender": "Masc", "Number": "Plur", POS: NOUN}, - "Ncmpry": { "Gender": "Masc", "Number": "Plur", POS: NOUN}, - "Ncms-n": {"Gender": "Masc", "Number": "Sing", POS: NOUN}, - "Ncms-ny": {"Gender": "Masc", "Number": "Sing", POS: NOUN}, - "Ncmsoy": { "Gender": "Masc", "Number": "Sing", POS: NOUN}, - "Ncmsrn": { "Gender": "Masc", "Number": "Sing", POS: NOUN}, - "Ncmsry": { "Gender": "Masc", "Number": "Sing", POS: NOUN}, - "Np": {POS: PROPN}, - "Npfsoy": { "Gender": "Fem", "Number": "Sing", POS: PROPN}, - "Npfsry": { "Gender": "Fem", "Number": "Sing", POS: PROPN}, - "Npmsoy": { "Gender": "Masc", "Number": "Sing", POS: PROPN}, - "Npmsry": { "Gender": "Masc", "Number": "Sing", POS: PROPN}, - "PERCENT": {POS: SYM}, - "PERIOD": {POS: PUNCT}, - "PLUSMINUS": {POS: SYM}, - "Pd3-po": { - - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Dem", - }, - "Pd3fpr": { - - "Gender": "Fem", - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Dem", - }, - "Pd3fso": { - - "Gender": "Fem", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Dem", - }, - "Pd3fsr": { - - "Gender": "Fem", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Dem", - }, - "Pd3mpr": { - - "Gender": "Masc", - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Dem", - }, - "Pd3mso": { - - "Gender": "Masc", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Dem", - }, - "Pd3msr": { - - "Gender": "Masc", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Dem", - }, - "Pi3--r": { POS: PRON, "Person": "three", "PronType": "Ind"}, - "Pi3-po": { - - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Ind", - }, - "Pi3-so": { - - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Ind", - }, - "Pi3-sr": { - - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Ind", - }, - "Pi3fpr": { - - "Gender": "Fem", - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Ind", - }, - "Pi3fso": { - - "Gender": "Fem", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Ind", - }, - "Pi3fsr": { - - "Gender": "Fem", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Ind", - }, - "Pi3mpr": { - - "Gender": "Masc", - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Ind", - }, - "Pi3msr": { - - "Gender": "Masc", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Ind", - }, - "Pi3msr--y": { - - "Gender": "Masc", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Ind", - - }, - "Pp1-pa--------w": { - "Case": "Acc", - "Number": "Plur", - POS: PRON, - "Person": "one", - "PronType": "Prs", - }, - "Pp1-pa--y-----w": { - "Case": "Acc", - "Number": "Plur", - POS: PRON, - "Person": "one", - "PronType": "Prs", - - }, - "Pp1-pd--------w": { - "Case": "Dat", - "Number": "Plur", - POS: PRON, - "Person": "one", - "PronType": "Prs", - }, - "Pp1-pr--------s": { - - "Number": "Plur", - POS: PRON, - "Person": "one", - "PronType": "Prs", - }, - "Pp1-sa--------s": { - "Case": "Acc", - "Number": "Sing", - POS: PRON, - "Person": "one", - "PronType": "Prs", - }, - "Pp1-sa--------w": { - "Case": "Acc", - "Number": "Sing", - POS: PRON, - "Person": "one", - "PronType": "Prs", - }, - "Pp1-sa--y-----w": { - "Case": "Acc", - "Number": "Sing", - POS: PRON, - "Person": "one", - "PronType": "Prs", - - }, - "Pp1-sd--------w": { - "Case": "Dat", - "Number": "Sing", - POS: PRON, - "Person": "one", - "PronType": "Prs", - }, - "Pp1-sd--y-----w": { - "Case": "Dat", - "Number": "Sing", - POS: PRON, - "Person": "one", - "PronType": "Prs", - - }, - "Pp1-sn--------s": { - "Case": "Nom", - "Number": "Sing", - POS: PRON, - "Person": "one", - "PronType": "Prs", - }, - "Pp2-----------s": {POS: PRON, "Person": "two", "PronType": "Prs"}, - "Pp2-pa--------w": { - "Case": "Acc", - "Number": "Plur", - POS: PRON, - "Person": "two", - "PronType": "Prs", - }, - "Pp2-pa--y-----w": { - "Case": "Acc", - "Number": "Plur", - POS: PRON, - "Person": "two", - "PronType": "Prs", - - }, - "Pp2-pd--------w": { - "Case": "Dat", - "Number": "Plur", - POS: PRON, - "Person": "two", - "PronType": "Prs", - }, - "Pp2-pr--------s": { - - "Number": "Plur", - POS: PRON, - "Person": "two", - "PronType": "Prs", - }, - "Pp2-sa--------s": { - "Case": "Acc", - "Number": "Sing", - POS: PRON, - "Person": "two", - "PronType": "Prs", - }, - "Pp2-sa--------w": { - "Case": "Acc", - "Number": "Sing", - POS: PRON, - "Person": "two", - "PronType": "Prs", - }, - "Pp2-sa--y-----w": { - "Case": "Acc", - "Number": "Sing", - POS: PRON, - "Person": "two", - "PronType": "Prs", - - }, - "Pp2-sd--y-----w": { - "Case": "Dat", - "Number": "Sing", - POS: PRON, - "Person": "two", - "PronType": "Prs", - - }, - "Pp2-sn--------s": { - "Case": "Nom", - "Number": "Sing", - POS: PRON, - "Person": "two", - "PronType": "Prs", - }, - "Pp3-pd--------w": { - "Case": "Dat", - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Prs", - }, - "Pp3-pd--y-----w": { - "Case": "Dat", - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Prs", - - }, - "Pp3-po--------s": { - - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Prs", - }, - "Pp3-sd--------w": { - "Case": "Dat", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Prs", - }, - "Pp3-sd--y-----w": { - "Case": "Dat", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Prs", - - }, - "Pp3fpa--------w": { - "Case": "Acc", - "Gender": "Fem", - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Prs", - }, - "Pp3fpa--y-----w": { - "Case": "Acc", - "Gender": "Fem", - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Prs", - - }, - "Pp3fpr--------s": { - - "Gender": "Fem", - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Prs", - }, - "Pp3fsa--------w": { - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Prs", - }, - "Pp3fsa--y-----w": { - "Case": "Acc", - "Gender": "Fem", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Prs", - - }, - "Pp3fsr--------s": { - - "Gender": "Fem", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Prs", - }, - "Pp3mpa--------w": { - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Prs", - }, - "Pp3mpa--y-----w": { - "Case": "Acc", - "Gender": "Masc", - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Prs", - - }, - "Pp3mpr--------s": { - - "Gender": "Masc", - "Number": "Plur", - POS: PRON, - "Person": "three", - "PronType": "Prs", - }, - "Pp3msa--------w": { - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Prs", - }, - "Pp3msa--y-----w": { - "Case": "Acc", - "Gender": "Masc", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Prs", - - }, - "Pp3mso--------s": { - - "Gender": "Masc", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Prs", - }, - "Pp3msr--------s": { - - "Gender": "Masc", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Prs", - }, - "Ps1mp-s": { - "Gender": "Masc", - "Number": "Plur", - POS: PRON, - "Person": "one", - "Poss": "Yes", - "PronType": "Prs", - }, - "Ps3---p": {POS: PRON, "Person": "three", "Poss": "Yes", "PronType": "Prs"}, - "Ps3---s": {POS: PRON, "Person": "three", "Poss": "Yes", "PronType": "Prs"}, - "Ps3fp-s": { - "Gender": "Fem", - "Number": "Plur", - POS: PRON, - "Person": "three", - "Poss": "Yes", - "PronType": "Prs", - }, - "Pw3--r": { POS: PRON, "Person": "three"}, - "Pw3-po": { - - "Number": "Plur", - POS: PRON, - "Person": "three", - - }, - "Pw3fso": { - - "Gender": "Fem", - "Number": "Sing", - POS: PRON, - "Person": "three", - - }, - "Pw3mpr": { - - "Gender": "Masc", - "Number": "Plur", - POS: PRON, - "Person": "three", - - }, - "Px3--a--------s": { - "Case": "Acc", - POS: PRON, - "Person": "three", - "PronType": "Prs", - "Reflex": "Yes", - }, - "Px3--a--------w": { - "Case": "Acc", - POS: PRON, - "Person": "three", - "PronType": "Prs", - "Reflex": "Yes", - }, - "Px3--a--y-----w": { - "Case": "Acc", - POS: PRON, - "Person": "three", - "PronType": "Prs", - "Reflex": "Yes", - - }, - "Px3--d--------w": { - "Case": "Dat", - POS: PRON, - "Person": "three", - "PronType": "Prs", - "Reflex": "Yes", - }, - "Px3--d--y-----w": { - "Case": "Dat", - POS: PRON, - "Person": "three", - "PronType": "Prs", - "Reflex": "Yes", - - }, - "Pz3-sr": { - - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Neg", - }, - "Pz3msr": { - - "Gender": "Masc", - "Number": "Sing", - POS: PRON, - "Person": "three", - "PronType": "Neg", - }, - "QUEST": {POS: PUNCT}, - "QUOT": {POS: PUNCT}, - "Qn": {POS: PART, "PartType": "Inf"}, - "Qs": {"Mood": "Sub", POS: PART}, - "Qs-y": {"Mood": "Sub", POS: PART}, - "Qz": {POS: PART, "Polarity": "Neg"}, - "Qz-y": {POS: PART, "Polarity": "Neg"}, - "RPAR": {POS: PUNCT}, - "Rc": {POS: ADV}, - "Rgp": {"Degree": "Pos", POS: ADV}, - "Rgpy": {"Degree": "Pos", POS: ADV}, - "Rgs": {"Degree": "Sup", POS: ADV}, - "Rp": {POS: ADV}, - "Rw": {POS: ADV}, - "Rz": {POS: ADV, "PronType": "Neg"}, - "SCOLON": {"AdpType": "Prep", POS: PUNCT}, - "SLASH": {"AdpType": "Prep", POS: SYM}, - "Spsa": {"AdpType": "Prep", "Case": "Acc", POS: ADP}, - "Spsay": {"AdpType": "Prep", "Case": "Acc", POS: ADP}, - "Spsd": {"AdpType": "Prep", "Case": "Dat", POS: ADP}, - "Spsg": {"AdpType": "Prep", "Case": "Gen", POS: ADP}, - "Spsgy": {"AdpType": "Prep", "Case": "Gen", POS: ADP}, - "Td-po": { "Number": "Plur", POS: DET, "PronType": "Dem"}, - "Tdfpr": { - - "Gender": "Fem", - "Number": "Plur", - POS: DET, - "PronType": "Dem", - }, - "Tdfso": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "PronType": "Dem", - }, - "Tdfsr": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "PronType": "Dem", - }, - "Tdmpr": { - - "Gender": "Masc", - "Number": "Plur", - POS: DET, - "PronType": "Dem", - }, - "Tdmso": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "PronType": "Dem", - }, - "Tdmsr": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "PronType": "Dem", - }, - "Tf-so": { "Number": "Sing", POS: DET, "PronType": "Art"}, - "Tffs-y": { - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "PronType": "Art", - - }, - "Tfms-y": { - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "PronType": "Art", - - }, - "Tfmsoy": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "PronType": "Art", - - }, - "Tfmsry": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "PronType": "Art", - - }, - "Ti-po": { "Number": "Plur", POS: DET, "PronType": "Ind"}, - "Tifp-y": { - "Gender": "Fem", - "Number": "Plur", - POS: DET, - "PronType": "Ind", - - }, - "Tifso": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "PronType": "Ind", - }, - "Tifsr": { - - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "PronType": "Ind", - }, - "Timso": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "PronType": "Ind", - }, - "Timsr": { - - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "PronType": "Ind", - }, - "Tsfp": { - "Gender": "Fem", - "Number": "Plur", - POS: DET, - "Poss": "Yes", - "PronType": "Prs", - }, - "Tsfs": { - "Gender": "Fem", - "Number": "Sing", - POS: DET, - "Poss": "Yes", - "PronType": "Prs", - }, - "Tsmp": { - "Gender": "Masc", - "Number": "Plur", - POS: DET, - "Poss": "Yes", - "PronType": "Prs", - }, - "Tsms": { - "Gender": "Masc", - "Number": "Sing", - POS: DET, - "Poss": "Yes", - "PronType": "Prs", - }, - "Va--1": {POS: AUX, "Person": "one"}, - "Va--1p": {"Number": "Plur", POS: AUX, "Person": "one"}, - "Va--1s": {"Number": "Sing", POS: AUX, "Person": "one"}, - "Va--2p": {"Number": "Plur", POS: AUX, "Person": "two"}, - "Va--2s": {"Number": "Sing", POS: AUX, "Person": "two"}, - "Va--3": {POS: AUX, "Person": "three"}, - "Va--3-----y": {POS: AUX, "Person": "three"}, - "Va--3p": {"Number": "Plur", POS: AUX, "Person": "three"}, - "Va--3p----y": {"Number": "Plur", POS: AUX, "Person": "three"}, - "Va--3s": {"Number": "Sing", POS: AUX, "Person": "three"}, - "Va--3s----y": {"Number": "Sing", POS: AUX, "Person": "three"}, - "Vag": {POS: AUX, "VerbForm": "Ger"}, - "Vaii3p": { - "Mood": "Ind", - "Number": "Plur", - POS: AUX, - "Person": "three", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "Vaii3s": { - "Mood": "Ind", - "Number": "Sing", - POS: AUX, - "Person": "three", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "Vail3s": { - "Mood": "Ind", - "Number": "Sing", - POS: AUX, - "Person": "three", - - "VerbForm": "Fin", - }, - "Vaip1s": { - "Mood": "Ind", - "Number": "Sing", - POS: AUX, - "Person": "one", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vaip2s": { - "Mood": "Ind", - "Number": "Sing", - POS: AUX, - "Person": "two", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vaip3p": { - "Mood": "Ind", - "Number": "Plur", - POS: AUX, - "Person": "three", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vaip3s": { - "Mood": "Ind", - "Number": "Sing", - POS: AUX, - "Person": "three", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vanp": {POS: AUX, "Tense": "Pres", "VerbForm": "Inf"}, - "Vap--sm": {"Gender": "Masc", "Number": "Sing", POS: AUX, "VerbForm": "Part"}, - "Vasp3": { - "Mood": "Sub", - POS: AUX, - "Person": "three", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vmg": {POS: VERB, "VerbForm": "Ger"}, - "Vmg-------y": {POS: VERB, "VerbForm": "Ger"}, - "Vmii1": { - "Mood": "Ind", - POS: VERB, - "Person": "one", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "Vmii1-----y": { - "Mood": "Ind", - POS: VERB, - "Person": "one", - "Tense": "Imp", - - "VerbForm": "Fin", - }, - "Vmii2p": { - "Mood": "Ind", - "Number": "Plur", - POS: VERB, - "Person": "two", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "Vmii2s": { - "Mood": "Ind", - "Number": "Sing", - POS: VERB, - "Person": "two", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "Vmii3p": { - "Mood": "Ind", - "Number": "Plur", - POS: VERB, - "Person": "three", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "Vmii3p----y": { - "Mood": "Ind", - "Number": "Plur", - POS: VERB, - "Person": "three", - "Tense": "Imp", - - "VerbForm": "Fin", - }, - "Vmii3s": { - "Mood": "Ind", - "Number": "Sing", - POS: VERB, - "Person": "three", - "Tense": "Imp", - "VerbForm": "Fin", - }, - "Vmil3p": { - "Mood": "Ind", - "Number": "Plur", - POS: VERB, - "Person": "three", - - "VerbForm": "Fin", - }, - "Vmil3s": { - "Mood": "Ind", - "Number": "Sing", - POS: VERB, - "Person": "three", - - "VerbForm": "Fin", - }, - "Vmip1p": { - "Mood": "Ind", - "Number": "Plur", - POS: VERB, - "Person": "one", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vmip1s": { - "Mood": "Ind", - "Number": "Sing", - POS: VERB, - "Person": "one", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vmip1s----y": { - "Mood": "Ind", - "Number": "Sing", - POS: VERB, - "Person": "one", - "Tense": "Pres", - - "VerbForm": "Fin", - }, - "Vmip2p": { - "Mood": "Ind", - "Number": "Plur", - POS: VERB, - "Person": "two", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vmip2s": { - "Mood": "Ind", - "Number": "Sing", - POS: VERB, - "Person": "two", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vmip3": { - "Mood": "Ind", - POS: VERB, - "Person": "three", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vmip3-----y": { - "Mood": "Ind", - POS: VERB, - "Person": "three", - "Tense": "Pres", - - "VerbForm": "Fin", - }, - "Vmip3p": { - "Mood": "Ind", - "Number": "Plur", - POS: AUX, - "Person": "three", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vmip3s": { - "Mood": "Ind", - "Number": "Sing", - POS: VERB, - "Person": "three", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vmip3s----y": { - "Mood": "Ind", - "Number": "Sing", - POS: AUX, - "Person": "three", - "Tense": "Pres", - - "VerbForm": "Fin", - }, - "Vmis1p": { - "Mood": "Ind", - "Number": "Plur", - POS: VERB, - "Person": "one", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vmis1s": { - "Mood": "Ind", - "Number": "Sing", - POS: VERB, - "Person": "one", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vmis3p": { - "Mood": "Ind", - "Number": "Plur", - POS: VERB, - "Person": "three", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vmis3s": { - "Mood": "Ind", - "Number": "Sing", - POS: VERB, - "Person": "three", - "Tense": "Past", - "VerbForm": "Fin", - }, - "Vmm-2p": { - "Mood": "Imp", - "Number": "Plur", - POS: VERB, - "Person": "two", - "VerbForm": "Fin", - }, - "Vmm-2s": { - "Mood": "Imp", - "Number": "Sing", - POS: VERB, - "Person": "two", - "VerbForm": "Fin", - }, - "Vmnp": {POS: VERB, "Tense": "Pres", "VerbForm": "Inf"}, - "Vmp--pf": {"Gender": "Fem", "Number": "Plur", POS: VERB, "VerbForm": "Part"}, - "Vmp--pm": {"Gender": "Masc", "Number": "Plur", POS: VERB, "VerbForm": "Part"}, - "Vmp--sf": {"Gender": "Fem", "Number": "Sing", POS: VERB, "VerbForm": "Part"}, - "Vmp--sm": {"Gender": "Masc", "Number": "Sing", POS: VERB, "VerbForm": "Part"}, - "Vmsp3": { - "Mood": "Sub", - POS: VERB, - "Person": "three", - "Tense": "Pres", - "VerbForm": "Fin", - }, - "Vmsp3-----y": { - "Mood": "Sub", - POS: VERB, - "Person": "three", - "Tense": "Pres", - - "VerbForm": "Fin", - }, - "X": {POS: X}, - "Y": {"Abbr": "Yes", POS: X}, - "Yn": {"Abbr": "Yes", POS: NOUN}, - "Ynmsry": { - "Abbr": "Yes", - - "Gender": "Masc", - "Number": "Sing", - POS: NOUN, - }, -} diff --git a/spacy/lang/ro/tokenizer_exceptions.py b/spacy/lang/ro/tokenizer_exceptions.py index b27344d2a..b8af0b1d6 100644 --- a/spacy/lang/ro/tokenizer_exceptions.py +++ b/spacy/lang/ro/tokenizer_exceptions.py @@ -1,7 +1,6 @@ -# coding: utf8 -from __future__ import unicode_literals - +from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...symbols import ORTH +from ...util import update_exc from .punctuation import _make_ro_variants @@ -94,4 +93,4 @@ for orth in [ _exc[variant] = [{ORTH: variant}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/ru/__init__.py b/spacy/lang/ru/__init__.py index f0e77d811..2f3965fcc 100644 --- a/spacy/lang/ru/__init__.py +++ b/spacy/lang/ru/__init__.py @@ -1,32 +1,17 @@ -# encoding: utf8 -from __future__ import unicode_literals, print_function +from typing import Optional +from thinc.api import Model from .stop_words import STOP_WORDS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .lex_attrs import LEX_ATTRS -from .tag_map import TAG_MAP from .lemmatizer import RussianLemmatizer - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ...util import update_exc from ...language import Language -from ...lookups import Lookups -from ...attrs import LANG class RussianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "ru" - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) + tokenizer_exceptions = TOKENIZER_EXCEPTIONS + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS - tag_map = TAG_MAP - - @classmethod - def create_lemmatizer(cls, nlp=None, lookups=None): - if lookups is None: - lookups = Lookups() - return RussianLemmatizer(lookups) class Russian(Language): @@ -34,4 +19,20 @@ class Russian(Language): Defaults = RussianDefaults +@Russian.factory( + "lemmatizer", + assigns=["token.lemma"], + default_config={"model": None, "mode": "pymorphy2"}, + default_score_weights={"lemma_acc": 1.0}, +) +def make_lemmatizer( + nlp: Language, + model: Optional[Model], + name: str, + mode: str, + overwrite: bool = False, +): + return RussianLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) + + __all__ = ["Russian"] diff --git a/spacy/lang/ru/examples.py b/spacy/lang/ru/examples.py index 2db621dac..adb007625 100644 --- a/spacy/lang/ru/examples.py +++ b/spacy/lang/ru/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ru/lemmatizer.py b/spacy/lang/ru/lemmatizer.py index 96d32f59c..3bcac8730 100644 --- a/spacy/lang/ru/lemmatizer.py +++ b/spacy/lang/ru/lemmatizer.py @@ -1,16 +1,30 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Optional, List, Dict, Tuple -from ...symbols import ADJ, DET, NOUN, NUM, PRON, PROPN, PUNCT, VERB, POS -from ...lemmatizer import Lemmatizer -from ...compat import unicode_ +from thinc.api import Model + +from ...pipeline import Lemmatizer +from ...symbols import POS +from ...tokens import Token +from ...vocab import Vocab + + +PUNCT_RULES = {"«": '"', "»": '"'} class RussianLemmatizer(Lemmatizer): _morph = None - def __init__(self, lookups=None): - super(RussianLemmatizer, self).__init__(lookups) + def __init__( + self, + vocab: Vocab, + model: Optional[Model], + name: str = "lemmatizer", + *, + mode: str = "pymorphy2", + overwrite: bool = False, + ) -> None: + super().__init__(vocab, model, name, mode=mode, overwrite=overwrite) + try: from pymorphy2 import MorphAnalyzer except ImportError: @@ -19,19 +33,19 @@ class RussianLemmatizer(Lemmatizer): 'try to fix it with "pip install pymorphy2==0.8" ' 'or "pip install git+https://github.com/kmike/pymorphy2.git pymorphy2-dicts-uk"' "if you need Ukrainian too" - ) + ) from None if RussianLemmatizer._morph is None: RussianLemmatizer._morph = MorphAnalyzer() - def __call__(self, string, univ_pos, morphology=None): - univ_pos = self.normalize_univ_pos(univ_pos) + def pymorphy2_lemmatize(self, token: Token) -> List[str]: + string = token.text + univ_pos = token.pos_ + morphology = token.morph.to_dict() if univ_pos == "PUNCT": return [PUNCT_RULES.get(string, string)] - if univ_pos not in ("ADJ", "DET", "NOUN", "NUM", "PRON", "PROPN", "VERB"): # Skip unchangeable pos return [string.lower()] - analyses = self._morph.parse(string) filtered_analyses = [] for analysis in analyses: @@ -43,12 +57,10 @@ class RussianLemmatizer(Lemmatizer): analysis_pos in ("NOUN", "PROPN") and univ_pos in ("NOUN", "PROPN") ): filtered_analyses.append(analysis) - if not len(filtered_analyses): return [string.lower()] if morphology is None or (len(morphology) == 1 and POS in morphology): return list(set([analysis.normal_form for analysis in filtered_analyses])) - if univ_pos in ("ADJ", "DET", "NOUN", "PROPN"): features_to_compare = ["Case", "Number", "Gender"] elif univ_pos == "NUM": @@ -65,7 +77,6 @@ class RussianLemmatizer(Lemmatizer): "VerbForm", "Voice", ] - analyses, filtered_analyses = filtered_analyses, [] for analysis in analyses: _, analysis_morph = oc2ud(str(analysis.tag)) @@ -78,38 +89,19 @@ class RussianLemmatizer(Lemmatizer): break else: filtered_analyses.append(analysis) - if not len(filtered_analyses): return [string.lower()] return list(set([analysis.normal_form for analysis in filtered_analyses])) - @staticmethod - def normalize_univ_pos(univ_pos): - if isinstance(univ_pos, unicode_): - return univ_pos.upper() - - symbols_to_str = { - ADJ: "ADJ", - DET: "DET", - NOUN: "NOUN", - NUM: "NUM", - PRON: "PRON", - PROPN: "PROPN", - PUNCT: "PUNCT", - VERB: "VERB", - } - if univ_pos in symbols_to_str: - return symbols_to_str[univ_pos] - return None - - def lookup(self, string, orth=None): + def lookup_lemmatize(self, token: Token) -> List[str]: + string = token.text analyses = self._morph.parse(string) if len(analyses) == 1: return analyses[0].normal_form return string -def oc2ud(oc_tag): +def oc2ud(oc_tag: str) -> Tuple[str, Dict[str, str]]: gram_map = { "_POS": { "ADJF": "ADJ", @@ -164,11 +156,9 @@ def oc2ud(oc_tag): "Voice": {"actv": "Act", "pssv": "Pass"}, "Abbr": {"Abbr": "Yes"}, } - pos = "X" morphology = dict() unmatched = set() - grams = oc_tag.replace(" ", ",").split(",") for gram in grams: match = False @@ -181,7 +171,6 @@ def oc2ud(oc_tag): morphology[categ] = gmap[gram] if not match: unmatched.add(gram) - while len(unmatched) > 0: gram = unmatched.pop() if gram in ("Name", "Patr", "Surn", "Geox", "Orgn"): @@ -190,8 +179,4 @@ def oc2ud(oc_tag): pos = "AUX" elif gram == "Pltm": morphology["Number"] = "Ptan" - return pos, morphology - - -PUNCT_RULES = {"«": '"', "»": '"'} diff --git a/spacy/lang/ru/lex_attrs.py b/spacy/lang/ru/lex_attrs.py index 448c5b285..7979c7ea6 100644 --- a/spacy/lang/ru/lex_attrs.py +++ b/spacy/lang/ru/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/ru/stop_words.py b/spacy/lang/ru/stop_words.py index 89069b3cf..16cb55ef9 100644 --- a/spacy/lang/ru/stop_words.py +++ b/spacy/lang/ru/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ а diff --git a/spacy/lang/ru/tag_map.py b/spacy/lang/ru/tag_map.py deleted file mode 100644 index b6ca314b6..000000000 --- a/spacy/lang/ru/tag_map.py +++ /dev/null @@ -1,746 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, SYM, ADJ, NUM, DET, ADV, ADP, X, VERB, NOUN -from ...symbols import PROPN, PART, INTJ, PRON, SCONJ, AUX, CCONJ - -# fmt: off -TAG_MAP = { - "ADJ__Animacy=Anim|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "Animacy": "Anim", "Case": "Acc", "Degree": "Pos", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Animacy=Anim|Case=Acc|Degree=Pos|Number=Plur": {POS: ADJ, "Animacy": "Anim", "Case": "Acc", "Degree": "Pos", "Number": "Plur"}, - "ADJ__Animacy=Anim|Case=Acc|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "Animacy": "Anim", "Case": "Acc", "Degree": "Sup", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Animacy=Anim|Case=Nom|Degree=Pos|Number=Plur": {POS: ADJ, "Animacy": "Anim", "Case": "Nom", "Degree": "Pos", "Number": "Plur"}, - "ADJ__Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "Animacy": "Inan", "Case": "Acc", "Degree": "Pos", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Animacy=Inan|Case=Acc|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "Animacy": "Inan", "Case": "Acc", "Degree": "Pos", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Animacy=Inan|Case=Acc|Degree=Pos|Number=Plur": {POS: ADJ, "Animacy": "Inan", "Case": "Acc", "Degree": "Pos", "Number": "Plur"}, - "ADJ__Animacy=Inan|Case=Acc|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "Animacy": "Inan", "Case": "Acc", "Degree": "Sup", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Animacy=Inan|Case=Acc|Degree=Sup|Number=Plur": {POS: ADJ, "Animacy": "Inan", "Case": "Acc", "Degree": "Sup", "Number": "Plur"}, - "ADJ__Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing": {POS: ADJ, "Animacy": "Inan", "Case": "Acc", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Animacy=Inan|Case=Nom|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "Animacy": "Inan", "Case": "Nom", "Degree": "Pos", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Case=Acc|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "Case": "Acc", "Degree": "Pos", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Case=Acc|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "Case": "Acc", "Degree": "Pos", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Case=Acc|Degree=Sup|Gender=Fem|Number=Sing": {POS: ADJ, "Case": "Acc", "Degree": "Sup", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Case=Acc|Degree=Sup|Gender=Neut|Number=Sing": {POS: ADJ, "Case": "Acc", "Degree": "Sup", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Case=Dat|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "Case": "Dat", "Degree": "Pos", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Case=Dat|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "Case": "Dat", "Degree": "Pos", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Case=Dat|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "Case": "Dat", "Degree": "Pos", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Case=Dat|Degree=Pos|Number=Plur": {POS: ADJ, "Case": "Dat", "Degree": "Pos", "Number": "Plur"}, - "ADJ__Case=Dat|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "Case": "Dat", "Degree": "Sup", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Case=Dat|Degree=Sup|Gender=Neut|Number=Sing": {POS: ADJ, "Case": "Dat", "Degree": "Sup", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Case=Dat|Degree=Sup|Number=Plur": {POS: ADJ, "Case": "Dat", "Degree": "Sup", "Number": "Plur"}, - "ADJ__Case=Gen|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "Case": "Gen", "Degree": "Pos", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Case=Gen|Degree=Pos|Gender=Fem|Number=Sing|Variant=Short": {POS: ADJ, "Case": "Gen", "Degree": "Pos", "Gender": "Fem", "Number": "Sing", }, - "ADJ__Case=Gen|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "Case": "Gen", "Degree": "Pos", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Case=Gen|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "Case": "Gen", "Degree": "Pos", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Case=Gen|Degree=Pos|Number=Plur": {POS: ADJ, "Case": "Gen", "Degree": "Pos", "Number": "Plur"}, - "ADJ__Case=Gen|Degree=Sup|Gender=Fem|Number=Sing": {POS: ADJ, "Case": "Gen", "Degree": "Sup", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Case=Gen|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "Case": "Gen", "Degree": "Sup", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Case=Gen|Degree=Sup|Gender=Neut|Number=Sing": {POS: ADJ, "Case": "Gen", "Degree": "Sup", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Case=Gen|Degree=Sup|Number=Plur": {POS: ADJ, "Case": "Gen", "Degree": "Sup", "Number": "Plur"}, - "ADJ__Case=Ins|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "Case": "Ins", "Degree": "Pos", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Case=Ins|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "Case": "Ins", "Degree": "Pos", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Case=Ins|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "Case": "Ins", "Degree": "Pos", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Case=Ins|Degree=Pos|Number=Plur": {POS: ADJ, "Case": "Ins", "Degree": "Pos", "Number": "Plur"}, - "ADJ__Case=Ins|Degree=Sup|Gender=Fem|Number=Sing": {POS: ADJ, "Case": "Ins", "Degree": "Sup", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Case=Ins|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "Case": "Ins", "Degree": "Sup", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Case=Ins|Degree=Sup|Gender=Neut|Number=Sing": {POS: ADJ, "Case": "Ins", "Degree": "Sup", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Case=Ins|Degree=Sup|Number=Plur": {POS: ADJ, "Case": "Ins", "Degree": "Sup", "Number": "Plur"}, - "ADJ__Case=Loc|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "Case": "Loc", "Degree": "Pos", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Case=Loc|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "Case": "Loc", "Degree": "Pos", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Case=Loc|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "Case": "Loc", "Degree": "Pos", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Case=Loc|Degree=Pos|Number=Plur": {POS: ADJ, "Case": "Loc", "Degree": "Pos", "Number": "Plur"}, - "ADJ__Case=Loc|Degree=Sup|Gender=Fem|Number=Sing": {POS: ADJ, "Case": "Loc", "Degree": "Sup", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Case=Loc|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "Case": "Loc", "Degree": "Sup", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Case=Loc|Degree=Sup|Gender=Neut|Number=Sing": {POS: ADJ, "Case": "Loc", "Degree": "Sup", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Case=Loc|Degree=Sup|Number=Plur": {POS: ADJ, "Case": "Loc", "Degree": "Sup", "Number": "Plur"}, - "ADJ__Case=Nom|Degree=Pos|Gender=Fem|Number=Sing": {POS: ADJ, "Case": "Nom", "Degree": "Pos", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Case=Nom|Degree=Pos|Gender=Masc|Number=Sing": {POS: ADJ, "Case": "Nom", "Degree": "Pos", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Case=Nom|Degree=Pos|Gender=Neut|Number=Sing": {POS: ADJ, "Case": "Nom", "Degree": "Pos", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Case=Nom|Degree=Pos|Number=Plur": {POS: ADJ, "Case": "Nom", "Degree": "Pos", "Number": "Plur"}, - "ADJ__Case=Nom|Degree=Sup|Gender=Fem|Number=Sing": {POS: ADJ, "Case": "Nom", "Degree": "Sup", "Gender": "Fem", "Number": "Sing"}, - "ADJ__Case=Nom|Degree=Sup|Gender=Masc|Number=Sing": {POS: ADJ, "Case": "Nom", "Degree": "Sup", "Gender": "Masc", "Number": "Sing"}, - "ADJ__Case=Nom|Degree=Sup|Gender=Neut|Number=Sing": {POS: ADJ, "Case": "Nom", "Degree": "Sup", "Gender": "Neut", "Number": "Sing"}, - "ADJ__Case=Nom|Degree=Sup|Number=Plur": {POS: ADJ, "Case": "Nom", "Degree": "Sup", "Number": "Plur"}, - "ADJ__Degree=Cmp": {POS: ADJ, "Degree": "Cmp"}, - "ADJ__Degree=Pos": {POS: ADJ, "Degree": "Pos"}, - "ADJ__Degree=Pos|Gender=Fem|Number=Sing|Variant=Short": {POS: ADJ, "Degree": "Pos", "Gender": "Fem", "Number": "Sing", }, - "ADJ__Degree=Pos|Gender=Masc|Number=Sing|Variant=Short": {POS: ADJ, "Degree": "Pos", "Gender": "Masc", "Number": "Sing", }, - "ADJ__Degree=Pos|Gender=Neut|Number=Sing|Variant=Short": {POS: ADJ, "Degree": "Pos", "Gender": "Neut", "Number": "Sing", }, - "ADJ__Degree=Pos|Number=Plur|Variant=Short": {POS: ADJ, "Degree": "Pos", "Number": "Plur", }, - "ADJ__Foreign=Yes": {POS: ADJ, "Foreign": "Yes"}, - "ADJ___": {POS: ADJ}, - "ADJ": {POS: ADJ}, - "ADP___": {POS: ADP}, - "ADP": {POS: ADP}, - "ADV__Degree=Cmp": {POS: ADV, "Degree": "Cmp"}, - "ADV__Degree=Pos": {POS: ADV, "Degree": "Pos"}, - "ADV__Polarity=Neg": {POS: ADV, "Polarity": "Neg"}, - "AUX__Aspect=Imp|Case=Loc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "Aspect": "Imp", "Case": "Loc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "AUX__Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "Aspect": "Imp", "Case": "Nom", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "AUX__Aspect=Imp|Case=Nom|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: AUX, "Aspect": "Imp", "Case": "Nom", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "AUX__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act": {POS: AUX, "Aspect": "Imp", "Gender": "Fem", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Act"}, - "AUX__Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act": {POS: AUX, "Aspect": "Imp", "Gender": "Masc", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Act"}, - "AUX__Aspect=Imp|Gender=Neut|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act": {POS: AUX, "Aspect": "Imp", "Gender": "Neut", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Act"}, - "AUX__Aspect=Imp|Mood=Imp|Number=Plur|Person=2|VerbForm=Fin|Voice=Act": {POS: AUX, "Aspect": "Imp", "Mood": "Imp", "Number": "Plur", "Person": "two", "VerbForm": "Fin", "Voice": "Act"}, - "AUX__Aspect=Imp|Mood=Imp|Number=Sing|Person=2|VerbForm=Fin|Voice=Act": {POS: AUX, "Aspect": "Imp", "Mood": "Imp", "Number": "Sing", "Person": "two", "VerbForm": "Fin", "Voice": "Act"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin|Voice=Act": {POS: AUX, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Person": "one", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Act"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act": {POS: AUX, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Person": "two", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Act"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act": {POS: AUX, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Person": "three", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Act"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act": {POS: AUX, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Tense": "Past", "VerbForm": "Fin", "Voice": "Act"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin|Voice=Act": {POS: AUX, "Aspect": "Imp", "Mood": "Ind", "Number": "Sing", "Person": "one", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Act"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act": {POS: AUX, "Aspect": "Imp", "Mood": "Ind", "Number": "Sing", "Person": "two", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Act"}, - "AUX__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act": {POS: AUX, "Aspect": "Imp", "Mood": "Ind", "Number": "Sing", "Person": "three", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Act"}, - "AUX__Aspect=Imp|Tense=Pres|VerbForm=Conv|Voice=Act": {POS: AUX, "Aspect": "Imp", "Tense": "Pres", "VerbForm": "Conv", "Voice": "Act"}, - "AUX__Aspect=Imp|VerbForm=Inf|Voice=Act": {POS: AUX, "Aspect": "Imp", "VerbForm": "Inf", "Voice": "Act"}, - "CCONJ___": {POS: CCONJ}, - "CCONJ": {POS: CCONJ}, - "DET__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing": {POS: DET, "Animacy": "Inan", "Case": "Acc", "Gender": "Masc", "Number": "Sing"}, - "DET__Animacy=Inan|Case=Acc|Gender=Neut|Number=Sing": {POS: DET, "Animacy": "Inan", "Case": "Acc", "Gender": "Neut", "Number": "Sing"}, - "DET__Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing": {POS: DET, "Animacy": "Inan", "Case": "Gen", "Gender": "Fem", "Number": "Sing"}, - "DET__Animacy=Inan|Case=Gen|Number=Plur": {POS: DET, "Animacy": "Inan", "Case": "Gen", "Number": "Plur"}, - "DET__Case=Acc|Degree=Pos|Number=Plur": {POS: DET, "Case": "Acc", "Degree": "Pos", "Number": "Plur"}, - "DET__Case=Acc|Gender=Fem|Number=Sing": {POS: DET, "Case": "Acc", "Gender": "Fem", "Number": "Sing"}, - "DET__Case=Acc|Gender=Masc|Number=Sing": {POS: DET, "Case": "Acc", "Gender": "Masc", "Number": "Sing"}, - "DET__Case=Acc|Gender=Neut|Number=Sing": {POS: DET, "Case": "Acc", "Gender": "Neut", "Number": "Sing"}, - "DET__Case=Acc|Number=Plur": {POS: DET, "Case": "Acc", "Number": "Plur"}, - "DET__Case=Dat|Gender=Fem|Number=Sing": {POS: DET, "Case": "Dat", "Gender": "Fem", "Number": "Sing"}, - "DET__Case=Dat|Gender=Masc|Number=Plur": {POS: DET, "Case": "Dat", "Gender": "Masc", "Number": "Plur"}, - "DET__Case=Dat|Gender=Masc|Number=Sing": {POS: DET, "Case": "Dat", "Gender": "Masc", "Number": "Sing"}, - "DET__Case=Dat|Gender=Neut|Number=Sing": {POS: DET, "Case": "Dat", "Gender": "Neut", "Number": "Sing"}, - "DET__Case=Dat|Number=Plur": {POS: DET, "Case": "Dat", "Number": "Plur"}, - "DET__Case=Gen|Gender=Fem|Number=Sing": {POS: DET, "Case": "Gen", "Gender": "Fem", "Number": "Sing"}, - "DET__Case=Gen|Gender=Masc|Number=Sing": {POS: DET, "Case": "Gen", "Gender": "Masc", "Number": "Sing"}, - "DET__Case=Gen|Gender=Neut|Number=Sing": {POS: DET, "Case": "Gen", "Gender": "Neut", "Number": "Sing"}, - "DET__Case=Gen|Number=Plur": {POS: DET, "Case": "Gen", "Number": "Plur"}, - "DET__Case=Ins|Gender=Fem|Number=Sing": {POS: DET, "Case": "Ins", "Gender": "Fem", "Number": "Sing"}, - "DET__Case=Ins|Gender=Masc|Number=Sing": {POS: DET, "Case": "Ins", "Gender": "Masc", "Number": "Sing"}, - "DET__Case=Ins|Gender=Neut|Number=Sing": {POS: DET, "Case": "Ins", "Gender": "Neut", "Number": "Sing"}, - "DET__Case=Ins|Number=Plur": {POS: DET, "Case": "Ins", "Number": "Plur"}, - "DET__Case=Loc|Gender=Fem|Number=Sing": {POS: DET, "Case": "Loc", "Gender": "Fem", "Number": "Sing"}, - "DET__Case=Loc|Gender=Masc|Number=Sing": {POS: DET, "Case": "Loc", "Gender": "Masc", "Number": "Sing"}, - "DET__Case=Loc|Gender=Neut|Number=Sing": {POS: DET, "Case": "Loc", "Gender": "Neut", "Number": "Sing"}, - "DET__Case=Loc|Number=Plur": {POS: DET, "Case": "Loc", "Number": "Plur"}, - "DET__Case=Nom|Gender=Fem|Number=Sing": {POS: DET, "Case": "Nom", "Gender": "Fem", "Number": "Sing"}, - "DET__Case=Nom|Gender=Masc|Number=Plur": {POS: DET, "Case": "Nom", "Gender": "Masc", "Number": "Plur"}, - "DET__Case=Nom|Gender=Masc|Number=Sing": {POS: DET, "Case": "Nom", "Gender": "Masc", "Number": "Sing"}, - "DET__Case=Nom|Gender=Neut|Number=Sing": {POS: DET, "Case": "Nom", "Gender": "Neut", "Number": "Sing"}, - "DET__Case=Nom|Number=Plur": {POS: DET, "Case": "Nom", "Number": "Plur"}, - "DET__Gender=Masc|Number=Sing": {POS: DET, "Gender": "Masc", "Number": "Sing"}, - "INTJ___": {POS: INTJ}, - "INTJ": {POS: INTJ}, - "NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Acc", "Gender": "Fem", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Acc", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Acc|Gender=Masc|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Acc", "Gender": "Masc", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Acc|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Acc", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Acc|Gender=Neut|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Acc", "Gender": "Neut", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Acc|Gender=Neut|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Acc", "Gender": "Neut", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Acc|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Acc", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Dat|Gender=Fem|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Dat", "Gender": "Fem", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Dat|Gender=Fem|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Dat", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Dat|Gender=Masc|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Dat", "Gender": "Masc", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Dat|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Dat", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Dat|Gender=Neut|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Dat", "Gender": "Neut", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Dat|Gender=Neut|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Dat", "Gender": "Neut", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Dat|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Dat", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Gen|Gender=Fem|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Gen", "Gender": "Fem", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Gen|Gender=Fem|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Gen", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Gen", "Gender": "Masc", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Gen|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Gen", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Gen|Gender=Neut|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Gen", "Gender": "Neut", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Gen|Gender=Neut|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Gen", "Gender": "Neut", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Gen|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Gen", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Ins|Gender=Fem|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Ins", "Gender": "Fem", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Ins|Gender=Fem|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Ins", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Ins|Gender=Masc|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Ins", "Gender": "Masc", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Ins|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Ins", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Ins|Gender=Neut|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Ins", "Gender": "Neut", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Ins|Gender=Neut|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Ins", "Gender": "Neut", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Ins|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Ins", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Loc|Gender=Fem|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Loc", "Gender": "Fem", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Loc|Gender=Fem|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Loc", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Loc|Gender=Masc|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Loc", "Gender": "Masc", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Loc|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Loc", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Loc|Gender=Neut|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Loc", "Gender": "Neut", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Loc|Gender=Neut|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Loc", "Gender": "Neut", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Loc|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Loc", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Nom", "Gender": "Fem", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Nom", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Nom|Gender=Masc|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Nom", "Gender": "Masc", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Nom", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Nom|Gender=Neut|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Nom", "Gender": "Neut", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Nom|Gender=Neut|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Nom", "Gender": "Neut", "Number": "Sing"}, - "NOUN__Animacy=Anim|Case=Nom|Number=Plur": {POS: NOUN, "Animacy": "Anim", "Case": "Nom", "Number": "Plur"}, - "NOUN__Animacy=Anim|Case=Voc|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Anim", "Case": "Voc", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Acc|Gender=Fem|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Acc", "Gender": "Fem", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Acc", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Acc", "Gender": "Masc", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Acc", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Acc|Gender=Neut|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Acc", "Gender": "Neut", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Acc|Gender=Neut|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Acc", "Gender": "Neut", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Acc|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Acc", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Dat|Gender=Fem|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Dat", "Gender": "Fem", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Dat|Gender=Fem|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Dat", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Dat", "Gender": "Masc", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Dat", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Dat|Gender=Neut|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Dat", "Gender": "Neut", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Dat|Gender=Neut|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Dat", "Gender": "Neut", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Dat|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Dat", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Gen", "Gender": "Fem", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Gen", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Gen", "Gender": "Masc", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Gen", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Gen|Gender=Neut|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Gen", "Gender": "Neut", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Gen", "Gender": "Neut", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Gen|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Gen", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Ins|Gender=Fem|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Ins", "Gender": "Fem", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Ins|Gender=Fem|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Ins", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Ins", "Gender": "Masc", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Ins", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Ins|Gender=Neut|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Ins", "Gender": "Neut", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Ins|Gender=Neut|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Ins", "Gender": "Neut", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Ins|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Ins", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Loc|Gender=Fem|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Loc", "Gender": "Fem", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Loc", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Loc", "Gender": "Masc", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Loc", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Loc|Gender=Neut|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Loc", "Gender": "Neut", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Loc|Gender=Neut|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Loc", "Gender": "Neut", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Loc|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Loc", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Nom", "Gender": "Fem", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Nom|Gender=Fem|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Nom", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Nom", "Gender": "Masc", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Nom", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Nom|Gender=Neut|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Nom", "Gender": "Neut", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Nom", "Gender": "Neut", "Number": "Sing"}, - "NOUN__Animacy=Inan|Case=Nom|Number=Plur": {POS: NOUN, "Animacy": "Inan", "Case": "Nom", "Number": "Plur"}, - "NOUN__Animacy=Inan|Case=Par|Gender=Masc|Number=Sing": {POS: NOUN, "Animacy": "Inan", "Case": "Par", "Gender": "Masc", "Number": "Sing"}, - "NOUN__Animacy=Inan|Gender=Fem": {POS: NOUN, "Animacy": "Inan", "Gender": "Fem"}, - "NOUN__Animacy=Inan|Gender=Masc": {POS: NOUN, "Animacy": "Inan", "Gender": "Masc"}, - "NOUN__Animacy=Inan|Gender=Neut": {POS: NOUN, "Animacy": "Inan", "Gender": "Neut"}, - "NOUN__Case=Gen|Degree=Pos|Gender=Fem|Number=Sing": {POS: NOUN, "Case": "Gen", "Degree": "Pos", "Gender": "Fem", "Number": "Sing"}, - "NOUN__Foreign=Yes": {POS: NOUN, "Foreign": "Yes"}, - "NOUN___": {POS: NOUN}, - "NOUN": {POS: NOUN}, - "NUM__Animacy=Anim|Case=Acc": {POS: NUM, "Animacy": "Anim", "Case": "Acc"}, - "NUM__Animacy=Anim|Case=Acc|Gender=Fem": {POS: NUM, "Animacy": "Anim", "Case": "Acc", "Gender": "Fem"}, - "NUM__Animacy=Anim|Case=Acc|Gender=Masc": {POS: NUM, "Animacy": "Anim", "Case": "Acc", "Gender": "Masc"}, - "NUM__Animacy=Inan|Case=Acc": {POS: NUM, "Animacy": "Inan", "Case": "Acc"}, - "NUM__Animacy=Inan|Case=Acc|Gender=Fem": {POS: NUM, "Animacy": "Inan", "Case": "Acc", "Gender": "Fem"}, - "NUM__Animacy=Inan|Case=Acc|Gender=Masc": {POS: NUM, "Animacy": "Inan", "Case": "Acc", "Gender": "Masc"}, - "NUM__Case=Acc": {POS: NUM, "Case": "Acc"}, - "NUM__Case=Acc|Gender=Fem": {POS: NUM, "Case": "Acc", "Gender": "Fem"}, - "NUM__Case=Acc|Gender=Masc": {POS: NUM, "Case": "Acc", "Gender": "Masc"}, - "NUM__Case=Acc|Gender=Neut": {POS: NUM, "Case": "Acc", "Gender": "Neut"}, - "NUM__Case=Dat": {POS: NUM, "Case": "Dat"}, - "NUM__Case=Dat|Gender=Fem": {POS: NUM, "Case": "Dat", "Gender": "Fem"}, - "NUM__Case=Dat|Gender=Masc": {POS: NUM, "Case": "Dat", "Gender": "Masc"}, - "NUM__Case=Dat|Gender=Neut": {POS: NUM, "Case": "Dat", "Gender": "Neut"}, - "NUM__Case=Gen": {POS: NUM, "Case": "Gen"}, - "NUM__Case=Gen|Gender=Fem": {POS: NUM, "Case": "Gen", "Gender": "Fem"}, - "NUM__Case=Gen|Gender=Masc": {POS: NUM, "Case": "Gen", "Gender": "Masc"}, - "NUM__Case=Gen|Gender=Neut": {POS: NUM, "Case": "Gen", "Gender": "Neut"}, - "NUM__Case=Ins": {POS: NUM, "Case": "Ins"}, - "NUM__Case=Ins|Gender=Fem": {POS: NUM, "Case": "Ins", "Gender": "Fem"}, - "NUM__Case=Ins|Gender=Masc": {POS: NUM, "Case": "Ins", "Gender": "Masc"}, - "NUM__Case=Ins|Gender=Neut": {POS: NUM, "Case": "Ins", "Gender": "Neut"}, - "NUM__Case=Loc": {POS: NUM, "Case": "Loc"}, - "NUM__Case=Loc|Gender=Fem": {POS: NUM, "Case": "Loc", "Gender": "Fem"}, - "NUM__Case=Loc|Gender=Masc": {POS: NUM, "Case": "Loc", "Gender": "Masc"}, - "NUM__Case=Loc|Gender=Neut": {POS: NUM, "Case": "Loc", "Gender": "Neut"}, - "NUM__Case=Nom": {POS: NUM, "Case": "Nom"}, - "NUM__Case=Nom|Gender=Fem": {POS: NUM, "Case": "Nom", "Gender": "Fem"}, - "NUM__Case=Nom|Gender=Masc": {POS: NUM, "Case": "Nom", "Gender": "Masc"}, - "NUM__Case=Nom|Gender=Neut": {POS: NUM, "Case": "Nom", "Gender": "Neut"}, - "NUM___": {POS: NUM}, - "NUM": {POS: NUM}, - "PART__Mood=Cnd": {POS: PART, "Mood": "Cnd"}, - "PART__Polarity=Neg": {POS: PART, "Polarity": "Neg"}, - "PART___": {POS: PART}, - "PART": {POS: PART}, - "PRON__Animacy=Anim|Case=Acc|Gender=Masc|Number=Plur": {POS: PRON, "Animacy": "Anim", "Case": "Acc", "Gender": "Masc", "Number": "Plur"}, - "PRON__Animacy=Anim|Case=Acc|Number=Plur": {POS: PRON, "Animacy": "Anim", "Case": "Acc", "Number": "Plur"}, - "PRON__Animacy=Anim|Case=Dat|Gender=Masc|Number=Sing": {POS: PRON, "Animacy": "Anim", "Case": "Dat", "Gender": "Masc", "Number": "Sing"}, - "PRON__Animacy=Anim|Case=Dat|Number=Plur": {POS: PRON, "Animacy": "Anim", "Case": "Dat", "Number": "Plur"}, - "PRON__Animacy=Anim|Case=Gen|Number=Plur": {POS: PRON, "Animacy": "Anim", "Case": "Gen", "Number": "Plur"}, - "PRON__Animacy=Anim|Case=Ins|Gender=Masc|Number=Sing": {POS: PRON, "Animacy": "Anim", "Case": "Ins", "Gender": "Masc", "Number": "Sing"}, - "PRON__Animacy=Anim|Case=Ins|Number=Plur": {POS: PRON, "Animacy": "Anim", "Case": "Ins", "Number": "Plur"}, - "PRON__Animacy=Anim|Case=Loc|Number=Plur": {POS: PRON, "Animacy": "Anim", "Case": "Loc", "Number": "Plur"}, - "PRON__Animacy=Anim|Case=Nom|Gender=Masc|Number=Plur": {POS: PRON, "Animacy": "Anim", "Case": "Nom", "Gender": "Masc", "Number": "Plur"}, - "PRON__Animacy=Anim|Case=Nom|Number=Plur": {POS: PRON, "Animacy": "Anim", "Case": "Nom", "Number": "Plur"}, - "PRON__Animacy=Anim|Gender=Masc|Number=Plur": {POS: PRON, "Animacy": "Anim", "Gender": "Masc", "Number": "Plur"}, - "PRON__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing": {POS: PRON, "Animacy": "Inan", "Case": "Acc", "Gender": "Masc", "Number": "Sing"}, - "PRON__Animacy=Inan|Case=Acc|Gender=Neut|Number=Sing": {POS: PRON, "Animacy": "Inan", "Case": "Acc", "Gender": "Neut", "Number": "Sing"}, - "PRON__Animacy=Inan|Case=Dat|Gender=Neut|Number=Sing": {POS: PRON, "Animacy": "Inan", "Case": "Dat", "Gender": "Neut", "Number": "Sing"}, - "PRON__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing": {POS: PRON, "Animacy": "Inan", "Case": "Gen", "Gender": "Masc", "Number": "Sing"}, - "PRON__Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing": {POS: PRON, "Animacy": "Inan", "Case": "Gen", "Gender": "Neut", "Number": "Sing"}, - "PRON__Animacy=Inan|Case=Ins|Gender=Fem|Number=Sing": {POS: PRON, "Animacy": "Inan", "Case": "Ins", "Gender": "Fem", "Number": "Sing"}, - "PRON__Animacy=Inan|Case=Ins|Gender=Neut|Number=Sing": {POS: PRON, "Animacy": "Inan", "Case": "Ins", "Gender": "Neut", "Number": "Sing"}, - "PRON__Animacy=Inan|Case=Loc|Gender=Neut|Number=Sing": {POS: PRON, "Animacy": "Inan", "Case": "Loc", "Gender": "Neut", "Number": "Sing"}, - "PRON__Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing": {POS: PRON, "Animacy": "Inan", "Case": "Nom", "Gender": "Neut", "Number": "Sing"}, - "PRON__Animacy=Inan|Gender=Neut|Number=Sing": {POS: PRON, "Animacy": "Inan", "Gender": "Neut", "Number": "Sing"}, - "PRON__Case=Acc": {POS: PRON, "Case": "Acc"}, - "PRON__Case=Acc|Gender=Fem|Number=Sing|Person=3": {POS: PRON, "Case": "Acc", "Gender": "Fem", "Number": "Sing", "Person": "three"}, - "PRON__Case=Acc|Gender=Masc|Number=Sing|Person=3": {POS: PRON, "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Person": "three"}, - "PRON__Case=Acc|Gender=Neut|Number=Sing|Person=3": {POS: PRON, "Case": "Acc", "Gender": "Neut", "Number": "Sing", "Person": "three"}, - "PRON__Case=Acc|Number=Plur|Person=1": {POS: PRON, "Case": "Acc", "Number": "Plur", "Person": "one"}, - "PRON__Case=Acc|Number=Plur|Person=2": {POS: PRON, "Case": "Acc", "Number": "Plur", "Person": "two"}, - "PRON__Case=Acc|Number=Plur|Person=3": {POS: PRON, "Case": "Acc", "Number": "Plur", "Person": "three"}, - "PRON__Case=Acc|Number=Sing|Person=1": {POS: PRON, "Case": "Acc", "Number": "Sing", "Person": "one"}, - "PRON__Case=Acc|Number=Sing|Person=2": {POS: PRON, "Case": "Acc", "Number": "Sing", "Person": "two"}, - "PRON__Case=Dat": {POS: PRON, "Case": "Dat"}, - "PRON__Case=Dat|Gender=Fem|Number=Sing|Person=3": {POS: PRON, "Case": "Dat", "Gender": "Fem", "Number": "Sing", "Person": "three"}, - "PRON__Case=Dat|Gender=Masc|Number=Sing|Person=3": {POS: PRON, "Case": "Dat", "Gender": "Masc", "Number": "Sing", "Person": "three"}, - "PRON__Case=Dat|Gender=Neut|Number=Sing|Person=3": {POS: PRON, "Case": "Dat", "Gender": "Neut", "Number": "Sing", "Person": "three"}, - "PRON__Case=Dat|Number=Plur|Person=1": {POS: PRON, "Case": "Dat", "Number": "Plur", "Person": "one"}, - "PRON__Case=Dat|Number=Plur|Person=2": {POS: PRON, "Case": "Dat", "Number": "Plur", "Person": "two"}, - "PRON__Case=Dat|Number=Plur|Person=3": {POS: PRON, "Case": "Dat", "Number": "Plur", "Person": "three"}, - "PRON__Case=Dat|Number=Sing|Person=1": {POS: PRON, "Case": "Dat", "Number": "Sing", "Person": "one"}, - "PRON__Case=Dat|Number=Sing|Person=2": {POS: PRON, "Case": "Dat", "Number": "Sing", "Person": "two"}, - "PRON__Case=Gen": {POS: PRON, "Case": "Gen"}, - "PRON__Case=Gen|Gender=Fem|Number=Sing|Person=3": {POS: PRON, "Case": "Gen", "Gender": "Fem", "Number": "Sing", "Person": "three"}, - "PRON__Case=Gen|Gender=Masc|Number=Sing|Person=3": {POS: PRON, "Case": "Gen", "Gender": "Masc", "Number": "Sing", "Person": "three"}, - "PRON__Case=Gen|Gender=Neut|Number=Sing|Person=3": {POS: PRON, "Case": "Gen", "Gender": "Neut", "Number": "Sing", "Person": "three"}, - "PRON__Case=Gen|Number=Plur|Person=1": {POS: PRON, "Case": "Gen", "Number": "Plur", "Person": "one"}, - "PRON__Case=Gen|Number=Plur|Person=2": {POS: PRON, "Case": "Gen", "Number": "Plur", "Person": "two"}, - "PRON__Case=Gen|Number=Plur|Person=3": {POS: PRON, "Case": "Gen", "Number": "Plur", "Person": "three"}, - "PRON__Case=Gen|Number=Sing|Person=1": {POS: PRON, "Case": "Gen", "Number": "Sing", "Person": "one"}, - "PRON__Case=Gen|Number=Sing|Person=2": {POS: PRON, "Case": "Gen", "Number": "Sing", "Person": "two"}, - "PRON__Case=Ins": {POS: PRON, "Case": "Ins"}, - "PRON__Case=Ins|Gender=Fem|Number=Sing|Person=3": {POS: PRON, "Case": "Ins", "Gender": "Fem", "Number": "Sing", "Person": "three"}, - "PRON__Case=Ins|Gender=Masc|Number=Sing|Person=3": {POS: PRON, "Case": "Ins", "Gender": "Masc", "Number": "Sing", "Person": "three"}, - "PRON__Case=Ins|Gender=Neut|Number=Sing|Person=3": {POS: PRON, "Case": "Ins", "Gender": "Neut", "Number": "Sing", "Person": "three"}, - "PRON__Case=Ins|Number=Plur|Person=1": {POS: PRON, "Case": "Ins", "Number": "Plur", "Person": "one"}, - "PRON__Case=Ins|Number=Plur|Person=2": {POS: PRON, "Case": "Ins", "Number": "Plur", "Person": "two"}, - "PRON__Case=Ins|Number=Plur|Person=3": {POS: PRON, "Case": "Ins", "Number": "Plur", "Person": "three"}, - "PRON__Case=Ins|Number=Sing|Person=1": {POS: PRON, "Case": "Ins", "Number": "Sing", "Person": "one"}, - "PRON__Case=Ins|Number=Sing|Person=2": {POS: PRON, "Case": "Ins", "Number": "Sing", "Person": "two"}, - "PRON__Case=Loc": {POS: PRON, "Case": "Loc"}, - "PRON__Case=Loc|Gender=Fem|Number=Sing|Person=3": {POS: PRON, "Case": "Loc", "Gender": "Fem", "Number": "Sing", "Person": "three"}, - "PRON__Case=Loc|Gender=Masc|Number=Sing|Person=3": {POS: PRON, "Case": "Loc", "Gender": "Masc", "Number": "Sing", "Person": "three"}, - "PRON__Case=Loc|Gender=Neut|Number=Sing|Person=3": {POS: PRON, "Case": "Loc", "Gender": "Neut", "Number": "Sing", "Person": "three"}, - "PRON__Case=Loc|Number=Plur|Person=1": {POS: PRON, "Case": "Loc", "Number": "Plur", "Person": "one"}, - "PRON__Case=Loc|Number=Plur|Person=2": {POS: PRON, "Case": "Loc", "Number": "Plur", "Person": "two"}, - "PRON__Case=Loc|Number=Plur|Person=3": {POS: PRON, "Case": "Loc", "Number": "Plur", "Person": "three"}, - "PRON__Case=Loc|Number=Sing|Person=1": {POS: PRON, "Case": "Loc", "Number": "Sing", "Person": "one"}, - "PRON__Case=Loc|Number=Sing|Person=2": {POS: PRON, "Case": "Loc", "Number": "Sing", "Person": "two"}, - "PRON__Case=Nom": {POS: PRON, "Case": "Nom"}, - "PRON__Case=Nom|Gender=Fem|Number=Sing|Person=3": {POS: PRON, "Case": "Nom", "Gender": "Fem", "Number": "Sing", "Person": "three"}, - "PRON__Case=Nom|Gender=Masc|Number=Sing|Person=3": {POS: PRON, "Case": "Nom", "Gender": "Masc", "Number": "Sing", "Person": "three"}, - "PRON__Case=Nom|Gender=Neut|Number=Sing|Person=3": {POS: PRON, "Case": "Nom", "Gender": "Neut", "Number": "Sing", "Person": "three"}, - "PRON__Case=Nom|Number=Plur|Person=1": {POS: PRON, "Case": "Nom", "Number": "Plur", "Person": "one"}, - "PRON__Case=Nom|Number=Plur|Person=2": {POS: PRON, "Case": "Nom", "Number": "Plur", "Person": "two"}, - "PRON__Case=Nom|Number=Plur|Person=3": {POS: PRON, "Case": "Nom", "Number": "Plur", "Person": "three"}, - "PRON__Case=Nom|Number=Sing|Person=1": {POS: PRON, "Case": "Nom", "Number": "Sing", "Person": "one"}, - "PRON__Case=Nom|Number=Sing|Person=2": {POS: PRON, "Case": "Nom", "Number": "Sing", "Person": "two"}, - "PRON__Number=Sing|Person=1": {POS: PRON, "Number": "Sing", "Person": "one"}, - "PRON___": {POS: PRON}, - "PRON": {POS: PRON}, - "PROPN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Acc", "Gender": "Fem", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Acc", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Acc|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Acc", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Acc|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Acc", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Acc|Gender=Neut|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Acc", "Gender": "Neut", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Dat|Gender=Fem|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Dat", "Gender": "Fem", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Dat|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Dat", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Dat|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Dat", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Dat|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Dat", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Dat|Gender=Neut|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Dat", "Gender": "Neut", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Gen|Foreign=Yes|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Gen", "Foreign": "Yes", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Gen|Gender=Fem|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Gen", "Gender": "Fem", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Gen|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Gen", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Gen|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Gen", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Gen|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Gen", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Ins|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Ins", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Ins|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Ins", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Ins|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Ins", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Ins|Gender=Neut|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Ins", "Gender": "Neut", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Loc|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Loc", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Loc|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Loc", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Loc|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Loc", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Nom|Foreign=Yes|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Nom", "Foreign": "Yes", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Nom", "Gender": "Fem", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Nom", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Nom|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Nom", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Nom", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Anim|Case=Nom|Gender=Neut|Number=Plur": {POS: PROPN, "Animacy": "Anim", "Case": "Nom", "Gender": "Neut", "Number": "Plur"}, - "PROPN__Animacy=Anim|Case=Voc|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Case": "Voc", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Anim|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Anim", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Acc|Gender=Fem|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Acc", "Gender": "Fem", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Acc|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Acc", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Acc|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Acc", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Acc", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Acc|Gender=Neut|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Acc", "Gender": "Neut", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Acc|Gender=Neut|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Acc", "Gender": "Neut", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Acc|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Acc", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Dat|Gender=Fem|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Dat", "Gender": "Fem", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Dat|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Dat", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Dat|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Dat", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Dat|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Dat", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Dat|Gender=Neut|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Dat", "Gender": "Neut", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Dat|Gender=Neut|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Dat", "Gender": "Neut", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Dat|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Dat", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Gen|Foreign=Yes|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Gen", "Foreign": "Yes", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Gen|Gender=Fem|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Gen", "Gender": "Fem", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Gen", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Gen|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Gen", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Gen|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Gen", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Gen|Gender=Neut|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Gen", "Gender": "Neut", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Gen", "Gender": "Neut", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Gen|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Gen", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Ins|Gender=Fem|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Ins", "Gender": "Fem", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Ins|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Ins", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Ins|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Ins", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Ins|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Ins", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Ins|Gender=Neut|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Ins", "Gender": "Neut", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Ins|Gender=Neut|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Ins", "Gender": "Neut", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Ins|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Ins", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Loc|Gender=Fem|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Loc", "Gender": "Fem", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Loc", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Loc|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Loc", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Loc|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Loc", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Loc|Gender=Neut|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Loc", "Gender": "Neut", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Loc|Gender=Neut|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Loc", "Gender": "Neut", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Loc|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Loc", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Nom|Foreign=Yes|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Nom", "Foreign": "Yes", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Nom|Foreign=Yes|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Nom", "Foreign": "Yes", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Nom|Foreign=Yes|Gender=Neut|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Nom", "Foreign": "Yes", "Gender": "Neut", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Nom", "Gender": "Fem", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Nom|Gender=Fem|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Nom", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Nom", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Nom", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Nom|Gender=Neut|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Nom", "Gender": "Neut", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Nom|Gender=Neut|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Nom", "Gender": "Neut", "Number": "Sing"}, - "PROPN__Animacy=Inan|Case=Nom|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Case": "Nom", "Number": "Plur"}, - "PROPN__Animacy=Inan|Case=Par|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Case": "Par", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Inan|Gender=Fem": {POS: PROPN, "Animacy": "Inan", "Gender": "Fem"}, - "PROPN__Animacy=Inan|Gender=Masc": {POS: PROPN, "Animacy": "Inan", "Gender": "Masc"}, - "PROPN__Animacy=Inan|Gender=Masc|Number=Plur": {POS: PROPN, "Animacy": "Inan", "Gender": "Masc", "Number": "Plur"}, - "PROPN__Animacy=Inan|Gender=Masc|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Animacy=Inan|Gender=Neut|Number=Sing": {POS: PROPN, "Animacy": "Inan", "Gender": "Neut", "Number": "Sing"}, - "PROPN__Case=Acc|Degree=Pos|Gender=Fem|Number=Sing": {POS: PROPN, "Case": "Acc", "Degree": "Pos", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Case=Dat|Degree=Pos|Gender=Masc|Number=Sing": {POS: PROPN, "Case": "Dat", "Degree": "Pos", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Case=Ins|Degree=Pos|Gender=Fem|Number=Sing": {POS: PROPN, "Case": "Ins", "Degree": "Pos", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Case=Ins|Degree=Pos|Number=Plur": {POS: PROPN, "Case": "Ins", "Degree": "Pos", "Number": "Plur"}, - "PROPN__Case=Nom|Degree=Pos|Gender=Fem|Number=Sing": {POS: PROPN, "Case": "Nom", "Degree": "Pos", "Gender": "Fem", "Number": "Sing"}, - "PROPN__Case=Nom|Degree=Pos|Gender=Masc|Number=Sing": {POS: PROPN, "Case": "Nom", "Degree": "Pos", "Gender": "Masc", "Number": "Sing"}, - "PROPN__Case=Nom|Degree=Pos|Gender=Neut|Number=Sing": {POS: PROPN, "Case": "Nom", "Degree": "Pos", "Gender": "Neut", "Number": "Sing"}, - "PROPN__Case=Nom|Degree=Pos|Number=Plur": {POS: PROPN, "Case": "Nom", "Degree": "Pos", "Number": "Plur"}, - "PROPN__Degree=Pos|Gender=Neut|Number=Sing|Variant=Short": {POS: PROPN, "Degree": "Pos", "Gender": "Neut", "Number": "Sing", }, - "PROPN__Degree=Pos|Number=Plur|Variant=Short": {POS: PROPN, "Degree": "Pos", "Number": "Plur", }, - "PROPN__Foreign=Yes": {POS: PROPN, "Foreign": "Yes"}, - "PROPN__Number=Sing": {POS: PROPN, "Number": "Sing"}, - "PROPN___": {POS: PROPN}, - "PROPN": {POS: PROPN}, - "PUNCT___": {POS: PUNCT}, - "PUNCT": {POS: PUNCT}, - "SCONJ__Mood=Cnd": {POS: SCONJ, "Mood": "Cnd"}, - "SCONJ___": {POS: SCONJ}, - "SCONJ": {POS: SCONJ}, - "SYM___": {POS: SYM}, - "SYM": {POS: SYM}, - "VERB__Animacy=Anim|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Animacy": "Anim", "Aspect": "Imp", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Animacy=Anim|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Animacy": "Anim", "Aspect": "Imp", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Animacy=Anim|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Animacy": "Anim", "Aspect": "Imp", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Animacy=Anim|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Animacy": "Anim", "Aspect": "Imp", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Animacy=Anim|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Animacy": "Anim", "Aspect": "Imp", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Animacy=Anim|Aspect=Imp|Case=Acc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Animacy": "Anim", "Aspect": "Imp", "Case": "Acc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Animacy=Anim|Aspect=Imp|Case=Acc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Animacy": "Anim", "Aspect": "Imp", "Case": "Acc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Animacy=Anim|Aspect=Imp|Case=Acc|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Animacy": "Anim", "Aspect": "Imp", "Case": "Acc", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Animacy=Anim|Aspect=Imp|Case=Acc|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Animacy": "Anim", "Aspect": "Imp", "Case": "Acc", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Animacy=Anim|Aspect=Imp|Case=Acc|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Animacy": "Anim", "Aspect": "Imp", "Case": "Acc", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Animacy=Anim|Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Animacy": "Anim", "Aspect": "Perf", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Animacy=Anim|Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Animacy": "Anim", "Aspect": "Perf", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Animacy=Anim|Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Animacy": "Anim", "Aspect": "Perf", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Animacy=Anim|Aspect=Perf|Case=Acc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Animacy": "Anim", "Aspect": "Perf", "Case": "Acc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Animacy=Anim|Aspect=Perf|Case=Acc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Animacy": "Anim", "Aspect": "Perf", "Case": "Acc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Animacy=Anim|Aspect=Perf|Case=Acc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Animacy": "Anim", "Aspect": "Perf", "Case": "Acc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Animacy": "Inan", "Aspect": "Imp", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Animacy": "Inan", "Aspect": "Imp", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Animacy": "Inan", "Aspect": "Imp", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Animacy": "Inan", "Aspect": "Imp", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Animacy": "Inan", "Aspect": "Imp", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Animacy=Inan|Aspect=Imp|Case=Acc|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Animacy": "Inan", "Aspect": "Imp", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Animacy=Inan|Aspect=Imp|Case=Acc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Animacy": "Inan", "Aspect": "Imp", "Case": "Acc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Animacy=Inan|Aspect=Imp|Case=Acc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Animacy": "Inan", "Aspect": "Imp", "Case": "Acc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Animacy=Inan|Aspect=Imp|Case=Acc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Animacy": "Inan", "Aspect": "Imp", "Case": "Acc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Animacy=Inan|Aspect=Imp|Case=Acc|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Animacy": "Inan", "Aspect": "Imp", "Case": "Acc", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Animacy=Inan|Aspect=Imp|Case=Acc|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Animacy": "Inan", "Aspect": "Imp", "Case": "Acc", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Animacy=Inan|Aspect=Imp|Case=Acc|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Animacy": "Inan", "Aspect": "Imp", "Case": "Acc", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Animacy=Inan|Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Animacy": "Inan", "Aspect": "Perf", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Animacy=Inan|Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Animacy": "Inan", "Aspect": "Perf", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Animacy=Inan|Aspect=Perf|Case=Acc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Animacy": "Inan", "Aspect": "Perf", "Case": "Acc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Animacy=Inan|Aspect=Perf|Case=Acc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Animacy": "Inan", "Aspect": "Perf", "Case": "Acc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Animacy=Inan|Aspect=Perf|Case=Acc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Animacy": "Inan", "Aspect": "Perf", "Case": "Acc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Animacy=Inan|Aspect=Perf|Case=Acc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Animacy": "Inan", "Aspect": "Perf", "Case": "Acc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Acc|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Acc", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Acc|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Acc", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Acc|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Acc", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Acc|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Acc", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Acc|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Acc", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Acc|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Acc", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Acc", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Acc", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Acc", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Acc", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Acc", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Acc|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Acc", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Dat|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Dat|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Dat|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Dat|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Dat|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Dat|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Dat|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Dat", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Gen|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Gen|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Gen|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Gen|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Gen|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Gen|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Gen|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Gen", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Ins|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Ins|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Ins|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Ins|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Ins|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Ins|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Ins", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Loc|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Loc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Loc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Loc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Loc|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Loc|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Loc|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Loc", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Fem|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Masc|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Nom|Gender=Neut|Number=Sing|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Nom|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Nom|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Nom|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Case=Nom|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Imp|Case=Nom|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Imp|Case=Nom|Number=Plur|Tense=Pres|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Case": "Nom", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Imp", "Gender": "Fem", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Gender": "Fem", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Gender": "Fem", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Pass"}, - "VERB__Aspect=Imp|Gender=Fem|Number=Sing|Tense=Past|Variant=Short|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Gender=Fem|Number=Sing|Tense=Pres|Variant=Short|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Gender": "Fem", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Imp", "Gender": "Masc", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Gender": "Masc", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Imp|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Gender": "Masc", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Pass"}, - "VERB__Aspect=Imp|Gender=Masc|Number=Sing|Tense=Past|Variant=Short|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Gender=Masc|Number=Sing|Tense=Pres|Variant=Short|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Gender": "Masc", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Gender=Neut|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Imp", "Gender": "Neut", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Imp|Gender=Neut|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Gender": "Neut", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Imp|Gender=Neut|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Gender": "Neut", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Pass"}, - "VERB__Aspect=Imp|Gender=Neut|Number=Sing|Tense=Past|Variant=Short|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Gender=Neut|Number=Sing|Tense=Pres|Variant=Short|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Gender": "Neut", "Number": "Sing", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Mood=Imp|Number=Plur|Person=2|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Imp", "Mood": "Imp", "Number": "Plur", "Person": "two", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Imp|Mood=Imp|Number=Plur|Person=2|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Mood": "Imp", "Number": "Plur", "Person": "two", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Imp|Mood=Imp|Number=Sing|Person=2|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Imp", "Mood": "Imp", "Number": "Sing", "Person": "two", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Imp|Mood=Imp|Number=Sing|Person=2|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Mood": "Imp", "Number": "Sing", "Person": "two", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Person": "one", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Tense=Pres|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Person": "one", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Person": "two", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Tense=Pres|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Person": "two", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Person": "three", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Person": "three", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Person": "three", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Pass"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Tense": "Past", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Tense": "Past", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Plur", "Tense": "Past", "VerbForm": "Fin", "Voice": "Pass"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Sing", "Person": "one", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Sing", "Person": "one", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Sing", "Person": "two", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Tense=Pres|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Sing", "Person": "two", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Sing", "Person": "three", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Sing", "Person": "three", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Mood": "Ind", "Number": "Sing", "Person": "three", "Tense": "Pres", "VerbForm": "Fin", "Voice": "Pass"}, - "VERB__Aspect=Imp|Number=Plur|Tense=Past|Variant=Short|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Number=Plur|Tense=Pres|Variant=Short|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Number": "Plur", "Tense": "Pres", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Imp|Tense=Past|VerbForm=Conv|Voice=Act": {POS: VERB, "Aspect": "Imp", "Tense": "Past", "VerbForm": "Conv", "Voice": "Act"}, - "VERB__Aspect=Imp|Tense=Pres|VerbForm=Conv|Voice=Act": {POS: VERB, "Aspect": "Imp", "Tense": "Pres", "VerbForm": "Conv", "Voice": "Act"}, - "VERB__Aspect=Imp|Tense=Pres|VerbForm=Conv|Voice=Mid": {POS: VERB, "Aspect": "Imp", "Tense": "Pres", "VerbForm": "Conv", "Voice": "Mid"}, - "VERB__Aspect=Imp|Tense=Pres|VerbForm=Conv|Voice=Pass": {POS: VERB, "Aspect": "Imp", "Tense": "Pres", "VerbForm": "Conv", "Voice": "Pass"}, - "VERB__Aspect=Imp|VerbForm=Inf|Voice=Act": {POS: VERB, "Aspect": "Imp", "VerbForm": "Inf", "Voice": "Act"}, - "VERB__Aspect=Imp|VerbForm=Inf|Voice=Mid": {POS: VERB, "Aspect": "Imp", "VerbForm": "Inf", "Voice": "Mid"}, - "VERB__Aspect=Imp|VerbForm=Inf|Voice=Pass": {POS: VERB, "Aspect": "Imp", "VerbForm": "Inf", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Acc|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Acc", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Acc|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Acc", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Acc|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Acc", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Acc|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Acc", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Acc|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Acc", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Acc|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Acc", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Dat|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Dat", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Dat|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Dat", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Dat|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Dat", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Dat|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Dat", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Dat|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Dat", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Dat|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Dat", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Dat|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Dat", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Dat|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Dat", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Dat|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Dat", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Dat|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Dat", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Dat|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Dat", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Dat|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Dat", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Gen|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Gen|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Gen|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Gen|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Gen|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Gen|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Gen|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Gen|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Gen|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Gen|Number=Plur|Tense=Fut|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Number": "Plur", "Tense": "Fut", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Gen|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Gen|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Gen|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Gen", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Ins|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Ins", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Ins|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Ins", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Ins|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Ins", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Ins|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Ins", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Ins|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Ins", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Ins|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Ins", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Ins|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Ins", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Ins|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Ins", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Ins|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Ins", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Ins|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Ins", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Ins|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Ins", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Ins|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Ins", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Loc|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Loc", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Loc|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Loc", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Loc|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Loc", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Loc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Loc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Loc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Loc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Loc|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Loc", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Loc|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Loc", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Loc|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Loc", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Loc|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Loc", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Loc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Loc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Loc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Loc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Loc|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Loc", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Nom|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Nom", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Nom|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Nom", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Nom|Gender=Fem|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Nom", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Nom|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Nom", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Nom|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Nom", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Nom|Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Nom", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Nom|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Nom", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Nom|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Nom", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Nom|Gender=Neut|Number=Sing|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Nom", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Case=Nom|Number=Plur|Tense=Past|VerbForm=Part|Voice=Act": {POS: VERB, "Aspect": "Perf", "Case": "Nom", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Act"}, - "VERB__Aspect=Perf|Case=Nom|Number=Plur|Tense=Past|VerbForm=Part|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Case": "Nom", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Mid"}, - "VERB__Aspect=Perf|Case=Nom|Number=Plur|Tense=Past|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Case": "Nom", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Gender": "Fem", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Gender": "Fem", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Perf|Gender=Fem|Number=Sing|Tense=Past|Variant=Short|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Gender": "Fem", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Gender": "Masc", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Gender=Masc|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Gender": "Masc", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Perf|Gender=Masc|Number=Sing|Tense=Past|Variant=Short|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Gender": "Masc", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Gender=Neut|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Gender": "Neut", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Gender=Neut|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Gender": "Neut", "Mood": "Ind", "Number": "Sing", "Tense": "Past", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Perf|Gender=Neut|Number=Sing|Tense=Past|Variant=Short|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Gender": "Neut", "Number": "Sing", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Mood=Imp|Number=Plur|Person=1|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Mood": "Imp", "Number": "Plur", "Person": "one", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Mood=Imp|Number=Plur|Person=2|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Mood": "Imp", "Number": "Plur", "Person": "two", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Mood=Imp|Number=Plur|Person=2|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Mood": "Imp", "Number": "Plur", "Person": "two", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Perf|Mood=Imp|Number=Sing|Person=2|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Mood": "Imp", "Number": "Sing", "Person": "two", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Mood=Imp|Number=Sing|Person=2|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Mood": "Imp", "Number": "Sing", "Person": "two", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Plur", "Person": "one", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Plur|Person=1|Tense=Fut|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Plur", "Person": "one", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Plur|Person=2|Tense=Fut|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Plur", "Person": "two", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Plur|Person=2|Tense=Fut|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Plur", "Person": "two", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Plur", "Person": "three", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Plur", "Person": "three", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Tense=Fut|VerbForm=Fin|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Plur", "Person": "three", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Pass"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Plur", "Tense": "Past", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Plur", "Tense": "Past", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Tense=Fut|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Sing", "Person": "one", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Tense=Fut|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Sing", "Person": "one", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Tense=Fut|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Sing", "Person": "two", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Tense=Fut|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Sing", "Person": "two", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin|Voice=Act": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Sing", "Person": "three", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Act"}, - "VERB__Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Tense=Fut|VerbForm=Fin|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Mood": "Ind", "Number": "Sing", "Person": "three", "Tense": "Fut", "VerbForm": "Fin", "Voice": "Mid"}, - "VERB__Aspect=Perf|Number=Plur|Tense=Past|Variant=Short|VerbForm=Part|Voice=Pass": {POS: VERB, "Aspect": "Perf", "Number": "Plur", "Tense": "Past", "VerbForm": "Part", "Voice": "Pass"}, - "VERB__Aspect=Perf|Tense=Past|VerbForm=Conv|Voice=Act": {POS: VERB, "Aspect": "Perf", "Tense": "Past", "VerbForm": "Conv", "Voice": "Act"}, - "VERB__Aspect=Perf|Tense=Past|VerbForm=Conv|Voice=Mid": {POS: VERB, "Aspect": "Perf", "Tense": "Past", "VerbForm": "Conv", "Voice": "Mid"}, - "VERB__Aspect=Perf|VerbForm=Inf|Voice=Act": {POS: VERB, "Aspect": "Perf", "VerbForm": "Inf", "Voice": "Act"}, - "VERB__Aspect=Perf|VerbForm=Inf|Voice=Mid": {POS: VERB, "Aspect": "Perf", "VerbForm": "Inf", "Voice": "Mid"}, - "VERB__Voice=Act": {POS: VERB, "Voice": "Act"}, - "VERB___": {POS: VERB}, - "VERB": {POS: VERB}, - "X__Foreign=Yes": {POS: X, "Foreign": "Yes"}, - "X___": {POS: X}, - "X": {POS: X}, -} -# fmt: on diff --git a/spacy/lang/ru/tokenizer_exceptions.py b/spacy/lang/ru/tokenizer_exceptions.py index ea7b5b20d..1dc363fae 100644 --- a/spacy/lang/ru/tokenizer_exceptions.py +++ b/spacy/lang/ru/tokenizer_exceptions.py @@ -1,69 +1,66 @@ -# encoding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA, NORM +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = {} _abbrev_exc = [ # Weekdays abbreviations - {ORTH: "пн", LEMMA: "понедельник", NORM: "понедельник"}, - {ORTH: "вт", LEMMA: "вторник", NORM: "вторник"}, - {ORTH: "ср", LEMMA: "среда", NORM: "среда"}, - {ORTH: "чт", LEMMA: "четверг", NORM: "четверг"}, - {ORTH: "чтв", LEMMA: "четверг", NORM: "четверг"}, - {ORTH: "пт", LEMMA: "пятница", NORM: "пятница"}, - {ORTH: "сб", LEMMA: "суббота", NORM: "суббота"}, - {ORTH: "сбт", LEMMA: "суббота", NORM: "суббота"}, - {ORTH: "вс", LEMMA: "воскресенье", NORM: "воскресенье"}, - {ORTH: "вскр", LEMMA: "воскресенье", NORM: "воскресенье"}, - {ORTH: "воскр", LEMMA: "воскресенье", NORM: "воскресенье"}, + {ORTH: "пн", NORM: "понедельник"}, + {ORTH: "вт", NORM: "вторник"}, + {ORTH: "ср", NORM: "среда"}, + {ORTH: "чт", NORM: "четверг"}, + {ORTH: "чтв", NORM: "четверг"}, + {ORTH: "пт", NORM: "пятница"}, + {ORTH: "сб", NORM: "суббота"}, + {ORTH: "сбт", NORM: "суббота"}, + {ORTH: "вс", NORM: "воскресенье"}, + {ORTH: "вскр", NORM: "воскресенье"}, + {ORTH: "воскр", NORM: "воскресенье"}, # Months abbreviations - {ORTH: "янв", LEMMA: "январь", NORM: "январь"}, - {ORTH: "фев", LEMMA: "февраль", NORM: "февраль"}, - {ORTH: "февр", LEMMA: "февраль", NORM: "февраль"}, - {ORTH: "мар", LEMMA: "март", NORM: "март"}, - # {ORTH: "март", LEMMA: "март", NORM: "март"}, - {ORTH: "мрт", LEMMA: "март", NORM: "март"}, - {ORTH: "апр", LEMMA: "апрель", NORM: "апрель"}, - # {ORTH: "май", LEMMA: "май", NORM: "май"}, - {ORTH: "июн", LEMMA: "июнь", NORM: "июнь"}, - # {ORTH: "июнь", LEMMA: "июнь", NORM: "июнь"}, - {ORTH: "июл", LEMMA: "июль", NORM: "июль"}, - # {ORTH: "июль", LEMMA: "июль", NORM: "июль"}, - {ORTH: "авг", LEMMA: "август", NORM: "август"}, - {ORTH: "сен", LEMMA: "сентябрь", NORM: "сентябрь"}, - {ORTH: "сент", LEMMA: "сентябрь", NORM: "сентябрь"}, - {ORTH: "окт", LEMMA: "октябрь", NORM: "октябрь"}, - {ORTH: "октб", LEMMA: "октябрь", NORM: "октябрь"}, - {ORTH: "ноя", LEMMA: "ноябрь", NORM: "ноябрь"}, - {ORTH: "нояб", LEMMA: "ноябрь", NORM: "ноябрь"}, - {ORTH: "нбр", LEMMA: "ноябрь", NORM: "ноябрь"}, - {ORTH: "дек", LEMMA: "декабрь", NORM: "декабрь"}, + {ORTH: "янв", NORM: "январь"}, + {ORTH: "фев", NORM: "февраль"}, + {ORTH: "февр", NORM: "февраль"}, + {ORTH: "мар", NORM: "март"}, + # {ORTH: "март", NORM: "март"}, + {ORTH: "мрт", NORM: "март"}, + {ORTH: "апр", NORM: "апрель"}, + # {ORTH: "май", NORM: "май"}, + {ORTH: "июн", NORM: "июнь"}, + # {ORTH: "июнь", NORM: "июнь"}, + {ORTH: "июл", NORM: "июль"}, + # {ORTH: "июль", NORM: "июль"}, + {ORTH: "авг", NORM: "август"}, + {ORTH: "сен", NORM: "сентябрь"}, + {ORTH: "сент", NORM: "сентябрь"}, + {ORTH: "окт", NORM: "октябрь"}, + {ORTH: "октб", NORM: "октябрь"}, + {ORTH: "ноя", NORM: "ноябрь"}, + {ORTH: "нояб", NORM: "ноябрь"}, + {ORTH: "нбр", NORM: "ноябрь"}, + {ORTH: "дек", NORM: "декабрь"}, ] for abbrev_desc in _abbrev_exc: abbrev = abbrev_desc[ORTH] for orth in (abbrev, abbrev.capitalize(), abbrev.upper()): - _exc[orth] = [{ORTH: orth, LEMMA: abbrev_desc[LEMMA], NORM: abbrev_desc[NORM]}] - _exc[orth + "."] = [ - {ORTH: orth + ".", LEMMA: abbrev_desc[LEMMA], NORM: abbrev_desc[NORM]} - ] + _exc[orth] = [{ORTH: orth, NORM: abbrev_desc[NORM]}] + _exc[orth + "."] = [{ORTH: orth + ".", NORM: abbrev_desc[NORM]}] _slang_exc = [ - {ORTH: "2к15", LEMMA: "2015", NORM: "2015"}, - {ORTH: "2к16", LEMMA: "2016", NORM: "2016"}, - {ORTH: "2к17", LEMMA: "2017", NORM: "2017"}, - {ORTH: "2к18", LEMMA: "2018", NORM: "2018"}, - {ORTH: "2к19", LEMMA: "2019", NORM: "2019"}, - {ORTH: "2к20", LEMMA: "2020", NORM: "2020"}, + {ORTH: "2к15", NORM: "2015"}, + {ORTH: "2к16", NORM: "2016"}, + {ORTH: "2к17", NORM: "2017"}, + {ORTH: "2к18", NORM: "2018"}, + {ORTH: "2к19", NORM: "2019"}, + {ORTH: "2к20", NORM: "2020"}, ] for slang_desc in _slang_exc: _exc[slang_desc[ORTH]] = [slang_desc] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/sa/__init__.py b/spacy/lang/sa/__init__.py index 8a4533341..345137817 100644 --- a/spacy/lang/sa/__init__.py +++ b/spacy/lang/sa/__init__.py @@ -1,18 +1,10 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS - from ...language import Language -from ...attrs import LANG class SanskritDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "sa" - + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS diff --git a/spacy/lang/sa/examples.py b/spacy/lang/sa/examples.py index 9d4fa1e49..60243c04c 100644 --- a/spacy/lang/sa/examples.py +++ b/spacy/lang/sa/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/sa/lex_attrs.py b/spacy/lang/sa/lex_attrs.py index c33be2ce4..bdceb7ec2 100644 --- a/spacy/lang/sa/lex_attrs.py +++ b/spacy/lang/sa/lex_attrs.py @@ -1,9 +1,5 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM - # reference 1: https://en.wikibooks.org/wiki/Sanskrit/Numbers _num_words = [ @@ -106,26 +102,26 @@ _num_words = [ "सप्तनवतिः", "अष्टनवतिः", "एकोनशतम्", - "शतम्" + "शतम्", ] def like_num(text): - """ - Check if text resembles a number - """ - if text.startswith(("+", "-", "±", "~")): - text = text[1:] - text = text.replace(",", "").replace(".", "") - if text.isdigit(): - return True - if text.count("/") == 1: - num, denom = text.split("/") - if num.isdigit() and denom.isdigit(): - return True - if text in _num_words: - return True - return False + """ + Check if text resembles a number + """ + if text.startswith(("+", "-", "±", "~")): + text = text[1:] + text = text.replace(",", "").replace(".", "") + if text.isdigit(): + return True + if text.count("/") == 1: + num, denom = text.split("/") + if num.isdigit() and denom.isdigit(): + return True + if text in _num_words: + return True + return False LEX_ATTRS = {LIKE_NUM: like_num} diff --git a/spacy/lang/sa/stop_words.py b/spacy/lang/sa/stop_words.py index aa51ceae0..30302a14d 100644 --- a/spacy/lang/sa/stop_words.py +++ b/spacy/lang/sa/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - # Source: https://gist.github.com/Akhilesh28/fe8b8e180f64b72e64751bc31cb6d323 STOP_WORDS = set( diff --git a/spacy/lang/si/__init__.py b/spacy/lang/si/__init__.py index a58a63f03..d77e3bb8b 100644 --- a/spacy/lang/si/__init__.py +++ b/spacy/lang/si/__init__.py @@ -1,17 +1,10 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS - from ...language import Language -from ...attrs import LANG class SinhalaDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "si" + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS diff --git a/spacy/lang/si/examples.py b/spacy/lang/si/examples.py index 842dfdd7e..b34051d00 100644 --- a/spacy/lang/si/examples.py +++ b/spacy/lang/si/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/si/lex_attrs.py b/spacy/lang/si/lex_attrs.py index 5d5f06187..aa061852d 100644 --- a/spacy/lang/si/lex_attrs.py +++ b/spacy/lang/si/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ diff --git a/spacy/lang/si/stop_words.py b/spacy/lang/si/stop_words.py index 8bbdec6b7..bde662bf7 100644 --- a/spacy/lang/si/stop_words.py +++ b/spacy/lang/si/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ අතර diff --git a/spacy/lang/sk/__init__.py b/spacy/lang/sk/__init__.py index cb17c0b6d..4003c7340 100644 --- a/spacy/lang/sk/__init__.py +++ b/spacy/lang/sk/__init__.py @@ -1,19 +1,10 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS -from .tag_map import TAG_MAP from .lex_attrs import LEX_ATTRS - from ...language import Language -from ...attrs import LANG class SlovakDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "sk" - tag_map = TAG_MAP + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS diff --git a/spacy/lang/sk/examples.py b/spacy/lang/sk/examples.py index 486ea375e..736109a7c 100644 --- a/spacy/lang/sk/examples.py +++ b/spacy/lang/sk/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/sk/lex_attrs.py b/spacy/lang/sk/lex_attrs.py index 3dea4d8f0..0caf62e8e 100644 --- a/spacy/lang/sk/lex_attrs.py +++ b/spacy/lang/sk/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ diff --git a/spacy/lang/sk/stop_words.py b/spacy/lang/sk/stop_words.py index 3e78acb10..017e7beef 100644 --- a/spacy/lang/sk/stop_words.py +++ b/spacy/lang/sk/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/Ardevop-sk/stopwords-sk STOP_WORDS = set( diff --git a/spacy/lang/sk/tag_map.py b/spacy/lang/sk/tag_map.py deleted file mode 100644 index 28b36d3c1..000000000 --- a/spacy/lang/sk/tag_map.py +++ /dev/null @@ -1,1467 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, AUX, ADJ, CCONJ, NUM, ADV, ADP, X, VERB -from ...symbols import NOUN, PART, INTJ, PRON - -# Source https://universaldependencies.org/tagset-conversion/sk-snk-uposf.html -# fmt: off -TAG_MAP = { - "AAfp1x": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp1y": {POS: ADJ, "morph": "Case=Nom|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp1z": {POS: ADJ, "morph": "Case=Nom|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp2x": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp2y": {POS: ADJ, "morph": "Case=Gen|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp2z": {POS: ADJ, "morph": "Case=Gen|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp3x": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp3y": {POS: ADJ, "morph": "Case=Dat|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp3z": {POS: ADJ, "morph": "Case=Dat|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp4x": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp4y": {POS: ADJ, "morph": "Case=Acc|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp4z": {POS: ADJ, "morph": "Case=Acc|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp5x": {POS: ADJ, "morph": "Case=Voc|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp5y": {POS: ADJ, "morph": "Case=Voc|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp5z": {POS: ADJ, "morph": "Case=Voc|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp6x": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp6y": {POS: ADJ, "morph": "Case=Loc|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp6z": {POS: ADJ, "morph": "Case=Loc|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp7x": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp7y": {POS: ADJ, "morph": "Case=Ins|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfp7z": {POS: ADJ, "morph": "Case=Ins|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "AAfs1x": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs1y": {POS: ADJ, "morph": "Case=Nom|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs1z": {POS: ADJ, "morph": "Case=Nom|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs2x": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs2y": {POS: ADJ, "morph": "Case=Gen|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs2z": {POS: ADJ, "morph": "Case=Gen|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs3x": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs3y": {POS: ADJ, "morph": "Case=Dat|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs3z": {POS: ADJ, "morph": "Case=Dat|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs4x": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs4y": {POS: ADJ, "morph": "Case=Acc|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs4z": {POS: ADJ, "morph": "Case=Acc|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs5x": {POS: ADJ, "morph": "Case=Voc|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs5y": {POS: ADJ, "morph": "Case=Voc|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs5z": {POS: ADJ, "morph": "Case=Voc|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs6x": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs6y": {POS: ADJ, "morph": "Case=Loc|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs6z": {POS: ADJ, "morph": "Case=Loc|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs7x": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs7y": {POS: ADJ, "morph": "Case=Ins|Degree=Cmp|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAfs7z": {POS: ADJ, "morph": "Case=Ins|Degree=Sup|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "AAip1x": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip1y": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip1z": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip2x": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip2y": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip2z": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip3x": {POS: ADJ, "morph": "Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip3y": {POS: ADJ, "morph": "Animacy=Inan|Case=Dat|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip3z": {POS: ADJ, "morph": "Animacy=Inan|Case=Dat|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip4x": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip4y": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip4z": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip5x": {POS: ADJ, "morph": "Animacy=Inan|Case=Voc|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip5y": {POS: ADJ, "morph": "Animacy=Inan|Case=Voc|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip5z": {POS: ADJ, "morph": "Animacy=Inan|Case=Voc|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip6x": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip6y": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip6z": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip7x": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip7y": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAip7z": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAis1x": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis1y": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis1z": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis2x": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis2y": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis2z": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis3x": {POS: ADJ, "morph": "Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis3y": {POS: ADJ, "morph": "Animacy=Inan|Case=Dat|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis3z": {POS: ADJ, "morph": "Animacy=Inan|Case=Dat|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis4x": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis4y": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis4z": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis5x": {POS: ADJ, "morph": "Animacy=Inan|Case=Voc|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis5y": {POS: ADJ, "morph": "Animacy=Inan|Case=Voc|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis5z": {POS: ADJ, "morph": "Animacy=Inan|Case=Voc|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis6x": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis6y": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis6z": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis7x": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis7y": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAis7z": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAmp1x": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp1y": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp1z": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp2x": {POS: ADJ, "morph": "Animacy=Anim|Case=Gen|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp2y": {POS: ADJ, "morph": "Animacy=Anim|Case=Gen|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp2z": {POS: ADJ, "morph": "Animacy=Anim|Case=Gen|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp3x": {POS: ADJ, "morph": "Animacy=Anim|Case=Dat|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp3y": {POS: ADJ, "morph": "Animacy=Anim|Case=Dat|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp3z": {POS: ADJ, "morph": "Animacy=Anim|Case=Dat|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp4x": {POS: ADJ, "morph": "Animacy=Anim|Case=Acc|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp4y": {POS: ADJ, "morph": "Animacy=Anim|Case=Acc|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp4z": {POS: ADJ, "morph": "Animacy=Anim|Case=Acc|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp5x": {POS: ADJ, "morph": "Animacy=Anim|Case=Voc|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp5y": {POS: ADJ, "morph": "Animacy=Anim|Case=Voc|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp5z": {POS: ADJ, "morph": "Animacy=Anim|Case=Voc|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp6x": {POS: ADJ, "morph": "Animacy=Anim|Case=Loc|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp6y": {POS: ADJ, "morph": "Animacy=Anim|Case=Loc|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp6z": {POS: ADJ, "morph": "Animacy=Anim|Case=Loc|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp7x": {POS: ADJ, "morph": "Animacy=Anim|Case=Ins|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp7y": {POS: ADJ, "morph": "Animacy=Anim|Case=Ins|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAmp7z": {POS: ADJ, "morph": "Animacy=Anim|Case=Ins|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "AAms1x": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms1y": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms1z": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms2x": {POS: ADJ, "morph": "Animacy=Anim|Case=Gen|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms2y": {POS: ADJ, "morph": "Animacy=Anim|Case=Gen|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms2z": {POS: ADJ, "morph": "Animacy=Anim|Case=Gen|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms3x": {POS: ADJ, "morph": "Animacy=Anim|Case=Dat|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms3y": {POS: ADJ, "morph": "Animacy=Anim|Case=Dat|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms3z": {POS: ADJ, "morph": "Animacy=Anim|Case=Dat|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms4x": {POS: ADJ, "morph": "Animacy=Anim|Case=Acc|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms4y": {POS: ADJ, "morph": "Animacy=Anim|Case=Acc|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms4z": {POS: ADJ, "morph": "Animacy=Anim|Case=Acc|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms5x": {POS: ADJ, "morph": "Animacy=Anim|Case=Voc|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms5y": {POS: ADJ, "morph": "Animacy=Anim|Case=Voc|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms5z": {POS: ADJ, "morph": "Animacy=Anim|Case=Voc|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms6x": {POS: ADJ, "morph": "Animacy=Anim|Case=Loc|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms6y": {POS: ADJ, "morph": "Animacy=Anim|Case=Loc|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms6z": {POS: ADJ, "morph": "Animacy=Anim|Case=Loc|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms7x": {POS: ADJ, "morph": "Animacy=Anim|Case=Ins|Degree=Pos|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms7y": {POS: ADJ, "morph": "Animacy=Anim|Case=Ins|Degree=Cmp|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAms7z": {POS: ADJ, "morph": "Animacy=Anim|Case=Ins|Degree=Sup|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "AAnp1x": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp1y": {POS: ADJ, "morph": "Case=Nom|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp1z": {POS: ADJ, "morph": "Case=Nom|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp2x": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp2y": {POS: ADJ, "morph": "Case=Gen|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp2z": {POS: ADJ, "morph": "Case=Gen|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp3x": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp3y": {POS: ADJ, "morph": "Case=Dat|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp3z": {POS: ADJ, "morph": "Case=Dat|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp4x": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp4y": {POS: ADJ, "morph": "Case=Acc|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp4z": {POS: ADJ, "morph": "Case=Acc|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp5x": {POS: ADJ, "morph": "Case=Voc|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp5y": {POS: ADJ, "morph": "Case=Voc|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp5z": {POS: ADJ, "morph": "Case=Voc|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp6x": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp6y": {POS: ADJ, "morph": "Case=Loc|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp6z": {POS: ADJ, "morph": "Case=Loc|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp7x": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp7y": {POS: ADJ, "morph": "Case=Ins|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAnp7z": {POS: ADJ, "morph": "Case=Ins|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "AAns1x": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns1y": {POS: ADJ, "morph": "Case=Nom|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns1z": {POS: ADJ, "morph": "Case=Nom|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns2x": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns2y": {POS: ADJ, "morph": "Case=Gen|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns2z": {POS: ADJ, "morph": "Case=Gen|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns3x": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns3y": {POS: ADJ, "morph": "Case=Dat|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns3z": {POS: ADJ, "morph": "Case=Dat|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns4x": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns4y": {POS: ADJ, "morph": "Case=Acc|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns4z": {POS: ADJ, "morph": "Case=Acc|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns5x": {POS: ADJ, "morph": "Case=Voc|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns5y": {POS: ADJ, "morph": "Case=Voc|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns5z": {POS: ADJ, "morph": "Case=Voc|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns6x": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns6y": {POS: ADJ, "morph": "Case=Loc|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns6z": {POS: ADJ, "morph": "Case=Loc|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns7x": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns7y": {POS: ADJ, "morph": "Case=Ins|Degree=Cmp|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AAns7z": {POS: ADJ, "morph": "Case=Ins|Degree=Sup|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "AFfp1x": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "AFfp2x": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "AFfp3x": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "AFfp4x": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "AFfp5x": {POS: ADJ, "morph": "Case=Voc|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "AFfp6x": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "AFfp7x": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "AFfs1x": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "AFfs2x": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "AFfs3x": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "AFfs4x": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "AFfs5x": {POS: ADJ, "morph": "Case=Voc|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "AFfs6x": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "AFfs7x": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "AFip1x": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFip2x": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFip3x": {POS: ADJ, "morph": "Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFip4x": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFip5x": {POS: ADJ, "morph": "Animacy=Inan|Case=Voc|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFip6x": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFip7x": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFis1x": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFis2x": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFis3x": {POS: ADJ, "morph": "Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFis4x": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFis5x": {POS: ADJ, "morph": "Animacy=Inan|Case=Voc|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFis6x": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFis7x": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFmp1x": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFmp2x": {POS: ADJ, "morph": "Animacy=Anim|Case=Gen|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFmp3x": {POS: ADJ, "morph": "Animacy=Anim|Case=Dat|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFmp4x": {POS: ADJ, "morph": "Animacy=Anim|Case=Acc|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFmp5x": {POS: ADJ, "morph": "Animacy=Anim|Case=Voc|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFmp6x": {POS: ADJ, "morph": "Animacy=Anim|Case=Loc|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFmp7x": {POS: ADJ, "morph": "Animacy=Anim|Case=Ins|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "AFms1x": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFms2x": {POS: ADJ, "morph": "Animacy=Anim|Case=Gen|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFms3x": {POS: ADJ, "morph": "Animacy=Anim|Case=Dat|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFms4x": {POS: ADJ, "morph": "Animacy=Anim|Case=Acc|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFms5x": {POS: ADJ, "morph": "Animacy=Anim|Case=Voc|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFms6x": {POS: ADJ, "morph": "Animacy=Anim|Case=Loc|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFms7x": {POS: ADJ, "morph": "Animacy=Anim|Case=Ins|Degree=Pos|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "AFnp1x": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "AFnp2x": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "AFnp3x": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "AFnp4x": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "AFnp5x": {POS: ADJ, "morph": "Case=Voc|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "AFnp6x": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "AFnp7x": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "AFns1x": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "AFns2x": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "AFns3x": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "AFns4x": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "AFns5x": {POS: ADJ, "morph": "Case=Voc|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "AFns6x": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "AFns7x": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "AUfp1x": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Plur"}, - "AUfp1y": {POS: ADJ, "morph": "Case=Nom|Degree=Cmp|Gender=Fem|MorphPos=Def|Number=Plur"}, - "AUfp1z": {POS: ADJ, "morph": "Case=Nom|Degree=Sup|Gender=Fem|MorphPos=Def|Number=Plur"}, - "AUfp2x": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Plur"}, - "AUfp3x": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Plur"}, - "AUfp4x": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Plur"}, - "AUfp5x": {POS: ADJ, "morph": "Case=Voc|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Plur"}, - "AUfp6x": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Plur"}, - "AUfp7x": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Plur"}, - "AUfs1x": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Sing"}, - "AUfs1y": {POS: ADJ, "morph": "Case=Nom|Degree=Cmp|Gender=Fem|MorphPos=Def|Number=Sing"}, - "AUfs1z": {POS: ADJ, "morph": "Case=Nom|Degree=Sup|Gender=Fem|MorphPos=Def|Number=Sing"}, - "AUfs2x": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Sing"}, - "AUfs3x": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Sing"}, - "AUfs4x": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Sing"}, - "AUfs5x": {POS: ADJ, "morph": "Case=Voc|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Sing"}, - "AUfs6x": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Sing"}, - "AUfs7x": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Fem|MorphPos=Def|Number=Sing"}, - "AUip1x": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUip1y": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Cmp|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUip1z": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Sup|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUip2x": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUip3x": {POS: ADJ, "morph": "Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUip4x": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUip5x": {POS: ADJ, "morph": "Animacy=Inan|Case=Voc|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUip6x": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUip7x": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUis1x": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUis1y": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Cmp|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUis1z": {POS: ADJ, "morph": "Animacy=Inan|Case=Nom|Degree=Sup|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUis2x": {POS: ADJ, "morph": "Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUis3x": {POS: ADJ, "morph": "Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUis4x": {POS: ADJ, "morph": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUis5x": {POS: ADJ, "morph": "Animacy=Inan|Case=Voc|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUis6x": {POS: ADJ, "morph": "Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUis7x": {POS: ADJ, "morph": "Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUmp1x": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUmp1y": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Cmp|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUmp1z": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Sup|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUmp2x": {POS: ADJ, "morph": "Animacy=Anim|Case=Gen|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUmp3x": {POS: ADJ, "morph": "Animacy=Anim|Case=Dat|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUmp4x": {POS: ADJ, "morph": "Animacy=Anim|Case=Acc|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUmp5x": {POS: ADJ, "morph": "Animacy=Anim|Case=Voc|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUmp6x": {POS: ADJ, "morph": "Animacy=Anim|Case=Loc|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUmp7x": {POS: ADJ, "morph": "Animacy=Anim|Case=Ins|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Plur"}, - "AUms1x": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUms1y": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Cmp|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUms1z": {POS: ADJ, "morph": "Animacy=Anim|Case=Nom|Degree=Sup|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUms2x": {POS: ADJ, "morph": "Animacy=Anim|Case=Gen|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUms3x": {POS: ADJ, "morph": "Animacy=Anim|Case=Dat|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUms4x": {POS: ADJ, "morph": "Animacy=Anim|Case=Acc|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUms5x": {POS: ADJ, "morph": "Animacy=Anim|Case=Voc|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUms6x": {POS: ADJ, "morph": "Animacy=Anim|Case=Loc|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUms7x": {POS: ADJ, "morph": "Animacy=Anim|Case=Ins|Degree=Pos|Gender=Masc|MorphPos=Def|Number=Sing"}, - "AUnp1x": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Plur"}, - "AUnp1y": {POS: ADJ, "morph": "Case=Nom|Degree=Cmp|Gender=Neut|MorphPos=Def|Number=Plur"}, - "AUnp1z": {POS: ADJ, "morph": "Case=Nom|Degree=Sup|Gender=Neut|MorphPos=Def|Number=Plur"}, - "AUnp2x": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Plur"}, - "AUnp3x": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Plur"}, - "AUnp4x": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Plur"}, - "AUnp5x": {POS: ADJ, "morph": "Case=Voc|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Plur"}, - "AUnp6x": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Plur"}, - "AUnp7x": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Plur"}, - "AUns1x": {POS: ADJ, "morph": "Case=Nom|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Sing"}, - "AUns1y": {POS: ADJ, "morph": "Case=Nom|Degree=Cmp|Gender=Neut|MorphPos=Def|Number=Sing"}, - "AUns1z": {POS: ADJ, "morph": "Case=Nom|Degree=Sup|Gender=Neut|MorphPos=Def|Number=Sing"}, - "AUns2x": {POS: ADJ, "morph": "Case=Gen|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Sing"}, - "AUns3x": {POS: ADJ, "morph": "Case=Dat|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Sing"}, - "AUns4x": {POS: ADJ, "morph": "Case=Acc|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Sing"}, - "AUns5x": {POS: ADJ, "morph": "Case=Voc|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Sing"}, - "AUns6x": {POS: ADJ, "morph": "Case=Loc|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Sing"}, - "AUns7x": {POS: ADJ, "morph": "Case=Ins|Degree=Pos|Gender=Neut|MorphPos=Def|Number=Sing"}, - "Dx": {POS: ADV, "morph": "Degree=Pos"}, - "Dy": {POS: ADV, "morph": "Degree=Cmp"}, - "Dz": {POS: ADV, "morph": "Degree=Sup"}, - "Eu1": {POS: ADP, "morph": "AdpType=Prep|Case=Nom"}, - "Eu2": {POS: ADP, "morph": "AdpType=Prep|Case=Gen"}, - "Eu3": {POS: ADP, "morph": "AdpType=Prep|Case=Dat"}, - "Eu4": {POS: ADP, "morph": "AdpType=Prep|Case=Acc"}, - "Eu6": {POS: ADP, "morph": "AdpType=Prep|Case=Loc"}, - "Eu7": {POS: ADP, "morph": "AdpType=Prep|Case=Ins"}, - "Ev2": {POS: ADP, "morph": "AdpType=Voc|Case=Gen"}, - "Ev3": {POS: ADP, "morph": "AdpType=Voc|Case=Dat"}, - "Ev4": {POS: ADP, "morph": "AdpType=Voc|Case=Acc"}, - "Ev6": {POS: ADP, "morph": "AdpType=Voc|Case=Loc"}, - "Ev7": {POS: ADP, "morph": "AdpType=Voc|Case=Ins"}, - "Gkfp1x": {POS: VERB, "morph": "Case=Nom|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp1y": {POS: VERB, "morph": "Case=Nom|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp1z": {POS: VERB, "morph": "Case=Nom|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp2x": {POS: VERB, "morph": "Case=Gen|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp2y": {POS: VERB, "morph": "Case=Gen|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp2z": {POS: VERB, "morph": "Case=Gen|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp3x": {POS: VERB, "morph": "Case=Dat|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp3y": {POS: VERB, "morph": "Case=Dat|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp3z": {POS: VERB, "morph": "Case=Dat|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp4x": {POS: VERB, "morph": "Case=Acc|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp4y": {POS: VERB, "morph": "Case=Acc|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp4z": {POS: VERB, "morph": "Case=Acc|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp5x": {POS: VERB, "morph": "Case=Voc|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp5y": {POS: VERB, "morph": "Case=Voc|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp5z": {POS: VERB, "morph": "Case=Voc|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp6x": {POS: VERB, "morph": "Case=Loc|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp6y": {POS: VERB, "morph": "Case=Loc|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp6z": {POS: VERB, "morph": "Case=Loc|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp7x": {POS: VERB, "morph": "Case=Ins|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp7y": {POS: VERB, "morph": "Case=Ins|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfp7z": {POS: VERB, "morph": "Case=Ins|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkfs1x": {POS: VERB, "morph": "Case=Nom|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs1y": {POS: VERB, "morph": "Case=Nom|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs1z": {POS: VERB, "morph": "Case=Nom|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs2x": {POS: VERB, "morph": "Case=Gen|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs2y": {POS: VERB, "morph": "Case=Gen|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs2z": {POS: VERB, "morph": "Case=Gen|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs3x": {POS: VERB, "morph": "Case=Dat|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs3y": {POS: VERB, "morph": "Case=Dat|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs3z": {POS: VERB, "morph": "Case=Dat|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs4x": {POS: VERB, "morph": "Case=Acc|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs4y": {POS: VERB, "morph": "Case=Acc|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs4z": {POS: VERB, "morph": "Case=Acc|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs5x": {POS: VERB, "morph": "Case=Voc|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs5y": {POS: VERB, "morph": "Case=Voc|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs5z": {POS: VERB, "morph": "Case=Voc|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs6x": {POS: VERB, "morph": "Case=Loc|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs6y": {POS: VERB, "morph": "Case=Loc|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs6z": {POS: VERB, "morph": "Case=Loc|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs7x": {POS: VERB, "morph": "Case=Ins|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs7y": {POS: VERB, "morph": "Case=Ins|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkfs7z": {POS: VERB, "morph": "Case=Ins|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkip1x": {POS: VERB, "morph": "Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip1y": {POS: VERB, "morph": "Animacy=Inan|Case=Nom|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip1z": {POS: VERB, "morph": "Animacy=Inan|Case=Nom|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip2x": {POS: VERB, "morph": "Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip2y": {POS: VERB, "morph": "Animacy=Inan|Case=Gen|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip2z": {POS: VERB, "morph": "Animacy=Inan|Case=Gen|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip3x": {POS: VERB, "morph": "Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip3y": {POS: VERB, "morph": "Animacy=Inan|Case=Dat|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip3z": {POS: VERB, "morph": "Animacy=Inan|Case=Dat|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip4x": {POS: VERB, "morph": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip4y": {POS: VERB, "morph": "Animacy=Inan|Case=Acc|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip4z": {POS: VERB, "morph": "Animacy=Inan|Case=Acc|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip5x": {POS: VERB, "morph": "Animacy=Inan|Case=Voc|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip5y": {POS: VERB, "morph": "Animacy=Inan|Case=Voc|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip5z": {POS: VERB, "morph": "Animacy=Inan|Case=Voc|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip6x": {POS: VERB, "morph": "Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip6y": {POS: VERB, "morph": "Animacy=Inan|Case=Loc|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip6z": {POS: VERB, "morph": "Animacy=Inan|Case=Loc|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip7x": {POS: VERB, "morph": "Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip7y": {POS: VERB, "morph": "Animacy=Inan|Case=Ins|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkip7z": {POS: VERB, "morph": "Animacy=Inan|Case=Ins|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkis1x": {POS: VERB, "morph": "Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis1y": {POS: VERB, "morph": "Animacy=Inan|Case=Nom|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis1z": {POS: VERB, "morph": "Animacy=Inan|Case=Nom|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis2x": {POS: VERB, "morph": "Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis2y": {POS: VERB, "morph": "Animacy=Inan|Case=Gen|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis2z": {POS: VERB, "morph": "Animacy=Inan|Case=Gen|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis3x": {POS: VERB, "morph": "Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis3y": {POS: VERB, "morph": "Animacy=Inan|Case=Dat|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis3z": {POS: VERB, "morph": "Animacy=Inan|Case=Dat|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis4x": {POS: VERB, "morph": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis4y": {POS: VERB, "morph": "Animacy=Inan|Case=Acc|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis4z": {POS: VERB, "morph": "Animacy=Inan|Case=Acc|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis5x": {POS: VERB, "morph": "Animacy=Inan|Case=Voc|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis5y": {POS: VERB, "morph": "Animacy=Inan|Case=Voc|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis5z": {POS: VERB, "morph": "Animacy=Inan|Case=Voc|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis6x": {POS: VERB, "morph": "Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis6y": {POS: VERB, "morph": "Animacy=Inan|Case=Loc|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis6z": {POS: VERB, "morph": "Animacy=Inan|Case=Loc|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis7x": {POS: VERB, "morph": "Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis7y": {POS: VERB, "morph": "Animacy=Inan|Case=Ins|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkis7z": {POS: VERB, "morph": "Animacy=Inan|Case=Ins|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkmp1x": {POS: VERB, "morph": "Animacy=Anim|Case=Nom|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp1y": {POS: VERB, "morph": "Animacy=Anim|Case=Nom|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp1z": {POS: VERB, "morph": "Animacy=Anim|Case=Nom|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp2x": {POS: VERB, "morph": "Animacy=Anim|Case=Gen|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp2y": {POS: VERB, "morph": "Animacy=Anim|Case=Gen|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp2z": {POS: VERB, "morph": "Animacy=Anim|Case=Gen|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp3x": {POS: VERB, "morph": "Animacy=Anim|Case=Dat|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp3y": {POS: VERB, "morph": "Animacy=Anim|Case=Dat|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp3z": {POS: VERB, "morph": "Animacy=Anim|Case=Dat|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp4x": {POS: VERB, "morph": "Animacy=Anim|Case=Acc|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp4y": {POS: VERB, "morph": "Animacy=Anim|Case=Acc|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp4z": {POS: VERB, "morph": "Animacy=Anim|Case=Acc|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp5x": {POS: VERB, "morph": "Animacy=Anim|Case=Voc|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp5y": {POS: VERB, "morph": "Animacy=Anim|Case=Voc|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp5z": {POS: VERB, "morph": "Animacy=Anim|Case=Voc|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp6x": {POS: VERB, "morph": "Animacy=Anim|Case=Loc|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp6y": {POS: VERB, "morph": "Animacy=Anim|Case=Loc|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp6z": {POS: VERB, "morph": "Animacy=Anim|Case=Loc|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp7x": {POS: VERB, "morph": "Animacy=Anim|Case=Ins|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp7y": {POS: VERB, "morph": "Animacy=Anim|Case=Ins|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkmp7z": {POS: VERB, "morph": "Animacy=Anim|Case=Ins|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkms1x": {POS: VERB, "morph": "Animacy=Anim|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms1y": {POS: VERB, "morph": "Animacy=Anim|Case=Nom|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms1z": {POS: VERB, "morph": "Animacy=Anim|Case=Nom|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms2x": {POS: VERB, "morph": "Animacy=Anim|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms2y": {POS: VERB, "morph": "Animacy=Anim|Case=Gen|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms2z": {POS: VERB, "morph": "Animacy=Anim|Case=Gen|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms3x": {POS: VERB, "morph": "Animacy=Anim|Case=Dat|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms3y": {POS: VERB, "morph": "Animacy=Anim|Case=Dat|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms3z": {POS: VERB, "morph": "Animacy=Anim|Case=Dat|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms4x": {POS: VERB, "morph": "Animacy=Anim|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms4y": {POS: VERB, "morph": "Animacy=Anim|Case=Acc|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms4z": {POS: VERB, "morph": "Animacy=Anim|Case=Acc|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms5x": {POS: VERB, "morph": "Animacy=Anim|Case=Voc|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms5y": {POS: VERB, "morph": "Animacy=Anim|Case=Voc|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms5z": {POS: VERB, "morph": "Animacy=Anim|Case=Voc|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms6x": {POS: VERB, "morph": "Animacy=Anim|Case=Loc|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms6y": {POS: VERB, "morph": "Animacy=Anim|Case=Loc|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms6z": {POS: VERB, "morph": "Animacy=Anim|Case=Loc|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms7x": {POS: VERB, "morph": "Animacy=Anim|Case=Ins|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms7y": {POS: VERB, "morph": "Animacy=Anim|Case=Ins|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkms7z": {POS: VERB, "morph": "Animacy=Anim|Case=Ins|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gknp1x": {POS: VERB, "morph": "Case=Nom|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp1y": {POS: VERB, "morph": "Case=Nom|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp1z": {POS: VERB, "morph": "Case=Nom|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp2x": {POS: VERB, "morph": "Case=Gen|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp2y": {POS: VERB, "morph": "Case=Gen|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp2z": {POS: VERB, "morph": "Case=Gen|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp3x": {POS: VERB, "morph": "Case=Dat|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp3y": {POS: VERB, "morph": "Case=Dat|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp3z": {POS: VERB, "morph": "Case=Dat|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp4x": {POS: VERB, "morph": "Case=Acc|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp4y": {POS: VERB, "morph": "Case=Acc|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp4z": {POS: VERB, "morph": "Case=Acc|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp5x": {POS: VERB, "morph": "Case=Voc|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp5y": {POS: VERB, "morph": "Case=Voc|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp5z": {POS: VERB, "morph": "Case=Voc|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp6x": {POS: VERB, "morph": "Case=Loc|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp6y": {POS: VERB, "morph": "Case=Loc|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp6z": {POS: VERB, "morph": "Case=Loc|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp7x": {POS: VERB, "morph": "Case=Ins|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp7y": {POS: VERB, "morph": "Case=Ins|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gknp7z": {POS: VERB, "morph": "Case=Ins|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Act"}, - "Gkns1x": {POS: VERB, "morph": "Case=Nom|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns1y": {POS: VERB, "morph": "Case=Nom|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns1z": {POS: VERB, "morph": "Case=Nom|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns2x": {POS: VERB, "morph": "Case=Gen|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns2y": {POS: VERB, "morph": "Case=Gen|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns2z": {POS: VERB, "morph": "Case=Gen|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns3x": {POS: VERB, "morph": "Case=Dat|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns3y": {POS: VERB, "morph": "Case=Dat|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns3z": {POS: VERB, "morph": "Case=Dat|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns4x": {POS: VERB, "morph": "Case=Acc|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns4y": {POS: VERB, "morph": "Case=Acc|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns4z": {POS: VERB, "morph": "Case=Acc|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns5x": {POS: VERB, "morph": "Case=Voc|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns5y": {POS: VERB, "morph": "Case=Voc|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns5z": {POS: VERB, "morph": "Case=Voc|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns6x": {POS: VERB, "morph": "Case=Loc|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns6y": {POS: VERB, "morph": "Case=Loc|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns6z": {POS: VERB, "morph": "Case=Loc|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns7x": {POS: VERB, "morph": "Case=Ins|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns7y": {POS: VERB, "morph": "Case=Ins|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gkns7z": {POS: VERB, "morph": "Case=Ins|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Act"}, - "Gtfp1x": {POS: VERB, "morph": "Case=Nom|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp1y": {POS: VERB, "morph": "Case=Nom|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp1z": {POS: VERB, "morph": "Case=Nom|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp2x": {POS: VERB, "morph": "Case=Gen|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp2y": {POS: VERB, "morph": "Case=Gen|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp2z": {POS: VERB, "morph": "Case=Gen|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp3x": {POS: VERB, "morph": "Case=Dat|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp3y": {POS: VERB, "morph": "Case=Dat|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp3z": {POS: VERB, "morph": "Case=Dat|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp4x": {POS: VERB, "morph": "Case=Acc|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp4y": {POS: VERB, "morph": "Case=Acc|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp4z": {POS: VERB, "morph": "Case=Acc|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp5x": {POS: VERB, "morph": "Case=Voc|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp5y": {POS: VERB, "morph": "Case=Voc|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp5z": {POS: VERB, "morph": "Case=Voc|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp6x": {POS: VERB, "morph": "Case=Loc|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp6y": {POS: VERB, "morph": "Case=Loc|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp6z": {POS: VERB, "morph": "Case=Loc|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp7x": {POS: VERB, "morph": "Case=Ins|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp7y": {POS: VERB, "morph": "Case=Ins|Degree=Cmp|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfp7z": {POS: VERB, "morph": "Case=Ins|Degree=Sup|Gender=Fem|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtfs1x": {POS: VERB, "morph": "Case=Nom|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs1y": {POS: VERB, "morph": "Case=Nom|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs1z": {POS: VERB, "morph": "Case=Nom|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs2x": {POS: VERB, "morph": "Case=Gen|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs2y": {POS: VERB, "morph": "Case=Gen|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs2z": {POS: VERB, "morph": "Case=Gen|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs3x": {POS: VERB, "morph": "Case=Dat|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs3y": {POS: VERB, "morph": "Case=Dat|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs3z": {POS: VERB, "morph": "Case=Dat|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs4x": {POS: VERB, "morph": "Case=Acc|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs4y": {POS: VERB, "morph": "Case=Acc|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs4z": {POS: VERB, "morph": "Case=Acc|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs5x": {POS: VERB, "morph": "Case=Voc|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs5y": {POS: VERB, "morph": "Case=Voc|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs5z": {POS: VERB, "morph": "Case=Voc|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs6x": {POS: VERB, "morph": "Case=Loc|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs6y": {POS: VERB, "morph": "Case=Loc|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs6z": {POS: VERB, "morph": "Case=Loc|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs7x": {POS: VERB, "morph": "Case=Ins|Degree=Pos|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs7y": {POS: VERB, "morph": "Case=Ins|Degree=Cmp|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtfs7z": {POS: VERB, "morph": "Case=Ins|Degree=Sup|Gender=Fem|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtip1x": {POS: VERB, "morph": "Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip1y": {POS: VERB, "morph": "Animacy=Inan|Case=Nom|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip1z": {POS: VERB, "morph": "Animacy=Inan|Case=Nom|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip2x": {POS: VERB, "morph": "Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip2y": {POS: VERB, "morph": "Animacy=Inan|Case=Gen|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip2z": {POS: VERB, "morph": "Animacy=Inan|Case=Gen|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip3x": {POS: VERB, "morph": "Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip3y": {POS: VERB, "morph": "Animacy=Inan|Case=Dat|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip3z": {POS: VERB, "morph": "Animacy=Inan|Case=Dat|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip4x": {POS: VERB, "morph": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip4y": {POS: VERB, "morph": "Animacy=Inan|Case=Acc|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip4z": {POS: VERB, "morph": "Animacy=Inan|Case=Acc|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip5x": {POS: VERB, "morph": "Animacy=Inan|Case=Voc|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip5y": {POS: VERB, "morph": "Animacy=Inan|Case=Voc|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip5z": {POS: VERB, "morph": "Animacy=Inan|Case=Voc|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip6x": {POS: VERB, "morph": "Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip6y": {POS: VERB, "morph": "Animacy=Inan|Case=Loc|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip6z": {POS: VERB, "morph": "Animacy=Inan|Case=Loc|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip7x": {POS: VERB, "morph": "Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip7y": {POS: VERB, "morph": "Animacy=Inan|Case=Ins|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtip7z": {POS: VERB, "morph": "Animacy=Inan|Case=Ins|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtis1x": {POS: VERB, "morph": "Animacy=Inan|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis1y": {POS: VERB, "morph": "Animacy=Inan|Case=Nom|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis1z": {POS: VERB, "morph": "Animacy=Inan|Case=Nom|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis2x": {POS: VERB, "morph": "Animacy=Inan|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis2y": {POS: VERB, "morph": "Animacy=Inan|Case=Gen|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis2z": {POS: VERB, "morph": "Animacy=Inan|Case=Gen|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis3x": {POS: VERB, "morph": "Animacy=Inan|Case=Dat|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis3y": {POS: VERB, "morph": "Animacy=Inan|Case=Dat|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis3z": {POS: VERB, "morph": "Animacy=Inan|Case=Dat|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis4x": {POS: VERB, "morph": "Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis4y": {POS: VERB, "morph": "Animacy=Inan|Case=Acc|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis4z": {POS: VERB, "morph": "Animacy=Inan|Case=Acc|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis5x": {POS: VERB, "morph": "Animacy=Inan|Case=Voc|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis5y": {POS: VERB, "morph": "Animacy=Inan|Case=Voc|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis5z": {POS: VERB, "morph": "Animacy=Inan|Case=Voc|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis6x": {POS: VERB, "morph": "Animacy=Inan|Case=Loc|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis6y": {POS: VERB, "morph": "Animacy=Inan|Case=Loc|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis6z": {POS: VERB, "morph": "Animacy=Inan|Case=Loc|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis7x": {POS: VERB, "morph": "Animacy=Inan|Case=Ins|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis7y": {POS: VERB, "morph": "Animacy=Inan|Case=Ins|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtis7z": {POS: VERB, "morph": "Animacy=Inan|Case=Ins|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtmp1x": {POS: VERB, "morph": "Animacy=Anim|Case=Nom|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp1y": {POS: VERB, "morph": "Animacy=Anim|Case=Nom|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp1z": {POS: VERB, "morph": "Animacy=Anim|Case=Nom|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp2x": {POS: VERB, "morph": "Animacy=Anim|Case=Gen|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp2y": {POS: VERB, "morph": "Animacy=Anim|Case=Gen|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp2z": {POS: VERB, "morph": "Animacy=Anim|Case=Gen|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp3x": {POS: VERB, "morph": "Animacy=Anim|Case=Dat|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp3y": {POS: VERB, "morph": "Animacy=Anim|Case=Dat|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp3z": {POS: VERB, "morph": "Animacy=Anim|Case=Dat|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp4x": {POS: VERB, "morph": "Animacy=Anim|Case=Acc|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp4y": {POS: VERB, "morph": "Animacy=Anim|Case=Acc|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp4z": {POS: VERB, "morph": "Animacy=Anim|Case=Acc|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp5x": {POS: VERB, "morph": "Animacy=Anim|Case=Voc|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp5y": {POS: VERB, "morph": "Animacy=Anim|Case=Voc|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp5z": {POS: VERB, "morph": "Animacy=Anim|Case=Voc|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp6x": {POS: VERB, "morph": "Animacy=Anim|Case=Loc|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp6y": {POS: VERB, "morph": "Animacy=Anim|Case=Loc|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp6z": {POS: VERB, "morph": "Animacy=Anim|Case=Loc|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp7x": {POS: VERB, "morph": "Animacy=Anim|Case=Ins|Degree=Pos|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp7y": {POS: VERB, "morph": "Animacy=Anim|Case=Ins|Degree=Cmp|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtmp7z": {POS: VERB, "morph": "Animacy=Anim|Case=Ins|Degree=Sup|Gender=Masc|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtms1x": {POS: VERB, "morph": "Animacy=Anim|Case=Nom|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms1y": {POS: VERB, "morph": "Animacy=Anim|Case=Nom|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms1z": {POS: VERB, "morph": "Animacy=Anim|Case=Nom|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms2x": {POS: VERB, "morph": "Animacy=Anim|Case=Gen|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms2y": {POS: VERB, "morph": "Animacy=Anim|Case=Gen|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms2z": {POS: VERB, "morph": "Animacy=Anim|Case=Gen|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms3x": {POS: VERB, "morph": "Animacy=Anim|Case=Dat|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms3y": {POS: VERB, "morph": "Animacy=Anim|Case=Dat|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms3z": {POS: VERB, "morph": "Animacy=Anim|Case=Dat|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms4x": {POS: VERB, "morph": "Animacy=Anim|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms4y": {POS: VERB, "morph": "Animacy=Anim|Case=Acc|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms4z": {POS: VERB, "morph": "Animacy=Anim|Case=Acc|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms5x": {POS: VERB, "morph": "Animacy=Anim|Case=Voc|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms5y": {POS: VERB, "morph": "Animacy=Anim|Case=Voc|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms5z": {POS: VERB, "morph": "Animacy=Anim|Case=Voc|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms6x": {POS: VERB, "morph": "Animacy=Anim|Case=Loc|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms6y": {POS: VERB, "morph": "Animacy=Anim|Case=Loc|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms6z": {POS: VERB, "morph": "Animacy=Anim|Case=Loc|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms7x": {POS: VERB, "morph": "Animacy=Anim|Case=Ins|Degree=Pos|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms7y": {POS: VERB, "morph": "Animacy=Anim|Case=Ins|Degree=Cmp|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtms7z": {POS: VERB, "morph": "Animacy=Anim|Case=Ins|Degree=Sup|Gender=Masc|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtnp1x": {POS: VERB, "morph": "Case=Nom|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp1y": {POS: VERB, "morph": "Case=Nom|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp1z": {POS: VERB, "morph": "Case=Nom|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp2x": {POS: VERB, "morph": "Case=Gen|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp2y": {POS: VERB, "morph": "Case=Gen|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp2z": {POS: VERB, "morph": "Case=Gen|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp3x": {POS: VERB, "morph": "Case=Dat|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp3y": {POS: VERB, "morph": "Case=Dat|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp3z": {POS: VERB, "morph": "Case=Dat|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp4x": {POS: VERB, "morph": "Case=Acc|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp4y": {POS: VERB, "morph": "Case=Acc|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp4z": {POS: VERB, "morph": "Case=Acc|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp5x": {POS: VERB, "morph": "Case=Voc|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp5y": {POS: VERB, "morph": "Case=Voc|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp5z": {POS: VERB, "morph": "Case=Voc|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp6x": {POS: VERB, "morph": "Case=Loc|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp6y": {POS: VERB, "morph": "Case=Loc|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp6z": {POS: VERB, "morph": "Case=Loc|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp7x": {POS: VERB, "morph": "Case=Ins|Degree=Pos|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp7y": {POS: VERB, "morph": "Case=Ins|Degree=Cmp|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtnp7z": {POS: VERB, "morph": "Case=Ins|Degree=Sup|Gender=Neut|Number=Plur|VerbForm=Part|Voice=Pass"}, - "Gtns1x": {POS: VERB, "morph": "Case=Nom|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns1y": {POS: VERB, "morph": "Case=Nom|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns1z": {POS: VERB, "morph": "Case=Nom|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns2x": {POS: VERB, "morph": "Case=Gen|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns2y": {POS: VERB, "morph": "Case=Gen|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns2z": {POS: VERB, "morph": "Case=Gen|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns3x": {POS: VERB, "morph": "Case=Dat|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns3y": {POS: VERB, "morph": "Case=Dat|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns3z": {POS: VERB, "morph": "Case=Dat|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns4x": {POS: VERB, "morph": "Case=Acc|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns4y": {POS: VERB, "morph": "Case=Acc|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns4z": {POS: VERB, "morph": "Case=Acc|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns5x": {POS: VERB, "morph": "Case=Voc|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns5y": {POS: VERB, "morph": "Case=Voc|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns5z": {POS: VERB, "morph": "Case=Voc|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns6x": {POS: VERB, "morph": "Case=Loc|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns6y": {POS: VERB, "morph": "Case=Loc|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns6z": {POS: VERB, "morph": "Case=Loc|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns7x": {POS: VERB, "morph": "Case=Ins|Degree=Pos|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns7y": {POS: VERB, "morph": "Case=Ins|Degree=Cmp|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "Gtns7z": {POS: VERB, "morph": "Case=Ins|Degree=Sup|Gender=Neut|Number=Sing|VerbForm=Part|Voice=Pass"}, - "J": {POS: INTJ, "morph": "_"}, - "NAfp1": {POS: NUM, "morph": "Case=Nom|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "NAfp2": {POS: NUM, "morph": "Case=Gen|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "NAfp3": {POS: NUM, "morph": "Case=Dat|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "NAfp4": {POS: NUM, "morph": "Case=Acc|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "NAfp5": {POS: NUM, "morph": "Case=Voc|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "NAfp6": {POS: NUM, "morph": "Case=Loc|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "NAfp7": {POS: NUM, "morph": "Case=Ins|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "NAfs1": {POS: NUM, "morph": "Case=Nom|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "NAfs2": {POS: NUM, "morph": "Case=Gen|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "NAfs3": {POS: NUM, "morph": "Case=Dat|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "NAfs4": {POS: NUM, "morph": "Case=Acc|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "NAfs5": {POS: NUM, "morph": "Case=Voc|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "NAfs6": {POS: NUM, "morph": "Case=Loc|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "NAfs7": {POS: NUM, "morph": "Case=Ins|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "NAip1": {POS: NUM, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAip2": {POS: NUM, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAip3": {POS: NUM, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAip4": {POS: NUM, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAip5": {POS: NUM, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAip6": {POS: NUM, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAip7": {POS: NUM, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAis1": {POS: NUM, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAis2": {POS: NUM, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAis3": {POS: NUM, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAis4": {POS: NUM, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAis5": {POS: NUM, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAis6": {POS: NUM, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAis7": {POS: NUM, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAmp1": {POS: NUM, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAmp2": {POS: NUM, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAmp3": {POS: NUM, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAmp4": {POS: NUM, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAmp5": {POS: NUM, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAmp6": {POS: NUM, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAmp7": {POS: NUM, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "NAms1": {POS: NUM, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAms2": {POS: NUM, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAms3": {POS: NUM, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAms4": {POS: NUM, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAms5": {POS: NUM, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAms6": {POS: NUM, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAms7": {POS: NUM, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "NAnp1": {POS: NUM, "morph": "Case=Nom|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "NAnp2": {POS: NUM, "morph": "Case=Gen|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "NAnp3": {POS: NUM, "morph": "Case=Dat|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "NAnp4": {POS: NUM, "morph": "Case=Acc|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "NAnp5": {POS: NUM, "morph": "Case=Voc|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "NAnp6": {POS: NUM, "morph": "Case=Loc|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "NAnp7": {POS: NUM, "morph": "Case=Ins|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "NAns1": {POS: NUM, "morph": "Case=Nom|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "NAns2": {POS: NUM, "morph": "Case=Gen|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "NAns3": {POS: NUM, "morph": "Case=Dat|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "NAns4": {POS: NUM, "morph": "Case=Acc|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "NAns5": {POS: NUM, "morph": "Case=Voc|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "NAns6": {POS: NUM, "morph": "Case=Loc|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "NAns7": {POS: NUM, "morph": "Case=Ins|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "ND": {POS: NUM, "morph": "MorphPos=Adv"}, - "NFfp1": {POS: NUM, "morph": "Case=Nom|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "NFfp2": {POS: NUM, "morph": "Case=Gen|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "NFfp3": {POS: NUM, "morph": "Case=Dat|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "NFfp4": {POS: NUM, "morph": "Case=Acc|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "NFfp5": {POS: NUM, "morph": "Case=Voc|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "NFfp6": {POS: NUM, "morph": "Case=Loc|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "NFfp7": {POS: NUM, "morph": "Case=Ins|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "NFfs1": {POS: NUM, "morph": "Case=Nom|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "NFfs2": {POS: NUM, "morph": "Case=Gen|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "NFfs3": {POS: NUM, "morph": "Case=Dat|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "NFfs4": {POS: NUM, "morph": "Case=Acc|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "NFfs5": {POS: NUM, "morph": "Case=Voc|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "NFfs6": {POS: NUM, "morph": "Case=Loc|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "NFfs7": {POS: NUM, "morph": "Case=Ins|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "NFip1": {POS: NUM, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFip2": {POS: NUM, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFip3": {POS: NUM, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFip4": {POS: NUM, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFip5": {POS: NUM, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFip6": {POS: NUM, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFip7": {POS: NUM, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFis1": {POS: NUM, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFis2": {POS: NUM, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFis3": {POS: NUM, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFis4": {POS: NUM, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFis5": {POS: NUM, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFis6": {POS: NUM, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFis7": {POS: NUM, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFmp1": {POS: NUM, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFmp2": {POS: NUM, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFmp3": {POS: NUM, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFmp4": {POS: NUM, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFmp5": {POS: NUM, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFmp6": {POS: NUM, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFmp7": {POS: NUM, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Mix|Number=Plur"}, - "NFms1": {POS: NUM, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFms2": {POS: NUM, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFms3": {POS: NUM, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFms4": {POS: NUM, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFms5": {POS: NUM, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFms6": {POS: NUM, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFms7": {POS: NUM, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Mix|Number=Sing"}, - "NFnp1": {POS: NUM, "morph": "Case=Nom|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "NFnp2": {POS: NUM, "morph": "Case=Gen|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "NFnp3": {POS: NUM, "morph": "Case=Dat|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "NFnp4": {POS: NUM, "morph": "Case=Acc|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "NFnp5": {POS: NUM, "morph": "Case=Voc|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "NFnp6": {POS: NUM, "morph": "Case=Loc|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "NFnp7": {POS: NUM, "morph": "Case=Ins|Gender=Neut|MorphPos=Mix|Number=Plur"}, - "NFns1": {POS: NUM, "morph": "Case=Nom|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "NFns2": {POS: NUM, "morph": "Case=Gen|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "NFns3": {POS: NUM, "morph": "Case=Dat|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "NFns4": {POS: NUM, "morph": "Case=Acc|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "NFns5": {POS: NUM, "morph": "Case=Voc|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "NFns6": {POS: NUM, "morph": "Case=Loc|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "NFns7": {POS: NUM, "morph": "Case=Ins|Gender=Neut|MorphPos=Mix|Number=Sing"}, - "NNfp1": {POS: NUM, "morph": "Case=Nom|Gender=Fem|MorphPos=Num|Number=Plur"}, - "NNfp2": {POS: NUM, "morph": "Case=Gen|Gender=Fem|MorphPos=Num|Number=Plur"}, - "NNfp3": {POS: NUM, "morph": "Case=Dat|Gender=Fem|MorphPos=Num|Number=Plur"}, - "NNfp4": {POS: NUM, "morph": "Case=Acc|Gender=Fem|MorphPos=Num|Number=Plur"}, - "NNfp5": {POS: NUM, "morph": "Case=Voc|Gender=Fem|MorphPos=Num|Number=Plur"}, - "NNfp6": {POS: NUM, "morph": "Case=Loc|Gender=Fem|MorphPos=Num|Number=Plur"}, - "NNfp7": {POS: NUM, "morph": "Case=Ins|Gender=Fem|MorphPos=Num|Number=Plur"}, - "NNip1": {POS: NUM, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNip2": {POS: NUM, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNip3": {POS: NUM, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNip4": {POS: NUM, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNip5": {POS: NUM, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNip6": {POS: NUM, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNip7": {POS: NUM, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNmp1": {POS: NUM, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNmp2": {POS: NUM, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNmp3": {POS: NUM, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNmp4": {POS: NUM, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNmp5": {POS: NUM, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNmp6": {POS: NUM, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNmp7": {POS: NUM, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Num|Number=Plur"}, - "NNnp1": {POS: NUM, "morph": "Case=Nom|Gender=Neut|MorphPos=Num|Number=Plur"}, - "NNnp2": {POS: NUM, "morph": "Case=Gen|Gender=Neut|MorphPos=Num|Number=Plur"}, - "NNnp3": {POS: NUM, "morph": "Case=Dat|Gender=Neut|MorphPos=Num|Number=Plur"}, - "NNnp4": {POS: NUM, "morph": "Case=Acc|Gender=Neut|MorphPos=Num|Number=Plur"}, - "NNnp5": {POS: NUM, "morph": "Case=Voc|Gender=Neut|MorphPos=Num|Number=Plur"}, - "NNnp6": {POS: NUM, "morph": "Case=Loc|Gender=Neut|MorphPos=Num|Number=Plur"}, - "NNnp7": {POS: NUM, "morph": "Case=Ins|Gender=Neut|MorphPos=Num|Number=Plur"}, - "NSfp1": {POS: NUM, "morph": "Case=Nom|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "NSfp2": {POS: NUM, "morph": "Case=Gen|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "NSfp3": {POS: NUM, "morph": "Case=Dat|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "NSfp4": {POS: NUM, "morph": "Case=Acc|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "NSfp5": {POS: NUM, "morph": "Case=Voc|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "NSfp6": {POS: NUM, "morph": "Case=Loc|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "NSfp7": {POS: NUM, "morph": "Case=Ins|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "NSfs1": {POS: NUM, "morph": "Case=Nom|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "NSfs2": {POS: NUM, "morph": "Case=Gen|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "NSfs3": {POS: NUM, "morph": "Case=Dat|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "NSfs4": {POS: NUM, "morph": "Case=Acc|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "NSfs5": {POS: NUM, "morph": "Case=Voc|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "NSfs6": {POS: NUM, "morph": "Case=Loc|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "NSfs7": {POS: NUM, "morph": "Case=Ins|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "NSip1": {POS: NUM, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "NSip2": {POS: NUM, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "NSip3": {POS: NUM, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "NSip4": {POS: NUM, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "NSip5": {POS: NUM, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "NSip6": {POS: NUM, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "NSip7": {POS: NUM, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "NSis1": {POS: NUM, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "NSis2": {POS: NUM, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "NSis3": {POS: NUM, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "NSis4": {POS: NUM, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "NSis5": {POS: NUM, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "NSis6": {POS: NUM, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "NSis7": {POS: NUM, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "NUfp1": {POS: NUM, "morph": "Case=Nom|Gender=Fem|MorphPos=Def|Number=Plur"}, - "NUfp2": {POS: NUM, "morph": "Case=Gen|Gender=Fem|MorphPos=Def|Number=Plur"}, - "NUfp3": {POS: NUM, "morph": "Case=Dat|Gender=Fem|MorphPos=Def|Number=Plur"}, - "NUfp4": {POS: NUM, "morph": "Case=Acc|Gender=Fem|MorphPos=Def|Number=Plur"}, - "NUfp5": {POS: NUM, "morph": "Case=Voc|Gender=Fem|MorphPos=Def|Number=Plur"}, - "NUfp6": {POS: NUM, "morph": "Case=Loc|Gender=Fem|MorphPos=Def|Number=Plur"}, - "NUfp7": {POS: NUM, "morph": "Case=Ins|Gender=Fem|MorphPos=Def|Number=Plur"}, - "NUip1": {POS: NUM, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUip2": {POS: NUM, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUip3": {POS: NUM, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUip4": {POS: NUM, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUip5": {POS: NUM, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUip6": {POS: NUM, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUip7": {POS: NUM, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUis1": {POS: NUM, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Def|Number=Sing"}, - "NUis2": {POS: NUM, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Def|Number=Sing"}, - "NUis3": {POS: NUM, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Def|Number=Sing"}, - "NUis4": {POS: NUM, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Def|Number=Sing"}, - "NUis5": {POS: NUM, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Def|Number=Sing"}, - "NUis6": {POS: NUM, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Def|Number=Sing"}, - "NUis7": {POS: NUM, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Def|Number=Sing"}, - "NUmp1": {POS: NUM, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUmp2": {POS: NUM, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUmp3": {POS: NUM, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUmp4": {POS: NUM, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUmp5": {POS: NUM, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUmp6": {POS: NUM, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUmp7": {POS: NUM, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Def|Number=Plur"}, - "NUnp1": {POS: NUM, "morph": "Case=Nom|Gender=Neut|MorphPos=Def|Number=Plur"}, - "NUnp2": {POS: NUM, "morph": "Case=Gen|Gender=Neut|MorphPos=Def|Number=Plur"}, - "NUnp3": {POS: NUM, "morph": "Case=Dat|Gender=Neut|MorphPos=Def|Number=Plur"}, - "NUnp4": {POS: NUM, "morph": "Case=Acc|Gender=Neut|MorphPos=Def|Number=Plur"}, - "NUnp5": {POS: NUM, "morph": "Case=Voc|Gender=Neut|MorphPos=Def|Number=Plur"}, - "NUnp6": {POS: NUM, "morph": "Case=Loc|Gender=Neut|MorphPos=Def|Number=Plur"}, - "NUnp7": {POS: NUM, "morph": "Case=Ins|Gender=Neut|MorphPos=Def|Number=Plur"}, - "NUns1": {POS: NUM, "morph": "Case=Nom|Gender=Neut|MorphPos=Def|Number=Sing"}, - "NUns2": {POS: NUM, "morph": "Case=Gen|Gender=Neut|MorphPos=Def|Number=Sing"}, - "NUns3": {POS: NUM, "morph": "Case=Dat|Gender=Neut|MorphPos=Def|Number=Sing"}, - "NUns4": {POS: NUM, "morph": "Case=Acc|Gender=Neut|MorphPos=Def|Number=Sing"}, - "NUns5": {POS: NUM, "morph": "Case=Voc|Gender=Neut|MorphPos=Def|Number=Sing"}, - "NUns6": {POS: NUM, "morph": "Case=Loc|Gender=Neut|MorphPos=Def|Number=Sing"}, - "NUns7": {POS: NUM, "morph": "Case=Ins|Gender=Neut|MorphPos=Def|Number=Sing"}, - "O": {POS: CCONJ, "morph": "_"}, - "OY": {POS: CCONJ, "morph": "Mood=Cnd"}, - "PAfp1": {POS: PRON, "morph": "Case=Nom|Gender=Fem|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAfp2": {POS: PRON, "morph": "Case=Gen|Gender=Fem|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAfp3": {POS: PRON, "morph": "Case=Dat|Gender=Fem|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAfp4": {POS: PRON, "morph": "Case=Acc|Gender=Fem|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAfp5": {POS: PRON, "morph": "Case=Voc|Gender=Fem|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAfp6": {POS: PRON, "morph": "Case=Loc|Gender=Fem|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAfp7": {POS: PRON, "morph": "Case=Ins|Gender=Fem|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAfs1": {POS: PRON, "morph": "Case=Nom|Gender=Fem|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAfs2": {POS: PRON, "morph": "Case=Gen|Gender=Fem|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAfs3": {POS: PRON, "morph": "Case=Dat|Gender=Fem|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAfs4": {POS: PRON, "morph": "Case=Acc|Gender=Fem|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAfs5": {POS: PRON, "morph": "Case=Voc|Gender=Fem|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAfs6": {POS: PRON, "morph": "Case=Loc|Gender=Fem|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAfs7": {POS: PRON, "morph": "Case=Ins|Gender=Fem|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAip1": {POS: PRON, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAip2": {POS: PRON, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAip3": {POS: PRON, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAip4": {POS: PRON, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAip5": {POS: PRON, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAip6": {POS: PRON, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAip7": {POS: PRON, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAis1": {POS: PRON, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAis2": {POS: PRON, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAis3": {POS: PRON, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAis4": {POS: PRON, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAis5": {POS: PRON, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAis6": {POS: PRON, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAis7": {POS: PRON, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAmp1": {POS: PRON, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAmp2": {POS: PRON, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAmp3": {POS: PRON, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAmp4": {POS: PRON, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAmp5": {POS: PRON, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAmp6": {POS: PRON, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAmp7": {POS: PRON, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAms1": {POS: PRON, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAms2": {POS: PRON, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAms3": {POS: PRON, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAms4": {POS: PRON, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAms5": {POS: PRON, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAms6": {POS: PRON, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAms7": {POS: PRON, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAnp1": {POS: PRON, "morph": "Case=Nom|Gender=Neut|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAnp2": {POS: PRON, "morph": "Case=Gen|Gender=Neut|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAnp3": {POS: PRON, "morph": "Case=Dat|Gender=Neut|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAnp4": {POS: PRON, "morph": "Case=Acc|Gender=Neut|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAnp5": {POS: PRON, "morph": "Case=Voc|Gender=Neut|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAnp6": {POS: PRON, "morph": "Case=Loc|Gender=Neut|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAnp7": {POS: PRON, "morph": "Case=Ins|Gender=Neut|MorphPos=Adj|Number=Plur|PronType=Prs"}, - "PAns1": {POS: PRON, "morph": "Case=Nom|Gender=Neut|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAns2": {POS: PRON, "morph": "Case=Gen|Gender=Neut|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAns3": {POS: PRON, "morph": "Case=Dat|Gender=Neut|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAns4": {POS: PRON, "morph": "Case=Acc|Gender=Neut|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAns5": {POS: PRON, "morph": "Case=Voc|Gender=Neut|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAns6": {POS: PRON, "morph": "Case=Loc|Gender=Neut|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PAns7": {POS: PRON, "morph": "Case=Ins|Gender=Neut|MorphPos=Adj|Number=Sing|PronType=Prs"}, - "PD": {POS: PRON, "morph": "MorphPos=Adv|PronType=Prs"}, - "PFfp1": {POS: PRON, "morph": "Case=Nom|Gender=Fem|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFfp2": {POS: PRON, "morph": "Case=Gen|Gender=Fem|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFfp3": {POS: PRON, "morph": "Case=Dat|Gender=Fem|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFfp4": {POS: PRON, "morph": "Case=Acc|Gender=Fem|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFfp5": {POS: PRON, "morph": "Case=Voc|Gender=Fem|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFfp6": {POS: PRON, "morph": "Case=Loc|Gender=Fem|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFfp7": {POS: PRON, "morph": "Case=Ins|Gender=Fem|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFfs1": {POS: PRON, "morph": "Case=Nom|Gender=Fem|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFfs2": {POS: PRON, "morph": "Case=Gen|Gender=Fem|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFfs3": {POS: PRON, "morph": "Case=Dat|Gender=Fem|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFfs4": {POS: PRON, "morph": "Case=Acc|Gender=Fem|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFfs5": {POS: PRON, "morph": "Case=Voc|Gender=Fem|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFfs6": {POS: PRON, "morph": "Case=Loc|Gender=Fem|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFfs7": {POS: PRON, "morph": "Case=Ins|Gender=Fem|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFip1": {POS: PRON, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFip2": {POS: PRON, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFip3": {POS: PRON, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFip4": {POS: PRON, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFip5": {POS: PRON, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFip6": {POS: PRON, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFip7": {POS: PRON, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFis1": {POS: PRON, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFis2": {POS: PRON, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFis2g": {POS: PRON, "morph": "AdpType=Preppron|Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFis3": {POS: PRON, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFis4": {POS: PRON, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFis4g": {POS: PRON, "morph": "AdpType=Preppron|Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFis5": {POS: PRON, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFis6": {POS: PRON, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFis7": {POS: PRON, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFmp1": {POS: PRON, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFmp2": {POS: PRON, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFmp3": {POS: PRON, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFmp4": {POS: PRON, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFmp5": {POS: PRON, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFmp6": {POS: PRON, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFmp7": {POS: PRON, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFms1": {POS: PRON, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFms2": {POS: PRON, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFms2g": {POS: PRON, "morph": "AdpType=Preppron|Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFms3": {POS: PRON, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFms4": {POS: PRON, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFms4g": {POS: PRON, "morph": "AdpType=Preppron|Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFms5": {POS: PRON, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFms6": {POS: PRON, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFms7": {POS: PRON, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFnp1": {POS: PRON, "morph": "Case=Nom|Gender=Neut|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFnp2": {POS: PRON, "morph": "Case=Gen|Gender=Neut|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFnp3": {POS: PRON, "morph": "Case=Dat|Gender=Neut|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFnp4": {POS: PRON, "morph": "Case=Acc|Gender=Neut|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFnp5": {POS: PRON, "morph": "Case=Voc|Gender=Neut|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFnp6": {POS: PRON, "morph": "Case=Loc|Gender=Neut|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFnp7": {POS: PRON, "morph": "Case=Ins|Gender=Neut|MorphPos=Mix|Number=Plur|PronType=Prs"}, - "PFns1": {POS: PRON, "morph": "Case=Nom|Gender=Neut|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFns2": {POS: PRON, "morph": "Case=Gen|Gender=Neut|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFns2g": {POS: PRON, "morph": "AdpType=Preppron|Case=Gen|Gender=Neut|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFns3": {POS: PRON, "morph": "Case=Dat|Gender=Neut|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFns4": {POS: PRON, "morph": "Case=Acc|Gender=Neut|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFns4g": {POS: PRON, "morph": "AdpType=Preppron|Case=Acc|Gender=Neut|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFns5": {POS: PRON, "morph": "Case=Voc|Gender=Neut|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFns6": {POS: PRON, "morph": "Case=Loc|Gender=Neut|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PFns7": {POS: PRON, "morph": "Case=Ins|Gender=Neut|MorphPos=Mix|Number=Sing|PronType=Prs"}, - "PPhp1": {POS: PRON, "morph": "Case=Nom|MorphPos=Pron|Number=Plur|PronType=Prs"}, - "PPhp2": {POS: PRON, "morph": "Case=Gen|MorphPos=Pron|Number=Plur|PronType=Prs"}, - "PPhp3": {POS: PRON, "morph": "Case=Dat|MorphPos=Pron|Number=Plur|PronType=Prs"}, - "PPhp4": {POS: PRON, "morph": "Case=Acc|MorphPos=Pron|Number=Plur|PronType=Prs"}, - "PPhp5": {POS: PRON, "morph": "Case=Voc|MorphPos=Pron|Number=Plur|PronType=Prs"}, - "PPhp6": {POS: PRON, "morph": "Case=Loc|MorphPos=Pron|Number=Plur|PronType=Prs"}, - "PPhp7": {POS: PRON, "morph": "Case=Ins|MorphPos=Pron|Number=Plur|PronType=Prs"}, - "PPhs1": {POS: PRON, "morph": "Case=Nom|MorphPos=Pron|Number=Sing|PronType=Prs"}, - "PPhs2": {POS: PRON, "morph": "Case=Gen|MorphPos=Pron|Number=Sing|PronType=Prs"}, - "PPhs3": {POS: PRON, "morph": "Case=Dat|MorphPos=Pron|Number=Sing|PronType=Prs"}, - "PPhs4": {POS: PRON, "morph": "Case=Acc|MorphPos=Pron|Number=Sing|PronType=Prs"}, - "PPhs5": {POS: PRON, "morph": "Case=Voc|MorphPos=Pron|Number=Sing|PronType=Prs"}, - "PPhs6": {POS: PRON, "morph": "Case=Loc|MorphPos=Pron|Number=Sing|PronType=Prs"}, - "PPhs7": {POS: PRON, "morph": "Case=Ins|MorphPos=Pron|Number=Sing|PronType=Prs"}, - "PSfp1": {POS: PRON, "morph": "Case=Nom|Gender=Fem|MorphPos=Noun|Number=Plur|PronType=Prs"}, - "PSfp2": {POS: PRON, "morph": "Case=Gen|Gender=Fem|MorphPos=Noun|Number=Plur|PronType=Prs"}, - "PSfp3": {POS: PRON, "morph": "Case=Dat|Gender=Fem|MorphPos=Noun|Number=Plur|PronType=Prs"}, - "PSfp4": {POS: PRON, "morph": "Case=Acc|Gender=Fem|MorphPos=Noun|Number=Plur|PronType=Prs"}, - "PSfp5": {POS: PRON, "morph": "Case=Voc|Gender=Fem|MorphPos=Noun|Number=Plur|PronType=Prs"}, - "PSfp6": {POS: PRON, "morph": "Case=Loc|Gender=Fem|MorphPos=Noun|Number=Plur|PronType=Prs"}, - "PSfp7": {POS: PRON, "morph": "Case=Ins|Gender=Fem|MorphPos=Noun|Number=Plur|PronType=Prs"}, - "PSfs1": {POS: PRON, "morph": "Case=Nom|Gender=Fem|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSfs2": {POS: PRON, "morph": "Case=Gen|Gender=Fem|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSfs3": {POS: PRON, "morph": "Case=Dat|Gender=Fem|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSfs4": {POS: PRON, "morph": "Case=Acc|Gender=Fem|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSfs5": {POS: PRON, "morph": "Case=Voc|Gender=Fem|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSfs6": {POS: PRON, "morph": "Case=Loc|Gender=Fem|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSfs7": {POS: PRON, "morph": "Case=Ins|Gender=Fem|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSns1": {POS: PRON, "morph": "Case=Nom|Gender=Neut|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSns2": {POS: PRON, "morph": "Case=Gen|Gender=Neut|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSns3": {POS: PRON, "morph": "Case=Dat|Gender=Neut|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSns4": {POS: PRON, "morph": "Case=Acc|Gender=Neut|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSns5": {POS: PRON, "morph": "Case=Voc|Gender=Neut|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSns6": {POS: PRON, "morph": "Case=Loc|Gender=Neut|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PSns7": {POS: PRON, "morph": "Case=Ins|Gender=Neut|MorphPos=Noun|Number=Sing|PronType=Prs"}, - "PUfp1": {POS: PRON, "morph": "Case=Nom|Gender=Fem|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUfp2": {POS: PRON, "morph": "Case=Gen|Gender=Fem|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUfp3": {POS: PRON, "morph": "Case=Dat|Gender=Fem|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUfp4": {POS: PRON, "morph": "Case=Acc|Gender=Fem|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUfp5": {POS: PRON, "morph": "Case=Voc|Gender=Fem|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUfp6": {POS: PRON, "morph": "Case=Loc|Gender=Fem|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUfp7": {POS: PRON, "morph": "Case=Ins|Gender=Fem|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUfs1": {POS: PRON, "morph": "Case=Nom|Gender=Fem|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUfs2": {POS: PRON, "morph": "Case=Gen|Gender=Fem|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUfs3": {POS: PRON, "morph": "Case=Dat|Gender=Fem|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUfs4": {POS: PRON, "morph": "Case=Acc|Gender=Fem|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUfs5": {POS: PRON, "morph": "Case=Voc|Gender=Fem|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUfs6": {POS: PRON, "morph": "Case=Loc|Gender=Fem|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUfs7": {POS: PRON, "morph": "Case=Ins|Gender=Fem|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUip1": {POS: PRON, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUip2": {POS: PRON, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUip3": {POS: PRON, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUip4": {POS: PRON, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUip5": {POS: PRON, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUip6": {POS: PRON, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUip7": {POS: PRON, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUis1": {POS: PRON, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUis2": {POS: PRON, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUis3": {POS: PRON, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUis4": {POS: PRON, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUis5": {POS: PRON, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUis6": {POS: PRON, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUis7": {POS: PRON, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUmp1": {POS: PRON, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUmp2": {POS: PRON, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUmp3": {POS: PRON, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUmp4": {POS: PRON, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUmp5": {POS: PRON, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUmp6": {POS: PRON, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUmp7": {POS: PRON, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUms1": {POS: PRON, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUms2": {POS: PRON, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUms3": {POS: PRON, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUms4": {POS: PRON, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUms5": {POS: PRON, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUms6": {POS: PRON, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUms7": {POS: PRON, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUnp1": {POS: PRON, "morph": "Case=Nom|Gender=Neut|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUnp2": {POS: PRON, "morph": "Case=Gen|Gender=Neut|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUnp3": {POS: PRON, "morph": "Case=Dat|Gender=Neut|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUnp4": {POS: PRON, "morph": "Case=Acc|Gender=Neut|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUnp5": {POS: PRON, "morph": "Case=Voc|Gender=Neut|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUnp6": {POS: PRON, "morph": "Case=Loc|Gender=Neut|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUnp7": {POS: PRON, "morph": "Case=Ins|Gender=Neut|MorphPos=Def|Number=Plur|PronType=Prs"}, - "PUns1": {POS: PRON, "morph": "Case=Nom|Gender=Neut|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUns2": {POS: PRON, "morph": "Case=Gen|Gender=Neut|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUns3": {POS: PRON, "morph": "Case=Dat|Gender=Neut|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUns4": {POS: PRON, "morph": "Case=Acc|Gender=Neut|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUns5": {POS: PRON, "morph": "Case=Voc|Gender=Neut|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUns6": {POS: PRON, "morph": "Case=Loc|Gender=Neut|MorphPos=Def|Number=Sing|PronType=Prs"}, - "PUns7": {POS: PRON, "morph": "Case=Ins|Gender=Neut|MorphPos=Def|Number=Sing|PronType=Prs"}, - "Q": {POS: X, "morph": "Hyph=Yes"}, - "R": {POS: PRON, "morph": "PronType=Prs|Reflex=Yes"}, - "SAfp1": {POS: NOUN, "morph": "Case=Nom|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "SAfp2": {POS: NOUN, "morph": "Case=Gen|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "SAfp3": {POS: NOUN, "morph": "Case=Dat|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "SAfp4": {POS: NOUN, "morph": "Case=Acc|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "SAfp5": {POS: NOUN, "morph": "Case=Voc|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "SAfp6": {POS: NOUN, "morph": "Case=Loc|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "SAfp7": {POS: NOUN, "morph": "Case=Ins|Gender=Fem|MorphPos=Adj|Number=Plur"}, - "SAfs1": {POS: NOUN, "morph": "Case=Nom|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "SAfs2": {POS: NOUN, "morph": "Case=Gen|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "SAfs3": {POS: NOUN, "morph": "Case=Dat|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "SAfs4": {POS: NOUN, "morph": "Case=Acc|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "SAfs5": {POS: NOUN, "morph": "Case=Voc|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "SAfs6": {POS: NOUN, "morph": "Case=Loc|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "SAfs7": {POS: NOUN, "morph": "Case=Ins|Gender=Fem|MorphPos=Adj|Number=Sing"}, - "SAip1": {POS: NOUN, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAip2": {POS: NOUN, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAip3": {POS: NOUN, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAip4": {POS: NOUN, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAip5": {POS: NOUN, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAip6": {POS: NOUN, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAip7": {POS: NOUN, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAis1": {POS: NOUN, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAis2": {POS: NOUN, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAis3": {POS: NOUN, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAis4": {POS: NOUN, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAis5": {POS: NOUN, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAis6": {POS: NOUN, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAis7": {POS: NOUN, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAmp1": {POS: NOUN, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAmp2": {POS: NOUN, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAmp3": {POS: NOUN, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAmp4": {POS: NOUN, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAmp5": {POS: NOUN, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAmp6": {POS: NOUN, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAmp7": {POS: NOUN, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Adj|Number=Plur"}, - "SAms1": {POS: NOUN, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAms2": {POS: NOUN, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAms3": {POS: NOUN, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAms4": {POS: NOUN, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAms5": {POS: NOUN, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAms6": {POS: NOUN, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAms7": {POS: NOUN, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Adj|Number=Sing"}, - "SAnp1": {POS: NOUN, "morph": "Case=Nom|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "SAnp2": {POS: NOUN, "morph": "Case=Gen|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "SAnp3": {POS: NOUN, "morph": "Case=Dat|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "SAnp4": {POS: NOUN, "morph": "Case=Acc|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "SAnp5": {POS: NOUN, "morph": "Case=Voc|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "SAnp6": {POS: NOUN, "morph": "Case=Loc|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "SAnp7": {POS: NOUN, "morph": "Case=Ins|Gender=Neut|MorphPos=Adj|Number=Plur"}, - "SAns1": {POS: NOUN, "morph": "Case=Nom|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "SAns2": {POS: NOUN, "morph": "Case=Gen|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "SAns3": {POS: NOUN, "morph": "Case=Dat|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "SAns4": {POS: NOUN, "morph": "Case=Acc|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "SAns5": {POS: NOUN, "morph": "Case=Voc|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "SAns6": {POS: NOUN, "morph": "Case=Loc|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "SAns7": {POS: NOUN, "morph": "Case=Ins|Gender=Neut|MorphPos=Adj|Number=Sing"}, - "SFfp1": {POS: NOUN, "morph": "Case=Nom|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "SFfp2": {POS: NOUN, "morph": "Case=Gen|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "SFfp3": {POS: NOUN, "morph": "Case=Dat|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "SFfp4": {POS: NOUN, "morph": "Case=Acc|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "SFfp5": {POS: NOUN, "morph": "Case=Voc|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "SFfp6": {POS: NOUN, "morph": "Case=Loc|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "SFfp7": {POS: NOUN, "morph": "Case=Ins|Gender=Fem|MorphPos=Mix|Number=Plur"}, - "SFfs1": {POS: NOUN, "morph": "Case=Nom|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "SFfs2": {POS: NOUN, "morph": "Case=Gen|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "SFfs3": {POS: NOUN, "morph": "Case=Dat|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "SFfs4": {POS: NOUN, "morph": "Case=Acc|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "SFfs5": {POS: NOUN, "morph": "Case=Voc|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "SFfs6": {POS: NOUN, "morph": "Case=Loc|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "SFfs7": {POS: NOUN, "morph": "Case=Ins|Gender=Fem|MorphPos=Mix|Number=Sing"}, - "SSfp1": {POS: NOUN, "morph": "Case=Nom|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "SSfp2": {POS: NOUN, "morph": "Case=Gen|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "SSfp3": {POS: NOUN, "morph": "Case=Dat|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "SSfp4": {POS: NOUN, "morph": "Case=Acc|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "SSfp5": {POS: NOUN, "morph": "Case=Voc|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "SSfp6": {POS: NOUN, "morph": "Case=Loc|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "SSfp7": {POS: NOUN, "morph": "Case=Ins|Gender=Fem|MorphPos=Noun|Number=Plur"}, - "SSfs1": {POS: NOUN, "morph": "Case=Nom|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "SSfs2": {POS: NOUN, "morph": "Case=Gen|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "SSfs3": {POS: NOUN, "morph": "Case=Dat|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "SSfs4": {POS: NOUN, "morph": "Case=Acc|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "SSfs5": {POS: NOUN, "morph": "Case=Voc|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "SSfs6": {POS: NOUN, "morph": "Case=Loc|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "SSfs7": {POS: NOUN, "morph": "Case=Ins|Gender=Fem|MorphPos=Noun|Number=Sing"}, - "SSip1": {POS: NOUN, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSip2": {POS: NOUN, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSip3": {POS: NOUN, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSip4": {POS: NOUN, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSip5": {POS: NOUN, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSip6": {POS: NOUN, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSip7": {POS: NOUN, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSis1": {POS: NOUN, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSis2": {POS: NOUN, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSis3": {POS: NOUN, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSis4": {POS: NOUN, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSis5": {POS: NOUN, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSis6": {POS: NOUN, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSis7": {POS: NOUN, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSmp1": {POS: NOUN, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSmp2": {POS: NOUN, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSmp3": {POS: NOUN, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSmp4": {POS: NOUN, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSmp5": {POS: NOUN, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSmp6": {POS: NOUN, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSmp7": {POS: NOUN, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Noun|Number=Plur"}, - "SSms1": {POS: NOUN, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSms2": {POS: NOUN, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSms3": {POS: NOUN, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSms4": {POS: NOUN, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSms5": {POS: NOUN, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSms6": {POS: NOUN, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSms7": {POS: NOUN, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Noun|Number=Sing"}, - "SSnp1": {POS: NOUN, "morph": "Case=Nom|Gender=Neut|MorphPos=Noun|Number=Plur"}, - "SSnp2": {POS: NOUN, "morph": "Case=Gen|Gender=Neut|MorphPos=Noun|Number=Plur"}, - "SSnp3": {POS: NOUN, "morph": "Case=Dat|Gender=Neut|MorphPos=Noun|Number=Plur"}, - "SSnp4": {POS: NOUN, "morph": "Case=Acc|Gender=Neut|MorphPos=Noun|Number=Plur"}, - "SSnp5": {POS: NOUN, "morph": "Case=Voc|Gender=Neut|MorphPos=Noun|Number=Plur"}, - "SSnp6": {POS: NOUN, "morph": "Case=Loc|Gender=Neut|MorphPos=Noun|Number=Plur"}, - "SSnp7": {POS: NOUN, "morph": "Case=Ins|Gender=Neut|MorphPos=Noun|Number=Plur"}, - "SSns1": {POS: NOUN, "morph": "Case=Nom|Gender=Neut|MorphPos=Noun|Number=Sing"}, - "SSns2": {POS: NOUN, "morph": "Case=Gen|Gender=Neut|MorphPos=Noun|Number=Sing"}, - "SSns3": {POS: NOUN, "morph": "Case=Dat|Gender=Neut|MorphPos=Noun|Number=Sing"}, - "SSns4": {POS: NOUN, "morph": "Case=Acc|Gender=Neut|MorphPos=Noun|Number=Sing"}, - "SSns5": {POS: NOUN, "morph": "Case=Voc|Gender=Neut|MorphPos=Noun|Number=Sing"}, - "SSns6": {POS: NOUN, "morph": "Case=Loc|Gender=Neut|MorphPos=Noun|Number=Sing"}, - "SSns7": {POS: NOUN, "morph": "Case=Ins|Gender=Neut|MorphPos=Noun|Number=Sing"}, - "SUfp1": {POS: NOUN, "morph": "Case=Nom|Gender=Fem|MorphPos=Def|Number=Plur"}, - "SUfp2": {POS: NOUN, "morph": "Case=Gen|Gender=Fem|MorphPos=Def|Number=Plur"}, - "SUfp3": {POS: NOUN, "morph": "Case=Dat|Gender=Fem|MorphPos=Def|Number=Plur"}, - "SUfp4": {POS: NOUN, "morph": "Case=Acc|Gender=Fem|MorphPos=Def|Number=Plur"}, - "SUfp5": {POS: NOUN, "morph": "Case=Voc|Gender=Fem|MorphPos=Def|Number=Plur"}, - "SUfp6": {POS: NOUN, "morph": "Case=Loc|Gender=Fem|MorphPos=Def|Number=Plur"}, - "SUfp7": {POS: NOUN, "morph": "Case=Ins|Gender=Fem|MorphPos=Def|Number=Plur"}, - "SUfs1": {POS: NOUN, "morph": "Case=Nom|Gender=Fem|MorphPos=Def|Number=Sing"}, - "SUfs2": {POS: NOUN, "morph": "Case=Gen|Gender=Fem|MorphPos=Def|Number=Sing"}, - "SUfs3": {POS: NOUN, "morph": "Case=Dat|Gender=Fem|MorphPos=Def|Number=Sing"}, - "SUfs4": {POS: NOUN, "morph": "Case=Acc|Gender=Fem|MorphPos=Def|Number=Sing"}, - "SUfs5": {POS: NOUN, "morph": "Case=Voc|Gender=Fem|MorphPos=Def|Number=Sing"}, - "SUfs6": {POS: NOUN, "morph": "Case=Loc|Gender=Fem|MorphPos=Def|Number=Sing"}, - "SUfs7": {POS: NOUN, "morph": "Case=Ins|Gender=Fem|MorphPos=Def|Number=Sing"}, - "SUip1": {POS: NOUN, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUip2": {POS: NOUN, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUip3": {POS: NOUN, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUip4": {POS: NOUN, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUip5": {POS: NOUN, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUip6": {POS: NOUN, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUip7": {POS: NOUN, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUis1": {POS: NOUN, "morph": "Animacy=Inan|Case=Nom|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUis2": {POS: NOUN, "morph": "Animacy=Inan|Case=Gen|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUis3": {POS: NOUN, "morph": "Animacy=Inan|Case=Dat|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUis4": {POS: NOUN, "morph": "Animacy=Inan|Case=Acc|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUis5": {POS: NOUN, "morph": "Animacy=Inan|Case=Voc|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUis6": {POS: NOUN, "morph": "Animacy=Inan|Case=Loc|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUis7": {POS: NOUN, "morph": "Animacy=Inan|Case=Ins|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUmp1": {POS: NOUN, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUmp2": {POS: NOUN, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUmp3": {POS: NOUN, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUmp4": {POS: NOUN, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUmp5": {POS: NOUN, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUmp6": {POS: NOUN, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUmp7": {POS: NOUN, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Def|Number=Plur"}, - "SUms1": {POS: NOUN, "morph": "Animacy=Anim|Case=Nom|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUms2": {POS: NOUN, "morph": "Animacy=Anim|Case=Gen|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUms3": {POS: NOUN, "morph": "Animacy=Anim|Case=Dat|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUms4": {POS: NOUN, "morph": "Animacy=Anim|Case=Acc|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUms5": {POS: NOUN, "morph": "Animacy=Anim|Case=Voc|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUms6": {POS: NOUN, "morph": "Animacy=Anim|Case=Loc|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUms7": {POS: NOUN, "morph": "Animacy=Anim|Case=Ins|Gender=Masc|MorphPos=Def|Number=Sing"}, - "SUnp1": {POS: NOUN, "morph": "Case=Nom|Gender=Neut|MorphPos=Def|Number=Plur"}, - "SUnp2": {POS: NOUN, "morph": "Case=Gen|Gender=Neut|MorphPos=Def|Number=Plur"}, - "SUnp3": {POS: NOUN, "morph": "Case=Dat|Gender=Neut|MorphPos=Def|Number=Plur"}, - "SUnp4": {POS: NOUN, "morph": "Case=Acc|Gender=Neut|MorphPos=Def|Number=Plur"}, - "SUnp5": {POS: NOUN, "morph": "Case=Voc|Gender=Neut|MorphPos=Def|Number=Plur"}, - "SUnp6": {POS: NOUN, "morph": "Case=Loc|Gender=Neut|MorphPos=Def|Number=Plur"}, - "SUnp7": {POS: NOUN, "morph": "Case=Ins|Gender=Neut|MorphPos=Def|Number=Plur"}, - "SUns1": {POS: NOUN, "morph": "Case=Nom|Gender=Neut|MorphPos=Def|Number=Sing"}, - "SUns2": {POS: NOUN, "morph": "Case=Gen|Gender=Neut|MorphPos=Def|Number=Sing"}, - "SUns3": {POS: NOUN, "morph": "Case=Dat|Gender=Neut|MorphPos=Def|Number=Sing"}, - "SUns4": {POS: NOUN, "morph": "Case=Acc|Gender=Neut|MorphPos=Def|Number=Sing"}, - "SUns5": {POS: NOUN, "morph": "Case=Voc|Gender=Neut|MorphPos=Def|Number=Sing"}, - "SUns6": {POS: NOUN, "morph": "Case=Loc|Gender=Neut|MorphPos=Def|Number=Sing"}, - "SUns7": {POS: NOUN, "morph": "Case=Ins|Gender=Neut|MorphPos=Def|Number=Sing"}, - "T": {POS: PART, "morph": "_"}, - "TY": {POS: PART, "morph": "Mood=Cnd"}, - "VBepa-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Polarity=Neg|Tense=Fut|VerbForm=Fin"}, - "VBepa+": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Polarity=Pos|Tense=Fut|VerbForm=Fin"}, - "VBepb-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Polarity=Neg|Tense=Fut|VerbForm=Fin"}, - "VBepb+": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Polarity=Pos|Tense=Fut|VerbForm=Fin"}, - "VBepc-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Tense=Fut|VerbForm=Fin"}, - "VBepc+": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Tense=Fut|VerbForm=Fin"}, - "VBesa-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Fut|VerbForm=Fin"}, - "VBesa+": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Fut|VerbForm=Fin"}, - "VBesb-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Polarity=Neg|Tense=Fut|VerbForm=Fin"}, - "VBesb+": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Polarity=Pos|Tense=Fut|VerbForm=Fin"}, - "VBesc-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Tense=Fut|VerbForm=Fin"}, - "VBesc+": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Fut|VerbForm=Fin"}, - "VBjpa-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Plur|Person=1|Polarity=Neg|Tense=Fut|VerbForm=Fin"}, - "VBjpa+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Plur|Person=1|Polarity=Pos|Tense=Fut|VerbForm=Fin"}, - "VBjpb-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Plur|Person=2|Polarity=Neg|Tense=Fut|VerbForm=Fin"}, - "VBjpb+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Plur|Person=2|Polarity=Pos|Tense=Fut|VerbForm=Fin"}, - "VBjpc-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Tense=Fut|VerbForm=Fin"}, - "VBjpc+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Tense=Fut|VerbForm=Fin"}, - "VBjsa-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Fut|VerbForm=Fin"}, - "VBjsa+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Fut|VerbForm=Fin"}, - "VBjsb-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Sing|Person=2|Polarity=Neg|Tense=Fut|VerbForm=Fin"}, - "VBjsb+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Sing|Person=2|Polarity=Pos|Tense=Fut|VerbForm=Fin"}, - "VBjsc-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Tense=Fut|VerbForm=Fin"}, - "VBjsc+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Fut|VerbForm=Fin"}, - "VHd-": {POS: VERB, "morph": "Aspect=Perf|Polarity=Neg|VerbForm=Conv"}, - "VHd+": {POS: VERB, "morph": "Aspect=Perf|Polarity=Pos|VerbForm=Conv"}, - "VHe-": {POS: VERB, "morph": "Aspect=Imp|Polarity=Neg|VerbForm=Conv"}, - "VHe+": {POS: VERB, "morph": "Aspect=Imp|Polarity=Pos|VerbForm=Conv"}, - "VHj-": {POS: VERB, "morph": "Aspect=Imp,Perf|Polarity=Neg|VerbForm=Conv"}, - "VHj+": {POS: VERB, "morph": "Aspect=Imp,Perf|Polarity=Pos|VerbForm=Conv"}, - "VId-": {POS: VERB, "morph": "Aspect=Perf|Polarity=Neg|VerbForm=Inf"}, - "VId+": {POS: VERB, "morph": "Aspect=Perf|Polarity=Pos|VerbForm=Inf"}, - "VIe-": {POS: VERB, "morph": "Aspect=Imp|Polarity=Neg|VerbForm=Inf"}, - "VIe+": {POS: VERB, "morph": "Aspect=Imp|Polarity=Pos|VerbForm=Inf"}, - "VIj-": {POS: VERB, "morph": "Aspect=Imp,Perf|Polarity=Neg|VerbForm=Inf"}, - "VIj+": {POS: VERB, "morph": "Aspect=Imp,Perf|Polarity=Pos|VerbForm=Inf"}, - "VKdpa-": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Plur|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKdpa+": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Plur|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKdpb-": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Plur|Person=2|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKdpb+": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Plur|Person=2|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKdpc-": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKdpc+": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKdsa-": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKdsa+": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKdsb-": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKdsb+": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Sing|Person=2|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKdsc-": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKdsc+": {POS: VERB, "morph": "Aspect=Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKe-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKepa-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKepa+": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKepb-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKepb+": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=2|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKepc-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKepc+": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKesa-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKesa+": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKesb-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKesb+": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=2|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKesc-": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKesc+": {POS: VERB, "morph": "Aspect=Imp|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKjpa-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Plur|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKjpa+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Plur|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKjpb-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Plur|Person=2|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKjpb+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Plur|Person=2|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKjpc-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Plur|Person=3|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKjpc+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Plur|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKjsa-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKjsa+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Sing|Person=1|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKjsb-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Sing|Person=2|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKjsb+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Sing|Person=2|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VKjsc-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Tense=Pres|VerbForm=Fin"}, - "VKjsc+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin"}, - "VLdpah-": {POS: VERB, "morph": "Aspect=Perf|Number=Plur|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdpah+": {POS: VERB, "morph": "Aspect=Perf|Number=Plur|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdpbh-": {POS: VERB, "morph": "Aspect=Perf|Number=Plur|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdpbh+": {POS: VERB, "morph": "Aspect=Perf|Number=Plur|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdpcf-": {POS: VERB, "morph": "Aspect=Perf|Gender=Fem|Number=Plur|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdpcf+": {POS: VERB, "morph": "Aspect=Perf|Gender=Fem|Number=Plur|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdpci-": {POS: VERB, "morph": "Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Plur|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdpci+": {POS: VERB, "morph": "Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Plur|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdpcm-": {POS: VERB, "morph": "Animacy=Anim|Aspect=Perf|Gender=Masc|Number=Plur|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdpcm+": {POS: VERB, "morph": "Animacy=Anim|Aspect=Perf|Gender=Masc|Number=Plur|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdpcn-": {POS: VERB, "morph": "Aspect=Perf|Gender=Neut|Number=Plur|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdpcn+": {POS: VERB, "morph": "Aspect=Perf|Gender=Neut|Number=Plur|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdsaf-": {POS: VERB, "morph": "Aspect=Perf|Gender=Fem|Number=Sing|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdsaf+": {POS: VERB, "morph": "Aspect=Perf|Gender=Fem|Number=Sing|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdsai-": {POS: VERB, "morph": "Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Sing|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdsai+": {POS: VERB, "morph": "Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Sing|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdsam-": {POS: VERB, "morph": "Animacy=Anim|Aspect=Perf|Gender=Masc|Number=Sing|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdsam+": {POS: VERB, "morph": "Animacy=Anim|Aspect=Perf|Gender=Masc|Number=Sing|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdsan-": {POS: VERB, "morph": "Aspect=Perf|Gender=Neut|Number=Sing|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdsan+": {POS: VERB, "morph": "Aspect=Perf|Gender=Neut|Number=Sing|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdsbf-": {POS: VERB, "morph": "Aspect=Perf|Gender=Fem|Number=Sing|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdsbf+": {POS: VERB, "morph": "Aspect=Perf|Gender=Fem|Number=Sing|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdsbi-": {POS: VERB, "morph": "Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Sing|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdsbi+": {POS: VERB, "morph": "Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Sing|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdsbm-": {POS: VERB, "morph": "Animacy=Anim|Aspect=Perf|Gender=Masc|Number=Sing|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdsbm+": {POS: VERB, "morph": "Animacy=Anim|Aspect=Perf|Gender=Masc|Number=Sing|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdsbn-": {POS: VERB, "morph": "Aspect=Perf|Gender=Neut|Number=Sing|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdsbn+": {POS: VERB, "morph": "Aspect=Perf|Gender=Neut|Number=Sing|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdscf-": {POS: VERB, "morph": "Aspect=Perf|Gender=Fem|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdscf+": {POS: VERB, "morph": "Aspect=Perf|Gender=Fem|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdsci-": {POS: VERB, "morph": "Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdsci+": {POS: VERB, "morph": "Animacy=Inan|Aspect=Perf|Gender=Masc|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdscm-": {POS: VERB, "morph": "Animacy=Anim|Aspect=Perf|Gender=Masc|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdscm+": {POS: VERB, "morph": "Animacy=Anim|Aspect=Perf|Gender=Masc|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLdscn-": {POS: VERB, "morph": "Aspect=Perf|Gender=Neut|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLdscn+": {POS: VERB, "morph": "Aspect=Perf|Gender=Neut|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLepah-": {POS: VERB, "morph": "Aspect=Imp|Number=Plur|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLepah+": {POS: VERB, "morph": "Aspect=Imp|Number=Plur|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLepbh-": {POS: VERB, "morph": "Aspect=Imp|Number=Plur|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLepbh+": {POS: VERB, "morph": "Aspect=Imp|Number=Plur|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLepcf-": {POS: VERB, "morph": "Aspect=Imp|Gender=Fem|Number=Plur|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLepcf+": {POS: VERB, "morph": "Aspect=Imp|Gender=Fem|Number=Plur|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLepci-": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Plur|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLepci+": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Plur|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLepcm-": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp|Gender=Masc|Number=Plur|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLepcm+": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp|Gender=Masc|Number=Plur|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLepcn-": {POS: VERB, "morph": "Aspect=Imp|Gender=Neut|Number=Plur|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLepcn+": {POS: VERB, "morph": "Aspect=Imp|Gender=Neut|Number=Plur|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLesaf-": {POS: VERB, "morph": "Aspect=Imp|Gender=Fem|Number=Sing|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLesaf+": {POS: VERB, "morph": "Aspect=Imp|Gender=Fem|Number=Sing|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLesai-": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Sing|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLesai+": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Sing|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLesam-": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp|Gender=Masc|Number=Sing|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLesam+": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp|Gender=Masc|Number=Sing|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLesan-": {POS: VERB, "morph": "Aspect=Imp|Gender=Neut|Number=Sing|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLesan+": {POS: VERB, "morph": "Aspect=Imp|Gender=Neut|Number=Sing|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLesbf-": {POS: VERB, "morph": "Aspect=Imp|Gender=Fem|Number=Sing|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLesbf+": {POS: VERB, "morph": "Aspect=Imp|Gender=Fem|Number=Sing|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLesbi-": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Sing|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLesbi+": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Sing|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLesbm-": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp|Gender=Masc|Number=Sing|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLesbm+": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp|Gender=Masc|Number=Sing|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLesbn-": {POS: VERB, "morph": "Aspect=Imp|Gender=Neut|Number=Sing|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLesbn+": {POS: VERB, "morph": "Aspect=Imp|Gender=Neut|Number=Sing|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLescf-": {POS: VERB, "morph": "Aspect=Imp|Gender=Fem|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLescf+": {POS: VERB, "morph": "Aspect=Imp|Gender=Fem|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLesci-": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLesci+": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp|Gender=Masc|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLescm-": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp|Gender=Masc|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLescm+": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp|Gender=Masc|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLescn-": {POS: VERB, "morph": "Aspect=Imp|Gender=Neut|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLescn+": {POS: VERB, "morph": "Aspect=Imp|Gender=Neut|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjpah-": {POS: VERB, "morph": "Aspect=Imp,Perf|Number=Plur|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjpah+": {POS: VERB, "morph": "Aspect=Imp,Perf|Number=Plur|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjpbh-": {POS: VERB, "morph": "Aspect=Imp,Perf|Number=Plur|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjpbh+": {POS: VERB, "morph": "Aspect=Imp,Perf|Number=Plur|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjpcf-": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Fem|Number=Plur|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjpcf+": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Fem|Number=Plur|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjpci-": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp,Perf|Gender=Masc|Number=Plur|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjpci+": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp,Perf|Gender=Masc|Number=Plur|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjpcm-": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp,Perf|Gender=Masc|Number=Plur|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjpcm+": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp,Perf|Gender=Masc|Number=Plur|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjpcn-": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Neut|Number=Plur|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjpcn+": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Neut|Number=Plur|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjsaf-": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Fem|Number=Sing|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjsaf+": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Fem|Number=Sing|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjsai-": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp,Perf|Gender=Masc|Number=Sing|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjsai+": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp,Perf|Gender=Masc|Number=Sing|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjsam-": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp,Perf|Gender=Masc|Number=Sing|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjsam+": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp,Perf|Gender=Masc|Number=Sing|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjsan-": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Neut|Number=Sing|Person=1|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjsan+": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Neut|Number=Sing|Person=1|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjsbf-": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Fem|Number=Sing|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjsbf+": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Fem|Number=Sing|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjsbi-": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp,Perf|Gender=Masc|Number=Sing|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjsbi+": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp,Perf|Gender=Masc|Number=Sing|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjsbm-": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp,Perf|Gender=Masc|Number=Sing|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjsbm+": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp,Perf|Gender=Masc|Number=Sing|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjsbn-": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Neut|Number=Sing|Person=2|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjsbn+": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Neut|Number=Sing|Person=2|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjscf-": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Fem|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjscf+": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Fem|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjsci-": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp,Perf|Gender=Masc|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjsci+": {POS: VERB, "morph": "Animacy=Inan|Aspect=Imp,Perf|Gender=Masc|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjscm-": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp,Perf|Gender=Masc|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjscm+": {POS: VERB, "morph": "Animacy=Anim|Aspect=Imp,Perf|Gender=Masc|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VLjscn-": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Neut|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Part"}, - "VLjscn+": {POS: VERB, "morph": "Aspect=Imp,Perf|Gender=Neut|Number=Sing|Person=3|Polarity=Pos|Tense=Past|VerbForm=Part"}, - "VMdpa-": {POS: VERB, "morph": "Aspect=Perf|Mood=Imp|Number=Plur|Person=1|Polarity=Neg|VerbForm=Fin"}, - "VMdpa+": {POS: VERB, "morph": "Aspect=Perf|Mood=Imp|Number=Plur|Person=1|Polarity=Pos|VerbForm=Fin"}, - "VMdpb-": {POS: VERB, "morph": "Aspect=Perf|Mood=Imp|Number=Plur|Person=2|Polarity=Neg|VerbForm=Fin"}, - "VMdpb+": {POS: VERB, "morph": "Aspect=Perf|Mood=Imp|Number=Plur|Person=2|Polarity=Pos|VerbForm=Fin"}, - "VMdsb-": {POS: VERB, "morph": "Aspect=Perf|Mood=Imp|Number=Sing|Person=2|Polarity=Neg|VerbForm=Fin"}, - "VMdsb+": {POS: VERB, "morph": "Aspect=Perf|Mood=Imp|Number=Sing|Person=2|Polarity=Pos|VerbForm=Fin"}, - "VMepa-": {POS: VERB, "morph": "Aspect=Imp|Mood=Imp|Number=Plur|Person=1|Polarity=Neg|VerbForm=Fin"}, - "VMepa+": {POS: VERB, "morph": "Aspect=Imp|Mood=Imp|Number=Plur|Person=1|Polarity=Pos|VerbForm=Fin"}, - "VMepb-": {POS: VERB, "morph": "Aspect=Imp|Mood=Imp|Number=Plur|Person=2|Polarity=Neg|VerbForm=Fin"}, - "VMepb+": {POS: VERB, "morph": "Aspect=Imp|Mood=Imp|Number=Plur|Person=2|Polarity=Pos|VerbForm=Fin"}, - "VMesb-": {POS: VERB, "morph": "Aspect=Imp|Mood=Imp|Number=Sing|Person=2|Polarity=Neg|VerbForm=Fin"}, - "VMesb+": {POS: VERB, "morph": "Aspect=Imp|Mood=Imp|Number=Sing|Person=2|Polarity=Pos|VerbForm=Fin"}, - "VMjpa-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Imp|Number=Plur|Person=1|Polarity=Neg|VerbForm=Fin"}, - "VMjpa+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Imp|Number=Plur|Person=1|Polarity=Pos|VerbForm=Fin"}, - "VMjpb-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Imp|Number=Plur|Person=2|Polarity=Neg|VerbForm=Fin"}, - "VMjpb+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Imp|Number=Plur|Person=2|Polarity=Pos|VerbForm=Fin"}, - "VMjsb-": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Imp|Number=Sing|Person=2|Polarity=Neg|VerbForm=Fin"}, - "VMjsb+": {POS: VERB, "morph": "Aspect=Imp,Perf|Mood=Imp|Number=Sing|Person=2|Polarity=Pos|VerbForm=Fin"}, - "W": {POS: X, "morph": "Abbr=Yes"}, - "Y": {POS: AUX, "morph": "Mood=Cnd"}, -} diff --git a/spacy/lang/sl/__init__.py b/spacy/lang/sl/__init__.py index 2d4977bdf..0330cc4d0 100644 --- a/spacy/lang/sl/__init__.py +++ b/spacy/lang/sl/__init__.py @@ -1,14 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language -from ...attrs import LANG class SlovenianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "sl" stop_words = STOP_WORDS diff --git a/spacy/lang/sl/stop_words.py b/spacy/lang/sl/stop_words.py index 187e95876..6fb01a183 100644 --- a/spacy/lang/sl/stop_words.py +++ b/spacy/lang/sl/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-sl # TODO: probably needs to be tidied up – the list seems to have month names in # it, which shouldn't be considered stop words. diff --git a/spacy/lang/sq/__init__.py b/spacy/lang/sq/__init__.py index 6f33b37c2..a4bacfa49 100644 --- a/spacy/lang/sq/__init__.py +++ b/spacy/lang/sq/__init__.py @@ -1,14 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from ...language import Language -from ...attrs import LANG class AlbanianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "sq" stop_words = STOP_WORDS diff --git a/spacy/lang/sq/examples.py b/spacy/lang/sq/examples.py index c51a0da39..06ed20fa1 100644 --- a/spacy/lang/sq/examples.py +++ b/spacy/lang/sq/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/sq/stop_words.py b/spacy/lang/sq/stop_words.py index f91861ca1..f2b1a4f4a 100644 --- a/spacy/lang/sq/stop_words.py +++ b/spacy/lang/sq/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/andrixh/index-albanian STOP_WORDS = set( diff --git a/spacy/lang/sr/__init__.py b/spacy/lang/sr/__init__.py index 286d6693b..165e54975 100644 --- a/spacy/lang/sr/__init__.py +++ b/spacy/lang/sr/__init__.py @@ -1,20 +1,12 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .lex_attrs import LEX_ATTRS -from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language -from ...attrs import LANG -from ...util import update_exc class SerbianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "sr" - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) + tokenizer_exceptions = TOKENIZER_EXCEPTIONS + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS diff --git a/spacy/lang/sr/examples.py b/spacy/lang/sr/examples.py index d636220c3..ec7f57ced 100644 --- a/spacy/lang/sr/examples.py +++ b/spacy/lang/sr/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/sr/lex_attrs.py b/spacy/lang/sr/lex_attrs.py index c90dc0da7..dc48909bc 100644 --- a/spacy/lang/sr/lex_attrs.py +++ b/spacy/lang/sr/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/sr/stop_words.py b/spacy/lang/sr/stop_words.py index 9712327f8..5df5509d2 100644 --- a/spacy/lang/sr/stop_words.py +++ b/spacy/lang/sr/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ а diff --git a/spacy/lang/sr/tokenizer_exceptions.py b/spacy/lang/sr/tokenizer_exceptions.py index 8fca346a3..dcaa3e239 100755 --- a/spacy/lang/sr/tokenizer_exceptions.py +++ b/spacy/lang/sr/tokenizer_exceptions.py @@ -1,96 +1,93 @@ -# encoding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA, NORM +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = {} _abbrev_exc = [ # Weekdays abbreviations - {ORTH: "пoн", LEMMA: "понедељак", NORM: "понедељак"}, - {ORTH: "уто", LEMMA: "уторак", NORM: "уторак"}, - {ORTH: "сре", LEMMA: "среда", NORM: "среда"}, - {ORTH: "чет", LEMMA: "четвртак", NORM: "четвртак"}, - {ORTH: "пет", LEMMA: "петак", NORM: "петак"}, - {ORTH: "суб", LEMMA: "субота", NORM: "субота"}, - {ORTH: "нед", LEMMA: "недеља", NORM: "недеља"}, + {ORTH: "пoн", NORM: "понедељак"}, + {ORTH: "уто", NORM: "уторак"}, + {ORTH: "сре", NORM: "среда"}, + {ORTH: "чет", NORM: "четвртак"}, + {ORTH: "пет", NORM: "петак"}, + {ORTH: "суб", NORM: "субота"}, + {ORTH: "нед", NORM: "недеља"}, # Months abbreviations - {ORTH: "јан", LEMMA: "јануар", NORM: "јануар"}, - {ORTH: "феб", LEMMA: "фебруар", NORM: "фебруар"}, - {ORTH: "мар", LEMMA: "март", NORM: "март"}, - {ORTH: "апр", LEMMA: "април", NORM: "април"}, - {ORTH: "јуни", LEMMA: "јун", NORM: "јун"}, - {ORTH: "јули", LEMMA: "јул", NORM: "јул"}, - {ORTH: "авг", LEMMA: "август", NORM: "август"}, - {ORTH: "сеп", LEMMA: "септембар", NORM: "септембар"}, - {ORTH: "септ", LEMMA: "септембар", NORM: "септембар"}, - {ORTH: "окт", LEMMA: "октобар", NORM: "октобар"}, - {ORTH: "нов", LEMMA: "новембар", NORM: "новембар"}, - {ORTH: "дец", LEMMA: "децембар", NORM: "децембар"}, + {ORTH: "јан", NORM: "јануар"}, + {ORTH: "феб", NORM: "фебруар"}, + {ORTH: "мар", NORM: "март"}, + {ORTH: "апр", NORM: "април"}, + {ORTH: "јуни", NORM: "јун"}, + {ORTH: "јули", NORM: "јул"}, + {ORTH: "авг", NORM: "август"}, + {ORTH: "сеп", NORM: "септембар"}, + {ORTH: "септ", NORM: "септембар"}, + {ORTH: "окт", NORM: "октобар"}, + {ORTH: "нов", NORM: "новембар"}, + {ORTH: "дец", NORM: "децембар"}, ] for abbrev_desc in _abbrev_exc: abbrev = abbrev_desc[ORTH] for orth in (abbrev, abbrev.capitalize(), abbrev.upper()): - _exc[orth] = [{ORTH: orth, LEMMA: abbrev_desc[LEMMA], NORM: abbrev_desc[NORM]}] - _exc[orth + "."] = [ - {ORTH: orth + ".", LEMMA: abbrev_desc[LEMMA], NORM: abbrev_desc[NORM]} - ] + _exc[orth] = [{ORTH: orth, NORM: abbrev_desc[NORM]}] + _exc[orth + "."] = [{ORTH: orth + ".", NORM: abbrev_desc[NORM]}] # common abbreviations _slang_exc = [ # without dot - {ORTH: "др", LEMMA: "доктор", NORM: "доктор"}, - {ORTH: "гдин", LEMMA: "господин", NORM: "господин"}, - {ORTH: "гђа", LEMMA: "госпођа", NORM: "госпођа"}, - {ORTH: "гђица", LEMMA: "госпођица", NORM: "госпођица"}, - {ORTH: "мр", LEMMA: "магистар", NORM: "магистар"}, - {ORTH: "Бгд", LEMMA: "Београд", NORM: "београд"}, - {ORTH: "цм", LEMMA: "центиметар", NORM: "центиметар"}, - {ORTH: "м", LEMMA: "метар", NORM: "метар"}, - {ORTH: "км", LEMMA: "километар", NORM: "километар"}, - {ORTH: "мг", LEMMA: "милиграм", NORM: "милиграм"}, - {ORTH: "кг", LEMMA: "килограм", NORM: "килограм"}, - {ORTH: "дл", LEMMA: "децилитар", NORM: "децилитар"}, - {ORTH: "хл", LEMMA: "хектолитар", NORM: "хектолитар"}, + {ORTH: "др", NORM: "доктор"}, + {ORTH: "гдин", NORM: "господин"}, + {ORTH: "гђа", NORM: "госпођа"}, + {ORTH: "гђица", NORM: "госпођица"}, + {ORTH: "мр", NORM: "магистар"}, + {ORTH: "Бгд", NORM: "београд"}, + {ORTH: "цм", NORM: "центиметар"}, + {ORTH: "м", NORM: "метар"}, + {ORTH: "км", NORM: "километар"}, + {ORTH: "мг", NORM: "милиграм"}, + {ORTH: "кг", NORM: "килограм"}, + {ORTH: "дл", NORM: "децилитар"}, + {ORTH: "хл", NORM: "хектолитар"}, # with dot - {ORTH: "ул.", LEMMA: "улица", NORM: "улица"}, - {ORTH: "бр.", LEMMA: "број", NORM: "број"}, - {ORTH: "нпр.", LEMMA: "на пример", NORM: "на пример"}, - {ORTH: "тзв.", LEMMA: "такозван", NORM: "такозван"}, - {ORTH: "проф.", LEMMA: "професор", NORM: "професор"}, - {ORTH: "стр.", LEMMA: "страна", NORM: "страна"}, - {ORTH: "једн.", LEMMA: "једнина", NORM: "једнина"}, - {ORTH: "мн.", LEMMA: "множина", NORM: "множина"}, - {ORTH: "уч.", LEMMA: "ученик", NORM: "ученик"}, - {ORTH: "разр.", LEMMA: "разред", NORM: "разред"}, - {ORTH: "инж.", LEMMA: "инжењер", NORM: "инжењер"}, - {ORTH: "гимн.", LEMMA: "гимназија", NORM: "гимназија"}, - {ORTH: "год.", LEMMA: "година", NORM: "година"}, - {ORTH: "мед.", LEMMA: "медицина", NORM: "медицина"}, - {ORTH: "гимн.", LEMMA: "гимназија", NORM: "гимназија"}, - {ORTH: "акад.", LEMMA: "академик", NORM: "академик"}, - {ORTH: "доц.", LEMMA: "доцент", NORM: "доцент"}, - {ORTH: "итд.", LEMMA: "и тако даље", NORM: "и тако даље"}, - {ORTH: "и сл.", LEMMA: "и слично", NORM: "и слично"}, - {ORTH: "н.е.", LEMMA: "нова ера", NORM: "нове ере"}, - {ORTH: "о.г.", LEMMA: "ова година", NORM: "ове године"}, - {ORTH: "л.к.", LEMMA: "лична карта", NORM: "лична карта"}, - {ORTH: "в.д.", LEMMA: "вршилац дужности", NORM: "вршилац дужности"}, - {ORTH: "стр.", LEMMA: "страна", NORM: "страна"}, + {ORTH: "ул.", NORM: "улица"}, + {ORTH: "бр.", NORM: "број"}, + {ORTH: "нпр.", NORM: "на пример"}, + {ORTH: "тзв.", NORM: "такозван"}, + {ORTH: "проф.", NORM: "професор"}, + {ORTH: "стр.", NORM: "страна"}, + {ORTH: "једн.", NORM: "једнина"}, + {ORTH: "мн.", NORM: "множина"}, + {ORTH: "уч.", NORM: "ученик"}, + {ORTH: "разр.", NORM: "разред"}, + {ORTH: "инж.", NORM: "инжењер"}, + {ORTH: "гимн.", NORM: "гимназија"}, + {ORTH: "год.", NORM: "година"}, + {ORTH: "мед.", NORM: "медицина"}, + {ORTH: "гимн.", NORM: "гимназија"}, + {ORTH: "акад.", NORM: "академик"}, + {ORTH: "доц.", NORM: "доцент"}, + {ORTH: "итд.", NORM: "и тако даље"}, + {ORTH: "и сл.", NORM: "и слично"}, + {ORTH: "н.е.", NORM: "нове ере"}, + {ORTH: "о.г.", NORM: "ове године"}, + {ORTH: "л.к.", NORM: "лична карта"}, + {ORTH: "в.д.", NORM: "вршилац дужности"}, + {ORTH: "стр.", NORM: "страна"}, # with qoute - {ORTH: "ал'", LEMMA: "али", NORM: "али"}, - {ORTH: "ил'", LEMMA: "или", NORM: "или"}, - {ORTH: "је л'", LEMMA: "је ли", NORM: "је ли"}, - {ORTH: "да л'", LEMMA: "да ли", NORM: "да ли"}, - {ORTH: "држ'те", LEMMA: "држати", NORM: "држите"}, + {ORTH: "ал'", NORM: "али"}, + {ORTH: "ил'", NORM: "или"}, + {ORTH: "је л'", NORM: "је ли"}, + {ORTH: "да л'", NORM: "да ли"}, + {ORTH: "држ'те", NORM: "држите"}, ] for slang_desc in _slang_exc: _exc[slang_desc[ORTH]] = [slang_desc] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/sv/__init__.py b/spacy/lang/sv/__init__.py index 3a749eeee..2490eb9ec 100644 --- a/spacy/lang/sv/__init__.py +++ b/spacy/lang/sv/__init__.py @@ -1,38 +1,24 @@ -# coding: utf8 -from __future__ import unicode_literals - +from typing import Optional +from thinc.api import Model from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .tag_map import TAG_MAP from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS -from .morph_rules import MORPH_RULES +from .syntax_iterators import SYNTAX_ITERATORS +from ...language import Language +from ...pipeline import Lemmatizer + # Punctuation stolen from Danish from ..da.punctuation import TOKENIZER_INFIXES, TOKENIZER_SUFFIXES -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS -from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups -from .syntax_iterators import SYNTAX_ITERATORS - class SwedishDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "sv" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - morph_rules = MORPH_RULES - tag_map = TAG_MAP + tokenizer_exceptions = TOKENIZER_EXCEPTIONS infixes = TOKENIZER_INFIXES suffixes = TOKENIZER_SUFFIXES - stop_words = STOP_WORDS - morph_rules = MORPH_RULES + lex_attr_getters = LEX_ATTRS syntax_iterators = SYNTAX_ITERATORS + stop_words = STOP_WORDS class Swedish(Language): @@ -40,4 +26,14 @@ class Swedish(Language): Defaults = SwedishDefaults +@Swedish.factory( + "lemmatizer", + assigns=["token.lemma"], + default_config={"model": None, "mode": "rule"}, + default_score_weights={"lemma_acc": 1.0}, +) +def make_lemmatizer(nlp: Language, model: Optional[Model], name: str, mode: str): + return Lemmatizer(nlp.vocab, model, name, mode=mode) + + __all__ = ["Swedish"] diff --git a/spacy/lang/sv/examples.py b/spacy/lang/sv/examples.py index 58e095195..bc6cd7a54 100644 --- a/spacy/lang/sv/examples.py +++ b/spacy/lang/sv/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/sv/lex_attrs.py b/spacy/lang/sv/lex_attrs.py index 24d06a97a..f8ada9e2e 100644 --- a/spacy/lang/sv/lex_attrs.py +++ b/spacy/lang/sv/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/sv/morph_rules.py b/spacy/lang/sv/morph_rules.py deleted file mode 100644 index a131ce49d..000000000 --- a/spacy/lang/sv/morph_rules.py +++ /dev/null @@ -1,302 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import LEMMA, PRON_LEMMA - - -# Used the table of pronouns at https://sv.wiktionary.org/wiki/deras -MORPH_RULES = { - "PRP": { - "jag": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Case": "Nom", - }, - "mig": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Case": "Acc", - }, - "mej": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Case": "Acc", - }, - "du": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Case": "Nom", - }, - "dig": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Case": "Acc", - }, - "dej": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Case": "Acc", - }, - "han": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Masc", - "Case": "Nom", - }, - "honom": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Masc", - "Case": "Acc", - }, - "hon": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Fem", - "Case": "Nom", - }, - "henne": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Fem", - "Case": "Acc", - }, - "det": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Sing", - "Gender": "Neut", - }, - "vi": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Case": "Nom", - }, - "oss": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Case": "Acc", - }, - "ni": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Plur", - "Case": "Nom", - }, - "er": {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur"}, - "de": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Case": "Nom", - }, - "dom": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Case": ("Nom", "Acc"), - }, - "dem": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Case": "Acc", - }, - "min": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Reflex": "Yes", - }, - "mitt": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Sing", - "Poss": "Yes", - "Reflex": "Yes", - }, - "mina": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "din": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Poss": "Yes", - "Reflex": "Yes", - }, - "ditt": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Sing", - "Poss": "Yes", - "Reflex": "Yes", - }, - "dina": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "hans": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": ("Sing", "Plur"), - "Gender": "Masc", - "Poss": "Yes", - "Reflex": "Yes", - }, - "hennes": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": ("Sing", "Plur"), - "Gender": "Fem", - "Poss": "Yes", - "Reflex": "Yes", - }, - "dess": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": ("Sing", "Plur"), - "Poss": "Yes", - "Reflex": "Yes", - }, - "vår": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "våran": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "vårt": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "vårat": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "våra": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "One", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "eran": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "ert": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "erat": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "era": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Two", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - "deras": { - LEMMA: PRON_LEMMA, - "PronType": "Prs", - "Person": "Three", - "Number": "Plur", - "Poss": "Yes", - "Reflex": "Yes", - }, - }, - "VBZ": { - "är": { - "VerbForm": "Fin", - "Person": ("One", "Two", "Three"), - "Tense": "Pres", - "Mood": "Ind", - } - }, - "VBP": {"är": {"VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"}}, - "VBD": { - "var": {"VerbForm": "Fin", "Tense": "Past", "Number": "Sing"}, - "vart": {"VerbForm": "Fin", "Tense": "Past", "Number": "Plur"}, - }, -} diff --git a/spacy/lang/sv/stop_words.py b/spacy/lang/sv/stop_words.py index 206abce5a..2422b2a9e 100644 --- a/spacy/lang/sv/stop_words.py +++ b/spacy/lang/sv/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """ aderton adertonde adjö aldrig alla allas allt alltid alltså än andra andras diff --git a/spacy/lang/sv/syntax_iterators.py b/spacy/lang/sv/syntax_iterators.py index 84d295f96..d5ae47853 100644 --- a/spacy/lang/sv/syntax_iterators.py +++ b/spacy/lang/sv/syntax_iterators.py @@ -1,30 +1,18 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Union, Iterator from ...symbols import NOUN, PROPN, PRON from ...errors import Errors +from ...tokens import Doc, Span -def noun_chunks(doclike): - """ - Detect base noun phrases from a dependency parse. Works on both Doc and Span. - """ - labels = [ - "nsubj", - "nsubj:pass", - "dobj", - "obj", - "iobj", - "ROOT", - "appos", - "nmod", - "nmod:poss", - ] +def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Span]: + """Detect base noun phrases from a dependency parse. Works on Doc and Span.""" + # fmt: off + labels = ["nsubj", "nsubj:pass", "dobj", "obj", "iobj", "ROOT", "appos", "nmod", "nmod:poss"] + # fmt: on doc = doclike.doc # Ensure works on both Doc and Span. - - if not doc.is_parsed: + if not doc.has_annotation("DEP"): raise ValueError(Errors.E029) - np_deps = [doc.vocab.strings[label] for label in labels] conj = doc.vocab.strings.add("conj") np_label = doc.vocab.strings.add("NP") diff --git a/spacy/lang/sv/tag_map.py b/spacy/lang/sv/tag_map.py deleted file mode 100644 index 7d4e29030..000000000 --- a/spacy/lang/sv/tag_map.py +++ /dev/null @@ -1,191 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, ADJ, CCONJ, SCONJ, NUM, DET, ADV -from ...symbols import ADP, X, VERB, NOUN, PROPN, PART, INTJ, PRON - - -# Tag mappings according to https://universaldependencies.org/tagset-conversion/sv-suc-uposf.html -# for https://github.com/UniversalDependencies/UD_Swedish-Talbanken - -TAG_MAP = { - "AB": {POS: ADV}, # inte, också, så, bara, nu - "AB|AN": {POS: ADV}, # t.ex., ca, t_ex, bl.a., s_k - "AB|KOM": {POS: ADV}, # mer, tidigare, mindre, vidare, mera - "AB|POS": {POS: ADV}, # mycket, helt, ofta, länge, långt - "AB|SMS": {POS: ADV}, # över-, in- - "AB|SUV": {POS: ADV}, # minst, mest, högst, främst, helst - "DT|MAS|SIN|DEF": {POS: DET}, - "DT|MAS|SIN|IND": {POS: DET}, - "DT|NEU|SIN|DEF": {POS: DET}, # det, detta - "DT|NEU|SIN|IND": {POS: DET}, # ett, något, inget, vart, vartannat - "DT|NEU|SIN|IND/DEF": {POS: DET}, # allt - "DT|UTR/NEU|PLU|DEF": {POS: DET}, # de, dessa, bägge, dom - "DT|UTR/NEU|PLU|IND": {POS: DET}, # några, inga - "DT|UTR/NEU|PLU|IND/DEF": {POS: DET}, # alla - "DT|UTR/NEU|SIN/PLU|IND": {POS: DET}, # samma - "DT|UTR/NEU|SIN|DEF": {POS: DET}, # vardera - "DT|UTR/NEU|SIN|IND": {POS: DET}, # varje, varenda - "DT|UTR|SIN|DEF": {POS: DET}, # den, denna - "DT|UTR|SIN|IND": {POS: DET}, # en, någon, ingen, var, varannan - "DT|UTR|SIN|IND/DEF": {POS: DET}, # all - "HA": {POS: ADV}, # när, där, hur, som, då - "HD|NEU|SIN|IND": {POS: DET}, # vilket - "HD|UTR/NEU|PLU|IND": {POS: DET}, # vilka - "HD|UTR|SIN|IND": {POS: DET}, # vilken - "HP|-|-|-": {POS: PRON}, # som - "HP|NEU|SIN|IND": {POS: PRON}, # vad, vilket - "HP|NEU|SIN|IND|SMS": {POS: PRON}, - "HP|UTR/NEU|PLU|IND": {POS: PRON}, # vilka - "HP|UTR|SIN|IND": {POS: PRON}, # vilken, vem - "HS|DEF": {POS: DET}, # vars, vilkas, Vems - "IE": {POS: PART}, # att - "IN": {POS: INTJ}, # Jo, ja, nej, fan, visst - "JJ|AN": {POS: ADJ}, # ev, S:t, Kungl, Kungl., Teol - "JJ|KOM|UTR/NEU|SIN/PLU|IND/DEF|GEN": {POS: ADJ}, # äldres - "JJ|KOM|UTR/NEU|SIN/PLU|IND/DEF|NOM": { - POS: ADJ - }, # större, högre, mindre, bättre, äldre - "JJ|KOM|UTR/NEU|SIN/PLU|IND/DEF|SMS": {POS: ADJ}, - "JJ|POS|MAS|SIN|DEF|GEN": {POS: ADJ}, # enskildes, sjukes, andres - "JJ|POS|MAS|SIN|DEF|NOM": {POS: ADJ}, # enskilde, sjuke, andre, unge, ene - "JJ|POS|NEU|SIN|IND/DEF|NOM": {POS: ADJ}, # eget - "JJ|POS|NEU|SIN|IND|GEN": {POS: ADJ}, - "JJ|POS|NEU|SIN|IND|NOM": {POS: ADJ}, # annat, svårt, möjligt, nytt, sådant - "JJ|POS|UTR/NEU|PLU|IND/DEF|GEN": { - POS: ADJ - }, # ogiftas, ungas, frånskildas, efterkommandes, färgblindas - "JJ|POS|UTR/NEU|PLU|IND/DEF|NOM": {POS: ADJ}, # olika, andra, många, stora, vissa - "JJ|POS|UTR/NEU|PLU|IND|NOM": {POS: ADJ}, # flera, sådana, fler, få, samtliga - "JJ|POS|UTR/NEU|SIN/PLU|IND|NOM": {POS: ADJ}, - "JJ|POS|UTR/NEU|SIN/PLU|IND/DEF|NOM": {POS: ADJ}, # bra, ena, enda, nästa, ringa - "JJ|POS|UTR/NEU|SIN|DEF|GEN": {POS: ADJ}, - "JJ|POS|UTR/NEU|SIN|DEF|NOM": {POS: ADJ}, # hela, nya, andra, svenska, ekonomiska - "JJ|POS|UTR|-|-|SMS": {POS: ADJ}, # fri-, låg-, sexual- - "JJ|POS|UTR|SIN|IND/DEF|NOM": {POS: ADJ}, # egen - "JJ|POS|UTR|SIN|IND|GEN": {POS: ADJ}, # enskilds - "JJ|POS|UTR|SIN|IND|NOM": {POS: ADJ}, # stor, annan, själv, sådan, viss - "JJ|SUV|MAS|SIN|DEF|GEN": {POS: ADJ}, - "JJ|SUV|MAS|SIN|DEF|NOM": {POS: ADJ}, # störste, främste, äldste, minste - "JJ|SUV|UTR/NEU|PLU|DEF|NOM": {POS: ADJ}, # flesta - "JJ|SUV|UTR/NEU|PLU|IND|NOM": {POS: ADJ}, - "JJ|SUV|UTR/NEU|SIN/PLU|DEF|NOM": { - POS: ADJ - }, # bästa, största, närmaste, viktigaste, högsta - "JJ|SUV|UTR/NEU|SIN/PLU|IND|NOM": { - POS: ADJ - }, # störst, bäst, tidigast, högst, fattigast - "KN": {POS: CCONJ}, # och, eller, som, än, men - "KN|AN": {POS: CCONJ}, - "MAD": {POS: PUNCT}, # ., ?, :, !, ... - "MID": {POS: PUNCT}, # ,, -, :, *, ; - "NN|-|-|-|-": {POS: NOUN}, # godo, fjol, fullo, somras, måtto - "NN|AN": {POS: NOUN}, # kr, %, s., dr, kap. - "NN|NEU|-|-|-": {POS: NOUN}, - "NN|NEU|-|-|SMS": {POS: NOUN}, # yrkes-, barn-, hem-, fack-, vatten- - "NN|NEU|PLU|DEF|GEN": { - POS: NOUN - }, # barnens, årens, u-ländernas, företagens, århundradenas - "NN|NEU|PLU|DEF|NOM": {POS: NOUN}, # barnen, u-länderna, åren, länderna, könen - "NN|NEU|PLU|IND|GEN": {POS: NOUN}, # slags, års, barns, länders, tusentals - "NN|NEU|PLU|IND|NOM": {POS: NOUN}, # barn, år, fall, länder, problem - "NN|NEU|SIN|DEF|GEN": { - POS: NOUN - }, # äktenskapets, samhällets, barnets, 1800-talets, 1960-talets - "NN|NEU|SIN|DEF|NOM": { - POS: NOUN - }, # äktenskapet, samhället, barnet, stället, hemmet - "NN|NEU|SIN|IND|GEN": {POS: NOUN}, # års, slags, lands, havs, företags - "NN|NEU|SIN|IND|NOM": {POS: NOUN}, # år, arbete, barn, sätt, äktenskap - "NN|SMS": {POS: NOUN}, # PCB-, Syd- - "NN|UTR|-|-|-": {POS: NOUN}, # dags, rätta - "NN|UTR|-|-|SMS": {POS: NOUN}, # far-, kibbutz-, röntgen-, barna-, hälso- - "NN|UTR|PLU|DEF|GEN": { - POS: NOUN - }, # föräldrarnas, kvinnornas, elevernas, kibbutzernas, makarnas - "NN|UTR|PLU|DEF|NOM": { - POS: NOUN - }, # kvinnorna, föräldrarna, makarna, männen, hyrorna - "NN|UTR|PLU|IND|GEN": {POS: NOUN}, # människors, kvinnors, dagars, tiders, månaders - "NN|UTR|PLU|IND|NOM": {POS: NOUN}, # procent, människor, kvinnor, miljoner, kronor - "NN|UTR|SIN|DEF|GEN": {POS: NOUN}, # kvinnans, världens, familjens, dagens, jordens - "NN|UTR|SIN|DEF|NOM": {POS: NOUN}, # familjen, kvinnan, mannen, världen, skolan - "NN|UTR|SIN|IND|GEN": {POS: NOUN}, # sorts, medelålders, makes, kvinnas, veckas - "NN|UTR|SIN|IND|NOM": {POS: NOUN}, # del, tid, dag, fråga, man - "PAD": {POS: PUNCT}, # , ), ( - "PC|AN": {POS: VERB}, - "PC|PRF|MAS|SIN|DEF|GEN": {POS: VERB}, # avlidnes - "PC|PRF|MAS|SIN|DEF|NOM": {POS: VERB}, - "PC|PRF|NEU|SIN|IND|NOM": {POS: VERB}, # taget, sett, särskilt, förbjudet, ökat - "PC|PRF|UTR/NEU|PLU|IND/DEF|GEN": {POS: VERB}, # försäkrades, anställdas - "PC|PRF|UTR/NEU|PLU|IND/DEF|NOM": { - POS: VERB - }, # särskilda, gifta, ökade, handikappade, skilda - "PC|PRF|UTR/NEU|SIN|DEF|GEN": {POS: VERB}, - "PC|PRF|UTR/NEU|SIN|DEF|NOM": {POS: VERB}, # ökade, gifta, nämnda, nedärvda, dolda - "PC|PRF|UTR|SIN|IND|GEN": {POS: VERB}, - "PC|PRF|UTR|SIN|IND|NOM": {POS: VERB}, # särskild, ökad, beredd, gift, oförändrad - "PC|PRS|UTR/NEU|SIN/PLU|IND/DEF|GEN": { - POS: VERB - }, # studerandes, sammanboendes, dubbelarbetandes - "PC|PRS|UTR/NEU|SIN/PLU|IND/DEF|NOM": { - POS: VERB - }, # följande, beroende, nuvarande, motsvarande, liknande - "PL": {POS: PART}, # ut, upp, in, till, med - "PL|SMS": {POS: PART}, - "PM": {POS: PROPN}, # F, N, Liechtenstein, Danmark, DK - "PM|GEN": {POS: PROPN}, # Sveriges, EEC:s, Guds, Stockholms, Kristi - "PM|NOM": {POS: PROPN}, # Sverige, EEC, Stockholm, USA, ATP - "PM|SMS": {POS: PROPN}, # Göteborgs-, Nord-, Väst- - "PN|MAS|SIN|DEF|SUB/OBJ": {POS: PRON}, # denne - "PN|NEU|SIN|DEF|SUB/OBJ": {POS: PRON}, # det, detta, detsamma - "PN|NEU|SIN|IND|SUB/OBJ": {POS: PRON}, # något, allt, mycket, annat, ingenting - "PN|UTR/NEU|PLU|DEF|OBJ": {POS: PRON}, # dem, varandra, varann - "PN|UTR/NEU|PLU|DEF|SUB": {POS: PRON}, # de, bägge - "PN|UTR/NEU|PLU|DEF|SUB/OBJ": {POS: PRON}, # dessa, dom, båda, den, bådadera - "PN|UTR/NEU|PLU|IND|SUB/OBJ": {POS: PRON}, # andra, alla, många, sådana, några - "PN|UTR/NEU|SIN/PLU|DEF|OBJ": {POS: PRON}, # sig, sej - "PN|UTR|PLU|DEF|OBJ": {POS: PRON}, # oss, er, eder - "PN|UTR|PLU|DEF|SUB": {POS: PRON}, # vi - "PN|UTR|SIN|DEF|OBJ": {POS: PRON}, # dig, mig, henne, honom, Er - "PN|UTR|SIN|DEF|SUB": {POS: PRON}, # du, han, hon, jag, ni - "PN|UTR|SIN|DEF|SUB/OBJ": {POS: PRON}, # den, denna, densamma - "PN|UTR|SIN|IND|SUB": {POS: PRON}, # man - "PN|UTR|SIN|IND|SUB/OBJ": {POS: PRON}, # en, var, någon, ingen, Varannan - "PP": {POS: ADP}, # i, av, på, för, till - "PP|AN": {POS: ADP}, # f - "PS|AN": {POS: DET}, - "PS|NEU|SIN|DEF": {POS: DET}, # sitt, vårt, ditt, mitt, ert - "PS|UTR/NEU|PLU|DEF": {POS: DET}, # sina, våra, dina, mina - "PS|UTR/NEU|SIN/PLU|DEF": {POS: DET}, # deras, dess, hans, hennes, varandras - "PS|UTR|SIN|DEF": {POS: DET}, # sin, vår, din, min, er - "RG": {POS: NUM}, # 2, 17, 20, 1, 18 - "RG|GEN": {POS: NUM}, - "RG|MAS|SIN|DEF|NOM": {POS: NUM}, - "RG|NEU|SIN|IND|NOM": {POS: NUM}, # ett - "RG|NOM": {POS: NUM}, # två, tre, 1, 20, 2 - "RG|SMS": {POS: NUM}, # ett-, 1950-, två-, tre-, 1700- - "RG|UTR/NEU|SIN|DEF|NOM": {POS: NUM}, - "RG|UTR|SIN|IND|NOM": {POS: NUM}, # en - "RO|MAS|SIN|IND/DEF|GEN": {POS: ADJ}, - "RO|MAS|SIN|IND/DEF|NOM": {POS: ADJ}, # förste - "RO|GEN": {POS: ADJ}, - "RO|NOM": {POS: ADJ}, # första, andra, tredje, fjärde, femte - "SN": {POS: SCONJ}, # att, om, innan, eftersom, medan - "UO": {POS: X}, # companionship, vice, versa, family, capita - "VB|AN": {POS: VERB}, # jfr - "VB|IMP|AKT": {POS: VERB}, # se, Diskutera, låt, Läs, Gå - "VB|IMP|SFO": {POS: VERB}, # tas - "VB|INF|AKT": {POS: VERB}, # vara, få, ha, bli, kunna - "VB|INF|SFO": {POS: VERB}, # användas, finnas, göras, tas, ses - "VB|KON|PRS|AKT": {POS: VERB}, # vare, Gånge - "VB|KON|PRT|AKT": {POS: VERB}, # vore, finge - "VB|KON|PRT|SFO": {POS: VERB}, - "VB|PRS|AKT": {POS: VERB}, # är, har, kan, får, måste - "VB|PRS|SFO": {POS: VERB}, # finns, kallas, behövs, beräknas, används - "VB|PRT|AKT": {POS: VERB}, # skulle, var, hade, kunde, fick - "VB|PRT|SFO": {POS: VERB}, # fanns, gjordes, höjdes, användes, infördes - "VB|SMS": {POS: VERB}, # läs- - "VB|SUP|AKT": {POS: VERB}, # varit, fått, blivit, haft, kommit - "VB|SUP|SFO": {POS: VERB}, # nämnts, gjorts, förändrats, sagts, framhållits -} diff --git a/spacy/lang/sv/tokenizer_exceptions.py b/spacy/lang/sv/tokenizer_exceptions.py index e95c67f37..ce7db895a 100644 --- a/spacy/lang/sv/tokenizer_exceptions.py +++ b/spacy/lang/sv/tokenizer_exceptions.py @@ -1,7 +1,6 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import LEMMA, NORM, ORTH, PRON_LEMMA +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import NORM, ORTH +from ...util import update_exc _exc = {} @@ -11,61 +10,58 @@ _exc = {} for verb_data in [ {ORTH: "driver"}, {ORTH: "kör"}, - {ORTH: "hörr", LEMMA: "hör"}, + {ORTH: "hörr"}, {ORTH: "fattar"}, - {ORTH: "hajar", LEMMA: "förstår"}, + {ORTH: "hajar"}, {ORTH: "lever"}, - {ORTH: "serr", LEMMA: "ser"}, + {ORTH: "serr"}, {ORTH: "fixar"}, ]: verb_data_tc = dict(verb_data) verb_data_tc[ORTH] = verb_data_tc[ORTH].title() for data in [verb_data, verb_data_tc]: - _exc[data[ORTH] + "u"] = [ - dict(data), - {ORTH: "u", LEMMA: PRON_LEMMA, NORM: "du"}, - ] + _exc[data[ORTH] + "u"] = [data, {ORTH: "u", NORM: "du"}] # Abbreviations for weekdays "sön." (for "söndag" / "söner") # are left out because they are ambiguous. The same is the case # for abbreviations "jul." and "Jul." ("juli" / "jul"). for exc_data in [ - {ORTH: "jan.", LEMMA: "januari"}, - {ORTH: "febr.", LEMMA: "februari"}, - {ORTH: "feb.", LEMMA: "februari"}, - {ORTH: "apr.", LEMMA: "april"}, - {ORTH: "jun.", LEMMA: "juni"}, - {ORTH: "aug.", LEMMA: "augusti"}, - {ORTH: "sept.", LEMMA: "september"}, - {ORTH: "sep.", LEMMA: "september"}, - {ORTH: "okt.", LEMMA: "oktober"}, - {ORTH: "nov.", LEMMA: "november"}, - {ORTH: "dec.", LEMMA: "december"}, - {ORTH: "mån.", LEMMA: "måndag"}, - {ORTH: "tis.", LEMMA: "tisdag"}, - {ORTH: "ons.", LEMMA: "onsdag"}, - {ORTH: "tors.", LEMMA: "torsdag"}, - {ORTH: "fre.", LEMMA: "fredag"}, - {ORTH: "lör.", LEMMA: "lördag"}, - {ORTH: "Jan.", LEMMA: "Januari"}, - {ORTH: "Febr.", LEMMA: "Februari"}, - {ORTH: "Feb.", LEMMA: "Februari"}, - {ORTH: "Apr.", LEMMA: "April"}, - {ORTH: "Jun.", LEMMA: "Juni"}, - {ORTH: "Aug.", LEMMA: "Augusti"}, - {ORTH: "Sept.", LEMMA: "September"}, - {ORTH: "Sep.", LEMMA: "September"}, - {ORTH: "Okt.", LEMMA: "Oktober"}, - {ORTH: "Nov.", LEMMA: "November"}, - {ORTH: "Dec.", LEMMA: "December"}, - {ORTH: "Mån.", LEMMA: "Måndag"}, - {ORTH: "Tis.", LEMMA: "Tisdag"}, - {ORTH: "Ons.", LEMMA: "Onsdag"}, - {ORTH: "Tors.", LEMMA: "Torsdag"}, - {ORTH: "Fre.", LEMMA: "Fredag"}, - {ORTH: "Lör.", LEMMA: "Lördag"}, - {ORTH: "sthlm", LEMMA: "Stockholm"}, - {ORTH: "gbg", LEMMA: "Göteborg"}, + {ORTH: "jan.", NORM: "januari"}, + {ORTH: "febr.", NORM: "februari"}, + {ORTH: "feb.", NORM: "februari"}, + {ORTH: "apr.", NORM: "april"}, + {ORTH: "jun.", NORM: "juni"}, + {ORTH: "aug.", NORM: "augusti"}, + {ORTH: "sept.", NORM: "september"}, + {ORTH: "sep.", NORM: "september"}, + {ORTH: "okt.", NORM: "oktober"}, + {ORTH: "nov.", NORM: "november"}, + {ORTH: "dec.", NORM: "december"}, + {ORTH: "mån.", NORM: "måndag"}, + {ORTH: "tis.", NORM: "tisdag"}, + {ORTH: "ons.", NORM: "onsdag"}, + {ORTH: "tors.", NORM: "torsdag"}, + {ORTH: "fre.", NORM: "fredag"}, + {ORTH: "lör.", NORM: "lördag"}, + {ORTH: "Jan.", NORM: "Januari"}, + {ORTH: "Febr.", NORM: "Februari"}, + {ORTH: "Feb.", NORM: "Februari"}, + {ORTH: "Apr.", NORM: "April"}, + {ORTH: "Jun.", NORM: "Juni"}, + {ORTH: "Aug.", NORM: "Augusti"}, + {ORTH: "Sept.", NORM: "September"}, + {ORTH: "Sep.", NORM: "September"}, + {ORTH: "Okt.", NORM: "Oktober"}, + {ORTH: "Nov.", NORM: "November"}, + {ORTH: "Dec.", NORM: "December"}, + {ORTH: "Mån.", NORM: "Måndag"}, + {ORTH: "Tis.", NORM: "Tisdag"}, + {ORTH: "Ons.", NORM: "Onsdag"}, + {ORTH: "Tors.", NORM: "Torsdag"}, + {ORTH: "Fre.", NORM: "Fredag"}, + {ORTH: "Lör.", NORM: "Lördag"}, + {ORTH: "sthlm", NORM: "Stockholm"}, + {ORTH: "gbg", NORM: "Göteborg"}, ]: _exc[exc_data[ORTH]] = [exc_data] @@ -155,6 +151,6 @@ for orth in ABBREVIATIONS: # Sentences ending in "i." (as in "... peka i."), "m." (as in "...än 2000 m."), # should be tokenized as two separate tokens. for orth in ["i", "m"]: - _exc[orth + "."] = [{ORTH: orth, LEMMA: orth, NORM: orth}, {ORTH: "."}] + _exc[orth + "."] = [{ORTH: orth, NORM: orth}, {ORTH: "."}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/ta/__init__.py b/spacy/lang/ta/__init__.py index cb23339e6..ac5fc7124 100644 --- a/spacy/lang/ta/__init__.py +++ b/spacy/lang/ta/__init__.py @@ -1,17 +1,10 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS - from ...language import Language -from ...attrs import LANG class TamilDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "ta" - lex_attr_getters.update(LEX_ATTRS) + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS diff --git a/spacy/lang/ta/examples.py b/spacy/lang/ta/examples.py index 4700e0c7f..e68dc6237 100644 --- a/spacy/lang/ta/examples.py +++ b/spacy/lang/ta/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. @@ -24,5 +20,5 @@ sentences = [ "நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது", "லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.", "என்ன வேலை செய்கிறீர்கள்?", - "எந்த கல்லூரியில் படிக்கிறாய்?" + "எந்த கல்லூரியில் படிக்கிறாய்?", ] diff --git a/spacy/lang/ta/lex_attrs.py b/spacy/lang/ta/lex_attrs.py index 40158ad7a..f830f4ac9 100644 --- a/spacy/lang/ta/lex_attrs.py +++ b/spacy/lang/ta/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/ta/stop_words.py b/spacy/lang/ta/stop_words.py index 91ebe8fd8..abbff949d 100644 --- a/spacy/lang/ta/stop_words.py +++ b/spacy/lang/ta/stop_words.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - # Stop words STOP_WORDS = set( diff --git a/spacy/lang/tag_map.py b/spacy/lang/tag_map.py deleted file mode 100644 index 3a744f180..000000000 --- a/spacy/lang/tag_map.py +++ /dev/null @@ -1,28 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ -from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ - - -TAG_MAP = { - "ADV": {POS: ADV}, - "NOUN": {POS: NOUN}, - "ADP": {POS: ADP}, - "PRON": {POS: PRON}, - "SCONJ": {POS: SCONJ}, - "PROPN": {POS: PROPN}, - "DET": {POS: DET}, - "SYM": {POS: SYM}, - "INTJ": {POS: INTJ}, - "PUNCT": {POS: PUNCT}, - "NUM": {POS: NUM}, - "AUX": {POS: AUX}, - "X": {POS: X}, - "CONJ": {POS: CONJ}, - "CCONJ": {POS: CCONJ}, - "ADJ": {POS: ADJ}, - "VERB": {POS: VERB}, - "PART": {POS: PART}, - "_SP": {POS: SPACE}, -} diff --git a/spacy/lang/te/__init__.py b/spacy/lang/te/__init__.py index a4709177d..e6dc80e28 100644 --- a/spacy/lang/te/__init__.py +++ b/spacy/lang/te/__init__.py @@ -1,17 +1,10 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS - from ...language import Language -from ...attrs import LANG class TeluguDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "te" + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS diff --git a/spacy/lang/te/examples.py b/spacy/lang/te/examples.py index 815ec8227..cff7d3cb0 100644 --- a/spacy/lang/te/examples.py +++ b/spacy/lang/te/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/te/lex_attrs.py b/spacy/lang/te/lex_attrs.py index 6da766dca..ae11827f6 100644 --- a/spacy/lang/te/lex_attrs.py +++ b/spacy/lang/te/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ diff --git a/spacy/lang/te/stop_words.py b/spacy/lang/te/stop_words.py index 11e157177..b18dab697 100644 --- a/spacy/lang/te/stop_words.py +++ b/spacy/lang/te/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - # Source: https://github.com/Xangis/extra-stopwords (MIT License) STOP_WORDS = set( diff --git a/spacy/lang/th/__init__.py b/spacy/lang/th/__init__.py index 512be0c59..219c50c1a 100644 --- a/spacy/lang/th/__init__.py +++ b/spacy/lang/th/__init__.py @@ -1,55 +1,53 @@ -# coding: utf8 -from __future__ import unicode_literals - -from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .tag_map import TAG_MAP from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS - -from ...attrs import LANG from ...language import Language from ...tokens import Doc -from ...util import DummyTokenizer +from ...util import DummyTokenizer, registry, load_config_from_str + + +DEFAULT_CONFIG = """ +[nlp] + +[nlp.tokenizer] +@tokenizers = "spacy.th.ThaiTokenizer" +""" + + +@registry.tokenizers("spacy.th.ThaiTokenizer") +def create_thai_tokenizer(): + def thai_tokenizer_factory(nlp): + return ThaiTokenizer(nlp) + + return thai_tokenizer_factory class ThaiTokenizer(DummyTokenizer): - def __init__(self, cls, nlp=None): + def __init__(self, nlp: Language) -> None: try: from pythainlp.tokenize import word_tokenize except ImportError: raise ImportError( "The Thai tokenizer requires the PyThaiNLP library: " "https://github.com/PyThaiNLP/pythainlp" - ) - + ) from None self.word_tokenize = word_tokenize - self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) + self.vocab = nlp.vocab - def __call__(self, text): + def __call__(self, text: str) -> Doc: words = list(self.word_tokenize(text)) spaces = [False] * len(words) return Doc(self.vocab, words=words, spaces=spaces) class ThaiDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda _text: "th" - tokenizer_exceptions = dict(TOKENIZER_EXCEPTIONS) - tag_map = TAG_MAP + config = load_config_from_str(DEFAULT_CONFIG) + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS - @classmethod - def create_tokenizer(cls, nlp=None): - return ThaiTokenizer(cls, nlp) - class Thai(Language): lang = "th" Defaults = ThaiDefaults - def make_doc(self, text): - return self.tokenizer(text) - __all__ = ["Thai"] diff --git a/spacy/lang/th/lex_attrs.py b/spacy/lang/th/lex_attrs.py index 047d046c2..bc4e5293e 100644 --- a/spacy/lang/th/lex_attrs.py +++ b/spacy/lang/th/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/th/tag_map.py b/spacy/lang/th/tag_map.py deleted file mode 100644 index 3c0d3479b..000000000 --- a/spacy/lang/th/tag_map.py +++ /dev/null @@ -1,102 +0,0 @@ -# encoding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, NOUN, PRON, ADJ, ADV, INTJ, PROPN, DET, NUM, AUX, VERB -from ...symbols import ADP, CCONJ, PART, PUNCT, SPACE, SCONJ - -# Source: Korakot Chaovavanich -# https://www.facebook.com/photo.php?fbid=390564854695031&set=p.390564854695031&type=3&permPage=1&ifg=1 -TAG_MAP = { - # NOUN - "NOUN": {POS: NOUN}, - "NCMN": {POS: NOUN}, - "NTTL": {POS: NOUN}, - "CNIT": {POS: NOUN}, - "CLTV": {POS: NOUN}, - "CMTR": {POS: NOUN}, - "CFQC": {POS: NOUN}, - "CVBL": {POS: NOUN}, - "CL": {POS: NOUN}, - "FX": {POS: NOUN}, - "NN": {POS: NOUN}, - # VERB - "VACT": {POS: VERB}, - "VSTA": {POS: VERB}, - "VV": {POS: VERB}, - # PRON - "PRON": {POS: PRON}, - "NPRP": {POS: PRON}, - "PR": {POS: PRON}, - # ADJ - "ADJ": {POS: ADJ}, - "NONM": {POS: ADJ}, - "VATT": {POS: ADJ}, - "DONM": {POS: ADJ}, - "AJ": {POS: ADJ}, - # ADV - "ADV": {POS: ADV}, - "ADVN": {POS: ADV}, - "ADVI": {POS: ADV}, - "ADVP": {POS: ADV}, - "ADVS": {POS: ADV}, - "AV": {POS: ADV}, - # INTJ - "INT": {POS: INTJ}, - "IJ": {POS: INTJ}, - # PRON - "PROPN": {POS: PROPN}, - "PPRS": {POS: PROPN}, - "PDMN": {POS: PROPN}, - "PNTR": {POS: PROPN}, - # DET - "DET": {POS: DET}, - "DDAN": {POS: DET}, - "DDAC": {POS: DET}, - "DDBQ": {POS: DET}, - "DDAQ": {POS: DET}, - "DIAC": {POS: DET}, - "DIBQ": {POS: DET}, - "DIAQ": {POS: DET}, - # TODO: resolve duplicate (see below) - # "DCNM": {POS: DET}, - # NUM - "NUM": {POS: NUM}, - "NCNM": {POS: NUM}, - "NLBL": {POS: NUM}, - "DCNM": {POS: NUM}, - "NU": {POS: NUM}, - # AUX - "AUX": {POS: AUX}, - "XVBM": {POS: AUX}, - "XVAM": {POS: AUX}, - "XVMM": {POS: AUX}, - "XVBB": {POS: AUX}, - "XVAE": {POS: AUX}, - "AX": {POS: AUX}, - # ADP - "ADP": {POS: ADP}, - "RPRE": {POS: ADP}, - "PS": {POS: ADP}, - # CCONJ - "CCONJ": {POS: CCONJ}, - "JCRG": {POS: CCONJ}, - "CC": {POS: CCONJ}, - # SCONJ - "SCONJ": {POS: SCONJ}, - "PREL": {POS: SCONJ}, - "JSBR": {POS: SCONJ}, - "JCMP": {POS: SCONJ}, - # PART - "PART": {POS: PART}, - "FIXN": {POS: PART}, - "FIXV": {POS: PART}, - "EAFF": {POS: PART}, - "AITT": {POS: PART}, - "NEG": {POS: PART}, - "EITT": {POS: PART}, - "PA": {POS: PART}, - # PUNCT - "PUNCT": {POS: PUNCT}, - "PUNC": {POS: PUNCT}, - "_SP": {POS: SPACE}, -} diff --git a/spacy/lang/th/tokenizer_exceptions.py b/spacy/lang/th/tokenizer_exceptions.py index 4de0f1195..92116d474 100644 --- a/spacy/lang/th/tokenizer_exceptions.py +++ b/spacy/lang/th/tokenizer_exceptions.py @@ -1,472 +1,438 @@ -# encoding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA +from ...symbols import ORTH _exc = { # หน่วยงานรัฐ / government agency - "กกต.": [{ORTH: "กกต.", LEMMA: "คณะกรรมการการเลือกตั้ง"}], - "กทท.": [{ORTH: "กทท.", LEMMA: "การท่าเรือแห่งประเทศไทย"}], - "กทพ.": [{ORTH: "กทพ.", LEMMA: "การทางพิเศษแห่งประเทศไทย"}], - "กบข.": [{ORTH: "กบข.", LEMMA: "กองทุนบำเหน็จบำนาญข้าราชการพลเรือน"}], - "กบว.": [{ORTH: "กบว.", LEMMA: "คณะกรรมการบริหารวิทยุกระจายเสียงและวิทยุโทรทัศน์"}], - "กปน.": [{ORTH: "กปน.", LEMMA: "การประปานครหลวง"}], - "กปภ.": [{ORTH: "กปภ.", LEMMA: "การประปาส่วนภูมิภาค"}], - "กปส.": [{ORTH: "กปส.", LEMMA: "กรมประชาสัมพันธ์"}], - "กผม.": [{ORTH: "กผม.", LEMMA: "กองผังเมือง"}], - "กฟน.": [{ORTH: "กฟน.", LEMMA: "การไฟฟ้านครหลวง"}], - "กฟผ.": [{ORTH: "กฟผ.", LEMMA: "การไฟฟ้าฝ่ายผลิตแห่งประเทศไทย"}], - "กฟภ.": [{ORTH: "กฟภ.", LEMMA: "การไฟฟ้าส่วนภูมิภาค"}], - "ก.ช.น.": [{ORTH: "ก.ช.น.", LEMMA: "คณะกรรมการช่วยเหลือชาวนาชาวไร่"}], - "กยศ.": [{ORTH: "กยศ.", LEMMA: "กองทุนเงินให้กู้ยืมเพื่อการศึกษา"}], - "ก.ล.ต.": [{ORTH: "ก.ล.ต.", LEMMA: "คณะกรรมการกำกับหลักทรัพย์และตลาดหลักทรัพย์"}], - "กศ.บ.": [{ORTH: "กศ.บ.", LEMMA: "การศึกษาบัณฑิต"}], - "กศน.": [{ORTH: "กศน.", LEMMA: "กรมการศึกษานอกโรงเรียน"}], - "กสท.": [{ORTH: "กสท.", LEMMA: "การสื่อสารแห่งประเทศไทย"}], - "กอ.รมน.": [{ORTH: "กอ.รมน.", LEMMA: "กองอำนวยการรักษาความมั่นคงภายใน"}], - "กร.": [{ORTH: "กร.", LEMMA: "กองเรือยุทธการ"}], - "ขสมก.": [{ORTH: "ขสมก.", LEMMA: "องค์การขนส่งมวลชนกรุงเทพ"}], - "คตง.": [{ORTH: "คตง.", LEMMA: "คณะกรรมการตรวจเงินแผ่นดิน"}], - "ครม.": [{ORTH: "ครม.", LEMMA: "คณะรัฐมนตรี"}], - "คมช.": [{ORTH: "คมช.", LEMMA: "คณะมนตรีความมั่นคงแห่งชาติ"}], - "ตชด.": [{ORTH: "ตชด.", LEMMA: "ตำรวจตะเวนชายเดน"}], - "ตม.": [{ORTH: "ตม.", LEMMA: "กองตรวจคนเข้าเมือง"}], - "ตร.": [{ORTH: "ตร.", LEMMA: "ตำรวจ"}], - "ททท.": [{ORTH: "ททท.", LEMMA: "การท่องเที่ยวแห่งประเทศไทย"}], - "ททบ.": [{ORTH: "ททบ.", LEMMA: "สถานีวิทยุโทรทัศน์กองทัพบก"}], - "ทบ.": [{ORTH: "ทบ.", LEMMA: "กองทัพบก"}], - "ทร.": [{ORTH: "ทร.", LEMMA: "กองทัพเรือ"}], - "ทอ.": [{ORTH: "ทอ.", LEMMA: "กองทัพอากาศ"}], - "ทอท.": [{ORTH: "ทอท.", LEMMA: "การท่าอากาศยานแห่งประเทศไทย"}], - "ธ.ก.ส.": [{ORTH: "ธ.ก.ส.", LEMMA: "ธนาคารเพื่อการเกษตรและสหกรณ์การเกษตร"}], - "ธปท.": [{ORTH: "ธปท.", LEMMA: "ธนาคารแห่งประเทศไทย"}], - "ธอส.": [{ORTH: "ธอส.", LEMMA: "ธนาคารอาคารสงเคราะห์"}], - "นย.": [{ORTH: "นย.", LEMMA: "นาวิกโยธิน"}], - "ปตท.": [{ORTH: "ปตท.", LEMMA: "การปิโตรเลียมแห่งประเทศไทย"}], - "ป.ป.ช.": [ - { - ORTH: "ป.ป.ช.", - LEMMA: "คณะกรรมการป้องกันและปราบปรามการทุจริตและประพฤติมิชอบในวงราชการ", - } - ], - "ป.ป.ส.": [{ORTH: "ป.ป.ส.", LEMMA: "คณะกรรมการป้องกันและปราบปรามยาเสพติด"}], - "บพร.": [{ORTH: "บพร.", LEMMA: "กรมการบินพลเรือน"}], - "บย.": [{ORTH: "บย.", LEMMA: "กองบินยุทธการ"}], - "พสวท.": [ - { - ORTH: "พสวท.", - LEMMA: "โครงการพัฒนาและส่งเสริมผู้มีความรู้ความสามารถพิเศษทางวิทยาศาสตร์และเทคโนโลยี", - } - ], - "มอก.": [{ORTH: "มอก.", LEMMA: "สำนักงานมาตรฐานผลิตภัณฑ์อุตสาหกรรม"}], - "ยธ.": [{ORTH: "ยธ.", LEMMA: "กรมโยธาธิการ"}], - "รพช.": [{ORTH: "รพช.", LEMMA: "สำนักงานเร่งรัดพัฒนาชนบท"}], - "รฟท.": [{ORTH: "รฟท.", LEMMA: "การรถไฟแห่งประเทศไทย"}], - "รฟม.": [{ORTH: "รฟม.", LEMMA: "การรถไฟฟ้าขนส่งมวลชนแห่งประเทศไทย"}], - "ศธ.": [{ORTH: "ศธ.", LEMMA: "กระทรวงศึกษาธิการ"}], - "ศนธ.": [{ORTH: "ศนธ.", LEMMA: "ศูนย์กลางนิสิตนักศึกษาแห่งประเทศไทย"}], - "สกจ.": [{ORTH: "สกจ.", LEMMA: "สหกรณ์จังหวัด"}], - "สกท.": [{ORTH: "สกท.", LEMMA: "สำนักงานคณะกรรมการส่งเสริมการลงทุน"}], - "สกว.": [{ORTH: "สกว.", LEMMA: "สำนักงานกองทุนสนับสนุนการวิจัย"}], - "สคบ.": [{ORTH: "สคบ.", LEMMA: "สำนักงานคณะกรรมการคุ้มครองผู้บริโภค"}], - "สจร.": [{ORTH: "สจร.", LEMMA: "สำนักงานคณะกรรมการจัดระบบการจราจรทางบก"}], - "สตง.": [{ORTH: "สตง.", LEMMA: "สำนักงานตรวจเงินแผ่นดิน"}], - "สทท.": [{ORTH: "สทท.", LEMMA: "สถานีวิทยุโทรทัศน์แห่งประเทศไทย"}], - "สทร.": [{ORTH: "สทร.", LEMMA: "สำนักงานกลางทะเบียนราษฎร์"}], - "สธ": [{ORTH: "สธ", LEMMA: "กระทรวงสาธารณสุข"}], - "สนช.": [{ORTH: "สนช.", LEMMA: "สภานิติบัญญัติแห่งชาติ,สำนักงานนวัตกรรมแห่งชาติ"}], - "สนนท.": [{ORTH: "สนนท.", LEMMA: "สหพันธ์นิสิตนักศึกษาแห่งประเทศไทย"}], - "สปก.": [{ORTH: "สปก.", LEMMA: "สำนักงานการปฏิรูปที่ดินเพื่อเกษตรกรรม"}], - "สปช.": [{ORTH: "สปช.", LEMMA: "สำนักงานคณะกรรมการการประถมศึกษาแห่งชาติ"}], - "สปอ.": [{ORTH: "สปอ.", LEMMA: "สำนักงานการประถมศึกษาอำเภอ"}], - "สพช.": [{ORTH: "สพช.", LEMMA: "สำนักงานคณะกรรมการนโยบายพลังงานแห่งชาติ"}], - "สยช.": [ - {ORTH: "สยช.", LEMMA: "สำนักงานคณะกรรมการส่งเสริมและประสานงานเยาวชนแห่งชาติ"} - ], - "สวช.": [{ORTH: "สวช.", LEMMA: "สำนักงานคณะกรรมการวัฒนธรรมแห่งชาติ"}], - "สวท.": [{ORTH: "สวท.", LEMMA: "สถานีวิทยุกระจายเสียงแห่งประเทศไทย"}], - "สวทช.": [{ORTH: "สวทช.", LEMMA: "สำนักงานพัฒนาวิทยาศาสตร์และเทคโนโลยีแห่งชาติ"}], - "สคช.": [ - {ORTH: "สคช.", LEMMA: "สำนักงานคณะกรรมการพัฒนาการเศรษฐกิจและสังคมแห่งชาติ"} - ], - "สสว.": [{ORTH: "สสว.", LEMMA: "สำนักงานส่งเสริมวิสาหกิจขนาดกลางและขนาดย่อม"}], - "สสส.": [{ORTH: "สสส.", LEMMA: "สำนักงานกองทุนสนับสนุนการสร้างเสริมสุขภาพ"}], - "สสวท.": [{ORTH: "สสวท.", LEMMA: "สถาบันส่งเสริมการสอนวิทยาศาสตร์และเทคโนโลยี"}], - "อตก.": [{ORTH: "อตก.", LEMMA: "องค์การตลาดเพื่อเกษตรกร"}], - "อบจ.": [{ORTH: "อบจ.", LEMMA: "องค์การบริหารส่วนจังหวัด"}], - "อบต.": [{ORTH: "อบต.", LEMMA: "องค์การบริหารส่วนตำบล"}], - "อปพร.": [{ORTH: "อปพร.", LEMMA: "อาสาสมัครป้องกันภัยฝ่ายพลเรือน"}], - "อย.": [{ORTH: "อย.", LEMMA: "สำนักงานคณะกรรมการอาหารและยา"}], - "อ.ส.ม.ท.": [{ORTH: "อ.ส.ม.ท.", LEMMA: "องค์การสื่อสารมวลชนแห่งประเทศไทย"}], + "กกต.": [{ORTH: "กกต."}], + "กทท.": [{ORTH: "กทท."}], + "กทพ.": [{ORTH: "กทพ."}], + "กบข.": [{ORTH: "กบข."}], + "กบว.": [{ORTH: "กบว."}], + "กปน.": [{ORTH: "กปน."}], + "กปภ.": [{ORTH: "กปภ."}], + "กปส.": [{ORTH: "กปส."}], + "กผม.": [{ORTH: "กผม."}], + "กฟน.": [{ORTH: "กฟน."}], + "กฟผ.": [{ORTH: "กฟผ."}], + "กฟภ.": [{ORTH: "กฟภ."}], + "ก.ช.น.": [{ORTH: "ก.ช.น."}], + "กยศ.": [{ORTH: "กยศ."}], + "ก.ล.ต.": [{ORTH: "ก.ล.ต."}], + "กศ.บ.": [{ORTH: "กศ.บ."}], + "กศน.": [{ORTH: "กศน."}], + "กสท.": [{ORTH: "กสท."}], + "กอ.รมน.": [{ORTH: "กอ.รมน."}], + "กร.": [{ORTH: "กร."}], + "ขสมก.": [{ORTH: "ขสมก."}], + "คตง.": [{ORTH: "คตง."}], + "ครม.": [{ORTH: "ครม."}], + "คมช.": [{ORTH: "คมช."}], + "ตชด.": [{ORTH: "ตชด."}], + "ตม.": [{ORTH: "ตม."}], + "ตร.": [{ORTH: "ตร."}], + "ททท.": [{ORTH: "ททท."}], + "ททบ.": [{ORTH: "ททบ."}], + "ทบ.": [{ORTH: "ทบ."}], + "ทร.": [{ORTH: "ทร."}], + "ทอ.": [{ORTH: "ทอ."}], + "ทอท.": [{ORTH: "ทอท."}], + "ธ.ก.ส.": [{ORTH: "ธ.ก.ส."}], + "ธปท.": [{ORTH: "ธปท."}], + "ธอส.": [{ORTH: "ธอส."}], + "นย.": [{ORTH: "นย."}], + "ปตท.": [{ORTH: "ปตท."}], + "ป.ป.ช.": [{ORTH: "ป.ป.ช."}], + "ป.ป.ส.": [{ORTH: "ป.ป.ส."}], + "บพร.": [{ORTH: "บพร."}], + "บย.": [{ORTH: "บย."}], + "พสวท.": [{ORTH: "พสวท."}], + "มอก.": [{ORTH: "มอก."}], + "ยธ.": [{ORTH: "ยธ."}], + "รพช.": [{ORTH: "รพช."}], + "รฟท.": [{ORTH: "รฟท."}], + "รฟม.": [{ORTH: "รฟม."}], + "ศธ.": [{ORTH: "ศธ."}], + "ศนธ.": [{ORTH: "ศนธ."}], + "สกจ.": [{ORTH: "สกจ."}], + "สกท.": [{ORTH: "สกท."}], + "สกว.": [{ORTH: "สกว."}], + "สคบ.": [{ORTH: "สคบ."}], + "สจร.": [{ORTH: "สจร."}], + "สตง.": [{ORTH: "สตง."}], + "สทท.": [{ORTH: "สทท."}], + "สทร.": [{ORTH: "สทร."}], + "สธ": [{ORTH: "สธ"}], + "สนช.": [{ORTH: "สนช."}], + "สนนท.": [{ORTH: "สนนท."}], + "สปก.": [{ORTH: "สปก."}], + "สปช.": [{ORTH: "สปช."}], + "สปอ.": [{ORTH: "สปอ."}], + "สพช.": [{ORTH: "สพช."}], + "สยช.": [{ORTH: "สยช."}], + "สวช.": [{ORTH: "สวช."}], + "สวท.": [{ORTH: "สวท."}], + "สวทช.": [{ORTH: "สวทช."}], + "สคช.": [{ORTH: "สคช."}], + "สสว.": [{ORTH: "สสว."}], + "สสส.": [{ORTH: "สสส."}], + "สสวท.": [{ORTH: "สสวท."}], + "อตก.": [{ORTH: "อตก."}], + "อบจ.": [{ORTH: "อบจ."}], + "อบต.": [{ORTH: "อบต."}], + "อปพร.": [{ORTH: "อปพร."}], + "อย.": [{ORTH: "อย."}], + "อ.ส.ม.ท.": [{ORTH: "อ.ส.ม.ท."}], # มหาวิทยาลัย / สถานศึกษา / university / college - "มทส.": [{ORTH: "มทส.", LEMMA: "มหาวิทยาลัยเทคโนโลยีสุรนารี"}], - "มธ.": [{ORTH: "มธ.", LEMMA: "มหาวิทยาลัยธรรมศาสตร์"}], - "ม.อ.": [{ORTH: "ม.อ.", LEMMA: "มหาวิทยาลัยสงขลานครินทร์"}], - "มทร.": [{ORTH: "มทร.", LEMMA: "มหาวิทยาลัยเทคโนโลยีราชมงคล"}], - "มมส.": [{ORTH: "มมส.", LEMMA: "มหาวิทยาลัยมหาสารคาม"}], - "วท.": [{ORTH: "วท.", LEMMA: "วิทยาลัยเทคนิค"}], - "สตม.": [{ORTH: "สตม.", LEMMA: "สำนักงานตรวจคนเข้าเมือง (ตำรวจ)"}], + "มทส.": [{ORTH: "มทส."}], + "มธ.": [{ORTH: "มธ."}], + "ม.อ.": [{ORTH: "ม.อ."}], + "มทร.": [{ORTH: "มทร."}], + "มมส.": [{ORTH: "มมส."}], + "วท.": [{ORTH: "วท."}], + "สตม.": [{ORTH: "สตม."}], # ยศ / rank - "ดร.": [{ORTH: "ดร.", LEMMA: "ดอกเตอร์"}], - "ด.ต.": [{ORTH: "ด.ต.", LEMMA: "ดาบตำรวจ"}], - "จ.ต.": [{ORTH: "จ.ต.", LEMMA: "จ่าตรี"}], - "จ.ท.": [{ORTH: "จ.ท.", LEMMA: "จ่าโท"}], - "จ.ส.ต.": [{ORTH: "จ.ส.ต.", LEMMA: "จ่าสิบตรี (ทหารบก)"}], - "จสต.": [{ORTH: "จสต.", LEMMA: "จ่าสิบตำรวจ"}], - "จ.ส.ท.": [{ORTH: "จ.ส.ท.", LEMMA: "จ่าสิบโท"}], - "จ.ส.อ.": [{ORTH: "จ.ส.อ.", LEMMA: "จ่าสิบเอก"}], - "จ.อ.": [{ORTH: "จ.อ.", LEMMA: "จ่าเอก"}], - "ทพญ.": [{ORTH: "ทพญ.", LEMMA: "ทันตแพทย์หญิง"}], - "ทนพ.": [{ORTH: "ทนพ.", LEMMA: "เทคนิคการแพทย์"}], - "นจอ.": [{ORTH: "นจอ.", LEMMA: "นักเรียนจ่าอากาศ"}], - "น.ช.": [{ORTH: "น.ช.", LEMMA: "นักโทษชาย"}], - "น.ญ.": [{ORTH: "น.ญ.", LEMMA: "นักโทษหญิง"}], - "น.ต.": [{ORTH: "น.ต.", LEMMA: "นาวาตรี"}], - "น.ท.": [{ORTH: "น.ท.", LEMMA: "นาวาโท"}], - "นตท.": [{ORTH: "นตท.", LEMMA: "นักเรียนเตรียมทหาร"}], - "นนส.": [{ORTH: "นนส.", LEMMA: "นักเรียนนายสิบทหารบก"}], - "นนร.": [{ORTH: "นนร.", LEMMA: "นักเรียนนายร้อย"}], - "นนอ.": [{ORTH: "นนอ.", LEMMA: "นักเรียนนายเรืออากาศ"}], - "นพ.": [{ORTH: "นพ.", LEMMA: "นายแพทย์"}], - "นพท.": [{ORTH: "นพท.", LEMMA: "นายแพทย์ทหาร"}], - "นรจ.": [{ORTH: "นรจ.", LEMMA: "นักเรียนจ่าทหารเรือ"}], - "นรต.": [{ORTH: "นรต.", LEMMA: "นักเรียนนายร้อยตำรวจ"}], - "นศพ.": [{ORTH: "นศพ.", LEMMA: "นักศึกษาแพทย์"}], - "นศท.": [{ORTH: "นศท.", LEMMA: "นักศึกษาวิชาทหาร"}], - "น.สพ.": [{ORTH: "น.สพ.", LEMMA: "นายสัตวแพทย์ (พ.ร.บ.วิชาชีพการสัตวแพทย์)"}], - "น.อ.": [{ORTH: "น.อ.", LEMMA: "นาวาเอก"}], - "บช.ก.": [{ORTH: "บช.ก.", LEMMA: "กองบัญชาการตำรวจสอบสวนกลาง"}], - "บช.น.": [{ORTH: "บช.น.", LEMMA: "กองบัญชาการตำรวจนครบาล"}], - "ผกก.": [{ORTH: "ผกก.", LEMMA: "ผู้กำกับการ"}], - "ผกก.ภ.": [{ORTH: "ผกก.ภ.", LEMMA: "ผู้กำกับการตำรวจภูธร"}], - "ผจก.": [{ORTH: "ผจก.", LEMMA: "ผู้จัดการ"}], - "ผช.": [{ORTH: "ผช.", LEMMA: "ผู้ช่วย"}], - "ผชก.": [{ORTH: "ผชก.", LEMMA: "ผู้ชำนาญการ"}], - "ผช.ผอ.": [{ORTH: "ผช.ผอ.", LEMMA: "ผู้ช่วยผู้อำนวยการ"}], - "ผญบ.": [{ORTH: "ผญบ.", LEMMA: "ผู้ใหญ่บ้าน"}], - "ผบ.": [{ORTH: "ผบ.", LEMMA: "ผู้บังคับบัญชา"}], - "ผบก.": [{ORTH: "ผบก.", LEMMA: "ผู้บังคับบัญชาการ (ตำรวจ)"}], - "ผบก.น.": [{ORTH: "ผบก.น.", LEMMA: "ผู้บังคับการตำรวจนครบาล"}], - "ผบก.ป.": [{ORTH: "ผบก.ป.", LEMMA: "ผู้บังคับการตำรวจกองปราบปราม"}], - "ผบก.ปค.": [ - { - ORTH: "ผบก.ปค.", - LEMMA: "ผู้บังคับการ กองบังคับการปกครอง (โรงเรียนนายร้อยตำรวจ)", - } - ], - "ผบก.ปม.": [{ORTH: "ผบก.ปม.", LEMMA: "ผู้บังคับการตำรวจป่าไม้"}], - "ผบก.ภ.": [{ORTH: "ผบก.ภ.", LEMMA: "ผู้บังคับการตำรวจภูธร"}], - "ผบช.": [{ORTH: "ผบช.", LEMMA: "ผู้บัญชาการ (ตำรวจ)"}], - "ผบช.ก.": [{ORTH: "ผบช.ก.", LEMMA: "ผู้บัญชาการตำรวจสอบสวนกลาง"}], - "ผบช.ตชด.": [{ORTH: "ผบช.ตชด.", LEMMA: "ผู้บัญชาการตำรวจตระเวนชายแดน"}], - "ผบช.น.": [{ORTH: "ผบช.น.", LEMMA: "ผู้บัญชาการตำรวจนครบาล"}], - "ผบช.ภ.": [{ORTH: "ผบช.ภ.", LEMMA: "ผู้บัญชาการตำรวจภูธร"}], - "ผบ.ทบ.": [{ORTH: "ผบ.ทบ.", LEMMA: "ผู้บัญชาการทหารบก"}], - "ผบ.ตร.": [{ORTH: "ผบ.ตร.", LEMMA: "ผู้บัญชาการตำรวจแห่งชาติ"}], - "ผบ.ทร.": [{ORTH: "ผบ.ทร.", LEMMA: "ผู้บัญชาการทหารเรือ"}], - "ผบ.ทอ.": [{ORTH: "ผบ.ทอ.", LEMMA: "ผู้บัญชาการทหารอากาศ"}], - "ผบ.ทสส.": [{ORTH: "ผบ.ทสส.", LEMMA: "ผู้บัญชาการทหารสูงสุด"}], - "ผวจ.": [{ORTH: "ผวจ.", LEMMA: "ผู้ว่าราชการจังหวัด"}], - "ผู้ว่าฯ": [{ORTH: "ผู้ว่าฯ", LEMMA: "ผู้ว่าราชการจังหวัด"}], - "พ.จ.ต.": [{ORTH: "พ.จ.ต.", LEMMA: "พันจ่าตรี"}], - "พ.จ.ท.": [{ORTH: "พ.จ.ท.", LEMMA: "พันจ่าโท"}], - "พ.จ.อ.": [{ORTH: "พ.จ.อ.", LEMMA: "พันจ่าเอก"}], - "พญ.": [{ORTH: "พญ.", LEMMA: "แพทย์หญิง"}], - "ฯพณฯ": [{ORTH: "ฯพณฯ", LEMMA: "พณท่าน"}], - "พ.ต.": [{ORTH: "พ.ต.", LEMMA: "พันตรี"}], - "พ.ท.": [{ORTH: "พ.ท.", LEMMA: "พันโท"}], - "พ.อ.": [{ORTH: "พ.อ.", LEMMA: "พันเอก"}], - "พ.ต.อ.พิเศษ": [{ORTH: "พ.ต.อ.พิเศษ", LEMMA: "พันตำรวจเอกพิเศษ"}], - "พลฯ": [{ORTH: "พลฯ", LEMMA: "พลทหาร"}], - "พล.๑ รอ.": [{ORTH: "พล.๑ รอ.", LEMMA: "กองพลที่ ๑ รักษาพระองค์ กองทัพบก"}], - "พล.ต.": [{ORTH: "พล.ต.", LEMMA: "พลตรี"}], - "พล.ต.ต.": [{ORTH: "พล.ต.ต.", LEMMA: "พลตำรวจตรี"}], - "พล.ต.ท.": [{ORTH: "พล.ต.ท.", LEMMA: "พลตำรวจโท"}], - "พล.ต.อ.": [{ORTH: "พล.ต.อ.", LEMMA: "พลตำรวจเอก"}], - "พล.ท.": [{ORTH: "พล.ท.", LEMMA: "พลโท"}], - "พล.ปตอ.": [{ORTH: "พล.ปตอ.", LEMMA: "กองพลทหารปืนใหญ่ต่อสู่อากาศยาน"}], - "พล.ม.": [{ORTH: "พล.ม.", LEMMA: "กองพลทหารม้า"}], - "พล.ม.๒": [{ORTH: "พล.ม.๒", LEMMA: "กองพลทหารม้าที่ ๒"}], - "พล.ร.ต.": [{ORTH: "พล.ร.ต.", LEMMA: "พลเรือตรี"}], - "พล.ร.ท.": [{ORTH: "พล.ร.ท.", LEMMA: "พลเรือโท"}], - "พล.ร.อ.": [{ORTH: "พล.ร.อ.", LEMMA: "พลเรือเอก"}], - "พล.อ.": [{ORTH: "พล.อ.", LEMMA: "พลเอก"}], - "พล.อ.ต.": [{ORTH: "พล.อ.ต.", LEMMA: "พลอากาศตรี"}], - "พล.อ.ท.": [{ORTH: "พล.อ.ท.", LEMMA: "พลอากาศโท"}], - "พล.อ.อ.": [{ORTH: "พล.อ.อ.", LEMMA: "พลอากาศเอก"}], - "พ.อ.พิเศษ": [{ORTH: "พ.อ.พิเศษ", LEMMA: "พันเอกพิเศษ"}], - "พ.อ.ต.": [{ORTH: "พ.อ.ต.", LEMMA: "พันจ่าอากาศตรี"}], - "พ.อ.ท.": [{ORTH: "พ.อ.ท.", LEMMA: "พันจ่าอากาศโท"}], - "พ.อ.อ.": [{ORTH: "พ.อ.อ.", LEMMA: "พันจ่าอากาศเอก"}], - "ภกญ.": [{ORTH: "ภกญ.", LEMMA: "เภสัชกรหญิง"}], - "ม.จ.": [{ORTH: "ม.จ.", LEMMA: "หม่อมเจ้า"}], - "มท1": [{ORTH: "มท1", LEMMA: "รัฐมนตรีว่าการกระทรวงมหาดไทย"}], - "ม.ร.ว.": [{ORTH: "ม.ร.ว.", LEMMA: "หม่อมราชวงศ์"}], - "มล.": [{ORTH: "มล.", LEMMA: "หม่อมหลวง"}], - "ร.ต.": [{ORTH: "ร.ต.", LEMMA: "ร้อยตรี,เรือตรี,เรืออากาศตรี"}], - "ร.ต.ต.": [{ORTH: "ร.ต.ต.", LEMMA: "ร้อยตำรวจตรี"}], - "ร.ต.ท.": [{ORTH: "ร.ต.ท.", LEMMA: "ร้อยตำรวจโท"}], - "ร.ต.อ.": [{ORTH: "ร.ต.อ.", LEMMA: "ร้อยตำรวจเอก"}], - "ร.ท.": [{ORTH: "ร.ท.", LEMMA: "ร้อยโท,เรือโท,เรืออากาศโท"}], - "รมช.": [{ORTH: "รมช.", LEMMA: "รัฐมนตรีช่วยว่าการกระทรวง"}], - "รมต.": [{ORTH: "รมต.", LEMMA: "รัฐมนตรี"}], - "รมว.": [{ORTH: "รมว.", LEMMA: "รัฐมนตรีว่าการกระทรวง"}], - "รศ.": [{ORTH: "รศ.", LEMMA: "รองศาสตราจารย์"}], - "ร.อ.": [{ORTH: "ร.อ.", LEMMA: "ร้อยเอก,เรือเอก,เรืออากาศเอก"}], - "ศ.": [{ORTH: "ศ.", LEMMA: "ศาสตราจารย์"}], - "ส.ต.": [{ORTH: "ส.ต.", LEMMA: "สิบตรี"}], - "ส.ต.ต.": [{ORTH: "ส.ต.ต.", LEMMA: "สิบตำรวจตรี"}], - "ส.ต.ท.": [{ORTH: "ส.ต.ท.", LEMMA: "สิบตำรวจโท"}], - "ส.ต.อ.": [{ORTH: "ส.ต.อ.", LEMMA: "สิบตำรวจเอก"}], - "ส.ท.": [{ORTH: "ส.ท.", LEMMA: "สิบโท"}], - "สพ.": [{ORTH: "สพ.", LEMMA: "สัตวแพทย์"}], - "สพ.ญ.": [{ORTH: "สพ.ญ.", LEMMA: "สัตวแพทย์หญิง"}], - "สพ.ช.": [{ORTH: "สพ.ช.", LEMMA: "สัตวแพทย์ชาย"}], - "ส.อ.": [{ORTH: "ส.อ.", LEMMA: "สิบเอก"}], - "อจ.": [{ORTH: "อจ.", LEMMA: "อาจารย์"}], - "อจญ.": [{ORTH: "อจญ.", LEMMA: "อาจารย์ใหญ่"}], + "ดร.": [{ORTH: "ดร."}], + "ด.ต.": [{ORTH: "ด.ต."}], + "จ.ต.": [{ORTH: "จ.ต."}], + "จ.ท.": [{ORTH: "จ.ท."}], + "จ.ส.ต.": [{ORTH: "จ.ส.ต."}], + "จสต.": [{ORTH: "จสต."}], + "จ.ส.ท.": [{ORTH: "จ.ส.ท."}], + "จ.ส.อ.": [{ORTH: "จ.ส.อ."}], + "จ.อ.": [{ORTH: "จ.อ."}], + "ทพญ.": [{ORTH: "ทพญ."}], + "ทนพ.": [{ORTH: "ทนพ."}], + "นจอ.": [{ORTH: "นจอ."}], + "น.ช.": [{ORTH: "น.ช."}], + "น.ญ.": [{ORTH: "น.ญ."}], + "น.ต.": [{ORTH: "น.ต."}], + "น.ท.": [{ORTH: "น.ท."}], + "นตท.": [{ORTH: "นตท."}], + "นนส.": [{ORTH: "นนส."}], + "นนร.": [{ORTH: "นนร."}], + "นนอ.": [{ORTH: "นนอ."}], + "นพ.": [{ORTH: "นพ."}], + "นพท.": [{ORTH: "นพท."}], + "นรจ.": [{ORTH: "นรจ."}], + "นรต.": [{ORTH: "นรต."}], + "นศพ.": [{ORTH: "นศพ."}], + "นศท.": [{ORTH: "นศท."}], + "น.สพ.": [{ORTH: "น.สพ."}], + "น.อ.": [{ORTH: "น.อ."}], + "บช.ก.": [{ORTH: "บช.ก."}], + "บช.น.": [{ORTH: "บช.น."}], + "ผกก.": [{ORTH: "ผกก."}], + "ผกก.ภ.": [{ORTH: "ผกก.ภ."}], + "ผจก.": [{ORTH: "ผจก."}], + "ผช.": [{ORTH: "ผช."}], + "ผชก.": [{ORTH: "ผชก."}], + "ผช.ผอ.": [{ORTH: "ผช.ผอ."}], + "ผญบ.": [{ORTH: "ผญบ."}], + "ผบ.": [{ORTH: "ผบ."}], + "ผบก.": [{ORTH: "ผบก."}], + "ผบก.น.": [{ORTH: "ผบก.น."}], + "ผบก.ป.": [{ORTH: "ผบก.ป."}], + "ผบก.ปค.": [{ORTH: "ผบก.ปค."}], + "ผบก.ปม.": [{ORTH: "ผบก.ปม."}], + "ผบก.ภ.": [{ORTH: "ผบก.ภ."}], + "ผบช.": [{ORTH: "ผบช."}], + "ผบช.ก.": [{ORTH: "ผบช.ก."}], + "ผบช.ตชด.": [{ORTH: "ผบช.ตชด."}], + "ผบช.น.": [{ORTH: "ผบช.น."}], + "ผบช.ภ.": [{ORTH: "ผบช.ภ."}], + "ผบ.ทบ.": [{ORTH: "ผบ.ทบ."}], + "ผบ.ตร.": [{ORTH: "ผบ.ตร."}], + "ผบ.ทร.": [{ORTH: "ผบ.ทร."}], + "ผบ.ทอ.": [{ORTH: "ผบ.ทอ."}], + "ผบ.ทสส.": [{ORTH: "ผบ.ทสส."}], + "ผวจ.": [{ORTH: "ผวจ."}], + "ผู้ว่าฯ": [{ORTH: "ผู้ว่าฯ"}], + "พ.จ.ต.": [{ORTH: "พ.จ.ต."}], + "พ.จ.ท.": [{ORTH: "พ.จ.ท."}], + "พ.จ.อ.": [{ORTH: "พ.จ.อ."}], + "พญ.": [{ORTH: "พญ."}], + "ฯพณฯ": [{ORTH: "ฯพณฯ"}], + "พ.ต.": [{ORTH: "พ.ต."}], + "พ.ท.": [{ORTH: "พ.ท."}], + "พ.อ.": [{ORTH: "พ.อ."}], + "พ.ต.อ.พิเศษ": [{ORTH: "พ.ต.อ.พิเศษ"}], + "พลฯ": [{ORTH: "พลฯ"}], + "พล.๑ รอ.": [{ORTH: "พล.๑ รอ."}], + "พล.ต.": [{ORTH: "พล.ต."}], + "พล.ต.ต.": [{ORTH: "พล.ต.ต."}], + "พล.ต.ท.": [{ORTH: "พล.ต.ท."}], + "พล.ต.อ.": [{ORTH: "พล.ต.อ."}], + "พล.ท.": [{ORTH: "พล.ท."}], + "พล.ปตอ.": [{ORTH: "พล.ปตอ."}], + "พล.ม.": [{ORTH: "พล.ม."}], + "พล.ม.๒": [{ORTH: "พล.ม.๒"}], + "พล.ร.ต.": [{ORTH: "พล.ร.ต."}], + "พล.ร.ท.": [{ORTH: "พล.ร.ท."}], + "พล.ร.อ.": [{ORTH: "พล.ร.อ."}], + "พล.อ.": [{ORTH: "พล.อ."}], + "พล.อ.ต.": [{ORTH: "พล.อ.ต."}], + "พล.อ.ท.": [{ORTH: "พล.อ.ท."}], + "พล.อ.อ.": [{ORTH: "พล.อ.อ."}], + "พ.อ.พิเศษ": [{ORTH: "พ.อ.พิเศษ"}], + "พ.อ.ต.": [{ORTH: "พ.อ.ต."}], + "พ.อ.ท.": [{ORTH: "พ.อ.ท."}], + "พ.อ.อ.": [{ORTH: "พ.อ.อ."}], + "ภกญ.": [{ORTH: "ภกญ."}], + "ม.จ.": [{ORTH: "ม.จ."}], + "มท1": [{ORTH: "มท1"}], + "ม.ร.ว.": [{ORTH: "ม.ร.ว."}], + "มล.": [{ORTH: "มล."}], + "ร.ต.": [{ORTH: "ร.ต."}], + "ร.ต.ต.": [{ORTH: "ร.ต.ต."}], + "ร.ต.ท.": [{ORTH: "ร.ต.ท."}], + "ร.ต.อ.": [{ORTH: "ร.ต.อ."}], + "ร.ท.": [{ORTH: "ร.ท."}], + "รมช.": [{ORTH: "รมช."}], + "รมต.": [{ORTH: "รมต."}], + "รมว.": [{ORTH: "รมว."}], + "รศ.": [{ORTH: "รศ."}], + "ร.อ.": [{ORTH: "ร.อ."}], + "ศ.": [{ORTH: "ศ."}], + "ส.ต.": [{ORTH: "ส.ต."}], + "ส.ต.ต.": [{ORTH: "ส.ต.ต."}], + "ส.ต.ท.": [{ORTH: "ส.ต.ท."}], + "ส.ต.อ.": [{ORTH: "ส.ต.อ."}], + "ส.ท.": [{ORTH: "ส.ท."}], + "สพ.": [{ORTH: "สพ."}], + "สพ.ญ.": [{ORTH: "สพ.ญ."}], + "สพ.ช.": [{ORTH: "สพ.ช."}], + "ส.อ.": [{ORTH: "ส.อ."}], + "อจ.": [{ORTH: "อจ."}], + "อจญ.": [{ORTH: "อจญ."}], # วุฒิ / bachelor degree - "ป.": [{ORTH: "ป.", LEMMA: "ประถมศึกษา"}], - "ป.กศ.": [{ORTH: "ป.กศ.", LEMMA: "ประกาศนียบัตรวิชาการศึกษา"}], - "ป.กศ.สูง": [{ORTH: "ป.กศ.สูง", LEMMA: "ประกาศนียบัตรวิชาการศึกษาชั้นสูง"}], - "ปวช.": [{ORTH: "ปวช.", LEMMA: "ประกาศนียบัตรวิชาชีพ"}], - "ปวท.": [{ORTH: "ปวท.", LEMMA: "ประกาศนียบัตรวิชาชีพเทคนิค"}], - "ปวส.": [{ORTH: "ปวส.", LEMMA: "ประกาศนียบัตรวิชาชีพชั้นสูง"}], - "ปทส.": [{ORTH: "ปทส.", LEMMA: "ประกาศนียบัตรครูเทคนิคชั้นสูง"}], - "กษ.บ.": [{ORTH: "กษ.บ.", LEMMA: "เกษตรศาสตรบัณฑิต"}], - "กษ.ม.": [{ORTH: "กษ.ม.", LEMMA: "เกษตรศาสตรมหาบัณฑิต"}], - "กษ.ด.": [{ORTH: "กษ.ด.", LEMMA: "เกษตรศาสตรดุษฎีบัณฑิต"}], - "ค.บ.": [{ORTH: "ค.บ.", LEMMA: "ครุศาสตรบัณฑิต"}], - "คศ.บ.": [{ORTH: "คศ.บ.", LEMMA: "คหกรรมศาสตรบัณฑิต"}], - "คศ.ม.": [{ORTH: "คศ.ม.", LEMMA: "คหกรรมศาสตรมหาบัณฑิต"}], - "คศ.ด.": [{ORTH: "คศ.ด.", LEMMA: "คหกรรมศาสตรดุษฎีบัณฑิต"}], - "ค.อ.บ.": [{ORTH: "ค.อ.บ.", LEMMA: "ครุศาสตรอุตสาหกรรมบัณฑิต"}], - "ค.อ.ม.": [{ORTH: "ค.อ.ม.", LEMMA: "ครุศาสตรอุตสาหกรรมมหาบัณฑิต"}], - "ค.อ.ด.": [{ORTH: "ค.อ.ด.", LEMMA: "ครุศาสตรอุตสาหกรรมดุษฎีบัณฑิต"}], - "ทก.บ.": [{ORTH: "ทก.บ.", LEMMA: "เทคโนโลยีการเกษตรบัณฑิต"}], - "ทก.ม.": [{ORTH: "ทก.ม.", LEMMA: "เทคโนโลยีการเกษตรมหาบัณฑิต"}], - "ทก.ด.": [{ORTH: "ทก.ด.", LEMMA: "เทคโนโลยีการเกษตรดุษฎีบัณฑิต"}], - "ท.บ.": [{ORTH: "ท.บ.", LEMMA: "ทันตแพทยศาสตรบัณฑิต"}], - "ท.ม.": [{ORTH: "ท.ม.", LEMMA: "ทันตแพทยศาสตรมหาบัณฑิต"}], - "ท.ด.": [{ORTH: "ท.ด.", LEMMA: "ทันตแพทยศาสตรดุษฎีบัณฑิต"}], - "น.บ.": [{ORTH: "น.บ.", LEMMA: "นิติศาสตรบัณฑิต"}], - "น.ม.": [{ORTH: "น.ม.", LEMMA: "นิติศาสตรมหาบัณฑิต"}], - "น.ด.": [{ORTH: "น.ด.", LEMMA: "นิติศาสตรดุษฎีบัณฑิต"}], - "นศ.บ.": [{ORTH: "นศ.บ.", LEMMA: "นิเทศศาสตรบัณฑิต"}], - "นศ.ม.": [{ORTH: "นศ.ม.", LEMMA: "นิเทศศาสตรมหาบัณฑิต"}], - "นศ.ด.": [{ORTH: "นศ.ด.", LEMMA: "นิเทศศาสตรดุษฎีบัณฑิต"}], - "บช.บ.": [{ORTH: "บช.บ.", LEMMA: "บัญชีบัณฑิต"}], - "บช.ม.": [{ORTH: "บช.ม.", LEMMA: "บัญชีมหาบัณฑิต"}], - "บช.ด.": [{ORTH: "บช.ด.", LEMMA: "บัญชีดุษฎีบัณฑิต"}], - "บธ.บ.": [{ORTH: "บธ.บ.", LEMMA: "บริหารธุรกิจบัณฑิต"}], - "บธ.ม.": [{ORTH: "บธ.ม.", LEMMA: "บริหารธุรกิจมหาบัณฑิต"}], - "บธ.ด.": [{ORTH: "บธ.ด.", LEMMA: "บริหารธุรกิจดุษฎีบัณฑิต"}], - "พณ.บ.": [{ORTH: "พณ.บ.", LEMMA: "พาณิชยศาสตรบัณฑิต"}], - "พณ.ม.": [{ORTH: "พณ.ม.", LEMMA: "พาณิชยศาสตรมหาบัณฑิต"}], - "พณ.ด.": [{ORTH: "พณ.ด.", LEMMA: "พาณิชยศาสตรดุษฎีบัณฑิต"}], - "พ.บ.": [{ORTH: "พ.บ.", LEMMA: "แพทยศาสตรบัณฑิต"}], - "พ.ม.": [{ORTH: "พ.ม.", LEMMA: "แพทยศาสตรมหาบัณฑิต"}], - "พ.ด.": [{ORTH: "พ.ด.", LEMMA: "แพทยศาสตรดุษฎีบัณฑิต"}], - "พธ.บ.": [{ORTH: "พธ.บ.", LEMMA: "พุทธศาสตรบัณฑิต"}], - "พธ.ม.": [{ORTH: "พธ.ม.", LEMMA: "พุทธศาสตรมหาบัณฑิต"}], - "พธ.ด.": [{ORTH: "พธ.ด.", LEMMA: "พุทธศาสตรดุษฎีบัณฑิต"}], - "พบ.บ.": [{ORTH: "พบ.บ.", LEMMA: "พัฒนบริหารศาสตรบัณฑิต"}], - "พบ.ม.": [{ORTH: "พบ.ม.", LEMMA: "พัฒนบริหารศาสตรมหาบัณฑิต"}], - "พบ.ด.": [{ORTH: "พบ.ด.", LEMMA: "พัฒนบริหารศาสตรดุษฎีบัณฑิต"}], - "พย.บ.": [{ORTH: "พย.บ.", LEMMA: "พยาบาลศาสตรดุษฎีบัณฑิต"}], - "พย.ม.": [{ORTH: "พย.ม.", LEMMA: "พยาบาลศาสตรมหาบัณฑิต"}], - "พย.ด.": [{ORTH: "พย.ด.", LEMMA: "พยาบาลศาสตรดุษฎีบัณฑิต"}], - "พศ.บ.": [{ORTH: "พศ.บ.", LEMMA: "พาณิชยศาสตรบัณฑิต"}], - "พศ.ม.": [{ORTH: "พศ.ม.", LEMMA: "พาณิชยศาสตรมหาบัณฑิต"}], - "พศ.ด.": [{ORTH: "พศ.ด.", LEMMA: "พาณิชยศาสตรดุษฎีบัณฑิต"}], - "ภ.บ.": [{ORTH: "ภ.บ.", LEMMA: "เภสัชศาสตรบัณฑิต"}], - "ภ.ม.": [{ORTH: "ภ.ม.", LEMMA: "เภสัชศาสตรมหาบัณฑิต"}], - "ภ.ด.": [{ORTH: "ภ.ด.", LEMMA: "เภสัชศาสตรดุษฎีบัณฑิต"}], - "ภ.สถ.บ.": [{ORTH: "ภ.สถ.บ.", LEMMA: "ภูมิสถาปัตยกรรมศาสตรบัณฑิต"}], - "รป.บ.": [{ORTH: "รป.บ.", LEMMA: "รัฐประศาสนศาสตร์บัณฑิต"}], - "รป.ม.": [{ORTH: "รป.ม.", LEMMA: "รัฐประศาสนศาสตร์มหาบัณฑิต"}], - "วท.บ.": [{ORTH: "วท.บ.", LEMMA: "วิทยาศาสตรบัณฑิต"}], - "วท.ม.": [{ORTH: "วท.ม.", LEMMA: "วิทยาศาสตรมหาบัณฑิต"}], - "วท.ด.": [{ORTH: "วท.ด.", LEMMA: "วิทยาศาสตรดุษฎีบัณฑิต"}], - "ศ.บ.": [{ORTH: "ศ.บ.", LEMMA: "ศิลปบัณฑิต"}], - "ศศ.บ.": [{ORTH: "ศศ.บ.", LEMMA: "ศิลปศาสตรบัณฑิต"}], - "ศษ.บ.": [{ORTH: "ศษ.บ.", LEMMA: "ศึกษาศาสตรบัณฑิต"}], - "ศส.บ.": [{ORTH: "ศส.บ.", LEMMA: "เศรษฐศาสตรบัณฑิต"}], - "สถ.บ.": [{ORTH: "สถ.บ.", LEMMA: "สถาปัตยกรรมศาสตรบัณฑิต"}], - "สถ.ม.": [{ORTH: "สถ.ม.", LEMMA: "สถาปัตยกรรมศาสตรมหาบัณฑิต"}], - "สถ.ด.": [{ORTH: "สถ.ด.", LEMMA: "สถาปัตยกรรมศาสตรดุษฎีบัณฑิต"}], - "สพ.บ.": [{ORTH: "สพ.บ.", LEMMA: "สัตวแพทยศาสตรบัณฑิต"}], - "อ.บ.": [{ORTH: "อ.บ.", LEMMA: "อักษรศาสตรบัณฑิต"}], - "อ.ม.": [{ORTH: "อ.ม.", LEMMA: "อักษรศาสตรมหาบัณฑิต"}], - "อ.ด.": [{ORTH: "อ.ด.", LEMMA: "อักษรศาสตรดุษฎีบัณฑิต"}], + "ป.": [{ORTH: "ป."}], + "ป.กศ.": [{ORTH: "ป.กศ."}], + "ป.กศ.สูง": [{ORTH: "ป.กศ.สูง"}], + "ปวช.": [{ORTH: "ปวช."}], + "ปวท.": [{ORTH: "ปวท."}], + "ปวส.": [{ORTH: "ปวส."}], + "ปทส.": [{ORTH: "ปทส."}], + "กษ.บ.": [{ORTH: "กษ.บ."}], + "กษ.ม.": [{ORTH: "กษ.ม."}], + "กษ.ด.": [{ORTH: "กษ.ด."}], + "ค.บ.": [{ORTH: "ค.บ."}], + "คศ.บ.": [{ORTH: "คศ.บ."}], + "คศ.ม.": [{ORTH: "คศ.ม."}], + "คศ.ด.": [{ORTH: "คศ.ด."}], + "ค.อ.บ.": [{ORTH: "ค.อ.บ."}], + "ค.อ.ม.": [{ORTH: "ค.อ.ม."}], + "ค.อ.ด.": [{ORTH: "ค.อ.ด."}], + "ทก.บ.": [{ORTH: "ทก.บ."}], + "ทก.ม.": [{ORTH: "ทก.ม."}], + "ทก.ด.": [{ORTH: "ทก.ด."}], + "ท.บ.": [{ORTH: "ท.บ."}], + "ท.ม.": [{ORTH: "ท.ม."}], + "ท.ด.": [{ORTH: "ท.ด."}], + "น.บ.": [{ORTH: "น.บ."}], + "น.ม.": [{ORTH: "น.ม."}], + "น.ด.": [{ORTH: "น.ด."}], + "นศ.บ.": [{ORTH: "นศ.บ."}], + "นศ.ม.": [{ORTH: "นศ.ม."}], + "นศ.ด.": [{ORTH: "นศ.ด."}], + "บช.บ.": [{ORTH: "บช.บ."}], + "บช.ม.": [{ORTH: "บช.ม."}], + "บช.ด.": [{ORTH: "บช.ด."}], + "บธ.บ.": [{ORTH: "บธ.บ."}], + "บธ.ม.": [{ORTH: "บธ.ม."}], + "บธ.ด.": [{ORTH: "บธ.ด."}], + "พณ.บ.": [{ORTH: "พณ.บ."}], + "พณ.ม.": [{ORTH: "พณ.ม."}], + "พณ.ด.": [{ORTH: "พณ.ด."}], + "พ.บ.": [{ORTH: "พ.บ."}], + "พ.ม.": [{ORTH: "พ.ม."}], + "พ.ด.": [{ORTH: "พ.ด."}], + "พธ.บ.": [{ORTH: "พธ.บ."}], + "พธ.ม.": [{ORTH: "พธ.ม."}], + "พธ.ด.": [{ORTH: "พธ.ด."}], + "พบ.บ.": [{ORTH: "พบ.บ."}], + "พบ.ม.": [{ORTH: "พบ.ม."}], + "พบ.ด.": [{ORTH: "พบ.ด."}], + "พย.บ.": [{ORTH: "พย.บ."}], + "พย.ม.": [{ORTH: "พย.ม."}], + "พย.ด.": [{ORTH: "พย.ด."}], + "พศ.บ.": [{ORTH: "พศ.บ."}], + "พศ.ม.": [{ORTH: "พศ.ม."}], + "พศ.ด.": [{ORTH: "พศ.ด."}], + "ภ.บ.": [{ORTH: "ภ.บ."}], + "ภ.ม.": [{ORTH: "ภ.ม."}], + "ภ.ด.": [{ORTH: "ภ.ด."}], + "ภ.สถ.บ.": [{ORTH: "ภ.สถ.บ."}], + "รป.บ.": [{ORTH: "รป.บ."}], + "รป.ม.": [{ORTH: "รป.ม."}], + "วท.บ.": [{ORTH: "วท.บ."}], + "วท.ม.": [{ORTH: "วท.ม."}], + "วท.ด.": [{ORTH: "วท.ด."}], + "ศ.บ.": [{ORTH: "ศ.บ."}], + "ศศ.บ.": [{ORTH: "ศศ.บ."}], + "ศษ.บ.": [{ORTH: "ศษ.บ."}], + "ศส.บ.": [{ORTH: "ศส.บ."}], + "สถ.บ.": [{ORTH: "สถ.บ."}], + "สถ.ม.": [{ORTH: "สถ.ม."}], + "สถ.ด.": [{ORTH: "สถ.ด."}], + "สพ.บ.": [{ORTH: "สพ.บ."}], + "อ.บ.": [{ORTH: "อ.บ."}], + "อ.ม.": [{ORTH: "อ.ม."}], + "อ.ด.": [{ORTH: "อ.ด."}], # ปี / เวลา / year / time - "ชม.": [{ORTH: "ชม.", LEMMA: "ชั่วโมง"}], - "จ.ศ.": [{ORTH: "จ.ศ.", LEMMA: "จุลศักราช"}], - "ค.ศ.": [{ORTH: "ค.ศ.", LEMMA: "คริสต์ศักราช"}], - "ฮ.ศ.": [{ORTH: "ฮ.ศ.", LEMMA: "ฮิจเราะห์ศักราช"}], - "ว.ด.ป.": [{ORTH: "ว.ด.ป.", LEMMA: "วัน เดือน ปี"}], + "ชม.": [{ORTH: "ชม."}], + "จ.ศ.": [{ORTH: "จ.ศ."}], + "ค.ศ.": [{ORTH: "ค.ศ."}], + "ฮ.ศ.": [{ORTH: "ฮ.ศ."}], + "ว.ด.ป.": [{ORTH: "ว.ด.ป."}], # ระยะทาง / distance - "ฮม.": [{ORTH: "ฮม.", LEMMA: "เฮกโตเมตร"}], - "ดคม.": [{ORTH: "ดคม.", LEMMA: "เดคาเมตร"}], - "ดม.": [{ORTH: "ดม.", LEMMA: "เดซิเมตร"}], - "มม.": [{ORTH: "มม.", LEMMA: "มิลลิเมตร"}], - "ซม.": [{ORTH: "ซม.", LEMMA: "เซนติเมตร"}], - "กม.": [{ORTH: "กม.", LEMMA: "กิโลเมตร"}], + "ฮม.": [{ORTH: "ฮม."}], + "ดคม.": [{ORTH: "ดคม."}], + "ดม.": [{ORTH: "ดม."}], + "มม.": [{ORTH: "มม."}], + "ซม.": [{ORTH: "ซม."}], + "กม.": [{ORTH: "กม."}], # น้ำหนัก / weight - "น.น.": [{ORTH: "น.น.", LEMMA: "น้ำหนัก"}], - "ฮก.": [{ORTH: "ฮก.", LEMMA: "เฮกโตกรัม"}], - "ดคก.": [{ORTH: "ดคก.", LEMMA: "เดคากรัม"}], - "ดก.": [{ORTH: "ดก.", LEMMA: "เดซิกรัม"}], - "ซก.": [{ORTH: "ซก.", LEMMA: "เซนติกรัม"}], - "มก.": [{ORTH: "มก.", LEMMA: "มิลลิกรัม"}], - "ก.": [{ORTH: "ก.", LEMMA: "กรัม"}], - "กก.": [{ORTH: "กก.", LEMMA: "กิโลกรัม"}], + "น.น.": [{ORTH: "น.น."}], + "ฮก.": [{ORTH: "ฮก."}], + "ดคก.": [{ORTH: "ดคก."}], + "ดก.": [{ORTH: "ดก."}], + "ซก.": [{ORTH: "ซก."}], + "มก.": [{ORTH: "มก."}], + "ก.": [{ORTH: "ก."}], + "กก.": [{ORTH: "กก."}], # ปริมาตร / volume - "ฮล.": [{ORTH: "ฮล.", LEMMA: "เฮกโตลิตร"}], - "ดคล.": [{ORTH: "ดคล.", LEMMA: "เดคาลิตร"}], - "ดล.": [{ORTH: "ดล.", LEMMA: "เดซิลิตร"}], - "ซล.": [{ORTH: "ซล.", LEMMA: "เซนติลิตร"}], - "ล.": [{ORTH: "ล.", LEMMA: "ลิตร"}], - "กล.": [{ORTH: "กล.", LEMMA: "กิโลลิตร"}], - "ลบ.": [{ORTH: "ลบ.", LEMMA: "ลูกบาศก์"}], + "ฮล.": [{ORTH: "ฮล."}], + "ดคล.": [{ORTH: "ดคล."}], + "ดล.": [{ORTH: "ดล."}], + "ซล.": [{ORTH: "ซล."}], + "ล.": [{ORTH: "ล."}], + "กล.": [{ORTH: "กล."}], + "ลบ.": [{ORTH: "ลบ."}], # พื้นที่ / area - "ตร.ซม.": [{ORTH: "ตร.ซม.", LEMMA: "ตารางเซนติเมตร"}], - "ตร.ม.": [{ORTH: "ตร.ม.", LEMMA: "ตารางเมตร"}], - "ตร.ว.": [{ORTH: "ตร.ว.", LEMMA: "ตารางวา"}], - "ตร.กม.": [{ORTH: "ตร.กม.", LEMMA: "ตารางกิโลเมตร"}], + "ตร.ซม.": [{ORTH: "ตร.ซม."}], + "ตร.ม.": [{ORTH: "ตร.ม."}], + "ตร.ว.": [{ORTH: "ตร.ว."}], + "ตร.กม.": [{ORTH: "ตร.กม."}], # เดือน / month - "ม.ค.": [{ORTH: "ม.ค.", LEMMA: "มกราคม"}], - "ก.พ.": [{ORTH: "ก.พ.", LEMMA: "กุมภาพันธ์"}], - "มี.ค.": [{ORTH: "มี.ค.", LEMMA: "มีนาคม"}], - "เม.ย.": [{ORTH: "เม.ย.", LEMMA: "เมษายน"}], - "พ.ค.": [{ORTH: "พ.ค.", LEMMA: "พฤษภาคม"}], - "มิ.ย.": [{ORTH: "มิ.ย.", LEMMA: "มิถุนายน"}], - "ก.ค.": [{ORTH: "ก.ค.", LEMMA: "กรกฎาคม"}], - "ส.ค.": [{ORTH: "ส.ค.", LEMMA: "สิงหาคม"}], - "ก.ย.": [{ORTH: "ก.ย.", LEMMA: "กันยายน"}], - "ต.ค.": [{ORTH: "ต.ค.", LEMMA: "ตุลาคม"}], - "พ.ย.": [{ORTH: "พ.ย.", LEMMA: "พฤศจิกายน"}], - "ธ.ค.": [{ORTH: "ธ.ค.", LEMMA: "ธันวาคม"}], + "ม.ค.": [{ORTH: "ม.ค."}], + "ก.พ.": [{ORTH: "ก.พ."}], + "มี.ค.": [{ORTH: "มี.ค."}], + "เม.ย.": [{ORTH: "เม.ย."}], + "พ.ค.": [{ORTH: "พ.ค."}], + "มิ.ย.": [{ORTH: "มิ.ย."}], + "ก.ค.": [{ORTH: "ก.ค."}], + "ส.ค.": [{ORTH: "ส.ค."}], + "ก.ย.": [{ORTH: "ก.ย."}], + "ต.ค.": [{ORTH: "ต.ค."}], + "พ.ย.": [{ORTH: "พ.ย."}], + "ธ.ค.": [{ORTH: "ธ.ค."}], # เพศ / gender - "ช.": [{ORTH: "ช.", LEMMA: "ชาย"}], - "ญ.": [{ORTH: "ญ.", LEMMA: "หญิง"}], - "ด.ช.": [{ORTH: "ด.ช.", LEMMA: "เด็กชาย"}], - "ด.ญ.": [{ORTH: "ด.ญ.", LEMMA: "เด็กหญิง"}], + "ช.": [{ORTH: "ช."}], + "ญ.": [{ORTH: "ญ."}], + "ด.ช.": [{ORTH: "ด.ช."}], + "ด.ญ.": [{ORTH: "ด.ญ."}], # ที่อยู่ / address - "ถ.": [{ORTH: "ถ.", LEMMA: "ถนน"}], - "ต.": [{ORTH: "ต.", LEMMA: "ตำบล"}], - "อ.": [{ORTH: "อ.", LEMMA: "อำเภอ"}], - "จ.": [{ORTH: "จ.", LEMMA: "จังหวัด"}], + "ถ.": [{ORTH: "ถ."}], + "ต.": [{ORTH: "ต."}], + "อ.": [{ORTH: "อ."}], + "จ.": [{ORTH: "จ."}], # สรรพนาม / pronoun - "ข้าฯ": [{ORTH: "ข้าฯ", LEMMA: "ข้าพระพุทธเจ้า"}], - "ทูลเกล้าฯ": [{ORTH: "ทูลเกล้าฯ", LEMMA: "ทูลเกล้าทูลกระหม่อม"}], - "น้อมเกล้าฯ": [{ORTH: "น้อมเกล้าฯ", LEMMA: "น้อมเกล้าน้อมกระหม่อม"}], - "โปรดเกล้าฯ": [{ORTH: "โปรดเกล้าฯ", LEMMA: "โปรดเกล้าโปรดกระหม่อม"}], + "ข้าฯ": [{ORTH: "ข้าฯ"}], + "ทูลเกล้าฯ": [{ORTH: "ทูลเกล้าฯ"}], + "น้อมเกล้าฯ": [{ORTH: "น้อมเกล้าฯ"}], + "โปรดเกล้าฯ": [{ORTH: "โปรดเกล้าฯ"}], # การเมือง / politic - "ขจก.": [{ORTH: "ขจก.", LEMMA: "ขบวนการโจรก่อการร้าย"}], - "ขบด.": [{ORTH: "ขบด.", LEMMA: "ขบวนการแบ่งแยกดินแดน"}], - "นปช.": [{ORTH: "นปช.", LEMMA: "แนวร่วมประชาธิปไตยขับไล่เผด็จการ"}], - "ปชป.": [{ORTH: "ปชป.", LEMMA: "พรรคประชาธิปัตย์"}], - "ผกค.": [{ORTH: "ผกค.", LEMMA: "ผู้ก่อการร้ายคอมมิวนิสต์"}], - "พท.": [{ORTH: "พท.", LEMMA: "พรรคเพื่อไทย"}], - "พ.ร.ก.": [{ORTH: "พ.ร.ก.", LEMMA: "พระราชกำหนด"}], - "พ.ร.ฎ.": [{ORTH: "พ.ร.ฎ.", LEMMA: "พระราชกฤษฎีกา"}], - "พ.ร.บ.": [{ORTH: "พ.ร.บ.", LEMMA: "พระราชบัญญัติ"}], - "รธน.": [{ORTH: "รธน.", LEMMA: "รัฐธรรมนูญ"}], - "รบ.": [{ORTH: "รบ.", LEMMA: "รัฐบาล"}], - "รสช.": [{ORTH: "รสช.", LEMMA: "คณะรักษาความสงบเรียบร้อยแห่งชาติ"}], - "ส.ก.": [{ORTH: "ส.ก.", LEMMA: "สมาชิกสภากรุงเทพมหานคร"}], - "สจ.": [{ORTH: "สจ.", LEMMA: "สมาชิกสภาจังหวัด"}], - "สว.": [{ORTH: "สว.", LEMMA: "สมาชิกวุฒิสภา"}], - "ส.ส.": [{ORTH: "ส.ส.", LEMMA: "สมาชิกสภาผู้แทนราษฎร"}], + "ขจก.": [{ORTH: "ขจก."}], + "ขบด.": [{ORTH: "ขบด."}], + "นปช.": [{ORTH: "นปช."}], + "ปชป.": [{ORTH: "ปชป."}], + "ผกค.": [{ORTH: "ผกค."}], + "พท.": [{ORTH: "พท."}], + "พ.ร.ก.": [{ORTH: "พ.ร.ก."}], + "พ.ร.ฎ.": [{ORTH: "พ.ร.ฎ."}], + "พ.ร.บ.": [{ORTH: "พ.ร.บ."}], + "รธน.": [{ORTH: "รธน."}], + "รบ.": [{ORTH: "รบ."}], + "รสช.": [{ORTH: "รสช."}], + "ส.ก.": [{ORTH: "ส.ก."}], + "สจ.": [{ORTH: "สจ."}], + "สว.": [{ORTH: "สว."}], + "ส.ส.": [{ORTH: "ส.ส."}], # ทั่วไป / general - "ก.ข.ค.": [{ORTH: "ก.ข.ค.", LEMMA: "ก้างขวางคอ"}], - "กทม.": [{ORTH: "กทม.", LEMMA: "กรุงเทพมหานคร"}], - "กรุงเทพฯ": [{ORTH: "กรุงเทพฯ", LEMMA: "กรุงเทพมหานคร"}], - "ขรก.": [{ORTH: "ขรก.", LEMMA: "ข้าราชการ"}], - "ขส": [{ORTH: "ขส.", LEMMA: "ขนส่ง"}], - "ค.ร.น.": [{ORTH: "ค.ร.น.", LEMMA: "คูณร่วมน้อย"}], - "ค.ร.ม.": [{ORTH: "ค.ร.ม.", LEMMA: "คูณร่วมมาก"}], - "ง.ด.": [{ORTH: "ง.ด.", LEMMA: "เงินเดือน"}], - "งป.": [{ORTH: "งป.", LEMMA: "งบประมาณ"}], - "จก.": [{ORTH: "จก.", LEMMA: "จำกัด"}], - "จขกท.": [{ORTH: "จขกท.", LEMMA: "เจ้าของกระทู้"}], - "จนท.": [{ORTH: "จนท.", LEMMA: "เจ้าหน้าที่"}], - "จ.ป.ร.": [ - { - ORTH: "จ.ป.ร.", - LEMMA: "มหาจุฬาลงกรณ ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระจุลจอมเกล้าเจ้าอยู่หัว)", - } - ], - "จ.ม.": [{ORTH: "จ.ม.", LEMMA: "จดหมาย"}], - "จย.": [{ORTH: "จย.", LEMMA: "จักรยาน"}], - "จยย.": [{ORTH: "จยย.", LEMMA: "จักรยานยนต์"}], - "ตจว.": [{ORTH: "ตจว.", LEMMA: "ต่างจังหวัด"}], - "โทร.": [{ORTH: "โทร.", LEMMA: "โทรศัพท์"}], - "ธ.": [{ORTH: "ธ.", LEMMA: "ธนาคาร"}], - "น.ร.": [{ORTH: "น.ร.", LEMMA: "นักเรียน"}], - "น.ศ.": [{ORTH: "น.ศ.", LEMMA: "นักศึกษา"}], - "น.ส.": [{ORTH: "น.ส.", LEMMA: "นางสาว"}], - "น.ส.๓": [{ORTH: "น.ส.๓", LEMMA: "หนังสือรับรองการทำประโยชน์ในที่ดิน"}], - "น.ส.๓ ก.": [ - {ORTH: "น.ส.๓ ก", LEMMA: "หนังสือแสดงกรรมสิทธิ์ในที่ดิน (มีระวางกำหนด)"} - ], - "นสพ.": [{ORTH: "นสพ.", LEMMA: "หนังสือพิมพ์"}], - "บ.ก.": [{ORTH: "บ.ก.", LEMMA: "บรรณาธิการ"}], - "บจก.": [{ORTH: "บจก.", LEMMA: "บริษัทจำกัด"}], - "บงล.": [{ORTH: "บงล.", LEMMA: "บริษัทเงินทุนและหลักทรัพย์จำกัด"}], - "บบส.": [{ORTH: "บบส.", LEMMA: "บรรษัทบริหารสินทรัพย์สถาบันการเงิน"}], - "บมจ.": [{ORTH: "บมจ.", LEMMA: "บริษัทมหาชนจำกัด"}], - "บลจ.": [{ORTH: "บลจ.", LEMMA: "บริษัทหลักทรัพย์จัดการกองทุนรวมจำกัด"}], - "บ/ช": [{ORTH: "บ/ช", LEMMA: "บัญชี"}], - "บร.": [{ORTH: "บร.", LEMMA: "บรรณารักษ์"}], - "ปชช.": [{ORTH: "ปชช.", LEMMA: "ประชาชน"}], - "ปณ.": [{ORTH: "ปณ.", LEMMA: "ที่ทำการไปรษณีย์"}], - "ปณก.": [{ORTH: "ปณก.", LEMMA: "ที่ทำการไปรษณีย์กลาง"}], - "ปณส.": [{ORTH: "ปณส.", LEMMA: "ที่ทำการไปรษณีย์สาขา"}], - "ปธ.": [{ORTH: "ปธ.", LEMMA: "ประธาน"}], - "ปธน.": [{ORTH: "ปธน.", LEMMA: "ประธานาธิบดี"}], - "ปอ.": [{ORTH: "ปอ.", LEMMA: "รถยนต์โดยสารประจำทางปรับอากาศ"}], - "ปอ.พ.": [{ORTH: "ปอ.พ.", LEMMA: "รถยนต์โดยสารประจำทางปรับอากาศพิเศษ"}], - "พ.ก.ง.": [{ORTH: "พ.ก.ง.", LEMMA: "พัสดุเก็บเงินปลายทาง"}], - "พ.ก.ส.": [{ORTH: "พ.ก.ส.", LEMMA: "พนักงานเก็บค่าโดยสาร"}], - "พขร.": [{ORTH: "พขร.", LEMMA: "พนักงานขับรถ"}], - "ภ.ง.ด.": [{ORTH: "ภ.ง.ด.", LEMMA: "ภาษีเงินได้"}], - "ภ.ง.ด.๙": [{ORTH: "ภ.ง.ด.๙", LEMMA: "แบบแสดงรายการเสียภาษีเงินได้ของกรมสรรพากร"}], - "ภ.ป.ร.": [ - { - ORTH: "ภ.ป.ร.", - LEMMA: "ภูมิพลอดุยเดช ปรมราชาธิราช (พระปรมาภิไธยในพระบาทสมเด็จพระปรมินทรมหาภูมิพลอดุลยเดช)", - } - ], - "ภ.พ.": [{ORTH: "ภ.พ.", LEMMA: "ภาษีมูลค่าเพิ่ม"}], - "ร.": [{ORTH: "ร.", LEMMA: "รัชกาล"}], - "ร.ง.": [{ORTH: "ร.ง.", LEMMA: "โรงงาน"}], - "ร.ด.": [{ORTH: "ร.ด.", LEMMA: "รักษาดินแดน"}], - "รปภ.": [{ORTH: "รปภ.", LEMMA: "รักษาความปลอดภัย"}], - "รพ.": [{ORTH: "รพ.", LEMMA: "โรงพยาบาล"}], - "ร.พ.": [{ORTH: "ร.พ.", LEMMA: "โรงพิมพ์"}], - "รร.": [{ORTH: "รร.", LEMMA: "โรงเรียน,โรงแรม"}], - "รสก.": [{ORTH: "รสก.", LEMMA: "รัฐวิสาหกิจ"}], - "ส.ค.ส.": [{ORTH: "ส.ค.ส.", LEMMA: "ส่งความสุขปีใหม่"}], - "สต.": [{ORTH: "สต.", LEMMA: "สตางค์"}], - "สน.": [{ORTH: "สน.", LEMMA: "สถานีตำรวจ"}], - "สนข.": [{ORTH: "สนข.", LEMMA: "สำนักงานเขต"}], - "สนง.": [{ORTH: "สนง.", LEMMA: "สำนักงาน"}], - "สนญ.": [{ORTH: "สนญ.", LEMMA: "สำนักงานใหญ่"}], - "ส.ป.ช.": [{ORTH: "ส.ป.ช.", LEMMA: "สร้างเสริมประสบการณ์ชีวิต"}], - "สภ.": [{ORTH: "สภ.", LEMMA: "สถานีตำรวจภูธร"}], - "ส.ล.น.": [{ORTH: "ส.ล.น.", LEMMA: "สร้างเสริมลักษณะนิสัย"}], - "สวญ.": [{ORTH: "สวญ.", LEMMA: "สารวัตรใหญ่"}], - "สวป.": [{ORTH: "สวป.", LEMMA: "สารวัตรป้องกันปราบปราม"}], - "สว.สส.": [{ORTH: "สว.สส.", LEMMA: "สารวัตรสืบสวน"}], - "ส.ห.": [{ORTH: "ส.ห.", LEMMA: "สารวัตรทหาร"}], - "สอ.": [{ORTH: "สอ.", LEMMA: "สถานีอนามัย"}], - "สอท.": [{ORTH: "สอท.", LEMMA: "สถานเอกอัครราชทูต"}], - "เสธ.": [{ORTH: "เสธ.", LEMMA: "เสนาธิการ"}], - "หจก.": [{ORTH: "หจก.", LEMMA: "ห้างหุ้นส่วนจำกัด"}], - "ห.ร.ม.": [{ORTH: "ห.ร.ม.", LEMMA: "ตัวหารร่วมมาก"}], + "ก.ข.ค.": [{ORTH: "ก.ข.ค."}], + "กทม.": [{ORTH: "กทม."}], + "กรุงเทพฯ": [{ORTH: "กรุงเทพฯ"}], + "ขรก.": [{ORTH: "ขรก."}], + "ขส": [{ORTH: "ขส."}], + "ค.ร.น.": [{ORTH: "ค.ร.น."}], + "ค.ร.ม.": [{ORTH: "ค.ร.ม."}], + "ง.ด.": [{ORTH: "ง.ด."}], + "งป.": [{ORTH: "งป."}], + "จก.": [{ORTH: "จก."}], + "จขกท.": [{ORTH: "จขกท."}], + "จนท.": [{ORTH: "จนท."}], + "จ.ป.ร.": [{ORTH: "จ.ป.ร."}], + "จ.ม.": [{ORTH: "จ.ม."}], + "จย.": [{ORTH: "จย."}], + "จยย.": [{ORTH: "จยย."}], + "ตจว.": [{ORTH: "ตจว."}], + "โทร.": [{ORTH: "โทร."}], + "ธ.": [{ORTH: "ธ."}], + "น.ร.": [{ORTH: "น.ร."}], + "น.ศ.": [{ORTH: "น.ศ."}], + "น.ส.": [{ORTH: "น.ส."}], + "น.ส.๓": [{ORTH: "น.ส.๓"}], + "น.ส.๓ ก.": [{ORTH: "น.ส.๓ ก"}], + "นสพ.": [{ORTH: "นสพ."}], + "บ.ก.": [{ORTH: "บ.ก."}], + "บจก.": [{ORTH: "บจก."}], + "บงล.": [{ORTH: "บงล."}], + "บบส.": [{ORTH: "บบส."}], + "บมจ.": [{ORTH: "บมจ."}], + "บลจ.": [{ORTH: "บลจ."}], + "บ/ช": [{ORTH: "บ/ช"}], + "บร.": [{ORTH: "บร."}], + "ปชช.": [{ORTH: "ปชช."}], + "ปณ.": [{ORTH: "ปณ."}], + "ปณก.": [{ORTH: "ปณก."}], + "ปณส.": [{ORTH: "ปณส."}], + "ปธ.": [{ORTH: "ปธ."}], + "ปธน.": [{ORTH: "ปธน."}], + "ปอ.": [{ORTH: "ปอ."}], + "ปอ.พ.": [{ORTH: "ปอ.พ."}], + "พ.ก.ง.": [{ORTH: "พ.ก.ง."}], + "พ.ก.ส.": [{ORTH: "พ.ก.ส."}], + "พขร.": [{ORTH: "พขร."}], + "ภ.ง.ด.": [{ORTH: "ภ.ง.ด."}], + "ภ.ง.ด.๙": [{ORTH: "ภ.ง.ด.๙"}], + "ภ.ป.ร.": [{ORTH: "ภ.ป.ร."}], + "ภ.พ.": [{ORTH: "ภ.พ."}], + "ร.": [{ORTH: "ร."}], + "ร.ง.": [{ORTH: "ร.ง."}], + "ร.ด.": [{ORTH: "ร.ด."}], + "รปภ.": [{ORTH: "รปภ."}], + "รพ.": [{ORTH: "รพ."}], + "ร.พ.": [{ORTH: "ร.พ."}], + "รร.": [{ORTH: "รร."}], + "รสก.": [{ORTH: "รสก."}], + "ส.ค.ส.": [{ORTH: "ส.ค.ส."}], + "สต.": [{ORTH: "สต."}], + "สน.": [{ORTH: "สน."}], + "สนข.": [{ORTH: "สนข."}], + "สนง.": [{ORTH: "สนง."}], + "สนญ.": [{ORTH: "สนญ."}], + "ส.ป.ช.": [{ORTH: "ส.ป.ช."}], + "สภ.": [{ORTH: "สภ."}], + "ส.ล.น.": [{ORTH: "ส.ล.น."}], + "สวญ.": [{ORTH: "สวญ."}], + "สวป.": [{ORTH: "สวป."}], + "สว.สส.": [{ORTH: "สว.สส."}], + "ส.ห.": [{ORTH: "ส.ห."}], + "สอ.": [{ORTH: "สอ."}], + "สอท.": [{ORTH: "สอท."}], + "เสธ.": [{ORTH: "เสธ."}], + "หจก.": [{ORTH: "หจก."}], + "ห.ร.ม.": [{ORTH: "ห.ร.ม."}], } diff --git a/spacy/lang/tl/__init__.py b/spacy/lang/tl/__init__.py index 30ad93139..61530dc30 100644 --- a/spacy/lang/tl/__init__.py +++ b/spacy/lang/tl/__init__.py @@ -1,28 +1,12 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups - - -def _return_tl(_): - return "tl" class TagalogDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = _return_tl - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - lex_attr_getters.update(LEX_ATTRS) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) + tokenizer_exceptions = TOKENIZER_EXCEPTIONS + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS diff --git a/spacy/lang/tl/lex_attrs.py b/spacy/lang/tl/lex_attrs.py index 61dc9d4f3..60bdc923b 100644 --- a/spacy/lang/tl/lex_attrs.py +++ b/spacy/lang/tl/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/tl/stop_words.py b/spacy/lang/tl/stop_words.py index 510b3a418..2560cdaed 100644 --- a/spacy/lang/tl/stop_words.py +++ b/spacy/lang/tl/stop_words.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - STOP_WORDS = set( """ akin diff --git a/spacy/lang/tl/tokenizer_exceptions.py b/spacy/lang/tl/tokenizer_exceptions.py index 77e1fb0c6..51ad12d9f 100644 --- a/spacy/lang/tl/tokenizer_exceptions.py +++ b/spacy/lang/tl/tokenizer_exceptions.py @@ -1,20 +1,19 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = { - "tayo'y": [{ORTH: "tayo", LEMMA: "tayo"}, {ORTH: "'y", LEMMA: "ay"}], - "isa'y": [{ORTH: "isa", LEMMA: "isa"}, {ORTH: "'y", LEMMA: "ay"}], - "baya'y": [{ORTH: "baya", LEMMA: "bayan"}, {ORTH: "'y", LEMMA: "ay"}], - "sa'yo": [{ORTH: "sa", LEMMA: "sa"}, {ORTH: "'yo", LEMMA: "iyo"}], - "ano'ng": [{ORTH: "ano", LEMMA: "ano"}, {ORTH: "'ng", LEMMA: "ang"}], - "siya'y": [{ORTH: "siya", LEMMA: "siya"}, {ORTH: "'y", LEMMA: "ay"}], - "nawa'y": [{ORTH: "nawa", LEMMA: "nawa"}, {ORTH: "'y", LEMMA: "ay"}], - "papa'no": [{ORTH: "papa'no", LEMMA: "papaano"}], - "'di": [{ORTH: "'di", LEMMA: "hindi"}], + "tayo'y": [{ORTH: "tayo"}, {ORTH: "'y", NORM: "ay"}], + "isa'y": [{ORTH: "isa"}, {ORTH: "'y", NORM: "ay"}], + "baya'y": [{ORTH: "baya"}, {ORTH: "'y", NORM: "ay"}], + "sa'yo": [{ORTH: "sa"}, {ORTH: "'yo", NORM: "iyo"}], + "ano'ng": [{ORTH: "ano"}, {ORTH: "'ng", NORM: "ang"}], + "siya'y": [{ORTH: "siya"}, {ORTH: "'y", NORM: "ay"}], + "nawa'y": [{ORTH: "nawa"}, {ORTH: "'y", NORM: "ay"}], + "papa'no": [{ORTH: "papa'no", NORM: "papaano"}], + "'di": [{ORTH: "'di", NORM: "hindi"}], } -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/tokenizer_exceptions.py b/spacy/lang/tokenizer_exceptions.py index c903448b0..960302513 100644 --- a/spacy/lang/tokenizer_exceptions.py +++ b/spacy/lang/tokenizer_exceptions.py @@ -1,10 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals - import re -from .char_classes import ALPHA_LOWER, ALPHA -from ..symbols import ORTH, POS, TAG, LEMMA, SPACE +from .char_classes import ALPHA_LOWER +from ..symbols import ORTH, NORM # URL validation regex courtesy of: https://mathiasbynens.be/demo/url-regex @@ -37,13 +34,13 @@ URL_PATTERN = ( r"|" # host & domain names # mods: match is case-sensitive, so include [A-Z] - r"(?:" # noqa - r"(?:" - r"[A-Za-z0-9\u00a1-\uffff]" - r"[A-Za-z0-9\u00a1-\uffff_-]{0,62}" - r")?" - r"[A-Za-z0-9\u00a1-\uffff]\." - r")+" + r"(?:" # noqa: E131 + r"(?:" + r"[A-Za-z0-9\u00a1-\uffff]" + r"[A-Za-z0-9\u00a1-\uffff_-]{0,62}" + r")?" + r"[A-Za-z0-9\u00a1-\uffff]\." + r")+" # TLD identifier # mods: use ALPHA_LOWER instead of a wider range so that this doesn't match # strings like "lower.Upper", which can be split on "." by infixes in some @@ -58,7 +55,6 @@ URL_PATTERN = ( # fmt: on ).strip() -TOKEN_MATCH = None URL_MATCH = re.compile("(?u)" + URL_PATTERN).match @@ -66,13 +62,13 @@ BASE_EXCEPTIONS = {} for exc_data in [ - {ORTH: " ", POS: SPACE, TAG: "_SP"}, - {ORTH: "\t", POS: SPACE, TAG: "_SP"}, - {ORTH: "\\t", POS: SPACE, TAG: "_SP"}, - {ORTH: "\n", POS: SPACE, TAG: "_SP"}, - {ORTH: "\\n", POS: SPACE, TAG: "_SP"}, + {ORTH: " "}, + {ORTH: "\t"}, + {ORTH: "\\t"}, + {ORTH: "\n"}, + {ORTH: "\\n"}, {ORTH: "\u2014"}, - {ORTH: "\u00a0", POS: SPACE, LEMMA: " ", TAG: "_SP"}, + {ORTH: "\u00a0", NORM: " "}, ]: BASE_EXCEPTIONS[exc_data[ORTH]] = [exc_data] @@ -128,7 +124,6 @@ emoticons = set( (-: =) (= -") :] :-] [: diff --git a/spacy/lang/tr/__init__.py b/spacy/lang/tr/__init__.py index fb0883a68..788adb6fb 100644 --- a/spacy/lang/tr/__init__.py +++ b/spacy/lang/tr/__init__.py @@ -1,31 +1,15 @@ -# coding: utf8 -from __future__ import unicode_literals - from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .syntax_iterators import SYNTAX_ITERATORS from .lex_attrs import LEX_ATTRS -from .morph_rules import MORPH_RULES - - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups class TurkishDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "tr" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) + tokenizer_exceptions = TOKENIZER_EXCEPTIONS + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS syntax_iterators = SYNTAX_ITERATORS - morph_rules = MORPH_RULES class Turkish(Language): diff --git a/spacy/lang/tr/examples.py b/spacy/lang/tr/examples.py index a0464dfe3..dfb324a4e 100644 --- a/spacy/lang/tr/examples.py +++ b/spacy/lang/tr/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. >>> from spacy.lang.tr.examples import sentences diff --git a/spacy/lang/tr/lex_attrs.py b/spacy/lang/tr/lex_attrs.py index 366bda9e7..f7416837d 100644 --- a/spacy/lang/tr/lex_attrs.py +++ b/spacy/lang/tr/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM @@ -65,6 +62,7 @@ _ordinal_words = [ _ordinal_endings = ("inci", "ıncı", "nci", "ncı", "uncu", "üncü") + def like_num(text): if text.startswith(("+", "-", "±", "~")): text = text[1:] @@ -75,20 +73,16 @@ def like_num(text): num, denom = text.split("/") if num.isdigit() and denom.isdigit(): return True - text_lower = text.lower() - - #Check cardinal number + # Check cardinal number if text_lower in _num_words: return True - - #Check ordinal number + # Check ordinal number if text_lower in _ordinal_words: return True if text_lower.endswith(_ordinal_endings): if text_lower[:-3].isdigit() or text_lower[:-4].isdigit(): return True - return False diff --git a/spacy/lang/tr/stop_words.py b/spacy/lang/tr/stop_words.py index 65905499a..85dcff6a5 100644 --- a/spacy/lang/tr/stop_words.py +++ b/spacy/lang/tr/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - # Source: https://github.com/stopwords-iso/stopwords-tr STOP_WORDS = set( """ diff --git a/spacy/lang/tr/syntax_iterators.py b/spacy/lang/tr/syntax_iterators.py index 6cab3b260..d9b342949 100644 --- a/spacy/lang/tr/syntax_iterators.py +++ b/spacy/lang/tr/syntax_iterators.py @@ -21,7 +21,7 @@ def noun_chunks(doclike): "ROOT", ] doc = doclike.doc # Ensure works on both Doc and Span. - if not doc.is_parsed: + if not doc.has_annotation("DEP"): raise ValueError(Errors.E029) np_deps = [doc.vocab.strings.add(label) for label in labels] @@ -49,11 +49,10 @@ def noun_chunks(doclike): prev_end = word.left_edge.i yield word.left_edge.i, extend_right(word), np_label elif word.dep == conj: - cc_token = word.left_edge + cc_token = word.left_edge prev_end = cc_token.i - yield cc_token.right_edge.i + 1, extend_right(word), np_label # Shave off cc tokens from the NP - - + # Shave off cc tokens from the NP + yield cc_token.right_edge.i + 1, extend_right(word), np_label SYNTAX_ITERATORS = {"noun_chunks": noun_chunks} diff --git a/spacy/lang/tr/tokenizer_exceptions.py b/spacy/lang/tr/tokenizer_exceptions.py index f48e035d4..b84ef89a2 100644 --- a/spacy/lang/tr/tokenizer_exceptions.py +++ b/spacy/lang/tr/tokenizer_exceptions.py @@ -1,7 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals - +from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...symbols import ORTH, NORM +from ...util import update_exc + _exc = {"sağol": [{ORTH: "sağ"}, {ORTH: "ol", NORM: "olun"}]} @@ -116,4 +116,4 @@ for orth in ["Dr.", "yy."]: _exc[orth] = [{ORTH: orth}] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/tt/__init__.py b/spacy/lang/tt/__init__.py index 3655e6264..c8e293f29 100644 --- a/spacy/lang/tt/__init__.py +++ b/spacy/lang/tt/__init__.py @@ -1,25 +1,14 @@ -# coding: utf8 -from __future__ import unicode_literals - from .lex_attrs import LEX_ATTRS from .punctuation import TOKENIZER_INFIXES from .stop_words import STOP_WORDS from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ...attrs import LANG from ...language import Language -from ...util import update_exc class TatarDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "tt" - - lex_attr_getters.update(LEX_ATTRS) - - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - infixes = tuple(TOKENIZER_INFIXES) - + tokenizer_exceptions = TOKENIZER_EXCEPTIONS + infixes = TOKENIZER_INFIXES + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS diff --git a/spacy/lang/tt/examples.py b/spacy/lang/tt/examples.py index ac668a0c2..723fcdd15 100644 --- a/spacy/lang/tt/examples.py +++ b/spacy/lang/tt/examples.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - """ Example sentences to test spaCy and its language models. >>> from spacy.lang.tt.examples import sentences diff --git a/spacy/lang/tt/lex_attrs.py b/spacy/lang/tt/lex_attrs.py index ad3d6b9eb..a2ae03061 100644 --- a/spacy/lang/tt/lex_attrs.py +++ b/spacy/lang/tt/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ diff --git a/spacy/lang/tt/punctuation.py b/spacy/lang/tt/punctuation.py index 9ee66a59e..f644a8ccb 100644 --- a/spacy/lang/tt/punctuation.py +++ b/spacy/lang/tt/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, CONCAT_QUOTES, HYPHENS from ..char_classes import LIST_ELLIPSES, LIST_ICONS diff --git a/spacy/lang/tt/stop_words.py b/spacy/lang/tt/stop_words.py index 9f6e9bb86..44169b757 100644 --- a/spacy/lang/tt/stop_words.py +++ b/spacy/lang/tt/stop_words.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - # Tatar stopwords are from https://github.com/aliiae/stopwords-tt STOP_WORDS = set( diff --git a/spacy/lang/tt/tokenizer_exceptions.py b/spacy/lang/tt/tokenizer_exceptions.py index 89f7a990b..3b8cc86b5 100644 --- a/spacy/lang/tt/tokenizer_exceptions.py +++ b/spacy/lang/tt/tokenizer_exceptions.py @@ -1,41 +1,41 @@ -# coding: utf8 -from __future__ import unicode_literals +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc -from ...symbols import ORTH, LEMMA, NORM _exc = {} _abbrev_exc = [ # Weekdays abbreviations - {ORTH: "дш", LEMMA: "дүшәмбе"}, - {ORTH: "сш", LEMMA: "сишәмбе"}, - {ORTH: "чш", LEMMA: "чәршәмбе"}, - {ORTH: "пш", LEMMA: "пәнҗешәмбе"}, - {ORTH: "җм", LEMMA: "җомга"}, - {ORTH: "шб", LEMMA: "шимбә"}, - {ORTH: "яш", LEMMA: "якшәмбе"}, + {ORTH: "дш", NORM: "дүшәмбе"}, + {ORTH: "сш", NORM: "сишәмбе"}, + {ORTH: "чш", NORM: "чәршәмбе"}, + {ORTH: "пш", NORM: "пәнҗешәмбе"}, + {ORTH: "җм", NORM: "җомга"}, + {ORTH: "шб", NORM: "шимбә"}, + {ORTH: "яш", NORM: "якшәмбе"}, # Months abbreviations - {ORTH: "гый", LEMMA: "гыйнвар"}, - {ORTH: "фев", LEMMA: "февраль"}, - {ORTH: "мар", LEMMA: "март"}, - {ORTH: "мар", LEMMA: "март"}, - {ORTH: "апр", LEMMA: "апрель"}, - {ORTH: "июн", LEMMA: "июнь"}, - {ORTH: "июл", LEMMA: "июль"}, - {ORTH: "авг", LEMMA: "август"}, - {ORTH: "сен", LEMMA: "сентябрь"}, - {ORTH: "окт", LEMMA: "октябрь"}, - {ORTH: "ноя", LEMMA: "ноябрь"}, - {ORTH: "дек", LEMMA: "декабрь"}, + {ORTH: "гый", NORM: "гыйнвар"}, + {ORTH: "фев", NORM: "февраль"}, + {ORTH: "мар", NORM: "март"}, + {ORTH: "мар", NORM: "март"}, + {ORTH: "апр", NORM: "апрель"}, + {ORTH: "июн", NORM: "июнь"}, + {ORTH: "июл", NORM: "июль"}, + {ORTH: "авг", NORM: "август"}, + {ORTH: "сен", NORM: "сентябрь"}, + {ORTH: "окт", NORM: "октябрь"}, + {ORTH: "ноя", NORM: "ноябрь"}, + {ORTH: "дек", NORM: "декабрь"}, # Number abbreviations - {ORTH: "млрд", LEMMA: "миллиард"}, - {ORTH: "млн", LEMMA: "миллион"}, + {ORTH: "млрд", NORM: "миллиард"}, + {ORTH: "млн", NORM: "миллион"}, ] for abbr in _abbrev_exc: for orth in (abbr[ORTH], abbr[ORTH].capitalize(), abbr[ORTH].upper()): - _exc[orth] = [{ORTH: orth, LEMMA: abbr[LEMMA], NORM: abbr[LEMMA]}] - _exc[orth + "."] = [{ORTH: orth + ".", LEMMA: abbr[LEMMA], NORM: abbr[LEMMA]}] + _exc[orth] = [{ORTH: orth, NORM: abbr[NORM]}] + _exc[orth + "."] = [{ORTH: orth + ".", NORM: abbr[NORM]}] for exc_data in [ # "etc." abbreviations {ORTH: "һ.б.ш.", NORM: "һәм башка шундыйлар"}, @@ -43,7 +43,6 @@ for exc_data in [ # "etc." abbreviations {ORTH: "б.э.к.", NORM: "безнең эрага кадәр"}, {ORTH: "б.э.", NORM: "безнең эра"}, ]: - exc_data[LEMMA] = exc_data[NORM] _exc[exc_data[ORTH]] = [exc_data] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/uk/__init__.py b/spacy/lang/uk/__init__.py index e74ff2d86..24c88e5a7 100644 --- a/spacy/lang/uk/__init__.py +++ b/spacy/lang/uk/__init__.py @@ -1,39 +1,35 @@ -# coding: utf8 -from __future__ import unicode_literals +from typing import Optional + +from thinc.api import Model from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS -from ...util import update_exc, add_lookups -from ...language import Language -from ...lookups import Lookups -from ...attrs import LANG, NORM from .lemmatizer import UkrainianLemmatizer +from ...language import Language class UkrainianDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "uk" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - lex_attr_getters.update(LEX_ATTRS) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) + tokenizer_exceptions = TOKENIZER_EXCEPTIONS + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS - @classmethod - def create_lemmatizer(cls, nlp=None, lookups=None): - if lookups is None: - lookups = Lookups() - return UkrainianLemmatizer(lookups) - class Ukrainian(Language): lang = "uk" Defaults = UkrainianDefaults +@Ukrainian.factory( + "lemmatizer", + assigns=["token.lemma"], + default_config={"model": None, "mode": "pymorphy2"}, + default_score_weights={"lemma_acc": 1.0}, +) +def make_lemmatizer( + nlp: Language, model: Optional[Model], name: str, mode: str, overwrite: bool = False +): + return UkrainianLemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) + + __all__ = ["Ukrainian"] diff --git a/spacy/lang/uk/examples.py b/spacy/lang/uk/examples.py index 4f2b034eb..f75d44488 100644 --- a/spacy/lang/uk/examples.py +++ b/spacy/lang/uk/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/uk/lemmatizer.py b/spacy/lang/uk/lemmatizer.py index 3eeed5dd4..009ec5044 100644 --- a/spacy/lang/uk/lemmatizer.py +++ b/spacy/lang/uk/lemmatizer.py @@ -1,194 +1,29 @@ -# coding: utf8 -from ...symbols import ADJ, DET, NOUN, NUM, PRON, PROPN, PUNCT, VERB, POS -from ...lemmatizer import Lemmatizer +from typing import Optional + +from thinc.api import Model + +from ..ru.lemmatizer import RussianLemmatizer +from ...vocab import Vocab -class UkrainianLemmatizer(Lemmatizer): - _morph = None - - def __init__(self, lookups=None): - super(UkrainianLemmatizer, self).__init__(lookups) +class UkrainianLemmatizer(RussianLemmatizer): + def __init__( + self, + vocab: Vocab, + model: Optional[Model], + name: str = "lemmatizer", + *, + mode: str = "pymorphy2", + overwrite: bool = False, + ) -> None: + super().__init__(vocab, model, name, mode=mode, overwrite=overwrite) try: from pymorphy2 import MorphAnalyzer - - if UkrainianLemmatizer._morph is None: - UkrainianLemmatizer._morph = MorphAnalyzer(lang="uk") - except (ImportError, TypeError): + except ImportError: raise ImportError( "The Ukrainian lemmatizer requires the pymorphy2 library and " 'dictionaries: try to fix it with "pip uninstall pymorphy2" and' '"pip install git+https://github.com/kmike/pymorphy2.git pymorphy2-dicts-uk"' - ) - - def __call__(self, string, univ_pos, morphology=None): - univ_pos = self.normalize_univ_pos(univ_pos) - if univ_pos == "PUNCT": - return [PUNCT_RULES.get(string, string)] - - if univ_pos not in ("ADJ", "DET", "NOUN", "NUM", "PRON", "PROPN", "VERB"): - # Skip unchangeable pos - return [string.lower()] - - analyses = self._morph.parse(string) - filtered_analyses = [] - for analysis in analyses: - if not analysis.is_known: - # Skip suggested parse variant for unknown word for pymorphy - continue - analysis_pos, _ = oc2ud(str(analysis.tag)) - if analysis_pos == univ_pos or ( - analysis_pos in ("NOUN", "PROPN") and univ_pos in ("NOUN", "PROPN") - ): - filtered_analyses.append(analysis) - - if not len(filtered_analyses): - return [string.lower()] - if morphology is None or (len(morphology) == 1 and POS in morphology): - return list(set([analysis.normal_form for analysis in filtered_analyses])) - - if univ_pos in ("ADJ", "DET", "NOUN", "PROPN"): - features_to_compare = ["Case", "Number", "Gender"] - elif univ_pos == "NUM": - features_to_compare = ["Case", "Gender"] - elif univ_pos == "PRON": - features_to_compare = ["Case", "Number", "Gender", "Person"] - else: # VERB - features_to_compare = [ - "Aspect", - "Gender", - "Mood", - "Number", - "Tense", - "VerbForm", - "Voice", - ] - - analyses, filtered_analyses = filtered_analyses, [] - for analysis in analyses: - _, analysis_morph = oc2ud(str(analysis.tag)) - for feature in features_to_compare: - if ( - feature in morphology - and feature in analysis_morph - and morphology[feature].lower() != analysis_morph[feature].lower() - ): - break - else: - filtered_analyses.append(analysis) - - if not len(filtered_analyses): - return [string.lower()] - return list(set([analysis.normal_form for analysis in filtered_analyses])) - - @staticmethod - def normalize_univ_pos(univ_pos): - if isinstance(univ_pos, str): - return univ_pos.upper() - - symbols_to_str = { - ADJ: "ADJ", - DET: "DET", - NOUN: "NOUN", - NUM: "NUM", - PRON: "PRON", - PROPN: "PROPN", - PUNCT: "PUNCT", - VERB: "VERB", - } - if univ_pos in symbols_to_str: - return symbols_to_str[univ_pos] - return None - - def lookup(self, string, orth=None): - analyses = self._morph.parse(string) - if len(analyses) == 1: - return analyses[0].normal_form - return string - - -def oc2ud(oc_tag): - gram_map = { - "_POS": { - "ADJF": "ADJ", - "ADJS": "ADJ", - "ADVB": "ADV", - "Apro": "DET", - "COMP": "ADJ", # Can also be an ADV - unchangeable - "CONJ": "CCONJ", # Can also be a SCONJ - both unchangeable ones - "GRND": "VERB", - "INFN": "VERB", - "INTJ": "INTJ", - "NOUN": "NOUN", - "NPRO": "PRON", - "NUMR": "NUM", - "NUMB": "NUM", - "PNCT": "PUNCT", - "PRCL": "PART", - "PREP": "ADP", - "PRTF": "VERB", - "PRTS": "VERB", - "VERB": "VERB", - }, - "Animacy": {"anim": "Anim", "inan": "Inan"}, - "Aspect": {"impf": "Imp", "perf": "Perf"}, - "Case": { - "ablt": "Ins", - "accs": "Acc", - "datv": "Dat", - "gen1": "Gen", - "gen2": "Gen", - "gent": "Gen", - "loc2": "Loc", - "loct": "Loc", - "nomn": "Nom", - "voct": "Voc", - }, - "Degree": {"COMP": "Cmp", "Supr": "Sup"}, - "Gender": {"femn": "Fem", "masc": "Masc", "neut": "Neut"}, - "Mood": {"impr": "Imp", "indc": "Ind"}, - "Number": {"plur": "Plur", "sing": "Sing"}, - "NumForm": {"NUMB": "Digit"}, - "Person": {"1per": "1", "2per": "2", "3per": "3", "excl": "2", "incl": "1"}, - "Tense": {"futr": "Fut", "past": "Past", "pres": "Pres"}, - "Variant": {"ADJS": "Brev", "PRTS": "Brev"}, - "VerbForm": { - "GRND": "Conv", - "INFN": "Inf", - "PRTF": "Part", - "PRTS": "Part", - "VERB": "Fin", - }, - "Voice": {"actv": "Act", "pssv": "Pass"}, - "Abbr": {"Abbr": "Yes"}, - } - - pos = "X" - morphology = dict() - unmatched = set() - - grams = oc_tag.replace(" ", ",").split(",") - for gram in grams: - match = False - for categ, gmap in sorted(gram_map.items()): - if gram in gmap: - match = True - if categ == "_POS": - pos = gmap[gram] - else: - morphology[categ] = gmap[gram] - if not match: - unmatched.add(gram) - - while len(unmatched) > 0: - gram = unmatched.pop() - if gram in ("Name", "Patr", "Surn", "Geox", "Orgn"): - pos = "PROPN" - elif gram == "Auxt": - pos = "AUX" - elif gram == "Pltm": - morphology["Number"] = "Ptan" - - return pos, morphology - - -PUNCT_RULES = {"«": '"', "»": '"'} + ) from None + if UkrainianLemmatizer._morph is None: + UkrainianLemmatizer._morph = MorphAnalyzer(lang="uk") diff --git a/spacy/lang/uk/lex_attrs.py b/spacy/lang/uk/lex_attrs.py index 0ade751d6..510e5b85d 100644 --- a/spacy/lang/uk/lex_attrs.py +++ b/spacy/lang/uk/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM _num_words = [ diff --git a/spacy/lang/uk/stop_words.py b/spacy/lang/uk/stop_words.py index cdf24dd70..b11d7a044 100644 --- a/spacy/lang/uk/stop_words.py +++ b/spacy/lang/uk/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - STOP_WORDS = set( """а або diff --git a/spacy/lang/uk/tag_map.py b/spacy/lang/uk/tag_map.py deleted file mode 100644 index 472e772ef..000000000 --- a/spacy/lang/uk/tag_map.py +++ /dev/null @@ -1,28 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ -from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ - - -TAG_MAP = { - "ADV": {POS: ADV}, - "NOUN": {POS: NOUN}, - "ADP": {POS: ADP}, - "PRON": {POS: PRON}, - "SCONJ": {POS: SCONJ}, - "PROPN": {POS: PROPN}, - "DET": {POS: DET}, - "SYM": {POS: SYM}, - "INTJ": {POS: INTJ}, - "PUNCT": {POS: PUNCT}, - "NUM": {POS: NUM}, - "AUX": {POS: AUX}, - "X": {POS: X}, - "CONJ": {POS: CONJ}, - "CCONJ": {POS: CCONJ}, - "ADJ": {POS: ADJ}, - "VERB": {POS: VERB}, - "PART": {POS: PART}, - "SP": {POS: SPACE}, -} diff --git a/spacy/lang/uk/tokenizer_exceptions.py b/spacy/lang/uk/tokenizer_exceptions.py index a94d77af3..94016fd52 100644 --- a/spacy/lang/uk/tokenizer_exceptions.py +++ b/spacy/lang/uk/tokenizer_exceptions.py @@ -1,27 +1,26 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import ORTH, LEMMA, POS, NORM, NOUN +from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...symbols import ORTH, NORM +from ...util import update_exc _exc = {} for exc_data in [ - {ORTH: "вул.", LEMMA: "вулиця", NORM: "вулиця", POS: NOUN}, - {ORTH: "ім.", LEMMA: "ім'я", NORM: "імені", POS: NOUN}, - {ORTH: "просп.", LEMMA: "проспект", NORM: "проспект", POS: NOUN}, - {ORTH: "бул.", LEMMA: "бульвар", NORM: "бульвар", POS: NOUN}, - {ORTH: "пров.", LEMMA: "провулок", NORM: "провулок", POS: NOUN}, - {ORTH: "пл.", LEMMA: "площа", NORM: "площа", POS: NOUN}, - {ORTH: "г.", LEMMA: "гора", NORM: "гора", POS: NOUN}, - {ORTH: "п.", LEMMA: "пан", NORM: "пан", POS: NOUN}, - {ORTH: "м.", LEMMA: "місто", NORM: "місто", POS: NOUN}, - {ORTH: "проф.", LEMMA: "професор", NORM: "професор", POS: NOUN}, - {ORTH: "акад.", LEMMA: "академік", NORM: "академік", POS: NOUN}, - {ORTH: "доц.", LEMMA: "доцент", NORM: "доцент", POS: NOUN}, - {ORTH: "оз.", LEMMA: "озеро", NORM: "озеро", POS: NOUN}, + {ORTH: "вул.", NORM: "вулиця"}, + {ORTH: "ім.", NORM: "імені"}, + {ORTH: "просп.", NORM: "проспект"}, + {ORTH: "бул.", NORM: "бульвар"}, + {ORTH: "пров.", NORM: "провулок"}, + {ORTH: "пл.", NORM: "площа"}, + {ORTH: "г.", NORM: "гора"}, + {ORTH: "п.", NORM: "пан"}, + {ORTH: "м.", NORM: "місто"}, + {ORTH: "проф.", NORM: "професор"}, + {ORTH: "акад.", NORM: "академік"}, + {ORTH: "доц.", NORM: "доцент"}, + {ORTH: "оз.", NORM: "озеро"}, ]: _exc[exc_data[ORTH]] = [exc_data] -TOKENIZER_EXCEPTIONS = _exc +TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) diff --git a/spacy/lang/ur/__init__.py b/spacy/lang/ur/__init__.py index 6eea0cf3b..e3dee5805 100644 --- a/spacy/lang/ur/__init__.py +++ b/spacy/lang/ur/__init__.py @@ -1,25 +1,13 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS from .punctuation import TOKENIZER_SUFFIXES -from .tag_map import TAG_MAP - -from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language -from ...attrs import LANG class UrduDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "ur" - - tokenizer_exceptions = BASE_EXCEPTIONS - tag_map = TAG_MAP - stop_words = STOP_WORDS suffixes = TOKENIZER_SUFFIXES + lex_attr_getters = LEX_ATTRS + stop_words = STOP_WORDS writing_system = {"direction": "rtl", "has_case": False, "has_letters": True} diff --git a/spacy/lang/ur/examples.py b/spacy/lang/ur/examples.py index f47c11600..e55b337be 100644 --- a/spacy/lang/ur/examples.py +++ b/spacy/lang/ur/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/ur/lex_attrs.py b/spacy/lang/ur/lex_attrs.py index 12d85be4b..e590ed3e3 100644 --- a/spacy/lang/ur/lex_attrs.py +++ b/spacy/lang/ur/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM # Source https://quizlet.com/4271889/1-100-urdu-number-wordsurdu-numerals-flash-cards/ diff --git a/spacy/lang/ur/punctuation.py b/spacy/lang/ur/punctuation.py index b8b1a1c83..5d35d0a25 100644 --- a/spacy/lang/ur/punctuation.py +++ b/spacy/lang/ur/punctuation.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ..punctuation import TOKENIZER_SUFFIXES diff --git a/spacy/lang/ur/stop_words.py b/spacy/lang/ur/stop_words.py index 73c159d5c..abfa36497 100644 --- a/spacy/lang/ur/stop_words.py +++ b/spacy/lang/ur/stop_words.py @@ -1,6 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - # Source: collected from different resource on internet STOP_WORDS = set( """ diff --git a/spacy/lang/ur/tag_map.py b/spacy/lang/ur/tag_map.py deleted file mode 100644 index aad548e9b..000000000 --- a/spacy/lang/ur/tag_map.py +++ /dev/null @@ -1,98 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON, AUX, SCONJ -from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB - -TAG_MAP = { - "JJ-Ez": {POS: ADJ}, - "INJC": {POS: X}, - "QFC": {POS: DET}, - "UNK": {POS: X}, - "NSTC": {POS: ADV}, - "NST": {POS: ADV}, - "VMC": {POS: VERB}, - "PRPC": {POS: PRON}, - "RBC": {POS: ADV}, - "PSPC": {POS: ADP}, - "INJ": {POS: X}, - "JJZ": {POS: ADJ}, - "CCC": {POS: SCONJ}, - "NN-Ez": {POS: NOUN}, - "ECH": {POS: NOUN}, - "WQ": {POS: DET}, - "RDP": {POS: ADJ}, - "JJC": {POS: ADJ}, - "NEG": {POS: PART}, - "NNZ": {POS: NOUN}, - "QO": {POS: ADJ}, - "INTFC": {POS: ADV}, - "INTF": {POS: ADV}, - "NFC": {POS: ADP}, - "QCC": {POS: NUM}, - "QC": {POS: NUM}, - "QF": {POS: DET}, - "VAUX": {POS: AUX}, - "VM": {POS: VERB}, - "DEM": {POS: DET}, - "NNPC": {POS: PROPN}, - "NNC": {POS: NOUN}, - "PSP": {POS: ADP}, - ".": {POS: PUNCT}, - ",": {POS: PUNCT}, - "-LRB-": {POS: PUNCT}, - "-RRB-": {POS: PUNCT}, - "``": {POS: PUNCT}, - '""': {POS: PUNCT}, - "''": {POS: PUNCT}, - ":": {POS: PUNCT}, - "$": {POS: SYM}, - "#": {POS: SYM}, - "AFX": {POS: ADJ}, - "CC": {POS: CCONJ}, - "CD": {POS: NUM}, - "DT": {POS: DET}, - "EX": {POS: ADV}, - "FW": {POS: X}, - "HYPH": {POS: PUNCT}, - "IN": {POS: ADP}, - "JJ": {POS: ADJ}, - "JJR": {POS: ADJ}, - "JJS": {POS: ADJ}, - "LS": {POS: PUNCT}, - "MD": {POS: VERB}, - "NIL": {POS: ""}, - "NN": {POS: NOUN}, - "NNP": {POS: PROPN}, - "NNPS": {POS: PROPN}, - "NNS": {POS: NOUN}, - "PDT": {POS: ADJ}, - "POS": {POS: PART}, - "PRP": {POS: PRON}, - "PRP$": {POS: ADJ}, - "RB": {POS: ADV}, - "RBR": {POS: ADV}, - "RBS": {POS: ADV}, - "RP": {POS: PART}, - "SP": {POS: SPACE}, - "SYM": {POS: SYM}, - "TO": {POS: PART}, - "UH": {POS: INTJ}, - "VB": {POS: VERB}, - "VBD": {POS: VERB}, - "VBG": {POS: VERB}, - "VBN": {POS: VERB}, - "VBP": {POS: VERB}, - "VBZ": {POS: VERB}, - "WDT": {POS: ADJ}, - "WP": {POS: NOUN}, - "WP$": {POS: ADJ}, - "WRB": {POS: ADV}, - "ADD": {POS: X}, - "NFP": {POS: PUNCT}, - "GW": {POS: X}, - "XX": {POS: X}, - "BES": {POS: VERB}, - "HVS": {POS: VERB}, - "_SP": {POS: SPACE}, -} diff --git a/spacy/lang/vi/__init__.py b/spacy/lang/vi/__init__.py index 425f84e3d..1328de495 100644 --- a/spacy/lang/vi/__init__.py +++ b/spacy/lang/vi/__init__.py @@ -1,41 +1,46 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...attrs import LANG, NORM -from ..norm_exceptions import BASE_NORMS +from .stop_words import STOP_WORDS +from .lex_attrs import LEX_ATTRS from ...language import Language from ...tokens import Doc -from .stop_words import STOP_WORDS -from ...util import add_lookups -from .lex_attrs import LEX_ATTRS +from ...util import DummyTokenizer, registry, load_config_from_str -class VietnameseDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "vi" # for pickling - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - lex_attr_getters.update(LEX_ATTRS) - stop_words = STOP_WORDS - use_pyvi = True +DEFAULT_CONFIG = """ +[nlp] + +[nlp.tokenizer] +@tokenizers = "spacy.vi.VietnameseTokenizer" +use_pyvi = true +""" -class Vietnamese(Language): - lang = "vi" - Defaults = VietnameseDefaults # override defaults +@registry.tokenizers("spacy.vi.VietnameseTokenizer") +def create_vietnamese_tokenizer(use_pyvi: bool = True): + def vietnamese_tokenizer_factory(nlp): + return VietnameseTokenizer(nlp, use_pyvi=use_pyvi) - def make_doc(self, text): - if self.Defaults.use_pyvi: + return vietnamese_tokenizer_factory + + +class VietnameseTokenizer(DummyTokenizer): + def __init__(self, nlp: Language, use_pyvi: bool = False): + self.vocab = nlp.vocab + self.use_pyvi = use_pyvi + if self.use_pyvi: try: from pyvi import ViTokenizer + + self.ViTokenizer = ViTokenizer except ImportError: msg = ( - "Pyvi not installed. Either set Vietnamese.use_pyvi = False, " + "Pyvi not installed. Either set use_pyvi = False, " "or install it https://pypi.python.org/pypi/pyvi" ) - raise ImportError(msg) - words, spaces = ViTokenizer.spacy_tokenize(text) + raise ImportError(msg) from None + + def __call__(self, text: str) -> Doc: + if self.use_pyvi: + words, spaces = self.ViTokenizer.spacy_tokenize(text) return Doc(self.vocab, words=words, spaces=spaces) else: words = [] @@ -47,4 +52,15 @@ class Vietnamese(Language): return Doc(self.vocab, words=words, spaces=spaces) +class VietnameseDefaults(Language.Defaults): + config = load_config_from_str(DEFAULT_CONFIG) + lex_attr_getters = LEX_ATTRS + stop_words = STOP_WORDS + + +class Vietnamese(Language): + lang = "vi" + Defaults = VietnameseDefaults + + __all__ = ["Vietnamese"] diff --git a/spacy/lang/vi/lex_attrs.py b/spacy/lang/vi/lex_attrs.py index b6cd1188a..b3dbf2192 100644 --- a/spacy/lang/vi/lex_attrs.py +++ b/spacy/lang/vi/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...attrs import LIKE_NUM diff --git a/spacy/lang/vi/stop_words.py b/spacy/lang/vi/stop_words.py index 13284dc59..1d2ecdf8d 100644 --- a/spacy/lang/vi/stop_words.py +++ b/spacy/lang/vi/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - # Source: https://github.com/stopwords/vietnamese-stopwords STOP_WORDS = set( """ diff --git a/spacy/lang/vi/tag_map.py b/spacy/lang/vi/tag_map.py deleted file mode 100644 index 472e772ef..000000000 --- a/spacy/lang/vi/tag_map.py +++ /dev/null @@ -1,28 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ..symbols import POS, ADV, NOUN, ADP, PRON, SCONJ, PROPN, DET, SYM, INTJ -from ..symbols import PUNCT, NUM, AUX, X, CONJ, ADJ, VERB, PART, SPACE, CCONJ - - -TAG_MAP = { - "ADV": {POS: ADV}, - "NOUN": {POS: NOUN}, - "ADP": {POS: ADP}, - "PRON": {POS: PRON}, - "SCONJ": {POS: SCONJ}, - "PROPN": {POS: PROPN}, - "DET": {POS: DET}, - "SYM": {POS: SYM}, - "INTJ": {POS: INTJ}, - "PUNCT": {POS: PUNCT}, - "NUM": {POS: NUM}, - "AUX": {POS: AUX}, - "X": {POS: X}, - "CONJ": {POS: CONJ}, - "CCONJ": {POS: CCONJ}, - "ADJ": {POS: ADJ}, - "VERB": {POS: VERB}, - "PART": {POS: PART}, - "SP": {POS: SPACE}, -} diff --git a/spacy/lang/xx/__init__.py b/spacy/lang/xx/__init__.py index 66d8c7917..aff8403ff 100644 --- a/spacy/lang/xx/__init__.py +++ b/spacy/lang/xx/__init__.py @@ -1,21 +1,4 @@ -# coding: utf8 -from __future__ import unicode_literals - - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ..norm_exceptions import BASE_NORMS from ...language import Language -from ...attrs import LANG, NORM -from ...util import update_exc, add_lookups - - -class MultiLanguageDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "xx" - lex_attr_getters[NORM] = add_lookups( - Language.Defaults.lex_attr_getters[NORM], BASE_NORMS - ) - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS) class MultiLanguage(Language): @@ -24,7 +7,6 @@ class MultiLanguage(Language): """ lang = "xx" - Defaults = MultiLanguageDefaults __all__ = ["MultiLanguage"] diff --git a/spacy/lang/xx/examples.py b/spacy/lang/xx/examples.py index 38cd5e0cd..8d63c3c20 100644 --- a/spacy/lang/xx/examples.py +++ b/spacy/lang/xx/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/yo/__init__.py b/spacy/lang/yo/__init__.py index f227203cc..df6bb7d4a 100644 --- a/spacy/lang/yo/__init__.py +++ b/spacy/lang/yo/__init__.py @@ -1,19 +1,11 @@ -# coding: utf8 -from __future__ import unicode_literals - from .stop_words import STOP_WORDS from .lex_attrs import LEX_ATTRS -from ..tokenizer_exceptions import BASE_EXCEPTIONS from ...language import Language -from ...attrs import LANG class YorubaDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "yo" + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS - tokenizer_exceptions = BASE_EXCEPTIONS class Yoruba(Language): diff --git a/spacy/lang/yo/examples.py b/spacy/lang/yo/examples.py index 170ddc803..0a610f125 100644 --- a/spacy/lang/yo/examples.py +++ b/spacy/lang/yo/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/yo/lex_attrs.py b/spacy/lang/yo/lex_attrs.py index a9f1b85f6..ead68ced2 100644 --- a/spacy/lang/yo/lex_attrs.py +++ b/spacy/lang/yo/lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import unicodedata from ...attrs import LIKE_NUM diff --git a/spacy/lang/yo/stop_words.py b/spacy/lang/yo/stop_words.py index 53d382ad3..5c7a7fc45 100644 --- a/spacy/lang/yo/stop_words.py +++ b/spacy/lang/yo/stop_words.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - # stop words as whitespace-separated list. # Source: https://raw.githubusercontent.com/dohliam/more-stoplists/master/yo/yo.txt diff --git a/spacy/lang/zh/__init__.py b/spacy/lang/zh/__init__.py index 9f8a82c10..30560ed0d 100644 --- a/spacy/lang/zh/__init__.py +++ b/spacy/lang/zh/__init__.py @@ -1,140 +1,150 @@ -# coding: utf8 -from __future__ import unicode_literals - +from typing import Optional, List, Dict, Any, Callable, Iterable +from enum import Enum import tempfile import srsly +import warnings from pathlib import Path -from collections import OrderedDict -from ...attrs import LANG + +from ...errors import Warnings, Errors from ...language import Language +from ...scorer import Scorer from ...tokens import Doc -from ...util import DummyTokenizer -from ..tokenizer_exceptions import BASE_EXCEPTIONS +from ...training import validate_examples, Example +from ...util import DummyTokenizer, registry, load_config_from_str from .lex_attrs import LEX_ATTRS from .stop_words import STOP_WORDS -from .tag_map import TAG_MAP from ... import util -_PKUSEG_INSTALL_MSG = "install it with `pip install pkuseg==0.0.25` or from https://github.com/lancopku/pkuseg-python" +# fmt: off +_PKUSEG_INSTALL_MSG = "install spacy-pkuseg with `pip install spacy-pkuseg==0.0.26`" +# fmt: on + +DEFAULT_CONFIG = """ +[nlp] + +[nlp.tokenizer] +@tokenizers = "spacy.zh.ChineseTokenizer" +segmenter = "char" + +[initialize] + +[initialize.tokenizer] +pkuseg_model = null +pkuseg_user_dict = "default" +""" -def try_jieba_import(use_jieba): - try: - import jieba +class Segmenter(str, Enum): + char = "char" + jieba = "jieba" + pkuseg = "pkuseg" - # segment a short text to have jieba initialize its cache in advance - list(jieba.cut("作为", cut_all=False)) - - return jieba - except ImportError: - if use_jieba: - msg = ( - "Jieba not installed. Either set the default to False with " - "`from spacy.lang.zh import ChineseDefaults; ChineseDefaults.use_jieba = False`, " - "or install it with `pip install jieba` or from " - "https://github.com/fxsjy/jieba" - ) - raise ImportError(msg) + @classmethod + def values(cls): + return list(cls.__members__.keys()) -def try_pkuseg_import(use_pkuseg, pkuseg_model, pkuseg_user_dict): - try: - import pkuseg +@registry.tokenizers("spacy.zh.ChineseTokenizer") +def create_chinese_tokenizer(segmenter: Segmenter = Segmenter.char): + def chinese_tokenizer_factory(nlp): + return ChineseTokenizer(nlp, segmenter=segmenter) - if pkuseg_model: - return pkuseg.pkuseg(pkuseg_model, pkuseg_user_dict) - elif use_pkuseg: - msg = ( - "Chinese.use_pkuseg is True but no pkuseg model was specified. " - "Please provide the name of a pretrained model " - "or the path to a model with " - '`Chinese(meta={"tokenizer": {"config": {"pkuseg_model": name_or_path}}}).' - ) - raise ValueError(msg) - except ImportError: - if use_pkuseg: - msg = ( - "pkuseg not installed. Either set Chinese.use_pkuseg = False, " - "or " + _PKUSEG_INSTALL_MSG - ) - raise ImportError(msg) - except FileNotFoundError: - if use_pkuseg: - msg = "Unable to load pkuseg model from: " + pkuseg_model - raise FileNotFoundError(msg) + return chinese_tokenizer_factory class ChineseTokenizer(DummyTokenizer): - def __init__(self, cls, nlp=None, config={}): - self.use_jieba = config.get("use_jieba", cls.use_jieba) - self.use_pkuseg = config.get("use_pkuseg", cls.use_pkuseg) - self.require_pkuseg = config.get("require_pkuseg", False) - self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) - self.jieba_seg = try_jieba_import(self.use_jieba) - self.pkuseg_seg = try_pkuseg_import( - self.use_pkuseg, - pkuseg_model=config.get("pkuseg_model", None), - pkuseg_user_dict=config.get("pkuseg_user_dict", "default"), - ) - # remove relevant settings from config so they're not also saved in - # Language.meta - for key in ["use_jieba", "use_pkuseg", "require_pkuseg", "pkuseg_model"]: - if key in config: - del config[key] - self.tokenizer = Language.Defaults().create_tokenizer(nlp) + def __init__(self, nlp: Language, segmenter: Segmenter = Segmenter.char): + self.vocab = nlp.vocab + if isinstance(segmenter, Segmenter): + segmenter = segmenter.value + self.segmenter = segmenter + self.pkuseg_seg = None + self.jieba_seg = None + if segmenter not in Segmenter.values(): + warn_msg = Warnings.W103.format( + lang="Chinese", + segmenter=segmenter, + supported=", ".join(Segmenter.values()), + default="'char' (character segmentation)", + ) + warnings.warn(warn_msg) + self.segmenter = Segmenter.char + if segmenter == Segmenter.jieba: + self.jieba_seg = try_jieba_import() - def __call__(self, text): - use_jieba = self.use_jieba - use_pkuseg = self.use_pkuseg - if self.require_pkuseg: - use_jieba = False - use_pkuseg = True - if use_jieba: + def initialize( + self, + get_examples: Optional[Callable[[], Iterable[Example]]] = None, + *, + nlp: Optional[Language] = None, + pkuseg_model: Optional[str] = None, + pkuseg_user_dict: Optional[str] = "default", + ): + if self.segmenter == Segmenter.pkuseg: + if pkuseg_user_dict is None: + pkuseg_user_dict = pkuseg_model + self.pkuseg_seg = try_pkuseg_import( + pkuseg_model=pkuseg_model, pkuseg_user_dict=pkuseg_user_dict + ) + + def __call__(self, text: str) -> Doc: + if self.segmenter == Segmenter.jieba: words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x]) (words, spaces) = util.get_words_and_spaces(words, text) return Doc(self.vocab, words=words, spaces=spaces) - elif use_pkuseg: + elif self.segmenter == Segmenter.pkuseg: + if self.pkuseg_seg is None: + raise ValueError(Errors.E1000) words = self.pkuseg_seg.cut(text) (words, spaces) = util.get_words_and_spaces(words, text) return Doc(self.vocab, words=words, spaces=spaces) - else: - # split into individual characters - words = list(text) - (words, spaces) = util.get_words_and_spaces(words, text) - return Doc(self.vocab, words=words, spaces=spaces) - def pkuseg_update_user_dict(self, words, reset=False): - if self.pkuseg_seg: + # warn if segmenter setting is not the only remaining option "char" + if self.segmenter != Segmenter.char: + warn_msg = Warnings.W103.format( + lang="Chinese", + segmenter=self.segmenter, + supported=", ".join(Segmenter.values()), + default="'char' (character segmentation)", + ) + warnings.warn(warn_msg) + + # split into individual characters + words = list(text) + (words, spaces) = util.get_words_and_spaces(words, text) + return Doc(self.vocab, words=words, spaces=spaces) + + def pkuseg_update_user_dict(self, words: List[str], reset: bool = False): + if self.segmenter == Segmenter.pkuseg: if reset: try: - import pkuseg + import spacy_pkuseg - self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(None) + self.pkuseg_seg.preprocesser = spacy_pkuseg.Preprocesser(None) except ImportError: - if self.use_pkuseg: - msg = ( - "pkuseg not installed: unable to reset pkuseg " - "user dict. Please " + _PKUSEG_INSTALL_MSG - ) - raise ImportError(msg) + msg = ( + "spacy_pkuseg not installed: unable to reset pkuseg " + "user dict. Please " + _PKUSEG_INSTALL_MSG + ) + raise ImportError(msg) from None for word in words: self.pkuseg_seg.preprocesser.insert(word.strip(), "") + else: + warn_msg = Warnings.W104.format(target="pkuseg", current=self.segmenter) + warnings.warn(warn_msg) - def _get_config(self): - config = OrderedDict( - ( - ("use_jieba", self.use_jieba), - ("use_pkuseg", self.use_pkuseg), - ("require_pkuseg", self.require_pkuseg), - ) - ) - return config + def score(self, examples): + validate_examples(examples, "ChineseTokenizer.score") + return Scorer.score_tokenization(examples) - def _set_config(self, config={}): - self.use_jieba = config.get("use_jieba", False) - self.use_pkuseg = config.get("use_pkuseg", False) - self.require_pkuseg = config.get("require_pkuseg", False) + def _get_config(self) -> Dict[str, Any]: + return { + "segmenter": self.segmenter, + } + + def _set_config(self, config: Dict[str, Any] = {}) -> None: + self.segmenter = config.get("segmenter", Segmenter.char) def to_bytes(self, **kwargs): pkuseg_features_b = b"" @@ -145,7 +155,7 @@ class ChineseTokenizer(DummyTokenizer): self.pkuseg_seg.feature_extractor.save(tempdir) self.pkuseg_seg.model.save(tempdir) tempdir = Path(tempdir) - with open(tempdir / "features.pkl", "rb") as fileh: + with open(tempdir / "features.msgpack", "rb") as fileh: pkuseg_features_b = fileh.read() with open(tempdir / "weights.npz", "rb") as fileh: pkuseg_weights_b = fileh.read() @@ -155,17 +165,12 @@ class ChineseTokenizer(DummyTokenizer): sorted(list(self.pkuseg_seg.postprocesser.common_words)), sorted(list(self.pkuseg_seg.postprocesser.other_words)), ) - serializers = OrderedDict( - ( - ("cfg", lambda: srsly.json_dumps(self._get_config())), - ("pkuseg_features", lambda: pkuseg_features_b), - ("pkuseg_weights", lambda: pkuseg_weights_b), - ( - "pkuseg_processors", - lambda: srsly.msgpack_dumps(pkuseg_processors_data), - ), - ) - ) + serializers = { + "cfg": lambda: srsly.json_dumps(self._get_config()), + "pkuseg_features": lambda: pkuseg_features_b, + "pkuseg_weights": lambda: pkuseg_weights_b, + "pkuseg_processors": lambda: srsly.msgpack_dumps(pkuseg_processors_data), + } return util.to_bytes(serializers, []) def from_bytes(self, data, **kwargs): @@ -180,35 +185,33 @@ class ChineseTokenizer(DummyTokenizer): def deserialize_pkuseg_processors(b): pkuseg_data["processors_data"] = srsly.msgpack_loads(b) - deserializers = OrderedDict( - ( - ("cfg", lambda b: self._set_config(srsly.json_loads(b))), - ("pkuseg_features", deserialize_pkuseg_features), - ("pkuseg_weights", deserialize_pkuseg_weights), - ("pkuseg_processors", deserialize_pkuseg_processors), - ) - ) + deserializers = { + "cfg": lambda b: self._set_config(srsly.json_loads(b)), + "pkuseg_features": deserialize_pkuseg_features, + "pkuseg_weights": deserialize_pkuseg_weights, + "pkuseg_processors": deserialize_pkuseg_processors, + } util.from_bytes(data, deserializers, []) if pkuseg_data["features_b"] and pkuseg_data["weights_b"]: with tempfile.TemporaryDirectory() as tempdir: tempdir = Path(tempdir) - with open(tempdir / "features.pkl", "wb") as fileh: + with open(tempdir / "features.msgpack", "wb") as fileh: fileh.write(pkuseg_data["features_b"]) with open(tempdir / "weights.npz", "wb") as fileh: fileh.write(pkuseg_data["weights_b"]) try: - import pkuseg + import spacy_pkuseg except ImportError: raise ImportError( - "pkuseg not installed. To use this model, " + "spacy-pkuseg not installed. To use this model, " + _PKUSEG_INSTALL_MSG - ) - self.pkuseg_seg = pkuseg.pkuseg(str(tempdir)) + ) from None + self.pkuseg_seg = spacy_pkuseg.pkuseg(str(tempdir)) if pkuseg_data["processors_data"]: processors_data = pkuseg_data["processors_data"] (user_dict, do_process, common_words, other_words) = processors_data - self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(user_dict) + self.pkuseg_seg.preprocesser = spacy_pkuseg.Preprocesser(user_dict) self.pkuseg_seg.postprocesser.do_process = do_process self.pkuseg_seg.postprocesser.common_words = set(common_words) self.pkuseg_seg.postprocesser.other_words = set(other_words) @@ -235,13 +238,11 @@ class ChineseTokenizer(DummyTokenizer): ) srsly.write_msgpack(path, data) - serializers = OrderedDict( - ( - ("cfg", lambda p: srsly.write_json(p, self._get_config())), - ("pkuseg_model", lambda p: save_pkuseg_model(p)), - ("pkuseg_processors", lambda p: save_pkuseg_processors(p)), - ) - ) + serializers = { + "cfg": lambda p: srsly.write_json(p, self._get_config()), + "pkuseg_model": lambda p: save_pkuseg_model(p), + "pkuseg_processors": lambda p: save_pkuseg_processors(p), + } return util.to_disk(path, serializers, []) def from_disk(self, path, **kwargs): @@ -249,62 +250,78 @@ class ChineseTokenizer(DummyTokenizer): def load_pkuseg_model(path): try: - import pkuseg + import spacy_pkuseg except ImportError: - if self.use_pkuseg: + if self.segmenter == Segmenter.pkuseg: raise ImportError( - "pkuseg not installed. To use this model, " + "spacy-pkuseg not installed. To use this model, " + _PKUSEG_INSTALL_MSG - ) + ) from None if path.exists(): - self.pkuseg_seg = pkuseg.pkuseg(path) + self.pkuseg_seg = spacy_pkuseg.pkuseg(path) def load_pkuseg_processors(path): try: - import pkuseg + import spacy_pkuseg except ImportError: - if self.use_pkuseg: - raise ImportError(self._pkuseg_install_msg) - if self.pkuseg_seg: + if self.segmenter == Segmenter.pkuseg: + raise ImportError(self._pkuseg_install_msg) from None + if self.segmenter == Segmenter.pkuseg: data = srsly.read_msgpack(path) (user_dict, do_process, common_words, other_words) = data - self.pkuseg_seg.preprocesser = pkuseg.Preprocesser(user_dict) + self.pkuseg_seg.preprocesser = spacy_pkuseg.Preprocesser(user_dict) self.pkuseg_seg.postprocesser.do_process = do_process self.pkuseg_seg.postprocesser.common_words = set(common_words) self.pkuseg_seg.postprocesser.other_words = set(other_words) - serializers = OrderedDict( - ( - ("cfg", lambda p: self._set_config(srsly.read_json(p))), - ("pkuseg_model", lambda p: load_pkuseg_model(p)), - ("pkuseg_processors", lambda p: load_pkuseg_processors(p)), - ) - ) + serializers = { + "cfg": lambda p: self._set_config(srsly.read_json(p)), + "pkuseg_model": lambda p: load_pkuseg_model(p), + "pkuseg_processors": lambda p: load_pkuseg_processors(p), + } util.from_disk(path, serializers, []) class ChineseDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters.update(LEX_ATTRS) - lex_attr_getters[LANG] = lambda text: "zh" - tokenizer_exceptions = BASE_EXCEPTIONS + config = load_config_from_str(DEFAULT_CONFIG) + lex_attr_getters = LEX_ATTRS stop_words = STOP_WORDS - tag_map = TAG_MAP writing_system = {"direction": "ltr", "has_case": False, "has_letters": False} - use_jieba = True - use_pkuseg = False - - @classmethod - def create_tokenizer(cls, nlp=None, config={}): - return ChineseTokenizer(cls, nlp, config=config) class Chinese(Language): lang = "zh" - Defaults = ChineseDefaults # override defaults + Defaults = ChineseDefaults - def make_doc(self, text): - return self.tokenizer(text) + +def try_jieba_import() -> None: + try: + import jieba + + # segment a short text to have jieba initialize its cache in advance + list(jieba.cut("作为", cut_all=False)) + + return jieba + except ImportError: + msg = ( + "Jieba not installed. To use jieba, install it with `pip " + " install jieba` or from https://github.com/fxsjy/jieba" + ) + raise ImportError(msg) from None + + +def try_pkuseg_import(pkuseg_model: str, pkuseg_user_dict: str) -> None: + try: + import spacy_pkuseg + + except ImportError: + msg = "spacy-pkuseg not installed. To use pkuseg, " + _PKUSEG_INSTALL_MSG + raise ImportError(msg) from None + try: + return spacy_pkuseg.pkuseg(pkuseg_model, pkuseg_user_dict) + except FileNotFoundError: + msg = "Unable to load pkuseg model from: " + pkuseg_model + raise FileNotFoundError(msg) from None def _get_pkuseg_trie_data(node, path=""): diff --git a/spacy/lang/zh/examples.py b/spacy/lang/zh/examples.py index b28215741..8be1336d2 100644 --- a/spacy/lang/zh/examples.py +++ b/spacy/lang/zh/examples.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - """ Example sentences to test spaCy and its language models. diff --git a/spacy/lang/zh/lex_attrs.py b/spacy/lang/zh/lex_attrs.py index 0b29c226e..08c8e3160 100644 --- a/spacy/lang/zh/lex_attrs.py +++ b/spacy/lang/zh/lex_attrs.py @@ -1,8 +1,8 @@ -# coding: utf8 -from __future__ import unicode_literals import re + from ...attrs import LIKE_NUM + _single_num_words = [ "〇", "一", diff --git a/spacy/lang/zh/stop_words.py b/spacy/lang/zh/stop_words.py index 0af4c1859..42ae4a1de 100644 --- a/spacy/lang/zh/stop_words.py +++ b/spacy/lang/zh/stop_words.py @@ -1,7 +1,3 @@ -# encoding: utf8 -from __future__ import unicode_literals - - # stop words as whitespace-separated list # Chinese stop words,maybe not enough STOP_WORDS = set( diff --git a/spacy/lang/zh/tag_map.py b/spacy/lang/zh/tag_map.py deleted file mode 100644 index f9b5389ac..000000000 --- a/spacy/lang/zh/tag_map.py +++ /dev/null @@ -1,49 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...symbols import POS, PUNCT, ADJ, SCONJ, CCONJ, NUM, DET, ADV, ADP, X -from ...symbols import NOUN, PART, INTJ, PRON, VERB, SPACE, PROPN - -# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn -# Treebank tag set. We also map the tags to the simpler Universal Dependencies -# v2 tag set. - -TAG_MAP = { - "AS": {POS: PART}, - "DEC": {POS: PART}, - "DEG": {POS: PART}, - "DER": {POS: PART}, - "DEV": {POS: PART}, - "ETC": {POS: PART}, - "LC": {POS: PART}, - "MSP": {POS: PART}, - "SP": {POS: PART}, - "BA": {POS: X}, - "FW": {POS: X}, - "IJ": {POS: INTJ}, - "LB": {POS: X}, - "ON": {POS: X}, - "SB": {POS: X}, - "X": {POS: X}, - "URL": {POS: X}, - "INF": {POS: X}, - "NN": {POS: NOUN}, - "NR": {POS: PROPN}, - "NT": {POS: NOUN}, - "VA": {POS: VERB}, - "VC": {POS: VERB}, - "VE": {POS: VERB}, - "VV": {POS: VERB}, - "CD": {POS: NUM}, - "M": {POS: NUM}, - "OD": {POS: NUM}, - "DT": {POS: DET}, - "CC": {POS: CCONJ}, - "CS": {POS: SCONJ}, - "AD": {POS: ADV}, - "JJ": {POS: ADJ}, - "P": {POS: ADP}, - "PN": {POS: PRON}, - "PU": {POS: PUNCT}, - "_SP": {POS: SPACE}, -} diff --git a/spacy/language.py b/spacy/language.py index ee46da3c1..dd790e85f 100644 --- a/spacy/language.py +++ b/spacy/language.py @@ -1,207 +1,205 @@ -# coding: utf8 -from __future__ import absolute_import, unicode_literals - +from typing import Optional, Any, Dict, Callable, Iterable, Union, List, Pattern +from typing import Tuple +from dataclasses import dataclass import random import itertools -import warnings -from thinc.extra import load_nlp -import weakref import functools -from collections import OrderedDict from contextlib import contextmanager -from copy import copy, deepcopy -from thinc.neural import Model +from copy import deepcopy +from pathlib import Path +import warnings +from thinc.api import Model, get_current_ops, Config, Optimizer import srsly import multiprocessing as mp from itertools import chain, cycle +from timeit import default_timer as timer -from .tokenizer import Tokenizer from .tokens.underscore import Underscore -from .vocab import Vocab -from .lemmatizer import Lemmatizer -from .lookups import Lookups -from .analysis import analyze_pipes, analyze_all_pipes, validate_attrs -from .compat import izip, basestring_, is_python2, class_types -from .gold import GoldParse +from .vocab import Vocab, create_vocab +from .pipe_analysis import validate_attrs, analyze_pipes, print_pipe_analysis +from .training import Example, validate_examples +from .training.initialize import init_vocab, init_tok2vec from .scorer import Scorer -from ._ml import link_vectors_to_models, create_default_optimizer -from .attrs import IS_STOP, LANG, NORM +from .util import registry, SimpleFrozenList, _pipe +from .util import SimpleFrozenDict, combine_score_weights, CONFIG_SECTION_ORDER +from .lang.tokenizer_exceptions import URL_MATCH, BASE_EXCEPTIONS from .lang.punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES from .lang.punctuation import TOKENIZER_INFIXES -from .lang.tokenizer_exceptions import TOKEN_MATCH, URL_MATCH -from .lang.norm_exceptions import BASE_NORMS -from .lang.tag_map import TAG_MAP from .tokens import Doc -from .lang.lex_attrs import LEX_ATTRS, is_stop +from .tokenizer import Tokenizer from .errors import Errors, Warnings +from .schemas import ConfigSchema, ConfigSchemaNlp, ConfigSchemaInit +from .schemas import ConfigSchemaPretrain, validate_init_settings from .git_info import GIT_VERSION from . import util from . import about +from .lookups import load_lookups -ENABLE_PIPELINE_ANALYSIS = False +# This is the base config will all settings (training etc.) +DEFAULT_CONFIG_PATH = Path(__file__).parent / "default_config.cfg" +DEFAULT_CONFIG = util.load_config(DEFAULT_CONFIG_PATH) +# This is the base config for the [pretraining] block and currently not included +# in the main config and only added via the 'init fill-config' command +DEFAULT_CONFIG_PRETRAIN_PATH = Path(__file__).parent / "default_config_pretraining.cfg" -class BaseDefaults(object): - @classmethod - def create_lemmatizer(cls, nlp=None, lookups=None): - if lookups is None: - lookups = cls.create_lookups(nlp=nlp) - return Lemmatizer(lookups=lookups, is_base_form=cls.is_base_form) +class BaseDefaults: + """Language data defaults, available via Language.Defaults. Can be + overwritten by language subclasses by defining their own subclasses of + Language.Defaults. + """ - @classmethod - def create_lookups(cls, nlp=None): - root = util.get_module_path(cls) - filenames = {name: root / filename for name, filename in cls.resources} - if LANG in cls.lex_attr_getters: - lang = cls.lex_attr_getters[LANG](None) - if lang in util.registry.lookups: - filenames.update(util.registry.lookups.get(lang)) - lookups = Lookups() - for name, filename in filenames.items(): - data = util.load_language_data(filename) - lookups.add_table(name, data) - return lookups + config: Config = Config(section_order=CONFIG_SECTION_ORDER) + tokenizer_exceptions: Dict[str, List[dict]] = BASE_EXCEPTIONS + prefixes: Optional[List[Union[str, Pattern]]] = TOKENIZER_PREFIXES + suffixes: Optional[List[Union[str, Pattern]]] = TOKENIZER_SUFFIXES + infixes: Optional[List[Union[str, Pattern]]] = TOKENIZER_INFIXES + token_match: Optional[Pattern] = None + url_match: Optional[Pattern] = URL_MATCH + syntax_iterators: Dict[str, Callable] = {} + lex_attr_getters: Dict[int, Callable[[str], Any]] = {} + stop_words = set() + writing_system = {"direction": "ltr", "has_case": True, "has_letters": True} - @classmethod - def create_vocab(cls, nlp=None): - lookups = cls.create_lookups(nlp) - lemmatizer = cls.create_lemmatizer(nlp, lookups=lookups) - lex_attr_getters = dict(cls.lex_attr_getters) - # This is messy, but it's the minimal working fix to Issue #639. - lex_attr_getters[IS_STOP] = functools.partial(is_stop, stops=cls.stop_words) - vocab = Vocab( - lex_attr_getters=lex_attr_getters, - tag_map=cls.tag_map, - lemmatizer=lemmatizer, - lookups=lookups, - ) - vocab.lex_attr_getters[NORM] = util.add_lookups( - vocab.lex_attr_getters.get(NORM, LEX_ATTRS[NORM]), - BASE_NORMS, - vocab.lookups.get_table("lexeme_norm"), - ) - for tag_str, exc in cls.morph_rules.items(): - for orth_str, attrs in exc.items(): - vocab.morphology.add_special_case(tag_str, orth_str, attrs) - return vocab - @classmethod - def create_tokenizer(cls, nlp=None): - rules = cls.tokenizer_exceptions - token_match = cls.token_match - url_match = cls.url_match - prefix_search = ( - util.compile_prefix_regex(cls.prefixes).search if cls.prefixes else None - ) - suffix_search = ( - util.compile_suffix_regex(cls.suffixes).search if cls.suffixes else None - ) - infix_finditer = ( - util.compile_infix_regex(cls.infixes).finditer if cls.infixes else None - ) - vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp) +@registry.tokenizers("spacy.Tokenizer.v1") +def create_tokenizer() -> Callable[["Language"], Tokenizer]: + """Registered function to create a tokenizer. Returns a factory that takes + the nlp object and returns a Tokenizer instance using the language detaults. + """ + + def tokenizer_factory(nlp: "Language") -> Tokenizer: + prefixes = nlp.Defaults.prefixes + suffixes = nlp.Defaults.suffixes + infixes = nlp.Defaults.infixes + prefix_search = util.compile_prefix_regex(prefixes).search if prefixes else None + suffix_search = util.compile_suffix_regex(suffixes).search if suffixes else None + infix_finditer = util.compile_infix_regex(infixes).finditer if infixes else None return Tokenizer( - vocab, - rules=rules, + nlp.vocab, + rules=nlp.Defaults.tokenizer_exceptions, prefix_search=prefix_search, suffix_search=suffix_search, infix_finditer=infix_finditer, - token_match=token_match, - url_match=url_match, + token_match=nlp.Defaults.token_match, + url_match=nlp.Defaults.url_match, ) - pipe_names = ["tagger", "parser", "ner"] - token_match = TOKEN_MATCH - url_match = URL_MATCH - prefixes = tuple(TOKENIZER_PREFIXES) - suffixes = tuple(TOKENIZER_SUFFIXES) - infixes = tuple(TOKENIZER_INFIXES) - tag_map = dict(TAG_MAP) - tokenizer_exceptions = {} - stop_words = set() - morph_rules = {} - is_base_form = None - lex_attr_getters = LEX_ATTRS - syntax_iterators = {} - resources = {} - writing_system = {"direction": "ltr", "has_case": True, "has_letters": True} - single_orth_variants = [] - paired_orth_variants = [] + return tokenizer_factory -class Language(object): +@registry.misc("spacy.LookupsDataLoader.v1") +def load_lookups_data(lang, tables): + util.logger.debug(f"Loading lookups from spacy-lookups-data: {tables}") + lookups = load_lookups(lang=lang, tables=tables) + return lookups + + +class Language: """A text-processing pipeline. Usually you'll load this once per process, and pass the instance around your application. Defaults (class): Settings, data and factory methods for creating the `nlp` object and processing pipeline. - lang (unicode): Two-letter language ID, i.e. ISO code. + lang (str): Two-letter language ID, i.e. ISO code. - DOCS: https://spacy.io/api/language + DOCS: https://nightly.spacy.io/api/language """ Defaults = BaseDefaults - lang = None + lang: str = None + default_config = DEFAULT_CONFIG - factories = {"tokenizer": lambda nlp: nlp.Defaults.create_tokenizer(nlp)} + factories = SimpleFrozenDict(error=Errors.E957) + _factory_meta: Dict[str, "FactoryMeta"] = {} # meta by factory def __init__( - self, vocab=True, make_doc=True, max_length=10 ** 6, meta={}, **kwargs - ): + self, + vocab: Union[Vocab, bool] = True, + *, + max_length: int = 10 ** 6, + meta: Dict[str, Any] = {}, + create_tokenizer: Optional[Callable[["Language"], Callable[[str], Doc]]] = None, + **kwargs, + ) -> None: """Initialise a Language object. - vocab (Vocab): A `Vocab` object. If `True`, a vocab is created via - `Language.Defaults.create_vocab`. - make_doc (callable): A function that takes text and returns a `Doc` - object. Usually a `Tokenizer`. + vocab (Vocab): A `Vocab` object. If `True`, a vocab is created. meta (dict): Custom meta data for the Language class. Is written to by models to add model meta data. - max_length (int) : - Maximum number of characters in a single text. The current v2 models - may run out memory on extremely long texts, due to large internal - allocations. You should segment these texts into meaningful units, - e.g. paragraphs, subsections etc, before passing them to spaCy. - Default maximum length is 1,000,000 characters (1mb). As a rule of - thumb, if all pipeline components are enabled, spaCy's default - models currently requires roughly 1GB of temporary memory per + max_length (int): Maximum number of characters in a single text. The + current models may run out memory on extremely long texts, due to + large internal allocations. You should segment these texts into + meaningful units, e.g. paragraphs, subsections etc, before passing + them to spaCy. Default maximum length is 1,000,000 charas (1mb). As + a rule of thumb, if all pipeline components are enabled, spaCy's + default models currently requires roughly 1GB of temporary memory per 100,000 characters in one text. - RETURNS (Language): The newly constructed object. + create_tokenizer (Callable): Function that takes the nlp object and + returns a tokenizer. + + DOCS: https://nightly.spacy.io/api/language#init """ - user_factories = util.registry.factories.get_all() - self.factories.update(user_factories) + # We're only calling this to import all factories provided via entry + # points. The factory decorator applied to these functions takes care + # of the rest. + util.registry._entry_point_factories.get_all() + + self._config = DEFAULT_CONFIG.merge(self.default_config) self._meta = dict(meta) self._path = None + self._optimizer = None + # Component meta and configs are only needed on the instance + self._pipe_meta: Dict[str, "FactoryMeta"] = {} # meta by component + self._pipe_configs: Dict[str, Config] = {} # config by component + + if not isinstance(vocab, Vocab) and vocab is not True: + raise ValueError(Errors.E918.format(vocab=vocab, vocab_type=type(Vocab))) if vocab is True: - factory = self.Defaults.create_vocab - vocab = factory(self, **meta.get("vocab", {})) - if vocab.vectors.name is None: - vocab.vectors.name = meta.get("vectors", {}).get("name") + vectors_name = meta.get("vectors", {}).get("name") + vocab = create_vocab(self.lang, self.Defaults, vectors_name=vectors_name) else: if (self.lang and vocab.lang) and (self.lang != vocab.lang): raise ValueError(Errors.E150.format(nlp=self.lang, vocab=vocab.lang)) - self.vocab = vocab - if make_doc is True: - factory = self.Defaults.create_tokenizer - make_doc = factory(self, **meta.get("tokenizer", {})) - self.tokenizer = make_doc - self.pipeline = [] + self.vocab: Vocab = vocab + if self.lang is None: + self.lang = self.vocab.lang + self._components = [] + self._disabled = set() self.max_length = max_length - self._optimizer = None + # Create the default tokenizer from the default config + if not create_tokenizer: + tokenizer_cfg = {"tokenizer": self._config["nlp"]["tokenizer"]} + create_tokenizer = registry.resolve(tokenizer_cfg)["tokenizer"] + self.tokenizer = create_tokenizer(self) + + def __init_subclass__(cls, **kwargs): + super().__init_subclass__(**kwargs) + cls.default_config = DEFAULT_CONFIG.merge(cls.Defaults.config) + cls.default_config["nlp"]["lang"] = cls.lang @property def path(self): return self._path @property - def meta(self): + def meta(self) -> Dict[str, Any]: + """Custom meta data of the language class. If a model is loaded, this + includes details from the model's meta.json. + + RETURNS (Dict[str, Any]): The meta. + + DOCS: https://nightly.spacy.io/api/language#meta + """ + spacy_version = util.get_model_version_range(about.__version__) if self.vocab.lang: self._meta.setdefault("lang", self.vocab.lang) else: self._meta.setdefault("lang", self.lang) - self._meta.setdefault("name", "model") + self._meta.setdefault("name", "pipeline") self._meta.setdefault("version", "0.0.0") - self._meta.setdefault("spacy_version", ">={}".format(about.__version__)) + self._meta.setdefault("spacy_version", spacy_version) self._meta.setdefault("description", "") self._meta.setdefault("author", "") self._meta.setdefault("email", "") @@ -214,225 +212,761 @@ class Language(object): "keys": self.vocab.vectors.n_keys, "name": self.vocab.vectors.name, } - self._meta["pipeline"] = self.pipe_names - self._meta["factories"] = self.pipe_factories - self._meta["labels"] = self.pipe_labels + self._meta["labels"] = dict(self.pipe_labels) + # TODO: Adding this back to prevent breaking people's code etc., but + # we should consider removing it + self._meta["pipeline"] = list(self.pipe_names) + self._meta["components"] = list(self.component_names) + self._meta["disabled"] = list(self.disabled) return self._meta @meta.setter - def meta(self, value): + def meta(self, value: Dict[str, Any]) -> None: self._meta = value - # Conveniences to access pipeline components - # Shouldn't be used anymore! @property - def tensorizer(self): - return self.get_pipe("tensorizer") + def config(self) -> Config: + """Trainable config for the current language instance. Includes the + current pipeline components, as well as default training config. - @property - def tagger(self): - return self.get_pipe("tagger") + RETURNS (thinc.api.Config): The config. - @property - def parser(self): - return self.get_pipe("parser") - - @property - def entity(self): - return self.get_pipe("ner") - - @property - def linker(self): - return self.get_pipe("entity_linker") - - @property - def matcher(self): - return self.get_pipe("matcher") - - @property - def pipe_names(self): - """Get names of available pipeline components. - - RETURNS (list): List of component name strings, in order. + DOCS: https://nightly.spacy.io/api/language#config """ - return [pipe_name for pipe_name, _ in self.pipeline] + self._config.setdefault("nlp", {}) + self._config.setdefault("training", {}) + self._config["nlp"]["lang"] = self.lang + # We're storing the filled config for each pipeline component and so + # we can populate the config again later + pipeline = {} + score_weights = [] + for pipe_name in self.component_names: + pipe_meta = self.get_pipe_meta(pipe_name) + pipe_config = self.get_pipe_config(pipe_name) + pipeline[pipe_name] = {"factory": pipe_meta.factory, **pipe_config} + if pipe_meta.default_score_weights: + score_weights.append(pipe_meta.default_score_weights) + self._config["nlp"]["pipeline"] = list(self.component_names) + self._config["nlp"]["disabled"] = list(self.disabled) + self._config["components"] = pipeline + # We're merging the existing score weights back into the combined + # weights to make sure we're preserving custom settings in the config + # but also reflect updates (e.g. new components added) + prev_weights = self._config["training"].get("score_weights", {}) + combined_score_weights = combine_score_weights(score_weights, prev_weights) + self._config["training"]["score_weights"] = combined_score_weights + if not srsly.is_json_serializable(self._config): + raise ValueError(Errors.E961.format(config=self._config)) + return self._config + + @config.setter + def config(self, value: Config) -> None: + self._config = value @property - def pipe_factories(self): + def disabled(self) -> List[str]: + """Get the names of all disabled components. + + RETURNS (List[str]): The disabled components. + """ + # Make sure the disabled components are returned in the order they + # appear in the pipeline (which isn't guaranteed by the set) + names = [name for name, _ in self._components if name in self._disabled] + return SimpleFrozenList(names, error=Errors.E926.format(attr="disabled")) + + @property + def factory_names(self) -> List[str]: + """Get names of all available factories. + + RETURNS (List[str]): The factory names. + """ + names = list(self.factories.keys()) + return SimpleFrozenList(names) + + @property + def components(self) -> List[Tuple[str, Callable[[Doc], Doc]]]: + """Get all (name, component) tuples in the pipeline, including the + currently disabled components. + """ + return SimpleFrozenList( + self._components, error=Errors.E926.format(attr="components") + ) + + @property + def component_names(self) -> List[str]: + """Get the names of the available pipeline components. Includes all + active and inactive pipeline components. + + RETURNS (List[str]): List of component name strings, in order. + """ + names = [pipe_name for pipe_name, _ in self._components] + return SimpleFrozenList(names, error=Errors.E926.format(attr="component_names")) + + @property + def pipeline(self) -> List[Tuple[str, Callable[[Doc], Doc]]]: + """The processing pipeline consisting of (name, component) tuples. The + components are called on the Doc in order as it passes through the + pipeline. + + RETURNS (List[Tuple[str, Callable[[Doc], Doc]]]): The pipeline. + """ + pipes = [(n, p) for n, p in self._components if n not in self._disabled] + return SimpleFrozenList(pipes, error=Errors.E926.format(attr="pipeline")) + + @property + def pipe_names(self) -> List[str]: + """Get names of available active pipeline components. + + RETURNS (List[str]): List of component name strings, in order. + """ + names = [pipe_name for pipe_name, _ in self.pipeline] + return SimpleFrozenList(names, error=Errors.E926.format(attr="pipe_names")) + + @property + def pipe_factories(self) -> Dict[str, str]: """Get the component factories for the available pipeline components. - RETURNS (dict): Factory names, keyed by component names. + RETURNS (Dict[str, str]): Factory names, keyed by component names. """ factories = {} - for pipe_name, pipe in self.pipeline: - factories[pipe_name] = getattr(pipe, "factory", pipe_name) - return factories + for pipe_name, pipe in self._components: + factories[pipe_name] = self.get_pipe_meta(pipe_name).factory + return SimpleFrozenDict(factories) @property - def pipe_labels(self): + def pipe_labels(self) -> Dict[str, List[str]]: """Get the labels set by the pipeline components, if available (if the component exposes a labels property). - RETURNS (dict): Labels keyed by component name. + RETURNS (Dict[str, List[str]]): Labels keyed by component name. """ - labels = OrderedDict() - for name, pipe in self.pipeline: + labels = {} + for name, pipe in self._components: if hasattr(pipe, "labels"): labels[name] = list(pipe.labels) - return labels + return SimpleFrozenDict(labels) - def get_pipe(self, name): + @classmethod + def has_factory(cls, name: str) -> bool: + """RETURNS (bool): Whether a factory of that name is registered.""" + internal_name = cls.get_factory_name(name) + return name in registry.factories or internal_name in registry.factories + + @classmethod + def get_factory_name(cls, name: str) -> str: + """Get the internal factory name based on the language subclass. + + name (str): The factory name. + RETURNS (str): The internal factory name. + """ + if cls.lang is None: + return name + return f"{cls.lang}.{name}" + + @classmethod + def get_factory_meta(cls, name: str) -> "FactoryMeta": + """Get the meta information for a given factory name. + + name (str): The component factory name. + RETURNS (FactoryMeta): The meta for the given factory name. + """ + internal_name = cls.get_factory_name(name) + if internal_name in cls._factory_meta: + return cls._factory_meta[internal_name] + if name in cls._factory_meta: + return cls._factory_meta[name] + raise ValueError(Errors.E967.format(meta="factory", name=name)) + + @classmethod + def set_factory_meta(cls, name: str, value: "FactoryMeta") -> None: + """Set the meta information for a given factory name. + + name (str): The component factory name. + value (FactoryMeta): The meta to set. + """ + cls._factory_meta[cls.get_factory_name(name)] = value + + def get_pipe_meta(self, name: str) -> "FactoryMeta": + """Get the meta information for a given component name. + + name (str): The component name. + RETURNS (FactoryMeta): The meta for the given component name. + """ + if name not in self._pipe_meta: + raise ValueError(Errors.E967.format(meta="component", name=name)) + return self._pipe_meta[name] + + def get_pipe_config(self, name: str) -> Config: + """Get the config used to create a pipeline component. + + name (str): The component name. + RETURNS (Config): The config used to create the pipeline component. + """ + if name not in self._pipe_configs: + raise ValueError(Errors.E960.format(name=name)) + pipe_config = self._pipe_configs[name] + return pipe_config + + @classmethod + def factory( + cls, + name: str, + *, + default_config: Dict[str, Any] = SimpleFrozenDict(), + assigns: Iterable[str] = SimpleFrozenList(), + requires: Iterable[str] = SimpleFrozenList(), + retokenizes: bool = False, + default_score_weights: Dict[str, float] = SimpleFrozenDict(), + func: Optional[Callable] = None, + ) -> Callable: + """Register a new pipeline component factory. Can be used as a decorator + on a function or classmethod, or called as a function with the factory + provided as the func keyword argument. To create a component and add + it to the pipeline, you can use nlp.add_pipe(name). + + name (str): The name of the component factory. + default_config (Dict[str, Any]): Default configuration, describing the + default values of the factory arguments. + assigns (Iterable[str]): Doc/Token attributes assigned by this component, + e.g. "token.ent_id". Used for pipeline analyis. + requires (Iterable[str]): Doc/Token attributes required by this component, + e.g. "token.ent_id". Used for pipeline analyis. + retokenizes (bool): Whether the component changes the tokenization. + Used for pipeline analysis. + default_score_weights (Dict[str, float]): The scores to report during + training, and their default weight towards the final score used to + select the best model. Weights should sum to 1.0 per component and + will be combined and normalized for the whole pipeline. If None, + the score won't be shown in the logs or be weighted. + func (Optional[Callable]): Factory function if not used as a decorator. + + DOCS: https://nightly.spacy.io/api/language#factory + """ + if not isinstance(name, str): + raise ValueError(Errors.E963.format(decorator="factory")) + if not isinstance(default_config, dict): + err = Errors.E962.format( + style="default config", name=name, cfg_type=type(default_config) + ) + raise ValueError(err) + + def add_factory(factory_func: Callable) -> Callable: + internal_name = cls.get_factory_name(name) + if internal_name in registry.factories: + # We only check for the internal name here – it's okay if it's a + # subclass and the base class has a factory of the same name. We + # also only raise if the function is different to prevent raising + # if module is reloaded. + existing_func = registry.factories.get(internal_name) + if not util.is_same_func(factory_func, existing_func): + err = Errors.E004.format( + name=name, func=existing_func, new_func=factory_func + ) + raise ValueError(err) + + arg_names = util.get_arg_names(factory_func) + if "nlp" not in arg_names or "name" not in arg_names: + raise ValueError(Errors.E964.format(name=name)) + # Officially register the factory so we can later call + # registry.resolve and refer to it in the config as + # @factories = "spacy.Language.xyz". We use the class name here so + # different classes can have different factories. + registry.factories.register(internal_name, func=factory_func) + factory_meta = FactoryMeta( + factory=name, + default_config=default_config, + assigns=validate_attrs(assigns), + requires=validate_attrs(requires), + scores=list(default_score_weights.keys()), + default_score_weights=default_score_weights, + retokenizes=retokenizes, + ) + cls.set_factory_meta(name, factory_meta) + # We're overwriting the class attr with a frozen dict to handle + # backwards-compat (writing to Language.factories directly). This + # wouldn't work with an instance property and just produce a + # confusing error – here we can show a custom error + cls.factories = SimpleFrozenDict( + registry.factories.get_all(), error=Errors.E957 + ) + return factory_func + + if func is not None: # Support non-decorator use cases + return add_factory(func) + return add_factory + + @classmethod + def component( + cls, + name: Optional[str] = None, + *, + assigns: Iterable[str] = SimpleFrozenList(), + requires: Iterable[str] = SimpleFrozenList(), + retokenizes: bool = False, + func: Optional[Callable[[Doc], Doc]] = None, + ) -> Callable: + """Register a new pipeline component. Can be used for stateless function + components that don't require a separate factory. Can be used as a + decorator on a function or classmethod, or called as a function with the + factory provided as the func keyword argument. To create a component and + add it to the pipeline, you can use nlp.add_pipe(name). + + name (str): The name of the component factory. + assigns (Iterable[str]): Doc/Token attributes assigned by this component, + e.g. "token.ent_id". Used for pipeline analyis. + requires (Iterable[str]): Doc/Token attributes required by this component, + e.g. "token.ent_id". Used for pipeline analyis. + retokenizes (bool): Whether the component changes the tokenization. + Used for pipeline analysis. + func (Optional[Callable]): Factory function if not used as a decorator. + + DOCS: https://nightly.spacy.io/api/language#component + """ + if name is not None and not isinstance(name, str): + raise ValueError(Errors.E963.format(decorator="component")) + component_name = name if name is not None else util.get_object_name(func) + + def add_component(component_func: Callable[[Doc], Doc]) -> Callable: + if isinstance(func, type): # function is a class + raise ValueError(Errors.E965.format(name=component_name)) + + def factory_func(nlp: cls, name: str) -> Callable[[Doc], Doc]: + return component_func + + internal_name = cls.get_factory_name(name) + if internal_name in registry.factories: + # We only check for the internal name here – it's okay if it's a + # subclass and the base class has a factory of the same name. We + # also only raise if the function is different to prevent raising + # if module is reloaded. It's hacky, but we need to check the + # existing functure for a closure and whether that's identical + # to the component function (because factory_func created above + # will always be different, even for the same function) + existing_func = registry.factories.get(internal_name) + closure = existing_func.__closure__ + wrapped = [c.cell_contents for c in closure][0] if closure else None + if util.is_same_func(wrapped, component_func): + factory_func = existing_func # noqa: F811 + + cls.factory( + component_name, + assigns=assigns, + requires=requires, + retokenizes=retokenizes, + func=factory_func, + ) + return component_func + + if func is not None: # Support non-decorator use cases + return add_component(func) + return add_component + + def analyze_pipes( + self, + *, + keys: List[str] = ["assigns", "requires", "scores", "retokenizes"], + pretty: bool = False, + ) -> Optional[Dict[str, Any]]: + """Analyze the current pipeline components, print a summary of what + they assign or require and check that all requirements are met. + + keys (List[str]): The meta values to display in the table. Corresponds + to values in FactoryMeta, defined by @Language.factory decorator. + pretty (bool): Pretty-print the results. + RETURNS (dict): The data. + """ + analysis = analyze_pipes(self, keys=keys) + if pretty: + print_pipe_analysis(analysis, keys=keys) + return analysis + + def get_pipe(self, name: str) -> Callable[[Doc], Doc]: """Get a pipeline component for a given component name. - name (unicode): Name of pipeline component to get. + name (str): Name of pipeline component to get. RETURNS (callable): The pipeline component. - DOCS: https://spacy.io/api/language#get_pipe + DOCS: https://nightly.spacy.io/api/language#get_pipe """ - for pipe_name, component in self.pipeline: + for pipe_name, component in self._components: if pipe_name == name: return component - raise KeyError(Errors.E001.format(name=name, opts=self.pipe_names)) + raise KeyError(Errors.E001.format(name=name, opts=self.component_names)) - def create_pipe(self, name, config=dict()): - """Create a pipeline component from a factory. + def create_pipe( + self, + factory_name: str, + name: Optional[str] = None, + *, + config: Optional[Dict[str, Any]] = SimpleFrozenDict(), + raw_config: Optional[Config] = None, + validate: bool = True, + ) -> Callable[[Doc], Doc]: + """Create a pipeline component. Mostly used internally. To create and + add a component to the pipeline, you can use nlp.add_pipe. - name (unicode): Factory name to look up in `Language.factories`. - config (dict): Configuration parameters to initialise component. - RETURNS (callable): Pipeline component. + factory_name (str): Name of component factory. + name (Optional[str]): Optional name to assign to component instance. + Defaults to factory name if not set. + config (Optional[Dict[str, Any]]): Config parameters to use for this + component. Will be merged with default config, if available. + raw_config (Optional[Config]): Internals: the non-interpolated config. + validate (bool): Whether to validate the component config against the + arguments and types expected by the factory. + RETURNS (Callable[[Doc], Doc]): The pipeline component. - DOCS: https://spacy.io/api/language#create_pipe + DOCS: https://nightly.spacy.io/api/language#create_pipe """ - if name not in self.factories: - if name == "sbd": - raise KeyError(Errors.E108.format(name=name)) - else: - raise KeyError(Errors.E002.format(name=name)) - factory = self.factories[name] - return factory(self, **config) + name = name if name is not None else factory_name + if not isinstance(config, dict): + err = Errors.E962.format(style="config", name=name, cfg_type=type(config)) + raise ValueError(err) + if not srsly.is_json_serializable(config): + raise ValueError(Errors.E961.format(config=config)) + if not self.has_factory(factory_name): + err = Errors.E002.format( + name=factory_name, + opts=", ".join(self.factory_names), + method="create_pipe", + lang=util.get_object_name(self), + lang_code=self.lang, + ) + raise ValueError(err) + pipe_meta = self.get_factory_meta(factory_name) + config = config or {} + # This is unideal, but the alternative would mean you always need to + # specify the full config settings, which is not really viable. + if pipe_meta.default_config: + config = Config(pipe_meta.default_config).merge(config) + # We need to create a top-level key because Thinc doesn't allow resolving + # top-level references to registered functions. Also gives nicer errors. + # The name allows components to know their pipe name and use it in the + # losses etc. (even if multiple instances of the same factory are used) + internal_name = self.get_factory_name(factory_name) + # If the language-specific factory doesn't exist, try again with the + # not-specific name + if internal_name not in registry.factories: + internal_name = factory_name + config = {"nlp": self, "name": name, **config, "@factories": internal_name} + cfg = {factory_name: config} + # We're calling the internal _fill here to avoid constructing the + # registered functions twice + resolved = registry.resolve(cfg, validate=validate) + filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"] + filled = Config(filled) + filled["factory"] = factory_name + filled.pop("@factories", None) + # Remove the extra values we added because we don't want to keep passing + # them around, copying them etc. + filled.pop("nlp", None) + filled.pop("name", None) + # Merge the final filled config with the raw config (including non- + # interpolated variables) + if raw_config: + filled = filled.merge(raw_config) + self._pipe_configs[name] = filled + return resolved[factory_name] + + def create_pipe_from_source( + self, source_name: str, source: "Language", *, name: str + ) -> Tuple[Callable[[Doc], Doc], str]: + """Create a pipeline component by copying it from an existing model. + + source_name (str): Name of the component in the source pipeline. + source (Language): The source nlp object to copy from. + name (str): Optional alternative name to use in current pipeline. + RETURNS (Tuple[Callable, str]): The component and its factory name. + """ + # TODO: handle errors and mismatches (vectors etc.) + if not isinstance(source, self.__class__): + raise ValueError(Errors.E945.format(name=source_name, source=type(source))) + if not source.has_pipe(source_name): + raise KeyError( + Errors.E944.format( + name=source_name, + model=f"{source.meta['lang']}_{source.meta['name']}", + opts=", ".join(source.pipe_names), + ) + ) + pipe = source.get_pipe(source_name) + # Make sure the source config is interpolated so we don't end up with + # orphaned variables in our final config + source_config = source.config.interpolate() + pipe_config = util.copy_config(source_config["components"][source_name]) + self._pipe_configs[name] = pipe_config + return pipe, pipe_config["factory"] def add_pipe( - self, component, name=None, before=None, after=None, first=None, last=None - ): + self, + factory_name: str, + name: Optional[str] = None, + *, + before: Optional[Union[str, int]] = None, + after: Optional[Union[str, int]] = None, + first: Optional[bool] = None, + last: Optional[bool] = None, + source: Optional["Language"] = None, + config: Optional[Dict[str, Any]] = SimpleFrozenDict(), + raw_config: Optional[Config] = None, + validate: bool = True, + ) -> Callable[[Doc], Doc]: """Add a component to the processing pipeline. Valid components are callables that take a `Doc` object, modify it and return it. Only one of before/after/first/last can be set. Default behaviour is "last". - component (callable): The pipeline component. - name (unicode): Name of pipeline component. Overwrites existing + factory_name (str): Name of the component factory. + name (str): Name of pipeline component. Overwrites existing component.name attribute if available. If no name is set and the component exposes no name attribute, component.__name__ is used. An error is raised if a name already exists in the pipeline. - before (unicode): Component name to insert component directly before. - after (unicode): Component name to insert component directly after. - first (bool): Insert component first / not first in the pipeline. - last (bool): Insert component last / not last in the pipeline. + before (Union[str, int]): Name or index of the component to insert new + component directly before. + after (Union[str, int]): Name or index of the component to insert new + component directly after. + first (bool): If True, insert component first in the pipeline. + last (bool): If True, insert component last in the pipeline. + source (Language): Optional loaded nlp object to copy the pipeline + component from. + config (Optional[Dict[str, Any]]): Config parameters to use for this + component. Will be merged with default config, if available. + raw_config (Optional[Config]): Internals: the non-interpolated config. + validate (bool): Whether to validate the component config against the + arguments and types expected by the factory. + RETURNS (Callable[[Doc], Doc]): The pipeline component. - DOCS: https://spacy.io/api/language#add_pipe + DOCS: https://nightly.spacy.io/api/language#add_pipe """ - if not hasattr(component, "__call__"): - msg = Errors.E003.format(component=repr(component), name=name) - if isinstance(component, basestring_) and component in self.factories: - msg += Errors.E004.format(component=component) - raise ValueError(msg) - if name is None: - name = util.get_component_name(component) - if name in self.pipe_names: - raise ValueError(Errors.E007.format(name=name, opts=self.pipe_names)) - if sum([bool(before), bool(after), bool(first), bool(last)]) >= 2: - raise ValueError(Errors.E006) - pipe_index = 0 - pipe = (name, component) - if last or not any([first, before, after]): - pipe_index = len(self.pipeline) - self.pipeline.append(pipe) - elif first: - self.pipeline.insert(0, pipe) - elif before and before in self.pipe_names: - pipe_index = self.pipe_names.index(before) - self.pipeline.insert(self.pipe_names.index(before), pipe) - elif after and after in self.pipe_names: - pipe_index = self.pipe_names.index(after) + 1 - self.pipeline.insert(self.pipe_names.index(after) + 1, pipe) - else: - raise ValueError( - Errors.E001.format(name=before or after, opts=self.pipe_names) + if not isinstance(factory_name, str): + bad_val = repr(factory_name) + err = Errors.E966.format(component=bad_val, name=name) + raise ValueError(err) + name = name if name is not None else factory_name + if name in self.component_names: + raise ValueError(Errors.E007.format(name=name, opts=self.component_names)) + if source is not None: + # We're loading the component from a model. After loading the + # component, we know its real factory name + pipe_component, factory_name = self.create_pipe_from_source( + factory_name, source, name=name ) - if ENABLE_PIPELINE_ANALYSIS: - analyze_pipes(self.pipeline, name, component, pipe_index) + else: + if not self.has_factory(factory_name): + err = Errors.E002.format( + name=factory_name, + opts=", ".join(self.factory_names), + method="add_pipe", + lang=util.get_object_name(self), + lang_code=self.lang, + ) + pipe_component = self.create_pipe( + factory_name, + name=name, + config=config, + raw_config=raw_config, + validate=validate, + ) + pipe_index = self._get_pipe_index(before, after, first, last) + self._pipe_meta[name] = self.get_factory_meta(factory_name) + self._components.insert(pipe_index, (name, pipe_component)) + return pipe_component - def has_pipe(self, name): + def _get_pipe_index( + self, + before: Optional[Union[str, int]] = None, + after: Optional[Union[str, int]] = None, + first: Optional[bool] = None, + last: Optional[bool] = None, + ) -> int: + """Determine where to insert a pipeline component based on the before/ + after/first/last values. + + before (str): Name or index of the component to insert directly before. + after (str): Name or index of component to insert directly after. + first (bool): If True, insert component first in the pipeline. + last (bool): If True, insert component last in the pipeline. + RETURNS (int): The index of the new pipeline component. + """ + all_args = {"before": before, "after": after, "first": first, "last": last} + if sum(arg is not None for arg in [before, after, first, last]) >= 2: + raise ValueError( + Errors.E006.format(args=all_args, opts=self.component_names) + ) + if last or not any(value is not None for value in [first, before, after]): + return len(self._components) + elif first: + return 0 + elif isinstance(before, str): + if before not in self.component_names: + raise ValueError( + Errors.E001.format(name=before, opts=self.component_names) + ) + return self.component_names.index(before) + elif isinstance(after, str): + if after not in self.component_names: + raise ValueError( + Errors.E001.format(name=after, opts=self.component_names) + ) + return self.component_names.index(after) + 1 + # We're only accepting indices referring to components that exist + # (can't just do isinstance here because bools are instance of int, too) + elif type(before) == int: + if before >= len(self._components) or before < 0: + err = Errors.E959.format( + dir="before", idx=before, opts=self.component_names + ) + raise ValueError(err) + return before + elif type(after) == int: + if after >= len(self._components) or after < 0: + err = Errors.E959.format( + dir="after", idx=after, opts=self.component_names + ) + raise ValueError(err) + return after + 1 + raise ValueError(Errors.E006.format(args=all_args, opts=self.component_names)) + + def has_pipe(self, name: str) -> bool: """Check if a component name is present in the pipeline. Equivalent to `name in nlp.pipe_names`. - name (unicode): Name of the component. + name (str): Name of the component. RETURNS (bool): Whether a component of the name exists in the pipeline. - DOCS: https://spacy.io/api/language#has_pipe + DOCS: https://nightly.spacy.io/api/language#has_pipe """ return name in self.pipe_names - def replace_pipe(self, name, component): + def replace_pipe( + self, + name: str, + factory_name: str, + *, + config: Dict[str, Any] = SimpleFrozenDict(), + validate: bool = True, + ) -> Callable[[Doc], Doc]: """Replace a component in the pipeline. - name (unicode): Name of the component to replace. - component (callable): Pipeline component. + name (str): Name of the component to replace. + factory_name (str): Factory name of replacement component. + config (Optional[Dict[str, Any]]): Config parameters to use for this + component. Will be merged with default config, if available. + validate (bool): Whether to validate the component config against the + arguments and types expected by the factory. + RETURNS (Callable[[Doc], Doc]): The new pipeline component. - DOCS: https://spacy.io/api/language#replace_pipe + DOCS: https://nightly.spacy.io/api/language#replace_pipe """ if name not in self.pipe_names: raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names)) - if not hasattr(component, "__call__"): - msg = Errors.E003.format(component=repr(component), name=name) - if isinstance(component, basestring_) and component in self.factories: - msg += Errors.E135.format(name=name) - raise ValueError(msg) - self.pipeline[self.pipe_names.index(name)] = (name, component) - if ENABLE_PIPELINE_ANALYSIS: - analyze_all_pipes(self.pipeline) + if hasattr(factory_name, "__call__"): + err = Errors.E968.format(component=repr(factory_name), name=name) + raise ValueError(err) + # We need to delegate to Language.add_pipe here instead of just writing + # to Language.pipeline to make sure the configs are handled correctly + pipe_index = self.pipe_names.index(name) + self.remove_pipe(name) + if not len(self._components) or pipe_index == len(self._components): + # we have no components to insert before/after, or we're replacing the last component + return self.add_pipe( + factory_name, name=name, config=config, validate=validate + ) + else: + return self.add_pipe( + factory_name, + name=name, + before=pipe_index, + config=config, + validate=validate, + ) - def rename_pipe(self, old_name, new_name): + def rename_pipe(self, old_name: str, new_name: str) -> None: """Rename a pipeline component. - old_name (unicode): Name of the component to rename. - new_name (unicode): New name of the component. + old_name (str): Name of the component to rename. + new_name (str): New name of the component. - DOCS: https://spacy.io/api/language#rename_pipe + DOCS: https://nightly.spacy.io/api/language#rename_pipe """ - if old_name not in self.pipe_names: - raise ValueError(Errors.E001.format(name=old_name, opts=self.pipe_names)) - if new_name in self.pipe_names: - raise ValueError(Errors.E007.format(name=new_name, opts=self.pipe_names)) - i = self.pipe_names.index(old_name) - self.pipeline[i] = (new_name, self.pipeline[i][1]) + if old_name not in self.component_names: + raise ValueError( + Errors.E001.format(name=old_name, opts=self.component_names) + ) + if new_name in self.component_names: + raise ValueError( + Errors.E007.format(name=new_name, opts=self.component_names) + ) + i = self.component_names.index(old_name) + self._components[i] = (new_name, self._components[i][1]) + self._pipe_meta[new_name] = self._pipe_meta.pop(old_name) + self._pipe_configs[new_name] = self._pipe_configs.pop(old_name) + # Make sure [initialize] config is adjusted + if old_name in self._config["initialize"]["components"]: + init_cfg = self._config["initialize"]["components"].pop(old_name) + self._config["initialize"]["components"][new_name] = init_cfg - def remove_pipe(self, name): + def remove_pipe(self, name: str) -> Tuple[str, Callable[[Doc], Doc]]: """Remove a component from the pipeline. - name (unicode): Name of the component to remove. + name (str): Name of the component to remove. RETURNS (tuple): A `(name, component)` tuple of the removed component. - DOCS: https://spacy.io/api/language#remove_pipe + DOCS: https://nightly.spacy.io/api/language#remove_pipe """ - if name not in self.pipe_names: - raise ValueError(Errors.E001.format(name=name, opts=self.pipe_names)) - removed = self.pipeline.pop(self.pipe_names.index(name)) - if ENABLE_PIPELINE_ANALYSIS: - analyze_all_pipes(self.pipeline) + if name not in self.component_names: + raise ValueError(Errors.E001.format(name=name, opts=self.component_names)) + removed = self._components.pop(self.component_names.index(name)) + # We're only removing the component itself from the metas/configs here + # because factory may be used for something else + self._pipe_meta.pop(name) + self._pipe_configs.pop(name) + # Make sure name is removed from the [initialize] config + if name in self._config["initialize"]["components"]: + self._config["initialize"]["components"].pop(name) + # Make sure the name is also removed from the set of disabled components + if name in self.disabled: + self._disabled.remove(name) return removed - def __call__(self, text, disable=[], component_cfg=None): + def disable_pipe(self, name: str) -> None: + """Disable a pipeline component. The component will still exist on + the nlp object, but it won't be run as part of the pipeline. Does + nothing if the component is already disabled. + + name (str): The name of the component to disable. + """ + if name not in self.component_names: + raise ValueError(Errors.E001.format(name=name, opts=self.component_names)) + self._disabled.add(name) + + def enable_pipe(self, name: str) -> None: + """Enable a previously disabled pipeline component so it's run as part + of the pipeline. Does nothing if the component is already enabled. + + name (str): The name of the component to enable. + """ + if name not in self.component_names: + raise ValueError(Errors.E001.format(name=name, opts=self.component_names)) + if name in self.disabled: + self._disabled.remove(name) + + def __call__( + self, + text: str, + *, + disable: Iterable[str] = SimpleFrozenList(), + component_cfg: Optional[Dict[str, Dict[str, Any]]] = None, + ) -> Doc: """Apply the pipeline to some text. The text can span multiple sentences, and can contain arbitrary whitespace. Alignment into the original string is preserved. - text (unicode): The text to be processed. + text (str): The text to be processed. disable (list): Names of the pipeline components to disable. - component_cfg (dict): An optional dictionary with extra keyword arguments - for specific components. + component_cfg (Dict[str, dict]): An optional dictionary with extra + keyword arguments for specific components. RETURNS (Doc): A container for accessing the annotations. - DOCS: https://spacy.io/api/language#call + DOCS: https://nightly.spacy.io/api/language#call """ if len(text) > self.max_length: raise ValueError( @@ -446,332 +980,411 @@ class Language(object): continue if not hasattr(proc, "__call__"): raise ValueError(Errors.E003.format(component=type(proc), name=name)) - doc = proc(doc, **component_cfg.get(name, {})) + try: + doc = proc(doc, **component_cfg.get(name, {})) + except KeyError as e: + # This typically happens if a component is not initialized + raise ValueError(Errors.E109.format(name=name)) from e if doc is None: raise ValueError(Errors.E005.format(name=name)) return doc - def disable_pipes(self, *names): + def disable_pipes(self, *names) -> "DisabledPipes": """Disable one or more pipeline components. If used as a context manager, the pipeline will be restored to the initial state at the end of the block. Otherwise, a DisabledPipes object is returned, that has a `.restore()` method you can use to undo your changes. - DOCS: https://spacy.io/api/language#disable_pipes + This method has been deprecated since 3.0 """ + warnings.warn(Warnings.W096, DeprecationWarning) if len(names) == 1 and isinstance(names[0], (list, tuple)): names = names[0] # support list of names instead of spread - return DisabledPipes(self, *names) + return self.select_pipes(disable=names) - def make_doc(self, text): + def select_pipes( + self, + *, + disable: Optional[Union[str, Iterable[str]]] = None, + enable: Optional[Union[str, Iterable[str]]] = None, + ) -> "DisabledPipes": + """Disable one or more pipeline components. If used as a context + manager, the pipeline will be restored to the initial state at the end + of the block. Otherwise, a DisabledPipes object is returned, that has + a `.restore()` method you can use to undo your changes. + + disable (str or iterable): The name(s) of the pipes to disable + enable (str or iterable): The name(s) of the pipes to enable - all others will be disabled + + DOCS: https://nightly.spacy.io/api/language#select_pipes + """ + if enable is None and disable is None: + raise ValueError(Errors.E991) + if disable is not None and isinstance(disable, str): + disable = [disable] + if enable is not None: + if isinstance(enable, str): + enable = [enable] + to_disable = [pipe for pipe in self.pipe_names if pipe not in enable] + # raise an error if the enable and disable keywords are not consistent + if disable is not None and disable != to_disable: + raise ValueError( + Errors.E992.format( + enable=enable, disable=disable, names=self.pipe_names + ) + ) + disable = to_disable + # DisabledPipes will restore the pipes in 'disable' when it's done, so we need to exclude + # those pipes that were already disabled. + disable = [d for d in disable if d not in self._disabled] + return DisabledPipes(self, disable) + + def make_doc(self, text: str) -> Doc: + """Turn a text into a Doc object. + + text (str): The text to process. + RETURNS (Doc): The processed doc. + """ return self.tokenizer(text) - def _format_docs_and_golds(self, docs, golds): - """Format golds and docs before update models.""" - expected_keys = ("words", "tags", "heads", "deps", "entities", "cats", "links") - gold_objs = [] - doc_objs = [] - for doc, gold in zip(docs, golds): - if isinstance(doc, basestring_): - doc = self.make_doc(doc) - if not isinstance(gold, GoldParse): - unexpected = [k for k in gold if k not in expected_keys] - if unexpected: - err = Errors.E151.format(unexp=unexpected, exp=expected_keys) - raise ValueError(err) - gold = GoldParse(doc, **gold) - doc_objs.append(doc) - gold_objs.append(gold) - - return doc_objs, gold_objs - - def update(self, docs, golds, drop=0.0, sgd=None, losses=None, component_cfg=None): + def update( + self, + examples: Iterable[Example], + _: Optional[Any] = None, + *, + drop: float = 0.0, + sgd: Optional[Optimizer] = None, + losses: Optional[Dict[str, float]] = None, + component_cfg: Optional[Dict[str, Dict[str, Any]]] = None, + exclude: Iterable[str] = SimpleFrozenList(), + ): """Update the models in the pipeline. - docs (iterable): A batch of `Doc` objects. - golds (iterable): A batch of `GoldParse` objects. + examples (Iterable[Example]): A batch of examples + _: Should not be set - serves to catch backwards-incompatible scripts. drop (float): The dropout rate. - sgd (callable): An optimizer. - losses (dict): Dictionary to update with the loss, keyed by component. - component_cfg (dict): Config parameters for specific pipeline + sgd (Optimizer): An optimizer. + losses (Dict[str, float]): Dictionary to update with the loss, keyed by component. + component_cfg (Dict[str, Dict]): Config parameters for specific pipeline components, keyed by component name. + exclude (Iterable[str]): Names of components that shouldn't be updated. + RETURNS (Dict[str, float]): The updated losses dictionary - DOCS: https://spacy.io/api/language#update + DOCS: https://nightly.spacy.io/api/language#update """ - if len(docs) != len(golds): - raise IndexError(Errors.E009.format(n_docs=len(docs), n_golds=len(golds))) - if len(docs) == 0: - return + if _ is not None: + raise ValueError(Errors.E989) + if losses is None: + losses = {} + if len(examples) == 0: + return losses + validate_examples(examples, "Language.update") if sgd is None: if self._optimizer is None: - self._optimizer = create_default_optimizer(Model.ops) + self._optimizer = self.create_optimizer() sgd = self._optimizer - # Allow dict of args to GoldParse, instead of GoldParse objects. - docs, golds = self._format_docs_and_golds(docs, golds) - grads = {} - - def get_grads(W, dW, key=None): - grads[key] = (W, dW) - - get_grads.alpha = sgd.alpha - get_grads.b1 = sgd.b1 - get_grads.b2 = sgd.b2 - pipes = list(self.pipeline) - random.shuffle(pipes) if component_cfg is None: component_cfg = {} - for name, proc in pipes: - if not hasattr(proc, "update"): + for i, (name, proc) in enumerate(self.pipeline): + component_cfg.setdefault(name, {}) + component_cfg[name].setdefault("drop", drop) + component_cfg[name].setdefault("set_annotations", False) + for name, proc in self.pipeline: + if name in exclude or not hasattr(proc, "update"): continue - grads = {} - kwargs = component_cfg.get(name, {}) - kwargs.setdefault("drop", drop) - proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs) - for key, (W, dW) in grads.items(): - sgd(W, dW, key=key) + proc.update(examples, sgd=None, losses=losses, **component_cfg[name]) + if sgd not in (None, False): + for name, proc in self.pipeline: + if ( + name not in exclude + and hasattr(proc, "is_trainable") + and proc.is_trainable + and proc.model not in (True, False, None) + ): + proc.finish_update(sgd) + return losses - def rehearse(self, docs, sgd=None, losses=None, config=None): + def rehearse( + self, + examples: Iterable[Example], + *, + sgd: Optional[Optimizer] = None, + losses: Optional[Dict[str, float]] = None, + component_cfg: Optional[Dict[str, Dict[str, Any]]] = None, + exclude: Iterable[str] = SimpleFrozenList(), + ) -> Dict[str, float]: """Make a "rehearsal" update to the models in the pipeline, to prevent forgetting. Rehearsal updates run an initial copy of the model over some data, and update the model so its current predictions are more like the initial ones. This is useful for keeping a pretrained model on-track, even if you're updating it with a smaller set of examples. - docs (iterable): A batch of `Doc` objects. - drop (float): The dropout rate. - sgd (callable): An optimizer. + examples (Iterable[Example]): A batch of `Example` objects. + sgd (Optional[Optimizer]): An optimizer. + component_cfg (Dict[str, Dict]): Config parameters for specific pipeline + components, keyed by component name. + exclude (Iterable[str]): Names of components that shouldn't be updated. RETURNS (dict): Results from the update. EXAMPLE: >>> raw_text_batches = minibatch(raw_texts) - >>> for labelled_batch in minibatch(zip(train_docs, train_golds)): - >>> docs, golds = zip(*train_docs) - >>> nlp.update(docs, golds) - >>> raw_batch = [nlp.make_doc(text) for text in next(raw_text_batches)] + >>> for labelled_batch in minibatch(examples): + >>> nlp.update(labelled_batch) + >>> raw_batch = [Example.from_dict(nlp.make_doc(text), {}) for text in next(raw_text_batches)] >>> nlp.rehearse(raw_batch) + + DOCS: https://nightly.spacy.io/api/language#rehearse """ - # TODO: document - if len(docs) == 0: + if len(examples) == 0: return + validate_examples(examples, "Language.rehearse") if sgd is None: if self._optimizer is None: - self._optimizer = create_default_optimizer(Model.ops) + self._optimizer = self.create_optimizer() sgd = self._optimizer - docs = list(docs) - for i, doc in enumerate(docs): - if isinstance(doc, basestring_): - docs[i] = self.make_doc(doc) pipes = list(self.pipeline) random.shuffle(pipes) - if config is None: - config = {} + if component_cfg is None: + component_cfg = {} grads = {} def get_grads(W, dW, key=None): grads[key] = (W, dW) - get_grads.alpha = sgd.alpha + get_grads.learn_rate = sgd.learn_rate get_grads.b1 = sgd.b1 get_grads.b2 = sgd.b2 for name, proc in pipes: - if not hasattr(proc, "rehearse"): + if name in exclude or not hasattr(proc, "rehearse"): continue grads = {} - proc.rehearse(docs, sgd=get_grads, losses=losses, **config.get(name, {})) - for key, (W, dW) in grads.items(): - sgd(W, dW, key=key) + proc.rehearse( + examples, sgd=get_grads, losses=losses, **component_cfg.get(name, {}) + ) + for key, (W, dW) in grads.items(): + sgd(W, dW, key=key) return losses - def preprocess_gold(self, docs_golds): - """Can be called before training to pre-process gold data. By default, - it handles nonprojectivity and adds missing tags to the tag map. + def begin_training( + self, + get_examples: Optional[Callable[[], Iterable[Example]]] = None, + *, + sgd: Optional[Optimizer] = None, + ) -> Optimizer: + warnings.warn(Warnings.W089, DeprecationWarning) + return self.initialize(get_examples, sgd=sgd) - docs_golds (iterable): Tuples of `Doc` and `GoldParse` objects. - YIELDS (tuple): Tuples of preprocessed `Doc` and `GoldParse` objects. + def initialize( + self, + get_examples: Optional[Callable[[], Iterable[Example]]] = None, + *, + sgd: Optional[Optimizer] = None, + ) -> Optimizer: + """Initialize the pipe for training, using data examples if available. + + get_examples (Callable[[], Iterable[Example]]): Optional function that + returns gold-standard Example objects. + sgd (Optional[Optimizer]): An optimizer to use for updates. If not + provided, will be created using the .create_optimizer() method. + RETURNS (thinc.api.Optimizer): The optimizer. + + DOCS: https://nightly.spacy.io/api/language#initialize """ + if get_examples is None: + util.logger.debug( + "No 'get_examples' callback provided to 'Language.initialize', creating dummy examples" + ) + doc = Doc(self.vocab, words=["x", "y", "z"]) + get_examples = lambda: [Example.from_dict(doc, {})] + if not hasattr(get_examples, "__call__"): + err = Errors.E930.format( + method="Language.initialize", obj=type(get_examples) + ) + raise TypeError(err) + # Make sure the config is interpolated so we can resolve subsections + config = self.config.interpolate() + # These are the settings provided in the [initialize] block in the config + I = registry.resolve(config["initialize"], schema=ConfigSchemaInit) + init_vocab( + self, data=I["vocab_data"], lookups=I["lookups"], vectors=I["vectors"] + ) + pretrain_cfg = config.get("pretraining") + if pretrain_cfg: + P = registry.resolve(pretrain_cfg, schema=ConfigSchemaPretrain) + init_tok2vec(self, P, I) + if self.vocab.vectors.data.shape[1] >= 1: + ops = get_current_ops() + self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data) + if hasattr(self.tokenizer, "initialize"): + tok_settings = validate_init_settings( + self.tokenizer.initialize, + I["tokenizer"], + section="tokenizer", + name="tokenizer", + ) + self.tokenizer.initialize(get_examples, nlp=self, **tok_settings) for name, proc in self.pipeline: - if hasattr(proc, "preprocess_gold"): - docs_golds = proc.preprocess_gold(docs_golds) - for doc, gold in docs_golds: - yield doc, gold - - def begin_training(self, get_gold_tuples=None, sgd=None, component_cfg=None, **cfg): - """Allocate models, pre-process training data and acquire a trainer and - optimizer. Used as a contextmanager. - - get_gold_tuples (function): Function returning gold data - component_cfg (dict): Config parameters for specific components. - **cfg: Config parameters. - RETURNS: An optimizer. - - DOCS: https://spacy.io/api/language#begin_training - """ - if get_gold_tuples is None: - get_gold_tuples = lambda: [] - # Populate vocab - else: - for _, annots_brackets in get_gold_tuples(): - _ = annots_brackets.pop() - for annots, _ in annots_brackets: - for word in annots[1]: - _ = self.vocab[word] # noqa: F841 - if cfg.get("device", -1) >= 0: - util.use_gpu(cfg["device"]) - if self.vocab.vectors.data.shape[1] >= 1: - self.vocab.vectors.data = Model.ops.asarray(self.vocab.vectors.data) - link_vectors_to_models(self.vocab) - if self.vocab.vectors.data.shape[1]: - cfg["pretrained_vectors"] = self.vocab.vectors.name - cfg["pretrained_dims"] = self.vocab.vectors.data.shape[1] - if sgd is None: - sgd = create_default_optimizer(Model.ops) - self._optimizer = sgd - if component_cfg is None: - component_cfg = {} - for name, proc in self.pipeline: - if hasattr(proc, "begin_training"): - kwargs = component_cfg.get(name, {}) - kwargs.update(cfg) - proc.begin_training( - get_gold_tuples, - pipeline=self.pipeline, - sgd=self._optimizer, - **kwargs + if hasattr(proc, "initialize"): + p_settings = I["components"].get(name, {}) + p_settings = validate_init_settings( + proc.initialize, p_settings, section="components", name=name ) + proc.initialize(get_examples, nlp=self, **p_settings) + self._link_components() + self._optimizer = sgd + if sgd is not None: + self._optimizer = sgd + elif self._optimizer is None: + self._optimizer = self.create_optimizer() return self._optimizer - def resume_training(self, sgd=None, **cfg): + def resume_training(self, *, sgd: Optional[Optimizer] = None) -> Optimizer: """Continue training a pretrained model. Create and return an optimizer, and initialize "rehearsal" for any pipeline component that has a .rehearse() method. Rehearsal is used to prevent - models from "forgetting" their initialised "knowledge". To perform + models from "forgetting" their initialized "knowledge". To perform rehearsal, collect samples of text you want the models to retain performance - on, and call nlp.rehearse() with a batch of Doc objects. + on, and call nlp.rehearse() with a batch of Example objects. + + RETURNS (Optimizer): The optimizer. + + DOCS: https://nightly.spacy.io/api/language#resume_training """ - if cfg.get("device", -1) >= 0: - util.use_gpu(cfg["device"]) - if self.vocab.vectors.data.shape[1] >= 1: - self.vocab.vectors.data = Model.ops.asarray(self.vocab.vectors.data) - link_vectors_to_models(self.vocab) - if self.vocab.vectors.data.shape[1]: - cfg["pretrained_vectors"] = self.vocab.vectors.name - if sgd is None: - sgd = create_default_optimizer(Model.ops) - self._optimizer = sgd + ops = get_current_ops() + if self.vocab.vectors.data.shape[1] >= 1: + self.vocab.vectors.data = ops.asarray(self.vocab.vectors.data) for name, proc in self.pipeline: if hasattr(proc, "_rehearsal_model"): proc._rehearsal_model = deepcopy(proc.model) + if sgd is not None: + self._optimizer = sgd + elif self._optimizer is None: + self._optimizer = self.create_optimizer() return self._optimizer def evaluate( - self, docs_golds, verbose=False, batch_size=256, scorer=None, component_cfg=None - ): + self, + examples: Iterable[Example], + *, + batch_size: int = 256, + scorer: Optional[Scorer] = None, + component_cfg: Optional[Dict[str, Dict[str, Any]]] = None, + scorer_cfg: Optional[Dict[str, Any]] = None, + ) -> Dict[str, Union[float, dict]]: """Evaluate a model's pipeline components. - docs_golds (iterable): Tuples of `Doc` and `GoldParse` objects. - verbose (bool): Print debugging information. + examples (Iterable[Example]): `Example` objects. batch_size (int): Batch size to use. - scorer (Scorer): Optional `Scorer` to use. If not passed in, a new one + scorer (Optional[Scorer]): Scorer to use. If not passed in, a new one will be created. component_cfg (dict): An optional dictionary with extra keyword arguments for specific components. + scorer_cfg (dict): An optional dictionary with extra keyword arguments + for the scorer. RETURNS (Scorer): The scorer containing the evaluation results. - DOCS: https://spacy.io/api/language#evaluate + DOCS: https://nightly.spacy.io/api/language#evaluate """ - if scorer is None: - scorer = Scorer(pipeline=self.pipeline) + validate_examples(examples, "Language.evaluate") if component_cfg is None: component_cfg = {} - docs, golds = zip(*docs_golds) - docs = [ - self.make_doc(doc) if isinstance(doc, basestring_) else doc for doc in docs - ] - golds = list(golds) + if scorer_cfg is None: + scorer_cfg = {} + if scorer is None: + kwargs = dict(scorer_cfg) + kwargs.setdefault("nlp", self) + scorer = Scorer(**kwargs) + texts = [eg.reference.text for eg in examples] + docs = [eg.predicted for eg in examples] + start_time = timer() + # tokenize the texts only for timing purposes + if not hasattr(self.tokenizer, "pipe"): + _ = [self.tokenizer(text) for text in texts] # noqa: F841 + else: + _ = list(self.tokenizer.pipe(texts)) # noqa: F841 for name, pipe in self.pipeline: kwargs = component_cfg.get(name, {}) kwargs.setdefault("batch_size", batch_size) - if not hasattr(pipe, "pipe"): - docs = _pipe(docs, pipe, kwargs) - else: - docs = pipe.pipe(docs, **kwargs) - for doc, gold in zip(docs, golds): - if not isinstance(gold, GoldParse): - gold = GoldParse(doc, **gold) - if verbose: - print(doc) - kwargs = component_cfg.get("scorer", {}) - kwargs.setdefault("verbose", verbose) - scorer.score(doc, gold, **kwargs) - return scorer + docs = _pipe(docs, pipe, kwargs) + # iterate over the final generator + if len(self.pipeline): + docs = list(docs) + end_time = timer() + for i, (doc, eg) in enumerate(zip(docs, examples)): + util.logger.debug(doc) + eg.predicted = doc + results = scorer.score(examples) + n_words = sum(len(doc) for doc in docs) + results["speed"] = n_words / (end_time - start_time) + return results + + def create_optimizer(self): + """Create an optimizer, usually using the [training.optimizer] config.""" + subconfig = {"optimizer": self.config["training"]["optimizer"]} + return registry.resolve(subconfig)["optimizer"] @contextmanager - def use_params(self, params, **cfg): + def use_params(self, params: Optional[dict]): """Replace weights of models in the pipeline with those provided in the params dictionary. Can be used as a contextmanager, in which case, models go back to their original weights after the block. params (dict): A dictionary of parameters keyed by model ID. - **cfg: Config parameters. EXAMPLE: >>> with nlp.use_params(optimizer.averages): - >>> nlp.to_disk('/tmp/checkpoint') + >>> nlp.to_disk("/tmp/checkpoint") + + DOCS: https://nightly.spacy.io/api/language#use_params """ - contexts = [ - pipe.use_params(params) - for name, pipe in self.pipeline - if hasattr(pipe, "use_params") - ] - # TODO: Having trouble with contextlib - # Workaround: these aren't actually context managers atm. - for context in contexts: - try: - next(context) - except StopIteration: - pass - yield - for context in contexts: - try: - next(context) - except StopIteration: - pass + if not params: + yield + else: + contexts = [ + pipe.use_params(params) + for name, pipe in self.pipeline + if hasattr(pipe, "use_params") and hasattr(pipe, "model") + ] + # TODO: Having trouble with contextlib + # Workaround: these aren't actually context managers atm. + for context in contexts: + try: + next(context) + except StopIteration: + pass + yield + for context in contexts: + try: + next(context) + except StopIteration: + pass def pipe( self, - texts, - as_tuples=False, - n_threads=-1, - batch_size=1000, - disable=[], - cleanup=False, - component_cfg=None, - n_process=1, + texts: Iterable[str], + *, + as_tuples: bool = False, + batch_size: int = 1000, + disable: Iterable[str] = SimpleFrozenList(), + component_cfg: Optional[Dict[str, Dict[str, Any]]] = None, + n_process: int = 1, ): """Process texts as a stream, and yield `Doc` objects in order. - texts (iterable): A sequence of texts to process. + texts (Iterable[str]): A sequence of texts to process. as_tuples (bool): If set to True, inputs should be a sequence of (text, context) tuples. Output will then be a sequence of (doc, context) tuples. Defaults to False. batch_size (int): The number of texts to buffer. - disable (list): Names of the pipeline components to disable. - cleanup (bool): If True, unneeded strings are freed to control memory - use. Experimental. - component_cfg (dict): An optional dictionary with extra keyword + disable (List[str]): Names of the pipeline components to disable. + component_cfg (Dict[str, Dict]): An optional dictionary with extra keyword arguments for specific components. - n_process (int): Number of processors to process texts, only supported - in Python3. If -1, set `multiprocessing.cpu_count()`. + n_process (int): Number of processors to process texts. If -1, set `multiprocessing.cpu_count()`. YIELDS (Doc): Documents in the order of the original text. - DOCS: https://spacy.io/api/language#pipe + DOCS: https://nightly.spacy.io/api/language#pipe """ - if is_python2 and n_process != 1: - warnings.warn(Warnings.W023) - n_process = 1 - if n_threads != -1: - warnings.warn(Warnings.W016, DeprecationWarning) if n_process == -1: n_process = mp.cpu_count() if as_tuples: @@ -785,7 +1398,7 @@ class Language(object): n_process=n_process, component_cfg=component_cfg, ) - for doc, context in izip(docs, contexts): + for doc, context in zip(docs, contexts): yield (doc, context) return if component_cfg is None: @@ -800,11 +1413,7 @@ class Language(object): kwargs = component_cfg.get(name, {}) # Allow component_cfg to overwrite the top-level kwargs. kwargs.setdefault("batch_size", batch_size) - if hasattr(proc, "pipe"): - f = functools.partial(proc.pipe, **kwargs) - else: - # Apply the function, but yield the doc - f = functools.partial(_pipe, proc=proc, kwargs=kwargs) + f = functools.partial(_pipe, proc=proc, kwargs=kwargs) pipes.append(f) if n_process != 1: @@ -814,38 +1423,16 @@ class Language(object): docs = (self.make_doc(text) for text in texts) for pipe in pipes: docs = pipe(docs) - - # Track weakrefs of "recent" documents, so that we can see when they - # expire from memory. When they do, we know we don't need old strings. - # This way, we avoid maintaining an unbounded growth in string entries - # in the string store. - recent_refs = weakref.WeakSet() - old_refs = weakref.WeakSet() - # Keep track of the original string data, so that if we flush old strings, - # we can recover the original ones. However, we only want to do this if we're - # really adding strings, to save up-front costs. - original_strings_data = None - nr_seen = 0 for doc in docs: yield doc - if cleanup: - recent_refs.add(doc) - if nr_seen < 10000: - old_refs.add(doc) - nr_seen += 1 - elif len(old_refs) == 0: - old_refs, recent_refs = recent_refs, old_refs - if original_strings_data is None: - original_strings_data = list(self.vocab.strings) - else: - keys, strings = self.vocab.strings._cleanup_stale_strings( - original_strings_data - ) - self.vocab._reset_cache(keys, strings) - self.tokenizer._reset_cache(keys) - nr_seen = 0 - def _multiprocessing_pipe(self, texts, pipes, n_process, batch_size): + def _multiprocessing_pipe( + self, + texts: Iterable[str], + pipes: Iterable[Callable[[Doc], Doc]], + n_process: int, + batch_size: int, + ) -> None: # raw_texts is used later to stop iteration. texts, raw_texts = itertools.tee(texts) # for sending texts to worker @@ -867,14 +1454,7 @@ class Language(object): procs = [ mp.Process( target=_apply_pipes, - args=( - self.make_doc, - pipes, - rch, - sch, - Underscore.get_state(), - load_nlp.VECTORS, - ), + args=(self.make_doc, pipes, rch, sch, Underscore.get_state()), ) for rch, sch in zip(texts_q, bytedocs_send_ch) ] @@ -895,29 +1475,177 @@ class Language(object): for proc in procs: proc.terminate() - def to_disk(self, path, exclude=tuple(), disable=None): + def _link_components(self) -> None: + """Register 'listeners' within pipeline components, to allow them to + effectively share weights. + """ + # I had though, "Why do we do this inside the Language object? Shouldn't + # it be the tok2vec/transformer/etc's job? + # The problem is we need to do it during deserialization...And the + # components don't receive the pipeline then. So this does have to be + # here :( + for i, (name1, proc1) in enumerate(self.pipeline): + if hasattr(proc1, "find_listeners"): + for name2, proc2 in self.pipeline[i + 1 :]: + if isinstance(getattr(proc2, "model", None), Model): + proc1.find_listeners(proc2.model) + + @classmethod + def from_config( + cls, + config: Union[Dict[str, Any], Config] = {}, + *, + vocab: Union[Vocab, bool] = True, + disable: Iterable[str] = SimpleFrozenList(), + exclude: Iterable[str] = SimpleFrozenList(), + meta: Dict[str, Any] = SimpleFrozenDict(), + auto_fill: bool = True, + validate: bool = True, + ) -> "Language": + """Create the nlp object from a loaded config. Will set up the tokenizer + and language data, add pipeline components etc. If no config is provided, + the default config of the given language is used. + + config (Dict[str, Any] / Config): The loaded config. + vocab (Vocab): A Vocab object. If True, a vocab is created. + disable (Iterable[str]): Names of pipeline components to disable. + Disabled pipes will be loaded but they won't be run unless you + explicitly enable them by calling nlp.enable_pipe. + exclude (Iterable[str]): Names of pipeline components to exclude. + Excluded components won't be loaded. + meta (Dict[str, Any]): Meta overrides for nlp.meta. + auto_fill (bool): Automatically fill in missing values in config based + on defaults and function argument annotations. + validate (bool): Validate the component config and arguments against + the types expected by the factory. + RETURNS (Language): The initialized Language class. + + DOCS: https://nightly.spacy.io/api/language#from_config + """ + if auto_fill: + config = Config( + cls.default_config, section_order=CONFIG_SECTION_ORDER + ).merge(config) + if "nlp" not in config: + raise ValueError(Errors.E985.format(config=config)) + config_lang = config["nlp"].get("lang") + if config_lang is not None and config_lang != cls.lang: + raise ValueError( + Errors.E958.format( + bad_lang_code=config["nlp"]["lang"], + lang_code=cls.lang, + lang=util.get_object_name(cls), + ) + ) + config["nlp"]["lang"] = cls.lang + # This isn't very elegant, but we remove the [components] block here to prevent + # it from getting resolved (causes problems because we expect to pass in + # the nlp and name args for each component). If we're auto-filling, we're + # using the nlp.config with all defaults. + config = util.copy_config(config) + orig_pipeline = config.pop("components", {}) + config["components"] = {} + if auto_fill: + filled = registry.fill(config, validate=validate, schema=ConfigSchema) + else: + filled = config + filled["components"] = orig_pipeline + config["components"] = orig_pipeline + resolved_nlp = registry.resolve( + filled["nlp"], validate=validate, schema=ConfigSchemaNlp + ) + create_tokenizer = resolved_nlp["tokenizer"] + before_creation = resolved_nlp["before_creation"] + after_creation = resolved_nlp["after_creation"] + after_pipeline_creation = resolved_nlp["after_pipeline_creation"] + lang_cls = cls + if before_creation is not None: + lang_cls = before_creation(cls) + if ( + not isinstance(lang_cls, type) + or not issubclass(lang_cls, cls) + or lang_cls is not cls + ): + raise ValueError(Errors.E943.format(value=type(lang_cls))) + # Note that we don't load vectors here, instead they get loaded explicitly + # inside stuff like the spacy train function. If we loaded them here, + # then we would load them twice at runtime: once when we make from config, + # and then again when we load from disk. + nlp = lang_cls(vocab=vocab, create_tokenizer=create_tokenizer, meta=meta) + if after_creation is not None: + nlp = after_creation(nlp) + if not isinstance(nlp, cls): + raise ValueError(Errors.E942.format(name="creation", value=type(nlp))) + # To create the components we need to use the final interpolated config + # so all values are available (if component configs use variables). + # Later we replace the component config with the raw config again. + interpolated = filled.interpolate() if not filled.is_interpolated else filled + pipeline = interpolated.get("components", {}) + # If components are loaded from a source (existing models), we cache + # them here so they're only loaded once + source_nlps = {} + for pipe_name in config["nlp"]["pipeline"]: + if pipe_name not in pipeline: + opts = ", ".join(pipeline.keys()) + raise ValueError(Errors.E956.format(name=pipe_name, opts=opts)) + pipe_cfg = util.copy_config(pipeline[pipe_name]) + raw_config = Config(filled["components"][pipe_name]) + if pipe_name not in exclude: + if "factory" not in pipe_cfg and "source" not in pipe_cfg: + err = Errors.E984.format(name=pipe_name, config=pipe_cfg) + raise ValueError(err) + if "factory" in pipe_cfg: + factory = pipe_cfg.pop("factory") + # The pipe name (key in the config) here is the unique name + # of the component, not necessarily the factory + nlp.add_pipe( + factory, + name=pipe_name, + config=pipe_cfg, + validate=validate, + raw_config=raw_config, + ) + else: + model = pipe_cfg["source"] + if model not in source_nlps: + # We only need the components here and we need to init + # model with the same vocab as the current nlp object + source_nlps[model] = util.load_model( + model, vocab=nlp.vocab, disable=["vocab", "tokenizer"] + ) + source_name = pipe_cfg.get("component", pipe_name) + nlp.add_pipe(source_name, source=source_nlps[model], name=pipe_name) + disabled_pipes = [*config["nlp"]["disabled"], *disable] + nlp._disabled = set(p for p in disabled_pipes if p not in exclude) + nlp.config = filled if auto_fill else config + if after_pipeline_creation is not None: + nlp = after_pipeline_creation(nlp) + if not isinstance(nlp, cls): + raise ValueError( + Errors.E942.format(name="pipeline_creation", value=type(nlp)) + ) + return nlp + + def to_disk( + self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList() + ) -> None: """Save the current state to a directory. If a model is loaded, this will include the model. - path (unicode or Path): Path to a directory, which will be created if + path (str / Path): Path to a directory, which will be created if it doesn't exist. exclude (list): Names of components or serialization fields to exclude. - DOCS: https://spacy.io/api/language#to_disk + DOCS: https://nightly.spacy.io/api/language#to_disk """ - if disable is not None: - warnings.warn(Warnings.W014, DeprecationWarning) - exclude = disable path = util.ensure_path(path) - serializers = OrderedDict() + serializers = {} serializers["tokenizer"] = lambda p: self.tokenizer.to_disk( p, exclude=["vocab"] ) serializers["meta.json"] = lambda p: srsly.write_json(p, self.meta) - - for name, proc in self.pipeline: - if not hasattr(proc, "name"): - continue + serializers["config.cfg"] = lambda p: self.config.to_disk(p) + for name, proc in self._components: if name in exclude: continue if not hasattr(proc, "to_disk"): @@ -926,18 +1654,21 @@ class Language(object): serializers["vocab"] = lambda p: self.vocab.to_disk(p) util.to_disk(path, serializers, exclude) - def from_disk(self, path, exclude=tuple(), disable=None): + def from_disk( + self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList() + ) -> "Language": """Loads state from a directory. Modifies the object in place and returns it. If the saved `Language` object contains a model, the model will be loaded. - path (unicode or Path): A path to a directory. + path (str / Path): A path to a directory. exclude (list): Names of components or serialization fields to exclude. RETURNS (Language): The modified `Language` object. - DOCS: https://spacy.io/api/language#from_disk + DOCS: https://nightly.spacy.io/api/language#from_disk """ - def deserialize_meta(path): + + def deserialize_meta(path: Path) -> None: if path.exists(): data = srsly.read_json(path) self.meta.update(data) @@ -945,22 +1676,22 @@ class Language(object): # from self.vocab.vectors, so set the name directly self.vocab.vectors.name = data.get("vectors", {}).get("name") - def deserialize_vocab(path): + def deserialize_vocab(path: Path) -> None: if path.exists(): self.vocab.from_disk(path) - _fix_pretrained_vectors_name(self) - if disable is not None: - warnings.warn(Warnings.W014, DeprecationWarning) - exclude = disable path = util.ensure_path(path) - deserializers = OrderedDict() + deserializers = {} + if Path(path / "config.cfg").exists(): + deserializers["config.cfg"] = lambda p: self.config.from_disk( + p, interpolate=False + ) deserializers["meta.json"] = deserialize_meta deserializers["vocab"] = deserialize_vocab deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk( p, exclude=["vocab"] ) - for name, proc in self.pipeline: + for name, proc in self._components: if name in exclude: continue if not hasattr(proc, "from_disk"): @@ -973,43 +1704,42 @@ class Language(object): exclude = list(exclude) + ["vocab"] util.from_disk(path, deserializers, exclude) self._path = path + self._link_components() return self - def to_bytes(self, exclude=tuple(), disable=None, **kwargs): + def to_bytes(self, *, exclude: Iterable[str] = SimpleFrozenList()) -> bytes: """Serialize the current state to a binary string. exclude (list): Names of components or serialization fields to exclude. RETURNS (bytes): The serialized form of the `Language` object. - DOCS: https://spacy.io/api/language#to_bytes + DOCS: https://nightly.spacy.io/api/language#to_bytes """ - if disable is not None: - warnings.warn(Warnings.W014, DeprecationWarning) - exclude = disable - serializers = OrderedDict() + serializers = {} serializers["vocab"] = lambda: self.vocab.to_bytes() serializers["tokenizer"] = lambda: self.tokenizer.to_bytes(exclude=["vocab"]) - serializers["meta.json"] = lambda: srsly.json_dumps( - OrderedDict(sorted(self.meta.items())) - ) - for name, proc in self.pipeline: + serializers["meta.json"] = lambda: srsly.json_dumps(self.meta) + serializers["config.cfg"] = lambda: self.config.to_bytes() + for name, proc in self._components: if name in exclude: continue if not hasattr(proc, "to_bytes"): continue serializers[name] = lambda proc=proc: proc.to_bytes(exclude=["vocab"]) - exclude = util.get_serialization_exclude(serializers, exclude, kwargs) return util.to_bytes(serializers, exclude) - def from_bytes(self, bytes_data, exclude=tuple(), disable=None, **kwargs): + def from_bytes( + self, bytes_data: bytes, *, exclude: Iterable[str] = SimpleFrozenList() + ) -> "Language": """Load state from a binary string. bytes_data (bytes): The data to load from. exclude (list): Names of components or serialization fields to exclude. RETURNS (Language): The `Language` object. - DOCS: https://spacy.io/api/language#from_bytes + DOCS: https://nightly.spacy.io/api/language#from_bytes """ + def deserialize_meta(b): data = srsly.json_loads(b) self.meta.update(data) @@ -1017,20 +1747,16 @@ class Language(object): # from self.vocab.vectors, so set the name directly self.vocab.vectors.name = data.get("vectors", {}).get("name") - def deserialize_vocab(b): - self.vocab.from_bytes(b) - _fix_pretrained_vectors_name(self) - - if disable is not None: - warnings.warn(Warnings.W014, DeprecationWarning) - exclude = disable - deserializers = OrderedDict() + deserializers = {} + deserializers["config.cfg"] = lambda b: self.config.from_bytes( + b, interpolate=False + ) deserializers["meta.json"] = deserialize_meta - deserializers["vocab"] = deserialize_vocab + deserializers["vocab"] = self.vocab.from_bytes deserializers["tokenizer"] = lambda b: self.tokenizer.from_bytes( b, exclude=["vocab"] ) - for name, proc in self.pipeline: + for name, proc in self._components: if name in exclude: continue if not hasattr(proc, "from_bytes"): @@ -1038,90 +1764,38 @@ class Language(object): deserializers[name] = lambda b, proc=proc: proc.from_bytes( b, exclude=["vocab"] ) - exclude = util.get_serialization_exclude(deserializers, exclude, kwargs) util.from_bytes(bytes_data, deserializers, exclude) + self._link_components() return self -class component(object): - """Decorator for pipeline components. Can decorate both function components - and class components and will automatically register components in the - Language.factories. If the component is a class and needs access to the - nlp object or config parameters, it can expose a from_nlp classmethod - that takes the nlp object and **cfg arguments and returns the initialized - component. +@dataclass +class FactoryMeta: + """Dataclass containing information about a component and its defaults + provided by the @Language.component or @Language.factory decorator. It's + created whenever a component is defined and stored on the Language class for + each component instance and factory instance. """ - # NB: This decorator needs to live here, because it needs to write to - # Language.factories. All other solutions would cause circular import. - - def __init__(self, name=None, assigns=tuple(), requires=tuple(), retokenizes=False): - """Decorate a pipeline component. - - name (unicode): Default component and factory name. - assigns (list): Attributes assigned by component, e.g. `["token.pos"]`. - requires (list): Attributes required by component, e.g. `["token.dep"]`. - retokenizes (bool): Whether the component changes the tokenization. - """ - self.name = name - self.assigns = validate_attrs(assigns) - self.requires = validate_attrs(requires) - self.retokenizes = retokenizes - - def __call__(self, *args, **kwargs): - obj = args[0] - args = args[1:] - factory_name = self.name or util.get_component_name(obj) - obj.name = factory_name - obj.factory = factory_name - obj.assigns = self.assigns - obj.requires = self.requires - obj.retokenizes = self.retokenizes - - def factory(nlp, **cfg): - if hasattr(obj, "from_nlp"): - return obj.from_nlp(nlp, **cfg) - elif isinstance(obj, class_types): - return obj() - return obj - - Language.factories[obj.factory] = factory - return obj - - -def _fix_pretrained_vectors_name(nlp): - # TODO: Replace this once we handle vectors consistently as static - # data - if "vectors" in nlp.meta and "name" in nlp.meta["vectors"]: - nlp.vocab.vectors.name = nlp.meta["vectors"]["name"] - elif not nlp.vocab.vectors.size: - nlp.vocab.vectors.name = None - elif "name" in nlp.meta and "lang" in nlp.meta: - vectors_name = "%s_%s.vectors" % (nlp.meta["lang"], nlp.meta["name"]) - nlp.vocab.vectors.name = vectors_name - else: - raise ValueError(Errors.E092) - if nlp.vocab.vectors.size != 0: - link_vectors_to_models(nlp.vocab, skip_rank=True) - for name, proc in nlp.pipeline: - if not hasattr(proc, "cfg"): - continue - proc.cfg.setdefault("deprecation_fixes", {}) - proc.cfg["deprecation_fixes"]["vectors_name"] = nlp.vocab.vectors.name + factory: str + default_config: Optional[Dict[str, Any]] = None # noqa: E704 + assigns: Iterable[str] = tuple() + requires: Iterable[str] = tuple() + retokenizes: bool = False + scores: Iterable[str] = tuple() + default_score_weights: Optional[Dict[str, float]] = None # noqa: E704 class DisabledPipes(list): """Manager for temporary pipeline disabling.""" - def __init__(self, nlp, *names): + def __init__(self, nlp: Language, names: List[str]) -> None: self.nlp = nlp self.names = names - # Important! Not deep copy -- we just want the container (but we also - # want to support people providing arbitrarily typed nlp.pipeline - # objects.) - self.original_pipeline = copy(nlp.pipeline) + for name in self.names: + self.nlp.disable_pipe(name) list.__init__(self) - self.extend(nlp.remove_pipe(name) for name in names) + self.extend(self.names) def __enter__(self): return self @@ -1129,40 +1803,34 @@ class DisabledPipes(list): def __exit__(self, *args): self.restore() - def restore(self): + def restore(self) -> None: """Restore the pipeline to its state when DisabledPipes was created.""" - current, self.nlp.pipeline = self.nlp.pipeline, self.original_pipeline - unexpected = [name for name, pipe in current if not self.nlp.has_pipe(name)] - if unexpected: - # Don't change the pipeline if we're raising an error. - self.nlp.pipeline = current - raise ValueError(Errors.E008.format(names=unexpected)) + for name in self.names: + if name not in self.nlp.component_names: + raise ValueError(Errors.E008.format(name=name)) + self.nlp.enable_pipe(name) self[:] = [] -def _pipe(docs, proc, kwargs): - # We added some args for pipe that __call__ doesn't expect. - kwargs = dict(kwargs) - for arg in ["n_threads", "batch_size"]: - if arg in kwargs: - kwargs.pop(arg) - for doc in docs: - doc = proc(doc, **kwargs) - yield doc - - -def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state, vectors): +def _apply_pipes( + make_doc: Callable[[str], Doc], + pipes: Iterable[Callable[[Doc], Doc]], + receiver, + sender, + underscore_state: Tuple[dict, dict, dict], +) -> None: """Worker for Language.pipe + make_doc (Callable[[str,] Doc]): Function to create Doc from text. + pipes (Iterable[Callable[[Doc], Doc]]): The components to apply. receiver (multiprocessing.Connection): Pipe to receive text. Usually created by `multiprocessing.Pipe()` sender (multiprocessing.Connection): Pipe to send doc. Usually created by `multiprocessing.Pipe()` - underscore_state (tuple): The data in the Underscore class of the parent - vectors (dict): The global vectors data, copied from the parent + underscore_state (Tuple[dict, dict, dict]): The data in the Underscore class + of the parent. """ Underscore.load_state(underscore_state) - load_nlp.VECTORS = vectors while True: texts = receiver.get() docs = (make_doc(text) for text in texts) @@ -1175,13 +1843,15 @@ def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state, vectors): class _Sender: """Util for sending data to multiprocessing workers in Language.pipe""" - def __init__(self, data, queues, chunk_size): + def __init__( + self, data: Iterable[Any], queues: List[mp.Queue], chunk_size: int + ) -> None: self.data = iter(data) self.queues = iter(cycle(queues)) self.chunk_size = chunk_size self.count = 0 - def send(self): + def send(self) -> None: """Send chunk_size items from self.data to channels.""" for item, q in itertools.islice( zip(self.data, cycle(self.queues)), self.chunk_size @@ -1189,10 +1859,10 @@ class _Sender: # cycle channels so that distribute the texts evenly q.put(item) - def step(self): - """Tell sender that comsumed one item. - - Data is sent to the workers after every chunk_size calls.""" + def step(self) -> None: + """Tell sender that comsumed one item. Data is sent to the workers after + every chunk_size calls. + """ self.count += 1 if self.count >= self.chunk_size: self.count = 0 diff --git a/spacy/lemmatizer.py b/spacy/lemmatizer.py deleted file mode 100644 index 8b2375257..000000000 --- a/spacy/lemmatizer.py +++ /dev/null @@ -1,140 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from collections import OrderedDict - -from .symbols import NOUN, VERB, ADJ, PUNCT, PROPN -from .errors import Errors -from .lookups import Lookups -from .parts_of_speech import NAMES as UPOS_NAMES - - -class Lemmatizer(object): - """ - The Lemmatizer supports simple part-of-speech-sensitive suffix rules and - lookup tables. - - DOCS: https://spacy.io/api/lemmatizer - """ - - @classmethod - def load(cls, *args, **kwargs): - raise NotImplementedError(Errors.E172) - - def __init__(self, lookups, is_base_form=None, *args, **kwargs): - """Initialize a Lemmatizer. - - lookups (Lookups): The lookups object containing the (optional) tables - "lemma_rules", "lemma_index", "lemma_exc" and "lemma_lookup". - RETURNS (Lemmatizer): The newly constructed object. - """ - if args or kwargs or not isinstance(lookups, Lookups): - raise ValueError(Errors.E173) - self.lookups = lookups - self.is_base_form = is_base_form - - def __call__(self, string, univ_pos, morphology=None): - """Lemmatize a string. - - string (unicode): The string to lemmatize, e.g. the token text. - univ_pos (unicode / int): The token's universal part-of-speech tag. - morphology (dict): The token's morphological features following the - Universal Dependencies scheme. - RETURNS (list): The available lemmas for the string. - """ - lookup_table = self.lookups.get_table("lemma_lookup", {}) - if "lemma_rules" not in self.lookups: - return [lookup_table.get(string, string)] - if isinstance(univ_pos, int): - univ_pos = UPOS_NAMES.get(univ_pos, "X") - univ_pos = univ_pos.lower() - - if univ_pos in ("", "eol", "space"): - return [string.lower()] - # See Issue #435 for example of where this logic is requied. - if callable(self.is_base_form) and self.is_base_form(univ_pos, morphology): - return [string.lower()] - index_table = self.lookups.get_table("lemma_index", {}) - exc_table = self.lookups.get_table("lemma_exc", {}) - rules_table = self.lookups.get_table("lemma_rules", {}) - if not any((index_table.get(univ_pos), exc_table.get(univ_pos), rules_table.get(univ_pos))): - if univ_pos == "propn": - return [string] - else: - return [string.lower()] - lemmas = self.lemmatize( - string, - index_table.get(univ_pos, {}), - exc_table.get(univ_pos, {}), - rules_table.get(univ_pos, []), - ) - return lemmas - - def noun(self, string, morphology=None): - return self(string, "noun", morphology) - - def verb(self, string, morphology=None): - return self(string, "verb", morphology) - - def adj(self, string, morphology=None): - return self(string, "adj", morphology) - - def det(self, string, morphology=None): - return self(string, "det", morphology) - - def pron(self, string, morphology=None): - return self(string, "pron", morphology) - - def adp(self, string, morphology=None): - return self(string, "adp", morphology) - - def num(self, string, morphology=None): - return self(string, "num", morphology) - - def punct(self, string, morphology=None): - return self(string, "punct", morphology) - - def lookup(self, string, orth=None): - """Look up a lemma in the table, if available. If no lemma is found, - the original string is returned. - - string (unicode): The original string. - orth (int): Optional hash of the string to look up. If not set, the - string will be used and hashed. - RETURNS (unicode): The lemma if the string was found, otherwise the - original string. - """ - lookup_table = self.lookups.get_table("lemma_lookup", {}) - key = orth if orth is not None else string - if key in lookup_table: - return lookup_table[key] - return string - - def lemmatize(self, string, index, exceptions, rules): - orig = string - string = string.lower() - forms = [] - oov_forms = [] - for old, new in rules: - if string.endswith(old): - form = string[: len(string) - len(old)] + new - if not form: - pass - elif form in index or not form.isalpha(): - forms.append(form) - else: - oov_forms.append(form) - # Remove duplicates but preserve the ordering of applied "rules" - forms = list(OrderedDict.fromkeys(forms)) - # Put exceptions at the front of the list, so they get priority. - # This is a dodgy heuristic -- but it's the best we can do until we get - # frequencies on this. We can at least prune out problematic exceptions, - # if they shadow more frequent analyses. - for form in exceptions.get(string, []): - if form not in forms: - forms.insert(0, form) - if not forms: - forms.extend(oov_forms) - if not forms: - forms.append(orig) - return forms diff --git a/spacy/lexeme.pxd b/spacy/lexeme.pxd index 167f57462..8dea0d6a2 100644 --- a/spacy/lexeme.pxd +++ b/spacy/lexeme.pxd @@ -1,3 +1,5 @@ +from numpy cimport ndarray + from .typedefs cimport attr_t, hash_t, flags_t, len_t, tag_t from .attrs cimport attr_id_t from .attrs cimport ID, ORTH, LOWER, NORM, SHAPE, PREFIX, SUFFIX, LENGTH, LANG @@ -6,8 +8,6 @@ from .structs cimport LexemeC from .strings cimport StringStore from .vocab cimport Vocab -from numpy cimport ndarray - cdef LexemeC EMPTY_LEXEME cdef attr_t OOV_RANK @@ -18,11 +18,12 @@ cdef class Lexeme: cdef readonly attr_t orth @staticmethod - cdef inline Lexeme from_ptr(LexemeC* lex, Vocab vocab, int vector_length): + cdef inline Lexeme from_ptr(LexemeC* lex, Vocab vocab): cdef Lexeme self = Lexeme.__new__(Lexeme, vocab, lex.orth) self.c = lex self.vocab = vocab self.orth = lex.orth + return self @staticmethod cdef inline void set_struct_attr(LexemeC* lex, attr_id_t name, attr_t value) nogil: diff --git a/spacy/lexeme.pyx b/spacy/lexeme.pyx index 8042098d7..17ce574ce 100644 --- a/spacy/lexeme.pyx +++ b/spacy/lexeme.pyx @@ -1,7 +1,4 @@ # cython: embedsignature=True -# coding: utf8 -from __future__ import unicode_literals, print_function - # Compiler crashes on memory view coercion without this. Should report bug. from cython.view cimport array as cvarray from libc.string cimport memset @@ -9,8 +6,8 @@ cimport numpy as np np.import_array() import numpy +from thinc.api import get_array_module import warnings -from thinc.neural.util import get_array_module from .typedefs cimport attr_t, flags_t from .attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE @@ -33,7 +30,7 @@ cdef class Lexeme: tag, dependency parse, or lemma (lemmatization depends on the part-of-speech tag). - DOCS: https://spacy.io/api/lexeme + DOCS: https://nightly.spacy.io/api/lexeme """ def __init__(self, Vocab vocab, attr_t orth): """Create a Lexeme object. @@ -166,7 +163,7 @@ cdef class Lexeme: self.vocab.set_vector(self.c.orth, vector) property rank: - """RETURNS (unicode): Sequential ID of the lexemes's lexical type, used + """RETURNS (str): Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors.""" def __get__(self): return self.c.id @@ -189,18 +186,18 @@ cdef class Lexeme: @property def orth_(self): - """RETURNS (unicode): The original verbatim text of the lexeme + """RETURNS (str): The original verbatim text of the lexeme (identical to `Lexeme.text`). Exists mostly for consistency with the other attributes.""" return self.vocab.strings[self.c.orth] @property def text(self): - """RETURNS (unicode): The original verbatim text of the lexeme.""" + """RETURNS (str): The original verbatim text of the lexeme.""" return self.orth_ property lower: - """RETURNS (unicode): Lowercase form of the lexeme.""" + """RETURNS (str): Lowercase form of the lexeme.""" def __get__(self): return self.c.lower @@ -254,11 +251,11 @@ cdef class Lexeme: property cluster: """RETURNS (int): Brown cluster ID.""" def __get__(self): - cluster_table = self.vocab.load_extra_lookups("lexeme_cluster") + cluster_table = self.vocab.lookups.get_table("lexeme_cluster", {}) return cluster_table.get(self.c.orth, 0) def __set__(self, int x): - cluster_table = self.vocab.load_extra_lookups("lexeme_cluster") + cluster_table = self.vocab.lookups.get_table("lexeme_cluster", {}) cluster_table[self.c.orth] = x property lang: @@ -273,17 +270,17 @@ cdef class Lexeme: """RETURNS (float): Smoothed log probability estimate of the lexeme's type.""" def __get__(self): - prob_table = self.vocab.load_extra_lookups("lexeme_prob") - settings_table = self.vocab.load_extra_lookups("lexeme_settings") + prob_table = self.vocab.lookups.get_table("lexeme_prob", {}) + settings_table = self.vocab.lookups.get_table("lexeme_settings", {}) default_oov_prob = settings_table.get("oov_prob", -20.0) return prob_table.get(self.c.orth, default_oov_prob) def __set__(self, float x): - prob_table = self.vocab.load_extra_lookups("lexeme_prob") + prob_table = self.vocab.lookups.get_table("lexeme_prob", {}) prob_table[self.c.orth] = x property lower_: - """RETURNS (unicode): Lowercase form of the word.""" + """RETURNS (str): Lowercase form of the word.""" def __get__(self): return self.vocab.strings[self.c.lower] @@ -291,7 +288,7 @@ cdef class Lexeme: self.c.lower = self.vocab.strings.add(x) property norm_: - """RETURNS (unicode): The lexemes's norm, i.e. a normalised form of the + """RETURNS (str): The lexemes's norm, i.e. a normalised form of the lexeme text. """ def __get__(self): @@ -301,7 +298,7 @@ cdef class Lexeme: self.norm = self.vocab.strings.add(x) property shape_: - """RETURNS (unicode): Transform of the word's string, to show + """RETURNS (str): Transform of the word's string, to show orthographic features. """ def __get__(self): @@ -311,7 +308,7 @@ cdef class Lexeme: self.c.shape = self.vocab.strings.add(x) property prefix_: - """RETURNS (unicode): Length-N substring from the start of the word. + """RETURNS (str): Length-N substring from the start of the word. Defaults to `N=1`. """ def __get__(self): @@ -321,7 +318,7 @@ cdef class Lexeme: self.c.prefix = self.vocab.strings.add(x) property suffix_: - """RETURNS (unicode): Length-N substring from the end of the word. + """RETURNS (str): Length-N substring from the end of the word. Defaults to `N=3`. """ def __get__(self): @@ -331,7 +328,7 @@ cdef class Lexeme: self.c.suffix = self.vocab.strings.add(x) property lang_: - """RETURNS (unicode): Language of the parent vocabulary.""" + """RETURNS (str): Language of the parent vocabulary.""" def __get__(self): return self.vocab.strings[self.c.lang] diff --git a/spacy/lookups.py b/spacy/lookups.py index d4947be9f..133cb0672 100644 --- a/spacy/lookups.py +++ b/spacy/lookups.py @@ -1,160 +1,46 @@ -# coding: utf-8 -from __future__ import unicode_literals - +from typing import Dict, Any, List, Union, Optional +from pathlib import Path import srsly -from collections import OrderedDict from preshed.bloom import BloomFilter +from collections import OrderedDict from .errors import Errors -from .util import SimpleFrozenDict, ensure_path +from .util import SimpleFrozenDict, ensure_path, registry, load_language_data from .strings import get_string_id UNSET = object() -class Lookups(object): - """Container for large lookup tables and dictionaries, e.g. lemmatization - data or tokenizer exception lists. Lookups are available via vocab.lookups, - so they can be accessed before the pipeline components are applied (e.g. - in the tokenizer and lemmatizer), as well as within the pipeline components - via doc.vocab.lookups. +def load_lookups( + lang: str, tables: List[str], strict: bool = True +) -> Optional[Dict[str, Any]]: + """Load the data from the spacy-lookups-data package for a given language, + if available. Returns an empty dict if there's no data or if the package + is not installed. + + lang (str): The language code (corresponds to entry point exposed by + the spacy-lookups-data package). + tables (List[str]): Name of tables to load, e.g. ["lemma_lookup", "lemma_exc"] + strict (bool): Whether to raise an error if a table doesn't exist. + RETURNS (Dict[str, Any]): The lookups, keyed by table name. """ - - def __init__(self): - """Initialize the Lookups object. - - RETURNS (Lookups): The newly created object. - - DOCS: https://spacy.io/api/lookups#init - """ - self._tables = OrderedDict() - - def __contains__(self, name): - """Check if the lookups contain a table of a given name. Delegates to - Lookups.has_table. - - name (unicode): Name of the table. - RETURNS (bool): Whether a table of that name is in the lookups. - """ - return self.has_table(name) - - def __len__(self): - """RETURNS (int): The number of tables in the lookups.""" - return len(self._tables) - - @property - def tables(self): - """RETURNS (list): Names of all tables in the lookups.""" - return list(self._tables.keys()) - - def add_table(self, name, data=SimpleFrozenDict()): - """Add a new table to the lookups. Raises an error if the table exists. - - name (unicode): Unique name of table. - data (dict): Optional data to add to the table. - RETURNS (Table): The newly added table. - - DOCS: https://spacy.io/api/lookups#add_table - """ - if name in self.tables: - raise ValueError(Errors.E158.format(name=name)) - table = Table(name=name, data=data) - self._tables[name] = table - return table - - def get_table(self, name, default=UNSET): - """Get a table. Raises an error if the table doesn't exist and no - default value is provided. - - name (unicode): Name of the table. - default: Optional default value to return if table doesn't exist. - RETURNS (Table): The table. - - DOCS: https://spacy.io/api/lookups#get_table - """ - if name not in self._tables: - if default == UNSET: - raise KeyError(Errors.E159.format(name=name, tables=self.tables)) - return default - return self._tables[name] - - def remove_table(self, name): - """Remove a table. Raises an error if the table doesn't exist. - - name (unicode): Name of the table to remove. - RETURNS (Table): The removed table. - - DOCS: https://spacy.io/api/lookups#remove_table - """ - if name not in self._tables: - raise KeyError(Errors.E159.format(name=name, tables=self.tables)) - return self._tables.pop(name) - - def has_table(self, name): - """Check if the lookups contain a table of a given name. - - name (unicode): Name of the table. - RETURNS (bool): Whether a table of that name exists. - - DOCS: https://spacy.io/api/lookups#has_table - """ - return name in self._tables - - def to_bytes(self, **kwargs): - """Serialize the lookups to a bytestring. - - RETURNS (bytes): The serialized Lookups. - - DOCS: https://spacy.io/api/lookups#to_bytes - """ - return srsly.msgpack_dumps(self._tables) - - def from_bytes(self, bytes_data, **kwargs): - """Load the lookups from a bytestring. - - bytes_data (bytes): The data to load. - RETURNS (Lookups): The loaded Lookups. - - DOCS: https://spacy.io/api/lookups#from_bytes - """ - self._tables = OrderedDict() - for key, value in srsly.msgpack_loads(bytes_data).items(): - self._tables[key] = Table(key, value) - return self - - def to_disk(self, path, filename="lookups.bin", **kwargs): - """Save the lookups to a directory as lookups.bin. Expects a path to a - directory, which will be created if it doesn't exist. - - path (unicode / Path): The file path. - - DOCS: https://spacy.io/api/lookups#to_disk - """ - if len(self._tables): - path = ensure_path(path) - if not path.exists(): - path.mkdir() - filepath = path / filename - with filepath.open("wb") as file_: - file_.write(self.to_bytes()) - - def from_disk(self, path, filename="lookups.bin", **kwargs): - """Load lookups from a directory containing a lookups.bin. Will skip - loading if the file doesn't exist. - - path (unicode / Path): The directory path. - RETURNS (Lookups): The loaded lookups. - - DOCS: https://spacy.io/api/lookups#from_disk - """ - path = ensure_path(path) - filepath = path / filename - if filepath.exists(): - with filepath.open("rb") as file_: - data = file_.read() - return self.from_bytes(data) - return self + # TODO: import spacy_lookups_data instead of going via entry points here? + lookups = Lookups() + if lang not in registry.lookups: + if strict and len(tables) > 0: + raise ValueError(Errors.E955.format(table=", ".join(tables), lang=lang)) + return lookups + data = registry.lookups.get(lang) + for table in tables: + if table not in data: + if strict: + raise ValueError(Errors.E955.format(table=table, lang=lang)) + language_data = {} + else: + language_data = load_language_data(data[table]) + lookups.add_table(table, language_data) + return lookups class Table(OrderedDict): @@ -165,27 +51,25 @@ class Table(OrderedDict): """ @classmethod - def from_dict(cls, data, name=None): + def from_dict(cls, data: dict, name: Optional[str] = None) -> "Table": """Initialize a new table from a dict. data (dict): The dictionary. - name (unicode): Optional table name for reference. - RETURNS (Table): The newly created object. + name (str): Optional table name for reference. - DOCS: https://spacy.io/api/lookups#table.from_dict + DOCS: https://nightly.spacy.io/api/lookups#table.from_dict """ self = cls(name=name) self.update(data) return self - def __init__(self, name=None, data=None): + def __init__(self, name: Optional[str] = None, data: Optional[dict] = None) -> None: """Initialize a new table. - name (unicode): Optional table name for reference. + name (str): Optional table name for reference. data (dict): Initial data, used to hint Bloom Filter. - RETURNS (Table): The newly created object. - DOCS: https://spacy.io/api/lookups#table.init + DOCS: https://nightly.spacy.io/api/lookups#table.init """ OrderedDict.__init__(self) self.name = name @@ -196,48 +80,48 @@ class Table(OrderedDict): if data: self.update(data) - def __setitem__(self, key, value): + def __setitem__(self, key: Union[str, int], value: Any) -> None: """Set new key/value pair. String keys will be hashed. - key (unicode / int): The key to set. + key (str / int): The key to set. value: The value to set. """ key = get_string_id(key) OrderedDict.__setitem__(self, key, value) self.bloom.add(key) - def set(self, key, value): + def set(self, key: Union[str, int], value: Any) -> None: """Set new key/value pair. String keys will be hashed. Same as table[key] = value. - key (unicode / int): The key to set. + key (str / int): The key to set. value: The value to set. """ self[key] = value - def __getitem__(self, key): + def __getitem__(self, key: Union[str, int]) -> Any: """Get the value for a given key. String keys will be hashed. - key (unicode / int): The key to get. + key (str / int): The key to get. RETURNS: The value. """ key = get_string_id(key) return OrderedDict.__getitem__(self, key) - def get(self, key, default=None): + def get(self, key: Union[str, int], default: Optional[Any] = None) -> Any: """Get the value for a given key. String keys will be hashed. - key (unicode / int): The key to get. + key (str / int): The key to get. default: The default value to return. RETURNS: The value. """ key = get_string_id(key) return OrderedDict.get(self, key, default) - def __contains__(self, key): + def __contains__(self, key: Union[str, int]) -> bool: """Check whether a key is in the table. String keys will be hashed. - key (unicode / int): The key to check. + key (str / int): The key to check. RETURNS (bool): Whether the key is in the table. """ key = get_string_id(key) @@ -246,27 +130,27 @@ class Table(OrderedDict): return False return OrderedDict.__contains__(self, key) - def to_bytes(self): + def to_bytes(self) -> bytes: """Serialize table to a bytestring. RETURNS (bytes): The serialized table. - DOCS: https://spacy.io/api/lookups#table.to_bytes + DOCS: https://nightly.spacy.io/api/lookups#table.to_bytes """ - data = [ - ("name", self.name), - ("dict", dict(self.items())), - ("bloom", self.bloom.to_bytes()), - ] - return srsly.msgpack_dumps(OrderedDict(data)) + data = { + "name": self.name, + "dict": dict(self.items()), + "bloom": self.bloom.to_bytes(), + } + return srsly.msgpack_dumps(data) - def from_bytes(self, bytes_data): + def from_bytes(self, bytes_data: bytes) -> "Table": """Load a table from a bytestring. bytes_data (bytes): The data to load. RETURNS (Table): The loaded table. - DOCS: https://spacy.io/api/lookups#table.from_bytes + DOCS: https://nightly.spacy.io/api/lookups#table.from_bytes """ loaded = srsly.msgpack_loads(bytes_data) data = loaded.get("dict", {}) @@ -275,3 +159,158 @@ class Table(OrderedDict): self.clear() self.update(data) return self + + +class Lookups: + """Container for large lookup tables and dictionaries, e.g. lemmatization + data or tokenizer exception lists. Lookups are available via vocab.lookups, + so they can be accessed before the pipeline components are applied (e.g. + in the tokenizer and lemmatizer), as well as within the pipeline components + via doc.vocab.lookups. + """ + + def __init__(self) -> None: + """Initialize the Lookups object. + + DOCS: https://nightly.spacy.io/api/lookups#init + """ + self._tables = {} + + def __contains__(self, name: str) -> bool: + """Check if the lookups contain a table of a given name. Delegates to + Lookups.has_table. + + name (str): Name of the table. + RETURNS (bool): Whether a table of that name is in the lookups. + """ + return self.has_table(name) + + def __len__(self) -> int: + """RETURNS (int): The number of tables in the lookups.""" + return len(self._tables) + + @property + def tables(self) -> List[str]: + """RETURNS (List[str]): Names of all tables in the lookups.""" + return list(self._tables.keys()) + + def add_table(self, name: str, data: dict = SimpleFrozenDict()) -> Table: + """Add a new table to the lookups. Raises an error if the table exists. + + name (str): Unique name of table. + data (dict): Optional data to add to the table. + RETURNS (Table): The newly added table. + + DOCS: https://nightly.spacy.io/api/lookups#add_table + """ + if name in self.tables: + raise ValueError(Errors.E158.format(name=name)) + table = Table(name=name, data=data) + self._tables[name] = table + return table + + def set_table(self, name: str, table: Table) -> None: + """Set a table. + + name (str): Name of the table to set. + table (Table): The Table to set. + + DOCS: https://nightly.spacy.io/api/lookups#set_table + """ + self._tables[name] = table + + def get_table(self, name: str, default: Any = UNSET) -> Table: + """Get a table. Raises an error if the table doesn't exist and no + default value is provided. + + name (str): Name of the table. + default (Any): Optional default value to return if table doesn't exist. + RETURNS (Table): The table. + + DOCS: https://nightly.spacy.io/api/lookups#get_table + """ + if name not in self._tables: + if default == UNSET: + raise KeyError(Errors.E159.format(name=name, tables=self.tables)) + return default + return self._tables[name] + + def remove_table(self, name: str) -> Table: + """Remove a table. Raises an error if the table doesn't exist. + + name (str): Name of the table to remove. + RETURNS (Table): The removed table. + + DOCS: https://nightly.spacy.io/api/lookups#remove_table + """ + if name not in self._tables: + raise KeyError(Errors.E159.format(name=name, tables=self.tables)) + return self._tables.pop(name) + + def has_table(self, name: str) -> bool: + """Check if the lookups contain a table of a given name. + + name (str): Name of the table. + RETURNS (bool): Whether a table of that name exists. + + DOCS: https://nightly.spacy.io/api/lookups#has_table + """ + return name in self._tables + + def to_bytes(self, **kwargs) -> bytes: + """Serialize the lookups to a bytestring. + + RETURNS (bytes): The serialized Lookups. + + DOCS: https://nightly.spacy.io/api/lookups#to_bytes + """ + return srsly.msgpack_dumps(self._tables) + + def from_bytes(self, bytes_data: bytes, **kwargs) -> "Lookups": + """Load the lookups from a bytestring. + + bytes_data (bytes): The data to load. + RETURNS (Lookups): The loaded Lookups. + + DOCS: https://nightly.spacy.io/api/lookups#from_bytes + """ + self._tables = {} + for key, value in srsly.msgpack_loads(bytes_data).items(): + self._tables[key] = Table(key, value) + return self + + def to_disk( + self, path: Union[str, Path], filename: str = "lookups.bin", **kwargs + ) -> None: + """Save the lookups to a directory as lookups.bin. Expects a path to a + directory, which will be created if it doesn't exist. + + path (str / Path): The file path. + + DOCS: https://nightly.spacy.io/api/lookups#to_disk + """ + path = ensure_path(path) + if not path.exists(): + path.mkdir() + filepath = path / filename + with filepath.open("wb") as file_: + file_.write(self.to_bytes()) + + def from_disk( + self, path: Union[str, Path], filename: str = "lookups.bin", **kwargs + ) -> "Lookups": + """Load lookups from a directory containing a lookups.bin. Will skip + loading if the file doesn't exist. + + path (str / Path): The directory path. + RETURNS (Lookups): The loaded lookups. + + DOCS: https://nightly.spacy.io/api/lookups#from_disk + """ + path = ensure_path(path) + filepath = path / filename + if filepath.exists(): + with filepath.open("rb") as file_: + data = file_.read() + return self.from_bytes(data) + return self diff --git a/spacy/matcher/__init__.py b/spacy/matcher/__init__.py index 91874ed43..286844787 100644 --- a/spacy/matcher/__init__.py +++ b/spacy/matcher/__init__.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from .matcher import Matcher from .phrasematcher import PhraseMatcher from .dependencymatcher import DependencyMatcher diff --git a/spacy/matcher/_schemas.py b/spacy/matcher/_schemas.py deleted file mode 100644 index 4ef7ae49a..000000000 --- a/spacy/matcher/_schemas.py +++ /dev/null @@ -1,204 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - - -TOKEN_PATTERN_SCHEMA = { - "$schema": "http://json-schema.org/draft-06/schema", - "definitions": { - "string_value": { - "anyOf": [ - {"type": "string"}, - { - "type": "object", - "properties": { - "REGEX": {"type": "string"}, - "IN": {"type": "array", "items": {"type": "string"}}, - "NOT_IN": {"type": "array", "items": {"type": "string"}}, - }, - "additionalProperties": False, - }, - ] - }, - "integer_value": { - "anyOf": [ - {"type": "integer"}, - { - "type": "object", - "properties": { - "REGEX": {"type": "string"}, - "IN": {"type": "array", "items": {"type": "integer"}}, - "NOT_IN": {"type": "array", "items": {"type": "integer"}}, - "==": {"type": "integer"}, - ">=": {"type": "integer"}, - "<=": {"type": "integer"}, - ">": {"type": "integer"}, - "<": {"type": "integer"}, - }, - "additionalProperties": False, - }, - ] - }, - "boolean_value": {"type": "boolean"}, - "underscore_value": { - "anyOf": [ - {"type": ["string", "integer", "number", "array", "boolean", "null"]}, - { - "type": "object", - "properties": { - "REGEX": {"type": "string"}, - "IN": { - "type": "array", - "items": {"type": ["string", "integer"]}, - }, - "NOT_IN": { - "type": "array", - "items": {"type": ["string", "integer"]}, - }, - "==": {"type": "integer"}, - ">=": {"type": "integer"}, - "<=": {"type": "integer"}, - ">": {"type": "integer"}, - "<": {"type": "integer"}, - }, - "additionalProperties": False, - }, - ] - }, - }, - "type": "array", - "items": { - "type": "object", - "properties": { - "ORTH": { - "title": "Verbatim token text", - "$ref": "#/definitions/string_value", - }, - "TEXT": { - "title": "Verbatim token text (spaCy v2.1+)", - "$ref": "#/definitions/string_value", - }, - "LOWER": { - "title": "Lowercase form of token text", - "$ref": "#/definitions/string_value", - }, - "POS": { - "title": "Coarse-grained part-of-speech tag", - "$ref": "#/definitions/string_value", - }, - "TAG": { - "title": "Fine-grained part-of-speech tag", - "$ref": "#/definitions/string_value", - }, - "DEP": {"title": "Dependency label", "$ref": "#/definitions/string_value"}, - "LEMMA": { - "title": "Lemma (base form)", - "$ref": "#/definitions/string_value", - }, - "SHAPE": { - "title": "Abstract token shape", - "$ref": "#/definitions/string_value", - }, - "ENT_TYPE": { - "title": "Entity label of single token", - "$ref": "#/definitions/string_value", - }, - "NORM": { - "title": "Normalized form of the token text", - "$ref": "#/definitions/string_value", - }, - "LENGTH": { - "title": "Token character length", - "$ref": "#/definitions/integer_value", - }, - "IS_ALPHA": { - "title": "Token consists of alphabetic characters", - "$ref": "#/definitions/boolean_value", - }, - "IS_ASCII": { - "title": "Token consists of ASCII characters", - "$ref": "#/definitions/boolean_value", - }, - "IS_DIGIT": { - "title": "Token consists of digits", - "$ref": "#/definitions/boolean_value", - }, - "IS_LOWER": { - "title": "Token is lowercase", - "$ref": "#/definitions/boolean_value", - }, - "IS_UPPER": { - "title": "Token is uppercase", - "$ref": "#/definitions/boolean_value", - }, - "IS_TITLE": { - "title": "Token is titlecase", - "$ref": "#/definitions/boolean_value", - }, - "IS_PUNCT": { - "title": "Token is punctuation", - "$ref": "#/definitions/boolean_value", - }, - "IS_SPACE": { - "title": "Token is whitespace", - "$ref": "#/definitions/boolean_value", - }, - "IS_BRACKET": { - "title": "Token is a bracket", - "$ref": "#/definitions/boolean_value", - }, - "IS_QUOTE": { - "title": "Token is a quotation mark", - "$ref": "#/definitions/boolean_value", - }, - "IS_LEFT_PUNCT": { - "title": "Token is a left punctuation mark", - "$ref": "#/definitions/boolean_value", - }, - "IS_RIGHT_PUNCT": { - "title": "Token is a right punctuation mark", - "$ref": "#/definitions/boolean_value", - }, - "IS_CURRENCY": { - "title": "Token is a currency symbol", - "$ref": "#/definitions/boolean_value", - }, - "IS_STOP": { - "title": "Token is stop word", - "$ref": "#/definitions/boolean_value", - }, - "IS_SENT_START": { - "title": "Token is the first in a sentence", - "$ref": "#/definitions/boolean_value", - }, - "SENT_START": { - "title": "Token is the first in a sentence", - "$ref": "#/definitions/boolean_value", - }, - "LIKE_NUM": { - "title": "Token resembles a number", - "$ref": "#/definitions/boolean_value", - }, - "LIKE_URL": { - "title": "Token resembles a URL", - "$ref": "#/definitions/boolean_value", - }, - "LIKE_EMAIL": { - "title": "Token resembles an email address", - "$ref": "#/definitions/boolean_value", - }, - "_": { - "title": "Custom extension token attributes (token._.)", - "type": "object", - "patternProperties": { - "^.*$": {"$ref": "#/definitions/underscore_value"} - }, - }, - "OP": { - "title": "Operators / quantifiers", - "type": "string", - "enum": ["+", "*", "?", "!"], - }, - }, - "additionalProperties": False, - }, -} diff --git a/spacy/matcher/dependencymatcher.pyx b/spacy/matcher/dependencymatcher.pyx index 56d27024d..067b2167c 100644 --- a/spacy/matcher/dependencymatcher.pyx +++ b/spacy/matcher/dependencymatcher.pyx @@ -1,19 +1,17 @@ -# cython: infer_types=True -# cython: profile=True -from __future__ import unicode_literals +# cython: infer_types=True, profile=True +from typing import List + +import numpy from cymem.cymem cimport Pool -from preshed.maps cimport PreshMap from .matcher cimport Matcher from ..vocab cimport Vocab from ..tokens.doc cimport Doc -from .matcher import unpickle_matcher from ..errors import Errors +from ..tokens import Span -from libcpp cimport bool -import numpy DELIMITER = "||" INDEX_HEAD = 1 @@ -24,36 +22,52 @@ cdef class DependencyMatcher: """Match dependency parse tree based on pattern rules.""" cdef Pool mem cdef readonly Vocab vocab - cdef readonly Matcher token_matcher + cdef readonly Matcher matcher cdef public object _patterns + cdef public object _raw_patterns cdef public object _keys_to_token cdef public object _root - cdef public object _entities cdef public object _callbacks cdef public object _nodes cdef public object _tree + cdef public object _ops - def __init__(self, vocab): + def __init__(self, vocab, *, validate=False): """Create the DependencyMatcher. vocab (Vocab): The vocabulary object, which must be shared with the documents the matcher will operate on. - RETURNS (DependencyMatcher): The newly constructed object. + validate (bool): Whether patterns should be validated, passed to + Matcher as `validate` """ size = 20 - self.token_matcher = Matcher(vocab) + self.matcher = Matcher(vocab, validate=validate) self._keys_to_token = {} self._patterns = {} + self._raw_patterns = {} self._root = {} self._nodes = {} self._tree = {} - self._entities = {} self._callbacks = {} self.vocab = vocab self.mem = Pool() + self._ops = { + "<": self.dep, + ">": self.gov, + "<<": self.dep_chain, + ">>": self.gov_chain, + ".": self.imm_precede, + ".*": self.precede, + ";": self.imm_follow, + ";*": self.follow, + "$+": self.imm_right_sib, + "$-": self.imm_left_sib, + "$++": self.right_sib, + "$--": self.left_sib, + } def __reduce__(self): - data = (self.vocab, self._patterns,self._tree, self._callbacks) + data = (self.vocab, self._raw_patterns, self._callbacks) return (unpickle_matcher, data, None, None) def __len__(self): @@ -67,57 +81,70 @@ cdef class DependencyMatcher: def __contains__(self, key): """Check whether the matcher contains rules for a match ID. - key (unicode): The match ID. + key (str): The match ID. RETURNS (bool): Whether the matcher contains rules for this match ID. """ - return self._normalize_key(key) in self._patterns + return self.has_key(key) - def validateInput(self, pattern, key): + def validate_input(self, pattern, key): idx = 0 - visitedNodes = {} + visited_nodes = {} for relation in pattern: - if "PATTERN" not in relation or "SPEC" not in relation: + if not isinstance(relation, dict): + raise ValueError(Errors.E1008) + if "RIGHT_ATTRS" not in relation and "RIGHT_ID" not in relation: raise ValueError(Errors.E098.format(key=key)) if idx == 0: if not( - "NODE_NAME" in relation["SPEC"] - and "NBOR_RELOP" not in relation["SPEC"] - and "NBOR_NAME" not in relation["SPEC"] + "RIGHT_ID" in relation + and "REL_OP" not in relation + and "LEFT_ID" not in relation ): raise ValueError(Errors.E099.format(key=key)) - visitedNodes[relation["SPEC"]["NODE_NAME"]] = True + visited_nodes[relation["RIGHT_ID"]] = True else: if not( - "NODE_NAME" in relation["SPEC"] - and "NBOR_RELOP" in relation["SPEC"] - and "NBOR_NAME" in relation["SPEC"] + "RIGHT_ID" in relation + and "RIGHT_ATTRS" in relation + and "REL_OP" in relation + and "LEFT_ID" in relation ): raise ValueError(Errors.E100.format(key=key)) if ( - relation["SPEC"]["NODE_NAME"] in visitedNodes - or relation["SPEC"]["NBOR_NAME"] not in visitedNodes + relation["RIGHT_ID"] in visited_nodes + or relation["LEFT_ID"] not in visited_nodes ): raise ValueError(Errors.E101.format(key=key)) - visitedNodes[relation["SPEC"]["NODE_NAME"]] = True - visitedNodes[relation["SPEC"]["NBOR_NAME"]] = True + if relation["REL_OP"] not in self._ops: + raise ValueError(Errors.E1007.format(op=relation["REL_OP"])) + visited_nodes[relation["RIGHT_ID"]] = True + visited_nodes[relation["LEFT_ID"]] = True idx = idx + 1 - def add(self, key, patterns, *_patterns, on_match=None): - if patterns is None or hasattr(patterns, "__call__"): # old API - on_match = patterns - patterns = _patterns + def add(self, key, patterns, *, on_match=None): + """Add a new matcher rule to the matcher. + + key (str): The match ID. + patterns (list): The patterns to add for the given key. + on_match (callable): Optional callback executed on match. + """ + if on_match is not None and not hasattr(on_match, "__call__"): + raise ValueError(Errors.E171.format(arg_type=type(on_match))) + if patterns is None or not isinstance(patterns, List): # old API + raise ValueError(Errors.E948.format(arg_type=type(patterns))) for pattern in patterns: if len(pattern) == 0: raise ValueError(Errors.E012.format(key=key)) - self.validateInput(pattern,key) + self.validate_input(pattern, key) key = self._normalize_key(key) + self._raw_patterns.setdefault(key, []) + self._raw_patterns[key].extend(patterns) _patterns = [] for pattern in patterns: token_patterns = [] for i in range(len(pattern)): - token_pattern = [pattern[i]["PATTERN"]] + token_pattern = [pattern[i]["RIGHT_ATTRS"]] token_patterns.append(token_pattern) - # self.patterns.append(token_patterns) _patterns.append(token_patterns) self._patterns.setdefault(key, []) self._callbacks[key] = on_match @@ -131,7 +158,7 @@ cdef class DependencyMatcher: # TODO: Better ways to hash edges in pattern? for j in range(len(_patterns[i])): k = self._normalize_key(unicode(key) + DELIMITER + unicode(i) + DELIMITER + unicode(j)) - self.token_matcher.add(k, None, _patterns[i][j]) + self.matcher.add(k, [_patterns[i][j]]) _keys_to_token[k] = j _keys_to_token_list.append(_keys_to_token) self._keys_to_token.setdefault(key, []) @@ -140,14 +167,14 @@ cdef class DependencyMatcher: for pattern in patterns: nodes = {} for i in range(len(pattern)): - nodes[pattern[i]["SPEC"]["NODE_NAME"]] = i + nodes[pattern[i]["RIGHT_ID"]] = i _nodes_list.append(nodes) self._nodes.setdefault(key, []) self._nodes[key].extend(_nodes_list) # Create an object tree to traverse later on. This data structure # enables easy tree pattern match. Doc-Token based tree cannot be # reused since it is memory-heavy and tightly coupled with the Doc. - self.retrieve_tree(patterns, _nodes_list,key) + self.retrieve_tree(patterns, _nodes_list, key) def retrieve_tree(self, patterns, _nodes_list, key): _heads_list = [] @@ -157,13 +184,13 @@ cdef class DependencyMatcher: root = -1 for j in range(len(patterns[i])): token_pattern = patterns[i][j] - if ("NBOR_RELOP" not in token_pattern["SPEC"]): + if ("REL_OP" not in token_pattern): heads[j] = ('root', j) root = j else: heads[j] = ( - token_pattern["SPEC"]["NBOR_RELOP"], - _nodes_list[i][token_pattern["SPEC"]["NBOR_NAME"]] + token_pattern["REL_OP"], + _nodes_list[i][token_pattern["LEFT_ID"]] ) _heads_list.append(heads) _root_list.append(root) @@ -189,23 +216,45 @@ cdef class DependencyMatcher: key (string or int): The key to check. RETURNS (bool): Whether the matcher has the rule. """ - key = self._normalize_key(key) - return key in self._patterns + return self._normalize_key(key) in self._patterns def get(self, key, default=None): """Retrieve the pattern stored for a key. - key (unicode or int): The key to retrieve. + key (str / int): The key to retrieve. RETURNS (tuple): The rule, as an (on_match, patterns) tuple. """ key = self._normalize_key(key) - if key not in self._patterns: + if key not in self._raw_patterns: return default - return (self._callbacks[key], self._patterns[key]) + return (self._callbacks[key], self._raw_patterns[key]) - def __call__(self, Doc doc): + def remove(self, key): + key = self._normalize_key(key) + if not key in self._patterns: + raise ValueError(Errors.E175.format(key=key)) + self._patterns.pop(key) + self._raw_patterns.pop(key) + self._nodes.pop(key) + self._tree.pop(key) + self._root.pop(key) + + def __call__(self, object doclike): + """Find all token sequences matching the supplied pattern. + + doclike (Doc or Span): The document to match over. + RETURNS (list): A list of `(key, start, end)` tuples, + describing the matches. A match tuple describes a span + `doc[start:end]`. The `label_id` and `key` are both integers. + """ + if isinstance(doclike, Doc): + doc = doclike + elif isinstance(doclike, Span): + doc = doclike.as_doc() + else: + raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__)) matched_key_trees = [] - matches = self.token_matcher(doc) + matches = self.matcher(doc) for key in list(self._patterns.keys()): _patterns_list = self._patterns[key] _keys_to_token_list = self._keys_to_token[key] @@ -234,41 +283,41 @@ cdef class DependencyMatcher: length = len(_nodes) matched_trees = [] - self.recurse(_tree,id_to_position,_node_operator_map,0,[],matched_trees) - matched_key_trees.append((key,matched_trees)) - - for i, (ent_id, nodes) in enumerate(matched_key_trees): - on_match = self._callbacks.get(ent_id) + self.recurse(_tree, id_to_position, _node_operator_map, 0, [], matched_trees) + for matched_tree in matched_trees: + matched_key_trees.append((key, matched_tree)) + for i, (match_id, nodes) in enumerate(matched_key_trees): + on_match = self._callbacks.get(match_id) if on_match is not None: on_match(self, doc, i, matched_key_trees) return matched_key_trees - def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees): - cdef bool isValid; - if(patternLength == len(id_to_position.keys())): + def recurse(self, tree, id_to_position, _node_operator_map, int patternLength, visited_nodes, matched_trees): + cdef bint isValid; + if patternLength == len(id_to_position.keys()): isValid = True for node in range(patternLength): - if(node in tree): + if node in tree: for idx, (relop,nbor) in enumerate(tree[node]): - computed_nbors = numpy.asarray(_node_operator_map[visitedNodes[node]][relop]) + computed_nbors = numpy.asarray(_node_operator_map[visited_nodes[node]][relop]) isNbor = False for computed_nbor in computed_nbors: - if(computed_nbor.i == visitedNodes[nbor]): + if computed_nbor.i == visited_nodes[nbor]: isNbor = True isValid = isValid & isNbor if(isValid): - matched_trees.append(visitedNodes) + matched_trees.append(visited_nodes) return allPatternNodes = numpy.asarray(id_to_position[patternLength]) for patternNode in allPatternNodes: - self.recurse(tree,id_to_position,_node_operator_map,patternLength+1,visitedNodes+[patternNode],matched_trees) + self.recurse(tree, id_to_position, _node_operator_map, patternLength+1, visited_nodes+[patternNode], matched_trees) # Given a node and an edge operator, to return the list of nodes # from the doc that belong to node+operator. This is used to store # all the results beforehand to prevent unnecessary computation while # pattern matching # _node_operator_map[node][operator] = [...] - def get_node_operator_map(self,doc,tree,id_to_position,nodes,root): + def get_node_operator_map(self, doc, tree, id_to_position, nodes, root): _node_operator_map = {} all_node_indices = nodes.values() all_operators = [] @@ -285,24 +334,14 @@ cdef class DependencyMatcher: _node_operator_map[node] = {} for operator in all_operators: _node_operator_map[node][operator] = [] - # Used to invoke methods for each operator - switcher = { - "<": self.dep, - ">": self.gov, - "<<": self.dep_chain, - ">>": self.gov_chain, - ".": self.imm_precede, - "$+": self.imm_right_sib, - "$-": self.imm_left_sib, - "$++": self.right_sib, - "$--": self.left_sib - } for operator in all_operators: for node in all_nodes: - _node_operator_map[node][operator] = switcher.get(operator)(doc,node) + _node_operator_map[node][operator] = self._ops.get(operator)(doc, node) return _node_operator_map def dep(self, doc, node): + if doc[node].head == doc[node]: + return [] return [doc[node].head] def gov(self,doc,node): @@ -312,36 +351,51 @@ cdef class DependencyMatcher: return list(doc[node].ancestors) def gov_chain(self, doc, node): - return list(doc[node].subtree) + return [t for t in doc[node].subtree if t != doc[node]] def imm_precede(self, doc, node): - if node > 0: + sent = self._get_sent(doc[node]) + if node < len(doc) - 1 and doc[node + 1] in sent: + return [doc[node + 1]] + return [] + + def precede(self, doc, node): + sent = self._get_sent(doc[node]) + return [doc[i] for i in range(node + 1, sent.end)] + + def imm_follow(self, doc, node): + sent = self._get_sent(doc[node]) + if node > 0 and doc[node - 1] in sent: return [doc[node - 1]] return [] + def follow(self, doc, node): + sent = self._get_sent(doc[node]) + return [doc[i] for i in range(sent.start, node)] + def imm_right_sib(self, doc, node): for child in list(doc[node].head.children): - if child.i == node - 1: + if child.i == node + 1: return [doc[child.i]] return [] def imm_left_sib(self, doc, node): for child in list(doc[node].head.children): - if child.i == node + 1: + if child.i == node - 1: return [doc[child.i]] return [] def right_sib(self, doc, node): candidate_children = [] for child in list(doc[node].head.children): - if child.i < node: + if child.i > node: candidate_children.append(doc[child.i]) return candidate_children def left_sib(self, doc, node): candidate_children = [] for child in list(doc[node].head.children): - if child.i > node: + if child.i < node: candidate_children.append(doc[child.i]) return candidate_children @@ -350,3 +404,15 @@ cdef class DependencyMatcher: return self.vocab.strings.add(key) else: return key + + def _get_sent(self, token): + root = (list(token.ancestors) or [token])[-1] + return token.doc[root.left_edge.i:root.right_edge.i + 1] + + +def unpickle_matcher(vocab, patterns, callbacks): + matcher = DependencyMatcher(vocab) + for key, pattern in patterns.items(): + callback = callbacks.get(key, None) + matcher.add(key, pattern, on_match=callback) + return matcher diff --git a/spacy/matcher/matcher.pxd b/spacy/matcher/matcher.pxd index dd04153bf..e1f6bc773 100644 --- a/spacy/matcher/matcher.pxd +++ b/spacy/matcher/matcher.pxd @@ -63,9 +63,10 @@ cdef class Matcher: cdef Pool mem cdef vector[TokenPatternC*] patterns cdef readonly Vocab vocab - cdef public object validator + cdef public object validate cdef public object _patterns cdef public object _callbacks + cdef public object _filter cdef public object _extensions cdef public object _extra_predicates cdef public object _seen_attrs diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index 8fbfe305a..a4d20ec55 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -1,9 +1,9 @@ -# cython: infer_types=True -# cython: profile=True -from __future__ import unicode_literals +# cython: infer_types=True, cython: profile=True +from typing import List from libcpp.vector cimport vector from libc.stdint cimport int32_t +from libc.string cimport memset, memcmp from cymem.cymem cimport Pool from murmurhash.mrmr cimport hash64 @@ -17,10 +17,10 @@ from ..vocab cimport Vocab from ..tokens.doc cimport Doc, get_token_attr_for_matcher from ..tokens.span cimport Span from ..tokens.token cimport Token -from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH, POS, TAG, DEP, LEMMA +from ..tokens.morphanalysis cimport MorphAnalysis +from ..attrs cimport ID, attr_id_t, NULL_ATTR, ORTH, POS, TAG, DEP, LEMMA, MORPH -from ._schemas import TOKEN_PATTERN_SCHEMA -from ..util import get_json_validator, validate_json +from ..schemas import validate_token_pattern from ..errors import Errors, MatchPatternError, Warnings from ..strings import get_string_id from ..attrs import IDS @@ -32,28 +32,25 @@ DEF PADDING = 5 cdef class Matcher: """Match sequences of tokens, based on pattern rules. - DOCS: https://spacy.io/api/matcher - USAGE: https://spacy.io/usage/rule-based-matching + DOCS: https://nightly.spacy.io/api/matcher + USAGE: https://nightly.spacy.io/usage/rule-based-matching """ - def __init__(self, vocab, validate=False): + def __init__(self, vocab, validate=True): """Create the Matcher. vocab (Vocab): The vocabulary object, which must be shared with the documents the matcher will operate on. - RETURNS (Matcher): The newly constructed object. """ self._extra_predicates = [] self._patterns = {} self._callbacks = {} + self._filter = {} self._extensions = {} self._seen_attrs = set() self.vocab = vocab self.mem = Pool() - if validate: - self.validator = get_json_validator(TOKEN_PATTERN_SCHEMA) - else: - self.validator = None + self.validate = validate def __reduce__(self): data = (self.vocab, self._patterns, self._callbacks) @@ -71,12 +68,12 @@ cdef class Matcher: def __contains__(self, key): """Check whether the matcher contains rules for a match ID. - key (unicode): The match ID. + key (str): The match ID. RETURNS (bool): Whether the matcher contains rules for this match ID. """ - return self._normalize_key(key) in self._patterns + return self.has_key(key) - def add(self, key, patterns, *_patterns, on_match=None): + def add(self, key, patterns, *, on_match=None, greedy: str=None): """Add a match-rule to the matcher. A match-rule consists of: an ID key, an on_match callback, and one or more patterns. @@ -94,59 +91,58 @@ cdef class Matcher: '+': Require the pattern to match 1 or more times. '*': Allow the pattern to zero or more times. - The + and * operators are usually interpretted "greedily", i.e. longer - matches are returned where possible. However, if you specify two '+' - and '*' patterns in a row and their matches overlap, the first - operator will behave non-greedily. This quirk in the semantics makes - the matcher more efficient, by avoiding the need for back-tracking. + The + and * operators return all possible matches (not just the greedy + ones). However, the "greedy" argument can filter the final matches + by returning a non-overlapping set per key, either taking preference to + the first greedy match ("FIRST"), or the longest ("LONGEST"). As of spaCy v2.2.2, Matcher.add supports the future API, which makes the patterns the second argument and a list (instead of a variable number of arguments). The on_match callback becomes an optional keyword argument. - key (unicode): The match ID. + key (str): The match ID. patterns (list): The patterns to add for the given key. on_match (callable): Optional callback executed on match. - *_patterns (list): For backwards compatibility: list of patterns to add - as variable arguments. Will be ignored if a list of patterns is - provided as the second argument. + greedy (str): Optional filter: "FIRST" or "LONGEST". """ errors = {} if on_match is not None and not hasattr(on_match, "__call__"): raise ValueError(Errors.E171.format(arg_type=type(on_match))) - if patterns is None or hasattr(patterns, "__call__"): # old API - on_match = patterns - patterns = _patterns + if patterns is None or not isinstance(patterns, List): # old API + raise ValueError(Errors.E948.format(arg_type=type(patterns))) + if greedy is not None and greedy not in ["FIRST", "LONGEST"]: + raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=greedy)) for i, pattern in enumerate(patterns): if len(pattern) == 0: raise ValueError(Errors.E012.format(key=key)) if not isinstance(pattern, list): raise ValueError(Errors.E178.format(pat=pattern, key=key)) - if self.validator: - errors[i] = validate_json(pattern, self.validator) + if self.validate: + errors[i] = validate_token_pattern(pattern) if any(err for err in errors.values()): raise MatchPatternError(key, errors) key = self._normalize_key(key) for pattern in patterns: try: - specs = _preprocess_pattern(pattern, self.vocab.strings, + specs = _preprocess_pattern(pattern, self.vocab, self._extensions, self._extra_predicates) self.patterns.push_back(init_pattern(self.mem, key, specs)) for spec in specs: for attr, _ in spec[1]: self._seen_attrs.add(attr) except OverflowError, AttributeError: - raise ValueError(Errors.E154.format()) + raise ValueError(Errors.E154.format()) from None self._patterns.setdefault(key, []) self._callbacks[key] = on_match + self._filter[key] = greedy self._patterns[key].extend(patterns) def remove(self, key): """Remove a rule from the matcher. A KeyError is raised if the key does not exist. - key (unicode): The ID of the match rule. + key (str): The ID of the match rule. """ norm_key = self._normalize_key(key) if not norm_key in self._patterns: @@ -167,13 +163,12 @@ cdef class Matcher: key (string or int): The key to check. RETURNS (bool): Whether the matcher has the rule. """ - key = self._normalize_key(key) - return key in self._patterns + return self._normalize_key(key) in self._patterns def get(self, key, default=None): """Retrieve the pattern stored for a key. - key (unicode or int): The key to retrieve. + key (str / int): The key to retrieve. RETURNS (tuple): The rule, as an (on_match, patterns) tuple. """ key = self._normalize_key(key) @@ -181,23 +176,11 @@ cdef class Matcher: return default return (self._callbacks[key], self._patterns[key]) - def pipe(self, docs, batch_size=1000, n_threads=-1, return_matches=False, - as_tuples=False): - """Match a stream of documents, yielding them in turn. - - docs (iterable): A stream of documents. - batch_size (int): Number of documents to accumulate into a working set. - return_matches (bool): Yield the match lists along with the docs, making - results (doc, matches) tuples. - as_tuples (bool): Interpret the input stream as (doc, context) tuples, - and yield (result, context) tuples out. - If both return_matches and as_tuples are True, the output will - be a sequence of ((doc, matches), context) tuples. - YIELDS (Doc): Documents, in order. + def pipe(self, docs, batch_size=1000, return_matches=False, as_tuples=False): + """Match a stream of documents, yielding them in turn. Deprecated as of + spaCy v3.0. """ - if n_threads != -1: - warnings.warn(Warnings.W016, DeprecationWarning) - + warnings.warn(Warnings.W105.format(matcher="Matcher"), DeprecationWarning) if as_tuples: for doc, context in docs: matches = self(doc) @@ -213,13 +196,16 @@ cdef class Matcher: else: yield doc - def __call__(self, object doclike): + def __call__(self, object doclike, *, as_spans=False, allow_missing=False): """Find all token sequences matching the supplied pattern. doclike (Doc or Span): The document to match over. - RETURNS (list): A list of `(key, start, end)` tuples, + as_spans (bool): Return Span objects with labels instead of (match_id, + start, end) tuples. + RETURNS (list): A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span - `doc[start:end]`. The `label_id` and `key` are both integers. + `doc[start:end]`. The `match_id` is an integer. If as_spans is set + to True, a list of Span objects is returned. """ if isinstance(doclike, Doc): doc = doclike @@ -229,18 +215,61 @@ cdef class Matcher: length = doclike.end - doclike.start else: raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__)) - if len(set([LEMMA, POS, TAG]) & self._seen_attrs) > 0 \ - and not doc.is_tagged: - raise ValueError(Errors.E155.format()) - if DEP in self._seen_attrs and not doc.is_parsed: - raise ValueError(Errors.E156.format()) + cdef Pool tmp_pool = Pool() + if not allow_missing: + for attr in (TAG, POS, MORPH, LEMMA, DEP): + if attr in self._seen_attrs and not doc.has_annotation(attr): + if attr == TAG: + pipe = "tagger" + elif attr in (POS, MORPH): + pipe = "morphologizer" + elif attr == LEMMA: + pipe = "lemmatizer" + elif attr == DEP: + pipe = "parser" + error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr)) + raise ValueError(error_msg) matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length, extensions=self._extensions, predicates=self._extra_predicates) - for i, (key, start, end) in enumerate(matches): + final_matches = [] + pairs_by_id = {} + # For each key, either add all matches, or only the filtered, non-overlapping ones + for (key, start, end) in matches: + span_filter = self._filter.get(key) + if span_filter is not None: + pairs = pairs_by_id.get(key, []) + pairs.append((start,end)) + pairs_by_id[key] = pairs + else: + final_matches.append((key, start, end)) + matched = tmp_pool.alloc(length, sizeof(char)) + empty = tmp_pool.alloc(length, sizeof(char)) + for key, pairs in pairs_by_id.items(): + memset(matched, 0, length * sizeof(matched[0])) + span_filter = self._filter.get(key) + if span_filter == "FIRST": + sorted_pairs = sorted(pairs, key=lambda x: (x[0], -x[1]), reverse=False) # sort by start + elif span_filter == "LONGEST": + sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length + else: + raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=span_filter)) + for (start, end) in sorted_pairs: + assert 0 <= start < end # Defend against segfaults + span_len = end-start + # If no tokens in the span have matched + if memcmp(&matched[start], &empty[start], span_len * sizeof(matched[0])) == 0: + final_matches.append((key, start, end)) + # Mark tokens that have matched + memset(&matched[start], 1, span_len * sizeof(matched[0])) + # perform the callbacks on the filtered set of results + for i, (key, start, end) in enumerate(final_matches): on_match = self._callbacks.get(key, None) if on_match is not None: - on_match(self, doc, i, matches) - return matches + on_match(self, doc, i, final_matches) + if as_spans: + return [Span(doc, start, end, label=key) for key, start, end in final_matches] + else: + return final_matches def _normalize_key(self, key): if isinstance(key, basestring): @@ -251,9 +280,9 @@ cdef class Matcher: def unpickle_matcher(vocab, patterns, callbacks): matcher = Matcher(vocab) - for key, specs in patterns.items(): + for key, pattern in patterns.items(): callback = callbacks.get(key, None) - matcher.add(key, callback, *specs) + matcher.add(key, pattern, on_match=callback) return matcher @@ -635,7 +664,7 @@ cdef attr_t get_ent_id(const TokenPatternC* pattern) nogil: return id_attr.value -def _preprocess_pattern(token_specs, string_store, extensions_table, extra_predicates): +def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates): """This function interprets the pattern, converting the various bits of syntactic sugar before we compile it into a struct with init_pattern. @@ -650,6 +679,7 @@ def _preprocess_pattern(token_specs, string_store, extensions_table, extra_predi extra_predicates. """ tokens = [] + string_store = vocab.strings for spec in token_specs: if not spec: # Signifier for 'any token' @@ -660,7 +690,7 @@ def _preprocess_pattern(token_specs, string_store, extensions_table, extra_predi ops = _get_operators(spec) attr_values = _get_attr_values(spec, string_store) extensions = _get_extensions(spec, string_store, extensions_table) - predicates = _get_extra_predicates(spec, extra_predicates) + predicates = _get_extra_predicates(spec, extra_predicates, vocab) for op in ops: tokens.append((op, list(attr_values), list(extensions), list(predicates))) return tokens @@ -679,8 +709,6 @@ def _get_attr_values(spec, string_store): attr = "ORTH" if attr == "IS_SENT_START": attr = "SENT_START" - if attr not in TOKEN_PATTERN_SCHEMA["items"]["properties"]: - raise ValueError(Errors.E152.format(attr=attr)) attr = IDS.get(attr) if isinstance(value, basestring): value = string_store.add(value) @@ -695,7 +723,7 @@ def _get_attr_values(spec, string_store): if attr is not None: attr_values.append((attr, value)) else: - # should be caught above using TOKEN_PATTERN_SCHEMA + # should be caught in validation raise ValueError(Errors.E152.format(attr=attr)) return attr_values @@ -703,10 +731,10 @@ def _get_attr_values(spec, string_store): # These predicate helper classes are used to match the REGEX, IN, >= etc # extensions to the matcher introduced in #3173. -class _RegexPredicate(object): +class _RegexPredicate: operators = ("REGEX",) - def __init__(self, i, attr, value, predicate, is_extension=False): + def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None): self.i = i self.attr = attr self.value = re.compile(value) @@ -724,13 +752,18 @@ class _RegexPredicate(object): return bool(self.value.search(value)) -class _SetMemberPredicate(object): - operators = ("IN", "NOT_IN") +class _SetPredicate: + operators = ("IN", "NOT_IN", "IS_SUBSET", "IS_SUPERSET") - def __init__(self, i, attr, value, predicate, is_extension=False): + def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None): self.i = i self.attr = attr - self.value = set(get_string_id(v) for v in value) + self.vocab = vocab + if self.attr == MORPH: + # normalize morph strings + self.value = set(self.vocab.morphology.add(v) for v in value) + else: + self.value = set(get_string_id(v) for v in value) self.predicate = predicate self.is_extension = is_extension self.key = (attr, self.predicate, srsly.json_dumps(value, sort_keys=True)) @@ -742,19 +775,32 @@ class _SetMemberPredicate(object): value = get_string_id(token._.get(self.attr)) else: value = get_token_attr_for_matcher(token.c, self.attr) + + if self.predicate in ("IS_SUBSET", "IS_SUPERSET"): + if self.attr == MORPH: + # break up MORPH into individual Feat=Val values + value = set(get_string_id(v) for v in MorphAnalysis.from_id(self.vocab, value)) + else: + # IS_SUBSET for other attrs will be equivalent to "IN" + # IS_SUPERSET will only match for other attrs with 0 or 1 values + value = set([value]) if self.predicate == "IN": return value in self.value - else: + elif self.predicate == "NOT_IN": return value not in self.value + elif self.predicate == "IS_SUBSET": + return value <= self.value + elif self.predicate == "IS_SUPERSET": + return value >= self.value def __repr__(self): - return repr(("SetMemberPredicate", self.i, self.attr, self.value, self.predicate)) + return repr(("SetPredicate", self.i, self.attr, self.value, self.predicate)) -class _ComparisonPredicate(object): +class _ComparisonPredicate: operators = ("==", "!=", ">=", "<=", ">", "<") - def __init__(self, i, attr, value, predicate, is_extension=False): + def __init__(self, i, attr, value, predicate, is_extension=False, vocab=None): self.i = i self.attr = attr self.value = value @@ -783,11 +829,13 @@ class _ComparisonPredicate(object): return value < self.value -def _get_extra_predicates(spec, extra_predicates): +def _get_extra_predicates(spec, extra_predicates, vocab): predicate_types = { "REGEX": _RegexPredicate, - "IN": _SetMemberPredicate, - "NOT_IN": _SetMemberPredicate, + "IN": _SetPredicate, + "NOT_IN": _SetPredicate, + "IS_SUBSET": _SetPredicate, + "IS_SUPERSET": _SetPredicate, "==": _ComparisonPredicate, "!=": _ComparisonPredicate, ">=": _ComparisonPredicate, @@ -815,7 +863,7 @@ def _get_extra_predicates(spec, extra_predicates): value_with_upper_keys = {k.upper(): v for k, v in value.items()} for type_, cls in predicate_types.items(): if type_ in value_with_upper_keys: - predicate = cls(len(extra_predicates), attr, value_with_upper_keys[type_], type_) + predicate = cls(len(extra_predicates), attr, value_with_upper_keys[type_], type_, vocab=vocab) # Don't create a redundant predicates. # This helps with efficiency, as we're caching the results. if predicate.key in seen_predicates: diff --git a/spacy/matcher/phrasematcher.pxd b/spacy/matcher/phrasematcher.pxd index a8e5e5085..3b42f3fab 100644 --- a/spacy/matcher/phrasematcher.pxd +++ b/spacy/matcher/phrasematcher.pxd @@ -1,5 +1,4 @@ from libcpp.vector cimport vector - from cymem.cymem cimport Pool from preshed.maps cimport key_t, MapStruct diff --git a/spacy/matcher/phrasematcher.pyx b/spacy/matcher/phrasematcher.pyx index 00c3357f5..7e99859b5 100644 --- a/spacy/matcher/phrasematcher.pyx +++ b/spacy/matcher/phrasematcher.pyx @@ -1,19 +1,16 @@ -# cython: infer_types=True -# cython: profile=True -from __future__ import unicode_literals - +# cython: infer_types=True, profile=True from libc.stdint cimport uintptr_t - from preshed.maps cimport map_init, map_set, map_get, map_clear, map_iter import warnings -from ..attrs cimport ORTH, POS, TAG, DEP, LEMMA +from ..attrs cimport ORTH, POS, TAG, DEP, LEMMA, MORPH from ..structs cimport TokenC from ..tokens.token cimport Token +from ..tokens.span cimport Span from ..typedefs cimport attr_t -from ._schemas import TOKEN_PATTERN_SCHEMA +from ..schemas import TokenPattern from ..errors import Errors, Warnings @@ -22,26 +19,23 @@ cdef class PhraseMatcher: sequences based on lists of token descriptions, the `PhraseMatcher` accepts match patterns in the form of `Doc` objects. - DOCS: https://spacy.io/api/phrasematcher - USAGE: https://spacy.io/usage/rule-based-matching#phrasematcher + DOCS: https://nightly.spacy.io/api/phrasematcher + USAGE: https://nightly.spacy.io/usage/rule-based-matching#phrasematcher Adapted from FlashText: https://github.com/vi3k6i5/flashtext MIT License (see `LICENSE`) Copyright (c) 2017 Vikash Singh (vikash.duliajan@gmail.com) """ - def __init__(self, Vocab vocab, max_length=0, attr="ORTH", validate=False): + def __init__(self, Vocab vocab, attr="ORTH", validate=False): """Initialize the PhraseMatcher. vocab (Vocab): The shared vocabulary. - attr (int / unicode): Token attribute to match on. + attr (int / str): Token attribute to match on. validate (bool): Perform additional validation when patterns are added. - RETURNS (PhraseMatcher): The newly constructed object. - DOCS: https://spacy.io/api/phrasematcher#init + DOCS: https://nightly.spacy.io/api/phrasematcher#init """ - if max_length != 0: - warnings.warn(Warnings.W010, DeprecationWarning) self.vocab = vocab self._callbacks = {} self._docs = {} @@ -58,7 +52,7 @@ cdef class PhraseMatcher: attr = attr.upper() if attr == "TEXT": attr = "ORTH" - if attr not in TOKEN_PATTERN_SCHEMA["items"]["properties"]: + if attr.lower() not in TokenPattern().dict(): raise ValueError(Errors.E152.format(attr=attr)) self.attr = self.vocab.strings[attr] @@ -67,17 +61,17 @@ cdef class PhraseMatcher: RETURNS (int): The number of rules. - DOCS: https://spacy.io/api/phrasematcher#len + DOCS: https://nightly.spacy.io/api/phrasematcher#len """ return len(self._callbacks) def __contains__(self, key): """Check whether the matcher contains rules for a match ID. - key (unicode): The match ID. + key (str): The match ID. RETURNS (bool): Whether the matcher contains rules for this match ID. - DOCS: https://spacy.io/api/phrasematcher#contains + DOCS: https://nightly.spacy.io/api/phrasematcher#contains """ return key in self._callbacks @@ -89,9 +83,9 @@ cdef class PhraseMatcher: """Remove a rule from the matcher by match ID. A KeyError is raised if the key does not exist. - key (unicode): The match ID. + key (str): The match ID. - DOCS: https://spacy.io/api/phrasematcher#remove + DOCS: https://nightly.spacy.io/api/phrasematcher#remove """ if key not in self._docs: raise KeyError(key) @@ -163,14 +157,14 @@ cdef class PhraseMatcher: number of arguments). The on_match callback becomes an optional keyword argument. - key (unicode): The match ID. + key (str): The match ID. docs (list): List of `Doc` objects representing match patterns. on_match (callable): Callback executed on match. *_docs (Doc): For backwards compatibility: list of patterns to add as variable arguments. Will be ignored if a list of patterns is provided as the second argument. - DOCS: https://spacy.io/api/phrasematcher#add + DOCS: https://nightly.spacy.io/api/phrasematcher#add """ if docs is None or hasattr(docs, "__call__"): # old API on_match = docs @@ -190,12 +184,22 @@ cdef class PhraseMatcher: if len(doc) == 0: continue if isinstance(doc, Doc): - if self.attr in (POS, TAG, LEMMA) and not doc.is_tagged: - raise ValueError(Errors.E155.format()) - if self.attr == DEP and not doc.is_parsed: - raise ValueError(Errors.E156.format()) - if self._validate and (doc.is_tagged or doc.is_parsed) \ - and self.attr not in (DEP, POS, TAG, LEMMA): + attrs = (TAG, POS, MORPH, LEMMA, DEP) + has_annotation = {attr: doc.has_annotation(attr) for attr in attrs} + for attr in attrs: + if self.attr == attr and not has_annotation[attr]: + if attr == TAG: + pipe = "tagger" + elif attr in (POS, MORPH): + pipe = "morphologizer" + elif attr == LEMMA: + pipe = "lemmatizer" + elif attr == DEP: + pipe = "parser" + error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr)) + raise ValueError(error_msg) + if self._validate and any(has_annotation.values()) \ + and self.attr not in attrs: string_attr = self.vocab.strings[self.attr] warnings.warn(Warnings.W012.format(key=key, attr=string_attr)) keyword = self._convert_to_array(doc) @@ -223,15 +227,18 @@ cdef class PhraseMatcher: result = internal_node map_set(self.mem, result, self.vocab.strings[key], NULL) - def __call__(self, doc): + def __call__(self, doc, *, as_spans=False): """Find all sequences matching the supplied patterns on the `Doc`. doc (Doc): The document to match over. - RETURNS (list): A list of `(key, start, end)` tuples, + as_spans (bool): Return Span objects with labels instead of (match_id, + start, end) tuples. + RETURNS (list): A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span - `doc[start:end]`. The `label_id` and `key` are both integers. + `doc[start:end]`. The `match_id` is an integer. If as_spans is set + to True, a list of Span objects is returned. - DOCS: https://spacy.io/api/phrasematcher#call + DOCS: https://nightly.spacy.io/api/phrasematcher#call """ matches = [] if doc is None or len(doc) == 0: @@ -246,7 +253,10 @@ cdef class PhraseMatcher: on_match = self._callbacks.get(self.vocab.strings[ent_id]) if on_match is not None: on_match(self, doc, i, matches) - return matches + if as_spans: + return [Span(doc, start, end, label=key) for key, start, end in matches] + else: + return matches cdef void find_matches(self, Doc doc, vector[SpanC] *matches) nogil: cdef MapStruct* current_node = self.c_map @@ -291,24 +301,11 @@ cdef class PhraseMatcher: current_node = self.c_map idx += 1 - def pipe(self, stream, batch_size=1000, n_threads=-1, return_matches=False, - as_tuples=False): - """Match a stream of documents, yielding them in turn. - - docs (iterable): A stream of documents. - batch_size (int): Number of documents to accumulate into a working set. - return_matches (bool): Yield the match lists along with the docs, making - results (doc, matches) tuples. - as_tuples (bool): Interpret the input stream as (doc, context) tuples, - and yield (result, context) tuples out. - If both return_matches and as_tuples are True, the output will - be a sequence of ((doc, matches), context) tuples. - YIELDS (Doc): Documents, in order. - - DOCS: https://spacy.io/api/phrasematcher#pipe + def pipe(self, stream, batch_size=1000, return_matches=False, as_tuples=False): + """Match a stream of documents, yielding them in turn. Deprecated as of + spaCy v3.0. """ - if n_threads != -1: - warnings.warn(Warnings.W016, DeprecationWarning) + warnings.warn(Warnings.W105.format(matcher="PhraseMatcher"), DeprecationWarning) if as_tuples: for doc, context in stream: matches = self(doc) diff --git a/spacy/ml/__init__.py b/spacy/ml/__init__.py index 57e7ef571..c382d915b 100644 --- a/spacy/ml/__init__.py +++ b/spacy/ml/__init__.py @@ -1,5 +1 @@ -# coding: utf8 -from __future__ import unicode_literals - -from .tok2vec import Tok2Vec # noqa: F401 -from .common import FeedForward, LayerNormalizedMaxout # noqa: F401 +from .models import * # noqa: F401, F403 diff --git a/spacy/ml/_character_embed.py b/spacy/ml/_character_embed.py new file mode 100644 index 000000000..f5c539c42 --- /dev/null +++ b/spacy/ml/_character_embed.py @@ -0,0 +1,57 @@ +from typing import List +from thinc.api import Model +from thinc.types import Floats2d + +from ..tokens import Doc + + +def CharacterEmbed(nM: int, nC: int) -> Model[List[Doc], List[Floats2d]]: + # nM: Number of dimensions per character. nC: Number of characters. + return Model( + "charembed", + forward, + init=init, + dims={"nM": nM, "nC": nC, "nO": nM * nC, "nV": 256}, + params={"E": None}, + ) + + +def init(model: Model, X=None, Y=None): + vectors_table = model.ops.alloc3f( + model.get_dim("nC"), model.get_dim("nV"), model.get_dim("nM") + ) + model.set_param("E", vectors_table) + + +def forward(model: Model, docs: List[Doc], is_train: bool): + if docs is None: + return [] + ids = [] + output = [] + E = model.get_param("E") + nC = model.get_dim("nC") + nM = model.get_dim("nM") + nO = model.get_dim("nO") + # This assists in indexing; it's like looping over this dimension. + # Still consider this weird witch craft...But thanks to Mark Neumann + # for the tip. + nCv = model.ops.xp.arange(nC) + for doc in docs: + doc_ids = model.ops.asarray(doc.to_utf8_array(nr_char=nC)) + doc_vectors = model.ops.alloc3f(len(doc), nC, nM) + # Let's say I have a 2d array of indices, and a 3d table of data. What numpy + # incantation do I chant to get + # output[i, j, k] == data[j, ids[i, j], k]? + doc_vectors[:, nCv] = E[nCv, doc_ids[:, nCv]] + output.append(doc_vectors.reshape((len(doc), nO))) + ids.append(doc_ids) + + def backprop(d_output): + dE = model.ops.alloc(E.shape, dtype=E.dtype) + for doc_ids, d_doc_vectors in zip(ids, d_output): + d_doc_vectors = d_doc_vectors.reshape((len(doc_ids), nC, nM)) + dE[nCv, doc_ids[:, nCv]] += d_doc_vectors[:, nCv] + model.inc_grad("E", dE) + return [] + + return output, backprop diff --git a/spacy/ml/_legacy_tok2vec.py b/spacy/ml/_legacy_tok2vec.py deleted file mode 100644 index c4291b5d6..000000000 --- a/spacy/ml/_legacy_tok2vec.py +++ /dev/null @@ -1,140 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals -from thinc.v2v import Model, Maxout -from thinc.i2v import HashEmbed, StaticVectors -from thinc.t2t import ExtractWindow -from thinc.misc import Residual -from thinc.misc import LayerNorm as LN -from thinc.misc import FeatureExtracter -from thinc.api import layerize, chain, clone, concatenate, with_flatten -from thinc.api import uniqued, wrap, noop - -from ..attrs import ID, ORTH, NORM, PREFIX, SUFFIX, SHAPE - - -def Tok2Vec(width, embed_size, **kwargs): - # Circular imports :( - from .._ml import CharacterEmbed - from .._ml import PyTorchBiLSTM - - pretrained_vectors = kwargs.get("pretrained_vectors", None) - cnn_maxout_pieces = kwargs.get("cnn_maxout_pieces", 3) - subword_features = kwargs.get("subword_features", True) - char_embed = kwargs.get("char_embed", False) - if char_embed: - subword_features = False - conv_depth = kwargs.get("conv_depth", 4) - bilstm_depth = kwargs.get("bilstm_depth", 0) - cols = [ID, NORM, PREFIX, SUFFIX, SHAPE, ORTH] - with Model.define_operators({">>": chain, "|": concatenate, "**": clone}): - norm = HashEmbed(width, embed_size, column=cols.index(NORM), name="embed_norm", seed=6) - if subword_features: - prefix = HashEmbed( - width, embed_size // 2, column=cols.index(PREFIX), name="embed_prefix", seed=7 - ) - suffix = HashEmbed( - width, embed_size // 2, column=cols.index(SUFFIX), name="embed_suffix", seed=8 - ) - shape = HashEmbed( - width, embed_size // 2, column=cols.index(SHAPE), name="embed_shape", seed=9 - ) - else: - prefix, suffix, shape = (None, None, None) - if pretrained_vectors is not None: - glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID)) - - if subword_features: - embed = uniqued( - (glove | norm | prefix | suffix | shape) - >> LN(Maxout(width, width * 5, pieces=3)), - column=cols.index(ORTH), - ) - elif char_embed: - embed = concatenate_lists( - CharacterEmbed(nM=64, nC=8), - FeatureExtracter(cols) >> with_flatten(glove), - ) - reduce_dimensions = LN( - Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces) - ) - else: - embed = uniqued( - (glove | norm) >> LN(Maxout(width, width * 2, pieces=3)), - column=cols.index(ORTH), - ) - elif subword_features: - embed = uniqued( - (norm | prefix | suffix | shape) - >> LN(Maxout(width, width * 4, pieces=3)), - column=cols.index(ORTH), - ) - elif char_embed: - embed = concatenate_lists( - CharacterEmbed(nM=64, nC=8), - FeatureExtracter(cols) >> with_flatten(norm), - ) - reduce_dimensions = LN( - Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces) - ) - else: - embed = norm - - convolution = Residual( - ExtractWindow(nW=1) - >> LN(Maxout(width, width * 3, pieces=cnn_maxout_pieces)) - ) - if char_embed: - tok2vec = embed >> with_flatten( - reduce_dimensions >> convolution ** conv_depth, pad=conv_depth - ) - else: - tok2vec = FeatureExtracter(cols) >> with_flatten( - embed - >> convolution ** conv_depth, pad=conv_depth - ) - - if bilstm_depth >= 1: - tok2vec = tok2vec >> PyTorchBiLSTM(width, width, bilstm_depth) - # Work around thinc API limitations :(. TODO: Revise in Thinc 7 - tok2vec.nO = width - tok2vec.embed = embed - return tok2vec - - -@layerize -def flatten(seqs, drop=0.0): - ops = Model.ops - lengths = ops.asarray([len(seq) for seq in seqs], dtype="i") - - def finish_update(d_X, sgd=None): - return ops.unflatten(d_X, lengths, pad=0) - - X = ops.flatten(seqs, pad=0) - return X, finish_update - - -def concatenate_lists(*layers, **kwargs): # pragma: no cover - """Compose two or more models `f`, `g`, etc, such that their outputs are - concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))` - """ - if not layers: - return noop() - drop_factor = kwargs.get("drop_factor", 1.0) - ops = layers[0].ops - layers = [chain(layer, flatten) for layer in layers] - concat = concatenate(*layers) - - def concatenate_lists_fwd(Xs, drop=0.0): - if drop is not None: - drop *= drop_factor - lengths = ops.asarray([len(X) for X in Xs], dtype="i") - flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop) - ys = ops.unflatten(flat_y, lengths) - - def concatenate_lists_bwd(d_ys, sgd=None): - return bp_flat_y(ops.flatten(d_ys), sgd=sgd) - - return ys, concatenate_lists_bwd - - model = wrap(concatenate_lists_fwd, concat) - return model diff --git a/spacy/ml/_precomputable_affine.py b/spacy/ml/_precomputable_affine.py new file mode 100644 index 000000000..f5e5cd8ad --- /dev/null +++ b/spacy/ml/_precomputable_affine.py @@ -0,0 +1,155 @@ +from thinc.api import Model, normal_init + + +def PrecomputableAffine(nO, nI, nF, nP, dropout=0.1): + model = Model( + "precomputable_affine", + forward, + init=init, + dims={"nO": nO, "nI": nI, "nF": nF, "nP": nP}, + params={"W": None, "b": None, "pad": None}, + attrs={"dropout_rate": dropout}, + ) + return model + + +def forward(model, X, is_train): + nF = model.get_dim("nF") + nO = model.get_dim("nO") + nP = model.get_dim("nP") + nI = model.get_dim("nI") + W = model.get_param("W") + Yf = model.ops.gemm(X, W.reshape((nF * nO * nP, nI)), trans2=True) + Yf = Yf.reshape((Yf.shape[0], nF, nO, nP)) + Yf = model.ops.xp.vstack((model.get_param("pad"), Yf)) + + def backward(dY_ids): + # This backprop is particularly tricky, because we get back a different + # thing from what we put out. We put out an array of shape: + # (nB, nF, nO, nP), and get back: + # (nB, nO, nP) and ids (nB, nF) + # The ids tell us the values of nF, so we would have: + # + # dYf = zeros((nB, nF, nO, nP)) + # for b in range(nB): + # for f in range(nF): + # dYf[b, ids[b, f]] += dY[b] + # + # However, we avoid building that array for efficiency -- and just pass + # in the indices. + dY, ids = dY_ids + assert dY.ndim == 3 + assert dY.shape[1] == nO, dY.shape + assert dY.shape[2] == nP, dY.shape + # nB = dY.shape[0] + model.inc_grad("pad", _backprop_precomputable_affine_padding(model, dY, ids)) + Xf = X[ids] + Xf = Xf.reshape((Xf.shape[0], nF * nI)) + + model.inc_grad("b", dY.sum(axis=0)) + dY = dY.reshape((dY.shape[0], nO * nP)) + + Wopfi = W.transpose((1, 2, 0, 3)) + Wopfi = Wopfi.reshape((nO * nP, nF * nI)) + dXf = model.ops.gemm(dY.reshape((dY.shape[0], nO * nP)), Wopfi) + + dWopfi = model.ops.gemm(dY, Xf, trans1=True) + dWopfi = dWopfi.reshape((nO, nP, nF, nI)) + # (o, p, f, i) --> (f, o, p, i) + dWopfi = dWopfi.transpose((2, 0, 1, 3)) + model.inc_grad("W", dWopfi) + return dXf.reshape((dXf.shape[0], nF, nI)) + + return Yf, backward + + +def _backprop_precomputable_affine_padding(model, dY, ids): + nB = dY.shape[0] + nF = model.get_dim("nF") + nP = model.get_dim("nP") + nO = model.get_dim("nO") + # Backprop the "padding", used as a filler for missing values. + # Values that are missing are set to -1, and each state vector could + # have multiple missing values. The padding has different values for + # different missing features. The gradient of the padding vector is: + # + # for b in range(nB): + # for f in range(nF): + # if ids[b, f] < 0: + # d_pad[f] += dY[b] + # + # Which can be rewritten as: + # + # (ids < 0).T @ dY + mask = model.ops.asarray(ids < 0, dtype="f") + d_pad = model.ops.gemm(mask, dY.reshape(nB, nO * nP), trans1=True) + return d_pad.reshape((1, nF, nO, nP)) + + +def init(model, X=None, Y=None): + """This is like the 'layer sequential unit variance', but instead + of taking the actual inputs, we randomly generate whitened data. + + Why's this all so complicated? We have a huge number of inputs, + and the maxout unit makes guessing the dynamics tricky. Instead + we set the maxout weights to values that empirically result in + whitened outputs given whitened inputs. + """ + if model.has_param("W") and model.get_param("W").any(): + return + + nF = model.get_dim("nF") + nO = model.get_dim("nO") + nP = model.get_dim("nP") + nI = model.get_dim("nI") + W = model.ops.alloc4f(nF, nO, nP, nI) + b = model.ops.alloc2f(nO, nP) + pad = model.ops.alloc4f(1, nF, nO, nP) + + ops = model.ops + W = normal_init(ops, W.shape, mean=float(ops.xp.sqrt(1.0 / nF * nI))) + pad = normal_init(ops, pad.shape, mean=1.0) + model.set_param("W", W) + model.set_param("b", b) + model.set_param("pad", pad) + + ids = ops.alloc((5000, nF), dtype="f") + ids += ops.xp.random.uniform(0, 1000, ids.shape) + ids = ops.asarray(ids, dtype="i") + tokvecs = ops.alloc((5000, nI), dtype="f") + tokvecs += ops.xp.random.normal(loc=0.0, scale=1.0, size=tokvecs.size).reshape( + tokvecs.shape + ) + + def predict(ids, tokvecs): + # nS ids. nW tokvecs. Exclude the padding array. + hiddens = model.predict(tokvecs[:-1]) # (nW, f, o, p) + vectors = model.ops.alloc((ids.shape[0], nO * nP), dtype="f") + # need nS vectors + hiddens = hiddens.reshape((hiddens.shape[0] * nF, nO * nP)) + model.ops.scatter_add(vectors, ids.flatten(), hiddens) + vectors = vectors.reshape((vectors.shape[0], nO, nP)) + vectors += b + vectors = model.ops.asarray(vectors) + if nP >= 2: + return model.ops.maxout(vectors)[0] + else: + return vectors * (vectors >= 0) + + tol_var = 0.01 + tol_mean = 0.01 + t_max = 10 + W = model.get_param("W").copy() + b = model.get_param("b").copy() + for t_i in range(t_max): + acts1 = predict(ids, tokvecs) + var = model.ops.xp.var(acts1) + mean = model.ops.xp.mean(acts1) + if abs(var - 1.0) >= tol_var: + W /= model.ops.xp.sqrt(var) + model.set_param("W", W) + elif abs(mean) >= tol_mean: + b -= mean + model.set_param("b", b) + else: + break diff --git a/spacy/ml/_wire.py b/spacy/ml/_wire.py deleted file mode 100644 index fa271b37c..000000000 --- a/spacy/ml/_wire.py +++ /dev/null @@ -1,42 +0,0 @@ -from __future__ import unicode_literals -from thinc.api import layerize, wrap, noop, chain, concatenate -from thinc.v2v import Model - - -def concatenate_lists(*layers, **kwargs): # pragma: no cover - """Compose two or more models `f`, `g`, etc, such that their outputs are - concatenated, i.e. `concatenate(f, g)(x)` computes `hstack(f(x), g(x))` - """ - if not layers: - return layerize(noop()) - drop_factor = kwargs.get("drop_factor", 1.0) - ops = layers[0].ops - layers = [chain(layer, flatten) for layer in layers] - concat = concatenate(*layers) - - def concatenate_lists_fwd(Xs, drop=0.0): - if drop is not None: - drop *= drop_factor - lengths = ops.asarray([len(X) for X in Xs], dtype="i") - flat_y, bp_flat_y = concat.begin_update(Xs, drop=drop) - ys = ops.unflatten(flat_y, lengths) - - def concatenate_lists_bwd(d_ys, sgd=None): - return bp_flat_y(ops.flatten(d_ys), sgd=sgd) - - return ys, concatenate_lists_bwd - - model = wrap(concatenate_lists_fwd, concat) - return model - - -@layerize -def flatten(seqs, drop=0.0): - ops = Model.ops - lengths = ops.asarray([len(seq) for seq in seqs], dtype="i") - - def finish_update(d_X, sgd=None): - return ops.unflatten(d_X, lengths, pad=0) - - X = ops.flatten(seqs, pad=0) - return X, finish_update diff --git a/spacy/ml/common.py b/spacy/ml/common.py deleted file mode 100644 index f90b53a15..000000000 --- a/spacy/ml/common.py +++ /dev/null @@ -1,23 +0,0 @@ -from __future__ import unicode_literals - -from thinc.api import chain -from thinc.v2v import Maxout -from thinc.misc import LayerNorm -from ..util import registry, make_layer - - -@registry.architectures.register("thinc.FeedForward.v1") -def FeedForward(config): - layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]] - model = chain(*layers) - model.cfg = config - return model - - -@registry.architectures.register("spacy.LayerNormalizedMaxout.v1") -def LayerNormalizedMaxout(config): - width = config["width"] - pieces = config["pieces"] - layer = LayerNorm(Maxout(width, pieces=pieces)) - layer.nO = width - return layer diff --git a/spacy/ml/extract_ngrams.py b/spacy/ml/extract_ngrams.py new file mode 100644 index 000000000..bdc297232 --- /dev/null +++ b/spacy/ml/extract_ngrams.py @@ -0,0 +1,36 @@ +import numpy +from thinc.api import Model + +from ..attrs import LOWER + + +def extract_ngrams(ngram_size: int, attr: int = LOWER) -> Model: + model = Model("extract_ngrams", forward) + model.attrs["ngram_size"] = ngram_size + model.attrs["attr"] = attr + return model + + +def forward(model: Model, docs, is_train: bool): + batch_keys = [] + batch_vals = [] + for doc in docs: + unigrams = model.ops.asarray(doc.to_array([model.attrs["attr"]])) + ngrams = [unigrams] + for n in range(2, model.attrs["ngram_size"] + 1): + ngrams.append(model.ops.ngrams(n, unigrams)) + keys = model.ops.xp.concatenate(ngrams) + keys, vals = model.ops.xp.unique(keys, return_counts=True) + batch_keys.append(keys) + batch_vals.append(vals) + # The dtype here matches what thinc is expecting -- which differs per + # platform (by int definition). This should be fixed once the problem + # is fixed on Thinc's side. + lengths = model.ops.asarray([arr.shape[0] for arr in batch_keys], dtype=numpy.int_) + batch_keys = model.ops.xp.concatenate(batch_keys) + batch_vals = model.ops.asarray(model.ops.xp.concatenate(batch_vals), dtype="f") + + def backprop(dY): + return [] + + return (batch_keys, batch_vals, lengths), backprop diff --git a/spacy/ml/featureextractor.py b/spacy/ml/featureextractor.py new file mode 100644 index 000000000..ed2918f02 --- /dev/null +++ b/spacy/ml/featureextractor.py @@ -0,0 +1,28 @@ +from typing import List, Union, Callable, Tuple +from thinc.types import Ints2d +from thinc.api import Model, registry + +from ..tokens import Doc + + +@registry.layers("spacy.FeatureExtractor.v1") +def FeatureExtractor(columns: List[Union[int, str]]) -> Model[List[Doc], List[Ints2d]]: + return Model("extract_features", forward, attrs={"columns": columns}) + + +def forward( + model: Model[List[Doc], List[Ints2d]], docs, is_train: bool +) -> Tuple[List[Ints2d], Callable]: + columns = model.attrs["columns"] + features: List[Ints2d] = [] + for doc in docs: + if hasattr(doc, "to_array"): + attrs = doc.to_array(columns) + else: + attrs = doc.doc.to_array(columns)[doc.start : doc.end] + if attrs.ndim == 1: + attrs = attrs.reshape((attrs.shape[0], 1)) + features.append(model.ops.asarray2i(attrs, dtype="uint64")) + + backprop: Callable[[List[Ints2d]], List] = lambda d_features: [] + return features, backprop diff --git a/spacy/ml/models/__init__.py b/spacy/ml/models/__init__.py new file mode 100644 index 000000000..67e70421f --- /dev/null +++ b/spacy/ml/models/__init__.py @@ -0,0 +1,5 @@ +from .entity_linker import * # noqa +from .parser import * # noqa +from .tagger import * # noqa +from .textcat import * # noqa +from .tok2vec import * # noqa diff --git a/spacy/ml/models/entity_linker.py b/spacy/ml/models/entity_linker.py new file mode 100644 index 000000000..f37203b1b --- /dev/null +++ b/spacy/ml/models/entity_linker.py @@ -0,0 +1,48 @@ +from pathlib import Path +from typing import Optional, Callable, Iterable +from thinc.api import chain, clone, list2ragged, reduce_mean, residual +from thinc.api import Model, Maxout, Linear + +from ...util import registry +from ...kb import KnowledgeBase, Candidate, get_candidates +from ...vocab import Vocab + + +@registry.architectures.register("spacy.EntityLinker.v1") +def build_nel_encoder(tok2vec: Model, nO: Optional[int] = None) -> Model: + with Model.define_operators({">>": chain, "**": clone}): + token_width = tok2vec.get_dim("nO") + output_layer = Linear(nO=nO, nI=token_width) + model = ( + tok2vec + >> list2ragged() + >> reduce_mean() + >> residual(Maxout(nO=token_width, nI=token_width, nP=2, dropout=0.0)) + >> output_layer + ) + model.set_ref("output_layer", output_layer) + model.set_ref("tok2vec", tok2vec) + return model + + +@registry.misc.register("spacy.KBFromFile.v1") +def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]: + def kb_from_file(vocab): + kb = KnowledgeBase(vocab, entity_vector_length=1) + kb.from_disk(kb_path) + return kb + + return kb_from_file + + +@registry.misc.register("spacy.EmptyKB.v1") +def empty_kb(entity_vector_length: int) -> Callable[[Vocab], KnowledgeBase]: + def empty_kb_factory(vocab): + return KnowledgeBase(vocab=vocab, entity_vector_length=entity_vector_length) + + return empty_kb_factory + + +@registry.misc.register("spacy.CandidateGenerator.v1") +def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]: + return get_candidates diff --git a/spacy/ml/models/multi_task.py b/spacy/ml/models/multi_task.py new file mode 100644 index 000000000..ac990c015 --- /dev/null +++ b/spacy/ml/models/multi_task.py @@ -0,0 +1,168 @@ +from typing import Optional, Iterable, Tuple, List, TYPE_CHECKING +import numpy +from thinc.api import chain, Maxout, LayerNorm, Softmax, Linear, zero_init, Model +from thinc.api import MultiSoftmax, list2array + +if TYPE_CHECKING: + # This lets us add type hints for mypy etc. without causing circular imports + from ...vocab import Vocab # noqa: F401 + from ...tokens import Doc # noqa: F401 + + +def build_multi_task_model( + tok2vec: Model, + maxout_pieces: int, + token_vector_width: int, + nO: Optional[int] = None, +) -> Model: + softmax = Softmax(nO=nO, nI=token_vector_width * 2) + model = chain( + tok2vec, + Maxout( + nO=token_vector_width * 2, + nI=token_vector_width, + nP=maxout_pieces, + dropout=0.0, + ), + LayerNorm(token_vector_width * 2), + softmax, + ) + model.set_ref("tok2vec", tok2vec) + model.set_ref("output_layer", softmax) + return model + + +def build_cloze_multi_task_model( + vocab: "Vocab", + tok2vec: Model, + maxout_pieces: int, + hidden_size: int, + nO: Optional[int] = None, +) -> Model: + # nO = vocab.vectors.data.shape[1] + output_layer = chain( + list2array(), + Maxout( + nO=nO, + nI=tok2vec.get_dim("nO"), + nP=maxout_pieces, + normalize=True, + dropout=0.0, + ), + Linear(nO=nO, nI=nO, init_W=zero_init), + ) + model = chain(tok2vec, output_layer) + model = build_masked_language_model(vocab, model) + model.set_ref("tok2vec", tok2vec) + model.set_ref("output_layer", output_layer) + return model + + +def build_cloze_characters_multi_task_model( + vocab: "Vocab", tok2vec: Model, maxout_pieces: int, hidden_size: int, nr_char: int +) -> Model: + output_layer = chain( + list2array(), + Maxout(hidden_size, nP=maxout_pieces), + LayerNorm(nI=hidden_size), + MultiSoftmax([256] * nr_char, nI=hidden_size), + ) + model = build_masked_language_model(vocab, chain(tok2vec, output_layer)) + model.set_ref("tok2vec", tok2vec) + model.set_ref("output_layer", output_layer) + return model + + +def build_masked_language_model( + vocab: "Vocab", wrapped_model: Model, mask_prob: float = 0.15 +) -> Model: + """Convert a model into a BERT-style masked language model""" + random_words = _RandomWords(vocab) + + def mlm_forward(model, docs, is_train): + mask, docs = _apply_mask(docs, random_words, mask_prob=mask_prob) + mask = model.ops.asarray(mask).reshape((mask.shape[0], 1)) + output, backprop = model.layers[0](docs, is_train) + + def mlm_backward(d_output): + d_output *= 1 - mask + return backprop(d_output) + + return output, mlm_backward + + def mlm_initialize(model: Model, X=None, Y=None): + wrapped = model.layers[0] + wrapped.initialize(X=X, Y=Y) + for dim in wrapped.dim_names: + if wrapped.has_dim(dim): + model.set_dim(dim, wrapped.get_dim(dim)) + + mlm_model = Model( + "masked-language-model", + mlm_forward, + layers=[wrapped_model], + init=mlm_initialize, + refs={"wrapped": wrapped_model}, + dims={dim: None for dim in wrapped_model.dim_names}, + ) + mlm_model.set_ref("wrapped", wrapped_model) + return mlm_model + + +class _RandomWords: + def __init__(self, vocab: "Vocab") -> None: + self.words = [lex.text for lex in vocab if lex.prob != 0.0] + self.probs = [lex.prob for lex in vocab if lex.prob != 0.0] + self.words = self.words[:10000] + self.probs = self.probs[:10000] + self.probs = numpy.exp(numpy.array(self.probs, dtype="f")) + self.probs /= self.probs.sum() + self._cache = [] + + def next(self) -> str: + if not self._cache: + self._cache.extend( + numpy.random.choice(len(self.words), 10000, p=self.probs) + ) + index = self._cache.pop() + return self.words[index] + + +def _apply_mask( + docs: Iterable["Doc"], random_words: _RandomWords, mask_prob: float = 0.15 +) -> Tuple[numpy.ndarray, List["Doc"]]: + # This needs to be here to avoid circular imports + from ...tokens import Doc # noqa: F811 + + N = sum(len(doc) for doc in docs) + mask = numpy.random.uniform(0.0, 1.0, (N,)) + mask = mask >= mask_prob + i = 0 + masked_docs = [] + for doc in docs: + words = [] + for token in doc: + if not mask[i]: + word = _replace_word(token.text, random_words) + else: + word = token.text + words.append(word) + i += 1 + spaces = [bool(w.whitespace_) for w in doc] + # NB: If you change this implementation to instead modify + # the docs in place, take care that the IDs reflect the original + # words. Currently we use the original docs to make the vectors + # for the target, so we don't lose the original tokens. But if + # you modified the docs in place here, you would. + masked_docs.append(Doc(doc.vocab, words=words, spaces=spaces)) + return mask, masked_docs + + +def _replace_word(word: str, random_words: _RandomWords, mask: str = "[MASK]") -> str: + roll = numpy.random.random() + if roll < 0.8: + return mask + elif roll < 0.9: + return random_words.next() + else: + return word diff --git a/spacy/ml/models/parser.py b/spacy/ml/models/parser.py new file mode 100644 index 000000000..2c40bb3ab --- /dev/null +++ b/spacy/ml/models/parser.py @@ -0,0 +1,87 @@ +from typing import Optional, List +from thinc.api import Model, chain, list2array, Linear, zero_init, use_ops +from thinc.types import Floats2d + +from ...errors import Errors +from ...compat import Literal +from ...util import registry +from .._precomputable_affine import PrecomputableAffine +from ..tb_framework import TransitionModel +from ...tokens import Doc + + +@registry.architectures.register("spacy.TransitionBasedParser.v1") +def build_tb_parser_model( + tok2vec: Model[List[Doc], List[Floats2d]], + state_type: Literal["parser", "ner"], + extra_state_tokens: bool, + hidden_width: int, + maxout_pieces: int, + use_upper: bool = True, + nO: Optional[int] = None, +) -> Model: + """ + Build a transition-based parser model. Can apply to NER or dependency-parsing. + + Transition-based parsing is an approach to structured prediction where the + task of predicting the structure is mapped to a series of state transitions. + You might find this tutorial helpful as background: + https://explosion.ai/blog/parsing-english-in-python + + The neural network state prediction model consists of either two or three + subnetworks: + + * tok2vec: Map each token into a vector representations. This subnetwork + is run once for each batch. + * lower: Construct a feature-specific vector for each (token, feature) pair. + This is also run once for each batch. Constructing the state + representation is then simply a matter of summing the component features + and applying the non-linearity. + * upper (optional): A feed-forward network that predicts scores from the + state representation. If not present, the output from the lower model is + used as action scores directly. + + tok2vec (Model[List[Doc], List[Floats2d]]): + Subnetwork to map tokens into vector representations. + state_type (str): + String value denoting the type of parser model: "parser" or "ner" + extra_state_tokens (bool): Whether or not to use additional tokens in the context + to construct the state vector. Defaults to `False`, which means 3 and 8 + for the NER and parser respectively. When set to `True`, this would become 6 + feature sets (for the NER) or 13 (for the parser). + hidden_width (int): The width of the hidden layer. + maxout_pieces (int): How many pieces to use in the state prediction layer. + Recommended values are 1, 2 or 3. If 1, the maxout non-linearity + is replaced with a ReLu non-linearity if use_upper=True, and no + non-linearity if use_upper=False. + use_upper (bool): Whether to use an additional hidden layer after the state + vector in order to predict the action scores. It is recommended to set + this to False for large pretrained models such as transformers, and False + for smaller networks. The upper layer is computed on CPU, which becomes + a bottleneck on larger GPU-based models, where it's also less necessary. + nO (int or None): The number of actions the model will predict between. + Usually inferred from data at the beginning of training, or loaded from + disk. + """ + if state_type == "parser": + nr_feature_tokens = 13 if extra_state_tokens else 8 + elif state_type == "ner": + nr_feature_tokens = 6 if extra_state_tokens else 3 + else: + raise ValueError(Errors.E917.format(value=state_type)) + t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None + tok2vec = chain(tok2vec, list2array(), Linear(hidden_width, t2v_width)) + tok2vec.set_dim("nO", hidden_width) + lower = PrecomputableAffine( + nO=hidden_width if use_upper else nO, + nF=nr_feature_tokens, + nI=tok2vec.get_dim("nO"), + nP=maxout_pieces, + ) + if use_upper: + with use_ops("numpy"): + # Initialize weights at zero, as it's a classification layer. + upper = Linear(nO=nO, init_W=zero_init) + else: + upper = None + return TransitionModel(tok2vec, lower, upper) diff --git a/spacy/ml/models/tagger.py b/spacy/ml/models/tagger.py new file mode 100644 index 000000000..09405214c --- /dev/null +++ b/spacy/ml/models/tagger.py @@ -0,0 +1,28 @@ +from typing import Optional, List +from thinc.api import zero_init, with_array, Softmax, chain, Model +from thinc.types import Floats2d + +from ...util import registry +from ...tokens import Doc + + +@registry.architectures.register("spacy.Tagger.v1") +def build_tagger_model( + tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None +) -> Model[List[Doc], List[Floats2d]]: + """Build a tagger model, using a provided token-to-vector component. The tagger + model simply adds a linear layer with softmax activation to predict scores + given the token vectors. + + tok2vec (Model[List[Doc], List[Floats2d]]): The token-to-vector subnetwork. + nO (int or None): The number of tags to output. Inferred from the data if None. + """ + # TODO: glorot_uniform_init seems to work a bit better than zero_init here?! + t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None + output_layer = Softmax(nO, t2v_width, init_W=zero_init) + softmax = with_array(output_layer) + model = chain(tok2vec, softmax) + model.set_ref("tok2vec", tok2vec) + model.set_ref("softmax", output_layer) + model.set_ref("output_layer", output_layer) + return model diff --git a/spacy/ml/models/textcat.py b/spacy/ml/models/textcat.py new file mode 100644 index 000000000..ec8998e2d --- /dev/null +++ b/spacy/ml/models/textcat.py @@ -0,0 +1,180 @@ +from typing import Optional +from thinc.api import Model, reduce_mean, Linear, list2ragged, Logistic +from thinc.api import chain, concatenate, clone, Dropout, ParametricAttention +from thinc.api import SparseLinear, Softmax, softmax_activation, Maxout, reduce_sum +from thinc.api import HashEmbed, with_array, with_cpu, uniqued +from thinc.api import Relu, residual, expand_window + +from ...attrs import ID, ORTH, PREFIX, SUFFIX, SHAPE, LOWER +from ...util import registry +from ..extract_ngrams import extract_ngrams +from ..staticvectors import StaticVectors +from ..featureextractor import FeatureExtractor + + +@registry.architectures.register("spacy.TextCatCNN.v1") +def build_simple_cnn_text_classifier( + tok2vec: Model, exclusive_classes: bool, nO: Optional[int] = None +) -> Model: + """ + Build a simple CNN text classifier, given a token-to-vector model as inputs. + If exclusive_classes=True, a softmax non-linearity is applied, so that the + outputs sum to 1. If exclusive_classes=False, a logistic non-linearity + is applied instead, so that outputs are in the range [0, 1]. + """ + with Model.define_operators({">>": chain}): + if exclusive_classes: + output_layer = Softmax(nO=nO, nI=tok2vec.maybe_get_dim("nO")) + model = tok2vec >> list2ragged() >> reduce_mean() >> output_layer + model.set_ref("output_layer", output_layer) + else: + linear_layer = Linear(nO=nO, nI=tok2vec.maybe_get_dim("nO")) + model = ( + tok2vec >> list2ragged() >> reduce_mean() >> linear_layer >> Logistic() + ) + model.set_ref("output_layer", linear_layer) + model.set_ref("tok2vec", tok2vec) + model.set_dim("nO", nO) + model.attrs["multi_label"] = not exclusive_classes + return model + + +@registry.architectures.register("spacy.TextCatBOW.v1") +def build_bow_text_classifier( + exclusive_classes: bool, + ngram_size: int, + no_output_layer: bool, + nO: Optional[int] = None, +) -> Model: + # Don't document this yet, I'm not sure it's right. + with Model.define_operators({">>": chain}): + sparse_linear = SparseLinear(nO) + model = extract_ngrams(ngram_size, attr=ORTH) >> sparse_linear + model = with_cpu(model, model.ops) + if not no_output_layer: + output_layer = softmax_activation() if exclusive_classes else Logistic() + model = model >> with_cpu(output_layer, output_layer.ops) + model.set_ref("output_layer", sparse_linear) + model.attrs["multi_label"] = not exclusive_classes + return model + + +@registry.architectures.register("spacy.TextCatEnsemble.v1") +def build_text_classifier( + width: int, + embed_size: int, + pretrained_vectors: Optional[bool], + exclusive_classes: bool, + ngram_size: int, + window_size: int, + conv_depth: int, + dropout: Optional[float], + nO: Optional[int] = None, +) -> Model: + # Don't document this yet, I'm not sure it's right. + cols = [ORTH, LOWER, PREFIX, SUFFIX, SHAPE, ID] + with Model.define_operators({">>": chain, "|": concatenate, "**": clone}): + lower = HashEmbed( + nO=width, nV=embed_size, column=cols.index(LOWER), dropout=dropout, seed=10 + ) + prefix = HashEmbed( + nO=width // 2, + nV=embed_size, + column=cols.index(PREFIX), + dropout=dropout, + seed=11, + ) + suffix = HashEmbed( + nO=width // 2, + nV=embed_size, + column=cols.index(SUFFIX), + dropout=dropout, + seed=12, + ) + shape = HashEmbed( + nO=width // 2, + nV=embed_size, + column=cols.index(SHAPE), + dropout=dropout, + seed=13, + ) + width_nI = sum(layer.get_dim("nO") for layer in [lower, prefix, suffix, shape]) + trained_vectors = FeatureExtractor(cols) >> with_array( + uniqued( + (lower | prefix | suffix | shape) + >> Maxout(nO=width, nI=width_nI, normalize=True), + column=cols.index(ORTH), + ) + ) + if pretrained_vectors: + static_vectors = StaticVectors(width) + vector_layer = trained_vectors | static_vectors + vectors_width = width * 2 + else: + vector_layer = trained_vectors + vectors_width = width + tok2vec = vector_layer >> with_array( + Maxout(width, vectors_width, normalize=True) + >> residual( + ( + expand_window(window_size=window_size) + >> Maxout( + nO=width, nI=width * ((window_size * 2) + 1), normalize=True + ) + ) + ) + ** conv_depth, + pad=conv_depth, + ) + cnn_model = ( + tok2vec + >> list2ragged() + >> ParametricAttention(width) + >> reduce_sum() + >> residual(Maxout(nO=width, nI=width)) + >> Linear(nO=nO, nI=width) + >> Dropout(0.0) + ) + + linear_model = build_bow_text_classifier( + nO=nO, + ngram_size=ngram_size, + exclusive_classes=exclusive_classes, + no_output_layer=False, + ) + nO_double = nO * 2 if nO else None + if exclusive_classes: + output_layer = Softmax(nO=nO, nI=nO_double) + else: + output_layer = Linear(nO=nO, nI=nO_double) >> Dropout(0.0) >> Logistic() + model = (linear_model | cnn_model) >> output_layer + model.set_ref("tok2vec", tok2vec) + if model.has_dim("nO") is not False: + model.set_dim("nO", nO) + model.set_ref("output_layer", linear_model.get_ref("output_layer")) + model.attrs["multi_label"] = not exclusive_classes + return model + + +@registry.architectures.register("spacy.TextCatLowData.v1") +def build_text_classifier_lowdata( + width: int, + pretrained_vectors: Optional[bool], + dropout: Optional[float], + nO: Optional[int] = None, +) -> Model: + # Don't document this yet, I'm not sure it's right. + # Note, before v.3, this was the default if setting "low_data" and "pretrained_dims" + with Model.define_operators({">>": chain, "**": clone}): + model = ( + StaticVectors(width) + >> list2ragged() + >> ParametricAttention(width) + >> reduce_sum() + >> residual(Relu(width, width)) ** 2 + >> Linear(nO, width) + ) + if dropout: + model = model >> Dropout(dropout) + model = model >> Logistic() + return model diff --git a/spacy/ml/models/tok2vec.py b/spacy/ml/models/tok2vec.py new file mode 100644 index 000000000..95e200927 --- /dev/null +++ b/spacy/ml/models/tok2vec.py @@ -0,0 +1,317 @@ +from typing import Optional, List, Union +from thinc.types import Floats2d +from thinc.api import chain, clone, concatenate, with_array, with_padded +from thinc.api import Model, noop, list2ragged, ragged2list, HashEmbed +from thinc.api import expand_window, residual, Maxout, Mish, PyTorchLSTM + +from ...tokens import Doc +from ...util import registry +from ...errors import Errors +from ...ml import _character_embed +from ..staticvectors import StaticVectors +from ..featureextractor import FeatureExtractor +from ...pipeline.tok2vec import Tok2VecListener +from ...attrs import intify_attr + + +@registry.architectures.register("spacy.Tok2VecListener.v1") +def tok2vec_listener_v1(width: int, upstream: str = "*"): + tok2vec = Tok2VecListener(upstream_name=upstream, width=width) + return tok2vec + + +@registry.architectures.register("spacy.HashEmbedCNN.v1") +def build_hash_embed_cnn_tok2vec( + *, + width: int, + depth: int, + embed_size: int, + window_size: int, + maxout_pieces: int, + subword_features: bool, + pretrained_vectors: Optional[bool], +) -> Model[List[Doc], List[Floats2d]]: + """Build spaCy's 'standard' tok2vec layer, which uses hash embedding + with subword features and a CNN with layer-normalized maxout. + + width (int): The width of the input and output. These are required to be the + same, so that residual connections can be used. Recommended values are + 96, 128 or 300. + depth (int): The number of convolutional layers to use. Recommended values + are between 2 and 8. + window_size (int): The number of tokens on either side to concatenate during + the convolutions. The receptive field of the CNN will be + depth * (window_size * 2 + 1), so a 4-layer network with window_size of + 2 will be sensitive to 17 words at a time. Recommended value is 1. + embed_size (int): The number of rows in the hash embedding tables. This can + be surprisingly small, due to the use of the hash embeddings. Recommended + values are between 2000 and 10000. + maxout_pieces (int): The number of pieces to use in the maxout non-linearity. + If 1, the Mish non-linearity is used instead. Recommended values are 1-3. + subword_features (bool): Whether to also embed subword features, specifically + the prefix, suffix and word shape. This is recommended for alphabetic + languages like English, but not if single-character tokens are used for + a language such as Chinese. + pretrained_vectors (bool): Whether to also use static vectors. + """ + if subword_features: + attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"] + row_sizes = [embed_size, embed_size // 2, embed_size // 2, embed_size // 2] + else: + attrs = ["NORM"] + row_sizes = [embed_size] + return build_Tok2Vec_model( + embed=MultiHashEmbed( + width=width, + rows=row_sizes, + attrs=attrs, + include_static_vectors=bool(pretrained_vectors), + ), + encode=MaxoutWindowEncoder( + width=width, + depth=depth, + window_size=window_size, + maxout_pieces=maxout_pieces, + ), + ) + + +@registry.architectures.register("spacy.Tok2Vec.v1") +def build_Tok2Vec_model( + embed: Model[List[Doc], List[Floats2d]], + encode: Model[List[Floats2d], List[Floats2d]], +) -> Model[List[Doc], List[Floats2d]]: + """Construct a tok2vec model out of embedding and encoding subnetworks. + See https://explosion.ai/blog/deep-learning-formula-nlp + + embed (Model[List[Doc], List[Floats2d]]): Embed tokens into context-independent + word vector representations. + encode (Model[List[Floats2d], List[Floats2d]]): Encode context into the + embeddings, using an architecture such as a CNN, BiLSTM or transformer. + """ + receptive_field = encode.attrs.get("receptive_field", 0) + tok2vec = chain(embed, with_array(encode, pad=receptive_field)) + tok2vec.set_dim("nO", encode.get_dim("nO")) + tok2vec.set_ref("embed", embed) + tok2vec.set_ref("encode", encode) + return tok2vec + + +@registry.architectures.register("spacy.MultiHashEmbed.v1") +def MultiHashEmbed( + width: int, + attrs: List[Union[str, int]], + rows: List[int], + include_static_vectors: bool, +) -> Model[List[Doc], List[Floats2d]]: + """Construct an embedding layer that separately embeds a number of lexical + attributes using hash embedding, concatenates the results, and passes it + through a feed-forward subnetwork to build a mixed representations. + + The features used can be configured with the 'attrs' argument. The suggested + attributes are NORM, PREFIX, SUFFIX and SHAPE. This lets the model take into + account some subword information, without constructing a fully character-based + representation. If pretrained vectors are available, they can be included in + the representation as well, with the vectors table will be kept static + (i.e. it's not updated). + + The `width` parameter specifies the output width of the layer and the widths + of all embedding tables. If static vectors are included, a learned linear + layer is used to map the vectors to the specified width before concatenating + it with the other embedding outputs. A single Maxout layer is then used to + reduce the concatenated vectors to the final width. + + The `rows` parameter controls the number of rows used by the `HashEmbed` + tables. The HashEmbed layer needs surprisingly few rows, due to its use of + the hashing trick. Generally between 2000 and 10000 rows is sufficient, + even for very large vocabularies. A number of rows must be specified for each + table, so the `rows` list must be of the same length as the `attrs` parameter. + + width (int): The output width. Also used as the width of the embedding tables. + Recommended values are between 64 and 300. + attrs (list of attr IDs): The token attributes to embed. A separate + embedding table will be constructed for each attribute. + rows (List[int]): The number of rows in the embedding tables. Must have the + same length as attrs. + include_static_vectors (bool): Whether to also use static word vectors. + Requires a vectors table to be loaded in the Doc objects' vocab. + """ + if len(rows) != len(attrs): + raise ValueError(f"Mismatched lengths: {len(rows)} vs {len(attrs)}") + seed = 7 + + def make_hash_embed(index): + nonlocal seed + seed += 1 + return HashEmbed(width, rows[index], column=index, seed=seed, dropout=0.0) + + embeddings = [make_hash_embed(i) for i in range(len(attrs))] + concat_size = width * (len(embeddings) + include_static_vectors) + if include_static_vectors: + model = chain( + concatenate( + chain( + FeatureExtractor(attrs), + list2ragged(), + with_array(concatenate(*embeddings)), + ), + StaticVectors(width, dropout=0.0), + ), + with_array(Maxout(width, concat_size, nP=3, dropout=0.0, normalize=True)), + ragged2list(), + ) + else: + model = chain( + FeatureExtractor(list(attrs)), + list2ragged(), + with_array(concatenate(*embeddings)), + with_array(Maxout(width, concat_size, nP=3, dropout=0.0, normalize=True)), + ragged2list(), + ) + return model + + +@registry.architectures.register("spacy.CharacterEmbed.v1") +def CharacterEmbed( + width: int, + rows: int, + nM: int, + nC: int, + include_static_vectors: bool, + feature: Union[int, str] = "LOWER", +) -> Model[List[Doc], List[Floats2d]]: + """Construct an embedded representation based on character embeddings, using + a feed-forward network. A fixed number of UTF-8 byte characters are used for + each word, taken from the beginning and end of the word equally. Padding is + used in the centre for words that are too short. + + For instance, let's say nC=4, and the word is "jumping". The characters + used will be jung (two from the start, two from the end). If we had nC=8, + the characters would be "jumpping": 4 from the start, 4 from the end. This + ensures that the final character is always in the last position, instead + of being in an arbitrary position depending on the word length. + + The characters are embedded in a embedding table with a given number of rows, + and the vectors concatenated. A hash-embedded vector of the LOWER of the word is + also concatenated on, and the result is then passed through a feed-forward + network to construct a single vector to represent the information. + + feature (int or str): An attribute to embed, to concatenate with the characters. + width (int): The width of the output vector and the feature embedding. + rows (int): The number of rows in the LOWER hash embedding table. + nM (int): The dimensionality of the character embeddings. Recommended values + are between 16 and 64. + nC (int): The number of UTF-8 bytes to embed per word. Recommended values + are between 3 and 8, although it may depend on the length of words in the + language. + include_static_vectors (bool): Whether to also use static word vectors. + Requires a vectors table to be loaded in the Doc objects' vocab. + """ + feature = intify_attr(feature) + if feature is None: + raise ValueError(Errors.E911(feat=feature)) + if include_static_vectors: + model = chain( + concatenate( + chain(_character_embed.CharacterEmbed(nM=nM, nC=nC), list2ragged()), + chain( + FeatureExtractor([feature]), + list2ragged(), + with_array(HashEmbed(nO=width, nV=rows, column=0, seed=5)), + ), + StaticVectors(width, dropout=0.0), + ), + with_array( + Maxout(width, nM * nC + (2 * width), nP=3, normalize=True, dropout=0.0) + ), + ragged2list(), + ) + else: + model = chain( + concatenate( + chain(_character_embed.CharacterEmbed(nM=nM, nC=nC), list2ragged()), + chain( + FeatureExtractor([feature]), + list2ragged(), + with_array(HashEmbed(nO=width, nV=rows, column=0, seed=5)), + ), + ), + with_array( + Maxout(width, nM * nC + width, nP=3, normalize=True, dropout=0.0) + ), + ragged2list(), + ) + return model + + +@registry.architectures.register("spacy.MaxoutWindowEncoder.v1") +def MaxoutWindowEncoder( + width: int, window_size: int, maxout_pieces: int, depth: int +) -> Model[List[Floats2d], List[Floats2d]]: + """Encode context using convolutions with maxout activation, layer + normalization and residual connections. + + width (int): The input and output width. These are required to be the same, + to allow residual connections. This value will be determined by the + width of the inputs. Recommended values are between 64 and 300. + window_size (int): The number of words to concatenate around each token + to construct the convolution. Recommended value is 1. + maxout_pieces (int): The number of maxout pieces to use. Recommended + values are 2 or 3. + depth (int): The number of convolutional layers. Recommended value is 4. + """ + cnn = chain( + expand_window(window_size=window_size), + Maxout( + nO=width, + nI=width * ((window_size * 2) + 1), + nP=maxout_pieces, + dropout=0.0, + normalize=True, + ), + ) + model = clone(residual(cnn), depth) + model.set_dim("nO", width) + model.attrs["receptive_field"] = window_size * depth + return model + + +@registry.architectures.register("spacy.MishWindowEncoder.v1") +def MishWindowEncoder( + width: int, window_size: int, depth: int +) -> Model[List[Floats2d], List[Floats2d]]: + """Encode context using convolutions with mish activation, layer + normalization and residual connections. + + width (int): The input and output width. These are required to be the same, + to allow residual connections. This value will be determined by the + width of the inputs. Recommended values are between 64 and 300. + window_size (int): The number of words to concatenate around each token + to construct the convolution. Recommended value is 1. + depth (int): The number of convolutional layers. Recommended value is 4. + """ + cnn = chain( + expand_window(window_size=window_size), + Mish(nO=width, nI=width * ((window_size * 2) + 1), dropout=0.0, normalize=True), + ) + model = clone(residual(cnn), depth) + model.set_dim("nO", width) + return model + + +@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1") +def BiLSTMEncoder( + width: int, depth: int, dropout: float +) -> Model[List[Floats2d], List[Floats2d]]: + """Encode context using bidirectonal LSTM layers. Requires PyTorch. + + width (int): The input and output width. These are required to be the same, + to allow residual connections. This value will be determined by the + width of the inputs. Recommended values are between 64 and 300. + window_size (int): The number of words to concatenate around each token + to construct the convolution. Recommended value is 1. + depth (int): The number of convolutional layers. Recommended value is 4. + """ + if depth == 0: + return noop() + return with_padded(PyTorchLSTM(width, width, bi=True, depth=depth, dropout=dropout)) diff --git a/spacy/syntax/_parser_model.pxd b/spacy/ml/parser_model.pxd similarity index 88% rename from spacy/syntax/_parser_model.pxd rename to spacy/ml/parser_model.pxd index 9c72f3415..6582b3468 100644 --- a/spacy/syntax/_parser_model.pxd +++ b/spacy/ml/parser_model.pxd @@ -1,8 +1,6 @@ from libc.string cimport memset, memcpy -from libc.stdlib cimport calloc, free, realloc -from thinc.typedefs cimport weight_t, class_t, hash_t - -from ._state cimport StateC +from ..typedefs cimport weight_t, hash_t +from ..pipeline._parser_internals._state cimport StateC cdef struct SizesC: diff --git a/spacy/syntax/_parser_model.pyx b/spacy/ml/parser_model.pyx similarity index 61% rename from spacy/syntax/_parser_model.pyx rename to spacy/ml/parser_model.pyx index 8b6448a46..da937ca4f 100644 --- a/spacy/syntax/_parser_model.pyx +++ b/spacy/ml/parser_model.pyx @@ -1,40 +1,18 @@ -# cython: infer_types=True -# cython: cdivision=True -# cython: boundscheck=False -# coding: utf-8 -from __future__ import unicode_literals, print_function - -from collections import OrderedDict -import numpy -cimport cython.parallel -import numpy.random +# cython: infer_types=True, cdivision=True, boundscheck=False cimport numpy as np from libc.math cimport exp -from libcpp.vector cimport vector from libc.string cimport memset, memcpy from libc.stdlib cimport calloc, free, realloc -from cymem.cymem cimport Pool -from thinc.typedefs cimport weight_t, class_t, hash_t -from thinc.extra.search cimport Beam -from thinc.api import chain, clone -from thinc.v2v import Model, Maxout, Affine -from thinc.misc import LayerNorm -from thinc.neural.ops import CupyOps, NumpyOps -from thinc.neural.util import get_array_module -from thinc.linalg cimport Vec, VecVec +from thinc.backends.linalg cimport Vec, VecVec cimport blis.cy -from .._ml import zero_init, PrecomputableAffine, Tok2Vec, flatten -from .._ml import link_vectors_to_models, create_default_optimizer -from ..compat import copy_array -from ..tokens.doc cimport Doc -from ..gold cimport GoldParse -from ..errors import Errors, TempErrors +import numpy +import numpy.random +from thinc.api import Model, CupyOps, NumpyOps + from .. import util -from .stateclass cimport StateClass -from .transition_system cimport Transition -from . import _beam_utils -from . import nonproj +from ..typedefs cimport weight_t, class_t, hash_t +from ..pipeline._parser_internals.stateclass cimport StateClass cdef WeightsC get_c_weights(model) except *: @@ -48,8 +26,8 @@ cdef WeightsC get_c_weights(model) except *: output.hidden_weights = NULL output.hidden_bias = NULL else: - vec2scores_W = model.vec2scores.W - vec2scores_b = model.vec2scores.b + vec2scores_W = model.vec2scores.get_param("W") + vec2scores_b = model.vec2scores.get_param("b") output.hidden_weights = vec2scores_W.data output.hidden_bias = vec2scores_b.data cdef np.ndarray class_mask = model._class_mask @@ -61,12 +39,12 @@ cdef SizesC get_c_sizes(model, int batch_size) except *: cdef SizesC output output.states = batch_size if model.vec2scores is None: - output.classes = model.state2vec.nO + output.classes = model.state2vec.get_dim("nO") else: - output.classes = model.vec2scores.nO - output.hiddens = model.state2vec.nO - output.pieces = model.state2vec.nP - output.feats = model.state2vec.nF + output.classes = model.vec2scores.get_dim("nO") + output.hiddens = model.state2vec.get_dim("nO") + output.pieces = model.state2vec.get_dim("nP") + output.feats = model.state2vec.get_dim("nF") output.embed_width = model.tokvecs.shape[1] return output @@ -228,95 +206,47 @@ cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) no return best -class ParserModel(Model): - def __init__(self, tok2vec, lower_model, upper_model, unseen_classes=None): - Model.__init__(self) - self._layers = [tok2vec, lower_model] - if upper_model is not None: - self._layers.append(upper_model) - self.unseen_classes = set() - if unseen_classes: - for class_ in unseen_classes: - self.unseen_classes.add(class_) - - def begin_update(self, docs, drop=0.): - step_model = ParserStepModel(docs, self._layers, drop=drop, - unseen_classes=self.unseen_classes) - def finish_parser_update(golds, sgd=None): - step_model.make_updates(sgd) - return None - return step_model, finish_parser_update - - def resize_output(self, new_output): - if len(self._layers) == 2: - return - if new_output == self.upper.nO: - return - smaller = self.upper - - with Model.use_device('cpu'): - larger = Affine(new_output, smaller.nI) - larger.W.fill(0.0) - larger.b.fill(0.0) - # It seems very unhappy if I pass these as smaller.W? - # Seems to segfault. Maybe it's a descriptor protocol thing? - smaller_W = smaller.W - larger_W = larger.W - smaller_b = smaller.b - larger_b = larger.b - # Weights are stored in (nr_out, nr_in) format, so we're basically - # just adding rows here. - larger_W[:smaller.nO] = smaller_W - larger_b[:smaller.nO] = smaller_b - self._layers[-1] = larger - for i in range(smaller.nO, new_output): - self.unseen_classes.add(i) - - def begin_training(self, X, y=None): - self.lower.begin_training(X, y=y) - - @property - def tok2vec(self): - return self._layers[0] - - @property - def lower(self): - return self._layers[1] - - @property - def upper(self): - return self._layers[2] - class ParserStepModel(Model): - def __init__(self, docs, layers, unseen_classes=None, drop=0.): - self.tokvecs, self.bp_tokvecs = layers[0].begin_update(docs, drop=drop) - if layers[1].nP >= 2: + def __init__(self, docs, layers, *, has_upper, unseen_classes=None, train=True, + dropout=0.1): + Model.__init__(self, name="parser_step_model", forward=step_forward) + self.attrs["has_upper"] = has_upper + self.attrs["dropout_rate"] = dropout + self.tokvecs, self.bp_tokvecs = layers[0](docs, is_train=train) + if layers[1].get_dim("nP") >= 2: activation = "maxout" - elif len(layers) == 2: + elif has_upper: activation = None else: activation = "relu" self.state2vec = precompute_hiddens(len(docs), self.tokvecs, layers[1], - activation=activation, drop=drop) - if len(layers) == 3: + activation=activation, train=train) + if has_upper: self.vec2scores = layers[-1] else: self.vec2scores = None self.cuda_stream = util.get_cuda_stream(non_blocking=True) self.backprops = [] - if self.vec2scores is None: - self._class_mask = numpy.zeros((self.state2vec.nO,), dtype='f') - else: - self._class_mask = numpy.zeros((self.vec2scores.nO,), dtype='f') + self._class_mask = numpy.zeros((self.nO,), dtype='f') self._class_mask.fill(1) if unseen_classes is not None: for class_ in unseen_classes: self._class_mask[class_] = 0. + def clear_memory(self): + del self.tokvecs + del self.bp_tokvecs + del self.state2vec + del self.backprops + del self._class_mask + @property def nO(self): - return self.state2vec.nO + if self.attrs["has_upper"]: + return self.vec2scores.get_dim("nO") + else: + return self.state2vec.get_dim("nO") def class_is_unseen(self, class_): return self._class_mask[class_] @@ -327,42 +257,7 @@ class ParserStepModel(Model): def mark_class_seen(self, class_): self._class_mask[class_] = 1 - def begin_update(self, states, drop=0.): - token_ids = self.get_token_ids(states) - vector, get_d_tokvecs = self.state2vec.begin_update(token_ids, drop=0.0) - if self.vec2scores is not None: - mask = self.vec2scores.ops.get_dropout_mask(vector.shape, drop) - if mask is not None: - vector *= mask - scores, get_d_vector = self.vec2scores.begin_update(vector, drop=drop) - else: - scores = NumpyOps().asarray(vector) - get_d_vector = lambda d_scores, sgd=None: d_scores - mask = None - # If the class is unseen, make sure its score is minimum - scores[:, self._class_mask == 0] = numpy.nanmin(scores) - - def backprop_parser_step(d_scores, sgd=None): - # Zero vectors for unseen classes - d_scores *= self._class_mask - d_vector = get_d_vector(d_scores, sgd=sgd) - if mask is not None: - d_vector *= mask - if isinstance(self.state2vec.ops, CupyOps) \ - and not isinstance(token_ids, self.state2vec.ops.xp.ndarray): - # Move token_ids and d_vector to GPU, asynchronously - self.backprops.append(( - util.get_async(self.cuda_stream, token_ids), - util.get_async(self.cuda_stream, d_vector), - get_d_tokvecs - )) - else: - self.backprops.append((token_ids, d_vector, get_d_tokvecs)) - return None - return scores, backprop_parser_step - - def get_token_ids(self, batch): - states = _beam_utils.collect_states(batch) + def get_token_ids(self, states): cdef StateClass state states = [state for state in states if not state.is_final()] cdef np.ndarray ids = numpy.zeros((len(states), self.state2vec.nF), @@ -374,24 +269,65 @@ class ParserStepModel(Model): c_ids += ids.shape[1] return ids - def make_updates(self, sgd): + def backprop_step(self, token_ids, d_vector, get_d_tokvecs): + if isinstance(self.state2vec.ops, CupyOps) \ + and not isinstance(token_ids, self.state2vec.ops.xp.ndarray): + # Move token_ids and d_vector to GPU, asynchronously + self.backprops.append(( + util.get_async(self.cuda_stream, token_ids), + util.get_async(self.cuda_stream, d_vector), + get_d_tokvecs + )) + else: + self.backprops.append((token_ids, d_vector, get_d_tokvecs)) + + + def finish_steps(self, golds): # Add a padding vector to the d_tokvecs gradient, so that missing # values don't affect the real gradient. - d_tokvecs = self.ops.allocate((self.tokvecs.shape[0]+1, self.tokvecs.shape[1])) + d_tokvecs = self.ops.alloc((self.tokvecs.shape[0]+1, self.tokvecs.shape[1])) # Tells CUDA to block, so our async copies complete. if self.cuda_stream is not None: self.cuda_stream.synchronize() for ids, d_vector, bp_vector in self.backprops: - d_state_features = bp_vector((d_vector, ids), sgd=sgd) + d_state_features = bp_vector((d_vector, ids)) ids = ids.flatten() d_state_features = d_state_features.reshape( (ids.size, d_state_features.shape[2])) self.ops.scatter_add(d_tokvecs, ids, d_state_features) # Padded -- see update() - self.bp_tokvecs(d_tokvecs[:-1], sgd=sgd) + self.bp_tokvecs(d_tokvecs[:-1]) return d_tokvecs +NUMPY_OPS = NumpyOps() + +def step_forward(model: ParserStepModel, states, is_train): + token_ids = model.get_token_ids(states) + vector, get_d_tokvecs = model.state2vec(token_ids, is_train) + mask = None + if model.attrs["has_upper"]: + dropout_rate = model.attrs["dropout_rate"] + if is_train and dropout_rate > 0: + mask = NUMPY_OPS.get_dropout_mask(vector.shape, 0.1) + vector *= mask + scores, get_d_vector = model.vec2scores(vector, is_train) + else: + scores = NumpyOps().asarray(vector) + get_d_vector = lambda d_scores: d_scores + # If the class is unseen, make sure its score is minimum + scores[:, model._class_mask == 0] = numpy.nanmin(scores) + + def backprop_parser_step(d_scores): + # Zero vectors for unseen classes + d_scores *= model._class_mask + d_vector = get_d_vector(d_scores) + if mask is not None: + d_vector *= mask + model.backprop_step(token_ids, d_vector, get_d_tokvecs) + return None + return scores, backprop_parser_step + cdef class precompute_hiddens: """Allow a model to be "primed" by pre-computing input features in bulk. @@ -413,6 +349,7 @@ cdef class precompute_hiddens: cdef readonly int nF, nO, nP cdef bint _is_synchronized cdef public object ops + cdef public object numpy_ops cdef np.ndarray _features cdef np.ndarray _cached cdef np.ndarray bias @@ -421,8 +358,8 @@ cdef class precompute_hiddens: cdef object activation def __init__(self, batch_size, tokvecs, lower_model, cuda_stream=None, - activation="maxout", drop=0.): - gpu_cached, bp_features = lower_model.begin_update(tokvecs, drop=drop) + activation="maxout", train=False): + gpu_cached, bp_features = lower_model(tokvecs, train) cdef np.ndarray cached if not isinstance(gpu_cached, numpy.ndarray): # Note the passing of cuda_stream here: it lets @@ -431,14 +368,18 @@ cdef class precompute_hiddens: cached = gpu_cached.get(stream=cuda_stream) else: cached = gpu_cached - if not isinstance(lower_model.b, numpy.ndarray): - self.bias = lower_model.b.get() + if not isinstance(lower_model.get_param("b"), numpy.ndarray): + self.bias = lower_model.get_param("b").get(stream=cuda_stream) else: - self.bias = lower_model.b + self.bias = lower_model.get_param("b") self.nF = cached.shape[1] - self.nP = getattr(lower_model, 'nP', 1) + if lower_model.has_dim("nP"): + self.nP = lower_model.get_dim("nP") + else: + self.nP = 1 self.nO = cached.shape[2] self.ops = lower_model.ops + self.numpy_ops = NumpyOps() assert activation in (None, "relu", "maxout") self.activation = activation self._is_synchronized = False @@ -452,10 +393,46 @@ cdef class precompute_hiddens: self._is_synchronized = True return self._cached.data - def __call__(self, X): - return self.begin_update(X, drop=None)[0] + def has_dim(self, name): + if name == "nF": + return self.nF if self.nF is not None else True + elif name == "nP": + return self.nP if self.nP is not None else True + elif name == "nO": + return self.nO if self.nO is not None else True + else: + return False - def begin_update(self, token_ids, drop=0.): + def get_dim(self, name): + if name == "nF": + return self.nF + elif name == "nP": + return self.nP + elif name == "nO": + return self.nO + else: + raise ValueError(f"Dimension {name} invalid -- only nO, nF, nP") + + def set_dim(self, name, value): + if name == "nF": + self.nF = value + elif name == "nP": + self.nP = value + elif name == "nO": + self.nO = value + else: + raise ValueError(f"Dimension {name} invalid -- only nO, nF, nP") + + def __call__(self, X, bint is_train): + if is_train: + return self.begin_update(X) + else: + return self.predict(X), lambda X: X + + def predict(self, X): + return self.begin_update(X)[0] + + def begin_update(self, token_ids): cdef np.ndarray state_vector = numpy.zeros( (token_ids.shape[0], self.nO, self.nP), dtype='f') # This is tricky, but (assuming GPU available); @@ -473,48 +450,40 @@ cdef class precompute_hiddens: state_vector += self.bias state_vector, bp_nonlinearity = self._nonlinearity(state_vector) - def backward(d_state_vector_ids, sgd=None): + def backward(d_state_vector_ids): d_state_vector, token_ids = d_state_vector_ids - d_state_vector = bp_nonlinearity(d_state_vector, sgd) - d_tokens = bp_hiddens((d_state_vector, token_ids), sgd) + d_state_vector = bp_nonlinearity(d_state_vector) + d_tokens = bp_hiddens((d_state_vector, token_ids)) return d_tokens return state_vector, backward def _nonlinearity(self, state_vector): - if isinstance(state_vector, numpy.ndarray): - ops = NumpyOps() - else: - ops = CupyOps() - if self.activation == "maxout": - state_vector, mask = ops.maxout(state_vector) + return self._maxout_nonlinearity(state_vector) else: - state_vector = state_vector.reshape(state_vector.shape[:-1]) - if self.activation == "relu": - mask = state_vector >= 0. - state_vector *= mask - else: - mask = None + return self._relu_nonlinearity(state_vector) - def backprop_nonlinearity(d_best, sgd=None): - if isinstance(d_best, numpy.ndarray): - ops = NumpyOps() - else: - ops = CupyOps() - if mask is not None: - mask_ = ops.asarray(mask) - # This will usually be on GPU - d_best = ops.asarray(d_best) - # Fix nans (which can occur from unseen classes.) - d_best[ops.xp.isnan(d_best)] = 0. - if self.activation == "maxout": - mask_ = ops.asarray(mask) - return ops.backprop_maxout(d_best, mask_, self.nP) - elif self.activation == "relu": - mask_ = ops.asarray(mask) - d_best *= mask_ - d_best = d_best.reshape((d_best.shape + (1,))) - return d_best - else: - return d_best.reshape((d_best.shape + (1,))) - return state_vector, backprop_nonlinearity + def _maxout_nonlinearity(self, state_vector): + state_vector, mask = self.numpy_ops.maxout(state_vector) + # We're outputting to CPU, but we need this variable on GPU for the + # backward pass. + mask = self.ops.asarray(mask) + + def backprop_maxout(d_best): + return self.ops.backprop_maxout(d_best, mask, self.nP) + + return state_vector, backprop_maxout + + def _relu_nonlinearity(self, state_vector): + state_vector = state_vector.reshape((state_vector.shape[0], -1)) + mask = state_vector >= 0. + state_vector *= mask + # We're outputting to CPU, but we need this variable on GPU for the + # backward pass. + mask = self.ops.asarray(mask) + + def backprop_relu(d_best): + d_best *= mask + return d_best.reshape((d_best.shape + (1,))) + + return state_vector, backprop_relu diff --git a/spacy/ml/staticvectors.py b/spacy/ml/staticvectors.py new file mode 100644 index 000000000..f0213a9b8 --- /dev/null +++ b/spacy/ml/staticvectors.py @@ -0,0 +1,95 @@ +from typing import List, Tuple, Callable, Optional, cast +from thinc.initializers import glorot_uniform_init +from thinc.util import partial +from thinc.types import Ragged, Floats2d, Floats1d +from thinc.api import Model, Ops, registry + +from ..tokens import Doc +from ..errors import Errors + + +@registry.layers("spacy.StaticVectors.v1") +def StaticVectors( + nO: Optional[int] = None, + nM: Optional[int] = None, + *, + dropout: Optional[float] = None, + init_W: Callable = glorot_uniform_init, + key_attr: str = "ORTH" +) -> Model[List[Doc], Ragged]: + """Embed Doc objects with their vocab's vectors table, applying a learned + linear projection to control the dimensionality. If a dropout rate is + specified, the dropout is applied per dimension over the whole batch. + """ + return Model( + "static_vectors", + forward, + init=partial(init, init_W), + params={"W": None}, + attrs={"key_attr": key_attr, "dropout_rate": dropout}, + dims={"nO": nO, "nM": nM}, + ) + + +def forward( + model: Model[List[Doc], Ragged], docs: List[Doc], is_train: bool +) -> Tuple[Ragged, Callable]: + if not sum(len(doc) for doc in docs): + return _handle_empty(model.ops, model.get_dim("nO")) + key_attr = model.attrs["key_attr"] + W = cast(Floats2d, model.ops.as_contig(model.get_param("W"))) + V = cast(Floats2d, docs[0].vocab.vectors.data) + rows = model.ops.flatten( + [doc.vocab.vectors.find(keys=doc.to_array(key_attr)) for doc in docs] + ) + output = Ragged( + model.ops.gemm(model.ops.as_contig(V[rows]), W, trans2=True), + model.ops.asarray([len(doc) for doc in docs], dtype="i"), + ) + mask = None + if is_train: + mask = _get_drop_mask(model.ops, W.shape[0], model.attrs.get("dropout_rate")) + if mask is not None: + output.data *= mask + + def backprop(d_output: Ragged) -> List[Doc]: + if mask is not None: + d_output.data *= mask + model.inc_grad( + "W", + model.ops.gemm(d_output.data, model.ops.as_contig(V[rows]), trans1=True), + ) + return [] + + return output, backprop + + +def init( + init_W: Callable, + model: Model[List[Doc], Ragged], + X: Optional[List[Doc]] = None, + Y: Optional[Ragged] = None, +) -> Model[List[Doc], Ragged]: + nM = model.get_dim("nM") if model.has_dim("nM") else None + nO = model.get_dim("nO") if model.has_dim("nO") else None + if X is not None and len(X): + nM = X[0].vocab.vectors.data.shape[1] + if Y is not None: + nO = Y.data.shape[1] + + if nM is None: + raise ValueError(Errors.E905) + if nO is None: + raise ValueError(Errors.E904) + model.set_dim("nM", nM) + model.set_dim("nO", nO) + model.set_param("W", init_W(model.ops, (nO, nM))) + return model + + +def _handle_empty(ops: Ops, nO: int): + return Ragged(ops.alloc2f(0, nO), ops.alloc1i(0)), lambda d_ragged: [] + + +def _get_drop_mask(ops: Ops, nO: int, rate: Optional[float]) -> Optional[Floats1d]: + return ops.get_dropout_mask((nO,), rate) if rate is not None else None diff --git a/spacy/ml/tb_framework.py b/spacy/ml/tb_framework.py new file mode 100644 index 000000000..8b542b7b9 --- /dev/null +++ b/spacy/ml/tb_framework.py @@ -0,0 +1,86 @@ +from thinc.api import Model, noop, use_ops, Linear +from .parser_model import ParserStepModel + + +def TransitionModel(tok2vec, lower, upper, dropout=0.2, unseen_classes=set()): + """Set up a stepwise transition-based model""" + if upper is None: + has_upper = False + upper = noop() + else: + has_upper = True + # don't define nO for this object, because we can't dynamically change it + return Model( + name="parser_model", + forward=forward, + dims={"nI": tok2vec.get_dim("nI") if tok2vec.has_dim("nI") else None}, + layers=[tok2vec, lower, upper], + refs={"tok2vec": tok2vec, "lower": lower, "upper": upper}, + init=init, + attrs={ + "has_upper": has_upper, + "unseen_classes": set(unseen_classes), + "resize_output": resize_output, + }, + ) + + +def forward(model, X, is_train): + step_model = ParserStepModel( + X, + model.layers, + unseen_classes=model.attrs["unseen_classes"], + train=is_train, + has_upper=model.attrs["has_upper"], + ) + + return step_model, step_model.finish_steps + + +def init(model, X=None, Y=None): + model.get_ref("tok2vec").initialize(X=X) + lower = model.get_ref("lower") + lower.initialize() + if model.attrs["has_upper"]: + statevecs = model.ops.alloc2f(2, lower.get_dim("nO")) + model.get_ref("upper").initialize(X=statevecs) + + +def resize_output(model, new_nO): + lower = model.get_ref("lower") + upper = model.get_ref("upper") + if not model.attrs["has_upper"]: + if lower.has_dim("nO") is None: + lower.set_dim("nO", new_nO) + return + elif upper.has_dim("nO") is None: + upper.set_dim("nO", new_nO) + return + elif new_nO == upper.get_dim("nO"): + return + smaller = upper + nI = None + if smaller.has_dim("nI"): + nI = smaller.get_dim("nI") + with use_ops("numpy"): + larger = Linear(nO=new_nO, nI=nI) + larger.init = smaller.init + # it could be that the model is not initialized yet, then skip this bit + if nI: + larger_W = larger.ops.alloc2f(new_nO, nI) + larger_b = larger.ops.alloc1f(new_nO) + smaller_W = smaller.get_param("W") + smaller_b = smaller.get_param("b") + # Weights are stored in (nr_out, nr_in) format, so we're basically + # just adding rows here. + if smaller.has_dim("nO"): + larger_W[: smaller.get_dim("nO")] = smaller_W + larger_b[: smaller.get_dim("nO")] = smaller_b + for i in range(smaller.get_dim("nO"), new_nO): + model.attrs["unseen_classes"].add(i) + + larger.set_param("W", larger_W) + larger.set_param("b", larger_b) + model._layers[-1] = larger + model.set_ref("upper", larger) + return model diff --git a/spacy/ml/tok2vec.py b/spacy/ml/tok2vec.py deleted file mode 100644 index 6949d83e2..000000000 --- a/spacy/ml/tok2vec.py +++ /dev/null @@ -1,176 +0,0 @@ -from __future__ import unicode_literals - -from thinc.api import chain, layerize, clone, concatenate, with_flatten, uniqued -from thinc.api import noop, with_square_sequences -from thinc.v2v import Maxout, Model -from thinc.i2v import HashEmbed, StaticVectors -from thinc.t2t import ExtractWindow -from thinc.misc import Residual, LayerNorm, FeatureExtracter -from ..util import make_layer, registry -from ._wire import concatenate_lists - - -@registry.architectures.register("spacy.Tok2Vec.v1") -def Tok2Vec(config): - doc2feats = make_layer(config["@doc2feats"]) - embed = make_layer(config["@embed"]) - encode = make_layer(config["@encode"]) - field_size = getattr(encode, "receptive_field", 0) - tok2vec = chain(doc2feats, with_flatten(chain(embed, encode), pad=field_size)) - tok2vec.cfg = config - tok2vec.nO = encode.nO - tok2vec.embed = embed - tok2vec.encode = encode - return tok2vec - - -@registry.architectures.register("spacy.Doc2Feats.v1") -def Doc2Feats(config): - columns = config["columns"] - return FeatureExtracter(columns) - - -@registry.architectures.register("spacy.MultiHashEmbed.v1") -def MultiHashEmbed(config): - # For backwards compatibility with models before the architecture registry, - # we have to be careful to get exactly the same model structure. One subtle - # trick is that when we define concatenation with the operator, the operator - # is actually binary associative. So when we write (a | b | c), we're actually - # getting concatenate(concatenate(a, b), c). That's why the implementation - # is a bit ugly here. - cols = config["columns"] - width = config["width"] - rows = config["rows"] - - norm = HashEmbed(width, rows, column=cols.index("NORM"), name="embed_norm", seed=1) - if config["use_subwords"]: - prefix = HashEmbed( - width, rows // 2, column=cols.index("PREFIX"), name="embed_prefix", seed=2 - ) - suffix = HashEmbed( - width, rows // 2, column=cols.index("SUFFIX"), name="embed_suffix", seed=3 - ) - shape = HashEmbed( - width, rows // 2, column=cols.index("SHAPE"), name="embed_shape", seed=4 - ) - if config.get("@pretrained_vectors"): - glove = make_layer(config["@pretrained_vectors"]) - mix = make_layer(config["@mix"]) - - with Model.define_operators({">>": chain, "|": concatenate}): - if config["use_subwords"] and config["@pretrained_vectors"]: - mix._layers[0].nI = width * 5 - layer = uniqued( - (glove | norm | prefix | suffix | shape) >> mix, - column=cols.index("ORTH"), - ) - elif config["use_subwords"]: - mix._layers[0].nI = width * 4 - layer = uniqued( - (norm | prefix | suffix | shape) >> mix, column=cols.index("ORTH") - ) - elif config["@pretrained_vectors"]: - mix._layers[0].nI = width * 2 - layer = uniqued((glove | norm) >> mix, column=cols.index("ORTH"),) - else: - layer = norm - layer.cfg = config - return layer - - -@registry.architectures.register("spacy.CharacterEmbed.v1") -def CharacterEmbed(config): - from .. import _ml - - width = config["width"] - chars = config["chars"] - - chr_embed = _ml.CharacterEmbedModel(nM=width, nC=chars) - other_tables = make_layer(config["@embed_features"]) - mix = make_layer(config["@mix"]) - - model = chain(concatenate_lists(chr_embed, other_tables), mix) - model.cfg = config - return model - - -@registry.architectures.register("spacy.MaxoutWindowEncoder.v1") -def MaxoutWindowEncoder(config): - nO = config["width"] - nW = config["window_size"] - nP = config["pieces"] - depth = config["depth"] - - cnn = chain( - ExtractWindow(nW=nW), LayerNorm(Maxout(nO, nO * ((nW * 2) + 1), pieces=nP)) - ) - model = clone(Residual(cnn), depth) - model.nO = nO - model.receptive_field = nW * depth - return model - - -@registry.architectures.register("spacy.MishWindowEncoder.v1") -def MishWindowEncoder(config): - from thinc.v2v import Mish - - nO = config["width"] - nW = config["window_size"] - depth = config["depth"] - - cnn = chain(ExtractWindow(nW=nW), LayerNorm(Mish(nO, nO * ((nW * 2) + 1)))) - model = clone(Residual(cnn), depth) - model.nO = nO - return model - - -@registry.architectures.register("spacy.PretrainedVectors.v1") -def PretrainedVectors(config): - return StaticVectors(config["vectors_name"], config["width"], config["column"]) - - -@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1") -def TorchBiLSTMEncoder(config): - import torch.nn - from thinc.extra.wrappers import PyTorchWrapperRNN - - width = config["width"] - depth = config["depth"] - if depth == 0: - return layerize(noop()) - return with_square_sequences( - PyTorchWrapperRNN(torch.nn.LSTM(width, width // 2, depth, bidirectional=True)) - ) - - -_EXAMPLE_CONFIG = { - "@doc2feats": { - "arch": "Doc2Feats", - "config": {"columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"]}, - }, - "@embed": { - "arch": "spacy.MultiHashEmbed.v1", - "config": { - "width": 96, - "rows": 2000, - "columns": ["ID", "NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"], - "use_subwords": True, - "@pretrained_vectors": { - "arch": "TransformedStaticVectors", - "config": { - "vectors_name": "en_vectors_web_lg.vectors", - "width": 96, - "column": 0, - }, - }, - "@mix": { - "arch": "LayerNormalizedMaxout", - "config": {"width": 96, "pieces": 3}, - }, - }, - }, - "@encode": { - "arch": "MaxoutWindowEncode", - "config": {"width": 96, "window_size": 1, "depth": 4, "pieces": 3}, - }, -} diff --git a/spacy/morphology.pxd b/spacy/morphology.pxd index 1a3cedf97..4fe8f7428 100644 --- a/spacy/morphology.pxd +++ b/spacy/morphology.pxd @@ -2,40 +2,33 @@ from cymem.cymem cimport Pool from preshed.maps cimport PreshMap, PreshMapArray from libc.stdint cimport uint64_t from murmurhash cimport mrmr +cimport numpy as np from .structs cimport TokenC, MorphAnalysisC from .strings cimport StringStore from .typedefs cimport hash_t, attr_t, flags_t from .parts_of_speech cimport univ_pos_t - from . cimport symbols + cdef class Morphology: cdef readonly Pool mem cdef readonly StringStore strings cdef PreshMap tags # Keyed by hash, value is pointer to tag - + cdef public object lemmatizer cdef readonly object tag_map cdef readonly object tag_names cdef readonly object reverse_index - cdef readonly object exc - cdef readonly object _feat_map + cdef readonly object _exc cdef readonly PreshMapArray _cache cdef readonly int n_tags - cpdef update(self, hash_t morph, features) - cdef hash_t insert(self, MorphAnalysisC tag) except 0 - - cdef int assign_untagged(self, TokenC* token) except -1 - cdef int assign_tag(self, TokenC* token, tag) except -1 - cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1 - - cdef int _assign_tag_from_exceptions(self, TokenC* token, int tag_id) except -1 + cdef MorphAnalysisC create_morph_tag(self, field_feature_pairs) except * + cdef int insert(self, MorphAnalysisC tag) except -1 -cdef int check_feature(const MorphAnalysisC* tag, attr_t feature) nogil -cdef attr_t get_field(const MorphAnalysisC* tag, int field) nogil -cdef list list_features(const MorphAnalysisC* tag) - -cdef tag_to_json(const MorphAnalysisC* tag) +cdef int check_feature(const MorphAnalysisC* morph, attr_t feature) nogil +cdef list list_features(const MorphAnalysisC* morph) +cdef np.ndarray get_by_field(const MorphAnalysisC* morph, attr_t field) +cdef int get_n_by_field(attr_t* results, const MorphAnalysisC* morph, attr_t field) nogil diff --git a/spacy/morphology.pyx b/spacy/morphology.pyx index 18bba0124..cc0f61cea 100644 --- a/spacy/morphology.pyx +++ b/spacy/morphology.pyx @@ -1,1111 +1,210 @@ # cython: infer_types -# coding: utf8 -from __future__ import unicode_literals - from libc.string cimport memset + import srsly from collections import Counter +import numpy +import warnings -from .compat import basestring_ -from .strings import get_string_id -from . import symbols from .attrs cimport POS, IS_SPACE -from .attrs import LEMMA, intify_attrs from .parts_of_speech cimport SPACE -from .parts_of_speech import IDS as POS_IDS from .lexeme cimport Lexeme -from .errors import Errors + +from .strings import get_string_id +from .attrs import LEMMA, intify_attrs +from .parts_of_speech import IDS as POS_IDS +from .errors import Errors, Warnings from .util import ensure_path - - -cdef enum univ_field_t: - Field_POS - Field_Abbr - Field_AdpType - Field_AdvType - Field_Animacy - Field_Aspect - Field_Case - Field_ConjType - Field_Connegative - Field_Definite - Field_Degree - Field_Derivation - Field_Echo - Field_Foreign - Field_Gender - Field_Hyph - Field_InfForm - Field_Mood - Field_NameType - Field_Negative - Field_NounType - Field_Number - Field_NumForm - Field_NumType - Field_NumValue - Field_PartForm - Field_PartType - Field_Person - Field_Polarity - Field_Polite - Field_Poss - Field_Prefix - Field_PrepCase - Field_PronType - Field_PunctSide - Field_PunctType - Field_Reflex - Field_Style - Field_StyleVariant - Field_Tense - Field_Typo - Field_VerbForm - Field_VerbType - Field_Voice - - -def _normalize_props(props): - """Transform deprecated string keys to correct names.""" - out = {} - props = dict(props) - for key in FIELDS: - if key in props: - value = str(props[key]).lower() - # We don't have support for disjunctive int|rel features, so - # just take the first one :( - if "|" in value: - value = value.split("|")[0] - attr = '%s_%s' % (key, value) - if attr in FEATURES: - props.pop(key) - props[attr] = True - for key, value in props.items(): - if key == POS: - if hasattr(value, 'upper'): - value = value.upper() - if value in POS_IDS: - value = POS_IDS[value] - out[key] = value - elif isinstance(key, int): - out[key] = value - elif value is True: - out[key] = value - elif key.lower() == 'pos': - out[POS] = POS_IDS[value.upper()] - elif key.lower() != 'morph': - out[key] = value - return out - - -class MorphologyClassMap(object): - def __init__(self, features): - self.features = tuple(features) - self.fields = [] - self.feat2field = {} - seen_fields = set() - for feature in features: - field = feature.split("_", 1)[0] - if field not in seen_fields: - self.fields.append(field) - seen_fields.add(field) - self.feat2field[feature] = FIELDS[field] - self.id2feat = {get_string_id(name): name for name in features} - self.field2feats = {"POS": []} - self.col2info = [] - self.attr2field = dict(LOWER_FIELDS.items()) - self.feat2offset = {} - self.field2col = {} - self.field2id = dict(FIELDS.items()) - self.fieldid2field = {field_id: field for field, field_id in FIELDS.items()} - for feature in features: - field = self.fields[self.feat2field[feature]] - if field not in self.field2col: - self.field2col[field] = len(self.col2info) - if field != "POS" and field not in self.field2feats: - self.col2info.append((field, 0, "NIL")) - self.field2feats.setdefault(field, ["NIL"]) - offset = len(self.field2feats[field]) - self.field2feats[field].append(feature) - self.col2info.append((field, offset, feature)) - self.feat2offset[feature] = offset - - @property - def field_sizes(self): - return [len(self.field2feats[field]) for field in self.fields] - - def get_field_offset(self, field): - return self.field2col[field] +from . import symbols cdef class Morphology: - '''Store the possible morphological analyses for a language, and index them + """Store the possible morphological analyses for a language, and index them by hash. - To save space on each token, tokens only know the hash of their morphological - analysis, so queries of morphological attributes are delegated + To save space on each token, tokens only know the hash of their + morphological analysis, so queries of morphological attributes are delegated to this class. - ''' - def __init__(self, StringStore string_store, tag_map, lemmatizer, exc=None): + """ + FEATURE_SEP = "|" + FIELD_SEP = "=" + VALUE_SEP = "," + # not an empty string so that the PreshMap key is not 0 + EMPTY_MORPH = symbols.NAMES[symbols._] + + def __init__(self, StringStore strings): self.mem = Pool() - self.strings = string_store + self.strings = strings self.tags = PreshMap() - self._feat_map = MorphologyClassMap(FEATURES) - self.load_tag_map(tag_map) - self.lemmatizer = lemmatizer - - self._cache = PreshMapArray(self.n_tags) - self.exc = {} - if exc is not None: - for (tag, orth), attrs in exc.items(): - attrs = _normalize_props(attrs) - self.add_special_case( - self.strings.as_string(tag), self.strings.as_string(orth), attrs) - - def load_tag_map(self, tag_map): - # Add special space symbol. We prefix with underscore, to make sure it - # always sorts to the end. - if '_SP' in tag_map: - space_attrs = tag_map.get('_SP') - else: - space_attrs = tag_map.get('SP', {POS: SPACE}) - if '_SP' not in tag_map: - self.strings.add('_SP') - tag_map = dict(tag_map) - tag_map['_SP'] = space_attrs - self.tag_map = {} - self.reverse_index = {} - for i, (tag_str, attrs) in enumerate(sorted(tag_map.items())): - attrs = _normalize_props(attrs) - self.add({self._feat_map.id2feat[feat] for feat in attrs - if feat in self._feat_map.id2feat}) - self.tag_map[tag_str] = dict(attrs) - self.reverse_index[self.strings.add(tag_str)] = i - self.tag_names = tuple(sorted(self.tag_map.keys())) - self.n_tags = len(self.tag_map) - self._cache = PreshMapArray(self.n_tags) def __reduce__(self): - return (Morphology, (self.strings, self.tag_map, self.lemmatizer, - self.exc), None, None) + tags = set([self.get(self.strings[s]) for s in self.strings]) + tags -= set([""]) + return (unpickle_morphology, (self.strings, sorted(tags)), None, None) def add(self, features): - """Insert a morphological analysis in the morphology table, if not already - present. Returns the hash of the new analysis. + """Insert a morphological analysis in the morphology table, if not + already present. The morphological analysis may be provided in the UD + FEATS format as a string or in the tag map dict format. + Returns the hash of the new analysis. """ - for f in features: - if isinstance(f, basestring_): - self.strings.add(f) - string_features = features - features = intify_features(features) - cdef attr_t feature - for feature in features: - if feature != 0 and feature not in self._feat_map.id2feat: - raise ValueError(Errors.E167.format(feat=self.strings[feature], feat_id=feature)) - cdef MorphAnalysisC tag - tag = create_rich_tag(features) - cdef hash_t key = self.insert(tag) - return key - - def get(self, hash_t morph): - tag = self.tags.get(morph) - if tag == NULL: - return [] + cdef MorphAnalysisC* tag_ptr + if isinstance(features, str): + if features == self.EMPTY_MORPH: + features = "" + tag_ptr = self.tags.get(self.strings[features]) + if tag_ptr != NULL: + return tag_ptr.key + features = self.feats_to_dict(features) + if not isinstance(features, dict): + warnings.warn(Warnings.W100.format(feature=features)) + features = {} + string_features = {self.strings.as_string(field): self.strings.as_string(values) for field, values in features.items()} + # intified ("Field", "Field=Value") pairs + field_feature_pairs = [] + for field in sorted(string_features): + values = string_features[field] + for value in values.split(self.VALUE_SEP): + field_feature_pairs.append(( + self.strings.add(field), + self.strings.add(field + self.FIELD_SEP + value), + )) + cdef MorphAnalysisC tag = self.create_morph_tag(field_feature_pairs) + # the hash key for the tag is either the hash of the normalized UFEATS + # string or the hash of an empty placeholder (using the empty string + # would give a hash key of 0, which is not good for PreshMap) + norm_feats_string = self.normalize_features(features) + if norm_feats_string: + tag.key = self.strings.add(norm_feats_string) else: - return tag_to_json(tag) + tag.key = self.strings.add(self.EMPTY_MORPH) + self.insert(tag) + return tag.key - cpdef update(self, hash_t morph, features): - """Update a morphological analysis with new feature values.""" - tag = (self.tags.get(morph))[0] - features = intify_features(features) - cdef attr_t feature - for feature in features: - field = FEATURE_FIELDS[FEATURE_NAMES[feature]] - set_feature(&tag, field, feature, 1) - morph = self.insert(tag) - return morph + def normalize_features(self, features): + """Create a normalized FEATS string from a features string or dict. - def lemmatize(self, const univ_pos_t univ_pos, attr_t orth, morphology): - if orth not in self.strings: - return orth - cdef unicode py_string = self.strings[orth] - if self.lemmatizer is None: - return self.strings.add(py_string.lower()) - cdef list lemma_strings - cdef unicode lemma_string - # Normalize features into a dict keyed by the field, to make life easier - # for the lemmatizer. Handles string-to-int conversion too. - string_feats = {} - for key, value in morphology.items(): - if value is True: - name, value = self.strings.as_string(key).split('_', 1) - string_feats[name] = value - else: - string_feats[self.strings.as_string(key)] = self.strings.as_string(value) - lemma_strings = self.lemmatizer(py_string, univ_pos, string_feats) - lemma_string = lemma_strings[0] - lemma = self.strings.add(lemma_string) - return lemma - - def add_special_case(self, unicode tag_str, unicode orth_str, attrs, - force=False): - """Add a special-case rule to the morphological analyser. Tokens whose - tag and orth match the rule will receive the specified properties. - - tag (unicode): The part-of-speech tag to key the exception. - orth (unicode): The word-form to key the exception. + features (Union[dict, str]): Features as dict or UFEATS string. + RETURNS (str): Features as normalized UFEATS string. """ - attrs = dict(attrs) - attrs = _normalize_props(attrs) - self.add({self._feat_map.id2feat[feat] for feat in attrs - if feat in self._feat_map.id2feat}) - attrs = intify_attrs(attrs, self.strings, _do_deprecated=True) - self.exc[(tag_str, self.strings.add(orth_str))] = attrs + if isinstance(features, str): + features = self.feats_to_dict(features) + if not isinstance(features, dict): + warnings.warn(Warnings.W100.format(feature=features)) + features = {} + features = self.normalize_attrs(features) + string_features = {self.strings.as_string(field): self.strings.as_string(values) for field, values in features.items()} + # normalized UFEATS string with sorted fields and values + norm_feats_string = self.FEATURE_SEP.join(sorted([ + self.FIELD_SEP.join([field, values]) + for field, values in string_features.items() + ])) + return norm_feats_string or self.EMPTY_MORPH - cdef hash_t insert(self, MorphAnalysisC tag) except 0: - cdef hash_t key = hash_tag(tag) + def normalize_attrs(self, attrs): + """Convert attrs dict so that POS is always by ID, other features are + by string. Values separated by VALUE_SEP are sorted. + """ + out = {} + attrs = dict(attrs) + for key, value in attrs.items(): + # convert POS value to ID + if key == POS or (isinstance(key, str) and key.upper() == "POS"): + if isinstance(value, str) and value.upper() in POS_IDS: + value = POS_IDS[value.upper()] + elif isinstance(value, int) and value not in POS_IDS.values(): + warnings.warn(Warnings.W100.format(feature={key: value})) + continue + out[POS] = value + # accept any string or ID fields and values and convert to strings + elif isinstance(key, (int, str)) and isinstance(value, (int, str)): + key = self.strings.as_string(key) + value = self.strings.as_string(value) + # sort values + if self.VALUE_SEP in value: + value = self.VALUE_SEP.join(sorted(value.split(self.VALUE_SEP))) + out[key] = value + else: + warnings.warn(Warnings.W100.format(feature={key: value})) + return out + + cdef MorphAnalysisC create_morph_tag(self, field_feature_pairs) except *: + """Creates a MorphAnalysisC from a list of intified + ("Field", "Field=Value") tuples where fields with multiple values have + been split into individual tuples, e.g.: + [("Field1", "Field1=Value1"), ("Field1", "Field1=Value2"), + ("Field2", "Field2=Value3")] + """ + cdef MorphAnalysisC tag + tag.length = len(field_feature_pairs) + tag.fields = self.mem.alloc(tag.length, sizeof(attr_t)) + tag.features = self.mem.alloc(tag.length, sizeof(attr_t)) + for i, (field, feature) in enumerate(field_feature_pairs): + tag.fields[i] = field + tag.features[i] = feature + return tag + + cdef int insert(self, MorphAnalysisC tag) except -1: + cdef hash_t key = tag.key if self.tags.get(key) == NULL: tag_ptr = self.mem.alloc(1, sizeof(MorphAnalysisC)) tag_ptr[0] = tag self.tags.set(key, tag_ptr) - return key - cdef int assign_untagged(self, TokenC* token) except -1: - """Set morphological attributes on a token without a POS tag. Uses - the lemmatizer's lookup() method, which looks up the string in the - table provided by the language data as lemma_lookup (if available). - """ - if token.lemma == 0: - orth_str = self.strings[token.lex.orth] - lemma = self.lemmatizer.lookup(orth_str, orth=token.lex.orth) - token.lemma = self.strings.add(lemma) - - cdef int assign_tag(self, TokenC* token, tag_str) except -1: - cdef attr_t tag = self.strings.as_int(tag_str) - if tag in self.reverse_index: - tag_id = self.reverse_index[tag] - self.assign_tag_id(token, tag_id) + def get(self, hash_t morph): + tag = self.tags.get(morph) + if tag == NULL: + return "" else: - token.tag = tag + return self.strings[tag.key] - cdef int assign_tag_id(self, TokenC* token, int tag_id) except -1: - if tag_id > self.n_tags: - raise ValueError(Errors.E014.format(tag=tag_id)) - # Ensure spaces get tagged as space. - # It seems pretty arbitrary to put this logic here, but there's really - # nowhere better. I guess the justification is that this is where the - # specific word and the tag interact. Still, we should have a better - # way to enforce this rule, or figure out why the statistical model fails. - # Related to Issue #220 - if Lexeme.c_check_flag(token.lex, IS_SPACE): - tag_id = self.reverse_index[self.strings.add('_SP')] - tag_str = self.tag_names[tag_id] - features = dict(self.tag_map.get(tag_str, {})) - if features: - pos = self.strings.as_int(features.pop(POS)) - else: - pos = 0 - cdef attr_t lemma = self._cache.get(tag_id, token.lex.orth) - if lemma == 0: - # Ugh, self.lemmatize has opposite arg order from self.lemmatizer :( - lemma = self.lemmatize(pos, token.lex.orth, features) - self._cache.set(tag_id, token.lex.orth, lemma) - token.lemma = lemma - token.pos = pos - token.tag = self.strings[tag_str] - token.morph = self.add(features) - if (self.tag_names[tag_id], token.lex.orth) in self.exc: - self._assign_tag_from_exceptions(token, tag_id) + @staticmethod + def feats_to_dict(feats): + if not feats or feats == Morphology.EMPTY_MORPH: + return {} + return {field: Morphology.VALUE_SEP.join(sorted(values.split(Morphology.VALUE_SEP))) for field, values in + [feat.split(Morphology.FIELD_SEP) for feat in feats.split(Morphology.FEATURE_SEP)]} - cdef int _assign_tag_from_exceptions(self, TokenC* token, int tag_id) except -1: - key = (self.tag_names[tag_id], token.lex.orth) - cdef dict attrs - attrs = self.exc[key] - token.pos = attrs.get(POS, token.pos) - token.lemma = attrs.get(LEMMA, token.lemma) - - def load_morph_exceptions(self, dict exc): - # Map (form, pos) to attributes - for tag_str, entries in exc.items(): - for form_str, attrs in entries.items(): - self.add_special_case(tag_str, form_str, attrs) - - @classmethod - def create_class_map(cls): - return MorphologyClassMap(FEATURES) + @staticmethod + def dict_to_feats(feats_dict): + if len(feats_dict) == 0: + return "" + return Morphology.FEATURE_SEP.join(sorted([Morphology.FIELD_SEP.join([field, Morphology.VALUE_SEP.join(sorted(values.split(Morphology.VALUE_SEP)))]) for field, values in feats_dict.items()])) -cpdef univ_pos_t get_int_tag(pos_): - return 0 - -cpdef intify_features(features): - return {get_string_id(feature) for feature in features} - -cdef hash_t hash_tag(MorphAnalysisC tag) nogil: - return mrmr.hash64(&tag, sizeof(tag), 0) +cdef int check_feature(const MorphAnalysisC* morph, attr_t feature) nogil: + cdef int i + for i in range(morph.length): + if morph.features[i] == feature: + return True + return False -cdef MorphAnalysisC create_rich_tag(features) except *: - cdef MorphAnalysisC tag - cdef attr_t feature - memset(&tag, 0, sizeof(tag)) - for feature in features: - field = FEATURE_FIELDS[FEATURE_NAMES[feature]] - set_feature(&tag, field, feature, 1) - return tag +cdef list list_features(const MorphAnalysisC* morph): + cdef int i + features = [] + for i in range(morph.length): + features.append(morph.features[i]) + return features -cdef tag_to_json(const MorphAnalysisC* tag): - return [FEATURE_NAMES[f] for f in list_features(tag)] +cdef np.ndarray get_by_field(const MorphAnalysisC* morph, attr_t field): + cdef np.ndarray results = numpy.zeros((morph.length,), dtype="uint64") + n = get_n_by_field(results.data, morph, field) + return results[:n] -cdef MorphAnalysisC tag_from_json(json_tag): - raise NotImplementedError +cdef int get_n_by_field(attr_t* results, const MorphAnalysisC* morph, attr_t field) nogil: + cdef int n_results = 0 + cdef int i + for i in range(morph.length): + if morph.fields[i] == field: + results[n_results] = morph.features[i] + n_results += 1 + return n_results - -cdef list list_features(const MorphAnalysisC* tag): - output = [] - if tag.abbr != 0: - output.append(tag.abbr) - if tag.adp_type != 0: - output.append(tag.adp_type) - if tag.adv_type != 0: - output.append(tag.adv_type) - if tag.animacy != 0: - output.append(tag.animacy) - if tag.aspect != 0: - output.append(tag.aspect) - if tag.case != 0: - output.append(tag.case) - if tag.conj_type != 0: - output.append(tag.conj_type) - if tag.connegative != 0: - output.append(tag.connegative) - if tag.definite != 0: - output.append(tag.definite) - if tag.degree != 0: - output.append(tag.degree) - if tag.derivation != 0: - output.append(tag.derivation) - if tag.echo != 0: - output.append(tag.echo) - if tag.foreign != 0: - output.append(tag.foreign) - if tag.gender != 0: - output.append(tag.gender) - if tag.hyph != 0: - output.append(tag.hyph) - if tag.inf_form != 0: - output.append(tag.inf_form) - if tag.mood != 0: - output.append(tag.mood) - if tag.negative != 0: - output.append(tag.negative) - if tag.number != 0: - output.append(tag.number) - if tag.name_type != 0: - output.append(tag.name_type) - if tag.noun_type != 0: - output.append(tag.noun_type) - if tag.part_form != 0: - output.append(tag.part_form) - if tag.part_type != 0: - output.append(tag.part_type) - if tag.person != 0: - output.append(tag.person) - if tag.polite != 0: - output.append(tag.polite) - if tag.polarity != 0: - output.append(tag.polarity) - if tag.poss != 0: - output.append(tag.poss) - if tag.prefix != 0: - output.append(tag.prefix) - if tag.prep_case != 0: - output.append(tag.prep_case) - if tag.pron_type != 0: - output.append(tag.pron_type) - if tag.punct_type != 0: - output.append(tag.punct_type) - if tag.reflex != 0: - output.append(tag.reflex) - if tag.style != 0: - output.append(tag.style) - if tag.style_variant != 0: - output.append(tag.style_variant) - if tag.typo != 0: - output.append(tag.typo) - if tag.verb_form != 0: - output.append(tag.verb_form) - if tag.voice != 0: - output.append(tag.voice) - if tag.verb_type != 0: - output.append(tag.verb_type) - return output - - -cdef attr_t get_field(const MorphAnalysisC* tag, int field_id) nogil: - field = field_id - if field == Field_POS: - return tag.pos - if field == Field_Abbr: - return tag.abbr - elif field == Field_AdpType: - return tag.adp_type - elif field == Field_AdvType: - return tag.adv_type - elif field == Field_Animacy: - return tag.animacy - elif field == Field_Aspect: - return tag.aspect - elif field == Field_Case: - return tag.case - elif field == Field_ConjType: - return tag.conj_type - elif field == Field_Connegative: - return tag.connegative - elif field == Field_Definite: - return tag.definite - elif field == Field_Degree: - return tag.degree - elif field == Field_Derivation: - return tag.derivation - elif field == Field_Echo: - return tag.echo - elif field == Field_Foreign: - return tag.foreign - elif field == Field_Gender: - return tag.gender - elif field == Field_Hyph: - return tag.hyph - elif field == Field_InfForm: - return tag.inf_form - elif field == Field_Mood: - return tag.mood - elif field == Field_Negative: - return tag.negative - elif field == Field_Number: - return tag.number - elif field == Field_NameType: - return tag.name_type - elif field == Field_NounType: - return tag.noun_type - elif field == Field_NumForm: - return tag.num_form - elif field == Field_NumType: - return tag.num_type - elif field == Field_NumValue: - return tag.num_value - elif field == Field_PartForm: - return tag.part_form - elif field == Field_PartType: - return tag.part_type - elif field == Field_Person: - return tag.person - elif field == Field_Polite: - return tag.polite - elif field == Field_Polarity: - return tag.polarity - elif field == Field_Poss: - return tag.poss - elif field == Field_Prefix: - return tag.prefix - elif field == Field_PrepCase: - return tag.prep_case - elif field == Field_PronType: - return tag.pron_type - elif field == Field_PunctSide: - return tag.punct_side - elif field == Field_PunctType: - return tag.punct_type - elif field == Field_Reflex: - return tag.reflex - elif field == Field_Style: - return tag.style - elif field == Field_StyleVariant: - return tag.style_variant - elif field == Field_Tense: - return tag.tense - elif field == Field_Typo: - return tag.typo - elif field == Field_VerbForm: - return tag.verb_form - elif field == Field_Voice: - return tag.voice - elif field == Field_VerbType: - return tag.verb_type - else: - raise ValueError(Errors.E168.format(field=field_id)) - - -cdef int check_feature(const MorphAnalysisC* tag, attr_t feature) nogil: - if tag.abbr == feature: - return 1 - elif tag.adp_type == feature: - return 1 - elif tag.adv_type == feature: - return 1 - elif tag.animacy == feature: - return 1 - elif tag.aspect == feature: - return 1 - elif tag.case == feature: - return 1 - elif tag.conj_type == feature: - return 1 - elif tag.connegative == feature: - return 1 - elif tag.definite == feature: - return 1 - elif tag.degree == feature: - return 1 - elif tag.derivation == feature: - return 1 - elif tag.echo == feature: - return 1 - elif tag.foreign == feature: - return 1 - elif tag.gender == feature: - return 1 - elif tag.hyph == feature: - return 1 - elif tag.inf_form == feature: - return 1 - elif tag.mood == feature: - return 1 - elif tag.negative == feature: - return 1 - elif tag.number == feature: - return 1 - elif tag.name_type == feature: - return 1 - elif tag.noun_type == feature: - return 1 - elif tag.num_form == feature: - return 1 - elif tag.num_type == feature: - return 1 - elif tag.num_value == feature: - return 1 - elif tag.part_form == feature: - return 1 - elif tag.part_type == feature: - return 1 - elif tag.person == feature: - return 1 - elif tag.polite == feature: - return 1 - elif tag.polarity == feature: - return 1 - elif tag.poss == feature: - return 1 - elif tag.prefix == feature: - return 1 - elif tag.prep_case == feature: - return 1 - elif tag.pron_type == feature: - return 1 - elif tag.punct_side == feature: - return 1 - elif tag.punct_type == feature: - return 1 - elif tag.reflex == feature: - return 1 - elif tag.style == feature: - return 1 - elif tag.style_variant == feature: - return 1 - elif tag.tense == feature: - return 1 - elif tag.typo == feature: - return 1 - elif tag.verb_form == feature: - return 1 - elif tag.voice == feature: - return 1 - elif tag.verb_type == feature: - return 1 - else: - return 0 - -cdef int set_feature(MorphAnalysisC* tag, - univ_field_t field, attr_t feature, int value) except -1: - if value == True: - value_ = feature - else: - value_ = 0 - prev_value = get_field(tag, field) - if prev_value != 0 and value_ == 0 and field != Field_POS: - tag.length -= 1 - elif prev_value == 0 and value_ != 0 and field != Field_POS: - tag.length += 1 - if feature == 0: - pass - elif field == Field_POS: - tag.pos = get_string_id(FEATURE_NAMES[value_].split('_')[1]) - elif field == Field_Abbr: - tag.abbr = value_ - elif field == Field_AdpType: - tag.adp_type = value_ - elif field == Field_AdvType: - tag.adv_type = value_ - elif field == Field_Animacy: - tag.animacy = value_ - elif field == Field_Aspect: - tag.aspect = value_ - elif field == Field_Case: - tag.case = value_ - elif field == Field_ConjType: - tag.conj_type = value_ - elif field == Field_Connegative: - tag.connegative = value_ - elif field == Field_Definite: - tag.definite = value_ - elif field == Field_Degree: - tag.degree = value_ - elif field == Field_Derivation: - tag.derivation = value_ - elif field == Field_Echo: - tag.echo = value_ - elif field == Field_Foreign: - tag.foreign = value_ - elif field == Field_Gender: - tag.gender = value_ - elif field == Field_Hyph: - tag.hyph = value_ - elif field == Field_InfForm: - tag.inf_form = value_ - elif field == Field_Mood: - tag.mood = value_ - elif field == Field_Negative: - tag.negative = value_ - elif field == Field_Number: - tag.number = value_ - elif field == Field_NameType: - tag.name_type = value_ - elif field == Field_NounType: - tag.noun_type = value_ - elif field == Field_NumForm: - tag.num_form = value_ - elif field == Field_NumType: - tag.num_type = value_ - elif field == Field_NumValue: - tag.num_value = value_ - elif field == Field_PartForm: - tag.part_form = value_ - elif field == Field_PartType: - tag.part_type = value_ - elif field == Field_Person: - tag.person = value_ - elif field == Field_Polite: - tag.polite = value_ - elif field == Field_Polarity: - tag.polarity = value_ - elif field == Field_Poss: - tag.poss = value_ - elif field == Field_Prefix: - tag.prefix = value_ - elif field == Field_PrepCase: - tag.prep_case = value_ - elif field == Field_PronType: - tag.pron_type = value_ - elif field == Field_PunctSide: - tag.punct_side = value_ - elif field == Field_PunctType: - tag.punct_type = value_ - elif field == Field_Reflex: - tag.reflex = value_ - elif field == Field_Style: - tag.style = value_ - elif field == Field_StyleVariant: - tag.style_variant = value_ - elif field == Field_Tense: - tag.tense = value_ - elif field == Field_Typo: - tag.typo = value_ - elif field == Field_VerbForm: - tag.verb_form = value_ - elif field == Field_Voice: - tag.voice = value_ - elif field == Field_VerbType: - tag.verb_type = value_ - else: - raise ValueError(Errors.E167.format(field=FEATURE_NAMES.get(feature), field_id=feature)) - - -FIELDS = { - 'POS': Field_POS, - 'Abbr': Field_Abbr, - 'AdpType': Field_AdpType, - 'AdvType': Field_AdvType, - 'Animacy': Field_Animacy, - 'Aspect': Field_Aspect, - 'Case': Field_Case, - 'ConjType': Field_ConjType, - 'Connegative': Field_Connegative, - 'Definite': Field_Definite, - 'Degree': Field_Degree, - 'Derivation': Field_Derivation, - 'Echo': Field_Echo, - 'Foreign': Field_Foreign, - 'Gender': Field_Gender, - 'Hyph': Field_Hyph, - 'InfForm': Field_InfForm, - 'Mood': Field_Mood, - 'NameType': Field_NameType, - 'Negative': Field_Negative, - 'NounType': Field_NounType, - 'Number': Field_Number, - 'NumForm': Field_NumForm, - 'NumType': Field_NumType, - 'NumValue': Field_NumValue, - 'PartForm': Field_PartForm, - 'PartType': Field_PartType, - 'Person': Field_Person, - 'Polite': Field_Polite, - 'Polarity': Field_Polarity, - 'Poss': Field_Poss, - 'Prefix': Field_Prefix, - 'PrepCase': Field_PrepCase, - 'PronType': Field_PronType, - 'PunctSide': Field_PunctSide, - 'PunctType': Field_PunctType, - 'Reflex': Field_Reflex, - 'Style': Field_Style, - 'StyleVariant': Field_StyleVariant, - 'Tense': Field_Tense, - 'Typo': Field_Typo, - 'VerbForm': Field_VerbForm, - 'VerbType': Field_VerbType, - 'Voice': Field_Voice, -} - -LOWER_FIELDS = { - 'pos': Field_POS, - 'abbr': Field_Abbr, - 'adp_type': Field_AdpType, - 'adv_type': Field_AdvType, - 'animacy': Field_Animacy, - 'aspect': Field_Aspect, - 'case': Field_Case, - 'conj_type': Field_ConjType, - 'connegative': Field_Connegative, - 'definite': Field_Definite, - 'degree': Field_Degree, - 'derivation': Field_Derivation, - 'echo': Field_Echo, - 'foreign': Field_Foreign, - 'gender': Field_Gender, - 'hyph': Field_Hyph, - 'inf_form': Field_InfForm, - 'mood': Field_Mood, - 'name_type': Field_NameType, - 'negative': Field_Negative, - 'noun_type': Field_NounType, - 'number': Field_Number, - 'num_form': Field_NumForm, - 'num_type': Field_NumType, - 'num_value': Field_NumValue, - 'part_form': Field_PartForm, - 'part_type': Field_PartType, - 'person': Field_Person, - 'polarity': Field_Polarity, - 'polite': Field_Polite, - 'poss': Field_Poss, - 'prefix': Field_Prefix, - 'prep_case': Field_PrepCase, - 'pron_type': Field_PronType, - 'punct_side': Field_PunctSide, - 'punct_type': Field_PunctType, - 'reflex': Field_Reflex, - 'style': Field_Style, - 'style_variant': Field_StyleVariant, - 'tense': Field_Tense, - 'typo': Field_Typo, - 'verb_form': Field_VerbForm, - 'verb_type': Field_VerbType, - 'voice': Field_Voice, -} - - -FEATURES = [ - "POS_ADJ", - "POS_ADP", - "POS_ADV", - "POS_AUX", - "POS_CONJ", - "POS_CCONJ", - "POS_DET", - "POS_INTJ", - "POS_NOUN", - "POS_NUM", - "POS_PART", - "POS_PRON", - "POS_PROPN", - "POS_PUNCT", - "POS_SCONJ", - "POS_SYM", - "POS_VERB", - "POS_X", - "POS_EOL", - "POS_SPACE", - "Abbr_yes", - "AdpType_circ", - "AdpType_comprep", - "AdpType_prep", - "AdpType_post", - "AdpType_voc", - "AdvType_adadj", - "AdvType_cau", - "AdvType_deg", - "AdvType_ex", - "AdvType_loc", - "AdvType_man", - "AdvType_mod", - "AdvType_sta", - "AdvType_tim", - "Animacy_anim", - "Animacy_hum", - "Animacy_inan", - "Animacy_nhum", - "Aspect_hab", - "Aspect_imp", - "Aspect_iter", - "Aspect_perf", - "Aspect_prog", - "Aspect_prosp", - "Aspect_none", - "Case_abe", - "Case_abl", - "Case_abs", - "Case_acc", - "Case_ade", - "Case_all", - "Case_cau", - "Case_com", - "Case_dat", - "Case_del", - "Case_dis", - "Case_ela", - "Case_ess", - "Case_gen", - "Case_ill", - "Case_ine", - "Case_ins", - "Case_loc", - "Case_lat", - "Case_nom", - "Case_par", - "Case_sub", - "Case_sup", - "Case_tem", - "Case_ter", - "Case_tra", - "Case_voc", - "ConjType_comp", - "ConjType_oper", - "Connegative_yes", - "Definite_cons", - "Definite_def", - "Definite_ind", - "Definite_red", - "Definite_two", - "Degree_abs", - "Degree_cmp", - "Degree_comp", - "Degree_none", - "Degree_pos", - "Degree_sup", - "Degree_com", - "Degree_dim", - "Derivation_minen", - "Derivation_sti", - "Derivation_inen", - "Derivation_lainen", - "Derivation_ja", - "Derivation_ton", - "Derivation_vs", - "Derivation_ttain", - "Derivation_ttaa", - "Echo_rdp", - "Echo_ech", - "Foreign_foreign", - "Foreign_fscript", - "Foreign_tscript", - "Foreign_yes", - "Gender_com", - "Gender_fem", - "Gender_masc", - "Gender_neut", - "Gender_dat_masc", - "Gender_dat_fem", - "Gender_erg_masc", - "Gender_erg_fem", - "Gender_psor_masc", - "Gender_psor_fem", - "Gender_psor_neut", - "Hyph_yes", - "InfForm_one", - "InfForm_two", - "InfForm_three", - "Mood_cnd", - "Mood_imp", - "Mood_ind", - "Mood_n", - "Mood_pot", - "Mood_sub", - "Mood_opt", - "NameType_geo", - "NameType_prs", - "NameType_giv", - "NameType_sur", - "NameType_nat", - "NameType_com", - "NameType_pro", - "NameType_oth", - "Negative_neg", - "Negative_pos", - "Negative_yes", - "NounType_com", - "NounType_prop", - "NounType_class", - "Number_com", - "Number_dual", - "Number_none", - "Number_plur", - "Number_sing", - "Number_ptan", - "Number_count", - "Number_abs_sing", - "Number_abs_plur", - "Number_dat_sing", - "Number_dat_plur", - "Number_erg_sing", - "Number_erg_plur", - "Number_psee_sing", - "Number_psee_plur", - "Number_psor_sing", - "Number_psor_plur", - "NumForm_digit", - "NumForm_roman", - "NumForm_word", - "NumForm_combi", - "NumType_card", - "NumType_dist", - "NumType_frac", - "NumType_gen", - "NumType_mult", - "NumType_none", - "NumType_ord", - "NumType_sets", - "NumType_dual", - "NumValue_one", - "NumValue_two", - "NumValue_three", - "PartForm_pres", - "PartForm_past", - "PartForm_agt", - "PartForm_neg", - "PartType_mod", - "PartType_emp", - "PartType_res", - "PartType_inf", - "PartType_vbp", - "Person_one", - "Person_two", - "Person_three", - "Person_none", - "Person_abs_one", - "Person_abs_two", - "Person_abs_three", - "Person_dat_one", - "Person_dat_two", - "Person_dat_three", - "Person_erg_one", - "Person_erg_two", - "Person_erg_three", - "Person_psor_one", - "Person_psor_two", - "Person_psor_three", - "Polarity_neg", - "Polarity_pos", - "Polite_inf", - "Polite_pol", - "Polite_abs_inf", - "Polite_abs_pol", - "Polite_erg_inf", - "Polite_erg_pol", - "Polite_dat_inf", - "Polite_dat_pol", - "Poss_yes", - "Prefix_yes", - "PrepCase_npr", - "PrepCase_pre", - "PronType_advPart", - "PronType_art", - "PronType_default", - "PronType_dem", - "PronType_ind", - "PronType_int", - "PronType_neg", - "PronType_prs", - "PronType_rcp", - "PronType_rel", - "PronType_tot", - "PronType_clit", - "PronType_exc", - "PunctSide_ini", - "PunctSide_fin", - "PunctType_peri", - "PunctType_qest", - "PunctType_excl", - "PunctType_quot", - "PunctType_brck", - "PunctType_comm", - "PunctType_colo", - "PunctType_semi", - "PunctType_dash", - "Reflex_yes", - "Style_arch", - "Style_rare", - "Style_poet", - "Style_norm", - "Style_coll", - "Style_vrnc", - "Style_sing", - "Style_expr", - "Style_derg", - "Style_vulg", - "Style_yes", - "StyleVariant_styleShort", - "StyleVariant_styleBound", - "Tense_fut", - "Tense_imp", - "Tense_past", - "Tense_pres", - "Typo_yes", - "VerbForm_fin", - "VerbForm_ger", - "VerbForm_inf", - "VerbForm_none", - "VerbForm_part", - "VerbForm_partFut", - "VerbForm_partPast", - "VerbForm_partPres", - "VerbForm_sup", - "VerbForm_trans", - "VerbForm_conv", - "VerbForm_gdv", - "VerbType_aux", - "VerbType_cop", - "VerbType_mod", - "VerbType_light", - "Voice_act", - "Voice_cau", - "Voice_pass", - "Voice_mid", - "Voice_int", -] - -FEATURE_NAMES = {get_string_id(f): f for f in FEATURES} -FEATURE_FIELDS = {f: FIELDS[f.split('_', 1)[0]] for f in FEATURES} +def unpickle_morphology(strings, tags): + cdef Morphology morphology = Morphology(strings) + for tag in tags: + morphology.add(tag) + return morphology diff --git a/spacy/parts_of_speech.pyx b/spacy/parts_of_speech.pyx index 3925a6738..e71fb917f 100644 --- a/spacy/parts_of_speech.pyx +++ b/spacy/parts_of_speech.pyx @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - IDS = { "": NO_TAG, diff --git a/spacy/pipe_analysis.py b/spacy/pipe_analysis.py new file mode 100644 index 000000000..d0362e7e1 --- /dev/null +++ b/spacy/pipe_analysis.py @@ -0,0 +1,135 @@ +from typing import List, Dict, Iterable, Optional, Union, TYPE_CHECKING +from wasabi import msg + +from .tokens import Doc, Token, Span +from .errors import Errors +from .util import dot_to_dict + +if TYPE_CHECKING: + # This lets us add type hints for mypy etc. without causing circular imports + from .language import Language # noqa: F401 + + +DEFAULT_KEYS = ["requires", "assigns", "scores", "retokenizes"] + + +def validate_attrs(values: Iterable[str]) -> Iterable[str]: + """Validate component attributes provided to "assigns", "requires" etc. + Raises error for invalid attributes and formatting. Doesn't check if + custom extension attributes are registered, since this is something the + user might want to do themselves later in the component. + + values (Iterable[str]): The string attributes to check, e.g. `["token.pos"]`. + RETURNS (Iterable[str]): The checked attributes. + """ + data = dot_to_dict({value: True for value in values}) + objs = {"doc": Doc, "token": Token, "span": Span} + for obj_key, attrs in data.items(): + if obj_key == "span": + # Support Span only for custom extension attributes + span_attrs = [attr for attr in values if attr.startswith("span.")] + span_attrs = [attr for attr in span_attrs if not attr.startswith("span._.")] + if span_attrs: + raise ValueError(Errors.E180.format(attrs=", ".join(span_attrs))) + if obj_key not in objs: # first element is not doc/token/span + invalid_attrs = ", ".join(a for a in values if a.startswith(obj_key)) + raise ValueError(Errors.E181.format(obj=obj_key, attrs=invalid_attrs)) + if not isinstance(attrs, dict): # attr is something like "doc" + raise ValueError(Errors.E182.format(attr=obj_key)) + for attr, value in attrs.items(): + if attr == "_": + if value is True: # attr is something like "doc._" + raise ValueError(Errors.E182.format(attr="{}._".format(obj_key))) + for ext_attr, ext_value in value.items(): + # We don't check whether the attribute actually exists + if ext_value is not True: # attr is something like doc._.x.y + good = f"{obj_key}._.{ext_attr}" + bad = f"{good}.{'.'.join(ext_value)}" + raise ValueError(Errors.E183.format(attr=bad, solution=good)) + continue # we can't validate those further + if attr.endswith("_"): # attr is something like "token.pos_" + raise ValueError(Errors.E184.format(attr=attr, solution=attr[:-1])) + if value is not True: # attr is something like doc.x.y + good = f"{obj_key}.{attr}" + bad = f"{good}.{'.'.join(value)}" + raise ValueError(Errors.E183.format(attr=bad, solution=good)) + obj = objs[obj_key] + if not hasattr(obj, attr): + raise ValueError(Errors.E185.format(obj=obj_key, attr=attr)) + return values + + +def get_attr_info(nlp: "Language", attr: str) -> Dict[str, List[str]]: + """Check which components in the pipeline assign or require an attribute. + + nlp (Language): The current nlp object. + attr (str): The attribute, e.g. "doc.tensor". + RETURNS (Dict[str, List[str]]): A dict keyed by "assigns" and "requires", + mapped to a list of component names. + """ + result = {"assigns": [], "requires": []} + for pipe_name in nlp.pipe_names: + meta = nlp.get_pipe_meta(pipe_name) + if attr in meta.assigns: + result["assigns"].append(pipe_name) + if attr in meta.requires: + result["requires"].append(pipe_name) + return result + + +def analyze_pipes( + nlp: "Language", *, keys: List[str] = DEFAULT_KEYS +) -> Dict[str, Union[List[str], Dict[str, List[str]]]]: + """Print a formatted summary for the current nlp object's pipeline. Shows + a table with the pipeline components and why they assign and require, as + well as any problems if available. + + nlp (Language): The nlp object. + keys (List[str]): The meta keys to show in the table. + RETURNS (dict): A dict with "summary" and "problems". + """ + result = {"summary": {}, "problems": {}} + all_attrs = set() + for i, name in enumerate(nlp.pipe_names): + meta = nlp.get_pipe_meta(name) + all_attrs.update(meta.assigns) + all_attrs.update(meta.requires) + result["summary"][name] = {key: getattr(meta, key, None) for key in keys} + prev_pipes = nlp.pipeline[:i] + requires = {annot: False for annot in meta.requires} + if requires: + for prev_name, prev_pipe in prev_pipes: + prev_meta = nlp.get_pipe_meta(prev_name) + for annot in prev_meta.assigns: + requires[annot] = True + result["problems"][name] = [] + for annot, fulfilled in requires.items(): + if not fulfilled: + result["problems"][name].append(annot) + result["attrs"] = {attr: get_attr_info(nlp, attr) for attr in all_attrs} + return result + + +def print_pipe_analysis( + analysis: Dict[str, Union[List[str], Dict[str, List[str]]]], + *, + keys: List[str] = DEFAULT_KEYS, +) -> Optional[Dict[str, Union[List[str], Dict[str, List[str]]]]]: + """Print a formatted version of the pipe analysis produced by analyze_pipes. + + analysis (Dict[str, Union[List[str], Dict[str, List[str]]]]): The analysis. + keys (List[str]): The meta keys to show in the table. + """ + msg.divider("Pipeline Overview") + header = ["#", "Component", *[key.capitalize() for key in keys]] + summary = analysis["summary"].items() + body = [[i, n, *[v for v in m.values()]] for i, (n, m) in enumerate(summary)] + msg.table(body, header=header, divider=True, multiline=True) + n_problems = sum(len(p) for p in analysis["problems"].values()) + if any(p for p in analysis["problems"].values()): + msg.divider(f"Problems ({n_problems})") + for name, problem in analysis["problems"].items(): + if problem: + msg.warn(f"'{name}' requirements not met: {', '.join(problem)}") + else: + msg.good("No problems found.") diff --git a/spacy/pipeline/__init__.py b/spacy/pipeline/__init__.py index 2f30fbbee..cec5b4eb5 100644 --- a/spacy/pipeline/__init__.py +++ b/spacy/pipeline/__init__.py @@ -1,26 +1,34 @@ -# coding: utf8 -from __future__ import unicode_literals - -from .pipes import Tagger, DependencyParser, EntityRecognizer, EntityLinker -from .pipes import TextCategorizer, Tensorizer, Pipe, Sentencizer -from .morphologizer import Morphologizer +from .attributeruler import AttributeRuler +from .dep_parser import DependencyParser +from .entity_linker import EntityLinker +from .ner import EntityRecognizer from .entityruler import EntityRuler -from .hooks import SentenceSegmenter, SimilarityHook +from .lemmatizer import Lemmatizer +from .morphologizer import Morphologizer +from .pipe import Pipe +from .trainable_pipe import TrainablePipe +from .senter import SentenceRecognizer +from .sentencizer import Sentencizer +from .tagger import Tagger +from .textcat import TextCategorizer +from .tok2vec import Tok2Vec from .functions import merge_entities, merge_noun_chunks, merge_subtokens __all__ = [ - "Tagger", + "AttributeRuler", "DependencyParser", - "EntityRecognizer", "EntityLinker", - "TextCategorizer", - "Tensorizer", - "Pipe", - "Morphologizer", + "EntityRecognizer", "EntityRuler", + "Morphologizer", + "Lemmatizer", + "TrainablePipe", + "Pipe", + "SentenceRecognizer", "Sentencizer", - "SentenceSegmenter", - "SimilarityHook", + "Tagger", + "TextCategorizer", + "Tok2Vec", "merge_entities", "merge_noun_chunks", "merge_subtokens", diff --git a/spacy/data/__init__.py b/spacy/pipeline/_parser_internals/__init__.py similarity index 100% rename from spacy/data/__init__.py rename to spacy/pipeline/_parser_internals/__init__.py diff --git a/spacy/syntax/_state.pxd b/spacy/pipeline/_parser_internals/_state.pxd similarity index 98% rename from spacy/syntax/_state.pxd rename to spacy/pipeline/_parser_internals/_state.pxd index 141d796a4..0d0dd8c05 100644 --- a/spacy/syntax/_state.pxd +++ b/spacy/pipeline/_parser_internals/_state.pxd @@ -1,17 +1,14 @@ -from libc.string cimport memcpy, memset, memmove -from libc.stdlib cimport malloc, calloc, free +from libc.string cimport memcpy, memset +from libc.stdlib cimport calloc, free from libc.stdint cimport uint32_t, uint64_t - from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno - from murmurhash.mrmr cimport hash64 -from ..vocab cimport EMPTY_LEXEME -from ..structs cimport TokenC, SpanC -from ..lexeme cimport Lexeme -from ..symbols cimport punct -from ..attrs cimport IS_SPACE -from ..typedefs cimport attr_t +from ...vocab cimport EMPTY_LEXEME +from ...structs cimport TokenC, SpanC +from ...lexeme cimport Lexeme +from ...attrs cimport IS_SPACE +from ...typedefs cimport attr_t cdef inline bint is_space_token(const TokenC* token) nogil: diff --git a/spacy/syntax/_state.pyx b/spacy/pipeline/_parser_internals/_state.pyx similarity index 100% rename from spacy/syntax/_state.pyx rename to spacy/pipeline/_parser_internals/_state.pyx diff --git a/spacy/pipeline/_parser_internals/arc_eager.pxd b/spacy/pipeline/_parser_internals/arc_eager.pxd new file mode 100644 index 000000000..e05a34f56 --- /dev/null +++ b/spacy/pipeline/_parser_internals/arc_eager.pxd @@ -0,0 +1,11 @@ +from .stateclass cimport StateClass +from ...typedefs cimport weight_t, attr_t +from .transition_system cimport Transition, TransitionSystem + + +cdef class ArcEager(TransitionSystem): + pass + + +cdef weight_t push_cost(StateClass stcls, const void* _gold, int target) nogil +cdef weight_t arc_cost(StateClass stcls, const void* _gold, int head, int child) nogil diff --git a/spacy/pipeline/_parser_internals/arc_eager.pyx b/spacy/pipeline/_parser_internals/arc_eager.pyx new file mode 100644 index 000000000..69f015bda --- /dev/null +++ b/spacy/pipeline/_parser_internals/arc_eager.pyx @@ -0,0 +1,808 @@ +# cython: profile=True, cdivision=True, infer_types=True +from cymem.cymem cimport Pool, Address +from libc.stdint cimport int32_t + +from collections import defaultdict, Counter + +from ...typedefs cimport hash_t, attr_t +from ...strings cimport hash_string +from ...structs cimport TokenC +from ...tokens.doc cimport Doc, set_children_from_heads +from ...training.example cimport Example +from .stateclass cimport StateClass +from ._state cimport StateC + +from ...errors import Errors + +# Calculate cost as gold/not gold. We don't use scalar value anyway. +cdef int BINARY_COSTS = 1 +cdef weight_t MIN_SCORE = -90000 +cdef attr_t SUBTOK_LABEL = hash_string(u'subtok') + +DEF NON_MONOTONIC = True +DEF USE_BREAK = True + +# Break transition from here +# http://www.aclweb.org/anthology/P13-1074 +cdef enum: + SHIFT + REDUCE + LEFT + RIGHT + + BREAK + + N_MOVES + + +MOVE_NAMES = [None] * N_MOVES +MOVE_NAMES[SHIFT] = 'S' +MOVE_NAMES[REDUCE] = 'D' +MOVE_NAMES[LEFT] = 'L' +MOVE_NAMES[RIGHT] = 'R' +MOVE_NAMES[BREAK] = 'B' + + +cdef enum: + HEAD_IN_STACK = 0 + HEAD_IN_BUFFER + HEAD_UNKNOWN + IS_SENT_START + SENT_START_UNKNOWN + + +cdef struct GoldParseStateC: + char* state_bits + int32_t* n_kids_in_buffer + int32_t* n_kids_in_stack + int32_t* heads + attr_t* labels + int32_t** kids + int32_t* n_kids + int32_t length + int32_t stride + + +cdef GoldParseStateC create_gold_state(Pool mem, StateClass stcls, + heads, labels, sent_starts) except *: + cdef GoldParseStateC gs + gs.length = len(heads) + gs.stride = 1 + gs.labels = mem.alloc(gs.length, sizeof(gs.labels[0])) + gs.heads = mem.alloc(gs.length, sizeof(gs.heads[0])) + gs.n_kids = mem.alloc(gs.length, sizeof(gs.n_kids[0])) + gs.state_bits = mem.alloc(gs.length, sizeof(gs.state_bits[0])) + gs.n_kids_in_buffer = mem.alloc(gs.length, sizeof(gs.n_kids_in_buffer[0])) + gs.n_kids_in_stack = mem.alloc(gs.length, sizeof(gs.n_kids_in_stack[0])) + + for i, is_sent_start in enumerate(sent_starts): + if is_sent_start == True: + gs.state_bits[i] = set_state_flag( + gs.state_bits[i], + IS_SENT_START, + 1 + ) + gs.state_bits[i] = set_state_flag( + gs.state_bits[i], + SENT_START_UNKNOWN, + 0 + ) + + elif is_sent_start is None: + gs.state_bits[i] = set_state_flag( + gs.state_bits[i], + SENT_START_UNKNOWN, + 1 + ) + gs.state_bits[i] = set_state_flag( + gs.state_bits[i], + IS_SENT_START, + 0 + ) + else: + gs.state_bits[i] = set_state_flag( + gs.state_bits[i], + SENT_START_UNKNOWN, + 0 + ) + gs.state_bits[i] = set_state_flag( + gs.state_bits[i], + IS_SENT_START, + 0 + ) + + for i, (head, label) in enumerate(zip(heads, labels)): + if head is not None: + gs.heads[i] = head + gs.labels[i] = label + if i != head: + gs.n_kids[head] += 1 + gs.state_bits[i] = set_state_flag( + gs.state_bits[i], + HEAD_UNKNOWN, + 0 + ) + else: + gs.state_bits[i] = set_state_flag( + gs.state_bits[i], + HEAD_UNKNOWN, + 1 + ) + # Make an array of pointers, pointing into the gs_kids_flat array. + gs.kids = mem.alloc(gs.length, sizeof(int32_t*)) + for i in range(gs.length): + if gs.n_kids[i] != 0: + gs.kids[i] = mem.alloc(gs.n_kids[i], sizeof(int32_t)) + # This is a temporary buffer + js_addr = Address(gs.length, sizeof(int32_t)) + js = js_addr.ptr + for i in range(gs.length): + if not is_head_unknown(&gs, i): + head = gs.heads[i] + if head != i: + gs.kids[head][js[head]] = i + js[head] += 1 + return gs + + +cdef void update_gold_state(GoldParseStateC* gs, StateClass stcls) nogil: + for i in range(gs.length): + gs.state_bits[i] = set_state_flag( + gs.state_bits[i], + HEAD_IN_BUFFER, + 0 + ) + gs.state_bits[i] = set_state_flag( + gs.state_bits[i], + HEAD_IN_STACK, + 0 + ) + gs.n_kids_in_stack[i] = 0 + gs.n_kids_in_buffer[i] = 0 + + for i in range(stcls.stack_depth()): + s_i = stcls.S(i) + if not is_head_unknown(gs, s_i): + gs.n_kids_in_stack[gs.heads[s_i]] += 1 + for kid in gs.kids[s_i][:gs.n_kids[s_i]]: + gs.state_bits[kid] = set_state_flag( + gs.state_bits[kid], + HEAD_IN_STACK, + 1 + ) + for i in range(stcls.buffer_length()): + b_i = stcls.B(i) + if not is_head_unknown(gs, b_i): + gs.n_kids_in_buffer[gs.heads[b_i]] += 1 + for kid in gs.kids[b_i][:gs.n_kids[b_i]]: + gs.state_bits[kid] = set_state_flag( + gs.state_bits[kid], + HEAD_IN_BUFFER, + 1 + ) + + +cdef class ArcEagerGold: + cdef GoldParseStateC c + cdef Pool mem + + def __init__(self, ArcEager moves, StateClass stcls, Example example): + self.mem = Pool() + heads, labels = example.get_aligned_parse(projectivize=True) + labels = [label if label is not None else "" for label in labels] + labels = [example.x.vocab.strings.add(label) for label in labels] + sent_starts = example.get_aligned("SENT_START") + assert len(heads) == len(labels) == len(sent_starts) + self.c = create_gold_state(self.mem, stcls, heads, labels, sent_starts) + + def update(self, StateClass stcls): + update_gold_state(&self.c, stcls) + + +cdef int check_state_gold(char state_bits, char flag) nogil: + cdef char one = 1 + return state_bits & (one << flag) + + +cdef int set_state_flag(char state_bits, char flag, int value) nogil: + cdef char one = 1 + if value: + return state_bits | (one << flag) + else: + return state_bits & ~(one << flag) + + +cdef int is_head_in_stack(const GoldParseStateC* gold, int i) nogil: + return check_state_gold(gold.state_bits[i], HEAD_IN_STACK) + + +cdef int is_head_in_buffer(const GoldParseStateC* gold, int i) nogil: + return check_state_gold(gold.state_bits[i], HEAD_IN_BUFFER) + + +cdef int is_head_unknown(const GoldParseStateC* gold, int i) nogil: + return check_state_gold(gold.state_bits[i], HEAD_UNKNOWN) + +cdef int is_sent_start(const GoldParseStateC* gold, int i) nogil: + return check_state_gold(gold.state_bits[i], IS_SENT_START) + +cdef int is_sent_start_unknown(const GoldParseStateC* gold, int i) nogil: + return check_state_gold(gold.state_bits[i], SENT_START_UNKNOWN) + + +# Helper functions for the arc-eager oracle + +cdef weight_t push_cost(StateClass stcls, const void* _gold, int target) nogil: + gold = _gold + cdef weight_t cost = 0 + if is_head_in_stack(gold, target): + cost += 1 + cost += gold.n_kids_in_stack[target] + if Break.is_valid(stcls.c, 0) and Break.move_cost(stcls, gold) == 0: + cost += 1 + return cost + + +cdef weight_t pop_cost(StateClass stcls, const void* _gold, int target) nogil: + gold = _gold + cdef weight_t cost = 0 + if is_head_in_buffer(gold, target): + cost += 1 + cost += gold[0].n_kids_in_buffer[target] + if Break.is_valid(stcls.c, 0) and Break.move_cost(stcls, gold) == 0: + cost += 1 + return cost + + +cdef weight_t arc_cost(StateClass stcls, const void* _gold, int head, int child) nogil: + gold = _gold + if arc_is_gold(gold, head, child): + return 0 + elif stcls.H(child) == gold.heads[child]: + return 1 + # Head in buffer + elif is_head_in_buffer(gold, child): + return 1 + else: + return 0 + + +cdef bint arc_is_gold(const GoldParseStateC* gold, int head, int child) nogil: + if is_head_unknown(gold, child): + return True + elif gold.heads[child] == head: + return True + else: + return False + + +cdef bint label_is_gold(const GoldParseStateC* gold, int head, int child, attr_t label) nogil: + if is_head_unknown(gold, child): + return True + elif label == 0: + return True + elif gold.labels[child] == label: + return True + else: + return False + + +cdef bint _is_gold_root(const GoldParseStateC* gold, int word) nogil: + return gold.heads[word] == word or is_head_unknown(gold, word) + + +cdef class Shift: + @staticmethod + cdef bint is_valid(const StateC* st, attr_t label) nogil: + sent_start = st._sent[st.B_(0).l_edge].sent_start + return st.buffer_length() >= 2 and not st.shifted[st.B(0)] and sent_start != 1 + + @staticmethod + cdef int transition(StateC* st, attr_t label) nogil: + st.push() + st.fast_forward() + + @staticmethod + cdef weight_t cost(StateClass st, const void* _gold, attr_t label) nogil: + gold = _gold + return Shift.move_cost(st, gold) + Shift.label_cost(st, gold, label) + + @staticmethod + cdef inline weight_t move_cost(StateClass s, const void* _gold) nogil: + gold = _gold + return push_cost(s, gold, s.B(0)) + + @staticmethod + cdef inline weight_t label_cost(StateClass s, const void* _gold, attr_t label) nogil: + return 0 + + +cdef class Reduce: + @staticmethod + cdef bint is_valid(const StateC* st, attr_t label) nogil: + return st.stack_depth() >= 2 + + @staticmethod + cdef int transition(StateC* st, attr_t label) nogil: + if st.has_head(st.S(0)): + st.pop() + else: + st.unshift() + st.fast_forward() + + @staticmethod + cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil: + gold = _gold + return Reduce.move_cost(s, gold) + Reduce.label_cost(s, gold, label) + + @staticmethod + cdef inline weight_t move_cost(StateClass st, const void* _gold) nogil: + gold = _gold + s0 = st.S(0) + cost = pop_cost(st, gold, s0) + return_to_buffer = not st.has_head(s0) + if return_to_buffer: + # Decrement cost for the arcs we save, as we'll be putting this + # back to the buffer + if is_head_in_stack(gold, s0): + cost -= 1 + cost -= gold.n_kids_in_stack[s0] + if Break.is_valid(st.c, 0) and Break.move_cost(st, gold) == 0: + cost -= 1 + return cost + + @staticmethod + cdef inline weight_t label_cost(StateClass s, const void* gold, attr_t label) nogil: + return 0 + + +cdef class LeftArc: + @staticmethod + cdef bint is_valid(const StateC* st, attr_t label) nogil: + if label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1): + return 0 + sent_start = st._sent[st.B_(0).l_edge].sent_start + return sent_start != 1 + + @staticmethod + cdef int transition(StateC* st, attr_t label) nogil: + st.add_arc(st.B(0), st.S(0), label) + st.pop() + st.fast_forward() + + @staticmethod + cdef inline weight_t cost(StateClass s, const void* _gold, attr_t label) nogil: + gold = _gold + return LeftArc.move_cost(s, gold) + LeftArc.label_cost(s, gold, label) + + @staticmethod + cdef inline weight_t move_cost(StateClass s, const GoldParseStateC* gold) nogil: + cdef weight_t cost = 0 + s0 = s.S(0) + b0 = s.B(0) + if arc_is_gold(gold, b0, s0): + # Have a negative cost if we 'recover' from the wrong dependency + return 0 if not s.has_head(s0) else -1 + else: + # Account for deps we might lose between S0 and stack + if not s.has_head(s0): + cost += gold.n_kids_in_stack[s0] + if is_head_in_buffer(gold, s0): + cost += 1 + return cost + pop_cost(s, gold, s.S(0)) + arc_cost(s, gold, s.B(0), s.S(0)) + + @staticmethod + cdef inline weight_t label_cost(StateClass s, const GoldParseStateC* gold, attr_t label) nogil: + return arc_is_gold(gold, s.B(0), s.S(0)) and not label_is_gold(gold, s.B(0), s.S(0), label) + + +cdef class RightArc: + @staticmethod + cdef bint is_valid(const StateC* st, attr_t label) nogil: + # If there's (perhaps partial) parse pre-set, don't allow cycle. + if label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1): + return 0 + sent_start = st._sent[st.B_(0).l_edge].sent_start + return sent_start != 1 and st.H(st.S(0)) != st.B(0) + + @staticmethod + cdef int transition(StateC* st, attr_t label) nogil: + st.add_arc(st.S(0), st.B(0), label) + st.push() + st.fast_forward() + + @staticmethod + cdef inline weight_t cost(StateClass s, const void* _gold, attr_t label) nogil: + gold = _gold + return RightArc.move_cost(s, gold) + RightArc.label_cost(s, gold, label) + + @staticmethod + cdef inline weight_t move_cost(StateClass s, const void* _gold) nogil: + gold = _gold + if arc_is_gold(gold, s.S(0), s.B(0)): + return 0 + elif s.c.shifted[s.B(0)]: + return push_cost(s, gold, s.B(0)) + else: + return push_cost(s, gold, s.B(0)) + arc_cost(s, gold, s.S(0), s.B(0)) + + @staticmethod + cdef weight_t label_cost(StateClass s, const void* _gold, attr_t label) nogil: + gold = _gold + return arc_is_gold(gold, s.S(0), s.B(0)) and not label_is_gold(gold, s.S(0), s.B(0), label) + + +cdef class Break: + @staticmethod + cdef bint is_valid(const StateC* st, attr_t label) nogil: + cdef int i + if not USE_BREAK: + return False + elif st.at_break(): + return False + elif st.stack_depth() < 1: + return False + elif st.B_(0).l_edge < 0: + return False + elif st._sent[st.B_(0).l_edge].sent_start < 0: + return False + else: + return True + + @staticmethod + cdef int transition(StateC* st, attr_t label) nogil: + st.set_break(st.B_(0).l_edge) + st.fast_forward() + + @staticmethod + cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil: + gold = _gold + return Break.move_cost(s, gold) + Break.label_cost(s, gold, label) + + @staticmethod + cdef inline weight_t move_cost(StateClass s, const void* _gold) nogil: + gold = _gold + cost = 0 + for i in range(s.stack_depth()): + S_i = s.S(i) + cost += gold.n_kids_in_buffer[S_i] + if is_head_in_buffer(gold, S_i): + cost += 1 + # It's weird not to check the gold sentence boundaries but if we do, + # we can't account for "sunk costs", i.e. situations where we're already + # wrong. + s0_root = _get_root(s.S(0), gold) + b0_root = _get_root(s.B(0), gold) + if s0_root != b0_root or s0_root == -1 or b0_root == -1: + return cost + else: + return cost + 1 + + @staticmethod + cdef inline weight_t label_cost(StateClass s, const void* gold, attr_t label) nogil: + return 0 + +cdef int _get_root(int word, const GoldParseStateC* gold) nogil: + if is_head_unknown(gold, word): + return -1 + while gold.heads[word] != word and word >= 0: + word = gold.heads[word] + if is_head_unknown(gold, word): + return -1 + else: + return word + + +cdef void* _init_state(Pool mem, int length, void* tokens) except NULL: + st = new StateC(tokens, length) + for i in range(st.length): + if st._sent[i].dep == 0: + st._sent[i].l_edge = i + st._sent[i].r_edge = i + st._sent[i].head = 0 + st._sent[i].dep = 0 + st._sent[i].l_kids = 0 + st._sent[i].r_kids = 0 + st.fast_forward() + return st + + +cdef int _del_state(Pool mem, void* state, void* x) except -1: + cdef StateC* st = state + del st + + +cdef class ArcEager(TransitionSystem): + def __init__(self, *args, **kwargs): + TransitionSystem.__init__(self, *args, **kwargs) + + @classmethod + def get_actions(cls, **kwargs): + min_freq = kwargs.get('min_freq', None) + actions = defaultdict(lambda: Counter()) + actions[SHIFT][''] = 1 + actions[REDUCE][''] = 1 + for label in kwargs.get('left_labels', []): + actions[LEFT][label] = 1 + actions[SHIFT][label] = 1 + for label in kwargs.get('right_labels', []): + actions[RIGHT][label] = 1 + actions[REDUCE][label] = 1 + for example in kwargs.get('examples', []): + heads, labels = example.get_aligned_parse(projectivize=True) + for child, (head, label) in enumerate(zip(heads, labels)): + if head is None or label is None: + continue + if label.upper() == 'ROOT' : + label = 'ROOT' + if head == child: + actions[BREAK][label] += 1 + elif head < child: + actions[RIGHT][label] += 1 + actions[REDUCE][''] += 1 + elif head > child: + actions[LEFT][label] += 1 + actions[SHIFT][''] += 1 + if min_freq is not None: + for action, label_freqs in actions.items(): + for label, freq in list(label_freqs.items()): + if freq < min_freq: + label_freqs.pop(label) + # Ensure these actions are present + actions[BREAK].setdefault('ROOT', 0) + if kwargs.get("learn_tokens") is True: + actions[RIGHT].setdefault('subtok', 0) + actions[LEFT].setdefault('subtok', 0) + # Used for backoff + actions[RIGHT].setdefault('dep', 0) + actions[LEFT].setdefault('dep', 0) + return actions + + @property + def action_types(self): + return (SHIFT, REDUCE, LEFT, RIGHT, BREAK) + + def transition(self, StateClass state, action): + cdef Transition t = self.lookup_transition(action) + t.do(state.c, t.label) + return state + + def is_gold_parse(self, StateClass state, gold): + raise NotImplementedError + + def init_gold(self, StateClass state, Example example): + gold = ArcEagerGold(self, state, example) + self._replace_unseen_labels(gold) + return gold + + def init_gold_batch(self, examples): + all_states = self.init_batch([eg.predicted for eg in examples]) + golds = [] + states = [] + for state, eg in zip(all_states, examples): + if self.has_gold(eg) and not state.is_final(): + golds.append(self.init_gold(state, eg)) + states.append(state) + n_steps = sum([len(s.queue) for s in states]) + return states, golds, n_steps + + def _replace_unseen_labels(self, ArcEagerGold gold): + backoff_label = self.strings["dep"] + root_label = self.strings["ROOT"] + left_labels = self.labels[LEFT] + right_labels = self.labels[RIGHT] + break_labels = self.labels[BREAK] + for i in range(gold.c.length): + if not is_head_unknown(&gold.c, i): + head = gold.c.heads[i] + label = self.strings[gold.c.labels[i]] + if head > i and label not in left_labels: + gold.c.labels[i] = backoff_label + elif head < i and label not in right_labels: + gold.c.labels[i] = backoff_label + elif head == i and label not in break_labels: + gold.c.labels[i] = root_label + return gold + + cdef Transition lookup_transition(self, object name_or_id) except *: + if isinstance(name_or_id, int): + return self.c[name_or_id] + name = name_or_id + if '-' in name: + move_str, label_str = name.split('-', 1) + label = self.strings[label_str] + else: + move_str = name + label = 0 + move = MOVE_NAMES.index(move_str) + for i in range(self.n_moves): + if self.c[i].move == move and self.c[i].label == label: + return self.c[i] + raise KeyError(f"Unknown transition: {name}") + + def move_name(self, int move, attr_t label): + label_str = self.strings[label] + if label_str: + return MOVE_NAMES[move] + '-' + label_str + else: + return MOVE_NAMES[move] + + def class_name(self, int i): + return self.move_name(self.c[i].move, self.c[i].label) + + cdef Transition init_transition(self, int clas, int move, attr_t label) except *: + # TODO: Apparent Cython bug here when we try to use the Transition() + # constructor with the function pointers + cdef Transition t + t.score = 0 + t.clas = clas + t.move = move + t.label = label + if move == SHIFT: + t.is_valid = Shift.is_valid + t.do = Shift.transition + t.get_cost = Shift.cost + elif move == REDUCE: + t.is_valid = Reduce.is_valid + t.do = Reduce.transition + t.get_cost = Reduce.cost + elif move == LEFT: + t.is_valid = LeftArc.is_valid + t.do = LeftArc.transition + t.get_cost = LeftArc.cost + elif move == RIGHT: + t.is_valid = RightArc.is_valid + t.do = RightArc.transition + t.get_cost = RightArc.cost + elif move == BREAK: + t.is_valid = Break.is_valid + t.do = Break.transition + t.get_cost = Break.cost + else: + raise ValueError(Errors.E019.format(action=move, src='arc_eager')) + return t + + cdef int initialize_state(self, StateC* st) nogil: + for i in range(st.length): + if st._sent[i].dep == 0: + st._sent[i].l_edge = i + st._sent[i].r_edge = i + st._sent[i].head = 0 + st._sent[i].dep = 0 + st._sent[i].l_kids = 0 + st._sent[i].r_kids = 0 + st.fast_forward() + + cdef int finalize_state(self, StateC* st) nogil: + cdef int i + for i in range(st.length): + if st._sent[i].head == 0: + st._sent[i].dep = self.root_label + + def finalize_doc(self, Doc doc): + set_children_from_heads(doc.c, 0, doc.length) + + def has_gold(self, Example eg, start=0, end=None): + for word in eg.y[start:end]: + if word.dep != 0: + return True + else: + return False + + cdef int set_valid(self, int* output, const StateC* st) nogil: + cdef bint[N_MOVES] is_valid + is_valid[SHIFT] = Shift.is_valid(st, 0) + is_valid[REDUCE] = Reduce.is_valid(st, 0) + is_valid[LEFT] = LeftArc.is_valid(st, 0) + is_valid[RIGHT] = RightArc.is_valid(st, 0) + is_valid[BREAK] = Break.is_valid(st, 0) + cdef int i + for i in range(self.n_moves): + if self.c[i].label == SUBTOK_LABEL: + output[i] = self.c[i].is_valid(st, self.c[i].label) + else: + output[i] = is_valid[self.c[i].move] + + def get_cost(self, StateClass stcls, gold, int i): + if not isinstance(gold, ArcEagerGold): + raise TypeError(Errors.E909.format(name="ArcEagerGold")) + cdef ArcEagerGold gold_ = gold + gold_state = gold_.c + n_gold = 0 + if self.c[i].is_valid(stcls.c, self.c[i].label): + cost = self.c[i].get_cost(stcls, &gold_state, self.c[i].label) + else: + cost = 9000 + return cost + + cdef int set_costs(self, int* is_valid, weight_t* costs, + StateClass stcls, gold) except -1: + if not isinstance(gold, ArcEagerGold): + raise TypeError(Errors.E909.format(name="ArcEagerGold")) + cdef ArcEagerGold gold_ = gold + gold_.update(stcls) + gold_state = gold_.c + cdef int n_gold = 0 + for i in range(self.n_moves): + if self.c[i].is_valid(stcls.c, self.c[i].label): + is_valid[i] = True + costs[i] = self.c[i].get_cost(stcls, &gold_state, self.c[i].label) + if costs[i] <= 0: + n_gold += 1 + else: + is_valid[i] = False + costs[i] = 9000 + if n_gold < 1: + raise ValueError + + def get_oracle_sequence_from_state(self, StateClass state, ArcEagerGold gold, _debug=None): + cdef int i + cdef Pool mem = Pool() + # n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc + assert self.n_moves > 0 + costs = mem.alloc(self.n_moves, sizeof(float)) + is_valid = mem.alloc(self.n_moves, sizeof(int)) + + history = [] + debug_log = [] + failed = False + while not state.is_final(): + try: + self.set_costs(is_valid, costs, state, gold) + except ValueError: + failed = True + break + for i in range(self.n_moves): + if is_valid[i] and costs[i] <= 0: + action = self.c[i] + history.append(i) + s0 = state.S(0) + b0 = state.B(0) + if _debug: + example = _debug + debug_log.append(" ".join(( + self.get_class_name(i), + "S0=", (example.x[s0].text if s0 >= 0 else "__"), + "B0=", (example.x[b0].text if b0 >= 0 else "__"), + "S0 head?", str(state.has_head(state.S(0))), + ))) + action.do(state.c, action.label) + break + else: + failed = False + break + if failed: + example = _debug + print("Actions") + for i in range(self.n_moves): + print(self.get_class_name(i)) + print("Gold") + for token in example.y: + print(token.i, token.text, token.dep_, token.head.text) + aligned_heads, aligned_labels = example.get_aligned_parse() + print("Aligned heads") + for i, head in enumerate(aligned_heads): + print(example.x[i], example.x[head] if head is not None else "__") + + print("Predicted tokens") + print([(w.i, w.text) for w in example.x]) + s0 = state.S(0) + b0 = state.B(0) + debug_log.append(" ".join(( + "?", + "S0=", (example.x[s0].text if s0 >= 0 else "-"), + "B0=", (example.x[b0].text if b0 >= 0 else "-"), + "S0 head?", str(state.has_head(state.S(0))), + ))) + s0 = state.S(0) + b0 = state.B(0) + print("\n".join(debug_log)) + print("Arc is gold B0, S0?", arc_is_gold(&gold.c, b0, s0)) + print("Arc is gold S0, B0?", arc_is_gold(&gold.c, s0, b0)) + print("is_head_unknown(s0)", is_head_unknown(&gold.c, s0)) + print("is_head_unknown(b0)", is_head_unknown(&gold.c, b0)) + print("b0", b0, "gold.heads[s0]", gold.c.heads[s0]) + print("Stack", [example.x[i] for i in state.stack]) + print("Buffer", [example.x[i] for i in state.queue]) + raise ValueError(Errors.E024) + return history diff --git a/spacy/pipeline/_parser_internals/ner.pxd b/spacy/pipeline/_parser_internals/ner.pxd new file mode 100644 index 000000000..2264a1518 --- /dev/null +++ b/spacy/pipeline/_parser_internals/ner.pxd @@ -0,0 +1,5 @@ +from .transition_system cimport TransitionSystem + + +cdef class BiluoPushDown(TransitionSystem): + pass diff --git a/spacy/syntax/ner.pyx b/spacy/pipeline/_parser_internals/ner.pyx similarity index 79% rename from spacy/syntax/ner.pyx rename to spacy/pipeline/_parser_internals/ner.pyx index 9f8ad418c..4f142caaf 100644 --- a/spacy/syntax/ner.pyx +++ b/spacy/pipeline/_parser_internals/ner.pyx @@ -1,18 +1,17 @@ -# coding: utf-8 -from __future__ import unicode_literals +from libc.stdint cimport int32_t +from cymem.cymem cimport Pool -from thinc.typedefs cimport weight_t -from thinc.extra.search cimport Beam -from collections import OrderedDict, Counter +from collections import Counter +from ...typedefs cimport weight_t, attr_t +from ...lexeme cimport Lexeme +from ...attrs cimport IS_SPACE +from ...training.example cimport Example from .stateclass cimport StateClass from ._state cimport StateC -from .transition_system cimport Transition -from .transition_system cimport do_func_t -from ..gold cimport GoldParseC, GoldParse -from ..lexeme cimport Lexeme -from ..attrs cimport IS_SPACE -from ..errors import Errors +from .transition_system cimport Transition, do_func_t + +from ...errors import Errors cdef enum: @@ -36,6 +35,43 @@ MOVE_NAMES[OUT] = 'O' MOVE_NAMES[ISNT] = 'x' +cdef struct GoldNERStateC: + Transition* ner + int32_t length + + +cdef class BiluoGold: + cdef Pool mem + cdef GoldNERStateC c + + def __init__(self, BiluoPushDown moves, StateClass stcls, Example example): + self.mem = Pool() + self.c = create_gold_state(self.mem, moves, stcls, example) + + def update(self, StateClass stcls): + update_gold_state(&self.c, stcls) + + + +cdef GoldNERStateC create_gold_state( + Pool mem, + BiluoPushDown moves, + StateClass stcls, + Example example +) except *: + cdef GoldNERStateC gs + gs.ner = mem.alloc(example.x.length, sizeof(Transition)) + ner_tags = example.get_aligned_ner() + for i, ner_tag in enumerate(ner_tags): + gs.ner[i] = moves.lookup_transition(ner_tag) + return gs + + +cdef void update_gold_state(GoldNERStateC* gs, StateClass stcls) except *: + # We don't need to update each time, unlike the parser. + pass + + cdef do_func_t[N_MOVES] do_funcs @@ -72,13 +108,12 @@ cdef class BiluoPushDown(TransitionSystem): for action in (BEGIN, IN, LAST, UNIT): actions[action][entity_type] = 1 moves = ('M', 'B', 'I', 'L', 'U') - for raw_text, sents in kwargs.get('gold_parses', []): - for (ids, words, tags, heads, labels, biluo), _ in sents: - for i, ner_tag in enumerate(biluo): - if ner_tag != 'O' and ner_tag != '-': - _, label = ner_tag.split('-', 1) - for action in (BEGIN, IN, LAST, UNIT): - actions[action][label] += 1 + for example in kwargs.get('examples', []): + for token in example.y: + ent_type = token.ent_type_ + if ent_type: + for action in (BEGIN, IN, LAST, UNIT): + actions[action][ent_type] += 1 return actions @property @@ -93,52 +128,16 @@ cdef class BiluoPushDown(TransitionSystem): else: return MOVE_NAMES[move] + '-' + self.strings[label] - def has_gold(self, GoldParse gold, start=0, end=None): - end = end or len(gold.ner) - if all([tag in ('-', None) for tag in gold.ner[start:end]]): - return False - else: - return True - - def preprocess_gold(self, GoldParse gold): - if not self.has_gold(gold): - return None - for i in range(gold.length): - gold.c.ner[i] = self.lookup_transition(gold.ner[i]) - return gold - - def get_beam_annot(self, Beam beam): - entities = {} - probs = beam.probs - for i in range(beam.size): - state = beam.at(i) - if state.is_final(): - self.finalize_state(state) - prob = probs[i] - for j in range(state._e_i): - start = state._ents[j].start - end = state._ents[j].end - label = state._ents[j].label - entities.setdefault((start, end, label), 0.0) - entities[(start, end, label)] += prob - return entities - - def get_beam_parses(self, Beam beam): - parses = [] - probs = beam.probs - for i in range(beam.size): - state = beam.at(i) - if state.is_final(): - self.finalize_state(state) - prob = probs[i] - parse = [] - for j in range(state._e_i): - start = state._ents[j].start - end = state._ents[j].end - label = state._ents[j].label - parse.append((start, end, self.strings[label])) - parses.append((prob, parse)) - return parses + def init_gold_batch(self, examples): + all_states = self.init_batch([eg.predicted for eg in examples]) + golds = [] + states = [] + for state, eg in zip(all_states, examples): + if self.has_gold(eg) and not state.is_final(): + golds.append(self.init_gold(state, eg)) + states.append(state) + n_steps = sum([len(s.queue) for s in states]) + return states, golds, n_steps cdef Transition lookup_transition(self, object name) except *: cdef attr_t label @@ -239,6 +238,47 @@ cdef class BiluoPushDown(TransitionSystem): self.add_action(UNIT, st._sent[i].ent_type) self.add_action(LAST, st._sent[i].ent_type) + def init_gold(self, StateClass state, Example example): + return BiluoGold(self, state, example) + + def has_gold(self, Example eg, start=0, end=None): + for word in eg.y[start:end]: + if word.ent_iob != 0: + return True + else: + return False + + def get_cost(self, StateClass stcls, gold, int i): + if not isinstance(gold, BiluoGold): + raise TypeError(Errors.E909.format(name="BiluoGold")) + cdef BiluoGold gold_ = gold + gold_state = gold_.c + n_gold = 0 + if self.c[i].is_valid(stcls.c, self.c[i].label): + cost = self.c[i].get_cost(stcls, &gold_state, self.c[i].label) + else: + cost = 9000 + return cost + + cdef int set_costs(self, int* is_valid, weight_t* costs, + StateClass stcls, gold) except -1: + if not isinstance(gold, BiluoGold): + raise TypeError(Errors.E909.format(name="BiluoGold")) + cdef BiluoGold gold_ = gold + gold_.update(stcls) + gold_state = gold_.c + n_gold = 0 + for i in range(self.n_moves): + if self.c[i].is_valid(stcls.c, self.c[i].label): + is_valid[i] = 1 + costs[i] = self.c[i].get_cost(stcls, &gold_state, self.c[i].label) + n_gold += costs[i] <= 0 + else: + is_valid[i] = 0 + costs[i] = 9000 + if n_gold < 1: + raise ValueError + cdef class Missing: @staticmethod @@ -250,7 +290,7 @@ cdef class Missing: pass @staticmethod - cdef weight_t cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: + cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil: return 9000 @@ -302,7 +342,8 @@ cdef class Begin: st.pop() @staticmethod - cdef weight_t cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: + cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil: + gold = _gold cdef int g_act = gold.ner[s.B(0)].move cdef attr_t g_tag = gold.ner[s.B(0)].label @@ -365,7 +406,8 @@ cdef class In: st.pop() @staticmethod - cdef weight_t cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: + cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil: + gold = _gold move = IN cdef int next_act = gold.ner[s.B(1)].move if s.B(1) >= 0 else OUT cdef int g_act = gold.ner[s.B(0)].move @@ -431,7 +473,8 @@ cdef class Last: st.pop() @staticmethod - cdef weight_t cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: + cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil: + gold = _gold move = LAST cdef int g_act = gold.ner[s.B(0)].move @@ -499,7 +542,8 @@ cdef class Unit: st.pop() @staticmethod - cdef weight_t cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: + cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil: + gold = _gold cdef int g_act = gold.ner[s.B(0)].move cdef attr_t g_tag = gold.ner[s.B(0)].label @@ -539,7 +583,8 @@ cdef class Out: st.pop() @staticmethod - cdef weight_t cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: + cdef weight_t cost(StateClass s, const void* _gold, attr_t label) nogil: + gold = _gold cdef int g_act = gold.ner[s.B(0)].move cdef attr_t g_tag = gold.ner[s.B(0)].label diff --git a/spacy/syntax/nonproj.pxd b/spacy/pipeline/_parser_internals/nonproj.pxd similarity index 100% rename from spacy/syntax/nonproj.pxd rename to spacy/pipeline/_parser_internals/nonproj.pxd diff --git a/spacy/syntax/nonproj.pyx b/spacy/pipeline/_parser_internals/nonproj.pyx similarity index 66% rename from spacy/syntax/nonproj.pyx rename to spacy/pipeline/_parser_internals/nonproj.pyx index 53e8a9cfe..82070cd27 100644 --- a/spacy/syntax/nonproj.pyx +++ b/spacy/pipeline/_parser_internals/nonproj.pyx @@ -1,16 +1,13 @@ -# coding: utf-8 -# cython: profile=True -# cython: infer_types=True +# cython: profile=True, infer_types=True """Implements the projectivize/deprojectivize mechanism in Nivre & Nilsson 2005 for doing pseudo-projective parsing implementation uses the HEAD decoration scheme. """ -from __future__ import unicode_literals - from copy import copy -from ..tokens.doc cimport Doc, set_children_from_heads -from ..errors import Errors +from ...tokens.doc cimport Doc, set_children_from_heads + +from ...errors import Errors DELIMITER = '||' @@ -52,8 +49,12 @@ def is_nonproj_arc(tokenid, heads): return False elif head is None: # unattached tokens cannot be non-projective return False - - start, end = (head+1, tokenid) if head < tokenid else (tokenid+1, head) + + cdef int start, end + if head < tokenid: + start, end = (head+1, tokenid) + else: + start, end = (tokenid+1, head) for k in range(start, end): for ancestor in ancestors(k, heads): if ancestor is None: # for unattached tokens/subtrees @@ -77,44 +78,21 @@ def decompose(label): def is_decorated(label): return DELIMITER in label -def count_decorated_labels(gold_tuples): +def count_decorated_labels(gold_data): freqs = {} - for raw_text, sents in gold_tuples: - for (ids, words, tags, heads, labels, iob), ctnts in sents: - proj_heads, deco_labels = projectivize(heads, labels) - # set the label to ROOT for each root dependent - deco_labels = ['ROOT' if head == i else deco_labels[i] - for i, head in enumerate(proj_heads)] - # count label frequencies - for label in deco_labels: - if is_decorated(label): - freqs[label] = freqs.get(label, 0) + 1 + for example in gold_data: + proj_heads, deco_deps = projectivize(example.get_aligned("HEAD"), + example.get_aligned("DEP")) + # set the label to ROOT for each root dependent + deco_deps = ['ROOT' if head == i else deco_deps[i] + for i, head in enumerate(proj_heads)] + # count label frequencies + for label in deco_deps: + if is_decorated(label): + freqs[label] = freqs.get(label, 0) + 1 return freqs -def preprocess_training_data(gold_tuples, label_freq_cutoff=30): - preprocessed = [] - freqs = {} - for raw_text, sents in gold_tuples: - prepro_sents = [] - for (ids, words, tags, heads, labels, iob), ctnts in sents: - proj_heads, deco_labels = projectivize(heads, labels) - # set the label to ROOT for each root dependent - deco_labels = ['ROOT' if head == i else deco_labels[i] - for i, head in enumerate(proj_heads)] - # count label frequencies - if label_freq_cutoff > 0: - for label in deco_labels: - if is_decorated(label): - freqs[label] = freqs.get(label, 0) + 1 - prepro_sents.append( - ((ids, words, tags, proj_heads, deco_labels, iob), ctnts)) - preprocessed.append((raw_text, prepro_sents)) - if label_freq_cutoff > 0: - return _filter_labels(preprocessed, label_freq_cutoff, freqs) - return preprocessed - - def projectivize(heads, labels): # Use the algorithm by Nivre & Nilsson 2005. Assumes heads to be a proper # tree, i.e. connected and cycle-free. Returns a new pair (heads, labels) @@ -141,7 +119,7 @@ cpdef deprojectivize(Doc doc): new_head = _find_new_head(doc[i], head_label) doc.c[i].head = new_head.i - i doc.c[i].dep = doc.vocab.strings.add(new_label) - set_children_from_heads(doc.c, doc.length) + set_children_from_heads(doc.c, 0, doc.length) return doc @@ -154,8 +132,7 @@ def _decorate(heads, proj_heads, labels): deco_labels = [] for tokenid, head in enumerate(heads): if head != proj_heads[tokenid]: - deco_labels.append( - '%s%s%s' % (labels[tokenid], DELIMITER, labels[head])) + deco_labels.append(f"{labels[tokenid]}{DELIMITER}{labels[head]}") else: deco_labels.append(labels[tokenid]) return deco_labels @@ -201,22 +178,3 @@ def _find_new_head(token, headlabel): next_queue.append(child) queue = next_queue return token.head - - -def _filter_labels(gold_tuples, cutoff, freqs): - # throw away infrequent decorated labels - # can't learn them reliably anyway and keeps label set smaller - filtered = [] - for raw_text, sents in gold_tuples: - filtered_sents = [] - for (ids, words, tags, heads, labels, iob), ctnts in sents: - filtered_labels = [] - for label in labels: - if is_decorated(label) and freqs.get(label, 0) < cutoff: - filtered_labels.append(decompose(label)[0]) - else: - filtered_labels.append(label) - filtered_sents.append( - ((ids, words, tags, heads, filtered_labels, iob), ctnts)) - filtered.append((raw_text, filtered_sents)) - return filtered diff --git a/spacy/syntax/stateclass.pxd b/spacy/pipeline/_parser_internals/stateclass.pxd similarity index 95% rename from spacy/syntax/stateclass.pxd rename to spacy/pipeline/_parser_internals/stateclass.pxd index 567982a3f..1d9f05538 100644 --- a/spacy/syntax/stateclass.pxd +++ b/spacy/pipeline/_parser_internals/stateclass.pxd @@ -1,12 +1,8 @@ -from libc.string cimport memcpy, memset - from cymem.cymem cimport Pool -cimport cython -from ..structs cimport TokenC, SpanC -from ..typedefs cimport attr_t +from ...structs cimport TokenC, SpanC +from ...typedefs cimport attr_t -from ..vocab cimport EMPTY_LEXEME from ._state cimport StateC diff --git a/spacy/syntax/stateclass.pyx b/spacy/pipeline/_parser_internals/stateclass.pyx similarity index 82% rename from spacy/syntax/stateclass.pyx rename to spacy/pipeline/_parser_internals/stateclass.pyx index 2a15a2de1..880cf6cc5 100644 --- a/spacy/syntax/stateclass.pyx +++ b/spacy/pipeline/_parser_internals/stateclass.pyx @@ -1,10 +1,7 @@ -# coding: utf-8 # cython: infer_types=True -from __future__ import unicode_literals - import numpy -from ..tokens.doc cimport Doc +from ...tokens.doc cimport Doc cdef class StateClass: @@ -49,9 +46,9 @@ cdef class StateClass: def print_state(self, words): words = list(words) + ['_'] - top = words[self.S(0)] + '_%d' % self.S_(0).head - second = words[self.S(1)] + '_%d' % self.S_(1).head - third = words[self.S(2)] + '_%d' % self.S_(2).head + top = f"{words[self.S(0)]}_{self.S_(0).head}" + second = f"{words[self.S(1)]}_{self.S_(1).head}" + third = f"{words[self.S(2)]}_{self.S_(2).head}" n0 = words[self.B(0)] n1 = words[self.B(1)] return ' '.join((third, second, top, '|', n0, n1)) diff --git a/spacy/syntax/transition_system.pxd b/spacy/pipeline/_parser_internals/transition_system.pxd similarity index 64% rename from spacy/syntax/transition_system.pxd rename to spacy/pipeline/_parser_internals/transition_system.pxd index a5fe55918..458f1d5f9 100644 --- a/spacy/syntax/transition_system.pxd +++ b/spacy/pipeline/_parser_internals/transition_system.pxd @@ -1,12 +1,9 @@ from cymem.cymem cimport Pool -from thinc.typedefs cimport weight_t - -from ..typedefs cimport attr_t -from ..structs cimport TokenC -from ..gold cimport GoldParse -from ..gold cimport GoldParseC -from ..strings cimport StringStore +from ...typedefs cimport attr_t, weight_t +from ...structs cimport TokenC +from ...strings cimport StringStore +from ...training.example cimport Example from .stateclass cimport StateClass from ._state cimport StateC @@ -19,14 +16,14 @@ cdef struct Transition: weight_t score bint (*is_valid)(const StateC* state, attr_t label) nogil - weight_t (*get_cost)(StateClass state, const GoldParseC* gold, attr_t label) nogil + weight_t (*get_cost)(StateClass state, const void* gold, attr_t label) nogil int (*do)(StateC* state, attr_t label) nogil -ctypedef weight_t (*get_cost_func_t)(StateClass state, const GoldParseC* gold, +ctypedef weight_t (*get_cost_func_t)(StateClass state, const void* gold, attr_tlabel) nogil -ctypedef weight_t (*move_cost_func_t)(StateClass state, const GoldParseC* gold) nogil -ctypedef weight_t (*label_cost_func_t)(StateClass state, const GoldParseC* +ctypedef weight_t (*move_cost_func_t)(StateClass state, const void* gold) nogil +ctypedef weight_t (*label_cost_func_t)(StateClass state, const void* gold, attr_t label) nogil ctypedef int (*do_func_t)(StateC* state, attr_t label) nogil @@ -43,8 +40,6 @@ cdef class TransitionSystem: cdef int _size cdef public attr_t root_label cdef public freqs - cdef init_state_t init_beam_state - cdef del_state_t del_beam_state cdef public object labels cdef int initialize_state(self, StateC* state) nogil @@ -57,4 +52,4 @@ cdef class TransitionSystem: cdef int set_valid(self, int* output, const StateC* st) nogil cdef int set_costs(self, int* is_valid, weight_t* costs, - StateClass state, GoldParse gold) except -1 + StateClass state, gold) except -1 diff --git a/spacy/syntax/transition_system.pyx b/spacy/pipeline/_parser_internals/transition_system.pyx similarity index 72% rename from spacy/syntax/transition_system.pyx rename to spacy/pipeline/_parser_internals/transition_system.pyx index 65097f114..7694e7f34 100644 --- a/spacy/syntax/transition_system.pyx +++ b/spacy/pipeline/_parser_internals/transition_system.pyx @@ -1,21 +1,17 @@ # cython: infer_types=True -# coding: utf-8 -from __future__ import unicode_literals - -from cpython.ref cimport Py_INCREF +from __future__ import print_function from cymem.cymem cimport Pool -from thinc.typedefs cimport weight_t -from thinc.extra.search cimport Beam -from collections import OrderedDict, Counter + +from collections import Counter import srsly -from . cimport _beam_utils -from ..tokens.doc cimport Doc -from ..structs cimport TokenC +from ...typedefs cimport weight_t, attr_t +from ...tokens.doc cimport Doc +from ...structs cimport TokenC from .stateclass cimport StateClass -from ..typedefs cimport attr_t -from ..errors import Errors -from .. import util + +from ...errors import Errors +from ... import util cdef weight_t MIN_SCORE = -90000 @@ -48,8 +44,6 @@ cdef class TransitionSystem: if labels_by_action: self.initialize_actions(labels_by_action, min_freq=min_freq) self.root_label = self.strings.add('ROOT') - self.init_beam_state = _init_state - self.del_beam_state = _del_state def __reduce__(self): return (self.__class__, (self.strings, self.labels), None, None) @@ -65,48 +59,62 @@ cdef class TransitionSystem: offset += len(doc) return states - def init_beams(self, docs, beam_width, beam_density=0.): - cdef Doc doc - beams = [] - cdef int offset = 0 + def get_oracle_sequence(self, Example example, _debug=False): + states, golds, _ = self.init_gold_batch([example]) + if not states: + return [] + state = states[0] + gold = golds[0] + if _debug: + return self.get_oracle_sequence_from_state(state, gold, _debug=example) + else: + return self.get_oracle_sequence_from_state(state, gold) - # Doc objects might contain labels that we need to register actions for. We need to check for that - # *before* we create any Beam objects, because the Beam object needs the correct number of - # actions. It's sort of dumb, but the best way is to just call init_batch() -- that triggers the additions, - # and it doesn't matter that we create and discard the state objects. - self.init_batch(docs) - - for doc in docs: - beam = Beam(self.n_moves, beam_width, min_density=beam_density) - beam.initialize(self.init_beam_state, self.del_beam_state, - doc.length, doc.c) - for i in range(beam.width): - state = beam.at(i) - state.offset = offset - offset += len(doc) - beam.check_done(_beam_utils.check_final_state, NULL) - beams.append(beam) - return beams - - def get_oracle_sequence(self, doc, GoldParse gold): + def get_oracle_sequence_from_state(self, StateClass state, gold, _debug=None): cdef Pool mem = Pool() # n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc assert self.n_moves > 0 costs = mem.alloc(self.n_moves, sizeof(float)) is_valid = mem.alloc(self.n_moves, sizeof(int)) - cdef StateClass state = StateClass(doc, offset=0) - self.initialize_state(state.c) history = [] + debug_log = [] while not state.is_final(): self.set_costs(is_valid, costs, state, gold) for i in range(self.n_moves): if is_valid[i] and costs[i] <= 0: action = self.c[i] history.append(i) + if _debug: + s0 = state.S(0) + b0 = state.B(0) + example = _debug + debug_log.append(" ".join(( + self.get_class_name(i), + "S0=", (example.x[s0].text if s0 >= 0 else "__"), + "B0=", (example.x[b0].text if b0 >= 0 else "__"), + "S0 head?", str(state.has_head(state.S(0))), + ))) action.do(state.c, action.label) break else: + if _debug: + example = _debug + print("Actions") + for i in range(self.n_moves): + print(self.get_class_name(i)) + print("Gold") + for token in example.y: + print(token.text, token.dep_, token.head.text) + s0 = state.S(0) + b0 = state.B(0) + debug_log.append(" ".join(( + "?", + "S0=", (example.x[s0].text if s0 >= 0 else "-"), + "B0=", (example.x[b0].text if b0 >= 0 else "-"), + "S0 head?", str(state.has_head(state.S(0))), + ))) + print("\n".join(debug_log)) raise ValueError(Errors.E024) return history @@ -125,12 +133,6 @@ cdef class TransitionSystem: def finalize_doc(self, doc): pass - def preprocess_gold(self, GoldParse gold): - raise NotImplementedError - - def is_gold_parse(self, StateClass state, GoldParse gold): - raise NotImplementedError - cdef Transition lookup_transition(self, object name) except *: raise NotImplementedError @@ -149,18 +151,8 @@ cdef class TransitionSystem: is_valid[i] = self.c[i].is_valid(st, self.c[i].label) cdef int set_costs(self, int* is_valid, weight_t* costs, - StateClass stcls, GoldParse gold) except -1: - cdef int i - self.set_valid(is_valid, stcls.c) - cdef int n_gold = 0 - for i in range(self.n_moves): - if is_valid[i]: - costs[i] = self.c[i].get_cost(stcls, &gold.c, self.c[i].label) - n_gold += costs[i] <= 0 - else: - costs[i] = 9000 - if n_gold <= 0: - raise ValueError(Errors.E024) + StateClass stcls, gold) except -1: + raise NotImplementedError def get_class_name(self, int clas): act = self.c[clas] @@ -233,22 +225,20 @@ cdef class TransitionSystem: self.from_bytes(byte_data, **kwargs) return self - def to_bytes(self, exclude=tuple(), **kwargs): + def to_bytes(self, exclude=tuple()): transitions = [] serializers = { 'moves': lambda: srsly.json_dumps(self.labels), 'strings': lambda: self.strings.to_bytes() } - exclude = util.get_serialization_exclude(serializers, exclude, kwargs) return util.to_bytes(serializers, exclude) - def from_bytes(self, bytes_data, exclude=tuple(), **kwargs): + def from_bytes(self, bytes_data, exclude=tuple()): labels = {} deserializers = { 'moves': lambda b: labels.update(srsly.json_loads(b)), 'strings': lambda b: self.strings.from_bytes(b) } - exclude = util.get_serialization_exclude(deserializers, exclude, kwargs) msg = util.from_bytes(bytes_data, deserializers, exclude) self.initialize_actions(labels) return self diff --git a/spacy/pipeline/attributeruler.py b/spacy/pipeline/attributeruler.py new file mode 100644 index 000000000..e17d3be98 --- /dev/null +++ b/spacy/pipeline/attributeruler.py @@ -0,0 +1,330 @@ +from typing import List, Dict, Union, Iterable, Any, Optional, Callable +from typing import Tuple +import srsly +from pathlib import Path + +from .pipe import Pipe +from ..errors import Errors +from ..training import validate_examples, Example +from ..language import Language +from ..matcher import Matcher +from ..scorer import Scorer +from ..symbols import IDS, TAG, POS, MORPH, LEMMA +from ..tokens import Doc, Span +from ..tokens._retokenize import normalize_token_attrs, set_token_attrs +from ..vocab import Vocab +from ..util import SimpleFrozenList +from .. import util + + +MatcherPatternType = List[Dict[Union[int, str], Any]] +AttributeRulerPatternType = Dict[str, Union[MatcherPatternType, Dict, int]] +TagMapType = Dict[str, Dict[Union[int, str], Union[int, str]]] +MorphRulesType = Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]] + + +@Language.factory("attribute_ruler", default_config={"validate": False}) +def make_attribute_ruler(nlp: Language, name: str, validate: bool): + return AttributeRuler(nlp.vocab, name, validate=validate) + + +class AttributeRuler(Pipe): + """Set token-level attributes for tokens matched by Matcher patterns. + Additionally supports importing patterns from tag maps and morph rules. + + DOCS: https://nightly.spacy.io/api/attributeruler + """ + + def __init__( + self, vocab: Vocab, name: str = "attribute_ruler", *, validate: bool = False + ) -> None: + """Create the AttributeRuler. After creation, you can add patterns + with the `.initialize()` or `.add_patterns()` methods, or load patterns + with `.from_bytes()` or `.from_disk()`. Loading patterns will remove + any patterns you've added previously. + + vocab (Vocab): The vocab. + name (str): The pipe name. Defaults to "attribute_ruler". + + RETURNS (AttributeRuler): The AttributeRuler component. + + DOCS: https://nightly.spacy.io/api/attributeruler#init + """ + self.name = name + self.vocab = vocab + self.matcher = Matcher(self.vocab, validate=validate) + self.validate = validate + self.attrs = [] + self._attrs_unnormed = [] # store for reference + self.indices = [] + + def clear(self) -> None: + """Reset all patterns.""" + self.matcher = Matcher(self.vocab, validate=self.validate) + self.attrs = [] + self._attrs_unnormed = [] + self.indices = [] + + def initialize( + self, + get_examples: Optional[Callable[[], Iterable[Example]]], + *, + nlp: Optional[Language] = None, + patterns: Optional[Iterable[AttributeRulerPatternType]] = None, + tag_map: Optional[TagMapType] = None, + morph_rules: Optional[MorphRulesType] = None, + ) -> None: + """Initialize the attribute ruler by adding zero or more patterns. + + Rules can be specified as a sequence of dicts using the `patterns` + keyword argument. You can also provide rules using the "tag map" or + "morph rules" formats supported by spaCy prior to v3. + """ + self.clear() + if patterns: + self.add_patterns(patterns) + if tag_map: + self.load_from_tag_map(tag_map) + if morph_rules: + self.load_from_morph_rules(morph_rules) + + def __call__(self, doc: Doc) -> Doc: + """Apply the AttributeRuler to a Doc and set all attribute exceptions. + + doc (Doc): The document to process. + RETURNS (Doc): The processed Doc. + + DOCS: https://nightly.spacy.io/api/attributeruler#call + """ + matches = self.matcher(doc, allow_missing=True) + # Sort by the attribute ID, so that later rules have precendence + matches = [ + (int(self.vocab.strings[m_id]), m_id, s, e) for m_id, s, e in matches + ] + matches.sort() + for attr_id, match_id, start, end in matches: + span = Span(doc, start, end, label=match_id) + attrs = self.attrs[attr_id] + index = self.indices[attr_id] + try: + # The index can be negative, which makes it annoying to do + # the boundscheck. Let Span do it instead. + token = span[index] # noqa: F841 + except IndexError: + # The original exception is just our conditional logic, so we + # raise from. + raise ValueError( + Errors.E1001.format( + patterns=self.matcher.get(span.label), + span=[t.text for t in span], + index=index, + ) + ) from None + set_token_attrs(span[index], attrs) + return doc + + def load_from_tag_map( + self, tag_map: Dict[str, Dict[Union[int, str], Union[int, str]]] + ) -> None: + """Load attribute ruler patterns from a tag map. + + tag_map (dict): The tag map that maps fine-grained tags to + coarse-grained tags and morphological features. + + DOCS: https://nightly.spacy.io/api/attributeruler#load_from_morph_rules + """ + for tag, attrs in tag_map.items(): + pattern = [{"TAG": tag}] + attrs, morph_attrs = _split_morph_attrs(attrs) + if "MORPH" not in attrs: + morph = self.vocab.morphology.add(morph_attrs) + attrs["MORPH"] = self.vocab.strings[morph] + else: + morph = self.vocab.morphology.add(attrs["MORPH"]) + attrs["MORPH"] = self.vocab.strings[morph] + self.add([pattern], attrs) + + def load_from_morph_rules( + self, morph_rules: Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]] + ) -> None: + """Load attribute ruler patterns from morph rules. + + morph_rules (dict): The morph rules that map token text and + fine-grained tags to coarse-grained tags, lemmas and morphological + features. + + DOCS: https://nightly.spacy.io/api/attributeruler#load_from_morph_rules + """ + for tag in morph_rules: + for word in morph_rules[tag]: + pattern = [{"ORTH": word, "TAG": tag}] + attrs = morph_rules[tag][word] + attrs, morph_attrs = _split_morph_attrs(attrs) + if "MORPH" in attrs: + morph = self.vocab.morphology.add(attrs["MORPH"]) + attrs["MORPH"] = self.vocab.strings[morph] + elif morph_attrs: + morph = self.vocab.morphology.add(morph_attrs) + attrs["MORPH"] = self.vocab.strings[morph] + self.add([pattern], attrs) + + def add( + self, patterns: Iterable[MatcherPatternType], attrs: Dict, index: int = 0 + ) -> None: + """Add Matcher patterns for tokens that should be modified with the + provided attributes. The token at the specified index within the + matched span will be assigned the attributes. + + patterns (Iterable[List[Dict]]): A list of Matcher patterns. + attrs (Dict): The attributes to assign to the target token in the + matched span. + index (int): The index of the token in the matched span to modify. May + be negative to index from the end of the span. Defaults to 0. + + DOCS: https://nightly.spacy.io/api/attributeruler#add + """ + # We need to make a string here, because otherwise the ID we pass back + # will be interpreted as the hash of a string, rather than an ordinal. + key = str(len(self.attrs)) + self.matcher.add(self.vocab.strings.add(key), patterns) + self._attrs_unnormed.append(attrs) + attrs = normalize_token_attrs(self.vocab, attrs) + self.attrs.append(attrs) + self.indices.append(index) + + def add_patterns(self, patterns: Iterable[AttributeRulerPatternType]) -> None: + """Add patterns from a list of pattern dicts with the keys as the + arguments to AttributeRuler.add. + patterns (Iterable[dict]): A list of pattern dicts with the keys + as the arguments to AttributeRuler.add (patterns/attrs/index) to + add as patterns. + + DOCS: https://nightly.spacy.io/api/attributeruler#add_patterns + """ + for p in patterns: + self.add(**p) + + @property + def patterns(self) -> List[AttributeRulerPatternType]: + """All the added patterns.""" + all_patterns = [] + for i in range(len(self.attrs)): + p = {} + p["patterns"] = self.matcher.get(str(i))[1] + p["attrs"] = self._attrs_unnormed[i] + p["index"] = self.indices[i] + all_patterns.append(p) + return all_patterns + + def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]: + """Score a batch of examples. + + examples (Iterable[Example]): The examples to score. + RETURNS (Dict[str, Any]): The scores, produced by + Scorer.score_token_attr for the attributes "tag", "pos", "morph" + and "lemma" for the target token attributes. + + DOCS: https://nightly.spacy.io/api/tagger#score + """ + validate_examples(examples, "AttributeRuler.score") + results = {} + attrs = set() + for token_attrs in self.attrs: + attrs.update(token_attrs) + for attr in attrs: + if attr == TAG: + results.update(Scorer.score_token_attr(examples, "tag", **kwargs)) + elif attr == POS: + results.update(Scorer.score_token_attr(examples, "pos", **kwargs)) + elif attr == MORPH: + results.update(Scorer.score_token_attr(examples, "morph", **kwargs)) + elif attr == LEMMA: + results.update(Scorer.score_token_attr(examples, "lemma", **kwargs)) + return results + + def to_bytes(self, exclude: Iterable[str] = SimpleFrozenList()) -> bytes: + """Serialize the AttributeRuler to a bytestring. + + exclude (Iterable[str]): String names of serialization fields to exclude. + RETURNS (bytes): The serialized object. + + DOCS: https://nightly.spacy.io/api/attributeruler#to_bytes + """ + serialize = {} + serialize["vocab"] = self.vocab.to_bytes + serialize["patterns"] = lambda: srsly.msgpack_dumps(self.patterns) + return util.to_bytes(serialize, exclude) + + def from_bytes( + self, bytes_data: bytes, exclude: Iterable[str] = SimpleFrozenList() + ) -> "AttributeRuler": + """Load the AttributeRuler from a bytestring. + + bytes_data (bytes): The data to load. + exclude (Iterable[str]): String names of serialization fields to exclude. + returns (AttributeRuler): The loaded object. + + DOCS: https://nightly.spacy.io/api/attributeruler#from_bytes + """ + + def load_patterns(b): + self.add_patterns(srsly.msgpack_loads(b)) + + deserialize = { + "vocab": lambda b: self.vocab.from_bytes(b), + "patterns": load_patterns, + } + util.from_bytes(bytes_data, deserialize, exclude) + return self + + def to_disk( + self, path: Union[Path, str], exclude: Iterable[str] = SimpleFrozenList() + ) -> None: + """Serialize the AttributeRuler to disk. + + path (Union[Path, str]): A path to a directory. + exclude (Iterable[str]): String names of serialization fields to exclude. + + DOCS: https://nightly.spacy.io/api/attributeruler#to_disk + """ + serialize = { + "vocab": lambda p: self.vocab.to_disk(p), + "patterns": lambda p: srsly.write_msgpack(p, self.patterns), + } + util.to_disk(path, serialize, exclude) + + def from_disk( + self, path: Union[Path, str], exclude: Iterable[str] = SimpleFrozenList() + ) -> "AttributeRuler": + """Load the AttributeRuler from disk. + + path (Union[Path, str]): A path to a directory. + exclude (Iterable[str]): String names of serialization fields to exclude. + RETURNS (AttributeRuler): The loaded object. + + DOCS: https://nightly.spacy.io/api/attributeruler#from_disk + """ + + def load_patterns(p): + self.add_patterns(srsly.read_msgpack(p)) + + deserialize = { + "vocab": lambda p: self.vocab.from_disk(p), + "patterns": load_patterns, + } + util.from_disk(path, deserialize, exclude) + return self + + +def _split_morph_attrs(attrs: dict) -> Tuple[dict, dict]: + """Split entries from a tag map or morph rules dict into to two dicts, one + with the token-level features (POS, LEMMA) and one with the remaining + features, which are presumed to be individual MORPH features.""" + other_attrs = {} + morph_attrs = {} + for k, v in attrs.items(): + if k in "_" or k in IDS.keys() or k in IDS.values(): + other_attrs[k] = v + else: + morph_attrs[k] = v + return other_attrs, morph_attrs diff --git a/spacy/pipeline/dep_parser.pyx b/spacy/pipeline/dep_parser.pyx new file mode 100644 index 000000000..bdef332cc --- /dev/null +++ b/spacy/pipeline/dep_parser.pyx @@ -0,0 +1,169 @@ +# cython: infer_types=True, profile=True, binding=True +from typing import Optional, Iterable +from thinc.api import Model, Config + +from .transition_parser cimport Parser +from ._parser_internals.arc_eager cimport ArcEager + +from .functions import merge_subtokens +from ..language import Language +from ._parser_internals import nonproj +from ..scorer import Scorer +from ..training import validate_examples + + +default_model_config = """ +[model] +@architectures = "spacy.TransitionBasedParser.v1" +state_type = "parser" +extra_state_tokens = false +hidden_width = 64 +maxout_pieces = 2 + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 4 +embed_size = 2000 +window_size = 1 +maxout_pieces = 3 +subword_features = true +""" +DEFAULT_PARSER_MODEL = Config().from_str(default_model_config)["model"] + + +@Language.factory( + "parser", + assigns=["token.dep", "token.head", "token.is_sent_start", "doc.sents"], + default_config={ + "moves": None, + "update_with_oracle_cut_size": 100, + "learn_tokens": False, + "min_action_freq": 30, + "model": DEFAULT_PARSER_MODEL, + }, + default_score_weights={ + "dep_uas": 0.5, + "dep_las": 0.5, + "dep_las_per_type": None, + "sents_p": None, + "sents_r": None, + "sents_f": 0.0, + }, +) +def make_parser( + nlp: Language, + name: str, + model: Model, + moves: Optional[list], + update_with_oracle_cut_size: int, + learn_tokens: bool, + min_action_freq: int +): + """Create a transition-based DependencyParser component. The dependency parser + jointly learns sentence segmentation and labelled dependency parsing, and can + optionally learn to merge tokens that had been over-segmented by the tokenizer. + + The parser uses a variant of the non-monotonic arc-eager transition-system + described by Honnibal and Johnson (2014), with the addition of a "break" + transition to perform the sentence segmentation. Nivre's pseudo-projective + dependency transformation is used to allow the parser to predict + non-projective parses. + + The parser is trained using an imitation learning objective. The parser follows + the actions predicted by the current weights, and at each state, determines + which actions are compatible with the optimal parse that could be reached + from the current state. The weights such that the scores assigned to the + set of optimal actions is increased, while scores assigned to other + actions are decreased. Note that more than one action may be optimal for + a given state. + + model (Model): The model for the transition-based parser. The model needs + to have a specific substructure of named components --- see the + spacy.ml.tb_framework.TransitionModel for details. + moves (List[str]): A list of transition names. Inferred from the data if not + provided. + update_with_oracle_cut_size (int): + During training, cut long sequences into shorter segments by creating + intermediate states based on the gold-standard history. The model is + not very sensitive to this parameter, so you usually won't need to change + it. 100 is a good default. + learn_tokens (bool): Whether to learn to merge subtokens that are split + relative to the gold standard. Experimental. + min_action_freq (int): The minimum frequency of labelled actions to retain. + Rarer labelled actions have their label backed-off to "dep". While this + primarily affects the label accuracy, it can also affect the attachment + structure, as the labels are used to represent the pseudo-projectivity + transformation. + """ + return DependencyParser( + nlp.vocab, + model, + name, + moves=moves, + update_with_oracle_cut_size=update_with_oracle_cut_size, + multitasks=[], + learn_tokens=learn_tokens, + min_action_freq=min_action_freq + ) + + +cdef class DependencyParser(Parser): + """Pipeline component for dependency parsing. + + DOCS: https://nightly.spacy.io/api/dependencyparser + """ + TransitionSystem = ArcEager + + @property + def postprocesses(self): + output = [nonproj.deprojectivize] + if self.cfg.get("learn_tokens") is True: + output.append(merge_subtokens) + return tuple(output) + + def add_multitask_objective(self, mt_component): + self._multitasks.append(mt_component) + + def init_multitask_objectives(self, get_examples, nlp=None, **cfg): + # TODO: transfer self.model.get_ref("tok2vec") to the multitask's model ? + for labeller in self._multitasks: + labeller.model.set_dim("nO", len(self.labels)) + if labeller.model.has_ref("output_layer"): + labeller.model.get_ref("output_layer").set_dim("nO", len(self.labels)) + labeller.initialize(get_examples, nlp=nlp) + + @property + def labels(self): + labels = set() + # Get the labels from the model by looking at the available moves + for move in self.move_names: + if "-" in move: + label = move.split("-")[1] + if "||" in label: + label = label.split("||")[1] + labels.add(label) + return tuple(sorted(labels)) + + def score(self, examples, **kwargs): + """Score a batch of examples. + + examples (Iterable[Example]): The examples to score. + RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_spans + and Scorer.score_deps. + + DOCS: https://nightly.spacy.io/api/dependencyparser#score + """ + validate_examples(examples, "DependencyParser.score") + def dep_getter(token, attr): + dep = getattr(token, attr) + dep = token.vocab.strings.as_string(dep).lower() + return dep + results = {} + results.update(Scorer.score_spans(examples, "sents", **kwargs)) + kwargs.setdefault("getter", dep_getter) + kwargs.setdefault("ignore_labels", ("p", "punct")) + results.update(Scorer.score_deps(examples, "dep", **kwargs)) + del results["sents_per_type"] + return results diff --git a/spacy/pipeline/entity_linker.py b/spacy/pipeline/entity_linker.py new file mode 100644 index 000000000..3bb449b4d --- /dev/null +++ b/spacy/pipeline/entity_linker.py @@ -0,0 +1,491 @@ +from itertools import islice +from typing import Optional, Iterable, Callable, Dict, Iterator, Union, List +from pathlib import Path +import srsly +import random +from thinc.api import CosineDistance, Model, Optimizer, Config +from thinc.api import set_dropout_rate +import warnings + +from ..kb import KnowledgeBase, Candidate +from ..ml import empty_kb +from ..tokens import Doc +from .pipe import deserialize_config +from .trainable_pipe import TrainablePipe +from ..language import Language +from ..vocab import Vocab +from ..training import Example, validate_examples, validate_get_examples +from ..errors import Errors, Warnings +from ..util import SimpleFrozenList +from .. import util +from ..scorer import Scorer + + +default_model_config = """ +[model] +@architectures = "spacy.EntityLinker.v1" + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 2 +embed_size = 300 +window_size = 1 +maxout_pieces = 3 +subword_features = true +""" +DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"] + + +@Language.factory( + "entity_linker", + requires=["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"], + assigns=["token.ent_kb_id"], + default_config={ + "model": DEFAULT_NEL_MODEL, + "labels_discard": [], + "incl_prior": True, + "incl_context": True, + "entity_vector_length": 64, + "get_candidates": {"@misc": "spacy.CandidateGenerator.v1"}, + }, + default_score_weights={ + "nel_micro_f": 1.0, + "nel_micro_r": None, + "nel_micro_p": None, + }, +) +def make_entity_linker( + nlp: Language, + name: str, + model: Model, + *, + labels_discard: Iterable[str], + incl_prior: bool, + incl_context: bool, + entity_vector_length: int, + get_candidates: Callable[[KnowledgeBase, "Span"], Iterable[Candidate]], +): + """Construct an EntityLinker component. + + model (Model[List[Doc], Floats2d]): A model that learns document vector + representations. Given a batch of Doc objects, it should return a single + array, with one row per item in the batch. + labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction. + incl_prior (bool): Whether or not to include prior probabilities from the KB in the model. + incl_context (bool): Whether or not to include the local context in the model. + entity_vector_length (int): Size of encoding vectors in the KB. + get_candidates (Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]): Function that + produces a list of candidates, given a certain knowledge base and a textual mention. + """ + return EntityLinker( + nlp.vocab, + model, + name, + labels_discard=labels_discard, + incl_prior=incl_prior, + incl_context=incl_context, + entity_vector_length=entity_vector_length, + get_candidates=get_candidates, + ) + + +class EntityLinker(TrainablePipe): + """Pipeline component for named entity linking. + + DOCS: https://nightly.spacy.io/api/entitylinker + """ + + NIL = "NIL" # string used to refer to a non-existing link + + def __init__( + self, + vocab: Vocab, + model: Model, + name: str = "entity_linker", + *, + labels_discard: Iterable[str], + incl_prior: bool, + incl_context: bool, + entity_vector_length: int, + get_candidates: Callable[[KnowledgeBase, "Span"], Iterable[Candidate]], + ) -> None: + """Initialize an entity linker. + + vocab (Vocab): The shared vocabulary. + model (thinc.api.Model): The Thinc Model powering the pipeline component. + name (str): The component instance name, used to add entries to the + losses during training. + labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction. + incl_prior (bool): Whether or not to include prior probabilities from the KB in the model. + incl_context (bool): Whether or not to include the local context in the model. + entity_vector_length (int): Size of encoding vectors in the KB. + get_candidates (Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]): Function that + produces a list of candidates, given a certain knowledge base and a textual mention. + + DOCS: https://nightly.spacy.io/api/entitylinker#init + """ + self.vocab = vocab + self.model = model + self.name = name + cfg = { + "labels_discard": list(labels_discard), + "incl_prior": incl_prior, + "incl_context": incl_context, + "entity_vector_length": entity_vector_length, + } + self.get_candidates = get_candidates + self.cfg = dict(cfg) + self.distance = CosineDistance(normalize=False) + # how many neightbour sentences to take into account + self.n_sents = cfg.get("n_sents", 0) + # create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'. + self.kb = empty_kb(entity_vector_length)(self.vocab) + + def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]): + """Define the KB of this pipe by providing a function that will + create it using this object's vocab.""" + self.kb = kb_loader(self.vocab) + self.cfg["entity_vector_length"] = self.kb.entity_vector_length + + def validate_kb(self) -> None: + # Raise an error if the knowledge base is not initialized. + if len(self.kb) == 0: + raise ValueError(Errors.E139.format(name=self.name)) + + def initialize( + self, + get_examples: Callable[[], Iterable[Example]], + *, + nlp: Optional[Language] = None, + kb_loader: Callable[[Vocab], KnowledgeBase] = None, + ): + """Initialize the pipe for training, using a representative set + of data examples. + + get_examples (Callable[[], Iterable[Example]]): Function that + returns a representative sample of gold-standard Example objects. + nlp (Language): The current nlp object the component is part of. + kb_loader (Callable[[Vocab], KnowledgeBase]): A function that creates a KnowledgeBase from a Vocab instance. + Note that providing this argument, will overwrite all data accumulated in the current KB. + Use this only when loading a KB as-such from file. + + DOCS: https://nightly.spacy.io/api/entitylinker#initialize + """ + validate_get_examples(get_examples, "EntityLinker.initialize") + if kb_loader is not None: + self.set_kb(kb_loader) + self.validate_kb() + nO = self.kb.entity_vector_length + doc_sample = [] + vector_sample = [] + for example in islice(get_examples(), 10): + doc_sample.append(example.x) + vector_sample.append(self.model.ops.alloc1f(nO)) + assert len(doc_sample) > 0, Errors.E923.format(name=self.name) + assert len(vector_sample) > 0, Errors.E923.format(name=self.name) + self.model.initialize( + X=doc_sample, Y=self.model.ops.asarray(vector_sample, dtype="float32") + ) + + def update( + self, + examples: Iterable[Example], + *, + set_annotations: bool = False, + drop: float = 0.0, + sgd: Optional[Optimizer] = None, + losses: Optional[Dict[str, float]] = None, + ) -> Dict[str, float]: + """Learn from a batch of documents and gold-standard information, + updating the pipe's model. Delegates to predict and get_loss. + + examples (Iterable[Example]): A batch of Example objects. + drop (float): The dropout rate. + set_annotations (bool): Whether or not to update the Example objects + with the predictions. + sgd (thinc.api.Optimizer): The optimizer. + losses (Dict[str, float]): Optional record of the loss during training. + Updated using the component name as the key. + RETURNS (Dict[str, float]): The updated losses dictionary. + + DOCS: https://nightly.spacy.io/api/entitylinker#update + """ + self.validate_kb() + if losses is None: + losses = {} + losses.setdefault(self.name, 0.0) + if not examples: + return losses + validate_examples(examples, "EntityLinker.update") + sentence_docs = [] + docs = [eg.predicted for eg in examples] + if set_annotations: + # This seems simpler than other ways to get that exact output -- but + # it does run the model twice :( + predictions = self.model.predict(docs) + for eg in examples: + sentences = [s for s in eg.reference.sents] + kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True) + for ent in eg.reference.ents: + # KB ID of the first token is the same as the whole span + kb_id = kb_ids[ent.start] + if kb_id: + try: + # find the sentence in the list of sentences. + sent_index = sentences.index(ent.sent) + except AttributeError: + # Catch the exception when ent.sent is None and provide a user-friendly warning + raise RuntimeError(Errors.E030) from None + # get n previous sentences, if there are any + start_sentence = max(0, sent_index - self.n_sents) + # get n posterior sentences, or as many < n as there are + end_sentence = min(len(sentences) - 1, sent_index + self.n_sents) + # get token positions + start_token = sentences[start_sentence].start + end_token = sentences[end_sentence].end + # append that span as a doc to training + sent_doc = eg.predicted[start_token:end_token].as_doc() + sentence_docs.append(sent_doc) + set_dropout_rate(self.model, drop) + if not sentence_docs: + warnings.warn(Warnings.W093.format(name="Entity Linker")) + return losses + sentence_encodings, bp_context = self.model.begin_update(sentence_docs) + loss, d_scores = self.get_loss( + sentence_encodings=sentence_encodings, examples=examples + ) + bp_context(d_scores) + if sgd is not None: + self.finish_update(sgd) + losses[self.name] += loss + if set_annotations: + self.set_annotations(docs, predictions) + return losses + + def get_loss(self, examples: Iterable[Example], sentence_encodings): + validate_examples(examples, "EntityLinker.get_loss") + entity_encodings = [] + for eg in examples: + kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True) + for ent in eg.reference.ents: + kb_id = kb_ids[ent.start] + if kb_id: + entity_encoding = self.kb.get_vector(kb_id) + entity_encodings.append(entity_encoding) + entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32") + if sentence_encodings.shape != entity_encodings.shape: + err = Errors.E147.format( + method="get_loss", msg="gold entities do not match up" + ) + raise RuntimeError(err) + gradients = self.distance.get_grad(sentence_encodings, entity_encodings) + loss = self.distance.get_loss(sentence_encodings, entity_encodings) + loss = loss / len(entity_encodings) + return loss, gradients + + def __call__(self, doc: Doc) -> Doc: + """Apply the pipe to a Doc. + + doc (Doc): The document to process. + RETURNS (Doc): The processed Doc. + + DOCS: https://nightly.spacy.io/api/entitylinker#call + """ + kb_ids = self.predict([doc]) + self.set_annotations([doc], kb_ids) + return doc + + def pipe(self, stream: Iterable[Doc], *, batch_size: int = 128) -> Iterator[Doc]: + """Apply the pipe to a stream of documents. This usually happens under + the hood when the nlp object is called on a text and all components are + applied to the Doc. + + stream (Iterable[Doc]): A stream of documents. + batch_size (int): The number of documents to buffer. + YIELDS (Doc): Processed documents in order. + + DOCS: https://nightly.spacy.io/api/entitylinker#pipe + """ + for docs in util.minibatch(stream, size=batch_size): + kb_ids = self.predict(docs) + self.set_annotations(docs, kb_ids) + yield from docs + + def predict(self, docs: Iterable[Doc]) -> List[str]: + """Apply the pipeline's model to a batch of docs, without modifying them. + Returns the KB IDs for each entity in each doc, including NIL if there is + no prediction. + + docs (Iterable[Doc]): The documents to predict. + RETURNS (List[int]): The models prediction for each document. + + DOCS: https://nightly.spacy.io/api/entitylinker#predict + """ + self.validate_kb() + entity_count = 0 + final_kb_ids = [] + if not docs: + return final_kb_ids + if isinstance(docs, Doc): + docs = [docs] + for i, doc in enumerate(docs): + sentences = [s for s in doc.sents] + if len(doc) > 0: + # Looping through each sentence and each entity + # This may go wrong if there are entities across sentences - which shouldn't happen normally. + for sent_index, sent in enumerate(sentences): + if sent.ents: + # get n_neightbour sentences, clipped to the length of the document + start_sentence = max(0, sent_index - self.n_sents) + end_sentence = min( + len(sentences) - 1, sent_index + self.n_sents + ) + start_token = sentences[start_sentence].start + end_token = sentences[end_sentence].end + sent_doc = doc[start_token:end_token].as_doc() + # currently, the context is the same for each entity in a sentence (should be refined) + xp = self.model.ops.xp + if self.cfg.get("incl_context"): + sentence_encoding = self.model.predict([sent_doc])[0] + sentence_encoding_t = sentence_encoding.T + sentence_norm = xp.linalg.norm(sentence_encoding_t) + for ent in sent.ents: + entity_count += 1 + to_discard = self.cfg.get("labels_discard", []) + if to_discard and ent.label_ in to_discard: + # ignoring this entity - setting to NIL + final_kb_ids.append(self.NIL) + else: + candidates = self.get_candidates(self.kb, ent) + if not candidates: + # no prediction possible for this entity - setting to NIL + final_kb_ids.append(self.NIL) + elif len(candidates) == 1: + # shortcut for efficiency reasons: take the 1 candidate + # TODO: thresholding + final_kb_ids.append(candidates[0].entity_) + else: + random.shuffle(candidates) + # set all prior probabilities to 0 if incl_prior=False + prior_probs = xp.asarray( + [c.prior_prob for c in candidates] + ) + if not self.cfg.get("incl_prior"): + prior_probs = xp.asarray( + [0.0 for _ in candidates] + ) + scores = prior_probs + # add in similarity from the context + if self.cfg.get("incl_context"): + entity_encodings = xp.asarray( + [c.entity_vector for c in candidates] + ) + entity_norm = xp.linalg.norm( + entity_encodings, axis=1 + ) + if len(entity_encodings) != len(prior_probs): + raise RuntimeError( + Errors.E147.format( + method="predict", + msg="vectors not of equal length", + ) + ) + # cosine similarity + sims = xp.dot( + entity_encodings, sentence_encoding_t + ) / (sentence_norm * entity_norm) + if sims.shape != prior_probs.shape: + raise ValueError(Errors.E161) + scores = ( + prior_probs + sims - (prior_probs * sims) + ) + # TODO: thresholding + best_index = scores.argmax().item() + best_candidate = candidates[best_index] + final_kb_ids.append(best_candidate.entity_) + if not (len(final_kb_ids) == entity_count): + err = Errors.E147.format( + method="predict", msg="result variables not of equal length" + ) + raise RuntimeError(err) + return final_kb_ids + + def set_annotations(self, docs: Iterable[Doc], kb_ids: List[str]) -> None: + """Modify a batch of documents, using pre-computed scores. + + docs (Iterable[Doc]): The documents to modify. + kb_ids (List[str]): The IDs to set, produced by EntityLinker.predict. + + DOCS: https://nightly.spacy.io/api/entitylinker#set_annotations + """ + count_ents = len([ent for doc in docs for ent in doc.ents]) + if count_ents != len(kb_ids): + raise ValueError(Errors.E148.format(ents=count_ents, ids=len(kb_ids))) + i = 0 + for doc in docs: + for ent in doc.ents: + kb_id = kb_ids[i] + i += 1 + for token in ent: + token.ent_kb_id_ = kb_id + + def score(self, examples, **kwargs): + """Score a batch of examples. + + examples (Iterable[Example]): The examples to score. + RETURNS (Dict[str, Any]): The scores. + + DOCS TODO: https://nightly.spacy.io/api/entity_linker#score + """ + validate_examples(examples, "EntityLinker.score") + return Scorer.score_links(examples, negative_labels=[self.NIL]) + + def to_disk( + self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList() + ) -> None: + """Serialize the pipe to disk. + + path (str / Path): Path to a directory. + exclude (Iterable[str]): String names of serialization fields to exclude. + + DOCS: https://nightly.spacy.io/api/entitylinker#to_disk + """ + serialize = {} + serialize["vocab"] = lambda p: self.vocab.to_disk(p) + serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg) + serialize["kb"] = lambda p: self.kb.to_disk(p) + serialize["model"] = lambda p: self.model.to_disk(p) + util.to_disk(path, serialize, exclude) + + def from_disk( + self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList() + ) -> "EntityLinker": + """Load the pipe from disk. Modifies the object in place and returns it. + + path (str / Path): Path to a directory. + exclude (Iterable[str]): String names of serialization fields to exclude. + RETURNS (EntityLinker): The modified EntityLinker object. + + DOCS: https://nightly.spacy.io/api/entitylinker#from_disk + """ + + def load_model(p): + try: + self.model.from_bytes(p.open("rb").read()) + except AttributeError: + raise ValueError(Errors.E149) from None + + deserialize = {} + deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p)) + deserialize["kb"] = lambda p: self.kb.from_disk(p) + deserialize["model"] = load_model + util.from_disk(path, deserialize, exclude) + return self + + def rehearse(self, examples, *, sgd=None, losses=None, **config): + raise NotImplementedError + + def add_label(self, label): + raise NotImplementedError diff --git a/spacy/pipeline/entityruler.py b/spacy/pipeline/entityruler.py index 2abff62f1..382ca338d 100644 --- a/spacy/pipeline/entityruler.py +++ b/spacy/pipeline/entityruler.py @@ -1,55 +1,104 @@ -# coding: utf8 -from __future__ import unicode_literals - -from collections import defaultdict, OrderedDict +from typing import Optional, Union, List, Dict, Tuple, Iterable, Any, Callable, Sequence +from collections import defaultdict +from pathlib import Path import srsly -from ..language import component +from .pipe import Pipe +from ..training import Example +from ..language import Language from ..errors import Errors -from ..compat import basestring_ -from ..util import ensure_path, to_disk, from_disk +from ..util import ensure_path, to_disk, from_disk, SimpleFrozenList from ..tokens import Doc, Span from ..matcher import Matcher, PhraseMatcher +from ..scorer import Scorer +from ..training import validate_examples + DEFAULT_ENT_ID_SEP = "||" +PatternType = Dict[str, Union[str, List[Dict[str, Any]]]] -@component("entity_ruler", assigns=["doc.ents", "token.ent_type", "token.ent_iob"]) -class EntityRuler(object): +@Language.factory( + "entity_ruler", + assigns=["doc.ents", "token.ent_type", "token.ent_iob"], + default_config={ + "phrase_matcher_attr": None, + "validate": False, + "overwrite_ents": False, + "ent_id_sep": DEFAULT_ENT_ID_SEP, + }, + default_score_weights={ + "ents_f": 1.0, + "ents_p": 0.0, + "ents_r": 0.0, + "ents_per_type": None, + }, +) +def make_entity_ruler( + nlp: Language, + name: str, + phrase_matcher_attr: Optional[Union[int, str]], + validate: bool, + overwrite_ents: bool, + ent_id_sep: str, +): + return EntityRuler( + nlp, + name, + phrase_matcher_attr=phrase_matcher_attr, + validate=validate, + overwrite_ents=overwrite_ents, + ent_id_sep=ent_id_sep, + ) + + +class EntityRuler(Pipe): """The EntityRuler lets you add spans to the `Doc.ents` using token-based rules or exact phrase matches. It can be combined with the statistical `EntityRecognizer` to boost accuracy, or used on its own to implement a purely rule-based entity recognition system. After initialization, the component is typically added to the pipeline using `nlp.add_pipe`. - DOCS: https://spacy.io/api/entityruler - USAGE: https://spacy.io/usage/rule-based-matching#entityruler + DOCS: https://nightly.spacy.io/api/entityruler + USAGE: https://nightly.spacy.io/usage/rule-based-matching#entityruler """ - def __init__(self, nlp, phrase_matcher_attr=None, validate=False, **cfg): - """Initialize the entitiy ruler. If patterns are supplied here, they + def __init__( + self, + nlp: Language, + name: str = "entity_ruler", + *, + phrase_matcher_attr: Optional[Union[int, str]] = None, + validate: bool = False, + overwrite_ents: bool = False, + ent_id_sep: str = DEFAULT_ENT_ID_SEP, + patterns: Optional[List[PatternType]] = None, + ) -> None: + """Initialize the entity ruler. If patterns are supplied here, they need to be a list of dictionaries with a `"label"` and `"pattern"` key. A pattern can either be a token pattern (list) or a phrase pattern (string). For example: `{'label': 'ORG', 'pattern': 'Apple'}`. nlp (Language): The shared nlp object to pass the vocab to the matchers and process phrase patterns. - phrase_matcher_attr (int / unicode): Token attribute to match on, passed + name (str): Instance name of the current pipeline component. Typically + passed in automatically from the factory when the component is + added. Used to disable the current entity ruler while creating + phrase patterns with the nlp object. + phrase_matcher_attr (int / str): Token attribute to match on, passed to the internal PhraseMatcher as `attr` validate (bool): Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate` patterns (iterable): Optional patterns to load in. overwrite_ents (bool): If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. - **cfg: Other config parameters. If pipeline component is loaded as part - of a model pipeline, this will include all keyword arguments passed - to `spacy.load`. - RETURNS (EntityRuler): The newly constructed object. + ent_id_sep (str): Separator used internally for entity IDs. - DOCS: https://spacy.io/api/entityruler#init + DOCS: https://nightly.spacy.io/api/entityruler#init """ self.nlp = nlp - self.overwrite = cfg.get("overwrite_ents", False) + self.name = name + self.overwrite = overwrite_ents self.token_patterns = defaultdict(list) self.phrase_patterns = defaultdict(list) self.matcher = Matcher(nlp.vocab, validate=validate) @@ -63,33 +112,28 @@ class EntityRuler(object): else: self.phrase_matcher_attr = None self.phrase_matcher = PhraseMatcher(nlp.vocab, validate=validate) - self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP) + self.ent_id_sep = ent_id_sep self._ent_ids = defaultdict(dict) - patterns = cfg.get("patterns") if patterns is not None: self.add_patterns(patterns) - @classmethod - def from_nlp(cls, nlp, **cfg): - return cls(nlp, **cfg) - - def __len__(self): + def __len__(self) -> int: """The number of all patterns added to the entity ruler.""" n_token_patterns = sum(len(p) for p in self.token_patterns.values()) n_phrase_patterns = sum(len(p) for p in self.phrase_patterns.values()) return n_token_patterns + n_phrase_patterns - def __contains__(self, label): + def __contains__(self, label: str) -> bool: """Whether a label is present in the patterns.""" return label in self.token_patterns or label in self.phrase_patterns - def __call__(self, doc): + def __call__(self, doc: Doc) -> Doc: """Find matches in document and add them as entities. doc (Doc): The Doc object in the pipeline. RETURNS (Doc): The Doc with added entities, if available. - DOCS: https://spacy.io/api/entityruler#call + DOCS: https://nightly.spacy.io/api/entityruler#call """ matches = list(self.matcher(doc)) + list(self.phrase_matcher(doc)) matches = set( @@ -122,12 +166,12 @@ class EntityRuler(object): return doc @property - def labels(self): + def labels(self) -> Tuple[str, ...]: """All labels present in the match patterns. RETURNS (set): The string labels. - DOCS: https://spacy.io/api/entityruler#labels + DOCS: https://nightly.spacy.io/api/entityruler#labels """ keys = set(self.token_patterns.keys()) keys.update(self.phrase_patterns.keys()) @@ -141,13 +185,33 @@ class EntityRuler(object): all_labels.add(l) return tuple(all_labels) + def initialize( + self, + get_examples: Callable[[], Iterable[Example]], + *, + nlp: Optional[Language] = None, + patterns: Optional[Sequence[PatternType]] = None, + ): + """Initialize the pipe for training. + + get_examples (Callable[[], Iterable[Example]]): Function that + returns a representative sample of gold-standard Example objects. + nlp (Language): The current nlp object the component is part of. + patterns Optional[Iterable[PatternType]]: The list of patterns. + + DOCS: https://nightly.spacy.io/api/entityruler#initialize + """ + self.clear() + if patterns: + self.add_patterns(patterns) + @property - def ent_ids(self): + def ent_ids(self) -> Tuple[str, ...]: """All entity ids present in the match patterns `id` properties RETURNS (set): The string entity ids. - DOCS: https://spacy.io/api/entityruler#ent_ids + DOCS: https://nightly.spacy.io/api/entityruler#ent_ids """ keys = set(self.token_patterns.keys()) keys.update(self.phrase_patterns.keys()) @@ -160,12 +224,12 @@ class EntityRuler(object): return tuple(all_ent_ids) @property - def patterns(self): + def patterns(self) -> List[PatternType]: """Get all patterns that were added to the entity ruler. RETURNS (list): The original patterns, one dictionary per pattern. - DOCS: https://spacy.io/api/entityruler#patterns + DOCS: https://nightly.spacy.io/api/entityruler#patterns """ all_patterns = [] for label, patterns in self.token_patterns.items(): @@ -182,18 +246,17 @@ class EntityRuler(object): if ent_id: p["id"] = ent_id all_patterns.append(p) - return all_patterns - def add_patterns(self, patterns): - """Add patterns to the entitiy ruler. A pattern can either be a token + def add_patterns(self, patterns: List[PatternType]) -> None: + """Add patterns to the entity ruler. A pattern can either be a token pattern (list of dicts) or a phrase pattern (string). For example: {'label': 'ORG', 'pattern': 'Apple'} {'label': 'GPE', 'pattern': [{'lower': 'san'}, {'lower': 'francisco'}]} patterns (list): The patterns to add. - DOCS: https://spacy.io/api/entityruler#add_patterns + DOCS: https://nightly.spacy.io/api/entityruler#add_patterns """ # disable the nlp components after this one in case they hadn't been initialized / deserialised yet @@ -204,20 +267,18 @@ class EntityRuler(object): ] except ValueError: subsequent_pipes = [] - with self.nlp.disable_pipes(subsequent_pipes): + with self.nlp.select_pipes(disable=subsequent_pipes): token_patterns = [] phrase_pattern_labels = [] phrase_pattern_texts = [] phrase_pattern_ids = [] - for entry in patterns: - if isinstance(entry["pattern"], basestring_): + if isinstance(entry["pattern"], str): phrase_pattern_labels.append(entry["label"]) phrase_pattern_texts.append(entry["pattern"]) phrase_pattern_ids.append(entry.get("id")) elif isinstance(entry["pattern"], list): token_patterns.append(entry) - phrase_patterns = [] for label, pattern, ent_id in zip( phrase_pattern_labels, @@ -228,7 +289,6 @@ class EntityRuler(object): if ent_id: phrase_pattern["id"] = ent_id phrase_patterns.append(phrase_pattern) - for entry in token_patterns + phrase_patterns: label = entry["label"] if "id" in entry: @@ -236,7 +296,6 @@ class EntityRuler(object): label = self._create_label(label, entry["id"]) key = self.matcher._normalize_key(label) self._ent_ids[key] = (ent_label, entry["id"]) - pattern = entry["pattern"] if isinstance(pattern, Doc): self.phrase_patterns[label].append(pattern) @@ -249,11 +308,16 @@ class EntityRuler(object): for label, patterns in self.phrase_patterns.items(): self.phrase_matcher.add(label, patterns) - def _split_label(self, label): + def clear(self) -> None: + """Reset all patterns.""" + self.token_patterns = defaultdict(list) + self.phrase_patterns = defaultdict(list) + self._ent_ids = defaultdict(dict) + + def _split_label(self, label: str) -> Tuple[str, str]: """Split Entity label into ent_label and ent_id if it contains self.ent_id_sep label (str): The value of label in a pattern entry - RETURNS (tuple): ent_label, ent_id """ if self.ent_id_sep in label: @@ -261,32 +325,35 @@ class EntityRuler(object): else: ent_label = label ent_id = None - return ent_label, ent_id - def _create_label(self, label, ent_id): + def _create_label(self, label: str, ent_id: str) -> str: """Join Entity label with ent_id if the pattern has an `id` attribute label (str): The label to set for ent.label_ ent_id (str): The label - RETURNS (str): The ent_label joined with configured `ent_id_sep` """ - if isinstance(ent_id, basestring_): - label = "{}{}{}".format(label, self.ent_id_sep, ent_id) + if isinstance(ent_id, str): + label = f"{label}{self.ent_id_sep}{ent_id}" return label - def from_bytes(self, patterns_bytes, **kwargs): + def score(self, examples, **kwargs): + validate_examples(examples, "EntityRuler.score") + return Scorer.score_spans(examples, "ents", **kwargs) + + def from_bytes( + self, patterns_bytes: bytes, *, exclude: Iterable[str] = SimpleFrozenList() + ) -> "EntityRuler": """Load the entity ruler from a bytestring. patterns_bytes (bytes): The bytestring to load. - **kwargs: Other config paramters, mostly for consistency. - RETURNS (EntityRuler): The loaded entity ruler. - DOCS: https://spacy.io/api/entityruler#from_bytes + DOCS: https://nightly.spacy.io/api/entityruler#from_bytes """ cfg = srsly.msgpack_loads(patterns_bytes) + self.clear() if isinstance(cfg, dict): self.add_patterns(cfg.get("patterns", cfg)) self.overwrite = cfg.get("overwrite", False) @@ -300,36 +367,34 @@ class EntityRuler(object): self.add_patterns(cfg) return self - def to_bytes(self, **kwargs): + def to_bytes(self, *, exclude: Iterable[str] = SimpleFrozenList()) -> bytes: """Serialize the entity ruler patterns to a bytestring. RETURNS (bytes): The serialized patterns. - DOCS: https://spacy.io/api/entityruler#to_bytes + DOCS: https://nightly.spacy.io/api/entityruler#to_bytes """ - - serial = OrderedDict( - ( - ("overwrite", self.overwrite), - ("ent_id_sep", self.ent_id_sep), - ("phrase_matcher_attr", self.phrase_matcher_attr), - ("patterns", self.patterns), - ) - ) + serial = { + "overwrite": self.overwrite, + "ent_id_sep": self.ent_id_sep, + "phrase_matcher_attr": self.phrase_matcher_attr, + "patterns": self.patterns, + } return srsly.msgpack_dumps(serial) - def from_disk(self, path, **kwargs): + def from_disk( + self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList() + ) -> "EntityRuler": """Load the entity ruler from a file. Expects a file containing newline-delimited JSON (JSONL) with one entry per line. - path (unicode / Path): The JSONL file to load. - **kwargs: Other config paramters, mostly for consistency. - + path (str / Path): The JSONL file to load. RETURNS (EntityRuler): The loaded entity ruler. - DOCS: https://spacy.io/api/entityruler#from_disk + DOCS: https://nightly.spacy.io/api/entityruler#from_disk """ path = ensure_path(path) + self.clear() depr_patterns_path = path.with_suffix(".jsonl") if depr_patterns_path.is_file(): patterns = srsly.read_jsonl(depr_patterns_path) @@ -354,14 +419,15 @@ class EntityRuler(object): from_disk(path, deserializers_patterns, {}) return self - def to_disk(self, path, **kwargs): + def to_disk( + self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList() + ) -> None: """Save the entity ruler patterns to a directory. The patterns will be saved as newline-delimited JSON (JSONL). - path (unicode / Path): The JSONL file to save. - **kwargs: Other config paramters, mostly for consistency. + path (str / Path): The JSONL file to save. - DOCS: https://spacy.io/api/entityruler#to_disk + DOCS: https://nightly.spacy.io/api/entityruler#to_disk """ path = ensure_path(path) cfg = { diff --git a/spacy/pipeline/functions.py b/spacy/pipeline/functions.py index 69e638da2..614608b25 100644 --- a/spacy/pipeline/functions.py +++ b/spacy/pipeline/functions.py @@ -1,25 +1,23 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ..language import component +from ..language import Language from ..matcher import Matcher +from ..tokens import Doc from ..util import filter_spans -@component( +@Language.component( "merge_noun_chunks", requires=["token.dep", "token.tag", "token.pos"], retokenizes=True, ) -def merge_noun_chunks(doc): +def merge_noun_chunks(doc: Doc) -> Doc: """Merge noun chunks into a single token. doc (Doc): The Doc object. RETURNS (Doc): The Doc object with merged noun chunks. - DOCS: https://spacy.io/api/pipeline-functions#merge_noun_chunks + DOCS: https://nightly.spacy.io/api/pipeline-functions#merge_noun_chunks """ - if not doc.is_parsed: + if not doc.has_annotation("DEP"): return doc with doc.retokenize() as retokenizer: for np in doc.noun_chunks: @@ -28,18 +26,18 @@ def merge_noun_chunks(doc): return doc -@component( +@Language.component( "merge_entities", requires=["doc.ents", "token.ent_iob", "token.ent_type"], retokenizes=True, ) -def merge_entities(doc): +def merge_entities(doc: Doc): """Merge entities into a single token. doc (Doc): The Doc object. RETURNS (Doc): The Doc object with merged entities. - DOCS: https://spacy.io/api/pipeline-functions#merge_entities + DOCS: https://nightly.spacy.io/api/pipeline-functions#merge_entities """ with doc.retokenize() as retokenizer: for ent in doc.ents: @@ -48,18 +46,19 @@ def merge_entities(doc): return doc -@component("merge_subtokens", requires=["token.dep"], retokenizes=True) -def merge_subtokens(doc, label="subtok"): +@Language.component("merge_subtokens", requires=["token.dep"], retokenizes=True) +def merge_subtokens(doc: Doc, label: str = "subtok") -> Doc: """Merge subtokens into a single token. doc (Doc): The Doc object. - label (unicode): The subtoken dependency label. + label (str): The subtoken dependency label. RETURNS (Doc): The Doc object with merged subtokens. - DOCS: https://spacy.io/api/pipeline-functions#merge_subtokens + DOCS: https://nightly.spacy.io/api/pipeline-functions#merge_subtokens """ + # TODO: make stateful component with "label" config merger = Matcher(doc.vocab) - merger.add("SUBTOK", None, [{"DEP": label, "op": "+"}]) + merger.add("SUBTOK", [[{"DEP": label, "op": "+"}]]) matches = merger(doc) spans = filter_spans([doc[start : end + 1] for _, start, end in matches]) with doc.retokenize() as retokenizer: diff --git a/spacy/pipeline/hooks.py b/spacy/pipeline/hooks.py deleted file mode 100644 index b61a34c0e..000000000 --- a/spacy/pipeline/hooks.py +++ /dev/null @@ -1,99 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from thinc.t2v import Pooling, max_pool, mean_pool -from thinc.neural._classes.difference import Siamese, CauchySimilarity - -from .pipes import Pipe -from ..language import component -from .._ml import link_vectors_to_models - - -@component("sentencizer_hook", assigns=["doc.user_hooks"]) -class SentenceSegmenter(object): - """A simple spaCy hook, to allow custom sentence boundary detection logic - (that doesn't require the dependency parse). To change the sentence - boundary detection strategy, pass a generator function `strategy` on - initialization, or assign a new strategy to the .strategy attribute. - Sentence detection strategies should be generators that take `Doc` objects - and yield `Span` objects for each sentence. - """ - - def __init__(self, vocab, strategy=None): - self.vocab = vocab - if strategy is None or strategy == "on_punct": - strategy = self.split_on_punct - self.strategy = strategy - - def __call__(self, doc): - doc.user_hooks["sents"] = self.strategy - return doc - - @staticmethod - def split_on_punct(doc): - start = 0 - seen_period = False - for i, token in enumerate(doc): - if seen_period and not token.is_punct: - yield doc[start : token.i] - start = token.i - seen_period = False - elif token.text in [".", "!", "?"]: - seen_period = True - if start < len(doc): - yield doc[start : len(doc)] - - -@component("similarity", assigns=["doc.user_hooks"]) -class SimilarityHook(Pipe): - """ - Experimental: A pipeline component to install a hook for supervised - similarity into `Doc` objects. Requires a `Tensorizer` to pre-process - documents. The similarity model can be any object obeying the Thinc `Model` - interface. By default, the model concatenates the elementwise mean and - elementwise max of the two tensors, and compares them using the - Cauchy-like similarity function from Chen (2013): - - >>> similarity = 1. / (1. + (W * (vec1-vec2)**2).sum()) - - Where W is a vector of dimension weights, initialized to 1. - """ - - def __init__(self, vocab, model=True, **cfg): - self.vocab = vocab - self.model = model - self.cfg = dict(cfg) - - @classmethod - def Model(cls, length): - return Siamese(Pooling(max_pool, mean_pool), CauchySimilarity(length)) - - def __call__(self, doc): - """Install similarity hook""" - doc.user_hooks["similarity"] = self.predict - return doc - - def pipe(self, docs, **kwargs): - for doc in docs: - yield self(doc) - - def predict(self, doc1, doc2): - self.require_model() - return self.model.predict([(doc1, doc2)]) - - def update(self, doc1_doc2, golds, sgd=None, drop=0.0): - self.require_model() - sims, bp_sims = self.model.begin_update(doc1_doc2, drop=drop) - - def begin_training(self, _=tuple(), pipeline=None, sgd=None, **kwargs): - """Allocate model, using width from tensorizer in pipeline. - - gold_tuples (iterable): Gold-standard training data. - pipeline (list): The pipeline the model is part of. - """ - if self.model is True: - self.model = self.Model(pipeline[0].model.nO) - link_vectors_to_models(self.vocab) - if sgd is None: - sgd = self.create_optimizer() - return sgd diff --git a/spacy/pipeline/lemmatizer.py b/spacy/pipeline/lemmatizer.py new file mode 100644 index 000000000..9be596868 --- /dev/null +++ b/spacy/pipeline/lemmatizer.py @@ -0,0 +1,335 @@ +from typing import Optional, List, Dict, Any, Callable, Iterable, Iterator, Union +from typing import Tuple +from thinc.api import Model +from pathlib import Path + +from .pipe import Pipe +from ..errors import Errors +from ..language import Language +from ..training import Example +from ..lookups import Lookups, load_lookups +from ..scorer import Scorer +from ..tokens import Doc, Token +from ..vocab import Vocab +from ..training import validate_examples +from ..util import logger, SimpleFrozenList +from .. import util + + +@Language.factory( + "lemmatizer", + assigns=["token.lemma"], + default_config={"model": None, "mode": "lookup", "overwrite": False}, + default_score_weights={"lemma_acc": 1.0}, +) +def make_lemmatizer( + nlp: Language, + model: Optional[Model], + name: str, + mode: str, + overwrite: bool = False, +): + return Lemmatizer(nlp.vocab, model, name, mode=mode, overwrite=overwrite) + + +class Lemmatizer(Pipe): + """ + The Lemmatizer supports simple part-of-speech-sensitive suffix rules and + lookup tables. + + DOCS: https://nightly.spacy.io/api/lemmatizer + """ + + @classmethod + def get_lookups_config(cls, mode: str) -> Tuple[List[str], List[str]]: + """Returns the lookups configuration settings for a given mode for use + in Lemmatizer.load_lookups. + + mode (str): The lemmatizer mode. + RETURNS (Tuple[List[str], List[str]]): The required and optional + lookup tables for this mode. + """ + if mode == "lookup": + return (["lemma_lookup"], []) + elif mode == "rule": + return (["lemma_rules"], ["lemma_exc", "lemma_index"]) + return ([], []) + + def __init__( + self, + vocab: Vocab, + model: Optional[Model], + name: str = "lemmatizer", + *, + mode: str = "lookup", + overwrite: bool = False, + ) -> None: + """Initialize a Lemmatizer. + + vocab (Vocab): The vocab. + model (Model): A model (not yet implemented). + name (str): The component name. Defaults to "lemmatizer". + mode (str): The lemmatizer mode: "lookup", "rule". Defaults to "lookup". + overwrite (bool): Whether to overwrite existing lemmas. Defaults to + `False`. + + DOCS: https://nightly.spacy.io/api/lemmatizer#init + """ + self.vocab = vocab + self.model = model + self.name = name + self._mode = mode + self.lookups = Lookups() + self.overwrite = overwrite + self._validated = False + if self.mode == "lookup": + self.lemmatize = self.lookup_lemmatize + elif self.mode == "rule": + self.lemmatize = self.rule_lemmatize + else: + mode_attr = f"{self.mode}_lemmatize" + if not hasattr(self, mode_attr): + raise ValueError(Errors.E1003.format(mode=mode)) + self.lemmatize = getattr(self, mode_attr) + self.cache = {} + + @property + def mode(self): + return self._mode + + def __call__(self, doc: Doc) -> Doc: + """Apply the lemmatizer to one document. + + doc (Doc): The Doc to process. + RETURNS (Doc): The processed Doc. + + DOCS: https://nightly.spacy.io/api/lemmatizer#call + """ + if not self._validated: + self._validate_tables(Errors.E1004) + for token in doc: + if self.overwrite or token.lemma == 0: + token.lemma_ = self.lemmatize(token)[0] + return doc + + def initialize( + self, + get_examples: Optional[Callable[[], Iterable[Example]]] = None, + *, + nlp: Optional[Language] = None, + lookups: Optional[Lookups] = None, + ): + """Initialize the lemmatizer and load in data. + + get_examples (Callable[[], Iterable[Example]]): Function that + returns a representative sample of gold-standard Example objects. + nlp (Language): The current nlp object the component is part of. + lookups (Lookups): The lookups object containing the (optional) tables + such as "lemma_rules", "lemma_index", "lemma_exc" and + "lemma_lookup". Defaults to None. + """ + required_tables, optional_tables = self.get_lookups_config(self.mode) + if lookups is None: + logger.debug("Lemmatizer: loading tables from spacy-lookups-data") + lookups = load_lookups(lang=self.vocab.lang, tables=required_tables) + optional_lookups = load_lookups( + lang=self.vocab.lang, tables=optional_tables, strict=False + ) + for table in optional_lookups.tables: + lookups.set_table(table, optional_lookups.get_table(table)) + self.lookups = lookups + self._validate_tables(Errors.E1004) + + def _validate_tables(self, error_message: str = Errors.E912) -> None: + """Check that the lookups are correct for the current mode.""" + required_tables, optional_tables = self.get_lookups_config(self.mode) + for table in required_tables: + if table not in self.lookups: + raise ValueError( + error_message.format( + mode=self.mode, + tables=required_tables, + found=self.lookups.tables, + ) + ) + self._validated = True + + def pipe(self, stream: Iterable[Doc], *, batch_size: int = 128) -> Iterator[Doc]: + """Apply the pipe to a stream of documents. This usually happens under + the hood when the nlp object is called on a text and all components are + applied to the Doc. + + stream (Iterable[Doc]): A stream of documents. + batch_size (int): The number of documents to buffer. + YIELDS (Doc): Processed documents in order. + + DOCS: https://nightly.spacy.io/api/lemmatizer#pipe + """ + for doc in stream: + doc = self(doc) + yield doc + + def lookup_lemmatize(self, token: Token) -> List[str]: + """Lemmatize using a lookup-based approach. + + token (Token): The token to lemmatize. + RETURNS (list): The available lemmas for the string. + + DOCS: https://nightly.spacy.io/api/lemmatizer#lookup_lemmatize + """ + lookup_table = self.lookups.get_table("lemma_lookup", {}) + result = lookup_table.get(token.text, token.text) + if isinstance(result, str): + result = [result] + return result + + def rule_lemmatize(self, token: Token) -> List[str]: + """Lemmatize using a rule-based approach. + + token (Token): The token to lemmatize. + RETURNS (list): The available lemmas for the string. + + DOCS: https://nightly.spacy.io/api/lemmatizer#rule_lemmatize + """ + cache_key = (token.orth, token.pos, token.morph) + if cache_key in self.cache: + return self.cache[cache_key] + string = token.text + univ_pos = token.pos_.lower() + if univ_pos in ("", "eol", "space"): + return [string.lower()] + # See Issue #435 for example of where this logic is requied. + if self.is_base_form(token): + return [string.lower()] + index_table = self.lookups.get_table("lemma_index", {}) + exc_table = self.lookups.get_table("lemma_exc", {}) + rules_table = self.lookups.get_table("lemma_rules", {}) + if not any( + ( + index_table.get(univ_pos), + exc_table.get(univ_pos), + rules_table.get(univ_pos), + ) + ): + if univ_pos == "propn": + return [string] + else: + return [string.lower()] + + index = index_table.get(univ_pos, {}) + exceptions = exc_table.get(univ_pos, {}) + rules = rules_table.get(univ_pos, {}) + orig = string + string = string.lower() + forms = [] + oov_forms = [] + for old, new in rules: + if string.endswith(old): + form = string[: len(string) - len(old)] + new + if not form: + pass + elif form in index or not form.isalpha(): + forms.append(form) + else: + oov_forms.append(form) + # Remove duplicates but preserve the ordering of applied "rules" + forms = list(dict.fromkeys(forms)) + # Put exceptions at the front of the list, so they get priority. + # This is a dodgy heuristic -- but it's the best we can do until we get + # frequencies on this. We can at least prune out problematic exceptions, + # if they shadow more frequent analyses. + for form in exceptions.get(string, []): + if form not in forms: + forms.insert(0, form) + if not forms: + forms.extend(oov_forms) + if not forms: + forms.append(orig) + self.cache[cache_key] = forms + return forms + + def is_base_form(self, token: Token) -> bool: + """Check whether the token is a base form that does not need further + analysis for lemmatization. + + token (Token): The token. + RETURNS (bool): Whether the token is a base form. + + DOCS: https://nightly.spacy.io/api/lemmatizer#is_base_form + """ + return False + + def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]: + """Score a batch of examples. + + examples (Iterable[Example]): The examples to score. + RETURNS (Dict[str, Any]): The scores. + + DOCS: https://nightly.spacy.io/api/lemmatizer#score + """ + validate_examples(examples, "Lemmatizer.score") + return Scorer.score_token_attr(examples, "lemma", **kwargs) + + def to_disk( + self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList() + ): + """Serialize the pipe to disk. + + path (str / Path): Path to a directory. + exclude (Iterable[str]): String names of serialization fields to exclude. + + DOCS: https://nightly.spacy.io/api/lemmatizer#to_disk + """ + serialize = {} + serialize["vocab"] = lambda p: self.vocab.to_disk(p) + serialize["lookups"] = lambda p: self.lookups.to_disk(p) + util.to_disk(path, serialize, exclude) + + def from_disk( + self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList() + ) -> "Lemmatizer": + """Load the pipe from disk. Modifies the object in place and returns it. + + path (str / Path): Path to a directory. + exclude (Iterable[str]): String names of serialization fields to exclude. + RETURNS (Lemmatizer): The modified Lemmatizer object. + + DOCS: https://nightly.spacy.io/api/lemmatizer#from_disk + """ + deserialize = {} + deserialize["vocab"] = lambda p: self.vocab.from_disk(p) + deserialize["lookups"] = lambda p: self.lookups.from_disk(p) + util.from_disk(path, deserialize, exclude) + self._validate_tables() + return self + + def to_bytes(self, *, exclude: Iterable[str] = SimpleFrozenList()) -> bytes: + """Serialize the pipe to a bytestring. + + exclude (Iterable[str]): String names of serialization fields to exclude. + RETURNS (bytes): The serialized object. + + DOCS: https://nightly.spacy.io/api/lemmatizer#to_bytes + """ + serialize = {} + serialize["vocab"] = self.vocab.to_bytes + serialize["lookups"] = self.lookups.to_bytes + return util.to_bytes(serialize, exclude) + + def from_bytes( + self, bytes_data: bytes, *, exclude: Iterable[str] = SimpleFrozenList() + ) -> "Lemmatizer": + """Load the pipe from a bytestring. + + bytes_data (bytes): The serialized pipe. + exclude (Iterable[str]): String names of serialization fields to exclude. + RETURNS (Lemmatizer): The loaded Lemmatizer. + + DOCS: https://nightly.spacy.io/api/lemmatizer#from_bytes + """ + deserialize = {} + deserialize["vocab"] = lambda b: self.vocab.from_bytes(b) + deserialize["lookups"] = lambda b: self.lookups.from_bytes(b) + util.from_bytes(bytes_data, deserialize, exclude) + self._validate_tables() + return self diff --git a/spacy/pipeline/morphologizer.pyx b/spacy/pipeline/morphologizer.pyx index 72e31f120..ac111f28b 100644 --- a/spacy/pipeline/morphologizer.pyx +++ b/spacy/pipeline/morphologizer.pyx @@ -1,165 +1,260 @@ -from __future__ import unicode_literals -from collections import OrderedDict, defaultdict +# cython: infer_types=True, profile=True, binding=True +from typing import Optional, Union, Dict +import srsly +from thinc.api import SequenceCategoricalCrossentropy, Model, Config +from itertools import islice -import numpy -cimport numpy as np - -from thinc.api import chain -from thinc.neural.util import to_categorical, copy_array, get_array_module -from .. import util -from .pipes import Pipe -from ..language import component -from .._ml import Tok2Vec, build_morphologizer_model -from .._ml import link_vectors_to_models, zero_init, flatten -from .._ml import create_default_optimizer -from ..errors import Errors, TempErrors -from ..compat import basestring_ from ..tokens.doc cimport Doc from ..vocab cimport Vocab from ..morphology cimport Morphology +from ..parts_of_speech import IDS as POS_IDS +from ..symbols import POS +from ..language import Language +from ..errors import Errors +from .pipe import deserialize_config +from .tagger import Tagger +from .. import util +from ..scorer import Scorer +from ..training import validate_examples, validate_get_examples -@component("morphologizer", assigns=["token.morph", "token.pos"]) -class Morphologizer(Pipe): - @classmethod - def Model(cls, **cfg): - if cfg.get('pretrained_dims') and not cfg.get('pretrained_vectors'): - raise ValueError(TempErrors.T008) - class_map = Morphology.create_class_map() - return build_morphologizer_model(class_map.field_sizes, **cfg) +default_model_config = """ +[model] +@architectures = "spacy.Tagger.v1" - def __init__(self, vocab, model=True, **cfg): +[model.tok2vec] +@architectures = "spacy.Tok2Vec.v1" + +[model.tok2vec.embed] +@architectures = "spacy.CharacterEmbed.v1" +width = 128 +rows = 7000 +nM = 64 +nC = 8 +include_static_vectors = false + +[model.tok2vec.encode] +@architectures = "spacy.MaxoutWindowEncoder.v1" +width = 128 +depth = 4 +window_size = 1 +maxout_pieces = 3 +""" + +DEFAULT_MORPH_MODEL = Config().from_str(default_model_config)["model"] + + +@Language.factory( + "morphologizer", + assigns=["token.morph", "token.pos"], + default_config={"model": DEFAULT_MORPH_MODEL}, + default_score_weights={"pos_acc": 0.5, "morph_acc": 0.5, "morph_per_feat": None}, +) +def make_morphologizer( + nlp: Language, + model: Model, + name: str, +): + return Morphologizer(nlp.vocab, model, name) + + +class Morphologizer(Tagger): + POS_FEAT = "POS" + + def __init__( + self, + vocab: Vocab, + model: Model, + name: str = "morphologizer", + *, + labels_morph: Optional[dict] = None, + labels_pos: Optional[dict] = None, + ): + """Initialize a morphologizer. + + vocab (Vocab): The shared vocabulary. + model (thinc.api.Model): The Thinc Model powering the pipeline component. + name (str): The component instance name, used to add entries to the + losses during training. + labels_morph (dict): Mapping of morph + POS tags to morph labels. + labels_pos (dict): Mapping of morph + POS tags to POS tags. + + DOCS: https://nightly.spacy.io/api/morphologizer#init + """ self.vocab = vocab self.model = model - self.cfg = OrderedDict(sorted(cfg.items())) - self.cfg.setdefault('cnn_maxout_pieces', 2) - self._class_map = self.vocab.morphology.create_class_map() + self.name = name + self._rehearsal_model = None + # to be able to set annotations without string operations on labels, + # store mappings from morph+POS labels to token-level annotations: + # 1) labels_morph stores a mapping from morph+POS->morph + # 2) labels_pos stores a mapping from morph+POS->POS + cfg = {"labels_morph": labels_morph or {}, "labels_pos": labels_pos or {}} + self.cfg = dict(sorted(cfg.items())) + # add mappings for empty morph + self.cfg["labels_morph"][Morphology.EMPTY_MORPH] = Morphology.EMPTY_MORPH + self.cfg["labels_pos"][Morphology.EMPTY_MORPH] = POS_IDS[""] @property def labels(self): - return self.vocab.morphology.tag_names + """RETURNS (Tuple[str]): The labels currently added to the component.""" + return tuple(self.cfg["labels_morph"].keys()) @property - def tok2vec(self): - if self.model in (None, True, False): - return None + def label_data(self) -> Dict[str, Dict[str, Union[str, float, int, None]]]: + """A dictionary with all labels data.""" + return {"morph": self.cfg["labels_morph"], "pos": self.cfg["labels_pos"]} + + def add_label(self, label): + """Add a new label to the pipe. + + label (str): The label to add. + RETURNS (int): 0 if label is already present, otherwise 1. + + DOCS: https://nightly.spacy.io/api/morphologizer#add_label + """ + if not isinstance(label, str): + raise ValueError(Errors.E187) + if label in self.labels: + return 0 + self._allow_extra_label() + # normalize label + norm_label = self.vocab.morphology.normalize_features(label) + # extract separate POS and morph tags + label_dict = Morphology.feats_to_dict(label) + pos = label_dict.get(self.POS_FEAT, "") + if self.POS_FEAT in label_dict: + label_dict.pop(self.POS_FEAT) + # normalize morph string and add to morphology table + norm_morph = self.vocab.strings[self.vocab.morphology.add(label_dict)] + # add label mappings + if norm_label not in self.cfg["labels_morph"]: + self.cfg["labels_morph"][norm_label] = norm_morph + self.cfg["labels_pos"][norm_label] = POS_IDS[pos] + return 1 + + def initialize(self, get_examples, *, nlp=None, labels=None): + """Initialize the pipe for training, using a representative set + of data examples. + + get_examples (Callable[[], Iterable[Example]]): Function that + returns a representative sample of gold-standard Example objects. + nlp (Language): The current nlp object the component is part of. + + DOCS: https://nightly.spacy.io/api/morphologizer#initialize + """ + validate_get_examples(get_examples, "Morphologizer.initialize") + if labels is not None: + self.cfg["labels_morph"] = labels["morph"] + self.cfg["labels_pos"] = labels["pos"] else: - return chain(self.model.tok2vec, flatten) + # First, fetch all labels from the data + for example in get_examples(): + for i, token in enumerate(example.reference): + pos = token.pos_ + morph = str(token.morph) + # create and add the combined morph+POS label + morph_dict = Morphology.feats_to_dict(morph) + if pos: + morph_dict[self.POS_FEAT] = pos + norm_label = self.vocab.strings[self.vocab.morphology.add(morph_dict)] + # add label->morph and label->POS mappings + if norm_label not in self.cfg["labels_morph"]: + self.cfg["labels_morph"][norm_label] = morph + self.cfg["labels_pos"][norm_label] = POS_IDS[pos] + if len(self.labels) <= 1: + raise ValueError(Errors.E143.format(name=self.name)) + doc_sample = [] + label_sample = [] + for example in islice(get_examples(), 10): + gold_array = [] + for i, token in enumerate(example.reference): + pos = token.pos_ + morph = str(token.morph) + morph_dict = Morphology.feats_to_dict(morph) + if pos: + morph_dict[self.POS_FEAT] = pos + norm_label = self.vocab.strings[self.vocab.morphology.add(morph_dict)] + gold_array.append([1.0 if label == norm_label else 0.0 for label in self.labels]) + doc_sample.append(example.x) + label_sample.append(self.model.ops.asarray(gold_array, dtype="float32")) + assert len(doc_sample) > 0, Errors.E923.format(name=self.name) + assert len(label_sample) > 0, Errors.E923.format(name=self.name) + self.model.initialize(X=doc_sample, Y=label_sample) - def __call__(self, doc): - features, tokvecs = self.predict([doc]) - self.set_annotations([doc], features, tensors=tokvecs) - return doc + def set_annotations(self, docs, batch_tag_ids): + """Modify a batch of documents, using pre-computed scores. - def pipe(self, stream, batch_size=128, n_threads=-1): - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) - features, tokvecs = self.predict(docs) - self.set_annotations(docs, features, tensors=tokvecs) - yield from docs + docs (Iterable[Doc]): The documents to modify. + batch_tag_ids: The IDs to set, produced by Morphologizer.predict. - def predict(self, docs): - if not any(len(doc) for doc in docs): - # Handle case where there are no tokens in any docs. - n_labels = self.model.nO - guesses = [self.model.ops.allocate((0, n_labels)) for doc in docs] - tokvecs = self.model.ops.allocate((0, self.model.tok2vec.nO)) - return guesses, tokvecs - tokvecs = self.model.tok2vec(docs) - scores = self.model.softmax(tokvecs) - return scores, tokvecs - - def set_annotations(self, docs, batch_scores, tensors=None): + DOCS: https://nightly.spacy.io/api/morphologizer#set_annotations + """ if isinstance(docs, Doc): docs = [docs] cdef Doc doc cdef Vocab vocab = self.vocab - offsets = [self._class_map.get_field_offset(field) - for field in self._class_map.fields] for i, doc in enumerate(docs): - doc_scores = batch_scores[i] - doc_guesses = scores_to_guesses(doc_scores, self.model.softmax.out_sizes) - # Convert the neuron indices into feature IDs. - doc_feat_ids = numpy.zeros((len(doc), len(self._class_map.fields)), dtype='i') - for j in range(len(doc)): - for k, offset in enumerate(offsets): - if doc_guesses[j, k] == 0: - doc_feat_ids[j, k] = 0 - else: - doc_feat_ids[j, k] = offset + doc_guesses[j, k] - # Get the set of feature names. - feats = {self._class_map.col2info[f][2] for f in doc_feat_ids[j]} - if "NIL" in feats: - feats.remove("NIL") - # Now add the analysis, and set the hash. - doc.c[j].morph = self.vocab.morphology.add(feats) - if doc[j].morph.pos != 0: - doc.c[j].pos = doc[j].morph.pos + doc_tag_ids = batch_tag_ids[i] + if hasattr(doc_tag_ids, "get"): + doc_tag_ids = doc_tag_ids.get() + for j, tag_id in enumerate(doc_tag_ids): + morph = self.labels[tag_id] + doc.c[j].morph = self.vocab.morphology.add(self.cfg["labels_morph"][morph]) + doc.c[j].pos = self.cfg["labels_pos"][morph] - def update(self, docs, golds, drop=0., sgd=None, losses=None): - if losses is not None and self.name not in losses: - losses[self.name] = 0. + def get_loss(self, examples, scores): + """Find the loss and gradient of loss for the batch of documents and + their predicted scores. - tag_scores, bp_tag_scores = self.model.begin_update(docs, drop=drop) - loss, d_tag_scores = self.get_loss(docs, golds, tag_scores) - bp_tag_scores(d_tag_scores, sgd=sgd) + examples (Iterable[Examples]): The batch of examples. + scores: Scores representing the model's predictions. + RETURNS (Tuple[float, float]): The loss and the gradient. - if losses is not None: - losses[self.name] += loss - - def get_loss(self, docs, golds, scores): - guesses = [] - for doc_scores in scores: - guesses.append(scores_to_guesses(doc_scores, self.model.softmax.out_sizes)) - guesses = self.model.ops.xp.vstack(guesses) - scores = self.model.ops.xp.vstack(scores) - if not isinstance(scores, numpy.ndarray): - scores = scores.get() - if not isinstance(guesses, numpy.ndarray): - guesses = guesses.get() - cdef int idx = 0 - # Do this on CPU, as we can't vectorize easily. - target = numpy.zeros(scores.shape, dtype='f') - field_sizes = self.model.softmax.out_sizes - for doc, gold in zip(docs, golds): - for t, features in enumerate(gold.morphology): - if features is None: - target[idx] = scores[idx] - else: - gold_fields = {} - for feature in features: - field = self._class_map.feat2field[feature] - gold_fields[field] = self._class_map.feat2offset[feature] - for field in self._class_map.fields: - field_id = self._class_map.field2id[field] - col_offset = self._class_map.field2col[field] - if field_id in gold_fields: - target[idx, col_offset + gold_fields[field_id]] = 1. - else: - target[idx, col_offset] = 1. - #print(doc[t]) - #for col, info in enumerate(self._class_map.col2info): - # print(col, info, scores[idx, col], target[idx, col]) - idx += 1 - target = self.model.ops.asarray(target, dtype='f') - scores = self.model.ops.asarray(scores, dtype='f') - d_scores = scores - target - loss = (d_scores**2).sum() - d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs]) + DOCS: https://nightly.spacy.io/api/morphologizer#get_loss + """ + validate_examples(examples, "Morphologizer.get_loss") + loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False) + truths = [] + for eg in examples: + eg_truths = [] + pos_tags = eg.get_aligned("POS", as_string=True) + morphs = eg.get_aligned("MORPH", as_string=True) + for i in range(len(morphs)): + pos = pos_tags[i] + morph = morphs[i] + # POS may align (same value for multiple tokens) when morph + # doesn't, so if either is None, treat both as None here so that + # truths doesn't end up with an unknown morph+POS combination + if pos is None or morph is None: + pos = None + morph = None + label_dict = Morphology.feats_to_dict(morph) + if pos: + label_dict[self.POS_FEAT] = pos + label = self.vocab.strings[self.vocab.morphology.add(label_dict)] + eg_truths.append(label) + truths.append(eg_truths) + d_scores, loss = loss_func(scores, truths) + if self.model.ops.xp.isnan(loss): + raise ValueError(Errors.E910.format(name=self.name)) return float(loss), d_scores - def use_params(self, params): - with self.model.use_params(params): - yield + def score(self, examples, **kwargs): + """Score a batch of examples. -def scores_to_guesses(scores, out_sizes): - xp = get_array_module(scores) - guesses = xp.zeros((scores.shape[0], len(out_sizes)), dtype='i') - offset = 0 - for i, size in enumerate(out_sizes): - slice_ = scores[:, offset : offset + size] - col_guesses = slice_.argmax(axis=1) - guesses[:, i] = col_guesses - offset += size - return guesses + examples (Iterable[Example]): The examples to score. + RETURNS (Dict[str, Any]): The scores, produced by + Scorer.score_token_attr for the attributes "pos" and "morph" and + Scorer.score_token_attr_per_feat for the attribute "morph". + + DOCS: https://nightly.spacy.io/api/morphologizer#score + """ + validate_examples(examples, "Morphologizer.score") + results = {} + results.update(Scorer.score_token_attr(examples, "pos", **kwargs)) + results.update(Scorer.score_token_attr(examples, "morph", **kwargs)) + results.update(Scorer.score_token_attr_per_feat(examples, + "morph", **kwargs)) + return results diff --git a/spacy/pipeline/multitask.pyx b/spacy/pipeline/multitask.pyx new file mode 100644 index 000000000..e1ea49849 --- /dev/null +++ b/spacy/pipeline/multitask.pyx @@ -0,0 +1,218 @@ +# cython: infer_types=True, profile=True, binding=True +from typing import Optional +import numpy +from thinc.api import CosineDistance, to_categorical, Model, Config +from thinc.api import set_dropout_rate + +from ..tokens.doc cimport Doc + +from .trainable_pipe import TrainablePipe +from .tagger import Tagger +from ..training import validate_examples +from ..language import Language +from ._parser_internals import nonproj +from ..attrs import POS, ID +from ..errors import Errors + + +default_model_config = """ +[model] +@architectures = "spacy.MultiTask.v1" +maxout_pieces = 3 +token_vector_width = 96 + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 4 +embed_size = 2000 +window_size = 1 +maxout_pieces = 2 +subword_features = true +""" +DEFAULT_MT_MODEL = Config().from_str(default_model_config)["model"] + + +@Language.factory( + "nn_labeller", + default_config={"labels": None, "target": "dep_tag_offset", "model": DEFAULT_MT_MODEL} +) +def make_nn_labeller(nlp: Language, name: str, model: Model, labels: Optional[dict], target: str): + return MultitaskObjective(nlp.vocab, model, name) + + +class MultitaskObjective(Tagger): + """Experimental: Assist training of a parser or tagger, by training a + side-objective. + """ + + def __init__(self, vocab, model, name="nn_labeller", *, labels, target): + self.vocab = vocab + self.model = model + self.name = name + if target == "dep": + self.make_label = self.make_dep + elif target == "tag": + self.make_label = self.make_tag + elif target == "ent": + self.make_label = self.make_ent + elif target == "dep_tag_offset": + self.make_label = self.make_dep_tag_offset + elif target == "ent_tag": + self.make_label = self.make_ent_tag + elif target == "sent_start": + self.make_label = self.make_sent_start + elif hasattr(target, "__call__"): + self.make_label = target + else: + raise ValueError(Errors.E016) + cfg = {"labels": labels or {}, "target": target} + self.cfg = dict(cfg) + + @property + def labels(self): + return self.cfg.setdefault("labels", {}) + + @labels.setter + def labels(self, value): + self.cfg["labels"] = value + + def set_annotations(self, docs, dep_ids): + pass + + def initialize(self, get_examples, nlp=None): + if not hasattr(get_examples, "__call__"): + err = Errors.E930.format(name="MultitaskObjective", obj=type(get_examples)) + raise ValueError(err) + for example in get_examples(): + for token in example.y: + label = self.make_label(token) + if label is not None and label not in self.labels: + self.labels[label] = len(self.labels) + self.model.initialize() # TODO: fix initialization by defining X and Y + + def predict(self, docs): + tokvecs = self.model.get_ref("tok2vec")(docs) + scores = self.model.get_ref("softmax")(tokvecs) + return tokvecs, scores + + def get_loss(self, examples, scores): + cdef int idx = 0 + correct = numpy.zeros((scores.shape[0],), dtype="i") + guesses = scores.argmax(axis=1) + docs = [eg.predicted for eg in examples] + for i, eg in enumerate(examples): + # Handles alignment for tokenization differences + doc_annots = eg.get_aligned() # TODO + for j in range(len(eg.predicted)): + tok_annots = {key: values[j] for key, values in tok_annots.items()} + label = self.make_label(j, tok_annots) + if label is None or label not in self.labels: + correct[idx] = guesses[idx] + else: + correct[idx] = self.labels[label] + idx += 1 + correct = self.model.ops.xp.array(correct, dtype="i") + d_scores = scores - to_categorical(correct, n_classes=scores.shape[1]) + loss = (d_scores**2).sum() + return float(loss), d_scores + + @staticmethod + def make_dep(token): + return token.dep_ + + @staticmethod + def make_tag(token): + return token.tag_ + + @staticmethod + def make_ent(token): + if token.ent_iob_ == "O": + return "O" + else: + return token.ent_iob_ + "-" + token.ent_type_ + + @staticmethod + def make_dep_tag_offset(token): + dep = token.dep_ + tag = token.tag_ + offset = token.head.i - token.i + offset = min(offset, 2) + offset = max(offset, -2) + return f"{dep}-{tag}:{offset}" + + @staticmethod + def make_ent_tag(token): + if token.ent_iob_ == "O": + ent = "O" + else: + ent = token.ent_iob_ + "-" + token.ent_type_ + tag = token.tag_ + return f"{tag}-{ent}" + + @staticmethod + def make_sent_start(token): + """A multi-task objective for representing sentence boundaries, + using BILU scheme. (O is impossible) + """ + if token.is_sent_start and token.is_sent_end: + return "U-SENT" + elif token.is_sent_start: + return "B-SENT" + else: + return "I-SENT" + + +class ClozeMultitask(TrainablePipe): + def __init__(self, vocab, model, **cfg): + self.vocab = vocab + self.model = model + self.cfg = cfg + self.distance = CosineDistance(ignore_zeros=True, normalize=False) # TODO: in config + + def set_annotations(self, docs, dep_ids): + pass + + def initialize(self, get_examples, nlp=None): + self.model.initialize() # TODO: fix initialization by defining X and Y + X = self.model.ops.alloc((5, self.model.get_ref("tok2vec").get_dim("nO"))) + self.model.output_layer.initialize(X) + + def predict(self, docs): + tokvecs = self.model.get_ref("tok2vec")(docs) + vectors = self.model.get_ref("output_layer")(tokvecs) + return tokvecs, vectors + + def get_loss(self, examples, vectors, prediction): + validate_examples(examples, "ClozeMultitask.get_loss") + # The simplest way to implement this would be to vstack the + # token.vector values, but that's a bit inefficient, especially on GPU. + # Instead we fetch the index into the vectors table for each of our tokens, + # and look them up all at once. This prevents data copying. + ids = self.model.ops.flatten([eg.predicted.to_array(ID).ravel() for eg in examples]) + target = vectors[ids] + gradient = self.distance.get_grad(prediction, target) + loss = self.distance.get_loss(prediction, target) + return loss, gradient + + def update(self, examples, *, drop=0., set_annotations=False, sgd=None, losses=None): + pass + + def rehearse(self, examples, drop=0., sgd=None, losses=None): + if losses is not None and self.name not in losses: + losses[self.name] = 0. + set_dropout_rate(self.model, drop) + validate_examples(examples, "ClozeMultitask.rehearse") + docs = [eg.predicted for eg in examples] + predictions, bp_predictions = self.model.begin_update() + loss, d_predictions = self.get_loss(examples, self.vocab.vectors.data, predictions) + bp_predictions(d_predictions) + if sgd is not None: + self.finish_update(sgd) + if losses is not None: + losses[self.name] += loss + return losses + + def add_label(self, label): + raise NotImplementedError diff --git a/spacy/pipeline/ner.pyx b/spacy/pipeline/ner.pyx new file mode 100644 index 000000000..6482d6125 --- /dev/null +++ b/spacy/pipeline/ner.pyx @@ -0,0 +1,134 @@ +# cython: infer_types=True, profile=True, binding=True +from typing import Optional, Iterable +from thinc.api import Model, Config + +from .transition_parser cimport Parser +from ._parser_internals.ner cimport BiluoPushDown + +from ..language import Language +from ..scorer import get_ner_prf, PRFScore +from ..training import validate_examples + + +default_model_config = """ +[model] +@architectures = "spacy.TransitionBasedParser.v1" +state_type = "ner" +extra_state_tokens = false +hidden_width = 64 +maxout_pieces = 2 + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 4 +embed_size = 2000 +window_size = 1 +maxout_pieces = 3 +subword_features = true +""" +DEFAULT_NER_MODEL = Config().from_str(default_model_config)["model"] + + +@Language.factory( + "ner", + assigns=["doc.ents", "token.ent_iob", "token.ent_type"], + default_config={ + "moves": None, + "update_with_oracle_cut_size": 100, + "model": DEFAULT_NER_MODEL, + }, + default_score_weights={"ents_f": 1.0, "ents_p": 0.0, "ents_r": 0.0, "ents_per_type": None}, + +) +def make_ner( + nlp: Language, + name: str, + model: Model, + moves: Optional[list], + update_with_oracle_cut_size: int, +): + """Create a transition-based EntityRecognizer component. The entity recognizer + identifies non-overlapping labelled spans of tokens. + + The transition-based algorithm used encodes certain assumptions that are + effective for "traditional" named entity recognition tasks, but may not be + a good fit for every span identification problem. Specifically, the loss + function optimizes for whole entity accuracy, so if your inter-annotator + agreement on boundary tokens is low, the component will likely perform poorly + on your problem. The transition-based algorithm also assumes that the most + decisive information about your entities will be close to their initial tokens. + If your entities are long and characterised by tokens in their middle, the + component will likely do poorly on your task. + + model (Model): The model for the transition-based parser. The model needs + to have a specific substructure of named components --- see the + spacy.ml.tb_framework.TransitionModel for details. + moves (list[str]): A list of transition names. Inferred from the data if not + provided. + update_with_oracle_cut_size (int): + During training, cut long sequences into shorter segments by creating + intermediate states based on the gold-standard history. The model is + not very sensitive to this parameter, so you usually won't need to change + it. 100 is a good default. + """ + return EntityRecognizer( + nlp.vocab, + model, + name, + moves=moves, + update_with_oracle_cut_size=update_with_oracle_cut_size, + multitasks=[], + min_action_freq=1, + learn_tokens=False, + ) + + +cdef class EntityRecognizer(Parser): + """Pipeline component for named entity recognition. + + DOCS: https://nightly.spacy.io/api/entityrecognizer + """ + TransitionSystem = BiluoPushDown + + def add_multitask_objective(self, mt_component): + """Register another component as a multi-task objective. Experimental.""" + self._multitasks.append(mt_component) + + def init_multitask_objectives(self, get_examples, nlp=None, **cfg): + """Setup multi-task objective components. Experimental and internal.""" + # TODO: transfer self.model.get_ref("tok2vec") to the multitask's model ? + for labeller in self._multitasks: + labeller.model.set_dim("nO", len(self.labels)) + if labeller.model.has_ref("output_layer"): + labeller.model.get_ref("output_layer").set_dim("nO", len(self.labels)) + labeller.initialize(get_examples, nlp=nlp) + + @property + def labels(self): + # Get the labels from the model by looking at the available moves, e.g. + # B-PERSON, I-PERSON, L-PERSON, U-PERSON + labels = set(move.split("-")[1] for move in self.move_names + if move[0] in ("B", "I", "L", "U")) + return tuple(sorted(labels)) + + def score(self, examples, **kwargs): + """Score a batch of examples. + + examples (Iterable[Example]): The examples to score. + RETURNS (Dict[str, Any]): The NER precision, recall and f-scores. + + DOCS: https://nightly.spacy.io/api/entityrecognizer#score + """ + validate_examples(examples, "EntityRecognizer.score") + score_per_type = get_ner_prf(examples) + totals = PRFScore() + for prf in score_per_type.values(): + totals += prf + return { + "ents_p": totals.precision, + "ents_r": totals.recall, + "ents_f": totals.fscore, + "ents_per_type": {k: v.to_dict() for k, v in score_per_type.items()}, + } diff --git a/spacy/pipeline/pipe.pxd b/spacy/pipeline/pipe.pxd new file mode 100644 index 000000000..bb97f79d0 --- /dev/null +++ b/spacy/pipeline/pipe.pxd @@ -0,0 +1,2 @@ +cdef class Pipe: + cdef public str name diff --git a/spacy/pipeline/pipe.pyx b/spacy/pipeline/pipe.pyx new file mode 100644 index 000000000..afb59fdb3 --- /dev/null +++ b/spacy/pipeline/pipe.pyx @@ -0,0 +1,105 @@ +# cython: infer_types=True, profile=True +import warnings +from typing import Optional, Tuple, Iterable, Iterator, Callable, Union, Dict +import srsly + +from ..tokens.doc cimport Doc + +from ..training import Example +from ..errors import Errors, Warnings +from ..language import Language + +cdef class Pipe: + """This class is a base class and not instantiated directly. It provides + an interface for pipeline components to implement. + Trainable pipeline components like the EntityRecognizer or TextCategorizer + should inherit from the subclass 'TrainablePipe'. + + DOCS: https://nightly.spacy.io/api/pipe + """ + + @classmethod + def __init_subclass__(cls, **kwargs): + """Raise a warning if an inheriting class implements 'begin_training' + (from v2) instead of the new 'initialize' method (from v3)""" + if hasattr(cls, "begin_training"): + warnings.warn(Warnings.W088.format(name=cls.__name__)) + + def __call__(self, Doc doc) -> Doc: + """Apply the pipe to one document. The document is modified in place, + and returned. This usually happens under the hood when the nlp object + is called on a text and all components are applied to the Doc. + + docs (Doc): The Doc to process. + RETURNS (Doc): The processed Doc. + + DOCS: https://nightly.spacy.io/api/pipe#call + """ + raise NotImplementedError(Errors.E931.format(parent="Pipe", method="__call__", name=self.name)) + + def pipe(self, stream: Iterable[Doc], *, batch_size: int=128) -> Iterator[Doc]: + """Apply the pipe to a stream of documents. This usually happens under + the hood when the nlp object is called on a text and all components are + applied to the Doc. + + stream (Iterable[Doc]): A stream of documents. + batch_size (int): The number of documents to buffer. + YIELDS (Doc): Processed documents in order. + + DOCS: https://nightly.spacy.io/api/pipe#pipe + """ + for doc in stream: + doc = self(doc) + yield doc + + def initialize(self, get_examples: Callable[[], Iterable[Example]], *, nlp: Language=None): + """Initialize the pipe. For non-trainable components, this method + is optional. For trainable components, which should inherit + from the subclass TrainablePipe, the provided data examples + should be used to ensure that the internal model is initialized + properly and all input/output dimensions throughout the network are + inferred. + + get_examples (Callable[[], Iterable[Example]]): Function that + returns a representative sample of gold-standard Example objects. + nlp (Language): The current nlp object the component is part of. + + DOCS: https://nightly.spacy.io/api/pipe#initialize + """ + pass + + def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Union[float, Dict[str, float]]]: + """Score a batch of examples. + + examples (Iterable[Example]): The examples to score. + RETURNS (Dict[str, Any]): The scores. + + DOCS: https://nightly.spacy.io/api/pipe#score + """ + return {} + + @property + def is_trainable(self) -> bool: + return False + + @property + def labels(self) -> Optional[Tuple[str]]: + return tuple() + + @property + def label_data(self): + """Optional JSON-serializable data that would be sufficient to recreate + the label set if provided to the `pipe.initialize()` method. + """ + return None + + def _require_labels(self) -> None: + """Raise an error if this component has no labels defined.""" + if not self.labels or list(self.labels) == [""]: + raise ValueError(Errors.E143.format(name=self.name)) + +def deserialize_config(path): + if path.exists(): + return srsly.read_json(path) + else: + return {} diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx deleted file mode 100644 index a5b891b54..000000000 --- a/spacy/pipeline/pipes.pyx +++ /dev/null @@ -1,1651 +0,0 @@ -# cython: infer_types=True -# cython: profile=True -# coding: utf8 -from __future__ import unicode_literals - -import numpy -import srsly -import random -import warnings -from collections import OrderedDict -from thinc.api import chain -from thinc.v2v import Affine, Maxout, Softmax -from thinc.misc import LayerNorm -from thinc.neural.util import to_categorical -from thinc.neural.util import get_array_module - -from ..compat import basestring_ -from ..tokens.doc cimport Doc -from ..syntax.nn_parser cimport Parser -from ..syntax.ner cimport BiluoPushDown -from ..syntax.arc_eager cimport ArcEager -from ..morphology cimport Morphology -from ..vocab cimport Vocab - -from .functions import merge_subtokens -from ..language import Language, component -from ..syntax import nonproj -from ..attrs import POS, ID -from ..parts_of_speech import X -from ..kb import KnowledgeBase -from .._ml import Tok2Vec, build_tagger_model, cosine, get_cossim_loss -from .._ml import build_text_classifier, build_simple_cnn_text_classifier -from .._ml import build_bow_text_classifier, build_nel_encoder -from .._ml import link_vectors_to_models, zero_init, flatten -from .._ml import masked_language_model, create_default_optimizer, get_cossim_loss -from .._ml import MultiSoftmax, get_characters_loss -from ..errors import Errors, TempErrors, Warnings -from .. import util - - -def _load_cfg(path): - if path.exists(): - return srsly.read_json(path) - else: - return {} - - -class Pipe(object): - """This class is not instantiated directly. Components inherit from it, and - it defines the interface that components should follow to function as - components in a spaCy analysis pipeline. - """ - - name = None - - @classmethod - def Model(cls, *shape, **kwargs): - """Initialize a model for the pipe.""" - raise NotImplementedError - - @classmethod - def from_nlp(cls, nlp, **cfg): - return cls(nlp.vocab, **cfg) - - def __init__(self, vocab, model=True, **cfg): - """Create a new pipe instance.""" - raise NotImplementedError - - def __call__(self, doc): - """Apply the pipe to one document. The document is - modified in-place, and returned. - - Both __call__ and pipe should delegate to the `predict()` - and `set_annotations()` methods. - """ - self.require_model() - predictions = self.predict([doc]) - if isinstance(predictions, tuple) and len(predictions) == 2: - scores, tensors = predictions - self.set_annotations([doc], scores, tensors=tensors) - else: - self.set_annotations([doc], predictions) - return doc - - def require_model(self): - """Raise an error if the component's model is not initialized.""" - if getattr(self, "model", None) in (None, True, False): - raise ValueError(Errors.E109.format(name=self.name)) - - def pipe(self, stream, batch_size=128, n_threads=-1): - """Apply the pipe to a stream of documents. - - Both __call__ and pipe should delegate to the `predict()` - and `set_annotations()` methods. - """ - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) - predictions = self.predict(docs) - if isinstance(predictions, tuple) and len(tuple) == 2: - scores, tensors = predictions - self.set_annotations(docs, scores, tensors=tensors) - else: - self.set_annotations(docs, predictions) - yield from docs - - def predict(self, docs): - """Apply the pipeline's model to a batch of docs, without - modifying them. - """ - self.require_model() - raise NotImplementedError - - def set_annotations(self, docs, scores, tensors=None): - """Modify a batch of documents, using pre-computed scores.""" - raise NotImplementedError - - def update(self, docs, golds, drop=0.0, sgd=None, losses=None): - """Learn from a batch of documents and gold-standard information, - updating the pipe's model. - - Delegates to predict() and get_loss(). - """ - pass - - def rehearse(self, docs, sgd=None, losses=None, **config): - pass - - def get_loss(self, docs, golds, scores): - """Find the loss and gradient of loss for the batch of - documents and their predicted scores.""" - raise NotImplementedError - - def add_label(self, label): - """Add an output label, to be predicted by the model. - - It's possible to extend pretrained models with new labels, - but care should be taken to avoid the "catastrophic forgetting" - problem. - """ - raise NotImplementedError - - def create_optimizer(self): - return create_default_optimizer(self.model.ops, **self.cfg.get("optimizer", {})) - - def begin_training( - self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs - ): - """Initialize the pipe for training, using data exampes if available. - If no model has been initialized yet, the model is added.""" - if self.model is True: - self.model = self.Model(**self.cfg) - if hasattr(self, "vocab"): - link_vectors_to_models(self.vocab) - if sgd is None: - sgd = self.create_optimizer() - return sgd - - def use_params(self, params): - """Modify the pipe's model, to use the given parameter values.""" - with self.model.use_params(params): - yield - - def to_bytes(self, exclude=tuple(), **kwargs): - """Serialize the pipe to a bytestring. - - exclude (list): String names of serialization fields to exclude. - RETURNS (bytes): The serialized object. - """ - serialize = OrderedDict() - serialize["cfg"] = lambda: srsly.json_dumps(self.cfg) - if self.model not in (True, False, None): - serialize["model"] = self.model.to_bytes - if hasattr(self, "vocab"): - serialize["vocab"] = self.vocab.to_bytes - exclude = util.get_serialization_exclude(serialize, exclude, kwargs) - return util.to_bytes(serialize, exclude) - - def from_bytes(self, bytes_data, exclude=tuple(), **kwargs): - """Load the pipe from a bytestring.""" - - def load_model(b): - # TODO: Remove this once we don't have to handle previous models - if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg: - self.cfg["pretrained_vectors"] = self.vocab.vectors.name - if self.model is True: - self.model = self.Model(**self.cfg) - try: - self.model.from_bytes(b) - except AttributeError: - raise ValueError(Errors.E149) - - deserialize = OrderedDict() - deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b)) - if hasattr(self, "vocab"): - deserialize["vocab"] = lambda b: self.vocab.from_bytes(b) - deserialize["model"] = load_model - exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) - util.from_bytes(bytes_data, deserialize, exclude) - return self - - def to_disk(self, path, exclude=tuple(), **kwargs): - """Serialize the pipe to disk.""" - serialize = OrderedDict() - serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg) - serialize["vocab"] = lambda p: self.vocab.to_disk(p) - if self.model not in (None, True, False): - serialize["model"] = lambda p: self.model.to_disk(p) - exclude = util.get_serialization_exclude(serialize, exclude, kwargs) - util.to_disk(path, serialize, exclude) - - def from_disk(self, path, exclude=tuple(), **kwargs): - """Load the pipe from disk.""" - - def load_model(p): - # TODO: Remove this once we don't have to handle previous models - if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg: - self.cfg["pretrained_vectors"] = self.vocab.vectors.name - if self.model is True: - self.model = self.Model(**self.cfg) - try: - self.model.from_bytes(p.open("rb").read()) - except AttributeError: - raise ValueError(Errors.E149) - - deserialize = OrderedDict() - deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p)) - deserialize["vocab"] = lambda p: self.vocab.from_disk(p) - deserialize["model"] = load_model - exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) - util.from_disk(path, deserialize, exclude) - return self - - -@component("tensorizer", assigns=["doc.tensor"]) -class Tensorizer(Pipe): - """Pre-train position-sensitive vectors for tokens.""" - - @classmethod - def Model(cls, output_size=300, **cfg): - """Create a new statistical model for the class. - - width (int): Output size of the model. - embed_size (int): Number of vectors in the embedding table. - **cfg: Config parameters. - RETURNS (Model): A `thinc.neural.Model` or similar instance. - """ - input_size = util.env_opt("token_vector_width", cfg.get("input_size", 96)) - return zero_init(Affine(output_size, input_size, drop_factor=0.0)) - - def __init__(self, vocab, model=True, **cfg): - """Construct a new statistical model. Weights are not allocated on - initialisation. - - vocab (Vocab): A `Vocab` instance. The model must share the same - `Vocab` instance with the `Doc` objects it will process. - model (Model): A `Model` instance or `True` to allocate one later. - **cfg: Config parameters. - - EXAMPLE: - >>> from spacy.pipeline import TokenVectorEncoder - >>> tok2vec = TokenVectorEncoder(nlp.vocab) - >>> tok2vec.model = tok2vec.Model(128, 5000) - """ - self.vocab = vocab - self.model = model - self.input_models = [] - self.cfg = dict(cfg) - self.cfg.setdefault("cnn_maxout_pieces", 3) - - def __call__(self, doc): - """Add context-sensitive vectors to a `Doc`, e.g. from a CNN or LSTM - model. Vectors are set to the `Doc.tensor` attribute. - - docs (Doc or iterable): One or more documents to add vectors to. - RETURNS (dict or None): Intermediate computations. - """ - tokvecses = self.predict([doc]) - self.set_annotations([doc], tokvecses) - return doc - - def pipe(self, stream, batch_size=128, n_threads=-1): - """Process `Doc` objects as a stream. - - stream (iterator): A sequence of `Doc` objects to process. - batch_size (int): Number of `Doc` objects to group. - YIELDS (iterator): A sequence of `Doc` objects, in order of input. - """ - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) - tensors = self.predict(docs) - self.set_annotations(docs, tensors) - yield from docs - - def predict(self, docs): - """Return a single tensor for a batch of documents. - - docs (iterable): A sequence of `Doc` objects. - RETURNS (object): Vector representations for each token in the docs. - """ - self.require_model() - inputs = self.model.ops.flatten([doc.tensor for doc in docs]) - outputs = self.model(inputs) - return self.model.ops.unflatten(outputs, [len(d) for d in docs]) - - def set_annotations(self, docs, tensors): - """Set the tensor attribute for a batch of documents. - - docs (iterable): A sequence of `Doc` objects. - tensors (object): Vector representation for each token in the docs. - """ - for doc, tensor in zip(docs, tensors): - if tensor.shape[0] != len(doc): - raise ValueError(Errors.E076.format(rows=tensor.shape[0], words=len(doc))) - doc.tensor = tensor - - def update(self, docs, golds, state=None, drop=0.0, sgd=None, losses=None): - """Update the model. - - docs (iterable): A batch of `Doc` objects. - golds (iterable): A batch of `GoldParse` objects. - drop (float): The dropout rate. - sgd (callable): An optimizer. - RETURNS (dict): Results from the update. - """ - self.require_model() - if isinstance(docs, Doc): - docs = [docs] - inputs = [] - bp_inputs = [] - for tok2vec in self.input_models: - tensor, bp_tensor = tok2vec.begin_update(docs, drop=drop) - inputs.append(tensor) - bp_inputs.append(bp_tensor) - inputs = self.model.ops.xp.hstack(inputs) - scores, bp_scores = self.model.begin_update(inputs, drop=drop) - loss, d_scores = self.get_loss(docs, golds, scores) - d_inputs = bp_scores(d_scores, sgd=sgd) - d_inputs = self.model.ops.xp.split(d_inputs, len(self.input_models), axis=1) - for d_input, bp_input in zip(d_inputs, bp_inputs): - bp_input(d_input, sgd=sgd) - if losses is not None: - losses.setdefault(self.name, 0.0) - losses[self.name] += loss - return loss - - def get_loss(self, docs, golds, prediction): - ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs]) - target = self.vocab.vectors.data[ids] - d_scores = (prediction - target) / prediction.shape[0] - loss = (d_scores ** 2).sum() - return loss, d_scores - - def begin_training(self, gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): - """Allocate models, pre-process training data and acquire an - optimizer. - - gold_tuples (iterable): Gold-standard training data. - pipeline (list): The pipeline the model is part of. - """ - if pipeline is not None: - for name, model in pipeline: - if getattr(model, "tok2vec", None): - self.input_models.append(model.tok2vec) - if self.model is True: - self.model = self.Model(**self.cfg) - link_vectors_to_models(self.vocab) - if sgd is None: - sgd = self.create_optimizer() - return sgd - - -@component("tagger", assigns=["token.tag", "token.pos", "token.lemma"]) -class Tagger(Pipe): - """Pipeline component for part-of-speech tagging. - - DOCS: https://spacy.io/api/tagger - """ - - def __init__(self, vocab, model=True, **cfg): - self.vocab = vocab - self.model = model - self._rehearsal_model = None - self.cfg = OrderedDict(sorted(cfg.items())) - self.cfg.setdefault("cnn_maxout_pieces", 2) - - @property - def labels(self): - return tuple(self.vocab.morphology.tag_names) - - @property - def tok2vec(self): - if self.model in (None, True, False): - return None - else: - return chain(self.model.tok2vec, flatten) - - def __call__(self, doc): - tags, tokvecs = self.predict([doc]) - self.set_annotations([doc], tags, tensors=tokvecs) - return doc - - def pipe(self, stream, batch_size=128, n_threads=-1): - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) - tag_ids, tokvecs = self.predict(docs) - self.set_annotations(docs, tag_ids, tensors=tokvecs) - yield from docs - - def predict(self, docs): - self.require_model() - if not any(len(doc) for doc in docs): - # Handle cases where there are no tokens in any docs. - n_labels = len(self.labels) - guesses = [self.model.ops.allocate((0, n_labels)) for doc in docs] - tokvecs = self.model.ops.allocate((0, self.model.tok2vec.nO)) - return guesses, tokvecs - tokvecs = self.model.tok2vec(docs) - scores = self.model.softmax(tokvecs) - guesses = [] - for doc_scores in scores: - doc_guesses = doc_scores.argmax(axis=1) - if not isinstance(doc_guesses, numpy.ndarray): - doc_guesses = doc_guesses.get() - guesses.append(doc_guesses) - return guesses, tokvecs - - def set_annotations(self, docs, batch_tag_ids, tensors=None): - if isinstance(docs, Doc): - docs = [docs] - cdef Doc doc - cdef int idx = 0 - cdef Vocab vocab = self.vocab - assign_morphology = self.cfg.get("set_morphology", True) - for i, doc in enumerate(docs): - doc_tag_ids = batch_tag_ids[i] - if hasattr(doc_tag_ids, "get"): - doc_tag_ids = doc_tag_ids.get() - for j, tag_id in enumerate(doc_tag_ids): - # Don't clobber preset POS tags - if doc.c[j].tag == 0: - if doc.c[j].pos == 0 and assign_morphology: - # Don't clobber preset lemmas - lemma = doc.c[j].lemma - vocab.morphology.assign_tag_id(&doc.c[j], tag_id) - if lemma != 0 and lemma != doc.c[j].lex.orth: - doc.c[j].lemma = lemma - else: - doc.c[j].tag = self.vocab.strings[self.labels[tag_id]] - idx += 1 - if tensors is not None and len(tensors): - if isinstance(doc.tensor, numpy.ndarray) \ - and not isinstance(tensors[i], numpy.ndarray): - doc.extend_tensor(tensors[i].get()) - else: - doc.extend_tensor(tensors[i]) - doc.is_tagged = True - - def update(self, docs, golds, drop=0., sgd=None, losses=None): - self.require_model() - if losses is not None and self.name not in losses: - losses[self.name] = 0. - - if not any(len(doc) for doc in docs): - # Handle cases where there are no tokens in any docs. - return - - tag_scores, bp_tag_scores = self.model.begin_update(docs, drop=drop) - loss, d_tag_scores = self.get_loss(docs, golds, tag_scores) - bp_tag_scores(d_tag_scores, sgd=sgd) - - if losses is not None: - losses[self.name] += loss - - def rehearse(self, docs, drop=0., sgd=None, losses=None): - """Perform a 'rehearsal' update, where we try to match the output of - an initial model. - """ - if self._rehearsal_model is None: - return - if not any(len(doc) for doc in docs): - # Handle cases where there are no tokens in any docs. - return - guesses, backprop = self.model.begin_update(docs, drop=drop) - target = self._rehearsal_model(docs) - gradient = guesses - target - backprop(gradient, sgd=sgd) - if losses is not None: - losses.setdefault(self.name, 0.0) - losses[self.name] += (gradient**2).sum() - - def get_loss(self, docs, golds, scores): - scores = self.model.ops.flatten(scores) - tag_index = {tag: i for i, tag in enumerate(self.labels)} - cdef int idx = 0 - correct = numpy.zeros((scores.shape[0],), dtype="i") - guesses = scores.argmax(axis=1) - known_labels = numpy.ones((scores.shape[0], 1), dtype="f") - for gold in golds: - for tag in gold.tags: - if tag is None: - correct[idx] = guesses[idx] - elif tag in tag_index: - correct[idx] = tag_index[tag] - else: - correct[idx] = 0 - known_labels[idx] = 0. - idx += 1 - correct = self.model.ops.xp.array(correct, dtype="i") - d_scores = scores - to_categorical(correct, nb_classes=scores.shape[1]) - d_scores *= self.model.ops.asarray(known_labels) - loss = (d_scores**2).sum() - d_scores = self.model.ops.unflatten(d_scores, [len(d) for d in docs]) - return float(loss), d_scores - - def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, - **kwargs): - lemma_tables = ["lemma_rules", "lemma_index", "lemma_exc", "lemma_lookup"] - if not any(table in self.vocab.lookups for table in lemma_tables): - warnings.warn(Warnings.W022) - if len(self.vocab.lookups.get_table("lexeme_norm", {})) == 0: - warnings.warn(Warnings.W033.format(model="part-of-speech tagger")) - try: - import spacy_lookups_data - except ImportError: - if self.vocab.lang in ("da", "de", "el", "en", "id", "lb", "pt", - "ru", "sr", "ta", "th"): - warnings.warn(Warnings.W034.format(lang=self.vocab.lang)) - orig_tag_map = dict(self.vocab.morphology.tag_map) - new_tag_map = OrderedDict() - for raw_text, annots_brackets in get_gold_tuples(): - for annots, brackets in annots_brackets: - ids, words, tags, heads, deps, ents = annots - for tag in tags: - if tag in orig_tag_map: - new_tag_map[tag] = orig_tag_map[tag] - else: - new_tag_map[tag] = {POS: X} - cdef Vocab vocab = self.vocab - if new_tag_map: - if "_SP" in orig_tag_map: - new_tag_map["_SP"] = orig_tag_map["_SP"] - vocab.morphology = Morphology(vocab.strings, new_tag_map, - vocab.morphology.lemmatizer, - exc=vocab.morphology.exc) - self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors") - if self.model is True: - for hp in ["token_vector_width", "conv_depth"]: - if hp in kwargs: - self.cfg[hp] = kwargs[hp] - self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg) - link_vectors_to_models(self.vocab) - if sgd is None: - sgd = self.create_optimizer() - return sgd - - @classmethod - def Model(cls, n_tags, **cfg): - if cfg.get("pretrained_dims") and not cfg.get("pretrained_vectors"): - raise ValueError(TempErrors.T008) - return build_tagger_model(n_tags, **cfg) - - def add_label(self, label, values=None): - if not isinstance(label, basestring_): - raise ValueError(Errors.E187) - if label in self.labels: - return 0 - if self.model not in (True, False, None): - # Here's how the model resizing will work, once the - # neuron-to-tag mapping is no longer controlled by - # the Morphology class, which sorts the tag names. - # The sorting makes adding labels difficult. - # smaller = self.model._layers[-1] - # larger = Softmax(len(self.labels)+1, smaller.nI) - # copy_array(larger.W[:smaller.nO], smaller.W) - # copy_array(larger.b[:smaller.nO], smaller.b) - # self.model._layers[-1] = larger - raise ValueError(TempErrors.T003) - tag_map = dict(self.vocab.morphology.tag_map) - if values is None: - values = {POS: "X"} - tag_map[label] = values - self.vocab.morphology = Morphology( - self.vocab.strings, tag_map=tag_map, - lemmatizer=self.vocab.morphology.lemmatizer, - exc=self.vocab.morphology.exc) - return 1 - - def use_params(self, params): - with self.model.use_params(params): - yield - - def to_bytes(self, exclude=tuple(), **kwargs): - serialize = OrderedDict() - if self.model not in (None, True, False): - serialize["model"] = self.model.to_bytes - serialize["vocab"] = self.vocab.to_bytes - serialize["cfg"] = lambda: srsly.json_dumps(self.cfg) - tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items())) - serialize["tag_map"] = lambda: srsly.msgpack_dumps(tag_map) - exclude = util.get_serialization_exclude(serialize, exclude, kwargs) - return util.to_bytes(serialize, exclude) - - def from_bytes(self, bytes_data, exclude=tuple(), **kwargs): - def load_model(b): - # TODO: Remove this once we don't have to handle previous models - if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg: - self.cfg["pretrained_vectors"] = self.vocab.vectors.name - if self.model is True: - token_vector_width = util.env_opt( - "token_vector_width", - self.cfg.get("token_vector_width", 96)) - self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg) - try: - self.model.from_bytes(b) - except AttributeError: - raise ValueError(Errors.E149) - - def load_tag_map(b): - tag_map = srsly.msgpack_loads(b) - self.vocab.morphology = Morphology( - self.vocab.strings, tag_map=tag_map, - lemmatizer=self.vocab.morphology.lemmatizer, - exc=self.vocab.morphology.exc) - - deserialize = OrderedDict(( - ("vocab", lambda b: self.vocab.from_bytes(b)), - ("tag_map", load_tag_map), - ("cfg", lambda b: self.cfg.update(srsly.json_loads(b))), - ("model", lambda b: load_model(b)), - )) - exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) - util.from_bytes(bytes_data, deserialize, exclude) - return self - - def to_disk(self, path, exclude=tuple(), **kwargs): - tag_map = OrderedDict(sorted(self.vocab.morphology.tag_map.items())) - serialize = OrderedDict(( - ("vocab", lambda p: self.vocab.to_disk(p)), - ("tag_map", lambda p: srsly.write_msgpack(p, tag_map)), - ("model", lambda p: self.model.to_disk(p)), - ("cfg", lambda p: srsly.write_json(p, self.cfg)) - )) - exclude = util.get_serialization_exclude(serialize, exclude, kwargs) - util.to_disk(path, serialize, exclude) - - def from_disk(self, path, exclude=tuple(), **kwargs): - def load_model(p): - # TODO: Remove this once we don't have to handle previous models - if self.cfg.get("pretrained_dims") and "pretrained_vectors" not in self.cfg: - self.cfg["pretrained_vectors"] = self.vocab.vectors.name - if self.model is True: - self.model = self.Model(self.vocab.morphology.n_tags, **self.cfg) - with p.open("rb") as file_: - try: - self.model.from_bytes(file_.read()) - except AttributeError: - raise ValueError(Errors.E149) - - def load_tag_map(p): - tag_map = srsly.read_msgpack(p) - self.vocab.morphology = Morphology( - self.vocab.strings, tag_map=tag_map, - lemmatizer=self.vocab.morphology.lemmatizer, - exc=self.vocab.morphology.exc) - - deserialize = OrderedDict(( - ("cfg", lambda p: self.cfg.update(_load_cfg(p))), - ("vocab", lambda p: self.vocab.from_disk(p)), - ("tag_map", load_tag_map), - ("model", load_model), - )) - exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) - util.from_disk(path, deserialize, exclude) - return self - - -@component("nn_labeller") -class MultitaskObjective(Tagger): - """Experimental: Assist training of a parser or tagger, by training a - side-objective. - """ - - def __init__(self, vocab, model=True, target='dep_tag_offset', **cfg): - self.vocab = vocab - self.model = model - if target == "dep": - self.make_label = self.make_dep - elif target == "tag": - self.make_label = self.make_tag - elif target == "ent": - self.make_label = self.make_ent - elif target == "dep_tag_offset": - self.make_label = self.make_dep_tag_offset - elif target == "ent_tag": - self.make_label = self.make_ent_tag - elif target == "sent_start": - self.make_label = self.make_sent_start - elif hasattr(target, "__call__"): - self.make_label = target - else: - raise ValueError(Errors.E016) - self.cfg = dict(cfg) - self.cfg.setdefault("cnn_maxout_pieces", 2) - - @property - def labels(self): - return self.cfg.setdefault("labels", {}) - - @labels.setter - def labels(self, value): - self.cfg["labels"] = value - - def set_annotations(self, docs, dep_ids, tensors=None): - pass - - def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, tok2vec=None, - sgd=None, **kwargs): - gold_tuples = nonproj.preprocess_training_data(get_gold_tuples()) - for raw_text, annots_brackets in gold_tuples: - for annots, brackets in annots_brackets: - ids, words, tags, heads, deps, ents = annots - for i in range(len(ids)): - label = self.make_label(i, words, tags, heads, deps, ents) - if label is not None and label not in self.labels: - self.labels[label] = len(self.labels) - if self.model is True: - token_vector_width = util.env_opt("token_vector_width") - self.model = self.Model(len(self.labels), tok2vec=tok2vec) - link_vectors_to_models(self.vocab) - if sgd is None: - sgd = self.create_optimizer() - return sgd - - @classmethod - def Model(cls, n_tags, tok2vec=None, **cfg): - token_vector_width = util.env_opt("token_vector_width", 96) - softmax = Softmax(n_tags, token_vector_width*2) - model = chain( - tok2vec, - LayerNorm(Maxout(token_vector_width*2, token_vector_width, pieces=3)), - softmax - ) - model.tok2vec = tok2vec - model.softmax = softmax - return model - - def predict(self, docs): - self.require_model() - tokvecs = self.model.tok2vec(docs) - scores = self.model.softmax(tokvecs) - return tokvecs, scores - - def get_loss(self, docs, golds, scores): - if len(docs) != len(golds): - raise ValueError(Errors.E077.format(value="loss", n_docs=len(docs), - n_golds=len(golds))) - cdef int idx = 0 - correct = numpy.zeros((scores.shape[0],), dtype="i") - guesses = scores.argmax(axis=1) - for i, gold in enumerate(golds): - for j in range(len(docs[i])): - # Handes alignment for tokenization differences - label = self.make_label(j, gold.words, gold.tags, - gold.heads, gold.labels, gold.ents) - if label is None or label not in self.labels: - correct[idx] = guesses[idx] - else: - correct[idx] = self.labels[label] - idx += 1 - correct = self.model.ops.xp.array(correct, dtype="i") - d_scores = scores - to_categorical(correct, nb_classes=scores.shape[1]) - loss = (d_scores**2).sum() - return float(loss), d_scores - - @staticmethod - def make_dep(i, words, tags, heads, deps, ents): - if deps[i] is None or heads[i] is None: - return None - return deps[i] - - @staticmethod - def make_tag(i, words, tags, heads, deps, ents): - return tags[i] - - @staticmethod - def make_ent(i, words, tags, heads, deps, ents): - if ents is None: - return None - return ents[i] - - @staticmethod - def make_dep_tag_offset(i, words, tags, heads, deps, ents): - if deps[i] is None or heads[i] is None: - return None - offset = heads[i] - i - offset = min(offset, 2) - offset = max(offset, -2) - return "%s-%s:%d" % (deps[i], tags[i], offset) - - @staticmethod - def make_ent_tag(i, words, tags, heads, deps, ents): - if ents is None or ents[i] is None: - return None - else: - return "%s-%s" % (tags[i], ents[i]) - - @staticmethod - def make_sent_start(target, words, tags, heads, deps, ents, cache=True, _cache={}): - """A multi-task objective for representing sentence boundaries, - using BILU scheme. (O is impossible) - - The implementation of this method uses an internal cache that relies - on the identity of the heads array, to avoid requiring a new piece - of gold data. You can pass cache=False if you know the cache will - do the wrong thing. - """ - assert len(words) == len(heads) - assert target < len(words), (target, len(words)) - if cache: - if id(heads) in _cache: - return _cache[id(heads)][target] - else: - for key in list(_cache.keys()): - _cache.pop(key) - sent_tags = ["I-SENT"] * len(words) - _cache[id(heads)] = sent_tags - else: - sent_tags = ["I-SENT"] * len(words) - - def _find_root(child): - seen = set([child]) - while child is not None and heads[child] != child: - seen.add(child) - child = heads[child] - return child - - sentences = {} - for i in range(len(words)): - root = _find_root(i) - if root is None: - sent_tags[i] = None - else: - sentences.setdefault(root, []).append(i) - for root, span in sorted(sentences.items()): - if len(span) == 1: - sent_tags[span[0]] = "U-SENT" - else: - sent_tags[span[0]] = "B-SENT" - sent_tags[span[-1]] = "L-SENT" - return sent_tags[target] - - -class ClozeMultitask(Pipe): - @classmethod - def Model(cls, vocab, tok2vec, **cfg): - if cfg["objective"] == "characters": - out_sizes = [256] * cfg.get("nr_char", 4) - output_layer = MultiSoftmax(out_sizes) - else: - output_size = vocab.vectors.data.shape[1] - output_layer = chain( - LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)), - zero_init(Affine(output_size, output_size, drop_factor=0.0)) - ) - model = chain(tok2vec, output_layer) - model = masked_language_model(vocab, model) - model.tok2vec = tok2vec - model.output_layer = output_layer - return model - - def __init__(self, vocab, model=True, **cfg): - self.vocab = vocab - self.model = model - self.cfg = cfg - self.cfg.setdefault("objective", "characters") - self.cfg.setdefault("nr_char", 4) - - def set_annotations(self, docs, dep_ids, tensors=None): - pass - - def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, - tok2vec=None, sgd=None, **kwargs): - link_vectors_to_models(self.vocab) - if self.model is True: - kwargs.update(self.cfg) - self.model = self.Model(self.vocab, tok2vec, **kwargs) - X = self.model.ops.allocate((5, self.model.tok2vec.nO)) - self.model.output_layer.begin_training(X) - if sgd is None: - sgd = self.create_optimizer() - return sgd - - def predict(self, docs): - self.require_model() - tokvecs = self.model.tok2vec(docs) - vectors = self.model.output_layer(tokvecs) - return tokvecs, vectors - - def get_loss(self, docs, vectors, prediction): - if self.cfg["objective"] == "characters": - loss, gradient = get_characters_loss(self.model.ops, docs, prediction) - else: - # The simplest way to implement this would be to vstack the - # token.vector values, but that's a bit inefficient, especially on GPU. - # Instead we fetch the index into the vectors table for each of our tokens, - # and look them up all at once. This prevents data copying. - ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs]) - target = vectors[ids] - loss, gradient = get_cossim_loss(prediction, target, ignore_zeros=True) - return float(loss), gradient - - def update(self, docs, golds, drop=0., sgd=None, losses=None): - pass - - def rehearse(self, docs, drop=0., sgd=None, losses=None): - self.require_model() - if losses is not None and self.name not in losses: - losses[self.name] = 0. - predictions, bp_predictions = self.model.begin_update(docs, drop=drop) - loss, d_predictions = self.get_loss(docs, self.vocab.vectors.data, predictions) - bp_predictions(d_predictions, sgd=sgd) - - if losses is not None: - losses[self.name] += loss - - @staticmethod - def decode_utf8_predictions(char_array): - # The format alternates filling from start and end, and 255 is missing - words = [] - char_array = char_array.reshape((char_array.shape[0], -1, 256)) - nr_char = char_array.shape[1] - char_array = char_array.argmax(axis=-1) - for row in char_array: - starts = [chr(c) for c in row[::2] if c != 255] - ends = [chr(c) for c in row[1::2] if c != 255] - word = "".join(starts + list(reversed(ends))) - words.append(word) - return words - - -@component("textcat", assigns=["doc.cats"]) -class TextCategorizer(Pipe): - """Pipeline component for text classification. - - DOCS: https://spacy.io/api/textcategorizer - """ - - @classmethod - def Model(cls, nr_class=1, **cfg): - embed_size = util.env_opt("embed_size", 2000) - if "token_vector_width" in cfg: - token_vector_width = cfg["token_vector_width"] - else: - token_vector_width = util.env_opt("token_vector_width", 96) - if cfg.get("architecture") == "simple_cnn": - tok2vec = Tok2Vec(token_vector_width, embed_size, **cfg) - return build_simple_cnn_text_classifier(tok2vec, nr_class, **cfg) - elif cfg.get("architecture") == "bow": - return build_bow_text_classifier(nr_class, **cfg) - else: - return build_text_classifier(nr_class, **cfg) - - @property - def tok2vec(self): - if self.model in (None, True, False): - return None - else: - return self.model.tok2vec - - def __init__(self, vocab, model=True, **cfg): - self.vocab = vocab - self.model = model - self._rehearsal_model = None - self.cfg = dict(cfg) - - @property - def labels(self): - return tuple(self.cfg.setdefault("labels", [])) - - def require_labels(self): - """Raise an error if the component's model has no labels defined.""" - if not self.labels: - raise ValueError(Errors.E143.format(name=self.name)) - - @labels.setter - def labels(self, value): - self.cfg["labels"] = tuple(value) - - def pipe(self, stream, batch_size=128, n_threads=-1): - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) - scores, tensors = self.predict(docs) - self.set_annotations(docs, scores, tensors=tensors) - yield from docs - - def predict(self, docs): - self.require_model() - tensors = [doc.tensor for doc in docs] - - if not any(len(doc) for doc in docs): - # Handle cases where there are no tokens in any docs. - xp = get_array_module(tensors) - scores = xp.zeros((len(docs), len(self.labels))) - return scores, tensors - - scores = self.model(docs) - scores = self.model.ops.asarray(scores) - return scores, tensors - - def set_annotations(self, docs, scores, tensors=None): - for i, doc in enumerate(docs): - for j, label in enumerate(self.labels): - doc.cats[label] = float(scores[i, j]) - - def update(self, docs, golds, state=None, drop=0., sgd=None, losses=None): - self.require_model() - if not any(len(doc) for doc in docs): - # Handle cases where there are no tokens in any docs. - return - scores, bp_scores = self.model.begin_update(docs, drop=drop) - loss, d_scores = self.get_loss(docs, golds, scores) - bp_scores(d_scores, sgd=sgd) - if losses is not None: - losses.setdefault(self.name, 0.0) - losses[self.name] += loss - - def rehearse(self, docs, drop=0., sgd=None, losses=None): - if self._rehearsal_model is None: - return - if not any(len(doc) for doc in docs): - # Handle cases where there are no tokens in any docs. - return - scores, bp_scores = self.model.begin_update(docs, drop=drop) - target = self._rehearsal_model(docs) - gradient = scores - target - bp_scores(gradient, sgd=sgd) - if losses is not None: - losses.setdefault(self.name, 0.0) - losses[self.name] += (gradient**2).sum() - - def get_loss(self, docs, golds, scores): - truths = numpy.zeros((len(golds), len(self.labels)), dtype="f") - not_missing = numpy.ones((len(golds), len(self.labels)), dtype="f") - for i, gold in enumerate(golds): - for j, label in enumerate(self.labels): - if label in gold.cats: - truths[i, j] = gold.cats[label] - else: - not_missing[i, j] = 0. - truths = self.model.ops.asarray(truths) - not_missing = self.model.ops.asarray(not_missing) - d_scores = (scores-truths) / scores.shape[0] - d_scores *= not_missing - mean_square_error = (d_scores**2).sum(axis=1).mean() - return float(mean_square_error), d_scores - - def add_label(self, label): - if not isinstance(label, basestring_): - raise ValueError(Errors.E187) - if label in self.labels: - return 0 - if self.model not in (None, True, False): - # This functionality was available previously, but was broken. - # The problem is that we resize the last layer, but the last layer - # is actually just an ensemble. We're not resizing the child layers - # - a huge problem. - raise ValueError(Errors.E116) - # smaller = self.model._layers[-1] - # larger = Affine(len(self.labels)+1, smaller.nI) - # copy_array(larger.W[:smaller.nO], smaller.W) - # copy_array(larger.b[:smaller.nO], smaller.b) - # self.model._layers[-1] = larger - self.labels = tuple(list(self.labels) + [label]) - return 1 - - def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): - for raw_text, annot_brackets in get_gold_tuples(): - for _, (cats, _2) in annot_brackets: - for cat in cats: - self.add_label(cat) - if self.model is True: - self.cfg["pretrained_vectors"] = kwargs.get("pretrained_vectors") - self.cfg["pretrained_dims"] = kwargs.get("pretrained_dims") - self.require_labels() - self.model = self.Model(len(self.labels), **self.cfg) - link_vectors_to_models(self.vocab) - if sgd is None: - sgd = self.create_optimizer() - return sgd - - -cdef class DependencyParser(Parser): - """Pipeline component for dependency parsing. - - DOCS: https://spacy.io/api/dependencyparser - """ - # cdef classes can't have decorators, so we're defining this here - name = "parser" - factory = "parser" - assigns = ["token.dep", "token.is_sent_start", "doc.sents"] - requires = [] - TransitionSystem = ArcEager - nr_feature = 8 - - @property - def postprocesses(self): - output = [nonproj.deprojectivize] - if self.cfg.get("learn_tokens") is True: - output.append(merge_subtokens) - return tuple(output) - - def add_multitask_objective(self, target): - if target == "cloze": - cloze = ClozeMultitask(self.vocab) - self._multitasks.append(cloze) - else: - labeller = MultitaskObjective(self.vocab, target=target) - self._multitasks.append(labeller) - - def init_multitask_objectives(self, get_gold_tuples, pipeline, sgd=None, **cfg): - for labeller in self._multitasks: - tok2vec = self.model.tok2vec - labeller.begin_training(get_gold_tuples, pipeline=pipeline, - tok2vec=tok2vec, sgd=sgd) - - def __reduce__(self): - return (DependencyParser, (self.vocab, self.moves, self.model), None, None) - - @property - def labels(self): - labels = set() - # Get the labels from the model by looking at the available moves - for move in self.move_names: - if "-" in move: - label = move.split("-")[1] - if "||" in label: - label = label.split("||")[1] - labels.add(label) - return tuple(sorted(labels)) - - -cdef class EntityRecognizer(Parser): - """Pipeline component for named entity recognition. - - DOCS: https://spacy.io/api/entityrecognizer - """ - name = "ner" - factory = "ner" - assigns = ["doc.ents", "token.ent_iob", "token.ent_type"] - requires = [] - TransitionSystem = BiluoPushDown - nr_feature = 6 - - def add_multitask_objective(self, target): - if target == "cloze": - cloze = ClozeMultitask(self.vocab) - self._multitasks.append(cloze) - else: - labeller = MultitaskObjective(self.vocab, target=target) - self._multitasks.append(labeller) - - def init_multitask_objectives(self, get_gold_tuples, pipeline, sgd=None, **cfg): - for labeller in self._multitasks: - tok2vec = self.model.tok2vec - labeller.begin_training(get_gold_tuples, pipeline=pipeline, - tok2vec=tok2vec) - - def __reduce__(self): - return (EntityRecognizer, (self.vocab, self.moves, self.model), - None, None) - - @property - def labels(self): - # Get the labels from the model by looking at the available moves, e.g. - # B-PERSON, I-PERSON, L-PERSON, U-PERSON - labels = set(move.split("-")[1] for move in self.move_names - if move[0] in ("B", "I", "L", "U")) - return tuple(sorted(labels)) - - -@component( - "entity_linker", - requires=["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"], - assigns=["token.ent_kb_id"] -) -class EntityLinker(Pipe): - """Pipeline component for named entity linking. - - DOCS: https://spacy.io/api/entitylinker - """ - NIL = "NIL" # string used to refer to a non-existing link - - @classmethod - def Model(cls, **cfg): - embed_width = cfg.get("embed_width", 300) - hidden_width = cfg.get("hidden_width", 128) - type_to_int = cfg.get("type_to_int", dict()) - - model = build_nel_encoder(embed_width=embed_width, hidden_width=hidden_width, ner_types=len(type_to_int), **cfg) - return model - - def __init__(self, vocab, **cfg): - self.vocab = vocab - self.model = True - self.kb = None - self.cfg = dict(cfg) - - # how many neighbour sentences to take into account - self.n_sents = cfg.get("n_sents", 0) - - def set_kb(self, kb): - self.kb = kb - - def require_model(self): - # Raise an error if the component's model is not initialized. - if getattr(self, "model", None) in (None, True, False): - raise ValueError(Errors.E109.format(name=self.name)) - - def require_kb(self): - # Raise an error if the knowledge base is not initialized. - if getattr(self, "kb", None) in (None, True, False): - raise ValueError(Errors.E139.format(name=self.name)) - - def begin_training(self, get_gold_tuples=lambda: [], pipeline=None, sgd=None, **kwargs): - self.require_kb() - self.cfg["entity_width"] = self.kb.entity_vector_length - - if self.model is True: - self.model = self.Model(**self.cfg) - - if sgd is None: - sgd = self.create_optimizer() - - return sgd - - def update(self, docs, golds, state=None, drop=0.0, sgd=None, losses=None): - self.require_model() - self.require_kb() - - if losses is not None: - losses.setdefault(self.name, 0.0) - - if not docs or not golds: - return 0 - - if len(docs) != len(golds): - raise ValueError(Errors.E077.format(value="EL training", n_docs=len(docs), - n_golds=len(golds))) - - if isinstance(docs, Doc): - docs = [docs] - golds = [golds] - - sentence_docs = [] - - for doc, gold in zip(docs, golds): - ents_by_offset = dict() - - sentences = [s for s in doc.sents] - - for ent in doc.ents: - ents_by_offset[(ent.start_char, ent.end_char)] = ent - - for entity, kb_dict in gold.links.items(): - start, end = entity - mention = doc.text[start:end] - - # the gold annotations should link to proper entities - if this fails, the dataset is likely corrupt - if not (start, end) in ents_by_offset: - raise RuntimeError(Errors.E188) - - ent = ents_by_offset[(start, end)] - - for kb_id, value in kb_dict.items(): - # Currently only training on the positive instances - if value: - try: - # find the sentence in the list of sentences. - sent_index = sentences.index(ent.sent) - - except AttributeError: - # Catch the exception when ent.sent is None and provide a user-friendly warning - raise RuntimeError(Errors.E030) - - # get n previous sentences, if there are any - start_sentence = max(0, sent_index - self.n_sents) - - # get n posterior sentences, or as many < n as there are - end_sentence = min(len(sentences) -1, sent_index + self.n_sents) - - # get token positions - start_token = sentences[start_sentence].start - end_token = sentences[end_sentence].end - - # append that span as a doc to training - sent_doc = doc[start_token:end_token].as_doc() - sentence_docs.append(sent_doc) - - sentence_encodings, bp_context = self.model.begin_update(sentence_docs, drop=drop) - loss, d_scores = self.get_similarity_loss(scores=sentence_encodings, golds=golds, docs=None) - bp_context(d_scores, sgd=sgd) - - if losses is not None: - losses[self.name] += loss - return loss - - def get_similarity_loss(self, docs, golds, scores): - entity_encodings = [] - for gold in golds: - for entity, kb_dict in gold.links.items(): - for kb_id, value in kb_dict.items(): - # this loss function assumes we're only using positive examples - if value: - entity_encoding = self.kb.get_vector(kb_id) - entity_encodings.append(entity_encoding) - - entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32") - - if scores.shape != entity_encodings.shape: - raise RuntimeError(Errors.E147.format(method="get_loss", msg="gold entities do not match up")) - - loss, gradients = get_cossim_loss(yh=scores, y=entity_encodings) - loss = loss / len(entity_encodings) - return loss, gradients - - def get_loss(self, docs, golds, scores): - cats = [] - for gold in golds: - for entity, kb_dict in gold.links.items(): - for kb_id, value in kb_dict.items(): - cats.append([value]) - - cats = self.model.ops.asarray(cats, dtype="float32") - if len(scores) != len(cats): - raise RuntimeError(Errors.E147.format(method="get_loss", msg="gold entities do not match up")) - - d_scores = (scores - cats) - loss = (d_scores ** 2).sum() - loss = loss / len(cats) - return loss, d_scores - - def __call__(self, doc): - kb_ids, tensors = self.predict([doc]) - self.set_annotations([doc], kb_ids, tensors=tensors) - return doc - - def pipe(self, stream, batch_size=128, n_threads=-1): - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) - kb_ids, tensors = self.predict(docs) - self.set_annotations(docs, kb_ids, tensors=tensors) - yield from docs - - def predict(self, docs): - """ Return the KB IDs for each entity in each doc, including NIL if there is no prediction """ - self.require_model() - self.require_kb() - - entity_count = 0 - final_kb_ids = [] - final_tensors = [] - - if not docs: - return final_kb_ids, final_tensors - - if isinstance(docs, Doc): - docs = [docs] - - - for i, doc in enumerate(docs): - sentences = [s for s in doc.sents] - - if len(doc) > 0: - # Looping through each sentence and each entity - # This may go wrong if there are entities across sentences - which shouldn't happen normally. - for sent_index, sent in enumerate(sentences): - if sent.ents: - # get n_neighbour sentences, clipped to the length of the document - start_sentence = max(0, sent_index - self.n_sents) - end_sentence = min(len(sentences) -1, sent_index + self.n_sents) - - start_token = sentences[start_sentence].start - end_token = sentences[end_sentence].end - - sent_doc = doc[start_token:end_token].as_doc() - - # currently, the context is the same for each entity in a sentence (should be refined) - sentence_encoding = self.model([sent_doc])[0] - xp = get_array_module(sentence_encoding) - sentence_encoding_t = sentence_encoding.T - sentence_norm = xp.linalg.norm(sentence_encoding_t) - - for ent in sent.ents: - entity_count += 1 - - to_discard = self.cfg.get("labels_discard", []) - if to_discard and ent.label_ in to_discard: - # ignoring this entity - setting to NIL - final_kb_ids.append(self.NIL) - final_tensors.append(sentence_encoding) - - else: - candidates = self.kb.get_candidates(ent.text) - if not candidates: - # no prediction possible for this entity - setting to NIL - final_kb_ids.append(self.NIL) - final_tensors.append(sentence_encoding) - - elif len(candidates) == 1: - # shortcut for efficiency reasons: take the 1 candidate - - # TODO: thresholding - final_kb_ids.append(candidates[0].entity_) - final_tensors.append(sentence_encoding) - - else: - random.shuffle(candidates) - - # this will set all prior probabilities to 0 if they should be excluded from the model - prior_probs = xp.asarray([c.prior_prob for c in candidates]) - if not self.cfg.get("incl_prior", True): - prior_probs = xp.asarray([0.0 for c in candidates]) - scores = prior_probs - - # add in similarity from the context - if self.cfg.get("incl_context", True): - entity_encodings = xp.asarray([c.entity_vector for c in candidates]) - entity_norm = xp.linalg.norm(entity_encodings, axis=1) - - if len(entity_encodings) != len(prior_probs): - raise RuntimeError(Errors.E147.format(method="predict", msg="vectors not of equal length")) - - # cosine similarity - sims = xp.dot(entity_encodings, sentence_encoding_t) / (sentence_norm * entity_norm) - if sims.shape != prior_probs.shape: - raise ValueError(Errors.E161) - scores = prior_probs + sims - (prior_probs*sims) - - # TODO: thresholding - best_index = scores.argmax() - best_candidate = candidates[best_index] - final_kb_ids.append(best_candidate.entity_) - final_tensors.append(sentence_encoding) - - if not (len(final_tensors) == len(final_kb_ids) == entity_count): - raise RuntimeError(Errors.E147.format(method="predict", msg="result variables not of equal length")) - - return final_kb_ids, final_tensors - - def set_annotations(self, docs, kb_ids, tensors=None): - count_ents = len([ent for doc in docs for ent in doc.ents]) - if count_ents != len(kb_ids): - raise ValueError(Errors.E148.format(ents=count_ents, ids=len(kb_ids))) - - i=0 - for doc in docs: - for ent in doc.ents: - kb_id = kb_ids[i] - i += 1 - for token in ent: - token.ent_kb_id_ = kb_id - - def to_disk(self, path, exclude=tuple(), **kwargs): - serialize = OrderedDict() - serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg) - serialize["vocab"] = lambda p: self.vocab.to_disk(p) - serialize["kb"] = lambda p: self.kb.dump(p) - if self.model not in (None, True, False): - serialize["model"] = lambda p: self.model.to_disk(p) - exclude = util.get_serialization_exclude(serialize, exclude, kwargs) - util.to_disk(path, serialize, exclude) - - def from_disk(self, path, exclude=tuple(), **kwargs): - def load_model(p): - if self.model is True: - self.model = self.Model(**self.cfg) - try: - self.model.from_bytes(p.open("rb").read()) - except AttributeError: - raise ValueError(Errors.E149) - - def load_kb(p): - kb = KnowledgeBase(vocab=self.vocab, entity_vector_length=self.cfg["entity_width"]) - kb.load_bulk(p) - self.set_kb(kb) - - deserialize = OrderedDict() - deserialize["cfg"] = lambda p: self.cfg.update(_load_cfg(p)) - deserialize["vocab"] = lambda p: self.vocab.from_disk(p) - deserialize["kb"] = load_kb - deserialize["model"] = load_model - exclude = util.get_serialization_exclude(deserialize, exclude, kwargs) - util.from_disk(path, deserialize, exclude) - return self - - def rehearse(self, docs, sgd=None, losses=None, **config): - raise NotImplementedError - - def add_label(self, label): - raise NotImplementedError - - -@component("sentencizer", assigns=["token.is_sent_start", "doc.sents"]) -class Sentencizer(object): - """Segment the Doc into sentences using a rule-based strategy. - - DOCS: https://spacy.io/api/sentencizer - """ - - default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', - ':', ';', '؟', - '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', - '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', - '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', - '꡷', '꣎', '꣏', '꤯', '꧈', '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', - '﹖', '﹗', '!', '.', '?', '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', - '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', - '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', - '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', - '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', - '。', '。'] - - def __init__(self, punct_chars=None, **kwargs): - """Initialize the sentencizer. - - punct_chars (list): Punctuation characters to split on. Will be - serialized with the nlp object. - RETURNS (Sentencizer): The sentencizer component. - - DOCS: https://spacy.io/api/sentencizer#init - """ - if punct_chars: - self.punct_chars = set(punct_chars) - else: - self.punct_chars = set(self.default_punct_chars) - - @classmethod - def from_nlp(cls, nlp, **cfg): - return cls(**cfg) - - def __call__(self, doc): - """Apply the sentencizer to a Doc and set Token.is_sent_start. - - doc (Doc): The document to process. - RETURNS (Doc): The processed Doc. - - DOCS: https://spacy.io/api/sentencizer#call - """ - tags = self.predict([doc]) - self.set_annotations([doc], tags) - return doc - - def pipe(self, stream, batch_size=128, n_threads=-1): - for docs in util.minibatch(stream, size=batch_size): - docs = list(docs) - tag_ids = self.predict(docs) - self.set_annotations(docs, tag_ids) - yield from docs - - def predict(self, docs): - """Apply the pipeline's model to a batch of docs, without - modifying them. - """ - if not any(len(doc) for doc in docs): - # Handle cases where there are no tokens in any docs. - guesses = [[] for doc in docs] - return guesses - guesses = [] - for doc in docs: - doc_guesses = [False] * len(doc) - if len(doc) > 0: - start = 0 - seen_period = False - doc_guesses[0] = True - for i, token in enumerate(doc): - is_in_punct_chars = token.text in self.punct_chars - if seen_period and not token.is_punct and not is_in_punct_chars: - doc_guesses[start] = True - start = token.i - seen_period = False - elif is_in_punct_chars: - seen_period = True - if start < len(doc): - doc_guesses[start] = True - guesses.append(doc_guesses) - return guesses - - def set_annotations(self, docs, batch_tag_ids, tensors=None): - if isinstance(docs, Doc): - docs = [docs] - cdef Doc doc - cdef int idx = 0 - for i, doc in enumerate(docs): - doc_tag_ids = batch_tag_ids[i] - for j, tag_id in enumerate(doc_tag_ids): - # Don't clobber existing sentence boundaries - if doc.c[j].sent_start == 0: - if tag_id: - doc.c[j].sent_start = 1 - else: - doc.c[j].sent_start = -1 - - def to_bytes(self, **kwargs): - """Serialize the sentencizer to a bytestring. - - RETURNS (bytes): The serialized object. - - DOCS: https://spacy.io/api/sentencizer#to_bytes - """ - return srsly.msgpack_dumps({"punct_chars": list(self.punct_chars)}) - - def from_bytes(self, bytes_data, **kwargs): - """Load the sentencizer from a bytestring. - - bytes_data (bytes): The data to load. - returns (Sentencizer): The loaded object. - - DOCS: https://spacy.io/api/sentencizer#from_bytes - """ - cfg = srsly.msgpack_loads(bytes_data) - self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars)) - return self - - def to_disk(self, path, exclude=tuple(), **kwargs): - """Serialize the sentencizer to disk. - - DOCS: https://spacy.io/api/sentencizer#to_disk - """ - path = util.ensure_path(path) - path = path.with_suffix(".json") - srsly.write_json(path, {"punct_chars": list(self.punct_chars)}) - - - def from_disk(self, path, exclude=tuple(), **kwargs): - """Load the sentencizer from disk. - - DOCS: https://spacy.io/api/sentencizer#from_disk - """ - path = util.ensure_path(path) - path = path.with_suffix(".json") - cfg = srsly.read_json(path) - self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars)) - return self - - -# Cython classes can't be decorated, so we need to add the factories here -Language.factories["parser"] = lambda nlp, **cfg: DependencyParser.from_nlp(nlp, **cfg) -Language.factories["ner"] = lambda nlp, **cfg: EntityRecognizer.from_nlp(nlp, **cfg) - - -__all__ = ["Tagger", "DependencyParser", "EntityRecognizer", "Tensorizer", "TextCategorizer", "EntityLinker", "Sentencizer"] diff --git a/spacy/pipeline/sentencizer.pyx b/spacy/pipeline/sentencizer.pyx new file mode 100644 index 000000000..7656b330c --- /dev/null +++ b/spacy/pipeline/sentencizer.pyx @@ -0,0 +1,203 @@ +# cython: infer_types=True, profile=True, binding=True +import srsly +from typing import Optional, List + +from ..tokens.doc cimport Doc + +from .pipe import Pipe +from ..language import Language +from ..scorer import Scorer +from ..training import validate_examples +from .. import util + + +@Language.factory( + "sentencizer", + assigns=["token.is_sent_start", "doc.sents"], + default_config={"punct_chars": None}, + default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0}, +) +def make_sentencizer( + nlp: Language, + name: str, + punct_chars: Optional[List[str]] +): + return Sentencizer(name, punct_chars=punct_chars) + + +class Sentencizer(Pipe): + """Segment the Doc into sentences using a rule-based strategy. + + DOCS: https://nightly.spacy.io/api/sentencizer + """ + + default_punct_chars = ['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', + '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', + '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', + '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', + '꡷', '꣎', '꣏', '꤯', '꧈', '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', + '﹖', '﹗', '!', '.', '?', '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', + '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', + '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', + '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', + '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', + '。', '。'] + + def __init__(self, name="sentencizer", *, punct_chars=None): + """Initialize the sentencizer. + + punct_chars (list): Punctuation characters to split on. Will be + serialized with the nlp object. + RETURNS (Sentencizer): The sentencizer component. + + DOCS: https://nightly.spacy.io/api/sentencizer#init + """ + self.name = name + if punct_chars: + self.punct_chars = set(punct_chars) + else: + self.punct_chars = set(self.default_punct_chars) + + def __call__(self, doc): + """Apply the sentencizer to a Doc and set Token.is_sent_start. + + doc (Doc): The document to process. + RETURNS (Doc): The processed Doc. + + DOCS: https://nightly.spacy.io/api/sentencizer#call + """ + start = 0 + seen_period = False + for i, token in enumerate(doc): + is_in_punct_chars = token.text in self.punct_chars + token.is_sent_start = i == 0 + if seen_period and not token.is_punct and not is_in_punct_chars: + doc[start].is_sent_start = True + start = token.i + seen_period = False + elif is_in_punct_chars: + seen_period = True + if start < len(doc): + doc[start].is_sent_start = True + return doc + + def pipe(self, stream, batch_size=128): + """Apply the pipe to a stream of documents. This usually happens under + the hood when the nlp object is called on a text and all components are + applied to the Doc. + + stream (Iterable[Doc]): A stream of documents. + batch_size (int): The number of documents to buffer. + YIELDS (Doc): Processed documents in order. + + DOCS: https://nightly.spacy.io/api/sentencizer#pipe + """ + for docs in util.minibatch(stream, size=batch_size): + predictions = self.predict(docs) + self.set_annotations(docs, predictions) + yield from docs + + def predict(self, docs): + """Apply the pipe to a batch of docs, without modifying them. + + docs (Iterable[Doc]): The documents to predict. + RETURNS: The predictions for each document. + """ + if not any(len(doc) for doc in docs): + # Handle cases where there are no tokens in any docs. + guesses = [[] for doc in docs] + return guesses + guesses = [] + for doc in docs: + doc_guesses = [False] * len(doc) + if len(doc) > 0: + start = 0 + seen_period = False + doc_guesses[0] = True + for i, token in enumerate(doc): + is_in_punct_chars = token.text in self.punct_chars + if seen_period and not token.is_punct and not is_in_punct_chars: + doc_guesses[start] = True + start = token.i + seen_period = False + elif is_in_punct_chars: + seen_period = True + if start < len(doc): + doc_guesses[start] = True + guesses.append(doc_guesses) + return guesses + + def set_annotations(self, docs, batch_tag_ids): + """Modify a batch of documents, using pre-computed scores. + + docs (Iterable[Doc]): The documents to modify. + scores: The tag IDs produced by Sentencizer.predict. + """ + if isinstance(docs, Doc): + docs = [docs] + cdef Doc doc + cdef int idx = 0 + for i, doc in enumerate(docs): + doc_tag_ids = batch_tag_ids[i] + for j, tag_id in enumerate(doc_tag_ids): + # Don't clobber existing sentence boundaries + if doc.c[j].sent_start == 0: + if tag_id: + doc.c[j].sent_start = 1 + else: + doc.c[j].sent_start = -1 + + def score(self, examples, **kwargs): + """Score a batch of examples. + + examples (Iterable[Example]): The examples to score. + RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_spans. + + DOCS: https://nightly.spacy.io/api/sentencizer#score + """ + validate_examples(examples, "Sentencizer.score") + results = Scorer.score_spans(examples, "sents", **kwargs) + del results["sents_per_type"] + return results + + def to_bytes(self, *, exclude=tuple()): + """Serialize the sentencizer to a bytestring. + + RETURNS (bytes): The serialized object. + + DOCS: https://nightly.spacy.io/api/sentencizer#to_bytes + """ + return srsly.msgpack_dumps({"punct_chars": list(self.punct_chars)}) + + def from_bytes(self, bytes_data, *, exclude=tuple()): + """Load the sentencizer from a bytestring. + + bytes_data (bytes): The data to load. + returns (Sentencizer): The loaded object. + + DOCS: https://nightly.spacy.io/api/sentencizer#from_bytes + """ + cfg = srsly.msgpack_loads(bytes_data) + self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars)) + return self + + def to_disk(self, path, *, exclude=tuple()): + """Serialize the sentencizer to disk. + + DOCS: https://nightly.spacy.io/api/sentencizer#to_disk + """ + path = util.ensure_path(path) + path = path.with_suffix(".json") + srsly.write_json(path, {"punct_chars": list(self.punct_chars)}) + + + def from_disk(self, path, *, exclude=tuple()): + """Load the sentencizer from disk. + + DOCS: https://nightly.spacy.io/api/sentencizer#from_disk + """ + path = util.ensure_path(path) + path = path.with_suffix(".json") + cfg = srsly.read_json(path) + self.punct_chars = set(cfg.get("punct_chars", self.default_punct_chars)) + return self diff --git a/spacy/pipeline/senter.pyx b/spacy/pipeline/senter.pyx new file mode 100644 index 000000000..15a21902a --- /dev/null +++ b/spacy/pipeline/senter.pyx @@ -0,0 +1,166 @@ +# cython: infer_types=True, profile=True, binding=True +from itertools import islice + +import srsly +from thinc.api import Model, SequenceCategoricalCrossentropy, Config + +from ..tokens.doc cimport Doc + +from .tagger import Tagger +from ..language import Language +from ..errors import Errors +from ..scorer import Scorer +from ..training import validate_examples, validate_get_examples +from .. import util + + +default_model_config = """ +[model] +@architectures = "spacy.Tagger.v1" + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 12 +depth = 1 +embed_size = 2000 +window_size = 1 +maxout_pieces = 2 +subword_features = true +""" +DEFAULT_SENTER_MODEL = Config().from_str(default_model_config)["model"] + + +@Language.factory( + "senter", + assigns=["token.is_sent_start"], + default_config={"model": DEFAULT_SENTER_MODEL}, + default_score_weights={"sents_f": 1.0, "sents_p": 0.0, "sents_r": 0.0}, +) +def make_senter(nlp: Language, name: str, model: Model): + return SentenceRecognizer(nlp.vocab, model, name) + + +class SentenceRecognizer(Tagger): + """Pipeline component for sentence segmentation. + + DOCS: https://nightly.spacy.io/api/sentencerecognizer + """ + def __init__(self, vocab, model, name="senter"): + """Initialize a sentence recognizer. + + vocab (Vocab): The shared vocabulary. + model (thinc.api.Model): The Thinc Model powering the pipeline component. + name (str): The component instance name, used to add entries to the + losses during training. + + DOCS: https://nightly.spacy.io/api/sentencerecognizer#init + """ + self.vocab = vocab + self.model = model + self.name = name + self._rehearsal_model = None + self.cfg = {} + + @property + def labels(self): + """RETURNS (Tuple[str]): The labels.""" + # labels are numbered by index internally, so this matches GoldParse + # and Example where the sentence-initial tag is 1 and other positions + # are 0 + return tuple(["I", "S"]) + + @property + def label_data(self): + return None + + def set_annotations(self, docs, batch_tag_ids): + """Modify a batch of documents, using pre-computed scores. + + docs (Iterable[Doc]): The documents to modify. + batch_tag_ids: The IDs to set, produced by SentenceRecognizer.predict. + + DOCS: https://nightly.spacy.io/api/sentencerecognizer#set_annotations + """ + if isinstance(docs, Doc): + docs = [docs] + cdef Doc doc + for i, doc in enumerate(docs): + doc_tag_ids = batch_tag_ids[i] + if hasattr(doc_tag_ids, "get"): + doc_tag_ids = doc_tag_ids.get() + for j, tag_id in enumerate(doc_tag_ids): + # Don't clobber existing sentence boundaries + if doc.c[j].sent_start == 0: + if tag_id == 1: + doc.c[j].sent_start = 1 + else: + doc.c[j].sent_start = -1 + + def get_loss(self, examples, scores): + """Find the loss and gradient of loss for the batch of documents and + their predicted scores. + + examples (Iterable[Examples]): The batch of examples. + scores: Scores representing the model's predictions. + RETURNS (Tuple[float, float]): The loss and the gradient. + + DOCS: https://nightly.spacy.io/api/sentencerecognizer#get_loss + """ + validate_examples(examples, "SentenceRecognizer.get_loss") + labels = self.labels + loss_func = SequenceCategoricalCrossentropy(names=labels, normalize=False) + truths = [] + for eg in examples: + eg_truth = [] + for x in eg.get_aligned("SENT_START"): + if x is None: + eg_truth.append(None) + elif x == 1: + eg_truth.append(labels[1]) + else: + # anything other than 1: 0, -1, -1 as uint64 + eg_truth.append(labels[0]) + truths.append(eg_truth) + d_scores, loss = loss_func(scores, truths) + if self.model.ops.xp.isnan(loss): + raise ValueError(Errors.E910.format(name=self.name)) + return float(loss), d_scores + + def initialize(self, get_examples, *, nlp=None): + """Initialize the pipe for training, using a representative set + of data examples. + + get_examples (Callable[[], Iterable[Example]]): Function that + returns a representative sample of gold-standard Example objects. + nlp (Language): The current nlp object the component is part of. + + DOCS: https://nightly.spacy.io/api/sentencerecognizer#initialize + """ + validate_get_examples(get_examples, "SentenceRecognizer.initialize") + doc_sample = [] + label_sample = [] + assert self.labels, Errors.E924.format(name=self.name) + for example in islice(get_examples(), 10): + doc_sample.append(example.x) + gold_tags = example.get_aligned("SENT_START") + gold_array = [[1.0 if tag == gold_tag else 0.0 for tag in self.labels] for gold_tag in gold_tags] + label_sample.append(self.model.ops.asarray(gold_array, dtype="float32")) + assert len(doc_sample) > 0, Errors.E923.format(name=self.name) + assert len(label_sample) > 0, Errors.E923.format(name=self.name) + self.model.initialize(X=doc_sample, Y=label_sample) + + def add_label(self, label, values=None): + raise NotImplementedError + + def score(self, examples, **kwargs): + """Score a batch of examples. + + examples (Iterable[Example]): The examples to score. + RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_spans. + DOCS: https://nightly.spacy.io/api/sentencerecognizer#score + """ + validate_examples(examples, "SentenceRecognizer.score") + results = Scorer.score_spans(examples, "sents", **kwargs) + del results["sents_per_type"] + return results diff --git a/spacy/pipeline/tagger.pyx b/spacy/pipeline/tagger.pyx new file mode 100644 index 000000000..16633a7b8 --- /dev/null +++ b/spacy/pipeline/tagger.pyx @@ -0,0 +1,330 @@ +# cython: infer_types=True, profile=True, binding=True +from typing import List +import numpy +import srsly +from thinc.api import Model, set_dropout_rate, SequenceCategoricalCrossentropy, Config +from thinc.types import Floats2d +import warnings +from itertools import islice + +from ..tokens.doc cimport Doc +from ..morphology cimport Morphology +from ..vocab cimport Vocab + +from .trainable_pipe import TrainablePipe +from .pipe import deserialize_config +from ..language import Language +from ..attrs import POS, ID +from ..parts_of_speech import X +from ..errors import Errors, Warnings +from ..scorer import Scorer +from ..training import validate_examples, validate_get_examples +from .. import util + + +default_model_config = """ +[model] +@architectures = "spacy.Tagger.v1" + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 4 +embed_size = 2000 +window_size = 1 +maxout_pieces = 3 +subword_features = true +""" +DEFAULT_TAGGER_MODEL = Config().from_str(default_model_config)["model"] + + +@Language.factory( + "tagger", + assigns=["token.tag"], + default_config={"model": DEFAULT_TAGGER_MODEL}, + default_score_weights={"tag_acc": 1.0}, +) +def make_tagger(nlp: Language, name: str, model: Model): + """Construct a part-of-speech tagger component. + + model (Model[List[Doc], List[Floats2d]]): A model instance that predicts + the tag probabilities. The output vectors should match the number of tags + in size, and be normalized as probabilities (all scores between 0 and 1, + with the rows summing to 1). + """ + return Tagger(nlp.vocab, model, name) + + +class Tagger(TrainablePipe): + """Pipeline component for part-of-speech tagging. + + DOCS: https://nightly.spacy.io/api/tagger + """ + def __init__(self, vocab, model, name="tagger", *, labels=None): + """Initialize a part-of-speech tagger. + + vocab (Vocab): The shared vocabulary. + model (thinc.api.Model): The Thinc Model powering the pipeline component. + name (str): The component instance name, used to add entries to the + losses during training. + labels (List): The set of labels. Defaults to None. + + DOCS: https://nightly.spacy.io/api/tagger#init + """ + self.vocab = vocab + self.model = model + self.name = name + self._rehearsal_model = None + cfg = {"labels": labels or []} + self.cfg = dict(sorted(cfg.items())) + + @property + def labels(self): + """The labels currently added to the component. Note that even for a + blank component, this will always include the built-in coarse-grained + part-of-speech tags by default. + + RETURNS (Tuple[str]): The labels. + + DOCS: https://nightly.spacy.io/api/tagger#labels + """ + return tuple(self.cfg["labels"]) + + @property + def label_data(self): + """Data about the labels currently added to the component.""" + return tuple(self.cfg["labels"]) + + def __call__(self, doc): + """Apply the pipe to a Doc. + + doc (Doc): The document to process. + RETURNS (Doc): The processed Doc. + + DOCS: https://nightly.spacy.io/api/tagger#call + """ + tags = self.predict([doc]) + self.set_annotations([doc], tags) + return doc + + def pipe(self, stream, *, batch_size=128): + """Apply the pipe to a stream of documents. This usually happens under + the hood when the nlp object is called on a text and all components are + applied to the Doc. + + stream (Iterable[Doc]): A stream of documents. + batch_size (int): The number of documents to buffer. + YIELDS (Doc): Processed documents in order. + + DOCS: https://nightly.spacy.io/api/tagger#pipe + """ + for docs in util.minibatch(stream, size=batch_size): + tag_ids = self.predict(docs) + self.set_annotations(docs, tag_ids) + yield from docs + + def predict(self, docs): + """Apply the pipeline's model to a batch of docs, without modifying them. + + docs (Iterable[Doc]): The documents to predict. + RETURNS: The models prediction for each document. + + DOCS: https://nightly.spacy.io/api/tagger#predict + """ + if not any(len(doc) for doc in docs): + # Handle cases where there are no tokens in any docs. + n_labels = len(self.labels) + guesses = [self.model.ops.alloc((0, n_labels)) for doc in docs] + assert len(guesses) == len(docs) + return guesses + scores = self.model.predict(docs) + assert len(scores) == len(docs), (len(scores), len(docs)) + guesses = self._scores2guesses(scores) + assert len(guesses) == len(docs) + return guesses + + def _scores2guesses(self, scores): + guesses = [] + for doc_scores in scores: + doc_guesses = doc_scores.argmax(axis=1) + if not isinstance(doc_guesses, numpy.ndarray): + doc_guesses = doc_guesses.get() + guesses.append(doc_guesses) + return guesses + + def set_annotations(self, docs, batch_tag_ids): + """Modify a batch of documents, using pre-computed scores. + + docs (Iterable[Doc]): The documents to modify. + batch_tag_ids: The IDs to set, produced by Tagger.predict. + + DOCS: https://nightly.spacy.io/api/tagger#set_annotations + """ + if isinstance(docs, Doc): + docs = [docs] + cdef Doc doc + cdef Vocab vocab = self.vocab + for i, doc in enumerate(docs): + doc_tag_ids = batch_tag_ids[i] + if hasattr(doc_tag_ids, "get"): + doc_tag_ids = doc_tag_ids.get() + for j, tag_id in enumerate(doc_tag_ids): + # Don't clobber preset POS tags + if doc.c[j].tag == 0: + doc.c[j].tag = self.vocab.strings[self.labels[tag_id]] + + def update(self, examples, *, drop=0., sgd=None, losses=None, set_annotations=False): + """Learn from a batch of documents and gold-standard information, + updating the pipe's model. Delegates to predict and get_loss. + + examples (Iterable[Example]): A batch of Example objects. + drop (float): The dropout rate. + set_annotations (bool): Whether or not to update the Example objects + with the predictions. + sgd (thinc.api.Optimizer): The optimizer. + losses (Dict[str, float]): Optional record of the loss during training. + Updated using the component name as the key. + RETURNS (Dict[str, float]): The updated losses dictionary. + + DOCS: https://nightly.spacy.io/api/tagger#update + """ + if losses is None: + losses = {} + losses.setdefault(self.name, 0.0) + validate_examples(examples, "Tagger.update") + if not any(len(eg.predicted) if eg.predicted else 0 for eg in examples): + # Handle cases where there are no tokens in any docs. + return losses + set_dropout_rate(self.model, drop) + tag_scores, bp_tag_scores = self.model.begin_update([eg.predicted for eg in examples]) + for sc in tag_scores: + if self.model.ops.xp.isnan(sc.sum()): + raise ValueError(Errors.E940) + loss, d_tag_scores = self.get_loss(examples, tag_scores) + bp_tag_scores(d_tag_scores) + if sgd not in (None, False): + self.finish_update(sgd) + + losses[self.name] += loss + if set_annotations: + docs = [eg.predicted for eg in examples] + self.set_annotations(docs, self._scores2guesses(tag_scores)) + return losses + + def rehearse(self, examples, *, drop=0., sgd=None, losses=None): + """Perform a "rehearsal" update from a batch of data. Rehearsal updates + teach the current model to make predictions similar to an initial model, + to try to address the "catastrophic forgetting" problem. This feature is + experimental. + + examples (Iterable[Example]): A batch of Example objects. + drop (float): The dropout rate. + sgd (thinc.api.Optimizer): The optimizer. + losses (Dict[str, float]): Optional record of the loss during training. + Updated using the component name as the key. + RETURNS (Dict[str, float]): The updated losses dictionary. + + DOCS: https://nightly.spacy.io/api/tagger#rehearse + """ + if losses is None: + losses = {} + losses.setdefault(self.name, 0.0) + validate_examples(examples, "Tagger.rehearse") + docs = [eg.predicted for eg in examples] + if self._rehearsal_model is None: + return losses + if not any(len(doc) for doc in docs): + # Handle cases where there are no tokens in any docs. + return losses + set_dropout_rate(self.model, drop) + guesses, backprop = self.model.begin_update(docs) + target = self._rehearsal_model(examples) + gradient = guesses - target + backprop(gradient) + self.finish_update(sgd) + losses[self.name] += (gradient**2).sum() + return losses + + def get_loss(self, examples, scores): + """Find the loss and gradient of loss for the batch of documents and + their predicted scores. + + examples (Iterable[Examples]): The batch of examples. + scores: Scores representing the model's predictions. + RETURNS (Tuple[float, float]): The loss and the gradient. + + DOCS: https://nightly.spacy.io/api/tagger#get_loss + """ + validate_examples(examples, "Tagger.get_loss") + loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False) + truths = [eg.get_aligned("TAG", as_string=True) for eg in examples] + d_scores, loss = loss_func(scores, truths) + if self.model.ops.xp.isnan(loss): + raise ValueError(Errors.E910.format(name=self.name)) + return float(loss), d_scores + + def initialize(self, get_examples, *, nlp=None, labels=None): + """Initialize the pipe for training, using a representative set + of data examples. + + get_examples (Callable[[], Iterable[Example]]): Function that + returns a representative sample of gold-standard Example objects.. + nlp (Language): The current nlp object the component is part of. + labels: The labels to add to the component, typically generated by the + `init labels` command. If no labels are provided, the get_examples + callback is used to extract the labels from the data. + + DOCS: https://nightly.spacy.io/api/tagger#initialize + """ + validate_get_examples(get_examples, "Tagger.initialize") + if labels is not None: + for tag in labels: + self.add_label(tag) + else: + tags = set() + for example in get_examples(): + for token in example.y: + if token.tag_: + tags.add(token.tag_) + for tag in sorted(tags): + self.add_label(tag) + doc_sample = [] + label_sample = [] + for example in islice(get_examples(), 10): + doc_sample.append(example.x) + gold_tags = example.get_aligned("TAG", as_string=True) + gold_array = [[1.0 if tag == gold_tag else 0.0 for tag in self.labels] for gold_tag in gold_tags] + label_sample.append(self.model.ops.asarray(gold_array, dtype="float32")) + assert len(doc_sample) > 0, Errors.E923.format(name=self.name) + assert len(label_sample) > 0, Errors.E923.format(name=self.name) + self.model.initialize(X=doc_sample, Y=label_sample) + + def add_label(self, label): + """Add a new label to the pipe. + + label (str): The label to add. + RETURNS (int): 0 if label is already present, otherwise 1. + + DOCS: https://nightly.spacy.io/api/tagger#add_label + """ + if not isinstance(label, str): + raise ValueError(Errors.E187) + if label in self.labels: + return 0 + self._allow_extra_label() + self.cfg["labels"].append(label) + self.vocab.strings.add(label) + return 1 + + def score(self, examples, **kwargs): + """Score a batch of examples. + + examples (Iterable[Example]): The examples to score. + RETURNS (Dict[str, Any]): The scores, produced by + Scorer.score_token_attr for the attributes "tag". + + DOCS: https://nightly.spacy.io/api/tagger#score + """ + validate_examples(examples, "Tagger.score") + return Scorer.score_token_attr(examples, "tag", **kwargs) diff --git a/spacy/pipeline/textcat.py b/spacy/pipeline/textcat.py new file mode 100644 index 000000000..5ebe0e104 --- /dev/null +++ b/spacy/pipeline/textcat.py @@ -0,0 +1,367 @@ +from itertools import islice +from typing import Iterable, Tuple, Optional, Dict, List, Callable, Iterator, Any +from thinc.api import get_array_module, Model, Optimizer, set_dropout_rate, Config +from thinc.types import Floats2d +import numpy + +from .trainable_pipe import TrainablePipe +from ..language import Language +from ..training import Example, validate_examples, validate_get_examples +from ..errors import Errors +from ..scorer import Scorer +from .. import util +from ..tokens import Doc +from ..vocab import Vocab + + +default_model_config = """ +[model] +@architectures = "spacy.TextCatEnsemble.v1" +exclusive_classes = false +pretrained_vectors = null +width = 64 +conv_depth = 2 +embed_size = 2000 +window_size = 1 +ngram_size = 1 +dropout = null +""" +DEFAULT_TEXTCAT_MODEL = Config().from_str(default_model_config)["model"] + +bow_model_config = """ +[model] +@architectures = "spacy.TextCatBOW.v1" +exclusive_classes = false +ngram_size = 1 +no_output_layer = false +""" + +cnn_model_config = """ +[model] +@architectures = "spacy.TextCatCNN.v1" +exclusive_classes = false + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 4 +embed_size = 2000 +window_size = 1 +maxout_pieces = 3 +subword_features = true +""" + + +@Language.factory( + "textcat", + assigns=["doc.cats"], + default_config={"threshold": 0.5, "model": DEFAULT_TEXTCAT_MODEL}, + default_score_weights={ + "cats_score": 1.0, + "cats_score_desc": None, + "cats_p": None, + "cats_r": None, + "cats_f": None, + "cats_macro_f": None, + "cats_macro_auc": None, + "cats_f_per_type": None, + "cats_macro_auc_per_type": None, + }, +) +def make_textcat( + nlp: Language, name: str, model: Model[List[Doc], List[Floats2d]], threshold: float +) -> "TextCategorizer": + """Create a TextCategorizer compoment. The text categorizer predicts categories + over a whole document. It can learn one or more labels, and the labels can + be mutually exclusive (i.e. one true label per doc) or non-mutually exclusive + (i.e. zero or more labels may be true per doc). The multi-label setting is + controlled by the model instance that's provided. + + model (Model[List[Doc], List[Floats2d]]): A model instance that predicts + scores for each category. + threshold (float): Cutoff to consider a prediction "positive". + """ + return TextCategorizer(nlp.vocab, model, name, threshold=threshold) + + +class TextCategorizer(TrainablePipe): + """Pipeline component for text classification. + + DOCS: https://nightly.spacy.io/api/textcategorizer + """ + + def __init__( + self, vocab: Vocab, model: Model, name: str = "textcat", *, threshold: float + ) -> None: + """Initialize a text categorizer. + + vocab (Vocab): The shared vocabulary. + model (thinc.api.Model): The Thinc Model powering the pipeline component. + name (str): The component instance name, used to add entries to the + losses during training. + threshold (float): Cutoff to consider a prediction "positive". + + DOCS: https://nightly.spacy.io/api/textcategorizer#init + """ + self.vocab = vocab + self.model = model + self.name = name + self._rehearsal_model = None + cfg = {"labels": [], "threshold": threshold, "positive_label": None} + self.cfg = dict(cfg) + + @property + def labels(self) -> Tuple[str]: + """RETURNS (Tuple[str]): The labels currently added to the component. + + DOCS: https://nightly.spacy.io/api/textcategorizer#labels + """ + return tuple(self.cfg["labels"]) + + @property + def label_data(self) -> List[str]: + """RETURNS (List[str]): Information about the component's labels.""" + return self.labels + + def pipe(self, stream: Iterable[Doc], *, batch_size: int = 128) -> Iterator[Doc]: + """Apply the pipe to a stream of documents. This usually happens under + the hood when the nlp object is called on a text and all components are + applied to the Doc. + + stream (Iterable[Doc]): A stream of documents. + batch_size (int): The number of documents to buffer. + YIELDS (Doc): Processed documents in order. + + DOCS: https://nightly.spacy.io/api/textcategorizer#pipe + """ + for docs in util.minibatch(stream, size=batch_size): + scores = self.predict(docs) + self.set_annotations(docs, scores) + yield from docs + + def predict(self, docs: Iterable[Doc]): + """Apply the pipeline's model to a batch of docs, without modifying them. + + docs (Iterable[Doc]): The documents to predict. + RETURNS: The models prediction for each document. + + DOCS: https://nightly.spacy.io/api/textcategorizer#predict + """ + if not any(len(doc) for doc in docs): + # Handle cases where there are no tokens in any docs. + tensors = [doc.tensor for doc in docs] + xp = get_array_module(tensors) + scores = xp.zeros((len(docs), len(self.labels))) + return scores + scores = self.model.predict(docs) + scores = self.model.ops.asarray(scores) + return scores + + def set_annotations(self, docs: Iterable[Doc], scores) -> None: + """Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores. + + docs (Iterable[Doc]): The documents to modify. + scores: The scores to set, produced by TextCategorizer.predict. + + DOCS: https://nightly.spacy.io/api/textcategorizer#set_annotations + """ + for i, doc in enumerate(docs): + for j, label in enumerate(self.labels): + doc.cats[label] = float(scores[i, j]) + + def update( + self, + examples: Iterable[Example], + *, + drop: float = 0.0, + set_annotations: bool = False, + sgd: Optional[Optimizer] = None, + losses: Optional[Dict[str, float]] = None, + ) -> Dict[str, float]: + """Learn from a batch of documents and gold-standard information, + updating the pipe's model. Delegates to predict and get_loss. + + examples (Iterable[Example]): A batch of Example objects. + drop (float): The dropout rate. + set_annotations (bool): Whether or not to update the Example objects + with the predictions. + sgd (thinc.api.Optimizer): The optimizer. + losses (Dict[str, float]): Optional record of the loss during training. + Updated using the component name as the key. + RETURNS (Dict[str, float]): The updated losses dictionary. + + DOCS: https://nightly.spacy.io/api/textcategorizer#update + """ + if losses is None: + losses = {} + losses.setdefault(self.name, 0.0) + validate_examples(examples, "TextCategorizer.update") + if not any(len(eg.predicted) if eg.predicted else 0 for eg in examples): + # Handle cases where there are no tokens in any docs. + return losses + set_dropout_rate(self.model, drop) + scores, bp_scores = self.model.begin_update([eg.predicted for eg in examples]) + loss, d_scores = self.get_loss(examples, scores) + bp_scores(d_scores) + if sgd is not None: + self.finish_update(sgd) + losses[self.name] += loss + if set_annotations: + docs = [eg.predicted for eg in examples] + self.set_annotations(docs, scores=scores) + return losses + + def rehearse( + self, + examples: Iterable[Example], + *, + drop: float = 0.0, + sgd: Optional[Optimizer] = None, + losses: Optional[Dict[str, float]] = None, + ) -> Dict[str, float]: + """Perform a "rehearsal" update from a batch of data. Rehearsal updates + teach the current model to make predictions similar to an initial model, + to try to address the "catastrophic forgetting" problem. This feature is + experimental. + + examples (Iterable[Example]): A batch of Example objects. + drop (float): The dropout rate. + sgd (thinc.api.Optimizer): The optimizer. + losses (Dict[str, float]): Optional record of the loss during training. + Updated using the component name as the key. + RETURNS (Dict[str, float]): The updated losses dictionary. + + DOCS: https://nightly.spacy.io/api/textcategorizer#rehearse + """ + if losses is not None: + losses.setdefault(self.name, 0.0) + if self._rehearsal_model is None: + return losses + validate_examples(examples, "TextCategorizer.rehearse") + docs = [eg.predicted for eg in examples] + if not any(len(doc) for doc in docs): + # Handle cases where there are no tokens in any docs. + return losses + set_dropout_rate(self.model, drop) + scores, bp_scores = self.model.begin_update(docs) + target = self._rehearsal_model(examples) + gradient = scores - target + bp_scores(gradient) + if sgd is not None: + self.finish_update(sgd) + if losses is not None: + losses[self.name] += (gradient ** 2).sum() + return losses + + def _examples_to_truth( + self, examples: List[Example] + ) -> Tuple[numpy.ndarray, numpy.ndarray]: + truths = numpy.zeros((len(examples), len(self.labels)), dtype="f") + not_missing = numpy.ones((len(examples), len(self.labels)), dtype="f") + for i, eg in enumerate(examples): + for j, label in enumerate(self.labels): + if label in eg.reference.cats: + truths[i, j] = eg.reference.cats[label] + else: + not_missing[i, j] = 0.0 + truths = self.model.ops.asarray(truths) + return truths, not_missing + + def get_loss(self, examples: Iterable[Example], scores) -> Tuple[float, float]: + """Find the loss and gradient of loss for the batch of documents and + their predicted scores. + + examples (Iterable[Examples]): The batch of examples. + scores: Scores representing the model's predictions. + RETURNS (Tuple[float, float]): The loss and the gradient. + + DOCS: https://nightly.spacy.io/api/textcategorizer#get_loss + """ + validate_examples(examples, "TextCategorizer.get_loss") + truths, not_missing = self._examples_to_truth(examples) + not_missing = self.model.ops.asarray(not_missing) + d_scores = (scores - truths) / scores.shape[0] + d_scores *= not_missing + mean_square_error = (d_scores ** 2).sum(axis=1).mean() + return float(mean_square_error), d_scores + + def add_label(self, label: str) -> int: + """Add a new label to the pipe. + + label (str): The label to add. + RETURNS (int): 0 if label is already present, otherwise 1. + + DOCS: https://nightly.spacy.io/api/textcategorizer#add_label + """ + if not isinstance(label, str): + raise ValueError(Errors.E187) + if label in self.labels: + return 0 + self._allow_extra_label() + self.cfg["labels"].append(label) + self.vocab.strings.add(label) + return 1 + + def initialize( + self, + get_examples: Callable[[], Iterable[Example]], + *, + nlp: Optional[Language] = None, + labels: Optional[Dict] = None, + positive_label: Optional[str] = None, + ): + """Initialize the pipe for training, using a representative set + of data examples. + + get_examples (Callable[[], Iterable[Example]]): Function that + returns a representative sample of gold-standard Example objects. + nlp (Language): The current nlp object the component is part of. + labels: The labels to add to the component, typically generated by the + `init labels` command. If no labels are provided, the get_examples + callback is used to extract the labels from the data. + + DOCS: https://nightly.spacy.io/api/textcategorizer#initialize + """ + validate_get_examples(get_examples, "TextCategorizer.initialize") + if labels is None: + for example in get_examples(): + for cat in example.y.cats: + self.add_label(cat) + else: + for label in labels: + self.add_label(label) + if positive_label is not None: + if positive_label not in self.labels: + err = Errors.E920.format(pos_label=positive_label, labels=self.labels) + raise ValueError(err) + if len(self.labels) != 2: + err = Errors.E919.format(pos_label=positive_label, labels=self.labels) + raise ValueError(err) + self.cfg["positive_label"] = positive_label + subbatch = list(islice(get_examples(), 10)) + doc_sample = [eg.reference for eg in subbatch] + label_sample, _ = self._examples_to_truth(subbatch) + self._require_labels() + assert len(doc_sample) > 0, Errors.E923.format(name=self.name) + assert len(label_sample) > 0, Errors.E923.format(name=self.name) + self.model.initialize(X=doc_sample, Y=label_sample) + + def score(self, examples: Iterable[Example], **kwargs) -> Dict[str, Any]: + """Score a batch of examples. + + examples (Iterable[Example]): The examples to score. + RETURNS (Dict[str, Any]): The scores, produced by Scorer.score_cats. + + DOCS: https://nightly.spacy.io/api/textcategorizer#score + """ + validate_examples(examples, "TextCategorizer.score") + return Scorer.score_cats( + examples, + "cats", + labels=self.labels, + multi_label=self.model.attrs["multi_label"], + positive_label=self.cfg["positive_label"], + threshold=self.cfg["threshold"], + **kwargs, + ) diff --git a/spacy/pipeline/tok2vec.py b/spacy/pipeline/tok2vec.py new file mode 100644 index 000000000..0ad875035 --- /dev/null +++ b/spacy/pipeline/tok2vec.py @@ -0,0 +1,313 @@ +from typing import Iterator, Sequence, Iterable, Optional, Dict, Callable, List +from thinc.api import Model, set_dropout_rate, Optimizer, Config +from itertools import islice + +from .trainable_pipe import TrainablePipe +from ..training import Example, validate_examples, validate_get_examples +from ..tokens import Doc +from ..vocab import Vocab +from ..language import Language +from ..errors import Errors +from ..util import minibatch + + +default_model_config = """ +[model] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 96 +depth = 4 +embed_size = 2000 +window_size = 1 +maxout_pieces = 3 +subword_features = true +""" +DEFAULT_TOK2VEC_MODEL = Config().from_str(default_model_config)["model"] + + +@Language.factory( + "tok2vec", assigns=["doc.tensor"], default_config={"model": DEFAULT_TOK2VEC_MODEL} +) +def make_tok2vec(nlp: Language, name: str, model: Model) -> "Tok2Vec": + return Tok2Vec(nlp.vocab, model, name) + + +class Tok2Vec(TrainablePipe): + """Apply a "token-to-vector" model and set its outputs in the doc.tensor + attribute. This is mostly useful to share a single subnetwork between multiple + components, e.g. to have one embedding and CNN network shared between a + parser, tagger and NER. + + In order to use the `Tok2Vec` predictions, subsequent components should use + the `Tok2VecListener` layer as the tok2vec subnetwork of their model. This + layer will read data from the `doc.tensor` attribute during prediction. + During training, the `Tok2Vec` component will save its prediction and backprop + callback for each batch, so that the subsequent components can backpropagate + to the shared weights. This implementation is used because it allows us to + avoid relying on object identity within the models to achieve the parameter + sharing. + """ + + def __init__(self, vocab: Vocab, model: Model, name: str = "tok2vec") -> None: + """Initialize a tok2vec component. + + vocab (Vocab): The shared vocabulary. + model (thinc.api.Model[List[Doc], List[Floats2d]]): + The Thinc Model powering the pipeline component. It should take + a list of Doc objects as input, and output a list of 2d float arrays. + name (str): The component instance name. + + DOCS: https://nightly.spacy.io/api/tok2vec#init + """ + self.vocab = vocab + self.model = model + self.name = name + self.listeners = [] + self.cfg = {} + + def add_listener(self, listener: "Tok2VecListener") -> None: + """Add a listener for a downstream component. Usually internals.""" + self.listeners.append(listener) + + def find_listeners(self, model: Model) -> None: + """Walk over a model, looking for layers that are Tok2vecListener + subclasses that have an upstream_name that matches this component. + Listeners can also set their upstream_name attribute to the wildcard + string '*' to match any `Tok2Vec`. + + You're unlikely to ever need multiple `Tok2Vec` components, so it's + fine to leave your listeners upstream_name on '*'. + """ + for node in model.walk(): + if isinstance(node, Tok2VecListener) and node.upstream_name in ( + "*", + self.name, + ): + self.add_listener(node) + + def __call__(self, doc: Doc) -> Doc: + """Add context-sensitive embeddings to the Doc.tensor attribute, allowing + them to be used as features by downstream components. + + docs (Doc): The Doc to process. + RETURNS (Doc): The processed Doc. + + DOCS: https://nightly.spacy.io/api/tok2vec#call + """ + tokvecses = self.predict([doc]) + self.set_annotations([doc], tokvecses) + return doc + + def pipe(self, stream: Iterator[Doc], *, batch_size: int = 128) -> Iterator[Doc]: + """Apply the pipe to a stream of documents. This usually happens under + the hood when the nlp object is called on a text and all components are + applied to the Doc. + + stream (Iterable[Doc]): A stream of documents. + batch_size (int): The number of documents to buffer. + YIELDS (Doc): Processed documents in order. + + DOCS: https://nightly.spacy.io/api/tok2vec#pipe + """ + for docs in minibatch(stream, batch_size): + docs = list(docs) + tokvecses = self.predict(docs) + self.set_annotations(docs, tokvecses) + yield from docs + + def predict(self, docs: Iterable[Doc]): + """Apply the pipeline's model to a batch of docs, without modifying them. + Returns a single tensor for a batch of documents. + + docs (Iterable[Doc]): The documents to predict. + RETURNS: Vector representations for each token in the documents. + + DOCS: https://nightly.spacy.io/api/tok2vec#predict + """ + tokvecs = self.model.predict(docs) + batch_id = Tok2VecListener.get_batch_id(docs) + for listener in self.listeners: + listener.receive(batch_id, tokvecs, lambda dX: []) + return tokvecs + + def set_annotations(self, docs: Sequence[Doc], tokvecses) -> None: + """Modify a batch of documents, using pre-computed scores. + + docs (Iterable[Doc]): The documents to modify. + tokvecses: The tensors to set, produced by Tok2Vec.predict. + + DOCS: https://nightly.spacy.io/api/tok2vec#set_annotations + """ + for doc, tokvecs in zip(docs, tokvecses): + assert tokvecs.shape[0] == len(doc) + doc.tensor = tokvecs + + def update( + self, + examples: Iterable[Example], + *, + drop: float = 0.0, + sgd: Optional[Optimizer] = None, + losses: Optional[Dict[str, float]] = None, + set_annotations: bool = False, + ): + """Learn from a batch of documents and gold-standard information, + updating the pipe's model. + + examples (Iterable[Example]): A batch of Example objects. + drop (float): The dropout rate. + set_annotations (bool): Whether or not to update the Example objects + with the predictions. + sgd (thinc.api.Optimizer): The optimizer. + losses (Dict[str, float]): Optional record of the loss during training. + Updated using the component name as the key. + RETURNS (Dict[str, float]): The updated losses dictionary. + + DOCS: https://nightly.spacy.io/api/tok2vec#update + """ + if losses is None: + losses = {} + validate_examples(examples, "Tok2Vec.update") + docs = [eg.predicted for eg in examples] + set_dropout_rate(self.model, drop) + tokvecs, bp_tokvecs = self.model.begin_update(docs) + d_tokvecs = [self.model.ops.alloc2f(*t2v.shape) for t2v in tokvecs] + losses.setdefault(self.name, 0.0) + + def accumulate_gradient(one_d_tokvecs): + """Accumulate tok2vec loss and gradient. This is passed as a callback + to all but the last listener. Only the last one does the backprop. + """ + nonlocal d_tokvecs + for i in range(len(one_d_tokvecs)): + d_tokvecs[i] += one_d_tokvecs[i] + losses[self.name] += float((one_d_tokvecs[i] ** 2).sum()) + + def backprop(one_d_tokvecs): + """Callback to actually do the backprop. Passed to last listener.""" + accumulate_gradient(one_d_tokvecs) + d_docs = bp_tokvecs(d_tokvecs) + if sgd is not None: + self.finish_update(sgd) + return d_docs + + batch_id = Tok2VecListener.get_batch_id(docs) + for listener in self.listeners[:-1]: + listener.receive(batch_id, tokvecs, accumulate_gradient) + if self.listeners: + self.listeners[-1].receive(batch_id, tokvecs, backprop) + if set_annotations: + self.set_annotations(docs, tokvecs) + return losses + + def get_loss(self, examples, scores) -> None: + pass + + def initialize( + self, + get_examples: Callable[[], Iterable[Example]], + *, + nlp: Optional[Language] = None, + ): + """Initialize the pipe for training, using a representative set + of data examples. + + get_examples (Callable[[], Iterable[Example]]): Function that + returns a representative sample of gold-standard Example objects. + nlp (Language): The current nlp object the component is part of. + + DOCS: https://nightly.spacy.io/api/tok2vec#initialize + """ + validate_get_examples(get_examples, "Tok2Vec.initialize") + doc_sample = [] + for example in islice(get_examples(), 10): + doc_sample.append(example.x) + assert doc_sample, Errors.E923.format(name=self.name) + self.model.initialize(X=doc_sample) + + def add_label(self, label): + raise NotImplementedError + + +class Tok2VecListener(Model): + """A layer that gets fed its answers from an upstream connection, + for instance from a component earlier in the pipeline. + + The Tok2VecListener layer is used as a sublayer within a component such + as a parser, NER or text categorizer. Usually you'll have multiple listeners + connecting to a single upstream Tok2Vec component, that's earlier in the + pipeline. The Tok2VecListener layers act as proxies, passing the predictions + from the Tok2Vec component into downstream components, and communicating + gradients back upstream. + """ + + name = "tok2vec-listener" + + def __init__(self, upstream_name: str, width: int) -> None: + """ + upstream_name (str): A string to identify the 'upstream' Tok2Vec component + to communicate with. The upstream name should either be the wildcard + string '*', or the name of the `Tok2Vec` component. You'll almost + never have multiple upstream Tok2Vec components, so the wildcard + string will almost always be fine. + width (int): + The width of the vectors produced by the upstream tok2vec component. + """ + Model.__init__(self, name=self.name, forward=forward, dims={"nO": width}) + self.upstream_name = upstream_name + self._batch_id = None + self._outputs = None + self._backprop = None + + @classmethod + def get_batch_id(cls, inputs: List[Doc]) -> int: + """Calculate a content-sensitive hash of the batch of documents, to check + whether the next batch of documents is unexpected. + """ + return sum(sum(token.orth for token in doc) for doc in inputs) + + def receive(self, batch_id: int, outputs, backprop) -> None: + """Store a batch of training predictions and a backprop callback. The + predictions and callback are produced by the upstream Tok2Vec component, + and later will be used when the listener's component's model is called. + """ + self._batch_id = batch_id + self._outputs = outputs + self._backprop = backprop + + def verify_inputs(self, inputs) -> bool: + """Check that the batch of Doc objects matches the ones we have a + prediction for. + """ + if self._batch_id is None and self._outputs is None: + raise ValueError(Errors.E954) + else: + batch_id = self.get_batch_id(inputs) + if batch_id != self._batch_id: + raise ValueError(Errors.E953.format(id1=batch_id, id2=self._batch_id)) + else: + return True + + +def forward(model: Tok2VecListener, inputs, is_train: bool): + """Supply the outputs from the upstream Tok2Vec component.""" + if is_train: + model.verify_inputs(inputs) + return model._outputs, model._backprop + else: + # This is pretty grim, but it's hard to do better :(. + # It's hard to avoid relying on the doc.tensor attribute, because the + # pipeline components can batch the data differently during prediction. + # That doesn't happen in update, where the nlp object works on batches + # of data. + # When the components batch differently, we don't receive a matching + # prediction from the upstream, so we can't predict. + if not all(doc.tensor.size for doc in inputs): + # But we do need to do *something* if the tensor hasn't been set. + # The compromise is to at least return data of the right shape, + # so the output is valid. + width = model.get_dim("nO") + outputs = [model.ops.alloc2f(len(doc), width) for doc in inputs] + else: + outputs = [doc.tensor for doc in inputs] + return outputs, lambda dX: [] diff --git a/spacy/pipeline/trainable_pipe.pxd b/spacy/pipeline/trainable_pipe.pxd new file mode 100644 index 000000000..d5cdbb511 --- /dev/null +++ b/spacy/pipeline/trainable_pipe.pxd @@ -0,0 +1,7 @@ +from .pipe cimport Pipe +from ..vocab cimport Vocab + +cdef class TrainablePipe(Pipe): + cdef public Vocab vocab + cdef public object model + cdef public object cfg diff --git a/spacy/pipeline/trainable_pipe.pyx b/spacy/pipeline/trainable_pipe.pyx new file mode 100644 index 000000000..6cd73d256 --- /dev/null +++ b/spacy/pipeline/trainable_pipe.pyx @@ -0,0 +1,332 @@ +# cython: infer_types=True, profile=True +from typing import Iterable, Iterator, Optional, Dict, Tuple, Callable +import srsly +from thinc.api import set_dropout_rate, Model, Optimizer + +from ..tokens.doc cimport Doc + +from ..training import validate_examples +from ..errors import Errors +from .pipe import Pipe, deserialize_config +from .. import util +from ..vocab import Vocab +from ..language import Language +from ..training import Example + + +cdef class TrainablePipe(Pipe): + """This class is a base class and not instantiated directly. Trainable + pipeline components like the EntityRecognizer or TextCategorizer inherit + from it and it defines the interface that components should follow to + function as trainable components in a spaCy pipeline. + + DOCS: https://nightly.spacy.io/api/pipe + """ + def __init__(self, vocab: Vocab, model: Model, name: str, **cfg): + """Initialize a pipeline component. + + vocab (Vocab): The shared vocabulary. + model (thinc.api.Model): The Thinc Model powering the pipeline component. + name (str): The component instance name. + **cfg: Additonal settings and config parameters. + + DOCS: https://nightly.spacy.io/api/pipe#init + """ + self.vocab = vocab + self.model = model + self.name = name + self.cfg = dict(cfg) + + def __call__(self, Doc doc) -> Doc: + """Apply the pipe to one document. The document is modified in place, + and returned. This usually happens under the hood when the nlp object + is called on a text and all components are applied to the Doc. + + docs (Doc): The Doc to process. + RETURNS (Doc): The processed Doc. + + DOCS: https://nightly.spacy.io/api/pipe#call + """ + scores = self.predict([doc]) + self.set_annotations([doc], scores) + return doc + + def pipe(self, stream: Iterable[Doc], *, batch_size: int=128) -> Iterator[Doc]: + """Apply the pipe to a stream of documents. This usually happens under + the hood when the nlp object is called on a text and all components are + applied to the Doc. + + stream (Iterable[Doc]): A stream of documents. + batch_size (int): The number of documents to buffer. + YIELDS (Doc): Processed documents in order. + + DOCS: https://nightly.spacy.io/api/pipe#pipe + """ + for docs in util.minibatch(stream, size=batch_size): + scores = self.predict(docs) + self.set_annotations(docs, scores) + yield from docs + + def predict(self, docs: Iterable[Doc]): + """Apply the pipeline's model to a batch of docs, without modifying them. + Returns a single tensor for a batch of documents. + + docs (Iterable[Doc]): The documents to predict. + RETURNS: Vector representations of the predictions. + + DOCS: https://nightly.spacy.io/api/pipe#predict + """ + raise NotImplementedError(Errors.E931.format(parent="TrainablePipe", method="predict", name=self.name)) + + def set_annotations(self, docs: Iterable[Doc], scores): + """Modify a batch of documents, using pre-computed scores. + + docs (Iterable[Doc]): The documents to modify. + scores: The scores to assign. + + DOCS: https://nightly.spacy.io/api/pipe#set_annotations + """ + raise NotImplementedError(Errors.E931.format(parent="TrainablePipe", method="set_annotations", name=self.name)) + + def update(self, + examples: Iterable["Example"], + *, drop: float=0.0, + set_annotations: bool=False, + sgd: Optimizer=None, + losses: Optional[Dict[str, float]]=None) -> Dict[str, float]: + """Learn from a batch of documents and gold-standard information, + updating the pipe's model. Delegates to predict and get_loss. + + examples (Iterable[Example]): A batch of Example objects. + drop (float): The dropout rate. + set_annotations (bool): Whether or not to update the Example objects + with the predictions. + sgd (thinc.api.Optimizer): The optimizer. + losses (Dict[str, float]): Optional record of the loss during training. + Updated using the component name as the key. + RETURNS (Dict[str, float]): The updated losses dictionary. + + DOCS: https://nightly.spacy.io/api/pipe#update + """ + if losses is None: + losses = {} + if not hasattr(self, "model") or self.model in (None, True, False): + return losses + losses.setdefault(self.name, 0.0) + validate_examples(examples, "TrainablePipe.update") + if not any(len(eg.predicted) if eg.predicted else 0 for eg in examples): + # Handle cases where there are no tokens in any docs. + return losses + set_dropout_rate(self.model, drop) + scores, bp_scores = self.model.begin_update([eg.predicted for eg in examples]) + loss, d_scores = self.get_loss(examples, scores) + bp_scores(d_scores) + if sgd not in (None, False): + self.finish_update(sgd) + losses[self.name] += loss + if set_annotations: + docs = [eg.predicted for eg in examples] + self.set_annotations(docs, scores=scores) + return losses + + def rehearse(self, + examples: Iterable[Example], + *, + sgd: Optimizer=None, + losses: Dict[str, float]=None, + **config) -> Dict[str, float]: + """Perform a "rehearsal" update from a batch of data. Rehearsal updates + teach the current model to make predictions similar to an initial model, + to try to address the "catastrophic forgetting" problem. This feature is + experimental. + + examples (Iterable[Example]): A batch of Example objects. + sgd (thinc.api.Optimizer): The optimizer. + losses (Dict[str, float]): Optional record of the loss during training. + Updated using the component name as the key. + RETURNS (Dict[str, float]): The updated losses dictionary. + + DOCS: https://nightly.spacy.io/api/pipe#rehearse + """ + pass + + def get_loss(self, examples: Iterable[Example], scores) -> Tuple[float, float]: + """Find the loss and gradient of loss for the batch of documents and + their predicted scores. + + examples (Iterable[Examples]): The batch of examples. + scores: Scores representing the model's predictions. + RETURNS (Tuple[float, float]): The loss and the gradient. + + DOCS: https://nightly.spacy.io/api/pipe#get_loss + """ + raise NotImplementedError(Errors.E931.format(parent="TrainablePipe", method="get_loss", name=self.name)) + + def create_optimizer(self) -> Optimizer: + """Create an optimizer for the pipeline component. + + RETURNS (thinc.api.Optimizer): The optimizer. + + DOCS: https://nightly.spacy.io/api/pipe#create_optimizer + """ + return util.create_default_optimizer() + + def initialize(self, get_examples: Callable[[], Iterable[Example]], *, nlp: Language=None): + """Initialize the pipe for training, using data examples if available. + This method needs to be implemented by each TrainablePipe component, + ensuring the internal model (if available) is initialized properly + using the provided sample of Example objects. + + get_examples (Callable[[], Iterable[Example]]): Function that + returns a representative sample of gold-standard Example objects. + nlp (Language): The current nlp object the component is part of. + + DOCS: https://nightly.spacy.io/api/pipe#initialize + """ + raise NotImplementedError(Errors.E931.format(parent="TrainablePipe", method="initialize", name=self.name)) + + def add_label(self, label: str) -> int: + """Add an output label. + For TrainablePipe components, it is possible to + extend pretrained models with new labels, but care should be taken to + avoid the "catastrophic forgetting" problem. + + label (str): The label to add. + RETURNS (int): 0 if label is already present, otherwise 1. + + DOCS: https://nightly.spacy.io/api/pipe#add_label + """ + raise NotImplementedError(Errors.E931.format(parent="Pipe", method="add_label", name=self.name)) + + @property + def is_trainable(self) -> bool: + return True + + @property + def is_resizable(self) -> bool: + return getattr(self, "model", None) and "resize_output" in self.model.attrs + + def _allow_extra_label(self) -> None: + """Raise an error if the component can not add any more labels.""" + if self.model.has_dim("nO") and self.model.get_dim("nO") == len(self.labels): + if not self.is_resizable: + raise ValueError(Errors.E922.format(name=self.name, nO=self.model.get_dim("nO"))) + + def set_output(self, nO: int) -> None: + if self.is_resizable: + self.model.attrs["resize_output"](self.model, nO) + else: + raise NotImplementedError(Errors.E921) + + def use_params(self, params: dict): + """Modify the pipe's model, to use the given parameter values. At the + end of the context, the original parameters are restored. + + params (dict): The parameter values to use in the model. + + DOCS: https://nightly.spacy.io/api/pipe#use_params + """ + with self.model.use_params(params): + yield + + def finish_update(self, sgd: Optimizer) -> None: + """Update parameters using the current parameter gradients. + The Optimizer instance contains the functionality to perform + the stochastic gradient descent. + + sgd (thinc.api.Optimizer): The optimizer. + + DOCS: https://nightly.spacy.io/api/pipe#finish_update + """ + self.model.finish_update(sgd) + + def _validate_serialization_attrs(self): + """Check that the pipe implements the required attributes. If a subclass + implements a custom __init__ method but doesn't set these attributes, + they currently default to None, so we need to perform additonal checks. + """ + if not hasattr(self, "vocab") or self.vocab is None: + raise ValueError(Errors.E899.format(name=util.get_object_name(self))) + if not hasattr(self, "model") or self.model is None: + raise ValueError(Errors.E898.format(name=util.get_object_name(self))) + + def to_bytes(self, *, exclude=tuple()): + """Serialize the pipe to a bytestring. + + exclude (Iterable[str]): String names of serialization fields to exclude. + RETURNS (bytes): The serialized object. + + DOCS: https://nightly.spacy.io/api/pipe#to_bytes + """ + self._validate_serialization_attrs() + serialize = {} + if hasattr(self, "cfg") and self.cfg is not None: + serialize["cfg"] = lambda: srsly.json_dumps(self.cfg) + serialize["vocab"] = self.vocab.to_bytes + serialize["model"] = self.model.to_bytes + return util.to_bytes(serialize, exclude) + + def from_bytes(self, bytes_data, *, exclude=tuple()): + """Load the pipe from a bytestring. + + exclude (Iterable[str]): String names of serialization fields to exclude. + RETURNS (TrainablePipe): The loaded object. + + DOCS: https://nightly.spacy.io/api/pipe#from_bytes + """ + self._validate_serialization_attrs() + + def load_model(b): + try: + self.model.from_bytes(b) + except AttributeError: + raise ValueError(Errors.E149) from None + + deserialize = {} + if hasattr(self, "cfg") and self.cfg is not None: + deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b)) + deserialize["vocab"] = lambda b: self.vocab.from_bytes(b) + deserialize["model"] = load_model + util.from_bytes(bytes_data, deserialize, exclude) + return self + + def to_disk(self, path, *, exclude=tuple()): + """Serialize the pipe to disk. + + path (str / Path): Path to a directory. + exclude (Iterable[str]): String names of serialization fields to exclude. + + DOCS: https://nightly.spacy.io/api/pipe#to_disk + """ + self._validate_serialization_attrs() + serialize = {} + if hasattr(self, "cfg") and self.cfg is not None: + serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg) + serialize["vocab"] = lambda p: self.vocab.to_disk(p) + serialize["model"] = lambda p: self.model.to_disk(p) + util.to_disk(path, serialize, exclude) + + def from_disk(self, path, *, exclude=tuple()): + """Load the pipe from disk. + + path (str / Path): Path to a directory. + exclude (Iterable[str]): String names of serialization fields to exclude. + RETURNS (TrainablePipe): The loaded object. + + DOCS: https://nightly.spacy.io/api/pipe#from_disk + """ + self._validate_serialization_attrs() + + def load_model(p): + try: + self.model.from_bytes(p.open("rb").read()) + except AttributeError: + raise ValueError(Errors.E149) from None + + deserialize = {} + if hasattr(self, "cfg") and self.cfg is not None: + deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p)) + deserialize["vocab"] = lambda p: self.vocab.from_disk(p) + deserialize["model"] = load_model + util.from_disk(path, deserialize, exclude) + return self diff --git a/spacy/pipeline/transition_parser.pxd b/spacy/pipeline/transition_parser.pxd new file mode 100644 index 000000000..bd5bad334 --- /dev/null +++ b/spacy/pipeline/transition_parser.pxd @@ -0,0 +1,19 @@ +from cymem.cymem cimport Pool + +from ..vocab cimport Vocab +from .trainable_pipe cimport TrainablePipe +from ._parser_internals.transition_system cimport Transition, TransitionSystem +from ._parser_internals._state cimport StateC +from ..ml.parser_model cimport WeightsC, ActivationsC, SizesC + + +cdef class Parser(TrainablePipe): + cdef public object _rehearsal_model + cdef readonly TransitionSystem moves + cdef public object _multitasks + + cdef void _parseC(self, StateC** states, + WeightsC weights, SizesC sizes) nogil + + cdef void c_transition_batch(self, StateC** states, const float* scores, + int nr_class, int batch_size) nogil diff --git a/spacy/pipeline/transition_parser.pyx b/spacy/pipeline/transition_parser.pyx new file mode 100644 index 000000000..63a8595cc --- /dev/null +++ b/spacy/pipeline/transition_parser.pyx @@ -0,0 +1,550 @@ +# cython: infer_types=True, cdivision=True, boundscheck=False, binding=True +from __future__ import print_function +from cymem.cymem cimport Pool +cimport numpy as np +from itertools import islice +from libcpp.vector cimport vector +from libc.string cimport memset +from libc.stdlib cimport calloc, free +import random +from typing import Optional + +import srsly +from thinc.api import set_dropout_rate +import numpy.random +import numpy +import warnings + +from ._parser_internals.stateclass cimport StateClass +from ..ml.parser_model cimport alloc_activations, free_activations +from ..ml.parser_model cimport predict_states, arg_max_if_valid +from ..ml.parser_model cimport WeightsC, ActivationsC, SizesC, cpu_log_loss +from ..ml.parser_model cimport get_c_weights, get_c_sizes +from ..tokens.doc cimport Doc +from .trainable_pipe import TrainablePipe + +from ..training import validate_examples, validate_get_examples +from ..errors import Errors, Warnings +from .. import util + + +cdef class Parser(TrainablePipe): + """ + Base class of the DependencyParser and EntityRecognizer. + """ + + def __init__( + self, + Vocab vocab, + model, + name="base_parser", + moves=None, + *, + update_with_oracle_cut_size, + multitasks=tuple(), + min_action_freq, + learn_tokens, + ): + """Create a Parser. + + vocab (Vocab): The vocabulary object. Must be shared with documents + to be processed. The value is set to the `.vocab` attribute. + **cfg: Configuration parameters. Set to the `.cfg` attribute. + If it doesn't include a value for 'moves', a new instance is + created with `self.TransitionSystem()`. This defines how the + parse-state is created, updated and evaluated. + """ + self.vocab = vocab + self.name = name + cfg = { + "moves": moves, + "update_with_oracle_cut_size": update_with_oracle_cut_size, + "multitasks": list(multitasks), + "min_action_freq": min_action_freq, + "learn_tokens": learn_tokens + } + if moves is None: + # defined by EntityRecognizer as a BiluoPushDown + moves = self.TransitionSystem(self.vocab.strings) + self.moves = moves + self.model = model + if self.moves.n_moves != 0: + self.set_output(self.moves.n_moves) + self.cfg = cfg + self._multitasks = [] + for multitask in cfg["multitasks"]: + self.add_multitask_objective(multitask) + + self._rehearsal_model = None + + def __getnewargs_ex__(self): + """This allows pickling the Parser and its keyword-only init arguments""" + args = (self.vocab, self.model, self.name, self.moves) + return args, self.cfg + + @property + def move_names(self): + names = [] + for i in range(self.moves.n_moves): + name = self.moves.move_name(self.moves.c[i].move, self.moves.c[i].label) + # Explicitly removing the internal "U-" token used for blocking entities + if name != "U-": + names.append(name) + return names + + @property + def labels(self): + class_names = [self.moves.get_class_name(i) for i in range(self.moves.n_moves)] + return class_names + + @property + def label_data(self): + return self.moves.labels + + @property + def tok2vec(self): + """Return the embedding and convolutional layer of the model.""" + return self.model.get_ref("tok2vec") + + @property + def postprocesses(self): + # Available for subclasses, e.g. to deprojectivize + return [] + + def add_label(self, label): + resized = False + for action in self.moves.action_types: + added = self.moves.add_action(action, label) + if added: + resized = True + if resized: + self._resize() + self.vocab.strings.add(label) + return 1 + return 0 + + def _resize(self): + self.model.attrs["resize_output"](self.model, self.moves.n_moves) + if self._rehearsal_model not in (True, False, None): + self._rehearsal_model.attrs["resize_output"]( + self._rehearsal_model, self.moves.n_moves + ) + + def add_multitask_objective(self, target): + # Defined in subclasses, to avoid circular import + raise NotImplementedError + + def init_multitask_objectives(self, get_examples, pipeline, **cfg): + """Setup models for secondary objectives, to benefit from multi-task + learning. This method is intended to be overridden by subclasses. + + For instance, the dependency parser can benefit from sharing + an input representation with a label prediction model. These auxiliary + models are discarded after training. + """ + pass + + def use_params(self, params): + # Can't decorate cdef class :(. Workaround. + with self.model.use_params(params): + yield + + def __call__(self, Doc doc): + """Apply the parser or entity recognizer, setting the annotations onto + the `Doc` object. + + doc (Doc): The document to be processed. + """ + states = self.predict([doc]) + self.set_annotations([doc], states) + return doc + + def pipe(self, docs, *, int batch_size=256): + """Process a stream of documents. + + stream: The sequence of documents to process. + batch_size (int): Number of documents to accumulate into a working set. + YIELDS (Doc): Documents, in order. + """ + cdef Doc doc + for batch in util.minibatch(docs, size=batch_size): + batch_in_order = list(batch) + by_length = sorted(batch, key=lambda doc: len(doc)) + for subbatch in util.minibatch(by_length, size=max(batch_size//4, 2)): + subbatch = list(subbatch) + parse_states = self.predict(subbatch) + self.set_annotations(subbatch, parse_states) + yield from batch_in_order + + def predict(self, docs): + if isinstance(docs, Doc): + docs = [docs] + if not any(len(doc) for doc in docs): + result = self.moves.init_batch(docs) + self._resize() + return result + return self.greedy_parse(docs, drop=0.0) + + def greedy_parse(self, docs, drop=0.): + cdef vector[StateC*] states + cdef StateClass state + set_dropout_rate(self.model, drop) + batch = self.moves.init_batch(docs) + # This is pretty dirty, but the NER can resize itself in init_batch, + # if labels are missing. We therefore have to check whether we need to + # expand our model output. + self._resize() + model = self.model.predict(docs) + weights = get_c_weights(model) + for state in batch: + if not state.is_final(): + states.push_back(state.c) + sizes = get_c_sizes(model, states.size()) + with nogil: + self._parseC(&states[0], + weights, sizes) + model.clear_memory() + del model + return batch + + cdef void _parseC(self, StateC** states, + WeightsC weights, SizesC sizes) nogil: + cdef int i, j + cdef vector[StateC*] unfinished + cdef ActivationsC activations = alloc_activations(sizes) + while sizes.states >= 1: + predict_states(&activations, + states, &weights, sizes) + # Validate actions, argmax, take action. + self.c_transition_batch(states, + activations.scores, sizes.classes, sizes.states) + for i in range(sizes.states): + if not states[i].is_final(): + unfinished.push_back(states[i]) + for i in range(unfinished.size()): + states[i] = unfinished[i] + sizes.states = unfinished.size() + unfinished.clear() + free_activations(&activations) + + def set_annotations(self, docs, states): + cdef StateClass state + cdef Doc doc + for i, (state, doc) in enumerate(zip(states, docs)): + self.moves.finalize_state(state.c) + for j in range(doc.length): + doc.c[j] = state.c._sent[j] + self.moves.finalize_doc(doc) + for hook in self.postprocesses: + hook(doc) + + def transition_states(self, states, float[:, ::1] scores): + cdef StateClass state + cdef float* c_scores = &scores[0, 0] + cdef vector[StateC*] c_states + for state in states: + c_states.push_back(state.c) + self.c_transition_batch(&c_states[0], c_scores, scores.shape[1], scores.shape[0]) + return [state for state in states if not state.c.is_final()] + + cdef void c_transition_batch(self, StateC** states, const float* scores, + int nr_class, int batch_size) nogil: + # n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc + with gil: + assert self.moves.n_moves > 0, Errors.E924.format(name=self.name) + is_valid = calloc(self.moves.n_moves, sizeof(int)) + cdef int i, guess + cdef Transition action + for i in range(batch_size): + self.moves.set_valid(is_valid, states[i]) + guess = arg_max_if_valid(&scores[i*nr_class], is_valid, nr_class) + if guess == -1: + # This shouldn't happen, but it's hard to raise an error here, + # and we don't want to infinite loop. So, force to end state. + states[i].force_final() + else: + action = self.moves.c[guess] + action.do(states[i], action.label) + states[i].push_hist(guess) + free(is_valid) + + def update(self, examples, *, drop=0., set_annotations=False, sgd=None, losses=None): + cdef StateClass state + if losses is None: + losses = {} + losses.setdefault(self.name, 0.) + validate_examples(examples, "Parser.update") + for multitask in self._multitasks: + multitask.update(examples, drop=drop, sgd=sgd) + n_examples = len([eg for eg in examples if self.moves.has_gold(eg)]) + if n_examples == 0: + return losses + set_dropout_rate(self.model, drop) + # Prepare the stepwise model, and get the callback for finishing the batch + model, backprop_tok2vec = self.model.begin_update( + [eg.predicted for eg in examples]) + max_moves = self.cfg["update_with_oracle_cut_size"] + if max_moves >= 1: + # Chop sequences into lengths of this many words, to make the + # batch uniform length. + max_moves = int(random.uniform(max_moves // 2, max_moves * 2)) + states, golds, _ = self._init_gold_batch( + examples, + max_length=max_moves + ) + else: + states, golds, _ = self.moves.init_gold_batch(examples) + if not states: + return losses + all_states = list(states) + states_golds = list(zip(states, golds)) + n_moves = 0 + while states_golds: + states, golds = zip(*states_golds) + scores, backprop = model.begin_update(states) + d_scores = self.get_batch_loss(states, golds, scores, losses) + # Note that the gradient isn't normalized by the batch size + # here, because our "samples" are really the states...But we + # can't normalize by the number of states either, as then we'd + # be getting smaller gradients for states in long sequences. + backprop(d_scores) + # Follow the predicted action + self.transition_states(states, scores) + states_golds = [(s, g) for (s, g) in zip(states, golds) if not s.is_final()] + if max_moves >= 1 and n_moves >= max_moves: + break + n_moves += 1 + + backprop_tok2vec(golds) + if sgd not in (None, False): + self.finish_update(sgd) + if set_annotations: + docs = [eg.predicted for eg in examples] + self.set_annotations(docs, all_states) + # Ugh, this is annoying. If we're working on GPU, we want to free the + # memory ASAP. It seems that Python doesn't necessarily get around to + # removing these in time if we don't explicitly delete? It's confusing. + del backprop + del backprop_tok2vec + model.clear_memory() + del model + return losses + + def rehearse(self, examples, sgd=None, losses=None, **cfg): + """Perform a "rehearsal" update, to prevent catastrophic forgetting.""" + if losses is None: + losses = {} + for multitask in self._multitasks: + if hasattr(multitask, 'rehearse'): + multitask.rehearse(examples, losses=losses, sgd=sgd) + if self._rehearsal_model is None: + return None + losses.setdefault(self.name, 0.) + validate_examples(examples, "Parser.rehearse") + docs = [eg.predicted for eg in examples] + states = self.moves.init_batch(docs) + # This is pretty dirty, but the NER can resize itself in init_batch, + # if labels are missing. We therefore have to check whether we need to + # expand our model output. + self._resize() + # Prepare the stepwise model, and get the callback for finishing the batch + set_dropout_rate(self._rehearsal_model, 0.0) + set_dropout_rate(self.model, 0.0) + tutor, _ = self._rehearsal_model.begin_update(docs) + model, backprop_tok2vec = self.model.begin_update(docs) + n_scores = 0. + loss = 0. + while states: + targets, _ = tutor.begin_update(states) + guesses, backprop = model.begin_update(states) + d_scores = (guesses - targets) / targets.shape[0] + # If all weights for an output are 0 in the original model, don't + # supervise that output. This allows us to add classes. + loss += (d_scores**2).sum() + backprop(d_scores) + # Follow the predicted action + self.transition_states(states, guesses) + states = [state for state in states if not state.is_final()] + n_scores += d_scores.size + # Do the backprop + backprop_tok2vec(docs) + if sgd is not None: + self.finish_update(sgd) + losses[self.name] += loss / n_scores + del backprop + del backprop_tok2vec + model.clear_memory() + tutor.clear_memory() + del model + del tutor + return losses + + def get_batch_loss(self, states, golds, float[:, ::1] scores, losses): + cdef StateClass state + cdef Pool mem = Pool() + cdef int i + + # n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc + assert self.moves.n_moves > 0, Errors.E924.format(name=self.name) + + is_valid = mem.alloc(self.moves.n_moves, sizeof(int)) + costs = mem.alloc(self.moves.n_moves, sizeof(float)) + cdef np.ndarray d_scores = numpy.zeros((len(states), self.moves.n_moves), + dtype='f', order='C') + c_d_scores = d_scores.data + unseen_classes = self.model.attrs["unseen_classes"] + for i, (state, gold) in enumerate(zip(states, golds)): + memset(is_valid, 0, self.moves.n_moves * sizeof(int)) + memset(costs, 0, self.moves.n_moves * sizeof(float)) + self.moves.set_costs(is_valid, costs, state, gold) + for j in range(self.moves.n_moves): + if costs[j] <= 0.0 and j in unseen_classes: + unseen_classes.remove(j) + cpu_log_loss(c_d_scores, + costs, is_valid, &scores[i, 0], d_scores.shape[1]) + c_d_scores += d_scores.shape[1] + # Note that we don't normalize this. See comment in update() for why. + if losses is not None: + losses.setdefault(self.name, 0.) + losses[self.name] += (d_scores**2).sum() + return d_scores + + def set_output(self, nO): + self.model.attrs["resize_output"](self.model, nO) + + def initialize(self, get_examples, nlp=None, labels=None): + validate_get_examples(get_examples, "Parser.initialize") + lexeme_norms = self.vocab.lookups.get_table("lexeme_norm", {}) + if len(lexeme_norms) == 0 and self.vocab.lang in util.LEXEME_NORM_LANGS: + langs = ", ".join(util.LEXEME_NORM_LANGS) + util.logger.debug(Warnings.W033.format(model="parser or NER", langs=langs)) + if labels is not None: + actions = dict(labels) + else: + actions = self.moves.get_actions( + examples=get_examples(), + min_freq=self.cfg['min_action_freq'], + learn_tokens=self.cfg["learn_tokens"] + ) + for action, labels in self.moves.labels.items(): + actions.setdefault(action, {}) + for label, freq in labels.items(): + if label not in actions[action]: + actions[action][label] = freq + self.moves.initialize_actions(actions) + # make sure we resize so we have an appropriate upper layer + self._resize() + doc_sample = [] + if nlp is not None: + for name, component in nlp.pipeline: + if component is self: + break + # non-trainable components may have a pipe() implementation that refers to dummy + # predict and set_annotations methods + if hasattr(component, "pipe"): + doc_sample = list(component.pipe(doc_sample, batch_size=8)) + else: + doc_sample = [component(doc) for doc in doc_sample] + if not doc_sample: + for example in islice(get_examples(), 10): + doc_sample.append(example.predicted) + assert len(doc_sample) > 0, Errors.E923.format(name=self.name) + self.model.initialize(doc_sample) + if nlp is not None: + self.init_multitask_objectives(get_examples, nlp.pipeline) + + def to_disk(self, path, exclude=tuple()): + serializers = { + "model": lambda p: (self.model.to_disk(p) if self.model is not True else True), + "vocab": lambda p: self.vocab.to_disk(p), + "moves": lambda p: self.moves.to_disk(p, exclude=["strings"]), + "cfg": lambda p: srsly.write_json(p, self.cfg) + } + util.to_disk(path, serializers, exclude) + + def from_disk(self, path, exclude=tuple()): + deserializers = { + "vocab": lambda p: self.vocab.from_disk(p), + "moves": lambda p: self.moves.from_disk(p, exclude=["strings"]), + "cfg": lambda p: self.cfg.update(srsly.read_json(p)), + "model": lambda p: None, + } + util.from_disk(path, deserializers, exclude) + if "model" not in exclude: + path = util.ensure_path(path) + with (path / "model").open("rb") as file_: + bytes_data = file_.read() + try: + self._resize() + self.model.from_bytes(bytes_data) + except AttributeError: + raise ValueError(Errors.E149) from None + return self + + def to_bytes(self, exclude=tuple()): + serializers = { + "model": lambda: (self.model.to_bytes()), + "vocab": lambda: self.vocab.to_bytes(), + "moves": lambda: self.moves.to_bytes(exclude=["strings"]), + "cfg": lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True) + } + return util.to_bytes(serializers, exclude) + + def from_bytes(self, bytes_data, exclude=tuple()): + deserializers = { + "vocab": lambda b: self.vocab.from_bytes(b), + "moves": lambda b: self.moves.from_bytes(b, exclude=["strings"]), + "cfg": lambda b: self.cfg.update(srsly.json_loads(b)), + "model": lambda b: None, + } + msg = util.from_bytes(bytes_data, deserializers, exclude) + if 'model' not in exclude: + if 'model' in msg: + try: + self.model.from_bytes(msg['model']) + except AttributeError: + raise ValueError(Errors.E149) from None + return self + + def _init_gold_batch(self, examples, max_length): + """Make a square batch, of length equal to the shortest transition + sequence or a cap. A long + doc will get multiple states. Let's say we have a doc of length 2*N, + where N is the shortest doc. We'll make two states, one representing + long_doc[:N], and another representing long_doc[N:].""" + cdef: + StateClass start_state + StateClass state + Transition action + all_states = self.moves.init_batch([eg.predicted for eg in examples]) + states = [] + golds = [] + to_cut = [] + for state, eg in zip(all_states, examples): + if self.moves.has_gold(eg) and not state.is_final(): + gold = self.moves.init_gold(state, eg) + if len(eg.x) < max_length: + states.append(state) + golds.append(gold) + else: + oracle_actions = self.moves.get_oracle_sequence_from_state( + state.copy(), gold) + to_cut.append((eg, state, gold, oracle_actions)) + if not to_cut: + return states, golds, 0 + cdef int clas + for eg, state, gold, oracle_actions in to_cut: + for i in range(0, len(oracle_actions), max_length): + start_state = state.copy() + for clas in oracle_actions[i:i+max_length]: + action = self.moves.c[clas] + action.do(state.c, action.label) + state.c.push_hist(action.clas) + if state.is_final(): + break + if self.moves.has_gold(eg, start_state.B(0), state.B(0)): + states.append(start_state) + golds.append(gold) + if state.is_final(): + break + return states, golds, max_length diff --git a/spacy/schemas.py b/spacy/schemas.py new file mode 100644 index 000000000..f3664acff --- /dev/null +++ b/spacy/schemas.py @@ -0,0 +1,474 @@ +from typing import Dict, List, Union, Optional, Any, Callable, Type, Tuple +from typing import Iterable, TypeVar, TYPE_CHECKING +from enum import Enum +from pydantic import BaseModel, Field, ValidationError, validator, create_model +from pydantic import StrictStr, StrictInt, StrictFloat, StrictBool +from pydantic.main import ModelMetaclass +from thinc.api import Optimizer, ConfigValidationError +from thinc.config import Promise +from collections import defaultdict +import inspect + +from .attrs import NAMES +from .lookups import Lookups +from .util import is_cython_func + +if TYPE_CHECKING: + # This lets us add type hints for mypy etc. without causing circular imports + from .language import Language # noqa: F401 + from .training import Example # noqa: F401 + + +# fmt: off +ItemT = TypeVar("ItemT") +Batcher = Union[Callable[[Iterable[ItemT]], Iterable[List[ItemT]]], Promise] +Reader = Union[Callable[["Language", str], Iterable["Example"]], Promise] +Logger = Union[Callable[["Language"], Tuple[Callable[[Dict[str, Any]], None], Callable]], Promise] +# fmt: on + + +def validate(schema: Type[BaseModel], obj: Dict[str, Any]) -> List[str]: + """Validate data against a given pydantic schema. + + obj (Dict[str, Any]): JSON-serializable data to validate. + schema (pydantic.BaseModel): The schema to validate against. + RETURNS (List[str]): A list of error messages, if available. + """ + try: + schema(**obj) + return [] + except ValidationError as e: + errors = e.errors() + data = defaultdict(list) + for error in errors: + err_loc = " -> ".join([str(p) for p in error.get("loc", [])]) + data[err_loc].append(error.get("msg")) + return [f"[{loc}] {', '.join(msg)}" for loc, msg in data.items()] + + +# Initialization + + +class ArgSchemaConfig: + extra = "forbid" + arbitrary_types_allowed = True + + +class ArgSchemaConfigExtra: + extra = "forbid" + arbitrary_types_allowed = True + + +def get_arg_model( + func: Callable, + *, + exclude: Iterable[str] = tuple(), + name: str = "ArgModel", + strict: bool = True, +) -> ModelMetaclass: + """Generate a pydantic model for function arguments. + + func (Callable): The function to generate the schema for. + exclude (Iterable[str]): Parameter names to ignore. + name (str): Name of created model class. + strict (bool): Don't allow extra arguments if no variable keyword arguments + are allowed on the function. + RETURNS (ModelMetaclass): A pydantic model. + """ + sig_args = {} + try: + sig = inspect.signature(func) + except ValueError: + # Typically happens if the method is part of a Cython module without + # binding=True. Here we just use an empty model that allows everything. + return create_model(name, __config__=ArgSchemaConfigExtra) + has_variable = False + for param in sig.parameters.values(): + if param.name in exclude: + continue + if param.kind == param.VAR_KEYWORD: + # The function allows variable keyword arguments so we shouldn't + # include **kwargs etc. in the schema and switch to non-strict + # mode and pass through all other values + has_variable = True + continue + # If no annotation is specified assume it's anything + annotation = param.annotation if param.annotation != param.empty else Any + # If no default value is specified assume that it's required. Cython + # functions/methods will have param.empty for default value None so we + # need to treat them differently + default_empty = None if is_cython_func(func) else ... + default = param.default if param.default != param.empty else default_empty + sig_args[param.name] = (annotation, default) + is_strict = strict and not has_variable + sig_args["__config__"] = ArgSchemaConfig if is_strict else ArgSchemaConfigExtra + return create_model(name, **sig_args) + + +def validate_init_settings( + func: Callable, + settings: Dict[str, Any], + *, + section: Optional[str] = None, + name: str = "", + exclude: Iterable[str] = ("get_examples", "nlp"), +) -> Dict[str, Any]: + """Validate initialization settings against the expected arguments in + the method signature. Will parse values if possible (e.g. int to string) + and return the updated settings dict. Will raise a ConfigValidationError + if types don't match or required values are missing. + + func (Callable): The initialize method of a given component etc. + settings (Dict[str, Any]): The settings from the respective [initialize] block. + section (str): Initialize section, for error message. + name (str): Name of the block in the section. + exclude (Iterable[str]): Parameter names to exclude from schema. + RETURNS (Dict[str, Any]): The validated settings. + """ + schema = get_arg_model(func, exclude=exclude, name="InitArgModel") + try: + return schema(**settings).dict() + except ValidationError as e: + block = "initialize" if not section else f"initialize.{section}" + title = f"Error validating initialization settings in [{block}]" + raise ConfigValidationError( + title=title, errors=e.errors(), config=settings, parent=name + ) from None + + +# Matcher token patterns + + +def validate_token_pattern(obj: list) -> List[str]: + # Try to convert non-string keys (e.g. {ORTH: "foo"} -> {"ORTH": "foo"}) + get_key = lambda k: NAMES[k] if isinstance(k, int) and k < len(NAMES) else k + if isinstance(obj, list): + converted = [] + for pattern in obj: + if isinstance(pattern, dict): + pattern = {get_key(k): v for k, v in pattern.items()} + converted.append(pattern) + obj = converted + return validate(TokenPatternSchema, {"pattern": obj}) + + +class TokenPatternString(BaseModel): + REGEX: Optional[StrictStr] = Field(None, alias="regex") + IN: Optional[List[StrictStr]] = Field(None, alias="in") + NOT_IN: Optional[List[StrictStr]] = Field(None, alias="not_in") + IS_SUBSET: Optional[List[StrictStr]] = Field(None, alias="is_subset") + IS_SUPERSET: Optional[List[StrictStr]] = Field(None, alias="is_superset") + + class Config: + extra = "forbid" + allow_population_by_field_name = True # allow alias and field name + + @validator("*", pre=True, each_item=True, allow_reuse=True) + def raise_for_none(cls, v): + if v is None: + raise ValueError("None / null is not allowed") + return v + + +class TokenPatternNumber(BaseModel): + REGEX: Optional[StrictStr] = Field(None, alias="regex") + IN: Optional[List[StrictInt]] = Field(None, alias="in") + NOT_IN: Optional[List[StrictInt]] = Field(None, alias="not_in") + ISSUBSET: Optional[List[StrictInt]] = Field(None, alias="issubset") + ISSUPERSET: Optional[List[StrictInt]] = Field(None, alias="issuperset") + EQ: Union[StrictInt, StrictFloat] = Field(None, alias="==") + NEQ: Union[StrictInt, StrictFloat] = Field(None, alias="!=") + GEQ: Union[StrictInt, StrictFloat] = Field(None, alias=">=") + LEQ: Union[StrictInt, StrictFloat] = Field(None, alias="<=") + GT: Union[StrictInt, StrictFloat] = Field(None, alias=">") + LT: Union[StrictInt, StrictFloat] = Field(None, alias="<") + + class Config: + extra = "forbid" + allow_population_by_field_name = True # allow alias and field name + + @validator("*", pre=True, each_item=True, allow_reuse=True) + def raise_for_none(cls, v): + if v is None: + raise ValueError("None / null is not allowed") + return v + + +class TokenPatternOperator(str, Enum): + plus: StrictStr = "+" + start: StrictStr = "*" + question: StrictStr = "?" + exclamation: StrictStr = "!" + + +StringValue = Union[TokenPatternString, StrictStr] +NumberValue = Union[TokenPatternNumber, StrictInt, StrictFloat] +UnderscoreValue = Union[ + TokenPatternString, TokenPatternNumber, str, int, float, list, bool +] + + +class TokenPattern(BaseModel): + orth: Optional[StringValue] = None + text: Optional[StringValue] = None + lower: Optional[StringValue] = None + pos: Optional[StringValue] = None + tag: Optional[StringValue] = None + morph: Optional[StringValue] = None + dep: Optional[StringValue] = None + lemma: Optional[StringValue] = None + shape: Optional[StringValue] = None + ent_type: Optional[StringValue] = None + norm: Optional[StringValue] = None + length: Optional[NumberValue] = None + spacy: Optional[StrictBool] = None + is_alpha: Optional[StrictBool] = None + is_ascii: Optional[StrictBool] = None + is_digit: Optional[StrictBool] = None + is_lower: Optional[StrictBool] = None + is_upper: Optional[StrictBool] = None + is_title: Optional[StrictBool] = None + is_punct: Optional[StrictBool] = None + is_space: Optional[StrictBool] = None + is_bracket: Optional[StrictBool] = None + is_quote: Optional[StrictBool] = None + is_left_punct: Optional[StrictBool] = None + is_right_punct: Optional[StrictBool] = None + is_currency: Optional[StrictBool] = None + is_stop: Optional[StrictBool] = None + is_sent_start: Optional[StrictBool] = None + sent_start: Optional[StrictBool] = None + like_num: Optional[StrictBool] = None + like_url: Optional[StrictBool] = None + like_email: Optional[StrictBool] = None + op: Optional[TokenPatternOperator] = None + underscore: Optional[Dict[StrictStr, UnderscoreValue]] = Field(None, alias="_") + + class Config: + extra = "forbid" + allow_population_by_field_name = True + alias_generator = lambda value: value.upper() + + @validator("*", pre=True, allow_reuse=True) + def raise_for_none(cls, v): + if v is None: + raise ValueError("None / null is not allowed") + return v + + +class TokenPatternSchema(BaseModel): + pattern: List[TokenPattern] = Field(..., minItems=1) + + class Config: + extra = "forbid" + + +# Model meta + + +class ModelMetaSchema(BaseModel): + # fmt: off + lang: StrictStr = Field(..., title="Two-letter language code, e.g. 'en'") + name: StrictStr = Field(..., title="Model name") + version: StrictStr = Field(..., title="Model version") + spacy_version: StrictStr = Field("", title="Compatible spaCy version identifier") + parent_package: StrictStr = Field("spacy", title="Name of parent spaCy package, e.g. spacy or spacy-nightly") + pipeline: List[StrictStr] = Field([], title="Names of pipeline components") + description: StrictStr = Field("", title="Model description") + license: StrictStr = Field("", title="Model license") + author: StrictStr = Field("", title="Model author name") + email: StrictStr = Field("", title="Model author email") + url: StrictStr = Field("", title="Model author URL") + sources: Optional[Union[List[StrictStr], List[Dict[str, str]]]] = Field(None, title="Training data sources") + vectors: Dict[str, Any] = Field({}, title="Included word vectors") + labels: Dict[str, List[str]] = Field({}, title="Component labels, keyed by component name") + performance: Dict[str, Any] = Field({}, title="Accuracy and speed numbers") + spacy_git_version: StrictStr = Field("", title="Commit of spaCy version used") + # fmt: on + + +# Config schema +# We're not setting any defaults here (which is too messy) and are making all +# fields required, so we can raise validation errors for missing values. To +# provide a default, we include a separate .cfg file with all values and +# check that against this schema in the test suite to make sure it's always +# up to date. + + +class ConfigSchemaTraining(BaseModel): + # fmt: off + dev_corpus: StrictStr = Field(..., title="Path in the config to the dev data") + train_corpus: StrictStr = Field(..., title="Path in the config to the training data") + batcher: Batcher = Field(..., title="Batcher for the training data") + dropout: StrictFloat = Field(..., title="Dropout rate") + patience: StrictInt = Field(..., title="How many steps to continue without improvement in evaluation score") + max_epochs: StrictInt = Field(..., title="Maximum number of epochs to train for") + max_steps: StrictInt = Field(..., title="Maximum number of update steps to train for") + eval_frequency: StrictInt = Field(..., title="How often to evaluate during training (steps)") + seed: Optional[StrictInt] = Field(..., title="Random seed") + gpu_allocator: Optional[StrictStr] = Field(..., title="Memory allocator when running on GPU") + accumulate_gradient: StrictInt = Field(..., title="Whether to divide the batch up into substeps") + score_weights: Dict[StrictStr, Optional[Union[StrictFloat, StrictInt]]] = Field(..., title="Scores to report and their weights for selecting final model") + optimizer: Optimizer = Field(..., title="The optimizer to use") + logger: Logger = Field(..., title="The logger to track training progress") + frozen_components: List[str] = Field(..., title="Pipeline components that shouldn't be updated during training") + before_to_disk: Optional[Callable[["Language"], "Language"]] = Field(..., title="Optional callback to modify nlp object after training, before it's saved to disk") + # fmt: on + + class Config: + extra = "forbid" + arbitrary_types_allowed = True + + +class ConfigSchemaNlp(BaseModel): + # fmt: off + lang: StrictStr = Field(..., title="The base language to use") + pipeline: List[StrictStr] = Field(..., title="The pipeline component names in order") + disabled: List[StrictStr] = Field(..., title="Pipeline components to disable by default") + tokenizer: Callable = Field(..., title="The tokenizer to use") + before_creation: Optional[Callable[[Type["Language"]], Type["Language"]]] = Field(..., title="Optional callback to modify Language class before initialization") + after_creation: Optional[Callable[["Language"], "Language"]] = Field(..., title="Optional callback to modify nlp object after creation and before the pipeline is constructed") + after_pipeline_creation: Optional[Callable[["Language"], "Language"]] = Field(..., title="Optional callback to modify nlp object after the pipeline is constructed") + # fmt: on + + class Config: + extra = "forbid" + arbitrary_types_allowed = True + + +class ConfigSchemaPretrainEmpty(BaseModel): + class Config: + extra = "forbid" + + +class ConfigSchemaPretrain(BaseModel): + # fmt: off + max_epochs: StrictInt = Field(..., title="Maximum number of epochs to train for") + dropout: StrictFloat = Field(..., title="Dropout rate") + n_save_every: Optional[StrictInt] = Field(..., title="Saving frequency") + optimizer: Optimizer = Field(..., title="The optimizer to use") + corpus: StrictStr = Field(..., title="Path in the config to the training data") + batcher: Batcher = Field(..., title="Batcher for the training data") + component: str = Field(..., title="Component to find the layer to pretrain") + layer: str = Field(..., title="Layer to pretrain. Whole model if empty.") + + # TODO: use a more detailed schema for this? + objective: Dict[str, Any] = Field(..., title="Pretraining objective") + # fmt: on + + class Config: + extra = "forbid" + arbitrary_types_allowed = True + + +class ConfigSchemaInit(BaseModel): + # fmt: off + vocab_data: Optional[StrictStr] = Field(..., title="Path to JSON-formatted vocabulary file") + lookups: Optional[Lookups] = Field(..., title="Vocabulary lookups, e.g. lexeme normalization") + vectors: Optional[StrictStr] = Field(..., title="Path to vectors") + init_tok2vec: Optional[StrictStr] = Field(..., title="Path to pretrained tok2vec weights") + tokenizer: Dict[StrictStr, Any] = Field(..., help="Arguments to be passed into Tokenizer.initialize") + components: Dict[StrictStr, Dict[StrictStr, Any]] = Field(..., help="Arguments for TrainablePipe.initialize methods of pipeline components, keyed by component") + # fmt: on + + class Config: + extra = "forbid" + arbitrary_types_allowed = True + + +class ConfigSchema(BaseModel): + training: ConfigSchemaTraining + nlp: ConfigSchemaNlp + pretraining: Union[ConfigSchemaPretrain, ConfigSchemaPretrainEmpty] = {} + components: Dict[str, Dict[str, Any]] + corpora: Dict[str, Reader] + initialize: ConfigSchemaInit + + class Config: + extra = "allow" + arbitrary_types_allowed = True + + +CONFIG_SCHEMAS = { + "nlp": ConfigSchemaNlp, + "training": ConfigSchemaTraining, + "pretraining": ConfigSchemaPretrain, + "initialize": ConfigSchemaInit, +} + + +# Project config Schema + + +class ProjectConfigAssetGitItem(BaseModel): + # fmt: off + repo: StrictStr = Field(..., title="URL of Git repo to download from") + path: StrictStr = Field(..., title="File path or sub-directory to download (used for sparse checkout)") + branch: StrictStr = Field("master", title="Branch to clone from") + # fmt: on + + +class ProjectConfigAssetURL(BaseModel): + # fmt: off + dest: StrictStr = Field(..., title="Destination of downloaded asset") + url: Optional[StrictStr] = Field(None, title="URL of asset") + checksum: str = Field(None, title="MD5 hash of file", regex=r"([a-fA-F\d]{32})") + description: StrictStr = Field("", title="Description of asset") + # fmt: on + + +class ProjectConfigAssetGit(BaseModel): + # fmt: off + git: ProjectConfigAssetGitItem = Field(..., title="Git repo information") + checksum: str = Field(None, title="MD5 hash of file", regex=r"([a-fA-F\d]{32})") + description: Optional[StrictStr] = Field(None, title="Description of asset") + # fmt: on + + +class ProjectConfigCommand(BaseModel): + # fmt: off + name: StrictStr = Field(..., title="Name of command") + help: Optional[StrictStr] = Field(None, title="Command description") + script: List[StrictStr] = Field([], title="List of CLI commands to run, in order") + deps: List[StrictStr] = Field([], title="File dependencies required by this command") + outputs: List[StrictStr] = Field([], title="Outputs produced by this command") + outputs_no_cache: List[StrictStr] = Field([], title="Outputs not tracked by DVC (DVC only)") + no_skip: bool = Field(False, title="Never skip this command, even if nothing changed") + # fmt: on + + class Config: + title = "A single named command specified in a project config" + extra = "forbid" + + +class ProjectConfigSchema(BaseModel): + # fmt: off + vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands") + assets: List[Union[ProjectConfigAssetURL, ProjectConfigAssetGit]] = Field([], title="Data assets") + workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order") + commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts") + title: Optional[str] = Field(None, title="Project title") + spacy_version: Optional[StrictStr] = Field(None, title="spaCy version range that the project is compatible with") + # fmt: on + + class Config: + title = "Schema for project configuration file" + + +# Recommendations for init config workflows + + +class RecommendationTrfItem(BaseModel): + name: str + size_factor: int + + +class RecommendationTrf(BaseModel): + efficiency: RecommendationTrfItem + accuracy: RecommendationTrfItem + + +class RecommendationSchema(BaseModel): + word_vectors: Optional[str] = None + transformer: Optional[RecommendationTrf] = None + has_letters: bool = True diff --git a/spacy/scorer.py b/spacy/scorer.py index 25c660240..d1065f3a9 100644 --- a/spacy/scorer.py +++ b/spacy/scorer.py @@ -1,54 +1,73 @@ -# coding: utf8 -from __future__ import division, print_function, unicode_literals - +from typing import Optional, Iterable, Dict, Any, Callable, TYPE_CHECKING import numpy as np +from collections import defaultdict -from .gold import tags_to_entities, GoldParse +from .training import Example +from .tokens import Token, Doc, Span from .errors import Errors +from .util import get_lang_class, SimpleFrozenList +from .morphology import Morphology + +if TYPE_CHECKING: + # This lets us add type hints for mypy etc. without causing circular imports + from .language import Language # noqa: F401 -class PRFScore(object): - """ - A precision / recall / F score - """ +DEFAULT_PIPELINE = ["senter", "tagger", "morphologizer", "parser", "ner", "textcat"] - def __init__(self): + +class PRFScore: + """A precision / recall / F score.""" + + def __init__(self) -> None: self.tp = 0 self.fp = 0 self.fn = 0 - def score_set(self, cand, gold): + def __iadd__(self, other): + self.tp += other.tp + self.fp += other.fp + self.fn += other.fn + return self + + def __add__(self, other): + return PRFScore( + tp=self.tp + other.tp, fp=self.fp + other.fp, fn=self.fn + other.fn + ) + + def score_set(self, cand: set, gold: set) -> None: self.tp += len(cand.intersection(gold)) self.fp += len(cand - gold) self.fn += len(gold - cand) @property - def precision(self): + def precision(self) -> float: return self.tp / (self.tp + self.fp + 1e-100) @property - def recall(self): + def recall(self) -> float: return self.tp / (self.tp + self.fn + 1e-100) @property - def fscore(self): + def fscore(self) -> float: p = self.precision r = self.recall return 2 * ((p * r) / (p + r + 1e-100)) + def to_dict(self) -> Dict[str, float]: + return {"p": self.precision, "r": self.recall, "f": self.fscore} -class ROCAUCScore(object): - """ - An AUC ROC score. - """ - def __init__(self): +class ROCAUCScore: + """An AUC ROC score.""" + + def __init__(self) -> None: self.golds = [] self.cands = [] self.saved_score = 0.0 self.saved_score_at_len = 0 - def score_set(self, cand, gold): + def score_set(self, cand, gold) -> None: self.cands.append(cand) self.golds.append(gold) @@ -66,285 +85,570 @@ class ROCAUCScore(object): return self.saved_score -class Scorer(object): +class Scorer: """Compute evaluation scores.""" - def __init__(self, eval_punct=False, pipeline=None): + def __init__( + self, + nlp: Optional["Language"] = None, + default_lang: str = "xx", + default_pipeline=DEFAULT_PIPELINE, + **cfg, + ) -> None: """Initialize the Scorer. - eval_punct (bool): Evaluate the dependency attachments to and from - punctuation. - RETURNS (Scorer): The newly created object. - - DOCS: https://spacy.io/api/scorer#init + DOCS: https://nightly.spacy.io/api/scorer#init """ - self.tokens = PRFScore() - self.sbd = PRFScore() - self.unlabelled = PRFScore() - self.labelled = PRFScore() - self.labelled_per_dep = dict() - self.tags = PRFScore() - self.ner = PRFScore() - self.ner_per_ents = dict() - self.eval_punct = eval_punct - self.textcat = None - self.textcat_per_cat = dict() - self.textcat_positive_label = None - self.textcat_multilabel = False + self.nlp = nlp + self.cfg = cfg + if not nlp: + nlp = get_lang_class(default_lang)() + for pipe in default_pipeline: + nlp.add_pipe(pipe) + self.nlp = nlp - if pipeline: - for name, model in pipeline: - if name == "textcat": - self.textcat_positive_label = model.cfg.get("positive_label", None) - if self.textcat_positive_label: - self.textcat = PRFScore() - if not model.cfg.get("exclusive_classes", False): - self.textcat_multilabel = True - for label in model.cfg.get("labels", []): - self.textcat_per_cat[label] = ROCAUCScore() - else: - for label in model.cfg.get("labels", []): - self.textcat_per_cat[label] = PRFScore() + def score(self, examples: Iterable[Example]) -> Dict[str, Any]: + """Evaluate a list of Examples. - @property - def tags_acc(self): - """RETURNS (float): Part-of-speech tag accuracy (fine grained tags, - i.e. `Token.tag`). + examples (Iterable[Example]): The predicted annotations + correct annotations. + RETURNS (Dict): A dictionary of scores. + + DOCS: https://nightly.spacy.io/api/scorer#score """ - return self.tags.fscore * 100 + scores = {} + if hasattr(self.nlp.tokenizer, "score"): + scores.update(self.nlp.tokenizer.score(examples, **self.cfg)) + for name, component in self.nlp.pipeline: + if hasattr(component, "score"): + scores.update(component.score(examples, **self.cfg)) + return scores - @property - def token_acc(self): - """RETURNS (float): Tokenization accuracy.""" - return self.tokens.precision * 100 + @staticmethod + def score_tokenization(examples: Iterable[Example], **cfg) -> Dict[str, float]: + """Returns accuracy and PRF scores for tokenization. + * token_acc: # correct tokens / # gold tokens + * token_p/r/f: PRF for token character spans - @property - def uas(self): - """RETURNS (float): Unlabelled dependency score.""" - return self.unlabelled.fscore * 100 + examples (Iterable[Example]): Examples to score + RETURNS (Dict[str, float]): A dictionary containing the scores + token_acc/p/r/f. - @property - def las(self): - """RETURNS (float): Labelled dependency score.""" - return self.labelled.fscore * 100 - - @property - def las_per_type(self): - """RETURNS (dict): Scores per dependency label. + DOCS: https://nightly.spacy.io/api/scorer#score_tokenization """ + acc_score = PRFScore() + prf_score = PRFScore() + for example in examples: + gold_doc = example.reference + pred_doc = example.predicted + align = example.alignment + gold_spans = set() + pred_spans = set() + for token in gold_doc: + if token.orth_.isspace(): + continue + gold_spans.add((token.idx, token.idx + len(token))) + for token in pred_doc: + if token.orth_.isspace(): + continue + pred_spans.add((token.idx, token.idx + len(token))) + if align.x2y.lengths[token.i] != 1: + acc_score.fp += 1 + else: + acc_score.tp += 1 + prf_score.score_set(pred_spans, gold_spans) return { - k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100} - for k, v in self.labelled_per_dep.items() + "token_acc": acc_score.fscore, + "token_p": prf_score.precision, + "token_r": prf_score.recall, + "token_f": prf_score.fscore, } - @property - def ents_p(self): - """RETURNS (float): Named entity accuracy (precision).""" - return self.ner.precision * 100 + @staticmethod + def score_token_attr( + examples: Iterable[Example], + attr: str, + *, + getter: Callable[[Token, str], Any] = getattr, + **cfg, + ) -> Dict[str, float]: + """Returns an accuracy score for a token-level attribute. - @property - def ents_r(self): - """RETURNS (float): Named entity accuracy (recall).""" - return self.ner.recall * 100 + examples (Iterable[Example]): Examples to score + attr (str): The attribute to score. + getter (Callable[[Token, str], Any]): Defaults to getattr. If provided, + getter(token, attr) should return the value of the attribute for an + individual token. + RETURNS (Dict[str, float]): A dictionary containing the accuracy score + under the key attr_acc. - @property - def ents_f(self): - """RETURNS (float): Named entity accuracy (F-score).""" - return self.ner.fscore * 100 - - @property - def ents_per_type(self): - """RETURNS (dict): Scores per entity label. + DOCS: https://nightly.spacy.io/api/scorer#score_token_attr """ - return { - k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100} - for k, v in self.ner_per_ents.items() - } + tag_score = PRFScore() + for example in examples: + gold_doc = example.reference + pred_doc = example.predicted + align = example.alignment + gold_tags = set() + for gold_i, token in enumerate(gold_doc): + gold_tags.add((gold_i, getter(token, attr))) + pred_tags = set() + for token in pred_doc: + if token.orth_.isspace(): + continue + if align.x2y.lengths[token.i] == 1: + gold_i = align.x2y[token.i].dataXd[0, 0] + pred_tags.add((gold_i, getter(token, attr))) + tag_score.score_set(pred_tags, gold_tags) + return {f"{attr}_acc": tag_score.fscore} - @property - def textcat_score(self): - """RETURNS (float): f-score on positive label for binary exclusive, - macro-averaged f-score for 3+ exclusive, - macro-averaged AUC ROC score for multilabel (-1 if undefined) + @staticmethod + def score_token_attr_per_feat( + examples: Iterable[Example], + attr: str, + *, + getter: Callable[[Token, str], Any] = getattr, + **cfg, + ): + """Return PRF scores per feat for a token attribute in UFEATS format. + + examples (Iterable[Example]): Examples to score + attr (str): The attribute to score. + getter (Callable[[Token, str], Any]): Defaults to getattr. If provided, + getter(token, attr) should return the value of the attribute for an + individual token. + RETURNS (dict): A dictionary containing the per-feat PRF scores unders + the key attr_per_feat. """ - if not self.textcat_multilabel: - # binary multiclass - if self.textcat_positive_label: - return self.textcat.fscore * 100 - # other multiclass - return ( - sum([score.fscore for label, score in self.textcat_per_cat.items()]) - / (len(self.textcat_per_cat) + 1e-100) - * 100 - ) - # multilabel - return max( - sum([score.score for label, score in self.textcat_per_cat.items()]) - / (len(self.textcat_per_cat) + 1e-100), - -1, - ) + per_feat = {} + for example in examples: + pred_doc = example.predicted + gold_doc = example.reference + align = example.alignment + gold_per_feat = {} + for gold_i, token in enumerate(gold_doc): + morph = str(getter(token, attr)) + if morph: + for feat in morph.split(Morphology.FEATURE_SEP): + field, values = feat.split(Morphology.FIELD_SEP) + if field not in per_feat: + per_feat[field] = PRFScore() + if field not in gold_per_feat: + gold_per_feat[field] = set() + gold_per_feat[field].add((gold_i, feat)) + pred_per_feat = {} + for token in pred_doc: + if token.orth_.isspace(): + continue + if align.x2y.lengths[token.i] == 1: + gold_i = align.x2y[token.i].dataXd[0, 0] + morph = str(getter(token, attr)) + if morph: + for feat in morph.split("|"): + field, values = feat.split("=") + if field not in per_feat: + per_feat[field] = PRFScore() + if field not in pred_per_feat: + pred_per_feat[field] = set() + pred_per_feat[field].add((gold_i, feat)) + for field in per_feat: + per_feat[field].score_set( + pred_per_feat.get(field, set()), gold_per_feat.get(field, set()) + ) + result = {k: v.to_dict() for k, v in per_feat.items()} + return {f"{attr}_per_feat": result} - @property - def textcats_per_cat(self): - """RETURNS (dict): Scores per textcat label. + @staticmethod + def score_spans( + examples: Iterable[Example], + attr: str, + *, + getter: Callable[[Doc, str], Iterable[Span]] = getattr, + **cfg, + ) -> Dict[str, Any]: + """Returns PRF scores for labeled spans. + + examples (Iterable[Example]): Examples to score + attr (str): The attribute to score. + getter (Callable[[Doc, str], Iterable[Span]]): Defaults to getattr. If + provided, getter(doc, attr) should return the spans for the + individual doc. + RETURNS (Dict[str, Any]): A dictionary containing the PRF scores under + the keys attr_p/r/f and the per-type PRF scores under attr_per_type. + + DOCS: https://nightly.spacy.io/api/scorer#score_spans """ - if not self.textcat_multilabel: - return { - k: {"p": v.precision * 100, "r": v.recall * 100, "f": v.fscore * 100} - for k, v in self.textcat_per_cat.items() - } - return { - k: {"roc_auc_score": max(v.score, -1)} - for k, v in self.textcat_per_cat.items() - } - - @property - def scores(self): - """RETURNS (dict): All scores with keys `uas`, `las`, `ents_p`, - `ents_r`, `ents_f`, `tags_acc`, `token_acc`, and `textcat_score`. - """ - return { - "uas": self.uas, - "las": self.las, - "las_per_type": self.las_per_type, - "ents_p": self.ents_p, - "ents_r": self.ents_r, - "ents_f": self.ents_f, - "ents_per_type": self.ents_per_type, - "tags_acc": self.tags_acc, - "token_acc": self.token_acc, - "textcat_score": self.textcat_score, - "textcats_per_cat": self.textcats_per_cat, - } - - def score(self, doc, gold, verbose=False, punct_labels=("p", "punct")): - """Update the evaluation scores from a single Doc / GoldParse pair. - - doc (Doc): The predicted annotations. - gold (GoldParse): The correct annotations. - verbose (bool): Print debugging information. - punct_labels (tuple): Dependency labels for punctuation. Used to - evaluate dependency attachments to punctuation if `eval_punct` is - `True`. - - DOCS: https://spacy.io/api/scorer#score - """ - if len(doc) != len(gold): - gold = GoldParse.from_annot_tuples( - doc, zip(*gold.orig_annot), cats=gold.cats, - ) - gold_deps = set() - gold_deps_per_dep = {} - gold_tags = set() - gold_ents = set(tags_to_entities([annot[-1] for annot in gold.orig_annot])) - for id_, word, tag, head, dep, ner in gold.orig_annot: - gold_tags.add((id_, tag)) - if dep not in (None, "") and dep.lower() not in punct_labels: - gold_deps.add((id_, head, dep.lower())) - if dep.lower() not in self.labelled_per_dep: - self.labelled_per_dep[dep.lower()] = PRFScore() - if dep.lower() not in gold_deps_per_dep: - gold_deps_per_dep[dep.lower()] = set() - gold_deps_per_dep[dep.lower()].add((id_, head, dep.lower())) - cand_deps = set() - cand_deps_per_dep = {} - cand_tags = set() - for token in doc: - if token.orth_.isspace(): + score = PRFScore() + score_per_type = dict() + for example in examples: + pred_doc = example.predicted + gold_doc = example.reference + # TODO + # This is a temporary hack to work around the problem that the scorer + # fails if you have examples that are not fully annotated for all + # the tasks in your pipeline. For instance, you might have a corpus + # of NER annotations that does not set sentence boundaries, but the + # pipeline includes a parser or senter, and then the score_weights + # are used to evaluate that component. When the scorer attempts + # to read the sentences from the gold document, it fails. + try: + list(getter(gold_doc, attr)) + except ValueError: continue - gold_i = gold.cand_to_gold[token.i] - if gold_i is None: - self.tokens.fp += 1 - else: - self.tokens.tp += 1 - cand_tags.add((gold_i, token.tag_)) - if token.dep_.lower() not in punct_labels and token.orth_.strip(): - gold_head = gold.cand_to_gold[token.head.i] - # None is indistinct, so we can't just add it to the set - # Multiple (None, None) deps are possible - if gold_i is None or gold_head is None: - self.unlabelled.fp += 1 - self.labelled.fp += 1 - else: - cand_deps.add((gold_i, gold_head, token.dep_.lower())) - if token.dep_.lower() not in self.labelled_per_dep: - self.labelled_per_dep[token.dep_.lower()] = PRFScore() - if token.dep_.lower() not in cand_deps_per_dep: - cand_deps_per_dep[token.dep_.lower()] = set() - cand_deps_per_dep[token.dep_.lower()].add( - (gold_i, gold_head, token.dep_.lower()) - ) - if "-" not in [token[-1] for token in gold.orig_annot]: - # Find all NER labels in gold and doc - ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents]) + # Find all labels in gold and doc + labels = set( + [k.label_ for k in getter(gold_doc, attr)] + + [k.label_ for k in getter(pred_doc, attr)] + ) # Set up all labels for per type scoring and prepare gold per type - gold_per_ents = {ent_label: set() for ent_label in ent_labels} - for ent_label in ent_labels: - if ent_label not in self.ner_per_ents: - self.ner_per_ents[ent_label] = PRFScore() - gold_per_ents[ent_label].update( - [x for x in gold_ents if x[0] == ent_label] + gold_per_type = {label: set() for label in labels} + for label in labels: + if label not in score_per_type: + score_per_type[label] = PRFScore() + # Find all predidate labels, for all and per type + gold_spans = set() + pred_spans = set() + for span in getter(gold_doc, attr): + gold_span = (span.label_, span.start, span.end - 1) + gold_spans.add(gold_span) + gold_per_type[span.label_].add((span.label_, span.start, span.end - 1)) + pred_per_type = {label: set() for label in labels} + for span in example.get_aligned_spans_x2y(getter(pred_doc, attr)): + pred_spans.add((span.label_, span.start, span.end - 1)) + pred_per_type[span.label_].add((span.label_, span.start, span.end - 1)) + # Scores per label + for k, v in score_per_type.items(): + if k in pred_per_type: + v.score_set(pred_per_type[k], gold_per_type[k]) + # Score for all labels + score.score_set(pred_spans, gold_spans) + results = { + f"{attr}_p": score.precision, + f"{attr}_r": score.recall, + f"{attr}_f": score.fscore, + f"{attr}_per_type": {k: v.to_dict() for k, v in score_per_type.items()}, + } + return results + + @staticmethod + def score_cats( + examples: Iterable[Example], + attr: str, + *, + getter: Callable[[Doc, str], Any] = getattr, + labels: Iterable[str] = SimpleFrozenList(), + multi_label: bool = True, + positive_label: Optional[str] = None, + threshold: Optional[float] = None, + **cfg, + ) -> Dict[str, Any]: + """Returns PRF and ROC AUC scores for a doc-level attribute with a + dict with scores for each label like Doc.cats. The reported overall + score depends on the scorer settings. + + examples (Iterable[Example]): Examples to score + attr (str): The attribute to score. + getter (Callable[[Doc, str], Any]): Defaults to getattr. If provided, + getter(doc, attr) should return the values for the individual doc. + labels (Iterable[str]): The set of possible labels. Defaults to []. + multi_label (bool): Whether the attribute allows multiple labels. + Defaults to True. + positive_label (str): The positive label for a binary task with + exclusive classes. Defaults to None. + threshold (float): Cutoff to consider a prediction "positive". Defaults + to 0.5 for multi-label, and 0.0 (i.e. whatever's highest scoring) + otherwise. + RETURNS (Dict[str, Any]): A dictionary containing the scores, with + inapplicable scores as None: + for all: + attr_score (one of attr_micro_f / attr_macro_f / attr_macro_auc), + attr_score_desc (text description of the overall score), + attr_micro_f, + attr_macro_f, + attr_auc, + attr_f_per_type, + attr_auc_per_type + + DOCS: https://nightly.spacy.io/api/scorer#score_cats + """ + if threshold is None: + threshold = 0.5 if multi_label else 0.0 + f_per_type = {label: PRFScore() for label in labels} + auc_per_type = {label: ROCAUCScore() for label in labels} + labels = set(labels) + if labels: + for eg in examples: + labels.update(eg.predicted.cats.keys()) + labels.update(eg.reference.cats.keys()) + for example in examples: + # Through this loop, None in the gold_cats indicates missing label. + pred_cats = getter(example.predicted, attr) + gold_cats = getter(example.reference, attr) + + # I think the AUC metric is applicable regardless of whether we're + # doing multi-label classification? Unsure. If not, move this into + # the elif pred_cats and gold_cats block below. + for label in labels: + pred_score = pred_cats.get(label, 0.0) + gold_score = gold_cats.get(label, 0.0) + if gold_score is not None: + auc_per_type[label].score_set(pred_score, gold_score) + if multi_label: + for label in labels: + pred_score = pred_cats.get(label, 0.0) + gold_score = gold_cats.get(label, 0.0) + if gold_score is not None: + if pred_score >= threshold and gold_score > 0: + f_per_type[label].tp += 1 + elif pred_score >= threshold and gold_score == 0: + f_per_type[label].fp += 1 + elif pred_score < threshold and gold_score > 0: + f_per_type[label].fn += 1 + elif pred_cats and gold_cats: + # Get the highest-scoring for each. + pred_label, pred_score = max(pred_cats.items(), key=lambda it: it[1]) + gold_label, gold_score = max(gold_cats.items(), key=lambda it: it[1]) + if gold_score is not None: + if pred_label == gold_label and pred_score >= threshold: + f_per_type[pred_label].tp += 1 + else: + f_per_type[gold_label].fn += 1 + if pred_score >= threshold: + f_per_type[pred_label].fp += 1 + elif gold_cats: + gold_label, gold_score = max(gold_cats, key=lambda it: it[1]) + if gold_score is not None and gold_score > 0: + f_per_type[gold_label].fn += 1 + else: + pred_label, pred_score = max(pred_cats, key=lambda it: it[1]) + if pred_score >= threshold: + f_per_type[pred_label].fp += 1 + micro_prf = PRFScore() + for label_prf in f_per_type.values(): + micro_prf.tp += label_prf.tp + micro_prf.fn += label_prf.fn + micro_prf.fp += label_prf.fp + n_cats = len(f_per_type) + 1e-100 + macro_p = sum(prf.precision for prf in f_per_type.values()) / n_cats + macro_r = sum(prf.recall for prf in f_per_type.values()) / n_cats + macro_f = sum(prf.fscore for prf in f_per_type.values()) / n_cats + macro_auc = sum(auc.score for auc in auc_per_type.values()) / n_cats + results = { + f"{attr}_score": None, + f"{attr}_score_desc": None, + f"{attr}_micro_p": micro_prf.precision, + f"{attr}_micro_r": micro_prf.recall, + f"{attr}_micro_f": micro_prf.fscore, + f"{attr}_macro_p": macro_p, + f"{attr}_macro_r": macro_r, + f"{attr}_macro_f": macro_f, + f"{attr}_macro_auc": macro_auc, + f"{attr}_f_per_type": {k: v.to_dict() for k, v in f_per_type.items()}, + f"{attr}_auc_per_type": {k: v.score for k, v in auc_per_type.items()}, + } + if len(labels) == 2 and not multi_label and positive_label: + positive_label_f = results[f"{attr}_f_per_type"][positive_label]["f"] + results[f"{attr}_score"] = positive_label_f + results[f"{attr}_score_desc"] = f"F ({positive_label})" + elif not multi_label: + results[f"{attr}_score"] = results[f"{attr}_macro_f"] + results[f"{attr}_score_desc"] = "macro F" + else: + results[f"{attr}_score"] = results[f"{attr}_macro_auc"] + results[f"{attr}_score_desc"] = "macro AUC" + return results + + @staticmethod + def score_links( + examples: Iterable[Example], *, negative_labels: Iterable[str] + ) -> Dict[str, Any]: + """Returns PRF for predicted links on the entity level. + To disentangle the performance of the NEL from the NER, + this method only evaluates NEL links for entities that overlap + between the gold reference and the predictions. + + examples (Iterable[Example]): Examples to score + negative_labels (Iterable[str]): The string values that refer to no annotation (e.g. "NIL") + RETURNS (Dict[str, Any]): A dictionary containing the scores. + + DOCS (TODO): https://nightly.spacy.io/api/scorer#score_links + """ + f_per_type = {} + for example in examples: + gold_ent_by_offset = {} + for gold_ent in example.reference.ents: + gold_ent_by_offset[(gold_ent.start_char, gold_ent.end_char)] = gold_ent + + for pred_ent in example.predicted.ents: + gold_span = gold_ent_by_offset.get( + (pred_ent.start_char, pred_ent.end_char), None ) - # Find all candidate labels, for all and per type - cand_ents = set() - cand_per_ents = {ent_label: set() for ent_label in ent_labels} - for ent in doc.ents: - first = gold.cand_to_gold[ent.start] - last = gold.cand_to_gold[ent.end - 1] - if first is None or last is None: - self.ner.fp += 1 - self.ner_per_ents[ent.label_].fp += 1 + label = gold_span.label_ + if label not in f_per_type: + f_per_type[label] = PRFScore() + gold = gold_span.kb_id_ + # only evaluating entities that overlap between gold and pred, + # to disentangle the performance of the NEL from the NER + if gold is not None: + pred = pred_ent.kb_id_ + if gold in negative_labels and pred in negative_labels: + # ignore true negatives + pass + elif gold == pred: + f_per_type[label].tp += 1 + elif gold in negative_labels: + f_per_type[label].fp += 1 + elif pred in negative_labels: + f_per_type[label].fn += 1 + else: + # a wrong prediction (e.g. Q42 != Q3) counts as both a FP as well as a FN + f_per_type[label].fp += 1 + f_per_type[label].fn += 1 + micro_prf = PRFScore() + for label_prf in f_per_type.values(): + micro_prf.tp += label_prf.tp + micro_prf.fn += label_prf.fn + micro_prf.fp += label_prf.fp + n_labels = len(f_per_type) + 1e-100 + macro_p = sum(prf.precision for prf in f_per_type.values()) / n_labels + macro_r = sum(prf.recall for prf in f_per_type.values()) / n_labels + macro_f = sum(prf.fscore for prf in f_per_type.values()) / n_labels + results = { + f"nel_score": micro_prf.fscore, + f"nel_score_desc": "micro F", + f"nel_micro_p": micro_prf.precision, + f"nel_micro_r": micro_prf.recall, + f"nel_micro_f": micro_prf.fscore, + f"nel_macro_p": macro_p, + f"nel_macro_r": macro_r, + f"nel_macro_f": macro_f, + f"nel_f_per_type": {k: v.to_dict() for k, v in f_per_type.items()}, + } + return results + + @staticmethod + def score_deps( + examples: Iterable[Example], + attr: str, + *, + getter: Callable[[Token, str], Any] = getattr, + head_attr: str = "head", + head_getter: Callable[[Token, str], Token] = getattr, + ignore_labels: Iterable[str] = SimpleFrozenList(), + **cfg, + ) -> Dict[str, Any]: + """Returns the UAS, LAS, and LAS per type scores for dependency + parses. + + examples (Iterable[Example]): Examples to score + attr (str): The attribute containing the dependency label. + getter (Callable[[Token, str], Any]): Defaults to getattr. If provided, + getter(token, attr) should return the value of the attribute for an + individual token. + head_attr (str): The attribute containing the head token. Defaults to + 'head'. + head_getter (Callable[[Token, str], Token]): Defaults to getattr. If provided, + head_getter(token, attr) should return the value of the head for an + individual token. + ignore_labels (Tuple): Labels to ignore while scoring (e.g., punct). + RETURNS (Dict[str, Any]): A dictionary containing the scores: + attr_uas, attr_las, and attr_las_per_type. + + DOCS: https://nightly.spacy.io/api/scorer#score_deps + """ + unlabelled = PRFScore() + labelled = PRFScore() + labelled_per_dep = dict() + for example in examples: + gold_doc = example.reference + pred_doc = example.predicted + align = example.alignment + gold_deps = set() + gold_deps_per_dep = {} + for gold_i, token in enumerate(gold_doc): + dep = getter(token, attr) + head = head_getter(token, head_attr) + if dep not in ignore_labels: + gold_deps.add((gold_i, head.i, dep)) + if dep not in labelled_per_dep: + labelled_per_dep[dep] = PRFScore() + if dep not in gold_deps_per_dep: + gold_deps_per_dep[dep] = set() + gold_deps_per_dep[dep].add((gold_i, head.i, dep)) + pred_deps = set() + pred_deps_per_dep = {} + for token in pred_doc: + if token.orth_.isspace(): + continue + if align.x2y.lengths[token.i] != 1: + gold_i = None else: - cand_ents.add((ent.label_, first, last)) - cand_per_ents[ent.label_].add((ent.label_, first, last)) - # Scores per ent - for k, v in self.ner_per_ents.items(): - if k in cand_per_ents: - v.score_set(cand_per_ents[k], gold_per_ents[k]) - # Score for all ents - self.ner.score_set(cand_ents, gold_ents) - self.tags.score_set(cand_tags, gold_tags) - self.labelled.score_set(cand_deps, gold_deps) - for dep in self.labelled_per_dep: - self.labelled_per_dep[dep].score_set( - cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set()) - ) - self.unlabelled.score_set( - set(item[:2] for item in cand_deps), set(item[:2] for item in gold_deps) - ) - if ( - len(gold.cats) > 0 - and set(self.textcat_per_cat) == set(gold.cats) - and set(gold.cats) == set(doc.cats) - ): - goldcat = max(gold.cats, key=gold.cats.get) - candcat = max(doc.cats, key=doc.cats.get) - if self.textcat_positive_label: - self.textcat.score_set( - set([self.textcat_positive_label]) & set([candcat]), - set([self.textcat_positive_label]) & set([goldcat]), + gold_i = align.x2y[token.i].dataXd[0, 0] + dep = getter(token, attr) + head = head_getter(token, head_attr) + if dep not in ignore_labels and token.orth_.strip(): + if align.x2y.lengths[head.i] == 1: + gold_head = align.x2y[head.i].dataXd[0, 0] + else: + gold_head = None + # None is indistinct, so we can't just add it to the set + # Multiple (None, None) deps are possible + if gold_i is None or gold_head is None: + unlabelled.fp += 1 + labelled.fp += 1 + else: + pred_deps.add((gold_i, gold_head, dep)) + if dep not in labelled_per_dep: + labelled_per_dep[dep] = PRFScore() + if dep not in pred_deps_per_dep: + pred_deps_per_dep[dep] = set() + pred_deps_per_dep[dep].add((gold_i, gold_head, dep)) + labelled.score_set(pred_deps, gold_deps) + for dep in labelled_per_dep: + labelled_per_dep[dep].score_set( + pred_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set()) ) - for label in self.textcat_per_cat: - if self.textcat_multilabel: - self.textcat_per_cat[label].score_set( - doc.cats[label], gold.cats[label] - ) - else: - self.textcat_per_cat[label].score_set( - set([label]) & set([candcat]), set([label]) & set([goldcat]) - ) - elif len(self.textcat_per_cat) > 0: - model_labels = set(self.textcat_per_cat) - eval_labels = set(gold.cats) - raise ValueError( - Errors.E162.format(model_labels=model_labels, eval_labels=eval_labels) + unlabelled.score_set( + set(item[:2] for item in pred_deps), set(item[:2] for item in gold_deps) ) - if verbose: - gold_words = [item[1] for item in gold.orig_annot] - for w_id, h_id, dep in cand_deps - gold_deps: - print("F", gold_words[w_id], dep, gold_words[h_id]) - for w_id, h_id, dep in gold_deps - cand_deps: - print("M", gold_words[w_id], dep, gold_words[h_id]) + return { + f"{attr}_uas": unlabelled.fscore, + f"{attr}_las": labelled.fscore, + f"{attr}_las_per_type": { + k: v.to_dict() for k, v in labelled_per_dep.items() + }, + } + + +def get_ner_prf(examples: Iterable[Example]) -> Dict[str, PRFScore]: + """Compute per-entity PRFScore objects for a sequence of examples. The + results are returned as a dictionary keyed by the entity type. You can + add the PRFScore objects to get micro-averaged total. + """ + scores = defaultdict(PRFScore) + for eg in examples: + if not eg.y.has_annotation("ENT_IOB"): + continue + golds = {(e.label_, e.start, e.end) for e in eg.y.ents} + align_x2y = eg.alignment.x2y + for pred_ent in eg.x.ents: + if pred_ent.label_ not in scores: + scores[pred_ent.label_] = PRFScore() + indices = align_x2y[pred_ent.start : pred_ent.end].dataXd.ravel() + if len(indices): + g_span = eg.y[indices[0] : indices[-1] + 1] + # Check we aren't missing annotation on this span. If so, + # our prediction is neither right nor wrong, we just + # ignore it. + if all(token.ent_iob != 0 for token in g_span): + key = (pred_ent.label_, indices[0], indices[-1] + 1) + if key in golds: + scores[pred_ent.label_].tp += 1 + golds.remove(key) + else: + scores[pred_ent.label_].fp += 1 + for label, start, end in golds: + scores[label].fn += 1 + return scores ############################################################################# @@ -601,7 +905,7 @@ def _auc(x, y): if np.all(dx <= 0): direction = -1 else: - raise ValueError(Errors.E164.format(x)) + raise ValueError(Errors.E164.format(x=x)) area = direction * np.trapz(y, x) if isinstance(area, np.memmap): diff --git a/spacy/strings.pxd b/spacy/strings.pxd index e436fb33b..07768d347 100644 --- a/spacy/strings.pxd +++ b/spacy/strings.pxd @@ -1,7 +1,6 @@ from libc.stdint cimport int64_t from libcpp.vector cimport vector from libcpp.set cimport set - from cymem.cymem cimport Pool from preshed.maps cimport PreshMap from murmurhash.mrmr cimport hash64 @@ -24,7 +23,6 @@ cdef class StringStore: cdef Pool mem cdef vector[hash_t] keys - cdef set[hash_t] hits cdef public PreshMap _map cdef const Utf8Str* intern_unicode(self, unicode py_string) diff --git a/spacy/strings.pyx b/spacy/strings.pyx index f3457e1a5..cd442729c 100644 --- a/spacy/strings.pyx +++ b/spacy/strings.pyx @@ -1,18 +1,16 @@ # cython: infer_types=True -# coding: utf8 -from __future__ import unicode_literals, absolute_import - cimport cython from libc.string cimport memcpy from libcpp.set cimport set from libc.stdint cimport uint32_t from murmurhash.mrmr cimport hash64, hash32 + import srsly -from .compat import basestring_ +from .typedefs cimport hash_t + from .symbols import IDS as SYMBOLS_BY_STR from .symbols import NAMES as SYMBOLS_BY_INT -from .typedefs cimport hash_t from .errors import Errors from . import util @@ -24,7 +22,7 @@ def get_string_id(key): This function optimises for convenience over performance, so shouldn't be used in tight loops. """ - if not isinstance(key, basestring_): + if not isinstance(key, str): return key elif key in SYMBOLS_BY_STR: return SYMBOLS_BY_STR[key] @@ -93,13 +91,12 @@ cdef Utf8Str* _allocate(Pool mem, const unsigned char* chars, uint32_t length) e cdef class StringStore: """Look up strings by 64-bit hashes. - DOCS: https://spacy.io/api/stringstore + DOCS: https://nightly.spacy.io/api/stringstore """ def __init__(self, strings=None, freeze=False): """Create the StringStore. strings (iterable): A sequence of unicode strings to add to the store. - RETURNS (StringStore): The newly constructed object. """ self.mem = Pool() self._map = PreshMap() @@ -111,7 +108,7 @@ cdef class StringStore: """Retrieve a string from a given hash, or vice versa. string_or_id (bytes, unicode or uint64): The value to encode. - Returns (unicode or uint64): The value to be retrieved. + Returns (str / uint64): The value to be retrieved. """ if isinstance(string_or_id, basestring) and len(string_or_id) == 0: return 0 @@ -130,7 +127,6 @@ cdef class StringStore: return SYMBOLS_BY_INT[string_or_id] else: key = string_or_id - self.hits.insert(key) utf8str = self._map.get(key) if utf8str is NULL: raise KeyError(Errors.E018.format(hash_value=string_or_id)) @@ -150,11 +146,11 @@ cdef class StringStore: return key else: return self[key] - + def add(self, string): """Add a string to the StringStore. - string (unicode): The string to add. + string (str): The string to add. RETURNS (uint64): The string's hash value. """ if isinstance(string, unicode): @@ -181,7 +177,7 @@ cdef class StringStore: def __contains__(self, string not None): """Check whether a string is in the store. - string (unicode): The string to check. + string (str): The string to check. RETURNS (bool): Whether the store contains the string. """ cdef hash_t key @@ -201,19 +197,17 @@ cdef class StringStore: if key < len(SYMBOLS_BY_INT): return True else: - self.hits.insert(key) return self._map.get(key) is not NULL def __iter__(self): """Iterate over the strings in the store, in order. - YIELDS (unicode): A string in the store. + YIELDS (str): A string in the store. """ cdef int i cdef hash_t key for i in range(self.keys.size()): key = self.keys[i] - self.hits.insert(key) utf8str = self._map.get(key) yield decode_Utf8Str(utf8str) # TODO: Iterate OOV here? @@ -225,7 +219,7 @@ cdef class StringStore: def to_disk(self, path): """Save the current state to a directory. - path (unicode or Path): A path to a directory, which will be created if + path (str / Path): A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. """ path = util.ensure_path(path) @@ -236,7 +230,7 @@ cdef class StringStore: """Loads state from a directory. Modifies the object in place and returns it. - path (unicode or Path): A path to a directory. Paths may be either + path (str / Path): A path to a directory. Paths may be either strings or `Path`-like objects. RETURNS (StringStore): The modified `StringStore` object. """ @@ -272,41 +266,9 @@ cdef class StringStore: self.mem = Pool() self._map = PreshMap() self.keys.clear() - self.hits.clear() for string in strings: self.add(string) - def _cleanup_stale_strings(self, excepted): - """ - excepted (list): Strings that should not be removed. - RETURNS (keys, strings): Dropped strings and keys that can be dropped from other places - """ - if self.hits.size() == 0: - # If we don't have any hits, just skip cleanup - return - - cdef vector[hash_t] tmp - dropped_strings = [] - dropped_keys = [] - for i in range(self.keys.size()): - key = self.keys[i] - # Here we cannot use __getitem__ because it also set hit. - utf8str = self._map.get(key) - value = decode_Utf8Str(utf8str) - if self.hits.count(key) != 0 or value in excepted: - tmp.push_back(key) - else: - dropped_keys.append(key) - dropped_strings.append(value) - - self.keys.swap(tmp) - strings = list(self) - self._reset_and_load(strings) - # Here we have strings but hits to it should be reseted - self.hits.clear() - - return dropped_keys, dropped_strings - cdef const Utf8Str* intern_unicode(self, unicode py_string): # 0 means missing, but we don't bother offsetting the index. cdef bytes byte_string = py_string.encode("utf8") @@ -322,6 +284,5 @@ cdef class StringStore: return value value = _allocate(self.mem, utf8_string, length) self._map.set(key, value) - self.hits.insert(key) self.keys.push_back(key) return value diff --git a/spacy/structs.pxd b/spacy/structs.pxd index 1f5f32675..4a51bc9e0 100644 --- a/spacy/structs.pxd +++ b/spacy/structs.pxd @@ -1,11 +1,9 @@ from libc.stdint cimport uint8_t, uint32_t, int32_t, uint64_t - -from .typedefs cimport flags_t, attr_t, hash_t -from .parts_of_speech cimport univ_pos_t - from libcpp.vector cimport vector from libc.stdint cimport int32_t, int64_t +from .typedefs cimport flags_t, attr_t, hash_t +from .parts_of_speech cimport univ_pos_t cdef struct LexemeC: @@ -59,52 +57,12 @@ cdef struct TokenC: cdef struct MorphAnalysisC: - univ_pos_t pos + hash_t key int length - attr_t abbr - attr_t adp_type - attr_t adv_type - attr_t animacy - attr_t aspect - attr_t case - attr_t conj_type - attr_t connegative - attr_t definite - attr_t degree - attr_t derivation - attr_t echo - attr_t foreign - attr_t gender - attr_t hyph - attr_t inf_form - attr_t mood - attr_t negative - attr_t number - attr_t name_type - attr_t noun_type - attr_t num_form - attr_t num_type - attr_t num_value - attr_t part_form - attr_t part_type - attr_t person - attr_t polite - attr_t polarity - attr_t poss - attr_t prefix - attr_t prep_case - attr_t pron_type - attr_t punct_side - attr_t punct_type - attr_t reflex - attr_t style - attr_t style_variant - attr_t tense - attr_t typo - attr_t verb_form - attr_t voice - attr_t verb_type + attr_t* fields + attr_t* features + # Internal struct, for storage and disambiguation of entities. cdef struct KBEntryC: diff --git a/spacy/symbols.pxd b/spacy/symbols.pxd index ebb87c8d2..bc15d9b80 100644 --- a/spacy/symbols.pxd +++ b/spacy/symbols.pxd @@ -108,282 +108,282 @@ cdef enum symbol_t: EOL SPACE - Animacy_anim - Animacy_inan - Animacy_hum # U20 - Animacy_nhum - Aspect_freq - Aspect_imp - Aspect_mod - Aspect_none - Aspect_perf - Aspect_iter # U20 - Aspect_hab # U20 - Case_abe - Case_abl - Case_abs - Case_acc - Case_ade - Case_all - Case_cau - Case_com - Case_cmp # U20 - Case_dat - Case_del - Case_dis - Case_ela - Case_equ # U20 - Case_ess - Case_gen - Case_ill - Case_ine - Case_ins - Case_loc - Case_lat - Case_nom - Case_par - Case_sub - Case_sup - Case_tem - Case_ter - Case_tra - Case_voc - Definite_two - Definite_def - Definite_red - Definite_cons # U20 - Definite_ind - Definite_spec # U20 - Degree_cmp - Degree_comp - Degree_none - Degree_pos - Degree_sup - Degree_abs - Degree_com - Degree_dim # du - Degree_equ # U20 - Evident_nfh # U20 - Gender_com - Gender_fem - Gender_masc - Gender_neut - Mood_cnd - Mood_imp - Mood_ind - Mood_n - Mood_pot - Mood_sub - Mood_opt - Mood_prp # U20 - Mood_adm # U20 - Negative_neg - Negative_pos - Negative_yes - Polarity_neg # U20 - Polarity_pos # U20 - Number_com - Number_dual - Number_none - Number_plur - Number_sing - Number_ptan # bg - Number_count # bg, U20 - Number_tri # U20 - NumType_card - NumType_dist - NumType_frac - NumType_gen - NumType_mult - NumType_none - NumType_ord - NumType_sets - Person_one - Person_two - Person_three - Person_none - Poss_yes - PronType_advPart - PronType_art - PronType_default - PronType_dem - PronType_ind - PronType_int - PronType_neg - PronType_prs - PronType_rcp - PronType_rel - PronType_tot - PronType_clit - PronType_exc # es, ca, it, fa, U20 - PronType_emp # U20 - Reflex_yes - Tense_fut - Tense_imp - Tense_past - Tense_pres - VerbForm_fin - VerbForm_ger - VerbForm_inf - VerbForm_none - VerbForm_part - VerbForm_partFut - VerbForm_partPast - VerbForm_partPres - VerbForm_sup - VerbForm_trans - VerbForm_conv # U20 - VerbForm_gdv # la - VerbForm_vnoun # U20 - Voice_act - Voice_cau - Voice_pass - Voice_mid # gkc, U20 - Voice_int # hb - Voice_antip # U20 - Voice_dir # U20 - Voice_inv # U20 - Abbr_yes # cz, fi, sl, U - AdpType_prep # cz, U - AdpType_post # U - AdpType_voc # cz - AdpType_comprep # cz - AdpType_circ # U - AdvType_man - AdvType_loc - AdvType_tim - AdvType_deg - AdvType_cau - AdvType_mod - AdvType_sta - AdvType_ex - AdvType_adadj - ConjType_oper # cz, U - ConjType_comp # cz, U - Connegative_yes # fi - Derivation_minen # fi - Derivation_sti # fi - Derivation_inen # fi - Derivation_lainen # fi - Derivation_ja # fi - Derivation_ton # fi - Derivation_vs # fi - Derivation_ttain # fi - Derivation_ttaa # fi - Echo_rdp # U - Echo_ech # U - Foreign_foreign # cz, fi, U - Foreign_fscript # cz, fi, U - Foreign_tscript # cz, U - Foreign_yes # sl - Gender_dat_masc # bq, U - Gender_dat_fem # bq, U - Gender_erg_masc # bq - Gender_erg_fem # bq - Gender_psor_masc # cz, sl, U - Gender_psor_fem # cz, sl, U - Gender_psor_neut # sl - Hyph_yes # cz, U - InfForm_one # fi - InfForm_two # fi - InfForm_three # fi - NameType_geo # U, cz - NameType_prs # U, cz - NameType_giv # U, cz - NameType_sur # U, cz - NameType_nat # U, cz - NameType_com # U, cz - NameType_pro # U, cz - NameType_oth # U, cz - NounType_com # U - NounType_prop # U - NounType_class # U - Number_abs_sing # bq, U - Number_abs_plur # bq, U - Number_dat_sing # bq, U - Number_dat_plur # bq, U - Number_erg_sing # bq, U - Number_erg_plur # bq, U - Number_psee_sing # U - Number_psee_plur # U - Number_psor_sing # cz, fi, sl, U - Number_psor_plur # cz, fi, sl, U - Number_pauc # U20 - Number_grpa # U20 - Number_grpl # U20 - Number_inv # U20 - NumForm_digit # cz, sl, U - NumForm_roman # cz, sl, U - NumForm_word # cz, sl, U - NumValue_one # cz, U - NumValue_two # cz, U - NumValue_three # cz, U - PartForm_pres # fi - PartForm_past # fi - PartForm_agt # fi - PartForm_neg # fi - PartType_mod # U - PartType_emp # U - PartType_res # U - PartType_inf # U - PartType_vbp # U - Person_abs_one # bq, U - Person_abs_two # bq, U - Person_abs_three # bq, U - Person_dat_one # bq, U - Person_dat_two # bq, U - Person_dat_three # bq, U - Person_erg_one # bq, U - Person_erg_two # bq, U - Person_erg_three # bq, U - Person_psor_one # fi, U - Person_psor_two # fi, U - Person_psor_three # fi, U - Person_zero # U20 - Person_four # U20 - Polite_inf # bq, U - Polite_pol # bq, U - Polite_abs_inf # bq, U - Polite_abs_pol # bq, U - Polite_erg_inf # bq, U - Polite_erg_pol # bq, U - Polite_dat_inf # bq, U - Polite_dat_pol # bq, U - Polite_infm # U20 - Polite_form # U20 - Polite_form_elev # U20 - Polite_form_humb # U20 - Prefix_yes # U - PrepCase_npr # cz - PrepCase_pre # U - PunctSide_ini # U - PunctSide_fin # U - PunctType_peri # U - PunctType_qest # U - PunctType_excl # U - PunctType_quot # U - PunctType_brck # U - PunctType_comm # U - PunctType_colo # U - PunctType_semi # U - PunctType_dash # U - Style_arch # cz, fi, U - Style_rare # cz, fi, U - Style_poet # cz, U - Style_norm # cz, U - Style_coll # cz, U - Style_vrnc # cz, U - Style_sing # cz, U - Style_expr # cz, U - Style_derg # cz, U - Style_vulg # cz, U - Style_yes # fi, U - StyleVariant_styleShort # cz - StyleVariant_styleBound # cz, sl - VerbType_aux # U - VerbType_cop # U - VerbType_mod # U - VerbType_light # U + DEPRECATED001 + DEPRECATED002 + DEPRECATED003 + DEPRECATED004 + DEPRECATED005 + DEPRECATED006 + DEPRECATED007 + DEPRECATED008 + DEPRECATED009 + DEPRECATED010 + DEPRECATED011 + DEPRECATED012 + DEPRECATED013 + DEPRECATED014 + DEPRECATED015 + DEPRECATED016 + DEPRECATED017 + DEPRECATED018 + DEPRECATED019 + DEPRECATED020 + DEPRECATED021 + DEPRECATED022 + DEPRECATED023 + DEPRECATED024 + DEPRECATED025 + DEPRECATED026 + DEPRECATED027 + DEPRECATED028 + DEPRECATED029 + DEPRECATED030 + DEPRECATED031 + DEPRECATED032 + DEPRECATED033 + DEPRECATED034 + DEPRECATED035 + DEPRECATED036 + DEPRECATED037 + DEPRECATED038 + DEPRECATED039 + DEPRECATED040 + DEPRECATED041 + DEPRECATED042 + DEPRECATED043 + DEPRECATED044 + DEPRECATED045 + DEPRECATED046 + DEPRECATED047 + DEPRECATED048 + DEPRECATED049 + DEPRECATED050 + DEPRECATED051 + DEPRECATED052 + DEPRECATED053 + DEPRECATED054 + DEPRECATED055 + DEPRECATED056 + DEPRECATED057 + DEPRECATED058 + DEPRECATED059 + DEPRECATED060 + DEPRECATED061 + DEPRECATED062 + DEPRECATED063 + DEPRECATED064 + DEPRECATED065 + DEPRECATED066 + DEPRECATED067 + DEPRECATED068 + DEPRECATED069 + DEPRECATED070 + DEPRECATED071 + DEPRECATED072 + DEPRECATED073 + DEPRECATED074 + DEPRECATED075 + DEPRECATED076 + DEPRECATED077 + DEPRECATED078 + DEPRECATED079 + DEPRECATED080 + DEPRECATED081 + DEPRECATED082 + DEPRECATED083 + DEPRECATED084 + DEPRECATED085 + DEPRECATED086 + DEPRECATED087 + DEPRECATED088 + DEPRECATED089 + DEPRECATED090 + DEPRECATED091 + DEPRECATED092 + DEPRECATED093 + DEPRECATED094 + DEPRECATED095 + DEPRECATED096 + DEPRECATED097 + DEPRECATED098 + DEPRECATED099 + DEPRECATED100 + DEPRECATED101 + DEPRECATED102 + DEPRECATED103 + DEPRECATED104 + DEPRECATED105 + DEPRECATED106 + DEPRECATED107 + DEPRECATED108 + DEPRECATED109 + DEPRECATED110 + DEPRECATED111 + DEPRECATED112 + DEPRECATED113 + DEPRECATED114 + DEPRECATED115 + DEPRECATED116 + DEPRECATED117 + DEPRECATED118 + DEPRECATED119 + DEPRECATED120 + DEPRECATED121 + DEPRECATED122 + DEPRECATED123 + DEPRECATED124 + DEPRECATED125 + DEPRECATED126 + DEPRECATED127 + DEPRECATED128 + DEPRECATED129 + DEPRECATED130 + DEPRECATED131 + DEPRECATED132 + DEPRECATED133 + DEPRECATED134 + DEPRECATED135 + DEPRECATED136 + DEPRECATED137 + DEPRECATED138 + DEPRECATED139 + DEPRECATED140 + DEPRECATED141 + DEPRECATED142 + DEPRECATED143 + DEPRECATED144 + DEPRECATED145 + DEPRECATED146 + DEPRECATED147 + DEPRECATED148 + DEPRECATED149 + DEPRECATED150 + DEPRECATED151 + DEPRECATED152 + DEPRECATED153 + DEPRECATED154 + DEPRECATED155 + DEPRECATED156 + DEPRECATED157 + DEPRECATED158 + DEPRECATED159 + DEPRECATED160 + DEPRECATED161 + DEPRECATED162 + DEPRECATED163 + DEPRECATED164 + DEPRECATED165 + DEPRECATED166 + DEPRECATED167 + DEPRECATED168 + DEPRECATED169 + DEPRECATED170 + DEPRECATED171 + DEPRECATED172 + DEPRECATED173 + DEPRECATED174 + DEPRECATED175 + DEPRECATED176 + DEPRECATED177 + DEPRECATED178 + DEPRECATED179 + DEPRECATED180 + DEPRECATED181 + DEPRECATED182 + DEPRECATED183 + DEPRECATED184 + DEPRECATED185 + DEPRECATED186 + DEPRECATED187 + DEPRECATED188 + DEPRECATED189 + DEPRECATED190 + DEPRECATED191 + DEPRECATED192 + DEPRECATED193 + DEPRECATED194 + DEPRECATED195 + DEPRECATED196 + DEPRECATED197 + DEPRECATED198 + DEPRECATED199 + DEPRECATED200 + DEPRECATED201 + DEPRECATED202 + DEPRECATED203 + DEPRECATED204 + DEPRECATED205 + DEPRECATED206 + DEPRECATED207 + DEPRECATED208 + DEPRECATED209 + DEPRECATED210 + DEPRECATED211 + DEPRECATED212 + DEPRECATED213 + DEPRECATED214 + DEPRECATED215 + DEPRECATED216 + DEPRECATED217 + DEPRECATED218 + DEPRECATED219 + DEPRECATED220 + DEPRECATED221 + DEPRECATED222 + DEPRECATED223 + DEPRECATED224 + DEPRECATED225 + DEPRECATED226 + DEPRECATED227 + DEPRECATED228 + DEPRECATED229 + DEPRECATED230 + DEPRECATED231 + DEPRECATED232 + DEPRECATED233 + DEPRECATED234 + DEPRECATED235 + DEPRECATED236 + DEPRECATED237 + DEPRECATED238 + DEPRECATED239 + DEPRECATED240 + DEPRECATED241 + DEPRECATED242 + DEPRECATED243 + DEPRECATED244 + DEPRECATED245 + DEPRECATED246 + DEPRECATED247 + DEPRECATED248 + DEPRECATED249 + DEPRECATED250 + DEPRECATED251 + DEPRECATED252 + DEPRECATED253 + DEPRECATED254 + DEPRECATED255 + DEPRECATED256 + DEPRECATED257 + DEPRECATED258 + DEPRECATED259 + DEPRECATED260 + DEPRECATED261 + DEPRECATED262 + DEPRECATED263 + DEPRECATED264 + DEPRECATED265 + DEPRECATED266 + DEPRECATED267 + DEPRECATED268 + DEPRECATED269 + DEPRECATED270 + DEPRECATED271 + DEPRECATED272 + DEPRECATED273 + DEPRECATED274 + DEPRECATED275 + DEPRECATED276 PERSON NORP @@ -462,6 +462,8 @@ cdef enum symbol_t: acl ENT_KB_ID + MORPH ENT_ID IDX + _ diff --git a/spacy/symbols.pyx b/spacy/symbols.pyx index 83a9d0482..b0345c710 100644 --- a/spacy/symbols.pyx +++ b/spacy/symbols.pyx @@ -1,8 +1,4 @@ -# coding: utf8 -#cython: optimize.unpack_method_calls=False -from __future__ import unicode_literals - - +# cython: optimize.unpack_method_calls=False IDS = { "": NIL, "IS_ALPHA": IS_ALPHA, @@ -116,282 +112,282 @@ IDS = { "EOL": EOL, "SPACE": SPACE, - "Animacy_anim": Animacy_anim, - "Animacy_inam": Animacy_inan, - "Animacy_hum": Animacy_hum, # U20 - "Animacy_nhum": Animacy_nhum, - "Aspect_freq": Aspect_freq, - "Aspect_imp": Aspect_imp, - "Aspect_mod": Aspect_mod, - "Aspect_none": Aspect_none, - "Aspect_perf": Aspect_perf, - "Aspect_iter": Aspect_iter, # U20 - "Aspect_hab": Aspect_hab, # U20 - "Case_abe": Case_abe, - "Case_abl": Case_abl, - "Case_abs": Case_abs, - "Case_acc": Case_acc, - "Case_ade": Case_ade, - "Case_all": Case_all, - "Case_cau": Case_cau, - "Case_com": Case_com, - "Case_cmp": Case_cmp, # U20 - "Case_dat": Case_dat, - "Case_del": Case_del, - "Case_dis": Case_dis, - "Case_ela": Case_ela, - "Case_equ": Case_equ, # U20 - "Case_ess": Case_ess, - "Case_gen": Case_gen, - "Case_ill": Case_ill, - "Case_ine": Case_ine, - "Case_ins": Case_ins, - "Case_loc": Case_loc, - "Case_lat": Case_lat, - "Case_nom": Case_nom, - "Case_par": Case_par, - "Case_sub": Case_sub, - "Case_sup": Case_sup, - "Case_tem": Case_tem, - "Case_ter": Case_ter, - "Case_tra": Case_tra, - "Case_voc": Case_voc, - "Definite_two": Definite_two, - "Definite_def": Definite_def, - "Definite_red": Definite_red, - "Definite_cons": Definite_cons, # U20 - "Definite_ind": Definite_ind, - "Definite_spec": Definite_spec, # U20 - "Degree_cmp": Degree_cmp, - "Degree_comp": Degree_comp, - "Degree_none": Degree_none, - "Degree_pos": Degree_pos, - "Degree_sup": Degree_sup, - "Degree_abs": Degree_abs, - "Degree_com": Degree_com, - "Degree_dim": Degree_dim, # du - "Degree_equ": Degree_equ, # U20 - "Evident_nfh": Evident_nfh, # U20 - "Gender_com": Gender_com, - "Gender_fem": Gender_fem, - "Gender_masc": Gender_masc, - "Gender_neut": Gender_neut, - "Mood_cnd": Mood_cnd, - "Mood_imp": Mood_imp, - "Mood_ind": Mood_ind, - "Mood_n": Mood_n, - "Mood_pot": Mood_pot, - "Mood_sub": Mood_sub, - "Mood_opt": Mood_opt, - "Mood_prp": Mood_prp, # U20 - "Mood_adm": Mood_adm, # U20 - "Negative_neg": Negative_neg, - "Negative_pos": Negative_pos, - "Negative_yes": Negative_yes, - "Polarity_neg": Polarity_neg, # U20 - "Polarity_pos": Polarity_pos, # U20 - "Number_com": Number_com, - "Number_dual": Number_dual, - "Number_none": Number_none, - "Number_plur": Number_plur, - "Number_sing": Number_sing, - "Number_ptan": Number_ptan, # bg - "Number_count": Number_count, # bg, U20 - "Number_tri": Number_tri, # U20 - "NumType_card": NumType_card, - "NumType_dist": NumType_dist, - "NumType_frac": NumType_frac, - "NumType_gen": NumType_gen, - "NumType_mult": NumType_mult, - "NumType_none": NumType_none, - "NumType_ord": NumType_ord, - "NumType_sets": NumType_sets, - "Person_one": Person_one, - "Person_two": Person_two, - "Person_three": Person_three, - "Person_none": Person_none, - "Poss_yes": Poss_yes, - "PronType_advPart": PronType_advPart, - "PronType_art": PronType_art, - "PronType_default": PronType_default, - "PronType_dem": PronType_dem, - "PronType_ind": PronType_ind, - "PronType_int": PronType_int, - "PronType_neg": PronType_neg, - "PronType_prs": PronType_prs, - "PronType_rcp": PronType_rcp, - "PronType_rel": PronType_rel, - "PronType_tot": PronType_tot, - "PronType_clit": PronType_clit, - "PronType_exc": PronType_exc, # es, ca, it, fa, U20 - "PronType_emp": PronType_emp, # U20 - "Reflex_yes": Reflex_yes, - "Tense_fut": Tense_fut, - "Tense_imp": Tense_imp, - "Tense_past": Tense_past, - "Tense_pres": Tense_pres, - "VerbForm_fin": VerbForm_fin, - "VerbForm_ger": VerbForm_ger, - "VerbForm_inf": VerbForm_inf, - "VerbForm_none": VerbForm_none, - "VerbForm_part": VerbForm_part, - "VerbForm_partFut": VerbForm_partFut, - "VerbForm_partPast": VerbForm_partPast, - "VerbForm_partPres": VerbForm_partPres, - "VerbForm_sup": VerbForm_sup, - "VerbForm_trans": VerbForm_trans, - "VerbForm_conv": VerbForm_conv, # U20 - "VerbForm_gdv": VerbForm_gdv, # la, - "VerbForm_vnoun": VerbForm_vnoun, # U20 - "Voice_act": Voice_act, - "Voice_cau": Voice_cau, - "Voice_pass": Voice_pass, - "Voice_mid": Voice_mid, # gkc, U20 - "Voice_int": Voice_int, # hb, - "Voice_antip": Voice_antip, # U20 - "Voice_dir": Voice_dir, # U20 - "Voice_inv": Voice_inv, # U20 - "Abbr_yes": Abbr_yes, # cz, fi, sl, U, - "AdpType_prep": AdpType_prep, # cz, U, - "AdpType_post": AdpType_post, # U, - "AdpType_voc": AdpType_voc, # cz, - "AdpType_comprep": AdpType_comprep, # cz, - "AdpType_circ": AdpType_circ, # U, - "AdvType_man": AdvType_man, - "AdvType_loc": AdvType_loc, - "AdvType_tim": AdvType_tim, - "AdvType_deg": AdvType_deg, - "AdvType_cau": AdvType_cau, - "AdvType_mod": AdvType_mod, - "AdvType_sta": AdvType_sta, - "AdvType_ex": AdvType_ex, - "AdvType_adadj": AdvType_adadj, - "ConjType_oper": ConjType_oper, # cz, U, - "ConjType_comp": ConjType_comp, # cz, U, - "Connegative_yes": Connegative_yes, # fi, - "Derivation_minen": Derivation_minen, # fi, - "Derivation_sti": Derivation_sti, # fi, - "Derivation_inen": Derivation_inen, # fi, - "Derivation_lainen": Derivation_lainen, # fi, - "Derivation_ja": Derivation_ja, # fi, - "Derivation_ton": Derivation_ton, # fi, - "Derivation_vs": Derivation_vs, # fi, - "Derivation_ttain": Derivation_ttain, # fi, - "Derivation_ttaa": Derivation_ttaa, # fi, - "Echo_rdp": Echo_rdp, # U, - "Echo_ech": Echo_ech, # U, - "Foreign_foreign": Foreign_foreign, # cz, fi, U, - "Foreign_fscript": Foreign_fscript, # cz, fi, U, - "Foreign_tscript": Foreign_tscript, # cz, U, - "Foreign_yes": Foreign_yes, # sl, - "Gender_dat_masc": Gender_dat_masc, # bq, U, - "Gender_dat_fem": Gender_dat_fem, # bq, U, - "Gender_erg_masc": Gender_erg_masc, # bq, - "Gender_erg_fem": Gender_erg_fem, # bq, - "Gender_psor_masc": Gender_psor_masc, # cz, sl, U, - "Gender_psor_fem": Gender_psor_fem, # cz, sl, U, - "Gender_psor_neut": Gender_psor_neut, # sl, - "Hyph_yes": Hyph_yes, # cz, U, - "InfForm_one": InfForm_one, # fi, - "InfForm_two": InfForm_two, # fi, - "InfForm_three": InfForm_three, # fi, - "NameType_geo": NameType_geo, # U, cz, - "NameType_prs": NameType_prs, # U, cz, - "NameType_giv": NameType_giv, # U, cz, - "NameType_sur": NameType_sur, # U, cz, - "NameType_nat": NameType_nat, # U, cz, - "NameType_com": NameType_com, # U, cz, - "NameType_pro": NameType_pro, # U, cz, - "NameType_oth": NameType_oth, # U, cz, - "NounType_com": NounType_com, # U, - "NounType_prop": NounType_prop, # U, - "NounType_class": NounType_class, # U, - "Number_abs_sing": Number_abs_sing, # bq, U, - "Number_abs_plur": Number_abs_plur, # bq, U, - "Number_dat_sing": Number_dat_sing, # bq, U, - "Number_dat_plur": Number_dat_plur, # bq, U, - "Number_erg_sing": Number_erg_sing, # bq, U, - "Number_erg_plur": Number_erg_plur, # bq, U, - "Number_psee_sing": Number_psee_sing, # U, - "Number_psee_plur": Number_psee_plur, # U, - "Number_psor_sing": Number_psor_sing, # cz, fi, sl, U, - "Number_psor_plur": Number_psor_plur, # cz, fi, sl, U, - "Number_pauc": Number_pauc, # U20 - "Number_grpa": Number_grpa, # U20 - "Number_grpl": Number_grpl, # U20 - "Number_inv": Number_inv, # U20 - "NumForm_digit": NumForm_digit, # cz, sl, U, - "NumForm_roman": NumForm_roman, # cz, sl, U, - "NumForm_word": NumForm_word, # cz, sl, U, - "NumValue_one": NumValue_one, # cz, U, - "NumValue_two": NumValue_two, # cz, U, - "NumValue_three": NumValue_three, # cz, U, - "PartForm_pres": PartForm_pres, # fi, - "PartForm_past": PartForm_past, # fi, - "PartForm_agt": PartForm_agt, # fi, - "PartForm_neg": PartForm_neg, # fi, - "PartType_mod": PartType_mod, # U, - "PartType_emp": PartType_emp, # U, - "PartType_res": PartType_res, # U, - "PartType_inf": PartType_inf, # U, - "PartType_vbp": PartType_vbp, # U, - "Person_abs_one": Person_abs_one, # bq, U, - "Person_abs_two": Person_abs_two, # bq, U, - "Person_abs_three": Person_abs_three, # bq, U, - "Person_dat_one": Person_dat_one, # bq, U, - "Person_dat_two": Person_dat_two, # bq, U, - "Person_dat_three": Person_dat_three, # bq, U, - "Person_erg_one": Person_erg_one, # bq, U, - "Person_erg_two": Person_erg_two, # bq, U, - "Person_erg_three": Person_erg_three, # bq, U, - "Person_psor_one": Person_psor_one, # fi, U, - "Person_psor_two": Person_psor_two, # fi, U, - "Person_psor_three": Person_psor_three, # fi, U, - "Person_zero": Person_zero, # U20 - "Person_four": Person_four, # U20 - "Polite_inf": Polite_inf, # bq, U, - "Polite_pol": Polite_pol, # bq, U, - "Polite_abs_inf": Polite_abs_inf, # bq, U, - "Polite_abs_pol": Polite_abs_pol, # bq, U, - "Polite_erg_inf": Polite_erg_inf, # bq, U, - "Polite_erg_pol": Polite_erg_pol, # bq, U, - "Polite_dat_inf": Polite_dat_inf, # bq, U, - "Polite_dat_pol": Polite_dat_pol, # bq, U, - "Polite_infm": Polite_infm, # U20 - "Polite_form": Polite_form, # U20 - "Polite_form_elev": Polite_form_elev, # U20 - "Polite_form_humb": Polite_form_humb, # U20 - "Prefix_yes": Prefix_yes, # U, - "PrepCase_npr": PrepCase_npr, # cz, - "PrepCase_pre": PrepCase_pre, # U, - "PunctSide_ini": PunctSide_ini, # U, - "PunctSide_fin": PunctSide_fin, # U, - "PunctType_peri": PunctType_peri, # U, - "PunctType_qest": PunctType_qest, # U, - "PunctType_excl": PunctType_excl, # U, - "PunctType_quot": PunctType_quot, # U, - "PunctType_brck": PunctType_brck, # U, - "PunctType_comm": PunctType_comm, # U, - "PunctType_colo": PunctType_colo, # U, - "PunctType_semi": PunctType_semi, # U, - "PunctType_dash": PunctType_dash, # U, - "Style_arch": Style_arch, # cz, fi, U, - "Style_rare": Style_rare, # cz, fi, U, - "Style_poet": Style_poet, # cz, U, - "Style_norm": Style_norm, # cz, U, - "Style_coll": Style_coll, # cz, U, - "Style_vrnc": Style_vrnc, # cz, U, - "Style_sing": Style_sing, # cz, U, - "Style_expr": Style_expr, # cz, U, - "Style_derg": Style_derg, # cz, U, - "Style_vulg": Style_vulg, # cz, U, - "Style_yes": Style_yes, # fi, U, - "StyleVariant_styleShort": StyleVariant_styleShort, # cz, - "StyleVariant_styleBound": StyleVariant_styleBound, # cz, sl, - "VerbType_aux": VerbType_aux, # U, - "VerbType_cop": VerbType_cop, # U, - "VerbType_mod": VerbType_mod, # U, - "VerbType_light": VerbType_light, # U, + "DEPRECATED001": DEPRECATED001, + "DEPRECATED002": DEPRECATED002, + "DEPRECATED003": DEPRECATED003, + "DEPRECATED004": DEPRECATED004, + "DEPRECATED005": DEPRECATED005, + "DEPRECATED006": DEPRECATED006, + "DEPRECATED007": DEPRECATED007, + "DEPRECATED008": DEPRECATED008, + "DEPRECATED009": DEPRECATED009, + "DEPRECATED010": DEPRECATED010, + "DEPRECATED011": DEPRECATED011, + "DEPRECATED012": DEPRECATED012, + "DEPRECATED013": DEPRECATED013, + "DEPRECATED014": DEPRECATED014, + "DEPRECATED015": DEPRECATED015, + "DEPRECATED016": DEPRECATED016, + "DEPRECATED017": DEPRECATED017, + "DEPRECATED018": DEPRECATED018, + "DEPRECATED019": DEPRECATED019, + "DEPRECATED020": DEPRECATED020, + "DEPRECATED021": DEPRECATED021, + "DEPRECATED022": DEPRECATED022, + "DEPRECATED023": DEPRECATED023, + "DEPRECATED024": DEPRECATED024, + "DEPRECATED025": DEPRECATED025, + "DEPRECATED026": DEPRECATED026, + "DEPRECATED027": DEPRECATED027, + "DEPRECATED028": DEPRECATED028, + "DEPRECATED029": DEPRECATED029, + "DEPRECATED030": DEPRECATED030, + "DEPRECATED031": DEPRECATED031, + "DEPRECATED032": DEPRECATED032, + "DEPRECATED033": DEPRECATED033, + "DEPRECATED034": DEPRECATED034, + "DEPRECATED035": DEPRECATED035, + "DEPRECATED036": DEPRECATED036, + "DEPRECATED037": DEPRECATED037, + "DEPRECATED038": DEPRECATED038, + "DEPRECATED039": DEPRECATED039, + "DEPRECATED040": DEPRECATED040, + "DEPRECATED041": DEPRECATED041, + "DEPRECATED042": DEPRECATED042, + "DEPRECATED043": DEPRECATED043, + "DEPRECATED044": DEPRECATED044, + "DEPRECATED045": DEPRECATED045, + "DEPRECATED046": DEPRECATED046, + "DEPRECATED047": DEPRECATED047, + "DEPRECATED048": DEPRECATED048, + "DEPRECATED049": DEPRECATED049, + "DEPRECATED050": DEPRECATED050, + "DEPRECATED051": DEPRECATED051, + "DEPRECATED052": DEPRECATED052, + "DEPRECATED053": DEPRECATED053, + "DEPRECATED054": DEPRECATED054, + "DEPRECATED055": DEPRECATED055, + "DEPRECATED056": DEPRECATED056, + "DEPRECATED057": DEPRECATED057, + "DEPRECATED058": DEPRECATED058, + "DEPRECATED059": DEPRECATED059, + "DEPRECATED060": DEPRECATED060, + "DEPRECATED061": DEPRECATED061, + "DEPRECATED062": DEPRECATED062, + "DEPRECATED063": DEPRECATED063, + "DEPRECATED064": DEPRECATED064, + "DEPRECATED065": DEPRECATED065, + "DEPRECATED066": DEPRECATED066, + "DEPRECATED067": DEPRECATED067, + "DEPRECATED068": DEPRECATED068, + "DEPRECATED069": DEPRECATED069, + "DEPRECATED070": DEPRECATED070, + "DEPRECATED071": DEPRECATED071, + "DEPRECATED072": DEPRECATED072, + "DEPRECATED073": DEPRECATED073, + "DEPRECATED074": DEPRECATED074, + "DEPRECATED075": DEPRECATED075, + "DEPRECATED076": DEPRECATED076, + "DEPRECATED077": DEPRECATED077, + "DEPRECATED078": DEPRECATED078, + "DEPRECATED079": DEPRECATED079, + "DEPRECATED080": DEPRECATED080, + "DEPRECATED081": DEPRECATED081, + "DEPRECATED082": DEPRECATED082, + "DEPRECATED083": DEPRECATED083, + "DEPRECATED084": DEPRECATED084, + "DEPRECATED085": DEPRECATED085, + "DEPRECATED086": DEPRECATED086, + "DEPRECATED087": DEPRECATED087, + "DEPRECATED088": DEPRECATED088, + "DEPRECATED089": DEPRECATED089, + "DEPRECATED090": DEPRECATED090, + "DEPRECATED091": DEPRECATED091, + "DEPRECATED092": DEPRECATED092, + "DEPRECATED093": DEPRECATED093, + "DEPRECATED094": DEPRECATED094, + "DEPRECATED095": DEPRECATED095, + "DEPRECATED096": DEPRECATED096, + "DEPRECATED097": DEPRECATED097, + "DEPRECATED098": DEPRECATED098, + "DEPRECATED099": DEPRECATED099, + "DEPRECATED100": DEPRECATED100, + "DEPRECATED101": DEPRECATED101, + "DEPRECATED102": DEPRECATED102, + "DEPRECATED103": DEPRECATED103, + "DEPRECATED104": DEPRECATED104, + "DEPRECATED105": DEPRECATED105, + "DEPRECATED106": DEPRECATED106, + "DEPRECATED107": DEPRECATED107, + "DEPRECATED108": DEPRECATED108, + "DEPRECATED109": DEPRECATED109, + "DEPRECATED110": DEPRECATED110, + "DEPRECATED111": DEPRECATED111, + "DEPRECATED112": DEPRECATED112, + "DEPRECATED113": DEPRECATED113, + "DEPRECATED114": DEPRECATED114, + "DEPRECATED115": DEPRECATED115, + "DEPRECATED116": DEPRECATED116, + "DEPRECATED117": DEPRECATED117, + "DEPRECATED118": DEPRECATED118, + "DEPRECATED119": DEPRECATED119, + "DEPRECATED120": DEPRECATED120, + "DEPRECATED121": DEPRECATED121, + "DEPRECATED122": DEPRECATED122, + "DEPRECATED123": DEPRECATED123, + "DEPRECATED124": DEPRECATED124, + "DEPRECATED125": DEPRECATED125, + "DEPRECATED126": DEPRECATED126, + "DEPRECATED127": DEPRECATED127, + "DEPRECATED128": DEPRECATED128, + "DEPRECATED129": DEPRECATED129, + "DEPRECATED130": DEPRECATED130, + "DEPRECATED131": DEPRECATED131, + "DEPRECATED132": DEPRECATED132, + "DEPRECATED133": DEPRECATED133, + "DEPRECATED134": DEPRECATED134, + "DEPRECATED135": DEPRECATED135, + "DEPRECATED136": DEPRECATED136, + "DEPRECATED137": DEPRECATED137, + "DEPRECATED138": DEPRECATED138, + "DEPRECATED139": DEPRECATED139, + "DEPRECATED140": DEPRECATED140, + "DEPRECATED141": DEPRECATED141, + "DEPRECATED142": DEPRECATED142, + "DEPRECATED143": DEPRECATED143, + "DEPRECATED144": DEPRECATED144, + "DEPRECATED145": DEPRECATED145, + "DEPRECATED146": DEPRECATED146, + "DEPRECATED147": DEPRECATED147, + "DEPRECATED148": DEPRECATED148, + "DEPRECATED149": DEPRECATED149, + "DEPRECATED150": DEPRECATED150, + "DEPRECATED151": DEPRECATED151, + "DEPRECATED152": DEPRECATED152, + "DEPRECATED153": DEPRECATED153, + "DEPRECATED154": DEPRECATED154, + "DEPRECATED155": DEPRECATED155, + "DEPRECATED156": DEPRECATED156, + "DEPRECATED157": DEPRECATED157, + "DEPRECATED158": DEPRECATED158, + "DEPRECATED159": DEPRECATED159, + "DEPRECATED160": DEPRECATED160, + "DEPRECATED161": DEPRECATED161, + "DEPRECATED162": DEPRECATED162, + "DEPRECATED163": DEPRECATED163, + "DEPRECATED164": DEPRECATED164, + "DEPRECATED165": DEPRECATED165, + "DEPRECATED166": DEPRECATED166, + "DEPRECATED167": DEPRECATED167, + "DEPRECATED168": DEPRECATED168, + "DEPRECATED169": DEPRECATED169, + "DEPRECATED170": DEPRECATED170, + "DEPRECATED171": DEPRECATED171, + "DEPRECATED172": DEPRECATED172, + "DEPRECATED173": DEPRECATED173, + "DEPRECATED174": DEPRECATED174, + "DEPRECATED175": DEPRECATED175, + "DEPRECATED176": DEPRECATED176, + "DEPRECATED177": DEPRECATED177, + "DEPRECATED178": DEPRECATED178, + "DEPRECATED179": DEPRECATED179, + "DEPRECATED180": DEPRECATED180, + "DEPRECATED181": DEPRECATED181, + "DEPRECATED182": DEPRECATED182, + "DEPRECATED183": DEPRECATED183, + "DEPRECATED184": DEPRECATED184, + "DEPRECATED185": DEPRECATED185, + "DEPRECATED186": DEPRECATED186, + "DEPRECATED187": DEPRECATED187, + "DEPRECATED188": DEPRECATED188, + "DEPRECATED189": DEPRECATED189, + "DEPRECATED190": DEPRECATED190, + "DEPRECATED191": DEPRECATED191, + "DEPRECATED192": DEPRECATED192, + "DEPRECATED193": DEPRECATED193, + "DEPRECATED194": DEPRECATED194, + "DEPRECATED195": DEPRECATED195, + "DEPRECATED196": DEPRECATED196, + "DEPRECATED197": DEPRECATED197, + "DEPRECATED198": DEPRECATED198, + "DEPRECATED199": DEPRECATED199, + "DEPRECATED200": DEPRECATED200, + "DEPRECATED201": DEPRECATED201, + "DEPRECATED202": DEPRECATED202, + "DEPRECATED203": DEPRECATED203, + "DEPRECATED204": DEPRECATED204, + "DEPRECATED205": DEPRECATED205, + "DEPRECATED206": DEPRECATED206, + "DEPRECATED207": DEPRECATED207, + "DEPRECATED208": DEPRECATED208, + "DEPRECATED209": DEPRECATED209, + "DEPRECATED210": DEPRECATED210, + "DEPRECATED211": DEPRECATED211, + "DEPRECATED212": DEPRECATED212, + "DEPRECATED213": DEPRECATED213, + "DEPRECATED214": DEPRECATED214, + "DEPRECATED215": DEPRECATED215, + "DEPRECATED216": DEPRECATED216, + "DEPRECATED217": DEPRECATED217, + "DEPRECATED218": DEPRECATED218, + "DEPRECATED219": DEPRECATED219, + "DEPRECATED220": DEPRECATED220, + "DEPRECATED221": DEPRECATED221, + "DEPRECATED222": DEPRECATED222, + "DEPRECATED223": DEPRECATED223, + "DEPRECATED224": DEPRECATED224, + "DEPRECATED225": DEPRECATED225, + "DEPRECATED226": DEPRECATED226, + "DEPRECATED227": DEPRECATED227, + "DEPRECATED228": DEPRECATED228, + "DEPRECATED229": DEPRECATED229, + "DEPRECATED230": DEPRECATED230, + "DEPRECATED231": DEPRECATED231, + "DEPRECATED232": DEPRECATED232, + "DEPRECATED233": DEPRECATED233, + "DEPRECATED234": DEPRECATED234, + "DEPRECATED235": DEPRECATED235, + "DEPRECATED236": DEPRECATED236, + "DEPRECATED237": DEPRECATED237, + "DEPRECATED238": DEPRECATED238, + "DEPRECATED239": DEPRECATED239, + "DEPRECATED240": DEPRECATED240, + "DEPRECATED241": DEPRECATED241, + "DEPRECATED242": DEPRECATED242, + "DEPRECATED243": DEPRECATED243, + "DEPRECATED244": DEPRECATED244, + "DEPRECATED245": DEPRECATED245, + "DEPRECATED246": DEPRECATED246, + "DEPRECATED247": DEPRECATED247, + "DEPRECATED248": DEPRECATED248, + "DEPRECATED249": DEPRECATED249, + "DEPRECATED250": DEPRECATED250, + "DEPRECATED251": DEPRECATED251, + "DEPRECATED252": DEPRECATED252, + "DEPRECATED253": DEPRECATED253, + "DEPRECATED254": DEPRECATED254, + "DEPRECATED255": DEPRECATED255, + "DEPRECATED256": DEPRECATED256, + "DEPRECATED257": DEPRECATED257, + "DEPRECATED258": DEPRECATED258, + "DEPRECATED259": DEPRECATED259, + "DEPRECATED260": DEPRECATED260, + "DEPRECATED261": DEPRECATED261, + "DEPRECATED262": DEPRECATED262, + "DEPRECATED263": DEPRECATED263, + "DEPRECATED264": DEPRECATED264, + "DEPRECATED265": DEPRECATED265, + "DEPRECATED266": DEPRECATED266, + "DEPRECATED267": DEPRECATED267, + "DEPRECATED268": DEPRECATED268, + "DEPRECATED269": DEPRECATED269, + "DEPRECATED270": DEPRECATED270, + "DEPRECATED271": DEPRECATED271, + "DEPRECATED272": DEPRECATED272, + "DEPRECATED273": DEPRECATED273, + "DEPRECATED274": DEPRECATED274, + "DEPRECATED275": DEPRECATED275, + "DEPRECATED276": DEPRECATED276, "PERSON": PERSON, "NORP": NORP, @@ -468,6 +464,8 @@ IDS = { "acl": acl, "LAW": LAW, + "MORPH": MORPH, + "_": _, } @@ -475,7 +473,6 @@ def sort_nums(x): return x[1] -PRON_LEMMA = "-PRON-" NAMES = [it[0] for it in sorted(IDS.items(), key=sort_nums)] # Unfortunate hack here, to work around problem with long cpdef enum # (which is generating an enormous amount of C++ in Cython 0.24+) diff --git a/spacy/syntax/_beam_utils.pxd b/spacy/syntax/_beam_utils.pxd deleted file mode 100644 index 36b0c05da..000000000 --- a/spacy/syntax/_beam_utils.pxd +++ /dev/null @@ -1,9 +0,0 @@ -from thinc.typedefs cimport class_t, hash_t - -# These are passed as callbacks to thinc.search.Beam -cdef int transition_state(void* _dest, void* _src, class_t clas, void* _moves) except -1 - -cdef int check_final_state(void* _state, void* extra_args) except -1 - - -cdef hash_t hash_state(void* _state, void* _) except 0 diff --git a/spacy/syntax/_beam_utils.pyx b/spacy/syntax/_beam_utils.pyx deleted file mode 100644 index b1085c762..000000000 --- a/spacy/syntax/_beam_utils.pyx +++ /dev/null @@ -1,330 +0,0 @@ -# cython: infer_types=True -# cython: profile=True -cimport numpy as np -import numpy -from cpython.ref cimport PyObject, Py_XDECREF -from thinc.extra.search cimport Beam -from thinc.extra.search import MaxViolation -from thinc.typedefs cimport hash_t, class_t -from thinc.extra.search cimport MaxViolation - -from .transition_system cimport TransitionSystem, Transition -from ..gold cimport GoldParse -from ..errors import Errors -from .stateclass cimport StateC, StateClass - - -# These are passed as callbacks to thinc.search.Beam -cdef int transition_state(void* _dest, void* _src, class_t clas, void* _moves) except -1: - dest = _dest - src = _src - moves = _moves - dest.clone(src) - moves[clas].do(dest, moves[clas].label) - dest.push_hist(clas) - - -cdef int check_final_state(void* _state, void* extra_args) except -1: - state = _state - return state.is_final() - - -cdef hash_t hash_state(void* _state, void* _) except 0: - state = _state - if state.is_final(): - return 1 - else: - return state.hash() - - -def collect_states(beams): - cdef StateClass state - cdef Beam beam - states = [] - for state_or_beam in beams: - if isinstance(state_or_beam, StateClass): - states.append(state_or_beam) - else: - beam = state_or_beam - state = StateClass.borrow(beam.at(0)) - states.append(state) - return states - - -cdef class ParserBeam(object): - cdef public TransitionSystem moves - cdef public object states - cdef public object golds - cdef public object beams - cdef public object dones - - def __init__(self, TransitionSystem moves, states, golds, - int width, float density=0.): - self.moves = moves - self.states = states - self.golds = golds - self.beams = [] - cdef Beam beam - cdef StateClass state - cdef StateC* st - for state in states: - beam = Beam(self.moves.n_moves, width, min_density=density) - beam.initialize(self.moves.init_beam_state, - self.moves.del_beam_state, state.c.length, - state.c._sent) - for i in range(beam.width): - st = beam.at(i) - st.offset = state.c.offset - self.beams.append(beam) - self.dones = [False] * len(self.beams) - - @property - def is_done(self): - return all(b.is_done or self.dones[i] - for i, b in enumerate(self.beams)) - - def __getitem__(self, i): - return self.beams[i] - - def __len__(self): - return len(self.beams) - - def advance(self, scores, follow_gold=False): - cdef Beam beam - for i, beam in enumerate(self.beams): - if beam.is_done or not scores[i].size or self.dones[i]: - continue - self._set_scores(beam, scores[i]) - if self.golds is not None: - self._set_costs(beam, self.golds[i], follow_gold=follow_gold) - beam.advance(transition_state, hash_state, self.moves.c) - beam.check_done(check_final_state, NULL) - # This handles the non-monotonic stuff for the parser. - if beam.is_done and self.golds is not None: - for j in range(beam.size): - state = StateClass.borrow(beam.at(j)) - if state.is_final(): - try: - if self.moves.is_gold_parse(state, self.golds[i]): - beam._states[j].loss = 0.0 - except NotImplementedError: - break - - def _set_scores(self, Beam beam, float[:, ::1] scores): - cdef float* c_scores = &scores[0, 0] - cdef int nr_state = min(scores.shape[0], beam.size) - cdef int nr_class = scores.shape[1] - for i in range(nr_state): - state = beam.at(i) - if not state.is_final(): - for j in range(nr_class): - beam.scores[i][j] = c_scores[i * nr_class + j] - self.moves.set_valid(beam.is_valid[i], state) - else: - for j in range(beam.nr_class): - beam.scores[i][j] = 0 - beam.costs[i][j] = 0 - - def _set_costs(self, Beam beam, GoldParse gold, int follow_gold=False): - for i in range(beam.size): - state = StateClass.borrow(beam.at(i)) - if not state.is_final(): - self.moves.set_costs(beam.is_valid[i], beam.costs[i], - state, gold) - if follow_gold: - min_cost = 0 - for j in range(beam.nr_class): - if beam.is_valid[i][j] and beam.costs[i][j] < min_cost: - min_cost = beam.costs[i][j] - for j in range(beam.nr_class): - if beam.costs[i][j] > min_cost: - beam.is_valid[i][j] = 0 - - -def get_token_ids(states, int n_tokens): - cdef StateClass state - cdef np.ndarray ids = numpy.zeros((len(states), n_tokens), - dtype='int32', order='C') - c_ids = ids.data - for i, state in enumerate(states): - if not state.is_final(): - state.c.set_context_tokens(c_ids, n_tokens) - else: - ids[i] = -1 - c_ids += ids.shape[1] - return ids - - -nr_update = 0 - - -def update_beam(TransitionSystem moves, int nr_feature, int max_steps, - states, golds, - state2vec, vec2scores, - int width, losses=None, drop=0., - early_update=True, beam_density=0.0): - global nr_update - cdef MaxViolation violn - nr_update += 1 - pbeam = ParserBeam(moves, states, golds, width=width, density=beam_density) - gbeam = ParserBeam(moves, states, golds, width=width, density=beam_density) - cdef StateClass state - beam_maps = [] - backprops = [] - violns = [MaxViolation() for _ in range(len(states))] - for t in range(max_steps): - if pbeam.is_done and gbeam.is_done: - break - # The beam maps let us find the right row in the flattened scores - # arrays for each state. States are identified by (example id, - # history). We keep a different beam map for each step (since we'll - # have a flat scores array for each step). The beam map will let us - # take the per-state losses, and compute the gradient for each (step, - # state, class). - beam_maps.append({}) - # Gather all states from the two beams in a list. Some stats may occur - # in both beams. To figure out which beam each state belonged to, - # we keep two lists of indices, p_indices and g_indices - states, p_indices, g_indices = get_states(pbeam, gbeam, beam_maps[-1], - nr_update) - if not states: - break - # Now that we have our flat list of states, feed them through the model - token_ids = get_token_ids(states, nr_feature) - vectors, bp_vectors = state2vec.begin_update(token_ids, drop=drop) - scores, bp_scores = vec2scores.begin_update(vectors, drop=drop) - - # Store the callbacks for the backward pass - backprops.append((token_ids, bp_vectors, bp_scores)) - - # Unpack the flat scores into lists for the two beams. The indices arrays - # tell us which example and state the scores-row refers to. - p_scores = [numpy.ascontiguousarray(scores[indices], dtype='f') - for indices in p_indices] - g_scores = [numpy.ascontiguousarray(scores[indices], dtype='f') - for indices in g_indices] - # Now advance the states in the beams. The gold beam is constrained to - # to follow only gold analyses. - pbeam.advance(p_scores) - gbeam.advance(g_scores, follow_gold=True) - # Track the "maximum violation", to use in the update. - for i, violn in enumerate(violns): - violn.check_crf(pbeam[i], gbeam[i]) - histories = [] - losses = [] - for violn in violns: - if violn.p_hist: - histories.append(violn.p_hist + violn.g_hist) - losses.append(violn.p_probs + violn.g_probs) - else: - histories.append([]) - losses.append([]) - states_d_scores = get_gradient(moves.n_moves, beam_maps, histories, losses) - beams = list(pbeam.beams) + list(gbeam.beams) - return states_d_scores, backprops[:len(states_d_scores)], beams - - -def get_states(pbeams, gbeams, beam_map, nr_update): - seen = {} - states = [] - p_indices = [] - g_indices = [] - cdef Beam pbeam, gbeam - if len(pbeams) != len(gbeams): - raise ValueError(Errors.E079.format(pbeams=len(pbeams), gbeams=len(gbeams))) - for eg_id, (pbeam, gbeam) in enumerate(zip(pbeams, gbeams)): - p_indices.append([]) - g_indices.append([]) - for i in range(pbeam.size): - state = StateClass.borrow(pbeam.at(i)) - if not state.is_final(): - key = tuple([eg_id] + pbeam.histories[i]) - if key in seen: - raise ValueError(Errors.E080.format(key=key)) - seen[key] = len(states) - p_indices[-1].append(len(states)) - states.append(state) - beam_map.update(seen) - for i in range(gbeam.size): - state = StateClass.borrow(gbeam.at(i)) - if not state.is_final(): - key = tuple([eg_id] + gbeam.histories[i]) - if key in seen: - g_indices[-1].append(seen[key]) - else: - g_indices[-1].append(len(states)) - beam_map[key] = len(states) - states.append(state) - p_idx = [numpy.asarray(idx, dtype='i') for idx in p_indices] - g_idx = [numpy.asarray(idx, dtype='i') for idx in g_indices] - return states, p_idx, g_idx - - -def get_gradient(nr_class, beam_maps, histories, losses): - """The global model assigns a loss to each parse. The beam scores - are additive, so the same gradient is applied to each action - in the history. This gives the gradient of a single *action* - for a beam state -- so we have "the gradient of loss for taking - action i given history H." - - Histories: Each hitory is a list of actions - Each candidate has a history - Each beam has multiple candidates - Each batch has multiple beams - So history is list of lists of lists of ints - """ - grads = [] - nr_steps = [] - for eg_id, hists in enumerate(histories): - nr_step = 0 - for loss, hist in zip(losses[eg_id], hists): - if loss != 0.0 and not numpy.isnan(loss): - nr_step = max(nr_step, len(hist)) - nr_steps.append(nr_step) - for i in range(max(nr_steps)): - grads.append(numpy.zeros((max(beam_maps[i].values())+1, nr_class), - dtype='f')) - if len(histories) != len(losses): - raise ValueError(Errors.E081.format(n_hist=len(histories), losses=len(losses))) - for eg_id, hists in enumerate(histories): - for loss, hist in zip(losses[eg_id], hists): - if loss == 0.0 or numpy.isnan(loss): - continue - key = tuple([eg_id]) - # Adjust loss for length - # We need to do this because each state in a short path is scored - # multiple times, as we add in the average cost when we run out - # of actions. - avg_loss = loss / len(hist) - loss += avg_loss * (nr_steps[eg_id] - len(hist)) - for j, clas in enumerate(hist): - i = beam_maps[j][key] - # In step j, at state i action clas - # resulted in loss - grads[j][i, clas] += loss - key = key + tuple([clas]) - return grads - - -def cleanup_beam(Beam beam): - cdef StateC* state - # Once parsing has finished, states in beam may not be unique. Is this - # correct? - seen = set() - for i in range(beam.width): - addr = beam._parents[i].content - if addr not in seen: - state = addr - del state - seen.add(addr) - else: - raise ValueError(Errors.E023.format(addr=addr, i=i)) - addr = beam._states[i].content - if addr not in seen: - state = addr - del state - seen.add(addr) - else: - raise ValueError(Errors.E023.format(addr=addr, i=i)) - - diff --git a/spacy/syntax/arc_eager.pxd b/spacy/syntax/arc_eager.pxd deleted file mode 100644 index 972ad682a..000000000 --- a/spacy/syntax/arc_eager.pxd +++ /dev/null @@ -1,18 +0,0 @@ -from cymem.cymem cimport Pool - -from thinc.typedefs cimport weight_t - -from .stateclass cimport StateClass -from ..typedefs cimport attr_t - -from .transition_system cimport TransitionSystem, Transition -from ..gold cimport GoldParseC - - -cdef class ArcEager(TransitionSystem): - pass - - -cdef weight_t push_cost(StateClass stcls, const GoldParseC* gold, int target) nogil -cdef weight_t arc_cost(StateClass stcls, const GoldParseC* gold, int head, int child) nogil - diff --git a/spacy/syntax/arc_eager.pyx b/spacy/syntax/arc_eager.pyx deleted file mode 100644 index efe8573c1..000000000 --- a/spacy/syntax/arc_eager.pyx +++ /dev/null @@ -1,634 +0,0 @@ -# cython: profile=True -# cython: cdivision=True -# cython: infer_types=True -# coding: utf-8 -from __future__ import unicode_literals - -from cpython.ref cimport Py_INCREF -from cymem.cymem cimport Pool -from collections import OrderedDict, defaultdict, Counter -from thinc.extra.search cimport Beam -import json - -from .nonproj import is_nonproj_tree -from ..typedefs cimport hash_t, attr_t -from ..strings cimport hash_string -from .stateclass cimport StateClass -from ._state cimport StateC -from . import nonproj -from .transition_system cimport move_cost_func_t, label_cost_func_t -from ..gold cimport GoldParse, GoldParseC -from ..structs cimport TokenC -from ..errors import Errors -from ..tokens.doc cimport Doc, set_children_from_heads - -# Calculate cost as gold/not gold. We don't use scalar value anyway. -cdef int BINARY_COSTS = 1 -cdef weight_t MIN_SCORE = -90000 -cdef attr_t SUBTOK_LABEL = hash_string('subtok') - -DEF NON_MONOTONIC = True -DEF USE_BREAK = True - -# Break transition from here -# http://www.aclweb.org/anthology/P13-1074 -cdef enum: - SHIFT - REDUCE - LEFT - RIGHT - - BREAK - - N_MOVES - - -MOVE_NAMES = [None] * N_MOVES -MOVE_NAMES[SHIFT] = 'S' -MOVE_NAMES[REDUCE] = 'D' -MOVE_NAMES[LEFT] = 'L' -MOVE_NAMES[RIGHT] = 'R' -MOVE_NAMES[BREAK] = 'B' - - -# Helper functions for the arc-eager oracle - -cdef weight_t push_cost(StateClass stcls, const GoldParseC* gold, int target) nogil: - cdef weight_t cost = 0 - cdef int i, S_i - for i in range(stcls.stack_depth()): - S_i = stcls.S(i) - if gold.heads[target] == S_i: - cost += 1 - if gold.heads[S_i] == target and (NON_MONOTONIC or not stcls.has_head(S_i)): - cost += 1 - if BINARY_COSTS and cost >= 1: - return cost - cost += Break.is_valid(stcls.c, 0) and Break.move_cost(stcls, gold) == 0 - return cost - - -cdef weight_t pop_cost(StateClass stcls, const GoldParseC* gold, int target) nogil: - cdef weight_t cost = 0 - cdef int i, B_i - for i in range(stcls.buffer_length()): - B_i = stcls.B(i) - cost += gold.heads[B_i] == target - cost += gold.heads[target] == B_i - if gold.heads[B_i] == B_i or gold.heads[B_i] < target: - break - if BINARY_COSTS and cost >= 1: - return cost - if Break.is_valid(stcls.c, 0) and Break.move_cost(stcls, gold) == 0: - cost += 1 - return cost - - -cdef weight_t arc_cost(StateClass stcls, const GoldParseC* gold, int head, int child) nogil: - if arc_is_gold(gold, head, child): - return 0 - elif stcls.H(child) == gold.heads[child]: - return 1 - # Head in buffer - elif gold.heads[child] >= stcls.B(0) and stcls.B(1) != 0: - return 1 - else: - return 0 - - -cdef bint arc_is_gold(const GoldParseC* gold, int head, int child) nogil: - if not gold.has_dep[child]: - return True - elif gold.heads[child] == head: - return True - else: - return False - - -cdef bint label_is_gold(const GoldParseC* gold, int head, int child, attr_t label) nogil: - if not gold.has_dep[child]: - return True - elif label == 0: - return True - elif gold.labels[child] == label: - return True - else: - return False - - -cdef bint _is_gold_root(const GoldParseC* gold, int word) nogil: - return gold.heads[word] == word or not gold.has_dep[word] - -cdef class Shift: - @staticmethod - cdef bint is_valid(const StateC* st, attr_t label) nogil: - sent_start = st._sent[st.B_(0).l_edge].sent_start - return st.buffer_length() >= 2 and not st.shifted[st.B(0)] and sent_start != 1 - - @staticmethod - cdef int transition(StateC* st, attr_t label) nogil: - st.push() - st.fast_forward() - - @staticmethod - cdef weight_t cost(StateClass st, const GoldParseC* gold, attr_t label) nogil: - return Shift.move_cost(st, gold) + Shift.label_cost(st, gold, label) - - @staticmethod - cdef inline weight_t move_cost(StateClass s, const GoldParseC* gold) nogil: - return push_cost(s, gold, s.B(0)) - - @staticmethod - cdef inline weight_t label_cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: - return 0 - - -cdef class Reduce: - @staticmethod - cdef bint is_valid(const StateC* st, attr_t label) nogil: - return st.stack_depth() >= 2 - - @staticmethod - cdef int transition(StateC* st, attr_t label) nogil: - if st.has_head(st.S(0)): - st.pop() - else: - st.unshift() - st.fast_forward() - - @staticmethod - cdef weight_t cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: - return Reduce.move_cost(s, gold) + Reduce.label_cost(s, gold, label) - - @staticmethod - cdef inline weight_t move_cost(StateClass st, const GoldParseC* gold) nogil: - cost = pop_cost(st, gold, st.S(0)) - if not st.has_head(st.S(0)): - # Decrement cost for the arcs e save - for i in range(1, st.stack_depth()): - S_i = st.S(i) - if gold.heads[st.S(0)] == S_i: - cost -= 1 - if gold.heads[S_i] == st.S(0): - cost -= 1 - if Break.is_valid(st.c, 0) and Break.move_cost(st, gold) == 0: - cost -= 1 - return cost - - @staticmethod - cdef inline weight_t label_cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: - return 0 - - -cdef class LeftArc: - @staticmethod - cdef bint is_valid(const StateC* st, attr_t label) nogil: - if label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1): - return 0 - sent_start = st._sent[st.B_(0).l_edge].sent_start - return sent_start != 1 - - @staticmethod - cdef int transition(StateC* st, attr_t label) nogil: - st.add_arc(st.B(0), st.S(0), label) - st.pop() - st.fast_forward() - - @staticmethod - cdef weight_t cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: - return LeftArc.move_cost(s, gold) + LeftArc.label_cost(s, gold, label) - - @staticmethod - cdef inline weight_t move_cost(StateClass s, const GoldParseC* gold) nogil: - cdef weight_t cost = 0 - if arc_is_gold(gold, s.B(0), s.S(0)): - # Have a negative cost if we 'recover' from the wrong dependency - return 0 if not s.has_head(s.S(0)) else -1 - else: - # Account for deps we might lose between S0 and stack - if not s.has_head(s.S(0)): - for i in range(1, s.stack_depth()): - cost += gold.heads[s.S(i)] == s.S(0) - cost += gold.heads[s.S(0)] == s.S(i) - return cost + pop_cost(s, gold, s.S(0)) + arc_cost(s, gold, s.B(0), s.S(0)) - - @staticmethod - cdef inline weight_t label_cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: - return arc_is_gold(gold, s.B(0), s.S(0)) and not label_is_gold(gold, s.B(0), s.S(0), label) - - -cdef class RightArc: - @staticmethod - cdef bint is_valid(const StateC* st, attr_t label) nogil: - # If there's (perhaps partial) parse pre-set, don't allow cycle. - if label == SUBTOK_LABEL and st.S(0) != (st.B(0)-1): - return 0 - sent_start = st._sent[st.B_(0).l_edge].sent_start - return sent_start != 1 and st.H(st.S(0)) != st.B(0) - - @staticmethod - cdef int transition(StateC* st, attr_t label) nogil: - st.add_arc(st.S(0), st.B(0), label) - st.push() - st.fast_forward() - - @staticmethod - cdef inline weight_t cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: - return RightArc.move_cost(s, gold) + RightArc.label_cost(s, gold, label) - - @staticmethod - cdef inline weight_t move_cost(StateClass s, const GoldParseC* gold) nogil: - if arc_is_gold(gold, s.S(0), s.B(0)): - return 0 - elif s.c.shifted[s.B(0)]: - return push_cost(s, gold, s.B(0)) - else: - return push_cost(s, gold, s.B(0)) + arc_cost(s, gold, s.S(0), s.B(0)) - - @staticmethod - cdef weight_t label_cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: - return arc_is_gold(gold, s.S(0), s.B(0)) and not label_is_gold(gold, s.S(0), s.B(0), label) - - -cdef class Break: - @staticmethod - cdef bint is_valid(const StateC* st, attr_t label) nogil: - cdef int i - if not USE_BREAK: - return False - elif st.at_break(): - return False - elif st.stack_depth() < 1: - return False - elif st.B_(0).l_edge < 0: - return False - elif st._sent[st.B_(0).l_edge].sent_start < 0: - return False - else: - return True - - @staticmethod - cdef int transition(StateC* st, attr_t label) nogil: - st.set_break(st.B_(0).l_edge) - st.fast_forward() - - @staticmethod - cdef weight_t cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: - return Break.move_cost(s, gold) + Break.label_cost(s, gold, label) - - @staticmethod - cdef inline weight_t move_cost(StateClass s, const GoldParseC* gold) nogil: - cdef weight_t cost = 0 - cdef int i, j, S_i, B_i - for i in range(s.stack_depth()): - S_i = s.S(i) - for j in range(s.buffer_length()): - B_i = s.B(j) - cost += gold.heads[S_i] == B_i - cost += gold.heads[B_i] == S_i - if cost != 0: - return cost - # Check for sentence boundary --- if it's here, we can't have any deps - # between stack and buffer, so rest of action is irrelevant. - s0_root = _get_root(s.S(0), gold) - b0_root = _get_root(s.B(0), gold) - if s0_root != b0_root or s0_root == -1 or b0_root == -1: - return cost - else: - return cost + 1 - - @staticmethod - cdef inline weight_t label_cost(StateClass s, const GoldParseC* gold, attr_t label) nogil: - return 0 - -cdef int _get_root(int word, const GoldParseC* gold) nogil: - while gold.heads[word] != word and gold.has_dep[word] and word >= 0: - word = gold.heads[word] - if not gold.has_dep[word]: - return -1 - else: - return word - - -cdef void* _init_state(Pool mem, int length, void* tokens) except NULL: - st = new StateC(tokens, length) - for i in range(st.length): - if st._sent[i].dep == 0: - st._sent[i].l_edge = i - st._sent[i].r_edge = i - st._sent[i].head = 0 - st._sent[i].dep = 0 - st._sent[i].l_kids = 0 - st._sent[i].r_kids = 0 - st.fast_forward() - return st - - -cdef int _del_state(Pool mem, void* state, void* x) except -1: - cdef StateC* st = state - del st - - -cdef class ArcEager(TransitionSystem): - def __init__(self, *args, **kwargs): - TransitionSystem.__init__(self, *args, **kwargs) - self.init_beam_state = _init_state - self.del_beam_state = _del_state - - @classmethod - def get_actions(cls, **kwargs): - min_freq = kwargs.get('min_freq', None) - actions = defaultdict(lambda: Counter()) - actions[SHIFT][''] = 1 - actions[REDUCE][''] = 1 - for label in kwargs.get('left_labels', []): - actions[LEFT][label] = 1 - actions[SHIFT][label] = 1 - for label in kwargs.get('right_labels', []): - actions[RIGHT][label] = 1 - actions[REDUCE][label] = 1 - for raw_text, sents in kwargs.get('gold_parses', []): - for (ids, words, tags, heads, labels, iob), ctnts in sents: - heads, labels = nonproj.projectivize(heads, labels) - for child, head, label in zip(ids, heads, labels): - if label.upper() == 'ROOT' : - label = 'ROOT' - if head == child: - actions[BREAK][label] += 1 - elif head < child: - actions[RIGHT][label] += 1 - actions[REDUCE][''] += 1 - elif head > child: - actions[LEFT][label] += 1 - actions[SHIFT][''] += 1 - if min_freq is not None: - for action, label_freqs in actions.items(): - for label, freq in list(label_freqs.items()): - if freq < min_freq: - label_freqs.pop(label) - # Ensure these actions are present - actions[BREAK].setdefault('ROOT', 0) - if kwargs.get("learn_tokens") is True: - actions[RIGHT].setdefault('subtok', 0) - actions[LEFT].setdefault('subtok', 0) - # Used for backoff - actions[RIGHT].setdefault('dep', 0) - actions[LEFT].setdefault('dep', 0) - return actions - - @property - def action_types(self): - return (SHIFT, REDUCE, LEFT, RIGHT, BREAK) - - def get_cost(self, StateClass state, GoldParse gold, action): - cdef Transition t = self.lookup_transition(action) - if not t.is_valid(state.c, t.label): - return 9000 - else: - return t.get_cost(state, &gold.c, t.label) - - def transition(self, StateClass state, action): - cdef Transition t = self.lookup_transition(action) - t.do(state.c, t.label) - return state - - def is_gold_parse(self, StateClass state, GoldParse gold): - predicted = set() - truth = set() - for i in range(gold.length): - if gold.cand_to_gold[i] is None: - continue - if state.safe_get(i).dep: - predicted.add((i, state.H(i), - self.strings[state.safe_get(i).dep])) - else: - predicted.add((i, state.H(i), 'ROOT')) - id_, word, tag, head, dep, ner = gold.orig_annot[gold.cand_to_gold[i]] - truth.add((id_, head, dep)) - return truth == predicted - - def has_gold(self, GoldParse gold, start=0, end=None): - end = end or len(gold.heads) - if all([tag is None for tag in gold.heads[start:end]]): - return False - else: - return True - - def preprocess_gold(self, GoldParse gold): - if not self.has_gold(gold): - return None - # Figure out whether we're using subtok - use_subtok = False - for action, labels in self.labels.items(): - if SUBTOK_LABEL in labels: - use_subtok = True - break - for i, (head, dep) in enumerate(zip(gold.heads, gold.labels)): - # Missing values - if head is None or dep is None: - gold.c.heads[i] = i - gold.c.has_dep[i] = False - elif dep == SUBTOK_LABEL and not use_subtok: - # If we're not doing the joint tokenization and parsing, - # regard these subtok labels as missing - gold.c.heads[i] = i - gold.c.labels[i] = 0 - gold.c.has_dep[i] = False - else: - if head > i: - action = LEFT - elif head < i: - action = RIGHT - else: - action = BREAK - if dep not in self.labels[action]: - if action == BREAK: - dep = 'ROOT' - elif nonproj.is_decorated(dep): - backoff = nonproj.decompose(dep)[0] - if backoff in self.labels[action]: - dep = backoff - else: - dep = 'dep' - else: - dep = 'dep' - gold.c.has_dep[i] = True - if dep.upper() == 'ROOT': - dep = 'ROOT' - gold.c.heads[i] = head - gold.c.labels[i] = self.strings.add(dep) - return gold - - def get_beam_parses(self, Beam beam): - parses = [] - probs = beam.probs - for i in range(beam.size): - state = beam.at(i) - if state.is_final(): - self.finalize_state(state) - prob = probs[i] - parse = [] - for j in range(state.length): - head = state.H(j) - label = self.strings[state._sent[j].dep] - parse.append((head, j, label)) - parses.append((prob, parse)) - return parses - - cdef Transition lookup_transition(self, object name_or_id) except *: - if isinstance(name_or_id, int): - return self.c[name_or_id] - name = name_or_id - if '-' in name: - move_str, label_str = name.split('-', 1) - label = self.strings[label_str] - else: - move_str = name - label = 0 - move = MOVE_NAMES.index(move_str) - for i in range(self.n_moves): - if self.c[i].move == move and self.c[i].label == label: - return self.c[i] - return Transition(clas=0, move=MISSING, label=0) - - def move_name(self, int move, attr_t label): - label_str = self.strings[label] - if label_str: - return MOVE_NAMES[move] + '-' + label_str - else: - return MOVE_NAMES[move] - - def class_name(self, int i): - return self.move_name(self.c[i].move, self.c[i].label) - - cdef Transition init_transition(self, int clas, int move, attr_t label) except *: - # TODO: Apparent Cython bug here when we try to use the Transition() - # constructor with the function pointers - cdef Transition t - t.score = 0 - t.clas = clas - t.move = move - t.label = label - if move == SHIFT: - t.is_valid = Shift.is_valid - t.do = Shift.transition - t.get_cost = Shift.cost - elif move == REDUCE: - t.is_valid = Reduce.is_valid - t.do = Reduce.transition - t.get_cost = Reduce.cost - elif move == LEFT: - t.is_valid = LeftArc.is_valid - t.do = LeftArc.transition - t.get_cost = LeftArc.cost - elif move == RIGHT: - t.is_valid = RightArc.is_valid - t.do = RightArc.transition - t.get_cost = RightArc.cost - elif move == BREAK: - t.is_valid = Break.is_valid - t.do = Break.transition - t.get_cost = Break.cost - else: - raise ValueError(Errors.E019.format(action=move, src='arc_eager')) - return t - - cdef int initialize_state(self, StateC* st) nogil: - for i in range(st.length): - if st._sent[i].dep == 0: - st._sent[i].l_edge = i - st._sent[i].r_edge = i - st._sent[i].head = 0 - st._sent[i].dep = 0 - st._sent[i].l_kids = 0 - st._sent[i].r_kids = 0 - st.fast_forward() - - cdef int finalize_state(self, StateC* st) nogil: - cdef int i - for i in range(st.length): - if st._sent[i].head == 0: - st._sent[i].dep = self.root_label - - def finalize_doc(self, Doc doc): - doc.is_parsed = True - set_children_from_heads(doc.c, doc.length) - - cdef int set_valid(self, int* output, const StateC* st) nogil: - cdef bint[N_MOVES] is_valid - is_valid[SHIFT] = Shift.is_valid(st, 0) - is_valid[REDUCE] = Reduce.is_valid(st, 0) - is_valid[LEFT] = LeftArc.is_valid(st, 0) - is_valid[RIGHT] = RightArc.is_valid(st, 0) - is_valid[BREAK] = Break.is_valid(st, 0) - cdef int i - for i in range(self.n_moves): - if self.c[i].label == SUBTOK_LABEL: - output[i] = self.c[i].is_valid(st, self.c[i].label) - else: - output[i] = is_valid[self.c[i].move] - - cdef int set_costs(self, int* is_valid, weight_t* costs, - StateClass stcls, GoldParse gold) except -1: - cdef int i, move - cdef attr_t label - cdef label_cost_func_t[N_MOVES] label_cost_funcs - cdef move_cost_func_t[N_MOVES] move_cost_funcs - cdef weight_t[N_MOVES] move_costs - for i in range(N_MOVES): - move_costs[i] = 9000 - move_cost_funcs[SHIFT] = Shift.move_cost - move_cost_funcs[REDUCE] = Reduce.move_cost - move_cost_funcs[LEFT] = LeftArc.move_cost - move_cost_funcs[RIGHT] = RightArc.move_cost - move_cost_funcs[BREAK] = Break.move_cost - - label_cost_funcs[SHIFT] = Shift.label_cost - label_cost_funcs[REDUCE] = Reduce.label_cost - label_cost_funcs[LEFT] = LeftArc.label_cost - label_cost_funcs[RIGHT] = RightArc.label_cost - label_cost_funcs[BREAK] = Break.label_cost - - cdef attr_t* labels = gold.c.labels - cdef int* heads = gold.c.heads - - n_gold = 0 - for i in range(self.n_moves): - if self.c[i].is_valid(stcls.c, self.c[i].label): - is_valid[i] = True - move = self.c[i].move - label = self.c[i].label - if move_costs[move] == 9000: - move_costs[move] = move_cost_funcs[move](stcls, &gold.c) - costs[i] = move_costs[move] + label_cost_funcs[move](stcls, &gold.c, label) - n_gold += costs[i] <= 0 - else: - is_valid[i] = False - costs[i] = 9000 - if n_gold < 1: - # Check projectivity --- leading cause - if is_nonproj_tree(gold.heads): - raise ValueError(Errors.E020) - else: - failure_state = stcls.print_state(gold.words) - raise ValueError(Errors.E021.format(n_actions=self.n_moves, - state=failure_state)) - - def get_beam_annot(self, Beam beam): - length = (beam.at(0)).length - heads = [{} for _ in range(length)] - deps = [{} for _ in range(length)] - probs = beam.probs - for i in range(beam.size): - state = beam.at(i) - self.finalize_state(state) - if state.is_final(): - prob = probs[i] - for j in range(state.length): - head = j + state._sent[j].head - dep = state._sent[j].dep - heads[j].setdefault(head, 0.0) - heads[j][head] += prob - deps[j].setdefault(dep, 0.0) - deps[j][dep] += prob - return heads, deps diff --git a/spacy/syntax/ner.pxd b/spacy/syntax/ner.pxd deleted file mode 100644 index 647f98fc0..000000000 --- a/spacy/syntax/ner.pxd +++ /dev/null @@ -1,8 +0,0 @@ -from .transition_system cimport TransitionSystem -from .transition_system cimport Transition -from ..gold cimport GoldParseC -from ..typedefs cimport attr_t - - -cdef class BiluoPushDown(TransitionSystem): - pass diff --git a/spacy/syntax/nn_parser.pxd b/spacy/syntax/nn_parser.pxd deleted file mode 100644 index 707c9654c..000000000 --- a/spacy/syntax/nn_parser.pxd +++ /dev/null @@ -1,24 +0,0 @@ -from thinc.typedefs cimport atom_t - -from .stateclass cimport StateClass -from .arc_eager cimport TransitionSystem -from ..vocab cimport Vocab -from ..tokens.doc cimport Doc -from ..structs cimport TokenC -from ._state cimport StateC -from ._parser_model cimport WeightsC, ActivationsC, SizesC - - -cdef class Parser: - cdef readonly Vocab vocab - cdef public object model - cdef public object _rehearsal_model - cdef readonly TransitionSystem moves - cdef readonly object cfg - cdef public object _multitasks - - cdef void _parseC(self, StateC** states, - WeightsC weights, SizesC sizes) nogil - - cdef void c_transition_batch(self, StateC** states, const float* scores, - int nr_class, int batch_size) nogil diff --git a/spacy/syntax/nn_parser.pyx b/spacy/syntax/nn_parser.pyx deleted file mode 100644 index 145c382a5..000000000 --- a/spacy/syntax/nn_parser.pyx +++ /dev/null @@ -1,719 +0,0 @@ -# cython: infer_types=True -# cython: cdivision=True -# cython: boundscheck=False -# coding: utf-8 -from __future__ import unicode_literals, print_function - -from collections import OrderedDict -import numpy -cimport cython.parallel -import numpy.random -cimport numpy as np -from itertools import islice -from cpython.ref cimport PyObject, Py_XDECREF -from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno -from libc.math cimport exp -from libcpp.vector cimport vector -from libc.string cimport memset, memcpy -from libc.stdlib cimport calloc, free -from cymem.cymem cimport Pool -from thinc.typedefs cimport weight_t, class_t, hash_t -from thinc.extra.search cimport Beam -from thinc.api import chain, clone -from thinc.v2v import Model, Maxout, Affine -from thinc.misc import LayerNorm -from thinc.neural.ops import NumpyOps, CupyOps -from thinc.neural.util import get_array_module -from thinc.linalg cimport Vec, VecVec -import srsly -import warnings - -from ._parser_model cimport alloc_activations, free_activations -from ._parser_model cimport predict_states, arg_max_if_valid -from ._parser_model cimport WeightsC, ActivationsC, SizesC, cpu_log_loss -from ._parser_model cimport get_c_weights, get_c_sizes -from ._parser_model import ParserModel -from .._ml import zero_init, PrecomputableAffine, Tok2Vec, flatten -from .._ml import link_vectors_to_models, create_default_optimizer -from ..compat import copy_array -from ..tokens.doc cimport Doc -from ..gold cimport GoldParse -from ..errors import Errors, TempErrors, Warnings -from .. import util -from .stateclass cimport StateClass -from ._state cimport StateC -from .transition_system cimport Transition -from . cimport _beam_utils -from . import _beam_utils -from . import nonproj - - -cdef class Parser: - """ - Base class of the DependencyParser and EntityRecognizer. - """ - @classmethod - def Model(cls, nr_class, **cfg): - depth = util.env_opt('parser_hidden_depth', cfg.get('hidden_depth', 1)) - subword_features = util.env_opt('subword_features', - cfg.get('subword_features', True)) - conv_depth = util.env_opt('conv_depth', cfg.get('conv_depth', 4)) - conv_window = util.env_opt('conv_window', cfg.get('conv_window', 1)) - t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3)) - bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0)) - self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0)) - nr_feature_tokens = cfg.get("nr_feature_tokens", cls.nr_feature) - if depth not in (0, 1): - raise ValueError(TempErrors.T004.format(value=depth)) - parser_maxout_pieces = util.env_opt('parser_maxout_pieces', - cfg.get('maxout_pieces', 2)) - token_vector_width = util.env_opt('token_vector_width', - cfg.get('token_vector_width', 96)) - hidden_width = util.env_opt('hidden_width', cfg.get('hidden_width', 64)) - if depth == 0: - hidden_width = nr_class - parser_maxout_pieces = 1 - embed_size = util.env_opt('embed_size', cfg.get('embed_size', 2000)) - pretrained_vectors = cfg.get('pretrained_vectors', None) - tok2vec = Tok2Vec(token_vector_width, embed_size, - conv_depth=conv_depth, - conv_window=conv_window, - cnn_maxout_pieces=t2v_pieces, - subword_features=subword_features, - pretrained_vectors=pretrained_vectors, - bilstm_depth=bilstm_depth) - tok2vec = chain(tok2vec, flatten) - tok2vec.nO = token_vector_width - lower = PrecomputableAffine(hidden_width, - nF=nr_feature_tokens, nI=token_vector_width, - nP=parser_maxout_pieces) - lower.nP = parser_maxout_pieces - if depth == 1: - with Model.use_device('cpu'): - upper = Affine(nr_class, hidden_width, drop_factor=0.0) - upper.W *= 0 - else: - upper = None - - cfg = { - 'nr_class': nr_class, - 'nr_feature_tokens': nr_feature_tokens, - 'hidden_depth': depth, - 'token_vector_width': token_vector_width, - 'hidden_width': hidden_width, - 'maxout_pieces': parser_maxout_pieces, - 'pretrained_vectors': pretrained_vectors, - 'bilstm_depth': bilstm_depth, - 'self_attn_depth': self_attn_depth, - 'conv_depth': conv_depth, - 'conv_window': conv_window, - 'embed_size': embed_size, - 'cnn_maxout_pieces': t2v_pieces - } - return ParserModel(tok2vec, lower, upper), cfg - - name = 'base_parser' - - def __init__(self, Vocab vocab, moves=True, model=True, **cfg): - """Create a Parser. - - vocab (Vocab): The vocabulary object. Must be shared with documents - to be processed. The value is set to the `.vocab` attribute. - moves (TransitionSystem): Defines how the parse-state is created, - updated and evaluated. The value is set to the .moves attribute - unless True (default), in which case a new instance is created with - `Parser.Moves()`. - model (object): Defines how the parse-state is created, updated and - evaluated. The value is set to the .model attribute. If set to True - (default), a new instance will be created with `Parser.Model()` - in parser.begin_training(), parser.from_disk() or parser.from_bytes(). - **cfg: Arbitrary configuration parameters. Set to the `.cfg` attribute - """ - self.vocab = vocab - if moves is True: - self.moves = self.TransitionSystem(self.vocab.strings) - else: - self.moves = moves - if 'beam_width' not in cfg: - cfg['beam_width'] = util.env_opt('beam_width', 1) - if 'beam_density' not in cfg: - cfg['beam_density'] = util.env_opt('beam_density', 0.0) - if 'beam_update_prob' not in cfg: - cfg['beam_update_prob'] = util.env_opt('beam_update_prob', 1.0) - cfg.setdefault('cnn_maxout_pieces', 3) - cfg.setdefault("nr_feature_tokens", self.nr_feature) - self.cfg = cfg - self.model = model - self._multitasks = [] - self._rehearsal_model = None - - @classmethod - def from_nlp(cls, nlp, **cfg): - return cls(nlp.vocab, **cfg) - - def __reduce__(self): - return (Parser, (self.vocab, self.moves, self.model), None, None) - - @property - def move_names(self): - names = [] - for i in range(self.moves.n_moves): - name = self.moves.move_name(self.moves.c[i].move, self.moves.c[i].label) - # Explicitly removing the internal "U-" token used for blocking entities - if name != "U-": - names.append(name) - return names - - nr_feature = 8 - - @property - def labels(self): - class_names = [self.moves.get_class_name(i) for i in range(self.moves.n_moves)] - return class_names - - @property - def tok2vec(self): - '''Return the embedding and convolutional layer of the model.''' - return None if self.model in (None, True, False) else self.model.tok2vec - - @property - def postprocesses(self): - # Available for subclasses, e.g. to deprojectivize - return [] - - def add_label(self, label): - resized = False - for action in self.moves.action_types: - added = self.moves.add_action(action, label) - if added: - resized = True - if resized: - self._resize() - - def _resize(self): - if "nr_class" in self.cfg: - self.cfg["nr_class"] = self.moves.n_moves - if self.model not in (True, False, None): - self.model.resize_output(self.moves.n_moves) - if self._rehearsal_model not in (True, False, None): - self._rehearsal_model.resize_output(self.moves.n_moves) - - def add_multitask_objective(self, target): - # Defined in subclasses, to avoid circular import - raise NotImplementedError - - def init_multitask_objectives(self, get_gold_tuples, pipeline, **cfg): - '''Setup models for secondary objectives, to benefit from multi-task - learning. This method is intended to be overridden by subclasses. - - For instance, the dependency parser can benefit from sharing - an input representation with a label prediction model. These auxiliary - models are discarded after training. - ''' - pass - - def preprocess_gold(self, docs_golds): - for doc, gold in docs_golds: - yield doc, gold - - def use_params(self, params): - # Can't decorate cdef class :(. Workaround. - with self.model.use_params(params): - yield - - def __call__(self, Doc doc, beam_width=None): - """Apply the parser or entity recognizer, setting the annotations onto - the `Doc` object. - - doc (Doc): The document to be processed. - """ - if beam_width is None: - beam_width = self.cfg.get('beam_width', 1) - beam_density = self.cfg.get('beam_density', 0.) - states = self.predict([doc], beam_width=beam_width, - beam_density=beam_density) - self.set_annotations([doc], states, tensors=None) - return doc - - def pipe(self, docs, int batch_size=256, int n_threads=-1, beam_width=None): - """Process a stream of documents. - - stream: The sequence of documents to process. - batch_size (int): Number of documents to accumulate into a working set. - YIELDS (Doc): Documents, in order. - """ - if beam_width is None: - beam_width = self.cfg.get('beam_width', 1) - beam_density = self.cfg.get('beam_density', 0.) - cdef Doc doc - for batch in util.minibatch(docs, size=batch_size): - batch_in_order = list(batch) - by_length = sorted(batch_in_order, key=lambda doc: len(doc)) - for subbatch in util.minibatch(by_length, size=max(batch_size//4, 2)): - subbatch = list(subbatch) - parse_states = self.predict(subbatch, beam_width=beam_width, - beam_density=beam_density) - self.set_annotations(subbatch, parse_states, tensors=None) - for doc in batch_in_order: - yield doc - - def require_model(self): - """Raise an error if the component's model is not initialized.""" - if getattr(self, 'model', None) in (None, True, False): - raise ValueError(Errors.E109.format(name=self.name)) - - def predict(self, docs, beam_width=1, beam_density=0.0, drop=0.): - self.require_model() - if isinstance(docs, Doc): - docs = [docs] - if not any(len(doc) for doc in docs): - result = self.moves.init_batch(docs) - self._resize() - return result - if beam_width < 2: - return self.greedy_parse(docs, drop=drop) - else: - return self.beam_parse(docs, beam_width=beam_width, - beam_density=beam_density, drop=drop) - - def greedy_parse(self, docs, drop=0.): - cdef vector[StateC*] states - cdef StateClass state - batch = self.moves.init_batch(docs) - # This is pretty dirty, but the NER can resize itself in init_batch, - # if labels are missing. We therefore have to check whether we need to - # expand our model output. - self._resize() - model = self.model(docs) - weights = get_c_weights(model) - for state in batch: - if not state.is_final(): - states.push_back(state.c) - sizes = get_c_sizes(model, states.size()) - with nogil: - self._parseC(&states[0], - weights, sizes) - return batch - - def beam_parse(self, docs, int beam_width, float drop=0., beam_density=0.): - cdef Beam beam - cdef Doc doc - cdef np.ndarray token_ids - beams = self.moves.init_beams(docs, beam_width, beam_density=beam_density) - # This is pretty dirty, but the NER can resize itself in init_batch, - # if labels are missing. We therefore have to check whether we need to - # expand our model output. - self._resize() - model = self.model(docs) - token_ids = numpy.zeros((len(docs) * beam_width, self.nr_feature), - dtype='i', order='C') - cdef int* c_ids - cdef int nr_feature = self.cfg["nr_feature_tokens"] - cdef int n_states - model = self.model(docs) - todo = [beam for beam in beams if not beam.is_done] - while todo: - token_ids.fill(-1) - c_ids = token_ids.data - n_states = 0 - for beam in todo: - for i in range(beam.size): - state = beam.at(i) - # This way we avoid having to score finalized states - # We do have to take care to keep indexes aligned, though - if not state.is_final(): - state.set_context_tokens(c_ids, nr_feature) - c_ids += nr_feature - n_states += 1 - if n_states == 0: - break - vectors = model.state2vec(token_ids[:n_states]) - scores = model.vec2scores(vectors) - todo = self.transition_beams(todo, scores) - return beams - - cdef void _parseC(self, StateC** states, - WeightsC weights, SizesC sizes) nogil: - cdef int i, j - cdef vector[StateC*] unfinished - cdef ActivationsC activations = alloc_activations(sizes) - while sizes.states >= 1: - predict_states(&activations, - states, &weights, sizes) - # Validate actions, argmax, take action. - self.c_transition_batch(states, - activations.scores, sizes.classes, sizes.states) - for i in range(sizes.states): - if not states[i].is_final(): - unfinished.push_back(states[i]) - for i in range(unfinished.size()): - states[i] = unfinished[i] - sizes.states = unfinished.size() - unfinished.clear() - free_activations(&activations) - - def set_annotations(self, docs, states_or_beams, tensors=None): - cdef StateClass state - cdef Beam beam - cdef Doc doc - states = [] - beams = [] - for state_or_beam in states_or_beams: - if isinstance(state_or_beam, StateClass): - states.append(state_or_beam) - else: - beam = state_or_beam - state = StateClass.borrow(beam.at(0)) - states.append(state) - beams.append(beam) - for i, (state, doc) in enumerate(zip(states, docs)): - self.moves.finalize_state(state.c) - for j in range(doc.length): - doc.c[j] = state.c._sent[j] - self.moves.finalize_doc(doc) - for hook in self.postprocesses: - hook(doc) - for beam in beams: - _beam_utils.cleanup_beam(beam) - - def transition_states(self, states, float[:, ::1] scores): - cdef StateClass state - cdef float* c_scores = &scores[0, 0] - cdef vector[StateC*] c_states - for state in states: - c_states.push_back(state.c) - self.c_transition_batch(&c_states[0], c_scores, scores.shape[1], scores.shape[0]) - return [state for state in states if not state.c.is_final()] - - cdef void c_transition_batch(self, StateC** states, const float* scores, - int nr_class, int batch_size) nogil: - # n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc - with gil: - assert self.moves.n_moves > 0 - is_valid = calloc(self.moves.n_moves, sizeof(int)) - cdef int i, guess - cdef Transition action - for i in range(batch_size): - self.moves.set_valid(is_valid, states[i]) - guess = arg_max_if_valid(&scores[i*nr_class], is_valid, nr_class) - if guess == -1: - # This shouldn't happen, but it's hard to raise an error here, - # and we don't want to infinite loop. So, force to end state. - states[i].force_final() - else: - action = self.moves.c[guess] - action.do(states[i], action.label) - states[i].push_hist(guess) - free(is_valid) - - def transition_beams(self, beams, float[:, ::1] scores): - cdef Beam beam - cdef float* c_scores = &scores[0, 0] - for beam in beams: - for i in range(beam.size): - state = beam.at(i) - if not state.is_final(): - self.moves.set_valid(beam.is_valid[i], state) - memcpy(beam.scores[i], c_scores, scores.shape[1] * sizeof(float)) - c_scores += scores.shape[1] - beam.advance(_beam_utils.transition_state, _beam_utils.hash_state, self.moves.c) - beam.check_done(_beam_utils.check_final_state, NULL) - return [b for b in beams if not b.is_done] - - def update(self, docs, golds, drop=0., sgd=None, losses=None): - self.require_model() - if isinstance(docs, Doc) and isinstance(golds, GoldParse): - docs = [docs] - golds = [golds] - if len(docs) != len(golds): - raise ValueError(Errors.E077.format(value='update', n_docs=len(docs), - n_golds=len(golds))) - if losses is None: - losses = {} - losses.setdefault(self.name, 0.) - for multitask in self._multitasks: - multitask.update(docs, golds, drop=drop, sgd=sgd) - # The probability we use beam update, instead of falling back to - # a greedy update - beam_update_prob = self.cfg.get('beam_update_prob', 0.5) - if self.cfg.get('beam_width', 1) >= 2 and numpy.random.random() < beam_update_prob: - return self.update_beam(docs, golds, self.cfg.get('beam_width', 1), - drop=drop, sgd=sgd, losses=losses, - beam_density=self.cfg.get('beam_density', 0.001)) - # Chop sequences into lengths of this many transitions, to make the - # batch uniform length. - cut_gold = numpy.random.choice(range(20, 100)) - states, golds, max_steps = self._init_gold_batch(docs, golds, max_length=cut_gold) - states_golds = [(s, g) for (s, g) in zip(states, golds) - if not s.is_final() and g is not None] - - # Prepare the stepwise model, and get the callback for finishing the batch - model, finish_update = self.model.begin_update(docs, drop=drop) - for _ in range(max_steps): - if not states_golds: - break - states, golds = zip(*states_golds) - scores, backprop = model.begin_update(states, drop=drop) - d_scores = self.get_batch_loss(states, golds, scores, losses) - backprop(d_scores, sgd=sgd) - # Follow the predicted action - self.transition_states(states, scores) - states_golds = [eg for eg in states_golds if not eg[0].is_final()] - # Do the backprop - finish_update(golds, sgd=sgd) - return losses - - def rehearse(self, docs, sgd=None, losses=None, **cfg): - """Perform a "rehearsal" update, to prevent catastrophic forgetting.""" - if isinstance(docs, Doc): - docs = [docs] - if losses is None: - losses = {} - for multitask in self._multitasks: - if hasattr(multitask, 'rehearse'): - multitask.rehearse(docs, losses=losses, sgd=sgd) - if self._rehearsal_model is None: - return None - losses.setdefault(self.name, 0.) - - states = self.moves.init_batch(docs) - # This is pretty dirty, but the NER can resize itself in init_batch, - # if labels are missing. We therefore have to check whether we need to - # expand our model output. - self._resize() - # Prepare the stepwise model, and get the callback for finishing the batch - tutor, _ = self._rehearsal_model.begin_update(docs, drop=0.0) - model, finish_update = self.model.begin_update(docs, drop=0.0) - n_scores = 0. - loss = 0. - while states: - targets, _ = tutor.begin_update(states, drop=0.) - guesses, backprop = model.begin_update(states, drop=0.) - d_scores = (guesses - targets) / targets.shape[0] - # If all weights for an output are 0 in the original model, don't - # supervise that output. This allows us to add classes. - loss += (d_scores**2).sum() - backprop(d_scores, sgd=sgd) - # Follow the predicted action - self.transition_states(states, guesses) - states = [state for state in states if not state.is_final()] - n_scores += d_scores.size - # Do the backprop - finish_update(docs, sgd=sgd) - losses[self.name] += loss / n_scores - return losses - - def update_beam(self, docs, golds, width, drop=0., sgd=None, losses=None, - beam_density=0.0): - lengths = [len(d) for d in docs] - states = self.moves.init_batch(docs) - for gold in golds: - self.moves.preprocess_gold(gold) - model, finish_update = self.model.begin_update(docs, drop=drop) - states_d_scores, backprops, beams = _beam_utils.update_beam( - self.moves, self.cfg["nr_feature_tokens"], 10000, states, golds, model.state2vec, - model.vec2scores, width, drop=drop, losses=losses, - beam_density=beam_density) - for i, d_scores in enumerate(states_d_scores): - losses[self.name] += (d_scores**2).mean() - ids, bp_vectors, bp_scores = backprops[i] - d_vector = bp_scores(d_scores, sgd=sgd) - if isinstance(model.ops, CupyOps) \ - and not isinstance(ids, model.state2vec.ops.xp.ndarray): - model.backprops.append(( - util.get_async(model.cuda_stream, ids), - util.get_async(model.cuda_stream, d_vector), - bp_vectors)) - else: - model.backprops.append((ids, d_vector, bp_vectors)) - model.make_updates(sgd) - cdef Beam beam - for beam in beams: - _beam_utils.cleanup_beam(beam) - - def _init_gold_batch(self, whole_docs, whole_golds, min_length=5, max_length=500): - """Make a square batch, of length equal to the shortest doc. A long - doc will get multiple states. Let's say we have a doc of length 2*N, - where N is the shortest doc. We'll make two states, one representing - long_doc[:N], and another representing long_doc[N:].""" - cdef: - StateClass state - Transition action - whole_states = self.moves.init_batch(whole_docs) - max_length = max(min_length, min(max_length, min([len(doc) for doc in whole_docs]))) - max_moves = 0 - states = [] - golds = [] - for doc, state, gold in zip(whole_docs, whole_states, whole_golds): - gold = self.moves.preprocess_gold(gold) - if gold is None: - continue - oracle_actions = self.moves.get_oracle_sequence(doc, gold) - start = 0 - while start < len(doc): - state = state.copy() - n_moves = 0 - while state.B(0) < start and not state.is_final(): - action = self.moves.c[oracle_actions.pop(0)] - action.do(state.c, action.label) - state.c.push_hist(action.clas) - n_moves += 1 - has_gold = self.moves.has_gold(gold, start=start, - end=start+max_length) - if not state.is_final() and has_gold: - states.append(state) - golds.append(gold) - max_moves = max(max_moves, n_moves) - start += min(max_length, len(doc)-start) - max_moves = max(max_moves, len(oracle_actions)) - return states, golds, max_moves - - def get_batch_loss(self, states, golds, float[:, ::1] scores, losses): - cdef StateClass state - cdef GoldParse gold - cdef Pool mem = Pool() - cdef int i - - # n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc - assert self.moves.n_moves > 0 - - is_valid = mem.alloc(self.moves.n_moves, sizeof(int)) - costs = mem.alloc(self.moves.n_moves, sizeof(float)) - cdef np.ndarray d_scores = numpy.zeros((len(states), self.moves.n_moves), - dtype='f', order='C') - c_d_scores = d_scores.data - for i, (state, gold) in enumerate(zip(states, golds)): - memset(is_valid, 0, self.moves.n_moves * sizeof(int)) - memset(costs, 0, self.moves.n_moves * sizeof(float)) - self.moves.set_costs(is_valid, costs, state, gold) - for j in range(self.moves.n_moves): - if costs[j] <= 0.0 and j in self.model.unseen_classes: - self.model.unseen_classes.remove(j) - cpu_log_loss(c_d_scores, - costs, is_valid, &scores[i, 0], d_scores.shape[1]) - c_d_scores += d_scores.shape[1] - if losses is not None: - losses.setdefault(self.name, 0.) - losses[self.name] += (d_scores**2).sum() - return d_scores - - def create_optimizer(self): - return create_default_optimizer(self.model.ops, - **self.cfg.get('optimizer', {})) - - def begin_training(self, get_gold_tuples, pipeline=None, sgd=None, **cfg): - if len(self.vocab.lookups.get_table("lexeme_norm", {})) == 0: - warnings.warn(Warnings.W033.format(model="parser or NER")) - try: - import spacy_lookups_data - except ImportError: - if self.vocab.lang in ("da", "de", "el", "en", "id", "lb", "pt", - "ru", "sr", "ta", "th"): - warnings.warn(Warnings.W034.format(lang=self.vocab.lang)) - if 'model' in cfg: - self.model = cfg['model'] - if not hasattr(get_gold_tuples, '__call__'): - gold_tuples = get_gold_tuples - get_gold_tuples = lambda: gold_tuples - actions = self.moves.get_actions(gold_parses=get_gold_tuples(), - min_freq=cfg.get('min_action_freq', 30), - learn_tokens=self.cfg.get("learn_tokens", False)) - for action, labels in self.moves.labels.items(): - actions.setdefault(action, {}) - for label, freq in labels.items(): - if label not in actions[action]: - actions[action][label] = freq - self.moves.initialize_actions(actions) - if self.model is True: - cfg.setdefault('min_action_freq', 30) - cfg.setdefault('token_vector_width', 96) - self.model, cfg = self.Model(self.moves.n_moves, **cfg) - if sgd is None: - sgd = self.create_optimizer() - doc_sample = [] - gold_sample = [] - for raw_text, annots_brackets in islice(get_gold_tuples(), 1000): - for annots, brackets in annots_brackets: - ids, words, tags, heads, deps, ents = annots - doc_sample.append(Doc(self.vocab, words=words)) - gold_sample.append(GoldParse(doc_sample[-1], words=words, tags=tags, - heads=heads, deps=deps, entities=ents)) - self.model.begin_training(doc_sample, gold_sample) - if pipeline is not None: - self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg) - link_vectors_to_models(self.vocab) - self.cfg.update(cfg) - else: - if sgd is None: - sgd = self.create_optimizer() - self.model.begin_training([]) - return sgd - - def to_disk(self, path, exclude=tuple(), **kwargs): - serializers = { - 'model': lambda p: (self.model.to_disk(p) if self.model is not True else True), - 'vocab': lambda p: self.vocab.to_disk(p), - 'moves': lambda p: self.moves.to_disk(p, exclude=["strings"]), - 'cfg': lambda p: srsly.write_json(p, self.cfg) - } - exclude = util.get_serialization_exclude(serializers, exclude, kwargs) - util.to_disk(path, serializers, exclude) - - def from_disk(self, path, exclude=tuple(), **kwargs): - deserializers = { - 'vocab': lambda p: self.vocab.from_disk(p), - 'moves': lambda p: self.moves.from_disk(p, exclude=["strings"]), - 'cfg': lambda p: self.cfg.update(srsly.read_json(p)), - 'model': lambda p: None - } - exclude = util.get_serialization_exclude(deserializers, exclude, kwargs) - util.from_disk(path, deserializers, exclude) - if 'model' not in exclude: - path = util.ensure_path(path) - if self.model is True: - self.model, cfg = self.Model(**self.cfg) - else: - cfg = {} - with (path / 'model').open('rb') as file_: - bytes_data = file_.read() - try: - self.model.from_bytes(bytes_data) - except AttributeError: - raise ValueError(Errors.E149) - self.cfg.update(cfg) - return self - - def to_bytes(self, exclude=tuple(), **kwargs): - serializers = OrderedDict(( - ('model', lambda: (self.model.to_bytes() if self.model is not True else True)), - ('vocab', lambda: self.vocab.to_bytes()), - ('moves', lambda: self.moves.to_bytes(exclude=["strings"])), - ('cfg', lambda: srsly.json_dumps(self.cfg, indent=2, sort_keys=True)) - )) - exclude = util.get_serialization_exclude(serializers, exclude, kwargs) - return util.to_bytes(serializers, exclude) - - def from_bytes(self, bytes_data, exclude=tuple(), **kwargs): - deserializers = OrderedDict(( - ('vocab', lambda b: self.vocab.from_bytes(b)), - ('moves', lambda b: self.moves.from_bytes(b, exclude=["strings"])), - ('cfg', lambda b: self.cfg.update(srsly.json_loads(b))), - ('model', lambda b: None) - )) - exclude = util.get_serialization_exclude(deserializers, exclude, kwargs) - msg = util.from_bytes(bytes_data, deserializers, exclude) - if 'model' not in exclude: - # TODO: Remove this once we don't have to handle previous models - if self.cfg.get('pretrained_dims') and 'pretrained_vectors' not in self.cfg: - self.cfg['pretrained_vectors'] = self.vocab.vectors.name - if self.model is True: - self.model, cfg = self.Model(**self.cfg) - else: - cfg = {} - if 'model' in msg: - try: - self.model.from_bytes(msg['model']) - except AttributeError: - raise ValueError(Errors.E149) - self.cfg.update(cfg) - return self diff --git a/spacy/tests/README.md b/spacy/tests/README.md index 7aa7f6166..833dc9266 100644 --- a/spacy/tests/README.md +++ b/spacy/tests/README.md @@ -17,7 +17,6 @@ Tests for spaCy modules and classes live in their own directories of the same na 5. [Helpers and utilities](#helpers-and-utilities) 6. [Contributing to the tests](#contributing-to-the-tests) - ## Running the tests To show print statements, run the tests with `py.test -s`. To abort after the @@ -39,19 +38,17 @@ py.test spacy/tests/tokenizer/test_exceptions.py::test_tokenizer_handles_emoji # ## Dos and don'ts -To keep the behaviour of the tests consistent and predictable, we try to follow a few basic conventions: - -* **Test names** should follow a pattern of `test_[module]_[tested behaviour]`. For example: `test_tokenizer_keeps_email` or `test_spans_override_sentiment`. -* If you're testing for a bug reported in a specific issue, always create a **regression test**. Regression tests should be named `test_issue[ISSUE NUMBER]` and live in the [`regression`](regression) directory. -* Only use `@pytest.mark.xfail` for tests that **should pass, but currently fail**. To test for desired negative behaviour, use `assert not` in your test. -* Very **extensive tests** that take a long time to run should be marked with `@pytest.mark.slow`. If your slow test is testing important behaviour, consider adding an additional simpler version. -* If tests require **loading the models**, they should be added to the [`spacy-models`](https://github.com/explosion/spacy-models) tests. -* Before requiring the models, always make sure there is no other way to test the particular behaviour. In a lot of cases, it's sufficient to simply create a `Doc` object manually. See the section on [helpers and utility functions](#helpers-and-utilities) for more info on this. -* **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and many components are available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`). -* If you're importing from spaCy, **always use absolute imports**. For example: `from spacy.language import Language`. -* Don't forget the **unicode declarations** at the top of each file. This way, unicode strings won't have to be prefixed with `u`. -* Try to keep the tests **readable and concise**. Use clear and descriptive variable names (`doc`, `tokens` and `text` are great), keep it short and only test for one behaviour at a time. +To keep the behavior of the tests consistent and predictable, we try to follow a few basic conventions: +- **Test names** should follow a pattern of `test_[module]_[tested behaviour]`. For example: `test_tokenizer_keeps_email` or `test_spans_override_sentiment`. +- If you're testing for a bug reported in a specific issue, always create a **regression test**. Regression tests should be named `test_issue[ISSUE NUMBER]` and live in the [`regression`](regression) directory. +- Only use `@pytest.mark.xfail` for tests that **should pass, but currently fail**. To test for desired negative behavior, use `assert not` in your test. +- Very **extensive tests** that take a long time to run should be marked with `@pytest.mark.slow`. If your slow test is testing important behavior, consider adding an additional simpler version. +- If tests require **loading the models**, they should be added to the [`spacy-models`](https://github.com/explosion/spacy-models) tests. +- Before requiring the models, always make sure there is no other way to test the particular behavior. In a lot of cases, it's sufficient to simply create a `Doc` object manually. See the section on [helpers and utility functions](#helpers-and-utilities) for more info on this. +- **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and many components are available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`). +- If you're importing from spaCy, **always use absolute imports**. For example: `from spacy.language import Language`. +- Try to keep the tests **readable and concise**. Use clear and descriptive variable names (`doc`, `tokens` and `text` are great), keep it short and only test for one behavior at a time. ## Parameters @@ -64,7 +61,7 @@ def test_tokenizer_keep_urls(tokenizer, text): assert len(tokens) == 1 ``` -This will run the test once for each `text` value. Even if you're only testing one example, it's usually best to specify it as a parameter. This will later make it easier for others to quickly add additional test cases without having to modify the test. +This will run the test once for each `text` value. Even if you're only testing one example, it's usually best to specify it as a parameter. This will later make it easier for others to quickly add additional test cases without having to modify the test. You can also specify parameters as tuples to test with multiple values per test: @@ -79,8 +76,7 @@ To test for combinations of parameters, you can add several `parametrize` marker @pytest.mark.parametrize('punct', ['.', '!', '?']) ``` -This will run the test with all combinations of the two parameters `text` and `punct`. **Use this feature sparingly**, though, as it can easily cause unneccessary or undesired test bloat. - +This will run the test with all combinations of the two parameters `text` and `punct`. **Use this feature sparingly**, though, as it can easily cause unnecessary or undesired test bloat. ## Fixtures @@ -88,11 +84,11 @@ Fixtures to create instances of spaCy objects and other components should only b These are the main fixtures that are currently available: -| Fixture | Description | -| --- | --- | -| `tokenizer` | Basic, language-independent tokenizer. Identical to the `xx` language class. | -| `en_tokenizer`, `de_tokenizer`, ... | Creates an English, German etc. tokenizer. | -| `en_vocab` | Creates an instance of the English `Vocab`. | +| Fixture | Description | +| ----------------------------------- | ---------------------------------------------------------------------------- | +| `tokenizer` | Basic, language-independent tokenizer. Identical to the `xx` language class. | +| `en_tokenizer`, `de_tokenizer`, ... | Creates an English, German etc. tokenizer. | +| `en_vocab` | Creates an instance of the English `Vocab`. | The fixtures can be used in all tests by simply setting them as an argument, like this: @@ -107,59 +103,32 @@ If all tests in a file require a specific configuration, or use the same complex Our new test setup comes with a few handy utility functions that can be imported from [`util.py`](util.py). +### Constructing a `Doc` object manually -### Constructing a `Doc` object manually with `get_doc()` - -Loading the models is expensive and not necessary if you're not actually testing the model performance. If all you need ia a `Doc` object with annotations like heads, POS tags or the dependency parse, you can use `get_doc()` to construct it manually. +Loading the models is expensive and not necessary if you're not actually testing the model performance. If all you need is a `Doc` object with annotations like heads, POS tags or the dependency parse, you can construct it manually. ```python -def test_doc_token_api_strings(en_tokenizer): +def test_doc_token_api_strings(en_vocab): text = "Give it back! He pleaded." pos = ['VERB', 'PRON', 'PART', 'PUNCT', 'PRON', 'VERB', 'PUNCT'] - heads = [0, -1, -2, -3, 1, 0, -1] + heads = [0, 0, 0, 0, 5, 5, 5] deps = ['ROOT', 'dobj', 'prt', 'punct', 'nsubj', 'ROOT', 'punct'] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos, heads=heads, deps=deps) + doc = Doc(en_vocab, [t.text for t in tokens], pos=pos, heads=heads, deps=deps) assert doc[0].text == 'Give' assert doc[0].lower_ == 'give' assert doc[0].pos_ == 'VERB' assert doc[0].dep_ == 'ROOT' ``` -You can construct a `Doc` with the following arguments: - -| Argument | Description | -| --- | --- | -| `vocab` | `Vocab` instance to use. If you're tokenizing before creating a `Doc`, make sure to use the tokenizer's vocab. Otherwise, you can also use the `en_vocab` fixture. **(required)** | -| `words` | List of words, for example `[t.text for t in tokens]`. **(required)** | -| `heads` | List of heads as integers. | -| `pos` | List of POS tags as text values. | -| `tag` | List of tag names as text values. | -| `dep` | List of dependencies as text values. | -| `ents` | List of entity tuples with `start`, `end`, `label` (for example `(0, 2, 'PERSON')`). The `label` will be looked up in `vocab.strings[label]`. | - -Here's how to quickly get these values from within spaCy: - -```python -doc = nlp(u'Some text here') -print([token.head.i-token.i for token in doc]) -print([token.tag_ for token in doc]) -print([token.pos_ for token in doc]) -print([token.dep_ for token in doc]) -print([(ent.start, ent.end, ent.label_) for ent in doc.ents]) -``` - -**Note:** There's currently no way of setting the serializer data for the parser without loading the models. If this is relevant to your test, constructing the `Doc` via `get_doc()` won't work. - ### Other utilities -| Name | Description | -| --- | --- | -| `apply_transition_sequence(parser, doc, sequence)` | Perform a series of pre-specified transitions, to put the parser in a desired state. | -| `add_vecs_to_vocab(vocab, vectors)` | Add list of vector tuples (`[("text", [1, 2, 3])]`) to given vocab. All vectors need to have the same length. | -| `get_cosine(vec1, vec2)` | Get cosine for two given vectors. | -| `assert_docs_equal(doc1, doc2)` | Compare two `Doc` objects and `assert` that they're equal. Tests for tokens, tags, dependencies and entities. | +| Name | Description | +| -------------------------------------------------- | ------------------------------------------------------------------------------------------------------------- | +| `apply_transition_sequence(parser, doc, sequence)` | Perform a series of pre-specified transitions, to put the parser in a desired state. | +| `add_vecs_to_vocab(vocab, vectors)` | Add list of vector tuples (`[("text", [1, 2, 3])]`) to given vocab. All vectors need to have the same length. | +| `get_cosine(vec1, vec2)` | Get cosine for two given vectors. | +| `assert_docs_equal(doc1, doc2)` | Compare two `Doc` objects and `assert` that they're equal. Tests for tokens, tags, dependencies and entities. | ## Contributing to the tests diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index dc742ce30..2cbfa5ee2 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.util import get_lang_class @@ -17,11 +14,11 @@ def pytest_runtest_setup(item): # recognize the option we're asking about. To avoid this, we need to # pass a default value. We default to False, i.e., we act like all the # options weren't given. - return item.config.getoption("--%s" % opt, False) + return item.config.getoption(f"--{opt}", False) for opt in ["slow"]: if opt in item.keywords and not getopt(opt): - pytest.skip("need --%s option to run" % opt) + pytest.skip(f"need --{opt} option to run") # Fixtures for language tokenizers (languages sorted alphabetically) @@ -29,52 +26,57 @@ def pytest_runtest_setup(item): @pytest.fixture(scope="module") def tokenizer(): - return get_lang_class("xx").Defaults.create_tokenizer() + return get_lang_class("xx")().tokenizer @pytest.fixture(scope="session") def ar_tokenizer(): - return get_lang_class("ar").Defaults.create_tokenizer() + return get_lang_class("ar")().tokenizer @pytest.fixture(scope="session") def bn_tokenizer(): - return get_lang_class("bn").Defaults.create_tokenizer() + return get_lang_class("bn")().tokenizer @pytest.fixture(scope="session") def ca_tokenizer(): - return get_lang_class("ca").Defaults.create_tokenizer() + return get_lang_class("ca")().tokenizer @pytest.fixture(scope="session") def cs_tokenizer(): - return get_lang_class("cs").Defaults.create_tokenizer() + return get_lang_class("cs")().tokenizer @pytest.fixture(scope="session") def da_tokenizer(): - return get_lang_class("da").Defaults.create_tokenizer() + return get_lang_class("da")().tokenizer @pytest.fixture(scope="session") def de_tokenizer(): - return get_lang_class("de").Defaults.create_tokenizer() + return get_lang_class("de")().tokenizer + + +@pytest.fixture(scope="session") +def de_vocab(): + return get_lang_class("de")().vocab @pytest.fixture(scope="session") def el_tokenizer(): - return get_lang_class("el").Defaults.create_tokenizer() + return get_lang_class("el")().tokenizer @pytest.fixture(scope="session") def en_tokenizer(): - return get_lang_class("en").Defaults.create_tokenizer() + return get_lang_class("en")().tokenizer @pytest.fixture(scope="session") def en_vocab(): - return get_lang_class("en").Defaults.create_vocab() + return get_lang_class("en")().vocab @pytest.fixture(scope="session") @@ -85,42 +87,42 @@ def en_parser(en_vocab): @pytest.fixture(scope="session") def es_tokenizer(): - return get_lang_class("es").Defaults.create_tokenizer() + return get_lang_class("es")().tokenizer @pytest.fixture(scope="session") def eu_tokenizer(): - return get_lang_class("eu").Defaults.create_tokenizer() + return get_lang_class("eu")().tokenizer @pytest.fixture(scope="session") def fa_tokenizer(): - return get_lang_class("fa").Defaults.create_tokenizer() + return get_lang_class("fa")().tokenizer @pytest.fixture(scope="session") def fi_tokenizer(): - return get_lang_class("fi").Defaults.create_tokenizer() + return get_lang_class("fi")().tokenizer @pytest.fixture(scope="session") def fr_tokenizer(): - return get_lang_class("fr").Defaults.create_tokenizer() + return get_lang_class("fr")().tokenizer @pytest.fixture(scope="session") def ga_tokenizer(): - return get_lang_class("ga").Defaults.create_tokenizer() + return get_lang_class("ga")().tokenizer @pytest.fixture(scope="session") def gu_tokenizer(): - return get_lang_class("gu").Defaults.create_tokenizer() + return get_lang_class("gu")().tokenizer @pytest.fixture(scope="session") def he_tokenizer(): - return get_lang_class("he").Defaults.create_tokenizer() + return get_lang_class("he")().tokenizer @pytest.fixture(scope="session") @@ -130,117 +132,122 @@ def hi_tokenizer(): @pytest.fixture(scope="session") def hr_tokenizer(): - return get_lang_class("hr").Defaults.create_tokenizer() + return get_lang_class("hr")().tokenizer @pytest.fixture def hu_tokenizer(): - return get_lang_class("hu").Defaults.create_tokenizer() + return get_lang_class("hu")().tokenizer @pytest.fixture(scope="session") def id_tokenizer(): - return get_lang_class("id").Defaults.create_tokenizer() + return get_lang_class("id")().tokenizer @pytest.fixture(scope="session") def it_tokenizer(): - return get_lang_class("it").Defaults.create_tokenizer() + return get_lang_class("it")().tokenizer @pytest.fixture(scope="session") def ja_tokenizer(): pytest.importorskip("sudachipy") - return get_lang_class("ja").Defaults.create_tokenizer() + return get_lang_class("ja")().tokenizer @pytest.fixture(scope="session") def ko_tokenizer(): pytest.importorskip("natto") - return get_lang_class("ko").Defaults.create_tokenizer() + return get_lang_class("ko")().tokenizer @pytest.fixture(scope="session") def lb_tokenizer(): - return get_lang_class("lb").Defaults.create_tokenizer() + return get_lang_class("lb")().tokenizer @pytest.fixture(scope="session") def lt_tokenizer(): - return get_lang_class("lt").Defaults.create_tokenizer() + return get_lang_class("lt")().tokenizer @pytest.fixture(scope="session") def ml_tokenizer(): - return get_lang_class("ml").Defaults.create_tokenizer() + return get_lang_class("ml")().tokenizer @pytest.fixture(scope="session") def nb_tokenizer(): - return get_lang_class("nb").Defaults.create_tokenizer() + return get_lang_class("nb")().tokenizer @pytest.fixture(scope="session") def ne_tokenizer(): - return get_lang_class("ne").Defaults.create_tokenizer() + return get_lang_class("ne")().tokenizer @pytest.fixture(scope="session") def nl_tokenizer(): - return get_lang_class("nl").Defaults.create_tokenizer() + return get_lang_class("nl")().tokenizer @pytest.fixture(scope="session") def pl_tokenizer(): - return get_lang_class("pl").Defaults.create_tokenizer() + return get_lang_class("pl")().tokenizer @pytest.fixture(scope="session") def pt_tokenizer(): - return get_lang_class("pt").Defaults.create_tokenizer() + return get_lang_class("pt")().tokenizer @pytest.fixture(scope="session") def ro_tokenizer(): - return get_lang_class("ro").Defaults.create_tokenizer() + return get_lang_class("ro")().tokenizer @pytest.fixture(scope="session") def ru_tokenizer(): pytest.importorskip("pymorphy2") - return get_lang_class("ru").Defaults.create_tokenizer() + return get_lang_class("ru")().tokenizer @pytest.fixture def ru_lemmatizer(): pytest.importorskip("pymorphy2") - return get_lang_class("ru").Defaults.create_lemmatizer() + return get_lang_class("ru")().add_pipe("lemmatizer") @pytest.fixture(scope="session") def sa_tokenizer(): - return get_lang_class("sa").Defaults.create_tokenizer() + return get_lang_class("sa")().tokenizer @pytest.fixture(scope="session") def sr_tokenizer(): - return get_lang_class("sr").Defaults.create_tokenizer() + return get_lang_class("sr")().tokenizer @pytest.fixture(scope="session") def sv_tokenizer(): - return get_lang_class("sv").Defaults.create_tokenizer() + return get_lang_class("sv")().tokenizer @pytest.fixture(scope="session") def th_tokenizer(): pytest.importorskip("pythainlp") - return get_lang_class("th").Defaults.create_tokenizer() + return get_lang_class("th")().tokenizer @pytest.fixture(scope="session") def tr_tokenizer(): - return get_lang_class("tr").Defaults.create_tokenizer() + return get_lang_class("tr")().tokenizer + + +@pytest.fixture(scope="session") +def tr_vocab(): + return get_lang_class("tr").Defaults.create_vocab() @pytest.fixture(scope="session") def tr_vocab(): @@ -248,47 +255,67 @@ def tr_vocab(): @pytest.fixture(scope="session") def tt_tokenizer(): - return get_lang_class("tt").Defaults.create_tokenizer() + return get_lang_class("tt")().tokenizer @pytest.fixture(scope="session") def uk_tokenizer(): pytest.importorskip("pymorphy2") - pytest.importorskip("pymorphy2.lang") - return get_lang_class("uk").Defaults.create_tokenizer() + return get_lang_class("uk")().tokenizer @pytest.fixture(scope="session") def ur_tokenizer(): - return get_lang_class("ur").Defaults.create_tokenizer() + return get_lang_class("ur")().tokenizer @pytest.fixture(scope="session") def yo_tokenizer(): - return get_lang_class("yo").Defaults.create_tokenizer() + return get_lang_class("yo")().tokenizer @pytest.fixture(scope="session") def zh_tokenizer_char(): - return get_lang_class("zh").Defaults.create_tokenizer( - config={"use_jieba": False, "use_pkuseg": False} - ) + nlp = get_lang_class("zh")() + return nlp.tokenizer @pytest.fixture(scope="session") def zh_tokenizer_jieba(): pytest.importorskip("jieba") - return get_lang_class("zh").Defaults.create_tokenizer() + config = { + "nlp": { + "tokenizer": { + "@tokenizers": "spacy.zh.ChineseTokenizer", + "segmenter": "jieba", + } + } + } + nlp = get_lang_class("zh").from_config(config) + return nlp.tokenizer @pytest.fixture(scope="session") def zh_tokenizer_pkuseg(): - pytest.importorskip("pkuseg") - return get_lang_class("zh").Defaults.create_tokenizer( - config={"pkuseg_model": "default", "use_jieba": False, "use_pkuseg": True} - ) + pytest.importorskip("spacy_pkuseg") + config = { + "nlp": { + "tokenizer": { + "@tokenizers": "spacy.zh.ChineseTokenizer", + "segmenter": "pkuseg", + } + }, + "initialize": { + "tokenizer": { + "pkuseg_model": "web", + } + }, + } + nlp = get_lang_class("zh").from_config(config) + nlp.initialize() + return nlp.tokenizer @pytest.fixture(scope="session") def hy_tokenizer(): - return get_lang_class("hy").Defaults.create_tokenizer() + return get_lang_class("hy")().tokenizer diff --git a/spacy/tests/doc/test_add_entities.py b/spacy/tests/doc/test_add_entities.py index 6c69e699a..fa0206fdd 100644 --- a/spacy/tests/doc/test_add_entities.py +++ b/spacy/tests/doc/test_add_entities.py @@ -1,43 +1,63 @@ -# coding: utf-8 -from __future__ import unicode_literals - +from spacy.pipeline.ner import DEFAULT_NER_MODEL +from spacy.training import Example from spacy.pipeline import EntityRecognizer -from spacy.tokens import Span +from spacy.tokens import Span, Doc +from spacy import registry import pytest -from ..util import get_doc + +def _ner_example(ner): + doc = Doc( + ner.vocab, + words=["Joe", "loves", "visiting", "London", "during", "the", "weekend"], + ) + gold = {"entities": [(0, 3, "PERSON"), (19, 25, "LOC")]} + return Example.from_dict(doc, gold) def test_doc_add_entities_set_ents_iob(en_vocab): text = ["This", "is", "a", "lion"] - doc = get_doc(en_vocab, text) - ner = EntityRecognizer(en_vocab) - ner.begin_training([]) + doc = Doc(en_vocab, words=text) + config = { + "learn_tokens": False, + "min_action_freq": 30, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_NER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + ner = EntityRecognizer(en_vocab, model, **config) + ner.initialize(lambda: [_ner_example(ner)]) ner(doc) - assert len(list(doc.ents)) == 0 - assert [w.ent_iob_ for w in doc] == (["O"] * len(doc)) - doc.ents = [(doc.vocab.strings["ANIMAL"], 3, 4)] + doc.ents = [("ANIMAL", 3, 4)] assert [w.ent_iob_ for w in doc] == ["O", "O", "O", "B"] - doc.ents = [(doc.vocab.strings["WORD"], 0, 2)] + doc.ents = [("WORD", 0, 2)] assert [w.ent_iob_ for w in doc] == ["B", "I", "O", "O"] def test_ents_reset(en_vocab): + """Ensure that resetting doc.ents does not change anything""" text = ["This", "is", "a", "lion"] - doc = get_doc(en_vocab, text) - ner = EntityRecognizer(en_vocab) - ner.begin_training([]) + doc = Doc(en_vocab, words=text) + config = { + "learn_tokens": False, + "min_action_freq": 30, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_NER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + ner = EntityRecognizer(en_vocab, model, **config) + ner.initialize(lambda: [_ner_example(ner)]) ner(doc) - assert [t.ent_iob_ for t in doc] == (["O"] * len(doc)) + orig_iobs = [t.ent_iob_ for t in doc] doc.ents = list(doc.ents) - assert [t.ent_iob_ for t in doc] == (["O"] * len(doc)) + assert [t.ent_iob_ for t in doc] == orig_iobs def test_add_overlapping_entities(en_vocab): text = ["Louisiana", "Office", "of", "Conservation"] - doc = get_doc(en_vocab, text) + doc = Doc(en_vocab, words=text) entity = Span(doc, 0, 4, label=391) doc.ents = [entity] diff --git a/spacy/tests/doc/test_array.py b/spacy/tests/doc/test_array.py index 09a6f9c4b..ef54c581c 100644 --- a/spacy/tests/doc/test_array.py +++ b/spacy/tests/doc/test_array.py @@ -1,11 +1,6 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.tokens import Doc -from spacy.attrs import ORTH, SHAPE, POS, DEP - -from ..util import get_doc +from spacy.attrs import ORTH, SHAPE, POS, DEP, MORPH def test_doc_array_attr_of_token(en_vocab): @@ -38,7 +33,7 @@ def test_doc_scalar_attr_of_token(en_vocab): def test_doc_array_tag(en_vocab): words = ["A", "nice", "sentence", "."] pos = ["DET", "ADJ", "NOUN", "PUNCT"] - doc = get_doc(en_vocab, words=words, pos=pos) + doc = Doc(en_vocab, words=words, pos=pos) assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos feats_array = doc.to_array((ORTH, POS)) assert feats_array[0][1] == doc[0].pos @@ -47,10 +42,24 @@ def test_doc_array_tag(en_vocab): assert feats_array[3][1] == doc[3].pos +def test_doc_array_morph(en_vocab): + words = ["Eat", "blue", "ham"] + morph = ["Feat=V", "Feat=J", "Feat=N"] + doc = Doc(en_vocab, words=words, morphs=morph) + assert morph[0] == str(doc[0].morph) + assert morph[1] == str(doc[1].morph) + assert morph[2] == str(doc[2].morph) + + feats_array = doc.to_array((ORTH, MORPH)) + assert feats_array[0][1] == doc[0].morph.key + assert feats_array[1][1] == doc[1].morph.key + assert feats_array[2][1] == doc[2].morph.key + + def test_doc_array_dep(en_vocab): words = ["A", "nice", "sentence", "."] deps = ["det", "amod", "ROOT", "punct"] - doc = get_doc(en_vocab, words=words, deps=deps) + doc = Doc(en_vocab, words=words, deps=deps) feats_array = doc.to_array((ORTH, DEP)) assert feats_array[0][1] == doc[0].dep assert feats_array[1][1] == doc[1].dep diff --git a/spacy/tests/doc/test_creation.py b/spacy/tests/doc/test_creation.py index 863a7c210..0dc6c4866 100644 --- a/spacy/tests/doc/test_creation.py +++ b/spacy/tests/doc/test_creation.py @@ -1,24 +1,12 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.vocab import Vocab from spacy.tokens import Doc -from spacy.lemmatizer import Lemmatizer -from spacy.lookups import Lookups from spacy import util @pytest.fixture -def lemmatizer(): - lookups = Lookups() - lookups.add_table("lemma_lookup", {"dogs": "dog", "boxen": "box", "mice": "mouse"}) - return Lemmatizer(lookups) - - -@pytest.fixture -def vocab(lemmatizer): - return Vocab(lemmatizer=lemmatizer) +def vocab(): + return Vocab() def test_empty_doc(vocab): @@ -33,14 +21,6 @@ def test_single_word(vocab): assert doc.text == "a" -def test_lookup_lemmatization(vocab): - doc = Doc(vocab, words=["dogs", "dogses"]) - assert doc[0].text == "dogs" - assert doc[0].lemma_ == "dog" - assert doc[1].text == "dogses" - assert doc[1].lemma_ == "dogses" - - def test_create_from_words_and_text(vocab): # no whitespace in words words = ["'", "dogs", "'", "run"] diff --git a/spacy/tests/doc/test_doc_api.py b/spacy/tests/doc/test_doc_api.py index 388cd78fe..db8a6d1c4 100644 --- a/spacy/tests/doc/test_doc_api.py +++ b/spacy/tests/doc/test_doc_api.py @@ -1,14 +1,27 @@ -# coding: utf-8 -from __future__ import unicode_literals - - import pytest import numpy from spacy.tokens import Doc, Span from spacy.vocab import Vocab -from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP +from spacy.lexeme import Lexeme +from spacy.lang.en import English +from spacy.attrs import ENT_TYPE, ENT_IOB, SENT_START, HEAD, DEP, MORPH -from ..util import get_doc + +def test_doc_api_init(en_vocab): + words = ["a", "b", "c", "d"] + heads = [0, 0, 2, 2] + # set sent_start by sent_starts + doc = Doc(en_vocab, words=words, sent_starts=[True, False, True, False]) + assert [t.is_sent_start for t in doc] == [True, False, True, False] + + # set sent_start by heads + doc = Doc(en_vocab, words=words, heads=heads, deps=["dep"] * 4) + assert [t.is_sent_start for t in doc] == [True, False, True, False] + # heads override sent_starts + doc = Doc( + en_vocab, words=words, sent_starts=[True] * 4, heads=heads, deps=["dep"] * 4 + ) + assert [t.is_sent_start for t in doc] == [True, False, True, False] @pytest.mark.parametrize("text", [["one", "two", "three"]]) @@ -108,6 +121,7 @@ def test_doc_api_serialize(en_tokenizer, text): tokens = en_tokenizer(text) tokens[0].lemma_ = "lemma" tokens[0].norm_ = "norm" + tokens.ents = [(tokens.vocab.strings["PRODUCT"], 0, 1)] tokens[0].ent_kb_id_ = "ent_kb_id" new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes()) assert tokens.text == new_tokens.text @@ -138,7 +152,7 @@ def test_doc_api_set_ents(en_tokenizer): assert len(tokens.ents) == 0 tokens.ents = [(tokens.vocab.strings["PRODUCT"], 2, 4)] assert len(list(tokens.ents)) == 1 - assert [t.ent_iob for t in tokens] == [0, 0, 3, 1, 0, 0, 0, 0] + assert [t.ent_iob for t in tokens] == [2, 2, 3, 1, 2, 2, 2, 2] assert tokens.ents[0].label_ == "PRODUCT" assert tokens.ents[0].start == 2 assert tokens.ents[0].end == 4 @@ -146,7 +160,6 @@ def test_doc_api_set_ents(en_tokenizer): def test_doc_api_sents_empty_string(en_tokenizer): doc = en_tokenizer("") - doc.is_parsed = True sents = list(doc.sents) assert len(sents) == 0 @@ -160,7 +173,7 @@ def test_doc_api_runtime_error(en_tokenizer): "", "nummod", "nsubj", "prep", "det", "amod", "pobj", "aux", "neg", "ccomp", "amod", "dobj"] # fmt: on tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps) + doc = Doc(tokens.vocab, words=[t.text for t in tokens], deps=deps) nps = [] for np in doc.noun_chunks: while len(np) > 1 and np[0].dep_ not in ("advmod", "amod", "compound"): @@ -177,34 +190,24 @@ def test_doc_api_runtime_error(en_tokenizer): retokenizer.merge(np, attrs=attrs) -def test_doc_api_right_edge(en_tokenizer): +def test_doc_api_right_edge(en_vocab): """Test for bug occurring from Unshift action, causing incorrect right edge""" # fmt: off - text = "I have proposed to myself, for the sake of such as live under the government of the Romans, to translate those books into the Greek tongue." - heads = [2, 1, 0, -1, -1, -3, 15, 1, -2, -1, 1, -3, -1, -1, 1, -2, -1, 1, - -2, -7, 1, -19, 1, -2, -3, 2, 1, -3, -26] + words = [ + "I", "have", "proposed", "to", "myself", ",", "for", "the", "sake", + "of", "such", "as", "live", "under", "the", "government", "of", "the", + "Romans", ",", "to", "translate", "those", "books", "into", "the", + "Greek", "tongue", "." + ] + heads = [2, 2, 2, 2, 3, 2, 21, 8, 6, 8, 11, 8, 11, 12, 15, 13, 15, 18, 16, 12, 21, 2, 23, 21, 21, 27, 27, 24, 2] + deps = ["dep"] * len(heads) # fmt: on - - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) assert doc[6].text == "for" subtree = [w.text for w in doc[6].subtree] - assert subtree == [ - "for", - "the", - "sake", - "of", - "such", - "as", - "live", - "under", - "the", - "government", - "of", - "the", - "Romans", - ",", - ] + # fmt: off + assert subtree == ["for", "the", "sake", "of", "such", "as", "live", "under", "the", "government", "of", "the", "Romans", ","] + # fmt: on assert doc[6].right_edge.text == "," @@ -227,16 +230,16 @@ def test_doc_api_similarity_match(): @pytest.mark.parametrize( - "sentence,heads,lca_matrix", + "words,heads,lca_matrix", [ ( - "the lazy dog slept", - [2, 1, 1, 0], + ["the", "lazy", "dog", "slept"], + [2, 2, 3, 3], numpy.array([[0, 2, 2, 3], [2, 1, 2, 3], [2, 2, 2, 3], [3, 3, 3, 3]]), ), ( - "The lazy dog slept. The quick fox jumped", - [2, 1, 1, 0, -1, 2, 1, 1, 0], + ["The", "lazy", "dog", "slept", ".", "The", "quick", "fox", "jumped"], + [2, 2, 3, 3, 3, 7, 7, 8, 8], numpy.array( [ [0, 2, 2, 3, 3, -1, -1, -1, -1], @@ -253,9 +256,8 @@ def test_doc_api_similarity_match(): ), ], ) -def test_lowest_common_ancestor(en_tokenizer, sentence, heads, lca_matrix): - tokens = en_tokenizer(sentence) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) +def test_lowest_common_ancestor(en_vocab, words, heads, lca_matrix): + doc = Doc(en_vocab, words, heads=heads, deps=["dep"] * len(heads)) lca = doc.get_lca_matrix() assert (lca == lca_matrix).all() assert lca[1, 1] == 1 @@ -266,54 +268,351 @@ def test_lowest_common_ancestor(en_tokenizer, sentence, heads, lca_matrix): def test_doc_is_nered(en_vocab): words = ["I", "live", "in", "New", "York"] doc = Doc(en_vocab, words=words) - assert not doc.is_nered + assert not doc.has_annotation("ENT_IOB") doc.ents = [Span(doc, 3, 5, label="GPE")] - assert doc.is_nered + assert doc.has_annotation("ENT_IOB") # Test creating doc from array with unknown values arr = numpy.array([[0, 0], [0, 0], [0, 0], [384, 3], [384, 1]], dtype="uint64") doc = Doc(en_vocab, words=words).from_array([ENT_TYPE, ENT_IOB], arr) - assert doc.is_nered + assert doc.has_annotation("ENT_IOB") # Test serialization new_doc = Doc(en_vocab).from_bytes(doc.to_bytes()) - assert new_doc.is_nered + assert new_doc.has_annotation("ENT_IOB") def test_doc_from_array_sent_starts(en_vocab): + # fmt: off words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."] heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6] - # fmt: off - deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"] + deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep"] # fmt: on - doc = Doc(en_vocab, words=words) - for i, (dep, head) in enumerate(zip(deps, heads)): - doc[i].dep_ = dep - doc[i].head = doc[head] - if head == i: - doc[i].is_sent_start = True - doc.is_parsed - + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) + # HEAD overrides SENT_START without warning attrs = [SENT_START, HEAD] arr = doc.to_array(attrs) new_doc = Doc(en_vocab, words=words) - with pytest.raises(ValueError): + new_doc.from_array(attrs, arr) + # no warning using default attrs + attrs = doc._get_array_attrs() + arr = doc.to_array(attrs) + with pytest.warns(None) as record: new_doc.from_array(attrs, arr) - - attrs = [SENT_START, DEP] + assert len(record) == 0 + # only SENT_START uses SENT_START + attrs = [SENT_START] arr = doc.to_array(attrs) new_doc = Doc(en_vocab, words=words) new_doc.from_array(attrs, arr) assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc] - assert not new_doc.is_parsed - + assert not new_doc.has_annotation("DEP") + # only HEAD uses HEAD attrs = [HEAD, DEP] arr = doc.to_array(attrs) new_doc = Doc(en_vocab, words=words) new_doc.from_array(attrs, arr) assert [t.is_sent_start for t in doc] == [t.is_sent_start for t in new_doc] - assert new_doc.is_parsed + assert new_doc.has_annotation("DEP") + + +def test_doc_from_array_morph(en_vocab): + # fmt: off + words = ["I", "live", "in", "New", "York", "."] + morphs = ["Feat1=A", "Feat1=B", "Feat1=C", "Feat1=A|Feat2=D", "Feat2=E", "Feat3=F"] + # fmt: on + doc = Doc(en_vocab, words=words, morphs=morphs) + attrs = [MORPH] + arr = doc.to_array(attrs) + new_doc = Doc(en_vocab, words=words) + new_doc.from_array(attrs, arr) + assert [str(t.morph) for t in new_doc] == morphs + assert [str(t.morph) for t in doc] == [str(t.morph) for t in new_doc] + + +def test_doc_api_from_docs(en_tokenizer, de_tokenizer): + en_texts = ["Merging the docs is fun.", "", "They don't think alike."] + en_texts_without_empty = [t for t in en_texts if len(t)] + de_text = "Wie war die Frage?" + en_docs = [en_tokenizer(text) for text in en_texts] + docs_idx = en_texts[0].index("docs") + de_doc = de_tokenizer(de_text) + expected = (True, None, None, None) + en_docs[0].user_data[("._.", "is_ambiguous", docs_idx, None)] = expected + assert Doc.from_docs([]) is None + assert de_doc is not Doc.from_docs([de_doc]) + assert str(de_doc) == str(Doc.from_docs([de_doc])) + + with pytest.raises(ValueError): + Doc.from_docs(en_docs + [de_doc]) + + m_doc = Doc.from_docs(en_docs) + assert len(en_texts_without_empty) == len(list(m_doc.sents)) + assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1]) + assert str(m_doc) == " ".join(en_texts_without_empty) + p_token = m_doc[len(en_docs[0]) - 1] + assert p_token.text == "." and bool(p_token.whitespace_) + en_docs_tokens = [t for doc in en_docs for t in doc] + assert len(m_doc) == len(en_docs_tokens) + think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think") + assert m_doc[9].idx == think_idx + with pytest.raises(AttributeError): + # not callable, because it was not set via set_extension + m_doc[2]._.is_ambiguous + assert len(m_doc.user_data) == len(en_docs[0].user_data) # but it's there + + m_doc = Doc.from_docs(en_docs, ensure_whitespace=False) + assert len(en_texts_without_empty) == len(list(m_doc.sents)) + assert len(str(m_doc)) == sum(len(t) for t in en_texts) + assert str(m_doc) == "".join(en_texts) + p_token = m_doc[len(en_docs[0]) - 1] + assert p_token.text == "." and not bool(p_token.whitespace_) + en_docs_tokens = [t for doc in en_docs for t in doc] + assert len(m_doc) == len(en_docs_tokens) + think_idx = len(en_texts[0]) + 0 + en_texts[2].index("think") + assert m_doc[9].idx == think_idx + + m_doc = Doc.from_docs(en_docs, attrs=["lemma", "length", "pos"]) + assert len(str(m_doc)) > len(en_texts[0]) + len(en_texts[1]) + # space delimiter considered, although spacy attribute was missing + assert str(m_doc) == " ".join(en_texts_without_empty) + p_token = m_doc[len(en_docs[0]) - 1] + assert p_token.text == "." and bool(p_token.whitespace_) + en_docs_tokens = [t for doc in en_docs for t in doc] + assert len(m_doc) == len(en_docs_tokens) + think_idx = len(en_texts[0]) + 1 + en_texts[2].index("think") + assert m_doc[9].idx == think_idx + + +def test_doc_api_from_docs_ents(en_tokenizer): + texts = ["Merging the docs is fun.", "They don't think alike."] + docs = [en_tokenizer(t) for t in texts] + docs[0].ents = () + docs[1].ents = (Span(docs[1], 0, 1, label="foo"),) + doc = Doc.from_docs(docs) + assert len(doc.ents) == 1 def test_doc_lang(en_vocab): doc = Doc(en_vocab, words=["Hello", "world"]) assert doc.lang_ == "en" assert doc.lang == en_vocab.strings["en"] + assert doc[0].lang_ == "en" + assert doc[0].lang == en_vocab.strings["en"] + nlp = English() + doc = nlp("Hello world") + assert doc.lang_ == "en" + assert doc.lang == en_vocab.strings["en"] + assert doc[0].lang_ == "en" + assert doc[0].lang == en_vocab.strings["en"] + + +def test_token_lexeme(en_vocab): + """Test that tokens expose their lexeme.""" + token = Doc(en_vocab, words=["Hello", "world"])[0] + assert isinstance(token.lex, Lexeme) + assert token.lex.text == token.text + assert en_vocab[token.orth] == token.lex + + +def test_has_annotation(en_vocab): + doc = Doc(en_vocab, words=["Hello", "world"]) + attrs = ("TAG", "POS", "MORPH", "LEMMA", "DEP", "HEAD", "ENT_IOB", "ENT_TYPE") + for attr in attrs: + assert not doc.has_annotation(attr) + + doc[0].tag_ = "A" + doc[0].pos_ = "X" + doc[0].set_morph("Feat=Val") + doc[0].lemma_ = "a" + doc[0].dep_ = "dep" + doc[0].head = doc[1] + doc.set_ents([Span(doc, 0, 1, label="HELLO")], default="missing") + + for attr in attrs: + assert doc.has_annotation(attr) + assert not doc.has_annotation(attr, require_complete=True) + + doc[1].tag_ = "A" + doc[1].pos_ = "X" + doc[1].set_morph("") + doc[1].lemma_ = "a" + doc[1].dep_ = "dep" + doc.ents = [Span(doc, 0, 2, label="HELLO")] + + for attr in attrs: + assert doc.has_annotation(attr) + assert doc.has_annotation(attr, require_complete=True) + + +def test_is_flags_deprecated(en_tokenizer): + doc = en_tokenizer("test") + with pytest.deprecated_call(): + doc.is_tagged + with pytest.deprecated_call(): + doc.is_parsed + with pytest.deprecated_call(): + doc.is_nered + with pytest.deprecated_call(): + doc.is_sentenced + + +def test_doc_set_ents(en_tokenizer): + # set ents + doc = en_tokenizer("a b c d e") + doc.set_ents([Span(doc, 0, 1, 10), Span(doc, 1, 3, 11)]) + assert [t.ent_iob for t in doc] == [3, 3, 1, 2, 2] + assert [t.ent_type for t in doc] == [10, 11, 11, 0, 0] + + # add ents, invalid IOB repaired + doc = en_tokenizer("a b c d e") + doc.set_ents([Span(doc, 0, 1, 10), Span(doc, 1, 3, 11)]) + doc.set_ents([Span(doc, 0, 2, 12)], default="unmodified") + assert [t.ent_iob for t in doc] == [3, 1, 3, 2, 2] + assert [t.ent_type for t in doc] == [12, 12, 11, 0, 0] + + # missing ents + doc = en_tokenizer("a b c d e") + doc.set_ents([Span(doc, 0, 1, 10), Span(doc, 1, 3, 11)], missing=[doc[4:5]]) + assert [t.ent_iob for t in doc] == [3, 3, 1, 2, 0] + assert [t.ent_type for t in doc] == [10, 11, 11, 0, 0] + + # outside ents + doc = en_tokenizer("a b c d e") + doc.set_ents( + [Span(doc, 0, 1, 10), Span(doc, 1, 3, 11)], + outside=[doc[4:5]], + default="missing", + ) + assert [t.ent_iob for t in doc] == [3, 3, 1, 0, 2] + assert [t.ent_type for t in doc] == [10, 11, 11, 0, 0] + + # blocked ents + doc = en_tokenizer("a b c d e") + doc.set_ents([], blocked=[doc[1:2], doc[3:5]], default="unmodified") + assert [t.ent_iob for t in doc] == [0, 3, 0, 3, 3] + assert [t.ent_type for t in doc] == [0, 0, 0, 0, 0] + assert doc.ents == tuple() + + # invalid IOB repaired after blocked + doc.ents = [Span(doc, 3, 5, "ENT")] + assert [t.ent_iob for t in doc] == [2, 2, 2, 3, 1] + doc.set_ents([], blocked=[doc[3:4]], default="unmodified") + assert [t.ent_iob for t in doc] == [2, 2, 2, 3, 3] + + # all types + doc = en_tokenizer("a b c d e") + doc.set_ents( + [Span(doc, 0, 1, 10)], + blocked=[doc[1:2]], + missing=[doc[2:3]], + outside=[doc[3:4]], + default="unmodified", + ) + assert [t.ent_iob for t in doc] == [3, 3, 0, 2, 0] + assert [t.ent_type for t in doc] == [10, 0, 0, 0, 0] + + doc = en_tokenizer("a b c d e") + # single span instead of a list + with pytest.raises(ValueError): + doc.set_ents([], missing=doc[1:2]) + # invalid default mode + with pytest.raises(ValueError): + doc.set_ents([], missing=[doc[1:2]], default="none") + # conflicting/overlapping specifications + with pytest.raises(ValueError): + doc.set_ents([], missing=[doc[1:2]], outside=[doc[1:2]]) + + +def test_doc_ents_setter(): + """Test that both strings and integers can be used to set entities in + tuple format via doc.ents.""" + words = ["a", "b", "c", "d", "e"] + doc = Doc(Vocab(), words=words) + doc.ents = [("HELLO", 0, 2), (doc.vocab.strings.add("WORLD"), 3, 5)] + assert [e.label_ for e in doc.ents] == ["HELLO", "WORLD"] + vocab = Vocab() + ents = [("HELLO", 0, 2), (vocab.strings.add("WORLD"), 3, 5)] + ents = ["B-HELLO", "I-HELLO", "O", "B-WORLD", "I-WORLD"] + doc = Doc(vocab, words=words, ents=ents) + assert [e.label_ for e in doc.ents] == ["HELLO", "WORLD"] + + +def test_doc_morph_setter(en_tokenizer, de_tokenizer): + doc1 = en_tokenizer("a b") + doc1b = en_tokenizer("c d") + doc2 = de_tokenizer("a b") + + # unset values can be copied + doc1[0].morph = doc1[1].morph + assert doc1[0].morph.key == 0 + assert doc1[1].morph.key == 0 + + # morph values from the same vocab can be copied + doc1[0].set_morph("Feat=Val") + doc1[1].morph = doc1[0].morph + assert doc1[0].morph == doc1[1].morph + + # ... also across docs + doc1b[0].morph = doc1[0].morph + assert doc1[0].morph == doc1b[0].morph + + doc2[0].set_morph("Feat2=Val2") + + # the morph value must come from the same vocab + with pytest.raises(ValueError): + doc1[0].morph = doc2[0].morph + + +def test_doc_init_iob(): + """Test ents validation/normalization in Doc.__init__""" + words = ["a", "b", "c", "d", "e"] + ents = ["O"] * len(words) + doc = Doc(Vocab(), words=words, ents=ents) + assert doc.ents == () + + ents = ["B-PERSON", "I-PERSON", "O", "I-PERSON", "I-PERSON"] + doc = Doc(Vocab(), words=words, ents=ents) + assert len(doc.ents) == 2 + + ents = ["B-PERSON", "I-PERSON", "O", "I-PERSON", "I-GPE"] + doc = Doc(Vocab(), words=words, ents=ents) + assert len(doc.ents) == 3 + + # None is missing + ents = ["B-PERSON", "I-PERSON", "O", None, "I-GPE"] + doc = Doc(Vocab(), words=words, ents=ents) + assert len(doc.ents) == 2 + + # empty tag is missing + ents = ["", "B-PERSON", "O", "B-PERSON", "I-PERSON"] + doc = Doc(Vocab(), words=words, ents=ents) + assert len(doc.ents) == 2 + + # invalid IOB + ents = ["Q-PERSON", "I-PERSON", "O", "I-PERSON", "I-GPE"] + with pytest.raises(ValueError): + doc = Doc(Vocab(), words=words, ents=ents) + + # no dash + ents = ["OPERSON", "I-PERSON", "O", "I-PERSON", "I-GPE"] + with pytest.raises(ValueError): + doc = Doc(Vocab(), words=words, ents=ents) + + # no ent type + ents = ["O", "B-", "O", "I-PERSON", "I-GPE"] + with pytest.raises(ValueError): + doc = Doc(Vocab(), words=words, ents=ents) + + # not strings or None + ents = [0, "B-", "O", "I-PERSON", "I-GPE"] + with pytest.raises(ValueError): + doc = Doc(Vocab(), words=words, ents=ents) + + +def test_doc_set_ents_invalid_spans(en_tokenizer): + doc = en_tokenizer("Some text about Colombia and the Czech Republic") + spans = [Span(doc, 3, 4, label="GPE"), Span(doc, 6, 8, label="GPE")] + with doc.retokenize() as retokenizer: + for span in spans: + retokenizer.merge(span) + with pytest.raises(IndexError): + doc.ents = spans diff --git a/spacy/tests/doc/test_morphanalysis.py b/spacy/tests/doc/test_morphanalysis.py index 5d570af53..918d4acdc 100644 --- a/spacy/tests/doc/test_morphanalysis.py +++ b/spacy/tests/doc/test_morphanalysis.py @@ -1,33 +1,98 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @pytest.fixture def i_has(en_tokenizer): doc = en_tokenizer("I has") - doc[0].tag_ = "PRP" - doc[1].tag_ = "VBZ" + doc[0].set_morph({"PronType": "prs"}) + doc[1].set_morph( + { + "VerbForm": "fin", + "Tense": "pres", + "Number": "sing", + "Person": "three", + } + ) + return doc -def test_token_morph_id(i_has): - assert i_has[0].morph.id - assert i_has[1].morph.id != 0 - assert i_has[0].morph.id != i_has[1].morph.id +def test_token_morph_eq(i_has): + assert i_has[0].morph is not i_has[0].morph + assert i_has[0].morph == i_has[0].morph + assert i_has[0].morph != i_has[1].morph + + +def test_token_morph_key(i_has): + assert i_has[0].morph.key != 0 + assert i_has[1].morph.key != 0 + assert i_has[0].morph.key == i_has[0].morph.key + assert i_has[0].morph.key != i_has[1].morph.key def test_morph_props(i_has): - assert i_has[0].morph.pron_type == i_has.vocab.strings["PronType_prs"] - assert i_has[0].morph.pron_type_ == "PronType_prs" - assert i_has[1].morph.pron_type == 0 + assert i_has[0].morph.get("PronType") == ["prs"] + assert i_has[1].morph.get("PronType") == [] def test_morph_iter(i_has): - assert list(i_has[0].morph) == ["PronType_prs"] - assert list(i_has[1].morph) == ["Number_sing", "Person_three", "VerbForm_fin"] + assert set(i_has[0].morph) == set(["PronType=prs"]) + assert set(i_has[1].morph) == set( + ["Number=sing", "Person=three", "Tense=pres", "VerbForm=fin"] + ) def test_morph_get(i_has): - assert i_has[0].morph.get("pron_type") == "PronType_prs" + assert i_has[0].morph.get("PronType") == ["prs"] + + +def test_morph_set(i_has): + assert i_has[0].morph.get("PronType") == ["prs"] + # set by string + i_has[0].set_morph("PronType=unk") + assert i_has[0].morph.get("PronType") == ["unk"] + # set by string, fields are alphabetized + i_has[0].set_morph("PronType=123|NounType=unk") + assert str(i_has[0].morph) == "NounType=unk|PronType=123" + # set by dict + i_has[0].set_morph({"AType": "123", "BType": "unk"}) + assert str(i_has[0].morph) == "AType=123|BType=unk" + # set by string with multiple values, fields and values are alphabetized + i_has[0].set_morph("BType=c|AType=b,a") + assert str(i_has[0].morph) == "AType=a,b|BType=c" + # set by dict with multiple values, fields and values are alphabetized + i_has[0].set_morph({"AType": "b,a", "BType": "c"}) + assert str(i_has[0].morph) == "AType=a,b|BType=c" + + +def test_morph_str(i_has): + assert str(i_has[0].morph) == "PronType=prs" + assert str(i_has[1].morph) == "Number=sing|Person=three|Tense=pres|VerbForm=fin" + + +def test_morph_property(tokenizer): + doc = tokenizer("a dog") + + # set through token.morph_ + doc[0].set_morph("PronType=prs") + assert str(doc[0].morph) == "PronType=prs" + assert doc.to_array(["MORPH"])[0] != 0 + + # unset with token.morph + doc[0].set_morph(None) + assert doc.to_array(["MORPH"])[0] == 0 + + # empty morph is equivalent to "_" + doc[0].set_morph("") + assert str(doc[0].morph) == "" + assert doc.to_array(["MORPH"])[0] == tokenizer.vocab.strings["_"] + + # "_" morph is also equivalent to empty morph + doc[0].set_morph("_") + assert str(doc[0].morph) == "" + assert doc.to_array(["MORPH"])[0] == tokenizer.vocab.strings["_"] + + # set through existing hash with token.morph + tokenizer.vocab.strings.add("Feat=Val") + doc[0].set_morph(tokenizer.vocab.strings.add("Feat=Val")) + assert str(doc[0].morph) == "Feat=Val" diff --git a/spacy/tests/doc/test_pickle_doc.py b/spacy/tests/doc/test_pickle_doc.py index 2b6970a38..28cb66714 100644 --- a/spacy/tests/doc/test_pickle_doc.py +++ b/spacy/tests/doc/test_pickle_doc.py @@ -1,8 +1,5 @@ -# coding: utf-8 -from __future__ import unicode_literals - from spacy.language import Language -from spacy.compat import pickle, unicode_ +from spacy.compat import pickle def test_pickle_single_doc(): @@ -16,9 +13,9 @@ def test_pickle_single_doc(): def test_list_of_docs_pickles_efficiently(): nlp = Language() for i in range(10000): - _ = nlp.vocab[unicode_(i)] # noqa: F841 + _ = nlp.vocab[str(i)] # noqa: F841 one_pickled = pickle.dumps(nlp("0"), -1) - docs = list(nlp.pipe(unicode_(i) for i in range(100))) + docs = list(nlp.pipe(str(i) for i in range(100))) many_pickled = pickle.dumps(docs, -1) assert len(many_pickled) < (len(one_pickled) * 2) many_unpickled = pickle.loads(many_pickled) diff --git a/spacy/tests/doc/test_retokenize_merge.py b/spacy/tests/doc/test_retokenize_merge.py index 636b7bb14..b483255c8 100644 --- a/spacy/tests/doc/test_retokenize_merge.py +++ b/spacy/tests/doc/test_retokenize_merge.py @@ -1,17 +1,17 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.attrs import LEMMA from spacy.vocab import Vocab from spacy.tokens import Doc, Token -from ..util import get_doc - def test_doc_retokenize_merge(en_tokenizer): text = "WKRO played songs by the beach boys all night" - attrs = {"tag": "NAMED", "lemma": "LEMMA", "ent_type": "TYPE"} + attrs = { + "tag": "NAMED", + "lemma": "LEMMA", + "ent_type": "TYPE", + "morph": "Number=Plur", + } doc = en_tokenizer(text) assert len(doc) == 9 with doc.retokenize() as retokenizer: @@ -21,9 +21,11 @@ def test_doc_retokenize_merge(en_tokenizer): assert doc[4].text == "the beach boys" assert doc[4].text_with_ws == "the beach boys " assert doc[4].tag_ == "NAMED" + assert str(doc[4].morph) == "Number=Plur" assert doc[5].text == "all night" assert doc[5].text_with_ws == "all night" assert doc[5].tag_ == "NAMED" + assert str(doc[5].morph) == "Number=Plur" def test_doc_retokenize_merge_children(en_tokenizer): @@ -84,9 +86,9 @@ def test_doc_retokenize_lex_attrs(en_tokenizer): def test_doc_retokenize_spans_merge_tokens(en_tokenizer): text = "Los Angeles start." - heads = [1, 1, 0, -1] + heads = [1, 2, 2, 2] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) + doc = Doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) assert len(doc) == 4 assert doc[0].head.text == "Angeles" assert doc[1].head.text == "start" @@ -99,17 +101,12 @@ def test_doc_retokenize_spans_merge_tokens(en_tokenizer): assert doc[0].ent_type_ == "GPE" -def test_doc_retokenize_spans_merge_tokens_default_attrs(en_tokenizer): - text = "The players start." - heads = [1, 1, 0, -1] - tokens = en_tokenizer(text) - doc = get_doc( - tokens.vocab, - words=[t.text for t in tokens], - tags=["DT", "NN", "VBZ", "."], - pos=["DET", "NOUN", "VERB", "PUNCT"], - heads=heads, - ) +def test_doc_retokenize_spans_merge_tokens_default_attrs(en_vocab): + words = ["The", "players", "start", "."] + heads = [1, 2, 2, 2] + tags = ["DT", "NN", "VBZ", "."] + pos = ["DET", "NOUN", "VERB", "PUNCT"] + doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads) assert len(doc) == 4 assert doc[0].text == "The" assert doc[0].tag_ == "DT" @@ -120,14 +117,7 @@ def test_doc_retokenize_spans_merge_tokens_default_attrs(en_tokenizer): assert doc[0].text == "The players" assert doc[0].tag_ == "NN" assert doc[0].pos_ == "NOUN" - assert doc[0].lemma_ == "The players" - doc = get_doc( - tokens.vocab, - words=[t.text for t in tokens], - tags=["DT", "NN", "VBZ", "."], - pos=["DET", "NOUN", "VERB", "PUNCT"], - heads=heads, - ) + doc = Doc(en_vocab, words=words, tags=tags, pos=pos, heads=heads) assert len(doc) == 4 assert doc[0].text == "The" assert doc[0].tag_ == "DT" @@ -139,18 +129,15 @@ def test_doc_retokenize_spans_merge_tokens_default_attrs(en_tokenizer): assert doc[0].text == "The players" assert doc[0].tag_ == "NN" assert doc[0].pos_ == "NOUN" - assert doc[0].lemma_ == "The players" assert doc[1].text == "start ." assert doc[1].tag_ == "VBZ" assert doc[1].pos_ == "VERB" - assert doc[1].lemma_ == "start ." -def test_doc_retokenize_spans_merge_heads(en_tokenizer): - text = "I found a pilates class near work." - heads = [1, 0, 2, 1, -3, -1, -1, -6] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) +def test_doc_retokenize_spans_merge_heads(en_vocab): + words = ["I", "found", "a", "pilates", "class", "near", "work", "."] + heads = [1, 1, 4, 6, 1, 4, 5, 1] + doc = Doc(en_vocab, words=words, heads=heads) assert len(doc) == 8 with doc.retokenize() as retokenizer: attrs = {"tag": doc[4].tag_, "lemma": "pilates class", "ent_type": "O"} @@ -181,9 +168,9 @@ def test_doc_retokenize_spans_merge_non_disjoint(en_tokenizer): def test_doc_retokenize_span_np_merges(en_tokenizer): text = "displaCy is a parse tool built with Javascript" - heads = [1, 0, 2, 1, -3, -1, -1, -1] + heads = [1, 1, 4, 4, 1, 4, 5, 6] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) + doc = Doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) assert doc[4].head.i == 1 with doc.retokenize() as retokenizer: attrs = {"tag": "NP", "lemma": "tool", "ent_type": "O"} @@ -191,18 +178,18 @@ def test_doc_retokenize_span_np_merges(en_tokenizer): assert doc[2].head.i == 1 text = "displaCy is a lightweight and modern dependency parse tree visualization tool built with CSS3 and JavaScript." - heads = [1, 0, 8, 3, -1, -2, 4, 3, 1, 1, -9, -1, -1, -1, -1, -2, -15] + heads = [1, 1, 10, 7, 3, 3, 7, 10, 9, 10, 1, 10, 11, 12, 13, 13, 1] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) + doc = Doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) with doc.retokenize() as retokenizer: for ent in doc.ents: attrs = {"tag": ent.label_, "lemma": ent.lemma_, "ent_type": ent.label_} retokenizer.merge(ent, attrs=attrs) text = "One test with entities like New York City so the ents list is not void" - heads = [1, 11, -1, -1, -1, 1, 1, -3, 4, 2, 1, 1, 0, -1, -2] + heads = [1, 1, 1, 2, 3, 6, 7, 4, 12, 11, 11, 12, 1, 12, 12] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) + doc = Doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) with doc.retokenize() as retokenizer: for ent in doc.ents: retokenizer.merge(ent) @@ -211,12 +198,18 @@ def test_doc_retokenize_span_np_merges(en_tokenizer): def test_doc_retokenize_spans_entity_merge(en_tokenizer): # fmt: off text = "Stewart Lee is a stand up comedian who lives in England and loves Joe Pasquale.\n" - heads = [1, 1, 0, 1, 2, -1, -4, 1, -2, -1, -1, -3, -10, 1, -2, -13, -1] + heads = [1, 2, 2, 4, 6, 4, 2, 8, 6, 8, 9, 8, 8, 14, 12, 2, 15] tags = ["NNP", "NNP", "VBZ", "DT", "VB", "RP", "NN", "WP", "VBZ", "IN", "NNP", "CC", "VBZ", "NNP", "NNP", ".", "SP"] - ents = [(0, 2, "PERSON"), (10, 11, "GPE"), (13, 15, "PERSON")] + ents = [("PERSON", 0, 2), ("GPE", 10, 11), ("PERSON", 13, 15)] + ents = ["O"] * len(heads) + ents[0] = "B-PERSON" + ents[1] = "I-PERSON" + ents[10] = "B-GPE" + ents[13] = "B-PERSON" + ents[14] = "I-PERSON" # fmt: on tokens = en_tokenizer(text) - doc = get_doc( + doc = Doc( tokens.vocab, words=[t.text for t in tokens], heads=heads, tags=tags, ents=ents ) assert len(doc) == 17 @@ -281,13 +274,17 @@ def test_doc_retokenize_spans_entity_merge_iob(en_vocab): # if there is a parse, span.root provides default values words = ["a", "b", "c", "d", "e", "f", "g", "h", "i"] - heads = [0, -1, 1, -3, -4, -5, -1, -7, -8] - ents = [(3, 5, "ent-de"), (5, 7, "ent-fg")] + heads = [0, 0, 3, 0, 0, 0, 5, 0, 0] + ents = ["O"] * len(words) + ents[3] = "B-ent-de" + ents[4] = "I-ent-de" + ents[5] = "B-ent-fg" + ents[6] = "I-ent-fg" deps = ["dep"] * len(words) en_vocab.strings.add("ent-de") en_vocab.strings.add("ent-fg") en_vocab.strings.add("dep") - doc = get_doc(en_vocab, words=words, heads=heads, deps=deps, ents=ents) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps, ents=ents) assert doc[2:4].root == doc[3] # root of 'c d' is d assert doc[4:6].root == doc[4] # root is 'e f' is e with doc.retokenize() as retokenizer: @@ -304,10 +301,14 @@ def test_doc_retokenize_spans_entity_merge_iob(en_vocab): # check that B is preserved if span[start] is B words = ["a", "b", "c", "d", "e", "f", "g", "h", "i"] - heads = [0, -1, 1, 1, -4, -5, -1, -7, -8] - ents = [(3, 5, "ent-de"), (5, 7, "ent-de")] + heads = [0, 0, 3, 4, 0, 0, 5, 0, 0] + ents = ["O"] * len(words) + ents[3] = "B-ent-de" + ents[4] = "I-ent-de" + ents[5] = "B-ent-de" + ents[6] = "I-ent-de" deps = ["dep"] * len(words) - doc = get_doc(en_vocab, words=words, heads=heads, deps=deps, ents=ents) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps, ents=ents) with doc.retokenize() as retokenizer: retokenizer.merge(doc[3:5]) retokenizer.merge(doc[5:7]) @@ -321,13 +322,13 @@ def test_doc_retokenize_spans_entity_merge_iob(en_vocab): def test_doc_retokenize_spans_sentence_update_after_merge(en_tokenizer): # fmt: off text = "Stewart Lee is a stand up comedian. He lives in England and loves Joe Pasquale." - heads = [1, 1, 0, 1, 2, -1, -4, -5, 1, 0, -1, -1, -3, -4, 1, -2, -7] + heads = [1, 2, 2, 4, 2, 4, 4, 2, 9, 9, 9, 10, 9, 9, 15, 13, 9] deps = ['compound', 'nsubj', 'ROOT', 'det', 'amod', 'prt', 'attr', 'punct', 'nsubj', 'ROOT', 'prep', 'pobj', 'cc', 'conj', 'compound', 'dobj', 'punct'] # fmt: on tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) + doc = Doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) sent1, sent2 = list(doc.sents) init_len = len(sent1) init_len2 = len(sent2) @@ -335,6 +336,7 @@ def test_doc_retokenize_spans_sentence_update_after_merge(en_tokenizer): attrs = {"lemma": "none", "ent_type": "none"} retokenizer.merge(doc[0:2], attrs=attrs) retokenizer.merge(doc[-2:], attrs=attrs) + sent1, sent2 = list(doc.sents) assert len(sent1) == init_len - 1 assert len(sent2) == init_len2 - 1 @@ -342,13 +344,13 @@ def test_doc_retokenize_spans_sentence_update_after_merge(en_tokenizer): def test_doc_retokenize_spans_subtree_size_check(en_tokenizer): # fmt: off text = "Stewart Lee is a stand up comedian who lives in England and loves Joe Pasquale" - heads = [1, 1, 0, 1, 2, -1, -4, 1, -2, -1, -1, -3, -10, 1, -2] + heads = [1, 2, 2, 4, 6, 4, 2, 8, 6, 8, 9, 8, 8, 14, 12] deps = ["compound", "nsubj", "ROOT", "det", "amod", "prt", "attr", "nsubj", "relcl", "prep", "pobj", "cc", "conj", "compound", "dobj"] # fmt: on tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) + doc = Doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) sent1 = list(doc.sents)[0] init_len = len(list(sent1.root.subtree)) with doc.retokenize() as retokenizer: diff --git a/spacy/tests/doc/test_retokenize_split.py b/spacy/tests/doc/test_retokenize_split.py index d84c846de..30f945165 100644 --- a/spacy/tests/doc/test_retokenize_split.py +++ b/spacy/tests/doc/test_retokenize_split.py @@ -1,17 +1,12 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.vocab import Vocab from spacy.tokens import Doc, Token -from ..util import get_doc - def test_doc_retokenize_split(en_vocab): words = ["LosAngeles", "start", "."] - heads = [1, 1, 0] - doc = get_doc(en_vocab, words=words, heads=heads) + heads = [1, 2, 2] + doc = Doc(en_vocab, words=words, heads=heads) assert len(doc) == 3 assert len(str(doc)) == 19 assert doc[0].head.text == "start" @@ -25,15 +20,18 @@ def test_doc_retokenize_split(en_vocab): "tag": ["NNP"] * 2, "lemma": ["Los", "Angeles"], "ent_type": ["GPE"] * 2, + "morph": ["Number=Sing"] * 2, }, ) assert len(doc) == 4 assert doc[0].text == "Los" assert doc[0].head.text == "Angeles" assert doc[0].idx == 0 + assert str(doc[0].morph) == "Number=Sing" assert doc[1].idx == 3 assert doc[1].text == "Angeles" assert doc[1].head.text == "start" + assert str(doc[1].morph) == "Number=Sing" assert doc[2].text == "start" assert doc[2].head.text == "." assert doc[3].text == "." @@ -88,11 +86,11 @@ def test_doc_retokenize_spans_sentence_update_after_split(en_vocab): # fmt: off words = ["StewartLee", "is", "a", "stand", "up", "comedian", ".", "He", "lives", "in", "England", "and", "loves", "JoePasquale", "."] - heads = [1, 0, 1, 2, -1, -4, -5, 1, 0, -1, -1, -3, -4, 1, -2] + heads = [1, 1, 3, 5, 3, 1, 1, 8, 8, 8, 9, 8, 8, 14, 12] deps = ["nsubj", "ROOT", "det", "amod", "prt", "attr", "punct", "nsubj", "ROOT", "prep", "pobj", "cc", "conj", "compound", "punct"] # fmt: on - doc = get_doc(en_vocab, words=words, heads=heads, deps=deps) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) sent1, sent2 = list(doc.sents) init_len = len(sent1) init_len2 = len(sent2) @@ -211,9 +209,13 @@ def test_doc_retokenizer_split_norm(en_vocab): # Retokenize to split out the words in the token at doc[2]. token = doc[2] with doc.retokenize() as retokenizer: - retokenizer.split(token, ["brown", "fox", "jumps", "over", "the"], heads=[(token, idx) for idx in range(5)]) + retokenizer.split( + token, + ["brown", "fox", "jumps", "over", "the"], + heads=[(token, idx) for idx in range(5)], + ) - assert doc[9].text == "w/" + assert doc[9].text == "w/" assert doc[9].norm_ == "with" - assert doc[5].text == "over" + assert doc[5].text == "over" assert doc[5].norm_ == "over" diff --git a/spacy/tests/doc/test_span.py b/spacy/tests/doc/test_span.py index df41aedf5..4c7f0c86b 100644 --- a/spacy/tests/doc/test_span.py +++ b/spacy/tests/doc/test_span.py @@ -1,25 +1,20 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.attrs import ORTH, LENGTH from spacy.tokens import Doc, Span from spacy.vocab import Vocab from spacy.util import filter_spans -from ..util import get_doc - @pytest.fixture def doc(en_tokenizer): # fmt: off text = "This is a sentence. This is another sentence. And a third." - heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 0, 1, -2, -1] + heads = [1, 1, 3, 1, 1, 6, 6, 8, 6, 6, 12, 12, 12, 12] deps = ["nsubj", "ROOT", "det", "attr", "punct", "nsubj", "ROOT", "det", "attr", "punct", "ROOT", "det", "npadvmod", "punct"] # fmt: on tokens = en_tokenizer(text) - return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) + return Doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) @pytest.fixture @@ -27,7 +22,6 @@ def doc_not_parsed(en_tokenizer): text = "This is a sentence. This is another sentence. And a third." tokens = en_tokenizer(text) doc = Doc(tokens.vocab, words=[t.text for t in tokens]) - doc.is_parsed = False return doc @@ -69,15 +63,14 @@ def test_spans_string_fn(doc): span = doc[0:4] assert len(span) == 4 assert span.text == "This is a sentence" - assert span.upper_ == "THIS IS A SENTENCE" - assert span.lower_ == "this is a sentence" def test_spans_root2(en_tokenizer): text = "through North and South Carolina" - heads = [0, 3, -1, -2, -4] + heads = [0, 4, 1, 1, 0] + deps = ["dep"] * len(heads) tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) + doc = Doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) assert doc[-2:].root.text == "Carolina" @@ -97,7 +90,12 @@ def test_spans_span_sent(doc, doc_not_parsed): def test_spans_lca_matrix(en_tokenizer): """Test span's lca matrix generation""" tokens = en_tokenizer("the lazy dog slept") - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0]) + doc = Doc( + tokens.vocab, + words=[t.text for t in tokens], + heads=[2, 2, 3, 3], + deps=["dep"] * 4, + ) lca = doc[:2].get_lca_matrix() assert lca.shape == (2, 2) assert lca[0, 0] == 0 # the & the -> the @@ -321,14 +319,14 @@ def test_span_boundaries(doc): for i in range(start, end): assert span[i - start] == doc[i] with pytest.raises(IndexError): - _ = span[-5] + span[-5] with pytest.raises(IndexError): - _ = span[5] + span[5] def test_sent(en_tokenizer): doc = en_tokenizer("Check span.sent raises error if doc is not sentencized.") span = doc[1:3] - assert not span.doc.is_sentenced + assert not span.doc.has_annotation("SENT_START") with pytest.raises(ValueError): span.sent diff --git a/spacy/tests/doc/test_to_json.py b/spacy/tests/doc/test_to_json.py index a063a6569..9ebee6c88 100644 --- a/spacy/tests/doc/test_to_json.py +++ b/spacy/tests/doc/test_to_json.py @@ -1,11 +1,5 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest -from spacy.cli._schemas import TRAINING_SCHEMA -from spacy.util import get_json_validator, validate_json from spacy.tokens import Doc -from ..util import get_doc @pytest.fixture() @@ -13,11 +7,19 @@ def doc(en_vocab): words = ["c", "d", "e"] pos = ["VERB", "NOUN", "NOUN"] tags = ["VBP", "NN", "NN"] - heads = [0, -1, -2] + heads = [0, 0, 0] deps = ["ROOT", "dobj", "dobj"] - ents = [(1, 2, "ORG")] - return get_doc( - en_vocab, words=words, pos=pos, tags=tags, heads=heads, deps=deps, ents=ents + ents = ["O", "B-ORG", "O"] + morphs = ["Feat1=A", "Feat1=B", "Feat1=A|Feat2=D"] + return Doc( + en_vocab, + words=words, + pos=pos, + tags=tags, + heads=heads, + deps=deps, + ents=ents, + morphs=morphs, ) @@ -58,10 +60,3 @@ def test_doc_to_json_underscore_error_serialize(doc): Doc.set_extension("json_test4", method=lambda doc: doc.text) with pytest.raises(ValueError): doc.to_json(underscore=["json_test4"]) - - -def test_doc_to_json_valid_training(doc): - json_doc = doc.to_json() - validator = get_json_validator(TRAINING_SCHEMA) - errors = validate_json([json_doc], validator) - assert not errors diff --git a/spacy/tests/doc/test_token_api.py b/spacy/tests/doc/test_token_api.py index 4dcd07ad9..3c5c063bd 100644 --- a/spacy/tests/doc/test_token_api.py +++ b/spacy/tests/doc/test_token_api.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import numpy from spacy.attrs import IS_ALPHA, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_TITLE, IS_STOP @@ -8,31 +5,24 @@ from spacy.symbols import VERB from spacy.vocab import Vocab from spacy.tokens import Doc -from ..util import get_doc - @pytest.fixture -def doc(en_tokenizer): +def doc(en_vocab): # fmt: off - text = "This is a sentence. This is another sentence. And a third." - heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 0, 1, -2, -1] + words = ["This", "is", "a", "sentence", ".", "This", "is", "another", "sentence", ".", "And", "a", "third", "."] + heads = [1, 1, 3, 1, 1, 6, 6, 8, 6, 6, 10, 12, 10, 12] deps = ["nsubj", "ROOT", "det", "attr", "punct", "nsubj", "ROOT", "det", "attr", "punct", "ROOT", "det", "npadvmod", "punct"] # fmt: on - tokens = en_tokenizer(text) - return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) + return Doc(en_vocab, words=words, heads=heads, deps=deps) -def test_doc_token_api_strings(en_tokenizer): - text = "Give it back! He pleaded." +def test_doc_token_api_strings(en_vocab): + words = ["Give", "it", "back", "!", "He", "pleaded", "."] pos = ["VERB", "PRON", "PART", "PUNCT", "PRON", "VERB", "PUNCT"] - heads = [0, -1, -2, -3, 1, 0, -1] + heads = [0, 0, 0, 0, 5, 5, 5] deps = ["ROOT", "dobj", "prt", "punct", "nsubj", "ROOT", "punct"] - - tokens = en_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps - ) + doc = Doc(en_vocab, words=words, pos=pos, heads=heads, deps=deps) assert doc[0].orth_ == "Give" assert doc[0].text == "Give" assert doc[0].text_with_ws == "Give " @@ -100,77 +90,91 @@ def test_doc_token_api_vectors(): assert doc[0].similarity(doc[1]) == cosine -def test_doc_token_api_ancestors(en_tokenizer): +def test_doc_token_api_ancestors(en_vocab): # the structure of this sentence depends on the English annotation scheme - text = "Yesterday I saw a dog that barked loudly." - heads = [2, 1, 0, 1, -2, 1, -2, -1, -6] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) + words = ["Yesterday", "I", "saw", "a", "dog", "that", "barked", "loudly", "."] + heads = [2, 2, 2, 4, 2, 6, 4, 6, 2] + doc = Doc(en_vocab, words=words, heads=heads) assert [t.text for t in doc[6].ancestors] == ["dog", "saw"] assert [t.text for t in doc[1].ancestors] == ["saw"] assert [t.text for t in doc[2].ancestors] == [] - assert doc[2].is_ancestor(doc[7]) assert not doc[6].is_ancestor(doc[2]) -def test_doc_token_api_head_setter(en_tokenizer): - # the structure of this sentence depends on the English annotation scheme - text = "Yesterday I saw a dog that barked loudly." - heads = [2, 1, 0, 1, -2, 1, -2, -1, -6] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) - +def test_doc_token_api_head_setter(en_vocab): + words = ["Yesterday", "I", "saw", "a", "dog", "that", "barked", "loudly", "."] + heads = [2, 2, 2, 4, 2, 6, 4, 6, 2] + deps = ["dep"] * len(heads) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) assert doc[6].n_lefts == 1 assert doc[6].n_rights == 1 assert doc[6].left_edge.i == 5 assert doc[6].right_edge.i == 7 - assert doc[4].n_lefts == 1 assert doc[4].n_rights == 1 assert doc[4].left_edge.i == 3 assert doc[4].right_edge.i == 7 - assert doc[3].n_lefts == 0 assert doc[3].n_rights == 0 assert doc[3].left_edge.i == 3 assert doc[3].right_edge.i == 3 - assert doc[2].left_edge.i == 0 assert doc[2].right_edge.i == 8 doc[6].head = doc[3] - assert doc[6].n_lefts == 1 assert doc[6].n_rights == 1 assert doc[6].left_edge.i == 5 assert doc[6].right_edge.i == 7 - assert doc[3].n_lefts == 0 assert doc[3].n_rights == 1 assert doc[3].left_edge.i == 3 assert doc[3].right_edge.i == 7 - assert doc[4].n_lefts == 1 assert doc[4].n_rights == 0 assert doc[4].left_edge.i == 3 assert doc[4].right_edge.i == 7 - assert doc[2].left_edge.i == 0 assert doc[2].right_edge.i == 8 doc[0].head = doc[5] - assert doc[5].left_edge.i == 0 assert doc[6].left_edge.i == 0 assert doc[3].left_edge.i == 0 assert doc[4].left_edge.i == 0 assert doc[2].left_edge.i == 0 - # head token must be from the same document - doc2 = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) + doc2 = Doc(en_vocab, words=words, heads=heads) with pytest.raises(ValueError): doc[0].head = doc2[0] + # test sentence starts when two sentences are joined + # fmt: off + words = ["This", "is", "one", "sentence", ".", "This", "is", "another", "sentence", "."] + heads = [0, 0, 0, 0, 0, 5, 5, 5, 5, 5] + # fmt: on + doc = Doc(en_vocab, words=words, heads=heads, deps=["dep"] * len(heads)) + # initially two sentences + assert doc[0].is_sent_start + assert doc[5].is_sent_start + assert doc[0].left_edge == doc[0] + assert doc[0].right_edge == doc[4] + assert doc[5].left_edge == doc[5] + assert doc[5].right_edge == doc[9] + # modifying with a sentence doesn't change sent starts + doc[2].head = doc[3] + assert doc[0].is_sent_start + assert doc[5].is_sent_start + assert doc[0].left_edge == doc[0] + assert doc[0].right_edge == doc[4] + assert doc[5].left_edge == doc[5] + assert doc[5].right_edge == doc[9] + # attach the second sentence to the first, resulting in one sentence + doc[5].head = doc[0] + assert doc[0].is_sent_start + assert not doc[5].is_sent_start + assert doc[0].left_edge == doc[0] + assert doc[0].right_edge == doc[9] def test_is_sent_start(en_tokenizer): @@ -178,7 +182,6 @@ def test_is_sent_start(en_tokenizer): assert doc[5].is_sent_start is None doc[5].is_sent_start = True assert doc[5].is_sent_start is True - doc.is_parsed = True assert len(list(doc.sents)) == 2 @@ -187,7 +190,6 @@ def test_is_sent_end(en_tokenizer): assert doc[4].is_sent_end is None doc[5].is_sent_start = True assert doc[4].is_sent_end is True - doc.is_parsed = True assert len(list(doc.sents)) == 2 @@ -212,39 +214,39 @@ def test_token0_has_sent_start_true(): doc = Doc(Vocab(), words=["hello", "world"]) assert doc[0].is_sent_start is True assert doc[1].is_sent_start is None - assert not doc.is_sentenced + assert not doc.has_annotation("SENT_START") def test_tokenlast_has_sent_end_true(): doc = Doc(Vocab(), words=["hello", "world"]) assert doc[0].is_sent_end is None assert doc[1].is_sent_end is True - assert not doc.is_sentenced + assert not doc.has_annotation("SENT_START") def test_token_api_conjuncts_chain(en_vocab): - words = "The boy and the girl and the man went .".split() - heads = [1, 7, -1, 1, -3, -1, 1, -3, 0, -1] + words = ["The", "boy", "and", "the", "girl", "and", "the", "man", "went", "."] + heads = [1, 8, 1, 4, 1, 4, 7, 4, 8, 8] deps = ["det", "nsubj", "cc", "det", "conj", "cc", "det", "conj", "ROOT", "punct"] - doc = get_doc(en_vocab, words=words, heads=heads, deps=deps) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) assert [w.text for w in doc[1].conjuncts] == ["girl", "man"] assert [w.text for w in doc[4].conjuncts] == ["boy", "man"] assert [w.text for w in doc[7].conjuncts] == ["boy", "girl"] def test_token_api_conjuncts_simple(en_vocab): - words = "They came and went .".split() - heads = [1, 0, -1, -2, -1] + words = ["They", "came", "and", "went", "."] + heads = [1, 1, 1, 1, 3] deps = ["nsubj", "ROOT", "cc", "conj", "dep"] - doc = get_doc(en_vocab, words=words, heads=heads, deps=deps) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) assert [w.text for w in doc[1].conjuncts] == ["went"] assert [w.text for w in doc[3].conjuncts] == ["came"] def test_token_api_non_conjuncts(en_vocab): - words = "They came .".split() - heads = [1, 0, -1] + words = ["They", "came", "."] + heads = [1, 1, 1] deps = ["nsubj", "ROOT", "punct"] - doc = get_doc(en_vocab, words=words, heads=heads, deps=deps) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) assert [w.text for w in doc[0].conjuncts] == [] assert [w.text for w in doc[1].conjuncts] == [] diff --git a/spacy/tests/doc/test_underscore.py b/spacy/tests/doc/test_underscore.py index c1eff2c20..b934221af 100644 --- a/spacy/tests/doc/test_underscore.py +++ b/spacy/tests/doc/test_underscore.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from mock import Mock from spacy.tokens import Doc, Span, Token diff --git a/spacy/tests/lang/ar/test_exceptions.py b/spacy/tests/lang/ar/test_exceptions.py index 3cfc380d2..0129c3a19 100644 --- a/spacy/tests/lang/ar/test_exceptions.py +++ b/spacy/tests/lang/ar/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -15,7 +12,6 @@ def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer): tokens = ar_tokenizer(text) assert len(tokens) == 7 assert tokens[6].text == "ق.م" - assert tokens[6].lemma_ == "قبل الميلاد" def test_ar_tokenizer_handles_exc_in_text_2(ar_tokenizer): diff --git a/spacy/tests/lang/ar/test_text.py b/spacy/tests/lang/ar/test_text.py index 109c3721a..c5ab376f1 100644 --- a/spacy/tests/lang/ar/test_text.py +++ b/spacy/tests/lang/ar/test_text.py @@ -1,7 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - - def test_ar_tokenizer_handles_long_text(ar_tokenizer): text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين. ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها، diff --git a/spacy/tests/lang/bn/test_tokenizer.py b/spacy/tests/lang/bn/test_tokenizer.py index 62dd52778..5b18c5269 100644 --- a/spacy/tests/lang/bn/test_tokenizer.py +++ b/spacy/tests/lang/bn/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ca/test_exception.py b/spacy/tests/lang/ca/test_exception.py index 56156c328..cfb574b63 100644 --- a/spacy/tests/lang/ca/test_exception.py +++ b/spacy/tests/lang/ca/test_exception.py @@ -1,7 +1,3 @@ -# coding: utf-8 - -from __future__ import unicode_literals - import pytest @@ -12,7 +8,6 @@ import pytest def test_ca_tokenizer_handles_abbr(ca_tokenizer, text, lemma): tokens = ca_tokenizer(text) assert len(tokens) == 1 - assert tokens[0].lemma_ == lemma def test_ca_tokenizer_handles_exc_in_text(ca_tokenizer): @@ -20,4 +15,3 @@ def test_ca_tokenizer_handles_exc_in_text(ca_tokenizer): tokens = ca_tokenizer(text) assert len(tokens) == 15 assert tokens[7].text == "aprox." - assert tokens[7].lemma_ == "aproximadament" diff --git a/spacy/tests/lang/ca/test_prefix_suffix_infix.py b/spacy/tests/lang/ca/test_prefix_suffix_infix.py index 4583a62b9..83a75f056 100644 --- a/spacy/tests/lang/ca/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/ca/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ca/test_text.py b/spacy/tests/lang/ca/test_text.py index 1506016d4..38f5fc708 100644 --- a/spacy/tests/lang/ca/test_text.py +++ b/spacy/tests/lang/ca/test_text.py @@ -1,10 +1,4 @@ -# coding: utf-8 - """Test that longer and mixed texts are tokenized correctly.""" - - -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/cs/test_text.py b/spacy/tests/lang/cs/test_text.py index d98961738..b834111b9 100644 --- a/spacy/tests/lang/cs/test_text.py +++ b/spacy/tests/lang/cs/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/da/test_exceptions.py b/spacy/tests/lang/da/test_exceptions.py index 503399ee4..bd9f2710e 100644 --- a/spacy/tests/lang/da/test_exceptions.py +++ b/spacy/tests/lang/da/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/da/test_prefix_suffix_infix.py b/spacy/tests/lang/da/test_prefix_suffix_infix.py index 8b43bf360..e36b3cdb9 100644 --- a/spacy/tests/lang/da/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/da/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/da/test_text.py b/spacy/tests/lang/da/test_text.py index 07b134e2d..3c6cca5ac 100644 --- a/spacy/tests/lang/da/test_text.py +++ b/spacy/tests/lang/da/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.da.lex_attrs import like_num diff --git a/spacy/tests/lang/de/test_exceptions.py b/spacy/tests/lang/de/test_exceptions.py index 3b464e1ae..d51c33992 100644 --- a/spacy/tests/lang/de/test_exceptions.py +++ b/spacy/tests/lang/de/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -21,4 +18,3 @@ def test_de_tokenizer_handles_exc_in_text(de_tokenizer): tokens = de_tokenizer(text) assert len(tokens) == 6 assert tokens[2].text == "z.Zt." - assert tokens[2].lemma_ == "zur Zeit" diff --git a/spacy/tests/lang/de/test_noun_chunks.py b/spacy/tests/lang/de/test_noun_chunks.py index 8d76ddd79..7b8b15b1c 100644 --- a/spacy/tests/lang/de/test_noun_chunks.py +++ b/spacy/tests/lang/de/test_noun_chunks.py @@ -1,16 +1,8 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest def test_noun_chunks_is_parsed_de(de_tokenizer): - """Test that noun_chunks raises Value Error for 'de' language if Doc is not parsed. - To check this test, we're constructing a Doc - with a new Vocab here and forcing is_parsed to 'False' - to make sure the noun chunks don't run. - """ + """Test that noun_chunks raises Value Error for 'de' language if Doc is not parsed.""" doc = de_tokenizer("Er lag auf seinem") - doc.is_parsed = False with pytest.raises(ValueError): list(doc.noun_chunks) diff --git a/spacy/tests/lang/de/test_parser.py b/spacy/tests/lang/de/test_parser.py index 5c8694da3..8c858a4cb 100644 --- a/spacy/tests/lang/de/test_parser.py +++ b/spacy/tests/lang/de/test_parser.py @@ -1,33 +1,26 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...util import get_doc +from spacy.tokens import Doc -def test_de_parser_noun_chunks_standard_de(de_tokenizer): - text = "Eine Tasse steht auf dem Tisch." - heads = [1, 1, 0, -1, 1, -2, -4] - tags = ["ART", "NN", "VVFIN", "APPR", "ART", "NN", "$."] +def test_de_parser_noun_chunks_standard_de(de_vocab): + words = ["Eine", "Tasse", "steht", "auf", "dem", "Tisch", "."] + heads = [1, 2, 2, 2, 5, 3, 2] + pos = ["DET", "NOUN", "VERB", "ADP", "DET", "NOUN", "PUNCT"] deps = ["nk", "sb", "ROOT", "mo", "nk", "nk", "punct"] - tokens = de_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads - ) + doc = Doc(de_vocab, words=words, pos=pos, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 2 assert chunks[0].text_with_ws == "Eine Tasse " assert chunks[1].text_with_ws == "dem Tisch " -def test_de_extended_chunk(de_tokenizer): - text = "Die Sängerin singt mit einer Tasse Kaffee Arien." - heads = [1, 1, 0, -1, 1, -2, -1, -5, -6] - tags = ["ART", "NN", "VVFIN", "APPR", "ART", "NN", "NN", "NN", "$."] +def test_de_extended_chunk(de_vocab): + # fmt: off + words = ["Die", "Sängerin", "singt", "mit", "einer", "Tasse", "Kaffee", "Arien", "."] + heads = [1, 2, 2, 2, 5, 3, 5, 2, 2] + pos = ["DET", "NOUN", "VERB", "ADP", "DET", "NOUN", "NOUN", "NOUN", "PUNCT"] deps = ["nk", "sb", "ROOT", "mo", "nk", "nk", "nk", "oa", "punct"] - tokens = de_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads - ) + # fmt: on + doc = Doc(de_vocab, words=words, pos=pos, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 3 assert chunks[0].text_with_ws == "Die Sängerin " diff --git a/spacy/tests/lang/de/test_prefix_suffix_infix.py b/spacy/tests/lang/de/test_prefix_suffix_infix.py index 13e109395..82bd8ed69 100644 --- a/spacy/tests/lang/de/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/de/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/de/test_text.py b/spacy/tests/lang/de/test_text.py index b3fb1eaa5..22711763e 100644 --- a/spacy/tests/lang/de/test_text.py +++ b/spacy/tests/lang/de/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/el/test_exception.py b/spacy/tests/lang/el/test_exception.py index b8d10fb69..a4656ea98 100644 --- a/spacy/tests/lang/el/test_exception.py +++ b/spacy/tests/lang/el/test_exception.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/el/test_noun_chunks.py b/spacy/tests/lang/el/test_noun_chunks.py index 4f24865d0..2684a5cfb 100644 --- a/spacy/tests/lang/el/test_noun_chunks.py +++ b/spacy/tests/lang/el/test_noun_chunks.py @@ -1,16 +1,8 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest def test_noun_chunks_is_parsed_el(el_tokenizer): - """Test that noun_chunks raises Value Error for 'el' language if Doc is not parsed. - To check this test, we're constructing a Doc - with a new Vocab here and forcing is_parsed to 'False' - to make sure the noun chunks don't run. - """ + """Test that noun_chunks raises Value Error for 'el' language if Doc is not parsed.""" doc = el_tokenizer("είναι χώρα της νοτιοανατολικής") - doc.is_parsed = False with pytest.raises(ValueError): list(doc.noun_chunks) diff --git a/spacy/tests/lang/el/test_text.py b/spacy/tests/lang/el/test_text.py index a6395ab4a..1b3ef6182 100644 --- a/spacy/tests/lang/el/test_text.py +++ b/spacy/tests/lang/el/test_text.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/en/test_customized_tokenizer.py b/spacy/tests/lang/en/test_customized_tokenizer.py index 7f939011f..f5302cb31 100644 --- a/spacy/tests/lang/en/test_customized_tokenizer.py +++ b/spacy/tests/lang/en/test_customized_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import re from spacy.lang.en import English diff --git a/spacy/tests/lang/en/test_exceptions.py b/spacy/tests/lang/en/test_exceptions.py index 1ff64eff2..1b56a3b0f 100644 --- a/spacy/tests/lang/en/test_exceptions.py +++ b/spacy/tests/lang/en/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -52,7 +49,6 @@ def test_en_tokenizer_handles_ll_contraction(en_tokenizer, text): assert len(tokens) == 2 assert tokens[0].text == text.split("'")[0] assert tokens[1].text == "'ll" - assert tokens[1].lemma_ == "will" @pytest.mark.parametrize( @@ -107,7 +103,6 @@ def test_en_tokenizer_handles_exc_in_text(en_tokenizer): def test_en_tokenizer_handles_times(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 2 - assert tokens[1].lemma_ in ["a.m.", "p.m."] @pytest.mark.parametrize( diff --git a/spacy/tests/lang/en/test_indices.py b/spacy/tests/lang/en/test_indices.py index 8a7bc0323..93daeec30 100644 --- a/spacy/tests/lang/en/test_indices.py +++ b/spacy/tests/lang/en/test_indices.py @@ -1,7 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - - def test_en_simple_punct(en_tokenizer): text = "to walk, do foo" tokens = en_tokenizer(text) diff --git a/spacy/tests/lang/en/test_noun_chunks.py b/spacy/tests/lang/en/test_noun_chunks.py index ff67986a5..540f3ed84 100644 --- a/spacy/tests/lang/en/test_noun_chunks.py +++ b/spacy/tests/lang/en/test_noun_chunks.py @@ -1,34 +1,23 @@ -# coding: utf-8 -from __future__ import unicode_literals - import numpy from spacy.attrs import HEAD, DEP from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root -from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS - +from spacy.lang.en.syntax_iterators import noun_chunks +from spacy.tokens import Doc import pytest -from ...util import get_doc - - def test_noun_chunks_is_parsed(en_tokenizer): - """Test that noun_chunks raises Value Error for 'en' language if Doc is not parsed. - To check this test, we're constructing a Doc - with a new Vocab here and forcing is_parsed to 'False' - to make sure the noun chunks don't run. - """ + """Test that noun_chunks raises Value Error for 'en' language if Doc is not parsed.""" doc = en_tokenizer("This is a sentence") - doc.is_parsed = False with pytest.raises(ValueError): list(doc.noun_chunks) def test_en_noun_chunks_not_nested(en_vocab): words = ["Peter", "has", "chronic", "command", "and", "control", "issues"] - heads = [1, 0, 4, 3, -1, -2, -5] + heads = [1, 1, 6, 6, 3, 3, 1] deps = ["nsubj", "ROOT", "amod", "nmod", "cc", "conj", "dobj"] - doc = get_doc(en_vocab, words=words, heads=heads, deps=deps) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) doc.from_array( [HEAD, DEP], numpy.asarray( @@ -44,7 +33,7 @@ def test_en_noun_chunks_not_nested(en_vocab): dtype="uint64", ), ) - doc.noun_chunks_iterator = SYNTAX_ITERATORS["noun_chunks"] + doc.noun_chunks_iterator = noun_chunks word_occurred = {} for chunk in doc.noun_chunks: for word in chunk: diff --git a/spacy/tests/lang/en/test_parser.py b/spacy/tests/lang/en/test_parser.py index ce696bc25..426605566 100644 --- a/spacy/tests/lang/en/test_parser.py +++ b/spacy/tests/lang/en/test_parser.py @@ -1,66 +1,51 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...util import get_doc +from spacy.tokens import Doc -def test_en_parser_noun_chunks_standard(en_tokenizer): - text = "A base phrase should be recognized." - heads = [2, 1, 3, 2, 1, 0, -1] - tags = ["DT", "JJ", "NN", "MD", "VB", "VBN", "."] +def test_en_parser_noun_chunks_standard(en_vocab): + words = ["A", "base", "phrase", "should", "be", "recognized", "."] + heads = [2, 2, 5, 5, 5, 5, 5] + pos = ["DET", "ADJ", "NOUN", "AUX", "VERB", "VERB", "PUNCT"] deps = ["det", "amod", "nsubjpass", "aux", "auxpass", "ROOT", "punct"] - tokens = en_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads - ) + doc = Doc(en_vocab, words=words, pos=pos, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 1 assert chunks[0].text_with_ws == "A base phrase " -def test_en_parser_noun_chunks_coordinated(en_tokenizer): +def test_en_parser_noun_chunks_coordinated(en_vocab): # fmt: off - text = "A base phrase and a good phrase are often the same." - heads = [2, 1, 5, -1, 2, 1, -4, 0, -1, 1, -3, -4] - tags = ["DT", "NN", "NN", "CC", "DT", "JJ", "NN", "VBP", "RB", "DT", "JJ", "."] + words = ["A", "base", "phrase", "and", "a", "good", "phrase", "are", "often", "the", "same", "."] + heads = [2, 2, 7, 2, 6, 6, 2, 7, 7, 10, 7, 7] + pos = ["DET", "NOUN", "NOUN", "CCONJ", "DET", "ADJ", "NOUN", "VERB", "ADV", "DET", "ADJ", "PUNCT"] deps = ["det", "compound", "nsubj", "cc", "det", "amod", "conj", "ROOT", "advmod", "det", "attr", "punct"] # fmt: on - tokens = en_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads - ) + doc = Doc(en_vocab, words=words, pos=pos, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 2 assert chunks[0].text_with_ws == "A base phrase " assert chunks[1].text_with_ws == "a good phrase " -def test_en_parser_noun_chunks_pp_chunks(en_tokenizer): - text = "A phrase with another phrase occurs." - heads = [1, 4, -1, 1, -2, 0, -1] - tags = ["DT", "NN", "IN", "DT", "NN", "VBZ", "."] +def test_en_parser_noun_chunks_pp_chunks(en_vocab): + words = ["A", "phrase", "with", "another", "phrase", "occurs", "."] + heads = [1, 5, 1, 4, 2, 5, 5] + pos = ["DET", "NOUN", "ADP", "DET", "NOUN", "VERB", "PUNCT"] deps = ["det", "nsubj", "prep", "det", "pobj", "ROOT", "punct"] - tokens = en_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads - ) + doc = Doc(en_vocab, words=words, pos=pos, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 2 assert chunks[0].text_with_ws == "A phrase " assert chunks[1].text_with_ws == "another phrase " -def test_en_parser_noun_chunks_appositional_modifiers(en_tokenizer): +def test_en_parser_noun_chunks_appositional_modifiers(en_vocab): # fmt: off - text = "Sam, my brother, arrived to the house." - heads = [5, -1, 1, -3, -4, 0, -1, 1, -2, -4] - tags = ["NNP", ",", "PRP$", "NN", ",", "VBD", "IN", "DT", "NN", "."] + words = ["Sam", ",", "my", "brother", ",", "arrived", "to", "the", "house", "."] + heads = [5, 0, 3, 0, 0, 5, 5, 8, 6, 5] + pos = ["PROPN", "PUNCT", "DET", "NOUN", "PUNCT", "VERB", "ADP", "DET", "NOUN", "PUNCT"] deps = ["nsubj", "punct", "poss", "appos", "punct", "ROOT", "prep", "det", "pobj", "punct"] # fmt: on - tokens = en_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads - ) + doc = Doc(en_vocab, words=words, pos=pos, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 3 assert chunks[0].text_with_ws == "Sam " @@ -68,15 +53,12 @@ def test_en_parser_noun_chunks_appositional_modifiers(en_tokenizer): assert chunks[2].text_with_ws == "the house " -def test_en_parser_noun_chunks_dative(en_tokenizer): - text = "She gave Bob a raise." - heads = [1, 0, -1, 1, -3, -4] - tags = ["PRP", "VBD", "NNP", "DT", "NN", "."] +def test_en_parser_noun_chunks_dative(en_vocab): + words = ["She", "gave", "Bob", "a", "raise", "."] + heads = [1, 1, 1, 4, 1, 1] + pos = ["PRON", "VERB", "PROPN", "DET", "NOUN", "PUNCT"] deps = ["nsubj", "ROOT", "dative", "det", "dobj", "punct"] - tokens = en_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads - ) + doc = Doc(en_vocab, words=words, pos=pos, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 3 assert chunks[0].text_with_ws == "She " diff --git a/spacy/tests/lang/en/test_prefix_suffix_infix.py b/spacy/tests/lang/en/test_prefix_suffix_infix.py index 3dccd6bcf..9dfb54fd6 100644 --- a/spacy/tests/lang/en/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/en/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -111,7 +108,6 @@ def test_en_tokenizer_splits_double_hyphen_infix(en_tokenizer): assert tokens[9].text == "people" -@pytest.mark.xfail def test_en_tokenizer_splits_period_abbr(en_tokenizer): text = "Today is Tuesday.Mr." tokens = en_tokenizer(text) @@ -123,9 +119,8 @@ def test_en_tokenizer_splits_period_abbr(en_tokenizer): assert tokens[4].text == "Mr." -@pytest.mark.xfail +@pytest.mark.xfail(reason="Issue #225 - not yet implemented") def test_en_tokenizer_splits_em_dash_infix(en_tokenizer): - # Re Issue #225 tokens = en_tokenizer( """Will this road take me to Puddleton?\u2014No, """ """you'll have to walk there.\u2014Ariel.""" diff --git a/spacy/tests/lang/en/test_punct.py b/spacy/tests/lang/en/test_punct.py index 61274cf14..1d10478a1 100644 --- a/spacy/tests/lang/en/test_punct.py +++ b/spacy/tests/lang/en/test_punct.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.util import compile_prefix_regex from spacy.lang.punctuation import TOKENIZER_PREFIXES @@ -82,7 +79,6 @@ def test_en_tokenizer_splits_open_appostrophe(en_tokenizer, text): assert tokens[0].text == "'" -@pytest.mark.xfail @pytest.mark.parametrize("text", ["Hello''"]) def test_en_tokenizer_splits_double_end_quote(en_tokenizer, text): tokens = en_tokenizer(text) diff --git a/spacy/tests/lang/en/test_sbd.py b/spacy/tests/lang/en/test_sbd.py index 40bd110e8..39d8d3b59 100644 --- a/spacy/tests/lang/en/test_sbd.py +++ b/spacy/tests/lang/en/test_sbd.py @@ -1,34 +1,34 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest +from spacy.tokens import Doc -from ...util import get_doc, apply_transition_sequence +from ...util import apply_transition_sequence -@pytest.mark.parametrize("text", ["A test sentence"]) +@pytest.mark.parametrize("words", [["A", "test", "sentence"]]) @pytest.mark.parametrize("punct", [".", "!", "?", ""]) -def test_en_sbd_single_punct(en_tokenizer, text, punct): - heads = [2, 1, 0, -1] if punct else [2, 1, 0] - tokens = en_tokenizer(text + punct) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) +def test_en_sbd_single_punct(en_vocab, words, punct): + heads = [2, 2, 2, 2] if punct else [2, 2, 2] + deps = ["dep"] * len(heads) + words = [*words, punct] if punct else words + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) assert len(doc) == 4 if punct else 3 assert len(list(doc.sents)) == 1 assert sum(len(sent) for sent in doc.sents) == len(doc) -@pytest.mark.xfail -def test_en_sentence_breaks(en_tokenizer, en_parser): +@pytest.mark.skip( + reason="The step_through API was removed (but should be brought back)" +) +def test_en_sentence_breaks(en_vocab, en_parser): # fmt: off - text = "This is a sentence . This is another one ." - heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3] + words = ["This", "is", "a", "sentence", ".", "This", "is", "another", "one", "."] + heads = [1, 1, 3, 1, 1, 6, 6, 8, 6, 6] deps = ["nsubj", "ROOT", "det", "attr", "punct", "nsubj", "ROOT", "det", "attr", "punct"] transition = ["L-nsubj", "S", "L-det", "R-attr", "D", "R-punct", "B-ROOT", "L-nsubj", "S", "L-attr", "R-attr", "D", "R-punct"] # fmt: on - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) apply_transition_sequence(en_parser, doc, transition) assert len(list(doc.sents)) == 2 for token in doc: diff --git a/spacy/tests/lang/en/test_tagger.py b/spacy/tests/lang/en/test_tagger.py deleted file mode 100644 index 567fd5a44..000000000 --- a/spacy/tests/lang/en/test_tagger.py +++ /dev/null @@ -1,15 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...util import get_doc - - -def test_en_tagger_load_morph_exc(en_tokenizer): - text = "I like his style." - tags = ["PRP", "VBP", "PRP$", "NN", "."] - morph_exc = {"VBP": {"like": {"lemma": "luck"}}} - en_tokenizer.vocab.morphology.load_morph_exceptions(morph_exc) - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags) - assert doc[1].tag_ == "VBP" - assert doc[1].lemma_ == "luck" diff --git a/spacy/tests/lang/en/test_text.py b/spacy/tests/lang/en/test_text.py index 0db1a6419..733e814f7 100644 --- a/spacy/tests/lang/en/test_text.py +++ b/spacy/tests/lang/en/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.en.lex_attrs import like_num @@ -29,9 +26,7 @@ untimely death" of the rapier-tongued Scottish barrister and parliamentarian. ("""Yes! "I'd rather have a walk", Ms. Comble sighed. """, 15), ("""'Me too!', Mr. P. Delaware cried. """, 11), ("They ran about 10km.", 6), - pytest.param( - "But then the 6,000-year ice age came...", 10, marks=pytest.mark.xfail() - ), + ("But then the 6,000-year ice age came...", 10), ], ) def test_en_tokenizer_handles_cnts(en_tokenizer, text, length): @@ -61,15 +56,7 @@ def test_lex_attrs_like_number(en_tokenizer, text, match): assert tokens[0].like_num == match -@pytest.mark.parametrize( - "word", - [ - "third", - "Millionth", - "100th", - "Hundredth", - ] -) +@pytest.mark.parametrize("word", ["third", "Millionth", "100th", "Hundredth"]) def test_en_lex_attrs_like_number_for_ordinal(word): assert like_num(word) diff --git a/spacy/tests/lang/es/test_exception.py b/spacy/tests/lang/es/test_exception.py index 8d6164058..07df5d69e 100644 --- a/spacy/tests/lang/es/test_exception.py +++ b/spacy/tests/lang/es/test_exception.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -16,7 +13,6 @@ import pytest def test_es_tokenizer_handles_abbr(es_tokenizer, text, lemma): tokens = es_tokenizer(text) assert len(tokens) == 1 - assert tokens[0].lemma_ == lemma def test_es_tokenizer_handles_exc_in_text(es_tokenizer): @@ -24,4 +20,3 @@ def test_es_tokenizer_handles_exc_in_text(es_tokenizer): tokens = es_tokenizer(text) assert len(tokens) == 7 assert tokens[4].text == "aprox." - assert tokens[4].lemma_ == "aproximadamente" diff --git a/spacy/tests/lang/es/test_noun_chunks.py b/spacy/tests/lang/es/test_noun_chunks.py index 66bbd8c3a..e5afd81c9 100644 --- a/spacy/tests/lang/es/test_noun_chunks.py +++ b/spacy/tests/lang/es/test_noun_chunks.py @@ -1,16 +1,8 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest def test_noun_chunks_is_parsed_es(es_tokenizer): - """Test that noun_chunks raises Value Error for 'es' language if Doc is not parsed. - To check this test, we're constructing a Doc - with a new Vocab here and forcing is_parsed to 'False' - to make sure the noun chunks don't run. - """ + """Test that noun_chunks raises Value Error for 'es' language if Doc is not parsed.""" doc = es_tokenizer("en Oxford este verano") - doc.is_parsed = False with pytest.raises(ValueError): list(doc.noun_chunks) diff --git a/spacy/tests/lang/es/test_text.py b/spacy/tests/lang/es/test_text.py index 999e788dd..96f6bcab5 100644 --- a/spacy/tests/lang/es/test_text.py +++ b/spacy/tests/lang/es/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.es.lex_attrs import like_num diff --git a/spacy/tests/lang/eu/test_text.py b/spacy/tests/lang/eu/test_text.py index f448a7859..94d5ac91d 100644 --- a/spacy/tests/lang/eu/test_text.py +++ b/spacy/tests/lang/eu/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/fa/test_noun_chunks.py b/spacy/tests/lang/fa/test_noun_chunks.py index a98aae061..d2411e6d3 100644 --- a/spacy/tests/lang/fa/test_noun_chunks.py +++ b/spacy/tests/lang/fa/test_noun_chunks.py @@ -1,17 +1,9 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest def test_noun_chunks_is_parsed_fa(fa_tokenizer): - """Test that noun_chunks raises Value Error for 'fa' language if Doc is not parsed. - To check this test, we're constructing a Doc - with a new Vocab here and forcing is_parsed to 'False' - to make sure the noun chunks don't run. - """ + """Test that noun_chunks raises Value Error for 'fa' language if Doc is not parsed.""" doc = fa_tokenizer("این یک جمله نمونه می باشد.") - doc.is_parsed = False with pytest.raises(ValueError): list(doc.noun_chunks) diff --git a/spacy/tests/lang/fi/test_text.py b/spacy/tests/lang/fi/test_text.py index 2dd92597e..dbb67ad7a 100644 --- a/spacy/tests/lang/fi/test_text.py +++ b/spacy/tests/lang/fi/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/fi/test_tokenizer.py b/spacy/tests/lang/fi/test_tokenizer.py index 301b85d74..ae16c7eea 100644 --- a/spacy/tests/lang/fi/test_tokenizer.py +++ b/spacy/tests/lang/fi/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/fr/test_exceptions.py b/spacy/tests/lang/fr/test_exceptions.py index 93dbf0993..d75c653d0 100644 --- a/spacy/tests/lang/fr/test_exceptions.py +++ b/spacy/tests/lang/fr/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -19,8 +16,6 @@ import pytest "grand'hamien", "Châteauneuf-la-Forêt", "Château-Guibert", - "11-septembre", - "11-Septembre", "refox-trottâmes", # u"K-POP", # u"K-Pop", @@ -41,20 +36,10 @@ def test_fr_tokenizer_infix_exceptions(fr_tokenizer, text): assert len(tokens) == 1 -@pytest.mark.parametrize( - "text,lemma", - [ - ("janv.", "janvier"), - ("juill.", "juillet"), - ("Dr.", "docteur"), - ("av.", "avant"), - ("sept.", "septembre"), - ], -) -def test_fr_tokenizer_handles_abbr(fr_tokenizer, text, lemma): +@pytest.mark.parametrize("text", ["janv.", "juill.", "Dr.", "av.", "sept."]) +def test_fr_tokenizer_handles_abbr(fr_tokenizer, text): tokens = fr_tokenizer(text) assert len(tokens) == 1 - assert tokens[0].lemma_ == lemma def test_fr_tokenizer_handles_exc_in_text(fr_tokenizer): @@ -62,7 +47,6 @@ def test_fr_tokenizer_handles_exc_in_text(fr_tokenizer): tokens = fr_tokenizer(text) assert len(tokens) == 10 assert tokens[6].text == "janv." - assert tokens[6].lemma_ == "janvier" assert tokens[8].text == "prud’hommes" @@ -79,20 +63,16 @@ def test_fr_tokenizer_handles_title(fr_tokenizer): tokens = fr_tokenizer(text) assert len(tokens) == 6 assert tokens[0].text == "N'" - assert tokens[0].lemma_ == "ne" assert tokens[1].text == "est" - assert tokens[1].lemma_ == "être" assert tokens[2].text == "-ce" - assert tokens[2].lemma_ == "ce" -@pytest.mark.xfail def test_fr_tokenizer_handles_title_2(fr_tokenizer): text = "Est-ce pas génial?" tokens = fr_tokenizer(text) - assert len(tokens) == 6 + assert len(tokens) == 5 assert tokens[0].text == "Est" - assert tokens[0].lemma_ == "être" + assert tokens[1].text == "-ce" def test_fr_tokenizer_handles_title_3(fr_tokenizer): @@ -100,4 +80,3 @@ def test_fr_tokenizer_handles_title_3(fr_tokenizer): tokens = fr_tokenizer(text) assert len(tokens) == 7 assert tokens[0].text == "Qu'" - assert tokens[0].lemma_ == "que" diff --git a/spacy/tests/lang/fr/test_noun_chunks.py b/spacy/tests/lang/fr/test_noun_chunks.py index ea93a5a35..48ac88ead 100644 --- a/spacy/tests/lang/fr/test_noun_chunks.py +++ b/spacy/tests/lang/fr/test_noun_chunks.py @@ -1,16 +1,8 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest def test_noun_chunks_is_parsed_fr(fr_tokenizer): - """Test that noun_chunks raises Value Error for 'fr' language if Doc is not parsed. - To check this test, we're constructing a Doc - with a new Vocab here and forcing is_parsed to 'False' - to make sure the noun chunks don't run. - """ + """Test that noun_chunks raises Value Error for 'fr' language if Doc is not parsed.""" doc = fr_tokenizer("trouver des travaux antérieurs") - doc.is_parsed = False with pytest.raises(ValueError): list(doc.noun_chunks) diff --git a/spacy/tests/lang/fr/test_prefix_suffix_infix.py b/spacy/tests/lang/fr/test_prefix_suffix_infix.py index ca6bdbd87..2ead34069 100644 --- a/spacy/tests/lang/fr/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/fr/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.language import Language from spacy.lang.punctuation import TOKENIZER_INFIXES @@ -18,7 +15,7 @@ def test_issue768(text, expected_tokens): class Defaults(Language.Defaults): infixes = TOKENIZER_INFIXES + [SPLIT_INFIX] - fr_tokenizer_w_infix = FrenchTest.Defaults.create_tokenizer() + fr_tokenizer_w_infix = FrenchTest().tokenizer tokens = fr_tokenizer_w_infix(text) assert len(tokens) == 2 assert [t.text for t in tokens] == expected_tokens diff --git a/spacy/tests/lang/fr/test_text.py b/spacy/tests/lang/fr/test_text.py index 24b4c4532..01231f593 100644 --- a/spacy/tests/lang/fr/test_text.py +++ b/spacy/tests/lang/fr/test_text.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.lang.fr.lex_attrs import like_num diff --git a/spacy/tests/lang/ga/test_tokenizer.py b/spacy/tests/lang/ga/test_tokenizer.py index 29bc1c759..78127ef7c 100644 --- a/spacy/tests/lang/ga/test_tokenizer.py +++ b/spacy/tests/lang/ga/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/gu/test_text.py b/spacy/tests/lang/gu/test_text.py index aa8d442a2..2d251166f 100644 --- a/spacy/tests/lang/gu/test_text.py +++ b/spacy/tests/lang/gu/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/he/test_tokenizer.py b/spacy/tests/lang/he/test_tokenizer.py index 67ad964d8..3716f7e3b 100644 --- a/spacy/tests/lang/he/test_tokenizer.py +++ b/spacy/tests/lang/he/test_tokenizer.py @@ -1,8 +1,5 @@ -# encoding: utf8 -from __future__ import unicode_literals -from spacy.lang.he.lex_attrs import like_num - import pytest +from spacy.lang.he.lex_attrs import like_num @pytest.mark.parametrize( @@ -45,7 +42,6 @@ def test_he_tokenizer_handles_punct(he_tokenizer, text, expected_tokens): assert expected_tokens == [token.text for token in tokens] - @pytest.mark.parametrize( "text,match", [ @@ -68,16 +64,6 @@ def test_lex_attrs_like_number(he_tokenizer, text, match): assert tokens[0].like_num == match -@pytest.mark.parametrize( - "word", - [ - "שלישי", - "מליון", - "עשירי", - "מאה", - "עשר", - "אחד עשר", - ] -) +@pytest.mark.parametrize("word", ["שלישי", "מליון", "עשירי", "מאה", "עשר", "אחד עשר"]) def test_he_lex_attrs_like_number_for_ordinal(word): assert like_num(word) diff --git a/spacy/tests/lang/hu/test_tokenizer.py b/spacy/tests/lang/hu/test_tokenizer.py index 1ac6bfc76..fd3acd0a0 100644 --- a/spacy/tests/lang/hu/test_tokenizer.py +++ b/spacy/tests/lang/hu/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/hy/test_text.py b/spacy/tests/lang/hy/test_text.py index cbdb77e4e..ac0f1e128 100644 --- a/spacy/tests/lang/hy/test_text.py +++ b/spacy/tests/lang/hy/test_text.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.lang.hy.lex_attrs import like_num diff --git a/spacy/tests/lang/hy/test_tokenizer.py b/spacy/tests/lang/hy/test_tokenizer.py index 3eeb8b54e..e9efb224a 100644 --- a/spacy/tests/lang/hy/test_tokenizer.py +++ b/spacy/tests/lang/hy/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/id/test_noun_chunks.py b/spacy/tests/lang/id/test_noun_chunks.py index add76f9b9..a39456581 100644 --- a/spacy/tests/lang/id/test_noun_chunks.py +++ b/spacy/tests/lang/id/test_noun_chunks.py @@ -1,16 +1,8 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest def test_noun_chunks_is_parsed_id(id_tokenizer): - """Test that noun_chunks raises Value Error for 'id' language if Doc is not parsed. - To check this test, we're constructing a Doc - with a new Vocab here and forcing is_parsed to 'False' - to make sure the noun chunks don't run. - """ + """Test that noun_chunks raises Value Error for 'id' language if Doc is not parsed.""" doc = id_tokenizer("sebelas") - doc.is_parsed = False with pytest.raises(ValueError): list(doc.noun_chunks) diff --git a/spacy/tests/lang/id/test_prefix_suffix_infix.py b/spacy/tests/lang/id/test_prefix_suffix_infix.py index e86a98ee3..2a81dab01 100644 --- a/spacy/tests/lang/id/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/id/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/id/test_text.py b/spacy/tests/lang/id/test_text.py index 915d268ae..ed6487b68 100644 --- a/spacy/tests/lang/id/test_text.py +++ b/spacy/tests/lang/id/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.id.lex_attrs import like_num diff --git a/spacy/tests/lang/it/test_prefix_suffix_infix.py b/spacy/tests/lang/it/test_prefix_suffix_infix.py index f84351fd7..46f66b5e6 100644 --- a/spacy/tests/lang/it/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/it/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ja/test_lemmatization.py b/spacy/tests/lang/ja/test_lemmatization.py index 58cd3f3bf..6041611e6 100644 --- a/spacy/tests/lang/ja/test_lemmatization.py +++ b/spacy/tests/lang/ja/test_lemmatization.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ja/test_serialize.py b/spacy/tests/lang/ja/test_serialize.py index 018e645bb..e05a363bf 100644 --- a/spacy/tests/lang/ja/test_serialize.py +++ b/spacy/tests/lang/ja/test_serialize.py @@ -1,7 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest from spacy.lang.ja import Japanese from ...util import make_tempdir @@ -11,7 +7,7 @@ def test_ja_tokenizer_serialize(ja_tokenizer): nlp = Japanese() nlp.tokenizer.from_bytes(tokenizer_bytes) assert tokenizer_bytes == nlp.tokenizer.to_bytes() - assert nlp.tokenizer.split_mode == None + assert nlp.tokenizer.split_mode is None with make_tempdir() as d: file_path = d / "tokenizer" @@ -19,10 +15,10 @@ def test_ja_tokenizer_serialize(ja_tokenizer): nlp = Japanese() nlp.tokenizer.from_disk(file_path) assert tokenizer_bytes == nlp.tokenizer.to_bytes() - assert nlp.tokenizer.split_mode == None + assert nlp.tokenizer.split_mode is None # split mode is (de)serialized correctly - nlp = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}}) + nlp = Japanese.from_config({"nlp": {"tokenizer": {"split_mode": "B"}}}) nlp_r = Japanese() nlp_bytes = nlp.to_bytes() nlp_r.from_bytes(nlp_bytes) diff --git a/spacy/tests/lang/ja/test_tokenizer.py b/spacy/tests/lang/ja/test_tokenizer.py index 651e906eb..c8c85d655 100644 --- a/spacy/tests/lang/ja/test_tokenizer.py +++ b/spacy/tests/lang/ja/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS @@ -24,18 +21,36 @@ TAG_TESTS = [ ] POS_TESTS = [ - ('日本語だよ', ['fish', 'NOUN', 'AUX', 'PART']), - ('東京タワーの近くに住んでいます。', ['PROPN', 'NOUN', 'ADP', 'NOUN', 'ADP', 'VERB', 'SCONJ', 'VERB', 'AUX', 'PUNCT']), - ('吾輩は猫である。', ['PRON', 'ADP', 'NOUN', 'AUX', 'VERB', 'PUNCT']), + ('日本語だよ', ['PROPN', 'NOUN', 'AUX', 'PART']), + ('東京タワーの近くに住んでいます。', ['PROPN', 'NOUN', 'ADP', 'NOUN', 'ADP', 'VERB', 'SCONJ', 'AUX', 'AUX', 'PUNCT']), + ('吾輩は猫である。', ['PRON', 'ADP', 'NOUN', 'AUX', 'AUX', 'PUNCT']), ('月に代わって、お仕置きよ!', ['NOUN', 'ADP', 'VERB', 'SCONJ', 'PUNCT', 'NOUN', 'NOUN', 'PART', 'PUNCT']), ('すもももももももものうち', ['NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN']) ] SENTENCE_TESTS = [ - ('あれ。これ。', ['あれ。', 'これ。']), - ('「伝染るんです。」という漫画があります。', - ['「伝染るんです。」という漫画があります。']), - ] + ("あれ。これ。", ["あれ。", "これ。"]), + ("「伝染るんです。」という漫画があります。", ["「伝染るんです。」という漫画があります。"]), +] + +tokens1 = [ + DetailedToken(surface="委員", tag="名詞-普通名詞-一般", inf="", lemma="委員", reading="イイン", sub_tokens=None), + DetailedToken(surface="会", tag="名詞-普通名詞-一般", inf="", lemma="会", reading="カイ", sub_tokens=None), +] +tokens2 = [ + DetailedToken(surface="選挙", tag="名詞-普通名詞-サ変可能", inf="", lemma="選挙", reading="センキョ", sub_tokens=None), + DetailedToken(surface="管理", tag="名詞-普通名詞-サ変可能", inf="", lemma="管理", reading="カンリ", sub_tokens=None), + DetailedToken(surface="委員", tag="名詞-普通名詞-一般", inf="", lemma="委員", reading="イイン", sub_tokens=None), + DetailedToken(surface="会", tag="名詞-普通名詞-一般", inf="", lemma="会", reading="カイ", sub_tokens=None), +] +tokens3 = [ + DetailedToken(surface="選挙", tag="名詞-普通名詞-サ変可能", inf="", lemma="選挙", reading="センキョ", sub_tokens=None), + DetailedToken(surface="管理", tag="名詞-普通名詞-サ変可能", inf="", lemma="管理", reading="カンリ", sub_tokens=None), + DetailedToken(surface="委員会", tag="名詞-普通名詞-一般", inf="", lemma="委員会", reading="イインカイ", sub_tokens=None), +] +SUB_TOKEN_TESTS = [ + ("選挙管理委員会", [None, None, None, None], [None, None, [tokens1]], [[tokens2, tokens3]]) +] # fmt: on @@ -51,7 +66,6 @@ def test_ja_tokenizer_tags(ja_tokenizer, text, expected_tags): assert tags == expected_tags -#XXX This isn't working? Always passes @pytest.mark.parametrize("text,expected_pos", POS_TESTS) def test_ja_tokenizer_pos(ja_tokenizer, text, expected_pos): pos = [token.pos_ for token in ja_tokenizer(text)] @@ -60,7 +74,7 @@ def test_ja_tokenizer_pos(ja_tokenizer, text, expected_pos): @pytest.mark.skip(reason="sentence segmentation in tokenizer is buggy") @pytest.mark.parametrize("text,expected_sents", SENTENCE_TESTS) -def test_ja_tokenizer_pos(ja_tokenizer, text, expected_sents): +def test_ja_tokenizer_sents(ja_tokenizer, text, expected_sents): sents = [str(sent) for sent in ja_tokenizer(text).sents] assert sents == expected_sents @@ -77,18 +91,19 @@ def test_ja_tokenizer_naughty_strings(ja_tokenizer, text): assert tokens.text_with_ws == text -@pytest.mark.parametrize("text,len_a,len_b,len_c", +@pytest.mark.parametrize( + "text,len_a,len_b,len_c", [ ("選挙管理委員会", 4, 3, 1), ("客室乗務員", 3, 2, 1), ("労働者協同組合", 4, 3, 1), ("機能性食品", 3, 2, 1), - ] + ], ) def test_ja_tokenizer_split_modes(ja_tokenizer, text, len_a, len_b, len_c): - nlp_a = Japanese(meta={"tokenizer": {"config": {"split_mode": "A"}}}) - nlp_b = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}}) - nlp_c = Japanese(meta={"tokenizer": {"config": {"split_mode": "C"}}}) + nlp_a = Japanese.from_config({"nlp": {"tokenizer": {"split_mode": "A"}}}) + nlp_b = Japanese.from_config({"nlp": {"tokenizer": {"split_mode": "B"}}}) + nlp_c = Japanese.from_config({"nlp": {"tokenizer": {"split_mode": "C"}}}) assert len(ja_tokenizer(text)) == len_a assert len(nlp_a(text)) == len_a @@ -96,36 +111,15 @@ def test_ja_tokenizer_split_modes(ja_tokenizer, text, len_a, len_b, len_c): assert len(nlp_c(text)) == len_c -@pytest.mark.parametrize("text,sub_tokens_list_a,sub_tokens_list_b,sub_tokens_list_c", - [ - ( - "選挙管理委員会", - [None, None, None, None], - [None, None, [ - [ - DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None), - DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None), - ] - ]], - [[ - [ - DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None), - DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None), - DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None), - DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None), - ], [ - DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None), - DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None), - DetailedToken(surface='委員会', tag='名詞-普通名詞-一般', inf='', lemma='委員会', reading='イインカイ', sub_tokens=None), - ] - ]] - ), - ] +@pytest.mark.parametrize( + "text,sub_tokens_list_a,sub_tokens_list_b,sub_tokens_list_c", SUB_TOKEN_TESTS ) -def test_ja_tokenizer_sub_tokens(ja_tokenizer, text, sub_tokens_list_a, sub_tokens_list_b, sub_tokens_list_c): - nlp_a = Japanese(meta={"tokenizer": {"config": {"split_mode": "A"}}}) - nlp_b = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}}) - nlp_c = Japanese(meta={"tokenizer": {"config": {"split_mode": "C"}}}) +def test_ja_tokenizer_sub_tokens( + ja_tokenizer, text, sub_tokens_list_a, sub_tokens_list_b, sub_tokens_list_c +): + nlp_a = Japanese.from_config({"nlp": {"tokenizer": {"split_mode": "A"}}}) + nlp_b = Japanese.from_config({"nlp": {"tokenizer": {"split_mode": "B"}}}) + nlp_c = Japanese.from_config({"nlp": {"tokenizer": {"split_mode": "C"}}}) assert ja_tokenizer(text).user_data["sub_tokens"] == sub_tokens_list_a assert nlp_a(text).user_data["sub_tokens"] == sub_tokens_list_a @@ -133,16 +127,19 @@ def test_ja_tokenizer_sub_tokens(ja_tokenizer, text, sub_tokens_list_a, sub_toke assert nlp_c(text).user_data["sub_tokens"] == sub_tokens_list_c -@pytest.mark.parametrize("text,inflections,reading_forms", +@pytest.mark.parametrize( + "text,inflections,reading_forms", [ ( "取ってつけた", ("五段-ラ行,連用形-促音便", "", "下一段-カ行,連用形-一般", "助動詞-タ,終止形-一般"), ("トッ", "テ", "ツケ", "タ"), ), - ] + ], ) -def test_ja_tokenizer_inflections_reading_forms(ja_tokenizer, text, inflections, reading_forms): +def test_ja_tokenizer_inflections_reading_forms( + ja_tokenizer, text, inflections, reading_forms +): assert ja_tokenizer(text).user_data["inflections"] == inflections assert ja_tokenizer(text).user_data["reading_forms"] == reading_forms diff --git a/spacy/tests/lang/ko/test_lemmatization.py b/spacy/tests/lang/ko/test_lemmatization.py index 42c306c11..7782ca4bc 100644 --- a/spacy/tests/lang/ko/test_lemmatization.py +++ b/spacy/tests/lang/ko/test_lemmatization.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ko/test_tokenizer.py b/spacy/tests/lang/ko/test_tokenizer.py index b8fe7959c..eac309857 100644 --- a/spacy/tests/lang/ko/test_tokenizer.py +++ b/spacy/tests/lang/ko/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest # fmt: off diff --git a/spacy/tests/lang/lb/test_exceptions.py b/spacy/tests/lang/lb/test_exceptions.py index ebfab75cf..fc4b4fa7b 100644 --- a/spacy/tests/lang/lb/test_exceptions.py +++ b/spacy/tests/lang/lb/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -21,4 +18,3 @@ def test_lb_tokenizer_handles_exc_in_text(lb_tokenizer): tokens = lb_tokenizer(text) assert len(tokens) == 9 assert tokens[1].text == "'t" - assert tokens[1].lemma_ == "et" diff --git a/spacy/tests/lang/lb/test_prefix_suffix_infix.py b/spacy/tests/lang/lb/test_prefix_suffix_infix.py index d85f932be..3958d1543 100644 --- a/spacy/tests/lang/lb/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/lb/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/lb/test_text.py b/spacy/tests/lang/lb/test_text.py index 36464b379..b0ba76b6b 100644 --- a/spacy/tests/lang/lb/test_text.py +++ b/spacy/tests/lang/lb/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/lt/test_text.py b/spacy/tests/lang/lt/test_text.py index bb9c75383..9e2b612b9 100644 --- a/spacy/tests/lang/lt/test_text.py +++ b/spacy/tests/lang/lt/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ml/test_text.py b/spacy/tests/lang/ml/test_text.py index 2883cf5bb..aced78461 100644 --- a/spacy/tests/lang/ml/test_text.py +++ b/spacy/tests/lang/ml/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/nb/test_noun_chunks.py b/spacy/tests/lang/nb/test_noun_chunks.py index 653491a64..dd259f2b7 100644 --- a/spacy/tests/lang/nb/test_noun_chunks.py +++ b/spacy/tests/lang/nb/test_noun_chunks.py @@ -1,16 +1,8 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest def test_noun_chunks_is_parsed_nb(nb_tokenizer): - """Test that noun_chunks raises Value Error for 'nb' language if Doc is not parsed. - To check this test, we're constructing a Doc - with a new Vocab here and forcing is_parsed to 'False' - to make sure the noun chunks don't run. - """ + """Test that noun_chunks raises Value Error for 'nb' language if Doc is not parsed.""" doc = nb_tokenizer("Smørsausen brukes bl.a. til") - doc.is_parsed = False with pytest.raises(ValueError): list(doc.noun_chunks) diff --git a/spacy/tests/lang/nb/test_tokenizer.py b/spacy/tests/lang/nb/test_tokenizer.py index f72d310e8..2da6e8d40 100644 --- a/spacy/tests/lang/nb/test_tokenizer.py +++ b/spacy/tests/lang/nb/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ne/test_text.py b/spacy/tests/lang/ne/test_text.py index 926a7de04..e8a6c2e98 100644 --- a/spacy/tests/lang/ne/test_text.py +++ b/spacy/tests/lang/ne/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -11,9 +8,8 @@ def test_ne_tokenizer_handlers_long_text(ne_tokenizer): @pytest.mark.parametrize( - "text,length", - [("समय जान कति पनि बेर लाग्दैन ।", 7), ("म ठूलो हुँदै थिएँ ।", 5)], + "text,length", [("समय जान कति पनि बेर लाग्दैन ।", 7), ("म ठूलो हुँदै थिएँ ।", 5)] ) def test_ne_tokenizer_handles_cnts(ne_tokenizer, text, length): tokens = ne_tokenizer(text) - assert len(tokens) == length \ No newline at end of file + assert len(tokens) == length diff --git a/spacy/tests/lang/nl/test_text.py b/spacy/tests/lang/nl/test_text.py index 4045b1c39..8bc72cc6d 100644 --- a/spacy/tests/lang/nl/test_text.py +++ b/spacy/tests/lang/nl/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.nl.lex_attrs import like_num diff --git a/spacy/tests/lang/pl/test_text.py b/spacy/tests/lang/pl/test_text.py index ec9b18084..e8654a498 100644 --- a/spacy/tests/lang/pl/test_text.py +++ b/spacy/tests/lang/pl/test_text.py @@ -1,9 +1,4 @@ -# coding: utf-8 """Words like numbers are recognized correctly.""" - - -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/pl/test_tokenizer.py b/spacy/tests/lang/pl/test_tokenizer.py index 9f4f5a38d..44b1be9a6 100644 --- a/spacy/tests/lang/pl/test_tokenizer.py +++ b/spacy/tests/lang/pl/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest DOT_TESTS = [ diff --git a/spacy/tests/lang/pt/test_text.py b/spacy/tests/lang/pt/test_text.py index 39dfff2c1..3a9162b80 100644 --- a/spacy/tests/lang/pt/test_text.py +++ b/spacy/tests/lang/pt/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.pt.lex_attrs import like_num diff --git a/spacy/tests/lang/ro/test_tokenizer.py b/spacy/tests/lang/ro/test_tokenizer.py index a327174e5..64c072470 100644 --- a/spacy/tests/lang/ro/test_tokenizer.py +++ b/spacy/tests/lang/ro/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ru/test_exceptions.py b/spacy/tests/lang/ru/test_exceptions.py index a8f0c3429..4fb417df8 100644 --- a/spacy/tests/lang/ru/test_exceptions.py +++ b/spacy/tests/lang/ru/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ru/test_lemmatizer.py b/spacy/tests/lang/ru/test_lemmatizer.py index b228fded8..3810323bf 100644 --- a/spacy/tests/lang/ru/test_lemmatizer.py +++ b/spacy/tests/lang/ru/test_lemmatizer.py @@ -1,19 +1,17 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest - -from ...util import get_doc +from spacy.tokens import Doc -def test_ru_doc_lemmatization(ru_tokenizer): +def test_ru_doc_lemmatization(ru_lemmatizer): words = ["мама", "мыла", "раму"] - tags = [ - "NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing", - "VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act", - "NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing", + pos = ["NOUN", "VERB", "NOUN"] + morphs = [ + "Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing", + "Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act", + "Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing", ] - doc = get_doc(ru_tokenizer.vocab, words=words, tags=tags) + doc = Doc(ru_lemmatizer.vocab, words=words, pos=pos, morphs=morphs) + doc = ru_lemmatizer(doc) lemmas = [token.lemma_ for token in doc] assert lemmas == ["мама", "мыть", "рама"] @@ -29,43 +27,51 @@ def test_ru_doc_lemmatization(ru_tokenizer): ], ) def test_ru_lemmatizer_noun_lemmas(ru_lemmatizer, text, lemmas): - assert sorted(ru_lemmatizer.noun(text)) == lemmas + doc = Doc(ru_lemmatizer.vocab, words=[text], pos=["NOUN"]) + result_lemmas = ru_lemmatizer.pymorphy2_lemmatize(doc[0]) + assert sorted(result_lemmas) == lemmas @pytest.mark.parametrize( - "text,pos,morphology,lemma", + "text,pos,morph,lemma", [ - ("рой", "NOUN", None, "рой"), - ("рой", "VERB", None, "рыть"), - ("клей", "NOUN", None, "клей"), - ("клей", "VERB", None, "клеить"), - ("три", "NUM", None, "три"), - ("кос", "NOUN", {"Number": "Sing"}, "кос"), - ("кос", "NOUN", {"Number": "Plur"}, "коса"), - ("кос", "ADJ", None, "косой"), - ("потом", "NOUN", None, "пот"), - ("потом", "ADV", None, "потом"), + ("рой", "NOUN", "", "рой"), + ("рой", "VERB", "", "рыть"), + ("клей", "NOUN", "", "клей"), + ("клей", "VERB", "", "клеить"), + ("три", "NUM", "", "три"), + ("кос", "NOUN", "Number=Sing", "кос"), + ("кос", "NOUN", "Number=Plur", "коса"), + ("кос", "ADJ", "", "косой"), + ("потом", "NOUN", "", "пот"), + ("потом", "ADV", "", "потом"), ], ) def test_ru_lemmatizer_works_with_different_pos_homonyms( - ru_lemmatizer, text, pos, morphology, lemma + ru_lemmatizer, text, pos, morph, lemma ): - assert ru_lemmatizer(text, pos, morphology) == [lemma] + doc = Doc(ru_lemmatizer.vocab, words=[text], pos=[pos], morphs=[morph]) + result_lemmas = ru_lemmatizer.pymorphy2_lemmatize(doc[0]) + assert result_lemmas == [lemma] @pytest.mark.parametrize( - "text,morphology,lemma", + "text,morph,lemma", [ - ("гвоздики", {"Gender": "Fem"}, "гвоздика"), - ("гвоздики", {"Gender": "Masc"}, "гвоздик"), - ("вина", {"Gender": "Fem"}, "вина"), - ("вина", {"Gender": "Neut"}, "вино"), + ("гвоздики", "Gender=Fem", "гвоздика"), + ("гвоздики", "Gender=Masc", "гвоздик"), + ("вина", "Gender=Fem", "вина"), + ("вина", "Gender=Neut", "вино"), ], ) -def test_ru_lemmatizer_works_with_noun_homonyms(ru_lemmatizer, text, morphology, lemma): - assert ru_lemmatizer.noun(text, morphology) == [lemma] +def test_ru_lemmatizer_works_with_noun_homonyms(ru_lemmatizer, text, morph, lemma): + doc = Doc(ru_lemmatizer.vocab, words=[text], pos=["NOUN"], morphs=[morph]) + result_lemmas = ru_lemmatizer.pymorphy2_lemmatize(doc[0]) + assert result_lemmas == [lemma] def test_ru_lemmatizer_punct(ru_lemmatizer): - assert ru_lemmatizer.punct("«") == ['"'] - assert ru_lemmatizer.punct("»") == ['"'] + doc = Doc(ru_lemmatizer.vocab, words=["«"], pos=["PUNCT"]) + assert ru_lemmatizer.pymorphy2_lemmatize(doc[0]) == ['"'] + doc = Doc(ru_lemmatizer.vocab, words=["»"], pos=["PUNCT"]) + assert ru_lemmatizer.pymorphy2_lemmatize(doc[0]) == ['"'] diff --git a/spacy/tests/lang/ru/test_text.py b/spacy/tests/lang/ru/test_text.py index c5bff6973..b0eaf66bb 100644 --- a/spacy/tests/lang/ru/test_text.py +++ b/spacy/tests/lang/ru/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.ru.lex_attrs import like_num diff --git a/spacy/tests/lang/ru/test_tokenizer.py b/spacy/tests/lang/ru/test_tokenizer.py index 5507f9f09..1cfdc50ee 100644 --- a/spacy/tests/lang/ru/test_tokenizer.py +++ b/spacy/tests/lang/ru/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -80,7 +77,6 @@ def test_ru_tokenizer_splits_open_appostrophe(ru_tokenizer, text): assert tokens[0].text == "'" -@pytest.mark.xfail @pytest.mark.parametrize("text", ["Тест''"]) def test_ru_tokenizer_splits_double_end_quote(ru_tokenizer, text): tokens = ru_tokenizer(text) diff --git a/spacy/tests/lang/sa/test_text.py b/spacy/tests/lang/sa/test_text.py index 7c961bdae..daa8d20c0 100644 --- a/spacy/tests/lang/sa/test_text.py +++ b/spacy/tests/lang/sa/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -13,7 +10,7 @@ def test_sa_tokenizer_handles_long_text(sa_tokenizer): @pytest.mark.parametrize( "text,length", [ - ("श्री भगवानुवाच पश्य मे पार्थ रूपाणि शतशोऽथ सहस्रशः।", 9,), + ("श्री भगवानुवाच पश्य मे पार्थ रूपाणि शतशोऽथ सहस्रशः।", 9), ("गुणान् सर्वान् स्वभावो मूर्ध्नि वर्तते ।", 6), ], ) diff --git a/spacy/tests/lang/sr/test_exceptions.py b/spacy/tests/lang/sr/test_exceptions.py index 285e99996..fa92e5e2d 100644 --- a/spacy/tests/lang/sr/test_exceptions.py +++ b/spacy/tests/lang/sr/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/sr/test_tokenizer.py b/spacy/tests/lang/sr/test_tokenizer.py index c4672b3ef..fdcf790d8 100644 --- a/spacy/tests/lang/sr/test_tokenizer.py +++ b/spacy/tests/lang/sr/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -80,7 +77,6 @@ def test_sr_tokenizer_splits_open_appostrophe(sr_tokenizer, text): assert tokens[0].text == "'" -@pytest.mark.xfail @pytest.mark.parametrize("text", ["Тест''"]) def test_sr_tokenizer_splits_double_end_quote(sr_tokenizer, text): tokens = sr_tokenizer(text) diff --git a/spacy/tests/lang/sv/test_exceptions.py b/spacy/tests/lang/sv/test_exceptions.py index 7c6fd5464..e6cae4d2b 100644 --- a/spacy/tests/lang/sv/test_exceptions.py +++ b/spacy/tests/lang/sv/test_exceptions.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/sv/test_lex_attrs.py b/spacy/tests/lang/sv/test_lex_attrs.py index abe6b0f7b..656c4706b 100644 --- a/spacy/tests/lang/sv/test_lex_attrs.py +++ b/spacy/tests/lang/sv/test_lex_attrs.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.sv.lex_attrs import like_num diff --git a/spacy/tests/lang/sv/test_noun_chunks.py b/spacy/tests/lang/sv/test_noun_chunks.py index a6283b65e..d2410156c 100644 --- a/spacy/tests/lang/sv/test_noun_chunks.py +++ b/spacy/tests/lang/sv/test_noun_chunks.py @@ -1,19 +1,10 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest - -from ...util import get_doc +from spacy.tokens import Doc def test_noun_chunks_is_parsed_sv(sv_tokenizer): - """Test that noun_chunks raises Value Error for 'sv' language if Doc is not parsed. - To check this test, we're constructing a Doc - with a new Vocab here and forcing is_parsed to 'False' - to make sure the noun chunks don't run. - """ + """Test that noun_chunks raises Value Error for 'sv' language if Doc is not parsed.""" doc = sv_tokenizer("Studenten läste den bästa boken") - doc.is_parsed = False with pytest.raises(ValueError): list(doc.noun_chunks) @@ -23,21 +14,21 @@ SV_NP_TEST_EXAMPLES = [ "En student läste en bok", # A student read a book ["DET", "NOUN", "VERB", "DET", "NOUN"], ["det", "nsubj", "ROOT", "det", "dobj"], - [1, 1, 0, 1, -2], + [1, 2, 2, 4, 2], ["En student", "en bok"], ), ( "Studenten läste den bästa boken.", # The student read the best book ["NOUN", "VERB", "DET", "ADJ", "NOUN", "PUNCT"], ["nsubj", "ROOT", "det", "amod", "dobj", "punct"], - [1, 0, 2, 1, -3, -4], + [1, 1, 4, 4, 1, 1], ["Studenten", "den bästa boken"], ), ( "De samvetslösa skurkarna hade stulit de största juvelerna på söndagen", # The remorseless crooks had stolen the largest jewels that sunday ["DET", "ADJ", "NOUN", "VERB", "VERB", "DET", "ADJ", "NOUN", "ADP", "NOUN"], ["det", "amod", "nsubj", "aux", "root", "det", "amod", "dobj", "case", "nmod"], - [2, 1, 2, 1, 0, 2, 1, -3, 1, -5], + [2, 2, 4, 4, 4, 7, 7, 4, 9, 4], ["De samvetslösa skurkarna", "de största juvelerna", "på söndagen"], ), ] @@ -48,12 +39,9 @@ SV_NP_TEST_EXAMPLES = [ ) def test_sv_noun_chunks(sv_tokenizer, text, pos, deps, heads, expected_noun_chunks): tokens = sv_tokenizer(text) - assert len(heads) == len(pos) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps, pos=pos - ) - + words = [t.text for t in tokens] + doc = Doc(tokens.vocab, words=words, heads=heads, deps=deps, pos=pos) noun_chunks = list(doc.noun_chunks) assert len(noun_chunks) == len(expected_noun_chunks) for i, np in enumerate(noun_chunks): diff --git a/spacy/tests/lang/sv/test_prefix_suffix_infix.py b/spacy/tests/lang/sv/test_prefix_suffix_infix.py index f3fdd9a9e..bbb0ff415 100644 --- a/spacy/tests/lang/sv/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/sv/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/sv/test_text.py b/spacy/tests/lang/sv/test_text.py index 9ea1851ae..1e26c45bc 100644 --- a/spacy/tests/lang/sv/test_text.py +++ b/spacy/tests/lang/sv/test_text.py @@ -1,7 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - - def test_sv_tokenizer_handles_long_text(sv_tokenizer): text = """Det var så härligt ute på landet. Det var sommar, majsen var gul, havren grön, höet var uppställt i stackar nere vid den gröna ängen, och där gick storken på sina långa, diff --git a/spacy/tests/lang/sv/test_tokenizer.py b/spacy/tests/lang/sv/test_tokenizer.py index 894b5aa6a..8871f4414 100644 --- a/spacy/tests/lang/sv/test_tokenizer.py +++ b/spacy/tests/lang/sv/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/test_attrs.py b/spacy/tests/lang/test_attrs.py index 4bb5aac70..b39109455 100644 --- a/spacy/tests/lang/test_attrs.py +++ b/spacy/tests/lang/test_attrs.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA from spacy.lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape diff --git a/spacy/tests/lang/test_initialize.py b/spacy/tests/lang/test_initialize.py index 5c701fc22..de1871e64 100644 --- a/spacy/tests/lang/test_initialize.py +++ b/spacy/tests/lang/test_initialize.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.util import get_lang_class diff --git a/spacy/tests/lang/test_lemmatizers.py b/spacy/tests/lang/test_lemmatizers.py new file mode 100644 index 000000000..a49d70d6b --- /dev/null +++ b/spacy/tests/lang/test_lemmatizers.py @@ -0,0 +1,54 @@ +import pytest +from spacy import registry +from spacy.lookups import Lookups +from spacy.util import get_lang_class + + +# fmt: off +# Only include languages with no external dependencies +# excluded: ru, uk +# excluded for custom tables: pl +LANGUAGES = ["bn", "el", "en", "fa", "fr", "nb", "nl", "sv"] +# fmt: on + + +@pytest.mark.parametrize("lang", LANGUAGES) +def test_lemmatizer_initialize(lang, capfd): + @registry.misc("lemmatizer_init_lookups") + def lemmatizer_init_lookups(): + lookups = Lookups() + lookups.add_table("lemma_lookup", {"cope": "cope", "x": "y"}) + lookups.add_table("lemma_index", {"verb": ("cope", "cop")}) + lookups.add_table("lemma_exc", {"verb": {"coping": ("cope",)}}) + lookups.add_table("lemma_rules", {"verb": [["ing", ""]]}) + return lookups + + lang_cls = get_lang_class(lang) + # Test that languages can be initialized + nlp = lang_cls() + lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "lookup"}) + assert not lemmatizer.lookups.tables + nlp.config["initialize"]["components"]["lemmatizer"] = { + "lookups": {"@misc": "lemmatizer_init_lookups"} + } + with pytest.raises(ValueError): + nlp("x") + nlp.initialize() + assert lemmatizer.lookups.tables + doc = nlp("x") + # Check for stray print statements (see #3342) + captured = capfd.readouterr() + assert not captured.out + assert doc[0].lemma_ == "y" + + # Test initialization by calling .initialize() directly + nlp = lang_cls() + lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "lookup"}) + lemmatizer.initialize(lookups=lemmatizer_init_lookups()) + assert nlp("x")[0].lemma_ == "y" + + # Test lookups config format + for mode in ("rule", "lookup", "pos_lookup"): + required, optional = lemmatizer.get_lookups_config(mode) + assert isinstance(required, list) + assert isinstance(optional, list) diff --git a/spacy/tests/lang/th/test_tokenizer.py b/spacy/tests/lang/th/test_tokenizer.py index 265c7753d..1e1ba52dc 100644 --- a/spacy/tests/lang/th/test_tokenizer.py +++ b/spacy/tests/lang/th/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/tr/test_noun_chunks.py b/spacy/tests/lang/tr/test_noun_chunks.py index 98a1f355f..003e4f08e 100644 --- a/spacy/tests/lang/tr/test_noun_chunks.py +++ b/spacy/tests/lang/tr/test_noun_chunks.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -11,6 +8,5 @@ def test_noun_chunks_is_parsed(tr_tokenizer): to make sure the noun chunks don't run. """ doc = tr_tokenizer("Dün seni gördüm.") - doc.is_parsed = False with pytest.raises(ValueError): list(doc.noun_chunks) diff --git a/spacy/tests/lang/tr/test_parser.py b/spacy/tests/lang/tr/test_parser.py index 707b0183d..b23d0869c 100644 --- a/spacy/tests/lang/tr/test_parser.py +++ b/spacy/tests/lang/tr/test_parser.py @@ -1,17 +1,14 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...util import get_doc +from spacy.tokens import Doc def test_tr_noun_chunks_amod_simple(tr_tokenizer): text = "sarı kedi" - heads = [1, 0] + heads = [1, 1] deps = ["amod", "ROOT"] - tags = ["ADJ", "NOUN"] + pos = ["ADJ", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -20,12 +17,12 @@ def test_tr_noun_chunks_amod_simple(tr_tokenizer): def test_tr_noun_chunks_nmod_simple(tr_tokenizer): text = "arkadaşımın kedisi" # my friend's cat - heads = [1, 0] + heads = [1, 1] deps = ["nmod", "ROOT"] - tags = ["NOUN", "NOUN"] + pos = ["NOUN", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -34,12 +31,12 @@ def test_tr_noun_chunks_nmod_simple(tr_tokenizer): def test_tr_noun_chunks_determiner_simple(tr_tokenizer): text = "O kedi" # that cat - heads = [1, 0] + heads = [1, 1] deps = ["det", "ROOT"] - tags = ["DET", "NOUN"] + pos = ["DET", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -48,12 +45,12 @@ def test_tr_noun_chunks_determiner_simple(tr_tokenizer): def test_tr_noun_chunks_nmod_amod(tr_tokenizer): text = "okulun eski müdürü" - heads = [2, 1, 0] + heads = [2, 2, 2] deps = ["nmod", "amod", "ROOT"] - tags = ["NOUN", "ADJ", "NOUN"] + pos = ["NOUN", "ADJ", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -62,12 +59,12 @@ def test_tr_noun_chunks_nmod_amod(tr_tokenizer): def test_tr_noun_chunks_one_det_one_adj_simple(tr_tokenizer): text = "O sarı kedi" - heads = [2, 1, 0] + heads = [2, 2, 2] deps = ["det", "amod", "ROOT"] - tags = ["DET", "ADJ", "NOUN"] + pos = ["DET", "ADJ", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -76,12 +73,12 @@ def test_tr_noun_chunks_one_det_one_adj_simple(tr_tokenizer): def test_tr_noun_chunks_two_adjs_simple(tr_tokenizer): text = "beyaz tombik kedi" - heads = [2, 1, 0] + heads = [2, 2, 2] deps = ["amod", "amod", "ROOT"] - tags = ["ADJ", "ADJ", "NOUN"] + pos = ["ADJ", "ADJ", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -90,12 +87,12 @@ def test_tr_noun_chunks_two_adjs_simple(tr_tokenizer): def test_tr_noun_chunks_one_det_two_adjs_simple(tr_tokenizer): text = "o beyaz tombik kedi" - heads = [3, 2, 1, 0] + heads = [3, 3, 3, 3] deps = ["det", "amod", "amod", "ROOT"] - tags = ["DET", "ADJ", "ADJ", "NOUN"] + pos = ["DET", "ADJ", "ADJ", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -104,12 +101,12 @@ def test_tr_noun_chunks_one_det_two_adjs_simple(tr_tokenizer): def test_tr_noun_chunks_nmod_two(tr_tokenizer): text = "kızın saçının rengi" - heads = [1, 1, 0] + heads = [1, 2, 2] deps = ["nmod", "nmod", "ROOT"] - tags = ["NOUN", "NOUN", "NOUN"] + pos = ["NOUN", "NOUN", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -118,12 +115,12 @@ def test_tr_noun_chunks_nmod_two(tr_tokenizer): def test_tr_noun_chunks_chain_nmod_with_adj(tr_tokenizer): text = "ev sahibinin tatlı köpeği" - heads = [1, 2, 1, 0] + heads = [1, 3, 3, 3] deps = ["nmod", "nmod", "amod", "ROOT"] - tags = ["NOUN", "NOUN", "ADJ", "NOUN"] + pos = ["NOUN", "NOUN", "ADJ", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -132,12 +129,12 @@ def test_tr_noun_chunks_chain_nmod_with_adj(tr_tokenizer): def test_tr_noun_chunks_chain_nmod_with_acl(tr_tokenizer): text = "ev sahibinin gelen köpeği" - heads = [1, 2, 1, 0] + heads = [1, 3, 3, 3] deps = ["nmod", "nmod", "acl", "ROOT"] - tags = ["NOUN", "NOUN", "VERB", "NOUN"] + pos = ["NOUN", "NOUN", "VERB", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -146,12 +143,12 @@ def test_tr_noun_chunks_chain_nmod_with_acl(tr_tokenizer): def test_tr_noun_chunks_chain_nmod_head_with_amod_acl(tr_tokenizer): text = "arabanın kırdığım sol aynası" - heads = [3, 2, 1, 0] + heads = [3, 3, 3, 3] deps = ["nmod", "acl", "amod", "ROOT"] - tags = ["NOUN", "VERB", "ADJ", "NOUN"] + pos = ["NOUN", "VERB", "ADJ", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -160,12 +157,12 @@ def test_tr_noun_chunks_chain_nmod_head_with_amod_acl(tr_tokenizer): def test_tr_noun_chunks_nmod_three(tr_tokenizer): text = "güney Afrika ülkelerinden Mozambik" - heads = [1, 1, 1, 0] + heads = [1, 2, 3, 3] deps = ["nmod", "nmod", "nmod", "ROOT"] - tags = ["NOUN", "PROPN", "NOUN", "PROPN"] + pos = ["NOUN", "PROPN", "NOUN", "PROPN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -174,12 +171,12 @@ def test_tr_noun_chunks_nmod_three(tr_tokenizer): def test_tr_noun_chunks_det_amod_nmod(tr_tokenizer): text = "bazı eski oyun kuralları" - heads = [3, 2, 1, 0] + heads = [3, 3, 3, 3] deps = ["det", "nmod", "nmod", "ROOT"] - tags = ["DET", "ADJ", "NOUN", "NOUN"] + pos = ["DET", "ADJ", "NOUN", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -188,12 +185,12 @@ def test_tr_noun_chunks_det_amod_nmod(tr_tokenizer): def test_tr_noun_chunks_acl_simple(tr_tokenizer): text = "bahçesi olan okul" - heads = [2, -1, 0] + heads = [2, 0, 2] deps = ["acl", "cop", "ROOT"] - tags = ["NOUN", "AUX", "NOUN"] + pos = ["NOUN", "AUX", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -202,12 +199,12 @@ def test_tr_noun_chunks_acl_simple(tr_tokenizer): def test_tr_noun_chunks_acl_verb(tr_tokenizer): text = "sevdiğim sanatçılar" - heads = [1, 0] + heads = [1, 1] deps = ["acl", "ROOT"] - tags = ["VERB", "NOUN"] + pos = ["VERB", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -216,26 +213,26 @@ def test_tr_noun_chunks_acl_verb(tr_tokenizer): def test_tr_noun_chunks_acl_nmod(tr_tokenizer): text = "en sevdiğim ses sanatçısı" - heads = [1, 2, 1, 0] + heads = [1, 3, 3, 3] deps = ["advmod", "acl", "nmod", "ROOT"] - tags = ["ADV", "VERB", "NOUN", "NOUN"] + pos = ["ADV", "VERB", "NOUN", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 assert chunks[0].text_with_ws == "en sevdiğim ses sanatçısı " -def test_tr_noun_chunks_acl_nmod(tr_tokenizer): +def test_tr_noun_chunks_acl_nmod2(tr_tokenizer): text = "bildiğim bir turizm şirketi" - heads = [3, 2, 1, 0] + heads = [3, 3, 3, 3] deps = ["acl", "det", "nmod", "ROOT"] - tags = ["VERB", "DET", "NOUN", "NOUN"] + pos = ["VERB", "DET", "NOUN", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -244,12 +241,12 @@ def test_tr_noun_chunks_acl_nmod(tr_tokenizer): def test_tr_noun_chunks_np_recursive_nsubj_to_root(tr_tokenizer): text = "Simge'nin okuduğu kitap" - heads = [1, 1, 0] + heads = [1, 2, 2] deps = ["nsubj", "acl", "ROOT"] - tags = ["PROPN", "VERB", "NOUN"] + pos = ["PROPN", "VERB", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -258,12 +255,12 @@ def test_tr_noun_chunks_np_recursive_nsubj_to_root(tr_tokenizer): def test_tr_noun_chunks_np_recursive_nsubj_attached_to_pron_root(tr_tokenizer): text = "Simge'nin konuşabileceği birisi" - heads = [1, 1, 0] + heads = [1, 2, 2] deps = ["nsubj", "acl", "ROOT"] - tags = ["PROPN", "VERB", "PRON"] + pos = ["PROPN", "VERB", "PRON"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -272,12 +269,12 @@ def test_tr_noun_chunks_np_recursive_nsubj_attached_to_pron_root(tr_tokenizer): def test_tr_noun_chunks_np_recursive_nsubj_in_subnp(tr_tokenizer): text = "Simge'nin yarın gideceği yer" - heads = [2, 1, 1, 0] + heads = [2, 2, 3, 3] deps = ["nsubj", "obl", "acl", "ROOT"] - tags = ["PROPN", "NOUN", "VERB", "NOUN"] + pos = ["PROPN", "NOUN", "VERB", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -286,12 +283,12 @@ def test_tr_noun_chunks_np_recursive_nsubj_in_subnp(tr_tokenizer): def test_tr_noun_chunks_np_recursive_two_nmods(tr_tokenizer): text = "ustanın kapısını degiştireceği çamasır makinası" - heads = [2, 1, 2, 1, 0] + heads = [2, 2, 4, 4, 4] deps = ["nsubj", "obj", "acl", "nmod", "ROOT"] - tags = ["NOUN", "NOUN", "VERB", "NOUN", "NOUN"] + pos = ["NOUN", "NOUN", "VERB", "NOUN", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -300,26 +297,26 @@ def test_tr_noun_chunks_np_recursive_two_nmods(tr_tokenizer): def test_tr_noun_chunks_np_recursive_four_nouns(tr_tokenizer): text = "kızına piyano dersi verdiğim hanım" - heads = [3, 1, 1, 1, 0] + heads = [3, 2, 3, 4, 4] deps = ["obl", "nmod", "obj", "acl", "ROOT"] - tags = ["NOUN", "NOUN", "NOUN", "VERB", "NOUN"] + pos = ["NOUN", "NOUN", "NOUN", "VERB", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 assert chunks[0].text_with_ws == "kızına piyano dersi verdiğim hanım " - + def test_tr_noun_chunks_np_recursive_no_nmod(tr_tokenizer): text = "içine birkaç çiçek konmuş olan bir vazo" - heads = [3, 1, 1, 3, -1, 1, 0] + heads = [3, 2, 3, 6, 3, 6, 6] deps = ["obl", "det", "nsubj", "acl", "aux", "det", "ROOT"] - tags = ["ADP", "DET", "NOUN", "VERB", "AUX", "DET", "NOUN"] + pos = ["ADP", "DET", "NOUN", "VERB", "AUX", "DET", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -328,39 +325,43 @@ def test_tr_noun_chunks_np_recursive_no_nmod(tr_tokenizer): def test_tr_noun_chunks_np_recursive_long_two_acls(tr_tokenizer): text = "içine Simge'nin bahçesinden toplanmış birkaç çiçeğin konmuş olduğu bir vazo" - heads = [6, 1, 1, 2, 1, 1, 3, -1, 1, 0] - deps = ["obl", "nmod" , "obl", "acl", "det", "nsubj", "acl", "aux", "det", "ROOT"] - tags = ["ADP", "PROPN", "NOUN", "VERB", "DET", "NOUN", "VERB", "AUX", "DET", "NOUN"] + heads = [6, 2, 3, 5, 5, 6, 9, 6, 9, 9] + deps = ["obl", "nmod", "obl", "acl", "det", "nsubj", "acl", "aux", "det", "ROOT"] + pos = ["ADP", "PROPN", "NOUN", "VERB", "DET", "NOUN", "VERB", "AUX", "DET", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 - assert chunks[0].text_with_ws == "içine Simge'nin bahçesinden toplanmış birkaç çiçeğin konmuş olduğu bir vazo " + assert ( + chunks[0].text_with_ws + == "içine Simge'nin bahçesinden toplanmış birkaç çiçeğin konmuş olduğu bir vazo " + ) def test_tr_noun_chunks_two_nouns_in_nmod(tr_tokenizer): text = "kız ve erkek çocuklar" - heads = [3, 1, -2, 0] + heads = [3, 2, 0, 3] deps = ["nmod", "cc", "conj", "ROOT"] - tags = ["NOUN", "CCONJ", "NOUN", "NOUN"] + pos = ["NOUN", "CCONJ", "NOUN", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 assert chunks[0].text_with_ws == "kız ve erkek çocuklar " -def test_tr_noun_chunks_two_nouns_in_nmod(tr_tokenizer): + +def test_tr_noun_chunks_two_nouns_in_nmod2(tr_tokenizer): text = "tatlı ve gürbüz çocuklar" - heads = [3, 1, -2, 0] + heads = [3, 2, 0, 3] deps = ["amod", "cc", "conj", "ROOT"] - tags = ["ADJ", "CCONJ", "NOUN", "NOUN"] + pos = ["ADJ", "CCONJ", "NOUN", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -369,26 +370,27 @@ def test_tr_noun_chunks_two_nouns_in_nmod(tr_tokenizer): def test_tr_noun_chunks_conj_simple(tr_tokenizer): text = "Sen ya da ben" - heads = [0, 2, -1, -3] + heads = [0, 3, 1, 0] deps = ["ROOT", "cc", "fixed", "conj"] - tags = ["PRON", "CCONJ", "CCONJ", "PRON"] + pos = ["PRON", "CCONJ", "CCONJ", "PRON"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 2 assert chunks[0].text_with_ws == "ben " assert chunks[1].text_with_ws == "Sen " + def test_tr_noun_chunks_conj_three(tr_tokenizer): text = "sen, ben ve ondan" - heads = [0, 1, -2, 1, -4] + heads = [0, 2, 0, 4, 0] deps = ["ROOT", "punct", "conj", "cc", "conj"] - tags = ["PRON", "PUNCT", "PRON", "CCONJ", "PRON"] + pos = ["PRON", "PUNCT", "PRON", "CCONJ", "PRON"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 3 @@ -397,14 +399,14 @@ def test_tr_noun_chunks_conj_three(tr_tokenizer): assert chunks[2].text_with_ws == "sen " -def test_tr_noun_chunks_conj_three(tr_tokenizer): +def test_tr_noun_chunks_conj_three2(tr_tokenizer): text = "ben ya da sen ya da onlar" - heads = [0, 2, -1, -3, 2, -1, -3] + heads = [0, 3, 1, 0, 6, 4, 3] deps = ["ROOT", "cc", "fixed", "conj", "cc", "fixed", "conj"] - tags = ["PRON", "CCONJ", "CCONJ", "PRON", "CCONJ", "CCONJ", "PRON"] + pos = ["PRON", "CCONJ", "CCONJ", "PRON", "CCONJ", "CCONJ", "PRON"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 3 @@ -415,12 +417,12 @@ def test_tr_noun_chunks_conj_three(tr_tokenizer): def test_tr_noun_chunks_conj_and_adj_phrase(tr_tokenizer): text = "ben ve akıllı çocuk" - heads = [0, 2, 1, -3] + heads = [0, 3, 3, 0] deps = ["ROOT", "cc", "amod", "conj"] - tags = ["PRON", "CCONJ", "ADJ", "NOUN"] + pos = ["PRON", "CCONJ", "ADJ", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 2 @@ -430,12 +432,12 @@ def test_tr_noun_chunks_conj_and_adj_phrase(tr_tokenizer): def test_tr_noun_chunks_conj_fixed_adj_phrase(tr_tokenizer): text = "ben ya da akıllı çocuk" - heads = [0, 3, -1, 1, -4] + heads = [0, 4, 1, 4, 0] deps = ["ROOT", "cc", "fixed", "amod", "conj"] - tags = ["PRON", "CCONJ", "CCONJ", "ADJ", "NOUN"] + pos = ["PRON", "CCONJ", "CCONJ", "ADJ", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 2 @@ -445,12 +447,12 @@ def test_tr_noun_chunks_conj_fixed_adj_phrase(tr_tokenizer): def test_tr_noun_chunks_conj_subject(tr_tokenizer): text = "Sen ve ben iyi anlaşıyoruz" - heads = [4, 1, -2, -1, 0] + heads = [4, 2, 0, 2, 4] deps = ["nsubj", "cc", "conj", "adv", "ROOT"] - tags = ["PRON", "CCONJ", "PRON", "ADV", "VERB"] + pos = ["PRON", "CCONJ", "PRON", "ADV", "VERB"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 2 @@ -460,12 +462,12 @@ def test_tr_noun_chunks_conj_subject(tr_tokenizer): def test_tr_noun_chunks_conj_noun_head_verb(tr_tokenizer): text = "Simge babasını görmüyormuş, annesini değil" - heads = [2, 1, 0, 1, -2, -1] + heads = [2, 2, 2, 4, 2, 4] deps = ["nsubj", "obj", "ROOT", "punct", "conj", "aux"] - tags = ["PROPN", "NOUN", "VERB", "PUNCT", "NOUN", "AUX"] + pos = ["PROPN", "NOUN", "VERB", "PUNCT", "NOUN", "AUX"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 3 @@ -476,12 +478,12 @@ def test_tr_noun_chunks_conj_noun_head_verb(tr_tokenizer): def test_tr_noun_chunks_flat_simple(tr_tokenizer): text = "New York" - heads = [0, -1] + heads = [0, 0] deps = ["ROOT", "flat"] - tags = ["PROPN", "PROPN"] + pos = ["PROPN", "PROPN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -490,26 +492,26 @@ def test_tr_noun_chunks_flat_simple(tr_tokenizer): def test_tr_noun_chunks_flat_names_and_title(tr_tokenizer): text = "Gazi Mustafa Kemal" - heads = [1, 0, -1] + heads = [1, 1, 1] deps = ["nmod", "ROOT", "flat"] - tags = ["PROPN", "PROPN", "PROPN"] + pos = ["PROPN", "PROPN", "PROPN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 assert chunks[0].text_with_ws == "Gazi Mustafa Kemal " -def test_tr_noun_chunks_flat_names_and_title(tr_tokenizer): +def test_tr_noun_chunks_flat_names_and_title2(tr_tokenizer): text = "Ahmet Vefik Paşa" - heads = [2, -1, 0] + heads = [2, 0, 2] deps = ["nmod", "flat", "ROOT"] - tags = ["PROPN", "PROPN", "PROPN"] + pos = ["PROPN", "PROPN", "PROPN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -518,12 +520,12 @@ def test_tr_noun_chunks_flat_names_and_title(tr_tokenizer): def test_tr_noun_chunks_flat_name_lastname_and_title(tr_tokenizer): text = "Cumhurbaşkanı Ahmet Necdet Sezer" - heads = [1, 0, -1, -2] + heads = [1, 1, 1, 1] deps = ["nmod", "ROOT", "flat", "flat"] - tags = ["NOUN", "PROPN", "PROPN", "PROPN"] + pos = ["NOUN", "PROPN", "PROPN", "PROPN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -532,12 +534,12 @@ def test_tr_noun_chunks_flat_name_lastname_and_title(tr_tokenizer): def test_tr_noun_chunks_flat_in_nmod(tr_tokenizer): text = "Ahmet Sezer adında bir ögrenci" - heads = [2, -1, 2, 1, 0] + heads = [2, 0, 4, 4, 4] deps = ["nmod", "flat", "nmod", "det", "ROOT"] - tags = ["PROPN", "PROPN", "NOUN", "DET", "NOUN"] + pos = ["PROPN", "PROPN", "NOUN", "DET", "NOUN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -546,12 +548,12 @@ def test_tr_noun_chunks_flat_in_nmod(tr_tokenizer): def test_tr_noun_chunks_flat_and_chain_nmod(tr_tokenizer): text = "Batı Afrika ülkelerinden Sierra Leone" - heads = [1, 1, 1, 0, -1] + heads = [1, 2, 3, 3, 3] deps = ["nmod", "nmod", "nmod", "ROOT", "flat"] - tags = ["NOUN", "PROPN", "NOUN", "PROPN", "PROPN"] + pos = ["NOUN", "PROPN", "NOUN", "PROPN", "PROPN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 1 @@ -560,12 +562,12 @@ def test_tr_noun_chunks_flat_and_chain_nmod(tr_tokenizer): def test_tr_noun_chunks_two_flats_conjed(tr_tokenizer): text = "New York ve Sierra Leone" - heads = [0, -1, 1, -3, -1] + heads = [0, 0, 3, 0, 3] deps = ["ROOT", "flat", "cc", "conj", "flat"] - tags = ["PROPN", "PROPN", "CCONJ", "PROPN", "PROPN"] + pos = ["PROPN", "PROPN", "CCONJ", "PROPN", "PROPN"] tokens = tr_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], tags=tags, heads=heads, deps=deps + doc = Doc( + tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps ) chunks = list(doc.noun_chunks) assert len(chunks) == 2 diff --git a/spacy/tests/lang/tr/test_text.py b/spacy/tests/lang/tr/test_text.py index 2fe638b5f..ed7dbb805 100644 --- a/spacy/tests/lang/tr/test_text.py +++ b/spacy/tests/lang/tr/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.tr.lex_attrs import like_num @@ -18,8 +15,8 @@ from spacy.lang.tr.lex_attrs import like_num "üçüncü", "beşinci", "100üncü", - "8inci" - ] + "8inci", + ], ) def test_tr_lex_attrs_like_number_cardinal_ordinal(word): assert like_num(word) @@ -29,4 +26,3 @@ def test_tr_lex_attrs_like_number_cardinal_ordinal(word): def test_tr_lex_attrs_capitals(word): assert like_num(word) assert like_num(word.upper()) - diff --git a/spacy/tests/lang/tt/test_tokenizer.py b/spacy/tests/lang/tt/test_tokenizer.py index f6c68a401..246d2824d 100644 --- a/spacy/tests/lang/tt/test_tokenizer.py +++ b/spacy/tests/lang/tt/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/uk/test_tokenizer.py b/spacy/tests/lang/uk/test_tokenizer.py index f744b32b0..91ae057f8 100644 --- a/spacy/tests/lang/uk/test_tokenizer.py +++ b/spacy/tests/lang/uk/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest @@ -92,7 +89,7 @@ def test_uk_tokenizer_splits_open_appostrophe(uk_tokenizer, text): assert tokens[0].text == "'" -@pytest.mark.xfail(reason="See #3327") +@pytest.mark.skip(reason="See Issue #3327 and PR #3329") @pytest.mark.parametrize("text", ["Тест''"]) def test_uk_tokenizer_splits_double_end_quote(uk_tokenizer, text): tokens = uk_tokenizer(text) diff --git a/spacy/tests/lang/uk/test_tokenizer_exc.py b/spacy/tests/lang/uk/test_tokenizer_exc.py index 328e1d287..4fb4a6b31 100644 --- a/spacy/tests/lang/uk/test_tokenizer_exc.py +++ b/spacy/tests/lang/uk/test_tokenizer_exc.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ur/test_prefix_suffix_infix.py b/spacy/tests/lang/ur/test_prefix_suffix_infix.py index de11c9b34..e9f3272f4 100644 --- a/spacy/tests/lang/ur/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/ur/test_prefix_suffix_infix.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/ur/test_text.py b/spacy/tests/lang/ur/test_text.py index 546e79182..5da831cf8 100644 --- a/spacy/tests/lang/ur/test_text.py +++ b/spacy/tests/lang/ur/test_text.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/tests/lang/yo/test_text.py b/spacy/tests/lang/yo/test_text.py index ce6408b67..48b689f3d 100644 --- a/spacy/tests/lang/yo/test_text.py +++ b/spacy/tests/lang/yo/test_text.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.lang.yo.lex_attrs import like_num diff --git a/spacy/tests/lang/zh/test_serialize.py b/spacy/tests/lang/zh/test_serialize.py index 56f092ed8..03cdbbe24 100644 --- a/spacy/tests/lang/zh/test_serialize.py +++ b/spacy/tests/lang/zh/test_serialize.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lang.zh import Chinese from ...util import make_tempdir @@ -8,14 +5,14 @@ from ...util import make_tempdir def zh_tokenizer_serialize(zh_tokenizer): tokenizer_bytes = zh_tokenizer.to_bytes() - nlp = Chinese(meta={"tokenizer": {"config": {"use_jieba": False}}}) + nlp = Chinese() nlp.tokenizer.from_bytes(tokenizer_bytes) assert tokenizer_bytes == nlp.tokenizer.to_bytes() with make_tempdir() as d: file_path = d / "tokenizer" zh_tokenizer.to_disk(file_path) - nlp = Chinese(meta={"tokenizer": {"config": {"use_jieba": False}}}) + nlp = Chinese() nlp.tokenizer.from_disk(file_path) assert tokenizer_bytes == nlp.tokenizer.to_bytes() @@ -28,21 +25,21 @@ def test_zh_tokenizer_serialize_jieba(zh_tokenizer_jieba): zh_tokenizer_serialize(zh_tokenizer_jieba) -def test_zh_tokenizer_serialize_pkuseg(zh_tokenizer_pkuseg): - zh_tokenizer_serialize(zh_tokenizer_pkuseg) - - @pytest.mark.slow def test_zh_tokenizer_serialize_pkuseg_with_processors(zh_tokenizer_pkuseg): - nlp = Chinese( - meta={ + config = { + "nlp": { "tokenizer": { - "config": { - "use_jieba": False, - "use_pkuseg": True, - "pkuseg_model": "medicine", - } + "@tokenizers": "spacy.zh.ChineseTokenizer", + "segmenter": "pkuseg", } - } - ) + }, + "initialize": { + "tokenizer": { + "pkuseg_model": "medicine", + } + }, + } + nlp = Chinese.from_config(config) + nlp.initialize() zh_tokenizer_serialize(nlp.tokenizer) diff --git a/spacy/tests/lang/zh/test_text.py b/spacy/tests/lang/zh/test_text.py index 3a3ccbdde..148257329 100644 --- a/spacy/tests/lang/zh/test_text.py +++ b/spacy/tests/lang/zh/test_text.py @@ -1,7 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - - import pytest diff --git a/spacy/tests/lang/zh/test_tokenizer.py b/spacy/tests/lang/zh/test_tokenizer.py index 28240b6a9..741eb0ace 100644 --- a/spacy/tests/lang/zh/test_tokenizer.py +++ b/spacy/tests/lang/zh/test_tokenizer.py @@ -1,8 +1,6 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest -from spacy.lang.zh import _get_pkuseg_trie_data +from spacy.lang.zh import Chinese, _get_pkuseg_trie_data +from thinc.api import ConfigValidationError # fmt: off @@ -40,7 +38,7 @@ def test_zh_tokenizer_pkuseg(zh_tokenizer_pkuseg, text, expected_tokens): assert tokens == expected_tokens -def test_zh_tokenizer_pkuseg_user_dict(zh_tokenizer_pkuseg): +def test_zh_tokenizer_pkuseg_user_dict(zh_tokenizer_pkuseg, zh_tokenizer_char): user_dict = _get_pkuseg_trie_data(zh_tokenizer_pkuseg.pkuseg_seg.preprocesser.trie) zh_tokenizer_pkuseg.pkuseg_update_user_dict(["nonsense_asdf"]) updated_user_dict = _get_pkuseg_trie_data( @@ -55,8 +53,26 @@ def test_zh_tokenizer_pkuseg_user_dict(zh_tokenizer_pkuseg): ) assert len(reset_user_dict) == 0 + # warn if not relevant + with pytest.warns(UserWarning): + zh_tokenizer_char.pkuseg_update_user_dict(["nonsense_asdf"]) -def test_extra_spaces(zh_tokenizer_char): + +def test_zh_extra_spaces(zh_tokenizer_char): # note: three spaces after "I" tokens = zh_tokenizer_char("I like cheese.") assert tokens[1].orth_ == " " + + +def test_zh_unsupported_segmenter(): + config = {"nlp": {"tokenizer": {"segmenter": "unk"}}} + with pytest.raises(ConfigValidationError): + Chinese.from_config(config) + + +def test_zh_uninitialized_pkuseg(): + config = {"nlp": {"tokenizer": {"segmenter": "char"}}} + nlp = Chinese.from_config(config) + nlp.tokenizer.segmenter = "pkuseg" + with pytest.raises(ValueError): + nlp("test") diff --git a/spacy/tests/matcher/test_dependency_matcher.py b/spacy/tests/matcher/test_dependency_matcher.py new file mode 100644 index 000000000..e18a8f6d8 --- /dev/null +++ b/spacy/tests/matcher/test_dependency_matcher.py @@ -0,0 +1,331 @@ +import pytest +import pickle +import re +import copy +from mock import Mock +from spacy.matcher import DependencyMatcher +from spacy.tokens import Doc + + +@pytest.fixture +def doc(en_vocab): + words = ["The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "fox"] + heads = [3, 3, 3, 4, 4, 4, 8, 8, 5] + deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "pobj", "det", "amod"] + return Doc(en_vocab, words=words, heads=heads, deps=deps) + + +@pytest.fixture +def patterns(en_vocab): + def is_brown_yellow(text): + return bool(re.compile(r"brown|yellow").match(text)) + + IS_BROWN_YELLOW = en_vocab.add_flag(is_brown_yellow) + + pattern1 = [ + {"RIGHT_ID": "fox", "RIGHT_ATTRS": {"ORTH": "fox"}}, + { + "LEFT_ID": "fox", + "REL_OP": ">", + "RIGHT_ID": "q", + "RIGHT_ATTRS": {"ORTH": "quick", "DEP": "amod"}, + }, + { + "LEFT_ID": "fox", + "REL_OP": ">", + "RIGHT_ID": "r", + "RIGHT_ATTRS": {IS_BROWN_YELLOW: True}, + }, + ] + + pattern2 = [ + {"RIGHT_ID": "jumped", "RIGHT_ATTRS": {"ORTH": "jumped"}}, + { + "LEFT_ID": "jumped", + "REL_OP": ">", + "RIGHT_ID": "fox1", + "RIGHT_ATTRS": {"ORTH": "fox"}, + }, + { + "LEFT_ID": "jumped", + "REL_OP": ".", + "RIGHT_ID": "over", + "RIGHT_ATTRS": {"ORTH": "over"}, + }, + ] + + pattern3 = [ + {"RIGHT_ID": "jumped", "RIGHT_ATTRS": {"ORTH": "jumped"}}, + { + "LEFT_ID": "jumped", + "REL_OP": ">", + "RIGHT_ID": "fox", + "RIGHT_ATTRS": {"ORTH": "fox"}, + }, + { + "LEFT_ID": "fox", + "REL_OP": ">>", + "RIGHT_ID": "r", + "RIGHT_ATTRS": {"ORTH": "brown"}, + }, + ] + + pattern4 = [ + {"RIGHT_ID": "jumped", "RIGHT_ATTRS": {"ORTH": "jumped"}}, + { + "LEFT_ID": "jumped", + "REL_OP": ">", + "RIGHT_ID": "fox", + "RIGHT_ATTRS": {"ORTH": "fox"}, + }, + ] + + pattern5 = [ + {"RIGHT_ID": "jumped", "RIGHT_ATTRS": {"ORTH": "jumped"}}, + { + "LEFT_ID": "jumped", + "REL_OP": ">>", + "RIGHT_ID": "fox", + "RIGHT_ATTRS": {"ORTH": "fox"}, + }, + ] + + return [pattern1, pattern2, pattern3, pattern4, pattern5] + + +@pytest.fixture +def dependency_matcher(en_vocab, patterns, doc): + matcher = DependencyMatcher(en_vocab) + mock = Mock() + for i in range(1, len(patterns) + 1): + if i == 1: + matcher.add("pattern1", [patterns[0]], on_match=mock) + else: + matcher.add("pattern" + str(i), [patterns[i - 1]]) + + return matcher + + +def test_dependency_matcher(dependency_matcher, doc, patterns): + assert len(dependency_matcher) == 5 + assert "pattern3" in dependency_matcher + assert dependency_matcher.get("pattern3") == (None, [patterns[2]]) + matches = dependency_matcher(doc) + assert len(matches) == 6 + assert matches[0][1] == [3, 1, 2] + assert matches[1][1] == [4, 3, 5] + assert matches[2][1] == [4, 3, 2] + assert matches[3][1] == [4, 3] + assert matches[4][1] == [4, 3] + assert matches[5][1] == [4, 8] + + span = doc[0:6] + matches = dependency_matcher(span) + assert len(matches) == 5 + assert matches[0][1] == [3, 1, 2] + assert matches[1][1] == [4, 3, 5] + assert matches[2][1] == [4, 3, 2] + assert matches[3][1] == [4, 3] + assert matches[4][1] == [4, 3] + + +def test_dependency_matcher_pickle(en_vocab, patterns, doc): + matcher = DependencyMatcher(en_vocab) + for i in range(1, len(patterns) + 1): + matcher.add("pattern" + str(i), [patterns[i - 1]]) + + matches = matcher(doc) + assert matches[0][1] == [3, 1, 2] + assert matches[1][1] == [4, 3, 5] + assert matches[2][1] == [4, 3, 2] + assert matches[3][1] == [4, 3] + assert matches[4][1] == [4, 3] + assert matches[5][1] == [4, 8] + + b = pickle.dumps(matcher) + matcher_r = pickle.loads(b) + + assert len(matcher) == len(matcher_r) + matches = matcher_r(doc) + assert matches[0][1] == [3, 1, 2] + assert matches[1][1] == [4, 3, 5] + assert matches[2][1] == [4, 3, 2] + assert matches[3][1] == [4, 3] + assert matches[4][1] == [4, 3] + assert matches[5][1] == [4, 8] + + +def test_dependency_matcher_pattern_validation(en_vocab): + pattern = [ + {"RIGHT_ID": "fox", "RIGHT_ATTRS": {"ORTH": "fox"}}, + { + "LEFT_ID": "fox", + "REL_OP": ">", + "RIGHT_ID": "q", + "RIGHT_ATTRS": {"ORTH": "quick", "DEP": "amod"}, + }, + { + "LEFT_ID": "fox", + "REL_OP": ">", + "RIGHT_ID": "r", + "RIGHT_ATTRS": {"ORTH": "brown"}, + }, + ] + + matcher = DependencyMatcher(en_vocab) + # original pattern is valid + matcher.add("FOUNDED", [pattern]) + # individual pattern not wrapped in a list + with pytest.raises(ValueError): + matcher.add("FOUNDED", pattern) + # no anchor node + with pytest.raises(ValueError): + matcher.add("FOUNDED", [pattern[1:]]) + # required keys missing + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + del pattern2[0]["RIGHT_ID"] + matcher.add("FOUNDED", [pattern2]) + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + del pattern2[1]["RIGHT_ID"] + matcher.add("FOUNDED", [pattern2]) + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + del pattern2[1]["RIGHT_ATTRS"] + matcher.add("FOUNDED", [pattern2]) + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + del pattern2[1]["LEFT_ID"] + matcher.add("FOUNDED", [pattern2]) + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + del pattern2[1]["REL_OP"] + matcher.add("FOUNDED", [pattern2]) + # invalid operator + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + pattern2[1]["REL_OP"] = "!!!" + matcher.add("FOUNDED", [pattern2]) + # duplicate node name + with pytest.raises(ValueError): + pattern2 = copy.deepcopy(pattern) + pattern2[1]["RIGHT_ID"] = "fox" + matcher.add("FOUNDED", [pattern2]) + + +def test_dependency_matcher_callback(en_vocab, doc): + pattern = [ + {"RIGHT_ID": "quick", "RIGHT_ATTRS": {"ORTH": "quick"}}, + ] + + matcher = DependencyMatcher(en_vocab) + mock = Mock() + matcher.add("pattern", [pattern], on_match=mock) + matches = matcher(doc) + mock.assert_called_once_with(matcher, doc, 0, matches) + + # check that matches with and without callback are the same (#4590) + matcher2 = DependencyMatcher(en_vocab) + matcher2.add("pattern", [pattern]) + matches2 = matcher2(doc) + assert matches == matches2 + + +@pytest.mark.parametrize("op,num_matches", [(".", 8), (".*", 20), (";", 8), (";*", 20)]) +def test_dependency_matcher_precedence_ops(en_vocab, op, num_matches): + # two sentences to test that all matches are within the same sentence + doc = Doc( + en_vocab, + words=["a", "b", "c", "d", "e"] * 2, + heads=[0, 0, 0, 0, 0, 5, 5, 5, 5, 5], + deps=["dep"] * 10, + ) + match_count = 0 + for text in ["a", "b", "c", "d", "e"]: + pattern = [ + {"RIGHT_ID": "1", "RIGHT_ATTRS": {"ORTH": text}}, + {"LEFT_ID": "1", "REL_OP": op, "RIGHT_ID": "2", "RIGHT_ATTRS": {}}, + ] + matcher = DependencyMatcher(en_vocab) + matcher.add("A", [pattern]) + matches = matcher(doc) + match_count += len(matches) + for match in matches: + match_id, token_ids = match + # token_ids[0] op token_ids[1] + if op == ".": + assert token_ids[0] == token_ids[1] - 1 + elif op == ";": + assert token_ids[0] == token_ids[1] + 1 + elif op == ".*": + assert token_ids[0] < token_ids[1] + elif op == ";*": + assert token_ids[0] > token_ids[1] + # all tokens are within the same sentence + assert doc[token_ids[0]].sent == doc[token_ids[1]].sent + assert match_count == num_matches + + +@pytest.mark.parametrize( + "left,right,op,num_matches", + [ + ("fox", "jumped", "<", 1), + ("the", "lazy", "<", 0), + ("jumped", "jumped", "<", 0), + ("fox", "jumped", ">", 0), + ("fox", "lazy", ">", 1), + ("lazy", "lazy", ">", 0), + ("fox", "jumped", "<<", 2), + ("jumped", "fox", "<<", 0), + ("the", "fox", "<<", 2), + ("fox", "jumped", ">>", 0), + ("over", "the", ">>", 1), + ("fox", "the", ">>", 2), + ("fox", "jumped", ".", 1), + ("lazy", "fox", ".", 1), + ("the", "fox", ".", 0), + ("the", "the", ".", 0), + ("fox", "jumped", ";", 0), + ("lazy", "fox", ";", 0), + ("the", "fox", ";", 0), + ("the", "the", ";", 0), + ("quick", "fox", ".*", 2), + ("the", "fox", ".*", 3), + ("the", "the", ".*", 1), + ("fox", "jumped", ";*", 1), + ("quick", "fox", ";*", 0), + ("the", "fox", ";*", 1), + ("the", "the", ";*", 1), + ("quick", "brown", "$+", 1), + ("brown", "quick", "$+", 0), + ("brown", "brown", "$+", 0), + ("quick", "brown", "$-", 0), + ("brown", "quick", "$-", 1), + ("brown", "brown", "$-", 0), + ("the", "brown", "$++", 1), + ("brown", "the", "$++", 0), + ("brown", "brown", "$++", 0), + ("the", "brown", "$--", 0), + ("brown", "the", "$--", 1), + ("brown", "brown", "$--", 0), + ], +) +def test_dependency_matcher_ops(en_vocab, doc, left, right, op, num_matches): + right_id = right + if left == right: + right_id = right + "2" + pattern = [ + {"RIGHT_ID": left, "RIGHT_ATTRS": {"LOWER": left}}, + { + "LEFT_ID": left, + "REL_OP": op, + "RIGHT_ID": right_id, + "RIGHT_ATTRS": {"LOWER": right}, + }, + ] + + matcher = DependencyMatcher(en_vocab) + matcher.add("pattern", [pattern]) + matches = matcher(doc) + assert len(matches) == num_matches diff --git a/spacy/tests/matcher/test_matcher_api.py b/spacy/tests/matcher/test_matcher_api.py index 1112195da..77b09f376 100644 --- a/spacy/tests/matcher/test_matcher_api.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -1,11 +1,8 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest -import re from mock import Mock -from spacy.matcher import Matcher, DependencyMatcher -from spacy.tokens import Doc, Token +from spacy.matcher import Matcher +from spacy.tokens import Doc, Token, Span + from ..doc.test_underscore import clean_underscore # noqa: F401 @@ -66,18 +63,11 @@ def test_matcher_len_contains(matcher): assert "TEST2" not in matcher -def test_matcher_add_new_old_api(en_vocab): +def test_matcher_add_new_api(en_vocab): doc = Doc(en_vocab, words=["a", "b"]) patterns = [[{"TEXT": "a"}], [{"TEXT": "a"}, {"TEXT": "b"}]] matcher = Matcher(en_vocab) - matcher.add("OLD_API", None, *patterns) - assert len(matcher(doc)) == 2 - matcher = Matcher(en_vocab) on_match = Mock() - matcher.add("OLD_API_CALLBACK", on_match, *patterns) - assert len(matcher(doc)) == 2 - assert on_match.call_count == 2 - # New API: add(key: str, patterns: List[List[dict]], on_match: Callable) matcher = Matcher(en_vocab) matcher.add("NEW_API", patterns) assert len(matcher(doc)) == 2 @@ -179,11 +169,11 @@ def test_matcher_match_zero_plus(matcher): def test_matcher_match_one_plus(matcher): control = Matcher(matcher.vocab) - control.add("BasicPhilippe", None, [{"ORTH": "Philippe"}]) + control.add("BasicPhilippe", [[{"ORTH": "Philippe"}]]) doc = Doc(control.vocab, words=["Philippe", "Philippe"]) m = control(doc) assert len(m) == 2 - pattern = [{"ORTH": "Philippe", "OP": "1"}, {"ORTH": "Philippe", "OP": "+"}] + pattern = [{"ORTH": "Philippe"}, {"ORTH": "Philippe", "OP": "+"}] matcher.add("KleenePhilippe", [pattern]) m = matcher(doc) assert len(m) == 1 @@ -240,6 +230,106 @@ def test_matcher_set_value_operator(en_vocab): assert len(matches) == 1 +def test_matcher_subset_value_operator(en_vocab): + matcher = Matcher(en_vocab) + pattern = [{"MORPH": {"IS_SUBSET": ["Feat=Val", "Feat2=Val2"]}}] + matcher.add("M", [pattern]) + doc = Doc(en_vocab, words=["a", "b", "c"]) + assert len(matcher(doc)) == 3 + doc[0].set_morph("Feat=Val") + assert len(matcher(doc)) == 3 + doc[0].set_morph("Feat=Val|Feat2=Val2") + assert len(matcher(doc)) == 3 + doc[0].set_morph("Feat=Val|Feat2=Val2|Feat3=Val3") + assert len(matcher(doc)) == 2 + doc[0].set_morph("Feat=Val|Feat2=Val2|Feat3=Val3|Feat4=Val4") + assert len(matcher(doc)) == 2 + + # IS_SUBSET acts like "IN" for attrs other than MORPH + matcher = Matcher(en_vocab) + pattern = [{"TAG": {"IS_SUBSET": ["A", "B"]}}] + matcher.add("M", [pattern]) + doc = Doc(en_vocab, words=["a", "b", "c"]) + doc[0].tag_ = "A" + assert len(matcher(doc)) == 1 + + # IS_SUBSET with an empty list matches nothing + matcher = Matcher(en_vocab) + pattern = [{"TAG": {"IS_SUBSET": []}}] + matcher.add("M", [pattern]) + doc = Doc(en_vocab, words=["a", "b", "c"]) + doc[0].tag_ = "A" + assert len(matcher(doc)) == 0 + + +def test_matcher_superset_value_operator(en_vocab): + matcher = Matcher(en_vocab) + pattern = [{"MORPH": {"IS_SUPERSET": ["Feat=Val", "Feat2=Val2", "Feat3=Val3"]}}] + matcher.add("M", [pattern]) + doc = Doc(en_vocab, words=["a", "b", "c"]) + assert len(matcher(doc)) == 0 + doc[0].set_morph("Feat=Val|Feat2=Val2") + assert len(matcher(doc)) == 0 + doc[0].set_morph("Feat=Val|Feat2=Val2|Feat3=Val3") + assert len(matcher(doc)) == 1 + doc[0].set_morph("Feat=Val|Feat2=Val2|Feat3=Val3|Feat4=Val4") + assert len(matcher(doc)) == 1 + + # IS_SUPERSET with more than one value only matches for MORPH + matcher = Matcher(en_vocab) + pattern = [{"TAG": {"IS_SUPERSET": ["A", "B"]}}] + matcher.add("M", [pattern]) + doc = Doc(en_vocab, words=["a", "b", "c"]) + doc[0].tag_ = "A" + assert len(matcher(doc)) == 0 + + # IS_SUPERSET with one value is the same as == + matcher = Matcher(en_vocab) + pattern = [{"TAG": {"IS_SUPERSET": ["A"]}}] + matcher.add("M", [pattern]) + doc = Doc(en_vocab, words=["a", "b", "c"]) + doc[0].tag_ = "A" + assert len(matcher(doc)) == 1 + + # IS_SUPERSET with an empty value matches everything + matcher = Matcher(en_vocab) + pattern = [{"TAG": {"IS_SUPERSET": []}}] + matcher.add("M", [pattern]) + doc = Doc(en_vocab, words=["a", "b", "c"]) + doc[0].tag_ = "A" + assert len(matcher(doc)) == 3 + + +def test_matcher_morph_handling(en_vocab): + # order of features in pattern doesn't matter + matcher = Matcher(en_vocab) + pattern1 = [{"MORPH": {"IN": ["Feat1=Val1|Feat2=Val2"]}}] + pattern2 = [{"MORPH": {"IN": ["Feat2=Val2|Feat1=Val1"]}}] + matcher.add("M", [pattern1]) + matcher.add("N", [pattern2]) + doc = Doc(en_vocab, words=["a", "b", "c"]) + assert len(matcher(doc)) == 0 + + doc[0].set_morph("Feat2=Val2|Feat1=Val1") + assert len(matcher(doc)) == 2 + doc[0].set_morph("Feat1=Val1|Feat2=Val2") + assert len(matcher(doc)) == 2 + + # multiple values are split + matcher = Matcher(en_vocab) + pattern1 = [{"MORPH": {"IS_SUPERSET": ["Feat1=Val1", "Feat2=Val2"]}}] + pattern2 = [{"MORPH": {"IS_SUPERSET": ["Feat1=Val1", "Feat1=Val3", "Feat2=Val2"]}}] + matcher.add("M", [pattern1]) + matcher.add("N", [pattern2]) + doc = Doc(en_vocab, words=["a", "b", "c"]) + assert len(matcher(doc)) == 0 + + doc[0].set_morph("Feat2=Val2,Val3|Feat1=Val1") + assert len(matcher(doc)) == 1 + doc[0].set_morph("Feat1=Val1,Val3|Feat2=Val2") + assert len(matcher(doc)) == 2 + + def test_matcher_regex(en_vocab): matcher = Matcher(en_vocab) pattern = [{"ORTH": {"REGEX": r"(?:a|an)"}}] @@ -301,84 +391,6 @@ def test_matcher_extension_set_membership(en_vocab): assert len(matches) == 0 -@pytest.fixture -def text(): - return "The quick brown fox jumped over the lazy fox" - - -@pytest.fixture -def heads(): - return [3, 2, 1, 1, 0, -1, 2, 1, -3] - - -@pytest.fixture -def deps(): - return ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"] - - -@pytest.fixture -def dependency_matcher(en_vocab): - def is_brown_yellow(text): - return bool(re.compile(r"brown|yellow|over").match(text)) - - IS_BROWN_YELLOW = en_vocab.add_flag(is_brown_yellow) - - pattern1 = [ - {"SPEC": {"NODE_NAME": "fox"}, "PATTERN": {"ORTH": "fox"}}, - { - "SPEC": {"NODE_NAME": "q", "NBOR_RELOP": ">", "NBOR_NAME": "fox"}, - "PATTERN": {"ORTH": "quick", "DEP": "amod"}, - }, - { - "SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">", "NBOR_NAME": "fox"}, - "PATTERN": {IS_BROWN_YELLOW: True}, - }, - ] - - pattern2 = [ - {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}}, - { - "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, - "PATTERN": {"ORTH": "fox"}, - }, - { - "SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, - "PATTERN": {"ORTH": "fox"}, - }, - ] - - pattern3 = [ - {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}}, - { - "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, - "PATTERN": {"ORTH": "fox"}, - }, - { - "SPEC": {"NODE_NAME": "r", "NBOR_RELOP": ">>", "NBOR_NAME": "fox"}, - "PATTERN": {"ORTH": "brown"}, - }, - ] - - matcher = DependencyMatcher(en_vocab) - matcher.add("pattern1", [pattern1]) - matcher.add("pattern2", [pattern2]) - matcher.add("pattern3", [pattern3]) - - return matcher - - -def test_dependency_matcher_compile(dependency_matcher): - assert len(dependency_matcher) == 3 - - -# def test_dependency_matcher(dependency_matcher, text, heads, deps): -# doc = get_doc(dependency_matcher.vocab, text.split(), heads=heads, deps=deps) -# matches = dependency_matcher(doc) -# assert matches[0][1] == [[3, 1, 2]] -# assert matches[1][1] == [[4, 3, 3]] -# assert matches[2][1] == [[4, 3, 2]] - - def test_matcher_basic_check(en_vocab): matcher = Matcher(en_vocab) # Potential mistake: pass in pattern instead of list of patterns @@ -389,11 +401,14 @@ def test_matcher_basic_check(en_vocab): def test_attr_pipeline_checks(en_vocab): doc1 = Doc(en_vocab, words=["Test"]) - doc1.is_parsed = True + doc1[0].dep_ = "ROOT" doc2 = Doc(en_vocab, words=["Test"]) - doc2.is_tagged = True + doc2[0].tag_ = "TAG" + doc2[0].pos_ = "X" + doc2[0].set_morph("Feat=Val") + doc2[0].lemma_ = "LEMMA" doc3 = Doc(en_vocab, words=["Test"]) - # DEP requires is_parsed + # DEP requires DEP matcher = Matcher(en_vocab) matcher.add("TEST", [[{"DEP": "a"}]]) matcher(doc1) @@ -401,7 +416,10 @@ def test_attr_pipeline_checks(en_vocab): matcher(doc2) with pytest.raises(ValueError): matcher(doc3) - # TAG, POS, LEMMA require is_tagged + # errors can be suppressed if desired + matcher(doc2, allow_missing=True) + matcher(doc3, allow_missing=True) + # TAG, POS, LEMMA require those values for attr in ("TAG", "POS", "LEMMA"): matcher = Matcher(en_vocab) matcher.add("TEST", [[{attr: "a"}]]) @@ -479,3 +497,26 @@ def test_matcher_span(matcher): assert len(matcher(doc)) == 2 assert len(matcher(span_js)) == 1 assert len(matcher(span_java)) == 1 + + +def test_matcher_as_spans(matcher): + """Test the new as_spans=True API.""" + text = "JavaScript is good but Java is better" + doc = Doc(matcher.vocab, words=text.split()) + matches = matcher(doc, as_spans=True) + assert len(matches) == 2 + assert isinstance(matches[0], Span) + assert matches[0].text == "JavaScript" + assert matches[0].label_ == "JS" + assert isinstance(matches[1], Span) + assert matches[1].text == "Java" + assert matches[1].label_ == "Java" + + +def test_matcher_deprecated(matcher): + doc = Doc(matcher.vocab, words=["hello", "world"]) + with pytest.warns(DeprecationWarning) as record: + for _ in matcher.pipe([doc]): + pass + assert record.list + assert "spaCy v3.0" in str(record.list[0].message) diff --git a/spacy/tests/matcher/test_matcher_logic.py b/spacy/tests/matcher/test_matcher_logic.py index 240ace537..5f4c2991a 100644 --- a/spacy/tests/matcher/test_matcher_logic.py +++ b/spacy/tests/matcher/test_matcher_logic.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import re @@ -9,19 +6,11 @@ from spacy.matcher import Matcher from spacy.tokens import Doc, Span -pattern1 = [{"ORTH": "A", "OP": "1"}, {"ORTH": "A", "OP": "*"}] -pattern2 = [{"ORTH": "A", "OP": "*"}, {"ORTH": "A", "OP": "1"}] -pattern3 = [{"ORTH": "A", "OP": "1"}, {"ORTH": "A", "OP": "1"}] -pattern4 = [ - {"ORTH": "B", "OP": "1"}, - {"ORTH": "A", "OP": "*"}, - {"ORTH": "B", "OP": "1"}, -] -pattern5 = [ - {"ORTH": "B", "OP": "*"}, - {"ORTH": "A", "OP": "*"}, - {"ORTH": "B", "OP": "1"}, -] +pattern1 = [{"ORTH": "A"}, {"ORTH": "A", "OP": "*"}] +pattern2 = [{"ORTH": "A", "OP": "*"}, {"ORTH": "A"}] +pattern3 = [{"ORTH": "A"}, {"ORTH": "A"}] +pattern4 = [{"ORTH": "B"}, {"ORTH": "A", "OP": "*"}, {"ORTH": "B"}] +pattern5 = [{"ORTH": "B", "OP": "*"}, {"ORTH": "A", "OP": "*"}, {"ORTH": "B"}] re_pattern1 = "AA*" re_pattern2 = "A*A" @@ -29,10 +18,16 @@ re_pattern3 = "AA" re_pattern4 = "BA*B" re_pattern5 = "B*A*B" +longest1 = "A A A A A" +longest2 = "A A A A A" +longest3 = "A A" +longest4 = "B A A A A A B" # "FIRST" would be "B B" +longest5 = "B B A A A A A B" + @pytest.fixture def text(): - return "(ABBAAAAAB)." + return "(BBAAAAAB)." @pytest.fixture @@ -44,25 +39,63 @@ def doc(en_tokenizer, text): @pytest.mark.parametrize( "pattern,re_pattern", [ - pytest.param(pattern1, re_pattern1, marks=pytest.mark.xfail()), - pytest.param(pattern2, re_pattern2, marks=pytest.mark.xfail()), - pytest.param(pattern3, re_pattern3, marks=pytest.mark.xfail()), + (pattern1, re_pattern1), + (pattern2, re_pattern2), + (pattern3, re_pattern3), (pattern4, re_pattern4), - pytest.param(pattern5, re_pattern5, marks=pytest.mark.xfail()), + (pattern5, re_pattern5), ], ) -def test_greedy_matching(doc, text, pattern, re_pattern): - """Test that the greedy matching behavior of the * op is consistant with +def test_greedy_matching_first(doc, text, pattern, re_pattern): + """Test that the greedy matching behavior "FIRST" is consistent with other re implementations.""" matcher = Matcher(doc.vocab) - matcher.add(re_pattern, [pattern]) + matcher.add(re_pattern, [pattern], greedy="FIRST") matches = matcher(doc) re_matches = [m.span() for m in re.finditer(re_pattern, text)] - for match, re_match in zip(matches, re_matches): - assert match[1:] == re_match + for (key, m_s, m_e), (re_s, re_e) in zip(matches, re_matches): + # matching the string, not the exact position + assert doc[m_s:m_e].text == doc[re_s:re_e].text + + +@pytest.mark.parametrize( + "pattern,longest", + [ + (pattern1, longest1), + (pattern2, longest2), + (pattern3, longest3), + (pattern4, longest4), + (pattern5, longest5), + ], +) +def test_greedy_matching_longest(doc, text, pattern, longest): + """Test the "LONGEST" greedy matching behavior""" + matcher = Matcher(doc.vocab) + matcher.add("RULE", [pattern], greedy="LONGEST") + matches = matcher(doc) + for (key, s, e) in matches: + assert doc[s:e].text == longest + + +def test_greedy_matching_longest_first(en_tokenizer): + """Test that "LONGEST" matching prefers the first of two equally long matches""" + doc = en_tokenizer(" ".join("CCC")) + matcher = Matcher(doc.vocab) + pattern = [{"ORTH": "C"}, {"ORTH": "C"}] + matcher.add("RULE", [pattern], greedy="LONGEST") + matches = matcher(doc) + # out of 0-2 and 1-3, the first should be picked + assert len(matches) == 1 + assert matches[0][1] == 0 + assert matches[0][2] == 2 + + +def test_invalid_greediness(doc, text): + matcher = Matcher(doc.vocab) + with pytest.raises(ValueError): + matcher.add("RULE", [pattern1], greedy="GREEDY") -@pytest.mark.xfail @pytest.mark.parametrize( "pattern,re_pattern", [ @@ -77,7 +110,7 @@ def test_match_consuming(doc, text, pattern, re_pattern): """Test that matcher.__call__ consumes tokens on a match similar to re.findall.""" matcher = Matcher(doc.vocab) - matcher.add(re_pattern, [pattern]) + matcher.add(re_pattern, [pattern], greedy="FIRST") matches = matcher(doc) re_matches = [m.span() for m in re.finditer(re_pattern, text)] assert len(matches) == len(re_matches) diff --git a/spacy/tests/matcher/test_pattern_validation.py b/spacy/tests/matcher/test_pattern_validation.py index ec2660ab4..4d21aea81 100644 --- a/spacy/tests/matcher/test_pattern_validation.py +++ b/spacy/tests/matcher/test_pattern_validation.py @@ -1,11 +1,7 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.matcher import Matcher -from spacy.matcher._schemas import TOKEN_PATTERN_SCHEMA from spacy.errors import MatchPatternError -from spacy.util import get_json_validator, validate_json +from spacy.schemas import validate_token_pattern # (pattern, num errors with validation, num errors identified with minimal # checks) @@ -18,12 +14,12 @@ TEST_PATTERNS = [ ('[{"TEXT": "foo"}, {"LOWER": "bar"}]', 1, 1), ([1, 2, 3], 3, 1), # Bad patterns flagged outside of Matcher - ([{"_": {"foo": "bar", "baz": {"IN": "foo"}}}], 1, 0), + ([{"_": {"foo": "bar", "baz": {"IN": "foo"}}}], 2, 0), # prev: (1, 0) # Bad patterns not flagged with minimal checks ([{"LENGTH": "2", "TEXT": 2}, {"LOWER": "test"}], 2, 0), - ([{"LENGTH": {"IN": [1, 2, "3"]}}, {"POS": {"IN": "VERB"}}], 2, 0), - ([{"LENGTH": {"VALUE": 5}}], 1, 0), - ([{"TEXT": {"VALUE": "foo"}}], 1, 0), + ([{"LENGTH": {"IN": [1, 2, "3"]}}, {"POS": {"IN": "VERB"}}], 4, 0), # prev: (2, 0) + ([{"LENGTH": {"VALUE": 5}}], 2, 0), # prev: (1, 0) + ([{"TEXT": {"VALUE": "foo"}}], 2, 0), # prev: (1, 0) ([{"IS_DIGIT": -1}], 1, 0), ([{"ORTH": -1}], 1, 0), # Good patterns @@ -34,17 +30,11 @@ TEST_PATTERNS = [ ([{"LOWER": {"REGEX": "^X", "NOT_IN": ["XXX", "XY"]}}], 0, 0), ([{"NORM": "a"}, {"POS": {"IN": ["NOUN"]}}], 0, 0), ([{"_": {"foo": {"NOT_IN": ["bar", "baz"]}, "a": 5, "b": {">": 10}}}], 0, 0), + ([{"orth": "foo"}], 0, 0), # prev: xfail ([{"IS_SENT_START": True}], 0, 0), ([{"SENT_START": True}], 0, 0), ] -XFAIL_TEST_PATTERNS = [([{"orth": "foo"}], 0, 0)] - - -@pytest.fixture -def validator(): - return get_json_validator(TOKEN_PATTERN_SCHEMA) - @pytest.mark.parametrize( "pattern", [[{"XX": "y"}, {"LENGTH": "2"}, {"TEXT": {"IN": 5}}]] @@ -56,15 +46,8 @@ def test_matcher_pattern_validation(en_vocab, pattern): @pytest.mark.parametrize("pattern,n_errors,_", TEST_PATTERNS) -def test_pattern_validation(validator, pattern, n_errors, _): - errors = validate_json(pattern, validator) - assert len(errors) == n_errors - - -@pytest.mark.xfail -@pytest.mark.parametrize("pattern,n_errors,_", XFAIL_TEST_PATTERNS) -def test_xfail_pattern_validation(validator, pattern, n_errors, _): - errors = validate_json(pattern, validator) +def test_pattern_validation(pattern, n_errors, _): + errors = validate_token_pattern(pattern) assert len(errors) == n_errors @@ -78,10 +61,10 @@ def test_minimal_pattern_validation(en_vocab, pattern, n_errors, n_min_errors): matcher.add("TEST", [pattern]) -def test_pattern_warnings(en_vocab): +def test_pattern_errors(en_vocab): matcher = Matcher(en_vocab) # normalize "regex" to upper like "text" matcher.add("TEST1", [[{"text": {"regex": "regex"}}]]) - # warn if subpattern attribute isn't recognized and processed - with pytest.warns(UserWarning): + # error if subpattern attribute isn't recognized and processed + with pytest.raises(MatchPatternError): matcher.add("TEST2", [[{"TEXT": {"XX": "xx"}}]]) diff --git a/spacy/tests/matcher/test_phrase_matcher.py b/spacy/tests/matcher/test_phrase_matcher.py index 60aa584ef..1b81fd780 100644 --- a/spacy/tests/matcher/test_phrase_matcher.py +++ b/spacy/tests/matcher/test_phrase_matcher.py @@ -1,12 +1,8 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import srsly from mock import Mock from spacy.matcher import PhraseMatcher -from spacy.tokens import Doc -from ..util import get_doc +from spacy.tokens import Doc, Span def test_matcher_phrase_matcher(en_vocab): @@ -143,10 +139,10 @@ def test_phrase_matcher_string_attrs(en_vocab): pos1 = ["PRON", "VERB", "NOUN"] words2 = ["Yes", ",", "you", "hate", "dogs", "very", "much"] pos2 = ["INTJ", "PUNCT", "PRON", "VERB", "NOUN", "ADV", "ADV"] - pattern = get_doc(en_vocab, words=words1, pos=pos1) + pattern = Doc(en_vocab, words=words1, pos=pos1) matcher = PhraseMatcher(en_vocab, attr="POS") matcher.add("TEST", [pattern]) - doc = get_doc(en_vocab, words=words2, pos=pos2) + doc = Doc(en_vocab, words=words2, pos=pos2) matches = matcher(doc) assert len(matches) == 1 match_id, start, end = matches[0] @@ -161,10 +157,10 @@ def test_phrase_matcher_string_attrs_negative(en_vocab): pos1 = ["PRON", "VERB", "NOUN"] words2 = ["matcher:POS-PRON", "matcher:POS-VERB", "matcher:POS-NOUN"] pos2 = ["X", "X", "X"] - pattern = get_doc(en_vocab, words=words1, pos=pos1) + pattern = Doc(en_vocab, words=words1, pos=pos1) matcher = PhraseMatcher(en_vocab, attr="POS") matcher.add("TEST", [pattern]) - doc = get_doc(en_vocab, words=words2, pos=pos2) + doc = Doc(en_vocab, words=words2, pos=pos2) matches = matcher(doc) assert len(matches) == 0 @@ -190,9 +186,11 @@ def test_phrase_matcher_bool_attrs(en_vocab): def test_phrase_matcher_validation(en_vocab): doc1 = Doc(en_vocab, words=["Test"]) - doc1.is_parsed = True + doc1[0].dep_ = "ROOT" doc2 = Doc(en_vocab, words=["Test"]) - doc2.is_tagged = True + doc2[0].tag_ = "TAG" + doc2[0].pos_ = "X" + doc2[0].set_morph("Feat=Val") doc3 = Doc(en_vocab, words=["Test"]) matcher = PhraseMatcher(en_vocab, validate=True) with pytest.warns(UserWarning): @@ -215,18 +213,21 @@ def test_attr_validation(en_vocab): def test_attr_pipeline_checks(en_vocab): doc1 = Doc(en_vocab, words=["Test"]) - doc1.is_parsed = True + doc1[0].dep_ = "ROOT" doc2 = Doc(en_vocab, words=["Test"]) - doc2.is_tagged = True + doc2[0].tag_ = "TAG" + doc2[0].pos_ = "X" + doc2[0].set_morph("Feat=Val") + doc2[0].lemma_ = "LEMMA" doc3 = Doc(en_vocab, words=["Test"]) - # DEP requires is_parsed + # DEP requires DEP matcher = PhraseMatcher(en_vocab, attr="DEP") matcher.add("TEST1", [doc1]) with pytest.raises(ValueError): matcher.add("TEST2", [doc2]) with pytest.raises(ValueError): matcher.add("TEST3", [doc3]) - # TAG, POS, LEMMA require is_tagged + # TAG, POS, LEMMA require those values for attr in ("TAG", "POS", "LEMMA"): matcher = PhraseMatcher(en_vocab, attr=attr) matcher.add("TEST2", [doc2]) @@ -290,3 +291,30 @@ def test_phrase_matcher_pickle(en_vocab): # clunky way to vaguely check that callback is unpickled (vocab, docs, callbacks, attr) = matcher_unpickled.__reduce__()[1] assert isinstance(callbacks.get("TEST2"), Mock) + + +def test_phrase_matcher_as_spans(en_vocab): + """Test the new as_spans=True API.""" + matcher = PhraseMatcher(en_vocab) + matcher.add("A", [Doc(en_vocab, words=["hello", "world"])]) + matcher.add("B", [Doc(en_vocab, words=["test"])]) + doc = Doc(en_vocab, words=["...", "hello", "world", "this", "is", "a", "test"]) + matches = matcher(doc, as_spans=True) + assert len(matches) == 2 + assert isinstance(matches[0], Span) + assert matches[0].text == "hello world" + assert matches[0].label_ == "A" + assert isinstance(matches[1], Span) + assert matches[1].text == "test" + assert matches[1].label_ == "B" + + +def test_phrase_matcher_deprecated(en_vocab): + matcher = PhraseMatcher(en_vocab) + matcher.add("TEST", [Doc(en_vocab, words=["helllo"])]) + doc = Doc(en_vocab, words=["hello", "world"]) + with pytest.warns(DeprecationWarning) as record: + for _ in matcher.pipe([doc]): + pass + assert record.list + assert "spaCy v3.0" in str(record.list[0].message) diff --git a/spacy/tests/morphology/test_morph_converters.py b/spacy/tests/morphology/test_morph_converters.py new file mode 100644 index 000000000..6973bf782 --- /dev/null +++ b/spacy/tests/morphology/test_morph_converters.py @@ -0,0 +1,21 @@ +from spacy.morphology import Morphology + + +def test_feats_converters(): + feats = "Case=dat,gen|Number=sing" + feats_dict = {"Case": "dat,gen", "Number": "sing"} + + # simple conversions + assert Morphology.dict_to_feats(feats_dict) == feats + assert Morphology.feats_to_dict(feats) == feats_dict + + # roundtrips + assert Morphology.dict_to_feats(Morphology.feats_to_dict(feats)) == feats + assert Morphology.feats_to_dict(Morphology.dict_to_feats(feats_dict)) == feats_dict + + # unsorted input is normalized + unsorted_feats = "Number=sing|Case=gen,dat" + unsorted_feats_dict = {"Case": "gen,dat", "Number": "sing"} + assert Morphology.feats_to_dict(unsorted_feats) == feats_dict + assert Morphology.dict_to_feats(unsorted_feats_dict) == feats + assert Morphology.dict_to_feats(Morphology.feats_to_dict(unsorted_feats)) == feats diff --git a/spacy/tests/morphology/test_morph_features.py b/spacy/tests/morphology/test_morph_features.py index 41f807143..0693da690 100644 --- a/spacy/tests/morphology/test_morph_features.py +++ b/spacy/tests/morphology/test_morph_features.py @@ -1,17 +1,11 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.morphology import Morphology from spacy.strings import StringStore, get_string_id -from spacy.lemmatizer import Lemmatizer -from spacy.lookups import Lookups @pytest.fixture def morphology(): - lemmatizer = Lemmatizer(Lookups()) - return Morphology(StringStore(), {}, lemmatizer) + return Morphology(StringStore()) def test_init(morphology): @@ -19,32 +13,37 @@ def test_init(morphology): def test_add_morphology_with_string_names(morphology): - morphology.add({"Case_gen", "Number_sing"}) + morphology.add({"Case": "gen", "Number": "sing"}) def test_add_morphology_with_int_ids(morphology): - morphology.add({get_string_id("Case_gen"), get_string_id("Number_sing")}) + morphology.strings.add("Case") + morphology.strings.add("gen") + morphology.strings.add("Number") + morphology.strings.add("sing") + morphology.add( + { + get_string_id("Case"): get_string_id("gen"), + get_string_id("Number"): get_string_id("sing"), + } + ) def test_add_morphology_with_mix_strings_and_ints(morphology): - morphology.add({get_string_id("PunctSide_ini"), "VerbType_aux"}) + morphology.strings.add("PunctSide") + morphology.strings.add("ini") + morphology.add( + {get_string_id("PunctSide"): get_string_id("ini"), "VerbType": "aux"} + ) def test_morphology_tags_hash_distinctly(morphology): - tag1 = morphology.add({"PunctSide_ini", "VerbType_aux"}) - tag2 = morphology.add({"Case_gen", "Number_sing"}) + tag1 = morphology.add({"PunctSide": "ini", "VerbType": "aux"}) + tag2 = morphology.add({"Case": "gen", "Number": "sing"}) assert tag1 != tag2 def test_morphology_tags_hash_independent_of_order(morphology): - tag1 = morphology.add({"Case_gen", "Number_sing"}) - tag2 = morphology.add({"Number_sing", "Case_gen"}) + tag1 = morphology.add({"Case": "gen", "Number": "sing"}) + tag2 = morphology.add({"Number": "sing", "Case": "gen"}) assert tag1 == tag2 - - -def test_update_morphology_tag(morphology): - tag1 = morphology.add({"Case_gen"}) - tag2 = morphology.update(tag1, {"Number_sing"}) - assert tag1 != tag2 - tag3 = morphology.add({"Number_sing", "Case_gen"}) - assert tag2 == tag3 diff --git a/spacy/tests/morphology/test_morph_pickle.py b/spacy/tests/morphology/test_morph_pickle.py new file mode 100644 index 000000000..d9b0e3476 --- /dev/null +++ b/spacy/tests/morphology/test_morph_pickle.py @@ -0,0 +1,21 @@ +import pytest +import pickle +from spacy.morphology import Morphology +from spacy.strings import StringStore + + +@pytest.fixture +def morphology(): + morphology = Morphology(StringStore()) + morphology.add("Feat1=Val1|Feat2=Val2") + morphology.add("Feat3=Val3|Feat4=Val4") + return morphology + + +def test_morphology_pickle_roundtrip(morphology): + b = pickle.dumps(morphology) + reloaded_morphology = pickle.loads(b) + feat = reloaded_morphology.get(morphology.strings["Feat1=Val1|Feat2=Val2"]) + assert feat == "Feat1=Val1|Feat2=Val2" + feat = reloaded_morphology.get(morphology.strings["Feat3=Val3|Feat4=Val4"]) + assert feat == "Feat3=Val3|Feat4=Val4" diff --git a/spacy/tests/package/test_requirements.py b/spacy/tests/package/test_requirements.py new file mode 100644 index 000000000..8145beba9 --- /dev/null +++ b/spacy/tests/package/test_requirements.py @@ -0,0 +1,83 @@ +import re +from pathlib import Path + + +def test_build_dependencies(): + # Check that library requirements are pinned exactly the same across different setup files. + libs_ignore_requirements = [ + "pytest", + "pytest-timeout", + "mock", + "flake8", + ] + # ignore language-specific packages that shouldn't be installed by all + libs_ignore_setup = [ + "fugashi", + "natto-py", + "pythainlp", + "sudachipy", + "sudachidict_core", + "spacy-pkuseg", + ] + + # check requirements.txt + req_dict = {} + + root_dir = Path(__file__).parent + req_file = root_dir / "requirements.txt" + with req_file.open() as f: + lines = f.readlines() + for line in lines: + line = line.strip() + if not line.startswith("#"): + lib, v = _parse_req(line) + if lib and lib not in libs_ignore_requirements: + req_dict[lib] = v + # check setup.cfg and compare to requirements.txt + # also fails when there are missing or additional libs + setup_file = root_dir / "setup.cfg" + with setup_file.open() as f: + lines = f.readlines() + + setup_keys = set() + for line in lines: + line = line.strip() + if not line.startswith("#"): + lib, v = _parse_req(line) + if lib and not lib.startswith("cupy") and lib not in libs_ignore_setup: + req_v = req_dict.get(lib, None) + assert ( + req_v is not None + ), "{} in setup.cfg but not in requirements.txt".format(lib) + assert (lib + v) == (lib + req_v), ( + "{} has different version in setup.cfg and in requirements.txt: " + "{} and {} respectively".format(lib, v, req_v) + ) + setup_keys.add(lib) + assert sorted(setup_keys) == sorted( + req_dict.keys() + ) # if fail: requirements.txt contains a lib not in setup.cfg + + # check pyproject.toml and compare the versions of the libs to requirements.txt + # does not fail when there are missing or additional libs + toml_file = root_dir / "pyproject.toml" + with toml_file.open() as f: + lines = f.readlines() + for line in lines: + line = line.strip().strip(",").strip('"') + if not line.startswith("#"): + lib, v = _parse_req(line) + if lib: + req_v = req_dict.get(lib, None) + assert (lib + v) == (lib + req_v), ( + "{} has different version in pyproject.toml and in requirements.txt: " + "{} and {} respectively".format(lib, v, req_v) + ) + + +def _parse_req(line): + lib = re.match(r"^[a-z0-9\-]*", line).group(0) + v = line.replace(lib, "").strip() + if not re.match(r"^[<>=][<>=].*", v): + return None, None + return lib, v diff --git a/spacy/tests/parser/test_add_label.py b/spacy/tests/parser/test_add_label.py index 4ab9c1e70..2f750b60c 100644 --- a/spacy/tests/parser/test_add_label.py +++ b/spacy/tests/parser/test_add_label.py @@ -1,15 +1,13 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest -from thinc.neural.optimizers import Adam -from thinc.neural.ops import NumpyOps +from thinc.api import Adam, fix_random_seed +from spacy import registry from spacy.attrs import NORM -from spacy.gold import GoldParse from spacy.vocab import Vocab +from spacy.training import Example from spacy.tokens import Doc from spacy.pipeline import DependencyParser, EntityRecognizer -from spacy.util import fix_random_seed +from spacy.pipeline.ner import DEFAULT_NER_MODEL +from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL @pytest.fixture @@ -19,7 +17,14 @@ def vocab(): @pytest.fixture def parser(vocab): - parser = DependencyParser(vocab) + config = { + "learn_tokens": False, + "min_action_freq": 30, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_PARSER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + parser = DependencyParser(vocab, model, **config) return parser @@ -30,28 +35,40 @@ def test_init_parser(parser): def _train_parser(parser): fix_random_seed(1) parser.add_label("left") - parser.begin_training([], **parser.cfg) - sgd = Adam(NumpyOps(), 0.001) + parser.initialize(lambda: [_parser_example(parser)]) + sgd = Adam(0.001) for i in range(5): losses = {} doc = Doc(parser.vocab, words=["a", "b", "c", "d"]) - gold = GoldParse(doc, heads=[1, 1, 3, 3], deps=["left", "ROOT", "left", "ROOT"]) - parser.update([doc], [gold], sgd=sgd, losses=losses) + gold = {"heads": [1, 1, 3, 3], "deps": ["left", "ROOT", "left", "ROOT"]} + example = Example.from_dict(doc, gold) + parser.update([example], sgd=sgd, losses=losses) return parser +def _parser_example(parser): + doc = Doc(parser.vocab, words=["a", "b", "c", "d"]) + gold = {"heads": [1, 1, 3, 3], "deps": ["right", "ROOT", "left", "ROOT"]} + return Example.from_dict(doc, gold) + + +def _ner_example(ner): + doc = Doc( + ner.vocab, + words=["Joe", "loves", "visiting", "London", "during", "the", "weekend"], + ) + gold = {"entities": [(0, 3, "PERSON"), (19, 25, "LOC")]} + return Example.from_dict(doc, gold) + + def test_add_label(parser): parser = _train_parser(parser) parser.add_label("right") - sgd = Adam(NumpyOps(), 0.001) - for i in range(10): + sgd = Adam(0.001) + for i in range(100): losses = {} - doc = Doc(parser.vocab, words=["a", "b", "c", "d"]) - gold = GoldParse( - doc, heads=[1, 1, 3, 3], deps=["right", "ROOT", "left", "ROOT"] - ) - parser.update([doc], [gold], sgd=sgd, losses=losses) + parser.update([_parser_example(parser)], sgd=sgd, losses=losses) doc = Doc(parser.vocab, words=["a", "b", "c", "d"]) doc = parser(doc) assert doc[0].dep_ == "right" @@ -59,27 +76,48 @@ def test_add_label(parser): def test_add_label_deserializes_correctly(): - ner1 = EntityRecognizer(Vocab()) + config = { + "learn_tokens": False, + "min_action_freq": 30, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_NER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + ner1 = EntityRecognizer(Vocab(), model, **config) ner1.add_label("C") ner1.add_label("B") ner1.add_label("A") - ner1.begin_training([]) - ner2 = EntityRecognizer(Vocab()).from_bytes(ner1.to_bytes()) + ner1.initialize(lambda: [_ner_example(ner1)]) + ner2 = EntityRecognizer(Vocab(), model, **config) + + # the second model needs to be resized before we can call from_bytes + ner2.model.attrs["resize_output"](ner2.model, ner1.moves.n_moves) + ner2.from_bytes(ner1.to_bytes()) assert ner1.moves.n_moves == ner2.moves.n_moves for i in range(ner1.moves.n_moves): assert ner1.moves.get_class_name(i) == ner2.moves.get_class_name(i) @pytest.mark.parametrize( - "pipe_cls,n_moves", [(DependencyParser, 5), (EntityRecognizer, 4)] + "pipe_cls,n_moves,model_config", + [ + (DependencyParser, 5, DEFAULT_PARSER_MODEL), + (EntityRecognizer, 4, DEFAULT_NER_MODEL), + ], ) -def test_add_label_get_label(pipe_cls, n_moves): +def test_add_label_get_label(pipe_cls, n_moves, model_config): """Test that added labels are returned correctly. This test was added to test for a bug in DependencyParser.labels that'd cause it to fail when splitting the move names. """ labels = ["A", "B", "C"] - pipe = pipe_cls(Vocab()) + model = registry.resolve({"model": model_config}, validate=True)["model"] + config = { + "learn_tokens": False, + "min_action_freq": 30, + "update_with_oracle_cut_size": 100, + } + pipe = pipe_cls(Vocab(), model, **config) for label in labels: pipe.add_label(label) assert len(pipe.move_names) == len(labels) * n_moves diff --git a/spacy/tests/parser/test_arc_eager_oracle.py b/spacy/tests/parser/test_arc_eager_oracle.py index 41b7a4861..84070db73 100644 --- a/spacy/tests/parser/test_arc_eager_oracle.py +++ b/spacy/tests/parser/test_arc_eager_oracle.py @@ -1,23 +1,23 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.vocab import Vocab +from spacy import registry +from spacy.training import Example from spacy.pipeline import DependencyParser from spacy.tokens import Doc -from spacy.gold import GoldParse -from spacy.syntax.nonproj import projectivize -from spacy.syntax.stateclass import StateClass -from spacy.syntax.arc_eager import ArcEager +from spacy.pipeline._parser_internals.nonproj import projectivize +from spacy.pipeline._parser_internals.arc_eager import ArcEager +from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL def get_sequence_costs(M, words, heads, deps, transitions): doc = Doc(Vocab(), words=words) - gold = GoldParse(doc, heads=heads, deps=deps) - state = StateClass(doc) - M.preprocess_gold(gold) + example = Example.from_dict(doc, {"heads": heads, "deps": deps}) + states, golds, _ = M.init_gold_batch([example]) + state = states[0] + gold = golds[0] cost_history = [] for gold_action in transitions: + gold.update(state) state_costs = {} for i in range(M.n_moves): name = M.class_name(i) @@ -40,31 +40,13 @@ def arc_eager(vocab): return moves -@pytest.fixture -def words(): - return ["a", "b"] - - -@pytest.fixture -def doc(words, vocab): - if vocab is None: - vocab = Vocab() - return Doc(vocab, words=list(words)) - - -@pytest.fixture -def gold(doc, words): - if len(words) == 2: - return GoldParse(doc, words=["a", "b"], heads=[0, 0], deps=["ROOT", "right"]) - else: - raise NotImplementedError - - -@pytest.mark.xfail def test_oracle_four_words(arc_eager, vocab): words = ["a", "b", "c", "d"] heads = [1, 1, 3, 3] deps = ["left", "ROOT", "left", "ROOT"] + for dep in deps: + arc_eager.add_action(2, dep) # Left + arc_eager.add_action(3, dep) # Right actions = ["L-left", "B-ROOT", "L-left"] state, cost_history = get_sequence_costs(arc_eager, words, heads, deps, actions) assert state.is_final() @@ -73,7 +55,7 @@ def test_oracle_four_words(arc_eager, vocab): assert state_costs[actions[i]] == 0.0, actions[i] for other_action, cost in state_costs.items(): if other_action != actions[i]: - assert cost >= 1 + assert cost >= 1, (i, other_action) annot_tuples = [ @@ -130,19 +112,119 @@ annot_tuples = [ def test_get_oracle_actions(): + ids, words, tags, heads, deps, ents = [], [], [], [], [], [] + for id_, word, tag, head, dep, ent in annot_tuples: + ids.append(id_) + words.append(word) + tags.append(tag) + heads.append(head) + deps.append(dep) + ents.append(ent) doc = Doc(Vocab(), words=[t[1] for t in annot_tuples]) - parser = DependencyParser(doc.vocab) + config = { + "learn_tokens": False, + "min_action_freq": 0, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_PARSER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + parser = DependencyParser(doc.vocab, model, **config) parser.moves.add_action(0, "") parser.moves.add_action(1, "") parser.moves.add_action(1, "") parser.moves.add_action(4, "ROOT") - for i, (id_, word, tag, head, dep, ent) in enumerate(annot_tuples): + heads, deps = projectivize(heads, deps) + for i, (head, dep) in enumerate(zip(heads, deps)): if head > i: parser.moves.add_action(2, dep) elif head < i: parser.moves.add_action(3, dep) - ids, words, tags, heads, deps, ents = zip(*annot_tuples) - heads, deps = projectivize(heads, deps) - gold = GoldParse(doc, words=words, tags=tags, heads=heads, deps=deps) - parser.moves.preprocess_gold(gold) - parser.moves.get_oracle_sequence(doc, gold) + example = Example.from_dict( + doc, {"words": words, "tags": tags, "heads": heads, "deps": deps} + ) + parser.moves.get_oracle_sequence(example) + + +def test_oracle_dev_sentence(vocab, arc_eager): + words_deps_heads = """ + Rolls-Royce nn Inc. + Motor nn Inc. + Cars nn Inc. + Inc. nsubj said + said ROOT said + it nsubj expects + expects ccomp said + its poss sales + U.S. nn sales + sales nsubj steady + to aux steady + remain cop steady + steady xcomp expects + at prep steady + about quantmod 1,200 + 1,200 num cars + cars pobj at + in prep steady + 1990 pobj in + . punct said + """ + expected_transitions = [ + "S", # Shift 'Motor' + "S", # Shift 'Cars' + "L-nn", # Attach 'Cars' to 'Inc.' + "L-nn", # Attach 'Motor' to 'Inc.' + "L-nn", # Attach 'Rolls-Royce' to 'Inc.', force shift + "L-nsubj", # Attach 'Inc.' to 'said' + "S", # Shift 'it' + "L-nsubj", # Attach 'it.' to 'expects' + "R-ccomp", # Attach 'expects' to 'said' + "S", # Shift 'its' + "S", # Shift 'U.S.' + "L-nn", # Attach 'U.S.' to 'sales' + "L-poss", # Attach 'its' to 'sales' + "S", # Shift 'sales' + "S", # Shift 'to' + "S", # Shift 'remain' + "L-cop", # Attach 'remain' to 'steady' + "L-aux", # Attach 'to' to 'steady' + "L-nsubj", # Attach 'sales' to 'steady' + "R-xcomp", # Attach 'steady' to 'expects' + "R-prep", # Attach 'at' to 'steady' + "S", # Shift 'about' + "L-quantmod", # Attach "about" to "1,200" + "S", # Shift "1,200" + "L-num", # Attach "1,200" to "cars" + "R-pobj", # Attach "cars" to "at" + "D", # Reduce "cars" + "D", # Reduce "at" + "R-prep", # Attach "in" to "steady" + "R-pobj", # Attach "1990" to "in" + "D", # Reduce "1990" + "D", # Reduce "in" + "D", # Reduce "steady" + "D", # Reduce "expects" + "R-punct", # Attach "." to "said" + ] + + gold_words = [] + gold_deps = [] + gold_heads = [] + for line in words_deps_heads.strip().split("\n"): + line = line.strip() + if not line: + continue + word, dep, head = line.split() + gold_words.append(word) + gold_deps.append(dep) + gold_heads.append(head) + gold_heads = [gold_words.index(head) for head in gold_heads] + for dep in gold_deps: + arc_eager.add_action(2, dep) # Left + arc_eager.add_action(3, dep) # Right + + doc = Doc(Vocab(), words=gold_words) + example = Example.from_dict(doc, {"heads": gold_heads, "deps": gold_deps}) + + ae_oracle_actions = arc_eager.get_oracle_sequence(example) + ae_oracle_actions = [arc_eager.get_class_name(i) for i in ae_oracle_actions] + assert ae_oracle_actions == expected_transitions diff --git a/spacy/tests/parser/test_ner.py b/spacy/tests/parser/test_ner.py index dd623e07f..b4c22b48d 100644 --- a/spacy/tests/parser/test_ner.py +++ b/spacy/tests/parser/test_ner.py @@ -1,16 +1,24 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest -from spacy.lang.en import English +from numpy.testing import assert_equal +from spacy.attrs import ENT_IOB +from spacy import util +from spacy.lang.en import English from spacy.language import Language from spacy.lookups import Lookups -from spacy.pipeline import EntityRecognizer, EntityRuler -from spacy.vocab import Vocab -from spacy.syntax.ner import BiluoPushDown -from spacy.gold import GoldParse, minibatch +from spacy.pipeline._parser_internals.ner import BiluoPushDown +from spacy.training import Example from spacy.tokens import Doc +from spacy.vocab import Vocab +import logging + +from ..util import make_tempdir + + +TRAIN_DATA = [ + ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}), + ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}), +] @pytest.fixture @@ -45,51 +53,56 @@ def tsys(vocab, entity_types): def test_get_oracle_moves(tsys, doc, entity_annots): - gold = GoldParse(doc, entities=entity_annots) - tsys.preprocess_gold(gold) - act_classes = tsys.get_oracle_sequence(doc, gold) + example = Example.from_dict(doc, {"entities": entity_annots}) + act_classes = tsys.get_oracle_sequence(example) names = [tsys.get_class_name(act) for act in act_classes] assert names == ["U-PERSON", "O", "O", "B-GPE", "L-GPE", "O"] +@pytest.mark.filterwarnings("ignore::UserWarning") def test_get_oracle_moves_negative_entities(tsys, doc, entity_annots): entity_annots = [(s, e, "!" + label) for s, e, label in entity_annots] - gold = GoldParse(doc, entities=entity_annots) - for i, tag in enumerate(gold.ner): + example = Example.from_dict(doc, {"entities": entity_annots}) + ex_dict = example.to_dict() + + for i, tag in enumerate(ex_dict["doc_annotation"]["entities"]): if tag == "L-!GPE": - gold.ner[i] = "-" - tsys.preprocess_gold(gold) - act_classes = tsys.get_oracle_sequence(doc, gold) + ex_dict["doc_annotation"]["entities"][i] = "-" + example = Example.from_dict(doc, ex_dict) + + act_classes = tsys.get_oracle_sequence(example) names = [tsys.get_class_name(act) for act in act_classes] assert names def test_get_oracle_moves_negative_entities2(tsys, vocab): doc = Doc(vocab, words=["A", "B", "C", "D"]) - gold = GoldParse(doc, entities=[]) - gold.ner = ["B-!PERSON", "L-!PERSON", "B-!PERSON", "L-!PERSON"] - tsys.preprocess_gold(gold) - act_classes = tsys.get_oracle_sequence(doc, gold) + entity_annots = ["B-!PERSON", "L-!PERSON", "B-!PERSON", "L-!PERSON"] + example = Example.from_dict(doc, {"entities": entity_annots}) + act_classes = tsys.get_oracle_sequence(example) names = [tsys.get_class_name(act) for act in act_classes] assert names +@pytest.mark.skip(reason="Maybe outdated? Unsure") def test_get_oracle_moves_negative_O(tsys, vocab): doc = Doc(vocab, words=["A", "B", "C", "D"]) - gold = GoldParse(doc, entities=[]) - gold.ner = ["O", "!O", "O", "!O"] - tsys.preprocess_gold(gold) - act_classes = tsys.get_oracle_sequence(doc, gold) + entity_annots = ["O", "!O", "O", "!O"] + example = Example.from_dict(doc, {"entities": entity_annots}) + act_classes = tsys.get_oracle_sequence(example) names = [tsys.get_class_name(act) for act in act_classes] assert names +# We can't easily represent this on a Doc object. Not sure what the best solution +# would be, but I don't think it's an important use case? +@pytest.mark.skip(reason="No longer supported") def test_oracle_moves_missing_B(en_vocab): words = ["B", "52", "Bomber"] biluo_tags = [None, None, "L-PRODUCT"] doc = Doc(en_vocab, words=words) - gold = GoldParse(doc, words=words, entities=biluo_tags) + example = Example.from_dict(doc, {"words": words, "entities": biluo_tags}) moves = BiluoPushDown(en_vocab.strings) move_types = ("M", "B", "I", "L", "U", "O") @@ -104,16 +117,18 @@ def test_oracle_moves_missing_B(en_vocab): moves.add_action(move_types.index("I"), label) moves.add_action(move_types.index("L"), label) moves.add_action(move_types.index("U"), label) - moves.preprocess_gold(gold) - moves.get_oracle_sequence(doc, gold) + moves.get_oracle_sequence(example) +# We can't easily represent this on a Doc object. Not sure what the best solution +# would be, but I don't think it's an important use case? +@pytest.mark.skip(reason="No longer supported") def test_oracle_moves_whitespace(en_vocab): words = ["production", "\n", "of", "Northrop", "\n", "Corp.", "\n", "'s", "radar"] biluo_tags = ["O", "O", "O", "B-ORG", None, "I-ORG", "L-ORG", "O", "O"] doc = Doc(en_vocab, words=words) - gold = GoldParse(doc, words=words, entities=biluo_tags) + example = Example.from_dict(doc, {"entities": biluo_tags}) moves = BiluoPushDown(en_vocab.strings) move_types = ("M", "B", "I", "L", "U", "O") @@ -125,8 +140,7 @@ def test_oracle_moves_whitespace(en_vocab): else: action, label = tag.split("-") moves.add_action(move_types.index(action), label) - moves.preprocess_gold(gold) - moves.get_oracle_sequence(doc, gold) + moves.get_oracle_sequence(example) def test_accept_blocked_token(): @@ -134,7 +148,8 @@ def test_accept_blocked_token(): # 1. test normal behaviour nlp1 = English() doc1 = nlp1("I live in New York") - ner1 = EntityRecognizer(doc1.vocab) + config = {} + ner1 = nlp1.create_pipe("ner", config=config) assert [token.ent_iob_ for token in doc1] == ["", "", "", "", ""] assert [token.ent_type_ for token in doc1] == ["", "", "", "", ""] @@ -152,10 +167,11 @@ def test_accept_blocked_token(): # 2. test blocking behaviour nlp2 = English() doc2 = nlp2("I live in New York") - ner2 = EntityRecognizer(doc2.vocab) + config = {} + ner2 = nlp2.create_pipe("ner", config=config) # set "New York" to a blocked entity - doc2.ents = [(0, 3, 5)] + doc2.set_ents([], blocked=[doc2[3:5]], default="unmodified") assert [token.ent_iob_ for token in doc2] == ["", "", "", "B", "B"] assert [token.ent_type_ for token in doc2] == ["", "", "", "", ""] @@ -184,36 +200,30 @@ def test_train_empty(): ] nlp = English() - ner = nlp.create_pipe("ner") + train_examples = [] + for t in train_data: + train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) + ner = nlp.add_pipe("ner", last=True) ner.add_label("PERSON") - nlp.add_pipe(ner, last=True) - - nlp.begin_training() + nlp.initialize() for itn in range(2): losses = {} - batches = minibatch(train_data) + batches = util.minibatch(train_examples, size=8) for batch in batches: - texts, annotations = zip(*batch) - nlp.update( - texts, # batch of texts - annotations, # batch of annotations - losses=losses, - ) + nlp.update(batch, losses=losses) def test_overwrite_token(): nlp = English() - ner1 = nlp.create_pipe("ner") - nlp.add_pipe(ner1, name="ner") - nlp.begin_training() - + nlp.add_pipe("ner") + nlp.initialize() # The untrained NER will predict O for each token doc = nlp("I live in New York") assert [token.ent_iob_ for token in doc] == ["O", "O", "O", "O", "O"] assert [token.ent_type_ for token in doc] == ["", "", "", "", ""] - # Check that a new ner can overwrite O - ner2 = EntityRecognizer(doc.vocab) + config = {} + ner2 = nlp.create_pipe("ner", config=config) ner2.moves.add_action(5, "") ner2.add_label("GPE") state = ner2.moves.init_batch([doc])[0] @@ -224,22 +234,30 @@ def test_overwrite_token(): assert ner2.moves.is_valid(state, "L-GPE") +def test_empty_ner(): + nlp = English() + ner = nlp.add_pipe("ner") + ner.add_label("MY_LABEL") + nlp.initialize() + doc = nlp("John is watching the news about Croatia's elections") + # if this goes wrong, the initialization of the parser's upper layer is probably broken + result = ["O", "O", "O", "O", "O", "O", "O", "O", "O"] + assert [token.ent_iob_ for token in doc] == result + + def test_ruler_before_ner(): """ Test that an NER works after an entity_ruler: the second can add annotations """ nlp = English() # 1 : Entity Ruler - should set "this" to B and everything else to empty - ruler = EntityRuler(nlp) patterns = [{"label": "THING", "pattern": "This"}] + ruler = nlp.add_pipe("entity_ruler") ruler.add_patterns(patterns) - nlp.add_pipe(ruler) # 2: untrained NER - should set everything else to O - untrained_ner = nlp.create_pipe("ner") + untrained_ner = nlp.add_pipe("ner") untrained_ner.add_label("MY_LABEL") - nlp.add_pipe(untrained_ner) - nlp.begin_training() - + nlp.initialize() doc = nlp("This is Antti Korhonen speaking in Finland") expected_iobs = ["B", "O", "O", "O", "O", "O", "O"] expected_types = ["THING", "", "", "", "", "", ""] @@ -252,16 +270,14 @@ def test_ner_before_ruler(): nlp = English() # 1: untrained NER - should set everything to O - untrained_ner = nlp.create_pipe("ner") + untrained_ner = nlp.add_pipe("ner", name="uner") untrained_ner.add_label("MY_LABEL") - nlp.add_pipe(untrained_ner, name="uner") - nlp.begin_training() + nlp.initialize() # 2 : Entity Ruler - should set "this" to B and keep everything else O - ruler = EntityRuler(nlp) patterns = [{"label": "THING", "pattern": "This"}] + ruler = nlp.add_pipe("entity_ruler") ruler.add_patterns(patterns) - nlp.add_pipe(ruler) doc = nlp("This is Antti Korhonen speaking in Finland") expected_iobs = ["B", "O", "O", "O", "O", "O", "O"] @@ -274,11 +290,10 @@ def test_block_ner(): """ Test functionality for blocking tokens so they can't be in a named entity """ # block "Antti L Korhonen" from being a named entity nlp = English() - nlp.add_pipe(BlockerComponent1(2, 5)) - untrained_ner = nlp.create_pipe("ner") + nlp.add_pipe("blocker", config={"start": 2, "end": 5}) + untrained_ner = nlp.add_pipe("ner") untrained_ner.add_label("MY_LABEL") - nlp.add_pipe(untrained_ner, name="uner") - nlp.begin_training() + nlp.initialize() doc = nlp("This is Antti L Korhonen speaking in Finland") expected_iobs = ["O", "O", "B", "B", "B", "O", "O", "O"] expected_types = ["", "", "", "", "", "", "", ""] @@ -286,49 +301,78 @@ def test_block_ner(): assert [token.ent_type_ for token in doc] == expected_types -def test_change_number_features(): - # Test the default number features +def test_overfitting_IO(): + # Simple test to try and quickly overfit the NER component - ensuring the ML models work correctly nlp = English() - ner = nlp.create_pipe("ner") - nlp.add_pipe(ner) - ner.add_label("PERSON") - nlp.begin_training() - assert ner.model.lower.nF == ner.nr_feature - # Test we can change it - nlp = English() - ner = nlp.create_pipe("ner") - nlp.add_pipe(ner) - ner.add_label("PERSON") - nlp.begin_training( - component_cfg={"ner": {"nr_feature_tokens": 3, "token_vector_width": 128}} - ) - assert ner.model.lower.nF == 3 - # Test the model runs - nlp("hello world") + ner = nlp.add_pipe("ner") + train_examples = [] + for text, annotations in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(text), annotations)) + for ent in annotations.get("entities"): + ner.add_label(ent[2]) + optimizer = nlp.initialize() + + for i in range(50): + losses = {} + nlp.update(train_examples, sgd=optimizer, losses=losses) + assert losses["ner"] < 0.00001 + + # test the trained model + test_text = "I like London." + doc = nlp(test_text) + ents = doc.ents + assert len(ents) == 1 + assert ents[0].text == "London" + assert ents[0].label_ == "LOC" + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(test_text) + ents2 = doc2.ents + assert len(ents2) == 1 + assert ents2[0].text == "London" + assert ents2[0].label_ == "LOC" + + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = [ + "Just a sentence.", + "Then one more sentence about London.", + "Here is another one.", + "I like London.", + ] + batch_deps_1 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)] + no_batch_deps = [doc.to_array([ENT_IOB]) for doc in [nlp(text) for text in texts]] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) -def test_ner_warns_no_lookups(): - nlp = Language() +def test_ner_warns_no_lookups(caplog): + nlp = English() + assert nlp.lang in util.LEXEME_NORM_LANGS nlp.vocab.lookups = Lookups() assert not len(nlp.vocab.lookups) - ner = nlp.create_pipe("ner") - nlp.add_pipe(ner) - with pytest.warns(UserWarning): - nlp.begin_training() + nlp.add_pipe("ner") + with caplog.at_level(logging.DEBUG): + nlp.initialize() + assert "W033" in caplog.text + caplog.clear() nlp.vocab.lookups.add_table("lexeme_norm") nlp.vocab.lookups.get_table("lexeme_norm")["a"] = "A" - with pytest.warns(None) as record: - nlp.begin_training() - assert not record.list + with caplog.at_level(logging.DEBUG): + nlp.initialize() + assert "W033" not in caplog.text -class BlockerComponent1(object): - name = "my_blocker" - - def __init__(self, start, end): +@Language.factory("blocker") +class BlockerComponent1: + def __init__(self, nlp, start, end, name="my_blocker"): self.start = start self.end = end + self.name = name def __call__(self, doc): - doc.ents = [(0, self.start, self.end)] + doc.set_ents([], blocked=[doc[self.start : self.end]], default="unmodified") return doc diff --git a/spacy/tests/parser/test_neural_parser.py b/spacy/tests/parser/test_neural_parser.py index 062c76ae3..1bb5d4aa5 100644 --- a/spacy/tests/parser/test_neural_parser.py +++ b/spacy/tests/parser/test_neural_parser.py @@ -1,13 +1,14 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest -from spacy._ml import Tok2Vec + +from spacy import registry +from spacy.training import Example from spacy.vocab import Vocab -from spacy.syntax.arc_eager import ArcEager -from spacy.syntax.nn_parser import Parser +from spacy.pipeline._parser_internals.arc_eager import ArcEager +from spacy.pipeline.transition_parser import Parser from spacy.tokens.doc import Doc -from spacy.gold import GoldParse +from thinc.api import Model +from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL +from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL @pytest.fixture @@ -23,17 +24,31 @@ def arc_eager(vocab): @pytest.fixture def tok2vec(): - return Tok2Vec(8, 100) + cfg = {"model": DEFAULT_TOK2VEC_MODEL} + tok2vec = registry.resolve(cfg, validate=True)["model"] + tok2vec.initialize() + return tok2vec @pytest.fixture def parser(vocab, arc_eager): - return Parser(vocab, moves=arc_eager, model=None) + config = { + "learn_tokens": False, + "min_action_freq": 30, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_PARSER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + return Parser(vocab, model, moves=arc_eager, **config) @pytest.fixture -def model(arc_eager, tok2vec): - return Parser.Model(arc_eager.n_moves, token_vector_width=tok2vec.nO)[0] +def model(arc_eager, tok2vec, vocab): + cfg = {"model": DEFAULT_PARSER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + model.attrs["resize_output"](model, arc_eager.n_moves) + model.initialize() + return model @pytest.fixture @@ -43,20 +58,27 @@ def doc(vocab): @pytest.fixture def gold(doc): - return GoldParse(doc, heads=[1, 1, 1], deps=["L", "ROOT", "R"]) + return {"heads": [1, 1, 1], "deps": ["L", "ROOT", "R"]} def test_can_init_nn_parser(parser): - assert parser.model is None + assert isinstance(parser.model, Model) -def test_build_model(parser): - parser.model = Parser.Model(parser.moves.n_moves, hist_size=0)[0] +def test_build_model(parser, vocab): + config = { + "learn_tokens": False, + "min_action_freq": 0, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_PARSER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + parser.model = Parser(vocab, model=model, moves=parser.moves, **config).model assert parser.model is not None def test_predict_doc(parser, tok2vec, model, doc): - doc.tensor = tok2vec([doc])[0] + doc.tensor = tok2vec.predict([doc])[0] parser.model = model parser(doc) @@ -64,23 +86,25 @@ def test_predict_doc(parser, tok2vec, model, doc): def test_update_doc(parser, model, doc, gold): parser.model = model - def optimize(weights, gradient, key=None): + def optimize(key, weights, gradient): weights -= 0.001 * gradient + return weights, gradient - parser.update([doc], [gold], sgd=optimize) + example = Example.from_dict(doc, gold) + parser.update([example], sgd=optimize) -@pytest.mark.xfail +@pytest.mark.skip(reason="No longer supported") def test_predict_doc_beam(parser, model, doc): parser.model = model parser(doc, beam_width=32, beam_density=0.001) -@pytest.mark.xfail +@pytest.mark.skip(reason="No longer supported") def test_update_doc_beam(parser, model, doc, gold): parser.model = model def optimize(weights, gradient, key=None): weights -= 0.001 * gradient - parser.update_beam([doc], [gold], sgd=optimize) + parser.update_beam((doc, gold), sgd=optimize) diff --git a/spacy/tests/parser/test_nn_beam.py b/spacy/tests/parser/test_nn_beam.py index 9dca99255..e69de29bb 100644 --- a/spacy/tests/parser/test_nn_beam.py +++ b/spacy/tests/parser/test_nn_beam.py @@ -1,103 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -import numpy -from spacy.vocab import Vocab -from spacy.language import Language -from spacy.pipeline import DependencyParser -from spacy.syntax.arc_eager import ArcEager -from spacy.tokens import Doc -from spacy.syntax._beam_utils import ParserBeam -from spacy.syntax.stateclass import StateClass -from spacy.gold import GoldParse - - -@pytest.fixture -def vocab(): - return Vocab() - - -@pytest.fixture -def moves(vocab): - aeager = ArcEager(vocab.strings, {}) - aeager.add_action(2, "nsubj") - aeager.add_action(3, "dobj") - aeager.add_action(2, "aux") - return aeager - - -@pytest.fixture -def docs(vocab): - return [Doc(vocab, words=["Rats", "bite", "things"])] - - -@pytest.fixture -def states(docs): - return [StateClass(doc) for doc in docs] - - -@pytest.fixture -def tokvecs(docs, vector_size): - output = [] - for doc in docs: - vec = numpy.random.uniform(-0.1, 0.1, (len(doc), vector_size)) - output.append(numpy.asarray(vec)) - return output - - -@pytest.fixture -def golds(docs): - return [GoldParse(doc) for doc in docs] - - -@pytest.fixture -def batch_size(docs): - return len(docs) - - -@pytest.fixture -def beam_width(): - return 4 - - -@pytest.fixture -def vector_size(): - return 6 - - -@pytest.fixture -def beam(moves, states, golds, beam_width): - return ParserBeam(moves, states, golds, width=beam_width, density=0.0) - - -@pytest.fixture -def scores(moves, batch_size, beam_width): - return [ - numpy.asarray( - numpy.random.uniform(-0.1, 0.1, (batch_size, moves.n_moves)), dtype="f" - ) - for _ in range(batch_size) - ] - - -def test_create_beam(beam): - pass - - -def test_beam_advance(beam, scores): - beam.advance(scores) - - -def test_beam_advance_too_few_scores(beam, scores): - with pytest.raises(IndexError): - beam.advance(scores[:-1]) - - -def test_beam_parse(): - nlp = Language() - nlp.add_pipe(DependencyParser(nlp.vocab), name="parser") - nlp.parser.add_label("nsubj") - nlp.parser.begin_training([], token_vector_width=8, hidden_width=8) - doc = nlp.make_doc("Australia is a country") - nlp.parser(doc, beam_width=2) diff --git a/spacy/tests/parser/test_nonproj.py b/spacy/tests/parser/test_nonproj.py index 8bf8111c1..544701a4c 100644 --- a/spacy/tests/parser/test_nonproj.py +++ b/spacy/tests/parser/test_nonproj.py @@ -1,12 +1,8 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest -from spacy.syntax.nonproj import ancestors, contains_cycle, is_nonproj_arc -from spacy.syntax.nonproj import is_nonproj_tree -from spacy.syntax import nonproj - -from ..util import get_doc +from spacy.pipeline._parser_internals.nonproj import ancestors, contains_cycle +from spacy.pipeline._parser_internals.nonproj import is_nonproj_tree, is_nonproj_arc +from spacy.pipeline._parser_internals import nonproj +from spacy.tokens import Doc @pytest.fixture @@ -48,7 +44,7 @@ def test_parser_ancestors(tree, cyclic_tree, partial_tree, multirooted_tree): def test_parser_contains_cycle(tree, cyclic_tree, partial_tree, multirooted_tree): assert contains_cycle(tree) is None - assert contains_cycle(cyclic_tree) == set([3, 4, 5]) + assert contains_cycle(cyclic_tree) == {3, 4, 5} assert contains_cycle(partial_tree) is None assert contains_cycle(multirooted_tree) is None @@ -77,16 +73,10 @@ def test_parser_is_nonproj_tree( assert is_nonproj_tree(multirooted_tree) is True -def test_parser_pseudoprojectivity(en_tokenizer): +def test_parser_pseudoprojectivity(en_vocab): def deprojectivize(proj_heads, deco_labels): - tokens = en_tokenizer("whatever " * len(proj_heads)) - rel_proj_heads = [head - i for i, head in enumerate(proj_heads)] - doc = get_doc( - tokens.vocab, - words=[t.text for t in tokens], - deps=deco_labels, - heads=rel_proj_heads, - ) + words = ["whatever "] * len(proj_heads) + doc = Doc(en_vocab, words=words, deps=deco_labels, heads=proj_heads) nonproj.deprojectivize(doc) return [t.head.i for t in doc], [token.dep_ for token in doc] @@ -97,49 +87,39 @@ def test_parser_pseudoprojectivity(en_tokenizer): labels = ["det", "nsubj", "root", "det", "dobj", "aux", "nsubj", "acl", "punct"] labels2 = ["advmod", "root", "det", "nsubj", "advmod", "det", "dobj", "det", "nmod", "aux", "nmod", "advmod", "det", "amod", "punct"] # fmt: on - assert nonproj.decompose("X||Y") == ("X", "Y") assert nonproj.decompose("X") == ("X", "") assert nonproj.is_decorated("X||Y") is True assert nonproj.is_decorated("X") is False - nonproj._lift(0, tree) assert tree == [2, 2, 2] - assert nonproj._get_smallest_nonproj_arc(nonproj_tree) == 7 assert nonproj._get_smallest_nonproj_arc(nonproj_tree2) == 10 - # fmt: off proj_heads, deco_labels = nonproj.projectivize(nonproj_tree, labels) assert proj_heads == [1, 2, 2, 4, 5, 2, 7, 5, 2] assert deco_labels == ["det", "nsubj", "root", "det", "dobj", "aux", "nsubj", "acl||dobj", "punct"] - deproj_heads, undeco_labels = deprojectivize(proj_heads, deco_labels) assert deproj_heads == nonproj_tree assert undeco_labels == labels - proj_heads, deco_labels = nonproj.projectivize(nonproj_tree2, labels2) assert proj_heads == [1, 1, 3, 1, 5, 6, 9, 8, 6, 1, 9, 12, 13, 10, 1] assert deco_labels == ["advmod||aux", "root", "det", "nsubj", "advmod", "det", "dobj", "det", "nmod", "aux", "nmod||dobj", "advmod", "det", "amod", "punct"] - deproj_heads, undeco_labels = deprojectivize(proj_heads, deco_labels) assert deproj_heads == nonproj_tree2 assert undeco_labels == labels2 - # if decoration is wrong such that there is no head with the desired label # the structure is kept and the label is undecorated proj_heads = [1, 2, 2, 4, 5, 2, 7, 5, 2] deco_labels = ["det", "nsubj", "root", "det", "dobj", "aux", "nsubj", "acl||iobj", "punct"] - deproj_heads, undeco_labels = deprojectivize(proj_heads, deco_labels) assert deproj_heads == proj_heads assert undeco_labels == ["det", "nsubj", "root", "det", "dobj", "aux", "nsubj", "acl", "punct"] - # if there are two potential new heads, the first one is chosen even if # it"s wrong proj_heads = [1, 1, 3, 1, 5, 6, 9, 8, 6, 1, 9, 12, 13, 10, 1] diff --git a/spacy/tests/parser/test_parse.py b/spacy/tests/parser/test_parse.py index fb5301718..a914eb17a 100644 --- a/spacy/tests/parser/test_parse.py +++ b/spacy/tests/parser/test_parse.py @@ -1,53 +1,73 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest +from numpy.testing import assert_equal +from spacy.attrs import DEP -from ..util import get_doc, apply_transition_sequence +from spacy.lang.en import English +from spacy.training import Example +from spacy.tokens import Doc +from spacy import util + +from ..util import apply_transition_sequence, make_tempdir -def test_parser_root(en_tokenizer): - text = "i don't have other assistance" - heads = [3, 2, 1, 0, 1, -2] +TRAIN_DATA = [ + ( + "They trade mortgage-backed securities.", + { + "heads": [1, 1, 4, 4, 5, 1, 1], + "deps": ["nsubj", "ROOT", "compound", "punct", "nmod", "dobj", "punct"], + }, + ), + ( + "I like London and Berlin.", + { + "heads": [1, 1, 1, 2, 2, 1], + "deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"], + }, + ), +] + + +def test_parser_root(en_vocab): + words = ["i", "do", "n't", "have", "other", "assistance"] + heads = [3, 3, 3, 3, 5, 3] deps = ["nsubj", "aux", "neg", "ROOT", "amod", "dobj"] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) for t in doc: assert t.dep != 0, t.text -@pytest.mark.xfail -@pytest.mark.parametrize("text", ["Hello"]) -def test_parser_parse_one_word_sentence(en_tokenizer, en_parser, text): - tokens = en_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], heads=[0], deps=["ROOT"] - ) - +@pytest.mark.skip( + reason="The step_through API was removed (but should be brought back)" +) +@pytest.mark.parametrize("words", [["Hello"]]) +def test_parser_parse_one_word_sentence(en_vocab, en_parser, words): + doc = Doc(en_vocab, words=words, heads=[0], deps=["ROOT"]) assert len(doc) == 1 with en_parser.step_through(doc) as _: # noqa: F841 pass assert doc[0].dep != 0 -@pytest.mark.xfail -def test_parser_initial(en_tokenizer, en_parser): - text = "I ate the pizza with anchovies." - # heads = [1, 0, 1, -2, -3, -1, -5] +@pytest.mark.skip( + reason="The step_through API was removed (but should be brought back)" +) +def test_parser_initial(en_vocab, en_parser): + words = ["I", "ate", "the", "pizza", "with", "anchovies", "."] transition = ["L-nsubj", "S", "L-det"] - tokens = en_tokenizer(text) - apply_transition_sequence(en_parser, tokens, transition) - assert tokens[0].head.i == 1 - assert tokens[1].head.i == 1 - assert tokens[2].head.i == 3 - assert tokens[3].head.i == 3 + doc = Doc(en_vocab, words=words) + apply_transition_sequence(en_parser, doc, transition) + assert doc[0].head.i == 1 + assert doc[1].head.i == 1 + assert doc[2].head.i == 3 + assert doc[3].head.i == 3 -def test_parser_parse_subtrees(en_tokenizer, en_parser): - text = "The four wheels on the bus turned quickly" - heads = [2, 1, 4, -1, 1, -2, 0, -1] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) +def test_parser_parse_subtrees(en_vocab, en_parser): + words = ["The", "four", "wheels", "on", "the", "bus", "turned", "quickly"] + heads = [2, 2, 6, 2, 5, 3, 6, 6] + deps = ["dep"] * len(heads) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) assert len(list(doc[2].lefts)) == 2 assert len(list(doc[2].rights)) == 1 assert len(list(doc[2].children)) == 3 @@ -57,15 +77,12 @@ def test_parser_parse_subtrees(en_tokenizer, en_parser): assert len(list(doc[2].subtree)) == 6 -def test_parser_merge_pp(en_tokenizer): - text = "A phrase with another phrase occurs" - heads = [1, 4, -1, 1, -2, 0] +def test_parser_merge_pp(en_vocab): + words = ["A", "phrase", "with", "another", "phrase", "occurs"] + heads = [1, 5, 1, 4, 2, 5] deps = ["det", "nsubj", "prep", "det", "pobj", "ROOT"] - tags = ["DT", "NN", "IN", "DT", "NN", "VBZ"] - tokens = en_tokenizer(text) - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], deps=deps, heads=heads, tags=tags - ) + pos = ["DET", "NOUN", "ADP", "DET", "NOUN", "VERB"] + doc = Doc(en_vocab, words=words, deps=deps, heads=heads, pos=pos) with doc.retokenize() as retokenizer: for np in doc.noun_chunks: retokenizer.merge(np, attrs={"lemma": np.lemma_}) @@ -75,13 +92,14 @@ def test_parser_merge_pp(en_tokenizer): assert doc[3].text == "occurs" -@pytest.mark.xfail -def test_parser_arc_eager_finalize_state(en_tokenizer, en_parser): - text = "a b c d e" - +@pytest.mark.skip( + reason="The step_through API was removed (but should be brought back)" +) +def test_parser_arc_eager_finalize_state(en_vocab, en_parser): + words = ["a", "b", "c", "d", "e"] # right branching transition = ["R-nsubj", "D", "R-nsubj", "R-nsubj", "D", "R-ROOT"] - tokens = en_tokenizer(text) + tokens = Doc(en_vocab, words=words) apply_transition_sequence(en_parser, tokens, transition) assert tokens[0].n_lefts == 0 @@ -116,7 +134,7 @@ def test_parser_arc_eager_finalize_state(en_tokenizer, en_parser): # left branching transition = ["S", "S", "S", "L-nsubj", "L-nsubj", "L-nsubj", "L-nsubj"] - tokens = en_tokenizer(text) + tokens = Doc(en_vocab, words=words) apply_transition_sequence(en_parser, tokens, transition) assert tokens[0].n_lefts == 0 @@ -153,15 +171,58 @@ def test_parser_arc_eager_finalize_state(en_tokenizer, en_parser): def test_parser_set_sent_starts(en_vocab): # fmt: off words = ['Ein', 'Satz', '.', 'Außerdem', 'ist', 'Zimmer', 'davon', 'überzeugt', ',', 'dass', 'auch', 'epige-', '\n', 'netische', 'Mechanismen', 'eine', 'Rolle', 'spielen', ',', 'also', 'Vorgänge', ',', 'die', '\n', 'sich', 'darauf', 'auswirken', ',', 'welche', 'Gene', 'abgelesen', 'werden', 'und', '\n', 'welche', 'nicht', '.', '\n'] - heads = [1, 0, -1, 27, 0, -1, 1, -3, -1, 8, 4, 3, -1, 1, 3, 1, 1, -11, -1, 1, -9, -1, 4, -1, 2, 1, -6, -1, 1, 2, 1, -6, -1, -1, -17, -31, -32, -1] + heads = [1, 1, 1, 30, 4, 4, 7, 4, 7, 17, 14, 14, 11, 14, 17, 16, 17, 6, 17, 20, 11, 20, 26, 22, 26, 26, 20, 26, 29, 31, 31, 25, 31, 32, 17, 4, 4, 36] deps = ['nk', 'ROOT', 'punct', 'mo', 'ROOT', 'sb', 'op', 'pd', 'punct', 'cp', 'mo', 'nk', '', 'nk', 'sb', 'nk', 'oa', 're', 'punct', 'mo', 'app', 'punct', 'sb', '', 'oa', 'op', 'rc', 'punct', 'nk', 'sb', 'oc', 're', 'cd', '', 'oa', 'ng', 'punct', ''] # fmt: on - doc = get_doc(en_vocab, words=words, deps=deps, heads=heads) + doc = Doc(en_vocab, words=words, deps=deps, heads=heads) for i in range(len(words)): if i == 0 or i == 3: assert doc[i].is_sent_start is True else: - assert doc[i].is_sent_start is None + assert doc[i].is_sent_start is False for sent in doc.sents: for token in sent: assert token.head in sent + + +def test_overfitting_IO(): + # Simple test to try and quickly overfit the dependency parser - ensuring the ML models work correctly + nlp = English() + parser = nlp.add_pipe("parser") + train_examples = [] + for text, annotations in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(text), annotations)) + for dep in annotations.get("deps", []): + parser.add_label(dep) + optimizer = nlp.initialize() + for i in range(100): + losses = {} + nlp.update(train_examples, sgd=optimizer, losses=losses) + assert losses["parser"] < 0.0001 + # test the trained model + test_text = "I like securities." + doc = nlp(test_text) + assert doc[0].dep_ == "nsubj" + assert doc[2].dep_ == "dobj" + assert doc[3].dep_ == "punct" + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(test_text) + assert doc2[0].dep_ == "nsubj" + assert doc2[2].dep_ == "dobj" + assert doc2[3].dep_ == "punct" + + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = [ + "Just a sentence.", + "Then one more sentence about London.", + "Here is another one.", + "I like London.", + ] + batch_deps_1 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)] + no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) diff --git a/spacy/tests/parser/test_parse_navigate.py b/spacy/tests/parser/test_parse_navigate.py index 41524d45e..8ca4039a2 100644 --- a/spacy/tests/parser/test_parse_navigate.py +++ b/spacy/tests/parser/test_parse_navigate.py @@ -1,62 +1,75 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest - -from ..util import get_doc +from spacy.tokens import Doc @pytest.fixture -def text(): - return """ -It was a bright cold day in April, and the clocks were striking thirteen. -Winston Smith, his chin nuzzled into his breast in an effort to escape the -vile wind, slipped quickly through the glass doors of Victory Mansions, -though not quickly enough to prevent a swirl of gritty dust from entering -along with him. - -The hallway smelt of boiled cabbage and old rag mats. At one end of it a -coloured poster, too large for indoor display, had been tacked to the wall. -It depicted simply an enormous face, more than a metre wide: the face of a -man of about forty-five, with a heavy black moustache and ruggedly handsome -features. Winston made for the stairs. It was no use trying the lift. Even at -the best of times it was seldom working, and at present the electric current -was cut off during daylight hours. It was part of the economy drive in -preparation for Hate Week. The flat was seven flights up, and Winston, who -was thirty-nine and had a varicose ulcer above his right ankle, went slowly, -resting several times on the way. On each landing, opposite the lift-shaft, -the poster with the enormous face gazed from the wall. It was one of those -pictures which are so contrived that the eyes follow you about when you move. -BIG BROTHER IS WATCHING YOU, the caption beneath it ran. -""" +def words(): + # fmt: off + return [ + "\n", "It", "was", "a", "bright", "cold", "day", "in", "April", ",", + "and", "the", "clocks", "were", "striking", "thirteen", ".", "\n", + "Winston", "Smith", ",", "his", "chin", "nuzzled", "into", "his", + "breast", "in", "an", "effort", "to", "escape", "the", "\n", "vile", + "wind", ",", "slipped", "quickly", "through", "the", "glass", "doors", + "of", "Victory", "Mansions", ",", "\n", "though", "not", "quickly", + "enough", "to", "prevent", "a", "swirl", "of", "gritty", "dust", + "from", "entering", "\n", "along", "with", "him", ".", "\n\n", "The", + "hallway", "smelt", "of", "boiled", "cabbage", "and", "old", "rag", + "mats", ".", "At", "one", "end", "of", "it", "a", "\n", "coloured", + "poster", ",", "too", "large", "for", "indoor", "display", ",", "had", + "been", "tacked", "to", "the", "wall", ".", "\n", "It", "depicted", + "simply", "an", "enormous", "face", ",", "more", "than", "a", "metre", + "wide", ":", "the", "face", "of", "a", "\n", "man", "of", "about", + "forty", "-", "five", ",", "with", "a", "heavy", "black", "moustache", + "and", "ruggedly", "handsome", "\n", "features", ".", "Winston", "made", + "for", "the", "stairs", ".", "It", "was", "no", "use", "trying", "the", + "lift", ".", "Even", "at", "\n", "the", "best", "of", "times", "it", + "was", "seldom", "working", ",", "and", "at", "present", "the", + "electric", "current", "\n", "was", "cut", "off", "during", "daylight", + "hours", ".", "It", "was", "part", "of", "the", "economy", "drive", + "in", "\n", "preparation", "for", "Hate", "Week", ".", "The", "flat", + "was", "seven", "flights", "up", ",", "and", "Winston", ",", "who", + "\n", "was", "thirty", "-", "nine", "and", "had", "a", "varicose", + "ulcer", "above", "his", "right", "ankle", ",", "went", "slowly", ",", + "\n", "resting", "several", "times", "on", "the", "way", ".", "On", + "each", "landing", ",", "opposite", "the", "lift", "-", "shaft", ",", + "\n", "the", "poster", "with", "the", "enormous", "face", "gazed", + "from", "the", "wall", ".", "It", "was", "one", "of", "those", "\n", + "pictures", "which", "are", "so", "contrived", "that", "the", "eyes", + "follow", "you", "about", "when", "you", "move", ".", "\n", "BIG", + "BROTHER", "IS", "WATCHING", "YOU", ",", "the", "caption", "beneath", + "it", "ran", ".", "\n", ] + # fmt: on @pytest.fixture def heads(): # fmt: off - return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, 2, 1, -12, -1, -2, - -1, 1, 4, 3, 1, 1, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1, - -4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, -11, 1, 1, -14, - 1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 1, 1, - 0, -1, 1, -2, -1, 2, 1, -4, -8, 18, 1, -2, -1, -1, 3, -1, 1, 10, - 9, 1, 7, -1, 1, -2, 3, 2, 1, 0, -1, 1, -2, -4, -1, 1, 0, -1, - 2, 1, -4, -1, 2, 1, 1, 1, -6, -11, 1, 20, -1, 2, -1, -3, -1, - 3, 2, 1, -4, -10, -11, 3, 2, 1, -4, -1, 1, -3, -1, 0, -1, 1, 0, - -1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, 6, -1, - -1, 3, 2, 1, 0, -1, -2, 7, -1, 2, 1, 3, -1, 1, -10, -1, -2, 1, - -2, -5, 1, 0, -1, -1, 1, -2, -5, -1, -1, -2, -1, 1, -2, -12, 1, - 1, 0, 1, -2, -1, -4, -5, 18, -1, 2, -1, -4, 2, 1, -3, -4, -5, 2, - 1, -3, -1, 2, 1, -3, -17, -24, -1, -2, -1, -4, 1, -2, -3, 1, -2, - -10, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3, - 0, -1, 1, -2, -4, 1, 0, -1, -1, 2, -1, -3, 1, -2, 1, -2, 3, 1, - 1, -4, -1, -2, 2, 1, -3, -19, -1, 1, 1, 0, 0, 6, 5, 1, 3, -1, - -1, 0, -1, -1] + return [ + 1, 2, 2, 6, 6, 6, 2, 6, 7, 2, 2, 12, 14, 14, 2, 14, 14, 16, 19, 23, 23, + 22, 23, 23, 23, 26, 24, 23, 29, 27, 31, 29, 35, 32, 35, 31, 23, 23, 37, + 37, 42, 42, 39, 42, 45, 43, 37, 46, 37, 50, 51, 37, 53, 51, 55, 53, 55, + 58, 56, 53, 59, 60, 60, 62, 63, 23, 65, 68, 69, 69, 69, 72, 70, 72, 76, + 76, 72, 69, 96, 80, 78, 80, 81, 86, 83, 86, 96, 96, 89, 96, 89, 92, 90, + 96, 96, 96, 96, 96, 99, 97, 96, 100, 103, 103, 103, 107, 107, 103, 107, + 111, 111, 112, 113, 107, 103, 116, 136, 116, 120, 118, 117, 120, 125, + 125, 125, 121, 116, 116, 131, 131, 131, 127, 131, 134, 131, 134, 136, + 136, 139, 139, 139, 142, 140, 139, 145, 145, 147, 145, 147, 150, 148, + 145, 153, 162, 153, 156, 162, 156, 157, 162, 162, 162, 162, 162, 162, + 172, 165, 169, 169, 172, 169, 172, 162, 172, 172, 176, 174, 172, 179, + 179, 179, 180, 183, 181, 179, 184, 185, 185, 187, 190, 188, 179, 193, + 194, 194, 196, 194, 196, 194, 194, 218, 200, 204, 202, 200, 207, 207, + 204, 204, 204, 212, 212, 209, 212, 216, 216, 213, 200, 194, 218, 218, + 220, 218, 224, 222, 222, 227, 225, 218, 246, 231, 229, 246, 246, 237, + 237, 237, 233, 246, 238, 241, 246, 241, 245, 245, 242, 246, 246, 249, + 247, 246, 252, 252, 252, 253, 257, 255, 254, 259, 257, 261, 259, 265, + 264, 265, 261, 265, 265, 270, 270, 267, 252, 271, 274, 275, 275, 276, + 283, 283, 280, 283, 280, 281, 283, 283, 284] # fmt: on -def test_parser_parse_navigate_consistency(en_tokenizer, text, heads): - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) +def test_parser_parse_navigate_consistency(en_vocab, words, heads): + doc = Doc(en_vocab, words=words, heads=heads) for head in doc: for child in head.lefts: assert child.head == head @@ -64,10 +77,8 @@ def test_parser_parse_navigate_consistency(en_tokenizer, text, heads): assert child.head == head -def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads): - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) - +def test_parser_parse_navigate_child_consistency(en_vocab, words, heads): + doc = Doc(en_vocab, words=words, heads=heads, deps=["dep"] * len(heads)) lefts = {} rights = {} for head in doc: @@ -97,9 +108,8 @@ def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads): assert not children -def test_parser_parse_navigate_edges(en_tokenizer, text, heads): - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) +def test_parser_parse_navigate_edges(en_vocab, words, heads): + doc = Doc(en_vocab, words=words, heads=heads) for token in doc: subtree = list(token.subtree) debug = "\t".join((token.text, token.left_edge.text, subtree[0].text)) diff --git a/spacy/tests/parser/test_preset_sbd.py b/spacy/tests/parser/test_preset_sbd.py index 70beb2f60..ab58ac17b 100644 --- a/spacy/tests/parser/test_preset_sbd.py +++ b/spacy/tests/parser/test_preset_sbd.py @@ -1,12 +1,10 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest -from thinc.neural.optimizers import Adam -from thinc.neural.ops import NumpyOps +from thinc.api import Adam from spacy.attrs import NORM -from spacy.gold import GoldParse from spacy.vocab import Vocab +from spacy import registry +from spacy.training import Example +from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL from spacy.tokens import Doc from spacy.pipeline import DependencyParser @@ -16,21 +14,36 @@ def vocab(): return Vocab(lex_attr_getters={NORM: lambda s: s}) +def _parser_example(parser): + doc = Doc(parser.vocab, words=["a", "b", "c", "d"]) + gold = {"heads": [1, 1, 3, 3], "deps": ["right", "ROOT", "left", "ROOT"]} + return Example.from_dict(doc, gold) + + @pytest.fixture def parser(vocab): - parser = DependencyParser(vocab) + config = { + "learn_tokens": False, + "min_action_freq": 30, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_PARSER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + parser = DependencyParser(vocab, model, **config) parser.cfg["token_vector_width"] = 4 parser.cfg["hidden_width"] = 32 # parser.add_label('right') parser.add_label("left") - parser.begin_training([], **parser.cfg) - sgd = Adam(NumpyOps(), 0.001) + parser.initialize(lambda: [_parser_example(parser)]) + sgd = Adam(0.001) for i in range(10): losses = {} doc = Doc(vocab, words=["a", "b", "c", "d"]) - gold = GoldParse(doc, heads=[1, 1, 3, 3], deps=["left", "ROOT", "left", "ROOT"]) - parser.update([doc], [gold], sgd=sgd, losses=losses) + example = Example.from_dict( + doc, {"heads": [1, 1, 3, 3], "deps": ["left", "ROOT", "left", "ROOT"]} + ) + parser.update([example], sgd=sgd, losses=losses) return parser diff --git a/spacy/tests/parser/test_space_attachment.py b/spacy/tests/parser/test_space_attachment.py index 945173faf..2b80272d6 100644 --- a/spacy/tests/parser/test_space_attachment.py +++ b/spacy/tests/parser/test_space_attachment.py @@ -1,42 +1,40 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest +from spacy.tokens import Doc -from spacy.tokens.doc import Doc - -from ..util import get_doc, apply_transition_sequence +from ..util import apply_transition_sequence -def test_parser_space_attachment(en_tokenizer): - text = "This is a test.\nTo ensure spaces are attached well." - heads = [1, 0, 1, -2, -3, -1, 1, 4, -1, 2, 1, 0, -1, -2] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) +def test_parser_space_attachment(en_vocab): + # fmt: off + words = ["This", "is", "a", "test", ".", "\n", "To", "ensure", " ", "spaces", "are", "attached", "well", "."] + heads = [1, 1, 3, 1, 1, 4, 7, 11, 7, 11, 11, 11, 11, 11] + # fmt: on + deps = ["dep"] * len(heads) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) for sent in doc.sents: if len(sent) == 1: assert not sent[-1].is_space -def test_parser_sentence_space(en_tokenizer): +def test_parser_sentence_space(en_vocab): # fmt: off - text = "I look forward to using Thingamajig. I've been told it will make my life easier..." - heads = [1, 0, -1, -2, -1, -1, -5, -1, 3, 2, 1, 0, 2, 1, -3, 1, 1, -3, -7] + words = ["I", "look", "forward", "to", "using", "Thingamajig", ".", " ", "I", "'ve", "been", "told", "it", "will", "make", "my", "life", "easier", "..."] + heads = [1, 1, 1, 1, 3, 4, 1, 6, 11, 11, 11, 11, 14, 14, 11, 16, 17, 14, 11] deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "", "nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp", "poss", "nsubj", "ccomp", "punct"] # fmt: on - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) assert len(list(doc.sents)) == 2 -@pytest.mark.xfail -def test_parser_space_attachment_leading(en_tokenizer, en_parser): - text = "\t \n This is a sentence ." - heads = [1, 1, 0, 1, -2, -3] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=text.split(" "), heads=heads) +@pytest.mark.skip( + reason="The step_through API was removed (but should be brought back)" +) +def test_parser_space_attachment_leading(en_vocab, en_parser): + words = ["\t", "\n", "This", "is", "a", "sentence", "."] + heads = [1, 2, 2, 4, 2, 2] + doc = Doc(en_vocab, words=words, heads=heads) assert doc[0].is_space assert doc[1].is_space assert doc[2].text == "This" @@ -47,19 +45,19 @@ def test_parser_space_attachment_leading(en_tokenizer, en_parser): assert stepwise.stack == set([2]) -@pytest.mark.xfail -def test_parser_space_attachment_intermediate_trailing(en_tokenizer, en_parser): - text = "This is \t a \t\n \n sentence . \n\n \n" - heads = [1, 0, -1, 2, -1, -4, -5, -1] +@pytest.mark.skip( + reason="The step_through API was removed (but should be brought back)" +) +def test_parser_space_attachment_intermediate_trailing(en_vocab, en_parser): + words = ["This", "is", "\t", "a", "\t\n", "\n", "sentence", ".", "\n\n", "\n"] + heads = [1, 1, 1, 5, 3, 1, 1, 6] transition = ["L-nsubj", "S", "L-det", "R-attr", "D", "R-punct"] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, words=text.split(" "), heads=heads) + doc = Doc(en_vocab, words=words, heads=heads) assert doc[2].is_space assert doc[4].is_space assert doc[5].is_space assert doc[8].is_space assert doc[9].is_space - apply_transition_sequence(en_parser, doc, transition) for token in doc: assert token.dep != 0 or token.is_space @@ -67,8 +65,10 @@ def test_parser_space_attachment_intermediate_trailing(en_tokenizer, en_parser): @pytest.mark.parametrize("text,length", [(["\n"], 1), (["\n", "\t", "\n\n", "\t"], 4)]) -@pytest.mark.xfail -def test_parser_space_attachment_space(en_tokenizer, en_parser, text, length): +@pytest.mark.skip( + reason="The step_through API was removed (but should be brought back)" +) +def test_parser_space_attachment_space(en_parser, text, length): doc = Doc(en_parser.vocab, words=text) assert len(doc) == length with en_parser.step_through(doc) as _: # noqa: F841 diff --git a/spacy/tests/pipeline/test_analysis.py b/spacy/tests/pipeline/test_analysis.py index 198f11bcd..df3d7dff5 100644 --- a/spacy/tests/pipeline/test_analysis.py +++ b/spacy/tests/pipeline/test_analysis.py @@ -1,123 +1,69 @@ -# coding: utf8 -from __future__ import unicode_literals - -import spacy.language -from spacy.language import Language, component -from spacy.analysis import print_summary, validate_attrs -from spacy.analysis import get_assigns_for_attr, get_requires_for_attr -from spacy.compat import is_python2 -from mock import Mock, ANY +from spacy.language import Language +from spacy.pipe_analysis import get_attr_info, validate_attrs +from mock import Mock import pytest -def test_component_decorator_function(): - @component(name="test") - def test_component(doc): - """docstring""" - return doc - - assert test_component.name == "test" - if not is_python2: - assert test_component.__doc__ == "docstring" - assert test_component("foo") == "foo" - - -def test_component_decorator_class(): - @component(name="test") - class TestComponent(object): - """docstring1""" - - foo = "bar" - - def __call__(self, doc): - """docstring2""" - return doc - - def custom(self, x): - """docstring3""" - return x - - assert TestComponent.name == "test" - assert TestComponent.foo == "bar" - assert hasattr(TestComponent, "custom") - test_component = TestComponent() - assert test_component.foo == "bar" - assert test_component("foo") == "foo" - assert hasattr(test_component, "custom") - assert test_component.custom("bar") == "bar" - if not is_python2: - assert TestComponent.__doc__ == "docstring1" - assert TestComponent.__call__.__doc__ == "docstring2" - assert TestComponent.custom.__doc__ == "docstring3" - assert test_component.__doc__ == "docstring1" - assert test_component.__call__.__doc__ == "docstring2" - assert test_component.custom.__doc__ == "docstring3" - - def test_component_decorator_assigns(): - spacy.language.ENABLE_PIPELINE_ANALYSIS = True - - @component("c1", assigns=["token.tag", "doc.tensor"]) + @Language.component("c1", assigns=["token.tag", "doc.tensor"]) def test_component1(doc): return doc - @component( + @Language.component( "c2", requires=["token.tag", "token.pos"], assigns=["token.lemma", "doc.tensor"] ) def test_component2(doc): return doc - @component("c3", requires=["token.lemma"], assigns=["token._.custom_lemma"]) + @Language.component( + "c3", requires=["token.lemma"], assigns=["token._.custom_lemma"] + ) def test_component3(doc): return doc - assert "c1" in Language.factories - assert "c2" in Language.factories - assert "c3" in Language.factories + assert Language.has_factory("c1") + assert Language.has_factory("c2") + assert Language.has_factory("c3") nlp = Language() - nlp.add_pipe(test_component1) - with pytest.warns(UserWarning): - nlp.add_pipe(test_component2) - nlp.add_pipe(test_component3) - assigns_tensor = get_assigns_for_attr(nlp.pipeline, "doc.tensor") - assert [name for name, _ in assigns_tensor] == ["c1", "c2"] - test_component4 = nlp.create_pipe("c1") - assert test_component4.name == "c1" - assert test_component4.factory == "c1" - nlp.add_pipe(test_component4, name="c4") + nlp.add_pipe("c1") + nlp.add_pipe("c2") + problems = nlp.analyze_pipes()["problems"] + assert problems["c2"] == ["token.pos"] + nlp.add_pipe("c3") + assert get_attr_info(nlp, "doc.tensor")["assigns"] == ["c1", "c2"] + nlp.add_pipe("c1", name="c4") + test_component4_meta = nlp.get_pipe_meta("c1") + assert test_component4_meta.factory == "c1" assert nlp.pipe_names == ["c1", "c2", "c3", "c4"] - assert "c4" not in Language.factories + assert not Language.has_factory("c4") assert nlp.pipe_factories["c1"] == "c1" assert nlp.pipe_factories["c4"] == "c1" - assigns_tensor = get_assigns_for_attr(nlp.pipeline, "doc.tensor") - assert [name for name, _ in assigns_tensor] == ["c1", "c2", "c4"] - requires_pos = get_requires_for_attr(nlp.pipeline, "token.pos") - assert [name for name, _ in requires_pos] == ["c2"] - assert print_summary(nlp, no_print=True) + assert get_attr_info(nlp, "doc.tensor")["assigns"] == ["c1", "c2", "c4"] + assert get_attr_info(nlp, "token.pos")["requires"] == ["c2"] assert nlp("hello world") -def test_component_factories_from_nlp(): +def test_component_factories_class_func(): """Test that class components can implement a from_nlp classmethod that gives them access to the nlp object and config via the factory.""" - class TestComponent5(object): + class TestComponent5: def __call__(self, doc): return doc mock = Mock() mock.return_value = TestComponent5() - TestComponent5.from_nlp = classmethod(mock) - TestComponent5 = component("c5")(TestComponent5) - assert "c5" in Language.factories + def test_componen5_factory(nlp, foo: str = "bar", name="c5"): + return mock(nlp, foo=foo) + + Language.factory("c5", func=test_componen5_factory) + assert Language.has_factory("c5") nlp = Language() - pipe = nlp.create_pipe("c5", config={"foo": "bar"}) - nlp.add_pipe(pipe) + nlp.add_pipe("c5", config={"foo": "bar"}) assert nlp("hello world") - # The first argument here is the class itself, so we're accepting any here - mock.assert_called_once_with(ANY, nlp, foo="bar") + mock.assert_called_once_with(nlp, foo="bar") def test_analysis_validate_attrs_valid(): @@ -149,20 +95,20 @@ def test_analysis_validate_attrs_invalid(attr): def test_analysis_validate_attrs_remove_pipe(): """Test that attributes are validated correctly on remove.""" - spacy.language.ENABLE_PIPELINE_ANALYSIS = True - @component("c1", assigns=["token.tag"]) + @Language.component("pipe_analysis_c6", assigns=["token.tag"]) def c1(doc): return doc - @component("c2", requires=["token.pos"]) + @Language.component("pipe_analysis_c7", requires=["token.pos"]) def c2(doc): return doc nlp = Language() - nlp.add_pipe(c1) - with pytest.warns(UserWarning): - nlp.add_pipe(c2) - with pytest.warns(None) as record: - nlp.remove_pipe("c2") - assert not record.list + nlp.add_pipe("pipe_analysis_c6") + nlp.add_pipe("pipe_analysis_c7") + problems = nlp.analyze_pipes()["problems"] + assert problems["pipe_analysis_c7"] == ["token.pos"] + nlp.remove_pipe("pipe_analysis_c7") + problems = nlp.analyze_pipes()["problems"] + assert all(p == [] for p in problems.values()) diff --git a/spacy/tests/pipeline/test_attributeruler.py b/spacy/tests/pipeline/test_attributeruler.py new file mode 100644 index 000000000..6c66469cc --- /dev/null +++ b/spacy/tests/pipeline/test_attributeruler.py @@ -0,0 +1,273 @@ +import pytest +import numpy +from spacy.training import Example +from spacy.lang.en import English +from spacy.pipeline import AttributeRuler +from spacy import util, registry +from spacy.tokens import Doc + +from ..util import make_tempdir + + +@pytest.fixture +def nlp(): + return English() + + +@pytest.fixture +def pattern_dicts(): + return [ + { + "patterns": [[{"ORTH": "a"}], [{"ORTH": "irrelevant"}]], + "attrs": {"LEMMA": "the", "MORPH": "Case=Nom|Number=Plur"}, + }, + # one pattern sets the lemma + {"patterns": [[{"ORTH": "test"}]], "attrs": {"LEMMA": "cat"}}, + # another pattern sets the morphology + { + "patterns": [[{"ORTH": "test"}]], + "attrs": {"MORPH": "Case=Nom|Number=Sing"}, + "index": 0, + }, + ] + + +@registry.misc("attribute_ruler_patterns") +def attribute_ruler_patterns(): + return [ + { + "patterns": [[{"ORTH": "a"}], [{"ORTH": "irrelevant"}]], + "attrs": {"LEMMA": "the", "MORPH": "Case=Nom|Number=Plur"}, + }, + # one pattern sets the lemma + {"patterns": [[{"ORTH": "test"}]], "attrs": {"LEMMA": "cat"}}, + # another pattern sets the morphology + { + "patterns": [[{"ORTH": "test"}]], + "attrs": {"MORPH": "Case=Nom|Number=Sing"}, + "index": 0, + }, + ] + + +@pytest.fixture +def tag_map(): + return { + ".": {"POS": "PUNCT", "PunctType": "peri"}, + ",": {"POS": "PUNCT", "PunctType": "comm"}, + } + + +@pytest.fixture +def morph_rules(): + return {"DT": {"the": {"POS": "DET", "LEMMA": "a", "Case": "Nom"}}} + + +def check_tag_map(ruler): + doc = Doc( + ruler.vocab, + words=["This", "is", "a", "test", "."], + tags=["DT", "VBZ", "DT", "NN", "."], + ) + doc = ruler(doc) + for i in range(len(doc)): + if i == 4: + assert doc[i].pos_ == "PUNCT" + assert str(doc[i].morph) == "PunctType=peri" + else: + assert doc[i].pos_ == "" + assert str(doc[i].morph) == "" + + +def check_morph_rules(ruler): + doc = Doc( + ruler.vocab, + words=["This", "is", "the", "test", "."], + tags=["DT", "VBZ", "DT", "NN", "."], + ) + doc = ruler(doc) + for i in range(len(doc)): + if i != 2: + assert doc[i].pos_ == "" + assert str(doc[i].morph) == "" + else: + assert doc[2].pos_ == "DET" + assert doc[2].lemma_ == "a" + assert str(doc[2].morph) == "Case=Nom" + + +def test_attributeruler_init(nlp, pattern_dicts): + a = nlp.add_pipe("attribute_ruler") + for p in pattern_dicts: + a.add(**p) + doc = nlp("This is a test.") + assert doc[2].lemma_ == "the" + assert str(doc[2].morph) == "Case=Nom|Number=Plur" + assert doc[3].lemma_ == "cat" + assert str(doc[3].morph) == "Case=Nom|Number=Sing" + assert doc.has_annotation("LEMMA") + assert doc.has_annotation("MORPH") + + +def test_attributeruler_init_patterns(nlp, pattern_dicts): + # initialize with patterns + ruler = nlp.add_pipe("attribute_ruler") + ruler.initialize(lambda: [], patterns=pattern_dicts) + doc = nlp("This is a test.") + assert doc[2].lemma_ == "the" + assert str(doc[2].morph) == "Case=Nom|Number=Plur" + assert doc[3].lemma_ == "cat" + assert str(doc[3].morph) == "Case=Nom|Number=Sing" + assert doc.has_annotation("LEMMA") + assert doc.has_annotation("MORPH") + nlp.remove_pipe("attribute_ruler") + # initialize with patterns from misc registry + nlp.config["initialize"]["components"]["attribute_ruler"] = { + "patterns": {"@misc": "attribute_ruler_patterns"} + } + nlp.add_pipe("attribute_ruler") + nlp.initialize() + doc = nlp("This is a test.") + assert doc[2].lemma_ == "the" + assert str(doc[2].morph) == "Case=Nom|Number=Plur" + assert doc[3].lemma_ == "cat" + assert str(doc[3].morph) == "Case=Nom|Number=Sing" + assert doc.has_annotation("LEMMA") + assert doc.has_annotation("MORPH") + + +def test_attributeruler_init_clear(nlp, pattern_dicts): + """Test that initialization clears patterns.""" + ruler = nlp.add_pipe("attribute_ruler") + assert not len(ruler.matcher) + ruler.add_patterns(pattern_dicts) + assert len(ruler.matcher) + ruler.initialize(lambda: []) + assert not len(ruler.matcher) + + +def test_attributeruler_score(nlp, pattern_dicts): + # initialize with patterns + ruler = nlp.add_pipe("attribute_ruler") + ruler.initialize(lambda: [], patterns=pattern_dicts) + doc = nlp("This is a test.") + assert doc[2].lemma_ == "the" + assert str(doc[2].morph) == "Case=Nom|Number=Plur" + assert doc[3].lemma_ == "cat" + assert str(doc[3].morph) == "Case=Nom|Number=Sing" + doc = nlp.make_doc("This is a test.") + dev_examples = [Example.from_dict(doc, {"lemmas": ["this", "is", "a", "cat", "."]})] + scores = nlp.evaluate(dev_examples) + # "cat" is the only correct lemma + assert scores["lemma_acc"] == pytest.approx(0.2) + # the empty morphs are correct + assert scores["morph_acc"] == pytest.approx(0.6) + + +def test_attributeruler_rule_order(nlp): + a = AttributeRuler(nlp.vocab) + patterns = [ + {"patterns": [[{"TAG": "VBZ"}]], "attrs": {"POS": "VERB"}}, + {"patterns": [[{"TAG": "VBZ"}]], "attrs": {"POS": "NOUN"}}, + ] + a.add_patterns(patterns) + doc = Doc( + nlp.vocab, + words=["This", "is", "a", "test", "."], + tags=["DT", "VBZ", "DT", "NN", "."], + ) + doc = a(doc) + assert doc[1].pos_ == "NOUN" + + +def test_attributeruler_tag_map(nlp, tag_map): + ruler = AttributeRuler(nlp.vocab) + ruler.load_from_tag_map(tag_map) + check_tag_map(ruler) + + +def test_attributeruler_tag_map_initialize(nlp, tag_map): + ruler = nlp.add_pipe("attribute_ruler") + ruler.initialize(lambda: [], tag_map=tag_map) + check_tag_map(ruler) + + +def test_attributeruler_morph_rules(nlp, morph_rules): + ruler = AttributeRuler(nlp.vocab) + ruler.load_from_morph_rules(morph_rules) + check_morph_rules(ruler) + + +def test_attributeruler_morph_rules_initialize(nlp, morph_rules): + ruler = nlp.add_pipe("attribute_ruler") + ruler.initialize(lambda: [], morph_rules=morph_rules) + check_morph_rules(ruler) + + +def test_attributeruler_indices(nlp): + a = nlp.add_pipe("attribute_ruler") + a.add( + [[{"ORTH": "a"}, {"ORTH": "test"}]], + {"LEMMA": "the", "MORPH": "Case=Nom|Number=Plur"}, + index=0, + ) + a.add( + [[{"ORTH": "This"}, {"ORTH": "is"}]], + {"LEMMA": "was", "MORPH": "Case=Nom|Number=Sing"}, + index=1, + ) + a.add([[{"ORTH": "a"}, {"ORTH": "test"}]], {"LEMMA": "cat"}, index=-1) + + text = "This is a test." + doc = nlp(text) + for i in range(len(doc)): + if i == 1: + assert doc[i].lemma_ == "was" + assert str(doc[i].morph) == "Case=Nom|Number=Sing" + elif i == 2: + assert doc[i].lemma_ == "the" + assert str(doc[i].morph) == "Case=Nom|Number=Plur" + elif i == 3: + assert doc[i].lemma_ == "cat" + else: + assert str(doc[i].morph) == "" + # raises an error when trying to modify a token outside of the match + a.add([[{"ORTH": "a"}, {"ORTH": "test"}]], {"LEMMA": "cat"}, index=2) + with pytest.raises(ValueError): + doc = nlp(text) + # raises an error when trying to modify a token outside of the match + a.add([[{"ORTH": "a"}, {"ORTH": "test"}]], {"LEMMA": "cat"}, index=10) + with pytest.raises(ValueError): + doc = nlp(text) + + +def test_attributeruler_patterns_prop(nlp, pattern_dicts): + a = nlp.add_pipe("attribute_ruler") + a.add_patterns(pattern_dicts) + for p1, p2 in zip(pattern_dicts, a.patterns): + assert p1["patterns"] == p2["patterns"] + assert p1["attrs"] == p2["attrs"] + if p1.get("index"): + assert p1["index"] == p2["index"] + + +def test_attributeruler_serialize(nlp, pattern_dicts): + a = nlp.add_pipe("attribute_ruler") + a.add_patterns(pattern_dicts) + text = "This is a test." + attrs = ["ORTH", "LEMMA", "MORPH"] + doc = nlp(text) + # bytes roundtrip + a_reloaded = AttributeRuler(nlp.vocab).from_bytes(a.to_bytes()) + assert a.to_bytes() == a_reloaded.to_bytes() + doc1 = a_reloaded(nlp.make_doc(text)) + numpy.array_equal(doc.to_array(attrs), doc1.to_array(attrs)) + assert a.patterns == a_reloaded.patterns + # disk roundtrip + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(text) + assert nlp2.get_pipe("attribute_ruler").to_bytes() == a.to_bytes() + assert numpy.array_equal(doc.to_array(attrs), doc2.to_array(attrs)) + assert a.patterns == nlp2.get_pipe("attribute_ruler").patterns diff --git a/spacy/tests/pipeline/test_entity_linker.py b/spacy/tests/pipeline/test_entity_linker.py index 8023f72a6..8ba2d0d3e 100644 --- a/spacy/tests/pipeline/test_entity_linker.py +++ b/spacy/tests/pipeline/test_entity_linker.py @@ -1,11 +1,17 @@ -# coding: utf-8 -from __future__ import unicode_literals - +from typing import Callable, Iterable import pytest +from numpy.testing import assert_equal +from spacy.attrs import ENT_KB_ID -from spacy.kb import KnowledgeBase +from spacy.kb import KnowledgeBase, get_candidates, Candidate +from spacy.vocab import Vocab + +from spacy import util, registry +from spacy.ml import load_kb +from spacy.scorer import Scorer +from spacy.training import Example from spacy.lang.en import English -from spacy.pipeline import EntityRuler +from spacy.tests.util import make_tempdir from spacy.tokens import Span @@ -106,9 +112,74 @@ def test_kb_invalid_entity_vector(nlp): mykb.add_entity(entity="Q2", freq=5, entity_vector=[2]) +def test_kb_default(nlp): + """Test that the default (empty) KB is loaded upon construction""" + entity_linker = nlp.add_pipe("entity_linker", config={}) + assert len(entity_linker.kb) == 0 + assert entity_linker.kb.get_size_entities() == 0 + assert entity_linker.kb.get_size_aliases() == 0 + # 64 is the default value from pipeline.entity_linker + assert entity_linker.kb.entity_vector_length == 64 + + +def test_kb_custom_length(nlp): + """Test that the default (empty) KB can be configured with a custom entity length""" + entity_linker = nlp.add_pipe("entity_linker", config={"entity_vector_length": 35}) + assert len(entity_linker.kb) == 0 + assert entity_linker.kb.get_size_entities() == 0 + assert entity_linker.kb.get_size_aliases() == 0 + assert entity_linker.kb.entity_vector_length == 35 + + +def test_kb_initialize_empty(nlp): + """Test that the EL can't initialize without examples""" + entity_linker = nlp.add_pipe("entity_linker") + with pytest.raises(TypeError): + entity_linker.initialize(lambda: []) + + +def test_kb_serialize(nlp): + """Test serialization of the KB""" + mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1) + with make_tempdir() as d: + # normal read-write behaviour + mykb.to_disk(d / "kb") + mykb.from_disk(d / "kb") + mykb.to_disk(d / "new" / "kb") + mykb.from_disk(d / "new" / "kb") + # allow overwriting an existing file + mykb.to_disk(d / "kb") + with pytest.raises(ValueError): + # can not read from an unknown file + mykb.from_disk(d / "unknown" / "kb") + + +def test_kb_serialize_vocab(nlp): + """Test serialization of the KB and custom strings""" + entity = "MyFunnyID" + assert entity not in nlp.vocab.strings + mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1) + assert not mykb.contains_entity(entity) + mykb.add_entity(entity, freq=342, entity_vector=[3]) + assert mykb.contains_entity(entity) + assert entity in mykb.vocab.strings + with make_tempdir() as d: + # normal read-write behaviour + mykb.to_disk(d / "kb") + mykb_new = KnowledgeBase(Vocab(), entity_vector_length=1) + mykb_new.from_disk(d / "kb") + assert entity in mykb_new.vocab.strings + + def test_candidate_generation(nlp): """Test correct candidate generation""" mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1) + doc = nlp("douglas adam Adam shrubbery") + + douglas_ent = doc[0:1] + adam_ent = doc[1:2] + Adam_ent = doc[2:3] + shrubbery_ent = doc[3:4] # adding entities mykb.add_entity(entity="Q1", freq=27, entity_vector=[1]) @@ -120,15 +191,96 @@ def test_candidate_generation(nlp): mykb.add_alias(alias="adam", entities=["Q2"], probabilities=[0.9]) # test the size of the relevant candidates - assert len(mykb.get_candidates("douglas")) == 2 - assert len(mykb.get_candidates("adam")) == 1 - assert len(mykb.get_candidates("shrubbery")) == 0 + assert len(get_candidates(mykb, douglas_ent)) == 2 + assert len(get_candidates(mykb, adam_ent)) == 1 + assert len(get_candidates(mykb, Adam_ent)) == 0 # default case sensitive + assert len(get_candidates(mykb, shrubbery_ent)) == 0 # test the content of the candidates - assert mykb.get_candidates("adam")[0].entity_ == "Q2" - assert mykb.get_candidates("adam")[0].alias_ == "adam" - assert_almost_equal(mykb.get_candidates("adam")[0].entity_freq, 12) - assert_almost_equal(mykb.get_candidates("adam")[0].prior_prob, 0.9) + assert get_candidates(mykb, adam_ent)[0].entity_ == "Q2" + assert get_candidates(mykb, adam_ent)[0].alias_ == "adam" + assert_almost_equal(get_candidates(mykb, adam_ent)[0].entity_freq, 12) + assert_almost_equal(get_candidates(mykb, adam_ent)[0].prior_prob, 0.9) + + +def test_el_pipe_configuration(nlp): + """Test correct candidate generation as part of the EL pipe""" + nlp.add_pipe("sentencizer") + pattern = {"label": "PERSON", "pattern": [{"LOWER": "douglas"}]} + ruler = nlp.add_pipe("entity_ruler") + ruler.add_patterns([pattern]) + + def create_kb(vocab): + kb = KnowledgeBase(vocab, entity_vector_length=1) + kb.add_entity(entity="Q2", freq=12, entity_vector=[2]) + kb.add_entity(entity="Q3", freq=5, entity_vector=[3]) + kb.add_alias(alias="douglas", entities=["Q2", "Q3"], probabilities=[0.8, 0.1]) + return kb + + # run an EL pipe without a trained context encoder, to check the candidate generation step only + entity_linker = nlp.add_pipe("entity_linker", config={"incl_context": False}) + entity_linker.set_kb(create_kb) + # With the default get_candidates function, matching is case-sensitive + text = "Douglas and douglas are not the same." + doc = nlp(text) + assert doc[0].ent_kb_id_ == "NIL" + assert doc[1].ent_kb_id_ == "" + assert doc[2].ent_kb_id_ == "Q2" + + def get_lowercased_candidates(kb, span): + return kb.get_alias_candidates(span.text.lower()) + + @registry.misc.register("spacy.LowercaseCandidateGenerator.v1") + def create_candidates() -> Callable[[KnowledgeBase, "Span"], Iterable[Candidate]]: + return get_lowercased_candidates + + # replace the pipe with a new one with with a different candidate generator + entity_linker = nlp.replace_pipe( + "entity_linker", + "entity_linker", + config={ + "incl_context": False, + "get_candidates": {"@misc": "spacy.LowercaseCandidateGenerator.v1"}, + }, + ) + entity_linker.set_kb(create_kb) + doc = nlp(text) + assert doc[0].ent_kb_id_ == "Q2" + assert doc[1].ent_kb_id_ == "" + assert doc[2].ent_kb_id_ == "Q2" + + +def test_vocab_serialization(nlp): + """Test that string information is retained across storage""" + mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1) + + # adding entities + mykb.add_entity(entity="Q1", freq=27, entity_vector=[1]) + q2_hash = mykb.add_entity(entity="Q2", freq=12, entity_vector=[2]) + mykb.add_entity(entity="Q3", freq=5, entity_vector=[3]) + + # adding aliases + mykb.add_alias(alias="douglas", entities=["Q2", "Q3"], probabilities=[0.4, 0.1]) + adam_hash = mykb.add_alias(alias="adam", entities=["Q2"], probabilities=[0.9]) + + candidates = mykb.get_alias_candidates("adam") + assert len(candidates) == 1 + assert candidates[0].entity == q2_hash + assert candidates[0].entity_ == "Q2" + assert candidates[0].alias == adam_hash + assert candidates[0].alias_ == "adam" + + with make_tempdir() as d: + mykb.to_disk(d / "kb") + kb_new_vocab = KnowledgeBase(Vocab(), entity_vector_length=1) + kb_new_vocab.from_disk(d / "kb") + + candidates = kb_new_vocab.get_alias_candidates("adam") + assert len(candidates) == 1 + assert candidates[0].entity == q2_hash + assert candidates[0].entity_ == "Q2" + assert candidates[0].alias == adam_hash + assert candidates[0].alias_ == "adam" def test_append_alias(nlp): @@ -145,20 +297,20 @@ def test_append_alias(nlp): mykb.add_alias(alias="adam", entities=["Q2"], probabilities=[0.9]) # test the size of the relevant candidates - assert len(mykb.get_candidates("douglas")) == 2 + assert len(mykb.get_alias_candidates("douglas")) == 2 # append an alias mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.2) # test the size of the relevant candidates has been incremented - assert len(mykb.get_candidates("douglas")) == 3 + assert len(mykb.get_alias_candidates("douglas")) == 3 # append the same alias-entity pair again should not work (will throw a warning) with pytest.warns(UserWarning): mykb.append_alias(alias="douglas", entity="Q1", prior_prob=0.3) # test the size of the relevant candidates remained unchanged - assert len(mykb.get_candidates("douglas")) == 3 + assert len(mykb.get_alias_candidates("douglas")) == 3 def test_append_invalid_alias(nlp): @@ -181,34 +333,31 @@ def test_append_invalid_alias(nlp): def test_preserving_links_asdoc(nlp): """Test that Span.as_doc preserves the existing entity links""" - mykb = KnowledgeBase(nlp.vocab, entity_vector_length=1) + vector_length = 1 - # adding entities - mykb.add_entity(entity="Q1", freq=19, entity_vector=[1]) - mykb.add_entity(entity="Q2", freq=8, entity_vector=[1]) - - # adding aliases - mykb.add_alias(alias="Boston", entities=["Q1"], probabilities=[0.7]) - mykb.add_alias(alias="Denver", entities=["Q2"], probabilities=[0.6]) + def create_kb(vocab): + mykb = KnowledgeBase(vocab, entity_vector_length=vector_length) + # adding entities + mykb.add_entity(entity="Q1", freq=19, entity_vector=[1]) + mykb.add_entity(entity="Q2", freq=8, entity_vector=[1]) + # adding aliases + mykb.add_alias(alias="Boston", entities=["Q1"], probabilities=[0.7]) + mykb.add_alias(alias="Denver", entities=["Q2"], probabilities=[0.6]) + return mykb # set up pipeline with NER (Entity Ruler) and NEL (prior probability only, model not trained) - sentencizer = nlp.create_pipe("sentencizer") - nlp.add_pipe(sentencizer) - - ruler = EntityRuler(nlp) + nlp.add_pipe("sentencizer") patterns = [ {"label": "GPE", "pattern": "Boston"}, {"label": "GPE", "pattern": "Denver"}, ] + ruler = nlp.add_pipe("entity_ruler") ruler.add_patterns(patterns) - nlp.add_pipe(ruler) - - el_pipe = nlp.create_pipe(name="entity_linker") - el_pipe.set_kb(mykb) - el_pipe.begin_training() - el_pipe.incl_context = False - el_pipe.incl_prior = True - nlp.add_pipe(el_pipe, last=True) + config = {"incl_prior": False} + entity_linker = nlp.add_pipe("entity_linker", config=config, last=True) + entity_linker.set_kb(create_kb) + nlp.initialize() + assert entity_linker.model.get_dim("nO") == vector_length # test whether the entity links are preserved by the `as_doc()` function text = "She lives in Boston. He lives in Denver." @@ -248,3 +397,185 @@ def test_preserving_links_ents_2(nlp): assert len(list(doc.ents)) == 1 assert list(doc.ents)[0].label_ == "LOC" assert list(doc.ents)[0].kb_id_ == "Q1" + + +# fmt: off +TRAIN_DATA = [ + ("Russ Cochran captured his first major title with his son as caddie.", + {"links": {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}, + "entities": [(0, 12, "PERSON")], + "sent_starts": [1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}), + ("Russ Cochran his reprints include EC Comics.", + {"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}, + "entities": [(0, 12, "PERSON")], + "sent_starts": [1, -1, 0, 0, 0, 0, 0, 0]}), + ("Russ Cochran has been publishing comic art.", + {"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}, + "entities": [(0, 12, "PERSON")], + "sent_starts": [1, -1, 0, 0, 0, 0, 0, 0]}), + ("Russ Cochran was a member of University of Kentucky's golf team.", + {"links": {(0, 12): {"Q7381115": 0.0, "Q2146908": 1.0}}, + "entities": [(0, 12, "PERSON"), (43, 51, "LOC")], + "sent_starts": [1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}) +] +GOLD_entities = ["Q2146908", "Q7381115", "Q7381115", "Q2146908"] +# fmt: on + + +def test_overfitting_IO(): + # Simple test to try and quickly overfit the NEL component - ensuring the ML models work correctly + nlp = English() + vector_length = 3 + assert "Q2146908" not in nlp.vocab.strings + + # Convert the texts to docs to make sure we have doc.ents set for the training examples + train_examples = [] + for text, annotation in TRAIN_DATA: + doc = nlp(text) + train_examples.append(Example.from_dict(doc, annotation)) + + def create_kb(vocab): + # create artificial KB - assign same prior weight to the two russ cochran's + # Q2146908 (Russ Cochran): American golfer + # Q7381115 (Russ Cochran): publisher + mykb = KnowledgeBase(vocab, entity_vector_length=vector_length) + mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3]) + mykb.add_entity(entity="Q7381115", freq=12, entity_vector=[9, 1, -7]) + mykb.add_alias( + alias="Russ Cochran", + entities=["Q2146908", "Q7381115"], + probabilities=[0.5, 0.5], + ) + return mykb + + # Create the Entity Linker component and add it to the pipeline + entity_linker = nlp.add_pipe("entity_linker", last=True) + entity_linker.set_kb(create_kb) + assert "Q2146908" in entity_linker.vocab.strings + assert "Q2146908" in entity_linker.kb.vocab.strings + + # train the NEL pipe + optimizer = nlp.initialize(get_examples=lambda: train_examples) + assert entity_linker.model.get_dim("nO") == vector_length + assert entity_linker.model.get_dim("nO") == entity_linker.kb.entity_vector_length + + for i in range(50): + losses = {} + nlp.update(train_examples, sgd=optimizer, losses=losses) + assert losses["entity_linker"] < 0.001 + + # adding additional components that are required for the entity_linker + nlp.add_pipe("sentencizer", first=True) + + # Add a custom component to recognize "Russ Cochran" as an entity for the example training data + patterns = [ + {"label": "PERSON", "pattern": [{"LOWER": "russ"}, {"LOWER": "cochran"}]} + ] + ruler = nlp.add_pipe("entity_ruler", before="entity_linker") + ruler.add_patterns(patterns) + + # test the trained model + predictions = [] + for text, annotation in TRAIN_DATA: + doc = nlp(text) + for ent in doc.ents: + predictions.append(ent.kb_id_) + assert predictions == GOLD_entities + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + assert nlp2.pipe_names == nlp.pipe_names + assert "Q2146908" in nlp2.vocab.strings + entity_linker2 = nlp2.get_pipe("entity_linker") + assert "Q2146908" in entity_linker2.vocab.strings + assert "Q2146908" in entity_linker2.kb.vocab.strings + predictions = [] + for text, annotation in TRAIN_DATA: + doc2 = nlp2(text) + for ent in doc2.ents: + predictions.append(ent.kb_id_) + assert predictions == GOLD_entities + + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = [ + "Russ Cochran captured his first major title with his son as caddie.", + "Russ Cochran his reprints include EC Comics.", + "Russ Cochran has been publishing comic art.", + "Russ Cochran was a member of University of Kentucky's golf team.", + ] + batch_deps_1 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)] + no_batch_deps = [doc.to_array([ENT_KB_ID]) for doc in [nlp(text) for text in texts]] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) + + +def test_kb_serialization(): + # Test that the KB can be used in a pipeline with a different vocab + vector_length = 3 + with make_tempdir() as tmp_dir: + kb_dir = tmp_dir / "kb" + nlp1 = English() + assert "Q2146908" not in nlp1.vocab.strings + mykb = KnowledgeBase(nlp1.vocab, entity_vector_length=vector_length) + mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3]) + mykb.add_alias(alias="Russ Cochran", entities=["Q2146908"], probabilities=[0.8]) + assert "Q2146908" in nlp1.vocab.strings + mykb.to_disk(kb_dir) + + nlp2 = English() + assert "RandomWord" not in nlp2.vocab.strings + nlp2.vocab.strings.add("RandomWord") + assert "RandomWord" in nlp2.vocab.strings + assert "Q2146908" not in nlp2.vocab.strings + + # Create the Entity Linker component with the KB from file, and check the final vocab + entity_linker = nlp2.add_pipe("entity_linker", last=True) + entity_linker.set_kb(load_kb(kb_dir)) + assert "Q2146908" in nlp2.vocab.strings + assert "RandomWord" in nlp2.vocab.strings + + +def test_scorer_links(): + train_examples = [] + nlp = English() + ref1 = nlp("Julia lives in London happily.") + ref1.ents = [ + Span(ref1, 0, 1, label="PERSON", kb_id="Q2"), + Span(ref1, 3, 4, label="LOC", kb_id="Q3"), + ] + pred1 = nlp("Julia lives in London happily.") + pred1.ents = [ + Span(pred1, 0, 1, label="PERSON", kb_id="Q70"), + Span(pred1, 3, 4, label="LOC", kb_id="Q3"), + ] + train_examples.append(Example(pred1, ref1)) + + ref2 = nlp("She loves London.") + ref2.ents = [ + Span(ref2, 0, 1, label="PERSON", kb_id="Q2"), + Span(ref2, 2, 3, label="LOC", kb_id="Q13"), + ] + pred2 = nlp("She loves London.") + pred2.ents = [ + Span(pred2, 0, 1, label="PERSON", kb_id="Q2"), + Span(pred2, 2, 3, label="LOC", kb_id="NIL"), + ] + train_examples.append(Example(pred2, ref2)) + + ref3 = nlp("London is great.") + ref3.ents = [Span(ref3, 0, 1, label="LOC", kb_id="NIL")] + pred3 = nlp("London is great.") + pred3.ents = [Span(pred3, 0, 1, label="LOC", kb_id="NIL")] + train_examples.append(Example(pred3, ref3)) + + scores = Scorer().score_links(train_examples, negative_labels=["NIL"]) + assert scores["nel_f_per_type"]["PERSON"]["p"] == 1 / 2 + assert scores["nel_f_per_type"]["PERSON"]["r"] == 1 / 2 + assert scores["nel_f_per_type"]["LOC"]["p"] == 1 / 1 + assert scores["nel_f_per_type"]["LOC"]["r"] == 1 / 2 + + assert scores["nel_micro_p"] == 2 / 3 + assert scores["nel_micro_r"] == 2 / 4 diff --git a/spacy/tests/pipeline/test_entity_ruler.py b/spacy/tests/pipeline/test_entity_ruler.py index 9e22c9cc7..206f44719 100644 --- a/spacy/tests/pipeline/test_entity_ruler.py +++ b/spacy/tests/pipeline/test_entity_ruler.py @@ -1,7 +1,6 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest + +from spacy import registry from spacy.tokens import Span from spacy.language import Language from spacy.pipeline import EntityRuler @@ -14,6 +13,7 @@ def nlp(): @pytest.fixture +@registry.misc("entity_ruler_patterns") def patterns(): return [ {"label": "HELLO", "pattern": "hello world"}, @@ -25,13 +25,10 @@ def patterns(): ] -@pytest.fixture -def add_ent(): - def add_ent_component(doc): - doc.ents = [Span(doc, 0, 3, label=doc.vocab.strings["ORG"])] - return doc - - return add_ent_component +@Language.component("add_ent") +def add_ent_component(doc): + doc.ents = [Span(doc, 0, 3, label="ORG")] + return doc def test_entity_ruler_init(nlp, patterns): @@ -40,27 +37,60 @@ def test_entity_ruler_init(nlp, patterns): assert len(ruler.labels) == 4 assert "HELLO" in ruler assert "BYE" in ruler - nlp.add_pipe(ruler) + ruler = nlp.add_pipe("entity_ruler") + ruler.add_patterns(patterns) doc = nlp("hello world bye bye") assert len(doc.ents) == 2 assert doc.ents[0].label_ == "HELLO" assert doc.ents[1].label_ == "BYE" -def test_entity_ruler_existing(nlp, patterns, add_ent): - ruler = EntityRuler(nlp, patterns=patterns) - nlp.add_pipe(add_ent) - nlp.add_pipe(ruler) +def test_entity_ruler_init_patterns(nlp, patterns): + # initialize with patterns + ruler = nlp.add_pipe("entity_ruler") + assert len(ruler.labels) == 0 + ruler.initialize(lambda: [], patterns=patterns) + assert len(ruler.labels) == 4 + doc = nlp("hello world bye bye") + assert doc.ents[0].label_ == "HELLO" + assert doc.ents[1].label_ == "BYE" + nlp.remove_pipe("entity_ruler") + # initialize with patterns from misc registry + nlp.config["initialize"]["components"]["entity_ruler"] = { + "patterns": {"@misc": "entity_ruler_patterns"} + } + ruler = nlp.add_pipe("entity_ruler") + assert len(ruler.labels) == 0 + nlp.initialize() + assert len(ruler.labels) == 4 + doc = nlp("hello world bye bye") + assert doc.ents[0].label_ == "HELLO" + assert doc.ents[1].label_ == "BYE" + + +def test_entity_ruler_init_clear(nlp, patterns): + """Test that initialization clears patterns.""" + ruler = nlp.add_pipe("entity_ruler") + ruler.add_patterns(patterns) + assert len(ruler.labels) == 4 + ruler.initialize(lambda: []) + assert len(ruler.labels) == 0 + + +def test_entity_ruler_existing(nlp, patterns): + ruler = nlp.add_pipe("entity_ruler") + ruler.add_patterns(patterns) + nlp.add_pipe("add_ent", before="entity_ruler") doc = nlp("OH HELLO WORLD bye bye") assert len(doc.ents) == 2 assert doc.ents[0].label_ == "ORG" assert doc.ents[1].label_ == "BYE" -def test_entity_ruler_existing_overwrite(nlp, patterns, add_ent): - ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True) - nlp.add_pipe(add_ent) - nlp.add_pipe(ruler) +def test_entity_ruler_existing_overwrite(nlp, patterns): + ruler = nlp.add_pipe("entity_ruler", config={"overwrite_ents": True}) + ruler.add_patterns(patterns) + nlp.add_pipe("add_ent", before="entity_ruler") doc = nlp("OH HELLO WORLD bye bye") assert len(doc.ents) == 2 assert doc.ents[0].label_ == "HELLO" @@ -68,10 +98,10 @@ def test_entity_ruler_existing_overwrite(nlp, patterns, add_ent): assert doc.ents[1].label_ == "BYE" -def test_entity_ruler_existing_complex(nlp, patterns, add_ent): - ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True) - nlp.add_pipe(add_ent) - nlp.add_pipe(ruler) +def test_entity_ruler_existing_complex(nlp, patterns): + ruler = nlp.add_pipe("entity_ruler", config={"overwrite_ents": True}) + ruler.add_patterns(patterns) + nlp.add_pipe("add_ent", before="entity_ruler") doc = nlp("foo foo bye bye") assert len(doc.ents) == 2 assert doc.ents[0].label_ == "COMPLEX" @@ -81,8 +111,8 @@ def test_entity_ruler_existing_complex(nlp, patterns, add_ent): def test_entity_ruler_entity_id(nlp, patterns): - ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True) - nlp.add_pipe(ruler) + ruler = nlp.add_pipe("entity_ruler", config={"overwrite_ents": True}) + ruler.add_patterns(patterns) doc = nlp("Apple is a technology company") assert len(doc.ents) == 1 assert doc.ents[0].label_ == "TECH_ORG" @@ -90,9 +120,10 @@ def test_entity_ruler_entity_id(nlp, patterns): def test_entity_ruler_cfg_ent_id_sep(nlp, patterns): - ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True, ent_id_sep="**") + config = {"overwrite_ents": True, "ent_id_sep": "**"} + ruler = nlp.add_pipe("entity_ruler", config=config) + ruler.add_patterns(patterns) assert "TECH_ORG**a1" in ruler.phrase_patterns - nlp.add_pipe(ruler) doc = nlp("Apple is a technology company") assert len(doc.ents) == 1 assert doc.ents[0].label_ == "TECH_ORG" diff --git a/spacy/tests/pipeline/test_factories.py b/spacy/tests/pipeline/test_factories.py deleted file mode 100644 index 5efcc319a..000000000 --- a/spacy/tests/pipeline/test_factories.py +++ /dev/null @@ -1,50 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from spacy.language import Language -from spacy.tokens import Span - -from ..util import get_doc - - -@pytest.fixture -def doc(en_tokenizer): - text = "I like New York in Autumn." - heads = [1, 0, 1, -2, -3, -1, -5] - tags = ["PRP", "IN", "NNP", "NNP", "IN", "NNP", "."] - pos = ["PRON", "VERB", "PROPN", "PROPN", "ADP", "PROPN", "PUNCT"] - deps = ["ROOT", "prep", "compound", "pobj", "prep", "pobj", "punct"] - tokens = en_tokenizer(text) - doc = get_doc( - tokens.vocab, - words=[t.text for t in tokens], - heads=heads, - tags=tags, - pos=pos, - deps=deps, - ) - doc.ents = [Span(doc, 2, 4, doc.vocab.strings["GPE"])] - doc.is_parsed = True - doc.is_tagged = True - return doc - - -def test_factories_merge_noun_chunks(doc): - assert len(doc) == 7 - nlp = Language() - merge_noun_chunks = nlp.create_pipe("merge_noun_chunks") - merge_noun_chunks(doc) - assert len(doc) == 6 - assert doc[2].text == "New York" - - -def test_factories_merge_ents(doc): - assert len(doc) == 7 - assert len(list(doc.ents)) == 1 - nlp = Language() - merge_entities = nlp.create_pipe("merge_entities") - merge_entities(doc) - assert len(doc) == 6 - assert len(list(doc.ents)) == 1 - assert doc[2].text == "New York" diff --git a/spacy/tests/pipeline/test_functions.py b/spacy/tests/pipeline/test_functions.py index 5b5fcd2fd..025ac04af 100644 --- a/spacy/tests/pipeline/test_functions.py +++ b/spacy/tests/pipeline/test_functions.py @@ -1,34 +1,55 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.pipeline.functions import merge_subtokens -from ..util import get_doc +from spacy.language import Language +from spacy.tokens import Span, Doc @pytest.fixture -def doc(en_tokenizer): +def doc(en_vocab): # fmt: off - text = "This is a sentence. This is another sentence. And a third." - heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 1, 1, 1, 0] + words = ["This", "is", "a", "sentence", ".", "This", "is", "another", "sentence", ".", "And", "a", "third", "."] + heads = [1, 1, 3, 1, 1, 6, 6, 8, 6, 6, 11, 12, 13, 13] deps = ["nsubj", "ROOT", "subtok", "attr", "punct", "nsubj", "ROOT", "subtok", "attr", "punct", "subtok", "subtok", "subtok", "ROOT"] # fmt: on - tokens = en_tokenizer(text) - return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) + return Doc(en_vocab, words=words, heads=heads, deps=deps) + + +@pytest.fixture +def doc2(en_vocab): + words = ["I", "like", "New", "York", "in", "Autumn", "."] + heads = [1, 1, 3, 1, 1, 4, 1] + tags = ["PRP", "IN", "NNP", "NNP", "IN", "NNP", "."] + pos = ["PRON", "VERB", "PROPN", "PROPN", "ADP", "PROPN", "PUNCT"] + deps = ["ROOT", "prep", "compound", "pobj", "prep", "pobj", "punct"] + doc = Doc(en_vocab, words=words, heads=heads, tags=tags, pos=pos, deps=deps) + doc.ents = [Span(doc, 2, 4, label="GPE")] + return doc def test_merge_subtokens(doc): doc = merge_subtokens(doc) - # get_doc() doesn't set spaces, so the result is "And a third ." - assert [t.text for t in doc] == [ - "This", - "is", - "a sentence", - ".", - "This", - "is", - "another sentence", - ".", - "And a third .", - ] + # Doc doesn't have spaces, so the result is "And a third ." + # fmt: off + assert [t.text for t in doc] == ["This", "is", "a sentence", ".", "This", "is", "another sentence", ".", "And a third ."] + # fmt: on + + +def test_factories_merge_noun_chunks(doc2): + assert len(doc2) == 7 + nlp = Language() + merge_noun_chunks = nlp.create_pipe("merge_noun_chunks") + merge_noun_chunks(doc2) + assert len(doc2) == 6 + assert doc2[2].text == "New York" + + +def test_factories_merge_ents(doc2): + assert len(doc2) == 7 + assert len(list(doc2.ents)) == 1 + nlp = Language() + merge_entities = nlp.create_pipe("merge_entities") + merge_entities(doc2) + assert len(doc2) == 6 + assert len(list(doc2.ents)) == 1 + assert doc2[2].text == "New York" diff --git a/spacy/tests/pipeline/test_initialize.py b/spacy/tests/pipeline/test_initialize.py new file mode 100644 index 000000000..c9b514770 --- /dev/null +++ b/spacy/tests/pipeline/test_initialize.py @@ -0,0 +1,69 @@ +import pytest +from spacy.language import Language +from spacy.lang.en import English +from spacy.training import Example +from thinc.api import ConfigValidationError +from pydantic import StrictBool + + +def test_initialize_arguments(): + name = "test_initialize_arguments" + + class CustomTokenizer: + def __init__(self, tokenizer): + self.tokenizer = tokenizer + self.from_initialize = None + + def __call__(self, text): + return self.tokenizer(text) + + def initialize(self, get_examples, nlp, custom: int): + self.from_initialize = custom + + class Component: + def __init__(self): + self.from_initialize = None + + def initialize( + self, get_examples, nlp, custom1: str, custom2: StrictBool = False + ): + self.from_initialize = (custom1, custom2) + + Language.factory(name, func=lambda nlp, name: Component()) + + nlp = English() + nlp.tokenizer = CustomTokenizer(nlp.tokenizer) + example = Example.from_dict(nlp("x"), {}) + get_examples = lambda: [example] + nlp.add_pipe(name) + # The settings here will typically come from the [initialize] block + init_cfg = {"tokenizer": {"custom": 1}, "components": {name: {}}} + nlp.config["initialize"].update(init_cfg) + with pytest.raises(ConfigValidationError) as e: + # Empty config for component, no required custom1 argument + nlp.initialize(get_examples) + errors = e.value.errors + assert len(errors) == 1 + assert errors[0]["loc"] == ("custom1",) + assert errors[0]["type"] == "value_error.missing" + init_cfg = { + "tokenizer": {"custom": 1}, + "components": {name: {"custom1": "x", "custom2": 1}}, + } + nlp.config["initialize"].update(init_cfg) + with pytest.raises(ConfigValidationError) as e: + # Wrong type of custom 2 + nlp.initialize(get_examples) + errors = e.value.errors + assert len(errors) == 1 + assert errors[0]["loc"] == ("custom2",) + assert errors[0]["type"] == "value_error.strictbool" + init_cfg = { + "tokenizer": {"custom": 1}, + "components": {name: {"custom1": "x"}}, + } + nlp.config["initialize"].update(init_cfg) + nlp.initialize(get_examples) + assert nlp.tokenizer.from_initialize == 1 + pipe = nlp.get_pipe(name) + assert pipe.from_initialize == ("x", False) diff --git a/spacy/tests/pipeline/test_lemmatizer.py b/spacy/tests/pipeline/test_lemmatizer.py new file mode 100644 index 000000000..d37c87059 --- /dev/null +++ b/spacy/tests/pipeline/test_lemmatizer.py @@ -0,0 +1,100 @@ +import pytest +from spacy import util, registry +from spacy.lang.en import English +from spacy.lookups import Lookups + +from ..util import make_tempdir + + +@pytest.fixture +def nlp(): + @registry.misc("cope_lookups") + def cope_lookups(): + lookups = Lookups() + lookups.add_table("lemma_lookup", {"cope": "cope", "coped": "cope"}) + lookups.add_table("lemma_index", {"verb": ("cope", "cop")}) + lookups.add_table("lemma_exc", {"verb": {"coping": ("cope",)}}) + lookups.add_table("lemma_rules", {"verb": [["ing", ""]]}) + return lookups + + nlp = English() + nlp.config["initialize"]["components"]["lemmatizer"] = { + "lookups": {"@misc": "cope_lookups"} + } + return nlp + + +def test_lemmatizer_init(nlp): + lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "lookup"}) + assert isinstance(lemmatizer.lookups, Lookups) + assert not lemmatizer.lookups.tables + assert lemmatizer.mode == "lookup" + with pytest.raises(ValueError): + nlp("test") + nlp.initialize() + assert lemmatizer.lookups.tables + assert nlp("cope")[0].lemma_ == "cope" + assert nlp("coped")[0].lemma_ == "cope" + # replace any tables from spacy-lookups-data + lemmatizer.lookups = Lookups() + # lookup with no tables sets text as lemma + assert nlp("cope")[0].lemma_ == "cope" + assert nlp("coped")[0].lemma_ == "coped" + nlp.remove_pipe("lemmatizer") + lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "lookup"}) + with pytest.raises(ValueError): + # Can't initialize without required tables + lemmatizer.initialize(lookups=Lookups()) + lookups = Lookups() + lookups.add_table("lemma_lookup", {}) + lemmatizer.initialize(lookups=lookups) + + +def test_lemmatizer_config(nlp): + lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "rule"}) + nlp.initialize() + + doc = nlp.make_doc("coping") + doc[0].pos_ = "VERB" + assert doc[0].lemma_ == "" + doc = lemmatizer(doc) + assert doc[0].text == "coping" + assert doc[0].lemma_ == "cope" + + doc = nlp.make_doc("coping") + doc[0].pos_ = "VERB" + assert doc[0].lemma_ == "" + doc = lemmatizer(doc) + assert doc[0].text == "coping" + assert doc[0].lemma_ == "cope" + + +def test_lemmatizer_serialize(nlp): + lemmatizer = nlp.add_pipe("lemmatizer", config={"mode": "rule"}) + nlp.initialize() + + def cope_lookups(): + lookups = Lookups() + lookups.add_table("lemma_lookup", {"cope": "cope", "coped": "cope"}) + lookups.add_table("lemma_index", {"verb": ("cope", "cop")}) + lookups.add_table("lemma_exc", {"verb": {"coping": ("cope",)}}) + lookups.add_table("lemma_rules", {"verb": [["ing", ""]]}) + return lookups + + nlp2 = English() + lemmatizer2 = nlp2.add_pipe("lemmatizer", config={"mode": "rule"}) + lemmatizer2.initialize(lookups=cope_lookups()) + lemmatizer2.from_bytes(lemmatizer.to_bytes()) + assert lemmatizer.to_bytes() == lemmatizer2.to_bytes() + assert lemmatizer.lookups.tables == lemmatizer2.lookups.tables + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2.make_doc("coping") + doc2[0].pos_ = "VERB" + assert doc2[0].lemma_ == "" + doc2 = lemmatizer(doc2) + assert doc2[0].text == "coping" + assert doc2[0].lemma_ == "cope" diff --git a/spacy/tests/pipeline/test_models.py b/spacy/tests/pipeline/test_models.py new file mode 100644 index 000000000..d04ac9cd4 --- /dev/null +++ b/spacy/tests/pipeline/test_models.py @@ -0,0 +1,107 @@ +from typing import List + +import numpy +import pytest +from numpy.testing import assert_almost_equal +from spacy.vocab import Vocab +from thinc.api import NumpyOps, Model, data_validation +from thinc.types import Array2d, Ragged + +from spacy.lang.en import English +from spacy.ml import FeatureExtractor, StaticVectors +from spacy.ml._character_embed import CharacterEmbed +from spacy.tokens import Doc + + +OPS = NumpyOps() + +texts = ["These are 4 words", "Here just three"] +l0 = [[1, 2], [3, 4], [5, 6], [7, 8]] +l1 = [[9, 8], [7, 6], [5, 4]] +list_floats = [OPS.xp.asarray(l0, dtype="f"), OPS.xp.asarray(l1, dtype="f")] +list_ints = [OPS.xp.asarray(l0, dtype="i"), OPS.xp.asarray(l1, dtype="i")] +array = OPS.xp.asarray(l1, dtype="f") +ragged = Ragged(array, OPS.xp.asarray([2, 1], dtype="i")) + + +def get_docs(): + vocab = Vocab() + for t in texts: + for word in t.split(): + hash_id = vocab.strings.add(word) + vector = numpy.random.uniform(-1, 1, (7,)) + vocab.set_vector(hash_id, vector) + docs = [English(vocab)(t) for t in texts] + return docs + + +# Test components with a model of type Model[List[Doc], List[Floats2d]] +@pytest.mark.parametrize("name", ["tagger", "tok2vec", "morphologizer", "senter"]) +def test_components_batching_list(name): + nlp = English() + proc = nlp.create_pipe(name) + util_batch_unbatch_docs_list(proc.model, get_docs(), list_floats) + + +# Test components with a model of type Model[List[Doc], Floats2d] +@pytest.mark.parametrize("name", ["textcat"]) +def test_components_batching_array(name): + nlp = English() + proc = nlp.create_pipe(name) + util_batch_unbatch_docs_array(proc.model, get_docs(), array) + + +LAYERS = [ + (CharacterEmbed(nM=5, nC=3), get_docs(), list_floats), + (FeatureExtractor([100, 200]), get_docs(), list_ints), + (StaticVectors(), get_docs(), ragged), +] + + +@pytest.mark.parametrize("model,in_data,out_data", LAYERS) +def test_layers_batching_all(model, in_data, out_data): + # In = List[Doc] + if isinstance(in_data, list) and isinstance(in_data[0], Doc): + if isinstance(out_data, OPS.xp.ndarray) and out_data.ndim == 2: + util_batch_unbatch_docs_array(model, in_data, out_data) + elif ( + isinstance(out_data, list) + and isinstance(out_data[0], OPS.xp.ndarray) + and out_data[0].ndim == 2 + ): + util_batch_unbatch_docs_list(model, in_data, out_data) + elif isinstance(out_data, Ragged): + util_batch_unbatch_docs_ragged(model, in_data, out_data) + + +def util_batch_unbatch_docs_list( + model: Model[List[Doc], List[Array2d]], in_data: List[Doc], out_data: List[Array2d] +): + with data_validation(True): + model.initialize(in_data, out_data) + Y_batched = model.predict(in_data) + Y_not_batched = [model.predict([u])[0] for u in in_data] + for i in range(len(Y_batched)): + assert_almost_equal(Y_batched[i], Y_not_batched[i], decimal=4) + + +def util_batch_unbatch_docs_array( + model: Model[List[Doc], Array2d], in_data: List[Doc], out_data: Array2d +): + with data_validation(True): + model.initialize(in_data, out_data) + Y_batched = model.predict(in_data).tolist() + Y_not_batched = [model.predict([u])[0] for u in in_data] + assert_almost_equal(Y_batched, Y_not_batched, decimal=4) + + +def util_batch_unbatch_docs_ragged( + model: Model[List[Doc], Ragged], in_data: List[Doc], out_data: Ragged +): + with data_validation(True): + model.initialize(in_data, out_data) + Y_batched = model.predict(in_data) + Y_not_batched = [] + for u in in_data: + Y_not_batched.extend(model.predict([u]).data.tolist()) + assert_almost_equal(Y_batched.data, Y_not_batched, decimal=4) diff --git a/spacy/tests/pipeline/test_morphologizer.py b/spacy/tests/pipeline/test_morphologizer.py new file mode 100644 index 000000000..85d1d6c8b --- /dev/null +++ b/spacy/tests/pipeline/test_morphologizer.py @@ -0,0 +1,118 @@ +import pytest +from numpy.testing import assert_equal + +from spacy import util +from spacy.training import Example +from spacy.lang.en import English +from spacy.language import Language +from spacy.tests.util import make_tempdir +from spacy.morphology import Morphology +from spacy.attrs import MORPH + + +def test_label_types(): + nlp = Language() + morphologizer = nlp.add_pipe("morphologizer") + morphologizer.add_label("Feat=A") + with pytest.raises(ValueError): + morphologizer.add_label(9) + + +TRAIN_DATA = [ + ( + "I like green eggs", + { + "morphs": ["Feat=N", "Feat=V", "Feat=J", "Feat=N"], + "pos": ["NOUN", "VERB", "ADJ", "NOUN"], + }, + ), + # test combinations of morph+POS + ("Eat blue ham", {"morphs": ["Feat=V", "", ""], "pos": ["", "ADJ", ""]}), +] + + +def test_no_label(): + nlp = Language() + nlp.add_pipe("morphologizer") + with pytest.raises(ValueError): + nlp.initialize() + + +def test_implicit_label(): + nlp = Language() + nlp.add_pipe("morphologizer") + train_examples = [] + for t in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) + nlp.initialize(get_examples=lambda: train_examples) + + +def test_no_resize(): + nlp = Language() + morphologizer = nlp.add_pipe("morphologizer") + morphologizer.add_label("POS" + Morphology.FIELD_SEP + "NOUN") + morphologizer.add_label("POS" + Morphology.FIELD_SEP + "VERB") + nlp.initialize() + # this throws an error because the morphologizer can't be resized after initialization + with pytest.raises(ValueError): + morphologizer.add_label("POS" + Morphology.FIELD_SEP + "ADJ") + + +def test_initialize_examples(): + nlp = Language() + morphologizer = nlp.add_pipe("morphologizer") + morphologizer.add_label("POS" + Morphology.FIELD_SEP + "NOUN") + train_examples = [] + for t in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) + # you shouldn't really call this more than once, but for testing it should be fine + nlp.initialize() + nlp.initialize(get_examples=lambda: train_examples) + with pytest.raises(TypeError): + nlp.initialize(get_examples=lambda: None) + with pytest.raises(TypeError): + nlp.initialize(get_examples=train_examples) + + +def test_overfitting_IO(): + # Simple test to try and quickly overfit the morphologizer - ensuring the ML models work correctly + nlp = English() + nlp.add_pipe("morphologizer") + train_examples = [] + for inst in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(inst[0]), inst[1])) + optimizer = nlp.initialize(get_examples=lambda: train_examples) + + for i in range(50): + losses = {} + nlp.update(train_examples, sgd=optimizer, losses=losses) + assert losses["morphologizer"] < 0.00001 + + # test the trained model + test_text = "I like blue ham" + doc = nlp(test_text) + gold_morphs = ["Feat=N", "Feat=V", "", ""] + gold_pos_tags = ["NOUN", "VERB", "ADJ", ""] + assert [str(t.morph) for t in doc] == gold_morphs + assert [t.pos_ for t in doc] == gold_pos_tags + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(test_text) + assert [str(t.morph) for t in doc2] == gold_morphs + assert [t.pos_ for t in doc2] == gold_pos_tags + + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = [ + "Just a sentence.", + "Then one more sentence about London.", + "Here is another one.", + "I like London.", + ] + batch_deps_1 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)] + no_batch_deps = [doc.to_array([MORPH]) for doc in [nlp(text) for text in texts]] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) diff --git a/spacy/tests/pipeline/test_pipe_factories.py b/spacy/tests/pipeline/test_pipe_factories.py new file mode 100644 index 000000000..cac394913 --- /dev/null +++ b/spacy/tests/pipeline/test_pipe_factories.py @@ -0,0 +1,492 @@ +import pytest +from spacy.language import Language +from spacy.lang.en import English +from spacy.lang.de import German +from spacy.tokens import Doc +from spacy.util import registry, SimpleFrozenDict, combine_score_weights +from thinc.api import Model, Linear, ConfigValidationError +from pydantic import StrictInt, StrictStr + +from ..util import make_tempdir + + +def test_pipe_function_component(): + name = "test_component" + + @Language.component(name) + def component(doc: Doc) -> Doc: + return doc + + assert name in registry.factories + nlp = Language() + with pytest.raises(ValueError): + nlp.add_pipe(component) + nlp.add_pipe(name) + assert name in nlp.pipe_names + assert nlp.pipe_factories[name] == name + assert Language.get_factory_meta(name) + assert nlp.get_pipe_meta(name) + pipe = nlp.get_pipe(name) + assert pipe == component + pipe = nlp.create_pipe(name) + assert pipe == component + + +def test_pipe_class_component_init(): + name1 = "test_class_component1" + name2 = "test_class_component2" + + @Language.factory(name1) + class Component1: + def __init__(self, nlp: Language, name: str): + self.nlp = nlp + + def __call__(self, doc: Doc) -> Doc: + return doc + + class Component2: + def __init__(self, nlp: Language, name: str): + self.nlp = nlp + + def __call__(self, doc: Doc) -> Doc: + return doc + + @Language.factory(name2) + def factory(nlp: Language, name=name2): + return Component2(nlp, name) + + nlp = Language() + for name, Component in [(name1, Component1), (name2, Component2)]: + assert name in registry.factories + with pytest.raises(ValueError): + nlp.add_pipe(Component(nlp, name)) + nlp.add_pipe(name) + assert name in nlp.pipe_names + assert nlp.pipe_factories[name] == name + assert Language.get_factory_meta(name) + assert nlp.get_pipe_meta(name) + pipe = nlp.get_pipe(name) + assert isinstance(pipe, Component) + assert isinstance(pipe.nlp, Language) + pipe = nlp.create_pipe(name) + assert isinstance(pipe, Component) + assert isinstance(pipe.nlp, Language) + + +def test_pipe_class_component_config(): + name = "test_class_component_config" + + @Language.factory(name) + class Component: + def __init__( + self, nlp: Language, name: str, value1: StrictInt, value2: StrictStr + ): + self.nlp = nlp + self.value1 = value1 + self.value2 = value2 + self.is_base = True + + def __call__(self, doc: Doc) -> Doc: + return doc + + @English.factory(name) + class ComponentEN: + def __init__( + self, nlp: Language, name: str, value1: StrictInt, value2: StrictStr + ): + self.nlp = nlp + self.value1 = value1 + self.value2 = value2 + self.is_base = False + + def __call__(self, doc: Doc) -> Doc: + return doc + + nlp = Language() + with pytest.raises(ConfigValidationError): # no config provided + nlp.add_pipe(name) + with pytest.raises(ConfigValidationError): # invalid config + nlp.add_pipe(name, config={"value1": "10", "value2": "hello"}) + nlp.add_pipe(name, config={"value1": 10, "value2": "hello"}) + pipe = nlp.get_pipe(name) + assert isinstance(pipe.nlp, Language) + assert pipe.value1 == 10 + assert pipe.value2 == "hello" + assert pipe.is_base is True + + nlp_en = English() + with pytest.raises(ConfigValidationError): # invalid config + nlp_en.add_pipe(name, config={"value1": "10", "value2": "hello"}) + nlp_en.add_pipe(name, config={"value1": 10, "value2": "hello"}) + pipe = nlp_en.get_pipe(name) + assert isinstance(pipe.nlp, English) + assert pipe.value1 == 10 + assert pipe.value2 == "hello" + assert pipe.is_base is False + + +def test_pipe_class_component_defaults(): + name = "test_class_component_defaults" + + @Language.factory(name) + class Component: + def __init__( + self, + nlp: Language, + name: str, + value1: StrictInt = 10, + value2: StrictStr = "hello", + ): + self.nlp = nlp + self.value1 = value1 + self.value2 = value2 + + def __call__(self, doc: Doc) -> Doc: + return doc + + nlp = Language() + nlp.add_pipe(name) + pipe = nlp.get_pipe(name) + assert isinstance(pipe.nlp, Language) + assert pipe.value1 == 10 + assert pipe.value2 == "hello" + + +def test_pipe_class_component_model(): + name = "test_class_component_model" + default_config = { + "model": { + "@architectures": "spacy.TextCatEnsemble.v1", + "exclusive_classes": False, + "pretrained_vectors": None, + "width": 64, + "embed_size": 2000, + "window_size": 1, + "conv_depth": 2, + "ngram_size": 1, + "dropout": None, + }, + "value1": 10, + } + + @Language.factory(name, default_config=default_config) + class Component: + def __init__(self, nlp: Language, model: Model, name: str, value1: StrictInt): + self.nlp = nlp + self.model = model + self.value1 = value1 + self.name = name + + def __call__(self, doc: Doc) -> Doc: + return doc + + nlp = Language() + nlp.add_pipe(name) + pipe = nlp.get_pipe(name) + assert isinstance(pipe.nlp, Language) + assert pipe.value1 == 10 + assert isinstance(pipe.model, Model) + + +def test_pipe_class_component_model_custom(): + name = "test_class_component_model_custom" + arch = f"{name}.arch" + default_config = {"value1": 1, "model": {"@architectures": arch, "nO": 0, "nI": 0}} + + @Language.factory(name, default_config=default_config) + class Component: + def __init__( + self, nlp: Language, model: Model, name: str, value1: StrictInt = 10 + ): + self.nlp = nlp + self.model = model + self.value1 = value1 + self.name = name + + def __call__(self, doc: Doc) -> Doc: + return doc + + @registry.architectures(arch) + def make_custom_arch(nO: StrictInt, nI: StrictInt): + return Linear(nO, nI) + + nlp = Language() + config = {"value1": 20, "model": {"@architectures": arch, "nO": 1, "nI": 2}} + nlp.add_pipe(name, config=config) + pipe = nlp.get_pipe(name) + assert isinstance(pipe.nlp, Language) + assert pipe.value1 == 20 + assert isinstance(pipe.model, Model) + assert pipe.model.name == "linear" + + nlp = Language() + with pytest.raises(ConfigValidationError): + config = {"value1": "20", "model": {"@architectures": arch, "nO": 1, "nI": 2}} + nlp.add_pipe(name, config=config) + with pytest.raises(ConfigValidationError): + config = {"value1": 20, "model": {"@architectures": arch, "nO": 1.0, "nI": 2.0}} + nlp.add_pipe(name, config=config) + + +def test_pipe_factories_wrong_formats(): + with pytest.raises(ValueError): + # Decorator is not called + @Language.component + def component(foo: int, bar: str): + ... + + with pytest.raises(ValueError): + # Decorator is not called + @Language.factory + def factory1(foo: int, bar: str): + ... + + with pytest.raises(ValueError): + # Factory function is missing "nlp" and "name" arguments + @Language.factory("test_pipe_factories_missing_args") + def factory2(foo: int, bar: str): + ... + + +def test_pipe_factory_meta_config_cleanup(): + """Test that component-specific meta and config entries are represented + correctly and cleaned up when pipes are removed, replaced or renamed.""" + nlp = Language() + nlp.add_pipe("ner", name="ner_component") + nlp.add_pipe("textcat") + assert nlp.get_factory_meta("ner") + assert nlp.get_pipe_meta("ner_component") + assert nlp.get_pipe_config("ner_component") + assert nlp.get_factory_meta("textcat") + assert nlp.get_pipe_meta("textcat") + assert nlp.get_pipe_config("textcat") + nlp.rename_pipe("textcat", "tc") + assert nlp.get_pipe_meta("tc") + assert nlp.get_pipe_config("tc") + with pytest.raises(ValueError): + nlp.remove_pipe("ner") + nlp.remove_pipe("ner_component") + assert "ner_component" not in nlp._pipe_meta + assert "ner_component" not in nlp._pipe_configs + with pytest.raises(ValueError): + nlp.replace_pipe("textcat", "parser") + nlp.replace_pipe("tc", "parser") + assert nlp.get_factory_meta("parser") + assert nlp.get_pipe_meta("tc").factory == "parser" + + +def test_pipe_factories_empty_dict_default(): + """Test that default config values can be empty dicts and that no config + validation error is raised.""" + # TODO: fix this + name = "test_pipe_factories_empty_dict_default" + + @Language.factory(name, default_config={"foo": {}}) + def factory(nlp: Language, name: str, foo: dict): + ... + + nlp = Language() + nlp.create_pipe(name) + + +def test_pipe_factories_language_specific(): + """Test that language sub-classes can have their own factories, with + fallbacks to the base factories.""" + name1 = "specific_component1" + name2 = "specific_component2" + Language.component(name1, func=lambda: "base") + English.component(name1, func=lambda: "en") + German.component(name2, func=lambda: "de") + + assert Language.has_factory(name1) + assert not Language.has_factory(name2) + assert English.has_factory(name1) + assert not English.has_factory(name2) + assert German.has_factory(name1) + assert German.has_factory(name2) + + nlp = Language() + assert nlp.create_pipe(name1)() == "base" + with pytest.raises(ValueError): + nlp.create_pipe(name2) + nlp_en = English() + assert nlp_en.create_pipe(name1)() == "en" + with pytest.raises(ValueError): + nlp_en.create_pipe(name2) + nlp_de = German() + assert nlp_de.create_pipe(name1)() == "base" + assert nlp_de.create_pipe(name2)() == "de" + + +def test_language_factories_invalid(): + """Test that assigning directly to Language.factories is now invalid and + raises a custom error.""" + assert isinstance(Language.factories, SimpleFrozenDict) + with pytest.raises(NotImplementedError): + Language.factories["foo"] = "bar" + nlp = Language() + assert isinstance(nlp.factories, SimpleFrozenDict) + assert len(nlp.factories) + with pytest.raises(NotImplementedError): + nlp.factories["foo"] = "bar" + + +@pytest.mark.parametrize( + "weights,expected", + [ + ([{"a": 1.0}, {"b": 1.0}, {"c": 1.0}], {"a": 0.33, "b": 0.33, "c": 0.33}), + ([{"a": 1.0}, {"b": 50}, {"c": 123}], {"a": 0.33, "b": 0.33, "c": 0.33}), + ( + [{"a": 0.7, "b": 0.3}, {"c": 1.0}, {"d": 0.5, "e": 0.5}], + {"a": 0.23, "b": 0.1, "c": 0.33, "d": 0.17, "e": 0.17}, + ), + ( + [{"a": 100, "b": 400}, {"c": 0.5, "d": 0.5}], + {"a": 0.1, "b": 0.4, "c": 0.25, "d": 0.25}, + ), + ([{"a": 0.5, "b": 0.5}, {"b": 1.0}], {"a": 0.25, "b": 0.75}), + ([{"a": 0.0, "b": 0.0}, {"c": 0.0}], {"a": 0.0, "b": 0.0, "c": 0.0}), + ], +) +def test_language_factories_combine_score_weights(weights, expected): + result = combine_score_weights(weights) + assert sum(result.values()) in (0.99, 1.0, 0.0) + assert result == expected + + +def test_language_factories_scores(): + name = "test_language_factories_scores" + func = lambda nlp, name: lambda doc: doc + weights1 = {"a1": 0.5, "a2": 0.5} + weights2 = {"b1": 0.2, "b2": 0.7, "b3": 0.1} + Language.factory(f"{name}1", default_score_weights=weights1, func=func) + Language.factory(f"{name}2", default_score_weights=weights2, func=func) + meta1 = Language.get_factory_meta(f"{name}1") + assert meta1.default_score_weights == weights1 + meta2 = Language.get_factory_meta(f"{name}2") + assert meta2.default_score_weights == weights2 + nlp = Language() + nlp._config["training"]["score_weights"] = {} + nlp.add_pipe(f"{name}1") + nlp.add_pipe(f"{name}2") + cfg = nlp.config["training"] + expected_weights = {"a1": 0.25, "a2": 0.25, "b1": 0.1, "b2": 0.35, "b3": 0.05} + assert cfg["score_weights"] == expected_weights + # Test with custom defaults + config = nlp.config.copy() + config["training"]["score_weights"]["a1"] = 0.0 + config["training"]["score_weights"]["b3"] = 1.0 + nlp = English.from_config(config) + score_weights = nlp.config["training"]["score_weights"] + expected = {"a1": 0.0, "a2": 0.5, "b1": 0.03, "b2": 0.12, "b3": 0.34} + assert score_weights == expected + # Test with null values + config = nlp.config.copy() + config["training"]["score_weights"]["a1"] = None + nlp = English.from_config(config) + score_weights = nlp.config["training"]["score_weights"] + expected = {"a1": None, "a2": 0.5, "b1": 0.03, "b2": 0.12, "b3": 0.35} + assert score_weights == expected + + +def test_pipe_factories_from_source(): + """Test adding components from a source model.""" + source_nlp = English() + source_nlp.add_pipe("tagger", name="my_tagger") + nlp = English() + with pytest.raises(ValueError): + nlp.add_pipe("my_tagger", source="en_core_web_sm") + nlp.add_pipe("my_tagger", source=source_nlp) + assert "my_tagger" in nlp.pipe_names + with pytest.raises(KeyError): + nlp.add_pipe("custom", source=source_nlp) + + +def test_pipe_factories_from_source_custom(): + """Test adding components from a source model with custom components.""" + name = "test_pipe_factories_from_source_custom" + + @Language.factory(name, default_config={"arg": "hello"}) + def test_factory(nlp, name, arg: str): + return lambda doc: doc + + source_nlp = English() + source_nlp.add_pipe("tagger") + source_nlp.add_pipe(name, config={"arg": "world"}) + nlp = English() + nlp.add_pipe(name, source=source_nlp) + assert name in nlp.pipe_names + assert nlp.get_pipe_meta(name).default_config["arg"] == "hello" + config = nlp.config["components"][name] + assert config["factory"] == name + assert config["arg"] == "world" + + +def test_pipe_factories_from_source_config(): + name = "test_pipe_factories_from_source_config" + + @Language.factory(name, default_config={"arg": "hello"}) + def test_factory(nlp, name, arg: str): + return lambda doc: doc + + source_nlp = English() + source_nlp.add_pipe("tagger") + source_nlp.add_pipe(name, name="yolo", config={"arg": "world"}) + dest_nlp_cfg = {"lang": "en", "pipeline": ["parser", "custom"]} + with make_tempdir() as tempdir: + source_nlp.to_disk(tempdir) + dest_components_cfg = { + "parser": {"factory": "parser"}, + "custom": {"source": str(tempdir), "component": "yolo"}, + } + dest_config = {"nlp": dest_nlp_cfg, "components": dest_components_cfg} + nlp = English.from_config(dest_config) + assert nlp.pipe_names == ["parser", "custom"] + assert nlp.pipe_factories == {"parser": "parser", "custom": name} + meta = nlp.get_pipe_meta("custom") + assert meta.factory == name + assert meta.default_config["arg"] == "hello" + config = nlp.config["components"]["custom"] + assert config["factory"] == name + assert config["arg"] == "world" + + +def test_pipe_factories_decorator_idempotent(): + """Check that decorator can be run multiple times if the function is the + same. This is especially relevant for live reloading because we don't + want spaCy to raise an error if a module registering components is reloaded. + """ + name = "test_pipe_factories_decorator_idempotent" + func = lambda nlp, name: lambda doc: doc + for i in range(5): + Language.factory(name, func=func) + nlp = Language() + nlp.add_pipe(name) + Language.factory(name, func=func) + # Make sure it also works for component decorator, which creates the + # factory function + name2 = f"{name}2" + func2 = lambda doc: doc + for i in range(5): + Language.component(name2, func=func2) + nlp = Language() + nlp.add_pipe(name) + Language.component(name2, func=func2) + + +def test_pipe_factories_config_excludes_nlp(): + """Test that the extra values we temporarily add to component config + blocks/functions are removed and not copied around. + """ + name = "test_pipe_factories_config_excludes_nlp" + func = lambda nlp, name: lambda doc: doc + Language.factory(name, func=func) + config = { + "nlp": {"lang": "en", "pipeline": [name]}, + "components": {name: {"factory": name}}, + } + nlp = English.from_config(config) + assert nlp.pipe_names == [name] + pipe_cfg = nlp.get_pipe_config(name) + pipe_cfg == {"factory": name} + assert nlp._pipe_configs[name] == {"factory": name} diff --git a/spacy/tests/pipeline/test_pipe_methods.py b/spacy/tests/pipeline/test_pipe_methods.py index 27fb57b18..6a21ddfaa 100644 --- a/spacy/tests/pipeline/test_pipe_methods.py +++ b/spacy/tests/pipeline/test_pipe_methods.py @@ -1,8 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy.language import Language +from spacy.pipeline import TrainablePipe +from spacy.util import SimpleFrozenList, get_arg_names @pytest.fixture @@ -10,67 +9,90 @@ def nlp(): return Language() +@Language.component("new_pipe") def new_pipe(doc): return doc +@Language.component("other_pipe") +def other_pipe(doc): + return doc + + def test_add_pipe_no_name(nlp): - nlp.add_pipe(new_pipe) + nlp.add_pipe("new_pipe") assert "new_pipe" in nlp.pipe_names def test_add_pipe_duplicate_name(nlp): - nlp.add_pipe(new_pipe, name="duplicate_name") + nlp.add_pipe("new_pipe", name="duplicate_name") with pytest.raises(ValueError): - nlp.add_pipe(new_pipe, name="duplicate_name") + nlp.add_pipe("new_pipe", name="duplicate_name") @pytest.mark.parametrize("name", ["parser"]) def test_add_pipe_first(nlp, name): - nlp.add_pipe(new_pipe, name=name, first=True) + nlp.add_pipe("new_pipe", name=name, first=True) assert nlp.pipeline[0][0] == name @pytest.mark.parametrize("name1,name2", [("parser", "lambda_pipe")]) def test_add_pipe_last(nlp, name1, name2): - nlp.add_pipe(lambda doc: doc, name=name2) - nlp.add_pipe(new_pipe, name=name1, last=True) + Language.component("new_pipe2", func=lambda doc: doc) + nlp.add_pipe("new_pipe2", name=name2) + nlp.add_pipe("new_pipe", name=name1, last=True) assert nlp.pipeline[0][0] != name1 assert nlp.pipeline[-1][0] == name1 def test_cant_add_pipe_first_and_last(nlp): with pytest.raises(ValueError): - nlp.add_pipe(new_pipe, first=True, last=True) + nlp.add_pipe("new_pipe", first=True, last=True) @pytest.mark.parametrize("name", ["my_component"]) def test_get_pipe(nlp, name): with pytest.raises(KeyError): nlp.get_pipe(name) - nlp.add_pipe(new_pipe, name=name) + nlp.add_pipe("new_pipe", name=name) assert nlp.get_pipe(name) == new_pipe @pytest.mark.parametrize( - "name,replacement,not_callable", [("my_component", lambda doc: doc, {})] + "name,replacement,invalid_replacement", + [("my_component", "other_pipe", lambda doc: doc)], ) -def test_replace_pipe(nlp, name, replacement, not_callable): +def test_replace_pipe(nlp, name, replacement, invalid_replacement): with pytest.raises(ValueError): nlp.replace_pipe(name, new_pipe) - nlp.add_pipe(new_pipe, name=name) + nlp.add_pipe("new_pipe", name=name) with pytest.raises(ValueError): - nlp.replace_pipe(name, not_callable) + nlp.replace_pipe(name, invalid_replacement) nlp.replace_pipe(name, replacement) - assert nlp.get_pipe(name) != new_pipe - assert nlp.get_pipe(name) == replacement + assert nlp.get_pipe(name) == nlp.create_pipe(replacement) + + +def test_replace_last_pipe(nlp): + nlp.add_pipe("sentencizer") + nlp.add_pipe("ner") + assert nlp.pipe_names == ["sentencizer", "ner"] + nlp.replace_pipe("ner", "ner") + assert nlp.pipe_names == ["sentencizer", "ner"] + + +def test_replace_pipe_config(nlp): + nlp.add_pipe("entity_linker") + nlp.add_pipe("sentencizer") + assert nlp.get_pipe("entity_linker").cfg["incl_prior"] is True + nlp.replace_pipe("entity_linker", "entity_linker", config={"incl_prior": False}) + assert nlp.get_pipe("entity_linker").cfg["incl_prior"] is False @pytest.mark.parametrize("old_name,new_name", [("old_pipe", "new_pipe")]) def test_rename_pipe(nlp, old_name, new_name): with pytest.raises(ValueError): nlp.rename_pipe(old_name, new_name) - nlp.add_pipe(new_pipe, name=old_name) + nlp.add_pipe("new_pipe", name=old_name) nlp.rename_pipe(old_name, new_name) assert nlp.pipeline[0][0] == new_name @@ -79,7 +101,7 @@ def test_rename_pipe(nlp, old_name, new_name): def test_remove_pipe(nlp, name): with pytest.raises(ValueError): nlp.remove_pipe(name) - nlp.add_pipe(new_pipe, name=name) + nlp.add_pipe("new_pipe", name=name) assert len(nlp.pipeline) == 1 removed_name, removed_component = nlp.remove_pipe(name) assert not len(nlp.pipeline) @@ -89,40 +111,106 @@ def test_remove_pipe(nlp, name): @pytest.mark.parametrize("name", ["my_component"]) def test_disable_pipes_method(nlp, name): - nlp.add_pipe(new_pipe, name=name) + nlp.add_pipe("new_pipe", name=name) assert nlp.has_pipe(name) - disabled = nlp.disable_pipes(name) + disabled = nlp.select_pipes(disable=name) + assert not nlp.has_pipe(name) + disabled.restore() + + +@pytest.mark.parametrize("name", ["my_component"]) +def test_enable_pipes_method(nlp, name): + nlp.add_pipe("new_pipe", name=name) + assert nlp.has_pipe(name) + disabled = nlp.select_pipes(enable=[]) assert not nlp.has_pipe(name) disabled.restore() @pytest.mark.parametrize("name", ["my_component"]) def test_disable_pipes_context(nlp, name): - nlp.add_pipe(new_pipe, name=name) + """Test that an enabled component stays enabled after running the context manager.""" + nlp.add_pipe("new_pipe", name=name) assert nlp.has_pipe(name) - with nlp.disable_pipes(name): + with nlp.select_pipes(disable=name): assert not nlp.has_pipe(name) assert nlp.has_pipe(name) -def test_disable_pipes_list_arg(nlp): +@pytest.mark.parametrize("name", ["my_component"]) +def test_disable_pipes_context_restore(nlp, name): + """Test that a disabled component stays disabled after running the context manager.""" + nlp.add_pipe("new_pipe", name=name) + assert nlp.has_pipe(name) + nlp.disable_pipe(name) + assert not nlp.has_pipe(name) + with nlp.select_pipes(disable=name): + assert not nlp.has_pipe(name) + assert not nlp.has_pipe(name) + + +def test_select_pipes_list_arg(nlp): for name in ["c1", "c2", "c3"]: - nlp.add_pipe(new_pipe, name=name) + nlp.add_pipe("new_pipe", name=name) assert nlp.has_pipe(name) - with nlp.disable_pipes(["c1", "c2"]): + with nlp.select_pipes(disable=["c1", "c2"]): assert not nlp.has_pipe("c1") assert not nlp.has_pipe("c2") assert nlp.has_pipe("c3") + with nlp.select_pipes(enable="c3"): + assert not nlp.has_pipe("c1") + assert not nlp.has_pipe("c2") + assert nlp.has_pipe("c3") + with nlp.select_pipes(enable=["c1", "c2"], disable="c3"): + assert nlp.has_pipe("c1") + assert nlp.has_pipe("c2") + assert not nlp.has_pipe("c3") + with nlp.select_pipes(enable=[]): + assert not nlp.has_pipe("c1") + assert not nlp.has_pipe("c2") + assert not nlp.has_pipe("c3") + with nlp.select_pipes(enable=["c1", "c2", "c3"], disable=[]): + assert nlp.has_pipe("c1") + assert nlp.has_pipe("c2") + assert nlp.has_pipe("c3") + with nlp.select_pipes(disable=["c1", "c2", "c3"], enable=[]): + assert not nlp.has_pipe("c1") + assert not nlp.has_pipe("c2") + assert not nlp.has_pipe("c3") + + +def test_select_pipes_errors(nlp): + for name in ["c1", "c2", "c3"]: + nlp.add_pipe("new_pipe", name=name) + assert nlp.has_pipe(name) + + with pytest.raises(ValueError): + nlp.select_pipes() + + with pytest.raises(ValueError): + nlp.select_pipes(enable=["c1", "c2"], disable=["c1"]) + + with pytest.raises(ValueError): + nlp.select_pipes(enable=["c1", "c2"], disable=[]) + + with pytest.raises(ValueError): + nlp.select_pipes(enable=[], disable=["c3"]) + + disabled = nlp.select_pipes(disable=["c2"]) + nlp.remove_pipe("c2") + with pytest.raises(ValueError): + disabled.restore() @pytest.mark.parametrize("n_pipes", [100]) def test_add_lots_of_pipes(nlp, n_pipes): + Language.component("n_pipes", func=lambda doc: doc) for i in range(n_pipes): - nlp.add_pipe(lambda doc: doc, name="pipe_%d" % i) + nlp.add_pipe("n_pipes", name=f"pipe_{i}") assert len(nlp.pipe_names) == n_pipes -@pytest.mark.parametrize("component", ["ner", {"hello": "world"}]) +@pytest.mark.parametrize("component", [lambda doc: doc, {"hello": "world"}]) def test_raise_for_invalid_components(nlp, component): with pytest.raises(ValueError): nlp.add_pipe(component) @@ -146,11 +234,186 @@ def test_pipe_labels(nlp): "textcat": ["POSITIVE", "NEGATIVE"], } for name, labels in input_labels.items(): - pipe = nlp.create_pipe(name) + nlp.add_pipe(name) + pipe = nlp.get_pipe(name) for label in labels: pipe.add_label(label) assert len(pipe.labels) == len(labels) - nlp.add_pipe(pipe) + assert len(nlp.pipe_labels) == len(input_labels) for name, labels in nlp.pipe_labels.items(): assert sorted(input_labels[name]) == sorted(labels) + + +def test_add_pipe_before_after(): + """Test that before/after works with strings and ints.""" + nlp = Language() + nlp.add_pipe("ner") + with pytest.raises(ValueError): + nlp.add_pipe("textcat", before="parser") + nlp.add_pipe("textcat", before="ner") + assert nlp.pipe_names == ["textcat", "ner"] + with pytest.raises(ValueError): + nlp.add_pipe("parser", before=3) + with pytest.raises(ValueError): + nlp.add_pipe("parser", after=3) + nlp.add_pipe("parser", after=0) + assert nlp.pipe_names == ["textcat", "parser", "ner"] + nlp.add_pipe("tagger", before=2) + assert nlp.pipe_names == ["textcat", "parser", "tagger", "ner"] + with pytest.raises(ValueError): + nlp.add_pipe("entity_ruler", after=1, first=True) + with pytest.raises(ValueError): + nlp.add_pipe("entity_ruler", before="ner", after=2) + with pytest.raises(ValueError): + nlp.add_pipe("entity_ruler", before=True) + with pytest.raises(ValueError): + nlp.add_pipe("entity_ruler", first=False) + + +def test_disable_enable_pipes(): + name = "test_disable_enable_pipes" + results = {} + + def make_component(name): + results[name] = "" + + def component(doc): + nonlocal results + results[name] = doc.text + return doc + + return component + + c1 = Language.component(f"{name}1", func=make_component(f"{name}1")) + c2 = Language.component(f"{name}2", func=make_component(f"{name}2")) + + nlp = Language() + nlp.add_pipe(f"{name}1") + nlp.add_pipe(f"{name}2") + assert results[f"{name}1"] == "" + assert results[f"{name}2"] == "" + assert nlp.pipeline == [(f"{name}1", c1), (f"{name}2", c2)] + assert nlp.pipe_names == [f"{name}1", f"{name}2"] + nlp.disable_pipe(f"{name}1") + assert nlp.disabled == [f"{name}1"] + assert nlp.component_names == [f"{name}1", f"{name}2"] + assert nlp.pipe_names == [f"{name}2"] + assert nlp.config["nlp"]["disabled"] == [f"{name}1"] + nlp("hello") + assert results[f"{name}1"] == "" # didn't run + assert results[f"{name}2"] == "hello" # ran + nlp.enable_pipe(f"{name}1") + assert nlp.disabled == [] + assert nlp.pipe_names == [f"{name}1", f"{name}2"] + assert nlp.config["nlp"]["disabled"] == [] + nlp("world") + assert results[f"{name}1"] == "world" + assert results[f"{name}2"] == "world" + nlp.disable_pipe(f"{name}2") + nlp.remove_pipe(f"{name}2") + assert nlp.components == [(f"{name}1", c1)] + assert nlp.pipeline == [(f"{name}1", c1)] + assert nlp.component_names == [f"{name}1"] + assert nlp.pipe_names == [f"{name}1"] + assert nlp.disabled == [] + assert nlp.config["nlp"]["disabled"] == [] + nlp.rename_pipe(f"{name}1", name) + assert nlp.components == [(name, c1)] + assert nlp.component_names == [name] + nlp("!") + assert results[f"{name}1"] == "!" + assert results[f"{name}2"] == "world" + with pytest.raises(ValueError): + nlp.disable_pipe(f"{name}2") + nlp.disable_pipe(name) + assert nlp.component_names == [name] + assert nlp.pipe_names == [] + assert nlp.config["nlp"]["disabled"] == [name] + nlp("?") + assert results[f"{name}1"] == "!" + + +def test_pipe_methods_frozen(): + """Test that spaCy raises custom error messages if "frozen" properties are + accessed. We still want to use a list here to not break backwards + compatibility, but users should see an error if they're trying to append + to nlp.pipeline etc.""" + nlp = Language() + ner = nlp.add_pipe("ner") + assert nlp.pipe_names == ["ner"] + for prop in [ + nlp.pipeline, + nlp.pipe_names, + nlp.components, + nlp.component_names, + nlp.disabled, + nlp.factory_names, + ]: + assert isinstance(prop, list) + assert isinstance(prop, SimpleFrozenList) + with pytest.raises(NotImplementedError): + nlp.pipeline.append(("ner2", ner)) + with pytest.raises(NotImplementedError): + nlp.pipe_names.pop() + with pytest.raises(NotImplementedError): + nlp.components.sort() + with pytest.raises(NotImplementedError): + nlp.component_names.clear() + + +@pytest.mark.parametrize( + "pipe", ["tagger", "parser", "ner", "textcat", "morphologizer"] +) +def test_pipe_label_data_exports_labels(pipe): + nlp = Language() + pipe = nlp.add_pipe(pipe) + # Make sure pipe has pipe labels + assert getattr(pipe, "label_data", None) is not None + # Make sure pipe can be initialized with labels + initialize = getattr(pipe, "initialize", None) + assert initialize is not None + assert "labels" in get_arg_names(initialize) + + +@pytest.mark.parametrize("pipe", ["senter", "entity_linker"]) +def test_pipe_label_data_no_labels(pipe): + nlp = Language() + pipe = nlp.add_pipe(pipe) + assert getattr(pipe, "label_data", None) is None + initialize = getattr(pipe, "initialize", None) + if initialize is not None: + assert "labels" not in get_arg_names(initialize) + + +def test_warning_pipe_begin_training(): + with pytest.warns(UserWarning, match="begin_training"): + + class IncompatPipe(TrainablePipe): + def __init__(self): + ... + + def begin_training(*args, **kwargs): + ... + + +def test_pipe_methods_initialize(): + """Test that the [initialize] config reflects the components correctly.""" + nlp = Language() + nlp.add_pipe("tagger") + assert "tagger" not in nlp.config["initialize"]["components"] + nlp.config["initialize"]["components"]["tagger"] = {"labels": ["hello"]} + assert nlp.config["initialize"]["components"]["tagger"] == {"labels": ["hello"]} + nlp.remove_pipe("tagger") + assert "tagger" not in nlp.config["initialize"]["components"] + nlp.add_pipe("tagger") + assert "tagger" not in nlp.config["initialize"]["components"] + nlp.config["initialize"]["components"]["tagger"] = {"labels": ["hello"]} + nlp.rename_pipe("tagger", "my_tagger") + assert "tagger" not in nlp.config["initialize"]["components"] + assert nlp.config["initialize"]["components"]["my_tagger"] == {"labels": ["hello"]} + nlp.config["initialize"]["components"]["test"] = {"foo": "bar"} + nlp.add_pipe("ner", name="test") + assert "test" in nlp.config["initialize"]["components"] + nlp.remove_pipe("test") + assert "test" not in nlp.config["initialize"]["components"] diff --git a/spacy/tests/pipeline/test_sentencizer.py b/spacy/tests/pipeline/test_sentencizer.py index ee9220a29..5dd0fef43 100644 --- a/spacy/tests/pipeline/test_sentencizer.py +++ b/spacy/tests/pipeline/test_sentencizer.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest import spacy from spacy.pipeline import Sentencizer @@ -10,9 +7,9 @@ from spacy.lang.en import English def test_sentencizer(en_vocab): doc = Doc(en_vocab, words=["Hello", "!", "This", "is", "a", "test", "."]) - sentencizer = Sentencizer() + sentencizer = Sentencizer(punct_chars=None) doc = sentencizer(doc) - assert doc.is_sentenced + assert doc.has_annotation("SENT_START") sent_starts = [t.is_sent_start for t in doc] sent_ends = [t.is_sent_end for t in doc] assert sent_starts == [True, False, True, False, False, False, False] @@ -23,9 +20,15 @@ def test_sentencizer(en_vocab): def test_sentencizer_pipe(): texts = ["Hello! This is a test.", "Hi! This is a test."] nlp = English() - nlp.add_pipe(nlp.create_pipe("sentencizer")) + nlp.add_pipe("sentencizer") for doc in nlp.pipe(texts): - assert doc.is_sentenced + assert doc.has_annotation("SENT_START") + sent_starts = [t.is_sent_start for t in doc] + assert sent_starts == [True, False, True, False, False, False, False] + assert len(list(doc.sents)) == 2 + for ex in nlp.pipe(texts): + doc = ex.doc + assert doc.has_annotation("SENT_START") sent_starts = [t.is_sent_start for t in doc] assert sent_starts == [True, False, True, False, False, False, False] assert len(list(doc.sents)) == 2 @@ -36,10 +39,10 @@ def test_sentencizer_empty_docs(): many_empty_texts = ["", "", ""] some_empty_texts = ["hi", "", "This is a test. Here are two sentences.", ""] nlp = English() - nlp.add_pipe(nlp.create_pipe("sentencizer")) + nlp.add_pipe("sentencizer") for texts in [one_empty_text, many_empty_texts, some_empty_texts]: for doc in nlp.pipe(texts): - assert doc.is_sentenced + assert doc.has_annotation("SENT_START") sent_starts = [t.is_sent_start for t in doc] if len(doc) == 0: assert sent_starts == [] @@ -77,9 +80,9 @@ def test_sentencizer_empty_docs(): ) def test_sentencizer_complex(en_vocab, words, sent_starts, sent_ends, n_sents): doc = Doc(en_vocab, words=words) - sentencizer = Sentencizer() + sentencizer = Sentencizer(punct_chars=None) doc = sentencizer(doc) - assert doc.is_sentenced + assert doc.has_annotation("SENT_START") assert [t.is_sent_start for t in doc] == sent_starts assert [t.is_sent_end for t in doc] == sent_ends assert len(list(doc.sents)) == n_sents @@ -112,7 +115,7 @@ def test_sentencizer_custom_punct( doc = Doc(en_vocab, words=words) sentencizer = Sentencizer(punct_chars=punct_chars) doc = sentencizer(doc) - assert doc.is_sentenced + assert doc.has_annotation("SENT_START") assert [t.is_sent_start for t in doc] == sent_starts assert [t.is_sent_end for t in doc] == sent_ends assert len(list(doc.sents)) == n_sents @@ -123,7 +126,7 @@ def test_sentencizer_serialize_bytes(en_vocab): sentencizer = Sentencizer(punct_chars=punct_chars) assert sentencizer.punct_chars == set(punct_chars) bytes_data = sentencizer.to_bytes() - new_sentencizer = Sentencizer().from_bytes(bytes_data) + new_sentencizer = Sentencizer(punct_chars=None).from_bytes(bytes_data) assert new_sentencizer.punct_chars == set(punct_chars) @@ -144,7 +147,6 @@ def test_sentencizer_serialize_bytes(en_vocab): ) def test_sentencizer_across_scripts(lang, text): nlp = spacy.blank(lang) - sentencizer = Sentencizer() - nlp.add_pipe(sentencizer) + nlp.add_pipe("sentencizer") doc = nlp(text) assert len(list(doc.sents)) > 1 diff --git a/spacy/tests/pipeline/test_senter.py b/spacy/tests/pipeline/test_senter.py new file mode 100644 index 000000000..7a256f79b --- /dev/null +++ b/spacy/tests/pipeline/test_senter.py @@ -0,0 +1,99 @@ +import pytest +from numpy.testing import assert_equal +from spacy.attrs import SENT_START + +from spacy import util +from spacy.training import Example +from spacy.lang.en import English +from spacy.language import Language +from spacy.tests.util import make_tempdir + + +def test_label_types(): + nlp = Language() + senter = nlp.add_pipe("senter") + with pytest.raises(NotImplementedError): + senter.add_label("A") + + +SENT_STARTS = [0] * 14 +SENT_STARTS[0] = 1 +SENT_STARTS[5] = 1 +SENT_STARTS[9] = 1 + +TRAIN_DATA = [ + ( + "I like green eggs. Eat blue ham. I like purple eggs.", + {"sent_starts": SENT_STARTS}, + ), + ( + "She likes purple eggs. They hate ham. You like yellow eggs.", + {"sent_starts": SENT_STARTS}, + ), +] + + +def test_initialize_examples(): + nlp = Language() + nlp.add_pipe("senter") + train_examples = [] + for t in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) + # you shouldn't really call this more than once, but for testing it should be fine + nlp.initialize() + nlp.initialize(get_examples=lambda: train_examples) + with pytest.raises(TypeError): + nlp.initialize(get_examples=lambda: None) + with pytest.raises(TypeError): + nlp.initialize(get_examples=train_examples) + + +def test_overfitting_IO(): + # Simple test to try and quickly overfit the senter - ensuring the ML models work correctly + nlp = English() + train_examples = [] + for t in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) + # add some cases where SENT_START == -1 + train_examples[0].reference[10].is_sent_start = False + train_examples[1].reference[1].is_sent_start = False + train_examples[1].reference[11].is_sent_start = False + + nlp.add_pipe("senter") + optimizer = nlp.initialize() + + for i in range(200): + losses = {} + nlp.update(train_examples, sgd=optimizer, losses=losses) + assert losses["senter"] < 0.001 + + # test the trained model + test_text = TRAIN_DATA[0][0] + doc = nlp(test_text) + gold_sent_starts = [0] * 14 + gold_sent_starts[0] = 1 + gold_sent_starts[5] = 1 + gold_sent_starts[9] = 1 + assert [int(t.is_sent_start) for t in doc] == gold_sent_starts + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(test_text) + assert [int(t.is_sent_start) for t in doc2] == gold_sent_starts + + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = [ + "Just a sentence.", + "Then one more sentence about London.", + "Here is another one.", + "I like London.", + ] + batch_deps_1 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)] + no_batch_deps = [ + doc.to_array([SENT_START]) for doc in [nlp(text) for text in texts] + ] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) diff --git a/spacy/tests/pipeline/test_tagger.py b/spacy/tests/pipeline/test_tagger.py index 1681ffeaa..885bdbce1 100644 --- a/spacy/tests/pipeline/test_tagger.py +++ b/spacy/tests/pipeline/test_tagger.py @@ -1,27 +1,141 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest +from numpy.testing import assert_equal +from spacy.attrs import TAG + +from spacy import util +from spacy.training import Example +from spacy.lang.en import English from spacy.language import Language -from spacy.symbols import POS, NOUN + +from ..util import make_tempdir def test_label_types(): nlp = Language() - nlp.add_pipe(nlp.create_pipe("tagger")) - nlp.get_pipe("tagger").add_label("A") + tagger = nlp.add_pipe("tagger") + tagger.add_label("A") with pytest.raises(ValueError): - nlp.get_pipe("tagger").add_label(9) + tagger.add_label(9) -def test_tagger_begin_training_tag_map(): - """Test that Tagger.begin_training() without gold tuples does not clobber +def test_tagger_initialize_tag_map(): + """Test that Tagger.initialize() without gold tuples does not clobber the tag map.""" nlp = Language() - tagger = nlp.create_pipe("tagger") + tagger = nlp.add_pipe("tagger") orig_tag_count = len(tagger.labels) - tagger.add_label("A", {"POS": "NOUN"}) - nlp.add_pipe(tagger) - nlp.begin_training() - assert nlp.vocab.morphology.tag_map["A"] == {POS: NOUN} + tagger.add_label("A") + nlp.initialize() assert orig_tag_count + 1 == len(nlp.get_pipe("tagger").labels) + + +TAGS = ("N", "V", "J") + +TRAIN_DATA = [ + ("I like green eggs", {"tags": ["N", "V", "J", "N"]}), + ("Eat blue ham", {"tags": ["V", "J", "N"]}), +] + + +def test_no_label(): + nlp = Language() + nlp.add_pipe("tagger") + with pytest.raises(ValueError): + nlp.initialize() + + +def test_no_resize(): + nlp = Language() + tagger = nlp.add_pipe("tagger") + tagger.add_label("N") + tagger.add_label("V") + assert tagger.labels == ("N", "V") + nlp.initialize() + assert tagger.model.get_dim("nO") == 2 + # this throws an error because the tagger can't be resized after initialization + with pytest.raises(ValueError): + tagger.add_label("J") + + +def test_implicit_label(): + nlp = Language() + nlp.add_pipe("tagger") + train_examples = [] + for t in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) + nlp.initialize(get_examples=lambda: train_examples) + + +def test_initialize_examples(): + nlp = Language() + tagger = nlp.add_pipe("tagger") + train_examples = [] + for tag in TAGS: + tagger.add_label(tag) + for t in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) + # you shouldn't really call this more than once, but for testing it should be fine + nlp.initialize() + nlp.initialize(get_examples=lambda: train_examples) + with pytest.raises(TypeError): + nlp.initialize(get_examples=lambda: None) + with pytest.raises(TypeError): + nlp.initialize(get_examples=lambda: train_examples[0]) + with pytest.raises(TypeError): + nlp.initialize(get_examples=lambda: []) + with pytest.raises(TypeError): + nlp.initialize(get_examples=train_examples) + + +def test_overfitting_IO(): + # Simple test to try and quickly overfit the tagger - ensuring the ML models work correctly + nlp = English() + tagger = nlp.add_pipe("tagger") + train_examples = [] + for t in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) + optimizer = nlp.initialize(get_examples=lambda: train_examples) + assert tagger.model.get_dim("nO") == len(TAGS) + + for i in range(50): + losses = {} + nlp.update(train_examples, sgd=optimizer, losses=losses) + assert losses["tagger"] < 0.00001 + + # test the trained model + test_text = "I like blue eggs" + doc = nlp(test_text) + assert doc[0].tag_ is "N" + assert doc[1].tag_ is "V" + assert doc[2].tag_ is "J" + assert doc[3].tag_ is "N" + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(test_text) + assert doc2[0].tag_ is "N" + assert doc2[1].tag_ is "V" + assert doc2[2].tag_ is "J" + assert doc2[3].tag_ is "N" + + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = [ + "Just a sentence.", + "I like green eggs.", + "Here is another one.", + "I eat ham.", + ] + batch_deps_1 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)] + no_batch_deps = [doc.to_array([TAG]) for doc in [nlp(text) for text in texts]] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) + + +def test_tagger_requires_labels(): + nlp = English() + nlp.add_pipe("tagger") + with pytest.raises(ValueError): + nlp.initialize() diff --git a/spacy/tests/pipeline/test_textcat.py b/spacy/tests/pipeline/test_textcat.py index b7db85056..91348b1b3 100644 --- a/spacy/tests/pipeline/test_textcat.py +++ b/spacy/tests/pipeline/test_textcat.py @@ -1,21 +1,43 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest import random import numpy.random +from numpy.testing import assert_equal +from thinc.api import fix_random_seed +from spacy import util +from spacy.lang.en import English from spacy.language import Language from spacy.pipeline import TextCategorizer from spacy.tokens import Doc -from spacy.gold import GoldParse +from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL +from spacy.scorer import Scorer +from spacy.training import Example + +from ..util import make_tempdir + + +TRAIN_DATA = [ + ("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}), + ("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}), +] + + +def make_get_examples(nlp): + train_examples = [] + for t in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) + + def get_examples(): + return train_examples + + return get_examples @pytest.mark.skip(reason="Test is flakey when run with others") def test_simple_train(): nlp = Language() - nlp.add_pipe(nlp.create_pipe("textcat")) - nlp.get_pipe("textcat").add_label("answer") - nlp.begin_training() + textcat = nlp.add_pipe("textcat") + textcat.add_label("answer") + nlp.initialize() for i in range(5): for text, answer in [ ("aaaa", 1.0), @@ -24,7 +46,7 @@ def test_simple_train(): ("bbbbbbbbb", 0.0), ("aaaaaa", 1), ]: - nlp.update([text], [{"cats": {"answer": answer}}]) + nlp.update((text, {"cats": {"answer": answer}})) doc = nlp("aaa") assert "answer" in doc.cats assert doc.cats["answer"] >= 0.5 @@ -42,21 +64,20 @@ def test_textcat_learns_multilabel(): cats = {letter: float(w2 == letter) for letter in letters} docs.append((Doc(nlp.vocab, words=["d"] * 3 + [w1, w2] + ["d"] * 3), cats)) random.shuffle(docs) - model = TextCategorizer(nlp.vocab, width=8) + textcat = TextCategorizer(nlp.vocab, width=8) for letter in letters: - model.add_label(letter) - optimizer = model.begin_training() + textcat.add_label(letter) + optimizer = textcat.initialize(lambda: []) for i in range(30): losses = {} - Ys = [GoldParse(doc, cats=cats) for doc, cats in docs] - Xs = [doc for doc, cats in docs] - model.update(Xs, Ys, sgd=optimizer, losses=losses) + examples = [Example.from_dict(doc, {"cats": cats}) for doc, cat in docs] + textcat.update(examples, sgd=optimizer, losses=losses) random.shuffle(docs) for w1 in letters: for w2 in letters: doc = Doc(nlp.vocab, words=["d"] * 3 + [w1, w2] + ["d"] * 3) truth = {letter: w2 == letter for letter in letters} - model(doc) + textcat(doc) for cat, score in doc.cats.items(): if not truth[cat]: assert score < 0.5 @@ -66,7 +87,186 @@ def test_textcat_learns_multilabel(): def test_label_types(): nlp = Language() - nlp.add_pipe(nlp.create_pipe("textcat")) - nlp.get_pipe("textcat").add_label("answer") + textcat = nlp.add_pipe("textcat") + textcat.add_label("answer") with pytest.raises(ValueError): - nlp.get_pipe("textcat").add_label(9) + textcat.add_label(9) + + +def test_no_label(): + nlp = Language() + nlp.add_pipe("textcat") + with pytest.raises(ValueError): + nlp.initialize() + + +def test_implicit_label(): + nlp = Language() + nlp.add_pipe("textcat") + nlp.initialize(get_examples=make_get_examples(nlp)) + + +def test_no_resize(): + nlp = Language() + textcat = nlp.add_pipe("textcat") + textcat.add_label("POSITIVE") + textcat.add_label("NEGATIVE") + nlp.initialize() + assert textcat.model.get_dim("nO") == 2 + # this throws an error because the textcat can't be resized after initialization + with pytest.raises(ValueError): + textcat.add_label("NEUTRAL") + + +def test_initialize_examples(): + nlp = Language() + textcat = nlp.add_pipe("textcat") + for text, annotations in TRAIN_DATA: + for label, value in annotations.get("cats").items(): + textcat.add_label(label) + # you shouldn't really call this more than once, but for testing it should be fine + nlp.initialize() + get_examples = make_get_examples(nlp) + nlp.initialize(get_examples=get_examples) + with pytest.raises(TypeError): + nlp.initialize(get_examples=lambda: None) + with pytest.raises(TypeError): + nlp.initialize(get_examples=get_examples()) + + +def test_overfitting_IO(): + # Simple test to try and quickly overfit the textcat component - ensuring the ML models work correctly + fix_random_seed(0) + nlp = English() + nlp.config["initialize"]["components"]["textcat"] = {"positive_label": "POSITIVE"} + # Set exclusive labels + config = {"model": {"exclusive_classes": True}} + textcat = nlp.add_pipe("textcat", config=config) + train_examples = [] + for text, annotations in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(text), annotations)) + optimizer = nlp.initialize(get_examples=lambda: train_examples) + assert textcat.model.get_dim("nO") == 2 + + for i in range(50): + losses = {} + nlp.update(train_examples, sgd=optimizer, losses=losses) + assert losses["textcat"] < 0.01 + + # test the trained model + test_text = "I am happy." + doc = nlp(test_text) + cats = doc.cats + assert cats["POSITIVE"] > 0.9 + assert cats["POSITIVE"] + cats["NEGATIVE"] == pytest.approx(1.0, 0.001) + + # Also test the results are still the same after IO + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + doc2 = nlp2(test_text) + cats2 = doc2.cats + assert cats2["POSITIVE"] > 0.9 + assert cats2["POSITIVE"] + cats2["NEGATIVE"] == pytest.approx(1.0, 0.001) + + # Test scoring + scores = nlp.evaluate(train_examples) + assert scores["cats_micro_f"] == 1.0 + assert scores["cats_score"] == 1.0 + assert "cats_score_desc" in scores + + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = ["Just a sentence.", "I like green eggs.", "I am happy.", "I eat ham."] + batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)] + no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) + + +# fmt: off +@pytest.mark.parametrize( + "textcat_config", + [ + {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 1, "no_output_layer": False}, + {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 4, "no_output_layer": False}, + {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": False, "ngram_size": 3, "no_output_layer": True}, + {"@architectures": "spacy.TextCatBOW.v1", "exclusive_classes": True, "ngram_size": 2, "no_output_layer": True}, + {"@architectures": "spacy.TextCatEnsemble.v1", "exclusive_classes": False, "ngram_size": 1, "pretrained_vectors": False, "width": 64, "conv_depth": 2, "embed_size": 2000, "window_size": 2, "dropout": None}, + {"@architectures": "spacy.TextCatEnsemble.v1", "exclusive_classes": True, "ngram_size": 5, "pretrained_vectors": False, "width": 128, "conv_depth": 2, "embed_size": 2000, "window_size": 1, "dropout": None}, + {"@architectures": "spacy.TextCatEnsemble.v1", "exclusive_classes": True, "ngram_size": 2, "pretrained_vectors": False, "width": 32, "conv_depth": 3, "embed_size": 500, "window_size": 3, "dropout": None}, + {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": True}, + {"@architectures": "spacy.TextCatCNN.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL, "exclusive_classes": False}, + ], +) +# fmt: on +def test_textcat_configs(textcat_config): + pipe_config = {"model": textcat_config} + nlp = English() + textcat = nlp.add_pipe("textcat", config=pipe_config) + train_examples = [] + for text, annotations in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(text), annotations)) + for label, value in annotations.get("cats").items(): + textcat.add_label(label) + optimizer = nlp.initialize() + for i in range(5): + losses = {} + nlp.update(train_examples, sgd=optimizer, losses=losses) + + +def test_positive_class(): + nlp = English() + textcat = nlp.add_pipe("textcat") + get_examples = make_get_examples(nlp) + textcat.initialize(get_examples, labels=["POS", "NEG"], positive_label="POS") + assert textcat.labels == ("POS", "NEG") + + +def test_positive_class_not_present(): + nlp = English() + textcat = nlp.add_pipe("textcat") + get_examples = make_get_examples(nlp) + with pytest.raises(ValueError): + textcat.initialize(get_examples, labels=["SOME", "THING"], positive_label="POS") + + +def test_positive_class_not_binary(): + nlp = English() + textcat = nlp.add_pipe("textcat") + get_examples = make_get_examples(nlp) + with pytest.raises(ValueError): + textcat.initialize( + get_examples, labels=["SOME", "THING", "POS"], positive_label="POS" + ) + + +def test_textcat_evaluation(): + train_examples = [] + nlp = English() + ref1 = nlp("one") + ref1.cats = {"winter": 1.0, "summer": 1.0, "spring": 1.0, "autumn": 1.0} + pred1 = nlp("one") + pred1.cats = {"winter": 1.0, "summer": 0.0, "spring": 1.0, "autumn": 1.0} + train_examples.append(Example(pred1, ref1)) + + ref2 = nlp("two") + ref2.cats = {"winter": 0.0, "summer": 0.0, "spring": 1.0, "autumn": 1.0} + pred2 = nlp("two") + pred2.cats = {"winter": 1.0, "summer": 0.0, "spring": 0.0, "autumn": 1.0} + train_examples.append(Example(pred2, ref2)) + + scores = Scorer().score_cats( + train_examples, "cats", labels=["winter", "summer", "spring", "autumn"] + ) + assert scores["cats_f_per_type"]["winter"]["p"] == 1 / 2 + assert scores["cats_f_per_type"]["winter"]["r"] == 1 / 1 + assert scores["cats_f_per_type"]["summer"]["p"] == 0 + assert scores["cats_f_per_type"]["summer"]["r"] == 0 / 1 + assert scores["cats_f_per_type"]["spring"]["p"] == 1 / 1 + assert scores["cats_f_per_type"]["spring"]["r"] == 1 / 2 + assert scores["cats_f_per_type"]["autumn"]["p"] == 2 / 2 + assert scores["cats_f_per_type"]["autumn"]["r"] == 2 / 2 + + assert scores["cats_micro_p"] == 4 / 5 + assert scores["cats_micro_r"] == 4 / 6 diff --git a/spacy/tests/pipeline/test_tok2vec.py b/spacy/tests/pipeline/test_tok2vec.py new file mode 100644 index 000000000..ec4ed17dd --- /dev/null +++ b/spacy/tests/pipeline/test_tok2vec.py @@ -0,0 +1,189 @@ +import pytest + +from spacy.ml.models.tok2vec import build_Tok2Vec_model +from spacy.ml.models.tok2vec import MultiHashEmbed, CharacterEmbed +from spacy.ml.models.tok2vec import MishWindowEncoder, MaxoutWindowEncoder +from spacy.pipeline.tok2vec import Tok2Vec, Tok2VecListener +from spacy.vocab import Vocab +from spacy.tokens import Doc +from spacy.training import Example +from spacy import util +from spacy.lang.en import English +from ..util import get_batch + +from thinc.api import Config + +from numpy.testing import assert_equal + + +def test_empty_doc(): + width = 128 + embed_size = 2000 + vocab = Vocab() + doc = Doc(vocab, words=[]) + tok2vec = build_Tok2Vec_model( + MultiHashEmbed( + width=width, + rows=[embed_size, embed_size, embed_size, embed_size], + include_static_vectors=False, + attrs=["NORM", "PREFIX", "SUFFIX", "SHAPE"], + ), + MaxoutWindowEncoder(width=width, depth=4, window_size=1, maxout_pieces=3), + ) + tok2vec.initialize() + vectors, backprop = tok2vec.begin_update([doc]) + assert len(vectors) == 1 + assert vectors[0].shape == (0, width) + + +@pytest.mark.parametrize( + "batch_size,width,embed_size", [[1, 128, 2000], [2, 128, 2000], [3, 8, 63]] +) +def test_tok2vec_batch_sizes(batch_size, width, embed_size): + batch = get_batch(batch_size) + tok2vec = build_Tok2Vec_model( + MultiHashEmbed( + width=width, + rows=[embed_size] * 4, + include_static_vectors=False, + attrs=["NORM", "PREFIX", "SUFFIX", "SHAPE"], + ), + MaxoutWindowEncoder(width=width, depth=4, window_size=1, maxout_pieces=3), + ) + tok2vec.initialize() + vectors, backprop = tok2vec.begin_update(batch) + assert len(vectors) == len(batch) + for doc_vec, doc in zip(vectors, batch): + assert doc_vec.shape == (len(doc), width) + + +# fmt: off +@pytest.mark.parametrize( + "width,embed_arch,embed_config,encode_arch,encode_config", + [ + (8, MultiHashEmbed, {"rows": [100, 100], "attrs": ["SHAPE", "LOWER"], "include_static_vectors": False}, MaxoutWindowEncoder, {"window_size": 1, "maxout_pieces": 3, "depth": 2}), + (8, MultiHashEmbed, {"rows": [100, 20], "attrs": ["ORTH", "PREFIX"], "include_static_vectors": False}, MishWindowEncoder, {"window_size": 1, "depth": 6}), + (8, CharacterEmbed, {"rows": 100, "nM": 64, "nC": 8, "include_static_vectors": False}, MaxoutWindowEncoder, {"window_size": 1, "maxout_pieces": 3, "depth": 3}), + (8, CharacterEmbed, {"rows": 100, "nM": 16, "nC": 2, "include_static_vectors": False}, MishWindowEncoder, {"window_size": 1, "depth": 3}), + ], +) +# fmt: on +def test_tok2vec_configs(width, embed_arch, embed_config, encode_arch, encode_config): + embed_config["width"] = width + encode_config["width"] = width + docs = get_batch(3) + tok2vec = build_Tok2Vec_model( + embed_arch(**embed_config), encode_arch(**encode_config) + ) + tok2vec.initialize(docs) + vectors, backprop = tok2vec.begin_update(docs) + assert len(vectors) == len(docs) + assert vectors[0].shape == (len(docs[0]), width) + backprop(vectors) + + +def test_init_tok2vec(): + # Simple test to initialize the default tok2vec + nlp = English() + tok2vec = nlp.add_pipe("tok2vec") + assert tok2vec.listeners == [] + nlp.initialize() + assert tok2vec.model.get_dim("nO") + + +cfg_string = """ + [nlp] + lang = "en" + pipeline = ["tok2vec","tagger"] + + [components] + + [components.tagger] + factory = "tagger" + + [components.tagger.model] + @architectures = "spacy.Tagger.v1" + nO = null + + [components.tagger.model.tok2vec] + @architectures = "spacy.Tok2VecListener.v1" + width = ${components.tok2vec.model.encode.width} + + [components.tok2vec] + factory = "tok2vec" + + [components.tok2vec.model] + @architectures = "spacy.Tok2Vec.v1" + + [components.tok2vec.model.embed] + @architectures = "spacy.MultiHashEmbed.v1" + width = ${components.tok2vec.model.encode.width} + rows = [2000, 1000, 1000, 1000] + attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"] + include_static_vectors = false + + [components.tok2vec.model.encode] + @architectures = "spacy.MaxoutWindowEncoder.v1" + width = 96 + depth = 4 + window_size = 1 + maxout_pieces = 3 + """ + +TRAIN_DATA = [ + ("I like green eggs", {"tags": ["N", "V", "J", "N"]}), + ("Eat blue ham", {"tags": ["V", "J", "N"]}), +] + + +def test_tok2vec_listener(): + orig_config = Config().from_str(cfg_string) + nlp = util.load_model_from_config(orig_config, auto_fill=True, validate=True) + assert nlp.pipe_names == ["tok2vec", "tagger"] + tagger = nlp.get_pipe("tagger") + tok2vec = nlp.get_pipe("tok2vec") + tagger_tok2vec = tagger.model.get_ref("tok2vec") + assert isinstance(tok2vec, Tok2Vec) + assert isinstance(tagger_tok2vec, Tok2VecListener) + train_examples = [] + for t in TRAIN_DATA: + train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) + for tag in t[1]["tags"]: + tagger.add_label(tag) + + # Check that the Tok2Vec component finds it listeners + assert tok2vec.listeners == [] + optimizer = nlp.initialize(lambda: train_examples) + assert tok2vec.listeners == [tagger_tok2vec] + + for i in range(5): + losses = {} + nlp.update(train_examples, sgd=optimizer, losses=losses) + + doc = nlp("Running the pipeline as a whole.") + doc_tensor = tagger_tok2vec.predict([doc])[0] + assert_equal(doc.tensor, doc_tensor) + + # TODO: should this warn or error? + nlp.select_pipes(disable="tok2vec") + assert nlp.pipe_names == ["tagger"] + nlp("Running the pipeline with the Tok2Vec component disabled.") + + +def test_tok2vec_listener_callback(): + orig_config = Config().from_str(cfg_string) + nlp = util.load_model_from_config(orig_config, auto_fill=True, validate=True) + assert nlp.pipe_names == ["tok2vec", "tagger"] + tagger = nlp.get_pipe("tagger") + tok2vec = nlp.get_pipe("tok2vec") + nlp._link_components() + docs = [nlp.make_doc("A random sentence")] + tok2vec.model.initialize(X=docs) + gold_array = [[1.0 for tag in ["V", "Z"]] for word in docs] + label_sample = [tagger.model.ops.asarray(gold_array, dtype="float32")] + tagger.model.initialize(X=docs, Y=label_sample) + docs = [nlp.make_doc("Another entirely random sentence")] + tok2vec.update([Example.from_dict(x, {}) for x in docs]) + Y, get_dX = tagger.model.begin_update(docs) + # assure that the backprop call works (and doesn't hit a 'None' callback) + assert get_dX(Y) is not None diff --git a/spacy/tests/regression/test_issue1-1000.py b/spacy/tests/regression/test_issue1-1000.py index 38a99371e..6bb71f6f4 100644 --- a/spacy/tests/regression/test_issue1-1000.py +++ b/spacy/tests/regression/test_issue1-1000.py @@ -1,19 +1,15 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import random +from spacy import util +from spacy.training import Example from spacy.matcher import Matcher from spacy.attrs import IS_PUNCT, ORTH, LOWER -from spacy.symbols import POS, VERB, VerbForm_inf from spacy.vocab import Vocab -from spacy.language import Language -from spacy.lemmatizer import Lemmatizer +from spacy.lang.en import English from spacy.lookups import Lookups from spacy.tokens import Doc, Span -from spacy.lang.en import EnglishDefaults -from ..util import get_doc, make_tempdir +from ..util import make_tempdir @pytest.mark.parametrize( @@ -92,13 +88,9 @@ def test_issue242(en_tokenizer): doc.ents += tuple(matches) -def test_issue309(en_tokenizer): +def test_issue309(en_vocab): """Test Issue #309: SBD fails on empty string""" - tokens = en_tokenizer(" ") - doc = get_doc( - tokens.vocab, words=[t.text for t in tokens], heads=[0], deps=["ROOT"] - ) - doc.is_parsed = True + doc = Doc(en_vocab, words=[" "], heads=[0], deps=["ROOT"]) assert len(doc) == 1 sents = list(doc.sents) assert len(sents) == 1 @@ -145,14 +137,6 @@ def test_issue588(en_vocab): matcher.add("TEST", [[]]) -@pytest.mark.xfail -def test_issue589(): - vocab = Vocab() - vocab.strings.set_frozen(True) - doc = Doc(vocab, words=["whata"]) - assert doc - - def test_issue590(en_vocab): """Test overlapping matches""" doc = Doc(en_vocab, words=["n", "=", "1", ";", "a", ":", "5", "%"]) @@ -165,16 +149,15 @@ def test_issue590(en_vocab): assert len(matches) == 2 +@pytest.mark.skip(reason="Old vocab-based lemmatization") def test_issue595(): """Test lemmatization of base forms""" words = ["Do", "n't", "feed", "the", "dog"] - tag_map = {"VB": {POS: VERB, VerbForm_inf: True}} lookups = Lookups() lookups.add_table("lemma_rules", {"verb": [["ed", "e"]]}) lookups.add_table("lemma_index", {"verb": {}}) lookups.add_table("lemma_exc", {"verb": {}}) - lemmatizer = Lemmatizer(lookups, is_base_form=EnglishDefaults.is_base_form) - vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map) + vocab = Vocab() doc = Doc(vocab, words=words) doc[2].tag_ = "VB" assert doc[2].text == "feed" @@ -183,11 +166,9 @@ def test_issue595(): def test_issue599(en_vocab): doc = Doc(en_vocab) - doc.is_tagged = True - doc.is_parsed = True doc2 = Doc(doc.vocab) doc2.from_bytes(doc.to_bytes()) - assert doc2.is_parsed + assert doc2.has_annotation("DEP") def test_issue600(): @@ -289,7 +270,9 @@ def test_control_issue792(en_tokenizer, text): assert "".join([token.text_with_ws for token in doc]) == text -@pytest.mark.xfail +@pytest.mark.skip( + reason="Can not be fixed unless with variable-width lookbehinds, cf. PR #3218" +) @pytest.mark.parametrize( "text,tokens", [ @@ -395,6 +378,7 @@ def test_issue891(en_tokenizer, text): assert tokens[1].text == "/" +@pytest.mark.skip(reason="Old vocab-based lemmatization") @pytest.mark.parametrize( "text,tag,lemma", [("anus", "NN", "anus"), ("princess", "NN", "princess"), ("inner", "JJ", "inner")], @@ -421,8 +405,7 @@ def test_issue957(en_tokenizer): assert doc -@pytest.mark.xfail -def test_issue999(train_data): +def test_issue999(): """Test that adding entities and resuming training works passably OK. There are two issues here: 1) We have to re-add labels. This isn't very nice. @@ -436,27 +419,27 @@ def test_issue999(train_data): ["hello", []], ["hi", []], ["i'm looking for a place to eat", []], - ["i'm looking for a place in the north of town", [[31, 36, "LOCATION"]]], - ["show me chinese restaurants", [[8, 15, "CUISINE"]]], - ["show me chines restaurants", [[8, 14, "CUISINE"]]], + ["i'm looking for a place in the north of town", [(31, 36, "LOCATION")]], + ["show me chinese restaurants", [(8, 15, "CUISINE")]], + ["show me chines restaurants", [(8, 14, "CUISINE")]], ] - - nlp = Language() - ner = nlp.create_pipe("ner") - nlp.add_pipe(ner) + nlp = English() + ner = nlp.add_pipe("ner") for _, offsets in TRAIN_DATA: for start, end, label in offsets: ner.add_label(label) - nlp.begin_training() - ner.model.learn_rate = 0.001 - for itn in range(100): + nlp.initialize() + for itn in range(20): random.shuffle(TRAIN_DATA) for raw_text, entity_offsets in TRAIN_DATA: - nlp.update([raw_text], [{"entities": entity_offsets}]) + example = Example.from_dict( + nlp.make_doc(raw_text), {"entities": entity_offsets} + ) + nlp.update([example]) with make_tempdir() as model_dir: nlp.to_disk(model_dir) - nlp2 = Language().from_disk(model_dir) + nlp2 = util.load_model_from_path(model_dir) for raw_text, entity_offsets in TRAIN_DATA: doc = nlp2(raw_text) @@ -465,6 +448,6 @@ def test_issue999(train_data): if (start, end) in ents: assert ents[(start, end)] == label break - else: - if entity_offsets: - raise Exception(ents) + else: + if entity_offsets: + raise Exception(ents) diff --git a/spacy/tests/regression/test_issue1001-1500.py b/spacy/tests/regression/test_issue1001-1500.py index 924c5aa3e..d6a4600e3 100644 --- a/spacy/tests/regression/test_issue1001-1500.py +++ b/spacy/tests/regression/test_issue1001-1500.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import re from spacy.tokens import Doc @@ -9,15 +6,13 @@ from spacy.lang.en import English from spacy.lang.lex_attrs import LEX_ATTRS from spacy.matcher import Matcher from spacy.tokenizer import Tokenizer -from spacy.lemmatizer import Lemmatizer -from spacy.lookups import Lookups -from spacy.symbols import ORTH, LEMMA, POS, VERB, VerbForm_part +from spacy.symbols import ORTH, LEMMA, POS def test_issue1061(): """Test special-case works after tokenizing. Was caching problem.""" text = "I like _MATH_ even _MATH_ when _MATH_, except when _MATH_ is _MATH_! but not _MATH_." - tokenizer = English.Defaults.create_tokenizer() + tokenizer = English().tokenizer doc = tokenizer(text) assert "MATH" in [w.text for w in doc] assert "_MATH_" not in [w.text for w in doc] @@ -28,15 +23,15 @@ def test_issue1061(): assert "MATH" not in [w.text for w in doc] # For sanity, check it works when pipeline is clean. - tokenizer = English.Defaults.create_tokenizer() + tokenizer = English().tokenizer tokenizer.add_special_case("_MATH_", [{ORTH: "_MATH_"}]) doc = tokenizer(text) assert "_MATH_" in [w.text for w in doc] assert "MATH" not in [w.text for w in doc] -@pytest.mark.xfail( - reason="g is split of as a unit, as the suffix regular expression can not look back further (variable-width)" +@pytest.mark.skip( + reason="Can not be fixed without variable-width look-behind (which we don't want)" ) def test_issue1235(): """Test that g is not split of if preceded by a number and a letter""" @@ -60,6 +55,7 @@ def test_issue1242(): assert len(docs[1]) == 1 +@pytest.mark.skip(reason="v3 no longer supports LEMMA/POS in tokenizer special cases") def test_issue1250(): """Test cached special cases.""" special_case = [{ORTH: "reimbur", LEMMA: "reimburse", POS: "VERB"}] @@ -90,20 +86,6 @@ def test_issue1375(): assert doc[1].nbor(1).text == "2" -def test_issue1387(): - tag_map = {"VBG": {POS: VERB, VerbForm_part: True}} - lookups = Lookups() - lookups.add_table("lemma_index", {"verb": ("cope", "cop")}) - lookups.add_table("lemma_exc", {"verb": {"coping": ("cope",)}}) - lookups.add_table("lemma_rules", {"verb": [["ing", ""]]}) - lemmatizer = Lemmatizer(lookups) - vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map) - doc = Doc(vocab, words=["coping"]) - doc[0].tag_ = "VBG" - assert doc[0].text == "coping" - assert doc[0].lemma_ == "cope" - - def test_issue1434(): """Test matches occur when optional element at end of short doc.""" pattern = [{"ORTH": "Hello"}, {"IS_ALPHA": True, "OP": "?"}] diff --git a/spacy/tests/regression/test_issue1501-2000.py b/spacy/tests/regression/test_issue1501-2000.py index e498417d1..f85ec70e1 100644 --- a/spacy/tests/regression/test_issue1501-2000.py +++ b/spacy/tests/regression/test_issue1501-2000.py @@ -1,10 +1,9 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest import gc import numpy import copy + +from spacy.training import Example from spacy.lang.en import English from spacy.lang.en.stop_words import STOP_WORDS from spacy.lang.lex_attrs import is_stop @@ -12,7 +11,6 @@ from spacy.vectors import Vectors from spacy.vocab import Vocab from spacy.language import Language from spacy.tokens import Doc, Span, Token -from spacy.pipeline import Tagger, EntityRecognizer from spacy.attrs import HEAD, DEP from spacy.matcher import Matcher @@ -100,15 +98,20 @@ def test_issue1612(en_tokenizer): def test_issue1654(): nlp = Language(Vocab()) assert not nlp.pipeline - nlp.add_pipe(lambda doc: doc, name="1") - nlp.add_pipe(lambda doc: doc, name="2", after="1") - nlp.add_pipe(lambda doc: doc, name="3", after="2") + + @Language.component("component") + def component(doc): + return doc + + nlp.add_pipe("component", name="1") + nlp.add_pipe("component", name="2", after="1") + nlp.add_pipe("component", name="3", after="2") assert nlp.pipe_names == ["1", "2", "3"] nlp2 = Language(Vocab()) assert not nlp2.pipeline - nlp2.add_pipe(lambda doc: doc, name="3") - nlp2.add_pipe(lambda doc: doc, name="2", before="3") - nlp2.add_pipe(lambda doc: doc, name="1", before="2") + nlp2.add_pipe("component", name="3") + nlp2.add_pipe("component", name="2", before="3") + nlp2.add_pipe("component", name="1", before="2") assert nlp2.pipe_names == ["1", "2", "3"] @@ -122,17 +125,16 @@ def test_issue1698(en_tokenizer, text): def test_issue1727(): """Test that models with no pretrained vectors can be deserialized correctly after vectors are added.""" + nlp = Language(Vocab()) data = numpy.ones((3, 300), dtype="f") vectors = Vectors(data=data, keys=["I", "am", "Matt"]) - tagger = Tagger(Vocab()) + tagger = nlp.create_pipe("tagger") tagger.add_label("PRP") - with pytest.warns(UserWarning): - tagger.begin_training() assert tagger.cfg.get("pretrained_dims", 0) == 0 tagger.vocab.vectors = vectors with make_tempdir() as path: tagger.to_disk(path) - tagger = Tagger(Vocab()).from_disk(path) + tagger = nlp.create_pipe("tagger").from_disk(path) assert tagger.cfg.get("pretrained_dims", 0) == 0 @@ -153,8 +155,6 @@ def test_issue1758(en_tokenizer): """Test that "would've" is handled by the English tokenizer exceptions.""" tokens = en_tokenizer("would've") assert len(tokens) == 2 - assert tokens[0].tag_ == "MD" - assert tokens[1].lemma_ == "have" def test_issue1773(en_tokenizer): @@ -197,18 +197,24 @@ def test_issue1807(): def test_issue1834(): """Test that sentence boundaries & parse/tag flags are not lost during serialization.""" - string = "This is a first sentence . And another one" - doc = Doc(Vocab(), words=string.split()) - doc[6].sent_start = True + words = ["This", "is", "a", "first", "sentence", ".", "And", "another", "one"] + doc = Doc(Vocab(), words=words) + doc[6].is_sent_start = True new_doc = Doc(doc.vocab).from_bytes(doc.to_bytes()) assert new_doc[6].sent_start - assert not new_doc.is_parsed - assert not new_doc.is_tagged - doc.is_parsed = True - doc.is_tagged = True + assert not new_doc.has_annotation("DEP") + assert not new_doc.has_annotation("TAG") + doc = Doc( + Vocab(), + words=words, + tags=["TAG"] * len(words), + heads=[0, 0, 0, 0, 0, 0, 6, 6, 6], + deps=["dep"] * len(words), + ) new_doc = Doc(doc.vocab).from_bytes(doc.to_bytes()) - assert new_doc.is_parsed - assert new_doc.is_tagged + assert new_doc[6].sent_start + assert new_doc.has_annotation("DEP") + assert new_doc.has_annotation("TAG") def test_issue1868(): @@ -237,13 +243,14 @@ def test_issue1889(word): assert is_stop(word, STOP_WORDS) == is_stop(word.upper(), STOP_WORDS) +@pytest.mark.skip(reason="obsolete with the config refactor of v.3") def test_issue1915(): cfg = {"hidden_depth": 2} # should error out nlp = Language() - nlp.add_pipe(nlp.create_pipe("ner")) - nlp.get_pipe("ner").add_label("answer") + ner = nlp.add_pipe("ner") + ner.add_label("answer") with pytest.raises(ValueError): - nlp.begin_training(**cfg) + nlp.initialize(**cfg) def test_issue1945(): @@ -269,10 +276,21 @@ def test_issue1963(en_tokenizer): @pytest.mark.parametrize("label", ["U-JOB-NAME"]) def test_issue1967(label): - ner = EntityRecognizer(Vocab()) - entry = ([0], ["word"], ["tag"], [0], ["dep"], [label]) - gold_parses = [(None, [(entry, None)])] - ner.moves.get_actions(gold_parses=gold_parses) + nlp = Language() + config = {} + ner = nlp.create_pipe("ner", config=config) + example = Example.from_dict( + Doc(ner.vocab, words=["word"]), + { + "ids": [0], + "words": ["word"], + "tags": ["tag"], + "heads": [0], + "deps": ["dep"], + "entities": [label], + }, + ) + assert "JOB-NAME" in ner.moves.get_actions(examples=[example])[1] def test_issue1971(en_vocab): diff --git a/spacy/tests/regression/test_issue2001-2500.py b/spacy/tests/regression/test_issue2001-2500.py index 01f0f905c..09baab4d8 100644 --- a/spacy/tests/regression/test_issue2001-2500.py +++ b/spacy/tests/regression/test_issue2001-2500.py @@ -1,19 +1,18 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest import numpy from spacy.tokens import Doc from spacy.matcher import Matcher from spacy.displacy import render -from spacy.gold import iob_to_biluo +from spacy.training import iob_to_biluo from spacy.lang.it import Italian from spacy.lang.en import English -from ..util import add_vecs_to_vocab, get_doc +from ..util import add_vecs_to_vocab -@pytest.mark.xfail +@pytest.mark.skip( + reason="Can not be fixed without iterative looping between prefix/suffix and infix" +) def test_issue2070(): """Test that checks that a dot followed by a quote is handled appropriately. @@ -29,12 +28,14 @@ def test_issue2070(): def test_issue2179(): """Test that spurious 'extra_labels' aren't created when initializing NER.""" nlp = Italian() - ner = nlp.create_pipe("ner") + ner = nlp.add_pipe("ner") ner.add_label("CITIZENSHIP") - nlp.add_pipe(ner) - nlp.begin_training() + nlp.initialize() nlp2 = Italian() - nlp2.add_pipe(nlp2.create_pipe("ner")) + nlp2.add_pipe("ner") + assert len(nlp2.get_pipe("ner").labels) == 0 + model = nlp2.get_pipe("ner").model + model.attrs["resize_output"](model, nlp.get_pipe("ner").moves.n_moves) nlp2.from_bytes(nlp.to_bytes()) assert "extra_labels" not in nlp2.get_pipe("ner").cfg assert nlp2.get_pipe("ner").labels == ("CITIZENSHIP",) @@ -68,11 +69,10 @@ def test_issue2219(en_vocab): assert doc[0].similarity(doc[1]) == doc[1].similarity(doc[0]) -def test_issue2361(de_tokenizer): +def test_issue2361(de_vocab): chars = ("<", ">", "&", """) - doc = de_tokenizer('< > & " ') - doc.is_parsed = True - doc.is_tagged = True + words = ["<", ">", "&", '"'] + doc = Doc(de_vocab, words=words, deps=["dep"] * len(words)) html = render(doc) for char in chars: assert char in html @@ -106,7 +106,8 @@ def test_issue2385_biluo(tags): def test_issue2396(en_vocab): words = ["She", "created", "a", "test", "for", "spacy"] - heads = [1, 0, 1, -2, -1, -1] + heads = [1, 1, 3, 1, 3, 4] + deps = ["dep"] * len(heads) matrix = numpy.array( [ [0, 1, 1, 1, 1, 1], @@ -118,7 +119,7 @@ def test_issue2396(en_vocab): ], dtype=numpy.int32, ) - doc = get_doc(en_vocab, words=words, heads=heads) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) span = doc[:] assert (doc.get_lca_matrix() == matrix).all() assert (span.get_lca_matrix() == matrix).all() @@ -136,6 +137,6 @@ def test_issue2464(en_vocab): def test_issue2482(): """Test we can serialize and deserialize a blank NER or parser model.""" nlp = Italian() - nlp.add_pipe(nlp.create_pipe("ner")) + nlp.add_pipe("ner") b = nlp.to_bytes() Italian().from_bytes(b) diff --git a/spacy/tests/regression/test_issue2501-3000.py b/spacy/tests/regression/test_issue2501-3000.py index 622fc3635..4952a545d 100644 --- a/spacy/tests/regression/test_issue2501-3000.py +++ b/spacy/tests/regression/test_issue2501-3000.py @@ -1,8 +1,6 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy import displacy +from spacy.training import Example from spacy.lang.en import English from spacy.lang.ja import Japanese from spacy.lang.xx import MultiLanguage @@ -11,25 +9,21 @@ from spacy.matcher import Matcher from spacy.tokens import Doc, Span from spacy.vocab import Vocab from spacy.compat import pickle -from spacy._ml import link_vectors_to_models import numpy import random -from ..util import get_doc - def test_issue2564(): - """Test the tagger sets is_tagged correctly when used via Language.pipe.""" + """Test the tagger sets has_annotation("TAG") correctly when used via Language.pipe.""" nlp = Language() - tagger = nlp.create_pipe("tagger") - with pytest.warns(UserWarning): - tagger.begin_training() # initialise weights - nlp.add_pipe(tagger) + tagger = nlp.add_pipe("tagger") + tagger.add_label("A") + nlp.initialize() doc = nlp("hello world") - assert doc.is_tagged + assert doc.has_annotation("TAG") docs = nlp.pipe(["hello", "world"]) piped_doc = next(docs) - assert piped_doc.is_tagged + assert piped_doc.has_annotation("TAG") def test_issue2569(en_tokenizer): @@ -121,13 +115,15 @@ def test_issue2754(en_tokenizer): def test_issue2772(en_vocab): """Test that deprojectivization doesn't mess up sentence boundaries.""" - words = "When we write or communicate virtually , we can hide our true feelings .".split() + # fmt: off + words = ["When", "we", "write", "or", "communicate", "virtually", ",", "we", "can", "hide", "our", "true", "feelings", "."] + # fmt: on # A tree with a non-projective (i.e. crossing) arc # The arcs (0, 4) and (2, 9) cross. - heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, 2, 1, -3, -4] + heads = [4, 2, 9, 2, 2, 4, 9, 9, 9, 9, 12, 12, 9, 9] deps = ["dep"] * len(heads) - doc = get_doc(en_vocab, words=words, heads=heads, deps=deps) - assert doc[1].is_sent_start is None + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) + assert doc[1].is_sent_start is False @pytest.mark.parametrize("text", ["-0.23", "+123,456", "±1"]) @@ -144,20 +140,21 @@ def test_issue2800(): """Test issue that arises when too many labels are added to NER model. Used to cause segfault. """ - train_data = [] - train_data.extend([("One sentence", {"entities": []})]) - entity_types = [str(i) for i in range(1000)] nlp = English() - ner = nlp.create_pipe("ner") - nlp.add_pipe(ner) + train_data = [] + train_data.extend( + [Example.from_dict(nlp.make_doc("One sentence"), {"entities": []})] + ) + entity_types = [str(i) for i in range(1000)] + ner = nlp.add_pipe("ner") for entity_type in list(entity_types): ner.add_label(entity_type) - optimizer = nlp.begin_training() + optimizer = nlp.initialize() for i in range(20): losses = {} random.shuffle(train_data) - for statement, entities in train_data: - nlp.update([statement], [entities], sgd=optimizer, losses=losses, drop=0.5) + for example in train_data: + nlp.update([example], sgd=optimizer, losses=losses, drop=0.5) def test_issue2822(it_tokenizer): @@ -167,7 +164,6 @@ def test_issue2822(it_tokenizer): assert doc[0].text == "Vuoi" assert doc[1].text == "un" assert doc[2].text == "po'" - assert doc[2].lemma_ == "poco" assert doc[3].text == "di" assert doc[4].text == "zucchero" assert doc[5].text == "?" @@ -192,7 +188,6 @@ def test_issue2871(): _ = vocab[word] # noqa: F841 vocab.set_vector(word, vector_data[0]) vocab.vectors.name = "dummy_vectors" - link_vectors_to_models(vocab) assert vocab["dog"].rank == 0 assert vocab["cat"].rank == 1 assert vocab["SUFFIX"].rank == 2 diff --git a/spacy/tests/regression/test_issue3001-3500.py b/spacy/tests/regression/test_issue3001-3500.py index a10225390..01f58ae77 100644 --- a/spacy/tests/regression/test_issue3001-3500.py +++ b/spacy/tests/regression/test_issue3001-3500.py @@ -1,22 +1,17 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest +from spacy import registry from spacy.lang.en import English from spacy.lang.de import German +from spacy.pipeline.ner import DEFAULT_NER_MODEL from spacy.pipeline import EntityRuler, EntityRecognizer from spacy.matcher import Matcher, PhraseMatcher from spacy.tokens import Doc from spacy.vocab import Vocab from spacy.attrs import ENT_IOB, ENT_TYPE -from spacy.compat import pickle, is_python2, unescape_unicode +from spacy.compat import pickle from spacy import displacy -from spacy.util import decaying -import numpy -import re - from spacy.vectors import Vectors -from ..util import get_doc +import numpy def test_issue3002(): @@ -31,16 +26,16 @@ def test_issue3002(): def test_issue3009(en_vocab): """Test problem with matcher quantifiers""" patterns = [ - [{"LEMMA": "have"}, {"LOWER": "to"}, {"LOWER": "do"}, {"TAG": "IN"}], + [{"ORTH": "has"}, {"LOWER": "to"}, {"LOWER": "do"}, {"TAG": "IN"}], [ - {"LEMMA": "have"}, + {"ORTH": "has"}, {"IS_ASCII": True, "IS_PUNCT": False, "OP": "*"}, {"LOWER": "to"}, {"LOWER": "do"}, {"TAG": "IN"}, ], [ - {"LEMMA": "have"}, + {"ORTH": "has"}, {"IS_ASCII": True, "IS_PUNCT": False, "OP": "?"}, {"LOWER": "to"}, {"LOWER": "do"}, @@ -49,7 +44,8 @@ def test_issue3009(en_vocab): ] words = ["also", "has", "to", "do", "with"] tags = ["RB", "VBZ", "TO", "VB", "IN"] - doc = get_doc(en_vocab, words=words, tags=tags) + pos = ["ADV", "VERB", "ADP", "VERB", "ADP"] + doc = Doc(en_vocab, words=words, tags=tags, pos=pos) matcher = Matcher(en_vocab) for i, pattern in enumerate(patterns): matcher.add(str(i), [pattern]) @@ -63,19 +59,15 @@ def test_issue3012(en_vocab): words = ["This", "is", "10", "%", "."] tags = ["DT", "VBZ", "CD", "NN", "."] pos = ["DET", "VERB", "NUM", "NOUN", "PUNCT"] - ents = [(2, 4, "PERCENT")] - doc = get_doc(en_vocab, words=words, tags=tags, pos=pos, ents=ents) - assert doc.is_tagged - + ents = ["O", "O", "B-PERCENT", "I-PERCENT", "O"] + doc = Doc(en_vocab, words=words, tags=tags, pos=pos, ents=ents) + assert doc.has_annotation("TAG") expected = ("10", "NUM", "CD", "PERCENT") assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected - header = [ENT_IOB, ENT_TYPE] ent_array = doc.to_array(header) doc.from_array(header, ent_array) - assert (doc[2].text, doc[2].pos_, doc[2].tag_, doc[2].ent_type_) == expected - # Serializing then deserializing doc_bytes = doc.to_bytes() doc2 = Doc(en_vocab).from_bytes(doc_bytes) @@ -85,10 +77,10 @@ def test_issue3012(en_vocab): def test_issue3199(): """Test that Span.noun_chunks works correctly if no noun chunks iterator is available. To make this test future-proof, we're constructing a Doc - with a new Vocab here and setting is_parsed to make sure the noun chunks run. + with a new Vocab here and a parse tree to make sure the noun chunks run. """ - doc = Doc(Vocab(), words=["This", "is", "a", "sentence"]) - doc.is_parsed = True + words = ["This", "is", "a", "sentence"] + doc = Doc(Vocab(), words=words, heads=[0] * len(words), deps=["dep"] * len(words)) assert list(doc[0:3].noun_chunks) == [] @@ -98,17 +90,17 @@ def test_issue3209(): were added using ner.add_label(). """ nlp = English() - ner = nlp.create_pipe("ner") - nlp.add_pipe(ner) - + ner = nlp.add_pipe("ner") ner.add_label("ANIMAL") - nlp.begin_training() + nlp.initialize() move_names = ["O", "B-ANIMAL", "I-ANIMAL", "L-ANIMAL", "U-ANIMAL"] assert ner.move_names == move_names nlp2 = English() - nlp2.add_pipe(nlp2.create_pipe("ner")) + ner2 = nlp2.add_pipe("ner") + model = ner2.model + model.attrs["resize_output"](model, ner.moves.n_moves) nlp2.from_bytes(nlp.to_bytes()) - assert nlp2.get_pipe("ner").move_names == move_names + assert ner2.move_names == move_names def test_issue3248_1(): @@ -121,7 +113,6 @@ def test_issue3248_1(): assert len(matcher) == 2 -@pytest.mark.skipif(is_python2, reason="Can't pickle instancemethod for is_base_form") def test_issue3248_2(): """Test that the PhraseMatcher can be pickled correctly.""" nlp = English() @@ -146,9 +137,9 @@ def test_issue3288(en_vocab): """Test that retokenization works correctly via displaCy when punctuation is merged onto the preceeding token and tensor is resized.""" words = ["Hello", "World", "!", "When", "is", "this", "breaking", "?"] - heads = [1, 0, -1, 1, 0, 1, -2, -3] + heads = [1, 1, 1, 4, 4, 6, 4, 4] deps = ["intj", "ROOT", "punct", "advmod", "ROOT", "det", "nsubj", "punct"] - doc = get_doc(en_vocab, words=words, heads=heads, deps=deps) + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) doc.tensor = numpy.zeros((len(words), 96), dtype="float32") displacy.render(doc) @@ -157,10 +148,10 @@ def test_issue3289(): """Test that Language.to_bytes handles serializing a pipeline component with an uninitialized model.""" nlp = English() - nlp.add_pipe(nlp.create_pipe("textcat")) + nlp.add_pipe("textcat") bytes_data = nlp.to_bytes() new_nlp = English() - new_nlp.add_pipe(nlp.create_pipe("textcat")) + new_nlp.add_pipe("textcat") new_nlp.from_bytes(bytes_data) @@ -198,7 +189,14 @@ def test_issue3345(): doc = Doc(nlp.vocab, words=["I", "live", "in", "New", "York"]) doc[4].is_sent_start = True ruler = EntityRuler(nlp, patterns=[{"label": "GPE", "pattern": "New York"}]) - ner = EntityRecognizer(doc.vocab) + config = { + "learn_tokens": False, + "min_action_freq": 30, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_NER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + ner = EntityRecognizer(doc.vocab, model, **config) # Add the OUT action. I wouldn't have thought this would be necessary... ner.moves.add_action(5, "") ner.add_label("GPE") @@ -212,88 +210,6 @@ def test_issue3345(): assert ner.moves.is_valid(state, "B-GPE") -if is_python2: - # If we have this test in Python 3, pytest chokes, as it can't print the - # string above in the xpass message. - prefix_search = ( - b"^\xc2\xa7|^%|^=|^\xe2\x80\x94|^\xe2\x80\x93|^\\+(?![0-9])" - b"|^\xe2\x80\xa6|^\xe2\x80\xa6\xe2\x80\xa6|^,|^:|^;|^\\!|^\\?" - b"|^\xc2\xbf|^\xd8\x9f|^\xc2\xa1|^\\(|^\\)|^\\[|^\\]|^\\{|^\\}" - b"|^<|^>|^_|^#|^\\*|^&|^\xe3\x80\x82|^\xef\xbc\x9f|^\xef\xbc\x81|" - b"^\xef\xbc\x8c|^\xe3\x80\x81|^\xef\xbc\x9b|^\xef\xbc\x9a|" - b"^\xef\xbd\x9e|^\xc2\xb7|^\xe0\xa5\xa4|^\xd8\x8c|^\xd8\x9b|" - b"^\xd9\xaa|^\\.\\.+|^\xe2\x80\xa6|^\\'|^\"|^\xe2\x80\x9d|" - b"^\xe2\x80\x9c|^`|^\xe2\x80\x98|^\xc2\xb4|^\xe2\x80\x99|" - b"^\xe2\x80\x9a|^,|^\xe2\x80\x9e|^\xc2\xbb|^\xc2\xab|^\xe3\x80\x8c|" - b"^\xe3\x80\x8d|^\xe3\x80\x8e|^\xe3\x80\x8f|^\xef\xbc\x88|" - b"^\xef\xbc\x89|^\xe3\x80\x94|^\xe3\x80\x95|^\xe3\x80\x90|" - b"^\xe3\x80\x91|^\xe3\x80\x8a|^\xe3\x80\x8b|^\xe3\x80\x88|" - b"^\xe3\x80\x89|^\\$|^\xc2\xa3|^\xe2\x82\xac|^\xc2\xa5|^\xe0\xb8\xbf|" - b"^US\\$|^C\\$|^A\\$|^\xe2\x82\xbd|^\xef\xb7\xbc|^\xe2\x82\xb4|" - b"^[\\u00A6\\u00A9\\u00AE\\u00B0\\u0482\\u058D\\u058E\\u060E\\u060F" - b"\\u06DE\\u06E9\\u06FD\\u06FE\\u07F6\\u09FA\\u0B70\\u0BF3-\\u0BF8" - b"\\u0BFA\\u0C7F\\u0D4F\\u0D79\\u0F01-\\u0F03\\u0F13\\u0F15-\\u0F17" - b"\\u0F1A-\\u0F1F\\u0F34\\u0F36\\u0F38\\u0FBE-\\u0FC5\\u0FC7-\\u0FCC" - b"\\u0FCE\\u0FCF\\u0FD5-\\u0FD8\\u109E\\u109F\\u1390-\\u1399\\u1940" - b"\\u19DE-\\u19FF\\u1B61-\\u1B6A\\u1B74-\\u1B7C\\u2100\\u2101\\u2103" - b"-\\u2106\\u2108\\u2109\\u2114\\u2116\\u2117\\u211E-\\u2123\\u2125" - b"\\u2127\\u2129\\u212E\\u213A\\u213B\\u214A\\u214C\\u214D\\u214F" - b"\\u218A\\u218B\\u2195-\\u2199\\u219C-\\u219F\\u21A1\\u21A2\\u21A4" - b"\\u21A5\\u21A7-\\u21AD\\u21AF-\\u21CD\\u21D0\\u21D1\\u21D3\\u21D5" - b"-\\u21F3\\u2300-\\u2307\\u230C-\\u231F\\u2322-\\u2328\\u232B" - b"-\\u237B\\u237D-\\u239A\\u23B4-\\u23DB\\u23E2-\\u2426\\u2440" - b"-\\u244A\\u249C-\\u24E9\\u2500-\\u25B6\\u25B8-\\u25C0\\u25C2" - b"-\\u25F7\\u2600-\\u266E\\u2670-\\u2767\\u2794-\\u27BF\\u2800" - b"-\\u28FF\\u2B00-\\u2B2F\\u2B45\\u2B46\\u2B4D-\\u2B73\\u2B76" - b"-\\u2B95\\u2B98-\\u2BC8\\u2BCA-\\u2BFE\\u2CE5-\\u2CEA\\u2E80" - b"-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u2FF0-\\u2FFB\\u3004" - b"\\u3012\\u3013\\u3020\\u3036\\u3037\\u303E\\u303F\\u3190\\u3191" - b"\\u3196-\\u319F\\u31C0-\\u31E3\\u3200-\\u321E\\u322A-\\u3247\\u3250" - b"\\u3260-\\u327F\\u328A-\\u32B0\\u32C0-\\u32FE\\u3300-\\u33FF\\u4DC0" - b"-\\u4DFF\\uA490-\\uA4C6\\uA828-\\uA82B\\uA836\\uA837\\uA839\\uAA77" - b"-\\uAA79\\uFDFD\\uFFE4\\uFFE8\\uFFED\\uFFEE\\uFFFC\\uFFFD\\U00010137" - b"-\\U0001013F\\U00010179-\\U00010189\\U0001018C-\\U0001018E" - b"\\U00010190-\\U0001019B\\U000101A0\\U000101D0-\\U000101FC\\U00010877" - b"\\U00010878\\U00010AC8\\U0001173F\\U00016B3C-\\U00016B3F\\U00016B45" - b"\\U0001BC9C\\U0001D000-\\U0001D0F5\\U0001D100-\\U0001D126\\U0001D129" - b"-\\U0001D164\\U0001D16A-\\U0001D16C\\U0001D183\\U0001D184\\U0001D18C" - b"-\\U0001D1A9\\U0001D1AE-\\U0001D1E8\\U0001D200-\\U0001D241\\U0001D245" - b"\\U0001D300-\\U0001D356\\U0001D800-\\U0001D9FF\\U0001DA37-\\U0001DA3A" - b"\\U0001DA6D-\\U0001DA74\\U0001DA76-\\U0001DA83\\U0001DA85\\U0001DA86" - b"\\U0001ECAC\\U0001F000-\\U0001F02B\\U0001F030-\\U0001F093\\U0001F0A0" - b"-\\U0001F0AE\\U0001F0B1-\\U0001F0BF\\U0001F0C1-\\U0001F0CF\\U0001F0D1" - b"-\\U0001F0F5\\U0001F110-\\U0001F16B\\U0001F170-\\U0001F1AC\\U0001F1E6" - b"-\\U0001F202\\U0001F210-\\U0001F23B\\U0001F240-\\U0001F248\\U0001F250" - b"\\U0001F251\\U0001F260-\\U0001F265\\U0001F300-\\U0001F3FA\\U0001F400" - b"-\\U0001F6D4\\U0001F6E0-\\U0001F6EC\\U0001F6F0-\\U0001F6F9\\U0001F700" - b"-\\U0001F773\\U0001F780-\\U0001F7D8\\U0001F800-\\U0001F80B\\U0001F810" - b"-\\U0001F847\\U0001F850-\\U0001F859\\U0001F860-\\U0001F887\\U0001F890" - b"-\\U0001F8AD\\U0001F900-\\U0001F90B\\U0001F910-\\U0001F93E\\U0001F940" - b"-\\U0001F970\\U0001F973-\\U0001F976\\U0001F97A\\U0001F97C-\\U0001F9A2" - b"\\U0001F9B0-\\U0001F9B9\\U0001F9C0-\\U0001F9C2\\U0001F9D0-\\U0001F9FF" - b"\\U0001FA60-\\U0001FA6D]" - ) - - def test_issue3356(): - pattern = re.compile(unescape_unicode(prefix_search.decode("utf8"))) - assert not pattern.search("hello") - - -def test_issue3410(): - texts = ["Hello world", "This is a test"] - nlp = English() - matcher = Matcher(nlp.vocab) - phrasematcher = PhraseMatcher(nlp.vocab) - with pytest.deprecated_call(): - docs = list(nlp.pipe(texts, n_threads=4)) - with pytest.deprecated_call(): - docs = list(nlp.tokenizer.pipe(texts, n_threads=4)) - with pytest.deprecated_call(): - list(matcher.pipe(docs, n_threads=4)) - with pytest.deprecated_call(): - list(phrasematcher.pipe(docs, n_threads=4)) - - def test_issue3412(): data = numpy.asarray([[0, 0, 0], [1, 2, 3], [9, 8, 7]], dtype="f") vectors = Vectors(data=data, keys=["A", "B", "C"]) @@ -303,20 +219,10 @@ def test_issue3412(): assert best_rows[0] == 2 -def test_issue3447(): - sizes = decaying(10.0, 1.0, 0.5) - size = next(sizes) - assert size == 10.0 - size = next(sizes) - assert size == 10.0 - 0.5 - size = next(sizes) - assert size == 10.0 - 0.5 - 0.5 - - -@pytest.mark.xfail(reason="default suffix rules avoid one upper-case letter before dot") +@pytest.mark.skip(reason="default suffix rules avoid one upper-case letter before dot") def test_issue3449(): nlp = English() - nlp.add_pipe(nlp.create_pipe("sentencizer")) + nlp.add_pipe("sentencizer") text1 = "He gave the ball to I. Do you want to go to the movies with I?" text2 = "He gave the ball to I. Do you want to go to the movies with I?" text3 = "He gave the ball to I.\nDo you want to go to the movies with I?" @@ -328,26 +234,26 @@ def test_issue3449(): assert t3[5].text == "I" -@pytest.mark.filterwarnings("ignore::UserWarning") def test_issue3456(): # this crashed because of a padding error in layer.ops.unflatten in thinc nlp = English() - nlp.add_pipe(nlp.create_pipe("tagger")) - nlp.begin_training() + tagger = nlp.add_pipe("tagger") + tagger.add_label("A") + nlp.initialize() list(nlp.pipe(["hi", ""])) def test_issue3468(): - """Test that sentence boundaries are set correctly so Doc.is_sentenced can + """Test that sentence boundaries are set correctly so Doc.has_annotation("SENT_START") can be restored after serialization.""" nlp = English() - nlp.add_pipe(nlp.create_pipe("sentencizer")) + nlp.add_pipe("sentencizer") doc = nlp("Hello world") assert doc[0].is_sent_start - assert doc.is_sentenced + assert doc.has_annotation("SENT_START") assert len(list(doc.sents)) == 1 doc_bytes = doc.to_bytes() new_doc = Doc(nlp.vocab).from_bytes(doc_bytes) assert new_doc[0].is_sent_start - assert new_doc.is_sentenced + assert new_doc.has_annotation("SENT_START") assert len(list(new_doc.sents)) == 1 diff --git a/spacy/tests/regression/test_issue3501-4000.py b/spacy/tests/regression/test_issue3501-4000.py new file mode 100644 index 000000000..0505571c2 --- /dev/null +++ b/spacy/tests/regression/test_issue3501-4000.py @@ -0,0 +1,476 @@ +import pytest +from spacy.language import Language +from spacy.vocab import Vocab +from spacy.pipeline import EntityRuler, DependencyParser +from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL +from spacy import displacy, load +from spacy.displacy import parse_deps +from spacy.tokens import Doc, Token +from spacy.matcher import Matcher, PhraseMatcher +from spacy.errors import MatchPatternError +from spacy.util import minibatch +from spacy.training import Example +from spacy.lang.hi import Hindi +from spacy.lang.es import Spanish +from spacy.lang.en import English +from spacy.attrs import IS_ALPHA +from spacy import registry +from thinc.api import compounding +import spacy +import srsly +import numpy + +from ..util import make_tempdir + + +@pytest.mark.parametrize("word", ["don't", "don’t", "I'd", "I’d"]) +def test_issue3521(en_tokenizer, word): + tok = en_tokenizer(word)[1] + # 'not' and 'would' should be stopwords, also in their abbreviated forms + assert tok.is_stop + + +def test_issue_3526_1(en_vocab): + patterns = [ + {"label": "HELLO", "pattern": "hello world"}, + {"label": "BYE", "pattern": [{"LOWER": "bye"}, {"LOWER": "bye"}]}, + {"label": "HELLO", "pattern": [{"ORTH": "HELLO"}]}, + {"label": "COMPLEX", "pattern": [{"ORTH": "foo", "OP": "*"}]}, + {"label": "TECH_ORG", "pattern": "Apple", "id": "a1"}, + ] + nlp = Language(vocab=en_vocab) + ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True) + ruler_bytes = ruler.to_bytes() + assert len(ruler) == len(patterns) + assert len(ruler.labels) == 4 + assert ruler.overwrite + new_ruler = EntityRuler(nlp) + new_ruler = new_ruler.from_bytes(ruler_bytes) + assert len(new_ruler) == len(ruler) + assert len(new_ruler.labels) == 4 + assert new_ruler.overwrite == ruler.overwrite + assert new_ruler.ent_id_sep == ruler.ent_id_sep + + +def test_issue_3526_2(en_vocab): + patterns = [ + {"label": "HELLO", "pattern": "hello world"}, + {"label": "BYE", "pattern": [{"LOWER": "bye"}, {"LOWER": "bye"}]}, + {"label": "HELLO", "pattern": [{"ORTH": "HELLO"}]}, + {"label": "COMPLEX", "pattern": [{"ORTH": "foo", "OP": "*"}]}, + {"label": "TECH_ORG", "pattern": "Apple", "id": "a1"}, + ] + nlp = Language(vocab=en_vocab) + ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True) + bytes_old_style = srsly.msgpack_dumps(ruler.patterns) + new_ruler = EntityRuler(nlp) + new_ruler = new_ruler.from_bytes(bytes_old_style) + assert len(new_ruler) == len(ruler) + for pattern in ruler.patterns: + assert pattern in new_ruler.patterns + assert new_ruler.overwrite is not ruler.overwrite + + +def test_issue_3526_3(en_vocab): + patterns = [ + {"label": "HELLO", "pattern": "hello world"}, + {"label": "BYE", "pattern": [{"LOWER": "bye"}, {"LOWER": "bye"}]}, + {"label": "HELLO", "pattern": [{"ORTH": "HELLO"}]}, + {"label": "COMPLEX", "pattern": [{"ORTH": "foo", "OP": "*"}]}, + {"label": "TECH_ORG", "pattern": "Apple", "id": "a1"}, + ] + nlp = Language(vocab=en_vocab) + ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True) + with make_tempdir() as tmpdir: + out_file = tmpdir / "entity_ruler" + srsly.write_jsonl(out_file.with_suffix(".jsonl"), ruler.patterns) + new_ruler = EntityRuler(nlp).from_disk(out_file) + for pattern in ruler.patterns: + assert pattern in new_ruler.patterns + assert len(new_ruler) == len(ruler) + assert new_ruler.overwrite is not ruler.overwrite + + +def test_issue_3526_4(en_vocab): + nlp = Language(vocab=en_vocab) + patterns = [{"label": "ORG", "pattern": "Apple"}] + config = {"overwrite_ents": True} + ruler = nlp.add_pipe("entity_ruler", config=config) + ruler.add_patterns(patterns) + with make_tempdir() as tmpdir: + nlp.to_disk(tmpdir) + ruler = nlp.get_pipe("entity_ruler") + assert ruler.patterns == [{"label": "ORG", "pattern": "Apple"}] + assert ruler.overwrite is True + nlp2 = load(tmpdir) + new_ruler = nlp2.get_pipe("entity_ruler") + assert new_ruler.patterns == [{"label": "ORG", "pattern": "Apple"}] + assert new_ruler.overwrite is True + + +def test_issue3531(): + """Test that displaCy renderer doesn't require "settings" key.""" + example_dep = { + "words": [ + {"text": "But", "tag": "CCONJ"}, + {"text": "Google", "tag": "PROPN"}, + {"text": "is", "tag": "VERB"}, + {"text": "starting", "tag": "VERB"}, + {"text": "from", "tag": "ADP"}, + {"text": "behind.", "tag": "ADV"}, + ], + "arcs": [ + {"start": 0, "end": 3, "label": "cc", "dir": "left"}, + {"start": 1, "end": 3, "label": "nsubj", "dir": "left"}, + {"start": 2, "end": 3, "label": "aux", "dir": "left"}, + {"start": 3, "end": 4, "label": "prep", "dir": "right"}, + {"start": 4, "end": 5, "label": "pcomp", "dir": "right"}, + ], + } + example_ent = { + "text": "But Google is starting from behind.", + "ents": [{"start": 4, "end": 10, "label": "ORG"}], + } + dep_html = displacy.render(example_dep, style="dep", manual=True) + assert dep_html + ent_html = displacy.render(example_ent, style="ent", manual=True) + assert ent_html + + +def test_issue3540(en_vocab): + words = ["I", "live", "in", "NewYork", "right", "now"] + tensor = numpy.asarray( + [[1.0, 1.1], [2.0, 2.1], [3.0, 3.1], [4.0, 4.1], [5.0, 5.1], [6.0, 6.1]], + dtype="f", + ) + doc = Doc(en_vocab, words=words) + doc.tensor = tensor + gold_text = ["I", "live", "in", "NewYork", "right", "now"] + assert [token.text for token in doc] == gold_text + gold_lemma = ["I", "live", "in", "NewYork", "right", "now"] + for i, lemma in enumerate(gold_lemma): + doc[i].lemma_ = lemma + assert [token.lemma_ for token in doc] == gold_lemma + vectors_1 = [token.vector for token in doc] + assert len(vectors_1) == len(doc) + + with doc.retokenize() as retokenizer: + heads = [(doc[3], 1), doc[2]] + attrs = { + "POS": ["PROPN", "PROPN"], + "LEMMA": ["New", "York"], + "DEP": ["pobj", "compound"], + } + retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs) + + gold_text = ["I", "live", "in", "New", "York", "right", "now"] + assert [token.text for token in doc] == gold_text + gold_lemma = ["I", "live", "in", "New", "York", "right", "now"] + assert [token.lemma_ for token in doc] == gold_lemma + vectors_2 = [token.vector for token in doc] + assert len(vectors_2) == len(doc) + assert vectors_1[0].tolist() == vectors_2[0].tolist() + assert vectors_1[1].tolist() == vectors_2[1].tolist() + assert vectors_1[2].tolist() == vectors_2[2].tolist() + assert vectors_1[4].tolist() == vectors_2[5].tolist() + assert vectors_1[5].tolist() == vectors_2[6].tolist() + + +def test_issue3549(en_vocab): + """Test that match pattern validation doesn't raise on empty errors.""" + matcher = Matcher(en_vocab, validate=True) + pattern = [{"LOWER": "hello"}, {"LOWER": "world"}] + matcher.add("GOOD", [pattern]) + with pytest.raises(MatchPatternError): + matcher.add("BAD", [[{"X": "Y"}]]) + + +@pytest.mark.skip("Matching currently only works on strings and integers") +def test_issue3555(en_vocab): + """Test that custom extensions with default None don't break matcher.""" + Token.set_extension("issue3555", default=None) + matcher = Matcher(en_vocab) + pattern = [{"ORTH": "have"}, {"_": {"issue3555": True}}] + matcher.add("TEST", [pattern]) + doc = Doc(en_vocab, words=["have", "apple"]) + matcher(doc) + + +def test_issue3611(): + """ Test whether adding n-grams in the textcat works even when n > token length of some docs """ + unique_classes = ["offensive", "inoffensive"] + x_train = [ + "This is an offensive text", + "This is the second offensive text", + "inoff", + ] + y_train = ["offensive", "offensive", "inoffensive"] + nlp = spacy.blank("en") + # preparing the data + train_data = [] + for text, train_instance in zip(x_train, y_train): + cat_dict = {label: label == train_instance for label in unique_classes} + train_data.append(Example.from_dict(nlp.make_doc(text), {"cats": cat_dict})) + # add a text categorizer component + model = { + "@architectures": "spacy.TextCatBOW.v1", + "exclusive_classes": True, + "ngram_size": 2, + "no_output_layer": False, + } + textcat = nlp.add_pipe("textcat", config={"model": model}, last=True) + for label in unique_classes: + textcat.add_label(label) + # training the network + with nlp.select_pipes(enable="textcat"): + optimizer = nlp.initialize() + for i in range(3): + losses = {} + batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001)) + + for batch in batches: + nlp.update(examples=batch, sgd=optimizer, drop=0.1, losses=losses) + + +def test_issue3625(): + """Test that default punctuation rules applies to hindi unicode characters""" + nlp = Hindi() + doc = nlp("hi. how हुए. होटल, होटल") + expected = ["hi", ".", "how", "हुए", ".", "होटल", ",", "होटल"] + assert [token.text for token in doc] == expected + + +def test_issue3803(): + """Test that spanish num-like tokens have True for like_num attribute.""" + nlp = Spanish() + text = "2 dos 1000 mil 12 doce" + doc = nlp(text) + + assert [t.like_num for t in doc] == [True, True, True, True, True, True] + + +def _parser_example(parser): + doc = Doc(parser.vocab, words=["a", "b", "c", "d"]) + gold = {"heads": [1, 1, 3, 3], "deps": ["right", "ROOT", "left", "ROOT"]} + return Example.from_dict(doc, gold) + + +def test_issue3830_no_subtok(): + """Test that the parser doesn't have subtok label if not learn_tokens""" + config = { + "learn_tokens": False, + "min_action_freq": 30, + "update_with_oracle_cut_size": 100, + } + model = registry.resolve({"model": DEFAULT_PARSER_MODEL}, validate=True)["model"] + parser = DependencyParser(Vocab(), model, **config) + parser.add_label("nsubj") + assert "subtok" not in parser.labels + parser.initialize(lambda: [_parser_example(parser)]) + assert "subtok" not in parser.labels + + +def test_issue3830_with_subtok(): + """Test that the parser does have subtok label if learn_tokens=True.""" + config = { + "learn_tokens": True, + "min_action_freq": 30, + "update_with_oracle_cut_size": 100, + } + model = registry.resolve({"model": DEFAULT_PARSER_MODEL}, validate=True)["model"] + parser = DependencyParser(Vocab(), model, **config) + parser.add_label("nsubj") + assert "subtok" not in parser.labels + parser.initialize(lambda: [_parser_example(parser)]) + assert "subtok" in parser.labels + + +def test_issue3839(en_vocab): + """Test that match IDs returned by the matcher are correct, are in the string """ + doc = Doc(en_vocab, words=["terrific", "group", "of", "people"]) + matcher = Matcher(en_vocab) + match_id = "PATTERN" + pattern1 = [{"LOWER": "terrific"}, {"OP": "?"}, {"LOWER": "group"}] + pattern2 = [{"LOWER": "terrific"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "group"}] + matcher.add(match_id, [pattern1]) + matches = matcher(doc) + assert matches[0][0] == en_vocab.strings[match_id] + matcher = Matcher(en_vocab) + matcher.add(match_id, [pattern2]) + matches = matcher(doc) + assert matches[0][0] == en_vocab.strings[match_id] + + +@pytest.mark.parametrize( + "sentence", + [ + "The story was to the effect that a young American student recently called on Professor Christlieb with a letter of introduction.", + "The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale's #1.", + "The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale's number one", + "Indeed, making the one who remains do all the work has installed him into a position of such insolent tyranny, it will take a month at least to reduce him to his proper proportions.", + "It was a missed assignment, but it shouldn't have resulted in a turnover ...", + ], +) +def test_issue3869(sentence): + """Test that the Doc's count_by function works consistently""" + nlp = English() + doc = nlp(sentence) + count = 0 + for token in doc: + count += token.is_alpha + assert count == doc.count_by(IS_ALPHA).get(1, 0) + + +def test_issue3879(en_vocab): + doc = Doc(en_vocab, words=["This", "is", "a", "test", "."]) + assert len(doc) == 5 + pattern = [{"ORTH": "This", "OP": "?"}, {"OP": "?"}, {"ORTH": "test"}] + matcher = Matcher(en_vocab) + matcher.add("TEST", [pattern]) + assert len(matcher(doc)) == 2 # fails because of a FP match 'is a test' + + +def test_issue3880(): + """Test that `nlp.pipe()` works when an empty string ends the batch. + + Fixed in v7.0.5 of Thinc. + """ + texts = ["hello", "world", "", ""] + nlp = English() + nlp.add_pipe("parser").add_label("dep") + nlp.add_pipe("ner").add_label("PERSON") + nlp.add_pipe("tagger").add_label("NN") + nlp.initialize() + for doc in nlp.pipe(texts): + pass + + +def test_issue3882(en_vocab): + """Test that displaCy doesn't serialize the doc.user_data when making a + copy of the Doc. + """ + doc = Doc(en_vocab, words=["Hello", "world"], deps=["dep", "dep"]) + doc.user_data["test"] = set() + parse_deps(doc) + + +def test_issue3951(en_vocab): + """Test that combinations of optional rules are matched correctly.""" + matcher = Matcher(en_vocab) + pattern = [ + {"LOWER": "hello"}, + {"LOWER": "this", "OP": "?"}, + {"OP": "?"}, + {"LOWER": "world"}, + ] + matcher.add("TEST", [pattern]) + doc = Doc(en_vocab, words=["Hello", "my", "new", "world"]) + matches = matcher(doc) + assert len(matches) == 0 + + +def test_issue3959(): + """ Ensure that a modified pos attribute is serialized correctly.""" + nlp = English() + doc = nlp( + "displaCy uses JavaScript, SVG and CSS to show you how computers understand language" + ) + assert doc[0].pos_ == "" + doc[0].pos_ = "NOUN" + assert doc[0].pos_ == "NOUN" + # usually this is already True when starting from proper models instead of blank English + with make_tempdir() as tmp_dir: + file_path = tmp_dir / "my_doc" + doc.to_disk(file_path) + doc2 = nlp("") + doc2.from_disk(file_path) + assert doc2[0].pos_ == "NOUN" + + +def test_issue3962(en_vocab): + """Ensure that as_doc does not result in out-of-bound access of tokens. + This is achieved by setting the head to itself if it would lie out of the span otherwise.""" + # fmt: off + words = ["He", "jests", "at", "scars", ",", "that", "never", "felt", "a", "wound", "."] + heads = [1, 7, 1, 2, 7, 7, 7, 7, 9, 7, 7] + deps = ["nsubj", "ccomp", "prep", "pobj", "punct", "nsubj", "neg", "ROOT", "det", "dobj", "punct"] + # fmt: on + doc = Doc(en_vocab, words=words, heads=heads, deps=deps) + span2 = doc[1:5] # "jests at scars ," + doc2 = span2.as_doc() + doc2_json = doc2.to_json() + assert doc2_json + # head set to itself, being the new artificial root + assert doc2[0].head.text == "jests" + assert doc2[0].dep_ == "dep" + assert doc2[1].head.text == "jests" + assert doc2[1].dep_ == "prep" + assert doc2[2].head.text == "at" + assert doc2[2].dep_ == "pobj" + assert doc2[3].head.text == "jests" # head set to the new artificial root + assert doc2[3].dep_ == "dep" + # We should still have 1 sentence + assert len(list(doc2.sents)) == 1 + span3 = doc[6:9] # "never felt a" + doc3 = span3.as_doc() + doc3_json = doc3.to_json() + assert doc3_json + assert doc3[0].head.text == "felt" + assert doc3[0].dep_ == "neg" + assert doc3[1].head.text == "felt" + assert doc3[1].dep_ == "ROOT" + assert doc3[2].head.text == "felt" # head set to ancestor + assert doc3[2].dep_ == "dep" + # We should still have 1 sentence as "a" can be attached to "felt" instead of "wound" + assert len(list(doc3.sents)) == 1 + + +def test_issue3962_long(en_vocab): + """Ensure that as_doc does not result in out-of-bound access of tokens. + This is achieved by setting the head to itself if it would lie out of the span otherwise.""" + # fmt: off + words = ["He", "jests", "at", "scars", ".", "They", "never", "felt", "a", "wound", "."] + heads = [1, 1, 1, 2, 1, 7, 7, 7, 9, 7, 7] + deps = ["nsubj", "ROOT", "prep", "pobj", "punct", "nsubj", "neg", "ROOT", "det", "dobj", "punct"] + # fmt: on + two_sent_doc = Doc(en_vocab, words=words, heads=heads, deps=deps) + span2 = two_sent_doc[1:7] # "jests at scars. They never" + doc2 = span2.as_doc() + doc2_json = doc2.to_json() + assert doc2_json + # head set to itself, being the new artificial root (in sentence 1) + assert doc2[0].head.text == "jests" + assert doc2[0].dep_ == "ROOT" + assert doc2[1].head.text == "jests" + assert doc2[1].dep_ == "prep" + assert doc2[2].head.text == "at" + assert doc2[2].dep_ == "pobj" + assert doc2[3].head.text == "jests" + assert doc2[3].dep_ == "punct" + # head set to itself, being the new artificial root (in sentence 2) + assert doc2[4].head.text == "They" + assert doc2[4].dep_ == "dep" + # head set to the new artificial head (in sentence 2) + assert doc2[4].head.text == "They" + assert doc2[4].dep_ == "dep" + # We should still have 2 sentences + sents = list(doc2.sents) + assert len(sents) == 2 + assert sents[0].text == "jests at scars ." + assert sents[1].text == "They never" + + +def test_issue3972(en_vocab): + """Test that the PhraseMatcher returns duplicates for duplicate match IDs.""" + matcher = PhraseMatcher(en_vocab) + matcher.add("A", [Doc(en_vocab, words=["New", "York"])]) + matcher.add("B", [Doc(en_vocab, words=["New", "York"])]) + doc = Doc(en_vocab, words=["I", "live", "in", "New", "York"]) + matches = matcher(doc) + + assert len(matches) == 2 + + # We should have a match for each of the two rules + found_ids = [en_vocab.strings[ent_id] for (ent_id, _, _) in matches] + assert "A" in found_ids + assert "B" in found_ids diff --git a/spacy/tests/regression/test_issue3521.py b/spacy/tests/regression/test_issue3521.py deleted file mode 100644 index 35731ac12..000000000 --- a/spacy/tests/regression/test_issue3521.py +++ /dev/null @@ -1,11 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize("word", ["don't", "don’t", "I'd", "I’d"]) -def test_issue3521(en_tokenizer, word): - tok = en_tokenizer(word)[1] - # 'not' and 'would' should be stopwords, also in their abbreviated forms - assert tok.is_stop diff --git a/spacy/tests/regression/test_issue3526.py b/spacy/tests/regression/test_issue3526.py deleted file mode 100644 index c6f513730..000000000 --- a/spacy/tests/regression/test_issue3526.py +++ /dev/null @@ -1,88 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from spacy.tokens import Span -from spacy.language import Language -from spacy.pipeline import EntityRuler -from spacy import load -import srsly - -from ..util import make_tempdir - - -@pytest.fixture -def patterns(): - return [ - {"label": "HELLO", "pattern": "hello world"}, - {"label": "BYE", "pattern": [{"LOWER": "bye"}, {"LOWER": "bye"}]}, - {"label": "HELLO", "pattern": [{"ORTH": "HELLO"}]}, - {"label": "COMPLEX", "pattern": [{"ORTH": "foo", "OP": "*"}]}, - {"label": "TECH_ORG", "pattern": "Apple", "id": "a1"}, - ] - - -@pytest.fixture -def add_ent(): - def add_ent_component(doc): - doc.ents = [Span(doc, 0, 3, label=doc.vocab.strings["ORG"])] - return doc - - return add_ent_component - - -def test_entity_ruler_existing_overwrite_serialize_bytes(patterns, en_vocab): - nlp = Language(vocab=en_vocab) - ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True) - ruler_bytes = ruler.to_bytes() - assert len(ruler) == len(patterns) - assert len(ruler.labels) == 4 - assert ruler.overwrite - new_ruler = EntityRuler(nlp) - new_ruler = new_ruler.from_bytes(ruler_bytes) - assert len(new_ruler) == len(ruler) - assert len(new_ruler.labels) == 4 - assert new_ruler.overwrite == ruler.overwrite - assert new_ruler.ent_id_sep == ruler.ent_id_sep - - -def test_entity_ruler_existing_bytes_old_format_safe(patterns, en_vocab): - nlp = Language(vocab=en_vocab) - ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True) - bytes_old_style = srsly.msgpack_dumps(ruler.patterns) - new_ruler = EntityRuler(nlp) - new_ruler = new_ruler.from_bytes(bytes_old_style) - assert len(new_ruler) == len(ruler) - for pattern in ruler.patterns: - assert pattern in new_ruler.patterns - assert new_ruler.overwrite is not ruler.overwrite - - -def test_entity_ruler_from_disk_old_format_safe(patterns, en_vocab): - nlp = Language(vocab=en_vocab) - ruler = EntityRuler(nlp, patterns=patterns, overwrite_ents=True) - with make_tempdir() as tmpdir: - out_file = tmpdir / "entity_ruler" - srsly.write_jsonl(out_file.with_suffix(".jsonl"), ruler.patterns) - new_ruler = EntityRuler(nlp).from_disk(out_file) - for pattern in ruler.patterns: - assert pattern in new_ruler.patterns - assert len(new_ruler) == len(ruler) - assert new_ruler.overwrite is not ruler.overwrite - - -def test_entity_ruler_in_pipeline_from_issue(patterns, en_vocab): - nlp = Language(vocab=en_vocab) - ruler = EntityRuler(nlp, overwrite_ents=True) - - ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}]) - nlp.add_pipe(ruler) - with make_tempdir() as tmpdir: - nlp.to_disk(tmpdir) - ruler = nlp.get_pipe("entity_ruler") - assert ruler.patterns == [{"label": "ORG", "pattern": "Apple"}] - assert ruler.overwrite is True - nlp2 = load(tmpdir) - new_ruler = nlp2.get_pipe("entity_ruler") - assert new_ruler.patterns == [{"label": "ORG", "pattern": "Apple"}] - assert new_ruler.overwrite is True diff --git a/spacy/tests/regression/test_issue3531.py b/spacy/tests/regression/test_issue3531.py deleted file mode 100644 index 7b9d0bd2a..000000000 --- a/spacy/tests/regression/test_issue3531.py +++ /dev/null @@ -1,33 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy import displacy - - -def test_issue3531(): - """Test that displaCy renderer doesn't require "settings" key.""" - example_dep = { - "words": [ - {"text": "But", "tag": "CCONJ"}, - {"text": "Google", "tag": "PROPN"}, - {"text": "is", "tag": "VERB"}, - {"text": "starting", "tag": "VERB"}, - {"text": "from", "tag": "ADP"}, - {"text": "behind.", "tag": "ADV"}, - ], - "arcs": [ - {"start": 0, "end": 3, "label": "cc", "dir": "left"}, - {"start": 1, "end": 3, "label": "nsubj", "dir": "left"}, - {"start": 2, "end": 3, "label": "aux", "dir": "left"}, - {"start": 3, "end": 4, "label": "prep", "dir": "right"}, - {"start": 4, "end": 5, "label": "pcomp", "dir": "right"}, - ], - } - example_ent = { - "text": "But Google is starting from behind.", - "ents": [{"start": 4, "end": 10, "label": "ORG"}], - } - dep_html = displacy.render(example_dep, style="dep", manual=True) - assert dep_html - ent_html = displacy.render(example_ent, style="ent", manual=True) - assert ent_html diff --git a/spacy/tests/regression/test_issue3540.py b/spacy/tests/regression/test_issue3540.py deleted file mode 100644 index 19d89c797..000000000 --- a/spacy/tests/regression/test_issue3540.py +++ /dev/null @@ -1,47 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.tokens import Doc - -import numpy as np - - -def test_issue3540(en_vocab): - - words = ["I", "live", "in", "NewYork", "right", "now"] - tensor = np.asarray( - [[1.0, 1.1], [2.0, 2.1], [3.0, 3.1], [4.0, 4.1], [5.0, 5.1], [6.0, 6.1]], - dtype="f", - ) - doc = Doc(en_vocab, words=words) - doc.tensor = tensor - - gold_text = ["I", "live", "in", "NewYork", "right", "now"] - assert [token.text for token in doc] == gold_text - - gold_lemma = ["I", "live", "in", "NewYork", "right", "now"] - assert [token.lemma_ for token in doc] == gold_lemma - - vectors_1 = [token.vector for token in doc] - assert len(vectors_1) == len(doc) - - with doc.retokenize() as retokenizer: - heads = [(doc[3], 1), doc[2]] - attrs = {"POS": ["PROPN", "PROPN"], "DEP": ["pobj", "compound"]} - retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs) - - gold_text = ["I", "live", "in", "New", "York", "right", "now"] - assert [token.text for token in doc] == gold_text - - gold_lemma = ["I", "live", "in", "New", "York", "right", "now"] - assert [token.lemma_ for token in doc] == gold_lemma - - vectors_2 = [token.vector for token in doc] - assert len(vectors_2) == len(doc) - - assert vectors_1[0].tolist() == vectors_2[0].tolist() - assert vectors_1[1].tolist() == vectors_2[1].tolist() - assert vectors_1[2].tolist() == vectors_2[2].tolist() - - assert vectors_1[4].tolist() == vectors_2[5].tolist() - assert vectors_1[5].tolist() == vectors_2[6].tolist() diff --git a/spacy/tests/regression/test_issue3549.py b/spacy/tests/regression/test_issue3549.py deleted file mode 100644 index 587b3a857..000000000 --- a/spacy/tests/regression/test_issue3549.py +++ /dev/null @@ -1,15 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from spacy.matcher import Matcher -from spacy.errors import MatchPatternError - - -def test_issue3549(en_vocab): - """Test that match pattern validation doesn't raise on empty errors.""" - matcher = Matcher(en_vocab, validate=True) - pattern = [{"LOWER": "hello"}, {"LOWER": "world"}] - matcher.add("GOOD", [pattern]) - with pytest.raises(MatchPatternError): - matcher.add("BAD", [[{"X": "Y"}]]) diff --git a/spacy/tests/regression/test_issue3555.py b/spacy/tests/regression/test_issue3555.py deleted file mode 100644 index 8444f11f2..000000000 --- a/spacy/tests/regression/test_issue3555.py +++ /dev/null @@ -1,17 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from spacy.tokens import Doc, Token -from spacy.matcher import Matcher - - -@pytest.mark.xfail -def test_issue3555(en_vocab): - """Test that custom extensions with default None don't break matcher.""" - Token.set_extension("issue3555", default=None) - matcher = Matcher(en_vocab) - pattern = [{"LEMMA": "have"}, {"_": {"issue3555": True}}] - matcher.add("TEST", [pattern]) - doc = Doc(en_vocab, words=["have", "apple"]) - matcher(doc) diff --git a/spacy/tests/regression/test_issue3611.py b/spacy/tests/regression/test_issue3611.py deleted file mode 100644 index 3c4836264..000000000 --- a/spacy/tests/regression/test_issue3611.py +++ /dev/null @@ -1,51 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import spacy -from spacy.util import minibatch, compounding - - -def test_issue3611(): - """ Test whether adding n-grams in the textcat works even when n > token length of some docs """ - unique_classes = ["offensive", "inoffensive"] - x_train = [ - "This is an offensive text", - "This is the second offensive text", - "inoff", - ] - y_train = ["offensive", "offensive", "inoffensive"] - - # preparing the data - pos_cats = list() - for train_instance in y_train: - pos_cats.append({label: label == train_instance for label in unique_classes}) - train_data = list(zip(x_train, [{"cats": cats} for cats in pos_cats])) - - # set up the spacy model with a text categorizer component - nlp = spacy.blank("en") - - textcat = nlp.create_pipe( - "textcat", - config={"exclusive_classes": True, "architecture": "bow", "ngram_size": 2}, - ) - - for label in unique_classes: - textcat.add_label(label) - nlp.add_pipe(textcat, last=True) - - # training the network - with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]): - optimizer = nlp.begin_training() - for i in range(3): - losses = {} - batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001)) - - for batch in batches: - texts, annotations = zip(*batch) - nlp.update( - docs=texts, - golds=annotations, - sgd=optimizer, - drop=0.1, - losses=losses, - ) diff --git a/spacy/tests/regression/test_issue3625.py b/spacy/tests/regression/test_issue3625.py deleted file mode 100644 index d935db17f..000000000 --- a/spacy/tests/regression/test_issue3625.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.hi import Hindi - - -def test_issue3625(): - """Test that default punctuation rules applies to hindi unicode characters""" - nlp = Hindi() - doc = nlp("hi. how हुए. होटल, होटल") - expected = ["hi", ".", "how", "हुए", ".", "होटल", ",", "होटल"] - assert [token.text for token in doc] == expected diff --git a/spacy/tests/regression/test_issue3803.py b/spacy/tests/regression/test_issue3803.py deleted file mode 100644 index 37d15a5cf..000000000 --- a/spacy/tests/regression/test_issue3803.py +++ /dev/null @@ -1,13 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.es import Spanish - - -def test_issue3803(): - """Test that spanish num-like tokens have True for like_num attribute.""" - nlp = Spanish() - text = "2 dos 1000 mil 12 doce" - doc = nlp(text) - - assert [t.like_num for t in doc] == [True, True, True, True, True, True] diff --git a/spacy/tests/regression/test_issue3830.py b/spacy/tests/regression/test_issue3830.py deleted file mode 100644 index 54ce10924..000000000 --- a/spacy/tests/regression/test_issue3830.py +++ /dev/null @@ -1,20 +0,0 @@ -from spacy.pipeline.pipes import DependencyParser -from spacy.vocab import Vocab - - -def test_issue3830_no_subtok(): - """Test that the parser doesn't have subtok label if not learn_tokens""" - parser = DependencyParser(Vocab()) - parser.add_label("nsubj") - assert "subtok" not in parser.labels - parser.begin_training(lambda: []) - assert "subtok" not in parser.labels - - -def test_issue3830_with_subtok(): - """Test that the parser does have subtok label if learn_tokens=True.""" - parser = DependencyParser(Vocab(), learn_tokens=True) - parser.add_label("nsubj") - assert "subtok" not in parser.labels - parser.begin_training(lambda: []) - assert "subtok" in parser.labels diff --git a/spacy/tests/regression/test_issue3839.py b/spacy/tests/regression/test_issue3839.py deleted file mode 100644 index fe722a681..000000000 --- a/spacy/tests/regression/test_issue3839.py +++ /dev/null @@ -1,21 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.matcher import Matcher -from spacy.tokens import Doc - - -def test_issue3839(en_vocab): - """Test that match IDs returned by the matcher are correct, are in the string """ - doc = Doc(en_vocab, words=["terrific", "group", "of", "people"]) - matcher = Matcher(en_vocab) - match_id = "PATTERN" - pattern1 = [{"LOWER": "terrific"}, {"OP": "?"}, {"LOWER": "group"}] - pattern2 = [{"LOWER": "terrific"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "group"}] - matcher.add(match_id, [pattern1]) - matches = matcher(doc) - assert matches[0][0] == en_vocab.strings[match_id] - matcher = Matcher(en_vocab) - matcher.add(match_id, [pattern2]) - matches = matcher(doc) - assert matches[0][0] == en_vocab.strings[match_id] diff --git a/spacy/tests/regression/test_issue3869.py b/spacy/tests/regression/test_issue3869.py deleted file mode 100644 index 62e8eabd6..000000000 --- a/spacy/tests/regression/test_issue3869.py +++ /dev/null @@ -1,28 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from spacy.attrs import IS_ALPHA -from spacy.lang.en import English - - -@pytest.mark.parametrize( - "sentence", - [ - "The story was to the effect that a young American student recently called on Professor Christlieb with a letter of introduction.", - "The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale's #1.", - "The next month Barry Siddall joined Stoke City on a free transfer, after Chris Pearce had established himself as the Vale's number one", - "Indeed, making the one who remains do all the work has installed him into a position of such insolent tyranny, it will take a month at least to reduce him to his proper proportions.", - "It was a missed assignment, but it shouldn't have resulted in a turnover ...", - ], -) -def test_issue3869(sentence): - """Test that the Doc's count_by function works consistently""" - nlp = English() - doc = nlp(sentence) - - count = 0 - for token in doc: - count += token.is_alpha - - assert count == doc.count_by(IS_ALPHA).get(1, 0) diff --git a/spacy/tests/regression/test_issue3879.py b/spacy/tests/regression/test_issue3879.py deleted file mode 100644 index 5cd245231..000000000 --- a/spacy/tests/regression/test_issue3879.py +++ /dev/null @@ -1,14 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.matcher import Matcher -from spacy.tokens import Doc - - -def test_issue3879(en_vocab): - doc = Doc(en_vocab, words=["This", "is", "a", "test", "."]) - assert len(doc) == 5 - pattern = [{"ORTH": "This", "OP": "?"}, {"OP": "?"}, {"ORTH": "test"}] - matcher = Matcher(en_vocab) - matcher.add("TEST", [pattern]) - assert len(matcher(doc)) == 2 # fails because of a FP match 'is a test' diff --git a/spacy/tests/regression/test_issue3880.py b/spacy/tests/regression/test_issue3880.py deleted file mode 100644 index c060473f5..000000000 --- a/spacy/tests/regression/test_issue3880.py +++ /dev/null @@ -1,24 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.en import English -import pytest - - -@pytest.mark.filterwarnings("ignore::UserWarning") -def test_issue3880(): - """Test that `nlp.pipe()` works when an empty string ends the batch. - - Fixed in v7.0.5 of Thinc. - """ - texts = ["hello", "world", "", ""] - nlp = English() - nlp.add_pipe(nlp.create_pipe("parser")) - nlp.add_pipe(nlp.create_pipe("ner")) - nlp.add_pipe(nlp.create_pipe("tagger")) - nlp.get_pipe("parser").add_label("dep") - nlp.get_pipe("ner").add_label("PERSON") - nlp.get_pipe("tagger").add_label("NN") - nlp.begin_training() - for doc in nlp.pipe(texts): - pass diff --git a/spacy/tests/regression/test_issue3882.py b/spacy/tests/regression/test_issue3882.py deleted file mode 100644 index 1b2dcea25..000000000 --- a/spacy/tests/regression/test_issue3882.py +++ /dev/null @@ -1,15 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.displacy import parse_deps -from spacy.tokens import Doc - - -def test_issue3882(en_vocab): - """Test that displaCy doesn't serialize the doc.user_data when making a - copy of the Doc. - """ - doc = Doc(en_vocab, words=["Hello", "world"]) - doc.is_parsed = True - doc.user_data["test"] = set() - parse_deps(doc) diff --git a/spacy/tests/regression/test_issue3951.py b/spacy/tests/regression/test_issue3951.py deleted file mode 100644 index 33230112f..000000000 --- a/spacy/tests/regression/test_issue3951.py +++ /dev/null @@ -1,20 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.matcher import Matcher -from spacy.tokens import Doc - - -def test_issue3951(en_vocab): - """Test that combinations of optional rules are matched correctly.""" - matcher = Matcher(en_vocab) - pattern = [ - {"LOWER": "hello"}, - {"LOWER": "this", "OP": "?"}, - {"OP": "?"}, - {"LOWER": "world"}, - ] - matcher.add("TEST", [pattern]) - doc = Doc(en_vocab, words=["Hello", "my", "new", "world"]) - matches = matcher(doc) - assert len(matches) == 0 diff --git a/spacy/tests/regression/test_issue3959.py b/spacy/tests/regression/test_issue3959.py deleted file mode 100644 index c1f7fe100..000000000 --- a/spacy/tests/regression/test_issue3959.py +++ /dev/null @@ -1,29 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.en import English -from ..util import make_tempdir - - -def test_issue3959(): - """ Ensure that a modified pos attribute is serialized correctly.""" - nlp = English() - doc = nlp( - "displaCy uses JavaScript, SVG and CSS to show you how computers understand language" - ) - assert doc[0].pos_ == "" - - doc[0].pos_ = "NOUN" - assert doc[0].pos_ == "NOUN" - - # usually this is already True when starting from proper models instead of blank English - doc.is_tagged = True - - with make_tempdir() as tmp_dir: - file_path = tmp_dir / "my_doc" - doc.to_disk(file_path) - - doc2 = nlp("") - doc2.from_disk(file_path) - - assert doc2[0].pos_ == "NOUN" diff --git a/spacy/tests/regression/test_issue3962.py b/spacy/tests/regression/test_issue3962.py deleted file mode 100644 index ae60fa0fa..000000000 --- a/spacy/tests/regression/test_issue3962.py +++ /dev/null @@ -1,120 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - -from ..util import get_doc - - -@pytest.fixture -def doc(en_tokenizer): - text = "He jests at scars, that never felt a wound." - heads = [1, 6, -1, -1, 3, 2, 1, 0, 1, -2, -3] - deps = [ - "nsubj", - "ccomp", - "prep", - "pobj", - "punct", - "nsubj", - "neg", - "ROOT", - "det", - "dobj", - "punct", - ] - tokens = en_tokenizer(text) - return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) - - -def test_issue3962(doc): - """ Ensure that as_doc does not result in out-of-bound access of tokens. - This is achieved by setting the head to itself if it would lie out of the span otherwise.""" - span2 = doc[1:5] # "jests at scars ," - doc2 = span2.as_doc() - doc2_json = doc2.to_json() - assert doc2_json - - assert ( - doc2[0].head.text == "jests" - ) # head set to itself, being the new artificial root - assert doc2[0].dep_ == "dep" - assert doc2[1].head.text == "jests" - assert doc2[1].dep_ == "prep" - assert doc2[2].head.text == "at" - assert doc2[2].dep_ == "pobj" - assert doc2[3].head.text == "jests" # head set to the new artificial root - assert doc2[3].dep_ == "dep" - - # We should still have 1 sentence - assert len(list(doc2.sents)) == 1 - - span3 = doc[6:9] # "never felt a" - doc3 = span3.as_doc() - doc3_json = doc3.to_json() - assert doc3_json - - assert doc3[0].head.text == "felt" - assert doc3[0].dep_ == "neg" - assert doc3[1].head.text == "felt" - assert doc3[1].dep_ == "ROOT" - assert doc3[2].head.text == "felt" # head set to ancestor - assert doc3[2].dep_ == "dep" - - # We should still have 1 sentence as "a" can be attached to "felt" instead of "wound" - assert len(list(doc3.sents)) == 1 - - -@pytest.fixture -def two_sent_doc(en_tokenizer): - text = "He jests at scars. They never felt a wound." - heads = [1, 0, -1, -1, -3, 2, 1, 0, 1, -2, -3] - deps = [ - "nsubj", - "ROOT", - "prep", - "pobj", - "punct", - "nsubj", - "neg", - "ROOT", - "det", - "dobj", - "punct", - ] - tokens = en_tokenizer(text) - return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) - - -def test_issue3962_long(two_sent_doc): - """ Ensure that as_doc does not result in out-of-bound access of tokens. - This is achieved by setting the head to itself if it would lie out of the span otherwise.""" - span2 = two_sent_doc[1:7] # "jests at scars. They never" - doc2 = span2.as_doc() - doc2_json = doc2.to_json() - assert doc2_json - - assert ( - doc2[0].head.text == "jests" - ) # head set to itself, being the new artificial root (in sentence 1) - assert doc2[0].dep_ == "ROOT" - assert doc2[1].head.text == "jests" - assert doc2[1].dep_ == "prep" - assert doc2[2].head.text == "at" - assert doc2[2].dep_ == "pobj" - assert doc2[3].head.text == "jests" - assert doc2[3].dep_ == "punct" - assert ( - doc2[4].head.text == "They" - ) # head set to itself, being the new artificial root (in sentence 2) - assert doc2[4].dep_ == "dep" - assert ( - doc2[4].head.text == "They" - ) # head set to the new artificial head (in sentence 2) - assert doc2[4].dep_ == "dep" - - # We should still have 2 sentences - sents = list(doc2.sents) - assert len(sents) == 2 - assert sents[0].text == "jests at scars ." - assert sents[1].text == "They never" diff --git a/spacy/tests/regression/test_issue3972.py b/spacy/tests/regression/test_issue3972.py deleted file mode 100644 index 22b8d486e..000000000 --- a/spacy/tests/regression/test_issue3972.py +++ /dev/null @@ -1,22 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.matcher import PhraseMatcher -from spacy.tokens import Doc - - -def test_issue3972(en_vocab): - """Test that the PhraseMatcher returns duplicates for duplicate match IDs. - """ - matcher = PhraseMatcher(en_vocab) - matcher.add("A", [Doc(en_vocab, words=["New", "York"])]) - matcher.add("B", [Doc(en_vocab, words=["New", "York"])]) - doc = Doc(en_vocab, words=["I", "live", "in", "New", "York"]) - matches = matcher(doc) - - assert len(matches) == 2 - - # We should have a match for each of the two rules - found_ids = [en_vocab.strings[ent_id] for (ent_id, _, _) in matches] - assert "A" in found_ids - assert "B" in found_ids diff --git a/spacy/tests/regression/test_issue4001-4500.py b/spacy/tests/regression/test_issue4001-4500.py new file mode 100644 index 000000000..73aea5b4b --- /dev/null +++ b/spacy/tests/regression/test_issue4001-4500.py @@ -0,0 +1,436 @@ +import pytest +from spacy.pipeline import TrainablePipe +from spacy.matcher import PhraseMatcher, Matcher +from spacy.tokens import Doc, Span, DocBin +from spacy.training import Example, Corpus +from spacy.training.converters import json_to_docs +from spacy.vocab import Vocab +from spacy.lang.en import English +from spacy.util import minibatch, ensure_path, load_model +from spacy.util import compile_prefix_regex, compile_suffix_regex, compile_infix_regex +from spacy.tokenizer import Tokenizer +from spacy.lang.el import Greek +from spacy.language import Language +import spacy +from thinc.api import compounding +from collections import defaultdict + +from ..util import make_tempdir + + +def test_issue4002(en_vocab): + """Test that the PhraseMatcher can match on overwritten NORM attributes.""" + matcher = PhraseMatcher(en_vocab, attr="NORM") + pattern1 = Doc(en_vocab, words=["c", "d"]) + assert [t.norm_ for t in pattern1] == ["c", "d"] + matcher.add("TEST", [pattern1]) + doc = Doc(en_vocab, words=["a", "b", "c", "d"]) + assert [t.norm_ for t in doc] == ["a", "b", "c", "d"] + matches = matcher(doc) + assert len(matches) == 1 + matcher = PhraseMatcher(en_vocab, attr="NORM") + pattern2 = Doc(en_vocab, words=["1", "2"]) + pattern2[0].norm_ = "c" + pattern2[1].norm_ = "d" + assert [t.norm_ for t in pattern2] == ["c", "d"] + matcher.add("TEST", [pattern2]) + matches = matcher(doc) + assert len(matches) == 1 + + +def test_issue4030(): + """ Test whether textcat works fine with empty doc """ + unique_classes = ["offensive", "inoffensive"] + x_train = [ + "This is an offensive text", + "This is the second offensive text", + "inoff", + ] + y_train = ["offensive", "offensive", "inoffensive"] + nlp = spacy.blank("en") + # preparing the data + train_data = [] + for text, train_instance in zip(x_train, y_train): + cat_dict = {label: label == train_instance for label in unique_classes} + train_data.append(Example.from_dict(nlp.make_doc(text), {"cats": cat_dict})) + # add a text categorizer component + model = { + "@architectures": "spacy.TextCatBOW.v1", + "exclusive_classes": True, + "ngram_size": 2, + "no_output_layer": False, + } + textcat = nlp.add_pipe("textcat", config={"model": model}, last=True) + for label in unique_classes: + textcat.add_label(label) + # training the network + with nlp.select_pipes(enable="textcat"): + optimizer = nlp.initialize() + for i in range(3): + losses = {} + batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001)) + + for batch in batches: + nlp.update(examples=batch, sgd=optimizer, drop=0.1, losses=losses) + # processing of an empty doc should result in 0.0 for all categories + doc = nlp("") + assert doc.cats["offensive"] == 0.0 + assert doc.cats["inoffensive"] == 0.0 + + +def test_issue4042(): + """Test that serialization of an EntityRuler before NER works fine.""" + nlp = English() + # add ner pipe + ner = nlp.add_pipe("ner") + ner.add_label("SOME_LABEL") + nlp.initialize() + # Add entity ruler + patterns = [ + {"label": "MY_ORG", "pattern": "Apple"}, + {"label": "MY_GPE", "pattern": [{"lower": "san"}, {"lower": "francisco"}]}, + ] + # works fine with "after" + ruler = nlp.add_pipe("entity_ruler", before="ner") + ruler.add_patterns(patterns) + doc1 = nlp("What do you think about Apple ?") + assert doc1.ents[0].label_ == "MY_ORG" + + with make_tempdir() as d: + output_dir = ensure_path(d) + if not output_dir.exists(): + output_dir.mkdir() + nlp.to_disk(output_dir) + nlp2 = load_model(output_dir) + doc2 = nlp2("What do you think about Apple ?") + assert doc2.ents[0].label_ == "MY_ORG" + + +def test_issue4042_bug2(): + """ + Test that serialization of an NER works fine when new labels were added. + This is the second bug of two bugs underlying the issue 4042. + """ + nlp1 = English() + # add ner pipe + ner1 = nlp1.add_pipe("ner") + ner1.add_label("SOME_LABEL") + nlp1.initialize() + # add a new label to the doc + doc1 = nlp1("What do you think about Apple ?") + assert len(ner1.labels) == 1 + assert "SOME_LABEL" in ner1.labels + apple_ent = Span(doc1, 5, 6, label="MY_ORG") + doc1.ents = list(doc1.ents) + [apple_ent] + # reapply the NER - at this point it should resize itself + ner1(doc1) + assert len(ner1.labels) == 2 + assert "SOME_LABEL" in ner1.labels + assert "MY_ORG" in ner1.labels + with make_tempdir() as d: + # assert IO goes fine + output_dir = ensure_path(d) + if not output_dir.exists(): + output_dir.mkdir() + ner1.to_disk(output_dir) + config = {} + ner2 = nlp1.create_pipe("ner", config=config) + ner2.from_disk(output_dir) + assert len(ner2.labels) == 2 + + +def test_issue4054(en_vocab): + """Test that a new blank model can be made with a vocab from file, + and that serialization does not drop the language at any point.""" + nlp1 = English() + vocab1 = nlp1.vocab + with make_tempdir() as d: + vocab_dir = ensure_path(d / "vocab") + if not vocab_dir.exists(): + vocab_dir.mkdir() + vocab1.to_disk(vocab_dir) + vocab2 = Vocab().from_disk(vocab_dir) + nlp2 = spacy.blank("en", vocab=vocab2) + nlp_dir = ensure_path(d / "nlp") + if not nlp_dir.exists(): + nlp_dir.mkdir() + nlp2.to_disk(nlp_dir) + nlp3 = load_model(nlp_dir) + assert nlp3.lang == "en" + + +def test_issue4120(en_vocab): + """Test that matches without a final {OP: ?} token are returned.""" + matcher = Matcher(en_vocab) + matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}]]) + doc1 = Doc(en_vocab, words=["a"]) + assert len(matcher(doc1)) == 1 # works + doc2 = Doc(en_vocab, words=["a", "b", "c"]) + assert len(matcher(doc2)) == 2 # fixed + matcher = Matcher(en_vocab) + matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b"}]]) + doc3 = Doc(en_vocab, words=["a", "b", "b", "c"]) + assert len(matcher(doc3)) == 2 # works + matcher = Matcher(en_vocab) + matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b", "OP": "?"}]]) + doc4 = Doc(en_vocab, words=["a", "b", "b", "c"]) + assert len(matcher(doc4)) == 3 # fixed + + +def test_issue4133(en_vocab): + nlp = English() + vocab_bytes = nlp.vocab.to_bytes() + words = ["Apple", "is", "looking", "at", "buying", "a", "startup"] + pos = ["NOUN", "VERB", "ADP", "VERB", "PROPN", "NOUN", "ADP"] + doc = Doc(en_vocab, words=words) + for i, token in enumerate(doc): + token.pos_ = pos[i] + # usually this is already True when starting from proper models instead of blank English + doc_bytes = doc.to_bytes() + vocab = Vocab() + vocab = vocab.from_bytes(vocab_bytes) + doc = Doc(vocab).from_bytes(doc_bytes) + actual = [] + for token in doc: + actual.append(token.pos_) + assert actual == pos + + +def test_issue4190(): + def customize_tokenizer(nlp): + prefix_re = compile_prefix_regex(nlp.Defaults.prefixes) + suffix_re = compile_suffix_regex(nlp.Defaults.suffixes) + infix_re = compile_infix_regex(nlp.Defaults.infixes) + # Remove all exceptions where a single letter is followed by a period (e.g. 'h.') + exceptions = { + k: v + for k, v in dict(nlp.Defaults.tokenizer_exceptions).items() + if not (len(k) == 2 and k[1] == ".") + } + new_tokenizer = Tokenizer( + nlp.vocab, + exceptions, + prefix_search=prefix_re.search, + suffix_search=suffix_re.search, + infix_finditer=infix_re.finditer, + token_match=nlp.tokenizer.token_match, + ) + nlp.tokenizer = new_tokenizer + + test_string = "Test c." + # Load default language + nlp_1 = English() + doc_1a = nlp_1(test_string) + result_1a = [token.text for token in doc_1a] # noqa: F841 + # Modify tokenizer + customize_tokenizer(nlp_1) + doc_1b = nlp_1(test_string) + result_1b = [token.text for token in doc_1b] + # Save and Reload + with make_tempdir() as model_dir: + nlp_1.to_disk(model_dir) + nlp_2 = load_model(model_dir) + # This should be the modified tokenizer + doc_2 = nlp_2(test_string) + result_2 = [token.text for token in doc_2] + assert result_1b == result_2 + + +def test_issue4267(): + """ Test that running an entity_ruler after ner gives consistent results""" + nlp = English() + ner = nlp.add_pipe("ner") + ner.add_label("PEOPLE") + nlp.initialize() + assert "ner" in nlp.pipe_names + # assert that we have correct IOB annotations + doc1 = nlp("hi") + assert doc1.has_annotation("ENT_IOB") + for token in doc1: + assert token.ent_iob == 2 + # add entity ruler and run again + patterns = [{"label": "SOFTWARE", "pattern": "spacy"}] + ruler = nlp.add_pipe("entity_ruler") + ruler.add_patterns(patterns) + assert "entity_ruler" in nlp.pipe_names + assert "ner" in nlp.pipe_names + # assert that we still have correct IOB annotations + doc2 = nlp("hi") + assert doc2.has_annotation("ENT_IOB") + for token in doc2: + assert token.ent_iob == 2 + + +@pytest.mark.skip(reason="lemmatizer lookups no longer in vocab") +def test_issue4272(): + """Test that lookup table can be accessed from Token.lemma if no POS tags + are available.""" + nlp = Greek() + doc = nlp("Χθες") + assert doc[0].lemma_ + + +def test_multiple_predictions(): + class DummyPipe(TrainablePipe): + def __init__(self): + self.model = "dummy_model" + + def predict(self, docs): + return ([1, 2, 3], [4, 5, 6]) + + def set_annotations(self, docs, scores): + return docs + + nlp = Language() + doc = nlp.make_doc("foo") + dummy_pipe = DummyPipe() + dummy_pipe(doc) + + +@pytest.mark.skip(reason="removed Beam stuff during the Example/GoldParse refactor") +def test_issue4313(): + """ This should not crash or exit with some strange error code """ + beam_width = 16 + beam_density = 0.0001 + nlp = English() + config = {} + ner = nlp.create_pipe("ner", config=config) + ner.add_label("SOME_LABEL") + ner.initialize(lambda: []) + # add a new label to the doc + doc = nlp("What do you think about Apple ?") + assert len(ner.labels) == 1 + assert "SOME_LABEL" in ner.labels + apple_ent = Span(doc, 5, 6, label="MY_ORG") + doc.ents = list(doc.ents) + [apple_ent] + + # ensure the beam_parse still works with the new label + docs = [doc] + beams = nlp.entity.beam_parse( + docs, beam_width=beam_width, beam_density=beam_density + ) + + for doc, beam in zip(docs, beams): + entity_scores = defaultdict(float) + for score, ents in nlp.entity.moves.get_beam_parses(beam): + for start, end, label in ents: + entity_scores[(start, end, label)] += score + + +def test_issue4348(): + """Test that training the tagger with empty data, doesn't throw errors""" + nlp = English() + example = Example.from_dict(nlp.make_doc(""), {"tags": []}) + TRAIN_DATA = [example, example] + tagger = nlp.add_pipe("tagger") + tagger.add_label("A") + optimizer = nlp.initialize() + for i in range(5): + losses = {} + batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) + for batch in batches: + nlp.update(batch, sgd=optimizer, losses=losses) + + +def test_issue4367(): + """Test that docbin init goes well""" + DocBin() + DocBin(attrs=["LEMMA"]) + DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"]) + + +def test_issue4373(): + """Test that PhraseMatcher.vocab can be accessed (like Matcher.vocab).""" + matcher = Matcher(Vocab()) + assert isinstance(matcher.vocab, Vocab) + matcher = PhraseMatcher(Vocab()) + assert isinstance(matcher.vocab, Vocab) + + +def test_issue4402(): + json_data = { + "id": 0, + "paragraphs": [ + { + "raw": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven.", + "sentences": [ + { + "tokens": [ + {"id": 0, "orth": "How", "ner": "O"}, + {"id": 1, "orth": "should", "ner": "O"}, + {"id": 2, "orth": "I", "ner": "O"}, + {"id": 3, "orth": "cook", "ner": "O"}, + {"id": 4, "orth": "bacon", "ner": "O"}, + {"id": 5, "orth": "in", "ner": "O"}, + {"id": 6, "orth": "an", "ner": "O"}, + {"id": 7, "orth": "oven", "ner": "O"}, + {"id": 8, "orth": "?", "ner": "O"}, + ], + "brackets": [], + }, + { + "tokens": [ + {"id": 9, "orth": "\n", "ner": "O"}, + {"id": 10, "orth": "I", "ner": "O"}, + {"id": 11, "orth": "'ve", "ner": "O"}, + {"id": 12, "orth": "heard", "ner": "O"}, + {"id": 13, "orth": "of", "ner": "O"}, + {"id": 14, "orth": "people", "ner": "O"}, + {"id": 15, "orth": "cooking", "ner": "O"}, + {"id": 16, "orth": "bacon", "ner": "O"}, + {"id": 17, "orth": "in", "ner": "O"}, + {"id": 18, "orth": "an", "ner": "O"}, + {"id": 19, "orth": "oven", "ner": "O"}, + {"id": 20, "orth": ".", "ner": "O"}, + ], + "brackets": [], + }, + ], + "cats": [ + {"label": "baking", "value": 1.0}, + {"label": "not_baking", "value": 0.0}, + ], + }, + { + "raw": "What is the difference between white and brown eggs?\n", + "sentences": [ + { + "tokens": [ + {"id": 0, "orth": "What", "ner": "O"}, + {"id": 1, "orth": "is", "ner": "O"}, + {"id": 2, "orth": "the", "ner": "O"}, + {"id": 3, "orth": "difference", "ner": "O"}, + {"id": 4, "orth": "between", "ner": "O"}, + {"id": 5, "orth": "white", "ner": "O"}, + {"id": 6, "orth": "and", "ner": "O"}, + {"id": 7, "orth": "brown", "ner": "O"}, + {"id": 8, "orth": "eggs", "ner": "O"}, + {"id": 9, "orth": "?", "ner": "O"}, + ], + "brackets": [], + }, + {"tokens": [{"id": 10, "orth": "\n", "ner": "O"}], "brackets": []}, + ], + "cats": [ + {"label": "baking", "value": 0.0}, + {"label": "not_baking", "value": 1.0}, + ], + }, + ], + } + nlp = English() + attrs = ["ORTH", "SENT_START", "ENT_IOB", "ENT_TYPE"] + with make_tempdir() as tmpdir: + output_file = tmpdir / "test4402.spacy" + docs = json_to_docs([json_data]) + data = DocBin(docs=docs, attrs=attrs).to_bytes() + with output_file.open("wb") as file_: + file_.write(data) + reader = Corpus(output_file) + train_data = list(reader(nlp)) + assert len(train_data) == 2 + + split_train_data = [] + for eg in train_data: + split_train_data.extend(eg.split_sents()) + assert len(split_train_data) == 4 diff --git a/spacy/tests/regression/test_issue4002.py b/spacy/tests/regression/test_issue4002.py deleted file mode 100644 index d075128aa..000000000 --- a/spacy/tests/regression/test_issue4002.py +++ /dev/null @@ -1,26 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.matcher import PhraseMatcher -from spacy.tokens import Doc - - -def test_issue4002(en_vocab): - """Test that the PhraseMatcher can match on overwritten NORM attributes. - """ - matcher = PhraseMatcher(en_vocab, attr="NORM") - pattern1 = Doc(en_vocab, words=["c", "d"]) - assert [t.norm_ for t in pattern1] == ["c", "d"] - matcher.add("TEST", [pattern1]) - doc = Doc(en_vocab, words=["a", "b", "c", "d"]) - assert [t.norm_ for t in doc] == ["a", "b", "c", "d"] - matches = matcher(doc) - assert len(matches) == 1 - matcher = PhraseMatcher(en_vocab, attr="NORM") - pattern2 = Doc(en_vocab, words=["1", "2"]) - pattern2[0].norm_ = "c" - pattern2[1].norm_ = "d" - assert [t.norm_ for t in pattern2] == ["c", "d"] - matcher.add("TEST", [pattern2]) - matches = matcher(doc) - assert len(matches) == 1 diff --git a/spacy/tests/regression/test_issue4030.py b/spacy/tests/regression/test_issue4030.py deleted file mode 100644 index ed219573f..000000000 --- a/spacy/tests/regression/test_issue4030.py +++ /dev/null @@ -1,56 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import spacy -from spacy.util import minibatch, compounding - - -def test_issue4030(): - """ Test whether textcat works fine with empty doc """ - unique_classes = ["offensive", "inoffensive"] - x_train = [ - "This is an offensive text", - "This is the second offensive text", - "inoff", - ] - y_train = ["offensive", "offensive", "inoffensive"] - - # preparing the data - pos_cats = list() - for train_instance in y_train: - pos_cats.append({label: label == train_instance for label in unique_classes}) - train_data = list(zip(x_train, [{"cats": cats} for cats in pos_cats])) - - # set up the spacy model with a text categorizer component - nlp = spacy.blank("en") - - textcat = nlp.create_pipe( - "textcat", - config={"exclusive_classes": True, "architecture": "bow", "ngram_size": 2}, - ) - - for label in unique_classes: - textcat.add_label(label) - nlp.add_pipe(textcat, last=True) - - # training the network - with nlp.disable_pipes([p for p in nlp.pipe_names if p != "textcat"]): - optimizer = nlp.begin_training() - for i in range(3): - losses = {} - batches = minibatch(train_data, size=compounding(4.0, 32.0, 1.001)) - - for batch in batches: - texts, annotations = zip(*batch) - nlp.update( - docs=texts, - golds=annotations, - sgd=optimizer, - drop=0.1, - losses=losses, - ) - - # processing of an empty doc should result in 0.0 for all categories - doc = nlp("") - assert doc.cats["offensive"] == 0.0 - assert doc.cats["inoffensive"] == 0.0 diff --git a/spacy/tests/regression/test_issue4042.py b/spacy/tests/regression/test_issue4042.py deleted file mode 100644 index 00a8882d3..000000000 --- a/spacy/tests/regression/test_issue4042.py +++ /dev/null @@ -1,81 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import spacy -from spacy.pipeline import EntityRecognizer, EntityRuler -from spacy.lang.en import English -from spacy.tokens import Span -from spacy.util import ensure_path - -from ..util import make_tempdir - - -def test_issue4042(): - """Test that serialization of an EntityRuler before NER works fine.""" - nlp = English() - - # add ner pipe - ner = nlp.create_pipe("ner") - ner.add_label("SOME_LABEL") - nlp.add_pipe(ner) - nlp.begin_training() - - # Add entity ruler - ruler = EntityRuler(nlp) - patterns = [ - {"label": "MY_ORG", "pattern": "Apple"}, - {"label": "MY_GPE", "pattern": [{"lower": "san"}, {"lower": "francisco"}]}, - ] - ruler.add_patterns(patterns) - nlp.add_pipe(ruler, before="ner") # works fine with "after" - doc1 = nlp("What do you think about Apple ?") - assert doc1.ents[0].label_ == "MY_ORG" - - with make_tempdir() as d: - output_dir = ensure_path(d) - if not output_dir.exists(): - output_dir.mkdir() - nlp.to_disk(output_dir) - - nlp2 = spacy.load(output_dir) - doc2 = nlp2("What do you think about Apple ?") - assert doc2.ents[0].label_ == "MY_ORG" - - -def test_issue4042_bug2(): - """ - Test that serialization of an NER works fine when new labels were added. - This is the second bug of two bugs underlying the issue 4042. - """ - nlp1 = English() - vocab = nlp1.vocab - - # add ner pipe - ner1 = nlp1.create_pipe("ner") - ner1.add_label("SOME_LABEL") - nlp1.add_pipe(ner1) - nlp1.begin_training() - - # add a new label to the doc - doc1 = nlp1("What do you think about Apple ?") - assert len(ner1.labels) == 1 - assert "SOME_LABEL" in ner1.labels - apple_ent = Span(doc1, 5, 6, label="MY_ORG") - doc1.ents = list(doc1.ents) + [apple_ent] - - # reapply the NER - at this point it should resize itself - ner1(doc1) - assert len(ner1.labels) == 2 - assert "SOME_LABEL" in ner1.labels - assert "MY_ORG" in ner1.labels - - with make_tempdir() as d: - # assert IO goes fine - output_dir = ensure_path(d) - if not output_dir.exists(): - output_dir.mkdir() - ner1.to_disk(output_dir) - - ner2 = EntityRecognizer(vocab) - ner2.from_disk(output_dir) - assert len(ner2.labels) == 2 diff --git a/spacy/tests/regression/test_issue4054.py b/spacy/tests/regression/test_issue4054.py deleted file mode 100644 index cc84cebf8..000000000 --- a/spacy/tests/regression/test_issue4054.py +++ /dev/null @@ -1,33 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.vocab import Vocab -import spacy -from spacy.lang.en import English -from spacy.util import ensure_path - -from ..util import make_tempdir - - -def test_issue4054(en_vocab): - """Test that a new blank model can be made with a vocab from file, - and that serialization does not drop the language at any point.""" - nlp1 = English() - vocab1 = nlp1.vocab - - with make_tempdir() as d: - vocab_dir = ensure_path(d / "vocab") - if not vocab_dir.exists(): - vocab_dir.mkdir() - vocab1.to_disk(vocab_dir) - - vocab2 = Vocab().from_disk(vocab_dir) - print("lang", vocab2.lang) - nlp2 = spacy.blank("en", vocab=vocab2) - - nlp_dir = ensure_path(d / "nlp") - if not nlp_dir.exists(): - nlp_dir.mkdir() - nlp2.to_disk(nlp_dir) - nlp3 = spacy.load(nlp_dir) - assert nlp3.lang == "en" diff --git a/spacy/tests/regression/test_issue4120.py b/spacy/tests/regression/test_issue4120.py deleted file mode 100644 index d288f46c4..000000000 --- a/spacy/tests/regression/test_issue4120.py +++ /dev/null @@ -1,26 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.matcher import Matcher -from spacy.tokens import Doc - - -def test_issue4120(en_vocab): - """Test that matches without a final {OP: ?} token are returned.""" - matcher = Matcher(en_vocab) - matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}]]) - doc1 = Doc(en_vocab, words=["a"]) - assert len(matcher(doc1)) == 1 # works - - doc2 = Doc(en_vocab, words=["a", "b", "c"]) - assert len(matcher(doc2)) == 2 # fixed - - matcher = Matcher(en_vocab) - matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b"}]]) - doc3 = Doc(en_vocab, words=["a", "b", "b", "c"]) - assert len(matcher(doc3)) == 2 # works - - matcher = Matcher(en_vocab) - matcher.add("TEST", [[{"ORTH": "a"}, {"OP": "?"}, {"ORTH": "b", "OP": "?"}]]) - doc4 = Doc(en_vocab, words=["a", "b", "b", "c"]) - assert len(matcher(doc4)) == 3 # fixed diff --git a/spacy/tests/regression/test_issue4133.py b/spacy/tests/regression/test_issue4133.py deleted file mode 100644 index 93262f8cf..000000000 --- a/spacy/tests/regression/test_issue4133.py +++ /dev/null @@ -1,31 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.en import English -from spacy.tokens import Doc -from spacy.vocab import Vocab - - -def test_issue4133(en_vocab): - nlp = English() - vocab_bytes = nlp.vocab.to_bytes() - words = ["Apple", "is", "looking", "at", "buying", "a", "startup"] - pos = ["NOUN", "VERB", "ADP", "VERB", "PROPN", "NOUN", "ADP"] - doc = Doc(en_vocab, words=words) - for i, token in enumerate(doc): - token.pos_ = pos[i] - - # usually this is already True when starting from proper models instead of blank English - doc.is_tagged = True - - doc_bytes = doc.to_bytes() - - vocab = Vocab() - vocab = vocab.from_bytes(vocab_bytes) - doc = Doc(vocab).from_bytes(doc_bytes) - - actual = [] - for token in doc: - actual.append(token.pos_) - - assert actual == pos diff --git a/spacy/tests/regression/test_issue4190.py b/spacy/tests/regression/test_issue4190.py deleted file mode 100644 index eb4eb8648..000000000 --- a/spacy/tests/regression/test_issue4190.py +++ /dev/null @@ -1,49 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.en import English -from spacy.tokenizer import Tokenizer -from spacy import util - -from ..util import make_tempdir - - -def test_issue4190(): - test_string = "Test c." - # Load default language - nlp_1 = English() - doc_1a = nlp_1(test_string) - result_1a = [token.text for token in doc_1a] # noqa: F841 - # Modify tokenizer - customize_tokenizer(nlp_1) - doc_1b = nlp_1(test_string) - result_1b = [token.text for token in doc_1b] - # Save and Reload - with make_tempdir() as model_dir: - nlp_1.to_disk(model_dir) - nlp_2 = util.load_model(model_dir) - # This should be the modified tokenizer - doc_2 = nlp_2(test_string) - result_2 = [token.text for token in doc_2] - assert result_1b == result_2 - - -def customize_tokenizer(nlp): - prefix_re = util.compile_prefix_regex(nlp.Defaults.prefixes) - suffix_re = util.compile_suffix_regex(nlp.Defaults.suffixes) - infix_re = util.compile_infix_regex(nlp.Defaults.infixes) - # Remove all exceptions where a single letter is followed by a period (e.g. 'h.') - exceptions = { - k: v - for k, v in dict(nlp.Defaults.tokenizer_exceptions).items() - if not (len(k) == 2 and k[1] == ".") - } - new_tokenizer = Tokenizer( - nlp.vocab, - exceptions, - prefix_search=prefix_re.search, - suffix_search=suffix_re.search, - infix_finditer=infix_re.finditer, - token_match=nlp.tokenizer.token_match, - ) - nlp.tokenizer = new_tokenizer diff --git a/spacy/tests/regression/test_issue4267.py b/spacy/tests/regression/test_issue4267.py deleted file mode 100644 index ef871bf9f..000000000 --- a/spacy/tests/regression/test_issue4267.py +++ /dev/null @@ -1,37 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.en import English -from spacy.pipeline import EntityRuler - - -def test_issue4267(): - """ Test that running an entity_ruler after ner gives consistent results""" - nlp = English() - ner = nlp.create_pipe("ner") - ner.add_label("PEOPLE") - nlp.add_pipe(ner) - nlp.begin_training() - - assert "ner" in nlp.pipe_names - - # assert that we have correct IOB annotations - doc1 = nlp("hi") - assert doc1.is_nered - for token in doc1: - assert token.ent_iob == 2 - - # add entity ruler and run again - ruler = EntityRuler(nlp) - patterns = [{"label": "SOFTWARE", "pattern": "spacy"}] - - ruler.add_patterns(patterns) - nlp.add_pipe(ruler) - assert "entity_ruler" in nlp.pipe_names - assert "ner" in nlp.pipe_names - - # assert that we still have correct IOB annotations - doc2 = nlp("hi") - assert doc2.is_nered - for token in doc2: - assert token.ent_iob == 2 diff --git a/spacy/tests/regression/test_issue4272.py b/spacy/tests/regression/test_issue4272.py deleted file mode 100644 index c57704d71..000000000 --- a/spacy/tests/regression/test_issue4272.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.el import Greek - - -def test_issue4272(): - """Test that lookup table can be accessed from Token.lemma if no POS tags - are available.""" - nlp = Greek() - doc = nlp("Χθες") - assert doc[0].lemma_ diff --git a/spacy/tests/regression/test_issue4278.py b/spacy/tests/regression/test_issue4278.py deleted file mode 100644 index cb09340ff..000000000 --- a/spacy/tests/regression/test_issue4278.py +++ /dev/null @@ -1,28 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from spacy.language import Language -from spacy.pipeline import Pipe - - -class DummyPipe(Pipe): - def __init__(self): - self.model = "dummy_model" - - def predict(self, docs): - return ([1, 2, 3], [4, 5, 6]) - - def set_annotations(self, docs, scores, tensors=None): - return docs - - -@pytest.fixture -def nlp(): - return Language() - - -def test_multiple_predictions(nlp): - doc = nlp.make_doc("foo") - dummy_pipe = DummyPipe() - dummy_pipe(doc) diff --git a/spacy/tests/regression/test_issue4313.py b/spacy/tests/regression/test_issue4313.py deleted file mode 100644 index c68f745a7..000000000 --- a/spacy/tests/regression/test_issue4313.py +++ /dev/null @@ -1,39 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from collections import defaultdict - -from spacy.pipeline import EntityRecognizer - -from spacy.lang.en import English -from spacy.tokens import Span - - -def test_issue4313(): - """ This should not crash or exit with some strange error code """ - beam_width = 16 - beam_density = 0.0001 - nlp = English() - ner = EntityRecognizer(nlp.vocab) - ner.add_label("SOME_LABEL") - ner.begin_training([]) - nlp.add_pipe(ner) - - # add a new label to the doc - doc = nlp("What do you think about Apple ?") - assert len(ner.labels) == 1 - assert "SOME_LABEL" in ner.labels - apple_ent = Span(doc, 5, 6, label="MY_ORG") - doc.ents = list(doc.ents) + [apple_ent] - - # ensure the beam_parse still works with the new label - docs = [doc] - beams = nlp.entity.beam_parse( - docs, beam_width=beam_width, beam_density=beam_density - ) - - for doc, beam in zip(docs, beams): - entity_scores = defaultdict(float) - for score, ents in nlp.entity.moves.get_beam_parses(beam): - for start, end, label in ents: - entity_scores[(start, end, label)] += score diff --git a/spacy/tests/regression/test_issue4348.py b/spacy/tests/regression/test_issue4348.py deleted file mode 100644 index d2e27d563..000000000 --- a/spacy/tests/regression/test_issue4348.py +++ /dev/null @@ -1,25 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.en import English -from spacy.util import minibatch, compounding -import pytest - - -@pytest.mark.filterwarnings("ignore::UserWarning") -def test_issue4348(): - """Test that training the tagger with empty data, doesn't throw errors""" - - TRAIN_DATA = [("", {"tags": []}), ("", {"tags": []})] - - nlp = English() - tagger = nlp.create_pipe("tagger") - nlp.add_pipe(tagger) - - optimizer = nlp.begin_training() - for i in range(5): - losses = {} - batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001)) - for batch in batches: - texts, annotations = zip(*batch) - nlp.update(texts, annotations, sgd=optimizer, losses=losses) diff --git a/spacy/tests/regression/test_issue4367.py b/spacy/tests/regression/test_issue4367.py deleted file mode 100644 index ab6192744..000000000 --- a/spacy/tests/regression/test_issue4367.py +++ /dev/null @@ -1,11 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.tokens import DocBin - - -def test_issue4367(): - """Test that docbin init goes well""" - DocBin() - DocBin(attrs=["LEMMA"]) - DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"]) diff --git a/spacy/tests/regression/test_issue4373.py b/spacy/tests/regression/test_issue4373.py deleted file mode 100644 index 57d7547da..000000000 --- a/spacy/tests/regression/test_issue4373.py +++ /dev/null @@ -1,13 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.matcher import Matcher, PhraseMatcher -from spacy.vocab import Vocab - - -def test_issue4373(): - """Test that PhraseMatcher.vocab can be accessed (like Matcher.vocab).""" - matcher = Matcher(Vocab()) - assert isinstance(matcher.vocab, Vocab) - matcher = PhraseMatcher(Vocab()) - assert isinstance(matcher.vocab, Vocab) diff --git a/spacy/tests/regression/test_issue4402.py b/spacy/tests/regression/test_issue4402.py deleted file mode 100644 index d3b4bdf9a..000000000 --- a/spacy/tests/regression/test_issue4402.py +++ /dev/null @@ -1,96 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import srsly -from spacy.gold import GoldCorpus -from spacy.lang.en import English - -from ..util import make_tempdir - - -def test_issue4402(): - nlp = English() - with make_tempdir() as tmpdir: - print("temp", tmpdir) - json_path = tmpdir / "test4402.json" - srsly.write_json(json_path, json_data) - - corpus = GoldCorpus(str(json_path), str(json_path)) - - train_docs = list(corpus.train_docs(nlp, gold_preproc=True, max_length=0)) - # assert that the data got split into 4 sentences - assert len(train_docs) == 4 - - -json_data = [ - { - "id": 0, - "paragraphs": [ - { - "raw": "How should I cook bacon in an oven?\nI've heard of people cooking bacon in an oven.", - "sentences": [ - { - "tokens": [ - {"id": 0, "orth": "How", "ner": "O"}, - {"id": 1, "orth": "should", "ner": "O"}, - {"id": 2, "orth": "I", "ner": "O"}, - {"id": 3, "orth": "cook", "ner": "O"}, - {"id": 4, "orth": "bacon", "ner": "O"}, - {"id": 5, "orth": "in", "ner": "O"}, - {"id": 6, "orth": "an", "ner": "O"}, - {"id": 7, "orth": "oven", "ner": "O"}, - {"id": 8, "orth": "?", "ner": "O"}, - ], - "brackets": [], - }, - { - "tokens": [ - {"id": 9, "orth": "\n", "ner": "O"}, - {"id": 10, "orth": "I", "ner": "O"}, - {"id": 11, "orth": "'ve", "ner": "O"}, - {"id": 12, "orth": "heard", "ner": "O"}, - {"id": 13, "orth": "of", "ner": "O"}, - {"id": 14, "orth": "people", "ner": "O"}, - {"id": 15, "orth": "cooking", "ner": "O"}, - {"id": 16, "orth": "bacon", "ner": "O"}, - {"id": 17, "orth": "in", "ner": "O"}, - {"id": 18, "orth": "an", "ner": "O"}, - {"id": 19, "orth": "oven", "ner": "O"}, - {"id": 20, "orth": ".", "ner": "O"}, - ], - "brackets": [], - }, - ], - "cats": [ - {"label": "baking", "value": 1.0}, - {"label": "not_baking", "value": 0.0}, - ], - }, - { - "raw": "What is the difference between white and brown eggs?\n", - "sentences": [ - { - "tokens": [ - {"id": 0, "orth": "What", "ner": "O"}, - {"id": 1, "orth": "is", "ner": "O"}, - {"id": 2, "orth": "the", "ner": "O"}, - {"id": 3, "orth": "difference", "ner": "O"}, - {"id": 4, "orth": "between", "ner": "O"}, - {"id": 5, "orth": "white", "ner": "O"}, - {"id": 6, "orth": "and", "ner": "O"}, - {"id": 7, "orth": "brown", "ner": "O"}, - {"id": 8, "orth": "eggs", "ner": "O"}, - {"id": 9, "orth": "?", "ner": "O"}, - ], - "brackets": [], - }, - {"tokens": [{"id": 10, "orth": "\n", "ner": "O"}], "brackets": []}, - ], - "cats": [ - {"label": "baking", "value": 0.0}, - {"label": "not_baking", "value": 1.0}, - ], - }, - ], - } -] diff --git a/spacy/tests/regression/test_issue4501-5000.py b/spacy/tests/regression/test_issue4501-5000.py new file mode 100644 index 000000000..6dbbc233b --- /dev/null +++ b/spacy/tests/regression/test_issue4501-5000.py @@ -0,0 +1,251 @@ +import pytest +from spacy.tokens import Doc, Span, DocBin +from spacy.training import Example +from spacy.training.converters.conllu_to_docs import conllu_to_docs +from spacy.lang.en import English +from spacy.kb import KnowledgeBase +from spacy.vocab import Vocab +from spacy.language import Language +from spacy.util import ensure_path, load_model_from_path +import numpy +import pickle + +from ..util import make_tempdir + + +def test_issue4528(en_vocab): + """Test that user_data is correctly serialized in DocBin.""" + doc = Doc(en_vocab, words=["hello", "world"]) + doc.user_data["foo"] = "bar" + # This is how extension attribute values are stored in the user data + doc.user_data[("._.", "foo", None, None)] = "bar" + doc_bin = DocBin(store_user_data=True) + doc_bin.add(doc) + doc_bin_bytes = doc_bin.to_bytes() + new_doc_bin = DocBin(store_user_data=True).from_bytes(doc_bin_bytes) + new_doc = list(new_doc_bin.get_docs(en_vocab))[0] + assert new_doc.user_data["foo"] == "bar" + assert new_doc.user_data[("._.", "foo", None, None)] == "bar" + + +@pytest.mark.parametrize( + "text,words", [("A'B C", ["A", "'", "B", "C"]), ("A-B", ["A-B"])] +) +def test_gold_misaligned(en_tokenizer, text, words): + doc = en_tokenizer(text) + Example.from_dict(doc, {"words": words}) + + +def test_issue4651_with_phrase_matcher_attr(): + """Test that the EntityRuler PhraseMatcher is deserialized correctly using + the method from_disk when the EntityRuler argument phrase_matcher_attr is + specified. + """ + text = "Spacy is a python library for nlp" + nlp = English() + patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}] + ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"}) + ruler.add_patterns(patterns) + doc = nlp(text) + res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents] + nlp_reloaded = English() + with make_tempdir() as d: + file_path = d / "entityruler" + ruler.to_disk(file_path) + nlp_reloaded.add_pipe("entity_ruler").from_disk(file_path) + doc_reloaded = nlp_reloaded(text) + res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents] + assert res == res_reloaded + + +def test_issue4651_without_phrase_matcher_attr(): + """Test that the EntityRuler PhraseMatcher is deserialized correctly using + the method from_disk when the EntityRuler argument phrase_matcher_attr is + not specified. + """ + text = "Spacy is a python library for nlp" + nlp = English() + patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}] + ruler = nlp.add_pipe("entity_ruler") + ruler.add_patterns(patterns) + doc = nlp(text) + res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents] + nlp_reloaded = English() + with make_tempdir() as d: + file_path = d / "entityruler" + ruler.to_disk(file_path) + nlp_reloaded.add_pipe("entity_ruler").from_disk(file_path) + doc_reloaded = nlp_reloaded(text) + res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents] + assert res == res_reloaded + + +def test_issue4665(): + """ + conllu_to_docs should not raise an exception if the HEAD column contains an + underscore + """ + input_data = """ +1 [ _ PUNCT -LRB- _ _ punct _ _ +2 This _ DET DT _ _ det _ _ +3 killing _ NOUN NN _ _ nsubj _ _ +4 of _ ADP IN _ _ case _ _ +5 a _ DET DT _ _ det _ _ +6 respected _ ADJ JJ _ _ amod _ _ +7 cleric _ NOUN NN _ _ nmod _ _ +8 will _ AUX MD _ _ aux _ _ +9 be _ AUX VB _ _ aux _ _ +10 causing _ VERB VBG _ _ root _ _ +11 us _ PRON PRP _ _ iobj _ _ +12 trouble _ NOUN NN _ _ dobj _ _ +13 for _ ADP IN _ _ case _ _ +14 years _ NOUN NNS _ _ nmod _ _ +15 to _ PART TO _ _ mark _ _ +16 come _ VERB VB _ _ acl _ _ +17 . _ PUNCT . _ _ punct _ _ +18 ] _ PUNCT -RRB- _ _ punct _ _ +""" + conllu_to_docs(input_data) + + +def test_issue4674(): + """Test that setting entities with overlapping identifiers does not mess up IO""" + nlp = English() + kb = KnowledgeBase(nlp.vocab, entity_vector_length=3) + vector1 = [0.9, 1.1, 1.01] + vector2 = [1.8, 2.25, 2.01] + with pytest.warns(UserWarning): + kb.set_entities( + entity_list=["Q1", "Q1"], + freq_list=[32, 111], + vector_list=[vector1, vector2], + ) + assert kb.get_size_entities() == 1 + # dumping to file & loading back in + with make_tempdir() as d: + dir_path = ensure_path(d) + if not dir_path.exists(): + dir_path.mkdir() + file_path = dir_path / "kb" + kb.to_disk(str(file_path)) + kb2 = KnowledgeBase(nlp.vocab, entity_vector_length=3) + kb2.from_disk(str(file_path)) + assert kb2.get_size_entities() == 1 + + +@pytest.mark.skip(reason="API change: disable just disables, new exclude arg") +def test_issue4707(): + """Tests that disabled component names are also excluded from nlp.from_disk + by default when loading a model. + """ + nlp = English() + nlp.add_pipe("sentencizer") + nlp.add_pipe("entity_ruler") + assert nlp.pipe_names == ["sentencizer", "entity_ruler"] + exclude = ["tokenizer", "sentencizer"] + with make_tempdir() as tmpdir: + nlp.to_disk(tmpdir, exclude=exclude) + new_nlp = load_model_from_path(tmpdir, disable=exclude) + assert "sentencizer" not in new_nlp.pipe_names + assert "entity_ruler" in new_nlp.pipe_names + + +def test_issue4725_1(): + """ Ensure the pickling of the NER goes well""" + vocab = Vocab(vectors_name="test_vocab_add_vector") + nlp = English(vocab=vocab) + config = { + "update_with_oracle_cut_size": 111, + } + ner = nlp.create_pipe("ner", config=config) + with make_tempdir() as tmp_path: + with (tmp_path / "ner.pkl").open("wb") as file_: + pickle.dump(ner, file_) + assert ner.cfg["update_with_oracle_cut_size"] == 111 + + with (tmp_path / "ner.pkl").open("rb") as file_: + ner2 = pickle.load(file_) + assert ner2.cfg["update_with_oracle_cut_size"] == 111 + + +def test_issue4725_2(): + # ensures that this runs correctly and doesn't hang or crash because of the global vectors + # if it does crash, it's usually because of calling 'spawn' for multiprocessing (e.g. on Windows), + # or because of issues with pickling the NER (cf test_issue4725_1) + vocab = Vocab(vectors_name="test_vocab_add_vector") + data = numpy.ndarray((5, 3), dtype="f") + data[0] = 1.0 + data[1] = 2.0 + vocab.set_vector("cat", data[0]) + vocab.set_vector("dog", data[1]) + nlp = English(vocab=vocab) + nlp.add_pipe("ner") + nlp.initialize() + docs = ["Kurt is in London."] * 10 + for _ in nlp.pipe(docs, batch_size=2, n_process=2): + pass + + +def test_issue4849(): + nlp = English() + patterns = [ + {"label": "PERSON", "pattern": "joe biden", "id": "joe-biden"}, + {"label": "PERSON", "pattern": "bernie sanders", "id": "bernie-sanders"}, + ] + ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"}) + ruler.add_patterns(patterns) + text = """ + The left is starting to take aim at Democratic front-runner Joe Biden. + Sen. Bernie Sanders joined in her criticism: "There is no 'middle ground' when it comes to climate policy." + """ + # USING 1 PROCESS + count_ents = 0 + for doc in nlp.pipe([text], n_process=1): + count_ents += len([ent for ent in doc.ents if ent.ent_id > 0]) + assert count_ents == 2 + # USING 2 PROCESSES + count_ents = 0 + for doc in nlp.pipe([text], n_process=2): + count_ents += len([ent for ent in doc.ents if ent.ent_id > 0]) + assert count_ents == 2 + + +@Language.factory("my_pipe") +class CustomPipe: + def __init__(self, nlp, name="my_pipe"): + self.name = name + Span.set_extension("my_ext", getter=self._get_my_ext) + Doc.set_extension("my_ext", default=None) + + def __call__(self, doc): + gathered_ext = [] + for sent in doc.sents: + sent_ext = self._get_my_ext(sent) + sent._.set("my_ext", sent_ext) + gathered_ext.append(sent_ext) + + doc._.set("my_ext", "\n".join(gathered_ext)) + return doc + + @staticmethod + def _get_my_ext(span): + return str(span.end) + + +def test_issue4903(): + """Ensure that this runs correctly and doesn't hang or crash on Windows / + macOS.""" + nlp = English() + nlp.add_pipe("sentencizer") + nlp.add_pipe("my_pipe", after="sentencizer") + text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."] + docs = list(nlp.pipe(text, n_process=2)) + assert docs[0].text == "I like bananas." + assert docs[1].text == "Do you like them?" + assert docs[2].text == "No, I prefer wasabi." + + +def test_issue4924(): + nlp = Language() + example = Example.from_dict(nlp.make_doc(""), {}) + nlp.evaluate([example]) diff --git a/spacy/tests/regression/test_issue4528.py b/spacy/tests/regression/test_issue4528.py deleted file mode 100644 index 460449003..000000000 --- a/spacy/tests/regression/test_issue4528.py +++ /dev/null @@ -1,19 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.tokens import Doc, DocBin - - -def test_issue4528(en_vocab): - """Test that user_data is correctly serialized in DocBin.""" - doc = Doc(en_vocab, words=["hello", "world"]) - doc.user_data["foo"] = "bar" - # This is how extension attribute values are stored in the user data - doc.user_data[("._.", "foo", None, None)] = "bar" - doc_bin = DocBin(store_user_data=True) - doc_bin.add(doc) - doc_bin_bytes = doc_bin.to_bytes() - new_doc_bin = DocBin(store_user_data=True).from_bytes(doc_bin_bytes) - new_doc = list(new_doc_bin.get_docs(en_vocab))[0] - assert new_doc.user_data["foo"] == "bar" - assert new_doc.user_data[("._.", "foo", None, None)] == "bar" diff --git a/spacy/tests/regression/test_issue4529.py b/spacy/tests/regression/test_issue4529.py deleted file mode 100644 index 381957be6..000000000 --- a/spacy/tests/regression/test_issue4529.py +++ /dev/null @@ -1,13 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from spacy.gold import GoldParse - - -@pytest.mark.parametrize( - "text,words", [("A'B C", ["A", "'", "B", "C"]), ("A-B", ["A-B"])] -) -def test_gold_misaligned(en_tokenizer, text, words): - doc = en_tokenizer(text) - GoldParse(doc, words=words) diff --git a/spacy/tests/regression/test_issue4590.py b/spacy/tests/regression/test_issue4590.py deleted file mode 100644 index 3d01cd487..000000000 --- a/spacy/tests/regression/test_issue4590.py +++ /dev/null @@ -1,38 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from mock import Mock -from spacy.matcher import DependencyMatcher -from ..util import get_doc - - -def test_issue4590(en_vocab): - """Test that matches param in on_match method are the same as matches run with no on_match method""" - pattern = [ - {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}}, - { - "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, - "PATTERN": {"ORTH": "fox"}, - }, - { - "SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, - "PATTERN": {"ORTH": "fox"}, - }, - ] - - on_match = Mock() - - matcher = DependencyMatcher(en_vocab) - matcher.add("pattern", on_match, pattern) - - text = "The quick brown fox jumped over the lazy fox" - heads = [3, 2, 1, 1, 0, -1, 2, 1, -3] - deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "det", "amod", "pobj"] - - doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps) - - matches = matcher(doc) - - on_match_args = on_match.call_args - - assert on_match_args[0][3] == matches diff --git a/spacy/tests/regression/test_issue4651.py b/spacy/tests/regression/test_issue4651.py deleted file mode 100644 index eb49f4a38..000000000 --- a/spacy/tests/regression/test_issue4651.py +++ /dev/null @@ -1,65 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from spacy.lang.en import English -from spacy.pipeline import EntityRuler - -from ..util import make_tempdir - - -def test_issue4651_with_phrase_matcher_attr(): - """Test that the EntityRuler PhraseMatcher is deserialize correctly using - the method from_disk when the EntityRuler argument phrase_matcher_attr is - specified. - """ - text = "Spacy is a python library for nlp" - - nlp = English() - ruler = EntityRuler(nlp, phrase_matcher_attr="LOWER") - patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}] - ruler.add_patterns(patterns) - nlp.add_pipe(ruler) - - doc = nlp(text) - res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents] - - nlp_reloaded = English() - with make_tempdir() as d: - file_path = d / "entityruler" - ruler.to_disk(file_path) - ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path) - - nlp_reloaded.add_pipe(ruler_reloaded) - doc_reloaded = nlp_reloaded(text) - res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents] - - assert res == res_reloaded - - -def test_issue4651_without_phrase_matcher_attr(): - """Test that the EntityRuler PhraseMatcher is deserialize correctly using - the method from_disk when the EntityRuler argument phrase_matcher_attr is - not specified. - """ - text = "Spacy is a python library for nlp" - - nlp = English() - ruler = EntityRuler(nlp) - patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}] - ruler.add_patterns(patterns) - nlp.add_pipe(ruler) - - doc = nlp(text) - res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents] - - nlp_reloaded = English() - with make_tempdir() as d: - file_path = d / "entityruler" - ruler.to_disk(file_path) - ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path) - - nlp_reloaded.add_pipe(ruler_reloaded) - doc_reloaded = nlp_reloaded(text) - res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents] - - assert res == res_reloaded diff --git a/spacy/tests/regression/test_issue4665.py b/spacy/tests/regression/test_issue4665.py deleted file mode 100644 index 721ec0098..000000000 --- a/spacy/tests/regression/test_issue4665.py +++ /dev/null @@ -1,31 +0,0 @@ -from spacy.cli.converters.conllu2json import conllu2json - -input_data = """ -1 [ _ PUNCT -LRB- _ _ punct _ _ -2 This _ DET DT _ _ det _ _ -3 killing _ NOUN NN _ _ nsubj _ _ -4 of _ ADP IN _ _ case _ _ -5 a _ DET DT _ _ det _ _ -6 respected _ ADJ JJ _ _ amod _ _ -7 cleric _ NOUN NN _ _ nmod _ _ -8 will _ AUX MD _ _ aux _ _ -9 be _ AUX VB _ _ aux _ _ -10 causing _ VERB VBG _ _ root _ _ -11 us _ PRON PRP _ _ iobj _ _ -12 trouble _ NOUN NN _ _ dobj _ _ -13 for _ ADP IN _ _ case _ _ -14 years _ NOUN NNS _ _ nmod _ _ -15 to _ PART TO _ _ mark _ _ -16 come _ VERB VB _ _ acl _ _ -17 . _ PUNCT . _ _ punct _ _ -18 ] _ PUNCT -RRB- _ _ punct _ _ -""" - - -def test_issue4665(): - """ - conllu2json should not raise an exception if the HEAD column contains an - underscore - """ - - conllu2json(input_data) diff --git a/spacy/tests/regression/test_issue4674.py b/spacy/tests/regression/test_issue4674.py deleted file mode 100644 index 8fa4f9259..000000000 --- a/spacy/tests/regression/test_issue4674.py +++ /dev/null @@ -1,39 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest -from spacy.kb import KnowledgeBase -from spacy.util import ensure_path -from spacy.lang.en import English - -from ..util import make_tempdir - - -def test_issue4674(): - """Test that setting entities with overlapping identifiers does not mess up IO""" - nlp = English() - kb = KnowledgeBase(nlp.vocab, entity_vector_length=3) - - vector1 = [0.9, 1.1, 1.01] - vector2 = [1.8, 2.25, 2.01] - with pytest.warns(UserWarning): - kb.set_entities( - entity_list=["Q1", "Q1"], - freq_list=[32, 111], - vector_list=[vector1, vector2], - ) - - assert kb.get_size_entities() == 1 - - # dumping to file & loading back in - with make_tempdir() as d: - dir_path = ensure_path(d) - if not dir_path.exists(): - dir_path.mkdir() - file_path = dir_path / "kb" - kb.dump(str(file_path)) - - kb2 = KnowledgeBase(vocab=nlp.vocab, entity_vector_length=3) - kb2.load_bulk(str(file_path)) - - assert kb2.get_size_entities() == 1 diff --git a/spacy/tests/regression/test_issue4707.py b/spacy/tests/regression/test_issue4707.py deleted file mode 100644 index e710881d7..000000000 --- a/spacy/tests/regression/test_issue4707.py +++ /dev/null @@ -1,23 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.util import load_model_from_path -from spacy.lang.en import English - -from ..util import make_tempdir - - -def test_issue4707(): - """Tests that disabled component names are also excluded from nlp.from_disk - by default when loading a model. - """ - nlp = English() - nlp.add_pipe(nlp.create_pipe("sentencizer")) - nlp.add_pipe(nlp.create_pipe("entity_ruler")) - assert nlp.pipe_names == ["sentencizer", "entity_ruler"] - exclude = ["tokenizer", "sentencizer"] - with make_tempdir() as tmpdir: - nlp.to_disk(tmpdir, exclude=exclude) - new_nlp = load_model_from_path(tmpdir, disable=exclude) - assert "sentencizer" not in new_nlp.pipe_names - assert "entity_ruler" in new_nlp.pipe_names diff --git a/spacy/tests/regression/test_issue4725.py b/spacy/tests/regression/test_issue4725.py deleted file mode 100644 index 57675a202..000000000 --- a/spacy/tests/regression/test_issue4725.py +++ /dev/null @@ -1,25 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import numpy - -from spacy.lang.en import English -from spacy.vocab import Vocab - - -def test_issue4725(): - # ensures that this runs correctly and doesn't hang or crash because of the global vectors - vocab = Vocab(vectors_name="test_vocab_add_vector") - data = numpy.ndarray((5, 3), dtype="f") - data[0] = 1.0 - data[1] = 2.0 - vocab.set_vector("cat", data[0]) - vocab.set_vector("dog", data[1]) - - nlp = English(vocab=vocab) - ner = nlp.create_pipe("ner") - nlp.add_pipe(ner) - nlp.begin_training() - docs = ["Kurt is in London."] * 10 - for _ in nlp.pipe(docs, batch_size=2, n_process=2): - pass diff --git a/spacy/tests/regression/test_issue4849.py b/spacy/tests/regression/test_issue4849.py deleted file mode 100644 index 5c7ffc999..000000000 --- a/spacy/tests/regression/test_issue4849.py +++ /dev/null @@ -1,37 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.en import English -from spacy.pipeline import EntityRuler - - -def test_issue4849(): - nlp = English() - - ruler = EntityRuler( - nlp, - patterns=[ - {"label": "PERSON", "pattern": "joe biden", "id": "joe-biden"}, - {"label": "PERSON", "pattern": "bernie sanders", "id": "bernie-sanders"}, - ], - phrase_matcher_attr="LOWER", - ) - - nlp.add_pipe(ruler) - - text = """ - The left is starting to take aim at Democratic front-runner Joe Biden. - Sen. Bernie Sanders joined in her criticism: "There is no 'middle ground' when it comes to climate policy." - """ - - # USING 1 PROCESS - count_ents = 0 - for doc in nlp.pipe([text], n_process=1): - count_ents += len([ent for ent in doc.ents if ent.ent_id > 0]) - assert count_ents == 2 - - # USING 2 PROCESSES - count_ents = 0 - for doc in nlp.pipe([text], n_process=2): - count_ents += len([ent for ent in doc.ents if ent.ent_id > 0]) - assert count_ents == 2 diff --git a/spacy/tests/regression/test_issue4903.py b/spacy/tests/regression/test_issue4903.py deleted file mode 100644 index d467b1cd6..000000000 --- a/spacy/tests/regression/test_issue4903.py +++ /dev/null @@ -1,43 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.en import English -from spacy.tokens import Span, Doc - - -class CustomPipe: - name = "my_pipe" - - def __init__(self): - Span.set_extension("my_ext", getter=self._get_my_ext) - Doc.set_extension("my_ext", default=None) - - def __call__(self, doc): - gathered_ext = [] - for sent in doc.sents: - sent_ext = self._get_my_ext(sent) - sent._.set("my_ext", sent_ext) - gathered_ext.append(sent_ext) - - doc._.set("my_ext", "\n".join(gathered_ext)) - - return doc - - @staticmethod - def _get_my_ext(span): - return str(span.end) - - -def test_issue4903(): - # ensures that this runs correctly and doesn't hang or crash on Windows / macOS - - nlp = English() - custom_component = CustomPipe() - nlp.add_pipe(nlp.create_pipe("sentencizer")) - nlp.add_pipe(custom_component, after="sentencizer") - - text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."] - docs = list(nlp.pipe(text, n_process=2)) - assert docs[0].text == "I like bananas." - assert docs[1].text == "Do you like them?" - assert docs[2].text == "No, I prefer wasabi." diff --git a/spacy/tests/regression/test_issue4924.py b/spacy/tests/regression/test_issue4924.py deleted file mode 100644 index 0e45291a9..000000000 --- a/spacy/tests/regression/test_issue4924.py +++ /dev/null @@ -1,16 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - -import spacy - - -@pytest.fixture -def nlp(): - return spacy.blank("en") - - -def test_issue4924(nlp): - docs_golds = [("", {})] - nlp.evaluate(docs_golds) diff --git a/spacy/tests/regression/test_issue5001-5500.py b/spacy/tests/regression/test_issue5001-5500.py new file mode 100644 index 000000000..dbfe78679 --- /dev/null +++ b/spacy/tests/regression/test_issue5001-5500.py @@ -0,0 +1,138 @@ +import numpy +from spacy.tokens import Doc, DocBin +from spacy.attrs import DEP, POS, TAG +from spacy.lang.en import English +from spacy.language import Language +from spacy.lang.en.syntax_iterators import noun_chunks +from spacy.vocab import Vocab +import spacy +import pytest + +from ...util import make_tempdir + + +def test_issue5048(en_vocab): + words = ["This", "is", "a", "sentence"] + pos_s = ["DET", "VERB", "DET", "NOUN"] + spaces = [" ", " ", " ", ""] + deps_s = ["dep", "adj", "nn", "atm"] + tags_s = ["DT", "VBZ", "DT", "NN"] + strings = en_vocab.strings + for w in words: + strings.add(w) + deps = [strings.add(d) for d in deps_s] + pos = [strings.add(p) for p in pos_s] + tags = [strings.add(t) for t in tags_s] + attrs = [POS, DEP, TAG] + array = numpy.array(list(zip(pos, deps, tags)), dtype="uint64") + doc = Doc(en_vocab, words=words, spaces=spaces) + doc.from_array(attrs, array) + v1 = [(token.text, token.pos_, token.tag_) for token in doc] + doc2 = Doc(en_vocab, words=words, pos=pos_s, deps=deps_s, tags=tags_s) + v2 = [(token.text, token.pos_, token.tag_) for token in doc2] + assert v1 == v2 + + +def test_issue5082(): + # Ensure the 'merge_entities' pipeline does something sensible for the vectors of the merged tokens + nlp = English() + vocab = nlp.vocab + array1 = numpy.asarray([0.1, 0.5, 0.8], dtype=numpy.float32) + array2 = numpy.asarray([-0.2, -0.6, -0.9], dtype=numpy.float32) + array3 = numpy.asarray([0.3, -0.1, 0.7], dtype=numpy.float32) + array4 = numpy.asarray([0.5, 0, 0.3], dtype=numpy.float32) + array34 = numpy.asarray([0.4, -0.05, 0.5], dtype=numpy.float32) + vocab.set_vector("I", array1) + vocab.set_vector("like", array2) + vocab.set_vector("David", array3) + vocab.set_vector("Bowie", array4) + text = "I like David Bowie" + patterns = [ + {"label": "PERSON", "pattern": [{"LOWER": "david"}, {"LOWER": "bowie"}]} + ] + ruler = nlp.add_pipe("entity_ruler") + ruler.add_patterns(patterns) + parsed_vectors_1 = [t.vector for t in nlp(text)] + assert len(parsed_vectors_1) == 4 + numpy.testing.assert_array_equal(parsed_vectors_1[0], array1) + numpy.testing.assert_array_equal(parsed_vectors_1[1], array2) + numpy.testing.assert_array_equal(parsed_vectors_1[2], array3) + numpy.testing.assert_array_equal(parsed_vectors_1[3], array4) + nlp.add_pipe("merge_entities") + parsed_vectors_2 = [t.vector for t in nlp(text)] + assert len(parsed_vectors_2) == 3 + numpy.testing.assert_array_equal(parsed_vectors_2[0], array1) + numpy.testing.assert_array_equal(parsed_vectors_2[1], array2) + numpy.testing.assert_array_equal(parsed_vectors_2[2], array34) + + +def test_issue5137(): + @Language.factory("my_component") + class MyComponent: + def __init__(self, nlp, name="my_component", categories="all_categories"): + self.nlp = nlp + self.categories = categories + self.name = name + + def __call__(self, doc): + pass + + def to_disk(self, path, **kwargs): + pass + + def from_disk(self, path, **cfg): + pass + + nlp = English() + my_component = nlp.add_pipe("my_component") + assert my_component.categories == "all_categories" + with make_tempdir() as tmpdir: + nlp.to_disk(tmpdir) + overrides = {"components": {"my_component": {"categories": "my_categories"}}} + nlp2 = spacy.load(tmpdir, config=overrides) + assert nlp2.get_pipe("my_component").categories == "my_categories" + + +def test_issue5141(en_vocab): + """ Ensure an empty DocBin does not crash on serialization """ + doc_bin = DocBin(attrs=["DEP", "HEAD"]) + assert list(doc_bin.get_docs(en_vocab)) == [] + doc_bin_bytes = doc_bin.to_bytes() + doc_bin_2 = DocBin().from_bytes(doc_bin_bytes) + assert list(doc_bin_2.get_docs(en_vocab)) == [] + + +def test_issue5152(): + # Test that the comparison between a Span and a Token, goes well + # There was a bug when the number of tokens in the span equaled the number of characters in the token (!) + nlp = English() + text = nlp("Talk about being boring!") + text_var = nlp("Talk of being boring!") + y = nlp("Let") + span = text[0:3] # Talk about being + span_2 = text[0:3] # Talk about being + span_3 = text_var[0:3] # Talk of being + token = y[0] # Let + with pytest.warns(UserWarning): + assert span.similarity(token) == 0.0 + assert span.similarity(span_2) == 1.0 + with pytest.warns(UserWarning): + assert span_2.similarity(span_3) < 1.0 + + +def test_issue5458(): + # Test that the noun chuncker does not generate overlapping spans + # fmt: off + words = ["In", "an", "era", "where", "markets", "have", "brought", "prosperity", "and", "empowerment", "."] + vocab = Vocab(strings=words) + deps = ["ROOT", "det", "pobj", "advmod", "nsubj", "aux", "relcl", "dobj", "cc", "conj", "punct"] + pos = ["ADP", "DET", "NOUN", "ADV", "NOUN", "AUX", "VERB", "NOUN", "CCONJ", "NOUN", "PUNCT"] + heads = [0, 2, 0, 9, 6, 6, 2, 6, 7, 7, 0] + # fmt: on + en_doc = Doc(vocab, words=words, pos=pos, heads=heads, deps=deps) + en_doc.noun_chunks_iterator = noun_chunks + + # if there are overlapping spans, this will fail with an E102 error "Can't merge non-disjoint spans" + nlp = English() + merge_nps = nlp.create_pipe("merge_noun_chunks") + merge_nps(en_doc) diff --git a/spacy/tests/regression/test_issue5048.py b/spacy/tests/regression/test_issue5048.py deleted file mode 100644 index 228322493..000000000 --- a/spacy/tests/regression/test_issue5048.py +++ /dev/null @@ -1,35 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import numpy -from spacy.tokens import Doc -from spacy.attrs import DEP, POS, TAG - -from ..util import get_doc - - -def test_issue5048(en_vocab): - words = ["This", "is", "a", "sentence"] - pos_s = ["DET", "VERB", "DET", "NOUN"] - spaces = [" ", " ", " ", ""] - deps_s = ["dep", "adj", "nn", "atm"] - tags_s = ["DT", "VBZ", "DT", "NN"] - - strings = en_vocab.strings - - for w in words: - strings.add(w) - deps = [strings.add(d) for d in deps_s] - pos = [strings.add(p) for p in pos_s] - tags = [strings.add(t) for t in tags_s] - - attrs = [POS, DEP, TAG] - array = numpy.array(list(zip(pos, deps, tags)), dtype="uint64") - - doc = Doc(en_vocab, words=words, spaces=spaces) - doc.from_array(attrs, array) - v1 = [(token.text, token.pos_, token.tag_) for token in doc] - - doc2 = get_doc(en_vocab, words=words, pos=pos_s, deps=deps_s, tags=tags_s) - v2 = [(token.text, token.pos_, token.tag_) for token in doc2] - assert v1 == v2 diff --git a/spacy/tests/regression/test_issue5082.py b/spacy/tests/regression/test_issue5082.py deleted file mode 100644 index efa5d39f2..000000000 --- a/spacy/tests/regression/test_issue5082.py +++ /dev/null @@ -1,46 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import numpy as np -from spacy.lang.en import English -from spacy.pipeline import EntityRuler - - -def test_issue5082(): - # Ensure the 'merge_entities' pipeline does something sensible for the vectors of the merged tokens - nlp = English() - vocab = nlp.vocab - array1 = np.asarray([0.1, 0.5, 0.8], dtype=np.float32) - array2 = np.asarray([-0.2, -0.6, -0.9], dtype=np.float32) - array3 = np.asarray([0.3, -0.1, 0.7], dtype=np.float32) - array4 = np.asarray([0.5, 0, 0.3], dtype=np.float32) - array34 = np.asarray([0.4, -0.05, 0.5], dtype=np.float32) - - vocab.set_vector("I", array1) - vocab.set_vector("like", array2) - vocab.set_vector("David", array3) - vocab.set_vector("Bowie", array4) - - text = "I like David Bowie" - ruler = EntityRuler(nlp) - patterns = [ - {"label": "PERSON", "pattern": [{"LOWER": "david"}, {"LOWER": "bowie"}]} - ] - ruler.add_patterns(patterns) - nlp.add_pipe(ruler) - - parsed_vectors_1 = [t.vector for t in nlp(text)] - assert len(parsed_vectors_1) == 4 - np.testing.assert_array_equal(parsed_vectors_1[0], array1) - np.testing.assert_array_equal(parsed_vectors_1[1], array2) - np.testing.assert_array_equal(parsed_vectors_1[2], array3) - np.testing.assert_array_equal(parsed_vectors_1[3], array4) - - merge_ents = nlp.create_pipe("merge_entities") - nlp.add_pipe(merge_ents) - - parsed_vectors_2 = [t.vector for t in nlp(text)] - assert len(parsed_vectors_2) == 3 - np.testing.assert_array_equal(parsed_vectors_2[0], array1) - np.testing.assert_array_equal(parsed_vectors_2[1], array2) - np.testing.assert_array_equal(parsed_vectors_2[2], array34) diff --git a/spacy/tests/regression/test_issue5137.py b/spacy/tests/regression/test_issue5137.py deleted file mode 100644 index 4b4e597d3..000000000 --- a/spacy/tests/regression/test_issue5137.py +++ /dev/null @@ -1,33 +0,0 @@ -import spacy -from spacy.language import Language -from spacy.lang.en import English -from spacy.tests.util import make_tempdir - - -def test_issue5137(): - class MyComponent(object): - name = "my_component" - - def __init__(self, nlp, **cfg): - self.nlp = nlp - self.categories = cfg.get("categories", "all_categories") - - def __call__(self, doc): - pass - - def to_disk(self, path, **kwargs): - pass - - def from_disk(self, path, **cfg): - pass - - Language.factories["my_component"] = lambda nlp, **cfg: MyComponent(nlp, **cfg) - - nlp = English() - nlp.add_pipe(nlp.create_pipe("my_component")) - assert nlp.get_pipe("my_component").categories == "all_categories" - - with make_tempdir() as tmpdir: - nlp.to_disk(tmpdir) - nlp2 = spacy.load(tmpdir, categories="my_categories") - assert nlp2.get_pipe("my_component").categories == "my_categories" diff --git a/spacy/tests/regression/test_issue5152.py b/spacy/tests/regression/test_issue5152.py deleted file mode 100644 index 758ac9c14..000000000 --- a/spacy/tests/regression/test_issue5152.py +++ /dev/null @@ -1,21 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from spacy.lang.en import English - - -def test_issue5152(): - # Test that the comparison between a Span and a Token, goes well - # There was a bug when the number of tokens in the span equaled the number of characters in the token (!) - nlp = English() - text = nlp("Talk about being boring!") - text_var = nlp("Talk of being boring!") - y = nlp("Let") - - span = text[0:3] # Talk about being - span_2 = text[0:3] # Talk about being - span_3 = text_var[0:3] # Talk of being - token = y[0] # Let - assert span.similarity(token) == 0.0 - assert span.similarity(span_2) == 1.0 - assert span_2.similarity(span_3) < 1.0 diff --git a/spacy/tests/regression/test_issue5230.py b/spacy/tests/regression/test_issue5230.py index 2b14ff589..a00b2a688 100644 --- a/spacy/tests/regression/test_issue5230.py +++ b/spacy/tests/regression/test_issue5230.py @@ -1,4 +1,3 @@ -# coding: utf8 import warnings from unittest import TestCase import pytest @@ -7,9 +6,8 @@ from numpy import zeros from spacy.kb import KnowledgeBase, Writer from spacy.vectors import Vectors from spacy.language import Language -from spacy.pipeline import Pipe -from spacy.compat import is_python2 - +from spacy.pipeline import TrainablePipe +from spacy.vocab import Vocab from ..util import make_tempdir @@ -26,7 +24,7 @@ def vectors(): def custom_pipe(): # create dummy pipe partially implementing interface -- only want to test to_disk - class SerializableDummy(object): + class SerializableDummy: def __init__(self, **cfg): if cfg: self.cfg = cfg @@ -46,39 +44,43 @@ def custom_pipe(): def from_disk(self, path, exclude=tuple(), **kwargs): return self - class MyPipe(Pipe): + class MyPipe(TrainablePipe): def __init__(self, vocab, model=True, **cfg): if cfg: self.cfg = cfg else: self.cfg = None self.model = SerializableDummy() - self.vocab = SerializableDummy() + self.vocab = vocab - return MyPipe(None) + return MyPipe(Vocab()) def tagger(): nlp = Language() - nlp.add_pipe(nlp.create_pipe("tagger")) - tagger = nlp.get_pipe("tagger") + tagger = nlp.add_pipe("tagger") # need to add model for two reasons: # 1. no model leads to error in serialization, # 2. the affected line is the one for model serialization - tagger.begin_training(pipeline=nlp.pipeline) + tagger.add_label("A") + nlp.initialize() return tagger def entity_linker(): nlp = Language() - nlp.add_pipe(nlp.create_pipe("entity_linker")) - entity_linker = nlp.get_pipe("entity_linker") + + def create_kb(vocab): + kb = KnowledgeBase(vocab, entity_vector_length=1) + kb.add_entity("test", 0.0, zeros((1, 1), dtype="f")) + return kb + + entity_linker = nlp.add_pipe("entity_linker") + entity_linker.set_kb(create_kb) # need to add model for two reasons: # 1. no model leads to error in serialization, # 2. the affected line is the one for model serialization - kb = KnowledgeBase(nlp.vocab, entity_vector_length=1) - entity_linker.set_kb(kb) - entity_linker.begin_training(pipeline=nlp.pipeline) + nlp.initialize() return entity_linker @@ -97,14 +99,12 @@ def write_obj_and_catch_warnings(obj): return list(filter(lambda x: isinstance(x, ResourceWarning), warnings_list)) -@pytest.mark.skipif(is_python2, reason="ResourceWarning needs Python 3.x") @pytest.mark.parametrize("obj", objects_to_test[0], ids=objects_to_test[1]) def test_to_disk_resource_warning(obj): warnings_list = write_obj_and_catch_warnings(obj) assert len(warnings_list) == 0 -@pytest.mark.skipif(is_python2, reason="ResourceWarning needs Python 3.x") def test_writer_with_path_py35(): writer = None with make_tempdir() as d: @@ -124,24 +124,22 @@ def test_save_and_load_knowledge_base(): with make_tempdir() as d: path = d / "kb" try: - kb.dump(path) + kb.to_disk(path) except Exception as e: pytest.fail(str(e)) try: kb_loaded = KnowledgeBase(nlp.vocab, entity_vector_length=1) - kb_loaded.load_bulk(path) + kb_loaded.from_disk(path) except Exception as e: pytest.fail(str(e)) -if not is_python2: +class TestToDiskResourceWarningUnittest(TestCase): + def test_resource_warning(self): + scenarios = zip(*objects_to_test) - class TestToDiskResourceWarningUnittest(TestCase): - def test_resource_warning(self): - scenarios = zip(*objects_to_test) - - for scenario in scenarios: - with self.subTest(msg=scenario[1]): - warnings_list = write_obj_and_catch_warnings(scenario[0]) - self.assertEqual(len(warnings_list), 0) + for scenario in scenarios: + with self.subTest(msg=scenario[1]): + warnings_list = write_obj_and_catch_warnings(scenario[0]) + self.assertEqual(len(warnings_list), 0) diff --git a/spacy/tests/regression/test_issue5458.py b/spacy/tests/regression/test_issue5458.py deleted file mode 100644 index 3281e2a8c..000000000 --- a/spacy/tests/regression/test_issue5458.py +++ /dev/null @@ -1,26 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from spacy.lang.en import English -from spacy.lang.en.syntax_iterators import noun_chunks -from spacy.tests.util import get_doc -from spacy.vocab import Vocab - - -def test_issue5458(): - # Test that the noun chuncker does not generate overlapping spans - # fmt: off - words = ["In", "an", "era", "where", "markets", "have", "brought", "prosperity", "and", "empowerment", "."] - vocab = Vocab(strings=words) - dependencies = ["ROOT", "det", "pobj", "advmod", "nsubj", "aux", "relcl", "dobj", "cc", "conj", "punct"] - pos_tags = ["ADP", "DET", "NOUN", "ADV", "NOUN", "AUX", "VERB", "NOUN", "CCONJ", "NOUN", "PUNCT"] - heads = [0, 1, -2, 6, 2, 1, -4, -1, -1, -2, -10] - # fmt: on - - en_doc = get_doc(vocab, words, pos_tags, heads, dependencies) - en_doc.noun_chunks_iterator = noun_chunks - - # if there are overlapping spans, this will fail with an E102 error "Can't merge non-disjoint spans" - nlp = English() - merge_nps = nlp.create_pipe("merge_noun_chunks") - merge_nps(en_doc) diff --git a/spacy/tests/regression/test_issue5551.py b/spacy/tests/regression/test_issue5551.py new file mode 100644 index 000000000..655764362 --- /dev/null +++ b/spacy/tests/regression/test_issue5551.py @@ -0,0 +1,37 @@ +from spacy.lang.en import English +from spacy.util import fix_random_seed + + +def test_issue5551(): + """Test that after fixing the random seed, the results of the pipeline are truly identical""" + component = "textcat" + pipe_cfg = { + "model": { + "@architectures": "spacy.TextCatBOW.v1", + "exclusive_classes": True, + "ngram_size": 2, + "no_output_layer": False, + } + } + + results = [] + for i in range(3): + fix_random_seed(0) + nlp = English() + example = ( + "Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.", + {"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}}, + ) + pipe = nlp.add_pipe(component, config=pipe_cfg, last=True) + for label in set(example[1]["cats"]): + pipe.add_label(label) + nlp.initialize() + + # Store the result of each iteration + result = pipe.model.predict([nlp.make_doc(example[0])]) + results.append(list(result[0])) + + # All results should be the same because of the fixed seed + assert len(results) == 3 + assert results[0] == results[1] + assert results[0] == results[2] diff --git a/spacy/tests/regression/test_issue5838.py b/spacy/tests/regression/test_issue5838.py index c008c5aec..4e4d98beb 100644 --- a/spacy/tests/regression/test_issue5838.py +++ b/spacy/tests/regression/test_issue5838.py @@ -1,15 +1,13 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.en import English from spacy.tokens import Span from spacy import displacy -SAMPLE_TEXT = '''First line + +SAMPLE_TEXT = """First line Second line, with ent Third line Fourth line -''' +""" def test_issue5838(): @@ -18,8 +16,8 @@ def test_issue5838(): nlp = English() doc = nlp(SAMPLE_TEXT) - doc.ents = [Span(doc, 7, 8, label='test')] + doc.ents = [Span(doc, 7, 8, label="test")] - html = displacy.render(doc, style='ent') - found = html.count('
') + html = displacy.render(doc, style="ent") + found = html.count("
") assert found == 4 diff --git a/spacy/tests/regression/test_issue5918.py b/spacy/tests/regression/test_issue5918.py index 2dee26d82..d25323ef6 100644 --- a/spacy/tests/regression/test_issue5918.py +++ b/spacy/tests/regression/test_issue5918.py @@ -1,21 +1,17 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.lang.en import English -from spacy.pipeline import merge_entities, EntityRuler +from spacy.pipeline import merge_entities def test_issue5918(): # Test edge case when merging entities. nlp = English() + ruler = nlp.add_pipe("entity_ruler") patterns = [ {"label": "ORG", "pattern": "Digicon Inc"}, {"label": "ORG", "pattern": "Rotan Mosle Inc's"}, {"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"}, ] - ruler = EntityRuler(nlp) ruler.add_patterns(patterns) - nlp.add_pipe(ruler) text = """ Digicon Inc said it has completed the previously-announced disposition @@ -26,6 +22,8 @@ def test_issue5918(): assert len(doc.ents) == 3 # make it so that the third span's head is within the entity (ent_iob=I) # bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents. - doc[29].head = doc[33] + # TODO: test for logging here + # with pytest.warns(UserWarning): + # doc[29].head = doc[33] doc = merge_entities(doc) assert len(doc.ents) == 3 diff --git a/spacy/tests/regression/test_issue6207.py b/spacy/tests/regression/test_issue6207.py index 3c9c3ce89..9d8b047bf 100644 --- a/spacy/tests/regression/test_issue6207.py +++ b/spacy/tests/regression/test_issue6207.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from spacy.util import filter_spans @@ -9,8 +6,8 @@ def test_issue6207(en_tokenizer): # Make spans s1 = doc[:4] - s2 = doc[3:6] # overlaps with s1 - s3 = doc[5:7] # overlaps with s2, not s1 + s2 = doc[3:6] # overlaps with s1 + s3 = doc[5:7] # overlaps with s2, not s1 result = filter_spans((s1, s2, s3)) assert s1 in result diff --git a/spacy/tests/serialize/test_serialize_config.py b/spacy/tests/serialize/test_serialize_config.py new file mode 100644 index 000000000..8b3f5c2b8 --- /dev/null +++ b/spacy/tests/serialize/test_serialize_config.py @@ -0,0 +1,357 @@ +import pytest +from thinc.api import Config, ConfigValidationError +import spacy +from spacy.lang.en import English +from spacy.lang.de import German +from spacy.language import Language, DEFAULT_CONFIG +from spacy.util import registry, load_model_from_config +from spacy.ml.models import build_Tok2Vec_model, build_tb_parser_model +from spacy.ml.models import MultiHashEmbed, MaxoutWindowEncoder +from spacy.schemas import ConfigSchema + +from ..util import make_tempdir + + +nlp_config_string = """ +[paths] +train = null +dev = null + +[corpora] + +[corpora.train] +@readers = "spacy.Corpus.v1" +path = ${paths.train} + +[corpora.dev] +@readers = "spacy.Corpus.v1" +path = ${paths.dev} + +[training] + +[training.batcher] +@batchers = "spacy.batch_by_words.v1" +size = 666 + +[nlp] +lang = "en" +pipeline = ["tok2vec", "tagger"] + +[components] + +[components.tok2vec] +factory = "tok2vec" + +[components.tok2vec.model] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 342 +depth = 4 +window_size = 1 +embed_size = 2000 +maxout_pieces = 3 +subword_features = true + +[components.tagger] +factory = "tagger" + +[components.tagger.model] +@architectures = "spacy.Tagger.v1" + +[components.tagger.model.tok2vec] +@architectures = "spacy.Tok2VecListener.v1" +width = ${components.tok2vec.model.width} +""" + + +parser_config_string = """ +[model] +@architectures = "spacy.TransitionBasedParser.v1" +state_type = "parser" +extra_state_tokens = false +hidden_width = 66 +maxout_pieces = 2 + +[model.tok2vec] +@architectures = "spacy.HashEmbedCNN.v1" +pretrained_vectors = null +width = 333 +depth = 4 +embed_size = 5555 +window_size = 1 +maxout_pieces = 7 +subword_features = false +""" + + +@registry.architectures.register("my_test_parser") +def my_parser(): + tok2vec = build_Tok2Vec_model( + MultiHashEmbed( + width=321, + attrs=["LOWER", "SHAPE"], + rows=[5432, 5432], + include_static_vectors=False, + ), + MaxoutWindowEncoder(width=321, window_size=3, maxout_pieces=4, depth=2), + ) + parser = build_tb_parser_model( + tok2vec=tok2vec, + state_type="parser", + extra_state_tokens=True, + hidden_width=65, + maxout_pieces=5, + ) + return parser + + +def test_create_nlp_from_config(): + config = Config().from_str(nlp_config_string) + with pytest.raises(ConfigValidationError): + load_model_from_config(config, auto_fill=False) + nlp = load_model_from_config(config, auto_fill=True) + assert nlp.config["training"]["batcher"]["size"] == 666 + assert len(nlp.config["training"]) > 1 + assert nlp.pipe_names == ["tok2vec", "tagger"] + assert len(nlp.config["components"]) == 2 + assert len(nlp.config["nlp"]["pipeline"]) == 2 + nlp.remove_pipe("tagger") + assert len(nlp.config["components"]) == 1 + assert len(nlp.config["nlp"]["pipeline"]) == 1 + with pytest.raises(ValueError): + bad_cfg = {"yolo": {}} + load_model_from_config(Config(bad_cfg), auto_fill=True) + with pytest.raises(ValueError): + bad_cfg = {"pipeline": {"foo": "bar"}} + load_model_from_config(Config(bad_cfg), auto_fill=True) + + +def test_create_nlp_from_config_multiple_instances(): + """Test that the nlp object is created correctly for a config with multiple + instances of the same component.""" + config = Config().from_str(nlp_config_string) + config["components"] = { + "t2v": config["components"]["tok2vec"], + "tagger1": config["components"]["tagger"], + "tagger2": config["components"]["tagger"], + } + config["nlp"]["pipeline"] = list(config["components"].keys()) + nlp = load_model_from_config(config, auto_fill=True) + assert nlp.pipe_names == ["t2v", "tagger1", "tagger2"] + assert nlp.get_pipe_meta("t2v").factory == "tok2vec" + assert nlp.get_pipe_meta("tagger1").factory == "tagger" + assert nlp.get_pipe_meta("tagger2").factory == "tagger" + pipeline_config = nlp.config["components"] + assert len(pipeline_config) == 3 + assert list(pipeline_config.keys()) == ["t2v", "tagger1", "tagger2"] + assert nlp.config["nlp"]["pipeline"] == ["t2v", "tagger1", "tagger2"] + + +def test_serialize_nlp(): + """ Create a custom nlp pipeline from config and ensure it serializes it correctly """ + nlp_config = Config().from_str(nlp_config_string) + nlp = load_model_from_config(nlp_config, auto_fill=True) + nlp.get_pipe("tagger").add_label("A") + nlp.initialize() + assert "tok2vec" in nlp.pipe_names + assert "tagger" in nlp.pipe_names + assert "parser" not in nlp.pipe_names + assert nlp.get_pipe("tagger").model.get_ref("tok2vec").get_dim("nO") == 342 + + with make_tempdir() as d: + nlp.to_disk(d) + nlp2 = spacy.load(d) + assert "tok2vec" in nlp2.pipe_names + assert "tagger" in nlp2.pipe_names + assert "parser" not in nlp2.pipe_names + assert nlp2.get_pipe("tagger").model.get_ref("tok2vec").get_dim("nO") == 342 + + +def test_serialize_custom_nlp(): + """ Create a custom nlp pipeline and ensure it serializes it correctly""" + nlp = English() + parser_cfg = dict() + parser_cfg["model"] = {"@architectures": "my_test_parser"} + nlp.add_pipe("parser", config=parser_cfg) + nlp.initialize() + + with make_tempdir() as d: + nlp.to_disk(d) + nlp2 = spacy.load(d) + model = nlp2.get_pipe("parser").model + model.get_ref("tok2vec") + upper = model.get_ref("upper") + # check that we have the correct settings, not the default ones + assert upper.get_dim("nI") == 65 + + +def test_serialize_parser(): + """ Create a non-default parser config to check nlp serializes it correctly """ + nlp = English() + model_config = Config().from_str(parser_config_string) + parser = nlp.add_pipe("parser", config=model_config) + parser.add_label("nsubj") + nlp.initialize() + + with make_tempdir() as d: + nlp.to_disk(d) + nlp2 = spacy.load(d) + model = nlp2.get_pipe("parser").model + model.get_ref("tok2vec") + upper = model.get_ref("upper") + # check that we have the correct settings, not the default ones + assert upper.get_dim("nI") == 66 + + +def test_config_nlp_roundtrip(): + """Test that a config prduced by the nlp object passes training config + validation.""" + nlp = English() + nlp.add_pipe("entity_ruler") + nlp.add_pipe("ner") + new_nlp = load_model_from_config(nlp.config, auto_fill=False) + assert new_nlp.config == nlp.config + assert new_nlp.pipe_names == nlp.pipe_names + assert new_nlp._pipe_configs == nlp._pipe_configs + assert new_nlp._pipe_meta == nlp._pipe_meta + assert new_nlp._factory_meta == nlp._factory_meta + + +def test_config_nlp_roundtrip_bytes_disk(): + """Test that the config is serialized correctly and not interpolated + by mistake.""" + nlp = English() + nlp_bytes = nlp.to_bytes() + new_nlp = English().from_bytes(nlp_bytes) + assert new_nlp.config == nlp.config + nlp = English() + with make_tempdir() as d: + nlp.to_disk(d) + new_nlp = spacy.load(d) + assert new_nlp.config == nlp.config + + +def test_serialize_config_language_specific(): + """Test that config serialization works as expected with language-specific + factories.""" + name = "test_serialize_config_language_specific" + + @English.factory(name, default_config={"foo": 20}) + def custom_factory(nlp: Language, name: str, foo: int): + return lambda doc: doc + + nlp = Language() + assert not nlp.has_factory(name) + nlp = English() + assert nlp.has_factory(name) + nlp.add_pipe(name, config={"foo": 100}, name="bar") + pipe_config = nlp.config["components"]["bar"] + assert pipe_config["foo"] == 100 + assert pipe_config["factory"] == name + + with make_tempdir() as d: + nlp.to_disk(d) + nlp2 = spacy.load(d) + assert nlp2.has_factory(name) + assert nlp2.pipe_names == ["bar"] + assert nlp2.get_pipe_meta("bar").factory == name + pipe_config = nlp2.config["components"]["bar"] + assert pipe_config["foo"] == 100 + assert pipe_config["factory"] == name + + config = Config().from_str(nlp2.config.to_str()) + config["nlp"]["lang"] = "de" + with pytest.raises(ValueError): + # German doesn't have a factory, only English does + load_model_from_config(config) + + +def test_serialize_config_missing_pipes(): + config = Config().from_str(nlp_config_string) + config["components"].pop("tok2vec") + assert "tok2vec" in config["nlp"]["pipeline"] + assert "tok2vec" not in config["components"] + with pytest.raises(ValueError): + load_model_from_config(config, auto_fill=True) + + +def test_config_overrides(): + overrides_nested = {"nlp": {"lang": "de", "pipeline": ["tagger"]}} + overrides_dot = {"nlp.lang": "de", "nlp.pipeline": ["tagger"]} + # load_model from config with overrides passed directly to Config + config = Config().from_str(nlp_config_string, overrides=overrides_dot) + nlp = load_model_from_config(config, auto_fill=True) + assert isinstance(nlp, German) + assert nlp.pipe_names == ["tagger"] + # Serialized roundtrip with config passed in + base_config = Config().from_str(nlp_config_string) + base_nlp = load_model_from_config(base_config, auto_fill=True) + assert isinstance(base_nlp, English) + assert base_nlp.pipe_names == ["tok2vec", "tagger"] + with make_tempdir() as d: + base_nlp.to_disk(d) + nlp = spacy.load(d, config=overrides_nested) + assert isinstance(nlp, German) + assert nlp.pipe_names == ["tagger"] + with make_tempdir() as d: + base_nlp.to_disk(d) + nlp = spacy.load(d, config=overrides_dot) + assert isinstance(nlp, German) + assert nlp.pipe_names == ["tagger"] + with make_tempdir() as d: + base_nlp.to_disk(d) + nlp = spacy.load(d) + assert isinstance(nlp, English) + assert nlp.pipe_names == ["tok2vec", "tagger"] + + +def test_config_interpolation(): + config = Config().from_str(nlp_config_string, interpolate=False) + assert config["corpora"]["train"]["path"] == "${paths.train}" + interpolated = config.interpolate() + assert interpolated["corpora"]["train"]["path"] is None + nlp = English.from_config(config) + assert nlp.config["corpora"]["train"]["path"] == "${paths.train}" + # Ensure that variables are preserved in nlp config + width = "${components.tok2vec.model.width}" + assert config["components"]["tagger"]["model"]["tok2vec"]["width"] == width + assert nlp.config["components"]["tagger"]["model"]["tok2vec"]["width"] == width + interpolated2 = nlp.config.interpolate() + assert interpolated2["corpora"]["train"]["path"] is None + assert interpolated2["components"]["tagger"]["model"]["tok2vec"]["width"] == 342 + nlp2 = English.from_config(interpolated) + assert nlp2.config["corpora"]["train"]["path"] is None + assert nlp2.config["components"]["tagger"]["model"]["tok2vec"]["width"] == 342 + + +def test_config_optional_sections(): + config = Config().from_str(nlp_config_string) + config = DEFAULT_CONFIG.merge(config) + assert "pretraining" not in config + filled = registry.fill(config, schema=ConfigSchema, validate=False) + # Make sure that optional "pretraining" block doesn't default to None, + # which would (rightly) cause error because it'd result in a top-level + # key that's not a section (dict). Note that the following roundtrip is + # also how Config.interpolate works under the hood. + new_config = Config().from_str(filled.to_str()) + assert new_config["pretraining"] == {} + + +def test_config_auto_fill_extra_fields(): + config = Config({"nlp": {"lang": "en"}, "training": {}}) + assert load_model_from_config(config, auto_fill=True) + config = Config({"nlp": {"lang": "en"}, "training": {"extra": "hello"}}) + nlp = load_model_from_config(config, auto_fill=True, validate=False) + assert "extra" not in nlp.config["training"] + # Make sure the config generated is valid + load_model_from_config(nlp.config) + + +def test_config_validate_literal(): + nlp = English() + config = Config().from_str(parser_config_string) + config["model"]["state_type"] = "nonsense" + with pytest.raises(ConfigValidationError): + nlp.add_pipe("parser", config=config) + config["model"]["state_type"] = "ner" + nlp.add_pipe("parser", config=config) diff --git a/spacy/tests/serialize/test_serialize_doc.py b/spacy/tests/serialize/test_serialize_doc.py index ef2b1ee89..00b9d12d4 100644 --- a/spacy/tests/serialize/test_serialize_doc.py +++ b/spacy/tests/serialize/test_serialize_doc.py @@ -1,13 +1,9 @@ -# coding: utf-8 -from __future__ import unicode_literals +import pytest +from spacy.tokens.doc import Underscore import spacy - -import pytest - from spacy.lang.en import English from spacy.tokens import Doc, DocBin -from spacy.compat import path2str from ..util import make_tempdir @@ -43,7 +39,7 @@ def test_serialize_doc_roundtrip_disk_str_path(en_vocab): doc = Doc(en_vocab, words=["hello", "world"]) with make_tempdir() as d: file_path = d / "doc" - file_path = path2str(file_path) + file_path = str(file_path) doc.to_disk(file_path) doc_d = Doc(en_vocab).from_disk(file_path) assert doc.to_bytes() == doc_d.to_bytes() @@ -58,10 +54,6 @@ def test_serialize_doc_exclude(en_vocab): assert not new_doc.user_data new_doc = Doc(en_vocab).from_bytes(doc.to_bytes(exclude=["user_data"])) assert not new_doc.user_data - with pytest.raises(ValueError): - doc.to_bytes(user_data=False) - with pytest.raises(ValueError): - Doc(en_vocab).from_bytes(doc.to_bytes(), tensor=False) def test_serialize_doc_bin(): @@ -81,3 +73,42 @@ def test_serialize_doc_bin(): for i, doc in enumerate(reloaded_docs): assert doc.text == texts[i] assert doc.cats == cats + + +def test_serialize_doc_bin_unknown_spaces(en_vocab): + doc1 = Doc(en_vocab, words=["that", "'s"]) + assert doc1.has_unknown_spaces + assert doc1.text == "that 's " + doc2 = Doc(en_vocab, words=["that", "'s"], spaces=[False, False]) + assert not doc2.has_unknown_spaces + assert doc2.text == "that's" + + doc_bin = DocBin().from_bytes(DocBin(docs=[doc1, doc2]).to_bytes()) + re_doc1, re_doc2 = doc_bin.get_docs(en_vocab) + assert re_doc1.has_unknown_spaces + assert re_doc1.text == "that 's " + assert not re_doc2.has_unknown_spaces + assert re_doc2.text == "that's" + + +@pytest.mark.parametrize( + "writer_flag,reader_flag,reader_value", + [ + (True, True, "bar"), + (True, False, "bar"), + (False, True, "nothing"), + (False, False, "nothing"), + ], +) +def test_serialize_custom_extension(en_vocab, writer_flag, reader_flag, reader_value): + """Test that custom extensions are correctly serialized in DocBin.""" + Doc.set_extension("foo", default="nothing") + doc = Doc(en_vocab, words=["hello", "world"]) + doc._.foo = "bar" + doc_bin_1 = DocBin(store_user_data=writer_flag) + doc_bin_1.add(doc) + doc_bin_bytes = doc_bin_1.to_bytes() + doc_bin_2 = DocBin(store_user_data=reader_flag).from_bytes(doc_bin_bytes) + doc_2 = list(doc_bin_2.get_docs(en_vocab))[0] + assert doc_2._.foo == reader_value + Underscore.doc_extensions = {} diff --git a/spacy/tests/serialize/test_serialize_extension_attrs.py b/spacy/tests/serialize/test_serialize_extension_attrs.py index 45c2e3909..9cfa1a552 100644 --- a/spacy/tests/serialize/test_serialize_extension_attrs.py +++ b/spacy/tests/serialize/test_serialize_extension_attrs.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.tokens import Doc, Token from spacy.vocab import Vocab @@ -10,9 +7,7 @@ from spacy.vocab import Vocab def doc_w_attrs(en_tokenizer): Doc.set_extension("_test_attr", default=False) Doc.set_extension("_test_prop", getter=lambda doc: len(doc.text)) - Doc.set_extension( - "_test_method", method=lambda doc, arg: "{}{}".format(len(doc.text), arg) - ) + Doc.set_extension("_test_method", method=lambda doc, arg: f"{len(doc.text)}{arg}") doc = en_tokenizer("This is a test.") doc._._test_attr = "test" @@ -28,8 +23,7 @@ def test_serialize_ext_attrs_from_bytes(doc_w_attrs): assert doc._.has("_test_attr") assert doc._._test_attr == "test" assert doc._._test_prop == len(doc.text) - assert doc._._test_method("test") == "{}{}".format(len(doc.text), "test") - + assert doc._._test_method("test") == f"{len(doc.text)}test" assert doc[0]._._test_token == "t0" assert doc[1]._._test_token == "t1" assert doc[2]._._test_token == "t0" diff --git a/spacy/tests/serialize/test_serialize_kb.py b/spacy/tests/serialize/test_serialize_kb.py index b19c11864..352c335ea 100644 --- a/spacy/tests/serialize/test_serialize_kb.py +++ b/spacy/tests/serialize/test_serialize_kb.py @@ -1,10 +1,12 @@ -# coding: utf-8 -from __future__ import unicode_literals +from typing import Callable -from spacy.util import ensure_path +from spacy import util +from spacy.util import ensure_path, registry, load_model_from_config from spacy.kb import KnowledgeBase +from thinc.api import Config from ..util import make_tempdir +from numpy import zeros def test_serialize_kb_disk(en_vocab): @@ -18,18 +20,16 @@ def test_serialize_kb_disk(en_vocab): if not dir_path.exists(): dir_path.mkdir() file_path = dir_path / "kb" - kb1.dump(str(file_path)) - + kb1.to_disk(str(file_path)) kb2 = KnowledgeBase(vocab=en_vocab, entity_vector_length=3) - kb2.load_bulk(str(file_path)) + kb2.from_disk(str(file_path)) # final assertions _check_kb(kb2) def _get_dummy_kb(vocab): - kb = KnowledgeBase(vocab=vocab, entity_vector_length=3) - + kb = KnowledgeBase(vocab, entity_vector_length=3) kb.add_entity(entity="Q53", freq=33, entity_vector=[0, 5, 3]) kb.add_entity(entity="Q17", freq=2, entity_vector=[7, 1, 0]) kb.add_entity(entity="Q007", freq=7, entity_vector=[0, 0, 7]) @@ -62,7 +62,7 @@ def _check_kb(kb): assert alias_string not in kb.get_alias_strings() # check candidates & probabilities - candidates = sorted(kb.get_candidates("double07"), key=lambda x: x.entity_) + candidates = sorted(kb.get_alias_candidates("double07"), key=lambda x: x.entity_) assert len(candidates) == 2 assert candidates[0].entity_ == "Q007" @@ -76,3 +76,68 @@ def _check_kb(kb): assert candidates[1].entity_vector == [7, 1, 0] assert candidates[1].alias_ == "double07" assert 0.099 < candidates[1].prior_prob < 0.101 + + +def test_serialize_subclassed_kb(): + """Check that IO of a custom KB works fine as part of an EL pipe.""" + + config_string = """ + [nlp] + lang = "en" + pipeline = ["entity_linker"] + + [components] + + [components.entity_linker] + factory = "entity_linker" + + [initialize] + + [initialize.components] + + [initialize.components.entity_linker] + + [initialize.components.entity_linker.kb_loader] + @misc = "spacy.CustomKB.v1" + entity_vector_length = 342 + custom_field = 666 + """ + + class SubKnowledgeBase(KnowledgeBase): + def __init__(self, vocab, entity_vector_length, custom_field): + super().__init__(vocab, entity_vector_length) + self.custom_field = custom_field + + @registry.misc.register("spacy.CustomKB.v1") + def custom_kb( + entity_vector_length: int, custom_field: int + ) -> Callable[["Vocab"], KnowledgeBase]: + def custom_kb_factory(vocab): + kb = SubKnowledgeBase( + vocab=vocab, + entity_vector_length=entity_vector_length, + custom_field=custom_field, + ) + kb.add_entity("random_entity", 0.0, zeros(entity_vector_length)) + return kb + + return custom_kb_factory + + config = Config().from_str(config_string) + nlp = load_model_from_config(config, auto_fill=True) + nlp.initialize() + + entity_linker = nlp.get_pipe("entity_linker") + assert type(entity_linker.kb) == SubKnowledgeBase + assert entity_linker.kb.entity_vector_length == 342 + assert entity_linker.kb.custom_field == 666 + + # Make sure the custom KB is serialized correctly + with make_tempdir() as tmp_dir: + nlp.to_disk(tmp_dir) + nlp2 = util.load_model_from_path(tmp_dir) + entity_linker2 = nlp2.get_pipe("entity_linker") + # After IO, the KB is the standard one + assert type(entity_linker2.kb) == KnowledgeBase + assert entity_linker2.kb.entity_vector_length == 342 + assert not hasattr(entity_linker2.kb, "custom_field") diff --git a/spacy/tests/serialize/test_serialize_language.py b/spacy/tests/serialize/test_serialize_language.py index efc5d181c..05529f9d1 100644 --- a/spacy/tests/serialize/test_serialize_language.py +++ b/spacy/tests/serialize/test_serialize_language.py @@ -1,8 +1,6 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import re + from spacy.language import Language from spacy.tokenizer import Tokenizer @@ -59,12 +57,8 @@ def test_serialize_language_exclude(meta_data): nlp = Language(meta=meta_data) assert nlp.meta["name"] == name new_nlp = Language().from_bytes(nlp.to_bytes()) - assert nlp.meta["name"] == name + assert new_nlp.meta["name"] == name new_nlp = Language().from_bytes(nlp.to_bytes(), exclude=["meta"]) assert not new_nlp.meta["name"] == name new_nlp = Language().from_bytes(nlp.to_bytes(exclude=["meta"])) assert not new_nlp.meta["name"] == name - with pytest.raises(ValueError): - nlp.to_bytes(meta=False) - with pytest.raises(ValueError): - Language().from_bytes(nlp.to_bytes(), meta=False) diff --git a/spacy/tests/serialize/test_serialize_pipeline.py b/spacy/tests/serialize/test_serialize_pipeline.py index efa7ef625..951dd3035 100644 --- a/spacy/tests/serialize/test_serialize_pipeline.py +++ b/spacy/tests/serialize/test_serialize_pipeline.py @@ -1,9 +1,14 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest +from spacy import registry, Vocab from spacy.pipeline import Tagger, DependencyParser, EntityRecognizer -from spacy.pipeline import Tensorizer, TextCategorizer +from spacy.pipeline import TextCategorizer, SentenceRecognizer, TrainablePipe +from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL +from spacy.pipeline.tagger import DEFAULT_TAGGER_MODEL +from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL +from spacy.pipeline.senter import DEFAULT_SENTER_MODEL +from spacy.lang.en import English +from thinc.api import Linear +import spacy from ..util import make_tempdir @@ -13,58 +18,108 @@ test_parsers = [DependencyParser, EntityRecognizer] @pytest.fixture def parser(en_vocab): - parser = DependencyParser(en_vocab) + config = { + "learn_tokens": False, + "min_action_freq": 30, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_PARSER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + parser = DependencyParser(en_vocab, model, **config) parser.add_label("nsubj") - parser.model, cfg = parser.Model(parser.moves.n_moves) - parser.cfg.update(cfg) return parser @pytest.fixture def blank_parser(en_vocab): - parser = DependencyParser(en_vocab) + config = { + "learn_tokens": False, + "min_action_freq": 30, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_PARSER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + parser = DependencyParser(en_vocab, model, **config) return parser @pytest.fixture def taggers(en_vocab): - tagger1 = Tagger(en_vocab) - tagger2 = Tagger(en_vocab) - tagger1.model = tagger1.Model(8) - tagger2.model = tagger1.model - return (tagger1, tagger2) + cfg = {"model": DEFAULT_TAGGER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + tagger1 = Tagger(en_vocab, model) + tagger2 = Tagger(en_vocab, model) + return tagger1, tagger2 @pytest.mark.parametrize("Parser", test_parsers) def test_serialize_parser_roundtrip_bytes(en_vocab, Parser): - parser = Parser(en_vocab) - parser.model, _ = parser.Model(10) - new_parser = Parser(en_vocab) - new_parser.model, _ = new_parser.Model(10) + config = { + "learn_tokens": False, + "min_action_freq": 0, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_PARSER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + parser = Parser(en_vocab, model, **config) + new_parser = Parser(en_vocab, model, **config) new_parser = new_parser.from_bytes(parser.to_bytes(exclude=["vocab"])) - assert new_parser.to_bytes(exclude=["vocab"]) == parser.to_bytes(exclude=["vocab"]) + bytes_2 = new_parser.to_bytes(exclude=["vocab"]) + bytes_3 = parser.to_bytes(exclude=["vocab"]) + assert len(bytes_2) == len(bytes_3) + assert bytes_2 == bytes_3 + + +@pytest.mark.parametrize("Parser", test_parsers) +def test_serialize_parser_strings(Parser): + vocab1 = Vocab() + label = "FunnyLabel" + assert label not in vocab1.strings + config = { + "learn_tokens": False, + "min_action_freq": 0, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_PARSER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + parser1 = Parser(vocab1, model, **config) + parser1.add_label(label) + assert label in parser1.vocab.strings + vocab2 = Vocab() + assert label not in vocab2.strings + parser2 = Parser(vocab2, model, **config) + parser2 = parser2.from_bytes(parser1.to_bytes(exclude=["vocab"])) + assert label in parser2.vocab.strings @pytest.mark.parametrize("Parser", test_parsers) def test_serialize_parser_roundtrip_disk(en_vocab, Parser): - parser = Parser(en_vocab) - parser.model, _ = parser.Model(0) + config = { + "learn_tokens": False, + "min_action_freq": 0, + "update_with_oracle_cut_size": 100, + } + cfg = {"model": DEFAULT_PARSER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + parser = Parser(en_vocab, model, **config) with make_tempdir() as d: file_path = d / "parser" parser.to_disk(file_path) - parser_d = Parser(en_vocab) - parser_d.model, _ = parser_d.Model(0) + parser_d = Parser(en_vocab, model, **config) parser_d = parser_d.from_disk(file_path) parser_bytes = parser.to_bytes(exclude=["model", "vocab"]) parser_d_bytes = parser_d.to_bytes(exclude=["model", "vocab"]) + assert len(parser_bytes) == len(parser_d_bytes) assert parser_bytes == parser_d_bytes def test_to_from_bytes(parser, blank_parser): assert parser.model is not True - assert blank_parser.model is True + assert blank_parser.model is not True assert blank_parser.moves.n_moves != parser.moves.n_moves bytes_data = parser.to_bytes(exclude=["vocab"]) + # the blank parser needs to be resized before we can call from_bytes + blank_parser.model.attrs["resize_output"](blank_parser.model, parser.moves.n_moves) blank_parser.from_bytes(bytes_data) assert blank_parser.model is not True assert blank_parser.moves.n_moves == parser.moves.n_moves @@ -78,8 +133,12 @@ def test_serialize_tagger_roundtrip_bytes(en_vocab, taggers): tagger1_b = tagger1.to_bytes() tagger1 = tagger1.from_bytes(tagger1_b) assert tagger1.to_bytes() == tagger1_b - new_tagger1 = Tagger(en_vocab).from_bytes(tagger1_b) - assert new_tagger1.to_bytes() == tagger1_b + cfg = {"model": DEFAULT_TAGGER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + new_tagger1 = Tagger(en_vocab, model).from_bytes(tagger1_b) + new_tagger1_b = new_tagger1.to_bytes() + assert len(new_tagger1_b) == len(tagger1_b) + assert new_tagger1_b == tagger1_b def test_serialize_tagger_roundtrip_disk(en_vocab, taggers): @@ -89,46 +148,55 @@ def test_serialize_tagger_roundtrip_disk(en_vocab, taggers): file_path2 = d / "tagger2" tagger1.to_disk(file_path1) tagger2.to_disk(file_path2) - tagger1_d = Tagger(en_vocab).from_disk(file_path1) - tagger2_d = Tagger(en_vocab).from_disk(file_path2) + cfg = {"model": DEFAULT_TAGGER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + tagger1_d = Tagger(en_vocab, model).from_disk(file_path1) + tagger2_d = Tagger(en_vocab, model).from_disk(file_path2) assert tagger1_d.to_bytes() == tagger2_d.to_bytes() -def test_serialize_tensorizer_roundtrip_bytes(en_vocab): - tensorizer = Tensorizer(en_vocab) - tensorizer.model = tensorizer.Model() - tensorizer_b = tensorizer.to_bytes(exclude=["vocab"]) - new_tensorizer = Tensorizer(en_vocab).from_bytes(tensorizer_b) - assert new_tensorizer.to_bytes(exclude=["vocab"]) == tensorizer_b - - -def test_serialize_tensorizer_roundtrip_disk(en_vocab): - tensorizer = Tensorizer(en_vocab) - tensorizer.model = tensorizer.Model() +def test_serialize_tagger_strings(en_vocab, de_vocab, taggers): + label = "SomeWeirdLabel" + assert label not in en_vocab.strings + assert label not in de_vocab.strings + tagger = taggers[0] + assert label not in tagger.vocab.strings with make_tempdir() as d: - file_path = d / "tensorizer" - tensorizer.to_disk(file_path) - tensorizer_d = Tensorizer(en_vocab).from_disk(file_path) - assert tensorizer.to_bytes(exclude=["vocab"]) == tensorizer_d.to_bytes( - exclude=["vocab"] - ) + # check that custom labels are serialized as part of the component's strings.jsonl + tagger.add_label(label) + assert label in tagger.vocab.strings + file_path = d / "tagger1" + tagger.to_disk(file_path) + # ensure that the custom strings are loaded back in when using the tagger in another pipeline + cfg = {"model": DEFAULT_TAGGER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + tagger2 = Tagger(de_vocab, model).from_disk(file_path) + assert label in tagger2.vocab.strings def test_serialize_textcat_empty(en_vocab): # See issue #1105 - textcat = TextCategorizer(en_vocab, labels=["ENTITY", "ACTION", "MODIFIER"]) + cfg = {"model": DEFAULT_TEXTCAT_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + textcat = TextCategorizer(en_vocab, model, threshold=0.5) textcat.to_bytes(exclude=["vocab"]) @pytest.mark.parametrize("Parser", test_parsers) def test_serialize_pipe_exclude(en_vocab, Parser): + cfg = {"model": DEFAULT_PARSER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + config = { + "learn_tokens": False, + "min_action_freq": 0, + "update_with_oracle_cut_size": 100, + } + def get_new_parser(): - new_parser = Parser(en_vocab) - new_parser.model, _ = new_parser.Model(0) + new_parser = Parser(en_vocab, model, **config) return new_parser - parser = Parser(en_vocab) - parser.model, _ = parser.Model(0) + parser = Parser(en_vocab, model, **config) parser.cfg["foo"] = "bar" new_parser = get_new_parser().from_bytes(parser.to_bytes(exclude=["vocab"])) assert "foo" in new_parser.cfg @@ -140,7 +208,80 @@ def test_serialize_pipe_exclude(en_vocab, Parser): parser.to_bytes(exclude=["cfg"]), exclude=["vocab"] ) assert "foo" not in new_parser.cfg + + +def test_serialize_sentencerecognizer(en_vocab): + cfg = {"model": DEFAULT_SENTER_MODEL} + model = registry.resolve(cfg, validate=True)["model"] + sr = SentenceRecognizer(en_vocab, model) + sr_b = sr.to_bytes() + sr_d = SentenceRecognizer(en_vocab, model).from_bytes(sr_b) + assert sr.to_bytes() == sr_d.to_bytes() + + +def test_serialize_pipeline_disable_enable(): + nlp = English() + nlp.add_pipe("ner") + nlp.add_pipe("tagger") + nlp.disable_pipe("tagger") + assert nlp.config["nlp"]["disabled"] == ["tagger"] + config = nlp.config.copy() + nlp2 = English.from_config(config) + assert nlp2.pipe_names == ["ner"] + assert nlp2.component_names == ["ner", "tagger"] + assert nlp2.disabled == ["tagger"] + assert nlp2.config["nlp"]["disabled"] == ["tagger"] + with make_tempdir() as d: + nlp2.to_disk(d) + nlp3 = spacy.load(d) + assert nlp3.pipe_names == ["ner"] + assert nlp3.component_names == ["ner", "tagger"] + with make_tempdir() as d: + nlp3.to_disk(d) + nlp4 = spacy.load(d, disable=["ner"]) + assert nlp4.pipe_names == [] + assert nlp4.component_names == ["ner", "tagger"] + assert nlp4.disabled == ["ner", "tagger"] + with make_tempdir() as d: + nlp.to_disk(d) + nlp5 = spacy.load(d, exclude=["tagger"]) + assert nlp5.pipe_names == ["ner"] + assert nlp5.component_names == ["ner"] + assert nlp5.disabled == [] + + +def test_serialize_custom_trainable_pipe(): + class BadCustomPipe1(TrainablePipe): + def __init__(self, vocab): + pass + + class BadCustomPipe2(TrainablePipe): + def __init__(self, vocab): + self.vocab = vocab + self.model = None + + class CustomPipe(TrainablePipe): + def __init__(self, vocab, model): + self.vocab = vocab + self.model = model + + pipe = BadCustomPipe1(Vocab()) with pytest.raises(ValueError): - parser.to_bytes(cfg=False, exclude=["vocab"]) + pipe.to_bytes() + with make_tempdir() as d: + with pytest.raises(ValueError): + pipe.to_disk(d) + pipe = BadCustomPipe2(Vocab()) with pytest.raises(ValueError): - get_new_parser().from_bytes(parser.to_bytes(exclude=["vocab"]), cfg=False) + pipe.to_bytes() + with make_tempdir() as d: + with pytest.raises(ValueError): + pipe.to_disk(d) + pipe = CustomPipe(Vocab(), Linear()) + pipe_bytes = pipe.to_bytes() + new_pipe = CustomPipe(Vocab(), Linear()).from_bytes(pipe_bytes) + assert new_pipe.to_bytes() == pipe_bytes + with make_tempdir() as d: + pipe.to_disk(d) + new_pipe = CustomPipe(Vocab(), Linear()).from_disk(d) + assert new_pipe.to_bytes() == pipe_bytes diff --git a/spacy/tests/serialize/test_serialize_tokenizer.py b/spacy/tests/serialize/test_serialize_tokenizer.py index cbe119225..00a88ec38 100644 --- a/spacy/tests/serialize/test_serialize_tokenizer.py +++ b/spacy/tests/serialize/test_serialize_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.util import get_lang_class from spacy.tokenizer import Tokenizer @@ -9,7 +6,7 @@ from ..util import make_tempdir, assert_packed_msg_equal def load_tokenizer(b): - tok = get_lang_class("en").Defaults.create_tokenizer() + tok = get_lang_class("en")().tokenizer tok.from_bytes(b) return tok diff --git a/spacy/tests/serialize/test_serialize_vocab_strings.py b/spacy/tests/serialize/test_serialize_vocab_strings.py index 4727899a3..45a546203 100644 --- a/spacy/tests/serialize/test_serialize_vocab_strings.py +++ b/spacy/tests/serialize/test_serialize_vocab_strings.py @@ -1,11 +1,7 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import pickle from spacy.vocab import Vocab from spacy.strings import StringStore -from spacy.compat import is_python2 from ..util import make_tempdir @@ -14,7 +10,6 @@ test_strings = [([], []), (["rats", "are", "cute"], ["i", "like", "rats"])] test_strings_attrs = [(["rats", "are", "cute"], "Hello")] -@pytest.mark.xfail @pytest.mark.parametrize("text", ["rat"]) def test_serialize_vocab(en_vocab, text): text_hash = en_vocab.strings.add(text) @@ -38,8 +33,8 @@ def test_serialize_vocab_roundtrip_bytes(strings1, strings2): assert vocab1.to_bytes() == vocab1_b new_vocab1 = Vocab().from_bytes(vocab1_b) assert new_vocab1.to_bytes() == vocab1_b - assert len(new_vocab1.strings) == len(strings1) + 1 # adds _SP - assert sorted([s for s in new_vocab1.strings]) == sorted(strings1 + ["_SP"]) + assert len(new_vocab1.strings) == len(strings1) + assert sorted([s for s in new_vocab1.strings]) == sorted(strings1) @pytest.mark.parametrize("strings1,strings2", test_strings) @@ -54,16 +49,12 @@ def test_serialize_vocab_roundtrip_disk(strings1, strings2): vocab1_d = Vocab().from_disk(file_path1) vocab2_d = Vocab().from_disk(file_path2) # check strings rather than lexemes, which are only reloaded on demand - assert strings1 == [s for s in vocab1_d.strings if s != "_SP"] - assert strings2 == [s for s in vocab2_d.strings if s != "_SP"] + assert strings1 == [s for s in vocab1_d.strings] + assert strings2 == [s for s in vocab2_d.strings] if strings1 == strings2: - assert [s for s in vocab1_d.strings if s != "_SP"] == [ - s for s in vocab2_d.strings if s != "_SP" - ] + assert [s for s in vocab1_d.strings] == [s for s in vocab2_d.strings] else: - assert [s for s in vocab1_d.strings if s != "_SP"] != [ - s for s in vocab2_d.strings if s != "_SP" - ] + assert [s for s in vocab1_d.strings] != [s for s in vocab2_d.strings] @pytest.mark.parametrize("strings,lex_attr", test_strings_attrs) @@ -82,7 +73,7 @@ def test_deserialize_vocab_seen_entries(strings, lex_attr): # Reported in #2153 vocab = Vocab(strings=strings) vocab.from_bytes(vocab.to_bytes()) - assert len(vocab.strings) == len(strings) + 1 # adds _SP + assert len(vocab.strings) == len(strings) @pytest.mark.parametrize("strings,lex_attr", test_strings_attrs) @@ -135,7 +126,6 @@ def test_serialize_stringstore_roundtrip_disk(strings1, strings2): assert list(sstore1_d) != list(sstore2_d) -@pytest.mark.skipif(is_python2, reason="Dict order? Not sure if worth investigating") @pytest.mark.parametrize("strings,lex_attr", test_strings_attrs) def test_pickle_vocab(strings, lex_attr): vocab = Vocab(strings=strings) diff --git a/spacy/tests/test_architectures.py b/spacy/tests/test_architectures.py index 77f1af020..31b2a2d2f 100644 --- a/spacy/tests/test_architectures.py +++ b/spacy/tests/test_architectures.py @@ -1,15 +1,12 @@ -# coding: utf8 -from __future__ import unicode_literals - import pytest from spacy import registry -from thinc.v2v import Affine +from thinc.api import Linear from catalogue import RegistryError @registry.architectures.register("my_test_function") def create_model(nr_in, nr_out): - return Affine(nr_in, nr_out) + return Linear(nr_in, nr_out) def test_get_architecture(): diff --git a/spacy/tests/test_cli.py b/spacy/tests/test_cli.py index 6dce649a9..62584d0ce 100644 --- a/spacy/tests/test_cli.py +++ b/spacy/tests/test_cli.py @@ -1,15 +1,22 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest +from click import NoSuchOption +from spacy.training import docs_to_json, offsets_to_biluo_tags +from spacy.training.converters import iob_to_docs, conll_ner_to_docs, conllu_to_docs +from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate +from spacy.util import ENV_VARS +from spacy.cli.init_config import init_config, RECOMMENDATIONS +from spacy.cli._util import validate_project_commands, parse_config_overrides +from spacy.cli._util import load_project_config, substitute_project_variables +from spacy.cli._util import string_to_list +from thinc.api import ConfigValidationError +import srsly +import os -from spacy.lang.en import English -from spacy.cli.converters import conllu2json, iob2json, conll_ner2json -from spacy.cli.pretrain import make_docs +from .util import make_tempdir -def test_cli_converters_conllu2json(): - # https://raw.githubusercontent.com/ohenrik/nb_news_ud_sm/master/original_data/no-ud-dev-ner.conllu +def test_cli_converters_conllu_to_docs(): + # from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu lines = [ "1\tDommer\tdommer\tNOUN\t_\tDefinite=Ind|Gender=Masc|Number=Sing\t2\tappos\t_\tO", "2\tFinn\tFinn\tPROPN\t_\tGender=Masc\t4\tnsubj\t_\tB-PER", @@ -17,8 +24,9 @@ def test_cli_converters_conllu2json(): "4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tO", ] input_data = "\n".join(lines) - converted = conllu2json(input_data, n_sents=1) - assert len(converted) == 1 + converted_docs = conllu_to_docs(input_data, n_sents=1) + assert len(converted_docs) == 1 + converted = [docs_to_json(converted_docs)] assert converted[0]["id"] == 0 assert len(converted[0]["paragraphs"]) == 1 assert len(converted[0]["paragraphs"][0]["sentences"]) == 1 @@ -29,10 +37,107 @@ def test_cli_converters_conllu2json(): assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB"] assert [t["head"] for t in tokens] == [1, 2, -1, 0] assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT"] - assert [t["ner"] for t in tokens] == ["O", "B-PER", "L-PER", "O"] + ent_offsets = [ + (e[0], e[1], e[2]) for e in converted[0]["paragraphs"][0]["entities"] + ] + biluo_tags = offsets_to_biluo_tags(converted_docs[0], ent_offsets, missing="O") + assert biluo_tags == ["O", "B-PER", "L-PER", "O"] -def test_cli_converters_iob2json(): +@pytest.mark.parametrize( + "lines", + [ + ( + "1\tDommer\tdommer\tNOUN\t_\tDefinite=Ind|Gender=Masc|Number=Sing\t2\tappos\t_\tname=O", + "2\tFinn\tFinn\tPROPN\t_\tGender=Masc\t4\tnsubj\t_\tSpaceAfter=No|name=B-PER", + "3\tEilertsen\tEilertsen\tPROPN\t_\t_\t2\tname\t_\tname=I-PER", + "4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tSpaceAfter=No|name=O", + "5\t.\t$.\tPUNCT\t_\t_\t4\tpunct\t_\tname=B-BAD", + ), + ( + "1\tDommer\tdommer\tNOUN\t_\tDefinite=Ind|Gender=Masc|Number=Sing\t2\tappos\t_\t_", + "2\tFinn\tFinn\tPROPN\t_\tGender=Masc\t4\tnsubj\t_\tSpaceAfter=No|NE=B-PER", + "3\tEilertsen\tEilertsen\tPROPN\t_\t_\t2\tname\t_\tNE=L-PER", + "4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tSpaceAfter=No", + "5\t.\t$.\tPUNCT\t_\t_\t4\tpunct\t_\tNE=B-BAD", + ), + ], +) +def test_cli_converters_conllu_to_docs_name_ner_map(lines): + input_data = "\n".join(lines) + converted_docs = conllu_to_docs( + input_data, n_sents=1, ner_map={"PER": "PERSON", "BAD": ""} + ) + assert len(converted_docs) == 1 + converted = [docs_to_json(converted_docs)] + assert converted[0]["id"] == 0 + assert len(converted[0]["paragraphs"]) == 1 + assert converted[0]["paragraphs"][0]["raw"] == "Dommer FinnEilertsen avstår. " + assert len(converted[0]["paragraphs"][0]["sentences"]) == 1 + sent = converted[0]["paragraphs"][0]["sentences"][0] + assert len(sent["tokens"]) == 5 + tokens = sent["tokens"] + assert [t["orth"] for t in tokens] == ["Dommer", "Finn", "Eilertsen", "avstår", "."] + assert [t["tag"] for t in tokens] == ["NOUN", "PROPN", "PROPN", "VERB", "PUNCT"] + assert [t["head"] for t in tokens] == [1, 2, -1, 0, -1] + assert [t["dep"] for t in tokens] == ["appos", "nsubj", "name", "ROOT", "punct"] + ent_offsets = [ + (e[0], e[1], e[2]) for e in converted[0]["paragraphs"][0]["entities"] + ] + biluo_tags = offsets_to_biluo_tags(converted_docs[0], ent_offsets, missing="O") + assert biluo_tags == ["O", "B-PERSON", "L-PERSON", "O", "O"] + + +def test_cli_converters_conllu_to_docs_subtokens(): + # https://raw.githubusercontent.com/ohenrik/nb_news_ud_sm/master/original_data/no-ud-dev-ner.conllu + lines = [ + "1\tDommer\tdommer\tNOUN\t_\tDefinite=Ind|Gender=Masc|Number=Sing\t2\tappos\t_\tname=O", + "2-3\tFE\t_\t_\t_\t_\t_\t_\t_\t_", + "2\tFinn\tFinn\tPROPN\t_\tGender=Masc\t4\tnsubj\t_\tname=B-PER", + "3\tEilertsen\tEilertsen\tX\t_\tGender=Fem|Tense=past\t2\tname\t_\tname=I-PER", + "4\tavstår\tavstå\tVERB\t_\tMood=Ind|Tense=Pres|VerbForm=Fin\t0\troot\t_\tSpaceAfter=No|name=O", + "5\t.\t$.\tPUNCT\t_\t_\t4\tpunct\t_\tname=O", + ] + input_data = "\n".join(lines) + converted_docs = conllu_to_docs( + input_data, n_sents=1, merge_subtokens=True, append_morphology=True + ) + assert len(converted_docs) == 1 + converted = [docs_to_json(converted_docs)] + + assert converted[0]["id"] == 0 + assert len(converted[0]["paragraphs"]) == 1 + assert converted[0]["paragraphs"][0]["raw"] == "Dommer FE avstår. " + assert len(converted[0]["paragraphs"][0]["sentences"]) == 1 + sent = converted[0]["paragraphs"][0]["sentences"][0] + assert len(sent["tokens"]) == 4 + tokens = sent["tokens"] + print(tokens) + assert [t["orth"] for t in tokens] == ["Dommer", "FE", "avstår", "."] + assert [t["tag"] for t in tokens] == [ + "NOUN__Definite=Ind|Gender=Masc|Number=Sing", + "PROPN_X__Gender=Fem,Masc|Tense=past", + "VERB__Mood=Ind|Tense=Pres|VerbForm=Fin", + "PUNCT", + ] + assert [t["pos"] for t in tokens] == ["NOUN", "PROPN", "VERB", "PUNCT"] + assert [t["morph"] for t in tokens] == [ + "Definite=Ind|Gender=Masc|Number=Sing", + "Gender=Fem,Masc|Tense=past", + "Mood=Ind|Tense=Pres|VerbForm=Fin", + "", + ] + assert [t["lemma"] for t in tokens] == ["dommer", "Finn Eilertsen", "avstå", "$."] + assert [t["head"] for t in tokens] == [1, 1, 0, -1] + assert [t["dep"] for t in tokens] == ["appos", "nsubj", "ROOT", "punct"] + ent_offsets = [ + (e[0], e[1], e[2]) for e in converted[0]["paragraphs"][0]["entities"] + ] + biluo_tags = offsets_to_biluo_tags(converted_docs[0], ent_offsets, missing="O") + assert biluo_tags == ["O", "U-PER", "O", "O"] + + +def test_cli_converters_iob_to_docs(): lines = [ "I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O", "I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O", @@ -40,22 +145,24 @@ def test_cli_converters_iob2json(): "I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O", ] input_data = "\n".join(lines) - converted = iob2json(input_data, n_sents=10) - assert len(converted) == 1 - assert converted[0]["id"] == 0 - assert len(converted[0]["paragraphs"]) == 1 - assert len(converted[0]["paragraphs"][0]["sentences"]) == 4 + converted_docs = iob_to_docs(input_data, n_sents=10) + assert len(converted_docs) == 1 + converted = docs_to_json(converted_docs) + assert converted["id"] == 0 + assert len(converted["paragraphs"]) == 1 + assert len(converted["paragraphs"][0]["sentences"]) == 4 for i in range(0, 4): - sent = converted[0]["paragraphs"][0]["sentences"][i] + sent = converted["paragraphs"][0]["sentences"][i] assert len(sent["tokens"]) == 8 tokens = sent["tokens"] - # fmt: off - assert [t["orth"] for t in tokens] == ["I", "like", "London", "and", "New", "York", "City", "."] - assert [t["ner"] for t in tokens] == ["O", "O", "U-GPE", "O", "B-GPE", "I-GPE", "L-GPE", "O"] - # fmt: on + expected = ["I", "like", "London", "and", "New", "York", "City", "."] + assert [t["orth"] for t in tokens] == expected + assert len(converted_docs[0].ents) == 8 + for ent in converted_docs[0].ents: + assert ent.text in ["New York City", "London"] -def test_cli_converters_conll_ner2json(): +def test_cli_converters_conll_ner_to_docs(): lines = [ "-DOCSTART- -X- O O", "", @@ -105,59 +212,206 @@ def test_cli_converters_conll_ner2json(): ".\t.\t_\tO", ] input_data = "\n".join(lines) - converted = conll_ner2json(input_data, n_sents=10) - print(converted) - assert len(converted) == 1 - assert converted[0]["id"] == 0 - assert len(converted[0]["paragraphs"]) == 1 - assert len(converted[0]["paragraphs"][0]["sentences"]) == 5 + converted_docs = conll_ner_to_docs(input_data, n_sents=10) + assert len(converted_docs) == 1 + converted = docs_to_json(converted_docs) + assert converted["id"] == 0 + assert len(converted["paragraphs"]) == 1 + assert len(converted["paragraphs"][0]["sentences"]) == 5 for i in range(0, 5): - sent = converted[0]["paragraphs"][0]["sentences"][i] + sent = converted["paragraphs"][0]["sentences"][i] assert len(sent["tokens"]) == 8 tokens = sent["tokens"] # fmt: off assert [t["orth"] for t in tokens] == ["I", "like", "London", "and", "New", "York", "City", "."] - assert [t["ner"] for t in tokens] == ["O", "O", "U-GPE", "O", "B-GPE", "I-GPE", "L-GPE", "O"] # fmt: on + assert len(converted_docs[0].ents) == 10 + for ent in converted_docs[0].ents: + assert ent.text in ["New York City", "London"] -def test_pretrain_make_docs(): - nlp = English() +def test_project_config_validation_full(): + config = { + "vars": {"some_var": 20}, + "directories": ["assets", "configs", "corpus", "scripts", "training"], + "assets": [ + { + "dest": "x", + "url": "https://example.com", + "checksum": "63373dd656daa1fd3043ce166a59474c", + }, + { + "dest": "y", + "git": { + "repo": "https://github.com/example/repo", + "branch": "develop", + "path": "y", + }, + }, + ], + "commands": [ + { + "name": "train", + "help": "Train a model", + "script": ["python -m spacy train config.cfg -o training"], + "deps": ["config.cfg", "corpus/training.spcy"], + "outputs": ["training/model-best"], + }, + {"name": "test", "script": ["pytest", "custom.py"], "no_skip": True}, + ], + "workflows": {"all": ["train", "test"], "train": ["train"]}, + } + errors = validate(ProjectConfigSchema, config) + assert not errors - valid_jsonl_text = {"text": "Some text"} - docs, skip_count = make_docs(nlp, [valid_jsonl_text], 1, 10) - assert len(docs) == 1 - assert skip_count == 0 - valid_jsonl_tokens = {"tokens": ["Some", "tokens"]} - docs, skip_count = make_docs(nlp, [valid_jsonl_tokens], 1, 10) - assert len(docs) == 1 - assert skip_count == 0 +@pytest.mark.parametrize( + "config", + [ + {"commands": [{"name": "a"}, {"name": "a"}]}, + {"commands": [{"name": "a"}], "workflows": {"a": []}}, + {"commands": [{"name": "a"}], "workflows": {"b": ["c"]}}, + ], +) +def test_project_config_validation1(config): + with pytest.raises(SystemExit): + validate_project_commands(config) - invalid_jsonl_type = 0 - with pytest.raises(TypeError): - make_docs(nlp, [invalid_jsonl_type], 1, 100) - invalid_jsonl_key = {"invalid": "Does not matter"} - with pytest.raises(ValueError): - make_docs(nlp, [invalid_jsonl_key], 1, 100) +@pytest.mark.parametrize( + "config,n_errors", + [ + ({"commands": {"a": []}}, 1), + ({"commands": [{"help": "..."}]}, 1), + ({"commands": [{"name": "a", "extra": "b"}]}, 1), + ({"commands": [{"extra": "b"}]}, 2), + ({"commands": [{"name": "a", "deps": [123]}]}, 1), + ], +) +def test_project_config_validation2(config, n_errors): + errors = validate(ProjectConfigSchema, config) + assert len(errors) == n_errors - empty_jsonl_text = {"text": ""} - docs, skip_count = make_docs(nlp, [empty_jsonl_text], 1, 10) - assert len(docs) == 0 - assert skip_count == 1 - empty_jsonl_tokens = {"tokens": []} - docs, skip_count = make_docs(nlp, [empty_jsonl_tokens], 1, 10) - assert len(docs) == 0 - assert skip_count == 1 +def test_project_config_interpolation(): + variables = {"a": 10, "b": {"c": "foo", "d": True}} + commands = [ + {"name": "x", "script": ["hello ${vars.a} ${vars.b.c}"]}, + {"name": "y", "script": ["${vars.b.c} ${vars.b.d}"]}, + ] + project = {"commands": commands, "vars": variables} + with make_tempdir() as d: + srsly.write_yaml(d / "project.yml", project) + cfg = load_project_config(d) + assert cfg["commands"][0]["script"][0] == "hello 10 foo" + assert cfg["commands"][1]["script"][0] == "foo true" + commands = [{"name": "x", "script": ["hello ${vars.a} ${vars.b.e}"]}] + project = {"commands": commands, "vars": variables} + with pytest.raises(ConfigValidationError): + substitute_project_variables(project) - too_short_jsonl = {"text": "This text is not long enough"} - docs, skip_count = make_docs(nlp, [too_short_jsonl], 10, 15) - assert len(docs) == 0 - assert skip_count == 0 - too_long_jsonl = {"text": "This text contains way too much tokens for this test"} - docs, skip_count = make_docs(nlp, [too_long_jsonl], 1, 5) - assert len(docs) == 0 - assert skip_count == 0 +@pytest.mark.parametrize( + "args,expected", + [ + # fmt: off + (["--x.foo", "10"], {"x.foo": 10}), + (["--x.foo=10"], {"x.foo": 10}), + (["--x.foo", "bar"], {"x.foo": "bar"}), + (["--x.foo=bar"], {"x.foo": "bar"}), + (["--x.foo", "--x.bar", "baz"], {"x.foo": True, "x.bar": "baz"}), + (["--x.foo", "--x.bar=baz"], {"x.foo": True, "x.bar": "baz"}), + (["--x.foo", "10.1", "--x.bar", "--x.baz", "false"], {"x.foo": 10.1, "x.bar": True, "x.baz": False}), + (["--x.foo", "10.1", "--x.bar", "--x.baz=false"], {"x.foo": 10.1, "x.bar": True, "x.baz": False}) + # fmt: on + ], +) +def test_parse_config_overrides(args, expected): + assert parse_config_overrides(args) == expected + + +@pytest.mark.parametrize("args", [["--foo"], ["--x.foo", "bar", "--baz"]]) +def test_parse_config_overrides_invalid(args): + with pytest.raises(NoSuchOption): + parse_config_overrides(args) + + +@pytest.mark.parametrize("args", [["--x.foo", "bar", "baz"], ["x.foo"]]) +def test_parse_config_overrides_invalid_2(args): + with pytest.raises(SystemExit): + parse_config_overrides(args) + + +def test_parse_cli_overrides(): + overrides = "--x.foo bar --x.bar=12 --x.baz false --y.foo=hello" + os.environ[ENV_VARS.CONFIG_OVERRIDES] = overrides + result = parse_config_overrides([]) + assert len(result) == 4 + assert result["x.foo"] == "bar" + assert result["x.bar"] == 12 + assert result["x.baz"] is False + assert result["y.foo"] == "hello" + os.environ[ENV_VARS.CONFIG_OVERRIDES] = "--x" + assert parse_config_overrides([], env_var=None) == {} + with pytest.raises(SystemExit): + parse_config_overrides([]) + os.environ[ENV_VARS.CONFIG_OVERRIDES] = "hello world" + with pytest.raises(SystemExit): + parse_config_overrides([]) + del os.environ[ENV_VARS.CONFIG_OVERRIDES] + + +@pytest.mark.parametrize("lang", ["en", "nl"]) +@pytest.mark.parametrize( + "pipeline", [["tagger", "parser", "ner"], [], ["ner", "textcat", "sentencizer"]] +) +@pytest.mark.parametrize("optimize", ["efficiency", "accuracy"]) +def test_init_config(lang, pipeline, optimize): + # TODO: add more tests and also check for GPU with transformers + init_config("-", lang=lang, pipeline=pipeline, optimize=optimize, cpu=True) + + +def test_model_recommendations(): + for lang, data in RECOMMENDATIONS.items(): + assert RecommendationSchema(**data) + + +@pytest.mark.parametrize( + "value", + [ + # fmt: off + "parser,textcat,tagger", + " parser, textcat ,tagger ", + 'parser,textcat,tagger', + ' parser, textcat ,tagger ', + ' "parser"," textcat " ,"tagger "', + " 'parser',' textcat ' ,'tagger '", + '[parser,textcat,tagger]', + '["parser","textcat","tagger"]', + '[" parser" ,"textcat ", " tagger " ]', + "[parser,textcat,tagger]", + "[ parser, textcat , tagger]", + "['parser','textcat','tagger']", + "[' parser' , 'textcat', ' tagger ' ]", + # fmt: on + ], +) +def test_string_to_list(value): + assert string_to_list(value, intify=False) == ["parser", "textcat", "tagger"] + + +@pytest.mark.parametrize( + "value", + [ + # fmt: off + "1,2,3", + '[1,2,3]', + '["1","2","3"]', + '[" 1" ,"2 ", " 3 " ]', + "[' 1' , '2', ' 3 ' ]", + # fmt: on + ], +) +def test_string_to_list_intify(value): + assert string_to_list(value, intify=False) == ["1", "2", "3"] + assert string_to_list(value, intify=True) == [1, 2, 3] diff --git a/spacy/tests/test_displacy.py b/spacy/tests/test_displacy.py index 539714e0c..040dd657f 100644 --- a/spacy/tests/test_displacy.py +++ b/spacy/tests/test_displacy.py @@ -1,18 +1,13 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy import displacy -from spacy.displacy.render import DependencyRenderer -from spacy.tokens import Span +from spacy.displacy.render import DependencyRenderer, EntityRenderer +from spacy.tokens import Span, Doc from spacy.lang.fa import Persian -from .util import get_doc - def test_displacy_parse_ents(en_vocab): """Test that named entities on a Doc are converted into displaCy's format.""" - doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"]) + doc = Doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"]) doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])] ents = displacy.parse_ents(doc) assert isinstance(ents, dict) @@ -23,11 +18,11 @@ def test_displacy_parse_ents(en_vocab): def test_displacy_parse_deps(en_vocab): """Test that deps and tags on a Doc are converted into displaCy's format.""" words = ["This", "is", "a", "sentence"] - heads = [1, 0, 1, -2] + heads = [1, 1, 3, 1] pos = ["DET", "VERB", "DET", "NOUN"] tags = ["DT", "VBZ", "DT", "NN"] deps = ["nsubj", "ROOT", "det", "attr"] - doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags, deps=deps) + doc = Doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags, deps=deps) deps = displacy.parse_deps(doc) assert isinstance(deps, dict) assert deps["words"] == [ @@ -56,7 +51,7 @@ def test_displacy_invalid_arcs(): def test_displacy_spans(en_vocab): """Test that displaCy can render Spans.""" - doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"]) + doc = Doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"]) doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings["ORG"])] html = displacy.render(doc[1:4], style="ent") assert html.startswith("TEST") # Restore displacy.set_render_wrapper(lambda html: html) + + +def test_displacy_options_case(): + ents = ["foo", "BAR"] + colors = {"FOO": "red", "bar": "green"} + renderer = EntityRenderer({"ents": ents, "colors": colors}) + text = "abcd" + labels = ["foo", "bar", "FOO", "BAR"] + spans = [{"start": i, "end": i + 1, "label": labels[i]} for i in range(len(text))] + result = renderer.render_ents("abcde", spans, None).split("\n\n") + assert "red" in result[0] and "foo" in result[0] + assert "green" in result[1] and "bar" in result[1] + assert "red" in result[2] and "FOO" in result[2] + assert "green" in result[3] and "BAR" in result[3] diff --git a/spacy/tests/test_errors.py b/spacy/tests/test_errors.py index 1bd4eec7f..e79abc6ab 100644 --- a/spacy/tests/test_errors.py +++ b/spacy/tests/test_errors.py @@ -6,7 +6,7 @@ from spacy.errors import add_codes @add_codes -class Errors(object): +class Errors: E001 = "error description" diff --git a/spacy/tests/test_gold.py b/spacy/tests/test_gold.py deleted file mode 100644 index 53665d852..000000000 --- a/spacy/tests/test_gold.py +++ /dev/null @@ -1,288 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags -from spacy.gold import spans_from_biluo_tags, GoldParse, iob_to_biluo -from spacy.gold import GoldCorpus, docs_to_json, align -from spacy.lang.en import English -from spacy.tokens import Doc -from spacy.util import get_words_and_spaces -from .util import make_tempdir -import pytest -import srsly - - -def test_gold_biluo_U(en_vocab): - words = ["I", "flew", "to", "London", "."] - spaces = [True, True, True, False, True] - doc = Doc(en_vocab, words=words, spaces=spaces) - entities = [(len("I flew to "), len("I flew to London"), "LOC")] - tags = biluo_tags_from_offsets(doc, entities) - assert tags == ["O", "O", "O", "U-LOC", "O"] - - -def test_gold_biluo_BL(en_vocab): - words = ["I", "flew", "to", "San", "Francisco", "."] - spaces = [True, True, True, True, False, True] - doc = Doc(en_vocab, words=words, spaces=spaces) - entities = [(len("I flew to "), len("I flew to San Francisco"), "LOC")] - tags = biluo_tags_from_offsets(doc, entities) - assert tags == ["O", "O", "O", "B-LOC", "L-LOC", "O"] - - -def test_gold_biluo_BIL(en_vocab): - words = ["I", "flew", "to", "San", "Francisco", "Valley", "."] - spaces = [True, True, True, True, True, False, True] - doc = Doc(en_vocab, words=words, spaces=spaces) - entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] - tags = biluo_tags_from_offsets(doc, entities) - assert tags == ["O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"] - - -def test_gold_biluo_overlap(en_vocab): - words = ["I", "flew", "to", "San", "Francisco", "Valley", "."] - spaces = [True, True, True, True, True, False, True] - doc = Doc(en_vocab, words=words, spaces=spaces) - entities = [ - (len("I flew to "), len("I flew to San Francisco Valley"), "LOC"), - (len("I flew to "), len("I flew to San Francisco"), "LOC"), - ] - with pytest.raises(ValueError): - biluo_tags_from_offsets(doc, entities) - - -def test_gold_biluo_misalign(en_vocab): - words = ["I", "flew", "to", "San", "Francisco", "Valley."] - spaces = [True, True, True, True, True, False] - doc = Doc(en_vocab, words=words, spaces=spaces) - entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] - with pytest.warns(UserWarning): - tags = biluo_tags_from_offsets(doc, entities) - assert tags == ["O", "O", "O", "-", "-", "-"] - - -def test_gold_biluo_different_tokenization(en_vocab, en_tokenizer): - # one-to-many - words = ["I", "flew to", "San Francisco Valley", "."] - spaces = [True, True, False, False] - doc = Doc(en_vocab, words=words, spaces=spaces) - entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] - gp = GoldParse( - doc, - words=["I", "flew", "to", "San", "Francisco", "Valley", "."], - entities=entities, - ) - assert gp.ner == ["O", "O", "U-LOC", "O"] - - # many-to-one - words = ["I", "flew", "to", "San", "Francisco", "Valley", "."] - spaces = [True, True, True, True, True, False, False] - doc = Doc(en_vocab, words=words, spaces=spaces) - entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] - gp = GoldParse( - doc, words=["I", "flew to", "San Francisco Valley", "."], entities=entities - ) - assert gp.ner == ["O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"] - - # misaligned - words = ["I flew", "to", "San Francisco", "Valley", "."] - spaces = [True, True, True, False, False] - doc = Doc(en_vocab, words=words, spaces=spaces) - entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] - gp = GoldParse( - doc, words=["I", "flew to", "San", "Francisco Valley", "."], entities=entities, - ) - assert gp.ner == ["O", "O", "B-LOC", "L-LOC", "O"] - - # additional whitespace tokens in GoldParse words - words, spaces = get_words_and_spaces( - ["I", "flew", "to", "San Francisco", "Valley", "."], - "I flew to San Francisco Valley.", - ) - doc = Doc(en_vocab, words=words, spaces=spaces) - entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] - gp = GoldParse( - doc, - words=["I", "flew", " ", "to", "San Francisco Valley", "."], - entities=entities, - ) - assert gp.ner == ["O", "O", "O", "O", "B-LOC", "L-LOC", "O"] - - # from issue #4791 - data = ( - "I'll return the ₹54 amount", - { - "words": ["I", "'ll", "return", "the", "₹", "54", "amount"], - "entities": [(16, 19, "MONEY")], - }, - ) - gp = GoldParse(en_tokenizer(data[0]), **data[1]) - assert gp.ner == ["O", "O", "O", "O", "U-MONEY", "O"] - - data = ( - "I'll return the $54 amount", - { - "words": ["I", "'ll", "return", "the", "$", "54", "amount"], - "entities": [(16, 19, "MONEY")], - }, - ) - gp = GoldParse(en_tokenizer(data[0]), **data[1]) - assert gp.ner == ["O", "O", "O", "O", "B-MONEY", "L-MONEY", "O"] - - -def test_roundtrip_offsets_biluo_conversion(en_tokenizer): - text = "I flew to Silicon Valley via London." - biluo_tags = ["O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] - offsets = [(10, 24, "LOC"), (29, 35, "GPE")] - doc = en_tokenizer(text) - biluo_tags_converted = biluo_tags_from_offsets(doc, offsets) - assert biluo_tags_converted == biluo_tags - offsets_converted = offsets_from_biluo_tags(doc, biluo_tags) - assert offsets_converted == offsets - - -def test_biluo_spans(en_tokenizer): - doc = en_tokenizer("I flew to Silicon Valley via London.") - biluo_tags = ["O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] - spans = spans_from_biluo_tags(doc, biluo_tags) - assert len(spans) == 2 - assert spans[0].text == "Silicon Valley" - assert spans[0].label_ == "LOC" - assert spans[1].text == "London" - assert spans[1].label_ == "GPE" - - -def test_gold_ner_missing_tags(en_tokenizer): - doc = en_tokenizer("I flew to Silicon Valley via London.") - biluo_tags = [None, "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] - gold = GoldParse(doc, entities=biluo_tags) # noqa: F841 - - -def test_iob_to_biluo(): - good_iob = ["O", "O", "B-LOC", "I-LOC", "O", "B-PERSON"] - good_biluo = ["O", "O", "B-LOC", "L-LOC", "O", "U-PERSON"] - bad_iob = ["O", "O", '"', "B-LOC", "I-LOC"] - converted_biluo = iob_to_biluo(good_iob) - assert good_biluo == converted_biluo - with pytest.raises(ValueError): - iob_to_biluo(bad_iob) - - -def test_roundtrip_docs_to_json(): - text = "I flew to Silicon Valley via London." - tags = ["PRP", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."] - heads = [1, 1, 1, 4, 2, 1, 5, 1] - deps = ["nsubj", "ROOT", "prep", "compound", "pobj", "prep", "pobj", "punct"] - biluo_tags = ["O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] - cats = {"TRAVEL": 1.0, "BAKING": 0.0} - nlp = English() - doc = nlp(text) - for i in range(len(tags)): - doc[i].tag_ = tags[i] - doc[i].dep_ = deps[i] - doc[i].head = doc[heads[i]] - doc.ents = spans_from_biluo_tags(doc, biluo_tags) - doc.cats = cats - doc.is_tagged = True - doc.is_parsed = True - - # roundtrip to JSON - with make_tempdir() as tmpdir: - json_file = tmpdir / "roundtrip.json" - srsly.write_json(json_file, [docs_to_json(doc)]) - goldcorpus = GoldCorpus(str(json_file), str(json_file)) - - reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp)) - - assert len(doc) == goldcorpus.count_train() - assert text == reloaded_doc.text - assert tags == goldparse.tags - assert deps == goldparse.labels - assert heads == goldparse.heads - assert biluo_tags == goldparse.ner - assert "TRAVEL" in goldparse.cats - assert "BAKING" in goldparse.cats - assert cats["TRAVEL"] == goldparse.cats["TRAVEL"] - assert cats["BAKING"] == goldparse.cats["BAKING"] - - # roundtrip to JSONL train dicts - with make_tempdir() as tmpdir: - jsonl_file = tmpdir / "roundtrip.jsonl" - srsly.write_jsonl(jsonl_file, [docs_to_json(doc)]) - goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) - - reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp)) - - assert len(doc) == goldcorpus.count_train() - assert text == reloaded_doc.text - assert tags == goldparse.tags - assert deps == goldparse.labels - assert heads == goldparse.heads - assert biluo_tags == goldparse.ner - assert "TRAVEL" in goldparse.cats - assert "BAKING" in goldparse.cats - assert cats["TRAVEL"] == goldparse.cats["TRAVEL"] - assert cats["BAKING"] == goldparse.cats["BAKING"] - - # roundtrip to JSONL tuples - with make_tempdir() as tmpdir: - jsonl_file = tmpdir / "roundtrip.jsonl" - # write to JSONL train dicts - srsly.write_jsonl(jsonl_file, [docs_to_json(doc)]) - goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) - # load and rewrite as JSONL tuples - srsly.write_jsonl(jsonl_file, goldcorpus.train_tuples) - goldcorpus = GoldCorpus(str(jsonl_file), str(jsonl_file)) - - reloaded_doc, goldparse = next(goldcorpus.train_docs(nlp)) - - assert len(doc) == goldcorpus.count_train() - assert text == reloaded_doc.text - assert tags == goldparse.tags - assert deps == goldparse.labels - assert heads == goldparse.heads - assert biluo_tags == goldparse.ner - assert "TRAVEL" in goldparse.cats - assert "BAKING" in goldparse.cats - assert cats["TRAVEL"] == goldparse.cats["TRAVEL"] - assert cats["BAKING"] == goldparse.cats["BAKING"] - - -@pytest.mark.parametrize( - "tokens_a,tokens_b,expected", - [ - (["a", "b", "c"], ["ab", "c"], (3, [-1, -1, 1], [-1, 2], {0: 0, 1: 0}, {})), - ( - ["a", "b", '"', "c"], - ['ab"', "c"], - (4, [-1, -1, -1, 1], [-1, 3], {0: 0, 1: 0, 2: 0}, {}), - ), - (["a", "bc"], ["ab", "c"], (4, [-1, -1], [-1, -1], {0: 0}, {1: 1})), - ( - ["ab", "c", "d"], - ["a", "b", "cd"], - (6, [-1, -1, -1], [-1, -1, -1], {1: 2, 2: 2}, {0: 0, 1: 0}), - ), - ( - ["a", "b", "cd"], - ["a", "b", "c", "d"], - (3, [0, 1, -1], [0, 1, -1, -1], {}, {2: 2, 3: 2}), - ), - ([" ", "a"], ["a"], (1, [-1, 0], [1], {}, {})), - ], -) -def test_align(tokens_a, tokens_b, expected): - cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_a, tokens_b) - assert (cost, list(a2b), list(b2a), a2b_multi, b2a_multi) == expected - # check symmetry - cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_b, tokens_a) - assert (cost, list(b2a), list(a2b), b2a_multi, a2b_multi) == expected - - -def test_goldparse_startswith_space(en_tokenizer): - text = " a" - doc = en_tokenizer(text) - g = GoldParse(doc, words=["a"], entities=["U-DATE"], deps=["ROOT"], heads=[0]) - assert g.words == [" ", "a"] - assert g.ner == [None, "U-DATE"] - assert g.labels == [None, "ROOT"] diff --git a/spacy/tests/test_json_schemas.py b/spacy/tests/test_json_schemas.py deleted file mode 100644 index 89e797c1a..000000000 --- a/spacy/tests/test_json_schemas.py +++ /dev/null @@ -1,50 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from spacy.util import get_json_validator, validate_json, validate_schema -from spacy.cli._schemas import META_SCHEMA, TRAINING_SCHEMA -from spacy.matcher._schemas import TOKEN_PATTERN_SCHEMA -import pytest - - -@pytest.fixture(scope="session") -def training_schema_validator(): - return get_json_validator(TRAINING_SCHEMA) - - -def test_validate_schema(): - validate_schema({"type": "object"}) - with pytest.raises(Exception): - validate_schema({"type": lambda x: x}) - - -@pytest.mark.parametrize("schema", [TRAINING_SCHEMA, META_SCHEMA, TOKEN_PATTERN_SCHEMA]) -def test_schemas(schema): - validate_schema(schema) - - -@pytest.mark.parametrize( - "data", - [ - {"text": "Hello world"}, - {"text": "Hello", "ents": [{"start": 0, "end": 5, "label": "TEST"}]}, - ], -) -def test_json_schema_training_valid(data, training_schema_validator): - errors = validate_json([data], training_schema_validator) - assert not errors - - -@pytest.mark.parametrize( - "data,n_errors", - [ - ({"spans": []}, 1), - ({"text": "Hello", "ents": [{"start": "0", "end": "5", "label": "TEST"}]}, 2), - ({"text": "Hello", "ents": [{"start": 0, "end": 5}]}, 1), - ({"text": "Hello", "ents": [{"start": 0, "end": 5, "label": "test"}]}, 1), - ({"text": "spaCy", "tokens": [{"pos": "PROPN"}]}, 2), - ], -) -def test_json_schema_training_invalid(data, n_errors, training_schema_validator): - errors = validate_json([data], training_schema_validator) - assert len(errors) == n_errors diff --git a/spacy/tests/test_language.py b/spacy/tests/test_language.py index 7106cef74..917e7552e 100644 --- a/spacy/tests/test_language.py +++ b/spacy/tests/test_language.py @@ -1,14 +1,13 @@ -# coding: utf-8 -from __future__ import unicode_literals - import itertools - import pytest -from spacy.compat import is_python2 -from spacy.gold import GoldParse from spacy.language import Language from spacy.tokens import Doc, Span from spacy.vocab import Vocab +from spacy.training import Example +from spacy.lang.en import English +from spacy.lang.de import German +from spacy.util import registry +import spacy from .util import add_vecs_to_vocab, assert_docs_equal @@ -16,11 +15,10 @@ from .util import add_vecs_to_vocab, assert_docs_equal @pytest.fixture def nlp(): nlp = Language(Vocab()) - textcat = nlp.create_pipe("textcat") + textcat = nlp.add_pipe("textcat") for label in ("POSITIVE", "NEGATIVE"): textcat.add_label(label) - nlp.add_pipe(textcat) - nlp.begin_training() + nlp.initialize() return nlp @@ -29,66 +27,77 @@ def test_language_update(nlp): annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}} wrongkeyannots = {"LABEL": True} doc = Doc(nlp.vocab, words=text.split(" ")) - gold = GoldParse(doc, **annots) - # Update with doc and gold objects - nlp.update([doc], [gold]) - # Update with text and dict - nlp.update([text], [annots]) + example = Example.from_dict(doc, annots) + nlp.update([example]) + + # Not allowed to call with just one Example + with pytest.raises(TypeError): + nlp.update(example) + + # Update with text and dict: not supported anymore since v.3 + with pytest.raises(TypeError): + nlp.update((text, annots)) # Update with doc object and dict - nlp.update([doc], [annots]) - # Update with text and gold object - nlp.update([text], [gold]) - # Update badly - with pytest.raises(IndexError): - nlp.update([doc], []) - with pytest.raises(IndexError): - nlp.update([], [gold]) + with pytest.raises(TypeError): + nlp.update((doc, annots)) + + # Create examples badly with pytest.raises(ValueError): - nlp.update([text], [wrongkeyannots]) + example = Example.from_dict(doc, None) + with pytest.raises(KeyError): + example = Example.from_dict(doc, wrongkeyannots) def test_language_evaluate(nlp): text = "hello world" - annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}} + annots = {"doc_annotation": {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}} doc = Doc(nlp.vocab, words=text.split(" ")) - gold = GoldParse(doc, **annots) - # Evaluate with doc and gold objects - nlp.evaluate([(doc, gold)]) - # Evaluate with text and dict - nlp.evaluate([(text, annots)]) + example = Example.from_dict(doc, annots) + nlp.evaluate([example]) + + # Not allowed to call with just one Example + with pytest.raises(TypeError): + nlp.evaluate(example) + + # Evaluate with text and dict: not supported anymore since v.3 + with pytest.raises(TypeError): + nlp.evaluate([(text, annots)]) # Evaluate with doc object and dict - nlp.evaluate([(doc, annots)]) - # Evaluate with text and gold object - nlp.evaluate([(text, gold)]) - # Evaluate badly - with pytest.raises(Exception): - nlp.evaluate([text, gold]) + with pytest.raises(TypeError): + nlp.evaluate([(doc, annots)]) + with pytest.raises(TypeError): + nlp.evaluate([text, annots]) def test_evaluate_no_pipe(nlp): """Test that docs are processed correctly within Language.pipe if the component doesn't expose a .pipe method.""" + @Language.component("test_evaluate_no_pipe") def pipe(doc): return doc text = "hello world" annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}} nlp = Language(Vocab()) - nlp.add_pipe(pipe) - nlp.evaluate([(text, annots)]) + doc = nlp(text) + nlp.add_pipe("test_evaluate_no_pipe") + nlp.evaluate([Example.from_dict(doc, annots)]) +@Language.component("test_language_vector_modification_pipe") def vector_modification_pipe(doc): doc.vector += 1 return doc +@Language.component("test_language_userdata_pipe") def userdata_pipe(doc): doc.user_data["foo"] = "bar" return doc +@Language.component("test_language_ner_pipe") def ner_pipe(doc): span = Span(doc, 0, 1, label="FIRST") doc.ents += (span,) @@ -107,9 +116,9 @@ def sample_vectors(): @pytest.fixture def nlp2(nlp, sample_vectors): add_vecs_to_vocab(nlp.vocab, sample_vectors) - nlp.add_pipe(vector_modification_pipe) - nlp.add_pipe(ner_pipe) - nlp.add_pipe(userdata_pipe) + nlp.add_pipe("test_language_vector_modification_pipe") + nlp.add_pipe("test_language_ner_pipe") + nlp.add_pipe("test_language_userdata_pipe") return nlp @@ -134,9 +143,6 @@ def test_language_pipe(nlp2, n_process, texts): assert_docs_equal(doc, expected_doc) -@pytest.mark.skipif( - is_python2, reason="python2 seems to be unable to handle iterator properly" -) @pytest.mark.parametrize("n_process", [1, 2]) def test_language_pipe_stream(nlp2, n_process, texts): # check if nlp.pipe can handle infinite length iterator properly. @@ -148,3 +154,145 @@ def test_language_pipe_stream(nlp2, n_process, texts): n_fetch = 20 for doc, expected_doc in itertools.islice(zip(docs, expecteds), n_fetch): assert_docs_equal(doc, expected_doc) + + +def test_language_from_config_before_after_init(): + name = "test_language_from_config_before_after_init" + ran_before = False + ran_after = False + ran_after_pipeline = False + + @registry.callbacks(f"{name}_before") + def make_before_creation(): + def before_creation(lang_cls): + nonlocal ran_before + ran_before = True + assert lang_cls is English + lang_cls.Defaults.foo = "bar" + return lang_cls + + return before_creation + + @registry.callbacks(f"{name}_after") + def make_after_creation(): + def after_creation(nlp): + nonlocal ran_after + ran_after = True + assert isinstance(nlp, English) + assert nlp.pipe_names == [] + assert nlp.Defaults.foo == "bar" + nlp.meta["foo"] = "bar" + return nlp + + return after_creation + + @registry.callbacks(f"{name}_after_pipeline") + def make_after_pipeline_creation(): + def after_pipeline_creation(nlp): + nonlocal ran_after_pipeline + ran_after_pipeline = True + assert isinstance(nlp, English) + assert nlp.pipe_names == ["sentencizer"] + assert nlp.Defaults.foo == "bar" + assert nlp.meta["foo"] == "bar" + nlp.meta["bar"] = "baz" + return nlp + + return after_pipeline_creation + + config = { + "nlp": { + "pipeline": ["sentencizer"], + "before_creation": {"@callbacks": f"{name}_before"}, + "after_creation": {"@callbacks": f"{name}_after"}, + "after_pipeline_creation": {"@callbacks": f"{name}_after_pipeline"}, + }, + "components": {"sentencizer": {"factory": "sentencizer"}}, + } + nlp = English.from_config(config) + assert all([ran_before, ran_after, ran_after_pipeline]) + assert nlp.Defaults.foo == "bar" + assert nlp.meta["foo"] == "bar" + assert nlp.meta["bar"] == "baz" + assert nlp.pipe_names == ["sentencizer"] + assert nlp("text") + + +def test_language_from_config_before_after_init_invalid(): + """Check that an error is raised if function doesn't return nlp.""" + name = "test_language_from_config_before_after_init_invalid" + registry.callbacks(f"{name}_before1", func=lambda: lambda nlp: None) + registry.callbacks(f"{name}_before2", func=lambda: lambda nlp: nlp()) + registry.callbacks(f"{name}_after1", func=lambda: lambda nlp: None) + registry.callbacks(f"{name}_after1", func=lambda: lambda nlp: English) + + for callback_name in [f"{name}_before1", f"{name}_before2"]: + config = {"nlp": {"before_creation": {"@callbacks": callback_name}}} + with pytest.raises(ValueError): + English.from_config(config) + for callback_name in [f"{name}_after1", f"{name}_after2"]: + config = {"nlp": {"after_creation": {"@callbacks": callback_name}}} + with pytest.raises(ValueError): + English.from_config(config) + for callback_name in [f"{name}_after1", f"{name}_after2"]: + config = {"nlp": {"after_pipeline_creation": {"@callbacks": callback_name}}} + with pytest.raises(ValueError): + English.from_config(config) + + +def test_language_custom_tokenizer(): + """Test that a fully custom tokenizer can be plugged in via the registry.""" + name = "test_language_custom_tokenizer" + + class CustomTokenizer: + """Dummy "tokenizer" that splits on spaces and adds prefix to each word.""" + + def __init__(self, nlp, prefix): + self.vocab = nlp.vocab + self.prefix = prefix + + def __call__(self, text): + words = [f"{self.prefix}{word}" for word in text.split(" ")] + return Doc(self.vocab, words=words) + + @registry.tokenizers(name) + def custom_create_tokenizer(prefix: str = "_"): + def create_tokenizer(nlp): + return CustomTokenizer(nlp, prefix=prefix) + + return create_tokenizer + + config = {"nlp": {"tokenizer": {"@tokenizers": name}}} + nlp = English.from_config(config) + doc = nlp("hello world") + assert [t.text for t in doc] == ["_hello", "_world"] + doc = list(nlp.pipe(["hello world"]))[0] + assert [t.text for t in doc] == ["_hello", "_world"] + + +def test_language_from_config_invalid_lang(): + """Test that calling Language.from_config raises an error and lang defined + in config needs to match language-specific subclasses.""" + config = {"nlp": {"lang": "en"}} + with pytest.raises(ValueError): + Language.from_config(config) + with pytest.raises(ValueError): + German.from_config(config) + + +def test_spacy_blank(): + nlp = spacy.blank("en") + assert nlp.config["training"]["dropout"] == 0.1 + config = {"training": {"dropout": 0.2}} + meta = {"name": "my_custom_model"} + nlp = spacy.blank("en", config=config, meta=meta) + assert nlp.config["training"]["dropout"] == 0.2 + assert nlp.meta["name"] == "my_custom_model" + + +@pytest.mark.parametrize("value", [False, None, ["x", "y"], Language, Vocab]) +def test_language_init_invalid_vocab(value): + err_fragment = "invalid value" + with pytest.raises(ValueError) as e: + Language(value) + assert err_fragment in str(e.value) diff --git a/spacy/tests/test_lemmatizer.py b/spacy/tests/test_lemmatizer.py deleted file mode 100644 index e7736b042..000000000 --- a/spacy/tests/test_lemmatizer.py +++ /dev/null @@ -1,61 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from spacy.tokens import Doc -from spacy.language import Language -from spacy.lookups import Lookups -from spacy.lemmatizer import Lemmatizer - - -def test_lemmatizer_reflects_lookups_changes(): - """Test for an issue that'd cause lookups available in a model loaded from - disk to not be reflected in the lemmatizer.""" - nlp = Language() - assert Doc(nlp.vocab, words=["foo"])[0].lemma_ == "foo" - table = nlp.vocab.lookups.add_table("lemma_lookup") - table["foo"] = "bar" - assert Doc(nlp.vocab, words=["foo"])[0].lemma_ == "bar" - table = nlp.vocab.lookups.get_table("lemma_lookup") - table["hello"] = "world" - # The update to the table should be reflected in the lemmatizer - assert Doc(nlp.vocab, words=["hello"])[0].lemma_ == "world" - new_nlp = Language() - table = new_nlp.vocab.lookups.add_table("lemma_lookup") - table["hello"] = "hi" - assert Doc(new_nlp.vocab, words=["hello"])[0].lemma_ == "hi" - nlp_bytes = nlp.to_bytes() - new_nlp.from_bytes(nlp_bytes) - # Make sure we have the previously saved lookup table - assert "lemma_lookup" in new_nlp.vocab.lookups - assert len(new_nlp.vocab.lookups.get_table("lemma_lookup")) == 2 - assert new_nlp.vocab.lookups.get_table("lemma_lookup")["hello"] == "world" - assert Doc(new_nlp.vocab, words=["foo"])[0].lemma_ == "bar" - assert Doc(new_nlp.vocab, words=["hello"])[0].lemma_ == "world" - - -def test_tagger_warns_no_lookups(): - nlp = Language() - nlp.vocab.lookups = Lookups() - assert not len(nlp.vocab.lookups) - tagger = nlp.create_pipe("tagger") - nlp.add_pipe(tagger) - with pytest.warns(UserWarning): - nlp.begin_training() - nlp.vocab.lookups.add_table("lemma_lookup") - nlp.vocab.lookups.add_table("lexeme_norm") - nlp.vocab.lookups.get_table("lexeme_norm")["a"] = "A" - with pytest.warns(None) as record: - nlp.begin_training() - assert not record.list - - -def test_lemmatizer_without_is_base_form_implementation(): - # Norwegian example from #5658 - lookups = Lookups() - lookups.add_table("lemma_rules", {"noun": []}) - lookups.add_table("lemma_index", {"noun": {}}) - lookups.add_table("lemma_exc", {"noun": {"formuesskatten": ["formuesskatt"]}}) - - lemmatizer = Lemmatizer(lookups, is_base_form=None) - assert lemmatizer("Formuesskatten", "noun", {'Definite': 'def', 'Gender': 'masc', 'Number': 'sing'}) == ["formuesskatt"] diff --git a/spacy/tests/test_misc.py b/spacy/tests/test_misc.py index d48ba24a2..b9a0a9d05 100644 --- a/spacy/tests/test_misc.py +++ b/spacy/tests/test_misc.py @@ -1,43 +1,21 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import os import ctypes -import srsly from pathlib import Path +from spacy.about import __version__ as spacy_version from spacy import util from spacy import prefer_gpu, require_gpu -from spacy.compat import symlink_to, symlink_remove, path2str, is_windows -from spacy._ml import PrecomputableAffine -from subprocess import CalledProcessError -from .util import make_tempdir +from spacy.ml._precomputable_affine import PrecomputableAffine +from spacy.ml._precomputable_affine import _backprop_precomputable_affine_padding +from spacy.util import dot_to_object, SimpleFrozenList +from thinc.api import Config, Optimizer, ConfigValidationError +from spacy.training.batchers import minibatch_by_words +from spacy.lang.en import English +from spacy.lang.nl import Dutch +from spacy.language import DEFAULT_CONFIG_PATH +from spacy.schemas import ConfigSchemaTraining - -@pytest.fixture -def symlink_target(): - return Path("./foo-target") - - -@pytest.fixture -def symlink(): - return Path("./foo-symlink") - - -@pytest.fixture(scope="function") -def symlink_setup_target(request, symlink_target, symlink): - if not symlink_target.exists(): - os.mkdir(path2str(symlink_target)) - # yield -- need to cleanup even if assertion fails - # https://github.com/pytest-dev/pytest/issues/2508#issuecomment-309934240 - - def cleanup(): - # Remove symlink only if it was created - if symlink.exists(): - symlink_remove(symlink) - os.rmdir(path2str(symlink_target)) - - request.addfinalizer(cleanup) +from .util import get_random_doc @pytest.fixture @@ -57,10 +35,12 @@ def test_util_ensure_path_succeeds(text): assert isinstance(path, Path) -@pytest.mark.parametrize("package", ["numpy"]) -def test_util_is_package(package): +@pytest.mark.parametrize( + "package,result", [("numpy", True), ("sfkodskfosdkfpsdpofkspdof", False)] +) +def test_util_is_package(package, result): """Test that an installed package via pip is recognised by util.is_package.""" - assert util.is_package(package) + assert util.is_package(package) is result @pytest.mark.parametrize("package", ["thinc"]) @@ -71,29 +51,31 @@ def test_util_get_package_path(package): def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2): - model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP) - assert model.W.shape == (nF, nO, nP, nI) - tensor = model.ops.allocate((10, nI)) + model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP).initialize() + assert model.get_param("W").shape == (nF, nO, nP, nI) + tensor = model.ops.alloc((10, nI)) Y, get_dX = model.begin_update(tensor) assert Y.shape == (tensor.shape[0] + 1, nF, nO, nP) - assert model.d_pad.shape == (1, nF, nO, nP) - dY = model.ops.allocate((15, nO, nP)) - ids = model.ops.allocate((15, nF)) + dY = model.ops.alloc((15, nO, nP)) + ids = model.ops.alloc((15, nF)) ids[1, 2] = -1 dY[1] = 1 - assert model.d_pad[0, 2, 0, 0] == 0.0 - model._backprop_padding(dY, ids) - assert model.d_pad[0, 2, 0, 0] == 1.0 - model.d_pad.fill(0.0) + assert not model.has_grad("pad") + d_pad = _backprop_precomputable_affine_padding(model, dY, ids) + assert d_pad[0, 2, 0, 0] == 1.0 ids.fill(0.0) dY.fill(0.0) - ids[1, 2] = -1 + dY[0] = 0 + ids[1, 2] = 0 ids[1, 1] = -1 ids[1, 0] = -1 dY[1] = 1 - assert model.d_pad[0, 2, 0, 0] == 0.0 - model._backprop_padding(dY, ids) - assert model.d_pad[0, 2, 0, 0] == 3.0 + ids[2, 0] = -1 + dY[2] = 5 + d_pad = _backprop_precomputable_affine_padding(model, dY, ids) + assert d_pad[0, 0, 0, 0] == 6 + assert d_pad[0, 1, 0, 0] == 1 + assert d_pad[0, 2, 0, 0] == 0 def test_prefer_gpu(): @@ -111,25 +93,6 @@ def test_require_gpu(): require_gpu() -def test_create_symlink_windows( - symlink_setup_target, symlink_target, symlink, is_admin -): - """Test the creation of symlinks on windows. If run as admin or not on windows it should succeed, otherwise a CalledProcessError should be raised.""" - assert symlink_target.exists() - - if is_admin or not is_windows: - try: - symlink_to(symlink, symlink_target) - assert symlink.exists() - except CalledProcessError as e: - pytest.fail(e) - else: - with pytest.raises(CalledProcessError): - symlink_to(symlink, symlink_target) - - assert not symlink.exists() - - def test_ascii_filenames(): """Test that all filenames in the project are ASCII. See: https://twitter.com/_inesmontani/status/1177941471632211968 @@ -150,31 +113,196 @@ def test_load_model_blank_shortcut(): util.load_model("blank:fjsfijsdof") -def test_load_model_version_compat(): - """Test warnings for various spacy_version specifications in meta. Since - this is more of a hack for v2, manually specify the current major.minor - version to simplify test creation.""" - nlp = util.load_model("blank:en") - assert nlp.meta["spacy_version"].startswith(">=2.3") - with make_tempdir() as d: - # no change: compatible - nlp.to_disk(d) - meta_path = Path(d / "meta.json") - util.get_model_meta(d) +@pytest.mark.parametrize( + "version,constraint,compatible", + [ + (spacy_version, spacy_version, True), + (spacy_version, f">={spacy_version}", True), + ("3.0.0", "2.0.0", False), + ("3.2.1", ">=2.0.0", True), + ("2.2.10a1", ">=1.0.0,<2.1.1", False), + ("3.0.0.dev3", ">=1.2.3,<4.5.6", True), + ("n/a", ">=1.2.3,<4.5.6", None), + ("1.2.3", "n/a", None), + ("n/a", "n/a", None), + ], +) +def test_is_compatible_version(version, constraint, compatible): + assert util.is_compatible_version(version, constraint) is compatible - # additional compatible upper pin - nlp.meta["spacy_version"] = ">=2.3.0,<2.4.0" - srsly.write_json(meta_path, nlp.meta) - util.get_model_meta(d) - # incompatible older version - nlp.meta["spacy_version"] = ">=2.2.5" - srsly.write_json(meta_path, nlp.meta) - with pytest.warns(UserWarning): - util.get_model_meta(d) +@pytest.mark.parametrize( + "constraint,expected", + [ + ("3.0.0", False), + ("==3.0.0", False), + (">=2.3.0", True), + (">2.0.0", True), + ("<=2.0.0", True), + (">2.0.0,<3.0.0", False), + (">=2.0.0,<3.0.0", False), + ("!=1.1,>=1.0,~=1.0", True), + ("n/a", None), + ], +) +def test_is_unconstrained_version(constraint, expected): + assert util.is_unconstrained_version(constraint) is expected - # invalid version specification - nlp.meta["spacy_version"] = ">@#$%_invalid_version" - srsly.write_json(meta_path, nlp.meta) - with pytest.warns(UserWarning): - util.get_model_meta(d) + +@pytest.mark.parametrize( + "a1,a2,b1,b2,is_match", + [ + ("3.0.0", "3.0", "3.0.1", "3.0", True), + ("3.1.0", "3.1", "3.2.1", "3.2", False), + ("xxx", None, "1.2.3.dev0", "1.2", False), + ], +) +def test_minor_version(a1, a2, b1, b2, is_match): + assert util.get_minor_version(a1) == a2 + assert util.get_minor_version(b1) == b2 + assert util.is_minor_version_match(a1, b1) is is_match + assert util.is_minor_version_match(a2, b2) is is_match + + +@pytest.mark.parametrize( + "dot_notation,expected", + [ + ( + {"token.pos": True, "token._.xyz": True}, + {"token": {"pos": True, "_": {"xyz": True}}}, + ), + ( + {"training.batch_size": 128, "training.optimizer.learn_rate": 0.01}, + {"training": {"batch_size": 128, "optimizer": {"learn_rate": 0.01}}}, + ), + ], +) +def test_dot_to_dict(dot_notation, expected): + result = util.dot_to_dict(dot_notation) + assert result == expected + assert util.dict_to_dot(result) == dot_notation + + +@pytest.mark.parametrize( + "doc_sizes, expected_batches", + [ + ([400, 400, 199], [3]), + ([400, 400, 199, 3], [4]), + ([400, 400, 199, 3, 200], [3, 2]), + ([400, 400, 199, 3, 1], [5]), + ([400, 400, 199, 3, 1, 1500], [5]), # 1500 will be discarded + ([400, 400, 199, 3, 1, 200], [3, 3]), + ([400, 400, 199, 3, 1, 999], [3, 3]), + ([400, 400, 199, 3, 1, 999, 999], [3, 2, 1, 1]), + ([1, 2, 999], [3]), + ([1, 2, 999, 1], [4]), + ([1, 200, 999, 1], [2, 2]), + ([1, 999, 200, 1], [2, 2]), + ], +) +def test_util_minibatch(doc_sizes, expected_batches): + docs = [get_random_doc(doc_size) for doc_size in doc_sizes] + tol = 0.2 + batch_size = 1000 + batches = list( + minibatch_by_words(docs, size=batch_size, tolerance=tol, discard_oversize=True) + ) + assert [len(batch) for batch in batches] == expected_batches + + max_size = batch_size + batch_size * tol + for batch in batches: + assert sum([len(doc) for doc in batch]) < max_size + + +@pytest.mark.parametrize( + "doc_sizes, expected_batches", + [ + ([400, 4000, 199], [1, 2]), + ([400, 400, 199, 3000, 200], [1, 4]), + ([400, 400, 199, 3, 1, 1500], [1, 5]), + ([400, 400, 199, 3000, 2000, 200, 200], [1, 1, 3, 2]), + ([1, 2, 9999], [1, 2]), + ([2000, 1, 2000, 1, 1, 1, 2000], [1, 1, 1, 4]), + ], +) +def test_util_minibatch_oversize(doc_sizes, expected_batches): + """ Test that oversized documents are returned in their own batch""" + docs = [get_random_doc(doc_size) for doc_size in doc_sizes] + tol = 0.2 + batch_size = 1000 + batches = list( + minibatch_by_words(docs, size=batch_size, tolerance=tol, discard_oversize=False) + ) + assert [len(batch) for batch in batches] == expected_batches + + +def test_util_dot_section(): + cfg_string = """ + [nlp] + lang = "en" + pipeline = ["textcat"] + + [components] + + [components.textcat] + factory = "textcat" + + [components.textcat.model] + @architectures = "spacy.TextCatBOW.v1" + exclusive_classes = true + ngram_size = 1 + no_output_layer = false + """ + nlp_config = Config().from_str(cfg_string) + en_nlp = util.load_model_from_config(nlp_config, auto_fill=True) + default_config = Config().from_disk(DEFAULT_CONFIG_PATH) + default_config["nlp"]["lang"] = "nl" + nl_nlp = util.load_model_from_config(default_config, auto_fill=True) + # Test that creation went OK + assert isinstance(en_nlp, English) + assert isinstance(nl_nlp, Dutch) + assert nl_nlp.pipe_names == [] + assert en_nlp.pipe_names == ["textcat"] + # not exclusive_classes + assert en_nlp.get_pipe("textcat").model.attrs["multi_label"] is False + # Test that default values got overwritten + assert en_nlp.config["nlp"]["pipeline"] == ["textcat"] + assert nl_nlp.config["nlp"]["pipeline"] == [] # default value [] + # Test proper functioning of 'dot_to_object' + with pytest.raises(KeyError): + dot_to_object(en_nlp.config, "nlp.pipeline.tagger") + with pytest.raises(KeyError): + dot_to_object(en_nlp.config, "nlp.unknownattribute") + T = util.registry.resolve(nl_nlp.config["training"], schema=ConfigSchemaTraining) + assert isinstance(dot_to_object({"training": T}, "training.optimizer"), Optimizer) + + +def test_simple_frozen_list(): + t = SimpleFrozenList(["foo", "bar"]) + assert t == ["foo", "bar"] + assert t.index("bar") == 1 # okay method + with pytest.raises(NotImplementedError): + t.append("baz") + with pytest.raises(NotImplementedError): + t.sort() + with pytest.raises(NotImplementedError): + t.extend(["baz"]) + with pytest.raises(NotImplementedError): + t.pop() + t = SimpleFrozenList(["foo", "bar"], error="Error!") + with pytest.raises(NotImplementedError): + t.append("baz") + + +def test_resolve_dot_names(): + config = { + "training": {"optimizer": {"@optimizers": "Adam.v1"}}, + "foo": {"bar": "training.optimizer", "baz": "training.xyz"}, + } + result = util.resolve_dot_names(config, ["training.optimizer"]) + assert isinstance(result[0], Optimizer) + with pytest.raises(ConfigValidationError) as e: + util.resolve_dot_names(config, ["training.xyz", "training.optimizer"]) + errors = e.value.errors + assert len(errors) == 1 + assert errors[0]["loc"] == ["training", "xyz"] diff --git a/spacy/tests/test_models.py b/spacy/tests/test_models.py new file mode 100644 index 000000000..e8884e6b2 --- /dev/null +++ b/spacy/tests/test_models.py @@ -0,0 +1,200 @@ +from typing import List +import pytest +from thinc.api import fix_random_seed, Adam, set_dropout_rate +from numpy.testing import assert_array_equal +import numpy +from spacy.ml.models import build_Tok2Vec_model, MultiHashEmbed, MaxoutWindowEncoder +from spacy.ml.models import build_text_classifier, build_simple_cnn_text_classifier +from spacy.ml.staticvectors import StaticVectors +from spacy.lang.en import English +from spacy.lang.en.examples import sentences as EN_SENTENCES + + +def get_textcat_kwargs(): + return { + "width": 64, + "embed_size": 2000, + "pretrained_vectors": None, + "exclusive_classes": False, + "ngram_size": 1, + "window_size": 1, + "conv_depth": 2, + "dropout": None, + "nO": 7, + } + + +def get_textcat_cnn_kwargs(): + return { + "tok2vec": test_tok2vec(), + "exclusive_classes": False, + "nO": 13, + } + + +def get_all_params(model): + params = [] + for node in model.walk(): + for name in node.param_names: + params.append(node.get_param(name).ravel()) + return node.ops.xp.concatenate(params) + + +def get_docs(): + nlp = English() + return list(nlp.pipe(EN_SENTENCES + [" ".join(EN_SENTENCES)])) + + +def get_gradient(model, Y): + if isinstance(Y, model.ops.xp.ndarray): + dY = model.ops.alloc(Y.shape, dtype=Y.dtype) + dY += model.ops.xp.random.uniform(-1.0, 1.0, Y.shape) + return dY + elif isinstance(Y, List): + return [get_gradient(model, y) for y in Y] + else: + raise ValueError(f"Could not get gradient for type {type(Y)}") + + +def get_tok2vec_kwargs(): + # This actually creates models, so seems best to put it in a function. + return { + "embed": MultiHashEmbed( + width=32, + rows=[500, 500, 500], + attrs=["NORM", "PREFIX", "SHAPE"], + include_static_vectors=False, + ), + "encode": MaxoutWindowEncoder( + width=32, depth=2, maxout_pieces=2, window_size=1 + ), + } + + +def test_tok2vec(): + return build_Tok2Vec_model(**get_tok2vec_kwargs()) + + +def test_multi_hash_embed(): + embed = MultiHashEmbed( + width=32, + rows=[500, 500, 500], + attrs=["NORM", "PREFIX", "SHAPE"], + include_static_vectors=False, + ) + hash_embeds = [node for node in embed.walk() if node.name == "hashembed"] + assert len(hash_embeds) == 3 + # Check they look at different columns. + assert list(sorted(he.attrs["column"] for he in hash_embeds)) == [0, 1, 2] + # Check they use different seeds + assert len(set(he.attrs["seed"] for he in hash_embeds)) == 3 + # Check they all have the same number of rows + assert [he.get_dim("nV") for he in hash_embeds] == [500, 500, 500] + # Now try with different row factors + embed = MultiHashEmbed( + width=32, + rows=[1000, 50, 250], + attrs=["NORM", "PREFIX", "SHAPE"], + include_static_vectors=False, + ) + hash_embeds = [node for node in embed.walk() if node.name == "hashembed"] + assert [he.get_dim("nV") for he in hash_embeds] == [1000, 50, 250] + + +@pytest.mark.parametrize( + "seed,model_func,kwargs", + [ + (0, build_Tok2Vec_model, get_tok2vec_kwargs()), + (0, build_text_classifier, get_textcat_kwargs()), + (0, build_simple_cnn_text_classifier, get_textcat_cnn_kwargs()), + ], +) +def test_models_initialize_consistently(seed, model_func, kwargs): + fix_random_seed(seed) + model1 = model_func(**kwargs) + model1.initialize() + fix_random_seed(seed) + model2 = model_func(**kwargs) + model2.initialize() + params1 = get_all_params(model1) + params2 = get_all_params(model2) + assert_array_equal(params1, params2) + + +@pytest.mark.parametrize( + "seed,model_func,kwargs,get_X", + [ + (0, build_Tok2Vec_model, get_tok2vec_kwargs(), get_docs), + (0, build_text_classifier, get_textcat_kwargs(), get_docs), + (0, build_simple_cnn_text_classifier, get_textcat_cnn_kwargs(), get_docs), + ], +) +def test_models_predict_consistently(seed, model_func, kwargs, get_X): + fix_random_seed(seed) + model1 = model_func(**kwargs).initialize() + Y1 = model1.predict(get_X()) + fix_random_seed(seed) + model2 = model_func(**kwargs).initialize() + Y2 = model2.predict(get_X()) + + if model1.has_ref("tok2vec"): + tok2vec1 = model1.get_ref("tok2vec").predict(get_X()) + tok2vec2 = model2.get_ref("tok2vec").predict(get_X()) + for i in range(len(tok2vec1)): + for j in range(len(tok2vec1[i])): + assert_array_equal( + numpy.asarray(tok2vec1[i][j]), numpy.asarray(tok2vec2[i][j]) + ) + + if isinstance(Y1, numpy.ndarray): + assert_array_equal(Y1, Y2) + elif isinstance(Y1, List): + assert len(Y1) == len(Y2) + for y1, y2 in zip(Y1, Y2): + assert_array_equal(y1, y2) + else: + raise ValueError(f"Could not compare type {type(Y1)}") + + +@pytest.mark.parametrize( + "seed,dropout,model_func,kwargs,get_X", + [ + (0, 0.2, build_Tok2Vec_model, get_tok2vec_kwargs(), get_docs), + (0, 0.2, build_text_classifier, get_textcat_kwargs(), get_docs), + (0, 0.2, build_simple_cnn_text_classifier, get_textcat_cnn_kwargs(), get_docs), + ], +) +def test_models_update_consistently(seed, dropout, model_func, kwargs, get_X): + def get_updated_model(): + fix_random_seed(seed) + optimizer = Adam(0.001) + model = model_func(**kwargs).initialize() + initial_params = get_all_params(model) + set_dropout_rate(model, dropout) + for _ in range(5): + Y, get_dX = model.begin_update(get_X()) + dY = get_gradient(model, Y) + get_dX(dY) + model.finish_update(optimizer) + updated_params = get_all_params(model) + with pytest.raises(AssertionError): + assert_array_equal(initial_params, updated_params) + return model + + model1 = get_updated_model() + model2 = get_updated_model() + assert_array_equal(get_all_params(model1), get_all_params(model2)) + + +@pytest.mark.parametrize("model_func,kwargs", [(StaticVectors, {"nO": 128, "nM": 300})]) +def test_empty_docs(model_func, kwargs): + nlp = English() + model = model_func(**kwargs).initialize() + # Test the layer can be called successfully with 0, 1 and 2 empty docs. + for n_docs in range(3): + docs = [nlp("") for _ in range(n_docs)] + # Test predict + model.predict(docs) + # Test backprop + output, backprop = model.begin_update(docs) + backprop(output) diff --git a/spacy/tests/test_pickles.py b/spacy/tests/test_pickles.py index 65288527a..e4c67b672 100644 --- a/spacy/tests/test_pickles.py +++ b/spacy/tests/test_pickles.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import numpy import srsly diff --git a/spacy/tests/test_scorer.py b/spacy/tests/test_scorer.py index 2a4ef0f40..4c1b09849 100644 --- a/spacy/tests/test_scorer.py +++ b/spacy/tests/test_scorer.py @@ -1,13 +1,13 @@ -# coding: utf-8 -from __future__ import unicode_literals - from numpy.testing import assert_almost_equal, assert_array_almost_equal import pytest from pytest import approx -from spacy.gold import GoldParse +from spacy.training import Example +from spacy.training.iob_utils import offsets_to_biluo_tags from spacy.scorer import Scorer, ROCAUCScore from spacy.scorer import _roc_auc_score, _roc_curve -from .util import get_doc +from spacy.lang.en import English +from spacy.tokens import Doc + test_las_apple = [ [ @@ -43,101 +43,238 @@ test_ner_apple = [ ] +@pytest.fixture +def tagged_doc(): + text = "Sarah's sister flew to Silicon Valley via London." + tags = ["NNP", "POS", "NN", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."] + pos = [ + "PROPN", + "PART", + "NOUN", + "VERB", + "ADP", + "PROPN", + "PROPN", + "ADP", + "PROPN", + "PUNCT", + ] + morphs = [ + "NounType=prop|Number=sing", + "Poss=yes", + "Number=sing", + "Tense=past|VerbForm=fin", + "", + "NounType=prop|Number=sing", + "NounType=prop|Number=sing", + "", + "NounType=prop|Number=sing", + "PunctType=peri", + ] + nlp = English() + doc = nlp(text) + for i in range(len(tags)): + doc[i].tag_ = tags[i] + doc[i].pos_ = pos[i] + doc[i].set_morph(morphs[i]) + if i > 0: + doc[i].is_sent_start = False + return doc + + +@pytest.fixture +def sented_doc(): + text = "One sentence. Two sentences. Three sentences." + nlp = English() + doc = nlp(text) + for i in range(len(doc)): + if i % 3 == 0: + doc[i].is_sent_start = True + else: + doc[i].is_sent_start = False + return doc + + +def test_tokenization(sented_doc): + scorer = Scorer() + gold = {"sent_starts": [t.sent_start for t in sented_doc]} + example = Example.from_dict(sented_doc, gold) + scores = scorer.score([example]) + assert scores["token_acc"] == 1.0 + + nlp = English() + example.predicted = Doc( + nlp.vocab, + words=["One", "sentence.", "Two", "sentences.", "Three", "sentences."], + spaces=[True, True, True, True, True, False], + ) + example.predicted[1].is_sent_start = False + scores = scorer.score([example]) + assert scores["token_acc"] == approx(0.66666666) + assert scores["token_p"] == 0.5 + assert scores["token_r"] == approx(0.33333333) + assert scores["token_f"] == 0.4 + + +def test_sents(sented_doc): + scorer = Scorer() + gold = {"sent_starts": [t.sent_start for t in sented_doc]} + example = Example.from_dict(sented_doc, gold) + scores = scorer.score([example]) + assert scores["sents_f"] == 1.0 + + # One sentence start is moved + gold["sent_starts"][3] = 0 + gold["sent_starts"][4] = 1 + example = Example.from_dict(sented_doc, gold) + scores = scorer.score([example]) + assert scores["sents_f"] == approx(0.3333333) + + def test_las_per_type(en_vocab): # Gold and Doc are identical scorer = Scorer() + examples = [] for input_, annot in test_las_apple: - doc = get_doc( - en_vocab, - words=input_.split(" "), - heads=([h - i for i, h in enumerate(annot["heads"])]), - deps=annot["deps"], + doc = Doc( + en_vocab, words=input_.split(" "), heads=annot["heads"], deps=annot["deps"] ) - gold = GoldParse(doc, heads=annot["heads"], deps=annot["deps"]) - scorer.score(doc, gold) - results = scorer.scores + gold = {"heads": annot["heads"], "deps": annot["deps"]} + example = Example.from_dict(doc, gold) + examples.append(example) + results = scorer.score(examples) - assert results["uas"] == 100 - assert results["las"] == 100 - assert results["las_per_type"]["nsubj"]["p"] == 100 - assert results["las_per_type"]["nsubj"]["r"] == 100 - assert results["las_per_type"]["nsubj"]["f"] == 100 - assert results["las_per_type"]["compound"]["p"] == 100 - assert results["las_per_type"]["compound"]["r"] == 100 - assert results["las_per_type"]["compound"]["f"] == 100 + assert results["dep_uas"] == 1.0 + assert results["dep_las"] == 1.0 + assert results["dep_las_per_type"]["nsubj"]["p"] == 1.0 + assert results["dep_las_per_type"]["nsubj"]["r"] == 1.0 + assert results["dep_las_per_type"]["nsubj"]["f"] == 1.0 + assert results["dep_las_per_type"]["compound"]["p"] == 1.0 + assert results["dep_las_per_type"]["compound"]["r"] == 1.0 + assert results["dep_las_per_type"]["compound"]["f"] == 1.0 # One dep is incorrect in Doc scorer = Scorer() + examples = [] for input_, annot in test_las_apple: - doc = get_doc( - en_vocab, - words=input_.split(" "), - heads=([h - i for i, h in enumerate(annot["heads"])]), - deps=annot["deps"], + doc = Doc( + en_vocab, words=input_.split(" "), heads=annot["heads"], deps=annot["deps"] ) - gold = GoldParse(doc, heads=annot["heads"], deps=annot["deps"]) + gold = {"heads": annot["heads"], "deps": annot["deps"]} doc[0].dep_ = "compound" - scorer.score(doc, gold) - results = scorer.scores + example = Example.from_dict(doc, gold) + examples.append(example) + results = scorer.score(examples) - assert results["uas"] == 100 - assert_almost_equal(results["las"], 90.9090909) - assert results["las_per_type"]["nsubj"]["p"] == 0 - assert results["las_per_type"]["nsubj"]["r"] == 0 - assert results["las_per_type"]["nsubj"]["f"] == 0 - assert_almost_equal(results["las_per_type"]["compound"]["p"], 66.6666666) - assert results["las_per_type"]["compound"]["r"] == 100 - assert results["las_per_type"]["compound"]["f"] == 80 + assert results["dep_uas"] == 1.0 + assert_almost_equal(results["dep_las"], 0.9090909) + assert results["dep_las_per_type"]["nsubj"]["p"] == 0 + assert results["dep_las_per_type"]["nsubj"]["r"] == 0 + assert results["dep_las_per_type"]["nsubj"]["f"] == 0 + assert_almost_equal(results["dep_las_per_type"]["compound"]["p"], 0.666666666) + assert results["dep_las_per_type"]["compound"]["r"] == 1.0 + assert results["dep_las_per_type"]["compound"]["f"] == 0.8 def test_ner_per_type(en_vocab): # Gold and Doc are identical scorer = Scorer() + examples = [] for input_, annot in test_ner_cardinal: - doc = get_doc( - en_vocab, - words=input_.split(" "), - ents=[[0, 1, "CARDINAL"], [2, 3, "CARDINAL"]], + doc = Doc( + en_vocab, words=input_.split(" "), ents=["B-CARDINAL", "O", "B-CARDINAL"] ) - gold = GoldParse(doc, entities=annot["entities"]) - scorer.score(doc, gold) - results = scorer.scores + entities = offsets_to_biluo_tags(doc, annot["entities"]) + example = Example.from_dict(doc, {"entities": entities}) + # a hack for sentence boundaries + example.predicted[1].is_sent_start = False + example.reference[1].is_sent_start = False + examples.append(example) + results = scorer.score(examples) - assert results["ents_p"] == 100 - assert results["ents_f"] == 100 - assert results["ents_r"] == 100 - assert results["ents_per_type"]["CARDINAL"]["p"] == 100 - assert results["ents_per_type"]["CARDINAL"]["f"] == 100 - assert results["ents_per_type"]["CARDINAL"]["r"] == 100 + assert results["ents_p"] == 1.0 + assert results["ents_r"] == 1.0 + assert results["ents_f"] == 1.0 + assert results["ents_per_type"]["CARDINAL"]["p"] == 1.0 + assert results["ents_per_type"]["CARDINAL"]["r"] == 1.0 + assert results["ents_per_type"]["CARDINAL"]["f"] == 1.0 # Doc has one missing and one extra entity # Entity type MONEY is not present in Doc scorer = Scorer() + examples = [] for input_, annot in test_ner_apple: - doc = get_doc( + doc = Doc( en_vocab, words=input_.split(" "), - ents=[[0, 1, "ORG"], [5, 6, "GPE"], [6, 7, "ORG"]], + ents=["B-ORG", "O", "O", "O", "O", "B-GPE", "B-ORG", "O", "O", "O"], ) - gold = GoldParse(doc, entities=annot["entities"]) - scorer.score(doc, gold) - results = scorer.scores + entities = offsets_to_biluo_tags(doc, annot["entities"]) + example = Example.from_dict(doc, {"entities": entities}) + # a hack for sentence boundaries + example.predicted[1].is_sent_start = False + example.reference[1].is_sent_start = False + examples.append(example) + results = scorer.score(examples) - assert results["ents_p"] == approx(66.66666) - assert results["ents_r"] == approx(66.66666) - assert results["ents_f"] == approx(66.66666) + assert results["ents_p"] == approx(0.6666666) + assert results["ents_r"] == approx(0.6666666) + assert results["ents_f"] == approx(0.6666666) assert "GPE" in results["ents_per_type"] assert "MONEY" in results["ents_per_type"] assert "ORG" in results["ents_per_type"] - assert results["ents_per_type"]["GPE"]["p"] == 100 - assert results["ents_per_type"]["GPE"]["r"] == 100 - assert results["ents_per_type"]["GPE"]["f"] == 100 + assert results["ents_per_type"]["GPE"]["p"] == 1.0 + assert results["ents_per_type"]["GPE"]["r"] == 1.0 + assert results["ents_per_type"]["GPE"]["f"] == 1.0 assert results["ents_per_type"]["MONEY"]["p"] == 0 assert results["ents_per_type"]["MONEY"]["r"] == 0 assert results["ents_per_type"]["MONEY"]["f"] == 0 - assert results["ents_per_type"]["ORG"]["p"] == 50 - assert results["ents_per_type"]["ORG"]["r"] == 100 - assert results["ents_per_type"]["ORG"]["f"] == approx(66.66666) + assert results["ents_per_type"]["ORG"]["p"] == 0.5 + assert results["ents_per_type"]["ORG"]["r"] == 1.0 + assert results["ents_per_type"]["ORG"]["f"] == approx(0.6666666) + + +def test_tag_score(tagged_doc): + # Gold and Doc are identical + scorer = Scorer() + gold = { + "tags": [t.tag_ for t in tagged_doc], + "pos": [t.pos_ for t in tagged_doc], + "morphs": [str(t.morph) for t in tagged_doc], + "sent_starts": [1 if t.is_sent_start else -1 for t in tagged_doc], + } + example = Example.from_dict(tagged_doc, gold) + results = scorer.score([example]) + + assert results["tag_acc"] == 1.0 + assert results["pos_acc"] == 1.0 + assert results["morph_acc"] == 1.0 + assert results["morph_per_feat"]["NounType"]["f"] == 1.0 + + # Gold annotation is modified + scorer = Scorer() + tags = [t.tag_ for t in tagged_doc] + tags[0] = "NN" + pos = [t.pos_ for t in tagged_doc] + pos[1] = "X" + morphs = [str(t.morph) for t in tagged_doc] + morphs[1] = "Number=sing" + morphs[2] = "Number=plur" + gold = { + "tags": tags, + "pos": pos, + "morphs": morphs, + "sent_starts": gold["sent_starts"], + } + example = Example.from_dict(tagged_doc, gold) + results = scorer.score([example]) + + assert results["tag_acc"] == 0.9 + assert results["pos_acc"] == 0.9 + assert results["morph_acc"] == approx(0.8) + assert results["morph_per_feat"]["NounType"]["f"] == 1.0 + assert results["morph_per_feat"]["Poss"]["f"] == 0.0 + assert results["morph_per_feat"]["Number"]["f"] == approx(0.72727272) def test_roc_auc_score(): diff --git a/spacy/tests/test_tok2vec.py b/spacy/tests/test_tok2vec.py deleted file mode 100644 index ddaa71059..000000000 --- a/spacy/tests/test_tok2vec.py +++ /dev/null @@ -1,66 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - -from spacy._ml import Tok2Vec -from spacy.vocab import Vocab -from spacy.tokens import Doc -from spacy.compat import unicode_ - - -def get_batch(batch_size): - vocab = Vocab() - docs = [] - start = 0 - for size in range(1, batch_size + 1): - # Make the words numbers, so that they're distnct - # across the batch, and easy to track. - numbers = [unicode_(i) for i in range(start, start + size)] - docs.append(Doc(vocab, words=numbers)) - start += size - return docs - - -# This fails in Thinc v7.3.1. Need to push patch -@pytest.mark.xfail -def test_empty_doc(): - width = 128 - embed_size = 2000 - vocab = Vocab() - doc = Doc(vocab, words=[]) - tok2vec = Tok2Vec(width, embed_size) - vectors, backprop = tok2vec.begin_update([doc]) - assert len(vectors) == 1 - assert vectors[0].shape == (0, width) - - -@pytest.mark.parametrize( - "batch_size,width,embed_size", [[1, 128, 2000], [2, 128, 2000], [3, 8, 63]] -) -def test_tok2vec_batch_sizes(batch_size, width, embed_size): - batch = get_batch(batch_size) - tok2vec = Tok2Vec(width, embed_size) - vectors, backprop = tok2vec.begin_update(batch) - assert len(vectors) == len(batch) - for doc_vec, doc in zip(vectors, batch): - assert doc_vec.shape == (len(doc), width) - - -@pytest.mark.parametrize( - "tok2vec_config", - [ - {"width": 8, "embed_size": 100, "char_embed": False}, - {"width": 8, "embed_size": 100, "char_embed": True}, - {"width": 8, "embed_size": 100, "conv_depth": 6}, - {"width": 8, "embed_size": 100, "conv_depth": 6}, - {"width": 8, "embed_size": 100, "subword_features": False}, - ], -) -def test_tok2vec_configs(tok2vec_config): - docs = get_batch(3) - tok2vec = Tok2Vec(**tok2vec_config) - vectors, backprop = tok2vec.begin_update(docs) - assert len(vectors) == len(docs) - assert vectors[0].shape == (len(docs[0]), tok2vec_config["width"]) - backprop(vectors) diff --git a/spacy/tests/tokenizer/test_exceptions.py b/spacy/tests/tokenizer/test_exceptions.py index a79363abb..9a98e049e 100644 --- a/spacy/tests/tokenizer/test_exceptions.py +++ b/spacy/tests/tokenizer/test_exceptions.py @@ -1,13 +1,12 @@ -# coding: utf-8 -from __future__ import unicode_literals - import sys import pytest def test_tokenizer_handles_emoticons(tokenizer): # Tweebo challenge (CMU) - text = """:o :/ :'( >:o (: :) >.< XD -__- o.O ;D :-) @_@ :P 8D :1 >:( :D =| ") :> ....""" + text = ( + """:o :/ :'( >:o (: :) >.< XD -__- o.O ;D :-) @_@ :P 8D :1 >:( :D =| :> ....""" + ) tokens = tokenizer(text) assert tokens[0].text == ":o" assert tokens[1].text == ":/" @@ -28,12 +27,11 @@ def test_tokenizer_handles_emoticons(tokenizer): assert tokens[16].text == ">:(" assert tokens[17].text == ":D" assert tokens[18].text == "=|" - assert tokens[19].text == '")' - assert tokens[20].text == ":>" - assert tokens[21].text == "...." + assert tokens[19].text == ":>" + assert tokens[20].text == "...." -@pytest.mark.parametrize("text,length", [("example:)", 3), ("108)", 2), ("XDN", 1)]) +@pytest.mark.parametrize("text,length", [("108)", 2), ("XDN", 1)]) def test_tokenizer_excludes_false_pos_emoticons(tokenizer, text, length): tokens = tokenizer(text) assert len(tokens) == length diff --git a/spacy/tests/tokenizer/test_explain.py b/spacy/tests/tokenizer/test_explain.py index 2d71588cc..ea6cf91be 100644 --- a/spacy/tests/tokenizer/test_explain.py +++ b/spacy/tests/tokenizer/test_explain.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.util import get_lang_class @@ -57,8 +54,8 @@ LANGUAGES = [ @pytest.mark.parametrize("lang", LANGUAGES) def test_tokenizer_explain(lang): - tokenizer = get_lang_class(lang).Defaults.create_tokenizer() - examples = pytest.importorskip("spacy.lang.{}.examples".format(lang)) + tokenizer = get_lang_class(lang)().tokenizer + examples = pytest.importorskip(f"spacy.lang.{lang}.examples") for sentence in examples.sentences: tokens = [t.text for t in tokenizer(sentence) if not t.is_space] debug_tokens = [t[1] for t in tokenizer.explain(sentence)] diff --git a/spacy/tests/tokenizer/test_naughty_strings.py b/spacy/tests/tokenizer/test_naughty_strings.py index 9737b15cf..b22dabb9d 100644 --- a/spacy/tests/tokenizer/test_naughty_strings.py +++ b/spacy/tests/tokenizer/test_naughty_strings.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest # Examples taken from the "Big List of Naughty Strings" diff --git a/spacy/tests/tokenizer/test_tokenizer.py b/spacy/tests/tokenizer/test_tokenizer.py index 803c31abf..23c2d5c47 100644 --- a/spacy/tests/tokenizer/test_tokenizer.py +++ b/spacy/tests/tokenizer/test_tokenizer.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.vocab import Vocab from spacy.tokenizer import Tokenizer @@ -109,14 +106,60 @@ def test_tokenizer_add_special_case(tokenizer, text, tokens): @pytest.mark.parametrize( - "text,tokens", [("lorem", [{"orth": "lo", "tag": "NN"}, {"orth": "rem"}])] + "text,tokens", + [ + ("lorem", [{"orth": "lo"}, {"orth": "re"}]), + ("lorem", [{"orth": "lo", "tag": "A"}, {"orth": "rem"}]), + ], +) +def test_tokenizer_validate_special_case(tokenizer, text, tokens): + with pytest.raises(ValueError): + tokenizer.add_special_case(text, tokens) + + +@pytest.mark.parametrize( + "text,tokens", [("lorem", [{"orth": "lo", "norm": "LO"}, {"orth": "rem"}])] ) def test_tokenizer_add_special_case_tag(text, tokens): - vocab = Vocab(tag_map={"NN": {"pos": "NOUN"}}) + vocab = Vocab() tokenizer = Tokenizer(vocab, {}, None, None, None) tokenizer.add_special_case(text, tokens) doc = tokenizer(text) assert doc[0].text == tokens[0]["orth"] - assert doc[0].tag_ == tokens[0]["tag"] - assert doc[0].pos_ == "NOUN" + assert doc[0].norm_ == tokens[0]["norm"] assert doc[1].text == tokens[1]["orth"] + + +def test_tokenizer_special_cases_with_affixes(tokenizer): + text = '(((_SPECIAL_ A/B, A/B-A/B")' + tokenizer.add_special_case("_SPECIAL_", [{"orth": "_SPECIAL_"}]) + tokenizer.add_special_case("A/B", [{"orth": "A/B"}]) + doc = tokenizer(text) + assert [token.text for token in doc] == [ + "(", + "(", + "(", + "_SPECIAL_", + "A/B", + ",", + "A/B", + "-", + "A/B", + '"', + ")", + ] + + +def test_tokenizer_special_cases_with_period(tokenizer): + text = "_SPECIAL_." + tokenizer.add_special_case("_SPECIAL_", [{"orth": "_SPECIAL_"}]) + doc = tokenizer(text) + assert [token.text for token in doc] == ["_SPECIAL_", "."] + + +def test_tokenizer_special_cases_idx(tokenizer): + text = "the _ID'X_" + tokenizer.add_special_case("_ID'X_", [{"orth": "_ID"}, {"orth": "'X_"}]) + doc = tokenizer(text) + assert doc[1].idx == 4 + assert doc[2].idx == 7 diff --git a/spacy/tests/tokenizer/test_urls.py b/spacy/tests/tokenizer/test_urls.py index 65ba93d66..57e970f87 100644 --- a/spacy/tests/tokenizer/test_urls.py +++ b/spacy/tests/tokenizer/test_urls.py @@ -1,8 +1,7 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest +from spacy.lang.tokenizer_exceptions import BASE_EXCEPTIONS + URLS_BASIC = [ "http://www.nytimes.com/2016/04/20/us/politics/new-york-primary-preview.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region®ion=top-news&WT.nav=top-news&_r=0", @@ -196,7 +195,12 @@ def test_tokenizer_handles_two_prefix_url(tokenizer, prefix1, prefix2, url): @pytest.mark.parametrize("url", URLS_FULL) def test_tokenizer_handles_two_suffix_url(tokenizer, suffix1, suffix2, url): tokens = tokenizer(url + suffix1 + suffix2) - assert len(tokens) == 3 - assert tokens[0].text == url - assert tokens[1].text == suffix1 - assert tokens[2].text == suffix2 + if suffix1 + suffix2 in BASE_EXCEPTIONS: + assert len(tokens) == 2 + assert tokens[0].text == url + assert tokens[1].text == suffix1 + suffix2 + else: + assert len(tokens) == 3 + assert tokens[0].text == url + assert tokens[1].text == suffix1 + assert tokens[2].text == suffix2 diff --git a/spacy/tests/tokenizer/test_whitespace.py b/spacy/tests/tokenizer/test_whitespace.py index e32fa3efc..d68bb9e4e 100644 --- a/spacy/tests/tokenizer/test_whitespace.py +++ b/spacy/tests/tokenizer/test_whitespace.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest diff --git a/spacy/syntax/__init__.py b/spacy/tests/training/__init__.py similarity index 100% rename from spacy/syntax/__init__.py rename to spacy/tests/training/__init__.py diff --git a/spacy/tests/training/test_augmenters.py b/spacy/tests/training/test_augmenters.py new file mode 100644 index 000000000..0bd4d5ef2 --- /dev/null +++ b/spacy/tests/training/test_augmenters.py @@ -0,0 +1,100 @@ +import pytest +from spacy.training import Corpus +from spacy.training.augment import create_orth_variants_augmenter +from spacy.training.augment import create_lower_casing_augmenter +from spacy.lang.en import English +from spacy.tokens import DocBin, Doc +from contextlib import contextmanager +import random + +from ..util import make_tempdir + + +@contextmanager +def make_docbin(docs, name="roundtrip.spacy"): + with make_tempdir() as tmpdir: + output_file = tmpdir / name + DocBin(docs=docs).to_disk(output_file) + yield output_file + + +@pytest.fixture +def nlp(): + return English() + + +@pytest.fixture +def doc(nlp): + # fmt: off + words = ["Sarah", "'s", "sister", "flew", "to", "Silicon", "Valley", "via", "London", "."] + tags = ["NNP", "POS", "NN", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."] + pos = ["PROPN", "PART", "NOUN", "VERB", "ADP", "PROPN", "PROPN", "ADP", "PROPN", "PUNCT"] + ents = ["B-PERSON", "I-PERSON", "O", "O", "O", "B-LOC", "I-LOC", "O", "B-GPE", "O"] + cats = {"TRAVEL": 1.0, "BAKING": 0.0} + # fmt: on + doc = Doc(nlp.vocab, words=words, tags=tags, pos=pos, ents=ents) + doc.cats = cats + return doc + + +@pytest.mark.filterwarnings("ignore::UserWarning") +def test_make_orth_variants(nlp, doc): + single = [ + {"tags": ["NFP"], "variants": ["…", "..."]}, + {"tags": [":"], "variants": ["-", "—", "–", "--", "---", "——"]}, + ] + augmenter = create_orth_variants_augmenter( + level=0.2, lower=0.5, orth_variants={"single": single} + ) + with make_docbin([doc]) as output_file: + reader = Corpus(output_file, augmenter=augmenter) + # Due to randomness, only test that it works without errors for now + list(reader(nlp)) + + +def test_lowercase_augmenter(nlp, doc): + augmenter = create_lower_casing_augmenter(level=1.0) + with make_docbin([doc]) as output_file: + reader = Corpus(output_file, augmenter=augmenter) + corpus = list(reader(nlp)) + eg = corpus[0] + assert eg.reference.text == doc.text.lower() + assert eg.predicted.text == doc.text.lower() + ents = [(e.start, e.end, e.label) for e in doc.ents] + assert [(e.start, e.end, e.label) for e in eg.reference.ents] == ents + for ref_ent, orig_ent in zip(eg.reference.ents, doc.ents): + assert ref_ent.text == orig_ent.text.lower() + assert [t.pos_ for t in eg.reference] == [t.pos_ for t in doc] + + +@pytest.mark.filterwarnings("ignore::UserWarning") +def test_custom_data_augmentation(nlp, doc): + def create_spongebob_augmenter(randomize: bool = False): + def augment(nlp, example): + text = example.text + if randomize: + ch = [c.lower() if random.random() < 0.5 else c.upper() for c in text] + else: + ch = [c.lower() if i % 2 else c.upper() for i, c in enumerate(text)] + example_dict = example.to_dict() + doc = nlp.make_doc("".join(ch)) + example_dict["token_annotation"]["ORTH"] = [t.text for t in doc] + yield example + yield example.from_dict(doc, example_dict) + + return augment + + with make_docbin([doc]) as output_file: + reader = Corpus(output_file, augmenter=create_spongebob_augmenter()) + corpus = list(reader(nlp)) + orig_text = "Sarah 's sister flew to Silicon Valley via London . " + augmented = "SaRaH 's sIsTeR FlEw tO SiLiCoN VaLlEy vIa lOnDoN . " + assert corpus[0].text == orig_text + assert corpus[0].reference.text == orig_text + assert corpus[0].predicted.text == orig_text + assert corpus[1].text == augmented + assert corpus[1].reference.text == augmented + assert corpus[1].predicted.text == augmented + ents = [(e.start, e.end, e.label) for e in doc.ents] + assert [(e.start, e.end, e.label) for e in corpus[0].reference.ents] == ents + assert [(e.start, e.end, e.label) for e in corpus[1].reference.ents] == ents diff --git a/spacy/tests/training/test_new_example.py b/spacy/tests/training/test_new_example.py new file mode 100644 index 000000000..06db86a12 --- /dev/null +++ b/spacy/tests/training/test_new_example.py @@ -0,0 +1,265 @@ +import pytest +from spacy.training.example import Example +from spacy.tokens import Doc +from spacy.vocab import Vocab + + +def test_Example_init_requires_doc_objects(): + vocab = Vocab() + with pytest.raises(TypeError): + Example(None, None) + with pytest.raises(TypeError): + Example(Doc(vocab, words=["hi"]), None) + with pytest.raises(TypeError): + Example(None, Doc(vocab, words=["hi"])) + + +def test_Example_from_dict_basic(): + example = Example.from_dict( + Doc(Vocab(), words=["hello", "world"]), {"words": ["hello", "world"]} + ) + assert isinstance(example.x, Doc) + assert isinstance(example.y, Doc) + + +@pytest.mark.parametrize( + "annots", [{"words": ["ice", "cream"], "weirdannots": ["something", "such"]}] +) +def test_Example_from_dict_invalid(annots): + vocab = Vocab() + predicted = Doc(vocab, words=annots["words"]) + with pytest.raises(KeyError): + Example.from_dict(predicted, annots) + + +@pytest.mark.parametrize( + "pred_words", [["ice", "cream"], ["icecream"], ["i", "ce", "cream"]] +) +@pytest.mark.parametrize("annots", [{"words": ["icecream"], "tags": ["NN"]}]) +def test_Example_from_dict_with_tags(pred_words, annots): + vocab = Vocab() + predicted = Doc(vocab, words=pred_words) + example = Example.from_dict(predicted, annots) + for i, token in enumerate(example.reference): + assert token.tag_ == annots["tags"][i] + aligned_tags = example.get_aligned("TAG", as_string=True) + assert aligned_tags == ["NN" for _ in predicted] + + +@pytest.mark.filterwarnings("ignore::UserWarning") +def test_aligned_tags(): + pred_words = ["Apply", "some", "sunscreen", "unless", "you", "can", "not"] + gold_words = ["Apply", "some", "sun", "screen", "unless", "you", "cannot"] + gold_tags = ["VERB", "DET", "NOUN", "NOUN", "SCONJ", "PRON", "VERB"] + annots = {"words": gold_words, "tags": gold_tags} + vocab = Vocab() + predicted = Doc(vocab, words=pred_words) + example1 = Example.from_dict(predicted, annots) + aligned_tags1 = example1.get_aligned("TAG", as_string=True) + assert aligned_tags1 == ["VERB", "DET", "NOUN", "SCONJ", "PRON", "VERB", "VERB"] + # ensure that to_dict works correctly + example2 = Example.from_dict(predicted, example1.to_dict()) + aligned_tags2 = example2.get_aligned("TAG", as_string=True) + assert aligned_tags2 == ["VERB", "DET", "NOUN", "SCONJ", "PRON", "VERB", "VERB"] + + +def test_aligned_tags_multi(): + pred_words = ["Applysome", "sunscreen", "unless", "you", "can", "not"] + gold_words = ["Apply", "somesun", "screen", "unless", "you", "cannot"] + gold_tags = ["VERB", "DET", "NOUN", "SCONJ", "PRON", "VERB"] + annots = {"words": gold_words, "tags": gold_tags} + vocab = Vocab() + predicted = Doc(vocab, words=pred_words) + example = Example.from_dict(predicted, annots) + aligned_tags = example.get_aligned("TAG", as_string=True) + assert aligned_tags == [None, None, "SCONJ", "PRON", "VERB", "VERB"] + + +@pytest.mark.parametrize( + "annots", + [ + { + "words": ["I", "like", "London", "and", "Berlin", "."], + "deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"], + "heads": [1, 1, 1, 2, 2, 1], + } + ], +) +def test_Example_from_dict_with_parse(annots): + vocab = Vocab() + predicted = Doc(vocab, words=annots["words"]) + example = Example.from_dict(predicted, annots) + for i, token in enumerate(example.reference): + assert token.dep_ == annots["deps"][i] + assert token.head.i == annots["heads"][i] + + +@pytest.mark.parametrize( + "annots", + [ + { + "words": ["Sarah", "'s", "sister", "flew"], + "morphs": [ + "NounType=prop|Number=sing", + "Poss=yes", + "Number=sing", + "Tense=past|VerbForm=fin", + ], + } + ], +) +def test_Example_from_dict_with_morphology(annots): + vocab = Vocab() + predicted = Doc(vocab, words=annots["words"]) + example = Example.from_dict(predicted, annots) + for i, token in enumerate(example.reference): + assert str(token.morph) == annots["morphs"][i] + + +@pytest.mark.parametrize( + "annots", + [ + { + "words": ["This", "is", "one", "sentence", "this", "is", "another"], + "sent_starts": [1, 0, 0, 0, 1, 0, 0], + } + ], +) +def test_Example_from_dict_with_sent_start(annots): + vocab = Vocab() + predicted = Doc(vocab, words=annots["words"]) + example = Example.from_dict(predicted, annots) + assert len(list(example.reference.sents)) == 2 + for i, token in enumerate(example.reference): + assert bool(token.is_sent_start) == bool(annots["sent_starts"][i]) + + +@pytest.mark.parametrize( + "annots", + [ + { + "words": ["This", "is", "a", "sentence"], + "cats": {"cat1": 1.0, "cat2": 0.0, "cat3": 0.5}, + } + ], +) +def test_Example_from_dict_with_cats(annots): + vocab = Vocab() + predicted = Doc(vocab, words=annots["words"]) + example = Example.from_dict(predicted, annots) + assert len(list(example.reference.cats)) == 3 + assert example.reference.cats["cat1"] == 1.0 + assert example.reference.cats["cat2"] == 0.0 + assert example.reference.cats["cat3"] == 0.5 + + +@pytest.mark.parametrize( + "annots", + [ + { + "words": ["I", "like", "New", "York", "and", "Berlin", "."], + "entities": [(7, 15, "LOC"), (20, 26, "LOC")], + } + ], +) +def test_Example_from_dict_with_entities(annots): + vocab = Vocab() + predicted = Doc(vocab, words=annots["words"]) + example = Example.from_dict(predicted, annots) + + assert len(list(example.reference.ents)) == 2 + assert [example.reference[i].ent_iob_ for i in range(7)] == [ + "O", + "O", + "B", + "I", + "O", + "B", + "O", + ] + assert example.get_aligned("ENT_IOB") == [2, 2, 3, 1, 2, 3, 2] + + assert example.reference[2].ent_type_ == "LOC" + assert example.reference[3].ent_type_ == "LOC" + assert example.reference[5].ent_type_ == "LOC" + + +@pytest.mark.parametrize( + "annots", + [ + { + "words": ["I", "like", "New", "York", "and", "Berlin", "."], + "entities": [ + (0, 4, "LOC"), + (21, 27, "LOC"), + ], # not aligned to token boundaries + } + ], +) +def test_Example_from_dict_with_entities_invalid(annots): + vocab = Vocab() + predicted = Doc(vocab, words=annots["words"]) + with pytest.warns(UserWarning): + example = Example.from_dict(predicted, annots) + assert len(list(example.reference.ents)) == 0 + + +@pytest.mark.parametrize( + "annots", + [ + { + "words": ["I", "like", "New", "York", "and", "Berlin", "."], + "entities": [(7, 15, "LOC"), (20, 26, "LOC")], + "links": { + (7, 15): {"Q60": 1.0, "Q64": 0.0}, + (20, 26): {"Q60": 0.0, "Q64": 1.0}, + }, + } + ], +) +def test_Example_from_dict_with_links(annots): + vocab = Vocab() + predicted = Doc(vocab, words=annots["words"]) + example = Example.from_dict(predicted, annots) + assert example.reference[0].ent_kb_id_ == "" + assert example.reference[1].ent_kb_id_ == "" + assert example.reference[2].ent_kb_id_ == "Q60" + assert example.reference[3].ent_kb_id_ == "Q60" + assert example.reference[4].ent_kb_id_ == "" + assert example.reference[5].ent_kb_id_ == "Q64" + assert example.reference[6].ent_kb_id_ == "" + + +@pytest.mark.parametrize( + "annots", + [ + { + "words": ["I", "like", "New", "York", "and", "Berlin", "."], + "links": {(7, 14): {"Q7381115": 1.0, "Q2146908": 0.0}}, + } + ], +) +def test_Example_from_dict_with_links_invalid(annots): + vocab = Vocab() + predicted = Doc(vocab, words=annots["words"]) + with pytest.raises(ValueError): + Example.from_dict(predicted, annots) + + +def test_Example_from_dict_sentences(): + vocab = Vocab() + predicted = Doc(vocab, words=["One", "sentence", ".", "one", "more"]) + annots = {"sent_starts": [1, 0, 0, 1, 0]} + ex = Example.from_dict(predicted, annots) + assert len(list(ex.reference.sents)) == 2 + + # this currently throws an error - bug or feature? + # predicted = Doc(vocab, words=["One", "sentence", "not", "one", "more"]) + # annots = {"sent_starts": [1, 0, 0, 0, 0]} + # ex = Example.from_dict(predicted, annots) + # assert len(list(ex.reference.sents)) == 1 + + predicted = Doc(vocab, words=["One", "sentence", "not", "one", "more"]) + annots = {"sent_starts": [1, -1, 0, 0, 0]} + ex = Example.from_dict(predicted, annots) + assert len(list(ex.reference.sents)) == 1 diff --git a/spacy/tests/training/test_readers.py b/spacy/tests/training/test_readers.py new file mode 100644 index 000000000..9d82ca50a --- /dev/null +++ b/spacy/tests/training/test_readers.py @@ -0,0 +1,117 @@ +from typing import Dict, Iterable, Callable +import pytest +from thinc.api import Config +from spacy import Language +from spacy.util import load_model_from_config, registry, resolve_dot_names +from spacy.schemas import ConfigSchemaTraining +from spacy.training import Example + + +def test_readers(): + config_string = """ + [training] + + [corpora] + @readers = "myreader.v1" + + [nlp] + lang = "en" + pipeline = ["tok2vec", "textcat"] + + [components] + + [components.tok2vec] + factory = "tok2vec" + + [components.textcat] + factory = "textcat" + """ + + @registry.readers.register("myreader.v1") + def myreader() -> Dict[str, Callable[[Language, str], Iterable[Example]]]: + annots = {"cats": {"POS": 1.0, "NEG": 0.0}} + + def reader(nlp: Language): + doc = nlp.make_doc(f"This is an example") + return [Example.from_dict(doc, annots)] + + return {"train": reader, "dev": reader, "extra": reader, "something": reader} + + config = Config().from_str(config_string) + nlp = load_model_from_config(config, auto_fill=True) + T = registry.resolve( + nlp.config.interpolate()["training"], schema=ConfigSchemaTraining + ) + dot_names = [T["train_corpus"], T["dev_corpus"]] + train_corpus, dev_corpus = resolve_dot_names(nlp.config, dot_names) + assert isinstance(train_corpus, Callable) + optimizer = T["optimizer"] + # simulate a training loop + nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer) + for example in train_corpus(nlp): + nlp.update([example], sgd=optimizer) + scores = nlp.evaluate(list(dev_corpus(nlp))) + assert scores["cats_score"] + # ensure the pipeline runs + doc = nlp("Quick test") + assert doc.cats + corpora = {"corpora": nlp.config.interpolate()["corpora"]} + extra_corpus = registry.resolve(corpora)["corpora"]["extra"] + assert isinstance(extra_corpus, Callable) + + +@pytest.mark.slow +@pytest.mark.parametrize( + "reader,additional_config", + [ + ("ml_datasets.imdb_sentiment.v1", {"train_limit": 10, "dev_limit": 2}), + ("ml_datasets.dbpedia.v1", {"train_limit": 10, "dev_limit": 2}), + ("ml_datasets.cmu_movies.v1", {"limit": 10, "freq_cutoff": 200, "split": 0.8}), + ], +) +def test_cat_readers(reader, additional_config): + nlp_config_string = """ + [training] + + [corpora] + @readers = "PLACEHOLDER" + + [nlp] + lang = "en" + pipeline = ["tok2vec", "textcat"] + + [components] + + [components.tok2vec] + factory = "tok2vec" + + [components.textcat] + factory = "textcat" + """ + config = Config().from_str(nlp_config_string) + config["corpora"]["@readers"] = reader + config["corpora"].update(additional_config) + nlp = load_model_from_config(config, auto_fill=True) + T = registry.resolve( + nlp.config["training"].interpolate(), schema=ConfigSchemaTraining + ) + dot_names = [T["train_corpus"], T["dev_corpus"]] + train_corpus, dev_corpus = resolve_dot_names(nlp.config, dot_names) + optimizer = T["optimizer"] + # simulate a training loop + nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer) + for example in train_corpus(nlp): + assert example.y.cats + # this shouldn't fail if each training example has at least one positive label + assert sorted(list(set(example.y.cats.values()))) == [0.0, 1.0] + nlp.update([example], sgd=optimizer) + # simulate performance benchmark on dev corpus + dev_examples = list(dev_corpus(nlp)) + for example in dev_examples: + # this shouldn't fail if each dev example has at least one positive label + assert sorted(list(set(example.y.cats.values()))) == [0.0, 1.0] + scores = nlp.evaluate(dev_examples) + assert scores["cats_score"] + # ensure the pipeline runs + doc = nlp("Quick test") + assert doc.cats diff --git a/spacy/tests/training/test_training.py b/spacy/tests/training/test_training.py new file mode 100644 index 000000000..07e1aef01 --- /dev/null +++ b/spacy/tests/training/test_training.py @@ -0,0 +1,692 @@ +import numpy +from spacy.training import offsets_to_biluo_tags, biluo_tags_to_offsets, Alignment +from spacy.training import biluo_tags_to_spans, iob_to_biluo +from spacy.training import Corpus, docs_to_json, Example +from spacy.training.converters import json_to_docs +from spacy.lang.en import English +from spacy.tokens import Doc, DocBin +from spacy.util import get_words_and_spaces, minibatch +from thinc.api import compounding +import pytest +import srsly + +from ..util import make_tempdir + + +@pytest.fixture +def doc(): + nlp = English() # make sure we get a new vocab every time + # fmt: off + words = ["Sarah", "'s", "sister", "flew", "to", "Silicon", "Valley", "via", "London", "."] + tags = ["NNP", "POS", "NN", "VBD", "IN", "NNP", "NNP", "IN", "NNP", "."] + pos = ["PROPN", "PART", "NOUN", "VERB", "ADP", "PROPN", "PROPN", "ADP", "PROPN", "PUNCT"] + morphs = ["NounType=prop|Number=sing", "Poss=yes", "Number=sing", "Tense=past|VerbForm=fin", + "", "NounType=prop|Number=sing", "NounType=prop|Number=sing", "", + "NounType=prop|Number=sing", "PunctType=peri"] + # head of '.' is intentionally nonprojective for testing + heads = [2, 0, 3, 3, 3, 6, 4, 3, 7, 5] + deps = ["poss", "case", "nsubj", "ROOT", "prep", "compound", "pobj", "prep", "pobj", "punct"] + lemmas = ["Sarah", "'s", "sister", "fly", "to", "Silicon", "Valley", "via", "London", "."] + ents = ["O"] * len(words) + ents[0] = "B-PERSON" + ents[1] = "I-PERSON" + ents[5] = "B-LOC" + ents[6] = "I-LOC" + ents[8] = "B-GPE" + cats = {"TRAVEL": 1.0, "BAKING": 0.0} + # fmt: on + doc = Doc( + nlp.vocab, + words=words, + tags=tags, + pos=pos, + morphs=morphs, + heads=heads, + deps=deps, + lemmas=lemmas, + ents=ents, + ) + doc.cats = cats + return doc + + +@pytest.fixture() +def merged_dict(): + return { + "ids": [1, 2, 3, 4, 5, 6, 7], + "words": ["Hi", "there", "everyone", "It", "is", "just", "me"], + "spaces": [True, True, True, True, True, True, False], + "tags": ["INTJ", "ADV", "PRON", "PRON", "AUX", "ADV", "PRON"], + "sent_starts": [1, 0, 0, 1, 0, 0, 0], + } + + +@pytest.fixture +def vocab(): + nlp = English() + return nlp.vocab + + +def test_gold_biluo_U(en_vocab): + words = ["I", "flew", "to", "London", "."] + spaces = [True, True, True, False, True] + doc = Doc(en_vocab, words=words, spaces=spaces) + entities = [(len("I flew to "), len("I flew to London"), "LOC")] + tags = offsets_to_biluo_tags(doc, entities) + assert tags == ["O", "O", "O", "U-LOC", "O"] + + +def test_gold_biluo_BL(en_vocab): + words = ["I", "flew", "to", "San", "Francisco", "."] + spaces = [True, True, True, True, False, True] + doc = Doc(en_vocab, words=words, spaces=spaces) + entities = [(len("I flew to "), len("I flew to San Francisco"), "LOC")] + tags = offsets_to_biluo_tags(doc, entities) + assert tags == ["O", "O", "O", "B-LOC", "L-LOC", "O"] + + +def test_gold_biluo_BIL(en_vocab): + words = ["I", "flew", "to", "San", "Francisco", "Valley", "."] + spaces = [True, True, True, True, True, False, True] + doc = Doc(en_vocab, words=words, spaces=spaces) + entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] + tags = offsets_to_biluo_tags(doc, entities) + assert tags == ["O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"] + + +def test_gold_biluo_overlap(en_vocab): + words = ["I", "flew", "to", "San", "Francisco", "Valley", "."] + spaces = [True, True, True, True, True, False, True] + doc = Doc(en_vocab, words=words, spaces=spaces) + entities = [ + (len("I flew to "), len("I flew to San Francisco Valley"), "LOC"), + (len("I flew to "), len("I flew to San Francisco"), "LOC"), + ] + with pytest.raises(ValueError): + offsets_to_biluo_tags(doc, entities) + + +def test_gold_biluo_misalign(en_vocab): + words = ["I", "flew", "to", "San", "Francisco", "Valley."] + spaces = [True, True, True, True, True, False] + doc = Doc(en_vocab, words=words, spaces=spaces) + entities = [(len("I flew to "), len("I flew to San Francisco Valley"), "LOC")] + with pytest.warns(UserWarning): + tags = offsets_to_biluo_tags(doc, entities) + assert tags == ["O", "O", "O", "-", "-", "-"] + + +def test_example_constructor(en_vocab): + words = ["I", "like", "stuff"] + tags = ["NOUN", "VERB", "NOUN"] + tag_ids = [en_vocab.strings.add(tag) for tag in tags] + predicted = Doc(en_vocab, words=words) + reference = Doc(en_vocab, words=words) + reference = reference.from_array("TAG", numpy.array(tag_ids, dtype="uint64")) + example = Example(predicted, reference) + tags = example.get_aligned("TAG", as_string=True) + assert tags == ["NOUN", "VERB", "NOUN"] + + +def test_example_from_dict_tags(en_vocab): + words = ["I", "like", "stuff"] + tags = ["NOUN", "VERB", "NOUN"] + predicted = Doc(en_vocab, words=words) + example = Example.from_dict(predicted, {"TAGS": tags}) + tags = example.get_aligned("TAG", as_string=True) + assert tags == ["NOUN", "VERB", "NOUN"] + + +def test_example_from_dict_no_ner(en_vocab): + words = ["a", "b", "c", "d"] + spaces = [True, True, False, True] + predicted = Doc(en_vocab, words=words, spaces=spaces) + example = Example.from_dict(predicted, {"words": words}) + ner_tags = example.get_aligned_ner() + assert ner_tags == [None, None, None, None] + + +def test_example_from_dict_some_ner(en_vocab): + words = ["a", "b", "c", "d"] + spaces = [True, True, False, True] + predicted = Doc(en_vocab, words=words, spaces=spaces) + example = Example.from_dict( + predicted, {"words": words, "entities": ["U-LOC", None, None, None]} + ) + ner_tags = example.get_aligned_ner() + assert ner_tags == ["U-LOC", None, None, None] + + +@pytest.mark.filterwarnings("ignore::UserWarning") +def test_json_to_docs_no_ner(en_vocab): + data = [ + { + "id": 1, + "paragraphs": [ + { + "sentences": [ + { + "tokens": [ + {"dep": "nn", "head": 1, "tag": "NNP", "orth": "Ms."}, + { + "dep": "nsubj", + "head": 1, + "tag": "NNP", + "orth": "Haag", + }, + { + "dep": "ROOT", + "head": 0, + "tag": "VBZ", + "orth": "plays", + }, + { + "dep": "dobj", + "head": -1, + "tag": "NNP", + "orth": "Elianti", + }, + {"dep": "punct", "head": -2, "tag": ".", "orth": "."}, + ] + } + ] + } + ], + } + ] + docs = json_to_docs(data) + assert len(docs) == 1 + for doc in docs: + assert not doc.has_annotation("ENT_IOB") + for token in doc: + assert token.ent_iob == 0 + eg = Example( + Doc( + doc.vocab, + words=[w.text for w in doc], + spaces=[bool(w.whitespace_) for w in doc], + ), + doc, + ) + ner_tags = eg.get_aligned_ner() + assert ner_tags == [None, None, None, None, None] + + +def test_split_sentences(en_vocab): + # fmt: off + words = ["I", "flew", "to", "San Francisco Valley", "had", "loads of fun"] + gold_words = ["I", "flew", "to", "San", "Francisco", "Valley", "had", "loads", "of", "fun"] + sent_starts = [True, False, False, False, False, False, True, False, False, False] + # fmt: on + doc = Doc(en_vocab, words=words) + example = Example.from_dict(doc, {"words": gold_words, "sent_starts": sent_starts}) + assert example.text == "I flew to San Francisco Valley had loads of fun " + split_examples = example.split_sents() + assert len(split_examples) == 2 + assert split_examples[0].text == "I flew to San Francisco Valley " + assert split_examples[1].text == "had loads of fun " + # fmt: off + words = ["I", "flew", "to", "San", "Francisco", "Valley", "had", "loads", "of fun"] + gold_words = ["I", "flew", "to", "San Francisco", "Valley", "had", "loads of", "fun"] + sent_starts = [True, False, False, False, False, True, False, False] + # fmt: on + doc = Doc(en_vocab, words=words) + example = Example.from_dict(doc, {"words": gold_words, "sent_starts": sent_starts}) + assert example.text == "I flew to San Francisco Valley had loads of fun " + split_examples = example.split_sents() + assert len(split_examples) == 2 + assert split_examples[0].text == "I flew to San Francisco Valley " + assert split_examples[1].text == "had loads of fun " + + +def test_gold_biluo_one_to_many(en_vocab, en_tokenizer): + words = ["Mr and ", "Mrs Smith", "flew to", "San Francisco Valley", "."] + spaces = [True, True, True, False, False] + doc = Doc(en_vocab, words=words, spaces=spaces) + prefix = "Mr and Mrs Smith flew to " + entities = [(len(prefix), len(prefix + "San Francisco Valley"), "LOC")] + gold_words = ["Mr and Mrs Smith", "flew", "to", "San", "Francisco", "Valley", "."] + example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) + ner_tags = example.get_aligned_ner() + assert ner_tags == ["O", "O", "O", "U-LOC", "O"] + + entities = [ + (len("Mr and "), len("Mr and Mrs Smith"), "PERSON"), # "Mrs Smith" is a PERSON + (len(prefix), len(prefix + "San Francisco Valley"), "LOC"), + ] + # fmt: off + gold_words = ["Mr and", "Mrs", "Smith", "flew", "to", "San", "Francisco", "Valley", "."] + # fmt: on + example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) + ner_tags = example.get_aligned_ner() + assert ner_tags == ["O", "U-PERSON", "O", "U-LOC", "O"] + + entities = [ + (len("Mr and "), len("Mr and Mrs"), "PERSON"), # "Mrs" is a Person + (len(prefix), len(prefix + "San Francisco Valley"), "LOC"), + ] + # fmt: off + gold_words = ["Mr and", "Mrs", "Smith", "flew", "to", "San", "Francisco", "Valley", "."] + # fmt: on + example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) + ner_tags = example.get_aligned_ner() + assert ner_tags == ["O", None, "O", "U-LOC", "O"] + + +def test_gold_biluo_many_to_one(en_vocab, en_tokenizer): + words = ["Mr and", "Mrs", "Smith", "flew", "to", "San", "Francisco", "Valley", "."] + spaces = [True, True, True, True, True, True, True, False, False] + doc = Doc(en_vocab, words=words, spaces=spaces) + prefix = "Mr and Mrs Smith flew to " + entities = [(len(prefix), len(prefix + "San Francisco Valley"), "LOC")] + gold_words = ["Mr and Mrs Smith", "flew to", "San Francisco Valley", "."] + example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) + ner_tags = example.get_aligned_ner() + assert ner_tags == ["O", "O", "O", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"] + + entities = [ + (len("Mr and "), len("Mr and Mrs Smith"), "PERSON"), # "Mrs Smith" is a PERSON + (len(prefix), len(prefix + "San Francisco Valley"), "LOC"), + ] + gold_words = ["Mr and", "Mrs Smith", "flew to", "San Francisco Valley", "."] + example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) + ner_tags = example.get_aligned_ner() + expected = ["O", "B-PERSON", "L-PERSON", "O", "O", "B-LOC", "I-LOC", "L-LOC", "O"] + assert ner_tags == expected + + +def test_gold_biluo_misaligned(en_vocab, en_tokenizer): + words = ["Mr and Mrs", "Smith", "flew", "to", "San Francisco", "Valley", "."] + spaces = [True, True, True, True, True, False, False] + doc = Doc(en_vocab, words=words, spaces=spaces) + prefix = "Mr and Mrs Smith flew to " + entities = [(len(prefix), len(prefix + "San Francisco Valley"), "LOC")] + gold_words = ["Mr", "and Mrs Smith", "flew to", "San", "Francisco Valley", "."] + example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) + ner_tags = example.get_aligned_ner() + assert ner_tags == ["O", "O", "O", "O", "B-LOC", "L-LOC", "O"] + + entities = [ + (len("Mr and "), len("Mr and Mrs Smith"), "PERSON"), # "Mrs Smith" is a PERSON + (len(prefix), len(prefix + "San Francisco Valley"), "LOC"), + ] + gold_words = ["Mr and", "Mrs Smith", "flew to", "San", "Francisco Valley", "."] + example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) + ner_tags = example.get_aligned_ner() + assert ner_tags == [None, None, "O", "O", "B-LOC", "L-LOC", "O"] + + +def test_gold_biluo_additional_whitespace(en_vocab, en_tokenizer): + # additional whitespace tokens in GoldParse words + words, spaces = get_words_and_spaces( + ["I", "flew", "to", "San Francisco", "Valley", "."], + "I flew to San Francisco Valley.", + ) + doc = Doc(en_vocab, words=words, spaces=spaces) + prefix = "I flew to " + entities = [(len(prefix), len(prefix + "San Francisco Valley"), "LOC")] + gold_words = ["I", "flew", " ", "to", "San Francisco Valley", "."] + gold_spaces = [True, True, False, True, False, False] + example = Example.from_dict( + doc, {"words": gold_words, "spaces": gold_spaces, "entities": entities} + ) + ner_tags = example.get_aligned_ner() + assert ner_tags == ["O", "O", "O", "O", "B-LOC", "L-LOC", "O"] + + +def test_gold_biluo_4791(en_vocab, en_tokenizer): + doc = en_tokenizer("I'll return the ₹54 amount") + gold_words = ["I", "'ll", "return", "the", "₹", "54", "amount"] + gold_spaces = [False, True, True, True, False, True, False] + entities = [(16, 19, "MONEY")] + example = Example.from_dict( + doc, {"words": gold_words, "spaces": gold_spaces, "entities": entities} + ) + ner_tags = example.get_aligned_ner() + assert ner_tags == ["O", "O", "O", "O", "U-MONEY", "O"] + + doc = en_tokenizer("I'll return the $54 amount") + gold_words = ["I", "'ll", "return", "the", "$", "54", "amount"] + gold_spaces = [False, True, True, True, False, True, False] + entities = [(16, 19, "MONEY")] + example = Example.from_dict( + doc, {"words": gold_words, "spaces": gold_spaces, "entities": entities} + ) + ner_tags = example.get_aligned_ner() + assert ner_tags == ["O", "O", "O", "O", "B-MONEY", "L-MONEY", "O"] + + +def test_roundtrip_offsets_biluo_conversion(en_tokenizer): + text = "I flew to Silicon Valley via London." + biluo_tags = ["O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] + offsets = [(10, 24, "LOC"), (29, 35, "GPE")] + doc = en_tokenizer(text) + biluo_tags_converted = offsets_to_biluo_tags(doc, offsets) + assert biluo_tags_converted == biluo_tags + offsets_converted = biluo_tags_to_offsets(doc, biluo_tags) + offsets_converted = [ent for ent in offsets if ent[2]] + assert offsets_converted == offsets + + +def test_biluo_spans(en_tokenizer): + doc = en_tokenizer("I flew to Silicon Valley via London.") + biluo_tags = ["O", "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] + spans = biluo_tags_to_spans(doc, biluo_tags) + spans = [span for span in spans if span.label_] + assert len(spans) == 2 + assert spans[0].text == "Silicon Valley" + assert spans[0].label_ == "LOC" + assert spans[1].text == "London" + assert spans[1].label_ == "GPE" + + +def test_aligned_spans_y2x(en_vocab, en_tokenizer): + words = ["Mr and Mrs Smith", "flew", "to", "San Francisco Valley", "."] + spaces = [True, True, True, False, False] + doc = Doc(en_vocab, words=words, spaces=spaces) + prefix = "Mr and Mrs Smith flew to " + entities = [ + (0, len("Mr and Mrs Smith"), "PERSON"), + (len(prefix), len(prefix + "San Francisco Valley"), "LOC"), + ] + # fmt: off + tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "San", "Francisco", "Valley", "."] + # fmt: on + example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities}) + ents_ref = example.reference.ents + assert [(ent.start, ent.end) for ent in ents_ref] == [(0, 4), (6, 9)] + ents_y2x = example.get_aligned_spans_y2x(ents_ref) + assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1), (3, 4)] + + +def test_aligned_spans_x2y(en_vocab, en_tokenizer): + text = "Mr and Mrs Smith flew to San Francisco Valley" + nlp = English() + patterns = [ + {"label": "PERSON", "pattern": "Mr and Mrs Smith"}, + {"label": "LOC", "pattern": "San Francisco Valley"}, + ] + ruler = nlp.add_pipe("entity_ruler") + ruler.add_patterns(patterns) + doc = nlp(text) + assert [(ent.start, ent.end) for ent in doc.ents] == [(0, 4), (6, 9)] + prefix = "Mr and Mrs Smith flew to " + entities = [ + (0, len("Mr and Mrs Smith"), "PERSON"), + (len(prefix), len(prefix + "San Francisco Valley"), "LOC"), + ] + tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "San Francisco", "Valley"] + example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities}) + assert [(ent.start, ent.end) for ent in example.reference.ents] == [(0, 2), (4, 6)] + # Ensure that 'get_aligned_spans_x2y' has the aligned entities correct + ents_pred = example.predicted.ents + assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4), (6, 9)] + ents_x2y = example.get_aligned_spans_x2y(ents_pred) + assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2), (4, 6)] + + +def test_gold_ner_missing_tags(en_tokenizer): + doc = en_tokenizer("I flew to Silicon Valley via London.") + biluo_tags = [None, "O", "O", "B-LOC", "L-LOC", "O", "U-GPE", "O"] + example = Example.from_dict(doc, {"entities": biluo_tags}) + assert example.get_aligned("ENT_IOB") == [0, 2, 2, 3, 1, 2, 3, 2] + + +def test_projectivize(en_tokenizer): + doc = en_tokenizer("He pretty quickly walks away") + heads = [3, 2, 3, 0, 2] + example = Example.from_dict(doc, {"heads": heads}) + proj_heads, proj_labels = example.get_aligned_parse(projectivize=True) + nonproj_heads, nonproj_labels = example.get_aligned_parse(projectivize=False) + assert proj_heads == [3, 2, 3, 0, 3] + assert nonproj_heads == [3, 2, 3, 0, 2] + + +def test_iob_to_biluo(): + good_iob = ["O", "O", "B-LOC", "I-LOC", "O", "B-PERSON"] + good_biluo = ["O", "O", "B-LOC", "L-LOC", "O", "U-PERSON"] + bad_iob = ["O", "O", '"', "B-LOC", "I-LOC"] + converted_biluo = iob_to_biluo(good_iob) + assert good_biluo == converted_biluo + with pytest.raises(ValueError): + iob_to_biluo(bad_iob) + + +def test_roundtrip_docs_to_docbin(doc): + text = doc.text + idx = [t.idx for t in doc] + tags = [t.tag_ for t in doc] + pos = [t.pos_ for t in doc] + morphs = [str(t.morph) for t in doc] + lemmas = [t.lemma_ for t in doc] + deps = [t.dep_ for t in doc] + heads = [t.head.i for t in doc] + cats = doc.cats + ents = [(e.start_char, e.end_char, e.label_) for e in doc.ents] + # roundtrip to DocBin + with make_tempdir() as tmpdir: + # use a separate vocab to test that all labels are added + reloaded_nlp = English() + json_file = tmpdir / "roundtrip.json" + srsly.write_json(json_file, [docs_to_json(doc)]) + output_file = tmpdir / "roundtrip.spacy" + DocBin(docs=[doc]).to_disk(output_file) + reader = Corpus(output_file) + reloaded_examples = list(reader(reloaded_nlp)) + assert len(doc) == sum(len(eg) for eg in reloaded_examples) + reloaded_example = reloaded_examples[0] + assert text == reloaded_example.reference.text + assert idx == [t.idx for t in reloaded_example.reference] + assert tags == [t.tag_ for t in reloaded_example.reference] + assert pos == [t.pos_ for t in reloaded_example.reference] + assert morphs == [str(t.morph) for t in reloaded_example.reference] + assert lemmas == [t.lemma_ for t in reloaded_example.reference] + assert deps == [t.dep_ for t in reloaded_example.reference] + assert heads == [t.head.i for t in reloaded_example.reference] + assert ents == [ + (e.start_char, e.end_char, e.label_) for e in reloaded_example.reference.ents + ] + assert "TRAVEL" in reloaded_example.reference.cats + assert "BAKING" in reloaded_example.reference.cats + assert cats["TRAVEL"] == reloaded_example.reference.cats["TRAVEL"] + assert cats["BAKING"] == reloaded_example.reference.cats["BAKING"] + + +@pytest.mark.skip("Outdated") +@pytest.mark.parametrize( + "tokens_a,tokens_b,expected", + [ + (["a", "b", "c"], ["ab", "c"], (3, [-1, -1, 1], [-1, 2], {0: 0, 1: 0}, {})), + ( + ["a", "b", '"', "c"], + ['ab"', "c"], + (4, [-1, -1, -1, 1], [-1, 3], {0: 0, 1: 0, 2: 0}, {}), + ), + (["a", "bc"], ["ab", "c"], (4, [-1, -1], [-1, -1], {0: 0}, {1: 1})), + ( + ["ab", "c", "d"], + ["a", "b", "cd"], + (6, [-1, -1, -1], [-1, -1, -1], {1: 2, 2: 2}, {0: 0, 1: 0}), + ), + ( + ["a", "b", "cd"], + ["a", "b", "c", "d"], + (3, [0, 1, -1], [0, 1, -1, -1], {}, {2: 2, 3: 2}), + ), + ([" ", "a"], ["a"], (1, [-1, 0], [1], {}, {})), + ], +) +def test_align(tokens_a, tokens_b, expected): # noqa + cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_a, tokens_b) # noqa + assert (cost, list(a2b), list(b2a), a2b_multi, b2a_multi) == expected # noqa + # check symmetry + cost, a2b, b2a, a2b_multi, b2a_multi = align(tokens_b, tokens_a) # noqa + assert (cost, list(b2a), list(a2b), b2a_multi, a2b_multi) == expected # noqa + + +def test_goldparse_startswith_space(en_tokenizer): + text = " a" + doc = en_tokenizer(text) + gold_words = ["a"] + entities = ["U-DATE"] + deps = ["ROOT"] + heads = [0] + example = Example.from_dict( + doc, {"words": gold_words, "entities": entities, "deps": deps, "heads": heads} + ) + ner_tags = example.get_aligned_ner() + assert ner_tags == ["O", "U-DATE"] + assert example.get_aligned("DEP", as_string=True) == [None, "ROOT"] + + +def test_gold_constructor(): + """Test that the Example constructor works fine""" + nlp = English() + doc = nlp("This is a sentence") + example = Example.from_dict(doc, {"cats": {"cat1": 1.0, "cat2": 0.0}}) + assert example.get_aligned("ORTH", as_string=True) == [ + "This", + "is", + "a", + "sentence", + ] + assert example.reference.cats["cat1"] + assert not example.reference.cats["cat2"] + + +def test_tuple_format_implicit(): + """Test tuple format""" + + train_data = [ + ("Uber blew through $1 million a week", {"entities": [(0, 4, "ORG")]}), + ( + "Spotify steps up Asia expansion", + {"entities": [(0, 7, "ORG"), (17, 21, "LOC")]}, + ), + ("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]}), + ] + + _train_tuples(train_data) + + +def test_tuple_format_implicit_invalid(): + """Test that an error is thrown for an implicit invalid field""" + train_data = [ + ("Uber blew through $1 million a week", {"frumble": [(0, 4, "ORG")]}), + ( + "Spotify steps up Asia expansion", + {"entities": [(0, 7, "ORG"), (17, 21, "LOC")]}, + ), + ("Google rebrands its business apps", {"entities": [(0, 6, "ORG")]}), + ] + with pytest.raises(KeyError): + _train_tuples(train_data) + + +def _train_tuples(train_data): + nlp = English() + ner = nlp.add_pipe("ner") + ner.add_label("ORG") + ner.add_label("LOC") + train_examples = [] + for t in train_data: + train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1])) + optimizer = nlp.initialize() + for i in range(5): + losses = {} + batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001)) + for batch in batches: + nlp.update(batch, sgd=optimizer, losses=losses) + + +def test_split_sents(merged_dict): + nlp = English() + example = Example.from_dict( + Doc(nlp.vocab, words=merged_dict["words"], spaces=merged_dict["spaces"]), + merged_dict, + ) + assert example.text == "Hi there everyone It is just me" + split_examples = example.split_sents() + assert len(split_examples) == 2 + assert split_examples[0].text == "Hi there everyone " + assert split_examples[1].text == "It is just me" + token_annotation_1 = split_examples[0].to_dict()["token_annotation"] + assert token_annotation_1["ORTH"] == ["Hi", "there", "everyone"] + assert token_annotation_1["TAG"] == ["INTJ", "ADV", "PRON"] + assert token_annotation_1["SENT_START"] == [1, 0, 0] + token_annotation_2 = split_examples[1].to_dict()["token_annotation"] + assert token_annotation_2["ORTH"] == ["It", "is", "just", "me"] + assert token_annotation_2["TAG"] == ["PRON", "AUX", "ADV", "PRON"] + assert token_annotation_2["SENT_START"] == [1, 0, 0, 0] + + +def test_alignment(): + other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."] + spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."] + align = Alignment.from_strings(other_tokens, spacy_tokens) + assert list(align.x2y.lengths) == [1, 1, 1, 1, 1, 1, 1, 1] + assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 6] + assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 1, 1] + assert list(align.y2x.dataXd) == [0, 1, 2, 3, 4, 5, 6, 7] + + +def test_alignment_case_insensitive(): + other_tokens = ["I", "listened", "to", "obama", "'", "s", "podcasts", "."] + spacy_tokens = ["i", "listened", "to", "Obama", "'s", "PODCASTS", "."] + align = Alignment.from_strings(other_tokens, spacy_tokens) + assert list(align.x2y.lengths) == [1, 1, 1, 1, 1, 1, 1, 1] + assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 6] + assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 1, 1] + assert list(align.y2x.dataXd) == [0, 1, 2, 3, 4, 5, 6, 7] + + +def test_alignment_complex(): + other_tokens = ["i listened to", "obama", "'", "s", "podcasts", "."] + spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts."] + align = Alignment.from_strings(other_tokens, spacy_tokens) + assert list(align.x2y.lengths) == [3, 1, 1, 1, 1, 1] + assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5] + assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2] + assert list(align.y2x.dataXd) == [0, 0, 0, 1, 2, 3, 4, 5] + + +def test_alignment_complex_example(en_vocab): + other_tokens = ["i listened to", "obama", "'", "s", "podcasts", "."] + spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts."] + predicted = Doc( + en_vocab, words=other_tokens, spaces=[True, False, False, True, False, False] + ) + reference = Doc( + en_vocab, words=spacy_tokens, spaces=[True, True, True, False, True, False] + ) + assert predicted.text == "i listened to obama's podcasts." + assert reference.text == "i listened to obama's podcasts." + example = Example(predicted, reference) + align = example.alignment + assert list(align.x2y.lengths) == [3, 1, 1, 1, 1, 1] + assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5] + assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2] + assert list(align.y2x.dataXd) == [0, 0, 0, 1, 2, 3, 4, 5] + + +def test_alignment_different_texts(): + other_tokens = ["she", "listened", "to", "obama", "'s", "podcasts", "."] + spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."] + with pytest.raises(ValueError): + Alignment.from_strings(other_tokens, spacy_tokens) + + +def test_retokenized_docs(doc): + a = doc.to_array(["TAG"]) + doc1 = Doc(doc.vocab, words=[t.text for t in doc]).from_array(["TAG"], a) + doc2 = Doc(doc.vocab, words=[t.text for t in doc]).from_array(["TAG"], a) + example = Example(doc1, doc2) + # fmt: off + expected1 = ["Sarah", "'s", "sister", "flew", "to", "Silicon", "Valley", "via", "London", "."] + expected2 = [None, "sister", "flew", "to", None, "via", "London", "."] + # fmt: on + assert example.get_aligned("ORTH", as_string=True) == expected1 + with doc1.retokenize() as retokenizer: + retokenizer.merge(doc1[0:2]) + retokenizer.merge(doc1[5:7]) + assert example.get_aligned("ORTH", as_string=True) == expected2 diff --git a/spacy/tests/util.py b/spacy/tests/util.py index 4e1c50398..ef7b4d00d 100644 --- a/spacy/tests/util.py +++ b/spacy/tests/util.py @@ -1,17 +1,10 @@ -# coding: utf-8 -from __future__ import unicode_literals - import numpy import tempfile -import shutil import contextlib import srsly -from pathlib import Path - -from spacy import Errors -from spacy.tokens import Doc, Span -from spacy.attrs import POS, TAG, HEAD, DEP, LEMMA -from spacy.compat import path2str +from spacy.tokens import Doc +from spacy.vocab import Vocab +from spacy.util import make_tempdir # noqa: F401 @contextlib.contextmanager @@ -21,64 +14,24 @@ def make_tempfile(mode="r"): f.close() -@contextlib.contextmanager -def make_tempdir(): - d = Path(tempfile.mkdtemp()) - yield d - shutil.rmtree(path2str(d)) +def get_batch(batch_size): + vocab = Vocab() + docs = [] + start = 0 + for size in range(1, batch_size + 1): + # Make the words numbers, so that they're distinct + # across the batch, and easy to track. + numbers = [str(i) for i in range(start, start + size)] + docs.append(Doc(vocab, words=numbers)) + start += size + return docs -def get_doc( - vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None, lemmas=None -): - """Create Doc object from given vocab, words and annotations.""" - if deps and not heads: - heads = [0] * len(deps) - headings = [] - values = [] - annotations = [pos, heads, deps, lemmas, tags] - possible_headings = [POS, HEAD, DEP, LEMMA, TAG] - for a, annot in enumerate(annotations): - if annot is not None: - if len(annot) != len(words): - raise ValueError(Errors.E189) - headings.append(possible_headings[a]) - if annot is not heads: - values.extend(annot) - for value in values: - vocab.strings.add(value) - - doc = Doc(vocab, words=words) - - # if there are any other annotations, set them - if headings: - attrs = doc.to_array(headings) - - j = 0 - for annot in annotations: - if annot: - if annot is heads: - for i in range(len(words)): - if attrs.ndim == 1: - attrs[i] = heads[i] - else: - attrs[i, j] = heads[i] - else: - for i in range(len(words)): - if attrs.ndim == 1: - attrs[i] = doc.vocab.strings[annot[i]] - else: - attrs[i, j] = doc.vocab.strings[annot[i]] - j += 1 - doc.from_array(headings, attrs) - - # finally, set the entities - if ents: - doc.ents = [ - Span(doc, start, end, label=doc.vocab.strings[label]) - for start, end, label in ents - ] - return doc +def get_random_doc(n_words): + vocab = Vocab() + # Make the words numbers, so that they're easy to track. + numbers = [str(i) for i in range(0, n_words)] + return Doc(vocab, words=numbers) def apply_transition_sequence(parser, doc, sequence): diff --git a/spacy/tests/vocab_vectors/test_lexeme.py b/spacy/tests/vocab_vectors/test_lexeme.py index af73a79bf..4288f427c 100644 --- a/spacy/tests/vocab_vectors/test_lexeme.py +++ b/spacy/tests/vocab_vectors/test_lexeme.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import numpy from spacy.attrs import IS_ALPHA, IS_DIGIT diff --git a/spacy/tests/vocab_vectors/test_lookups.py b/spacy/tests/vocab_vectors/test_lookups.py index af15e9e91..d8c7651e4 100644 --- a/spacy/tests/vocab_vectors/test_lookups.py +++ b/spacy/tests/vocab_vectors/test_lookups.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.lookups import Lookups, Table from spacy.strings import get_string_id diff --git a/spacy/tests/vocab_vectors/test_similarity.py b/spacy/tests/vocab_vectors/test_similarity.py index f98f0e6e0..b5f7303b5 100644 --- a/spacy/tests/vocab_vectors/test_similarity.py +++ b/spacy/tests/vocab_vectors/test_similarity.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import numpy from spacy.tokens import Doc diff --git a/spacy/tests/vocab_vectors/test_stringstore.py b/spacy/tests/vocab_vectors/test_stringstore.py index 75b1116dd..a0f8016af 100644 --- a/spacy/tests/vocab_vectors/test_stringstore.py +++ b/spacy/tests/vocab_vectors/test_stringstore.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.strings import StringStore @@ -98,22 +95,3 @@ def test_stringstore_to_bytes(stringstore, text): serialized = stringstore.to_bytes() new_stringstore = StringStore().from_bytes(serialized) assert new_stringstore[store] == text - - -@pytest.mark.xfail -@pytest.mark.parametrize("text", [["a", "b", "c"]]) -def test_stringstore_freeze_oov(stringstore, text): - """Test the possibly temporary workaround of flushing the stringstore of - OOV words.""" - assert stringstore[text[0]] == 1 - assert stringstore[text[1]] == 2 - - stringstore.set_frozen(True) - s = stringstore[text[2]] - assert s >= 4 - s_ = stringstore[s] - assert s_ == text[2] - - stringstore.flush_oov() - with pytest.raises(IndexError): - s_ = stringstore[s] diff --git a/spacy/tests/vocab_vectors/test_vectors.py b/spacy/tests/vocab_vectors/test_vectors.py index b31cef1f2..4257022ea 100644 --- a/spacy/tests/vocab_vectors/test_vectors.py +++ b/spacy/tests/vocab_vectors/test_vectors.py @@ -1,18 +1,13 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest import numpy from numpy.testing import assert_allclose, assert_equal -from spacy._ml import cosine from spacy.vocab import Vocab from spacy.vectors import Vectors from spacy.tokenizer import Tokenizer from spacy.strings import hash_string from spacy.tokens import Doc -from spacy.compat import is_python2 -from ..util import add_vecs_to_vocab, make_tempdir +from ..util import add_vecs_to_vocab, get_cosine, make_tempdir @pytest.fixture @@ -337,10 +332,9 @@ def test_vocab_prune_vectors(): assert list(remap.keys()) == ["kitten"] neighbour, similarity = list(remap.values())[0] assert neighbour == "cat", remap - assert_allclose(similarity, cosine(data[0], data[2]), atol=1e-4, rtol=1e-3) + assert_allclose(similarity, get_cosine(data[0], data[2]), atol=1e-4, rtol=1e-3) -@pytest.mark.skipif(is_python2, reason="Dict order? Not sure if worth investigating") def test_vectors_serialize(): data = numpy.asarray([[4, 2, 2, 2], [4, 2, 2, 2], [1, 1, 1, 1]], dtype="f") v = Vectors(data=data, keys=["A", "B", "C"]) diff --git a/spacy/tests/vocab_vectors/test_vocab_api.py b/spacy/tests/vocab_vectors/test_vocab_api.py index d22db2d8b..a687059be 100644 --- a/spacy/tests/vocab_vectors/test_vocab_api.py +++ b/spacy/tests/vocab_vectors/test_vocab_api.py @@ -1,6 +1,3 @@ -# coding: utf-8 -from __future__ import unicode_literals - import pytest from spacy.attrs import LEMMA, ORTH, PROB, IS_ALPHA from spacy.parts_of_speech import NOUN, VERB diff --git a/spacy/tokenizer.pxd b/spacy/tokenizer.pxd index 694ea49cc..9c1398a17 100644 --- a/spacy/tokenizer.pxd +++ b/spacy/tokenizer.pxd @@ -1,13 +1,13 @@ from libcpp.vector cimport vector - from preshed.maps cimport PreshMap from cymem.cymem cimport Pool from .typedefs cimport hash_t -from .structs cimport LexemeC, TokenC +from .structs cimport LexemeC, SpanC, TokenC from .strings cimport StringStore from .tokens.doc cimport Doc from .vocab cimport Vocab, LexemesOrTokens, _Cached +from .matcher.phrasematcher cimport PhraseMatcher cdef class Tokenizer: @@ -22,15 +22,30 @@ cdef class Tokenizer: cdef object _suffix_search cdef object _infix_finditer cdef object _rules + cdef PhraseMatcher _special_matcher + cdef int _property_init_count + cdef int _property_init_max - cpdef Doc tokens_from_list(self, list strings) - - cdef int _try_cache(self, hash_t key, Doc tokens) except -1 - cdef int _tokenize(self, Doc tokens, unicode span, hash_t key) except -1 - cdef unicode _split_affixes(self, Pool mem, unicode string, vector[LexemeC*] *prefixes, - vector[LexemeC*] *suffixes, int* has_special) + cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases) + cdef int _apply_special_cases(self, Doc doc) except -1 + cdef void _filter_special_spans(self, vector[SpanC] &original, + vector[SpanC] &filtered, int doc_len) nogil + cdef object _prepare_special_spans(self, Doc doc, + vector[SpanC] &filtered) + cdef int _retokenize_special_spans(self, Doc doc, TokenC* tokens, + object span_data) + cdef int _try_specials_and_cache(self, hash_t key, Doc tokens, + int* has_special, + bint with_special_cases) except -1 + cdef int _tokenize(self, Doc tokens, unicode span, hash_t key, + int* has_special, bint with_special_cases) except -1 + cdef unicode _split_affixes(self, Pool mem, unicode string, + vector[LexemeC*] *prefixes, + vector[LexemeC*] *suffixes, int* has_special, + bint with_special_cases) cdef int _attach_tokens(self, Doc tokens, unicode string, - vector[LexemeC*] *prefixes, vector[LexemeC*] *suffixes) except -1 - - cdef int _save_cached(self, const TokenC* tokens, hash_t key, int has_special, - int n) except -1 + vector[LexemeC*] *prefixes, + vector[LexemeC*] *suffixes, int* has_special, + bint with_special_cases) except -1 + cdef int _save_cached(self, const TokenC* tokens, hash_t key, + int* has_special, int n) except -1 diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx index 154a42c4f..17714940d 100644 --- a/spacy/tokenizer.pyx +++ b/spacy/tokenizer.pyx @@ -1,33 +1,37 @@ -# cython: embedsignature=True -# cython: profile=True -# coding: utf8 +# cython: embedsignature=True, profile=True, binding=True from __future__ import unicode_literals from cython.operator cimport dereference as deref from cython.operator cimport preincrement as preinc +from libc.string cimport memcpy, memset +from libcpp.set cimport set as stdset from cymem.cymem cimport Pool from preshed.maps cimport PreshMap cimport cython -from collections import OrderedDict import re import warnings from .tokens.doc cimport Doc from .strings cimport hash_string -from .compat import unescape_unicode, basestring_ -from .attrs import intify_attrs -from .symbols import ORTH +from .lexeme cimport EMPTY_LEXEME +from .attrs import intify_attrs +from .symbols import ORTH, NORM from .errors import Errors, Warnings from . import util +from .util import registry +from .attrs import intify_attrs +from .symbols import ORTH +from .scorer import Scorer +from .training import validate_examples cdef class Tokenizer: """Segment text, and create Doc objects with the discovered segment boundaries. - DOCS: https://spacy.io/api/tokenizer + DOCS: https://nightly.spacy.io/api/tokenizer """ def __init__(self, Vocab vocab, rules=None, prefix_search=None, suffix_search=None, infix_finditer=None, token_match=None, @@ -43,16 +47,14 @@ cdef class Tokenizer: `infix_finditer` (callable): A function matching the signature of `re.compile(string).finditer` to find infixes. token_match (callable): A boolean function matching strings to be - recognised as tokens. + recognized as tokens. url_match (callable): A boolean function matching strings to be - recognised as tokens after considering prefixes and suffixes. - RETURNS (Tokenizer): The newly constructed object. + recognized as tokens after considering prefixes and suffixes. EXAMPLE: >>> tokenizer = Tokenizer(nlp.vocab) - >>> tokenizer = English().Defaults.create_tokenizer(nlp) - DOCS: https://spacy.io/api/tokenizer#init + DOCS: https://nightly.spacy.io/api/tokenizer#init """ self.mem = Pool() self._cache = PreshMap() @@ -64,7 +66,10 @@ cdef class Tokenizer: self.infix_finditer = infix_finditer self.vocab = vocab self._rules = {} - self._load_special_tokenization(rules) + self._special_matcher = PhraseMatcher(self.vocab) + self._load_special_cases(rules) + self._property_init_count = 0 + self._property_init_max = 4 property token_match: def __get__(self): @@ -72,7 +77,9 @@ cdef class Tokenizer: def __set__(self, token_match): self._token_match = token_match - self._flush_cache() + self._reload_special_cases() + if self._property_init_count <= self._property_init_max: + self._property_init_count += 1 property url_match: def __get__(self): @@ -88,7 +95,9 @@ cdef class Tokenizer: def __set__(self, prefix_search): self._prefix_search = prefix_search - self._flush_cache() + self._reload_special_cases() + if self._property_init_count <= self._property_init_max: + self._property_init_count += 1 property suffix_search: def __get__(self): @@ -96,7 +105,9 @@ cdef class Tokenizer: def __set__(self, suffix_search): self._suffix_search = suffix_search - self._flush_cache() + self._reload_special_cases() + if self._property_init_count <= self._property_init_max: + self._property_init_count += 1 property infix_finditer: def __get__(self): @@ -104,7 +115,9 @@ cdef class Tokenizer: def __set__(self, infix_finditer): self._infix_finditer = infix_finditer - self._flush_cache() + self._reload_special_cases() + if self._property_init_count <= self._property_init_max: + self._property_init_count += 1 property rules: def __get__(self): @@ -113,10 +126,10 @@ cdef class Tokenizer: def __set__(self, rules): self._rules = {} self._reset_cache([key for key in self._cache]) - self._reset_specials() + self._flush_specials() self._cache = PreshMap() self._specials = PreshMap() - self._load_special_tokenization(rules) + self._load_special_cases(rules) def __reduce__(self): args = (self.vocab, @@ -128,18 +141,24 @@ cdef class Tokenizer: self.url_match) return (self.__class__, args, None, None) - cpdef Doc tokens_from_list(self, list strings): - warnings.warn(Warnings.W002, DeprecationWarning) - return Doc(self.vocab, words=strings) - - @cython.boundscheck(False) def __call__(self, unicode string): """Tokenize a string. - string (unicode): The string to tokenize. + string (str): The string to tokenize. RETURNS (Doc): A container for linguistic annotations. - DOCS: https://spacy.io/api/tokenizer#call + DOCS: https://nightly.spacy.io/api/tokenizer#call + """ + doc = self._tokenize_affixes(string, True) + self._apply_special_cases(doc) + return doc + + @cython.boundscheck(False) + cdef Doc _tokenize_affixes(self, unicode string, bint with_special_cases): + """Tokenize according to affix and token_match settings. + + string (str): The string to tokenize. + RETURNS (Doc): A container for linguistic annotations. """ if len(string) >= (2 ** 30): raise ValueError(Errors.E025.format(length=len(string))) @@ -149,7 +168,7 @@ cdef class Tokenizer: return doc cdef int i = 0 cdef int start = 0 - cdef bint cache_hit + cdef int has_special = 0 cdef bint in_ws = string[0].isspace() cdef unicode span # The task here is much like string.split, but not quite @@ -165,9 +184,8 @@ cdef class Tokenizer: # we don't have to create the slice when we hit the cache. span = string[start:i] key = hash_string(span) - cache_hit = self._try_cache(key, doc) - if not cache_hit: - self._tokenize(doc, span, key) + if not self._try_specials_and_cache(key, doc, &has_special, with_special_cases): + self._tokenize(doc, span, key, &has_special, with_special_cases) if uc == ' ': doc.c[doc.length - 1].spacy = True start = i + 1 @@ -178,13 +196,12 @@ cdef class Tokenizer: if start < i: span = string[start:] key = hash_string(span) - cache_hit = self._try_cache(key, doc) - if not cache_hit: - self._tokenize(doc, span, key) + if not self._try_specials_and_cache(key, doc, &has_special, with_special_cases): + self._tokenize(doc, span, key, &has_special, with_special_cases) doc.c[doc.length - 1].spacy = string[-1] == " " and not in_ws return doc - def pipe(self, texts, batch_size=1000, n_threads=-1): + def pipe(self, texts, batch_size=1000): """Tokenize a stream of texts. texts: A sequence of unicode texts. @@ -192,60 +209,194 @@ cdef class Tokenizer: Defaults to 1000. YIELDS (Doc): A sequence of Doc objects, in order. - DOCS: https://spacy.io/api/tokenizer#pipe + DOCS: https://nightly.spacy.io/api/tokenizer#pipe """ - if n_threads != -1: - warnings.warn(Warnings.W016, DeprecationWarning) for text in texts: yield self(text) def _flush_cache(self): - self._reset_cache([key for key in self._cache if not key in self._specials]) + self._reset_cache([key for key in self._cache]) def _reset_cache(self, keys): for k in keys: + cached = <_Cached*>self._cache.get(k) del self._cache[k] - if not k in self._specials: - cached = <_Cached*>self._cache.get(k) - if cached is not NULL: - self.mem.free(cached) + if cached is not NULL: + self.mem.free(cached) - def _reset_specials(self): + def _flush_specials(self): for k in self._specials: cached = <_Cached*>self._specials.get(k) del self._specials[k] if cached is not NULL: self.mem.free(cached) - cdef int _try_cache(self, hash_t key, Doc tokens) except -1: - cached = <_Cached*>self._cache.get(key) - if cached == NULL: - return False + cdef int _apply_special_cases(self, Doc doc) except -1: + """Retokenize doc according to special cases. + + doc (Doc): Document. + """ cdef int i - if cached.is_lex: - for i in range(cached.length): - tokens.push_back(cached.data.lexemes[i], False) + cdef int max_length = 0 + cdef bint modify_in_place + cdef Pool mem = Pool() + cdef vector[SpanC] c_matches + cdef vector[SpanC] c_filtered + cdef int offset + cdef int modified_doc_length + # Find matches for special cases + self._special_matcher.find_matches(doc, &c_matches) + # Skip processing if no matches + if c_matches.size() == 0: + return True + self._filter_special_spans(c_matches, c_filtered, doc.length) + # Put span info in span.start-indexed dict and calculate maximum + # intermediate document size + (span_data, max_length, modify_in_place) = self._prepare_special_spans(doc, c_filtered) + # If modifications never increase doc length, can modify in place + if modify_in_place: + tokens = doc.c + # Otherwise create a separate array to store modified tokens else: - for i in range(cached.length): - tokens.push_back(&cached.data.tokens[i], False) + tokens = mem.alloc(max_length, sizeof(TokenC)) + # Modify tokenization according to filtered special cases + offset = self._retokenize_special_spans(doc, tokens, span_data) + # Allocate more memory for doc if needed + modified_doc_length = doc.length + offset + while modified_doc_length >= doc.max_length: + doc._realloc(doc.max_length * 2) + # If not modified in place, copy tokens back to doc + if not modify_in_place: + memcpy(doc.c, tokens, max_length * sizeof(TokenC)) + for i in range(doc.length + offset, doc.length): + memset(&doc.c[i], 0, sizeof(TokenC)) + doc.c[i].lex = &EMPTY_LEXEME + doc.length = doc.length + offset return True - cdef int _tokenize(self, Doc tokens, unicode span, hash_t orig_key) except -1: + cdef void _filter_special_spans(self, vector[SpanC] &original, vector[SpanC] &filtered, int doc_len) nogil: + + cdef int seen_i + cdef SpanC span + cdef stdset[int] seen_tokens + stdsort(original.begin(), original.end(), len_start_cmp) + cdef int orig_i = original.size() - 1 + while orig_i >= 0: + span = original[orig_i] + if not seen_tokens.count(span.start) and not seen_tokens.count(span.end - 1): + filtered.push_back(span) + for seen_i in range(span.start, span.end): + seen_tokens.insert(seen_i) + orig_i -= 1 + stdsort(filtered.begin(), filtered.end(), start_cmp) + + cdef object _prepare_special_spans(self, Doc doc, vector[SpanC] &filtered): + spans = [doc[match.start:match.end] for match in filtered] + cdef bint modify_in_place = True + cdef int curr_length = doc.length + cdef int max_length + cdef int span_length_diff = 0 + span_data = {} + for span in spans: + rule = self._rules.get(span.text, None) + span_length_diff = 0 + if rule: + span_length_diff = len(rule) - (span.end - span.start) + if span_length_diff > 0: + modify_in_place = False + curr_length += span_length_diff + if curr_length > max_length: + max_length = curr_length + span_data[span.start] = (span.text, span.start, span.end, span_length_diff) + return (span_data, max_length, modify_in_place) + + cdef int _retokenize_special_spans(self, Doc doc, TokenC* tokens, object span_data): + cdef int i = 0 + cdef int j = 0 + cdef int offset = 0 + cdef _Cached* cached + cdef int idx_offset = 0 + cdef int orig_final_spacy + cdef int orig_idx + cdef int span_start + cdef int span_end + while i < doc.length: + if not i in span_data: + tokens[i + offset] = doc.c[i] + i += 1 + else: + span = span_data[i] + span_start = span[1] + span_end = span[2] + cached = <_Cached*>self._specials.get(hash_string(span[0])) + if cached == NULL: + # Copy original tokens if no rule found + for j in range(span_end - span_start): + tokens[i + offset + j] = doc.c[i + j] + i += span_end - span_start + else: + # Copy special case tokens into doc and adjust token and + # character offsets + idx_offset = 0 + orig_final_spacy = doc.c[span_end + offset - 1].spacy + orig_idx = doc.c[i].idx + for j in range(cached.length): + tokens[i + offset + j] = cached.data.tokens[j] + tokens[i + offset + j].idx = orig_idx + idx_offset + idx_offset += cached.data.tokens[j].lex.length + if cached.data.tokens[j].spacy: + idx_offset += 1 + tokens[i + offset + cached.length - 1].spacy = orig_final_spacy + i += span_end - span_start + offset += span[3] + return offset + + cdef int _try_specials_and_cache(self, hash_t key, Doc tokens, int* has_special, bint with_special_cases) except -1: + cdef bint specials_hit = 0 + cdef bint cache_hit = 0 + cdef int i + if with_special_cases: + cached = <_Cached*>self._specials.get(key) + if cached == NULL: + specials_hit = False + else: + for i in range(cached.length): + tokens.push_back(&cached.data.tokens[i], False) + has_special[0] = 1 + specials_hit = True + if not specials_hit: + cached = <_Cached*>self._cache.get(key) + if cached == NULL: + cache_hit = False + else: + if cached.is_lex: + for i in range(cached.length): + tokens.push_back(cached.data.lexemes[i], False) + else: + for i in range(cached.length): + tokens.push_back(&cached.data.tokens[i], False) + cache_hit = True + if not specials_hit and not cache_hit: + return False + return True + + cdef int _tokenize(self, Doc tokens, unicode span, hash_t orig_key, int* has_special, bint with_special_cases) except -1: cdef vector[LexemeC*] prefixes cdef vector[LexemeC*] suffixes cdef int orig_size - cdef int has_special = 0 orig_size = tokens.length span = self._split_affixes(tokens.mem, span, &prefixes, &suffixes, - &has_special) - self._attach_tokens(tokens, span, &prefixes, &suffixes) + has_special, with_special_cases) + self._attach_tokens(tokens, span, &prefixes, &suffixes, has_special, + with_special_cases) self._save_cached(&tokens.c[orig_size], orig_key, has_special, tokens.length - orig_size) cdef unicode _split_affixes(self, Pool mem, unicode string, vector[const LexemeC*] *prefixes, vector[const LexemeC*] *suffixes, - int* has_special): + int* has_special, + bint with_special_cases): cdef size_t i cdef unicode prefix cdef unicode suffix @@ -253,31 +404,28 @@ cdef class Tokenizer: cdef unicode minus_suf cdef size_t last_size = 0 while string and len(string) != last_size: - if self.token_match and self.token_match(string): + if self.token_match and self.token_match(string) \ + and not self.find_prefix(string) \ + and not self.find_suffix(string): break - if self._specials.get(hash_string(string)) != NULL: - has_special[0] = 1 + if with_special_cases and self._specials.get(hash_string(string)) != NULL: break last_size = len(string) pre_len = self.find_prefix(string) if pre_len != 0: prefix = string[:pre_len] minus_pre = string[pre_len:] - # Check whether we've hit a special-case - if minus_pre and self._specials.get(hash_string(minus_pre)) != NULL: + if minus_pre and with_special_cases and self._specials.get(hash_string(minus_pre)) != NULL: string = minus_pre prefixes.push_back(self.vocab.get(mem, prefix)) - has_special[0] = 1 break suf_len = self.find_suffix(string) if suf_len != 0: suffix = string[-suf_len:] minus_suf = string[:-suf_len] - # Check whether we've hit a special-case - if minus_suf and (self._specials.get(hash_string(minus_suf)) != NULL): + if minus_suf and with_special_cases and self._specials.get(hash_string(minus_suf)) != NULL: string = minus_suf suffixes.push_back(self.vocab.get(mem, suffix)) - has_special[0] = 1 break if pre_len and suf_len and (pre_len + suf_len) <= len(string): string = string[pre_len:-suf_len] @@ -289,15 +437,15 @@ cdef class Tokenizer: elif suf_len: string = minus_suf suffixes.push_back(self.vocab.get(mem, suffix)) - if string and (self._specials.get(hash_string(string)) != NULL): - has_special[0] = 1 - break return string cdef int _attach_tokens(self, Doc tokens, unicode string, vector[const LexemeC*] *prefixes, - vector[const LexemeC*] *suffixes) except -1: - cdef bint cache_hit + vector[const LexemeC*] *suffixes, + int* has_special, + bint with_special_cases) except -1: + cdef bint specials_hit = 0 + cdef bint cache_hit = 0 cdef int split, end cdef const LexemeC* const* lexemes cdef const LexemeC* lexeme @@ -307,8 +455,7 @@ cdef class Tokenizer: for i in range(prefixes.size()): tokens.push_back(prefixes[0][i], False) if string: - cache_hit = self._try_cache(hash_string(string), tokens) - if cache_hit: + if self._try_specials_and_cache(hash_string(string), tokens, has_special, with_special_cases): pass elif (self.token_match and self.token_match(string)) or \ (self.url_match and \ @@ -355,7 +502,7 @@ cdef class Tokenizer: tokens.push_back(lexeme, False) cdef int _save_cached(self, const TokenC* tokens, hash_t key, - int has_special, int n) except -1: + int* has_special, int n) except -1: cdef int i if n <= 0: # avoid mem alloc of zero length @@ -364,7 +511,7 @@ cdef class Tokenizer: if self.vocab._by_orth.get(tokens[i].lex.orth) == NULL: return 0 # See #1250 - if has_special: + if has_special[0]: return 0 cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached)) cached.length = n @@ -378,12 +525,12 @@ cdef class Tokenizer: def find_infix(self, unicode string): """Find internal split points of the string, such as hyphens. - string (unicode): The string to segment. + string (str): The string to segment. RETURNS (list): A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens. - DOCS: https://spacy.io/api/tokenizer#find_infix + DOCS: https://nightly.spacy.io/api/tokenizer#find_infix """ if self.infix_finditer is None: return 0 @@ -393,10 +540,10 @@ cdef class Tokenizer: """Find the length of a prefix that should be segmented from the string, or None if no prefix rules match. - string (unicode): The string to segment. + string (str): The string to segment. RETURNS (int): The length of the prefix if present, otherwise `None`. - DOCS: https://spacy.io/api/tokenizer#find_prefix + DOCS: https://nightly.spacy.io/api/tokenizer#find_prefix """ if self.prefix_search is None: return 0 @@ -407,32 +554,52 @@ cdef class Tokenizer: """Find the length of a suffix that should be segmented from the string, or None if no suffix rules match. - string (unicode): The string to segment. + string (str): The string to segment. Returns (int): The length of the suffix if present, otherwise `None`. - DOCS: https://spacy.io/api/tokenizer#find_suffix + DOCS: https://nightly.spacy.io/api/tokenizer#find_suffix """ if self.suffix_search is None: return 0 match = self.suffix_search(string) return (match.end() - match.start()) if match is not None else 0 - def _load_special_tokenization(self, special_cases): + def _load_special_cases(self, special_cases): """Add special-case tokenization rules.""" if special_cases is not None: for chunk, substrings in sorted(special_cases.items()): + self._validate_special_case(chunk, substrings) self.add_special_case(chunk, substrings) + def _validate_special_case(self, chunk, substrings): + """Check whether the `ORTH` fields match the string. Check that + additional features beyond `ORTH` and `NORM` are not set by the + exception. + + chunk (str): The string to specially tokenize. + substrings (iterable): A sequence of dicts, where each dict describes + a token and its attributes. + """ + attrs = [intify_attrs(spec, _do_deprecated=True) for spec in substrings] + orth = "".join([spec[ORTH] for spec in attrs]) + if chunk != orth: + raise ValueError(Errors.E997.format(chunk=chunk, orth=orth, token_attrs=substrings)) + for substring in attrs: + for attr in substring: + if attr not in (ORTH, NORM): + raise ValueError(Errors.E1005.format(attr=self.vocab.strings[attr], chunk=chunk)) + def add_special_case(self, unicode string, substrings): """Add a special-case tokenization rule. - string (unicode): The string to specially tokenize. + string (str): The string to specially tokenize. substrings (iterable): A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. - DOCS: https://spacy.io/api/tokenizer#add_special_case + DOCS: https://nightly.spacy.io/api/tokenizer#add_special_case """ + self._validate_special_case(string, substrings) substrings = list(substrings) cached = <_Cached*>self.mem.alloc(1, sizeof(_Cached)) cached.length = len(substrings) @@ -440,15 +607,25 @@ cdef class Tokenizer: cached.data.tokens = self.vocab.make_fused_token(substrings) key = hash_string(string) stale_special = <_Cached*>self._specials.get(key) - stale_cached = <_Cached*>self._cache.get(key) - self._flush_cache() self._specials.set(key, cached) - self._cache.set(key, cached) if stale_special is not NULL: self.mem.free(stale_special) - if stale_special != stale_cached and stale_cached is not NULL: - self.mem.free(stale_cached) self._rules[string] = substrings + self._flush_cache() + if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string): + self._special_matcher.add(string, None, self._tokenize_affixes(string, False)) + + def _reload_special_cases(self): + try: + self._property_init_count + except AttributeError: + return + # only reload if all 4 of prefix, suffix, infix, token_match have + # have been initialized + if self.vocab is not None and self._property_init_count >= self._property_init_max: + self._flush_cache() + self._flush_specials() + self._load_special_cases(self._rules) def explain(self, text): """A debugging tokenizer that provides information about which @@ -456,10 +633,10 @@ cdef class Tokenizer: produced are identical to `nlp.tokenizer()` except for whitespace tokens. - string (unicode): The string to tokenize. + string (str): The string to tokenize. RETURNS (list): A list of (pattern_string, token_string) tuples - DOCS: https://spacy.io/api/tokenizer#explain + DOCS: https://nightly.spacy.io/api/tokenizer#explain """ prefix_search = self.prefix_search suffix_search = self.suffix_search @@ -529,101 +706,113 @@ cdef class Tokenizer: tokens.extend(reversed(suffixes)) return tokens + def score(self, examples, **kwargs): + validate_examples(examples, "Tokenizer.score") + return Scorer.score_tokenization(examples) + def to_disk(self, path, **kwargs): """Save the current state to a directory. - path (unicode or Path): A path to a directory, which will be created if + path (str / Path): A path to a directory, which will be created if it doesn't exist. exclude (list): String names of serialization fields to exclude. - DOCS: https://spacy.io/api/tokenizer#to_disk + DOCS: https://nightly.spacy.io/api/tokenizer#to_disk """ path = util.ensure_path(path) with path.open("wb") as file_: file_.write(self.to_bytes(**kwargs)) - def from_disk(self, path, **kwargs): + def from_disk(self, path, *, exclude=tuple()): """Loads state from a directory. Modifies the object in place and returns it. - path (unicode or Path): A path to a directory. + path (str / Path): A path to a directory. exclude (list): String names of serialization fields to exclude. RETURNS (Tokenizer): The modified `Tokenizer` object. - DOCS: https://spacy.io/api/tokenizer#from_disk + DOCS: https://nightly.spacy.io/api/tokenizer#from_disk """ path = util.ensure_path(path) with path.open("rb") as file_: bytes_data = file_.read() - self.from_bytes(bytes_data, **kwargs) + self.from_bytes(bytes_data, exclude=exclude) return self - def to_bytes(self, exclude=tuple(), **kwargs): + def to_bytes(self, *, exclude=tuple()): """Serialize the current state to a binary string. exclude (list): String names of serialization fields to exclude. RETURNS (bytes): The serialized form of the `Tokenizer` object. - DOCS: https://spacy.io/api/tokenizer#to_bytes + DOCS: https://nightly.spacy.io/api/tokenizer#to_bytes """ - serializers = OrderedDict(( - ("vocab", lambda: self.vocab.to_bytes()), - ("prefix_search", lambda: _get_regex_pattern(self.prefix_search)), - ("suffix_search", lambda: _get_regex_pattern(self.suffix_search)), - ("infix_finditer", lambda: _get_regex_pattern(self.infix_finditer)), - ("token_match", lambda: _get_regex_pattern(self.token_match)), - ("url_match", lambda: _get_regex_pattern(self.url_match)), - ("exceptions", lambda: OrderedDict(sorted(self._rules.items()))) - )) - exclude = util.get_serialization_exclude(serializers, exclude, kwargs) + serializers = { + "vocab": lambda: self.vocab.to_bytes(), + "prefix_search": lambda: _get_regex_pattern(self.prefix_search), + "suffix_search": lambda: _get_regex_pattern(self.suffix_search), + "infix_finditer": lambda: _get_regex_pattern(self.infix_finditer), + "token_match": lambda: _get_regex_pattern(self.token_match), + "url_match": lambda: _get_regex_pattern(self.url_match), + "exceptions": lambda: dict(sorted(self._rules.items())) + } return util.to_bytes(serializers, exclude) - def from_bytes(self, bytes_data, exclude=tuple(), **kwargs): + def from_bytes(self, bytes_data, *, exclude=tuple()): """Load state from a binary string. bytes_data (bytes): The data to load from. exclude (list): String names of serialization fields to exclude. RETURNS (Tokenizer): The `Tokenizer` object. - DOCS: https://spacy.io/api/tokenizer#from_bytes + DOCS: https://nightly.spacy.io/api/tokenizer#from_bytes """ - data = OrderedDict() - deserializers = OrderedDict(( - ("vocab", lambda b: self.vocab.from_bytes(b)), - ("prefix_search", lambda b: data.setdefault("prefix_search", b)), - ("suffix_search", lambda b: data.setdefault("suffix_search", b)), - ("infix_finditer", lambda b: data.setdefault("infix_finditer", b)), - ("token_match", lambda b: data.setdefault("token_match", b)), - ("url_match", lambda b: data.setdefault("url_match", b)), - ("exceptions", lambda b: data.setdefault("rules", b)) - )) - exclude = util.get_serialization_exclude(deserializers, exclude, kwargs) + data = {} + deserializers = { + "vocab": lambda b: self.vocab.from_bytes(b), + "prefix_search": lambda b: data.setdefault("prefix_search", b), + "suffix_search": lambda b: data.setdefault("suffix_search", b), + "infix_finditer": lambda b: data.setdefault("infix_finditer", b), + "token_match": lambda b: data.setdefault("token_match", b), + "url_match": lambda b: data.setdefault("url_match", b), + "exceptions": lambda b: data.setdefault("rules", b) + } msg = util.from_bytes(bytes_data, deserializers, exclude) - for key in ["prefix_search", "suffix_search", "infix_finditer", "token_match", "url_match"]: - if key in data: - data[key] = unescape_unicode(data[key]) - if "prefix_search" in data and isinstance(data["prefix_search"], basestring_): + if "prefix_search" in data and isinstance(data["prefix_search"], str): self.prefix_search = re.compile(data["prefix_search"]).search - if "suffix_search" in data and isinstance(data["suffix_search"], basestring_): + if "suffix_search" in data and isinstance(data["suffix_search"], str): self.suffix_search = re.compile(data["suffix_search"]).search - if "infix_finditer" in data and isinstance(data["infix_finditer"], basestring_): + if "infix_finditer" in data and isinstance(data["infix_finditer"], str): self.infix_finditer = re.compile(data["infix_finditer"]).finditer - if "token_match" in data and isinstance(data["token_match"], basestring_): + if "token_match" in data and isinstance(data["token_match"], str): self.token_match = re.compile(data["token_match"]).match - if "url_match" in data and isinstance(data["url_match"], basestring_): + if "url_match" in data and isinstance(data["url_match"], str): self.url_match = re.compile(data["url_match"]).match if "rules" in data and isinstance(data["rules"], dict): # make sure to hard reset the cache to remove data from the default exceptions self._rules = {} - self._reset_cache([key for key in self._cache]) - self._reset_specials() - self._cache = PreshMap() - self._specials = PreshMap() - self._load_special_tokenization(data["rules"]) - + self._flush_cache() + self._flush_specials() + self._load_special_cases(data["rules"]) return self def _get_regex_pattern(regex): """Get a pattern string for a regex, or None if the pattern is None.""" return None if regex is None else regex.__self__.pattern + + +cdef extern from "" namespace "std" nogil: + void stdsort "sort"(vector[SpanC].iterator, + vector[SpanC].iterator, + bint (*)(SpanC, SpanC)) + + +cdef bint len_start_cmp(SpanC a, SpanC b) nogil: + if a.end - a.start == b.end - b.start: + return b.start < a.start + return a.end - a.start < b.end - b.start + + +cdef bint start_cmp(SpanC a, SpanC b) nogil: + return a.start < b.start diff --git a/spacy/tokens/__init__.py b/spacy/tokens/__init__.py index 536ec8349..1aefa2b7c 100644 --- a/spacy/tokens/__init__.py +++ b/spacy/tokens/__init__.py @@ -1,9 +1,7 @@ -# coding: utf8 -from __future__ import unicode_literals - from .doc import Doc from .token import Token from .span import Span from ._serialize import DocBin +from .morphanalysis import MorphAnalysis -__all__ = ["Doc", "Token", "Span", "DocBin"] +__all__ = ["Doc", "Token", "Span", "DocBin", "MorphAnalysis"] diff --git a/spacy/tokens/_retokenize.pyx b/spacy/tokens/_retokenize.pyx index 4a030bef6..398dfca26 100644 --- a/spacy/tokens/_retokenize.pyx +++ b/spacy/tokens/_retokenize.pyx @@ -1,14 +1,9 @@ -# coding: utf8 -# cython: infer_types=True -# cython: bounds_check=False -# cython: profile=True -from __future__ import unicode_literals - +# cython: infer_types=True, bounds_check=False, profile=True from libc.string cimport memcpy, memset from libc.stdlib cimport malloc, free from cymem.cymem cimport Pool -from thinc.neural.util import get_array_module +from thinc.api import get_array_module import numpy from .doc cimport Doc, set_children_from_heads, token_by_start, token_by_end @@ -16,7 +11,8 @@ from .span cimport Span from .token cimport Token from ..lexeme cimport Lexeme, EMPTY_LEXEME from ..structs cimport LexemeC, TokenC -from ..attrs cimport TAG +from ..attrs cimport MORPH +from ..vocab cimport Vocab from .underscore import is_writable_attr from ..attrs import intify_attrs @@ -28,8 +24,8 @@ from ..strings import get_string_id cdef class Retokenizer: """Helper class for doc.retokenize() context manager. - DOCS: https://spacy.io/api/doc#retokenize - USAGE: https://spacy.io/usage/linguistic-features#retokenization + DOCS: https://nightly.spacy.io/api/doc#retokenize + USAGE: https://nightly.spacy.io/usage/linguistic-features#retokenization """ cdef Doc doc cdef list merges @@ -51,7 +47,7 @@ cdef class Retokenizer: span (Span): The span to merge. attrs (dict): Attributes to set on the merged token. - DOCS: https://spacy.io/api/doc#retokenizer.merge + DOCS: https://nightly.spacy.io/api/doc#retokenizer.merge """ if (span.start, span.end) in self._spans_to_merge: return @@ -62,14 +58,7 @@ cdef class Retokenizer: raise ValueError(Errors.E102.format(token=repr(token))) self.tokens_to_merge.add(token.i) self._spans_to_merge.append((span.start, span.end)) - if "_" in attrs: # Extension attributes - extensions = attrs["_"] - _validate_extensions(extensions) - attrs = {key: value for key, value in attrs.items() if key != "_"} - attrs = intify_attrs(attrs, strings_map=self.doc.vocab.strings) - attrs["_"] = extensions - else: - attrs = intify_attrs(attrs, strings_map=self.doc.vocab.strings) + attrs = normalize_token_attrs(self.doc.vocab, attrs) self.merges.append((span, attrs)) def split(self, Token token, orths, heads, attrs=SimpleFrozenDict()): @@ -84,7 +73,7 @@ cdef class Retokenizer: attrs (dict): Attributes to set on all split tokens. Attribute names mapped to list of per-token attribute values. - DOCS: https://spacy.io/api/doc#retokenizer.split + DOCS: https://nightly.spacy.io/api/doc#retokenizer.split """ if ''.join(orths) != token.text: raise ValueError(Errors.E117.format(new=''.join(orths), old=token.text)) @@ -101,6 +90,11 @@ cdef class Retokenizer: # NB: Since we support {"KEY": [value, value]} syntax here, this # will only "intify" the keys, not the values attrs = intify_attrs(attrs, strings_map=self.doc.vocab.strings) + if MORPH in attrs: + for i, morph in enumerate(attrs[MORPH]): + # add and set to normalized value + morph = self.doc.vocab.morphology.add(self.doc.vocab.strings.as_string(morph)) + attrs[MORPH][i] = morph head_offsets = [] for head in heads: if isinstance(head, Token): @@ -226,21 +220,7 @@ def _merge(Doc doc, merges): token.lex = lex # We set trailing space here too token.spacy = doc.c[spans[token_index].end-1].spacy - py_token = span[0] - # Assign attributes - for attr_name, attr_value in attributes.items(): - if attr_name == "_": # Set extension attributes - for ext_attr_key, ext_attr_value in attr_value.items(): - py_token._.set(ext_attr_key, ext_attr_value) - elif attr_name == TAG: - doc.vocab.morphology.assign_tag(token, attr_value) - else: - # Set attributes on both token and lexeme to take care of token - # attribute vs. lexical attribute without having to enumerate - # them. If an attribute name is not valid, set_struct_attr will - # ignore it. - Token.set_struct_attr(token, attr_name, attr_value) - Lexeme.set_struct_attr(lex, attr_name, attr_value) + set_token_attrs(span[0], attributes) # Begin by setting all the head indices to absolute token positions # This is easier to work with for now than the offsets # Before thinking of something simpler, beware the case where a @@ -294,7 +274,7 @@ def _merge(Doc doc, merges): for i in range(doc.length): doc.c[i].head -= i # Set the left/right children, left/right edges - set_children_from_heads(doc.c, doc.length) + set_children_from_heads(doc.c, 0, doc.length) # Make sure ent_iob remains consistent make_iob_consistent(doc.c, doc.length) # Return the merged Python object @@ -314,6 +294,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs): """Retokenize the document, such that the token at `doc[token_index]` is split into tokens with the orth 'orths' token_index(int): token index of the token to split. + orths: IDs of the verbatim text content of the tokens to create **attributes: Attributes to assign to each of the newly created tokens. By default, attributes are inherited from the original token. @@ -387,8 +368,6 @@ def _split(Doc doc, int token_index, orths, heads, attrs): doc[token_index + i]._.set(ext_attr_key, ext_attr_value) # NB: We need to call get_string_id here because only the keys are # "intified" (since we support "KEY": [value, value] syntax here). - elif attr_name == TAG: - doc.vocab.morphology.assign_tag(token, get_string_id(attr_value)) else: # Set attributes on both token and lexeme to take care of token # attribute vs. lexical attribute without having to enumerate @@ -403,7 +382,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs): for i in range(doc.length): doc.c[i].head -= i # set children from head - set_children_from_heads(doc.c, doc.length) + set_children_from_heads(doc.c, 0, doc.length) def _validate_extensions(extensions): @@ -425,3 +404,37 @@ cdef make_iob_consistent(TokenC* tokens, int length): for i in range(1, length): if tokens[i].ent_iob == 1 and tokens[i - 1].ent_type != tokens[i].ent_type: tokens[i].ent_iob = 3 + + +def normalize_token_attrs(Vocab vocab, attrs): + if "_" in attrs: # Extension attributes + extensions = attrs["_"] + _validate_extensions(extensions) + attrs = {key: value for key, value in attrs.items() if key != "_"} + attrs = intify_attrs(attrs, strings_map=vocab.strings) + attrs["_"] = extensions + else: + attrs = intify_attrs(attrs, strings_map=vocab.strings) + if MORPH in attrs: + # add and set to normalized value + morph = vocab.morphology.add(vocab.strings.as_string(attrs[MORPH])) + attrs[MORPH] = morph + return attrs + + +def set_token_attrs(Token py_token, attrs): + cdef TokenC* token = py_token.c + cdef const LexemeC* lex = token.lex + cdef Doc doc = py_token.doc + # Assign attributes + for attr_name, attr_value in attrs.items(): + if attr_name == "_": # Set extension attributes + for ext_attr_key, ext_attr_value in attr_value.items(): + py_token._.set(ext_attr_key, ext_attr_value) + else: + # Set attributes on both token and lexeme to take care of token + # attribute vs. lexical attribute without having to enumerate + # them. If an attribute name is not valid, set_struct_attr will + # ignore it. + Token.set_struct_attr(token, attr_name, attr_value) + Lexeme.set_struct_attr(lex, attr_name, attr_value) diff --git a/spacy/tokens/_serialize.py b/spacy/tokens/_serialize.py index b60a6d7b3..11eb75821 100644 --- a/spacy/tokens/_serialize.py +++ b/spacy/tokens/_serialize.py @@ -1,18 +1,23 @@ -# coding: utf8 -from __future__ import unicode_literals - +from typing import Iterable, Iterator, Union +from pathlib import Path import numpy import zlib import srsly -from thinc.neural.ops import NumpyOps +from thinc.api import NumpyOps +from .doc import Doc +from ..vocab import Vocab from ..compat import copy_reg -from ..tokens import Doc from ..attrs import SPACY, ORTH, intify_attr from ..errors import Errors +from ..util import ensure_path, SimpleFrozenList + +# fmt: off +ALL_ATTRS = ("ORTH", "NORM", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "ENT_ID", "LEMMA", "MORPH", "POS", "SENT_START") +# fmt: on -class DocBin(object): +class DocBin: """Pack Doc objects for binary serialization. The DocBin class lets you efficiently serialize the information from a @@ -31,6 +36,7 @@ class DocBin(object): "spaces": bytes, # Serialized numpy boolean array with spaces data "lengths": bytes, # Serialized numpy int32 array with the doc lengths "strings": List[unicode] # List of unique strings in the token data + "version": str, # DocBin version number } Strings for the words, tags, labels etc are represented by 64-bit hashes in @@ -42,37 +48,45 @@ class DocBin(object): document from the DocBin. """ - def __init__(self, attrs=None, store_user_data=False): + def __init__( + self, + attrs: Iterable[str] = ALL_ATTRS, + store_user_data: bool = False, + docs: Iterable[Doc] = SimpleFrozenList(), + ) -> None: """Create a DocBin object to hold serialized annotations. - attrs (list): List of attributes to serialize. 'orth' and 'spacy' are - always serialized, so they're not required. Defaults to None. - store_user_data (bool): Whether to include the `Doc.user_data`. - RETURNS (DocBin): The newly constructed object. + attrs (Iterable[str]): List of attributes to serialize. 'orth' and + 'spacy' are always serialized, so they're not required. + store_user_data (bool): Whether to write the `Doc.user_data` to bytes/file. + docs (Iterable[Doc]): Docs to add. - DOCS: https://spacy.io/api/docbin#init + DOCS: https://nightly.spacy.io/api/docbin#init """ - attrs = attrs or [] attrs = sorted([intify_attr(attr) for attr in attrs]) + self.version = "0.1" self.attrs = [attr for attr in attrs if attr != ORTH and attr != SPACY] self.attrs.insert(0, ORTH) # Ensure ORTH is always attrs[0] self.tokens = [] self.spaces = [] self.cats = [] self.user_data = [] + self.flags = [] self.strings = set() self.store_user_data = store_user_data + for doc in docs: + self.add(doc) - def __len__(self): + def __len__(self) -> int: """RETURNS: The number of Doc objects added to the DocBin.""" return len(self.tokens) - def add(self, doc): + def add(self, doc: Doc) -> None: """Add a Doc's annotations to the DocBin for serialization. doc (Doc): The Doc object to add. - DOCS: https://spacy.io/api/docbin#add + DOCS: https://nightly.spacy.io/api/docbin#add """ array = doc.to_array(self.attrs) if len(array.shape) == 1: @@ -82,81 +96,107 @@ class DocBin(object): assert array.shape[0] == spaces.shape[0] # this should never happen spaces = spaces.reshape((spaces.shape[0], 1)) self.spaces.append(numpy.asarray(spaces, dtype=bool)) - self.strings.update(w.text for w in doc) + self.flags.append({"has_unknown_spaces": doc.has_unknown_spaces}) + for token in doc: + self.strings.add(token.text) + self.strings.add(token.tag_) + self.strings.add(token.lemma_) + self.strings.add(str(token.morph)) + self.strings.add(token.dep_) + self.strings.add(token.ent_type_) + self.strings.add(token.ent_kb_id_) self.cats.append(doc.cats) - if self.store_user_data: - self.user_data.append(srsly.msgpack_dumps(doc.user_data)) + self.user_data.append(srsly.msgpack_dumps(doc.user_data)) - def get_docs(self, vocab): + def get_docs(self, vocab: Vocab) -> Iterator[Doc]: """Recover Doc objects from the annotations, using the given vocab. + Note that the user data of each doc will be read (if available) and returned, + regardless of the setting of 'self.store_user_data'. vocab (Vocab): The shared vocab. YIELDS (Doc): The Doc objects. - DOCS: https://spacy.io/api/docbin#get_docs + DOCS: https://nightly.spacy.io/api/docbin#get_docs """ for string in self.strings: vocab[string] orth_col = self.attrs.index(ORTH) for i in range(len(self.tokens)): + flags = self.flags[i] tokens = self.tokens[i] spaces = self.spaces[i] - words = [vocab.strings[orth] for orth in tokens[:, orth_col]] - doc = Doc(vocab, words=words, spaces=spaces) + if flags.get("has_unknown_spaces"): + spaces = None + doc = Doc(vocab, words=tokens[:, orth_col], spaces=spaces) doc = doc.from_array(self.attrs, tokens) doc.cats = self.cats[i] - if self.store_user_data: + if i < len(self.user_data) and self.user_data[i] is not None: user_data = srsly.msgpack_loads(self.user_data[i], use_list=False) doc.user_data.update(user_data) yield doc - def merge(self, other): + def merge(self, other: "DocBin") -> None: """Extend the annotations of this DocBin with the annotations from another. Will raise an error if the pre-defined attrs of the two - DocBins don't match. + DocBins don't match, or if they differ in whether or not to store + user data. other (DocBin): The DocBin to merge into the current bin. - DOCS: https://spacy.io/api/docbin#merge + DOCS: https://nightly.spacy.io/api/docbin#merge """ if self.attrs != other.attrs: - raise ValueError(Errors.E166.format(current=self.attrs, other=other.attrs)) + raise ValueError( + Errors.E166.format(param="attrs", current=self.attrs, other=other.attrs) + ) + if self.store_user_data != other.store_user_data: + raise ValueError( + Errors.E166.format( + param="store_user_data", + current=self.store_user_data, + other=other.store_user_data, + ) + ) self.tokens.extend(other.tokens) self.spaces.extend(other.spaces) self.strings.update(other.strings) self.cats.extend(other.cats) - if self.store_user_data: - self.user_data.extend(other.user_data) + self.flags.extend(other.flags) + self.user_data.extend(other.user_data) - def to_bytes(self): + def to_bytes(self) -> bytes: """Serialize the DocBin's annotations to a bytestring. RETURNS (bytes): The serialized DocBin. - DOCS: https://spacy.io/api/docbin#to_bytes + DOCS: https://nightly.spacy.io/api/docbin#to_bytes """ for tokens in self.tokens: assert len(tokens.shape) == 2, tokens.shape # this should never happen lengths = [len(tokens) for tokens in self.tokens] + tokens = numpy.vstack(self.tokens) if self.tokens else numpy.asarray([]) + spaces = numpy.vstack(self.spaces) if self.spaces else numpy.asarray([]) msg = { + "version": self.version, "attrs": self.attrs, - "tokens": numpy.vstack(self.tokens).tobytes("C"), - "spaces": numpy.vstack(self.spaces).tobytes("C"), + "tokens": tokens.tobytes("C"), + "spaces": spaces.tobytes("C"), "lengths": numpy.asarray(lengths, dtype="int32").tobytes("C"), - "strings": list(self.strings), + "strings": list(sorted(self.strings)), "cats": self.cats, + "flags": self.flags, } if self.store_user_data: msg["user_data"] = self.user_data return zlib.compress(srsly.msgpack_dumps(msg)) - def from_bytes(self, bytes_data): + def from_bytes(self, bytes_data: bytes) -> "DocBin": """Deserialize the DocBin's annotations from a bytestring. bytes_data (bytes): The data to load from. RETURNS (DocBin): The loaded DocBin. - DOCS: https://spacy.io/api/docbin#from_bytes + DOCS: https://nightly.spacy.io/api/docbin#from_bytes """ msg = srsly.msgpack_loads(zlib.decompress(bytes_data)) self.attrs = msg["attrs"] @@ -170,12 +210,39 @@ class DocBin(object): self.tokens = NumpyOps().unflatten(flat_tokens, lengths) self.spaces = NumpyOps().unflatten(flat_spaces, lengths) self.cats = msg["cats"] - if self.store_user_data and "user_data" in msg: + self.flags = msg.get("flags", [{} for _ in lengths]) + if "user_data" in msg: self.user_data = list(msg["user_data"]) + else: + self.user_data = [None] * len(self) for tokens in self.tokens: assert len(tokens.shape) == 2, tokens.shape # this should never happen return self + def to_disk(self, path: Union[str, Path]) -> None: + """Save the DocBin to a file (typically called .spacy). + + path (str / Path): The file path. + + DOCS: https://nightly.spacy.io/api/docbin#to_disk + """ + path = ensure_path(path) + with path.open("wb") as file_: + file_.write(self.to_bytes()) + + def from_disk(self, path: Union[str, Path]) -> "DocBin": + """Load the DocBin from a file (typically called .spacy). + + path (str / Path): The file path. + RETURNS (DocBin): The loaded DocBin. + + DOCS: https://nightly.spacy.io/api/docbin#to_disk + """ + path = ensure_path(path) + with path.open("rb") as file_: + self.from_bytes(file_.read()) + return self + def merge_bins(bins): merged = None diff --git a/spacy/tokens/doc.pxd b/spacy/tokens/doc.pxd index 6536d271d..08f795b1a 100644 --- a/spacy/tokens/doc.pxd +++ b/spacy/tokens/doc.pxd @@ -19,10 +19,10 @@ ctypedef fused LexemeOrToken: const_TokenC_ptr -cdef int set_children_from_heads(TokenC* tokens, int length) except -1 +cdef int set_children_from_heads(TokenC* tokens, int start, int end) except -1 -cdef int _set_lr_kids_and_edges(TokenC* tokens, int length, int loop_count) except -1 +cdef int _set_lr_kids_and_edges(TokenC* tokens, int start, int end, int loop_count) except -1 cdef int token_by_start(const TokenC* tokens, int length, int start_char) except -2 @@ -31,9 +31,6 @@ cdef int token_by_start(const TokenC* tokens, int length, int start_char) except cdef int token_by_end(const TokenC* tokens, int length, int end_char) except -2 -cdef int set_children_from_heads(TokenC* tokens, int length) except -1 - - cdef int [:,:] _get_lca_matrix(Doc, int start, int end) cdef class Doc: @@ -49,20 +46,20 @@ cdef class Doc: cdef TokenC* c - cdef public bint is_tagged - cdef public bint is_parsed - cdef public float sentiment cdef public dict user_hooks cdef public dict user_token_hooks cdef public dict user_span_hooks + cdef public bint has_unknown_spaces + cdef public list _py_tokens cdef int length cdef int max_length + cdef public object noun_chunks_iterator cdef object __weakref__ @@ -70,5 +67,3 @@ cdef class Doc: cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1 cpdef np.ndarray to_array(self, object features) - - cdef void set_parse(self, const TokenC* parsed) nogil diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx index 89573ba09..abc82030d 100644 --- a/spacy/tokens/doc.pyx +++ b/spacy/tokens/doc.pyx @@ -1,39 +1,36 @@ - -# coding: utf8 -# cython: infer_types=True -# cython: bounds_check=False -# cython: profile=True -from __future__ import unicode_literals - +# cython: infer_types=True, bounds_check=False, profile=True cimport cython cimport numpy as np -from libc.string cimport memcpy, memset +from libc.string cimport memcpy from libc.math cimport sqrt -from collections import Counter +from libc.stdint cimport int32_t, uint64_t +import copy +from collections import Counter +from enum import Enum +import itertools import numpy -import numpy.linalg -import struct import srsly -from thinc.neural.util import get_array_module, copy_array +from thinc.api import get_array_module +from thinc.util import copy_array import warnings from .span cimport Span from .token cimport Token from ..lexeme cimport Lexeme, EMPTY_LEXEME from ..typedefs cimport attr_t, flags_t -from ..attrs cimport ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX, CLUSTER -from ..attrs cimport LENGTH, POS, LEMMA, TAG, DEP, HEAD, SPACY, ENT_IOB -from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, attr_id_t -from ..parts_of_speech cimport CCONJ, PUNCT, NOUN, univ_pos_t +from ..attrs cimport attr_id_t +from ..attrs cimport LENGTH, POS, LEMMA, TAG, MORPH, DEP, HEAD, SPACY, ENT_IOB +from ..attrs cimport ENT_TYPE, ENT_ID, ENT_KB_ID, SENT_START, IDX, NORM -from ..attrs import intify_attrs, IDS -from ..util import normalize_slice -from ..compat import is_config, copy_reg, pickle, basestring_ +from ..attrs import intify_attr, IDS +from ..compat import copy_reg, pickle from ..errors import Errors, Warnings +from ..morphology import Morphology from .. import util from .underscore import Underscore, get_ext_args from ._retokenize import Retokenizer +from ._serialize import ALL_ATTRS as DOCBIN_ALL_ATTRS DEF PADDING = 5 @@ -57,6 +54,8 @@ cdef attr_t get_token_attr(const TokenC* token, attr_id_t feat_name) nogil: return token.pos elif feat_name == TAG: return token.tag + elif feat_name == MORPH: + return token.morph elif feat_name == DEP: return token.dep elif feat_name == HEAD: @@ -89,14 +88,15 @@ cdef attr_t get_token_attr_for_matcher(const TokenC* token, attr_id_t feat_name) return get_token_attr(token, feat_name) -def _get_chunker(lang): - try: - cls = util.get_lang_class(lang) - except ImportError: - return None - except KeyError: - return None - return cls.Defaults.syntax_iterators.get("noun_chunks") +class SetEntsDefault(str, Enum): + blocked = "blocked" + missing = "missing" + outside = "outside" + unmodified = "unmodified" + + @classmethod + def values(cls): + return list(cls.__members__.keys()) cdef class Doc: @@ -112,25 +112,24 @@ cdef class Doc: Construction 2 >>> from spacy.tokens import Doc - >>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'], - >>> spaces=[True, False, False]) + >>> doc = Doc(nlp.vocab, words=["hello", "world", "!"], spaces=[True, False, False]) - DOCS: https://spacy.io/api/doc + DOCS: https://nightly.spacy.io/api/doc """ @classmethod def set_extension(cls, name, **kwargs): """Define a custom attribute which becomes available as `Doc._`. - name (unicode): Name of the attribute to set. + name (str): Name of the attribute to set. default: Optional default value of the attribute. getter (callable): Optional getter function. setter (callable): Optional setter function. method (callable): Optional method for method extension. force (bool): Force overwriting existing attribute. - DOCS: https://spacy.io/api/doc#set_extension - USAGE: https://spacy.io/usage/processing-pipelines#custom-components-attributes + DOCS: https://nightly.spacy.io/api/doc#set_extension + USAGE: https://nightly.spacy.io/usage/processing-pipelines#custom-components-attributes """ if cls.has_extension(name) and not kwargs.get("force", False): raise ValueError(Errors.E090.format(name=name, obj="Doc")) @@ -140,10 +139,10 @@ cdef class Doc: def get_extension(cls, name): """Look up a previously registered extension by name. - name (unicode): Name of the extension. + name (str): Name of the extension. RETURNS (tuple): A `(default, method, getter, setter)` tuple. - DOCS: https://spacy.io/api/doc#get_extension + DOCS: https://nightly.spacy.io/api/doc#get_extension """ return Underscore.doc_extensions.get(name) @@ -151,10 +150,10 @@ cdef class Doc: def has_extension(cls, name): """Check whether an extension has been registered. - name (unicode): Name of the extension. + name (str): Name of the extension. RETURNS (bool): Whether the extension has been registered. - DOCS: https://spacy.io/api/doc#has_extension + DOCS: https://nightly.spacy.io/api/doc#has_extension """ return name in Underscore.doc_extensions @@ -162,34 +161,66 @@ cdef class Doc: def remove_extension(cls, name): """Remove a previously registered extension. - name (unicode): Name of the extension. + name (str): Name of the extension. RETURNS (tuple): A `(default, method, getter, setter)` tuple of the removed extension. - DOCS: https://spacy.io/api/doc#remove_extension + DOCS: https://nightly.spacy.io/api/doc#remove_extension """ if not cls.has_extension(name): raise ValueError(Errors.E046.format(name=name)) return Underscore.doc_extensions.pop(name) - def __init__(self, Vocab vocab, words=None, spaces=None, user_data=None, - orths_and_spaces=None): + def __init__( + self, + Vocab vocab, + words=None, + spaces=None, + *, + user_data=None, + tags=None, + pos=None, + morphs=None, + lemmas=None, + heads=None, + deps=None, + sent_starts=None, + ents=None, + ): """Create a Doc object. vocab (Vocab): A vocabulary object, which must match any models you want to use (e.g. tokenizer, parser, entity recognizer). - words (list or None): A list of unicode strings to add to the document + words (Optional[List[str]]): A list of unicode strings to add to the document as words. If `None`, defaults to empty list. - spaces (list or None): A list of boolean values, of the same length as + spaces (Optional[List[bool]]): A list of boolean values, of the same length as words. True means that the word is followed by a space, False means it is not. If `None`, defaults to `[True]*len(words)` user_data (dict or None): Optional extra data to attach to the Doc. - RETURNS (Doc): The newly constructed object. + tags (Optional[List[str]]): A list of unicode strings, of the same + length as words, to assign as token.tag. Defaults to None. + pos (Optional[List[str]]): A list of unicode strings, of the same + length as words, to assign as token.pos. Defaults to None. + morphs (Optional[List[str]]): A list of unicode strings, of the same + length as words, to assign as token.morph. Defaults to None. + lemmas (Optional[List[str]]): A list of unicode strings, of the same + length as words, to assign as token.lemma. Defaults to None. + heads (Optional[List[int]]): A list of values, of the same length as + words, to assign as heads. Head indices are the position of the + head in the doc. Defaults to None. + deps (Optional[List[str]]): A list of unicode strings, of the same + length as words, to assign as token.dep. Defaults to None. + sent_starts (Optional[List[Union[bool, None]]]): A list of values, of + the same length as words, to assign as token.is_sent_start. Will be + overridden by heads if heads is provided. Defaults to None. + ents (Optional[List[str]]): A list of unicode strings, of the same + length as words, as IOB tags to assign as token.ent_iob and + token.ent_type. Defaults to None. - DOCS: https://spacy.io/api/doc#init + DOCS: https://nightly.spacy.io/api/doc#init """ self.vocab = vocab - size = 20 + size = max(20, (len(words) if words is not None else 0)) self.mem = Pool() # Guarantee self.lex[i-x], for any i >= 0 and x < padding is in bounds # However, we need to remember the true starting places, so that we can @@ -203,8 +234,6 @@ cdef class Doc: self.c = data_start + PADDING self.max_length = size self.length = 0 - self.is_tagged = False - self.is_parsed = False self.sentiment = 0.0 self.cats = {} self.user_hooks = {} @@ -213,33 +242,121 @@ cdef class Doc: self.tensor = numpy.zeros((0,), dtype="float32") self.user_data = {} if user_data is None else user_data self._vector = None - self.noun_chunks_iterator = _get_chunker(self.vocab.lang) - cdef unicode orth + self.noun_chunks_iterator = self.vocab.get_noun_chunks cdef bint has_space - if orths_and_spaces is None and words is not None: - if spaces is None: - spaces = [True] * len(words) - elif len(spaces) != len(words): - raise ValueError(Errors.E027) - orths_and_spaces = zip(words, spaces) - if orths_and_spaces is not None: - for orth_space in orths_and_spaces: - if isinstance(orth_space, unicode): - orth = orth_space - has_space = True - elif isinstance(orth_space, bytes): - raise ValueError(Errors.E028.format(value=orth_space)) + if words is None and spaces is not None: + raise ValueError(Errors.E908) + elif spaces is None and words is not None: + self.has_unknown_spaces = True + else: + self.has_unknown_spaces = False + words = words if words is not None else [] + spaces = spaces if spaces is not None else ([True] * len(words)) + if len(spaces) != len(words): + raise ValueError(Errors.E027) + cdef const LexemeC* lexeme + for word, has_space in zip(words, spaces): + if isinstance(word, unicode): + lexeme = self.vocab.get(self.mem, word) + elif isinstance(word, bytes): + raise ValueError(Errors.E028.format(value=word)) + else: + lexeme = self.vocab.get_by_orth(self.mem, word) + self.push_back(lexeme, has_space) + + if heads is not None: + heads = [head - i for i, head in enumerate(heads)] + if deps and not heads: + heads = [0] * len(deps) + if sent_starts is not None: + for i in range(len(sent_starts)): + if sent_starts[i] is True: + sent_starts[i] = 1 + elif sent_starts[i] is False: + sent_starts[i] = -1 + elif sent_starts[i] is None or sent_starts[i] not in [-1, 0, 1]: + sent_starts[i] = 0 + ent_iobs = None + ent_types = None + if ents is not None: + iob_strings = Token.iob_strings() + # make valid IOB2 out of IOB1 or IOB2 + for i, ent in enumerate(ents): + if ent is "": + ents[i] = None + elif ent is not None and not isinstance(ent, str): + raise ValueError(Errors.E177.format(tag=ent)) + if i < len(ents) - 1: + # OI -> OB + if (ent is None or ent.startswith("O")) and \ + (ents[i+1] is not None and ents[i+1].startswith("I")): + ents[i+1] = "B" + ents[i+1][1:] + # B-TYPE1 I-TYPE2 or I-TYPE1 I-TYPE2 -> B/I-TYPE1 B-TYPE2 + if ent is not None and ents[i+1] is not None and \ + (ent.startswith("B") or ent.startswith("I")) and \ + ents[i+1].startswith("I") and \ + ent[1:] != ents[i+1][1:]: + ents[i+1] = "B" + ents[i+1][1:] + ent_iobs = [] + ent_types = [] + for ent in ents: + if ent is None: + ent_iobs.append(iob_strings.index("")) + ent_types.append("") + elif ent == "O": + ent_iobs.append(iob_strings.index(ent)) + ent_types.append("") else: - orth, has_space = orth_space - # Note that we pass self.mem here --- we have ownership, if LexemeC - # must be created. - self.push_back( - self.vocab.get(self.mem, orth), has_space) - # Tough to decide on policy for this. Is an empty doc tagged and parsed? - # There's no information we'd like to add to it, so I guess so? - if self.length == 0: - self.is_tagged = True - self.is_parsed = True + if len(ent) < 3 or ent[1] != "-": + raise ValueError(Errors.E177.format(tag=ent)) + ent_iob, ent_type = ent.split("-", 1) + if ent_iob not in iob_strings: + raise ValueError(Errors.E177.format(tag=ent)) + ent_iob = iob_strings.index(ent_iob) + ent_iobs.append(ent_iob) + ent_types.append(ent_type) + headings = [] + values = [] + annotations = [pos, heads, deps, lemmas, tags, morphs, sent_starts, ent_iobs, ent_types] + possible_headings = [POS, HEAD, DEP, LEMMA, TAG, MORPH, SENT_START, ENT_IOB, ENT_TYPE] + for a, annot in enumerate(annotations): + if annot is not None: + if len(annot) != len(words): + raise ValueError(Errors.E189) + headings.append(possible_headings[a]) + if annot is not heads and annot is not sent_starts and annot is not ent_iobs: + values.extend(annot) + for value in values: + self.vocab.strings.add(value) + + # if there are any other annotations, set them + if headings: + attrs = self.to_array(headings) + + j = 0 + for annot in annotations: + if annot: + if annot is heads or annot is sent_starts or annot is ent_iobs: + for i in range(len(words)): + if attrs.ndim == 1: + attrs[i] = annot[i] + else: + attrs[i, j] = annot[i] + elif annot is morphs: + for i in range(len(words)): + morph_key = vocab.morphology.add(morphs[i]) + if attrs.ndim == 1: + attrs[i] = morph_key + else: + attrs[i, j] = morph_key + else: + for i in range(len(words)): + if attrs.ndim == 1: + attrs[i] = self.vocab.strings[annot[i]] + else: + attrs[i, j] = self.vocab.strings[annot[i]] + j += 1 + self.from_array(headings, attrs) @property def _(self): @@ -247,37 +364,61 @@ cdef class Doc: return Underscore(Underscore.doc_extensions, self) @property - def is_sentenced(self): - """Check if the document has sentence boundaries assigned. This is - defined as having at least one of the following: + def is_tagged(self): + warnings.warn(Warnings.W107.format(prop="is_tagged", attr="TAG"), DeprecationWarning) + return self.has_annotation("TAG") - a) An entry "sents" in doc.user_hooks"; - b) Doc.is_parsed is set to True; - c) At least one token other than the first where sent_start is not None. - """ - if "sents" in self.user_hooks: - return True - if self.is_parsed: - return True - if len(self) < 2: - return True - for i in range(1, self.length): - if self.c[i].sent_start == -1 or self.c[i].sent_start == 1: - return True - return False + @property + def is_parsed(self): + warnings.warn(Warnings.W107.format(prop="is_parsed", attr="DEP"), DeprecationWarning) + return self.has_annotation("DEP") @property def is_nered(self): - """Check if the document has named entities set. Will return True if - *any* of the tokens has a named entity tag set (even if the others are - unknown values), or if the document is empty. + warnings.warn(Warnings.W107.format(prop="is_nered", attr="ENT_IOB"), DeprecationWarning) + return self.has_annotation("ENT_IOB") + + @property + def is_sentenced(self): + warnings.warn(Warnings.W107.format(prop="is_sentenced", attr="SENT_START"), DeprecationWarning) + return self.has_annotation("SENT_START") + + def has_annotation(self, attr, *, require_complete=False): + """Check whether the doc contains annotation on a token attribute. + + attr (Union[int, str]): The attribute string name or int ID. + require_complete (bool): Whether to check that the attribute is set on + every token in the doc. + RETURNS (bool): Whether annotation is present. + + DOCS: https://nightly.spacy.io/api/doc#has_annotation """ - if len(self) == 0: + + # empty docs are always annotated + if self.length == 0: return True - for i in range(self.length): - if self.c[i].ent_iob != 0: + cdef int i + cdef int range_start = 0 + attr = intify_attr(attr) + # adjust attributes + if attr == HEAD: + # HEAD does not have an unset state, so rely on DEP + attr = DEP + elif attr == self.vocab.strings["IS_SENT_START"]: + # as in Matcher, allow IS_SENT_START as an alias of SENT_START + attr = SENT_START + # special cases for sentence boundaries + if attr == SENT_START: + if "sents" in self.user_hooks: return True - return False + # docs of length 1 always have sentence boundaries + if self.length == 1: + return True + range_start = 1 + if require_complete: + return all(Token.get_struct_attr(&self.c[i], attr) for i in range(range_start, self.length)) + else: + return any(Token.get_struct_attr(&self.c[i], attr) for i in range(range_start, self.length)) def __getitem__(self, object i): """Get a `Token` or `Span` object. @@ -302,10 +443,10 @@ cdef class Doc: You can use negative indices and open-ended ranges, which have their normal Python semantics. - DOCS: https://spacy.io/api/doc#getitem + DOCS: https://nightly.spacy.io/api/doc#getitem """ if isinstance(i, slice): - start, stop = normalize_slice(len(self), i.start, i.stop, i.step) + start, stop = util.normalize_slice(len(self), i.start, i.stop, i.step) return Span(self, start, stop, label=0) if i < 0: i = self.length + i @@ -319,7 +460,7 @@ cdef class Doc: than-Python speeds are required, you can instead access the annotations as a numpy array, or access the underlying C data directly from Cython. - DOCS: https://spacy.io/api/doc#iter + DOCS: https://nightly.spacy.io/api/doc#iter """ cdef int i for i in range(self.length): @@ -330,7 +471,7 @@ cdef class Doc: RETURNS (int): The number of tokens in the document. - DOCS: https://spacy.io/api/doc#len + DOCS: https://nightly.spacy.io/api/doc#len """ return self.length @@ -341,9 +482,7 @@ cdef class Doc: return "".join([t.text_with_ws for t in self]).encode("utf-8") def __str__(self): - if is_config(python3=True): - return self.__unicode__() - return self.__bytes__() + return self.__unicode__() def __repr__(self): return self.__str__() @@ -373,7 +512,7 @@ cdef class Doc: partially covered by the character span). Defaults to "strict". RETURNS (Span): The newly constructed object. - DOCS: https://spacy.io/api/doc#char_span + DOCS: https://nightly.spacy.io/api/doc#char_span """ if not isinstance(label, int): label = self.vocab.strings.add(label) @@ -415,7 +554,7 @@ cdef class Doc: `Span`, `Token` and `Lexeme` objects. RETURNS (float): A scalar similarity score. Higher is more similar. - DOCS: https://spacy.io/api/doc#similarity + DOCS: https://nightly.spacy.io/api/doc#similarity """ if "similarity" in self.user_hooks: return self.user_hooks["similarity"](self, other) @@ -437,7 +576,9 @@ cdef class Doc: return 0.0 vector = self.vector xp = get_array_module(vector) - return xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm) + result = xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm) + # ensure we get a scalar back (numpy does this automatically but cupy doesn't) + return result.item() @property def has_vector(self): @@ -446,7 +587,7 @@ cdef class Doc: RETURNS (bool): Whether a word vector is associated with the object. - DOCS: https://spacy.io/api/doc#has_vector + DOCS: https://nightly.spacy.io/api/doc#has_vector """ if "has_vector" in self.user_hooks: return self.user_hooks["has_vector"](self) @@ -464,7 +605,7 @@ cdef class Doc: RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array representing the document's semantics. - DOCS: https://spacy.io/api/doc#vector + DOCS: https://nightly.spacy.io/api/doc#vector """ def __get__(self): if "vector" in self.user_hooks: @@ -492,7 +633,7 @@ cdef class Doc: RETURNS (float): The L2 norm of the vector representation. - DOCS: https://spacy.io/api/doc#vector_norm + DOCS: https://nightly.spacy.io/api/doc#vector_norm """ def __get__(self): if "vector_norm" in self.user_hooks: @@ -513,7 +654,7 @@ cdef class Doc: def text(self): """A unicode representation of the document text. - RETURNS (unicode): The original verbatim text of the document. + RETURNS (str): The original verbatim text of the document. """ return "".join(t.text_with_ws for t in self) @@ -522,7 +663,7 @@ cdef class Doc: """An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. - RETURNS (unicode): The original verbatim text of the document. + RETURNS (str): The original verbatim text of the document. """ return self.text @@ -532,7 +673,7 @@ cdef class Doc: RETURNS (tuple): Entities in the document, one `Span` per entity. - DOCS: https://spacy.io/api/doc#ents + DOCS: https://nightly.spacy.io/api/doc#ents """ def __get__(self): cdef int i @@ -545,9 +686,10 @@ cdef class Doc: token = &self.c[i] if token.ent_iob == 1: if start == -1: - seq = ["%s|%s" % (t.text, t.ent_iob_) for t in self[i-5:i+5]] + seq = [f"{t.text}|{t.ent_iob_}" for t in self[i-5:i+5]] raise ValueError(Errors.E093.format(seq=" ".join(seq))) - elif token.ent_iob == 2 or token.ent_iob == 0: + elif token.ent_iob == 2 or token.ent_iob == 0 or \ + (token.ent_iob == 3 and token.ent_type == 0): if start != -1: output.append(Span(self, start, i, label=label, kb_id=kb_id)) start = -1 @@ -561,53 +703,108 @@ cdef class Doc: kb_id = token.ent_kb_id if start != -1: output.append(Span(self, start, self.length, label=label, kb_id=kb_id)) + # remove empty-label spans + output = [o for o in output if o.label_ != ""] return tuple(output) def __set__(self, ents): # TODO: # 1. Test basic data-driven ORTH gazetteer # 2. Test more nuanced date and currency regex - tokens_in_ents = {} - cdef attr_t entity_type - cdef attr_t kb_id + cdef attr_t entity_type, kb_id cdef int ent_start, ent_end + ent_spans = [] for ent_info in ents: - entity_type, kb_id, ent_start, ent_end = get_entity_info(ent_info) - for token_index in range(ent_start, ent_end): - if token_index in tokens_in_ents.keys(): - raise ValueError(Errors.E103.format( - span1=(tokens_in_ents[token_index][0], - tokens_in_ents[token_index][1], - self.vocab.strings[tokens_in_ents[token_index][2]]), - span2=(ent_start, ent_end, self.vocab.strings[entity_type]))) - tokens_in_ents[token_index] = (ent_start, ent_end, entity_type, kb_id) - cdef int i + entity_type_, kb_id, ent_start, ent_end = get_entity_info(ent_info) + if isinstance(entity_type_, str): + self.vocab.strings.add(entity_type_) + span = Span(self, ent_start, ent_end, label=entity_type_, kb_id=kb_id) + ent_spans.append(span) + self.set_ents(ent_spans, default=SetEntsDefault.outside) + + def set_ents(self, entities, *, blocked=None, missing=None, outside=None, default=SetEntsDefault.outside): + """Set entity annotation. + + entities (List[Span]): Spans with labels to set as entities. + blocked (Optional[List[Span]]): Spans to set as 'blocked' (never an + entity) for spacy's built-in NER component. Other components may + ignore this setting. + missing (Optional[List[Span]]): Spans with missing/unknown entity + information. + outside (Optional[List[Span]]): Spans outside of entities (O in IOB). + default (str): How to set entity annotation for tokens outside of any + provided spans. Options: "blocked", "missing", "outside" and + "unmodified" (preserve current state). Defaults to "outside". + """ + if default not in SetEntsDefault.values(): + raise ValueError(Errors.E1011.format(default=default, modes=", ".join(SetEntsDefault))) + + # Ignore spans with missing labels + entities = [ent for ent in entities if ent.label > 0] + + if blocked is None: + blocked = tuple() + if missing is None: + missing = tuple() + if outside is None: + outside = tuple() + + # Find all tokens covered by spans and check that none are overlapping + cdef int i + seen_tokens = set() + for span in itertools.chain.from_iterable([entities, blocked, missing, outside]): + if not isinstance(span, Span): + raise ValueError(Errors.E1012.format(span=span)) + for i in range(span.start, span.end): + if i in seen_tokens: + raise ValueError(Errors.E1010.format(i=i)) + seen_tokens.add(i) + + # Set all specified entity information + for span in entities: + for i in range(span.start, span.end): + if i == span.start: + self.c[i].ent_iob = 3 + else: + self.c[i].ent_iob = 1 + self.c[i].ent_type = span.label + self.c[i].ent_kb_id = span.kb_id + for span in blocked: + for i in range(span.start, span.end): + self.c[i].ent_iob = 3 + self.c[i].ent_type = 0 + for span in missing: + for i in range(span.start, span.end): + self.c[i].ent_iob = 0 + self.c[i].ent_type = 0 + for span in outside: + for i in range(span.start, span.end): + self.c[i].ent_iob = 2 + self.c[i].ent_type = 0 + + # Set tokens outside of all provided spans + if default != SetEntsDefault.unmodified: for i in range(self.length): - # default values - entity_type = 0 - kb_id = 0 + if i not in seen_tokens: + self.c[i].ent_type = 0 + if default == SetEntsDefault.outside: + self.c[i].ent_iob = 2 + elif default == SetEntsDefault.missing: + self.c[i].ent_iob = 0 + elif default == SetEntsDefault.blocked: + self.c[i].ent_iob = 3 - # Set ent_iob to Missing (0) bij default unless this token was nered before - ent_iob = 0 - if self.c[i].ent_iob != 0: - ent_iob = 2 - - # overwrite if the token was part of a specified entity - if i in tokens_in_ents.keys(): - ent_start, ent_end, entity_type, kb_id = tokens_in_ents[i] - if entity_type is None or entity_type <= 0: - # Blocking this token from being overwritten by downstream NER - ent_iob = 3 - elif ent_start == i: - # Marking the start of an entity - ent_iob = 3 - else: - # Marking the inside of an entity - ent_iob = 1 - - self.c[i].ent_type = entity_type - self.c[i].ent_kb_id = kb_id - self.c[i].ent_iob = ent_iob + # Fix any resulting inconsistent annotation + for i in range(self.length - 1): + # I must follow B or I: convert I to B + if (self.c[i].ent_iob == 0 or self.c[i].ent_iob == 2) and \ + self.c[i+1].ent_iob == 1: + self.c[i+1].ent_iob = 3 + # Change of type with BI or II: convert second I to B + if self.c[i].ent_type != self.c[i+1].ent_type and \ + (self.c[i].ent_iob == 3 or self.c[i].ent_iob == 1) and \ + self.c[i+1].ent_iob == 1: + self.c[i+1].ent_iob = 3 @property def noun_chunks(self): @@ -620,9 +817,9 @@ cdef class Doc: YIELDS (Span): Noun chunks in the document. - DOCS: https://spacy.io/api/doc#noun_chunks + DOCS: https://nightly.spacy.io/api/doc#noun_chunks """ - + # Accumulate the result before beginning to iterate over it. This # prevents the tokenisation from being changed out from under us # during the iteration. The tricky thing here is that Span accepts @@ -638,16 +835,13 @@ cdef class Doc: @property def sents(self): """Iterate over the sentences in the document. Yields sentence `Span` - objects. Sentence spans have no label. To improve accuracy on informal - texts, spaCy calculates sentence boundaries from the syntactic - dependency parse. If the parser is disabled, the `sents` iterator will - be unavailable. + objects. Sentence spans have no label. YIELDS (Span): Sentences in the document. - DOCS: https://spacy.io/api/doc#sents + DOCS: https://nightly.spacy.io/api/doc#sents """ - if not self.is_sentenced: + if not self.has_annotation("SENT_START"): raise ValueError(Errors.E030) if "sents" in self.user_hooks: yield from self.user_hooks["sents"](self) @@ -667,14 +861,10 @@ cdef class Doc: @property def lang_(self): - """RETURNS (unicode): Language of the doc's vocabulary, e.g. 'en'.""" + """RETURNS (str): Language of the doc's vocabulary, e.g. 'en'.""" return self.vocab.lang cdef int push_back(self, LexemeOrToken lex_or_tok, bint has_space) except -1: - if self.length == 0: - # Flip these to false when we see the first token. - self.is_tagged = False - self.is_parsed = False if self.length == self.max_length: self._realloc(self.length * 2) cdef TokenC* t = &self.c[self.length] @@ -722,15 +912,19 @@ cdef class Doc: cdef np.ndarray[attr_t, ndim=2] output # Handle scalar/list inputs of strings/ints for py_attr_ids # See also #3064 - if isinstance(py_attr_ids, basestring_): + if isinstance(py_attr_ids, str): # Handle inputs like doc.to_array('ORTH') py_attr_ids = [py_attr_ids] elif not hasattr(py_attr_ids, "__iter__"): # Handle inputs like doc.to_array(ORTH) py_attr_ids = [py_attr_ids] # Allow strings, e.g. 'lemma' or 'LEMMA' - py_attr_ids = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_) + try: + py_attr_ids = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_) for id_ in py_attr_ids] + except KeyError as msg: + keys = [k for k in IDS.keys() if not k.startswith("FLAG")] + raise KeyError(Errors.E983.format(dict="IDS", key=msg, keys=keys)) from None # Make an array from the attributes --- otherwise our inner loop is # Python dict iteration. cdef np.ndarray attr_ids = numpy.asarray(py_attr_ids, dtype="i") @@ -754,7 +948,7 @@ cdef class Doc: attr_id (int): The attribute ID to key the counts. RETURNS (dict): A dictionary mapping attributes to integer counts. - DOCS: https://spacy.io/api/doc#count_by + DOCS: https://nightly.spacy.io/api/doc#count_by """ cdef int i cdef attr_t attr @@ -777,6 +971,8 @@ cdef class Doc: return dict(counts) def _realloc(self, new_size): + if new_size < self.max_length: + return self.max_length = new_size n = new_size + (PADDING * 2) # What we're storing is a "padded" array. We've jumped forward PADDING @@ -791,14 +987,6 @@ cdef class Doc: for i in range(self.length, self.max_length + PADDING): self.c[i].lex = &EMPTY_LEXEME - cdef void set_parse(self, const TokenC* parsed) nogil: - # TODO: This method is fairly misleading atm. It's used by Parser - # to actually apply the parse calculated. Need to rethink this. - # Probably we should use from_array? - self.is_parsed = True - for i in range(self.length): - self.c[i] = parsed[i] - def from_array(self, attrs, array): """Load attributes from a numpy array. Write to a `Doc` object, from an `(M, N)` array of attributes. @@ -807,11 +995,11 @@ cdef class Doc: array (numpy.ndarray[ndim=2, dtype='int32']): The attribute values. RETURNS (Doc): Itself. - DOCS: https://spacy.io/api/doc#from_array + DOCS: https://nightly.spacy.io/api/doc#from_array """ # Handle scalar/list inputs of strings/ints for py_attr_ids # See also #3064 - if isinstance(attrs, basestring_): + if isinstance(attrs, str): # Handle inputs like doc.to_array('ORTH') attrs = [attrs] elif not hasattr(attrs, "__iter__"): @@ -823,12 +1011,14 @@ cdef class Doc: if array.dtype != numpy.uint64: warnings.warn(Warnings.W028.format(type=array.dtype)) - if SENT_START in attrs and HEAD in attrs: - raise ValueError(Errors.E032) - cdef int i, col, abs_head_index + cdef int i, col + cdef int32_t abs_head_index cdef attr_id_t attr_id cdef TokenC* tokens = self.c cdef int length = len(array) + if length != len(self): + raise ValueError(Errors.E971.format(array_length=length, doc_length=len(self))) + # Get set up for fast loading cdef Pool mem = Pool() cdef int n_attrs = len(attrs) @@ -839,34 +1029,128 @@ cdef class Doc: attr_ids[i] = attr_id if len(array.shape) == 1: array = array.reshape((array.size, 1)) + cdef np.ndarray transposed_array = numpy.ascontiguousarray(array.T) + values = transposed_array.data + stride = transposed_array.shape[1] # Check that all heads are within the document bounds if HEAD in attrs: col = attrs.index(HEAD) for i in range(length): # cast index to signed int - abs_head_index = numpy.int32(array[i, col]) + i + abs_head_index = values[col * stride + i] + abs_head_index += i if abs_head_index < 0 or abs_head_index >= length: - raise ValueError(Errors.E190.format(index=i, value=array[i, col], rel_head_index=numpy.int32(array[i, col]))) - # Do TAG first. This lets subsequent loop override stuff like POS, LEMMA - if TAG in attrs: - col = attrs.index(TAG) + raise ValueError( + Errors.E190.format( + index=i, + value=array[i, col], + rel_head_index=abs_head_index-i + ) + ) + # Verify ENT_IOB are proper integers + if ENT_IOB in attrs: + iob_strings = Token.iob_strings() + col = attrs.index(ENT_IOB) + n_iob_strings = len(iob_strings) for i in range(length): - if array[i, col] != 0: - self.vocab.morphology.assign_tag(&tokens[i], array[i, col]) + value = values[col * stride + i] + if value < 0 or value >= n_iob_strings: + raise ValueError( + Errors.E982.format( + values=iob_strings, + value=value + ) + ) # Now load the data for i in range(length): token = &self.c[i] for j in range(n_attrs): - if attr_ids[j] != TAG: - Token.set_struct_attr(token, attr_ids[j], array[i, j]) - # Set flags - self.is_parsed = bool(self.is_parsed or HEAD in attrs) - self.is_tagged = bool(self.is_tagged or TAG in attrs or POS in attrs) - # If document is parsed, set children - if self.is_parsed: - set_children_from_heads(self.c, length) + value = values[j * stride + i] + if attr_ids[j] == MORPH: + # add morph to morphology table + self.vocab.morphology.add(self.vocab.strings[value]) + Token.set_struct_attr(token, attr_ids[j], value) + # If document is parsed, set children and sentence boundaries + if HEAD in attrs and DEP in attrs: + col = attrs.index(DEP) + if array[:, col].any(): + set_children_from_heads(self.c, 0, length) return self + @staticmethod + def from_docs(docs, ensure_whitespace=True, attrs=None): + """Concatenate multiple Doc objects to form a new one. Raises an error + if the `Doc` objects do not all share the same `Vocab`. + + docs (list): A list of Doc objects. + ensure_whitespace (bool): Insert a space between two adjacent docs whenever the first doc does not end in whitespace. + attrs (list): Optional list of attribute ID ints or attribute name strings. + RETURNS (Doc): A doc that contains the concatenated docs, or None if no docs were given. + + DOCS: https://nightly.spacy.io/api/doc#from_docs + """ + if not docs: + return None + + vocab = {doc.vocab for doc in docs} + if len(vocab) > 1: + raise ValueError(Errors.E999) + (vocab,) = vocab + + if attrs is None: + attrs = Doc._get_array_attrs() + else: + if any(isinstance(attr, str) for attr in attrs): # resolve attribute names + attrs = [intify_attr(attr) for attr in attrs] # intify_attr returns None for invalid attrs + attrs = list(attr for attr in set(attrs) if attr) # filter duplicates, remove None if present + if SPACY not in attrs: + attrs.append(SPACY) + + concat_words = [] + concat_spaces = [] + concat_user_data = {} + char_offset = 0 + for doc in docs: + concat_words.extend(t.text for t in doc) + concat_spaces.extend(bool(t.whitespace_) for t in doc) + + for key, value in doc.user_data.items(): + if isinstance(key, tuple) and len(key) == 4: + data_type, name, start, end = key + if start is not None or end is not None: + start += char_offset + if end is not None: + end += char_offset + concat_user_data[(data_type, name, start, end)] = copy.copy(value) + else: + warnings.warn(Warnings.W101.format(name=name)) + else: + warnings.warn(Warnings.W102.format(key=key, value=value)) + char_offset += len(doc.text) + if ensure_whitespace and not (len(doc) > 0 and doc[-1].is_space): + char_offset += 1 + + arrays = [doc.to_array(attrs) for doc in docs] + + if ensure_whitespace: + spacy_index = attrs.index(SPACY) + for i, array in enumerate(arrays[:-1]): + if len(array) > 0 and not docs[i][-1].is_space: + array[-1][spacy_index] = 1 + token_offset = -1 + for doc in docs[:-1]: + token_offset += len(doc) + if not (len(doc) > 0 and doc[-1].is_space): + concat_spaces[token_offset] = True + + concat_array = numpy.concatenate(arrays) + + concat_doc = Doc(vocab, words=concat_words, spaces=concat_spaces, user_data=concat_user_data) + + concat_doc.from_array(attrs, concat_array) + + return concat_doc + def get_lca_matrix(self): """Calculates a matrix of Lowest Common Ancestors (LCA) for a given `Doc`, where LCA[i, j] is the index of the lowest common ancestor among @@ -875,64 +1159,100 @@ cdef class Doc: RETURNS (np.array[ndim=2, dtype=numpy.int32]): LCA matrix with shape (n, n), where n = len(self). - DOCS: https://spacy.io/api/doc#get_lca_matrix + DOCS: https://nightly.spacy.io/api/doc#get_lca_matrix """ return numpy.asarray(_get_lca_matrix(self, 0, len(self))) - def to_disk(self, path, **kwargs): + def copy(self): + cdef Doc other = Doc(self.vocab) + other._vector = copy.deepcopy(self._vector) + other._vector_norm = copy.deepcopy(self._vector_norm) + other.tensor = copy.deepcopy(self.tensor) + other.cats = copy.deepcopy(self.cats) + other.user_data = copy.deepcopy(self.user_data) + other.sentiment = self.sentiment + other.has_unknown_spaces = self.has_unknown_spaces + other.user_hooks = dict(self.user_hooks) + other.user_token_hooks = dict(self.user_token_hooks) + other.user_span_hooks = dict(self.user_span_hooks) + other.length = self.length + other.max_length = self.max_length + buff_size = other.max_length + (PADDING*2) + tokens = other.mem.alloc(buff_size, sizeof(TokenC)) + memcpy(tokens, self.c - PADDING, buff_size * sizeof(TokenC)) + other.c = &tokens[PADDING] + return other + + def to_disk(self, path, *, exclude=tuple()): """Save the current state to a directory. - path (unicode or Path): A path to a directory, which will be created if + path (str / Path): A path to a directory, which will be created if it doesn't exist. Paths may be either strings or Path-like objects. - exclude (list): String names of serialization fields to exclude. + exclude (Iterable[str]): String names of serialization fields to exclude. - DOCS: https://spacy.io/api/doc#to_disk + DOCS: https://nightly.spacy.io/api/doc#to_disk """ path = util.ensure_path(path) with path.open("wb") as file_: - file_.write(self.to_bytes(**kwargs)) + file_.write(self.to_bytes(exclude=exclude)) - def from_disk(self, path, **kwargs): + def from_disk(self, path, *, exclude=tuple()): """Loads state from a directory. Modifies the object in place and returns it. - path (unicode or Path): A path to a directory. Paths may be either + path (str / Path): A path to a directory. Paths may be either strings or `Path`-like objects. exclude (list): String names of serialization fields to exclude. RETURNS (Doc): The modified `Doc` object. - DOCS: https://spacy.io/api/doc#from_disk + DOCS: https://nightly.spacy.io/api/doc#from_disk """ path = util.ensure_path(path) with path.open("rb") as file_: bytes_data = file_.read() - return self.from_bytes(bytes_data, **kwargs) + return self.from_bytes(bytes_data, exclude=exclude) - def to_bytes(self, exclude=tuple(), **kwargs): + def to_bytes(self, *, exclude=tuple()): """Serialize, i.e. export the document contents to a binary string. exclude (list): String names of serialization fields to exclude. RETURNS (bytes): A losslessly serialized copy of the `Doc`, including all annotations. - DOCS: https://spacy.io/api/doc#to_bytes + DOCS: https://nightly.spacy.io/api/doc#to_bytes """ - array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM, ENT_KB_ID] - if self.is_tagged: - array_head.extend([TAG, POS]) - # If doc parsed add head and dep attribute - if self.is_parsed: - array_head.extend([HEAD, DEP]) - # Otherwise add sent_start - else: - array_head.append(SENT_START) + return srsly.msgpack_dumps(self.to_dict(exclude=exclude)) + + def from_bytes(self, bytes_data, *, exclude=tuple()): + """Deserialize, i.e. import the document contents from a binary string. + + data (bytes): The string to load from. + exclude (list): String names of serialization fields to exclude. + RETURNS (Doc): Itself. + + DOCS: https://nightly.spacy.io/api/doc#from_bytes + """ + return self.from_dict(srsly.msgpack_loads(bytes_data), exclude=exclude) + + def to_dict(self, *, exclude=tuple()): + """Export the document contents to a dictionary for serialization. + + exclude (list): String names of serialization fields to exclude. + RETURNS (bytes): A losslessly serialized copy of the `Doc`, including + all annotations. + + DOCS: https://nightly.spacy.io/api/doc#to_bytes + """ + array_head = Doc._get_array_attrs() strings = set() for token in self: strings.add(token.tag_) strings.add(token.lemma_) + strings.add(str(token.morph)) strings.add(token.dep_) strings.add(token.ent_type_) strings.add(token.ent_kb_id_) + strings.add(token.ent_id_) strings.add(token.norm_) # Msgpack doesn't distinguish between lists and tuples, which is # vexing for user data. As a best guess, we *know* that within @@ -946,26 +1266,24 @@ cdef class Doc: "tensor": lambda: self.tensor, "cats": lambda: self.cats, "strings": lambda: list(strings), + "has_unknown_spaces": lambda: self.has_unknown_spaces } - for key in kwargs: - if key in serializers or key in ("user_data", "user_data_keys", "user_data_values"): - raise ValueError(Errors.E128.format(arg=key)) if "user_data" not in exclude and self.user_data: user_data_keys, user_data_values = list(zip(*self.user_data.items())) if "user_data_keys" not in exclude: serializers["user_data_keys"] = lambda: srsly.msgpack_dumps(user_data_keys) if "user_data_values" not in exclude: serializers["user_data_values"] = lambda: srsly.msgpack_dumps(user_data_values) - return util.to_bytes(serializers, exclude) + return util.to_dict(serializers, exclude) - def from_bytes(self, bytes_data, exclude=tuple(), **kwargs): + def from_dict(self, msg, *, exclude=tuple()): """Deserialize, i.e. import the document contents from a binary string. data (bytes): The string to load from. exclude (list): String names of serialization fields to exclude. RETURNS (Doc): Itself. - DOCS: https://spacy.io/api/doc#from_bytes + DOCS: https://nightly.spacy.io/api/doc#from_dict """ if self.length != 0: raise ValueError(Errors.E033.format(length=self.length)) @@ -979,11 +1297,8 @@ cdef class Doc: "strings": lambda b: None, "user_data_keys": lambda b: None, "user_data_values": lambda b: None, + "has_unknown_spaces": lambda b: None } - for key in kwargs: - if key in deserializers or key in ("user_data",): - raise ValueError(Errors.E128.format(arg=key)) - msg = util.from_bytes(bytes_data, deserializers, exclude) # Msgpack doesn't distinguish between lists and tuples, which is # vexing for user data. As a best guess, we *know* that within # keys, we must have tuples. In values we just have to hope @@ -1003,6 +1318,8 @@ cdef class Doc: if "strings" not in exclude and "strings" in msg: for s in msg["strings"]: self.vocab.strings.add(s) + if "has_unknown_spaces" not in exclude and "has_unknown_spaces" in msg: + self.has_unknown_spaces = msg["has_unknown_spaces"] start = 0 cdef const LexemeC* lex cdef unicode orth_ @@ -1018,6 +1335,7 @@ cdef class Doc: self.from_array(msg["array_head"][2:], attrs[:, 2:]) return self + def extend_tensor(self, tensor): """Concatenate a new tensor onto the doc.tensor object. @@ -1045,8 +1363,8 @@ cdef class Doc: retokenization are invalidated, although they may accidentally continue to work. - DOCS: https://spacy.io/api/doc#retokenize - USAGE: https://spacy.io/usage/linguistic-features#retokenization + DOCS: https://nightly.spacy.io/api/doc#retokenize + USAGE: https://nightly.spacy.io/usage/linguistic-features#retokenization """ return Retokenizer(self) @@ -1073,78 +1391,38 @@ cdef class Doc: remove_label_if_necessary(attributes[i]) retokenizer.merge(span, attributes[i]) - def merge(self, int start_idx, int end_idx, *args, **attributes): - """Retokenize the document, such that the span at - `doc.text[start_idx : end_idx]` is merged into a single token. If - `start_idx` and `end_idx `do not mark start and end token boundaries, - the document remains unchanged. - - start_idx (int): Character index of the start of the slice to merge. - end_idx (int): Character index after the end of the slice to merge. - **attributes: Attributes to assign to the merged token. By default, - attributes are inherited from the syntactic root of the span. - RETURNS (Token): The newly merged token, or `None` if the start and end - indices did not fall at token boundaries. - """ - cdef unicode tag, lemma, ent_type - warnings.warn(Warnings.W013.format(obj="Doc"), DeprecationWarning) - # TODO: ENT_KB_ID ? - if len(args) == 3: - warnings.warn(Warnings.W003, DeprecationWarning) - tag, lemma, ent_type = args - attributes[TAG] = tag - attributes[LEMMA] = lemma - attributes[ENT_TYPE] = ent_type - elif not args: - fix_attributes(self, attributes) - elif args: - raise ValueError(Errors.E034.format(n_args=len(args), args=repr(args), - kwargs=repr(attributes))) - remove_label_if_necessary(attributes) - attributes = intify_attrs(attributes, strings_map=self.vocab.strings) - cdef int start = token_by_start(self.c, self.length, start_idx) - if start == -1: - return None - cdef int end = token_by_end(self.c, self.length, end_idx) - if end == -1: - return None - # Currently we have the token index, we want the range-end index - end += 1 - with self.retokenize() as retokenizer: - retokenizer.merge(self[start:end], attrs=attributes) - return self[start] - - def print_tree(self, light=False, flat=False): - raise ValueError(Errors.E105) - def to_json(self, underscore=None): - """Convert a Doc to JSON. The format it produces will be the new format - for the `spacy train` command (not implemented yet). + """Convert a Doc to JSON. underscore (list): Optional list of string names of custom doc._. attributes. Attribute values need to be JSON-serializable. Values will be added to an "_" key in the data, e.g. "_": {"foo": "bar"}. RETURNS (dict): The data in spaCy's JSON format. - - DOCS: https://spacy.io/api/doc#to_json """ data = {"text": self.text} - if self.is_nered: + if self.has_annotation("ENT_IOB"): data["ents"] = [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in self.ents] - if self.is_sentenced: + if self.has_annotation("SENT_START"): sents = list(self.sents) data["sents"] = [{"start": sent.start_char, "end": sent.end_char} for sent in sents] if self.cats: data["cats"] = self.cats data["tokens"] = [] + attrs = ["TAG", "MORPH", "POS", "LEMMA", "DEP"] + include_annotation = {attr: self.has_annotation(attr) for attr in attrs} for token in self: token_data = {"id": token.i, "start": token.idx, "end": token.idx + len(token)} - if self.is_tagged: - token_data["pos"] = token.pos_ + if include_annotation["TAG"]: token_data["tag"] = token.tag_ - if self.is_parsed: + if include_annotation["POS"]: + token_data["pos"] = token.pos_ + if include_annotation["MORPH"]: + token_data["morph"] = token.morph.to_json() + if include_annotation["LEMMA"]: + token_data["lemma"] = token.lemma_ + if include_annotation["DEP"]: token_data["dep"] = token.dep_ token_data["head"] = token.head.i data["tokens"].append(token_data) @@ -1190,6 +1468,12 @@ cdef class Doc: j += 1 return output + @staticmethod + def _get_array_attrs(): + attrs = [LENGTH, SPACY] + attrs.extend(intify_attr(x) for x in DOCBIN_ALL_ATTRS) + return tuple(attrs) + cdef int token_by_start(const TokenC* tokens, int length, int start_char) except -2: cdef int i = token_by_char(tokens, length, start_char) @@ -1220,13 +1504,13 @@ cdef int token_by_char(const TokenC* tokens, int length, int char_idx) except -2 return mid return -1 - -cdef int set_children_from_heads(TokenC* tokens, int length) except -1: +cdef int set_children_from_heads(TokenC* tokens, int start, int end) except -1: + # note: end is exclusive cdef TokenC* head cdef TokenC* child cdef int i # Set number of left/right children to 0. We'll increment it in the loops. - for i in range(length): + for i in range(start, end): tokens[i].l_kids = 0 tokens[i].r_kids = 0 tokens[i].l_edge = i @@ -1240,38 +1524,40 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1: # without risking getting stuck in an infinite loop if something is # terribly malformed. while not heads_within_sents: - heads_within_sents = _set_lr_kids_and_edges(tokens, length, loop_count) + heads_within_sents = _set_lr_kids_and_edges(tokens, start, end, loop_count) if loop_count > 10: - warnings.warn(Warnings.W026) + util.logger.debug(Warnings.W026) break loop_count += 1 # Set sentence starts - for i in range(length): - if tokens[i].head == 0 and tokens[i].dep != 0: - tokens[tokens[i].l_edge].sent_start = True + for i in range(start, end): + tokens[i].sent_start = -1 + for i in range(start, end): + if tokens[i].head == 0: + tokens[tokens[i].l_edge].sent_start = 1 -cdef int _set_lr_kids_and_edges(TokenC* tokens, int length, int loop_count) except -1: +cdef int _set_lr_kids_and_edges(TokenC* tokens, int start, int end, int loop_count) except -1: # May be called multiple times due to non-projectivity. See issues #3170 # and #4688. # Set left edges cdef TokenC* head cdef TokenC* child cdef int i, j - for i in range(length): + for i in range(start, end): child = &tokens[i] head = &tokens[i + child.head] - if child < head and loop_count == 0: + if loop_count == 0 and child < head: head.l_kids += 1 if child.l_edge < head.l_edge: head.l_edge = child.l_edge if child.r_edge > head.r_edge: head.r_edge = child.r_edge # Set right edges - same as above, but iterate in reverse - for i in range(length-1, -1, -1): + for i in range(end-1, start-1, -1): child = &tokens[i] head = &tokens[i + child.head] - if child > head and loop_count == 0: + if loop_count == 0 and child > head: head.r_kids += 1 if child.r_edge > head.r_edge: head.r_edge = child.r_edge @@ -1279,14 +1565,14 @@ cdef int _set_lr_kids_and_edges(TokenC* tokens, int length, int loop_count) exce head.l_edge = child.l_edge # Get sentence start positions according to current state sent_starts = set() - for i in range(length): - if tokens[i].head == 0 and tokens[i].dep != 0: + for i in range(start, end): + if tokens[i].head == 0: sent_starts.add(tokens[i].l_edge) cdef int curr_sent_start = 0 cdef int curr_sent_end = 0 # Check whether any heads are not within the current sentence - for i in range(length): - if (i > 0 and i in sent_starts) or i == length - 1: + for i in range(start, end): + if (i > 0 and i in sent_starts) or i == end - 1: curr_sent_end = i for j in range(curr_sent_start, curr_sent_end): if tokens[j].head + j < curr_sent_start or tokens[j].head + j >= curr_sent_end + 1: @@ -1335,6 +1621,7 @@ cdef int [:,:] _get_lca_matrix(Doc doc, int start, int end): with shape (n, n), where n = len(doc). """ cdef int [:,:] lca_matrix + cdef int j, k n_tokens= end - start lca_mat = numpy.empty((n_tokens, n_tokens), dtype=numpy.int32) lca_mat.fill(-1) diff --git a/spacy/tokens/morphanalysis.pxd b/spacy/tokens/morphanalysis.pxd index 22844454a..9510875c9 100644 --- a/spacy/tokens/morphanalysis.pxd +++ b/spacy/tokens/morphanalysis.pxd @@ -5,5 +5,5 @@ from ..structs cimport MorphAnalysisC cdef class MorphAnalysis: cdef readonly Vocab vocab - cdef hash_t key + cdef readonly hash_t key cdef MorphAnalysisC c diff --git a/spacy/tokens/morphanalysis.pyx b/spacy/tokens/morphanalysis.pyx index 12f2f6cc3..a7d1f2e44 100644 --- a/spacy/tokens/morphanalysis.pyx +++ b/spacy/tokens/morphanalysis.pyx @@ -1,15 +1,16 @@ from libc.string cimport memset +cimport numpy as np +from ..errors import Errors +from ..morphology import Morphology from ..vocab cimport Vocab from ..typedefs cimport hash_t, attr_t -from ..morphology cimport list_features, check_feature, get_field, tag_to_json - -from ..strings import get_string_id +from ..morphology cimport list_features, check_feature, get_by_field cdef class MorphAnalysis: """Control access to morphological features for a token.""" - def __init__(self, Vocab vocab, features=tuple()): + def __init__(self, Vocab vocab, features=dict()): self.vocab = vocab self.key = self.vocab.morphology.add(features) analysis = self.vocab.morphology.tags.get(self.key) @@ -33,7 +34,7 @@ cdef class MorphAnalysis: def __contains__(self, feature): """Test whether the morphological analysis contains some feature.""" - cdef attr_t feat_id = get_string_id(feature) + cdef attr_t feat_id = self.vocab.strings.as_int(feature) return check_feature(&self.c, feat_id) def __iter__(self): @@ -49,369 +50,38 @@ cdef class MorphAnalysis: def __hash__(self): return self.key - def get(self, unicode field): - """Retrieve a feature by field.""" - cdef int field_id = self.vocab.morphology._feat_map.attr2field[field] - return self.vocab.strings[get_field(&self.c, field_id)] + def __eq__(self, other): + if isinstance(other, str): + raise ValueError(Errors.E977) + return self.key == other.key + + def __ne__(self, other): + return self.key != other.key + + def get(self, field): + """Retrieve feature values by field.""" + cdef attr_t field_id = self.vocab.strings.as_int(field) + cdef np.ndarray results = get_by_field(&self.c, field_id) + features = [self.vocab.strings[result] for result in results] + return [f.split(Morphology.FIELD_SEP)[1] for f in features] def to_json(self): - """Produce a json serializable representation, which will be a list of - strings. + """Produce a json serializable representation as a UD FEATS-style + string. """ - return tag_to_json(&self.c) + morph_string = self.vocab.strings[self.c.key] + if morph_string == self.vocab.morphology.EMPTY_MORPH: + return "" + return morph_string - @property - def is_base_form(self): - raise NotImplementedError + def to_dict(self): + """Produce a dict representation. + """ + return self.vocab.morphology.feats_to_dict(self.to_json()) - @property - def pos(self): - return self.c.pos + def __str__(self): + return self.to_json() - @property - def pos_(self): - return self.vocab.strings[self.c.pos] + def __repr__(self): + return self.to_json() - property id: - def __get__(self): - return self.key - - property abbr: - def __get__(self): - return self.c.abbr - - property adp_type: - def __get__(self): - return self.c.adp_type - - property adv_type: - def __get__(self): - return self.c.adv_type - - property animacy: - def __get__(self): - return self.c.animacy - - property aspect: - def __get__(self): - return self.c.aspect - - property case: - def __get__(self): - return self.c.case - - property conj_type: - def __get__(self): - return self.c.conj_type - - property connegative: - def __get__(self): - return self.c.connegative - - property definite: - def __get__(self): - return self.c.definite - - property degree: - def __get__(self): - return self.c.degree - - property derivation: - def __get__(self): - return self.c.derivation - - property echo: - def __get__(self): - return self.c.echo - - property foreign: - def __get__(self): - return self.c.foreign - - property gender: - def __get__(self): - return self.c.gender - - property hyph: - def __get__(self): - return self.c.hyph - - property inf_form: - def __get__(self): - return self.c.inf_form - - property mood: - def __get__(self): - return self.c.mood - - property name_type: - def __get__(self): - return self.c.name_type - - property negative: - def __get__(self): - return self.c.negative - - property noun_type: - def __get__(self): - return self.c.noun_type - - property number: - def __get__(self): - return self.c.number - - property num_form: - def __get__(self): - return self.c.num_form - - property num_type: - def __get__(self): - return self.c.num_type - - property num_value: - def __get__(self): - return self.c.num_value - - property part_form: - def __get__(self): - return self.c.part_form - - property part_type: - def __get__(self): - return self.c.part_type - - property person: - def __get__(self): - return self.c.person - - property polite: - def __get__(self): - return self.c.polite - - property polarity: - def __get__(self): - return self.c.polarity - - property poss: - def __get__(self): - return self.c.poss - - property prefix: - def __get__(self): - return self.c.prefix - - property prep_case: - def __get__(self): - return self.c.prep_case - - property pron_type: - def __get__(self): - return self.c.pron_type - - property punct_side: - def __get__(self): - return self.c.punct_side - - property punct_type: - def __get__(self): - return self.c.punct_type - - property reflex: - def __get__(self): - return self.c.reflex - - property style: - def __get__(self): - return self.c.style - - property style_variant: - def __get__(self): - return self.c.style_variant - - property tense: - def __get__(self): - return self.c.tense - - property typo: - def __get__(self): - return self.c.typo - - property verb_form: - def __get__(self): - return self.c.verb_form - - property voice: - def __get__(self): - return self.c.voice - - property verb_type: - def __get__(self): - return self.c.verb_type - - property abbr_: - def __get__(self): - return self.vocab.strings[self.c.abbr] - - property adp_type_: - def __get__(self): - return self.vocab.strings[self.c.adp_type] - - property adv_type_: - def __get__(self): - return self.vocab.strings[self.c.adv_type] - - property animacy_: - def __get__(self): - return self.vocab.strings[self.c.animacy] - - property aspect_: - def __get__(self): - return self.vocab.strings[self.c.aspect] - - property case_: - def __get__(self): - return self.vocab.strings[self.c.case] - - property conj_type_: - def __get__(self): - return self.vocab.strings[self.c.conj_type] - - property connegative_: - def __get__(self): - return self.vocab.strings[self.c.connegative] - - property definite_: - def __get__(self): - return self.vocab.strings[self.c.definite] - - property degree_: - def __get__(self): - return self.vocab.strings[self.c.degree] - - property derivation_: - def __get__(self): - return self.vocab.strings[self.c.derivation] - - property echo_: - def __get__(self): - return self.vocab.strings[self.c.echo] - - property foreign_: - def __get__(self): - return self.vocab.strings[self.c.foreign] - - property gender_: - def __get__(self): - return self.vocab.strings[self.c.gender] - - property hyph_: - def __get__(self): - return self.vocab.strings[self.c.hyph] - - property inf_form_: - def __get__(self): - return self.vocab.strings[self.c.inf_form] - - property name_type_: - def __get__(self): - return self.vocab.strings[self.c.name_type] - - property negative_: - def __get__(self): - return self.vocab.strings[self.c.negative] - - property mood_: - def __get__(self): - return self.vocab.strings[self.c.mood] - - property number_: - def __get__(self): - return self.vocab.strings[self.c.number] - - property num_form_: - def __get__(self): - return self.vocab.strings[self.c.num_form] - - property num_type_: - def __get__(self): - return self.vocab.strings[self.c.num_type] - - property num_value_: - def __get__(self): - return self.vocab.strings[self.c.num_value] - - property part_form_: - def __get__(self): - return self.vocab.strings[self.c.part_form] - - property part_type_: - def __get__(self): - return self.vocab.strings[self.c.part_type] - - property person_: - def __get__(self): - return self.vocab.strings[self.c.person] - - property polite_: - def __get__(self): - return self.vocab.strings[self.c.polite] - - property polarity_: - def __get__(self): - return self.vocab.strings[self.c.polarity] - - property poss_: - def __get__(self): - return self.vocab.strings[self.c.poss] - - property prefix_: - def __get__(self): - return self.vocab.strings[self.c.prefix] - - property prep_case_: - def __get__(self): - return self.vocab.strings[self.c.prep_case] - - property pron_type_: - def __get__(self): - return self.vocab.strings[self.c.pron_type] - - property punct_side_: - def __get__(self): - return self.vocab.strings[self.c.punct_side] - - property punct_type_: - def __get__(self): - return self.vocab.strings[self.c.punct_type] - - property reflex_: - def __get__(self): - return self.vocab.strings[self.c.reflex] - - property style_: - def __get__(self): - return self.vocab.strings[self.c.style] - - property style_variant_: - def __get__(self): - return self.vocab.strings[self.c.style_variant] - - property tense_: - def __get__(self): - return self.vocab.strings[self.c.tense] - - property typo_: - def __get__(self): - return self.vocab.strings[self.c.typo] - - property verb_form_: - def __get__(self): - return self.vocab.strings[self.c.verb_form] - - property voice_: - def __get__(self): - return self.vocab.strings[self.c.voice] - - property verb_type_: - def __get__(self): - return self.vocab.strings[self.c.verb_type] diff --git a/spacy/tokens/span.pxd b/spacy/tokens/span.pxd index f6f88a23e..cc6b908bb 100644 --- a/spacy/tokens/span.pxd +++ b/spacy/tokens/span.pxd @@ -16,5 +16,4 @@ cdef class Span: cdef public _vector cdef public _vector_norm - cpdef int _recalculate_indices(self) except -1 cpdef np.ndarray to_array(self, object features) diff --git a/spacy/tokens/span.pyx b/spacy/tokens/span.pyx index cf0775bae..491ba0266 100644 --- a/spacy/tokens/span.pyx +++ b/spacy/tokens/span.pyx @@ -1,17 +1,13 @@ -# coding: utf8 from __future__ import unicode_literals cimport numpy as np from libc.math cimport sqrt import numpy -import numpy.linalg +from thinc.api import get_array_module import warnings -from thinc.neural.util import get_array_module -from collections import defaultdict from .doc cimport token_by_start, token_by_end, get_token_attr, _get_lca_matrix -from .token cimport TokenC from ..structs cimport TokenC, LexemeC from ..typedefs cimport flags_t, attr_t, hash_t from ..attrs cimport attr_id_t @@ -21,29 +17,28 @@ from ..lexeme cimport Lexeme from ..symbols cimport dep from ..util import normalize_slice -from ..compat import is_config, basestring_ -from ..errors import Errors, TempErrors, Warnings +from ..errors import Errors, Warnings from .underscore import Underscore, get_ext_args cdef class Span: """A slice from a Doc object. - DOCS: https://spacy.io/api/span + DOCS: https://nightly.spacy.io/api/span """ @classmethod def set_extension(cls, name, **kwargs): """Define a custom attribute which becomes available as `Span._`. - name (unicode): Name of the attribute to set. + name (str): Name of the attribute to set. default: Optional default value of the attribute. getter (callable): Optional getter function. setter (callable): Optional setter function. method (callable): Optional method for method extension. force (bool): Force overwriting existing attribute. - DOCS: https://spacy.io/api/span#set_extension - USAGE: https://spacy.io/usage/processing-pipelines#custom-components-attributes + DOCS: https://nightly.spacy.io/api/span#set_extension + USAGE: https://nightly.spacy.io/usage/processing-pipelines#custom-components-attributes """ if cls.has_extension(name) and not kwargs.get("force", False): raise ValueError(Errors.E090.format(name=name, obj="Span")) @@ -53,10 +48,10 @@ cdef class Span: def get_extension(cls, name): """Look up a previously registered extension by name. - name (unicode): Name of the extension. + name (str): Name of the extension. RETURNS (tuple): A `(default, method, getter, setter)` tuple. - DOCS: https://spacy.io/api/span#get_extension + DOCS: https://nightly.spacy.io/api/span#get_extension """ return Underscore.span_extensions.get(name) @@ -64,10 +59,10 @@ cdef class Span: def has_extension(cls, name): """Check whether an extension has been registered. - name (unicode): Name of the extension. + name (str): Name of the extension. RETURNS (bool): Whether the extension has been registered. - DOCS: https://spacy.io/api/span#has_extension + DOCS: https://nightly.spacy.io/api/span#has_extension """ return name in Underscore.span_extensions @@ -75,11 +70,11 @@ cdef class Span: def remove_extension(cls, name): """Remove a previously registered extension. - name (unicode): Name of the extension. + name (str): Name of the extension. RETURNS (tuple): A `(default, method, getter, setter)` tuple of the removed extension. - DOCS: https://spacy.io/api/span#remove_extension + DOCS: https://nightly.spacy.io/api/span#remove_extension """ if not cls.has_extension(name): raise ValueError(Errors.E046.format(name=name)) @@ -96,9 +91,8 @@ cdef class Span: kb_id (uint64): An identifier from a Knowledge Base to capture the meaning of a named entity. vector (ndarray[ndim=1, dtype='float32']): A meaning representation of the span. - RETURNS (Span): The newly constructed object. - DOCS: https://spacy.io/api/span#init + DOCS: https://nightly.spacy.io/api/span#init """ if not (0 <= start <= end <= len(doc)): raise IndexError(Errors.E035.format(start=start, end=end, length=len(doc))) @@ -110,9 +104,9 @@ cdef class Span: self.end_char = self.doc[end - 1].idx + len(self.doc[end - 1]) else: self.end_char = 0 - if isinstance(label, basestring_): + if isinstance(label, str): label = doc.vocab.strings.add(label) - if isinstance(kb_id, basestring_): + if isinstance(kb_id, str): kb_id = doc.vocab.strings.add(kb_id) if label not in doc.vocab.strings: raise ValueError(Errors.E084.format(label=label)) @@ -154,17 +148,14 @@ cdef class Span: RETURNS (int): The number of tokens in the span. - DOCS: https://spacy.io/api/span#len + DOCS: https://nightly.spacy.io/api/span#len """ - self._recalculate_indices() if self.end < self.start: return 0 return self.end - self.start def __repr__(self): - if is_config(python3=True): - return self.text - return self.text.encode("utf-8") + return self.text def __getitem__(self, object i): """Get a `Token` or a `Span` object @@ -173,9 +164,8 @@ cdef class Span: the span to get. RETURNS (Token or Span): The token at `span[i]`. - DOCS: https://spacy.io/api/span#getitem + DOCS: https://nightly.spacy.io/api/span#getitem """ - self._recalculate_indices() if isinstance(i, slice): start, end = normalize_slice(len(self), i.start, i.stop, i.step) return Span(self.doc, start + self.start, end + self.start) @@ -187,16 +177,15 @@ cdef class Span: if self.start <= token_i < self.end: return self.doc[token_i] else: - raise IndexError(Errors.E201) + raise IndexError(Errors.E1002) def __iter__(self): """Iterate over `Token` objects. YIELDS (Token): A `Token` object. - DOCS: https://spacy.io/api/span#iter + DOCS: https://nightly.spacy.io/api/span#iter """ - self._recalculate_indices() for i in range(self.start, self.end): yield self.doc[i] @@ -209,27 +198,18 @@ cdef class Span: return Underscore(Underscore.span_extensions, self, start=self.start_char, end=self.end_char) - def as_doc(self, bint copy_user_data=False): + def as_doc(self, *, bint copy_user_data=False): """Create a `Doc` object with a copy of the `Span`'s data. copy_user_data (bool): Whether or not to copy the original doc's user data. RETURNS (Doc): The `Doc` copy of the span. - DOCS: https://spacy.io/api/span#as_doc + DOCS: https://nightly.spacy.io/api/span#as_doc """ - # TODO: make copy_user_data a keyword-only argument (Python 3 only) words = [t.text for t in self] spaces = [bool(t.whitespace_) for t in self] cdef Doc doc = Doc(self.doc.vocab, words=words, spaces=spaces) - array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, ENT_KB_ID] - if self.doc.is_tagged: - array_head.append(TAG) - # If doc parsed add head and dep attribute - if self.doc.is_parsed: - array_head.extend([HEAD, DEP]) - # Otherwise add sent_start - else: - array_head.append(SENT_START) + array_head = self.doc._get_array_attrs() array = self.doc.to_array(array_head) array = array[self.start : self.end] self._fix_dep_copy(array_head, array) @@ -288,18 +268,6 @@ cdef class Span: return array - def merge(self, *args, **attributes): - """Retokenize the document, such that the span is merged into a single - token. - - **attributes: Attributes to assign to the merged token. By default, - attributes are inherited from the syntactic root token of the span. - RETURNS (Token): The newly merged token. - """ - warnings.warn(Warnings.W013.format(obj="Span"), DeprecationWarning) - return self.doc.merge(self.start_char, self.end_char, *args, - **attributes) - def get_lca_matrix(self): """Calculates a matrix of Lowest Common Ancestors (LCA) for a given `Span`, where LCA[i, j] is the index of the lowest common ancestor among @@ -309,7 +277,7 @@ cdef class Span: RETURNS (np.array[ndim=2, dtype=numpy.int32]): LCA matrix with shape (n, n), where n = len(self). - DOCS: https://spacy.io/api/span#get_lca_matrix + DOCS: https://nightly.spacy.io/api/span#get_lca_matrix """ return numpy.asarray(_get_lca_matrix(self.doc, self.start, self.end)) @@ -321,7 +289,7 @@ cdef class Span: `Span`, `Token` and `Lexeme` objects. RETURNS (float): A scalar similarity score. Higher is more similar. - DOCS: https://spacy.io/api/span#similarity + DOCS: https://nightly.spacy.io/api/span#similarity """ if "similarity" in self.doc.user_span_hooks: return self.doc.user_span_hooks["similarity"](self, other) @@ -368,19 +336,6 @@ cdef class Span: output[i-self.start, j] = get_token_attr(&self.doc.c[i], feature) return output - cpdef int _recalculate_indices(self) except -1: - if self.end > self.doc.length \ - or self.doc.c[self.start].idx != self.start_char \ - or (self.doc.c[self.end-1].idx + self.doc.c[self.end-1].lex.length) != self.end_char: - start = token_by_start(self.doc.c, self.doc.length, self.start_char) - if self.start == -1: - raise IndexError(Errors.E036.format(start=self.start_char)) - end = token_by_end(self.doc.c, self.doc.length, self.end_char) - if end == -1: - raise IndexError(Errors.E037.format(end=self.end_char)) - self.start = start - self.end = end + 1 - @property def vocab(self): """RETURNS (Vocab): The Span's Doc's vocab.""" @@ -393,7 +348,7 @@ cdef class Span: return self.doc.user_span_hooks["sent"](self) # Use `sent_start` token attribute to find sentence boundaries cdef int n = 0 - if self.doc.is_sentenced: + if self.doc.has_annotation("SENT_START"): # Find start of the sentence start = self.start while self.doc.c[start].sent_start != 1 and start > 0: @@ -416,7 +371,7 @@ cdef class Span: RETURNS (tuple): Entities in the span, one `Span` per entity. - DOCS: https://spacy.io/api/span#ents + DOCS: https://nightly.spacy.io/api/span#ents """ ents = [] for ent in self.doc.ents: @@ -431,7 +386,7 @@ cdef class Span: RETURNS (bool): Whether a word vector is associated with the object. - DOCS: https://spacy.io/api/span#has_vector + DOCS: https://nightly.spacy.io/api/span#has_vector """ if "has_vector" in self.doc.user_span_hooks: return self.doc.user_span_hooks["has_vector"](self) @@ -450,7 +405,7 @@ cdef class Span: RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array representing the span's semantics. - DOCS: https://spacy.io/api/span#vector + DOCS: https://nightly.spacy.io/api/span#vector """ if "vector" in self.doc.user_span_hooks: return self.doc.user_span_hooks["vector"](self) @@ -464,7 +419,7 @@ cdef class Span: RETURNS (float): The L2 norm of the vector representation. - DOCS: https://spacy.io/api/span#vector_norm + DOCS: https://nightly.spacy.io/api/span#vector_norm """ if "vector_norm" in self.doc.user_span_hooks: return self.doc.user_span_hooks["vector"](self) @@ -478,7 +433,7 @@ cdef class Span: @property def tensor(self): """The span's slice of the doc's tensor. - + RETURNS (ndarray[ndim=2, dtype='float32']): A 2D numpy or cupy array representing the span's semantics. """ @@ -498,7 +453,7 @@ cdef class Span: @property def text(self): - """RETURNS (unicode): The original verbatim text of the span.""" + """RETURNS (str): The original verbatim text of the span.""" text = self.text_with_ws if self[-1].whitespace_: text = text[:-1] @@ -509,7 +464,7 @@ cdef class Span: """The text content of the span with a trailing whitespace character if the last token has one. - RETURNS (unicode): The text content of the span (with trailing + RETURNS (str): The text content of the span (with trailing whitespace). """ return "".join([t.text_with_ws for t in self]) @@ -524,10 +479,8 @@ cdef class Span: YIELDS (Span): Base noun-phrase `Span` objects. - DOCS: https://spacy.io/api/span#noun_chunks + DOCS: https://nightly.spacy.io/api/span#noun_chunks """ - if not self.doc.is_parsed: - raise ValueError(Errors.E029) # Accumulate the result before beginning to iterate over it. This # prevents the tokenisation from being changed out from under us # during the iteration. The tricky thing here is that Span accepts @@ -549,9 +502,8 @@ cdef class Span: RETURNS (Token): The root token. - DOCS: https://spacy.io/api/span#root + DOCS: https://nightly.spacy.io/api/span#root """ - self._recalculate_indices() if "root" in self.doc.user_span_hooks: return self.doc.user_span_hooks["root"](self) # This should probably be called 'head', and the other one called @@ -606,7 +558,7 @@ cdef class Span: RETURNS (tuple): A tuple of Token objects. - DOCS: https://spacy.io/api/span#lefts + DOCS: https://nightly.spacy.io/api/span#lefts """ return self.root.conjuncts @@ -617,7 +569,7 @@ cdef class Span: YIELDS (Token):A left-child of a token of the span. - DOCS: https://spacy.io/api/span#lefts + DOCS: https://nightly.spacy.io/api/span#lefts """ for token in reversed(self): # Reverse, so we get tokens in order for left in token.lefts: @@ -631,7 +583,7 @@ cdef class Span: YIELDS (Token): A right-child of a token of the span. - DOCS: https://spacy.io/api/span#rights + DOCS: https://nightly.spacy.io/api/span#rights """ for token in self: for right in token.rights: @@ -646,7 +598,7 @@ cdef class Span: RETURNS (int): The number of leftward immediate children of the span, in the syntactic dependency parse. - DOCS: https://spacy.io/api/span#n_lefts + DOCS: https://nightly.spacy.io/api/span#n_lefts """ return len(list(self.lefts)) @@ -658,7 +610,7 @@ cdef class Span: RETURNS (int): The number of rightward immediate children of the span, in the syntactic dependency parse. - DOCS: https://spacy.io/api/span#n_rights + DOCS: https://nightly.spacy.io/api/span#n_rights """ return len(list(self.rights)) @@ -668,7 +620,7 @@ cdef class Span: YIELDS (Token): A token within the span, or a descendant from it. - DOCS: https://spacy.io/api/span#subtree + DOCS: https://nightly.spacy.io/api/span#subtree """ for word in self.lefts: yield from word.subtree @@ -682,46 +634,31 @@ cdef class Span: return self.root.ent_id def __set__(self, hash_t key): - raise NotImplementedError(TempErrors.T007.format(attr="ent_id")) + raise NotImplementedError(Errors.E200.format(attr="ent_id")) property ent_id_: - """RETURNS (unicode): The (string) entity ID.""" + """RETURNS (str): The (string) entity ID.""" def __get__(self): return self.root.ent_id_ def __set__(self, hash_t key): - raise NotImplementedError(TempErrors.T007.format(attr="ent_id_")) + raise NotImplementedError(Errors.E200.format(attr="ent_id_")) @property def orth_(self): """Verbatim text content (identical to `Span.text`). Exists mostly for consistency with other attributes. - RETURNS (unicode): The span's text.""" + RETURNS (str): The span's text.""" return self.text @property def lemma_(self): - """RETURNS (unicode): The span's lemma.""" + """RETURNS (str): The span's lemma.""" return " ".join([t.lemma_ for t in self]).strip() - @property - def upper_(self): - """Deprecated. Use `Span.text.upper()` instead.""" - return "".join([t.text_with_ws.upper() for t in self]).strip() - - @property - def lower_(self): - """Deprecated. Use `Span.text.lower()` instead.""" - return "".join([t.text_with_ws.lower() for t in self]).strip() - - @property - def string(self): - """Deprecated: Use `Span.text_with_ws` instead.""" - return "".join([t.text_with_ws for t in self]) - property label_: - """RETURNS (unicode): The span's label.""" + """RETURNS (str): The span's label.""" def __get__(self): return self.doc.vocab.strings[self.label] @@ -731,7 +668,7 @@ cdef class Span: raise NotImplementedError(Errors.E129.format(start=self.start, end=self.end, label=label_)) property kb_id_: - """RETURNS (unicode): The named entity's KB ID.""" + """RETURNS (str): The named entity's KB ID.""" def __get__(self): return self.doc.vocab.strings[self.kb_id] diff --git a/spacy/tokens/token.pxd b/spacy/tokens/token.pxd index cbca55c40..45c906a82 100644 --- a/spacy/tokens/token.pxd +++ b/spacy/tokens/token.pxd @@ -6,6 +6,7 @@ from ..typedefs cimport attr_t, flags_t from ..parts_of_speech cimport univ_pos_t from .doc cimport Doc from ..lexeme cimport Lexeme + from ..errors import Errors @@ -43,6 +44,8 @@ cdef class Token: return token.pos elif feat_name == TAG: return token.tag + elif feat_name == MORPH: + return token.morph elif feat_name == DEP: return token.dep elif feat_name == HEAD: @@ -73,6 +76,8 @@ cdef class Token: token.pos = value elif feat_name == TAG: token.tag = value + elif feat_name == MORPH: + token.morph = value elif feat_name == DEP: token.dep = value elif feat_name == HEAD: diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx index 8d3406bae..2075c3cc8 100644 --- a/spacy/tokens/token.pyx +++ b/spacy/tokens/token.pyx @@ -1,54 +1,47 @@ # cython: infer_types=True -# coding: utf8 -from __future__ import unicode_literals - -from libc.string cimport memcpy -from cpython.mem cimport PyMem_Malloc, PyMem_Free # Compiler crashes on memory view coercion without this. Should report bug. from cython.view cimport array as cvarray cimport numpy as np np.import_array() import numpy +from thinc.api import get_array_module import warnings -from thinc.neural.util import get_array_module from ..typedefs cimport hash_t from ..lexeme cimport Lexeme from ..attrs cimport IS_ALPHA, IS_ASCII, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_SPACE from ..attrs cimport IS_BRACKET, IS_QUOTE, IS_LEFT_PUNCT, IS_RIGHT_PUNCT -from ..attrs cimport IS_TITLE, IS_UPPER, IS_CURRENCY, LIKE_URL, LIKE_NUM, LIKE_EMAIL -from ..attrs cimport IS_STOP, ID, ORTH, NORM, LOWER, SHAPE, PREFIX, SUFFIX -from ..attrs cimport LENGTH, CLUSTER, LEMMA, POS, TAG, DEP +from ..attrs cimport IS_TITLE, IS_UPPER, IS_CURRENCY, IS_STOP +from ..attrs cimport LIKE_URL, LIKE_NUM, LIKE_EMAIL from ..symbols cimport conj +from .morphanalysis cimport MorphAnalysis +from .doc cimport set_children_from_heads from .. import parts_of_speech -from .. import util -from ..compat import is_config from ..errors import Errors, Warnings from .underscore import Underscore, get_ext_args -from .morphanalysis cimport MorphAnalysis cdef class Token: """An individual token – i.e. a word, punctuation symbol, whitespace, etc. - DOCS: https://spacy.io/api/token + DOCS: https://nightly.spacy.io/api/token """ @classmethod def set_extension(cls, name, **kwargs): """Define a custom attribute which becomes available as `Token._`. - name (unicode): Name of the attribute to set. + name (str): Name of the attribute to set. default: Optional default value of the attribute. getter (callable): Optional getter function. setter (callable): Optional setter function. method (callable): Optional method for method extension. force (bool): Force overwriting existing attribute. - DOCS: https://spacy.io/api/token#set_extension - USAGE: https://spacy.io/usage/processing-pipelines#custom-components-attributes + DOCS: https://nightly.spacy.io/api/token#set_extension + USAGE: https://nightly.spacy.io/usage/processing-pipelines#custom-components-attributes """ if cls.has_extension(name) and not kwargs.get("force", False): raise ValueError(Errors.E090.format(name=name, obj="Token")) @@ -58,10 +51,10 @@ cdef class Token: def get_extension(cls, name): """Look up a previously registered extension by name. - name (unicode): Name of the extension. + name (str): Name of the extension. RETURNS (tuple): A `(default, method, getter, setter)` tuple. - DOCS: https://spacy.io/api/token#get_extension + DOCS: https://nightly.spacy.io/api/token#get_extension """ return Underscore.token_extensions.get(name) @@ -69,10 +62,10 @@ cdef class Token: def has_extension(cls, name): """Check whether an extension has been registered. - name (unicode): Name of the extension. + name (str): Name of the extension. RETURNS (bool): Whether the extension has been registered. - DOCS: https://spacy.io/api/token#has_extension + DOCS: https://nightly.spacy.io/api/token#has_extension """ return name in Underscore.token_extensions @@ -80,11 +73,11 @@ cdef class Token: def remove_extension(cls, name): """Remove a previously registered extension. - name (unicode): Name of the extension. + name (str): Name of the extension. RETURNS (tuple): A `(default, method, getter, setter)` tuple of the removed extension. - DOCS: https://spacy.io/api/token#remove_extension + DOCS: https://nightly.spacy.io/api/token#remove_extension """ if not cls.has_extension(name): raise ValueError(Errors.E046.format(name=name)) @@ -97,7 +90,7 @@ cdef class Token: doc (Doc): The parent document. offset (int): The index of the token within the document. - DOCS: https://spacy.io/api/token#init + DOCS: https://nightly.spacy.io/api/token#init """ self.vocab = vocab self.doc = doc @@ -112,7 +105,7 @@ cdef class Token: RETURNS (int): The number of unicode characters in the token. - DOCS: https://spacy.io/api/token#len + DOCS: https://nightly.spacy.io/api/token#len """ return self.c.lex.length @@ -123,9 +116,7 @@ cdef class Token: return self.text.encode('utf8') def __str__(self): - if is_config(python3=True): - return self.__unicode__() - return self.__bytes__() + return self.__unicode__() def __repr__(self): return self.__str__() @@ -177,7 +168,7 @@ cdef class Token: flag_id (int): The ID of the flag attribute. RETURNS (bool): Whether the flag is set. - DOCS: https://spacy.io/api/token#check_flag + DOCS: https://nightly.spacy.io/api/token#check_flag """ return Lexeme.c_check_flag(self.c.lex, flag_id) @@ -187,7 +178,7 @@ cdef class Token: i (int): The relative position of the token to get. Defaults to 1. RETURNS (Token): The token at position `self.doc[self.i+i]`. - DOCS: https://spacy.io/api/token#nbor + DOCS: https://nightly.spacy.io/api/token#nbor """ if self.i+i < 0 or (self.i+i >= len(self.doc)): raise IndexError(Errors.E042.format(i=self.i, j=i, length=len(self.doc))) @@ -201,7 +192,7 @@ cdef class Token: `Span`, `Token` and `Lexeme` objects. RETURNS (float): A scalar similarity score. Higher is more similar. - DOCS: https://spacy.io/api/token#similarity + DOCS: https://nightly.spacy.io/api/token#similarity """ if "similarity" in self.doc.user_token_hooks: return self.doc.user_token_hooks["similarity"](self, other) @@ -220,9 +211,32 @@ cdef class Token: xp = get_array_module(vector) return (xp.dot(vector, other.vector) / (self.vector_norm * other.vector_norm)) + property morph: + def __get__(self): + return MorphAnalysis.from_id(self.vocab, self.c.morph) + + def __set__(self, MorphAnalysis morph): + # Check that the morph has the same vocab + if self.vocab != morph.vocab: + raise ValueError(Errors.E1013) + self.c.morph = morph.c.key + + def set_morph(self, features): + cdef hash_t key + if features is None: + self.c.morph = 0 + elif isinstance(features, MorphAnalysis): + self.morph = features + else: + if isinstance(features, int): + features = self.vocab.strings[features] + key = self.vocab.morphology.add(features) + self.c.morph = key + @property - def morph(self): - return MorphAnalysis.from_id(self.vocab, self.c.morph) + def lex(self): + """RETURNS (Lexeme): The underlying lexeme.""" + return self.vocab[self.c.lex.orth] @property def lex_id(self): @@ -235,19 +249,14 @@ cdef class Token: index into tables, e.g. for word vectors.""" return self.c.lex.id - @property - def string(self): - """Deprecated: Use Token.text_with_ws instead.""" - return self.text_with_ws - @property def text(self): - """RETURNS (unicode): The original verbatim text of the token.""" + """RETURNS (str): The original verbatim text of the token.""" return self.orth_ @property def text_with_ws(self): - """RETURNS (unicode): The text content of the span (with trailing + """RETURNS (str): The text content of the span (with trailing whitespace). """ cdef unicode orth = self.vocab.strings[self.c.lex.orth] @@ -335,11 +344,7 @@ cdef class Token: inflectional suffixes. """ def __get__(self): - if self.c.lemma == 0: - lemma_ = self.vocab.morphology.lemmatizer.lookup(self.orth_, orth=self.orth) - return self.vocab.strings[lemma_] - else: - return self.c.lemma + return self.c.lemma def __set__(self, attr_t lemma): self.c.lemma = lemma @@ -358,7 +363,7 @@ cdef class Token: return self.c.tag def __set__(self, attr_t tag): - self.vocab.morphology.assign_tag(self.c, tag) + self.c.tag = tag property dep: """RETURNS (uint64): ID of syntactic dependency label.""" @@ -375,7 +380,7 @@ cdef class Token: RETURNS (bool): Whether a word vector is associated with the object. - DOCS: https://spacy.io/api/token#has_vector + DOCS: https://nightly.spacy.io/api/token#has_vector """ if "has_vector" in self.doc.user_token_hooks: return self.doc.user_token_hooks["has_vector"](self) @@ -390,7 +395,7 @@ cdef class Token: RETURNS (numpy.ndarray[ndim=1, dtype='float32']): A 1D numpy array representing the token's semantics. - DOCS: https://spacy.io/api/token#vector + DOCS: https://nightly.spacy.io/api/token#vector """ if "vector" in self.doc.user_token_hooks: return self.doc.user_token_hooks["vector"](self) @@ -405,7 +410,7 @@ cdef class Token: RETURNS (float): The L2 norm of the vector representation. - DOCS: https://spacy.io/api/token#vector_norm + DOCS: https://nightly.spacy.io/api/token#vector_norm """ if "vector_norm" in self.doc.user_token_hooks: return self.doc.user_token_hooks["vector_norm"](self) @@ -428,7 +433,7 @@ cdef class Token: RETURNS (int): The number of leftward immediate children of the word, in the syntactic dependency parse. - DOCS: https://spacy.io/api/token#n_lefts + DOCS: https://nightly.spacy.io/api/token#n_lefts """ return self.c.l_kids @@ -440,7 +445,7 @@ cdef class Token: RETURNS (int): The number of rightward immediate children of the word, in the syntactic dependency parse. - DOCS: https://spacy.io/api/token#n_rights + DOCS: https://nightly.spacy.io/api/token#n_rights """ return self.c.r_kids @@ -472,7 +477,7 @@ cdef class Token: RETURNS (bool / None): Whether the token starts a sentence. None if unknown. - DOCS: https://spacy.io/api/token#is_sent_start + DOCS: https://nightly.spacy.io/api/token#is_sent_start """ def __get__(self): if self.c.sent_start == 0: @@ -483,7 +488,7 @@ cdef class Token: return True def __set__(self, value): - if self.doc.is_parsed: + if self.doc.has_annotation("DEP"): raise ValueError(Errors.E043) if value is None: self.c.sent_start = 0 @@ -501,7 +506,7 @@ cdef class Token: RETURNS (bool / None): Whether the token ends a sentence. None if unknown. - DOCS: https://spacy.io/api/token#is_sent_end + DOCS: https://nightly.spacy.io/api/token#is_sent_end """ def __get__(self): if self.i + 1 == len(self.doc): @@ -523,7 +528,7 @@ cdef class Token: YIELDS (Token): A left-child of the token. - DOCS: https://spacy.io/api/token#lefts + DOCS: https://nightly.spacy.io/api/token#lefts """ cdef int nr_iter = 0 cdef const TokenC* ptr = self.c - (self.i - self.c.l_edge) @@ -543,7 +548,7 @@ cdef class Token: YIELDS (Token): A right-child of the token. - DOCS: https://spacy.io/api/token#rights + DOCS: https://nightly.spacy.io/api/token#rights """ cdef const TokenC* ptr = self.c + (self.c.r_edge - self.i) tokens = [] @@ -565,7 +570,7 @@ cdef class Token: YIELDS (Token): A child token such that `child.head==self`. - DOCS: https://spacy.io/api/token#children + DOCS: https://nightly.spacy.io/api/token#children """ yield from self.lefts yield from self.rights @@ -578,7 +583,7 @@ cdef class Token: YIELDS (Token): A descendent token such that `self.is_ancestor(descendent) or token == self`. - DOCS: https://spacy.io/api/token#subtree + DOCS: https://nightly.spacy.io/api/token#subtree """ for word in self.lefts: yield from word.subtree @@ -609,7 +614,7 @@ cdef class Token: YIELDS (Token): A sequence of ancestor tokens such that `ancestor.is_ancestor(self)`. - DOCS: https://spacy.io/api/token#ancestors + DOCS: https://nightly.spacy.io/api/token#ancestors """ cdef const TokenC* head_ptr = self.c # Guard against infinite loop, no token can have @@ -627,7 +632,7 @@ cdef class Token: descendant (Token): Another token. RETURNS (bool): Whether this token is the ancestor of the descendant. - DOCS: https://spacy.io/api/token#is_ancestor + DOCS: https://nightly.spacy.io/api/token#is_ancestor """ if self.doc is not descendant.doc: return False @@ -652,78 +657,19 @@ cdef class Token: # Do nothing if old head is new head if self.i + self.c.head == new_head.i: return - cdef Token old_head = self.head - cdef int rel_newhead_i = new_head.i - self.i - # Is the new head a descendant of the old head - cdef bint is_desc = old_head.is_ancestor(new_head) - cdef int new_edge - cdef Token anc, child - # Update number of deps of old head - if self.c.head > 0: # left dependent - old_head.c.l_kids -= 1 - if self.c.l_edge == old_head.c.l_edge: - # The token dominates the left edge so the left edge of - # the head may change when the token is reattached, it may - # not change if the new head is a descendant of the current - # head. - new_edge = self.c.l_edge - # The new l_edge is the left-most l_edge on any of the - # other dependents where the l_edge is left of the head, - # otherwise it is the head - if not is_desc: - new_edge = old_head.i - for child in old_head.children: - if child == self: - continue - if child.c.l_edge < new_edge: - new_edge = child.c.l_edge - old_head.c.l_edge = new_edge - # Walk up the tree from old_head and assign new l_edge to - # ancestors until an ancestor already has an l_edge that's - # further left - for anc in old_head.ancestors: - if anc.c.l_edge <= new_edge: - break - anc.c.l_edge = new_edge - elif self.c.head < 0: # right dependent - old_head.c.r_kids -= 1 - # Do the same thing as for l_edge - if self.c.r_edge == old_head.c.r_edge: - new_edge = self.c.r_edge - if not is_desc: - new_edge = old_head.i - for child in old_head.children: - if child == self: - continue - if child.c.r_edge > new_edge: - new_edge = child.c.r_edge - old_head.c.r_edge = new_edge - for anc in old_head.ancestors: - if anc.c.r_edge >= new_edge: - break - anc.c.r_edge = new_edge - # Update number of deps of new head - if rel_newhead_i > 0: # left dependent - new_head.c.l_kids += 1 - # Walk up the tree from new head and set l_edge to self.l_edge - # until you hit a token with an l_edge further to the left - if self.c.l_edge < new_head.c.l_edge: - new_head.c.l_edge = self.c.l_edge - for anc in new_head.ancestors: - if anc.c.l_edge <= self.c.l_edge: - break - anc.c.l_edge = self.c.l_edge - elif rel_newhead_i < 0: # right dependent - new_head.c.r_kids += 1 - # Do the same as for l_edge - if self.c.r_edge > new_head.c.r_edge: - new_head.c.r_edge = self.c.r_edge - for anc in new_head.ancestors: - if anc.c.r_edge >= self.c.r_edge: - break - anc.c.r_edge = self.c.r_edge + # Find the widest l/r_edges of the roots of the two tokens involved + # to limit the number of tokens for set_children_from_heads + cdef Token self_root, new_head_root + self_ancestors = list(self.ancestors) + new_head_ancestors = list(new_head.ancestors) + self_root = self_ancestors[-1] if self_ancestors else self + new_head_root = new_head_ancestors[-1] if new_head_ancestors else new_head + start = self_root.c.l_edge if self_root.c.l_edge < new_head_root.c.l_edge else new_head_root.c.l_edge + end = self_root.c.r_edge if self_root.c.r_edge > new_head_root.c.r_edge else new_head_root.c.r_edge # Set new head - self.c.head = rel_newhead_i + self.c.head = new_head.i - self.i + # Adjust parse properties and sentence starts + set_children_from_heads(self.doc.c, start, end + 1) @property def conjuncts(self): @@ -731,7 +677,7 @@ cdef class Token: RETURNS (tuple): The coordinated tokens. - DOCS: https://spacy.io/api/token#conjuncts + DOCS: https://nightly.spacy.io/api/token#conjuncts """ cdef Token word, child if "conjuncts" in self.doc.user_token_hooks: @@ -760,7 +706,7 @@ cdef class Token: self.c.ent_type = ent_type property ent_type_: - """RETURNS (unicode): Named entity type.""" + """RETURNS (str): Named entity type.""" def __get__(self): return self.vocab.strings[self.c.ent_type] @@ -776,6 +722,10 @@ cdef class Token: """ return self.c.ent_iob + @classmethod + def iob_strings(cls): + return ("", "I", "O", "B") + @property def ent_iob_(self): """IOB code of named entity tag. "B" means the token begins an entity, @@ -783,10 +733,9 @@ cdef class Token: and "" means no entity tag is set. "B" with an empty ent_type means that the token is blocked from further processing by NER. - RETURNS (unicode): IOB code of named entity tag. + RETURNS (str): IOB code of named entity tag. """ - iob_strings = ("", "I", "O", "B") - return iob_strings[self.c.ent_iob] + return self.iob_strings()[self.c.ent_iob] property ent_id: """RETURNS (uint64): ID of the entity the token is an instance of, @@ -799,7 +748,7 @@ cdef class Token: self.c.ent_id = key property ent_id_: - """RETURNS (unicode): ID of the entity the token is an instance of, + """RETURNS (str): ID of the entity the token is an instance of, if any. """ def __get__(self): @@ -817,7 +766,7 @@ cdef class Token: self.c.ent_kb_id = ent_kb_id property ent_kb_id_: - """RETURNS (unicode): Named entity KB ID.""" + """RETURNS (str): Named entity KB ID.""" def __get__(self): return self.vocab.strings[self.c.ent_kb_id] @@ -826,12 +775,12 @@ cdef class Token: @property def whitespace_(self): - """RETURNS (unicode): The trailing whitespace character, if present.""" + """RETURNS (str): The trailing whitespace character, if present.""" return " " if self.c.spacy else "" @property def orth_(self): - """RETURNS (unicode): Verbatim text content (identical to + """RETURNS (str): Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. """ @@ -839,13 +788,13 @@ cdef class Token: @property def lower_(self): - """RETURNS (unicode): The lowercase token text. Equivalent to + """RETURNS (str): The lowercase token text. Equivalent to `Token.text.lower()`. """ return self.vocab.strings[self.c.lex.lower] property norm_: - """RETURNS (unicode): The token's norm, i.e. a normalised form of the + """RETURNS (str): The token's norm, i.e. a normalised form of the token text. Usually set in the language's tokenizer exceptions or norm exceptions. """ @@ -857,47 +806,44 @@ cdef class Token: @property def shape_(self): - """RETURNS (unicode): Transform of the tokens's string, to show + """RETURNS (str): Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". """ return self.vocab.strings[self.c.lex.shape] @property def prefix_(self): - """RETURNS (unicode): A length-N substring from the start of the token. + """RETURNS (str): A length-N substring from the start of the token. Defaults to `N=1`. """ return self.vocab.strings[self.c.lex.prefix] @property def suffix_(self): - """RETURNS (unicode): A length-N substring from the end of the token. + """RETURNS (str): A length-N substring from the end of the token. Defaults to `N=3`. """ return self.vocab.strings[self.c.lex.suffix] @property def lang_(self): - """RETURNS (unicode): Language of the parent document's vocabulary, + """RETURNS (str): Language of the parent document's vocabulary, e.g. 'en'. """ return self.vocab.strings[self.c.lex.lang] property lemma_: - """RETURNS (unicode): The token lemma, i.e. the base form of the word, + """RETURNS (str): The token lemma, i.e. the base form of the word, with no inflectional suffixes. """ def __get__(self): - if self.c.lemma == 0: - return self.vocab.morphology.lemmatizer.lookup(self.orth_, orth=self.orth) - else: - return self.vocab.strings[self.c.lemma] + return self.vocab.strings[self.c.lemma] def __set__(self, unicode lemma_): self.c.lemma = self.vocab.strings.add(lemma_) property pos_: - """RETURNS (unicode): Coarse-grained part-of-speech tag.""" + """RETURNS (str): Coarse-grained part-of-speech tag.""" def __get__(self): return parts_of_speech.NAMES[self.c.pos] @@ -905,7 +851,7 @@ cdef class Token: self.c.pos = parts_of_speech.IDS[pos_name] property tag_: - """RETURNS (unicode): Fine-grained part-of-speech tag.""" + """RETURNS (str): Fine-grained part-of-speech tag.""" def __get__(self): return self.vocab.strings[self.c.tag] @@ -913,7 +859,7 @@ cdef class Token: self.tag = self.vocab.strings.add(tag) property dep_: - """RETURNS (unicode): The syntactic dependency label.""" + """RETURNS (str): The syntactic dependency label.""" def __get__(self): return self.vocab.strings[self.c.dep] diff --git a/spacy/tokens/underscore.py b/spacy/tokens/underscore.py index 8dac8526e..b7966fd6e 100644 --- a/spacy/tokens/underscore.py +++ b/spacy/tokens/underscore.py @@ -1,13 +1,10 @@ -# coding: utf8 -from __future__ import unicode_literals - import functools import copy from ..errors import Errors -class Underscore(object): +class Underscore: mutable_types = (dict, list, set) doc_extensions = {} span_extensions = {} diff --git a/spacy/syntax/__init__.pxd b/spacy/training/__init__.pxd similarity index 100% rename from spacy/syntax/__init__.pxd rename to spacy/training/__init__.pxd diff --git a/spacy/training/__init__.py b/spacy/training/__init__.py new file mode 100644 index 000000000..86341dd9a --- /dev/null +++ b/spacy/training/__init__.py @@ -0,0 +1,10 @@ +from .corpus import Corpus # noqa: F401 +from .example import Example, validate_examples, validate_get_examples # noqa: F401 +from .align import Alignment # noqa: F401 +from .augment import dont_augment, orth_variants_augmenter # noqa: F401 +from .iob_utils import iob_to_biluo, biluo_to_iob # noqa: F401 +from .iob_utils import offsets_to_biluo_tags, biluo_tags_to_offsets # noqa: F401 +from .iob_utils import biluo_tags_to_spans, tags_to_entities # noqa: F401 +from .gold_io import docs_to_json, read_json_file # noqa: F401 +from .batchers import minibatch_by_padded_size, minibatch_by_words # noqa: F401 +from .loggers import console_logger, wandb_logger # noqa: F401 diff --git a/spacy/training/align.py b/spacy/training/align.py new file mode 100644 index 000000000..e8f17a667 --- /dev/null +++ b/spacy/training/align.py @@ -0,0 +1,34 @@ +from typing import List +import numpy +from thinc.types import Ragged +from dataclasses import dataclass +import tokenizations + +from ..errors import Errors + + +@dataclass +class Alignment: + x2y: Ragged + y2x: Ragged + + @classmethod + def from_indices(cls, x2y: List[List[int]], y2x: List[List[int]]) -> "Alignment": + x2y = _make_ragged(x2y) + y2x = _make_ragged(y2x) + return Alignment(x2y=x2y, y2x=y2x) + + @classmethod + def from_strings(cls, A: List[str], B: List[str]) -> "Alignment": + if "".join(A).replace(" ", "").lower() != "".join(B).replace(" ", "").lower(): + raise ValueError(Errors.E949) + x2y, y2x = tokenizations.get_alignments(A, B) + return Alignment.from_indices(x2y=x2y, y2x=y2x) + + +def _make_ragged(indices): + lengths = numpy.array([len(x) for x in indices], dtype="i") + flat = [] + for x in indices: + flat.extend(x) + return Ragged(numpy.array(flat, dtype="i"), lengths) diff --git a/spacy/training/augment.py b/spacy/training/augment.py new file mode 100644 index 000000000..13ae45bd2 --- /dev/null +++ b/spacy/training/augment.py @@ -0,0 +1,203 @@ +from typing import Callable, Iterator, Dict, List, Tuple, TYPE_CHECKING +import random +import itertools +import copy +from functools import partial +from pydantic import BaseModel, StrictStr + +from ..util import registry +from ..tokens import Doc +from .example import Example + +if TYPE_CHECKING: + from ..language import Language # noqa: F401 + + +class OrthVariantsSingle(BaseModel): + tags: List[StrictStr] + variants: List[StrictStr] + + +class OrthVariantsPaired(BaseModel): + tags: List[StrictStr] + variants: List[List[StrictStr]] + + +class OrthVariants(BaseModel): + paired: List[OrthVariantsPaired] = {} + single: List[OrthVariantsSingle] = {} + + +@registry.augmenters("spacy.orth_variants.v1") +def create_orth_variants_augmenter( + level: float, lower: float, orth_variants: OrthVariants +) -> Callable[["Language", Example], Iterator[Example]]: + """Create a data augmentation callback that uses orth-variant replacement. + The callback can be added to a corpus or other data iterator during training. + + level (float): The percentage of texts that will be augmented. + lower (float): The percentage of texts that will be lowercased. + orth_variants (Dict[str, dict]): A dictionary containing the single and + paired orth variants. Typically loaded from a JSON file. + RETURNS (Callable[[Language, Example], Iterator[Example]]): The augmenter. + """ + return partial( + orth_variants_augmenter, orth_variants=orth_variants, level=level, lower=lower + ) + + +@registry.augmenters("spacy.lower_case.v1") +def create_lower_casing_augmenter( + level: float, +) -> Callable[["Language", Example], Iterator[Example]]: + """Create a data augmentation callback that converts documents to lowercase. + The callback can be added to a corpus or other data iterator during training. + + level (float): The percentage of texts that will be augmented. + RETURNS (Callable[[Language, Example], Iterator[Example]]): The augmenter. + """ + return partial(lower_casing_augmenter, level=level) + + +def dont_augment(nlp: "Language", example: Example) -> Iterator[Example]: + yield example + + +def lower_casing_augmenter( + nlp: "Language", example: Example, *, level: float +) -> Iterator[Example]: + if random.random() >= level: + yield example + else: + example_dict = example.to_dict() + doc = nlp.make_doc(example.text.lower()) + example_dict["token_annotation"]["ORTH"] = [t.lower_ for t in doc] + yield example.from_dict(doc, example_dict) + + +def orth_variants_augmenter( + nlp: "Language", + example: Example, + orth_variants: dict, + *, + level: float = 0.0, + lower: float = 0.0, +) -> Iterator[Example]: + if random.random() >= level: + yield example + else: + raw_text = example.text + orig_dict = example.to_dict() + if not orig_dict["token_annotation"]: + yield example + else: + variant_text, variant_token_annot = make_orth_variants( + nlp, + raw_text, + orig_dict["token_annotation"], + orth_variants, + lower=raw_text is not None and random.random() < lower, + ) + if variant_text: + doc = nlp.make_doc(variant_text) + else: + doc = Doc(nlp.vocab, words=variant_token_annot["ORTH"]) + variant_token_annot["ORTH"] = [w.text for w in doc] + variant_token_annot["SPACY"] = [w.whitespace_ for w in doc] + orig_dict["token_annotation"] = variant_token_annot + yield example.from_dict(doc, orig_dict) + + +def make_orth_variants( + nlp: "Language", + raw: str, + token_dict: Dict[str, List[str]], + orth_variants: Dict[str, List[Dict[str, List[str]]]], + *, + lower: bool = False, +) -> Tuple[str, Dict[str, List[str]]]: + orig_token_dict = copy.deepcopy(token_dict) + ndsv = orth_variants.get("single", []) + ndpv = orth_variants.get("paired", []) + words = token_dict.get("ORTH", []) + tags = token_dict.get("TAG", []) + # keep unmodified if words or tags are not defined + if words and tags: + if lower: + words = [w.lower() for w in words] + # single variants + punct_choices = [random.choice(x["variants"]) for x in ndsv] + for word_idx in range(len(words)): + for punct_idx in range(len(ndsv)): + if ( + tags[word_idx] in ndsv[punct_idx]["tags"] + and words[word_idx] in ndsv[punct_idx]["variants"] + ): + words[word_idx] = punct_choices[punct_idx] + # paired variants + punct_choices = [random.choice(x["variants"]) for x in ndpv] + for word_idx in range(len(words)): + for punct_idx in range(len(ndpv)): + if tags[word_idx] in ndpv[punct_idx]["tags"] and words[ + word_idx + ] in itertools.chain.from_iterable(ndpv[punct_idx]["variants"]): + # backup option: random left vs. right from pair + pair_idx = random.choice([0, 1]) + # best option: rely on paired POS tags like `` / '' + if len(ndpv[punct_idx]["tags"]) == 2: + pair_idx = ndpv[punct_idx]["tags"].index(tags[word_idx]) + # next best option: rely on position in variants + # (may not be unambiguous, so order of variants matters) + else: + for pair in ndpv[punct_idx]["variants"]: + if words[word_idx] in pair: + pair_idx = pair.index(words[word_idx]) + words[word_idx] = punct_choices[punct_idx][pair_idx] + token_dict["ORTH"] = words + token_dict["TAG"] = tags + # modify raw + if raw is not None: + variants = [] + for single_variants in ndsv: + variants.extend(single_variants["variants"]) + for paired_variants in ndpv: + variants.extend( + list(itertools.chain.from_iterable(paired_variants["variants"])) + ) + # store variants in reverse length order to be able to prioritize + # longer matches (e.g., "---" before "--") + variants = sorted(variants, key=lambda x: len(x)) + variants.reverse() + variant_raw = "" + raw_idx = 0 + # add initial whitespace + while raw_idx < len(raw) and raw[raw_idx].isspace(): + variant_raw += raw[raw_idx] + raw_idx += 1 + for word in words: + match_found = False + # skip whitespace words + if word.isspace(): + match_found = True + # add identical word + elif word not in variants and raw[raw_idx:].startswith(word): + variant_raw += word + raw_idx += len(word) + match_found = True + # add variant word + else: + for variant in variants: + if not match_found and raw[raw_idx:].startswith(variant): + raw_idx += len(variant) + variant_raw += word + match_found = True + # something went wrong, abort + # (add a warning message?) + if not match_found: + return raw, orig_token_dict + # add following whitespace + while raw_idx < len(raw) and raw[raw_idx].isspace(): + variant_raw += raw[raw_idx] + raw_idx += 1 + raw = variant_raw + return raw, token_dict diff --git a/spacy/training/batchers.py b/spacy/training/batchers.py new file mode 100644 index 000000000..c54242eae --- /dev/null +++ b/spacy/training/batchers.py @@ -0,0 +1,230 @@ +from typing import Union, Iterable, Sequence, TypeVar, List, Callable +from typing import Optional, Any +from functools import partial +import itertools + +from ..util import registry, minibatch + + +Sizing = Union[Iterable[int], int] +ItemT = TypeVar("ItemT") +BatcherT = Callable[[Iterable[ItemT]], Iterable[List[ItemT]]] + + +@registry.batchers("spacy.batch_by_padded.v1") +def configure_minibatch_by_padded_size( + *, + size: Sizing, + buffer: int, + discard_oversize: bool, + get_length: Optional[Callable[[ItemT], int]] = None +) -> BatcherT: + """Create a batcher that uses the `batch_by_padded_size` strategy. + + The padded size is defined as the maximum length of sequences within the + batch multiplied by the number of sequences in the batch. + + size (int or Iterable[int]): The largest padded size to batch sequences into. + Can be a single integer, or a sequence, allowing for variable batch sizes. + buffer (int): The number of sequences to accumulate before sorting by length. + A larger buffer will result in more even sizing, but if the buffer is + very large, the iteration order will be less random, which can result + in suboptimal training. + discard_oversize (bool): Whether to discard sequences that are by themselves + longer than the largest padded batch size. + get_length (Callable or None): Function to get the length of a sequence item. + The `len` function is used by default. + """ + # Avoid displacing optional values from the underlying function. + optionals = {"get_length": get_length} if get_length is not None else {} + return partial( + minibatch_by_padded_size, + size=size, + buffer=buffer, + discard_oversize=discard_oversize, + **optionals + ) + + +@registry.batchers("spacy.batch_by_words.v1") +def configure_minibatch_by_words( + *, + size: Sizing, + tolerance: float, + discard_oversize: bool, + get_length: Optional[Callable[[ItemT], int]] = None +) -> BatcherT: + """Create a batcher that uses the "minibatch by words" strategy. + + size (int or Iterable[int]): The target number of words per batch. + Can be a single integer, or a sequence, allowing for variable batch sizes. + tolerance (float): What percentage of the size to allow batches to exceed. + discard_oversize (bool): Whether to discard sequences that by themselves + exceed the tolerated size. + get_length (Callable or None): Function to get the length of a sequence + item. The `len` function is used by default. + """ + optionals = {"get_length": get_length} if get_length is not None else {} + return partial( + minibatch_by_words, size=size, discard_oversize=discard_oversize, **optionals + ) + + +@registry.batchers("spacy.batch_by_sequence.v1") +def configure_minibatch( + size: Sizing, get_length: Optional[Callable[[ItemT], int]] = None +) -> BatcherT: + """Create a batcher that creates batches of the specified size. + + size (int or Iterable[int]): The target number of items per batch. + Can be a single integer, or a sequence, allowing for variable batch sizes. + """ + optionals = {"get_length": get_length} if get_length is not None else {} + return partial(minibatch, size=size, **optionals) + + +def minibatch_by_padded_size( + seqs: Iterable[ItemT], + size: Sizing, + buffer: int = 256, + discard_oversize: bool = False, + get_length: Callable = len, +) -> Iterable[List[ItemT]]: + """Minibatch a sequence by the size of padded batches that would result, + with sequences binned by length within a window. + + The padded size is defined as the maximum length of sequences within the + batch multiplied by the number of sequences in the batch. + + size (int): The largest padded size to batch sequences into. + buffer (int): The number of sequences to accumulate before sorting by length. + A larger buffer will result in more even sizing, but if the buffer is + very large, the iteration order will be less random, which can result + in suboptimal training. + discard_oversize (bool): Whether to discard sequences that are by themselves + longer than the largest padded batch size. + get_length (Callable or None): Function to get the length of a sequence item. + The `len` function is used by default. + """ + if isinstance(size, int): + size_ = itertools.repeat(size) + else: + size_ = size + for outer_batch in minibatch(seqs, size=buffer): + outer_batch = list(outer_batch) + target_size = next(size_) + for indices in _batch_by_length(outer_batch, target_size, get_length): + subbatch = [outer_batch[i] for i in indices] + padded_size = max(len(seq) for seq in subbatch) * len(subbatch) + if discard_oversize and padded_size >= target_size: + pass + else: + yield subbatch + + +def minibatch_by_words( + seqs: Iterable[ItemT], + size: Sizing, + tolerance=0.2, + discard_oversize=False, + get_length=len, +) -> Iterable[List[ItemT]]: + """Create minibatches of roughly a given number of words. If any examples + are longer than the specified batch length, they will appear in a batch by + themselves, or be discarded if discard_oversize=True. + + seqs (Iterable[Sequence]): The sequences to minibatch. + size (int or Iterable[int]): The target number of words per batch. + Can be a single integer, or a sequence, allowing for variable batch sizes. + tolerance (float): What percentage of the size to allow batches to exceed. + discard_oversize (bool): Whether to discard sequences that by themselves + exceed the tolerated size. + get_length (Callable or None): Function to get the length of a sequence + item. The `len` function is used by default. + """ + if isinstance(size, int): + size_ = itertools.repeat(size) + elif isinstance(size, List): + size_ = iter(size) + else: + size_ = size + target_size = next(size_) + tol_size = target_size * tolerance + batch = [] + overflow = [] + batch_size = 0 + overflow_size = 0 + for seq in seqs: + n_words = get_length(seq) + # if the current example exceeds the maximum batch size, it is returned separately + # but only if discard_oversize=False. + if n_words > target_size + tol_size: + if not discard_oversize: + yield [seq] + # add the example to the current batch if there's no overflow yet and it still fits + elif overflow_size == 0 and (batch_size + n_words) <= target_size: + batch.append(seq) + batch_size += n_words + # add the example to the overflow buffer if it fits in the tolerance margin + elif (batch_size + overflow_size + n_words) <= (target_size + tol_size): + overflow.append(seq) + overflow_size += n_words + # yield the previous batch and start a new one. The new one gets the overflow examples. + else: + if batch: + yield batch + target_size = next(size_) + tol_size = target_size * tolerance + batch = overflow + batch_size = overflow_size + overflow = [] + overflow_size = 0 + # this example still fits + if (batch_size + n_words) <= target_size: + batch.append(seq) + batch_size += n_words + # this example fits in overflow + elif (batch_size + n_words) <= (target_size + tol_size): + overflow.append(seq) + overflow_size += n_words + # this example does not fit with the previous overflow: start another new batch + else: + if batch: + yield batch + target_size = next(size_) + tol_size = target_size * tolerance + batch = [seq] + batch_size = n_words + batch.extend(overflow) + if batch: + yield batch + + +def _batch_by_length( + seqs: Sequence[Any], max_words: int, get_length=len +) -> List[List[Any]]: + """Given a list of sequences, return a batched list of indices into the + list, where the batches are grouped by length, in descending order. + + Batches may be at most max_words in size, defined as max sequence length * size. + """ + # Use negative index so we can get sort by position ascending. + lengths_indices = [(get_length(seq), i) for i, seq in enumerate(seqs)] + lengths_indices.sort() + batches = [] + batch = [] + for length, i in lengths_indices: + if not batch: + batch.append(i) + elif length * (len(batch) + 1) <= max_words: + batch.append(i) + else: + batches.append(batch) + batch = [i] + if batch: + batches.append(batch) + # Check lengths match + assert sum(len(b) for b in batches) == len(seqs) + batches = [list(sorted(batch)) for batch in batches] + batches.reverse() + return batches diff --git a/spacy/training/converters/__init__.py b/spacy/training/converters/__init__.py new file mode 100644 index 000000000..e91b6aaa6 --- /dev/null +++ b/spacy/training/converters/__init__.py @@ -0,0 +1,4 @@ +from .iob_to_docs import iob_to_docs # noqa: F401 +from .conll_ner_to_docs import conll_ner_to_docs # noqa: F401 +from .json_to_docs import json_to_docs # noqa: F401 +from .conllu_to_docs import conllu_to_docs # noqa: F401 diff --git a/spacy/cli/converters/conll_ner2json.py b/spacy/training/converters/conll_ner_to_docs.py similarity index 69% rename from spacy/cli/converters/conll_ner2json.py rename to spacy/training/converters/conll_ner_to_docs.py index 46489ad7c..c01686aee 100644 --- a/spacy/cli/converters/conll_ner2json.py +++ b/spacy/training/converters/conll_ner_to_docs.py @@ -1,20 +1,18 @@ -# coding: utf8 -from __future__ import unicode_literals - from wasabi import Printer -from ...gold import iob_to_biluo -from ...lang.xx import MultiLanguage -from ...tokens.doc import Doc -from ...util import load_model +from .. import tags_to_entities +from ...training import iob_to_biluo +from ...tokens import Doc, Span +from ...errors import Errors +from ...util import load_model, get_lang_class -def conll_ner2json( +def conll_ner_to_docs( input_data, n_sents=10, seg_sents=False, model=None, no_print=False, **kwargs ): """ Convert files in the CoNLL-2003 NER format and similar - whitespace-separated columns into JSON format for use with train cli. + whitespace-separated columns into Doc objects. The first column is the tokens, the final column is the IOB tags. If an additional second column is present, the second column is the tags. @@ -64,9 +62,9 @@ def conll_ner2json( # sentence segmentation required for document segmentation if n_sents > 0 and not seg_sents: msg.warn( - "No sentence boundaries found to use with option `-n {}`. " - "Use `-s` to automatically segment sentences or `-n 0` " - "to disable.".format(n_sents) + f"No sentence boundaries found to use with option `-n {n_sents}`. " + f"Use `-s` to automatically segment sentences or `-n 0` " + f"to disable." ) else: n_sents_info(msg, n_sents) @@ -84,43 +82,41 @@ def conll_ner2json( "No document delimiters found. Use `-n` to automatically group " "sentences into documents." ) + + if model: + nlp = load_model(model) + else: + nlp = get_lang_class("xx")() output_docs = [] - for doc in input_data.strip().split(doc_delimiter): - doc = doc.strip() - if not doc: + for conll_doc in input_data.strip().split(doc_delimiter): + conll_doc = conll_doc.strip() + if not conll_doc: continue - output_doc = [] - for sent in doc.split("\n\n"): - sent = sent.strip() - if not sent: + words = [] + sent_starts = [] + pos_tags = [] + biluo_tags = [] + for conll_sent in conll_doc.split("\n\n"): + conll_sent = conll_sent.strip() + if not conll_sent: continue - lines = [line.strip() for line in sent.split("\n") if line.strip()] + lines = [line.strip() for line in conll_sent.split("\n") if line.strip()] cols = list(zip(*[line.split() for line in lines])) if len(cols) < 2: - raise ValueError( - "The token-per-line NER file is not formatted correctly. " - "Try checking whitespace and delimiters. See " - "https://spacy.io/api/cli#convert" - ) - words = cols[0] - iob_ents = cols[-1] - if len(cols) > 2: - tags = cols[1] - else: - tags = ["-"] * len(words) - biluo_ents = iob_to_biluo(iob_ents) - output_doc.append( - { - "tokens": [ - {"orth": w, "tag": tag, "ner": ent} - for (w, tag, ent) in zip(words, tags, biluo_ents) - ] - } - ) - output_docs.append( - {"id": len(output_docs), "paragraphs": [{"sentences": output_doc}]} - ) - output_doc = [] + raise ValueError(Errors.E903) + length = len(cols[0]) + words.extend(cols[0]) + sent_starts.extend([True] + [False] * (length - 1)) + biluo_tags.extend(iob_to_biluo(cols[-1])) + pos_tags.extend(cols[1] if len(cols) > 2 else ["-"] * length) + + doc = Doc(nlp.vocab, words=words) + for i, token in enumerate(doc): + token.tag_ = pos_tags[i] + token.is_sent_start = sent_starts[i] + entities = tags_to_entities(biluo_tags) + doc.ents = [Span(doc, start=s, end=e + 1, label=L) for L, s, e in entities] + output_docs.append(doc) return output_docs @@ -129,14 +125,14 @@ def segment_sents_and_docs(doc, n_sents, doc_delimiter, model=None, msg=None): if model: nlp = load_model(model) if "parser" in nlp.pipe_names: - msg.info("Segmenting sentences with parser from model '{}'.".format(model)) + msg.info(f"Segmenting sentences with parser from model '{model}'.") sentencizer = nlp.get_pipe("parser") if not sentencizer: msg.info( "Segmenting sentences with sentencizer. (Use `-b model` for " "improved parser-based sentence segmentation.)" ) - nlp = MultiLanguage() + nlp = get_lang_class("xx")() sentencizer = nlp.create_pipe("sentencizer") lines = doc.strip().split("\n") words = [line.strip().split()[0] for line in lines] @@ -166,7 +162,7 @@ def segment_docs(input_data, n_sents, doc_delimiter): def n_sents_info(msg, n_sents): - msg.info("Grouping every {} sentences into a document.".format(n_sents)) + msg.info(f"Grouping every {n_sents} sentences into a document.") if n_sents == 1: msg.warn( "To generate better training data, you may want to group " diff --git a/spacy/training/converters/conllu_to_docs.py b/spacy/training/converters/conllu_to_docs.py new file mode 100644 index 000000000..2e6084ae5 --- /dev/null +++ b/spacy/training/converters/conllu_to_docs.py @@ -0,0 +1,298 @@ +import re + +from .conll_ner_to_docs import n_sents_info +from ...training import iob_to_biluo, biluo_tags_to_spans +from ...tokens import Doc, Token, Span +from ...vocab import Vocab +from wasabi import Printer + + +def conllu_to_docs( + input_data, + n_sents=10, + append_morphology=False, + ner_map=None, + merge_subtokens=False, + no_print=False, + **_ +): + """ + Convert conllu files into JSON format for use with train cli. + append_morphology parameter enables appending morphology to tags, which is + useful for languages such as Spanish, where UD tags are not so rich. + + Extract NER tags if available and convert them so that they follow + BILUO and the Wikipedia scheme + """ + MISC_NER_PATTERN = "^((?:name|NE)=)?([BILU])-([A-Z_]+)|O$" + msg = Printer(no_print=no_print) + n_sents_info(msg, n_sents) + sent_docs = read_conllx( + input_data, + append_morphology=append_morphology, + ner_tag_pattern=MISC_NER_PATTERN, + ner_map=ner_map, + merge_subtokens=merge_subtokens, + ) + docs = [] + sent_docs_to_merge = [] + for sent_doc in sent_docs: + sent_docs_to_merge.append(sent_doc) + if len(sent_docs_to_merge) % n_sents == 0: + docs.append(Doc.from_docs(sent_docs_to_merge)) + sent_docs_to_merge = [] + if sent_docs_to_merge: + docs.append(Doc.from_docs(sent_docs_to_merge)) + return docs + + +def has_ner(input_data, ner_tag_pattern): + """ + Check the MISC column for NER tags. + """ + for sent in input_data.strip().split("\n\n"): + lines = sent.strip().split("\n") + if lines: + while lines[0].startswith("#"): + lines.pop(0) + for line in lines: + parts = line.split("\t") + id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts + for misc_part in misc.split("|"): + if re.match(ner_tag_pattern, misc_part): + return True + return False + + +def read_conllx( + input_data, + append_morphology=False, + merge_subtokens=False, + ner_tag_pattern="", + ner_map=None, +): + """ Yield docs, one for each sentence """ + vocab = Vocab() # need vocab to make a minimal Doc + for sent in input_data.strip().split("\n\n"): + lines = sent.strip().split("\n") + if lines: + while lines[0].startswith("#"): + lines.pop(0) + doc = conllu_sentence_to_doc( + vocab, + lines, + ner_tag_pattern, + merge_subtokens=merge_subtokens, + append_morphology=append_morphology, + ner_map=ner_map, + ) + yield doc + + +def get_entities(lines, tag_pattern, ner_map=None): + """Find entities in the MISC column according to the pattern and map to + final entity type with `ner_map` if mapping present. Entity tag is 'O' if + the pattern is not matched. + + lines (str): CONLL-U lines for one sentences + tag_pattern (str): Regex pattern for entity tag + ner_map (dict): Map old NER tag names to new ones, '' maps to O. + RETURNS (list): List of BILUO entity tags + """ + miscs = [] + for line in lines: + parts = line.split("\t") + id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts + if "-" in id_ or "." in id_: + continue + miscs.append(misc) + + iob = [] + for misc in miscs: + iob_tag = "O" + for misc_part in misc.split("|"): + tag_match = re.match(tag_pattern, misc_part) + if tag_match: + prefix = tag_match.group(2) + suffix = tag_match.group(3) + if prefix and suffix: + iob_tag = prefix + "-" + suffix + if ner_map: + suffix = ner_map.get(suffix, suffix) + if suffix == "": + iob_tag = "O" + else: + iob_tag = prefix + "-" + suffix + break + iob.append(iob_tag) + return iob_to_biluo(iob) + + +def conllu_sentence_to_doc( + vocab, + lines, + ner_tag_pattern, + merge_subtokens=False, + append_morphology=False, + ner_map=None, +): + """Create an Example from the lines for one CoNLL-U sentence, merging + subtokens and appending morphology to tags if required. + + lines (str): The non-comment lines for a CoNLL-U sentence + ner_tag_pattern (str): The regex pattern for matching NER in MISC col + RETURNS (Example): An example containing the annotation + """ + # create a Doc with each subtoken as its own token + # if merging subtokens, each subtoken orth is the merged subtoken form + if not Token.has_extension("merged_orth"): + Token.set_extension("merged_orth", default="") + if not Token.has_extension("merged_lemma"): + Token.set_extension("merged_lemma", default="") + if not Token.has_extension("merged_morph"): + Token.set_extension("merged_morph", default="") + if not Token.has_extension("merged_spaceafter"): + Token.set_extension("merged_spaceafter", default="") + words, spaces, tags, poses, morphs, lemmas = [], [], [], [], [], [] + heads, deps = [], [] + subtok_word = "" + in_subtok = False + for i in range(len(lines)): + line = lines[i] + parts = line.split("\t") + id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts + if "." in id_: + continue + if "-" in id_: + in_subtok = True + if "-" in id_: + in_subtok = True + subtok_word = word + subtok_start, subtok_end = id_.split("-") + subtok_spaceafter = "SpaceAfter=No" not in misc + continue + if merge_subtokens and in_subtok: + words.append(subtok_word) + else: + words.append(word) + if in_subtok: + if id_ == subtok_end: + spaces.append(subtok_spaceafter) + else: + spaces.append(False) + elif "SpaceAfter=No" in misc: + spaces.append(False) + else: + spaces.append(True) + if in_subtok and id_ == subtok_end: + subtok_word = "" + in_subtok = False + id_ = int(id_) - 1 + head = (int(head) - 1) if head not in ("0", "_") else id_ + tag = pos if tag == "_" else tag + morph = morph if morph != "_" else "" + dep = "ROOT" if dep == "root" else dep + lemmas.append(lemma) + poses.append(pos) + tags.append(tag) + morphs.append(morph) + heads.append(head) + deps.append(dep) + + doc = Doc( + vocab, + words=words, + spaces=spaces, + tags=tags, + pos=poses, + deps=deps, + lemmas=lemmas, + morphs=morphs, + heads=heads, + ) + for i in range(len(doc)): + doc[i]._.merged_orth = words[i] + doc[i]._.merged_morph = morphs[i] + doc[i]._.merged_lemma = lemmas[i] + doc[i]._.merged_spaceafter = spaces[i] + ents = get_entities(lines, ner_tag_pattern, ner_map) + doc.ents = biluo_tags_to_spans(doc, ents) + + if merge_subtokens: + doc = merge_conllu_subtokens(lines, doc) + + # create final Doc from custom Doc annotation + words, spaces, tags, morphs, lemmas, poses = [], [], [], [], [], [] + heads, deps = [], [] + for i, t in enumerate(doc): + words.append(t._.merged_orth) + lemmas.append(t._.merged_lemma) + spaces.append(t._.merged_spaceafter) + morphs.append(t._.merged_morph) + if append_morphology and t._.merged_morph: + tags.append(t.tag_ + "__" + t._.merged_morph) + else: + tags.append(t.tag_) + poses.append(t.pos_) + heads.append(t.head.i) + deps.append(t.dep_) + + doc_x = Doc( + vocab, + words=words, + spaces=spaces, + tags=tags, + morphs=morphs, + lemmas=lemmas, + pos=poses, + deps=deps, + heads=heads, + ) + doc_x.ents = [Span(doc_x, ent.start, ent.end, label=ent.label) for ent in doc.ents] + + return doc_x + + +def merge_conllu_subtokens(lines, doc): + # identify and process all subtoken spans to prepare attrs for merging + subtok_spans = [] + for line in lines: + parts = line.split("\t") + id_, word, lemma, pos, tag, morph, head, dep, _1, misc = parts + if "-" in id_: + subtok_start, subtok_end = id_.split("-") + subtok_span = doc[int(subtok_start) - 1 : int(subtok_end)] + subtok_spans.append(subtok_span) + # create merged tag, morph, and lemma values + tags = [] + morphs = {} + lemmas = [] + for token in subtok_span: + tags.append(token.tag_) + lemmas.append(token.lemma_) + if token._.merged_morph: + for feature in token._.merged_morph.split("|"): + field, values = feature.split("=", 1) + if field not in morphs: + morphs[field] = set() + for value in values.split(","): + morphs[field].add(value) + # create merged features for each morph field + for field, values in morphs.items(): + morphs[field] = field + "=" + ",".join(sorted(values)) + # set the same attrs on all subtok tokens so that whatever head the + # retokenizer chooses, the final attrs are available on that token + for token in subtok_span: + token._.merged_orth = token.orth_ + token._.merged_lemma = " ".join(lemmas) + token.tag_ = "_".join(tags) + token._.merged_morph = "|".join(sorted(morphs.values())) + token._.merged_spaceafter = ( + True if subtok_span[-1].whitespace_ else False + ) + + with doc.retokenize() as retokenizer: + for span in subtok_spans: + retokenizer.merge(span) + + return doc diff --git a/spacy/training/converters/iob_to_docs.py b/spacy/training/converters/iob_to_docs.py new file mode 100644 index 000000000..a2185fef7 --- /dev/null +++ b/spacy/training/converters/iob_to_docs.py @@ -0,0 +1,65 @@ +from wasabi import Printer + +from .conll_ner_to_docs import n_sents_info +from ...vocab import Vocab +from ...training import iob_to_biluo, tags_to_entities +from ...tokens import Doc, Span +from ...errors import Errors +from ...util import minibatch + + +def iob_to_docs(input_data, n_sents=10, no_print=False, *args, **kwargs): + """ + Convert IOB files with one sentence per line and tags separated with '|' + into Doc objects so they can be saved. IOB and IOB2 are accepted. + + Sample formats: + + I|O like|O London|I-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O + I|O like|O London|B-GPE and|O New|B-GPE York|I-GPE City|I-GPE .|O + I|PRP|O like|VBP|O London|NNP|I-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O + I|PRP|O like|VBP|O London|NNP|B-GPE and|CC|O New|NNP|B-GPE York|NNP|I-GPE City|NNP|I-GPE .|.|O + """ + vocab = Vocab() # need vocab to make a minimal Doc + msg = Printer(no_print=no_print) + if n_sents > 0: + n_sents_info(msg, n_sents) + docs = read_iob(input_data.split("\n"), vocab, n_sents) + return docs + + +def read_iob(raw_sents, vocab, n_sents): + docs = [] + for group in minibatch(raw_sents, size=n_sents): + tokens = [] + words = [] + tags = [] + iob = [] + sent_starts = [] + for line in group: + if not line.strip(): + continue + sent_tokens = [t.split("|") for t in line.split()] + if len(sent_tokens[0]) == 3: + sent_words, sent_tags, sent_iob = zip(*sent_tokens) + elif len(sent_tokens[0]) == 2: + sent_words, sent_iob = zip(*sent_tokens) + sent_tags = ["-"] * len(sent_words) + else: + raise ValueError(Errors.E902) + words.extend(sent_words) + tags.extend(sent_tags) + iob.extend(sent_iob) + tokens.extend(sent_tokens) + sent_starts.append(True) + sent_starts.extend([False for _ in sent_words[1:]]) + doc = Doc(vocab, words=words) + for i, tag in enumerate(tags): + doc[i].tag_ = tag + for i, sent_start in enumerate(sent_starts): + doc[i].is_sent_start = sent_start + biluo = iob_to_biluo(iob) + entities = tags_to_entities(biluo) + doc.ents = [Span(doc, start=s, end=e + 1, label=L) for (L, s, e) in entities] + docs.append(doc) + return docs diff --git a/spacy/training/converters/json_to_docs.py b/spacy/training/converters/json_to_docs.py new file mode 100644 index 000000000..d7df1d6f9 --- /dev/null +++ b/spacy/training/converters/json_to_docs.py @@ -0,0 +1,22 @@ +import srsly +from ..gold_io import json_iterate, json_to_annotations +from ..example import annotations_to_doc +from ..example import _fix_legacy_dict_data, _parse_example_dict_data +from ...util import load_model +from ...lang.xx import MultiLanguage + + +def json_to_docs(input_data, model=None, **kwargs): + nlp = load_model(model) if model is not None else MultiLanguage() + if not isinstance(input_data, bytes): + if not isinstance(input_data, str): + input_data = srsly.json_dumps(input_data) + input_data = input_data.encode("utf8") + docs = [] + for json_doc in json_iterate(input_data): + for json_para in json_to_annotations(json_doc): + example_dict = _fix_legacy_dict_data(json_para) + tok_dict, doc_dict = _parse_example_dict_data(example_dict) + doc = annotations_to_doc(nlp.vocab, tok_dict, doc_dict) + docs.append(doc) + return docs diff --git a/spacy/training/corpus.py b/spacy/training/corpus.py new file mode 100644 index 000000000..b3ff30e66 --- /dev/null +++ b/spacy/training/corpus.py @@ -0,0 +1,248 @@ +import warnings +from typing import Union, List, Iterable, Iterator, TYPE_CHECKING, Callable +from typing import Optional +from pathlib import Path +import srsly + +from .. import util +from .augment import dont_augment +from .example import Example +from ..errors import Warnings, Errors +from ..tokens import DocBin, Doc +from ..vocab import Vocab + +if TYPE_CHECKING: + # This lets us add type hints for mypy etc. without causing circular imports + from ..language import Language # noqa: F401 + +FILE_TYPE = ".spacy" + + +@util.registry.readers("spacy.Corpus.v1") +def create_docbin_reader( + path: Optional[Path], + gold_preproc: bool, + max_length: int = 0, + limit: int = 0, + augmenter: Optional[Callable] = None, +) -> Callable[["Language"], Iterable[Example]]: + if path is None: + raise ValueError(Errors.E913) + util.logger.debug(f"Loading corpus from path: {path}") + return Corpus( + path, + gold_preproc=gold_preproc, + max_length=max_length, + limit=limit, + augmenter=augmenter, + ) + + +@util.registry.readers("spacy.JsonlCorpus.v1") +def create_jsonl_reader( + path: Path, min_length: int = 0, max_length: int = 0, limit: int = 0 +) -> Callable[["Language"], Iterable[Doc]]: + return JsonlCorpus(path, min_length=min_length, max_length=max_length, limit=limit) + + +@util.registry.readers("spacy.read_labels.v1") +def read_labels(path: Path, *, require: bool = False): + # I decided not to give this a generic name, because I don't want people to + # use it for arbitrary stuff, as I want this require arg with default False. + if not require and not path.exists(): + return None + return srsly.read_json(path) + + +def walk_corpus(path: Union[str, Path], file_type) -> List[Path]: + path = util.ensure_path(path) + if not path.is_dir() and path.parts[-1].endswith(file_type): + return [path] + orig_path = path + paths = [path] + locs = [] + seen = set() + for path in paths: + if str(path) in seen: + continue + seen.add(str(path)) + if path.parts and path.parts[-1].startswith("."): + continue + elif path.is_dir(): + paths.extend(path.iterdir()) + elif path.parts[-1].endswith(file_type): + locs.append(path) + if len(locs) == 0: + warnings.warn(Warnings.W090.format(path=orig_path, format=file_type)) + # It's good to sort these, in case the ordering messes up a cache. + locs.sort() + return locs + + +class Corpus: + """Iterate Example objects from a file or directory of DocBin (.spacy) + formatted data files. + + path (Path): The directory or filename to read from. + gold_preproc (bool): Whether to set up the Example object with gold-standard + sentences and tokens for the predictions. Gold preprocessing helps + the annotations align to the tokenization, and may result in sequences + of more consistent length. However, it may reduce run-time accuracy due + to train/test skew. Defaults to False. + max_length (int): Maximum document length. Longer documents will be + split into sentences, if sentence boundaries are available. Defaults to + 0, which indicates no limit. + limit (int): Limit corpus to a subset of examples, e.g. for debugging. + Defaults to 0, which indicates no limit. + augment (Callable[Example, Iterable[Example]]): Optional data augmentation + function, to extrapolate additional examples from your annotations. + + DOCS: https://nightly.spacy.io/api/corpus + """ + + def __init__( + self, + path: Union[str, Path], + *, + limit: int = 0, + gold_preproc: bool = False, + max_length: int = 0, + augmenter: Optional[Callable] = None, + ) -> None: + self.path = util.ensure_path(path) + self.gold_preproc = gold_preproc + self.max_length = max_length + self.limit = limit + self.augmenter = augmenter if augmenter is not None else dont_augment + + def __call__(self, nlp: "Language") -> Iterator[Example]: + """Yield examples from the data. + + nlp (Language): The current nlp object. + YIELDS (Example): The examples. + + DOCS: https://nightly.spacy.io/api/corpus#call + """ + ref_docs = self.read_docbin(nlp.vocab, walk_corpus(self.path, FILE_TYPE)) + if self.gold_preproc: + examples = self.make_examples_gold_preproc(nlp, ref_docs) + else: + examples = self.make_examples(nlp, ref_docs) + for real_eg in examples: + for augmented_eg in self.augmenter(nlp, real_eg): + yield augmented_eg + + def _make_example( + self, nlp: "Language", reference: Doc, gold_preproc: bool + ) -> Example: + if gold_preproc or reference.has_unknown_spaces: + return Example( + Doc( + nlp.vocab, + words=[word.text for word in reference], + spaces=[bool(word.whitespace_) for word in reference], + ), + reference, + ) + else: + return Example(nlp.make_doc(reference.text), reference) + + def make_examples( + self, nlp: "Language", reference_docs: Iterable[Doc] + ) -> Iterator[Example]: + for reference in reference_docs: + if len(reference) == 0: + continue + elif self.max_length == 0 or len(reference) < self.max_length: + yield self._make_example(nlp, reference, False) + elif reference.is_sentenced: + for ref_sent in reference.sents: + if len(ref_sent) == 0: + continue + elif self.max_length == 0 or len(ref_sent) < self.max_length: + yield self._make_example(nlp, ref_sent.as_doc(), False) + + def make_examples_gold_preproc( + self, nlp: "Language", reference_docs: Iterable[Doc] + ) -> Iterator[Example]: + for reference in reference_docs: + if reference.is_sentenced: + ref_sents = [sent.as_doc() for sent in reference.sents] + else: + ref_sents = [reference] + for ref_sent in ref_sents: + eg = self._make_example(nlp, ref_sent, True) + if len(eg.x): + yield eg + + def read_docbin( + self, vocab: Vocab, locs: Iterable[Union[str, Path]] + ) -> Iterator[Doc]: + """ Yield training examples as example dicts """ + i = 0 + for loc in locs: + loc = util.ensure_path(loc) + if loc.parts[-1].endswith(FILE_TYPE): + doc_bin = DocBin().from_disk(loc) + docs = doc_bin.get_docs(vocab) + for doc in docs: + if len(doc): + yield doc + i += 1 + if self.limit >= 1 and i >= self.limit: + break + + +class JsonlCorpus: + """Iterate Doc objects from a file or directory of jsonl + formatted raw text files. + + path (Path): The directory or filename to read from. + min_length (int): Minimum document length (in tokens). Shorter documents + will be skipped. Defaults to 0, which indicates no limit. + + max_length (int): Maximum document length (in tokens). Longer documents will + be skipped. Defaults to 0, which indicates no limit. + limit (int): Limit corpus to a subset of examples, e.g. for debugging. + Defaults to 0, which indicates no limit. + + DOCS: https://nightly.spacy.io/api/corpus#jsonlcorpus + """ + + file_type = "jsonl" + + def __init__( + self, + path: Union[str, Path], + *, + limit: int = 0, + min_length: int = 0, + max_length: int = 0, + ) -> None: + self.path = util.ensure_path(path) + self.min_length = min_length + self.max_length = max_length + self.limit = limit + + def __call__(self, nlp: "Language") -> Iterator[Example]: + """Yield examples from the data. + + nlp (Language): The current nlp object. + YIELDS (Example): The example objects. + + DOCS: https://nightly.spacy.io/api/corpus#jsonlcorpus-call + """ + for loc in walk_corpus(self.path, ".jsonl"): + records = srsly.read_jsonl(loc) + for record in records: + doc = nlp.make_doc(record["text"]) + if self.min_length >= 1 and len(doc) < self.min_length: + continue + elif self.max_length >= 1 and len(doc) >= self.max_length: + continue + else: + words = [w.text for w in doc] + spaces = [bool(w.whitespace_) for w in doc] + # We don't *need* an example here, but it seems nice to + # make it match the Corpus signature. + yield Example(doc, Doc(nlp.vocab, words=words, spaces=spaces)) diff --git a/spacy/training/example.pxd b/spacy/training/example.pxd new file mode 100644 index 000000000..49e239757 --- /dev/null +++ b/spacy/training/example.pxd @@ -0,0 +1,12 @@ +from ..tokens.doc cimport Doc +from libc.stdint cimport uint64_t + + +cdef class Example: + cdef readonly Doc x + cdef readonly Doc y + cdef readonly object _cached_alignment + cdef readonly object _cached_words_x + cdef readonly object _cached_words_y + cdef readonly uint64_t _x_sig + cdef readonly uint64_t _y_sig diff --git a/spacy/training/example.pyx b/spacy/training/example.pyx new file mode 100644 index 000000000..a8da49c61 --- /dev/null +++ b/spacy/training/example.pyx @@ -0,0 +1,497 @@ +from collections.abc import Iterable as IterableInstance +import warnings +import numpy +from murmurhash.mrmr cimport hash64 + +from ..tokens.doc cimport Doc +from ..tokens.span cimport Span +from ..tokens.span import Span +from ..attrs import IDS +from .align import Alignment +from .iob_utils import biluo_to_iob, offsets_to_biluo_tags, doc_to_biluo_tags +from .iob_utils import biluo_tags_to_spans +from ..errors import Errors, Warnings +from ..pipeline._parser_internals import nonproj +from ..util import logger + + +cpdef Doc annotations_to_doc(vocab, tok_annot, doc_annot): + """ Create a Doc from dictionaries with token and doc annotations. """ + attrs, array = _annot2array(vocab, tok_annot, doc_annot) + output = Doc(vocab, words=tok_annot["ORTH"], spaces=tok_annot["SPACY"]) + if "entities" in doc_annot: + _add_entities_to_doc(output, doc_annot["entities"]) + if array.size: + output = output.from_array(attrs, array) + # links are currently added with ENT_KB_ID on the token level + output.cats.update(doc_annot.get("cats", {})) + return output + + +def validate_examples(examples, method): + """Check that a batch of examples received during processing is valid. + This function lives here to prevent circular imports. + + examples (Iterable[Examples]): A batch of examples. + method (str): The method name to show in error messages. + """ + if not isinstance(examples, IterableInstance): + err = Errors.E978.format(name=method, types=type(examples)) + raise TypeError(err) + wrong = set([type(eg) for eg in examples if not isinstance(eg, Example)]) + if wrong: + err = Errors.E978.format(name=method, types=wrong) + raise TypeError(err) + + +def validate_get_examples(get_examples, method): + """Check that a generator of a batch of examples received during processing is valid: + the callable produces a non-empty list of Example objects. + This function lives here to prevent circular imports. + + get_examples (Callable[[], Iterable[Example]]): A function that produces a batch of examples. + method (str): The method name to show in error messages. + """ + if get_examples is None or not hasattr(get_examples, "__call__"): + err = Errors.E930.format(method=method, obj=type(get_examples)) + raise TypeError(err) + examples = get_examples() + if not examples: + err = Errors.E930.format(method=method, obj=examples) + raise TypeError(err) + validate_examples(examples, method) + + +cdef class Example: + def __init__(self, Doc predicted, Doc reference, *, alignment=None): + if predicted is None: + raise TypeError(Errors.E972.format(arg="predicted")) + if reference is None: + raise TypeError(Errors.E972.format(arg="reference")) + self.predicted = predicted + self.reference = reference + self._cached_alignment = alignment + + def __len__(self): + return len(self.predicted) + + property predicted: + def __get__(self): + return self.x + + def __set__(self, doc): + self.x = doc + self._cached_alignment = None + self._cached_words_x = [t.text for t in doc] + + property reference: + def __get__(self): + return self.y + + def __set__(self, doc): + self.y = doc + self._cached_alignment = None + self._cached_words_y = [t.text for t in doc] + + def copy(self): + return Example( + self.x.copy(), + self.y.copy() + ) + + @classmethod + def from_dict(cls, Doc predicted, dict example_dict): + if predicted is None: + raise ValueError(Errors.E976.format(n="first", type="Doc")) + if example_dict is None: + raise ValueError(Errors.E976.format(n="second", type="dict")) + example_dict = _fix_legacy_dict_data(example_dict) + tok_dict, doc_dict = _parse_example_dict_data(example_dict) + if "ORTH" not in tok_dict: + tok_dict["ORTH"] = [tok.text for tok in predicted] + tok_dict["SPACY"] = [tok.whitespace_ for tok in predicted] + return Example( + predicted, + annotations_to_doc(predicted.vocab, tok_dict, doc_dict) + ) + + @property + def alignment(self): + x_sig = hash64(self.x.c, sizeof(self.x.c[0]) * self.x.length, 0) + y_sig = hash64(self.y.c, sizeof(self.y.c[0]) * self.y.length, 0) + if self._cached_alignment is None: + words_x = [token.text for token in self.x] + words_y = [token.text for token in self.y] + self._x_sig = x_sig + self._y_sig = y_sig + self._cached_words_x = words_x + self._cached_words_y = words_y + self._cached_alignment = Alignment.from_strings(words_x, words_y) + return self._cached_alignment + elif self._x_sig == x_sig and self._y_sig == y_sig: + # If we have a cached alignment, check whether the cache is invalid + # due to retokenization. To make this check fast in loops, we first + # check a hash of the TokenC arrays. + return self._cached_alignment + else: + words_x = [token.text for token in self.x] + words_y = [token.text for token in self.y] + if words_x == self._cached_words_x and words_y == self._cached_words_y: + self._x_sig = x_sig + self._y_sig = y_sig + return self._cached_alignment + else: + self._cached_alignment = Alignment.from_strings(words_x, words_y) + self._cached_words_x = words_x + self._cached_words_y = words_y + self._x_sig = x_sig + self._y_sig = y_sig + return self._cached_alignment + + def get_aligned(self, field, as_string=False): + """Return an aligned array for a token attribute.""" + align = self.alignment.x2y + + vocab = self.reference.vocab + gold_values = self.reference.to_array([field]) + output = [None] * len(self.predicted) + for token in self.predicted: + if token.is_space: + output[token.i] = None + else: + values = gold_values[align[token.i].dataXd] + values = values.ravel() + if len(values) == 0: + output[token.i] = None + elif len(values) == 1: + output[token.i] = values[0] + elif len(set(list(values))) == 1: + # If all aligned tokens have the same value, use it. + output[token.i] = values[0] + else: + output[token.i] = None + if as_string and field not in ["ENT_IOB", "SENT_START"]: + output = [vocab.strings[o] if o is not None else o for o in output] + return output + + def get_aligned_parse(self, projectivize=True): + cand_to_gold = self.alignment.x2y + gold_to_cand = self.alignment.y2x + aligned_heads = [None] * self.x.length + aligned_deps = [None] * self.x.length + heads = [token.head.i for token in self.y] + deps = [token.dep_ for token in self.y] + if projectivize: + heads, deps = nonproj.projectivize(heads, deps) + for cand_i in range(self.x.length): + if cand_to_gold.lengths[cand_i] == 1: + gold_i = cand_to_gold[cand_i].dataXd[0, 0] + if gold_to_cand.lengths[heads[gold_i]] == 1: + aligned_heads[cand_i] = int(gold_to_cand[heads[gold_i]].dataXd[0, 0]) + aligned_deps[cand_i] = deps[gold_i] + return aligned_heads, aligned_deps + + def get_aligned_spans_x2y(self, x_spans): + return self._get_aligned_spans(self.y, x_spans, self.alignment.x2y) + + def get_aligned_spans_y2x(self, y_spans): + return self._get_aligned_spans(self.x, y_spans, self.alignment.y2x) + + def _get_aligned_spans(self, doc, spans, align): + seen = set() + output = [] + for span in spans: + indices = align[span.start : span.end].data.ravel() + indices = [idx for idx in indices if idx not in seen] + if len(indices) >= 1: + aligned_span = Span(doc, indices[0], indices[-1] + 1, label=span.label) + target_text = span.text.lower().strip().replace(" ", "") + our_text = aligned_span.text.lower().strip().replace(" ", "") + if our_text == target_text: + output.append(aligned_span) + seen.update(indices) + return output + + def get_aligned_ner(self): + if not self.y.has_annotation("ENT_IOB"): + return [None] * len(self.x) # should this be 'missing' instead of 'None' ? + x_ents = self.get_aligned_spans_y2x(self.y.ents) + # Default to 'None' for missing values + x_tags = offsets_to_biluo_tags( + self.x, + [(e.start_char, e.end_char, e.label_) for e in x_ents], + missing=None + ) + # Now fill the tokens we can align to O. + O = 2 # I=1, O=2, B=3 + for i, ent_iob in enumerate(self.get_aligned("ENT_IOB")): + if x_tags[i] is None: + if ent_iob == O: + x_tags[i] = "O" + elif self.x[i].is_space: + x_tags[i] = "O" + return x_tags + + def to_dict(self): + return { + "doc_annotation": { + "cats": dict(self.reference.cats), + "entities": doc_to_biluo_tags(self.reference), + "links": self._links_to_dict() + }, + "token_annotation": { + "ORTH": [t.text for t in self.reference], + "SPACY": [bool(t.whitespace_) for t in self.reference], + "TAG": [t.tag_ for t in self.reference], + "LEMMA": [t.lemma_ for t in self.reference], + "POS": [t.pos_ for t in self.reference], + "MORPH": [str(t.morph) for t in self.reference], + "HEAD": [t.head.i for t in self.reference], + "DEP": [t.dep_ for t in self.reference], + "SENT_START": [int(bool(t.is_sent_start)) for t in self.reference] + } + } + + def _links_to_dict(self): + links = {} + for ent in self.reference.ents: + if ent.kb_id_: + links[(ent.start_char, ent.end_char)] = {ent.kb_id_: 1.0} + return links + + def split_sents(self): + """ Split the token annotations into multiple Examples based on + sent_starts and return a list of the new Examples""" + if not self.reference.has_annotation("SENT_START"): + return [self] + + align = self.alignment.y2x + seen_indices = set() + output = [] + for y_sent in self.reference.sents: + indices = align[y_sent.start : y_sent.end].data.ravel() + indices = [idx for idx in indices if idx not in seen_indices] + if indices: + x_sent = self.predicted[indices[0] : indices[-1] + 1] + output.append(Example(x_sent.as_doc(), y_sent.as_doc())) + seen_indices.update(indices) + return output + + property text: + def __get__(self): + return self.x.text + + def __str__(self): + return str(self.to_dict()) + + def __repr__(self): + return str(self.to_dict()) + + +def _annot2array(vocab, tok_annot, doc_annot): + attrs = [] + values = [] + + for key, value in doc_annot.items(): + if value: + if key == "entities": + pass + elif key == "links": + ent_kb_ids = _parse_links(vocab, tok_annot["ORTH"], tok_annot["SPACY"], value) + tok_annot["ENT_KB_ID"] = ent_kb_ids + elif key == "cats": + pass + else: + raise ValueError(Errors.E974.format(obj="doc", key=key)) + + for key, value in tok_annot.items(): + if key not in IDS: + raise ValueError(Errors.E974.format(obj="token", key=key)) + elif key in ["ORTH", "SPACY"]: + pass + elif key == "HEAD": + attrs.append(key) + values.append([h-i for i, h in enumerate(value)]) + elif key == "SENT_START": + attrs.append(key) + values.append(value) + elif key == "MORPH": + attrs.append(key) + values.append([vocab.morphology.add(v) for v in value]) + else: + attrs.append(key) + if not all(isinstance(v, str) for v in value): + types = set([type(v) for v in value]) + raise TypeError(Errors.E969.format(field=key, types=types)) from None + values.append([vocab.strings.add(v) for v in value]) + array = numpy.asarray(values, dtype="uint64") + return attrs, array.T + + +def _add_entities_to_doc(doc, ner_data): + if ner_data is None: + return + elif ner_data == []: + doc.ents = [] + elif isinstance(ner_data[0], tuple): + return _add_entities_to_doc( + doc, + offsets_to_biluo_tags(doc, ner_data) + ) + elif isinstance(ner_data[0], str) or ner_data[0] is None: + return _add_entities_to_doc( + doc, + biluo_tags_to_spans(doc, ner_data) + ) + elif isinstance(ner_data[0], Span): + entities = [] + missing = [] + for span in ner_data: + if span.label: + entities.append(span) + else: + missing.append(span) + doc.set_ents(entities, missing=missing) + else: + raise ValueError(Errors.E973) + + +def _parse_example_dict_data(example_dict): + return ( + example_dict["token_annotation"], + example_dict["doc_annotation"] + ) + + +def _fix_legacy_dict_data(example_dict): + token_dict = example_dict.get("token_annotation", {}) + doc_dict = example_dict.get("doc_annotation", {}) + for key, value in example_dict.items(): + if value: + if key in ("token_annotation", "doc_annotation"): + pass + elif key == "ids": + pass + elif key in ("cats", "links"): + doc_dict[key] = value + elif key in ("ner", "entities"): + doc_dict["entities"] = value + else: + token_dict[key] = value + # Remap keys + remapping = { + "words": "ORTH", + "tags": "TAG", + "pos": "POS", + "lemmas": "LEMMA", + "deps": "DEP", + "heads": "HEAD", + "sent_starts": "SENT_START", + "morphs": "MORPH", + "spaces": "SPACY", + } + old_token_dict = token_dict + token_dict = {} + for key, value in old_token_dict.items(): + if key in ("text", "ids", "brackets"): + pass + elif key in remapping.values(): + token_dict[key] = value + elif key.lower() in remapping: + token_dict[remapping[key.lower()]] = value + else: + all_keys = set(remapping.values()) + all_keys.update(remapping.keys()) + raise KeyError(Errors.E983.format(key=key, dict="token_annotation", keys=all_keys)) + text = example_dict.get("text", example_dict.get("raw")) + if _has_field(token_dict, "ORTH") and not _has_field(token_dict, "SPACY"): + token_dict["SPACY"] = _guess_spaces(text, token_dict["ORTH"]) + if "HEAD" in token_dict and "SENT_START" in token_dict: + # If heads are set, we don't also redundantly specify SENT_START. + token_dict.pop("SENT_START") + logger.debug(Warnings.W092) + return { + "token_annotation": token_dict, + "doc_annotation": doc_dict + } + +def _has_field(annot, field): + if field not in annot: + return False + elif annot[field] is None: + return False + elif len(annot[field]) == 0: + return False + elif all([value is None for value in annot[field]]): + return False + else: + return True + + +def _parse_ner_tags(biluo_or_offsets, vocab, words, spaces): + if isinstance(biluo_or_offsets[0], (list, tuple)): + # Convert to biluo if necessary + # This is annoying but to convert the offsets we need a Doc + # that has the target tokenization. + reference = Doc(vocab, words=words, spaces=spaces) + biluo = offsets_to_biluo_tags(reference, biluo_or_offsets) + else: + biluo = biluo_or_offsets + ent_iobs = [] + ent_types = [] + for iob_tag in biluo_to_iob(biluo): + if iob_tag in (None, "-"): + ent_iobs.append("") + ent_types.append("") + else: + ent_iobs.append(iob_tag.split("-")[0]) + if iob_tag.startswith("I") or iob_tag.startswith("B"): + ent_types.append(iob_tag.split("-", 1)[1]) + else: + ent_types.append("") + return ent_iobs, ent_types + +def _parse_links(vocab, words, spaces, links): + reference = Doc(vocab, words=words, spaces=spaces) + starts = {token.idx: token.i for token in reference} + ends = {token.idx + len(token): token.i for token in reference} + ent_kb_ids = ["" for _ in reference] + + for index, annot_dict in links.items(): + true_kb_ids = [] + for key, value in annot_dict.items(): + if value == 1.0: + true_kb_ids.append(key) + if len(true_kb_ids) > 1: + raise ValueError(Errors.E980) + + if len(true_kb_ids) == 1: + start_char, end_char = index + start_token = starts.get(start_char) + end_token = ends.get(end_char) + if start_token is None or end_token is None: + raise ValueError(Errors.E981) + for i in range(start_token, end_token+1): + ent_kb_ids[i] = true_kb_ids[0] + + return ent_kb_ids + + +def _guess_spaces(text, words): + if text is None: + return None + spaces = [] + text_pos = 0 + # align words with text + for word in words: + try: + word_start = text[text_pos:].index(word) + except ValueError: + spaces.append(True) + continue + text_pos += word_start + len(word) + if text_pos < len(text) and text[text_pos] == " ": + spaces.append(True) + else: + spaces.append(False) + return spaces diff --git a/spacy/training/gold_io.pyx b/spacy/training/gold_io.pyx new file mode 100644 index 000000000..8fb6b8565 --- /dev/null +++ b/spacy/training/gold_io.pyx @@ -0,0 +1,207 @@ +import warnings +import srsly +from .. import util +from ..errors import Warnings +from ..tokens import Doc +from .iob_utils import offsets_to_biluo_tags, tags_to_entities +import json + + +def docs_to_json(docs, doc_id=0, ner_missing_tag="O"): + """Convert a list of Doc objects into the JSON-serializable format used by + the spacy train command. + + docs (iterable / Doc): The Doc object(s) to convert. + doc_id (int): Id for the JSON. + RETURNS (dict): The data in spaCy's JSON format + - each input doc will be treated as a paragraph in the output doc + """ + if isinstance(docs, Doc): + docs = [docs] + json_doc = {"id": doc_id, "paragraphs": []} + for i, doc in enumerate(docs): + json_para = {'raw': doc.text, "sentences": [], "cats": [], "entities": [], "links": []} + for cat, val in doc.cats.items(): + json_cat = {"label": cat, "value": val} + json_para["cats"].append(json_cat) + # warning: entities information is currently duplicated as + # doc-level "entities" and token-level "ner" + for ent in doc.ents: + ent_tuple = (ent.start_char, ent.end_char, ent.label_) + json_para["entities"].append(ent_tuple) + if ent.kb_id_: + link_dict = {(ent.start_char, ent.end_char): {ent.kb_id_: 1.0}} + json_para["links"].append(link_dict) + biluo_tags = offsets_to_biluo_tags(doc, json_para["entities"], missing=ner_missing_tag) + attrs = ("TAG", "POS", "MORPH", "LEMMA", "DEP", "ENT_IOB") + include_annotation = {attr: doc.has_annotation(attr) for attr in attrs} + for j, sent in enumerate(doc.sents): + json_sent = {"tokens": [], "brackets": []} + for token in sent: + json_token = {"id": token.i, "orth": token.text, "space": token.whitespace_} + if include_annotation["TAG"]: + json_token["tag"] = token.tag_ + if include_annotation["POS"]: + json_token["pos"] = token.pos_ + if include_annotation["MORPH"]: + json_token["morph"] = str(token.morph) + if include_annotation["LEMMA"]: + json_token["lemma"] = token.lemma_ + if include_annotation["DEP"]: + json_token["head"] = token.head.i-token.i + json_token["dep"] = token.dep_ + if include_annotation["ENT_IOB"]: + json_token["ner"] = biluo_tags[token.i] + json_sent["tokens"].append(json_token) + json_para["sentences"].append(json_sent) + json_doc["paragraphs"].append(json_para) + return json_doc + + +def read_json_file(loc, docs_filter=None, limit=None): + """Read Example dictionaries from a json file or directory.""" + loc = util.ensure_path(loc) + if loc.is_dir(): + for filename in sorted(loc.iterdir()): + yield from read_json_file(loc / filename, limit=limit) + else: + with loc.open("rb") as file_: + utf8_str = file_.read() + for json_doc in json_iterate(utf8_str): + if docs_filter is not None and not docs_filter(json_doc): + continue + for json_paragraph in json_to_annotations(json_doc): + yield json_paragraph + + +def json_to_annotations(doc): + """Convert an item in the JSON-formatted training data to the format + used by Example. + + doc (dict): One entry in the training data. + YIELDS (tuple): The reformatted data - one training example per paragraph + """ + for paragraph in doc["paragraphs"]: + example = {"text": paragraph.get("raw", None)} + words = [] + spaces = [] + ids = [] + tags = [] + ner_tags = [] + pos = [] + morphs = [] + lemmas = [] + heads = [] + labels = [] + sent_starts = [] + brackets = [] + for sent in paragraph["sentences"]: + sent_start_i = len(words) + for i, token in enumerate(sent["tokens"]): + words.append(token["orth"]) + spaces.append(token.get("space", None)) + ids.append(token.get('id', sent_start_i + i)) + tags.append(token.get("tag", None)) + pos.append(token.get("pos", None)) + morphs.append(token.get("morph", None)) + lemmas.append(token.get("lemma", None)) + if "head" in token: + heads.append(token["head"] + sent_start_i + i) + else: + heads.append(None) + if "dep" in token: + labels.append(token["dep"]) + # Ensure ROOT label is case-insensitive + if labels[-1].lower() == "root": + labels[-1] = "ROOT" + else: + labels.append(None) + ner_tags.append(token.get("ner", None)) + if i == 0: + sent_starts.append(1) + else: + sent_starts.append(0) + if "brackets" in sent: + brackets.extend((b["first"] + sent_start_i, + b["last"] + sent_start_i, b["label"]) + for b in sent["brackets"]) + + example["token_annotation"] = dict( + ids=ids, + words=words, + spaces=spaces, + sent_starts=sent_starts, + brackets=brackets + ) + # avoid including dummy values that looks like gold info was present + if any(tags): + example["token_annotation"]["tags"] = tags + if any(pos): + example["token_annotation"]["pos"] = pos + if any(morphs): + example["token_annotation"]["morphs"] = morphs + if any(lemmas): + example["token_annotation"]["lemmas"] = lemmas + if any(head is not None for head in heads): + example["token_annotation"]["heads"] = heads + if any(labels): + example["token_annotation"]["deps"] = labels + + cats = {} + for cat in paragraph.get("cats", {}): + cats[cat["label"]] = cat["value"] + example["doc_annotation"] = dict( + cats=cats, + entities=ner_tags, + links=paragraph.get("links", []) + ) + yield example + +def json_iterate(bytes utf8_str): + # We should've made these files jsonl...But since we didn't, parse out + # the docs one-by-one to reduce memory usage. + # It's okay to read in the whole file -- just don't parse it into JSON. + cdef long file_length = len(utf8_str) + if file_length > 2 ** 30: + warnings.warn(Warnings.W027.format(size=file_length)) + + raw = utf8_str + cdef int square_depth = 0 + cdef int curly_depth = 0 + cdef int inside_string = 0 + cdef int escape = 0 + cdef long start = -1 + cdef char c + cdef char quote = ord('"') + cdef char backslash = ord("\\") + cdef char open_square = ord("[") + cdef char close_square = ord("]") + cdef char open_curly = ord("{") + cdef char close_curly = ord("}") + for i in range(file_length): + c = raw[i] + if escape: + escape = False + continue + if c == backslash: + escape = True + continue + if c == quote: + inside_string = not inside_string + continue + if inside_string: + continue + if c == open_square: + square_depth += 1 + elif c == close_square: + square_depth -= 1 + elif c == open_curly: + if square_depth == 1 and curly_depth == 0: + start = i + curly_depth += 1 + elif c == close_curly: + curly_depth -= 1 + if square_depth == 1 and curly_depth == 0: + substr = utf8_str[start : i + 1].decode("utf8") + yield srsly.json_loads(substr) + start = -1 diff --git a/spacy/training/initialize.py b/spacy/training/initialize.py new file mode 100644 index 000000000..7c84caf95 --- /dev/null +++ b/spacy/training/initialize.py @@ -0,0 +1,259 @@ +from typing import Union, Dict, Optional, Any, List, IO, TYPE_CHECKING +from thinc.api import Config, fix_random_seed, set_gpu_allocator +from thinc.api import ConfigValidationError +from pathlib import Path +import srsly +import numpy +import tarfile +import gzip +import zipfile +import tqdm + +from ..lookups import Lookups +from ..vectors import Vectors +from ..errors import Errors +from ..schemas import ConfigSchemaTraining +from ..util import registry, load_model_from_config, resolve_dot_names, logger +from ..util import load_model, ensure_path, OOV_RANK, DEFAULT_OOV_PROB + +if TYPE_CHECKING: + from ..language import Language # noqa: F401 + + +def init_nlp(config: Config, *, use_gpu: int = -1) -> "Language": + raw_config = config + config = raw_config.interpolate() + if config["training"]["seed"] is not None: + fix_random_seed(config["training"]["seed"]) + allocator = config["training"]["gpu_allocator"] + if use_gpu >= 0 and allocator: + set_gpu_allocator(allocator) + # Use original config here before it's resolved to functions + sourced_components = get_sourced_components(config) + nlp = load_model_from_config(raw_config, auto_fill=True) + logger.info("Set up nlp object from config") + config = nlp.config.interpolate() + # Resolve all training-relevant sections using the filled nlp config + T = registry.resolve(config["training"], schema=ConfigSchemaTraining) + dot_names = [T["train_corpus"], T["dev_corpus"]] + train_corpus, dev_corpus = resolve_dot_names(config, dot_names) + optimizer = T["optimizer"] + # Components that shouldn't be updated during training + frozen_components = T["frozen_components"] + # Sourced components that require resume_training + resume_components = [p for p in sourced_components if p not in frozen_components] + logger.info(f"Pipeline: {nlp.pipe_names}") + if resume_components: + with nlp.select_pipes(enable=resume_components): + logger.info(f"Resuming training for: {resume_components}") + nlp.resume_training(sgd=optimizer) + with nlp.select_pipes(disable=[*frozen_components, *resume_components]): + nlp.initialize(lambda: train_corpus(nlp), sgd=optimizer) + logger.info(f"Initialized pipeline components: {nlp.pipe_names}") + return nlp + + +def init_vocab( + nlp: "Language", + *, + data: Optional[Path] = None, + lookups: Optional[Lookups] = None, + vectors: Optional[str] = None, +) -> "Language": + if lookups: + nlp.vocab.lookups = lookups + logger.info(f"Added vocab lookups: {', '.join(lookups.tables)}") + data_path = ensure_path(data) + if data_path is not None: + lex_attrs = srsly.read_jsonl(data_path) + for lexeme in nlp.vocab: + lexeme.rank = OOV_RANK + for attrs in lex_attrs: + if "settings" in attrs: + continue + lexeme = nlp.vocab[attrs["orth"]] + lexeme.set_attrs(**attrs) + if len(nlp.vocab): + oov_prob = min(lex.prob for lex in nlp.vocab) - 1 + else: + oov_prob = DEFAULT_OOV_PROB + nlp.vocab.cfg.update({"oov_prob": oov_prob}) + logger.info(f"Added {len(nlp.vocab)} lexical entries to the vocab") + logger.info("Created vocabulary") + if vectors is not None: + load_vectors_into_model(nlp, vectors) + logger.info(f"Added vectors: {vectors}") + logger.info("Finished initializing nlp object") + + +def load_vectors_into_model( + nlp: "Language", name: Union[str, Path], *, add_strings: bool = True +) -> None: + """Load word vectors from an installed model or path into a model instance.""" + try: + vectors_nlp = load_model(name) + except ConfigValidationError as e: + title = f"Config validation error for vectors {name}" + desc = ( + "This typically means that there's a problem in the config.cfg included " + "with the packaged vectors. Make sure that the vectors package you're " + "loading is compatible with the current version of spaCy." + ) + err = ConfigValidationError.from_error(e, config=None, title=title, desc=desc) + raise err from None + nlp.vocab.vectors = vectors_nlp.vocab.vectors + if add_strings: + # I guess we should add the strings from the vectors_nlp model? + # E.g. if someone does a similarity query, they might expect the strings. + for key in nlp.vocab.vectors.key2row: + if key in vectors_nlp.vocab.strings: + nlp.vocab.strings.add(vectors_nlp.vocab.strings[key]) + + +def init_tok2vec( + nlp: "Language", pretrain_config: Dict[str, Any], init_config: Dict[str, Any] +) -> bool: + # Load pretrained tok2vec weights - cf. CLI command 'pretrain' + P = pretrain_config + I = init_config + weights_data = None + init_tok2vec = ensure_path(I["init_tok2vec"]) + if init_tok2vec is not None: + if P["objective"].get("type") == "vectors" and not I["vectors"]: + err = 'need initialize.vectors if pretraining.objective.type is "vectors"' + errors = [{"loc": ["initialize"], "msg": err}] + raise ConfigValidationError(config=nlp.config, errors=errors) + if not init_tok2vec.exists(): + err = f"can't find pretrained tok2vec: {init_tok2vec}" + errors = [{"loc": ["initialize", "init_tok2vec"], "msg": err}] + raise ConfigValidationError(config=nlp.config, errors=errors) + with init_tok2vec.open("rb") as file_: + weights_data = file_.read() + if weights_data is not None: + tok2vec_component = P["component"] + if tok2vec_component is None: + desc = ( + f"To use pretrained tok2vec weights, [pretraining.component] " + f"needs to specify the component that should load them." + ) + err = "component can't be null" + errors = [{"loc": ["pretraining", "component"], "msg": err}] + raise ConfigValidationError( + config=nlp.config["pretraining"], errors=errors, desc=desc + ) + layer = nlp.get_pipe(tok2vec_component).model + if P["layer"]: + layer = layer.get_ref(P["layer"]) + layer.from_bytes(weights_data) + return True + return False + + +def get_sourced_components(config: Union[Dict[str, Any], Config]) -> List[str]: + """RETURNS (List[str]): All sourced components in the original config, + e.g. {"source": "en_core_web_sm"}. If the config contains a key + "factory", we assume it refers to a component factory. + """ + return [ + name + for name, cfg in config.get("components", {}).items() + if "factory" not in cfg and "source" in cfg + ] + + +def convert_vectors( + nlp: "Language", + vectors_loc: Optional[Path], + *, + truncate: int, + prune: int, + name: Optional[str] = None, +) -> None: + vectors_loc = ensure_path(vectors_loc) + if vectors_loc and vectors_loc.parts[-1].endswith(".npz"): + nlp.vocab.vectors = Vectors(data=numpy.load(vectors_loc.open("rb"))) + for lex in nlp.vocab: + if lex.rank and lex.rank != OOV_RANK: + nlp.vocab.vectors.add(lex.orth, row=lex.rank) + else: + if vectors_loc: + logger.info(f"Reading vectors from {vectors_loc}") + vectors_data, vector_keys = read_vectors(vectors_loc, truncate) + logger.info(f"Loaded vectors from {vectors_loc}") + else: + vectors_data, vector_keys = (None, None) + if vector_keys is not None: + for word in vector_keys: + if word not in nlp.vocab: + nlp.vocab[word] + if vectors_data is not None: + nlp.vocab.vectors = Vectors(data=vectors_data, keys=vector_keys) + if name is None: + # TODO: Is this correct? Does this matter? + nlp.vocab.vectors.name = f"{nlp.meta['lang']}_{nlp.meta['name']}.vectors" + else: + nlp.vocab.vectors.name = name + nlp.meta["vectors"]["name"] = nlp.vocab.vectors.name + if prune >= 1: + nlp.vocab.prune_vectors(prune) + + +def read_vectors(vectors_loc: Path, truncate_vectors: int): + f = open_file(vectors_loc) + f = ensure_shape(f) + shape = tuple(int(size) for size in next(f).split()) + if truncate_vectors >= 1: + shape = (truncate_vectors, shape[1]) + vectors_data = numpy.zeros(shape=shape, dtype="f") + vectors_keys = [] + for i, line in enumerate(tqdm.tqdm(f)): + line = line.rstrip() + pieces = line.rsplit(" ", vectors_data.shape[1]) + word = pieces.pop(0) + if len(pieces) != vectors_data.shape[1]: + raise ValueError(Errors.E094.format(line_num=i, loc=vectors_loc)) + vectors_data[i] = numpy.asarray(pieces, dtype="f") + vectors_keys.append(word) + if i == truncate_vectors - 1: + break + return vectors_data, vectors_keys + + +def open_file(loc: Union[str, Path]) -> IO: + """Handle .gz, .tar.gz or unzipped files""" + loc = ensure_path(loc) + if tarfile.is_tarfile(str(loc)): + return tarfile.open(str(loc), "r:gz") + elif loc.parts[-1].endswith("gz"): + return (line.decode("utf8") for line in gzip.open(str(loc), "r")) + elif loc.parts[-1].endswith("zip"): + zip_file = zipfile.ZipFile(str(loc)) + names = zip_file.namelist() + file_ = zip_file.open(names[0]) + return (line.decode("utf8") for line in file_) + else: + return loc.open("r", encoding="utf8") + + +def ensure_shape(lines): + """Ensure that the first line of the data is the vectors shape. + If it's not, we read in the data and output the shape as the first result, + so that the reader doesn't have to deal with the problem. + """ + first_line = next(lines) + try: + shape = tuple(int(size) for size in first_line.split()) + except ValueError: + shape = None + if shape is not None: + # All good, give the data + yield first_line + yield from lines + else: + # Figure out the shape, make it the first value, and then give the + # rest of the data. + width = len(first_line.split()) - 1 + captured = [first_line] + list(lines) + length = len(captured) + yield f"{length} {width}" + yield from captured diff --git a/spacy/training/iob_utils.py b/spacy/training/iob_utils.py new file mode 100644 index 000000000..0e8e7eed0 --- /dev/null +++ b/spacy/training/iob_utils.py @@ -0,0 +1,218 @@ +from typing import List, Tuple, Iterable, Union, Iterator +import warnings + +from ..errors import Errors, Warnings +from ..tokens import Span, Doc + + +def iob_to_biluo(tags: Iterable[str]) -> List[str]: + out = [] + tags = list(tags) + while tags: + out.extend(_consume_os(tags)) + out.extend(_consume_ent(tags)) + return out + + +def biluo_to_iob(tags: Iterable[str]) -> List[str]: + out = [] + for tag in tags: + if tag is None: + out.append(tag) + else: + tag = tag.replace("U-", "B-", 1).replace("L-", "I-", 1) + out.append(tag) + return out + + +def _consume_os(tags: List[str]) -> Iterator[str]: + while tags and tags[0] == "O": + yield tags.pop(0) + + +def _consume_ent(tags: List[str]) -> List[str]: + if not tags: + return [] + tag = tags.pop(0) + target_in = "I" + tag[1:] + target_last = "L" + tag[1:] + length = 1 + while tags and tags[0] in {target_in, target_last}: + length += 1 + tags.pop(0) + label = tag[2:] + if length == 1: + if len(label) == 0: + raise ValueError(Errors.E177.format(tag=tag)) + return ["U-" + label] + else: + start = "B-" + label + end = "L-" + label + middle = [f"I-{label}" for _ in range(1, length - 1)] + return [start] + middle + [end] + + +def doc_to_biluo_tags(doc: Doc, missing: str = "O"): + return offsets_to_biluo_tags( + doc, + [(ent.start_char, ent.end_char, ent.label_) for ent in doc.ents], + missing=missing, + ) + + +def offsets_to_biluo_tags( + doc: Doc, entities: Iterable[Tuple[int, int, Union[str, int]]], missing: str = "O" +) -> List[str]: + """Encode labelled spans into per-token tags, using the + Begin/In/Last/Unit/Out scheme (BILUO). + + doc (Doc): The document that the entity offsets refer to. The output tags + will refer to the token boundaries within the document. + entities (iterable): A sequence of `(start, end, label)` triples. `start` + and `end` should be character-offset integers denoting the slice into + the original string. + RETURNS (list): A list of unicode strings, describing the tags. Each tag + string will be of the form either "", "O" or "{action}-{label}", where + action is one of "B", "I", "L", "U". The missing label is used where the + entity offsets don't align with the tokenization in the `Doc` object. + The training algorithm will view these as missing values. "O" denotes a + non-entity token. "B" denotes the beginning of a multi-token entity, + "I" the inside of an entity of three or more tokens, and "L" the end + of an entity of two or more tokens. "U" denotes a single-token entity. + + EXAMPLE: + >>> text = 'I like London.' + >>> entities = [(len('I like '), len('I like London'), 'LOC')] + >>> doc = nlp.tokenizer(text) + >>> tags = offsets_to_biluo_tags(doc, entities) + >>> assert tags == ["O", "O", 'U-LOC', "O"] + """ + # Ensure no overlapping entity labels exist + tokens_in_ents = {} + starts = {token.idx: token.i for token in doc} + ends = {token.idx + len(token): token.i for token in doc} + biluo = ["-" for _ in doc] + # Handle entity cases + for start_char, end_char, label in entities: + if not label: + for s in starts: # account for many-to-one + if s >= start_char and s < end_char: + biluo[starts[s]] = "O" + else: + for token_index in range(start_char, end_char): + if token_index in tokens_in_ents.keys(): + raise ValueError( + Errors.E103.format( + span1=( + tokens_in_ents[token_index][0], + tokens_in_ents[token_index][1], + tokens_in_ents[token_index][2], + ), + span2=(start_char, end_char, label), + ) + ) + tokens_in_ents[token_index] = (start_char, end_char, label) + start_token = starts.get(start_char) + end_token = ends.get(end_char) + # Only interested if the tokenization is correct + if start_token is not None and end_token is not None: + if start_token == end_token: + biluo[start_token] = f"U-{label}" + else: + biluo[start_token] = f"B-{label}" + for i in range(start_token + 1, end_token): + biluo[i] = f"I-{label}" + biluo[end_token] = f"L-{label}" + # Now distinguish the O cases from ones where we miss the tokenization + entity_chars = set() + for start_char, end_char, label in entities: + for i in range(start_char, end_char): + entity_chars.add(i) + for token in doc: + for i in range(token.idx, token.idx + len(token)): + if i in entity_chars: + break + else: + biluo[token.i] = missing + if "-" in biluo and missing != "-": + ent_str = str(entities) + warnings.warn( + Warnings.W030.format( + text=doc.text[:50] + "..." if len(doc.text) > 50 else doc.text, + entities=ent_str[:50] + "..." if len(ent_str) > 50 else ent_str, + ) + ) + return biluo + + +def biluo_tags_to_spans(doc: Doc, tags: Iterable[str]) -> List[Span]: + """Encode per-token tags following the BILUO scheme into Span object, e.g. + to overwrite the doc.ents. + + doc (Doc): The document that the BILUO tags refer to. + entities (iterable): A sequence of BILUO tags with each tag describing one + token. Each tag string will be of the form of either "", "O" or + "{action}-{label}", where action is one of "B", "I", "L", "U". + RETURNS (list): A sequence of Span objects. Each token with a missing IOB + tag is returned as a Span with an empty label. + """ + token_offsets = tags_to_entities(tags) + spans = [] + for label, start_idx, end_idx in token_offsets: + span = Span(doc, start_idx, end_idx + 1, label=label) + spans.append(span) + return spans + + +def biluo_tags_to_offsets( + doc: Doc, tags: Iterable[str] +) -> List[Tuple[int, int, Union[str, int]]]: + """Encode per-token tags following the BILUO scheme into entity offsets. + + doc (Doc): The document that the BILUO tags refer to. + entities (iterable): A sequence of BILUO tags with each tag describing one + token. Each tags string will be of the form of either "", "O" or + "{action}-{label}", where action is one of "B", "I", "L", "U". + RETURNS (list): A sequence of `(start, end, label)` triples. `start` and + `end` will be character-offset integers denoting the slice into the + original string. + """ + spans = biluo_tags_to_spans(doc, tags) + return [(span.start_char, span.end_char, span.label_) for span in spans] + + +def tags_to_entities(tags: Iterable[str]) -> List[Tuple[str, int, int]]: + """Note that the end index returned by this function is inclusive. + To use it for Span creation, increment the end by 1.""" + entities = [] + start = None + for i, tag in enumerate(tags): + if tag is None or tag.startswith("-"): + # TODO: We shouldn't be getting these malformed inputs. Fix this. + if start is not None: + start = None + else: + entities.append(("", i, i)) + elif tag.startswith("O"): + pass + elif tag.startswith("I"): + if start is None: + raise ValueError(Errors.E067.format(start="I", tags=tags[: i + 1])) + elif tag.startswith("U"): + entities.append((tag[2:], i, i)) + elif tag.startswith("B"): + start = i + elif tag.startswith("L"): + if start is None: + raise ValueError(Errors.E067.format(start="L", tags=tags[: i + 1])) + entities.append((tag[2:], start, i)) + start = None + else: + raise ValueError(Errors.E068.format(tag=tag)) + return entities + + +# Fallbacks to make backwards-compat easier +offsets_from_biluo_tags = biluo_tags_to_offsets +spans_from_biluo_tags = biluo_tags_to_spans +biluo_tags_from_offsets = offsets_to_biluo_tags diff --git a/spacy/training/loggers.py b/spacy/training/loggers.py new file mode 100644 index 000000000..79459a89b --- /dev/null +++ b/spacy/training/loggers.py @@ -0,0 +1,139 @@ +from typing import TYPE_CHECKING, Dict, Any, Tuple, Callable, List, Optional, IO +from wasabi import Printer +import tqdm +import sys + +from ..util import registry +from .. import util +from ..errors import Errors + +if TYPE_CHECKING: + from ..language import Language # noqa: F401 + + +def setup_table( + *, cols: List[str], widths: List[int], max_width: int = 13 +) -> Tuple[List[str], List[int], List[str]]: + final_cols = [] + final_widths = [] + for col, width in zip(cols, widths): + if len(col) > max_width: + col = col[: max_width - 3] + "..." # shorten column if too long + final_cols.append(col.upper()) + final_widths.append(max(len(col), width)) + return final_cols, final_widths, ["r" for _ in final_widths] + + +@registry.loggers("spacy.ConsoleLogger.v1") +def console_logger(progress_bar: bool = False): + def setup_printer( + nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr + ) -> Tuple[Callable[[Optional[Dict[str, Any]]], None], Callable[[], None]]: + write = lambda text: stdout.write(f"{text}\n") + msg = Printer(no_print=True) + # ensure that only trainable components are logged + logged_pipes = [ + name + for name, proc in nlp.pipeline + if hasattr(proc, "is_trainable") and proc.is_trainable + ] + eval_frequency = nlp.config["training"]["eval_frequency"] + score_weights = nlp.config["training"]["score_weights"] + score_cols = [col for col, value in score_weights.items() if value is not None] + loss_cols = [f"Loss {pipe}" for pipe in logged_pipes] + spacing = 2 + table_header, table_widths, table_aligns = setup_table( + cols=["E", "#"] + loss_cols + score_cols + ["Score"], + widths=[3, 6] + [8 for _ in loss_cols] + [6 for _ in score_cols] + [6], + ) + write(msg.row(table_header, widths=table_widths, spacing=spacing)) + write(msg.row(["-" * width for width in table_widths], spacing=spacing)) + progress = None + + def log_step(info: Optional[Dict[str, Any]]) -> None: + nonlocal progress + + if info is None: + # If we don't have a new checkpoint, just return. + if progress is not None: + progress.update(1) + return + losses = [ + "{0:.2f}".format(float(info["losses"][pipe_name])) + for pipe_name in logged_pipes + ] + + scores = [] + for col in score_cols: + score = info["other_scores"].get(col, 0.0) + try: + score = float(score) + except TypeError: + err = Errors.E916.format(name=col, score_type=type(score)) + raise ValueError(err) from None + if col != "speed": + score *= 100 + scores.append("{0:.2f}".format(score)) + + data = ( + [info["epoch"], info["step"]] + + losses + + scores + + ["{0:.2f}".format(float(info["score"]))] + ) + if progress is not None: + progress.close() + write( + msg.row(data, widths=table_widths, aligns=table_aligns, spacing=spacing) + ) + if progress_bar: + # Set disable=None, so that it disables on non-TTY + progress = tqdm.tqdm( + total=eval_frequency, disable=None, leave=False, file=stderr + ) + progress.set_description(f"Epoch {info['epoch']+1}") + + def finalize() -> None: + pass + + return log_step, finalize + + return setup_printer + + +@registry.loggers("spacy.WandbLogger.v1") +def wandb_logger(project_name: str, remove_config_values: List[str] = []): + import wandb + + console = console_logger(progress_bar=False) + + def setup_logger( + nlp: "Language", stdout: IO = sys.stdout, stderr: IO = sys.stderr + ) -> Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]: + config = nlp.config.interpolate() + config_dot = util.dict_to_dot(config) + for field in remove_config_values: + del config_dot[field] + config = util.dot_to_dict(config_dot) + wandb.init(project=project_name, config=config, reinit=True) + console_log_step, console_finalize = console(nlp, stdout, stderr) + + def log_step(info: Optional[Dict[str, Any]]): + console_log_step(info) + if info is not None: + score = info["score"] + other_scores = info["other_scores"] + losses = info["losses"] + wandb.log({"score": score}) + if losses: + wandb.log({f"loss_{k}": v for k, v in losses.items()}) + if isinstance(other_scores, dict): + wandb.log(other_scores) + + def finalize() -> None: + console_finalize() + wandb.join() + + return log_step, finalize + + return setup_logger diff --git a/spacy/training/loop.py b/spacy/training/loop.py new file mode 100644 index 000000000..c3fa83b39 --- /dev/null +++ b/spacy/training/loop.py @@ -0,0 +1,334 @@ +from typing import List, Callable, Tuple, Dict, Iterable, Iterator, Union, Any, IO +from typing import Optional, TYPE_CHECKING +from pathlib import Path +from timeit import default_timer as timer +from thinc.api import Optimizer, Config, constant, fix_random_seed, set_gpu_allocator +from wasabi import Printer +import random +import sys +import shutil + +from .example import Example +from ..schemas import ConfigSchemaTraining +from ..errors import Errors +from ..util import resolve_dot_names, registry, logger + +if TYPE_CHECKING: + from ..language import Language # noqa: F401 + + +DIR_MODEL_BEST = "model-best" +DIR_MODEL_LAST = "model-last" + + +def train( + nlp: "Language", + output_path: Optional[Path] = None, + *, + use_gpu: int = -1, + stdout: IO = sys.stdout, + stderr: IO = sys.stderr, +) -> None: + """Train a pipeline. + + nlp (Language): The initialized nlp object with the full config. + output_path (Path): Optional output path to save trained model to. + use_gpu (int): Whether to train on GPU. Make sure to call require_gpu + before calling this function. + stdout (file): A file-like object to write output messages. To disable + printing, set to io.StringIO. + stderr (file): A second file-like object to write output messages. To disable + printing, set to io.StringIO. + + RETURNS (Path / None): The path to the final exported model. + """ + # We use no_print here so we can respect the stdout/stderr options. + msg = Printer(no_print=True) + # Create iterator, which yields out info after each optimization step. + config = nlp.config.interpolate() + if config["training"]["seed"] is not None: + fix_random_seed(config["training"]["seed"]) + allocator = config["training"]["gpu_allocator"] + if use_gpu >= 0 and allocator: + set_gpu_allocator(allocator) + T = registry.resolve(config["training"], schema=ConfigSchemaTraining) + dot_names = [T["train_corpus"], T["dev_corpus"]] + train_corpus, dev_corpus = resolve_dot_names(config, dot_names) + optimizer = T["optimizer"] + score_weights = T["score_weights"] + batcher = T["batcher"] + train_logger = T["logger"] + before_to_disk = create_before_to_disk_callback(T["before_to_disk"]) + # Components that shouldn't be updated during training + frozen_components = T["frozen_components"] + # Create iterator, which yields out info after each optimization step. + training_step_iterator = train_while_improving( + nlp, + optimizer, + create_train_batches(train_corpus(nlp), batcher, T["max_epochs"]), + create_evaluation_callback(nlp, dev_corpus, score_weights), + dropout=T["dropout"], + accumulate_gradient=T["accumulate_gradient"], + patience=T["patience"], + max_steps=T["max_steps"], + eval_frequency=T["eval_frequency"], + exclude=frozen_components, + ) + clean_output_dir(output_path) + stdout.write(msg.info(f"Pipeline: {nlp.pipe_names}") + "\n") + if frozen_components: + stdout.write(msg.info(f"Frozen components: {frozen_components}") + "\n") + stdout.write(msg.info(f"Initial learn rate: {optimizer.learn_rate}") + "\n") + with nlp.select_pipes(disable=frozen_components): + log_step, finalize_logger = train_logger(nlp, stdout, stderr) + try: + for batch, info, is_best_checkpoint in training_step_iterator: + log_step(info if is_best_checkpoint is not None else None) + if is_best_checkpoint is not None and output_path is not None: + with nlp.select_pipes(disable=frozen_components): + update_meta(T, nlp, info) + with nlp.use_params(optimizer.averages): + nlp = before_to_disk(nlp) + nlp.to_disk(output_path / DIR_MODEL_BEST) + except Exception as e: + if output_path is not None: + # We don't want to swallow the traceback if we don't have a + # specific error, but we do want to warn that we're trying + # to do something here. + stdout.write( + msg.warn( + f"Aborting and saving the final best model. " + f"Encountered exception: {str(e)}" + ) + + "\n" + ) + raise e + finally: + finalize_logger() + if output_path is not None: + final_model_path = output_path / DIR_MODEL_LAST + if optimizer.averages: + with nlp.use_params(optimizer.averages): + nlp.to_disk(final_model_path) + else: + nlp.to_disk(final_model_path) + # This will only run if we don't hit an error + stdout.write( + msg.good("Saved pipeline to output directory", final_model_path) + "\n" + ) + + +def train_while_improving( + nlp: "Language", + optimizer: Optimizer, + train_data, + evaluate, + *, + dropout: float, + eval_frequency: int, + accumulate_gradient: int, + patience: int, + max_steps: int, + exclude: List[str], +): + """Train until an evaluation stops improving. Works as a generator, + with each iteration yielding a tuple `(batch, info, is_best_checkpoint)`, + where info is a dict, and is_best_checkpoint is in [True, False, None] -- + None indicating that the iteration was not evaluated as a checkpoint. + The evaluation is conducted by calling the evaluate callback. + + Positional arguments: + nlp: The spaCy pipeline to evaluate. + optimizer: The optimizer callable. + train_data (Iterable[Batch]): A generator of batches, with the training + data. Each batch should be a Sized[Tuple[Input, Annot]]. The training + data iterable needs to take care of iterating over the epochs and + shuffling. + evaluate (Callable[[], Tuple[float, Any]]): A callback to perform evaluation. + The callback should take no arguments and return a tuple + `(main_score, other_scores)`. The main_score should be a float where + higher is better. other_scores can be any object. + + Every iteration, the function yields out a tuple with: + + * batch: A list of Example objects. + * info: A dict with various information about the last update (see below). + * is_best_checkpoint: A value in None, False, True, indicating whether this + was the best evaluation so far. You should use this to save the model + checkpoints during training. If None, evaluation was not conducted on + that iteration. False means evaluation was conducted, but a previous + evaluation was better. + + The info dict provides the following information: + + epoch (int): How many passes over the data have been completed. + step (int): How many steps have been completed. + score (float): The main score from the last evaluation. + other_scores: : The other scores from the last evaluation. + losses: The accumulated losses throughout training. + checkpoints: A list of previous results, where each result is a + (score, step, epoch) tuple. + """ + if isinstance(dropout, float): + dropouts = constant(dropout) + else: + dropouts = dropout + results = [] + losses = {} + words_seen = 0 + start_time = timer() + for step, (epoch, batch) in enumerate(train_data): + dropout = next(dropouts) + for subbatch in subdivide_batch(batch, accumulate_gradient): + nlp.update( + subbatch, drop=dropout, losses=losses, sgd=False, exclude=exclude + ) + # TODO: refactor this so we don't have to run it separately in here + for name, proc in nlp.pipeline: + if ( + name not in exclude + and hasattr(proc, "is_trainable") + and proc.is_trainable + and proc.model not in (True, False, None) + ): + proc.finish_update(optimizer) + optimizer.step_schedules() + if not (step % eval_frequency): + if optimizer.averages: + with nlp.use_params(optimizer.averages): + score, other_scores = evaluate() + else: + score, other_scores = evaluate() + results.append((score, step)) + is_best_checkpoint = score == max(results)[0] + else: + score, other_scores = (None, None) + is_best_checkpoint = None + words_seen += sum(len(eg) for eg in batch) + info = { + "epoch": epoch, + "step": step, + "score": score, + "other_scores": other_scores, + "losses": losses, + "checkpoints": results, + "seconds": int(timer() - start_time), + "words": words_seen, + } + yield batch, info, is_best_checkpoint + if is_best_checkpoint is not None: + losses = {} + # Stop if no improvement in `patience` updates (if specified) + best_score, best_step = max(results) + if patience and (step - best_step) >= patience: + break + # Stop if we've exhausted our max steps (if specified) + if max_steps and step >= max_steps: + break + + +def subdivide_batch(batch, accumulate_gradient): + batch = list(batch) + batch.sort(key=lambda eg: len(eg.predicted)) + sub_len = len(batch) // accumulate_gradient + start = 0 + for i in range(accumulate_gradient): + subbatch = batch[start : start + sub_len] + if subbatch: + yield subbatch + start += len(subbatch) + subbatch = batch[start:] + if subbatch: + yield subbatch + + +def create_evaluation_callback( + nlp: "Language", dev_corpus: Callable, weights: Dict[str, float] +) -> Callable[[], Tuple[float, Dict[str, float]]]: + weights = {key: value for key, value in weights.items() if value is not None} + + def evaluate() -> Tuple[float, Dict[str, float]]: + dev_examples = list(dev_corpus(nlp)) + try: + scores = nlp.evaluate(dev_examples) + except KeyError as e: + raise KeyError(Errors.E900.format(pipeline=nlp.pipe_names)) from e + # Calculate a weighted sum based on score_weights for the main score. + # We can only consider scores that are ints/floats, not dicts like + # entity scores per type etc. + for key, value in scores.items(): + if key in weights and not isinstance(value, (int, float)): + raise ValueError(Errors.E915.format(name=key, score_type=type(value))) + try: + weighted_score = sum( + scores.get(s, 0.0) * weights.get(s, 0.0) for s in weights + ) + except KeyError as e: + keys = list(scores.keys()) + err = Errors.E983.format(dict="score_weights", key=str(e), keys=keys) + raise KeyError(err) from None + return weighted_score, scores + + return evaluate + + +def create_train_batches( + iterator: Iterator[Example], + batcher: Callable[[Iterable[Example]], Iterable[Example]], + max_epochs: int, +): + epoch = 0 + examples = list(iterator) + if not examples: + # Raise error if no data + raise ValueError(Errors.E986) + while max_epochs < 1 or epoch != max_epochs: + random.shuffle(examples) + for batch in batcher(examples): + yield epoch, batch + epoch += 1 + + +def update_meta( + training: Union[Dict[str, Any], Config], nlp: "Language", info: Dict[str, Any] +) -> None: + nlp.meta["performance"] = {} + for metric in training["score_weights"]: + if metric is not None: + nlp.meta["performance"][metric] = info["other_scores"].get(metric, 0.0) + for pipe_name in nlp.pipe_names: + if pipe_name in info["losses"]: + nlp.meta["performance"][f"{pipe_name}_loss"] = info["losses"][pipe_name] + + +def create_before_to_disk_callback( + callback: Optional[Callable[["Language"], "Language"]] +) -> Callable[["Language"], "Language"]: + from ..language import Language # noqa: F811 + + def before_to_disk(nlp: Language) -> Language: + if not callback: + return nlp + modified_nlp = callback(nlp) + if not isinstance(modified_nlp, Language): + err = Errors.E914.format(name="before_to_disk", value=type(modified_nlp)) + raise ValueError(err) + return modified_nlp + + return before_to_disk + + +def clean_output_dir(path: Union[str, Path]) -> None: + """Remove an existing output directory. Typically used to ensure that that + a directory like model-best and its contents aren't just being overwritten + by nlp.to_disk, which could preserve existing subdirectories (e.g. + components that don't exist anymore). + """ + if path is not None and path.exists(): + for subdir in [path / DIR_MODEL_BEST, path / DIR_MODEL_LAST]: + if subdir.exists(): + try: + shutil.rmtree(str(subdir)) + logger.debug(f"Removed existing output directory: {subdir}") + except Exception as e: + raise IOError(Errors.E901.format(path=path)) from e diff --git a/spacy/training/pretrain.py b/spacy/training/pretrain.py new file mode 100644 index 000000000..b91fb07a8 --- /dev/null +++ b/spacy/training/pretrain.py @@ -0,0 +1,267 @@ +from typing import Optional, Callable, Iterable, Union, List +from thinc.api import Config, fix_random_seed, set_gpu_allocator, Model, Optimizer +from thinc.api import set_dropout_rate, to_categorical, CosineDistance, L2Distance +from pathlib import Path +from functools import partial +from collections import Counter +import srsly +import numpy +import time +import re +from wasabi import Printer + +from .example import Example +from ..tokens import Doc +from ..attrs import ID +from ..ml.models.multi_task import build_cloze_multi_task_model +from ..ml.models.multi_task import build_cloze_characters_multi_task_model +from ..schemas import ConfigSchemaTraining, ConfigSchemaPretrain +from ..errors import Errors +from ..util import registry, load_model_from_config, dot_to_object + + +def pretrain( + config: Config, + output_dir: Path, + resume_path: Optional[Path] = None, + epoch_resume: Optional[int] = None, + use_gpu: int = -1, + silent: bool = True, +): + msg = Printer(no_print=silent) + if config["training"]["seed"] is not None: + fix_random_seed(config["training"]["seed"]) + allocator = config["training"]["gpu_allocator"] + if use_gpu >= 0 and allocator: + set_gpu_allocator(allocator) + nlp = load_model_from_config(config) + _config = nlp.config.interpolate() + T = registry.resolve(_config["training"], schema=ConfigSchemaTraining) + P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain) + corpus = dot_to_object(T, P["corpus"]) + batcher = P["batcher"] + model = create_pretraining_model(nlp, P) + optimizer = P["optimizer"] + # Load in pretrained weights to resume from + if resume_path is not None: + _resume_model(model, resume_path, epoch_resume, silent=silent) + else: + # Without '--resume-path' the '--epoch-resume' argument is ignored + epoch_resume = 0 + # TODO: move this to logger function? + tracker = ProgressTracker(frequency=10000) + msg.divider(f"Pre-training tok2vec layer - starting at epoch {epoch_resume}") + row_settings = {"widths": (3, 10, 10, 6, 4), "aligns": ("r", "r", "r", "r", "r")} + msg.row(("#", "# Words", "Total Loss", "Loss", "w/s"), **row_settings) + + def _save_model(epoch, is_temp=False): + is_temp_str = ".temp" if is_temp else "" + with model.use_params(optimizer.averages): + with (output_dir / f"model{epoch}{is_temp_str}.bin").open("wb") as file_: + file_.write(model.get_ref("tok2vec").to_bytes()) + log = { + "nr_word": tracker.nr_word, + "loss": tracker.loss, + "epoch_loss": tracker.epoch_loss, + "epoch": epoch, + } + with (output_dir / "log.jsonl").open("a") as file_: + file_.write(srsly.json_dumps(log) + "\n") + + objective = create_objective(P["objective"]) + # TODO: I think we probably want this to look more like the + # 'create_train_batches' function? + for epoch in range(epoch_resume, P["max_epochs"]): + for batch_id, batch in enumerate(batcher(corpus(nlp))): + docs = ensure_docs(batch) + loss = make_update(model, docs, optimizer, objective) + progress = tracker.update(epoch, loss, docs) + if progress: + msg.row(progress, **row_settings) + if P["n_save_every"] and (batch_id % P["n_save_every"] == 0): + _save_model(epoch, is_temp=True) + _save_model(epoch) + tracker.epoch_loss = 0.0 + + +def ensure_docs(examples_or_docs: Iterable[Union[Doc, Example]]) -> List[Doc]: + docs = [] + for eg_or_doc in examples_or_docs: + if isinstance(eg_or_doc, Doc): + docs.append(eg_or_doc) + else: + docs.append(eg_or_doc.reference) + return docs + + +def _resume_model( + model: Model, resume_path: Path, epoch_resume: int, silent: bool = True +) -> None: + msg = Printer(no_print=silent) + msg.info(f"Resume training tok2vec from: {resume_path}") + with resume_path.open("rb") as file_: + weights_data = file_.read() + model.get_ref("tok2vec").from_bytes(weights_data) + # Parse the epoch number from the given weight file + model_name = re.search(r"model\d+\.bin", str(resume_path)) + if model_name: + # Default weight file name so read epoch_start from it by cutting off 'model' and '.bin' + epoch_resume = int(model_name.group(0)[5:][:-4]) + 1 + msg.info(f"Resuming from epoch: {epoch_resume}") + else: + msg.info(f"Resuming from epoch: {epoch_resume}") + + +def make_update( + model: Model, docs: Iterable[Doc], optimizer: Optimizer, objective_func: Callable +) -> float: + """Perform an update over a single batch of documents. + + docs (iterable): A batch of `Doc` objects. + optimizer (callable): An optimizer. + RETURNS loss: A float for the loss. + """ + predictions, backprop = model.begin_update(docs) + loss, gradients = objective_func(model.ops, docs, predictions) + backprop(gradients) + model.finish_update(optimizer) + # Don't want to return a cupy object here + # The gradients are modified in-place by the BERT MLM, + # so we get an accurate loss + return float(loss) + + +def create_objective(config: Config): + """Create the objective for pretraining. + + We'd like to replace this with a registry function but it's tricky because + we're also making a model choice based on this. For now we hard-code support + for two types (characters, vectors). For characters you can specify + n_characters, for vectors you can specify the loss. + + Bleh. + """ + objective_type = config["type"] + if objective_type == "characters": + return partial(get_characters_loss, nr_char=config["n_characters"]) + elif objective_type == "vectors": + if config["loss"] == "cosine": + distance = CosineDistance(normalize=True, ignore_zeros=True) + return partial(get_vectors_loss, distance=distance) + elif config["loss"] == "L2": + distance = L2Distance(normalize=True, ignore_zeros=True) + return partial(get_vectors_loss, distance=distance) + else: + raise ValueError(Errors.E906.format(loss_type=config["loss"])) + else: + raise ValueError(Errors.E907.format(objective_type=objective_type)) + + +def get_vectors_loss(ops, docs, prediction, distance): + """Compute a loss based on a distance between the documents' vectors and + the prediction. + """ + # The simplest way to implement this would be to vstack the + # token.vector values, but that's a bit inefficient, especially on GPU. + # Instead we fetch the index into the vectors table for each of our tokens, + # and look them up all at once. This prevents data copying. + ids = ops.flatten([doc.to_array(ID).ravel() for doc in docs]) + target = docs[0].vocab.vectors.data[ids] + d_target, loss = distance(prediction, target) + return loss, d_target + + +def get_characters_loss(ops, docs, prediction, nr_char): + """Compute a loss based on a number of characters predicted from the docs.""" + target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs]) + target_ids = target_ids.reshape((-1,)) + target = ops.asarray(to_categorical(target_ids, n_classes=256), dtype="f") + target = target.reshape((-1, 256 * nr_char)) + diff = prediction - target + loss = (diff ** 2).sum() + d_target = diff / float(prediction.shape[0]) + return loss, d_target + + +def create_pretraining_model(nlp, pretrain_config): + """Define a network for the pretraining. We simply add an output layer onto + the tok2vec input model. The tok2vec input model needs to be a model that + takes a batch of Doc objects (as a list), and returns a list of arrays. + Each array in the output needs to have one row per token in the doc. + The actual tok2vec layer is stored as a reference, and only this bit will be + serialized to file and read back in when calling the 'train' command. + """ + component = nlp.get_pipe(pretrain_config["component"]) + if pretrain_config.get("layer"): + tok2vec = component.model.get_ref(pretrain_config["layer"]) + else: + tok2vec = component.model + + # TODO + maxout_pieces = 3 + hidden_size = 300 + if pretrain_config["objective"]["type"] == "vectors": + model = build_cloze_multi_task_model( + nlp.vocab, tok2vec, hidden_size=hidden_size, maxout_pieces=maxout_pieces + ) + elif pretrain_config["objective"]["type"] == "characters": + model = build_cloze_characters_multi_task_model( + nlp.vocab, + tok2vec, + hidden_size=hidden_size, + maxout_pieces=maxout_pieces, + nr_char=pretrain_config["objective"]["n_characters"], + ) + model.initialize(X=[nlp.make_doc("Give it a doc to infer shapes")]) + set_dropout_rate(model, pretrain_config["dropout"]) + return model + + +class ProgressTracker: + def __init__(self, frequency=1000000): + self.loss = 0.0 + self.prev_loss = 0.0 + self.nr_word = 0 + self.words_per_epoch = Counter() + self.frequency = frequency + self.last_time = time.time() + self.last_update = 0 + self.epoch_loss = 0.0 + + def update(self, epoch, loss, docs): + self.loss += loss + self.epoch_loss += loss + words_in_batch = sum(len(doc) for doc in docs) + self.words_per_epoch[epoch] += words_in_batch + self.nr_word += words_in_batch + words_since_update = self.nr_word - self.last_update + if words_since_update >= self.frequency: + wps = words_since_update / (time.time() - self.last_time) + self.last_update = self.nr_word + self.last_time = time.time() + loss_per_word = self.loss - self.prev_loss + status = ( + epoch, + self.nr_word, + _smart_round(self.loss, width=10), + _smart_round(loss_per_word, width=6), + int(wps), + ) + self.prev_loss = float(self.loss) + return status + else: + return None + + +def _smart_round( + figure: Union[float, int], width: int = 10, max_decimal: int = 4 +) -> str: + """Round large numbers as integers, smaller numbers as decimals.""" + n_digits = len(str(int(figure))) + n_decimal = width - (n_digits + 1) + if n_decimal <= 1: + return str(int(figure)) + else: + n_decimal = min(n_decimal, max_decimal) + format_str = "%." + str(n_decimal) + "f" + return format_str % figure diff --git a/spacy/typedefs.pxd b/spacy/typedefs.pxd index bd5b38958..b43814268 100644 --- a/spacy/typedefs.pxd +++ b/spacy/typedefs.pxd @@ -2,7 +2,9 @@ from libc.stdint cimport uint16_t, uint32_t, uint64_t, uintptr_t, int32_t from libc.stdint cimport uint8_t +ctypedef float weight_t ctypedef uint64_t hash_t +ctypedef uint64_t class_t ctypedef char* utf8_t ctypedef uint64_t attr_t ctypedef uint64_t flags_t diff --git a/spacy/util.py b/spacy/util.py index 735bfc53b..8335a4fcc 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -1,14 +1,14 @@ -# coding: utf8 -from __future__ import unicode_literals, print_function - +from typing import List, Union, Dict, Any, Optional, Iterable, Callable, Tuple +from typing import Iterator, Type, Pattern, Generator, TYPE_CHECKING +from types import ModuleType import os import importlib +import importlib.util import re from pathlib import Path -import random -from collections import OrderedDict -from thinc.neural._classes.model import Model -from thinc.neural.ops import NumpyOps +import thinc +from thinc.api import NumpyOps, get_current_ops, Adam, Config, Optimizer +from thinc.api import ConfigValidationError import functools import itertools import numpy.random @@ -17,57 +17,174 @@ import srsly import catalogue import sys import warnings -from . import about - -try: - import jsonschema -except ImportError: - jsonschema = None +from packaging.specifiers import SpecifierSet, InvalidSpecifier +from packaging.version import Version, InvalidVersion +import subprocess +from contextlib import contextmanager +import tempfile +import shutil +import shlex +import inspect +import logging try: import cupy.random except ImportError: cupy = None +try: # Python 3.8 + import importlib.metadata as importlib_metadata +except ImportError: + import importlib_metadata + +# These are functions that were previously (v2.x) available from spacy.util +# and have since moved to Thinc. We're importing them here so people's code +# doesn't break, but they should always be imported from Thinc from now on, +# not from spacy.util. +from thinc.api import fix_random_seed, compounding, decaying # noqa: F401 + + from .symbols import ORTH -from .compat import cupy, CudaStream, path2str, basestring_, unicode_ -from .compat import import_file -from .errors import Errors, Warnings +from .compat import cupy, CudaStream, is_windows +from .errors import Errors, Warnings, OLD_MODEL_SHORTCUTS +from . import about + +if TYPE_CHECKING: + # This lets us add type hints for mypy etc. without causing circular imports + from .language import Language # noqa: F401 + from .tokens import Doc, Span # noqa: F401 + from .vocab import Vocab # noqa: F401 -_data_path = Path(__file__).parent / "data" -_PRINT_ENV = False OOV_RANK = numpy.iinfo(numpy.uint64).max +DEFAULT_OOV_PROB = -20 +LEXEME_NORM_LANGS = ["da", "de", "el", "en", "id", "lb", "pt", "ru", "sr", "ta", "th"] + +# Default order of sections in the config.cfg. Not all sections needs to exist, +# and additional sections are added at the end, in alphabetical order. +# fmt: off +CONFIG_SECTION_ORDER = ["paths", "variables", "system", "nlp", "components", "corpora", "training", "pretraining", "initialize"] +# fmt: on -class registry(object): +logging.basicConfig(format="%(message)s") +logger = logging.getLogger("spacy") + + +class ENV_VARS: + CONFIG_OVERRIDES = "SPACY_CONFIG_OVERRIDES" + PROJECT_USE_GIT_VERSION = "SPACY_PROJECT_USE_GIT_VERSION" + + +class registry(thinc.registry): languages = catalogue.create("spacy", "languages", entry_points=True) architectures = catalogue.create("spacy", "architectures", entry_points=True) + tokenizers = catalogue.create("spacy", "tokenizers", entry_points=True) + lemmatizers = catalogue.create("spacy", "lemmatizers", entry_points=True) lookups = catalogue.create("spacy", "lookups", entry_points=True) - factories = catalogue.create("spacy", "factories", entry_points=True) displacy_colors = catalogue.create("spacy", "displacy_colors", entry_points=True) + misc = catalogue.create("spacy", "misc", entry_points=True) + # Callback functions used to manipulate nlp object etc. + callbacks = catalogue.create("spacy", "callbacks") + batchers = catalogue.create("spacy", "batchers", entry_points=True) + readers = catalogue.create("spacy", "readers", entry_points=True) + augmenters = catalogue.create("spacy", "augmenters", entry_points=True) + loggers = catalogue.create("spacy", "loggers", entry_points=True) + # These are factories registered via third-party packages and the + # spacy_factories entry point. This registry only exists so we can easily + # load them via the entry points. The "true" factories are added via the + # Language.factory decorator (in the spaCy code base and user code) and those + # are the factories used to initialize components via registry.resolve. + _entry_point_factories = catalogue.create("spacy", "factories", entry_points=True) + factories = catalogue.create("spacy", "internal_factories") + # This is mostly used to get a list of all installed models in the current + # environment. spaCy models packaged with `spacy package` will "advertise" + # themselves via entry points. + models = catalogue.create("spacy", "models", entry_points=True) + cli = catalogue.create("spacy", "cli", entry_points=True) -def set_env_log(value): - global _PRINT_ENV - _PRINT_ENV = value +class SimpleFrozenDict(dict): + """Simplified implementation of a frozen dict, mainly used as default + function or method argument (for arguments that should default to empty + dictionary). Will raise an error if user or spaCy attempts to add to dict. + """ + + def __init__(self, *args, error: str = Errors.E095, **kwargs) -> None: + """Initialize the frozen dict. Can be initialized with pre-defined + values. + + error (str): The error message when user tries to assign to dict. + """ + super().__init__(*args, **kwargs) + self.error = error + + def __setitem__(self, key, value): + raise NotImplementedError(self.error) + + def pop(self, key, default=None): + raise NotImplementedError(self.error) + + def update(self, other): + raise NotImplementedError(self.error) -def lang_class_is_loaded(lang): +class SimpleFrozenList(list): + """Wrapper class around a list that lets us raise custom errors if certain + attributes/methods are accessed. Mostly used for properties like + Language.pipeline that return an immutable list (and that we don't want to + convert to a tuple to not break too much backwards compatibility). If a user + accidentally calls nlp.pipeline.append(), we can raise a more helpful error. + """ + + def __init__(self, *args, error: str = Errors.E927) -> None: + """Initialize the frozen list. + + error (str): The error message when user tries to mutate the list. + """ + self.error = error + super().__init__(*args) + + def append(self, *args, **kwargs): + raise NotImplementedError(self.error) + + def clear(self, *args, **kwargs): + raise NotImplementedError(self.error) + + def extend(self, *args, **kwargs): + raise NotImplementedError(self.error) + + def insert(self, *args, **kwargs): + raise NotImplementedError(self.error) + + def pop(self, *args, **kwargs): + raise NotImplementedError(self.error) + + def remove(self, *args, **kwargs): + raise NotImplementedError(self.error) + + def reverse(self, *args, **kwargs): + raise NotImplementedError(self.error) + + def sort(self, *args, **kwargs): + raise NotImplementedError(self.error) + + +def lang_class_is_loaded(lang: str) -> bool: """Check whether a Language class is already loaded. Language classes are loaded lazily, to avoid expensive setup code associated with the language data. - lang (unicode): Two-letter language code, e.g. 'en'. + lang (str): Two-letter language code, e.g. 'en'. RETURNS (bool): Whether a Language class has been loaded. """ return lang in registry.languages -def get_lang_class(lang): +def get_lang_class(lang: str) -> "Language": """Import and load a Language class. - lang (unicode): Two-letter language code, e.g. 'en'. + lang (str): Two-letter language code, e.g. 'en'. RETURNS (Language): Language class. """ # Check if language is registered / entry point is available @@ -75,65 +192,39 @@ def get_lang_class(lang): return registry.languages.get(lang) else: try: - module = importlib.import_module(".lang.%s" % lang, "spacy") + module = importlib.import_module(f".lang.{lang}", "spacy") except ImportError as err: - raise ImportError(Errors.E048.format(lang=lang, err=err)) + raise ImportError(Errors.E048.format(lang=lang, err=err)) from err set_lang_class(lang, getattr(module, module.__all__[0])) return registry.languages.get(lang) -def set_lang_class(name, cls): +def set_lang_class(name: str, cls: Type["Language"]) -> None: """Set a custom Language class name that can be loaded via get_lang_class. - name (unicode): Name of Language class. + name (str): Name of Language class. cls (Language): Language class. """ registry.languages.register(name, func=cls) -def get_data_path(require_exists=True): - """Get path to spaCy data directory. - - require_exists (bool): Only return path if it exists, otherwise None. - RETURNS (Path or None): Data path or None. - """ - if not require_exists: - return _data_path - else: - return _data_path if _data_path.exists() else None - - -def set_data_path(path): - """Set path to spaCy data directory. - - path (unicode or Path): Path to new data directory. - """ - global _data_path - _data_path = ensure_path(path) - - -def make_layer(arch_config): - arch_func = registry.architectures.get(arch_config["arch"]) - return arch_func(arch_config["config"]) - - -def ensure_path(path): +def ensure_path(path: Any) -> Any: """Ensure string is converted to a Path. - path: Anything. If string, it's converted to Path. + path (Any): Anything. If string, it's converted to Path. RETURNS: Path or original argument. """ - if isinstance(path, basestring_): + if isinstance(path, str): return Path(path) else: return path -def load_language_data(path): +def load_language_data(path: Union[str, Path]) -> Union[dict, list]: """Load JSON language data using the given path as a base. If the provided path isn't present, will attempt to load a gzipped version before giving up. - path (unicode / Path): The data to load. + path (str / Path): The data to load. RETURNS: The loaded data. """ path = ensure_path(path) @@ -142,167 +233,447 @@ def load_language_data(path): path = path.with_suffix(path.suffix + ".gz") if path.exists(): return srsly.read_gzip_json(path) - raise ValueError(Errors.E160.format(path=path2str(path))) + raise ValueError(Errors.E160.format(path=path)) -def get_module_path(module): +def get_module_path(module: ModuleType) -> Path: + """Get the path of a Python module. + + module (ModuleType): The Python module. + RETURNS (Path): The path. + """ if not hasattr(module, "__module__"): raise ValueError(Errors.E169.format(module=repr(module))) return Path(sys.modules[module.__module__].__file__).parent -def load_model(name, **overrides): - """Load a model from a shortcut link, package or data path. +def load_model( + name: Union[str, Path], + *, + vocab: Union["Vocab", bool] = True, + disable: Iterable[str] = SimpleFrozenList(), + exclude: Iterable[str] = SimpleFrozenList(), + config: Union[Dict[str, Any], Config] = SimpleFrozenDict(), +) -> "Language": + """Load a model from a package or data path. - name (unicode): Package name, shortcut link or model path. - **overrides: Specific overrides, like pipeline components to disable. - RETURNS (Language): `Language` class with the loaded model. + name (str): Package name or model path. + vocab (Vocab / True): Optional vocab to pass in on initialization. If True, + a new Vocab object will be created. + disable (Iterable[str]): Names of pipeline components to disable. + config (Dict[str, Any] / Config): Config overrides as nested dict or dict + keyed by section values in dot notation. + RETURNS (Language): The loaded nlp object. """ - data_path = get_data_path() - if not data_path or not data_path.exists(): - raise IOError(Errors.E049.format(path=path2str(data_path))) - if isinstance(name, basestring_): # in data dir / shortcut + kwargs = {"vocab": vocab, "disable": disable, "exclude": exclude, "config": config} + if isinstance(name, str): # name or string path if name.startswith("blank:"): # shortcut for blank model return get_lang_class(name.replace("blank:", ""))() - if name in set([d.name for d in data_path.iterdir()]): - return load_model_from_link(name, **overrides) if is_package(name): # installed as package - return load_model_from_package(name, **overrides) + return load_model_from_package(name, **kwargs) if Path(name).exists(): # path to model data directory - return load_model_from_path(Path(name), **overrides) + return load_model_from_path(Path(name), **kwargs) elif hasattr(name, "exists"): # Path or Path-like to model data - return load_model_from_path(name, **overrides) + return load_model_from_path(name, **kwargs) + if name in OLD_MODEL_SHORTCUTS: + raise IOError(Errors.E941.format(name=name, full=OLD_MODEL_SHORTCUTS[name])) raise IOError(Errors.E050.format(name=name)) -def load_model_from_link(name, **overrides): - """Load a model from a shortcut link, or directory in spaCy data path.""" - path = get_data_path() / name / "__init__.py" - try: - cls = import_file(name, path) - except AttributeError: - raise IOError(Errors.E051.format(name=name)) - return cls.load(**overrides) +def load_model_from_package( + name: str, + *, + vocab: Union["Vocab", bool] = True, + disable: Iterable[str] = SimpleFrozenList(), + exclude: Iterable[str] = SimpleFrozenList(), + config: Union[Dict[str, Any], Config] = SimpleFrozenDict(), +) -> "Language": + """Load a model from an installed package. - -def load_model_from_package(name, **overrides): - """Load a model from an installed package.""" + name (str): The package name. + vocab (Vocab / True): Optional vocab to pass in on initialization. If True, + a new Vocab object will be created. + disable (Iterable[str]): Names of pipeline components to disable. Disabled + pipes will be loaded but they won't be run unless you explicitly + enable them by calling nlp.enable_pipe. + exclude (Iterable[str]): Names of pipeline components to exclude. Excluded + components won't be loaded. + config (Dict[str, Any] / Config): Config overrides as nested dict or dict + keyed by section values in dot notation. + RETURNS (Language): The loaded nlp object. + """ cls = importlib.import_module(name) - return cls.load(**overrides) + return cls.load(vocab=vocab, disable=disable, exclude=exclude, config=config) -def load_model_from_path(model_path, meta=False, **overrides): +def load_model_from_path( + model_path: Union[str, Path], + *, + meta: Optional[Dict[str, Any]] = None, + vocab: Union["Vocab", bool] = True, + disable: Iterable[str] = SimpleFrozenList(), + exclude: Iterable[str] = SimpleFrozenList(), + config: Union[Dict[str, Any], Config] = SimpleFrozenDict(), +) -> "Language": """Load a model from a data directory path. Creates Language class with - pipeline from meta.json and then calls from_disk() with path.""" + pipeline from config.cfg and then calls from_disk() with path. + + name (str): Package name or model path. + meta (Dict[str, Any]): Optional model meta. + vocab (Vocab / True): Optional vocab to pass in on initialization. If True, + a new Vocab object will be created. + disable (Iterable[str]): Names of pipeline components to disable. Disabled + pipes will be loaded but they won't be run unless you explicitly + enable them by calling nlp.enable_pipe. + exclude (Iterable[str]): Names of pipeline components to exclude. Excluded + components won't be loaded. + config (Dict[str, Any] / Config): Config overrides as nested dict or dict + keyed by section values in dot notation. + RETURNS (Language): The loaded nlp object. + """ + if not model_path.exists(): + raise IOError(Errors.E052.format(path=model_path)) if not meta: meta = get_model_meta(model_path) - # Support language factories registered via entry points (e.g. custom - # language subclass) while keeping top-level language identifier "lang" - lang = meta.get("lang_factory", meta["lang"]) - cls = get_lang_class(lang) - nlp = cls(meta=meta, **overrides) - pipeline = meta.get("pipeline", []) - factories = meta.get("factories", {}) - disable = overrides.get("disable", []) - if pipeline is True: - pipeline = nlp.Defaults.pipe_names - elif pipeline in (False, None): - pipeline = [] - # skip "vocab" from overrides in component initialization since vocab is - # already configured from overrides when nlp is initialized above - if "vocab" in overrides: - del overrides["vocab"] - for name in pipeline: - if name not in disable: - config = meta.get("pipeline_args", {}).get(name, {}) - config.update(overrides) - factory = factories.get(name, name) - component = nlp.create_pipe(factory, config=config) - nlp.add_pipe(component, name=name) - return nlp.from_disk(model_path, exclude=disable) + config_path = model_path / "config.cfg" + config = load_config(config_path, overrides=dict_to_dot(config)) + nlp = load_model_from_config(config, vocab=vocab, disable=disable, exclude=exclude) + return nlp.from_disk(model_path, exclude=exclude) -def load_model_from_init_py(init_file, **overrides): +def load_model_from_config( + config: Union[Dict[str, Any], Config], + *, + vocab: Union["Vocab", bool] = True, + disable: Iterable[str] = SimpleFrozenList(), + exclude: Iterable[str] = SimpleFrozenList(), + auto_fill: bool = False, + validate: bool = True, +) -> "Language": + """Create an nlp object from a config. Expects the full config file including + a section "nlp" containing the settings for the nlp object. + + name (str): Package name or model path. + meta (Dict[str, Any]): Optional model meta. + vocab (Vocab / True): Optional vocab to pass in on initialization. If True, + a new Vocab object will be created. + disable (Iterable[str]): Names of pipeline components to disable. Disabled + pipes will be loaded but they won't be run unless you explicitly + enable them by calling nlp.enable_pipe. + exclude (Iterable[str]): Names of pipeline components to exclude. Excluded + components won't be loaded. + auto_fill (bool): Whether to auto-fill config with missing defaults. + validate (bool): Whether to show config validation errors. + RETURNS (Language): The loaded nlp object. + """ + if "nlp" not in config: + raise ValueError(Errors.E985.format(config=config)) + nlp_config = config["nlp"] + if "lang" not in nlp_config or nlp_config["lang"] is None: + raise ValueError(Errors.E993.format(config=nlp_config)) + # This will automatically handle all codes registered via the languages + # registry, including custom subclasses provided via entry points + lang_cls = get_lang_class(nlp_config["lang"]) + nlp = lang_cls.from_config( + config, + vocab=vocab, + disable=disable, + exclude=exclude, + auto_fill=auto_fill, + validate=validate, + ) + return nlp + + +def resolve_dot_names(config: Config, dot_names: List[Optional[str]]) -> Tuple[Any]: + """Resolve one or more "dot notation" names, e.g. corpora.train. + The paths could point anywhere into the config, so we don't know which + top-level section we'll be looking within. + + We resolve the whole top-level section, although we could resolve less -- + we could find the lowest part of the tree. + """ + # TODO: include schema? + resolved = {} + output = [] + errors = [] + for name in dot_names: + if name is None: + output.append(name) + else: + section = name.split(".")[0] + # We want to avoid resolving the same thing twice + if section not in resolved: + if registry.is_promise(config[section]): + # Otherwise we can't resolve [corpus] if it's a promise + result = registry.resolve({"config": config[section]})["config"] + else: + result = registry.resolve(config[section]) + resolved[section] = result + try: + output.append(dot_to_object(resolved, name)) + except KeyError: + msg = f"not a valid section reference: {name}" + errors.append({"loc": name.split("."), "msg": msg}) + if errors: + raise ConfigValidationError(config=config, errors=errors) + return tuple(output) + + +def load_model_from_init_py( + init_file: Union[Path, str], + *, + vocab: Union["Vocab", bool] = True, + disable: Iterable[str] = SimpleFrozenList(), + exclude: Iterable[str] = SimpleFrozenList(), + config: Union[Dict[str, Any], Config] = SimpleFrozenDict(), +) -> "Language": """Helper function to use in the `load()` method of a model package's __init__.py. - init_file (unicode): Path to model's __init__.py, i.e. `__file__`. - **overrides: Specific overrides, like pipeline components to disable. - RETURNS (Language): `Language` class with loaded model. + vocab (Vocab / True): Optional vocab to pass in on initialization. If True, + a new Vocab object will be created. + disable (Iterable[str]): Names of pipeline components to disable. Disabled + pipes will be loaded but they won't be run unless you explicitly + enable them by calling nlp.enable_pipe. + exclude (Iterable[str]): Names of pipeline components to exclude. Excluded + components won't be loaded. + config (Dict[str, Any] / Config): Config overrides as nested dict or dict + keyed by section values in dot notation. + RETURNS (Language): The loaded nlp object. """ model_path = Path(init_file).parent meta = get_model_meta(model_path) - data_dir = "%s_%s-%s" % (meta["lang"], meta["name"], meta["version"]) + data_dir = f"{meta['lang']}_{meta['name']}-{meta['version']}" data_path = model_path / data_dir if not model_path.exists(): - raise IOError(Errors.E052.format(path=path2str(data_path))) - return load_model_from_path(data_path, meta, **overrides) + raise IOError(Errors.E052.format(path=data_path)) + return load_model_from_path( + data_path, + vocab=vocab, + meta=meta, + disable=disable, + exclude=exclude, + config=config, + ) -def get_model_meta(path): - """Get model meta.json from a directory path and validate its contents. +def load_config( + path: Union[str, Path], + overrides: Dict[str, Any] = SimpleFrozenDict(), + interpolate: bool = False, +) -> Config: + """Load a config file. Takes care of path validation and section order. - path (unicode or Path): Path to model directory. - RETURNS (dict): The model's meta data. + path (Union[str, Path]): Path to the config file. + overrides: (Dict[str, Any]): Config overrides as nested dict or + dict keyed by section values in dot notation. + interpolate (bool): Whether to interpolate and resolve variables. + RETURNS (Config): The loaded config. """ - model_path = ensure_path(path) - if not model_path.exists(): - raise IOError(Errors.E052.format(path=path2str(model_path))) - meta_path = model_path / "meta.json" - if not meta_path.is_file(): - raise IOError(Errors.E053.format(path=meta_path)) - meta = srsly.read_json(meta_path) + config_path = ensure_path(path) + if not config_path.exists() or not config_path.is_file(): + raise IOError(Errors.E053.format(path=config_path, name="config.cfg")) + return Config(section_order=CONFIG_SECTION_ORDER).from_disk( + config_path, overrides=overrides, interpolate=interpolate + ) + + +def load_config_from_str( + text: str, overrides: Dict[str, Any] = SimpleFrozenDict(), interpolate: bool = False +): + """Load a full config from a string. Wrapper around Thinc's Config.from_str. + + text (str): The string config to load. + interpolate (bool): Whether to interpolate and resolve variables. + RETURNS (Config): The loaded config. + """ + return Config(section_order=CONFIG_SECTION_ORDER).from_str( + text, overrides=overrides, interpolate=interpolate + ) + + +def get_installed_models() -> List[str]: + """List all model packages currently installed in the environment. + + RETURNS (List[str]): The string names of the models. + """ + return list(registry.models.get_all().keys()) + + +def get_package_version(name: str) -> Optional[str]: + """Get the version of an installed package. Typically used to get model + package versions. + + name (str): The name of the installed Python package. + RETURNS (str / None): The version or None if package not installed. + """ + try: + return importlib_metadata.version(name) + except importlib_metadata.PackageNotFoundError: + return None + + +def is_compatible_version( + version: str, constraint: str, prereleases: bool = True +) -> Optional[bool]: + """Check if a version (e.g. "2.0.0") is compatible given a version + constraint (e.g. ">=1.9.0,<2.2.1"). If the constraint is a specific version, + it's interpreted as =={version}. + + version (str): The version to check. + constraint (str): The constraint string. + prereleases (bool): Whether to allow prereleases. If set to False, + prerelease versions will be considered incompatible. + RETURNS (bool / None): Whether the version is compatible, or None if the + version or constraint are invalid. + """ + # Handle cases where exact version is provided as constraint + if constraint[0].isdigit(): + constraint = f"=={constraint}" + try: + spec = SpecifierSet(constraint) + version = Version(version) + except (InvalidSpecifier, InvalidVersion): + return None + spec.prereleases = prereleases + return version in spec + + +def is_unconstrained_version( + constraint: str, prereleases: bool = True +) -> Optional[bool]: + # We have an exact version, this is the ultimate constrained version + if constraint[0].isdigit(): + return False + try: + spec = SpecifierSet(constraint) + except InvalidSpecifier: + return None + spec.prereleases = prereleases + specs = [sp for sp in spec] + # We only have one version spec and it defines > or >= + if len(specs) == 1 and specs[0].operator in (">", ">="): + return True + # One specifier is exact version + if any(sp.operator in ("==") for sp in specs): + return False + has_upper = any(sp.operator in ("<", "<=") for sp in specs) + has_lower = any(sp.operator in (">", ">=") for sp in specs) + # We have a version spec that defines an upper and lower bound + if has_upper and has_lower: + return False + # Everything else, like only an upper version, only a lower version etc. + return True + + +def get_model_version_range(spacy_version: str) -> str: + """Generate a version range like >=1.2.3,<1.3.0 based on a given spaCy + version. Models are always compatible across patch versions but not + across minor or major versions. + """ + release = Version(spacy_version).release + return f">={spacy_version},<{release[0]}.{release[1] + 1}.0" + + +def get_base_version(version: str) -> str: + """Generate the base version without any prerelease identifiers. + + version (str): The version, e.g. "3.0.0.dev1". + RETURNS (str): The base version, e.g. "3.0.0". + """ + return Version(version).base_version + + +def get_minor_version(version: str) -> Optional[str]: + """Get the major + minor version (without patch or prerelease identifiers). + + version (str): The version. + RETURNS (str): The major + minor version or None if version is invalid. + """ + try: + v = Version(version) + except (TypeError, InvalidVersion): + return None + return f"{v.major}.{v.minor}" + + +def is_minor_version_match(version_a: str, version_b: str) -> bool: + """Compare two versions and check if they match in major and minor, without + patch or prerelease identifiers. Used internally for compatibility checks + that should be insensitive to patch releases. + + version_a (str): The first version + version_b (str): The second version. + RETURNS (bool): Whether the versions match. + """ + a = get_minor_version(version_a) + b = get_minor_version(version_b) + return a is not None and b is not None and a == b + + +def load_meta(path: Union[str, Path]) -> Dict[str, Any]: + """Load a model meta.json from a path and validate its contents. + + path (Union[str, Path]): Path to meta.json. + RETURNS (Dict[str, Any]): The loaded meta. + """ + path = ensure_path(path) + if not path.parent.exists(): + raise IOError(Errors.E052.format(path=path.parent)) + if not path.exists() or not path.is_file(): + raise IOError(Errors.E053.format(path=path.parent, name="meta.json")) + meta = srsly.read_json(path) for setting in ["lang", "name", "version"]: if setting not in meta or not meta[setting]: raise ValueError(Errors.E054.format(setting=setting)) if "spacy_version" in meta: - about_major_minor = ".".join(about.__version__.split(".")[:2]) - if not meta["spacy_version"].startswith(">=" + about_major_minor): - # try to simplify version requirements from model meta to vx.x - # for warning message - meta_spacy_version = "v" + ".".join( - meta["spacy_version"].replace(">=", "").split(".")[:2] - ) - # if the format is unexpected, supply the full version - if not re.match(r"v\d+\.\d+", meta_spacy_version): - meta_spacy_version = meta["spacy_version"] - warn_msg = Warnings.W031.format( - model=meta["lang"] + "_" + meta["name"], + if not is_compatible_version(about.__version__, meta["spacy_version"]): + warn_msg = Warnings.W095.format( + model=f"{meta['lang']}_{meta['name']}", model_version=meta["version"], - version=meta_spacy_version, + version=meta["spacy_version"], current=about.__version__, ) warnings.warn(warn_msg) - else: - warn_msg = Warnings.W032.format( - model=meta["lang"] + "_" + meta["name"], - model_version=meta["version"], - current=about.__version__, - ) - warnings.warn(warn_msg) + if is_unconstrained_version(meta["spacy_version"]): + warn_msg = Warnings.W094.format( + model=f"{meta['lang']}_{meta['name']}", + model_version=meta["version"], + version=meta["spacy_version"], + example=get_model_version_range(about.__version__), + ) + warnings.warn(warn_msg) return meta -def is_package(name): +def get_model_meta(path: Union[str, Path]) -> Dict[str, Any]: + """Get model meta.json from a directory path and validate its contents. + + path (str / Path): Path to model directory. + RETURNS (Dict[str, Any]): The model's meta data. + """ + model_path = ensure_path(path) + return load_meta(model_path / "meta.json") + + +def is_package(name: str) -> bool: """Check if string maps to a package installed via pip. - name (unicode): Name of package. + name (str): Name of package. RETURNS (bool): True if installed package, False if not. """ - import pkg_resources - - name = name.lower() # compare package name against lowercase name - packages = pkg_resources.working_set.by_key.keys() - for package in packages: - if package.lower().replace("-", "_") == name: - return True - return False + try: + importlib_metadata.distribution(name) + return True + except: # noqa: E722 + return False -def get_package_path(name): +def get_package_path(name: str) -> Path: """Get the path to an installed package. - name (unicode): Package name. + name (str): Package name. RETURNS (Path): Path to installed package. """ name = name.lower() # use lowercase version to be safe @@ -312,7 +683,124 @@ def get_package_path(name): return Path(pkg.__file__).parent -def is_in_jupyter(): +def split_command(command: str) -> List[str]: + """Split a string command using shlex. Handles platform compatibility. + + command (str) : The command to split + RETURNS (List[str]): The split command. + """ + return shlex.split(command, posix=not is_windows) + + +def join_command(command: List[str]) -> str: + """Join a command using shlex. shlex.join is only available for Python 3.8+, + so we're using a workaround here. + + command (List[str]): The command to join. + RETURNS (str): The joined command + """ + return " ".join(shlex.quote(cmd) for cmd in command) + + +def run_command( + command: Union[str, List[str]], + *, + stdin: Optional[Any] = None, + capture: bool = False, +) -> Optional[subprocess.CompletedProcess]: + """Run a command on the command line as a subprocess. If the subprocess + returns a non-zero exit code, a system exit is performed. + + command (str / List[str]): The command. If provided as a string, the + string will be split using shlex.split. + stdin (Optional[Any]): stdin to read from or None. + capture (bool): Whether to capture the output and errors. If False, + the stdout and stderr will not be redirected, and if there's an error, + sys.exit will be called with the returncode. You should use capture=False + when you want to turn over execution to the command, and capture=True + when you want to run the command more like a function. + RETURNS (Optional[CompletedProcess]): The process object. + """ + if isinstance(command, str): + cmd_list = split_command(command) + cmd_str = command + else: + cmd_list = command + cmd_str = " ".join(command) + try: + ret = subprocess.run( + cmd_list, + env=os.environ.copy(), + input=stdin, + encoding="utf8", + check=False, + stdout=subprocess.PIPE if capture else None, + stderr=subprocess.STDOUT if capture else None, + ) + except FileNotFoundError: + # Indicates the *command* wasn't found, it's an error before the command + # is run. + raise FileNotFoundError( + Errors.E970.format(str_command=cmd_str, tool=cmd_list[0]) + ) from None + if ret.returncode != 0 and capture: + message = f"Error running command:\n\n{cmd_str}\n\n" + message += f"Subprocess exited with status {ret.returncode}" + if ret.stdout is not None: + message += f"\n\nProcess log (stdout and stderr):\n\n" + message += ret.stdout + error = subprocess.SubprocessError(message) + error.ret = ret + error.command = cmd_str + raise error + elif ret.returncode != 0: + sys.exit(ret.returncode) + return ret + + +@contextmanager +def working_dir(path: Union[str, Path]) -> None: + """Change current working directory and returns to previous on exit. + + path (str / Path): The directory to navigate to. + YIELDS (Path): The absolute path to the current working directory. This + should be used if the block needs to perform actions within the working + directory, to prevent mismatches with relative paths. + """ + prev_cwd = Path.cwd() + current = Path(path).resolve() + os.chdir(str(current)) + try: + yield current + finally: + os.chdir(str(prev_cwd)) + + +@contextmanager +def make_tempdir() -> Generator[Path, None, None]: + """Execute a block in a temporary directory and remove the directory and + its contents at the end of the with block. + + YIELDS (Path): The path of the temp directory. + """ + d = Path(tempfile.mkdtemp()) + yield d + try: + shutil.rmtree(str(d)) + except PermissionError as e: + warnings.warn(Warnings.W091.format(dir=d, msg=e)) + + +def is_cwd(path: Union[Path, str]) -> bool: + """Check whether a path is the current working directory. + + path (Union[Path, str]): The directory path. + RETURNS (bool): Whether the path is the current working directory. + """ + return str(Path(path).resolve()).lower() == str(Path.cwd().resolve()).lower() + + +def is_in_jupyter() -> bool: """Check if user is running spaCy from a Jupyter notebook by detecting the IPython kernel. Mainly used for the displaCy visualizer. RETURNS (bool): True if in Jupyter, False if not. @@ -327,20 +815,47 @@ def is_in_jupyter(): return False -def get_component_name(component): - if hasattr(component, "name"): - return component.name - if hasattr(component, "__name__"): - return component.__name__ - if hasattr(component, "__class__") and hasattr(component.__class__, "__name__"): - return component.__class__.__name__ - return repr(component) +def get_object_name(obj: Any) -> str: + """Get a human-readable name of a Python object, e.g. a pipeline component. + + obj (Any): The Python object, typically a function or class. + RETURNS (str): A human-readable name. + """ + if hasattr(obj, "name") and obj.name is not None: + return obj.name + if hasattr(obj, "__name__"): + return obj.__name__ + if hasattr(obj, "__class__") and hasattr(obj.__class__, "__name__"): + return obj.__class__.__name__ + return repr(obj) -def get_cuda_stream(require=False, non_blocking=True): +def is_same_func(func1: Callable, func2: Callable) -> bool: + """Approximately decide whether two functions are the same, even if their + identity is different (e.g. after they have been live reloaded). Mostly + used in the @Language.component and @Language.factory decorators to decide + whether to raise if a factory already exists. Allows decorator to run + multiple times with the same function. + + func1 (Callable): The first function. + func2 (Callable): The second function. + RETURNS (bool): Whether it's the same function (most likely). + """ + if not callable(func1) or not callable(func2): + return False + same_name = func1.__qualname__ == func2.__qualname__ + same_file = inspect.getfile(func1) == inspect.getfile(func2) + same_code = inspect.getsourcelines(func1) == inspect.getsourcelines(func2) + return same_name and same_file and same_code + + +def get_cuda_stream( + require: bool = False, non_blocking: bool = True +) -> Optional[CudaStream]: + ops = get_current_ops() if CudaStream is None: return None - elif isinstance(Model.ops, NumpyOps): + elif isinstance(ops, NumpyOps): return None else: return CudaStream(non_blocking=non_blocking) @@ -355,28 +870,7 @@ def get_async(stream, numpy_array): return array -def env_opt(name, default=None): - if type(default) is float: - type_convert = float - else: - type_convert = int - if "SPACY_" + name.upper() in os.environ: - value = type_convert(os.environ["SPACY_" + name.upper()]) - if _PRINT_ENV: - print(name, "=", repr(value), "via", "$SPACY_" + name.upper()) - return value - elif name in os.environ: - value = type_convert(os.environ[name]) - if _PRINT_ENV: - print(name, "=", repr(value), "via", "$" + name) - return value - else: - if _PRINT_ENV: - print(name, "=", repr(default), "by default") - return default - - -def read_regex(path): +def read_regex(path: Union[str, Path]) -> Pattern: path = ensure_path(path) with path.open(encoding="utf8") as file_: entries = file_.read().split("\n") @@ -386,44 +880,40 @@ def read_regex(path): return re.compile(expression) -def compile_prefix_regex(entries): +def compile_prefix_regex(entries: Iterable[Union[str, Pattern]]) -> Pattern: """Compile a sequence of prefix rules into a regex object. - entries (tuple): The prefix rules, e.g. spacy.lang.punctuation.TOKENIZER_PREFIXES. - RETURNS (regex object): The regex object. to be used for Tokenizer.prefix_search. + entries (Iterable[Union[str, Pattern]]): The prefix rules, e.g. + spacy.lang.punctuation.TOKENIZER_PREFIXES. + RETURNS (Pattern): The regex object. to be used for Tokenizer.prefix_search. """ - if "(" in entries: - # Handle deprecated data - expression = "|".join( - ["^" + re.escape(piece) for piece in entries if piece.strip()] - ) - return re.compile(expression) - else: - expression = "|".join(["^" + piece for piece in entries if piece.strip()]) - return re.compile(expression) + expression = "|".join(["^" + piece for piece in entries if piece.strip()]) + return re.compile(expression) -def compile_suffix_regex(entries): +def compile_suffix_regex(entries: Iterable[Union[str, Pattern]]) -> Pattern: """Compile a sequence of suffix rules into a regex object. - entries (tuple): The suffix rules, e.g. spacy.lang.punctuation.TOKENIZER_SUFFIXES. - RETURNS (regex object): The regex object. to be used for Tokenizer.suffix_search. + entries (Iterable[Union[str, Pattern]]): The suffix rules, e.g. + spacy.lang.punctuation.TOKENIZER_SUFFIXES. + RETURNS (Pattern): The regex object. to be used for Tokenizer.suffix_search. """ expression = "|".join([piece + "$" for piece in entries if piece.strip()]) return re.compile(expression) -def compile_infix_regex(entries): +def compile_infix_regex(entries: Iterable[Union[str, Pattern]]) -> Pattern: """Compile a sequence of infix rules into a regex object. - entries (tuple): The infix rules, e.g. spacy.lang.punctuation.TOKENIZER_INFIXES. + entries (Iterable[Union[str, Pattern]]): The infix rules, e.g. + spacy.lang.punctuation.TOKENIZER_INFIXES. RETURNS (regex object): The regex object. to be used for Tokenizer.infix_finditer. """ expression = "|".join([piece for piece in entries if piece.strip()]) return re.compile(expression) -def add_lookups(default_func, *lookups): +def add_lookups(default_func: Callable[[str], Any], *lookups) -> Callable[[str], Any]: """Extend an attribute function with special cases. If a word is in the lookups, the value is returned. Otherwise the previous function is used. @@ -436,24 +926,28 @@ def add_lookups(default_func, *lookups): return functools.partial(_get_attr_unless_lookup, default_func, lookups) -def _get_attr_unless_lookup(default_func, lookups, string): +def _get_attr_unless_lookup( + default_func: Callable[[str], Any], lookups: Dict[str, Any], string: str +) -> Any: for lookup in lookups: if string in lookup: return lookup[string] return default_func(string) -def update_exc(base_exceptions, *addition_dicts): +def update_exc( + base_exceptions: Dict[str, List[dict]], *addition_dicts +) -> Dict[str, List[dict]]: """Update and validate tokenizer exceptions. Will overwrite exceptions. - base_exceptions (dict): Base exceptions. - *addition_dicts (dict): Exceptions to add to the base dict, in order. - RETURNS (dict): Combined tokenizer exceptions. + base_exceptions (Dict[str, List[dict]]): Base exceptions. + *addition_dicts (Dict[str, List[dict]]): Exceptions to add to the base dict, in order. + RETURNS (Dict[str, List[dict]]): Combined tokenizer exceptions. """ exc = dict(base_exceptions) for additions in addition_dicts: for orth, token_attrs in additions.items(): - if not all(isinstance(attr[ORTH], unicode_) for attr in token_attrs): + if not all(isinstance(attr[ORTH], str) for attr in token_attrs): raise ValueError(Errors.E055.format(key=orth, orths=token_attrs)) described_orth = "".join(attr[ORTH] for attr in token_attrs) if orth != described_orth: @@ -463,14 +957,16 @@ def update_exc(base_exceptions, *addition_dicts): return exc -def expand_exc(excs, search, replace): +def expand_exc( + excs: Dict[str, List[dict]], search: str, replace: str +) -> Dict[str, List[dict]]: """Find string in tokenizer exceptions, duplicate entry and replace string. For example, to add additional versions with typographic apostrophes. - excs (dict): Tokenizer exceptions. - search (unicode): String to find and replace. - replace (unicode): Replacement. - RETURNS (dict): Combined tokenizer exceptions. + excs (Dict[str, List[dict]]): Tokenizer exceptions. + search (str): String to find and replace. + replace (str): Replacement. + RETURNS (Dict[str, List[dict]]): Combined tokenizer exceptions. """ def _fix_token(token, search, replace): @@ -487,7 +983,9 @@ def expand_exc(excs, search, replace): return new_excs -def normalize_slice(length, start, stop, step=None): +def normalize_slice( + length: int, start: int, stop: int, step: Optional[int] = None +) -> Tuple[int, int]: if not (step is None or step == 1): raise ValueError(Errors.E057) if start is None: @@ -503,142 +1001,14 @@ def normalize_slice(length, start, stop, step=None): return start, stop -def minibatch(items, size=8): - """Iterate over batches of items. `size` may be an iterator, - so that batch-size can vary on each step. - """ - if isinstance(size, int): - size_ = itertools.repeat(size) - else: - size_ = size - items = iter(items) - while True: - batch_size = next(size_) - batch = list(itertools.islice(items, int(batch_size))) - if len(batch) == 0: - break - yield list(batch) - - -def compounding(start, stop, compound): - """Yield an infinite series of compounding values. Each time the - generator is called, a value is produced by multiplying the previous - value by the compound rate. - - EXAMPLE: - >>> sizes = compounding(1., 10., 1.5) - >>> assert next(sizes) == 1. - >>> assert next(sizes) == 1 * 1.5 - >>> assert next(sizes) == 1.5 * 1.5 - """ - - def clip(value): - return max(value, stop) if (start > stop) else min(value, stop) - - curr = float(start) - while True: - yield clip(curr) - curr *= compound - - -def stepping(start, stop, steps): - """Yield an infinite series of values that step from a start value to a - final value over some number of steps. Each step is (stop-start)/steps. - - After the final value is reached, the generator continues yielding that - value. - - EXAMPLE: - >>> sizes = stepping(1., 200., 100) - >>> assert next(sizes) == 1. - >>> assert next(sizes) == 1 * (200.-1.) / 100 - >>> assert next(sizes) == 1 + (200.-1.) / 100 + (200.-1.) / 100 - """ - - def clip(value): - return max(value, stop) if (start > stop) else min(value, stop) - - curr = float(start) - while True: - yield clip(curr) - curr += (stop - start) / steps - - -def decaying(start, stop, decay): - """Yield an infinite series of linearly decaying values.""" - - curr = float(start) - while True: - yield max(curr, stop) - curr -= decay - - -def minibatch_by_words(items, size, tuples=True, count_words=len): - """Create minibatches of a given number of words.""" - if isinstance(size, int): - size_ = itertools.repeat(size) - else: - size_ = size - items = iter(items) - while True: - batch_size = next(size_) - batch = [] - while batch_size >= 0: - try: - if tuples: - doc, gold = next(items) - else: - doc = next(items) - except StopIteration: - if batch: - yield batch - return - batch_size -= count_words(doc) - if tuples: - batch.append((doc, gold)) - else: - batch.append(doc) - if batch: - yield batch - - -def itershuffle(iterable, bufsize=1000): - """Shuffle an iterator. This works by holding `bufsize` items back - and yielding them sometime later. Obviously, this is not unbiased – - but should be good enough for batching. Larger bufsize means less bias. - From https://gist.github.com/andres-erbsen/1307752 - - iterable (iterable): Iterator to shuffle. - bufsize (int): Items to hold back. - YIELDS (iterable): The shuffled iterator. - """ - iterable = iter(iterable) - buf = [] - try: - while True: - for i in range(random.randint(1, bufsize - len(buf))): - buf.append(next(iterable)) - random.shuffle(buf) - for i in range(random.randint(1, bufsize)): - if buf: - yield buf.pop() - else: - break - except StopIteration: - random.shuffle(buf) - while buf: - yield buf.pop() - raise StopIteration - - -def filter_spans(spans): +def filter_spans(spans: Iterable["Span"]) -> List["Span"]: """Filter a sequence of spans and remove duplicates or overlaps. Useful for creating named entities (where one token can only be part of one entity) or when merging spans with `Retokenizer.merge`. When spans overlap, the (first) longest span is preferred over shorter spans. - spans (iterable): The spans to filter. - RETURNS (list): The filtered spans. + spans (Iterable[Span]): The spans to filter. + RETURNS (List[Span]): The filtered spans. """ get_sort_key = lambda span: (span.end - span.start, -span.start) sorted_spans = sorted(spans, key=get_sort_key, reverse=True) @@ -653,17 +1023,34 @@ def filter_spans(spans): return result -def to_bytes(getters, exclude): - serialized = OrderedDict() +def to_bytes(getters: Dict[str, Callable[[], bytes]], exclude: Iterable[str]) -> bytes: + return srsly.msgpack_dumps(to_dict(getters, exclude)) + + +def from_bytes( + bytes_data: bytes, + setters: Dict[str, Callable[[bytes], Any]], + exclude: Iterable[str], +) -> None: + return from_dict(srsly.msgpack_loads(bytes_data), setters, exclude) + + +def to_dict( + getters: Dict[str, Callable[[], Any]], exclude: Iterable[str] +) -> Dict[str, Any]: + serialized = {} for key, getter in getters.items(): # Split to support file names like meta.json if key.split(".")[0] not in exclude: serialized[key] = getter() - return srsly.msgpack_dumps(serialized) + return serialized -def from_bytes(bytes_data, setters, exclude): - msg = srsly.msgpack_loads(bytes_data) +def from_dict( + msg: Dict[str, Any], + setters: Dict[str, Callable[[Any], Any]], + exclude: Iterable[str], +) -> Dict[str, Any]: for key, setter in setters.items(): # Split to support file names like meta.json if key.split(".")[0] not in exclude and key in msg: @@ -671,7 +1058,11 @@ def from_bytes(bytes_data, setters, exclude): return msg -def to_disk(path, writers, exclude): +def to_disk( + path: Union[str, Path], + writers: Dict[str, Callable[[Path], None]], + exclude: Iterable[str], +) -> Path: path = ensure_path(path) if not path.exists(): path.mkdir() @@ -682,7 +1073,11 @@ def to_disk(path, writers, exclude): return path -def from_disk(path, readers, exclude): +def from_disk( + path: Union[str, Path], + readers: Dict[str, Callable[[Path], None]], + exclude: Iterable[str], +) -> Path: path = ensure_path(path) for key, reader in readers.items(): # Split to support file names like meta.json @@ -691,23 +1086,36 @@ def from_disk(path, readers, exclude): return path -def minify_html(html): +def import_file(name: str, loc: Union[str, Path]) -> ModuleType: + """Import module from a file. Used to load models from a directory. + + name (str): Name of module to load. + loc (str / Path): Path to the file. + RETURNS: The loaded module. + """ + spec = importlib.util.spec_from_file_location(name, str(loc)) + module = importlib.util.module_from_spec(spec) + spec.loader.exec_module(module) + return module + + +def minify_html(html: str) -> str: """Perform a template-specific, rudimentary HTML minification for displaCy. Disclaimer: NOT a general-purpose solution, only removes indentation and newlines. - html (unicode): Markup to minify. - RETURNS (unicode): "Minified" HTML. + html (str): Markup to minify. + RETURNS (str): "Minified" HTML. """ return html.strip().replace(" ", "").replace("\n", "") -def escape_html(text): +def escape_html(text: str) -> str: """Replace <, >, &, " with their HTML encoded representation. Intended to prevent HTML errors in rendered displaCy markup. - text (unicode): The original text. - RETURNS (unicode): Equivalent text to be safely used within HTML. + text (str): The original text. + RETURNS (str): Equivalent text to be safely used within HTML. """ text = text.replace("&", "&") text = text.replace("<", "<") @@ -716,82 +1124,18 @@ def escape_html(text): return text -def use_gpu(gpu_id): - try: - import cupy.cuda.device - except ImportError: - return None - from thinc.neural.ops import CupyOps +def get_words_and_spaces( + words: Iterable[str], text: str +) -> Tuple[List[str], List[bool]]: + """Given a list of words and a text, reconstruct the original tokens and + return a list of words and spaces that can be used to create a Doc. This + can help recover destructive tokenization that didn't preserve any + whitespace information. - device = cupy.cuda.device.Device(gpu_id) - device.use() - Model.ops = CupyOps() - Model.Ops = CupyOps - return device - - -def fix_random_seed(seed=0): - random.seed(seed) - numpy.random.seed(seed) - if cupy is not None: - cupy.random.seed(seed) - - -def get_json_validator(schema): - # We're using a helper function here to make it easier to change the - # validator that's used (e.g. different draft implementation), without - # having to change it all across the codebase. - # TODO: replace with (stable) Draft6Validator, if available - if jsonschema is None: - raise ValueError(Errors.E136) - return jsonschema.Draft4Validator(schema) - - -def validate_schema(schema): - """Validate a given schema. This just checks if the schema itself is valid.""" - validator = get_json_validator(schema) - validator.check_schema(schema) - - -def validate_json(data, validator): - """Validate data against a given JSON schema (see https://json-schema.org). - - data: JSON-serializable data to validate. - validator (jsonschema.DraftXValidator): The validator. - RETURNS (list): A list of error messages, if available. + words (Iterable[str]): The words. + text (str): The original text. + RETURNS (Tuple[List[str], List[bool]]): The words and spaces. """ - errors = [] - for err in sorted(validator.iter_errors(data), key=lambda e: e.path): - if err.path: - err_path = "[{}]".format(" -> ".join([str(p) for p in err.path])) - else: - err_path = "" - msg = err.message + " " + err_path - if err.context: # Error has suberrors, e.g. if schema uses anyOf - suberrs = [" - {}".format(suberr.message) for suberr in err.context] - msg += ":\n{}".format("".join(suberrs)) - errors.append(msg) - return errors - - -def get_serialization_exclude(serializers, exclude, kwargs): - """Helper function to validate serialization args and manage transition from - keyword arguments (pre v2.1) to exclude argument. - """ - exclude = list(exclude) - # Split to support file names like meta.json - options = [name.split(".")[0] for name in serializers] - for key, value in kwargs.items(): - if key in ("vocab",) and value is False: - warnings.warn(Warnings.W015.format(arg=key), DeprecationWarning) - exclude.append(key) - elif key.split(".")[0] in options: - raise ValueError(Errors.E128.format(arg=key)) - # TODO: user warning? - return exclude - - -def get_words_and_spaces(words, text): if "".join("".join(words).split()) != "".join(text.split()): raise ValueError(Errors.E194.format(text=text, words=words)) text_words = [] @@ -804,7 +1148,7 @@ def get_words_and_spaces(words, text): try: word_start = text[text_pos:].index(word) except ValueError: - raise ValueError(Errors.E194.format(text=text, words=words)) + raise ValueError(Errors.E194.format(text=text, words=words)) from None if word_start > 0: text_words.append(text[text_pos : text_pos + word_start]) text_spaces.append(False) @@ -821,23 +1165,130 @@ def get_words_and_spaces(words, text): return (text_words, text_spaces) -class SimpleFrozenDict(dict): - """Simplified implementation of a frozen dict, mainly used as default - function or method argument (for arguments that should default to empty - dictionary). Will raise an error if user or spaCy attempts to add to dict. +def copy_config(config: Union[Dict[str, Any], Config]) -> Config: + """Deep copy a Config. Will raise an error if the config contents are not + JSON-serializable. + + config (Config): The config to copy. + RETURNS (Config): The copied config. """ - - def __setitem__(self, key, value): - raise NotImplementedError(Errors.E095) - - def pop(self, key, default=None): - raise NotImplementedError(Errors.E095) - - def update(self, other): - raise NotImplementedError(Errors.E095) + try: + return Config(config).copy() + except ValueError: + raise ValueError(Errors.E961.format(config=config)) from None -class DummyTokenizer(object): +def dot_to_dict(values: Dict[str, Any]) -> Dict[str, dict]: + """Convert dot notation to a dict. For example: {"token.pos": True, + "token._.xyz": True} becomes {"token": {"pos": True, "_": {"xyz": True }}}. + + values (Dict[str, Any]): The key/value pairs to convert. + RETURNS (Dict[str, dict]): The converted values. + """ + result = {} + for key, value in values.items(): + path = result + parts = key.lower().split(".") + for i, item in enumerate(parts): + is_last = i == len(parts) - 1 + path = path.setdefault(item, value if is_last else {}) + return result + + +def dict_to_dot(obj: Dict[str, dict]) -> Dict[str, Any]: + """Convert dot notation to a dict. For example: {"token": {"pos": True, + "_": {"xyz": True }}} becomes {"token.pos": True, "token._.xyz": True}. + + values (Dict[str, dict]): The dict to convert. + RETURNS (Dict[str, Any]): The key/value pairs. + """ + return {".".join(key): value for key, value in walk_dict(obj)} + + +def dot_to_object(config: Config, section: str): + """Convert dot notation of a "section" to a specific part of the Config. + e.g. "training.optimizer" would return the Optimizer object. + Throws an error if the section is not defined in this config. + + config (Config): The config. + section (str): The dot notation of the section in the config. + RETURNS: The object denoted by the section + """ + component = config + parts = section.split(".") + for item in parts: + try: + component = component[item] + except (KeyError, TypeError): + raise KeyError(Errors.E952.format(name=section)) from None + return component + + +def walk_dict( + node: Dict[str, Any], parent: List[str] = [] +) -> Iterator[Tuple[List[str], Any]]: + """Walk a dict and yield the path and values of the leaves.""" + for key, value in node.items(): + key_parent = [*parent, key] + if isinstance(value, dict): + yield from walk_dict(value, key_parent) + else: + yield (key_parent, value) + + +def get_arg_names(func: Callable) -> List[str]: + """Get a list of all named arguments of a function (regular, + keyword-only). + + func (Callable): The function + RETURNS (List[str]): The argument names. + """ + argspec = inspect.getfullargspec(func) + return list(set([*argspec.args, *argspec.kwonlyargs])) + + +def combine_score_weights( + weights: List[Dict[str, float]], + overrides: Dict[str, Optional[Union[float, int]]] = SimpleFrozenDict(), +) -> Dict[str, float]: + """Combine and normalize score weights defined by components, e.g. + {"ents_r": 0.2, "ents_p": 0.3, "ents_f": 0.5} and {"some_other_score": 1.0}. + + weights (List[dict]): The weights defined by the components. + overrides (Dict[str, Optional[Union[float, int]]]): Existing scores that + should be preserved. + RETURNS (Dict[str, float]): The combined and normalized weights. + """ + # We first need to extract all None/null values for score weights that + # shouldn't be shown in the table *or* be weighted + result = {} + all_weights = [] + for w_dict in weights: + filtered_weights = {} + for key, value in w_dict.items(): + value = overrides.get(key, value) + if value is None: + result[key] = None + else: + filtered_weights[key] = value + all_weights.append(filtered_weights) + for w_dict in all_weights: + # We need to account for weights that don't sum to 1.0 and normalize + # the score weights accordingly, then divide score by the number of + # components. + total = sum(w_dict.values()) + for key, value in w_dict.items(): + if total == 0: + weight = 0.0 + else: + weight = round(value / total / len(all_weights), 2) + prev_weight = result.get(key, 0.0) + prev_weight = 0.0 if prev_weight is None else prev_weight + result[key] = prev_weight + weight + return result + + +class DummyTokenizer: # add dummy methods for to_bytes, from_bytes, to_disk and from_disk to # allow serialization (see #1557) def to_bytes(self, **kwargs): @@ -851,3 +1302,71 @@ class DummyTokenizer(object): def from_disk(self, _path, **kwargs): return self + + +def create_default_optimizer() -> Optimizer: + return Adam() + + +def minibatch(items, size): + """Iterate over batches of items. `size` may be an iterator, + so that batch-size can vary on each step. + """ + if isinstance(size, int): + size_ = itertools.repeat(size) + else: + size_ = size + items = iter(items) + while True: + batch_size = next(size_) + batch = list(itertools.islice(items, int(batch_size))) + if len(batch) == 0: + break + yield list(batch) + + +def is_cython_func(func: Callable) -> bool: + """Slightly hacky check for whether a callable is implemented in Cython. + Can be used to implement slightly different behaviors, especially around + inspecting and parameter annotations. Note that this will only return True + for actual cdef functions and methods, not regular Python functions defined + in Python modules. + + func (Callable): The callable to check. + RETURNS (bool): Whether the callable is Cython (probably). + """ + attr = "__pyx_vtable__" + if hasattr(func, attr): # function or class instance + return True + # https://stackoverflow.com/a/55767059 + if hasattr(func, "__qualname__") and hasattr(func, "__module__"): # method + cls_func = vars(sys.modules[func.__module__])[func.__qualname__.split(".")[0]] + return hasattr(cls_func, attr) + return False + + +def check_bool_env_var(env_var: str) -> bool: + """Convert the value of an environment variable to a boolean. Add special + check for "0" (falsy) and consider everything else truthy, except unset. + + env_var (str): The name of the environment variable to check. + RETURNS (bool): Its boolean value. + """ + value = os.environ.get(env_var, False) + if value == "0": + return False + return bool(value) + + +def _pipe(docs, proc, kwargs): + if hasattr(proc, "pipe"): + yield from proc.pipe(docs, **kwargs) + else: + # We added some args for pipe that __call__ doesn't expect. + kwargs = dict(kwargs) + for arg in ["batch_size"]: + if arg in kwargs: + kwargs.pop(arg) + for doc in docs: + doc = proc(doc, **kwargs) + yield doc diff --git a/spacy/vectors.pyx b/spacy/vectors.pyx index aec086e6c..ae2508c87 100644 --- a/spacy/vectors.pyx +++ b/spacy/vectors.pyx @@ -1,22 +1,15 @@ -# coding: utf8 -from __future__ import unicode_literals - cimport numpy as np from cython.operator cimport dereference as deref from libcpp.set cimport set as cppset import functools import numpy -from collections import OrderedDict import srsly -import warnings -from thinc.neural.util import get_array_module -from thinc.neural._classes.model import Model +from thinc.api import get_array_module, get_current_ops from .strings cimport StringStore from .strings import get_string_id -from .compat import basestring_, path2str from .errors import Errors from . import util @@ -25,7 +18,7 @@ def unpickle_vectors(bytes_data): return Vectors().from_bytes(bytes_data) -class GlobalRegistry(object): +class GlobalRegistry: """Global store of vectors, to avoid repeatedly loading the data.""" data = {} @@ -51,7 +44,7 @@ cdef class Vectors: the table need to be assigned - so len(list(vectors.keys())) may be greater or smaller than vectors.shape[0]. - DOCS: https://spacy.io/api/vectors + DOCS: https://nightly.spacy.io/api/vectors """ cdef public object name cdef public object data @@ -64,10 +57,9 @@ cdef class Vectors: shape (tuple): Size of the table, as (# entries, # columns) data (numpy.ndarray): The vector data. keys (iterable): A sequence of keys, aligned with the data. - name (unicode): A name to identify the vectors table. - RETURNS (Vectors): The newly created object. + name (str): A name to identify the vectors table. - DOCS: https://spacy.io/api/vectors#init + DOCS: https://nightly.spacy.io/api/vectors#init """ self.name = name if data is None: @@ -75,7 +67,7 @@ cdef class Vectors: shape = (0,0) data = numpy.zeros(shape, dtype="f") self.data = data - self.key2row = OrderedDict() + self.key2row = {} if self.data is not None: self._unset = cppset[int]({i for i in range(self.data.shape[0])}) else: @@ -91,7 +83,7 @@ cdef class Vectors: RETURNS (tuple): A `(rows, dims)` pair. - DOCS: https://spacy.io/api/vectors#shape + DOCS: https://nightly.spacy.io/api/vectors#shape """ return self.data.shape @@ -101,7 +93,7 @@ cdef class Vectors: RETURNS (int): The vector size. - DOCS: https://spacy.io/api/vectors#size + DOCS: https://nightly.spacy.io/api/vectors#size """ return self.data.shape[0] * self.data.shape[1] @@ -111,7 +103,7 @@ cdef class Vectors: RETURNS (bool): `True` if no slots are available for new keys. - DOCS: https://spacy.io/api/vectors#is_full + DOCS: https://nightly.spacy.io/api/vectors#is_full """ return self._unset.size() == 0 @@ -122,7 +114,7 @@ cdef class Vectors: RETURNS (int): The number of keys in the table. - DOCS: https://spacy.io/api/vectors#n_keys + DOCS: https://nightly.spacy.io/api/vectors#n_keys """ return len(self.key2row) @@ -135,7 +127,7 @@ cdef class Vectors: key (int): The key to get the vector for. RETURNS (ndarray): The vector for the key. - DOCS: https://spacy.io/api/vectors#getitem + DOCS: https://nightly.spacy.io/api/vectors#getitem """ i = self.key2row[key] if i is None: @@ -149,7 +141,7 @@ cdef class Vectors: key (int): The key to set the vector for. vector (ndarray): The vector to set. - DOCS: https://spacy.io/api/vectors#setitem + DOCS: https://nightly.spacy.io/api/vectors#setitem """ i = self.key2row[key] self.data[i] = vector @@ -161,7 +153,7 @@ cdef class Vectors: YIELDS (int): A key in the table. - DOCS: https://spacy.io/api/vectors#iter + DOCS: https://nightly.spacy.io/api/vectors#iter """ yield from self.key2row @@ -170,7 +162,7 @@ cdef class Vectors: RETURNS (int): The number of vectors in the data. - DOCS: https://spacy.io/api/vectors#len + DOCS: https://nightly.spacy.io/api/vectors#len """ return self.data.shape[0] @@ -180,7 +172,7 @@ cdef class Vectors: key (int): The key to check. RETURNS (bool): Whether the key has a vector entry. - DOCS: https://spacy.io/api/vectors#contains + DOCS: https://nightly.spacy.io/api/vectors#contains """ return key in self.key2row @@ -197,7 +189,7 @@ cdef class Vectors: inplace (bool): Reallocate the memory. RETURNS (list): The removed items as a list of `(key, row)` tuples. - DOCS: https://spacy.io/api/vectors#resize + DOCS: https://nightly.spacy.io/api/vectors#resize """ xp = get_array_module(self.data) if inplace: @@ -232,7 +224,7 @@ cdef class Vectors: YIELDS (ndarray): A vector in the table. - DOCS: https://spacy.io/api/vectors#values + DOCS: https://nightly.spacy.io/api/vectors#values """ for row, vector in enumerate(range(self.data.shape[0])): if not self._unset.count(row): @@ -243,7 +235,7 @@ cdef class Vectors: YIELDS (tuple): A key/vector pair. - DOCS: https://spacy.io/api/vectors#items + DOCS: https://nightly.spacy.io/api/vectors#items """ for key, row in self.key2row.items(): yield key, self.data[row] @@ -251,7 +243,7 @@ cdef class Vectors: def find(self, *, key=None, keys=None, row=None, rows=None): """Look up one or more keys by row, or vice versa. - key (unicode / int): Find the row that the given key points to. + key (str / int): Find the row that the given key points to. Returns int, -1 if missing. keys (iterable): Find rows that the keys point to. Returns ndarray. @@ -289,7 +281,7 @@ cdef class Vectors: row (int / None): The row number of a vector to map the key to. RETURNS (int): The row the vector was added to. - DOCS: https://spacy.io/api/vectors#add + DOCS: https://nightly.spacy.io/api/vectors#add """ # use int for all keys and rows in key2row for more efficient access # and serialization @@ -357,7 +349,7 @@ cdef class Vectors: sorted_index = xp.arange(scores.shape[0])[:,None][i:i+batch_size],xp.argsort(scores[i:i+batch_size], axis=1)[:,::-1] scores[i:i+batch_size] = scores[sorted_index] best_rows[i:i+batch_size] = best_rows[sorted_index] - + for i, j in numpy.ndindex(best_rows.shape): best_rows[i, j] = filled[best_rows[i, j]] # Round values really close to 1 or -1 @@ -366,17 +358,17 @@ cdef class Vectors: scores = xp.clip(scores, a_min=-1, a_max=1, out=scores) row2key = {row: key for key, row in self.key2row.items()} keys = xp.asarray( - [[row2key[row] for row in best_rows[i] if row in row2key] + [[row2key[row] for row in best_rows[i] if row in row2key] for i in range(len(queries)) ], dtype="uint64") return (keys, best_rows, scores) def to_disk(self, path, **kwargs): """Save the current state to a directory. - path (unicode / Path): A path to a directory, which will be created if + path (str / Path): A path to a directory, which will be created if it doesn't exists. - DOCS: https://spacy.io/api/vectors#to_disk + DOCS: https://nightly.spacy.io/api/vectors#to_disk """ xp = get_array_module(self.data) if xp is numpy: @@ -391,20 +383,20 @@ cdef class Vectors: with path.open("wb") as _file: save_array(self.data, _file) - serializers = OrderedDict(( - ("vectors", lambda p: save_vectors(p)), - ("key2row", lambda p: srsly.write_msgpack(p, self.key2row)) - )) + serializers = { + "vectors": lambda p: save_vectors(p), + "key2row": lambda p: srsly.write_msgpack(p, self.key2row) + } return util.to_disk(path, serializers, []) def from_disk(self, path, **kwargs): """Loads state from a directory. Modifies the object in place and returns it. - path (unicode / Path): Directory path, string or Path-like object. + path (str / Path): Directory path, string or Path-like object. RETURNS (Vectors): The modified object. - DOCS: https://spacy.io/api/vectors#from_disk + DOCS: https://nightly.spacy.io/api/vectors#from_disk """ def load_key2row(path): if path.exists(): @@ -420,15 +412,16 @@ cdef class Vectors: self.add(key, row=i) def load_vectors(path): - xp = Model.ops.xp + ops = get_current_ops() if path.exists(): - self.data = xp.load(str(path)) + self.data = ops.xp.load(str(path)) + + serializers = { + "vectors": load_vectors, + "keys": load_keys, + "key2row": load_key2row, + } - serializers = OrderedDict(( - ("vectors", load_vectors), - ("keys", load_keys), - ("key2row", load_key2row), - )) util.from_disk(path, serializers, []) self._sync_unset() return self @@ -439,7 +432,7 @@ cdef class Vectors: exclude (list): String names of serialization fields to exclude. RETURNS (bytes): The serialized form of the `Vectors` object. - DOCS: https://spacy.io/api/vectors#to_bytes + DOCS: https://nightly.spacy.io/api/vectors#to_bytes """ def serialize_weights(): if hasattr(self.data, "to_bytes"): @@ -447,10 +440,10 @@ cdef class Vectors: else: return srsly.msgpack_dumps(self.data) - serializers = OrderedDict(( - ("key2row", lambda: srsly.msgpack_dumps(self.key2row)), - ("vectors", serialize_weights) - )) + serializers = { + "key2row": lambda: srsly.msgpack_dumps(self.key2row), + "vectors": serialize_weights + } return util.to_bytes(serializers, []) def from_bytes(self, data, **kwargs): @@ -460,7 +453,7 @@ cdef class Vectors: exclude (list): String names of serialization fields to exclude. RETURNS (Vectors): The `Vectors` object. - DOCS: https://spacy.io/api/vectors#from_bytes + DOCS: https://nightly.spacy.io/api/vectors#from_bytes """ def deserialize_weights(b): if hasattr(self.data, "from_bytes"): @@ -468,10 +461,10 @@ cdef class Vectors: else: self.data = srsly.msgpack_loads(b) - deserializers = OrderedDict(( - ("key2row", lambda b: self.key2row.update(srsly.msgpack_loads(b))), - ("vectors", deserialize_weights) - )) + deserializers = { + "key2row": lambda b: self.key2row.update(srsly.msgpack_loads(b)), + "vectors": deserialize_weights + } util.from_bytes(data, deserializers, []) self._sync_unset() return self diff --git a/spacy/vocab.pxd b/spacy/vocab.pxd index 73754eb02..7d8dfd5d6 100644 --- a/spacy/vocab.pxd +++ b/spacy/vocab.pxd @@ -1,5 +1,4 @@ from libcpp.vector cimport vector - from preshed.maps cimport PreshMap from cymem.cymem cimport Pool from murmurhash.mrmr cimport hash64 @@ -29,8 +28,9 @@ cdef class Vocab: cpdef readonly StringStore strings cpdef public Morphology morphology cpdef public object vectors - cpdef public object lookups - cpdef public object lookups_extra + cpdef public object _lookups + cpdef public object writing_system + cpdef public object get_noun_chunks cdef readonly int length cdef public object data_dir cdef public object lex_attr_getters diff --git a/spacy/vocab.pyx b/spacy/vocab.pyx index 1b1b04e13..93918250b 100644 --- a/spacy/vocab.pyx +++ b/spacy/vocab.pyx @@ -1,28 +1,45 @@ -# coding: utf8 # cython: profile=True -from __future__ import unicode_literals from libc.string cimport memcpy import srsly -from collections import OrderedDict -from thinc.neural.util import get_array_module +from thinc.api import get_array_module +import functools from .lexeme cimport EMPTY_LEXEME, OOV_RANK from .lexeme cimport Lexeme from .typedefs cimport attr_t from .tokens.token cimport Token -from .attrs cimport LANG, ORTH, TAG, POS +from .attrs cimport LANG, ORTH -from .compat import copy_reg, basestring_ +from .compat import copy_reg from .errors import Errors -from .lemmatizer import Lemmatizer -from .attrs import intify_attrs, NORM +from .attrs import intify_attrs, NORM, IS_STOP from .vectors import Vectors -from ._ml import link_vectors_to_models +from .util import registry from .lookups import Lookups from . import util from .lang.norm_exceptions import BASE_NORMS -from .lang.lex_attrs import LEX_ATTRS +from .lang.lex_attrs import LEX_ATTRS, is_stop, get_lang + + +def create_vocab(lang, defaults, vectors_name=None): + # If the spacy-lookups-data package is installed, we pre-populate the lookups + # with lexeme data, if available + lex_attrs = {**LEX_ATTRS, **defaults.lex_attr_getters} + # This is messy, but it's the minimal working fix to Issue #639. + lex_attrs[IS_STOP] = functools.partial(is_stop, stops=defaults.stop_words) + # Ensure that getter can be pickled + lex_attrs[LANG] = functools.partial(get_lang, lang=lang) + lex_attrs[NORM] = util.add_lookups( + lex_attrs.get(NORM, LEX_ATTRS[NORM]), + BASE_NORMS, + ) + return Vocab( + lex_attr_getters=lex_attrs, + writing_system=defaults.writing_system, + get_noun_chunks=defaults.syntax_iterators.get("noun_chunks"), + vectors_name=vectors_name, + ) cdef class Vocab: @@ -30,36 +47,24 @@ cdef class Vocab: instance also provides access to the `StringStore`, and owns underlying C-data that is shared between `Doc` objects. - DOCS: https://spacy.io/api/vocab + DOCS: https://nightly.spacy.io/api/vocab """ - def __init__(self, lex_attr_getters=None, tag_map=None, lemmatizer=None, - strings=tuple(), lookups=None, lookups_extra=None, - oov_prob=-20., vectors_name=None, **deprecated_kwargs): + def __init__(self, lex_attr_getters=None, strings=tuple(), lookups=None, + oov_prob=-20., vectors_name=None, writing_system={}, + get_noun_chunks=None, **deprecated_kwargs): """Create the vocabulary. lex_attr_getters (dict): A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. - tag_map (dict): Dictionary mapping fine-grained tags to coarse-grained - parts-of-speech, and optionally morphological attributes. - lemmatizer (object): A lemmatizer. Defaults to `None`. strings (StringStore): StringStore that maps strings to integers, and vice versa. lookups (Lookups): Container for large lookup tables and dictionaries. - lookups_extra (Lookups): Container for optional lookup tables and dictionaries. oov_prob (float): Default OOV probability. vectors_name (unicode): Optional name to identify the vectors table. - RETURNS (Vocab): The newly constructed object. """ lex_attr_getters = lex_attr_getters if lex_attr_getters is not None else {} - tag_map = tag_map if tag_map is not None else {} if lookups in (None, True, False): lookups = Lookups() - if "lexeme_norm" not in lookups: - lookups.add_table("lexeme_norm") - if lemmatizer in (None, True, False): - lemmatizer = Lemmatizer(lookups) - if lookups_extra in (None, True, False): - lookups_extra = Lookups() self.cfg = {'oov_prob': oov_prob} self.mem = Pool() self._by_orth = PreshMap() @@ -69,10 +74,11 @@ cdef class Vocab: for string in strings: _ = self[string] self.lex_attr_getters = lex_attr_getters - self.morphology = Morphology(self.strings, tag_map, lemmatizer) + self.morphology = Morphology(self.strings) self.vectors = Vectors(name=vectors_name) self.lookups = lookups - self.lookups_extra = lookups_extra + self.writing_system = writing_system + self.get_noun_chunks = get_noun_chunks @property def lang(self): @@ -81,17 +87,6 @@ cdef class Vocab: langfunc = self.lex_attr_getters.get(LANG, None) return langfunc("_") if langfunc else "" - property writing_system: - """A dict with information about the language's writing system. To get - the data, we use the vocab.lang property to fetch the Language class. - If the Language class is not loaded, an empty dict is returned. - """ - def __get__(self): - if not util.lang_class_is_loaded(self.lang): - return {} - lang_class = util.get_lang_class(self.lang) - return dict(lang_class.Defaults.writing_system) - def __len__(self): """The current number of lexemes stored. @@ -115,7 +110,7 @@ cdef class Vocab: available bit will be chosen. RETURNS (int): The integer ID by which the flag value can be checked. - DOCS: https://spacy.io/api/vocab#add_flag + DOCS: https://nightly.spacy.io/api/vocab#add_flag """ if flag_id == -1: for bit in range(1, 64): @@ -199,7 +194,7 @@ cdef class Vocab: string (unicode): The ID string. RETURNS (bool) Whether the string has an entry in the vocabulary. - DOCS: https://spacy.io/api/vocab#contains + DOCS: https://nightly.spacy.io/api/vocab#contains """ cdef hash_t int_key if isinstance(key, bytes): @@ -216,7 +211,7 @@ cdef class Vocab: YIELDS (Lexeme): An entry in the vocabulary. - DOCS: https://spacy.io/api/vocab#iter + DOCS: https://nightly.spacy.io/api/vocab#iter """ cdef attr_t key cdef size_t addr @@ -239,7 +234,7 @@ cdef class Vocab: >>> apple = nlp.vocab.strings["apple"] >>> assert nlp.vocab[apple] == nlp.vocab[u"apple"] - DOCS: https://spacy.io/api/vocab#getitem + DOCS: https://nightly.spacy.io/api/vocab#getitem """ cdef attr_t orth if isinstance(id_or_string, unicode): @@ -258,12 +253,6 @@ cdef class Vocab: # Set the special tokens up to have arbitrary attributes lex = self.get_by_orth(self.mem, props[ORTH]) token.lex = lex - if TAG in props: - self.morphology.assign_tag(token, props[TAG]) - elif POS in props: - # Don't allow POS to be set without TAG -- this causes problems, - # see #1773 - props.pop(POS) for attr_id, value in props.items(): Token.set_struct_attr(token, attr_id, value) # NORM is the only one that overlaps between the two @@ -313,7 +302,7 @@ cdef class Vocab: word was mapped to, and `score` the similarity score between the two words. - DOCS: https://spacy.io/api/vocab#prune_vectors + DOCS: https://nightly.spacy.io/api/vocab#prune_vectors """ xp = get_array_module(self.vectors.data) # Make prob negative so it sorts by rank ascending @@ -334,29 +323,28 @@ cdef class Vocab: synonym = self.strings[syn_keys[i][0]] score = scores[i][0] remap[word] = (synonym, score) - link_vectors_to_models(self) return remap def get_vector(self, orth, minn=None, maxn=None): """Retrieve a vector for a word in the vocabulary. Words can be looked up by string or int ID. If no vectors data is loaded, ValueError is raised. - - If `minn` is defined, then the resulting vector uses Fasttext's + + If `minn` is defined, then the resulting vector uses Fasttext's subword features by average over ngrams of `orth`. orth (int / unicode): The hash value of a word, or its unicode string. - minn (int): Minimum n-gram length used for Fasttext's ngram computation. + minn (int): Minimum n-gram length used for Fasttext's ngram computation. Defaults to the length of `orth`. - maxn (int): Maximum n-gram length used for Fasttext's ngram computation. + maxn (int): Maximum n-gram length used for Fasttext's ngram computation. Defaults to the length of `orth`. RETURNS (numpy.ndarray): A word vector. Size and shape determined by the `vocab.vectors` instance. Usually, a numpy ndarray of shape (300,) and dtype float32. - DOCS: https://spacy.io/api/vocab#get_vector + DOCS: https://nightly.spacy.io/api/vocab#get_vector """ - if isinstance(orth, basestring_): + if isinstance(orth, str): orth = self.strings.add(orth) word = self[orth].orth_ if orth in self.vectors.key2row: @@ -401,9 +389,9 @@ cdef class Vocab: orth (int / unicode): The word. vector (numpy.ndarray[ndim=1, dtype='float32']): The vector to set. - DOCS: https://spacy.io/api/vocab#set_vector + DOCS: https://nightly.spacy.io/api/vocab#set_vector """ - if isinstance(orth, basestring_): + if isinstance(orth, str): orth = self.strings.add(orth) if self.vectors.is_full and orth not in self.vectors: new_rows = max(100, int(self.vectors.shape[0]*1.3)) @@ -423,36 +411,46 @@ cdef class Vocab: orth (int / unicode): The word. RETURNS (bool): Whether the word has a vector. - DOCS: https://spacy.io/api/vocab#has_vector + DOCS: https://nightly.spacy.io/api/vocab#has_vector """ - if isinstance(orth, basestring_): + if isinstance(orth, str): orth = self.strings.add(orth) return orth in self.vectors - def to_disk(self, path, exclude=tuple(), **kwargs): + property lookups: + def __get__(self): + return self._lookups + + def __set__(self, lookups): + self._lookups = lookups + if lookups.has_table("lexeme_norm"): + self.lex_attr_getters[NORM] = util.add_lookups( + self.lex_attr_getters.get(NORM, LEX_ATTRS[NORM]), + self.lookups.get_table("lexeme_norm"), + ) + + + def to_disk(self, path, *, exclude=tuple()): """Save the current state to a directory. path (unicode or Path): A path to a directory, which will be created if it doesn't exist. exclude (list): String names of serialization fields to exclude. - DOCS: https://spacy.io/api/vocab#to_disk + DOCS: https://nightly.spacy.io/api/vocab#to_disk """ path = util.ensure_path(path) if not path.exists(): path.mkdir() setters = ["strings", "vectors"] - exclude = util.get_serialization_exclude(setters, exclude, kwargs) if "strings" not in exclude: self.strings.to_disk(path / "strings.json") - if "vectors" not in "exclude" and self.vectors is not None: + if "vectors" not in "exclude": self.vectors.to_disk(path) - if "lookups" not in "exclude" and self.lookups is not None: + if "lookups" not in "exclude": self.lookups.to_disk(path) - if "lookups_extra" not in "exclude" and self.lookups_extra is not None: - self.lookups_extra.to_disk(path, filename="lookups_extra.bin") - def from_disk(self, path, exclude=tuple(), **kwargs): + def from_disk(self, path, *, exclude=tuple()): """Loads state from a directory. Modifies the object in place and returns it. @@ -460,22 +458,17 @@ cdef class Vocab: exclude (list): String names of serialization fields to exclude. RETURNS (Vocab): The modified `Vocab` object. - DOCS: https://spacy.io/api/vocab#to_disk + DOCS: https://nightly.spacy.io/api/vocab#to_disk """ path = util.ensure_path(path) getters = ["strings", "vectors"] - exclude = util.get_serialization_exclude(getters, exclude, kwargs) if "strings" not in exclude: self.strings.from_disk(path / "strings.json") # TODO: add exclude? if "vectors" not in exclude: if self.vectors is not None: self.vectors.from_disk(path, exclude=["strings"]) - if self.vectors.name is not None: - link_vectors_to_models(self) if "lookups" not in exclude: self.lookups.from_disk(path) - if "lookups_extra" not in exclude: - self.lookups_extra.from_disk(path, filename="lookups_extra.bin") if "lexeme_norm" in self.lookups: self.lex_attr_getters[NORM] = util.add_lookups( self.lex_attr_getters.get(NORM, LEX_ATTRS[NORM]), self.lookups.get_table("lexeme_norm") @@ -484,13 +477,13 @@ cdef class Vocab: self._by_orth = PreshMap() return self - def to_bytes(self, exclude=tuple(), **kwargs): + def to_bytes(self, *, exclude=tuple()): """Serialize the current state to a binary string. exclude (list): String names of serialization fields to exclude. RETURNS (bytes): The serialized form of the `Vocab` object. - DOCS: https://spacy.io/api/vocab#to_bytes + DOCS: https://nightly.spacy.io/api/vocab#to_bytes """ def deserialize_vectors(): if self.vectors is None: @@ -498,23 +491,21 @@ cdef class Vocab: else: return self.vectors.to_bytes() - getters = OrderedDict(( - ("strings", lambda: self.strings.to_bytes()), - ("vectors", deserialize_vectors), - ("lookups", lambda: self.lookups.to_bytes()), - ("lookups_extra", lambda: self.lookups_extra.to_bytes()) - )) - exclude = util.get_serialization_exclude(getters, exclude, kwargs) + getters = { + "strings": lambda: self.strings.to_bytes(), + "vectors": deserialize_vectors, + "lookups": lambda: self.lookups.to_bytes(), + } return util.to_bytes(getters, exclude) - def from_bytes(self, bytes_data, exclude=tuple(), **kwargs): + def from_bytes(self, bytes_data, *, exclude=tuple()): """Load state from a binary string. bytes_data (bytes): The data to load from. exclude (list): String names of serialization fields to exclude. RETURNS (Vocab): The `Vocab` object. - DOCS: https://spacy.io/api/vocab#from_bytes + DOCS: https://nightly.spacy.io/api/vocab#from_bytes """ def serialize_vectors(b): if self.vectors is None: @@ -522,13 +513,12 @@ cdef class Vocab: else: return self.vectors.from_bytes(b) - setters = OrderedDict(( - ("strings", lambda b: self.strings.from_bytes(b)), - ("vectors", lambda b: serialize_vectors(b)), - ("lookups", lambda b: self.lookups.from_bytes(b)), - ("lookups_extra", lambda b: self.lookups_extra.from_bytes(b)) - )) - exclude = util.get_serialization_exclude(setters, exclude, kwargs) + setters = { + "strings": lambda b: self.strings.from_bytes(b), + "lexemes": lambda b: self.lexemes_from_bytes(b), + "vectors": lambda b: serialize_vectors(b), + "lookups": lambda b: self.lookups.from_bytes(b), + } util.from_bytes(bytes_data, setters, exclude) if "lexeme_norm" in self.lookups: self.lex_attr_getters[NORM] = util.add_lookups( @@ -536,8 +526,6 @@ cdef class Vocab: ) self.length = 0 self._by_orth = PreshMap() - if self.vectors.name is not None: - link_vectors_to_models(self) return self def _reset_cache(self, keys, strings): @@ -545,19 +533,6 @@ cdef class Vocab: raise NotImplementedError - def load_extra_lookups(self, table_name): - if table_name not in self.lookups_extra: - if self.lang + "_extra" in util.registry.lookups: - tables = util.registry.lookups.get(self.lang + "_extra") - for name, filename in tables.items(): - if table_name == name: - data = util.load_language_data(filename) - self.lookups_extra.add_table(name, data) - if table_name not in self.lookups_extra: - self.lookups_extra.add_table(table_name) - return self.lookups_extra.get_table(table_name) - - def pickle_vocab(vocab): sstore = vocab.strings vectors = vocab.vectors @@ -565,13 +540,12 @@ def pickle_vocab(vocab): data_dir = vocab.data_dir lex_attr_getters = srsly.pickle_dumps(vocab.lex_attr_getters) lookups = vocab.lookups - lookups_extra = vocab.lookups_extra return (unpickle_vocab, - (sstore, vectors, morph, data_dir, lex_attr_getters, lookups, lookups_extra)) + (sstore, vectors, morph, data_dir, lex_attr_getters, lookups)) def unpickle_vocab(sstore, vectors, morphology, data_dir, - lex_attr_getters, lookups, lookups_extra): + lex_attr_getters, lookups): cdef Vocab vocab = Vocab() vocab.vectors = vectors vocab.strings = sstore @@ -579,7 +553,6 @@ def unpickle_vocab(sstore, vectors, morphology, data_dir, vocab.data_dir = data_dir vocab.lex_attr_getters = srsly.pickle_loads(lex_attr_getters) vocab.lookups = lookups - vocab.lookups_extra = lookups_extra return vocab diff --git a/website/README.md b/website/README.md index a02d5a151..076032d92 100644 --- a/website/README.md +++ b/website/README.md @@ -75,7 +75,8 @@ import { H1, H2, H3, H4, H5, Label, InlineList, Comment } from Headlines are set in [HK Grotesk](http://cargocollective.com/hanken/HK-Grotesk-Open-Source-Font) by Hanken Design. All other body text and code uses the best-matching default -system font to provide a "native" reading experience. +system font to provide a "native" reading experience. All code uses the +[JetBrains Mono](https://www.jetbrains.com/lp/mono/) typeface by JetBrains. @@ -106,7 +107,7 @@ Tags are also available as standalone `` components. | Argument | Example | Result | | -------- | -------------------------- | ----------------------------------------- | | `tag` | `{tag="method"}` | method | -| `new` | `{new="2"}` | 2 | +| `new` | `{new="3"}` | 3 | | `model` | `{model="tagger, parser"}` | tagger, parser | | `hidden` | `{hidden="true"}` | | @@ -130,6 +131,8 @@ Special link styles are used depending on the link URL. - [I am a regular external link](https://explosion.ai) - [I am a link to the documentation](/api/doc) +- [I am a link to an architecture](/api/architectures#HashEmbedCNN) +- [I am a link to a model](/models/en#en_core_web_sm) - [I am a link to GitHub](https://github.com/explosion/spaCy) ### Abbreviations {#abbr} @@ -188,18 +191,20 @@ the buttons are implemented as styled links instead of native button elements. +
+ ## Components -### Table +### Table {#table} > #### Markdown > > ```markdown_ > | Header 1 | Header 2 | -> | --- | --- | +> | -------- | -------- | > | Column 1 | Column 2 | > ``` > @@ -213,7 +218,7 @@ the buttons are implemented as styled links instead of native button elements. > ``` Tables are used to present data and API documentation. Certain keywords can be -used to mark a footer row with a distinct style, for example to visualise the +used to mark a footer row with a distinct style, for example to visualize the return values of a documented function. | Header 1 | Header 2 | Header 3 | Header 4 | @@ -224,7 +229,73 @@ return values of a documented function. | Column 1 | Column 2 | Column 3 | Column 4 | | **RETURNS** | Column 2 | Column 3 | Column 4 | -### List +Tables also support optional "divider" rows that are typically used to denote +keyword-only arguments in API documentation. To turn a row into a dividing +headline, it should only include content in its first cell, and its value should +be italicized: + +> #### Markdown +> +> ```markdown_ +> | Header 1 | Header 2 | Header 3 | +> | -------- | -------- | -------- | +> | Column 1 | Column 2 | Column 3 | +> | _Hello_ | | | +> | Column 1 | Column 2 | Column 3 | +> ``` + +| Header 1 | Header 2 | Header 3 | +| -------- | -------- | -------- | +| Column 1 | Column 2 | Column 3 | +| _Hello_ | | | +| Column 1 | Column 2 | Column 3 | + +### Type Annotations {#type-annotations} + +> #### Markdown +> +> ```markdown_ +> ~~Model[List[Doc], Floats2d]~~ +> ``` +> +> #### JSX +> +> ```markup +> Model[List[Doc], Floats2d] +> ``` + +Type annotations are special inline code blocks are used to describe Python +types in the [type hints](https://docs.python.org/3/library/typing.html) format. +The special component will split the type, apply syntax highlighting and link +all types that specify links in `meta/type-annotations.json`. Types can link to +internal or external documentation pages. To make it easy to represent the type +annotations in Markdown, the rendering "hijacks" the `~~` tags that would +typically be converted to a `` element – but in this case, text surrounded +by `~~` becomes a type annotation. + +- ~~Dict[str, List[Union[Doc, Span]]]~~ +- ~~Model[List[Doc], List[numpy.ndarray]]~~ + +Type annotations support a special visual style in tables and will render as a +separate row, under the cell text. This allows the API docs to display complex +types without taking up too much space in the cell. The type annotation should +always be the **last element** in the row. + +> #### Markdown +> +> ```markdown_ +> | Header 1 | Header 2 | +> | -------- | ----------------------- | +> | Column 1 | Column 2 ~~List[Doc]~~ | +> ``` + +| Name | Description | +| ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. ~~Model[List[Doc], FullTransformerBatch]~~ | +| `set_extra_annotations` | Function that takes a batch of `Doc` objects and transformer outputs and can set additional annotations on the `Doc`. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | + +### List {#list} > #### Markdown > @@ -255,7 +326,7 @@ automatically. 3. Lorem ipsum dolor 4. consectetur adipiscing elit -### Aside +### Aside {#aside} > #### Markdown > @@ -280,7 +351,7 @@ To make them easier to use in Markdown, paragraphs formatted as blockquotes will turn into asides by default. Level 4 headlines (with a leading `####`) will become aside titles. -### Code Block +### Code Block {#code-block} > #### Markdown > @@ -384,10 +455,10 @@ original file is shown at the top of the widget. > ``` ```python -https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_countries_api.py +https://github.com/explosion/spaCy/tree/master/spacy/language.py ``` -### Infobox +### Infobox {#infobox} import Infobox from 'components/infobox' @@ -425,7 +496,7 @@ blocks.
-### Accordion +### Accordion {#accordion} import Accordion from 'components/accordion' @@ -538,7 +609,6 @@ In addition to the native markdown elements, you can use the components ├── docs # the actual markdown content ├── meta # JSON-formatted site metadata | ├── languages.json # supported languages and statistical models -| ├── logos.json # logos and links for landing page | ├── sidebars.json # sidebar navigations for different sections | ├── site.json # general site metadata | └── universe.json # data for the spaCy universe section @@ -560,3 +630,49 @@ In addition to the native markdown elements, you can use the components ├── gatsby-node.js # Node-specific hooks for Gatsby └── package.json # package settings and dependencies ``` + +## Editorial {#editorial} + +- "spaCy" should always be spelled with a lowercase "s" and a capital "C", + unless it specifically refers to the Python package or Python import `spacy` + (in which case it should be formatted as code). + - ✅ spaCy is a library for advanced NLP in Python. + - ❌ Spacy is a library for advanced NLP in Python. + - ✅ First, you need to install the `spacy` package from pip. +- Mentions of code, like function names, classes, variable names etc. in inline + text should be formatted as `code`. + - ✅ "Calling the `nlp` object on a text returns a `Doc`." +- Objects that have pages in the [API docs](/api) should be linked – for + example, [`Doc`](/api/doc) or [`Language.to_disk`](/api/language#to_disk). The + mentions should still be formatted as code within the link. Links pointing to + the API docs will automatically receive a little icon. However, if a paragraph + includes many references to the API, the links can easily get messy. In that + case, we typically only link the first mention of an object and not any + subsequent ones. + - ✅ The [`Span`](/api/span) and [`Token`](/api/token) objects are views of a + [`Doc`](/api/doc). [`Span.as_doc`](/api/span#as_doc) creates a `Doc` object + from a `Span`. + - ❌ The [`Span`](/api/span) and [`Token`](/api/token) objects are views of a + [`Doc`](/api/doc). [`Span.as_doc`](/api/span#as_doc) creates a + [`Doc`](/api/doc) object from a [`Span`](/api/span). + +* Other things we format as code are: references to trained pipeline packages + like `en_core_web_sm` or file names like `code.py` or `meta.json`. + + - ✅ After training, the `config.cfg` is saved to disk. + +* [Type annotations](#type-annotations) are a special type of code formatting, + expressed by wrapping the text in `~~` instead of backticks. The result looks + like this: ~~List[Doc]~~. All references to known types will be linked + automatically. + + - ✅ The model has the input type ~~List[Doc]~~ and it outputs a + ~~List[Array2d]~~. + +* We try to keep links meaningful but short. + - ✅ For details, see the usage guide on + [training with custom code](/usage/training#custom-code). + - ❌ For details, see + [the usage guide on training with custom code](/usage/training#custom-code). + - ❌ For details, see the usage guide on training with custom code + [here](/usage/training#custom-code). diff --git a/website/docs/api/annotation.md b/website/docs/api/annotation.md deleted file mode 100644 index 5ca5e91d9..000000000 --- a/website/docs/api/annotation.md +++ /dev/null @@ -1,621 +0,0 @@ ---- -title: Annotation Specifications -teaser: Schemes used for labels, tags and training data -menu: - - ['Text Processing', 'text-processing'] - - ['POS Tagging', 'pos-tagging'] - - ['Dependencies', 'dependency-parsing'] - - ['Named Entities', 'named-entities'] - - ['Models & Training', 'training'] ---- - -## Text processing {#text-processing} - -> #### Example -> -> ```python -> from spacy.lang.en import English -> nlp = English() -> tokens = nlp("Some\\nspaces and\\ttab characters") -> tokens_text = [t.text for t in tokens] -> assert tokens_text == ["Some", "\\n", "spaces", " ", "and", "\\t", "tab", "characters"] -> ``` - -Tokenization standards are based on the -[OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) corpus. The tokenizer -differs from most by including **tokens for significant whitespace**. Any -sequence of whitespace characters beyond a single space (`' '`) is included as a -token. The whitespace tokens are useful for much the same reason punctuation is -– it's often an important delimiter in the text. By preserving it in the token -output, we are able to maintain a simple alignment between the tokens and the -original string, and we ensure that **no information is lost** during -processing. - -### Lemmatization {#lemmatization} - -> #### Examples -> -> In English, this means: -> -> - **Adjectives**: happier, happiest → happy -> - **Adverbs**: worse, worst → badly -> - **Nouns**: dogs, children → dog, child -> - **Verbs**: writes, writing, wrote, written → write - -As of v2.2, lemmatization data is stored in a separate package, -[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) that can -be installed if needed via `pip install spacy[lookups]`. Some languages provide -full lemmatization rules and exceptions, while other languages currently only -rely on simple lookup tables. - - - -spaCy adds a **special case for English pronouns**: all English pronouns are -lemmatized to the special token `-PRON-`. Unlike verbs and common nouns, -there's no clear base form of a personal pronoun. Should the lemma of "me" be -"I", or should we normalize person as well, giving "it" — or maybe "he"? -spaCy's solution is to introduce a novel symbol, `-PRON-`, which is used as the -lemma for all personal pronouns. - - - -### Sentence boundary detection {#sentence-boundary} - -Sentence boundaries are calculated from the syntactic parse tree, so features -such as punctuation and capitalization play an important but non-decisive role -in determining the sentence boundaries. Usually this means that the sentence -boundaries will at least coincide with clause boundaries, even given poorly -punctuated text. - -## Part-of-speech tagging {#pos-tagging} - -> #### Tip: Understanding tags -> -> You can also use `spacy.explain` to get the description for the string -> representation of a tag. For example, `spacy.explain("RB")` will return -> "adverb". - -This section lists the fine-grained and coarse-grained part-of-speech tags -assigned by spaCy's [models](/models). The individual mapping is specific to the -training corpus and can be defined in the respective language data's -[`tag_map.py`](/usage/adding-languages#tag-map). - - - -spaCy maps all language-specific part-of-speech tags to a small, fixed set of -word type tags following the -[Universal Dependencies scheme](http://universaldependencies.org/u/pos/). The -universal tags don't code for any morphological features and only cover the word -type. They're available as the [`Token.pos`](/api/token#attributes) and -[`Token.pos_`](/api/token#attributes) attributes. - -| POS | Description | Examples | -| ------- | ------------------------- | --------------------------------------------- | -| `ADJ` | adjective | big, old, green, incomprehensible, first | -| `ADP` | adposition | in, to, during | -| `ADV` | adverb | very, tomorrow, down, where, there | -| `AUX` | auxiliary | is, has (done), will (do), should (do) | -| `CONJ` | conjunction | and, or, but | -| `CCONJ` | coordinating conjunction | and, or, but | -| `DET` | determiner | a, an, the | -| `INTJ` | interjection | psst, ouch, bravo, hello | -| `NOUN` | noun | girl, cat, tree, air, beauty | -| `NUM` | numeral | 1, 2017, one, seventy-seven, IV, MMXIV | -| `PART` | particle | 's, not, | -| `PRON` | pronoun | I, you, he, she, myself, themselves, somebody | -| `PROPN` | proper noun | Mary, John, London, NATO, HBO | -| `PUNCT` | punctuation | ., (, ), ? | -| `SCONJ` | subordinating conjunction | if, while, that | -| `SYM` | symbol | \$, %, §, ©, +, −, ×, ÷, =, :), 😝 | -| `VERB` | verb | run, runs, running, eat, ate, eating | -| `X` | other | sfpksdpsxmsa | -| `SPACE` | space | - - - - - -The English part-of-speech tagger uses the -[OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) version of the Penn -Treebank tag set. We also map the tags to the simpler Universal Dependencies v2 -POS tag set. - -| Tag |  POS | Morphology | Description | -| ------------------------------------- | ------- | --------------------------------------- | ----------------------------------------- | -| `$` | `SYM` | | symbol, currency | -| `` | `PUNCT` | `PunctType=quot PunctSide=ini` | opening quotation mark | -| `''` | `PUNCT` | `PunctType=quot PunctSide=fin` | closing quotation mark | -| `,` | `PUNCT` | `PunctType=comm` | punctuation mark, comma | -| `-LRB-` | `PUNCT` | `PunctType=brck PunctSide=ini` | left round bracket | -| `-RRB-` | `PUNCT` | `PunctType=brck PunctSide=fin` | right round bracket | -| `.` | `PUNCT` | `PunctType=peri` | punctuation mark, sentence closer | -| `:` | `PUNCT` | | punctuation mark, colon or ellipsis | -| `ADD` | `X` | | email | -| `AFX` | `ADJ` | `Hyph=yes` | affix | -| `CC` | `CCONJ` | `ConjType=comp` | conjunction, coordinating | -| `CD` | `NUM` | `NumType=card` | cardinal number | -| `DT` | `DET` | | determiner | -| `EX` | `PRON` | `AdvType=ex` | existential there | -| `FW` | `X` | `Foreign=yes` | foreign word | -| `GW` | `X` | | additional word in multi-word expression | -| `HYPH` | `PUNCT` | `PunctType=dash` | punctuation mark, hyphen | -| `IN` | `ADP` | | conjunction, subordinating or preposition | -| `JJ` | `ADJ` | `Degree=pos` | adjective | -| `JJR` | `ADJ` | `Degree=comp` | adjective, comparative | -| `JJS` | `ADJ` | `Degree=sup` | adjective, superlative | -| `LS` | `X` | `NumType=ord` | list item marker | -| `MD` | `VERB` | `VerbType=mod` | verb, modal auxiliary | -| `NFP` | `PUNCT` | | superfluous punctuation | -| `NIL` | `X` | | missing tag | -| `NN` | `NOUN` | `Number=sing` | noun, singular or mass | -| `NNP` | `PROPN` | `NounType=prop Number=sing` | noun, proper singular | -| `NNPS` | `PROPN` | `NounType=prop Number=plur` | noun, proper plural | -| `NNS` | `NOUN` | `Number=plur` | noun, plural | -| `PDT` | `DET` | | predeterminer | -| `POS` | `PART` | `Poss=yes` | possessive ending | -| `PRP` | `PRON` | `PronType=prs` | pronoun, personal | -| `PRP$` | `DET` | `PronType=prs Poss=yes` | pronoun, possessive | -| `RB` | `ADV` | `Degree=pos` | adverb | -| `RBR` | `ADV` | `Degree=comp` | adverb, comparative | -| `RBS` | `ADV` | `Degree=sup` | adverb, superlative | -| `RP` | `ADP` | | adverb, particle | -| `SP` | `SPACE` | | space | -| `SYM` | `SYM` | | symbol | -| `TO` | `PART` | `PartType=inf VerbForm=inf` | infinitival "to" | -| `UH` | `INTJ` | | interjection | -| `VB` | `VERB` | `VerbForm=inf` | verb, base form | -| `VBD` | `VERB` | `VerbForm=fin Tense=past` | verb, past tense | -| `VBG` | `VERB` | `VerbForm=part Tense=pres Aspect=prog` | verb, gerund or present participle | -| `VBN` | `VERB` | `VerbForm=part Tense=past Aspect=perf` | verb, past participle | -| `VBP` | `VERB` | `VerbForm=fin Tense=pres` | verb, non-3rd person singular present | -| `VBZ` | `VERB` | `VerbForm=fin Tense=pres Number=sing Person=three` | verb, 3rd person singular present | -| `WDT` | `DET` | | wh-determiner | -| `WP` | `PRON` | | wh-pronoun, personal | -| `WP$` | `DET` | `Poss=yes` | wh-pronoun, possessive | -| `WRB` | `ADV` | | wh-adverb | -| `XX` | `X` | | unknown | -| `_SP` | `SPACE` | | | - - - - -The German part-of-speech tagger uses the -[TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html) -annotation scheme. We also map the tags to the simpler Universal Dependencies -v2 POS tag set. - -| Tag |  POS | Morphology | Description | -| --------- | ------- | ---------------------------------------- | ------------------------------------------------- | -| `$(` | `PUNCT` | `PunctType=brck` | other sentence-internal punctuation mark | -| `$,` | `PUNCT` | `PunctType=comm` | comma | -| `$.` | `PUNCT` | `PunctType=peri` | sentence-final punctuation mark | -| `ADJA` | `ADJ` | | adjective, attributive | -| `ADJD` | `ADJ` | | adjective, adverbial or predicative | -| `ADV` | `ADV` | | adverb | -| `APPO` | `ADP` | `AdpType=post` | postposition | -| `APPR` | `ADP` | `AdpType=prep` | preposition; circumposition left | -| `APPRART` | `ADP` | `AdpType=prep PronType=art` | preposition with article | -| `APZR` | `ADP` | `AdpType=circ` | circumposition right | -| `ART` | `DET` | `PronType=art` | definite or indefinite article | -| `CARD` | `NUM` | `NumType=card` | cardinal number | -| `FM` | `X` | `Foreign=yes` | foreign language material | -| `ITJ` | `INTJ` | | interjection | -| `KOKOM` | `CCONJ` | `ConjType=comp` | comparative conjunction | -| `KON` | `CCONJ` | | coordinate conjunction | -| `KOUI` | `SCONJ` | | subordinate conjunction with "zu" and infinitive | -| `KOUS` | `SCONJ` | | subordinate conjunction with sentence | -| `NE` | `PROPN` | | proper noun | -| `NN` | `NOUN` | | noun, singular or mass | -| `NNE` | `PROPN` | | proper noun | -| `PDAT` | `DET` | `PronType=dem` | attributive demonstrative pronoun | -| `PDS` | `PRON` | `PronType=dem` | substituting demonstrative pronoun | -| `PIAT` | `DET` | `PronType=ind|neg|tot` | attributive indefinite pronoun without determiner | -| `PIS` | `PRON` | `PronType=ind|neg|tot` | substituting indefinite pronoun | -| `PPER` | `PRON` | `PronType=prs` | non-reflexive personal pronoun | -| `PPOSAT` | `DET` | `Poss=yes PronType=prs` | attributive possessive pronoun | -| `PPOSS` | `PRON` | `Poss=yes PronType=prs` | substituting possessive pronoun | -| `PRELAT` | `DET` | `PronType=rel` | attributive relative pronoun | -| `PRELS` | `PRON` | `PronType=rel` | substituting relative pronoun | -| `PRF` | `PRON` | `PronType=prs Reflex=yes` | reflexive personal pronoun | -| `PROAV` | `ADV` | `PronType=dem` | pronominal adverb | -| `PTKA` | `PART` | | particle with adjective or adverb | -| `PTKANT` | `PART` | `PartType=res` | answer particle | -| `PTKNEG` | `PART` | `Polarity=neg` | negative particle | -| `PTKVZ` | `ADP` | `PartType=vbp` | separable verbal particle | -| `PTKZU` | `PART` | `PartType=inf` | "zu" before infinitive | -| `PWAT` | `DET` | `PronType=int` | attributive interrogative pronoun | -| `PWAV` | `ADV` | `PronType=int` | adverbial interrogative or relative pronoun | -| `PWS` | `PRON` | `PronType=int` | substituting interrogative pronoun | -| `TRUNC` | `X` | `Hyph=yes` | word remnant | -| `VAFIN` | `AUX` | `Mood=ind VerbForm=fin` | finite verb, auxiliary | -| `VAIMP` | `AUX` | `Mood=imp VerbForm=fin` | imperative, auxiliary | -| `VAINF` | `AUX` | `VerbForm=inf` | infinitive, auxiliary | -| `VAPP` | `AUX` | `Aspect=perf VerbForm=part` | perfect participle, auxiliary | -| `VMFIN` | `VERB` | `Mood=ind VerbForm=fin VerbType=mod` | finite verb, modal | -| `VMINF` | `VERB` | `VerbForm=inf VerbType=mod` | infinitive, modal | -| `VMPP` | `VERB` | `Aspect=perf VerbForm=part VerbType=mod` | perfect participle, modal | -| `VVFIN` | `VERB` | `Mood=ind VerbForm=fin` | finite verb, full | -| `VVIMP` | `VERB` | `Mood=imp VerbForm=fin` | imperative, full | -| `VVINF` | `VERB` | `VerbForm=inf` | infinitive, full | -| `VVIZU` | `VERB` | `VerbForm=inf` | infinitive with "zu", full | -| `VVPP` | `VERB` | `Aspect=perf VerbForm=part` | perfect participle, full | -| `XY` | `X` | | non-word containing non-letter | -| `_SP` | `SPACE` | | | - - ---- - - - -For the label schemes used by the other models, see the respective `tag_map.py` -in [`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang). - - - -## Syntactic Dependency Parsing {#dependency-parsing} - -> #### Tip: Understanding labels -> -> You can also use `spacy.explain` to get the description for the string -> representation of a label. For example, `spacy.explain("prt")` will return -> "particle". - -This section lists the syntactic dependency labels assigned by spaCy's -[models](/models). The individual labels are language-specific and depend on the -training corpus. - - - -The [Universal Dependencies scheme](http://universaldependencies.org/u/dep/) is -used in all languages trained on Universal Dependency Corpora. - -| Label | Description | -| ------------ | -------------------------------------------- | -| `acl` | clausal modifier of noun (adjectival clause) | -| `advcl` | adverbial clause modifier | -| `advmod` | adverbial modifier | -| `amod` | adjectival modifier | -| `appos` | appositional modifier | -| `aux` | auxiliary | -| `case` | case marking | -| `cc` | coordinating conjunction | -| `ccomp` | clausal complement | -| `clf` | classifier | -| `compound` | compound | -| `conj` | conjunct | -| `cop` | copula | -| `csubj` | clausal subject | -| `dep` | unspecified dependency | -| `det` | determiner | -| `discourse` | discourse element | -| `dislocated` | dislocated elements | -| `expl` | expletive | -| `fixed` | fixed multiword expression | -| `flat` | flat multiword expression | -| `goeswith` | goes with | -| `iobj` | indirect object | -| `list` | list | -| `mark` | marker | -| `nmod` | nominal modifier | -| `nsubj` | nominal subject | -| `nummod` | numeric modifier | -| `obj` | object | -| `obl` | oblique nominal | -| `orphan` | orphan | -| `parataxis` | parataxis | -| `punct` | punctuation | -| `reparandum` | overridden disfluency | -| `root` | root | -| `vocative` | vocative | -| `xcomp` | open clausal complement | - - - - - -The English dependency labels use the -[CLEAR Style](https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md) -by [ClearNLP](http://www.clearnlp.com). - -| Label | Description | -| ----------- | -------------------------------------------- | -| `acl` | clausal modifier of noun (adjectival clause) | -| `acomp` | adjectival complement | -| `advcl` | adverbial clause modifier | -| `advmod` | adverbial modifier | -| `agent` | agent | -| `amod` | adjectival modifier | -| `appos` | appositional modifier | -| `attr` | attribute | -| `aux` | auxiliary | -| `auxpass` | auxiliary (passive) | -| `case` | case marking | -| `cc` | coordinating conjunction | -| `ccomp` | clausal complement | -| `compound` | compound | -| `conj` | conjunct | -| `cop` | copula | -| `csubj` | clausal subject | -| `csubjpass` | clausal subject (passive) | -| `dative` | dative | -| `dep` | unclassified dependent | -| `det` | determiner | -| `dobj` | direct object | -| `expl` | expletive | -| `intj` | interjection | -| `mark` | marker | -| `meta` | meta modifier | -| `neg` | negation modifier | -| `nn` | noun compound modifier | -| `nounmod` | modifier of nominal | -| `npmod` | noun phrase as adverbial modifier | -| `nsubj` | nominal subject | -| `nsubjpass` | nominal subject (passive) | -| `nummod` | numeric modifier | -| `oprd` | object predicate | -| `obj` | object | -| `obl` | oblique nominal | -| `parataxis` | parataxis | -| `pcomp` | complement of preposition | -| `pobj` | object of preposition | -| `poss` | possession modifier | -| `preconj` | pre-correlative conjunction | -| `prep` | prepositional modifier | -| `prt` | particle | -| `punct` | punctuation | -| `quantmod` | modifier of quantifier | -| `relcl` | relative clause modifier | -| `root` | root | -| `xcomp` | open clausal complement | - - - - - -The German dependency labels use the -[TIGER Treebank](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html) -annotation scheme. - -| Label | Description | -| ------- | ------------------------------- | -| `ac` | adpositional case marker | -| `adc` | adjective component | -| `ag` | genitive attribute | -| `ams` | measure argument of adjective | -| `app` | apposition | -| `avc` | adverbial phrase component | -| `cc` | comparative complement | -| `cd` | coordinating conjunction | -| `cj` | conjunct | -| `cm` | comparative conjunction | -| `cp` | complementizer | -| `cvc` | collocational verb construction | -| `da` | dative | -| `dm` | discourse marker | -| `ep` | expletive es | -| `ju` | junctor | -| `mnr` | postnominal modifier | -| `mo` | modifier | -| `ng` | negation | -| `nk` | noun kernel element | -| `nmc` | numerical component | -| `oa` | accusative object | -| `oa2` | second accusative object | -| `oc` | clausal object | -| `og` | genitive object | -| `op` | prepositional object | -| `par` | parenthetical element | -| `pd` | predicate | -| `pg` | phrasal genitive | -| `ph` | placeholder | -| `pm` | morphological particle | -| `pnc` | proper noun component | -| `punct` | punctuation | -| `rc` | relative clause | -| `re` | repeated element | -| `rs` | reported speech | -| `sb` | subject | -| `sbp` | passivized subject (PP) | -| `sp` | subject or predicate | -| `svp` | separable verb prefix | -| `uc` | unit component | -| `vo` | vocative | -| `ROOT` | root | - - - -## Named Entity Recognition {#named-entities} - -> #### Tip: Understanding entity types -> -> You can also use `spacy.explain` to get the description for the string -> representation of an entity label. For example, `spacy.explain("LANGUAGE")` -> will return "any named language". - -Models trained on the [OntoNotes 5](https://catalog.ldc.upenn.edu/LDC2013T19) -corpus support the following entity types: - -| Type | Description | -| ------------- | ---------------------------------------------------- | -| `PERSON` | People, including fictional. | -| `NORP` | Nationalities or religious or political groups. | -| `FAC` | Buildings, airports, highways, bridges, etc. | -| `ORG` | Companies, agencies, institutions, etc. | -| `GPE` | Countries, cities, states. | -| `LOC` | Non-GPE locations, mountain ranges, bodies of water. | -| `PRODUCT` | Objects, vehicles, foods, etc. (Not services.) | -| `EVENT` | Named hurricanes, battles, wars, sports events, etc. | -| `WORK_OF_ART` | Titles of books, songs, etc. | -| `LAW` | Named documents made into laws. | -| `LANGUAGE` | Any named language. | -| `DATE` | Absolute or relative dates or periods. | -| `TIME` | Times smaller than a day. | -| `PERCENT` | Percentage, including "%". | -| `MONEY` | Monetary values, including unit. | -| `QUANTITY` | Measurements, as of weight or distance. | -| `ORDINAL` | "first", "second", etc. | -| `CARDINAL` | Numerals that do not fall under another type. | - -### Wikipedia scheme {#ner-wikipedia-scheme} - -Models trained on Wikipedia corpus -([Nothman et al., 2013](http://www.sciencedirect.com/science/article/pii/S0004370212000276)) -use a less fine-grained NER annotation scheme and recognise the following -entities: - -| Type | Description | -| ------ | ----------------------------------------------------------------------------------------------------------------------------------------- | -| `PER` | Named person or family. | -| `LOC` | Name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains). | -| `ORG` | Named corporate, governmental, or other organizational entity. | -| `MISC` | Miscellaneous entities, e.g. events, nationalities, products or works of art. | - -### IOB Scheme {#iob} - -| Tag | ID | Description | -| ----- | --- | ------------------------------------- | -| `"I"` | `1` | Token is inside an entity. | -| `"O"` | `2` | Token is outside an entity. | -| `"B"` | `3` | Token begins an entity. | -| `""` | `0` | No entity tag is set (missing value). | - -### BILUO Scheme {#biluo} - -| Tag | Description | -| ----------- | ---------------------------------------- | -| **`B`**EGIN | The first token of a multi-token entity. | -| **`I`**N | An inner token of a multi-token entity. | -| **`L`**AST | The final token of a multi-token entity. | -| **`U`**NIT | A single-token entity. | -| **`O`**UT | A non-entity token. | - -> #### Why BILUO, not IOB? -> -> There are several coding schemes for encoding entity annotations as token -> tags. These coding schemes are equally expressive, but not necessarily equally -> learnable. [Ratinov and Roth](http://www.aclweb.org/anthology/W09-1119) showed -> that the minimal **Begin**, **In**, **Out** scheme was more difficult to learn -> than the **BILUO** scheme that we use, which explicitly marks boundary tokens. - -spaCy translates the character offsets into this scheme, in order to decide the -cost of each action given the current state of the entity recognizer. The costs -are then used to calculate the gradient of the loss, to train the model. The -exact algorithm is a pastiche of well-known methods, and is not currently -described in any single publication. The model is a greedy transition-based -parser guided by a linear model whose weights are learned using the averaged -perceptron loss, via the -[dynamic oracle](http://www.aclweb.org/anthology/C12-1059) imitation learning -strategy. The transition system is equivalent to the BILUO tagging scheme. - -## Models and training data {#training} - -### JSON input format for training {#json-input} - -spaCy takes training data in JSON format. The built-in -[`convert`](/api/cli#convert) command helps you convert the `.conllu` format -used by the -[Universal Dependencies corpora](https://github.com/UniversalDependencies) to -spaCy's training format. To convert one or more existing `Doc` objects to -spaCy's JSON format, you can use the -[`gold.docs_to_json`](/api/goldparse#docs_to_json) helper. - -> #### Annotating entities -> -> Named entities are provided in the [BILUO](#biluo) notation. Tokens outside an -> entity are set to `"O"` and tokens that are part of an entity are set to the -> entity label, prefixed by the BILUO marker. For example `"B-ORG"` describes -> the first token of a multi-token `ORG` entity and `"U-PERSON"` a single token -> representing a `PERSON` entity. The -> [`biluo_tags_from_offsets`](/api/goldparse#biluo_tags_from_offsets) function -> can help you convert entity offsets to the right format. - -```python -### Example structure -[{ - "id": int, # ID of the document within the corpus - "paragraphs": [{ # list of paragraphs in the corpus - "raw": string, # raw text of the paragraph - "sentences": [{ # list of sentences in the paragraph - "tokens": [{ # list of tokens in the sentence - "id": int, # index of the token in the document - "dep": string, # dependency label - "head": int, # offset of token head relative to token index - "tag": string, # part-of-speech tag - "orth": string, # verbatim text of the token - "ner": string # BILUO label, e.g. "O" or "B-ORG" - }], - "brackets": [{ # phrase structure (NOT USED by current models) - "first": int, # index of first token - "last": int, # index of last token - "label": string # phrase label - }] - }], - "cats": [{ # new in v2.2: categories for text classifier - "label": string, # text category label - "value": float / bool # label applies (1.0/true) or not (0.0/false) - }] - }] -}] -``` - -Here's an example of dependencies, part-of-speech tags and names entities, taken -from the English Wall Street Journal portion of the Penn Treebank: - -```json -https://github.com/explosion/spaCy/tree/master/examples/training/training-data.json -``` - -### Lexical data for vocabulary {#vocab-jsonl new="2"} - -To populate a model's vocabulary, you can use the -[`spacy init-model`](/api/cli#init-model) command and load in a -[newline-delimited JSON](http://jsonlines.org/) (JSONL) file containing one -lexical entry per line via the `--jsonl-loc` option. The first line defines the -language and vocabulary settings. All other lines are expected to be JSON -objects describing an individual lexeme. The lexical attributes will be then set -as attributes on spaCy's [`Lexeme`](/api/lexeme#attributes) object. The `vocab` -command outputs a ready-to-use spaCy model with a `Vocab` containing the lexical -data. - -```python -### First line -{"lang": "en", "settings": {"oov_prob": -20.502029418945312}} -``` - -```python -### Entry structure -{ - "orth": string, # the word text - "id": int, # can correspond to row in vectors table - "lower": string, - "norm": string, - "shape": string - "prefix": string, - "suffix": string, - "length": int, - "cluster": string, - "prob": float, - "is_alpha": bool, - "is_ascii": bool, - "is_digit": bool, - "is_lower": bool, - "is_punct": bool, - "is_space": bool, - "is_title": bool, - "is_upper": bool, - "like_url": bool, - "like_num": bool, - "like_email": bool, - "is_stop": bool, - "is_oov": bool, - "is_quote": bool, - "is_left_punct": bool, - "is_right_punct": bool -} -``` - -Here's an example of the 20 most frequent lexemes in the English training data: - -```json -https://github.com/explosion/spaCy/tree/master/examples/training/vocab-data.jsonl -``` diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md new file mode 100644 index 000000000..3157c261a --- /dev/null +++ b/website/docs/api/architectures.md @@ -0,0 +1,674 @@ +--- +title: Model Architectures +teaser: Pre-defined model architectures included with the core library +source: spacy/ml/models +menu: + - ['Tok2Vec', 'tok2vec-arch'] + - ['Transformers', 'transformers'] + - ['Parser & NER', 'parser'] + - ['Tagging', 'tagger'] + - ['Text Classification', 'textcat'] + - ['Entity Linking', 'entitylinker'] +--- + +A **model architecture** is a function that wires up a +[`Model`](https://thinc.ai/docs/api-model) instance, which you can then use in a +pipeline component or as a layer of a larger network. This page documents +spaCy's built-in architectures that are used for different NLP tasks. All +trainable [built-in components](/api#architecture-pipeline) expect a `model` +argument defined in the config and document their the default architecture. +Custom architectures can be registered using the +[`@spacy.registry.architectures`](/api/top-level#regsitry) decorator and used as +part of the [training config](/usage/training#custom-functions). Also see the +usage documentation on +[layers and model architectures](/usage/layers-architectures). + +## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"} + +### spacy.Tok2Vec.v1 {#Tok2Vec} + +> #### Example config +> +> ```ini +> [model] +> @architectures = "spacy.Tok2Vec.v1" +> +> [model.embed] +> @architectures = "spacy.CharacterEmbed.v1" +> # ... +> +> [model.encode] +> @architectures = "spacy.MaxoutWindowEncoder.v1" +> # ... +> ``` + +Construct a tok2vec model out of two subnetworks: one for embedding and one for +encoding. See the +["Embed, Encode, Attend, Predict"](https://explosion.ai/blog/deep-learning-formula-nlp) +blog post for background. + +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `embed` | Embed tokens into context-independent word vector representations. For example, [CharacterEmbed](/api/architectures#CharacterEmbed) or [MultiHashEmbed](/api/architectures#MultiHashEmbed). ~~Model[List[Doc], List[Floats2d]]~~ | +| `encode` | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, [MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder). ~~Model[List[Floats2d], List[Floats2d]]~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | + +### spacy.HashEmbedCNN.v1 {#HashEmbedCNN} + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy.HashEmbedCNN.v1" +> pretrained_vectors = null +> width = 96 +> depth = 4 +> embed_size = 2000 +> window_size = 1 +> maxout_pieces = 3 +> subword_features = true +> ``` + +Build spaCy's "standard" tok2vec layer. This layer is defined by a +[MultiHashEmbed](/api/architectures#MultiHashEmbed) embedding layer that uses +subword features, and a +[MaxoutWindowEncoder](/api/architectures#MaxoutWindowEncoder) encoding layer +consisting of a CNN and a layer-normalized maxout activation function. + +| Name | Description | +| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `width` | The width of the input and output. These are required to be the same, so that residual connections can be used. Recommended values are `96`, `128` or `300`. ~~int~~ | +| `depth` | The number of convolutional layers to use. Recommended values are between `2` and `8`. ~~int~~ | +| `embed_size` | The number of rows in the hash embedding tables. This can be surprisingly small, due to the use of the hash embeddings. Recommended values are between `2000` and `10000`. ~~int~~ | +| `window_size` | The number of tokens on either side to concatenate during the convolutions. The receptive field of the CNN will be `depth * (window_size * 2 + 1)`, so a 4-layer network with a window size of `2` will be sensitive to 17 words at a time. Recommended value is `1`. ~~int~~ | +| `maxout_pieces` | The number of pieces to use in the maxout non-linearity. If `1`, the [`Mish`](https://thinc.ai/docs/api-layers#mish) non-linearity is used instead. Recommended values are `1`-`3`. ~~int~~ | +| `subword_features` | Whether to also embed subword features, specifically the prefix, suffix and word shape. This is recommended for alphabetic languages like English, but not if single-character tokens are used for a language such as Chinese. ~~bool~~ | +| `pretrained_vectors` | Whether to also use static vectors. ~~bool~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | + +### spacy.Tok2VecListener.v1 {#Tok2VecListener} + +> #### Example config +> +> ```ini +> [components.tok2vec] +> factory = "tok2vec" +> +> [components.tok2vec.model] +> @architectures = "spacy.HashEmbedCNN.v1" +> width = 342 +> +> [components.tagger] +> factory = "tagger" +> +> [components.tagger.model] +> @architectures = "spacy.Tagger.v1" +> +> [components.tagger.model.tok2vec] +> @architectures = "spacy.Tok2VecListener.v1" +> width = ${components.tok2vec.model.width} +> ``` + +A listener is used as a sublayer within a component such as a +[`DependencyParser`](/api/dependencyparser), +[`EntityRecognizer`](/api/entityrecognizer)or +[`TextCategorizer`](/api/textcategorizer). Usually you'll have multiple +listeners connecting to a single upstream [`Tok2Vec`](/api/tok2vec) component +that's earlier in the pipeline. The listener layers act as **proxies**, passing +the predictions from the `Tok2Vec` component into downstream components, and +communicating gradients back upstream. + +Instead of defining its own `Tok2Vec` instance, a model architecture like +[Tagger](/api/architectures#tagger) can define a listener as its `tok2vec` +argument that connects to the shared `tok2vec` component in the pipeline. + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `width` | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~ | +| `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. By default, the upstream name is the wildcard string `"*"`, but you could also specify the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | + +### spacy.MultiHashEmbed.v1 {#MultiHashEmbed} + +> #### Example config +> +> ```ini +> [model] +> @architectures = "spacy.MultiHashEmbed.v1" +> width = 64 +> attrs = ["NORM", "PREFIX", "SUFFIX", "SHAPE"] +> rows = [2000, 1000, 1000, 1000] +> include_static_vectors = true +> ``` + +Construct an embedding layer that separately embeds a number of lexical +attributes using hash embedding, concatenates the results, and passes it through +a feed-forward subnetwork to build a mixed representations. The features used +can be configured with the `attrs` argument. The suggested attributes are +`NORM`, `PREFIX`, `SUFFIX` and `SHAPE`. This lets the model take into account +some subword information, without construction a fully character-based +representation. If pretrained vectors are available, they can be included in the +representation as well, with the vectors table will be kept static (i.e. it's +not updated). + +| Name | Description | +| ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `width` | The output width. Also used as the width of the embedding tables. Recommended values are between `64` and `300`. If static vectors are included, a learned linear layer is used to map the vectors to the specified width before concatenating it with the other embedding outputs. A single maxout layer is then used to reduce the concatenated vectors to the final width. ~~int~~ | +| `attrs` | The token attributes to embed. A separate embedding table will be constructed for each attribute. ~~List[Union[int, str]]~~ | +| `rows` | The number of rows for each embedding tables. Can be low, due to the hashing trick. Recommended values are between `1000` and `10000`. The layer needs surprisingly few rows, due to its use of the hashing trick. Generally between 2000 and 10000 rows is sufficient, even for very large vocabularies. A number of rows must be specified for each table, so the `rows` list must be of the same length as the `attrs` parameter. ~~List[int]~~ | +| `include_static_vectors` | Whether to also use static word vectors. Requires a vectors table to be loaded in the [`Doc`](/api/doc) objects' vocab. ~~bool~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | + +### spacy.CharacterEmbed.v1 {#CharacterEmbed} + +> #### Example config +> +> ```ini +> [model] +> @architectures = "spacy.CharacterEmbed.v1" +> width = 128 +> rows = 7000 +> nM = 64 +> nC = 8 +> ``` + +Construct an embedded representation based on character embeddings, using a +feed-forward network. A fixed number of UTF-8 byte characters are used for each +word, taken from the beginning and end of the word equally. Padding is used in +the center for words that are too short. + +For instance, let's say `nC=4`, and the word is "jumping". The characters used +will be `"jung"` (two from the start, two from the end). If we had `nC=8`, the +characters would be `"jumpping"`: 4 from the start, 4 from the end. This ensures +that the final character is always in the last position, instead of being in an +arbitrary position depending on the word length. + +The characters are embedded in a embedding table with a given number of rows, +and the vectors concatenated. A hash-embedded vector of the `NORM` of the word +is also concatenated on, and the result is then passed through a feed-forward +network to construct a single vector to represent the information. + +| Name | Description | +| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `width` | The width of the output vector and the `NORM` hash embedding. ~~int~~ | +| `rows` | The number of rows in the `NORM` hash embedding table. ~~int~~ | +| `nM` | The dimensionality of the character embeddings. Recommended values are between `16` and `64`. ~~int~~ | +| `nC` | The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. ~~int~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | + +### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder} + +> #### Example config +> +> ```ini +> [model] +> @architectures = "spacy.MaxoutWindowEncoder.v1" +> width = 128 +> window_size = 1 +> maxout_pieces = 3 +> depth = 4 +> ``` + +Encode context using convolutions with maxout activation, layer normalization +and residual connections. + +| Name | Description | +| --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ | +| `window_size` | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. ~~int~~ | +| `maxout_pieces` | The number of maxout pieces to use. Recommended values are `2` or `3`. ~~int~~ | +| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ | + +### spacy.MishWindowEncoder.v1 {#MishWindowEncoder} + +> #### Example config +> +> ```ini +> [model] +> @architectures = "spacy.MishWindowEncoder.v1" +> width = 64 +> window_size = 1 +> depth = 4 +> ``` + +Encode context using convolutions with +[`Mish`](https://thinc.ai/docs/api-layers#mish) activation, layer normalization +and residual connections. + +| Name | Description | +| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ | +| `window_size` | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. ~~int~~ | +| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ | + +### spacy.TorchBiLSTMEncoder.v1 {#TorchBiLSTMEncoder} + +> #### Example config +> +> ```ini +> [model] +> @architectures = "spacy.TorchBiLSTMEncoder.v1" +> width = 64 +> window_size = 1 +> depth = 4 +> ``` + +Encode context using bidirectional LSTM layers. Requires +[PyTorch](https://pytorch.org). + +| Name | Description | +| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `width` | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between `64` and `300`. ~~int~~ | +| `window_size` | The number of words to concatenate around each token to construct the convolution. Recommended value is `1`. ~~int~~ | +| `depth` | The number of convolutional layers. Recommended value is `4`. ~~int~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Floats2d], List[Floats2d]]~~ | + +### spacy.StaticVectors.v1 {#StaticVectors} + +> #### Example config +> +> ```ini +> [model] +> @architectures = "spacy.StaticVectors.v1" +> nO = null +> nM = null +> dropout = 0.2 +> key_attr = "ORTH" +> +> [model.init_W] +> @initializers = "glorot_uniform_init.v1" +> ``` + +Embed [`Doc`](/api/doc) objects with their vocab's vectors table, applying a +learned linear projection to control the dimensionality. See the documentation +on [static vectors](/usage/embeddings-transformers#static-vectors) for details. + +| Name |  Description | +| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `nO` | The output width of the layer, after the linear projection. ~~Optional[int]~~ | +| `nM` | The width of the static vectors. ~~Optional[int]~~ | +| `dropout` | Optional dropout rate. If set, it's applied per dimension over the whole batch. Defaults to `None`. ~~Optional[float]~~ | +| `init_W` | The [initialization function](https://thinc.ai/docs/api-initializers). Defaults to [`glorot_uniform_init`](https://thinc.ai/docs/api-initializers#glorot_uniform_init). ~~Callable[[Ops, Tuple[int, ...]]], FloatsXd]~~ | +| `key_attr` | Defaults to `"ORTH"`. ~~str~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Ragged]~~ | + +### spacy.FeatureExtractor.v1 {#FeatureExtractor} + +> #### Example config +> +> ```ini +> [model] +> @architectures = "spacy.FeatureExtractor.v1" +> columns = ["NORM", "PREFIX", "SUFFIX", "SHAPE", "ORTH"] +> ``` + +Extract arrays of input features from [`Doc`](/api/doc) objects. Expects a list +of feature names to extract, which should refer to token attributes. + +| Name |  Description | +| ----------- | ------------------------------------------------------------------------ | +| `columns` | The token attributes to extract. ~~List[Union[int, str]]~~ | +| **CREATES** | The created feature extraction layer. ~~Model[List[Doc], List[Ints2d]]~~ | + +## Transformer architectures {#transformers source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/architectures.py"} + +The following architectures are provided by the package +[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the +[usage documentation](/usage/embeddings-transformers#transformers) for how to +integrate the architectures into your training config. + + + +Note that in order to use these architectures in your config, you need to +install the +[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the +[installation docs](/usage/embeddings-transformers#transformers-installation) +for details and system requirements. + + + +### spacy-transformers.TransformerModel.v1 {#TransformerModel} + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy-transformers.TransformerModel.v1" +> name = "roberta-base" +> tokenizer_config = {"use_fast": true} +> +> [model.get_spans] +> @span_getters = "spacy-transformers.strided_spans.v1" +> window = 128 +> stride = 96 +> ``` + +Load and wrap a transformer model from the +[HuggingFace `transformers`](https://huggingface.co/transformers) library. You +can use any transformer that has pretrained weights and a PyTorch +implementation. The `name` variable is passed through to the underlying library, +so it can be either a string or a path. If it's a string, the pretrained weights +will be downloaded via the transformers library if they are not already +available locally. + +In order to support longer documents, the +[TransformerModel](/api/architectures#TransformerModel) layer allows you to pass +in a `get_spans` function that will divide up the [`Doc`](/api/doc) objects +before passing them through the transformer. Your spans are allowed to overlap +or exclude tokens. This layer is usually used directly by the +[`Transformer`](/api/transformer) component, which allows you to share the +transformer weights across your pipeline. For a layer that's configured for use +in other components, see +[Tok2VecTransformer](/api/architectures#Tok2VecTransformer). + +| Name | Description | +| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Any model name that can be loaded by [`transformers.AutoModel`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel). ~~str~~ | +| `get_spans` | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. ~~Callable[[List[Doc]], List[Span]]~~ | +| `tokenizer_config` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). ~~Dict[str, Any]~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], FullTransformerBatch]~~ | + +### spacy-transformers.TransformerListener.v1 {#TransformerListener} + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy-transformers.TransformerListener.v1" +> grad_factor = 1.0 +> +> [model.pooling] +> @layers = "reduce_mean.v1" +> ``` + +Create a `TransformerListener` layer, which will connect to a +[`Transformer`](/api/transformer) component earlier in the pipeline. The layer +takes a list of [`Doc`](/api/doc) objects as input, and produces a list of +2-dimensional arrays as output, with each array having one row per token. Most +spaCy models expect a sublayer with this signature, making it easy to connect +them to a transformer model via this sublayer. Transformer models usually +operate over wordpieces, which usually don't align one-to-one against spaCy +tokens. The layer therefore requires a reduction operation in order to calculate +a single token vector given zero or more wordpiece vectors. + +| Name | Description | +| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `pooling` | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean)) is usually a good choice. ~~Model[Ragged, Floats2d]~~ | +| `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | + +### spacy-transformers.Tok2VecTransformer.v1 {#Tok2VecTransformer} + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy.Tok2VecTransformer.v1" +> name = "albert-base-v2" +> tokenizer_config = {"use_fast": false} +> grad_factor = 1.0 +> ``` + +Use a transformer as a [`Tok2Vec`](/api/tok2vec) layer directly. This does +**not** allow multiple components to share the transformer weights and does +**not** allow the transformer to set annotations into the [`Doc`](/api/doc) +object, but it's a **simpler solution** if you only need the transformer within +one component. + +| Name | Description | +| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_spans` | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. ~~Callable[[List[Doc]], List[Span]]~~ | +| `tokenizer_config` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). ~~Dict[str, Any]~~ | +| `pooling` | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean)) is usually a good choice. ~~Model[Ragged, Floats2d]~~ | +| `grad_factor` | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. ~~float~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | + +## Parser & NER architectures {#parser} + +### spacy.TransitionBasedParser.v1 {#TransitionBasedParser source="spacy/ml/models/parser.py"} + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy.TransitionBasedParser.v1" +> state_type = "ner" +> extra_state_tokens = false +> hidden_width = 64 +> maxout_pieces = 2 +> +> [model.tok2vec] +> @architectures = "spacy.HashEmbedCNN.v1" +> pretrained_vectors = null +> width = 96 +> depth = 4 +> embed_size = 2000 +> window_size = 1 +> maxout_pieces = 3 +> subword_features = true +> ``` + +Build a transition-based parser model. Can apply to NER or dependency parsing. +Transition-based parsing is an approach to structured prediction where the task +of predicting the structure is mapped to a series of state transitions. You +might find [this tutorial](https://explosion.ai/blog/parsing-english-in-python) +helpful for background information. The neural network state prediction model +consists of either two or three subnetworks: + +- **tok2vec**: Map each token into a vector representation. This subnetwork is + run once for each batch. +- **lower**: Construct a feature-specific vector for each `(token, feature)` + pair. This is also run once for each batch. Constructing the state + representation is then simply a matter of summing the component features and + applying the non-linearity. +- **upper** (optional): A feed-forward network that predicts scores from the + state representation. If not present, the output from the lower model is used + as action scores directly. + +| Name | Description | +| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `tok2vec` | Subnetwork to map tokens into vector representations. ~~Model[List[Doc], List[Floats2d]]~~ | +| `state_type` | Which task to extract features for. Possible values are "ner" and "parser". ~~str~~ | +| `extra_state_tokens` | Whether to use an expanded feature set when extracting the state tokens. Slightly slower, but sometimes improves accuracy slightly. Defaults to `False`. ~~bool~~ | +| `hidden_width` | The width of the hidden layer. ~~int~~ | +| `maxout_pieces` | How many pieces to use in the state prediction layer. Recommended values are `1`, `2` or `3`. If `1`, the maxout non-linearity is replaced with a [`Relu`](https://thinc.ai/docs/api-layers#relu) non-linearity if `use_upper` is `True`, and no non-linearity if `False`. ~~int~~ | +| `use_upper` | Whether to use an additional hidden layer after the state vector in order to predict the action scores. It is recommended to set this to `False` for large pretrained models such as transformers, and `True` for smaller networks. The upper layer is computed on CPU, which becomes a bottleneck on larger GPU-based models, where it's also less necessary. ~~bool~~ | +| `nO` | The number of actions the model will predict between. Usually inferred from data at the beginning of training, or loaded from disk. ~~int~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Docs], List[List[Floats2d]]]~~ | + +## Tagging architectures {#tagger source="spacy/ml/models/tagger.py"} + +### spacy.Tagger.v1 {#Tagger} + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy.Tagger.v1" +> nO = null +> +> [model.tok2vec] +> # ... +> ``` + +Build a tagger model, using a provided token-to-vector component. The tagger +model simply adds a linear layer with softmax activation to predict scores given +the token vectors. + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------ | +| `tok2vec` | Subnetwork to map tokens into vector representations. ~~Model[List[Doc], List[Floats2d]]~~ | +| `nO` | The number of tags to output. Inferred from the data if `None`. ~~Optional[int]~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | + +## Text classification architectures {#textcat source="spacy/ml/models/textcat.py"} + +A text classification architecture needs to take a [`Doc`](/api/doc) as input, +and produce a score for each potential label class. Textcat challenges can be +binary (e.g. sentiment analysis) or involve multiple possible labels. +Multi-label challenges can either have mutually exclusive labels (each example +has exactly one label), or multiple labels may be applicable at the same time. + +As the properties of text classification problems can vary widely, we provide +several different built-in architectures. It is recommended to experiment with +different architectures and settings to determine what works best on your +specific data and challenge. + +### spacy.TextCatEnsemble.v1 {#TextCatEnsemble} + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy.TextCatEnsemble.v1" +> exclusive_classes = false +> pretrained_vectors = null +> width = 64 +> embed_size = 2000 +> conv_depth = 2 +> window_size = 1 +> ngram_size = 1 +> dropout = null +> nO = null +> ``` + +Stacked ensemble of a bag-of-words model and a neural network model. The neural +network has an internal CNN Tok2Vec layer and uses attention. + +| Name | Description | +| -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | +| `pretrained_vectors` | Whether or not pretrained vectors will be used in addition to the feature vectors. ~~bool~~ | +| `width` | Output dimension of the feature encoding step. ~~int~~ | +| `embed_size` | Input dimension of the feature encoding step. ~~int~~ | +| `conv_depth` | Depth of the tok2vec layer. ~~int~~ | +| `window_size` | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. ~~int~~ | +| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ | +| `dropout` | The dropout rate. ~~float~~ | +| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | + +### spacy.TextCatCNN.v1 {#TextCatCNN} + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy.TextCatCNN.v1" +> exclusive_classes = false +> nO = null +> +> [model.tok2vec] +> @architectures = "spacy.HashEmbedCNN.v1" +> pretrained_vectors = null +> width = 96 +> depth = 4 +> embed_size = 2000 +> window_size = 1 +> maxout_pieces = 3 +> subword_features = true +> ``` + +A neural network model where token vectors are calculated using a CNN. The +vectors are mean pooled and used as features in a feed-forward network. This +architecture is usually less accurate than the ensemble, but runs faster. + +| Name | Description | +| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | +| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ | +| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | + +### spacy.TextCatBOW.v1 {#TextCatBOW} + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy.TextCatBOW.v1" +> exclusive_classes = false +> ngram_size = 1 +> no_output_layer = false +> nO = null +> ``` + +An n-gram "bag-of-words" model. This architecture should run much faster than +the others, but may not be as accurate, especially if texts are short. + +| Name | Description | +| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | +| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3` would give unigram, trigram and bigram features. ~~int~~ | +| `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`). ~~bool~~ | +| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `initialize` is called. ~~Optional[int]~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | + +## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"} + +An [`EntityLinker`](/api/entitylinker) component disambiguates textual mentions +(tagged as named entities) to unique identifiers, grounding the named entities +into the "real world". This requires 3 main components: + +- A [`KnowledgeBase`](/api/kb) (KB) holding the unique identifiers, potential + synonyms and prior probabilities. +- A candidate generation step to produce a set of likely identifiers, given a + certain textual mention. +- A machine learning [`Model`](https://thinc.ai/docs/api-model) that picks the + most plausible ID from the set of candidates. + +### spacy.EntityLinker.v1 {#EntityLinker} + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy.EntityLinker.v1" +> nO = null +> +> [model.tok2vec] +> @architectures = "spacy.HashEmbedCNN.v1" +> pretrained_vectors = null +> width = 96 +> depth = 2 +> embed_size = 300 +> window_size = 1 +> maxout_pieces = 3 +> subword_features = true +> ``` + +The `EntityLinker` model architecture is a Thinc `Model` with a +[`Linear`](https://thinc.ai/api-layers#linear) output layer. + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ | +| `nO` | Output dimension, determined by the length of the vectors encoding each entity in the KB. If the `nO` dimension is not set, the entity linking component will set it when `initialize` is called. ~~Optional[int]~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | + +### spacy.EmptyKB.v1 {#EmptyKB} + +A function that creates an empty `KnowledgeBase` from a [`Vocab`](/api/vocab) +instance. This is the default when a new entity linker component is created. + +| Name | Description | +| ---------------------- | ----------------------------------------------------------------------------------- | +| `entity_vector_length` | The length of the vectors encoding each entity in the KB. Defaults to `64`. ~~int~~ | + +### spacy.KBFromFile.v1 {#KBFromFile} + +A function that reads an existing `KnowledgeBase` from file. + +| Name | Description | +| --------- | -------------------------------------------------------- | +| `kb_path` | The location of the KB that was stored to file. ~~Path~~ | + +### spacy.CandidateGenerator.v1 {#CandidateGenerator} + +A function that takes as input a [`KnowledgeBase`](/api/kb) and a +[`Span`](/api/span) object denoting a named entity, and returns a list of +plausible [`Candidate`](/api/kb/#candidate) objects. The default +`CandidateGenerator` simply uses the text of a mention to find its potential +aliases in the `KnowledgeBase`. Note that this function is case-dependent. diff --git a/website/docs/api/attributeruler.md b/website/docs/api/attributeruler.md new file mode 100644 index 000000000..a253ca9f8 --- /dev/null +++ b/website/docs/api/attributeruler.md @@ -0,0 +1,281 @@ +--- +title: AttributeRuler +tag: class +source: spacy/pipeline/attributeruler.py +new: 3 +teaser: 'Pipeline component for rule-based token attribute assignment' +api_string_name: attribute_ruler +api_trainable: false +--- + +The attribute ruler lets you set token attributes for tokens identified by +[`Matcher` patterns](/usage/rule-based-matching#matcher). The attribute ruler is +typically used to handle exceptions for token attributes and to map values +between attributes such as mapping fine-grained POS tags to coarse-grained POS +tags. See the [usage guide](/usage/linguistic-features/#mappings-exceptions) for +examples. + +## Config and implementation {#config} + +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). + +> #### Example +> +> ```python +> config = {"validate": True} +> nlp.add_pipe("attribute_ruler", config=config) +> ``` + +| Setting | Description | +| ---------- | --------------------------------------------------------------------------------------------- | +| `validate` | Whether patterns should be validated (passed to the `Matcher`). Defaults to `False`. ~~bool~~ | + +```python +%%GITHUB_SPACY/spacy/pipeline/attributeruler.py +``` + +## AttributeRuler.\_\_init\_\_ {#init tag="method"} + +Initialize the attribute ruler. + +> #### Example +> +> ```python +> # Construction via add_pipe +> ruler = nlp.add_pipe("attribute_ruler") +> ``` + +| Name | Description | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary to pass to the matcher. ~~Vocab~~ | +| `name` | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. ~~str~~ | +| _keyword-only_ | | +| `validate` | Whether patterns should be validated (passed to the [`Matcher`](/api/matcher#init)). Defaults to `False`. ~~bool~~ | + +## AttributeRuler.\_\_call\_\_ {#call tag="method"} + +Apply the attribute ruler to a `Doc`, setting token attributes for tokens +matched by the provided patterns. + +| Name | Description | +| ----------- | -------------------------------- | +| `doc` | The document to process. ~~Doc~~ | +| **RETURNS** | The processed document. ~~Doc~~ | + +## AttributeRuler.add {#add tag="method"} + +Add patterns to the attribute ruler. The patterns are a list of `Matcher` +patterns and the attributes are a dict of attributes to set on the matched +token. If the pattern matches a span of more than one token, the `index` can be +used to set the attributes for the token at that index in the span. The `index` +may be negative to index from the end of the span. + +> #### Example +> +> ```python +> ruler = nlp.add_pipe("attribute_ruler") +> patterns = [[{"TAG": "VB"}]] +> attrs = {"POS": "VERB"} +> ruler.add(patterns=patterns, attrs=attrs) +> ``` + +| Name | Description | +| ---------- | --------------------------------------------------------------------------------------------------------------------------------- | +| `patterns` | The `Matcher` patterns to add. ~~Iterable[List[Dict[Union[int, str], Any]]]~~ | +| `attrs` | The attributes to assign to the target token in the matched span. ~~Dict[str, Any]~~ | +| `index` | The index of the token in the matched span to modify. May be negative to index from the end of the span. Defaults to `0`. ~~int~~ | + +## AttributeRuler.add_patterns {#add_patterns tag="method"} + +> #### Example +> +> ```python +> ruler = nlp.add_pipe("attribute_ruler") +> patterns = [ +> { +> "patterns": [[{"TAG": "VB"}]], "attrs": {"POS": "VERB"} +> }, +> { +> "patterns": [[{"LOWER": "two"}, {"LOWER": "apples"}]], +> "attrs": {"LEMMA": "apple"}, +> "index": -1 +> }, +> ] +> ruler.add_patterns(patterns) +> ``` + +Add patterns from a list of pattern dicts. Each pattern dict can specify the +keys `"patterns"`, `"attrs"` and `"index"`, which match the arguments of +[`AttributeRuler.add`](/api/attributeruler#add). + +| Name | Description | +| ---------- | -------------------------------------------------------------------------- | +| `patterns` | The patterns to add. ~~Iterable[Dict[str, Union[List[dict], dict, int]]]~~ | + +## AttributeRuler.patterns {#patterns tag="property"} + +Get all patterns that have been added to the attribute ruler in the +`patterns_dict` format accepted by +[`AttributeRuler.add_patterns`](/api/attributeruler#add_patterns). + +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------- | +| **RETURNS** | The patterns added to the attribute ruler. ~~List[Dict[str, Union[List[dict], dict, int]]]~~ | + +## AttributeRuler.initialize {#initialize tag="method"} + +Initialize the component with data and used before training to load in rules +from a file. This method is typically called by +[`Language.initialize`](/api/language#initialize) and lets you customize +arguments it receives via the +[`[initialize.components]`](/api/data-formats#config-initialize) block in the +config. + +> #### Example +> +> ```python +> ruler = nlp.add_pipe("attribute_ruler") +> ruler.initialize(lambda: [], nlp=nlp, patterns=patterns) +> ``` +> +> ```ini +> ### config.cfg +> [initialize.components.attribute_ruler] +> +> [initialize.components.attribute_ruler.patterns] +> @readers = "srsly.read_json.v1" +> path = "corpus/attribute_ruler_patterns.json +> ``` + +| Name | Description | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects (the training data). Not used by this component. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `patterns` | A list of pattern dicts with the keys as the arguments to [`AttributeRuler.add`](/api/attributeruler#add) (`patterns`/`attrs`/`index`) to add as patterns. Defaults to `None`. ~~Optional[Iterable[Dict[str, Union[List[dict], dict, int]]]]~~ | +| `tag_map` | The tag map that maps fine-grained tags to coarse-grained tags and morphological features. Defaults to `None`. ~~Optional[Dict[str, Dict[Union[int, str], Union[int, str]]]]~~ | +| `morph_rules` | The morph rules that map token text and fine-grained tags to coarse-grained tags, lemmas and morphological features. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]]~~ | + +## AttributeRuler.load_from_tag_map {#load_from_tag_map tag="method"} + +Load attribute ruler patterns from a tag map. + +| Name | Description | +| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | +| `tag_map` | The tag map that maps fine-grained tags to coarse-grained tags and morphological features. ~~Dict[str, Dict[Union[int, str], Union[int, str]]]~~ | + +## AttributeRuler.load_from_morph_rules {#load_from_morph_rules tag="method"} + +Load attribute ruler patterns from morph rules. + +| Name | Description | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `morph_rules` | The morph rules that map token text and fine-grained tags to coarse-grained tags, lemmas and morphological features. ~~Dict[str, Dict[str, Dict[Union[int, str], Union[int, str]]]]~~ | + +## AttributeRuler.score {#score tag="method" new="3"} + +Score a batch of examples. + +> #### Example +> +> ```python +> scores = ruler.score(examples) +> ``` + +| Name | Description | +| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | The examples to score. ~~Iterable[Example]~~ | +| **RETURNS** | The scores, produced by [`Scorer.score_token_attr`](/api/scorer#score_token_attr) for the attributes `"tag"`, `"pos"`, `"morph"` and `"lemma"` if present in any of the target token attributes. ~~Dict[str, float]~~ | + +## AttributeRuler.to_disk {#to_disk tag="method"} + +Serialize the pipe to disk. + +> #### Example +> +> ```python +> ruler = nlp.add_pipe("attribute_ruler") +> ruler.to_disk("/path/to/attribute_ruler") +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | + +## AttributeRuler.from_disk {#from_disk tag="method"} + +Load the pipe from disk. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> ruler = nlp.add_pipe("attribute_ruler") +> ruler.from_disk("/path/to/attribute_ruler") +> ``` + +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `AttributeRuler` object. ~~AttributeRuler~~ | + +## AttributeRuler.to_bytes {#to_bytes tag="method"} + +> #### Example +> +> ```python +> ruler = nlp.add_pipe("attribute_ruler") +> ruler = ruler.to_bytes() +> ``` + +Serialize the pipe to a bytestring. + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the `AttributeRuler` object. ~~bytes~~ | + +## AttributeRuler.from_bytes {#from_bytes tag="method"} + +Load the pipe from a bytestring. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> ruler_bytes = ruler.to_bytes() +> ruler = nlp.add_pipe("attribute_ruler") +> ruler.from_bytes(ruler_bytes) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `AttributeRuler` object. ~~AttributeRuler~~ | + +## Serialization fields {#serialization-fields} + +During serialization, spaCy will export several data fields used to restore +different aspects of the object. If needed, you can exclude them from +serialization by passing in the string names via the `exclude` argument. + +> #### Example +> +> ```python +> data = ruler.to_disk("/path", exclude=["vocab"]) +> ``` + +| Name | Description | +| ---------- | --------------------------------------------------------------- | +| `vocab` | The shared [`Vocab`](/api/vocab). | +| `patterns` | The `Matcher` patterns. You usually don't want to exclude this. | +| `attrs` | The attributes to set. You usually don't want to exclude this. | +| `indices` | The token indices. You usually don't want to exclude this. | diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index b97308aab..168465fab 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -1,206 +1,472 @@ --- title: Command Line Interface -teaser: Download, train and package models, and debug spaCy +teaser: Download, train and package pipelines, and debug spaCy source: spacy/cli menu: - - ['Download', 'download'] - - ['Link', 'link'] - - ['Info', 'info'] - - ['Validate', 'validate'] - - ['Convert', 'convert'] - - ['Debug data', 'debug-data'] - - ['Train', 'train'] - - ['Pretrain', 'pretrain'] - - ['Init Model', 'init-model'] - - ['Evaluate', 'evaluate'] - - ['Package', 'package'] + - ['download', 'download'] + - ['info', 'info'] + - ['validate', 'validate'] + - ['init', 'init'] + - ['convert', 'convert'] + - ['debug', 'debug'] + - ['train', 'train'] + - ['pretrain', 'pretrain'] + - ['evaluate', 'evaluate'] + - ['package', 'package'] + - ['project', 'project'] + - ['ray', 'ray'] --- -As of v1.7.0, spaCy comes with new command line helpers to download and link -models and show useful debugging information. For a list of available commands, -type `spacy --help`. +spaCy's CLI provides a range of helpful commands for downloading and training +pipelines, converting data and debugging your config, data and installation. For +a list of available commands, you can type `python -m spacy --help`. You can +also add the `--help` flag to any command or subcommand to see the description, +available arguments and usage. -## Download {#download} +## download {#download tag="command"} -Download [models](/usage/models) for spaCy. The downloader finds the -best-matching compatible version, uses `pip install` to download the model as a -package and creates a [shortcut link](/usage/models#usage) if the model was -downloaded via a shortcut. Direct downloads don't perform any compatibility -checks and require the model name to be specified with its version (e.g. -`en_core_web_sm-2.2.0`). +Download [trained pipelines](/usage/models) for spaCy. The downloader finds the +best-matching compatible version and uses `pip install` to download the Python +package. Direct downloads don't perform any compatibility checks and require the +pipeline name to be specified with its version (e.g. `en_core_web_sm-2.2.0`). > #### Downloading best practices > > The `download` command is mostly intended as a convenient, interactive wrapper > – it performs compatibility checks and prints detailed messages in case things > go wrong. It's **not recommended** to use this command as part of an automated -> process. If you know which model your project needs, you should consider a -> [direct download via pip](/usage/models#download-pip), or uploading the model -> to a local PyPi installation and fetching it straight from there. This will -> also allow you to add it as a versioned package dependency to your project. +> process. If you know which package your project needs, you should consider a +> [direct download via pip](/usage/models#download-pip), or uploading the +> package to a local PyPi installation and fetching it straight from there. This +> will also allow you to add it as a versioned package dependency to your +> project. -```bash -$ python -m spacy download [model] [--direct] [pip args] +```cli +$ python -m spacy download [model] [--direct] [pip_args] ``` -| Argument | Type | Description | -| ------------------------------------- | ------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `model` | positional | Model name or shortcut (`en`, `de`, `en_core_web_sm`). | -| `--direct`, `-d` | flag | Force direct download of exact model version. | -| pip args 2.1 | - | Additional installation options to be passed to `pip install` when installing the model package. For example, `--user` to install to the user home directory or `--no-deps` to not install model dependencies. | -| `--help`, `-h` | flag | Show help message and available arguments. | -| **CREATES** | directory, symlink | The installed model package in your `site-packages` directory and a shortcut link as a symlink in `spacy/data` if installed via shortcut. | +| Name | Description | +| ------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `model` | Pipeline package name, e.g. [`en_core_web_sm`](/models/en#en_core_web_sm). ~~str (positional)~~ | +| `--direct`, `-d` | Force direct download of exact package version. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| pip args 2.1 | Additional installation options to be passed to `pip install` when installing the pipeline package. For example, `--user` to install to the user home directory or `--no-deps` to not install package dependencies. ~~Any (option/flag)~~ | +| **CREATES** | The installed pipeline package in your `site-packages` directory. | -## Link {#link} +## info {#info tag="command"} -Create a [shortcut link](/usage/models#usage) for a model, either a Python -package or a local directory. This will let you load models from any location -using a custom name via [`spacy.load()`](/api/top-level#spacy.load). +Print information about your spaCy installation, trained pipelines and local +setup, and generate [Markdown](https://en.wikipedia.org/wiki/Markdown)-formatted +markup to copy-paste into +[GitHub issues](https://github.com/explosion/spaCy/issues). - - -In spaCy v1.x, you had to use the model data directory to set up a shortcut link -for a local path. As of v2.0, spaCy expects all shortcut links to be **loadable -model packages**. If you want to load a data directory, call -[`spacy.load()`](/api/top-level#spacy.load) or -[`Language.from_disk()`](/api/language#from_disk) with the path, or use the -[`package`](/api/cli#package) command to create a model package. - - - -```bash -$ python -m spacy link [origin] [link_name] [--force] -``` - -| Argument | Type | Description | -| --------------- | ---------- | --------------------------------------------------------------- | -| `origin` | positional | Model name if package, or path to local directory. | -| `link_name` | positional | Name of the shortcut link to create. | -| `--force`, `-f` | flag | Force overwriting of existing link. | -| `--help`, `-h` | flag | Show help message and available arguments. | -| **CREATES** | symlink | A shortcut link of the given name as a symlink in `spacy/data`. | - -## Info {#info} - -Print information about your spaCy installation, models and local setup, and -generate [Markdown](https://en.wikipedia.org/wiki/Markdown)-formatted markup to -copy-paste into [GitHub issues](https://github.com/explosion/spaCy/issues). - -```bash +```cli $ python -m spacy info [--markdown] [--silent] ``` -```bash +```cli $ python -m spacy info [model] [--markdown] [--silent] ``` -| Argument | Type | Description | -| ------------------------------------------------ | ---------- | ------------------------------------------------------------- | -| `model` | positional | A model, i.e. shortcut link, package name or path (optional). | -| `--markdown`, `-md` | flag | Print information as Markdown. | -| `--silent`, `-s` 2.0.12 | flag | Don't print anything, just return the values. | -| `--help`, `-h` | flag | Show help message and available arguments. | -| **PRINTS** | `stdout` | Information about your spaCy installation. | +| Name | Description | +| ------------------------------------------------ | ----------------------------------------------------------------------------------------- | +| `model` | A trained pipeline, i.e. package name or path (optional). ~~Optional[str] \(positional)~~ | +| `--markdown`, `-md` | Print information as Markdown. ~~bool (flag)~~ | +| `--silent`, `-s` 2.0.12 | Don't print anything, just return the values. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **PRINTS** | Information about your spaCy installation. | -## Validate {#validate new="2"} +## validate {#validate new="2" tag="command"} -Find all models installed in the current environment (both packages and shortcut -links) and check whether they are compatible with the currently installed -version of spaCy. Should be run after upgrading spaCy via `pip install -U spacy` -to ensure that all installed models are can be used with the new version. The -command is also useful to detect out-of-sync model links resulting from links -created in different virtual environments. It will show a list of models and -their installed versions. If any model is out of date, the latest compatible -versions and command for updating are shown. +Find all trained pipeline packages installed in the current environment and +check whether they are compatible with the currently installed version of spaCy. +Should be run after upgrading spaCy via `pip install -U spacy` to ensure that +all installed packages can be used with the new version. It will show a list of +packages and their installed versions. If any package is out of date, the latest +compatible versions and command for updating are shown. > #### Automated validation > > You can also use the `validate` command as part of your build process or test -> suite, to ensure all models are up to date before proceeding. If incompatible -> models or shortcut links are found, it will return `1`. +> suite, to ensure all packages are up to date before proceeding. If +> incompatible packages are found, it will return `1`. -```bash +```cli $ python -m spacy validate ``` -| Argument | Type | Description | -| ---------- | -------- | --------------------------------------------------------- | -| **PRINTS** | `stdout` | Details about the compatibility of your installed models. | +| Name | Description | +| ---------- | -------------------------------------------------------------------- | +| **PRINTS** | Details about the compatibility of your installed pipeline packages. | -## Convert {#convert} +## init {#init new="3"} -Convert files into spaCy's [JSON format](/api/annotation#json-input) for use -with the `train` command and other experiment management functions. The -converter can be specified on the command line, or chosen based on the file -extension of the input file. +The `spacy init` CLI includes helpful commands for initializing training config +files and pipeline directories. -```bash -$ python -m spacy convert [input_file] [output_dir] [--file-type] [--converter] -[--n-sents] [--morphology] [--lang] +### init config {#init-config new="3" tag="command"} + +Initialize and save a [`config.cfg` file](/usage/training#config) using the +**recommended settings** for your use case. It works just like the +[quickstart widget](/usage/training#quickstart), only that it also auto-fills +all default values and exports a [training](/usage/training#config)-ready +config. The settings you specify will impact the suggested model architectures +and pipeline setup, as well as the hyperparameters. You can also adjust and +customize those settings in your config file later. + +> #### Example +> +> ```cli +> $ python -m spacy init config config.cfg --lang en --pipeline ner,textcat --optimize accuracy +> ``` + +```cli +$ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [--cpu] [--pretraining] ``` -| Argument | Type | Description | -| ------------------------------------------------ | ---------- | ------------------------------------------------------------------------------------------------- | -| `input_file` | positional | Input file. | -| `output_dir` | positional | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. | -| `--file-type`, `-t` 2.1 | option | Type of file to create (see below). | -| `--converter`, `-c` 2 | option | Name of converter to use (see below). | -| `--n-sents`, `-n` | option | Number of sentences per document. | -| `--seg-sents`, `-s` 2.2 | flag | Segment sentences (for `-c ner`) | -| `--model`, `-b` 2.2 | option | Model for parser-based sentence segmentation (for `-s`) | -| `--morphology`, `-m` | option | Enable appending morphology to tags. | -| `--lang`, `-l` 2.1 | option | Language code (if tokenizer required). | -| `--help`, `-h` | flag | Show help message and available arguments. | -| **CREATES** | JSON | Data in spaCy's [JSON format](/api/annotation#json-input). | +| Name | Description | +| ---------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `output_file` | Path to output `.cfg` file or `-` to write the config to stdout (so you can pipe it forward to a file). Note that if you're writing to stdout, no additional logging info is printed. ~~Path (positional)~~ | +| `--lang`, `-l` | Optional code of the [language](/usage/models#languages) to use. Defaults to `"en"`. ~~str (option)~~ | +| `--pipeline`, `-p` | Comma-separated list of trainable [pipeline components](/usage/processing-pipelines#built-in) to include. Defaults to `"tagger,parser,ner"`. ~~str (option)~~ | +| `--optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters. Defaults to `"efficiency"`. ~~str (option)~~ | +| `--cpu`, `-C` | Whether the model needs to run on CPU. This will impact the choice of architecture, pretrained weights and related hyperparameters. ~~bool (flag)~~ | +| `--pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Defaults to `False`. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **CREATES** | The config file for training. | -### Output file types {new="2.1"} +### init fill-config {#init-fill-config new="3"} -All output files generated by this command are compatible with -[`spacy train`](/api/cli#train). +Auto-fill a partial [`config.cfg` file](/usage/training#config) file with **all +default values**, e.g. a config generated with the +[quickstart widget](/usage/training#quickstart). Config files used for training +should always be complete and not contain any hidden defaults or missing values, +so this command helps you create your final training config. In order to find +the available settings and defaults, all functions referenced in the config will +be created, and their signatures are used to find the defaults. If your config +contains a problem that can't be resolved automatically, spaCy will show you a +validation error with more details. -| ID | Description | -| ------- | -------------------------- | -| `json` | Regular JSON (default). | -| `jsonl` | Newline-delimited JSON. | -| `msg` | Binary MessagePack format. | +> #### Example +> +> ```cli +> $ python -m spacy init fill-config base.cfg config.cfg --diff +> ``` +> +> #### Example diff +> +> ![Screenshot of visual diff in terminal](../images/cli_init_fill-config_diff.jpg) -### Converter options +```cli +$ python -m spacy init fill-config [base_path] [output_file] [--diff] +``` -| ID | Description | -| ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `auto` | Automatically pick converter based on file extension and file content (default). | -| `conll`, `conllu`, `conllubio` | Universal Dependencies `.conllu` or `.conll` format. | -| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). | -| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](https://github.com/explosion/spaCy/tree/master/examples/training/ner_example_data). | -| `jsonl` | NER data formatted as JSONL with one dict per line and a `"text"` and `"spans"` key. This is also the format exported by the [Prodigy](https://prodi.gy) annotation tool. See [sample data](https://raw.githubusercontent.com/explosion/projects/master/ner-fashion-brands/fashion_brands_training.jsonl). | +| Name | Description | +| ---------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | +| `base_path` | Path to base config to fill, e.g. generated by the [quickstart widget](/usage/training#quickstart). ~~Path (positional)~~ | +| `output_file` | Path to output `.cfg` file. If not set, the config is written to stdout so you can pipe it forward to a file. ~~Path (positional)~~ | +| `--pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Defaults to `False`. ~~bool (flag)~~ | +| `--diff`, `-D` | Print a visual diff highlighting the changes. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **CREATES** | Complete and auto-filled config file for training. | -## Debug data {#debug-data new="2.2"} +### init vectors {#init-vectors new="3" tag="command"} -Analyze, debug, and validate your training and development data. Get useful +Convert [word vectors](/usage/linguistic-features#vectors-similarity) for use +with spaCy. Will export an `nlp` object that you can use in the +[`[initialize]`](/api/data-formats#config-initialize) block of your config to +initialize a model with vectors. See the usage guide on +[static vectors](/usage/embeddings-transformers#static-vectors) for details on +how to use vectors in your model. + + + +This functionality was previously available as part of the command `init-model`. + + + +```cli +$ python -m spacy init vectors [lang] [vectors_loc] [output_dir] [--prune] [--truncate] [--name] [--verbose] +``` + +| Name | Description | +| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `lang` | Pipeline language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. ~~str (positional)~~ | +| `vectors_loc` | Location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. ~~Path (positional)~~ | +| `output_dir` | Pipeline output directory. Will be created if it doesn't exist. ~~Path (positional)~~ | +| `--truncate`, `-t` | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation. ~~int (option)~~ | +| `--prune`, `-p` | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. ~~int (option)~~ | +| `--name`, `-n` | Name to assign to the word vectors in the `meta.json`, e.g. `en_core_web_md.vectors`. ~~Optional[str] \(option)~~ | +| `--verbose`, `-V` | Print additional information and explanations. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **CREATES** | A spaCy pipeline directory containing the vocab and vectors. | + +### init labels {#init-labels new="3" tag="command"} + +Generate JSON files for the labels in the data. This helps speed up the training +process, since spaCy won't have to preprocess the data to extract the labels. +After generating the labels, you can provide them to components that accept a +`labels` argument on initialization via the +[`[initialize]`](/api/data-formats#config-initialize) block of your config. + +> #### Example config +> +> ```ini +> [initialize.components.ner] +> +> [initialize.components.ner.labels] +> @readers = "spacy.read_labels.v1" +> path = "corpus/labels/ner.json +> ``` + +```cli +$ python -m spacy init labels [config_path] [output_path] [--code] [--verbose] [--gpu-id] [overrides] +``` + +| Name | Description | +| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | +| `output_path` | Output directory for the label files. Will create one JSON file per component. ~~Path (positional)~~ | +| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | +| `--verbose`, `-V` | Show more detailed messages during training. ~~bool (flag)~~ | +| `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ | +| **CREATES** | The best trained pipeline and the final checkpoint (if training is terminated). | + +## convert {#convert tag="command"} + +Convert files into spaCy's +[binary training data format](/api/data-formats#binary-training), a serialized +[`DocBin`](/api/docbin), for use with the `train` command and other experiment +management functions. The converter can be specified on the command line, or +chosen based on the file extension of the input file. + +```cli +$ python -m spacy convert [input_file] [output_dir] [--converter] [--file-type] [--n-sents] [--seg-sents] [--base] [--morphology] [--merge-subtokens] [--ner-map] [--lang] +``` + +| Name | Description | +| ------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- | +| `input_file` | Input file. ~~Path (positional)~~ | +| `output_dir` | Output directory for converted file. Defaults to `"-"`, meaning data will be written to `stdout`. ~~Optional[Path] \(positional)~~ | +| `--converter`, `-c` 2 | Name of converter to use (see below). ~~str (option)~~ | +| `--file-type`, `-t` 2.1 | Type of file to create. Either `spacy` (default) for binary [`DocBin`](/api/docbin) data or `json` for v2.x JSON format. ~~str (option)~~ | +| `--n-sents`, `-n` | Number of sentences per document. ~~int (option)~~ | +| `--seg-sents`, `-s` 2.2 | Segment sentences (for `--converter ner`). ~~bool (flag)~~ | +| `--base`, `-b` | Trained spaCy pipeline for sentence segmentation to use as base (for `--seg-sents`). ~~Optional[str](option)~~ | +| `--morphology`, `-m` | Enable appending morphology to tags. ~~bool (flag)~~ | +| `--ner-map`, `-nm` | NER tag mapping (as JSON-encoded dict of entity types). ~~Optional[Path](option)~~ | +| `--lang`, `-l` 2.1 | Language code (if tokenizer required). ~~Optional[str] \(option)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **CREATES** | Binary [`DocBin`](/api/docbin) training data that can be used with [`spacy train`](/api/cli#train). | + +### Converters {#converters} + +| ID | Description | +| ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `auto` | Automatically pick converter based on file extension and file content (default). | +| `json` | JSON-formatted training data used in spaCy v2.x. | +| `conll` | Universal Dependencies `.conllu` or `.conll` format. | +| `ner` | NER with IOB/IOB2 tags, one token per line with columns separated by whitespace. The first column is the token and the final column is the IOB tag. Sentences are separated by blank lines and documents are separated by the line `-DOCSTART- -X- O O`. Supports CoNLL 2003 NER format. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). | +| `iob` | NER with IOB/IOB2 tags, one sentence per line with tokens separated by whitespace and annotation separated by `|`, either `word|B-ENT` or `word|POS|B-ENT`. See [sample data](%%GITHUB_SPACY/extra/example_data/ner_example_data). | + +## debug {#debug new="3"} + +The `spacy debug` CLI includes helpful commands for debugging and profiling your +configs, data and implementations. + +### debug config {#debug-config new="3" tag="command"} + +Debug a [`config.cfg` file](/usage/training#config) and show validation errors. +The command will create all objects in the tree and validate them. Note that +some config validation errors are blocking and will prevent the rest of the +config from being resolved. This means that you may not see all validation +errors at once and some issues are only shown once previous errors have been +fixed. To auto-fill a partial config and save the result, you can use the +[`init fill-config`](/api/cli#init-fill-config) command. + +```cli +$ python -m spacy debug config [config_path] [--code] [--show-functions] [--show-variables] [overrides] +``` + +> #### Example +> +> ```cli +> $ python -m spacy debug config config.cfg +> ``` + + + +``` +✘ Config validation error +dropout field required +optimizer field required +optimize extra fields not permitted + +{'seed': 0, 'accumulate_gradient': 1, 'dev_corpus': 'corpora.dev', 'train_corpus': 'corpora.train', 'gpu_allocator': None, 'patience': 1600, 'max_epochs': 0, 'max_steps': 20000, 'eval_frequency': 200, 'frozen_components': [], 'optimize': None, 'before_to_disk': None, 'batcher': {'@batchers': 'spacy.batch_by_words.v1', 'discard_oversize': False, 'tolerance': 0.2, 'get_length': None, 'size': {'@schedules': 'compounding.v1', 'start': 100, 'stop': 1000, 'compound': 1.001, 't': 0.0}}, 'logger': {'@loggers': 'spacy.ConsoleLogger.v1', 'progress_bar': False}, 'score_weights': {'tag_acc': 0.5, 'dep_uas': 0.25, 'dep_las': 0.25, 'sents_f': 0.0}} + +If your config contains missing values, you can run the 'init fill-config' +command to fill in all the defaults, if possible: + +python -m spacy init fill-config tmp/starter-config_invalid.cfg tmp/starter-config_invalid.cfg +``` + + + + + +```cli +$ python -m spacy debug config ./config.cfg --show-functions --show-variables +``` + +``` +============================= Config validation ============================= +✔ Config is valid + +=============================== Variables (6) =============================== + +Variable Value +----------------------------------------- ---------------------------------- +${components.tok2vec.model.encode.width} 96 +${paths.dev} 'hello' +${paths.init_tok2vec} None +${paths.raw} None +${paths.train} '' +${system.seed} 0 + + +========================= Registered functions (17) ========================= +ℹ [nlp.tokenizer] +Registry @tokenizers +Name spacy.Tokenizer.v1 +Module spacy.language +File /path/to/spacy/language.py (line 64) +ℹ [components.ner.model] +Registry @architectures +Name spacy.TransitionBasedParser.v1 +Module spacy.ml.models.parser +File /path/to/spacy/ml/models/parser.py (line 11) +ℹ [components.ner.model.tok2vec] +Registry @architectures +Name spacy.Tok2VecListener.v1 +Module spacy.ml.models.tok2vec +File /path/to/spacy/ml/models/tok2vec.py (line 16) +ℹ [components.parser.model] +Registry @architectures +Name spacy.TransitionBasedParser.v1 +Module spacy.ml.models.parser +File /path/to/spacy/ml/models/parser.py (line 11) +ℹ [components.parser.model.tok2vec] +Registry @architectures +Name spacy.Tok2VecListener.v1 +Module spacy.ml.models.tok2vec +File /path/to/spacy/ml/models/tok2vec.py (line 16) +ℹ [components.tagger.model] +Registry @architectures +Name spacy.Tagger.v1 +Module spacy.ml.models.tagger +File /path/to/spacy/ml/models/tagger.py (line 9) +ℹ [components.tagger.model.tok2vec] +Registry @architectures +Name spacy.Tok2VecListener.v1 +Module spacy.ml.models.tok2vec +File /path/to/spacy/ml/models/tok2vec.py (line 16) +ℹ [components.tok2vec.model] +Registry @architectures +Name spacy.Tok2Vec.v1 +Module spacy.ml.models.tok2vec +File /path/to/spacy/ml/models/tok2vec.py (line 72) +ℹ [components.tok2vec.model.embed] +Registry @architectures +Name spacy.MultiHashEmbed.v1 +Module spacy.ml.models.tok2vec +File /path/to/spacy/ml/models/tok2vec.py (line 93) +ℹ [components.tok2vec.model.encode] +Registry @architectures +Name spacy.MaxoutWindowEncoder.v1 +Module spacy.ml.models.tok2vec +File /path/to/spacy/ml/models/tok2vec.py (line 207) +ℹ [corpora.dev] +Registry @readers +Name spacy.Corpus.v1 +Module spacy.training.corpus +File /path/to/spacy/training/corpus.py (line 18) +ℹ [corpora.train] +Registry @readers +Name spacy.Corpus.v1 +Module spacy.training.corpus +File /path/to/spacy/training/corpus.py (line 18) +ℹ [training.logger] +Registry @loggers +Name spacy.ConsoleLogger.v1 +Module spacy.training.loggers +File /path/to/spacy/training/loggers.py (line 8) +ℹ [training.batcher] +Registry @batchers +Name spacy.batch_by_words.v1 +Module spacy.training.batchers +File /path/to/spacy/training/batchers.py (line 49) +ℹ [training.batcher.size] +Registry @schedules +Name compounding.v1 +Module thinc.schedules +File /path/to/thinc/thinc/schedules.py (line 43) +ℹ [training.optimizer] +Registry @optimizers +Name Adam.v1 +Module thinc.optimizers +File /path/to/thinc/thinc/optimizers.py (line 58) +ℹ [training.optimizer.learn_rate] +Registry @schedules +Name warmup_linear.v1 +Module thinc.schedules +File /path/to/thinc/thinc/schedules.py (line 91) +``` + + + +| Name | Description | +| ------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | +| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | +| `--show-functions`, `-F` | Show an overview of all registered function blocks used in the config and where those functions come from, including the module name, Python file and line number. ~~bool (flag)~~ | +| `--show-variables`, `-V` | Show an overview of all variables referenced in the config, e.g. `${paths.train}` and their values that will be used. This also reflects any config overrides provided on the CLI, e.g. `--paths.train /path`. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ | +| **PRINTS** | Config validation errors, if available. | + +### debug data {#debug-data tag="command"} + +Analyze, debug and validate your training and development data. Get useful stats, and find problems like invalid entity annotations, cyclic dependencies, low data labels and more. -```bash -$ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pipeline] [--ignore-warnings] [--verbose] [--no-format] + + +The `debug data` command is now available as a subcommand of `spacy debug`. It +takes the same arguments as `train` and reads settings off the +[`config.cfg` file](/usage/training#config) and optional +[overrides](/usage/training#config-overrides) on the CLI. + + + +```cli +$ python -m spacy debug data [config_path] [--code] [--ignore-warnings] [--verbose] [--no-format] [overrides] ``` -| Argument | Type | Description | -| ------------------------------------------------------ | ---------- | -------------------------------------------------------------------------------------------------- | -| `lang` | positional | Model language. | -| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. | -| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. | -| `--tag-map-path`, `-tm` 2.2.4 | option | Location of JSON-formatted tag map. | -| `--base-model`, `-b` | option | Optional name of base model to update. Can be any loadable spaCy model. | -| `--pipeline`, `-p` | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. | -| `--ignore-warnings`, `-IW` | flag | Ignore warnings, only show stats and errors. | -| `--verbose`, `-V` | flag | Print additional information and explanations. | -| --no-format, `-NF` | flag | Don't pretty-print the results. Use this if you want to write to a file. | +> #### Example +> +> ```cli +> $ python -m spacy debug data ./config.cfg +> ``` - + ``` =========================== Data format validation =========================== ✔ Corpus is loadable +✔ Pipeline can be initialized with data =============================== Training stats =============================== Training pipeline: tagger, parser, ner @@ -230,7 +496,7 @@ New: 'ORG' (23860), 'PERSON' (21395), 'GPE' (21193), 'DATE' (18080), 'CARDINAL' ✔ No entities consisting of or starting/ending with whitespace =========================== Part-of-speech Tagging =========================== -ℹ 49 labels in data (57 labels in tag map) +ℹ 49 labels in data 'NN' (266331), 'IN' (227365), 'DT' (185600), 'NNP' (164404), 'JJ' (119830), 'NNS' (110957), '.' (101482), ',' (92476), 'RB' (90090), 'PRP' (90081), 'VB' (74538), 'VBD' (68199), 'CC' (62862), 'VBZ' (50712), 'VBP' (43420), 'VBN' @@ -241,7 +507,6 @@ New: 'ORG' (23860), 'PERSON' (21395), 'GPE' (21193), 'DATE' (18080), 'CARDINAL' '-RRB-' (2825), '-LRB-' (2788), 'PDT' (2078), 'XX' (1316), 'RBS' (1142), 'FW' (794), 'NFP' (557), 'SYM' (440), 'WP$' (294), 'LS' (293), 'ADD' (191), 'AFX' (24) -✔ All labels present in tag map for language 'en' ============================= Dependency Parsing ============================= ℹ Found 111703 sentences with an average length of 18.6 words. @@ -335,277 +600,616 @@ will not be available. -## Train {#train} +| Name | Description | +| -------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | +| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | +| `--ignore-warnings`, `-IW` | Ignore warnings, only show stats and errors. ~~bool (flag)~~ | +| `--verbose`, `-V` | Print additional information and explanations. ~~bool (flag)~~ | +| `--no-format`, `-NF` | Don't pretty-print the results. Use this if you want to write to a file. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ | +| **PRINTS** | Debugging information. | -Train a model. Expects data in spaCy's -[JSON format](/api/annotation#json-input). On each epoch, a model will be saved -out to the directory. Accuracy scores and model details will be added to a -[`meta.json`](/usage/training#models-generating) to allow packaging the model -using the [`package`](/api/cli#package) command. +### debug profile {#debug-profile tag="command"} - +Profile which functions take the most time in a spaCy pipeline. Input should be +formatted as one JSON object per line with a key `"text"`. It can either be +provided as a JSONL file, or be read from `sys.sytdin`. If no input file is +specified, the IMDB dataset is loaded via +[`ml_datasets`](https://github.com/explosion/ml_datasets). -As of spaCy 2.1, the `--no-tagger`, `--no-parser` and `--no-entities` flags have -been replaced by a `--pipeline` option, which lets you define comma-separated -names of pipeline components to train. For example, `--pipeline tagger,parser` -will only train the tagger and parser. + + +The `profile` command is now available as a subcommand of `spacy debug`. -```bash -$ python -m spacy train [lang] [output_path] [train_path] [dev_path] -[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping] -[--n-examples] [--use-gpu] [--version] [--meta-path] [--init-tok2vec] -[--parser-multitasks] [--entity-multitasks] [--gold-preproc] [--noise-level] -[--orth-variant-level] [--learn-tokens] [--textcat-arch] [--textcat-multilabel] -[--textcat-positive-label] [--verbose] +```cli +$ python -m spacy debug profile [model] [inputs] [--n-texts] ``` -| Argument | Type | Description | -| --------------------------------------------------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `lang` | positional | Model language. | -| `output_path` | positional | Directory to store model in. Will be created if it doesn't exist. | -| `train_path` | positional | Location of JSON-formatted training data. Can be a file or a directory of files. | -| `dev_path` | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. | -| `--base-model`, `-b` 2.1 | option | Optional name of base model to update. Can be any loadable spaCy model. | -| `--pipeline`, `-p` 2.1 | option | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`. | -| `--replace-components`, `-R` | flag | Replace components from the base model. | -| `--vectors`, `-v` | option | Model to load vectors from. | -| `--n-iter`, `-n` | option | Number of iterations (default: `30`). | -| `--n-early-stopping`, `-ne` | option | Maximum number of training epochs without dev accuracy improvement. | -| `--n-examples`, `-ns` | option | Number of examples to use (defaults to `0` for all examples). | -| `--use-gpu`, `-g` | option | GPU ID or `-1` for CPU only (default: `-1`). | -| `--version`, `-V` | option | Model version. Will be written out to the model's `meta.json` after training. | -| `--meta-path`, `-m` 2 | option | Optional path to model [`meta.json`](/usage/training#models-generating). All relevant properties like `lang`, `pipeline` and `spacy_version` will be overwritten. | -| `--init-tok2vec`, `-t2v` 2.1 | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. | -| `--parser-multitasks`, `-pt` | option | Side objectives for parser CNN, e.g. `'dep'` or `'dep,tag'` | -| `--entity-multitasks`, `-et` | option | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'` | -| `--width`, `-cw` 2.2.4 | option | Width of CNN layers of `Tok2Vec` component. | -| `--conv-depth`, `-cd` 2.2.4 | option | Depth of CNN layers of `Tok2Vec` component. | -| `--cnn-window`, `-cW` 2.2.4 | option | Window size for CNN layers of `Tok2Vec` component. | -| `--cnn-pieces`, `-cP` 2.2.4 | option | Maxout size for CNN layers of `Tok2Vec` component. | -| `--use-chars`, `-chr` 2.2.4 | flag | Whether to use character-based embedding of `Tok2Vec` component. | -| `--bilstm-depth`, `-lstm` 2.2.4 | option | Depth of BiLSTM layers of `Tok2Vec` component (requires PyTorch). | -| `--embed-rows`, `-er` 2.2.4 | option | Number of embedding rows of `Tok2Vec` component. | -| `--noise-level`, `-nl` | option | Float indicating the amount of corruption for data augmentation. | -| `--orth-variant-level`, `-ovl` 2.2 | option | Float indicating the orthography variation for data augmentation (e.g. `0.3` for making 30% of occurrences of some tokens subject to replacement). | -| `--gold-preproc`, `-G` | flag | Use gold preprocessing. | -| `--learn-tokens`, `-T` | flag | Make parser learn gold-standard tokenization by merging ] subtokens. Typically used for languages like Chinese. | -| `--textcat-multilabel`, `-TML` 2.2 | flag | Text classification classes aren't mutually exclusive (multilabel). | -| `--textcat-arch`, `-ta` 2.2 | option | Text classification model architecture. Defaults to `"bow"`. | -| `--textcat-positive-label`, `-tpl` 2.2 | option | Text classification positive label for binary classes with two labels. | -| `--tag-map-path`, `-tm` 2.2.4 | option | Location of JSON-formatted tag map. | -| `--verbose`, `-VV` 2.0.13 | flag | Show more detailed messages during training. | -| `--help`, `-h` | flag | Show help message and available arguments. | -| **CREATES** | model, pickle | A spaCy model on each epoch. | +| Name | Description | +| ----------------- | ---------------------------------------------------------------------------------- | +| `model` | A loadable spaCy pipeline (package name or path). ~~str (positional)~~ | +| `inputs` | Optional path to input file, or `-` for standard input. ~~Path (positional)~~ | +| `--n-texts`, `-n` | Maximum number of texts to use if available. Defaults to `10000`. ~~int (option)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **PRINTS** | Profiling information for the pipeline. | -### Environment variables for hyperparameters {#train-hyperparams new="2"} +### debug model {#debug-model new="3" tag="command"} -spaCy lets you set hyperparameters for training via environment variables. For -example: +Debug a Thinc [`Model`](https://thinc.ai/docs/api-model) by running it on a +sample text and checking how it updates its internal weights and parameters. -```bash -$ token_vector_width=256 learn_rate=0.0001 spacy train [...] +```cli +$ python -m spacy debug model [config_path] [component] [--layers] [--dimensions] [--parameters] [--gradients] [--attributes] [--print-step0] [--print-step1] [--print-step2] [--print-step3] [--gpu-id] ``` -> #### Usage with alias -> -> Environment variables keep the command simple and allow you to to -> [create an alias](https://askubuntu.com/questions/17536/how-do-i-create-a-permanent-bash-alias/17537#17537) -> for your custom `train` command while still being able to easily tweak the -> hyperparameters. -> -> ```bash -> alias train-parser="python -m spacy train en /output /data /train /dev -n 1000" -> token_vector_width=256 train-parser -> ``` + -| Name | Description | Default | -| -------------------- | --------------------------------------------------- | ------- | -| `dropout_from` | Initial dropout rate. | `0.2` | -| `dropout_to` | Final dropout rate. | `0.2` | -| `dropout_decay` | Rate of dropout change. | `0.0` | -| `batch_from` | Initial batch size. | `1` | -| `batch_to` | Final batch size. | `64` | -| `batch_compound` | Rate of batch size acceleration. | `1.001` | -| `token_vector_width` | Width of embedding tables and convolutional layers. | `128` | -| `embed_size` | Number of rows in embedding tables. | `7500` | -| `hidden_width` | Size of the parser's and NER's hidden layers. | `128` | -| `learn_rate` | Learning rate. | `0.001` | -| `optimizer_B1` | Momentum for the Adam solver. | `0.9` | -| `optimizer_B2` | Adagrad-momentum for the Adam solver. | `0.999` | -| `optimizer_eps` | Epsilon value for the Adam solver. | `1e-08` | -| `L2_penalty` | L2 regularization penalty. | `1e-06` | -| `grad_norm_clip` | Gradient L2 norm constraint. | `1.0` | +In this example log, we just print the name of each layer after creation of the +model ("Step 0"), which helps us to understand the internal structure of the +Neural Network, and to focus on specific layers that we want to inspect further +(see next example). -## Pretrain {#pretrain new="2.1" tag="experimental"} - -Pre-train the "token to vector" (`tok2vec`) layer of pipeline components, using -an approximate language-modeling objective. Specifically, we load pretrained -vectors, and train a component like a CNN, BiLSTM, etc to predict vectors which -match the pretrained ones. The weights are saved to a directory after each -epoch. You can then pass a path to one of these pretrained weights files to the -`spacy train` command. You can try to use a few with low `Loss` values reported -in the output. - -This technique may be especially helpful if you have little labelled data. -However, it's still quite experimental, so your mileage may vary. To load the -weights back in during `spacy train`, you need to ensure all settings are the -same between pretraining and training. The API and errors around this need some -improvement. - -```bash -$ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir] -[--width] [--conv-depth] [--cnn-window] [--cnn-pieces] [--use-chars] [--sa-depth] -[--embed-rows] [--loss_func] [--dropout] [--batch-size] [--max-length] -[--min-length] [--seed] [--n-iter] [--use-vectors] [--n-save-every] -[--init-tok2vec] [--epoch-start] +```cli +$ python -m spacy debug model ./config.cfg tagger -P0 ``` -| Argument | Type | Description | -| ----------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `texts_loc` | positional | Path to JSONL file with raw texts to learn from, with text provided as the key `"text"` or tokens as the key `"tokens"`. [See here](#pretrain-jsonl) for details. | -| `vectors_model` | positional | Name or path to spaCy model with vectors to learn from. | -| `output_dir` | positional | Directory to write models to on each epoch. | -| `--width`, `-cw` | option | Width of CNN layers. | -| `--conv-depth`, `-cd` | option | Depth of CNN layers. | -| `--cnn-window`, `-cW` 2.2.2 | option | Window size for CNN layers. | -| `--cnn-pieces`, `-cP` 2.2.2 | option | Maxout size for CNN layers. `1` for [Mish](https://github.com/digantamisra98/Mish). | -| `--use-chars`, `-chr` 2.2.2 | flag | Whether to use character-based embedding. | -| `--sa-depth`, `-sa` 2.2.2 | option | Depth of self-attention layers. | -| `--embed-rows`, `-er` | option | Number of embedding rows. | -| `--loss-func`, `-L` | option | Loss function to use for the objective. Either `"cosine"`, `"L2"` or `"characters"`. | -| `--dropout`, `-d` | option | Dropout rate. | -| `--batch-size`, `-bs` | option | Number of words per training batch. | -| `--max-length`, `-xw` | option | Maximum words per example. Longer examples are discarded. | -| `--min-length`, `-nw` | option | Minimum words per example. Shorter examples are discarded. | -| `--seed`, `-s` | option | Seed for random number generators. | -| `--n-iter`, `-i` | option | Number of iterations to pretrain. | -| `--use-vectors`, `-uv` | flag | Whether to use the static vectors as input features. | -| `--n-save-every`, `-se` | option | Save model every X batches. | -| `--init-tok2vec`, `-t2v` 2.1 | option | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental. | -| `--epoch-start`, `-es` 2.1.5 | option | The epoch to start counting at. Only relevant when using `--init-tok2vec` and the given weight file has been renamed. Prevents unintended overwriting of existing weight files. | -| **CREATES** | weights | The pretrained weights that can be used to initialize `spacy train`. | +``` +ℹ Using CPU +ℹ Fixing random seed: 0 +ℹ Analysing model with ID 62 -### JSONL format for raw text {#pretrain-jsonl} - -Raw text can be provided as a `.jsonl` (newline-delimited JSON) file containing -one input text per line (roughly paragraph length is good). Optionally, custom -tokenization can be provided. - -> #### Tip: Writing JSONL -> -> Our utility library [`srsly`](https://github.com/explosion/srsly) provides a -> handy `write_jsonl` helper that takes a file path and list of dictionaries and -> writes out JSONL-formatted data. -> -> ```python -> import srsly -> data = [{"text": "Some text"}, {"text": "More..."}] -> srsly.write_jsonl("/path/to/text.jsonl", data) -> ``` - -| Key | Type | Description | -| -------- | ------- | ---------------------------------------------------------- | -| `text` | unicode | The raw input text. Is not required if `tokens` available. | -| `tokens` | list | Optional tokenization, one string per token. | - -```json -### Example -{"text": "Can I ask where you work now and what you do, and if you enjoy it?"} -{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."} -{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."} -{"tokens": ["If", "tokens", "are", "provided", "then", "we", "can", "skip", "the", "raw", "input", "text"]} +========================== STEP 0 - before training ========================== +ℹ Layer 0: model ID 62: +'extract_features>>list2ragged>>with_array-ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed>>with_array-maxout>>layernorm>>dropout>>ragged2list>>with_array-residual>>residual>>residual>>residual>>with_array-softmax' +ℹ Layer 1: model ID 59: +'extract_features>>list2ragged>>with_array-ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed>>with_array-maxout>>layernorm>>dropout>>ragged2list>>with_array-residual>>residual>>residual>>residual' +ℹ Layer 2: model ID 61: 'with_array-softmax' +ℹ Layer 3: model ID 24: +'extract_features>>list2ragged>>with_array-ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed>>with_array-maxout>>layernorm>>dropout>>ragged2list' +ℹ Layer 4: model ID 58: 'with_array-residual>>residual>>residual>>residual' +ℹ Layer 5: model ID 60: 'softmax' +ℹ Layer 6: model ID 13: 'extract_features' +ℹ Layer 7: model ID 14: 'list2ragged' +ℹ Layer 8: model ID 16: +'with_array-ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed' +ℹ Layer 9: model ID 22: 'with_array-maxout>>layernorm>>dropout' +ℹ Layer 10: model ID 23: 'ragged2list' +ℹ Layer 11: model ID 57: 'residual>>residual>>residual>>residual' +ℹ Layer 12: model ID 15: +'ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed|ints-getitem>>hashembed' +ℹ Layer 13: model ID 21: 'maxout>>layernorm>>dropout' +ℹ Layer 14: model ID 32: 'residual' +ℹ Layer 15: model ID 40: 'residual' +ℹ Layer 16: model ID 48: 'residual' +ℹ Layer 17: model ID 56: 'residual' +ℹ Layer 18: model ID 3: 'ints-getitem>>hashembed' +ℹ Layer 19: model ID 6: 'ints-getitem>>hashembed' +ℹ Layer 20: model ID 9: 'ints-getitem>>hashembed' +... ``` -## Init Model {#init-model new="2"} +In this example log, we see how initialization of the model (Step 1) propagates +the correct values for the `nI` (input) and `nO` (output) dimensions of the +various layers. In the `softmax` layer, this step also defines the `W` matrix as +an all-zero matrix determined by the `nO` and `nI` dimensions. After a first +training step (Step 2), this matrix has clearly updated its values through the +training feedback loop. -Create a new model directory from raw data, like word frequencies, Brown -clusters and word vectors. This command is similar to the `spacy model` command -in v1.x. Note that in order to populate the model's vocab, you need to pass in a -JSONL-formatted [vocabulary file](<(/api/annotation#vocab-jsonl)>) as -`--jsonl-loc` with optional `id` values that correspond to the vectors table. -Just loading in vectors will not automatically populate the vocab. +```cli +$ python -m spacy debug model ./config.cfg tagger -l "5,15" -DIM -PAR -P0 -P1 -P2 +``` - +``` +ℹ Using CPU +ℹ Fixing random seed: 0 +ℹ Analysing model with ID 62 -As of v2.1.0, the `--freqs-loc` and `--clusters-loc` are deprecated and have -been replaced with the `--jsonl-loc` argument, which lets you pass in a a -[JSONL](http://jsonlines.org/) file containing one lexical entry per line. For -more details on the format, see the -[annotation specs](/api/annotation#vocab-jsonl). +========================= STEP 0 - before training ========================= +ℹ Layer 5: model ID 60: 'softmax' +ℹ - dim nO: None +ℹ - dim nI: 96 +ℹ - param W: None +ℹ - param b: None +ℹ Layer 15: model ID 40: 'residual' +ℹ - dim nO: None +ℹ - dim nI: None + +======================= STEP 1 - after initialization ======================= +ℹ Layer 5: model ID 60: 'softmax' +ℹ - dim nO: 4 +ℹ - dim nI: 96 +ℹ - param W: (4, 96) - sample: [0. 0. 0. 0. 0.] +ℹ - param b: (4,) - sample: [0. 0. 0. 0.] +ℹ Layer 15: model ID 40: 'residual' +ℹ - dim nO: 96 +ℹ - dim nI: None + +========================== STEP 2 - after training ========================== +ℹ Layer 5: model ID 60: 'softmax' +ℹ - dim nO: 4 +ℹ - dim nI: 96 +ℹ - param W: (4, 96) - sample: [ 0.00283958 -0.00294119 0.00268396 -0.00296219 +-0.00297141] +ℹ - param b: (4,) - sample: [0.00300002 0.00300002 0.00300002 0.00300002] +ℹ Layer 15: model ID 40: 'residual' +ℹ - dim nO: 96 +ℹ - dim nI: None +``` + + + +| Name | Description | +| ----------------------- | --------------------------------------------------------------------------------------------------------------------------- | +| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | +| `component` | Name of the pipeline component of which the model should be analyzed. ~~str (positional)~~ | +| `--layers`, `-l` | Comma-separated names of layer IDs to print. ~~str (option)~~ | +| `--dimensions`, `-DIM` | Show dimensions of each layer. ~~bool (flag)~~ | +| `--parameters`, `-PAR` | Show parameters of each layer. ~~bool (flag)~~ | +| `--gradients`, `-GRAD` | Show gradients of each layer. ~~bool (flag)~~ | +| `--attributes`, `-ATTR` | Show attributes of each layer. ~~bool (flag)~~ | +| `--print-step0`, `-P0` | Print model before training. ~~bool (flag)~~ | +| `--print-step1`, `-P1` | Print model after initialization. ~~bool (flag)~~ | +| `--print-step2`, `-P2` | Print model after training. ~~bool (flag)~~ | +| `--print-step3`, `-P3` | Print final predictions. ~~bool (flag)~~ | +| `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **PRINTS** | Debugging information. | + +## train {#train tag="command"} + +Train a pipeline. Expects data in spaCy's +[binary format](/api/data-formats#training) and a +[config file](/api/data-formats#config) with all settings and hyperparameters. +Will save out the best model from all epochs, as well as the final pipeline. The +`--code` argument can be used to provide a Python file that's imported before +the training process starts. This lets you register +[custom functions](/usage/training#custom-functions) and architectures and refer +to them in your config, all while still using spaCy's built-in `train` workflow. +If you need to manage complex multi-step training workflows, check out the new +[spaCy projects](/usage/projects). + + + +The `train` command doesn't take a long list of command-line arguments anymore +and instead expects a single [`config.cfg` file](/usage/training#config) +containing all settings for the pipeline, training process and hyperparameters. +Config values can be [overwritten](/usage/training#config-overrides) on the CLI +if needed. For example, `--paths.train ./train.spacy` sets the variable `train` +in the section `[paths]`. -```bash -$ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc] -[--prune-vectors] +```cli +$ python -m spacy train [config_path] [--output] [--code] [--verbose] [--gpu-id] [overrides] ``` -| Argument | Type | Description | -| ----------------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `lang` | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`. | -| `output_dir` | positional | Model output directory. Will be created if it doesn't exist. | -| `--jsonl-loc`, `-j` | option | Optional location of JSONL-formatted [vocabulary file](/api/annotation#vocab-jsonl) with lexical attributes. | -| `--vectors-loc`, `-v` | option | Optional location of vectors. Should be a file where the first row contains the dimensions of the vectors, followed by a space-separated Word2Vec table. File can be provided in `.txt` format or as a zipped text file in `.zip` or `.tar.gz` format. | -| `--truncate-vectors`, `-t` 2.3 | option | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation. | -| `--prune-vectors`, `-V` | option | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning. | -| `--vectors-name`, `-vn` | option | Name to assign to the word vectors in the `meta.json`, e.g. `en_core_web_md.vectors`. | -| `--omit-extra-lookups`, `-OEL` 2.3 | flag | Do not include any of the extra lookups tables (`cluster`/`prob`/`sentiment`) from `spacy-lookups-data` in the model. | -| **CREATES** | model | A spaCy model containing the vocab and vectors. | +| Name | Description | +| ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | +| `--output`, `-o` | Directory to store trained pipeline in. Will be created if it doesn't exist. ~~Optional[Path] \(positional)~~ | +| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | +| `--verbose`, `-V` | Show more detailed messages during training. ~~bool (flag)~~ | +| `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ | +| **CREATES** | The final trained pipeline and the best trained pipeline. | -## Evaluate {#evaluate new="2"} +## pretrain {#pretrain new="2.1" tag="command,experimental"} -Evaluate a model's accuracy and speed on JSON-formatted annotated data. Will -print the results and optionally export -[displaCy visualizations](/usage/visualizers) of a sample set of parses to -`.html` files. Visualizations for the dependency parse and NER will be exported -as separate files if the respective component is present in the model's -pipeline. +Pretrain the "token to vector" ([`Tok2vec`](/api/tok2vec)) layer of pipeline +components on [raw text](/api/data-formats#pretrain), using an approximate +language-modeling objective. Specifically, we load pretrained vectors, and train +a component like a CNN, BiLSTM, etc to predict vectors which match the +pretrained ones. The weights are saved to a directory after each epoch. You can +then include a **path to one of these pretrained weights files** in your +[training config](/usage/training#config) as the `init_tok2vec` setting when you +train your pipeline. This technique may be especially helpful if you have little +labelled data. See the usage docs on +[pretraining](/usage/embeddings-transformers#pretraining) for more info. -```bash -$ python -m spacy evaluate [model] [data_path] [--displacy-path] [--displacy-limit] -[--gpu-id] [--gold-preproc] [--return-scores] + + +As of spaCy v3.0, the `pretrain` command takes the same +[config file](/usage/training#config) as the `train` command. This ensures that +settings are consistent between pretraining and training. Settings for +pretraining can be defined in the `[pretraining]` block of the config file and +auto-generated by setting `--pretraining` on +[`init fill-config`](/api/cli#init-fill-config). Also see the +[data format](/api/data-formats#config) for details. + + + +```cli +$ python -m spacy pretrain [config_path] [output_dir] [--code] [--resume-path] [--epoch-resume] [--gpu-id] [overrides] ``` -| Argument | Type | Description | -| ------------------------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `model` | positional | Model to evaluate. Can be a package or shortcut link name, or a path to a model data directory. | -| `data_path` | positional | Location of JSON-formatted evaluation data. | -| `--displacy-path`, `-dp` | option | Directory to output rendered parses as HTML. If not set, no visualizations will be generated. | -| `--displacy-limit`, `-dl` | option | Number of parses to generate per file. Defaults to `25`. Keep in mind that a significantly higher number might cause the `.html` files to render slowly. | -| `--gpu-id`, `-g` | option | GPU to use, if any. Defaults to `-1` for CPU. | -| `--gold-preproc`, `-G` | flag | Use gold preprocessing. | -| `--return-scores`, `-R` | flag | Return dict containing model scores. | -| **CREATES** | `stdout`, HTML | Training results and optional displaCy visualizations. | +| Name | Description | +| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | +| `output_dir` | Directory to save binary weights to on each epoch. ~~Path (positional)~~ | +| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | +| `--resume-path`, `-r` | Path to pretrained weights from which to resume pretraining. ~~Optional[Path] \(option)~~ | +| `--epoch-resume`, `-er` | The epoch to resume counting from when using `--resume-path`. Prevents unintended overwriting of existing weight files. ~~Optional[int] \(option)~~ | +| `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.dropout 0.2`. ~~Any (option/flag)~~ | +| **CREATES** | The pretrained weights that can be used to initialize `spacy train`. | -## Package {#package} +## evaluate {#evaluate new="2" tag="command"} -Generate a [model Python package](/usage/training#models-generating) from an -existing model data directory. All data files are copied over. If the path to a -`meta.json` is supplied, or a `meta.json` is found in the input directory, this -file is used. Otherwise, the data can be entered directly from the command line. -After packaging, you can run `python setup.py sdist` from the newly created -directory to turn your model into an installable archive file. +Evaluate a trained pipeline. Expects a loadable spaCy pipeline (package name or +path) and evaluation data in the +[binary `.spacy` format](/api/data-formats#binary-training). The +`--gold-preproc` option sets up the evaluation examples with gold-standard +sentences and tokens for the predictions. Gold preprocessing helps the +annotations align to the tokenization, and may result in sequences of more +consistent length. However, it may reduce runtime accuracy due to train/test +skew. To render a sample of dependency parses in a HTML file using the +[displaCy visualizations](/usage/visualizers), set as output directory as the +`--displacy-path` argument. -```bash -$ python -m spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] [--force] +```cli +$ python -m spacy evaluate [model] [data_path] [--output] [--gold-preproc] [--gpu-id] [--displacy-path] [--displacy-limit] ``` -```bash -### Example -python -m spacy package /input /output -cd /output/en_model-0.0.0 -python setup.py sdist -pip install dist/en_model-0.0.0.tar.gz +| Name | Description | +| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `model` | Pipeline to evaluate. Can be a package or a path to a data directory. ~~str (positional)~~ | +| `data_path` | Location of evaluation data in spaCy's [binary format](/api/data-formats#training). ~~Path (positional)~~ | +| `--output`, `-o` | Output JSON file for metrics. If not set, no metrics will be exported. ~~Optional[Path] \(option)~~ | +| `--code-path`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | +| `--gold-preproc`, `-G` | Use gold preprocessing. ~~bool (flag)~~ | +| `--gpu-id`, `-g` | GPU to use, if any. Defaults to `-1` for CPU. ~~int (option)~~ | +| `--displacy-path`, `-dp` | Directory to output rendered parses as HTML. If not set, no visualizations will be generated. ~~Optional[Path] \(option)~~ | +| `--displacy-limit`, `-dl` | Number of parses to generate per file. Defaults to `25`. Keep in mind that a significantly higher number might cause the `.html` files to render slowly. ~~int (option)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **CREATES** | Training results and optional metrics and visualizations. | + +## package {#package tag="command"} + +Generate an installable [Python package](/usage/training#models-generating) from +an existing pipeline data directory. All data files are copied over. If the path +to a [`meta.json`](/api/data-formats#meta) is supplied, or a `meta.json` is +found in the input directory, this file is used. Otherwise, the data can be +entered directly from the command line. spaCy will then create a `.tar.gz` +archive file that you can distribute and install with `pip install`. + + + +The `spacy package` command now also builds the `.tar.gz` archive automatically, +so you don't have to run `python setup.py sdist` separately anymore. To disable +this, you can set the `--no-sdist` flag. + + + +```cli +$ python -m spacy package [input_dir] [output_dir] [--meta-path] [--create-meta] [--no-sdist] [--name] [--version] [--force] ``` -| Argument | Type | Description | -| ------------------------------------------------ | ---------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `input_dir` | positional | Path to directory containing model data. | -| `output_dir` | positional | Directory to create package folder in. | -| `--meta-path`, `-m` 2 | option | Path to `meta.json` file (optional). | -| `--create-meta`, `-c` 2 | flag | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt. | -| `--force`, `-f` | flag | Force overwriting of existing folder in output directory. | -| `--help`, `-h` | flag | Show help message and available arguments. | -| **CREATES** | directory | A Python package containing the spaCy model. | +> #### Example +> +> ```cli +> $ python -m spacy package /input /output +> $ cd /output/en_pipeline-0.0.0 +> $ pip install dist/en_pipeline-0.0.0.tar.gz +> ``` + +| Name | Description | +| ------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `input_dir` | Path to directory containing pipeline data. ~~Path (positional)~~ | +| `output_dir` | Directory to create package folder in. ~~Path (positional)~~ | +| `--meta-path`, `-m` 2 | Path to [`meta.json`](/api/data-formats#meta) file (optional). ~~Optional[Path] \(option)~~ | +| `--create-meta`, `-C` 2 | Create a `meta.json` file on the command line, even if one already exists in the directory. If an existing file is found, its entries will be shown as the defaults in the command line prompt. ~~bool (flag)~~ | +| `--no-sdist`, `-NS`, | Don't build the `.tar.gz` sdist automatically. Can be set if you want to run this step manually. ~~bool (flag)~~ | +| `--name`, `-n` 3 | Package name to override in meta. ~~Optional[str] \(option)~~ | +| `--version`, `-v` 3 | Package version to override in meta. Useful when training new versions, as it doesn't require editing the meta template. ~~Optional[str] \(option)~~ | +| `--force`, `-f` | Force overwriting of existing folder in output directory. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **CREATES** | A Python package containing the spaCy pipeline. | + +## project {#project new="3"} + +The `spacy project` CLI includes subcommands for working with +[spaCy projects](/usage/projects), end-to-end workflows for building and +deploying custom spaCy pipelines. + +### project clone {#project-clone tag="command"} + +Clone a project template from a Git repository. Calls into `git` under the hood +and can use the sparse checkout feature if available, so you're only downloading +what you need. By default, spaCy's +[project templates repo](https://github.com/explosion/projects) is used, but you +can provide any other repo (public or private) that you have access to using the +`--repo` option. + +```cli +$ python -m spacy project clone [name] [dest] [--repo] [--branch] [--sparse] +``` + +> #### Example +> +> ```cli +> $ python -m spacy project clone pipelines/ner_wikiner +> ``` +> +> Clone from custom repo: +> +> ```cli +> $ python -m spacy project clone template --repo https://github.com/your_org/your_repo +> ``` + +| Name | Description | +| ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | The name of the template to clone, relative to the repo. Can be a top-level directory or a subdirectory like `dir/template`. ~~str (positional)~~ | +| `dest` | Where to clone the project. Defaults to current working directory. ~~Path (positional)~~ | +| `--repo`, `-r` | The repository to clone from. Can be any public or private Git repo you have access to. ~~str (option)~~ | +| `--branch`, `-b` | The branch to clone from. Defaults to `master`. ~~str (option)~~ | +| `--sparse`, `-S` | Enable [sparse checkout](https://git-scm.com/docs/git-sparse-checkout) to only check out and download what's needed. Requires Git v22.2+. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **CREATES** | The cloned [project directory](/usage/projects#project-files). | + +### project assets {#project-assets tag="command"} + +Fetch project assets like datasets and pretrained weights. Assets are defined in +the `assets` section of the [`project.yml`](/usage/projects#project-yml). If a +`checksum` is provided, the file is only downloaded if no local file with the +same checksum exists and spaCy will show an error if the checksum of the +downloaded file doesn't match. If assets don't specify a `url` they're +considered "private" and you have to take care of putting them into the +destination directory yourself. If a local path is provided, the asset is copied +into the current project. + +```cli +$ python -m spacy project assets [project_dir] +``` + +> #### Example +> +> ```cli +> $ python -m spacy project assets [--sparse] +> ``` + +| Name | Description | +| ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | +| `--sparse`, `-S` | Enable [sparse checkout](https://git-scm.com/docs/git-sparse-checkout) to only check out and download what's needed. Requires Git v22.2+. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **CREATES** | Downloaded or copied assets defined in the `project.yml`. | + +### project run {#project-run tag="command"} + +Run a named command or workflow defined in the +[`project.yml`](/usage/projects#project-yml). If a workflow name is specified, +all commands in the workflow are run, in order. If commands define +[dependencies or outputs](/usage/projects#deps-outputs), they will only be +re-run if state has changed. For example, if the input dataset changes, a +preprocessing command that depends on those files will be re-run. + +```cli +$ python -m spacy project run [subcommand] [project_dir] [--force] [--dry] +``` + +> #### Example +> +> ```cli +> $ python -m spacy project run train +> ``` + +| Name | Description | +| --------------- | --------------------------------------------------------------------------------------- | +| `subcommand` | Name of the command or workflow to run. ~~str (positional)~~ | +| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | +| `--force`, `-F` | Force re-running steps, even if nothing changed. ~~bool (flag)~~ | +| `--dry`, `-D` |  Perform a dry run and don't execute scripts. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **EXECUTES** | The command defined in the `project.yml`. | + +### project push {#project-push tag="command"} + +Upload all available files or directories listed as in the `outputs` section of +commands to a remote storage. Outputs are archived and compressed prior to +upload, and addressed in the remote storage using the output's relative path +(URL encoded), a hash of its command string and dependencies, and a hash of its +file contents. This means `push` should **never overwrite** a file in your +remote. If all the hashes match, the contents are the same and nothing happens. +If the contents are different, the new version of the file is uploaded. Deleting +obsolete files is left up to you. + +Remotes can be defined in the `remotes` section of the +[`project.yml`](/usage/projects#project-yml). Under the hood, spaCy uses the +[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library to +communicate with the remote storages, so you can use any protocol that +`smart-open` supports, including [S3](https://aws.amazon.com/s3/), +[Google Cloud Storage](https://cloud.google.com/storage), SSH and more, although +you may need to install extra dependencies to use certain protocols. + +```cli +$ python -m spacy project push [remote] [project_dir] +``` + +> #### Example +> +> ```cli +> $ python -m spacy project push my_bucket +> ``` +> +> ```yaml +> ### project.yml +> remotes: +> my_bucket: 's3://my-spacy-bucket' +> ``` + +| Name | Description | +| -------------- | --------------------------------------------------------------------------------------- | +| `remote` | The name of the remote to upload to. Defaults to `"default"`. ~~str (positional)~~ | +| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **UPLOADS** | All project outputs that exist and are not already stored in the remote. | + +### project pull {#project-pull tag="command"} + +Download all files or directories listed as `outputs` for commands, unless they +are not already present locally. When searching for files in the remote, `pull` +won't just look at the output path, but will also consider the **command +string** and the **hashes of the dependencies**. For instance, let's say you've +previously pushed a checkpoint to the remote, but now you've changed some +hyper-parameters. Because you've changed the inputs to the command, if you run +`pull`, you won't retrieve the stale result. If you train your pipeline and push +the outputs to the remote, the outputs will be saved alongside the prior +outputs, so if you change the config back, you'll be able to fetch back the +result. + +Remotes can be defined in the `remotes` section of the +[`project.yml`](/usage/projects#project-yml). Under the hood, spaCy uses the +[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library to +communicate with the remote storages, so you can use any protocol that +`smart-open` supports, including [S3](https://aws.amazon.com/s3/), +[Google Cloud Storage](https://cloud.google.com/storage), SSH and more, although +you may need to install extra dependencies to use certain protocols. + +```cli +$ python -m spacy project pull [remote] [project_dir] +``` + +> #### Example +> +> ```cli +> $ python -m spacy project pull my_bucket +> ``` +> +> ```yaml +> ### project.yml +> remotes: +> my_bucket: 's3://my-spacy-bucket' +> ``` + +| Name | Description | +| -------------- | --------------------------------------------------------------------------------------- | +| `remote` | The name of the remote to download from. Defaults to `"default"`. ~~str (positional)~~ | +| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **DOWNLOADS** | All project outputs that do not exist locally and can be found in the remote. | + +### project document {#project-document tag="command"} + +Auto-generate a pretty Markdown-formatted `README` for your project, based on +its [`project.yml`](/usage/projects#project-yml). Will create sections that +document the available commands, workflows and assets. The auto-generated +content will be placed between two hidden markers, so you can add your own +custom content before or after the auto-generated documentation. When you re-run +the `project document` command, only the auto-generated part is replaced. + +```cli +$ python -m spacy project document [project_dir] [--output] [--no-emoji] +``` + +> #### Example +> +> ```cli +> $ python -m spacy project document --output README.md +> ``` + + + +For more examples, see the templates in our +[`projects`](https://github.com/explosion/projects) repo. + +![Screenshot of auto-generated Markdown Readme](../images/project_document.jpg) + + + +| Name | Description | +| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | +| `--output`, `-o` | Path to output file or `-` for stdout (default). If a file is specified and it already exists and contains auto-generated docs, only the auto-generated docs section is replaced. ~~Path (positional)~~ | +|  `--no-emoji`, `-NE` | Don't use emoji in the titles. ~~bool (flag)~~ | +| **CREATES** | The Markdown-formatted project documentation. | + +### project dvc {#project-dvc tag="command"} + +Auto-generate [Data Version Control](https://dvc.org) (DVC) config file. Calls +[`dvc run`](https://dvc.org/doc/command-reference/run) with `--no-exec` under +the hood to generate the `dvc.yaml`. A DVC project can only define one pipeline, +so you need to specify one workflow defined in the +[`project.yml`](/usage/projects#project-yml). If no workflow is specified, the +first defined workflow is used. The DVC config will only be updated if the +`project.yml` changed. For details, see the +[DVC integration](/usage/projects#dvc) docs. + + + +This command requires DVC to be installed and initialized in the project +directory, e.g. via [`dvc init`](https://dvc.org/doc/command-reference/init). +You'll also need to add the assets you want to track with +[`dvc add`](https://dvc.org/doc/command-reference/add). + + + +```cli +$ python -m spacy project dvc [project_dir] [workflow] [--force] [--verbose] +``` + +> #### Example +> +> ```cli +> $ git init +> $ dvc init +> $ python -m spacy project dvc all +> ``` + +| Name | Description | +| ----------------- | ----------------------------------------------------------------------------------------------------------------- | +| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | +| `workflow` | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(positional)~~ | +| `--force`, `-F` | Force-updating config file. ~~bool (flag)~~ | +| `--verbose`, `-V` |  Print more output generated by DVC. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **CREATES** | A `dvc.yaml` file in the project directory, based on the steps defined in the given workflow. | + +## ray {#ray new="3"} + +The `spacy ray` CLI includes commands for parallel and distributed computing via +[Ray](https://ray.io). + + + +To use this command, you need the +[`spacy-ray`](https://github.com/explosion/spacy-ray) package installed. +Installing the package will automatically add the `ray` command to the spaCy +CLI. + + + +### ray train {#ray-train tag="command"} + +Train a spaCy pipeline using [Ray](https://ray.io) for parallel training. The +command works just like [`spacy train`](/api/cli#train). For more details and +examples, see the usage guide on +[parallel training](/usage/training#parallel-training) and the spaCy project +[integration](/usage/projects#ray). + +```cli +$ python -m spacy ray train [config_path] [--code] [--output] [--n-workers] [--address] [--gpu-id] [--verbose] [overrides] +``` + +> #### Example +> +> ```cli +> $ python -m spacy ray train config.cfg --n-workers 2 +> ``` + +| Name | Description | +| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Path (positional)~~ | +| `--code`, `-c` | Path to Python file with additional code to be imported. Allows [registering custom functions](/usage/training#custom-functions) for new architectures. ~~Optional[Path] \(option)~~ | +| `--output`, `-o` | Directory or remote storage URL for saving trained pipeline. The directory will be created if it doesn't exist. ~~Optional[Path] \(positional)~~ | +| `--n-workers`, `-n` | The number of workers. Defaults to `1`. ~~int (option)~~ | +| `--address`, `-a` | Optional address of the Ray cluster. If not set (default), Ray will run locally. ~~Optional[str] \(option)~~ | +| `--gpu-id`, `-g` | GPU ID or `-1` for CPU. Defaults to `-1`. ~~int (option)~~ | +| `--verbose`, `-V` | Display more information for debugging purposes. ~~bool (flag)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ | diff --git a/website/docs/api/corpus.md b/website/docs/api/corpus.md new file mode 100644 index 000000000..986c6f458 --- /dev/null +++ b/website/docs/api/corpus.md @@ -0,0 +1,177 @@ +--- +title: Corpus +teaser: An annotated corpus +tag: class +source: spacy/training/corpus.py +new: 3 +--- + +This class manages annotated corpora and can be used for training and +development datasets in the [`DocBin`](/api/docbin) (`.spacy`) format. To +customize the data loading during training, you can register your own +[data readers and batchers](/usage/training#custom-code-readers-batchers). Also +see the usage guide on [data utilities](/usage/training#data) for more details +and examples. + +## Config and implementation {#config} + +`spacy.Corpus.v1` is a registered function that creates a `Corpus` of training +or evaluation data. It takes the same arguments as the `Corpus` class and +returns a callable that yields [`Example`](/api/example) objects. You can +replace it with your own registered function in the +[`@readers` registry](/api/top-level#registry) to customize the data loading and +streaming. + +> #### Example config +> +> ```ini +> [paths] +> train = "corpus/train.spacy" +> +> [corpora.train] +> @readers = "spacy.Corpus.v1" +> path = ${paths.train} +> gold_preproc = false +> max_length = 0 +> limit = 0 +> augmenter = null +> ``` + +| Name | Description | +| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Path~~ | +|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ | +| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | +| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | +| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ | + +```python +%%GITHUB_SPACY/spacy/training/corpus.py +``` + +## Corpus.\_\_init\_\_ {#init tag="method"} + +Create a `Corpus` for iterating [Example](/api/example) objects from a file or +directory of [`.spacy` data files](/api/data-formats#binary-training). The +`gold_preproc` setting lets you specify whether to set up the `Example` object +with gold-standard sentences and tokens for the predictions. Gold preprocessing +helps the annotations align to the tokenization, and may result in sequences of +more consistent length. However, it may reduce runtime accuracy due to +train/test skew. + +> #### Example +> +> ```python +> from spacy.training import Corpus +> +> # With a single file +> corpus = Corpus("./data/train.spacy") +> +> # With a directory +> corpus = Corpus("./data", limit=10) +> ``` + +| Name | Description | +| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | The directory or filename to read from. ~~Union[str, Path]~~ | +| _keyword-only_ | | +|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~ | +| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | +| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | +| `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ | + +## Corpus.\_\_call\_\_ {#call tag="method"} + +Yield examples from the data. + +> #### Example +> +> ```python +> from spacy.training import Corpus +> import spacy +> +> corpus = Corpus("./train.spacy") +> nlp = spacy.blank("en") +> train_data = corpus(nlp) +> ``` + +| Name | Description | +| ---------- | -------------------------------------- | +| `nlp` | The current `nlp` object. ~~Language~~ | +| **YIELDS** | The examples. ~~Example~~ | + +## JsonlCorpus {#jsonlcorpus tag="class"} + +Iterate Doc objects from a file or directory of JSONL (newline-delimited JSON) +formatted raw text files. Can be used to read the raw text corpus for language +model [pretraining](/usage/embeddings-transformers#pretraining) from a JSONL +file. + +> #### Tip: Writing JSONL +> +> Our utility library [`srsly`](https://github.com/explosion/srsly) provides a +> handy `write_jsonl` helper that takes a file path and list of dictionaries and +> writes out JSONL-formatted data. +> +> ```python +> import srsly +> data = [{"text": "Some text"}, {"text": "More..."}] +> srsly.write_jsonl("/path/to/text.jsonl", data) +> ``` + +```json +### Example +{"text": "Can I ask where you work now and what you do, and if you enjoy it?"} +{"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."} +{"text": "My cynical view on this is that it will never be free to the public. Reason: what would be the draw of joining the military? Right now their selling point is free Healthcare and Education. Ironically both are run horribly and most, that I've talked to, come out wishing they never went in."} +``` + +### JsonlCorpus.\_\init\_\_ {#jsonlcorpus tag="method"} + +Initialize the reader. + +> #### Example +> +> ```python +> from spacy.training import JsonlCorpus +> +> corpus = JsonlCorpus("./data/texts.jsonl") +> ``` +> +> ```ini +> ### Example config +> [corpora.pretrain] +> @readers = "spacy.JsonlCorpus.v1" +> path = "corpus/raw_text.jsonl" +> min_length = 0 +> max_length = 0 +> limit = 0 +> ``` + +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------------------------------------------- | +| `path` | The directory or filename to read from. Expects newline-delimited JSON with a key `"text"` for each record. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `min_length` | Minimum document length (in tokens). Shorter documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~ | +| `max_length` | Maximum document length (in tokens). Longer documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~ | +| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | + +### JsonlCorpus.\_\_call\_\_ {#jsonlcorpus-call tag="method"} + +Yield examples from the data. + +> #### Example +> +> ```python +> from spacy.training import JsonlCorpus +> import spacy +> +> corpus = JsonlCorpus("./texts.jsonl") +> nlp = spacy.blank("en") +> data = corpus(nlp) +> ``` + +| Name | Description | +| ---------- | -------------------------------------- | +| `nlp` | The current `nlp` object. ~~Language~~ | +| **YIELDS** | The examples. ~~Example~~ | diff --git a/website/docs/api/cython-classes.md b/website/docs/api/cython-classes.md index 77d6fdd10..a4ecf294a 100644 --- a/website/docs/api/cython-classes.md +++ b/website/docs/api/cython-classes.md @@ -23,13 +23,13 @@ accessed from Python. For the Python documentation, see [`Doc`](/api/doc). ### Attributes {#doc_attributes} -| Name | Type | Description | -| ------------ | ------------ | ----------------------------------------------------------------------------------------- | -| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Doc` object is garbage collected. | -| `vocab` | `Vocab` | A reference to the shared `Vocab` object. | -| `c` | `TokenC*` | A pointer to a [`TokenC`](/api/cython-structs#tokenc) struct. | -| `length` | `int` | The number of tokens in the document. | -| `max_length` | `int` | The underlying size of the `Doc.c` array. | +| Name | Description | +| ------------ | -------------------------------------------------------------------------------------------------------- | +| `mem` | A memory pool. Allocated memory will be freed once the `Doc` object is garbage collected. ~~cymem.Pool~~ | +| `vocab` | A reference to the shared `Vocab` object. ~~Vocab~~ | +| `c` | A pointer to a [`TokenC`](/api/cython-structs#tokenc) struct. ~~TokenC\*~~ | +| `length` | The number of tokens in the document. ~~int~~ | +| `max_length` | The underlying size of the `Doc.c` array. ~~int~~ | ### Doc.push_back {#doc_push_back tag="method"} @@ -50,10 +50,10 @@ Append a token to the `Doc`. The token can be provided as a > assert doc.text == "hello " > ``` -| Name | Type | Description | -| ------------ | --------------- | ----------------------------------------- | -| `lex_or_tok` | `LexemeOrToken` | The word to append to the `Doc`. | -| `has_space` | `bint` | Whether the word has trailing whitespace. | +| Name | Description | +| ------------ | -------------------------------------------------- | +| `lex_or_tok` | The word to append to the `Doc`. ~~LexemeOrToken~~ | +| `has_space` | Whether the word has trailing whitespace. ~~bint~~ | ## Token {#token tag="cdef class" source="spacy/tokens/token.pxd"} @@ -70,12 +70,12 @@ accessed from Python. For the Python documentation, see [`Token`](/api/token). ### Attributes {#token_attributes} -| Name | Type | Description | -| ------- | --------- | ------------------------------------------------------------- | -| `vocab` | `Vocab` | A reference to the shared `Vocab` object. | -| `c` | `TokenC*` | A pointer to a [`TokenC`](/api/cython-structs#tokenc) struct. | -| `i` | `int` | The offset of the token within the document. | -| `doc` | `Doc` | The parent document. | +| Name | Description | +| ------- | -------------------------------------------------------------------------- | +| `vocab` | A reference to the shared `Vocab` object. ~~Vocab~~ | +| `c` | A pointer to a [`TokenC`](/api/cython-structs#tokenc) struct. ~~TokenC\*~~ | +| `i` | The offset of the token within the document. ~~int~~ | +| `doc` | The parent document. ~~Doc~~ | ### Token.cinit {#token_cinit tag="method"} @@ -87,13 +87,12 @@ Create a `Token` object from a `TokenC*` pointer. > token = Token.cinit(&doc.c[3], doc, 3) > ``` -| Name | Type | Description | -| ----------- | --------- | ------------------------------------------------------------ | -| `vocab` | `Vocab` | A reference to the shared `Vocab`. | -| `c` | `TokenC*` | A pointer to a [`TokenC`](/api/cython-structs#tokenc)struct. | -| `offset` | `int` | The offset of the token within the document. | -| `doc` | `Doc` | The parent document. | -| **RETURNS** | `Token` | The newly constructed object. | +| Name | Description | +| -------- | -------------------------------------------------------------------------- | +| `vocab` | A reference to the shared `Vocab`. ~~Vocab~~ | +| `c` | A pointer to a [`TokenC`](/api/cython-structs#tokenc) struct. ~~TokenC\*~~ | +| `offset` | The offset of the token within the document. ~~int~~ | +| `doc` | The parent document. ~~int~~ | ## Span {#span tag="cdef class" source="spacy/tokens/span.pxd"} @@ -108,14 +107,14 @@ accessed from Python. For the Python documentation, see [`Span`](/api/span). ### Attributes {#span_attributes} -| Name | Type | Description | -| ------------ | -------------------------------------- | ------------------------------------------------------- | -| `doc` | `Doc` | The parent document. | -| `start` | `int` | The index of the first token of the span. | -| `end` | `int` | The index of the first token after the span. | -| `start_char` | `int` | The index of the first character of the span. | -| `end_char` | `int` | The index of the last character of the span. | -| `label` | `attr_t` | A label to attach to the span, e.g. for named entities. | +| Name | Description | +| ------------ | ----------------------------------------------------------------------------- | +| `doc` | The parent document. ~~Doc~~ | +| `start` | The index of the first token of the span. ~~int~~ | +| `end` | The index of the first token after the span. ~~int~~ | +| `start_char` | The index of the first character of the span. ~~int~~ | +| `end_char` | The index of the last character of the span. ~~int~~ | +| `label` | A label to attach to the span, e.g. for named entities. ~~attr_t (uint64_t)~~ | ## Lexeme {#lexeme tag="cdef class" source="spacy/lexeme.pxd"} @@ -130,11 +129,11 @@ accessed from Python. For the Python documentation, see [`Lexeme`](/api/lexeme). ### Attributes {#lexeme_attributes} -| Name | Type | Description | -| ------- | -------------------------------------- | --------------------------------------------------------------- | -| `c` | `LexemeC*` | A pointer to a [`LexemeC`](/api/cython-structs#lexemec) struct. | -| `vocab` | `Vocab` | A reference to the shared `Vocab` object. | -| `orth` | `attr_t` | ID of the verbatim text content. | +| Name | Description | +| ------- | ----------------------------------------------------------------------------- | +| `c` | A pointer to a [`LexemeC`](/api/cython-structs#lexemec) struct. ~~LexemeC\*~~ | +| `vocab` | A reference to the shared `Vocab` object. ~~Vocab~~ | +| `orth` | ID of the verbatim text content. ~~attr_t (uint64_t)~~ | ## Vocab {#vocab tag="cdef class" source="spacy/vocab.pxd"} @@ -150,11 +149,11 @@ accessed from Python. For the Python documentation, see [`Vocab`](/api/vocab). ### Attributes {#vocab_attributes} -| Name | Type | Description | -| --------- | ------------- | ------------------------------------------------------------------------------------------- | -| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. | -| `strings` | `StringStore` | A `StringStore` that maps string to hash values and vice versa. | -| `length` | `int` | The number of entries in the vocabulary. | +| Name | Description | +| --------- | ---------------------------------------------------------------------------------------------------------- | +| `mem` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. ~~cymem.Pool~~ | +| `strings` | A `StringStore` that maps string to hash values and vice versa. ~~StringStore~~ | +| `length` | The number of entries in the vocabulary. ~~int~~ | ### Vocab.get {#vocab_get tag="method"} @@ -167,11 +166,11 @@ vocabulary. > lexeme = vocab.get(vocab.mem, "hello") > ``` -| Name | Type | Description | -| ----------- | ---------------- | ------------------------------------------------------------------------------------------- | -| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. | -| `string` | unicode | The string of the word to look up. | -| **RETURNS** | `const LexemeC*` | The lexeme in the vocabulary. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------------- | +| `mem` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. ~~cymem.Pool~~ | +| `string` | The string of the word to look up. ~~str~~ | +| **RETURNS** | The lexeme in the vocabulary. ~~const LexemeC\*~~ | ### Vocab.get_by_orth {#vocab_get_by_orth tag="method"} @@ -184,11 +183,11 @@ vocabulary. > lexeme = vocab.get_by_orth(doc[0].lex.norm) > ``` -| Name | Type | Description | -| ----------- | -------------------------------------- | ------------------------------------------------------------------------------------------- | -| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. | -| `orth` | `attr_t` | ID of the verbatim text content. | -| **RETURNS** | `const LexemeC*` | The lexeme in the vocabulary. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------------- | +| `mem` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. ~~cymem.Pool~~ | +| `orth` | ID of the verbatim text content. ~~attr_t (uint64_t)~~ | +| **RETURNS** | The lexeme in the vocabulary. ~~const LexemeC\*~~ | ## StringStore {#stringstore tag="cdef class" source="spacy/strings.pxd"} @@ -204,7 +203,7 @@ accessed from Python. For the Python documentation, see ### Attributes {#stringstore_attributes} -| Name | Type | Description | -| ------ | ------------------------------------------------------ | ------------------------------------------------------------------------------------------------ | -| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the`StringStore` object is garbage collected. | -| `keys` | `vector[hash_t]` | A list of hash values in the `StringStore`. | +| Name | Description | +| ------ | ---------------------------------------------------------------------------------------------------------------- | +| `mem` | A memory pool. Allocated memory will be freed once the `StringStore` object is garbage collected. ~~cymem.Pool~~ | +| `keys` | A list of hash values in the `StringStore`. ~~vector[hash_t] \(vector[uint64_t])~~ | diff --git a/website/docs/api/cython-structs.md b/website/docs/api/cython-structs.md index 8ee1f1b9a..4c8514b64 100644 --- a/website/docs/api/cython-structs.md +++ b/website/docs/api/cython-structs.md @@ -18,26 +18,26 @@ Cython data container for the `Token` object. > token_ptr = &doc.c[3] > ``` -| Name | Type | Description | -| ------------ | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `lex` | `const LexemeC*` | A pointer to the lexeme for the token. | -| `morph` | `uint64_t` | An ID allowing lookup of morphological attributes. | -| `pos` | `univ_pos_t` | Coarse-grained part-of-speech tag. | -| `spacy` | `bint` | A binary value indicating whether the token has trailing whitespace. | -| `tag` | `attr_t` | Fine-grained part-of-speech tag. | -| `idx` | `int` | The character offset of the token within the parent document. | -| `lemma` | `attr_t` | Base form of the token, with no inflectional suffixes. | -| `sense` | `attr_t` | Space for storing a word sense ID, currently unused. | -| `head` | `int` | Offset of the syntactic parent relative to the token. | -| `dep` | `attr_t` | Syntactic dependency relation. | -| `l_kids` | `uint32_t` | Number of left children. | -| `r_kids` | `uint32_t` | Number of right children. | -| `l_edge` | `uint32_t` | Offset of the leftmost token of this token's syntactic descendants. | -| `r_edge` | `uint32_t` | Offset of the rightmost token of this token's syntactic descendants. | -| `sent_start` | `int` | Ternary value indicating whether the token is the first word of a sentence. `0` indicates a missing value, `-1` indicates `False` and `1` indicates `True`. The default value, 0, is interpreted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary. | -| `ent_iob` | `int` | IOB code of named entity tag. `0` indicates a missing value, `1` indicates `I`, `2` indicates `0` and `3` indicates `B`. | -| `ent_type` | `attr_t` | Named entity type. | -| `ent_id` | `attr_t` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. | +| Name | Description | +| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `lex` | A pointer to the lexeme for the token. ~~const LexemeC\*~~ | +| `morph` | An ID allowing lookup of morphological attributes. ~~uint64_t~~ | +| `pos` | Coarse-grained part-of-speech tag. ~~univ_pos_t~~ | +| `spacy` | A binary value indicating whether the token has trailing whitespace. ~~bint~~ | +| `tag` | Fine-grained part-of-speech tag. ~~attr_t (uint64_t)~~ | +| `idx` | The character offset of the token within the parent document. ~~int~~ | +| `lemma` | Base form of the token, with no inflectional suffixes. ~~attr_t (uint64_t)~~ | +| `sense` | Space for storing a word sense ID, currently unused. ~~attr_t (uint64_t)~~ | +| `head` | Offset of the syntactic parent relative to the token. ~~int~~ | +| `dep` | Syntactic dependency relation. ~~attr_t (uint64_t)~~ | +| `l_kids` | Number of left children. ~~uint32_t~~ | +| `r_kids` | Number of right children. ~~uint32_t~~ | +| `l_edge` | Offset of the leftmost token of this token's syntactic descendants. ~~uint32_t~~ | +| `r_edge` | Offset of the rightmost token of this token's syntactic descendants. ~~uint32_t~~ | +| `sent_start` | Ternary value indicating whether the token is the first word of a sentence. `0` indicates a missing value, `-1` indicates `False` and `1` indicates `True`. The default value, 0, is interpreted as no sentence break. Sentence boundary detectors will usually set 0 for all tokens except tokens that follow a sentence boundary. ~~int~~ | +| `ent_iob` | IOB code of named entity tag. `0` indicates a missing value, `1` indicates `I`, `2` indicates `0` and `3` indicates `B`. ~~int~~ | +| `ent_type` | Named entity type. ~~attr_t (uint64_t)~~ | +| `ent_id` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~attr_t (uint64_t)~~ | ### Token.get_struct_attr {#token_get_struct_attr tag="staticmethod, nogil" source="spacy/tokens/token.pxd"} @@ -52,11 +52,11 @@ Get the value of an attribute from the `TokenC` struct by attribute ID. > is_alpha = Token.get_struct_attr(&doc.c[3], IS_ALPHA) > ``` -| Name | Type | Description | -| ----------- | -------------------------------------- | -------------------------------------------------------------------------------------- | -| `token` | `const TokenC*` | A pointer to a `TokenC` struct. | -| `feat_name` | `attr_id_t` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. | -| **RETURNS** | `attr_t` | The value of the attribute. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------- | +| `token` | A pointer to a `TokenC` struct. ~~const TokenC\*~~ | +| `feat_name` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~ | +| **RETURNS** | The value of the attribute. ~~attr_t (uint64_t)~~ | ### Token.set_struct_attr {#token_set_struct_attr tag="staticmethod, nogil" source="spacy/tokens/token.pxd"} @@ -72,11 +72,11 @@ Set the value of an attribute of the `TokenC` struct by attribute ID. > Token.set_struct_attr(token, TAG, 0) > ``` -| Name | Type | Description | -| ----------- | -------------------------------------- | -------------------------------------------------------------------------------------- | -| `token` | `const TokenC*` | A pointer to a `TokenC` struct. | -| `feat_name` | `attr_id_t` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. | -| `value` | `attr_t` | The value to set. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------- | +| `token` | A pointer to a `TokenC` struct. ~~const TokenC\*~~ | +| `feat_name` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~ | +| `value` | The value to set. ~~attr_t (uint64_t)~~ | ### token_by_start {#token_by_start tag="function" source="spacy/tokens/doc.pxd"} @@ -93,12 +93,12 @@ Find a token in a `TokenC*` array by the offset of its first character. > assert token_by_start(doc.c, doc.length, 4) == -1 > ``` -| Name | Type | Description | -| ------------ | --------------- | --------------------------------------------------------- | -| `tokens` | `const TokenC*` | A `TokenC*` array. | -| `length` | `int` | The number of tokens in the array. | -| `start_char` | `int` | The start index to search for. | -| **RETURNS** | `int` | The index of the token in the array or `-1` if not found. | +| Name | Description | +| ------------ | ----------------------------------------------------------------- | +| `tokens` | A `TokenC*` array. ~~const TokenC\*~~ | +| `length` | The number of tokens in the array. ~~int~~ | +| `start_char` | The start index to search for. ~~int~~ | +| **RETURNS** | The index of the token in the array or `-1` if not found. ~~int~~ | ### token_by_end {#token_by_end tag="function" source="spacy/tokens/doc.pxd"} @@ -115,12 +115,12 @@ Find a token in a `TokenC*` array by the offset of its final character. > assert token_by_end(doc.c, doc.length, 1) == -1 > ``` -| Name | Type | Description | -| ----------- | --------------- | --------------------------------------------------------- | -| `tokens` | `const TokenC*` | A `TokenC*` array. | -| `length` | `int` | The number of tokens in the array. | -| `end_char` | `int` | The end index to search for. | -| **RETURNS** | `int` | The index of the token in the array or `-1` if not found. | +| Name | Description | +| ----------- | ----------------------------------------------------------------- | +| `tokens` | A `TokenC*` array. ~~const TokenC\*~~ | +| `length` | The number of tokens in the array. ~~int~~ | +| `end_char` | The end index to search for. ~~int~~ | +| **RETURNS** | The index of the token in the array or `-1` if not found. ~~int~~ | ### set_children_from_heads {#set_children_from_heads tag="function" source="spacy/tokens/doc.pxd"} @@ -143,10 +143,10 @@ attribute, in order to make the parse tree navigation consistent. > assert doc.c[3].l_kids == 1 > ``` -| Name | Type | Description | -| -------- | --------------- | ---------------------------------- | -| `tokens` | `const TokenC*` | A `TokenC*` array. | -| `length` | `int` | The number of tokens in the array. | +| Name | Description | +| -------- | ------------------------------------------ | +| `tokens` | A `TokenC*` array. ~~const TokenC\*~~ | +| `length` | The number of tokens in the array. ~~int~~ | ## LexemeC {#lexemec tag="C struct" source="spacy/structs.pxd"} @@ -160,17 +160,17 @@ struct. > lex = doc.c[3].lex > ``` -| Name | Type | Description | -| ----------- | --------------------------------------- | -------------------------------------------------------------------------------------------------------------------------- | -| `flags` | `flags_t` | Bit-field for binary lexical flag values. | -| `id` | `attr_t` | Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed. | -| `length` | `attr_t` | Number of unicode characters in the lexeme. | -| `orth` | `attr_t` | ID of the verbatim text content. | -| `lower` | `attr_t` | ID of the lowercase form of the lexeme. | -| `norm` | `attr_t` | ID of the lexeme's norm, i.e. a normalized form of the text. | -| `shape` | `attr_t` | Transform of the lexeme's string, to show orthographic features. | -| `prefix` | `attr_t` | Length-N substring from the start of the lexeme. Defaults to `N=1`. | -| `suffix` | `attr_t` | Length-N substring from the end of the lexeme. Defaults to `N=3`. | +| Name | Description | +| -------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | +| `flags` | Bit-field for binary lexical flag values. ~~flags_t (uint64_t)~~ | +| `id` | Usually used to map lexemes to rows in a matrix, e.g. for word vectors. Does not need to be unique, so currently misnamed. ~~attr_t (uint64_t)~~ | +| `length` | Number of unicode characters in the lexeme. ~~attr_t (uint64_t)~~ | +| `orth` | ID of the verbatim text content. ~~attr_t (uint64_t)~~ | +| `lower` | ID of the lowercase form of the lexeme. ~~attr_t (uint64_t)~~ | +| `norm` | ID of the lexeme's norm, i.e. a normalized form of the text. ~~attr_t (uint64_t)~~ | +| `shape` | Transform of the lexeme's string, to show orthographic features. ~~attr_t (uint64_t)~~ | +| `prefix` | Length-N substring from the start of the lexeme. Defaults to `N=1`. ~~attr_t (uint64_t)~~ | +| `suffix` | Length-N substring from the end of the lexeme. Defaults to `N=3`. ~~attr_t (uint64_t)~~ | ### Lexeme.get_struct_attr {#lexeme_get_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"} @@ -186,11 +186,11 @@ Get the value of an attribute from the `LexemeC` struct by attribute ID. > is_alpha = Lexeme.get_struct_attr(lexeme, IS_ALPHA) > ``` -| Name | Type | Description | -| ----------- | -------------------------------------- | -------------------------------------------------------------------------------------- | -| `lex` | `const LexemeC*` | A pointer to a `LexemeC` struct. | -| `feat_name` | `attr_id_t` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. | -| **RETURNS** | `attr_t` | The value of the attribute. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------- | +| `lex` | A pointer to a `LexemeC` struct. ~~const LexemeC\*~~ | +| `feat_name` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~ | +| **RETURNS** | The value of the attribute. ~~attr_t (uint64_t)~~ | ### Lexeme.set_struct_attr {#lexeme_set_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"} @@ -206,11 +206,11 @@ Set the value of an attribute of the `LexemeC` struct by attribute ID. > Lexeme.set_struct_attr(lexeme, NORM, lexeme.lower) > ``` -| Name | Type | Description | -| ----------- | -------------------------------------- | -------------------------------------------------------------------------------------- | -| `lex` | `const LexemeC*` | A pointer to a `LexemeC` struct. | -| `feat_name` | `attr_id_t` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. | -| `value` | `attr_t` | The value to set. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------- | +| `lex` | A pointer to a `LexemeC` struct. ~~const LexemeC\*~~ | +| `feat_name` | The ID of the attribute to look up. The attributes are enumerated in `spacy.typedefs`. ~~attr_id_t~~ | +| `value` | The value to set. ~~attr_t (uint64_t)~~ | ### Lexeme.c_check_flag {#lexeme_c_check_flag tag="staticmethod, nogil" source="spacy/lexeme.pxd"} @@ -226,11 +226,11 @@ Check the value of a binary flag attribute. > is_stop = Lexeme.c_check_flag(lexeme, IS_STOP) > ``` -| Name | Type | Description | -| ----------- | ---------------- | ------------------------------------------------------------------------------- | -| `lexeme` | `const LexemeC*` | A pointer to a `LexemeC` struct. | -| `flag_id` | `attr_id_t` | The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. | -| **RETURNS** | `bint` | The boolean value of the flag. | +| Name | Description | +| ----------- | --------------------------------------------------------------------------------------------- | +| `lexeme` | A pointer to a `LexemeC` struct. ~~const LexemeC\*~~ | +| `flag_id` | The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. ~~attr_id_t~~ | +| **RETURNS** | The boolean value of the flag. ~~bint~~ | ### Lexeme.c_set_flag {#lexeme_c_set_flag tag="staticmethod, nogil" source="spacy/lexeme.pxd"} @@ -246,8 +246,8 @@ Set the value of a binary flag attribute. > Lexeme.c_set_flag(lexeme, IS_STOP, 0) > ``` -| Name | Type | Description | -| --------- | ---------------- | ------------------------------------------------------------------------------- | -| `lexeme` | `const LexemeC*` | A pointer to a `LexemeC` struct. | -| `flag_id` | `attr_id_t` | The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. | -| `value` | `bint` | The value to set. | +| Name | Description | +| --------- | --------------------------------------------------------------------------------------------- | +| `lexeme` | A pointer to a `LexemeC` struct. ~~const LexemeC\*~~ | +| `flag_id` | The ID of the flag to look up. The flag IDs are enumerated in `spacy.typedefs`. ~~attr_id_t~~ | +| `value` | The value to set. ~~bint~~ | diff --git a/website/docs/api/cython.md b/website/docs/api/cython.md index f91909747..16b11cead 100644 --- a/website/docs/api/cython.md +++ b/website/docs/api/cython.md @@ -23,12 +23,12 @@ abruptly. With Cython there are four ways of declaring complex data types. Unfortunately we use all four in different places, as they all have different utility: -| Declaration | Description | Example | -| --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- | -| `class` | A normal Python class. | [`Language`](/api/language) | -| `cdef class` | A Python extension type. Differs from a normal Python class in that its attributes can be defined on the underlying struct. Can have C-level objects as attributes (notably structs and pointers), and can have methods which have C-level objects as arguments or return types. | [`Lexeme`](/api/cython-classes#lexeme) | -| `cdef struct` | A struct is just a collection of variables, sort of like a named tuple, except the memory is contiguous. Structs can't have methods, only attributes. | [`LexemeC`](/api/cython-structs#lexemec) | -| `cdef cppclass` | A C++ class. Like a struct, this can be allocated on the stack, but can have methods, a constructor and a destructor. Differs from `cdef class` in that it can be created and destroyed without acquiring the Python global interpreter lock. This style is the most obscure. | [`StateC`](https://github.com/explosion/spaCy/tree/master/spacy/syntax/_state.pxd) | +| Declaration | Description | Example | +| --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- | +| `class` | A normal Python class. | [`Language`](/api/language) | +| `cdef class` | A Python extension type. Differs from a normal Python class in that its attributes can be defined on the underlying struct. Can have C-level objects as attributes (notably structs and pointers), and can have methods which have C-level objects as arguments or return types. | [`Lexeme`](/api/cython-classes#lexeme) | +| `cdef struct` | A struct is just a collection of variables, sort of like a named tuple, except the memory is contiguous. Structs can't have methods, only attributes. | [`LexemeC`](/api/cython-structs#lexemec) | +| `cdef cppclass` | A C++ class. Like a struct, this can be allocated on the stack, but can have methods, a constructor and a destructor. Differs from `cdef class` in that it can be created and destroyed without acquiring the Python global interpreter lock. This style is the most obscure. | [`StateC`](%%GITHUB_SPACY/spacy/pipeline/_parser_internals/_state.pxd) | The most important classes in spaCy are defined as `cdef class` objects. The underlying data for these objects is usually gathered into a struct, which is @@ -122,7 +122,7 @@ where the rescuers keep passing out from low oxygen, causing another rescuer to follow — only to succumb themselves. In short, just say no to optimizing your Python. If it's not fast enough the first time, just switch to Cython. - + - [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org) diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md new file mode 100644 index 000000000..c4cc5b1e4 --- /dev/null +++ b/website/docs/api/data-formats.md @@ -0,0 +1,586 @@ +--- +title: Data formats +teaser: Details on spaCy's input and output data formats +menu: + - ['Training Config', 'config'] + - ['Training Data', 'training'] + - ['Vocabulary', 'vocab-jsonl'] + - ['Pipeline Meta', 'meta'] +--- + +This section documents input and output formats of data used by spaCy, including +the [training config](/usage/training#config), training data and lexical +vocabulary data. For an overview of label schemes used by the models, see the +[models directory](/models). Each trained pipeline documents the label schemes +used in its components, depending on the data it was trained on. + +## Training config {#config new="3"} + +Config files define the training process and pipeline and can be passed to +[`spacy train`](/api/cli#train). They use +[Thinc's configuration system](https://thinc.ai/docs/usage-config) under the +hood. For details on how to use training configs, see the +[usage documentation](/usage/training#config). To get started with the +recommended settings for your use case, check out the +[quickstart widget](/usage/training#quickstart) or run the +[`init config`](/api/cli#init-config) command. + +> #### What does the @ mean? +> +> The `@` syntax lets you refer to function names registered in the +> [function registry](/api/top-level#registry). For example, +> `@architectures = "spacy.HashEmbedCNN.v1"` refers to a registered function of +> the name [spacy.HashEmbedCNN.v1](/api/architectures#HashEmbedCNN) and all +> other values defined in its block will be passed into that function as +> arguments. Those arguments depend on the registered function. See the usage +> guide on [registered functions](/usage/training#config-functions) for details. + +```ini +%%GITHUB_SPACY/spacy/default_config.cfg +``` + + + +Under the hood, spaCy's configs are powered by our machine learning library +[Thinc's config system](https://thinc.ai/docs/usage-config), which uses +[`pydantic`](https://github.com/samuelcolvin/pydantic/) for data validation +based on type hints. See [`spacy/schemas.py`](%%GITHUB_SPACY/spacy/schemas.py) +for the schemas used to validate the default config. Arguments of registered +functions are validated against their type annotations, if available. To debug +your config and check that it's valid, you can run the +[`spacy debug config`](/api/cli#debug-config) command. + + + +### nlp {#config-nlp tag="section"} + +> #### Example +> +> ```ini +> [nlp] +> lang = "en" +> pipeline = ["tagger", "parser", "ner"] +> before_creation = null +> after_creation = null +> after_pipeline_creation = null +> +> [nlp.tokenizer] +> @tokenizers = "spacy.Tokenizer.v1" +> ``` + +Defines the `nlp` object, its tokenizer and +[processing pipeline](/usage/processing-pipelines) component names. + +| Name | Description | +| ------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `lang` | Pipeline language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). Defaults to `null`. ~~str~~ | +| `pipeline` | Names of pipeline components in order. Should correspond to sections in the `[components]` block, e.g. `[components.ner]`. See docs on [defining components](/usage/training#config-components). Defaults to `[]`. ~~List[str]~~ | +| `disabled` | Names of pipeline components that are loaded but disabled by default and not run as part of the pipeline. Should correspond to components listed in `pipeline`. After a pipeline is loaded, disabled components can be enabled using [`Language.enable_pipe`](/api/language#enable_pipe). ~~List[str]~~ | +| `before_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `Language` subclass before it's initialized. Defaults to `null`. ~~Optional[Callable[[Type[Language]], Type[Language]]]~~ | +| `after_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `nlp` object right after it's initialized. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~ | +| `after_pipeline_creation` | Optional [callback](/usage/training#custom-code-nlp-callbacks) to modify `nlp` object after the pipeline components have been added. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~ | +| `tokenizer` | The tokenizer to use. Defaults to [`Tokenizer`](/api/tokenizer). ~~Callable[[str], Doc]~~ | + +### components {#config-components tag="section"} + +> #### Example +> +> ```ini +> [components.textcat] +> factory = "textcat" +> labels = ["POSITIVE", "NEGATIVE"] +> +> [components.textcat.model] +> @architectures = "spacy.TextCatBOW.v1" +> exclusive_classes = false +> ngram_size = 1 +> no_output_layer = false +> ``` + +This section includes definitions of the +[pipeline components](/usage/processing-pipelines) and their models, if +available. Components in this section can be referenced in the `pipeline` of the +`[nlp]` block. Component blocks need to specify either a `factory` (named +function to use to create component) or a `source` (name of path of trained +pipeline to copy components from). See the docs on +[defining pipeline components](/usage/training#config-components) for details. + +### paths, system {#config-variables tag="variables"} + +These sections define variables that can be referenced across the other sections +as variables. For example `${paths.train}` uses the value of `train` defined in +the block `[paths]`. If your config includes custom registered functions that +need paths, you can define them here. All config values can also be +[overwritten](/usage/training#config-overrides) on the CLI when you run +[`spacy train`](/api/cli#train), which is especially relevant for data paths +that you don't want to hard-code in your config file. + +```cli +$ python -m spacy train config.cfg --paths.train ./corpus/train.spacy +``` + +### corpora {#config-corpora tag="section"} + +> #### Example +> +> ```ini +> [corpora] +> +> [corpora.train] +> @readers = "spacy.Corpus.v1" +> path = ${paths:train} +> +> [corpora.dev] +> @readers = "spacy.Corpus.v1" +> path = ${paths:dev} +> +> [corpora.pretrain] +> @readers = "spacy.JsonlCorpus.v1" +> path = ${paths.raw} +> +> [corpora.my_custom_data] +> @readers = "my_custom_reader.v1" +> ``` + +This section defines a **dictionary** mapping of string keys to functions. Each +function takes an `nlp` object and yields [`Example`](/api/example) objects. By +default, the two keys `train` and `dev` are specified and each refer to a +[`Corpus`](/api/top-level#Corpus). When pretraining, an additional `pretrain` +section is added that defaults to a [`JsonlCorpus`](/api/top-level#JsonlCorpus). +You can also register custom functions that return a callable. + +| Name | Description | +| ---------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `train` | Training data corpus, typically used in `[training]` block. ~~Callable[[Language], Iterator[Example]]~~ | +| `dev` | Development data corpus, typically used in `[training]` block. ~~Callable[[Language], Iterator[Example]]~~ | +| `pretrain` | Raw text for [pretraining](/usage/embeddings-transformers#pretraining), typically used in `[pretraining]` block (if available). ~~Callable[[Language], Iterator[Example]]~~ | +| ... | Any custom or alternative corpora. ~~Callable[[Language], Iterator[Example]]~~ | + +Alternatively, the `[corpora]` block can refer to **one function** that returns +a dictionary keyed by the corpus names. This can be useful if you want to load a +single corpus once and then divide it up into `train` and `dev` partitions. + +> #### Example +> +> ```ini +> [corpora] +> @readers = "my_custom_reader.v1" +> train_path = ${paths:train} +> dev_path = ${paths:dev} +> shuffle = true +> +> ``` + +| Name | Description | +| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `corpora` | A dictionary keyed by string names, mapped to corpus functions that receive the current `nlp` object and return an iterator of [`Example`](/api/example) objects. ~~Dict[str, Callable[[Language], Iterator[Example]]]~~ | + +### training {#config-training tag="section"} + +This section defines settings and controls for the training and evaluation +process that are used when you run [`spacy train`](/api/cli#train). + +| Name | Description | +| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `accumulate_gradient` | Whether to divide the batch up into substeps. Defaults to `1`. ~~int~~ | +| `batcher` | Callable that takes an iterator of [`Doc`](/api/doc) objects and yields batches of `Doc`s. Defaults to [`batch_by_words`](/api/top-level#batch_by_words). ~~Callable[[Iterator[Doc], Iterator[List[Doc]]]]~~ | +| `before_to_disk` | Optional callback to modify `nlp` object right before it is saved to disk during and after training. Can be used to remove or reset config values or disable components. Defaults to `null`. ~~Optional[Callable[[Language], Language]]~~ | +| `dev_corpus` | Dot notation of the config location defining the dev corpus. Defaults to `corpora.dev`. ~~str~~ | +| `dropout` | The dropout rate. Defaults to `0.1`. ~~float~~ | +| `eval_frequency` | How often to evaluate during training (steps). Defaults to `200`. ~~int~~ | +| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be initialized or updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ | +| `gpu_allocator` | Library for cupy to route GPU memory allocation to. Can be `"pytorch"` or `"tensorflow"`. Defaults to variable `${system.gpu_allocator}`. ~~str~~ | +| `logger` | Callable that takes the `nlp` and stdout and stderr `IO` objects, sets up the logger, and returns two new callables to log a training step and to finalize the logger. Defaults to [`ConsoleLogger`](/api/top-level#ConsoleLogger). ~~Callable[[Language, IO, IO], [Tuple[Callable[[Dict[str, Any]], None], Callable[[], None]]]]~~ | +| `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ | +| `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ | +| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ | +| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ | +| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ | +| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ | +| `train_corpus` | Dot notation of the config location defining the train corpus. Defaults to `corpora.train`. ~~str~~ | + +### pretraining {#config-pretraining tag="section,optional"} + +This section is optional and defines settings and controls for +[language model pretraining](/usage/embeddings-transformers#pretraining). It's +used when you run [`spacy pretrain`](/api/cli#pretrain). + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `max_epochs` | Maximum number of epochs. Defaults to `1000`. ~~int~~ | +| `dropout` | The dropout rate. Defaults to `0.2`. ~~float~~ | +| `n_save_every` | Saving frequency. Defaults to `null`. ~~Optional[int]~~ | +| `objective` | The pretraining objective. Defaults to `{"type": "characters", "n_characters": 4}`. ~~Dict[str, Any]~~ | +| `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ | +| `corpus` | Dot notation of the config location defining the corpus with raw text. Defaults to `corpora.pretrain`. ~~str~~ | +| `batcher` | Callable that takes an iterator of [`Doc`](/api/doc) objects and yields batches of `Doc`s. Defaults to [`batch_by_words`](/api/top-level#batch_by_words). ~~Callable[[Iterator[Doc], Iterator[List[Doc]]]]~~ | +| `component` | Component name to identify the layer with the model to pretrain. Defaults to `"tok2vec"`. ~~str~~ | +| `layer` | The specific layer of the model to pretrain. If empty, the whole model will be used. ~~str~~ | + +### initialize {#config-initialize tag="section"} + +This config block lets you define resources for **initializing the pipeline**. +It's used by [`Language.initialize`](/api/language#initialize) and typically +called right before training (but not at runtime). The section allows you to +specify local file paths or custom functions to load data resources from, +without requiring them at runtime when you load the trained pipeline back in. +Also see the usage guides on the +[config lifecycle](/usage/training#config-lifecycle) and +[custom initialization](/usage/training#initialization). + +> #### Example +> +> ```ini +> [initialize] +> vectors = "/path/to/vectors_nlp" +> init_tok2vec = "/path/to/pretrain.bin" +> +> [initialize_components] +> +> [initialize.components.my_component] +> data_path = "/path/to/component_data" +> ``` + +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `components` | Additional arguments passed to the `initialize` method of a pipeline component, keyed by component name. If type annotations are available on the method, the config will be validated against them. The `initialize` methods will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Dict[str, Any]]~~ | +| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. ~~Optional[str]~~ | +| `lookups` | Additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). Defaults to `null`. ~~Optional[Lookups]~~ | +| `tokenizer` | Additional arguments passed to the `initialize` method of the specified tokenizer. Can be used for languages like Chinese that depend on dictionaries or trained models for tokenization. If type annotations are available on the method, the config will be validated against them. The `initialize` method will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Any]~~ | +| `vectors` | Name or path of pipeline containing pretrained word vectors to use, e.g. created with [`init vectors`](/api/cli#init-vectors). Defaults to `null`. ~~Optional[str]~~ | +| `vocab_data` | Path to JSONL-formatted [vocabulary file](/api/data-formats#vocab-jsonl) to initialize vocabulary. ~~Optional[str]~~ | + +## Training data {#training} + +### Binary training format {#binary-training new="3"} + +> #### Example +> +> ```python +> from spacy.tokens import DocBin +> from spacy.training import Corpus +> +> doc_bin = DocBin(docs=docs) +> doc_bin.to_disk("./data.spacy") +> reader = Corpus("./data.spacy") +> ``` + +The main data format used in spaCy v3.0 is a **binary format** created by +serializing a [`DocBin`](/api/docbin), which represents a collection of `Doc` +objects. This means that you can train spaCy pipelines using the same format it +outputs: annotated `Doc` objects. The binary format is extremely **efficient in +storage**, especially when packing multiple documents together. + +Typically, the extension for these binary files is `.spacy`, and they are used +as input format for specifying a [training corpus](/api/corpus) and for spaCy's +CLI [`train`](/api/cli#train) command. The built-in +[`convert`](/api/cli#convert) command helps you convert spaCy's previous +[JSON format](#json-input) to the new binary format. It also supports conversion +of the `.conllu` format used by the +[Universal Dependencies corpora](https://github.com/UniversalDependencies). + +### JSON training format {#json-input tag="deprecated"} + + + +As of v3.0, the JSON input format is deprecated and is replaced by the +[binary format](#binary-training). Instead of converting [`Doc`](/api/doc) +objects to JSON, you can now serialize them directly using the +[`DocBin`](/api/docbin) container and then use them as input data. + +[`spacy convert`](/api/cli) lets you convert your JSON data to the new `.spacy` +format: + +```cli +$ python -m spacy convert ./data.json ./output.spacy +``` + + + +> #### Annotating entities +> +> Named entities are provided in the +> [BILUO](/usage/linguistic-features#accessing-ner) notation. Tokens outside an +> entity are set to `"O"` and tokens that are part of an entity are set to the +> entity label, prefixed by the BILUO marker. For example `"B-ORG"` describes +> the first token of a multi-token `ORG` entity and `"U-PERSON"` a single token +> representing a `PERSON` entity. The +> [`offsets_to_biluo_tags`](/api/top-level#offsets_to_biluo_tags) function can +> help you convert entity offsets to the right format. + +```python +### Example structure +[{ + "id": int, # ID of the document within the corpus + "paragraphs": [{ # list of paragraphs in the corpus + "raw": string, # raw text of the paragraph + "sentences": [{ # list of sentences in the paragraph + "tokens": [{ # list of tokens in the sentence + "id": int, # index of the token in the document + "dep": string, # dependency label + "head": int, # offset of token head relative to token index + "tag": string, # part-of-speech tag + "orth": string, # verbatim text of the token + "ner": string # BILUO label, e.g. "O" or "B-ORG" + }], + "brackets": [{ # phrase structure (NOT USED by current models) + "first": int, # index of first token + "last": int, # index of last token + "label": string # phrase label + }] + }], + "cats": [{ # new in v2.2: categories for text classifier + "label": string, # text category label + "value": float / bool # label applies (1.0/true) or not (0.0/false) + }] + }] +}] +``` + + + +Here's an example of dependencies, part-of-speech tags and named entities, taken +from the English Wall Street Journal portion of the Penn Treebank: + +```json +https://github.com/explosion/spaCy/blob/v2.3.x/examples/training/training-data.json +``` + + + +### Annotation format for creating training examples {#dict-input} + +An [`Example`](/api/example) object holds the information for one training +instance. It stores two [`Doc`](/api/doc) objects: one for holding the +gold-standard reference data, and one for holding the predictions of the +pipeline. Examples can be created using the +[`Example.from_dict`](/api/example#from_dict) method with a reference `Doc` and +a dictionary of gold-standard annotations. + +> #### Example +> +> ```python +> example = Example.from_dict(doc, gold_dict) +> ``` + + + +`Example` objects are used as part of the +[internal training API](/usage/training#api) and they're expected when you call +[`nlp.update`](/api/language#update). However, for most use cases, you +**shouldn't** have to write your own training scripts. It's recommended to train +your pipelines via the [`spacy train`](/api/cli#train) command with a config +file to keep track of your settings and hyperparameters and your own +[registered functions](/usage/training/#custom-code) to customize the setup. + + + +> #### Example +> +> ```python +> { +> "text": str, +> "words": List[str], +> "lemmas": List[str], +> "spaces": List[bool], +> "tags": List[str], +> "pos": List[str], +> "morphs": List[str], +> "sent_starts": List[bool], +> "deps": List[string], +> "heads": List[int], +> "entities": List[str], +> "entities": List[(int, int, str)], +> "cats": Dict[str, float], +> "links": Dict[(int, int), dict], +> } +> ``` + +| Name | Description | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `text` | Raw text. ~~str~~ | +| `words` | List of gold-standard tokens. ~~List[str]~~ | +| `lemmas` | List of lemmas. ~~List[str]~~ | +| `spaces` | List of boolean values indicating whether the corresponding tokens is followed by a space or not. ~~List[bool]~~ | +| `tags` | List of fine-grained [POS tags](/usage/linguistic-features#pos-tagging). ~~List[str]~~ | +| `pos` | List of coarse-grained [POS tags](/usage/linguistic-features#pos-tagging). ~~List[str]~~ | +| `morphs` | List of [morphological features](/usage/linguistic-features#rule-based-morphology). ~~List[str]~~ | +| `sent_starts` | List of boolean values indicating whether each token is the first of a sentence or not. ~~List[bool]~~ | +| `deps` | List of string values indicating the [dependency relation](/usage/linguistic-features#dependency-parse) of a token to its head. ~~List[str]~~ | +| `heads` | List of integer values indicating the dependency head of each token, referring to the absolute index of each token in the text. ~~List[int]~~ | +| `entities` | **Option 1:** List of [BILUO tags](/usage/linguistic-features#accessing-ner) per token of the format `"{action}-{label}"`, or `None` for unannotated tokens. ~~List[str]~~ | +| `entities` | **Option 2:** List of `"(start, end, label)"` tuples defining all entities in the text. ~~List[Tuple[int, int, str]]~~ | +| `cats` | Dictionary of `label`/`value` pairs indicating how relevant a certain [text category](/api/textcategorizer) is for the text. ~~Dict[str, float]~~ | +| `links` | Dictionary of `offset`/`dict` pairs defining [named entity links](/usage/linguistic-features#entity-linking). The character offsets are linked to a dictionary of relevant knowledge base IDs. ~~Dict[Tuple[int, int], Dict]~~ | + + + +- Multiple formats are possible for the "entities" entry, but you have to pick + one. +- Any values for sentence starts will be ignored if there are annotations for + dependency relations. +- If the dictionary contains values for `"text"` and `"words"`, but not + `"spaces"`, the latter are inferred automatically. If "words" is not provided + either, the values are inferred from the `Doc` argument. + + + +```python +### Examples +# Training data for a part-of-speech tagger +doc = Doc(vocab, words=["I", "like", "stuff"]) +gold_dict = {"tags": ["NOUN", "VERB", "NOUN"]} +example = Example.from_dict(doc, gold_dict) + +# Training data for an entity recognizer (option 1) +doc = nlp("Laura flew to Silicon Valley.") +gold_dict = {"entities": ["U-PERS", "O", "O", "B-LOC", "L-LOC"]} +example = Example.from_dict(doc, gold_dict) + +# Training data for an entity recognizer (option 2) +doc = nlp("Laura flew to Silicon Valley.") +gold_dict = {"entities": [(0, 5, "PERSON"), (14, 28, "LOC")]} +example = Example.from_dict(doc, gold_dict) + +# Training data for text categorization +doc = nlp("I'm pretty happy about that!") +gold_dict = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}} +example = Example.from_dict(doc, gold_dict) + +# Training data for an Entity Linking component +doc = nlp("Russ Cochran his reprints include EC Comics.") +gold_dict = {"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}} +example = Example.from_dict(doc, gold_dict) +``` + +## Lexical data for vocabulary {#vocab-jsonl new="2"} + +This data file can be provided via the `vocab_data` setting in the +`[initialize]` block of the training config to pre-define the lexical data to +initialize the `nlp` object's vocabulary with. The file should contain one +lexical entry per line. The first line defines the language and vocabulary +settings. All other lines are expected to be JSON objects describing an +individual lexeme. The lexical attributes will be then set as attributes on +spaCy's [`Lexeme`](/api/lexeme#attributes) object. + +> #### Example config +> +> ```ini +> [initialize] +> vocab_data = "/path/to/vocab-data.jsonl" +> ``` + +```python +### First line +{"lang": "en", "settings": {"oov_prob": -20.502029418945312}} +``` + +```python +### Entry structure +{ + "orth": string, # the word text + "id": int, # can correspond to row in vectors table + "lower": string, + "norm": string, + "shape": string + "prefix": string, + "suffix": string, + "length": int, + "cluster": string, + "prob": float, + "is_alpha": bool, + "is_ascii": bool, + "is_digit": bool, + "is_lower": bool, + "is_punct": bool, + "is_space": bool, + "is_title": bool, + "is_upper": bool, + "like_url": bool, + "like_num": bool, + "like_email": bool, + "is_stop": bool, + "is_oov": bool, + "is_quote": bool, + "is_left_punct": bool, + "is_right_punct": bool +} +``` + +Here's an example of the 20 most frequent lexemes in the English training data: + +```json +%%GITHUB_SPACY/extra/example_data/vocab-data.jsonl +``` + +## Pipeline meta {#meta} + +The pipeline meta is available as the file `meta.json` and exported +automatically when you save an `nlp` object to disk. Its contents are available +as [`nlp.meta`](/api/language#meta). + + + +As of spaCy v3.0, the `meta.json` **isn't** used to construct the language class +and pipeline anymore and only contains meta information for reference and for +creating a Python package with [`spacy package`](/api/cli#package). How to set +up the `nlp` object is now defined in the +[`config.cfg`](/api/data-formats#config), which includes detailed information +about the pipeline components and their model architectures, and all other +settings and hyperparameters used to train the pipeline. It's the **single +source of truth** used for loading a pipeline. + + + +> #### Example +> +> ```json +> { +> "name": "example_pipeline", +> "lang": "en", +> "version": "1.0.0", +> "spacy_version": ">=3.0.0,<3.1.0", +> "parent_package": "spacy", +> "description": "Example pipeline for spaCy", +> "author": "You", +> "email": "you@example.com", +> "url": "https://example.com", +> "license": "CC BY-SA 3.0", +> "sources": [{ "name": "My Corpus", "license": "MIT" }], +> "vectors": { "width": 0, "vectors": 0, "keys": 0, "name": null }, +> "pipeline": ["tok2vec", "ner", "textcat"], +> "labels": { +> "ner": ["PERSON", "ORG", "PRODUCT"], +> "textcat": ["POSITIVE", "NEGATIVE"] +> }, +> "performance": { +> "ents_f": 82.7300930714, +> "ents_p": 82.135523614, +> "ents_r": 83.3333333333, +> "textcat_score": 88.364323811 +> }, +> "speed": { "cpu": 7667.8, "gpu": null, "nwords": 10329 }, +> "spacy_git_version": "61dfdd9fb" +> } +> ``` + +| Name | Description | +| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `lang` | Pipeline language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). Defaults to `"en"`. ~~str~~ | +| `name` | Pipeline name, e.g. `"core_web_sm"`. The final package name will be `{lang}_{name}`. Defaults to `"pipeline"`. ~~str~~ | +| `version` | Pipeline version. Will be used to version a Python package created with [`spacy package`](/api/cli#package). Defaults to `"0.0.0"`. ~~str~~ | +| `spacy_version` | spaCy version range the package is compatible with. Defaults to the spaCy version used to create the pipeline, up to next minor version, which is the default compatibility for the available [trained pipelines](/models). For instance, a pipeline trained with v3.0.0 will have the version range `">=3.0.0,<3.1.0"`. ~~str~~ | +| `parent_package` | Name of the spaCy package. Typically `"spacy"` or `"spacy_nightly"`. Defaults to `"spacy"`. ~~str~~ | +| `description` | Pipeline description. Also used for Python package. Defaults to `""`. ~~str~~ | +| `author` | Pipeline author name. Also used for Python package. Defaults to `""`. ~~str~~ | +| `email` | Pipeline author email. Also used for Python package. Defaults to `""`. ~~str~~ | +| `url` | Pipeline author URL. Also used for Python package. Defaults to `""`. ~~str~~ | +| `license` | Pipeline license. Also used for Python package. Defaults to `""`. ~~str~~ | +| `sources` | Data sources used to train the pipeline. Typically a list of dicts with the keys `"name"`, `"url"`, `"author"` and `"license"`. [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `None`. ~~Optional[List[Dict[str, str]]]~~ | +| `vectors` | Information about the word vectors included with the pipeline. Typically a dict with the keys `"width"`, `"vectors"` (number of vectors), `"keys"` and `"name"`. ~~Dict[str, Any]~~ | +| `pipeline` | Names of pipeline component names, in order. Corresponds to [`nlp.pipe_names`](/api/language#pipe_names). Only exists for reference and is not used to create the components. This information is defined in the [`config.cfg`](/api/data-formats#config). Defaults to `[]`. ~~List[str]~~ | +| `labels` | Label schemes of the trained pipeline components, keyed by component name. Corresponds to [`nlp.pipe_labels`](/api/language#pipe_labels). [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `{}`. ~~Dict[str, Dict[str, List[str]]]~~ | +| `accuracy` | Training accuracy, added automatically by [`spacy train`](/api/cli#train). Dictionary of [score names](/usage/training#metrics) mapped to scores. Defaults to `{}`. ~~Dict[str, Union[float, Dict[str, float]]]~~ | +| `speed` | Inference speed, added automatically by [`spacy train`](/api/cli#train). Typically a dictionary with the keys `"cpu"`, `"gpu"` and `"nwords"` (words per second). Defaults to `{}`. ~~Dict[str, Optional[Union[float, str]]]~~ | +| `spacy_git_version` 3 | Git commit of [`spacy`](https://github.com/explosion/spaCy) used to create pipeline. ~~str~~ | +| other | Any other custom meta information you want to add. The data is preserved in [`nlp.meta`](/api/language#meta). ~~Any~~ | diff --git a/website/docs/api/dependencymatcher.md b/website/docs/api/dependencymatcher.md new file mode 100644 index 000000000..356adcda7 --- /dev/null +++ b/website/docs/api/dependencymatcher.md @@ -0,0 +1,222 @@ +--- +title: DependencyMatcher +teaser: Match subtrees within a dependency parse +tag: class +new: 3 +source: spacy/matcher/dependencymatcher.pyx +--- + +The `DependencyMatcher` follows the same API as the [`Matcher`](/api/matcher) +and [`PhraseMatcher`](/api/phrasematcher) and lets you match on dependency trees +using +[Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). +It requires a pretrained [`DependencyParser`](/api/parser) or other component +that sets the `Token.dep` and `Token.head` attributes. See the +[usage guide](/usage/rule-based-matching#dependencymatcher) for examples. + +## Pattern format {#patterns} + +> ```python +> ### Example +> # pattern: "[subject] ... initially founded" +> [ +> # anchor token: founded +> { +> "RIGHT_ID": "founded", +> "RIGHT_ATTRS": {"ORTH": "founded"} +> }, +> # founded -> subject +> { +> "LEFT_ID": "founded", +> "REL_OP": ">", +> "RIGHT_ID": "subject", +> "RIGHT_ATTRS": {"DEP": "nsubj"} +> }, +> # "founded" follows "initially" +> { +> "LEFT_ID": "founded", +> "REL_OP": ";", +> "RIGHT_ID": "initially", +> "RIGHT_ATTRS": {"ORTH": "initially"} +> } +> ] +> ``` + +A pattern added to the `DependencyMatcher` consists of a list of dictionaries, +with each dictionary describing a token to match. Except for the first +dictionary, which defines an anchor token using only `RIGHT_ID` and +`RIGHT_ATTRS`, each pattern should have the following keys: + +| Name | Description | +| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~ | +| `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ | +| `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ | +| `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ | + + + +For examples of how to construct dependency matcher patterns for different types +of relations, see the usage guide on +[dependency matching](/usage/rule-based-matching#dependencymatcher). + + + +### Operators + +The following operators are supported by the `DependencyMatcher`, most of which +come directly from +[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html): + +| Symbol | Description | +| --------- | -------------------------------------------------------------------------------------------------------------------- | +| `A < B` | `A` is the immediate dependent of `B`. | +| `A > B` | `A` is the immediate head of `B`. | +| `A << B` | `A` is the dependent in a chain to `B` following dep → head paths. | +| `A >> B` | `A` is the head in a chain to `B` following head → dep paths. | +| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree. | +| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`. | +| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`. | +| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`. | +| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`. | + +## DependencyMatcher.\_\_init\_\_ {#init tag="method"} + +Create a `DependencyMatcher`. + +> #### Example +> +> ```python +> from spacy.matcher import DependencyMatcher +> matcher = DependencyMatcher(nlp.vocab) +> ``` + +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------------- | +| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ | +| _keyword-only_ | | +| `validate` | Validate all patterns added to this matcher. ~~bool~~ | + +## DependencyMatcher.\_\call\_\_ {#call tag="method"} + +Find all tokens matching the supplied patterns on the `Doc` or `Span`. + +> #### Example +> +> ```python +> from spacy.matcher import DependencyMatcher +> +> matcher = DependencyMatcher(nlp.vocab) +> pattern = [{"RIGHT_ID": "founded_id", +> "RIGHT_ATTRS": {"ORTH": "founded"}}] +> matcher.add("FOUNDED", [pattern]) +> doc = nlp("Bill Gates founded Microsoft.") +> matches = matcher(doc) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | +| **RETURNS** | A list of `(match_id, token_ids)` tuples, describing the matches. The `match_id` is the ID of the match pattern and `token_ids` is a list of token indices matched by the pattern, where the position of each token in the list corresponds to the position of the node specification in the pattern. ~~List[Tuple[int, List[int]]]~~ | + +## DependencyMatcher.\_\_len\_\_ {#len tag="method"} + +Get the number of rules added to the dependency matcher. Note that this only +returns the number of rules (identical with the number of IDs), not the number +of individual patterns. + +> #### Example +> +> ```python +> matcher = DependencyMatcher(nlp.vocab) +> assert len(matcher) == 0 +> pattern = [{"RIGHT_ID": "founded_id", +> "RIGHT_ATTRS": {"ORTH": "founded"}}] +> matcher.add("FOUNDED", [pattern]) +> assert len(matcher) == 1 +> ``` + +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The number of rules. ~~int~~ | + +## DependencyMatcher.\_\_contains\_\_ {#contains tag="method"} + +Check whether the matcher contains rules for a match ID. + +> #### Example +> +> ```python +> matcher = DependencyMatcher(nlp.vocab) +> assert "FOUNDED" not in matcher +> matcher.add("FOUNDED", [pattern]) +> assert "FOUNDED" in matcher +> ``` + +| Name | Description | +| ----------- | -------------------------------------------------------------- | +| `key` | The match ID. ~~str~~ | +| **RETURNS** | Whether the matcher contains rules for this match ID. ~~bool~~ | + +## DependencyMatcher.add {#add tag="method"} + +Add a rule to the matcher, consisting of an ID key, one or more patterns, and an +optional callback function to act on the matches. The callback function will +receive the arguments `matcher`, `doc`, `i` and `matches`. If a pattern already +exists for the given ID, the patterns will be extended. An `on_match` callback +will be overwritten. + +> #### Example +> +> ```python +> def on_match(matcher, doc, id, matches): +> print('Matched!', matches) +> +> matcher = DependencyMatcher(nlp.vocab) +> matcher.add("FOUNDED", patterns, on_match=on_match) +> ``` + +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `match_id` | An ID for the patterns. ~~str~~ | +| `patterns` | A list of match patterns. A pattern consists of a list of dicts, where each dict describes a token in the tree. ~~List[List[Dict[str, Union[str, Dict]]]]~~ | +| _keyword-only_ | | +| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[DependencyMatcher, Doc, int, List[Tuple], Any]]~~ | + +## DependencyMatcher.get {#get tag="method"} + +Retrieve the pattern stored for a key. Returns the rule as an +`(on_match, patterns)` tuple containing the callback and available patterns. + +> #### Example +> +> ```python +> matcher.add("FOUNDED", patterns, on_match=on_match) +> on_match, patterns = matcher.get("FOUNDED") +> ``` + +| Name | Description | +| ----------- | ----------------------------------------------------------------------------------------------------------- | +| `key` | The ID of the match rule. ~~str~~ | +| **RETURNS** | The rule, as an `(on_match, patterns)` tuple. ~~Tuple[Optional[Callable], List[List[Union[Dict, Tuple]]]]~~ | + +## DependencyMatcher.remove {#remove tag="method"} + +Remove a rule from the dependency matcher. A `KeyError` is raised if the match +ID does not exist. + +> #### Example +> +> ```python +> matcher.add("FOUNDED", patterns) +> assert "FOUNDED" in matcher +> matcher.remove("FOUNDED") +> assert "FOUNDED" not in matcher +> ``` + +| Name | Description | +| ----- | --------------------------------- | +| `key` | The ID of the match rule. ~~str~~ | diff --git a/website/docs/api/dependencyparser.md b/website/docs/api/dependencyparser.md index df0df3e38..fe8f7d8d5 100644 --- a/website/docs/api/dependencyparser.md +++ b/website/docs/api/dependencyparser.md @@ -1,48 +1,96 @@ --- title: DependencyParser tag: class -source: spacy/pipeline/pipes.pyx +source: spacy/pipeline/dep_parser.pyx +teaser: 'Pipeline component for syntactic dependency parsing' +api_base_class: /api/pipe +api_string_name: parser +api_trainable: true --- -This class is a subclass of `Pipe` and follows the same API. The pipeline -component is available in the [processing pipeline](/usage/processing-pipelines) -via the ID `"parser"`. +A transition-based dependency parser component. The dependency parser jointly +learns sentence segmentation and labelled dependency parsing, and can optionally +learn to merge tokens that had been over-segmented by the tokenizer. The parser +uses a variant of the **non-monotonic arc-eager transition-system** described by +[Honnibal and Johnson (2014)](https://www.aclweb.org/anthology/D15-1162/), with +the addition of a "break" transition to perform the sentence segmentation. +[Nivre (2005)](https://www.aclweb.org/anthology/P05-1013/)'s **pseudo-projective +dependency transformation** is used to allow the parser to predict +non-projective parses. -## DependencyParser.Model {#model tag="classmethod"} +The parser is trained using an **imitation learning objective**. It follows the +actions predicted by the current weights, and at each state, determines which +actions are compatible with the optimal parse that could be reached from the +current state. The weights are updated such that the scores assigned to the set +of optimal actions is increased, while scores assigned to other actions are +decreased. Note that more than one action may be optimal for a given state. -Initialize a model for the pipe. The model should implement the -`thinc.neural.Model` API. Wrappers are under development for most major machine -learning libraries. +## Config and implementation {#config} -| Name | Type | Description | -| ----------- | ------ | ------------------------------------- | -| `**kwargs` | - | Parameters for initializing the model | -| **RETURNS** | object | The initialized model. | - -## DependencyParser.\_\_init\_\_ {#init tag="method"} - -Create a new pipeline instance. In your application, you would normally use a -shortcut for this and instantiate the component using its string name and -[`nlp.create_pipe`](/api/language#create_pipe). +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). See the +[model architectures](/api/architectures) documentation for details on the +architectures and their arguments and hyperparameters. > #### Example > > ```python -> # Construction via create_pipe -> parser = nlp.create_pipe("parser") +> from spacy.pipeline.dep_parser import DEFAULT_PARSER_MODEL +> config = { +> "moves": None, +> "update_with_oracle_cut_size": 100, +> "learn_tokens": False, +> "min_action_freq": 30, +> "model": DEFAULT_PARSER_MODEL, +> } +> nlp.add_pipe("parser", config=config) +> ``` + +| Setting | Description | +| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `moves` | A list of transition names. Inferred from the data if not provided. Defaults to `None`. ~~Optional[List[str]]~~ | +| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ | +| `learn_tokens` | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. Defaults to `False`. ~~bool~~ | +| `min_action_freq` | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. Defaults to `30`. ~~int~~ | +| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [TransitionBasedParser](/api/architectures#TransitionBasedParser). ~~Model[List[Doc], List[Floats2d]]~~ | + +```python +%%GITHUB_SPACY/spacy/pipeline/dep_parser.pyx +``` + +## DependencyParser.\_\_init\_\_ {#init tag="method"} + +> #### Example +> +> ```python +> # Construction via add_pipe with default model +> parser = nlp.add_pipe("parser") +> +> # Construction via add_pipe with custom model +> config = {"model": {"@architectures": "my_parser"}} +> parser = nlp.add_pipe("parser", config=config) > > # Construction from class > from spacy.pipeline import DependencyParser -> parser = DependencyParser(nlp.vocab) -> parser.from_disk("/path/to/model") +> parser = DependencyParser(nlp.vocab, model) > ``` -| Name | Type | Description | -| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `**cfg` | - | Configuration parameters. | -| **RETURNS** | `DependencyParser` | The newly constructed object. | +Create a new pipeline instance. In your application, you would normally use a +shortcut for this and instantiate the component using its string name and +[`nlp.add_pipe`](/api/language#add_pipe). + +| Name | Description | +| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| `moves` | A list of transition names. Inferred from the data if not provided. ~~Optional[List[str]]~~ | +| _keyword-only_ | | +| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. ~~int~~ | +| `learn_tokens` | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. ~~bool~~ | +| `min_action_freq` | The minimum frequency of labelled actions to retain. Rarer labelled actions have their label backed-off to "dep". While this primarily affects the label accuracy, it can also affect the attachment structure, as the labels are used to represent the pseudo-projectivity transformation. ~~int~~ | ## DependencyParser.\_\_call\_\_ {#call tag="method"} @@ -57,16 +105,16 @@ and all pipeline components are applied to the `Doc` in order. Both > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) > doc = nlp("This is a sentence.") +> parser = nlp.add_pipe("parser") > # This usually happens under the hood > processed = parser(doc) > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------ | -| `doc` | `Doc` | The document to process. | -| **RETURNS** | `Doc` | The processed document. | +| Name | Description | +| ----------- | -------------------------------- | +| `doc` | The document to process. ~~Doc~~ | +| **RETURNS** | The processed document. ~~Doc~~ | ## DependencyParser.pipe {#pipe tag="method"} @@ -80,72 +128,118 @@ applied to the `Doc` in order. Both [`__call__`](/api/dependencyparser#call) and > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) +> parser = nlp.add_pipe("parser") > for doc in parser.pipe(docs, batch_size=50): > pass > ``` -| Name | Type | Description | -| ------------ | -------- | ------------------------------------------------------ | -| `stream` | iterable | A stream of documents. | -| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | -| **YIELDS** | `Doc` | Processed documents in the order of the original text. | +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `docs` | A stream of documents. ~~Iterable[Doc]~~ | +| _keyword-only_ | | +| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ | +| **YIELDS** | The processed documents in order. ~~Doc~~ | + +## DependencyParser.initialize {#initialize tag="method" new="3"} + +Initialize the component for training. `get_examples` should be a function that +returns an iterable of [`Example`](/api/example) objects. The data examples are +used to **initialize the model** of the component and can either be the full +training data or a representative sample. Initialization includes validating the +network, +[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and +setting up the label scheme based on the data. This method is typically called +by [`Language.initialize`](/api/language#initialize) and lets you customize +arguments it receives via the +[`[initialize.components]`](/api/data-formats#config-initialize) block in the +config. + + + +This method was previously called `begin_training`. + + + +> #### Example +> +> ```python +> parser = nlp.add_pipe("parser") +> parser.initialize(lambda: [], nlp=nlp) +> ``` +> +> ```ini +> ### config.cfg +> [initialize.components.parser] +> +> [initialize.components.parser.labels] +> @readers = "spacy.read_labels.v1" +> path = "corpus/labels/parser.json +> ``` + +| Name | Description | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Dict[str, Dict[str, int]]]~~ | ## DependencyParser.predict {#predict tag="method"} -Apply the pipeline's model to a batch of docs, without modifying them. +Apply the component's model to a batch of [`Doc`](/api/doc) objects, without +modifying them. > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) +> parser = nlp.add_pipe("parser") > scores = parser.predict([doc1, doc2]) > ``` -| Name | Type | Description | -| ----------- | ------------------- | ---------------------------------------------- | -| `docs` | iterable | The documents to predict. | -| **RETURNS** | `syntax.StateClass` | A helper class for the parse state (internal). | +| Name | Description | +| ----------- | ------------------------------------------------------------- | +| `docs` | The documents to predict. ~~Iterable[Doc]~~ | +| **RETURNS** | A helper class for the parse state (internal). ~~StateClass~~ | ## DependencyParser.set_annotations {#set_annotations tag="method"} -Modify a batch of documents, using pre-computed scores. +Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores. > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) +> parser = nlp.add_pipe("parser") > scores = parser.predict([doc1, doc2]) > parser.set_annotations([doc1, doc2], scores) > ``` -| Name | Type | Description | -| -------- | -------- | ---------------------------------------------------------- | -| `docs` | iterable | The documents to modify. | -| `scores` | - | The scores to set, produced by `DependencyParser.predict`. | +| Name | Description | +| -------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| `docs` | The documents to modify. ~~Iterable[Doc]~~ | +| `scores` | The scores to set, produced by `DependencyParser.predict`. Returns an internal helper class for the parse state. ~~List[StateClass]~~ | ## DependencyParser.update {#update tag="method"} -Learn from a batch of documents and gold-standard information, updating the -pipe's model. Delegates to [`predict`](/api/dependencyparser#predict) and +Learn from a batch of [`Example`](/api/example) objects, updating the pipe's +model. Delegates to [`predict`](/api/dependencyparser#predict) and [`get_loss`](/api/dependencyparser#get_loss). > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) -> losses = {} -> optimizer = nlp.begin_training() -> parser.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) +> parser = nlp.add_pipe("parser") +> optimizer = nlp.initialize() +> losses = parser.update(examples, sgd=optimizer) > ``` -| Name | Type | Description | -| -------- | -------- | -------------------------------------------------------------------------------------------- | -| `docs` | iterable | A batch of documents to learn from. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `drop` | float | The dropout rate. | -| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. | -| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | +| Name | Description | +| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | ## DependencyParser.get_loss {#get_loss tag="method"} @@ -155,83 +249,103 @@ predicted scores. > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) -> scores = parser.predict([doc1, doc2]) -> loss, d_loss = parser.get_loss([doc1, doc2], [gold1, gold2], scores) +> parser = nlp.add_pipe("parser") +> scores = parser.predict([eg.predicted for eg in examples]) +> loss, d_loss = parser.get_loss(examples, scores) > ``` -| Name | Type | Description | -| ----------- | -------- | ------------------------------------------------------------ | -| `docs` | iterable | The batch of documents. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `scores` | - | Scores representing the model's predictions. | -| **RETURNS** | tuple | The loss and the gradient, i.e. `(loss, gradient)`. | +| Name | Description | +| ----------- | --------------------------------------------------------------------------- | +| `examples` | The batch of examples. ~~Iterable[Example]~~ | +| `scores` | Scores representing the model's predictions. ~~StateClass~~ | +| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ | -## DependencyParser.begin_training {#begin_training tag="method"} +## DependencyParser.score {#score tag="method" new="3"} -Initialize the pipe for training, using data examples if available. If no model -has been initialized yet, the model is added. +Score a batch of examples. > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) -> nlp.pipeline.append(parser) -> optimizer = parser.begin_training(pipeline=nlp.pipeline) +> scores = parser.score(examples) > ``` -| Name | Type | Description | -| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | -| `pipeline` | list | Optional list of pipeline components that this component is part of. | -| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`DependencyParser`](/api/dependencyparser#create_optimizer) if not set. | -| **RETURNS** | callable | An optimizer. | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `examples` | The examples to score. ~~Iterable[Example]~~ | +| **RETURNS** | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans) and [`Scorer.score_deps`](/api/scorer#score_deps). ~~Dict[str, Union[float, Dict[str, float]]]~~ | ## DependencyParser.create_optimizer {#create_optimizer tag="method"} -Create an optimizer for the pipeline component. +Create an [`Optimizer`](https://thinc.ai/docs/api-optimizers) for the pipeline +component. > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) +> parser = nlp.add_pipe("parser") > optimizer = parser.create_optimizer() > ``` -| Name | Type | Description | -| ----------- | -------- | -------------- | -| **RETURNS** | callable | The optimizer. | +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The optimizer. ~~Optimizer~~ | ## DependencyParser.use_params {#use_params tag="method, contextmanager"} -Modify the pipe's model, to use the given parameter values. +Modify the pipe's model, to use the given parameter values. At the end of the +context, the original parameters are restored. > #### Example > > ```python > parser = DependencyParser(nlp.vocab) -> with parser.use_params(): +> with parser.use_params(optimizer.averages): > parser.to_disk("/best_model") > ``` -| Name | Type | Description | -| -------- | ---- | ---------------------------------------------------------------------------------------------------------- | -| `params` | - | The parameter values to use in the model. At the end of the context, the original parameters are restored. | +| Name | Description | +| -------- | -------------------------------------------------- | +| `params` | The parameter values to use in the model. ~~dict~~ | ## DependencyParser.add_label {#add_label tag="method"} -Add a new label to the pipe. +Add a new label to the pipe. Note that you don't have to call this method if you +provide a **representative data sample** to the [`initialize`](#initialize) +method. In this case, all labels found in the sample will be automatically added +to the model, and the output dimension will be +[inferred](/usage/layers-architectures#thinc-shape-inference) automatically. > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) +> parser = nlp.add_pipe("parser") > parser.add_label("MY_LABEL") > ``` -| Name | Type | Description | -| ------- | ------- | ----------------- | -| `label` | unicode | The label to add. | +| Name | Description | +| ----------- | ----------------------------------------------------------- | +| `label` | The label to add. ~~str~~ | +| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ | + +## DependencyParser.set_output {#set_output tag="method"} + +Change the output dimension of the component's model by calling the model's +attribute `resize_output`. This is a function that takes the original model and +the new output dimension `nO`, and changes the model in place. When resizing an +already trained model, care should be taken to avoid the "catastrophic +forgetting" problem. + +> #### Example +> +> ```python +> parser = nlp.add_pipe("parser") +> parser.set_output(512) +> ``` + +| Name | Description | +| ---- | --------------------------------- | +| `nO` | The new output dimension. ~~int~~ | ## DependencyParser.to_disk {#to_disk tag="method"} @@ -240,14 +354,15 @@ Serialize the pipe to disk. > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) +> parser = nlp.add_pipe("parser") > parser.to_disk("/path/to/parser") > ``` -| Name | Type | Description | -| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | ## DependencyParser.from_disk {#from_disk tag="method"} @@ -256,31 +371,33 @@ Load the pipe from disk. Modifies the object in place and returns it. > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) +> parser = nlp.add_pipe("parser") > parser.from_disk("/path/to/parser") > ``` -| Name | Type | Description | -| ----------- | ------------------ | -------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `DependencyParser` | The modified `DependencyParser` object. | +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `DependencyParser` object. ~~DependencyParser~~ | ## DependencyParser.to_bytes {#to_bytes tag="method"} > #### Example > > ```python -> parser = DependencyParser(nlp.vocab) +> parser = nlp.add_pipe("parser") > parser_bytes = parser.to_bytes() > ``` Serialize the pipe to a bytestring. -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------------------- | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | bytes | The serialized form of the `DependencyParser` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the `DependencyParser` object. ~~bytes~~ | ## DependencyParser.from_bytes {#from_bytes tag="method"} @@ -290,15 +407,16 @@ Load the pipe from a bytestring. Modifies the object in place and returns it. > > ```python > parser_bytes = parser.to_bytes() -> parser = DependencyParser(nlp.vocab) +> parser = nlp.add_pipe("parser") > parser.from_bytes(parser_bytes) > ``` -| Name | Type | Description | -| ------------ | ------------------ | ------------------------------------------------------------------------- | -| `bytes_data` | bytes | The data to load from. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `DependencyParser` | The `DependencyParser` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `DependencyParser` object. ~~DependencyParser~~ | ## DependencyParser.labels {#labels tag="property"} @@ -311,9 +429,27 @@ The labels currently added to the component. > assert "MY_LABEL" in parser.labels > ``` -| Name | Type | Description | -| ----------- | ----- | ---------------------------------- | -| **RETURNS** | tuple | The labels added to the component. | +| Name | Description | +| ----------- | ------------------------------------------------------ | +| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ | + +## DependencyParser.label_data {#label_data tag="property" new="3"} + +The labels currently added to the component and their internal meta information. +This is the data generated by [`init labels`](/api/cli#init-labels) and used by +[`DependencyParser.initialize`](/api/dependencyparser#initialize) to initialize +the model with a pre-defined label set. + +> #### Example +> +> ```python +> labels = parser.label_data +> parser.initialize(lambda: [], nlp=nlp, labels=labels) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------- | +| **RETURNS** | The label data added to the component. ~~Dict[str, Dict[str, Dict[str, int]]]~~ | ## Serialization fields {#serialization-fields} diff --git a/website/docs/api/doc.md b/website/docs/api/doc.md index 420e12fcb..d511dc889 100644 --- a/website/docs/api/doc.md +++ b/website/docs/api/doc.md @@ -25,17 +25,27 @@ Construct a `Doc` object. The most common way to get a `Doc` object is via the > > # Construction 2 > from spacy.tokens import Doc +> > words = ["hello", "world", "!"] > spaces = [True, False, False] > doc = Doc(nlp.vocab, words=words, spaces=spaces) > ``` -| Name | Type | Description | -| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | A storage container for lexical types. | -| `words` | iterable | A list of strings to add to the container. | -| `spaces` | iterable | A list of boolean values indicating whether each word has a subsequent space. Must have the same length as `words`, if specified. Defaults to a sequence of `True`. | -| **RETURNS** | `Doc` | The newly constructed object. | +| Name | Description | +| ---------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | A storage container for lexical types. ~~Vocab~~ | +| `words` | A list of strings to add to the container. ~~Optional[List[str]]~~ | +| `spaces` | A list of boolean values indicating whether each word has a subsequent space. Must have the same length as `words`, if specified. Defaults to a sequence of `True`. ~~Optional[List[bool]]~~ | +| _keyword-only_ | | +| `user\_data` | Optional extra data to attach to the Doc. ~~Dict~~ | +| `tags` 3 | A list of strings, of the same length as `words`, to assign as `token.tag` for each word. Defaults to `None`. ~~Optional[List[str]]~~ | +| `pos` 3 | A list of strings, of the same length as `words`, to assign as `token.pos` for each word. Defaults to `None`. ~~Optional[List[str]]~~ | +| `morphs` 3 | A list of strings, of the same length as `words`, to assign as `token.morph` for each word. Defaults to `None`. ~~Optional[List[str]]~~ | +| `lemmas` 3 | A list of strings, of the same length as `words`, to assign as `token.lemma` for each word. Defaults to `None`. ~~Optional[List[str]]~~ | +| `heads` 3 | A list of values, of the same length as `words`, to assign as the head for each word. Head indices are the absolute position of the head in the `Doc`. Defaults to `None`. ~~Optional[List[int]]~~ | +| `deps` 3 | A list of strings, of the same length as `words`, to assign as `token.dep` for each word. Defaults to `None`. ~~Optional[List[str]]~~ | +| `sent_starts` 3 | A list of values, of the same length as `words`, to assign as `token.is_sent_start`. Will be overridden by heads if `heads` is provided. Defaults to `None`. ~~Optional[List[Union[bool, None]]~~ | +| `ents` 3 | A list of strings, of the same length of `words`, to assign the token-based IOB tag. Defaults to `None`. ~~Optional[List[str]]~~ | ## Doc.\_\_getitem\_\_ {#getitem tag="method"} @@ -53,10 +63,10 @@ Negative indexing is supported, and follows the usual Python semantics, i.e. > assert span.text == "it back" > ``` -| Name | Type | Description | -| ----------- | ------- | ----------------------- | -| `i` | int | The index of the token. | -| **RETURNS** | `Token` | The token at `doc[i]`. | +| Name | Description | +| ----------- | -------------------------------- | +| `i` | The index of the token. ~~int~~ | +| **RETURNS** | The token at `doc[i]`. ~~Token~~ | Get a [`Span`](/api/span) object, starting at position `start` (token index) and ending at position `end` (token index). For instance, `doc[2:5]` produces a span @@ -65,10 +75,10 @@ are not supported, as `Span` objects must be contiguous (cannot have gaps). You can use negative indices and open-ended ranges, which have their normal Python semantics. -| Name | Type | Description | -| ----------- | ------ | --------------------------------- | -| `start_end` | tuple | The slice of the document to get. | -| **RETURNS** | `Span` | The span at `doc[start:end]`. | +| Name | Description | +| ----------- | ----------------------------------------------------- | +| `start_end` | The slice of the document to get. ~~Tuple[int, int]~~ | +| **RETURNS** | The span at `doc[start:end]`. ~~Span~~ | ## Doc.\_\_iter\_\_ {#iter tag="method"} @@ -86,9 +96,9 @@ main way annotations are accessed from Python. If faster-than-Python speeds are required, you can instead access the annotations as a numpy array, or access the underlying C data directly from Cython. -| Name | Type | Description | -| ---------- | ------- | ----------------- | -| **YIELDS** | `Token` | A `Token` object. | +| Name | Description | +| ---------- | --------------------------- | +| **YIELDS** | A `Token` object. ~~Token~~ | ## Doc.\_\_len\_\_ {#len tag="method"} @@ -101,9 +111,9 @@ Get the number of tokens in the document. > assert len(doc) == 7 > ``` -| Name | Type | Description | -| ----------- | ---- | ------------------------------------- | -| **RETURNS** | int | The number of tokens in the document. | +| Name | Description | +| ----------- | --------------------------------------------- | +| **RETURNS** | The number of tokens in the document. ~~int~~ | ## Doc.set_extension {#set_extension tag="classmethod" new="2"} @@ -121,14 +131,14 @@ details, see the documentation on > assert doc._.has_city > ``` -| Name | Type | Description | -| --------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------- | -| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `doc._.my_attr`. | -| `default` | - | Optional default value of the attribute if no getter or method is defined. | -| `method` | callable | Set a custom method on the object, for example `doc._.compare(other_doc)`. | -| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. | -| `setter` | callable | Setter function that takes the `Doc` and a value, and modifies the object. Is called when the user writes to the `Doc._` attribute. | -| `force` | bool | Force overwriting existing attribute. | +| Name | Description | +| --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `doc._.my_attr`. ~~str~~ | +| `default` | Optional default value of the attribute if no getter or method is defined. ~~Optional[Any]~~ | +| `method` | Set a custom method on the object, for example `doc._.compare(other_doc)`. ~~Optional[Callable[[Doc, ...], Any]]~~ | +| `getter` | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. ~~Optional[Callable[[Doc], Any]]~~ | +| `setter` | Setter function that takes the `Doc` and a value, and modifies the object. Is called when the user writes to the `Doc._` attribute. ~~Optional[Callable[[Doc, Any], None]]~~ | +| `force` | Force overwriting existing attribute. ~~bool~~ | ## Doc.get_extension {#get_extension tag="classmethod" new="2"} @@ -140,15 +150,15 @@ Look up a previously registered extension by name. Returns a 4-tuple > > ```python > from spacy.tokens import Doc -> Doc.set_extension('has_city', default=False) -> extension = Doc.get_extension('has_city') +> Doc.set_extension("has_city", default=False) +> extension = Doc.get_extension("has_city") > assert extension == (False, None, None, None) > ``` -| Name | Type | Description | -| ----------- | ------- | ------------------------------------------------------------- | -| `name` | unicode | Name of the extension. | -| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. | +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Name of the extension. ~~str~~ | +| **RETURNS** | A `(default, method, getter, setter)` tuple of the extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~ | ## Doc.has_extension {#has_extension tag="classmethod" new="2"} @@ -158,14 +168,14 @@ Check whether an extension has been registered on the `Doc` class. > > ```python > from spacy.tokens import Doc -> Doc.set_extension('has_city', default=False) -> assert Doc.has_extension('has_city') +> Doc.set_extension("has_city", default=False) +> assert Doc.has_extension("has_city") > ``` -| Name | Type | Description | -| ----------- | ------- | ------------------------------------------ | -| `name` | unicode | Name of the extension to check. | -| **RETURNS** | bool | Whether the extension has been registered. | +| Name | Description | +| ----------- | --------------------------------------------------- | +| `name` | Name of the extension to check. ~~str~~ | +| **RETURNS** | Whether the extension has been registered. ~~bool~~ | ## Doc.remove_extension {#remove_extension tag="classmethod" new="2.0.12"} @@ -175,21 +185,21 @@ Remove a previously registered extension. > > ```python > from spacy.tokens import Doc -> Doc.set_extension('has_city', default=False) -> removed = Doc.remove_extension('has_city') -> assert not Doc.has_extension('has_city') +> Doc.set_extension("has_city", default=False) +> removed = Doc.remove_extension("has_city") +> assert not Doc.has_extension("has_city") > ``` -| Name | Type | Description | -| ----------- | ------- | --------------------------------------------------------------------- | -| `name` | unicode | Name of the extension. | -| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Name of the extension. ~~str~~ | +| **RETURNS** | A `(default, method, getter, setter)` tuple of the removed extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~ | ## Doc.char_span {#char_span tag="method" new="2"} Create a `Span` object from the slice `doc.text[start_idx:end_idx]`. Returns -`None` if the character indices don't map to a valid span using the default mode -`"strict". +`None` if the character indices don't map to a valid span using the default +alignment mode `"strict". > #### Example > @@ -199,15 +209,39 @@ Create a `Span` object from the slice `doc.text[start_idx:end_idx]`. Returns > assert span.text == "New York" > ``` -| Name | Type | Description | -| ------------------------------------ | ---------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `start_idx` | int | The index of the first character of the span. | -| `end_idx` | int | The index of the last character after the span. | -| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. | -| `kb_id` 2.2 | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. | -| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. | -| `mode` | `str` | How character indices snap to token boundaries. Options: "strict" (no snapping), "inside" (span of all tokens completely within the character span), "outside" (span of all tokens at least partially covered by the character span). Defaults to "strict". | -| **RETURNS** | `Span` | The newly constructed object or `None`. | +| Name | Description | +| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `start` | The index of the first character of the span. ~~int~~ | +| `end` | The index of the last character after the span. ~int~~ | +| `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ | +| `kb_id` 2.2 | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ | +| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | +| `alignment_mode` | How character indices snap to token boundaries. Options: `"strict"` (no snapping), `"contract"` (span of all tokens completely within the character span), `"expand"` (span of all tokens at least partially covered by the character span). Defaults to `"strict"`. ~~str~~ | +| **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ | + +## Doc.set_ents {#ents tag="method" new="3"} + +Set the named entities in the document. + +> #### Example +> +> ```python +> from spacy.tokens import Span +> doc = nlp("Mr. Best flew to New York on Saturday morning.") +> doc.set_ents([Span(doc, 0, 2, "PERSON")]) +> ents = list(doc.ents) +> assert ents[0].label_ == "PERSON" +> assert ents[0].text == "Mr. Best" +> ``` + +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| entities | Spans with labels to set as entities. ~~List[Span]~~ | +| _keyword-only_ | | +| blocked | Spans to set as "blocked" (never an entity) for spacy's built-in NER component. Other components may ignore this setting. ~~Optional[List[Span]]~~ | +| missing | Spans with missing/unknown entity information. ~~Optional[List[Span]]~~ | +| outside | Spans outside of entities (O in IOB). ~~Optional[List[Span]]~~ | +| default | How to set entity annotation for tokens outside of any provided spans. Options: "blocked", "missing", "outside" and "unmodified" (preserve current state). Defaults to "outside". ~~str~~ | ## Doc.similarity {#similarity tag="method" model="vectors"} @@ -224,10 +258,10 @@ using an average of word vectors. > assert apples_oranges == oranges_apples > ``` -| Name | Type | Description | -| ----------- | ----- | -------------------------------------------------------------------------------------------- | -| `other` | - | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. | -| **RETURNS** | float | A scalar similarity score. Higher is more similar. | +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------------------------------------------- | +| `other` | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. ~~Union[Doc, Span, Token, Lexeme]~~ | +| **RETURNS** | A scalar similarity score. Higher is more similar. ~~float~~ | ## Doc.count_by {#count_by tag="method"} @@ -240,15 +274,15 @@ attribute ID. > ```python > from spacy.attrs import ORTH > doc = nlp("apple apple orange banana") -> assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2} +> assert doc.count_by(ORTH) == {7024: 1, 119552: 1, 2087: 2} > doc.to_array([ORTH]) > # array([[11880], [11880], [7561], [12800]]) > ``` -| Name | Type | Description | -| ----------- | ---- | -------------------------------------------------- | -| `attr_id` | int | The attribute ID | -| **RETURNS** | dict | A dictionary mapping attributes to integer counts. | +| Name | Description | +| ----------- | --------------------------------------------------------------------- | +| `attr_id` | The attribute ID. ~~int~~ | +| **RETURNS** | A dictionary mapping attributes to integer counts. ~~Dict[int, int]~~ | ## Doc.get_lca_matrix {#get_lca_matrix tag="method"} @@ -264,57 +298,41 @@ ancestor is found, e.g. if span excludes a necessary ancestor. > # array([[0, 1, 1, 1], [1, 1, 1, 1], [1, 1, 2, 3], [1, 1, 3, 3]], dtype=int32) > ``` -| Name | Type | Description | -| ----------- | -------------------------------------- | ----------------------------------------------- | -| **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Doc`. | +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------- | +| **RETURNS** | The lowest common ancestor matrix of the `Doc`. ~~numpy.ndarray[ndim=2, dtype=int32]~~ | -## Doc.to_json {#to_json tag="method" new="2.1"} +## Doc.has_annotation {#has_annotation tag="method"} -Convert a Doc to JSON. The format it produces will be the new format for the -[`spacy train`](/api/cli#train) command (not implemented yet). If custom -underscore attributes are specified, their values need to be JSON-serializable. -They'll be added to an `"_"` key in the data, e.g. `"_": {"foo": "bar"}`. +Check whether the doc contains annotation on a token attribute. -> #### Example -> -> ```python -> doc = nlp("Hello") -> json_doc = doc.to_json() -> ``` -> -> #### Result -> -> ```python -> { -> "text": "Hello", -> "ents": [], -> "sents": [{"start": 0, "end": 5}], -> "tokens": [{"id": 0, "start": 0, "end": 5, "pos": "INTJ", "tag": "UH", "dep": "ROOT", "head": 0} -> ] -> } -> ``` + -| Name | Type | Description | -| ------------ | ---- | ------------------------------------------------------------------------------ | -| `underscore` | list | Optional list of string names of custom JSON-serializable `doc._.` attributes. | -| **RETURNS** | dict | The JSON-formatted data. | +This method replaces the previous boolean attributes like `Doc.is_tagged`, +`Doc.is_parsed` or `Doc.is_sentenced`. - - -spaCy previously implemented a `Doc.print_tree` method that returned a similar -JSON-formatted representation of a `Doc`. As of v2.1, this method is deprecated -in favor of `Doc.to_json`. If you need more complex nested representations, you -might want to write your own function to extract the data. +```diff +doc = nlp("This is a text") +- assert doc.is_parsed ++ assert doc.has_annotation("DEP") +``` +| Name | Description | +| ------------------ | --------------------------------------------------------------------------------------------------- | +| `attr` | The attribute string name or int ID. ~~Union[int, str]~~ | +| _keyword-only_ | | +| `require_complete` | Whether to check that the attribute is set on every token in the doc. Defaults to `False`. ~~bool~~ | +| **RETURNS** | Whether specified annotation is present in the doc. ~~bool~~ | + ## Doc.to_array {#to_array tag="method"} Export given token attributes to a numpy `ndarray`. If `attr_ids` is a sequence of `M` attributes, the output array will be of shape `(N, M)`, where `N` is the length of the `Doc` (in tokens). If `attr_ids` is a single attribute, the output shape will be `(N,)`. You can specify attributes by integer ID (e.g. -`spacy.attrs.LEMMA`) or string name (e.g. 'LEMMA' or 'lemma'). The values will +`spacy.attrs.LEMMA`) or string name (e.g. "LEMMA" or "lemma"). The values will be 64-bit integers. Returns a 2D array with one row per token and one column per attribute (when @@ -331,10 +349,10 @@ Returns a 2D array with one row per token and one column per attribute (when > np_array = doc.to_array("POS") > ``` -| Name | Type | Description | -| ----------- | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- | -| `attr_ids` | list or int or string | A list of attributes (int IDs or string names) or a single attribute (int ID or string name) | -| **RETURNS** | `numpy.ndarray[ndim=2, dtype='uint64']` or `numpy.ndarray[ndim=1, dtype='uint64']` | The exported attributes as a numpy array. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------- | +| `attr_ids` | A list of attributes (int IDs or string names) or a single attribute (int ID or string name). ~~Union[int, str, List[Union[int, str]]]~~ | +| **RETURNS** | The exported attributes as a numpy array. ~~Union[numpy.ndarray[ndim=2, dtype=uint64], numpy.ndarray[ndim=1, dtype=uint64]]~~ | ## Doc.from_array {#from_array tag="method"} @@ -353,12 +371,39 @@ array of attributes. > assert doc[0].pos_ == doc2[0].pos_ > ``` -| Name | Type | Description | -| ----------- | -------------------------------------- | ------------------------------------------------------------------------- | -| `attrs` | list | A list of attribute ID ints. | -| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `Doc` | Itself. | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------- | +| `attrs` | A list of attribute ID ints. ~~List[int]~~ | +| `array` | The attribute values to load. ~~numpy.ndarray[ndim=2, dtype=int32]~~ | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `Doc` itself. ~~Doc~~ | + +## Doc.from_docs {#from_docs tag="staticmethod" new="3"} + +Concatenate multiple `Doc` objects to form a new one. Raises an error if the +`Doc` objects do not all share the same `Vocab`. + +> #### Example +> +> ```python +> from spacy.tokens import Doc +> texts = ["London is the capital of the United Kingdom.", +> "The River Thames flows through London.", +> "The famous Tower Bridge crosses the River Thames."] +> docs = list(nlp.pipe(texts)) +> c_doc = Doc.from_docs(docs) +> assert str(c_doc) == " ".join(texts) +> assert len(list(c_doc.sents)) == len(docs) +> assert [str(ent) for ent in c_doc.ents] == \ +> [str(ent) for doc in docs for ent in doc.ents] +> ``` + +| Name | Description | +| ------------------- | ----------------------------------------------------------------------------------------------------------------- | +| `docs` | A list of `Doc` objects. ~~List[Doc]~~ | +| `ensure_whitespace` | Insert a space between two adjacent docs whenever the first doc does not end in whitespace. ~~bool~~ | +| `attrs` | Optional list of attribute ID ints or attribute name strings. ~~Optional[List[Union[str, int]]]~~ | +| **RETURNS** | The new `Doc` object that is containing the other docs or `None`, if `docs` is empty or `None`. ~~Optional[Doc]~~ | ## Doc.to_disk {#to_disk tag="method" new="2"} @@ -370,10 +415,11 @@ Save the current state to a directory. > doc.to_disk("/path/to/doc") > ``` -| Name | Type | Description | -| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | ## Doc.from_disk {#from_disk tag="method" new="2"} @@ -387,11 +433,12 @@ Loads state from a directory. Modifies the object in place and returns it. > doc = Doc(Vocab()).from_disk("/path/to/doc") > ``` -| Name | Type | Description | -| ----------- | ---------------- | -------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `Doc` | The modified `Doc` object. | +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `Doc` object. ~~Doc~~ | ## Doc.to_bytes {#to_bytes tag="method"} @@ -404,10 +451,11 @@ Serialize, i.e. export the document contents to a binary string. > doc_bytes = doc.to_bytes() > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------------------- | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | A losslessly serialized copy of the `Doc`, including all annotations. ~~bytes~~ | ## Doc.from_bytes {#from_bytes tag="method"} @@ -423,11 +471,12 @@ Deserialize, i.e. import the document contents from a binary string. > assert doc.text == doc2.text > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------------------- | -| `data` | bytes | The string to load from. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `Doc` | The `Doc` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `data` | The string to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `Doc` object. ~~Doc~~ | ## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"} @@ -445,17 +494,18 @@ invalidated, although they may accidentally continue to work. > retokenizer.merge(doc[0:2]) > ``` -| Name | Type | Description | -| ----------- | ------------- | ---------------- | -| **RETURNS** | `Retokenizer` | The retokenizer. | +| Name | Description | +| ----------- | -------------------------------- | +| **RETURNS** | The retokenizer. ~~Retokenizer~~ | ### Retokenizer.merge {#retokenizer.merge tag="method"} Mark a span for merging. The `attrs` will be applied to the resulting token (if they're context-dependent token attributes like `LEMMA` or `DEP`) or to the underlying lexeme (if they're context-independent lexical attributes like -`LOWER` or `IS_STOP`). Writable custom extension attributes can be provided as a -dictionary mapping attribute names to values as the `"_"` key. +`LOWER` or `IS_STOP`). Writable custom extension attributes can be provided +using the `"_"` key and specifying a dictionary that maps attribute names to +values. > #### Example > @@ -466,10 +516,10 @@ dictionary mapping attribute names to values as the `"_"` key. > retokenizer.merge(doc[2:4], attrs=attrs) > ``` -| Name | Type | Description | -| ------- | ------ | -------------------------------------- | -| `span` | `Span` | The span to merge. | -| `attrs` | dict | Attributes to set on the merged token. | +| Name | Description | +| ------- | --------------------------------------------------------------------- | +| `span` | The span to merge. ~~Span~~ | +| `attrs` | Attributes to set on the merged token. ~~Dict[Union[str, int], Any]~~ | ### Retokenizer.split {#retokenizer.split tag="method"} @@ -500,41 +550,12 @@ underlying lexeme (if they're context-independent lexical attributes like > retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs) > ``` -| Name | Type | Description | -| ------- | ------- | ----------------------------------------------------------------------------------------------------------- | -| `token` | `Token` | The token to split. | -| `orths` | list | The verbatim text of the split tokens. Needs to match the text of the original token. | -| `heads` | list | List of `token` or `(token, subtoken)` tuples specifying the tokens to attach the newly split subtokens to. | -| `attrs` | dict | Attributes to set on all split tokens. Attribute names mapped to list of per-token attribute values. | - -## Doc.merge {#merge tag="method"} - - - -As of v2.1.0, `Doc.merge` still works but is considered deprecated. You should -use the new and less error-prone [`Doc.retokenize`](/api/doc#retokenize) -instead. - - - -Retokenize the document, such that the span at `doc.text[start_idx : end_idx]` -is merged into a single token. If `start_idx` and `end_idx` do not mark start -and end token boundaries, the document remains unchanged. - -> #### Example -> -> ```python -> doc = nlp("Los Angeles start.") -> doc.merge(0, len("Los Angeles"), "NNP", "Los Angeles", "GPE") -> assert [t.text for t in doc] == ["Los Angeles", "start", "."] -> ``` - -| Name | Type | Description | -| -------------- | ------- | ------------------------------------------------------------------------------------------------------------------------- | -| `start_idx` | int | The character index of the start of the slice to merge. | -| `end_idx` | int | The character index after the end of the slice to merge. | -| `**attributes` | - | Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span. | -| **RETURNS** | `Token` | The newly merged token, or `None` if the start and end indices did not fall at token boundaries | +| Name | Description | +| ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | +| `token` | The token to split. ~~Token~~ | +| `orths` | The verbatim text of the split tokens. Needs to match the text of the original token. ~~List[str]~~ | +| `heads` | List of `token` or `(token, subtoken)` tuples specifying the tokens to attach the newly split subtokens to. ~~List[Union[Token, Tuple[Token, int]]]~~ | +| `attrs` | Attributes to set on all split tokens. Attribute names mapped to list of per-token attribute values. ~~Dict[Union[str, int], List[Any]]~~ | ## Doc.ents {#ents tag="property" model="NER"} @@ -546,14 +567,13 @@ objects, if the entity recognizer has been applied. > ```python > doc = nlp("Mr. Best flew to New York on Saturday morning.") > ents = list(doc.ents) -> assert ents[0].label == 346 > assert ents[0].label_ == "PERSON" > assert ents[0].text == "Mr. Best" > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------ | -| **RETURNS** | tuple | Entities in the document, one `Span` per entity. | +| Name | Description | +| ----------- | --------------------------------------------------------------------- | +| **RETURNS** | Entities in the document, one `Span` per entity. ~~Tuple[Span, ...]~~ | ## Doc.noun_chunks {#noun_chunks tag="property" model="parser"} @@ -572,9 +592,9 @@ relative clauses. > assert chunks[1].text == "another phrase" > ``` -| Name | Type | Description | -| ---------- | ------ | ---------------------------- | -| **YIELDS** | `Span` | Noun chunks in the document. | +| Name | Description | +| ---------- | ------------------------------------- | +| **YIELDS** | Noun chunks in the document. ~~Span~~ | ## Doc.sents {#sents tag="property" model="parser"} @@ -592,9 +612,9 @@ will be unavailable. > assert [s.root.text for s in sents] == ["is", "'s"] > ``` -| Name | Type | Description | -| ---------- | ------ | -------------------------- | -| **YIELDS** | `Span` | Sentences in the document. | +| Name | Description | +| ---------- | ----------------------------------- | +| **YIELDS** | Sentences in the document. ~~Span~~ | ## Doc.has_vector {#has_vector tag="property" model="vectors"} @@ -607,9 +627,9 @@ A boolean value indicating whether a word vector is associated with the object. > assert doc.has_vector > ``` -| Name | Type | Description | -| ----------- | ---- | ------------------------------------------------ | -| **RETURNS** | bool | Whether the document has a vector data attached. | +| Name | Description | +| ----------- | --------------------------------------------------------- | +| **RETURNS** | Whether the document has a vector data attached. ~~bool~~ | ## Doc.vector {#vector tag="property" model="vectors"} @@ -624,9 +644,9 @@ vectors. > assert doc.vector.shape == (300,) > ``` -| Name | Type | Description | -| ----------- | ---------------------------------------- | ------------------------------------------------------- | -| **RETURNS** | `numpy.ndarray[ndim=1, dtype='float32']` | A 1D numpy array representing the document's semantics. | +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------------- | +| **RETURNS** | A 1-dimensional array representing the document's vector. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | ## Doc.vector_norm {#vector_norm tag="property" model="vectors"} @@ -642,32 +662,28 @@ The L2 norm of the document's vector representation. > assert doc1.vector_norm != doc2.vector_norm > ``` -| Name | Type | Description | -| ----------- | ----- | ----------------------------------------- | -| **RETURNS** | float | The L2 norm of the vector representation. | +| Name | Description | +| ----------- | --------------------------------------------------- | +| **RETURNS** | The L2 norm of the vector representation. ~~float~~ | ## Attributes {#attributes} -| Name | Type | Description | -| --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `text` | unicode | A unicode representation of the document text. | -| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. | -| `mem` | `Pool` | The document's local memory heap, for all C data it owns. | -| `vocab` | `Vocab` | The store of lexical types. | -| `tensor` 2 | `ndarray` | Container for dense vector representations. | -| `cats` 2 | dict | Maps a label to a score for categories applied to the document. The label is a string and the score should be a float. | -| `user_data` | - | A generic storage area, for user custom data. | -| `lang` 2.1 | int | Language of the document's vocabulary. | -| `lang_` 2.1 | unicode | Language of the document's vocabulary. | -| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. | -| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. | -| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. | -| `is_nered` 2.1 | bool | A flag indicating that named entities have been set. Will return `True` if the `Doc` is empty, or if _any_ of the tokens has an entity tag set, even if the others are unknown. | -| `sentiment` | float | The document's positivity/negativity score, if available. | -| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. | -| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. | -| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. | -| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). | +| Name | Description | +| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------- | +| `text` | A string representation of the document text. ~~str~~ | +| `text_with_ws` | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. ~~str~~ | +| `mem` | The document's local memory heap, for all C data it owns. ~~cymem.Pool~~ | +| `vocab` | The store of lexical types. ~~Vocab~~ | +| `tensor` 2 | Container for dense vector representations. ~~numpy.ndarray~~ | +| `cats` 2 | Maps a label to a score for categories applied to the document. The label is a string and the score should be a float. ~~Dict[str, float]~~ | +| `user_data` | A generic storage area, for user custom data. ~~Dict[str, Any]~~ | +| `lang` 2.1 | Language of the document's vocabulary. ~~int~~ | +| `lang_` 2.1 | Language of the document's vocabulary. ~~str~~ | +| `sentiment` | The document's positivity/negativity score, if available. ~~float~~ | +| `user_hooks` | A dictionary that allows customization of the `Doc`'s properties. ~~Dict[str, Callable]~~ | +| `user_token_hooks` | A dictionary that allows customization of properties of `Token` children. ~~Dict[str, Callable]~~ | +| `user_span_hooks` | A dictionary that allows customization of properties of `Span` children. ~~Dict[str, Callable]~~ | +| `_` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). ~~Underscore~~ | ## Serialization fields {#serialization-fields} diff --git a/website/docs/api/docbin.md b/website/docs/api/docbin.md index 9f12a07e6..3625ed790 100644 --- a/website/docs/api/docbin.md +++ b/website/docs/api/docbin.md @@ -16,13 +16,14 @@ document from the `DocBin`. The serialization format is gzipped msgpack, where the msgpack object has the following structure: ```python -### msgpack object strcutrue +### msgpack object structrue { + "version": str, # DocBin version number "attrs": List[uint64], # e.g. [TAG, HEAD, ENT_IOB, ENT_TYPE] "tokens": bytes, # Serialized numpy uint64 array with the token data "spaces": bytes, # Serialized numpy boolean array with spaces data "lengths": bytes, # Serialized numpy int32 array with the doc lengths - "strings": List[unicode] # List of unique strings in the token data + "strings": List[str] # List of unique strings in the token data } ``` @@ -43,11 +44,11 @@ Create a `DocBin` object to hold serialized annotations. > doc_bin = DocBin(attrs=["ENT_IOB", "ENT_TYPE"]) > ``` -| Argument | Type | Description | -| ----------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `attrs` | list | List of attributes to serialize. `orth` (hash of token text) and `spacy` (whether the token is followed by whitespace) are always serialized, so they're not required. Defaults to `None`. | -| `store_user_data` | bool | Whether to include the `Doc.user_data` and the values of custom extension attributes. Defaults to `False`. | -| **RETURNS** | `DocBin` | The newly constructed object. | +| Argument | Description | +| ----------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `attrs` | List of attributes to serialize. `ORTH` (hash of token text) and `SPACY` (whether the token is followed by whitespace) are always serialized, so they're not required. Defaults to `("ORTH", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "LEMMA", "MORPH", "POS")`. ~~Iterable[str]~~ | +| `store_user_data` | Whether to write the `Doc.user_data` and the values of custom extension attributes to file/bytes. Defaults to `False`. ~~bool~~ | +| `docs` | `Doc` objects to add on initialization. ~~Iterable[Doc]~~ | ## DocBin.\_\len\_\_ {#len tag="method"} @@ -62,9 +63,9 @@ Get the number of `Doc` objects that were added to the `DocBin`. > assert len(doc_bin) == 1 > ``` -| Argument | Type | Description | -| ----------- | ---- | ------------------------------------------- | -| **RETURNS** | int | The number of `Doc`s added to the `DocBin`. | +| Argument | Description | +| ----------- | --------------------------------------------------- | +| **RETURNS** | The number of `Doc`s added to the `DocBin`. ~~int~~ | ## DocBin.add {#add tag="method"} @@ -78,9 +79,9 @@ Add a `Doc`'s annotations to the `DocBin` for serialization. > doc_bin.add(doc) > ``` -| Argument | Type | Description | -| -------- | ----- | ------------------------ | -| `doc` | `Doc` | The `Doc` object to add. | +| Argument | Description | +| -------- | -------------------------------- | +| `doc` | The `Doc` object to add. ~~Doc~~ | ## DocBin.get_docs {#get_docs tag="method"} @@ -92,15 +93,15 @@ Recover `Doc` objects from the annotations, using the given vocab. > docs = list(doc_bin.get_docs(nlp.vocab)) > ``` -| Argument | Type | Description | -| ---------- | ------- | ------------------ | -| `vocab` | `Vocab` | The shared vocab. | -| **YIELDS** | `Doc` | The `Doc` objects. | +| Argument | Description | +| ---------- | --------------------------- | +| `vocab` | The shared vocab. ~~Vocab~~ | +| **YIELDS** | The `Doc` objects. ~~Doc~~ | ## DocBin.merge {#merge tag="method"} Extend the annotations of this `DocBin` with the annotations from another. Will -raise an error if the pre-defined attrs of the two `DocBin`s don't match. +raise an error if the pre-defined `attrs` of the two `DocBin`s don't match. > #### Example > @@ -113,9 +114,9 @@ raise an error if the pre-defined attrs of the two `DocBin`s don't match. > assert len(doc_bin1) == 2 > ``` -| Argument | Type | Description | -| -------- | -------- | ------------------------------------------- | -| `other` | `DocBin` | The `DocBin` to merge into the current bin. | +| Argument | Description | +| -------- | ------------------------------------------------------ | +| `other` | The `DocBin` to merge into the current bin. ~~DocBin~~ | ## DocBin.to_bytes {#to_bytes tag="method"} @@ -124,13 +125,14 @@ Serialize the `DocBin`'s annotations to a bytestring. > #### Example > > ```python -> doc_bin = DocBin(attrs=["DEP", "HEAD"]) +> docs = [nlp("Hello world!")] +> doc_bin = DocBin(docs=docs) > doc_bin_bytes = doc_bin.to_bytes() > ``` -| Argument | Type | Description | -| ----------- | ----- | ------------------------ | -| **RETURNS** | bytes | The serialized `DocBin`. | +| Argument | Description | +| ----------- | ---------------------------------- | +| **RETURNS** | The serialized `DocBin`. ~~bytes~~ | ## DocBin.from_bytes {#from_bytes tag="method"} @@ -143,7 +145,40 @@ Deserialize the `DocBin`'s annotations from a bytestring. > new_doc_bin = DocBin().from_bytes(doc_bin_bytes) > ``` -| Argument | Type | Description | -| ------------ | -------- | ---------------------- | -| `bytes_data` | bytes | The data to load from. | -| **RETURNS** | `DocBin` | The loaded `DocBin`. | +| Argument | Description | +| ------------ | -------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| **RETURNS** | The loaded `DocBin`. ~~DocBin~~ | + +## DocBin.to_disk {#to_disk tag="method" new="3"} + +Save the serialized `DocBin` to a file. Typically uses the `.spacy` extension +and the result can be used as the input data for +[`spacy train`](/api/cli#train). + +> #### Example +> +> ```python +> docs = [nlp("Hello world!")] +> doc_bin = DocBin(docs=docs) +> doc_bin.to_disk("./data.spacy") +> ``` + +| Argument | Description | +| -------- | -------------------------------------------------------------------------- | +| `path` | The file path, typically with the `.spacy` extension. ~~Union[str, Path]~~ | + +## DocBin.from_disk {#from_disk tag="method" new="3"} + +Load a serialized `DocBin` from a file. Typically uses the `.spacy` extension. + +> #### Example +> +> ```python +> doc_bin = DocBin().from_disk("./data.spacy") +> ``` + +| Argument | Description | +| ----------- | -------------------------------------------------------------------------- | +| `path` | The file path, typically with the `.spacy` extension. ~~Union[str, Path]~~ | +| **RETURNS** | The loaded `DocBin`. ~~DocBin~~ | diff --git a/website/docs/api/entitylinker.md b/website/docs/api/entitylinker.md index a9d6a31a5..683927b1c 100644 --- a/website/docs/api/entitylinker.md +++ b/website/docs/api/entitylinker.md @@ -1,59 +1,100 @@ --- title: EntityLinker -teaser: - Functionality to disambiguate a named entity in text to a unique knowledge - base identifier. tag: class -source: spacy/pipeline/pipes.pyx +source: spacy/pipeline/entity_linker.py new: 2.2 +teaser: 'Pipeline component for named entity linking and disambiguation' +api_base_class: /api/pipe +api_string_name: entity_linker +api_trainable: true --- -This class is a subclass of `Pipe` and follows the same API. The pipeline -component is available in the [processing pipeline](/usage/processing-pipelines) -via the ID `"entity_linker"`. +An `EntityLinker` component disambiguates textual mentions (tagged as named +entities) to unique identifiers, grounding the named entities into the "real +world". It requires a `KnowledgeBase`, as well as a function to generate +plausible candidates from that `KnowledgeBase` given a certain textual mention, +and a machine learning model to pick the right candidate, given the local +context of the mention. -## EntityLinker.Model {#model tag="classmethod"} +## Config and implementation {#config} -Initialize a model for the pipe. The model should implement the -`thinc.neural.Model` API, and should contain a field `tok2vec` that contains the -context encoder. Wrappers are under development for most major machine learning -libraries. - -| Name | Type | Description | -| ----------- | ------ | ------------------------------------- | -| `**kwargs` | - | Parameters for initializing the model | -| **RETURNS** | object | The initialized model. | - -## EntityLinker.\_\_init\_\_ {#init tag="method"} - -Create a new pipeline instance. In your application, you would normally use a -shortcut for this and instantiate the component using its string name and -[`nlp.create_pipe`](/api/language#create_pipe). +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). See the +[model architectures](/api/architectures) documentation for details on the +architectures and their arguments and hyperparameters. > #### Example > > ```python -> # Construction via create_pipe -> entity_linker = nlp.create_pipe("entity_linker") +> from spacy.pipeline.entity_linker import DEFAULT_NEL_MODEL +> config = { +> "labels_discard": [], +> "incl_prior": True, +> "incl_context": True, +> "model": DEFAULT_NEL_MODEL, +> "entity_vector_length": 64, +> "get_candidates": {'@misc': 'spacy.CandidateGenerator.v1'}, +> } +> nlp.add_pipe("entity_linker", config=config) +> ``` + +| Setting | Description | +| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `labels_discard` | NER labels that will automatically get a "NIL" prediction. Defaults to `[]`. ~~Iterable[str]~~ | +| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. Defaults to `True`. ~~bool~~ | +| `incl_context` | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~ | +| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~ | +| `entity_vector_length` | Size of encoding vectors in the KB. Defaults to `64`. ~~int~~ | +| `get_candidates` | Function that generates plausible candidates for a given `Span` object. Defaults to [CandidateGenerator](/api/architectures#CandidateGenerator), a function looking up exact, case-dependent aliases in the KB. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ | + +```python +%%GITHUB_SPACY/spacy/pipeline/entity_linker.py +``` + +## EntityLinker.\_\_init\_\_ {#init tag="method"} + +> #### Example +> +> ```python +> # Construction via add_pipe with default model +> entity_linker = nlp.add_pipe("entity_linker") +> +> # Construction via add_pipe with custom model +> config = {"model": {"@architectures": "my_el.v1"}} +> entity_linker = nlp.add_pipe("entity_linker", config=config) > > # Construction from class > from spacy.pipeline import EntityLinker -> entity_linker = EntityLinker(nlp.vocab) -> entity_linker.from_disk("/path/to/model") +> entity_linker = EntityLinker(nlp.vocab, model) > ``` -| Name | Type | Description | -| -------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `hidden_width` | int | Width of the hidden layer of the entity linking model, defaults to `128`. | -| `incl_prior` | bool | Whether or not to include prior probabilities in the model. Defaults to `True`. | -| `incl_context` | bool | Whether or not to include the local context in the model (if not: only prior probabilities are used). Defaults to `True`. | -| **RETURNS** | `EntityLinker` | The newly constructed object. | +Create a new pipeline instance. In your application, you would normally use a +shortcut for this and instantiate the component using its string name and +[`nlp.add_pipe`](/api/language#add_pipe). + +Upon construction of the entity linker component, an empty knowledge base is +constructed with the provided `entity_vector_length`. If you want to use a +custom knowledge base, you should either call +[`set_kb`](/api/entitylinker#set_kb) or provide a `kb_loader` in the +[`initialize`](/api/entitylinker#initialize) call. + +| Name | Description | +| ---------------------- | -------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| _keyword-only_ | | +| `entity_vector_length` | Size of encoding vectors in the KB. ~~int~~ | +| `get_candidates` | Function that generates plausible candidates for a given `Span` object. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ | +| `labels_discard` | NER labels that will automatically get a `"NIL"` prediction. ~~Iterable[str]~~ | +| `incl_prior` | Whether or not to include prior probabilities from the KB in the model. ~~bool~~ | +| `incl_context` | Whether or not to include the local context in the model. ~~bool~~ | ## EntityLinker.\_\_call\_\_ {#call tag="method"} -Apply the pipe to one document. The document is modified in place, and returned. +Apply the pipe to one document. The document is modified in place and returned. This usually happens under the hood when the `nlp` object is called on a text and all pipeline components are applied to the `Doc` in order. Both [`__call__`](/api/entitylinker#call) and [`pipe`](/api/entitylinker#pipe) @@ -63,16 +104,16 @@ delegate to the [`predict`](/api/entitylinker#predict) and > #### Example > > ```python -> entity_linker = EntityLinker(nlp.vocab) > doc = nlp("This is a sentence.") +> entity_linker = nlp.add_pipe("entity_linker") > # This usually happens under the hood > processed = entity_linker(doc) > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------ | -| `doc` | `Doc` | The document to process. | -| **RETURNS** | `Doc` | The processed document. | +| Name | Description | +| ----------- | -------------------------------- | +| `doc` | The document to process. ~~Doc~~ | +| **RETURNS** | The processed document. ~~Doc~~ | ## EntityLinker.pipe {#pipe tag="method"} @@ -86,32 +127,93 @@ applied to the `Doc` in order. Both [`__call__`](/api/entitylinker#call) and > #### Example > > ```python -> entity_linker = EntityLinker(nlp.vocab) +> entity_linker = nlp.add_pipe("entity_linker") > for doc in entity_linker.pipe(docs, batch_size=50): > pass > ``` -| Name | Type | Description | -| ------------ | -------- | ------------------------------------------------------ | -| `stream` | iterable | A stream of documents. | -| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | -| **YIELDS** | `Doc` | Processed documents in the order of the original text. | +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `stream` | A stream of documents. ~~Iterable[Doc]~~ | +| _keyword-only_ | | +| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ | +| **YIELDS** | The processed documents in order. ~~Doc~~ | -## EntityLinker.predict {#predict tag="method"} +## EntityLinker.set_kb {#initialize tag="method" new="3"} -Apply the pipeline's model to a batch of docs, without modifying them. +The `kb_loader` should be a function that takes a `Vocab` instance and creates +the `KnowledgeBase`, ensuring that the strings of the knowledge base are synced +with the current vocab. > #### Example > > ```python -> entity_linker = EntityLinker(nlp.vocab) -> kb_ids, tensors = entity_linker.predict([doc1, doc2]) +> def create_kb(vocab): +> kb = KnowledgeBase(vocab, entity_vector_length=128) +> kb.add_entity(...) +> kb.add_alias(...) +> return kb +> entity_linker = nlp.add_pipe("entity_linker") +> entity_linker.set_kb(lambda: [], nlp=nlp, kb_loader=create_kb) > ``` -| Name | Type | Description | -| ----------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | iterable | The documents to predict. | -| **RETURNS** | tuple | A `(kb_ids, tensors)` tuple where `kb_ids` are the model's predicted KB identifiers for the entities in the `docs`, and `tensors` are the token representations used to predict these identifiers. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------------------- | +| `kb_loader` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. ~~Callable[[Vocab], KnowledgeBase]~~ | + +## EntityLinker.initialize {#initialize tag="method" new="3"} + +Initialize the component for training. `get_examples` should be a function that +returns an iterable of [`Example`](/api/example) objects. The data examples are +used to **initialize the model** of the component and can either be the full +training data or a representative sample. Initialization includes validating the +network, +[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and +setting up the label scheme based on the data. This method is typically called +by [`Language.initialize`](/api/language#initialize). + +Optionally, a `kb_loader` argument may be specified to change the internal +knowledge base. This argument should be a function that takes a `Vocab` instance +and creates the `KnowledgeBase`, ensuring that the strings of the knowledge base +are synced with the current vocab. + + + +This method was previously called `begin_training`. + + + +> #### Example +> +> ```python +> entity_linker = nlp.add_pipe("entity_linker") +> entity_linker.initialize(lambda: [], nlp=nlp, kb_loader=my_kb) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `kb_loader` | Function that creates a [`KnowledgeBase`](/api/kb) from a `Vocab` instance. ~~Callable[[Vocab], KnowledgeBase]~~ | + +## EntityLinker.predict {#predict tag="method"} + +Apply the component's model to a batch of [`Doc`](/api/doc) objects, without +modifying them. Returns the KB IDs for each entity in each doc, including `NIL` +if there is no prediction. + +> #### Example +> +> ```python +> entity_linker = nlp.add_pipe("entity_linker") +> kb_ids = entity_linker.predict([doc1, doc2]) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------- | +| `docs` | The documents to predict. ~~Iterable[Doc]~~ | +| **RETURNS** | `List[str]` | The predicted KB identifiers for the entities in the `docs`. ~~List[str]~~ | ## EntityLinker.set_annotations {#set_annotations tag="method"} @@ -121,100 +223,54 @@ entities. > #### Example > > ```python -> entity_linker = EntityLinker(nlp.vocab) -> kb_ids, tensors = entity_linker.predict([doc1, doc2]) -> entity_linker.set_annotations([doc1, doc2], kb_ids, tensors) +> entity_linker = nlp.add_pipe("entity_linker") +> kb_ids = entity_linker.predict([doc1, doc2]) +> entity_linker.set_annotations([doc1, doc2], kb_ids) > ``` -| Name | Type | Description | -| --------- | -------- | ------------------------------------------------------------------------------------------------- | -| `docs` | iterable | The documents to modify. | -| `kb_ids` | iterable | The knowledge base identifiers for the entities in the docs, predicted by `EntityLinker.predict`. | -| `tensors` | iterable | The token representations used to predict the identifiers. | +| Name | Description | +| -------- | --------------------------------------------------------------------------------------------------------------- | +| `docs` | The documents to modify. ~~Iterable[Doc]~~ | +| `kb_ids` | The knowledge base identifiers for the entities in the docs, predicted by `EntityLinker.predict`. ~~List[str]~~ | ## EntityLinker.update {#update tag="method"} -Learn from a batch of documents and gold-standard information, updating both the +Learn from a batch of [`Example`](/api/example) objects, updating both the pipe's entity linking model and context encoder. Delegates to -[`predict`](/api/entitylinker#predict) and -[`get_loss`](/api/entitylinker#get_loss). +[`predict`](/api/entitylinker#predict). > #### Example > > ```python -> entity_linker = EntityLinker(nlp.vocab) -> losses = {} -> optimizer = nlp.begin_training() -> entity_linker.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) +> entity_linker = nlp.add_pipe("entity_linker") +> optimizer = nlp.initialize() +> losses = entity_linker.update(examples, sgd=optimizer) > ``` -| Name | Type | Description | -| -------- | -------- | ------------------------------------------------------------------------------------------------------- | -| `docs` | iterable | A batch of documents to learn from. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `drop` | float | The dropout rate, used both for the EL model and the context encoder. | -| `sgd` | callable | The optimizer for the EL model. Should take two arguments `weights` and `gradient`, and an optional ID. | -| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | +| Name | Description | +| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | -## EntityLinker.get_loss {#get_loss tag="method"} +## EntityLinker.score {#score tag="method" new="3"} -Find the loss and gradient of loss for the entities in a batch of documents and -their predicted scores. +Score a batch of examples. > #### Example > > ```python -> entity_linker = EntityLinker(nlp.vocab) -> kb_ids, tensors = entity_linker.predict(docs) -> loss, d_loss = entity_linker.get_loss(docs, [gold1, gold2], kb_ids, tensors) +> scores = entity_linker.score(examples) > ``` -| Name | Type | Description | -| ----------- | -------- | ------------------------------------------------------------ | -| `docs` | iterable | The batch of documents. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `kb_ids` | iterable | KB identifiers representing the model's predictions. | -| `tensors` | iterable | The token representations used to predict the identifiers | -| **RETURNS** | tuple | The loss and the gradient, i.e. `(loss, gradient)`. | - -## EntityLinker.set_kb {#set_kb tag="method"} - -Define the knowledge base (KB) used for disambiguating named entities to KB -identifiers. - -> #### Example -> -> ```python -> entity_linker = EntityLinker(nlp.vocab) -> entity_linker.set_kb(kb) -> ``` - -| Name | Type | Description | -| ---- | --------------- | ------------------------------- | -| `kb` | `KnowledgeBase` | The [`KnowledgeBase`](/api/kb). | - -## EntityLinker.begin_training {#begin_training tag="method"} - -Initialize the pipe for training, using data examples if available. If no model -has been initialized yet, the model is added. Before calling this method, a -knowledge base should have been defined with -[`set_kb`](/api/entitylinker#set_kb). - -> #### Example -> -> ```python -> entity_linker = EntityLinker(nlp.vocab) -> entity_linker.set_kb(kb) -> nlp.add_pipe(entity_linker, last=True) -> optimizer = entity_linker.begin_training(pipeline=nlp.pipeline) -> ``` - -| Name | Type | Description | -| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | -| `pipeline` | list | Optional list of pipeline components that this component is part of. | -| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityLinker`](/api/entitylinker#create_optimizer) if not set. | -| **RETURNS** | callable | An optimizer. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------- | +| `examples` | The examples to score. ~~Iterable[Example]~~ | +| **RETURNS** | The scores, produced by [`Scorer.score_links`](/api/scorer#score_links) . ~~Dict[str, float]~~ | ## EntityLinker.create_optimizer {#create_optimizer tag="method"} @@ -223,29 +279,30 @@ Create an optimizer for the pipeline component. > #### Example > > ```python -> entity_linker = EntityLinker(nlp.vocab) +> entity_linker = nlp.add_pipe("entity_linker") > optimizer = entity_linker.create_optimizer() > ``` -| Name | Type | Description | -| ----------- | -------- | -------------- | -| **RETURNS** | callable | The optimizer. | +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The optimizer. ~~Optimizer~~ | ## EntityLinker.use_params {#use_params tag="method, contextmanager"} -Modify the pipe's EL model, to use the given parameter values. +Modify the pipe's model, to use the given parameter values. At the end of the +context, the original parameters are restored. > #### Example > > ```python -> entity_linker = EntityLinker(nlp.vocab) +> entity_linker = nlp.add_pipe("entity_linker") > with entity_linker.use_params(optimizer.averages): > entity_linker.to_disk("/best_model") > ``` -| Name | Type | Description | -| -------- | ---- | ---------------------------------------------------------------------------------------------------------- | -| `params` | dict | The parameter values to use in the model. At the end of the context, the original parameters are restored. | +| Name | Description | +| -------- | -------------------------------------------------- | +| `params` | The parameter values to use in the model. ~~dict~~ | ## EntityLinker.to_disk {#to_disk tag="method"} @@ -254,14 +311,15 @@ Serialize the pipe to disk. > #### Example > > ```python -> entity_linker = EntityLinker(nlp.vocab) +> entity_linker = nlp.add_pipe("entity_linker") > entity_linker.to_disk("/path/to/entity_linker") > ``` -| Name | Type | Description | -| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | ## EntityLinker.from_disk {#from_disk tag="method"} @@ -270,15 +328,16 @@ Load the pipe from disk. Modifies the object in place and returns it. > #### Example > > ```python -> entity_linker = EntityLinker(nlp.vocab) +> entity_linker = nlp.add_pipe("entity_linker") > entity_linker.from_disk("/path/to/entity_linker") > ``` -| Name | Type | Description | -| ----------- | ---------------- | -------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `EntityLinker` | The modified `EntityLinker` object. | +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `EntityLinker` object. ~~EntityLinker~~ | ## Serialization fields {#serialization-fields} diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md index 9345ee249..6ac0d163f 100644 --- a/website/docs/api/entityrecognizer.md +++ b/website/docs/api/entityrecognizer.md @@ -1,52 +1,89 @@ --- title: EntityRecognizer tag: class -source: spacy/pipeline/pipes.pyx +source: spacy/pipeline/ner.pyx +teaser: 'Pipeline component for named entity recognition' +api_base_class: /api/pipe +api_string_name: ner +api_trainable: true --- -This class is a subclass of `Pipe` and follows the same API. The pipeline -component is available in the [processing pipeline](/usage/processing-pipelines) -via the ID `"ner"`. +A transition-based named entity recognition component. The entity recognizer +identifies **non-overlapping labelled spans** of tokens. The transition-based +algorithm used encodes certain assumptions that are effective for "traditional" +named entity recognition tasks, but may not be a good fit for every span +identification problem. Specifically, the loss function optimizes for **whole +entity accuracy**, so if your inter-annotator agreement on boundary tokens is +low, the component will likely perform poorly on your problem. The +transition-based algorithm also assumes that the most decisive information about +your entities will be close to their initial tokens. If your entities are long +and characterized by tokens in their middle, the component will likely not be a +good fit for your task. -## EntityRecognizer.Model {#model tag="classmethod"} +## Config and implementation {#config} -Initialize a model for the pipe. The model should implement the -`thinc.neural.Model` API. Wrappers are under development for most major machine -learning libraries. - -| Name | Type | Description | -| ----------- | ------ | ------------------------------------- | -| `**kwargs` | - | Parameters for initializing the model | -| **RETURNS** | object | The initialized model. | - -## EntityRecognizer.\_\_init\_\_ {#init tag="method"} - -Create a new pipeline instance. In your application, you would normally use a -shortcut for this and instantiate the component using its string name and -[`nlp.create_pipe`](/api/language#create_pipe). +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). See the +[model architectures](/api/architectures) documentation for details on the +architectures and their arguments and hyperparameters. > #### Example > > ```python -> # Construction via create_pipe -> ner = nlp.create_pipe("ner") +> from spacy.pipeline.ner import DEFAULT_NER_MODEL +> config = { +> "moves": None, +> "update_with_oracle_cut_size": 100, +> "model": DEFAULT_NER_MODEL, +> } +> nlp.add_pipe("ner", config=config) +> ``` + +| Setting | Description | +| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `moves` | A list of transition names. Inferred from the data if not provided. Defaults to `None`. ~~Optional[List[str]]~~ | +| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ | +| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [TransitionBasedParser](/api/architectures#TransitionBasedParser). ~~Model[List[Doc], List[Floats2d]]~~ | + +```python +%%GITHUB_SPACY/spacy/pipeline/ner.pyx +``` + +## EntityRecognizer.\_\_init\_\_ {#init tag="method"} + +> #### Example +> +> ```python +> # Construction via add_pipe with default model +> ner = nlp.add_pipe("ner") +> +> # Construction via add_pipe with custom model +> config = {"model": {"@architectures": "my_ner"}} +> parser = nlp.add_pipe("ner", config=config) > > # Construction from class > from spacy.pipeline import EntityRecognizer -> ner = EntityRecognizer(nlp.vocab) -> ner.from_disk("/path/to/model") +> ner = EntityRecognizer(nlp.vocab, model) > ``` -| Name | Type | Description | -| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `**cfg` | - | Configuration parameters. | -| **RETURNS** | `EntityRecognizer` | The newly constructed object. | +Create a new pipeline instance. In your application, you would normally use a +shortcut for this and instantiate the component using its string name and +[`nlp.add_pipe`](/api/language#add_pipe). + +| Name | Description | +| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| `moves` | A list of transition names. Inferred from the data if not provided. ~~Optional[List[str]]~~ | +| _keyword-only_ | | +| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. `100` is a good default. ~~int~~ | ## EntityRecognizer.\_\_call\_\_ {#call tag="method"} -Apply the pipe to one document. The document is modified in place, and returned. +Apply the pipe to one document. The document is modified in place and returned. This usually happens under the hood when the `nlp` object is called on a text and all pipeline components are applied to the `Doc` in order. Both [`__call__`](/api/entityrecognizer#call) and @@ -57,16 +94,16 @@ and all pipeline components are applied to the `Doc` in order. Both > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) > doc = nlp("This is a sentence.") +> ner = nlp.add_pipe("ner") > # This usually happens under the hood > processed = ner(doc) > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------ | -| `doc` | `Doc` | The document to process. | -| **RETURNS** | `Doc` | The processed document. | +| Name | Description | +| ----------- | -------------------------------- | +| `doc` | The document to process. ~~Doc~~ | +| **RETURNS** | The processed document. ~~Doc~~ | ## EntityRecognizer.pipe {#pipe tag="method"} @@ -80,73 +117,118 @@ applied to the `Doc` in order. Both [`__call__`](/api/entityrecognizer#call) and > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) +> ner = nlp.add_pipe("ner") > for doc in ner.pipe(docs, batch_size=50): > pass > ``` -| Name | Type | Description | -| ------------ | -------- | ------------------------------------------------------ | -| `stream` | iterable | A stream of documents. | -| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | -| **YIELDS** | `Doc` | Processed documents in the order of the original text. | +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `docs` | A stream of documents. ~~Iterable[Doc]~~ | +| _keyword-only_ | | +| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ | +| **YIELDS** | The processed documents in order. ~~Doc~~ | + +## EntityRecognizer.initialize {#initialize tag="method" new="3"} + +Initialize the component for training. `get_examples` should be a function that +returns an iterable of [`Example`](/api/example) objects. The data examples are +used to **initialize the model** of the component and can either be the full +training data or a representative sample. Initialization includes validating the +network, +[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and +setting up the label scheme based on the data. This method is typically called +by [`Language.initialize`](/api/language#initialize) and lets you customize +arguments it receives via the +[`[initialize.components]`](/api/data-formats#config-initialize) block in the +config. + + + +This method was previously called `begin_training`. + + + +> #### Example +> +> ```python +> ner = nlp.add_pipe("ner") +> ner.initialize(lambda: [], nlp=nlp) +> ``` +> +> ```ini +> ### config.cfg +> [initialize.components.ner] +> +> [initialize.components.ner.labels] +> @readers = "spacy.read_labels.v1" +> path = "corpus/labels/ner.json +> ``` + +| Name | Description | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Dict[str, Dict[str, int]]]~~ | ## EntityRecognizer.predict {#predict tag="method"} -Apply the pipeline's model to a batch of docs, without modifying them. +Apply the component's model to a batch of [`Doc`](/api/doc) objects, without +modifying them. > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) -> scores, tensors = ner.predict([doc1, doc2]) +> ner = nlp.add_pipe("ner") +> scores = ner.predict([doc1, doc2]) > ``` -| Name | Type | Description | -| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | iterable | The documents to predict. | -| **RETURNS** | list | List of `syntax.StateClass` objects. `syntax.StateClass` is a helper class for the parse state (internal). | +| Name | Description | +| ----------- | ------------------------------------------------------------- | +| `docs` | The documents to predict. ~~Iterable[Doc]~~ | +| **RETURNS** | A helper class for the parse state (internal). ~~StateClass~~ | ## EntityRecognizer.set_annotations {#set_annotations tag="method"} -Modify a batch of documents, using pre-computed scores. +Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores. > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) -> scores, tensors = ner.predict([doc1, doc2]) -> ner.set_annotations([doc1, doc2], scores, tensors) +> ner = nlp.add_pipe("ner") +> scores = ner.predict([doc1, doc2]) +> ner.set_annotations([doc1, doc2], scores) > ``` -| Name | Type | Description | -| --------- | -------- | ---------------------------------------------------------- | -| `docs` | iterable | The documents to modify. | -| `scores` | - | The scores to set, produced by `EntityRecognizer.predict`. | -| `tensors` | iterable | The token representations used to predict the scores. | +| Name | Description | +| -------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| `docs` | The documents to modify. ~~Iterable[Doc]~~ | +| `scores` | The scores to set, produced by `EntityRecognizer.predict`. Returns an internal helper class for the parse state. ~~List[StateClass]~~ | ## EntityRecognizer.update {#update tag="method"} -Learn from a batch of documents and gold-standard information, updating the -pipe's model. Delegates to [`predict`](/api/entityrecognizer#predict) and +Learn from a batch of [`Example`](/api/example) objects, updating the pipe's +model. Delegates to [`predict`](/api/entityrecognizer#predict) and [`get_loss`](/api/entityrecognizer#get_loss). > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) -> losses = {} -> optimizer = nlp.begin_training() -> ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) +> ner = nlp.add_pipe("ner") +> optimizer = nlp.initialize() +> losses = ner.update(examples, sgd=optimizer) > ``` -| Name | Type | Description | -| -------- | -------- | -------------------------------------------------------------------------------------------- | -| `docs` | iterable | A batch of documents to learn from. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `drop` | float | The dropout rate. | -| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. | -| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | +| Name | Description | +| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | ## EntityRecognizer.get_loss {#get_loss tag="method"} @@ -156,37 +238,31 @@ predicted scores. > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) -> scores = ner.predict([doc1, doc2]) -> loss, d_loss = ner.get_loss([doc1, doc2], [gold1, gold2], scores) +> ner = nlp.add_pipe("ner") +> scores = ner.predict([eg.predicted for eg in examples]) +> loss, d_loss = ner.get_loss(examples, scores) > ``` -| Name | Type | Description | -| ----------- | -------- | ------------------------------------------------------------ | -| `docs` | iterable | The batch of documents. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `scores` | - | Scores representing the model's predictions. | -| **RETURNS** | tuple | The loss and the gradient, i.e. `(loss, gradient)`. | +| Name | Description | +| ----------- | --------------------------------------------------------------------------- | +| `examples` | The batch of examples. ~~Iterable[Example]~~ | +| `scores` | Scores representing the model's predictions. ~~StateClass~~ | +| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ | -## EntityRecognizer.begin_training {#begin_training tag="method"} +## EntityRecognizer.score {#score tag="method" new="3"} -Initialize the pipe for training, using data examples if available. If no model -has been initialized yet, the model is added. +Score a batch of examples. > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) -> nlp.pipeline.append(ner) -> optimizer = ner.begin_training(pipeline=nlp.pipeline) +> scores = ner.score(examples) > ``` -| Name | Type | Description | -| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | -| `pipeline` | list | Optional list of pipeline components that this component is part of. | -| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`EntityRecognizer`](/api/entityrecognizer#create_optimizer) if not set. | -| **RETURNS** | callable | An optimizer. | +| Name | Description | +| ----------- | --------------------------------------------------------- | +| `examples` | The examples to score. ~~Iterable[Example]~~ | +| **RETURNS** | The scores. ~~Dict[str, Union[float, Dict[str, float]]]~~ | ## EntityRecognizer.create_optimizer {#create_optimizer tag="method"} @@ -195,17 +271,18 @@ Create an optimizer for the pipeline component. > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) +> ner = nlp.add_pipe("ner") > optimizer = ner.create_optimizer() > ``` -| Name | Type | Description | -| ----------- | -------- | -------------- | -| **RETURNS** | callable | The optimizer. | +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The optimizer. ~~Optimizer~~ | ## EntityRecognizer.use_params {#use_params tag="method, contextmanager"} -Modify the pipe's model, to use the given parameter values. +Modify the pipe's model, to use the given parameter values. At the end of the +context, the original parameters are restored. > #### Example > @@ -215,24 +292,48 @@ Modify the pipe's model, to use the given parameter values. > ner.to_disk("/best_model") > ``` -| Name | Type | Description | -| -------- | ---- | ---------------------------------------------------------------------------------------------------------- | -| `params` | dict | The parameter values to use in the model. At the end of the context, the original parameters are restored. | +| Name | Description | +| -------- | -------------------------------------------------- | +| `params` | The parameter values to use in the model. ~~dict~~ | ## EntityRecognizer.add_label {#add_label tag="method"} -Add a new label to the pipe. +Add a new label to the pipe. Note that you don't have to call this method if you +provide a **representative data sample** to the [`initialize`](#initialize) +method. In this case, all labels found in the sample will be automatically added +to the model, and the output dimension will be +[inferred](/usage/layers-architectures#thinc-shape-inference) automatically. > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) +> ner = nlp.add_pipe("ner") > ner.add_label("MY_LABEL") > ``` -| Name | Type | Description | -| ------- | ------- | ----------------- | -| `label` | unicode | The label to add. | +| Name | Description | +| ----------- | ----------------------------------------------------------- | +| `label` | The label to add. ~~str~~ | +| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ | + +## EntityRecognizer.set_output {#set_output tag="method"} + +Change the output dimension of the component's model by calling the model's +attribute `resize_output`. This is a function that takes the original model and +the new output dimension `nO`, and changes the model in place. When resizing an +already trained model, care should be taken to avoid the "catastrophic +forgetting" problem. + +> #### Example +> +> ```python +> ner = nlp.add_pipe("ner") +> ner.set_output(512) +> ``` + +| Name | Description | +| ---- | --------------------------------- | +| `nO` | The new output dimension. ~~int~~ | ## EntityRecognizer.to_disk {#to_disk tag="method"} @@ -241,14 +342,15 @@ Serialize the pipe to disk. > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) +> ner = nlp.add_pipe("ner") > ner.to_disk("/path/to/ner") > ``` -| Name | Type | Description | -| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | ## EntityRecognizer.from_disk {#from_disk tag="method"} @@ -257,31 +359,33 @@ Load the pipe from disk. Modifies the object in place and returns it. > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) +> ner = nlp.add_pipe("ner") > ner.from_disk("/path/to/ner") > ``` -| Name | Type | Description | -| ----------- | ------------------ | -------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `EntityRecognizer` | The modified `EntityRecognizer` object. | +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `EntityRecognizer` object. ~~EntityRecognizer~~ | ## EntityRecognizer.to_bytes {#to_bytes tag="method"} > #### Example > > ```python -> ner = EntityRecognizer(nlp.vocab) +> ner = nlp.add_pipe("ner") > ner_bytes = ner.to_bytes() > ``` Serialize the pipe to a bytestring. -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------------------- | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | bytes | The serialized form of the `EntityRecognizer` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the `EntityRecognizer` object. ~~bytes~~ | ## EntityRecognizer.from_bytes {#from_bytes tag="method"} @@ -291,15 +395,16 @@ Load the pipe from a bytestring. Modifies the object in place and returns it. > > ```python > ner_bytes = ner.to_bytes() -> ner = EntityRecognizer(nlp.vocab) +> ner = nlp.add_pipe("ner") > ner.from_bytes(ner_bytes) > ``` -| Name | Type | Description | -| ------------ | ------------------ | ------------------------------------------------------------------------- | -| `bytes_data` | bytes | The data to load from. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `EntityRecognizer` | The `EntityRecognizer` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `EntityRecognizer` object. ~~EntityRecognizer~~ | ## EntityRecognizer.labels {#labels tag="property"} @@ -312,9 +417,27 @@ The labels currently added to the component. > assert "MY_LABEL" in ner.labels > ``` -| Name | Type | Description | -| ----------- | ----- | ---------------------------------- | -| **RETURNS** | tuple | The labels added to the component. | +| Name | Description | +| ----------- | ------------------------------------------------------ | +| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ | + +## EntityRecognizer.label_data {#label_data tag="property" new="3"} + +The labels currently added to the component and their internal meta information. +This is the data generated by [`init labels`](/api/cli#init-labels) and used by +[`EntityRecognizer.initialize`](/api/entityrecognizer#initialize) to initialize +the model with a pre-defined label set. + +> #### Example +> +> ```python +> labels = ner.label_data +> ner.initialize(lambda: [], nlp=nlp, labels=labels) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------- | +| **RETURNS** | The label data added to the component. ~~Dict[str, Dict[str, Dict[str, int]]]~~ | ## Serialization fields {#serialization-fields} diff --git a/website/docs/api/entityruler.md b/website/docs/api/entityruler.md index 0fd24897d..76a4b3604 100644 --- a/website/docs/api/entityruler.md +++ b/website/docs/api/entityruler.md @@ -3,44 +3,108 @@ title: EntityRuler tag: class source: spacy/pipeline/entityruler.py new: 2.1 +teaser: 'Pipeline component for rule-based named entity recognition' +api_string_name: entity_ruler +api_trainable: false --- -The EntityRuler lets you add spans to the [`Doc.ents`](/api/doc#ents) using +The entity ruler lets you add spans to the [`Doc.ents`](/api/doc#ents) using token-based rules or exact phrase matches. It can be combined with the statistical [`EntityRecognizer`](/api/entityrecognizer) to boost accuracy, or -used on its own to implement a purely rule-based entity recognition system. -After initialization, the component is typically added to the processing -pipeline using [`nlp.add_pipe`](/api/language#add_pipe). For usage examples, see -the docs on +used on its own to implement a purely rule-based entity recognition system. For +usage examples, see the docs on [rule-based entity recognition](/usage/rule-based-matching#entityruler). +## Config and implementation {#config} + +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). + +> #### Example +> +> ```python +> config = { +> "phrase_matcher_attr": None, +> "validate": True, +> "overwrite_ents": False, +> "ent_id_sep": "||", +> } +> nlp.add_pipe("entity_ruler", config=config) +> ``` + +| Setting | Description | +| --------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ | +| `validate` | Whether patterns should be validated (passed to the `Matcher` and `PhraseMatcher`). Defaults to `False`. ~~bool~~ | +| `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ | +| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"||"`. ~~str~~ | + +```python +%%GITHUB_SPACY/spacy/pipeline/entityruler.py +``` + ## EntityRuler.\_\_init\_\_ {#init tag="method"} Initialize the entity ruler. If patterns are supplied here, they need to be a list of dictionaries with a `"label"` and `"pattern"` key. A pattern can either be a token pattern (list) or a phrase pattern (string). For example: -`{'label': 'ORG', 'pattern': 'Apple'}`. +`{"label": "ORG", "pattern": "Apple"}`. > #### Example > > ```python -> # Construction via create_pipe -> ruler = nlp.create_pipe("entity_ruler") +> # Construction via add_pipe +> ruler = nlp.add_pipe("entity_ruler") > > # Construction from class > from spacy.pipeline import EntityRuler > ruler = EntityRuler(nlp, overwrite_ents=True) > ``` -| Name | Type | Description | -| --------------------- | ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `nlp` | `Language` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. | -| `patterns` | iterable | Optional patterns to load in. | -| `phrase_matcher_attr` | int / unicode | Optional attr to pass to the internal [`PhraseMatcher`](/api/phrasematcher). defaults to `None` | -| `validate` | bool | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. | -| `overwrite_ents` | bool | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. | -| `**cfg` | - | Other config parameters. If pipeline component is loaded as part of a model pipeline, this will include all keyword arguments passed to `spacy.load`. | -| **RETURNS** | `EntityRuler` | The newly constructed object. | +| Name | Description | +| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `nlp` | The shared nlp object to pass the vocab to the matchers and process phrase patterns. ~~Language~~ | +| `name` 3 | Instance name of the current pipeline component. Typically passed in automatically from the factory when the component is added. Used to disable the current entity ruler while creating phrase patterns with the nlp object. ~~str~~ | +| _keyword-only_ | | +| `phrase_matcher_attr` | Optional attribute name match on for the internal [`PhraseMatcher`](/api/phrasematcher), e.g. `LOWER` to match on the lowercase token text. Defaults to `None`. ~~Optional[Union[int, str]]~~ | +| `validate` | Whether patterns should be validated, passed to Matcher and PhraseMatcher as `validate`. Defaults to `False`. ~~bool~~ | +| `overwrite_ents` | If existing entities are present, e.g. entities added by the model, overwrite them by matches if necessary. Defaults to `False`. ~~bool~~ | +| `ent_id_sep` | Separator used internally for entity IDs. Defaults to `"||"`. ~~str~~ | +| `patterns` | Optional patterns to load in on initialization. ~~Optional[List[Dict[str, Union[str, List[dict]]]]]~~ | + +## EntityRuler.initialize {#initialize tag="method" new="3"} + +Initialize the component with data and used before training to load in rules +from a file. This method is typically called by +[`Language.initialize`](/api/language#initialize) and lets you customize +arguments it receives via the +[`[initialize.components]`](/api/data-formats#config-initialize) block in the +config. + +> #### Example +> +> ```python +> entity_ruler = nlp.add_pipe("entity_ruler") +> entity_ruler.initialize(lambda: [], nlp=nlp, patterns=patterns) +> ``` +> +> ```ini +> ### config.cfg +> [initialize.components.entity_ruler] +> +> [initialize.components.entity_ruler.patterns] +> @readers = "srsly.read_jsonl.v1" +> path = "corpus/entity_ruler_patterns.jsonl +> ``` + +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. Not used by the `EntityRuler`. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `patterns` | The list of patterns. Defaults to `None`. ~~Optional[Sequence[Dict[str, Union[str, List[Dict[str, Any]]]]]]~~ | ## EntityRuler.\_\len\_\_ {#len tag="method"} @@ -49,15 +113,15 @@ The number of all patterns added to the entity ruler. > #### Example > > ```python -> ruler = EntityRuler(nlp) +> ruler = nlp.add_pipe("entity_ruler") > assert len(ruler) == 0 > ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}]) > assert len(ruler) == 1 > ``` -| Name | Type | Description | -| ----------- | ---- | ----------------------- | -| **RETURNS** | int | The number of patterns. | +| Name | Description | +| ----------- | ------------------------------- | +| **RETURNS** | The number of patterns. ~~int~~ | ## EntityRuler.\_\_contains\_\_ {#contains tag="method"} @@ -66,16 +130,16 @@ Whether a label is present in the patterns. > #### Example > > ```python -> ruler = EntityRuler(nlp) +> ruler = nlp.add_pipe("entity_ruler") > ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}]) > assert "ORG" in ruler > assert not "PERSON" in ruler > ``` -| Name | Type | Description | -| ----------- | ------- | -------------------------------------------- | -| `label` | unicode | The label to check. | -| **RETURNS** | bool | Whether the entity ruler contains the label. | +| Name | Description | +| ----------- | ----------------------------------------------------- | +| `label` | The label to check. ~~str~~ | +| **RETURNS** | Whether the entity ruler contains the label. ~~bool~~ | ## EntityRuler.\_\_call\_\_ {#call tag="method"} @@ -83,25 +147,25 @@ Find matches in the `Doc` and add them to the `doc.ents`. Typically, this happens automatically after the component has been added to the pipeline using [`nlp.add_pipe`](/api/language#add_pipe). If the entity ruler was initialized with `overwrite_ents=True`, existing entities will be replaced if they overlap -with the matches. When matches overlap in a Doc, the entity ruler prioritizes longer -patterns over shorter, and if equal the match occuring first in the Doc is chosen. +with the matches. When matches overlap in a Doc, the entity ruler prioritizes +longer patterns over shorter, and if equal the match occuring first in the Doc +is chosen. > #### Example > > ```python -> ruler = EntityRuler(nlp) +> ruler = nlp.add_pipe("entity_ruler") > ruler.add_patterns([{"label": "ORG", "pattern": "Apple"}]) -> nlp.add_pipe(ruler) > > doc = nlp("A text about Apple.") > ents = [(ent.text, ent.label_) for ent in doc.ents] > assert ents == [("Apple", "ORG")] > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------ | -| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. | -| **RETURNS** | `Doc` | The modified `Doc` with added entities, if available. | +| Name | Description | +| ----------- | -------------------------------------------------------------------- | +| `doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~ | +| **RETURNS** | The modified `Doc` with added entities, if available. ~~Doc~~ | ## EntityRuler.add_patterns {#add_patterns tag="method"} @@ -116,13 +180,13 @@ of dicts) or a phrase pattern (string). For more details, see the usage guide on > {"label": "ORG", "pattern": "Apple"}, > {"label": "GPE", "pattern": [{"lower": "san"}, {"lower": "francisco"}]} > ] -> ruler = EntityRuler(nlp) +> ruler = nlp.add_pipe("entity_ruler") > ruler.add_patterns(patterns) > ``` -| Name | Type | Description | -| ---------- | ---- | -------------------- | -| `patterns` | list | The patterns to add. | +| Name | Description | +| ---------- | ---------------------------------------------------------------- | +| `patterns` | The patterns to add. ~~List[Dict[str, Union[str, List[dict]]]]~~ | ## EntityRuler.to_disk {#to_disk tag="method"} @@ -134,18 +198,18 @@ only the patterns are saved as JSONL. If a directory name is provided, a > #### Example > > ```python -> ruler = EntityRuler(nlp) +> ruler = nlp.add_pipe("entity_ruler") > ruler.to_disk("/path/to/patterns.jsonl") # saves patterns only > ruler.to_disk("/path/to/entity_ruler") # saves patterns and config > ``` -| Name | Type | Description | -| ------ | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a JSONL file or directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | +| Name | Description | +| ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | A path to a JSONL file or directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | ## EntityRuler.from_disk {#from_disk tag="method"} -Load the entity ruler from a file. Expects either a file containing +Load the entity ruler from a path. Expects either a file containing newline-delimited JSON (JSONL) with one entry per line, or a directory containing a `patterns.jsonl` file and a `cfg` file with the component configuration. @@ -153,15 +217,15 @@ configuration. > #### Example > > ```python -> ruler = EntityRuler(nlp) +> ruler = nlp.add_pipe("entity_ruler") > ruler.from_disk("/path/to/patterns.jsonl") # loads patterns only > ruler.from_disk("/path/to/entity_ruler") # loads patterns and config > ``` -| Name | Type | Description | -| ----------- | ---------------- | ---------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a JSONL file or directory. Paths may be either strings or `Path`-like objects. | -| **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------- | +| `path` | A path to a JSONL file or directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| **RETURNS** | The modified `EntityRuler` object. ~~EntityRuler~~ | ## EntityRuler.to_bytes {#to_bytes tag="method"} @@ -170,13 +234,13 @@ Serialize the entity ruler patterns to a bytestring. > #### Example > > ```python -> ruler = EntityRuler(nlp) +> ruler = nlp.add_pipe("entity_ruler") > ruler_bytes = ruler.to_bytes() > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------ | -| **RETURNS** | bytes | The serialized patterns. | +| Name | Description | +| ----------- | ---------------------------------- | +| **RETURNS** | The serialized patterns. ~~bytes~~ | ## EntityRuler.from_bytes {#from_bytes tag="method"} @@ -186,44 +250,44 @@ Load the pipe from a bytestring. Modifies the object in place and returns it. > > ```python > ruler_bytes = ruler.to_bytes() -> ruler = EntityRuler(nlp) +> ruler = nlp.add_pipe("enity_ruler") > ruler.from_bytes(ruler_bytes) > ``` -| Name | Type | Description | -| ---------------- | ------------- | ---------------------------------- | -| `patterns_bytes` | bytes | The bytestring to load. | -| **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. | +| Name | Description | +| ------------ | -------------------------------------------------- | +| `bytes_data` | The bytestring to load. ~~bytes~~ | +| **RETURNS** | The modified `EntityRuler` object. ~~EntityRuler~~ | ## EntityRuler.labels {#labels tag="property"} All labels present in the match patterns. -| Name | Type | Description | -| ----------- | ----- | ------------------ | -| **RETURNS** | tuple | The string labels. | +| Name | Description | +| ----------- | -------------------------------------- | +| **RETURNS** | The string labels. ~~Tuple[str, ...]~~ | ## EntityRuler.ent_ids {#labels tag="property" new="2.2.2"} -All entity ids present in the match patterns `id` properties. +All entity IDs present in the `id` properties of the match patterns. -| Name | Type | Description | -| ----------- | ----- | ------------------- | -| **RETURNS** | tuple | The string ent_ids. | +| Name | Description | +| ----------- | ----------------------------------- | +| **RETURNS** | The string IDs. ~~Tuple[str, ...]~~ | ## EntityRuler.patterns {#patterns tag="property"} Get all patterns that were added to the entity ruler. -| Name | Type | Description | -| ----------- | ---- | -------------------------------------------------- | -| **RETURNS** | list | The original patterns, one dictionary per pattern. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------- | +| **RETURNS** | The original patterns, one dictionary per pattern. ~~List[Dict[str, Union[str, dict]]]~~ | ## Attributes {#attributes} -| Name | Type | Description | -| ----------------- | ------------------------------------- | ---------------------------------------------------------------- | -| `matcher` | [`Matcher`](/api/matcher) | The underlying matcher used to process token patterns. | -| `phrase_matcher` | [`PhraseMatcher`](/api/phrasematcher) | The underlying phrase matcher, used to process phrase patterns. | -| `token_patterns` | dict | The token patterns present in the entity ruler, keyed by label. | -| `phrase_patterns` | dict | The phrase patterns present in the entity ruler, keyed by label. | +| Name | Description | +| ----------------- | --------------------------------------------------------------------------------------------------------------------- | +| `matcher` | The underlying matcher used to process token patterns. ~~Matcher~~ | +| `phrase_matcher` | The underlying phrase matcher used to process phrase patterns. ~~PhraseMatcher~~ | +| `token_patterns` | The token patterns present in the entity ruler, keyed by label. ~~Dict[str, List[Dict[str, Union[str, List[dict]]]]~~ | +| `phrase_patterns` | The phrase patterns present in the entity ruler, keyed by label. ~~Dict[str, List[Doc]]~~ | diff --git a/website/docs/api/example.md b/website/docs/api/example.md new file mode 100644 index 000000000..2811f4d91 --- /dev/null +++ b/website/docs/api/example.md @@ -0,0 +1,322 @@ +--- +title: Example +teaser: A training instance +tag: class +source: spacy/training/example.pyx +new: 3.0 +--- + +An `Example` holds the information for one training instance. It stores two +`Doc` objects: one for holding the gold-standard reference data, and one for +holding the predictions of the pipeline. An +[`Alignment`](/api/example#alignment-object) object stores the alignment between +these two documents, as they can differ in tokenization. + +## Example.\_\_init\_\_ {#init tag="method"} + +Construct an `Example` object from the `predicted` document and the `reference` +document. If `alignment` is `None`, it will be initialized from the words in +both documents. + +> #### Example +> +> ```python +> from spacy.tokens import Doc +> from spacy.training import Example +> +> words = ["hello", "world", "!"] +> spaces = [True, False, False] +> predicted = Doc(nlp.vocab, words=words, spaces=spaces) +> reference = parse_gold_doc(my_data) +> example = Example(predicted, reference) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ | +| `reference` | The document containing gold-standard annotations. Cannot be `None`. ~~Doc~~ | +| _keyword-only_ | | +| `alignment` | An object holding the alignment between the tokens of the `predicted` and `reference` documents. ~~Optional[Alignment]~~ | + +## Example.from_dict {#from_dict tag="classmethod"} + +Construct an `Example` object from the `predicted` document and the reference +annotations provided as a dictionary. For more details on the required format, +see the [training format documentation](/api/data-formats#dict-input). + +> #### Example +> +> ```python +> from spacy.tokens import Doc +> from spacy.training import Example +> +> predicted = Doc(vocab, words=["Apply", "some", "sunscreen"]) +> token_ref = ["Apply", "some", "sun", "screen"] +> tags_ref = ["VERB", "DET", "NOUN", "NOUN"] +> example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref}) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------- | +| `predicted` | The document containing (partial) predictions. Cannot be `None`. ~~Doc~~ | +| `example_dict` | `Dict[str, obj]` | The gold-standard annotations as a dictionary. Cannot be `None`. ~~Dict[str, Any]~~ | +| **RETURNS** | The newly constructed object. ~~Example~~ | + +## Example.text {#text tag="property"} + +The text of the `predicted` document in this `Example`. + +> #### Example +> +> ```python +> raw_text = example.text +> ``` + +| Name | Description | +| ----------- | --------------------------------------------- | +| **RETURNS** | The text of the `predicted` document. ~~str~~ | + +## Example.predicted {#predicted tag="property"} + +The `Doc` holding the predictions. Occasionally also referred to as `example.x`. + +> #### Example +> +> ```python +> docs = [eg.predicted for eg in examples] +> predictions, _ = model.begin_update(docs) +> set_annotations(docs, predictions) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------ | +| **RETURNS** | The document containing (partial) predictions. ~~Doc~~ | + +## Example.reference {#reference tag="property"} + +The `Doc` holding the gold-standard annotations. Occasionally also referred to +as `example.y`. + +> #### Example +> +> ```python +> for i, eg in enumerate(examples): +> for j, label in enumerate(all_labels): +> gold_labels[i][j] = eg.reference.cats.get(label, 0.0) +> ``` + +| Name | Description | +| ----------- | ---------------------------------------------------------- | +| **RETURNS** | The document containing gold-standard annotations. ~~Doc~~ | + +## Example.alignment {#alignment tag="property"} + +The [`Alignment`](/api/example#alignment-object) object mapping the tokens of +the `predicted` document to those of the `reference` document. + +> #### Example +> +> ```python +> tokens_x = ["Apply", "some", "sunscreen"] +> x = Doc(vocab, words=tokens_x) +> tokens_y = ["Apply", "some", "sun", "screen"] +> example = Example.from_dict(x, {"words": tokens_y}) +> alignment = example.alignment +> assert list(alignment.y2x.data) == [[0], [1], [2], [2]] +> ``` + +| Name | Description | +| ----------- | ---------------------------------------------------------------- | +| **RETURNS** | The document containing gold-standard annotations. ~~Alignment~~ | + +## Example.get_aligned {#get_aligned tag="method"} + +Get the aligned view of a certain token attribute, denoted by its int ID or +string name. + +> #### Example +> +> ```python +> predicted = Doc(vocab, words=["Apply", "some", "sunscreen"]) +> token_ref = ["Apply", "some", "sun", "screen"] +> tags_ref = ["VERB", "DET", "NOUN", "NOUN"] +> example = Example.from_dict(predicted, {"words": token_ref, "tags": tags_ref}) +> assert example.get_aligned("TAG", as_string=True) == ["VERB", "DET", "NOUN"] +> ``` + +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------------- | +| `field` | Attribute ID or string name. ~~Union[int, str]~~ | +| `as_string` | Whether or not to return the list of values as strings. Defaults to `False`. ~~bool~~ | +| **RETURNS** | List of integer values, or string values if `as_string` is `True`. ~~Union[List[int], List[str]]~~ | + +## Example.get_aligned_parse {#get_aligned_parse tag="method"} + +Get the aligned view of the dependency parse. If `projectivize` is set to +`True`, non-projective dependency trees are made projective through the +Pseudo-Projective Dependency Parsing algorithm by Nivre and Nilsson (2005). + +> #### Example +> +> ```python +> doc = nlp("He pretty quickly walks away") +> example = Example.from_dict(doc, {"heads": [3, 2, 3, 0, 2]}) +> proj_heads, proj_labels = example.get_aligned_parse(projectivize=True) +> assert proj_heads == [3, 2, 3, 0, 3] +> ``` + +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------------- | +| `projectivize` | Whether or not to projectivize the dependency trees. Defaults to `True`. ~~bool~~ | +| **RETURNS** | List of integer values, or string values if `as_string` is `True`. ~~Union[List[int], List[str]]~~ | + +## Example.get_aligned_ner {#get_aligned_ner tag="method"} + +Get the aligned view of the NER +[BILUO](/usage/linguistic-features#accessing-ner) tags. + +> #### Example +> +> ```python +> words = ["Mrs", "Smith", "flew", "to", "New York"] +> doc = Doc(en_vocab, words=words) +> entities = [(0, 9, "PERSON"), (18, 26, "LOC")] +> gold_words = ["Mrs Smith", "flew", "to", "New", "York"] +> example = Example.from_dict(doc, {"words": gold_words, "entities": entities}) +> ner_tags = example.get_aligned_ner() +> assert ner_tags == ["B-PERSON", "L-PERSON", "O", "O", "U-LOC"] +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------- | +| **RETURNS** | List of BILUO values, denoting whether tokens are part of an NER annotation or not. ~~List[str]~~ | + +## Example.get_aligned_spans_y2x {#get_aligned_spans_y2x tag="method"} + +Get the aligned view of any set of [`Span`](/api/span) objects defined over +[`Example.reference`](/api/example#reference). The resulting span indices will +align to the tokenization in [`Example.predicted`](/api/example#predicted). + +> #### Example +> +> ```python +> words = ["Mr and Mrs Smith", "flew", "to", "New York"] +> doc = Doc(en_vocab, words=words) +> entities = [(0, 16, "PERSON")] +> tokens_ref = ["Mr", "and", "Mrs", "Smith", "flew", "to", "New", "York"] +> example = Example.from_dict(doc, {"words": tokens_ref, "entities": entities}) +> ents_ref = example.reference.ents +> assert [(ent.start, ent.end) for ent in ents_ref] == [(0, 4)] +> ents_y2x = example.get_aligned_spans_y2x(ents_ref) +> assert [(ent.start, ent.end) for ent in ents_y2x] == [(0, 1)] +> ``` + +| Name | Description | +| ----------- | ----------------------------------------------------------------------------- | +| `y_spans` | `Span` objects aligned to the tokenization of `reference`. ~~Iterable[Span]~~ | +| **RETURNS** | `Span` objects aligned to the tokenization of `predicted`. ~~List[Span]~~ | + +## Example.get_aligned_spans_x2y {#get_aligned_spans_x2y tag="method"} + +Get the aligned view of any set of [`Span`](/api/span) objects defined over +[`Example.predicted`](/api/example#predicted). The resulting span indices will +align to the tokenization in [`Example.reference`](/api/example#reference). This +method is particularly useful to assess the accuracy of predicted entities +against the original gold-standard annotation. + +> #### Example +> +> ```python +> nlp.add_pipe("my_ner") +> doc = nlp("Mr and Mrs Smith flew to New York") +> tokens_ref = ["Mr and Mrs", "Smith", "flew", "to", "New York"] +> example = Example.from_dict(doc, {"words": tokens_ref}) +> ents_pred = example.predicted.ents +> # Assume the NER model has found "Mr and Mrs Smith" as a named entity +> assert [(ent.start, ent.end) for ent in ents_pred] == [(0, 4)] +> ents_x2y = example.get_aligned_spans_x2y(ents_pred) +> assert [(ent.start, ent.end) for ent in ents_x2y] == [(0, 2)] +> ``` + +| Name | Description | +| ----------- | ----------------------------------------------------------------------------- | +| `x_spans` | `Span` objects aligned to the tokenization of `predicted`. ~~Iterable[Span]~~ | +| **RETURNS** | `Span` objects aligned to the tokenization of `reference`. ~~List[Span]~~ | + +## Example.to_dict {#to_dict tag="method"} + +Return a [dictionary representation](/api/data-formats#dict-input) of the +reference annotation contained in this `Example`. + +> #### Example +> +> ```python +> eg_dict = example.to_dict() +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------- | +| **RETURNS** | Dictionary representation of the reference annotation. ~~Dict[str, Any]~~ | + +## Example.split_sents {#split_sents tag="method"} + +Split one `Example` into multiple `Example` objects, one for each sentence. + +> #### Example +> +> ```python +> doc = nlp("I went yesterday had lots of fun") +> tokens_ref = ["I", "went", "yesterday", "had", "lots", "of", "fun"] +> sents_ref = [True, False, False, True, False, False, False] +> example = Example.from_dict(doc, {"words": tokens_ref, "sent_starts": sents_ref}) +> split_examples = example.split_sents() +> assert split_examples[0].text == "I went yesterday " +> assert split_examples[1].text == "had lots of fun" +> ``` + +| Name | Description | +| ----------- | ---------------------------------------------------------------------------- | +| **RETURNS** | List of `Example` objects, one for each original sentence. ~~List[Example]~~ | + +## Alignment {#alignment-object new="3"} + +Calculate alignment tables between two tokenizations. + +### Alignment attributes {#alignment-attributes"} + +| Name | Description | +| ----- | --------------------------------------------------------------------- | +| `x2y` | The `Ragged` object holding the alignment from `x` to `y`. ~~Ragged~~ | +| `y2x` | The `Ragged` object holding the alignment from `y` to `x`. ~~Ragged~~ | + + + +The current implementation of the alignment algorithm assumes that both +tokenizations add up to the same string. For example, you'll be able to align +`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not +`["I", "'m"]` and `["I", "am"]`. + + + +> #### Example +> +> ```python +> from spacy.training import Alignment +> +> bert_tokens = ["obama", "'", "s", "podcast"] +> spacy_tokens = ["obama", "'s", "podcast"] +> alignment = Alignment.from_strings(bert_tokens, spacy_tokens) +> a2b = alignment.x2y +> assert list(a2b.dataXd) == [0, 1, 1, 2] +> ``` +> +> If `a2b.dataXd[1] == a2b.dataXd[2] == 1`, that means that `A[1]` (`"'"`) and +> `A[2]` (`"s"`) both align to `B[1]` (`"'s"`). + +### Alignment.from_strings {#classmethod tag="function"} + +| Name | Description | +| ----------- | ------------------------------------------------------------- | +| `A` | String values of candidate tokens to align. ~~List[str]~~ | +| `B` | String values of reference tokens to align. ~~List[str]~~ | +| **RETURNS** | An `Alignment` object describing the alignment. ~~Alignment~~ | diff --git a/website/docs/api/goldcorpus.md b/website/docs/api/goldcorpus.md deleted file mode 100644 index a18ef4d32..000000000 --- a/website/docs/api/goldcorpus.md +++ /dev/null @@ -1,24 +0,0 @@ ---- -title: GoldCorpus -teaser: An annotated corpus, using the JSON file format -tag: class -source: spacy/gold.pyx -new: 2 ---- - -This class manages annotations for tagging, dependency parsing and NER. - -## GoldCorpus.\_\_init\_\_ {#init tag="method"} - -Create a `GoldCorpus`. IF the input data is an iterable, each item should be a -`(text, paragraphs)` tuple, where each paragraph is a tuple -`(sentences, brackets)`, and each sentence is a tuple -`(ids, words, tags, heads, ner)`. See the implementation of -[`gold.read_json_file`](https://github.com/explosion/spaCy/tree/master/spacy/gold.pyx) -for further details. - -| Name | Type | Description | -| ----------- | --------------------------- | ------------------------------------------------------------ | -| `train` | unicode / `Path` / iterable | Training data, as a path (file or directory) or iterable. | -| `dev` | unicode / `Path` / iterable | Development data, as a path (file or directory) or iterable. | -| **RETURNS** | `GoldCorpus` | The newly constructed object. | diff --git a/website/docs/api/goldparse.md b/website/docs/api/goldparse.md deleted file mode 100644 index bc33dd4e6..000000000 --- a/website/docs/api/goldparse.md +++ /dev/null @@ -1,207 +0,0 @@ ---- -title: GoldParse -teaser: A collection for training annotations -tag: class -source: spacy/gold.pyx ---- - -## GoldParse.\_\_init\_\_ {#init tag="method"} - -Create a `GoldParse`. The [`TextCategorizer`](/api/textcategorizer) component -expects true examples of a label to have the value `1.0`, and negative examples -of a label to have the value `0.0`. Labels not in the dictionary are treated as -missing – the gradient for those labels will be zero. - -| Name | Type | Description | -| ----------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doc` | `Doc` | The document the annotations refer to. | -| `words` | iterable | A sequence of unicode word strings. | -| `tags` | iterable | A sequence of strings, representing tag annotations. | -| `heads` | iterable | A sequence of integers, representing syntactic head offsets. | -| `deps` | iterable | A sequence of strings, representing the syntactic relation types. | -| `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. | -| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). | -| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). | -| `make_projective` | bool | Whether to projectivize the dependency tree. Defaults to `False`. | -| **RETURNS** | `GoldParse` | The newly constructed object. | - -## GoldParse.\_\_len\_\_ {#len tag="method"} - -Get the number of gold-standard tokens. - -| Name | Type | Description | -| ----------- | ---- | ----------------------------------- | -| **RETURNS** | int | The number of gold-standard tokens. | - -## GoldParse.is_projective {#is_projective tag="property"} - -Whether the provided syntactic annotations form a projective dependency tree. - -| Name | Type | Description | -| ----------- | ---- | ----------------------------------------- | -| **RETURNS** | bool | Whether annotations form projective tree. | - -## Attributes {#attributes} - -| Name | Type | Description | -| ------------------------------------ | ---- | ------------------------------------------------------------------------------------------------------------------------ | -| `words` | list | The words. | -| `tags` | list | The part-of-speech tag annotations. | -| `heads` | list | The syntactic head annotations. | -| `labels` | list | The syntactic relation-type annotations. | -| `ner` | list | The named entity annotations as BILUO tags. | -| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. | -| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. | -| `cats` 2 | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`. | -| `links` 2.2 | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. | - -## Utilities {#util} - -### gold.docs_to_json {#docs_to_json tag="function"} - -Convert a list of Doc objects into the -[JSON-serializable format](/api/annotation#json-input) used by the -[`spacy train`](/api/cli#train) command. Each input doc will be treated as a -'paragraph' in the output doc. - -> #### Example -> -> ```python -> from spacy.gold import docs_to_json -> -> doc = nlp("I like London") -> json_data = docs_to_json([doc]) -> ``` - -| Name | Type | Description | -| ----------- | ---------------- | ------------------------------------------ | -| `docs` | iterable / `Doc` | The `Doc` object(s) to convert. | -| `id` | int | ID to assign to the JSON. Defaults to `0`. | -| **RETURNS** | dict | The data in spaCy's JSON format. | - -### gold.align {#align tag="function"} - -Calculate alignment tables between two tokenizations, using the Levenshtein -algorithm. The alignment is case-insensitive. - - - -The current implementation of the alignment algorithm assumes that both -tokenizations add up to the same string. For example, you'll be able to align -`["I", "'", "m"]` and `["I", "'m"]`, which both add up to `"I'm"`, but not -`["I", "'m"]` and `["I", "am"]`. - - - -> #### Example -> -> ```python -> from spacy.gold import align -> -> bert_tokens = ["obama", "'", "s", "podcast"] -> spacy_tokens = ["obama", "'s", "podcast"] -> alignment = align(bert_tokens, spacy_tokens) -> cost, a2b, b2a, a2b_multi, b2a_multi = alignment -> ``` - -| Name | Type | Description | -| ----------- | ----- | -------------------------------------------------------------------------- | -| `tokens_a` | list | String values of candidate tokens to align. | -| `tokens_b` | list | String values of reference tokens to align. | -| **RETURNS** | tuple | A `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the alignment. | - -The returned tuple contains the following alignment information: - -> #### Example -> -> ```python -> a2b = array([0, -1, -1, 2]) -> b2a = array([0, 2, 3]) -> a2b_multi = {1: 1, 2: 1} -> b2a_multi = {} -> ``` -> -> If `a2b[3] == 2`, that means that `tokens_a[3]` aligns to `tokens_b[2]`. If -> there's no one-to-one alignment for a token, it has the value `-1`. - -| Name | Type | Description | -| ----------- | -------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | -| `cost` | int | The number of misaligned tokens. | -| `a2b` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_a` to indices in `tokens_b`. | -| `b2a` | `numpy.ndarray[ndim=1, dtype='int32']` | One-to-one mappings of indices in `tokens_b` to indices in `tokens_a`. | -| `a2b_multi` | dict | A dictionary mapping indices in `tokens_a` to indices in `tokens_b`, where multiple tokens of `tokens_a` align to the same token of `tokens_b`. | -| `b2a_multi` | dict | A dictionary mapping indices in `tokens_b` to indices in `tokens_a`, where multiple tokens of `tokens_b` align to the same token of `tokens_a`. | - -### gold.biluo_tags_from_offsets {#biluo_tags_from_offsets tag="function"} - -Encode labelled spans into per-token tags, using the -[BILUO scheme](/api/annotation#biluo) (Begin, In, Last, Unit, Out). Returns a -list of unicode strings, describing the tags. Each tag string will be of the -form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of -`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets -don't align with the tokenization in the `Doc` object. The training algorithm -will view these as missing values. `O` denotes a non-entity token. `B` denotes -the beginning of a multi-token entity, `I` the inside of an entity of three or -more tokens, and `L` the end of an entity of two or more tokens. `U` denotes a -single-token entity. - -> #### Example -> -> ```python -> from spacy.gold import biluo_tags_from_offsets -> -> doc = nlp("I like London.") -> entities = [(7, 13, "LOC")] -> tags = biluo_tags_from_offsets(doc, entities) -> assert tags == ["O", "O", "U-LOC", "O"] -> ``` - -| Name | Type | Description | -| ----------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | -| `doc` | `Doc` | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. | -| `entities` | iterable | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. | -| **RETURNS** | list | Unicode strings, describing the [BILUO](/api/annotation#biluo) tags. | - -### gold.offsets_from_biluo_tags {#offsets_from_biluo_tags tag="function"} - -Encode per-token tags following the [BILUO scheme](/api/annotation#biluo) into -entity offsets. - -> #### Example -> -> ```python -> from spacy.gold import offsets_from_biluo_tags -> -> doc = nlp("I like London.") -> tags = ["O", "O", "U-LOC", "O"] -> entities = offsets_from_biluo_tags(doc, tags) -> assert entities == [(7, 13, "LOC")] -> ``` - -| Name | Type | Description | -| ----------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doc` | `Doc` | The document that the BILUO tags refer to. | -| `entities` | iterable | A sequence of [BILUO](/api/annotation#biluo) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. | -| **RETURNS** | list | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string. | - -### gold.spans_from_biluo_tags {#spans_from_biluo_tags tag="function" new="2.1"} - -Encode per-token tags following the [BILUO scheme](/api/annotation#biluo) into -[`Span`](/api/span) objects. This can be used to create entity spans from -token-based tags, e.g. to overwrite the `doc.ents`. - -> #### Example -> -> ```python -> from spacy.gold import spans_from_biluo_tags -> -> doc = nlp("I like London.") -> tags = ["O", "O", "U-LOC", "O"] -> doc.ents = spans_from_biluo_tags(doc, tags) -> ``` - -| Name | Type | Description | -| ----------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doc` | `Doc` | The document that the BILUO tags refer to. | -| `entities` | iterable | A sequence of [BILUO](/api/annotation#biluo) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. | -| **RETURNS** | list | A sequence of `Span` objects with added entity labels. | diff --git a/website/docs/api/index.md b/website/docs/api/index.md index 97a7f57c4..a9dc408f6 100644 --- a/website/docs/api/index.md +++ b/website/docs/api/index.md @@ -1,10 +1,8 @@ --- -title: Architecture -next: /api/annotation +title: Library Architecture +next: /api/architectures --- -## Library architecture {#architecture} - import Architecture101 from 'usage/101/\_architecture.md' diff --git a/website/docs/api/kb.md b/website/docs/api/kb.md index eeba85e84..855dead27 100644 --- a/website/docs/api/kb.md +++ b/website/docs/api/kb.md @@ -1,16 +1,19 @@ --- title: KnowledgeBase -teaser: A storage class for entities and aliases of a specific knowledge base (ontology) +teaser: + A storage class for entities and aliases of a specific knowledge base + (ontology) tag: class source: spacy/kb.pyx new: 2.2 --- -The `KnowledgeBase` object provides a method to generate [`Candidate`](/api/kb/#candidate_init) -objects, which are plausible external identifiers given a certain textual mention. -Each such `Candidate` holds information from the relevant KB entities, -such as its frequency in text and possible aliases. -Each entity in the knowledge base also has a pretrained entity vector of a fixed size. +The `KnowledgeBase` object provides a method to generate +[`Candidate`](/api/kb/#candidate) objects, which are plausible external +identifiers given a certain textual mention. Each such `Candidate` holds +information from the relevant KB entities, such as its frequency in text and +possible aliases. Each entity in the knowledge base also has a pretrained entity +vector of a fixed size. ## KnowledgeBase.\_\_init\_\_ {#init tag="method"} @@ -24,25 +27,24 @@ Create the knowledge base. > kb = KnowledgeBase(vocab=vocab, entity_vector_length=64) > ``` -| Name | Type | Description | -| ----------------------- | ---------------- | ----------------------------------------- | -| `vocab` | `Vocab` | A `Vocab` object. | -| `entity_vector_length` | int | Length of the fixed-size entity vectors. | -| **RETURNS** | `KnowledgeBase` | The newly constructed object. | - +| Name | Description | +| ---------------------- | ------------------------------------------------ | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `entity_vector_length` | Length of the fixed-size entity vectors. ~~int~~ | ## KnowledgeBase.entity_vector_length {#entity_vector_length tag="property"} The length of the fixed-size entity vectors in the knowledge base. -| Name | Type | Description | -| ----------- | ---- | ----------------------------------------- | -| **RETURNS** | int | Length of the fixed-size entity vectors. | +| Name | Description | +| ----------- | ------------------------------------------------ | +| **RETURNS** | Length of the fixed-size entity vectors. ~~int~~ | ## KnowledgeBase.add_entity {#add_entity tag="method"} -Add an entity to the knowledge base, specifying its corpus frequency -and entity vector, which should be of length [`entity_vector_length`](/api/kb#entity_vector_length). +Add an entity to the knowledge base, specifying its corpus frequency and entity +vector, which should be of length +[`entity_vector_length`](/api/kb#entity_vector_length). > #### Example > @@ -51,16 +53,16 @@ and entity vector, which should be of length [`entity_vector_length`](/api/kb#en > kb.add_entity(entity="Q463035", freq=111, entity_vector=vector2) > ``` -| Name | Type | Description | -| --------------- | ------------- | ------------------------------------------------- | -| `entity` | unicode | The unique entity identifier | -| `freq` | float | The frequency of the entity in a typical corpus | -| `entity_vector` | vector | The pretrained vector of the entity | +| Name | Description | +| --------------- | ---------------------------------------------------------- | +| `entity` | The unique entity identifier. ~~str~~ | +| `freq` | The frequency of the entity in a typical corpus. ~~float~~ | +| `entity_vector` | The pretrained vector of the entity. ~~numpy.ndarray~~ | ## KnowledgeBase.set_entities {#set_entities tag="method"} -Define the full list of entities in the knowledge base, specifying the corpus frequency -and entity vector for each entity. +Define the full list of entities in the knowledge base, specifying the corpus +frequency and entity vector for each entity. > #### Example > @@ -68,18 +70,19 @@ and entity vector for each entity. > kb.set_entities(entity_list=["Q42", "Q463035"], freq_list=[32, 111], vector_list=[vector1, vector2]) > ``` -| Name | Type | Description | -| ------------- | ------------- | ------------------------------------------------- | -| `entity_list` | iterable | List of unique entity identifiers | -| `freq_list` | iterable | List of entity frequencies | -| `vector_list` | iterable | List of entity vectors | +| Name | Description | +| ------------- | ---------------------------------------------------------------- | +| `entity_list` | List of unique entity identifiers. ~~Iterable[Union[str, int]]~~ | +| `freq_list` | List of entity frequencies. ~~Iterable[int]~~ | +| `vector_list` | List of entity vectors. ~~Iterable[numpy.ndarray]~~ | ## KnowledgeBase.add_alias {#add_alias tag="method"} -Add an alias or mention to the knowledge base, specifying its potential KB identifiers -and their prior probabilities. The entity identifiers should refer to entities previously -added with [`add_entity`](/api/kb#add_entity) or [`set_entities`](/api/kb#set_entities). -The sum of the prior probabilities should not exceed 1. +Add an alias or mention to the knowledge base, specifying its potential KB +identifiers and their prior probabilities. The entity identifiers should refer +to entities previously added with [`add_entity`](/api/kb#add_entity) or +[`set_entities`](/api/kb#set_entities). The sum of the prior probabilities +should not exceed 1. > #### Example > @@ -87,11 +90,11 @@ The sum of the prior probabilities should not exceed 1. > kb.add_alias(alias="Douglas", entities=["Q42", "Q463035"], probabilities=[0.6, 0.3]) > ``` -| Name | Type | Description | -| -------------- | ------------- | -------------------------------------------------- | -| `alias` | unicode | The textual mention or alias | -| `entities` | iterable | The potential entities that the alias may refer to | -| `probabilities`| iterable | The prior probabilities of each entity | +| Name | Description | +| --------------- | --------------------------------------------------------------------------------- | +| `alias` | The textual mention or alias. ~~str~~ | +| `entities` | The potential entities that the alias may refer to. ~~Iterable[Union[str, int]]~~ | +| `probabilities` | The prior probabilities of each entity. ~~Iterable[float]~~ | ## KnowledgeBase.\_\_len\_\_ {#len tag="method"} @@ -103,9 +106,9 @@ Get the total number of entities in the knowledge base. > total_entities = len(kb) > ``` -| Name | Type | Description | -| ----------- | ---- | --------------------------------------------- | -| **RETURNS** | int | The number of entities in the knowledge base. | +| Name | Description | +| ----------- | ----------------------------------------------------- | +| **RETURNS** | The number of entities in the knowledge base. ~~int~~ | ## KnowledgeBase.get_entity_strings {#get_entity_strings tag="method"} @@ -117,9 +120,9 @@ Get a list of all entity IDs in the knowledge base. > all_entities = kb.get_entity_strings() > ``` -| Name | Type | Description | -| ----------- | ---- | --------------------------------------------- | -| **RETURNS** | list | The list of entities in the knowledge base. | +| Name | Description | +| ----------- | --------------------------------------------------------- | +| **RETURNS** | The list of entities in the knowledge base. ~~List[str]~~ | ## KnowledgeBase.get_size_aliases {#get_size_aliases tag="method"} @@ -131,9 +134,9 @@ Get the total number of aliases in the knowledge base. > total_aliases = kb.get_size_aliases() > ``` -| Name | Type | Description | -| ----------- | ---- | --------------------------------------------- | -| **RETURNS** | int | The number of aliases in the knowledge base. | +| Name | Description | +| ----------- | ---------------------------------------------------- | +| **RETURNS** | The number of aliases in the knowledge base. ~~int~~ | ## KnowledgeBase.get_alias_strings {#get_alias_strings tag="method"} @@ -145,14 +148,14 @@ Get a list of all aliases in the knowledge base. > all_aliases = kb.get_alias_strings() > ``` -| Name | Type | Description | -| ----------- | ---- | --------------------------------------------- | -| **RETURNS** | list | The list of aliases in the knowledge base. | +| Name | Description | +| ----------- | -------------------------------------------------------- | +| **RETURNS** | The list of aliases in the knowledge base. ~~List[str]~~ | ## KnowledgeBase.get_candidates {#get_candidates tag="method"} Given a certain textual mention as input, retrieve a list of candidate entities -of type [`Candidate`](/api/kb/#candidate_init). +of type [`Candidate`](/api/kb/#candidate). > #### Example > @@ -160,10 +163,10 @@ of type [`Candidate`](/api/kb/#candidate_init). > candidates = kb.get_candidates("Douglas") > ``` -| Name | Type | Description | -| ------------- | ------------- | -------------------------------------------------- | -| `alias` | unicode | The textual mention or alias | -| **RETURNS** | iterable | The list of relevant `Candidate` objects | +| Name | Description | +| ----------- | ------------------------------------- | +| `alias` | The textual mention or alias. ~~str~~ | +| **RETURNS** | iterable | The list of relevant `Candidate` objects. ~~List[Candidate]~~ | ## KnowledgeBase.get_vector {#get_vector tag="method"} @@ -175,15 +178,15 @@ Given a certain entity ID, retrieve its pretrained entity vector. > vector = kb.get_vector("Q42") > ``` -| Name | Type | Description | -| ------------- | ------------- | -------------------------------------------------- | -| `entity` | unicode | The entity ID | -| **RETURNS** | vector | The entity vector | +| Name | Description | +| ----------- | ------------------------------------ | +| `entity` | The entity ID. ~~str~~ | +| **RETURNS** | The entity vector. ~~numpy.ndarray~~ | ## KnowledgeBase.get_prior_prob {#get_prior_prob tag="method"} -Given a certain entity ID and a certain textual mention, retrieve -the prior probability of the fact that the mention links to the entity ID. +Given a certain entity ID and a certain textual mention, retrieve the prior +probability of the fact that the mention links to the entity ID. > #### Example > @@ -191,30 +194,30 @@ the prior probability of the fact that the mention links to the entity ID. > probability = kb.get_prior_prob("Q42", "Douglas") > ``` -| Name | Type | Description | -| ------------- | ------------- | --------------------------------------------------------------- | -| `entity` | unicode | The entity ID | -| `alias` | unicode | The textual mention or alias | -| **RETURNS** | float | The prior probability of the `alias` referring to the `entity` | +| Name | Description | +| ----------- | ------------------------------------------------------------------------- | +| `entity` | The entity ID. ~~str~~ | +| `alias` | The textual mention or alias. ~~str~~ | +| **RETURNS** | The prior probability of the `alias` referring to the `entity`. ~~float~~ | -## KnowledgeBase.dump {#dump tag="method"} +## KnowledgeBase.to_disk {#to_disk tag="method"} Save the current state of the knowledge base to a directory. > #### Example > > ```python -> kb.dump(loc) +> kb.to_disk(loc) > ``` -| Name | Type | Description | -| ------------- | ---------------- | ------------------------------------------------------------------------------------------------------------------------ | -| `loc` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | +| Name | Description | +| ----- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `loc` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | -## KnowledgeBase.load_bulk {#load_bulk tag="method"} +## KnowledgeBase.from_disk {#from_disk tag="method"} -Restore the state of the knowledge base from a given directory. Note that the [`Vocab`](/api/vocab) -should also be the same as the one used to create the KB. +Restore the state of the knowledge base from a given directory. Note that the +[`Vocab`](/api/vocab) should also be the same as the one used to create the KB. > #### Example > @@ -223,21 +226,27 @@ should also be the same as the one used to create the KB. > from spacy.vocab import Vocab > vocab = Vocab().from_disk("/path/to/vocab") > kb = KnowledgeBase(vocab=vocab, entity_vector_length=64) -> kb.load_bulk("/path/to/kb") +> kb.from_disk("/path/to/kb") > ``` +| Name | Description | +| ----------- | ----------------------------------------------------------------------------------------------- | +| `loc` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| **RETURNS** | The modified `KnowledgeBase` object. ~~KnowledgeBase~~ | -| Name | Type | Description | -| ----------- | ---------------- | ----------------------------------------------------------------------------------------- | -| `loc` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| **RETURNS** | `KnowledgeBase` | The modified `KnowledgeBase` object. | +## Candidate {#candidate tag="class"} +A `Candidate` object refers to a textual mention (alias) that may or may not be +resolved to a specific entity from a `KnowledgeBase`. This will be used as input +for the entity linking algorithm which will disambiguate the various candidates +to the correct one. Each candidate `(alias, entity)` pair is assigned to a +certain prior probability. -## Candidate.\_\_init\_\_ {#candidate_init tag="method"} +### Candidate.\_\_init\_\_ {#candidate-init tag="method"} Construct a `Candidate` object. Usually this constructor is not called directly, -but instead these objects are returned by the [`get_candidates`](/api/kb#get_candidates) method -of a `KnowledgeBase`. +but instead these objects are returned by the +[`get_candidates`](/api/kb#get_candidates) method of a `KnowledgeBase`. > #### Example > @@ -246,23 +255,22 @@ of a `KnowledgeBase`. > candidate = Candidate(kb, entity_hash, entity_freq, entity_vector, alias_hash, prior_prob) > ``` -| Name | Type | Description | -| ------------- | --------------- | -------------------------------------------------------------- | -| `kb` | `KnowledgeBase` | The knowledge base that defined this candidate. | -| `entity_hash` | int | The hash of the entity's KB ID. | -| `entity_freq` | float | The entity frequency as recorded in the KB. | -| `alias_hash` | int | The hash of the textual mention or alias. | -| `prior_prob` | float | The prior probability of the `alias` referring to the `entity` | -| **RETURNS** | `Candidate` | The newly constructed object. | +| Name | Description | +| ------------- | ------------------------------------------------------------------------- | +| `kb` | The knowledge base that defined this candidate. ~~KnowledgeBase~~ | +| `entity_hash` | The hash of the entity's KB ID. ~~int~~ | +| `entity_freq` | The entity frequency as recorded in the KB. ~~float~~ | +| `alias_hash` | The hash of the textual mention or alias. ~~int~~ | +| `prior_prob` | The prior probability of the `alias` referring to the `entity`. ~~float~~ | -## Candidate attributes {#candidate_attributes} +## Candidate attributes {#candidate-attributes} -| Name | Type | Description | -| ---------------------- | ------------ | ------------------------------------------------------------------ | -| `entity` | int | The entity's unique KB identifier | -| `entity_` | unicode | The entity's unique KB identifier | -| `alias` | int | The alias or textual mention | -| `alias_` | unicode | The alias or textual mention | -| `prior_prob` | long | The prior probability of the `alias` referring to the `entity` | -| `entity_freq` | long | The frequency of the entity in a typical corpus | -| `entity_vector` | vector | The pretrained vector of the entity | +| Name | Description | +| --------------- | ------------------------------------------------------------------------ | +| `entity` | The entity's unique KB identifier. ~~int~~ | +| `entity_` | The entity's unique KB identifier. ~~str~~ | +| `alias` | The alias or textual mention. ~~int~~ | +| `alias_` | The alias or textual mention. ~~str~~ | +| `prior_prob` | The prior probability of the `alias` referring to the `entity`. ~~long~~ | +| `entity_freq` | The frequency of the entity in a typical corpus. ~~long~~ | +| `entity_vector` | The pretrained vector of the entity. ~~numpy.ndarray~~ | diff --git a/website/docs/api/language.md b/website/docs/api/language.md index 97dfbf100..51e9a5e10 100644 --- a/website/docs/api/language.md +++ b/website/docs/api/language.md @@ -7,9 +7,9 @@ source: spacy/language.py Usually you'll load this once per process as `nlp` and pass the instance around your application. The `Language` class is created when you call -[`spacy.load()`](/api/top-level#spacy.load) and contains the shared vocabulary -and [language data](/usage/adding-languages), optional model data loaded from a -[model package](/models) or a path, and a +[`spacy.load`](/api/top-level#spacy.load) and contains the shared vocabulary and +[language data](/usage/linguistic-features#language-data), optional binary +weights, e.g. provided by a [trained pipeline](/models), and the [processing pipeline](/usage/processing-pipelines) containing components like the tagger or parser that are called on a document in order. You can also add your own processing pipeline components that take a `Doc` object, modify it and @@ -17,25 +17,145 @@ return it. ## Language.\_\_init\_\_ {#init tag="method"} -Initialize a `Language` object. +Initialize a `Language` object. Note that the `meta` is only used for meta +information in [`Language.meta`](/api/language#meta) and not to configure the +`nlp` object or to override the config. To initialize from a config, use +[`Language.from_config`](/api/language#from_config) instead. > #### Example > > ```python +> # Construction from subclass +> from spacy.lang.en import English +> nlp = English() +> +> # Construction from scratch > from spacy.vocab import Vocab > from spacy.language import Language > nlp = Language(Vocab()) -> -> from spacy.lang.en import English -> nlp = English() > ``` -| Name | Type | Description | -| ----------- | ---------- | ------------------------------------------------------------------------------------------ | -| `vocab` | `Vocab` | A `Vocab` object. If `True`, a vocab is created via `Language.Defaults.create_vocab`. | -| `make_doc` | callable | A function that takes text and returns a `Doc` object. Usually a `Tokenizer`. | -| `meta` | dict | Custom meta data for the `Language` class. Is written to by models to add model meta data. | -| **RETURNS** | `Language` | The newly constructed object. | +| Name | Description | +| ------------------ | ------------------------------------------------------------------------------------------------------------------------ | +| `vocab` | A `Vocab` object. If `True`, a vocab is created using the default language data settings. ~~Vocab~~ | +| _keyword-only_ | | +| `max_length` | Maximum number of characters allowed in a single text. Defaults to `10 ** 6`. ~~int~~ | +| `meta` | [Meta data](/api/data-formats#meta) overrides. ~~Dict[str, Any]~~ | +| `create_tokenizer` | Optional function that receives the `nlp` object and returns a tokenizer. ~~Callable[[Language], Callable[[str], Doc]]~~ | + +## Language.from_config {#from_config tag="classmethod" new="3"} + +Create a `Language` object from a loaded config. Will set up the tokenizer and +language data, add pipeline components based on the pipeline and add pipeline +components based on the definitions specified in the config. If no config is +provided, the default config of the given language is used. This is also how +spaCy loads a model under the hood based on its +[`config.cfg`](/api/data-formats#config). + +> #### Example +> +> ```python +> from thinc.api import Config +> from spacy.language import Language +> +> config = Config().from_disk("./config.cfg") +> nlp = Language.from_config(config) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `config` | The loaded config. ~~Union[Dict[str, Any], Config]~~ | +| _keyword-only_ | | +| `vocab` | A `Vocab` object. If `True`, a vocab is created using the default language data settings. ~~Vocab~~ | +| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [`nlp.enable_pipe`](/api/language#enable_pipe). ~~List[str]~~ | +| `exclude` | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ | +| `meta` | [Meta data](/api/data-formats#meta) overrides. ~~Dict[str, Any]~~ | +| `auto_fill` | Whether to automatically fill in missing values in the config, based on defaults and function argument annotations. Defaults to `True`. ~~bool~~ | +| `validate` | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ | +| **RETURNS** | The initialized object. ~~Language~~ | + +## Language.component {#component tag="classmethod" new="3"} + +Register a custom pipeline component under a given name. This allows +initializing the component by name using +[`Language.add_pipe`](/api/language#add_pipe) and referring to it in +[config files](/usage/training#config). This classmethod and decorator is +intended for **simple stateless functions** that take a `Doc` and return it. For +more complex stateful components that allow settings and need access to the +shared `nlp` object, use the [`Language.factory`](/api/language#factory) +decorator. For more details and examples, see the +[usage documentation](/usage/processing-pipelines#custom-components). + +> #### Example +> +> ```python +> from spacy.language import Language +> +> # Usage as a decorator +> @Language.component("my_component") +> def my_component(doc): +> # Do something to the doc +> return doc +> +> # Usage as a function +> Language.component("my_component2", func=my_component) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `name` | The name of the component factory. ~~str~~ | +| _keyword-only_ | | +| `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | +| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | +| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ | +| `func` | Optional function if not used as a decorator. ~~Optional[Callable[[Doc], Doc]]~~ | + +## Language.factory {#factory tag="classmethod"} + +Register a custom pipeline component factory under a given name. This allows +initializing the component by name using +[`Language.add_pipe`](/api/language#add_pipe) and referring to it in +[config files](/usage/training#config). The registered factory function needs to +take at least two **named arguments** which spaCy fills in automatically: `nlp` +for the current `nlp` object and `name` for the component instance name. This +can be useful to distinguish multiple instances of the same component and allows +trainable components to add custom losses using the component instance name. The +`default_config` defines the default values of the remaining factory arguments. +It's merged into the [`nlp.config`](/api/language#config). For more details and +examples, see the +[usage documentation](/usage/processing-pipelines#custom-components). + +> #### Example +> +> ```python +> from spacy.language import Language +> +> # Usage as a decorator +> @Language.factory( +> "my_component", +> default_config={"some_setting": True}, +> ) +> def create_my_component(nlp, name, some_setting): +> return MyComponent(some_setting) +> +> # Usage as function +> Language.factory( +> "my_component", +> default_config={"some_setting": True}, +> func=create_my_component +> ) +> ``` + +| Name | Description | +| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | The name of the component factory. ~~str~~ | +| _keyword-only_ | | +| `default_config` | The default config, describing the default values of the factory arguments. ~~Dict[str, Any]~~ | +| `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | +| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | +| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ | +| `default_score_weights` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. If a weight is set to `None`, the score will not be logged or weighted. ~~Dict[str, Optional[float]]~~ | +| `func` | Optional function if not used as a decorator. ~~Optional[Callable[[...], Callable[[Doc], Doc]]]~~ | ## Language.\_\_call\_\_ {#call tag="method"} @@ -49,116 +169,199 @@ contain arbitrary whitespace. Alignment into the original string is preserved. > assert (doc[0].text, doc[0].head.tag_) == ("An", "NN") > ``` -| Name | Type | Description | -| ----------- | ------- | --------------------------------------------------------------------------------- | -| `text` | unicode | The text to be processed. | -| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | -| **RETURNS** | `Doc` | A container for accessing the annotations. | - - - -Pipeline components to prevent from being loaded can now be added as a list to -`disable`, instead of specifying one keyword argument per component. - -```diff -- doc = nlp("I don't want parsed", parse=False) -+ doc = nlp("I don't want parsed", disable=["parser"]) -``` - - +| Name | Description | +| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | +| `text` | The text to be processed. ~~str~~ | +| _keyword-only_ | | +| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). ~~List[str]~~ | +| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ | +| **RETURNS** | A container for accessing the annotations. ~~Doc~~ | ## Language.pipe {#pipe tag="method"} Process texts as a stream, and yield `Doc` objects in order. This is usually more efficient than processing texts one-by-one. - +> #### Example +> +> ```python +> texts = ["One document.", "...", "Lots of documents"] +> for doc in nlp.pipe(texts, batch_size=50): +> assert doc.has_annotation("DEP") +> ``` -Early versions of spaCy used simple statistical models that could be efficiently -multi-threaded, as we were able to entirely release Python's global interpreter -lock. The multi-threading was controlled using the `n_threads` keyword argument -to the `.pipe` method. This keyword argument is now deprecated as of v2.1.0. A -new keyword argument, `n_process`, was introduced to control parallel inference -via multiprocessing in v2.2.2. +| Name | Description | +| ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `texts` | A sequence of strings. ~~Iterable[str]~~ | +| _keyword-only_ | | +| `as_tuples` | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. ~~bool~~ | +| `batch_size` | The number of texts to buffer. ~~int~~ | +| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). ~~List[str]~~ | +| `cleanup` | If `True`, unneeded strings are freed to control memory use. Experimental. ~~bool~~ | +| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ | +| `n_process` 2.2.2 | Number of processors to use. Defaults to `1`. ~~int~~ | +| **YIELDS** | Documents in the order of the original text. ~~Doc~~ | + +## Language.initialize {#initialize tag="method" new="3"} + +Initialize the pipeline for training and return an +[`Optimizer`](https://thinc.ai/docs/api-optimizers). Under the hood, it uses the +settings defined in the [`[initialize]`](/api/data-formats#config-initialize) +config block to set up the vocabulary, load in vectors and tok2vec weights and +pass optional arguments to the `initialize` methods implemented by pipeline +components or the tokenizer. This method is typically called automatically when +you run [`spacy train`](/api/cli#train). See the usage guide on the +[config lifecycle](/usage/training#config-lifecycle) and +[initialization](/usage/training#initialization) for details. + +`get_examples` should be a function that returns an iterable of +[`Example`](/api/example) objects. The data examples can either be the full +training data or a representative sample. They are used to **initialize the +models** of trainable pipeline components and are passed each component's +[`initialize`](/api/pipe#initialize) method, if available. Initialization +includes validating the network, +[inferring missing shapes](/usage/layers-architectures#thinc-shape-inference) +and setting up the label scheme based on the data. + +If no `get_examples` function is provided when calling `nlp.initialize`, the +pipeline components will be initialized with generic data. In this case, it is +crucial that the output dimension of each component has already been defined +either in the [config](/usage/training#config), or by calling +[`pipe.add_label`](/api/pipe#add_label) for each possible output label (e.g. for +the tagger or textcat). + + + +This method was previously called `begin_training`. It now also takes a +**function** that is called with no arguments and returns a sequence of +[`Example`](/api/example) objects instead of tuples of `Doc` and `GoldParse` +objects. > #### Example > > ```python -> texts = ["One document.", "...", "Lots of documents"] -> for doc in nlp.pipe(texts, batch_size=50): -> assert doc.is_parsed +> get_examples = lambda: examples +> optimizer = nlp.initialize(get_examples) > ``` -| Name | Type | Description | -| -------------------------------------------- | ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `texts` | - | A sequence of unicode objects. | -| `as_tuples` | bool | If set to `True`, inputs should be a sequence of `(text, context)` tuples. Output will then be a sequence of `(doc, context)` tuples. Defaults to `False`. | -| `batch_size` | int | The number of texts to buffer. | -| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | -| `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. | -| `n_process` 2.2.2 | int | Number of processors to use, only supported in Python 3. Defaults to `1`. | -| **YIELDS** | `Doc` | Documents in the order of the original text. | +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Optional function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Optional[Callable[[], Iterable[Example]]]~~ | +| _keyword-only_ | | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| **RETURNS** | The optimizer. ~~Optimizer~~ | + +## Language.resume_training {#resume_training tag="method,experimental" new="3"} + +Continue training a trained pipeline. Create and return an optimizer, and +initialize "rehearsal" for any pipeline component that has a `rehearse` method. +Rehearsal is used to prevent models from "forgetting" their initialized +"knowledge". To perform rehearsal, collect samples of text you want the models +to retain performance on, and call [`nlp.rehearse`](/api/language#rehearse) with +a batch of [Example](/api/example) objects. + +> #### Example +> +> ```python +> optimizer = nlp.resume_training() +> nlp.rehearse(examples, sgd=optimizer) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| **RETURNS** | The optimizer. ~~Optimizer~~ | ## Language.update {#update tag="method"} Update the models in the pipeline. + + +The `Language.update` method now takes a batch of [`Example`](/api/example) +objects instead of the raw texts and annotations or `Doc` and `GoldParse` +objects. An [`Example`](/api/example) streamlines how data is passed around. It +stores two `Doc` objects: one for holding the gold-standard reference data, and +one for holding the predictions of the pipeline. + +For most use cases, you shouldn't have to write your own training scripts +anymore. Instead, you can use [`spacy train`](/api/cli#train) with a config file +and custom registered functions if needed. See the +[training documentation](/usage/training) for details. + + + > #### Example > > ```python > for raw_text, entity_offsets in train_data: > doc = nlp.make_doc(raw_text) -> gold = GoldParse(doc, entities=entity_offsets) -> nlp.update([doc], [gold], drop=0.5, sgd=optimizer) +> example = Example.from_dict(doc, {"entities": entity_offsets}) +> nlp.update([example], sgd=optimizer) > ``` -| Name | Type | Description | -| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | iterable | A batch of `Doc` objects or unicode. If unicode, a `Doc` object will be created from the text. | -| `golds` | iterable | A batch of `GoldParse` objects or dictionaries. Dictionaries will be used to create [`GoldParse`](/api/goldparse) objects. For the available keys and their usage, see [`GoldParse.__init__`](/api/goldparse#init). | -| `drop` | float | The dropout rate. | -| `sgd` | callable | An optimizer. | -| `losses` | dict | Dictionary to update with the loss, keyed by pipeline component. | -| `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. | +| Name | Description | +| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Dictionary to update with the loss, keyed by pipeline component. ~~Optional[Dict[str, float]]~~ | +| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | + +## Language.rehearse {#rehearse tag="method,experimental" new="3"} + +Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the +current model to make predictions similar to an initial model, to try to address +the "catastrophic forgetting" problem. This feature is experimental. + +> #### Example +> +> ```python +> optimizer = nlp.resume_training() +> losses = nlp.rehearse(examples, sgd=optimizer) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Dictionary to update with the loss, keyed by pipeline component. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | ## Language.evaluate {#evaluate tag="method"} -Evaluate a model's pipeline components. +Evaluate a pipeline's components. + + + +The `Language.update` method now takes a batch of [`Example`](/api/example) +objects instead of tuples of `Doc` and `GoldParse` objects. + + > #### Example > > ```python -> scorer = nlp.evaluate(docs_golds, verbose=True) -> print(scorer.scores) +> scores = nlp.evaluate(examples) +> print(scores) > ``` -| Name | Type | Description | -| -------------------------------------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects, such that the `Doc` objects contain the predictions and the `GoldParse` objects the correct annotations. Alternatively, `(text, annotations)` tuples of raw text and a dict (see [simple training style](/usage/training#training-simple-style)). | -| `verbose` | bool | Print debugging information. | -| `batch_size` | int | The batch size to use. | -| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. | -| `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. | -| **RETURNS** | Scorer | The scorer containing the evaluation scores. | - -## Language.begin_training {#begin_training tag="method"} - -Allocate models, pre-process training data and acquire an optimizer. - -> #### Example -> -> ```python -> optimizer = nlp.begin_training(gold_tuples) -> ``` - -| Name | Type | Description | -| -------------------------------------------- | -------- | ---------------------------------------------------------------------------- | -| `gold_tuples` | iterable | Gold-standard training data. | -| `component_cfg` 2.1 | dict | Config parameters for specific pipeline components, keyed by component name. | -| `**cfg` | - | Config parameters (sent to all components). | -| **RETURNS** | callable | An optimizer. | +| Name | Description | +| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `batch_size` | The batch size to use. ~~int~~ | +| `scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. ~~Optional[Scorer]~~ | +| `component_cfg` | Optional dictionary of keyword arguments for components, keyed by component names. Defaults to `None`. ~~Optional[Dict[str, Dict[str, Any]]]~~ | +| `scorer_cfg` | Optional dictionary of keyword arguments for the `Scorer`. Defaults to `None`. ~~Optional[Dict[str, Any]]~~ | +| **RETURNS** | A dictionary of evaluation scores. ~~Dict[str, Union[float, Dict[str, float]]]~~ | ## Language.use_params {#use_params tag="contextmanager, method"} @@ -173,62 +376,113 @@ their original weights after the block. > nlp.to_disk("/tmp/checkpoint") > ``` -| Name | Type | Description | -| -------- | ---- | --------------------------------------------- | -| `params` | dict | A dictionary of parameters keyed by model ID. | -| `**cfg` | - | Config parameters. | +| Name | Description | +| -------- | ------------------------------------------------------ | +| `params` | A dictionary of parameters keyed by model ID. ~~dict~~ | -## Language.preprocess_gold {#preprocess_gold tag="method"} +## Language.add_pipe {#add_pipe tag="method" new="2"} -Can be called before training to pre-process gold data. By default, it handles -nonprojectivity and adds missing tags to the tag map. +Add a component to the processing pipeline. Expects a name that maps to a +component factory registered using +[`@Language.component`](/api/language#component) or +[`@Language.factory`](/api/language#factory). Components should be callables +that take a `Doc` object, modify it and return it. Only one of `before`, +`after`, `first` or `last` can be set. Default behavior is `last=True`. -| Name | Type | Description | -| ------------ | -------- | ---------------------------------------- | -| `docs_golds` | iterable | Tuples of `Doc` and `GoldParse` objects. | -| **YIELDS** | tuple | Tuples of `Doc` and `GoldParse` objects. | + + +As of v3.0, the [`Language.add_pipe`](/api/language#add_pipe) method doesn't +take callables anymore and instead expects the **name of a component factory** +registered using [`@Language.component`](/api/language#component) or +[`@Language.factory`](/api/language#factory). It now takes care of creating the +component, adds it to the pipeline and returns it. + + + +> #### Example +> +> ```python +> @Language.component("component") +> def component_func(doc): +> # modify Doc and return it return doc +> +> nlp.add_pipe("component", before="ner") +> component = nlp.add_pipe("component", name="custom_name", last=True) +> +> # Add component from source pipeline +> source_nlp = spacy.load("en_core_web_sm") +> nlp.add_pipe("ner", source=source_nlp) +> ``` + +| Name | Description | +| ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `factory_name` | Name of the registered component factory. ~~str~~ | +| `name` | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. ~~Optional[str]~~ | +| _keyword-only_ | | +| `before` | Component name or index to insert component directly before. ~~Optional[Union[str, int]]~~ | +| `after` | Component name or index to insert component directly after. ~~Optional[Union[str, int]]~~ | +| `first` | Insert component first / not first in the pipeline. ~~Optional[bool]~~ | +| `last` | Insert component last / not last in the pipeline. ~~Optional[bool]~~ | +| `config` 3 | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. ~~Optional[Dict[str, Any]]~~ | +| `source` 3 | Optional source pipeline to copy component from. If a source is provided, the `factory_name` is interpreted as the name of the component in the source pipeline. Make sure that the vocab, vectors and settings of the source pipeline match the target pipeline. ~~Optional[Language]~~ | +| `validate` 3 | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ | +| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ | ## Language.create_pipe {#create_pipe tag="method" new="2"} Create a pipeline component from a factory. + + +As of v3.0, the [`Language.add_pipe`](/api/language#add_pipe) method also takes +the string name of the factory, creates the component, adds it to the pipeline +and returns it. The `Language.create_pipe` method is now mostly used internally. +To create a component and add it to the pipeline, you should always use +`Language.add_pipe`. + + + > #### Example > > ```python > parser = nlp.create_pipe("parser") -> nlp.add_pipe(parser) > ``` -| Name | Type | Description | -| ----------- | -------- | ---------------------------------------------------------------------------------- | -| `name` | unicode | Factory name to look up in [`Language.factories`](/api/language#class-attributes). | -| `config` | dict | Configuration parameters to initialize component. | -| **RETURNS** | callable | The pipeline component. | +| Name | Description | +| ------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `factory_name` | Name of the registered component factory. ~~str~~ | +| `name` | Optional unique name of pipeline component instance. If not set, the factory name is used. An error is raised if the name already exists in the pipeline. ~~Optional[str]~~ | +| _keyword-only_ | | +| `config` 3 | Optional config parameters to use for this component. Will be merged with the `default_config` specified by the component factory. ~~Optional[Dict[str, Any]]~~ | +| `validate` 3 | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ | +| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ | -## Language.add_pipe {#add_pipe tag="method" new="2"} +## Language.has_factory {#has_factory tag="classmethod" new="3"} -Add a component to the processing pipeline. Valid components are callables that -take a `Doc` object, modify it and return it. Only one of `before`, `after`, -`first` or `last` can be set. Default behavior is `last=True`. +Check whether a factory name is registered on the `Language` class or subclass. +Will check for +[language-specific factories](/usage/processing-pipelines#factories-language) +registered on the subclass, as well as general-purpose factories registered on +the `Language` base class, available to all subclasses. > #### Example > > ```python -> def component(doc): -> # modify Doc and return it return doc +> from spacy.language import Language +> from spacy.lang.en import English > -> nlp.add_pipe(component, before="ner") -> nlp.add_pipe(component, name="custom_name", last=True) +> @English.component("component") +> def component(doc): +> return doc +> +> assert English.has_factory("component") +> assert not Language.has_factory("component") > ``` -| Name | Type | Description | -| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `component` | callable | The pipeline component. | -| `name` | unicode | Name of pipeline component. Overwrites existing `component.name` attribute if available. If no `name` is set and the component exposes no name attribute, `component.__name__` is used. An error is raised if the name already exists in the pipeline. | -| `before` | unicode | Component name to insert component directly before. | -| `after` | unicode | Component name to insert component directly after: | -| `first` | bool | Insert component first / not first in the pipeline. | -| `last` | bool | Insert component last / not last in the pipeline. | +| Name | Description | +| ----------- | ------------------------------------------------------------------- | +| `name` | Name of the pipeline factory to check. ~~str~~ | +| **RETURNS** | Whether a factory of that name is registered on the class. ~~bool~~ | ## Language.has_pipe {#has_pipe tag="method" new="2"} @@ -238,15 +492,19 @@ Check whether a component is present in the pipeline. Equivalent to > #### Example > > ```python -> nlp.add_pipe(lambda doc: doc, name="component") -> assert "component" in nlp.pipe_names -> assert nlp.has_pipe("component") +> @Language.component("component") +> def component(doc): +> return doc +> +> nlp.add_pipe("component", name="my_component") +> assert "my_component" in nlp.pipe_names +> assert nlp.has_pipe("my_component") > ``` -| Name | Type | Description | -| ----------- | ------- | -------------------------------------------------------- | -| `name` | unicode | Name of the pipeline component to check. | -| **RETURNS** | bool | Whether a component of that name exists in the pipeline. | +| Name | Description | +| ----------- | ----------------------------------------------------------------- | +| `name` | Name of the pipeline component to check. ~~str~~ | +| **RETURNS** | Whether a component of that name exists in the pipeline. ~~bool~~ | ## Language.get_pipe {#get_pipe tag="method" new="2"} @@ -259,25 +517,38 @@ Get a pipeline component for a given component name. > custom_component = nlp.get_pipe("custom_component") > ``` -| Name | Type | Description | -| ----------- | -------- | -------------------------------------- | -| `name` | unicode | Name of the pipeline component to get. | -| **RETURNS** | callable | The pipeline component. | +| Name | Description | +| ----------- | ------------------------------------------------ | +| `name` | Name of the pipeline component to get. ~~str~~ | +| **RETURNS** | The pipeline component. ~~Callable[[Doc], Doc]~~ | ## Language.replace_pipe {#replace_pipe tag="method" new="2"} -Replace a component in the pipeline. +Replace a component in the pipeline and return the new component. + + + +As of v3.0, the `Language.replace_pipe` method doesn't take callables anymore +and instead expects the **name of a component factory** registered using +[`@Language.component`](/api/language#component) or +[`@Language.factory`](/api/language#factory). + + > #### Example > > ```python -> nlp.replace_pipe("parser", my_custom_parser) +> new_parser = nlp.replace_pipe("parser", "my_custom_parser") > ``` -| Name | Type | Description | -| ----------- | -------- | --------------------------------- | -| `name` | unicode | Name of the component to replace. | -| `component` | callable | The pipeline component to insert. | +| Name | Description | +| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `name` | Name of the component to replace. ~~str~~ | +| `component` | The factory name of the component to insert. ~~str~~ | +| _keyword-only_ | | +| `config` 3 | Optional config parameters to use for the new component. Will be merged with the `default_config` specified by the component factory. ~~Optional[Dict[str, Any]]~~ | +| `validate` 3 | Whether to validate the component config and arguments against the types expected by the factory. Defaults to `True`. ~~bool~~ | +| **RETURNS** | The new pipeline component. ~~Callable[[Doc], Doc]~~ | ## Language.rename_pipe {#rename_pipe tag="method" new="2"} @@ -292,10 +563,10 @@ added to the pipeline, you can also use the `name` argument on > nlp.rename_pipe("parser", "spacy_parser") > ``` -| Name | Type | Description | -| ---------- | ------- | -------------------------------- | -| `old_name` | unicode | Name of the component to rename. | -| `new_name` | unicode | New name of the component. | +| Name | Description | +| ---------- | ---------------------------------------- | +| `old_name` | Name of the component to rename. ~~str~~ | +| `new_name` | New name of the component. ~~str~~ | ## Language.remove_pipe {#remove_pipe tag="method" new="2"} @@ -309,109 +580,336 @@ component function. > assert name == "parser" > ``` -| Name | Type | Description | -| ----------- | ------- | ----------------------------------------------------- | -| `name` | unicode | Name of the component to remove. | -| **RETURNS** | tuple | A `(name, component)` tuple of the removed component. | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------ | +| `name` | Name of the component to remove. ~~str~~ | +| **RETURNS** | A `(name, component)` tuple of the removed component. ~~Tuple[str, Callable[[Doc], Doc]]~~ | -## Language.disable_pipes {#disable_pipes tag="contextmanager, method" new="2"} +## Language.disable_pipe {#disable_pipe tag="method" new="3"} + +Temporarily disable a pipeline component so it's not run as part of the +pipeline. Disabled components are listed in +[`nlp.disabled`](/api/language#attributes) and included in +[`nlp.components`](/api/language#attributes), but not in +[`nlp.pipeline`](/api/language#pipeline), so they're not run when you process a +`Doc` with the `nlp` object. If the component is already disabled, this method +does nothing. + +> #### Example +> +> ```python +> nlp.add_pipe("ner") +> nlp.add_pipe("textcat") +> assert nlp.pipe_names == ["ner", "textcat"] +> nlp.disable_pipe("ner") +> assert nlp.pipe_names == ["textcat"] +> assert nlp.component_names == ["ner", "textcat"] +> assert nlp.disabled == ["ner"] +> ``` + +| Name | Description | +| ------ | ----------------------------------------- | +| `name` | Name of the component to disable. ~~str~~ | + +## Language.enable_pipe {#enable_pipe tag="method" new="3"} + +Enable a previously disabled component (e.g. via +[`Language.disable_pipes`](/api/language#disable_pipes)) so it's run as part of +the pipeline, [`nlp.pipeline`](/api/language#pipeline). If the component is +already enabled, this method does nothing. + +> #### Example +> +> ```python +> nlp.disable_pipe("ner") +> assert "ner" in nlp.disabled +> assert not "ner" in nlp.pipe_names +> nlp.enable_pipe("ner") +> assert not "ner" in nlp.disabled +> assert "ner" in nlp.pipe_names +> ``` + +| Name | Description | +| ------ | ---------------------------------------- | +| `name` | Name of the component to enable. ~~str~~ | + +## Language.select_pipes {#select_pipes tag="contextmanager, method" new="3"} Disable one or more pipeline components. If used as a context manager, the pipeline will be restored to the initial state at the end of the block. Otherwise, a `DisabledPipes` object is returned, that has a `.restore()` method -you can use to undo your changes. +you can use to undo your changes. You can specify either `disable` (as a list or +string), or `enable`. In the latter case, all components not in the `enable` +list will be disabled. Under the hood, this method calls into +[`disable_pipe`](/api/language#disable_pipe) and +[`enable_pipe`](/api/language#enable_pipe). > #### Example > > ```python -> # New API as of v2.2.2 -> with nlp.disable_pipes(["tagger", "parser"]): -> nlp.begin_training() +> with nlp.select_pipes(disable=["tagger", "parser"]): +> nlp.initialize() > -> with nlp.disable_pipes("tagger", "parser"): -> nlp.begin_training() +> with nlp.select_pipes(enable="ner"): +> nlp.initialize() > -> disabled = nlp.disable_pipes("tagger", "parser") -> nlp.begin_training() +> disabled = nlp.select_pipes(disable=["tagger", "parser"]) +> nlp.initialize() > disabled.restore() > ``` -| Name | Type | Description | -| ----------------------------------------- | --------------- | ------------------------------------------------------------------------------------ | -| `disabled` 2.2.2 | list | Names of pipeline components to disable. | -| `*disabled` | unicode | Names of pipeline components to disable. | -| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. | + - - -As of spaCy v2.2.2, the `Language.disable_pipes` method can also take a list of -component names as its first argument (instead of a variable number of -arguments). This is especially useful if you're generating the component names -to disable programmatically. The new syntax will become the default in the -future. +As of spaCy v3.0, the `disable_pipes` method has been renamed to `select_pipes`: ```diff -- disabled = nlp.disable_pipes("tagger", "parser") -+ disabled = nlp.disable_pipes(["tagger", "parser"]) +- nlp.disable_pipes(["tagger", "parser"]) ++ nlp.select_pipes(disable=["tagger", "parser"]) ``` -## Language.to_disk {#to_disk tag="method" new="2"} +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------ | +| _keyword-only_ | | +| `disable` | Name(s) of pipeline components to disable. ~~Optional[Union[str, Iterable[str]]]~~ | +| `enable` | Name(s) of pipeline components that will not be disabled. ~~Optional[Union[str, Iterable[str]]]~~ | +| **RETURNS** | The disabled pipes that can be restored by calling the object's `.restore()` method. ~~DisabledPipes~~ | -Save the current state to a directory. If a model is loaded, this will **include -the model**. +## Language.get_factory_meta {#get_factory_meta tag="classmethod" new="3"} + +Get the factory meta information for a given pipeline component name. Expects +the name of the component **factory**. The factory meta is an instance of the +[`FactoryMeta`](/api/language#factorymeta) dataclass and contains the +information about the component and its default provided by the +[`@Language.component`](/api/language#component) or +[`@Language.factory`](/api/language#factory) decorator. > #### Example > > ```python -> nlp.to_disk("/path/to/models") +> factory_meta = Language.get_factory_meta("ner") +> assert factory_meta.factory == "ner" +> print(factory_meta.default_config) > ``` -| Name | Type | Description | -| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. | +| Name | Description | +| ----------- | --------------------------------- | +| `name` | The factory name. ~~str~~ | +| **RETURNS** | The factory meta. ~~FactoryMeta~~ | + +## Language.get_pipe_meta {#get_pipe_meta tag="method" new="3"} + +Get the factory meta information for a given pipeline component name. Expects +the name of the component **instance** in the pipeline. The factory meta is an +instance of the [`FactoryMeta`](/api/language#factorymeta) dataclass and +contains the information about the component and its default provided by the +[`@Language.component`](/api/language#component) or +[`@Language.factory`](/api/language#factory) decorator. + +> #### Example +> +> ```python +> nlp.add_pipe("ner", name="entity_recognizer") +> factory_meta = nlp.get_pipe_meta("entity_recognizer") +> assert factory_meta.factory == "ner" +> print(factory_meta.default_config) +> ``` + +| Name | Description | +| ----------- | ------------------------------------ | +| `name` | The pipeline component name. ~~str~~ | +| **RETURNS** | The factory meta. ~~FactoryMeta~~ | + +## Language.analyze_pipes {#analyze_pipes tag="method" new="3"} + +Analyze the current pipeline components and show a summary of the attributes +they assign and require, and the scores they set. The data is based on the +information provided in the [`@Language.component`](/api/language#component) and +[`@Language.factory`](/api/language#factory) decorator. If requirements aren't +met, e.g. if a component specifies a required property that is not set by a +previous component, a warning is shown. + + + +The pipeline analysis is static and does **not actually run the components**. +This means that it relies on the information provided by the components +themselves. If a custom component declares that it assigns an attribute but it +doesn't, the pipeline analysis won't catch that. + + + +> #### Example +> +> ```python +> nlp = spacy.blank("en") +> nlp.add_pipe("tagger") +> nlp.add_pipe("entity_linker") +> analysis = nlp.analyze_pipes() +> ``` + + + +```json +### Structured +{ + "summary": { + "tagger": { + "assigns": ["token.tag"], + "requires": [], + "scores": ["tag_acc", "pos_acc", "lemma_acc"], + "retokenizes": false + }, + "entity_linker": { + "assigns": ["token.ent_kb_id"], + "requires": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"], + "scores": [], + "retokenizes": false + } + }, + "problems": { + "tagger": [], + "entity_linker": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"] + }, + "attrs": { + "token.ent_iob": { "assigns": [], "requires": ["entity_linker"] }, + "doc.ents": { "assigns": [], "requires": ["entity_linker"] }, + "token.ent_kb_id": { "assigns": ["entity_linker"], "requires": [] }, + "doc.sents": { "assigns": [], "requires": ["entity_linker"] }, + "token.tag": { "assigns": ["tagger"], "requires": [] }, + "token.ent_type": { "assigns": [], "requires": ["entity_linker"] } + } +} +``` + +``` +### Pretty +============================= Pipeline Overview ============================= + +# Component Assigns Requires Scores Retokenizes +- ------------- --------------- -------------- --------- ----------- +0 tagger token.tag tag_acc False + pos_acc + lemma_acc + +1 entity_linker token.ent_kb_id doc.ents False + doc.sents + token.ent_iob + token.ent_type + + +================================ Problems (4) ================================ +⚠ 'entity_linker' requirements not met: doc.ents, doc.sents, +token.ent_iob, token.ent_type +``` + + + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `keys` | The values to display in the table. Corresponds to attributes of the [`FactoryMeta`](/api/language#factorymeta). Defaults to `["assigns", "requires", "scores", "retokenizes"]`. ~~List[str]~~ | +| `pretty` | Pretty-print the results as a table. Defaults to `False`. ~~bool~~ | +| **RETURNS** | Dictionary containing the pipe analysis, keyed by `"summary"` (component meta by pipe), `"problems"` (attribute names by pipe) and `"attrs"` (pipes that assign and require an attribute, keyed by attribute). ~~Optional[Dict[str, Any]]~~ | + +## Language.meta {#meta tag="property"} + +Meta data for the `Language` class, including name, version, data sources, +license, author information and more. If a trained pipeline is loaded, this +contains meta data of the pipeline. The `Language.meta` is also what's +serialized as the `meta.json` when you save an `nlp` object to disk. See the +[meta data format](/api/data-formats#meta) for more details. + + + +As of v3.0, the meta only contains **meta information** about the pipeline and +isn't used to construct the language class and pipeline components. This +information is expressed in the [`config.cfg`](/api/data-formats#config). + + + +> #### Example +> +> ```python +> print(nlp.meta) +> ``` + +| Name | Description | +| ----------- | --------------------------------- | +| **RETURNS** | The meta data. ~~Dict[str, Any]~~ | + +## Language.config {#config tag="property" new="3"} + +Export a trainable [`config.cfg`](/api/data-formats#config) for the current +`nlp` object. Includes the current pipeline, all configs used to create the +currently active pipeline components, as well as the default training config +that can be used with [`spacy train`](/api/cli#train). `Language.config` returns +a [Thinc `Config` object](https://thinc.ai/docs/api-config#config), which is a +subclass of the built-in `dict`. It supports the additional methods `to_disk` +(serialize the config to a file) and `to_str` (output the config as a string). + +> #### Example +> +> ```python +> nlp.config.to_disk("./config.cfg") +> print(nlp.config.to_str()) +> ``` + +| Name | Description | +| ----------- | ---------------------- | +| **RETURNS** | The config. ~~Config~~ | + +## Language.to_disk {#to_disk tag="method" new="2"} + +Save the current state to a directory. Under the hood, this method delegates to +the `to_disk` methods of the individual pipeline components, if available. This +means that if a trained pipeline is loaded, all components and their weights +will be saved to disk. + +> #### Example +> +> ```python +> nlp.to_disk("/path/to/pipeline") +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | ## Language.from_disk {#from_disk tag="method" new="2"} -Loads state from a directory. Modifies the object in place and returns it. If -the saved `Language` object contains a model, the model will be loaded. Note -that this method is commonly used via the subclasses like `English` or `German` -to make language-specific functionality like the -[lexical attribute getters](/usage/adding-languages#lex-attrs) available to the -loaded object. +Loads state from a directory, including all data that was saved with the +`Language` object. Modifies the object in place and returns it. + + + +Keep in mind that this method **only loads the serialized state** and doesn't +set up the `nlp` object. This means that it requires the correct language class +to be initialized and all pipeline components to be added to the pipeline. If +you want to load a serialized pipeline from a directory, you should use +[`spacy.load`](/api/top-level#spacy.load), which will set everything up for you. + + > #### Example > > ```python > from spacy.language import Language -> nlp = Language().from_disk("/path/to/model") +> nlp = Language().from_disk("/path/to/pipeline") > -> # using language-specific subclass +> # Using language-specific subclass > from spacy.lang.en import English -> nlp = English().from_disk("/path/to/en_model") +> nlp = English().from_disk("/path/to/pipeline") > ``` -| Name | Type | Description | -| ----------- | ---------------- | ----------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `Language` | The modified `Language` object. | - - - -As of spaCy v2.0, the `save_to_directory` method has been renamed to `to_disk`, -to improve consistency across classes. Pipeline components to prevent from being -loaded can now be added as a list to `disable` (v2.0) or `exclude` (v2.1), -instead of specifying one keyword argument per component. - -```diff -- nlp = spacy.load("en", tagger=False, entity=False) -+ nlp = English().from_disk("/model", exclude=["tagger", "ner"]) -``` - - +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `Language` object. ~~Language~~ | ## Language.to_bytes {#to_bytes tag="method"} @@ -423,16 +921,17 @@ Serialize the current state to a binary string. > nlp_bytes = nlp.to_bytes() > ``` -| Name | Type | Description | -| ----------- | ----- | ----------------------------------------------------------------------------------------- | -| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | bytes | The serialized form of the `Language` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------ | +| _keyword-only_ | | +| `exclude` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. ~~iterable~~ | +| **RETURNS** | The serialized form of the `Language` object. ~~bytes~~ | ## Language.from_bytes {#from_bytes tag="method"} Load state from a binary string. Note that this method is commonly used via the subclasses like `English` or `German` to make language-specific functionality -like the [lexical attribute getters](/usage/adding-languages#lex-attrs) +like the [lexical attribute getters](/usage/linguistic-features#language-data) available to the loaded object. > #### Example @@ -444,45 +943,81 @@ available to the loaded object. > nlp2.from_bytes(nlp_bytes) > ``` -| Name | Type | Description | -| ------------ | ---------- | ----------------------------------------------------------------------------------------- | -| `bytes_data` | bytes | The data to load from. | -| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `Language` | The `Language` object. | - - - -Pipeline components to prevent from being loaded can now be added as a list to -`disable` (v2.0) or `exclude` (v2.1), instead of specifying one keyword argument -per component. - -```diff -- nlp = English().from_bytes(bytes, tagger=False, entity=False) -+ nlp = English().from_bytes(bytes, exclude=["tagger", "ner"]) -``` - - +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `Language` object. ~~Language~~ | ## Attributes {#attributes} -| Name | Type | Description | -| ------------------------------------------ | ----------- | ----------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | A container for the lexical types. | -| `tokenizer` | `Tokenizer` | The tokenizer. | -| `make_doc` | `callable` | Callable that takes a unicode text and returns a `Doc`. | -| `pipeline` | list | List of `(name, component)` tuples describing the current processing pipeline, in order. | -| `pipe_names` 2 | list | List of pipeline component names, in order. | -| `pipe_labels` 2.2 | dict | List of labels set by the pipeline components, if available, keyed by component name. | -| `meta` | dict | Custom meta data for the Language class. If a model is loaded, contains meta data of the model. | -| `path` 2 | `Path` | Path to the model data directory, if a model is loaded. Otherwise `None`. | +| Name | Description | +| --------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | A container for the lexical types. ~~Vocab~~ | +| `tokenizer` | The tokenizer. ~~Tokenizer~~ | +| `make_doc` | Callable that takes a string and returns a `Doc`. ~~Callable[[str], Doc]~~ | +| `pipeline` | List of `(name, component)` tuples describing the current processing pipeline, in order. ~~List[Tuple[str, Callable[[Doc], Doc]]]~~ | +| `pipe_names` 2 | List of pipeline component names, in order. ~~List[str]~~ | +| `pipe_labels` 2.2 | List of labels set by the pipeline components, if available, keyed by component name. ~~Dict[str, List[str]]~~ | +| `pipe_factories` 2.2 | Dictionary of pipeline component names, mapped to their factory names. ~~Dict[str, str]~~ | +| `factories` | All available factory functions, keyed by name. ~~Dict[str, Callable[[...], Callable[[Doc], Doc]]]~~ | +| `factory_names` 3 | List of all available factory names. ~~List[str]~~ | +| `components` 3 | List of all available `(name, component)` tuples, including components that are currently disabled. ~~List[Tuple[str, Callable[[Doc], Doc]]]~~ | +| `component_names` 3 | List of all available component names, including components that are currently disabled. ~~List[str]~~ | +| `disabled` 3 | Names of components that are currently disabled and don't run as part of the pipeline. ~~List[str]~~ | +| `path` 2 | Path to the pipeline data directory, if a pipeline is loaded from a path or package. Otherwise `None`. ~~Optional[Path]~~ | ## Class attributes {#class-attributes} -| Name | Type | Description | -| -------------------------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------- | -| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. | -| `lang` | unicode | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). | -| `factories` 2 | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. | +| Name | Description | +| ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `Defaults` | Settings, data and factory methods for creating the `nlp` object and processing pipeline. ~~Defaults~~ | +| `lang` | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). ~~str~~ | +| `default_config` | Base [config](/usage/training#config) to use for [Language.config](/api/language#config). Defaults to [`default_config.cfg`](%%GITHUB_SPACY/spacy/default_config.cfg). ~~Config~~ | + +## Defaults {#defaults} + +The following attributes can be set on the `Language.Defaults` class to +customize the default language data: + +> #### Example +> +> ```python +> from spacy.language import language +> from spacy.lang.tokenizer_exceptions import URL_MATCH +> from thinc.api import Config +> +> DEFAULT_CONFIFG = """ +> [nlp.tokenizer] +> @tokenizers = "MyCustomTokenizer.v1" +> """ +> +> class Defaults(Language.Defaults): +> stop_words = set() +> tokenizer_exceptions = {} +> prefixes = tuple() +> suffixes = tuple() +> infixes = tuple() +> token_match = None +> url_match = URL_MATCH +> lex_attr_getters = {} +> syntax_iterators = {} +> writing_system = {"direction": "ltr", "has_case": True, "has_letters": True} +> config = Config().from_str(DEFAULT_CONFIG) +> ``` + +| Name | Description | +| --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `stop_words` | List of stop words, used for `Token.is_stop`.
**Example:** [`stop_words.py`](%%GITHUB_SPACY/spacy/lang/en/stop_words.py) ~~Set[str]~~ | +| `tokenizer_exceptions` | Tokenizer exception rules, string mapped to list of token attributes.
**Example:** [`de/tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/de/tokenizer_exceptions.py) ~~Dict[str, List[dict]]~~ | +| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.
**Example:** [`puncutation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) ~~Optional[List[Union[str, Pattern]]]~~ | +| `token_match` | Optional regex for matching strings that should never be split, overriding the infix rules.
**Example:** [`fr/tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/fr/tokenizer_exceptions.py) ~~Optional[Pattern]~~ | +| `url_match` | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.
**Example:** [`tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/tokenizer_exceptions.py) ~~Optional[Pattern]~~ | +| `lex_attr_getters` | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.
**Example:** [`lex_attrs.py`](%%GITHUB_SPACY/spacy/lang/en/lex_attrs.py) ~~Dict[int, Callable[[str], Any]]~~ | +| `syntax_iterators` | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).
**Example:** [`syntax_iterators.py`](%%GITHUB_SPACY/spacy/lang/en/syntax_iterators.py). ~~Dict[str, Callable[[Union[Doc, Span]], Iterator[Span]]]~~ | +| `writing_system` | Information about the language's writing system, available via `Vocab.writing_system`. Defaults to: `{"direction": "ltr", "has_case": True, "has_letters": True}.`.
**Example:** [`zh/__init__.py`](%%GITHUB_SPACY/spacy/lang/zh/__init__.py) ~~Dict[str, Any]~~ | +| `config` | Default [config](/usage/training#config) added to `nlp.config`. This can include references to custom tokenizers or lemmatizers.
**Example:** [`zh/__init__.py`](%%GITHUB_SPACY/spacy/lang/zh/__init__.py) ~~Config~~ | ## Serialization fields {#serialization-fields} @@ -494,12 +1029,30 @@ serialization by passing in the string names via the `exclude` argument. > > ```python > data = nlp.to_bytes(exclude=["tokenizer", "vocab"]) -> nlp.from_disk("./model-data", exclude=["ner"]) +> nlp.from_disk("/pipeline", exclude=["ner"]) > ``` -| Name | Description | -| ----------- | -------------------------------------------------- | -| `vocab` | The shared [`Vocab`](/api/vocab). | -| `tokenizer` | Tokenization rules and exceptions. | -| `meta` | The meta data, available as `Language.meta`. | -| ... | String names of pipeline components, e.g. `"ner"`. | +| Name | Description | +| ----------- | ------------------------------------------------------------------ | +| `vocab` | The shared [`Vocab`](/api/vocab). | +| `tokenizer` | Tokenization rules and exceptions. | +| `meta` | The meta data, available as [`Language.meta`](/api/language#meta). | +| ... | String names of pipeline components, e.g. `"ner"`. | + +## FactoryMeta {#factorymeta new="3" tag="dataclass"} + +The `FactoryMeta` contains the information about the component and its default +provided by the [`@Language.component`](/api/language#component) or +[`@Language.factory`](/api/language#factory) decorator. It's created whenever a +component is defined and stored on the `Language` class for each component +instance and factory instance. + +| Name | Description | +| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `factory` | The name of the registered component factory. ~~str~~ | +| `default_config` | The default config, describing the default values of the factory arguments. ~~Dict[str, Any]~~ | +| `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | +| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~  | +| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~  | +| `default_score_weights` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. If a weight is set to `None`, the score will not be logged or weighted. ~~Dict[str, Optional[float]]~~ | +| `scores` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Based on the `default_score_weights` and used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | diff --git a/website/docs/api/lemmatizer.md b/website/docs/api/lemmatizer.md index f43e17fd3..e838c75b2 100644 --- a/website/docs/api/lemmatizer.md +++ b/website/docs/api/lemmatizer.md @@ -1,115 +1,292 @@ --- title: Lemmatizer -teaser: Assign the base forms of words tag: class -source: spacy/lemmatizer.py +source: spacy/pipeline/lemmatizer.py +new: 3 +teaser: 'Pipeline component for lemmatization' +api_base_class: /api/pipe +api_string_name: lemmatizer +api_trainable: false --- -The `Lemmatizer` supports simple part-of-speech-sensitive suffix rules and -lookup tables. +Component for assigning base forms to tokens using rules based on part-of-speech +tags, or lookup tables. Functionality to train the component is coming soon. +Different [`Language`](/api/language) subclasses can implement their own +lemmatizer components via +[language-specific factories](/usage/processing-pipelines#factories-language). +The default data used is provided by the +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) +extension package. -## Lemmatizer.\_\_init\_\_ {#init tag="method"} + -Initialize a `Lemmatizer`. Typically, this happens under the hood within spaCy -when a `Language` subclass and its `Vocab` is initialized. +As of v3.0, the `Lemmatizer` is a **standalone pipeline component** that can be +added to your pipeline, and not a hidden part of the vocab that runs behind the +scenes. This makes it easier to customize how lemmas should be assigned in your +pipeline. -> #### Example -> -> ```python -> from spacy.lemmatizer import Lemmatizer -> from spacy.lookups import Lookups -> lookups = Lookups() -> lookups.add_table("lemma_rules", {"noun": [["s", ""]]}) -> lemmatizer = Lemmatizer(lookups) -> ``` -> -> For examples of the data format, see the -> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) repo. - -| Name | Type | Description | -| -------------------------------------- | ------------------------- | ------------------------------------------------------------------------------------------------------------------------- | -| `lookups` 2.2 | [`Lookups`](/api/lookups) | The lookups object containing the (optional) tables `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. | -| **RETURNS** | `Lemmatizer` | The newly created object. | - - - -As of v2.2, the lemmatizer is initialized with a [`Lookups`](/api/lookups) -object containing tables for the different components. This makes it easier for -spaCy to share and serialize rules and lookup tables via the `Vocab`, and allows -users to modify lemmatizer data at runtime by updating `nlp.vocab.lookups`. - -```diff -- lemmatizer = Lemmatizer(rules=lemma_rules) -+ lemmatizer = Lemmatizer(lookups) -``` +If the lemmatization mode is set to `"rule"`, which requires coarse-grained POS +(`Token.pos`) to be assigned, make sure a [`Tagger`](/api/tagger), +[`Morphologizer`](/api/morphologizer) or another component assigning POS is +available in the pipeline and runs _before_ the lemmatizer. +## Config and implementation + +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). For examples of the lookups +data format used by the lookup and rule-based lemmatizers, see +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). + +> #### Example +> +> ```python +> config = {"mode": "rule"} +> nlp.add_pipe("lemmatizer", config=config) +> ``` + +| Setting | Description | +| ----------- | --------------------------------------------------------------------------------- | +| `mode` | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`. ~~str~~ | +| `overwrite` | Whether to overwrite existing lemmas. Defaults to `False`. ~~bool~~ | +| `model` | **Not yet implemented:** the model to use. ~~Model~~ | + +```python +%%GITHUB_SPACY/spacy/pipeline/lemmatizer.py +``` + +## Lemmatizer.\_\_init\_\_ {#init tag="method"} + +> #### Example +> +> ```python +> # Construction via add_pipe with default model +> lemmatizer = nlp.add_pipe("lemmatizer") +> +> # Construction via add_pipe with custom settings +> config = {"mode": "rule", overwrite=True} +> lemmatizer = nlp.add_pipe("lemmatizer", config=config) +> ``` + +Create a new pipeline instance. In your application, you would normally use a +shortcut for this and instantiate the component using its string name and +[`nlp.add_pipe`](/api/language#add_pipe). + +| Name | Description | +| -------------- | --------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | **Not yet implemented:** The model to use. ~~Model~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| _keyword-only_ | | +| mode | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`. ~~str~~ | +| overwrite | Whether to overwrite existing lemmas. ~~bool~ | + ## Lemmatizer.\_\_call\_\_ {#call tag="method"} -Lemmatize a string. +Apply the pipe to one document. The document is modified in place, and returned. +This usually happens under the hood when the `nlp` object is called on a text +and all pipeline components are applied to the `Doc` in order. > #### Example > > ```python -> from spacy.lemmatizer import Lemmatizer -> from spacy.lookups import Lookups -> lookups = Lookups() -> lookups.add_table("lemma_rules", {"noun": [["s", ""]]}) -> lemmatizer = Lemmatizer(lookups) -> lemmas = lemmatizer("ducks", "NOUN") -> assert lemmas == ["duck"] +> doc = nlp("This is a sentence.") +> lemmatizer = nlp.add_pipe("lemmatizer") +> # This usually happens under the hood +> processed = lemmatizer(doc) > ``` -| Name | Type | Description | -| ------------ | ------------- | -------------------------------------------------------------------------------------------------------- | -| `string` | unicode | The string to lemmatize, e.g. the token text. | -| `univ_pos` | unicode / int | The token's universal part-of-speech tag. | -| `morphology` | dict / `None` | Morphological features following the [Universal Dependencies](http://universaldependencies.org/) scheme. | -| **RETURNS** | list | The available lemmas for the string. | +| Name | Description | +| ----------- | -------------------------------- | +| `doc` | The document to process. ~~Doc~~ | +| **RETURNS** | The processed document. ~~Doc~~ | -## Lemmatizer.lookup {#lookup tag="method" new="2"} +## Lemmatizer.pipe {#pipe tag="method"} -Look up a lemma in the lookup table, if available. If no lemma is found, the -original string is returned. Languages can provide a -[lookup table](/usage/adding-languages#lemmatizer) via the `Lookups`. +Apply the pipe to a stream of documents. This usually happens under the hood +when the `nlp` object is called on a text and all pipeline components are +applied to the `Doc` in order. > #### Example > > ```python -> lookups = Lookups() -> lookups.add_table("lemma_lookup", {"going": "go"}) -> assert lemmatizer.lookup("going") == "go" +> lemmatizer = nlp.add_pipe("lemmatizer") +> for doc in lemmatizer.pipe(docs, batch_size=50): +> pass > ``` -| Name | Type | Description | -| ----------- | ------- | ----------------------------------------------------------------------------------------------------------- | -| `string` | unicode | The string to look up. | -| `orth` | int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to `None`. | -| **RETURNS** | unicode | The lemma if the string was found, otherwise the original string. | +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `stream` | A stream of documents. ~~Iterable[Doc]~~ | +| _keyword-only_ | | +| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ | +| **YIELDS** | The processed documents in order. ~~Doc~~ | + +## Lemmatizer.initialize {#initialize tag="method"} + +Initialize the lemmatizer and load any data resources. This method is typically +called by [`Language.initialize`](/api/language#initialize) and lets you +customize arguments it receives via the +[`[initialize.components]`](/api/data-formats#config-initialize) block in the +config. The loading only happens during initialization, typically before +training. At runtime, all data is loaded from disk. + +> #### Example +> +> ```python +> lemmatizer = nlp.add_pipe("lemmatizer") +> lemmatizer.initialize(lookups=lookups) +> ``` +> +> ```ini +> ### config.cfg +> [initialize.components.lemmatizer] +> +> [initialize.components.lemmatizer.lookups] +> @misc = "load_my_lookups.v1" +> ``` + +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. Defaults to `None`. ~~Optional[Callable[[], Iterable[Example]]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `lookups` | The lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. If `None`, default tables are loaded from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). Defaults to `None`. ~~Optional[Lookups]~~ | + +## Lemmatizer.lookup_lemmatize {#lookup_lemmatize tag="method"} + +Lemmatize a token using a lookup-based approach. If no lemma is found, the +original string is returned. + +| Name | Description | +| ----------- | --------------------------------------------------- | +| `token` | The token to lemmatize. ~~Token~~ | +| **RETURNS** | A list containing one or more lemmas. ~~List[str]~~ | + +## Lemmatizer.rule_lemmatize {#rule_lemmatize tag="method"} + +Lemmatize a token using a rule-based approach. Typically relies on POS tags. + +| Name | Description | +| ----------- | --------------------------------------------------- | +| `token` | The token to lemmatize. ~~Token~~ | +| **RETURNS** | A list containing one or more lemmas. ~~List[str]~~ | ## Lemmatizer.is_base_form {#is_base_form tag="method"} Check whether we're dealing with an uninflected paradigm, so we can avoid lemmatization entirely. +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------------------- | +| `token` | The token to analyze. ~~Token~~ | +| **RETURNS** | Whether the token's attributes (e.g., part-of-speech tag, morphological features) describe a base form. ~~bool~~ | + +## Lemmatizer.get_lookups_config {#get_lookups_config tag="classmethod"} + +Returns the lookups configuration settings for a given mode for use in +[`Lemmatizer.load_lookups`](/api/lemmatizer#load_lookups). + +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------- | +| `mode` | The lemmatizer mode. ~~str~~ | +| **RETURNS** | The required table names and the optional table names. ~~Tuple[List[str], List[str]]~~ | + +## Lemmatizer.to_disk {#to_disk tag="method"} + +Serialize the pipe to disk. + > #### Example > > ```python -> pos = "verb" -> morph = {"VerbForm": "inf"} -> is_base_form = lemmatizer.is_base_form(pos, morph) -> assert is_base_form == True +> lemmatizer = nlp.add_pipe("lemmatizer") +> lemmatizer.to_disk("/path/to/lemmatizer") > ``` -| Name | Type | Description | -| ------------ | ------------- | --------------------------------------------------------------------------------------- | -| `univ_pos` | unicode / int | The token's universal part-of-speech tag. | -| `morphology` | dict | The token's morphological features. | -| **RETURNS** | bool | Whether the token's part-of-speech tag and morphological features describe a base form. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | + +## Lemmatizer.from_disk {#from_disk tag="method"} + +Load the pipe from disk. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> lemmatizer = nlp.add_pipe("lemmatizer") +> lemmatizer.from_disk("/path/to/lemmatizer") +> ``` + +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `Lemmatizer` object. ~~Lemmatizer~~ | + +## Lemmatizer.to_bytes {#to_bytes tag="method"} + +> #### Example +> +> ```python +> lemmatizer = nlp.add_pipe("lemmatizer") +> lemmatizer_bytes = lemmatizer.to_bytes() +> ``` + +Serialize the pipe to a bytestring. + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the `Lemmatizer` object. ~~bytes~~ | + +## Lemmatizer.from_bytes {#from_bytes tag="method"} + +Load the pipe from a bytestring. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> lemmatizer_bytes = lemmatizer.to_bytes() +> lemmatizer = nlp.add_pipe("lemmatizer") +> lemmatizer.from_bytes(lemmatizer_bytes) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `Lemmatizer` object. ~~Lemmatizer~~ | ## Attributes {#attributes} -| Name | Type | Description | -| -------------------------------------- | ------------------------- | --------------------------------------------------------------- | -| `lookups` 2.2 | [`Lookups`](/api/lookups) | The lookups object containing the rules and data, if available. | +| Name | Description | +| --------- | ------------------------------------------- | +| `vocab` | The shared [`Vocab`](/api/vocab). ~~Vocab~~ | +| `lookups` | The lookups object. ~~Lookups~~ | +| `mode` | The lemmatizer mode. ~~str~~ | + +## Serialization fields {#serialization-fields} + +During serialization, spaCy will export several data fields used to restore +different aspects of the object. If needed, you can exclude them from +serialization by passing in the string names via the `exclude` argument. + +> #### Example +> +> ```python +> data = lemmatizer.to_disk("/path", exclude=["vocab"]) +> ``` + +| Name | Description | +| --------- | ---------------------------------------------------- | +| `vocab` | The shared [`Vocab`](/api/vocab). | +| `lookups` | The lookups. You usually don't want to exclude this. | diff --git a/website/docs/api/lexeme.md b/website/docs/api/lexeme.md index f7f6d654c..a7e1d1ca0 100644 --- a/website/docs/api/lexeme.md +++ b/website/docs/api/lexeme.md @@ -13,11 +13,10 @@ lemmatization depends on the part-of-speech tag). Create a `Lexeme` object. -| Name | Type | Description | -| ----------- | -------- | ----------------------------- | -| `vocab` | `Vocab` | The parent vocabulary. | -| `orth` | int | The orth id of the lexeme. | -| **RETURNS** | `Lexeme` | The newly constructed object. | +| Name | Description | +| ------- | ---------------------------------- | +| `vocab` | The parent vocabulary. ~~Vocab~~ | +| `orth` | The orth id of the lexeme. ~~int~~ | ## Lexeme.set_flag {#set_flag tag="method"} @@ -30,10 +29,10 @@ Change the value of a boolean flag. > nlp.vocab["spaCy"].set_flag(COOL_FLAG, True) > ``` -| Name | Type | Description | -| --------- | ---- | ------------------------------------ | -| `flag_id` | int | The attribute ID of the flag to set. | -| `value` | bool | The new value of the flag. | +| Name | Description | +| --------- | -------------------------------------------- | +| `flag_id` | The attribute ID of the flag to set. ~~int~~ | +| `value` | The new value of the flag. ~~bool~~ | ## Lexeme.check_flag {#check_flag tag="method"} @@ -47,10 +46,10 @@ Check the value of a boolean flag. > assert nlp.vocab["spaCy"].check_flag(MY_LIBRARY) == True > ``` -| Name | Type | Description | -| ----------- | ---- | -------------------------------------- | -| `flag_id` | int | The attribute ID of the flag to query. | -| **RETURNS** | bool | The value of the flag. | +| Name | Description | +| ----------- | ---------------------------------------------- | +| `flag_id` | The attribute ID of the flag to query. ~~int~~ | +| **RETURNS** | The value of the flag. ~~bool~~ | ## Lexeme.similarity {#similarity tag="method" model="vectors"} @@ -66,10 +65,10 @@ Compute a semantic similarity estimate. Defaults to cosine over vectors. > assert apple_orange == orange_apple > ``` -| Name | Type | Description | -| ----------- | ----- | -------------------------------------------------------------------------------------------- | -| other | - | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. | -| **RETURNS** | float | A scalar similarity score. Higher is more similar. | +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------------------------------------------- | +| other | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. ~~Union[Doc, Span, Token, Lexeme]~~ | +| **RETURNS** | A scalar similarity score. Higher is more similar. ~~float~~ | ## Lexeme.has_vector {#has_vector tag="property" model="vectors"} @@ -82,9 +81,9 @@ A boolean value indicating whether a word vector is associated with the lexeme. > assert apple.has_vector > ``` -| Name | Type | Description | -| ----------- | ---- | ---------------------------------------------- | -| **RETURNS** | bool | Whether the lexeme has a vector data attached. | +| Name | Description | +| ----------- | ------------------------------------------------------- | +| **RETURNS** | Whether the lexeme has a vector data attached. ~~bool~~ | ## Lexeme.vector {#vector tag="property" model="vectors"} @@ -98,9 +97,9 @@ A real-valued meaning representation. > assert apple.vector.shape == (300,) > ``` -| Name | Type | Description | -| ----------- | ---------------------------------------- | ----------------------------------------------------- | -| **RETURNS** | `numpy.ndarray[ndim=1, dtype='float32']` | A 1D numpy array representing the lexeme's semantics. | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------ | +| **RETURNS** | A 1-dimensional array representing the lexeme's vector. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | ## Lexeme.vector_norm {#vector_norm tag="property" model="vectors"} @@ -116,50 +115,50 @@ The L2 norm of the lexeme's vector representation. > assert apple.vector_norm != pasta.vector_norm > ``` -| Name | Type | Description | -| ----------- | ----- | ----------------------------------------- | -| **RETURNS** | float | The L2 norm of the vector representation. | +| Name | Description | +| ----------- | --------------------------------------------------- | +| **RETURNS** | The L2 norm of the vector representation. ~~float~~ | ## Attributes {#attributes} -| Name | Type | Description | -| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `vocab` | `Vocab` | The lexeme's vocabulary. | -| `text` | unicode | Verbatim text content. | -| `orth` | int | ID of the verbatim text content. | -| `orth_` | unicode | Verbatim text content (identical to `Lexeme.text`). Exists mostly for consistency with the other attributes. | -| `rank` | int | Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors. | -| `flags` | int | Container of the lexeme's binary flags. | -| `norm` | int | The lexemes's norm, i.e. a normalized form of the lexeme text. | -| `norm_` | unicode | The lexemes's norm, i.e. a normalized form of the lexeme text. | -| `lower` | int | Lowercase form of the word. | -| `lower_` | unicode | Lowercase form of the word. | -| `shape` | int | Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. | -| `shape_` | unicode | Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. | -| `prefix` | int | Length-N substring from the start of the word. Defaults to `N=1`. | -| `prefix_` | unicode | Length-N substring from the start of the word. Defaults to `N=1`. | -| `suffix` | int | Length-N substring from the end of the word. Defaults to `N=3`. | -| `suffix_` | unicode | Length-N substring from the start of the word. Defaults to `N=3`. | -| `is_alpha` | bool | Does the lexeme consist of alphabetic characters? Equivalent to `lexeme.text.isalpha()`. | -| `is_ascii` | bool | Does the lexeme consist of ASCII characters? Equivalent to `[any(ord(c) >= 128 for c in lexeme.text)]`. | -| `is_digit` | bool | Does the lexeme consist of digits? Equivalent to `lexeme.text.isdigit()`. | -| `is_lower` | bool | Is the lexeme in lowercase? Equivalent to `lexeme.text.islower()`. | -| `is_upper` | bool | Is the lexeme in uppercase? Equivalent to `lexeme.text.isupper()`. | -| `is_title` | bool | Is the lexeme in titlecase? Equivalent to `lexeme.text.istitle()`. | -| `is_punct` | bool | Is the lexeme punctuation? | -| `is_left_punct` | bool | Is the lexeme a left punctuation mark, e.g. `(`? | -| `is_right_punct` | bool | Is the lexeme a right punctuation mark, e.g. `)`? | -| `is_space` | bool | Does the lexeme consist of whitespace characters? Equivalent to `lexeme.text.isspace()`. | -| `is_bracket` | bool | Is the lexeme a bracket? | -| `is_quote` | bool | Is the lexeme a quotation mark? | -| `is_currency` 2.0.8 | bool | Is the lexeme a currency symbol? | -| `like_url` | bool | Does the lexeme resemble a URL? | -| `like_num` | bool | Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc. | -| `like_email` | bool | Does the lexeme resemble an email address? | -| `is_oov` | bool | Does the lexeme have a word vector? | -| `is_stop` | bool | Is the lexeme part of a "stop list"? | -| `lang` | int | Language of the parent vocabulary. | -| `lang_` | unicode | Language of the parent vocabulary. | -| `prob` | float | Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary). | -| `cluster` | int | Brown cluster ID. | -| `sentiment` | float | A scalar value indicating the positivity or negativity of the lexeme. | +| Name | Description | +| -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The lexeme's vocabulary. ~~Vocab~~ | +| `text` | Verbatim text content. ~~str~~ | +| `orth` | ID of the verbatim text content. ~~int~~ | +| `orth_` | Verbatim text content (identical to `Lexeme.text`). Exists mostly for consistency with the other attributes. ~~str~~ | +| `rank` | Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors. ~~int~~ | +| `flags` | Container of the lexeme's binary flags. ~~int~~ | +| `norm` | The lexemes's norm, i.e. a normalized form of the lexeme text. ~~int~~ | +| `norm_` | The lexemes's norm, i.e. a normalized form of the lexeme text. ~~str~~ | +| `lower` | Lowercase form of the word. ~~int~~ | +| `lower_` | Lowercase form of the word. ~~str~~ | +| `shape` | Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ | +| `shape_` | Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ | +| `prefix` | Length-N substring from the start of the word. Defaults to `N=1`. ~~int~~ | +| `prefix_` | Length-N substring from the start of the word. Defaults to `N=1`. ~~str~~ | +| `suffix` | Length-N substring from the end of the word. Defaults to `N=3`. ~~int~~ | +| `suffix_` | Length-N substring from the start of the word. Defaults to `N=3`. ~~str~~ | +| `is_alpha` | Does the lexeme consist of alphabetic characters? Equivalent to `lexeme.text.isalpha()`. ~~bool~~ | +| `is_ascii` | Does the lexeme consist of ASCII characters? Equivalent to `[any(ord(c) >= 128 for c in lexeme.text)]`. ~~bool~~ | +| `is_digit` | Does the lexeme consist of digits? Equivalent to `lexeme.text.isdigit()`. ~~bool~~ | +| `is_lower` | Is the lexeme in lowercase? Equivalent to `lexeme.text.islower()`. ~~bool~~ | +| `is_upper` | Is the lexeme in uppercase? Equivalent to `lexeme.text.isupper()`. ~~bool~~ | +| `is_title` | Is the lexeme in titlecase? Equivalent to `lexeme.text.istitle()`. ~~bool~~ | +| `is_punct` | Is the lexeme punctuation? ~~bool~~ | +| `is_left_punct` | Is the lexeme a left punctuation mark, e.g. `(`? ~~bool~~ | +| `is_right_punct` | Is the lexeme a right punctuation mark, e.g. `)`? ~~bool~~ | +| `is_space` | Does the lexeme consist of whitespace characters? Equivalent to `lexeme.text.isspace()`. ~~bool~~ | +| `is_bracket` | Is the lexeme a bracket? ~~bool~~ | +| `is_quote` | Is the lexeme a quotation mark? ~~bool~~ | +| `is_currency` 2.0.8 | Is the lexeme a currency symbol? ~~bool~~ | +| `like_url` | Does the lexeme resemble a URL? ~~bool~~ | +| `like_num` | Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc. ~~bool~~ | +| `like_email` | Does the lexeme resemble an email address? ~~bool~~ | +| `is_oov` | Does the lexeme have a word vector? ~~bool~~ | +| `is_stop` | Is the lexeme part of a "stop list"? ~~bool~~ | +| `lang` | Language of the parent vocabulary. ~~int~~ | +| `lang_` | Language of the parent vocabulary. ~~str~~ | +| `prob` | Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary). ~~float~~ | +| `cluster` | Brown cluster ID. ~~int~~ | +| `sentiment` | A scalar value indicating the positivity or negativity of the lexeme. ~~float~~ | diff --git a/website/docs/api/lookups.md b/website/docs/api/lookups.md index bd3b38303..9565e478f 100644 --- a/website/docs/api/lookups.md +++ b/website/docs/api/lookups.md @@ -24,10 +24,6 @@ Create a `Lookups` object. > lookups = Lookups() > ``` -| Name | Type | Description | -| ----------- | --------- | ----------------------------- | -| **RETURNS** | `Lookups` | The newly constructed object. | - ## Lookups.\_\_len\_\_ {#len tag="method"} Get the current number of tables in the lookups. @@ -39,9 +35,9 @@ Get the current number of tables in the lookups. > assert len(lookups) == 0 > ``` -| Name | Type | Description | -| ----------- | ---- | ------------------------------------ | -| **RETURNS** | int | The number of tables in the lookups. | +| Name | Description | +| ----------- | -------------------------------------------- | +| **RETURNS** | The number of tables in the lookups. ~~int~~ | ## Lookups.\_\contains\_\_ {#contains tag="method"} @@ -56,10 +52,10 @@ Check if the lookups contain a table of a given name. Delegates to > assert "some_table" in lookups > ``` -| Name | Type | Description | -| ----------- | ------- | ----------------------------------------------- | -| `name` | unicode | Name of the table. | -| **RETURNS** | bool | Whether a table of that name is in the lookups. | +| Name | Description | +| ----------- | -------------------------------------------------------- | +| `name` | Name of the table. ~~str~~ | +| **RETURNS** | Whether a table of that name is in the lookups. ~~bool~~ | ## Lookups.tables {#tables tag="property"} @@ -73,9 +69,9 @@ Get the names of all tables in the lookups. > assert lookups.tables == ["some_table"] > ``` -| Name | Type | Description | -| ----------- | ---- | ----------------------------------- | -| **RETURNS** | list | Names of the tables in the lookups. | +| Name | Description | +| ----------- | ------------------------------------------------- | +| **RETURNS** | Names of the tables in the lookups. ~~List[str]~~ | ## Lookups.add_table {#add_table tag="method"} @@ -89,11 +85,11 @@ exists. > lookups.add_table("some_table", {"foo": "bar"}) > ``` -| Name | Type | Description | -| ----------- | ----------------------------- | ---------------------------------- | -| `name` | unicode | Unique name of the table. | -| `data` | dict | Optional data to add to the table. | -| **RETURNS** | [`Table`](/api/lookups#table) | The newly added table. | +| Name | Description | +| ----------- | ------------------------------------------- | +| `name` | Unique name of the table. ~~str~~ | +| `data` | Optional data to add to the table. ~~dict~~ | +| **RETURNS** | The newly added table. ~~Table~~ | ## Lookups.get_table {#get_table tag="method"} @@ -108,10 +104,10 @@ Get a table from the lookups. Raises an error if the table doesn't exist. > assert table["foo"] == "bar" > ``` -| Name | Type | Description | -| ----------- | ----------------------------- | ------------------ | -| `name` | unicode | Name of the table. | -| **RETURNS** | [`Table`](/api/lookups#table) | The table. | +| Name | Description | +| ----------- | -------------------------- | +| `name` | Name of the table. ~~str~~ | +| **RETURNS** | The table. ~~Table~~ | ## Lookups.remove_table {#remove_table tag="method"} @@ -126,10 +122,10 @@ Remove a table from the lookups. Raises an error if the table doesn't exist. > assert "some_table" not in lookups > ``` -| Name | Type | Description | -| ----------- | ----------------------------- | ---------------------------- | -| `name` | unicode | Name of the table to remove. | -| **RETURNS** | [`Table`](/api/lookups#table) | The removed table. | +| Name | Description | +| ----------- | ------------------------------------ | +| `name` | Name of the table to remove. ~~str~~ | +| **RETURNS** | The removed table. ~~Table~~ | ## Lookups.has_table {#has_table tag="method"} @@ -144,10 +140,10 @@ Check if the lookups contain a table of a given name. Equivalent to > assert lookups.has_table("some_table") > ``` -| Name | Type | Description | -| ----------- | ------- | ----------------------------------------------- | -| `name` | unicode | Name of the table. | -| **RETURNS** | bool | Whether a table of that name is in the lookups. | +| Name | Description | +| ----------- | -------------------------------------------------------- | +| `name` | Name of the table. ~~str~~ | +| **RETURNS** | Whether a table of that name is in the lookups. ~~bool~~ | ## Lookups.to_bytes {#to_bytes tag="method"} @@ -159,9 +155,9 @@ Serialize the lookups to a bytestring. > lookup_bytes = lookups.to_bytes() > ``` -| Name | Type | Description | -| ----------- | ----- | ----------------------- | -| **RETURNS** | bytes | The serialized lookups. | +| Name | Description | +| ----------- | --------------------------------- | +| **RETURNS** | The serialized lookups. ~~bytes~~ | ## Lookups.from_bytes {#from_bytes tag="method"} @@ -175,10 +171,10 @@ Load the lookups from a bytestring. > lookups.from_bytes(lookup_bytes) > ``` -| Name | Type | Description | -| ------------ | --------- | ---------------------- | -| `bytes_data` | bytes | The data to load from. | -| **RETURNS** | `Lookups` | The loaded lookups. | +| Name | Description | +| ------------ | -------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| **RETURNS** | The loaded lookups. ~~Lookups~~ | ## Lookups.to_disk {#to_disk tag="method"} @@ -191,9 +187,9 @@ which will be created if it doesn't exist. > lookups.to_disk("/path/to/lookups") > ``` -| Name | Type | Description | -| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | +| Name | Description | +| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | ## Lookups.from_disk {#from_disk tag="method"} @@ -208,10 +204,10 @@ the file doesn't exist. > lookups.from_disk("/path/to/lookups") > ``` -| Name | Type | Description | -| ----------- | ---------------- | -------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| **RETURNS** | `Lookups` | The loaded lookups. | +| Name | Description | +| ----------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| **RETURNS** | The loaded lookups. ~~Lookups~~ | ## Table {#table tag="class, ordererddict"} @@ -236,10 +232,9 @@ Initialize a new table. > assert table["foo"] == "bar" > ``` -| Name | Type | Description | -| ----------- | ------- | ---------------------------------- | -| `name` | unicode | Optional table name for reference. | -| **RETURNS** | `Table` | The newly constructed object. | +| Name | Description | +| ------ | ------------------------------------------ | +| `name` | Optional table name for reference. ~~str~~ | ### Table.from_dict {#table.from_dict tag="classmethod"} @@ -253,11 +248,11 @@ Initialize a new table from a dict. > table = Table.from_dict(data, name="some_table") > ``` -| Name | Type | Description | -| ----------- | ------- | ---------------------------------- | -| `data` | dict | The dictionary. | -| `name` | unicode | Optional table name for reference. | -| **RETURNS** | `Table` | The newly constructed object. | +| Name | Description | +| ----------- | ------------------------------------------ | +| `data` | The dictionary. ~~dict~~ | +| `name` | Optional table name for reference. ~~str~~ | +| **RETURNS** | The newly constructed object. ~~Table~~ | ### Table.set {#table.set tag="method"} @@ -273,10 +268,10 @@ Set a new key / value pair. String keys will be hashed. Same as > assert table["foo"] == "bar" > ``` -| Name | Type | Description | -| ------- | ------------- | ----------- | -| `key` | unicode / int | The key. | -| `value` | - | The value. | +| Name | Description | +| ------- | ---------------------------- | +| `key` | The key. ~~Union[str, int]~~ | +| `value` | The value. | ### Table.to_bytes {#table.to_bytes tag="method"} @@ -288,9 +283,9 @@ Serialize the table to a bytestring. > table_bytes = table.to_bytes() > ``` -| Name | Type | Description | -| ----------- | ----- | --------------------- | -| **RETURNS** | bytes | The serialized table. | +| Name | Description | +| ----------- | ------------------------------- | +| **RETURNS** | The serialized table. ~~bytes~~ | ### Table.from_bytes {#table.from_bytes tag="method"} @@ -304,15 +299,15 @@ Load a table from a bytestring. > table.from_bytes(table_bytes) > ``` -| Name | Type | Description | -| ------------ | ------- | ----------------- | -| `bytes_data` | bytes | The data to load. | -| **RETURNS** | `Table` | The loaded table. | +| Name | Description | +| ------------ | --------------------------- | +| `bytes_data` | The data to load. ~~bytes~~ | +| **RETURNS** | The loaded table. ~~Table~~ | ### Attributes {#table-attributes} -| Name | Type | Description | -| -------------- | --------------------------- | ----------------------------------------------------- | -| `name` | unicode | Table name. | -| `default_size` | int | Default size of bloom filters if no data is provided. | -| `bloom` | `preshed.bloom.BloomFilter` | The bloom filters. | +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `name` | Table name. ~~str~~ | +| `default_size` | Default size of bloom filters if no data is provided. ~~int~~ | +| `bloom` | The bloom filters. ~~preshed.BloomFilter~~ | diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index 7b195e352..81c2a8515 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -5,18 +5,83 @@ tag: class source: spacy/matcher/matcher.pyx --- - +The `Matcher` lets you find words and phrases using rules describing their token +attributes. Rules can refer to token annotations (like the text or +part-of-speech tags), as well as lexical attributes like `Token.is_punct`. +Applying the matcher to a [`Doc`](/api/doc) gives you access to the matched +tokens in context. For in-depth examples and workflows for combining rules and +statistical models, see the [usage guide](/usage/rule-based-matching) on +rule-based matching. -As of spaCy 2.0, `Matcher.add_pattern` and `Matcher.add_entity` are deprecated -and have been replaced with a simpler [`Matcher.add`](/api/matcher#add) that -lets you add a list of patterns and a callback for a given match ID. -`Matcher.get_entity` is now called [`matcher.get`](/api/matcher#get). -`Matcher.load` (not useful, as it didn't allow specifying callbacks), and -`Matcher.has_entity` (now redundant) have been removed. The concept of "acceptor -functions" has also been retired – this logic can now be handled in the callback -functions. +## Pattern format {#patterns} - +> ```json +> ### Example +> [ +> {"LOWER": "i"}, +> {"LEMMA": {"IN": ["like", "love"]}}, +> {"POS": "NOUN", "OP": "+"} +> ] +> ``` + +A pattern added to the `Matcher` consists of a list of dictionaries. Each +dictionary describes **one token** and its attributes. The available token +pattern keys correspond to a number of +[`Token` attributes](/api/token#attributes). The supported attributes for +rule-based matching are: + +| Attribute |  Description | +| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | +| `ORTH` | The exact verbatim text of a token. ~~str~~ | +| `TEXT` 2.1 | The exact verbatim text of a token. ~~str~~ | +| `LOWER` | The lowercase form of the token text. ~~str~~ | +|  `LENGTH` | The length of the token text. ~~int~~ | +|  `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ | +|  `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ | +|  `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ | +|  `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ | +|  `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ | +| `ENT_TYPE` | The token's entity label. ~~str~~ | +| `_` 2.1 | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ | +| `OP` | Operator or quantifier to determine how often to match a token pattern. ~~str~~ | + +Operators and quantifiers define **how often** a token pattern should be +matched: + +> ```json +> ### Example +> [ +> {"POS": "ADJ", "OP": "*"}, +> {"POS": "NOUN", "OP": "+"} +> ] +> ``` + +| OP | Description | +| --- | ---------------------------------------------------------------- | +| `!` | Negate the pattern, by requiring it to match exactly 0 times. | +| `?` | Make the pattern optional, by allowing it to match 0 or 1 times. | +| `+` | Require the pattern to match 1 or more times. | +| `*` | Allow the pattern to match 0 or more times. | + +Token patterns can also map to a **dictionary of properties** instead of a +single value to indicate whether the expected value is a member of a list or how +it compares to another value. + +> ```json +> ### Example +> [ +> {"LEMMA": {"IN": ["like", "love", "enjoy"]}}, +> {"POS": "PROPN", "LENGTH": {">=": 10}}, +> ] +> ``` + +| Attribute | Description | +| -------------------------- | ------------------------------------------------------------------------------------------------------- | +| `IN` | Attribute value is member of a list. ~~Any~~ | +| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ | +| `ISSUBSET` | Attribute values (for `MORPH`) are a subset of a list. ~~Any~~ | +| `ISSUPERSET` | Attribute values (for `MORPH`) are a superset of a list. ~~Any~~ | +| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ | ## Matcher.\_\_init\_\_ {#init tag="method"} @@ -32,16 +97,14 @@ string where an integer is expected) or unexpected property names. > matcher = Matcher(nlp.vocab) > ``` -| Name | Type | Description | -| --------------------------------------- | --------- | ------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. | -| `validate` 2.1 | bool | Validate all patterns added to this matcher. | -| **RETURNS** | `Matcher` | The newly constructed object. | +| Name | Description | +| --------------------------------------- | ----------------------------------------------------------------------------------------------------- | +| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ | +| `validate` 2.1 | Validate all patterns added to this matcher. ~~bool~~ | ## Matcher.\_\_call\_\_ {#call tag="method"} -Find all token sequences matching the supplied patterns on the `Doc`. As of -spaCy v2.3, the `Matcher` can also be called on `Span` objects. +Find all token sequences matching the supplied patterns on the `Doc` or `Span`. > #### Example > @@ -50,49 +113,17 @@ spaCy v2.3, the `Matcher` can also be called on `Span` objects. > > matcher = Matcher(nlp.vocab) > pattern = [{"LOWER": "hello"}, {"LOWER": "world"}] -> matcher.add("HelloWorld", None, pattern) +> matcher.add("HelloWorld", [pattern]) > doc = nlp("hello world!") > matches = matcher(doc) > ``` -| Name | Type | Description | -| ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `doclike` | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3). | -| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. | - - - -By default, the matcher **does not perform any action** on matches, like tagging -matched phrases with entity types. Instead, actions need to be specified when -**adding patterns or entities**, by passing in a callback function as the -`on_match` argument on [`add`](/api/matcher#add). This allows you to define -custom actions per pattern within the same matcher. For example, you might only -want to merge some entity types, and set custom flags for other matched -patterns. For more details and examples, see the usage guide on -[rule-based matching](/usage/rule-based-matching). - - - -## Matcher.pipe {#pipe tag="method"} - -Match a stream of documents, yielding them in turn. - -> #### Example -> -> ```python -> from spacy.matcher import Matcher -> matcher = Matcher(nlp.vocab) -> for doc in matcher.pipe(docs, batch_size=50): -> pass -> ``` - -| Name | Type | Description | -| --------------------------------------------- | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | iterable | A stream of documents. | -| `batch_size` | int | The number of documents to accumulate into a working set. | -| `return_matches` 2.1 | bool | Yield the match lists along with the docs, making results `(doc, matches)` tuples. | -| `as_tuples` | bool | Interpret the input stream as `(doc, context)` tuples, and yield `(result, context)` tuples out. If both `return_matches` and `as_tuples` are `True`, the output will be a sequence of `((doc, matches), context)` tuples. | -| **YIELDS** | `Doc` | Documents, in order. | +| Name | Description | +| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | +| _keyword-only_ | | +| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | +| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | ## Matcher.\_\_len\_\_ {#len tag="method" new="2"} @@ -105,13 +136,13 @@ patterns. > ```python > matcher = Matcher(nlp.vocab) > assert len(matcher) == 0 -> matcher.add("Rule", None, [{"ORTH": "test"}]) +> matcher.add("Rule", [[{"ORTH": "test"}]]) > assert len(matcher) == 1 > ``` -| Name | Type | Description | -| ----------- | ---- | -------------------- | -| **RETURNS** | int | The number of rules. | +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The number of rules. ~~int~~ | ## Matcher.\_\_contains\_\_ {#contains tag="method" new="2"} @@ -121,60 +152,62 @@ Check whether the matcher contains rules for a match ID. > > ```python > matcher = Matcher(nlp.vocab) -> assert 'Rule' not in matcher -> matcher.add('Rule', None, [{'ORTH': 'test'}]) -> assert 'Rule' in matcher +> assert "Rule" not in matcher +> matcher.add("Rule", [[{'ORTH': 'test'}]]) +> assert "Rule" in matcher > ``` -| Name | Type | Description | -| ----------- | ------- | ----------------------------------------------------- | -| `key` | unicode | The match ID. | -| **RETURNS** | bool | Whether the matcher contains rules for this match ID. | +| Name | Description | +| ----------- | -------------------------------------------------------------- | +| `key` | The match ID. ~~str~~ | +| **RETURNS** | Whether the matcher contains rules for this match ID. ~~bool~~ | ## Matcher.add {#add tag="method" new="2"} -Add a rule to the matcher, consisting of an ID key, one or more patterns, and a -callback function to act on the matches. The callback function will receive the -arguments `matcher`, `doc`, `i` and `matches`. If a pattern already exists for -the given ID, the patterns will be extended. An `on_match` callback will be -overwritten. +Add a rule to the matcher, consisting of an ID key, one or more patterns, and an +optional callback function to act on the matches. The callback function will +receive the arguments `matcher`, `doc`, `i` and `matches`. If a pattern already +exists for the given ID, the patterns will be extended. An `on_match` callback +will be overwritten. > #### Example > > ```python -> def on_match(matcher, doc, id, matches): -> print('Matched!', matches) +> def on_match(matcher, doc, id, matches): +> print('Matched!', matches) > -> matcher = Matcher(nlp.vocab) -> matcher.add("HelloWorld", on_match, [{"LOWER": "hello"}, {"LOWER": "world"}]) -> matcher.add("GoogleMaps", on_match, [{"ORTH": "Google"}, {"ORTH": "Maps"}]) -> doc = nlp("HELLO WORLD on Google Maps.") -> matches = matcher(doc) +> matcher = Matcher(nlp.vocab) +> patterns = [ +> [{"LOWER": "hello"}, {"LOWER": "world"}], +> [{"ORTH": "Google"}, {"ORTH": "Maps"}] +> ] +> matcher.add("TEST_PATTERNS", patterns) +> doc = nlp("HELLO WORLD on Google Maps.") +> matches = matcher(doc) > ``` -| Name | Type | Description | -| ----------- | ------------------ | --------------------------------------------------------------------------------------------- | -| `match_id` | unicode | An ID for the thing you're matching. | -| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | -| `*patterns` | list | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. | + - - -As of spaCy 2.2.2, `Matcher.add` also supports the new API, which will become -the default in the future. The patterns are now the second argument and a list +As of spaCy v3.0, `Matcher.add` takes a list of patterns as the second argument (instead of a variable number of arguments). The `on_match` callback becomes an optional keyword argument. ```diff patterns = [[{"TEXT": "Google"}, {"TEXT": "Now"}], [{"TEXT": "GoogleNow"}]] -- matcher.add("GoogleNow", None, *patterns) -+ matcher.add("GoogleNow", patterns) - matcher.add("GoogleNow", on_match, *patterns) + matcher.add("GoogleNow", patterns, on_match=on_match) ``` +| Name | Description | +| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `match_id` | An ID for the thing you're matching. ~~str~~ | +| `patterns` | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. ~~List[List[Dict[str, Any]]]~~ | +| _keyword-only_ | | +| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ | +| `greedy` 3 | Optional filter for greedy matches. Can either be `"FIRST"` or `"LONGEST"`. ~~Optional[str]~~ | + ## Matcher.remove {#remove tag="method" new="2"} Remove a rule from the matcher. A `KeyError` is raised if the match ID does not @@ -183,15 +216,15 @@ exist. > #### Example > > ```python -> matcher.add("Rule", None, [{"ORTH": "test"}]) +> matcher.add("Rule", [[{"ORTH": "test"}]]) > assert "Rule" in matcher > matcher.remove("Rule") > assert "Rule" not in matcher > ``` -| Name | Type | Description | -| ----- | ------- | ------------------------- | -| `key` | unicode | The ID of the match rule. | +| Name | Description | +| ----- | --------------------------------- | +| `key` | The ID of the match rule. ~~str~~ | ## Matcher.get {#get tag="method" new="2"} @@ -201,11 +234,11 @@ Retrieve the pattern stored for a key. Returns the rule as an > #### Example > > ```python -> matcher.add("Rule", None, [{"ORTH": "test"}]) +> matcher.add("Rule", [[{"ORTH": "test"}]]) > on_match, patterns = matcher.get("Rule") > ``` -| Name | Type | Description | -| ----------- | ------- | --------------------------------------------- | -| `key` | unicode | The ID of the match rule. | -| **RETURNS** | tuple | The rule, as an `(on_match, patterns)` tuple. | +| Name | Description | +| ----------- | --------------------------------------------------------------------------------------------- | +| `key` | The ID of the match rule. ~~str~~ | +| **RETURNS** | The rule, as an `(on_match, patterns)` tuple. ~~Tuple[Optional[Callable], List[List[dict]]]~~ | diff --git a/website/docs/api/morphologizer.md b/website/docs/api/morphologizer.md new file mode 100644 index 000000000..d32514fb0 --- /dev/null +++ b/website/docs/api/morphologizer.md @@ -0,0 +1,414 @@ +--- +title: Morphologizer +tag: class +source: spacy/pipeline/morphologizer.pyx +new: 3 +teaser: 'Pipeline component for predicting morphological features' +api_base_class: /api/tagger +api_string_name: morphologizer +api_trainable: true +--- + +A trainable pipeline component to predict morphological features and +coarse-grained POS tags following the Universal Dependencies +[UPOS](https://universaldependencies.org/u/pos/index.html) and +[FEATS](https://universaldependencies.org/format.html#morphological-annotation) +annotation guidelines. + +## Config and implementation {#config} + +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). See the +[model architectures](/api/architectures) documentation for details on the +architectures and their arguments and hyperparameters. + +> #### Example +> +> ```python +> from spacy.pipeline.morphologizer import DEFAULT_MORPH_MODEL +> config = {"model": DEFAULT_MORPH_MODEL} +> nlp.add_pipe("morphologizer", config=config) +> ``` + +| Setting | Description | +| ------- | ------------------------------------------------------------------------------------------------------- | +| `model` | The model to use. Defaults to [Tagger](/api/architectures#Tagger). ~~Model[List[Doc], List[Floats2d]]~~ | + +```python +%%GITHUB_SPACY/spacy/pipeline/morphologizer.pyx +``` + +## Morphologizer.\_\_init\_\_ {#init tag="method"} + +Create a new pipeline instance. In your application, you would normally use a +shortcut for this and instantiate the component using its string name and +[`nlp.add_pipe`](/api/language#add_pipe). + +> #### Example +> +> ```python +> # Construction via add_pipe with default model +> morphologizer = nlp.add_pipe("morphologizer") +> +> # Construction via create_pipe with custom model +> config = {"model": {"@architectures": "my_morphologizer"}} +> morphologizer = nlp.add_pipe("morphologizer", config=config) +> +> # Construction from class +> from spacy.pipeline import Morphologizer +> morphologizer = Morphologizer(nlp.vocab, model) +> ``` + +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| _keyword-only_ | | +| `labels_morph` | Mapping of morph + POS tags to morph labels. ~~Dict[str, str]~~ | +| `labels_pos` | Mapping of morph + POS tags to POS tags. ~~Dict[str, str]~~ | + +## Morphologizer.\_\_call\_\_ {#call tag="method"} + +Apply the pipe to one document. The document is modified in place, and returned. +This usually happens under the hood when the `nlp` object is called on a text +and all pipeline components are applied to the `Doc` in order. Both +[`__call__`](/api/morphologizer#call) and [`pipe`](/api/morphologizer#pipe) +delegate to the [`predict`](/api/morphologizer#predict) and +[`set_annotations`](/api/morphologizer#set_annotations) methods. + +> #### Example +> +> ```python +> doc = nlp("This is a sentence.") +> morphologizer = nlp.add_pipe("morphologizer") +> # This usually happens under the hood +> processed = morphologizer(doc) +> ``` + +| Name | Description | +| ----------- | -------------------------------- | +| `doc` | The document to process. ~~Doc~~ | +| **RETURNS** | The processed document. ~~Doc~~ | + +## Morphologizer.pipe {#pipe tag="method"} + +Apply the pipe to a stream of documents. This usually happens under the hood +when the `nlp` object is called on a text and all pipeline components are +applied to the `Doc` in order. Both [`__call__`](/api/morphologizer#call) and +[`pipe`](/api/morphologizer#pipe) delegate to the +[`predict`](/api/morphologizer#predict) and +[`set_annotations`](/api/morphologizer#set_annotations) methods. + +> #### Example +> +> ```python +> morphologizer = nlp.add_pipe("morphologizer") +> for doc in morphologizer.pipe(docs, batch_size=50): +> pass +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `stream` | A stream of documents. ~~Iterable[Doc]~~ | +| _keyword-only_ | | +| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ | +| **YIELDS** | The processed documents in order. ~~Doc~~ | + +## Morphologizer.initialize {#initialize tag="method"} + +Initialize the component for training. `get_examples` should be a function that +returns an iterable of [`Example`](/api/example) objects. The data examples are +used to **initialize the model** of the component and can either be the full +training data or a representative sample. Initialization includes validating the +network, +[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and +setting up the label scheme based on the data. This method is typically called +by [`Language.initialize`](/api/language#initialize) and lets you customize +arguments it receives via the +[`[initialize.components]`](/api/data-formats#config-initialize) block in the +config. + +> #### Example +> +> ```python +> morphologizer = nlp.add_pipe("morphologizer") +> morphologizer.initialize(lambda: [], nlp=nlp) +> ``` +> +> ```ini +> ### config.cfg +> [initialize.components.morphologizer] +> +> [initialize.components.morphologizer.labels] +> @readers = "spacy.read_labels.v1" +> path = "corpus/labels/morphologizer.json +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[dict]~~ | + +## Morphologizer.predict {#predict tag="method"} + +Apply the component's model to a batch of [`Doc`](/api/doc) objects, without +modifying them. + +> #### Example +> +> ```python +> morphologizer = nlp.add_pipe("morphologizer") +> scores = morphologizer.predict([doc1, doc2]) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------- | +| `docs` | The documents to predict. ~~Iterable[Doc]~~ | +| **RETURNS** | The model's prediction for each document. | + +## Morphologizer.set_annotations {#set_annotations tag="method"} + +Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores. + +> #### Example +> +> ```python +> morphologizer = nlp.add_pipe("morphologizer") +> scores = morphologizer.predict([doc1, doc2]) +> morphologizer.set_annotations([doc1, doc2], scores) +> ``` + +| Name | Description | +| -------- | ------------------------------------------------------- | +| `docs` | The documents to modify. ~~Iterable[Doc]~~ | +| `scores` | The scores to set, produced by `Morphologizer.predict`. | + +## Morphologizer.update {#update tag="method"} + +Learn from a batch of [`Example`](/api/example) objects containing the +predictions and gold-standard annotations, and update the component's model. +Delegates to [`predict`](/api/morphologizer#predict) and +[`get_loss`](/api/morphologizer#get_loss). + +> #### Example +> +> ```python +> morphologizer = nlp.add_pipe("morphologizer") +> optimizer = nlp.initialize() +> losses = morphologizer.update(examples, sgd=optimizer) +> ``` + +| Name | Description | +| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | + +## Morphologizer.get_loss {#get_loss tag="method"} + +Find the loss and gradient of loss for the batch of documents and their +predicted scores. + +> #### Example +> +> ```python +> morphologizer = nlp.add_pipe("morphologizer") +> scores = morphologizer.predict([eg.predicted for eg in examples]) +> loss, d_loss = morphologizer.get_loss(examples, scores) +> ``` + +| Name | Description | +| ----------- | --------------------------------------------------------------------------- | +| `examples` | The batch of examples. ~~Iterable[Example]~~ | +| `scores` | Scores representing the model's predictions. | +| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ | + +## Morphologizer.create_optimizer {#create_optimizer tag="method"} + +Create an optimizer for the pipeline component. + +> #### Example +> +> ```python +> morphologizer = nlp.add_pipe("morphologizer") +> optimizer = morphologizer.create_optimizer() +> ``` + +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The optimizer. ~~Optimizer~~ | + +## Morphologizer.use_params {#use_params tag="method, contextmanager"} + +Modify the pipe's model, to use the given parameter values. At the end of the +context, the original parameters are restored. + +> #### Example +> +> ```python +> morphologizer = nlp.add_pipe("morphologizer") +> with morphologizer.use_params(optimizer.averages): +> morphologizer.to_disk("/best_model") +> ``` + +| Name | Description | +| -------- | -------------------------------------------------- | +| `params` | The parameter values to use in the model. ~~dict~~ | + +## Morphologizer.add_label {#add_label tag="method"} + +Add a new label to the pipe. If the `Morphologizer` should set annotations for +both `pos` and `morph`, the label should include the UPOS as the feature `POS`. +Raises an error if the output dimension is already set, or if the model has +already been fully [initialized](#initialize). Note that you don't have to call +this method if you provide a **representative data sample** to the +[`initialize`](#initialize) method. In this case, all labels found in the sample +will be automatically added to the model, and the output dimension will be +[inferred](/usage/layers-architectures#thinc-shape-inference) automatically. + +> #### Example +> +> ```python +> morphologizer = nlp.add_pipe("morphologizer") +> morphologizer.add_label("Mood=Ind|POS=VERB|Tense=Past|VerbForm=Fin") +> ``` + +| Name | Description | +| ----------- | ----------------------------------------------------------- | +| `label` | The label to add. ~~str~~ | +| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ | + +## Morphologizer.to_disk {#to_disk tag="method"} + +Serialize the pipe to disk. + +> #### Example +> +> ```python +> morphologizer = nlp.add_pipe("morphologizer") +> morphologizer.to_disk("/path/to/morphologizer") +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | + +## Morphologizer.from_disk {#from_disk tag="method"} + +Load the pipe from disk. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> morphologizer = nlp.add_pipe("morphologizer") +> morphologizer.from_disk("/path/to/morphologizer") +> ``` + +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `Morphologizer` object. ~~Morphologizer~~ | + +## Morphologizer.to_bytes {#to_bytes tag="method"} + +> #### Example +> +> ```python +> morphologizer = nlp.add_pipe("morphologizer") +> morphologizer_bytes = morphologizer.to_bytes() +> ``` + +Serialize the pipe to a bytestring. + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the `Morphologizer` object. ~~bytes~~ | + +## Morphologizer.from_bytes {#from_bytes tag="method"} + +Load the pipe from a bytestring. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> morphologizer_bytes = morphologizer.to_bytes() +> morphologizer = nlp.add_pipe("morphologizer") +> morphologizer.from_bytes(morphologizer_bytes) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `Morphologizer` object. ~~Morphologizer~~ | + +## Morphologizer.labels {#labels tag="property"} + +The labels currently added to the component in the Universal Dependencies +[FEATS](https://universaldependencies.org/format.html#morphological-annotation) +format. Note that even for a blank component, this will always include the +internal empty label `_`. If POS features are used, the labels will include the +coarse-grained POS as the feature `POS`. + +> #### Example +> +> ```python +> morphologizer.add_label("Mood=Ind|POS=VERB|Tense=Past|VerbForm=Fin") +> assert "Mood=Ind|POS=VERB|Tense=Past|VerbForm=Fin" in morphologizer.labels +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------ | +| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ | + +## Morphologizer.label_data {#label_data tag="property" new="3"} + +The labels currently added to the component and their internal meta information. +This is the data generated by [`init labels`](/api/cli#init-labels) and used by +[`Morphologizer.initialize`](/api/morphologizer#initialize) to initialize the +model with a pre-defined label set. + +> #### Example +> +> ```python +> labels = morphologizer.label_data +> morphologizer.initialize(lambda: [], nlp=nlp, labels=labels) +> ``` + +| Name | Description | +| ----------- | ----------------------------------------------- | +| **RETURNS** | The label data added to the component. ~~dict~~ | + +## Serialization fields {#serialization-fields} + +During serialization, spaCy will export several data fields used to restore +different aspects of the object. If needed, you can exclude them from +serialization by passing in the string names via the `exclude` argument. + +> #### Example +> +> ```python +> data = morphologizer.to_disk("/path", exclude=["vocab"]) +> ``` + +| Name | Description | +| ------- | -------------------------------------------------------------- | +| `vocab` | The shared [`Vocab`](/api/vocab). | +| `cfg` | The config file. You usually don't want to exclude this. | +| `model` | The binary model data. You usually don't want to exclude this. | diff --git a/website/docs/api/morphology.md b/website/docs/api/morphology.md new file mode 100644 index 000000000..e64f26bdd --- /dev/null +++ b/website/docs/api/morphology.md @@ -0,0 +1,254 @@ +--- +title: Morphology +tag: class +source: spacy/morphology.pyx +--- + +Store the possible morphological analyses for a language, and index them by +hash. To save space on each token, tokens only know the hash of their +morphological analysis, so queries of morphological attributes are delegated to +this class. See [`MorphAnalysis`](/api/morphology#morphanalysis) for the +container storing a single morphological analysis. + +## Morphology.\_\_init\_\_ {#init tag="method"} + +Create a `Morphology` object. + +> #### Example +> +> ```python +> from spacy.morphology import Morphology +> +> morphology = Morphology(strings) +> ``` + +| Name | Description | +| --------- | --------------------------------- | +| `strings` | The string store. ~~StringStore~~ | + +## Morphology.add {#add tag="method"} + +Insert a morphological analysis in the morphology table, if not already present. +The morphological analysis may be provided in the Universal Dependencies +[FEATS](https://universaldependencies.org/format.html#morphological-annotation) +format as a string or in the tag map dictionary format. Returns the hash of the +new analysis. + +> #### Example +> +> ```python +> feats = "Feat1=Val1|Feat2=Val2" +> hash = nlp.vocab.morphology.add(feats) +> assert hash == nlp.vocab.strings[feats] +> ``` + +| Name | Description | +| ---------- | ------------------------------------------------ | +| `features` | The morphological features. ~~Union[Dict, str]~~ | + +## Morphology.get {#get tag="method"} + +> #### Example +> +> ```python +> feats = "Feat1=Val1|Feat2=Val2" +> hash = nlp.vocab.morphology.add(feats) +> assert nlp.vocab.morphology.get(hash) == feats +> ``` + +Get the +[FEATS](https://universaldependencies.org/format.html#morphological-annotation) +string for the hash of the morphological analysis. + +| Name | Description | +| ------- | ----------------------------------------------- | +| `morph` | The hash of the morphological analysis. ~~int~~ | + +## Morphology.feats_to_dict {#feats_to_dict tag="staticmethod"} + +Convert a string +[FEATS](https://universaldependencies.org/format.html#morphological-annotation) +representation to a dictionary of features and values in the same format as the +tag map. + +> #### Example +> +> ```python +> from spacy.morphology import Morphology +> d = Morphology.feats_to_dict("Feat1=Val1|Feat2=Val2") +> assert d == {"Feat1": "Val1", "Feat2": "Val2"} +> ``` + +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | +| `feats` | The morphological features in Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ | +| **RETURNS** | The morphological features as a dictionary. ~~Dict[str, str]~~ | + +## Morphology.dict_to_feats {#dict_to_feats tag="staticmethod"} + +Convert a dictionary of features and values to a string +[FEATS](https://universaldependencies.org/format.html#morphological-annotation) +representation. + +> #### Example +> +> ```python +> from spacy.morphology import Morphology +> f = Morphology.dict_to_feats({"Feat1": "Val1", "Feat2": "Val2"}) +> assert f == "Feat1=Val1|Feat2=Val2" +> ``` + +| Name | Description | +| ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `feats_dict` | The morphological features as a dictionary. ~~Dict[str, str]~~ | +| **RETURNS** | The morphological features in Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ | + +## Attributes {#attributes} + +| Name | Description | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------ | +| `FEATURE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) feature separator. Default is `|`. ~~str~~ | +| `FIELD_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) field separator. Default is `=`. ~~str~~ | +| `VALUE_SEP` | The [FEATS](https://universaldependencies.org/format.html#morphological-annotation) value separator. Default is `,`. ~~str~~ | + +## MorphAnalysis {#morphanalysis tag="class" source="spacy/tokens/morphanalysis.pyx"} + +Stores a single morphological analysis. + +### MorphAnalysis.\_\_init\_\_ {#morphanalysis-init tag="method"} + +Initialize a MorphAnalysis object from a Universal Dependencies +[FEATS](https://universaldependencies.org/format.html#morphological-annotation) +string or a dictionary of morphological features. + +> #### Example +> +> ```python +> from spacy.tokens import MorphAnalysis +> +> feats = "Feat1=Val1|Feat2=Val2" +> m = MorphAnalysis(nlp.vocab, feats) +> ``` + +| Name | Description | +| ---------- | ---------------------------------------------------------- | +| `vocab` | The vocab. ~~Vocab~~ | +| `features` | The morphological features. ~~Union[Dict[str, str], str]~~ | + +### MorphAnalysis.\_\_contains\_\_ {#morphanalysis-contains tag="method"} + +Whether a feature/value pair is in the analysis. + +> #### Example +> +> ```python +> feats = "Feat1=Val1,Val2|Feat2=Val2" +> morph = MorphAnalysis(nlp.vocab, feats) +> assert "Feat1=Val1" in morph +> ``` + +| Name | Description | +| ----------- | --------------------------------------------- | +| **RETURNS** | A feature/value pair in the analysis. ~~str~~ | + +### MorphAnalysis.\_\_iter\_\_ {#morphanalysis-iter tag="method"} + +Iterate over the feature/value pairs in the analysis. + +> #### Example +> +> ```python +> feats = "Feat1=Val1,Val3|Feat2=Val2" +> morph = MorphAnalysis(nlp.vocab, feats) +> assert list(morph) == ["Feat1=Va1", "Feat1=Val3", "Feat2=Val2"] +> ``` + +| Name | Description | +| ---------- | --------------------------------------------- | +| **YIELDS** | A feature/value pair in the analysis. ~~str~~ | + +### MorphAnalysis.\_\_len\_\_ {#morphanalysis-len tag="method"} + +Returns the number of features in the analysis. + +> #### Example +> +> ```python +> feats = "Feat1=Val1,Val2|Feat2=Val2" +> morph = MorphAnalysis(nlp.vocab, feats) +> assert len(morph) == 3 +> ``` + +| Name | Description | +| ----------- | ----------------------------------------------- | +| **RETURNS** | The number of features in the analysis. ~~int~~ | + +### MorphAnalysis.\_\_str\_\_ {#morphanalysis-str tag="method"} + +Returns the morphological analysis in the Universal Dependencies +[FEATS](https://universaldependencies.org/format.html#morphological-annotation) +string format. + +> #### Example +> +> ```python +> feats = "Feat1=Val1,Val2|Feat2=Val2" +> morph = MorphAnalysis(nlp.vocab, feats) +> assert str(morph) == feats +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| **RETURNS** | The analysis in the Universal Dependencies [FEATS](https://universaldependencies.org/format.html#morphological-annotation) format. ~~str~~ | + +### MorphAnalysis.get {#morphanalysis-get tag="method"} + +Retrieve values for a feature by field. + +> #### Example +> +> ```python +> feats = "Feat1=Val1,Val2" +> morph = MorphAnalysis(nlp.vocab, feats) +> assert morph.get("Feat1") == ["Val1", "Val2"] +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------ | +| `field` | The field to retrieve. ~~str~~ | +| **RETURNS** | A list of the individual features. ~~List[str]~~ | + +### MorphAnalysis.to_dict {#morphanalysis-to_dict tag="method"} + +Produce a dict representation of the analysis, in the same format as the tag +map. + +> #### Example +> +> ```python +> feats = "Feat1=Val1,Val2|Feat2=Val2" +> morph = MorphAnalysis(nlp.vocab, feats) +> assert morph.to_dict() == {"Feat1": "Val1,Val2", "Feat2": "Val2"} +> ``` + +| Name | Description | +| ----------- | ----------------------------------------------------------- | +| **RETURNS** | The dict representation of the analysis. ~~Dict[str, str]~~ | + +### MorphAnalysis.from_id {#morphanalysis-from_id tag="classmethod"} + +Create a morphological analysis from a given hash ID. + +> #### Example +> +> ```python +> feats = "Feat1=Val1|Feat2=Val2" +> hash = nlp.vocab.strings[feats] +> morph = MorphAnalysis.from_id(nlp.vocab, hash) +> assert str(morph) == feats +> ``` + +| Name | Description | +| ------- | ---------------------------------------- | +| `vocab` | The vocab. ~~Vocab~~ | +| `key` | The hash of the features string. ~~int~~ | diff --git a/website/docs/api/phrasematcher.md b/website/docs/api/phrasematcher.md index 49211174c..47bbdcf6a 100644 --- a/website/docs/api/phrasematcher.md +++ b/website/docs/api/phrasematcher.md @@ -9,7 +9,8 @@ new: 2 The `PhraseMatcher` lets you efficiently match large terminology lists. While the [`Matcher`](/api/matcher) lets you match sequences based on lists of token descriptions, the `PhraseMatcher` accepts match patterns in the form of `Doc` -objects. +objects. See the [usage guide](/usage/rule-based-matching#phrasematcher) for +examples. ## PhraseMatcher.\_\_init\_\_ {#init tag="method"} @@ -35,20 +36,11 @@ be shown. > matcher = PhraseMatcher(nlp.vocab) > ``` -| Name | Type | Description | -| --------------------------------------- | --------------- | ------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. | -| `max_length` | int | Deprecated argument - the `PhraseMatcher` does not have a phrase length limit anymore. | -| `attr` 2.1 | int / unicode | The token attribute to match on. Defaults to `ORTH`, i.e. the verbatim token text. | -| `validate` 2.1 | bool | Validate patterns added to the matcher. | -| **RETURNS** | `PhraseMatcher` | The newly constructed object. | - - - -As of v2.1, the `PhraseMatcher` doesn't have a phrase length limit anymore, so -the `max_length` argument is now deprecated. - - +| Name | Description | +| --------------------------------------- | ------------------------------------------------------------------------------------------------------ | +| `vocab` | The vocabulary object, which must be shared with the documents the matcher will operate on. ~~Vocab~~ | +| `attr` 2.1 | The token attribute to match on. Defaults to `ORTH`, i.e. the verbatim token text. ~~Union[int, str]~~ | +| `validate` 2.1 | Validate patterns added to the matcher. ~~bool~~ | ## PhraseMatcher.\_\_call\_\_ {#call tag="method"} @@ -60,15 +52,17 @@ Find all token sequences matching the supplied patterns on the `Doc`. > from spacy.matcher import PhraseMatcher > > matcher = PhraseMatcher(nlp.vocab) -> matcher.add("OBAMA", None, nlp("Barack Obama")) +> matcher.add("OBAMA", [nlp("Barack Obama")]) > doc = nlp("Barack Obama lifts America one last time in emotional farewell") > matches = matcher(doc) > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `doc` | `Doc` | The document to match over. | -| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is the ID of the added match pattern. | +| Name | Description | +| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doc` | The document to match over. ~~Doc~~ | +| _keyword-only_ | | +| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | +| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | @@ -82,25 +76,6 @@ match_id_string = nlp.vocab.strings[match_id] -## PhraseMatcher.pipe {#pipe tag="method"} - -Match a stream of documents, yielding them in turn. - -> #### Example -> -> ```python -> from spacy.matcher import PhraseMatcher -> matcher = PhraseMatcher(nlp.vocab) -> for doc in matcher.pipe(docs, batch_size=50): -> pass -> ``` - -| Name | Type | Description | -| ------------ | -------- | --------------------------------------------------------- | -| `docs` | iterable | A stream of documents. | -| `batch_size` | int | The number of documents to accumulate into a working set. | -| **YIELDS** | `Doc` | Documents, in order. | - ## PhraseMatcher.\_\_len\_\_ {#len tag="method"} Get the number of rules added to the matcher. Note that this only returns the @@ -112,13 +87,13 @@ patterns. > ```python > matcher = PhraseMatcher(nlp.vocab) > assert len(matcher) == 0 -> matcher.add("OBAMA", None, nlp("Barack Obama")) +> matcher.add("OBAMA", [nlp("Barack Obama")]) > assert len(matcher) == 1 > ``` -| Name | Type | Description | -| ----------- | ---- | -------------------- | -| **RETURNS** | int | The number of rules. | +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The number of rules. ~~int~~ | ## PhraseMatcher.\_\_contains\_\_ {#contains tag="method"} @@ -129,14 +104,14 @@ Check whether the matcher contains rules for a match ID. > ```python > matcher = PhraseMatcher(nlp.vocab) > assert "OBAMA" not in matcher -> matcher.add("OBAMA", None, nlp("Barack Obama")) +> matcher.add("OBAMA", [nlp("Barack Obama")]) > assert "OBAMA" in matcher > ``` -| Name | Type | Description | -| ----------- | ------- | ----------------------------------------------------- | -| `key` | unicode | The match ID. | -| **RETURNS** | bool | Whether the matcher contains rules for this match ID. | +| Name | Description | +| ----------- | -------------------------------------------------------------- | +| `key` | The match ID. ~~str~~ | +| **RETURNS** | Whether the matcher contains rules for this match ID. ~~bool~~ | ## PhraseMatcher.add {#add tag="method"} @@ -153,36 +128,33 @@ overwritten. > print('Matched!', matches) > > matcher = PhraseMatcher(nlp.vocab) -> matcher.add("OBAMA", on_match, nlp("Barack Obama")) -> matcher.add("HEALTH", on_match, nlp("health care reform"), -> nlp("healthcare reform")) +> matcher.add("OBAMA", [nlp("Barack Obama")], on_match=on_match) +> matcher.add("HEALTH", [nlp("health care reform"), nlp("healthcare reform")], on_match=on_match) > doc = nlp("Barack Obama urges Congress to find courage to defend his healthcare reforms") > matches = matcher(doc) > ``` -| Name | Type | Description | -| ---------- | ------------------ | --------------------------------------------------------------------------------------------- | -| `match_id` | unicode | An ID for the thing you're matching. | -| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | -| `*docs` | `Doc` | `Doc` objects of the phrases to match. | + - - -As of spaCy 2.2.2, `PhraseMatcher.add` also supports the new API, which will -become the default in the future. The `Doc` patterns are now the second argument -and a list (instead of a variable number of arguments). The `on_match` callback +As of spaCy v3.0, `PhraseMatcher.add` takes a list of patterns as the second +argument (instead of a variable number of arguments). The `on_match` callback becomes an optional keyword argument. ```diff patterns = [nlp("health care reform"), nlp("healthcare reform")] -- matcher.add("HEALTH", None, *patterns) -+ matcher.add("HEALTH", patterns) - matcher.add("HEALTH", on_match, *patterns) + matcher.add("HEALTH", patterns, on_match=on_match) ``` +| Name | Description | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `match_id` | An ID for the thing you're matching. ~~str~~ | | +| `docs` | `Doc` objects of the phrases to match. ~~List[Doc]~~ | +| _keyword-only_ | | +| `on_match` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. ~~Optional[Callable[[Matcher, Doc, int, List[tuple], Any]]~~ | + ## PhraseMatcher.remove {#remove tag="method" new="2.2"} Remove a rule from the matcher by match ID. A `KeyError` is raised if the key @@ -192,12 +164,12 @@ does not exist. > > ```python > matcher = PhraseMatcher(nlp.vocab) -> matcher.add("OBAMA", None, nlp("Barack Obama")) +> matcher.add("OBAMA", [nlp("Barack Obama")]) > assert "OBAMA" in matcher > matcher.remove("OBAMA") > assert "OBAMA" not in matcher > ``` -| Name | Type | Description | -| ----- | ------- | ------------------------- | -| `key` | unicode | The ID of the match rule. | +| Name | Description | +| ----- | --------------------------------- | +| `key` | The ID of the match rule. ~~str~~ | diff --git a/website/docs/api/pipe.md b/website/docs/api/pipe.md new file mode 100644 index 000000000..1f7fab8aa --- /dev/null +++ b/website/docs/api/pipe.md @@ -0,0 +1,497 @@ +--- +title: TrainablePipe +tag: class +teaser: Base class for trainable pipeline components +--- + +This class is a base class and **not instantiated directly**. Trainable pipeline +components like the [`EntityRecognizer`](/api/entityrecognizer) or +[`TextCategorizer`](/api/textcategorizer) inherit from it and it defines the +interface that components should follow to function as trainable components in a +spaCy pipeline. See the docs on +[writing trainable components](/usage/processing-pipelines#trainable-components) +for how to use the `TrainablePipe` base class to implement custom components. + + + +> #### Why is it implemented in Cython? +> +> The `TrainablePipe` class is implemented in a `.pyx` module, the extension +> used by [Cython](/api/cython). This is needed so that **other** Cython +> classes, like the [`EntityRecognizer`](/api/entityrecognizer) can inherit from +> it. But it doesn't mean you have to implement trainable components in Cython – +> pure Python components like the [`TextCategorizer`](/api/textcategorizer) can +> also inherit from `TrainablePipe`. + +```python +%%GITHUB_SPACY/spacy/pipeline/trainable_pipe.pyx +``` + +## TrainablePipe.\_\_init\_\_ {#init tag="method"} + +> #### Example +> +> ```python +> from spacy.pipeline import TrainablePipe +> from spacy.language import Language +> +> class CustomPipe(TrainablePipe): +> ... +> +> @Language.factory("your_custom_pipe", default_config={"model": MODEL}) +> def make_custom_pipe(nlp, name, model): +> return CustomPipe(nlp.vocab, model, name) +> ``` + +Create a new pipeline instance. In your application, you would normally use a +shortcut for this and instantiate the component using its string name and +[`nlp.add_pipe`](/api/language#create_pipe). + +| Name | Description | +| ------- | -------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], Any]~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| `**cfg` | Additional config parameters and settings. Will be available as the dictionary `cfg` and is serialized with the component. | + +## TrainablePipe.\_\_call\_\_ {#call tag="method"} + +Apply the pipe to one document. The document is modified in place, and returned. +This usually happens under the hood when the `nlp` object is called on a text +and all pipeline components are applied to the `Doc` in order. Both +[`__call__`](/api/pipe#call) and [`pipe`](/api/pipe#pipe) delegate to the +[`predict`](/api/pipe#predict) and +[`set_annotations`](/api/pipe#set_annotations) methods. + +> #### Example +> +> ```python +> doc = nlp("This is a sentence.") +> pipe = nlp.add_pipe("your_custom_pipe") +> # This usually happens under the hood +> processed = pipe(doc) +> ``` + +| Name | Description | +| ----------- | -------------------------------- | +| `doc` | The document to process. ~~Doc~~ | +| **RETURNS** | The processed document. ~~Doc~~ | + +## TrainablePipe.pipe {#pipe tag="method"} + +Apply the pipe to a stream of documents. This usually happens under the hood +when the `nlp` object is called on a text and all pipeline components are +applied to the `Doc` in order. Both [`__call__`](/api/pipe#call) and +[`pipe`](/api/pipe#pipe) delegate to the [`predict`](/api/pipe#predict) and +[`set_annotations`](/api/pipe#set_annotations) methods. + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> for doc in pipe.pipe(docs, batch_size=50): +> pass +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `stream` | A stream of documents. ~~Iterable[Doc]~~ | +| _keyword-only_ | | +| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ | +| **YIELDS** | The processed documents in order. ~~Doc~~ | + +## TrainablePipe.initialize {#initialize tag="method" new="3"} + +Initialize the component for training. `get_examples` should be a function that +returns an iterable of [`Example`](/api/example) objects. The data examples are +used to **initialize the model** of the component and can either be the full +training data or a representative sample. Initialization includes validating the +network, +[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and +setting up the label scheme based on the data. This method is typically called +by [`Language.initialize`](/api/language#initialize). + + + +This method was previously called `begin_training`. + + + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> pipe.initialize(lambda: [], pipeline=nlp.pipeline) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | + +## TrainablePipe.predict {#predict tag="method"} + +Apply the component's model to a batch of [`Doc`](/api/doc) objects, without +modifying them. + + + +This method needs to be overwritten with your own custom `predict` method. + + + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> scores = pipe.predict([doc1, doc2]) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------- | +| `docs` | The documents to predict. ~~Iterable[Doc]~~ | +| **RETURNS** | The model's prediction for each document. | + +## TrainablePipe.set_annotations {#set_annotations tag="method"} + +Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores. + + + +This method needs to be overwritten with your own custom `set_annotations` +method. + + + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> scores = pipe.predict(docs) +> pipe.set_annotations(docs, scores) +> ``` + +| Name | Description | +| -------- | ------------------------------------------------ | +| `docs` | The documents to modify. ~~Iterable[Doc]~~ | +| `scores` | The scores to set, produced by `Tagger.predict`. | + +## TrainablePipe.update {#update tag="method"} + +Learn from a batch of [`Example`](/api/example) objects containing the +predictions and gold-standard annotations, and update the component's model. + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> optimizer = nlp.initialize() +> losses = pipe.update(examples, sgd=optimizer) +> ``` + +| Name | Description | +| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | + +## TrainablePipe.rehearse {#rehearse tag="method,experimental" new="3"} + +Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the +current model to make predictions similar to an initial model, to try to address +the "catastrophic forgetting" problem. This feature is experimental. + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> optimizer = nlp.resume_training() +> losses = pipe.rehearse(examples, sgd=optimizer) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | + +## TrainablePipe.get_loss {#get_loss tag="method"} + +Find the loss and gradient of loss for the batch of documents and their +predicted scores. + + + +This method needs to be overwritten with your own custom `get_loss` method. + + + +> #### Example +> +> ```python +> ner = nlp.add_pipe("ner") +> scores = ner.predict([eg.predicted for eg in examples]) +> loss, d_loss = ner.get_loss(examples, scores) +> ``` + +| Name | Description | +| ----------- | --------------------------------------------------------------------------- | +| `examples` | The batch of examples. ~~Iterable[Example]~~ | +| `scores` | Scores representing the model's predictions. | +| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ | + +## TrainablePipe.score {#score tag="method" new="3"} + +Score a batch of examples. + +> #### Example +> +> ```python +> scores = pipe.score(examples) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------- | +| `examples` | The examples to score. ~~Iterable[Example]~~ | +| **RETURNS** | The scores, e.g. produced by the [`Scorer`](/api/scorer). ~~Dict[str, Union[float, Dict[str, float]]]~~ | + +## TrainablePipe.create_optimizer {#create_optimizer tag="method"} + +Create an optimizer for the pipeline component. Defaults to +[`Adam`](https://thinc.ai/docs/api-optimizers#adam) with default settings. + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> optimizer = pipe.create_optimizer() +> ``` + +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The optimizer. ~~Optimizer~~ | + +## TrainablePipe.use_params {#use_params tag="method, contextmanager"} + +Modify the pipe's model, to use the given parameter values. At the end of the +context, the original parameters are restored. + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> with pipe.use_params(optimizer.averages): +> pipe.to_disk("/best_model") +> ``` + +| Name | Description | +| -------- | -------------------------------------------------- | +| `params` | The parameter values to use in the model. ~~dict~~ | + +## TrainablePipe.finish_update {#finish_update tag="method"} + +Update parameters using the current parameter gradients. Defaults to calling +[`self.model.finish_update`](https://thinc.ai/docs/api-model#finish_update). + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> optimizer = nlp.initialize() +> losses = pipe.update(examples, sgd=None) +> pipe.finish_update(sgd) +> ``` + +| Name | Description | +| ----- | ------------------------------------- | +| `sgd` | An optimizer. ~~Optional[Optimizer]~~ | + +## TrainablePipe.add_label {#add_label tag="method"} + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> pipe.add_label("MY_LABEL") +> ``` + +Add a new label to the pipe, to be predicted by the model. The actual +implementation depends on the specific component, but in general `add_label` +shouldn't be called if the output dimension is already set, or if the model has +already been fully [initialized](#initialize). If these conditions are violated, +the function will raise an Error. The exception to this rule is when the +component is [resizable](#is_resizable), in which case +[`set_output`](#set_output) should be called to ensure that the model is +properly resized. + + + +This method needs to be overwritten with your own custom `add_label` method. + + + +| Name | Description | +| ----------- | ------------------------------------------------------- | +| `label` | The label to add. ~~str~~ | +| **RETURNS** | 0 if the label is already present, otherwise 1. ~~int~~ | + +Note that in general, you don't have to call `pipe.add_label` if you provide a +representative data sample to the [`initialize`](#initialize) method. In this +case, all labels found in the sample will be automatically added to the model, +and the output dimension will be +[inferred](/usage/layers-architectures#thinc-shape-inference) automatically. + +## TrainablePipe.is_resizable {#is_resizable tag="property"} + +> #### Example +> +> ```python +> can_resize = pipe.is_resizable +> ``` +> +> With custom resizing implemented by a component: +> +> ```python +> def custom_resize(model, new_nO): +> # adjust model +> return model +> +> custom_model.attrs["resize_output"] = custom_resize +> ``` + +Check whether or not the output dimension of the component's model can be +resized. If this method returns `True`, [`set_output`](#set_output) can be +called to change the model's output dimension. + +For built-in components that are not resizable, you have to create and train a +new model from scratch with the appropriate architecture and output dimension. +For custom components, you can implement a `resize_output` function and add it +as an attribute to the component's model. + +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------- | +| **RETURNS** | Whether or not the output dimension of the model can be changed after initialization. ~~bool~~ | + +## TrainablePipe.set_output {#set_output tag="method"} + +Change the output dimension of the component's model. If the component is not +[resizable](#is_resizable), this method will raise a `NotImplementedError`. If a +component is resizable, the model's attribute `resize_output` will be called. +This is a function that takes the original model and the new output dimension +`nO`, and changes the model in place. When resizing an already trained model, +care should be taken to avoid the "catastrophic forgetting" problem. + +> #### Example +> +> ```python +> if pipe.is_resizable: +> pipe.set_output(512) +> ``` + +| Name | Description | +| ---- | --------------------------------- | +| `nO` | The new output dimension. ~~int~~ | + +## TrainablePipe.to_disk {#to_disk tag="method"} + +Serialize the pipe to disk. + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> pipe.to_disk("/path/to/pipe") +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | + +## TrainablePipe.from_disk {#from_disk tag="method"} + +Load the pipe from disk. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> pipe.from_disk("/path/to/pipe") +> ``` + +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified pipe. ~~TrainablePipe~~ | + +## TrainablePipe.to_bytes {#to_bytes tag="method"} + +> #### Example +> +> ```python +> pipe = nlp.add_pipe("your_custom_pipe") +> pipe_bytes = pipe.to_bytes() +> ``` + +Serialize the pipe to a bytestring. + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the pipe. ~~bytes~~ | + +## TrainablePipe.from_bytes {#from_bytes tag="method"} + +Load the pipe from a bytestring. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> pipe_bytes = pipe.to_bytes() +> pipe = nlp.add_pipe("your_custom_pipe") +> pipe.from_bytes(pipe_bytes) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The pipe. ~~TrainablePipe~~ | + +## Attributes {#attributes} + +| Name | Description | +| ------- | --------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary that's passed in on initialization. ~~Vocab~~ | +| `model` | The model powering the component. ~~Model[List[Doc], Any]~~ | +| `name` | The name of the component instance in the pipeline. Can be used in the losses. ~~str~~ | +| `cfg` | Keyword arguments passed to [`TrainablePipe.__init__`](/api/pipe#init). Will be serialized with the component. ~~Dict[str, Any]~~ | + +## Serialization fields {#serialization-fields} + +During serialization, spaCy will export several data fields used to restore +different aspects of the object. If needed, you can exclude them from +serialization by passing in the string names via the `exclude` argument. + +> #### Example +> +> ```python +> data = pipe.to_disk("/path") +> ``` + +| Name | Description | +| ------- | -------------------------------------------------------------- | +| `cfg` | The config file. You usually don't want to exclude this. | +| `model` | The binary model data. You usually don't want to exclude this. | diff --git a/website/docs/api/pipeline-functions.md b/website/docs/api/pipeline-functions.md index 6e2b473b1..0dc03a16a 100644 --- a/website/docs/api/pipeline-functions.md +++ b/website/docs/api/pipeline-functions.md @@ -11,8 +11,7 @@ menu: ## merge_noun_chunks {#merge_noun_chunks tag="function"} Merge noun chunks into a single token. Also available via the string name -`"merge_noun_chunks"`. After initialization, the component is typically added to -the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe). +`"merge_noun_chunks"`. > #### Example > @@ -20,9 +19,7 @@ the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe). > texts = [t.text for t in nlp("I have a blue car")] > assert texts == ["I", "have", "a", "blue", "car"] > -> merge_nps = nlp.create_pipe("merge_noun_chunks") -> nlp.add_pipe(merge_nps) -> +> nlp.add_pipe("merge_noun_chunks") > texts = [t.text for t in nlp("I have a blue car")] > assert texts == ["I", "have", "a blue car"] > ``` @@ -36,16 +33,15 @@ all other components. -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------ | -| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. | -| **RETURNS** | `Doc` | The modified `Doc` with merged noun chunks. | +| Name | Description | +| ----------- | -------------------------------------------------------------------- | +| `doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~ | +| **RETURNS** | The modified `Doc` with merged noun chunks. ~~Doc~~ | ## merge_entities {#merge_entities tag="function"} Merge named entities into a single token. Also available via the string name -`"merge_entities"`. After initialization, the component is typically added to -the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe). +`"merge_entities"`. > #### Example > @@ -53,8 +49,7 @@ the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe). > texts = [t.text for t in nlp("I like David Bowie")] > assert texts == ["I", "like", "David", "Bowie"] > -> merge_ents = nlp.create_pipe("merge_entities") -> nlp.add_pipe(merge_ents) +> nlp.add_pipe("merge_entities") > > texts = [t.text for t in nlp("I like David Bowie")] > assert texts == ["I", "like", "David Bowie"] @@ -68,20 +63,17 @@ components to the end of the pipeline and after all other components. -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------ | -| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. | -| **RETURNS** | `Doc` | The modified `Doc` with merged entities. | +| Name | Description | +| ----------- | -------------------------------------------------------------------- | +| `doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~ | +| **RETURNS** | The modified `Doc` with merged entities. ~~Doc~~ | ## merge_subtokens {#merge_subtokens tag="function" new="2.1"} Merge subtokens into a single token. Also available via the string name -`"merge_subtokens"`. After initialization, the component is typically added to -the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe). - -As of v2.1, the parser is able to predict "subtokens" that should be merged into -one single token later on. This is especially relevant for languages like -Chinese, Japanese or Korean, where a "word" isn't defined as a +`"merge_subtokens"`. As of v2.1, the parser is able to predict "subtokens" that +should be merged into one single token later on. This is especially relevant for +languages like Chinese, Japanese or Korean, where a "word" isn't defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the [`Matcher`](/api/matcher) to find sequences of tokens with the dependency label `"subtok"` and then merges them into a single token. @@ -96,9 +88,7 @@ label `"subtok"` and then merges them into a single token. > print([(token.text, token.dep_) for token in doc]) > # [('拜', 'subtok'), ('托', 'subtok')] > -> merge_subtok = nlp.create_pipe("merge_subtokens") -> nlp.add_pipe(merge_subtok) -> +> nlp.add_pipe("merge_subtokens") > doc = nlp("拜托") > print([token.text for token in doc]) > # ['拜托'] @@ -112,8 +102,8 @@ end of the pipeline and after all other components. -| Name | Type | Description | -| ----------- | ------- | ------------------------------------------------------------ | -| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. | -| `label` | unicode | The subtoken dependency label. Defaults to `"subtok"`. | -| **RETURNS** | `Doc` | The modified `Doc` with merged subtokens. | +| Name | Description | +| ----------- | -------------------------------------------------------------------- | +| `doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~ | +| `label` | The subtoken dependency label. Defaults to `"subtok"`. ~~str~~ | +| **RETURNS** | The modified `Doc` with merged subtokens. ~~Doc~~ | diff --git a/website/docs/api/scorer.md b/website/docs/api/scorer.md index b1824573c..0dbc0de33 100644 --- a/website/docs/api/scorer.md +++ b/website/docs/api/scorer.md @@ -5,8 +5,10 @@ tag: class source: spacy/scorer.py --- -The `Scorer` computes and stores evaluation scores. It's typically created by -[`Language.evaluate`](/api/language#evaluate). +The `Scorer` computes evaluation scores. It's typically created by +[`Language.evaluate`](/api/language#evaluate). In addition, the `Scorer` +provides a number of evaluation methods for evaluating [`Token`](/api/token) and +[`Doc`](/api/doc) attributes. ## Scorer.\_\_init\_\_ {#init tag="method"} @@ -17,46 +19,213 @@ Create a new `Scorer`. > ```python > from spacy.scorer import Scorer > +> # Default scoring pipeline > scorer = Scorer() +> +> # Provided scoring pipeline +> nlp = spacy.load("en_core_web_sm") +> scorer = Scorer(nlp) > ``` -| Name | Type | Description | -| ------------ | -------- | ------------------------------------------------------------ | -| `eval_punct` | bool | Evaluate the dependency attachments to and from punctuation. | -| **RETURNS** | `Scorer` | The newly created object. | +| Name | Description | +| ----- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `nlp` | The pipeline to use for scoring, where each pipeline component may provide a scoring method. If none is provided, then a default pipeline for the multi-language code `xx` is constructed containing: `senter`, `tagger`, `morphologizer`, `parser`, `ner`, `textcat`. ~~Language~~ | ## Scorer.score {#score tag="method"} -Update the evaluation scores from a single [`Doc`](/api/doc) / -[`GoldParse`](/api/goldparse) pair. +Calculate the scores for a list of [`Example`](/api/example) objects using the +scoring methods provided by the components in the pipeline. + +The returned `Dict` contains the scores provided by the individual pipeline +components. For the scoring methods provided by the `Scorer` and use by the core +pipeline components, the individual score names start with the `Token` or `Doc` +attribute being scored: + +- `token_acc`, `token_p`, `token_r`, `token_f`, +- `sents_p`, `sents_r`, `sents_f` +- `tag_acc`, `pos_acc`, `morph_acc`, `morph_per_feat`, `lemma_acc` +- `dep_uas`, `dep_las`, `dep_las_per_type` +- `ents_p`, `ents_r` `ents_f`, `ents_per_type` +- `textcat_macro_auc`, `textcat_macro_f` > #### Example > > ```python > scorer = Scorer() -> scorer.score(doc, gold) +> scores = scorer.score(examples) > ``` -| Name | Type | Description | -| -------------- | ----------- | -------------------------------------------------------------------------------------------------------------------- | -| `doc` | `Doc` | The predicted annotations. | -| `gold` | `GoldParse` | The correct annotations. | -| `verbose` | bool | Print debugging information. | -| `punct_labels` | tuple | Dependency labels for punctuation. Used to evaluate dependency attachments to punctuation if `eval_punct` is `True`. | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------- | +| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | +| **RETURNS** | A dictionary of scores. ~~Dict[str, Union[float, Dict[str, float]]]~~ | -## Properties +## Scorer.score_tokenization {#score_tokenization tag="staticmethod" new="3"} -| Name | Type | Description | -| ----------------------------------------------- | ----- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `token_acc` | float | Tokenization accuracy. | -| `tags_acc` | float | Part-of-speech tag accuracy (fine grained tags, i.e. `Token.tag`). | -| `uas` | float | Unlabelled dependency score. | -| `las` | float | Labelled dependency score. | -| `ents_p` | float | Named entity accuracy (precision). | -| `ents_r` | float | Named entity accuracy (recall). | -| `ents_f` | float | Named entity accuracy (F-score). | -| `ents_per_type` 2.1.5 | dict | Scores per entity label. Keyed by label, mapped to a dict of `p`, `r` and `f` scores. | -| `textcat_score` 2.2 | float | F-score on positive label for binary exclusive, macro-averaged F-score for 3+ exclusive, macro-averaged AUC ROC score for multilabel (`-1` if undefined). | -| `textcats_per_cat` 2.2 | dict | Scores per textcat label, keyed by label. | -| `las_per_type` 2.2.3 | dict | Labelled dependency scores, keyed by label. | -| `scores` | dict | All scores, keyed by type. | +Scores the tokenization: + +- `token_acc`: number of correct tokens / number of gold tokens +- `token_p`, `token_r`, `token_f`: precision, recall and F-score for token + character spans + +> #### Example +> +> ```python +> scores = Scorer.score_tokenization(examples) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------- | +| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | +| **RETURNS** | `Dict` | A dictionary containing the scores `token_acc`, `token_p`, `token_r`, `token_f`. ~~Dict[str, float]]~~ | + +## Scorer.score_token_attr {#score_token_attr tag="staticmethod" new="3"} + +Scores a single token attribute. + +> #### Example +> +> ```python +> scores = Scorer.score_token_attr(examples, "pos") +> print(scores["pos_acc"]) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | +| `attr` | The attribute to score. ~~str~~ | +| _keyword-only_ | | +| `getter` | Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`. ~~Callable[[Token, str], Any]~~ | +| **RETURNS** | A dictionary containing the score `{attr}_acc`. ~~Dict[str, float]~~ | + +## Scorer.score_token_attr_per_feat {#score_token_attr_per_feat tag="staticmethod" new="3"} + +Scores a single token attribute per feature for a token attribute in the +Universal Dependencies +[FEATS](https://universaldependencies.org/format.html#morphological-annotation) +format. + +> #### Example +> +> ```python +> scores = Scorer.score_token_attr_per_feat(examples, "morph") +> print(scores["morph_per_feat"]) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | +| `attr` | The attribute to score. ~~str~~ | +| _keyword-only_ | | +| `getter` | Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`. ~~Callable[[Token, str], Any]~~ | +| **RETURNS** | A dictionary containing the per-feature PRF scores under the key `{attr}_per_feat`. ~~Dict[str, Dict[str, float]]~~ | + +## Scorer.score_spans {#score_spans tag="staticmethod" new="3"} + +Returns PRF scores for labeled or unlabeled spans. + +> #### Example +> +> ```python +> scores = Scorer.score_spans(examples, "ents") +> print(scores["ents_f"]) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | +| `attr` | The attribute to score. ~~str~~ | +| _keyword-only_ | | +| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the `Span` objects for an individual `Doc`. ~~Callable[[Doc, str], Iterable[Span]]~~ | +| **RETURNS** | A dictionary containing the PRF scores under the keys `{attr}_p`, `{attr}_r`, `{attr}_f` and the per-type PRF scores under `{attr}_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ | + +## Scorer.score_deps {#score_deps tag="staticmethod" new="3"} + +Calculate the UAS, LAS, and LAS per type scores for dependency parses. + +> #### Example +> +> ```python +> def dep_getter(token, attr): +> dep = getattr(token, attr) +> dep = token.vocab.strings.as_string(dep).lower() +> return dep +> +> scores = Scorer.score_deps( +> examples, +> "dep", +> getter=dep_getter, +> ignore_labels=("p", "punct") +> ) +> print(scores["dep_uas"], scores["dep_las"]) +> ``` + +| Name | Description | +| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | +| `attr` | The attribute to score. ~~str~~ | +| _keyword-only_ | | +| `getter` | Defaults to `getattr`. If provided, `getter(token, attr)` should return the value of the attribute for an individual `Token`. ~~Callable[[Token, str], Any]~~ | +| `head_attr` | The attribute containing the head token. ~~str~~ | +| `head_getter` | Defaults to `getattr`. If provided, `head_getter(token, attr)` should return the head for an individual `Token`. ~~Callable[[Doc, str], Token]~~ | +| `ignore_labels` | Labels to ignore while scoring (e.g. `"punct"`). ~~Iterable[str]~~ | +| **RETURNS** | A dictionary containing the scores: `{attr}_uas`, `{attr}_las`, and `{attr}_las_per_type`. ~~Dict[str, Union[float, Dict[str, float]]]~~ | + +## Scorer.score_cats {#score_cats tag="staticmethod" new="3"} + +Calculate PRF and ROC AUC scores for a doc-level attribute that is a dict +containing scores for each label like `Doc.cats`. The reported overall score +depends on the scorer settings: + +1. **all:** `{attr}_score` (one of `{attr}_f` / `{attr}_macro_f` / + `{attr}_macro_auc`), `{attr}_score_desc` (text description of the overall + score), `{attr}_f_per_type`, `{attr}_auc_per_type` +2. **binary exclusive with positive label:** `{attr}_p`, `{attr}_r`, `{attr}_f` +3. **3+ exclusive classes**, macro-averaged F-score: `{attr}_macro_f`; +4. **multilabel**, macro-averaged AUC: `{attr}_macro_auc` + +> #### Example +> +> ```python +> labels = ["LABEL_A", "LABEL_B", "LABEL_C"] +> scores = Scorer.score_cats( +> examples, +> "cats", +> labels=labels +> ) +> print(scores["cats_macro_auc"]) +> ``` + +| Name | Description | +| ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | +| `attr` | The attribute to score. ~~str~~ | +| _keyword-only_ | | +| `getter` | Defaults to `getattr`. If provided, `getter(doc, attr)` should return the cats for an individual `Doc`. ~~Callable[[Doc, str], Dict[str, float]]~~ | +| labels | The set of possible labels. Defaults to `[]`. ~~Iterable[str]~~ | +| `multi_label` | Whether the attribute allows multiple labels. Defaults to `True`. ~~bool~~ | +| `positive_label` | The positive label for a binary task with exclusive classes. Defaults to `None`. ~~Optional[str]~~ | +| **RETURNS** | A dictionary containing the scores, with inapplicable scores as `None`. ~~Dict[str, Optional[float]]~~ | + +## Scorer.score_links {#score_links tag="staticmethod" new="3"} + +Returns PRF for predicted links on the entity level. To disentangle the +performance of the NEL from the NER, this method only evaluates NEL links for +entities that overlap between the gold reference and the predictions. + +> #### Example +> +> ```python +> scores = Scorer.score_links( +> examples, +> negative_labels=["NIL", ""] +> ) +> print(scores["nel_micro_f"]) +> ``` + +| Name | Description | +| ----------------- | ------------------------------------------------------------------------------------------------------------------- | +| `examples` | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `negative_labels` | The string values that refer to no annotation (e.g. "NIL"). ~~Iterable[str]~~ | +| **RETURNS** | A dictionary containing the scores. ~~Dict[str, Optional[float]]~~ | diff --git a/website/docs/api/sentencerecognizer.md b/website/docs/api/sentencerecognizer.md new file mode 100644 index 000000000..fced37fd3 --- /dev/null +++ b/website/docs/api/sentencerecognizer.md @@ -0,0 +1,376 @@ +--- +title: SentenceRecognizer +tag: class +source: spacy/pipeline/senter.pyx +new: 3 +teaser: 'Pipeline component for sentence segmentation' +api_base_class: /api/tagger +api_string_name: senter +api_trainable: true +--- + +A trainable pipeline component for sentence segmentation. For a simpler, +rule-based strategy, see the [`Sentencizer`](/api/sentencizer). + +## Config and implementation {#config} + +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). See the +[model architectures](/api/architectures) documentation for details on the +architectures and their arguments and hyperparameters. + +> #### Example +> +> ```python +> from spacy.pipeline.senter import DEFAULT_SENTER_MODEL +> config = {"model": DEFAULT_SENTER_MODEL,} +> nlp.add_pipe("senter", config=config) +> ``` + +| Setting | Description | +| ------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [Tagger](/api/architectures#Tagger). ~~Model[List[Doc], List[Floats2d]]~~ | + +```python +%%GITHUB_SPACY/spacy/pipeline/senter.pyx +``` + +## SentenceRecognizer.\_\_init\_\_ {#init tag="method"} + +Initialize the sentence recognizer. + +> #### Example +> +> ```python +> # Construction via add_pipe with default model +> senter = nlp.add_pipe("senter") +> +> # Construction via create_pipe with custom model +> config = {"model": {"@architectures": "my_senter"}} +> senter = nlp.add_pipe("senter", config=config) +> +> # Construction from class +> from spacy.pipeline import SentenceRecognizer +> senter = SentenceRecognizer(nlp.vocab, model) +> ``` + +Create a new pipeline instance. In your application, you would normally use a +shortcut for this and instantiate the component using its string name and +[`nlp.add_pipe`](/api/language#add_pipe). + +| Name | Description | +| ------- | -------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | + +## SentenceRecognizer.\_\_call\_\_ {#call tag="method"} + +Apply the pipe to one document. The document is modified in place, and returned. +This usually happens under the hood when the `nlp` object is called on a text +and all pipeline components are applied to the `Doc` in order. Both +[`__call__`](/api/sentencerecognizer#call) and +[`pipe`](/api/sentencerecognizer#pipe) delegate to the +[`predict`](/api/sentencerecognizer#predict) and +[`set_annotations`](/api/sentencerecognizer#set_annotations) methods. + +> #### Example +> +> ```python +> doc = nlp("This is a sentence.") +> senter = nlp.add_pipe("senter") +> # This usually happens under the hood +> processed = senter(doc) +> ``` + +| Name | Description | +| ----------- | -------------------------------- | +| `doc` | The document to process. ~~Doc~~ | +| **RETURNS** | The processed document. ~~Doc~~ | + +## SentenceRecognizer.pipe {#pipe tag="method"} + +Apply the pipe to a stream of documents. This usually happens under the hood +when the `nlp` object is called on a text and all pipeline components are +applied to the `Doc` in order. Both [`__call__`](/api/sentencerecognizer#call) +and [`pipe`](/api/sentencerecognizer#pipe) delegate to the +[`predict`](/api/sentencerecognizer#predict) and +[`set_annotations`](/api/sentencerecognizer#set_annotations) methods. + +> #### Example +> +> ```python +> senter = nlp.add_pipe("senter") +> for doc in senter.pipe(docs, batch_size=50): +> pass +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `stream` | A stream of documents. ~~Iterable[Doc]~~ | +| _keyword-only_ | | +| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ | +| **YIELDS** | The processed documents in order. ~~Doc~~ | + +## SentenceRecognizer.initialize {#initialize tag="method"} + +Initialize the component for training. `get_examples` should be a function that +returns an iterable of [`Example`](/api/example) objects. The data examples are +used to **initialize the model** of the component and can either be the full +training data or a representative sample. Initialization includes validating the +network, +[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and +setting up the label scheme based on the data. This method is typically called +by [`Language.initialize`](/api/language#initialize). + +> #### Example +> +> ```python +> senter = nlp.add_pipe("senter") +> senter.initialize(lambda: [], nlp=nlp) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | + +## SentenceRecognizer.predict {#predict tag="method"} + +Apply the component's model to a batch of [`Doc`](/api/doc) objects, without +modifying them. + +> #### Example +> +> ```python +> senter = nlp.add_pipe("senter") +> scores = senter.predict([doc1, doc2]) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------- | +| `docs` | The documents to predict. ~~Iterable[Doc]~~ | +| **RETURNS** | The model's prediction for each document. | + +## SentenceRecognizer.set_annotations {#set_annotations tag="method"} + +Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores. + +> #### Example +> +> ```python +> senter = nlp.add_pipe("senter") +> scores = senter.predict([doc1, doc2]) +> senter.set_annotations([doc1, doc2], scores) +> ``` + +| Name | Description | +| -------- | ------------------------------------------------------------ | +| `docs` | The documents to modify. ~~Iterable[Doc]~~ | +| `scores` | The scores to set, produced by `SentenceRecognizer.predict`. | + +## SentenceRecognizer.update {#update tag="method"} + +Learn from a batch of [`Example`](/api/example) objects containing the +predictions and gold-standard annotations, and update the component's model. +Delegates to [`predict`](/api/sentencerecognizer#predict) and +[`get_loss`](/api/sentencerecognizer#get_loss). + +> #### Example +> +> ```python +> senter = nlp.add_pipe("senter") +> optimizer = nlp.initialize() +> losses = senter.update(examples, sgd=optimizer) +> ``` + +| Name | Description | +| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | + +## SentenceRecognizer.rehearse {#rehearse tag="method,experimental" new="3"} + +Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the +current model to make predictions similar to an initial model to try to address +the "catastrophic forgetting" problem. This feature is experimental. + +> #### Example +> +> ```python +> senter = nlp.add_pipe("senter") +> optimizer = nlp.resume_training() +> losses = senter.rehearse(examples, sgd=optimizer) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | + +## SentenceRecognizer.get_loss {#get_loss tag="method"} + +Find the loss and gradient of loss for the batch of documents and their +predicted scores. + +> #### Example +> +> ```python +> senter = nlp.add_pipe("senter") +> scores = senter.predict([eg.predicted for eg in examples]) +> loss, d_loss = senter.get_loss(examples, scores) +> ``` + +| Name | Description | +| ----------- | --------------------------------------------------------------------------- | +| `examples` | The batch of examples. ~~Iterable[Example]~~ | +| `scores` | Scores representing the model's predictions. | +| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ | + +## SentenceRecognizer.score {#score tag="method" new="3"} + +Score a batch of examples. + +> #### Example +> +> ```python +> scores = senter.score(examples) +> ``` + +| Name | Description | +| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | The examples to score. ~~Iterable[Example]~~ | +| **RETURNS** | The scores, produced by [`Scorer.score_token_attr`](/api/scorer#score_token_attr) for the attributes `"pos"`, `"tag"` and `"lemma"`. ~~Dict[str, float]~~ | + +## SentenceRecognizer.create_optimizer {#create_optimizer tag="method"} + +Create an optimizer for the pipeline component. + +> #### Example +> +> ```python +> senter = nlp.add_pipe("senter") +> optimizer = senter.create_optimizer() +> ``` + +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The optimizer. ~~Optimizer~~ | + +## SentenceRecognizer.use_params {#use_params tag="method, contextmanager"} + +Modify the pipe's model, to use the given parameter values. At the end of the +context, the original parameters are restored. + +> #### Example +> +> ```python +> senter = nlp.add_pipe("senter") +> with senter.use_params(optimizer.averages): +> senter.to_disk("/best_model") +> ``` + +| Name | Description | +| -------- | -------------------------------------------------- | +| `params` | The parameter values to use in the model. ~~dict~~ | + +## SentenceRecognizer.to_disk {#to_disk tag="method"} + +Serialize the pipe to disk. + +> #### Example +> +> ```python +> senter = nlp.add_pipe("senter") +> senter.to_disk("/path/to/senter") +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | + +## SentenceRecognizer.from_disk {#from_disk tag="method"} + +Load the pipe from disk. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> senter = nlp.add_pipe("senter") +> senter.from_disk("/path/to/senter") +> ``` + +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `SentenceRecognizer` object. ~~SentenceRecognizer~~ | + +## SentenceRecognizer.to_bytes {#to_bytes tag="method"} + +> #### Example +> +> ```python +> senter = nlp.add_pipe("senter") +> senter_bytes = senter.to_bytes() +> ``` + +Serialize the pipe to a bytestring. + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the `SentenceRecognizer` object. ~~bytes~~ | + +## SentenceRecognizer.from_bytes {#from_bytes tag="method"} + +Load the pipe from a bytestring. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> senter_bytes = senter.to_bytes() +> senter = nlp.add_pipe("senter") +> senter.from_bytes(senter_bytes) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `SentenceRecognizer` object. ~~SentenceRecognizer~~ | + +## Serialization fields {#serialization-fields} + +During serialization, spaCy will export several data fields used to restore +different aspects of the object. If needed, you can exclude them from +serialization by passing in the string names via the `exclude` argument. + +> #### Example +> +> ```python +> data = senter.to_disk("/path", exclude=["vocab"]) +> ``` + +| Name | Description | +| ------- | -------------------------------------------------------------- | +| `vocab` | The shared [`Vocab`](/api/vocab). | +| `cfg` | The config file. You usually don't want to exclude this. | +| `model` | The binary model data. You usually don't want to exclude this. | diff --git a/website/docs/api/sentencizer.md b/website/docs/api/sentencizer.md index 5a1ea162a..2cd49127d 100644 --- a/website/docs/api/sentencizer.md +++ b/website/docs/api/sentencizer.md @@ -1,30 +1,40 @@ --- title: Sentencizer tag: class -source: spacy/pipeline/pipes.pyx +source: spacy/pipeline/sentencizer.pyx +teaser: 'Pipeline component for rule-based sentence boundary detection' +api_string_name: sentencizer +api_trainable: false --- -A simple pipeline component, to allow custom sentence boundary detection logic +A simple pipeline component to allow custom sentence boundary detection logic that doesn't require the dependency parse. By default, sentence segmentation is performed by the [`DependencyParser`](/api/dependencyparser), so the `Sentencizer` lets you implement a simpler, rule-based strategy that doesn't -require a statistical model to be loaded. The component is also available via -the string name `"sentencizer"`. After initialization, it is typically added to -the processing pipeline using [`nlp.add_pipe`](/api/language#add_pipe). +require a statistical model to be loaded. - +## Config and implementation {#config} -Compared to the previous `SentenceSegmenter` class, the `Sentencizer` component -doesn't add a hook to `doc.user_hooks["sents"]`. Instead, it iterates over the -tokens in the `Doc` and sets the `Token.is_sent_start` property. The -`SentenceSegmenter` is still available if you import it directly: +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). + +> #### Example +> +> ```python +> config = {"punct_chars": None} +> nlp.add_pipe("entity_ruler", config=config) +> ``` + +| Setting | Description | +| ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to `None`. ~~Optional[List[str]]~~ | `None` | ```python -from spacy.pipeline import SentenceSegmenter +%%GITHUB_SPACY/spacy/pipeline/sentencizer.pyx ``` - - ## Sentencizer.\_\_init\_\_ {#init tag="method"} Initialize the sentencizer. @@ -32,18 +42,32 @@ Initialize the sentencizer. > #### Example > > ```python -> # Construction via create_pipe -> sentencizer = nlp.create_pipe("sentencizer") +> # Construction via add_pipe +> sentencizer = nlp.add_pipe("sentencizer") > > # Construction from class > from spacy.pipeline import Sentencizer > sentencizer = Sentencizer() > ``` -| Name | Type | Description | -| ------------- | ------------- | ------------------------------------------------------------------------------------------------------ | -| `punct_chars` | list | Optional custom list of punctuation characters that mark sentence ends. Defaults to `['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈', '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '!', '.', '?', '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。']`. | -| **RETURNS** | `Sentencizer` | The newly constructed object. | +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `punct_chars` | Optional custom list of punctuation characters that mark sentence ends. See below for defaults. ~~Optional[List[str]]~~ | + +```python +### punct_chars defaults +['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', + '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', + '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉', + '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈', + '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '!', '.', '?', + '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', + '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂', + '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', + '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛', + '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。'] +``` ## Sentencizer.\_\_call\_\_ {#call tag="method"} @@ -57,33 +81,69 @@ the component has been added to the pipeline using > from spacy.lang.en import English > > nlp = English() -> sentencizer = nlp.create_pipe("sentencizer") -> nlp.add_pipe(sentencizer) +> nlp.add_pipe("sentencizer") > doc = nlp("This is a sentence. This is another sentence.") > assert len(list(doc.sents)) == 2 > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------ | -| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. | -| **RETURNS** | `Doc` | The modified `Doc` with added sentence boundaries. | +| Name | Description | +| ----------- | -------------------------------------------------------------------- | +| `doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. ~~Doc~~ | +| **RETURNS** | The modified `Doc` with added sentence boundaries. ~~Doc~~ | -## Sentencizer.to_disk {#to_disk tag="method"} +## Sentencizer.pipe {#pipe tag="method"} -Save the sentencizer settings (punctuation characters) a directory. Will create -a file `sentencizer.json`. This also happens automatically when you save an -`nlp` object with a sentencizer added to its pipeline. +Apply the pipe to a stream of documents. This usually happens under the hood +when the `nlp` object is called on a text and all pipeline components are +applied to the `Doc` in order. > #### Example > > ```python -> sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"]) -> sentencizer.to_disk("/path/to/sentencizer.jsonl") +> sentencizer = nlp.add_pipe("sentencizer") +> for doc in sentencizer.pipe(docs, batch_size=50): +> pass > ``` -| Name | Type | Description | -| ------ | ---------------- | ---------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `stream` | A stream of documents. ~~Iterable[Doc]~~ | +| _keyword-only_ | | +| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ | +| **YIELDS** | The processed documents in order. ~~Doc~~ | + +## Sentencizer.score {#score tag="method" new="3"} + +Score a batch of examples. + +> #### Example +> +> ```python +> scores = sentencizer.score(examples) +> ``` + +| Name | Description | +| ----------- | --------------------------------------------------------------------------------------------------------------------- | +| `examples` | The examples to score. ~~Iterable[Example]~~ | +| **RETURNS** | The scores, produced by [`Scorer.score_spans`](/api/scorer#score_spans). ~~Dict[str, Union[float, Dict[str, float]]~~ | + +## Sentencizer.to_disk {#to_disk tag="method"} + +Save the sentencizer settings (punctuation characters) to a directory. Will +create a file `sentencizer.json`. This also happens automatically when you save +an `nlp` object with a sentencizer added to its pipeline. + +> #### Example +> +> ```python +> config = {"punct_chars": [".", "?", "!", "。"]} +> sentencizer = nlp.add_pipe("sentencizer", config=config) +> sentencizer.to_disk("/path/to/sentencizer.json") +> ``` + +| Name | Description | +| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a JSON file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | ## Sentencizer.from_disk {#from_disk tag="method"} @@ -94,14 +154,14 @@ added to its pipeline. > #### Example > > ```python -> sentencizer = Sentencizer() +> sentencizer = nlp.add_pipe("sentencizer") > sentencizer.from_disk("/path/to/sentencizer.json") > ``` -| Name | Type | Description | -| ----------- | ---------------- | -------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a JSON file. Paths may be either strings or `Path`-like objects. | -| **RETURNS** | `Sentencizer` | The modified `Sentencizer` object. | +| Name | Description | +| ----------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a JSON file. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| **RETURNS** | The modified `Sentencizer` object. ~~Sentencizer~~ | ## Sentencizer.to_bytes {#to_bytes tag="method"} @@ -110,13 +170,14 @@ Serialize the sentencizer settings to a bytestring. > #### Example > > ```python -> sentencizer = Sentencizer(punct_chars=[".", "?", "!", "。"]) +> config = {"punct_chars": [".", "?", "!", "。"]} +> sentencizer = nlp.add_pipe("sentencizer", config=config) > sentencizer_bytes = sentencizer.to_bytes() > ``` -| Name | Type | Description | -| ----------- | ----- | -------------------- | -| **RETURNS** | bytes | The serialized data. | +| Name | Description | +| ----------- | ------------------------------ | +| **RETURNS** | The serialized data. ~~bytes~~ | ## Sentencizer.from_bytes {#from_bytes tag="method"} @@ -126,11 +187,11 @@ Load the pipe from a bytestring. Modifies the object in place and returns it. > > ```python > sentencizer_bytes = sentencizer.to_bytes() -> sentencizer = Sentencizer() +> sentencizer = nlp.add_pipe("sentencizer") > sentencizer.from_bytes(sentencizer_bytes) > ``` -| Name | Type | Description | -| ------------ | ------------- | ---------------------------------- | -| `bytes_data` | bytes | The bytestring to load. | -| **RETURNS** | `Sentencizer` | The modified `Sentencizer` object. | +| Name | Description | +| ------------ | -------------------------------------------------- | +| `bytes_data` | The bytestring to load. ~~bytes~~ | +| **RETURNS** | The modified `Sentencizer` object. ~~Sentencizer~~ | diff --git a/website/docs/api/span.md b/website/docs/api/span.md index 3833bbca9..7fa1aaa38 100644 --- a/website/docs/api/span.md +++ b/website/docs/api/span.md @@ -8,7 +8,7 @@ A slice from a [`Doc`](/api/doc) object. ## Span.\_\_init\_\_ {#init tag="method"} -Create a Span object from the slice `doc[start : end]`. +Create a `Span` object from the slice `doc[start : end]`. > #### Example > @@ -18,15 +18,14 @@ Create a Span object from the slice `doc[start : end]`. > assert [t.text for t in span] == ["it", "back", "!"] > ``` -| Name | Type | Description | -| ----------- | ---------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | -| `doc` | `Doc` | The parent document. | -| `start` | int | The index of the first token of the span. | -| `end` | int | The index of the first token after the span. | -| `label` | int / unicode | A label to attach to the span, e.g. for named entities. As of v2.1, the label can also be a unicode string. | -| `kb_id` | int / unicode | A knowledge base ID to attach to the span, e.g. for named entities. The ID can be an integer or a unicode string. | -| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. | -| **RETURNS** | `Span` | The newly constructed object. | +| Name | Description | +| -------- | --------------------------------------------------------------------------------------- | +| `doc` | The parent document. ~~Doc~~ | +| `start` | The index of the first token of the span. ~~int~~ | +| `end` | The index of the first token after the span. ~~int~~ | +| `label` | A label to attach to the span, e.g. for named entities. ~~Union[str, int]~~ | +| `kb_id` | A knowledge base ID to attach to the span, e.g. for named entities. ~~Union[str, int]~~ | +| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | ## Span.\_\_getitem\_\_ {#getitem tag="method"} @@ -40,10 +39,10 @@ Get a `Token` object. > assert span[1].text == "back" > ``` -| Name | Type | Description | -| ----------- | ------- | --------------------------------------- | -| `i` | int | The index of the token within the span. | -| **RETURNS** | `Token` | The token at `span[i]`. | +| Name | Description | +| ----------- | ----------------------------------------------- | +| `i` | The index of the token within the span. ~~int~~ | +| **RETURNS** | The token at `span[i]`. ~~Token~~ | Get a `Span` object. @@ -55,10 +54,10 @@ Get a `Span` object. > assert span[1:3].text == "back!" > ``` -| Name | Type | Description | -| ----------- | ------ | -------------------------------- | -| `start_end` | tuple | The slice of the span to get. | -| **RETURNS** | `Span` | The span at `span[start : end]`. | +| Name | Description | +| ----------- | ------------------------------------------------- | +| `start_end` | The slice of the span to get. ~~Tuple[int, int]~~ | +| **RETURNS** | The span at `span[start : end]`. ~~Span~~ | ## Span.\_\_iter\_\_ {#iter tag="method"} @@ -72,9 +71,9 @@ Iterate over `Token` objects. > assert [t.text for t in span] == ["it", "back", "!"] > ``` -| Name | Type | Description | -| ---------- | ------- | ----------------- | -| **YIELDS** | `Token` | A `Token` object. | +| Name | Description | +| ---------- | --------------------------- | +| **YIELDS** | A `Token` object. ~~Token~~ | ## Span.\_\_len\_\_ {#len tag="method"} @@ -88,9 +87,9 @@ Get the number of tokens in the span. > assert len(span) == 3 > ``` -| Name | Type | Description | -| ----------- | ---- | --------------------------------- | -| **RETURNS** | int | The number of tokens in the span. | +| Name | Description | +| ----------- | ----------------------------------------- | +| **RETURNS** | The number of tokens in the span. ~~int~~ | ## Span.set_extension {#set_extension tag="classmethod" new="2"} @@ -108,14 +107,14 @@ For details, see the documentation on > assert doc[1:4]._.has_city > ``` -| Name | Type | Description | -| --------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------- | -| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `span._.my_attr`. | -| `default` | - | Optional default value of the attribute if no getter or method is defined. | -| `method` | callable | Set a custom method on the object, for example `span._.compare(other_span)`. | -| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. | -| `setter` | callable | Setter function that takes the `Span` and a value, and modifies the object. Is called when the user writes to the `Span._` attribute. | -| `force` | bool | Force overwriting existing attribute. | +| Name | Description | +| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `span._.my_attr`. ~~str~~ | +| `default` | Optional default value of the attribute if no getter or method is defined. ~~Optional[Any]~~ | +| `method` | Set a custom method on the object, for example `span._.compare(other_span)`. ~~Optional[Callable[[Span, ...], Any]]~~ | +| `getter` | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. ~~Optional[Callable[[Span], Any]]~~ | +| `setter` | Setter function that takes the `Span` and a value, and modifies the object. Is called when the user writes to the `Span._` attribute. ~~Optional[Callable[[Span, Any], None]]~~ | +| `force` | Force overwriting existing attribute. ~~bool~~ | ## Span.get_extension {#get_extension tag="classmethod" new="2"} @@ -132,10 +131,10 @@ Look up a previously registered extension by name. Returns a 4-tuple > assert extension == (False, None, None, None) > ``` -| Name | Type | Description | -| ----------- | ------- | ------------------------------------------------------------- | -| `name` | unicode | Name of the extension. | -| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. | +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Name of the extension. ~~str~~ | +| **RETURNS** | A `(default, method, getter, setter)` tuple of the extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~ | ## Span.has_extension {#has_extension tag="classmethod" new="2"} @@ -149,10 +148,10 @@ Check whether an extension has been registered on the `Span` class. > assert Span.has_extension("is_city") > ``` -| Name | Type | Description | -| ----------- | ------- | ------------------------------------------ | -| `name` | unicode | Name of the extension to check. | -| **RETURNS** | bool | Whether the extension has been registered. | +| Name | Description | +| ----------- | --------------------------------------------------- | +| `name` | Name of the extension to check. ~~str~~ | +| **RETURNS** | Whether the extension has been registered. ~~bool~~ | ## Span.remove_extension {#remove_extension tag="classmethod" new="2.0.12"} @@ -167,10 +166,10 @@ Remove a previously registered extension. > assert not Span.has_extension("is_city") > ``` -| Name | Type | Description | -| ----------- | ------- | --------------------------------------------------------------------- | -| `name` | unicode | Name of the extension. | -| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Name of the extension. ~~str~~ | +| **RETURNS** | A `(default, method, getter, setter)` tuple of the removed extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~ | ## Span.char_span {#char_span tag="method" new="2.2.4"} @@ -185,14 +184,14 @@ the character indices don't map to a valid span. > assert span.text == "New York" > ``` -| Name | Type | Description | -| ----------- | ---------------------------------------- | --------------------------------------------------------------------- | -| `start` | int | The index of the first character of the span. | -| `end` | int | The index of the last character after the span. | -| `label` | uint64 / unicode | A label to attach to the span, e.g. for named entities. | -| `kb_id` | uint64 / unicode | An ID from a knowledge base to capture the meaning of a named entity. | -| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. | -| **RETURNS** | `Span` | The newly constructed object or `None`. | +| Name | Description | +| ------------------------------------ | ----------------------------------------------------------------------------------------- | +| `start` | The index of the first character of the span. ~~int~~ | +| `end` | The index of the last character after the span. ~~int~~ | +| `label` | A label to attach to the span, e.g. for named entities. ~~Union[int, str]~~ | +| `kb_id` 2.2 | An ID from a knowledge base to capture the meaning of a named entity. ~~Union[int, str]~~ | +| `vector` | A meaning representation of the span. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | +| **RETURNS** | The newly constructed object or `None`. ~~Optional[Span]~~ | ## Span.similarity {#similarity tag="method" model="vectors"} @@ -210,10 +209,10 @@ using an average of word vectors. > assert apples_oranges == oranges_apples > ``` -| Name | Type | Description | -| ----------- | ----- | -------------------------------------------------------------------------------------------- | -| `other` | - | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. | -| **RETURNS** | float | A scalar similarity score. Higher is more similar. | +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------------------------------------------- | +| `other` | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. ~~Union[Doc, Span, Token, Lexeme]~~ | +| **RETURNS** | A scalar similarity score. Higher is more similar. ~~float~~ | ## Span.get_lca_matrix {#get_lca_matrix tag="method"} @@ -230,9 +229,9 @@ ancestor is found, e.g. if span excludes a necessary ancestor. > # array([[0, 0, 0], [0, 1, 2], [0, 2, 2]], dtype=int32) > ``` -| Name | Type | Description | -| ----------- | -------------------------------------- | ------------------------------------------------ | -| **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Span`. | +| Name | Description | +| ----------- | --------------------------------------------------------------------------------------- | +| **RETURNS** | The lowest common ancestor matrix of the `Span`. ~~numpy.ndarray[ndim=2, dtype=int32]~~ | ## Span.to_array {#to_array tag="method" new="2"} @@ -250,37 +249,10 @@ shape `(N, M)`, where `N` is the length of the document. The values will be > np_array = span.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA]) > ``` -| Name | Type | Description | -| ----------- | ----------------------------- | -------------------------------------------------------------------------------------------------------- | -| `attr_ids` | list | A list of attribute ID ints. | -| **RETURNS** | `numpy.ndarray[long, ndim=2]` | A feature matrix, with one row per word, and one column per attribute indicated in the input `attr_ids`. | - -## Span.merge {#merge tag="method"} - - - -As of v2.1.0, `Span.merge` still works but is considered deprecated. You should -use the new and less error-prone [`Doc.retokenize`](/api/doc#retokenize) -instead. - - - -Retokenize the document, such that the span is merged into a single token. - -> #### Example -> -> ```python -> doc = nlp("I like New York in Autumn.") -> span = doc[2:4] -> span.merge() -> assert len(doc) == 6 -> assert doc[2].text == "New York" -> ``` - -| Name | Type | Description | -| -------------- | ------- | ------------------------------------------------------------------------------------------------------------------------- | -| `**attributes` | - | Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span. | -| **RETURNS** | `Token` | The newly merged token. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------- | +| `attr_ids` | A list of attributes (int IDs or string names) or a single attribute (int ID or string name). ~~Union[int, str, List[Union[int, str]]]~~ | +| **RETURNS** | The exported attributes as a numpy array. ~~Union[numpy.ndarray[ndim=2, dtype=uint64], numpy.ndarray[ndim=1, dtype=uint64]]~~ | ## Span.ents {#ents tag="property" new="2.0.13" model="ner"} @@ -298,9 +270,9 @@ if the entity recognizer has been applied. > assert ents[0].text == "Mr. Best" > ``` -| Name | Type | Description | -| ----------- | ----- | -------------------------------------------- | -| **RETURNS** | tuple | Entities in the span, one `Span` per entity. | +| Name | Description | +| ----------- | ----------------------------------------------------------------- | +| **RETURNS** | Entities in the span, one `Span` per entity. ~~Tuple[Span, ...]~~ | ## Span.as_doc {#as_doc tag="method"} @@ -315,10 +287,10 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data. > assert doc2.text == "New York" > ``` -| Name | Type | Description | -| ---------------- | ----- | ---------------------------------------------------- | -| `copy_user_data` | bool | Whether or not to copy the original doc's user data. | -| **RETURNS** | `Doc` | A `Doc` object of the `Span`'s content. | +| Name | Description | +| ---------------- | ------------------------------------------------------------- | +| `copy_user_data` | Whether or not to copy the original doc's user data. ~~bool~~ | +| **RETURNS** | A `Doc` object of the `Span`'s content. ~~Doc~~ | ## Span.root {#root tag="property" model="parser"} @@ -337,9 +309,9 @@ taken. > assert new_york.root.text == "York" > ``` -| Name | Type | Description | -| ----------- | ------- | --------------- | -| **RETURNS** | `Token` | The root token. | +| Name | Description | +| ----------- | ------------------------- | +| **RETURNS** | The root token. ~~Token~~ | ## Span.conjuncts {#conjuncts tag="property" model="parser"} @@ -353,9 +325,9 @@ A tuple of tokens coordinated to `span.root`. > assert [t.text for t in apples_conjuncts] == ["oranges"] > ``` -| Name | Type | Description | -| ----------- | ------- | ----------------------- | -| **RETURNS** | `tuple` | The coordinated tokens. | +| Name | Description | +| ----------- | --------------------------------------------- | +| **RETURNS** | The coordinated tokens. ~~Tuple[Token, ...]~~ | ## Span.lefts {#lefts tag="property" model="parser"} @@ -369,9 +341,9 @@ Tokens that are to the left of the span, whose heads are within the span. > assert lefts == ["New"] > ``` -| Name | Type | Description | -| ---------- | ------- | ------------------------------------ | -| **YIELDS** | `Token` | A left-child of a token of the span. | +| Name | Description | +| ---------- | ---------------------------------------------- | +| **YIELDS** | A left-child of a token of the span. ~~Token~~ | ## Span.rights {#rights tag="property" model="parser"} @@ -385,9 +357,9 @@ Tokens that are to the right of the span, whose heads are within the span. > assert rights == ["in"] > ``` -| Name | Type | Description | -| ---------- | ------- | ------------------------------------- | -| **YIELDS** | `Token` | A right-child of a token of the span. | +| Name | Description | +| ---------- | ----------------------------------------------- | +| **YIELDS** | A right-child of a token of the span. ~~Token~~ | ## Span.n_lefts {#n_lefts tag="property" model="parser"} @@ -401,9 +373,9 @@ the span. > assert doc[3:7].n_lefts == 1 > ``` -| Name | Type | Description | -| ----------- | ---- | -------------------------------- | -| **RETURNS** | int | The number of left-child tokens. | +| Name | Description | +| ----------- | ---------------------------------------- | +| **RETURNS** | The number of left-child tokens. ~~int~~ | ## Span.n_rights {#n_rights tag="property" model="parser"} @@ -417,9 +389,9 @@ the span. > assert doc[2:4].n_rights == 1 > ``` -| Name | Type | Description | -| ----------- | ---- | --------------------------------- | -| **RETURNS** | int | The number of right-child tokens. | +| Name | Description | +| ----------- | ----------------------------------------- | +| **RETURNS** | The number of right-child tokens. ~~int~~ | ## Span.subtree {#subtree tag="property" model="parser"} @@ -433,9 +405,9 @@ Tokens within the span and tokens which descend from them. > assert subtree == ["Give", "it", "back", "!"] > ``` -| Name | Type | Description | -| ---------- | ------- | ------------------------------------------------- | -| **YIELDS** | `Token` | A token within the span, or a descendant from it. | +| Name | Description | +| ---------- | ----------------------------------------------------------- | +| **YIELDS** | A token within the span, or a descendant from it. ~~Token~~ | ## Span.has_vector {#has_vector tag="property" model="vectors"} @@ -448,9 +420,9 @@ A boolean value indicating whether a word vector is associated with the object. > assert doc[1:].has_vector > ``` -| Name | Type | Description | -| ----------- | ---- | -------------------------------------------- | -| **RETURNS** | bool | Whether the span has a vector data attached. | +| Name | Description | +| ----------- | ----------------------------------------------------- | +| **RETURNS** | Whether the span has a vector data attached. ~~bool~~ | ## Span.vector {#vector tag="property" model="vectors"} @@ -465,9 +437,9 @@ vectors. > assert doc[1:].vector.shape == (300,) > ``` -| Name | Type | Description | -| ----------- | ---------------------------------------- | --------------------------------------------------- | -| **RETURNS** | `numpy.ndarray[ndim=1, dtype='float32']` | A 1D numpy array representing the span's semantics. | +| Name | Description | +| ----------- | ----------------------------------------------------------------------------------------------- | +| **RETURNS** | A 1-dimensional array representing the span's vector. ~~`numpy.ndarray[ndim=1, dtype=float32]~~ | ## Span.vector_norm {#vector_norm tag="property" model="vectors"} @@ -482,31 +454,31 @@ The L2 norm of the span's vector representation. > assert doc[1:].vector_norm != doc[2:].vector_norm > ``` -| Name | Type | Description | -| ----------- | ----- | ----------------------------------------- | -| **RETURNS** | float | The L2 norm of the vector representation. | +| Name | Description | +| ----------- | --------------------------------------------------- | +| **RETURNS** | The L2 norm of the vector representation. ~~float~~ | ## Attributes {#attributes} -| Name | Type | Description | -| --------------------------------------- | ------------ | -------------------------------------------------------------------------------------------------------------- | -| `doc` | `Doc` | The parent document. | -| `tensor` 2.1.7 | `ndarray` | The span's slice of the parent `Doc`'s tensor. | -| `sent` | `Span` | The sentence span that this span is a part of. | -| `start` | int | The token offset for the start of the span. | -| `end` | int | The token offset for the end of the span. | -| `start_char` | int | The character offset for the start of the span. | -| `end_char` | int | The character offset for the end of the span. | -| `text` | unicode | A unicode representation of the span text. | -| `text_with_ws` | unicode | The text content of the span with a trailing whitespace character if the last token has one. | -| `orth` | int | ID of the verbatim text content. | -| `orth_` | unicode | Verbatim text content (identical to `Span.text`). Exists mostly for consistency with the other attributes. | -| `label` | int | The hash value of the span's label. | -| `label_` | unicode | The span's label. | -| `lemma_` | unicode | The span's lemma. | -| `kb_id` | int | The hash value of the knowledge base ID referred to by the span. | -| `kb_id_` | unicode | The knowledge base ID referred to by the span. | -| `ent_id` | int | The hash value of the named entity the token is an instance of. | -| `ent_id_` | unicode | The string ID of the named entity the token is an instance of. | -| `sentiment` | float | A scalar value indicating the positivity or negativity of the span. | -| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). | +| Name | Description | +| --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | +| `doc` | The parent document. ~~Doc~~ | +| `tensor` 2.1.7 | The span's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ | +| `sent` | The sentence span that this span is a part of. ~~Span~~ | +| `start` | The token offset for the start of the span. ~~int~~ | +| `end` | The token offset for the end of the span. ~~int~~ | +| `start_char` | The character offset for the start of the span. ~~int~~ | +| `end_char` | The character offset for the end of the span. ~~int~~ | +| `text` | A string representation of the span text. ~~str~~ | +| `text_with_ws` | The text content of the span with a trailing whitespace character if the last token has one. ~~str~~ | +| `orth` | ID of the verbatim text content. ~~int~~ | +| `orth_` | Verbatim text content (identical to `Span.text`). Exists mostly for consistency with the other attributes. ~~str~~ | +| `label` | The hash value of the span's label. ~~int~~ | +| `label_` | The span's label. ~~str~~ | +| `lemma_` | The span's lemma. Equivalent to `"".join(token.text_with_ws for token in span)`. ~~str~~ | +| `kb_id` | The hash value of the knowledge base ID referred to by the span. ~~int~~ | +| `kb_id_` | The knowledge base ID referred to by the span. ~~str~~ | +| `ent_id` | The hash value of the named entity the token is an instance of. ~~int~~ | +| `ent_id_` | The string ID of the named entity the token is an instance of. ~~str~~ | +| `sentiment` | A scalar value indicating the positivity or negativity of the span. ~~float~~ | +| `_` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). ~~Underscore~~ | diff --git a/website/docs/api/stringstore.md b/website/docs/api/stringstore.md index 268f19125..d5f78dbab 100644 --- a/website/docs/api/stringstore.md +++ b/website/docs/api/stringstore.md @@ -19,10 +19,9 @@ Create the `StringStore`. > stringstore = StringStore(["apple", "orange"]) > ``` -| Name | Type | Description | -| ----------- | ------------- | -------------------------------------------------- | -| `strings` | iterable | A sequence of unicode strings to add to the store. | -| **RETURNS** | `StringStore` | The newly constructed object. | +| Name | Description | +| --------- | ---------------------------------------------------------------------- | +| `strings` | A sequence of strings to add to the store. ~~Optional[Iterable[str]]~~ | ## StringStore.\_\_len\_\_ {#len tag="method"} @@ -35,9 +34,9 @@ Get the number of strings in the store. > assert len(stringstore) == 2 > ``` -| Name | Type | Description | -| ----------- | ---- | ----------------------------------- | -| **RETURNS** | int | The number of strings in the store. | +| Name | Description | +| ----------- | ------------------------------------------- | +| **RETURNS** | The number of strings in the store. ~~int~~ | ## StringStore.\_\_getitem\_\_ {#getitem tag="method"} @@ -52,10 +51,10 @@ Retrieve a string from a given hash, or vice versa. > assert stringstore[apple_hash] == "apple" > ``` -| Name | Type | Description | -| -------------- | ------------------------ | -------------------------- | -| `string_or_id` | bytes, unicode or uint64 | The value to encode. | -| **RETURNS** | unicode or int | The value to be retrieved. | +| Name | Description | +| -------------- | ----------------------------------------------- | +| `string_or_id` | The value to encode. ~~Union[bytes, str, int]~~ | +| **RETURNS** | The value to be retrieved. ~~Union[str, int]~~ | ## StringStore.\_\_contains\_\_ {#contains tag="method"} @@ -69,15 +68,15 @@ Check whether a string is in the store. > assert not "cherry" in stringstore > ``` -| Name | Type | Description | -| ----------- | ------- | -------------------------------------- | -| `string` | unicode | The string to check. | -| **RETURNS** | bool | Whether the store contains the string. | +| Name | Description | +| ----------- | ----------------------------------------------- | +| `string` | The string to check. ~~str~~ | +| **RETURNS** | Whether the store contains the string. ~~bool~~ | ## StringStore.\_\_iter\_\_ {#iter tag="method"} Iterate over the strings in the store, in order. Note that a newly initialized -store will always include an empty string `''` at position `0`. +store will always include an empty string `""` at position `0`. > #### Example > @@ -87,9 +86,9 @@ store will always include an empty string `''` at position `0`. > assert all_strings == ["apple", "orange"] > ``` -| Name | Type | Description | -| ---------- | ------- | ---------------------- | -| **YIELDS** | unicode | A string in the store. | +| Name | Description | +| ---------- | ------------------------------ | +| **YIELDS** | A string in the store. ~~str~~ | ## StringStore.add {#add tag="method" new="2"} @@ -106,10 +105,10 @@ Add a string to the `StringStore`. > assert stringstore["banana"] == banana_hash > ``` -| Name | Type | Description | -| ----------- | ------- | ------------------------ | -| `string` | unicode | The string to add. | -| **RETURNS** | uint64 | The string's hash value. | +| Name | Description | +| ----------- | -------------------------------- | +| `string` | The string to add. ~~str~~ | +| **RETURNS** | The string's hash value. ~~int~~ | ## StringStore.to_disk {#to_disk tag="method" new="2"} @@ -121,9 +120,9 @@ Save the current state to a directory. > stringstore.to_disk("/path/to/strings") > ``` -| Name | Type | Description | -| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | +| Name | Description | +| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | ## StringStore.from_disk {#from_disk tag="method" new="2"} @@ -136,10 +135,10 @@ Loads state from a directory. Modifies the object in place and returns it. > stringstore = StringStore().from_disk("/path/to/strings") > ``` -| Name | Type | Description | -| ----------- | ---------------- | -------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| **RETURNS** | `StringStore` | The modified `StringStore` object. | +| Name | Description | +| ----------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| **RETURNS** | The modified `StringStore` object. ~~StringStore~~ | ## StringStore.to_bytes {#to_bytes tag="method"} @@ -151,9 +150,9 @@ Serialize the current state to a binary string. > store_bytes = stringstore.to_bytes() > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------ | -| **RETURNS** | bytes | The serialized form of the `StringStore` object. | +| Name | Description | +| ----------- | ---------------------------------------------------------- | +| **RETURNS** | The serialized form of the `StringStore` object. ~~bytes~~ | ## StringStore.from_bytes {#from_bytes tag="method"} @@ -167,10 +166,10 @@ Load state from a binary string. > new_store = StringStore().from_bytes(store_bytes) > ``` -| Name | Type | Description | -| ------------ | ------------- | ------------------------- | -| `bytes_data` | bytes | The data to load from. | -| **RETURNS** | `StringStore` | The `StringStore` object. | +| Name | Description | +| ------------ | ----------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| **RETURNS** | The `StringStore` object. ~~StringStore~~ | ## Utilities {#util} @@ -185,7 +184,7 @@ Get a 64-bit hash for a given string. > assert hash_string("apple") == 8566208034543834098 > ``` -| Name | Type | Description | -| ----------- | ------- | ------------------- | -| `string` | unicode | The string to hash. | -| **RETURNS** | uint64 | The hash. | +| Name | Description | +| ----------- | --------------------------- | +| `string` | The string to hash. ~~str~~ | +| **RETURNS** | The hash. ~~int~~ | diff --git a/website/docs/api/tagger.md b/website/docs/api/tagger.md index bd3382f89..2123004b6 100644 --- a/website/docs/api/tagger.md +++ b/website/docs/api/tagger.md @@ -1,48 +1,70 @@ --- title: Tagger tag: class -source: spacy/pipeline/pipes.pyx +source: spacy/pipeline/tagger.pyx +teaser: 'Pipeline component for part-of-speech tagging' +api_base_class: /api/pipe +api_string_name: tagger +api_trainable: true --- -This class is a subclass of `Pipe` and follows the same API. The pipeline -component is available in the [processing pipeline](/usage/processing-pipelines) -via the ID `"tagger"`. +## Config and implementation {#config} -## Tagger.Model {#model tag="classmethod"} - -Initialize a model for the pipe. The model should implement the -`thinc.neural.Model` API. Wrappers are under development for most major machine -learning libraries. - -| Name | Type | Description | -| ----------- | ------ | ------------------------------------- | -| `**kwargs` | - | Parameters for initializing the model | -| **RETURNS** | object | The initialized model. | - -## Tagger.\_\_init\_\_ {#init tag="method"} - -Create a new pipeline instance. In your application, you would normally use a -shortcut for this and instantiate the component using its string name and -[`nlp.create_pipe`](/api/language#create_pipe). +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). See the +[model architectures](/api/architectures) documentation for details on the +architectures and their arguments and hyperparameters. > #### Example > > ```python -> # Construction via create_pipe -> tagger = nlp.create_pipe("tagger") +> from spacy.pipeline.tagger import DEFAULT_TAGGER_MODEL +> config = { +> "set_morphology": False, +> "model": DEFAULT_TAGGER_MODEL, +> } +> nlp.add_pipe("tagger", config=config) +> ``` + +| Setting | Description | +| ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `set_morphology` | Whether to set morphological features. Defaults to `False`. ~~bool~~ | +| `model` | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). Defaults to [Tagger](/api/architectures#Tagger). ~~Model[List[Doc], List[Floats2d]]~~ | + +```python +%%GITHUB_SPACY/spacy/pipeline/tagger.pyx +``` + +## Tagger.\_\_init\_\_ {#init tag="method"} + +> #### Example +> +> ```python +> # Construction via add_pipe with default model +> tagger = nlp.add_pipe("tagger") +> +> # Construction via create_pipe with custom model +> config = {"model": {"@architectures": "my_tagger"}} +> tagger = nlp.add_pipe("tagger", config=config) > > # Construction from class > from spacy.pipeline import Tagger -> tagger = Tagger(nlp.vocab) -> tagger.from_disk("/path/to/model") +> tagger = Tagger(nlp.vocab, model) > ``` -| Name | Type | Description | -| ----------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `**cfg` | - | Configuration parameters. | -| **RETURNS** | `Tagger` | The newly constructed object. | +Create a new pipeline instance. In your application, you would normally use a +shortcut for this and instantiate the component using its string name and +[`nlp.add_pipe`](/api/language#add_pipe). + +| Name | Description | +| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | A model instance that predicts the tag probabilities. The output vectors should match the number of tags in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). ~~Model[List[Doc], List[Floats2d]]~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| _keyword-only_ | | +| `set_morphology` | Whether to set morphological features. ~~bool~~ | ## Tagger.\_\_call\_\_ {#call tag="method"} @@ -56,16 +78,16 @@ and all pipeline components are applied to the `Doc` in order. Both > #### Example > > ```python -> tagger = Tagger(nlp.vocab) > doc = nlp("This is a sentence.") +> tagger = nlp.add_pipe("tagger") > # This usually happens under the hood > processed = tagger(doc) > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------ | -| `doc` | `Doc` | The document to process. | -| **RETURNS** | `Doc` | The processed document. | +| Name | Description | +| ----------- | -------------------------------- | +| `doc` | The document to process. ~~Doc~~ | +| **RETURNS** | The processed document. ~~Doc~~ | ## Tagger.pipe {#pipe tag="method"} @@ -78,73 +100,142 @@ applied to the `Doc` in order. Both [`__call__`](/api/tagger#call) and > #### Example > > ```python -> tagger = Tagger(nlp.vocab) +> tagger = nlp.add_pipe("tagger") > for doc in tagger.pipe(docs, batch_size=50): > pass > ``` -| Name | Type | Description | -| ------------ | -------- | ------------------------------------------------------ | -| `stream` | iterable | A stream of documents. | -| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | -| **YIELDS** | `Doc` | Processed documents in the order of the original text. | +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `stream` | A stream of documents. ~~Iterable[Doc]~~ | +| _keyword-only_ | | +| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ | +| **YIELDS** | The processed documents in order. ~~Doc~~ | + +## Tagger.initialize {#initialize tag="method" new="3"} + +Initialize the component for training. `get_examples` should be a function that +returns an iterable of [`Example`](/api/example) objects. The data examples are +used to **initialize the model** of the component and can either be the full +training data or a representative sample. Initialization includes validating the +network, +[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and +setting up the label scheme based on the data. This method is typically called +by [`Language.initialize`](/api/language#initialize) and lets you customize +arguments it receives via the +[`[initialize.components]`](/api/data-formats#config-initialize) block in the +config. + + + +This method was previously called `begin_training`. + + + +> #### Example +> +> ```python +> tagger = nlp.add_pipe("tagger") +> tagger.initialize(lambda: [], nlp=nlp) +> ``` +> +> ```ini +> ### config.cfg +> [initialize.components.tagger] +> +> [initialize.components.tagger.labels] +> @readers = "spacy.read_labels.v1" +> path = "corpus/labels/tagger.json +> ``` + +| Name | Description | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ | ## Tagger.predict {#predict tag="method"} -Apply the pipeline's model to a batch of docs, without modifying them. +Apply the component's model to a batch of [`Doc`](/api/doc) objects, without +modifying them. > #### Example > > ```python -> tagger = Tagger(nlp.vocab) -> scores, tensors = tagger.predict([doc1, doc2]) +> tagger = nlp.add_pipe("tagger") +> scores = tagger.predict([doc1, doc2]) > ``` -| Name | Type | Description | -| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | iterable | The documents to predict. | -| **RETURNS** | tuple | A `(scores, tensors)` tuple where `scores` is the model's prediction for each document and `tensors` is the token representations used to predict the scores. Each tensor is an array with one row for each token in the document. | +| Name | Description | +| ----------- | ------------------------------------------- | +| `docs` | The documents to predict. ~~Iterable[Doc]~~ | +| **RETURNS** | The model's prediction for each document. | ## Tagger.set_annotations {#set_annotations tag="method"} -Modify a batch of documents, using pre-computed scores. +Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores. > #### Example > > ```python -> tagger = Tagger(nlp.vocab) -> scores, tensors = tagger.predict([doc1, doc2]) -> tagger.set_annotations([doc1, doc2], scores, tensors) +> tagger = nlp.add_pipe("tagger") +> scores = tagger.predict([doc1, doc2]) +> tagger.set_annotations([doc1, doc2], scores) > ``` -| Name | Type | Description | -| --------- | -------- | ----------------------------------------------------- | -| `docs` | iterable | The documents to modify. | -| `scores` | - | The scores to set, produced by `Tagger.predict`. | -| `tensors` | iterable | The token representations used to predict the scores. | +| Name | Description | +| -------- | ------------------------------------------------ | +| `docs` | The documents to modify. ~~Iterable[Doc]~~ | +| `scores` | The scores to set, produced by `Tagger.predict`. | ## Tagger.update {#update tag="method"} -Learn from a batch of documents and gold-standard information, updating the -pipe's model. Delegates to [`predict`](/api/tagger#predict) and +Learn from a batch of [`Example`](/api/example) objects containing the +predictions and gold-standard annotations, and update the component's model. +Delegates to [`predict`](/api/tagger#predict) and [`get_loss`](/api/tagger#get_loss). > #### Example > > ```python -> tagger = Tagger(nlp.vocab) -> losses = {} -> optimizer = nlp.begin_training() -> tagger.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) +> tagger = nlp.add_pipe("tagger") +> optimizer = nlp.initialize() +> losses = tagger.update(examples, sgd=optimizer) > ``` -| Name | Type | Description | -| -------- | -------- | -------------------------------------------------------------------------------------------- | -| `docs` | iterable | A batch of documents to learn from. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `drop` | float | The dropout rate. | -| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. | -| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | +| Name | Description | +| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | + +## Tagger.rehearse {#rehearse tag="method,experimental" new="3"} + +Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the +current model to make predictions similar to an initial model, to try to address +the "catastrophic forgetting" problem. This feature is experimental. + +> #### Example +> +> ```python +> tagger = nlp.add_pipe("tagger") +> optimizer = nlp.resume_training() +> losses = tagger.rehearse(examples, sgd=optimizer) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | ## Tagger.get_loss {#get_loss tag="method"} @@ -154,37 +245,31 @@ predicted scores. > #### Example > > ```python -> tagger = Tagger(nlp.vocab) -> scores = tagger.predict([doc1, doc2]) -> loss, d_loss = tagger.get_loss([doc1, doc2], [gold1, gold2], scores) +> tagger = nlp.add_pipe("tagger") +> scores = tagger.predict([eg.predicted for eg in examples]) +> loss, d_loss = tagger.get_loss(examples, scores) > ``` -| Name | Type | Description | -| ----------- | -------- | ------------------------------------------------------------ | -| `docs` | iterable | The batch of documents. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `scores` | - | Scores representing the model's predictions. | -| **RETURNS** | tuple | The loss and the gradient, i.e. `(loss, gradient)`. | +| Name | Description | +| ----------- | --------------------------------------------------------------------------- | +| `examples` | The batch of examples. ~~Iterable[Example]~~ | +| `scores` | Scores representing the model's predictions. | +| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ | -## Tagger.begin_training {#begin_training tag="method"} +## Tagger.score {#score tag="method" new="3"} -Initialize the pipe for training, using data examples if available. If no model -has been initialized yet, the model is added. +Score a batch of examples. > #### Example > > ```python -> tagger = Tagger(nlp.vocab) -> nlp.pipeline.append(tagger) -> optimizer = tagger.begin_training(pipeline=nlp.pipeline) +> scores = tagger.score(examples) > ``` -| Name | Type | Description | -| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | -| `pipeline` | list | Optional list of pipeline components that this component is part of. | -| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`Tagger`](/api/tagger#create_optimizer) if not set. | -| **RETURNS** | callable | An optimizer. | +| Name | Description | +| ----------- | --------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | The examples to score. ~~Iterable[Example]~~ | +| **RETURNS** | The scores, produced by [`Scorer.score_token_attr`](/api/scorer#score_token_attr) for the attribute `"tag"`. ~~Dict[str, float]~~ | ## Tagger.create_optimizer {#create_optimizer tag="method"} @@ -193,46 +278,52 @@ Create an optimizer for the pipeline component. > #### Example > > ```python -> tagger = Tagger(nlp.vocab) +> tagger = nlp.add_pipe("tagger") > optimizer = tagger.create_optimizer() > ``` -| Name | Type | Description | -| ----------- | -------- | -------------- | -| **RETURNS** | callable | The optimizer. | +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The optimizer. ~~Optimizer~~ | ## Tagger.use_params {#use_params tag="method, contextmanager"} -Modify the pipe's model, to use the given parameter values. +Modify the pipe's model, to use the given parameter values. At the end of the +context, the original parameters are restored. > #### Example > > ```python -> tagger = Tagger(nlp.vocab) -> with tagger.use_params(): +> tagger = nlp.add_pipe("tagger") +> with tagger.use_params(optimizer.averages): > tagger.to_disk("/best_model") > ``` -| Name | Type | Description | -| -------- | ---- | ---------------------------------------------------------------------------------------------------------- | -| `params` | - | The parameter values to use in the model. At the end of the context, the original parameters are restored. | +| Name | Description | +| -------- | -------------------------------------------------- | +| `params` | The parameter values to use in the model. ~~dict~~ | ## Tagger.add_label {#add_label tag="method"} -Add a new label to the pipe. +Add a new label to the pipe. Raises an error if the output dimension is already +set, or if the model has already been fully [initialized](#initialize). Note +that you don't have to call this method if you provide a **representative data +sample** to the [`initialize`](#initialize) method. In this case, all labels +found in the sample will be automatically added to the model, and the output +dimension will be [inferred](/usage/layers-architectures#thinc-shape-inference) +automatically. > #### Example > > ```python -> from spacy.symbols import POS -> tagger = Tagger(nlp.vocab) -> tagger.add_label("MY_LABEL", {POS: 'NOUN'}) +> tagger = nlp.add_pipe("tagger") +> tagger.add_label("MY_LABEL") > ``` -| Name | Type | Description | -| -------- | ------- | --------------------------------------------------------------- | -| `label` | unicode | The label to add. | -| `values` | dict | Optional values to map to the label, e.g. a tag map dictionary. | +| Name | Description | +| ----------- | ----------------------------------------------------------- | +| `label` | The label to add. ~~str~~ | +| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ | ## Tagger.to_disk {#to_disk tag="method"} @@ -241,14 +332,15 @@ Serialize the pipe to disk. > #### Example > > ```python -> tagger = Tagger(nlp.vocab) +> tagger = nlp.add_pipe("tagger") > tagger.to_disk("/path/to/tagger") > ``` -| Name | Type | Description | -| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | ## Tagger.from_disk {#from_disk tag="method"} @@ -257,31 +349,33 @@ Load the pipe from disk. Modifies the object in place and returns it. > #### Example > > ```python -> tagger = Tagger(nlp.vocab) +> tagger = nlp.add_pipe("tagger") > tagger.from_disk("/path/to/tagger") > ``` -| Name | Type | Description | -| ----------- | ---------------- | -------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `Tagger` | The modified `Tagger` object. | +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `Tagger` object. ~~Tagger~~ | ## Tagger.to_bytes {#to_bytes tag="method"} > #### Example > > ```python -> tagger = Tagger(nlp.vocab) +> tagger = nlp.add_pipe("tagger") > tagger_bytes = tagger.to_bytes() > ``` Serialize the pipe to a bytestring. -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------------------- | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | bytes | The serialized form of the `Tagger` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the `Tagger` object. ~~bytes~~ | ## Tagger.from_bytes {#from_bytes tag="method"} @@ -291,21 +385,20 @@ Load the pipe from a bytestring. Modifies the object in place and returns it. > > ```python > tagger_bytes = tagger.to_bytes() -> tagger = Tagger(nlp.vocab) +> tagger = nlp.add_pipe("tagger") > tagger.from_bytes(tagger_bytes) > ``` -| Name | Type | Description | -| ------------ | -------- | ------------------------------------------------------------------------- | -| `bytes_data` | bytes | The data to load from. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `Tagger` | The `Tagger` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `Tagger` object. ~~Tagger~~ | ## Tagger.labels {#labels tag="property"} -The labels currently added to the component. Note that even for a blank -component, this will always include the built-in coarse-grained part-of-speech -tags by default, e.g. `VERB`, `NOUN` and so on. +The labels currently added to the component. > #### Example > @@ -314,9 +407,27 @@ tags by default, e.g. `VERB`, `NOUN` and so on. > assert "MY_LABEL" in tagger.labels > ``` -| Name | Type | Description | -| ----------- | ----- | ---------------------------------- | -| **RETURNS** | tuple | The labels added to the component. | +| Name | Description | +| ----------- | ------------------------------------------------------ | +| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ | + +## Tagger.label_data {#label_data tag="property" new="3"} + +The labels currently added to the component and their internal meta information. +This is the data generated by [`init labels`](/api/cli#init-labels) and used by +[`Tagger.initialize`](/api/tagger#initialize) to initialize the model with a +pre-defined label set. + +> #### Example +> +> ```python +> labels = tagger.label_data +> tagger.initialize(lambda: [], nlp=nlp, labels=labels) +> ``` + +| Name | Description | +| ----------- | ---------------------------------------------------------- | +| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ | ## Serialization fields {#serialization-fields} @@ -330,9 +441,8 @@ serialization by passing in the string names via the `exclude` argument. > data = tagger.to_disk("/path", exclude=["vocab"]) > ``` -| Name | Description | -| --------- | ------------------------------------------------------------------------------------------ | -| `vocab` | The shared [`Vocab`](/api/vocab). | -| `cfg` | The config file. You usually don't want to exclude this. | -| `model` | The binary model data. You usually don't want to exclude this. | -| `tag_map` | The [tag map](/usage/adding-languages#tag-map) mapping fine-grained to coarse-grained tag. | +| Name | Description | +| ------- | -------------------------------------------------------------- | +| `vocab` | The shared [`Vocab`](/api/vocab). | +| `cfg` | The config file. You usually don't want to exclude this. | +| `model` | The binary model data. You usually don't want to exclude this. | diff --git a/website/docs/api/textcategorizer.md b/website/docs/api/textcategorizer.md index 1a0280265..447765e15 100644 --- a/website/docs/api/textcategorizer.md +++ b/website/docs/api/textcategorizer.md @@ -1,66 +1,77 @@ --- title: TextCategorizer tag: class -source: spacy/pipeline/pipes.pyx +source: spacy/pipeline/textcat.py new: 2 +teaser: 'Pipeline component for text classification' +api_base_class: /api/pipe +api_string_name: textcat +api_trainable: true --- -This class is a subclass of `Pipe` and follows the same API. The pipeline -component is available in the [processing pipeline](/usage/processing-pipelines) -via the ID `"textcat"`. +The text categorizer predicts **categories over a whole document**. It can learn +one or more labels, and the labels can be mutually exclusive (i.e. one true +label per document) or non-mutually exclusive (i.e. zero or more labels may be +true per document). The multi-label setting is controlled by the model instance +that's provided. -## TextCategorizer.Model {#model tag="classmethod"} +## Config and implementation {#config} -Initialize a model for the pipe. The model should implement the -`thinc.neural.Model` API. Wrappers are under development for most major machine -learning libraries. - -| Name | Type | Description | -| ----------- | ------ | ------------------------------------- | -| `**kwargs` | - | Parameters for initializing the model | -| **RETURNS** | object | The initialized model. | - -## TextCategorizer.\_\_init\_\_ {#init tag="method"} - -Create a new pipeline instance. In your application, you would normally use a -shortcut for this and instantiate the component using its string name and -[`nlp.create_pipe`](/api/language#create_pipe). +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). See the +[model architectures](/api/architectures) documentation for details on the +architectures and their arguments and hyperparameters. > #### Example > > ```python -> # Construction via create_pipe -> textcat = nlp.create_pipe("textcat") -> textcat = nlp.create_pipe("textcat", config={"exclusive_classes": True}) +> from spacy.pipeline.textcat import DEFAULT_TEXTCAT_MODEL +> config = { +> "threshold": 0.5, +> "model": DEFAULT_TEXTCAT_MODEL, +> } +> nlp.add_pipe("textcat", config=config) +> ``` + +| Setting | Description | +| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ | +| `model` | A model instance that predicts scores for each category. Defaults to [TextCatEnsemble](/api/architectures#TextCatEnsemble). ~~Model[List[Doc], List[Floats2d]]~~ | + +```python +%%GITHUB_SPACY/spacy/pipeline/textcat.py +``` + +## TextCategorizer.\_\_init\_\_ {#init tag="method"} + +> #### Example +> +> ```python +> # Construction via add_pipe with default model +> textcat = nlp.add_pipe("textcat") +> +> # Construction via add_pipe with custom model +> config = {"model": {"@architectures": "my_textcat"}} +> parser = nlp.add_pipe("textcat", config=config) > > # Construction from class > from spacy.pipeline import TextCategorizer -> textcat = TextCategorizer(nlp.vocab) -> textcat.from_disk("/path/to/model") +> textcat = TextCategorizer(nlp.vocab, model, threshold=0.5) > ``` -| Name | Type | Description | -| ------------------- | ----------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The shared vocabulary. | -| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | -| `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. | -| `architecture` | unicode | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. | -| **RETURNS** | `TextCategorizer` | The newly constructed object. | +Create a new pipeline instance. In your application, you would normally use a +shortcut for this and instantiate the component using its string name and +[`nlp.add_pipe`](/api/language#create_pipe). -### Architectures {#architectures new="2.1"} - -Text classification models can be used to solve a wide variety of problems. -Differences in text length, number of labels, difficulty, and runtime -performance constraints mean that no single algorithm performs well on all types -of problems. To handle a wider variety of problems, the `TextCategorizer` object -allows configuration of its model architecture, using the `architecture` keyword -argument. - -| Name | Description | -| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `"ensemble"` | **Default:** Stacked ensemble of a bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. The "ngram_size" and "attr" arguments can be used to configure the feature extraction for the bag-of-words model. | -| `"simple_cnn"` | A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster. | -| `"bow"` | An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. The features extracted can be controlled using the keyword arguments `ngram_size` and `attr`. For instance, `ngram_size=3` and `attr="lower"` would give lower-cased unigram, trigram and bigram features. 2, 3 or 4 are usually good choices of ngram size. | +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| _keyword-only_ | | +| `threshold` | Cutoff to consider a prediction "positive", relevant when printing accuracy results. ~~float~~ | ## TextCategorizer.\_\_call\_\_ {#call tag="method"} @@ -74,16 +85,16 @@ delegate to the [`predict`](/api/textcategorizer#predict) and > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) > doc = nlp("This is a sentence.") +> textcat = nlp.add_pipe("textcat") > # This usually happens under the hood > processed = textcat(doc) > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------ | -| `doc` | `Doc` | The document to process. | -| **RETURNS** | `Doc` | The processed document. | +| Name | Description | +| ----------- | -------------------------------- | +| `doc` | The document to process. ~~Doc~~ | +| **RETURNS** | The processed document. ~~Doc~~ | ## TextCategorizer.pipe {#pipe tag="method"} @@ -97,73 +108,144 @@ applied to the `Doc` in order. Both [`__call__`](/api/textcategorizer#call) and > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) +> textcat = nlp.add_pipe("textcat") > for doc in textcat.pipe(docs, batch_size=50): > pass > ``` -| Name | Type | Description | -| ------------ | -------- | ------------------------------------------------------ | -| `stream` | iterable | A stream of documents. | -| `batch_size` | int | The number of texts to buffer. Defaults to `128`. | -| **YIELDS** | `Doc` | Processed documents in the order of the original text. | +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `stream` | A stream of documents. ~~Iterable[Doc]~~ | +| _keyword-only_ | | +| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ | +| **YIELDS** | The processed documents in order. ~~Doc~~ | + +## TextCategorizer.initialize {#initialize tag="method" new="3"} + +Initialize the component for training. `get_examples` should be a function that +returns an iterable of [`Example`](/api/example) objects. The data examples are +used to **initialize the model** of the component and can either be the full +training data or a representative sample. Initialization includes validating the +network, +[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and +setting up the label scheme based on the data. This method is typically called +by [`Language.initialize`](/api/language#initialize) and lets you customize +arguments it receives via the +[`[initialize.components]`](/api/data-formats#config-initialize) block in the +config. + + + +This method was previously called `begin_training`. + + + +> #### Example +> +> ```python +> textcat = nlp.add_pipe("textcat") +> textcat.initialize(lambda: [], nlp=nlp) +> ``` +> +> ```ini +> ### config.cfg +> [initialize.components.textcat] +> positive_label = "POS" +> +> [initialize.components.textcat.labels] +> @readers = "spacy.read_labels.v1" +> path = "corpus/labels/textcat.json +> ``` + +| Name | Description | +| ---------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | +| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ | +| `positive_label` | The positive label for a binary task with exclusive classes, None otherwise and by default. ~~Optional[str]~~ | ## TextCategorizer.predict {#predict tag="method"} -Apply the pipeline's model to a batch of docs, without modifying them. +Apply the component's model to a batch of [`Doc`](/api/doc) objects without +modifying them. > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) -> scores, tensors = textcat.predict([doc1, doc2]) +> textcat = nlp.add_pipe("textcat") +> scores = textcat.predict([doc1, doc2]) > ``` -| Name | Type | Description | -| ----------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `docs` | iterable | The documents to predict. | -| **RETURNS** | tuple | A `(scores, tensors)` tuple where `scores` is the model's prediction for each document and `tensors` is the token representations used to predict the scores. Each tensor is an array with one row for each token in the document. | +| Name | Description | +| ----------- | ------------------------------------------- | +| `docs` | The documents to predict. ~~Iterable[Doc]~~ | +| **RETURNS** | The model's prediction for each document. | ## TextCategorizer.set_annotations {#set_annotations tag="method"} -Modify a batch of documents, using pre-computed scores. +Modify a batch of [`Doc`](/api/doc) objects using pre-computed scores. > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) -> scores, tensors = textcat.predict([doc1, doc2]) -> textcat.set_annotations([doc1, doc2], scores, tensors) +> textcat = nlp.add_pipe("textcat") +> scores = textcat.predict(docs) +> textcat.set_annotations(docs, scores) > ``` -| Name | Type | Description | -| --------- | -------- | --------------------------------------------------------- | -| `docs` | iterable | The documents to modify. | -| `scores` | - | The scores to set, produced by `TextCategorizer.predict`. | -| `tensors` | iterable | The token representations used to predict the scores. | +| Name | Description | +| -------- | --------------------------------------------------------- | +| `docs` | The documents to modify. ~~Iterable[Doc]~~ | +| `scores` | The scores to set, produced by `TextCategorizer.predict`. | ## TextCategorizer.update {#update tag="method"} -Learn from a batch of documents and gold-standard information, updating the -pipe's model. Delegates to [`predict`](/api/textcategorizer#predict) and +Learn from a batch of [`Example`](/api/example) objects containing the +predictions and gold-standard annotations, and update the component's model. +Delegates to [`predict`](/api/textcategorizer#predict) and [`get_loss`](/api/textcategorizer#get_loss). > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) -> losses = {} -> optimizer = nlp.begin_training() -> textcat.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer) +> textcat = nlp.add_pipe("textcat") +> optimizer = nlp.initialize() +> losses = textcat.update(examples, sgd=optimizer) > ``` -| Name | Type | Description | -| -------- | -------- | -------------------------------------------------------------------------------------------- | -| `docs` | iterable | A batch of documents to learn from. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `drop` | float | The dropout rate. | -| `sgd` | callable | The optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. | -| `losses` | dict | Optional record of the loss during training. The value keyed by the model's name is updated. | +| Name | Description | +| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | + +## TextCategorizer.rehearse {#rehearse tag="method,experimental" new="3"} + +Perform a "rehearsal" update from a batch of data. Rehearsal updates teach the +current model to make predictions similar to an initial model to try to address +the "catastrophic forgetting" problem. This feature is experimental. + +> #### Example +> +> ```python +> textcat = nlp.add_pipe("textcat") +> optimizer = nlp.resume_training() +> losses = textcat.rehearse(examples, sgd=optimizer) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------ | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | ## TextCategorizer.get_loss {#get_loss tag="method"} @@ -173,37 +255,33 @@ predicted scores. > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) -> scores = textcat.predict([doc1, doc2]) -> loss, d_loss = textcat.get_loss([doc1, doc2], [gold1, gold2], scores) +> textcat = nlp.add_pipe("textcat") +> scores = textcat.predict([eg.predicted for eg in examples]) +> loss, d_loss = textcat.get_loss(examples, scores) > ``` -| Name | Type | Description | -| ----------- | -------- | ------------------------------------------------------------ | -| `docs` | iterable | The batch of documents. | -| `golds` | iterable | The gold-standard data. Must have the same length as `docs`. | -| `scores` | - | Scores representing the model's predictions. | -| **RETURNS** | tuple | The loss and the gradient, i.e. `(loss, gradient)`. | +| Name | Description | +| ----------- | --------------------------------------------------------------------------- | +| `examples` | The batch of examples. ~~Iterable[Example]~~ | +| `scores` | Scores representing the model's predictions. | +| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ | -## TextCategorizer.begin_training {#begin_training tag="method"} +## TextCategorizer.score {#score tag="method" new="3"} -Initialize the pipe for training, using data examples if available. If no model -has been initialized yet, the model is added. +Score a batch of examples. > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) -> nlp.pipeline.append(textcat) -> optimizer = textcat.begin_training(pipeline=nlp.pipeline) +> scores = textcat.score(examples) > ``` -| Name | Type | Description | -| ------------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `gold_tuples` | iterable | Optional gold-standard annotations from which to construct [`GoldParse`](/api/goldparse) objects. | -| `pipeline` | list | Optional list of pipeline components that this component is part of. | -| `sgd` | callable | An optional optimizer. Should take two arguments `weights` and `gradient`, and an optional ID. Will be created via [`TextCategorizer`](/api/textcategorizer#create_optimizer) if not set. | -| **RETURNS** | callable | An optimizer. | +| Name | Description | +| ---------------- | -------------------------------------------------------------------------------------------------------------------- | +| `examples` | The examples to score. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `positive_label` | Optional positive label. ~~Optional[str]~~ | +| **RETURNS** | The scores, produced by [`Scorer.score_cats`](/api/scorer#score_cats). ~~Dict[str, Union[float, Dict[str, float]]]~~ | ## TextCategorizer.create_optimizer {#create_optimizer tag="method"} @@ -212,44 +290,51 @@ Create an optimizer for the pipeline component. > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) +> textcat = nlp.add_pipe("textcat") > optimizer = textcat.create_optimizer() > ``` -| Name | Type | Description | -| ----------- | -------- | -------------- | -| **RETURNS** | callable | The optimizer. | +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The optimizer. ~~Optimizer~~ | ## TextCategorizer.use_params {#use_params tag="method, contextmanager"} -Modify the pipe's model, to use the given parameter values. +Modify the pipe's model to use the given parameter values. > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) +> textcat = nlp.add_pipe("textcat") > with textcat.use_params(optimizer.averages): > textcat.to_disk("/best_model") > ``` -| Name | Type | Description | -| -------- | ---- | ---------------------------------------------------------------------------------------------------------- | -| `params` | dict | The parameter values to use in the model. At the end of the context, the original parameters are restored. | +| Name | Description | +| -------- | -------------------------------------------------- | +| `params` | The parameter values to use in the model. ~~dict~~ | ## TextCategorizer.add_label {#add_label tag="method"} -Add a new label to the pipe. +Add a new label to the pipe. Raises an error if the output dimension is already +set, or if the model has already been fully [initialized](#initialize). Note +that you don't have to call this method if you provide a **representative data +sample** to the [`initialize`](#initialize) method. In this case, all labels +found in the sample will be automatically added to the model, and the output +dimension will be [inferred](/usage/layers-architectures#thinc-shape-inference) +automatically. > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) +> textcat = nlp.add_pipe("textcat") > textcat.add_label("MY_LABEL") > ``` -| Name | Type | Description | -| ------- | ------- | ----------------- | -| `label` | unicode | The label to add. | +| Name | Description | +| ----------- | ----------------------------------------------------------- | +| `label` | The label to add. ~~str~~ | +| **RETURNS** | `0` if the label is already present, otherwise `1`. ~~int~~ | ## TextCategorizer.to_disk {#to_disk tag="method"} @@ -258,14 +343,15 @@ Serialize the pipe to disk. > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) +> textcat = nlp.add_pipe("textcat") > textcat.to_disk("/path/to/textcat") > ``` -| Name | Type | Description | -| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | ## TextCategorizer.from_disk {#from_disk tag="method"} @@ -274,31 +360,33 @@ Load the pipe from disk. Modifies the object in place and returns it. > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) +> textcat = nlp.add_pipe("textcat") > textcat.from_disk("/path/to/textcat") > ``` -| Name | Type | Description | -| ----------- | ----------------- | -------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `TextCategorizer` | The modified `TextCategorizer` object. | +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `TextCategorizer` object. ~~TextCategorizer~~ | ## TextCategorizer.to_bytes {#to_bytes tag="method"} > #### Example > > ```python -> textcat = TextCategorizer(nlp.vocab) +> textcat = nlp.add_pipe("textcat") > textcat_bytes = textcat.to_bytes() > ``` Serialize the pipe to a bytestring. -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------------------- | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | bytes | The serialized form of the `TextCategorizer` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the `TextCategorizer` object. ~~bytes~~ | ## TextCategorizer.from_bytes {#from_bytes tag="method"} @@ -308,15 +396,16 @@ Load the pipe from a bytestring. Modifies the object in place and returns it. > > ```python > textcat_bytes = textcat.to_bytes() -> textcat = TextCategorizer(nlp.vocab) +> textcat = nlp.add_pipe("textcat") > textcat.from_bytes(textcat_bytes) > ``` -| Name | Type | Description | -| ------------ | ----------------- | ------------------------------------------------------------------------- | -| `bytes_data` | bytes | The data to load from. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `TextCategorizer` | The `TextCategorizer` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `TextCategorizer` object. ~~TextCategorizer~~ | ## TextCategorizer.labels {#labels tag="property"} @@ -329,9 +418,27 @@ The labels currently added to the component. > assert "MY_LABEL" in textcat.labels > ``` -| Name | Type | Description | -| ----------- | ----- | ---------------------------------- | -| **RETURNS** | tuple | The labels added to the component. | +| Name | Description | +| ----------- | ------------------------------------------------------ | +| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ | + +## TextCategorizer.label_data {#label_data tag="property" new="3"} + +The labels currently added to the component and their internal meta information. +This is the data generated by [`init labels`](/api/cli#init-labels) and used by +[`TextCategorizer.initialize`](/api/textcategorizer#initialize) to initialize +the model with a pre-defined label set. + +> #### Example +> +> ```python +> labels = textcat.label_data +> textcat.initialize(lambda: [], nlp=nlp, labels=labels) +> ``` + +| Name | Description | +| ----------- | ---------------------------------------------------------- | +| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ | ## Serialization fields {#serialization-fields} diff --git a/website/docs/api/tok2vec.md b/website/docs/api/tok2vec.md new file mode 100644 index 000000000..051164ff5 --- /dev/null +++ b/website/docs/api/tok2vec.md @@ -0,0 +1,328 @@ +--- +title: Tok2Vec +source: spacy/pipeline/tok2vec.py +new: 3 +teaser: null +api_base_class: /api/pipe +api_string_name: tok2vec +api_trainable: true +--- + +Apply a "token-to-vector" model and set its outputs in the `Doc.tensor` +attribute. This is mostly useful to **share a single subnetwork** between +multiple components, e.g. to have one embedding and CNN network shared between a +[`DependencyParser`](/api/dependencyparser), [`Tagger`](/api/tagger) and +[`EntityRecognizer`](/api/entityrecognizer). + +In order to use the `Tok2Vec` predictions, subsequent components should use the +[Tok2VecListener](/api/architectures#Tok2VecListener) layer as the `tok2vec` +subnetwork of their model. This layer will read data from the `doc.tensor` +attribute during prediction. During training, the `Tok2Vec` component will save +its prediction and backprop callback for each batch, so that the subsequent +components can backpropagate to the shared weights. This implementation is used +because it allows us to avoid relying on object identity within the models to +achieve the parameter sharing. + +## Config and implementation {#config} + +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). See the +[model architectures](/api/architectures) documentation for details on the +architectures and their arguments and hyperparameters. + +> #### Example +> +> ```python +> from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL +> config = {"model": DEFAULT_TOK2VEC_MODEL} +> nlp.add_pipe("tok2vec", config=config) +> ``` + +| Setting | Description | +| ------- | ------------------------------------------------------------------------------------------------------------------ | +| `model` | The model to use. Defaults to [HashEmbedCNN](/api/architectures#HashEmbedCNN). ~~Model[List[Doc], List[Floats2d]~~ | + +```python +%%GITHUB_SPACY/spacy/pipeline/tok2vec.py +``` + +## Tok2Vec.\_\_init\_\_ {#init tag="method"} + +> #### Example +> +> ```python +> # Construction via add_pipe with default model +> tok2vec = nlp.add_pipe("tok2vec") +> +> # Construction via add_pipe with custom model +> config = {"model": {"@architectures": "my_tok2vec"}} +> parser = nlp.add_pipe("tok2vec", config=config) +> +> # Construction from class +> from spacy.pipeline import Tok2Vec +> tok2vec = Tok2Vec(nlp.vocab, model) +> ``` + +Create a new pipeline instance. In your application, you would normally use a +shortcut for this and instantiate the component using its string name and +[`nlp.add_pipe`](/api/language#create_pipe). + +| Name | Description | +| ------- | ------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]~~ | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | + +## Tok2Vec.\_\_call\_\_ {#call tag="method"} + +Apply the pipe to one document and add context-sensitive embeddings to the +`Doc.tensor` attribute, allowing them to be used as features by downstream +components. The document is modified in place, and returned. This usually +happens under the hood when the `nlp` object is called on a text and all +pipeline components are applied to the `Doc` in order. Both +[`__call__`](/api/tok2vec#call) and [`pipe`](/api/tok2vec#pipe) delegate to the +[`predict`](/api/tok2vec#predict) and +[`set_annotations`](/api/tok2vec#set_annotations) methods. + +> #### Example +> +> ```python +> doc = nlp("This is a sentence.") +> tok2vec = nlp.add_pipe("tok2vec") +> # This usually happens under the hood +> processed = tok2vec(doc) +> ``` + +| Name | Description | +| ----------- | -------------------------------- | +| `doc` | The document to process. ~~Doc~~ | +| **RETURNS** | The processed document. ~~Doc~~ | + +## Tok2Vec.pipe {#pipe tag="method"} + +Apply the pipe to a stream of documents. This usually happens under the hood +when the `nlp` object is called on a text and all pipeline components are +applied to the `Doc` in order. Both [`__call__`](/api/tok2vec#call) and +[`pipe`](/api/tok2vec#pipe) delegate to the [`predict`](/api/tok2vec#predict) +and [`set_annotations`](/api/tok2vec#set_annotations) methods. + +> #### Example +> +> ```python +> tok2vec = nlp.add_pipe("tok2vec") +> for doc in tok2vec.pipe(docs, batch_size=50): +> pass +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `stream` | A stream of documents. ~~Iterable[Doc]~~ | +| _keyword-only_ | | +| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ | +| **YIELDS** | The processed documents in order. ~~Doc~~ | + +## Tok2Vec.initialize {#initialize tag="method"} + +Initialize the component for training and return an +[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a +function that returns an iterable of [`Example`](/api/example) objects. The data +examples are used to **initialize the model** of the component and can either be +the full training data or a representative sample. Initialization includes +validating the network, +[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and +setting up the label scheme based on the data. This method is typically called +by [`Language.initialize`](/api/language#initialize). + +> #### Example +> +> ```python +> tok2vec = nlp.add_pipe("tok2vec") +> tok2vec.initialize(lambda: [], nlp=nlp) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | + +## Tok2Vec.predict {#predict tag="method"} + +Apply the component's model to a batch of [`Doc`](/api/doc) objects without +modifying them. + +> #### Example +> +> ```python +> tok2vec = nlp.add_pipe("tok2vec") +> scores = tok2vec.predict([doc1, doc2]) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------- | +| `docs` | The documents to predict. ~~Iterable[Doc]~~ | +| **RETURNS** | The model's prediction for each document. | + +## Tok2Vec.set_annotations {#set_annotations tag="method"} + +Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores. + +> #### Example +> +> ```python +> tok2vec = nlp.add_pipe("tok2vec") +> scores = tok2vec.predict(docs) +> tok2vec.set_annotations(docs, scores) +> ``` + +| Name | Description | +| -------- | ------------------------------------------------- | +| `docs` | The documents to modify. ~~Iterable[Doc]~~ | +| `scores` | The scores to set, produced by `Tok2Vec.predict`. | + +## Tok2Vec.update {#update tag="method"} + +Learn from a batch of [`Example`](/api/example) objects containing the +predictions and gold-standard annotations, and update the component's model. +Delegates to [`predict`](/api/tok2vec#predict). + +> #### Example +> +> ```python +> tok2vec = nlp.add_pipe("tok2vec") +> optimizer = nlp.initialize() +> losses = tok2vec.update(examples, sgd=optimizer) +> ``` + +| Name | Description | +| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | + +## Tok2Vec.create_optimizer {#create_optimizer tag="method"} + +Create an optimizer for the pipeline component. + +> #### Example +> +> ```python +> tok2vec = nlp.add_pipe("tok2vec") +> optimizer = tok2vec.create_optimizer() +> ``` + +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The optimizer. ~~Optimizer~~ | + +## Tok2Vec.use_params {#use_params tag="method, contextmanager"} + +Modify the pipe's model to use the given parameter values. At the end of the +context, the original parameters are restored. + +> #### Example +> +> ```python +> tok2vec = nlp.add_pipe("tok2vec") +> with tok2vec.use_params(optimizer.averages): +> tok2vec.to_disk("/best_model") +> ``` + +| Name | Description | +| -------- | -------------------------------------------------- | +| `params` | The parameter values to use in the model. ~~dict~~ | + +## Tok2Vec.to_disk {#to_disk tag="method"} + +Serialize the pipe to disk. + +> #### Example +> +> ```python +> tok2vec = nlp.add_pipe("tok2vec") +> tok2vec.to_disk("/path/to/tok2vec") +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | + +## Tok2Vec.from_disk {#from_disk tag="method"} + +Load the pipe from disk. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> tok2vec = nlp.add_pipe("tok2vec") +> tok2vec.from_disk("/path/to/tok2vec") +> ``` + +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `Tok2Vec` object. ~~Tok2Vec~~ | + +## Tok2Vec.to_bytes {#to_bytes tag="method"} + +> #### Example +> +> ```python +> tok2vec = nlp.add_pipe("tok2vec") +> tok2vec_bytes = tok2vec.to_bytes() +> ``` + +Serialize the pipe to a bytestring. + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the `Tok2Vec` object. ~~bytes~~ | + +## Tok2Vec.from_bytes {#from_bytes tag="method"} + +Load the pipe from a bytestring. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> tok2vec_bytes = tok2vec.to_bytes() +> tok2vec = nlp.add_pipe("tok2vec") +> tok2vec.from_bytes(tok2vec_bytes) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `Tok2Vec` object. ~~Tok2Vec~~ | + +## Serialization fields {#serialization-fields} + +During serialization, spaCy will export several data fields used to restore +different aspects of the object. If needed, you can exclude them from +serialization by passing in the string names via the `exclude` argument. + +> #### Example +> +> ```python +> data = tok2vec.to_disk("/path", exclude=["vocab"]) +> ``` + +| Name | Description | +| ------- | -------------------------------------------------------------- | +| `vocab` | The shared [`Vocab`](/api/vocab). | +| `cfg` | The config file. You usually don't want to exclude this. | +| `model` | The binary model data. You usually don't want to exclude this. | diff --git a/website/docs/api/token.md b/website/docs/api/token.md index 9f8594c96..e7e66e931 100644 --- a/website/docs/api/token.md +++ b/website/docs/api/token.md @@ -17,12 +17,11 @@ Construct a `Token` object. > assert token.text == "Give" > ``` -| Name | Type | Description | -| ----------- | ------- | ------------------------------------------- | -| `vocab` | `Vocab` | A storage container for lexical types. | -| `doc` | `Doc` | The parent document. | -| `offset` | int | The index of the token within the document. | -| **RETURNS** | `Token` | The newly constructed object. | +| Name | Description | +| -------- | --------------------------------------------------- | +| `vocab` | A storage container for lexical types. ~~Vocab~~ | +| `doc` | The parent document. ~~Doc~~ | +| `offset` | The index of the token within the document. ~~int~~ | ## Token.\_\_len\_\_ {#len tag="method"} @@ -36,9 +35,9 @@ The number of unicode characters in the token, i.e. `token.text`. > assert len(token) == 4 > ``` -| Name | Type | Description | -| ----------- | ---- | ---------------------------------------------- | -| **RETURNS** | int | The number of unicode characters in the token. | +| Name | Description | +| ----------- | ------------------------------------------------------ | +| **RETURNS** | The number of unicode characters in the token. ~~int~~ | ## Token.set_extension {#set_extension tag="classmethod" new="2"} @@ -56,14 +55,14 @@ For details, see the documentation on > assert doc[3]._.is_fruit > ``` -| Name | Type | Description | -| --------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------- | -| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `token._.my_attr`. | -| `default` | - | Optional default value of the attribute if no getter or method is defined. | -| `method` | callable | Set a custom method on the object, for example `token._.compare(other_token)`. | -| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. | -| `setter` | callable | Setter function that takes the `Token` and a value, and modifies the object. Is called when the user writes to the `Token._` attribute. | -| `force` | bool | Force overwriting existing attribute. | +| Name | Description | +| --------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `token._.my_attr`. ~~str~~ | +| `default` | Optional default value of the attribute if no getter or method is defined. ~~Optional[Any]~~ | +| `method` | Set a custom method on the object, for example `token._.compare(other_token)`. ~~Optional[Callable[[Token, ...], Any]]~~ | +| `getter` | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. ~~Optional[Callable[[Token], Any]]~~ | +| `setter` | Setter function that takes the `Token` and a value, and modifies the object. Is called when the user writes to the `Token._` attribute. ~~Optional[Callable[[Token, Any], None]]~~ | +| `force` | Force overwriting existing attribute. ~~bool~~ | ## Token.get_extension {#get_extension tag="classmethod" new="2"} @@ -80,10 +79,10 @@ Look up a previously registered extension by name. Returns a 4-tuple > assert extension == (False, None, None, None) > ``` -| Name | Type | Description | -| ----------- | ------- | ------------------------------------------------------------- | -| `name` | unicode | Name of the extension. | -| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. | +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Name of the extension. ~~str~~ | +| **RETURNS** | A `(default, method, getter, setter)` tuple of the extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~ | ## Token.has_extension {#has_extension tag="classmethod" new="2"} @@ -97,10 +96,10 @@ Check whether an extension has been registered on the `Token` class. > assert Token.has_extension("is_fruit") > ``` -| Name | Type | Description | -| ----------- | ------- | ------------------------------------------ | -| `name` | unicode | Name of the extension to check. | -| **RETURNS** | bool | Whether the extension has been registered. | +| Name | Description | +| ----------- | --------------------------------------------------- | +| `name` | Name of the extension to check. ~~str~~ | +| **RETURNS** | Whether the extension has been registered. ~~bool~~ | ## Token.remove_extension {#remove_extension tag="classmethod" new=""2.0.11""} @@ -115,10 +114,10 @@ Remove a previously registered extension. > assert not Token.has_extension("is_fruit") > ``` -| Name | Type | Description | -| ----------- | ------- | --------------------------------------------------------------------- | -| `name` | unicode | Name of the extension. | -| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Name of the extension. ~~str~~ | +| **RETURNS** | A `(default, method, getter, setter)` tuple of the removed extension. ~~Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]~~ | ## Token.check_flag {#check_flag tag="method"} @@ -133,10 +132,10 @@ Check the value of a boolean flag. > assert token.check_flag(IS_TITLE) == True > ``` -| Name | Type | Description | -| ----------- | ---- | -------------------------------------- | -| `flag_id` | int | The attribute ID of the flag to check. | -| **RETURNS** | bool | Whether the flag is set. | +| Name | Description | +| ----------- | ---------------------------------------------- | +| `flag_id` | The attribute ID of the flag to check. ~~int~~ | +| **RETURNS** | Whether the flag is set. ~~bool~~ | ## Token.similarity {#similarity tag="method" model="vectors"} @@ -151,10 +150,10 @@ Compute a semantic similarity estimate. Defaults to cosine over vectors. > assert apples_oranges == oranges_apples > ``` -| Name | Type | Description | -| ----------- | ----- | -------------------------------------------------------------------------------------------- | -| other | - | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. | -| **RETURNS** | float | A scalar similarity score. Higher is more similar. | +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------------------------------------------- | +| other | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. ~~Union[Doc, Span, Token, Lexeme]~~ | +| **RETURNS** | A scalar similarity score. Higher is more similar. ~~float~~ | ## Token.nbor {#nbor tag="method"} @@ -168,10 +167,29 @@ Get a neighboring token. > assert give_nbor.text == "it" > ``` -| Name | Type | Description | -| ----------- | ------- | ----------------------------------------------------------- | -| `i` | int | The relative position of the token to get. Defaults to `1`. | -| **RETURNS** | `Token` | The token at position `self.doc[self.i+i]`. | +| Name | Description | +| ----------- | ------------------------------------------------------------------- | +| `i` | The relative position of the token to get. Defaults to `1`. ~~int~~ | +| **RETURNS** | The token at position `self.doc[self.i+i]`. ~~Token~~ | + +## Token.set_morph {#set_morph tag="method"} + +Set the morphological analysis from a UD FEATS string, hash value of a UD FEATS +string, features dict or `MorphAnalysis`. The value `None` can be used to reset +the morph to an unset state. + +> #### Example +> +> ```python +> doc = nlp("Give it back! He pleaded.") +> doc[0].set_morph("Mood=Imp|VerbForm=Fin") +> assert "Mood=Imp" in doc[0].morph +> assert doc[0].morph.get("Mood") == ["Imp"] +> ``` + +| Name | Description | +| -------- | --------------------------------------------------------------------------------- | +| features | The morphological features to set. ~~Union[int, dict, str, MorphAnalysis, None]~~ | ## Token.is_ancestor {#is_ancestor tag="method" model="parser"} @@ -187,10 +205,10 @@ dependency tree. > assert give.is_ancestor(it) > ``` -| Name | Type | Description | -| ----------- | ------- | ----------------------------------------------------- | -| descendant | `Token` | Another token. | -| **RETURNS** | bool | Whether this token is the ancestor of the descendant. | +| Name | Description | +| ----------- | -------------------------------------------------------------- | +| descendant | Another token. ~~Token~~ | +| **RETURNS** | Whether this token is the ancestor of the descendant. ~~bool~~ | ## Token.ancestors {#ancestors tag="property" model="parser"} @@ -206,9 +224,9 @@ The rightmost token of this token's syntactic descendants. > assert [t.text for t in he_ancestors] == ["pleaded"] > ``` -| Name | Type | Description | -| ---------- | ------- | --------------------------------------------------------------------- | -| **YIELDS** | `Token` | A sequence of ancestor tokens such that `ancestor.is_ancestor(self)`. | +| Name | Description | +| ---------- | ------------------------------------------------------------------------------- | +| **YIELDS** | A sequence of ancestor tokens such that `ancestor.is_ancestor(self)`. ~~Token~~ | ## Token.conjuncts {#conjuncts tag="property" model="parser"} @@ -222,9 +240,9 @@ A tuple of coordinated tokens, not including the token itself. > assert [t.text for t in apples_conjuncts] == ["oranges"] > ``` -| Name | Type | Description | -| ----------- | ------- | ----------------------- | -| **RETURNS** | `tuple` | The coordinated tokens. | +| Name | Description | +| ----------- | --------------------------------------------- | +| **RETURNS** | The coordinated tokens. ~~Tuple[Token, ...]~~ | ## Token.children {#children tag="property" model="parser"} @@ -238,13 +256,13 @@ A sequence of the token's immediate syntactic children. > assert [t.text for t in give_children] == ["it", "back", "!"] > ``` -| Name | Type | Description | -| ---------- | ------- | ------------------------------------------- | -| **YIELDS** | `Token` | A child token such that `child.head==self`. | +| Name | Description | +| ---------- | ------------------------------------------------------- | +| **YIELDS** | A child token such that `child.head == self`. ~~Token~~ | ## Token.lefts {#lefts tag="property" model="parser"} -The leftward immediate children of the word, in the syntactic dependency parse. +The leftward immediate children of the word in the syntactic dependency parse. > #### Example > @@ -254,13 +272,13 @@ The leftward immediate children of the word, in the syntactic dependency parse. > assert lefts == ["New"] > ``` -| Name | Type | Description | -| ---------- | ------- | -------------------------- | -| **YIELDS** | `Token` | A left-child of the token. | +| Name | Description | +| ---------- | ------------------------------------ | +| **YIELDS** | A left-child of the token. ~~Token~~ | ## Token.rights {#rights tag="property" model="parser"} -The rightward immediate children of the word, in the syntactic dependency parse. +The rightward immediate children of the word in the syntactic dependency parse. > #### Example > @@ -270,13 +288,13 @@ The rightward immediate children of the word, in the syntactic dependency parse. > assert rights == ["in"] > ``` -| Name | Type | Description | -| ---------- | ------- | --------------------------- | -| **YIELDS** | `Token` | A right-child of the token. | +| Name | Description | +| ---------- | ------------------------------------- | +| **YIELDS** | A right-child of the token. ~~Token~~ | ## Token.n_lefts {#n_lefts tag="property" model="parser"} -The number of leftward immediate children of the word, in the syntactic +The number of leftward immediate children of the word in the syntactic dependency parse. > #### Example @@ -286,13 +304,13 @@ dependency parse. > assert doc[3].n_lefts == 1 > ``` -| Name | Type | Description | -| ----------- | ---- | -------------------------------- | -| **RETURNS** | int | The number of left-child tokens. | +| Name | Description | +| ----------- | ---------------------------------------- | +| **RETURNS** | The number of left-child tokens. ~~int~~ | ## Token.n_rights {#n_rights tag="property" model="parser"} -The number of rightward immediate children of the word, in the syntactic +The number of rightward immediate children of the word in the syntactic dependency parse. > #### Example @@ -302,9 +320,9 @@ dependency parse. > assert doc[3].n_rights == 1 > ``` -| Name | Type | Description | -| ----------- | ---- | --------------------------------- | -| **RETURNS** | int | The number of right-child tokens. | +| Name | Description | +| ----------- | ----------------------------------------- | +| **RETURNS** | The number of right-child tokens. ~~int~~ | ## Token.subtree {#subtree tag="property" model="parser"} @@ -318,9 +336,9 @@ A sequence containing the token and all the token's syntactic descendants. > assert [t.text for t in give_subtree] == ["Give", "it", "back", "!"] > ``` -| Name | Type | Description | -| ---------- | ------- | -------------------------------------------------------------------------- | -| **YIELDS** | `Token` | A descendant token such that `self.is_ancestor(token)` or `token == self`. | +| Name | Description | +| ---------- | ------------------------------------------------------------------------------------ | +| **YIELDS** | A descendant token such that `self.is_ancestor(token)` or `token == self`. ~~Token~~ | ## Token.is_sent_start {#is_sent_start tag="property" new="2"} @@ -335,24 +353,9 @@ unknown. Defaults to `True` for the first token in the `Doc`. > assert not doc[5].is_sent_start > ``` -| Name | Type | Description | -| ----------- | ---- | ------------------------------------ | -| **RETURNS** | bool | Whether the token starts a sentence. | - - - -As of spaCy v2.0, the `Token.sent_start` property is deprecated and has been -replaced with `Token.is_sent_start`, which returns a boolean value instead of a -misleading `0` for `False` and `1` for `True`. It also now returns `None` if the -answer is unknown, and fixes a quirk in the old logic that would always set the -property to `0` for the first word of the document. - -```diff -- assert doc[4].sent_start == 1 -+ assert doc[4].is_sent_start == True -``` - - +| Name | Description | +| ----------- | --------------------------------------------- | +| **RETURNS** | Whether the token starts a sentence. ~~bool~~ | ## Token.has_vector {#has_vector tag="property" model="vectors"} @@ -366,9 +369,9 @@ A boolean value indicating whether a word vector is associated with the token. > assert apples.has_vector > ``` -| Name | Type | Description | -| ----------- | ---- | --------------------------------------------- | -| **RETURNS** | bool | Whether the token has a vector data attached. | +| Name | Description | +| ----------- | ------------------------------------------------------ | +| **RETURNS** | Whether the token has a vector data attached. ~~bool~~ | ## Token.vector {#vector tag="property" model="vectors"} @@ -383,9 +386,9 @@ A real-valued meaning representation. > assert apples.vector.shape == (300,) > ``` -| Name | Type | Description | -| ----------- | ---------------------------------------- | ---------------------------------------------------- | -| **RETURNS** | `numpy.ndarray[ndim=1, dtype='float32']` | A 1D numpy array representing the token's semantics. | +| Name | Description | +| ----------- | ----------------------------------------------------------------------------------------------- | +| **RETURNS** | A 1-dimensional array representing the token's vector. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | ## Token.vector_norm {#vector_norm tag="property" model="vectors"} @@ -402,77 +405,79 @@ The L2 norm of the token's vector representation. > assert apples.vector_norm != pasta.vector_norm > ``` -| Name | Type | Description | -| ----------- | ----- | ----------------------------------------- | -| **RETURNS** | float | The L2 norm of the vector representation. | +| Name | Description | +| ----------- | --------------------------------------------------- | +| **RETURNS** | The L2 norm of the vector representation. ~~float~~ | ## Attributes {#attributes} -| Name | Type | Description | -| -------------------------------------------- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doc` | `Doc` | The parent document. | -| `sent` 2.0.12 | `Span` | The sentence span that this token is a part of. | -| `text` | unicode | Verbatim text content. | -| `text_with_ws` | unicode | Text content, with trailing space character if present. | -| `whitespace_` | unicode | Trailing space character if present. | -| `orth` | int | ID of the verbatim text content. | -| `orth_` | unicode | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. | -| `vocab` | `Vocab` | The vocab object of the parent `Doc`. | -| `tensor` 2.1.7 | `ndarray` | The tokens's slice of the parent `Doc`'s tensor. | -| `head` | `Token` | The syntactic parent, or "governor", of this token. | -| `left_edge` | `Token` | The leftmost token of this token's syntactic descendants. | -| `right_edge` | `Token` | The rightmost token of this token's syntactic descendants. | -| `i` | int | The index of the token within the parent document. | -| `ent_type` | int | Named entity type. | -| `ent_type_` | unicode | Named entity type. | -| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. | -| `ent_iob_` | unicode | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. | -| `ent_kb_id` 2.2 | int | Knowledge base ID that refers to the named entity this token is a part of, if any. | -| `ent_kb_id_` 2.2 | unicode | Knowledge base ID that refers to the named entity this token is a part of, if any. | -| `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. | -| `ent_id_` | unicode | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. | -| `lemma` | int | Base form of the token, with no inflectional suffixes. | -| `lemma_` | unicode | Base form of the token, with no inflectional suffixes. | -| `norm` | int | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). | -| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). | -| `lower` | int | Lowercase form of the token. | -| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. | -| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. | -| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. | -| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. | -| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. | -| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. | -| `suffix_` | unicode | Length-N substring from the end of the token. Defaults to `N=3`. | -| `is_alpha` | bool | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. | -| `is_ascii` | bool | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. | -| `is_digit` | bool | Does the token consist of digits? Equivalent to `token.text.isdigit()`. | -| `is_lower` | bool | Is the token in lowercase? Equivalent to `token.text.islower()`. | -| `is_upper` | bool | Is the token in uppercase? Equivalent to `token.text.isupper()`. | -| `is_title` | bool | Is the token in titlecase? Equivalent to `token.text.istitle()`. | -| `is_punct` | bool | Is the token punctuation? | -| `is_left_punct` | bool | Is the token a left punctuation mark, e.g. `'('` ? | -| `is_right_punct` | bool | Is the token a right punctuation mark, e.g. `')'` ? | -| `is_space` | bool | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. | -| `is_bracket` | bool | Is the token a bracket? | -| `is_quote` | bool | Is the token a quotation mark? | -| `is_currency` 2.0.8 | bool | Is the token a currency symbol? | -| `like_url` | bool | Does the token resemble a URL? | -| `like_num` | bool | Does the token represent a number? e.g. "10.9", "10", "ten", etc. | -| `like_email` | bool | Does the token resemble an email address? | -| `is_oov` | bool | Does the token have a word vector? | -| `is_stop` | bool | Is the token part of a "stop list"? | -| `pos` | int | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). | -| `pos_` | unicode | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). | -| `tag` | int | Fine-grained part-of-speech. | -| `tag_` | unicode | Fine-grained part-of-speech. | -| `dep` | int | Syntactic dependency relation. | -| `dep_` | unicode | Syntactic dependency relation. | -| `lang` | int | Language of the parent document's vocabulary. | -| `lang_` | unicode | Language of the parent document's vocabulary. | -| `prob` | float | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). | -| `idx` | int | The character offset of the token within the parent document. | -| `sentiment` | float | A scalar value indicating the positivity or negativity of the token. | -| `lex_id` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. | -| `rank` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. | -| `cluster` | int | Brown cluster ID. | -| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). | +| Name | Description | +| -------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doc` | The parent document. ~~Doc~~ | +| `lex` 3 | The underlying lexeme. ~~Lexeme~~ | +| `sent` 2.0.12 | The sentence span that this token is a part of. ~~Span~~ | +| `text` | Verbatim text content. ~~str~~ | +| `text_with_ws` | Text content, with trailing space character if present. ~~str~~ | +| `whitespace_` | Trailing space character if present. ~~str~~ | +| `orth` | ID of the verbatim text content. ~~int~~ | +| `orth_` | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. ~~str~~ | +| `vocab` | The vocab object of the parent `Doc`. ~~vocab~~ | +| `tensor` 2.1.7 | The tokens's slice of the parent `Doc`'s tensor. ~~numpy.ndarray~~ | +| `head` | The syntactic parent, or "governor", of this token. ~~Token~~ | +| `left_edge` | The leftmost token of this token's syntactic descendants. ~~Token~~ | +| `right_edge` | The rightmost token of this token's syntactic descendants. ~~Token~~ | +| `i` | The index of the token within the parent document. ~~int~~ | +| `ent_type` | Named entity type. ~~int~~ | +| `ent_type_` | Named entity type. ~~str~~ | +| `ent_iob` | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. ~~int~~ | +| `ent_iob_` | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. ~~str~~ | +| `ent_kb_id` 2.2 | Knowledge base ID that refers to the named entity this token is a part of, if any. ~~int~~ | +| `ent_kb_id_` 2.2 | Knowledge base ID that refers to the named entity this token is a part of, if any. ~~str~~ | +| `ent_id` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~int~~ | +| `ent_id_` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~str~~ | +| `lemma` | Base form of the token, with no inflectional suffixes. ~~int~~ | +| `lemma_` | Base form of the token, with no inflectional suffixes. ~~str~~ | +| `norm` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~int~~ | +| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~str~~ | +| `lower` | Lowercase form of the token. ~~int~~ | +| `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ | +| `shape` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ | +| `shape_` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~str~~ | +| `prefix` | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. ~~int~~ | +| `prefix_` | A length-N substring from the start of the token. Defaults to `N=1`. ~~str~~ | +| `suffix` | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. ~~int~~ | +| `suffix_` | Length-N substring from the end of the token. Defaults to `N=3`. ~~str~~ | +| `is_alpha` | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. ~~bool~~ | +| `is_ascii` | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. ~~bool~~ | +| `is_digit` | Does the token consist of digits? Equivalent to `token.text.isdigit()`. ~~bool~~ | +| `is_lower` | Is the token in lowercase? Equivalent to `token.text.islower()`. ~~bool~~ | +| `is_upper` | Is the token in uppercase? Equivalent to `token.text.isupper()`. ~~bool~~ | +| `is_title` | Is the token in titlecase? Equivalent to `token.text.istitle()`. ~~bool~~ | +| `is_punct` | Is the token punctuation? ~~bool~~ | +| `is_left_punct` | Is the token a left punctuation mark, e.g. `"("` ? ~~bool~~ | +| `is_right_punct` | Is the token a right punctuation mark, e.g. `")"` ? ~~bool~~ | +| `is_space` | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. ~~bool~~ | +| `is_bracket` | Is the token a bracket? ~~bool~~ | +| `is_quote` | Is the token a quotation mark? ~~bool~~ | +| `is_currency` 2.0.8 | Is the token a currency symbol? ~~bool~~ | +| `like_url` | Does the token resemble a URL? ~~bool~~ | +| `like_num` | Does the token represent a number? e.g. "10.9", "10", "ten", etc. ~~bool~~ | +| `like_email` | Does the token resemble an email address? ~~bool~~ | +| `is_oov` | Does the token have a word vector? ~~bool~~ | +| `is_stop` | Is the token part of a "stop list"? ~~bool~~ | +| `pos` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~int~~ | +| `pos_` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~str~~ | +| `tag` | Fine-grained part-of-speech. ~~int~~ | +| `tag_` | Fine-grained part-of-speech. ~~str~~ | +| `morph` 3 | Morphological analysis. ~~MorphAnalysis~~ | +| `dep` | Syntactic dependency relation. ~~int~~ | +| `dep_` | Syntactic dependency relation. ~~str~~ | +| `lang` | Language of the parent document's vocabulary. ~~int~~ | +| `lang_` | Language of the parent document's vocabulary. ~~str~~ | +| `prob` | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). ~~float~~ | +| `idx` | The character offset of the token within the parent document. ~~int~~ | +| `sentiment` | A scalar value indicating the positivity or negativity of the token. ~~float~~ | +| `lex_id` | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. ~~int~~ | +| `rank` | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. ~~int~~ | +| `cluster` | Brown cluster ID. ~~int~~ | +| `_` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). ~~Underscore~~ | diff --git a/website/docs/api/tokenizer.md b/website/docs/api/tokenizer.md index 6f8badfe8..8809c10bc 100644 --- a/website/docs/api/tokenizer.md +++ b/website/docs/api/tokenizer.md @@ -1,19 +1,29 @@ --- title: Tokenizer -teaser: Segment text into words, punctuations marks etc. +teaser: Segment text into words, punctuations marks, etc. tag: class source: spacy/tokenizer.pyx --- +> #### Default config +> +> ```ini +> [nlp.tokenizer] +> @tokenizers = "spacy.Tokenizer.v1" +> ``` + Segment text, and create `Doc` objects with the discovered segment boundaries. For a deeper understanding, see the docs on [how spaCy's tokenizer works](/usage/linguistic-features#how-tokenizer-works). +The tokenizer is typically created automatically when a +[`Language`](/api/language) subclass is initialized and it reads its settings +like punctuation and special case rules from the +[`Language.Defaults`](/api/language#defaults) provided by the language subclass. ## Tokenizer.\_\_init\_\_ {#init tag="method"} -Create a `Tokenizer`, to create `Doc` objects given unicode text. For examples -of how to construct a custom tokenizer with different tokenization rules, see -the +Create a `Tokenizer` to create `Doc` objects given unicode text. For examples of +how to construct a custom tokenizer with different tokenization rules, see the [usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers). > #### Example @@ -31,19 +41,18 @@ the > nlp = English() > # Create a Tokenizer with the default settings for English > # including punctuation rules and exceptions -> tokenizer = nlp.Defaults.create_tokenizer(nlp) +> tokenizer = nlp.tokenizer > ``` -| Name | Type | Description | -| ---------------- | ----------- | ------------------------------------------------------------------------------------------------------------------------------ | -| `vocab` | `Vocab` | A storage container for lexical types. | -| `rules` | dict | Exceptions and special-cases for the tokenizer. | -| `prefix_search` | callable | A function matching the signature of `re.compile(string).search` to match prefixes. | -| `suffix_search` | callable | A function matching the signature of `re.compile(string).search` to match suffixes. | -| `infix_finditer` | callable | A function matching the signature of `re.compile(string).finditer` to find infixes. | -| `token_match` | callable | A function matching the signature of `re.compile(string).match` to find token matches. | -| `url_match` | callable | A function matching the signature of `re.compile(string).match` to find token matches after considering prefixes and suffixes. | -| **RETURNS** | `Tokenizer` | The newly constructed object. | +| Name | Description | +| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | A storage container for lexical types. ~~Vocab~~ | +| `rules` | Exceptions and special-cases for the tokenizer. ~~Optional[Dict[str, List[Dict[int, str]]]]~~ | +| `prefix_search` | A function matching the signature of `re.compile(string).search` to match prefixes. ~~Optional[Callable[[str], Optional[Match]]]~~ | +| `suffix_search` | A function matching the signature of `re.compile(string).search` to match suffixes. ~~Optional[Callable[[str], Optional[Match]]]~~ | +| `infix_finditer` | A function matching the signature of `re.compile(string).finditer` to find infixes. ~~Optional[Callable[[str], Iterator[Match]]]~~ | +| `token_match` | A function matching the signature of `re.compile(string).match` to find token matches. ~~Optional[Callable[[str], Optional[Match]]]~~ | +| `url_match` | A function matching the signature of `re.compile(string).match` to find token matches after considering prefixes and suffixes. ~~Optional[Callable[[str], Optional[Match]]]~~ | ## Tokenizer.\_\_call\_\_ {#call tag="method"} @@ -56,10 +65,10 @@ Tokenize a string. > assert len(tokens) == 4 > ``` -| Name | Type | Description | -| ----------- | ------- | --------------------------------------- | -| `string` | unicode | The string to tokenize. | -| **RETURNS** | `Doc` | A container for linguistic annotations. | +| Name | Description | +| ----------- | ----------------------------------------------- | +| `string` | The string to tokenize. ~~str~~ | +| **RETURNS** | A container for linguistic annotations. ~~Doc~~ | ## Tokenizer.pipe {#pipe tag="method"} @@ -73,48 +82,48 @@ Tokenize a stream of texts. > pass > ``` -| Name | Type | Description | -| ------------ | ----- | ---------------------------------------------------------------------------- | -| `texts` | - | A sequence of unicode texts. | -| `batch_size` | int | The number of texts to accumulate in an internal buffer. Defaults to `1000`. | -| **YIELDS** | `Doc` | A sequence of Doc objects, in order. | +| Name | Description | +| ------------ | ------------------------------------------------------------------------------------ | +| `texts` | A sequence of unicode texts. ~~Iterable[str]~~ | +| `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ | +| **YIELDS** | The tokenized `Doc` objects, in order. ~~Doc~~ | ## Tokenizer.find_infix {#find_infix tag="method"} Find internal split points of the string. -| Name | Type | Description | -| ----------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | -| `string` | unicode | The string to split. | -| **RETURNS** | list | A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens. | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `string` | The string to split. ~~str~~ | +| **RETURNS** | A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens. ~~List[Match]~~ | ## Tokenizer.find_prefix {#find_prefix tag="method"} Find the length of a prefix that should be segmented from the string, or `None` if no prefix rules match. -| Name | Type | Description | -| ----------- | ------- | ------------------------------------------------------ | -| `string` | unicode | The string to segment. | -| **RETURNS** | int | The length of the prefix if present, otherwise `None`. | +| Name | Description | +| ----------- | ------------------------------------------------------------------------ | +| `string` | The string to segment. ~~str~~ | +| **RETURNS** | The length of the prefix if present, otherwise `None`. ~~Optional[int]~~ | ## Tokenizer.find_suffix {#find_suffix tag="method"} Find the length of a suffix that should be segmented from the string, or `None` if no suffix rules match. -| Name | Type | Description | -| ----------- | ------------ | ------------------------------------------------------ | -| `string` | unicode | The string to segment. | -| **RETURNS** | int / `None` | The length of the suffix if present, otherwise `None`. | +| Name | Description | +| ----------- | ------------------------------------------------------------------------ | +| `string` | The string to segment. ~~str~~ | +| **RETURNS** | The length of the suffix if present, otherwise `None`. ~~Optional[int]~~ | ## Tokenizer.add_special_case {#add_special_case tag="method"} Add a special-case tokenization rule. This mechanism is also used to add custom -tokenizer exceptions to the language data. See the usage guide on -[adding languages](/usage/adding-languages#tokenizer-exceptions) and -[linguistic features](/usage/linguistic-features#special-cases) for more details -and examples. +tokenizer exceptions to the language data. See the usage guide on the +[languages data](/usage/linguistic-features#language-data) and +[tokenizer special cases](/usage/linguistic-features#special-cases) for more +details and examples. > #### Example > @@ -124,10 +133,10 @@ and examples. > tokenizer.add_special_case("don't", case) > ``` -| Name | Type | Description | -| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `string` | unicode | The string to specially tokenize. | -| `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. | +| Name | Description | +| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `string` | The string to specially tokenize. ~~str~~ | +| `token_attrs` | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. ~~Iterable[Dict[int, str]]~~ | ## Tokenizer.explain {#explain tag="method"} @@ -143,10 +152,10 @@ produced are identical to `Tokenizer.__call__` except for whitespace tokens. > assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"] > ``` -| Name | Type | Description | -| ------------| -------- | --------------------------------------------------- | -| `string` | unicode | The string to tokenize with the debugging tokenizer | -| **RETURNS** | list | A list of `(pattern_string, token_string)` tuples | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------- | +| `string` | The string to tokenize with the debugging tokenizer. ~~str~~ | +| **RETURNS** | A list of `(pattern_string, token_string)` tuples. ~~List[Tuple[str, str]]~~ | ## Tokenizer.to_disk {#to_disk tag="method"} @@ -159,10 +168,11 @@ Serialize the tokenizer to disk. > tokenizer.to_disk("/path/to/tokenizer") > ``` -| Name | Type | Description | -| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | ## Tokenizer.from_disk {#from_disk tag="method"} @@ -175,11 +185,12 @@ Load the tokenizer from disk. Modifies the object in place and returns it. > tokenizer.from_disk("/path/to/tokenizer") > ``` -| Name | Type | Description | -| ----------- | ---------------- | -------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. | +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `Tokenizer` object. ~~Tokenizer~~ | ## Tokenizer.to_bytes {#to_bytes tag="method"} @@ -192,10 +203,11 @@ Load the tokenizer from disk. Modifies the object in place and returns it. Serialize the tokenizer to a bytestring. -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------------------- | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | bytes | The serialized form of the `Tokenizer` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the `Tokenizer` object. ~~bytes~~ | ## Tokenizer.from_bytes {#from_bytes tag="method"} @@ -210,22 +222,23 @@ it. > tokenizer.from_bytes(tokenizer_bytes) > ``` -| Name | Type | Description | -| ------------ | ----------- | ------------------------------------------------------------------------- | -| `bytes_data` | bytes | The data to load from. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `Tokenizer` | The `Tokenizer` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `Tokenizer` object. ~~Tokenizer~~ | ## Attributes {#attributes} -| Name | Type | Description | -| ---------------- | ------- | --------------------------------------------------------------------------------------------------------------------------- | -| `vocab` | `Vocab` | The vocab object of the parent `Doc`. | -| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. | -| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. | -| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. | -| `token_match` | - | A function matching the signature of `re.compile(string).match to find token matches. Returns an `re.MatchObject` or `None. | -| `rules` | dict | A dictionary of tokenizer exceptions and special cases. | +| Name | Description | +| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The vocab object of the parent `Doc`. ~~Vocab~~ | +| `prefix_search` | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~ | +| `suffix_search` | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~ | +| `infix_finditer` | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) sequence of `re.MatchObject` objects. ~~Optional[Callable[[str], Iterator[Match]]]~~ | +| `token_match` | A function matching the signature of `re.compile(string).match` to find token matches. Returns an `re.MatchObject` or `None`. ~~Optional[Callable[[str], Optional[Match]]]~~ | +| `rules` | A dictionary of tokenizer exceptions and special cases. ~~Optional[Dict[str, List[Dict[int, str]]]]~~ | ## Serialization fields {#serialization-fields} diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 217c51794..eb2eb5d71 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -3,111 +3,115 @@ title: Top-level Functions menu: - ['spacy', 'spacy'] - ['displacy', 'displacy'] + - ['registry', 'registry'] + - ['Loggers', 'loggers'] + - ['Readers', 'readers'] + - ['Batchers', 'batchers'] + - ['Augmenters', 'augmenters'] + - ['Training & Alignment', 'gold'] - ['Utility Functions', 'util'] - - ['Compatibility', 'compat'] --- ## spaCy {#spacy hidden="true"} -### spacy.load {#spacy.load tag="function" model="any"} +### spacy.load {#spacy.load tag="function"} -Load a model via its [shortcut link](/usage/models#usage), the name of an -installed [model package](/usage/training#models-generating), a unicode path or -a `Path`-like object. spaCy will try resolving the load argument in this order. -If a model is loaded from a shortcut link or package name, spaCy will assume -it's a Python package and import it and call the model's own `load()` method. If -a model is loaded from a path, spaCy will assume it's a data directory, read the -language and pipeline settings off the meta.json and initialize the `Language` -class. The data will be loaded in via +Load a pipeline using the name of an installed +[package](/usage/saving-loading#models), a string path or a `Path`-like object. +spaCy will try resolving the load argument in this order. If a pipeline is +loaded from a string name, spaCy will assume it's a Python package and import it +and call the package's own `load()` method. If a pipeline is loaded from a path, +spaCy will assume it's a data directory, load its +[`config.cfg`](/api/data-formats#config) and use the language and pipeline +information to construct the `Language` class. The data will be loaded in via [`Language.from_disk`](/api/language#from_disk). + + +As of v3.0, the `disable` keyword argument specifies components to load but +disable, instead of components to not load at all. Those components can now be +specified separately using the new `exclude` keyword argument. + + + > #### Example > > ```python -> nlp = spacy.load("en") # shortcut link > nlp = spacy.load("en_core_web_sm") # package -> nlp = spacy.load("/path/to/en") # unicode path -> nlp = spacy.load(Path("/path/to/en")) # pathlib Path +> nlp = spacy.load("/path/to/pipeline") # string path +> nlp = spacy.load(Path("/path/to/pipeline")) # pathlib Path > -> nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"]) +> nlp = spacy.load("en_core_web_sm", exclude=["parser", "tagger"]) > ``` -| Name | Type | Description | -| ----------- | ---------------- | --------------------------------------------------------------------------------- | -| `name` | unicode / `Path` | Model to load, i.e. shortcut link, package name or path. | -| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | -| **RETURNS** | `Language` | A `Language` object with the loaded model. | +| Name | Description | +| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | Pipeline to load, i.e. package name or path. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ | +| `exclude` 3 | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ | +| `config` 3 | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~ | +| **RETURNS** | A `Language` object with the loaded pipeline. ~~Language~~ | -Essentially, `spacy.load()` is a convenience wrapper that reads the language ID -and pipeline components from a model's `meta.json`, initializes the `Language` -class, loads in the model data and returns it. +Essentially, `spacy.load()` is a convenience wrapper that reads the pipeline's +[`config.cfg`](/api/data-formats#config), uses the language and pipeline +information to construct a `Language` object, loads in the model data and +weights, and returns it. ```python ### Abstract example -cls = util.get_lang_class(lang) # get language for ID, e.g. 'en' -nlp = cls() # initialise the language -for name in pipeline: component = nlp.create_pipe(name) # create each pipeline component nlp.add_pipe(component) # add component to pipeline -nlp.from_disk(model_data_path) # load in model data +cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English +nlp = cls() # 2. Initialize it +for name in pipeline: + nlp.add_pipe(name) # 3. Add the component to the pipeline +nlp.from_disk(data_path) # 4. Load in the binary data ``` - - -As of spaCy 2.0, the `path` keyword argument is deprecated. spaCy will also -raise an error if no model could be loaded and never just return an empty -`Language` object. If you need a blank language, you can use the new function -[`spacy.blank()`](/api/top-level#spacy.blank) or import the class explicitly, -e.g. `from spacy.lang.en import English`. - -```diff -- nlp = spacy.load("en", path="/model") -+ nlp = spacy.load("/model") -``` - - - ### spacy.blank {#spacy.blank tag="function" new="2"} -Create a blank model of a given language class. This function is the twin of +Create a blank pipeline of a given language class. This function is the twin of `spacy.load()`. > #### Example > > ```python -> nlp_en = spacy.blank("en") -> nlp_de = spacy.blank("de") +> nlp_en = spacy.blank("en") # equivalent to English() +> nlp_de = spacy.blank("de") # equivalent to German() > ``` -| Name | Type | Description | -| ----------- | ---------- | ------------------------------------------------------------------------------------------------ | -| `name` | unicode | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. | -| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | -| **RETURNS** | `Language` | An empty `Language` object of the appropriate subclass. | +| Name | Description | +| ----------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `name` | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. ~~str~~ | +| _keyword-only_ | | +| `vocab` 3 | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~. | +| `config` 3 | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. `"components.name.value"`. ~~Union[Dict[str, Any], Config]~~ | +| `meta` 3 | Optional meta overrides for [`nlp.meta`](/api/language#meta). ~~Dict[str, Any]~~ | +| **RETURNS** | An empty `Language` object of the appropriate subclass. ~~Language~~ | -#### spacy.info {#spacy.info tag="function"} +### spacy.info {#spacy.info tag="function"} The same as the [`info` command](/api/cli#info). Pretty-print information about -your installation, models and local setup from within spaCy. To get the model -meta data as a dictionary instead, you can use the `meta` attribute on your -`nlp` object with a loaded model, e.g. `nlp.meta`. +your installation, installed pipelines and local setup from within spaCy. > #### Example > > ```python > spacy.info() -> spacy.info("en") -> spacy.info("de", markdown=True) +> spacy.info("en_core_web_sm") +> markdown = spacy.info(markdown=True, silent=True) > ``` -| Name | Type | Description | -| ---------- | ------- | ------------------------------------------------------------- | -| `model` | unicode | A model, i.e. shortcut link, package name or path (optional). | -| `markdown` | bool | Print information as Markdown. | +| Name | Description | +| -------------- | ---------------------------------------------------------------------------- | +| `model` | Optional pipeline, i.e. a package name or path (optional). ~~Optional[str]~~ | +| _keyword-only_ | | +| `markdown` | Print information as Markdown. ~~bool~~ | +| `silent` | Don't print anything, just return. ~~bool~~ | ### spacy.explain {#spacy.explain tag="function"} Get a description for a given POS tag, dependency label or entity type. For a -list of available terms, see -[`glossary.py`](https://github.com/explosion/spaCy/tree/master/spacy/glossary.py). +list of available terms, see [`glossary.py`](%%GITHUB_SPACY/spacy/glossary.py). > #### Example > @@ -122,17 +126,17 @@ list of available terms, see > # world NN noun, singular or mass > ``` -| Name | Type | Description | -| ----------- | ------- | -------------------------------------------------------- | -| `term` | unicode | Term to explain. | -| **RETURNS** | unicode | The explanation, or `None` if not found in the glossary. | +| Name | Description | +| ----------- | -------------------------------------------------------------------------- | +| `term` | Term to explain. ~~str~~ | +| **RETURNS** | The explanation, or `None` if not found in the glossary. ~~Optional[str]~~ | ### spacy.prefer_gpu {#spacy.prefer_gpu tag="function" new="2.0.14"} Allocate data and perform operations on [GPU](/usage/#gpu), if available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy and _before_ loading any -models. +pipelines. > #### Example > @@ -142,16 +146,17 @@ models. > nlp = spacy.load("en_core_web_sm") > ``` -| Name | Type | Description | -| ----------- | ---- | ------------------------------ | -| **RETURNS** | bool | Whether the GPU was activated. | +| Name | Description | +| ----------- | ------------------------------------------------ | +| `gpu_id` | Device index to select. Defaults to `0`. ~~int~~ | +| **RETURNS** | Whether the GPU was activated. ~~bool~~ | ### spacy.require_gpu {#spacy.require_gpu tag="function" new="2.0.14"} Allocate data and perform operations on [GPU](/usage/#gpu). Will raise an error if no GPU is available. If data has already been allocated on CPU, it will not be moved. Ideally, this function should be called right after importing spaCy -and _before_ loading any models. +and _before_ loading any pipelines. > #### Example > @@ -161,9 +166,10 @@ and _before_ loading any models. > nlp = spacy.load("en_core_web_sm") > ``` -| Name | Type | Description | -| ----------- | ---- | ----------- | -| **RETURNS** | bool | `True` | +| Name | Description | +| ----------- | ------------------------------------------------ | +| `gpu_id` | Device index to select. Defaults to `0`. ~~int~~ | +| **RETURNS** | `True` ~~bool~~ | ## displaCy {#displacy source="spacy/displacy"} @@ -186,16 +192,16 @@ browser. Will run a simple web server. > displacy.serve([doc1, doc2], style="dep") > ``` -| Name | Type | Description | Default | -| --------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ----------- | -| `docs` | list, `Doc`, `Span` | Document(s) to visualize. | -| `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` | -| `page` | bool | Render markup as full HTML page. | `True` | -| `minify` | bool | Minify HTML markup. | `False` | -| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` | -| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` | -| `port` | int | Port to serve visualization. | `5000` | -| `host` | unicode | Host to serve visualization. | `'0.0.0.0'` | +| Name | Description | +| --------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `docs` | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span]], Doc, Span]~~ | +| `style` | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~ | +| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ | +| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ | +| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ | +| `manual` | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ | +| `port` | Port to serve visualization. Defaults to `5000`. ~~int~~ | +| `host` | Host to serve visualization. Defaults to `"0.0.0.0"`. ~~str~~ | ### displacy.render {#displacy.render tag="method" new="2"} @@ -211,16 +217,16 @@ Render a dependency parse tree or named entity visualization. > html = displacy.render(doc, style="dep") > ``` -| Name | Type | Description | Default | -| ----------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | -| `docs` | list, `Doc`, `Span` | Document(s) to visualize. | -| `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` | -| `page` | bool | Render markup as full HTML page. | `False` | -| `minify` | bool | Minify HTML markup. | `False` | -| `jupyter` | bool | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None`. | `None` | -| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` | -| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` | -| **RETURNS** | unicode | Rendered HTML markup. | +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `docs` | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span]], Doc, Span]~~ | +| `style` | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~ | +| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ | +| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ | +| `options` | [Visualizer-specific options](#displacy_options), e.g. colors. ~~Dict[str, Any]~~ | +| `manual` | Don't parse `Doc` and instead expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. Defaults to `False`. ~~bool~~ | +| `jupyter` | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None` (default). ~~Optional[bool]~~ | +| **RETURNS** | The rendered HTML markup. ~~str~~ | ### Visualizer options {#displacy_options} @@ -236,22 +242,22 @@ If a setting is not present in the options, the default value will be used. > displacy.serve(doc, style="dep", options=options) > ``` -| Name | Type | Description | Default | -| ------------------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- | -| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` | -| `add_lemma` 2.2.4 | bool | Print the lemma's in a separate row below the token texts. | `False` | -| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` | -| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` | -| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` | -| `color` | unicode | Text color (HEX, RGB or color names). | `'#000000'` | -| `bg` | unicode | Background color (HEX, RGB or color names). | `'#ffffff'` | -| `font` | unicode | Font name or font family for all text. | `'Arial'` | -| `offset_x` | int | Spacing on left side of the SVG in px. | `50` | -| `arrow_stroke` | int | Width of arrow path in px. | `2` | -| `arrow_width` | int | Width of arrow head in px. | `10` / `8` (compact) | -| `arrow_spacing` | int | Spacing between arrows in px to avoid overlaps. | `20` / `12` (compact) | -| `word_spacing` | int | Vertical spacing between words and arcs in px. | `45` | -| `distance` | int | Distance between words in px. | `175` / `150` (compact) | +| Name | Description | +| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- | +| `fine_grained` | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). Defaults to `False`. ~~bool~~ | +| `add_lemma` 2.2.4 | Print the lemmas in a separate row below the token texts. Defaults to `False`. ~~bool~~ | +| `collapse_punct` | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to `True`. ~~bool~~ | +| `collapse_phrases` | Merge noun phrases into one token. Defaults to `False`. ~~bool~~ | +| `compact` | "Compact mode" with square arrows that takes up less space. Defaults to `False`. ~~bool~~ | +| `color` | Text color (HEX, RGB or color names). Defaults to `"#000000"`. ~~str~~ | +| `bg` | Background color (HEX, RGB or color names). Defaults to `"#ffffff"`. ~~str~~ | +| `font` | Font name or font family for all text. Defaults to `"Arial"`. ~~str~~ | +| `offset_x` | Spacing on left side of the SVG in px. Defaults to `50`. ~~int~~ | +| `arrow_stroke` | Width of arrow path in px. Defaults to `2`. ~~int~~ | +| `arrow_width` | Width of arrow head in px. Defaults to `10` in regular mode and `8` in compact mode. ~~int~~ | +| `arrow_spacing` | Spacing between arrows in px to avoid overlaps. Defaults to `20` in regular mode and `12` in compact mode. ~~int~~ | +| `word_spacing` | Vertical spacing between words and arcs in px. Defaults to `45`. ~~int~~ | +| `distance` | Distance between words in px. Defaults to `175` in regular mode and `150` in compact mode. ~~int~~ | #### Named Entity Visualizer options {#displacy_options-ent} @@ -263,61 +269,588 @@ If a setting is not present in the options, the default value will be used. > displacy.serve(doc, style="ent", options=options) > ``` -| Name | Type | Description | Default | -| --------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------ | -| `ents` | list | Entity types to highlight (`None` for all types). | `None` | -| `colors` | dict | Color overrides. Entity types in uppercase should be mapped to color names or values. | `{}` | -| `template` 2.2 | unicode | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. | see [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) | +| Name | Description | +| --------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `ents` | Entity types to highlight or `None` for all types (default). ~~Optional[List[str]]~~ | +| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ | +| `template` 2.2 | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](%%GITHUB_SPACY/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ | -By default, displaCy comes with colors for all -[entity types supported by spaCy](/api/annotation#named-entities). If you're -using custom entity types, you can use the `colors` setting to add your own -colors for them. Your application or model package can also expose a +By default, displaCy comes with colors for all entity types used by +[spaCy's trained pipelines](/models). If you're using custom entity types, you +can use the `colors` setting to add your own colors for them. Your application +or pipeline package can also expose a [`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy) to add custom labels and their colors automatically. -## Utility functions {#util source="spacy/util.py"} +## registry {#registry source="spacy/util.py" new="3"} -spaCy comes with a small collection of utility functions located in -[`spacy/util.py`](https://github.com/explosion/spaCy/tree/master/spacy/util.py). -Because utility functions are mostly intended for **internal use within spaCy**, -their behavior may change with future releases. The functions documented on this -page should be safe to use and we'll try to ensure backwards compatibility. -However, we recommend having additional tests in place if your application -depends on any of spaCy's utilities. - -### util.get_data_path {#util.get_data_path tag="function"} - -Get path to the data directory where spaCy looks for models. Defaults to -`spacy/data`. - -| Name | Type | Description | -| ---------------- | --------------- | ------------------------------------------------------- | -| `require_exists` | bool | Only return path if it exists, otherwise return `None`. | -| **RETURNS** | `Path` / `None` | Data path or `None`. | - -### util.set_data_path {#util.set_data_path tag="function"} - -Set custom path to the data directory where spaCy looks for models. +spaCy's function registry extends +[Thinc's `registry`](https://thinc.ai/docs/api-config#registry) and allows you +to map strings to functions. You can register functions to create architectures, +optimizers, schedules and more, and then refer to them and set their arguments +in your [config file](/usage/training#config). Python type hints are used to +validate the inputs. See the +[Thinc docs](https://thinc.ai/docs/api-config#registry) for details on the +`registry` methods and our helper library +[`catalogue`](https://github.com/explosion/catalogue) for some background on the +concept of function registries. spaCy also uses the function registry for +language subclasses, model architecture, lookups and pipeline component +factories. > #### Example > > ```python -> util.set_data_path("/custom/path") -> util.get_data_path() -> # PosixPath('/custom/path') +> from typing import Iterator +> import spacy +> +> @spacy.registry.schedules("waltzing.v1") +> def waltzing() -> Iterator[float]: +> i = 0 +> while True: +> yield i % 3 + 1 +> i += 1 > ``` -| Name | Type | Description | -| ------ | ---------------- | --------------------------- | -| `path` | unicode / `Path` | Path to new data directory. | +| Registry name | Description | +| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | +| `augmenters` | Registry for functions that create [data augmentation](#augmenters) callbacks for corpora and other training data iterators. | +| `batchers` | Registry for training and evaluation [data batchers](#batchers). | +| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | +| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | +| `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | +| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | +| `loggers` | Registry for functions that log [training results](/usage/training). | +| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | +| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | +| `misc` | Registry for miscellaneous functions that return data assets, knowledge bases or anything else you may need. | +| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | +| `readers` | Registry for file and data readers, including training and evaluation data readers like [`Corpus`](/api/corpus). | +| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | +| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | + +### spacy-transformers registry {#registry-transformers} + +The following registries are added by the +[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package. +See the [`Transformer`](/api/transformer) API reference and +[usage docs](/usage/embeddings-transformers) for details. + +> #### Example +> +> ```python +> import spacy_transformers +> +> @spacy_transformers.registry.annotation_setters("my_annotation_setter.v1") +> def configure_custom_annotation_setter(): +> def annotation_setter(docs, trf_data) -> None: +> # Set annotations on the docs +> +> return annotation_setter +> ``` + +| Registry name | Description | +| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`span_getters`](/api/transformer#span_getters) | Registry for functions that take a batch of `Doc` objects and return a list of `Span` objects to process by the transformer, e.g. sentences. | +| [`annotation_setters`](/api/transformer#annotation_setters) | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | + +## Loggers {#loggers source="spacy/training/loggers.py" new="3"} + +A logger records the training results. When a logger is created, two functions +are returned: one for logging the information for each training step, and a +second function that is called to finalize the logging when the training is +finished. To log each training step, a +[dictionary](/usage/training#custom-logging) is passed on from the +[`spacy train`](/api/cli#train), including information such as the training loss +and the accuracy scores on the development set. + +There are two built-in logging functions: a logger printing results to the +console in tabular format (which is the default), and one that also sends the +results to a [Weights & Biases](https://www.wandb.com/) dashboard. Instead of +using one of the built-in loggers listed here, you can also +[implement your own](/usage/training#custom-logging). + +#### spacy.ConsoleLogger.v1 {#ConsoleLogger tag="registered function"} + +> #### Example config +> +> ```ini +> [training.logger] +> @loggers = "spacy.ConsoleLogger.v1" +> ``` + +Writes the results of a training step to the console in a tabular format. + + + +```cli +$ python -m spacy train config.cfg +``` + +``` +ℹ Using CPU +ℹ Loading config and nlp from: config.cfg +ℹ Pipeline: ['tok2vec', 'tagger'] +ℹ Start training +ℹ Training. Initial learn rate: 0.0 + +E # LOSS TOK2VEC LOSS TAGGER TAG_ACC SCORE +--- ------ ------------ ----------- ------- ------ + 1 0 0.00 86.20 0.22 0.00 + 1 200 3.08 18968.78 34.00 0.34 + 1 400 31.81 22539.06 33.64 0.34 + 1 600 92.13 22794.91 43.80 0.44 + 1 800 183.62 21541.39 56.05 0.56 + 1 1000 352.49 25461.82 65.15 0.65 + 1 1200 422.87 23708.82 71.84 0.72 + 1 1400 601.92 24994.79 76.57 0.77 + 1 1600 662.57 22268.02 80.20 0.80 + 1 1800 1101.50 28413.77 82.56 0.83 + 1 2000 1253.43 28736.36 85.00 0.85 + 1 2200 1411.02 28237.53 87.42 0.87 + 1 2400 1605.35 28439.95 88.70 0.89 +``` + +Note that the cumulative loss keeps increasing within one epoch, but should +start decreasing across epochs. + + + +#### spacy.WandbLogger.v1 {#WandbLogger tag="registered function"} + +> #### Installation +> +> ```bash +> $ pip install wandb +> $ wandb login +> ``` + +Built-in logger that sends the results of each training step to the dashboard of +the [Weights & Biases](https://www.wandb.com/) tool. To use this logger, Weights +& Biases should be installed, and you should be logged in. The logger will send +the full config file to W&B, as well as various system information such as +memory utilization, network traffic, disk IO, GPU statistics, etc. This will +also include information such as your hostname and operating system, as well as +the location of your Python executable. + + + +Note that by default, the full (interpolated) +[training config](/usage/training#config) is sent over to the W&B dashboard. If +you prefer to **exclude certain information** such as path names, you can list +those fields in "dot notation" in the `remove_config_values` parameter. These +fields will then be removed from the config before uploading, but will otherwise +remain in the config file stored on your local system. + + + +> #### Example config +> +> ```ini +> [training.logger] +> @loggers = "spacy.WandbLogger.v1" +> project_name = "monitor_spacy_training" +> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"] +> ``` + +| Name | Description | +| ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ | +| `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ | + + + +Get started with tracking your spaCy training runs in Weights & Biases using our +project template. It trains on the IMDB Movie Review Dataset and includes a +simple config with the built-in `WandbLogger`, as well as a custom example of +creating variants of the config for a simple hyperparameter grid search and +logging the results. + + + +## Readers {#readers} + +### File readers {#file-readers source="github.com/explosion/srsly" new="3"} + +The following file readers are provided by our serialization library +[`srsly`](https://github.com/explosion/srsly). All registered functions take one +argument `path`, pointing to the file path to load. + +> #### Example config +> +> ```ini +> [corpora.train.augmenter.orth_variants] +> @readers = "srsly.read_json.v1" +> path = "corpus/en_orth_variants.json" +> ``` + +| Name | Description | +| ----------------------- | ----------------------------------------------------- | +| `srsly.read_json.v1` | Read data from a JSON file. | +| `srsly.read_jsonl.v1` | Read data from a JSONL (newline-delimited JSON) file. | +| `srsly.read_yaml.v1` | Read data from a YAML file. | +| `srsly.read_msgpack.v1` | Read data from a binary MessagePack file. | + + + +Since the file readers expect a local path, you should only use them in config +blocks that are **not executed at runtime** – for example, in `[training]` and +`[corpora]` (to load data or resources like data augmentation tables) or in +`[initialize]` (to pass data to pipeline components). + + + +#### spacy.read_labels.v1 {#read_labels tag="registered function"} + +Read a JSON-formatted labels file generated with +[`init labels`](/api/cli#init-labels). Typically used in the +[`[initialize]`](/api/data-formats#config-initialize) block of the training +config to speed up the model initialization process and provide pre-generated +label sets. + +> #### Example config +> +> ```ini +> [initialize.components] +> +> [initialize.components.ner] +> +> [initialize.components.ner.labels] +> @readers = "spacy.read_labels.v1" +> path = "corpus/labels/ner.json" +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | The path to the labels file generated with [`init labels`](/api/cli#init-labels). ~~Path~~ | +| `require` | Whether to require the file to exist. If set to `False` and the labels file doesn't exist, the loader will return `None` and the `initialize` method will extract the labels from the data. Defaults to `False`. ~~bool~~ | +| **CREATES** | The | + +### Corpus readers {#corpus-readers source="spacy/training/corpus.py" new="3"} + +Corpus readers are registered functions that load data and return a function +that takes the current `nlp` object and yields [`Example`](/api/example) objects +that can be used for [training](/usage/training) and +[pretraining](/usage/embeddings-transformers#pretraining). You can replace it +with your own registered function in the +[`@readers` registry](/api/top-level#registry) to customize the data loading and +streaming. + +#### spacy.Corpus.v1 {#corpus tag="registered function"} + +The `Corpus` reader manages annotated corpora and can be used for training and +development datasets in the [DocBin](/api/docbin) (`.spacy`) format. Also see +the [`Corpus`](/api/corpus) class. + +> #### Example config +> +> ```ini +> [paths] +> train = "corpus/train.spacy" +> +> [corpora.train] +> @readers = "spacy.Corpus.v1" +> path = ${paths.train} +> gold_preproc = false +> max_length = 0 +> limit = 0 +> ``` + +| Name | Description | +| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Union[str, Path]~~ | +|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ | +| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | +| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | +| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ | +| **CREATES** | The corpus reader. ~~Corpus~~ | + +#### spacy.JsonlCorpus.v1 {#jsonlcorpus tag="registered function"} + +Create [`Example`](/api/example) objects from a JSONL (newline-delimited JSON) +file of texts keyed by `"text"`. Can be used to read the raw text corpus for +language model [pretraining](/usage/embeddings-transformers#pretraining) from a +JSONL file. Also see the [`JsonlCorpus`](/api/corpus#jsonlcorpus) class. + +> #### Example config +> +> ```ini +> [paths] +> pretrain = "corpus/raw_text.jsonl" +> +> [corpora.pretrain] +> @readers = "spacy.JsonlCorpus.v1" +> path = ${paths.pretrain} +> min_length = 0 +> max_length = 0 +> limit = 0 +> ``` + +| Name | Description | +| ------------ | -------------------------------------------------------------------------------------------------------------------------------- | +| `path` | The directory or filename to read from. Expects newline-delimited JSON with a key `"text"` for each record. ~~Union[str, Path]~~ | +| `min_length` | Minimum document length (in tokens). Shorter documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~ | +| `max_length` | Maximum document length (in tokens). Longer documents will be skipped. Defaults to `0`, which indicates no limit. ~~int~~ | +| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | +| **CREATES** | The corpus reader. ~~JsonlCorpus~~ | + +## Batchers {#batchers source="spacy/training/batchers.py" new="3"} + +A data batcher implements a batching strategy that essentially turns a stream of +items into a stream of batches, with each batch consisting of one item or a list +of items. During training, the models update their weights after processing one +batch at a time. Typical batching strategies include presenting the training +data as a stream of batches with similar sizes, or with increasing batch sizes. +See the Thinc documentation on +[`schedules`](https://thinc.ai/docs/api-schedules) for a few standard examples. + +Instead of using one of the built-in batchers listed here, you can also +[implement your own](/usage/training#custom-code-readers-batchers), which may or +may not use a custom schedule. + +### spacy.batch_by_words.v1 {#batch_by_words tag="registered function"} + +Create minibatches of roughly a given number of words. If any examples are +longer than the specified batch length, they will appear in a batch by +themselves, or be discarded if `discard_oversize` is set to `True`. The argument +`docs` can be a list of strings, [`Doc`](/api/doc) objects or +[`Example`](/api/example) objects. + +> #### Example config +> +> ```ini +> [training.batcher] +> @batchers = "spacy.batch_by_words.v1" +> size = 100 +> tolerance = 0.2 +> discard_oversize = false +> get_length = null +> ``` + +| Name | Description | +| ------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `seqs` | The sequences to minibatch. ~~Iterable[Any]~~ | +| `size` | The target number of words per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ | +| `tolerance` | What percentage of the size to allow batches to exceed. ~~float~~ | +| `discard_oversize` | Whether to discard sequences that by themselves exceed the tolerated size. ~~bool~~ | +| `get_length` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~ | +| **CREATES** | The batcher that takes an iterable of items and returns batches. ~~Callable[[Iterable[Any]], Iterable[List[Any]]]~~ | + +### spacy.batch_by_sequence.v1 {#batch_by_sequence tag="registered function"} + +> #### Example config +> +> ```ini +> [training.batcher] +> @batchers = "spacy.batch_by_sequence.v1" +> size = 32 +> get_length = null +> ``` + +Create a batcher that creates batches of the specified size. + +| Name | Description | +| ------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `size` | The target number of items per batch. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ | +| `get_length` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~ | +| **CREATES** | The batcher that takes an iterable of items and returns batches. ~~Callable[[Iterable[Any]], Iterable[List[Any]]]~~ | + +### spacy.batch_by_padded.v1 {#batch_by_padded tag="registered function"} + +> #### Example config +> +> ```ini +> [training.batcher] +> @batchers = "spacy.batch_by_padded.v1" +> size = 100 +> buffer = 256 +> discard_oversize = false +> get_length = null +> ``` + +Minibatch a sequence by the size of padded batches that would result, with +sequences binned by length within a window. The padded size is defined as the +maximum length of sequences within the batch multiplied by the number of +sequences in the batch. + +| Name | Description | +| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `size` | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. [`compounding`](https://thinc.ai/docs/api-schedules/#compounding). ~~Union[int, Sequence[int]]~~ | +| `buffer` | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. ~~int~~ | +| `discard_oversize` | Whether to discard sequences that are by themselves longer than the largest padded batch size. ~~bool~~ | +| `get_length` | Optional function that receives a sequence item and returns its length. Defaults to the built-in `len()` if not set. ~~Optional[Callable[[Any], int]]~~ | +| **CREATES** | The batcher that takes an iterable of items and returns batches. ~~Callable[[Iterable[Any]], Iterable[List[Any]]]~~ | + +## Augmenters {#augmenters source="spacy/training/augment.py" new="3"} + +Data augmentation is the process of applying small modifications to the training +data. It can be especially useful for punctuation and case replacement – for +example, if your corpus only uses smart quotes and you want to include +variations using regular quotes, or to make the model less sensitive to +capitalization by including a mix of capitalized and lowercase examples. See the +[usage guide](/usage/training#data-augmentation) for details and examples. + +### spacy.orth_variants.v1 {#orth_variants tag="registered function"} + +> #### Example config +> +> ```ini +> [corpora.train.augmenter] +> @augmenters = "spacy.orth_variants.v1" +> level = 0.1 +> lower = 0.5 +> +> [corpora.train.augmenter.orth_variants] +> @readers = "srsly.read_json.v1" +> path = "corpus/en_orth_variants.json" +> ``` + +Create a data augmentation callback that uses orth-variant replacement. The +callback can be added to a corpus or other data iterator during training. It's +is especially useful for punctuation and case replacement, to help generalize +beyond corpora that don't have smart quotes, or only have smart quotes etc. + +| Name | Description | +| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `level` | The percentage of texts that will be augmented. ~~float~~ | +| `lower` | The percentage of texts that will be lowercased. ~~float~~ | +| `orth_variants` | A dictionary containing the single and paired orth variants. Typically loaded from a JSON file. See [`en_orth_variants.json`](https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_orth_variants.json) for an example. ~~Dict[str, Dict[List[Union[str, List[str]]]]]~~ | +| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ | + +### spacy.lower_case.v1 {#lower_case tag="registered function"} + +> #### Example config +> +> ```ini +> [corpora.train.augmenter] +> @augmenters = "spacy.lower_case.v1" +> level = 0.3 +> ``` + +Create a data augmentation callback that lowercases documents. The callback can +be added to a corpus or other data iterator during training. It's especially +useful for making the model less sensitive to capitalization. + +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `level` | The percentage of texts that will be augmented. ~~float~~ | +| **CREATES** | A function that takes the current `nlp` object and an [`Example`](/api/example) and yields augmented `Example` objects. ~~Callable[[Language, Example], Iterator[Example]]~~ | + +## Training data and alignment {#gold source="spacy/training"} + +### training.offsets_to_biluo_tags {#offsets_to_biluo_tags tag="function"} + +Encode labelled spans into per-token tags, using the +[BILUO scheme](/usage/linguistic-features#accessing-ner) (Begin, In, Last, Unit, +Out). Returns a list of strings, describing the tags. Each tag string will be in +the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of +`"B"`, `"I"`, `"L"`, `"U"`. The string `"-"` is used where the entity offsets +don't align with the tokenization in the `Doc` object. The training algorithm +will view these as missing values. `O` denotes a non-entity token. `B` denotes +the beginning of a multi-token entity, `I` the inside of an entity of three or +more tokens, and `L` the end of an entity of two or more tokens. `U` denotes a +single-token entity. + + + +This method was previously available as `spacy.gold.biluo_tags_from_offsets`. + + + +> #### Example +> +> ```python +> from spacy.training import offsets_to_biluo_tags +> +> doc = nlp("I like London.") +> entities = [(7, 13, "LOC")] +> tags = offsets_to_biluo_tags(doc, entities) +> assert tags == ["O", "O", "U-LOC", "O"] +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `doc` | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. ~~Doc~~ | +| `entities` | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. ~~List[Tuple[int, int, Union[str, int]]]~~ | +| `missing` | The label used for missing values, e.g. if tokenization doesn't align with the entity offsets. Defaults to `"O"`. ~~str~~ | +| **RETURNS** | A list of strings, describing the [BILUO](/usage/linguistic-features#accessing-ner) tags. ~~List[str]~~ | + +### training.biluo_tags_to_offsets {#biluo_tags_to_offsets tag="function"} + +Encode per-token tags following the +[BILUO scheme](/usage/linguistic-features#accessing-ner) into entity offsets. + + + +This method was previously available as `spacy.gold.offsets_from_biluo_tags`. + + + +> #### Example +> +> ```python +> from spacy.training import biluo_tags_to_offsets +> +> doc = nlp("I like London.") +> tags = ["O", "O", "U-LOC", "O"] +> entities = biluo_tags_to_offsets(doc, tags) +> assert entities == [(7, 13, "LOC")] +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `doc` | The document that the BILUO tags refer to. ~~Doc~~ | +| `entities` | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ | +| **RETURNS** | A sequence of `(start, end, label)` triples. `start` and `end` will be character-offset integers denoting the slice into the original string. ~~List[Tuple[int, int, str]]~~ | + +### training.biluo_tags_to_spans {#biluo_tags_to_spans tag="function" new="2.1"} + +Encode per-token tags following the +[BILUO scheme](/usage/linguistic-features#accessing-ner) into +[`Span`](/api/span) objects. This can be used to create entity spans from +token-based tags, e.g. to overwrite the `doc.ents`. + + + +This method was previously available as `spacy.gold.spans_from_biluo_tags`. + + + +> #### Example +> +> ```python +> from spacy.training import biluo_tags_to_spans +> +> doc = nlp("I like London.") +> tags = ["O", "O", "U-LOC", "O"] +> doc.ents = biluo_tags_to_spans(doc, tags) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `doc` | The document that the BILUO tags refer to. ~~Doc~~ | +| `entities` | A sequence of [BILUO](/usage/linguistic-features#accessing-ner) tags with each tag describing one token. Each tag string will be of the form of either `""`, `"O"` or `"{action}-{label}"`, where action is one of `"B"`, `"I"`, `"L"`, `"U"`. ~~List[str]~~ | +| **RETURNS** | A sequence of `Span` objects with added entity labels. ~~List[Span]~~ | + +## Utility functions {#util source="spacy/util.py"} + +spaCy comes with a small collection of utility functions located in +[`spacy/util.py`](%%GITHUB_SPACY/spacy/util.py). Because utility functions are +mostly intended for **internal use within spaCy**, their behavior may change +with future releases. The functions documented on this page should be safe to +use and we'll try to ensure backwards compatibility. However, we recommend +having additional tests in place if your application depends on any of spaCy's +utilities. ### util.get_lang_class {#util.get_lang_class tag="function"} Import and load a `Language` class. Allows lazy-loading -[language data](/usage/adding-languages) and importing languages using the -two-letter language code. To add a language code for a custom language class, -you can use the [`set_lang_class`](/api/top-level#util.set_lang_class) helper. +[language data](/usage/linguistic-features#language-data) and importing +languages using the two-letter language code. To add a language code for a +custom language class, you can register it using the +[`@registry.languages`](/api/top-level#registry) decorator. > #### Example > @@ -325,40 +858,17 @@ you can use the [`set_lang_class`](/api/top-level#util.set_lang_class) helper. > for lang_id in ["en", "de"]: > lang_class = util.get_lang_class(lang_id) > lang = lang_class() -> tokenizer = lang.Defaults.create_tokenizer() > ``` -| Name | Type | Description | -| ----------- | ---------- | -------------------------------------- | -| `lang` | unicode | Two-letter language code, e.g. `'en'`. | -| **RETURNS** | `Language` | Language class. | - -### util.set_lang_class {#util.set_lang_class tag="function"} - -Set a custom `Language` class name that can be loaded via -[`get_lang_class`](/api/top-level#util.get_lang_class). If your model uses a -custom language, this is required so that spaCy can load the correct class from -the two-letter language code. - -> #### Example -> -> ```python -> from spacy.lang.xy import CustomLanguage -> -> util.set_lang_class('xy', CustomLanguage) -> lang_class = util.get_lang_class('xy') -> nlp = lang_class() -> ``` - -| Name | Type | Description | -| ------ | ---------- | -------------------------------------- | -| `name` | unicode | Two-letter language code, e.g. `'en'`. | -| `cls` | `Language` | The language class, e.g. `English`. | +| Name | Description | +| ----------- | ---------------------------------------------- | +| `lang` | Two-letter language code, e.g. `"en"`. ~~str~~ | +| **RETURNS** | The respective subclass. ~~Language~~ | ### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"} -Check whether a `Language` class is already loaded. `Language` classes are -loaded lazily, to avoid expensive setup code associated with the language data. +Check whether a `Language` subclass is already loaded. `Language` subclasses are +loaded lazily to avoid expensive setup code associated with the language data. > #### Example > @@ -368,57 +878,40 @@ loaded lazily, to avoid expensive setup code associated with the language data. > assert util.lang_class_is_loaded("de") is False > ``` -| Name | Type | Description | -| ----------- | ------- | -------------------------------------- | -| `name` | unicode | Two-letter language code, e.g. `'en'`. | -| **RETURNS** | bool | Whether the class has been loaded. | +| Name | Description | +| ----------- | ---------------------------------------------- | +| `name` | Two-letter language code, e.g. `"en"`. ~~str~~ | +| **RETURNS** | Whether the class has been loaded. ~~bool~~ | ### util.load_model {#util.load_model tag="function" new="2"} -Load a model from a shortcut link, package or data path. If called with a -shortcut link or package name, spaCy will assume the model is a Python package -and import and call its `load()` method. If called with a path, spaCy will -assume it's a data directory, read the language and pipeline settings from the -meta.json and initialize a `Language` class. The model data will then be loaded -in via [`Language.from_disk()`](/api/language#from_disk). +Load a pipeline from a package or data path. If called with a string name, spaCy +will assume the pipeline is a Python package and import and call its `load()` +method. If called with a path, spaCy will assume it's a data directory, read the +language and pipeline settings from the [`config.cfg`](/api/data-formats#config) +and create a `Language` object. The model data will then be loaded in via +[`Language.from_disk`](/api/language#from_disk). > #### Example > > ```python -> nlp = util.load_model("en") -> nlp = util.load_model("en_core_web_sm", disable=["ner"]) +> nlp = util.load_model("en_core_web_sm") +> nlp = util.load_model("en_core_web_sm", exclude=["ner"]) > nlp = util.load_model("/path/to/data") > ``` -| Name | Type | Description | -| ------------- | ---------- | -------------------------------------------------------- | -| `name` | unicode | Package name, shortcut link or model path. | -| `**overrides` | - | Specific overrides, like pipeline components to disable. | -| **RETURNS** | `Language` | `Language` class with the loaded model. | - -### util.load_model_from_path {#util.load_model_from_path tag="function" new="2"} - -Load a model from a data directory path. Creates the [`Language`](/api/language) -class and pipeline based on the directory's meta.json and then calls -[`from_disk()`](/api/language#from_disk) with the path. This function also makes -it easy to test a new model that you haven't packaged yet. - -> #### Example -> -> ```python -> nlp = load_model_from_path("/path/to/data") -> ``` - -| Name | Type | Description | -| ------------- | ---------- | ---------------------------------------------------------------------------------------------------- | -| `model_path` | unicode | Path to model data directory. | -| `meta` | dict | Model meta data. If `False`, spaCy will try to load the meta from a meta.json in the same directory. | -| `**overrides` | - | Specific overrides, like pipeline components to disable. | -| **RETURNS** | `Language` | `Language` class with the loaded model. | +| Name | Description | +| ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `name` | Package name or path. ~~str~~ | +| `vocab` 3 | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~. | +| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [`nlp.enable_pipe`](/api/language#enable_pipe). ~~List[str]~~ | +| `exclude` 3 | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ | +| `config` 3 | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~ | +| **RETURNS** | `Language` class with the loaded pipeline. ~~Language~~ | ### util.load_model_from_init_py {#util.load_model_from_init_py tag="function" new="2"} -A helper function to use in the `load()` method of a model package's +A helper function to use in the `load()` method of a pipeline package's [`__init__.py`](https://github.com/explosion/spacy-models/tree/master/template/model/xx_model_name/__init__.py). > #### Example @@ -430,31 +923,74 @@ A helper function to use in the `load()` method of a model package's > return load_model_from_init_py(__file__, **overrides) > ``` -| Name | Type | Description | -| ------------- | ---------- | -------------------------------------------------------- | -| `init_file` | unicode | Path to model's `__init__.py`, i.e. `__file__`. | -| `**overrides` | - | Specific overrides, like pipeline components to disable. | -| **RETURNS** | `Language` | `Language` class with the loaded model. | +| Name | Description | +| ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `init_file` | Path to package's `__init__.py`, i.e. `__file__`. ~~Union[str, Path]~~ | +| `vocab` 3 | Optional shared vocab to pass in on initialization. If `True` (default), a new `Vocab` object will be created. ~~Union[Vocab, bool]~~. | +| `disable` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). Disabled pipes will be loaded but they won't be run unless you explicitly enable them by calling [nlp.enable_pipe](/api/language#enable_pipe). ~~List[str]~~ | +| `exclude` 3 | Names of pipeline components to [exclude](/usage/processing-pipelines#disabling). Excluded components won't be loaded. ~~List[str]~~ | +| `config` 3 | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. `"nlp.pipeline"`. ~~Union[Dict[str, Any], Config]~~ | +| **RETURNS** | `Language` class with the loaded pipeline. ~~Language~~ | -### util.get_model_meta {#util.get_model_meta tag="function" new="2"} +### util.load_config {#util.load_config tag="function" new="3"} -Get a model's meta.json from a directory path and validate its contents. +Load a pipeline's [`config.cfg`](/api/data-formats#config) from a file path. The +config typically includes details about the components and how they're created, +as well as all training settings and hyperparameters. > #### Example > > ```python -> meta = util.get_model_meta("/path/to/model") +> config = util.load_config("/path/to/config.cfg") +> print(config.to_str()) > ``` -| Name | Type | Description | -| ----------- | ---------------- | ------------------------ | -| `path` | unicode / `Path` | Path to model directory. | -| **RETURNS** | dict | The model's meta data. | +| Name | Description | +| ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | Path to the pipeline's `config.cfg`. ~~Union[str, Path]~~ | +| `overrides` | Optional config overrides to replace in loaded config. Can be provided as nested dict, or as flat dict with keys in dot notation, e.g. `"nlp.pipeline"`. ~~Dict[str, Any]~~ | +| `interpolate` | Whether to interpolate the config and replace variables like `${paths.train}` with their values. Defaults to `False`. ~~bool~~ | +| **RETURNS** | The pipeline's config. ~~Config~~ | + +### util.load_meta {#util.load_meta tag="function" new="3"} + +Get a pipeline's [`meta.json`](/api/data-formats#meta) from a file path and +validate its contents. The meta typically includes details about author, +licensing, data sources and version. + +> #### Example +> +> ```python +> meta = util.load_meta("/path/to/meta.json") +> ``` + +| Name | Description | +| ----------- | -------------------------------------------------------- | +| `path` | Path to the pipeline's `meta.json`. ~~Union[str, Path]~~ | +| **RETURNS** | The pipeline's meta data. ~~Dict[str, Any]~~ | + +### util.get_installed_models {#util.get_installed_models tag="function" new="3"} + +List all pipeline packages installed in the current environment. This will +include any spaCy pipeline that was packaged with +[`spacy package`](/api/cli#package). Under the hood, pipeline packages expose a +Python entry point that spaCy can check, without having to load the `nlp` +object. + +> #### Example +> +> ```python +> names = util.get_installed_models() +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------- | +| **RETURNS** | The string names of the pipelines installed in the current environment. ~~List[str]~~ | ### util.is_package {#util.is_package tag="function"} Check if string maps to a package installed via pip. Mainly used to validate -[model packages](/usage/models). +[pipeline packages](/usage/models). > #### Example > @@ -463,15 +999,16 @@ Check if string maps to a package installed via pip. Mainly used to validate > util.is_package("xyz") # False > ``` -| Name | Type | Description | -| ----------- | ------- | -------------------------------------------- | -| `name` | unicode | Name of package. | -| **RETURNS** | `bool` | `True` if installed package, `False` if not. | +| Name | Description | +| ----------- | ----------------------------------------------------- | +| `name` | Name of package. ~~str~~ | +| **RETURNS** | `True` if installed package, `False` if not. ~~bool~~ | ### util.get_package_path {#util.get_package_path tag="function" new="2"} Get path to an installed package. Mainly used to resolve the location of -[model packages](/usage/models). Currently imports the package to find its path. +[pipeline packages](/usage/models). Currently imports the package to find its +path. > #### Example > @@ -480,10 +1017,10 @@ Get path to an installed package. Mainly used to resolve the location of > # /usr/lib/python3.6/site-packages/en_core_web_sm > ``` -| Name | Type | Description | -| -------------- | ------- | -------------------------------- | -| `package_name` | unicode | Name of installed package. | -| **RETURNS** | `Path` | Path to model package directory. | +| Name | Description | +| -------------- | -------------------------------------------- | +| `package_name` | Name of installed package. ~~str~~ | +| **RETURNS** | Path to pipeline package directory. ~~Path~~ | ### util.is_in_jupyter {#util.is_in_jupyter tag="function" new="2"} @@ -500,31 +1037,9 @@ detecting the IPython kernel. Mainly used for the > display(HTML(html)) > ``` -| Name | Type | Description | -| ----------- | ---- | ------------------------------------- | -| **RETURNS** | bool | `True` if in Jupyter, `False` if not. | - -### util.update_exc {#util.update_exc tag="function"} - -Update, validate and overwrite -[tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). Used to -combine global exceptions with custom, language-specific exceptions. Will raise -an error if key doesn't match `ORTH` values. - -> #### Example -> -> ```python -> BASE = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]} -> NEW = {"a.": [{ORTH: "a.", NORM: "all"}]} -> exceptions = util.update_exc(BASE, NEW) -> # {"a.": [{ORTH: "a.", NORM: "all"}], ":)": [{ORTH: ":)"}]} -> ``` - -| Name | Type | Description | -| ----------------- | ----- | --------------------------------------------------------------- | -| `base_exceptions` | dict | Base tokenizer exceptions. | -| `*addition_dicts` | dicts | Exception dictionaries to add to the base exceptions, in order. | -| **RETURNS** | dict | Combined tokenizer exceptions. | +| Name | Description | +| ----------- | ---------------------------------------------- | +| **RETURNS** | `True` if in Jupyter, `False` if not. ~~bool~~ | ### util.compile_prefix_regex {#util.compile_prefix_regex tag="function"} @@ -538,10 +1053,10 @@ Compile a sequence of prefix rules into a regex object. > nlp.tokenizer.prefix_search = prefix_regex.search > ``` -| Name | Type | Description | -| ----------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | -| `entries` | tuple | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). | -| **RETURNS** | [regex](https://docs.python.org/3/library/re.html#re-objects) | The regex object. to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- | +| `entries` | The prefix rules, e.g. [`lang.punctuation.TOKENIZER_PREFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ | +| **RETURNS** | The regex object to be used for [`Tokenizer.prefix_search`](/api/tokenizer#attributes). ~~Pattern~~ | ### util.compile_suffix_regex {#util.compile_suffix_regex tag="function"} @@ -555,10 +1070,10 @@ Compile a sequence of suffix rules into a regex object. > nlp.tokenizer.suffix_search = suffix_regex.search > ``` -| Name | Type | Description | -| ----------- | ------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | -| `entries` | tuple | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). | -| **RETURNS** | [regex](https://docs.python.org/3/library/re.html#re-objects) | The regex object. to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). | +| Name | Description | +| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------- | +| `entries` | The suffix rules, e.g. [`lang.punctuation.TOKENIZER_SUFFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ | +| **RETURNS** | The regex object to be used for [`Tokenizer.suffix_search`](/api/tokenizer#attributes). ~~Pattern~~ | ### util.compile_infix_regex {#util.compile_infix_regex tag="function"} @@ -572,10 +1087,10 @@ Compile a sequence of infix rules into a regex object. > nlp.tokenizer.infix_finditer = infix_regex.finditer > ``` -| Name | Type | Description | -| ----------- | ------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | -| `entries` | tuple | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py). | -| **RETURNS** | [regex](https://docs.python.org/3/library/re.html#re-objects) | The regex object. to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). | +| Name | Description | +| ----------- | ----------------------------------------------------------------------------------------------------------------------------------------- | +| `entries` | The infix rules, e.g. [`lang.punctuation.TOKENIZER_INFIXES`](%%GITHUB_SPACY/spacy/lang/punctuation.py). ~~Iterable[Union[str, Pattern]]~~ | +| **RETURNS** | The regex object to be used for [`Tokenizer.infix_finditer`](/api/tokenizer#attributes). ~~Pattern~~ | ### util.minibatch {#util.minibatch tag="function" new="2"} @@ -587,76 +1102,14 @@ vary on each step. > ```python > batches = minibatch(train_data) > for batch in batches: -> texts, annotations = zip(*batch) -> nlp.update(texts, annotations) +> nlp.update(batch) > ``` -| Name | Type | Description | -| ---------- | -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `items` | iterable | The items to batch up. | -| `size` | int / iterable | The batch size(s). Use [`util.compounding`](/api/top-level#util.compounding) or [`util.decaying`](/api/top-level#util.decaying) or for an infinite series of compounding or decaying values. | -| **YIELDS** | list | The batches. | - -### util.compounding {#util.compounding tag="function" new="2"} - -Yield an infinite series of compounding values. Each time the generator is -called, a value is produced by multiplying the previous value by the compound -rate. - -> #### Example -> -> ```python -> sizes = compounding(1., 10., 1.5) -> assert next(sizes) == 1. -> assert next(sizes) == 1. * 1.5 -> assert next(sizes) == 1.5 * 1.5 -> ``` - -| Name | Type | Description | -| ---------- | ----------- | ----------------------- | -| `start` | int / float | The first value. | -| `stop` | int / float | The maximum value. | -| `compound` | int / float | The compounding factor. | -| **YIELDS** | int | Compounding values. | - -### util.decaying {#util.decaying tag="function" new="2"} - -Yield an infinite series of linearly decaying values. - -> #### Example -> -> ```python -> sizes = decaying(10., 1., 0.001) -> assert next(sizes) == 10. -> assert next(sizes) == 10. - 0.001 -> assert next(sizes) == 9.999 - 0.001 -> ``` - -| Name | Type | Description | -| ---------- | ----------- | -------------------- | -| `start` | int / float | The first value. | -| `end` | int / float | The maximum value. | -| `decay` | int / float | The decaying factor. | -| **YIELDS** | int | The decaying values. | - -### util.itershuffle {#util.itershuffle tag="function" new="2"} - -Shuffle an iterator. This works by holding `bufsize` items back and yielding -them sometime later. Obviously, this is not unbiased – but should be good enough -for batching. Larger `bufsize` means less bias. - -> #### Example -> -> ```python -> values = range(1000) -> shuffled = itershuffle(values) -> ``` - -| Name | Type | Description | -| ---------- | -------- | ----------------------------------- | -| `iterable` | iterable | Iterator to shuffle. | -| `bufsize` | int | Items to hold back (default: 1000). | -| **YIELDS** | iterable | The shuffled iterator. | +| Name | Description | +| ---------- | ---------------------------------------- | +| `items` | The items to batch up. ~~Iterable[Any]~~ | +| `size` | int / iterable | The batch size(s). ~~Union[int, Sequence[int]]~~ | +| **YIELDS** | The batches. | ### util.filter_spans {#util.filter_spans tag="function" new="2.1.4"} @@ -674,54 +1127,30 @@ of one entity) or when merging spans with > filtered = filter_spans(spans) > ``` -| Name | Type | Description | -| ----------- | -------- | -------------------- | -| `spans` | iterable | The spans to filter. | -| **RETURNS** | list | The filtered spans. | +| Name | Description | +| ----------- | --------------------------------------- | +| `spans` | The spans to filter. ~~Iterable[Span]~~ | +| **RETURNS** | The filtered spans. ~~List[Span]~~ | -## Compatibility functions {#compat source="spacy/compaty.py"} +### util.get_words_and_spaces {#get_words_and_spaces tag="function" new="3"} -All Python code is written in an **intersection of Python 2 and Python 3**. This -is easy in Cython, but somewhat ugly in Python. Logic that deals with Python or -platform compatibility only lives in `spacy.compat`. To distinguish them from -the builtin functions, replacement functions are suffixed with an underscore, -e.g. `unicode_`. +Given a list of words and a text, reconstruct the original tokens and return a +list of words and spaces that can be used to create a [`Doc`](/api/doc#init). +This can help recover destructive tokenization that didn't preserve any +whitespace information. > #### Example > > ```python -> from spacy.compat import unicode_ -> -> compatible_unicode = unicode_("hello world") +> orig_words = ["Hey", ",", "what", "'s", "up", "?"] +> orig_text = "Hey, what's up?" +> words, spaces = get_words_and_spaces(orig_words, orig_text) +> # ['Hey', ',', 'what', "'s", 'up', '?'] +> # [False, True, False, True, False, False] > ``` -| Name | Python 2 | Python 3 | -| -------------------- | ---------------------------------- | ----------- | -| `compat.bytes_` | `str` | `bytes` | -| `compat.unicode_` | `unicode` | `str` | -| `compat.basestring_` | `basestring` | `str` | -| `compat.input_` | `raw_input` | `input` | -| `compat.path2str` | `str(path)` with `.decode('utf8')` | `str(path)` | - -### compat.is_config {#compat.is_config tag="function"} - -Check if a specific configuration of Python version and operating system matches -the user's setup. Mostly used to display targeted error messages. - -> #### Example -> -> ```python -> from spacy.compat import is_config -> -> if is_config(python2=True, windows=True): -> print("You are using Python 2 on Windows.") -> ``` - -| Name | Type | Description | -| ----------- | ---- | ---------------------------------------------------------------- | -| `python2` | bool | spaCy is executed with Python 2.x. | -| `python3` | bool | spaCy is executed with Python 3.x. | -| `windows` | bool | spaCy is executed on Windows. | -| `linux` | bool | spaCy is executed on Linux. | -| `osx` | bool | spaCy is executed on OS X or macOS. | -| **RETURNS** | bool | Whether the specified configuration matches the user's platform. | +| Name | Description | +| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | +| `words` | The list of words. ~~Iterable[str]~~ | +| `text` | The original text. ~~str~~ | +| **RETURNS** | A list of words and a list of boolean values indicating whether the word at this position is followed by a space. ~~Tuple[List[str], List[bool]]~~ | diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md new file mode 100644 index 000000000..5754d2238 --- /dev/null +++ b/website/docs/api/transformer.md @@ -0,0 +1,555 @@ +--- +title: Transformer +teaser: Pipeline component for multi-task learning with transformer models +tag: class +source: github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py +new: 3 +api_base_class: /api/pipe +api_string_name: transformer +--- + +> #### Installation +> +> ```bash +> $ pip install -U %%SPACY_PKG_NAME[transformers] %%SPACY_PKG_FLAGS +> ``` + + + +This component is available via the extension package +[`spacy-transformers`](https://github.com/explosion/spacy-transformers). It +exposes the component via entry points, so if you have the package installed, +using `factory = "transformer"` in your +[training config](/usage/training#config) or `nlp.add_pipe("transformer")` will +work out-of-the-box. + + + +This pipeline component lets you use transformer models in your pipeline. It +supports all models that are available via the +[HuggingFace `transformers`](https://huggingface.co/transformers) library. +Usually you will connect subsequent components to the shared transformer using +the [TransformerListener](/api/architectures#TransformerListener) layer. This +works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and +[Tok2VecListener](/api/architectures/#Tok2VecListener) sublayer. + +The component assigns the output of the transformer to the `Doc`'s extension +attributes. We also calculate an alignment between the word-piece tokens and the +spaCy tokenization, so that we can use the last hidden states to set the +`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy +token, the spaCy token receives the sum of their values. To access the values, +you can use the custom [`Doc._.trf_data`](#custom-attributes) attribute. The +package also adds the function registries [`@span_getters`](#span_getters) and +[`@annotation_setters`](#annotation_setters) with several built-in registered +functions. For more details, see the +[usage documentation](/usage/embeddings-transformers). + +## Config and implementation {#config} + +The default config is defined by the pipeline component factory and describes +how the component should be configured. You can override its settings via the +`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your +[`config.cfg` for training](/usage/training#config). See the +[model architectures](/api/architectures#transformers) documentation for details +on the transformer architectures and their arguments and hyperparameters. + +> #### Example +> +> ```python +> from spacy_transformers import Transformer, DEFAULT_CONFIG +> +> nlp.add_pipe("transformer", config=DEFAULT_CONFIG) +> ``` + +| Setting | Description | +| ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `4096`. ~~int~~ | +| `set_extra_annotations` | Function that takes a batch of `Doc` objects and transformer outputs to set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Defaults to [TransformerModel](/api/architectures#TransformerModel). ~~Model[List[Doc], FullTransformerBatch]~~ | + +```python +https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py +``` + +## Transformer.\_\_init\_\_ {#init tag="method"} + +> #### Example +> +> ```python +> # Construction via add_pipe with default model +> trf = nlp.add_pipe("transformer") +> +> # Construction via add_pipe with custom config +> config = { +> "model": { +> "@architectures": "spacy-transformers.TransformerModel.v1", +> "name": "bert-base-uncased", +> "tokenizer_config": {"use_fast": True} +> } +> } +> trf = nlp.add_pipe("transformer", config=config) +> +> # Construction from class +> from spacy_transformers import Transformer +> trf = Transformer(nlp.vocab, model) +> ``` + +Construct a `Transformer` component. One or more subsequent spaCy components can +use the transformer outputs as features in its model, with gradients +backpropagated to the single shared weights. The activations from the +transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension +attribute. You can also provide a callback to set additional annotations. In +your application, you would normally use a shortcut for this and instantiate the +component using its string name and [`nlp.add_pipe`](/api/language#create_pipe). + +| Name | Description | +| ----------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `vocab` | The shared vocabulary. ~~Vocab~~ | +| `model` | The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. ~~Model[List[Doc], FullTransformerBatch]~~ | +| `set_extra_annotations` | Function that takes a batch of `Doc` objects and transformer outputs and stores the annotations on the `Doc`. The `Doc._.trf_data` attribute is set prior to calling the callback. By default, no additional annotations are set. ~~Callable[[List[Doc], FullTransformerBatch], None]~~ | +| _keyword-only_ | | +| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ | +| `max_batch_items` | Maximum size of a padded batch. Defaults to `128*32`. ~~int~~ | + +## Transformer.\_\_call\_\_ {#call tag="method"} + +Apply the pipe to one document. The document is modified in place, and returned. +This usually happens under the hood when the `nlp` object is called on a text +and all pipeline components are applied to the `Doc` in order. Both +[`__call__`](/api/transformer#call) and [`pipe`](/api/transformer#pipe) delegate +to the [`predict`](/api/transformer#predict) and +[`set_annotations`](/api/transformer#set_annotations) methods. + +> #### Example +> +> ```python +> doc = nlp("This is a sentence.") +> trf = nlp.add_pipe("transformer") +> # This usually happens under the hood +> processed = transformer(doc) +> ``` + +| Name | Description | +| ----------- | -------------------------------- | +| `doc` | The document to process. ~~Doc~~ | +| **RETURNS** | The processed document. ~~Doc~~ | + +## Transformer.pipe {#pipe tag="method"} + +Apply the pipe to a stream of documents. This usually happens under the hood +when the `nlp` object is called on a text and all pipeline components are +applied to the `Doc` in order. Both [`__call__`](/api/transformer#call) and +[`pipe`](/api/transformer#pipe) delegate to the +[`predict`](/api/transformer#predict) and +[`set_annotations`](/api/transformer#set_annotations) methods. + +> #### Example +> +> ```python +> trf = nlp.add_pipe("transformer") +> for doc in trf.pipe(docs, batch_size=50): +> pass +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------- | +| `stream` | A stream of documents. ~~Iterable[Doc]~~ | +| _keyword-only_ | | +| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ | +| **YIELDS** | The processed documents in order. ~~Doc~~ | + +## Transformer.initialize {#initialize tag="method"} + +Initialize the component for training and return an +[`Optimizer`](https://thinc.ai/docs/api-optimizers). `get_examples` should be a +function that returns an iterable of [`Example`](/api/example) objects. The data +examples are used to **initialize the model** of the component and can either be +the full training data or a representative sample. Initialization includes +validating the network, +[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and +setting up the label scheme based on the data. This method is typically called +by [`Language.initialize`](/api/language#initialize). + +> #### Example +> +> ```python +> trf = nlp.add_pipe("transformer") +> trf.initialize(lambda: [], nlp=nlp) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------- | +| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ | +| _keyword-only_ | | +| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ | + +## Transformer.predict {#predict tag="method"} + +Apply the component's model to a batch of [`Doc`](/api/doc) objects without +modifying them. + +> #### Example +> +> ```python +> trf = nlp.add_pipe("transformer") +> scores = trf.predict([doc1, doc2]) +> ``` + +| Name | Description | +| ----------- | ------------------------------------------- | +| `docs` | The documents to predict. ~~Iterable[Doc]~~ | +| **RETURNS** | The model's prediction for each document. | + +## Transformer.set_annotations {#set_annotations tag="method"} + +Assign the extracted features to the `Doc` objects. By default, the +[`TransformerData`](/api/transformer#transformerdata) object is written to the +[`Doc._.trf_data`](#custom-attributes) attribute. Your `set_extra_annotations` +callback is then called, if provided. + +> #### Example +> +> ```python +> trf = nlp.add_pipe("transformer") +> scores = trf.predict(docs) +> trf.set_annotations(docs, scores) +> ``` + +| Name | Description | +| -------- | ----------------------------------------------------- | +| `docs` | The documents to modify. ~~Iterable[Doc]~~ | +| `scores` | The scores to set, produced by `Transformer.predict`. | + +## Transformer.update {#update tag="method"} + +Prepare for an update to the transformer. Like the [`Tok2Vec`](/api/tok2vec) +component, the `Transformer` component is unusual in that it does not receive +"gold standard" annotations to calculate a weight update. The optimal output of +the transformer data is unknown – it's a hidden layer inside the network that is +updated by backpropagating from output layers. + +The `Transformer` component therefore does **not** perform a weight update +during its own `update` method. Instead, it runs its transformer model and +communicates the output and the backpropagation callback to any **downstream +components** that have been connected to it via the +[TransformerListener](/api/architectures#TransformerListener) sublayer. If there +are multiple listeners, the last layer will actually backprop to the transformer +and call the optimizer, while the others simply increment the gradients. + +> #### Example +> +> ```python +> trf = nlp.add_pipe("transformer") +> optimizer = nlp.initialize() +> losses = trf.update(examples, sgd=optimizer) +> ``` + +| Name | Description | +| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `examples` | A batch of [`Example`](/api/example) objects. Only the [`Example.predicted`](/api/example#predicted) `Doc` object is used, the reference `Doc` is ignored. ~~Iterable[Example]~~ | +| _keyword-only_ | | +| `drop` | The dropout rate. ~~float~~ | +| `set_annotations` | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](#set_annotations). ~~bool~~ | +| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ | +| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ | +| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ | + +## Transformer.create_optimizer {#create_optimizer tag="method"} + +Create an optimizer for the pipeline component. + +> #### Example +> +> ```python +> trf = nlp.add_pipe("transformer") +> optimizer = trf.create_optimizer() +> ``` + +| Name | Description | +| ----------- | ---------------------------- | +| **RETURNS** | The optimizer. ~~Optimizer~~ | + +## Transformer.use_params {#use_params tag="method, contextmanager"} + +Modify the pipe's model to use the given parameter values. At the end of the +context, the original parameters are restored. + +> #### Example +> +> ```python +> trf = nlp.add_pipe("transformer") +> with trf.use_params(optimizer.averages): +> trf.to_disk("/best_model") +> ``` + +| Name | Description | +| -------- | -------------------------------------------------- | +| `params` | The parameter values to use in the model. ~~dict~~ | + +## Transformer.to_disk {#to_disk tag="method"} + +Serialize the pipe to disk. + +> #### Example +> +> ```python +> trf = nlp.add_pipe("transformer") +> trf.to_disk("/path/to/transformer") +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | + +## Transformer.from_disk {#from_disk tag="method"} + +Load the pipe from disk. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> trf = nlp.add_pipe("transformer") +> trf.from_disk("/path/to/transformer") +> ``` + +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `Transformer` object. ~~Transformer~~ | + +## Transformer.to_bytes {#to_bytes tag="method"} + +> #### Example +> +> ```python +> trf = nlp.add_pipe("transformer") +> trf_bytes = trf.to_bytes() +> ``` + +Serialize the pipe to a bytestring. + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the `Transformer` object. ~~bytes~~ | + +## Transformer.from_bytes {#from_bytes tag="method"} + +Load the pipe from a bytestring. Modifies the object in place and returns it. + +> #### Example +> +> ```python +> trf_bytes = trf.to_bytes() +> trf = nlp.add_pipe("transformer") +> trf.from_bytes(trf_bytes) +> ``` + +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `Transformer` object. ~~Transformer~~ | + +## Serialization fields {#serialization-fields} + +During serialization, spaCy will export several data fields used to restore +different aspects of the object. If needed, you can exclude them from +serialization by passing in the string names via the `exclude` argument. + +> #### Example +> +> ```python +> data = trf.to_disk("/path", exclude=["vocab"]) +> ``` + +| Name | Description | +| ------- | -------------------------------------------------------------- | +| `vocab` | The shared [`Vocab`](/api/vocab). | +| `cfg` | The config file. You usually don't want to exclude this. | +| `model` | The binary model data. You usually don't want to exclude this. | + +## TransformerData {#transformerdata tag="dataclass"} + +Transformer tokens and outputs for one `Doc` object. The transformer models +return tensors that refer to a whole padded batch of documents. These tensors +are wrapped into the +[FullTransformerBatch](/api/transformer#fulltransformerbatch) object. The +`FullTransformerBatch` then splits out the per-document data, which is handled +by this class. Instances of this class are typically assigned to the +[`Doc._.trf_data`](/api/transformer#custom-attributes) extension attribute. + +| Name | Description | +| --------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `tokens` | A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) object for details. ~~dict~~ | +| `tensors` | The activations for the `Doc` from the transformer. Usually the last tensor that is 3-dimensional will be the most important, as that will provide the final hidden state. Generally activations that are 2-dimensional will be attention weights. Details of this variable will differ depending on the underlying transformer model. ~~List[FloatsXd]~~ | +| `align` | Alignment from the `Doc`'s tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ | +| `width` | The width of the last hidden layer. ~~int~~ | + +### TransformerData.empty {#transformerdata-emoty tag="classmethod"} + +Create an empty `TransformerData` container. + +| Name | Description | +| ----------- | ---------------------------------- | +| **RETURNS** | The container. ~~TransformerData~~ | + +## FullTransformerBatch {#fulltransformerbatch tag="dataclass"} + +Holds a batch of input and output objects for a transformer model. The data can +then be split to a list of [`TransformerData`](/api/transformer#transformerdata) +objects to associate the outputs to each [`Doc`](/api/doc) in the batch. + +| Name | Description | +| ---------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `spans` | The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each `Span` can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each `Span` may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. ~~List[List[Span]]~~ | +| `tokens` | The output of the tokenizer. ~~transformers.BatchEncoding~~ | +| `tensors` | The output of the transformer model. ~~List[torch.Tensor]~~ | +| `align` | Alignment from the spaCy tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. ~~Ragged~~ | +| `doc_data` | The outputs, split per `Doc` object. ~~List[TransformerData]~~ | + +### FullTransformerBatch.unsplit_by_doc {#fulltransformerbatch-unsplit_by_doc tag="method"} + +Return a new `FullTransformerBatch` from a split batch of activations, using the +current object's spans, tokens and alignment. This is used during the backward +pass, in order to construct the gradients to pass back into the transformer +model. + +| Name | Description | +| ----------- | -------------------------------------------------------- | +| `arrays` | The split batch of activations. ~~List[List[Floats3d]]~~ | +| **RETURNS** | The transformer batch. ~~FullTransformerBatch~~ | + +### FullTransformerBatch.split_by_doc {#fulltransformerbatch-split_by_doc tag="method"} + +Split a `TransformerData` object that represents a batch into a list with one +`TransformerData` per `Doc`. + +| Name | Description | +| ----------- | ------------------------------------------ | +| **RETURNS** | The split batch. ~~List[TransformerData]~~ | + +## Span getters {#span_getters source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"} + +Span getters are functions that take a batch of [`Doc`](/api/doc) objects and +return a lists of [`Span`](/api/span) objects for each doc to be processed by +the transformer. This is used to manage long documents by cutting them into +smaller sequences before running the transformer. The spans are allowed to +overlap, and you can also omit sections of the `Doc` if they are not relevant. + +Span getters can be referenced in the `[components.transformer.model.get_spans]` +block of the config to customize the sequences processed by the transformer. You +can also register +[custom span getters](/usage/embeddings-transformers#transformers-training-custom-settings) +using the `@spacy.registry.span_getters` decorator. + +> #### Example +> +> ```python +> @spacy.registry.span_getters("custom_sent_spans") +> def configure_get_sent_spans() -> Callable: +> def get_sent_spans(docs: Iterable[Doc]) -> List[List[Span]]: +> return [list(doc.sents) for doc in docs] +> +> return get_sent_spans +> ``` + +| Name | Description | +| ----------- | ------------------------------------------------------------- | +| `docs` | A batch of `Doc` objects. ~~Iterable[Doc]~~ | +| **RETURNS** | The spans to process by the transformer. ~~List[List[Span]]~~ | + +### doc_spans.v1 {#doc_spans tag="registered function"} + +> #### Example config +> +> ```ini +> [transformer.model.get_spans] +> @span_getters = "spacy-transformers.doc_spans.v1" +> ``` + +Create a span getter that uses the whole document as its spans. This is the best +approach if your [`Doc`](/api/doc) objects already refer to relatively short +texts. + +### sent_spans.v1 {#sent_spans tag="registered function"} + +> #### Example config +> +> ```ini +> [transformer.model.get_spans] +> @span_getters = "spacy-transformers.sent_spans.v1" +> ``` + +Create a span getter that uses sentence boundary markers to extract the spans. +This requires sentence boundaries to be set (e.g. by the +[`Sentencizer`](/api/sentencizer)), and may result in somewhat uneven batches, +depending on the sentence lengths. However, it does provide the transformer with +more meaningful windows to attend over. + +### strided_spans.v1 {#strided_spans tag="registered function"} + +> #### Example config +> +> ```ini +> [transformer.model.get_spans] +> @span_getters = "spacy-transformers.strided_spans.v1" +> window = 128 +> stride = 96 +> ``` + +Create a span getter for strided spans. If you set the `window` and `stride` to +the same value, the spans will cover each token once. Setting `stride` lower +than `window` will allow for an overlap, so that some tokens are counted twice. +This can be desirable, because it allows all tokens to have both a left and +right context. + +| Name | Description | +| -------- | ------------------------ | +| `window` | The window size. ~~int~~ | +| `stride` | The stride size. ~~int~~ | + +## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"} + +Annotation setters are functions that take a batch of `Doc` objects and a +[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set +additional annotations on the `Doc`, e.g. to set custom or built-in attributes. +You can register custom annotation setters using the +`@registry.annotation_setters` decorator. + +> #### Example +> +> ```python +> @registry.annotation_setters("spacy-transformers.null_annotation_setter.v1") +> def configure_null_annotation_setter() -> Callable: +> def setter(docs: List[Doc], trf_data: FullTransformerBatch) -> None: +> pass +> +> return setter +> ``` + +| Name | Description | +| ---------- | ------------------------------------------------------------- | +| `docs` | A batch of `Doc` objects. ~~List[Doc]~~ | +| `trf_data` | The transformers data for the batch. ~~FullTransformerBatch~~ | + +The following built-in functions are available: + +| Name | Description | +| ---------------------------------------------- | ------------------------------------- | +| `spacy-transformers.null_annotation_setter.v1` | Don't set any additional annotations. | + +## Custom attributes {#custom-attributes} + +The component sets the following +[custom extension attributes](/usage/processing-pipeline#custom-components-attributes): + +| Name | Description | +| ---------------- | ------------------------------------------------------------------------ | +| `Doc._.trf_data` | Transformer tokens and outputs for the `Doc` object. ~~TransformerData~~ | diff --git a/website/docs/api/vectors.md b/website/docs/api/vectors.md index a4c36f8cd..ba2d5ab42 100644 --- a/website/docs/api/vectors.md +++ b/website/docs/api/vectors.md @@ -30,13 +30,13 @@ you can add vectors to later. > vectors = Vectors(data=data, keys=keys) > ``` -| Name | Type | Description | -| ----------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `data` | `ndarray[ndim=1, dtype='float32']` | The vector data. | -| `keys` | iterable | A sequence of keys aligned with the data. | -| `shape` | tuple | Size of the table as `(n_entries, n_columns)`, the number of entries and number of columns. Not required if you're initializing the object with `data` and `keys`. | -| `name` | unicode | A name to identify the vectors table. | -| **RETURNS** | `Vectors` | The newly created object. | +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `shape` | Size of the table as `(n_entries, n_columns)`, the number of entries and number of columns. Not required if you're initializing the object with `data` and `keys`. ~~Tuple[int, int]~~ | +| `data` | The vector data. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | +| `keys` | A sequence of keys aligned with the data. ~~Iterable[Union[str, int]]~~ | +| `name` | A name to identify the vectors table. ~~str~~ | ## Vectors.\_\_getitem\_\_ {#getitem tag="method"} @@ -51,10 +51,10 @@ raised. > assert cat_vector == nlp.vocab["cat"].vector > ``` -| Name | Type | Description | -| ------- | ---------------------------------- | ------------------------------ | -| `key` | int | The key to get the vector for. | -| returns | `ndarray[ndim=1, dtype='float32']` | The vector for the key. | +| Name | Description | +| ----------- | ---------------------------------------------------------------- | +| `key` | The key to get the vector for. ~~int~~ | +| **RETURNS** | The vector for the key. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | ## Vectors.\_\_setitem\_\_ {#setitem tag="method"} @@ -68,10 +68,10 @@ Set a vector for the given key. > nlp.vocab.vectors[cat_id] = vector > ``` -| Name | Type | Description | -| -------- | ---------------------------------- | ------------------------------ | -| `key` | int | The key to set the vector for. | -| `vector` | `ndarray[ndim=1, dtype='float32']` | The vector to set. | +| Name | Description | +| -------- | ----------------------------------------------------------- | +| `key` | The key to set the vector for. ~~int~~ | +| `vector` | The vector to set. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | ## Vectors.\_\_iter\_\_ {#iter tag="method"} @@ -84,9 +84,9 @@ Iterate over the keys in the table. > print(key, nlp.vocab.strings[key]) > ``` -| Name | Type | Description | -| ---------- | ---- | ------------------- | -| **YIELDS** | int | A key in the table. | +| Name | Description | +| ---------- | --------------------------- | +| **YIELDS** | A key in the table. ~~int~~ | ## Vectors.\_\_len\_\_ {#len tag="method"} @@ -99,9 +99,9 @@ Return the number of vectors in the table. > assert len(vectors) == 3 > ``` -| Name | Type | Description | -| ----------- | ---- | ----------------------------------- | -| **RETURNS** | int | The number of vectors in the table. | +| Name | Description | +| ----------- | ------------------------------------------- | +| **RETURNS** | The number of vectors in the table. ~~int~~ | ## Vectors.\_\_contains\_\_ {#contains tag="method"} @@ -115,16 +115,16 @@ Check whether a key has been mapped to a vector entry in the table. > assert cat_id in vectors > ``` -| Name | Type | Description | -| ----------- | ---- | ----------------------------------- | -| `key` | int | The key to check. | -| **RETURNS** | bool | Whether the key has a vector entry. | +| Name | Description | +| ----------- | -------------------------------------------- | +| `key` | The key to check. ~~int~~ | +| **RETURNS** | Whether the key has a vector entry. ~~bool~~ | ## Vectors.add {#add tag="method"} Add a key to the table, optionally setting a vector value as well. Keys can be mapped to an existing vector by setting `row`, or a new vector can be added. -When adding unicode keys, keep in mind that the `Vectors` class itself has no +When adding string keys, keep in mind that the `Vectors` class itself has no [`StringStore`](/api/stringstore), so you have to store the hash-to-string mapping separately. If you need to manage the strings, you should use the `Vectors` via the [`Vocab`](/api/vocab) class, e.g. `vocab.vectors`. @@ -138,12 +138,13 @@ mapping separately. If you need to manage the strings, you should use the > nlp.vocab.vectors.add("dog", row=0) > ``` -| Name | Type | Description | -| ----------- | ---------------------------------- | ----------------------------------------------------- | -| `key` | unicode / int | The key to add. | -| `vector` | `ndarray[ndim=1, dtype='float32']` | An optional vector to add for the key. | -| `row` | int | An optional row number of a vector to map the key to. | -| **RETURNS** | int | The row the vector was added to. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------- | +| `key` | The key to add. ~~Union[str, int]~~ | +| _keyword-only_ | | +| `vector` | An optional vector to add for the key. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | +| `row` | An optional row number of a vector to map the key to. ~~int~~ | +| **RETURNS** | The row the vector was added to. ~~int~~ | ## Vectors.resize {#resize tag="method"} @@ -159,11 +160,11 @@ These removed items are returned as a list of `(key, row)` tuples. > removed = nlp.vocab.vectors.resize((10000, 300)) > ``` -| Name | Type | Description | -| ----------- | ----- | -------------------------------------------------------------------- | -| `shape` | tuple | A `(rows, dims)` tuple describing the number of rows and dimensions. | -| `inplace` | bool | Reallocate the memory. | -| **RETURNS** | list | The removed items as a list of `(key, row)` tuples. | +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------------- | +| `shape` | A `(rows, dims)` tuple describing the number of rows and dimensions. ~~Tuple[int, int]~~ | +| `inplace` | Reallocate the memory. ~~bool~~ | +| **RETURNS** | The removed items as a list of `(key, row)` tuples. ~~List[Tuple[int, int]]~~ | ## Vectors.keys {#keys tag="method"} @@ -176,9 +177,9 @@ A sequence of the keys in the table. > print(key, nlp.vocab.strings[key]) > ``` -| Name | Type | Description | -| ----------- | -------- | ----------- | -| **RETURNS** | iterable | The keys. | +| Name | Description | +| ----------- | --------------------------- | +| **RETURNS** | The keys. ~~Iterable[int]~~ | ## Vectors.values {#values tag="method"} @@ -193,9 +194,9 @@ the length of the vectors table. > print(vector) > ``` -| Name | Type | Description | -| ---------- | ---------------------------------- | ---------------------- | -| **YIELDS** | `ndarray[ndim=1, dtype='float32']` | A vector in the table. | +| Name | Description | +| ---------- | --------------------------------------------------------------- | +| **YIELDS** | A vector in the table. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | ## Vectors.items {#items tag="method"} @@ -208,9 +209,9 @@ Iterate over `(key, vector)` pairs, in order. > print(key, nlp.vocab.strings[key], vector) > ``` -| Name | Type | Description | -| ---------- | ----- | -------------------------------- | -| **YIELDS** | tuple | `(key, vector)` pairs, in order. | +| Name | Description | +| ---------- | ------------------------------------------------------------------------------------- | +| **YIELDS** | `(key, vector)` pairs, in order. ~~Tuple[int, numpy.ndarray[ndim=1, dtype=float32]]~~ | ## Vectors.find {#find tag="method"} @@ -225,13 +226,14 @@ Look up one or more keys by row, or vice versa. > keys = nlp.vocab.vectors.find(rows=[18, 256, 985]) > ``` -| Name | Type | Description | -| ----------- | ------------------------------------- | ------------------------------------------------------------------------ | -| `key` | unicode / int | Find the row that the given key points to. Returns int, `-1` if missing. | -| `keys` | iterable | Find rows that the keys point to. Returns `ndarray`. | -| `row` | int | Find the first key that points to the row. Returns int. | -| `rows` | iterable | Find the keys that point to the rows. Returns ndarray. | -| **RETURNS** | The requested key, keys, row or rows. | +| Name | Description | +| -------------- | -------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `key` | Find the row that the given key points to. Returns int, `-1` if missing. ~~Union[str, int]~~ | +| `keys` | Find rows that the keys point to. Returns `numpy.ndarray`. ~~Iterable[Union[str, int]]~~ | +| `row` | Find the first key that points to the row. Returns integer. ~~int~~ | +| `rows` | Find the keys that point to the rows. Returns `numpy.ndarray`. ~~Iterable[int]~~ | +| **RETURNS** | The requested key, keys, row or rows. ~~Union[int, numpy.ndarray[ndim=1, dtype=float32]]~~ | ## Vectors.shape {#shape tag="property"} @@ -248,9 +250,9 @@ vector table. > assert dims == 300 > ``` -| Name | Type | Description | -| ----------- | ----- | ---------------------- | -| **RETURNS** | tuple | A `(rows, dims)` pair. | +| Name | Description | +| ----------- | ------------------------------------------ | +| **RETURNS** | A `(rows, dims)` pair. ~~Tuple[int, int]~~ | ## Vectors.size {#size tag="property"} @@ -263,9 +265,9 @@ The vector size, i.e. `rows * dims`. > assert vectors.size == 150000 > ``` -| Name | Type | Description | -| ----------- | ---- | ---------------- | -| **RETURNS** | int | The vector size. | +| Name | Description | +| ----------- | ------------------------ | +| **RETURNS** | The vector size. ~~int~~ | ## Vectors.is_full {#is_full tag="property"} @@ -281,14 +283,14 @@ If a table is full, it can be resized using > assert vectors.is_full > ``` -| Name | Type | Description | -| ----------- | ---- | ---------------------------------- | -| **RETURNS** | bool | Whether the vectors table is full. | +| Name | Description | +| ----------- | ------------------------------------------- | +| **RETURNS** | Whether the vectors table is full. ~~bool~~ | ## Vectors.n_keys {#n_keys tag="property"} Get the number of keys in the table. Note that this is the number of _all_ keys, -not just unique vectors. If several keys are mapped are mapped to the same +not just unique vectors. If several keys are mapped to the same vectors, they will be counted individually. > #### Example @@ -299,16 +301,16 @@ vectors, they will be counted individually. > assert vectors.n_keys == 0 > ``` -| Name | Type | Description | -| ----------- | ---- | ------------------------------------ | -| **RETURNS** | int | The number of all keys in the table. | +| Name | Description | +| ----------- | -------------------------------------------- | +| **RETURNS** | The number of all keys in the table. ~~int~~ | ## Vectors.most_similar {#most_similar tag="method"} -For each of the given vectors, find the `n` most similar entries to it, by +For each of the given vectors, find the `n` most similar entries to it by cosine. Queries are by vector. Results are returned as a `(keys, best_rows, scores)` tuple. If `queries` is large, the calculations are -performed in chunks, to avoid consuming too much memory. You can set the +performed in chunks to avoid consuming too much memory. You can set the `batch_size` to control the size/space trade-off during the calculations. > #### Example @@ -318,13 +320,14 @@ performed in chunks, to avoid consuming too much memory. You can set the > most_similar = nlp.vocab.vectors.most_similar(queries, n=10) > ``` -| Name | Type | Description | -| ------------ | --------- | ------------------------------------------------------------------ | -| `queries` | `ndarray` | An array with one or more vectors. | -| `batch_size` | int | The batch size to use. Default to `1024`. | -| `n` | int | The number of entries to return for each query. Defaults to `1`. | -| `sort` | bool | Whether to sort the entries returned by score. Defaults to `True`. | -| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. | +| Name | Description | +| -------------- | --------------------------------------------------------------------------- | +| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ | +| _keyword-only_ | | +| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ | +| `n` | The number of entries to return for each query. Defaults to `1`. ~~int~~ | +| `sort` | Whether to sort the entries returned by score. Defaults to `True`. ~~bool~~ | +| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. ~~Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]~~ | ## Vectors.to_disk {#to_disk tag="method"} @@ -337,9 +340,9 @@ Save the current state to a directory. > > ``` -| Name | Type | Description | -| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | +| Name | Description | +| ------ | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | ## Vectors.from_disk {#from_disk tag="method"} @@ -352,10 +355,10 @@ Loads state from a directory. Modifies the object in place and returns it. > vectors.from_disk("/path/to/vectors") > ``` -| Name | Type | Description | -| ----------- | ---------------- | -------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| **RETURNS** | `Vectors` | The modified `Vectors` object. | +| Name | Description | +| ----------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| **RETURNS** | The modified `Vectors` object. ~~Vectors~~ | ## Vectors.to_bytes {#to_bytes tag="method"} @@ -367,9 +370,9 @@ Serialize the current state to a binary string. > vectors_bytes = vectors.to_bytes() > ``` -| Name | Type | Description | -| ----------- | ----- | -------------------------------------------- | -| **RETURNS** | bytes | The serialized form of the `Vectors` object. | +| Name | Description | +| ----------- | ------------------------------------------------------ | +| **RETURNS** | The serialized form of the `Vectors` object. ~~bytes~~ | ## Vectors.from_bytes {#from_bytes tag="method"} @@ -384,15 +387,15 @@ Load state from a binary string. > new_vectors.from_bytes(vectors_bytes) > ``` -| Name | Type | Description | -| ----------- | --------- | ---------------------- | -| `data` | bytes | The data to load from. | -| **RETURNS** | `Vectors` | The `Vectors` object. | +| Name | Description | +| ----------- | --------------------------------- | +| `data` | The data to load from. ~~bytes~~ | +| **RETURNS** | The `Vectors` object. ~~Vectors~~ | ## Attributes {#attributes} -| Name | Type | Description | -| --------- | ---------------------------------- | ------------------------------------------------------------------------------- | -| `data` | `ndarray[ndim=1, dtype='float32']` | Stored vectors data. `numpy` is used for CPU vectors, `cupy` for GPU vectors. | -| `key2row` | dict | Dictionary mapping word hashes to rows in the `Vectors.data` table. | -| `keys` | `ndarray[ndim=1, dtype='float32']` | Array keeping the keys in order, such that `keys[vectors.key2row[key]] == key`. | +| Name | Description | +| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `data` | Stored vectors data. `numpy` is used for CPU vectors, `cupy` for GPU vectors. ~~Union[numpy.ndarray[ndim=1, dtype=float32], cupy.ndarray[ndim=1, dtype=float32]]~~ | +| `key2row` | Dictionary mapping word hashes to rows in the `Vectors.data` table. ~~Dict[int, int]~~ | +| `keys` | Array keeping the keys in order, such that `keys[vectors.key2row[key]] == key`. ~~Union[numpy.ndarray[ndim=1, dtype=float32], cupy.ndarray[ndim=1, dtype=float32]]~~ | diff --git a/website/docs/api/vocab.md b/website/docs/api/vocab.md index 2be6d67ed..a2ca63002 100644 --- a/website/docs/api/vocab.md +++ b/website/docs/api/vocab.md @@ -21,17 +21,15 @@ Create the vocabulary. > vocab = Vocab(strings=["hello", "world"]) > ``` -| Name | Type | Description | -| ------------------------------------------- | -------------------- | ------------------------------------------------------------------------------------------------------------------ | -| `lex_attr_getters` | dict | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. | -| `tag_map` | dict | A dictionary mapping fine-grained tags to coarse-grained parts-of-speech, and optionally morphological attributes. | -| `lemmatizer` | object | A lemmatizer. Defaults to `None`. | -| `strings` | `StringStore` / list | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. | -| `lookups` | `Lookups` | A [`Lookups`](/api/lookups) that stores the `lemma_\*`, `lexeme_norm` and other large lookup tables. Defaults to `None`. | -| `lookups_extra` 2.3 | `Lookups` | A [`Lookups`](/api/lookups) that stores the optional `lexeme_cluster`/`lexeme_prob`/`lexeme_sentiment`/`lexeme_settings` lookup tables. Defaults to `None`. | -| `oov_prob` | float | The default OOV probability. Defaults to `-20.0`. | -| `vectors_name` 2.2 | unicode | A name to identify the vectors table. | -| **RETURNS** | `Vocab` | The newly constructed object. | +| Name | Description | +| ------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `lex_attr_getters` | A dictionary mapping attribute IDs to functions to compute them. Defaults to `None`. ~~Optional[Dict[str, Callable[[str], Any]]]~~ | +| `strings` | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings. ~~Union[List[str], StringStore]~~ | +| `lookups` | A [`Lookups`](/api/lookups) that stores the `lexeme_norm` and other large lookup tables. Defaults to `None`. ~~Optional[Lookups]~~ | +| `oov_prob` | The default OOV probability. Defaults to `-20.0`. ~~float~~ | +| `vectors_name` 2.2 | A name to identify the vectors table. ~~str~~ | +| `writing_system` | A dictionary describing the language's writing system. Typically provided by [`Language.Defaults`](/api/language#defaults). ~~Dict[str, Any]~~ | +| `get_noun_chunks` | A function that yields base noun phrases used for [`Doc.noun_chunks`](/ap/doc#noun_chunks). ~~Optional[Callable[[Union[Doc, Span], Iterator[Span]]]]~~ | ## Vocab.\_\_len\_\_ {#len tag="method"} @@ -44,14 +42,14 @@ Get the current number of lexemes in the vocabulary. > assert len(nlp.vocab) > 0 > ``` -| Name | Type | Description | -| ----------- | ---- | ---------------------------------------- | -| **RETURNS** | int | The number of lexemes in the vocabulary. | +| Name | Description | +| ----------- | ------------------------------------------------ | +| **RETURNS** | The number of lexemes in the vocabulary. ~~int~~ | ## Vocab.\_\_getitem\_\_ {#getitem tag="method"} -Retrieve a lexeme, given an int ID or a unicode string. If a previously unseen -unicode string is given, a new lexeme is created and stored. +Retrieve a lexeme, given an int ID or a string. If a previously unseen string is +given, a new lexeme is created and stored. > #### Example > @@ -60,10 +58,10 @@ unicode string is given, a new lexeme is created and stored. > assert nlp.vocab[apple] == nlp.vocab["apple"] > ``` -| Name | Type | Description | -| -------------- | ------------- | ------------------------------------------------ | -| `id_or_string` | int / unicode | The hash value of a word, or its unicode string. | -| **RETURNS** | `Lexeme` | The lexeme indicated by the given ID. | +| Name | Description | +| -------------- | ------------------------------------------------------------ | +| `id_or_string` | The hash value of a word, or its string. ~~Union[int, str]~~ | +| **RETURNS** | The lexeme indicated by the given ID. ~~Lexeme~~ | ## Vocab.\_\_iter\_\_ {#iter tag="method"} @@ -75,9 +73,9 @@ Iterate over the lexemes in the vocabulary. > stop_words = (lex for lex in nlp.vocab if lex.is_stop) > ``` -| Name | Type | Description | -| ---------- | -------- | --------------------------- | -| **YIELDS** | `Lexeme` | An entry in the vocabulary. | +| Name | Description | +| ---------- | -------------------------------------- | +| **YIELDS** | An entry in the vocabulary. ~~Lexeme~~ | ## Vocab.\_\_contains\_\_ {#contains tag="method"} @@ -94,10 +92,10 @@ given string, you need to look it up in > assert oov not in nlp.vocab > ``` -| Name | Type | Description | -| ----------- | ------- | -------------------------------------------------- | -| `string` | unicode | The ID string. | -| **RETURNS** | bool | Whether the string has an entry in the vocabulary. | +| Name | Description | +| ----------- | ----------------------------------------------------------- | +| `string` | The ID string. ~~str~~ | +| **RETURNS** | Whether the string has an entry in the vocabulary. ~~bool~~ | ## Vocab.add_flag {#add_flag tag="method"} @@ -118,11 +116,11 @@ using `token.check_flag(flag_id)`. > assert doc[2].check_flag(MY_PRODUCT) == True > ``` -| Name | Type | Description | -| ------------- | ---- | ----------------------------------------------------------------------------------------------------------------------------------------------- | -| `flag_getter` | dict | A function `f(unicode) -> bool`, to get the flag value. | -| `flag_id` | int | An integer between 1 and 63 (inclusive), specifying the bit at which the flag will be stored. If `-1`, the lowest available bit will be chosen. | -| **RETURNS** | int | The integer ID by which the flag value can be checked. | +| Name | Description | +| ------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `flag_getter` | A function that takes the lexeme text and returns the boolean flag value. ~~Callable[[str], bool]~~ | +| `flag_id` | An integer between `1` and `63` (inclusive), specifying the bit at which the flag will be stored. If `-1`, the lowest available bit will be chosen. ~~int~~ | +| **RETURNS** | The integer ID by which the flag value can be checked. ~~int~~ | ## Vocab.reset_vectors {#reset_vectors tag="method" new="2"} @@ -136,10 +134,11 @@ have to call this to change the size of the vectors. Only one of the `width` and > nlp.vocab.reset_vectors(width=300) > ``` -| Name | Type | Description | -| ------- | ---- | -------------------------------------- | -| `width` | int | The new width (keyword argument only). | -| `shape` | int | The new shape (keyword argument only). | +| Name | Description | +| -------------- | ---------------------- | +| _keyword-only_ | | +| `width` | The new width. ~~int~~ | +| `shape` | The new shape. ~~int~~ | ## Vocab.prune_vectors {#prune_vectors tag="method" new="2"} @@ -151,7 +150,7 @@ rows, we would discard the vectors for "feline" and "reclined". These words would then be remapped to the closest remaining vector – so "feline" would have the same vector as "cat", and "reclined" would have the same vector as "sat". The similarities are judged by cosine. The original vectors may be large, so the -cosines are calculated in minibatches, to reduce memory usage. +cosines are calculated in minibatches to reduce memory usage. > #### Example > @@ -160,18 +159,18 @@ cosines are calculated in minibatches, to reduce memory usage. > assert len(nlp.vocab.vectors) <= 1000 > ``` -| Name | Type | Description | -| ------------ | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `nr_row` | int | The number of rows to keep in the vector table. | -| `batch_size` | int | Batch of vectors for calculating the similarities. Larger batch sizes might be faster, while temporarily requiring more memory. | -| **RETURNS** | dict | A dictionary keyed by removed words mapped to `(string, score)` tuples, where `string` is the entry the removed word was mapped to, and `score` the similarity score between the two words. | +| Name | Description | +| ------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `nr_row` | The number of rows to keep in the vector table. ~~int~~ | +| `batch_size` | Batch of vectors for calculating the similarities. Larger batch sizes might be faster, while temporarily requiring more memory. ~~int~~ | +| **RETURNS** | A dictionary keyed by removed words mapped to `(string, score)` tuples, where `string` is the entry the removed word was mapped to, and `score` the similarity score between the two words. ~~Dict[str, Tuple[str, float]]~~ | ## Vocab.get_vector {#get_vector tag="method" new="2"} Retrieve a vector for a word in the vocabulary. Words can be looked up by string or hash value. If no vectors data is loaded, a `ValueError` is raised. If `minn` is defined, then the resulting vector uses [FastText](https://fasttext.cc/)'s -subword features by average over ngrams of `orth` (introduced in spaCy `v2.1`). +subword features by average over n-grams of `orth` (introduced in spaCy `v2.1`). > #### Example > @@ -180,16 +179,16 @@ subword features by average over ngrams of `orth` (introduced in spaCy `v2.1`). > nlp.vocab.get_vector("apple", minn=1, maxn=5) > ``` -| Name | Type | Description | -| ----------------------------------- | ---------------------------------------- | ---------------------------------------------------------------------------------------------- | -| `orth` | int / unicode | The hash value of a word, or its unicode string. | -| `minn` 2.1 | int | Minimum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. | -| `maxn` 2.1 | int | Maximum n-gram length used for FastText's ngram computation. Defaults to the length of `orth`. | -| **RETURNS** | `numpy.ndarray[ndim=1, dtype='float32']` | A word vector. Size and shape are determined by the `Vocab.vectors` instance. | +| Name | Description | +| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------- | +| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ | +| `minn` 2.1 | Minimum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ | +| `maxn` 2.1 | Maximum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ | +| **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | ## Vocab.set_vector {#set_vector tag="method" new="2"} -Set a vector for a word in the vocabulary. Words can be referenced by by string +Set a vector for a word in the vocabulary. Words can be referenced by string or hash value. > #### Example @@ -198,10 +197,10 @@ or hash value. > nlp.vocab.set_vector("apple", array([...])) > ``` -| Name | Type | Description | -| -------- | ---------------------------------------- | ------------------------------------------------ | -| `orth` | int / unicode | The hash value of a word, or its unicode string. | -| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | The vector to set. | +| Name | Description | +| -------- | -------------------------------------------------------------------- | +| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ | +| `vector` | The vector to set. ~~numpy.ndarray[ndim=1, dtype=float32]~~ | ## Vocab.has_vector {#has_vector tag="method" new="2"} @@ -215,10 +214,10 @@ Words can be looked up by string or hash value. > vector = nlp.vocab.get_vector("apple") > ``` -| Name | Type | Description | -| ----------- | ------------- | ------------------------------------------------ | -| `orth` | int / unicode | The hash value of a word, or its unicode string. | -| **RETURNS** | bool | Whether the word has a vector. | +| Name | Description | +| ----------- | -------------------------------------------------------------------- | +| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ | +| **RETURNS** | Whether the word has a vector. ~~bool~~ | ## Vocab.to_disk {#to_disk tag="method" new="2"} @@ -230,10 +229,11 @@ Save the current state to a directory. > nlp.vocab.to_disk("/path/to/vocab") > ``` -| Name | Type | Description | -| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ | +| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | ## Vocab.from_disk {#from_disk tag="method" new="2"} @@ -246,11 +246,12 @@ Loads state from a directory. Modifies the object in place and returns it. > vocab = Vocab().from_disk("/path/to/vocab") > ``` -| Name | Type | Description | -| ----------- | ---------------- | -------------------------------------------------------------------------- | -| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `Vocab` | The modified `Vocab` object. | +| Name | Description | +| -------------- | ----------------------------------------------------------------------------------------------- | +| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The modified `Vocab` object. ~~Vocab~~ | ## Vocab.to_bytes {#to_bytes tag="method"} @@ -262,10 +263,11 @@ Serialize the current state to a binary string. > vocab_bytes = nlp.vocab.to_bytes() > ``` -| Name | Type | Description | -| ----------- | ----- | ------------------------------------------------------------------------- | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | bytes | The serialized form of the `Vocab` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The serialized form of the `Vocab` object. ~~Vocab~~ | ## Vocab.from_bytes {#from_bytes tag="method"} @@ -280,11 +282,12 @@ Load state from a binary string. > vocab.from_bytes(vocab_bytes) > ``` -| Name | Type | Description | -| ------------ | ------- | ------------------------------------------------------------------------- | -| `bytes_data` | bytes | The data to load from. | -| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | -| **RETURNS** | `Vocab` | The `Vocab` object. | +| Name | Description | +| -------------- | ------------------------------------------------------------------------------------------- | +| `bytes_data` | The data to load from. ~~bytes~~ | +| _keyword-only_ | | +| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ | +| **RETURNS** | The `Vocab` object. ~~Vocab~~ | ## Attributes {#attributes} @@ -297,13 +300,13 @@ Load state from a binary string. > assert type(PERSON) == int > ``` -| Name | Type | Description | -| --------------------------------------------- | ------------- | ------------------------------------------------------------ | -| `strings` | `StringStore` | A table managing the string-to-int mapping. | -| `vectors` 2 | `Vectors` | A table associating word IDs to word vectors. | -| `vectors_length` | int | Number of dimensions for each word vector. | -| `lookups` | `Lookups` | The available lookup tables in this vocab. | -| `writing_system` 2.1 | dict | A dict with information about the language's writing system. | +| Name | Description | +| --------------------------------------------- | ------------------------------------------------------------------------------- | +| `strings` | A table managing the string-to-int mapping. ~~StringStore~~ | +| `vectors` 2 | A table associating word IDs to word vectors. ~~Vectors~~ | +| `vectors_length` | Number of dimensions for each word vector. ~~int~~ | +| `lookups` | The available lookup tables in this vocab. ~~Lookups~~ | +| `writing_system` 2.1 | A dict with information about the language's writing system. ~~Dict[str, Any]~~ | ## Serialization fields {#serialization-fields} diff --git a/website/docs/images/architecture.svg b/website/docs/images/architecture.svg index 8279e6432..2e271e98a 100644 --- a/website/docs/images/architecture.svg +++ b/website/docs/images/architecture.svg @@ -1,132 +1,226 @@ - - - - Language - - - - MAKES - - - - nlp.vocab.morphology - - Vocab - - - - nlp.vocab - - StringStore - - - - nlp.vocab.strings - - - - nlp.tokenizer.vocab - - Tokenizer - - - - nlp.make_doc() - - - - nlp.pipeline - - - - nlp.pipeline[i].vocab - - pt - - en - - de - - fr - - es - - it - - nl - - sv - - fi - - nb - - hu - - he - - bn - - ja - - ... - - - - - - doc.vocab - - - - MAKES - - Doc - - - - MAKES - - - - token.doc - - Token - - Span - - - - lexeme.vocab - - Lexeme - - - - MAKES - - - - span.doc - - Dependency Parser - - Entity Recognizer - - Tagger - - Custom Components - - TextCategorizer - - Morphology + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/cli_init_fill-config_diff.jpg b/website/docs/images/cli_init_fill-config_diff.jpg new file mode 100644 index 000000000..3e3751726 Binary files /dev/null and b/website/docs/images/cli_init_fill-config_diff.jpg differ diff --git a/website/docs/images/dep-match-diagram.svg b/website/docs/images/dep-match-diagram.svg new file mode 100644 index 000000000..676be4137 --- /dev/null +++ b/website/docs/images/dep-match-diagram.svg @@ -0,0 +1,39 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/displacy-dep-founded.html b/website/docs/images/displacy-dep-founded.html new file mode 100644 index 000000000..e22984ee1 --- /dev/null +++ b/website/docs/images/displacy-dep-founded.html @@ -0,0 +1,58 @@ + + + Smith + + + + + founded + + + + + a + + + + + healthcare + + + + + company + + + + + + + nsubj + + + + + + + + det + + + + + + + + compound + + + + + + + + dobj + + + + diff --git a/website/docs/images/language_data.svg b/website/docs/images/language_data.svg deleted file mode 100644 index 58482b2c5..000000000 --- a/website/docs/images/language_data.svg +++ /dev/null @@ -1,85 +0,0 @@ - - - - - - Tokenizer - - - - - - - - - - Base data - - - - - - - - - - - - - - - - Language data - - - - stop words - - - - lexical attributes - - - - - - tokenizer exceptions - - - - - - prefixes, suffixes, infixes - - - - - lemma data - - - - Lemmatizer - - - - char classes - - Token - - - - morph rules - - - - tag map - - Morphology - diff --git a/website/docs/images/lifecycle.svg b/website/docs/images/lifecycle.svg new file mode 100644 index 000000000..2f4b304b8 --- /dev/null +++ b/website/docs/images/lifecycle.svg @@ -0,0 +1,93 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/pipeline.svg b/website/docs/images/pipeline.svg index 022219c5f..9ece70e6f 100644 --- a/website/docs/images/pipeline.svg +++ b/website/docs/images/pipeline.svg @@ -1,30 +1,33 @@ - - - - - Doc - - - - Text - - - - nlp - - tokenizer - - tagger - - - - parser - - ner - - ... + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/pipeline_transformer.svg b/website/docs/images/pipeline_transformer.svg new file mode 100644 index 000000000..cfbf470cc --- /dev/null +++ b/website/docs/images/pipeline_transformer.svg @@ -0,0 +1,37 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/prodigy_overview.jpg b/website/docs/images/prodigy_overview.jpg new file mode 100644 index 000000000..84326ccea Binary files /dev/null and b/website/docs/images/prodigy_overview.jpg differ diff --git a/website/docs/images/project_document.jpg b/website/docs/images/project_document.jpg new file mode 100644 index 000000000..7942619a8 Binary files /dev/null and b/website/docs/images/project_document.jpg differ diff --git a/website/docs/images/projects.png b/website/docs/images/projects.png new file mode 100644 index 000000000..934e98e0a Binary files /dev/null and b/website/docs/images/projects.png differ diff --git a/website/docs/images/projects.svg b/website/docs/images/projects.svg new file mode 100644 index 000000000..c7518d445 --- /dev/null +++ b/website/docs/images/projects.svg @@ -0,0 +1,92 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/sense2vec.jpg b/website/docs/images/sense2vec.jpg new file mode 100644 index 000000000..3a1772582 Binary files /dev/null and b/website/docs/images/sense2vec.jpg differ diff --git a/website/docs/images/spacy-ray.svg b/website/docs/images/spacy-ray.svg new file mode 100644 index 000000000..4c2fd81f1 --- /dev/null +++ b/website/docs/images/spacy-ray.svg @@ -0,0 +1,55 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/spacy-streamlit.png b/website/docs/images/spacy-streamlit.png new file mode 100644 index 000000000..8f617d49f Binary files /dev/null and b/website/docs/images/spacy-streamlit.png differ diff --git a/website/docs/images/thinc_mypy.jpg b/website/docs/images/thinc_mypy.jpg new file mode 100644 index 000000000..c0f7ee636 Binary files /dev/null and b/website/docs/images/thinc_mypy.jpg differ diff --git a/website/docs/images/tok2vec-listener.svg b/website/docs/images/tok2vec-listener.svg new file mode 100644 index 000000000..bb67d2186 --- /dev/null +++ b/website/docs/images/tok2vec-listener.svg @@ -0,0 +1,41 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/tok2vec.svg b/website/docs/images/tok2vec.svg new file mode 100644 index 000000000..5338b6280 --- /dev/null +++ b/website/docs/images/tok2vec.svg @@ -0,0 +1,17 @@ + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/tokenization.svg b/website/docs/images/tokenization.svg index 9877e1a30..d676fdace 100644 --- a/website/docs/images/tokenization.svg +++ b/website/docs/images/tokenization.svg @@ -1,123 +1,305 @@ - - - - - “Let’s - - - go - - - to - - - N.Y.!” - - - - - - Let’s - - - go - - - to - - - N.Y.!” - - - - - Let - - - go - - - to - - - N.Y.!” - - - ’s - - - - - - Let - - - go - - - to - - - N.Y.! - - - ’s - - - - - - - - - Let - - - go - - - to - - - N.Y. - - - ’s - - - - - - ! - - - - Let - - go - - to - - N.Y. - - ’s - - - - ! - - EXCEPTION - - PREFIX - - SUFFIX - - SUFFIX - - EXCEPTION - - DONE + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/trainable_component.svg b/website/docs/images/trainable_component.svg new file mode 100644 index 000000000..621ff90ef --- /dev/null +++ b/website/docs/images/trainable_component.svg @@ -0,0 +1,55 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/training-loop.svg b/website/docs/images/training-loop.svg deleted file mode 100644 index e883b36be..000000000 --- a/website/docs/images/training-loop.svg +++ /dev/null @@ -1,40 +0,0 @@ - - - - - - - - Training data - - - - label - - - - text - - - - - - Doc - - - - GoldParse - - - - update - - nlp - - - - optimizer - diff --git a/website/docs/images/training.svg b/website/docs/images/training.svg index 0a4649c48..4d6bcc0cc 100644 --- a/website/docs/images/training.svg +++ b/website/docs/images/training.svg @@ -1,47 +1,60 @@ - - - - - - + + + + + - - - - - - PREDICT - - - - SAVE - - Model - - - - - - Training data - - - - label - - - - label - - Updated - Model - - - text - - - - GRADIENT + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/vocab_stringstore.svg b/website/docs/images/vocab_stringstore.svg index b604041f2..e10ff3c58 100644 --- a/website/docs/images/vocab_stringstore.svg +++ b/website/docs/images/vocab_stringstore.svg @@ -1,77 +1,118 @@ - - - - - 31979... - - Lexeme - - 46904... - - Lexeme - - 37020... - - Lexeme - - - "coffee" - - 31979… - - "I" - - 46904… - - "love" - - 37020… - - - - - nsubj - - - - dobj - - String - Store - - Vocab - - Doc - - love - VERB - - Token - - I - PRON - - Token - - coffee - NOUN - - Token - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/wandb1.jpg b/website/docs/images/wandb1.jpg new file mode 100644 index 000000000..3baf4aba0 Binary files /dev/null and b/website/docs/images/wandb1.jpg differ diff --git a/website/docs/images/wandb2.jpg b/website/docs/images/wandb2.jpg new file mode 100644 index 000000000..cd67c9aa4 Binary files /dev/null and b/website/docs/images/wandb2.jpg differ diff --git a/website/docs/models/index.md b/website/docs/models/index.md index 31bc3c549..30b4f11d9 100644 --- a/website/docs/models/index.md +++ b/website/docs/models/index.md @@ -1,174 +1,55 @@ --- -title: Models -teaser: Downloadable pretrained models for spaCy +title: Trained Models & Pipelines +teaser: Downloadable trained pipelines and weights for spaCy menu: - ['Quickstart', 'quickstart'] - - ['Model Architecture', 'architecture'] - ['Conventions', 'conventions'] --- -The models directory includes two types of pretrained models: - -1. **Core models:** General-purpose pretrained models to predict named entities, - part-of-speech tags and syntactic dependencies. Can be used out-of-the-box - and fine-tuned on more specific data. -2. **Starter models:** Transfer learning starter packs with pretrained weights - you can initialize your models with to achieve better accuracy. They can - include word vectors (which will be used as features during training) or - other pretrained representations like BERT. These models don't include - components for specific tasks like NER or text classification and are - intended to be used as base models when training your own models. + ### Quickstart {hidden="true"} +> #### 📖 Installation and usage +> +> For more details on how to use trained pipelines with spaCy, see the +> [usage guide](/usage/models). + import QuickstartModels from 'widgets/quickstart-models.js' - + - +## Package naming conventions {#conventions} -For more details on how to use models with spaCy, see the -[usage guide on models](/usage/models). +In general, spaCy expects all pipeline packages to follow the naming convention +of `[lang]\_[name]`. For spaCy's pipelines, we also chose to divide the name +into three components: - +1. **Type:** Capabilities (e.g. `core` for general-purpose pipeline with + vocabulary, syntax, entities and word vectors, or `dep` for only vocab and + syntax). +2. **Genre:** Type of text the pipeline is trained on, e.g. `web` or `news`. +3. **Size:** Package size indicator, `sm`, `md` or `lg`. -## Model architecture {#architecture} +For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English +pipeline trained on written web text (blogs, news, comments), that includes +vocabulary, vectors, syntax and entities. -spaCy v2.0 features new neural models for **tagging**, **parsing** and **entity -recognition**. The models have been designed and implemented from scratch -specifically for spaCy, to give you an unmatched balance of speed, size and -accuracy. A novel bloom embedding strategy with subword features is used to -support huge vocabularies in tiny tables. Convolutional layers with residual -connections, layer normalization and maxout non-linearity are used, giving much -better efficiency than the standard BiLSTM solution. +### Package versioning {#model-versioning} -The parser and NER use an imitation learning objective to deliver **accuracy -in-line with the latest research systems**, even when evaluated from raw text. -With these innovations, spaCy v2.0's models are **10× smaller**, **20% more -accurate**, and **even cheaper to run** than the previous generation. The -current architecture hasn't been published yet, but in the meantime we prepared -a video that explains how the models work, with particular focus on NER. - - - -The parsing model is a blend of recent results. The two recent inspirations have -been the work of Eli Klipperwasser and Yoav Goldberg at Bar Ilan[^1], and the -SyntaxNet team from Google. The foundation of the parser is still based on the -work of Joakim Nivre[^2], who introduced the transition-based framework[^3], the -arc-eager transition system, and the imitation learning objective. The model is -implemented using [Thinc](https://github.com/explosion/thinc), spaCy's machine -learning library. We first predict context-sensitive vectors for each word in -the input: - -```python -(embed_lower | embed_prefix | embed_suffix | embed_shape) - >> Maxout(token_width) - >> convolution ** 4 -``` - -This convolutional layer is shared between the tagger, parser and NER, and will -also be shared by the future neural lemmatizer. Because the parser shares these -layers with the tagger, the parser does not require tag features. I got this -trick from David Weiss's "Stack Combination" paper[^4]. - -To boost the representation, the tagger actually predicts a "super tag" with -POS, morphology and dependency label[^5]. The tagger predicts these supertags by -adding a softmax layer onto the convolutional layer – so, we're teaching the -convolutional layer to give us a representation that's one affine transform from -this informative lexical information. This is obviously good for the parser -(which backprops to the convolutions, too). The parser model makes a state -vector by concatenating the vector representations for its context tokens. The -current context tokens: - -| Context tokens | Description | -| ---------------------------------------------------------------------------------- | --------------------------------------------------------------------------- | -| `S0`, `S1`, `S2` | Top three words on the stack. | -| `B0`, `B1` | First two words of the buffer. | -| `S0L1`, `S1L1`, `S2L1`, `B0L1`, `B1L1`
`S0L2`, `S1L2`, `S2L2`, `B0L2`, `B1L2` | Leftmost and second leftmost children of `S0`, `S1`, `S2`, `B0` and `B1`. | -| `S0R1`, `S1R1`, `S2R1`, `B0R1`, `B1R1`
`S0R2`, `S1R2`, `S2R2`, `B0R2`, `B1R2` | Rightmost and second rightmost children of `S0`, `S1`, `S2`, `B0` and `B1`. | - -This makes the state vector quite long: `13*T`, where `T` is the token vector -width (128 is working well). Fortunately, there's a way to structure the -computation to save some expense (and make it more GPU-friendly). - -The parser typically visits `2*N` states for a sentence of length `N` (although -it may visit more, if it back-tracks with a non-monotonic transition[^4]). A -naive implementation would require `2*N (B, 13*T) @ (13*T, H)` matrix -multiplications for a batch of size `B`. We can instead perform one -`(B*N, T) @ (T, 13*H)` multiplication, to pre-compute the hidden weights for -each positional feature with respect to the words in the batch. (Note that our -token vectors come from the CNN — so we can't play this trick over the -vocabulary. That's how Stanford's NN parser[^3] works — and why its model is so -big.) - -This pre-computation strategy allows a nice compromise between GPU-friendliness -and implementation simplicity. The CNN and the wide lower layer are computed on -the GPU, and then the precomputed hidden weights are moved to the CPU, before we -start the transition-based parsing process. This makes a lot of things much -easier. We don't have to worry about variable-length batch sizes, and we don't -have to implement the dynamic oracle in CUDA to train. - -Currently the parser's loss function is multi-label log loss[^6], as the dynamic -oracle allows multiple states to be 0 cost. This is defined as follows, where -`gZ` is the sum of the scores assigned to gold classes: - -```python -(exp(score) / Z) - (exp(score) / gZ) -``` - - - -1. [Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations {#fn-1}](https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41). - Eliyahu Kiperwasser, Yoav Goldberg. (2016) -2. [A Dynamic Oracle for Arc-Eager Dependency Parsing {#fn-2}](https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4). - Yoav Goldberg, Joakim Nivre (2012) -3. [Parsing English in 500 Lines of Python {#fn-3}](https://explosion.ai/blog/parsing-english-in-python). - Matthew Honnibal (2013) -4. [Stack-propagation: Improved Representation Learning for Syntax {#fn-4}](https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466). - Yuan Zhang, David Weiss (2016) -5. [Deep multi-task learning with low level tasks supervised at lower layers {#fn-5}](https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86). - Anders Søgaard, Yoav Goldberg (2016) -6. [An Improved Non-monotonic Transition System for Dependency Parsing {#fn-6}](https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c). - Matthew Honnibal, Mark Johnson (2015) -7. [A Fast and Accurate Dependency Parser using Neural Networks {#fn-7}](http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf). - Danqi Cheng, Christopher D. Manning (2014) -8. [Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques {#fn-8}](https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2). - Stefan Riezler et al. (2002) - - - -## Model naming conventions {#conventions} - -In general, spaCy expects all model packages to follow the naming convention of -`[lang`\_[name]]. For spaCy's models, we also chose to divide the name into -three components: - -1. **Type:** Model capabilities (e.g. `core` for general-purpose model with - vocabulary, syntax, entities and word vectors, or `depent` for only vocab, - syntax and entities). -2. **Genre:** Type of text the model is trained on, e.g. `web` or `news`. -3. **Size:** Model size indicator, `sm`, `md` or `lg`. - -For example, `en_core_web_sm` is a small English model trained on written web -text (blogs, news, comments), that includes vocabulary, vectors, syntax and -entities. - -### Model versioning {#model-versioning} - -Additionally, the model versioning reflects both the compatibility with spaCy, -as well as the major and minor model version. A model version `a.b.c` translates -to: +Additionally, the pipeline package versioning reflects both the compatibility +with spaCy, as well as the major and minor version. A package version `a.b.c` +translates to: - `a`: **spaCy major version**. For example, `2` for spaCy v2.x. -- `b`: **Model major version**. Models with a different major version can't be - loaded by the same code. For example, changing the width of the model, adding - hidden layers or changing the activation changes the model major version. -- `c`: **Model minor version**. Same model structure, but different parameter - values, e.g. from being trained on different data, for different numbers of - iterations, etc. +- `b`: **Package major version**. Pipelines with a different major version can't + be loaded by the same code. For example, changing the width of the model, + adding hidden layers or changing the activation changes the major version. +- `c`: **Package minor version**. Same pipeline structure, but different + parameter values, e.g. from being trained on different data, for different + numbers of iterations, etc. For a detailed compatibility overview, see the -[`compatibility.json`](https://github.com/explosion/spacy-models/tree/master/compatibility.json) -in the models repository. This is also the source of spaCy's internal -compatibility check, performed when you run the [`download`](/api/cli#download) -command. +[`compatibility.json`](https://github.com/explosion/spacy-models/tree/master/compatibility.json). +This is also the source of spaCy's internal compatibility check, performed when +you run the [`download`](/api/cli#download) command. diff --git a/website/docs/styleguide.md b/website/docs/styleguide.md index 4d8aa8748..ed6f9d99b 100644 --- a/website/docs/styleguide.md +++ b/website/docs/styleguide.md @@ -11,6 +11,7 @@ menu: - ['Setup & Installation', 'setup'] - ['Markdown Reference', 'markdown'] - ['Project Structure', 'structure'] + - ['Editorial', 'editorial'] sidebar: - label: Styleguide items: diff --git a/website/docs/usage/101/_architecture.md b/website/docs/usage/101/_architecture.md index 7cd749521..b012c4ec0 100644 --- a/website/docs/usage/101/_architecture.md +++ b/website/docs/usage/101/_architecture.md @@ -1,52 +1,86 @@ -The central data structures in spaCy are the `Doc` and the `Vocab`. The `Doc` -object owns the **sequence of tokens** and all their annotations. The `Vocab` -object owns a set of **look-up tables** that make common information available -across documents. By centralizing strings, word vectors and lexical attributes, -we avoid storing multiple copies of this data. This saves memory, and ensures -there's a **single source of truth**. +The central data structures in spaCy are the [`Language`](/api/language) class, +the [`Vocab`](/api/vocab) and the [`Doc`](/api/doc) object. The `Language` class +is used to process a text and turn it into a `Doc` object. It's typically stored +as a variable called `nlp`. The `Doc` object owns the **sequence of tokens** and +all their annotations. By centralizing strings, word vectors and lexical +attributes in the `Vocab`, we avoid storing multiple copies of this data. This +saves memory, and ensures there's a **single source of truth**. Text annotations are also designed to allow a single source of truth: the `Doc` -object owns the data, and `Span` and `Token` are **views that point into it**. -The `Doc` object is constructed by the `Tokenizer`, and then **modified in -place** by the components of the pipeline. The `Language` object coordinates -these components. It takes raw text and sends it through the pipeline, returning -an **annotated document**. It also orchestrates training and serialization. +object owns the data, and [`Span`](/api/span) and [`Token`](/api/token) are +**views that point into it**. The `Doc` object is constructed by the +[`Tokenizer`](/api/tokenizer), and then **modified in place** by the components +of the pipeline. The `Language` object coordinates these components. It takes +raw text and sends it through the pipeline, returning an **annotated document**. +It also orchestrates training and serialization. ![Library architecture](../../images/architecture.svg) ### Container objects {#architecture-containers} -| Name | Description | -| ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [`Doc`](/api/doc) | A container for accessing linguistic annotations. | -| [`Span`](/api/span) | A slice from a `Doc` object. | -| [`Token`](/api/token) | An individual token — i.e. a word, punctuation symbol, whitespace, etc. | -| [`Lexeme`](/api/lexeme) | An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. | +| Name | Description | +| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`Doc`](/api/doc) | A container for accessing linguistic annotations. | +| [`DocBin`](/api/docbin) | A collection of `Doc` objects for efficient binary serialization. Also used for [training data](/api/data-formats#binary-training). | +| [`Example`](/api/example) | A collection of training annotations, containing two `Doc` objects: the reference data and the predictions. | +| [`Language`](/api/language) | Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp`. | +| [`Lexeme`](/api/lexeme) | An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. | +| [`Span`](/api/span) | A slice from a `Doc` object. | +| [`Token`](/api/token) | An individual token — i.e. a word, punctuation symbol, whitespace, etc. | ### Processing pipeline {#architecture-pipeline} -| Name | Description | -| ------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | -| [`Language`](/api/language) | A text-processing pipeline. Usually you'll load this once per process as `nlp` and pass the instance around your application. | -| [`Tokenizer`](/api/tokenizer) | Segment text, and create `Doc` objects with the discovered segment boundaries. | -| [`Lemmatizer`](/api/lemmatizer) | Determine the base forms of words. | -| `Morphology` | Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag. | -| [`Tagger`](/api/tagger) | Annotate part-of-speech tags on `Doc` objects. | -| [`DependencyParser`](/api/dependencyparser) | Annotate syntactic dependencies on `Doc` objects. | -| [`EntityRecognizer`](/api/entityrecognizer) | Annotate named entities, e.g. persons or products, on `Doc` objects. | -| [`TextCategorizer`](/api/textcategorizer) | Assign categories or labels to `Doc` objects. | -| [`Matcher`](/api/matcher) | Match sequences of tokens, based on pattern rules, similar to regular expressions. | -| [`PhraseMatcher`](/api/phrasematcher) | Match sequences of tokens based on phrases. | -| [`EntityRuler`](/api/entityruler) | Add entity spans to the `Doc` using token-based rules or exact phrase matches. | -| [`Sentencizer`](/api/sentencizer) | Implement custom sentence boundary detection logic that doesn't require the dependency parse. | -| [Other functions](/api/pipeline-functions) | Automatically apply something to the `Doc`, e.g. to merge spans of tokens. | +The processing pipeline consists of one or more **pipeline components** that are +called on the `Doc` in order. The tokenizer runs before the components. Pipeline +components can be added using [`Language.add_pipe`](/api/language#add_pipe). +They can contain a statistical model and trained weights, or only make +rule-based modifications to the `Doc`. spaCy provides a range of built-in +components for different language processing tasks and also allows adding +[custom components](/usage/processing-pipelines#custom-components). + +![The processing pipeline](../../images/pipeline.svg) + +| Name | Description | +| ----------------------------------------------- | ------------------------------------------------------------------------------------------- | +| [`AttributeRuler`](/api/attributeruler) | Set token attributes using matcher rules. | +| [`DependencyParser`](/api/dependencyparser) | Predict syntactic dependencies. | +| [`EntityLinker`](/api/entitylinker) | Disambiguate named entities to nodes in a knowledge base. | +| [`EntityRecognizer`](/api/entityrecognizer) | Predict named entities, e.g. persons or products. | +| [`EntityRuler`](/api/entityruler) | Add entity spans to the `Doc` using token-based rules or exact phrase matches. | +| [`Lemmatizer`](/api/lemmatizer) | Determine the base forms of words. | +| [`Morphologizer`](/api/morphologizer) | Predict morphological features and coarse-grained part-of-speech tags. | +| [`SentenceRecognizer`](/api/sentencerecognizer) | Predict sentence boundaries. | +| [`Sentencizer`](/api/sentencizer) | Implement rule-based sentence boundary detection that doesn't require the dependency parse. | +| [`Tagger`](/api/tagger) | Predict part-of-speech tags. | +| [`TextCategorizer`](/api/textcategorizer) | Predict categories or labels over the whole document. | +| [`Tok2Vec`](/api/tok2vec) | Apply a "token-to-vector" model and set its outputs. | +| [`Tokenizer`](/api/tokenizer) | Segment raw text and create `Doc` objects from the words. | +| [`TrainablePipe`](/api/pipe) | Class that all trainable pipeline components inherit from. | +| [`Transformer`](/api/transformer) | Use a transformer model and set its outputs. | +| [Other functions](/api/pipeline-functions) | Automatically apply something to the `Doc`, e.g. to merge spans of tokens. | + +### Matchers {#architecture-matchers} + +Matchers help you find and extract information from [`Doc`](/api/doc) objects +based on match patterns describing the sequences you're looking for. A matcher +operates on a `Doc` and gives you access to the matched tokens **in context**. + +| Name | Description | +| --------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`DependencyMatcher`](/api/dependencymatcher) | Match sequences of tokens based on dependency trees using [Semgrex operators](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). | +| [`Matcher`](/api/matcher) | Match sequences of tokens, based on pattern rules, similar to regular expressions. | +| [`PhraseMatcher`](/api/phrasematcher) | Match sequences of tokens based on phrases. | ### Other classes {#architecture-other} -| Name | Description | -| --------------------------------- | ------------------------------------------------------------------------------------------------------------- | -| [`Vocab`](/api/vocab) | A lookup table for the vocabulary that allows you to access `Lexeme` objects. | -| [`StringStore`](/api/stringstore) | Map strings to and from hash values. | -| [`Vectors`](/api/vectors) | Container class for vector data keyed by string. | -| [`GoldParse`](/api/goldparse) | Collection for training annotations. | -| [`GoldCorpus`](/api/goldcorpus) | An annotated corpus, using the JSON file format. Manages annotations for tagging, dependency parsing and NER. | +| Name | Description | +| ------------------------------------------------ | -------------------------------------------------------------------------------------------------- | +| [`Corpus`](/api/corpus) | Class for managing annotated corpora for training and evaluation data. | +| [`KnowledgeBase`](/api/kb) | Storage for entities and aliases of a knowledge base for entity linking. | +| [`Lookups`](/api/lookups) | Container for convenient access to large lookup tables and dictionaries. | +| [`MorphAnalysis`](/api/morphology#morphanalysis) | A morphological analysis. | +| [`Morphology`](/api/morphology) | Store morphological analyses and map them to and from hash values. | +| [`Scorer`](/api/scorer) | Compute evaluation scores. | +| [`StringStore`](/api/stringstore) | Map strings to and from hash values. | +| [`Vectors`](/api/vectors) | Container class for vector data keyed by string. | +| [`Vocab`](/api/vocab) | The shared vocabulary that stores strings and gives you access to [`Lexeme`](/api/lexeme) objects. | diff --git a/website/docs/usage/101/_language-data.md b/website/docs/usage/101/_language-data.md index 31bfe53ab..239cec9d1 100644 --- a/website/docs/usage/101/_language-data.md +++ b/website/docs/usage/101/_language-data.md @@ -2,18 +2,16 @@ Every language is different – and usually full of **exceptions and special cases**, especially amongst the most common words. Some of these exceptions are shared across languages, while others are **entirely specific** – usually so specific that they need to be hard-coded. The -[`lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) module -contains all language-specific data, organized in simple Python files. This -makes the data easy to update and extend. +[`lang`](%%GITHUB_SPACY/spacy/lang) module contains all language-specific data, +organized in simple Python files. This makes the data easy to update and extend. The **shared language data** in the directory root includes rules that can be generalized across languages – for example, rules for basic punctuation, emoji, -emoticons, single-letter abbreviations and norms for equivalent tokens with -different spellings, like `"` and `”`. This helps the models make more accurate -predictions. The **individual language data** in a submodule contains rules that -are only relevant to a particular language. It also takes care of putting -together all components and creating the `Language` subclass – for example, -`English` or `German`. +emoticons and single-letter abbreviations. The **individual language data** in a +submodule contains rules that are only relevant to a particular language. It +also takes care of putting together all components and creating the +[`Language`](/api/language) subclass – for example, `English` or `German`. The +values are defined in the [`Language.Defaults`](/api/language#defaults). > ```python > from spacy.lang.en import English @@ -23,37 +21,12 @@ together all components and creating the `Language` subclass – for example, > nlp_de = German() # Includes German data > ``` -![Language data architecture](../../images/language_data.svg) - -| Name | Description | -| ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Stop words**
[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. | -| **Tokenizer exceptions**
[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". | -| **Norm exceptions**
[`norm_exceptions.py`][norm_exceptions.py] | Special-case rules for normalizing tokens to improve the model's predictions, for example on American vs. British spelling. | -| **Punctuation rules**
[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. | -| **Character classes**
[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. | -| **Lexical attributes**
[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". | -| **Syntax iterators**
[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). | -| **Tag map**
[`tag_map.py`][tag_map.py] | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. | -| **Morph rules**
[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. | -| **Lemmatizer**
[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". | - -[stop_words.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py -[tokenizer_exceptions.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py -[norm_exceptions.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py -[punctuation.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py -[char_classes.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py -[lex_attrs.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py -[syntax_iterators.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py -[tag_map.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py -[morph_rules.py]: - https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py -[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data +| Name | Description | +| ---------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Stop words**
[`stop_words.py`](%%GITHUB_SPACY/spacy/lang/en/stop_words.py) | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. | +| **Tokenizer exceptions**
[`tokenizer_exceptions.py`](%%GITHUB_SPACY/spacy/lang/de/tokenizer_exceptions.py) | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". | +| **Punctuation rules**
[`punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. | +| **Character classes**
[`char_classes.py`](%%GITHUB_SPACY/spacy/lang/char_classes.py) | Character classes to be used in regular expressions, for example, Latin characters, quotes, hyphens or icons. | +| **Lexical attributes**
[`lex_attrs.py`](%%GITHUB_SPACY/spacy/lang/en/lex_attrs.py) | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". | +| **Syntax iterators**
[`syntax_iterators.py`](%%GITHUB_SPACY/spacy/lang/en/syntax_iterators.py) | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). | +| **Lemmatizer**
[`lemmatizer.py`](%%GITHUB_SPACY/master/spacy/lang/fr/lemmatizer.py) [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) | Custom lemmatizer implementation and lemmatization tables. | diff --git a/website/docs/usage/101/_named-entities.md b/website/docs/usage/101/_named-entities.md index 0dfee8636..2abc45cbd 100644 --- a/website/docs/usage/101/_named-entities.md +++ b/website/docs/usage/101/_named-entities.md @@ -1,10 +1,9 @@ A named entity is a "real-world object" that's assigned a name – for example, a -person, a country, a product or a book title. spaCy can **recognize -[various types](/api/annotation#named-entities)** of named entities in a -document, by asking the model for a **prediction**. Because models are -statistical and strongly depend on the examples they were trained on, this -doesn't always work _perfectly_ and might need some tuning later, depending on -your use case. +person, a country, a product or a book title. spaCy can **recognize various +types of named entities in a document, by asking the model for a +prediction**. Because models are statistical and strongly depend on the +examples they were trained on, this doesn't always work _perfectly_ and might +need some tuning later, depending on your use case. Named entities are available as the `ents` property of a `Doc`: diff --git a/website/docs/usage/101/_pipelines.md b/website/docs/usage/101/_pipelines.md index d33ea45fd..f43219f41 100644 --- a/website/docs/usage/101/_pipelines.md +++ b/website/docs/usage/101/_pipelines.md @@ -1,9 +1,9 @@ When you call `nlp` on a text, spaCy first tokenizes the text to produce a `Doc` object. The `Doc` is then processed in several different steps – this is also referred to as the **processing pipeline**. The pipeline used by the -[default models](/models) consists of a tagger, a parser and an entity -recognizer. Each pipeline component returns the processed `Doc`, which is then -passed on to the next component. +[trained pipelines](/models) typically include a tagger, a lemmatizer, a parser +and an entity recognizer. Each pipeline component returns the processed `Doc`, +which is then passed on to the next component. ![The processing pipeline](../../images/pipeline.svg) @@ -12,47 +12,53 @@ passed on to the next component. > - **Creates:** Objects, attributes and properties modified and set by the > component. -| Name | Component | Creates | Description | -| ----------------- | ------------------------------------------------------------------ | ----------------------------------------------------------- | ------------------------------------------------ | -| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. | -| **tagger** | [`Tagger`](/api/tagger) | `Doc[i].tag` | Assign part-of-speech tags. | -| **parser** | [`DependencyParser`](/api/dependencyparser) | `Doc[i].head`, `Doc[i].dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. | -| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Doc[i].ent_iob`, `Doc[i].ent_type` | Detect and label named entities. | -| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. | -| ... | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or properties. | +| Name | Component | Creates | Description | +| --------------------- | ------------------------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------ | +| **tokenizer** | [`Tokenizer`](/api/tokenizer) | `Doc` | Segment text into tokens. | +| _processing pipeline_ | | | +| **tagger** | [`Tagger`](/api/tagger) | `Token.tag` | Assign part-of-speech tags. | +| **parser** | [`DependencyParser`](/api/dependencyparser) | `Token.head`, `Token.dep`, `Doc.sents`, `Doc.noun_chunks` | Assign dependency labels. | +| **ner** | [`EntityRecognizer`](/api/entityrecognizer) | `Doc.ents`, `Token.ent_iob`, `Token.ent_type` | Detect and label named entities. | +| **lemmatizer** | [`Lemmatizer`](/api/lemmatizer) | `Token.lemma` | Assign base forms. | +| **textcat** | [`TextCategorizer`](/api/textcategorizer) | `Doc.cats` | Assign document labels. | +| **custom** | [custom components](/usage/processing-pipelines#custom-components) | `Doc._.xxx`, `Token._.xxx`, `Span._.xxx` | Assign custom attributes, methods or properties. | -The processing pipeline always **depends on the statistical model** and its -capabilities. For example, a pipeline can only include an entity recognizer -component if the model includes data to make predictions of entity labels. This -is why each model will specify the pipeline to use in its meta data, as a simple -list containing the component names: +The capabilities of a processing pipeline always depend on the components, their +models and how they were trained. For example, a pipeline for named entity +recognition needs to include a trained named entity recognizer component with a +statistical model and weights that enable it to **make predictions** of entity +labels. This is why each pipeline specifies its components and their settings in +the [config](/usage/training#config): -```json -"pipeline": ["tagger", "parser", "ner"] +```ini +[nlp] +pipeline = ["tok2vec", "tagger", "parser", "ner"] ``` import Accordion from 'components/accordion.js' -In spaCy v2.x, the statistical components like the tagger or parser are -independent and don't share any data between themselves. For example, the named -entity recognizer doesn't use any features set by the tagger and parser, and so -on. This means that you can swap them, or remove single components from the -pipeline without affecting the others. +The statistical components like the tagger or parser are typically independent +and don't share any data between each other. For example, the named entity +recognizer doesn't use any features set by the tagger and parser, and so on. +This means that you can swap them, or remove single components from the pipeline +without affecting the others. However, components may share a "token-to-vector" +component like [`Tok2Vec`](/api/tok2vec) or [`Transformer`](/api/transformer). +You can read more about this in the docs on +[embedding layers](/usage/embeddings-transformers#embedding-layers). -However, custom components may depend on annotations set by other components. -For example, a custom lemmatizer may need the part-of-speech tags assigned, so -it'll only work if it's added after the tagger. The parser will respect -pre-defined sentence boundaries, so if a previous component in the pipeline sets -them, its dependency predictions may be different. Similarly, it matters if you -add the [`EntityRuler`](/api/entityruler) before or after the statistical entity +Custom components may also depend on annotations set by other components. For +example, a custom lemmatizer may need the part-of-speech tags assigned, so it'll +only work if it's added after the tagger. The parser will respect pre-defined +sentence boundaries, so if a previous component in the pipeline sets them, its +dependency predictions may be different. Similarly, it matters if you add the +[`EntityRuler`](/api/entityruler) before or after the statistical entity recognizer: if it's added before, the entity recognizer will take the existing -entities into account when making predictions. -The [`EntityLinker`](/api/entitylinker), which resolves named entities to -knowledge base IDs, should be preceded by -a pipeline component that recognizes entities such as the -[`EntityRecognizer`](/api/entityrecognizer). +entities into account when making predictions. The +[`EntityLinker`](/api/entitylinker), which resolves named entities to knowledge +base IDs, should be preceded by a pipeline component that recognizes entities +such as the [`EntityRecognizer`](/api/entityrecognizer). diff --git a/website/docs/usage/101/_pos-deps.md b/website/docs/usage/101/_pos-deps.md index 1e8960edf..a531b245e 100644 --- a/website/docs/usage/101/_pos-deps.md +++ b/website/docs/usage/101/_pos-deps.md @@ -1,9 +1,9 @@ After tokenization, spaCy can **parse** and **tag** a given `Doc`. This is where -the statistical model comes in, which enables spaCy to **make a prediction** of -which tag or label most likely applies in this context. A model consists of -binary data and is produced by showing a system enough examples for it to make -predictions that generalize across the language – for example, a word following -"the" in English is most likely a noun. +the trained pipeline and its statistical models come in, which enable spaCy to +**make predictions** of which tag or label most likely applies in this context. +A trained component includes binary data that is produced by showing a system +enough examples for it to make predictions that generalize across the language – +for example, a word following "the" in English is most likely a noun. Linguistic annotations are available as [`Token` attributes](/api/token#attributes). Like many NLP libraries, spaCy @@ -25,7 +25,8 @@ for token in doc: > - **Text:** The original word text. > - **Lemma:** The base form of the word. -> - **POS:** The simple [UPOS](https://universaldependencies.org/docs/u/pos/) part-of-speech tag. +> - **POS:** The simple [UPOS](https://universaldependencies.org/docs/u/pos/) +> part-of-speech tag. > - **Tag:** The detailed part-of-speech tag. > - **Dep:** Syntactic dependency, i.e. the relation between tokens. > - **Shape:** The word shape – capitalization, punctuation, digits. diff --git a/website/docs/usage/101/_serialization.md b/website/docs/usage/101/_serialization.md index 01a9c39d1..ce34ea6e9 100644 --- a/website/docs/usage/101/_serialization.md +++ b/website/docs/usage/101/_serialization.md @@ -1,9 +1,9 @@ If you've been modifying the pipeline, vocabulary, vectors and entities, or made -updates to the model, you'll eventually want to **save your progress** – for -example, everything that's in your `nlp` object. This means you'll have to -translate its contents and structure into a format that can be saved, like a -file or a byte string. This process is called serialization. spaCy comes with -**built-in serialization methods** and supports the +updates to the component models, you'll eventually want to **save your +progress** – for example, everything that's in your `nlp` object. This means +you'll have to translate its contents and structure into a format that can be +saved, like a file or a byte string. This process is called serialization. spaCy +comes with **built-in serialization methods** and supports the [Pickle protocol](https://www.diveinto.org/python3/serializing.html#dump). > #### What's pickle? diff --git a/website/docs/usage/101/_tokenization.md b/website/docs/usage/101/_tokenization.md index 764f1e62a..b82150f1a 100644 --- a/website/docs/usage/101/_tokenization.md +++ b/website/docs/usage/101/_tokenization.md @@ -45,6 +45,6 @@ marks. While punctuation rules are usually pretty general, tokenizer exceptions strongly depend on the specifics of the individual language. This is why each -[available language](/usage/models#languages) has its own subclass like +[available language](/usage/models#languages) has its own subclass, like `English` or `German`, that loads in lists of hard-coded data and exception rules. diff --git a/website/docs/usage/101/_training.md b/website/docs/usage/101/_training.md index baf3a1891..b73a83d6a 100644 --- a/website/docs/usage/101/_training.md +++ b/website/docs/usage/101/_training.md @@ -1,26 +1,30 @@ -spaCy's models are **statistical** and every "decision" they make – for example, +spaCy's tagger, parser, text categorizer and many other components are powered +by **statistical models**. Every "decision" these components make – for example, which part-of-speech tag to assign, or whether a word is a named entity – is a -**prediction**. This prediction is based on the examples the model has seen -during **training**. To train a model, you first need training data – examples -of text, and the labels you want the model to predict. This could be a -part-of-speech tag, a named entity or any other information. +**prediction** based on the model's current **weight values**. The weight values +are estimated based on examples the model has seen during **training**. To train +a model, you first need training data – examples of text, and the labels you +want the model to predict. This could be a part-of-speech tag, a named entity or +any other information. -The model is then shown the unlabelled text and will make a prediction. Because -we know the correct answer, we can give the model feedback on its prediction in -the form of an **error gradient** of the **loss function** that calculates the -difference between the training example and the expected output. The greater the -difference, the more significant the gradient and the updates to our model. +Training is an iterative process in which the model's predictions are compared +against the reference annotations in order to estimate the **gradient of the +loss**. The gradient of the loss is then used to calculate the gradient of the +weights through [backpropagation](https://thinc.ai/backprop101). The gradients +indicate how the weight values should be changed so that the model's predictions +become more similar to the reference labels over time. > - **Training data:** Examples and their annotations. > - **Text:** The input text the model should predict a label for. > - **Label:** The label the model should predict. -> - **Gradient:** Gradient of the loss function calculating the difference -> between input and expected output. +> - **Gradient:** The direction and rate of change for a numeric value. +> Minimising the gradient of the weights should result in predictions that are +> closer to the reference labels on the training data. ![The training process](../../images/training.svg) When training a model, we don't just want it to memorize our examples – we want -it to come up with a theory that can be **generalized across other examples**. +it to come up with a theory that can be **generalized across unseen data**. After all, we don't just want the model to learn that this one instance of "Amazon" right here is a company – we want it to learn that "Amazon", in contexts _like this_, is most likely a company. That's why the training data @@ -34,5 +38,4 @@ it's learning the right things, you don't only need **training data** – you'll also need **evaluation data**. If you only test the model with the data it was trained on, you'll have no idea how well it's generalizing. If you want to train a model from scratch, you usually need at least a few hundred examples for both -training and evaluation. To update an existing model, you can already achieve -decent results with very few examples – as long as they're representative. +training and evaluation. diff --git a/website/docs/usage/101/_vectors-similarity.md b/website/docs/usage/101/_vectors-similarity.md index 9ff55f815..cf5b70af2 100644 --- a/website/docs/usage/101/_vectors-similarity.md +++ b/website/docs/usage/101/_vectors-similarity.md @@ -24,12 +24,12 @@ array([2.02280000e-01, -7.66180009e-02, 3.70319992e-01, -To make them compact and fast, spaCy's small [models](/models) (all packages -that end in `sm`) **don't ship with word vectors**, and only include +To make them compact and fast, spaCy's small [pipeline packages](/models) (all +packages that end in `sm`) **don't ship with word vectors**, and only include context-sensitive **tensors**. This means you can still use the `similarity()` methods to compare documents, spans and tokens – but the result won't be as good, and individual tokens won't have any vectors assigned. So in order to use -_real_ word vectors, you need to download a larger model: +_real_ word vectors, you need to download a larger pipeline package: ```diff - python -m spacy download en_core_web_sm @@ -38,11 +38,11 @@ _real_ word vectors, you need to download a larger model: -Models that come with built-in word vectors make them available as the -[`Token.vector`](/api/token#vector) attribute. [`Doc.vector`](/api/doc#vector) -and [`Span.vector`](/api/span#vector) will default to an average of their token -vectors. You can also check if a token has a vector assigned, and get the L2 -norm, which can be used to normalize vectors. +Pipeline packages that come with built-in word vectors make them available as +the [`Token.vector`](/api/token#vector) attribute. +[`Doc.vector`](/api/doc#vector) and [`Span.vector`](/api/span#vector) will +default to an average of their token vectors. You can also check if a token has +a vector assigned, and get the L2 norm, which can be used to normalize vectors. ```python ### {executable="true"} @@ -62,12 +62,12 @@ for token in tokens: > - **OOV**: Out-of-vocabulary The words "dog", "cat" and "banana" are all pretty common in English, so they're -part of the model's vocabulary, and come with a vector. The word "afskfsd" on +part of the pipeline's vocabulary, and come with a vector. The word "afskfsd" on the other hand is a lot less common and out-of-vocabulary – so its vector representation consists of 300 dimensions of `0`, which means it's practically nonexistent. If your application will benefit from a **large vocabulary** with -more vectors, you should consider using one of the larger models or loading in a -full vector package, for example, +more vectors, you should consider using one of the larger pipeline packages or +loading in a full vector package, for example, [`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg), which includes over **1 million unique vectors**. @@ -77,26 +77,75 @@ or flagging duplicates. For example, you can suggest a user content that's similar to what they're currently looking at, or label a support ticket as a duplicate if it's very similar to an already existing one. -Each `Doc`, `Span` and `Token` comes with a -[`.similarity()`](/api/token#similarity) method that lets you compare it with -another object, and determine the similarity. Of course similarity is always -subjective – whether "dog" and "cat" are similar really depends on how you're -looking at it. spaCy's similarity model usually assumes a pretty general-purpose -definition of similarity. +Each [`Doc`](/api/doc), [`Span`](/api/span), [`Token`](/api/token) and +[`Lexeme`](/api/lexeme) comes with a [`.similarity`](/api/token#similarity) +method that lets you compare it with another object, and determine the +similarity. Of course similarity is always subjective – whether two words, spans +or documents are similar really depends on how you're looking at it. spaCy's +similarity implementation usually assumes a pretty general-purpose definition of +similarity. + +> #### 📝 Things to try +> +> 1. Compare two different tokens and try to find the two most _dissimilar_ +> tokens in the texts with the lowest similarity score (according to the +> vectors). +> 2. Compare the similarity of two [`Lexeme`](/api/lexeme) objects, entries in +> the vocabulary. You can get a lexeme via the `.lex` attribute of a token. +> You should see that the similarity results are identical to the token +> similarity. ```python ### {executable="true"} import spacy -nlp = spacy.load("en_core_web_md") # make sure to use larger model! -tokens = nlp("dog cat banana") +nlp = spacy.load("en_core_web_md") # make sure to use larger package! +doc1 = nlp("I like salty fries and hamburgers.") +doc2 = nlp("Fast food tastes very good.") -for token1 in tokens: - for token2 in tokens: - print(token1.text, token2.text, token1.similarity(token2)) +# Similarity of two documents +print(doc1, "<->", doc2, doc1.similarity(doc2)) +# Similarity of tokens and spans +french_fries = doc1[2:4] +burgers = doc1[5] +print(french_fries, "<->", burgers, french_fries.similarity(burgers)) ``` -In this case, the model's predictions are pretty on point. A dog is very similar -to a cat, whereas a banana is not very similar to either of them. Identical -tokens are obviously 100% similar to each other (just not always exactly `1.0`, -because of vector math and floating point imprecisions). +### What to expect from similarity results {#similarity-expectations} + +Computing similarity scores can be helpful in many situations, but it's also +important to maintain **realistic expectations** about what information it can +provide. Words can be related to each over in many ways, so a single +"similarity" score will always be a **mix of different signals**, and vectors +trained on different data can produce very different results that may not be +useful for your purpose. Here are some important considerations to keep in mind: + +- There's no objective definition of similarity. Whether "I like burgers" and "I + like pasta" is similar **depends on your application**. Both talk about food + preferences, which makes them very similar – but if you're analyzing mentions + of food, those sentences are pretty dissimilar, because they talk about very + different foods. +- The similarity of [`Doc`](/api/doc) and [`Span`](/api/span) objects defaults + to the **average** of the token vectors. This means that the vector for "fast + food" is the average of the vectors for "fast" and "food", which isn't + necessarily representative of the phrase "fast food". +- Vector averaging means that the vector of multiple tokens is **insensitive to + the order** of the words. Two documents expressing the same meaning with + dissimilar wording will return a lower similarity score than two documents + that happen to contain the same words while expressing different meanings. + + + +[![](../../images/sense2vec.jpg)](https://github.com/explosion/sense2vec) + +[`sense2vec`](https://github.com/explosion/sense2vec) is a library developed by +us that builds on top of spaCy and lets you train and query more interesting and +detailed word vectors. It combines noun phrases like "fast food" or "fair game" +and includes the part-of-speech tags and entity labels. The library also +includes annotation recipes for our annotation tool [Prodigy](https://prodi.gy) +that let you evaluate vectors and create terminology lists. For more details, +check out [our blog post](https://explosion.ai/blog/sense2vec-reloaded). To +explore the semantic similarities across all Reddit comments of 2015 and 2019, +see the [interactive demo](https://explosion.ai/demos/sense2vec). + + diff --git a/website/docs/usage/_benchmarks-choi.md b/website/docs/usage/_benchmarks-choi.md deleted file mode 100644 index 47d6f479f..000000000 --- a/website/docs/usage/_benchmarks-choi.md +++ /dev/null @@ -1,10 +0,0 @@ -import { Help } from 'components/typography' - -| System | Year | Language | Accuracy | Speed (wps) | -| -------------- | ---- | --------------- | -------: | -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | -| **spaCy v2.x** | 2017 | Python / Cython | **92.6** | _n/a_ This table shows speed as benchmarked by Choi et al. We therefore can't provide comparable figures, as we'd be running the benchmark on different hardware. | -| **spaCy v1.x** | 2015 | Python / Cython | 91.8 | 13,963 | -| ClearNLP | 2015 | Java | 91.7 | 10,271 | -| CoreNLP | 2015 | Java | 89.6 | 8,602 | -| MATE | 2015 | Java | 92.5 | 550 | -| Turbo | 2015 | C++ | 92.4 | 349 | diff --git a/website/docs/usage/_benchmarks-models.md b/website/docs/usage/_benchmarks-models.md new file mode 100644 index 000000000..4e6da9ad8 --- /dev/null +++ b/website/docs/usage/_benchmarks-models.md @@ -0,0 +1,46 @@ +import { Help } from 'components/typography'; import Link from 'components/link' + + + +
+ +| Pipeline | Parser | Tagger | NER | WPS
CPU words per second on CPU, higher is better | WPS
GPU words per second on GPU, higher is better | +| ---------------------------------------------------------- | -----: | -----: | ---: | ------------------------------------------------------------------: | -----------------------------------------------------------------: | +| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.5 | 98.3 | 89.7 | 1k | 8k | +| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.2 | 97.4 | 85.8 | 7k | | +| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | | 10k | | + +
+ +**Full pipeline accuracy and speed** on the +[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus. + +
+ +
+ +
+ +| Named Entity Recognition System | OntoNotes | CoNLL '03 | +| ------------------------------------------------------------------------------ | --------: | --------: | +| spaCy RoBERTa (2020) | 89.7 | 91.6 | +| spaCy CNN (2020) | 84.5 | 87.4 | +| [Stanza](https://stanfordnlp.github.io/stanza/) (StanfordNLP)1 | 88.8 | 92.1 | +| Flair2 | 89.7 | 93.1 | +| BERT Base3 | - | 92.4 | + +
+ +**Named entity recognition accuracy** on the +[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) and +[CoNLL-2003](https://www.aclweb.org/anthology/W03-0419.pdf) corpora. See +[NLP-progress](http://nlpprogress.com/english/named_entity_recognition.html) for +more results. Project template: +[`benchmarks/ner_conll03`](%%GITHUB_PROJECTS/benchmarks/ner_conll03). **1. ** +[Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf). **2. ** +[Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/). **3. ** +[Devlin et al. (2018)](https://arxiv.org/abs/1810.04805). + +
+ +
diff --git a/website/docs/usage/adding-languages.md b/website/docs/usage/adding-languages.md deleted file mode 100644 index 96ffafe06..000000000 --- a/website/docs/usage/adding-languages.md +++ /dev/null @@ -1,675 +0,0 @@ ---- -title: Adding Languages -next: /usage/training -menu: - - ['Language Data', 'language-data'] - - ['Testing', 'testing'] - - ['Training', 'training'] ---- - -Adding full support for a language touches many different parts of the spaCy -library. This guide explains how to fit everything together, and points you to -the specific workflows for each component. - -> #### Working on spaCy's source -> -> To add a new language to spaCy, you'll need to **modify the library's code**. -> The easiest way to do this is to clone the -> [repository](https://github.com/explosion/spaCy/tree/master/) and **build -> spaCy from source**. For more information on this, see the -> [installation guide](/usage). Unlike spaCy's core, which is mostly written in -> Cython, all language data is stored in regular Python files. This means that -> you won't have to rebuild anything in between – you can simply make edits and -> reload spaCy to test them. - - - -
- -Obviously, there are lots of ways you can organize your code when you implement -your own language data. This guide will focus on how it's done within spaCy. For -full language support, you'll need to create a `Language` subclass, define -custom **language data**, like a stop list and tokenizer exceptions and test the -new tokenizer. Once the language is set up, you can **build the vocabulary**, -including word frequencies, Brown clusters and word vectors. Finally, you can -**train the tagger and parser**, and save the model to a directory. - -For some languages, you may also want to develop a solution for lemmatization -and morphological analysis. - -
- - - -- [Language data 101](#language-data) -- [The Language subclass](#language-subclass) -- [Stop words](#stop-words) -- [Tokenizer exceptions](#tokenizer-exceptions) -- [Norm exceptions](#norm-exceptions) -- [Lexical attributes](#lex-attrs) -- [Syntax iterators](#syntax-iterators) -- [Lemmatizer](#lemmatizer) -- [Tag map](#tag-map) -- [Morph rules](#morph-rules) -- [Testing the language](#testing) -- [Training](#training) - - - -
- -## Language data {#language-data} - -import LanguageData101 from 'usage/101/\_language-data.md' - - - -The individual components **expose variables** that can be imported within a -language module, and added to the language's `Defaults`. Some components, like -the punctuation rules, usually don't need much customization and can be imported -from the global rules. Others, like the tokenizer and norm exceptions, are very -specific and will make a big difference to spaCy's performance on the particular -language and training a language model. - -| Variable | Type | Description | -| ---------------------- | ----- | ---------------------------------------------------------------------------------------------------------- | -| `STOP_WORDS` | set | Individual words. | -| `TOKENIZER_EXCEPTIONS` | dict | Keyed by strings mapped to list of one dict per token with token attributes. | -| `TOKEN_MATCH` | regex | Regexes to match complex tokens, e.g. URLs. | -| `NORM_EXCEPTIONS` | dict | Keyed by strings, mapped to their norms. | -| `TOKENIZER_PREFIXES` | list | Strings or regexes, usually not customized. | -| `TOKENIZER_SUFFIXES` | list | Strings or regexes, usually not customized. | -| `TOKENIZER_INFIXES` | list | Strings or regexes, usually not customized. | -| `LEX_ATTRS` | dict | Attribute ID mapped to function. | -| `SYNTAX_ITERATORS` | dict | Iterator ID mapped to function. Currently only supports `'noun_chunks'`. | -| `TAG_MAP` | dict | Keyed by strings mapped to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. | -| `MORPH_RULES` | dict | Keyed by strings mapped to a dict of their morphological features. | - -> #### Should I ever update the global data? -> -> Reusable language data is collected as atomic pieces in the root of the -> [`spacy.lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) -> module. Often, when a new language is added, you'll find a pattern or symbol -> that's missing. Even if it isn't common in other languages, it might be best -> to add it to the shared language data, unless it has some conflicting -> interpretation. For instance, we don't expect to see guillemot quotation -> symbols (`»` and `«`) in English text. But if we do see them, we'd probably -> prefer the tokenizer to split them off. - - - -In order for the tokenizer to split suffixes, prefixes and infixes, spaCy needs -to know the language's character set. If the language you're adding uses -non-latin characters, you might need to define the required character classes in -the global -[`char_classes.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/char_classes.py). -For efficiency, spaCy uses hard-coded unicode ranges to define character -classes, the definitions of which can be found on -[Wikipedia](https://en.wikipedia.org/wiki/Unicode_block). If the language -requires very specific punctuation rules, you should consider overwriting the -default regular expressions with your own in the language's `Defaults`. - - - -### Creating a language subclass {#language-subclass} - -Language-specific code and resources should be organized into a sub-package of -spaCy, named according to the language's -[ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). For instance, -code and resources specific to Spanish are placed into a directory -`spacy/lang/es`, which can be imported as `spacy.lang.es`. - -To get started, you can check out the -[existing languages](https://github.com/explosion/spacy/tree/master/spacy/lang). -Here's what the class could look like: - -```python -### __init__.py (excerpt) -# import language-specific data -from .stop_words import STOP_WORDS -from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from .lex_attrs import LEX_ATTRS - -from ..tokenizer_exceptions import BASE_EXCEPTIONS -from ...language import Language -from ...attrs import LANG -from ...util import update_exc - -# Create Defaults class in the module scope (necessary for pickling!) -class XxxxxDefaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: "xx" # language ISO code - - # Optional: replace flags with custom functions, e.g. like_num() - lex_attr_getters.update(LEX_ATTRS) - - # Merge base exceptions and custom tokenizer exceptions - tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS - -# Create actual Language class -class Xxxxx(Language): - lang = "xx" # Language ISO code - Defaults = XxxxxDefaults # Override defaults - -# Set default export – this allows the language class to be lazy-loaded -__all__ = ["Xxxxx"] -``` - - - -Some languages contain large volumes of custom data, like lemmatizer lookup -tables, or complex regular expression that are expensive to compute. As of spaCy -v2.0, `Language` classes are not imported on initialization and are only loaded -when you import them directly, or load a model that requires a language to be -loaded. To lazy-load languages in your application, you can use the -[`util.get_lang_class`](/api/top-level#util.get_lang_class) helper function with -the two-letter language code as its argument. - - - -### Stop words {#stop-words} - -A ["stop list"](https://en.wikipedia.org/wiki/Stop_words) is a classic trick -from the early days of information retrieval when search was largely about -keyword presence and absence. It is still sometimes useful today to filter out -common words from a bag-of-words model. To improve readability, `STOP_WORDS` are -separated by spaces and newlines, and added as a multiline string. - -> #### What does spaCy consider a stop word? -> -> There's no particularly principled logic behind what words should be added to -> the stop list. Make a list that you think might be useful to people and is -> likely to be unsurprising. As a rule of thumb, words that are very rare are -> unlikely to be useful stop words. - -```python -### Example -STOP_WORDS = set(""" -a about above across after afterwards again against all almost alone along -already also although always am among amongst amount an and another any anyhow -anyone anything anyway anywhere are around as at - -back be became because become becomes becoming been before beforehand behind -being below beside besides between beyond both bottom but by -""".split()) -``` - - - -When adding stop words from an online source, always **include the link** in a -comment. Make sure to **proofread** and double-check the words carefully. A lot -of the lists available online have been passed around for years and often -contain mistakes, like unicode errors or random words that have once been added -for a specific use case, but don't actually qualify. - - - -### Tokenizer exceptions {#tokenizer-exceptions} - -spaCy's [tokenization algorithm](/usage/linguistic-features#how-tokenizer-works) -lets you deal with whitespace-delimited chunks separately. This makes it easy to -define special-case rules, without worrying about how they interact with the -rest of the tokenizer. Whenever the key string is matched, the special-case rule -is applied, giving the defined sequence of tokens. - -Tokenizer exceptions can be added in the following format: - -```python -### tokenizer_exceptions.py (excerpt) -TOKENIZER_EXCEPTIONS = { - "don't": [ - {ORTH: "do"}, - {ORTH: "n't", NORM: "not"}] -} -``` - - - -If an exception consists of more than one token, the `ORTH` values combined -always need to **match the original string**. The way the original string is -split up can be pretty arbitrary sometimes – for example `"gonna"` is split into -`"gon"` (norm "going") and `"na"` (norm "to"). Because of how the tokenizer -works, it's currently not possible to split single-letter strings into multiple -tokens. - - - -> #### Generating tokenizer exceptions -> -> Keep in mind that generating exceptions only makes sense if there's a clearly -> defined and **finite number** of them, like common contractions in English. -> This is not always the case – in Spanish for instance, infinitive or -> imperative reflexive verbs and pronouns are one token (e.g. "vestirme"). In -> cases like this, spaCy shouldn't be generating exceptions for _all verbs_. -> Instead, this will be handled at a later stage after part-of-speech tagging -> and lemmatization. - -When adding the tokenizer exceptions to the `Defaults`, you can use the -[`update_exc`](/api/top-level#util.update_exc) helper function to merge them -with the global base exceptions (including one-letter abbreviations and -emoticons). The function performs a basic check to make sure exceptions are -provided in the correct format. It can take any number of exceptions dicts as -its arguments, and will update and overwrite the exception in this order. For -example, if your language's tokenizer exceptions include a custom tokenization -pattern for "a.", it will overwrite the base exceptions with the language's -custom one. - -```python -### Example -from ...util import update_exc - -BASE_EXCEPTIONS = {"a.": [{ORTH: "a."}], ":)": [{ORTH: ":)"}]} -TOKENIZER_EXCEPTIONS = {"a.": [{ORTH: "a.", NORM: "all"}]} - -tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS) -# {"a.": [{ORTH: "a.", NORM: "all"}], ":)": [{ORTH: ":)"}]} -``` - -### Norm exceptions {#norm-exceptions new="2"} - -In addition to `ORTH`, tokenizer exceptions can also set a `NORM` attribute. -This is useful to specify a normalized version of the token – for example, the -norm of "n't" is "not". By default, a token's norm equals its lowercase text. If -the lowercase spelling of a word exists, norms should always be in lowercase. - -> #### Norms vs. lemmas -> -> ```python -> doc = nlp("I'm gonna realise") -> norms = [token.norm_ for token in doc] -> lemmas = [token.lemma_ for token in doc] -> assert norms == ["i", "am", "going", "to", "realize"] -> assert lemmas == ["i", "be", "go", "to", "realise"] -> ``` - -spaCy usually tries to normalize words with different spellings to a single, -common spelling. This has no effect on any other token attributes, or -tokenization in general, but it ensures that **equivalent tokens receive similar -representations**. This can improve the model's predictions on words that -weren't common in the training data, but are equivalent to other words – for -example, "realise" and "realize", or "thx" and "thanks". - -Similarly, spaCy also includes -[global base norms](https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py) -for normalizing different styles of quotation marks and currency symbols. Even -though `$` and `€` are very different, spaCy normalizes them both to `$`. This -way, they'll always be seen as similar, no matter how common they were in the -training data. - -As of spaCy v2.3, language-specific norm exceptions are provided as a -JSON dictionary in the package -[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) rather -than in the main library. For a full example, see -[`en_lexeme_norm.json`](https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_lexeme_norm.json). - -```json -### Example -{ - "cos": "because", - "fav": "favorite", - "accessorise": "accessorize", - "accessorised": "accessorized" -} -``` - -If you're adding tables for a new languages, be sure to add the tables to -[`spacy_lookups_data/__init__.py`](https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/__init__.py) -and register the entry point under `spacy_lookups` in -[`setup.cfg`](https://github.com/explosion/spacy-lookups-data/blob/master/setup.cfg). - -Alternatively, you can initialize your language [`Vocab`](/api/vocab) with a -[`Lookups`](/api/lookups) object that includes the table `lexeme_norm`. - - - -Previously in spaCy v2.0-v2.2, norm exceptions were provided as a simple python -dictionary. For more examples, see the English -[`norm_exceptions.py`](https://github.com/explosion/spaCy/tree/v2.2.x/spacy/lang/en/norm_exceptions.py). - -```python -### Example -NORM_EXCEPTIONS = { - "cos": "because", - "fav": "favorite", - "accessorise": "accessorize", - "accessorised": "accessorized" -} -``` - -To add the custom norm exceptions lookup table, you can use the `add_lookups()` -helper functions. It takes the default attribute getter function as its first -argument, plus a variable list of dictionaries. If a string's norm is found in -one of the dictionaries, that value is used – otherwise, the default function is -called and the token is assigned its default norm. - -```python -lex_attr_getters[NORM] = add_lookups(Language.Defaults.lex_attr_getters[NORM], - NORM_EXCEPTIONS, BASE_NORMS) -``` - -The order of the dictionaries is also the lookup order – so if your language's -norm exceptions overwrite any of the global exceptions, they should be added -first. Also note that the tokenizer exceptions will always have priority over -the attribute getters. - - - -### Lexical attributes {#lex-attrs new="2"} - -spaCy provides a range of [`Token` attributes](/api/token#attributes) that -return useful information on that token – for example, whether it's uppercase or -lowercase, a left or right punctuation mark, or whether it resembles a number or -email address. Most of these functions, like `is_lower` or `like_url` should be -language-independent. Others, like `like_num` (which includes both digits and -number words), requires some customization. - -> #### Best practices -> -> Keep in mind that those functions are only intended to be an approximation. -> It's always better to prioritize simplicity and performance over covering very -> specific edge cases. -> -> English number words are pretty simple, because even large numbers consist of -> individual tokens, and we can get away with splitting and matching strings -> against a list. In other languages, like German, "two hundred and thirty-four" -> is one word, and thus one token. Here, it's best to match a string against a -> list of number word fragments (instead of a technically almost infinite list -> of possible number words). - -Here's an example from the English -[`lex_attrs.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py): - -```python -### lex_attrs.py -_num_words = ["zero", "one", "two", "three", "four", "five", "six", "seven", - "eight", "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", - "fifteen", "sixteen", "seventeen", "eighteen", "nineteen", "twenty", - "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety", - "hundred", "thousand", "million", "billion", "trillion", "quadrillion", - "gajillion", "bazillion"] - -def like_num(text): - text = text.replace(",", "").replace(".", "") - if text.isdigit(): - return True - if text.count("/") == 1: - num, denom = text.split("/") - if num.isdigit() and denom.isdigit(): - return True - if text.lower() in _num_words: - return True - return False - -LEX_ATTRS = { - LIKE_NUM: like_num -} -``` - -By updating the default lexical attributes with a custom `LEX_ATTRS` dictionary -in the language's defaults via `lex_attr_getters.update(LEX_ATTRS)`, only the -new custom functions are overwritten. - -### Syntax iterators {#syntax-iterators} - -Syntax iterators are functions that compute views of a `Doc` object based on its -syntax. At the moment, this data is only used for extracting -[noun chunks](/usage/linguistic-features#noun-chunks), which are available as -the [`Doc.noun_chunks`](/api/doc#noun_chunks) property. Because base noun -phrases work differently across languages, the rules to compute them are part of -the individual language's data. If a language does not include a noun chunks -iterator, the property won't be available. For examples, see the existing syntax -iterators: - -> #### Noun chunks example -> -> ```python -> doc = nlp("A phrase with another phrase occurs.") -> chunks = list(doc.noun_chunks) -> assert chunks[0].text == "A phrase" -> assert chunks[1].text == "another phrase" -> ``` - -| Language | Code | Source | -| ---------------- | ---- | ----------------------------------------------------------------------------------------------------------------- | -| English | `en` | [`lang/en/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py) | -| German | `de` | [`lang/de/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/de/syntax_iterators.py) | -| French | `fr` | [`lang/fr/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/syntax_iterators.py) | -| Spanish | `es` | [`lang/es/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/es/syntax_iterators.py) | -| Greek | `el` | [`lang/el/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/el/syntax_iterators.py) | -| Norwegian Bokmål | `nb` | [`lang/nb/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/nb/syntax_iterators.py) | -| Swedish | `sv` | [`lang/sv/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/sv/syntax_iterators.py) | -| Indonesian | `id` | [`lang/id/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/id/syntax_iterators.py) | -| Persian | `fa` | [`lang/fa/syntax_iterators.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/fa/syntax_iterators.py) | - -### Lemmatizer {#lemmatizer new="2"} - -As of v2.0, spaCy supports simple lookup-based lemmatization. This is usually -the quickest and easiest way to get started. The data is stored in a dictionary -mapping a string to its lemma. To determine a token's lemma, spaCy simply looks -it up in the table. Here's an example from the Spanish language data: - -```json -### es_lemma_lookup.json (excerpt) -{ - "aba": "abar", - "ababa": "abar", - "ababais": "abar", - "ababan": "abar", - "ababanes": "ababán", - "ababas": "abar", - "ababoles": "ababol", - "ababábites": "ababábite" -} -``` - -#### Adding JSON resources {#lemmatizer-resources new="2.2"} - -As of v2.2, resources for the lemmatizer are stored as JSON and have been moved -to a separate repository and package, -[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The -package exposes the data files via language-specific -[entry points](/usage/saving-loading#entry-points) that spaCy reads when -constructing the `Vocab` and [`Lookups`](/api/lookups). This allows easier -access to the data, serialization with the models and file compression on disk -(so your spaCy installation is smaller). If you want to use the lookup tables -without a pretrained model, you have to explicitly install spaCy with lookups -via `pip install spacy[lookups]` or by installing -[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) in the -same environment as spaCy. - -### Tag map {#tag-map} - -Most treebanks define a custom part-of-speech tag scheme, striking a balance -between level of detail and ease of prediction. While it's useful to have custom -tagging schemes, it's also useful to have a common scheme, to which the more -specific tags can be related. The tagger can learn a tag scheme with any -arbitrary symbols. However, you need to define how those symbols map down to the -[Universal Dependencies tag set](http://universaldependencies.org/u/pos/all.html). -This is done by providing a tag map. - -The keys of the tag map should be **strings in your tag set**. The values should -be a dictionary. The dictionary must have an entry POS whose value is one of the -[Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. -Optionally, you can also include morphological features or other token -attributes in the tag map as well. This allows you to do simple -[rule-based morphological analysis](/usage/linguistic-features#rule-based-morphology). - -```python -### Example -from ..symbols import POS, NOUN, VERB, DET - -TAG_MAP = { - "NNS": {POS: NOUN, "Number": "plur"}, - "VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"}, - "DT": {POS: DET} -} -``` - -### Morph rules {#morph-rules} - -The morphology rules let you set token attributes such as lemmas, keyed by the -extended part-of-speech tag and token text. The morphological features and their -possible values are language-specific and based on the -[Universal Dependencies scheme](http://universaldependencies.org). - -```python -### Example -from ..symbols import LEMMA - -MORPH_RULES = { - "VBZ": { - "am": {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"}, - "are": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"}, - "is": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"}, - "'re": {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"}, - "'s": {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"} - } -} -``` - -In the example of `"am"`, the attributes look like this: - -| Attribute | Description | -| ------------------- | ------------------------------------------------------------------------------------------------------------------------------ | -| `LEMMA: "be"` | Base form, e.g. "to be". | -| `"VerbForm": "Fin"` | Finite verb. Finite verbs have a subject and can be the root of an independent clause – "I am." is a valid, complete sentence. | -| `"Person": "One"` | First person, i.e. "**I** am". | -| `"Tense": "Pres"` | Present tense, i.e. actions that are happening right now or actions that usually happen. | -| `"Mood": "Ind"` | Indicative, i.e. something happens, has happened or will happen (as opposed to imperative or conditional). | - - - -The morphological attributes are currently **not all used by spaCy**. Full -integration is still being developed. In the meantime, it can still be useful to -add them, especially if the language you're adding includes important -distinctions and special cases. This ensures that as soon as full support is -introduced, your language will be able to assign all possible attributes. - - - -## Testing the new language {#testing} - -Before using the new language or submitting a -[pull request](https://github.com/explosion/spaCy/pulls) to spaCy, you should -make sure it works as expected. This is especially important if you've added -custom regular expressions for token matching or punctuation – you don't want to -be causing regressions. - - - -spaCy uses the [pytest framework](https://docs.pytest.org/en/latest/) for -testing. For more details on how the tests are structured and best practices for -writing your own tests, see our -[tests documentation](https://github.com/explosion/spaCy/tree/master/spacy/tests). - - - -### Writing language-specific tests {#testing-custom} - -It's recommended to always add at least some tests with examples specific to the -language. Language tests should be located in -[`tests/lang`](https://github.com/explosion/spaCy/tree/master/spacy/tests/lang) -in a directory named after the language ID. You'll also need to create a fixture -for your tokenizer in the -[`conftest.py`](https://github.com/explosion/spaCy/tree/master/spacy/tests/conftest.py). -Always use the [`get_lang_class`](/api/top-level#util.get_lang_class) helper -function within the fixture, instead of importing the class at the top of the -file. This will load the language data only when it's needed. (Otherwise, _all -data_ would be loaded every time you run a test.) - -```python -@pytest.fixture -def en_tokenizer(): - return util.get_lang_class("en").Defaults.create_tokenizer() -``` - -When adding test cases, always -[`parametrize`](https://github.com/explosion/spaCy/tree/master/spacy/tests#parameters) -them – this will make it easier for others to add more test cases without having -to modify the test itself. You can also add parameter tuples, for example, a -test sentence and its expected length, or a list of expected tokens. Here's an -example of an English tokenizer test for combinations of punctuation and -abbreviations: - -```python -### Example test -@pytest.mark.parametrize('text,length', [ - ("The U.S. Army likes Shock and Awe.", 8), - ("U.N. regulations are not a part of their concern.", 10), - ("“Isn't it?”", 6)]) -def test_en_tokenizer_handles_punct_abbrev(en_tokenizer, text, length): - tokens = en_tokenizer(text) - assert len(tokens) == length -``` - -## Training a language model {#training} - -Much of spaCy's functionality requires models to be trained from labeled data. -For instance, in order to use the named entity recognizer, you need to first -train a model on text annotated with examples of the entities you want to -recognize. The parser, part-of-speech tagger and text categorizer all also -require models to be trained from labeled examples. The word vectors, word -probabilities and word clusters also require training, although these can be -trained from unlabeled text, which tends to be much easier to collect. - -### Creating a vocabulary file {#vocab-file} - -spaCy expects that common words will be cached in a [`Vocab`](/api/vocab) -instance. The vocabulary caches lexical features. spaCy loads the vocabulary -from binary data, in order to keep loading efficient. The easiest way to save -out a new binary vocabulary file is to use the `spacy init-model` command, which -expects a JSONL file with words and their lexical attributes. See the docs on -the [vocab JSONL format](/api/annotation#vocab-jsonl) for details. - -#### Training the word vectors {#word-vectors} - -[Word2vec](https://en.wikipedia.org/wiki/Word2vec) and related algorithms let -you train useful word similarity models from unlabeled text. This is a key part -of using deep learning for NLP with limited labeled data. The vectors are also -useful by themselves – they power the `.similarity` methods in spaCy. For best -results, you should pre-process the text with spaCy before training the Word2vec -model. This ensures your tokenization will match. You can use our -[word vectors training script](https://github.com/explosion/spacy/tree/master/bin/train_word_vectors.py), -which pre-processes the text with your language-specific tokenizer and trains -the model using [Gensim](https://radimrehurek.com/gensim/). The `vectors.bin` -file should consist of one word and vector per line. - -```python -https://github.com/explosion/spacy/tree/master/bin/train_word_vectors.py -``` - -If you don't have a large sample of text available, you can also convert word -vectors produced by a variety of other tools into spaCy's format. See the docs -on [converting word vectors](/usage/vectors-similarity#converting) for details. - -### Creating or converting a training corpus {#training-corpus} - -The easiest way to train spaCy's tagger, parser, entity recognizer or text -categorizer is to use the [`spacy train`](/api/cli#train) command-line utility. -In order to use this, you'll need training and evaluation data in the -[JSON format](/api/annotation#json-input) spaCy expects for training. - -If your data is in one of the supported formats, the easiest solution might be -to use the [`spacy convert`](/api/cli#convert) command-line utility. This -supports several popular formats, including the IOB format for named entity -recognition, the JSONL format produced by our annotation tool -[Prodigy](https://prodi.gy), and the -[CoNLL-U](http://universaldependencies.org/docs/format.html) format used by the -[Universal Dependencies](http://universaldependencies.org/) corpus. - -One thing to keep in mind is that spaCy expects to train its models from **whole -documents**, not just single sentences. If your corpus only contains single -sentences, spaCy's models will never learn to expect multi-sentence documents, -leading to low performance on real text. To mitigate this problem, you can use -the `-n` argument to the `spacy convert` command, to merge some of the sentences -into longer pseudo-documents. - -### Training the tagger and parser {#train-tagger-parser} - -Once you have your training and evaluation data in the format spaCy expects, you -can train your model use the using spaCy's [`train`](/api/cli#train) command. -Note that training statistical models still involves a degree of -trial-and-error. You may need to tune one or more settings, also called -"hyper-parameters", to achieve optimal performance. See the -[usage guide on training](/usage/training#tagger-parser) for more details. diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md new file mode 100644 index 000000000..093b0c137 --- /dev/null +++ b/website/docs/usage/embeddings-transformers.md @@ -0,0 +1,758 @@ +--- +title: Embeddings, Transformers and Transfer Learning +teaser: Using transformer embeddings like BERT in spaCy +menu: + - ['Embedding Layers', 'embedding-layers'] + - ['Transformers', 'transformers'] + - ['Static Vectors', 'static-vectors'] + - ['Pretraining', 'pretraining'] +next: /usage/training +--- + +spaCy supports a number of **transfer and multi-task learning** workflows that +can often help improve your pipeline's efficiency or accuracy. Transfer learning +refers to techniques such as word vector tables and language model pretraining. +These techniques can be used to import knowledge from raw text into your +pipeline, so that your models are able to generalize better from your annotated +examples. + +You can convert **word vectors** from popular tools like +[FastText](https://fasttext.cc) and [Gensim](https://radimrehurek.com/gensim), +or you can load in any pretrained **transformer model** if you install +[`spacy-transformers`](https://github.com/explosion/spacy-transformers). You can +also do your own language model pretraining via the +[`spacy pretrain`](/api/cli#pretrain) command. You can even **share** your +transformer or other contextual embedding model across multiple components, +which can make long pipelines several times more efficient. To use transfer +learning, you'll need at least a few annotated examples for what you're trying +to predict. Otherwise, you could try using a "one-shot learning" approach using +[vectors and similarity](/usage/linguistic-features#vectors-similarity). + + + +[Transformers](#transformers) are large and powerful neural networks that give +you better accuracy, but are harder to deploy in production, as they require a +GPU to run effectively. [Word vectors](#word-vectors) are a slightly older +technique that can give your models a smaller improvement in accuracy, and can +also provide some additional capabilities. + +The key difference between word-vectors and contextual language models such as +transformers is that word vectors model **lexical types**, rather than _tokens_. +If you have a list of terms with no context around them, a transformer model +like BERT can't really help you. BERT is designed to understand language **in +context**, which isn't what you have. A word vectors table will be a much better +fit for your task. However, if you do have words in context – whole sentences or +paragraphs of running text – word vectors will only provide a very rough +approximation of what the text is about. + +Word vectors are also very computationally efficient, as they map a word to a +vector with a single indexing operation. Word vectors are therefore useful as a +way to **improve the accuracy** of neural network models, especially models that +are small or have received little or no pretraining. In spaCy, word vector +tables are only used as **static features**. spaCy does not backpropagate +gradients to the pretrained word vectors table. The static vectors table is +usually used in combination with a smaller table of learned task-specific +embeddings. + + + + + +Word vectors are not compatible with most [transformer models](#transformers), +but if you're training another type of NLP network, it's almost always worth +adding word vectors to your model. As well as improving your final accuracy, +word vectors often make experiments more consistent, as the accuracy you reach +will be less sensitive to how the network is randomly initialized. High variance +due to random chance can slow down your progress significantly, as you need to +run many experiments to filter the signal from the noise. + +Word vector features need to be enabled prior to training, and the same word +vectors table will need to be available at runtime as well. You cannot add word +vector features once the model has already been trained, and you usually cannot +replace one word vectors table with another without causing a significant loss +of performance. + + + +## Shared embedding layers {#embedding-layers} + +spaCy lets you share a single transformer or other token-to-vector ("tok2vec") +embedding layer between multiple components. You can even update the shared +layer, performing **multi-task learning**. Reusing the tok2vec layer between +components can make your pipeline run a lot faster and result in much smaller +models. However, it can make the pipeline less modular and make it more +difficult to swap components or retrain parts of the pipeline. Multi-task +learning can affect your accuracy (either positively or negatively), and may +require some retuning of your hyper-parameters. + +![Pipeline components using a shared embedding component vs. independent embedding layers](../images/tok2vec.svg) + +| Shared | Independent | +| ------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | +| ✅ **smaller:** models only need to include a single copy of the embeddings | ❌ **larger:** models need to include the embeddings for each component | +| ✅ **faster:** embed the documents once for your whole pipeline | ❌ **slower:** rerun the embedding for each component | +| ❌ **less composable:** all components require the same embedding component in the pipeline | ✅ **modular:** components can be moved and swapped freely | + +You can share a single transformer or other tok2vec model between multiple +components by adding a [`Transformer`](/api/transformer) or +[`Tok2Vec`](/api/tok2vec) component near the start of your pipeline. Components +later in the pipeline can "connect" to it by including a **listener layer** like +[Tok2VecListener](/api/architectures#Tok2VecListener) within their model. + +![Pipeline components listening to shared embedding component](../images/tok2vec-listener.svg) + +At the beginning of training, the [`Tok2Vec`](/api/tok2vec) component will grab +a reference to the relevant listener layers in the rest of your pipeline. When +it processes a batch of documents, it will pass forward its predictions to the +listeners, allowing the listeners to **reuse the predictions** when they are +eventually called. A similar mechanism is used to pass gradients from the +listeners back to the model. The [`Transformer`](/api/transformer) component and +[TransformerListener](/api/architectures#TransformerListener) layer do the same +thing for transformer models, but the `Transformer` component will also save the +transformer outputs to the +[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, +giving you access to them after the pipeline has finished running. + +### Example: Shared vs. independent config {#embedding-layers-config} + +The [config system](/usage/training#config) lets you express model configuration +for both shared and independent embedding layers. The shared setup uses a single +[`Tok2Vec`](/api/tok2vec) component with the +[Tok2Vec](/api/architectures#Tok2Vec) architecture. All other components, like +the entity recognizer, use a +[Tok2VecListener](/api/architectures#Tok2VecListener) layer as their model's +`tok2vec` argument, which connects to the `tok2vec` component model. + +```ini +### Shared {highlight="1-2,4-5,19-20"} +[components.tok2vec] +factory = "tok2vec" + +[components.tok2vec.model] +@architectures = "spacy.Tok2Vec.v1" + +[components.tok2vec.model.embed] +@architectures = "spacy.MultiHashEmbed.v1" + +[components.tok2vec.model.encode] +@architectures = "spacy.MaxoutWindowEncoder.v1" + +[components.ner] +factory = "ner" + +[components.ner.model] +@architectures = "spacy.TransitionBasedParser.v1" + +[components.ner.model.tok2vec] +@architectures = "spacy.Tok2VecListener.v1" +``` + +In the independent setup, the entity recognizer component defines its own +[Tok2Vec](/api/architectures#Tok2Vec) instance. Other components will do the +same. This makes them fully independent and doesn't require an upstream +[`Tok2Vec`](/api/tok2vec) component to be present in the pipeline. + +```ini +### Independent {highlight="7-8"} +[components.ner] +factory = "ner" + +[components.ner.model] +@architectures = "spacy.TransitionBasedParser.v1" + +[components.ner.model.tok2vec] +@architectures = "spacy.Tok2Vec.v1" + +[components.ner.model.tok2vec.embed] +@architectures = "spacy.MultiHashEmbed.v1" + +[components.ner.model.tok2vec.encode] +@architectures = "spacy.MaxoutWindowEncoder.v1" +``` + + + +## Using transformer models {#transformers} + +Transformers are a family of neural network architectures that compute **dense, +context-sensitive representations** for the tokens in your documents. Downstream +models in your pipeline can then use these representations as input features to +**improve their predictions**. You can connect multiple components to a single +transformer model, with any or all of those components giving feedback to the +transformer to fine-tune it to your tasks. spaCy's transformer support +interoperates with [PyTorch](https://pytorch.org) and the +[HuggingFace `transformers`](https://huggingface.co/transformers/) library, +giving you access to thousands of pretrained models for your pipelines. There +are many [great guides](http://jalammar.github.io/illustrated-transformer/) to +transformer models, but for practical purposes, you can simply think of them as +drop-in replacements that let you achieve **higher accuracy** in exchange for +**higher training and runtime costs**. + +### Setup and installation {#transformers-installation} + +> #### System requirements +> +> We recommend an NVIDIA **GPU** with at least **10GB of memory** in order to +> work with transformer models. Make sure your GPU drivers are up to date and +> you have **CUDA v9+** installed. + +> The exact requirements will depend on the transformer model. Training a +> transformer-based model without a GPU will be too slow for most practical +> purposes. +> +> Provisioning a new machine will require about **5GB** of data to be +> downloaded: 3GB CUDA runtime, 800MB PyTorch, 400MB CuPy, 500MB weights, 200MB +> spaCy and dependencies. + +Once you have CUDA installed, you'll need to install two pip packages, +[`cupy`](https://docs.cupy.dev/en/stable/install.html) and +[`spacy-transformers`](https://github.com/explosion/spacy-transformers). `cupy` +is just like `numpy`, but for GPU. The best way to install it is to choose a +wheel that matches the version of CUDA you're using. You may also need to set +the `CUDA_PATH` environment variable if your CUDA runtime is installed in a +non-standard location. Putting it all together, if you had installed CUDA 10.2 +in `/opt/nvidia/cuda`, you would run: + +```bash +### Installation with CUDA +$ export CUDA_PATH="/opt/nvidia/cuda" +$ pip install -U %%SPACY_PKG_NAME[cud102,transformers]%%SPACY_PKG_FLAGS +``` + +### Runtime usage {#transformers-runtime} + +Transformer models can be used as **drop-in replacements** for other types of +neural networks, so your spaCy pipeline can include them in a way that's +completely invisible to the user. Users will download, load and use the model in +the standard way, like any other spaCy pipeline. Instead of using the +transformers as subnetworks directly, you can also use them via the +[`Transformer`](/api/transformer) pipeline component. + +![The processing pipeline with the transformer component](../images/pipeline_transformer.svg) + +The `Transformer` component sets the +[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, +which lets you access the transformers outputs at runtime. The trained +transformer-based [pipelines](/models) provided by spaCy end on `_trf`, e.g. +[`en_core_web_trf`](/models/en#en_core_web_trf). + +```cli +$ python -m spacy download en_core_web_trf +``` + +```python +### Example +import spacy +from thinc.api import use_pytorch_for_gpu_memory, require_gpu + +# Use the GPU, with memory allocations directed via PyTorch. +# This prevents out-of-memory errors that would otherwise occur from competing +# memory pools. +use_pytorch_for_gpu_memory() +require_gpu(0) + +nlp = spacy.load("en_core_web_trf") +for doc in nlp.pipe(["some text", "some other text"]): + tokvecs = doc._.trf_data.tensors[-1] +``` + +You can also customize how the [`Transformer`](/api/transformer) component sets +annotations onto the [`Doc`](/api/doc) by specifying a custom +`set_extra_annotations` function. This callback will be called with the raw +input and output data for the whole batch, along with the batch of `Doc` +objects, allowing you to implement whatever you need. The annotation setter is +called with a batch of [`Doc`](/api/doc) objects and a +[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) containing the +transformers data for the batch. + +```python +def custom_annotation_setter(docs, trf_data): + doc_data = list(trf_data.doc_data) + for doc, data in zip(docs, doc_data): + doc._.custom_attr = data + +nlp = spacy.load("en_core_web_trf") +nlp.get_pipe("transformer").set_extra_annotations = custom_annotation_setter +doc = nlp("This is a text") +assert isinstance(doc._.custom_attr, TransformerData) +print(doc._.custom_attr.tensors) +``` + +### Training usage {#transformers-training} + +The recommended workflow for training is to use spaCy's +[config system](/usage/training#config), usually via the +[`spacy train`](/api/cli#train) command. The training config defines all +component settings and hyperparameters in one place and lets you describe a tree +of objects by referring to creation functions, including functions you register +yourself. For details on how to get started with training your own model, check +out the [training quickstart](/usage/training#quickstart). + + + +The `[components]` section in the [`config.cfg`](/api/data-formats#config) +describes the pipeline components and the settings used to construct them, +including their model implementation. Here's a config snippet for the +[`Transformer`](/api/transformer) component, along with matching Python code. In +this case, the `[components.transformer]` block describes the `transformer` +component: + +> #### Python equivalent +> +> ```python +> from spacy_transformers import Transformer, TransformerModel +> from spacy_transformers.annotation_setters import null_annotation_setter +> from spacy_transformers.span_getters import get_doc_spans +> +> trf = Transformer( +> nlp.vocab, +> TransformerModel( +> "bert-base-cased", +> get_spans=get_doc_spans, +> tokenizer_config={"use_fast": True}, +> ), +> set_extra_annotations=null_annotation_setter, +> max_batch_items=4096, +> ) +> ``` + +```ini +### config.cfg (excerpt) +[components.transformer] +factory = "transformer" +max_batch_items = 4096 + +[components.transformer.model] +@architectures = "spacy-transformers.TransformerModel.v1" +name = "bert-base-cased" +tokenizer_config = {"use_fast": true} + +[components.transformer.model.get_spans] +@span_getters = "spacy-transformers.doc_spans.v1" + +[components.transformer.set_extra_annotations] +@annotation_setters = "spacy-transformers.null_annotation_setter.v1" + +``` + +The `[components.transformer.model]` block describes the `model` argument passed +to the transformer component. It's a Thinc +[`Model`](https://thinc.ai/docs/api-model) object that will be passed into the +component. Here, it references the function +[spacy-transformers.TransformerModel.v1](/api/architectures#TransformerModel) +registered in the [`architectures` registry](/api/top-level#registry). If a key +in a block starts with `@`, it's **resolved to a function** and all other +settings are passed to the function as arguments. In this case, `name`, +`tokenizer_config` and `get_spans`. + +`get_spans` is a function that takes a batch of `Doc` objects and returns lists +of potentially overlapping `Span` objects to process by the transformer. Several +[built-in functions](/api/transformer#span_getters) are available – for example, +to process the whole document or individual sentences. When the config is +resolved, the function is created and passed into the model as an argument. + + + +Remember that the `config.cfg` used for training should contain **no missing +values** and requires all settings to be defined. You don't want any hidden +defaults creeping in and changing your results! spaCy will tell you if settings +are missing, and you can run +[`spacy init fill-config`](/api/cli#init-fill-config) to automatically fill in +all defaults. + + + +### Customizing the settings {#transformers-training-custom-settings} + +To change any of the settings, you can edit the `config.cfg` and re-run the +training. To change any of the functions, like the span getter, you can replace +the name of the referenced function – e.g. +`@span_getters = "spacy-transformers.sent_spans.v1"` to process sentences. You +can also register your own functions using the +[`span_getters` registry](/api/top-level#registry). For instance, the following +custom function returns [`Span`](/api/span) objects following sentence +boundaries, unless a sentence succeeds a certain amount of tokens, in which case +subsentences of at most `max_length` tokens are returned. + +> #### config.cfg +> +> ```ini +> [components.transformer.model.get_spans] +> @span_getters = "custom_sent_spans" +> max_length = 25 +> ``` + +```python +### code.py +import spacy_transformers + +@spacy_transformers.registry.span_getters("custom_sent_spans") +def configure_custom_sent_spans(max_length: int): + def get_custom_sent_spans(docs): + spans = [] + for doc in docs: + spans.append([]) + for sent in doc.sents: + start = 0 + end = max_length + while end <= len(sent): + spans[-1].append(sent[start:end]) + start += max_length + end += max_length + if start < len(sent): + spans[-1].append(sent[start:len(sent)]) + return spans + + return get_custom_sent_spans +``` + +To resolve the config during training, spaCy needs to know about your custom +function. You can make it available via the `--code` argument that can point to +a Python file. For more details on training with custom code, see the +[training documentation](/usage/training#custom-functions). + +```cli +python -m spacy train ./config.cfg --code ./code.py +``` + +### Customizing the model implementations {#training-custom-model} + +The [`Transformer`](/api/transformer) component expects a Thinc +[`Model`](https://thinc.ai/docs/api-model) object to be passed in as its `model` +argument. You're not limited to the implementation provided by +`spacy-transformers` – the only requirement is that your registered function +must return an object of type ~~Model[List[Doc], FullTransformerBatch]~~: that +is, a Thinc model that takes a list of [`Doc`](/api/doc) objects, and returns a +[`FullTransformerBatch`](/api/transformer#fulltransformerbatch) object with the +transformer data. + +The same idea applies to task models that power the **downstream components**. +Most of spaCy's built-in model creation functions support a `tok2vec` argument, +which should be a Thinc layer of type ~~Model[List[Doc], List[Floats2d]]~~. This +is where we'll plug in our transformer model, using the +[TransformerListener](/api/architectures#TransformerListener) layer, which +sneakily delegates to the `Transformer` pipeline component. + +```ini +### config.cfg (excerpt) {highlight="12"} +[components.ner] +factory = "ner" + +[nlp.pipeline.ner.model] +@architectures = "spacy.TransitionBasedParser.v1" +state_type = "ner" +extra_state_tokens = false +hidden_width = 128 +maxout_pieces = 3 +use_upper = false + +[nlp.pipeline.ner.model.tok2vec] +@architectures = "spacy-transformers.TransformerListener.v1" +grad_factor = 1.0 + +[nlp.pipeline.ner.model.tok2vec.pooling] +@layers = "reduce_mean.v1" +``` + +The [TransformerListener](/api/architectures#TransformerListener) layer expects +a [pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the +argument `pooling`, which needs to be of type ~~Model[Ragged, Floats2d]~~. This +layer determines how the vector for each spaCy token will be computed from the +zero or more source rows the token is aligned against. Here we use the +[`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which +averages the wordpiece rows. We could instead use +[`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom +function you write yourself. + +You can have multiple components all listening to the same transformer model, +and all passing gradients back to it. By default, all of the gradients will be +**equally weighted**. You can control this with the `grad_factor` setting, which +lets you reweight the gradients from the different listeners. For instance, +setting `grad_factor = 0` would disable gradients from one of the listeners, +while `grad_factor = 2.0` would multiply them by 2. This is similar to having a +custom learning rate for each component. Instead of a constant, you can also +provide a schedule, allowing you to freeze the shared parameters at the start of +training. + +## Static vectors {#static-vectors} + +If your pipeline includes a **word vectors table**, you'll be able to use the +`.similarity()` method on the [`Doc`](/api/doc), [`Span`](/api/span), +[`Token`](/api/token) and [`Lexeme`](/api/lexeme) objects. You'll also be able +to access the vectors using the `.vector` attribute, or you can look up one or +more vectors directly using the [`Vocab`](/api/vocab) object. Pipelines with +word vectors can also **use the vectors as features** for the statistical +models, which can **improve the accuracy** of your components. + +Word vectors in spaCy are "static" in the sense that they are not learned +parameters of the statistical models, and spaCy itself does not feature any +algorithms for learning word vector tables. You can train a word vectors table +using tools such as [Gensim](https://radimrehurek.com/gensim/), +[FastText](https://fasttext.cc/) or +[GloVe](https://nlp.stanford.edu/projects/glove/), or download existing +pretrained vectors. The [`init vectors`](/api/cli#init-vectors) command lets you +convert vectors for use with spaCy and will give you a directory you can load or +refer to in your [training configs](/usage/training#config). + + + +For more details on loading word vectors into spaCy, using them for similarity +and improving word vector coverage by truncating and pruning the vectors, see +the usage guide on +[word vectors and similarity](/usage/linguistic-features#vectors-similarity). + + + +### Using word vectors in your models {#word-vectors-models} + +Many neural network models are able to use word vector tables as additional +features, which sometimes results in significant improvements in accuracy. +spaCy's built-in embedding layer, +[MultiHashEmbed](/api/architectures#MultiHashEmbed), can be configured to use +word vector tables using the `include_static_vectors` flag. + +```ini +[tagger.model.tok2vec.embed] +@architectures = "spacy.MultiHashEmbed.v1" +width = 128 +attrs = ["LOWER","PREFIX","SUFFIX","SHAPE"] +rows = [5000,2500,2500,2500] +include_static_vectors = true +``` + + + +The configuration system will look up the string `"spacy.MultiHashEmbed.v1"` in +the `architectures` [registry](/api/top-level#registry), and call the returned +object with the rest of the arguments from the block. This will result in a call +to the +[`MultiHashEmbed`](https://github.com/explosion/spacy/tree/develop/spacy/ml/models/tok2vec.py) +function, which will return a [Thinc](https://thinc.ai) model object with the +type signature ~~Model[List[Doc], List[Floats2d]]~~. Because the embedding layer +takes a list of `Doc` objects as input, it does not need to store a copy of the +vectors table. The vectors will be retrieved from the `Doc` objects that are +passed in, via the `doc.vocab.vectors` attribute. This part of the process is +handled by the [StaticVectors](/api/architectures#StaticVectors) layer. + + + +#### Creating a custom embedding layer {#custom-embedding-layer} + +The [MultiHashEmbed](/api/architectures#StaticVectors) layer is spaCy's +recommended strategy for constructing initial word representations for your +neural network models, but you can also implement your own. You can register any +function to a string name, and then reference that function within your config +(see the [training docs](/usage/training) for more details). To try this out, +you can save the following little example to a new Python file: + +```python +from spacy.ml.staticvectors import StaticVectors +from spacy.util import registry + +print("I was imported!") + +@registry.architectures("my_example.MyEmbedding.v1") +def MyEmbedding(output_width: int) -> Model[List[Doc], List[Floats2d]]: + print("I was called!") + return StaticVectors(nO=output_width) +``` + +If you pass the path to your file to the [`spacy train`](/api/cli#train) command +using the `--code` argument, your file will be imported, which means the +decorator registering the function will be run. Your function is now on equal +footing with any of spaCy's built-ins, so you can drop it in instead of any +other model with the same input and output signature. For instance, you could +use it in the tagger model as follows: + +```ini +[tagger.model.tok2vec.embed] +@architectures = "my_example.MyEmbedding.v1" +output_width = 128 +``` + +Now that you have a custom function wired into the network, you can start +implementing the logic you're interested in. For example, let's say you want to +try a relatively simple embedding strategy that makes use of static word +vectors, but combines them via summation with a smaller table of learned +embeddings. + +```python +from thinc.api import add, chain, remap_ids, Embed +from spacy.ml.staticvectors import StaticVectors +from spacy.ml.featureextractor import FeatureExtractor +from spacy.util import registry + +@registry.architectures("my_example.MyEmbedding.v1") +def MyCustomVectors( + output_width: int, + vector_width: int, + embed_rows: int, + key2row: Dict[int, int] +) -> Model[List[Doc], List[Floats2d]]: + return add( + StaticVectors(nO=output_width), + chain( + FeatureExtractor(["ORTH"]), + remap_ids(key2row), + Embed(nO=output_width, nV=embed_rows) + ) + ) +``` + +## Pretraining {#pretraining} + +The [`spacy pretrain`](/api/cli#pretrain) command lets you initialize your +models with **information from raw text**. Without pretraining, the models for +your components will usually be initialized randomly. The idea behind +pretraining is simple: random probably isn't optimal, so if we have some text to +learn from, we can probably find a way to get the model off to a better start. + +Pretraining uses the same [`config.cfg`](/usage/training#config) file as the +regular training, which helps keep the settings and hyperparameters consistent. +The additional `[pretraining]` section has several configuration subsections +that are familiar from the training block: the `[pretraining.batcher]`, +`[pretraining.optimizer]` and `[pretraining.corpus]` all work the same way and +expect the same types of objects, although for pretraining your corpus does not +need to have any annotations, so you will often use a different reader, such as +the [`JsonlCorpus`](/api/top-level#jsonlcorpus). + +> #### Raw text format +> +> The raw text can be provided in spaCy's +> [binary `.spacy` format](/api/data-formats#training) consisting of serialized +> `Doc` objects or as a JSONL (newline-delimited JSON) with a key `"text"` per +> entry. This allows the data to be read in line by line, while also allowing +> you to include newlines in the texts. +> +> ```json +> {"text": "Can I ask where you work now and what you do, and if you enjoy it?"} +> {"text": "They may just pull out of the Seattle market completely, at least until they have autonomous vehicles."} +> ``` +> +> You can also use your own custom corpus loader instead. + +You can add a `[pretraining]` block to your config by setting the +`--pretraining` flag on [`init config`](/api/cli#init-config) or +[`init fill-config`](/api/cli#init-fill-config): + +```cli +$ python -m spacy init fill-config config.cfg config_pretrain.cfg --pretraining +``` + +You can then run [`spacy pretrain`](/api/cli#pretrain) with the updated config +and pass in optional config overrides, like the path to the raw text file: + +```cli +$ python -m spacy pretrain config_pretrain.cfg ./output --paths.raw text.jsonl +``` + +The following defaults are used for the `[pretraining]` block and merged into +your existing config when you run [`init config`](/api/cli#init-config) or +[`init fill-config`](/api/cli#init-fill-config) with `--pretraining`. If needed, +you can [configure](#pretraining-configure) the settings and hyperparameters or +change the [objective](#pretraining-details). + +```ini +%%GITHUB_SPACY/spacy/default_config_pretraining.cfg +``` + +### How pretraining works {#pretraining-details} + +The impact of [`spacy pretrain`](/api/cli#pretrain) varies, but it will usually +be worth trying if you're **not using a transformer** model and you have +**relatively little training data** (for instance, fewer than 5,000 sentences). +A good rule of thumb is that pretraining will generally give you a similar +accuracy improvement to using word vectors in your model. If word vectors have +given you a 10% error reduction, pretraining with spaCy might give you another +10%, for a 20% error reduction in total. + +The [`spacy pretrain`](/api/cli#pretrain) command will take a **specific +subnetwork** within one of your components, and add additional layers to build a +network for a temporary task that forces the model to learn something about +sentence structure and word cooccurrence statistics. Pretraining produces a +**binary weights file** that can be loaded back in at the start of training. The +weights file specifies an initial set of weights. Training then proceeds as +normal. + +You can only pretrain one subnetwork from your pipeline at a time, and the +subnetwork must be typed ~~Model[List[Doc], List[Floats2d]]~~ (i.e. it has to be +a "tok2vec" layer). The most common workflow is to use the +[`Tok2Vec`](/api/tok2vec) component to create a shared token-to-vector layer for +several components of your pipeline, and apply pretraining to its whole model. + +#### Configuring the pretraining {#pretraining-configure} + +The [`spacy pretrain`](/api/cli#pretrain) command is configured using the +`[pretraining]` section of your [config file](/usage/training#config). The +`component` and `layer` settings tell spaCy how to **find the subnetwork** to +pretrain. The `layer` setting should be either the empty string (to use the +whole model), or a +[node reference](https://thinc.ai/docs/usage-models#model-state). Most of +spaCy's built-in model architectures have a reference named `"tok2vec"` that +will refer to the right layer. + +```ini +### config.cfg +# 1. Use the whole model of the "tok2vec" component +[pretraining] +component = "tok2vec" +layer = "" + +# 2. Pretrain the "tok2vec" node of the "textcat" component +[pretraining] +component = "textcat" +layer = "tok2vec" +``` + +#### Pretraining objectives {#pretraining-details} + +Two pretraining objectives are available, both of which are variants of the +cloze task [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805) introduced +for BERT. The objective can be defined and configured via the +`[pretraining.objective]` config block. + +> ```ini +> ### Characters objective +> [pretraining.objective] +> type = "characters" +> n_characters = 4 +> ``` +> +> ```ini +> ### Vectors objective +> [pretraining.objective] +> type = "vectors" +> loss = "cosine" +> ``` + +- **Characters:** The `"characters"` objective asks the model to predict some + number of leading and trailing UTF-8 bytes for the words. For instance, + setting `n_characters = 2`, the model will try to predict the first two and + last two characters of the word. + +- **Vectors:** The `"vectors"` objective asks the model to predict the word's + vector, from a static embeddings table. This requires a word vectors model to + be trained and loaded. The vectors objective can optimize either a cosine or + an L2 loss. We've generally found cosine loss to perform better. + +These pretraining objectives use a trick that we term **language modelling with +approximate outputs (LMAO)**. The motivation for the trick is that predicting an +exact word ID introduces a lot of incidental complexity. You need a large output +layer, and even then, the vocabulary is too large, which motivates tokenization +schemes that do not align to actual word boundaries. At the end of training, the +output layer will be thrown away regardless: we just want a task that forces the +network to model something about word cooccurrence statistics. Predicting +leading and trailing characters does that more than adequately, as the exact +word sequence could be recovered with high accuracy if the initial and trailing +characters are predicted accurately. With the vectors objective, the pretraining +uses the embedding space learned by an algorithm such as +[GloVe](https://nlp.stanford.edu/projects/glove/) or +[Word2vec](https://code.google.com/archive/p/word2vec/), allowing the model to +focus on the contextual modelling we actual care about. diff --git a/website/docs/usage/examples.md b/website/docs/usage/examples.md deleted file mode 100644 index 854b2d42b..000000000 --- a/website/docs/usage/examples.md +++ /dev/null @@ -1,207 +0,0 @@ ---- -title: Examples -teaser: Full code examples you can modify and run -menu: - - ['Information Extraction', 'information-extraction'] - - ['Pipeline', 'pipeline'] - - ['Training', 'training'] - - ['Vectors & Similarity', 'vectors'] - - ['Deep Learning', 'deep-learning'] ---- - -## Information Extraction {#information-extraction hidden="true"} - -### Using spaCy's phrase matcher {#phrase-matcher new="2"} - -This example shows how to use the new [`PhraseMatcher`](/api/phrasematcher) to -efficiently find entities from a large terminology list. - -```python -https://github.com/explosion/spaCy/tree/master/examples/information_extraction/phrase_matcher.py -``` - -### Extracting entity relations {#entity-relations} - -A simple example of extracting relations between phrases and entities using -spaCy's named entity recognizer and the dependency parse. Here, we extract money -and currency values (entities labelled as `MONEY`) and then check the dependency -tree to find the noun phrase they are referring to – for example: -`"$9.4 million"` → `"Net income"`. - -```python -https://github.com/explosion/spaCy/tree/master/examples/information_extraction/entity_relations.py -``` - -### Navigating the parse tree and subtrees {#subtrees} - -This example shows how to navigate the parse tree including subtrees attached to -a word. - -```python -https://github.com/explosion/spaCy/tree/master/examples/information_extraction/parse_subtrees.py -``` - -## Pipeline {#pipeline hidden="true"} - -### Custom pipeline components and attribute extensions {#custom-components-entities new="2"} - -This example shows the implementation of a pipeline component that sets entity -annotations based on a list of single or multiple-word company names, merges -entities into one token and sets custom attributes on the `Doc`, `Span` and -`Token`. - -```python -https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_entities.py -``` - -### Custom pipeline components and attribute extensions via a REST API {#custom-components-api new="2"} - -This example shows the implementation of a pipeline component that fetches -country meta data via the [REST Countries API](https://restcountries.eu) sets -entity annotations for countries, merges entities into one token and sets custom -attributes on the `Doc`, `Span` and `Token` – for example, the capital, -latitude/longitude coordinates and the country flag. - -```python -https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_countries_api.py -``` - -### Custom method extensions {#custom-components-attr-methods new="2"} - -A collection of snippets showing examples of extensions adding custom methods to -the `Doc`, `Token` and `Span`. - -```python -https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_attr_methods.py -``` - -### Multi-processing with Joblib {#multi-processing} - -This example shows how to use multiple cores to process text using spaCy and -[Joblib](https://joblib.readthedocs.io/en/latest/). We're exporting -part-of-speech-tagged, true-cased, (very roughly) sentence-separated text, with -each "sentence" on a newline, and spaces between tokens. Data is loaded from the -IMDB movie reviews dataset and will be loaded automatically via Thinc's built-in -dataset loader. - -```python -https://github.com/explosion/spaCy/tree/master/examples/pipeline/multi_processing.py -``` - -## Training {#training hidden="true"} - -### Training spaCy's Named Entity Recognizer {#training-ner} - -This example shows how to update spaCy's entity recognizer with your own -examples, starting off with an existing, pretrained model, or from scratch -using a blank `Language` class. - -```python -https://github.com/explosion/spaCy/tree/master/examples/training/train_ner.py -``` - -### Training an additional entity type {#new-entity-type} - -This script shows how to add a new entity type to an existing pretrained NER -model. To keep the example short and simple, only four sentences are provided as -examples. In practice, you'll need many more — a few hundred would be a good -start. - -```python -https://github.com/explosion/spaCy/tree/master/examples/training/train_new_entity_type.py -``` - -### Creating a Knowledge Base for Named Entity Linking {#kb} - -This example shows how to create a knowledge base in spaCy, -which is needed to implement entity linking functionality. -It requires as input a spaCy model with pretrained word vectors, -and it stores the KB to file (if an `output_dir` is provided). - -```python -https://github.com/explosion/spaCy/tree/master/examples/training/create_kb.py -``` - -### Training spaCy's Named Entity Linker {#nel} - -This example shows how to train spaCy's entity linker with your own custom -examples, starting off with a predefined knowledge base and its vocab, -and using a blank `English` class. - -```python -https://github.com/explosion/spaCy/tree/master/examples/training/train_entity_linker.py -``` - -### Training spaCy's Dependency Parser {#parser} - -This example shows how to update spaCy's dependency parser, starting off with an -existing, pretrained model, or from scratch using a blank `Language` class. - -```python -https://github.com/explosion/spaCy/tree/master/examples/training/train_parser.py -``` - -### Training spaCy's Part-of-speech Tagger {#tagger} - -In this example, we're training spaCy's part-of-speech tagger with a custom tag -map, mapping our own tags to the mapping those tags to the -[Universal Dependencies scheme](http://universaldependencies.github.io/docs/u/pos/index.html). - -```python -https://github.com/explosion/spaCy/tree/master/examples/training/train_tagger.py -``` - -### Training a custom parser for chat intent semantics {#intent-parser} - -spaCy's parser component can be used to trained to predict any type of tree -structure over your input text. You can also predict trees over whole documents -or chat logs, with connections between the sentence-roots used to annotate -discourse structure. In this example, we'll build a message parser for a common -"chat intent": finding local businesses. Our message semantics will have the -following types of relations: `ROOT`, `PLACE`, `QUALITY`, `ATTRIBUTE`, `TIME` -and `LOCATION`. - -```python -https://github.com/explosion/spaCy/tree/master/examples/training/train_intent_parser.py -``` - -### Training spaCy's text classifier {#textcat new="2"} - -This example shows how to train a multi-label convolutional neural network text -classifier on IMDB movie reviews, using spaCy's new -[`TextCategorizer`](/api/textcategorizer) component. The dataset will be loaded -automatically via Thinc's built-in dataset loader. Predictions are available via -[`Doc.cats`](/api/doc#attributes). - -```python -https://github.com/explosion/spaCy/tree/master/examples/training/train_textcat.py -``` - -## Vectors {#vectors hidden="true"} - -### Visualizing spaCy vectors in TensorBoard {#tensorboard} - -This script lets you load any spaCy model containing word vectors into -[TensorBoard](https://projector.tensorflow.org/) to create an -[embedding visualization](https://github.com/tensorflow/tensorboard/blob/master/docs/tensorboard_projector_plugin.ipynb). - -```python -https://github.com/explosion/spaCy/tree/master/examples/vectors_tensorboard.py -``` - -## Deep Learning {#deep-learning hidden="true"} - -### Text classification with Keras {#keras} - -This example shows how to use a [Keras](https://keras.io) LSTM sentiment -classification model in spaCy. spaCy splits the document into sentences, and -each sentence is classified using the LSTM. The scores for the sentences are -then aggregated to give the document score. This kind of hierarchical model is -quite difficult in "pure" Keras or TensorFlow, but it's very effective. The -Keras example on this dataset performs quite poorly, because it cuts off the -documents so that they're a fixed size. This hurts review accuracy a lot, -because people often summarize their rating in the final sentence. - -```python -https://github.com/explosion/spaCy/tree/master/examples/deep_learning_keras.py -``` diff --git a/website/docs/usage/facts-figures.md b/website/docs/usage/facts-figures.md index e2549ecfc..c7a7d0525 100644 --- a/website/docs/usage/facts-figures.md +++ b/website/docs/usage/facts-figures.md @@ -5,254 +5,106 @@ next: /usage/spacy-101 menu: - ['Feature Comparison', 'comparison'] - ['Benchmarks', 'benchmarks'] + # TODO: - ['Citing spaCy', 'citation'] --- -## Feature comparison {#comparison} +## Comparison {#comparison hidden="true"} -Here's a quick comparison of the functionalities offered by spaCy, -[NLTK](http://www.nltk.org/py-modindex.html) and -[CoreNLP](http://stanfordnlp.github.io/CoreNLP/). +### When should I use spaCy? {#comparison-usage} -| | spaCy | NLTK | CoreNLP | -| ----------------------- | :----: | :----: | :-----------: | -| Programming language | Python | Python | Java / Python | -| Neural network models | ✅ | ❌ | ✅ | -| Integrated word vectors | ✅ | ❌ | ❌ | -| Multi-language support | ✅ | ✅ | ✅ | -| Tokenization | ✅ | ✅ | ✅ | -| Part-of-speech tagging | ✅ | ✅ | ✅ | -| Sentence segmentation | ✅ | ✅ | ✅ | -| Dependency parsing | ✅ | ❌ | ✅ | -| Entity recognition | ✅ | ✅ | ✅ | -| Entity linking | ✅ | ❌ | ❌ | -| Coreference resolution | ❌ | ❌ | ✅ | - -### When should I use what? {#comparison-usage} - -Natural Language Understanding is an active area of research and development, so -there are many different tools or technologies catering to different use-cases. -The table below summarizes a few libraries (spaCy, -[NLTK](http://www.nltk.org/py-modindex.html), [AllenNLP](https://allennlp.org/), -[StanfordNLP](https://stanfordnlp.github.io/stanfordnlp/) and -[TensorFlow](https://www.tensorflow.org/)) to help you get a feel for things fit -together. - -| | spaCy | NLTK | Allen-
NLP | Stanford-
NLP | Tensor-
Flow | -| ----------------------------------------------------------------- | :---: | :--: | :-------------: | :----------------: | :---------------: | -| I'm a beginner and just getting started with NLP. | ✅ | ✅ | ❌ | ✅ | ❌ | -| I want to build an end-to-end production application. | ✅ | ❌ | ❌ | ❌ | ✅ | -| I want to try out different neural network architectures for NLP. | ❌ | ❌ | ✅ | ❌ | ✅ | -| I want to try the latest models with state-of-the-art accuracy. | ❌ | ❌ | ✅ | ✅ | ✅ | -| I want to train models from my own data. | ✅ | ✅ | ✅ | ✅ | ✅ | -| I want my application to be efficient on CPU. | ✅ | ✅ | ❌ | ❌ | ❌ | +- ✅ **I'm a beginner and just getting started with NLP.** – spaCy makes it easy + to get started and comes with extensive documentation, including a + beginner-friendly [101 guide](/usage/spacy-101), a free interactive + [online course](https://course.spacy.io) and a range of + [video tutorials](https://www.youtube.com/c/ExplosionAI). +- ✅ **I want to build an end-to-end production application.** – spaCy is + specifically designed for production use and lets you build and train powerful + NLP pipelines and package them for easy deployment. +- ✅ **I want my application to be efficient on GPU _and_ CPU.** – While spaCy + lets you train modern NLP models that are best run on GPU, it also offers + CPU-optimized pipelines, which are less accurate but much cheaper to run. +- ✅ **I want to try out different neural network architectures for NLP.** – + spaCy lets you customize and swap out the model architectures powering its + components, and implement your own using a framework like PyTorch or + TensorFlow. The declarative configuration system makes it easy to mix and + match functions and keep track of your hyperparameters to make sure your + experiments are reproducible. +- ❌ **I want to build a language generation application.** – spaCy's focus is + natural language _processing_ and extracting information from large volumes of + text. While you can use it to help you re-write existing text, it doesn't + include any specific functionality for language generation tasks. +- ❌ **I want to research machine learning algorithms.** spaCy is built on the + latest research, but it's not a research library. If your goal is to write + papers and run benchmarks, spaCy is probably not a good choice. However, you + can use it to make the results of your research easily available for others to + use, e.g. via a custom spaCy component. ## Benchmarks {#benchmarks} -Two peer-reviewed papers in 2015 confirmed that spaCy offers the **fastest -syntactic parser in the world** and that **its accuracy is within 1% of the -best** available. The few systems that are more accurate are 20× slower or more. +spaCy v3.0 introduces transformer-based pipelines that bring spaCy's accuracy +right up to **current state-of-the-art**. You can also use a CPU-optimized +pipeline, which is less accurate but much cheaper to run. -> #### About the evaluation + + +> #### Evaluation details > -> The first of the evaluations was published by **Yahoo! Labs** and **Emory -> University**, as part of a survey of current parsing technologies -> ([Choi et al., 2015](https://aclweb.org/anthology/P/P15/P15-1038.pdf)). Their -> results and subsequent discussions helped us develop a novel -> psychologically-motivated technique to improve spaCy's accuracy, which we -> published in joint work with Macquarie University -> ([Honnibal and Johnson, 2015](https://www.aclweb.org/anthology/D/D15/D15-1162.pdf)). +> - **OntoNotes 5.0:** spaCy's English models are trained on this corpus, as +> it's several times larger than other English treebanks. However, most +> systems do not report accuracies on it. +> - **Penn Treebank:** The "classic" parsing evaluation for research. However, +> it's quite far removed from actual usage: it uses sentences with +> gold-standard segmentation and tokenization, from a pretty specific type of +> text (articles from a single newspaper, 1984-1989). -import BenchmarksChoi from 'usage/\_benchmarks-choi.md' +import Benchmarks from 'usage/\_benchmarks-models.md' - + -### Algorithm comparison {#algorithm} +
-In this section, we compare spaCy's algorithms to recently published systems, -using some of the most popular benchmarks. These benchmarks are designed to help -isolate the contributions of specific algorithmic decisions, so they promote -slightly "idealized" conditions. Specifically, the text comes pre-processed with -"gold standard" token and sentence boundaries. The data sets also tend to be -fairly small, to help researchers iterate quickly. These conditions mean the -models trained on these data sets are not always useful for practical purposes. +| Dependency Parsing System | UAS | LAS | +| ------------------------------------------------------------------------------ | ---: | ---: | +| spaCy RoBERTa (2020) | 95.5 | 94.3 | +| spaCy CNN (2020) | | | +| [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 | +| [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 | -#### Parse accuracy (Penn Treebank / Wall Street Journal) {#parse-accuracy-penn} +
-This is the "classic" evaluation, so it's the number parsing researchers are -most easily able to put in context. However, it's quite far removed from actual -usage: it uses sentences with gold-standard segmentation and tokenization, from -a pretty specific type of text (articles from a single newspaper, 1984-1989). +**Dependency parsing accuracy** on the Penn Treebank. See +[NLP-progress](http://nlpprogress.com/english/dependency_parsing.html) for more +results. Project template: +[`benchmarks/parsing_penn_treebank`](%%GITHUB_PROJECTS/benchmarks/parsing_penn_treebank). -> #### Methodology -> -> [Andor et al. (2016)](http://arxiv.org/abs/1603.06042) chose slightly -> different experimental conditions from -> [Choi et al. (2015)](https://aclweb.org/anthology/P/P15/P15-1038.pdf), so the -> two accuracy tables here do not present directly comparable figures. +
-| System | Year | Type | Accuracy | -| ------------------------------------------------------------ | ---- | ------ | --------: | -| spaCy v2.0.0 | 2017 | neural | 94.48 | -| spaCy v1.1.0 | 2016 | linear | 92.80 | -| [Dozat and Manning][dozat and manning] | 2017 | neural | **95.75** | -| [Andor et al.][andor et al.] | 2016 | neural | 94.44 | -| [SyntaxNet Parsey McParseface][syntaxnet parsey mcparseface] | 2016 | neural | 94.15 | -| [Weiss et al.][weiss et al.] | 2015 | neural | 93.91 | -| [Zhang and McDonald][zhang and mcdonald] | 2014 | linear | 93.32 | -| [Martins et al.][martins et al.] | 2013 | linear | 93.10 | +
-[dozat and manning]: https://arxiv.org/pdf/1611.01734.pdf -[andor et al.]: http://arxiv.org/abs/1603.06042 -[syntaxnet parsey mcparseface]: - https://github.com/tensorflow/models/tree/master/research/syntaxnet -[weiss et al.]: - http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43800.pdf -[zhang and mcdonald]: http://research.google.com/pubs/archive/38148.pdf -[martins et al.]: http://www.cs.cmu.edu/~ark/TurboParser/ +### Speed comparison {#benchmarks-speed} -#### NER accuracy (OntoNotes 5, no pre-process) {#ner-accuracy-ontonotes5} + -This is the evaluation we use to tune spaCy's parameters to decide which -algorithms are better than the others. It's reasonably close to actual usage, -because it requires the parses to be produced from raw text, without any -pre-processing. +
-| System | Year | Type | Accuracy | -| -------------------------------------------------- | ---- | ------ | --------: | -| spaCy [`en_core_web_lg`][en_core_web_lg] v2.0.0a3  | 2017 | neural | 85.85 | -| [Strubell et al.][strubell et al.]  | 2017 | neural | **86.81** | -| [Chiu and Nichols][chiu and nichols]  | 2016 | neural | 86.19 | -| [Durrett and Klein][durrett and klein]  | 2014 | neural | 84.04 | -| [Ratinov and Roth][ratinov and roth]  | 2009 | linear | 83.45 | +| Library | Pipeline | WPS CPU words per second on CPU, higher is better | WPS GPU words per second on GPU, higher is better | +| ------- | ----------------------------------------------- | -------------------------------------------------------------: | -------------------------------------------------------------: | +| spaCy | [`en_core_web_md`](/models/en#en_core_web_md) | +| spaCy | [`en_core_web_trf`](/models/en#en_core_web_trf) | +| Stanza | `en_ewt` | | +| Flair | `pos-fast_ner-fast` | +| Flair | `pos_ner` | +| UDPipe | `english-ewt-ud-2.5` | -[en_core_web_lg]: /models/en#en_core_web_lg -[strubell et al.]: https://arxiv.org/pdf/1702.02098.pdf -[chiu and nichols]: - https://www.semanticscholar.org/paper/Named-Entity-Recognition-with-Bidirectional-LSTM-C-Chiu-Nichols/10a4db59e81d26b2e0e896d3186ef81b4458b93f -[durrett and klein]: - https://www.semanticscholar.org/paper/A-Joint-Model-for-Entity-Analysis-Coreference-Typi-Durrett-Klein/28eb033eee5f51c5e5389cbb6b777779203a6778 -[ratinov and roth]: http://www.aclweb.org/anthology/W09-1119 +
-### Model comparison {#spacy-models} +**End-to-end processing speed** on raw unannotated text. Project template: +[`benchmarks/speed`](%%GITHUB_PROJECTS/benchmarks/speed). -In this section, we provide benchmark accuracies for the pretrained model -pipelines we distribute with spaCy. Evaluations are conducted end-to-end from -raw text, with no "gold standard" pre-processing, over text from a mix of genres -where possible. +
-> #### Methodology -> -> The evaluation was conducted on raw text with no gold standard information. -> The parser, tagger and entity recognizer were trained on the -> [OntoNotes 5](https://www.gabormelli.com/RKB/OntoNotes_Corpus) corpus, the -> word vectors on [Common Crawl](http://commoncrawl.org). +
-#### English {#benchmarks-models-english} + diff --git a/website/docs/usage/index.md b/website/docs/usage/index.md index d0172104b..ccb59e937 100644 --- a/website/docs/usage/index.md +++ b/website/docs/usage/index.md @@ -8,67 +8,77 @@ menu: - ['Changelog', 'changelog'] --- -spaCy is compatible with **64-bit CPython 2.7 / 3.5+** and runs on -**Unix/Linux**, **macOS/OS X** and **Windows**. The latest spaCy releases are -available over [pip](https://pypi.python.org/pypi/spacy) and -[conda](https://anaconda.org/conda-forge/spacy). +## Quickstart {hidden="true"} > #### 📖 Looking for the old docs? > -> To help you make the transition from v1.x to v2.0, we've uploaded the old -> website to [**legacy.spacy.io**](https://legacy.spacy.io/docs). Wherever -> possible, the new docs also include notes on features that have changed in -> v2.0, and features that were introduced in the new version. - -## Quickstart {hidden="true"} +> To help you make the transition from v2.x to v3.0, we've uploaded the old +> website to [**v2.spacy.io**](https://v2.spacy.io/docs). To see what's changed +> and how to migrate, see the [v3.0 guide](/usage/v3). import QuickstartInstall from 'widgets/quickstart-install.js' - + ## Installation instructions {#installation} +spaCy is compatible with **64-bit CPython 3.6+** and runs on **Unix/Linux**, +**macOS/OS X** and **Windows**. The latest spaCy releases are available over +[pip](https://pypi.python.org/pypi/spacy) and +[conda](https://anaconda.org/conda-forge/spacy). + ### pip {#pip} -Using pip, spaCy releases are available as source packages and binary wheels (as -of v2.0.13). +Using pip, spaCy releases are available as source packages and binary wheels. +Before you install spaCy and its dependencies, make sure that your `pip`, +`setuptools` and `wheel` are up to date. -```bash -$ pip install -U spacy -``` - -> #### Download models +> #### Download pipelines > -> After installation you need to download a language model. For more info and -> available models, see the [docs on models](/models). +> After installation you typically want to download a trained pipeline. For more +> info and available packages, see the [models directory](/models). > -> ```bash +> ```cli > $ python -m spacy download en_core_web_sm > > >>> import spacy > >>> nlp = spacy.load("en_core_web_sm") > ``` - - -To install additional data tables for lemmatization in **spaCy v2.2+** you can -run `pip install spacy[lookups]` or install -[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) -separately. The lookups package is needed to create blank models with -lemmatization data, and to lemmatize in languages that don't yet come with -pretrained models and aren't powered by third-party libraries. - - +```bash +$ pip install -U pip setuptools wheel +$ pip install -U %%SPACY_PKG_NAME%%SPACY_PKG_FLAGS +``` When using pip it is generally recommended to install packages in a virtual environment to avoid modifying system state: ```bash -python -m venv .env -source .env/bin/activate -pip install spacy +$ python -m venv .env +$ source .env/bin/activate +$ pip install -U pip setuptools wheel +$ pip install -U %%SPACY_PKG_NAME%%SPACY_PKG_FLAGS ``` +spaCy also lets you install extra dependencies by specifying the following +keywords in brackets, e.g. `spacy[ja]` or `spacy[lookups,transformers]` (with +multiple comma-separated extras). See the `[options.extras_require]` section in +spaCy's [`setup.cfg`](%%GITHUB_SPACY/setup.cfg) for details on what's included. + +> #### Example +> +> ```bash +> $ pip install %%SPACY_PKG_NAME[lookups,transformers]%%SPACY_PKG_FLAGS +> ``` + +| Name | Description | +| ---------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `lookups` | Install [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) for data tables for lemmatization and lexeme normalization. The data is serialized with trained pipelines, so you only need this package if you want to train your own models. | +| `transformers` | Install [`spacy-transformers`](https://github.com/explosion/spacy-transformers). The package will be installed automatically when you install a transformer-based pipeline. | +| `ray` | Install [`spacy-ray`](https://github.com/explosion/spacy-ray) to add CLI commands for [parallel training](/usage/training#parallel-training). | +| `cuda`, ... | Install spaCy with GPU support provided by [CuPy](https://cupy.chainer.org) for your given CUDA version. See the GPU [installation instructions](#gpu) for details and options. | +| `ja`, `ko`, `th`, `zh` | Install additional dependencies required for tokenization for the [languages](/usage/models#languages). | + ### conda {#conda} Thanks to our great community, we've been able to re-add conda support. You can @@ -79,47 +89,44 @@ $ conda install -c conda-forge spacy ``` For the feedstock including the build recipe and configuration, check out -[this repository](https://github.com/conda-forge/spacy-feedstock). Improvements -and pull requests to the recipe and setup are always appreciated. +[this repository](https://github.com/conda-forge/spacy-feedstock). Note that we +currently don't publish any [pre-releases](#changelog-pre) on conda. ### Upgrading spaCy {#upgrading} -> #### Upgrading from v1 to v2 +> #### Upgrading from v2 to v3 > > Although we've tried to keep breaking changes to a minimum, upgrading from -> spaCy v1.x to v2.x may still require some changes to your code base. For -> details see the sections on [backwards incompatibilities](/usage/v2#incompat) -> and [migrating](/usage/v2#migrating). Also remember to download the new -> models, and retrain your own models. +> spaCy v2.x to v3.x may still require some changes to your code base. For +> details see the sections on [backwards incompatibilities](/usage/v3#incompat) +> and [migrating](/usage/v3#migrating). Also remember to download the new +> trained pipelines, and retrain your own pipelines. When updating to a newer version of spaCy, it's generally recommended to start with a clean virtual environment. If you're upgrading to a new major version, -make sure you have the latest **compatible models** installed, and that there -are no old shortcut links or incompatible model packages left over in your -environment, as this can often lead to unexpected results and errors. If you've -trained your own models, keep in mind that your train and runtime inputs must -match. This means you'll have to **retrain your models** with the new version. +make sure you have the latest **compatible trained pipelines** installed, and +that there are no old and incompatible packages left over in your environment, +as this can often lead to unexpected results and errors. If you've trained your +own models, keep in mind that your train and runtime inputs must match. This +means you'll have to **retrain your pipelines** with the new version. -As of v2.0, spaCy also provides a [`validate`](/api/cli#validate) command, which -lets you verify that all installed models are compatible with your spaCy -version. If incompatible models are found, tips and installation instructions -are printed. The command is also useful to detect out-of-sync model links -resulting from links created in different virtual environments. It's recommended -to run the command with `python -m` to make sure you're executing the correct -version of spaCy. +spaCy also provides a [`validate`](/api/cli#validate) command, which lets you +verify that all installed pipeline packages are compatible with your spaCy +version. If incompatible packages are found, tips and installation instructions +are printed. It's recommended to run the command with `python -m` to make sure +you're executing the correct version of spaCy. -```bash -pip install -U spacy -python -m spacy validate +```cli +$ pip install -U %%SPACY_PKG_NAME%%SPACY_PKG_FLAGS +$ python -m spacy validate ``` ### Run spaCy with GPU {#gpu new="2.0.14"} As of v2.0, spaCy comes with neural network models that are implemented in our -machine learning library, [Thinc](https://github.com/explosion/thinc). For GPU -support, we've been grateful to use the work of Chainer's -[CuPy](https://cupy.chainer.org) module, which provides a numpy-compatible -interface for GPU arrays. +machine learning library, [Thinc](https://thinc.ai). For GPU support, we've been +grateful to use the work of Chainer's [CuPy](https://cupy.chainer.org) module, +which provides a numpy-compatible interface for GPU arrays. spaCy can be installed on GPU by specifying `spacy[cuda]`, `spacy[cuda90]`, `spacy[cuda91]`, `spacy[cuda92]`, `spacy[cuda100]`, `spacy[cuda101]` or @@ -128,14 +135,14 @@ specifier allows cupy to be installed via wheel, saving some compilation time. The specifiers should install [`cupy`](https://cupy.chainer.org). ```bash -$ pip install -U spacy[cuda92] +$ pip install -U %%SPACY_PKG_NAME[cuda92]%%SPACY_PKG_FLAGS ``` Once you have a GPU-enabled installation, the best way to activate it is to call [`spacy.prefer_gpu`](/api/top-level#spacy.prefer_gpu) or [`spacy.require_gpu()`](/api/top-level#spacy.require_gpu) somewhere in your -script before any models have been loaded. `require_gpu` will raise an error if -no GPU is available. +script before any pipelines have been loaded. `require_gpu` will raise an error +if no GPU is available. ```python import spacy @@ -158,80 +165,93 @@ system. See notes on [Ubuntu](#source-ubuntu), [macOS / OS X](#source-osx) and [Windows](#source-windows) for details. ```bash -python -m pip install -U pip # update pip -git clone https://github.com/explosion/spaCy # clone spaCy -cd spaCy # navigate into directory +$ python -m pip install -U pip # update pip +$ git clone https://github.com/explosion/spaCy # clone spaCy +$ cd spaCy # navigate into dir -python -m venv .env # create environment in .env -source .env/bin/activate # activate virtual environment -\export PYTHONPATH=`pwd` # set Python path to spaCy directory -pip install -r requirements.txt # install all requirements -python setup.py build_ext --inplace # compile spaCy +$ python -m venv .env # create environment in .env +$ source .env/bin/activate # activate virtual env +$ export PYTHONPATH=`pwd` # set Python path to spaCy dir +$ pip install -r requirements.txt # install all requirements +$ python setup.py build_ext --inplace # compile spaCy ``` Compared to regular install via pip, the -[`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt) -additionally installs developer dependencies such as Cython. See the the -[quickstart widget](#quickstart) to get the right commands for your platform and -Python version. +[`requirements.txt`](%%GITHUB_SPACY/requirements.txt) additionally installs +developer dependencies such as Cython. See the [quickstart widget](#quickstart) +to get the right commands for your platform and Python version. -#### Ubuntu {#source-ubuntu} + -Install system-level dependencies via `apt-get`: +- **Ubuntu:** Install system-level dependencies via `apt-get`: + `sudo apt-get install build-essential python-dev git` +- **macOS / OS X:** Install a recent version of + [XCode](https://developer.apple.com/xcode/), including the so-called "Command + Line Tools". macOS and OS X ship with Python and Git preinstalled. +- **Windows:** Install a version of the + [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) + or + [Visual Studio Express](https://www.visualstudio.com/vs/visual-studio-express/) + that matches the version that was used to compile your Python interpreter. + +### Building an executable {#executable} + +The spaCy repository includes a [`Makefile`](%%GITHUB_SPACY/Makefile) that +builds an executable zip file using [`pex`](https://github.com/pantsbuild/pex) +(**P**ython **Ex**ecutable). The executable includes spaCy and all its package +dependencies and only requires the system Python at runtime. Building an +executable `.pex` file is often the most convenient way to deploy spaCy, as it +lets you separate the build from the deployment process. + +> #### Usage +> +> To use a `.pex` file, just replace `python` with the path to the file when you +> execute your code or CLI commands. This is equivalent to running Python in a +> virtual environment with spaCy installed. +> +> ```bash +> $ ./spacy.pex my_script.py +> $ ./spacy.pex -m spacy info +> ``` ```bash -$ sudo apt-get install build-essential python-dev git +$ git clone https://github.com/explosion/spaCy +$ cd spaCy +$ make ``` -#### macOS / OS X {#source-osx} +You can configure the build process with the following environment variables: -Install a recent version of [XCode](https://developer.apple.com/xcode/), -including the so-called "Command Line Tools". macOS and OS X ship with Python -and git preinstalled. - -#### Windows {#source-windows} - -Install a version of the -[Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) -or -[Visual Studio Express](https://www.visualstudio.com/vs/visual-studio-express/) -that matches the version that was used to compile your Python interpreter. For -official distributions these are: - -| Distribution | Version | -| ------------ | ------------------ | -| Python 2.7 | Visual Studio 2008 | -| Python 3.4 | Visual Studio 2010 | -| Python 3.5+ | Visual Studio 2015 | +| Variable | Description | +| -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `SPACY_EXTRAS` | Additional Python packages to install alongside spaCy with optional version specifications. Should be a string that can be passed to `pip install`. See [`Makefile`](%%GITHUB_SPACY/Makefile) for defaults. | +| `PYVER` | The Python version to build against. This version needs to be available on your build and runtime machines. Defaults to `3.6`. | +| `WHEELHOUSE` | Directory to store the wheel files during compilation. Defaults to `./wheelhouse`. | ### Run tests {#run-tests} -spaCy comes with an -[extensive test suite](https://github.com/explosion/spaCy/tree/master/spacy/tests). -In order to run the tests, you'll usually want to clone the -[repository](https://github.com/explosion/spaCy/tree/master/) and -[build spaCy from source](#source). This will also install the required +spaCy comes with an [extensive test suite](%%GITHUB_SPACY/spacy/tests). In order +to run the tests, you'll usually want to clone the [repository](%%GITHUB_SPACY) +and [build spaCy from source](#source). This will also install the required development dependencies and test utilities defined in the `requirements.txt`. Alternatively, you can find out where spaCy is installed and run `pytest` on that directory. Don't forget to also install the test utilities via spaCy's -[`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt): +[`requirements.txt`](%%GITHUB_SPACY/requirements.txt): ```bash -python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))" -pip install -r path/to/requirements.txt -python -m pytest [spacy directory] +$ python -c "import os; import spacy; print(os.path.dirname(spacy.__file__))" +$ pip install -r path/to/requirements.txt +$ python -m pytest [spacy directory] ``` Calling `pytest` on the spaCy directory will run only the basic tests. The flag `--slow` is optional and enables additional tests that take longer. ```bash -# make sure you are using recent pytest version -python -m pip install -U pytest - -python -m pytest [spacy directory] # basic tests -python -m pytest [spacy directory] --slow # basic and slow tests +$ python -m pip install -U pytest # update pytest +$ python -m pytest [spacy directory] # basic tests +$ python -m pytest [spacy directory] --slow # basic and slow tests ``` ## Troubleshooting guide {#troubleshooting} @@ -249,46 +269,28 @@ installing, loading and using spaCy, as well as their solutions. ``` -No compatible model found for [lang] (spaCy vX.X.X). +No compatible package found for [lang] (spaCy vX.X.X). ``` -This usually means that the model you're trying to download does not exist, or -isn't available for your version of spaCy. Check the +This usually means that the trained pipeline you're trying to download does not +exist, or isn't available for your version of spaCy. Check the [compatibility table](https://github.com/explosion/spacy-models/tree/master/compatibility.json) -to see which models are available for your spaCy version. If you're using an old -version, consider upgrading to the latest release. Note that while spaCy +to see which packages are available for your spaCy version. If you're using an +old version, consider upgrading to the latest release. Note that while spaCy supports tokenization for [a variety of languages](/usage/models#languages), not -all of them come with statistical models. To only use the tokenizer, import the +all of them come with trained pipelines. To only use the tokenizer, import the language's `Language` class instead, for example `from spacy.lang.fr import French`. - - -``` -OSError: symbolic link privilege not held -``` - -To create [shortcut links](/usage/models#usage) that let you load models by -name, spaCy creates a symbolic link in the `spacy/data` directory. This means -your user needs permission to do this. The above error mostly occurs when doing -a system-wide installation, which will create the symlinks in a system -directory. Run the `download` or `link` command as administrator (on Windows, -you can either right-click on your terminal or shell and select "Run as -Administrator"), set the `--user` flag when installing a model or use a virtual -environment to install spaCy in a user directory, instead of doing a system-wide -installation. - - - ``` no such option: --no-cache-dir ``` -The `download` command uses pip to install the models and sets the +The `download` command uses pip to install the pipeline packages and sets the `--no-cache-dir` flag to prevent it from requiring too much memory. [This setting](https://pip.pypa.io/en/stable/reference/pip_install/#caching) requires pip v6.0 or newer. Run `pip install -U pip` to upgrade to the latest @@ -310,7 +312,7 @@ only 65535 in a narrow unicode build. You can check this by running the following command: ```bash -python -c "import sys; print(sys.maxunicode)" +$ python -c "import sys; print(sys.maxunicode)" ``` If you're running a narrow unicode build, reinstall Python and use a wide @@ -332,8 +334,8 @@ run `source ~/.bash_profile` or `source ~/.zshrc`. Make sure to add **both lines** for `LC_ALL` and `LANG`. ```bash -\export LC_ALL=en_US.UTF-8 -\export LANG=en_US.UTF-8 +$ export LC_ALL=en_US.UTF-8 +$ export LANG=en_US.UTF-8 ``` @@ -352,21 +354,19 @@ also run `which python` to find out where your Python executable is located.
- + ``` ImportError: No module named 'en_core_web_sm' ``` -As of spaCy v1.7, all models can be installed as Python packages. This means -that they'll become importable modules of your application. When creating -[shortcut links](/usage/models#usage), spaCy will also try to import the model -to load its meta data. If this fails, it's usually a sign that the package is -not installed in the current environment. Run `pip list` or `pip freeze` to -check which model packages you have installed, and install the -[correct models](/models) if necessary. If you're importing a model manually at -the top of a file, make sure to use the name of the package, not the shortcut -link you've created. +As of spaCy v1.7, all trained pipelines can be installed as Python packages. +This means that they'll become importable modules of your application. If this +fails, it's usually a sign that the package is not installed in the current +environment. Run `pip list` or `pip freeze` to check which pipeline packages you +have installed, and install the [correct package](/models) if necessary. If +you're importing a package manually at the top of a file, make sure to use the +full name of the package. @@ -380,7 +380,7 @@ This error may occur when running the `spacy` command from the command line. spaCy does not currently add an entry to your `PATH` environment variable, as this can lead to unexpected results, especially when using a virtual environment. Instead, spaCy adds an auto-alias that maps `spacy` to -`python -m spacy]`. If this is not working as expected, run the command with +`python -m spacy`. If this is not working as expected, run the command with `python -m`, yourself – for example `python -m spacy download en_core_web_sm`. For more info on this, see the [`download`](/api/cli#download) command. @@ -399,24 +399,7 @@ from is called `spacy`. So, when using spaCy, never call anything else `spacy`. - - -```python -doc = nlp("They are") -print(doc[0].lemma_) -# -PRON- -``` - -This is in fact expected behavior and not a bug. Unlike verbs and common nouns, -there's no clear base form of a personal pronoun. Should the lemma of "me" be -"I", or should we normalize person as well, giving "it" — or maybe "he"? spaCy's -solution is to introduce a novel symbol, `-PRON-`, which is used as the lemma -for all personal pronouns. For more info on this, see the -[lemmatization specs](/api/annotation#lemmatization). - - - - + If your training data only contained new entities and you didn't mix in any examples the model previously recognized, it can cause the model to "forget" @@ -444,8 +427,8 @@ disk has some binary files that should not go through this conversion. When they do, you get the error above. You can fix it by either changing your [`core.autocrlf`](https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration) setting to `"false"`, or by committing a -[`.gitattributes`](https://git-scm.com/docs/gitattributes) file] to your -repository to tell git on which files or folders it shouldn't do LF-to-CRLF +[`.gitattributes`](https://git-scm.com/docs/gitattributes) file to your +repository to tell Git on which files or folders it shouldn't do LF-to-CRLF conversion, with an entry like `path/to/spacy/model/** -text`. After you've done either of these, clone your repository again. diff --git a/website/docs/usage/layers-architectures.md b/website/docs/usage/layers-architectures.md new file mode 100644 index 000000000..d7b2593e7 --- /dev/null +++ b/website/docs/usage/layers-architectures.md @@ -0,0 +1,866 @@ +--- +title: Layers and Model Architectures +teaser: Power spaCy components with custom neural networks +menu: + - ['Type Signatures', 'type-sigs'] + - ['Swapping Architectures', 'swap-architectures'] + - ['PyTorch & TensorFlow', 'frameworks'] + - ['Custom Thinc Models', 'thinc'] + - ['Trainable Components', 'components'] +next: /usage/projects +--- + +> #### Example +> +> ```python +> from thinc.api import Model, chain +> +> @spacy.registry.architectures.register("model.v1") +> def build_model(width: int, classes: int) -> Model: +> tok2vec = build_tok2vec(width) +> output_layer = build_output_layer(width, classes) +> model = chain(tok2vec, output_layer) +> return model +> ``` + +A **model architecture** is a function that wires up a +[Thinc `Model`](https://thinc.ai/docs/api-model) instance. It describes the +neural network that is run internally as part of a component in a spaCy +pipeline. To define the actual architecture, you can implement your logic in +Thinc directly, or you can use Thinc as a thin wrapper around frameworks such as +PyTorch, TensorFlow and MXNet. Each `Model` can also be used as a sublayer of a +larger network, allowing you to freely combine implementations from different +frameworks into a single model. + +spaCy's built-in components require a `Model` instance to be passed to them via +the config system. To change the model architecture of an existing component, +you just need to [**update the config**](#swap-architectures) so that it refers +to a different registered function. Once the component has been created from +this config, you won't be able to change it anymore. The architecture is like a +recipe for the network, and you can't change the recipe once the dish has +already been prepared. You have to make a new one. + +```ini +### config.cfg (excerpt) +[components.tagger] +factory = "tagger" + +[components.tagger.model] +@architectures = "model.v1" +width = 512 +classes = 16 +``` + +## Type signatures {#type-sigs} + +> #### Example +> +> ```python +> from typing import List +> from thinc.api import Model, chain +> from thinc.types import Floats2d +> def chain_model( +> tok2vec: Model[List[Doc], List[Floats2d]], +> layer1: Model[List[Floats2d], Floats2d], +> layer2: Model[Floats2d, Floats2d] +> ) -> Model[List[Doc], Floats2d]: +> model = chain(tok2vec, layer1, layer2) +> return model +> ``` + +The Thinc `Model` class is a **generic type** that can specify its input and +output types. Python uses a square-bracket notation for this, so the type +~~Model[List, Dict]~~ says that each batch of inputs to the model will be a +list, and the outputs will be a dictionary. You can be even more specific and +write for instance~~Model[List[Doc], Dict[str, float]]~~ to specify that the +model expects a list of [`Doc`](/api/doc) objects as input, and returns a +dictionary mapping of strings to floats. Some of the most common types you'll +see are: ​ + +| Type | Description | +| ------------------ | ---------------------------------------------------------------------------------------------------- | +| ~~List[Doc]~~ | A batch of [`Doc`](/api/doc) objects. Most components expect their models to take this as input. | +| ~~Floats2d~~ | A two-dimensional `numpy` or `cupy` array of floats. Usually 32-bit. | +| ~~Ints2d~~ | A two-dimensional `numpy` or `cupy` array of integers. Common dtypes include uint64, int32 and int8. | +| ~~List[Floats2d]~~ | A list of two-dimensional arrays, generally with one array per `Doc` and one row per token. | +| ~~Ragged~~ | A container to handle variable-length sequence data in an unpadded contiguous array. | +| ~~Padded~~ | A container to handle variable-length sequence data in a padded contiguous array. | + +See the [Thinc type reference](https://thinc.ai/docs/api-types) for details. The +model type signatures help you figure out which model architectures and +components can **fit together**. For instance, the +[`TextCategorizer`](/api/textcategorizer) class expects a model typed +~~Model[List[Doc], Floats2d]~~, because the model will predict one row of +category probabilities per [`Doc`](/api/doc). In contrast, the +[`Tagger`](/api/tagger) class expects a model typed ~~Model[List[Doc], +List[Floats2d]]~~, because it needs to predict one row of probabilities per +token. + +There's no guarantee that two models with the same type signature can be used +interchangeably. There are many other ways they could be incompatible. However, +if the types don't match, they almost surely _won't_ be compatible. This little +bit of validation goes a long way, especially if you +[configure your editor](https://thinc.ai/docs/usage-type-checking) or other +tools to highlight these errors early. The config file is also validated at the +beginning of training, to verify that all the types match correctly. + + + +If you're using a modern editor like Visual Studio Code, you can +[set up `mypy`](https://thinc.ai/docs/usage-type-checking#install) with the +custom Thinc plugin and get live feedback about mismatched types as you write +code. + +[![](../images/thinc_mypy.jpg)](https://thinc.ai/docs/usage-type-checking#linting) + + + +## Swapping model architectures {#swap-architectures} + +If no model is specified for the [`TextCategorizer`](/api/textcategorizer), the +[TextCatEnsemble](/api/architectures#TextCatEnsemble) architecture is used by +default. This architecture combines a simple bag-of-words model with a neural +network, usually resulting in the most accurate results, but at the cost of +speed. The config file for this model would look something like this: + +```ini +### config.cfg (excerpt) +[components.textcat] +factory = "textcat" +labels = [] + +[components.textcat.model] +@architectures = "spacy.TextCatEnsemble.v1" +exclusive_classes = false +pretrained_vectors = null +width = 64 +conv_depth = 2 +embed_size = 2000 +window_size = 1 +ngram_size = 1 +dropout = 0 +nO = null +``` + +spaCy has two additional built-in `textcat` architectures, and you can easily +use those by swapping out the definition of the textcat's model. For instance, +to use the simple and fast bag-of-words model +[TextCatBOW](/api/architectures#TextCatBOW), you can change the config to: + +```ini +### config.cfg (excerpt) {highlight="6-10"} +[components.textcat] +factory = "textcat" +labels = [] + +[components.textcat.model] +@architectures = "spacy.TextCatBOW.v1" +exclusive_classes = false +ngram_size = 1 +no_output_layer = false +nO = null +``` + +For details on all pre-defined architectures shipped with spaCy and how to +configure them, check out the [model architectures](/api/architectures) +documentation. + +### Defining sublayers {#sublayers} + +Model architecture functions often accept **sublayers as arguments**, so that +you can try **substituting a different layer** into the network. Depending on +how the architecture function is structured, you might be able to define your +network structure entirely through the [config system](/usage/training#config), +using layers that have already been defined. ​ + +In most neural network models for NLP, the most important parts of the network +are what we refer to as the +[embed and encode](https://explosion.ai/blog/deep-learning-formula-nlp) steps. +These steps together compute dense, context-sensitive representations of the +tokens, and their combination forms a typical +[`Tok2Vec`](/api/architectures#Tok2Vec) layer: + +```ini +### config.cfg (excerpt) +[components.tok2vec] +factory = "tok2vec" + +[components.tok2vec.model] +@architectures = "spacy.Tok2Vec.v1" + +[components.tok2vec.model.embed] +@architectures = "spacy.MultiHashEmbed.v1" +# ... + +[components.tok2vec.model.encode] +@architectures = "spacy.MaxoutWindowEncoder.v1" +# ... +``` + +By defining these sublayers specifically, it becomes straightforward to swap out +a sublayer for another one, for instance changing the first sublayer to a +character embedding with the [CharacterEmbed](/api/architectures#CharacterEmbed) +architecture: + +```ini +### config.cfg (excerpt) +[components.tok2vec.model.embed] +@architectures = "spacy.CharacterEmbed.v1" +# ... + +[components.tok2vec.model.encode] +@architectures = "spacy.MaxoutWindowEncoder.v1" +# ... +``` + +Most of spaCy's default architectures accept a `tok2vec` layer as a sublayer +within the larger task-specific neural network. This makes it easy to **switch +between** transformer, CNN, BiLSTM or other feature extraction approaches. The +[transformers documentation](/usage/embeddings-transformers#training-custom-model) +section shows an example of swapping out a model's standard `tok2vec` layer with +a transformer. And if you want to define your own solution, all you need to do +is register a ~~Model[List[Doc], List[Floats2d]]~~ architecture function, and +you'll be able to try it out in any of the spaCy components. ​ + +## Wrapping PyTorch, TensorFlow and other frameworks {#frameworks} + +Thinc allows you to [wrap models](https://thinc.ai/docs/usage-frameworks) +written in other machine learning frameworks like PyTorch, TensorFlow and MXNet +using a unified [`Model`](https://thinc.ai/docs/api-model) API. This makes it +easy to use a model implemented in a different framework to power a component in +your spaCy pipeline. For example, to wrap a PyTorch model as a Thinc `Model`, +you can use Thinc's +[`PyTorchWrapper`](https://thinc.ai/docs/api-layers#pytorchwrapper): + +```python +from thinc.api import PyTorchWrapper + +wrapped_pt_model = PyTorchWrapper(torch_model) +``` + +Let's use PyTorch to define a very simple neural network consisting of two +hidden `Linear` layers with `ReLU` activation and dropout, and a +softmax-activated output layer: + +```python +### PyTorch model +from torch import nn + +torch_model = nn.Sequential( + nn.Linear(width, hidden_width), + nn.ReLU(), + nn.Dropout2d(dropout), + nn.Linear(hidden_width, nO), + nn.ReLU(), + nn.Dropout2d(dropout), + nn.Softmax(dim=1) +) +``` + +The resulting wrapped `Model` can be used as a **custom architecture** as such, +or can be a **subcomponent of a larger model**. For instance, we can use Thinc's +[`chain`](https://thinc.ai/docs/api-layers#chain) combinator, which works like +`Sequential` in PyTorch, to combine the wrapped model with other components in a +larger network. This effectively means that you can easily wrap different +components from different frameworks, and "glue" them together with Thinc: + +```python +from thinc.api import chain, with_array, PyTorchWrapper +from spacy.ml import CharacterEmbed + +wrapped_pt_model = PyTorchWrapper(torch_model) +char_embed = CharacterEmbed(width, embed_size, nM, nC) +model = chain(char_embed, with_array(wrapped_pt_model)) +``` + +In the above example, we have combined our custom PyTorch model with a character +embedding layer defined by spaCy. +[CharacterEmbed](/api/architectures#CharacterEmbed) returns a `Model` that takes +a ~~List[Doc]~~ as input, and outputs a ~~List[Floats2d]~~. To make sure that +the wrapped PyTorch model receives valid inputs, we use Thinc's +[`with_array`](https://thinc.ai/docs/api-layers#with_array) helper. + +You could also implement a model that only uses PyTorch for the transformer +layers, and "native" Thinc layers to do fiddly input and output transformations +and add on task-specific "heads", as efficiency is less of a consideration for +those parts of the network. + +### Using wrapped models {#frameworks-usage} + +To use our custom model including the PyTorch subnetwork, all we need to do is +register the architecture using the +[`architectures` registry](/api/top-level#registry). This assigns the +architecture a name so spaCy knows how to find it, and allows passing in +arguments like hyperparameters via the [config](/usage/training#config). The +full example then becomes: + +```python +### Registering the architecture {highlight="9"} +from typing import List +from thinc.types import Floats2d +from thinc.api import Model, PyTorchWrapper, chain, with_array +import spacy +from spacy.tokens.doc import Doc +from spacy.ml import CharacterEmbed +from torch import nn + +@spacy.registry.architectures("CustomTorchModel.v1") +def create_torch_model( + nO: int, + width: int, + hidden_width: int, + embed_size: int, + nM: int, + nC: int, + dropout: float, +) -> Model[List[Doc], List[Floats2d]]: + char_embed = CharacterEmbed(width, embed_size, nM, nC) + torch_model = nn.Sequential( + nn.Linear(width, hidden_width), + nn.ReLU(), + nn.Dropout2d(dropout), + nn.Linear(hidden_width, nO), + nn.ReLU(), + nn.Dropout2d(dropout), + nn.Softmax(dim=1) + ) + wrapped_pt_model = PyTorchWrapper(torch_model) + model = chain(char_embed, with_array(wrapped_pt_model)) + return model +``` + +The model definition can now be used in any existing trainable spaCy component, +by specifying it in the config file. In this configuration, all required +parameters for the various subcomponents of the custom architecture are passed +in as settings via the config. + +```ini +### config.cfg (excerpt) {highlight="5-5"} +[components.tagger] +factory = "tagger" + +[components.tagger.model] +@architectures = "CustomTorchModel.v1" +nO = 50 +width = 96 +hidden_width = 48 +embed_size = 2000 +nM = 64 +nC = 8 +dropout = 0.2 +``` + + + +Remember that it is best not to rely on any (hidden) default values to ensure +that training configs are complete and experiments fully reproducible. + + + +Note that when using a PyTorch or Tensorflow model, it is recommended to set the +GPU memory allocator accordingly. When `gpu_allocator` is set to "pytorch" or +"tensorflow" in the training config, cupy will allocate memory via those +respective libraries, preventing OOM errors when there's available memory +sitting in the other library's pool. + +```ini +### config.cfg (excerpt) +[training] +gpu_allocator = "pytorch" +``` + +## Custom models with Thinc {#thinc} + +Of course it's also possible to define the `Model` from the previous section +entirely in Thinc. The Thinc documentation provides details on the +[various layers](https://thinc.ai/docs/api-layers) and helper functions +available. Combinators can be used to +[overload operators](https://thinc.ai/docs/usage-models#operators) and a common +usage pattern is to bind `chain` to `>>`. The "native" Thinc version of our +simple neural network would then become: + +```python +from thinc.api import chain, with_array, Model, Relu, Dropout, Softmax +from spacy.ml import CharacterEmbed + +char_embed = CharacterEmbed(width, embed_size, nM, nC) +with Model.define_operators({">>": chain}): + layers = ( + Relu(hidden_width, width) + >> Dropout(dropout) + >> Relu(hidden_width, hidden_width) + >> Dropout(dropout) + >> Softmax(nO, hidden_width) + ) + model = char_embed >> with_array(layers) +``` + + + +Note that Thinc layers define the output dimension (`nO`) as the first argument, +followed (optionally) by the input dimension (`nI`). This is in contrast to how +the PyTorch layers are defined, where `in_features` precedes `out_features`. + + + +### Shape inference in Thinc {#thinc-shape-inference} + +It is **not** strictly necessary to define all the input and output dimensions +for each layer, as Thinc can perform +[shape inference](https://thinc.ai/docs/usage-models#validation) between +sequential layers by matching up the output dimensionality of one layer to the +input dimensionality of the next. This means that we can simplify the `layers` +definition: + +> #### Diff +> +> ```diff +> layers = ( +> Relu(hidden_width, width) +> >> Dropout(dropout) +> - >> Relu(hidden_width, hidden_width) +> + >> Relu(hidden_width) +> >> Dropout(dropout) +> - >> Softmax(nO, hidden_width) +> + >> Softmax(nO) +> ) +> ``` + +```python +with Model.define_operators({">>": chain}): + layers = ( + Relu(hidden_width, width) + >> Dropout(dropout) + >> Relu(hidden_width) + >> Dropout(dropout) + >> Softmax(nO) + ) +``` + +Thinc can even go one step further and **deduce the correct input dimension** of +the first layer, and output dimension of the last. To enable this functionality, +you have to call +[`Model.initialize`](https://thinc.ai/docs/api-model#initialize) with an **input +sample** `X` and an **output sample** `Y` with the correct dimensions: + +```python +### Shape inference with initialization {highlight="3,7,10"} +with Model.define_operators({">>": chain}): + layers = ( + Relu(hidden_width) + >> Dropout(dropout) + >> Relu(hidden_width) + >> Dropout(dropout) + >> Softmax() + ) + model = char_embed >> with_array(layers) + model.initialize(X=input_sample, Y=output_sample) +``` + +The built-in [pipeline components](/usage/processing-pipelines) in spaCy ensure +that their internal models are **always initialized** with appropriate sample +data. In this case, `X` is typically a ~~List[Doc]~~, while `Y` is typically a +~~List[Array1d]~~ or ~~List[Array2d]~~, depending on the specific task. This +functionality is triggered when [`nlp.initialize`](/api/language#initialize) is +called. + +### Dropout and normalization in Thinc {#thinc-dropout-norm} + +Many of the available Thinc [layers](https://thinc.ai/docs/api-layers) allow you +to define a `dropout` argument that will result in "chaining" an additional +[`Dropout`](https://thinc.ai/docs/api-layers#dropout) layer. Optionally, you can +often specify whether or not you want to add layer normalization, which would +result in an additional +[`LayerNorm`](https://thinc.ai/docs/api-layers#layernorm) layer. That means that +the following `layers` definition is equivalent to the previous: + +```python +with Model.define_operators({">>": chain}): + layers = ( + Relu(hidden_width, dropout=dropout, normalize=False) + >> Relu(hidden_width, dropout=dropout, normalize=False) + >> Softmax() + ) + model = char_embed >> with_array(layers) + model.initialize(X=input_sample, Y=output_sample) +``` + +## Create new trainable components {#components} + +In addition to [swapping out](#swap-architectures) default models in built-in +components, you can also implement an entirely new, +[trainable](/usage/processing-pipelines#trainable-components) pipeline component +from scratch. This can be done by creating a new class inheriting from +[`TrainablePipe`](/api/pipe), and linking it up to your custom model +implementation. + + + +For details on how to implement pipeline components, check out the usage guide +on [custom components](/usage/processing-pipelines#custom-component) and the +overview of the `TrainablePipe` methods used by +[trainable components](/usage/processing-pipelines#trainable-components). + + + +### Example: Entity relation extraction component {#component-rel} + +This section outlines an example use-case of implementing a **novel relation +extraction component** from scratch. We'll implement a binary relation +extraction method that determines whether or not **two entities** in a document +are related, and if so, what type of relation. We'll allow multiple types of +relations between two such entities (multi-label setting). There are two major +steps required: + +1. Implement a [machine learning model](#component-rel-model) specific to this + task. It will have to extract candidates from a [`Doc`](/api/doc) and predict + a relation for the available candidate pairs. +2. Implement a custom [pipeline component](#component-rel-pipe) powered by the + machine learning model that sets annotations on the [`Doc`](/api/doc) passing + through the pipeline. + + + +#### Step 1: Implementing the Model {#component-rel-model} + +We need to implement a [`Model`](https://thinc.ai/docs/api-model) that takes a +**list of documents** (~~List[Doc]~~) as input, and outputs a **two-dimensional +matrix** (~~Floats2d~~) of predictions: + +> #### Model type annotations +> +> The `Model` class is a generic type that can specify its input and output +> types, e.g. ~~Model[List[Doc], Floats2d]~~. Type hints are used for static +> type checks and validation. See the section on [type signatures](#type-sigs) +> for details. + +```python +### Register the model architecture +@registry.architectures.register("rel_model.v1") +def create_relation_model(...) -> Model[List[Doc], Floats2d]: + model = ... # 👈 model will go here + return model +``` + +The first layer in this model will typically be an +[embedding layer](/usage/embeddings-transformers) such as a +[`Tok2Vec`](/api/tok2vec) component or a [`Transformer`](/api/transformer). This +layer is assumed to be of type ~~Model[List[Doc], List[Floats2d]]~~ as it +transforms each **document into a list of tokens**, with each token being +represented by its embedding in the vector space. + +Next, we need a method that **generates pairs of entities** that we want to +classify as being related or not. As these candidate pairs are typically formed +within one document, this function takes a [`Doc`](/api/doc) as input and +outputs a `List` of `Span` tuples. For instance, a very straightforward +implementation would be to just take any two entities from the same document: + +```python +### Simple candiate generation +def get_candidates(doc: Doc) -> List[Tuple[Span, Span]]: + candidates = [] + for ent1 in doc.ents: + for ent2 in doc.ents: + candidates.append((ent1, ent2)) + return candidates +``` + +But we could also refine this further by **excluding relations** of an entity +with itself, and posing a **maximum distance** (in number of tokens) between two +entities. We register this function in the +[`@misc` registry](/api/top-level#registry) so we can refer to it from the +config, and easily swap it out for any other candidate generation function. + +> #### config.cfg (excerpt) +> +> ```ini +> [model] +> @architectures = "rel_model.v1" +> +> [model.tok2vec] +> # ... +> +> [model.get_candidates] +> @misc = "rel_cand_generator.v1" +> max_length = 20 +> ``` + +```python +### Extended candidate generation {highlight="1,2,7,8"} +@registry.misc.register("rel_cand_generator.v1") +def create_candidate_indices(max_length: int) -> Callable[[Doc], List[Tuple[Span, Span]]]: + def get_candidates(doc: "Doc") -> List[Tuple[Span, Span]]: + candidates = [] + for ent1 in doc.ents: + for ent2 in doc.ents: + if ent1 != ent2: + if max_length and abs(ent2.start - ent1.start) <= max_length: + candidates.append((ent1, ent2)) + return candidates + return get_candidates +``` + +Finally, we require a method that transforms the candidate entity pairs into a +2D tensor using the specified [`Tok2Vec`](/api/tok2vec) or +[`Transformer`](/api/transformer). The resulting ~~Floats2~~ object will then be +processed by a final `output_layer` of the network. Putting all this together, +we can define our relation model in a config file as such: + +```ini +### config.cfg +[model] +@architectures = "rel_model.v1" +# ... + +[model.tok2vec] +# ... + +[model.get_candidates] +@misc = "rel_cand_generator.v1" +max_length = 20 + +[model.create_candidate_tensor] +@misc = "rel_cand_tensor.v1" + +[model.output_layer] +@architectures = "rel_output_layer.v1" +# ... +``` + + + + +When creating this model, we store the custom functions as +[attributes](https://thinc.ai/docs/api-model#properties) and the sublayers as +references, so we can access them easily: + +```python +tok2vec_layer = model.get_ref("tok2vec") +output_layer = model.get_ref("output_layer") +create_candidate_tensor = model.attrs["create_candidate_tensor"] +get_candidates = model.attrs["get_candidates"] +``` + +#### Step 2: Implementing the pipeline component {#component-rel-pipe} + +To use our new relation extraction model as part of a custom +[trainable component](/usage/processing-pipelines#trainable-components), we +create a subclass of [`TrainablePipe`](/api/pipe) that holds the model. + +![Illustration of Pipe methods](../images/trainable_component.svg) + +```python +### Pipeline component skeleton +from spacy.pipeline import TrainablePipe + +class RelationExtractor(TrainablePipe): + def __init__(self, vocab, model, name="rel"): + """Create a component instance.""" + self.model = model + self.vocab = vocab + self.name = name + + def update(self, examples, drop=0.0, set_annotations=False, sgd=None, losses=None): + """Learn from a batch of Example objects.""" + ... + + def predict(self, docs): + """Apply the model to a batch of Doc objects.""" + ... + + def set_annotations(self, docs, predictions): + """Modify a batch of Doc objects using the predictions.""" + ... + + def initialize(self, get_examples, nlp=None, labels=None): + """Initialize the model before training.""" + ... + + def add_label(self, label): + """Add a label to the component.""" + ... +``` + +Before the model can be used, it needs to be +[initialized](/usage/training#initialization). This function receives a callback +to access the full **training data set**, or a representative sample. This data +set can be used to deduce all **relevant labels**. Alternatively, a list of +labels can be provided to `initialize`, or you can call +`RelationExtractor.add_label` directly. The number of labels defines the output +dimensionality of the network, and will be used to do +[shape inference](https://thinc.ai/docs/usage-models#validation) throughout the +layers of the neural network. This is triggered by calling +[`Model.initialize`](https://thinc.ai/api/model#initialize). + +```python +### The initialize method {highlight="12,18,22"} +from itertools import islice + +def initialize( + self, + get_examples: Callable[[], Iterable[Example]], + *, + nlp: Language = None, + labels: Optional[List[str]] = None, +): + if labels is not None: + for label in labels: + self.add_label(label) + else: + for example in get_examples(): + relations = example.reference._.rel + for indices, label_dict in relations.items(): + for label in label_dict.keys(): + self.add_label(label) + subbatch = list(islice(get_examples(), 10)) + doc_sample = [eg.reference for eg in subbatch] + label_sample = self._examples_to_truth(subbatch) + self.model.initialize(X=doc_sample, Y=label_sample) +``` + +The `initialize` method is triggered whenever this component is part of an `nlp` +pipeline, and [`nlp.initialize`](/api/language#initialize) is invoked. +Typically, this happens when the pipeline is set up before training in +[`spacy train`](/api/cli#training). After initialization, the pipeline component +and its internal model can be trained and used to make predictions. + +During training, the function [`update`](/api/pipe#update) is invoked which +delegates to +[`Model.begin_update`](https://thinc.ai/docs/api-model#begin_update) and a +[`get_loss`](/api/pipe#get_loss) function that **calculates the loss** for a +batch of examples, as well as the **gradient** of loss that will be used to +update the weights of the model layers. Thinc provides several +[loss functions](https://thinc.ai/docs/api-loss) that can be used for the +implementation of the `get_loss` function. + +```python +### The update method {highlight="12-14"} +def update( + self, + examples: Iterable[Example], + *, + drop: float = 0.0, + set_annotations: bool = False, + sgd: Optional[Optimizer] = None, + losses: Optional[Dict[str, float]] = None, +) -> Dict[str, float]: + ... + docs = [ex.predicted for ex in examples] + predictions, backprop = self.model.begin_update(docs) + loss, gradient = self.get_loss(examples, predictions) + backprop(gradient) + losses[self.name] += loss + ... + return losses +``` + +When the internal model is trained, the component can be used to make novel +**predictions**. The [`predict`](/api/pipe#predict) function needs to be +implemented for each subclass of `TrainablePipe`. In our case, we can simply +delegate to the internal model's +[predict](https://thinc.ai/docs/api-model#predict) function that takes a batch +of `Doc` objects and returns a ~~Floats2d~~ array: + +```python +### The predict method +def predict(self, docs: Iterable[Doc]) -> Floats2d: + predictions = self.model.predict(docs) + return self.model.ops.asarray(predictions) +``` + +The final method that needs to be implemented, is +[`set_annotations`](/api/pipe#set_annotations). This function takes the +predictions, and modifies the given `Doc` object in place to store them. For our +relation extraction component, we store the data as a dictionary in a custom +[extension attribute](/usage/processing-pipelines#custom-components-attributes) +`doc._.rel`. As keys, we represent the candidate pair by the **start offsets of +each entity**, as this defines an entity pair uniquely within one document. + +To interpret the scores predicted by the relation extraction model correctly, we +need to refer to the model's `get_candidates` function that defined which pairs +of entities were relevant candidates, so that the predictions can be linked to +those exact entities: + +> #### Example output +> +> ```python +> doc = nlp("Amsterdam is the capital of the Netherlands.") +> print("spans", [(e.start, e.text, e.label_) for e in doc.ents]) +> for value, rel_dict in doc._.rel.items(): +> print(f"{value}: {rel_dict}") +> +> # spans [(0, 'Amsterdam', 'LOC'), (6, 'Netherlands', 'LOC')] +> # (0, 6): {'CAPITAL_OF': 0.89, 'LOCATED_IN': 0.75, 'UNRELATED': 0.002} +> # (6, 0): {'CAPITAL_OF': 0.01, 'LOCATED_IN': 0.13, 'UNRELATED': 0.017} +> ``` + +```python +### Registering the extension attribute +from spacy.tokens import Doc +Doc.set_extension("rel", default={}) +``` + +```python +### The set_annotations method {highlight="5-6,10"} +def set_annotations(self, docs: Iterable[Doc], predictions: Floats2d): + c = 0 + get_candidates = self.model.attrs["get_candidates"] + for doc in docs: + for (e1, e2) in get_candidates(doc): + offset = (e1.start, e2.start) + if offset not in doc._.rel: + doc._.rel[offset] = {} + for j, label in enumerate(self.labels): + doc._.rel[offset][label] = predictions[c, j] + c += 1 +``` + +Under the hood, when the pipe is applied to a document, it delegates to the +`predict` and `set_annotations` methods: + +```python +### The __call__ method +def __call__(self, Doc doc): + predictions = self.predict([doc]) + self.set_annotations([doc], predictions) + return doc +``` + +Once our `TrainablePipe` subclass is fully implemented, we can +[register](/usage/processing-pipelines#custom-components-factories) the +component with the [`@Language.factory`](/api/language#factory) decorator. This +assigns it a name and lets you create the component with +[`nlp.add_pipe`](/api/language#add_pipe) and via the +[config](/usage/training#config). + +> #### config.cfg (excerpt) +> +> ```ini +> [components.relation_extractor] +> factory = "relation_extractor" +> +> [components.relation_extractor.model] +> @architectures = "rel_model.v1" +> +> [components.relation_extractor.model.tok2vec] +> # ... +> +> [components.relation_extractor.model.get_candidates] +> @misc = "rel_cand_generator.v1" +> max_length = 20 +> ``` + +```python +### Registering the pipeline component +from spacy.language import Language + +@Language.factory("relation_extractor") +def make_relation_extractor(nlp, name, model): + return RelationExtractor(nlp.vocab, model, name) +``` + + diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 53ea2dfa6..4077cf293 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -3,12 +3,17 @@ title: Linguistic Features next: /usage/rule-based-matching menu: - ['POS Tagging', 'pos-tagging'] + - ['Morphology', 'morphology'] + - ['Lemmatization', 'lemmatization'] - ['Dependency Parse', 'dependency-parse'] - ['Named Entities', 'named-entities'] - ['Entity Linking', 'entity-linking'] - ['Tokenization', 'tokenization'] - ['Merging & Splitting', 'retokenization'] - ['Sentence Segmentation', 'sbd'] + - ['Vectors & Similarity', 'vectors-similarity'] + - ['Mappings & Exceptions', 'mappings-exceptions'] + - ['Language Data', 'language-data'] --- Processing raw text intelligently is difficult: most words are rare, and it's @@ -27,58 +32,188 @@ import PosDeps101 from 'usage/101/\_pos-deps.md' - + For a list of the fine-grained and coarse-grained part-of-speech tags assigned -by spaCy's models across different languages, see the -[POS tag scheme documentation](/api/annotation#pos-tagging). +by spaCy's models across different languages, see the label schemes documented +in the [models directory](/models). -### Rule-based morphology {#rule-based-morphology} +## Morphology {#morphology} Inflectional morphology is the process by which a root form of a word is modified by adding prefixes or suffixes that specify its grammatical function -but do not changes its part-of-speech. We say that a **lemma** (root form) is +but do not change its part-of-speech. We say that a **lemma** (root form) is **inflected** (modified/combined) with one or more **morphological features** to create a surface form. Here are some examples: -| Context | Surface | Lemma | POS |  Morphological Features | -| ---------------------------------------- | ------- | ----- | ---- | ---------------------------------------- | -| I was reading the paper | reading | read | verb | `VerbForm=Ger` | -| I don't watch the news, I read the paper | read | read | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` | -| I read the paper yesterday | read | read | verb | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` | +| Context | Surface | Lemma | POS |  Morphological Features | +| ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- | +| I was reading the paper | reading | read | `VERB` | `VerbForm=Ger` | +| I don't watch the news, I read the paper | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` | +| I read the paper yesterday | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Past` | -English has a relatively simple morphological system, which spaCy handles using -rules that can be keyed by the token, the part-of-speech tag, or the combination -of the two. The system works as follows: +Morphological features are stored in the [`MorphAnalysis`](/api/morphanalysis) +under `Token.morph`, which allows you to access individual morphological +features. -1. The tokenizer consults a - [mapping table](/usage/adding-languages#tokenizer-exceptions) - `TOKENIZER_EXCEPTIONS`, which allows sequences of characters to be mapped to - multiple tokens. Each token may be assigned a part of speech and one or more - morphological features. -2. The part-of-speech tagger then assigns each token an **extended POS tag**. In - the API, these tags are known as `Token.tag`. They express the part-of-speech - (e.g. `VERB`) and some amount of morphological information, e.g. that the - verb is past tense. -3. For words whose POS is not set by a prior process, a - [mapping table](/usage/adding-languages#tag-map) `TAG_MAP` maps the tags to a - part-of-speech and a set of morphological features. -4. Finally, a **rule-based deterministic lemmatizer** maps the surface form, to - a lemma in light of the previously assigned extended part-of-speech and - morphological information, without consulting the context of the token. The - lemmatizer also accepts list-based exception files, acquired from - [WordNet](https://wordnet.princeton.edu/). +> #### 📝 Things to try +> +> 1. Change "I" to "She". You should see that the morphological features change +> and express that it's a pronoun in the third person. +> 2. Inspect `token.morph` for the other tokens. + +```python +### {executable="true"} +import spacy + +nlp = spacy.load("en_core_web_sm") +print("Pipeline:", nlp.pipe_names) +doc = nlp("I was reading the paper.") +token = doc[0] # 'I' +print(token.morph) # 'Case=Nom|Number=Sing|Person=1|PronType=Prs' +print(token.morph.get("PronType")) # ['Prs'] +``` + +### Statistical morphology {#morphologizer new="3" model="morphologizer"} + +spaCy's statistical [`Morphologizer`](/api/morphologizer) component assigns the +morphological features and coarse-grained part-of-speech tags as `Token.morph` +and `Token.pos`. + +```python +### {executable="true"} +import spacy + +nlp = spacy.load("de_core_news_sm") +doc = nlp("Wo bist du?") # English: 'Where are you?' +print(doc[2].morph) # 'Case=Nom|Number=Sing|Person=2|PronType=Prs' +print(doc[2].pos_) # 'PRON' +``` + +### Rule-based morphology {#rule-based-morphology} + +For languages with relatively simple morphological systems like English, spaCy +can assign morphological features through a rule-based approach, which uses the +**token text** and **fine-grained part-of-speech tags** to produce +coarse-grained part-of-speech tags and morphological features. + +1. The part-of-speech tagger assigns each token a **fine-grained part-of-speech + tag**. In the API, these tags are known as `Token.tag`. They express the + part-of-speech (e.g. verb) and some amount of morphological information, e.g. + that the verb is past tense (e.g. `VBD` for a past tense verb in the Penn + Treebank) . +2. For words whose coarse-grained POS is not set by a prior process, a + [mapping table](#mapping-exceptions) maps the fine-grained tags to a + coarse-grained POS tags and morphological features. + +```python +### {executable="true"} +import spacy + +nlp = spacy.load("en_core_web_sm") +doc = nlp("Where are you?") +print(doc[2].morph) # 'Case=Nom|Person=2|PronType=Prs' +print(doc[2].pos_) # 'PRON' +``` + +## Lemmatization {#lemmatization model="lemmatizer" new="3"} + +The [`Lemmatizer`](/api/lemmatizer) is a pipeline component that provides lookup +and rule-based lemmatization methods in a configurable component. An individual +language can extend the `Lemmatizer` as part of its +[language data](#language-data). + +```python +### {executable="true"} +import spacy + +# English pipelines include a rule-based lemmatizer +nlp = spacy.load("en_core_web_sm") +lemmatizer = nlp.get_pipe("lemmatizer") +print(lemmatizer.mode) # 'rule' + +doc = nlp("I was reading the paper.") +print([token.lemma_ for token in doc]) +# ['I', 'be', 'read', 'the', 'paper', '.'] +``` + + + +Unlike spaCy v2, spaCy v3 models do _not_ provide lemmas by default or switch +automatically between lookup and rule-based lemmas depending on whether a tagger +is in the pipeline. To have lemmas in a `Doc`, the pipeline needs to include a +[`Lemmatizer`](/api/lemmatizer) component. The lemmatizer component is +configured to use a single mode such as `"lookup"` or `"rule"` on +initialization. The `"rule"` mode requires `Token.pos` to be set by a previous +component. + + + +The data for spaCy's lemmatizers is distributed in the package +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The +provided trained pipelines already include all the required tables, but if you +are creating new pipelines, you'll probably want to install `spacy-lookups-data` +to provide the data when the lemmatizer is initialized. + +### Lookup lemmatizer {#lemmatizer-lookup} + +For pipelines without a tagger or morphologizer, a lookup lemmatizer can be +added to the pipeline as long as a lookup table is provided, typically through +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). The +lookup lemmatizer looks up the token surface form in the lookup table without +reference to the token's part-of-speech or context. + +```python +# pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS +import spacy + +nlp = spacy.blank("sv") +nlp.add_pipe("lemmatizer", config={"mode": "lookup"}) +``` + +### Rule-based lemmatizer {#lemmatizer-rule} + +When training pipelines that include a component that assigns part-of-speech +tags (a morphologizer or a tagger with a [POS mapping](#mappings-exceptions)), a +rule-based lemmatizer can be added using rule tables from +[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data): + +```python +# pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS +import spacy + +nlp = spacy.blank("de") +# Morphologizer (note: model is not yet trained!) +nlp.add_pipe("morphologizer") +# Rule-based lemmatizer +nlp.add_pipe("lemmatizer", config={"mode": "rule"}) +``` + +The rule-based deterministic lemmatizer maps the surface form to a lemma in +light of the previously assigned coarse-grained part-of-speech and morphological +information, without consulting the context of the token. The rule-based +lemmatizer also accepts list-based exception files. For English, these are +acquired from [WordNet](https://wordnet.princeton.edu/). ## Dependency Parsing {#dependency-parse model="parser"} spaCy features a fast and accurate syntactic dependency parser, and has a rich API for navigating the tree. The parser also powers the sentence boundary detection, and lets you iterate over base noun phrases, or "chunks". You can -check whether a [`Doc`](/api/doc) object has been parsed with the -`doc.is_parsed` attribute, which returns a boolean value. If this attribute is -`False`, the default sentence iterator will raise an exception. +check whether a [`Doc`](/api/doc) object has been parsed by calling +`doc.has_annotation("DEP")`, which checks whether the attribute `Token.dep` has +been set returns a boolean value. If the result is `False`, the default sentence +iterator will raise an exception. + + + +For a list of the syntactic dependency labels assigned by spaCy's models across +different languages, see the label schemes documented in the +[models directory](/models). + + ### Noun chunks {#noun-chunks} @@ -153,7 +288,7 @@ import DisplaCyLong2Html from 'images/displacy-long2.html' Because the syntactic relations form a tree, every word has **exactly one head**. You can therefore iterate over the arcs in the tree by iterating over the words in the sentence. This is usually the best way to match an arc of -interest — from below: +interest – from below: ```python ### {executable="true"} @@ -229,10 +364,10 @@ sequence of tokens. You can walk up the tree with the > #### Projective vs. non-projective > -> For the [default English model](/models/en), the parse tree is **projective**, -> which means that there are no crossing brackets. The tokens returned by -> `.subtree` are therefore guaranteed to be contiguous. This is not true for the -> German model, which has many +> For the [default English pipelines](/models/en), the parse tree is +> **projective**, which means that there are no crossing brackets. The tokens +> returned by `.subtree` are therefore guaranteed to be contiguous. This is not +> true for the German pipelines, which have many > [non-projective dependencies](https://explosion.ai/blog/german-model#word-order). ```python @@ -262,7 +397,7 @@ for descendant in subject.subtree: Finally, the `.left_edge` and `.right_edge` attributes can be especially useful, because they give you the first and last token of the subtree. This is the easiest way to create a `Span` object for a syntactic phrase. Note that -`.right_edge` gives a token **within** the subtree — so if you use it as the +`.right_edge` gives a token **within** the subtree – so if you use it as the end-point of a range, don't forget to `+1`! ```python @@ -286,19 +421,53 @@ for token in doc: | their | `ADJ` | `poss` | requests | | requests | `NOUN` | `dobj` | submit | - +The dependency parse can be a useful tool for **information extraction**, +especially when combined with other predictions like +[named entities](#named-entities). The following example extracts money and +currency values, i.e. entities labeled as `MONEY`, and then uses the dependency +parse to find the noun phrase they are referring to – for example `"Net income"` +→ `"$9.4 million"`. -For a list of the syntactic dependency labels assigned by spaCy's models across -different languages, see the -[dependency label scheme documentation](/api/annotation#dependency-parsing). +```python +### {executable="true"} +import spacy + +nlp = spacy.load("en_core_web_sm") +# Merge noun phrases and entities for easier analysis +nlp.add_pipe("merge_entities") +nlp.add_pipe("merge_noun_chunks") + +TEXTS = [ + "Net income was $9.4 million compared to the prior year of $2.7 million.", + "Revenue exceeded twelve billion dollars, with a loss of $1b.", +] +for doc in nlp.pipe(TEXTS): + for token in doc: + if token.ent_type_ == "MONEY": + # We have an attribute and direct object, so check for subject + if token.dep_ in ("attr", "dobj"): + subj = [w for w in token.head.lefts if w.dep_ == "nsubj"] + if subj: + print(subj[0], "-->", token) + # We have a prepositional object with a preposition + elif token.dep_ == "pobj" and token.head.dep_ == "prep": + print(token.head.head, "-->", token) +``` + + + +For more examples of how to write rule-based information extraction logic that +takes advantage of the model's predictions produced by the different components, +see the usage guide on +[combining models and rules](/usage/rule-based-matching#models-rules). ### Visualizing dependencies {#displacy} The best way to understand spaCy's dependency parser is interactively. To make -this easier, spaCy v2.0+ comes with a visualization module. You can pass a `Doc` -or a list of `Doc` objects to displaCy and run +this easier, spaCy comes with a visualization module. You can pass a `Doc` or a +list of `Doc` objects to displaCy and run [`displacy.serve`](/api/top-level#displacy.serve) to run the web server, or [`displacy.render`](/api/top-level#displacy.render) to generate the raw markup. If you want to know how to write rules that hook into some type of syntactic @@ -326,45 +495,27 @@ displaCy in our [online demo](https://explosion.ai/demos/displacy).. ### Disabling the parser {#disabling} -In the [default models](/models), the parser is loaded and enabled as part of -the [standard processing pipeline](/usage/processing-pipelines). If you don't need +In the [trained pipelines](/models) provided by spaCy, the parser is loaded and +enabled by default as part of the +[standard processing pipeline](/usage/processing-pipelines). If you don't need any of the syntactic information, you should disable the parser. Disabling the parser will make spaCy load and run much faster. If you want to load the parser, but need to disable it for specific documents, you can also control its use on -the `nlp` object. +the `nlp` object. For more details, see the usage guide on +[disabling pipeline components](/usage/processing-pipelines/#disabling). ```python nlp = spacy.load("en_core_web_sm", disable=["parser"]) -nlp = English().from_disk("/model", disable=["parser"]) -doc = nlp("I don't want parsed", disable=["parser"]) ``` - - -Since spaCy v2.0 comes with better support for customizing the processing -pipeline components, the `parser` keyword argument has been replaced with -`disable`, which takes a list of -[pipeline component names](/usage/processing-pipelines). This lets you disable -both default and custom components when loading a model, or initializing a -Language class via [`from_disk`](/api/language#from_disk). - -```diff -+ nlp = spacy.load("en_core_web_sm", disable=["parser"]) -+ doc = nlp("I don't want parsed", disable=["parser"]) - -- nlp = spacy.load("en_core_web_sm", parser=False) -- doc = nlp("I don't want parsed", parse=False) -``` - - - ## Named Entity Recognition {#named-entities} spaCy features an extremely fast statistical entity recognition system, that -assigns labels to contiguous spans of tokens. The default model identifies a -variety of named and numeric entities, including companies, locations, -organizations and products. You can add arbitrary classes to the entity -recognition system, and update the model with new examples. +assigns labels to contiguous spans of tokens. The default +[trained pipelines](/models) can indentify a variety of named and numeric +entities, including companies, locations, organizations and products. You can +add arbitrary classes to the entity recognition system, and update the model +with new examples. ### Named Entity Recognition 101 {#named-entities-101} @@ -372,7 +523,7 @@ import NER101 from 'usage/101/\_named-entities.md' -### Accessing entity annotations {#accessing} +### Accessing entity annotations and labels {#accessing-ner} The standard way to access entity annotations is the [`doc.ents`](/api/doc#ents) property, which produces a sequence of [`Span`](/api/span) objects. The entity @@ -389,9 +540,17 @@ on a token, it will return an empty string. > #### IOB Scheme > -> - `I` – Token is inside an entity. -> - `O` – Token is outside an entity. -> - `B` – Token is the beginning of an entity. +> - `I` – Token is **inside** an entity. +> - `O` – Token is **outside** an entity. +> - `B` – Token is the **beginning** of an entity. +> +> #### BILUO Scheme +> +> - `B` – Token is the **beginning** of a multi-token entity. +> - `I` – Token is **inside** a multi-token entity. +> - `L` – Token is the **last** token of a multi-token entity. +> - `U` – Token is a single-token **unit** entity. +> - `O` – Toke is **outside** an entity. ```python ### {executable="true"} @@ -438,7 +597,7 @@ nlp = spacy.load("en_core_web_sm") doc = nlp("fb is hiring a new vice president of global policy") ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] print('Before', ents) -# the model didn't recognise "fb" as an entity :( +# The model didn't recognize "fb" as an entity :( fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity doc.ents = list(doc.ents) + [fb_ent] @@ -480,7 +639,7 @@ print("After", doc.ents) # [London] #### Setting entity annotations in Cython {#setting-cython} -Finally, you can always write to the underlying struct, if you compile a +Finally, you can always write to the underlying struct if you compile a [Cython](http://cython.org/) function. This is easy to do, and allows you to write efficient native code. @@ -509,39 +668,9 @@ responsibility for ensuring that the data is left in a consistent state. -For details on the entity types available in spaCy's pretrained models, see the -[NER annotation scheme](/api/annotation#named-entities). - - - -### Training and updating {#updating} - -To provide training examples to the entity recognizer, you'll first need to -create an instance of the [`GoldParse`](/api/goldparse) class. You can specify -your annotations in a stand-off format or as token tags. If a character offset -in your entity annotations doesn't fall on a token boundary, the `GoldParse` -class will treat that annotation as a missing value. This allows for more -realistic training, because the entity recognizer is allowed to learn from -examples that may feature tokenizer errors. - -```python -train_data = [ - ("Who is Chaka Khan?", [(7, 17, "PERSON")]), - ("I like London and Berlin.", [(7, 13, "LOC"), (18, 24, "LOC")]), -] -``` - -```python -doc = Doc(nlp.vocab, ["rats", "make", "good", "pets"]) -gold = GoldParse(doc, entities=["U-ANIMAL", "O", "O", "O"]) -``` - - - -For more details on **training and updating** the named entity recognizer, see -the usage guides on [training](/usage/training) or check out the runnable -[training script](https://github.com/explosion/spaCy/tree/master/examples/training/train_ner.py) -on GitHub. +For details on the entity types available in spaCy's trained pipelines, see the +"label scheme" sections of the individual models in the +[models directory](/models). @@ -551,8 +680,8 @@ The [displaCy ENT visualizer](https://explosion.ai/demos/displacy-ent) lets you explore an entity recognition model's behavior interactively. If you're training a model, it's very useful to run the visualization yourself. To help -you do that, spaCy v2.0+ comes with a visualization module. You can pass a `Doc` -or a list of `Doc` objects to displaCy and run +you do that, spaCy comes with a visualization module. You can pass a `Doc` or a +list of `Doc` objects to displaCy and run [`displacy.serve`](/api/top-level#displacy.serve) to run the web server, or [`displacy.render`](/api/top-level#displacy.render) to generate the raw markup. @@ -580,11 +709,10 @@ import DisplacyEntHtml from 'images/displacy-ent2.html' To ground the named entities into the "real world", spaCy provides functionality to perform entity linking, which resolves a textual entity to a unique identifier from a knowledge base (KB). You can create your own -[`KnowledgeBase`](/api/kb) and -[train a new Entity Linking model](/usage/training#entity-linker) using that -custom-made KB. +[`KnowledgeBase`](/api/kb) and [train](/usage/training) a new +[`EntityLinker`](/api/entitylinker) using that custom knowledge base. -### Accessing entity identifiers {#entity-linking-accessing} +### Accessing entity identifiers {#entity-linking-accessing model="entity linking"} The annotated KB identifier is accessible as either a hash value or as a string, using the attributes `ent.kb_id` and `ent.kb_id_` of a [`Span`](/api/span) @@ -594,14 +722,14 @@ object, or the `ent_kb_id` and `ent_kb_id_` attributes of a ```python import spacy -nlp = spacy.load("my_custom_el_model") +nlp = spacy.load("my_custom_el_pipeline") doc = nlp("Ada Lovelace was born in London") -# document level +# Document level ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents] print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')] -# token level +# Token level ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_] ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_] ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_] @@ -610,15 +738,6 @@ print(ent_ada_1) # ['Lovelace', 'PERSON', 'Q7259'] print(ent_london_5) # ['London', 'GPE', 'Q84'] ``` -| Text | ent_type\_ | ent_kb_id\_ | -| -------- | ---------- | ----------- | -| Ada | `"PERSON"` | `"Q7259"` | -| Lovelace | `"PERSON"` | `"Q7259"` | -| was | - | - | -| born | - | - | -| in | - | - | -| London | `"GPE"` | `"Q84"` | - ## Tokenization {#tokenization} Tokenization is the task of splitting a text into meaningful segments, called @@ -642,36 +761,113 @@ import Tokenization101 from 'usage/101/\_tokenization.md' -### Tokenizer data {#101-data} + + +spaCy introduces a novel tokenization algorithm that gives a better balance +between performance, ease of definition and ease of alignment into the original +string. + +After consuming a prefix or suffix, we consult the special cases again. We want +the special cases to handle things like "don't" in English, and we want the same +rule to work for "(don't)!". We do this by splitting off the open bracket, then +the exclamation, then the closed bracket, and finally matching the special case. +Here's an implementation of the algorithm in Python optimized for readability +rather than performance: + +```python +def tokenizer_pseudo_code( + special_cases, + prefix_search, + suffix_search, + infix_finditer, + token_match, + url_match +): + tokens = [] + for substring in text.split(): + suffixes = [] + while substring: + while prefix_search(substring) or suffix_search(substring): + if token_match(substring): + tokens.append(substring) + substring = "" + break + if substring in special_cases: + tokens.extend(special_cases[substring]) + substring = "" + break + if prefix_search(substring): + split = prefix_search(substring).end() + tokens.append(substring[:split]) + substring = substring[split:] + if substring in special_cases: + continue + if suffix_search(substring): + split = suffix_search(substring).start() + suffixes.append(substring[split:]) + substring = substring[:split] + if token_match(substring): + tokens.append(substring) + substring = "" + elif url_match(substring): + tokens.append(substring) + substring = "" + elif substring in special_cases: + tokens.extend(special_cases[substring]) + substring = "" + elif list(infix_finditer(substring)): + infixes = infix_finditer(substring) + offset = 0 + for match in infixes: + tokens.append(substring[offset : match.start()]) + tokens.append(substring[match.start() : match.end()]) + offset = match.end() + if substring[offset:]: + tokens.append(substring[offset:]) + substring = "" + elif substring: + tokens.append(substring) + substring = "" + tokens.extend(reversed(suffixes)) + return tokens +``` + +The algorithm can be summarized as follows: + +1. Iterate over whitespace-separated substrings. +2. Look for a token match. If there is a match, stop processing and keep this + token. +3. Check whether we have an explicitly defined special case for this substring. + If we do, use it. +4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2, + so that the token match and special cases always get priority. +5. If we didn't consume a prefix, try to consume a suffix and then go back to + #2. +6. If we can't consume a prefix or a suffix, look for a URL match. +7. If there's no URL match, then look for a special case. +8. Look for "infixes" – stuff like hyphens etc. and split the substring into + tokens on all infixes. +9. Once we can't consume any more of the string, handle it as a single token. + + **Global** and **language-specific** tokenizer data is supplied via the language -data in -[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang). The -tokenizer exceptions define special cases like "don't" in English, which needs -to be split into two tokens: `{ORTH: "do"}` and `{ORTH: "n't", NORM: "not"}`. -The prefixes, suffixes and infixes mostly define punctuation rules – for -example, when to split off periods (at the end of a sentence), and when to leave -tokens containing periods intact (abbreviations like "U.S."). - -![Language data architecture](../images/language_data.svg) - - - -For more details on the language-specific data, see the usage guide on -[adding languages](/usage/adding-languages). - - +data in [`spacy/lang`](%%GITHUB_SPACY/spacy/lang). The tokenizer exceptions +define special cases like "don't" in English, which needs to be split into two +tokens: `{ORTH: "do"}` and `{ORTH: "n't", NORM: "not"}`. The prefixes, suffixes +and infixes mostly define punctuation rules – for example, when to split off +periods (at the end of a sentence), and when to leave tokens containing periods +intact (abbreviations like "U.S."). Tokenization rules that are specific to one language, but can be **generalized -across that language** should ideally live in the language data in -[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) – we -always appreciate pull requests! Anything that's specific to a domain or text -type – like financial trading abbreviations, or Bavarian youth slang – should be -added as a special case rule to your tokenizer instance. If you're dealing with -a lot of customizations, it might make sense to create an entirely custom -subclass. +across that language**, should ideally live in the language data in +[`spacy/lang`](%%GITHUB_SPACY/spacy/lang) – we always appreciate pull requests! +Anything that's specific to a domain or text type – like financial trading +abbreviations or Bavarian youth slang – should be added as a special case rule +to your tokenizer instance. If you're dealing with a lot of customizations, it +might make sense to create an entirely custom subclass. @@ -703,102 +899,17 @@ print([w.text for w in nlp("gimme that")]) # ['gim', 'me', 'that'] The special case doesn't have to match an entire whitespace-delimited substring. The tokenizer will incrementally split off punctuation, and keep looking up the -remaining substring: +remaining substring. The special case rules also have precedence over the +punctuation splitting. ```python assert "gimme" not in [w.text for w in nlp("gimme!")] assert "gimme" not in [w.text for w in nlp('("...gimme...?")')] -``` -The special case rules have precedence over the punctuation splitting: - -```python nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}]) assert len(nlp("...gimme...?")) == 1 ``` -### How spaCy's tokenizer works {#how-tokenizer-works} - -spaCy introduces a novel tokenization algorithm, that gives a better balance -between performance, ease of definition, and ease of alignment into the original -string. - -After consuming a prefix or suffix, we consult the special cases again. We want -the special cases to handle things like "don't" in English, and we want the same -rule to work for "(don't)!". We do this by splitting off the open bracket, then -the exclamation, then the close bracket, and finally matching the special case. -Here's an implementation of the algorithm in Python, optimized for readability -rather than performance: - -```python -def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search, - infix_finditer, token_match, url_match): - tokens = [] - for substring in text.split(): - suffixes = [] - while substring: - while prefix_search(substring) or suffix_search(substring): - if token_match(substring): - tokens.append(substring) - substring = '' - break - if substring in special_cases: - tokens.extend(special_cases[substring]) - substring = '' - break - if prefix_search(substring): - split = prefix_search(substring).end() - tokens.append(substring[:split]) - substring = substring[split:] - if substring in special_cases: - continue - if suffix_search(substring): - split = suffix_search(substring).start() - suffixes.append(substring[split:]) - substring = substring[:split] - if token_match(substring): - tokens.append(substring) - substring = '' - elif url_match(substring): - tokens.append(substring) - substring = '' - elif substring in special_cases: - tokens.extend(special_cases[substring]) - substring = '' - elif list(infix_finditer(substring)): - infixes = infix_finditer(substring) - offset = 0 - for match in infixes: - tokens.append(substring[offset : match.start()]) - tokens.append(substring[match.start() : match.end()]) - offset = match.end() - if substring[offset:]: - tokens.append(substring[offset:]) - substring = '' - elif substring: - tokens.append(substring) - substring = '' - tokens.extend(reversed(suffixes)) - return tokens -``` - -The algorithm can be summarized as follows: - -1. Iterate over whitespace-separated substrings. -2. Look for a token match. If there is a match, stop processing and keep this - token. -3. Check whether we have an explicitly defined special case for this substring. - If we do, use it. -4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to - #2, so that the token match and special cases always get priority. -5. If we didn't consume a prefix, try to consume a suffix and then go back to - #2. -6. If we can't consume a prefix or a suffix, look for a URL match. -7. If there's no URL match, then look for a special case. -8. Look for "infixes" — stuff like hyphens etc. and split the substring into - tokens on all infixes. -9. Once we can't consume any more of the string, handle it as a single token. - #### Debugging the tokenizer {#tokenizer-debug new="2.2.3"} A working implementation of the pseudo-code above is available for debugging as @@ -806,6 +917,17 @@ A working implementation of the pseudo-code above is available for debugging as tuples showing which tokenizer rule or pattern was matched for each token. The tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens: +> #### Expected output +> +> ``` +> " PREFIX +> Let SPECIAL-1 +> 's SPECIAL-2 +> go TOKEN +> ! SUFFIX +> " SUFFIX +> ``` + ```python ### {executable="true"} from spacy.lang.en import English @@ -817,13 +939,6 @@ tok_exp = nlp.tokenizer.explain(text) assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp] for t in tok_exp: print(t[1], "\\t", t[0]) - -# " PREFIX -# Let SPECIAL-1 -# 's SPECIAL-2 -# go TOKEN -# ! SUFFIX -# " SUFFIX ``` ### Customizing spaCy's Tokenizer class {#native-tokenizers} @@ -843,19 +958,6 @@ domain. There are six things you may need to define: be split, overriding the infix rules. Useful for things like numbers. 6. An optional boolean function `url_match`, which is similar to `token_match` except that prefixes and suffixes are removed before applying the match. - - - -In spaCy v2.2.2-v2.2.4, the `token_match` was equivalent to the `url_match` -above and there was no match pattern applied before prefixes and suffixes were -analyzed. As of spaCy v2.3.0, the `token_match` has been reverted to its -behavior in v2.2.1 and earlier with precedence over prefixes and suffixes. - -The `url_match` is introduced in v2.3.0 to handle cases like URLs where the -tokenizer should remove prefixes and suffixes (e.g., a comma at the end of a -URL) before applying the match. - - You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is to use `re.compile()` to build a regular expression object, and pass its @@ -868,8 +970,8 @@ import spacy from spacy.tokenizer import Tokenizer special_cases = {":)": [{"ORTH": ":)"}]} -prefix_re = re.compile(r'''^[\[\("']''') -suffix_re = re.compile(r'''[\]\)"']$''') +prefix_re = re.compile(r'''^[\\[\\("']''') +suffix_re = re.compile(r'''[\\]\\)"']$''') infix_re = re.compile(r'''[-~]''') simple_url_re = re.compile(r'''^https?://''') @@ -936,13 +1038,15 @@ function that behaves the same way. -If you're using a statistical model, writing to the `nlp.Defaults` or -`English.Defaults` directly won't work, since the regular expressions are read -from the model and will be compiled when you load it. If you modify -`nlp.Defaults`, you'll only see the effect if you call -[`spacy.blank`](/api/top-level#spacy.blank) or `Defaults.create_tokenizer()`. If -you want to modify the tokenizer loaded from a statistical model, you should -modify `nlp.tokenizer` directly. +If you've loaded a trained pipeline, writing to the +[`nlp.Defaults`](/api/language#defaults) or `English.Defaults` directly won't +work, since the regular expressions are read from the pipeline data and will be +compiled when you load it. If you modify `nlp.Defaults`, you'll only see the +effect if you call [`spacy.blank`](/api/top-level#spacy.blank). If you want to +modify the tokenizer loaded from a trained pipeline, you should modify +`nlp.tokenizer` directly. If you're training your own pipeline, you can register +[callbacks](/usage/training/#custom-code-nlp-callbacks) to modify the `nlp` +object before training. @@ -951,7 +1055,7 @@ but also detailed regular expressions that take the surrounding context into account. For example, there is a regular expression that treats a hyphen between letters as an infix. If you do not want the tokenizer to split on hyphens between letters, you can modify the existing infix definition from -[`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py): +[`lang/punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py): ```python ### {executable="true"} @@ -960,12 +1064,12 @@ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS from spacy.util import compile_infix_regex -# default tokenizer +# Default tokenizer nlp = spacy.load("en_core_web_sm") doc = nlp("mother-in-law") print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law'] -# modify tokenizer infix patterns +# Modify tokenizer infix patterns infixes = ( LIST_ELLIPSES + LIST_ICONS @@ -975,8 +1079,8 @@ infixes = ( al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES ), r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA), - # EDIT: commented out regex that splits on hyphens between letters: - #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS), + # ✅ Commented out regex that splits on hyphens between letters: + # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS), r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), ] ) @@ -988,129 +1092,249 @@ print([t.text for t in doc]) # ['mother-in-law'] ``` For an overview of the default regular expressions, see -[`lang/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py) -and language-specific definitions such as -[`lang/de/punctuation.py`](https://github.com/explosion/spaCy/blob/master/spacy/lang/de/punctuation.py) -for German. +[`lang/punctuation.py`](%%GITHUB_SPACY/spacy/lang/punctuation.py) and +language-specific definitions such as +[`lang/de/punctuation.py`](%%GITHUB_SPACY/spacy/lang/de/punctuation.py) for +German. -### Hooking an arbitrary tokenizer into the pipeline {#custom-tokenizer} +### Hooking a custom tokenizer into the pipeline {#custom-tokenizer} The tokenizer is the first component of the processing pipeline and the only one that can't be replaced by writing to `nlp.pipeline`. This is because it has a different signature from all the other components: it takes a text and returns a -`Doc`, whereas all other components expect to already receive a tokenized `Doc`. +[`Doc`](/api/doc), whereas all other components expect to already receive a +tokenized `Doc`. ![The processing pipeline](../images/pipeline.svg) To overwrite the existing tokenizer, you need to replace `nlp.tokenizer` with a -custom function that takes a text, and returns a `Doc`. +custom function that takes a text and returns a [`Doc`](/api/doc). + +> #### Creating a Doc +> +> Constructing a [`Doc`](/api/doc) object manually requires at least two +> arguments: the shared `Vocab` and a list of words. Optionally, you can pass in +> a list of `spaces` values indicating whether the token at this position is +> followed by a space (default `True`). See the section on +> [pre-tokenized text](#own-annotations) for more info. +> +> ```python +> words = ["Let", "'s", "go", "!"] +> spaces = [False, True, False, False] +> doc = Doc(nlp.vocab, words=words, spaces=spaces) +> ``` ```python -nlp = spacy.load("en_core_web_sm") +nlp = spacy.blank("en") nlp.tokenizer = my_tokenizer ``` -| Argument | Type | Description | -| ----------- | ------- | ------------------------- | -| `text` | unicode | The raw text to tokenize. | -| **RETURNS** | `Doc` | The tokenized document. | +| Argument | Type | Description | +| ----------- | ----------------- | ------------------------- | +| `text` | `str` | The raw text to tokenize. | +| **RETURNS** | [`Doc`](/api/doc) | The tokenized document. | - +#### Example 1: Basic whitespace tokenizer {#custom-tokenizer-example} -In spaCy v1.x, you had to add a custom tokenizer by passing it to the `make_doc` -keyword argument, or by passing a tokenizer "factory" to `create_make_doc`. This -was unnecessarily complicated. Since spaCy v2.0, you can write to -`nlp.tokenizer` instead. If your tokenizer needs the vocab, you can write a -function and use `nlp.vocab`. - -```diff -- nlp = spacy.load("en_core_web_sm", make_doc=my_tokenizer) -- nlp = spacy.load("en_core_web_sm", create_make_doc=my_tokenizer_factory) - -+ nlp.tokenizer = my_tokenizer -+ nlp.tokenizer = my_tokenizer_factory(nlp.vocab) -``` - - - -### Example: A custom whitespace tokenizer {#custom-tokenizer-example} - -To construct the tokenizer, we usually want attributes of the `nlp` pipeline. -Specifically, we want the tokenizer to hold a reference to the vocabulary -object. Let's say we have the following class as our tokenizer: +Here's an example of the most basic whitespace tokenizer. It takes the shared +vocab, so it can construct `Doc` objects. When it's called on a text, it returns +a `Doc` object consisting of the text split on single space characters. We can +then overwrite the `nlp.tokenizer` attribute with an instance of our custom +tokenizer. ```python ### {executable="true"} import spacy from spacy.tokens import Doc -class WhitespaceTokenizer(object): +class WhitespaceTokenizer: def __init__(self, vocab): self.vocab = vocab def __call__(self, text): - words = text.split(' ') - # All tokens 'own' a subsequent space character in this tokenizer - spaces = [True] * len(words) - return Doc(self.vocab, words=words, spaces=spaces) + words = text.split(" ") + return Doc(self.vocab, words=words) -nlp = spacy.load("en_core_web_sm") +nlp = spacy.blank("en") nlp.tokenizer = WhitespaceTokenizer(nlp.vocab) doc = nlp("What's happened to me? he thought. It wasn't a dream.") -print([t.text for t in doc]) +print([token.text for token in doc]) ``` -As you can see, we need a `Vocab` instance to construct this — but we won't have -it until we get back the loaded `nlp` object. The simplest solution is to build -the tokenizer in two steps. This also means that you can reuse the "tokenizer -factory" and initialize it with different instances of `Vocab`. +#### Example 2: Third-party tokenizers (BERT word pieces) {#custom-tokenizer-example2} -### Bringing your own annotations {#own-annotations} +You can use the same approach to plug in any other third-party tokenizers. Your +custom callable just needs to return a `Doc` object with the tokens produced by +your tokenizer. In this example, the wrapper uses the **BERT word piece +tokenizer**, provided by the +[`tokenizers`](https://github.com/huggingface/tokenizers) library. The tokens +available in the `Doc` object returned by spaCy now match the exact word pieces +produced by the tokenizer. -spaCy generally assumes by default that your data is raw text. However, +> #### 💡 Tip: spacy-transformers +> +> If you're working with transformer models like BERT, check out the +> [`spacy-transformers`](https://github.com/explosion/spacy-transformers) +> extension package and [documentation](/usage/embeddings-transformers). It +> includes a pipeline component for using pretrained transformer weights and +> **training transformer models** in spaCy, as well as helpful utilities for +> aligning word pieces to linguistic tokenization. + +```python +### Custom BERT word piece tokenizer +from tokenizers import BertWordPieceTokenizer +from spacy.tokens import Doc +import spacy + +class BertTokenizer: + def __init__(self, vocab, vocab_file, lowercase=True): + self.vocab = vocab + self._tokenizer = BertWordPieceTokenizer(vocab_file, lowercase=lowercase) + + def __call__(self, text): + tokens = self._tokenizer.encode(text) + words = [] + spaces = [] + for i, (text, (start, end)) in enumerate(zip(tokens.tokens, tokens.offsets)): + words.append(text) + if i < len(tokens.tokens) - 1: + # If next start != current end we assume a space in between + next_start, next_end = tokens.offsets[i + 1] + spaces.append(next_start > end) + else: + spaces.append(True) + return Doc(self.vocab, words=words, spaces=spaces) + +nlp = spacy.blank("en") +nlp.tokenizer = BertTokenizer(nlp.vocab, "bert-base-uncased-vocab.txt") +doc = nlp("Justin Drew Bieber is a Canadian singer, songwriter, and actor.") +print(doc.text, [token.text for token in doc]) +# [CLS]justin drew bi##eber is a canadian singer, songwriter, and actor.[SEP] +# ['[CLS]', 'justin', 'drew', 'bi', '##eber', 'is', 'a', 'canadian', 'singer', +# ',', 'songwriter', ',', 'and', 'actor', '.', '[SEP]'] +``` + + + +Keep in mind that your models' results may be less accurate if the tokenization +during training differs from the tokenization at runtime. So if you modify a +trained pipeline's tokenization afterwards, it may produce very different +predictions. You should therefore train your pipeline with the **same +tokenizer** it will be using at runtime. See the docs on +[training with custom tokenization](#custom-tokenizer-training) for details. + + + +#### Training with custom tokenization {#custom-tokenizer-training new="3"} + +spaCy's [training config](/usage/training#config) describes the settings, +hyperparameters, pipeline and tokenizer used for constructing and training the +pipeline. The `[nlp.tokenizer]` block refers to a **registered function** that +takes the `nlp` object and returns a tokenizer. Here, we're registering a +function called `whitespace_tokenizer` in the +[`@tokenizers` registry](/api/registry). To make sure spaCy knows how to +construct your tokenizer during training, you can pass in your Python file by +setting `--code functions.py` when you run [`spacy train`](/api/cli#train). + +> #### config.cfg +> +> ```ini +> [nlp.tokenizer] +> @tokenizers = "whitespace_tokenizer" +> ``` + +```python +### functions.py {highlight="1"} +@spacy.registry.tokenizers("whitespace_tokenizer") +def create_whitespace_tokenizer(): + def create_tokenizer(nlp): + return WhitespaceTokenizer(nlp.vocab) + + return create_tokenizer +``` + +Registered functions can also take arguments that are then passed in from the +config. This allows you to quickly change and keep track of different settings. +Here, the registered function called `bert_word_piece_tokenizer` takes two +arguments: the path to a vocabulary file and whether to lowercase the text. The +Python type hints `str` and `bool` ensure that the received values have the +correct type. + +> #### config.cfg +> +> ```ini +> [nlp.tokenizer] +> @tokenizers = "bert_word_piece_tokenizer" +> vocab_file = "bert-base-uncased-vocab.txt" +> lowercase = true +> ``` + +```python +### functions.py {highlight="1"} +@spacy.registry.tokenizers("bert_word_piece_tokenizer") +def create_whitespace_tokenizer(vocab_file: str, lowercase: bool): + def create_tokenizer(nlp): + return BertWordPieceTokenizer(nlp.vocab, vocab_file, lowercase) + + return create_tokenizer +``` + +To avoid hard-coding local paths into your config file, you can also set the +vocab path on the CLI by using the `--nlp.tokenizer.vocab_file` +[override](/usage/training#config-overrides) when you run +[`spacy train`](/api/cli#train). For more details on using registered functions, +see the docs in [training with custom code](/usage/training#custom-code). + + + +Remember that a registered function should always be a function that spaCy +**calls to create something**, not the "something" itself. In this case, it +**creates a function** that takes the `nlp` object and returns a callable that +takes a text and returns a `Doc`. + + + +#### Using pre-tokenized text {#own-annotations} + +spaCy generally assumes by default that your data is **raw text**. However, sometimes your data is partially annotated, e.g. with pre-existing tokenization, -part-of-speech tags, etc. The most common situation is that you have pre-defined -tokenization. If you have a list of strings, you can create a `Doc` object -directly. Optionally, you can also specify a list of boolean values, indicating -whether each word has a subsequent space. +part-of-speech tags, etc. The most common situation is that you have +**pre-defined tokenization**. If you have a list of strings, you can create a +[`Doc`](/api/doc) object directly. Optionally, you can also specify a list of +boolean values, indicating whether each word is followed by a space. + +> #### ✏️ Things to try +> +> 1. Change a boolean value in the list of `spaces`. You should see it reflected +> in the `doc.text` and whether the token is followed by a space. +> 2. Remove `spaces=spaces` from the `Doc`. You should see that every token is +> now followed by a space. +> 3. Copy-paste a random sentence from the internet and manually construct a +> `Doc` with `words` and `spaces` so that the `doc.text` matches the original +> input text. ```python ### {executable="true"} import spacy from spacy.tokens import Doc -from spacy.lang.en import English -nlp = English() -doc = Doc(nlp.vocab, words=["Hello", ",", "world", "!"], - spaces=[False, True, False, False]) +nlp = spacy.blank("en") +words = ["Hello", ",", "world", "!"] +spaces = [False, True, False, False] +doc = Doc(nlp.vocab, words=words, spaces=spaces) +print(doc.text) print([(t.text, t.text_with_ws, t.whitespace_) for t in doc]) ``` -If provided, the spaces list must be the same length as the words list. The +If provided, the spaces list must be the **same length** as the words list. The spaces list affects the `doc.text`, `span.text`, `token.idx`, `span.start_char` and `span.end_char` attributes. If you don't provide a `spaces` sequence, spaCy -will assume that all words are whitespace delimited. +will assume that all words are followed by a space. Once you have a +[`Doc`](/api/doc) object, you can write to its attributes to set the +part-of-speech tags, syntactic dependencies, named entities and other +attributes. -```python -### {executable="true"} -import spacy -from spacy.tokens import Doc -from spacy.lang.en import English - -nlp = English() -bad_spaces = Doc(nlp.vocab, words=["Hello", ",", "world", "!"]) -good_spaces = Doc(nlp.vocab, words=["Hello", ",", "world", "!"], - spaces=[False, True, False, False]) - -print(bad_spaces.text) # 'Hello , world !' -print(good_spaces.text) # 'Hello, world!' -``` - -Once you have a [`Doc`](/api/doc) object, you can write to its attributes to set -the part-of-speech tags, syntactic dependencies, named entities and other -attributes. For details, see the respective usage pages. - -### Aligning tokenization {#aligning-tokenization} +#### Aligning tokenization {#aligning-tokenization} spaCy's tokenization is non-destructive and uses language-specific rules optimized for compatibility with treebank annotations. Other tools and resources @@ -1121,51 +1345,45 @@ In situations like that, you often want to align the tokenization so that you can merge annotations from different sources together, or take vectors predicted by a [pretrained BERT model](https://github.com/huggingface/pytorch-transformers) and -apply them to spaCy tokens. spaCy's [`gold.align`](/api/goldparse#align) helper -returns a `(cost, a2b, b2a, a2b_multi, b2a_multi)` tuple describing the number -of misaligned tokens, the one-to-one mappings of token indices in both -directions and the indices where multiple tokens align to one single token. +apply them to spaCy tokens. spaCy's [`Alignment`](/api/example#alignment-object) +object allows the one-to-one mappings of token indices in both directions as +well as taking into account indices where multiple tokens align to one single +token. > #### ✏️ Things to try > > 1. Change the capitalization in one of the token lists – for example, > `"obama"` to `"Obama"`. You'll see that the alignment is case-insensitive. > 2. Change `"podcasts"` in `other_tokens` to `"pod", "casts"`. You should see -> that there are now 4 misaligned tokens and that the new many-to-one mapping -> is reflected in `a2b_multi`. -> 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that the -> `cost` is `0` and all corresponding mappings are also identical. +> that there are now two tokens of length 2 in `y2x`, one corresponding to +> "'s", and one to "podcasts". +> 3. Make `other_tokens` and `spacy_tokens` identical. You'll see that all +> tokens now correspond 1-to-1. ```python ### {executable="true"} -from spacy.gold import align +from spacy.training import Alignment other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."] spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."] -cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens) -print("Edit distance:", cost) # 3 -print("One-to-one mappings a -> b", a2b) # array([0, 1, 2, 3, -1, -1, 5, 6]) -print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, -1, 6, 7]) -print("Many-to-one mappings a -> b", a2b_multi) # {4: 4, 5: 4} -print("Many-to-one mappings b-> a", b2a_multi) # {} +align = Alignment.from_strings(other_tokens, spacy_tokens) +print(f"a -> b, lengths: {align.x2y.lengths}") # array([1, 1, 1, 1, 1, 1, 1, 1]) +print(f"a -> b, mapping: {align.x2y.dataXd}") # array([0, 1, 2, 3, 4, 4, 5, 6]) : two tokens both refer to "'s" +print(f"b -> a, lengths: {align.y2x.lengths}") # array([1, 1, 1, 1, 2, 1, 1]) : the token "'s" refers to two tokens +print(f"b -> a, mappings: {align.y2x.dataXd}") # array([0, 1, 2, 3, 4, 5, 6, 7]) ``` Here are some insights from the alignment information generated in the example above: -- The edit distance (cost) is `3`: two deletions and one insertion. - The one-to-one mappings for the first four tokens are identical, which means they map to each other. This makes sense because they're also identical in the input: `"i"`, `"listened"`, `"to"` and `"obama"`. -- The index mapped to `a2b[6]` is `5`, which means that `other_tokens[6]` +- The value of `x2y.dataXd[6]` is `5`, which means that `other_tokens[6]` (`"podcasts"`) aligns to `spacy_tokens[5]` (also `"podcasts"`). -- `a2b[4]` is `-1`, which means that there is no one-to-one alignment for the - token at `other_tokens[4]`. The token `"'"` doesn't exist on its own in - `spacy_tokens`. The same goes for `a2b[5]` and `other_tokens[5]`, i.e. `"s"`. -- The dictionary `a2b_multi` shows that both tokens 4 and 5 of `other_tokens` - (`"'"` and `"s"`) align to token 4 of `spacy_tokens` (`"'s"`). -- The dictionary `b2a_multi` shows that there are no tokens in `spacy_tokens` - that map to multiple tokens in `other_tokens`. +- `x2y.dataXd[4]` and `x2y.dataXd[5]` are both `4`, which means that both tokens + 4 and 5 of `other_tokens` (`"'"` and `"s"`) align to token 4 of `spacy_tokens` + (`"'s"`). @@ -1245,7 +1463,7 @@ filtered_spans = filter_spans(spans) The [`retokenizer.split`](/api/doc#retokenizer.split) method allows splitting one token into two or more tokens. This can be useful for cases where tokenization rules alone aren't sufficient. For example, you might want to split -"its" into the tokens "it" and "is" — but not the possessive pronoun "its". You +"its" into the tokens "it" and "is" – but not the possessive pronoun "its". You can write rule-based logic that can find only the correct "its" to split, but by that time, the `Doc` will already be tokenized. @@ -1293,7 +1511,7 @@ the token indices after splitting. | `"York"` | `doc[2]` | Attach this token to `doc[1]` in the original `Doc`, i.e. "in". | If you don't care about the heads (for example, if you're only running the -tokenizer and not the parser), you can each subtoken to itself: +tokenizer and not the parser), you can attach each subtoken to itself: ```python ### {highlight="3"} @@ -1372,25 +1590,47 @@ print("After:", [(token.text, token._.is_musician) for token in doc]) ## Sentence Segmentation {#sbd} A [`Doc`](/api/doc) object's sentences are available via the `Doc.sents` -property. Unlike other libraries, spaCy uses the dependency parse to determine -sentence boundaries. This is usually more accurate than a rule-based approach, -but it also means you'll need a **statistical model** and accurate predictions. -If your texts are closer to general-purpose news or web text, this should work -well out-of-the-box. For social media or conversational text that doesn't follow -the same rules, your application may benefit from a custom rule-based -implementation. You can either use the built-in -[`Sentencizer`](/api/sentencizer) or plug an entirely custom rule-based function -into your [processing pipeline](/usage/processing-pipelines). +property. To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a +generator that yields [`Span`](/api/span) objects. You can check whether a `Doc` +has sentence boundaries by calling +[`Doc.has_annotation`](/api/doc#has_annotation) with the attribute name +`"SENT_START"`. -spaCy's dependency parser respects already set boundaries, so you can preprocess -your `Doc` using custom rules _before_ it's parsed. Depending on your text, this -may also improve accuracy, since the parser is constrained to predict parses -consistent with the sentence boundaries. +```python +### {executable="true"} +import spacy + +nlp = spacy.load("en_core_web_sm") +doc = nlp("This is a sentence. This is another sentence.") +assert doc.has_annotation("SENT_START") +for sent in doc.sents: + print(sent.text) +``` + +spaCy provides four alternatives for sentence segmentation: + +1. [Dependency parser](#sbd-parser): the statistical + [`DependencyParser`](/api/dependencyparser) provides the most accurate + sentence boundaries based on full dependency parses. +2. [Statistical sentence segmenter](#sbd-senter): the statistical + [`SentenceRecognizer`](/api/sentencerecognizer) is a simpler and faster + alternative to the parser that only sets sentence boundaries. +3. [Rule-based pipeline component](#sbd-component): the rule-based + [`Sentencizer`](/api/sentencizer) sets sentence boundaries using a + customizable list of sentence-final punctuation. +4. [Custom function](#sbd-custom): your own custom function added to the + processing pipeline can set sentence boundaries by writing to + `Token.is_sent_start`. ### Default: Using the dependency parse {#sbd-parser model="parser"} -To view a `Doc`'s sentences, you can iterate over the `Doc.sents`, a generator -that yields [`Span`](/api/span) objects. +Unlike other libraries, spaCy uses the dependency parse to determine sentence +boundaries. This is usually the most accurate approach, but it requires a +**trained pipeline** that provides accurate predictions. If your texts are +closer to general-purpose news or web text, this should work well out-of-the-box +with spaCy's provided trained pipelines. For social media or conversational text +that doesn't follow the same rules, your application may benefit from a custom +trained or rule-based component. ```python ### {executable="true"} @@ -1402,21 +1642,56 @@ for sent in doc.sents: print(sent.text) ``` +spaCy's dependency parser respects already set boundaries, so you can preprocess +your `Doc` using custom components _before_ it's parsed. Depending on your text, +this may also improve parse accuracy, since the parser is constrained to predict +parses consistent with the sentence boundaries. + +### Statistical sentence segmenter {#sbd-senter model="senter" new="3"} + +The [`SentenceRecognizer`](/api/sentencerecognizer) is a simple statistical +component that only provides sentence boundaries. Along with being faster and +smaller than the parser, its primary advantage is that it's easier to train +because it only requires annotated sentence boundaries rather than full +dependency parses. spaCy's [trained pipelines](/models) include both a parser +and a trained sentence segmenter, which is +[disabled](/usage/processing-pipelines#disabling) by default. If you only need +sentence boundaries and no parser, you can use the `exclude` or `disable` +argument on [`spacy.load`](/api/top-level#spacy.load) to load the pipeline +without the parser and then enable the sentence recognizer explicitly with +[`nlp.enable_pipe`](/api/language#enable_pipe). + +> #### senter vs. parser +> +> The recall for the `senter` is typically slightly lower than for the parser, +> which is better at predicting sentence boundaries when punctuation is not +> present. + +```python +### {executable="true"} +import spacy + +nlp = spacy.load("en_core_web_sm", exclude=["parser"]) +nlp.enable_pipe("senter") +doc = nlp("This is a sentence. This is another sentence.") +for sent in doc.sents: + print(sent.text) +``` + ### Rule-based pipeline component {#sbd-component} The [`Sentencizer`](/api/sentencizer) component is a [pipeline component](/usage/processing-pipelines) that splits sentences on punctuation like `.`, `!` or `?`. You can plug it into your pipeline if you only -need sentence boundaries without the dependency parse. +need sentence boundaries without dependency parses. ```python ### {executable="true"} import spacy from spacy.lang.en import English -nlp = English() # just the language with no model -sentencizer = nlp.create_pipe("sentencizer") -nlp.add_pipe(sentencizer) +nlp = English() # just the language with no pipeline +nlp.add_pipe("sentencizer") doc = nlp("This is a sentence. This is another sentence.") for sent in doc.sents: print(sent.text) @@ -1435,14 +1710,15 @@ and can still be overwritten by the parser. To prevent inconsistent state, you can only set boundaries **before** a document -is parsed (and `Doc.is_parsed` is `False`). To ensure that your component is -added in the right place, you can set `before='parser'` or `first=True` when -adding it to the pipeline using [`nlp.add_pipe`](/api/language#add_pipe). +is parsed (and `doc.has_annotation("DEP")` is `False`). To ensure that your +component is added in the right place, you can set `before='parser'` or +`first=True` when adding it to the pipeline using +[`nlp.add_pipe`](/api/language#add_pipe). Here's an example of a component that implements a pre-processing rule for -splitting on `'...'` tokens. The component is added before the parser, which is +splitting on `"..."` tokens. The component is added before the parser, which is then used to further segment the text. That's possible, because `is_sent_start` is only set to `True` for some of the tokens – all others still specify `None` for unset sentence boundaries. This approach can be useful if you want to @@ -1451,6 +1727,7 @@ take advantage of dependency-based sentence segmentation. ```python ### {executable="true"} +from spacy.language import Language import spacy text = "this is a sentence...hello...and another sentence." @@ -1459,24 +1736,284 @@ nlp = spacy.load("en_core_web_sm") doc = nlp(text) print("Before:", [sent.text for sent in doc.sents]) +@Language.component("set_custom_boundaries") def set_custom_boundaries(doc): for token in doc[:-1]: if token.text == "...": - doc[token.i+1].is_sent_start = True + doc[token.i + 1].is_sent_start = True return doc -nlp.add_pipe(set_custom_boundaries, before="parser") +nlp.add_pipe("set_custom_boundaries", before="parser") doc = nlp(text) print("After:", [sent.text for sent in doc.sents]) ``` -## Rule-based matching {#rule-based-matching hidden="true"} +## Mappings & Exceptions {#mappings-exceptions new="3"} -
- +The [`AttributeRuler`](/api/attributeruler) manages **rule-based mappings and +exceptions** for all token-level attributes. As the number of +[pipeline components](/api/#architecture-pipeline) has grown from spaCy v2 to +v3, handling rules and exceptions in each component individually has become +impractical, so the `AttributeRuler` provides a single component with a unified +pattern format for all token attribute mappings and exceptions. -The documentation on rule-based matching -[has moved to its own page](/usage/rule-based-matching). +The `AttributeRuler` uses +[`Matcher` patterns](/usage/rule-based-matching#adding-patterns) to identify +tokens and then assigns them the provided attributes. If needed, the +[`Matcher`](/api/matcher) patterns can include context around the target token. +For example, the attribute ruler can: + +- provide exceptions for any **token attributes** +- map **fine-grained tags** to **coarse-grained tags** for languages without + statistical morphologizers (replacing the v2.x `tag_map` in the + [language data](#language-data)) +- map token **surface form + fine-grained tags** to **morphological features** + (replacing the v2.x `morph_rules` in the [language data](#language-data)) +- specify the **tags for space tokens** (replacing hard-coded behavior in the + tagger) + +The following example shows how the tag and POS `NNP`/`PROPN` can be specified +for the phrase `"The Who"`, overriding the tags provided by the statistical +tagger and the POS tag map. + +```python +### {executable="true"} +import spacy + +nlp = spacy.load("en_core_web_sm") +text = "I saw The Who perform. Who did you see?" +doc1 = nlp(text) +print(doc1[2].tag_, doc1[2].pos_) # DT DET +print(doc1[3].tag_, doc1[3].pos_) # WP PRON + +# Add attribute ruler with exception for "The Who" as NNP/PROPN NNP/PROPN +ruler = nlp.get_pipe("attribute_ruler") +# Pattern to match "The Who" +patterns = [[{"LOWER": "the"}, {"TEXT": "Who"}]] +# The attributes to assign to the matched token +attrs = {"TAG": "NNP", "POS": "PROPN"} +# Add rules to the attribute ruler +ruler.add(patterns=patterns, attrs=attrs, index=0) # "The" in "The Who" +ruler.add(patterns=patterns, attrs=attrs, index=1) # "Who" in "The Who" + +doc2 = nlp(text) +print(doc2[2].tag_, doc2[2].pos_) # NNP PROPN +print(doc2[3].tag_, doc2[3].pos_) # NNP PROPN +# The second "Who" remains unmodified +print(doc2[5].tag_, doc2[5].pos_) # WP PRON +``` + + + +The [`AttributeRuler`](/api/attributeruler) can import a **tag map and morph +rules** in the v2.x format via its built-in methods or when the component is +initialized before training. See the +[migration guide](/usage/v3#migrating-training-mappings-exceptions) for details. -
+ +## Word vectors and semantic similarity {#vectors-similarity} + +import Vectors101 from 'usage/101/\_vectors-similarity.md' + + + +### Adding word vectors {#adding-vectors} + +Custom word vectors can be trained using a number of open-source libraries, such +as [Gensim](https://radimrehurek.com/gensim), [FastText](https://fasttext.cc), +or Tomas Mikolov's original +[Word2vec implementation](https://code.google.com/archive/p/word2vec/). Most +word vector libraries output an easy-to-read text-based format, where each line +consists of the word followed by its vector. For everyday use, we want to +convert the vectors into a binary format that loads faster and takes up less +space on disk. The easiest way to do this is the +[`init vectors`](/api/cli#init-vectors) command-line utility. This will output a +blank spaCy pipeline in the directory `/tmp/la_vectors_wiki_lg`, giving you +access to some nice Latin vectors. You can then pass the directory path to +[`spacy.load`](/api/top-level#spacy.load) or use it in the +[`[initialize]`](/api/data-formats#config-initialize) of your config when you +[train](/usage/training) a model. + +> #### Usage example +> +> ```python +> nlp_latin = spacy.load("/tmp/la_vectors_wiki_lg") +> doc1 = nlp_latin("Caecilius est in horto") +> doc2 = nlp_latin("servus est in atrio") +> doc1.similarity(doc2) +> ``` + +```cli +$ wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.la.300.vec.gz +$ python -m spacy init vectors en cc.la.300.vec.gz /tmp/la_vectors_wiki_lg +``` + + + +To help you strike a good balance between coverage and memory usage, spaCy's +[`Vectors`](/api/vectors) class lets you map **multiple keys** to the **same +row** of the table. If you're using the +[`spacy init vectors`](/api/cli#init-vectors) command to create a vocabulary, +pruning the vectors will be taken care of automatically if you set the `--prune` +flag. You can also do it manually in the following steps: + +1. Start with a **word vectors package** that covers a huge vocabulary. For + instance, the [`en_vectors_web_lg`](/models/en-starters#en_vectors_web_lg) + starter provides 300-dimensional GloVe vectors for over 1 million terms of + English. +2. If your vocabulary has values set for the `Lexeme.prob` attribute, the + lexemes will be sorted by descending probability to determine which vectors + to prune. Otherwise, lexemes will be sorted by their order in the `Vocab`. +3. Call [`Vocab.prune_vectors`](/api/vocab#prune_vectors) with the number of + vectors you want to keep. + +```python +nlp = spacy.load('en_vectors_web_lg') +n_vectors = 105000 # number of vectors to keep +removed_words = nlp.vocab.prune_vectors(n_vectors) + +assert len(nlp.vocab.vectors) <= n_vectors # unique vectors have been pruned +assert nlp.vocab.vectors.n_keys > n_vectors # but not the total entries +``` + +[`Vocab.prune_vectors`](/api/vocab#prune_vectors) reduces the current vector +table to a given number of unique entries, and returns a dictionary containing +the removed words, mapped to `(string, score)` tuples, where `string` is the +entry the removed word was mapped to and `score` the similarity score between +the two words. + +```python +### Removed words +{ + "Shore": ("coast", 0.732257), + "Precautionary": ("caution", 0.490973), + "hopelessness": ("sadness", 0.742366), + "Continous": ("continuous", 0.732549), + "Disemboweled": ("corpse", 0.499432), + "biostatistician": ("scientist", 0.339724), + "somewheres": ("somewheres", 0.402736), + "observing": ("observe", 0.823096), + "Leaving": ("leaving", 1.0), +} +``` + +In the example above, the vector for "Shore" was removed and remapped to the +vector of "coast", which is deemed about 73% similar. "Leaving" was remapped to +the vector of "leaving", which is identical. If you're using the +[`init vectors`](/api/cli#init-vectors) command, you can set the `--prune` +option to easily reduce the size of the vectors as you add them to a spaCy +pipeline: + +```cli +$ python -m spacy init vectors en la.300d.vec.tgz /tmp/la_vectors_web_md --prune 10000 +``` + +This will create a blank spaCy pipeline with vectors for the first 10,000 words +in the vectors. All other words in the vectors are mapped to the closest vector +among those retained. + + + +### Adding vectors individually {#adding-individual-vectors} + +The `vector` attribute is a **read-only** numpy or cupy array (depending on +whether you've configured spaCy to use GPU memory), with dtype `float32`. The +array is read-only so that spaCy can avoid unnecessary copy operations where +possible. You can modify the vectors via the [`Vocab`](/api/vocab) or +[`Vectors`](/api/vectors) table. Using the +[`Vocab.set_vector`](/api/vocab#set_vector) method is often the easiest approach +if you have vectors in an arbitrary format, as you can read in the vectors with +your own logic, and just set them with a simple loop. This method is likely to +be slower than approaches that work with the whole vectors table at once, but +it's a great approach for once-off conversions before you save out your `nlp` +object to disk. + +```python +### Adding vectors +from spacy.vocab import Vocab + +vector_data = { + "dog": numpy.random.uniform(-1, 1, (300,)), + "cat": numpy.random.uniform(-1, 1, (300,)), + "orange": numpy.random.uniform(-1, 1, (300,)) +} +vocab = Vocab() +for word, vector in vector_data.items(): + vocab.set_vector(word, vector) +``` + +## Language Data {#language-data} + +import LanguageData101 from 'usage/101/\_language-data.md' + + + +### Creating a custom language subclass {#language-subclass} + +If you want to customize multiple components of the language data or add support +for a custom language or domain-specific "dialect", you can also implement your +own language subclass. The subclass should define two attributes: the `lang` +(unique language code) and the `Defaults` defining the language data. For an +overview of the available attributes that can be overwritten, see the +[`Language.Defaults`](/api/language#defaults) documentation. + +```python +### {executable="true"} +from spacy.lang.en import English + +class CustomEnglishDefaults(English.Defaults): + stop_words = set(["custom", "stop"]) + +class CustomEnglish(English): + lang = "custom_en" + Defaults = CustomEnglishDefaults + +nlp1 = English() +nlp2 = CustomEnglish() + +print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")]) +print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")]) +``` + +The [`@spacy.registry.languages`](/api/top-level#registry) decorator lets you +register a custom language class and assign it a string name. This means that +you can call [`spacy.blank`](/api/top-level#spacy.blank) with your custom +language name, and even train pipelines with it and refer to it in your +[training config](/usage/training#config). + +> #### Config usage +> +> After registering your custom language class using the `languages` registry, +> you can refer to it in your [training config](/usage/training#config). This +> means spaCy will train your pipeline using the custom subclass. +> +> ```ini +> [nlp] +> lang = "custom_en" +> ``` +> +> In order to resolve `"custom_en"` to your subclass, the registered function +> needs to be available during training. You can load a Python file containing +> the code using the `--code` argument: +> +> ```cli +> python -m spacy train config.cfg --code code.py +> ``` + +```python +### Registering a custom language {highlight="7,12-13"} +import spacy +from spacy.lang.en import English + +class CustomEnglishDefaults(English.Defaults): + stop_words = set(["custom", "stop"]) + +@spacy.registry.languages("custom_en") +class CustomEnglish(English): + lang = "custom_en" + Defaults = CustomEnglishDefaults + +# This now works! 🎉 +nlp = spacy.blank("custom_en") +``` diff --git a/website/docs/usage/models.md b/website/docs/usage/models.md index cc65dad68..8c8875b9e 100644 --- a/website/docs/usage/models.md +++ b/website/docs/usage/models.md @@ -8,40 +8,39 @@ menu: - ['Production Use', 'production'] --- -spaCy's models can be installed as **Python packages**. This means that they're -a component of your application, just like any other module. They're versioned -and can be defined as a dependency in your `requirements.txt`. Models can be -installed from a download URL or a local directory, manually or via -[pip](https://pypi.python.org/pypi/pip). Their data can be located anywhere on -your file system. +spaCy's trained pipelines can be installed as **Python packages**. This means +that they're a component of your application, just like any other module. +They're versioned and can be defined as a dependency in your `requirements.txt`. +Trained pipelines can be installed from a download URL or a local directory, +manually or via [pip](https://pypi.python.org/pypi/pip). Their data can be +located anywhere on your file system. > #### Important note > -> If you're upgrading to spaCy v1.7.x or v2.x, you need to **download the new -> models**. If you've trained statistical models that use spaCy's annotations, -> you should **retrain your models** after updating spaCy. If you don't retrain, -> you may suffer train/test skew, which might decrease your accuracy. +> If you're upgrading to spaCy v3.x, you need to **download the new pipeline +> packages**. If you've trained your own pipelines, you need to **retrain** them +> after updating spaCy. ## Quickstart {hidden="true"} import QuickstartModels from 'widgets/quickstart-models.js' - + ## Language support {#languages} spaCy currently provides support for the following languages. You can help by -[improving the existing language data](/usage/adding-languages#language-data) +improving the existing [language data](/usage/linguistic-features#language-data) and extending the tokenization patterns. [See here](https://github.com/explosion/spaCy/issues/3056) for details on how to -contribute to model development. +contribute to development. > #### Usage note > -> If a model is available for a language, you can download it using the -> [`spacy download`](/api/cli#download) command. In order to use languages that -> don't yet come with a model, you have to import them directly, or use -> [`spacy.blank`](/api/top-level#spacy.blank): +> If a trained pipeline is available for a language, you can download it using +> the [`spacy download`](/api/cli#download) command. In order to use languages +> that don't yet come with a trained pipeline, you have to import them directly, +> or use [`spacy.blank`](/api/top-level#spacy.blank): > > ```python > from spacy.lang.fi import Finnish @@ -55,7 +54,7 @@ contribute to model development. > separately in the same environment: > > ```bash -> $ pip install spacy[lookups] +> $ pip install -U %%SPACY_PKG_NAME[lookups]%%SPACY_PKG_FLAGS > ``` import Languages from 'widgets/languages.js' @@ -70,77 +69,112 @@ import Languages from 'widgets/languages.js' > nlp = MultiLanguage() > > # With lazy-loading -> from spacy.util import get_lang_class -> nlp = get_lang_class('xx') +> nlp = spacy.blank("xx") > ``` -As of v2.0, spaCy supports models trained on more than one language. This is +spaCy also supports pipelines trained on more than one language. This is especially useful for named entity recognition. The language ID used for -multi-language or language-neutral models is `xx`. The language class, a generic -subclass containing only the base language data, can be found in -[`lang/xx`](https://github.com/explosion/spaCy/tree/master/spacy/lang/xx). +multi-language or language-neutral pipelines is `xx`. The language class, a +generic subclass containing only the base language data, can be found in +[`lang/xx`](%%GITHUB_SPACY/spacy/lang/xx). -To load your model with the neutral, multi-language class, simply set -`"language": "xx"` in your [model package](/usage/training#models-generating)'s -`meta.json`. You can also import the class directly, or call -[`util.get_lang_class()`](/api/top-level#util.get_lang_class) for lazy-loading. +To train a pipeline using the neutral multi-language class, you can set +`lang = "xx"` in your [training config](/usage/training#config). You can also +import the `MultiLanguage` class directly, or call +[`spacy.blank("xx")`](/api/top-level#spacy.blank) for lazy-loading. -### Chinese language support {#chinese new=2.3} +### Chinese language support {#chinese new="2.3"} -The Chinese language class supports three word segmentation options: +The Chinese language class supports three word segmentation options, `char`, +`jieba` and `pkuseg`. +> #### Manual setup +> > ```python > from spacy.lang.zh import Chinese > -> # Disable jieba to use character segmentation -> Chinese.Defaults.use_jieba = False +> # Character segmentation (default) > nlp = Chinese() -> -> # Disable jieba through tokenizer config options -> cfg = {"use_jieba": False} -> nlp = Chinese(meta={"tokenizer": {"config": cfg}}) -> -> # Load with "default" model provided by pkuseg -> cfg = {"pkuseg_model": "default", "require_pkuseg": True} -> nlp = Chinese(meta={"tokenizer": {"config": cfg}}) +> # Jieba +> cfg = {"segmenter": "jieba"} +> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}}) +> # PKUSeg with "mixed" model provided by pkuseg +> cfg = {"segmenter": "pkuseg"} +> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}}) +> nlp.tokenizer.initialize(pkuseg_model="mixed") > ``` -1. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word - segmentation by default. It's enabled when you create a new `Chinese` - language class or call `spacy.blank("zh")`. -2. **Character segmentation:** Character segmentation is supported by disabling - `jieba` and setting `Chinese.Defaults.use_jieba = False` _before_ - initializing the language class. As of spaCy v2.3.0, the `meta` tokenizer - config options can be used to configure `use_jieba`. -3. **PKUSeg**: In spaCy v2.3.0, support for - [PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support - better segmentation for Chinese OntoNotes and the new - [Chinese models](/models/zh). +```ini +### config.cfg +[nlp.tokenizer] +@tokenizers = "spacy.zh.ChineseTokenizer" +segmenter = "char" +``` - +| Segmenter | Description | +| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `char` | **Character segmentation:** Character segmentation is the default segmentation option. It's enabled when you create a new `Chinese` language class or call `spacy.blank("zh")`. | +| `jieba` | **Jieba:** to use [Jieba](https://github.com/fxsjy/jieba) for word segmentation, you can set the option `segmenter` to `"jieba"`. | +| `pkuseg` | **PKUSeg**: As of spaCy v2.3.0, support for [PKUSeg](https://github.com/explosion/spacy-pkuseg) has been added to support better segmentation for Chinese OntoNotes and the provided [Chinese pipelines](/models/zh). Enable PKUSeg by setting tokenizer option `segmenter` to `"pkuseg"`. | -The `meta` argument of the `Chinese` language class supports the following -following tokenizer config settings: + -| Name | Type | Description | -| ------------------ | ------- | ---------------------------------------------------------------------------------------------------- | -| `pkuseg_model` | unicode | **Required:** Name of a model provided by `pkuseg` or the path to a local model directory. | -| `pkuseg_user_dict` | unicode | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. | -| `require_pkuseg` | bool | Overrides all `jieba` settings (optional but strongly recommended). | +In v3.0, the default word segmenter has switched from Jieba to character +segmentation. Because the `pkuseg` segmenter depends on a model that can be +loaded from a file, the model is loaded on +[initialization](/usage/training#config-lifecycle) (typically before training). +This ensures that your packaged Chinese model doesn't depend on a local path at +runtime. + + + + + +The `initialize` method for the Chinese tokenizer class supports the following +config settings for loading `pkuseg` models: + +| Name | Description | +| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `pkuseg_model` | Name of a model provided by `spacy-pkuseg` or the path to a local model directory. ~~str~~ | +| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`, the default provided dictionary. ~~str~~ | + +The initialization settings are typically provided in the +[training config](/usage/training#config) and the data is loaded in before +training and serialized with the model. This allows you to load the data from a +local path and save out your pipeline and config, without requiring the same +local path at runtime. See the usage guide on the +[config lifecycle](/usage/training#config-lifecycle) for more background on +this. + +```ini +### config.cfg +[initialize] + +[initialize.tokenizer] +pkuseg_model = "/path/to/model" +pkuseg_user_dict = "default" +``` + +You can also initialize the tokenizer for a blank language class by calling its +`initialize` method: ```python ### Examples -# Load "default" model -cfg = {"pkuseg_model": "default", "require_pkuseg": True} -nlp = Chinese(meta={"tokenizer": {"config": cfg}}) +# Initialize the pkuseg tokenizer +cfg = {"segmenter": "pkuseg"} +nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}}) + +# Load spaCy's OntoNotes model +nlp.tokenizer.initialize(pkuseg_model="spacy_ontonotes") + +# Load pkuseg's "news" model +nlp.tokenizer.initialize(pkuseg_model="news") # Load local model -cfg = {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True} -nlp = Chinese(meta={"tokenizer": {"config": cfg}}) +nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model") # Override the user directory -cfg = {"pkuseg_model": "default", "require_pkuseg": True, "pkuseg_user_dict": "/path"} -nlp = Chinese(meta={"tokenizer": {"config": cfg}}) +nlp.tokenizer.initialize(pkuseg_model="spacy_ontonotes", pkuseg_user_dict="/path/to/user_dict") ``` You can also modify the user dictionary on-the-fly: @@ -158,97 +192,103 @@ nlp.tokenizer.pkuseg_update_user_dict([], reset=True) - + -The [Chinese models](/models/zh) provided by spaCy include a custom `pkuseg` +The [Chinese pipelines](/models/zh) provided by spaCy include a custom `pkuseg` model trained only on [Chinese OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19), since the models provided by `pkuseg` include data restricted to research use. For -research use, `pkuseg` provides models for several different domains -(`"default"`, `"news"` `"web"`, `"medicine"`, `"tourism"`) and for other uses, -`pkuseg` provides a simple -[training API](https://github.com/lancopku/pkuseg-python/blob/master/readme/readme_english.md#usage): +research use, `pkuseg` provides models for several different domains (`"mixed"` +(equivalent to `"default"` from `pkuseg` packages), `"news"` `"web"`, +`"medicine"`, `"tourism"`) and for other uses, `pkuseg` provides a simple +[training API](https://github.com/explosion/spacy-pkuseg/blob/master/readme/readme_english.md#usage): ```python -import pkuseg +import spacy_pkuseg as pkuseg from spacy.lang.zh import Chinese # Train pkuseg model pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model") + # Load pkuseg model in spaCy Chinese tokenizer -nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True}}}) +cfg = {"segmenter": "pkuseg"} +nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}}) +nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model") ``` ### Japanese language support {#japanese new=2.3} +> #### Manual setup +> > ```python > from spacy.lang.ja import Japanese > > # Load SudachiPy with split mode A (default) > nlp = Japanese() -> > # Load SudachiPy with split mode B > cfg = {"split_mode": "B"} -> nlp = Japanese(meta={"tokenizer": {"config": cfg}}) +> nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}}) > ``` The Japanese language class uses [SudachiPy](https://github.com/WorksApplications/SudachiPy) for word segmentation and part-of-speech tagging. The default Japanese language class and -the provided Japanese models use SudachiPy split mode `A`. +the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer +config can be used to configure the split mode to `A`, `B` or `C`. -The `meta` argument of the `Japanese` language class can be used to configure -the split mode to `A`, `B` or `C`. +```ini +### config.cfg +[nlp.tokenizer] +@tokenizers = "spacy.ja.JapaneseTokenizer" +split_mode = "A" +``` If you run into errors related to `sudachipy`, which is currently under active -development, we suggest downgrading to `sudachipy==0.4.5`, which is the version -used for training the current [Japanese models](/models/ja). +development, we suggest downgrading to `sudachipy==0.4.9`, which is the version +used for training the current [Japanese pipelines](/models/ja). -## Installing and using models {#download} +## Installing and using trained pipelines {#download} -> #### Downloading models in spaCy < v1.7 +The easiest way to download a trained pipeline is via spaCy's +[`download`](/api/cli#download) command. It takes care of finding the +best-matching package compatible with your spaCy installation. + +> #### Important note for v3.0 > -> In older versions of spaCy, you can still use the old download commands. This -> will download and install the models into the `spacy/data` directory. +> Note that as of spaCy v3.0, shortcut links like `en` that create (potentially +> brittle) symlinks in your spaCy installation are **deprecated**. To download +> and load an installed pipeline package, use its full name: > -> ```bash -> python -m spacy.en.download all -> python -m spacy.de.download all -> python -m spacy.en.download glove +> ```diff +> - python -m spacy download en +> + python -m spacy dowmload en_core_web_sm > ``` > -> The old models are also -> [attached to the v1.6.0 release](https://github.com/explosion/spaCy/tree/v1.6.0). -> To download and install them manually, unpack the archive, drop the contained -> directory into `spacy/data`. +> ```diff +> - nlp = spacy.load("en") +> + nlp = spacy.load("en_core_web_sm") +> ``` -The easiest way to download a model is via spaCy's -[`download`](/api/cli#download) command. It takes care of finding the -best-matching model compatible with your spaCy installation. +```cli +# Download best-matching version of a package for your spaCy installation +$ python -m spacy download en_core_web_sm -```bash -# Download best-matching version of specific model for your spaCy installation -python -m spacy download en_core_web_sm - -# Out-of-the-box: download best-matching default model and create shortcut link -python -m spacy download en - -# Download exact model version (doesn't create shortcut link) -python -m spacy download en_core_web_sm-2.2.0 --direct +# Download exact package version +$ python -m spacy download en_core_web_sm-3.0.0 --direct ``` -The download command will [install the model](/usage/models#download-pip) via +The download command will [install the package](/usage/models#download-pip) via pip and place the package in your `site-packages` directory. -```bash -pip install spacy -python -m spacy download en_core_web_sm +```cli +$ pip install -U %%SPACY_PKG_NAME%%SPACY_PKG_FLAGS +$ python -m spacy download en_core_web_sm ``` ```python @@ -257,146 +297,97 @@ nlp = spacy.load("en_core_web_sm") doc = nlp("This is a sentence.") ``` - - -If you're downloading the models using a shortcut like `"en"`, spaCy will create -a symlink within the `spacy/data` directory. This means that your user needs the -**required permissions**. If you've installed spaCy to a system directory and -don't have admin privileges, the model linking may fail. The easiest solution is -to re-run the command as admin, set the `--user` flag or use a virtual -environment. For more info on this, see the -[troubleshooting guide](/usage/#symlink-privilege). - - - ### Installation via pip {#download-pip} -To download a model directly using [pip](https://pypi.python.org/pypi/pip), -point `pip install` to the URL or local path of the archive file. To find the -direct link to a model, head over to the -[model releases](https://github.com/explosion/spacy-models/releases), right -click on the archive link and copy it to your clipboard. +To download a trained pipeline directly using +[pip](https://pypi.python.org/pypi/pip), point `pip install` to the URL or local +path of the archive file. To find the direct link to a package, head over to the +[releases](https://github.com/explosion/spacy-models/releases), right click on +the archive link and copy it to your clipboard. ```bash # With external URL -pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz +$ pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz # With local file -pip install /Users/you/en_core_web_sm-2.2.0.tar.gz +$ pip install /Users/you/en_core_web_sm-3.0.0.tar.gz ``` -By default, this will install the model into your `site-packages` directory. You -can then use `spacy.load()` to load it via its package name, create a -[shortcut link](#usage-link) to assign it a custom name, or +By default, this will install the pipeline package into your `site-packages` +directory. You can then use `spacy.load` to load it via its package name or [import it](#usage-import) explicitly as a module. If you need to download -models as part of an automated process, we recommend using pip with a direct -link, instead of relying on spaCy's [`download`](/api/cli#download) command. +pipeline packages as part of an automated process, we recommend using pip with a +direct link, instead of relying on spaCy's [`download`](/api/cli#download) +command. You can also add the direct download link to your application's `requirements.txt`. For more details, see the section on -[working with models in production](#production). +[working with pipeline packages in production](#production). ### Manual download and installation {#download-manual} In some cases, you might prefer downloading the data manually, for example to -place it into a custom directory. You can download the model via your browser +place it into a custom directory. You can download the package via your browser from the [latest releases](https://github.com/explosion/spacy-models/releases), or configure your own download script using the URL of the archive file. The -archive consists of a model directory that contains another directory with the -model data. +archive consists of a package directory that contains another directory with the +pipeline data. ```yaml -### Directory structure {highlight="7"} -└── en_core_web_md-2.2.0.tar.gz # downloaded archive - ├── meta.json # model meta data +### Directory structure {highlight="6"} +└── en_core_web_md-3.0.0.tar.gz # downloaded archive ├── setup.py # setup file for pip installation - └── en_core_web_md # 📦 model package + ├── meta.json # copy of pipeline meta + └── en_core_web_md # 📦 pipeline package ├── __init__.py # init for pip installation - ├── meta.json # model meta data - └── en_core_web_md-2.2.0 # model data + └── en_core_web_md-3.0.0 # pipeline data + ├── config.cfg # pipeline config + ├── meta.json # pipeline meta + └── ... # directories with component data ``` -You can place the **model package directory** anywhere on your local file -system. To use it with spaCy, assign it a name by creating a shortcut link for -the data directory. +You can place the **pipeline package directory** anywhere on your local file +system. -### Using models with spaCy {#usage} +### Using trained pipelines with spaCy {#usage} -To load a model, use [`spacy.load`](/api/top-level#spacy.load) with the model's -shortcut link, package name or a path to the data directory: +To load a pipeline package, use [`spacy.load`](/api/top-level#spacy.load) with +the package name or a path to the data directory: + +> #### Important note for v3.0 +> +> Note that as of spaCy v3.0, shortcut links like `en` that create (potentially +> brittle) symlinks in your spaCy installation are **deprecated**. To download +> and load an installed pipeline package, use its full name: +> +> ```diff +> - python -m spacy download en +> + python -m spacy dowmload en_core_web_sm +> ``` ```python import spacy -nlp = spacy.load("en_core_web_sm") # load model package "en_core_web_sm" +nlp = spacy.load("en_core_web_sm") # load package "en_core_web_sm" nlp = spacy.load("/path/to/en_core_web_sm") # load package from a directory -nlp = spacy.load("en") # load model with shortcut link "en" doc = nlp("This is a sentence.") ``` - + You can use the [`info`](/api/cli#info) command or -[`spacy.info()`](/api/top-level#spacy.info) method to print a model's meta data -before loading it. Each `Language` object with a loaded model also exposes the -model's meta data as the attribute `meta`. For example, `nlp.meta['version']` -will return the model's version. +[`spacy.info()`](/api/top-level#spacy.info) method to print a pipeline +packages's meta data before loading it. Each `Language` object with a loaded +pipeline also exposes the pipeline's meta data as the attribute `meta`. For +example, `nlp.meta['version']` will return the package version. -### Using custom shortcut links {#usage-link} +### Importing pipeline packages as modules {#usage-import} -While previous versions of spaCy required you to maintain a data directory -containing the models for each installation, you can now choose **how and where -you want to keep your data**. For example, you could download all models -manually and put them into a local directory. Whenever your spaCy projects need -a model, you create a shortcut link to tell spaCy to load it from there. This -means you'll never end up with duplicate data. - -The [`link`](/api/cli#link) command will create a symlink in the `spacy/data` -directory. - -> #### Why does spaCy use symlinks? -> -> Symlinks were originally introduced to maintain backwards compatibility, as -> older versions expected model data to live within `spacy/data`. However, we -> decided to keep using them in v2.0 instead of opting for a config file. -> There'll always be a need for assigning and saving custom model names or IDs. -> And your system already comes with a native solution to mapping unicode -> aliases to file paths: symbolic links. - -```bash -$ python -m spacy link [package name or path] [shortcut] [--force] -``` - -The first argument is the **package name** (if the model was installed via pip), -or a local path to the the **model package**. The second argument is the -internal name you want to use for the model. Setting the `--force` flag will -overwrite any existing links. - -```bash -### Examples -# set up shortcut link to load installed package as "en_default" -python -m spacy link en_core_web_md en_default - -# set up shortcut link to load local model as "my_amazing_model" -python -m spacy link /Users/you/model my_amazing_model -``` - - - -In order to create a symlink, your user needs the **required permissions**. If -you've installed spaCy to a system directory and don't have admin privileges, -the `spacy link` command may fail. The easiest solution is to re-run the command -as admin, set the `--user` flag or use a virtual environment. For more info on -this, see the [troubleshooting guide](/usage/#symlink-privilege). - - - -### Importing models as modules {#usage-import} - -If you've installed a model via spaCy's downloader, or directly via pip, you can -also `import` it and then call its `load()` method with no arguments: +If you've installed a trained pipeline via [`spacy download`](/api/cli#download) +or directly via pip, you can also `import` it and then call its `load()` method +with no arguments: ```python ### {executable="true"} @@ -406,53 +397,36 @@ nlp = en_core_web_sm.load() doc = nlp("This is a sentence.") ``` -How you choose to load your models ultimately depends on personal preference. -However, **for larger code bases**, we usually recommend native imports, as this -will make it easier to integrate models with your existing build process, -continuous integration workflow and testing framework. It'll also prevent you -from ever trying to load a model that is not installed, as your code will raise -an `ImportError` immediately, instead of failing somewhere down the line when -calling `spacy.load()`. +How you choose to load your trained pipelines ultimately depends on personal +preference. However, **for larger code bases**, we usually recommend native +imports, as this will make it easier to integrate pipeline packages with your +existing build process, continuous integration workflow and testing framework. +It'll also prevent you from ever trying to load a package that is not installed, +as your code will raise an `ImportError` immediately, instead of failing +somewhere down the line when calling `spacy.load()`. For more details, see the +section on [working with pipeline packages in production](#production). -For more details, see the section on -[working with models in production](#production). +## Using trained pipelines in production {#production} -### Using your own models {#own-models} +If your application depends on one or more trained pipeline packages, you'll +usually want to integrate them into your continuous integration workflow and +build process. While spaCy provides a range of useful helpers for downloading +and loading pipeline packages, the underlying functionality is entirely based on +native Python packaging. This allows your application to handle a spaCy pipeline +like any other package dependency. -If you've trained your own model, for example for -[additional languages](/usage/adding-languages) or -[custom named entities](/usage/training#ner), you can save its state using the -[`Language.to_disk()`](/api/language#to_disk) method. To make the model more -convenient to deploy, we recommend wrapping it as a Python package. - -For more information and a detailed guide on how to package your model, see the -documentation on [saving and loading models](/usage/saving-loading#models). - -## Using models in production {#production} - -If your application depends on one or more models, you'll usually want to -integrate them into your continuous integration workflow and build process. -While spaCy provides a range of useful helpers for downloading, linking and -loading models, the underlying functionality is entirely based on native Python -packages. This allows your application to handle a model like any other package -dependency. - -For an example of an automated model training and build process, see -[this overview](/usage/training#example-training-spacy) of how we're training -and packaging our models for spaCy. - -### Downloading and requiring model dependencies {#models-download} +### Downloading and requiring package dependencies {#models-download} spaCy's built-in [`download`](/api/cli#download) command is mostly intended as a convenient, interactive wrapper. It performs compatibility checks and prints -detailed error messages and warnings. However, if you're downloading models as -part of an automated build process, this only adds an unnecessary layer of -complexity. If you know which models your application needs, you should be -specifying them directly. +detailed error messages and warnings. However, if you're downloading pipeline +packages as part of an automated build process, this only adds an unnecessary +layer of complexity. If you know which packages your application needs, you +should be specifying them directly. -Because all models are valid Python packages, you can add them to your +Because pipeline packages are valid Python packages, you can add them to your application's `requirements.txt`. If you're running your own internal PyPi -installation, you can upload the models there. pip's +installation, you can upload the pipeline packages there. pip's [requirements file format](https://pip.pypa.io/en/latest/reference/pip_install/#requirements-file-format) supports both package names to download via a PyPi server, as well as direct URLs. @@ -468,18 +442,17 @@ the download URL. This way, the package won't be re-downloaded and overwritten if it's already installed - just like when you're downloading a package from PyPi. -All models are versioned and specify their spaCy dependency. This ensures -cross-compatibility and lets you specify exact version requirements for each -model. If you've trained your own model, you can use the -[`package`](/api/cli#package) command to generate the required meta data and -turn it into a loadable package. +All pipeline packages are versioned and specify their spaCy dependency. This +ensures cross-compatibility and lets you specify exact version requirements for +each pipeline. If you've [trained](/usage/training) your own pipeline, you can +use the [`spacy package`](/api/cli#package) command to generate the required +meta data and turn it into a loadable package. -### Loading and testing models {#models-loading} +### Loading and testing pipeline packages {#models-loading} -Downloading models directly via pip won't call spaCy's link -[`package`](/api/cli#link) command, which creates symlinks for model shortcuts. -This means that you'll have to run this command separately, or use the native -`import` syntax to load the models: +Pipeline packages are regular Python packages, so you can also import them as a +package using Python's native `import` syntax, and then call the `load` method +to load the data and return an `nlp` object: ```python import en_core_web_sm @@ -487,16 +460,17 @@ nlp = en_core_web_sm.load() ``` In general, this approach is recommended for larger code bases, as it's more -"native", and doesn't depend on symlinks or rely on spaCy's loader to resolve -string names to model packages. If a model can't be imported, Python will raise -an `ImportError` immediately. And if a model is imported but not used, any -linter will catch that. +"native", and doesn't rely on spaCy's loader to resolve string names to +packages. If a package can't be imported, Python will raise an `ImportError` +immediately. And if a package is imported but not used, any linter will catch +that. Similarly, it'll give you more flexibility when writing tests that require -loading models. For example, instead of writing your own `try` and `except` +loading pipelines. For example, instead of writing your own `try` and `except` logic around spaCy's loader, you can use [pytest](http://pytest.readthedocs.io/en/latest/)'s [`importorskip()`](https://docs.pytest.org/en/latest/builtin.html#_pytest.outcomes.importorskip) -method to only run a test if a specific model or model version is installed. -Each model package exposes a `__version__` attribute which you can also use to -perform your own version compatibility checks before loading a model. +method to only run a test if a specific pipeline package or version is +installed. Each pipeline package package exposes a `__version__` attribute which +you can also use to perform your own version compatibility checks before loading +it. diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index b7b840999..a0cf36909 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -1,10 +1,13 @@ --- title: Language Processing Pipelines -next: vectors-similarity +next: /usage/embeddings-transformers menu: - ['Processing Text', 'processing'] - - ['How Pipelines Work', 'pipelines'] + - ['Pipelines & Components', 'pipelines'] - ['Custom Components', 'custom-components'] + - ['Component Data', 'component-data'] + - ['Type Hints & Validation', 'type-hints'] + - ['Trainable Components', 'trainable-components'] - ['Extension Attributes', 'custom-components-attributes'] - ['Plugins & Wrappers', 'plugins'] --- @@ -34,7 +37,7 @@ texts = ["This is a text", "These are lots of texts", "..."] + docs = list(nlp.pipe(texts)) ``` - + - Process the texts **as a stream** using [`nlp.pipe`](/api/language#pipe) and buffer them in batches, instead of one-by-one. This is usually much more @@ -42,8 +45,8 @@ texts = ["This is a text", "These are lots of texts", "..."] - Only apply the **pipeline components you need**. Getting predictions from the model that you don't actually need adds up and becomes very inefficient at scale. To prevent this, use the `disable` keyword argument to disable - components you don't need – either when loading a model, or during processing - with `nlp.pipe`. See the section on + components you don't need – either when loading a pipeline, or during + processing with `nlp.pipe`. See the section on [disabling pipeline components](#disabling) for more details and examples. @@ -89,42 +92,50 @@ have to call `list()` on it first: -## How pipelines work {#pipelines} +## Pipelines and built-in components {#pipelines} spaCy makes it very easy to create your own pipelines consisting of reusable components – this includes spaCy's default tagger, parser and entity recognizer, but also your own custom processing functions. A pipeline component can be added -to an already existing `nlp` object, specified when initializing a `Language` -class, or defined within a [model package](/usage/saving-loading#models). +to an already existing `nlp` object, specified when initializing a +[`Language`](/api/language) class, or defined within a +[pipeline package](/usage/saving-loading#models). -When you load a model, spaCy first consults the model's -[`meta.json`](/usage/saving-loading#models). The meta typically includes the -model details, the ID of a language class, and an optional list of pipeline -components. spaCy then does the following: - -> #### meta.json (excerpt) +> #### config.cfg (excerpt) > -> ```json -> { -> "lang": "en", -> "name": "core_web_sm", -> "description": "Example model for spaCy", -> "pipeline": ["tagger", "parser", "ner"] -> } +> ```ini +> [nlp] +> lang = "en" +> pipeline = ["tok2vec", "parser"] +> +> [components] +> +> [components.tok2vec] +> factory = "tok2vec" +> # Settings for the tok2vec component +> +> [components.parser] +> factory = "parser" +> # Settings for the parser component > ``` +When you load a pipeline, spaCy first consults the +[`meta.json`](/usage/saving-loading#models) and +[`config.cfg`](/usage/training#config). The config tells spaCy what language +class to use, which components are in the pipeline, and how those components +should be created. spaCy will then do the following: + 1. Load the **language class and data** for the given ID via [`get_lang_class`](/api/top-level#util.get_lang_class) and initialize it. The `Language` class contains the shared vocabulary, tokenization rules and the - language-specific annotation scheme. -2. Iterate over the **pipeline names** and create each component using - [`create_pipe`](/api/language#create_pipe), which looks them up in - `Language.factories`. -3. Add each pipeline component to the pipeline in order, using - [`add_pipe`](/api/language#add_pipe). -4. Make the **model data** available to the `Language` class by calling - [`from_disk`](/api/language#from_disk) with the path to the model data - directory. + language-specific settings. +2. Iterate over the **pipeline names** and look up each component name in the + `[components]` block. The `factory` tells spaCy which + [component factory](#custom-components-factories) to use for adding the + component with [`add_pipe`](/api/language#add_pipe). The settings are passed + into the factory. +3. Make the **model data** available to the `Language` class by calling + [`from_disk`](/api/language#from_disk) with the path to the data directory. So when you call this... @@ -132,19 +143,27 @@ So when you call this... nlp = spacy.load("en_core_web_sm") ``` -... the model's `meta.json` tells spaCy to use the language `"en"` and the -pipeline `["tagger", "parser", "ner"]`. spaCy will then initialize +... the pipeline's `config.cfg` tells spaCy to use the language `"en"` and the +pipeline `["tok2vec", "tagger", "parser", "ner"]`. spaCy will then initialize `spacy.lang.en.English`, and create each pipeline component and add it to the -processing pipeline. It'll then load in the model's data from its data directory +processing pipeline. It'll then load in the model data from the data directory and return the modified `Language` class for you to use as the `nlp` object. -Fundamentally, a [spaCy model](/models) consists of three components: **the -weights**, i.e. binary data loaded in from a directory, a **pipeline** of + + +spaCy v3.0 introduces a `config.cfg`, which includes more detailed settings for +the pipeline, its components and the [training process](/usage/training#config). +You can export the config of your current `nlp` object by calling +[`nlp.config.to_disk`](/api/language#config). + + + +Fundamentally, a [spaCy pipeline package](/models) consists of three components: +**the weights**, i.e. binary data loaded in from a directory, a **pipeline** of functions called in order, and **language data** like the tokenization rules and -annotation scheme. All of this is specific to each model, and defined in the -model's `meta.json` – for example, a Spanish NER model requires different -weights, language data and pipeline components than an English parsing and -tagging model. This is also why the pipeline state is always held by the +language-specific settings. For example, a Spanish NER pipeline requires +different weights, language data and components than an English parsing and +tagging pipeline. This is also why the pipeline state is always held by the `Language` class. [`spacy.load`](/api/top-level#spacy.load) puts this all together and returns an instance of `Language` with a pipeline set and access to the binary data: @@ -152,15 +171,14 @@ the binary data: ```python ### spacy.load under the hood lang = "en" -pipeline = ["tagger", "parser", "ner"] -data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0" +pipeline = ["tok2vec", "tagger", "parser", "ner"] +data_path = "path/to/en_core_web_sm/en_core_web_sm-3.0.0" -cls = spacy.util.get_lang_class(lang) # 1. Get Language instance, e.g. English() -nlp = cls() # 2. Initialize it +cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English +nlp = cls() # 2. Initialize it for name in pipeline: - component = nlp.create_pipe(name) # 3. Create the pipeline components - nlp.add_pipe(component) # 4. Add the component to the pipeline -nlp.from_disk(model_data_path) # 5. Load in the binary data + nlp.add_pipe(name) # 3. Add the component to the pipeline +nlp.from_disk(data_path) # 4. Load in the binary data ``` When you call `nlp` on a text, spaCy will **tokenize** it and then **call each @@ -172,9 +190,9 @@ which is then processed by the component next in the pipeline. ```python ### The pipeline under the hood -doc = nlp.make_doc("This is a sentence") # create a Doc from raw text -for name, proc in nlp.pipeline: # iterate over components in order - doc = proc(doc) # apply each component +doc = nlp.make_doc("This is a sentence") # Create a Doc from raw text +for name, proc in nlp.pipeline: # Iterate over components in order + doc = proc(doc) # Apply each component ``` The current processing pipeline is available as `nlp.pipeline`, which returns a @@ -183,147 +201,349 @@ list of human-readable component names. ```python print(nlp.pipeline) -# [('tagger', ), ('parser', ), ('ner', )] +# [('tok2vec', ), ('tagger', ), ('parser', ), ('ner', )] print(nlp.pipe_names) -# ['tagger', 'parser', 'ner'] +# ['tok2vec', 'tagger', 'parser', 'ner'] ``` ### Built-in pipeline components {#built-in} -spaCy ships with several built-in pipeline components that are also available in -the `Language.factories`. This means that you can initialize them by calling -[`nlp.create_pipe`](/api/language#create_pipe) with their string names and -require them in the pipeline settings in your model's `meta.json`. +spaCy ships with several built-in pipeline components that are registered with +string names. This means that you can initialize them by calling +[`nlp.add_pipe`](/api/language#add_pipe) with their names and spaCy will know +how to create them. See the [API documentation](/api) for a full list of +available pipeline components and component functions. > #### Usage > > ```python -> # Option 1: Import and initialize -> from spacy.pipeline import EntityRuler -> ruler = EntityRuler(nlp) -> nlp.add_pipe(ruler) -> -> # Option 2: Using nlp.create_pipe -> sentencizer = nlp.create_pipe("sentencizer") -> nlp.add_pipe(sentencizer) +> nlp = spacy.blank("en") +> nlp.add_pipe("sentencizer") +> # add_pipe returns the added component +> ruler = nlp.add_pipe("entity_ruler") > ``` -| String name | Component | Description | -| ------------------- | ---------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | -| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. | -| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. | -| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. | -| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. | -| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. | -| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules. | -| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. | -| `merge_noun_chunks` | [`merge_noun_chunks`](/api/pipeline-functions#merge_noun_chunks) | Merge all noun chunks into a single token. Should be added after the tagger and parser. | -| `merge_entities` | [`merge_entities`](/api/pipeline-functions#merge_entities) | Merge all entities into a single token. Should be added after the entity recognizer. | -| `merge_subtokens` | [`merge_subtokens`](/api/pipeline-functions#merge_subtokens) | Merge subtokens predicted by the parser into single tokens. Should be added after the parser. | +| String name | Component | Description | +| ----------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- | +| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. | +| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. | +| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. | +| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. | +| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. | +| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. | +| `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words. | +| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. | +| `attribute_ruler` | [`AttributeRuler`](/api/attributeruler) | Assign token attribute mappings and rule-based exceptions. | +| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. | +| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. | +| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | Assign token-to-vector embeddings. | +| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. | -### Disabling and modifying pipeline components {#disabling} +### Disabling, excluding and modifying components {#disabling} If you don't need a particular component of the pipeline – for example, the -tagger or the parser, you can **disable loading** it. This can sometimes make a -big difference and improve loading speed. Disabled component names can be -provided to [`spacy.load`](/api/top-level#spacy.load), -[`Language.from_disk`](/api/language#from_disk) or the `nlp` object itself as a -list: +tagger or the parser, you can **disable or exclude** it. This can sometimes make +a big difference and improve loading and inference speed. There are two +different mechanisms you can use: + +1. **Disable:** The component and its data will be loaded with the pipeline, but + it will be disabled by default and not run as part of the processing + pipeline. To run it, you can explicitly enable it by calling + [`nlp.enable_pipe`](/api/language#enable_pipe). When you save out the `nlp` + object, the disabled component will be included but disabled by default. +2. **Exclude:** Don't load the component and its data with the pipeline. Once + the pipeline is loaded, there will be no reference to the excluded component. + +Disabled and excluded component names can be provided to +[`spacy.load`](/api/top-level#spacy.load) as a list. + +> #### 💡 Optional pipeline components +> +> The `disable` mechanism makes it easy to distribute pipeline packages with +> optional components that you can enable or disable at runtime. For instance, +> your pipeline may include a statistical _and_ a rule-based component for +> sentence segmentation, and you can choose which one to run depending on your +> use case. +> +> For example, spaCy's [trained pipelines](/models) like +> [`en_core_web_sm`](/models/en#en_core_web_sm) contain both a `parser` and +> `senter` that perform sentence segmentation, but the `senter` is disabled by +> default. ```python -### Disable loading +# Load the pipeline without the entity recognizer +nlp = spacy.load("en_core_web_sm", exclude=["ner"]) + +# Load the tagger and parser but don't enable them nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"]) -nlp = English().from_disk("/model", disable=["ner"]) +# Explicitly enable the tagger later on +nlp.enable_pipe("tagger") ``` -In some cases, you do want to load all pipeline components and their weights, -because you need them at different points in your application. However, if you -only need a `Doc` object with named entities, there's no need to run all -pipeline components on it – that can potentially make processing much slower. -Instead, you can use the `disable` keyword argument on -[`nlp.pipe`](/api/language#pipe) to temporarily disable the components **during -processing**: + -```python -### Disable for processing -for doc in nlp.pipe(texts, disable=["tagger", "parser"]): - # Do something with the doc here -``` +As of v3.0, the `disable` keyword argument specifies components to load but +disable, instead of components to not load at all. Those components can now be +specified separately using the new `exclude` keyword argument. -If you need to **execute more code** with components disabled – e.g. to reset -the weights or update only some components during training – you can use the -[`nlp.disable_pipes`](/api/language#disable_pipes) contextmanager. At the end of -the `with` block, the disabled pipeline components will be restored -automatically. Alternatively, `disable_pipes` returns an object that lets you + + +As a shortcut, you can use the [`nlp.select_pipes`](/api/language#select_pipes) +context manager to temporarily disable certain components for a given block. At +the end of the `with` block, the disabled pipeline components will be restored +automatically. Alternatively, `select_pipes` returns an object that lets you call its `restore()` method to restore the disabled components when needed. This can be useful if you want to prevent unnecessary code indentation of large blocks. ```python ### Disable for block -# 1. Use as a contextmanager -with nlp.disable_pipes("tagger", "parser"): +# 1. Use as a context manager +with nlp.select_pipes(disable=["tagger", "parser"]): doc = nlp("I won't be tagged and parsed") doc = nlp("I will be tagged and parsed") # 2. Restore manually -disabled = nlp.disable_pipes("ner") +disabled = nlp.select_pipes(disable="ner") doc = nlp("I won't have named entities") disabled.restore() ``` +If you want to disable all pipes except for one or a few, you can use the +`enable` keyword. Just like the `disable` keyword, it takes a list of pipe +names, or a string defining just one pipe. + +```python +# Enable only the parser +with nlp.select_pipes(enable="parser"): + doc = nlp("I will only be parsed") +``` + +The [`nlp.pipe`](/api/language#pipe) method also supports a `disable` keyword +argument if you only want to disable components during processing: + +```python +for doc in nlp.pipe(texts, disable=["tagger", "parser"]): + # Do something with the doc here +``` + Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method to remove pipeline components from an existing pipeline, the [`rename_pipe`](/api/language#rename_pipe) method to rename them, or the [`replace_pipe`](/api/language#replace_pipe) method to replace them with a custom component entirely (more details on this in the section on -[custom components](#custom-components). +[custom components](#custom-components)). ```python nlp.remove_pipe("parser") nlp.rename_pipe("ner", "entityrecognizer") -nlp.replace_pipe("tagger", my_custom_tagger) +nlp.replace_pipe("tagger", "my_custom_tagger") ``` - +The `Language` object exposes different [attributes](/api/language#attributes) +that let you inspect all available components and the components that currently +run as part of the pipeline. -Since spaCy v2.0 comes with better support for customizing the processing -pipeline components, the `parser`, `tagger` and `entity` keyword arguments have -been replaced with `disable`, which takes a list of pipeline component names. -This lets you disable pre-defined components when loading a model, or -initializing a Language class via [`from_disk`](/api/language#from_disk). +> #### Example +> +> ```python +> nlp = spacy.blank("en") +> nlp.add_pipe("ner") +> nlp.add_pipe("textcat") +> assert nlp.pipe_names == ["ner", "textcat"] +> nlp.disable_pipe("ner") +> assert nlp.pipe_names == ["textcat"] +> assert nlp.component_names == ["ner", "textcat"] +> assert nlp.disabled == ["ner"] +> ``` -```diff -- nlp = spacy.load('en', tagger=False, entity=False) -- doc = nlp("I don't want parsed", parse=False) +| Name | Description | +| --------------------- | ---------------------------------------------------------------- | +| `nlp.pipeline` | `(name, component)` tuples of the processing pipeline, in order. | +| `nlp.pipe_names` | Pipeline component names, in order. | +| `nlp.components` | All `(name, component)` tuples, including disabled components. | +| `nlp.component_names` | All component names, including disabled components. | +| `nlp.disabled` | Names of components that are currently disabled. | -+ nlp = spacy.load("en", disable=["ner"]) -+ nlp.remove_pipe("parser") -+ doc = nlp("I don't want parsed") +### Sourcing components from existing pipelines {#sourced-components new="3"} + +Pipeline components that are independent can also be reused across pipelines. +Instead of adding a new blank component, you can also copy an existing component +from a trained pipeline by setting the `source` argument on +[`nlp.add_pipe`](/api/language#add_pipe). The first argument will then be +interpreted as the name of the component in the source pipeline – for instance, +`"ner"`. This is especially useful for +[training a pipeline](/usage/training#config-components) because it lets you mix +and match components and create fully custom pipeline packages with updated +trained components and new components trained on your data. + + + +When reusing components across pipelines, keep in mind that the **vocabulary**, +**vectors** and model settings **must match**. If a trained pipeline includes +[word vectors](/usage/linguistic-features#vectors-similarity) and the component +uses them as features, the pipeline you copy it to needs to have the _same_ +vectors available – otherwise, it won't be able to make the same predictions. + + + +> #### In training config +> +> Instead of providing a `factory`, component blocks in the training +> [config](/usage/training#config) can also define a `source`. The string needs +> to be a loadable spaCy pipeline package or path. +> +> ```ini +> [components.ner] +> source = "en_core_web_sm" +> component = "ner" +> ``` +> +> By default, sourced components will be updated with your data during training. +> If you want to preserve the component as-is, you can "freeze" it: +> +> ```ini +> [training] +> frozen_components = ["ner"] +> ``` + +```python +### {executable="true"} +import spacy + +# The source pipeline with different components +source_nlp = spacy.load("en_core_web_sm") +print(source_nlp.pipe_names) + +# Add only the entity recognizer to the new blank pipeline +nlp = spacy.blank("en") +nlp.add_pipe("ner", source=source_nlp) +print(nlp.pipe_names) ``` +### Analyzing pipeline components {#analysis new="3"} + +The [`nlp.analyze_pipes`](/api/language#analyze_pipes) method analyzes the +components in the current pipeline and outputs information about them like the +attributes they set on the [`Doc`](/api/doc) and [`Token`](/api/token), whether +they retokenize the `Doc` and which scores they produce during training. It will +also show warnings if components require values that aren't set by previous +component – for instance, if the entity linker is used but no component that +runs before it sets named entities. Setting `pretty=True` will pretty-print a +table instead of only returning the structured data. + +> #### ✏️ Things to try +> +> 1. Add the components `"ner"` and `"sentencizer"` _before_ the +> `"entity_linker"`. The analysis should now show no problems, because +> requirements are met. + +```python +### {executable="true"} +import spacy + +nlp = spacy.blank("en") +nlp.add_pipe("tagger") +# This is a problem because it needs entities and sentence boundaries +nlp.add_pipe("entity_linker") +analysis = nlp.analyze_pipes(pretty=True) +``` + + + +```json +### Structured +{ + "summary": { + "tagger": { + "assigns": ["token.tag"], + "requires": [], + "scores": ["tag_acc", "pos_acc", "lemma_acc"], + "retokenizes": false + }, + "entity_linker": { + "assigns": ["token.ent_kb_id"], + "requires": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"], + "scores": [], + "retokenizes": false + } + }, + "problems": { + "tagger": [], + "entity_linker": ["doc.ents", "doc.sents", "token.ent_iob", "token.ent_type"] + }, + "attrs": { + "token.ent_iob": { "assigns": [], "requires": ["entity_linker"] }, + "doc.ents": { "assigns": [], "requires": ["entity_linker"] }, + "token.ent_kb_id": { "assigns": ["entity_linker"], "requires": [] }, + "doc.sents": { "assigns": [], "requires": ["entity_linker"] }, + "token.tag": { "assigns": ["tagger"], "requires": [] }, + "token.ent_type": { "assigns": [], "requires": ["entity_linker"] } + } +} +``` + +``` +### Pretty +============================= Pipeline Overview ============================= + +# Component Assigns Requires Scores Retokenizes +- ------------- --------------- -------------- --------- ----------- +0 tagger token.tag tag_acc False + pos_acc + lemma_acc + +1 entity_linker token.ent_kb_id doc.ents False + doc.sents + token.ent_iob + token.ent_type + + +================================ Problems (4) ================================ +⚠ 'entity_linker' requirements not met: doc.ents, doc.sents, +token.ent_iob, token.ent_type +``` + + + + + +The pipeline analysis is static and does **not actually run the components**. +This means that it relies on the information provided by the components +themselves. If a custom component declares that it assigns an attribute but it +doesn't, the pipeline analysis won't catch that. + ## Creating custom pipeline components {#custom-components} -A component receives a `Doc` object and can modify it – for example, by using -the current weights to make a prediction and set some annotation on the -document. By adding a component to the pipeline, you'll get access to the `Doc` -at any point **during processing** – instead of only being able to modify it -afterwards. +A pipeline component is a function that receives a `Doc` object, modifies it and +returns it – for example, by using the current weights to make a prediction and +set some annotation on the document. By adding a component to the pipeline, +you'll get access to the `Doc` at any point **during processing** – instead of +only being able to modify it afterwards. > #### Example > > ```python +> from spacy.language import Language +> +> @Language.component("my_component") > def my_component(doc): -> # do something to the doc here +> # Do something to the doc here > return doc > ``` -| Argument | Type | Description | -| ----------- | ----- | ------------------------------------------------------ | -| `doc` | `Doc` | The `Doc` object processed by the previous component. | -| **RETURNS** | `Doc` | The `Doc` object processed by this pipeline component. | +| Argument | Type | Description | +| ----------- | ----------------- | ------------------------------------------------------ | +| `doc` | [`Doc`](/api/doc) | The `Doc` object processed by the previous component. | +| **RETURNS** | [`Doc`](/api/doc) | The `Doc` object processed by this pipeline component. | + +The [`@Language.component`](/api/language#component) decorator lets you turn a +simple function into a pipeline component. It takes at least one argument, the +**name** of the component factory. You can use this name to add an instance of +your component to the pipeline. It can also be listed in your pipeline config, +so you can save, load and train pipelines using your component. Custom components can be added to the pipeline using the [`add_pipe`](/api/language#add_pipe) method. Optionally, you can either specify @@ -334,23 +554,43 @@ last** in the pipeline, or define a **custom name**. If no name is set and no > #### Example > > ```python -> nlp.add_pipe(my_component) -> nlp.add_pipe(my_component, first=True) -> nlp.add_pipe(my_component, before="parser") +> nlp.add_pipe("my_component") +> nlp.add_pipe("my_component", first=True) +> nlp.add_pipe("my_component", before="parser") > ``` -| Argument | Type | Description | -| -------- | ------- | ------------------------------------------------------------------------ | -| `last` | bool | If set to `True`, component is added **last** in the pipeline (default). | -| `first` | bool | If set to `True`, component is added **first** in the pipeline. | -| `before` | unicode | String name of component to add the new component **before**. | -| `after` | unicode | String name of component to add the new component **after**. | +| Argument | Description | +| -------- | --------------------------------------------------------------------------------- | +| `last` | If set to `True`, component is added **last** in the pipeline (default). ~~bool~~ | +| `first` | If set to `True`, component is added **first** in the pipeline. ~~bool~~ | +| `before` | String name or index to add the new component **before**. ~~Union[str, int]~~ | +| `after` | String name or index to add the new component **after**. ~~Union[str, int]~~ | -### Example: A simple pipeline component {#custom-components-simple} + + +As of v3.0, components need to be registered using the +[`@Language.component`](/api/language#component) or +[`@Language.factory`](/api/language#factory) decorator so spaCy knows that a +function is a component. [`nlp.add_pipe`](/api/language#add_pipe) now takes the +**string name** of the component factory instead of the component function. This +doesn't only save you lines of code, it also allows spaCy to validate and track +your custom components, and make sure they can be saved and loaded. + +```diff +- ruler = nlp.create_pipe("entity_ruler") +- nlp.add_pipe(ruler) ++ ruler = nlp.add_pipe("entity_ruler") +``` + + + +### Examples: Simple stateless pipeline components {#custom-components-simple} The following component receives the `Doc` in the pipeline and prints some information about it: the number of tokens, the part-of-speech tags of the -tokens and a conditional message based on the document length. +tokens and a conditional message based on the document length. The +[`@Language.component`](/api/language#component) decorator lets you register the +component under the name `"info_component"`. > #### ✏️ Things to try > @@ -361,89 +601,34 @@ tokens and a conditional message based on the document length. > this change reflected in `nlp.pipe_names`. > 3. Print `nlp.pipeline`. You'll see a list of tuples describing the component > name and the function that's called on the `Doc` object in the pipeline. +> 4. Change the first argument to `@Language.component`, the name, to something +> else. spaCy should now complain that it doesn't know a component of the +> name `"info_component"`. ```python ### {executable="true"} import spacy +from spacy.language import Language +@Language.component("info_component") def my_component(doc): - print("After tokenization, this doc has {} tokens.".format(len(doc))) + print(f"After tokenization, this doc has {len(doc)} tokens.") print("The part-of-speech tags are:", [token.pos_ for token in doc]) if len(doc) < 10: print("This is a pretty short document.") return doc nlp = spacy.load("en_core_web_sm") -nlp.add_pipe(my_component, name="print_info", last=True) +nlp.add_pipe("info_component", name="print_info", last=True) print(nlp.pipe_names) # ['tagger', 'parser', 'ner', 'print_info'] doc = nlp("This is a sentence.") - ``` -Of course, you can also wrap your component as a class to allow initializing it -with custom settings and hold state within the component. This is useful for -**stateful components**, especially ones which **depend on shared data**. In the -following example, the custom component `EntityMatcher` can be initialized with -`nlp` object, a terminology list and an entity label. Using the -[`PhraseMatcher`](/api/phrasematcher), it then matches the terms in the `Doc` -and adds them to the existing entities. - - - -As of v2.1.0, spaCy ships with the [`EntityRuler`](/api/entityruler), a pipeline -component for easy, rule-based named entity recognition. Its implementation is -similar to the `EntityMatcher` code shown below, but it includes some additional -features like support for phrase patterns and token patterns, handling overlaps -with existing entities and pattern export as JSONL. - -We'll still keep the pipeline component example below, as it works well to -illustrate complex components. But if you're planning on using this type of -component in your application, you might find the `EntityRuler` more convenient. -[See here](/usage/rule-based-matching#entityruler) for more details and -examples. - - - -```python -### {executable="true"} -import spacy -from spacy.matcher import PhraseMatcher -from spacy.tokens import Span - -class EntityMatcher(object): - name = "entity_matcher" - - def __init__(self, nlp, terms, label): - patterns = [nlp.make_doc(text) for text in terms] - self.matcher = PhraseMatcher(nlp.vocab) - self.matcher.add(label, None, *patterns) - - def __call__(self, doc): - matches = self.matcher(doc) - for match_id, start, end in matches: - span = Span(doc, start, end, label=match_id) - doc.ents = list(doc.ents) + [span] - return doc - -nlp = spacy.load("en_core_web_sm") -terms = ("cat", "dog", "tree kangaroo", "giant sea spider") -entity_matcher = EntityMatcher(nlp, terms, "ANIMAL") - -nlp.add_pipe(entity_matcher, after="ner") - -print(nlp.pipe_names) # The components in the pipeline - -doc = nlp("This is a text about Barack Obama and a tree kangaroo") -print([(ent.text, ent.label_) for ent in doc.ents]) -``` - -### Example: Custom sentence segmentation logic {#component-example1} - -Let's say you want to implement custom logic to improve spaCy's sentence -boundary detection. Currently, sentence segmentation is based on the dependency -parse, which doesn't always produce ideal results. The custom logic should -therefore be applied **after** tokenization, but _before_ the dependency parsing -– this way, the parser can also take advantage of the sentence boundaries. +Here's another example of a pipeline component that implements custom logic to +improve the sentence boundaries set by the dependency parser. The custom logic +should therefore be applied **after** tokenization, but _before_ the dependency +parsing – this way, the parser can also take advantage of the sentence +boundaries. > #### ✏️ Things to try > @@ -457,90 +642,663 @@ therefore be applied **after** tokenization, but _before_ the dependency parsing ```python ### {executable="true"} import spacy +from spacy.language import Language +@Language.component("custom_sentencizer") def custom_sentencizer(doc): for i, token in enumerate(doc[:-2]): # Define sentence start if pipe + titlecase token - if token.text == "|" and doc[i+1].is_title: - doc[i+1].is_sent_start = True + if token.text == "|" and doc[i + 1].is_title: + doc[i + 1].is_sent_start = True else: # Explicitly set sentence start to False otherwise, to tell # the parser to leave those tokens alone - doc[i+1].is_sent_start = False + doc[i + 1].is_sent_start = False return doc nlp = spacy.load("en_core_web_sm") -nlp.add_pipe(custom_sentencizer, before="parser") # Insert before the parser +nlp.add_pipe("custom_sentencizer", before="parser") # Insert before the parser doc = nlp("This is. A sentence. | This is. Another sentence.") for sent in doc.sents: print(sent.text) ``` -### Example: Pipeline component for entity matching and tagging with custom attributes {#component-example2} +### Component factories and stateful components {#custom-components-factories} -This example shows how to create a spaCy extension that takes a terminology list -(in this case, single- and multi-word company names), matches the occurrences in -a document, labels them as `ORG` entities, merges the tokens and sets custom -`is_tech_org` and `has_tech_org` attributes. For efficient matching, the example -uses the [`PhraseMatcher`](/api/phrasematcher) which accepts `Doc` objects as -match patterns and works well for large terminology lists. It also ensures your -patterns will always match, even when you customize spaCy's tokenization rules. -When you call `nlp` on a text, the custom pipeline component is applied to the -`Doc`. - -```python -https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_entities.py -``` - -Wrapping this functionality in a pipeline component allows you to reuse the -module with different settings, and have all pre-processing taken care of when -you call `nlp` on your text and receive a `Doc` object. - -### Adding factories {#custom-components-factories} - -When spaCy loads a model via its `meta.json`, it will iterate over the -`"pipeline"` setting, look up every component name in the internal factories and -call [`nlp.create_pipe`](/api/language#create_pipe) to initialize the individual -components, like the tagger, parser or entity recognizer. If your model uses -custom components, this won't work – so you'll have to tell spaCy **where to -find your component**. You can do this by writing to the `Language.factories`: +Component factories are callables that take settings and return a **pipeline +component function**. This is useful if your component is stateful and if you +need to customize their creation, or if you need access to the current `nlp` +object or the shared vocab. Component factories can be registered using the +[`@Language.factory`](/api/language#factory) decorator and they need at least +**two named arguments** that are filled in automatically when the component is +added to the pipeline: + +> #### Example +> +> ```python +> from spacy.language import Language +> +> @Language.factory("my_component") +> def my_component(nlp, name): +> return MyComponent() +> ``` + +| Argument | Description | +| -------- | --------------------------------------------------------------------------------------------------------------------------------- | +| `nlp` | The current `nlp` object. Can be used to access the shared vocab. ~~Language~~ | +| `name` | The **instance name** of the component in the pipeline. This lets you identify different instances of the same component. ~~str~~ | + +All other settings can be passed in by the user via the `config` argument on +[`nlp.add_pipe`](/api/language). The +[`@Language.factory`](/api/language#factory) decorator also lets you define a +`default_config` that's used as a fallback. ```python +### With config {highlight="4,9"} +import spacy from spacy.language import Language -Language.factories["entity_matcher"] = lambda nlp, **cfg: EntityMatcher(nlp, **cfg) + +@Language.factory("my_component", default_config={"some_setting": True}) +def my_component(nlp, name, some_setting: bool): + return MyComponent(some_setting=some_setting) + +nlp = spacy.blank("en") +nlp.add_pipe("my_component", config={"some_setting": False}) ``` -You can also ship the above code and your custom component in your packaged -model's `__init__.py`, so it's executed when you load your model. The `**cfg` -config parameters are passed all the way down from -[`spacy.load`](/api/top-level#spacy.load), so you can load the model and its -components with custom settings: + + +The [`@Language.component`](/api/language#component) decorator is essentially a +**shortcut** for stateless pipeline components that don't need any settings. +This means you don't have to always write a function that returns your function +if there's no state to be passed through – spaCy can just take care of this for +you. The following two code examples are equivalent: ```python -nlp = spacy.load("your_custom_model", terms=["tree kangaroo"], label="ANIMAL") +# Statless component with @Language.factory +@Language.factory("my_component") +def create_my_component(): + def my_component(doc): + # Do something to the doc + return doc + + return my_component + +# Stateless component with @Language.component +@Language.component("my_component") +def my_component(doc): + # Do something to the doc + return doc ``` - + -When you load a model via its shortcut or package name, like `en_core_web_sm`, -spaCy will import the package and then call its `load()` method. This means that -custom code in the model's `__init__.py` will be executed, too. This is **not -the case** if you're loading a model from a path containing the model data. -Here, spaCy will only read in the `meta.json`. If you want to use custom -factories with a model loaded from a path, you need to add them to -`Language.factories` _before_ you load the model. + + +Yes, the [`@Language.factory`](/api/language#factory) decorator can be added to +a function or a class. If it's added to a class, it expects the `__init__` +method to take the arguments `nlp` and `name`, and will populate all other +arguments from the config. That said, it's often cleaner and more intuitive to +make your factory a separate function. That's also how spaCy does it internally. + + + +### Language-specific factories {#factories-language new="3"} + +There are many use cases where you might want your pipeline components to be +language-specific. Sometimes this requires entirely different implementation per +language, sometimes the only difference is in the settings or data. spaCy allows +you to register factories of the **same name** on both the `Language` base +class, as well as its **subclasses** like `English` or `German`. Factories are +resolved starting with the specific subclass. If the subclass doesn't define a +component of that name, spaCy will check the `Language` base class. + +Here's an example of a pipeline component that overwrites the normalized form of +a token, the `Token.norm_` with an entry from a language-specific lookup table. +It's registered twice under the name `"token_normalizer"` – once using +`@English.factory` and once using `@German.factory`: + +```python +### {executable="true"} +from spacy.lang.en import English +from spacy.lang.de import German + +class TokenNormalizer: + def __init__(self, norm_table): + self.norm_table = norm_table + + def __call__(self, doc): + for token in doc: + # Overwrite the token.norm_ if there's an entry in the data + token.norm_ = self.norm_table.get(token.text, token.norm_) + return doc + +@English.factory("token_normalizer") +def create_en_normalizer(nlp, name): + return TokenNormalizer({"realise": "realize", "colour": "color"}) + +@German.factory("token_normalizer") +def create_de_normalizer(nlp, name): + return TokenNormalizer({"daß": "dass", "wußte": "wusste"}) + +nlp_en = English() +nlp_en.add_pipe("token_normalizer") # uses the English factory +print([token.norm_ for token in nlp_en("realise colour daß wußte")]) + +nlp_de = German() +nlp_de.add_pipe("token_normalizer") # uses the German factory +print([token.norm_ for token in nlp_de("realise colour daß wußte")]) +``` + + + +Under the hood, language-specific factories are added to the +[`factories` registry](/api/top-level#registry) prefixed with the language code, +e.g. `"en.token_normalizer"`. When resolving the factory in +[`nlp.add_pipe`](/api/language#add_pipe), spaCy first checks for a +language-specific version of the factory using `nlp.lang` and if none is +available, falls back to looking up the regular factory name. + + + +### Example: Stateful component with settings {#example-stateful-components} + +This example shows a **stateful** pipeline component for handling acronyms: +based on a dictionary, it will detect acronyms and their expanded forms in both +directions and add them to a list as the custom `doc._.acronyms` +[extension attribute](#custom-components-attributes). Under the hood, it uses +the [`PhraseMatcher`](/api/phrasematcher) to find instances of the phrases. + +The factory function takes three arguments: the shared `nlp` object and +component instance `name`, which are passed in automatically by spaCy, and a +`case_sensitive` config setting that makes the matching and acronym detection +case-sensitive. + +> #### ✏️ Things to try +> +> 1. Change the `config` passed to `nlp.add_pipe` and set `"case_sensitive"` to +> `True`. You should see that the expanded acronym for "LOL" isn't detected +> anymore. +> 2. Add some more terms to the `DICTIONARY` and update the processed text so +> they're detected. +> 3. Add a `name` argument to `nlp.add_pipe` to change the component name. Print +> `nlp.pipe_names` to see the change reflected in the pipeline. +> 4. Print the config of the current `nlp` object with +> `print(nlp.config.to_str())` and inspect the `[components]` block. You +> should see an entry for the acronyms component, referencing the factory +> `acronyms` and the config settings. + +```python +### {executable="true"} +from spacy.language import Language +from spacy.tokens import Doc +from spacy.matcher import PhraseMatcher +import spacy + +DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"} +DICTIONARY.update({value: key for key, value in DICTIONARY.items()}) + +@Language.factory("acronyms", default_config={"case_sensitive": False}) +def create_acronym_component(nlp: Language, name: str, case_sensitive: bool): + return AcronymComponent(nlp, case_sensitive) + +class AcronymComponent: + def __init__(self, nlp: Language, case_sensitive: bool): + # Create the matcher and match on Token.lower if case-insensitive + matcher_attr = "TEXT" if case_sensitive else "LOWER" + self.matcher = PhraseMatcher(nlp.vocab, attr=matcher_attr) + self.matcher.add("ACRONYMS", [nlp.make_doc(term) for term in DICTIONARY]) + self.case_sensitive = case_sensitive + # Register custom extension on the Doc + if not Doc.has_extension("acronyms"): + Doc.set_extension("acronyms", default=[]) + + def __call__(self, doc: Doc) -> Doc: + # Add the matched spans when doc is processed + for _, start, end in self.matcher(doc): + span = doc[start:end] + acronym = DICTIONARY.get(span.text if self.case_sensitive else span.text.lower()) + doc._.acronyms.append((span, acronym)) + return doc + +# Add the component to the pipeline and configure it +nlp = spacy.blank("en") +nlp.add_pipe("acronyms", config={"case_sensitive": False}) + +# Process a doc and see the results +doc = nlp("LOL, be right back") +print(doc._.acronyms) +``` + +## Initializing and serializing component data {#component-data} + +Many stateful components depend on **data resources** like dictionaries and +lookup tables that should ideally be **configurable**. For example, it makes +sense to make the `DICTIONARY` in the above example an argument of the +registered function, so the `AcronymComponent` can be re-used with different +data. One logical solution would be to make it an argument of the component +factory, and allow it to be initialized with different dictionaries. + +> #### config.cfg +> +> ```ini +> [components.acronyms.data] +> # 🚨 Problem: you don't want the data in the config +> lol = "laugh out loud" +> brb = "be right back" +> ``` + +```python +@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False}) +def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool): + # 🚨 Problem: data ends up in the config file + return AcronymComponent(nlp, data, case_sensitive) +``` + +However, passing in the dictionary directly is problematic, because it means +that if a component saves out its config and settings, the +[`config.cfg`](/usage/training#config) will include a dump of the entire data, +since that's the config the component was created with. It will also fail if the +data is not JSON-serializable. + +### Option 1: Using a registered function {#component-data-function} + + + +- ✅ **Pros:** can load anything in Python, easy to add to and configure via + config +- ❌ **Cons:** requires the function and its dependencies to be available at + runtime + + + +If what you're passing in isn't JSON-serializable – e.g. a custom object like a +[model](#trainable-components) – saving out the component config becomes +impossible because there's no way for spaCy to know _how_ that object was +created, and what to do to create it again. This makes it much harder to save, +load and train custom pipelines with custom components. A simple solution is to +**register a function** that returns your resources. The +[registry](/api/top-level#registry) lets you **map string names to functions** +that create objects, so given a name and optional arguments, spaCy will know how +to recreate the object. To register a function that returns your custom +dictionary, you can use the `@spacy.registry.misc` decorator with a single +argument, the name: + +> #### What's the misc registry? +> +> The [`registry`](/api/top-level#registry) provides different categories for +> different types of functions – for example, model architectures, tokenizers or +> batchers. `misc` is intended for miscellaneous functions that don't fit +> anywhere else. + +```python +### Registered function for assets {highlight="1"} +@spacy.registry.misc("acronyms.slang_dict.v1") +def create_acronyms_slang_dict(): + dictionary = {"lol": "laughing out loud", "brb": "be right back"} + dictionary.update({value: key for key, value in dictionary.items()}) + return dictionary +``` + +In your `default_config` (and later in your +[training config](/usage/training#config)), you can now refer to the function +registered under the name `"acronyms.slang_dict.v1"` using the `@misc` key. This +tells spaCy how to create the value, and when your component is created, the +result of the registered function is passed in as the key `"dictionary"`. + +> #### config.cfg +> +> ```ini +> [components.acronyms] +> factory = "acronyms" +> +> [components.acronyms.data] +> @misc = "acronyms.slang_dict.v1" +> ``` + +```diff +- default_config = {"dictionary:" DICTIONARY} ++ default_config = {"dictionary": {"@misc": "acronyms.slang_dict.v1"}} +``` + +Using a registered function also means that you can easily include your custom +components in pipelines that you [train](/usage/training). To make sure spaCy +knows where to find your custom `@misc` function, you can pass in a Python file +via the argument `--code`. If someone else is using your component, all they +have to do to customize the data is to register their own function and swap out +the name. Registered functions can also take **arguments**, by the way, that can +be defined in the config as well – you can read more about this in the docs on +[training with custom code](/usage/training#custom-code). + +### Option 2: Save data with the pipeline and load it in once on initialization {#component-data-initialization} + + + +- ✅ **Pros:** lets components save and load their own data and reflect user + changes, load in data assets before training without depending on them at + runtime +- ❌ **Cons:** requires more component methods, more complex config and data + flow + + + +Just like models save out their binary weights when you call +[`nlp.to_disk`](/api/language#to_disk), components can also **serialize** any +other data assets – for instance, an acronym dictionary. If a pipeline component +implements its own `to_disk` and `from_disk` methods, those will be called +automatically by `nlp.to_disk` and will receive the path to the directory to +save to or load from. The component can then perform any custom saving or +loading. If a user makes changes to the component data, they will be reflected +when the `nlp` object is saved. For more examples of this, see the usage guide +on [serialization methods](/usage/saving-loading/#serialization-methods). + +> #### About the data path +> +> The `path` argument spaCy passes to the serialization methods consists of the +> path provided by the user, plus a directory of the component name. This means +> that when you call `nlp.to_disk("/path")`, the `acronyms` component will +> receive the directory path `/path/acronyms` and can then create files in this +> directory. + +```python +### Custom serialization methods {highlight="6-7,9-11"} +import srsly + +class AcronymComponent: + # other methods here... + + def to_disk(self, path, exclude=tuple()): + srsly.write_json(path / "data.json", self.data) + + def from_disk(self, path, exclude=tuple()): + self.data = srsly.read_json(path / "data.json") + return self +``` + +Now the component can save to and load from a directory. The only remaining +question: How do you **load in the initial data**? In Python, you could just +call the pipe's `from_disk` method yourself. But if you're adding the component +to your [training config](/usage/training#config), spaCy will need to know how +to set it up, from start to finish, including the data to initialize it with. + +While you could use a registered function or a file loader like +[`srsly.read_json.v1`](/api/top-level#file_readers) as an argument of the +component factory, this approach is problematic: the component factory runs +**every time the component is created**. This means it will run when creating +the `nlp` object before training, but also every a user loads your pipeline. So +your runtime pipeline would either depend on a local path on your file system, +or it's loaded twice: once when the component is created, and then again when +the data is by `from_disk`. + +> ```ini +> ### config.cfg +> [components.acronyms.data] +> # 🚨 Problem: Runtime pipeline depends on local path +> @readers = "srsly.read_json.v1" +> path = "/path/to/slang_dict.json" +> ``` +> +> ```ini +> ### config.cfg +> [components.acronyms.data] +> # 🚨 Problem: this always runs +> @misc = "acronyms.slang_dict.v1" +> ``` + +```python +@Language.factory("acronyms", default_config={"data": {}, "case_sensitive": False}) +def create_acronym_component(nlp: Language, name: str, data: Dict[str, str], case_sensitive: bool): + # 🚨 Problem: data will be loaded every time component is created + return AcronymComponent(nlp, data, case_sensitive) +``` + +To solve this, your component can implement a separate method, `initialize`, +which will be called by [`nlp.initialize`](/api/language#initialize) if +available. This typically happens before training, but not at runtime when the +pipeline is loaded. For more background on this, see the usage guides on the +[config lifecycle](/usage/training#config-lifecycle) and +[custom initialization](/usage/training#initialization). + +![Illustration of pipeline lifecycle](../images/lifecycle.svg) + +A component's `initialize` method needs to take at least **two named +arguments**: a `get_examples` callback that gives it access to the training +examples, and the current `nlp` object. This is mostly used by trainable +components so they can initialize their models and label schemes from the data, +so we can ignore those arguments here. All **other arguments** on the method can +be defined via the config – in this case a dictionary `data`. + +> #### config.cfg +> +> ```ini +> [initialize.components.my_component] +> +> [initialize.components.my_component.data] +> # ✅ This only runs on initialization +> @readers = "srsly.read_json.v1" +> path = "/path/to/slang_dict.json" +> ``` + +```python +### Custom initialize method {highlight="5-6"} +class AcronymComponent: + def __init__(self): + self.data = {} + + def initialize(self, get_examples=None, nlp=None, data={}): + self.data = data +``` + +When [`nlp.initialize`](/api/language#initialize) runs before training (or when +you call it in your own code), the +[`[initialize]`](/api/data-formats#config-initialize) block of the config is +loaded and used to construct the `nlp` object. The custom acronym component will +then be passed the data loaded from the JSON file. After training, the `nlp` +object is saved to disk, which will run the component's `to_disk` method. When +the pipeline is loaded back into spaCy later to use it, the `from_disk` method +will load the data back in. + +## Python type hints and validation {#type-hints new="3"} + +spaCy's configs are powered by our machine learning library Thinc's +[configuration system](https://thinc.ai/docs/usage-config), which supports +[type hints](https://docs.python.org/3/library/typing.html) and even +[advanced type annotations](https://thinc.ai/docs/usage-config#advanced-types) +using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your component +factory provides type hints, the values that are passed in will be **checked +against the expected types**. If the value can't be cast to an integer, spaCy +will raise an error. `pydantic` also provides strict types like `StrictFloat`, +which will force the value to be an integer and raise an error if it's not – for +instance, if your config defines a float. + + + +If you're not using +[strict types](https://pydantic-docs.helpmanual.io/usage/types/#strict-types), +values that can be **cast to** the given type will still be accepted. For +example, `1` can be cast to a `float` or a `bool` type, but not to a +`List[str]`. However, if the type is +[`StrictFloat`](https://pydantic-docs.helpmanual.io/usage/types/#strict-types), +only a float will be accepted. + + + +The following example shows a custom pipeline component for debugging. It can be +added anywhere in the pipeline and logs information about the `nlp` object and +the `Doc` that passes through. The `log_level` config setting lets the user +customize what log statements are shown – for instance, `"INFO"` will show info +logs and more critical logging statements, whereas `"DEBUG"` will show +everything. The value is annotated as a `StrictStr`, so it will only accept a +string value. + +> #### ✏️ Things to try +> +> 1. Change the `config` passed to `nlp.add_pipe` to use the log level `"INFO"`. +> You should see that only the statement logged with `logger.info` is shown. +> 2. Change the `config` passed to `nlp.add_pipe` so that it contains unexpected +> values – for example, a boolean instead of a string: `"log_level": False`. +> You should see a validation error. +> 3. Check out the docs on `pydantic`'s +> [constrained types](https://pydantic-docs.helpmanual.io/usage/types/#constrained-types) +> and write a type hint for `log_level` that only accepts the exact string +> values `"DEBUG"`, `"INFO"` or `"CRITICAL"`. + +```python +### {executable="true"} +import spacy +from spacy.language import Language +from spacy.tokens import Doc +from pydantic import StrictStr +import logging + +@Language.factory("debug", default_config={"log_level": "DEBUG"}) +class DebugComponent: + def __init__(self, nlp: Language, name: str, log_level: StrictStr): + self.logger = logging.getLogger(f"spacy.{name}") + self.logger.setLevel(log_level) + self.logger.info(f"Pipeline: {nlp.pipe_names}") + + def __call__(self, doc: Doc) -> Doc: + is_tagged = doc.has_annotation("TAG") + self.logger.debug(f"Doc: {len(doc)} tokens, is tagged: {is_tagged}") + return doc + +nlp = spacy.load("en_core_web_sm") +nlp.add_pipe("debug", config={"log_level": "DEBUG"}) +doc = nlp("This is a text...") +``` + +## Trainable components {#trainable-components new="3"} + +spaCy's [`TrainablePipe`](/api/pipe) class helps you implement your own +trainable components that have their own model instance, make predictions over +`Doc` objects and can be updated using [`spacy train`](/api/cli#train). This +lets you plug fully custom machine learning components into your pipeline. + +![Illustration of Pipe methods](../images/trainable_component.svg) + +You'll need the following: + +1. **Model:** A Thinc [`Model`](https://thinc.ai/docs/api-model) instance. This + can be a model implemented in [Thinc](/usage/layers-architectures#thinc), or + a [wrapped model](/usage/layers-architectures#frameworks) implemented in + PyTorch, TensorFlow, MXNet or a fully custom solution. The model must take a + list of [`Doc`](/api/doc) objects as input and can have any type of output. +2. **TrainablePipe subclass:** A subclass of [`TrainablePipe`](/api/pipe) that + implements at least two methods: [`TrainablePipe.predict`](/api/pipe#predict) + and [`TrainablePipe.set_annotations`](/api/pipe#set_annotations). +3. **Component factory:** A component factory registered with + [`@Language.factory`](/api/language#factory) that takes the `nlp` object and + component `name` and optional settings provided by the config and returns an + instance of your trainable component. + +> #### Example +> +> ```python +> from spacy.pipeline import TrainablePipe +> from spacy.language import Language +> +> class TrainableComponent(TrainablePipe): +> def predict(self, docs): +> ... +> +> def set_annotations(self, docs, scores): +> ... +> +> @Language.factory("my_trainable_component") +> def make_component(nlp, name, model): +> return TrainableComponent(nlp.vocab, model, name=name) +> ``` + +| Name | Description | +| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------- | +| [`predict`](/api/pipe#predict) | Apply the component's model to a batch of [`Doc`](/api/doc) objects (without modifying them) and return the scores. | +| [`set_annotations`](/api/pipe#set_annotations) | Modify a batch of [`Doc`](/api/doc) objects, using pre-computed scores generated by `predict`. | + +By default, [`TrainablePipe.__init__`](/api/pipe#init) takes the shared vocab, +the [`Model`](https://thinc.ai/docs/api-model) and the name of the component +instance in the pipeline, which you can use as a key in the losses. All other +keyword arguments will become available as [`TrainablePipe.cfg`](/api/pipe#cfg) +and will also be serialized with the component. + + + +spaCy's [config system](/usage/training#config) resolves the config describing +the pipeline components and models **bottom-up**. This means that it will +_first_ create a `Model` from a [registered architecture](/api/architectures), +validate its arguments and _then_ pass the object forward to the component. This +means that the config can express very complex, nested trees of objects – but +the objects don't have to pass the model settings all the way down to the +components. It also makes the components more **modular** and lets you +[swap](/usage/layers-architectures#swap-architectures) different architectures +in your config, and re-use model definitions. + +```ini +### config.cfg (excerpt) +[components] + +[components.textcat] +factory = "textcat" +labels = [] + +# This function is created and then passed to the "textcat" component as +# the argument "model" +[components.textcat.model] +@architectures = "spacy.TextCatEnsemble.v1" +exclusive_classes = false +pretrained_vectors = null +width = 64 +conv_depth = 2 +embed_size = 2000 +window_size = 1 +ngram_size = 1 +dropout = null + +[components.other_textcat] +factory = "textcat" +# This references the [components.textcat.model] block above +model = ${components.textcat.model} +labels = [] +``` + +Your trainable pipeline component factories should therefore always take a +`model` argument instead of instantiating the +[`Model`](https://thinc.ai/docs/api-model) inside the component. To register +custom architectures, you can use the +[`@spacy.registry.architectures`](/api/top-level#registry) decorator. Also see +the [training guide](/usage/training#config) for details. + + + +For some use cases, it makes sense to also overwrite additional methods to +customize how the model is updated from examples, how it's initialized, how the +loss is calculated and to add evaluation scores to the training output. + +| Name | Description | +| ------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| [`update`](/api/pipe#update) | Learn from a batch of [`Example`](/api/example) objects containing the predictions and gold-standard annotations, and update the component's model. | +| [`initialize`](/api/pipe#initialize) | Initialize the model. Typically calls into [`Model.initialize`](https://thinc.ai/docs/api-model#initialize) and can be passed custom arguments via the [`[initialize]`](/api/data-formats#config-initialize) config block that are only loaded during training or when you call [`nlp.initialize`](/api/language#initialize), not at runtime. | +| [`get_loss`](/api/pipe#get_loss) | Return a tuple of the loss and the gradient for a batch of [`Example`](/api/example) objects. | +| [`score`](/api/pipe#score) | Score a batch of [`Example`](/api/example) objects and return a dictionary of scores. The [`@Language.factory`](/api/language#factory) decorator can define the `default_socre_weights` of the component to decide which keys of the scores to display during training and how they count towards the final score. | + + + +For more details on how to implement your own trainable components and model +architectures, and plug existing models implemented in PyTorch or TensorFlow +into your spaCy pipeline, see the usage guide on +[layers and model architectures](/usage/layers-architectures#components). ## Extension attributes {#custom-components-attributes new="2"} -As of v2.0, spaCy allows you to set any custom attributes and methods on the -`Doc`, `Span` and `Token`, which become available as `Doc._`, `Span._` and -`Token._` – for example, `Token._.my_attr`. This lets you store additional -information relevant to your application, add new features and functionality to -spaCy, and implement your own models trained with other machine learning -libraries. It also lets you take advantage of spaCy's data structures and the -`Doc` object as the "single source of truth". +spaCy allows you to set any custom attributes and methods on the `Doc`, `Span` +and `Token`, which become available as `Doc._`, `Span._` and `Token._` – for +example, `Token._.my_attr`. This lets you store additional information relevant +to your application, add new features and functionality to spaCy, and implement +your own models trained with other machine learning libraries. It also lets you +take advantage of spaCy's data structures and the `Doc` object as the "single +source of truth". @@ -602,7 +1360,7 @@ There are three main types of extensions, which can be defined using the [these examples](/usage/examples#custom-components-attr-methods). ```python - Doc.set_extension("hello", method=lambda doc, name: "Hi {}!".format(name)) + Doc.set_extension("hello", method=lambda doc, name: f"Hi {name}!") assert doc._.hello("Bob") == "Hi Bob!" ``` @@ -645,12 +1403,70 @@ especially useful it you want to pass in a string instead of calling This example shows the implementation of a pipeline component that fetches country meta data via the [REST Countries API](https://restcountries.eu), sets -entity annotations for countries, merges entities into one token and sets custom -attributes on the `Doc`, `Span` and `Token` – for example, the capital, -latitude/longitude coordinates and even the country flag. +entity annotations for countries and sets custom attributes on the `Doc` and +`Span` – for example, the capital, latitude/longitude coordinates and even the +country flag. ```python -https://github.com/explosion/spaCy/tree/master/examples/pipeline/custom_component_countries_api.py +### {executable="true"} +import requests +from spacy.lang.en import English +from spacy.language import Language +from spacy.matcher import PhraseMatcher +from spacy.tokens import Doc, Span, Token + +@Language.factory("rest_countries") +class RESTCountriesComponent: + def __init__(self, nlp, name, label="GPE"): + r = requests.get("https://restcountries.eu/rest/v2/all") + r.raise_for_status() # make sure requests raises an error if it fails + countries = r.json() + # Convert API response to dict keyed by country name for easy lookup + self.countries = {c["name"]: c for c in countries} + self.label = label + # Set up the PhraseMatcher with Doc patterns for each country name + self.matcher = PhraseMatcher(nlp.vocab) + self.matcher.add("COUNTRIES", [nlp.make_doc(c) for c in self.countries.keys()]) + # Register attributes on the Span. We'll be overwriting this based on + # the matches, so we're only setting a default value, not a getter. + Span.set_extension("is_country", default=None) + Span.set_extension("country_capital", default=None) + Span.set_extension("country_latlng", default=None) + Span.set_extension("country_flag", default=None) + # Register attribute on Doc via a getter that checks if the Doc + # contains a country entity + Doc.set_extension("has_country", getter=self.has_country) + + def __call__(self, doc): + spans = [] # keep the spans for later so we can merge them afterwards + for _, start, end in self.matcher(doc): + # Generate Span representing the entity & set label + entity = Span(doc, start, end, label=self.label) + # Set custom attributes on entity. Can be extended with other data + # returned by the API, like currencies, country code, calling code etc. + entity._.set("is_country", True) + entity._.set("country_capital", self.countries[entity.text]["capital"]) + entity._.set("country_latlng", self.countries[entity.text]["latlng"]) + entity._.set("country_flag", self.countries[entity.text]["flag"]) + spans.append(entity) + # Overwrite doc.ents and add entity – be careful not to replace! + doc.ents = list(doc.ents) + spans + return doc # don't forget to return the Doc! + + def has_country(self, doc): + """Getter for Doc attributes. Since the getter is only called + when we access the attribute, we can refer to the Span's 'is_country' + attribute here, which is already set in the processing step.""" + return any([entity._.get("is_country") for entity in doc.ents]) + +nlp = English() +nlp.add_pipe("rest_countries", config={"label": "GPE"}) +doc = nlp("Some text about Colombia and the Czech Republic") +print("Pipeline", nlp.pipe_names) # pipeline contains component name +print("Doc has countries", doc._.has_country) # Doc contains countries +for ent in doc.ents: + if ent._.is_country: + print(ent.text, ent.label_, ent._.country_capital, ent._.country_latlng, ent._.country_flag) ``` In this case, all data can be fetched on initialization in one request. However, @@ -665,14 +1481,14 @@ While it's generally recommended to use the `Doc._`, `Span._` and `Token._` proxies to add your own custom attributes, spaCy offers a few exceptions to allow **customizing the built-in methods** like [`Doc.similarity`](/api/doc#similarity) or [`Doc.vector`](/api/doc#vector) with -your own hooks, which can rely on statistical models you train yourself. For -instance, you can provide your own on-the-fly sentence segmentation algorithm or -document similarity method. +your own hooks, which can rely on components you train yourself. For instance, +you can provide your own on-the-fly sentence segmentation algorithm or document +similarity method. Hooks let you customize some of the behaviors of the `Doc`, `Span` or `Token` objects by adding a component to the pipeline. For instance, to customize the [`Doc.similarity`](/api/doc#similarity) method, you can add a component that -sets a custom function to `doc.user_hooks['similarity']`. The built-in +sets a custom function to `doc.user_hooks["similarity"]`. The built-in `Doc.similarity` method will check the `user_hooks` dict, and delegate to your function if you've set one. Similar results can be achieved by setting functions to `Doc.user_span_hooks` and `Doc.user_token_hooks`. @@ -681,7 +1497,7 @@ to `Doc.user_span_hooks` and `Doc.user_token_hooks`. > > The hooks live on the `Doc` object because the `Span` and `Token` objects are > created lazily, and don't own any data. They just proxy to their parent `Doc`. -> This turns out to be convenient here — we only have to worry about installing +> This turns out to be convenient here – we only have to worry about installing > hooks in one place. | Name | Customizes | @@ -692,7 +1508,7 @@ to `Doc.user_span_hooks` and `Doc.user_token_hooks`. ```python ### Add custom similarity hooks -class SimilarityModel(object): +class SimilarityModel: def __init__(self, model): self._model = model @@ -709,8 +1525,8 @@ class SimilarityModel(object): ## Developing plugins and wrappers {#plugins} We're very excited about all the new possibilities for community extensions and -plugins in spaCy v2.0, and we can't wait to see what you build with it! To get -you started, here are a few tips, tricks and best +plugins in spaCy, and we can't wait to see what you build with it! To get you +started, here are a few tips, tricks and best practices. [See here](/universe/?category=pipeline) for examples of other spaCy extensions. @@ -799,20 +1615,14 @@ function that takes a `Doc`, modifies it and returns it. method. However, a third-party extension should **never silently overwrite built-ins**, or attributes set by other extensions. -- If you're looking to publish a model that depends on a custom pipeline - component, you can either **require it** in the model package's dependencies, - or – if the component is specific and lightweight – choose to **ship it with - your model package** and add it to the `Language` instance returned by the - model's `load()` method. For examples of this, check out the implementations - of spaCy's - [`load_model_from_init_py`](/api/top-level#util.load_model_from_init_py) - [`load_model_from_path`](/api/top-level#util.load_model_from_path) utility - functions. - - ```diff - + nlp.add_pipe(my_custom_component) - + return nlp.from_disk(model_path) - ``` +- If you're looking to publish a pipeline package that depends on a custom + pipeline component, you can either **require it** in the package's + dependencies, or – if the component is specific and lightweight – choose to + **ship it with your pipeline package**. Just make sure the + [`@Language.component`](/api/language#component) or + [`@Language.factory`](/api/language#factory) decorator that registers the + custom component runs in your package's `__init__.py` or is exposed via an + [entry point](/usage/saving-loading#entry-points). - Once you're ready to share your extension with others, make sure to **add docs and installation instructions** (you can always link to this page for more @@ -827,14 +1637,14 @@ function that takes a `Doc`, modifies it and returns it. ### Wrapping other models and libraries {#wrapping-models-libraries} Let's say you have a custom entity recognizer that takes a list of strings and -returns their [BILUO tags](/api/annotation#biluo). Given an input like -`["A", "text", "about", "Facebook"]`, it will predict and return +returns their [BILUO tags](/usage/linguistic-features#accessing-ner). Given an +input like `["A", "text", "about", "Facebook"]`, it will predict and return `["O", "O", "O", "U-ORG"]`. To integrate it into your spaCy pipeline and make it add those entities to the `doc.ents`, you can wrap it in a custom pipeline component function and pass it the token texts from the `Doc` object received by the component. -The [`gold.spans_from_biluo_tags`](/api/goldparse#spans_from_biluo_tags) is very +The [`training.biluo_tags_to_spans`](/api/top-level#biluo_tags_to_spans) is very helpful here, because it takes a `Doc` object and token-based BILUO tags and returns a sequence of `Span` objects in the `Doc` with added labels. So all your wrapper has to do is compute the entity spans and overwrite the `doc.ents`. @@ -847,20 +1657,22 @@ wrapper has to do is compute the entity spans and overwrite the `doc.ents`. > overlapping entity spans are not allowed. ```python -### {highlight="1,6-7"} +### {highlight="1,8-9"} import your_custom_entity_recognizer -from spacy.gold import offsets_from_biluo_tags +from spacy.training import biluo_tags_to_spans +from spacy.language import Language +@Language.component("custom_ner_wrapper") def custom_ner_wrapper(doc): words = [token.text for token in doc] custom_entities = your_custom_entity_recognizer(words) - doc.ents = spans_from_biluo_tags(doc, custom_entities) + doc.ents = biluo_tags_to_spans(doc, custom_entities) return doc ``` -The `custom_ner_wrapper` can then be added to the pipeline of a blank model -using [`nlp.add_pipe`](/api/language#add_pipe). You can also replace the -existing entity recognizer of a pretrained model with +The `custom_ner_wrapper` can then be added to a blank pipeline using +[`nlp.add_pipe`](/api/language#add_pipe). You can also replace the existing +entity recognizer of a trained pipeline with [`nlp.replace_pipe`](/api/language#replace_pipe). Here's another example of a custom model, `your_custom_model`, that takes a list @@ -874,22 +1686,24 @@ because it returns the integer ID of the string _and_ makes sure it's added to the vocab. This is especially important if the custom model uses a different label scheme than spaCy's default models. -> #### Example: spacy-stanfordnlp +> #### Example: spacy-stanza > > For an example of an end-to-end wrapper for statistical tokenization, tagging > and parsing, check out -> [`spacy-stanfordnlp`](https://github.com/explosion/spacy-stanfordnlp). It uses -> a very similar approach to the example in this section – the only difference -> is that it fully replaces the `nlp` object instead of providing a pipeline -> component, since it also needs to handle tokenization. +> [`spacy-stanza`](https://github.com/explosion/spacy-stanza). It uses a very +> similar approach to the example in this section – the only difference is that +> it fully replaces the `nlp` object instead of providing a pipeline component, +> since it also needs to handle tokenization. ```python -### {highlight="1,9,15-17"} +### {highlight="1,11,17-19"} import your_custom_model +from spacy.language import Language from spacy.symbols import POS, TAG, DEP, HEAD from spacy.tokens import Doc import numpy +@Language.component("custom_model_wrapper") def custom_model_wrapper(doc): words = [token.text for token in doc] spaces = [token.whitespace for token in doc] @@ -921,7 +1735,7 @@ new_heads = [head - i - 1 if head != 0 else 0 for i, head in enumerate(heads)] - + For more details on how to write and package custom components, make them available to spaCy via entry points and implement your own serialization diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md new file mode 100644 index 000000000..492345f2f --- /dev/null +++ b/website/docs/usage/projects.md @@ -0,0 +1,988 @@ +--- +title: Projects +new: 3 +menu: + - ['Intro & Workflow', 'intro'] + - ['Directory & Assets', 'directory'] + - ['Custom Projects', 'custom'] + - ['Remote Storage', 'remote'] + - ['Integrations', 'integrations'] +--- + +## Introduction and workflow {#intro hidden="true"} + +> #### 🪐 Project templates +> +> Our [`projects`](https://github.com/explosion/projects) repo includes various +> project templates for different NLP tasks, models, workflows and integrations +> that you can clone and run. The easiest way to get started is to pick a +> template, clone it and start modifying it! + +spaCy projects let you manage and share **end-to-end spaCy workflows** for +different **use cases and domains**, and orchestrate training, packaging and +serving your custom pipelines. You can start off by cloning a pre-defined +project template, adjust it to fit your needs, load in your data, train a +pipeline, export it as a Python package, upload your outputs to a remote storage +and share your results with your team. spaCy projects can be used via the new +[`spacy project`](/api/cli#project) command and we provide templates in our +[`projects`](https://github.com/explosion/projects) repo. + +![Illustration of project workflow and commands](../images/projects.svg) + + + +The easiest way to get started is to clone a project template and run it – for +example, this end-to-end template that lets you train a **part-of-speech +tagger** and **dependency parser** on a Universal Dependencies treebank. + + + +spaCy projects make it easy to integrate with many other **awesome tools** in +the data science and machine learning ecosystem to track and manage your data +and experiments, iterate on demos and prototypes and ship your models into +production. + + +Manage and version your data +Create labelled training data +Visualize and demo your pipelines +Serve your models and host APIs +Distributed and parallel training +Track your experiments and results + + +### 1. Clone a project template {#clone} + +> #### Cloning under the hood +> +> To clone a project, spaCy calls into `git` and uses the "sparse checkout" +> feature to only clone the relevant directory or directories. + +The [`spacy project clone`](/api/cli#project-clone) command clones an existing +project template and copies the files to a local directory. You can then run the +project, e.g. to train a pipeline and edit the commands and scripts to build +fully custom workflows. + +```cli +python -m spacy project clone pipelines/tagger_parser_ud +``` + +By default, the project will be cloned into the current working directory. You +can specify an optional second argument to define the output directory. The +`--repo` option lets you define a custom repo to clone from if you don't want +to use the spaCy [`projects`](https://github.com/explosion/projects) repo. You +can also use any private repo you have access to with Git. + +### 2. Fetch the project assets {#assets} + +> #### project.yml +> +> ```yaml +> assets: +> - dest: 'assets/training.spacy' +> url: 'https://example.com/data.spacy' +> checksum: '63373dd656daa1fd3043ce166a59474c' +> - dest: 'assets/development.spacy' +> git: +> repo: 'https://github.com/example/repo' +> branch: 'master' +> path: 'path/development.spacy' +> checksum: '5113dc04e03f079525edd8df3f4f39e3' +> ``` + +Assets are data files your project needs – for example, the training and +evaluation data or pretrained vectors and embeddings to initialize your model +with. Each project template comes with a `project.yml` that defines the assets +to download and where to put them. The +[`spacy project assets`](/api/cli#project-assets) will fetch the project assets +for you: + +```cli +$ cd some_example_project +$ python -m spacy project assets +``` + +Asset URLs can be a number of different protocols: HTTP, HTTPS, FTP, SSH, and +even cloud storage such as GCS and S3. You can also fetch assets using git, by +replacing the `url` string with a `git` block. spaCy will use Git's "sparse +checkout" feature to avoid downloading the whole repository. + +### 3. Run a command {#run} + +> #### project.yml +> +> ```yaml +> commands: +> - name: preprocess +> help: "Convert the input data to spaCy's format" +> script: +> - 'python -m spacy convert assets/train.conllu corpus/' +> - 'python -m spacy convert assets/eval.conllu corpus/' +> deps: +> - 'assets/train.conllu' +> - 'assets/eval.conllu' +> outputs: +> - 'corpus/train.spacy' +> - 'corpus/eval.spacy' +> ``` + +Commands consist of one or more steps and can be run with +[`spacy project run`](/api/cli#project-run). The following will run the command +`preprocess` defined in the `project.yml`: + +```cli +$ python -m spacy project run preprocess +``` + +Commands can define their expected [dependencies and outputs](#deps-outputs) +using the `deps` (files the commands require) and `outputs` (files the commands +create) keys. This allows your project to track changes and determine whether a +command needs to be re-run. For instance, if your input data changes, you want +to re-run the `preprocess` command. But if nothing changed, this step can be +skipped. You can also set `--force` to force re-running a command, or `--dry` to +perform a "dry run" and see what would happen (without actually running the +script). + +### 4. Run a workflow {#run-workfow} + +> #### project.yml +> +> ```yaml +> workflows: +> all: +> - preprocess +> - train +> - package +> ``` + +Workflows are series of commands that are run in order and often depend on each +other. For instance, to generate a pipeline package, you might start by +converting your data, then run [`spacy train`](/api/cli#train) to train your +pipeline on the converted data and if that's successful, run +[`spacy package`](/api/cli#package) to turn the best trained artifact into an +installable Python package. The following command runs the workflow named `all` +defined in the `project.yml`, and executes the commands it specifies, in order: + +```cli +$ python -m spacy project run all +``` + +Using the expected [dependencies and outputs](#deps-outputs) defined in the +commands, spaCy can determine whether to re-run a command (if its inputs or +outputs have changed) or whether to skip it. If you're looking to implement more +advanced data pipelines and track your changes in Git, check out the +[Data Version Control (DVC) integration](#dvc). The +[`spacy project dvc`](/api/cli#project-dvc) command generates a DVC config file +from a workflow defined in your `project.yml` so you can manage your spaCy +project as a DVC repo. + +### 5. Optional: Push to remote storage {#push} + +> ```yaml +> ### project.yml +> remotes: +> default: 's3://my-spacy-bucket' +> local: '/mnt/scratch/cache' +> ``` + +After training a pipeline, you can optionally use the +[`spacy project push`](/api/cli#project-push) command to upload your outputs to +a remote storage, using protocols like [S3](https://aws.amazon.com/s3/), +[Google Cloud Storage](https://cloud.google.com/storage) or SSH. This can help +you **export** your pipeline packages, **share** work with your team, or **cache +results** to avoid repeating work. + +```cli +$ python -m spacy project push +``` + +The `remotes` section in your `project.yml` lets you assign names to the +different storages. To download state from a remote storage, you can use the +[`spacy project pull`](/api/cli#project-pull) command. For more details, see the +docs on [remote storage](#remote). + +## Project directory and assets {#directory} + +### project.yml {#project-yml} + +The `project.yml` defines the assets a project depends on, like datasets and +pretrained weights, as well as a series of commands that can be run separately +or as a workflow – for instance, to preprocess the data, convert it to spaCy's +format, train a pipeline, evaluate it and export metrics, package it and spin up +a quick web demo. It looks pretty similar to a config file used to define CI +pipelines. + +```yaml +%%GITHUB_PROJECTS/pipelines/tagger_parser_ud/project.yml +``` + +| Section | Description | +| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| `title` | An optional project title used in `--help` message and [auto-generated docs](#custom-docs). | +| `description` | An optional project description used in [auto-generated docs](#custom-docs). | +| `vars` | A dictionary of variables that can be referenced in paths, URLs and scripts, just like [`config.cfg` variables](/usage/training#config-interpolation). For example, `${vars.name}` will use the value of the variable `name`. Variables need to be defined in the section `vars`, but can be a nested dict, so you're able to reference `${vars.model.name}`. | +| `directories` | An optional list of [directories](#project-files) that should be created in the project for assets, training outputs, metrics etc. spaCy will make sure that these directories always exist. | +| `assets` | A list of assets that can be fetched with the [`project assets`](/api/cli#project-assets) command. `url` defines a URL or local path, `dest` is the destination file relative to the project directory, and an optional `checksum` ensures that an error is raised if the file's checksum doesn't match. Instead of `url`, you can also provide a `git` block with the keys `repo`, `branch` and `path`, to download from a Git repo. | +| `workflows` | A dictionary of workflow names, mapped to a list of command names, to execute in order. Workflows can be run with the [`project run`](/api/cli#project-run) command. | +| `commands` | A list of named commands. A command can define an optional help message (shown in the CLI when the user adds `--help`) and the `script`, a list of commands to run. The `deps` and `outputs` let you define the created file the command depends on and produces, respectively. This lets spaCy determine whether a command needs to be re-run because its dependencies or outputs changed. Commands can be run as part of a workflow, or separately with the [`project run`](/api/cli#project-run) command. | +| `spacy_version` | Optional spaCy version range like `>=3.0.0,<3.1.0` that the project is compatible with. If it's loaded with an incompatible version, an error is raised when the project is loaded. | + +### Data assets {#data-assets} + +Assets are any files that your project might need, like training and development +corpora or pretrained weights for initializing your model. Assets are defined in +the `assets` block of your `project.yml` and can be downloaded using the +[`project assets`](/api/cli#project-assets) command. Defining checksums lets you +verify that someone else running your project will use the same files you used. +Asset URLs can be a number of different **protocols**: HTTP, HTTPS, FTP, SSH, +and even **cloud storage** such as GCS and S3. You can also download assets from +a **Git repo** instead. + +#### Downloading from a URL or cloud storage {#data-assets-url} + +Under the hood, spaCy uses the +[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library so you +can use any protocol it supports. Note that you may need to install extra +dependencies to use certain protocols. + +> #### project.yml +> +> ```yaml +> assets: +> # Download from public HTTPS URL +> - dest: 'assets/training.spacy' +> url: 'https://example.com/data.spacy' +> checksum: '63373dd656daa1fd3043ce166a59474c' +> # Download from Google Cloud Storage bucket +> - dest: 'assets/development.spacy' +> url: 'gs://your-bucket/corpora' +> checksum: '5113dc04e03f079525edd8df3f4f39e3' +> ``` + +| Name | Description | +| ------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. | +| `url` | The URL to download from, using the respective protocol. | +| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. | +| `description` | Optional asset description, used in [auto-generated docs](#custom-docs). | + +#### Downloading from a Git repo {#data-assets-git} + +If a `git` block is provided, the asset is downloaded from the given Git +repository. You can download from any repo that you have access to. Under the +hood, this uses Git's "sparse checkout" feature, so you're only downloading the +files you need and not the whole repo. + +> #### project.yml +> +> ```yaml +> assets: +> - dest: 'assets/training.spacy' +> git: +> repo: 'https://github.com/example/repo' +> branch: 'master' +> path: 'path/training.spacy' +> checksum: '63373dd656daa1fd3043ce166a59474c' +> description: 'The training data (5000 examples)' +> ``` + +| Name | Description | +| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `dest` | The destination path to save the downloaded asset to (relative to the project directory), including the file name. | +| `git` | `repo`: The URL of the repo to download from.
`path`: Path of the file or directory to download, relative to the repo root.
`branch`: The branch to download from. Defaults to `"master"`. | +| `checksum` | Optional checksum of the file. If provided, it will be used to verify that the file matches and downloads will be skipped if a local file with the same checksum already exists. | +| `description` | Optional asset description, used in [auto-generated docs](#custom-docs). | + +#### Working with private assets {#data-asets-private} + +> #### project.yml +> +> ```yaml +> assets: +> - dest: 'assets/private_training_data.json' +> checksum: '63373dd656daa1fd3043ce166a59474c' +> - dest: 'assets/private_vectors.bin' +> checksum: '5113dc04e03f079525edd8df3f4f39e3' +> ``` + +For many projects, the datasets and weights you're working with might be +company-internal and not available over the internet. In that case, you can +specify the destination paths and a checksum, and leave out the URL. When your +teammates clone and run your project, they can place the files in the respective +directory themselves. The [`project assets`](/api/cli#project-assets) command +will alert you about missing files and mismatched checksums, so you can ensure that +others are running your project with the same data. + +### Dependencies and outputs {#deps-outputs} + +Each command defined in the `project.yml` can optionally define a list of +dependencies and outputs. These are the files the command requires and creates. +For example, a command for training a pipeline may depend on a +[`config.cfg`](/usage/training#config) and the training and evaluation data, and +it will export a directory `model-best`, which you can then re-use in other +commands. + + +```yaml +### project.yml +commands: + - name: train + help: 'Train a spaCy pipeline using the specified corpus and config' + script: + - 'python -m spacy train ./configs/config.cfg -o training/ --paths.train ./corpus/training.spacy --paths.dev ./corpus/evaluation.spacy' + deps: + - 'configs/config.cfg' + - 'corpus/training.spacy' + - 'corpus/evaluation.spacy' + outputs: + - 'training/model-best' +``` + +> #### Re-running vs. skipping +> +> Under the hood, spaCy uses a `project.lock` lockfile that stores the details +> for each command, as well as its dependencies and outputs and their checksums. +> It's updated on each run. If any of this information changes, the command will +> be re-run. Otherwise, it will be skipped. + +If you're running a command and it depends on files that are missing, spaCy will +show you an error. If a command defines dependencies and outputs that haven't +changed since the last run, the command will be skipped. This means that you're +only re-running commands if they need to be re-run. Commands can also set +`no_skip: true` if they should never be skipped – for example commands that run +tests. Commands without outputs are also never skipped. To force re-running a +command or workflow, even if nothing changed, you can set the `--force` flag. + +Note that [`spacy project`](/api/cli#project) doesn't compile any dependency +graphs based on the dependencies and outputs, and won't re-run previous steps +automatically. For instance, if you only run the command `train` that depends on +data created by `preprocess` and those files are missing, spaCy will show an +error – it won't just re-run `preprocess`. If you're looking for more advanced +data management, check out the [Data Version Control (DVC) integration](#dvc). If you're planning on integrating your spaCy project with DVC, you +can also use `outputs_no_cache` instead of `outputs` to define outputs that +won't be cached or tracked. + +### Files and directory structure {#project-files} + +The `project.yml` can define a list of `directories` that should be created +within a project – for instance, `assets`, `training`, `corpus` and so on. spaCy +will make sure that these directories are always available, so your commands can +write to and read from them. Project directories will also include all files and +directories copied from the project template with +[`spacy project clone`](/api/cli#project-clone). Here's an example of a project +directory: + +> #### project.yml +> +> +> ```yaml +> directories: ['assets', 'configs', 'corpus', 'metas', 'metrics', 'notebooks', 'packages', 'scripts', 'training'] +> ``` + +```yaml +### Example project directory +├── project.yml # the project settings +├── project.lock # lockfile that tracks inputs/outputs +├── assets/ # downloaded data assets +├── configs/ # pipeline config.cfg files used for training +├── corpus/ # output directory for training corpus +├── metas/ # pipeline meta.json templates used for packaging +├── metrics/ # output directory for evaluation metrics +├── notebooks/ # directory for Jupyter notebooks +├── packages/ # output directory for pipeline Python packages +├── scripts/ # directory for scripts, e.g. referenced in commands +├── training/ # output directory for trained pipelines +└── ... # any other files, like a requirements.txt etc. +``` + +If you don't want a project to create a directory, you can delete it and remove +its entry from the `project.yml` – just make sure it's not required by any of +the commands. [Custom templates](#custom) can use any directories they need – +the only file that's required for a project is the `project.yml`. + +--- + +## Custom scripts and projects {#custom} + +The `project.yml` lets you define any custom commands and run them as part of +your training, evaluation or deployment workflows. The `script` section defines +a list of commands that are called in a subprocess, in order. This lets you +execute other Python scripts or command-line tools. Let's say you've written a +few integration tests that load the best model produced by the training command +and check that it works correctly. You can now define a `test` command that +calls into [`pytest`](https://docs.pytest.org/en/latest/), runs your tests and +uses [`pytest-html`](https://github.com/pytest-dev/pytest-html) to export a test +report: + +```yaml +### project.yml +commands: + - name: test + help: 'Test the trained pipeline' + script: + - 'pip install pytest pytest-html' + - 'python -m pytest ./scripts/tests --html=metrics/test-report.html' + deps: + - 'training/model-best' + outputs: + - 'metrics/test-report.html' + no_skip: true +``` + +Adding `training/model-best` to the command's `deps` lets you ensure that the +file is available. If not, spaCy will show an error and the command won't run. +Setting `no_skip: true` means that the command will always run, even if the +dependencies (the trained pipeline) haven't changed. This makes sense here, +because you typically don't want to skip your tests. + +### Writing custom scripts {#custom-scripts} + +Your project commands can include any custom scripts – essentially, anything you +can run from the command line. Here's an example of a custom script that uses +[`typer`](https://typer.tiangolo.com/) for quick and easy command-line arguments +that you can define via your `project.yml`: + +> #### About Typer +> +> [`typer`](https://typer.tiangolo.com/) is a modern library for building Python +> CLIs using type hints. It's a dependency of spaCy, so it will already be +> pre-installed in your environment. Function arguments automatically become +> positional CLI arguments and using Python type hints, you can define the value +> types. For instance, `batch_size: int` means that the value provided via the +> command line is converted to an integer. + +```python +### scripts/custom_evaluation.py +import typer + +def custom_evaluation(batch_size: int = 128, model_path: str, data_path: str): + # The arguments are now available as positional CLI arguments + print(batch_size, model_path, data_path) + +if __name__ == "__main__": + typer.run(custom_evaluation) +``` + +In your `project.yml`, you can then run the script by calling +`python scripts/custom_evaluation.py` with the function arguments. You can also +use the `vars` section to define reusable variables that will be substituted in +commands, paths and URLs. In this example, the batch size is defined as a +variable will be added in place of `${vars.batch_size}` in the script. + +> #### Calling into Python +> +> If any of your command scripts call into `python`, spaCy will take care of +> replacing that with your `sys.executable`, to make sure you're executing +> everything with the same Python (not some other Python installed on your +> system). It also normalizes references to `python3`, `pip3` and `pip`. + + +```yaml +### project.yml +vars: + batch_size: 128 + +commands: + - name: evaluate + script: + - 'python scripts/custom_evaluation.py ${vars.batch_size} ./training/model-best ./corpus/eval.json' + deps: + - 'training/model-best' + - 'corpus/eval.json' +``` + +### Documenting your project {#custom-docs} + +> #### Readme Example +> +> For more examples, see the [`projects`](https://github.com/explosion/projects) +> repo. +> +> ![Screenshot of auto-generated Markdown Readme](../images/project_document.jpg) + +When your custom project is ready and you want to share it with others, you can +use the [`spacy project document`](/api/cli#project-document) command to +**auto-generate** a pretty, Markdown-formatted `README` file based on your +project's `project.yml`. It will list all commands, workflows and assets defined +in the project and include details on how to run the project, as well as links +to the relevant spaCy documentation to make it easy for others to get started +using your project. + +```cli +$ python -m spacy project document --output README.md +``` + +Under the hood, hidden markers are added to identify where the auto-generated +content starts and ends. This means that you can add your own custom content +before or after it and re-running the `project document` command will **only +update the auto-generated part**. This makes it easy to keep your documentation +up to date. + + + +Note that the contents of an existing file will be **replaced** if no existing +auto-generated docs are found. If you want spaCy to ignore a file and not update +it, you can add the comment marker `` anywhere in +your markup. + + + +### Cloning from your own repo {#custom-repo} + +The [`spacy project clone`](/api/cli#project-clone) command lets you customize +the repo to clone from using the `--repo` option. It calls into `git`, so you'll +be able to clone from any repo that you have access to, including private repos. + +```cli +python -m spacy project clone your_project --repo https://github.com/you/repo +``` + +At a minimum, a valid project template needs to contain a +[`project.yml`](#project-yml). It can also include +[other files](/usage/projects#project-files), like custom scripts, a +`requirements.txt` listing additional dependencies, +[training configs](/usage/training#config) and model meta templates, or Jupyter +notebooks with usage examples. + + + +It's typically not a good idea to check large data assets, trained pipelines or +other artifacts into a Git repo and you should exclude them from your project +template by adding a `.gitignore`. If you want to version your data and models, +check out [Data Version Control](#dvc) (DVC), which integrates with spaCy +projects. + + + +## Remote Storage {#remote} + +You can persist your project outputs to a remote storage using the +[`project push`](/api/cli#project-push) command. This can help you **export** +your pipeline packages, **share** work with your team, or **cache results** to +avoid repeating work. The [`project pull`](/api/cli#project-pull) command will +download any outputs that are in the remote storage and aren't available +locally. + +You can list one or more remotes in the `remotes` section of your +[`project.yml`](#project-yml) by mapping a string name to the URL of the +storage. Under the hood, spaCy uses the +[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library to +communicate with the remote storages, so you can use any protocol that +`smart-open` supports, including [S3](https://aws.amazon.com/s3/), +[Google Cloud Storage](https://cloud.google.com/storage), SSH and more, although +you may need to install extra dependencies to use certain protocols. + +> #### Example +> +> ```cli +> $ python -m spacy project pull local +> ``` + +```yaml +### project.yml +remotes: + default: 's3://my-spacy-bucket' + local: '/mnt/scratch/cache' + stuff: 'ssh://myserver.example.com/whatever' +``` + + + +Inside the remote storage, spaCy uses a clever **directory structure** to avoid +overwriting files. The top level of the directory structure is a URL-encoded +version of the output's path. Within this directory are subdirectories named +according to a hash of the command string and the command's dependencies. +Finally, within those directories are files, named according to an MD5 hash of +their contents. + + + + +```yaml +└── urlencoded_file_path # Path of original file + ├── some_command_hash # Hash of command you ran + │ ├── some_content_hash # Hash of file content + │ └── another_content_hash + └── another_command_hash + └── third_content_hash +``` + + + +For instance, let's say you had the following command in your `project.yml`: + +```yaml +### project.yml +- name: train + help: 'Train a spaCy pipeline using the specified corpus and config' + script: + - 'spacy train ./config.cfg --output training/' + deps: + - 'corpus/train' + - 'corpus/dev' + - 'config.cfg' + outputs: + - 'training/model-best' +``` + +> #### Example +> +> ``` +> └── s3://my-spacy-bucket/training%2Fmodel-best +> └── 1d8cb33a06cc345ad3761c6050934a1b +> └── d8e20c3537a084c5c10d95899fe0b1ff +> ``` + +After you finish training, you run [`project push`](/api/cli#project-push) to +make sure the `training/model-best` output is saved to remote storage. spaCy +will then construct a hash from your command script and the listed dependencies, +`corpus/train`, `corpus/dev` and `config.cfg`, in order to identify the +execution context of your output. It would then compute an MD5 hash of the +`training/model-best` directory, and use those three pieces of information to +construct the storage URL. + +```cli +$ python -m spacy project run train +$ python -m spacy project push +``` + +If you change the command or one of its dependencies (for instance, by editing +the [`config.cfg`](/usage/training#config) file to tune the hyperparameters, a +different creation hash will be calculated, so when you use +[`project push`](/api/cli#project-push) you won't be overwriting your previous +file. The system even supports multiple outputs for the same file and the same +context, which can happen if your training process is not deterministic, or if +you have dependencies that aren't represented in the command. + +In summary, the [`spacy project`](/api/cli#project) remote storages are designed +to make a particular set of trade-offs. Priority is placed on **convenience**, +**correctness** and **avoiding data loss**. You can use +[`project push`](/api/cli#project-push) freely, as you'll never overwrite remote +state, and you don't have to come up with names or version numbers. However, +it's up to you to manage the size of your remote storage, and to remove files +that are no longer relevant to you. + +## Integrations {#integrations} + +### Data Version Control (DVC) {#dvc} + +Data assets like training corpora or pretrained weights are at the core of any +NLP project, but they're often difficult to manage: you can't just check them +into your Git repo to version and keep track of them. And if you have multiple +steps that depend on each other, like a preprocessing step that generates your +training data, you need to make sure the data is always up-to-date, and re-run +all steps of your process every time, just to be safe. + +[Data Version Control](https://dvc.org) (DVC) is a standalone open-source tool +that integrates into your workflow like Git, builds a dependency graph for your +data pipelines and tracks and caches your data files. If you're downloading data +from an external source, like a storage bucket, DVC can tell whether the +resource has changed. It can also determine whether to re-run a step, depending +on whether its input have changed or not. All metadata can be checked into a Git +repo, so you'll always be able to reproduce your experiments. + +To set up DVC, install the package and initialize your spaCy project as a Git +and DVC repo. You can also +[customize your DVC installation](https://dvc.org/doc/install/macos#install-with-pip) +to include support for remote storage like Google Cloud Storage, S3, Azure, SSH +and more. + +```bash +$ pip install dvc # Install DVC +$ git init # Initialize a Git repo +$ dvc init # Initialize a DVC project +``` + + + +DVC enables usage analytics by default, so if you're working in a +privacy-sensitive environment, make sure to +[**opt-out manually**](https://dvc.org/doc/user-guide/analytics#opting-out). + + + +The [`spacy project dvc`](/api/cli#project-dvc) command creates a `dvc.yaml` +config file based on a workflow defined in your `project.yml`. Whenever you +update your project, you can re-run the command to update your DVC config. You +can then manage your spaCy project like any other DVC project, run +[`dvc add`](https://dvc.org/doc/command-reference/add) to add and track assets +and [`dvc repro`](https://dvc.org/doc/command-reference/repro) to reproduce the +workflow or individual commands. + +```cli +$ python -m spacy project dvc [workflow_name] +``` + + + +DVC currently expects a single workflow per project, so when creating the config +with [`spacy project dvc`](/api/cli#project-dvc), you need to specify the name +of a workflow defined in your `project.yml`. You can still use multiple +workflows, but only one can be tracked by DVC. + + + + + +--- + +### Prodigy {#prodigy} + + + +The Prodigy integration will require a nightly version of Prodigy that supports +spaCy v3+. You can already use annotations created with Prodigy in spaCy v3 by +exporting your data with +[`data-to-spacy`](https://prodi.gy/docs/recipes#data-to-spacy) and running +[`spacy convert`](/api/cli#convert) to convert it to the binary format. + + + +[Prodigy](https://prodi.gy) is a modern annotation tool for creating training +data for machine learning models, developed by us. It integrates with spaCy +out-of-the-box and provides many different +[annotation recipes](https://prodi.gy/docs/recipes) for a variety of NLP tasks, +with and without a model in the loop. If Prodigy is installed in your project, +you can start the annotation server from your `project.yml` for a tight feedback +loop between data development and training. + +The following example command starts the Prodigy app using the +[`ner.correct`](https://prodi.gy/docs/recipes#ner-correct) recipe and streams in +suggestions for the given entity labels produced by a pretrained model. You can +then correct the suggestions manually in the UI. After you save and exit the +server, the full dataset is exported in spaCy's format and split into a training +and evaluation set. + +> #### Example usage +> +> ```cli +> $ python -m spacy project run annotate +> ``` + + +```yaml +### project.yml +vars: + prodigy: + dataset: 'ner_articles' + labels: 'PERSON,ORG,PRODUCT' + model: 'en_core_web_md' + +commands: + - name: annotate + - script: + - 'python -m prodigy ner.correct ${vars.prodigy.dataset} ./assets/raw_data.jsonl ${vars.prodigy.model} --labels ${vars.prodigy.labels}' + - 'python -m prodigy data-to-spacy ./corpus/train.json ./corpus/eval.json --ner ${vars.prodigy.dataset}' + - 'python -m spacy convert ./corpus/train.json ./corpus/train.spacy' + - 'python -m spacy convert ./corpus/eval.json ./corpus/eval.spacy' + - deps: + - 'assets/raw_data.jsonl' + - outputs: + - 'corpus/train.spacy' + - 'corpus/eval.spacy' +``` + +You can use the same approach for other types of projects and annotation +workflows, including +[text classification](https://prodi.gy/docs/recipes#textcat), +[dependency parsing](https://prodi.gy/docs/recipes#dep), +[part-of-speech tagging](https://prodi.gy/docs/recipes#pos) or fully +[custom recipes](https://prodi.gy/docs/custom-recipes) – for instance, an A/B +evaluation workflow that lets you compare two different models and their +results. + + + +--- + +### Streamlit {#streamlit} + +[Streamlit](https://streamlit.io) is a Python framework for building interactive +data apps. The [`spacy-streamlit`](https://github.com/explosion/spacy-streamlit) +package helps you integrate spaCy visualizations into your Streamlit apps and +quickly spin up demos to explore your pipelines interactively. It includes a +full embedded visualizer, as well as individual components. + + + +> #### Installation +> +> ```bash +> $ pip install spacy-streamlit --pre +> ``` + +![](../images/spacy-streamlit.png) + +Using [`spacy-streamlit`](https://github.com/explosion/spacy-streamlit), your +projects can easily define their own scripts that spin up an interactive +visualizer, using the latest pipeline you trained, or a selection of pipelines +so you can compare their results. + + + +Get started with spaCy and Streamlit using our project template. It includes a +script to spin up a custom visualizer and commands you can adjust to showcase +and explore your own custom trained pipelines. + + + +> #### Example usage +> +> ```cli +> $ python -m spacy project run visualize +> ``` + + +```yaml +### project.yml +commands: + - name: visualize + help: "Visualize the pipeline's output interactively using Streamlit" + script: + - 'streamlit run ./scripts/visualize.py ./training/model-best "I like Adidas shoes."' + deps: + - "training/model-best" +``` + +The following script is called from the `project.yml` and takes two positional +command-line argument: a comma-separated list of paths or packages to load the +pipelines from and an example text to use as the default text. + +```python +https://github.com/explosion/projects/blob/v3/integrations/streamlit/scripts/visualize.py +``` + +--- + +### FastAPI {#fastapi} + +[FastAPI](https://fastapi.tiangolo.com/) is a modern high-performance framework +for building REST APIs with Python, based on Python +[type hints](https://fastapi.tiangolo.com/python-types/). It's become a popular +library for serving machine learning models and you can use it in your spaCy +projects to quickly serve up a trained pipeline and make it available behind a +REST API. + + + +Get started with spaCy and FastAPI using our project template. It includes a +simple REST API for processing batches of text, and usage examples for how to +query your API from Python and JavaScript (Vanilla JS and React). + + + +> #### Example usage +> +> ```cli +> $ python -m spacy project run serve +> ``` + + +```yaml +### project.yml + - name: "serve" + help: "Serve the models via a FastAPI REST API using the given host and port" + script: + - "uvicorn scripts.main:app --reload --host 127.0.0.1 --port 5000" + deps: + - "scripts/main.py" + no_skip: true +``` + +The script included in the template shows a simple REST API with a `POST` +endpoint that accepts batches of texts and returns batches of predictions, e.g. +named entities found in the documents. Type hints and +[`pydantic`](https://github.com/samuelcolvin/pydantic) are used to define the +expected data types. + +```python +https://github.com/explosion/projects/blob/v3/integrations/fastapi/scripts/main.py +``` + +--- + +### Ray {#ray} + +> #### Installation +> +> ```cli +> $ pip install -U %%SPACY_PKG_NAME[ray]%%SPACY_PKG_FLAGS +> # Check that the CLI is registered +> $ python -m spacy ray --help +> ``` + +[Ray](https://ray.io/) is a fast and simple framework for building and running +**distributed applications**. You can use Ray for parallel and distributed +training with spaCy via our lightweight +[`spacy-ray`](https://github.com/explosion/spacy-ray) extension package. If the +package is installed in the same environment as spaCy, it will automatically add +[`spacy ray`](/api/cli#ray) commands to your spaCy CLI. See the usage guide on +[parallel training](/usage/training#parallel-training) for more details on how +it works under the hood. + + + +Get started with parallel training using our project template. It trains a +simple model on a Universal Dependencies Treebank and lets you parallelize the +training with Ray. + + + +You can integrate [`spacy ray train`](/api/cli#ray-train) into your +`project.yml` just like the regular training command and pass it the config, and +optional output directory or remote storage URL and config overrides if needed. + + +```yaml +### project.yml +commands: + - name: "ray" + help: "Train a model via parallel training with Ray" + script: + - "python -m spacy ray train configs/config.cfg -o training/ --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy" + deps: + - "corpus/train.spacy" + - "corpus/dev.spacy" + outputs: + - "training/model-best" +``` + +--- + +### Weights & Biases {#wandb} + +[Weights & Biases](https://www.wandb.com/) is a popular platform for experiment +tracking. spaCy integrates with it out-of-the-box via the +[`WandbLogger`](/api/top-level#WandbLogger), which you can add as the +`[training.logger]` block of your training [config](/usage/training#config). The +results of each step are then logged in your project, together with the full +**training config**. This means that _every_ hyperparameter, registered function +name and argument will be tracked and you'll be able to see the impact it has on +your results. + +> #### Example config +> +> ```ini +> [training.logger] +> @loggers = "spacy.WandbLogger.v1" +> project_name = "monitor_spacy_training" +> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"] +> ``` + +![Screenshot: Visualized training results](../images/wandb1.jpg) + +![Screenshot: Parameter importance using config values](../images/wandb2.jpg 'Parameter importance using config values') + + + +Get started with tracking your spaCy training runs in Weights & Biases using our +project template. It trains on the IMDB Movie Review Dataset and includes a +simple config with the built-in `WandbLogger`, as well as a custom example of +creating variants of the config for a simple hyperparameter grid search and +logging the results. + + diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index 7749dab59..131bd8c94 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -4,6 +4,7 @@ teaser: Find phrases and tokens, and match entities menu: - ['Token Matcher', 'matcher'] - ['Phrase Matcher', 'phrasematcher'] + - ['Dependency Matcher', 'dependencymatcher'] - ['Entity Ruler', 'entityruler'] - ['Models & Rules', 'models-rules'] --- @@ -54,7 +55,7 @@ abstract representations of the tokens you're looking for, using lexical attributes, linguistic features predicted by the model, operators, set membership and rich comparison. For example, you can find a noun, followed by a verb with the lemma "love" or "like", followed by an optional determiner and -another token that's at least ten characters long. +another token that's at least 10 characters long.
@@ -98,9 +99,7 @@ print([token.text for token in doc]) First, we initialize the `Matcher` with a vocab. The matcher must always share the same vocab with the documents it will operate on. We can now call -[`matcher.add()`](/api/matcher#add) with an ID and our custom pattern. The -second argument lets you pass in an optional callback function to invoke on a -successful match. For now, we set it to `None`. +[`matcher.add()`](/api/matcher#add) with an ID and a list of patterns. ```python ### {executable="true"} @@ -111,7 +110,7 @@ nlp = spacy.load("en_core_web_sm") matcher = Matcher(nlp.vocab) # Add match ID "HelloWorld" with no callback and one pattern pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}] -matcher.add("HelloWorld", None, pattern) +matcher.add("HelloWorld", [pattern]) doc = nlp("Hello, world! Hello world!") matches = matcher(doc) @@ -137,9 +136,11 @@ Optionally, we could also choose to add more than one pattern, for example to also match sequences without punctuation between "hello" and "world": ```python -matcher.add("HelloWorld", None, - [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}], - [{"LOWER": "hello"}, {"LOWER": "world"}]) +patterns = [ + [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}], + [{"LOWER": "hello"}, {"LOWER": "world"}] +] +matcher.add("HelloWorld", patterns) ``` By default, the matcher will only return the matches and **not do anything @@ -157,20 +158,21 @@ The available token pattern keys correspond to a number of [`Token` attributes](/api/token#attributes). The supported attributes for rule-based matching are: -| Attribute | Type |  Description | -| -------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------ | -| `ORTH` | unicode | The exact verbatim text of a token. | -| `TEXT` 2.1 | unicode | The exact verbatim text of a token. | -| `LOWER` | unicode | The lowercase form of the token text. | -|  `LENGTH` | int | The length of the token text. | -|  `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | bool | Token text consists of alphabetic characters, ASCII characters, digits. | -|  `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | bool | Token text is in lowercase, uppercase, titlecase. | -|  `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | bool | Token is punctuation, whitespace, stop word. | -|  `IS_SENT_START` | bool | Token is start of sentence. | -|  `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | bool | Token text resembles a number, URL, email. | -|  `POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE` | unicode | The token's simple and extended part-of-speech tag, dependency label, lemma, shape. | -| `ENT_TYPE` | unicode | The token's entity label. | -| `_` 2.1 | dict | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). | +| Attribute |  Description | +| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | +| `ORTH` | The exact verbatim text of a token. ~~str~~ | +| `TEXT` 2.1 | The exact verbatim text of a token. ~~str~~ | +| `LOWER` | The lowercase form of the token text. ~~str~~ | +|  `LENGTH` | The length of the token text. ~~int~~ | +|  `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ | +|  `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ | +|  `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ | +|  `IS_SENT_START` | Token is start of sentence. ~~bool~~ | +|  `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ | +|  `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ | +| `ENT_TYPE` | The token's entity label. ~~str~~ | +| `_` 2.1 | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ | +| `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ | @@ -191,12 +193,11 @@ of [`Token`](/api/token). This means that all of the attributes that refer to computed properties can't be accessed. The uppercase attribute names like `LOWER` or `IS_PUNCT` refer to symbols from -the -[`spacy.attrs`](https://github.com/explosion/spaCy/tree/master/spacy/attrs.pyx) -enum table. They're passed into a function that essentially is a big case/switch -statement, to figure out which struct field to return. The same attribute -identifiers are used in [`Doc.to_array`](/api/doc#to_array), and a few other -places in the code where you need to describe fields like this. +the [`spacy.attrs`](%%GITHUB_SPACY/spacy/attrs.pyx) enum table. They're passed +into a function that essentially is a big case/switch statement, to figure out +which struct field to return. The same attribute identifiers are used in +[`Doc.to_array`](/api/doc#to_array), and a few other places in the code where +you need to describe fields like this. @@ -232,11 +233,13 @@ following rich comparison attributes are available: > pattern2 = [{"LENGTH": {">=": 10}}] > ``` -| Attribute | Value Type | Description | -| -------------------------- | ---------- | --------------------------------------------------------------------------------- | -| `IN` | any | Attribute value is member of a list. | -| `NOT_IN` | any | Attribute value is _not_ member of a list. | -| `==`, `>=`, `<=`, `>`, `<` | int, float | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. | +| Attribute | Description | +| -------------------------- | ------------------------------------------------------------------------------------------------------- | +| `IN` | Attribute value is member of a list. ~~Any~~ | +| `NOT_IN` | Attribute value is _not_ member of a list. ~~Any~~ | +| `ISSUBSET` | Attribute values (for `MORPH`) are a subset of a list. ~~Any~~ | +| `ISSUPERSET` | Attribute values (for `MORPH`) are a superset of a list. ~~Any~~ | +| `==`, `>=`, `<=`, `>`, `<` | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. ~~Union[int, float]~~ | #### Regular expressions {#regex new="2.1"} @@ -414,7 +417,7 @@ nlp = spacy.load("en_core_web_sm") matcher = Matcher(nlp.vocab, validate=True) # Add match ID "HelloWorld" with unsupported attribute CASEINSENSITIVE pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"CASEINSENSITIVE": "world"}] -matcher.add("HelloWorld", None, pattern) +matcher.add("HelloWorld", [pattern]) # 🚨 Raises an error: # MatchPatternError: Invalid token patterns for matcher rule 'HelloWorld' # Pattern 0: @@ -447,7 +450,7 @@ def add_event_ent(matcher, doc, i, matches): print(entity.text) pattern = [{"ORTH": "Google"}, {"ORTH": "I"}, {"ORTH": "/"}, {"ORTH": "O"}] -matcher.add("GoogleIO", add_event_ent, pattern) +matcher.add("GoogleIO", [pattern], on_match=add_event_ent) doc = nlp("This is a text about Google I/O") matches = matcher(doc) ``` @@ -486,12 +489,45 @@ This allows you to write callbacks that consider the entire set of matched phrases, so that you can resolve overlaps and other conflicts in whatever way you prefer. -| Argument | Type | Description | -| --------- | --------- | -------------------------------------------------------------------------------------------------------------------- | -| `matcher` | `Matcher` | The matcher instance. | -| `doc` | `Doc` | The document the matcher was used on. | -| `i` | int | Index of the current match (`matches[i`]). | -| `matches` | list |  A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. | +| Argument | Description | +| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `matcher` | The matcher instance. ~~Matcher~~ | +| `doc` | The document the matcher was used on. ~~Doc~~ | +| `i` | Index of the current match (`matches[i`]). ~~int~~ | +| `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~List[Tuple[int, int int]]~~ | + +### Creating spans from matches {#matcher-spans} + +Creating [`Span`](/api/span) objects from the returned matches is a very common +use case. spaCy makes this easy by giving you access to the `start` and `end` +token of each match, which you can use to construct a new span with an optional +label. As of spaCy v3.0, you can also set `as_spans=True` when calling the +matcher on a `Doc`, which will return a list of [`Span`](/api/span) objects +using the `match_id` as the span label. + +```python +### {executable="true"} +import spacy +from spacy.matcher import Matcher +from spacy.tokens import Span + +nlp = spacy.blank("en") +matcher = Matcher(nlp.vocab) +matcher.add("PERSON", [[{"lower": "barack"}, {"lower": "obama"}]]) +doc = nlp("Barack Obama was the 44th president of the United States") + +# 1. Return (match_id, start, end) tuples +matches = matcher(doc) +for match_id, start, end in matches: + # Create the matched span and assign the match_id as a label + span = Span(doc, start, end, label=match_id) + print(span.text, span.label_) + +# 2. Return Span objects directly +matches = matcher(doc, as_spans=True) +for span in matches: + print(span.text, span.label_) +``` ### Using custom pipeline components {#matcher-pipeline} @@ -507,22 +543,26 @@ attribute `bad_html` on the token. ```python ### {executable="true"} import spacy +from spacy.language import Language from spacy.matcher import Matcher from spacy.tokens import Token -# We're using a class because the component needs to be initialised with -# the shared vocab via the nlp object -class BadHTMLMerger(object): - def __init__(self, nlp): - # Register a new token extension to flag bad HTML - Token.set_extension("bad_html", default=False) - self.matcher = Matcher(nlp.vocab) - self.matcher.add( - "BAD_HTML", - None, +# We're using a component factory because the component needs to be +# initialized with the shared vocab via the nlp object +@Language.factory("html_merger") +def create_bad_html_merger(nlp, name): + return BadHTMLMerger(nlp.vocab) + +class BadHTMLMerger: + def __init__(self, vocab): + patterns = [ [{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}], [{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}], - ) + ] + # Register a new token extension to flag bad HTML + Token.set_extension("bad_html", default=False) + self.matcher = Matcher(vocab) + self.matcher.add("BAD_HTML", patterns) def __call__(self, doc): # This method is invoked when the component is called on a Doc @@ -538,8 +578,7 @@ class BadHTMLMerger(object): return doc nlp = spacy.load("en_core_web_sm") -html_merger = BadHTMLMerger(nlp) -nlp.add_pipe(html_merger, last=True) # Add component to the pipeline +nlp.add_pipe("html_merger", last=True) # Add component to the pipeline doc = nlp("Hello
world!
This is a test.") for token in doc: print(token.text, token._.bad_html) @@ -548,13 +587,19 @@ for token in doc: Instead of hard-coding the patterns into the component, you could also make it take a path to a JSON file containing the patterns. This lets you reuse the -component with different patterns, depending on your application: +component with different patterns, depending on your application. When adding +the component to the pipeline with [`nlp.add_pipe`](/api/language#add_pipe), you +can pass in the argument via the `config`: ```python -html_merger = BadHTMLMerger(nlp, path="/path/to/patterns.json") +@Language.factory("html_merger", default_config={"path": None}) +def create_bad_html_merger(nlp, name, path): + return BadHTMLMerger(nlp, path=path) + +nlp.add_pipe("html_merger", config={"path": "/path/to/patterns.json"}) ``` - + For more details and examples of how to **create custom pipeline components** and **extension attributes**, see the @@ -586,8 +631,8 @@ To get a quick overview of the results, you could collect all sentences containing a match and render them with the [displaCy visualizer](/usage/visualizers). In the callback function, you'll have access to the `start` and `end` of each match, as well as the parent `Doc`. This -lets you determine the sentence containing the match, `doc[start : end`.sent], -and calculate the start and end of the matched span within the sentence. Using +lets you determine the sentence containing the match, `doc[start:end].sent`, and +calculate the start and end of the matched span within the sentence. Using displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a list of dictionaries containing the text and entities to render. @@ -617,7 +662,7 @@ def collect_sents(matcher, doc, i, matches): pattern = [{"LOWER": "facebook"}, {"LEMMA": "be"}, {"POS": "ADV", "OP": "*"}, {"POS": "ADJ"}] -matcher.add("FacebookIs", collect_sents, pattern) # add pattern +matcher.add("FacebookIs", [pattern], on_match=collect_sents) # add pattern doc = nlp("I'd say that Facebook is evil. – Facebook is pretty cool, right?") matches = matcher(doc) @@ -672,7 +717,7 @@ nlp = spacy.load("en_core_web_sm") matcher = Matcher(nlp.vocab) pattern = [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "ddd"}, {"ORTH": "-", "OP": "?"}, {"SHAPE": "ddd"}] -matcher.add("PHONE_NUMBER", None, pattern) +matcher.add("PHONE_NUMBER", [pattern]) doc = nlp("Call me at (123) 456 789 or (123) 456 789!") print([t.text for t in doc]) @@ -717,7 +762,7 @@ whitespace, making them easy to match as well. from spacy.lang.en import English from spacy.matcher import Matcher -nlp = English() # We only want the tokenizer, so no need to load a model +nlp = English() # We only want the tokenizer, so no need to load a pipeline matcher = Matcher(nlp.vocab) pos_emoji = ["😀", "😃", "😂", "🤣", "😊", "😍"] # Positive emoji @@ -735,11 +780,11 @@ def label_sentiment(matcher, doc, i, matches): elif doc.vocab.strings[match_id] == "SAD": doc.sentiment -= 0.1 # Subtract 0.1 for negative sentiment -matcher.add("HAPPY", label_sentiment, *pos_patterns) # Add positive pattern -matcher.add("SAD", label_sentiment, *neg_patterns) # Add negative pattern +matcher.add("HAPPY", pos_patterns, on_match=label_sentiment) # Add positive pattern +matcher.add("SAD", neg_patterns, on_match=label_sentiment) # Add negative pattern # Add pattern for valid hashtag, i.e. '#' plus any ASCII token -matcher.add("HASHTAG", None, [{"ORTH": "#"}, {"IS_ASCII": True}]) +matcher.add("HASHTAG", [[{"ORTH": "#"}, {"IS_ASCII": True}]]) doc = nlp("Hello world 😀 #MondayMotivation") matches = matcher(doc) @@ -793,7 +838,7 @@ nlp = spacy.load("en_core_web_sm") matcher = Matcher(nlp.vocab) # Add pattern for valid hashtag, i.e. '#' plus any ASCII token -matcher.add("HASHTAG", None, [{"ORTH": "#"}, {"IS_ASCII": True}]) +matcher.add("HASHTAG", [[{"ORTH": "#"}, {"IS_ASCII": True}]]) # Register token extension Token.set_extension("is_hashtag", default=False) @@ -814,15 +859,6 @@ for token in doc: print(token.text, token._.is_hashtag) ``` -To process a stream of social media posts, we can use -[`Language.pipe`](/api/language#pipe), which will return a stream of `Doc` -objects that we can pass to [`Matcher.pipe`](/api/matcher#pipe). - -```python -docs = nlp.pipe(LOTS_OF_TWEETS) -matches = matcher.pipe(docs) -``` - ## Efficient phrase matching {#phrasematcher} If you need to match large terminology lists, you can also use the @@ -837,12 +873,12 @@ patterns can contain single or multiple tokens. import spacy from spacy.matcher import PhraseMatcher -nlp = spacy.load('en_core_web_sm') +nlp = spacy.load("en_core_web_sm") matcher = PhraseMatcher(nlp.vocab) terms = ["Barack Obama", "Angela Merkel", "Washington, D.C."] # Only run nlp.make_doc to speed things up patterns = [nlp.make_doc(text) for text in terms] -matcher.add("TerminologyList", None, *patterns) +matcher.add("TerminologyList", patterns) doc = nlp("German Chancellor Angela Merkel and US President Barack Obama " "converse in the Oval Office inside the White House in Washington, D.C.") @@ -860,12 +896,13 @@ pattern covering the exact tokenization of the term. To create the patterns, each phrase has to be processed with the `nlp` object. -If you have a model loaded, doing this in a loop or list comprehension can -easily become inefficient and slow. If you **only need the tokenization and -lexical attributes**, you can run [`nlp.make_doc`](/api/language#make_doc) -instead, which will only run the tokenizer. For an additional speed boost, you -can also use the [`nlp.tokenizer.pipe`](/api/tokenizer#pipe) method, which will -process the texts as a stream. +If you have a trained pipeline loaded, doing this in a loop or list +comprehension can easily become inefficient and slow. If you **only need the +tokenization and lexical attributes**, you can run +[`nlp.make_doc`](/api/language#make_doc) instead, which will only run the +tokenizer. For an additional speed boost, you can also use the +[`nlp.tokenizer.pipe`](/api/tokenizer#pipe) method, which will process the texts +as a stream. ```diff - patterns = [nlp(term) for term in LOTS_OF_TERMS] @@ -891,7 +928,7 @@ from spacy.matcher import PhraseMatcher nlp = English() matcher = PhraseMatcher(nlp.vocab, attr="LOWER") patterns = [nlp.make_doc(name) for name in ["Angela Merkel", "Barack Obama"]] -matcher.add("Names", None, *patterns) +matcher.add("Names", patterns) doc = nlp("angela merkel and us president barack Obama") for match_id, start, end in matcher(doc): @@ -905,10 +942,10 @@ object patterns as efficiently as possible and without running any of the other pipeline components. If the token attribute you want to match on are set by a pipeline component, **make sure that the pipeline component runs** when you create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc` -objects need to have part-of-speech tags set by the `tagger`. You can either -call the `nlp` object on your pattern texts instead of `nlp.make_doc`, or use -[`nlp.disable_pipes`](/api/language#disable_pipes) to disable components -selectively. +objects need to have part-of-speech tags set by the `tagger` or `morphologizer`. +You can either call the `nlp` object on your pattern texts instead of +`nlp.make_doc`, or use [`nlp.select_pipes`](/api/language#select_pipes) to +disable components selectively. @@ -925,7 +962,7 @@ from spacy.matcher import PhraseMatcher nlp = English() matcher = PhraseMatcher(nlp.vocab, attr="SHAPE") -matcher.add("IP", None, nlp("127.0.0.1"), nlp("127.127.0.0")) +matcher.add("IP", [nlp("127.0.0.1"), nlp("127.127.0.0")]) doc = nlp("Often the router will have an IP address such as 192.168.1.1 or 192.168.2.1.") for match_id, start, end in matcher(doc): @@ -939,12 +976,289 @@ to match phrases with the same sequence of punctuation and non-punctuation tokens as the pattern. But this can easily get confusing and doesn't have much of an advantage over writing one or two token patterns. +## Dependency Matcher {#dependencymatcher new="3" model="parser"} + +The [`DependencyMatcher`](/api/dependencymatcher) lets you match patterns within +the dependency parse using +[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html) +operators. It requires a model containing a parser such as the +[`DependencyParser`](/api/dependencyparser). Instead of defining a list of +adjacent tokens as in `Matcher` patterns, the `DependencyMatcher` patterns match +tokens in the dependency parse and specify the relations between them. + +> ```python +> ### Example +> from spacy.matcher import DependencyMatcher +> +> # "[subject] ... initially founded" +> pattern = [ +> # anchor token: founded +> { +> "RIGHT_ID": "founded", +> "RIGHT_ATTRS": {"ORTH": "founded"} +> }, +> # founded -> subject +> { +> "LEFT_ID": "founded", +> "REL_OP": ">", +> "RIGHT_ID": "subject", +> "RIGHT_ATTRS": {"DEP": "nsubj"} +> }, +> # "founded" follows "initially" +> { +> "LEFT_ID": "founded", +> "REL_OP": ";", +> "RIGHT_ID": "initially", +> "RIGHT_ATTRS": {"ORTH": "initially"} +> } +> ] +> +> matcher = DependencyMatcher(nlp.vocab) +> matcher.add("FOUNDED", [pattern]) +> matches = matcher(doc) +> ``` + +A pattern added to the dependency matcher consists of a **list of +dictionaries**, with each dictionary describing a **token to match** and its +**relation to an existing token** in the pattern. Except for the first +dictionary, which defines an anchor token using only `RIGHT_ID` and +`RIGHT_ATTRS`, each pattern should have the following keys: + +| Name | Description | +| ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `LEFT_ID` | The name of the left-hand node in the relation, which has been defined in an earlier node. ~~str~~ | +| `REL_OP` | An operator that describes how the two nodes are related. ~~str~~ | +| `RIGHT_ID` | A unique name for the right-hand node in the relation. ~~str~~ | +| `RIGHT_ATTRS` | The token attributes to match for the right-hand node in the same format as patterns provided to the regular token-based [`Matcher`](/api/matcher). ~~Dict[str, Any]~~ | + +Each additional token added to the pattern is linked to an existing token +`LEFT_ID` by the relation `REL_OP`. The new token is given the name `RIGHT_ID` +and described by the attributes `RIGHT_ATTRS`. + + + +Because the unique token **names** in `LEFT_ID` and `RIGHT_ID` are used to +identify tokens, the order of the dicts in the patterns is important: a token +name needs to be defined as `RIGHT_ID` in one dict in the pattern **before** it +can be used as `LEFT_ID` in another dict. + + + +### Dependency matcher operators {#dependencymatcher-operators} + +The following operators are supported by the `DependencyMatcher`, most of which +come directly from +[Semgrex](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html): + +| Symbol | Description | +| --------- | -------------------------------------------------------------------------------------------------------------------- | +| `A < B` | `A` is the immediate dependent of `B`. | +| `A > B` | `A` is the immediate head of `B`. | +| `A << B` | `A` is the dependent in a chain to `B` following dep → head paths. | +| `A >> B` | `A` is the head in a chain to `B` following head → dep paths. | +| `A . B` | `A` immediately precedes `B`, i.e. `A.i == B.i - 1`, and both are within the same dependency tree. | +| `A .* B` | `A` precedes `B`, i.e. `A.i < B.i`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A ; B` | `A` immediately follows `B`, i.e. `A.i == B.i + 1`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A ;* B` | `A` follows `B`, i.e. `A.i > B.i`, and both are within the same dependency tree _(not in Semgrex)_. | +| `A $+ B` | `B` is a right immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i - 1`. | +| `A $- B` | `B` is a left immediate sibling of `A`, i.e. `A` and `B` have the same parent and `A.i == B.i + 1`. | +| `A $++ B` | `B` is a right sibling of `A`, i.e. `A` and `B` have the same parent and `A.i < B.i`. | +| `A $-- B` | `B` is a left sibling of `A`, i.e. `A` and `B` have the same parent and `A.i > B.i`. | + +### Designing dependency matcher patterns {#dependencymatcher-patterns} + +Let's say we want to find sentences describing who founded what kind of company: + +- _Smith founded a healthcare company in 2005._ +- _Williams initially founded an insurance company in 1987._ +- _Lee, an experienced CEO, has founded two AI startups._ + +The dependency parse for "Smith founded a healthcare company" shows types of +relations and tokens we want to match: + +> #### Visualizing the parse +> +> The [`displacy` visualizer](/usage/visualizers) lets you render `Doc` objects +> and their dependency parse and part-of-speech tags: +> +> ```python +> import spacy +> from spacy import displacy +> +> nlp = spacy.load("en_core_web_sm") +> doc = nlp("Smith founded a healthcare company") +> displacy.serve(doc) +> ``` + +import DisplaCyDepFoundedHtml from 'images/displacy-dep-founded.html' + +