mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-10 16:22:29 +03:00
Merge branch 'main' into feature/docwise-generator-batching
This commit is contained in:
commit
78c72d3ab7
1
.github/FUNDING.yml
vendored
Normal file
1
.github/FUNDING.yml
vendored
Normal file
|
@ -0,0 +1 @@
|
||||||
|
custom: [https://explosion.ai/merch, https://explosion.ai/tailored-solutions]
|
4
.github/workflows/tests.yml
vendored
4
.github/workflows/tests.yml
vendored
|
@ -58,7 +58,7 @@ jobs:
|
||||||
fail-fast: true
|
fail-fast: true
|
||||||
matrix:
|
matrix:
|
||||||
os: [ubuntu-latest, windows-latest, macos-latest]
|
os: [ubuntu-latest, windows-latest, macos-latest]
|
||||||
python_version: ["3.11"]
|
python_version: ["3.12"]
|
||||||
include:
|
include:
|
||||||
- os: macos-latest
|
- os: macos-latest
|
||||||
python_version: "3.8"
|
python_version: "3.8"
|
||||||
|
@ -66,6 +66,8 @@ jobs:
|
||||||
python_version: "3.9"
|
python_version: "3.9"
|
||||||
- os: windows-latest
|
- os: windows-latest
|
||||||
python_version: "3.10"
|
python_version: "3.10"
|
||||||
|
- os: macos-latest
|
||||||
|
python_version: "3.11"
|
||||||
|
|
||||||
runs-on: ${{ matrix.os }}
|
runs-on: ${{ matrix.os }}
|
||||||
|
|
||||||
|
|
2
LICENSE
2
LICENSE
|
@ -1,6 +1,6 @@
|
||||||
The MIT License (MIT)
|
The MIT License (MIT)
|
||||||
|
|
||||||
Copyright (C) 2016-2022 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
Copyright (C) 2016-2023 ExplosionAI GmbH, 2016 spaCy GmbH, 2015 Matthew Honnibal
|
||||||
|
|
||||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
of this software and associated documentation files (the "Software"), to deal
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
|
79
README.md
79
README.md
|
@ -6,23 +6,20 @@ spaCy is a library for **advanced Natural Language Processing** in Python and
|
||||||
Cython. It's built on the very latest research, and was designed from day one to
|
Cython. It's built on the very latest research, and was designed from day one to
|
||||||
be used in real products.
|
be used in real products.
|
||||||
|
|
||||||
spaCy comes with
|
spaCy comes with [pretrained pipelines](https://spacy.io/models) and currently
|
||||||
[pretrained pipelines](https://spacy.io/models) and
|
supports tokenization and training for **70+ languages**. It features
|
||||||
currently supports tokenization and training for **70+ languages**. It features
|
state-of-the-art speed and **neural network models** for tagging, parsing,
|
||||||
state-of-the-art speed and **neural network models** for tagging,
|
**named entity recognition**, **text classification** and more, multi-task
|
||||||
parsing, **named entity recognition**, **text classification** and more,
|
learning with pretrained **transformers** like BERT, as well as a
|
||||||
multi-task learning with pretrained **transformers** like BERT, as well as a
|
|
||||||
production-ready [**training system**](https://spacy.io/usage/training) and easy
|
production-ready [**training system**](https://spacy.io/usage/training) and easy
|
||||||
model packaging, deployment and workflow management. spaCy is commercial
|
model packaging, deployment and workflow management. spaCy is commercial
|
||||||
open-source software, released under the [MIT license](https://github.com/explosion/spaCy/blob/master/LICENSE).
|
open-source software, released under the
|
||||||
|
[MIT license](https://github.com/explosion/spaCy/blob/master/LICENSE).
|
||||||
|
|
||||||
💥 **We'd love to hear more about your experience with spaCy!**
|
💫 **Version 3.7 out now!**
|
||||||
[Fill out our survey here.](https://form.typeform.com/to/aMel9q9f)
|
|
||||||
|
|
||||||
💫 **Version 3.5 out now!**
|
|
||||||
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
|
||||||
|
|
||||||
[](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
|
[](https://github.com/explosion/spaCy/actions/workflows/tests.yml)
|
||||||
[](https://github.com/explosion/spaCy/releases)
|
[](https://github.com/explosion/spaCy/releases)
|
||||||
[](https://pypi.org/project/spacy/)
|
[](https://pypi.org/project/spacy/)
|
||||||
[](https://anaconda.org/conda-forge/spacy)
|
[](https://anaconda.org/conda-forge/spacy)
|
||||||
|
@ -35,35 +32,42 @@ open-source software, released under the [MIT license](https://github.com/explos
|
||||||
|
|
||||||
## 📖 Documentation
|
## 📖 Documentation
|
||||||
|
|
||||||
| Documentation | |
|
| Documentation | |
|
||||||
| ----------------------------- | ---------------------------------------------------------------------- |
|
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! |
|
| ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! |
|
||||||
| 📚 **[Usage Guides]** | How to use spaCy and its features. |
|
| 📚 **[Usage Guides]** | How to use spaCy and its features. |
|
||||||
| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
|
| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
|
||||||
| 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. |
|
| 🪐 **[Project Templates]** | End-to-end workflows you can clone, modify and run. |
|
||||||
| 🎛 **[API Reference]** | The detailed reference for spaCy's API. |
|
| 🎛 **[API Reference]** | The detailed reference for spaCy's API. |
|
||||||
| 📦 **[Models]** | Download trained pipelines for spaCy. |
|
| ⏩ **[GPU Processing]** | Use spaCy with CUDA-compatible GPU processing. |
|
||||||
| 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. |
|
| 📦 **[Models]** | Download trained pipelines for spaCy. |
|
||||||
| ⚙️ **[spaCy VS Code Extension]** | Additional tooling and features for working with spaCy's config files. |
|
| 🦙 **[Large Language Models]** | Integrate LLMs into spaCy pipelines. |
|
||||||
| 👩🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. |
|
| 🌌 **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. |
|
||||||
| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
|
| ⚙️ **[spaCy VS Code Extension]** | Additional tooling and features for working with spaCy's config files. |
|
||||||
| 🛠 **[Changelog]** | Changes and version history. |
|
| 👩🏫 **[Online Course]** | Learn spaCy in this free and interactive online course. |
|
||||||
| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. |
|
| 📰 **[Blog]** | Read about current spaCy and Prodigy development, releases, talks and more from Explosion. |
|
||||||
| <a href="https://explosion.ai/spacy-tailored-pipelines"><img src="https://user-images.githubusercontent.com/13643239/152853098-1c761611-ccb0-4ec6-9066-b234552831fe.png" width="125" alt="spaCy Tailored Pipelines"/></a> | Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's core developers. Streamlined, production-ready, predictable and maintainable. Start by completing our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more →](https://explosion.ai/spacy-tailored-pipelines)** |
|
| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
|
||||||
| <a href="https://explosion.ai/spacy-tailored-analysis"><img src="https://user-images.githubusercontent.com/1019791/206151300-b00cd189-e503-4797-aa1e-1bb6344062c5.png" width="125" alt="spaCy Tailored Pipelines"/></a> | Bespoke advice for problem solving, strategy and analysis for applied NLP projects. Services include data strategy, code reviews, pipeline design and annotation coaching. Curious? Fill in our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more →](https://explosion.ai/spacy-tailored-analysis)** |
|
| 🛠 **[Changelog]** | Changes and version history. |
|
||||||
|
| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. |
|
||||||
|
| 👕 **[Swag]** | Support us and our work with unique, custom-designed swag! |
|
||||||
|
| <a href="https://explosion.ai/tailored-solutions"><img src="https://github.com/explosion/spaCy/assets/13643239/36d2a42e-98c0-4599-90e1-788ef75181be" width="150" alt="Tailored Solutions"/></a> | Custom NLP consulting, implementation and strategic advice by spaCy’s core development team. Streamlined, production-ready, predictable and maintainable. Send us an email or take our 5-minute questionnaire, and well'be in touch! **[Learn more →](https://explosion.ai/tailored-solutions)** |
|
||||||
|
|
||||||
[spacy 101]: https://spacy.io/usage/spacy-101
|
[spacy 101]: https://spacy.io/usage/spacy-101
|
||||||
[new in v3.0]: https://spacy.io/usage/v3
|
[new in v3.0]: https://spacy.io/usage/v3
|
||||||
[usage guides]: https://spacy.io/usage/
|
[usage guides]: https://spacy.io/usage/
|
||||||
[api reference]: https://spacy.io/api/
|
[api reference]: https://spacy.io/api/
|
||||||
|
[gpu processing]: https://spacy.io/usage#gpu
|
||||||
[models]: https://spacy.io/models
|
[models]: https://spacy.io/models
|
||||||
|
[large language models]: https://spacy.io/usage/large-language-models
|
||||||
[universe]: https://spacy.io/universe
|
[universe]: https://spacy.io/universe
|
||||||
[spaCy VS Code Extension]: https://github.com/explosion/spacy-vscode
|
[spacy vs code extension]: https://github.com/explosion/spacy-vscode
|
||||||
[videos]: https://www.youtube.com/c/ExplosionAI
|
[videos]: https://www.youtube.com/c/ExplosionAI
|
||||||
[online course]: https://course.spacy.io
|
[online course]: https://course.spacy.io
|
||||||
|
[blog]: https://explosion.ai
|
||||||
[project templates]: https://github.com/explosion/projects
|
[project templates]: https://github.com/explosion/projects
|
||||||
[changelog]: https://spacy.io/usage#changelog
|
[changelog]: https://spacy.io/usage#changelog
|
||||||
[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
[contribute]: https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md
|
||||||
|
[swag]: https://explosion.ai/merch
|
||||||
|
|
||||||
## 💬 Where to ask questions
|
## 💬 Where to ask questions
|
||||||
|
|
||||||
|
@ -92,7 +96,9 @@ more people can benefit from it.
|
||||||
- State-of-the-art speed
|
- State-of-the-art speed
|
||||||
- Production-ready **training system**
|
- Production-ready **training system**
|
||||||
- Linguistically-motivated **tokenization**
|
- Linguistically-motivated **tokenization**
|
||||||
- Components for named **entity recognition**, part-of-speech-tagging, dependency parsing, sentence segmentation, **text classification**, lemmatization, morphological analysis, entity linking and more
|
- Components for named **entity recognition**, part-of-speech-tagging,
|
||||||
|
dependency parsing, sentence segmentation, **text classification**,
|
||||||
|
lemmatization, morphological analysis, entity linking and more
|
||||||
- Easily extensible with **custom components** and attributes
|
- Easily extensible with **custom components** and attributes
|
||||||
- Support for custom models in **PyTorch**, **TensorFlow** and other frameworks
|
- Support for custom models in **PyTorch**, **TensorFlow** and other frameworks
|
||||||
- Built in **visualizers** for syntax and NER
|
- Built in **visualizers** for syntax and NER
|
||||||
|
@ -118,8 +124,8 @@ For detailed installation instructions, see the
|
||||||
### pip
|
### pip
|
||||||
|
|
||||||
Using pip, spaCy releases are available as source packages and binary wheels.
|
Using pip, spaCy releases are available as source packages and binary wheels.
|
||||||
Before you install spaCy and its dependencies, make sure that
|
Before you install spaCy and its dependencies, make sure that your `pip`,
|
||||||
your `pip`, `setuptools` and `wheel` are up to date.
|
`setuptools` and `wheel` are up to date.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install -U pip setuptools wheel
|
pip install -U pip setuptools wheel
|
||||||
|
@ -174,9 +180,9 @@ with the new version.
|
||||||
|
|
||||||
## 📦 Download model packages
|
## 📦 Download model packages
|
||||||
|
|
||||||
Trained pipelines for spaCy can be installed as **Python packages**. This
|
Trained pipelines for spaCy can be installed as **Python packages**. This means
|
||||||
means that they're a component of your application, just like any other module.
|
that they're a component of your application, just like any other module. Models
|
||||||
Models can be installed using spaCy's [`download`](https://spacy.io/api/cli#download)
|
can be installed using spaCy's [`download`](https://spacy.io/api/cli#download)
|
||||||
command, or manually by pointing pip to a path or URL.
|
command, or manually by pointing pip to a path or URL.
|
||||||
|
|
||||||
| Documentation | |
|
| Documentation | |
|
||||||
|
@ -242,8 +248,7 @@ do that depends on your system.
|
||||||
| **Mac** | Install a recent version of [XCode](https://developer.apple.com/xcode/), including the so-called "Command Line Tools". macOS and OS X ship with Python and git preinstalled. |
|
| **Mac** | Install a recent version of [XCode](https://developer.apple.com/xcode/), including the so-called "Command Line Tools". macOS and OS X ship with Python and git preinstalled. |
|
||||||
| **Windows** | Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or [Visual Studio Express](https://visualstudio.microsoft.com/vs/express/) that matches the version that was used to compile your Python interpreter. |
|
| **Windows** | Install a version of the [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) or [Visual Studio Express](https://visualstudio.microsoft.com/vs/express/) that matches the version that was used to compile your Python interpreter. |
|
||||||
|
|
||||||
For more details
|
For more details and instructions, see the documentation on
|
||||||
and instructions, see the documentation on
|
|
||||||
[compiling spaCy from source](https://spacy.io/usage#source) and the
|
[compiling spaCy from source](https://spacy.io/usage#source) and the
|
||||||
[quickstart widget](https://spacy.io/usage#section-quickstart) to get the right
|
[quickstart widget](https://spacy.io/usage#section-quickstart) to get the right
|
||||||
commands for your platform and Python version.
|
commands for your platform and Python version.
|
||||||
|
|
|
@ -1,7 +1,4 @@
|
||||||
# build version constraints for use with wheelwright + multibuild
|
# build version constraints for use with wheelwright
|
||||||
numpy==1.17.3; python_version=='3.8' and platform_machine!='aarch64'
|
numpy==1.17.3; python_version=='3.8' and platform_machine!='aarch64'
|
||||||
numpy==1.19.2; python_version=='3.8' and platform_machine=='aarch64'
|
numpy==1.19.2; python_version=='3.8' and platform_machine=='aarch64'
|
||||||
numpy==1.19.3; python_version=='3.9'
|
numpy>=1.25.0; python_version>='3.9'
|
||||||
numpy==1.21.3; python_version=='3.10'
|
|
||||||
numpy==1.23.2; python_version=='3.11'
|
|
||||||
numpy; python_version>='3.12'
|
|
||||||
|
|
|
@ -1,14 +1,17 @@
|
||||||
# Listeners
|
# Listeners
|
||||||
|
|
||||||
1. [Overview](#1-overview)
|
- [1. Overview](#1-overview)
|
||||||
2. [Initialization](#2-initialization)
|
- [2. Initialization](#2-initialization)
|
||||||
- [A. Linking listeners to the embedding component](#2a-linking-listeners-to-the-embedding-component)
|
- [2A. Linking listeners to the embedding component](#2a-linking-listeners-to-the-embedding-component)
|
||||||
- [B. Shape inference](#2b-shape-inference)
|
- [2B. Shape inference](#2b-shape-inference)
|
||||||
3. [Internal communication](#3-internal-communication)
|
- [3. Internal communication](#3-internal-communication)
|
||||||
- [A. During prediction](#3a-during-prediction)
|
- [3A. During prediction](#3a-during-prediction)
|
||||||
- [B. During training](#3b-during-training)
|
- [3B. During training](#3b-during-training)
|
||||||
- [C. Frozen components](#3c-frozen-components)
|
- [Training with multiple listeners](#training-with-multiple-listeners)
|
||||||
4. [Replacing listener with standalone](#4-replacing-listener-with-standalone)
|
- [3C. Frozen components](#3c-frozen-components)
|
||||||
|
- [The Tok2Vec or Transformer is frozen](#the-tok2vec-or-transformer-is-frozen)
|
||||||
|
- [The upstream component is frozen](#the-upstream-component-is-frozen)
|
||||||
|
- [4. Replacing listener with standalone](#4-replacing-listener-with-standalone)
|
||||||
|
|
||||||
## 1. Overview
|
## 1. Overview
|
||||||
|
|
||||||
|
@ -62,7 +65,7 @@ of this `find_listener()` method will specifically identify sublayers of a model
|
||||||
|
|
||||||
If it's a Transformer-based pipeline, a
|
If it's a Transformer-based pipeline, a
|
||||||
[`transformer` component](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py)
|
[`transformer` component](https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py)
|
||||||
has a similar implementation but its `find_listener()` function will specifically look for `TransformerListener`
|
has a similar implementation but its `find_listener()` function will specifically look for `TransformerListener`
|
||||||
sublayers of downstream components.
|
sublayers of downstream components.
|
||||||
|
|
||||||
### 2B. Shape inference
|
### 2B. Shape inference
|
||||||
|
@ -154,7 +157,7 @@ as a tagger or a parser. This used to be impossible before 3.1, but has become s
|
||||||
embedding component in the [`annotating_components`](https://spacy.io/usage/training#annotating-components)
|
embedding component in the [`annotating_components`](https://spacy.io/usage/training#annotating-components)
|
||||||
list of the config. This works like any other "annotating component" because it relies on the `Doc` attributes.
|
list of the config. This works like any other "annotating component" because it relies on the `Doc` attributes.
|
||||||
|
|
||||||
However, if the `Tok2Vec` or `Transformer` is frozen, and not present in `annotating_components`, and a related
|
However, if the `Tok2Vec` or `Transformer` is frozen, and not present in `annotating_components`, and a related
|
||||||
listener isn't frozen, then a `W086` warning is shown and further training of the pipeline will likely end with `E954`.
|
listener isn't frozen, then a `W086` warning is shown and further training of the pipeline will likely end with `E954`.
|
||||||
|
|
||||||
#### The upstream component is frozen
|
#### The upstream component is frozen
|
||||||
|
@ -216,5 +219,17 @@ new_model = tok2vec_model.attrs["replace_listener"](new_model)
|
||||||
```
|
```
|
||||||
|
|
||||||
The new config and model are then properly stored on the `nlp` object.
|
The new config and model are then properly stored on the `nlp` object.
|
||||||
Note that this functionality (running the replacement for a transformer listener) was broken prior to
|
Note that this functionality (running the replacement for a transformer listener) was broken prior to
|
||||||
`spacy-transformers` 1.0.5.
|
`spacy-transformers` 1.0.5.
|
||||||
|
|
||||||
|
In spaCy 3.7, `Language.replace_listeners` was updated to pass the following additional arguments to the `replace_listener` callback:
|
||||||
|
the listener to be replaced and the `tok2vec`/`transformer` pipe from which the new model was copied. To maintain backwards-compatiblity,
|
||||||
|
the method only passes these extra arguments for callbacks that support them:
|
||||||
|
|
||||||
|
```
|
||||||
|
def replace_listener_pre_37(copied_tok2vec_model):
|
||||||
|
...
|
||||||
|
|
||||||
|
def replace_listener_post_37(copied_tok2vec_model, replaced_listener, tok2vec_pipe):
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
|
@ -158,3 +158,45 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||||
SOFTWARE.
|
SOFTWARE.
|
||||||
|
|
||||||
|
|
||||||
|
SciPy
|
||||||
|
-----
|
||||||
|
|
||||||
|
* Files: scorer.py
|
||||||
|
|
||||||
|
The implementation of trapezoid() is adapted from SciPy, which is distributed
|
||||||
|
under the following license:
|
||||||
|
|
||||||
|
New BSD License
|
||||||
|
|
||||||
|
Copyright (c) 2001-2002 Enthought, Inc. 2003-2023, SciPy Developers.
|
||||||
|
All rights reserved.
|
||||||
|
|
||||||
|
Redistribution and use in source and binary forms, with or without
|
||||||
|
modification, are permitted provided that the following conditions
|
||||||
|
are met:
|
||||||
|
|
||||||
|
1. Redistributions of source code must retain the above copyright
|
||||||
|
notice, this list of conditions and the following disclaimer.
|
||||||
|
|
||||||
|
2. Redistributions in binary form must reproduce the above
|
||||||
|
copyright notice, this list of conditions and the following
|
||||||
|
disclaimer in the documentation and/or other materials provided
|
||||||
|
with the distribution.
|
||||||
|
|
||||||
|
3. Neither the name of the copyright holder nor the names of its
|
||||||
|
contributors may be used to endorse or promote products derived
|
||||||
|
from this software without specific prior written permission.
|
||||||
|
|
||||||
|
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
||||||
|
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
||||||
|
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
||||||
|
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
||||||
|
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
||||||
|
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
|
||||||
|
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
|
||||||
|
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
|
||||||
|
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
|
||||||
|
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
|
||||||
|
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||||
|
|
|
@ -5,8 +5,9 @@ requires = [
|
||||||
"cymem>=2.0.2,<2.1.0",
|
"cymem>=2.0.2,<2.1.0",
|
||||||
"preshed>=3.0.2,<3.1.0",
|
"preshed>=3.0.2,<3.1.0",
|
||||||
"murmurhash>=0.28.0,<1.1.0",
|
"murmurhash>=0.28.0,<1.1.0",
|
||||||
"thinc>=9.0.0.dev2,<9.1.0",
|
"thinc>=9.0.0.dev4,<9.1.0",
|
||||||
"numpy>=1.15.0",
|
"numpy>=1.15.0; python_version < '3.9'",
|
||||||
|
"numpy>=1.25.0; python_version >= '3.9'",
|
||||||
]
|
]
|
||||||
build-backend = "setuptools.build_meta"
|
build-backend = "setuptools.build_meta"
|
||||||
|
|
||||||
|
|
|
@ -1,22 +1,23 @@
|
||||||
# Our libraries
|
# Our libraries
|
||||||
spacy-legacy>=4.0.0.dev0,<4.1.0
|
spacy-legacy>=4.0.0.dev1,<4.1.0
|
||||||
spacy-loggers>=1.0.0,<2.0.0
|
spacy-loggers>=1.0.0,<2.0.0
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
thinc>=9.0.0.dev2,<9.1.0
|
thinc>=9.0.0.dev4,<9.1.0
|
||||||
ml_datasets>=0.2.0,<0.3.0
|
ml_datasets>=0.2.0,<0.3.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
wasabi>=0.9.1,<1.2.0
|
wasabi>=0.9.1,<1.2.0
|
||||||
srsly>=2.4.3,<3.0.0
|
srsly>=2.4.3,<3.0.0
|
||||||
catalogue>=2.0.6,<2.1.0
|
catalogue>=2.0.6,<2.1.0
|
||||||
typer>=0.3.0,<0.10.0
|
typer>=0.3.0,<0.10.0
|
||||||
pathy>=0.10.0
|
|
||||||
smart-open>=5.2.1,<7.0.0
|
smart-open>=5.2.1,<7.0.0
|
||||||
|
weasel>=0.1.0,<0.4.0
|
||||||
# Third party dependencies
|
# Third party dependencies
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0; python_version < "3.9"
|
||||||
|
numpy>=1.19.0; python_version >= "3.9"
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
tqdm>=4.38.0,<5.0.0
|
tqdm>=4.38.0,<5.0.0
|
||||||
pydantic>=1.7.4,!=1.8,!=1.8.1,<1.11.0
|
pydantic>=1.7.4,!=1.8,!=1.8.1,<3.0.0
|
||||||
jinja2
|
jinja2
|
||||||
langcodes>=3.2.0,<4.0.0
|
langcodes>=3.2.0,<4.0.0
|
||||||
# Official Python utilities
|
# Official Python utilities
|
||||||
|
@ -30,11 +31,11 @@ pytest-timeout>=1.3.0,<2.0.0
|
||||||
mock>=2.0.0,<3.0.0
|
mock>=2.0.0,<3.0.0
|
||||||
flake8>=3.8.0,<6.0.0
|
flake8>=3.8.0,<6.0.0
|
||||||
hypothesis>=3.27.0,<7.0.0
|
hypothesis>=3.27.0,<7.0.0
|
||||||
mypy>=0.990,<1.1.0; platform_machine != "aarch64"
|
mypy>=1.5.0,<1.6.0; platform_machine != "aarch64" and python_version >= "3.8"
|
||||||
types-mock>=0.1.1
|
types-mock>=0.1.1
|
||||||
types-setuptools>=57.0.0
|
types-setuptools>=57.0.0
|
||||||
types-requests
|
types-requests
|
||||||
types-setuptools>=57.0.0
|
types-setuptools>=57.0.0
|
||||||
black==22.3.0
|
black==22.3.0
|
||||||
cython-lint>=0.15.0; python_version >= "3.7"
|
cython-lint>=0.15.0
|
||||||
isort>=5.0,<6.0
|
isort>=5.0,<6.0
|
||||||
|
|
25
setup.cfg
25
setup.cfg
|
@ -30,33 +30,26 @@ project_urls =
|
||||||
zip_safe = false
|
zip_safe = false
|
||||||
include_package_data = true
|
include_package_data = true
|
||||||
python_requires = >=3.8
|
python_requires = >=3.8
|
||||||
setup_requires =
|
|
||||||
cython>=0.25,<3.0
|
|
||||||
numpy>=1.15.0
|
|
||||||
# We also need our Cython packages here to compile against
|
|
||||||
cymem>=2.0.2,<2.1.0
|
|
||||||
preshed>=3.0.2,<3.1.0
|
|
||||||
murmurhash>=0.28.0,<1.1.0
|
|
||||||
thinc>=9.0.0.dev2,<9.1.0
|
|
||||||
install_requires =
|
install_requires =
|
||||||
# Our libraries
|
# Our libraries
|
||||||
spacy-legacy>=4.0.0.dev0,<4.1.0
|
spacy-legacy>=4.0.0.dev1,<4.1.0
|
||||||
spacy-loggers>=1.0.0,<2.0.0
|
spacy-loggers>=1.0.0,<2.0.0
|
||||||
murmurhash>=0.28.0,<1.1.0
|
murmurhash>=0.28.0,<1.1.0
|
||||||
cymem>=2.0.2,<2.1.0
|
cymem>=2.0.2,<2.1.0
|
||||||
preshed>=3.0.2,<3.1.0
|
preshed>=3.0.2,<3.1.0
|
||||||
thinc>=9.0.0.dev2,<9.1.0
|
thinc>=9.0.0.dev4,<9.1.0
|
||||||
wasabi>=0.9.1,<1.2.0
|
wasabi>=0.9.1,<1.2.0
|
||||||
srsly>=2.4.3,<3.0.0
|
srsly>=2.4.3,<3.0.0
|
||||||
catalogue>=2.0.6,<2.1.0
|
catalogue>=2.0.6,<2.1.0
|
||||||
|
weasel>=0.1.0,<0.4.0
|
||||||
# Third-party dependencies
|
# Third-party dependencies
|
||||||
typer>=0.3.0,<0.10.0
|
typer>=0.3.0,<0.10.0
|
||||||
pathy>=0.10.0
|
|
||||||
smart-open>=5.2.1,<7.0.0
|
smart-open>=5.2.1,<7.0.0
|
||||||
tqdm>=4.38.0,<5.0.0
|
tqdm>=4.38.0,<5.0.0
|
||||||
numpy>=1.15.0
|
numpy>=1.15.0; python_version < "3.9"
|
||||||
|
numpy>=1.19.0; python_version >= "3.9"
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
pydantic>=1.7.4,!=1.8,!=1.8.1,<1.11.0
|
pydantic>=1.7.4,!=1.8,!=1.8.1,<3.0.0
|
||||||
jinja2
|
jinja2
|
||||||
# Official Python utilities
|
# Official Python utilities
|
||||||
setuptools
|
setuptools
|
||||||
|
@ -71,9 +64,7 @@ console_scripts =
|
||||||
lookups =
|
lookups =
|
||||||
spacy_lookups_data>=1.0.3,<1.1.0
|
spacy_lookups_data>=1.0.3,<1.1.0
|
||||||
transformers =
|
transformers =
|
||||||
spacy_transformers>=1.1.2,<1.3.0
|
spacy_transformers>=1.1.2,<1.4.0
|
||||||
ray =
|
|
||||||
spacy_ray>=0.1.0,<1.0.0
|
|
||||||
cuda =
|
cuda =
|
||||||
cupy>=5.0.0b4,<13.0.0
|
cupy>=5.0.0b4,<13.0.0
|
||||||
cuda80 =
|
cuda80 =
|
||||||
|
@ -108,6 +99,8 @@ cuda117 =
|
||||||
cupy-cuda117>=5.0.0b4,<13.0.0
|
cupy-cuda117>=5.0.0b4,<13.0.0
|
||||||
cuda11x =
|
cuda11x =
|
||||||
cupy-cuda11x>=11.0.0,<13.0.0
|
cupy-cuda11x>=11.0.0,<13.0.0
|
||||||
|
cuda12x =
|
||||||
|
cupy-cuda12x>=11.5.0,<13.0.0
|
||||||
cuda-autodetect =
|
cuda-autodetect =
|
||||||
cupy-wheel>=11.0.0,<13.0.0
|
cupy-wheel>=11.0.0,<13.0.0
|
||||||
apple =
|
apple =
|
||||||
|
|
38
setup.py
38
setup.py
|
@ -1,10 +1,9 @@
|
||||||
#!/usr/bin/env python
|
#!/usr/bin/env python
|
||||||
from setuptools import Extension, setup, find_packages
|
from setuptools import Extension, setup, find_packages
|
||||||
import sys
|
import sys
|
||||||
import platform
|
|
||||||
import numpy
|
import numpy
|
||||||
from distutils.command.build_ext import build_ext
|
from setuptools.command.build_ext import build_ext
|
||||||
from distutils.sysconfig import get_python_inc
|
from sysconfig import get_path
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import shutil
|
import shutil
|
||||||
from Cython.Build import cythonize
|
from Cython.Build import cythonize
|
||||||
|
@ -33,10 +32,12 @@ MOD_NAMES = [
|
||||||
"spacy.kb.candidate",
|
"spacy.kb.candidate",
|
||||||
"spacy.kb.kb",
|
"spacy.kb.kb",
|
||||||
"spacy.kb.kb_in_memory",
|
"spacy.kb.kb_in_memory",
|
||||||
"spacy.ml.tb_framework",
|
"spacy.ml.parser_model",
|
||||||
"spacy.morphology",
|
"spacy.morphology",
|
||||||
|
"spacy.pipeline.dep_parser",
|
||||||
"spacy.pipeline._edit_tree_internals.edit_trees",
|
"spacy.pipeline._edit_tree_internals.edit_trees",
|
||||||
"spacy.pipeline.morphologizer",
|
"spacy.pipeline.morphologizer",
|
||||||
|
"spacy.pipeline.ner",
|
||||||
"spacy.pipeline.pipe",
|
"spacy.pipeline.pipe",
|
||||||
"spacy.pipeline.trainable_pipe",
|
"spacy.pipeline.trainable_pipe",
|
||||||
"spacy.pipeline.sentencizer",
|
"spacy.pipeline.sentencizer",
|
||||||
|
@ -44,7 +45,6 @@ MOD_NAMES = [
|
||||||
"spacy.pipeline.tagger",
|
"spacy.pipeline.tagger",
|
||||||
"spacy.pipeline.transition_parser",
|
"spacy.pipeline.transition_parser",
|
||||||
"spacy.pipeline._parser_internals.arc_eager",
|
"spacy.pipeline._parser_internals.arc_eager",
|
||||||
"spacy.pipeline._parser_internals.batch",
|
|
||||||
"spacy.pipeline._parser_internals.ner",
|
"spacy.pipeline._parser_internals.ner",
|
||||||
"spacy.pipeline._parser_internals.nonproj",
|
"spacy.pipeline._parser_internals.nonproj",
|
||||||
"spacy.pipeline._parser_internals.search",
|
"spacy.pipeline._parser_internals.search",
|
||||||
|
@ -52,7 +52,6 @@ MOD_NAMES = [
|
||||||
"spacy.pipeline._parser_internals.stateclass",
|
"spacy.pipeline._parser_internals.stateclass",
|
||||||
"spacy.pipeline._parser_internals.transition_system",
|
"spacy.pipeline._parser_internals.transition_system",
|
||||||
"spacy.pipeline._parser_internals._beam_utils",
|
"spacy.pipeline._parser_internals._beam_utils",
|
||||||
"spacy.pipeline._parser_internals._parser_utils",
|
|
||||||
"spacy.tokenizer",
|
"spacy.tokenizer",
|
||||||
"spacy.training.align",
|
"spacy.training.align",
|
||||||
"spacy.training.gold_io",
|
"spacy.training.gold_io",
|
||||||
|
@ -80,6 +79,7 @@ COMPILER_DIRECTIVES = {
|
||||||
"language_level": -3,
|
"language_level": -3,
|
||||||
"embedsignature": True,
|
"embedsignature": True,
|
||||||
"annotation_typing": False,
|
"annotation_typing": False,
|
||||||
|
"profile": sys.version_info < (3, 12),
|
||||||
}
|
}
|
||||||
# Files to copy into the package that are otherwise not included
|
# Files to copy into the package that are otherwise not included
|
||||||
COPY_FILES = {
|
COPY_FILES = {
|
||||||
|
@ -89,30 +89,6 @@ COPY_FILES = {
|
||||||
}
|
}
|
||||||
|
|
||||||
|
|
||||||
def is_new_osx():
|
|
||||||
"""Check whether we're on OSX >= 10.7"""
|
|
||||||
if sys.platform != "darwin":
|
|
||||||
return False
|
|
||||||
mac_ver = platform.mac_ver()[0]
|
|
||||||
if mac_ver.startswith("10"):
|
|
||||||
minor_version = int(mac_ver.split(".")[1])
|
|
||||||
if minor_version >= 7:
|
|
||||||
return True
|
|
||||||
else:
|
|
||||||
return False
|
|
||||||
return False
|
|
||||||
|
|
||||||
|
|
||||||
if is_new_osx():
|
|
||||||
# On Mac, use libc++ because Apple deprecated use of
|
|
||||||
# libstdc
|
|
||||||
COMPILE_OPTIONS["other"].append("-stdlib=libc++")
|
|
||||||
LINK_OPTIONS["other"].append("-lc++")
|
|
||||||
# g++ (used by unix compiler on mac) links to libstdc++ as a default lib.
|
|
||||||
# See: https://stackoverflow.com/questions/1653047/avoid-linking-to-libstdc
|
|
||||||
LINK_OPTIONS["other"].append("-nodefaultlibs")
|
|
||||||
|
|
||||||
|
|
||||||
# By subclassing build_extensions we have the actual compiler that will be used which is really known only after finalize_options
|
# By subclassing build_extensions we have the actual compiler that will be used which is really known only after finalize_options
|
||||||
# http://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
|
# http://stackoverflow.com/questions/724664/python-distutils-how-to-get-a-compiler-that-is-going-to-be-used
|
||||||
class build_ext_options:
|
class build_ext_options:
|
||||||
|
@ -205,7 +181,7 @@ def setup_package():
|
||||||
|
|
||||||
include_dirs = [
|
include_dirs = [
|
||||||
numpy.get_include(),
|
numpy.get_include(),
|
||||||
get_python_inc(plat_specific=True),
|
get_path("include"),
|
||||||
]
|
]
|
||||||
ext_modules = []
|
ext_modules = []
|
||||||
ext_modules.append(
|
ext_modules.append(
|
||||||
|
|
|
@ -1,7 +1,9 @@
|
||||||
# fmt: off
|
# fmt: off
|
||||||
__title__ = "spacy"
|
__title__ = "spacy"
|
||||||
__version__ = "4.0.0.dev1"
|
__version__ = "4.0.0.dev2"
|
||||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||||
__projects__ = "https://github.com/explosion/projects"
|
__projects__ = "https://github.com/explosion/projects"
|
||||||
__projects_branch__ = "v3"
|
__projects_branch__ = "v3"
|
||||||
|
__lookups_tag__ = "v1.0.3"
|
||||||
|
__lookups_url__ = f"https://raw.githubusercontent.com/explosion/spacy-lookups-data/{__lookups_tag__}/spacy_lookups_data/data/"
|
||||||
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
# cython: profile=False
|
||||||
from .errors import Errors
|
from .errors import Errors
|
||||||
|
|
||||||
IOB_STRINGS = ("", "I", "O", "B")
|
IOB_STRINGS = ("", "I", "O", "B")
|
||||||
|
|
|
@ -14,6 +14,7 @@ from .debug_diff import debug_diff # noqa: F401
|
||||||
from .debug_model import debug_model # noqa: F401
|
from .debug_model import debug_model # noqa: F401
|
||||||
from .download import download # noqa: F401
|
from .download import download # noqa: F401
|
||||||
from .evaluate import evaluate # noqa: F401
|
from .evaluate import evaluate # noqa: F401
|
||||||
|
from .find_function import find_function # noqa: F401
|
||||||
from .find_threshold import find_threshold # noqa: F401
|
from .find_threshold import find_threshold # noqa: F401
|
||||||
from .info import info # noqa: F401
|
from .info import info # noqa: F401
|
||||||
from .init_config import fill_config, init_config # noqa: F401
|
from .init_config import fill_config, init_config # noqa: F401
|
||||||
|
@ -21,15 +22,17 @@ from .init_pipeline import init_pipeline_cli # noqa: F401
|
||||||
from .package import package # noqa: F401
|
from .package import package # noqa: F401
|
||||||
from .pretrain import pretrain # noqa: F401
|
from .pretrain import pretrain # noqa: F401
|
||||||
from .profile import profile # noqa: F401
|
from .profile import profile # noqa: F401
|
||||||
from .project.assets import project_assets # noqa: F401
|
from .project.assets import project_assets # type: ignore[attr-defined] # noqa: F401
|
||||||
from .project.clone import project_clone # noqa: F401
|
from .project.clone import project_clone # type: ignore[attr-defined] # noqa: F401
|
||||||
from .project.document import project_document # noqa: F401
|
from .project.document import ( # type: ignore[attr-defined] # noqa: F401
|
||||||
from .project.dvc import project_update_dvc # noqa: F401
|
project_document,
|
||||||
from .project.pull import project_pull # noqa: F401
|
)
|
||||||
from .project.push import project_push # noqa: F401
|
from .project.dvc import project_update_dvc # type: ignore[attr-defined] # noqa: F401
|
||||||
from .project.run import project_run # noqa: F401
|
from .project.pull import project_pull # type: ignore[attr-defined] # noqa: F401
|
||||||
from .train import train_cli # noqa: F401
|
from .project.push import project_push # type: ignore[attr-defined] # noqa: F401
|
||||||
from .validate import validate # noqa: F401
|
from .project.run import project_run # type: ignore[attr-defined] # noqa: F401
|
||||||
|
from .train import train_cli # type: ignore[attr-defined] # noqa: F401
|
||||||
|
from .validate import validate # type: ignore[attr-defined] # noqa: F401
|
||||||
|
|
||||||
|
|
||||||
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
|
@app.command("link", no_args_is_help=True, deprecated=True, hidden=True)
|
||||||
|
|
|
@ -26,10 +26,11 @@ from thinc.api import Config, ConfigValidationError, require_gpu
|
||||||
from thinc.util import gpu_is_available
|
from thinc.util import gpu_is_available
|
||||||
from typer.main import get_command
|
from typer.main import get_command
|
||||||
from wasabi import Printer, msg
|
from wasabi import Printer, msg
|
||||||
|
from weasel import app as project_cli
|
||||||
|
|
||||||
from .. import about
|
from .. import about
|
||||||
from ..errors import RENAMED_LANGUAGE_CODES
|
from ..errors import RENAMED_LANGUAGE_CODES
|
||||||
from ..schemas import ProjectConfigSchema, validate
|
from ..schemas import validate
|
||||||
from ..util import (
|
from ..util import (
|
||||||
ENV_VARS,
|
ENV_VARS,
|
||||||
SimpleFrozenDict,
|
SimpleFrozenDict,
|
||||||
|
@ -41,15 +42,10 @@ from ..util import (
|
||||||
run_command,
|
run_command,
|
||||||
)
|
)
|
||||||
|
|
||||||
if TYPE_CHECKING:
|
|
||||||
from pathy import FluidPath # noqa: F401
|
|
||||||
|
|
||||||
|
|
||||||
SDIST_SUFFIX = ".tar.gz"
|
SDIST_SUFFIX = ".tar.gz"
|
||||||
WHEEL_SUFFIX = "-py3-none-any.whl"
|
WHEEL_SUFFIX = "-py3-none-any.whl"
|
||||||
|
|
||||||
PROJECT_FILE = "project.yml"
|
PROJECT_FILE = "project.yml"
|
||||||
PROJECT_LOCK = "project.lock"
|
|
||||||
COMMAND = "python -m spacy"
|
COMMAND = "python -m spacy"
|
||||||
NAME = "spacy"
|
NAME = "spacy"
|
||||||
HELP = """spaCy Command-line Interface
|
HELP = """spaCy Command-line Interface
|
||||||
|
@ -75,11 +71,10 @@ Opt = typer.Option
|
||||||
|
|
||||||
app = typer.Typer(name=NAME, help=HELP)
|
app = typer.Typer(name=NAME, help=HELP)
|
||||||
benchmark_cli = typer.Typer(name="benchmark", help=BENCHMARK_HELP, no_args_is_help=True)
|
benchmark_cli = typer.Typer(name="benchmark", help=BENCHMARK_HELP, no_args_is_help=True)
|
||||||
project_cli = typer.Typer(name="project", help=PROJECT_HELP, no_args_is_help=True)
|
|
||||||
debug_cli = typer.Typer(name="debug", help=DEBUG_HELP, no_args_is_help=True)
|
debug_cli = typer.Typer(name="debug", help=DEBUG_HELP, no_args_is_help=True)
|
||||||
init_cli = typer.Typer(name="init", help=INIT_HELP, no_args_is_help=True)
|
init_cli = typer.Typer(name="init", help=INIT_HELP, no_args_is_help=True)
|
||||||
|
|
||||||
app.add_typer(project_cli)
|
app.add_typer(project_cli, name="project", help=PROJECT_HELP, no_args_is_help=True)
|
||||||
app.add_typer(debug_cli)
|
app.add_typer(debug_cli)
|
||||||
app.add_typer(benchmark_cli)
|
app.add_typer(benchmark_cli)
|
||||||
app.add_typer(init_cli)
|
app.add_typer(init_cli)
|
||||||
|
@ -164,148 +159,6 @@ def _handle_renamed_language_codes(lang: Optional[str]) -> None:
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def load_project_config(
|
|
||||||
path: Path, interpolate: bool = True, overrides: Dict[str, Any] = SimpleFrozenDict()
|
|
||||||
) -> Dict[str, Any]:
|
|
||||||
"""Load the project.yml file from a directory and validate it. Also make
|
|
||||||
sure that all directories defined in the config exist.
|
|
||||||
|
|
||||||
path (Path): The path to the project directory.
|
|
||||||
interpolate (bool): Whether to substitute project variables.
|
|
||||||
overrides (Dict[str, Any]): Optional config overrides.
|
|
||||||
RETURNS (Dict[str, Any]): The loaded project.yml.
|
|
||||||
"""
|
|
||||||
config_path = path / PROJECT_FILE
|
|
||||||
if not config_path.exists():
|
|
||||||
msg.fail(f"Can't find {PROJECT_FILE}", config_path, exits=1)
|
|
||||||
invalid_err = f"Invalid {PROJECT_FILE}. Double-check that the YAML is correct."
|
|
||||||
try:
|
|
||||||
config = srsly.read_yaml(config_path)
|
|
||||||
except ValueError as e:
|
|
||||||
msg.fail(invalid_err, e, exits=1)
|
|
||||||
errors = validate(ProjectConfigSchema, config)
|
|
||||||
if errors:
|
|
||||||
msg.fail(invalid_err)
|
|
||||||
print("\n".join(errors))
|
|
||||||
sys.exit(1)
|
|
||||||
validate_project_version(config)
|
|
||||||
validate_project_commands(config)
|
|
||||||
if interpolate:
|
|
||||||
err = f"{PROJECT_FILE} validation error"
|
|
||||||
with show_validation_error(title=err, hint_fill=False):
|
|
||||||
config = substitute_project_variables(config, overrides)
|
|
||||||
# Make sure directories defined in config exist
|
|
||||||
for subdir in config.get("directories", []):
|
|
||||||
dir_path = path / subdir
|
|
||||||
if not dir_path.exists():
|
|
||||||
dir_path.mkdir(parents=True)
|
|
||||||
return config
|
|
||||||
|
|
||||||
|
|
||||||
def substitute_project_variables(
|
|
||||||
config: Dict[str, Any],
|
|
||||||
overrides: Dict[str, Any] = SimpleFrozenDict(),
|
|
||||||
key: str = "vars",
|
|
||||||
env_key: str = "env",
|
|
||||||
) -> Dict[str, Any]:
|
|
||||||
"""Interpolate variables in the project file using the config system.
|
|
||||||
|
|
||||||
config (Dict[str, Any]): The project config.
|
|
||||||
overrides (Dict[str, Any]): Optional config overrides.
|
|
||||||
key (str): Key containing variables in project config.
|
|
||||||
env_key (str): Key containing environment variable mapping in project config.
|
|
||||||
RETURNS (Dict[str, Any]): The interpolated project config.
|
|
||||||
"""
|
|
||||||
config.setdefault(key, {})
|
|
||||||
config.setdefault(env_key, {})
|
|
||||||
# Substitute references to env vars with their values
|
|
||||||
for config_var, env_var in config[env_key].items():
|
|
||||||
config[env_key][config_var] = _parse_override(os.environ.get(env_var, ""))
|
|
||||||
# Need to put variables in the top scope again so we can have a top-level
|
|
||||||
# section "project" (otherwise, a list of commands in the top scope wouldn't)
|
|
||||||
# be allowed by Thinc's config system
|
|
||||||
cfg = Config({"project": config, key: config[key], env_key: config[env_key]})
|
|
||||||
cfg = Config().from_str(cfg.to_str(), overrides=overrides)
|
|
||||||
interpolated = cfg.interpolate()
|
|
||||||
return dict(interpolated["project"])
|
|
||||||
|
|
||||||
|
|
||||||
def validate_project_version(config: Dict[str, Any]) -> None:
|
|
||||||
"""If the project defines a compatible spaCy version range, chec that it's
|
|
||||||
compatible with the current version of spaCy.
|
|
||||||
|
|
||||||
config (Dict[str, Any]): The loaded config.
|
|
||||||
"""
|
|
||||||
spacy_version = config.get("spacy_version", None)
|
|
||||||
if spacy_version and not is_compatible_version(about.__version__, spacy_version):
|
|
||||||
err = (
|
|
||||||
f"The {PROJECT_FILE} specifies a spaCy version range ({spacy_version}) "
|
|
||||||
f"that's not compatible with the version of spaCy you're running "
|
|
||||||
f"({about.__version__}). You can edit version requirement in the "
|
|
||||||
f"{PROJECT_FILE} to load it, but the project may not run as expected."
|
|
||||||
)
|
|
||||||
msg.fail(err, exits=1)
|
|
||||||
|
|
||||||
|
|
||||||
def validate_project_commands(config: Dict[str, Any]) -> None:
|
|
||||||
"""Check that project commands and workflows are valid, don't contain
|
|
||||||
duplicates, don't clash and only refer to commands that exist.
|
|
||||||
|
|
||||||
config (Dict[str, Any]): The loaded config.
|
|
||||||
"""
|
|
||||||
command_names = [cmd["name"] for cmd in config.get("commands", [])]
|
|
||||||
workflows = config.get("workflows", {})
|
|
||||||
duplicates = set([cmd for cmd in command_names if command_names.count(cmd) > 1])
|
|
||||||
if duplicates:
|
|
||||||
err = f"Duplicate commands defined in {PROJECT_FILE}: {', '.join(duplicates)}"
|
|
||||||
msg.fail(err, exits=1)
|
|
||||||
for workflow_name, workflow_steps in workflows.items():
|
|
||||||
if workflow_name in command_names:
|
|
||||||
err = f"Can't use workflow name '{workflow_name}': name already exists as a command"
|
|
||||||
msg.fail(err, exits=1)
|
|
||||||
for step in workflow_steps:
|
|
||||||
if step not in command_names:
|
|
||||||
msg.fail(
|
|
||||||
f"Unknown command specified in workflow '{workflow_name}': {step}",
|
|
||||||
f"Workflows can only refer to commands defined in the 'commands' "
|
|
||||||
f"section of the {PROJECT_FILE}.",
|
|
||||||
exits=1,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def get_hash(data, exclude: Iterable[str] = tuple()) -> str:
|
|
||||||
"""Get the hash for a JSON-serializable object.
|
|
||||||
|
|
||||||
data: The data to hash.
|
|
||||||
exclude (Iterable[str]): Top-level keys to exclude if data is a dict.
|
|
||||||
RETURNS (str): The hash.
|
|
||||||
"""
|
|
||||||
if isinstance(data, dict):
|
|
||||||
data = {k: v for k, v in data.items() if k not in exclude}
|
|
||||||
data_str = srsly.json_dumps(data, sort_keys=True).encode("utf8")
|
|
||||||
return hashlib.md5(data_str).hexdigest()
|
|
||||||
|
|
||||||
|
|
||||||
def get_checksum(path: Union[Path, str]) -> str:
|
|
||||||
"""Get the checksum for a file or directory given its file path. If a
|
|
||||||
directory path is provided, this uses all files in that directory.
|
|
||||||
|
|
||||||
path (Union[Path, str]): The file or directory path.
|
|
||||||
RETURNS (str): The checksum.
|
|
||||||
"""
|
|
||||||
path = Path(path)
|
|
||||||
if not (path.is_file() or path.is_dir()):
|
|
||||||
msg.fail(f"Can't get checksum for {path}: not a file or directory", exits=1)
|
|
||||||
if path.is_file():
|
|
||||||
return hashlib.md5(Path(path).read_bytes()).hexdigest()
|
|
||||||
else:
|
|
||||||
# TODO: this is currently pretty slow
|
|
||||||
dir_checksum = hashlib.md5()
|
|
||||||
for sub_file in sorted(fp for fp in path.rglob("*") if fp.is_file()):
|
|
||||||
dir_checksum.update(sub_file.read_bytes())
|
|
||||||
return dir_checksum.hexdigest()
|
|
||||||
|
|
||||||
|
|
||||||
@contextmanager
|
@contextmanager
|
||||||
def show_validation_error(
|
def show_validation_error(
|
||||||
file_path: Optional[Union[str, Path]] = None,
|
file_path: Optional[Union[str, Path]] = None,
|
||||||
|
@ -350,6 +203,13 @@ def show_validation_error(
|
||||||
msg.fail("Config validation error", e, exits=1)
|
msg.fail("Config validation error", e, exits=1)
|
||||||
|
|
||||||
|
|
||||||
|
def import_code_paths(code_paths: str) -> None:
|
||||||
|
"""Helper to import comma-separated list of code paths."""
|
||||||
|
code_paths = [Path(p.strip()) for p in string_to_list(code_paths)]
|
||||||
|
for code_path in code_paths:
|
||||||
|
import_code(code_path)
|
||||||
|
|
||||||
|
|
||||||
def import_code(code_path: Optional[Union[Path, str]]) -> None:
|
def import_code(code_path: Optional[Union[Path, str]]) -> None:
|
||||||
"""Helper to import Python file provided in training commands / commands
|
"""Helper to import Python file provided in training commands / commands
|
||||||
using the config. This makes custom registered functions available.
|
using the config. This makes custom registered functions available.
|
||||||
|
@ -363,166 +223,10 @@ def import_code(code_path: Optional[Union[Path, str]]) -> None:
|
||||||
msg.fail(f"Couldn't load Python code: {code_path}", e, exits=1)
|
msg.fail(f"Couldn't load Python code: {code_path}", e, exits=1)
|
||||||
|
|
||||||
|
|
||||||
def upload_file(src: Path, dest: Union[str, "FluidPath"]) -> None:
|
|
||||||
"""Upload a file.
|
|
||||||
|
|
||||||
src (Path): The source path.
|
|
||||||
url (str): The destination URL to upload to.
|
|
||||||
"""
|
|
||||||
import smart_open
|
|
||||||
|
|
||||||
# Create parent directories for local paths
|
|
||||||
if isinstance(dest, Path):
|
|
||||||
if not dest.parent.exists():
|
|
||||||
dest.parent.mkdir(parents=True)
|
|
||||||
|
|
||||||
dest = str(dest)
|
|
||||||
with smart_open.open(dest, mode="wb") as output_file:
|
|
||||||
with src.open(mode="rb") as input_file:
|
|
||||||
output_file.write(input_file.read())
|
|
||||||
|
|
||||||
|
|
||||||
def download_file(
|
|
||||||
src: Union[str, "FluidPath"], dest: Path, *, force: bool = False
|
|
||||||
) -> None:
|
|
||||||
"""Download a file using smart_open.
|
|
||||||
|
|
||||||
url (str): The URL of the file.
|
|
||||||
dest (Path): The destination path.
|
|
||||||
force (bool): Whether to force download even if file exists.
|
|
||||||
If False, the download will be skipped.
|
|
||||||
"""
|
|
||||||
import smart_open
|
|
||||||
|
|
||||||
if dest.exists() and not force:
|
|
||||||
return None
|
|
||||||
src = str(src)
|
|
||||||
with smart_open.open(src, mode="rb", compression="disable") as input_file:
|
|
||||||
with dest.open(mode="wb") as output_file:
|
|
||||||
shutil.copyfileobj(input_file, output_file)
|
|
||||||
|
|
||||||
|
|
||||||
def ensure_pathy(path):
|
|
||||||
"""Temporary helper to prevent importing Pathy globally (which can cause
|
|
||||||
slow and annoying Google Cloud warning)."""
|
|
||||||
from pathy import Pathy # noqa: F811
|
|
||||||
|
|
||||||
return Pathy.fluid(path)
|
|
||||||
|
|
||||||
|
|
||||||
def git_checkout(
|
|
||||||
repo: str, subpath: str, dest: Path, *, branch: str = "master", sparse: bool = False
|
|
||||||
):
|
|
||||||
git_version = get_git_version()
|
|
||||||
if dest.exists():
|
|
||||||
msg.fail("Destination of checkout must not exist", exits=1)
|
|
||||||
if not dest.parent.exists():
|
|
||||||
msg.fail("Parent of destination of checkout must exist", exits=1)
|
|
||||||
if sparse and git_version >= (2, 22):
|
|
||||||
return git_sparse_checkout(repo, subpath, dest, branch)
|
|
||||||
elif sparse:
|
|
||||||
# Only show warnings if the user explicitly wants sparse checkout but
|
|
||||||
# the Git version doesn't support it
|
|
||||||
err_old = (
|
|
||||||
f"You're running an old version of Git (v{git_version[0]}.{git_version[1]}) "
|
|
||||||
f"that doesn't fully support sparse checkout yet."
|
|
||||||
)
|
|
||||||
err_unk = "You're running an unknown version of Git, so sparse checkout has been disabled."
|
|
||||||
msg.warn(
|
|
||||||
f"{err_unk if git_version == (0, 0) else err_old} "
|
|
||||||
f"This means that more files than necessary may be downloaded "
|
|
||||||
f"temporarily. To only download the files needed, make sure "
|
|
||||||
f"you're using Git v2.22 or above."
|
|
||||||
)
|
|
||||||
with make_tempdir() as tmp_dir:
|
|
||||||
cmd = f"git -C {tmp_dir} clone {repo} . -b {branch}"
|
|
||||||
run_command(cmd, capture=True)
|
|
||||||
# We need Path(name) to make sure we also support subdirectories
|
|
||||||
try:
|
|
||||||
source_path = tmp_dir / Path(subpath)
|
|
||||||
if not is_subpath_of(tmp_dir, source_path):
|
|
||||||
err = f"'{subpath}' is a path outside of the cloned repository."
|
|
||||||
msg.fail(err, repo, exits=1)
|
|
||||||
shutil.copytree(str(source_path), str(dest))
|
|
||||||
except FileNotFoundError:
|
|
||||||
err = f"Can't clone {subpath}. Make sure the directory exists in the repo (branch '{branch}')"
|
|
||||||
msg.fail(err, repo, exits=1)
|
|
||||||
|
|
||||||
|
|
||||||
def git_sparse_checkout(repo, subpath, dest, branch):
|
|
||||||
# We're using Git, partial clone and sparse checkout to
|
|
||||||
# only clone the files we need
|
|
||||||
# This ends up being RIDICULOUS. omg.
|
|
||||||
# So, every tutorial and SO post talks about 'sparse checkout'...But they
|
|
||||||
# go and *clone* the whole repo. Worthless. And cloning part of a repo
|
|
||||||
# turns out to be completely broken. The only way to specify a "path" is..
|
|
||||||
# a path *on the server*? The contents of which, specifies the paths. Wat.
|
|
||||||
# Obviously this is hopelessly broken and insecure, because you can query
|
|
||||||
# arbitrary paths on the server! So nobody enables this.
|
|
||||||
# What we have to do is disable *all* files. We could then just checkout
|
|
||||||
# the path, and it'd "work", but be hopelessly slow...Because it goes and
|
|
||||||
# transfers every missing object one-by-one. So the final piece is that we
|
|
||||||
# need to use some weird git internals to fetch the missings in bulk, and
|
|
||||||
# *that* we can do by path.
|
|
||||||
# We're using Git and sparse checkout to only clone the files we need
|
|
||||||
with make_tempdir() as tmp_dir:
|
|
||||||
# This is the "clone, but don't download anything" part.
|
|
||||||
cmd = (
|
|
||||||
f"git clone {repo} {tmp_dir} --no-checkout --depth 1 "
|
|
||||||
f"-b {branch} --filter=blob:none"
|
|
||||||
)
|
|
||||||
run_command(cmd)
|
|
||||||
# Now we need to find the missing filenames for the subpath we want.
|
|
||||||
# Looking for this 'rev-list' command in the git --help? Hah.
|
|
||||||
cmd = f"git -C {tmp_dir} rev-list --objects --all --missing=print -- {subpath}"
|
|
||||||
ret = run_command(cmd, capture=True)
|
|
||||||
git_repo = _http_to_git(repo)
|
|
||||||
# Now pass those missings into another bit of git internals
|
|
||||||
missings = " ".join([x[1:] for x in ret.stdout.split() if x.startswith("?")])
|
|
||||||
if not missings:
|
|
||||||
err = (
|
|
||||||
f"Could not find any relevant files for '{subpath}'. "
|
|
||||||
f"Did you specify a correct and complete path within repo '{repo}' "
|
|
||||||
f"and branch {branch}?"
|
|
||||||
)
|
|
||||||
msg.fail(err, exits=1)
|
|
||||||
cmd = f"git -C {tmp_dir} fetch-pack {git_repo} {missings}"
|
|
||||||
run_command(cmd, capture=True)
|
|
||||||
# And finally, we can checkout our subpath
|
|
||||||
cmd = f"git -C {tmp_dir} checkout {branch} {subpath}"
|
|
||||||
run_command(cmd, capture=True)
|
|
||||||
|
|
||||||
# Get a subdirectory of the cloned path, if appropriate
|
|
||||||
source_path = tmp_dir / Path(subpath)
|
|
||||||
if not is_subpath_of(tmp_dir, source_path):
|
|
||||||
err = f"'{subpath}' is a path outside of the cloned repository."
|
|
||||||
msg.fail(err, repo, exits=1)
|
|
||||||
|
|
||||||
shutil.move(str(source_path), str(dest))
|
|
||||||
|
|
||||||
|
|
||||||
def git_repo_branch_exists(repo: str, branch: str) -> bool:
|
|
||||||
"""Uses 'git ls-remote' to check if a repository and branch exists
|
|
||||||
|
|
||||||
repo (str): URL to get repo.
|
|
||||||
branch (str): Branch on repo to check.
|
|
||||||
RETURNS (bool): True if repo:branch exists.
|
|
||||||
"""
|
|
||||||
get_git_version()
|
|
||||||
cmd = f"git ls-remote {repo} {branch}"
|
|
||||||
# We might be tempted to use `--exit-code` with `git ls-remote`, but
|
|
||||||
# `run_command` handles the `returncode` for us, so we'll rely on
|
|
||||||
# the fact that stdout returns '' if the requested branch doesn't exist
|
|
||||||
ret = run_command(cmd, capture=True)
|
|
||||||
exists = ret.stdout != ""
|
|
||||||
return exists
|
|
||||||
|
|
||||||
|
|
||||||
def get_git_version(
|
def get_git_version(
|
||||||
error: str = "Could not run 'git'. Make sure it's installed and the executable is available.",
|
error: str = "Could not run 'git'. Make sure it's installed and the executable is available.",
|
||||||
) -> Tuple[int, int]:
|
) -> Tuple[int, int]:
|
||||||
"""Get the version of git and raise an error if calling 'git --version' fails.
|
"""Get the version of git and raise an error if calling 'git --version' fails.
|
||||||
|
|
||||||
error (str): The error message to show.
|
error (str): The error message to show.
|
||||||
RETURNS (Tuple[int, int]): The version as a (major, minor) tuple. Returns
|
RETURNS (Tuple[int, int]): The version as a (major, minor) tuple. Returns
|
||||||
(0, 0) if the version couldn't be determined.
|
(0, 0) if the version couldn't be determined.
|
||||||
|
@ -538,30 +242,6 @@ def get_git_version(
|
||||||
return int(version[0]), int(version[1])
|
return int(version[0]), int(version[1])
|
||||||
|
|
||||||
|
|
||||||
def _http_to_git(repo: str) -> str:
|
|
||||||
if repo.startswith("http://"):
|
|
||||||
repo = repo.replace(r"http://", r"https://")
|
|
||||||
if repo.startswith(r"https://"):
|
|
||||||
repo = repo.replace("https://", "git@").replace("/", ":", 1)
|
|
||||||
if repo.endswith("/"):
|
|
||||||
repo = repo[:-1]
|
|
||||||
repo = f"{repo}.git"
|
|
||||||
return repo
|
|
||||||
|
|
||||||
|
|
||||||
def is_subpath_of(parent, child):
|
|
||||||
"""
|
|
||||||
Check whether `child` is a path contained within `parent`.
|
|
||||||
"""
|
|
||||||
# Based on https://stackoverflow.com/a/37095733 .
|
|
||||||
|
|
||||||
# In Python 3.9, the `Path.is_relative_to()` method will supplant this, so
|
|
||||||
# we can stop using crusty old os.path functions.
|
|
||||||
parent_realpath = os.path.realpath(parent)
|
|
||||||
child_realpath = os.path.realpath(child)
|
|
||||||
return os.path.commonpath([parent_realpath, child_realpath]) == parent_realpath
|
|
||||||
|
|
||||||
|
|
||||||
@overload
|
@overload
|
||||||
def string_to_list(value: str, intify: Literal[False] = ...) -> List[str]:
|
def string_to_list(value: str, intify: Literal[False] = ...) -> List[str]:
|
||||||
...
|
...
|
||||||
|
|
|
@ -133,7 +133,9 @@ def apply(
|
||||||
if len(text_files) > 0:
|
if len(text_files) > 0:
|
||||||
streams.append(_stream_texts(text_files))
|
streams.append(_stream_texts(text_files))
|
||||||
datagen = cast(DocOrStrStream, chain(*streams))
|
datagen = cast(DocOrStrStream, chain(*streams))
|
||||||
for doc in tqdm.tqdm(nlp.pipe(datagen, batch_size=batch_size, n_process=n_process)):
|
for doc in tqdm.tqdm(
|
||||||
|
nlp.pipe(datagen, batch_size=batch_size, n_process=n_process), disable=None
|
||||||
|
):
|
||||||
docbin.add(doc)
|
docbin.add(doc)
|
||||||
if output_file.suffix == "":
|
if output_file.suffix == "":
|
||||||
output_file = output_file.with_suffix(".spacy")
|
output_file = output_file.with_suffix(".spacy")
|
||||||
|
|
|
@ -11,7 +11,7 @@ from ._util import (
|
||||||
Arg,
|
Arg,
|
||||||
Opt,
|
Opt,
|
||||||
app,
|
app,
|
||||||
import_code,
|
import_code_paths,
|
||||||
parse_config_overrides,
|
parse_config_overrides,
|
||||||
show_validation_error,
|
show_validation_error,
|
||||||
)
|
)
|
||||||
|
@ -26,7 +26,7 @@ def assemble_cli(
|
||||||
ctx: typer.Context, # This is only used to read additional arguments
|
ctx: typer.Context, # This is only used to read additional arguments
|
||||||
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
|
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
|
||||||
output_path: Path = Arg(..., help="Output directory to store assembled pipeline in"),
|
output_path: Path = Arg(..., help="Output directory to store assembled pipeline in"),
|
||||||
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
|
||||||
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
|
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
|
||||||
# fmt: on
|
# fmt: on
|
||||||
):
|
):
|
||||||
|
@ -40,12 +40,13 @@ def assemble_cli(
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/cli#assemble
|
DOCS: https://spacy.io/api/cli#assemble
|
||||||
"""
|
"""
|
||||||
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
|
if verbose:
|
||||||
|
util.logger.setLevel(logging.DEBUG)
|
||||||
# Make sure all files and paths exists if they are needed
|
# Make sure all files and paths exists if they are needed
|
||||||
if not config_path or (str(config_path) != "-" and not config_path.exists()):
|
if not config_path or (str(config_path) != "-" and not config_path.exists()):
|
||||||
msg.fail("Config file not found", config_path, exits=1)
|
msg.fail("Config file not found", config_path, exits=1)
|
||||||
overrides = parse_config_overrides(ctx.args)
|
overrides = parse_config_overrides(ctx.args)
|
||||||
import_code(code_path)
|
import_code_paths(code_path)
|
||||||
with show_validation_error(config_path):
|
with show_validation_error(config_path):
|
||||||
config = util.load_config(config_path, overrides=overrides, interpolate=False)
|
config = util.load_config(config_path, overrides=overrides, interpolate=False)
|
||||||
msg.divider("Initializing pipeline")
|
msg.divider("Initializing pipeline")
|
||||||
|
|
|
@ -89,7 +89,7 @@ class Quartiles:
|
||||||
def annotate(
|
def annotate(
|
||||||
nlp: Language, docs: List[Doc], batch_size: Optional[int]
|
nlp: Language, docs: List[Doc], batch_size: Optional[int]
|
||||||
) -> numpy.ndarray:
|
) -> numpy.ndarray:
|
||||||
docs = nlp.pipe(tqdm(docs, unit="doc"), batch_size=batch_size)
|
docs = nlp.pipe(tqdm(docs, unit="doc", disable=None), batch_size=batch_size)
|
||||||
wps = []
|
wps = []
|
||||||
while True:
|
while True:
|
||||||
with time_context() as elapsed:
|
with time_context() as elapsed:
|
||||||
|
|
|
@ -13,7 +13,7 @@ from ._util import (
|
||||||
Arg,
|
Arg,
|
||||||
Opt,
|
Opt,
|
||||||
debug_cli,
|
debug_cli,
|
||||||
import_code,
|
import_code_paths,
|
||||||
parse_config_overrides,
|
parse_config_overrides,
|
||||||
show_validation_error,
|
show_validation_error,
|
||||||
)
|
)
|
||||||
|
@ -27,7 +27,7 @@ def debug_config_cli(
|
||||||
# fmt: off
|
# fmt: off
|
||||||
ctx: typer.Context, # This is only used to read additional arguments
|
ctx: typer.Context, # This is only used to read additional arguments
|
||||||
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
|
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
|
||||||
code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
|
||||||
show_funcs: bool = Opt(False, "--show-functions", "-F", help="Show an overview of all registered functions used in the config and where they come from (modules, files etc.)"),
|
show_funcs: bool = Opt(False, "--show-functions", "-F", help="Show an overview of all registered functions used in the config and where they come from (modules, files etc.)"),
|
||||||
show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.")
|
show_vars: bool = Opt(False, "--show-variables", "-V", help="Show an overview of all variables referenced in the config and their values. This will also reflect variables overwritten on the CLI.")
|
||||||
# fmt: on
|
# fmt: on
|
||||||
|
@ -44,7 +44,7 @@ def debug_config_cli(
|
||||||
DOCS: https://spacy.io/api/cli#debug-config
|
DOCS: https://spacy.io/api/cli#debug-config
|
||||||
"""
|
"""
|
||||||
overrides = parse_config_overrides(ctx.args)
|
overrides = parse_config_overrides(ctx.args)
|
||||||
import_code(code_path)
|
import_code_paths(code_path)
|
||||||
debug_config(
|
debug_config(
|
||||||
config_path, overrides=overrides, show_funcs=show_funcs, show_vars=show_vars
|
config_path, overrides=overrides, show_funcs=show_funcs, show_vars=show_vars
|
||||||
)
|
)
|
||||||
|
|
|
@ -40,7 +40,7 @@ from ._util import (
|
||||||
_format_number,
|
_format_number,
|
||||||
app,
|
app,
|
||||||
debug_cli,
|
debug_cli,
|
||||||
import_code,
|
import_code_paths,
|
||||||
parse_config_overrides,
|
parse_config_overrides,
|
||||||
show_validation_error,
|
show_validation_error,
|
||||||
)
|
)
|
||||||
|
@ -72,7 +72,7 @@ def debug_data_cli(
|
||||||
# fmt: off
|
# fmt: off
|
||||||
ctx: typer.Context, # This is only used to read additional arguments
|
ctx: typer.Context, # This is only used to read additional arguments
|
||||||
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
|
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
|
||||||
code_path: Optional[Path] = Opt(None, "--code-path", "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
|
||||||
ignore_warnings: bool = Opt(False, "--ignore-warnings", "-IW", help="Ignore warnings, only show stats and errors"),
|
ignore_warnings: bool = Opt(False, "--ignore-warnings", "-IW", help="Ignore warnings, only show stats and errors"),
|
||||||
verbose: bool = Opt(False, "--verbose", "-V", help="Print additional information and explanations"),
|
verbose: bool = Opt(False, "--verbose", "-V", help="Print additional information and explanations"),
|
||||||
no_format: bool = Opt(False, "--no-format", "-NF", help="Don't pretty-print the results"),
|
no_format: bool = Opt(False, "--no-format", "-NF", help="Don't pretty-print the results"),
|
||||||
|
@ -92,7 +92,7 @@ def debug_data_cli(
|
||||||
"--help for an overview of the other available debugging commands."
|
"--help for an overview of the other available debugging commands."
|
||||||
)
|
)
|
||||||
overrides = parse_config_overrides(ctx.args)
|
overrides = parse_config_overrides(ctx.args)
|
||||||
import_code(code_path)
|
import_code_paths(code_path)
|
||||||
debug_data(
|
debug_data(
|
||||||
config_path,
|
config_path,
|
||||||
config_overrides=overrides,
|
config_overrides=overrides,
|
||||||
|
|
|
@ -10,6 +10,8 @@ from ..util import (
|
||||||
get_installed_models,
|
get_installed_models,
|
||||||
get_minor_version,
|
get_minor_version,
|
||||||
get_package_version,
|
get_package_version,
|
||||||
|
is_in_interactive,
|
||||||
|
is_in_jupyter,
|
||||||
is_package,
|
is_package,
|
||||||
is_prerelease_version,
|
is_prerelease_version,
|
||||||
run_command,
|
run_command,
|
||||||
|
@ -85,6 +87,27 @@ def download(
|
||||||
"Download and installation successful",
|
"Download and installation successful",
|
||||||
f"You can now load the package via spacy.load('{model_name}')",
|
f"You can now load the package via spacy.load('{model_name}')",
|
||||||
)
|
)
|
||||||
|
if is_in_jupyter():
|
||||||
|
reload_deps_msg = (
|
||||||
|
"If you are in a Jupyter or Colab notebook, you may need to "
|
||||||
|
"restart Python in order to load all the package's dependencies. "
|
||||||
|
"You can do this by selecting the 'Restart kernel' or 'Restart "
|
||||||
|
"runtime' option."
|
||||||
|
)
|
||||||
|
msg.warn(
|
||||||
|
"Restart to reload dependencies",
|
||||||
|
reload_deps_msg,
|
||||||
|
)
|
||||||
|
elif is_in_interactive():
|
||||||
|
reload_deps_msg = (
|
||||||
|
"If you are in an interactive Python session, you may need to "
|
||||||
|
"exit and restart Python to load all the package's dependencies. "
|
||||||
|
"You can exit with Ctrl-D (or Ctrl-Z and Enter on Windows)."
|
||||||
|
)
|
||||||
|
msg.warn(
|
||||||
|
"Restart to reload dependencies",
|
||||||
|
reload_deps_msg,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def get_model_filename(model_name: str, version: str, sdist: bool = False) -> str:
|
def get_model_filename(model_name: str, version: str, sdist: bool = False) -> str:
|
||||||
|
|
|
@ -10,7 +10,7 @@ from .. import displacy, util
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
from ..training import Corpus
|
from ..training import Corpus
|
||||||
from ._util import Arg, Opt, app, benchmark_cli, import_code, setup_gpu
|
from ._util import Arg, Opt, app, benchmark_cli, import_code_paths, setup_gpu
|
||||||
|
|
||||||
|
|
||||||
@benchmark_cli.command(
|
@benchmark_cli.command(
|
||||||
|
@ -22,12 +22,13 @@ def evaluate_cli(
|
||||||
model: str = Arg(..., help="Model name or path"),
|
model: str = Arg(..., help="Model name or path"),
|
||||||
data_path: Path = Arg(..., help="Location of binary evaluation data in .spacy format", exists=True),
|
data_path: Path = Arg(..., help="Location of binary evaluation data in .spacy format", exists=True),
|
||||||
output: Optional[Path] = Opt(None, "--output", "-o", help="Output JSON file for metrics", dir_okay=False),
|
output: Optional[Path] = Opt(None, "--output", "-o", help="Output JSON file for metrics", dir_okay=False),
|
||||||
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
|
||||||
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"),
|
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"),
|
||||||
gold_preproc: bool = Opt(False, "--gold-preproc", "-G", help="Use gold preprocessing"),
|
gold_preproc: bool = Opt(False, "--gold-preproc", "-G", help="Use gold preprocessing"),
|
||||||
displacy_path: Optional[Path] = Opt(None, "--displacy-path", "-dp", help="Directory to output rendered parses as HTML", exists=True, file_okay=False),
|
displacy_path: Optional[Path] = Opt(None, "--displacy-path", "-dp", help="Directory to output rendered parses as HTML", exists=True, file_okay=False),
|
||||||
displacy_limit: int = Opt(25, "--displacy-limit", "-dl", help="Limit of parses to render as HTML"),
|
displacy_limit: int = Opt(25, "--displacy-limit", "-dl", help="Limit of parses to render as HTML"),
|
||||||
per_component: bool = Opt(False, "--per-component", "-P", help="Return scores per component, only applicable when an output JSON file is specified."),
|
per_component: bool = Opt(False, "--per-component", "-P", help="Return scores per component, only applicable when an output JSON file is specified."),
|
||||||
|
spans_key: str = Opt("sc", "--spans-key", "-sk", help="Spans key to use when evaluating Doc.spans"),
|
||||||
# fmt: on
|
# fmt: on
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
|
@ -42,7 +43,7 @@ def evaluate_cli(
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/cli#benchmark-accuracy
|
DOCS: https://spacy.io/api/cli#benchmark-accuracy
|
||||||
"""
|
"""
|
||||||
import_code(code_path)
|
import_code_paths(code_path)
|
||||||
evaluate(
|
evaluate(
|
||||||
model,
|
model,
|
||||||
data_path,
|
data_path,
|
||||||
|
@ -53,6 +54,7 @@ def evaluate_cli(
|
||||||
displacy_limit=displacy_limit,
|
displacy_limit=displacy_limit,
|
||||||
per_component=per_component,
|
per_component=per_component,
|
||||||
silent=False,
|
silent=False,
|
||||||
|
spans_key=spans_key,
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|
69
spacy/cli/find_function.py
Normal file
69
spacy/cli/find_function.py
Normal file
|
@ -0,0 +1,69 @@
|
||||||
|
from typing import Optional, Tuple
|
||||||
|
|
||||||
|
from catalogue import RegistryError
|
||||||
|
from wasabi import msg
|
||||||
|
|
||||||
|
from ..util import registry
|
||||||
|
from ._util import Arg, Opt, app
|
||||||
|
|
||||||
|
|
||||||
|
@app.command("find-function")
|
||||||
|
def find_function_cli(
|
||||||
|
# fmt: off
|
||||||
|
func_name: str = Arg(..., help="Name of the registered function."),
|
||||||
|
registry_name: Optional[str] = Opt(None, "--registry", "-r", help="Name of the catalogue registry."),
|
||||||
|
# fmt: on
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Find the module, path and line number to the file the registered
|
||||||
|
function is defined in, if available.
|
||||||
|
|
||||||
|
func_name (str): Name of the registered function.
|
||||||
|
registry_name (Optional[str]): Name of the catalogue registry.
|
||||||
|
|
||||||
|
DOCS: https://spacy.io/api/cli#find-function
|
||||||
|
"""
|
||||||
|
if not registry_name:
|
||||||
|
registry_names = registry.get_registry_names()
|
||||||
|
for name in registry_names:
|
||||||
|
if registry.has(name, func_name):
|
||||||
|
registry_name = name
|
||||||
|
break
|
||||||
|
|
||||||
|
if not registry_name:
|
||||||
|
msg.fail(
|
||||||
|
f"Couldn't find registered function: '{func_name}'",
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert registry_name is not None
|
||||||
|
find_function(func_name, registry_name)
|
||||||
|
|
||||||
|
|
||||||
|
def find_function(func_name: str, registry_name: str) -> Tuple[str, int]:
|
||||||
|
registry_desc = None
|
||||||
|
try:
|
||||||
|
registry_desc = registry.find(registry_name, func_name)
|
||||||
|
except RegistryError as e:
|
||||||
|
msg.fail(
|
||||||
|
f"Couldn't find registered function: '{func_name}' in registry '{registry_name}'",
|
||||||
|
)
|
||||||
|
msg.fail(f"{e}", exits=1)
|
||||||
|
assert registry_desc is not None
|
||||||
|
|
||||||
|
registry_path = None
|
||||||
|
line_no = None
|
||||||
|
if registry_desc["file"]:
|
||||||
|
registry_path = registry_desc["file"]
|
||||||
|
line_no = registry_desc["line_no"]
|
||||||
|
|
||||||
|
if not registry_path or not line_no:
|
||||||
|
msg.fail(
|
||||||
|
f"Couldn't find path to registered function: '{func_name}' in registry '{registry_name}'",
|
||||||
|
exits=1,
|
||||||
|
)
|
||||||
|
assert registry_path is not None
|
||||||
|
assert line_no is not None
|
||||||
|
|
||||||
|
msg.good(f"Found registered function '{func_name}' at {registry_path}:{line_no}")
|
||||||
|
return str(registry_path), int(line_no)
|
|
@ -52,8 +52,8 @@ def find_threshold_cli(
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/cli#find-threshold
|
DOCS: https://spacy.io/api/cli#find-threshold
|
||||||
"""
|
"""
|
||||||
|
if verbose:
|
||||||
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
|
util.logger.setLevel(logging.DEBUG)
|
||||||
import_code(code_path)
|
import_code(code_path)
|
||||||
find_threshold(
|
find_threshold(
|
||||||
model=model,
|
model=model,
|
||||||
|
|
|
@ -90,7 +90,8 @@ def init_pipeline_cli(
|
||||||
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU")
|
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU")
|
||||||
# fmt: on
|
# fmt: on
|
||||||
):
|
):
|
||||||
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
|
if verbose:
|
||||||
|
util.logger.setLevel(logging.DEBUG)
|
||||||
overrides = parse_config_overrides(ctx.args)
|
overrides = parse_config_overrides(ctx.args)
|
||||||
import_code(code_path)
|
import_code(code_path)
|
||||||
setup_gpu(use_gpu)
|
setup_gpu(use_gpu)
|
||||||
|
@ -119,7 +120,8 @@ def init_labels_cli(
|
||||||
"""Generate JSON files for the labels in the data. This helps speed up the
|
"""Generate JSON files for the labels in the data. This helps speed up the
|
||||||
training process, since spaCy won't have to preprocess the data to
|
training process, since spaCy won't have to preprocess the data to
|
||||||
extract the labels."""
|
extract the labels."""
|
||||||
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
|
if verbose:
|
||||||
|
util.logger.setLevel(logging.DEBUG)
|
||||||
if not output_path.exists():
|
if not output_path.exists():
|
||||||
output_path.mkdir(parents=True)
|
output_path.mkdir(parents=True)
|
||||||
overrides = parse_config_overrides(ctx.args)
|
overrides = parse_config_overrides(ctx.args)
|
||||||
|
|
|
@ -1,5 +1,8 @@
|
||||||
|
import importlib.metadata
|
||||||
|
import os
|
||||||
import re
|
import re
|
||||||
import shutil
|
import shutil
|
||||||
|
import subprocess
|
||||||
import sys
|
import sys
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
@ -20,7 +23,7 @@ def package_cli(
|
||||||
# fmt: off
|
# fmt: off
|
||||||
input_dir: Path = Arg(..., help="Directory with pipeline data", exists=True, file_okay=False),
|
input_dir: Path = Arg(..., help="Directory with pipeline data", exists=True, file_okay=False),
|
||||||
output_dir: Path = Arg(..., help="Output parent directory", exists=True, file_okay=False),
|
output_dir: Path = Arg(..., help="Output parent directory", exists=True, file_okay=False),
|
||||||
code_paths: str = Opt("", "--code", "-c", help="Comma-separated paths to Python file with additional code (registered functions) to be included in the package"),
|
code_paths: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be included in the package"),
|
||||||
meta_path: Optional[Path] = Opt(None, "--meta-path", "--meta", "-m", help="Path to meta.json", exists=True, dir_okay=False),
|
meta_path: Optional[Path] = Opt(None, "--meta-path", "--meta", "-m", help="Path to meta.json", exists=True, dir_okay=False),
|
||||||
create_meta: bool = Opt(False, "--create-meta", "-C", help="Create meta.json, even if one exists"),
|
create_meta: bool = Opt(False, "--create-meta", "-C", help="Create meta.json, even if one exists"),
|
||||||
name: Optional[str] = Opt(None, "--name", "-n", help="Package name to override meta"),
|
name: Optional[str] = Opt(None, "--name", "-n", help="Package name to override meta"),
|
||||||
|
@ -35,7 +38,7 @@ def package_cli(
|
||||||
specified output directory, and the data will be copied over. If
|
specified output directory, and the data will be copied over. If
|
||||||
--create-meta is set and a meta.json already exists in the output directory,
|
--create-meta is set and a meta.json already exists in the output directory,
|
||||||
the existing values will be used as the defaults in the command-line prompt.
|
the existing values will be used as the defaults in the command-line prompt.
|
||||||
After packaging, "python setup.py sdist" is run in the package directory,
|
After packaging, "python -m build --sdist" is run in the package directory,
|
||||||
which will create a .tar.gz archive that can be installed via "pip install".
|
which will create a .tar.gz archive that can be installed via "pip install".
|
||||||
|
|
||||||
If additional code files are provided (e.g. Python files containing custom
|
If additional code files are provided (e.g. Python files containing custom
|
||||||
|
@ -78,9 +81,17 @@ def package(
|
||||||
input_path = util.ensure_path(input_dir)
|
input_path = util.ensure_path(input_dir)
|
||||||
output_path = util.ensure_path(output_dir)
|
output_path = util.ensure_path(output_dir)
|
||||||
meta_path = util.ensure_path(meta_path)
|
meta_path = util.ensure_path(meta_path)
|
||||||
if create_wheel and not has_wheel():
|
if create_wheel and not has_wheel() and not has_build():
|
||||||
err = "Generating a binary .whl file requires wheel to be installed"
|
err = (
|
||||||
msg.fail(err, "pip install wheel", exits=1)
|
"Generating wheels requires 'build' or 'wheel' (deprecated) to be installed"
|
||||||
|
)
|
||||||
|
msg.fail(err, "pip install build", exits=1)
|
||||||
|
if not has_build():
|
||||||
|
msg.warn(
|
||||||
|
"Generating packages without the 'build' package is deprecated and "
|
||||||
|
"will not be supported in the future. To install 'build': pip "
|
||||||
|
"install build"
|
||||||
|
)
|
||||||
if not input_path or not input_path.exists():
|
if not input_path or not input_path.exists():
|
||||||
msg.fail("Can't locate pipeline data", input_path, exits=1)
|
msg.fail("Can't locate pipeline data", input_path, exits=1)
|
||||||
if not output_path or not output_path.exists():
|
if not output_path or not output_path.exists():
|
||||||
|
@ -184,12 +195,37 @@ def package(
|
||||||
msg.good(f"Successfully created package directory '{model_name_v}'", main_path)
|
msg.good(f"Successfully created package directory '{model_name_v}'", main_path)
|
||||||
if create_sdist:
|
if create_sdist:
|
||||||
with util.working_dir(main_path):
|
with util.working_dir(main_path):
|
||||||
util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
|
# run directly, since util.run_command is not designed to continue
|
||||||
|
# after a command fails
|
||||||
|
ret = subprocess.run(
|
||||||
|
[sys.executable, "-m", "build", ".", "--sdist"],
|
||||||
|
env=os.environ.copy(),
|
||||||
|
)
|
||||||
|
if ret.returncode != 0:
|
||||||
|
msg.warn(
|
||||||
|
"Creating sdist with 'python -m build' failed. Falling "
|
||||||
|
"back to deprecated use of 'python setup.py sdist'"
|
||||||
|
)
|
||||||
|
util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
|
||||||
zip_file = main_path / "dist" / f"{model_name_v}{SDIST_SUFFIX}"
|
zip_file = main_path / "dist" / f"{model_name_v}{SDIST_SUFFIX}"
|
||||||
msg.good(f"Successfully created zipped Python package", zip_file)
|
msg.good(f"Successfully created zipped Python package", zip_file)
|
||||||
if create_wheel:
|
if create_wheel:
|
||||||
with util.working_dir(main_path):
|
with util.working_dir(main_path):
|
||||||
util.run_command([sys.executable, "setup.py", "bdist_wheel"], capture=False)
|
# run directly, since util.run_command is not designed to continue
|
||||||
|
# after a command fails
|
||||||
|
ret = subprocess.run(
|
||||||
|
[sys.executable, "-m", "build", ".", "--wheel"],
|
||||||
|
env=os.environ.copy(),
|
||||||
|
)
|
||||||
|
if ret.returncode != 0:
|
||||||
|
msg.warn(
|
||||||
|
"Creating wheel with 'python -m build' failed. Falling "
|
||||||
|
"back to deprecated use of 'wheel' with "
|
||||||
|
"'python setup.py bdist_wheel'"
|
||||||
|
)
|
||||||
|
util.run_command(
|
||||||
|
[sys.executable, "setup.py", "bdist_wheel"], capture=False
|
||||||
|
)
|
||||||
wheel_name_squashed = re.sub("_+", "_", model_name_v)
|
wheel_name_squashed = re.sub("_+", "_", model_name_v)
|
||||||
wheel = main_path / "dist" / f"{wheel_name_squashed}{WHEEL_SUFFIX}"
|
wheel = main_path / "dist" / f"{wheel_name_squashed}{WHEEL_SUFFIX}"
|
||||||
msg.good(f"Successfully created binary wheel", wheel)
|
msg.good(f"Successfully created binary wheel", wheel)
|
||||||
|
@ -209,6 +245,17 @@ def has_wheel() -> bool:
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def has_build() -> bool:
|
||||||
|
# it's very likely that there is a local directory named build/ (especially
|
||||||
|
# in an editable install), so an import check is not sufficient; instead
|
||||||
|
# check that there is a package version
|
||||||
|
try:
|
||||||
|
importlib.metadata.version("build")
|
||||||
|
return True
|
||||||
|
except importlib.metadata.PackageNotFoundError: # type: ignore[attr-defined]
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
def get_third_party_dependencies(
|
def get_third_party_dependencies(
|
||||||
config: Config, exclude: List[str] = util.SimpleFrozenList()
|
config: Config, exclude: List[str] = util.SimpleFrozenList()
|
||||||
) -> List[str]:
|
) -> List[str]:
|
||||||
|
@ -403,7 +450,7 @@ def _format_sources(data: Any) -> str:
|
||||||
if author:
|
if author:
|
||||||
result += " ({})".format(author)
|
result += " ({})".format(author)
|
||||||
sources.append(result)
|
sources.append(result)
|
||||||
return "<br />".join(sources)
|
return "<br>".join(sources)
|
||||||
|
|
||||||
|
|
||||||
def _format_accuracy(data: Dict[str, Any], exclude: List[str] = ["speed"]) -> str:
|
def _format_accuracy(data: Dict[str, Any], exclude: List[str] = ["speed"]) -> str:
|
||||||
|
|
|
@ -11,7 +11,7 @@ from ._util import (
|
||||||
Arg,
|
Arg,
|
||||||
Opt,
|
Opt,
|
||||||
app,
|
app,
|
||||||
import_code,
|
import_code_paths,
|
||||||
parse_config_overrides,
|
parse_config_overrides,
|
||||||
setup_gpu,
|
setup_gpu,
|
||||||
show_validation_error,
|
show_validation_error,
|
||||||
|
@ -27,7 +27,7 @@ def pretrain_cli(
|
||||||
ctx: typer.Context, # This is only used to read additional arguments
|
ctx: typer.Context, # This is only used to read additional arguments
|
||||||
config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False, allow_dash=True),
|
config_path: Path = Arg(..., help="Path to config file", exists=True, dir_okay=False, allow_dash=True),
|
||||||
output_dir: Path = Arg(..., help="Directory to write weights to on each epoch"),
|
output_dir: Path = Arg(..., help="Directory to write weights to on each epoch"),
|
||||||
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
|
||||||
resume_path: Optional[Path] = Opt(None, "--resume-path", "-r", help="Path to pretrained weights from which to resume pretraining"),
|
resume_path: Optional[Path] = Opt(None, "--resume-path", "-r", help="Path to pretrained weights from which to resume pretraining"),
|
||||||
epoch_resume: Optional[int] = Opt(None, "--epoch-resume", "-er", help="The epoch to resume counting from when using --resume-path. Prevents unintended overwriting of existing weight files."),
|
epoch_resume: Optional[int] = Opt(None, "--epoch-resume", "-er", help="The epoch to resume counting from when using --resume-path. Prevents unintended overwriting of existing weight files."),
|
||||||
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"),
|
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU"),
|
||||||
|
@ -56,7 +56,7 @@ def pretrain_cli(
|
||||||
DOCS: https://spacy.io/api/cli#pretrain
|
DOCS: https://spacy.io/api/cli#pretrain
|
||||||
"""
|
"""
|
||||||
config_overrides = parse_config_overrides(ctx.args)
|
config_overrides = parse_config_overrides(ctx.args)
|
||||||
import_code(code_path)
|
import_code_paths(code_path)
|
||||||
verify_cli_args(config_path, output_dir, resume_path, epoch_resume)
|
verify_cli_args(config_path, output_dir, resume_path, epoch_resume)
|
||||||
setup_gpu(use_gpu)
|
setup_gpu(use_gpu)
|
||||||
msg.info(f"Loading config from: {config_path}")
|
msg.info(f"Loading config from: {config_path}")
|
||||||
|
|
|
@ -71,7 +71,7 @@ def profile(model: str, inputs: Optional[Path] = None, n_texts: int = 10000) ->
|
||||||
|
|
||||||
|
|
||||||
def parse_texts(nlp: Language, texts: Sequence[str]) -> None:
|
def parse_texts(nlp: Language, texts: Sequence[str]) -> None:
|
||||||
for doc in nlp.pipe(tqdm.tqdm(texts), batch_size=16):
|
for doc in nlp.pipe(tqdm.tqdm(texts, disable=None), batch_size=16):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,217 +1 @@
|
||||||
import os
|
from weasel.cli.assets import *
|
||||||
import re
|
|
||||||
import shutil
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Any, Dict, Optional
|
|
||||||
|
|
||||||
import requests
|
|
||||||
import typer
|
|
||||||
from wasabi import msg
|
|
||||||
|
|
||||||
from ...util import ensure_path, working_dir
|
|
||||||
from .._util import (
|
|
||||||
PROJECT_FILE,
|
|
||||||
Arg,
|
|
||||||
Opt,
|
|
||||||
SimpleFrozenDict,
|
|
||||||
download_file,
|
|
||||||
get_checksum,
|
|
||||||
get_git_version,
|
|
||||||
git_checkout,
|
|
||||||
load_project_config,
|
|
||||||
parse_config_overrides,
|
|
||||||
project_cli,
|
|
||||||
)
|
|
||||||
|
|
||||||
# Whether assets are extra if `extra` is not set.
|
|
||||||
EXTRA_DEFAULT = False
|
|
||||||
|
|
||||||
|
|
||||||
@project_cli.command(
|
|
||||||
"assets",
|
|
||||||
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
|
|
||||||
)
|
|
||||||
def project_assets_cli(
|
|
||||||
# fmt: off
|
|
||||||
ctx: typer.Context, # This is only used to read additional arguments
|
|
||||||
project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
|
|
||||||
sparse_checkout: bool = Opt(False, "--sparse", "-S", help="Use sparse checkout for assets provided via Git, to only check out and clone the files needed. Requires Git v22.2+."),
|
|
||||||
extra: bool = Opt(False, "--extra", "-e", help="Download all assets, including those marked as 'extra'.")
|
|
||||||
# fmt: on
|
|
||||||
):
|
|
||||||
"""Fetch project assets like datasets and pretrained weights. Assets are
|
|
||||||
defined in the "assets" section of the project.yml. If a checksum is
|
|
||||||
provided in the project.yml, the file is only downloaded if no local file
|
|
||||||
with the same checksum exists.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/cli#project-assets
|
|
||||||
"""
|
|
||||||
overrides = parse_config_overrides(ctx.args)
|
|
||||||
project_assets(
|
|
||||||
project_dir,
|
|
||||||
overrides=overrides,
|
|
||||||
sparse_checkout=sparse_checkout,
|
|
||||||
extra=extra,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def project_assets(
|
|
||||||
project_dir: Path,
|
|
||||||
*,
|
|
||||||
overrides: Dict[str, Any] = SimpleFrozenDict(),
|
|
||||||
sparse_checkout: bool = False,
|
|
||||||
extra: bool = False,
|
|
||||||
) -> None:
|
|
||||||
"""Fetch assets for a project using DVC if possible.
|
|
||||||
|
|
||||||
project_dir (Path): Path to project directory.
|
|
||||||
sparse_checkout (bool): Use sparse checkout for assets provided via Git, to only check out and clone the files
|
|
||||||
needed.
|
|
||||||
extra (bool): Whether to download all assets, including those marked as 'extra'.
|
|
||||||
"""
|
|
||||||
project_path = ensure_path(project_dir)
|
|
||||||
config = load_project_config(project_path, overrides=overrides)
|
|
||||||
assets = [
|
|
||||||
asset
|
|
||||||
for asset in config.get("assets", [])
|
|
||||||
if extra or not asset.get("extra", EXTRA_DEFAULT)
|
|
||||||
]
|
|
||||||
if not assets:
|
|
||||||
msg.warn(
|
|
||||||
f"No assets specified in {PROJECT_FILE} (if assets are marked as extra, download them with --extra)",
|
|
||||||
exits=0,
|
|
||||||
)
|
|
||||||
msg.info(f"Fetching {len(assets)} asset(s)")
|
|
||||||
|
|
||||||
for asset in assets:
|
|
||||||
dest = (project_dir / asset["dest"]).resolve()
|
|
||||||
checksum = asset.get("checksum")
|
|
||||||
if "git" in asset:
|
|
||||||
git_err = (
|
|
||||||
f"Cloning spaCy project templates requires Git and the 'git' command. "
|
|
||||||
f"Make sure it's installed and that the executable is available."
|
|
||||||
)
|
|
||||||
get_git_version(error=git_err)
|
|
||||||
if dest.exists():
|
|
||||||
# If there's already a file, check for checksum
|
|
||||||
if checksum and checksum == get_checksum(dest):
|
|
||||||
msg.good(
|
|
||||||
f"Skipping download with matching checksum: {asset['dest']}"
|
|
||||||
)
|
|
||||||
continue
|
|
||||||
else:
|
|
||||||
if dest.is_dir():
|
|
||||||
shutil.rmtree(dest)
|
|
||||||
else:
|
|
||||||
dest.unlink()
|
|
||||||
if "repo" not in asset["git"] or asset["git"]["repo"] is None:
|
|
||||||
msg.fail(
|
|
||||||
"A git asset must include 'repo', the repository address.", exits=1
|
|
||||||
)
|
|
||||||
if "path" not in asset["git"] or asset["git"]["path"] is None:
|
|
||||||
msg.fail(
|
|
||||||
"A git asset must include 'path' - use \"\" to get the entire repository.",
|
|
||||||
exits=1,
|
|
||||||
)
|
|
||||||
git_checkout(
|
|
||||||
asset["git"]["repo"],
|
|
||||||
asset["git"]["path"],
|
|
||||||
dest,
|
|
||||||
branch=asset["git"].get("branch"),
|
|
||||||
sparse=sparse_checkout,
|
|
||||||
)
|
|
||||||
msg.good(f"Downloaded asset {dest}")
|
|
||||||
else:
|
|
||||||
url = asset.get("url")
|
|
||||||
if not url:
|
|
||||||
# project.yml defines asset without URL that the user has to place
|
|
||||||
check_private_asset(dest, checksum)
|
|
||||||
continue
|
|
||||||
fetch_asset(project_path, url, dest, checksum)
|
|
||||||
|
|
||||||
|
|
||||||
def check_private_asset(dest: Path, checksum: Optional[str] = None) -> None:
|
|
||||||
"""Check and validate assets without a URL (private assets that the user
|
|
||||||
has to provide themselves) and give feedback about the checksum.
|
|
||||||
|
|
||||||
dest (Path): Destination path of the asset.
|
|
||||||
checksum (Optional[str]): Optional checksum of the expected file.
|
|
||||||
"""
|
|
||||||
if not Path(dest).exists():
|
|
||||||
err = f"No URL provided for asset. You need to add this file yourself: {dest}"
|
|
||||||
msg.warn(err)
|
|
||||||
else:
|
|
||||||
if not checksum:
|
|
||||||
msg.good(f"Asset already exists: {dest}")
|
|
||||||
elif checksum == get_checksum(dest):
|
|
||||||
msg.good(f"Asset exists with matching checksum: {dest}")
|
|
||||||
else:
|
|
||||||
msg.fail(f"Asset available but with incorrect checksum: {dest}")
|
|
||||||
|
|
||||||
|
|
||||||
def fetch_asset(
|
|
||||||
project_path: Path, url: str, dest: Path, checksum: Optional[str] = None
|
|
||||||
) -> None:
|
|
||||||
"""Fetch an asset from a given URL or path. If a checksum is provided and a
|
|
||||||
local file exists, it's only re-downloaded if the checksum doesn't match.
|
|
||||||
|
|
||||||
project_path (Path): Path to project directory.
|
|
||||||
url (str): URL or path to asset.
|
|
||||||
checksum (Optional[str]): Optional expected checksum of local file.
|
|
||||||
RETURNS (Optional[Path]): The path to the fetched asset or None if fetching
|
|
||||||
the asset failed.
|
|
||||||
"""
|
|
||||||
dest_path = (project_path / dest).resolve()
|
|
||||||
if dest_path.exists():
|
|
||||||
# If there's already a file, check for checksum
|
|
||||||
if checksum:
|
|
||||||
if checksum == get_checksum(dest_path):
|
|
||||||
msg.good(f"Skipping download with matching checksum: {dest}")
|
|
||||||
return
|
|
||||||
else:
|
|
||||||
# If there's not a checksum, make sure the file is a possibly valid size
|
|
||||||
if os.path.getsize(dest_path) == 0:
|
|
||||||
msg.warn(f"Asset exists but with size of 0 bytes, deleting: {dest}")
|
|
||||||
os.remove(dest_path)
|
|
||||||
# We might as well support the user here and create parent directories in
|
|
||||||
# case the asset dir isn't listed as a dir to create in the project.yml
|
|
||||||
if not dest_path.parent.exists():
|
|
||||||
dest_path.parent.mkdir(parents=True)
|
|
||||||
with working_dir(project_path):
|
|
||||||
url = convert_asset_url(url)
|
|
||||||
try:
|
|
||||||
download_file(url, dest_path)
|
|
||||||
msg.good(f"Downloaded asset {dest}")
|
|
||||||
except requests.exceptions.RequestException as e:
|
|
||||||
if Path(url).exists() and Path(url).is_file():
|
|
||||||
# If it's a local file, copy to destination
|
|
||||||
shutil.copy(url, str(dest_path))
|
|
||||||
msg.good(f"Copied local asset {dest}")
|
|
||||||
else:
|
|
||||||
msg.fail(f"Download failed: {dest}", e)
|
|
||||||
if checksum and checksum != get_checksum(dest_path):
|
|
||||||
msg.fail(f"Checksum doesn't match value defined in {PROJECT_FILE}: {dest}")
|
|
||||||
|
|
||||||
|
|
||||||
def convert_asset_url(url: str) -> str:
|
|
||||||
"""Check and convert the asset URL if needed.
|
|
||||||
|
|
||||||
url (str): The asset URL.
|
|
||||||
RETURNS (str): The converted URL.
|
|
||||||
"""
|
|
||||||
# If the asset URL is a regular GitHub URL it's likely a mistake
|
|
||||||
if (
|
|
||||||
re.match(r"(http(s?)):\/\/github.com", url)
|
|
||||||
and "releases/download" not in url
|
|
||||||
and "/raw/" not in url
|
|
||||||
):
|
|
||||||
converted = url.replace("github.com", "raw.githubusercontent.com")
|
|
||||||
converted = re.sub(r"/(tree|blob)/", "/", converted)
|
|
||||||
msg.warn(
|
|
||||||
"Downloading from a regular GitHub URL. This will only download "
|
|
||||||
"the source of the page, not the actual file. Converting the URL "
|
|
||||||
"to a raw URL.",
|
|
||||||
converted,
|
|
||||||
)
|
|
||||||
return converted
|
|
||||||
return url
|
|
||||||
|
|
|
@ -1,124 +1 @@
|
||||||
import re
|
from weasel.cli.clone import *
|
||||||
import subprocess
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Optional
|
|
||||||
|
|
||||||
from wasabi import msg
|
|
||||||
|
|
||||||
from ... import about
|
|
||||||
from ...util import ensure_path
|
|
||||||
from .._util import (
|
|
||||||
COMMAND,
|
|
||||||
PROJECT_FILE,
|
|
||||||
Arg,
|
|
||||||
Opt,
|
|
||||||
get_git_version,
|
|
||||||
git_checkout,
|
|
||||||
git_repo_branch_exists,
|
|
||||||
project_cli,
|
|
||||||
)
|
|
||||||
|
|
||||||
DEFAULT_REPO = about.__projects__
|
|
||||||
DEFAULT_PROJECTS_BRANCH = about.__projects_branch__
|
|
||||||
DEFAULT_BRANCHES = ["main", "master"]
|
|
||||||
|
|
||||||
|
|
||||||
@project_cli.command("clone")
|
|
||||||
def project_clone_cli(
|
|
||||||
# fmt: off
|
|
||||||
name: str = Arg(..., help="The name of the template to clone"),
|
|
||||||
dest: Optional[Path] = Arg(None, help="Where to clone the project. Defaults to current working directory", exists=False),
|
|
||||||
repo: str = Opt(DEFAULT_REPO, "--repo", "-r", help="The repository to clone from"),
|
|
||||||
branch: Optional[str] = Opt(None, "--branch", "-b", help=f"The branch to clone from. If not provided, will attempt {', '.join(DEFAULT_BRANCHES)}"),
|
|
||||||
sparse_checkout: bool = Opt(False, "--sparse", "-S", help="Use sparse Git checkout to only check out and clone the files needed. Requires Git v22.2+.")
|
|
||||||
# fmt: on
|
|
||||||
):
|
|
||||||
"""Clone a project template from a repository. Calls into "git" and will
|
|
||||||
only download the files from the given subdirectory. The GitHub repo
|
|
||||||
defaults to the official spaCy template repo, but can be customized
|
|
||||||
(including using a private repo).
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/cli#project-clone
|
|
||||||
"""
|
|
||||||
if dest is None:
|
|
||||||
dest = Path.cwd() / Path(name).parts[-1]
|
|
||||||
if repo == DEFAULT_REPO and branch is None:
|
|
||||||
branch = DEFAULT_PROJECTS_BRANCH
|
|
||||||
|
|
||||||
if branch is None:
|
|
||||||
for default_branch in DEFAULT_BRANCHES:
|
|
||||||
if git_repo_branch_exists(repo, default_branch):
|
|
||||||
branch = default_branch
|
|
||||||
break
|
|
||||||
if branch is None:
|
|
||||||
default_branches_msg = ", ".join(f"'{b}'" for b in DEFAULT_BRANCHES)
|
|
||||||
msg.fail(
|
|
||||||
"No branch provided and attempted default "
|
|
||||||
f"branches {default_branches_msg} do not exist.",
|
|
||||||
exits=1,
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
if not git_repo_branch_exists(repo, branch):
|
|
||||||
msg.fail(f"repo: {repo} (branch: {branch}) does not exist.", exits=1)
|
|
||||||
assert isinstance(branch, str)
|
|
||||||
project_clone(name, dest, repo=repo, branch=branch, sparse_checkout=sparse_checkout)
|
|
||||||
|
|
||||||
|
|
||||||
def project_clone(
|
|
||||||
name: str,
|
|
||||||
dest: Path,
|
|
||||||
*,
|
|
||||||
repo: str = about.__projects__,
|
|
||||||
branch: str = about.__projects_branch__,
|
|
||||||
sparse_checkout: bool = False,
|
|
||||||
) -> None:
|
|
||||||
"""Clone a project template from a repository.
|
|
||||||
|
|
||||||
name (str): Name of subdirectory to clone.
|
|
||||||
dest (Path): Destination path of cloned project.
|
|
||||||
repo (str): URL of Git repo containing project templates.
|
|
||||||
branch (str): The branch to clone from
|
|
||||||
"""
|
|
||||||
dest = ensure_path(dest)
|
|
||||||
check_clone(name, dest, repo)
|
|
||||||
project_dir = dest.resolve()
|
|
||||||
repo_name = re.sub(r"(http(s?)):\/\/github.com/", "", repo)
|
|
||||||
try:
|
|
||||||
git_checkout(repo, name, dest, branch=branch, sparse=sparse_checkout)
|
|
||||||
except subprocess.CalledProcessError:
|
|
||||||
err = f"Could not clone '{name}' from repo '{repo_name}' (branch '{branch}')"
|
|
||||||
msg.fail(err, exits=1)
|
|
||||||
msg.good(f"Cloned '{name}' from '{repo_name}' (branch '{branch}')", project_dir)
|
|
||||||
if not (project_dir / PROJECT_FILE).exists():
|
|
||||||
msg.warn(f"No {PROJECT_FILE} found in directory")
|
|
||||||
else:
|
|
||||||
msg.good(f"Your project is now ready!")
|
|
||||||
print(f"To fetch the assets, run:\n{COMMAND} project assets {dest}")
|
|
||||||
|
|
||||||
|
|
||||||
def check_clone(name: str, dest: Path, repo: str) -> None:
|
|
||||||
"""Check and validate that the destination path can be used to clone. Will
|
|
||||||
check that Git is available and that the destination path is suitable.
|
|
||||||
|
|
||||||
name (str): Name of the directory to clone from the repo.
|
|
||||||
dest (Path): Local destination of cloned directory.
|
|
||||||
repo (str): URL of the repo to clone from.
|
|
||||||
"""
|
|
||||||
git_err = (
|
|
||||||
f"Cloning spaCy project templates requires Git and the 'git' command. "
|
|
||||||
f"To clone a project without Git, copy the files from the '{name}' "
|
|
||||||
f"directory in the {repo} to {dest} manually."
|
|
||||||
)
|
|
||||||
get_git_version(error=git_err)
|
|
||||||
if not dest:
|
|
||||||
msg.fail(f"Not a valid directory to clone project: {dest}", exits=1)
|
|
||||||
if dest.exists():
|
|
||||||
# Directory already exists (not allowed, clone needs to create it)
|
|
||||||
msg.fail(f"Can't clone project, directory already exists: {dest}", exits=1)
|
|
||||||
if not dest.parent.exists():
|
|
||||||
# We're not creating parents, parent dir should exist
|
|
||||||
msg.fail(
|
|
||||||
f"Can't clone project, parent directory doesn't exist: {dest.parent}. "
|
|
||||||
f"Create the necessary folder(s) first before continuing.",
|
|
||||||
exits=1,
|
|
||||||
)
|
|
||||||
|
|
|
@ -1,115 +1 @@
|
||||||
from pathlib import Path
|
from weasel.cli.document import *
|
||||||
|
|
||||||
from wasabi import MarkdownRenderer, msg
|
|
||||||
|
|
||||||
from ...util import working_dir
|
|
||||||
from .._util import PROJECT_FILE, Arg, Opt, load_project_config, project_cli
|
|
||||||
|
|
||||||
DOCS_URL = "https://spacy.io"
|
|
||||||
INTRO_PROJECT = f"""The [`{PROJECT_FILE}`]({PROJECT_FILE}) defines the data assets required by the
|
|
||||||
project, as well as the available commands and workflows. For details, see the
|
|
||||||
[spaCy projects documentation]({DOCS_URL}/usage/projects)."""
|
|
||||||
INTRO_COMMANDS = f"""The following commands are defined by the project. They
|
|
||||||
can be executed using [`spacy project run [name]`]({DOCS_URL}/api/cli#project-run).
|
|
||||||
Commands are only re-run if their inputs have changed."""
|
|
||||||
INTRO_WORKFLOWS = f"""The following workflows are defined by the project. They
|
|
||||||
can be executed using [`spacy project run [name]`]({DOCS_URL}/api/cli#project-run)
|
|
||||||
and will run the specified commands in order. Commands are only re-run if their
|
|
||||||
inputs have changed."""
|
|
||||||
INTRO_ASSETS = f"""The following assets are defined by the project. They can
|
|
||||||
be fetched by running [`spacy project assets`]({DOCS_URL}/api/cli#project-assets)
|
|
||||||
in the project directory."""
|
|
||||||
# These markers are added to the Markdown and can be used to update the file in
|
|
||||||
# place if it already exists. Only the auto-generated part will be replaced.
|
|
||||||
MARKER_START = "<!-- SPACY PROJECT: AUTO-GENERATED DOCS START (do not remove) -->"
|
|
||||||
MARKER_END = "<!-- SPACY PROJECT: AUTO-GENERATED DOCS END (do not remove) -->"
|
|
||||||
# If this marker is used in an existing README, it's ignored and not replaced
|
|
||||||
MARKER_IGNORE = "<!-- SPACY PROJECT: IGNORE -->"
|
|
||||||
|
|
||||||
|
|
||||||
@project_cli.command("document")
|
|
||||||
def project_document_cli(
|
|
||||||
# fmt: off
|
|
||||||
project_dir: Path = Arg(Path.cwd(), help="Path to cloned project. Defaults to current working directory.", exists=True, file_okay=False),
|
|
||||||
output_file: Path = Opt("-", "--output", "-o", help="Path to output Markdown file for output. Defaults to - for standard output"),
|
|
||||||
no_emoji: bool = Opt(False, "--no-emoji", "-NE", help="Don't use emoji")
|
|
||||||
# fmt: on
|
|
||||||
):
|
|
||||||
"""
|
|
||||||
Auto-generate a README.md for a project. If the content is saved to a file,
|
|
||||||
hidden markers are added so you can add custom content before or after the
|
|
||||||
auto-generated section and only the auto-generated docs will be replaced
|
|
||||||
when you re-run the command.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/cli#project-document
|
|
||||||
"""
|
|
||||||
project_document(project_dir, output_file, no_emoji=no_emoji)
|
|
||||||
|
|
||||||
|
|
||||||
def project_document(
|
|
||||||
project_dir: Path, output_file: Path, *, no_emoji: bool = False
|
|
||||||
) -> None:
|
|
||||||
is_stdout = str(output_file) == "-"
|
|
||||||
config = load_project_config(project_dir)
|
|
||||||
md = MarkdownRenderer(no_emoji=no_emoji)
|
|
||||||
md.add(MARKER_START)
|
|
||||||
title = config.get("title")
|
|
||||||
description = config.get("description")
|
|
||||||
md.add(md.title(1, f"spaCy Project{f': {title}' if title else ''}", "🪐"))
|
|
||||||
if description:
|
|
||||||
md.add(description)
|
|
||||||
md.add(md.title(2, PROJECT_FILE, "📋"))
|
|
||||||
md.add(INTRO_PROJECT)
|
|
||||||
# Commands
|
|
||||||
cmds = config.get("commands", [])
|
|
||||||
data = [(md.code(cmd["name"]), cmd.get("help", "")) for cmd in cmds]
|
|
||||||
if data:
|
|
||||||
md.add(md.title(3, "Commands", "⏯"))
|
|
||||||
md.add(INTRO_COMMANDS)
|
|
||||||
md.add(md.table(data, ["Command", "Description"]))
|
|
||||||
# Workflows
|
|
||||||
wfs = config.get("workflows", {}).items()
|
|
||||||
data = [(md.code(n), " → ".join(md.code(w) for w in stp)) for n, stp in wfs]
|
|
||||||
if data:
|
|
||||||
md.add(md.title(3, "Workflows", "⏭"))
|
|
||||||
md.add(INTRO_WORKFLOWS)
|
|
||||||
md.add(md.table(data, ["Workflow", "Steps"]))
|
|
||||||
# Assets
|
|
||||||
assets = config.get("assets", [])
|
|
||||||
data = []
|
|
||||||
for a in assets:
|
|
||||||
source = "Git" if a.get("git") else "URL" if a.get("url") else "Local"
|
|
||||||
dest_path = a["dest"]
|
|
||||||
dest = md.code(dest_path)
|
|
||||||
if source == "Local":
|
|
||||||
# Only link assets if they're in the repo
|
|
||||||
with working_dir(project_dir) as p:
|
|
||||||
if (p / dest_path).exists():
|
|
||||||
dest = md.link(dest, dest_path)
|
|
||||||
data.append((dest, source, a.get("description", "")))
|
|
||||||
if data:
|
|
||||||
md.add(md.title(3, "Assets", "🗂"))
|
|
||||||
md.add(INTRO_ASSETS)
|
|
||||||
md.add(md.table(data, ["File", "Source", "Description"]))
|
|
||||||
md.add(MARKER_END)
|
|
||||||
# Output result
|
|
||||||
if is_stdout:
|
|
||||||
print(md.text)
|
|
||||||
else:
|
|
||||||
content = md.text
|
|
||||||
if output_file.exists():
|
|
||||||
with output_file.open("r", encoding="utf8") as f:
|
|
||||||
existing = f.read()
|
|
||||||
if MARKER_IGNORE in existing:
|
|
||||||
msg.warn("Found ignore marker in existing file: skipping", output_file)
|
|
||||||
return
|
|
||||||
if MARKER_START in existing and MARKER_END in existing:
|
|
||||||
msg.info("Found existing file: only replacing auto-generated docs")
|
|
||||||
before = existing.split(MARKER_START)[0]
|
|
||||||
after = existing.split(MARKER_END)[1]
|
|
||||||
content = f"{before}{content}{after}"
|
|
||||||
else:
|
|
||||||
msg.warn("Replacing existing file")
|
|
||||||
with output_file.open("w", encoding="utf8") as f:
|
|
||||||
f.write(content)
|
|
||||||
msg.good("Saved project documentation", output_file)
|
|
||||||
|
|
|
@ -1,220 +1 @@
|
||||||
"""This module contains helpers and subcommands for integrating spaCy projects
|
from weasel.cli.dvc import *
|
||||||
with Data Version Controk (DVC). https://dvc.org"""
|
|
||||||
import subprocess
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Any, Dict, Iterable, List, Optional
|
|
||||||
|
|
||||||
from wasabi import msg
|
|
||||||
|
|
||||||
from ...util import (
|
|
||||||
SimpleFrozenList,
|
|
||||||
join_command,
|
|
||||||
run_command,
|
|
||||||
split_command,
|
|
||||||
working_dir,
|
|
||||||
)
|
|
||||||
from .._util import (
|
|
||||||
COMMAND,
|
|
||||||
NAME,
|
|
||||||
PROJECT_FILE,
|
|
||||||
Arg,
|
|
||||||
Opt,
|
|
||||||
get_hash,
|
|
||||||
load_project_config,
|
|
||||||
project_cli,
|
|
||||||
)
|
|
||||||
|
|
||||||
DVC_CONFIG = "dvc.yaml"
|
|
||||||
DVC_DIR = ".dvc"
|
|
||||||
UPDATE_COMMAND = "dvc"
|
|
||||||
DVC_CONFIG_COMMENT = f"""# This file is auto-generated by spaCy based on your {PROJECT_FILE}. If you've
|
|
||||||
# edited your {PROJECT_FILE}, you can regenerate this file by running:
|
|
||||||
# {COMMAND} project {UPDATE_COMMAND}"""
|
|
||||||
|
|
||||||
|
|
||||||
@project_cli.command(UPDATE_COMMAND)
|
|
||||||
def project_update_dvc_cli(
|
|
||||||
# fmt: off
|
|
||||||
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
|
||||||
workflow: Optional[str] = Arg(None, help=f"Name of workflow defined in {PROJECT_FILE}. Defaults to first workflow if not set."),
|
|
||||||
verbose: bool = Opt(False, "--verbose", "-V", help="Print more info"),
|
|
||||||
quiet: bool = Opt(False, "--quiet", "-q", help="Print less info"),
|
|
||||||
force: bool = Opt(False, "--force", "-F", help="Force update DVC config"),
|
|
||||||
# fmt: on
|
|
||||||
):
|
|
||||||
"""Auto-generate Data Version Control (DVC) config. A DVC
|
|
||||||
project can only define one pipeline, so you need to specify one workflow
|
|
||||||
defined in the project.yml. If no workflow is specified, the first defined
|
|
||||||
workflow is used. The DVC config will only be updated if the project.yml
|
|
||||||
changed.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/cli#project-dvc
|
|
||||||
"""
|
|
||||||
project_update_dvc(project_dir, workflow, verbose=verbose, quiet=quiet, force=force)
|
|
||||||
|
|
||||||
|
|
||||||
def project_update_dvc(
|
|
||||||
project_dir: Path,
|
|
||||||
workflow: Optional[str] = None,
|
|
||||||
*,
|
|
||||||
verbose: bool = False,
|
|
||||||
quiet: bool = False,
|
|
||||||
force: bool = False,
|
|
||||||
) -> None:
|
|
||||||
"""Update the auto-generated Data Version Control (DVC) config file. A DVC
|
|
||||||
project can only define one pipeline, so you need to specify one workflow
|
|
||||||
defined in the project.yml. Will only update the file if the checksum changed.
|
|
||||||
|
|
||||||
project_dir (Path): The project directory.
|
|
||||||
workflow (Optional[str]): Optional name of workflow defined in project.yml.
|
|
||||||
If not set, the first workflow will be used.
|
|
||||||
verbose (bool): Print more info.
|
|
||||||
quiet (bool): Print less info.
|
|
||||||
force (bool): Force update DVC config.
|
|
||||||
"""
|
|
||||||
config = load_project_config(project_dir)
|
|
||||||
updated = update_dvc_config(
|
|
||||||
project_dir, config, workflow, verbose=verbose, quiet=quiet, force=force
|
|
||||||
)
|
|
||||||
help_msg = "To execute the workflow with DVC, run: dvc repro"
|
|
||||||
if updated:
|
|
||||||
msg.good(f"Updated DVC config from {PROJECT_FILE}", help_msg)
|
|
||||||
else:
|
|
||||||
msg.info(f"No changes found in {PROJECT_FILE}, no update needed", help_msg)
|
|
||||||
|
|
||||||
|
|
||||||
def update_dvc_config(
|
|
||||||
path: Path,
|
|
||||||
config: Dict[str, Any],
|
|
||||||
workflow: Optional[str] = None,
|
|
||||||
verbose: bool = False,
|
|
||||||
quiet: bool = False,
|
|
||||||
force: bool = False,
|
|
||||||
) -> bool:
|
|
||||||
"""Re-run the DVC commands in dry mode and update dvc.yaml file in the
|
|
||||||
project directory. The file is auto-generated based on the config. The
|
|
||||||
first line of the auto-generated file specifies the hash of the config
|
|
||||||
dict, so if any of the config values change, the DVC config is regenerated.
|
|
||||||
|
|
||||||
path (Path): The path to the project directory.
|
|
||||||
config (Dict[str, Any]): The loaded project.yml.
|
|
||||||
verbose (bool): Whether to print additional info (via DVC).
|
|
||||||
quiet (bool): Don't output anything (via DVC).
|
|
||||||
force (bool): Force update, even if hashes match.
|
|
||||||
RETURNS (bool): Whether the DVC config file was updated.
|
|
||||||
"""
|
|
||||||
ensure_dvc(path)
|
|
||||||
workflows = config.get("workflows", {})
|
|
||||||
workflow_names = list(workflows.keys())
|
|
||||||
check_workflows(workflow_names, workflow)
|
|
||||||
if not workflow:
|
|
||||||
workflow = workflow_names[0]
|
|
||||||
config_hash = get_hash(config)
|
|
||||||
path = path.resolve()
|
|
||||||
dvc_config_path = path / DVC_CONFIG
|
|
||||||
if dvc_config_path.exists():
|
|
||||||
# Check if the file was generated using the current config, if not, redo
|
|
||||||
with dvc_config_path.open("r", encoding="utf8") as f:
|
|
||||||
ref_hash = f.readline().strip().replace("# ", "")
|
|
||||||
if ref_hash == config_hash and not force:
|
|
||||||
return False # Nothing has changed in project.yml, don't need to update
|
|
||||||
dvc_config_path.unlink()
|
|
||||||
dvc_commands = []
|
|
||||||
config_commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
|
|
||||||
|
|
||||||
# some flags that apply to every command
|
|
||||||
flags = []
|
|
||||||
if verbose:
|
|
||||||
flags.append("--verbose")
|
|
||||||
if quiet:
|
|
||||||
flags.append("--quiet")
|
|
||||||
|
|
||||||
for name in workflows[workflow]:
|
|
||||||
command = config_commands[name]
|
|
||||||
deps = command.get("deps", [])
|
|
||||||
outputs = command.get("outputs", [])
|
|
||||||
outputs_no_cache = command.get("outputs_no_cache", [])
|
|
||||||
if not deps and not outputs and not outputs_no_cache:
|
|
||||||
continue
|
|
||||||
# Default to the working dir as the project path since dvc.yaml is auto-generated
|
|
||||||
# and we don't want arbitrary paths in there
|
|
||||||
project_cmd = ["python", "-m", NAME, "project", "run", name]
|
|
||||||
deps_cmd = [c for cl in [["-d", p] for p in deps] for c in cl]
|
|
||||||
outputs_cmd = [c for cl in [["-o", p] for p in outputs] for c in cl]
|
|
||||||
outputs_nc_cmd = [c for cl in [["-O", p] for p in outputs_no_cache] for c in cl]
|
|
||||||
|
|
||||||
dvc_cmd = ["run", *flags, "-n", name, "-w", str(path), "--no-exec"]
|
|
||||||
if command.get("no_skip"):
|
|
||||||
dvc_cmd.append("--always-changed")
|
|
||||||
full_cmd = [*dvc_cmd, *deps_cmd, *outputs_cmd, *outputs_nc_cmd, *project_cmd]
|
|
||||||
dvc_commands.append(join_command(full_cmd))
|
|
||||||
|
|
||||||
if not dvc_commands:
|
|
||||||
# If we don't check for this, then there will be an error when reading the
|
|
||||||
# config, since DVC wouldn't create it.
|
|
||||||
msg.fail(
|
|
||||||
"No usable commands for DVC found. This can happen if none of your "
|
|
||||||
"commands have dependencies or outputs.",
|
|
||||||
exits=1,
|
|
||||||
)
|
|
||||||
|
|
||||||
with working_dir(path):
|
|
||||||
for c in dvc_commands:
|
|
||||||
dvc_command = "dvc " + c
|
|
||||||
run_command(dvc_command)
|
|
||||||
with dvc_config_path.open("r+", encoding="utf8") as f:
|
|
||||||
content = f.read()
|
|
||||||
f.seek(0, 0)
|
|
||||||
f.write(f"# {config_hash}\n{DVC_CONFIG_COMMENT}\n{content}")
|
|
||||||
return True
|
|
||||||
|
|
||||||
|
|
||||||
def check_workflows(workflows: List[str], workflow: Optional[str] = None) -> None:
|
|
||||||
"""Validate workflows provided in project.yml and check that a given
|
|
||||||
workflow can be used to generate a DVC config.
|
|
||||||
|
|
||||||
workflows (List[str]): Names of the available workflows.
|
|
||||||
workflow (Optional[str]): The name of the workflow to convert.
|
|
||||||
"""
|
|
||||||
if not workflows:
|
|
||||||
msg.fail(
|
|
||||||
f"No workflows defined in {PROJECT_FILE}. To generate a DVC config, "
|
|
||||||
f"define at least one list of commands.",
|
|
||||||
exits=1,
|
|
||||||
)
|
|
||||||
if workflow is not None and workflow not in workflows:
|
|
||||||
msg.fail(
|
|
||||||
f"Workflow '{workflow}' not defined in {PROJECT_FILE}. "
|
|
||||||
f"Available workflows: {', '.join(workflows)}",
|
|
||||||
exits=1,
|
|
||||||
)
|
|
||||||
if not workflow:
|
|
||||||
msg.warn(
|
|
||||||
f"No workflow specified for DVC pipeline. Using the first workflow "
|
|
||||||
f"defined in {PROJECT_FILE}: '{workflows[0]}'"
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def ensure_dvc(project_dir: Path) -> None:
|
|
||||||
"""Ensure that the "dvc" command is available and that the current project
|
|
||||||
directory is an initialized DVC project.
|
|
||||||
"""
|
|
||||||
try:
|
|
||||||
subprocess.run(["dvc", "--version"], stdout=subprocess.DEVNULL)
|
|
||||||
except Exception:
|
|
||||||
msg.fail(
|
|
||||||
"To use spaCy projects with DVC (Data Version Control), DVC needs "
|
|
||||||
"to be installed and the 'dvc' command needs to be available",
|
|
||||||
"You can install the Python package from pip (pip install dvc) or "
|
|
||||||
"conda (conda install -c conda-forge dvc). For more details, see the "
|
|
||||||
"documentation: https://dvc.org/doc/install",
|
|
||||||
exits=1,
|
|
||||||
)
|
|
||||||
if not (project_dir / ".dvc").exists():
|
|
||||||
msg.fail(
|
|
||||||
"Project not initialized as a DVC project",
|
|
||||||
"To initialize a DVC project, you can run 'dvc init' in the project "
|
|
||||||
"directory. For more details, see the documentation: "
|
|
||||||
"https://dvc.org/doc/command-reference/init",
|
|
||||||
exits=1,
|
|
||||||
)
|
|
||||||
|
|
|
@ -1,67 +1 @@
|
||||||
from pathlib import Path
|
from weasel.cli.pull import *
|
||||||
|
|
||||||
from wasabi import msg
|
|
||||||
|
|
||||||
from .._util import Arg, load_project_config, logger, project_cli
|
|
||||||
from .remote_storage import RemoteStorage, get_command_hash
|
|
||||||
from .run import update_lockfile
|
|
||||||
|
|
||||||
|
|
||||||
@project_cli.command("pull")
|
|
||||||
def project_pull_cli(
|
|
||||||
# fmt: off
|
|
||||||
remote: str = Arg("default", help="Name or path of remote storage"),
|
|
||||||
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
|
||||||
# fmt: on
|
|
||||||
):
|
|
||||||
"""Retrieve available precomputed outputs from a remote storage.
|
|
||||||
You can alias remotes in your project.yml by mapping them to storage paths.
|
|
||||||
A storage can be anything that the smart-open library can upload to, e.g.
|
|
||||||
AWS, Google Cloud Storage, SSH, local directories etc.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/cli#project-pull
|
|
||||||
"""
|
|
||||||
for url, output_path in project_pull(project_dir, remote):
|
|
||||||
if url is not None:
|
|
||||||
msg.good(f"Pulled {output_path} from {url}")
|
|
||||||
|
|
||||||
|
|
||||||
def project_pull(project_dir: Path, remote: str, *, verbose: bool = False):
|
|
||||||
# TODO: We don't have tests for this :(. It would take a bit of mockery to
|
|
||||||
# set up. I guess see if it breaks first?
|
|
||||||
config = load_project_config(project_dir)
|
|
||||||
if remote in config.get("remotes", {}):
|
|
||||||
remote = config["remotes"][remote]
|
|
||||||
storage = RemoteStorage(project_dir, remote)
|
|
||||||
commands = list(config.get("commands", []))
|
|
||||||
# We use a while loop here because we don't know how the commands
|
|
||||||
# will be ordered. A command might need dependencies from one that's later
|
|
||||||
# in the list.
|
|
||||||
while commands:
|
|
||||||
for i, cmd in enumerate(list(commands)):
|
|
||||||
logger.debug("CMD: %s.", cmd["name"])
|
|
||||||
deps = [project_dir / dep for dep in cmd.get("deps", [])]
|
|
||||||
if all(dep.exists() for dep in deps):
|
|
||||||
cmd_hash = get_command_hash("", "", deps, cmd["script"])
|
|
||||||
for output_path in cmd.get("outputs", []):
|
|
||||||
url = storage.pull(output_path, command_hash=cmd_hash)
|
|
||||||
logger.debug(
|
|
||||||
"URL: %s for %s with command hash %s",
|
|
||||||
url,
|
|
||||||
output_path,
|
|
||||||
cmd_hash,
|
|
||||||
)
|
|
||||||
yield url, output_path
|
|
||||||
|
|
||||||
out_locs = [project_dir / out for out in cmd.get("outputs", [])]
|
|
||||||
if all(loc.exists() for loc in out_locs):
|
|
||||||
update_lockfile(project_dir, cmd)
|
|
||||||
# We remove the command from the list here, and break, so that
|
|
||||||
# we iterate over the loop again.
|
|
||||||
commands.pop(i)
|
|
||||||
break
|
|
||||||
else:
|
|
||||||
logger.debug("Dependency missing. Skipping %s outputs.", cmd["name"])
|
|
||||||
else:
|
|
||||||
# If we didn't break the for loop, break the while loop.
|
|
||||||
break
|
|
||||||
|
|
|
@ -1,69 +1 @@
|
||||||
from pathlib import Path
|
from weasel.cli.push import *
|
||||||
|
|
||||||
from wasabi import msg
|
|
||||||
|
|
||||||
from .._util import Arg, load_project_config, logger, project_cli
|
|
||||||
from .remote_storage import RemoteStorage, get_command_hash, get_content_hash
|
|
||||||
|
|
||||||
|
|
||||||
@project_cli.command("push")
|
|
||||||
def project_push_cli(
|
|
||||||
# fmt: off
|
|
||||||
remote: str = Arg("default", help="Name or path of remote storage"),
|
|
||||||
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
|
||||||
# fmt: on
|
|
||||||
):
|
|
||||||
"""Persist outputs to a remote storage. You can alias remotes in your
|
|
||||||
project.yml by mapping them to storage paths. A storage can be anything that
|
|
||||||
the smart-open library can upload to, e.g. AWS, Google Cloud Storage, SSH,
|
|
||||||
local directories etc.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/cli#project-push
|
|
||||||
"""
|
|
||||||
for output_path, url in project_push(project_dir, remote):
|
|
||||||
if url is None:
|
|
||||||
msg.info(f"Skipping {output_path}")
|
|
||||||
else:
|
|
||||||
msg.good(f"Pushed {output_path} to {url}")
|
|
||||||
|
|
||||||
|
|
||||||
def project_push(project_dir: Path, remote: str):
|
|
||||||
"""Persist outputs to a remote storage. You can alias remotes in your project.yml
|
|
||||||
by mapping them to storage paths. A storage can be anything that the smart-open
|
|
||||||
library can upload to, e.g. gcs, aws, ssh, local directories etc
|
|
||||||
"""
|
|
||||||
config = load_project_config(project_dir)
|
|
||||||
if remote in config.get("remotes", {}):
|
|
||||||
remote = config["remotes"][remote]
|
|
||||||
storage = RemoteStorage(project_dir, remote)
|
|
||||||
for cmd in config.get("commands", []):
|
|
||||||
logger.debug("CMD: %s", cmd["name"])
|
|
||||||
deps = [project_dir / dep for dep in cmd.get("deps", [])]
|
|
||||||
if any(not dep.exists() for dep in deps):
|
|
||||||
logger.debug("Dependency missing. Skipping %s outputs", cmd["name"])
|
|
||||||
continue
|
|
||||||
cmd_hash = get_command_hash(
|
|
||||||
"", "", [project_dir / dep for dep in cmd.get("deps", [])], cmd["script"]
|
|
||||||
)
|
|
||||||
logger.debug("CMD_HASH: %s", cmd_hash)
|
|
||||||
for output_path in cmd.get("outputs", []):
|
|
||||||
output_loc = project_dir / output_path
|
|
||||||
if output_loc.exists() and _is_not_empty_dir(output_loc):
|
|
||||||
url = storage.push(
|
|
||||||
output_path,
|
|
||||||
command_hash=cmd_hash,
|
|
||||||
content_hash=get_content_hash(output_loc),
|
|
||||||
)
|
|
||||||
logger.debug(
|
|
||||||
"URL: %s for output %s with cmd_hash %s", url, output_path, cmd_hash
|
|
||||||
)
|
|
||||||
yield output_path, url
|
|
||||||
|
|
||||||
|
|
||||||
def _is_not_empty_dir(loc: Path):
|
|
||||||
if not loc.is_dir():
|
|
||||||
return True
|
|
||||||
elif any(_is_not_empty_dir(child) for child in loc.iterdir()):
|
|
||||||
return True
|
|
||||||
else:
|
|
||||||
return False
|
|
||||||
|
|
|
@ -1,212 +1 @@
|
||||||
import hashlib
|
from weasel.cli.remote_storage import *
|
||||||
import os
|
|
||||||
import site
|
|
||||||
import tarfile
|
|
||||||
import urllib.parse
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import TYPE_CHECKING, Dict, List, Optional
|
|
||||||
|
|
||||||
from wasabi import msg
|
|
||||||
|
|
||||||
from ... import about
|
|
||||||
from ...errors import Errors
|
|
||||||
from ...git_info import GIT_VERSION
|
|
||||||
from ...util import ENV_VARS, check_bool_env_var, get_minor_version
|
|
||||||
from .._util import (
|
|
||||||
download_file,
|
|
||||||
ensure_pathy,
|
|
||||||
get_checksum,
|
|
||||||
get_hash,
|
|
||||||
make_tempdir,
|
|
||||||
upload_file,
|
|
||||||
)
|
|
||||||
|
|
||||||
if TYPE_CHECKING:
|
|
||||||
from pathy import FluidPath # noqa: F401
|
|
||||||
|
|
||||||
|
|
||||||
class RemoteStorage:
|
|
||||||
"""Push and pull outputs to and from a remote file storage.
|
|
||||||
|
|
||||||
Remotes can be anything that `smart-open` can support: AWS, GCS, file system,
|
|
||||||
ssh, etc.
|
|
||||||
"""
|
|
||||||
|
|
||||||
def __init__(self, project_root: Path, url: str, *, compression="gz"):
|
|
||||||
self.root = project_root
|
|
||||||
self.url = ensure_pathy(url)
|
|
||||||
self.compression = compression
|
|
||||||
|
|
||||||
def push(self, path: Path, command_hash: str, content_hash: str) -> "FluidPath":
|
|
||||||
"""Compress a file or directory within a project and upload it to a remote
|
|
||||||
storage. If an object exists at the full URL, nothing is done.
|
|
||||||
|
|
||||||
Within the remote storage, files are addressed by their project path
|
|
||||||
(url encoded) and two user-supplied hashes, representing their creation
|
|
||||||
context and their file contents. If the URL already exists, the data is
|
|
||||||
not uploaded. Paths are archived and compressed prior to upload.
|
|
||||||
"""
|
|
||||||
loc = self.root / path
|
|
||||||
if not loc.exists():
|
|
||||||
raise IOError(f"Cannot push {loc}: does not exist.")
|
|
||||||
url = self.make_url(path, command_hash, content_hash)
|
|
||||||
if url.exists():
|
|
||||||
return url
|
|
||||||
tmp: Path
|
|
||||||
with make_tempdir() as tmp:
|
|
||||||
tar_loc = tmp / self.encode_name(str(path))
|
|
||||||
mode_string = f"w:{self.compression}" if self.compression else "w"
|
|
||||||
with tarfile.open(tar_loc, mode=mode_string) as tar_file:
|
|
||||||
tar_file.add(str(loc), arcname=str(path))
|
|
||||||
upload_file(tar_loc, url)
|
|
||||||
return url
|
|
||||||
|
|
||||||
def pull(
|
|
||||||
self,
|
|
||||||
path: Path,
|
|
||||||
*,
|
|
||||||
command_hash: Optional[str] = None,
|
|
||||||
content_hash: Optional[str] = None,
|
|
||||||
) -> Optional["FluidPath"]:
|
|
||||||
"""Retrieve a file from the remote cache. If the file already exists,
|
|
||||||
nothing is done.
|
|
||||||
|
|
||||||
If the command_hash and/or content_hash are specified, only matching
|
|
||||||
results are returned. If no results are available, an error is raised.
|
|
||||||
"""
|
|
||||||
dest = self.root / path
|
|
||||||
if dest.exists():
|
|
||||||
return None
|
|
||||||
url = self.find(path, command_hash=command_hash, content_hash=content_hash)
|
|
||||||
if url is None:
|
|
||||||
return url
|
|
||||||
else:
|
|
||||||
# Make sure the destination exists
|
|
||||||
if not dest.parent.exists():
|
|
||||||
dest.parent.mkdir(parents=True)
|
|
||||||
tmp: Path
|
|
||||||
with make_tempdir() as tmp:
|
|
||||||
tar_loc = tmp / url.parts[-1]
|
|
||||||
download_file(url, tar_loc)
|
|
||||||
mode_string = f"r:{self.compression}" if self.compression else "r"
|
|
||||||
with tarfile.open(tar_loc, mode=mode_string) as tar_file:
|
|
||||||
# This requires that the path is added correctly, relative
|
|
||||||
# to root. This is how we set things up in push()
|
|
||||||
|
|
||||||
# Disallow paths outside the current directory for the tar
|
|
||||||
# file (CVE-2007-4559, directory traversal vulnerability)
|
|
||||||
def is_within_directory(directory, target):
|
|
||||||
abs_directory = os.path.abspath(directory)
|
|
||||||
abs_target = os.path.abspath(target)
|
|
||||||
prefix = os.path.commonprefix([abs_directory, abs_target])
|
|
||||||
return prefix == abs_directory
|
|
||||||
|
|
||||||
def safe_extract(tar, path):
|
|
||||||
for member in tar.getmembers():
|
|
||||||
member_path = os.path.join(path, member.name)
|
|
||||||
if not is_within_directory(path, member_path):
|
|
||||||
raise ValueError(Errors.E852)
|
|
||||||
tar.extractall(path)
|
|
||||||
|
|
||||||
safe_extract(tar_file, self.root)
|
|
||||||
return url
|
|
||||||
|
|
||||||
def find(
|
|
||||||
self,
|
|
||||||
path: Path,
|
|
||||||
*,
|
|
||||||
command_hash: Optional[str] = None,
|
|
||||||
content_hash: Optional[str] = None,
|
|
||||||
) -> Optional["FluidPath"]:
|
|
||||||
"""Find the best matching version of a file within the storage,
|
|
||||||
or `None` if no match can be found. If both the creation and content hash
|
|
||||||
are specified, only exact matches will be returned. Otherwise, the most
|
|
||||||
recent matching file is preferred.
|
|
||||||
"""
|
|
||||||
name = self.encode_name(str(path))
|
|
||||||
urls = []
|
|
||||||
if command_hash is not None and content_hash is not None:
|
|
||||||
url = self.url / name / command_hash / content_hash
|
|
||||||
urls = [url] if url.exists() else []
|
|
||||||
elif command_hash is not None:
|
|
||||||
if (self.url / name / command_hash).exists():
|
|
||||||
urls = list((self.url / name / command_hash).iterdir())
|
|
||||||
else:
|
|
||||||
if (self.url / name).exists():
|
|
||||||
for sub_dir in (self.url / name).iterdir():
|
|
||||||
urls.extend(sub_dir.iterdir())
|
|
||||||
if content_hash is not None:
|
|
||||||
urls = [url for url in urls if url.parts[-1] == content_hash]
|
|
||||||
if len(urls) >= 2:
|
|
||||||
try:
|
|
||||||
urls.sort(key=lambda x: x.stat().last_modified) # type: ignore
|
|
||||||
except Exception:
|
|
||||||
msg.warn(
|
|
||||||
"Unable to sort remote files by last modified. The file(s) "
|
|
||||||
"pulled from the cache may not be the most recent."
|
|
||||||
)
|
|
||||||
return urls[-1] if urls else None
|
|
||||||
|
|
||||||
def make_url(self, path: Path, command_hash: str, content_hash: str) -> "FluidPath":
|
|
||||||
"""Construct a URL from a subpath, a creation hash and a content hash."""
|
|
||||||
return self.url / self.encode_name(str(path)) / command_hash / content_hash
|
|
||||||
|
|
||||||
def encode_name(self, name: str) -> str:
|
|
||||||
"""Encode a subpath into a URL-safe name."""
|
|
||||||
return urllib.parse.quote_plus(name)
|
|
||||||
|
|
||||||
|
|
||||||
def get_content_hash(loc: Path) -> str:
|
|
||||||
return get_checksum(loc)
|
|
||||||
|
|
||||||
|
|
||||||
def get_command_hash(
|
|
||||||
site_hash: str, env_hash: str, deps: List[Path], cmd: List[str]
|
|
||||||
) -> str:
|
|
||||||
"""Create a hash representing the execution of a command. This includes the
|
|
||||||
currently installed packages, whatever environment variables have been marked
|
|
||||||
as relevant, and the command.
|
|
||||||
"""
|
|
||||||
if check_bool_env_var(ENV_VARS.PROJECT_USE_GIT_VERSION):
|
|
||||||
spacy_v = GIT_VERSION
|
|
||||||
else:
|
|
||||||
spacy_v = str(get_minor_version(about.__version__) or "")
|
|
||||||
dep_checksums = [get_checksum(dep) for dep in sorted(deps)]
|
|
||||||
hashes = [spacy_v, site_hash, env_hash] + dep_checksums
|
|
||||||
hashes.extend(cmd)
|
|
||||||
creation_bytes = "".join(hashes).encode("utf8")
|
|
||||||
return hashlib.md5(creation_bytes).hexdigest()
|
|
||||||
|
|
||||||
|
|
||||||
def get_site_hash():
|
|
||||||
"""Hash the current Python environment's site-packages contents, including
|
|
||||||
the name and version of the libraries. The list we're hashing is what
|
|
||||||
`pip freeze` would output.
|
|
||||||
"""
|
|
||||||
site_dirs = site.getsitepackages()
|
|
||||||
if site.ENABLE_USER_SITE:
|
|
||||||
site_dirs.extend(site.getusersitepackages())
|
|
||||||
packages = set()
|
|
||||||
for site_dir in site_dirs:
|
|
||||||
site_dir = Path(site_dir)
|
|
||||||
for subpath in site_dir.iterdir():
|
|
||||||
if subpath.parts[-1].endswith("dist-info"):
|
|
||||||
packages.add(subpath.parts[-1].replace(".dist-info", ""))
|
|
||||||
package_bytes = "".join(sorted(packages)).encode("utf8")
|
|
||||||
return hashlib.md5sum(package_bytes).hexdigest()
|
|
||||||
|
|
||||||
|
|
||||||
def get_env_hash(env: Dict[str, str]) -> str:
|
|
||||||
"""Construct a hash of the environment variables that will be passed into
|
|
||||||
the commands.
|
|
||||||
|
|
||||||
Values in the env dict may be references to the current os.environ, using
|
|
||||||
the syntax $ENV_VAR to mean os.environ[ENV_VAR]
|
|
||||||
"""
|
|
||||||
env_vars = {}
|
|
||||||
for key, value in env.items():
|
|
||||||
if value.startswith("$"):
|
|
||||||
env_vars[key] = os.environ.get(value[1:], "")
|
|
||||||
else:
|
|
||||||
env_vars[key] = value
|
|
||||||
return get_hash(env_vars)
|
|
||||||
|
|
|
@ -1,379 +1 @@
|
||||||
import os.path
|
from weasel.cli.run import *
|
||||||
import sys
|
|
||||||
from pathlib import Path
|
|
||||||
from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple
|
|
||||||
|
|
||||||
import srsly
|
|
||||||
import typer
|
|
||||||
from wasabi import msg
|
|
||||||
from wasabi.util import locale_escape
|
|
||||||
|
|
||||||
from ... import about
|
|
||||||
from ...git_info import GIT_VERSION
|
|
||||||
from ...util import (
|
|
||||||
ENV_VARS,
|
|
||||||
SimpleFrozenDict,
|
|
||||||
SimpleFrozenList,
|
|
||||||
check_bool_env_var,
|
|
||||||
is_cwd,
|
|
||||||
is_minor_version_match,
|
|
||||||
join_command,
|
|
||||||
run_command,
|
|
||||||
split_command,
|
|
||||||
working_dir,
|
|
||||||
)
|
|
||||||
from .._util import (
|
|
||||||
COMMAND,
|
|
||||||
PROJECT_FILE,
|
|
||||||
PROJECT_LOCK,
|
|
||||||
Arg,
|
|
||||||
Opt,
|
|
||||||
get_checksum,
|
|
||||||
get_hash,
|
|
||||||
load_project_config,
|
|
||||||
parse_config_overrides,
|
|
||||||
project_cli,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
@project_cli.command(
|
|
||||||
"run", context_settings={"allow_extra_args": True, "ignore_unknown_options": True}
|
|
||||||
)
|
|
||||||
def project_run_cli(
|
|
||||||
# fmt: off
|
|
||||||
ctx: typer.Context, # This is only used to read additional arguments
|
|
||||||
subcommand: str = Arg(None, help=f"Name of command defined in the {PROJECT_FILE}"),
|
|
||||||
project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False),
|
|
||||||
force: bool = Opt(False, "--force", "-F", help="Force re-running steps, even if nothing changed"),
|
|
||||||
dry: bool = Opt(False, "--dry", "-D", help="Perform a dry run and don't execute scripts"),
|
|
||||||
show_help: bool = Opt(False, "--help", help="Show help message and available subcommands")
|
|
||||||
# fmt: on
|
|
||||||
):
|
|
||||||
"""Run a named command or workflow defined in the project.yml. If a workflow
|
|
||||||
name is specified, all commands in the workflow are run, in order. If
|
|
||||||
commands define dependencies and/or outputs, they will only be re-run if
|
|
||||||
state has changed.
|
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/cli#project-run
|
|
||||||
"""
|
|
||||||
if show_help or not subcommand:
|
|
||||||
print_run_help(project_dir, subcommand)
|
|
||||||
else:
|
|
||||||
overrides = parse_config_overrides(ctx.args)
|
|
||||||
project_run(project_dir, subcommand, overrides=overrides, force=force, dry=dry)
|
|
||||||
|
|
||||||
|
|
||||||
def project_run(
|
|
||||||
project_dir: Path,
|
|
||||||
subcommand: str,
|
|
||||||
*,
|
|
||||||
overrides: Dict[str, Any] = SimpleFrozenDict(),
|
|
||||||
force: bool = False,
|
|
||||||
dry: bool = False,
|
|
||||||
capture: bool = False,
|
|
||||||
skip_requirements_check: bool = False,
|
|
||||||
) -> None:
|
|
||||||
"""Run a named script defined in the project.yml. If the script is part
|
|
||||||
of the default pipeline (defined in the "run" section), DVC is used to
|
|
||||||
execute the command, so it can determine whether to rerun it. It then
|
|
||||||
calls into "exec" to execute it.
|
|
||||||
|
|
||||||
project_dir (Path): Path to project directory.
|
|
||||||
subcommand (str): Name of command to run.
|
|
||||||
overrides (Dict[str, Any]): Optional config overrides.
|
|
||||||
force (bool): Force re-running, even if nothing changed.
|
|
||||||
dry (bool): Perform a dry run and don't execute commands.
|
|
||||||
capture (bool): Whether to capture the output and errors of individual commands.
|
|
||||||
If False, the stdout and stderr will not be redirected, and if there's an error,
|
|
||||||
sys.exit will be called with the return code. You should use capture=False
|
|
||||||
when you want to turn over execution to the command, and capture=True
|
|
||||||
when you want to run the command more like a function.
|
|
||||||
skip_requirements_check (bool): Whether to skip the requirements check.
|
|
||||||
"""
|
|
||||||
config = load_project_config(project_dir, overrides=overrides)
|
|
||||||
commands = {cmd["name"]: cmd for cmd in config.get("commands", [])}
|
|
||||||
workflows = config.get("workflows", {})
|
|
||||||
validate_subcommand(list(commands.keys()), list(workflows.keys()), subcommand)
|
|
||||||
|
|
||||||
req_path = project_dir / "requirements.txt"
|
|
||||||
if not skip_requirements_check:
|
|
||||||
if config.get("check_requirements", True) and os.path.exists(req_path):
|
|
||||||
with req_path.open() as requirements_file:
|
|
||||||
_check_requirements([req.strip() for req in requirements_file])
|
|
||||||
|
|
||||||
if subcommand in workflows:
|
|
||||||
msg.info(f"Running workflow '{subcommand}'")
|
|
||||||
for cmd in workflows[subcommand]:
|
|
||||||
project_run(
|
|
||||||
project_dir,
|
|
||||||
cmd,
|
|
||||||
overrides=overrides,
|
|
||||||
force=force,
|
|
||||||
dry=dry,
|
|
||||||
capture=capture,
|
|
||||||
skip_requirements_check=True,
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
cmd = commands[subcommand]
|
|
||||||
for dep in cmd.get("deps", []):
|
|
||||||
if not (project_dir / dep).exists():
|
|
||||||
err = f"Missing dependency specified by command '{subcommand}': {dep}"
|
|
||||||
err_help = "Maybe you forgot to run the 'project assets' command or a previous step?"
|
|
||||||
err_exits = 1 if not dry else None
|
|
||||||
msg.fail(err, err_help, exits=err_exits)
|
|
||||||
check_spacy_commit = check_bool_env_var(ENV_VARS.PROJECT_USE_GIT_VERSION)
|
|
||||||
with working_dir(project_dir) as current_dir:
|
|
||||||
msg.divider(subcommand)
|
|
||||||
rerun = check_rerun(current_dir, cmd, check_spacy_commit=check_spacy_commit)
|
|
||||||
if not rerun and not force:
|
|
||||||
msg.info(f"Skipping '{cmd['name']}': nothing changed")
|
|
||||||
else:
|
|
||||||
run_commands(cmd["script"], dry=dry, capture=capture)
|
|
||||||
if not dry:
|
|
||||||
update_lockfile(current_dir, cmd)
|
|
||||||
|
|
||||||
|
|
||||||
def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None:
|
|
||||||
"""Simulate a CLI help prompt using the info available in the project.yml.
|
|
||||||
|
|
||||||
project_dir (Path): The project directory.
|
|
||||||
subcommand (Optional[str]): The subcommand or None. If a subcommand is
|
|
||||||
provided, the subcommand help is shown. Otherwise, the top-level help
|
|
||||||
and a list of available commands is printed.
|
|
||||||
"""
|
|
||||||
config = load_project_config(project_dir)
|
|
||||||
config_commands = config.get("commands", [])
|
|
||||||
commands = {cmd["name"]: cmd for cmd in config_commands}
|
|
||||||
workflows = config.get("workflows", {})
|
|
||||||
project_loc = "" if is_cwd(project_dir) else project_dir
|
|
||||||
if subcommand:
|
|
||||||
validate_subcommand(list(commands.keys()), list(workflows.keys()), subcommand)
|
|
||||||
print(f"Usage: {COMMAND} project run {subcommand} {project_loc}")
|
|
||||||
if subcommand in commands:
|
|
||||||
help_text = commands[subcommand].get("help")
|
|
||||||
if help_text:
|
|
||||||
print(f"\n{help_text}\n")
|
|
||||||
elif subcommand in workflows:
|
|
||||||
steps = workflows[subcommand]
|
|
||||||
print(f"\nWorkflow consisting of {len(steps)} commands:")
|
|
||||||
steps_data = [
|
|
||||||
(f"{i + 1}. {step}", commands[step].get("help", ""))
|
|
||||||
for i, step in enumerate(steps)
|
|
||||||
]
|
|
||||||
msg.table(steps_data)
|
|
||||||
help_cmd = f"{COMMAND} project run [COMMAND] {project_loc} --help"
|
|
||||||
print(f"For command details, run: {help_cmd}")
|
|
||||||
else:
|
|
||||||
print("")
|
|
||||||
title = config.get("title")
|
|
||||||
if title:
|
|
||||||
print(f"{locale_escape(title)}\n")
|
|
||||||
if config_commands:
|
|
||||||
print(f"Available commands in {PROJECT_FILE}")
|
|
||||||
print(f"Usage: {COMMAND} project run [COMMAND] {project_loc}")
|
|
||||||
msg.table([(cmd["name"], cmd.get("help", "")) for cmd in config_commands])
|
|
||||||
if workflows:
|
|
||||||
print(f"Available workflows in {PROJECT_FILE}")
|
|
||||||
print(f"Usage: {COMMAND} project run [WORKFLOW] {project_loc}")
|
|
||||||
msg.table([(name, " -> ".join(steps)) for name, steps in workflows.items()])
|
|
||||||
|
|
||||||
|
|
||||||
def run_commands(
|
|
||||||
commands: Iterable[str] = SimpleFrozenList(),
|
|
||||||
silent: bool = False,
|
|
||||||
dry: bool = False,
|
|
||||||
capture: bool = False,
|
|
||||||
) -> None:
|
|
||||||
"""Run a sequence of commands in a subprocess, in order.
|
|
||||||
|
|
||||||
commands (List[str]): The string commands.
|
|
||||||
silent (bool): Don't print the commands.
|
|
||||||
dry (bool): Perform a dry run and don't execut anything.
|
|
||||||
capture (bool): Whether to capture the output and errors of individual commands.
|
|
||||||
If False, the stdout and stderr will not be redirected, and if there's an error,
|
|
||||||
sys.exit will be called with the return code. You should use capture=False
|
|
||||||
when you want to turn over execution to the command, and capture=True
|
|
||||||
when you want to run the command more like a function.
|
|
||||||
"""
|
|
||||||
for c in commands:
|
|
||||||
command = split_command(c)
|
|
||||||
# Not sure if this is needed or a good idea. Motivation: users may often
|
|
||||||
# use commands in their config that reference "python" and we want to
|
|
||||||
# make sure that it's always executing the same Python that spaCy is
|
|
||||||
# executed with and the pip in the same env, not some other Python/pip.
|
|
||||||
# Also ensures cross-compatibility if user 1 writes "python3" (because
|
|
||||||
# that's how it's set up on their system), and user 2 without the
|
|
||||||
# shortcut tries to re-run the command.
|
|
||||||
if len(command) and command[0] in ("python", "python3"):
|
|
||||||
command[0] = sys.executable
|
|
||||||
elif len(command) and command[0] in ("pip", "pip3"):
|
|
||||||
command = [sys.executable, "-m", "pip", *command[1:]]
|
|
||||||
if not silent:
|
|
||||||
print(f"Running command: {join_command(command)}")
|
|
||||||
if not dry:
|
|
||||||
run_command(command, capture=capture)
|
|
||||||
|
|
||||||
|
|
||||||
def validate_subcommand(
|
|
||||||
commands: Sequence[str], workflows: Sequence[str], subcommand: str
|
|
||||||
) -> None:
|
|
||||||
"""Check that a subcommand is valid and defined. Raises an error otherwise.
|
|
||||||
|
|
||||||
commands (Sequence[str]): The available commands.
|
|
||||||
subcommand (str): The subcommand.
|
|
||||||
"""
|
|
||||||
if not commands and not workflows:
|
|
||||||
msg.fail(f"No commands or workflows defined in {PROJECT_FILE}", exits=1)
|
|
||||||
if subcommand not in commands and subcommand not in workflows:
|
|
||||||
help_msg = []
|
|
||||||
if subcommand in ["assets", "asset"]:
|
|
||||||
help_msg.append("Did you mean to run: python -m spacy project assets?")
|
|
||||||
if commands:
|
|
||||||
help_msg.append(f"Available commands: {', '.join(commands)}")
|
|
||||||
if workflows:
|
|
||||||
help_msg.append(f"Available workflows: {', '.join(workflows)}")
|
|
||||||
msg.fail(
|
|
||||||
f"Can't find command or workflow '{subcommand}' in {PROJECT_FILE}",
|
|
||||||
". ".join(help_msg),
|
|
||||||
exits=1,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def check_rerun(
|
|
||||||
project_dir: Path,
|
|
||||||
command: Dict[str, Any],
|
|
||||||
*,
|
|
||||||
check_spacy_version: bool = True,
|
|
||||||
check_spacy_commit: bool = False,
|
|
||||||
) -> bool:
|
|
||||||
"""Check if a command should be rerun because its settings or inputs/outputs
|
|
||||||
changed.
|
|
||||||
|
|
||||||
project_dir (Path): The current project directory.
|
|
||||||
command (Dict[str, Any]): The command, as defined in the project.yml.
|
|
||||||
strict_version (bool):
|
|
||||||
RETURNS (bool): Whether to re-run the command.
|
|
||||||
"""
|
|
||||||
# Always rerun if no-skip is set
|
|
||||||
if command.get("no_skip", False):
|
|
||||||
return True
|
|
||||||
lock_path = project_dir / PROJECT_LOCK
|
|
||||||
if not lock_path.exists(): # We don't have a lockfile, run command
|
|
||||||
return True
|
|
||||||
data = srsly.read_yaml(lock_path)
|
|
||||||
if command["name"] not in data: # We don't have info about this command
|
|
||||||
return True
|
|
||||||
entry = data[command["name"]]
|
|
||||||
# Always run commands with no outputs (otherwise they'd always be skipped)
|
|
||||||
if not entry.get("outs", []):
|
|
||||||
return True
|
|
||||||
# Always rerun if spaCy version or commit hash changed
|
|
||||||
spacy_v = entry.get("spacy_version")
|
|
||||||
commit = entry.get("spacy_git_version")
|
|
||||||
if check_spacy_version and not is_minor_version_match(spacy_v, about.__version__):
|
|
||||||
info = f"({spacy_v} in {PROJECT_LOCK}, {about.__version__} current)"
|
|
||||||
msg.info(f"Re-running '{command['name']}': spaCy minor version changed {info}")
|
|
||||||
return True
|
|
||||||
if check_spacy_commit and commit != GIT_VERSION:
|
|
||||||
info = f"({commit} in {PROJECT_LOCK}, {GIT_VERSION} current)"
|
|
||||||
msg.info(f"Re-running '{command['name']}': spaCy commit changed {info}")
|
|
||||||
return True
|
|
||||||
# If the entry in the lockfile matches the lockfile entry that would be
|
|
||||||
# generated from the current command, we don't rerun because it means that
|
|
||||||
# all inputs/outputs, hashes and scripts are the same and nothing changed
|
|
||||||
lock_entry = get_lock_entry(project_dir, command)
|
|
||||||
exclude = ["spacy_version", "spacy_git_version"]
|
|
||||||
return get_hash(lock_entry, exclude=exclude) != get_hash(entry, exclude=exclude)
|
|
||||||
|
|
||||||
|
|
||||||
def update_lockfile(project_dir: Path, command: Dict[str, Any]) -> None:
|
|
||||||
"""Update the lockfile after running a command. Will create a lockfile if
|
|
||||||
it doesn't yet exist and will add an entry for the current command, its
|
|
||||||
script and dependencies/outputs.
|
|
||||||
|
|
||||||
project_dir (Path): The current project directory.
|
|
||||||
command (Dict[str, Any]): The command, as defined in the project.yml.
|
|
||||||
"""
|
|
||||||
lock_path = project_dir / PROJECT_LOCK
|
|
||||||
if not lock_path.exists():
|
|
||||||
srsly.write_yaml(lock_path, {})
|
|
||||||
data = {}
|
|
||||||
else:
|
|
||||||
data = srsly.read_yaml(lock_path)
|
|
||||||
data[command["name"]] = get_lock_entry(project_dir, command)
|
|
||||||
srsly.write_yaml(lock_path, data)
|
|
||||||
|
|
||||||
|
|
||||||
def get_lock_entry(project_dir: Path, command: Dict[str, Any]) -> Dict[str, Any]:
|
|
||||||
"""Get a lockfile entry for a given command. An entry includes the command,
|
|
||||||
the script (command steps) and a list of dependencies and outputs with
|
|
||||||
their paths and file hashes, if available. The format is based on the
|
|
||||||
dvc.lock files, to keep things consistent.
|
|
||||||
|
|
||||||
project_dir (Path): The current project directory.
|
|
||||||
command (Dict[str, Any]): The command, as defined in the project.yml.
|
|
||||||
RETURNS (Dict[str, Any]): The lockfile entry.
|
|
||||||
"""
|
|
||||||
deps = get_fileinfo(project_dir, command.get("deps", []))
|
|
||||||
outs = get_fileinfo(project_dir, command.get("outputs", []))
|
|
||||||
outs_nc = get_fileinfo(project_dir, command.get("outputs_no_cache", []))
|
|
||||||
return {
|
|
||||||
"cmd": f"{COMMAND} run {command['name']}",
|
|
||||||
"script": command["script"],
|
|
||||||
"deps": deps,
|
|
||||||
"outs": [*outs, *outs_nc],
|
|
||||||
"spacy_version": about.__version__,
|
|
||||||
"spacy_git_version": GIT_VERSION,
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
def get_fileinfo(project_dir: Path, paths: List[str]) -> List[Dict[str, Optional[str]]]:
|
|
||||||
"""Generate the file information for a list of paths (dependencies, outputs).
|
|
||||||
Includes the file path and the file's checksum.
|
|
||||||
|
|
||||||
project_dir (Path): The current project directory.
|
|
||||||
paths (List[str]): The file paths.
|
|
||||||
RETURNS (List[Dict[str, str]]): The lockfile entry for a file.
|
|
||||||
"""
|
|
||||||
data = []
|
|
||||||
for path in paths:
|
|
||||||
file_path = project_dir / path
|
|
||||||
md5 = get_checksum(file_path) if file_path.exists() else None
|
|
||||||
data.append({"path": path, "md5": md5})
|
|
||||||
return data
|
|
||||||
|
|
||||||
|
|
||||||
def _check_requirements(requirements: List[str]) -> Tuple[bool, bool]:
|
|
||||||
"""Checks whether requirements are installed and free of version conflicts.
|
|
||||||
requirements (List[str]): List of requirements.
|
|
||||||
RETURNS (Tuple[bool, bool]): Whether (1) any packages couldn't be imported, (2) any packages with version conflicts
|
|
||||||
exist.
|
|
||||||
"""
|
|
||||||
import pkg_resources
|
|
||||||
|
|
||||||
failed_pkgs_msgs: List[str] = []
|
|
||||||
conflicting_pkgs_msgs: List[str] = []
|
|
||||||
|
|
||||||
for req in requirements:
|
|
||||||
try:
|
|
||||||
pkg_resources.require(req)
|
|
||||||
except pkg_resources.DistributionNotFound as dnf:
|
|
||||||
failed_pkgs_msgs.append(dnf.report())
|
|
||||||
except pkg_resources.VersionConflict as vc:
|
|
||||||
conflicting_pkgs_msgs.append(vc.report())
|
|
||||||
except Exception:
|
|
||||||
msg.warn(
|
|
||||||
f"Unable to check requirement: {req} "
|
|
||||||
"Checks are currently limited to requirement specifiers "
|
|
||||||
"(PEP 508)"
|
|
||||||
)
|
|
||||||
|
|
||||||
if len(failed_pkgs_msgs) or len(conflicting_pkgs_msgs):
|
|
||||||
msg.warn(
|
|
||||||
title="Missing requirements or requirement conflicts detected. Make sure your Python environment is set up "
|
|
||||||
"correctly and you installed all requirements specified in your project's requirements.txt: "
|
|
||||||
)
|
|
||||||
for pgk_msg in failed_pkgs_msgs + conflicting_pkgs_msgs:
|
|
||||||
msg.text(pgk_msg)
|
|
||||||
|
|
||||||
return len(failed_pkgs_msgs) > 0, len(conflicting_pkgs_msgs) > 0
|
|
||||||
|
|
|
@ -90,11 +90,12 @@ grad_factor = 1.0
|
||||||
factory = "parser"
|
factory = "parser"
|
||||||
|
|
||||||
[components.parser.model]
|
[components.parser.model]
|
||||||
@architectures = "spacy.TransitionBasedParser.v3"
|
@architectures = "spacy.TransitionBasedParser.v2"
|
||||||
state_type = "parser"
|
state_type = "parser"
|
||||||
extra_state_tokens = false
|
extra_state_tokens = false
|
||||||
hidden_width = 128
|
hidden_width = 128
|
||||||
maxout_pieces = 3
|
maxout_pieces = 3
|
||||||
|
use_upper = false
|
||||||
nO = null
|
nO = null
|
||||||
|
|
||||||
[components.parser.model.tok2vec]
|
[components.parser.model.tok2vec]
|
||||||
|
@ -110,11 +111,12 @@ grad_factor = 1.0
|
||||||
factory = "ner"
|
factory = "ner"
|
||||||
|
|
||||||
[components.ner.model]
|
[components.ner.model]
|
||||||
@architectures = "spacy.TransitionBasedParser.v3"
|
@architectures = "spacy.TransitionBasedParser.v2"
|
||||||
state_type = "ner"
|
state_type = "ner"
|
||||||
extra_state_tokens = false
|
extra_state_tokens = false
|
||||||
hidden_width = 64
|
hidden_width = 64
|
||||||
maxout_pieces = 2
|
maxout_pieces = 2
|
||||||
|
use_upper = false
|
||||||
nO = null
|
nO = null
|
||||||
|
|
||||||
[components.ner.model.tok2vec]
|
[components.ner.model.tok2vec]
|
||||||
|
@ -269,8 +271,9 @@ grad_factor = 1.0
|
||||||
@layers = "reduce_mean.v1"
|
@layers = "reduce_mean.v1"
|
||||||
|
|
||||||
[components.textcat.model.linear_model]
|
[components.textcat.model.linear_model]
|
||||||
@architectures = "spacy.TextCatBOW.v2"
|
@architectures = "spacy.TextCatBOW.v3"
|
||||||
exclusive_classes = true
|
exclusive_classes = true
|
||||||
|
length = 262144
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
|
|
||||||
|
@ -306,8 +309,9 @@ grad_factor = 1.0
|
||||||
@layers = "reduce_mean.v1"
|
@layers = "reduce_mean.v1"
|
||||||
|
|
||||||
[components.textcat_multilabel.model.linear_model]
|
[components.textcat_multilabel.model.linear_model]
|
||||||
@architectures = "spacy.TextCatBOW.v2"
|
@architectures = "spacy.TextCatBOW.v3"
|
||||||
exclusive_classes = false
|
exclusive_classes = false
|
||||||
|
length = 262144
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
|
|
||||||
|
@ -383,11 +387,12 @@ width = ${components.tok2vec.model.encode.width}
|
||||||
factory = "parser"
|
factory = "parser"
|
||||||
|
|
||||||
[components.parser.model]
|
[components.parser.model]
|
||||||
@architectures = "spacy.TransitionBasedParser.v3"
|
@architectures = "spacy.TransitionBasedParser.v2"
|
||||||
state_type = "parser"
|
state_type = "parser"
|
||||||
extra_state_tokens = false
|
extra_state_tokens = false
|
||||||
hidden_width = 128
|
hidden_width = 128
|
||||||
maxout_pieces = 3
|
maxout_pieces = 3
|
||||||
|
use_upper = true
|
||||||
nO = null
|
nO = null
|
||||||
|
|
||||||
[components.parser.model.tok2vec]
|
[components.parser.model.tok2vec]
|
||||||
|
@ -400,11 +405,12 @@ width = ${components.tok2vec.model.encode.width}
|
||||||
factory = "ner"
|
factory = "ner"
|
||||||
|
|
||||||
[components.ner.model]
|
[components.ner.model]
|
||||||
@architectures = "spacy.TransitionBasedParser.v3"
|
@architectures = "spacy.TransitionBasedParser.v2"
|
||||||
state_type = "ner"
|
state_type = "ner"
|
||||||
extra_state_tokens = false
|
extra_state_tokens = false
|
||||||
hidden_width = 64
|
hidden_width = 64
|
||||||
maxout_pieces = 2
|
maxout_pieces = 2
|
||||||
|
use_upper = true
|
||||||
nO = null
|
nO = null
|
||||||
|
|
||||||
[components.ner.model.tok2vec]
|
[components.ner.model.tok2vec]
|
||||||
|
@ -538,14 +544,15 @@ nO = null
|
||||||
width = ${components.tok2vec.model.encode.width}
|
width = ${components.tok2vec.model.encode.width}
|
||||||
|
|
||||||
[components.textcat.model.linear_model]
|
[components.textcat.model.linear_model]
|
||||||
@architectures = "spacy.TextCatBOW.v2"
|
@architectures = "spacy.TextCatBOW.v3"
|
||||||
exclusive_classes = true
|
exclusive_classes = true
|
||||||
|
length = 262144
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
|
|
||||||
{% else -%}
|
{% else -%}
|
||||||
[components.textcat.model]
|
[components.textcat.model]
|
||||||
@architectures = "spacy.TextCatBOW.v2"
|
@architectures = "spacy.TextCatBOW.v3"
|
||||||
exclusive_classes = true
|
exclusive_classes = true
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
|
@ -566,15 +573,17 @@ nO = null
|
||||||
width = ${components.tok2vec.model.encode.width}
|
width = ${components.tok2vec.model.encode.width}
|
||||||
|
|
||||||
[components.textcat_multilabel.model.linear_model]
|
[components.textcat_multilabel.model.linear_model]
|
||||||
@architectures = "spacy.TextCatBOW.v2"
|
@architectures = "spacy.TextCatBOW.v3"
|
||||||
exclusive_classes = false
|
exclusive_classes = false
|
||||||
|
length = 262144
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
|
|
||||||
{% else -%}
|
{% else -%}
|
||||||
[components.textcat_multilabel.model]
|
[components.textcat_multilabel.model]
|
||||||
@architectures = "spacy.TextCatBOW.v2"
|
@architectures = "spacy.TextCatBOW.v3"
|
||||||
exclusive_classes = false
|
exclusive_classes = false
|
||||||
|
length = 262144
|
||||||
ngram_size = 1
|
ngram_size = 1
|
||||||
no_output_layer = false
|
no_output_layer = false
|
||||||
{%- endif %}
|
{%- endif %}
|
||||||
|
|
|
@ -13,7 +13,7 @@ from ._util import (
|
||||||
Arg,
|
Arg,
|
||||||
Opt,
|
Opt,
|
||||||
app,
|
app,
|
||||||
import_code,
|
import_code_paths,
|
||||||
parse_config_overrides,
|
parse_config_overrides,
|
||||||
setup_gpu,
|
setup_gpu,
|
||||||
show_validation_error,
|
show_validation_error,
|
||||||
|
@ -28,7 +28,7 @@ def train_cli(
|
||||||
ctx: typer.Context, # This is only used to read additional arguments
|
ctx: typer.Context, # This is only used to read additional arguments
|
||||||
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
|
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
|
||||||
output_path: Optional[Path] = Opt(None, "--output", "--output-path", "-o", help="Output directory to store trained pipeline in"),
|
output_path: Optional[Path] = Opt(None, "--output", "--output-path", "-o", help="Output directory to store trained pipeline in"),
|
||||||
code_path: Optional[Path] = Opt(None, "--code", "-c", help="Path to Python file with additional code (registered functions) to be imported"),
|
code_path: str = Opt("", "--code", "-c", help="Comma-separated paths to Python files with additional code (registered functions) to be imported"),
|
||||||
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
|
verbose: bool = Opt(False, "--verbose", "-V", "-VV", help="Display more information for debugging purposes"),
|
||||||
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU")
|
use_gpu: int = Opt(-1, "--gpu-id", "-g", help="GPU ID or -1 for CPU")
|
||||||
# fmt: on
|
# fmt: on
|
||||||
|
@ -47,9 +47,10 @@ def train_cli(
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/cli#train
|
DOCS: https://spacy.io/api/cli#train
|
||||||
"""
|
"""
|
||||||
util.logger.setLevel(logging.DEBUG if verbose else logging.INFO)
|
if verbose:
|
||||||
|
util.logger.setLevel(logging.DEBUG)
|
||||||
overrides = parse_config_overrides(ctx.args)
|
overrides = parse_config_overrides(ctx.args)
|
||||||
import_code(code_path)
|
import_code_paths(code_path)
|
||||||
train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
|
train(config_path, output_path, use_gpu=use_gpu, overrides=overrides)
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -26,6 +26,9 @@ batch_size = 1000
|
||||||
[nlp.tokenizer]
|
[nlp.tokenizer]
|
||||||
@tokenizers = "spacy.Tokenizer.v1"
|
@tokenizers = "spacy.Tokenizer.v1"
|
||||||
|
|
||||||
|
[nlp.vectors]
|
||||||
|
@vectors = "spacy.Vectors.v1"
|
||||||
|
|
||||||
# The pipeline components and their models
|
# The pipeline components and their models
|
||||||
[components]
|
[components]
|
||||||
|
|
||||||
|
|
|
@ -142,7 +142,25 @@ class SpanRenderer:
|
||||||
spans (list): Individual entity spans and their start, end, label, kb_id and kb_url.
|
spans (list): Individual entity spans and their start, end, label, kb_id and kb_url.
|
||||||
title (str / None): Document title set in Doc.user_data['title'].
|
title (str / None): Document title set in Doc.user_data['title'].
|
||||||
"""
|
"""
|
||||||
per_token_info = []
|
per_token_info = self._assemble_per_token_info(tokens, spans)
|
||||||
|
markup = self._render_markup(per_token_info)
|
||||||
|
markup = TPL_SPANS.format(content=markup, dir=self.direction)
|
||||||
|
if title:
|
||||||
|
markup = TPL_TITLE.format(title=title) + markup
|
||||||
|
return markup
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _assemble_per_token_info(
|
||||||
|
tokens: List[str], spans: List[Dict[str, Any]]
|
||||||
|
) -> List[Dict[str, List[Dict[str, Any]]]]:
|
||||||
|
"""Assembles token info used to generate markup in render_spans().
|
||||||
|
tokens (List[str]): Tokens in text.
|
||||||
|
spans (List[Dict[str, Any]]): Spans in text.
|
||||||
|
RETURNS (List[Dict[str, List[Dict, str, Any]]]): Per token info needed to render HTML markup for given tokens
|
||||||
|
and spans.
|
||||||
|
"""
|
||||||
|
per_token_info: List[Dict[str, List[Dict[str, Any]]]] = []
|
||||||
|
|
||||||
# we must sort so that we can correctly describe when spans need to "stack"
|
# we must sort so that we can correctly describe when spans need to "stack"
|
||||||
# which is determined by their start token, then span length (longer spans on top),
|
# which is determined by their start token, then span length (longer spans on top),
|
||||||
# then break any remaining ties with the span label
|
# then break any remaining ties with the span label
|
||||||
|
@ -154,21 +172,22 @@ class SpanRenderer:
|
||||||
s["label"],
|
s["label"],
|
||||||
),
|
),
|
||||||
)
|
)
|
||||||
|
|
||||||
for s in spans:
|
for s in spans:
|
||||||
# this is the vertical 'slot' that the span will be rendered in
|
# this is the vertical 'slot' that the span will be rendered in
|
||||||
# vertical_position = span_label_offset + (offset_step * (slot - 1))
|
# vertical_position = span_label_offset + (offset_step * (slot - 1))
|
||||||
s["render_slot"] = 0
|
s["render_slot"] = 0
|
||||||
|
|
||||||
for idx, token in enumerate(tokens):
|
for idx, token in enumerate(tokens):
|
||||||
# Identify if a token belongs to a Span (and which) and if it's a
|
# Identify if a token belongs to a Span (and which) and if it's a
|
||||||
# start token of said Span. We'll use this for the final HTML render
|
# start token of said Span. We'll use this for the final HTML render
|
||||||
token_markup: Dict[str, Any] = {}
|
token_markup: Dict[str, Any] = {}
|
||||||
token_markup["text"] = token
|
token_markup["text"] = token
|
||||||
concurrent_spans = 0
|
intersecting_spans: List[Dict[str, Any]] = []
|
||||||
entities = []
|
entities = []
|
||||||
for span in spans:
|
for span in spans:
|
||||||
ent = {}
|
ent = {}
|
||||||
if span["start_token"] <= idx < span["end_token"]:
|
if span["start_token"] <= idx < span["end_token"]:
|
||||||
concurrent_spans += 1
|
|
||||||
span_start = idx == span["start_token"]
|
span_start = idx == span["start_token"]
|
||||||
ent["label"] = span["label"]
|
ent["label"] = span["label"]
|
||||||
ent["is_start"] = span_start
|
ent["is_start"] = span_start
|
||||||
|
@ -176,7 +195,12 @@ class SpanRenderer:
|
||||||
# When the span starts, we need to know how many other
|
# When the span starts, we need to know how many other
|
||||||
# spans are on the 'span stack' and will be rendered.
|
# spans are on the 'span stack' and will be rendered.
|
||||||
# This value becomes the vertical render slot for this entire span
|
# This value becomes the vertical render slot for this entire span
|
||||||
span["render_slot"] = concurrent_spans
|
span["render_slot"] = (
|
||||||
|
intersecting_spans[-1]["render_slot"]
|
||||||
|
if len(intersecting_spans)
|
||||||
|
else 0
|
||||||
|
) + 1
|
||||||
|
intersecting_spans.append(span)
|
||||||
ent["render_slot"] = span["render_slot"]
|
ent["render_slot"] = span["render_slot"]
|
||||||
kb_id = span.get("kb_id", "")
|
kb_id = span.get("kb_id", "")
|
||||||
kb_url = span.get("kb_url", "#")
|
kb_url = span.get("kb_url", "#")
|
||||||
|
@ -193,11 +217,8 @@ class SpanRenderer:
|
||||||
span["render_slot"] = 0
|
span["render_slot"] = 0
|
||||||
token_markup["entities"] = entities
|
token_markup["entities"] = entities
|
||||||
per_token_info.append(token_markup)
|
per_token_info.append(token_markup)
|
||||||
markup = self._render_markup(per_token_info)
|
|
||||||
markup = TPL_SPANS.format(content=markup, dir=self.direction)
|
return per_token_info
|
||||||
if title:
|
|
||||||
markup = TPL_TITLE.format(title=title) + markup
|
|
||||||
return markup
|
|
||||||
|
|
||||||
def _render_markup(self, per_token_info: List[Dict[str, Any]]) -> str:
|
def _render_markup(self, per_token_info: List[Dict[str, Any]]) -> str:
|
||||||
"""Render the markup from per-token information"""
|
"""Render the markup from per-token information"""
|
||||||
|
@ -313,6 +334,8 @@ class DependencyRenderer:
|
||||||
self.lang = settings.get("lang", DEFAULT_LANG)
|
self.lang = settings.get("lang", DEFAULT_LANG)
|
||||||
render_id = f"{id_prefix}-{i}"
|
render_id = f"{id_prefix}-{i}"
|
||||||
svg = self.render_svg(render_id, p["words"], p["arcs"])
|
svg = self.render_svg(render_id, p["words"], p["arcs"])
|
||||||
|
if p.get("title"):
|
||||||
|
svg = TPL_TITLE.format(title=p.get("title")) + svg
|
||||||
rendered.append(svg)
|
rendered.append(svg)
|
||||||
if page:
|
if page:
|
||||||
content = "".join([TPL_FIGURE.format(content=svg) for svg in rendered])
|
content = "".join([TPL_FIGURE.format(content=svg) for svg in rendered])
|
||||||
|
@ -565,7 +588,7 @@ class EntityRenderer:
|
||||||
for i, fragment in enumerate(fragments):
|
for i, fragment in enumerate(fragments):
|
||||||
markup += escape_html(fragment)
|
markup += escape_html(fragment)
|
||||||
if len(fragments) > 1 and i != len(fragments) - 1:
|
if len(fragments) > 1 and i != len(fragments) - 1:
|
||||||
markup += "</br>"
|
markup += "<br>"
|
||||||
if self.ents is None or label.upper() in self.ents:
|
if self.ents is None or label.upper() in self.ents:
|
||||||
color = self.colors.get(label.upper(), self.default_color)
|
color = self.colors.get(label.upper(), self.default_color)
|
||||||
ent_settings = {
|
ent_settings = {
|
||||||
|
@ -583,7 +606,7 @@ class EntityRenderer:
|
||||||
for i, fragment in enumerate(fragments):
|
for i, fragment in enumerate(fragments):
|
||||||
markup += escape_html(fragment)
|
markup += escape_html(fragment)
|
||||||
if len(fragments) > 1 and i != len(fragments) - 1:
|
if len(fragments) > 1 and i != len(fragments) - 1:
|
||||||
markup += "</br>"
|
markup += "<br>"
|
||||||
markup = TPL_ENTS.format(content=markup, dir=self.direction)
|
markup = TPL_ENTS.format(content=markup, dir=self.direction)
|
||||||
if title:
|
if title:
|
||||||
markup = TPL_TITLE.format(title=title) + markup
|
markup = TPL_TITLE.format(title=title) + markup
|
||||||
|
|
|
@ -1,6 +1,8 @@
|
||||||
import warnings
|
import warnings
|
||||||
from typing import Literal
|
from typing import Literal
|
||||||
|
|
||||||
|
from . import about
|
||||||
|
|
||||||
|
|
||||||
class ErrorsWithCodes(type):
|
class ErrorsWithCodes(type):
|
||||||
def __getattribute__(self, code):
|
def __getattribute__(self, code):
|
||||||
|
@ -103,13 +105,14 @@ class Warnings(metaclass=ErrorsWithCodes):
|
||||||
"table. This may degrade the performance of the model to some "
|
"table. This may degrade the performance of the model to some "
|
||||||
"degree. If this is intentional or the language you're using "
|
"degree. If this is intentional or the language you're using "
|
||||||
"doesn't have a normalization table, please ignore this warning. "
|
"doesn't have a normalization table, please ignore this warning. "
|
||||||
"If this is surprising, make sure you have the spacy-lookups-data "
|
"If this is surprising, make sure you are loading the table in "
|
||||||
"package installed and load the table in your config. The "
|
"your config. The languages with lexeme normalization tables are "
|
||||||
"languages with lexeme normalization tables are currently: "
|
"currently: {langs}\n\nAn example of how to load a table in "
|
||||||
"{langs}\n\nLoad the table in your config with:\n\n"
|
"your config :\n\n"
|
||||||
"[initialize.lookups]\n"
|
"[initialize.lookups]\n"
|
||||||
"@misc = \"spacy.LookupsDataLoader.v1\"\n"
|
"@misc = \"spacy.LookupsDataLoaderFromURL.v1\"\n"
|
||||||
"lang = ${{nlp.lang}}\n"
|
"lang = ${{nlp.lang}}\n"
|
||||||
|
f'url = "{about.__lookups_url__}"\n'
|
||||||
"tables = [\"lexeme_norm\"]\n")
|
"tables = [\"lexeme_norm\"]\n")
|
||||||
W035 = ("Discarding subpattern '{pattern}' due to an unrecognized "
|
W035 = ("Discarding subpattern '{pattern}' due to an unrecognized "
|
||||||
"attribute or operator.")
|
"attribute or operator.")
|
||||||
|
@ -211,9 +214,9 @@ class Warnings(metaclass=ErrorsWithCodes):
|
||||||
W125 = ("The StaticVectors key_attr is no longer used. To set a custom "
|
W125 = ("The StaticVectors key_attr is no longer used. To set a custom "
|
||||||
"key attribute for vectors, configure it through Vectors(attr=) or "
|
"key attribute for vectors, configure it through Vectors(attr=) or "
|
||||||
"'spacy init vectors --attr'")
|
"'spacy init vectors --attr'")
|
||||||
|
W126 = ("These keys are unsupported: {unsupported}")
|
||||||
|
|
||||||
# v4 warning strings
|
# v4 warning strings
|
||||||
W400 = ("`use_upper=False` is ignored, the upper layer is always enabled")
|
|
||||||
W401 = ("`incl_prior is True`, but the selected knowledge base type {kb_type} doesn't support prior probability "
|
W401 = ("`incl_prior is True`, but the selected knowledge base type {kb_type} doesn't support prior probability "
|
||||||
"lookups so this setting will be ignored. If your KB does support prior probability lookups, make sure "
|
"lookups so this setting will be ignored. If your KB does support prior probability lookups, make sure "
|
||||||
"to return `True` in `.supports_prior_probs`.")
|
"to return `True` in `.supports_prior_probs`.")
|
||||||
|
@ -224,7 +227,6 @@ class Errors(metaclass=ErrorsWithCodes):
|
||||||
E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). "
|
E002 = ("Can't find factory for '{name}' for language {lang} ({lang_code}). "
|
||||||
"This usually happens when spaCy calls `nlp.{method}` with a custom "
|
"This usually happens when spaCy calls `nlp.{method}` with a custom "
|
||||||
"component name that's not registered on the current language class. "
|
"component name that's not registered on the current language class. "
|
||||||
"If you're using a Transformer, make sure to install 'spacy-transformers'. "
|
|
||||||
"If you're using a custom component, make sure you've added the "
|
"If you're using a custom component, make sure you've added the "
|
||||||
"decorator `@Language.component` (for function components) or "
|
"decorator `@Language.component` (for function components) or "
|
||||||
"`@Language.factory` (for class components).\n\nAvailable "
|
"`@Language.factory` (for class components).\n\nAvailable "
|
||||||
|
@ -549,12 +551,12 @@ class Errors(metaclass=ErrorsWithCodes):
|
||||||
"during training, make sure to include it in 'annotating components'")
|
"during training, make sure to include it in 'annotating components'")
|
||||||
|
|
||||||
# New errors added in v3.x
|
# New errors added in v3.x
|
||||||
|
E849 = ("The vocab only supports {method} for vectors of type "
|
||||||
|
"spacy.vectors.Vectors, not {vectors_type}.")
|
||||||
E850 = ("The PretrainVectors objective currently only supports default or "
|
E850 = ("The PretrainVectors objective currently only supports default or "
|
||||||
"floret vectors, not {mode} vectors.")
|
"floret vectors, not {mode} vectors.")
|
||||||
E851 = ("The 'textcat' component labels should only have values of 0 or 1, "
|
E851 = ("The 'textcat' component labels should only have values of 0 or 1, "
|
||||||
"but found value of '{val}'.")
|
"but found value of '{val}'.")
|
||||||
E852 = ("The tar file pulled from the remote attempted an unsafe path "
|
|
||||||
"traversal.")
|
|
||||||
E853 = ("Unsupported component factory name '{name}'. The character '.' is "
|
E853 = ("Unsupported component factory name '{name}'. The character '.' is "
|
||||||
"not permitted in factory names.")
|
"not permitted in factory names.")
|
||||||
E854 = ("Unable to set doc.ents. Check that the 'ents_filter' does not "
|
E854 = ("Unable to set doc.ents. Check that the 'ents_filter' does not "
|
||||||
|
@ -967,6 +969,12 @@ class Errors(metaclass=ErrorsWithCodes):
|
||||||
" 'min_length': {min_length}, 'max_length': {max_length}")
|
" 'min_length': {min_length}, 'max_length': {max_length}")
|
||||||
E1054 = ("The text, including whitespace, must match between reference and "
|
E1054 = ("The text, including whitespace, must match between reference and "
|
||||||
"predicted docs when training {component}.")
|
"predicted docs when training {component}.")
|
||||||
|
E1055 = ("The 'replace_listener' callback expects {num_params} parameters, "
|
||||||
|
"but only callbacks with one or three parameters are supported")
|
||||||
|
E1056 = ("The `TextCatBOW` architecture expects a length of at least 1, was {length}.")
|
||||||
|
E1057 = ("The `TextCatReduce` architecture must be used with at least one "
|
||||||
|
"reduction. Please enable one of `use_reduce_first`, "
|
||||||
|
"`use_reduce_last`, `use_reduce_max` or `use_reduce_mean`.")
|
||||||
|
|
||||||
# v4 error strings
|
# v4 error strings
|
||||||
E4000 = ("Expected a Doc as input, but got: '{type}'")
|
E4000 = ("Expected a Doc as input, but got: '{type}'")
|
||||||
|
@ -982,6 +990,18 @@ class Errors(metaclass=ErrorsWithCodes):
|
||||||
"{existing_value}.")
|
"{existing_value}.")
|
||||||
E4008 = ("Span {pos}_char {value} does not correspond to a token {pos}.")
|
E4008 = ("Span {pos}_char {value} does not correspond to a token {pos}.")
|
||||||
E4009 = ("The '{attr}' parameter should be 'None' or 'True', but found '{value}'.")
|
E4009 = ("The '{attr}' parameter should be 'None' or 'True', but found '{value}'.")
|
||||||
|
E4010 = ("Required lemmatizer table(s) {missing_tables} not found in "
|
||||||
|
"[initialize] or in registered lookups (spacy-lookups-data). An "
|
||||||
|
"example for how to load lemmatizer tables in [initialize]:\n\n"
|
||||||
|
"[initialize.components]\n\n"
|
||||||
|
"[initialize.components.{pipe_name}]\n\n"
|
||||||
|
"[initialize.components.{pipe_name}.lookups]\n"
|
||||||
|
'@misc = "spacy.LookupsDataLoaderFromURL.v1"\n'
|
||||||
|
"lang = ${{nlp.lang}}\n"
|
||||||
|
f'url = "{about.__lookups_url__}"\n'
|
||||||
|
"tables = {tables}\n"
|
||||||
|
"# or required tables only: tables = {required_tables}\n")
|
||||||
|
E4011 = ("Server error ({status_code}), couldn't fetch {url}")
|
||||||
|
|
||||||
|
|
||||||
RENAMED_LANGUAGE_CODES = {"xx": "mul", "is": "isl"}
|
RENAMED_LANGUAGE_CODES = {"xx": "mul", "is": "isl"}
|
||||||
|
|
|
@ -2,4 +2,9 @@ from .candidate import Candidate, InMemoryCandidate
|
||||||
from .kb import KnowledgeBase
|
from .kb import KnowledgeBase
|
||||||
from .kb_in_memory import InMemoryLookupKB
|
from .kb_in_memory import InMemoryLookupKB
|
||||||
|
|
||||||
__all__ = ["KnowledgeBase", "InMemoryLookupKB", "Candidate", "InMemoryCandidate"]
|
__all__ = [
|
||||||
|
"Candidate",
|
||||||
|
"KnowledgeBase",
|
||||||
|
"InMemoryCandidate",
|
||||||
|
"InMemoryLookupKB",
|
||||||
|
]
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: infer_types=True, profile=True
|
# cython: infer_types=True
|
||||||
|
|
||||||
from .kb_in_memory cimport InMemoryLookupKB
|
from .kb_in_memory cimport InMemoryLookupKB
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: infer_types=True, profile=True
|
# cython: infer_types=True
|
||||||
|
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Iterable, Iterator, Tuple, Union
|
from typing import Iterable, Iterator, Tuple, Union
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: infer_types=True, profile=True
|
# cython: infer_types=True
|
||||||
from typing import Any, Callable, Dict, Iterable, Iterator
|
from typing import Any, Callable, Dict, Iterable, Iterator
|
||||||
|
|
||||||
import srsly
|
import srsly
|
||||||
|
|
|
@ -6,7 +6,8 @@ _num_words = [
|
||||||
"nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
|
"nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
|
||||||
"sixteen", "seventeen", "eighteen", "nineteen", "twenty", "thirty", "forty",
|
"sixteen", "seventeen", "eighteen", "nineteen", "twenty", "thirty", "forty",
|
||||||
"fifty", "sixty", "seventy", "eighty", "ninety", "hundred", "thousand",
|
"fifty", "sixty", "seventy", "eighty", "ninety", "hundred", "thousand",
|
||||||
"million", "billion", "trillion", "quadrillion", "gajillion", "bazillion"
|
"million", "billion", "trillion", "quadrillion", "quintillion", "sextillion",
|
||||||
|
"septillion", "octillion", "nonillion", "decillion", "gajillion", "bazillion"
|
||||||
]
|
]
|
||||||
_ordinal_words = [
|
_ordinal_words = [
|
||||||
"first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth",
|
"first", "second", "third", "fourth", "fifth", "sixth", "seventh", "eighth",
|
||||||
|
@ -14,7 +15,8 @@ _ordinal_words = [
|
||||||
"fifteenth", "sixteenth", "seventeenth", "eighteenth", "nineteenth",
|
"fifteenth", "sixteenth", "seventeenth", "eighteenth", "nineteenth",
|
||||||
"twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth",
|
"twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth",
|
||||||
"eightieth", "ninetieth", "hundredth", "thousandth", "millionth", "billionth",
|
"eightieth", "ninetieth", "hundredth", "thousandth", "millionth", "billionth",
|
||||||
"trillionth", "quadrillionth", "gajillionth", "bazillionth"
|
"trillionth", "quadrillionth", "quintillionth", "sextillionth", "septillionth",
|
||||||
|
"octillionth", "nonillionth", "decillionth", "gajillionth", "bazillionth"
|
||||||
]
|
]
|
||||||
# fmt: on
|
# fmt: on
|
||||||
|
|
||||||
|
|
|
@ -163,7 +163,7 @@ class SpanishLemmatizer(Lemmatizer):
|
||||||
for old, new in self.lookups.get_table("lemma_rules").get("det", []):
|
for old, new in self.lookups.get_table("lemma_rules").get("det", []):
|
||||||
if word == old:
|
if word == old:
|
||||||
return [new]
|
return [new]
|
||||||
# If none of the specfic rules apply, search in the common rules for
|
# If none of the specific rules apply, search in the common rules for
|
||||||
# determiners and pronouns that follow a unique pattern for
|
# determiners and pronouns that follow a unique pattern for
|
||||||
# lemmatization. If the word is in the list, return the corresponding
|
# lemmatization. If the word is in the list, return the corresponding
|
||||||
# lemma.
|
# lemma.
|
||||||
|
@ -291,7 +291,7 @@ class SpanishLemmatizer(Lemmatizer):
|
||||||
for old, new in self.lookups.get_table("lemma_rules").get("pron", []):
|
for old, new in self.lookups.get_table("lemma_rules").get("pron", []):
|
||||||
if word == old:
|
if word == old:
|
||||||
return [new]
|
return [new]
|
||||||
# If none of the specfic rules apply, search in the common rules for
|
# If none of the specific rules apply, search in the common rules for
|
||||||
# determiners and pronouns that follow a unique pattern for
|
# determiners and pronouns that follow a unique pattern for
|
||||||
# lemmatization. If the word is in the list, return the corresponding
|
# lemmatization. If the word is in the list, return the corresponding
|
||||||
# lemma.
|
# lemma.
|
||||||
|
|
18
spacy/lang/fo/__init__.py
Normal file
18
spacy/lang/fo/__init__.py
Normal file
|
@ -0,0 +1,18 @@
|
||||||
|
from ...language import BaseDefaults, Language
|
||||||
|
from ..punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
|
||||||
|
|
||||||
|
class FaroeseDefaults(BaseDefaults):
|
||||||
|
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||||
|
infixes = TOKENIZER_INFIXES
|
||||||
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
prefixes = TOKENIZER_PREFIXES
|
||||||
|
|
||||||
|
|
||||||
|
class Faroese(Language):
|
||||||
|
lang = "fo"
|
||||||
|
Defaults = FaroeseDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["Faroese"]
|
90
spacy/lang/fo/tokenizer_exceptions.py
Normal file
90
spacy/lang/fo/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,90 @@
|
||||||
|
from ...symbols import ORTH
|
||||||
|
from ...util import update_exc
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
|
||||||
|
_exc = {}
|
||||||
|
|
||||||
|
for orth in [
|
||||||
|
"apr.",
|
||||||
|
"aug.",
|
||||||
|
"avgr.",
|
||||||
|
"árg.",
|
||||||
|
"ávís.",
|
||||||
|
"beinl.",
|
||||||
|
"blkv.",
|
||||||
|
"blaðkv.",
|
||||||
|
"blm.",
|
||||||
|
"blaðm.",
|
||||||
|
"bls.",
|
||||||
|
"blstj.",
|
||||||
|
"blaðstj.",
|
||||||
|
"des.",
|
||||||
|
"eint.",
|
||||||
|
"febr.",
|
||||||
|
"fyrrv.",
|
||||||
|
"góðk.",
|
||||||
|
"h.m.",
|
||||||
|
"innt.",
|
||||||
|
"jan.",
|
||||||
|
"kl.",
|
||||||
|
"m.a.",
|
||||||
|
"mðr.",
|
||||||
|
"mió.",
|
||||||
|
"nr.",
|
||||||
|
"nto.",
|
||||||
|
"nov.",
|
||||||
|
"nút.",
|
||||||
|
"o.a.",
|
||||||
|
"o.a.m.",
|
||||||
|
"o.a.tíl.",
|
||||||
|
"o.fl.",
|
||||||
|
"ff.",
|
||||||
|
"o.m.a.",
|
||||||
|
"o.o.",
|
||||||
|
"o.s.fr.",
|
||||||
|
"o.tíl.",
|
||||||
|
"o.ø.",
|
||||||
|
"okt.",
|
||||||
|
"omf.",
|
||||||
|
"pst.",
|
||||||
|
"ritstj.",
|
||||||
|
"sbr.",
|
||||||
|
"sms.",
|
||||||
|
"smst.",
|
||||||
|
"smb.",
|
||||||
|
"sb.",
|
||||||
|
"sbrt.",
|
||||||
|
"sp.",
|
||||||
|
"sept.",
|
||||||
|
"spf.",
|
||||||
|
"spsk.",
|
||||||
|
"t.e.",
|
||||||
|
"t.s.",
|
||||||
|
"t.s.s.",
|
||||||
|
"tlf.",
|
||||||
|
"tel.",
|
||||||
|
"tsk.",
|
||||||
|
"t.o.v.",
|
||||||
|
"t.d.",
|
||||||
|
"uml.",
|
||||||
|
"ums.",
|
||||||
|
"uppl.",
|
||||||
|
"upprfr.",
|
||||||
|
"uppr.",
|
||||||
|
"útg.",
|
||||||
|
"útl.",
|
||||||
|
"útr.",
|
||||||
|
"vanl.",
|
||||||
|
"v.",
|
||||||
|
"v.h.",
|
||||||
|
"v.ø.o.",
|
||||||
|
"viðm.",
|
||||||
|
"viðv.",
|
||||||
|
"vm.",
|
||||||
|
"v.m.",
|
||||||
|
]:
|
||||||
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
capitalized = orth.capitalize()
|
||||||
|
_exc[capitalized] = [{ORTH: capitalized}]
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
|
|
@ -15,6 +15,7 @@ _prefixes = (
|
||||||
[
|
[
|
||||||
"†",
|
"†",
|
||||||
"⸏",
|
"⸏",
|
||||||
|
"〈",
|
||||||
]
|
]
|
||||||
+ LIST_PUNCT
|
+ LIST_PUNCT
|
||||||
+ LIST_ELLIPSES
|
+ LIST_ELLIPSES
|
||||||
|
@ -31,6 +32,7 @@ _suffixes = (
|
||||||
+ [
|
+ [
|
||||||
"†",
|
"†",
|
||||||
"⸎",
|
"⸎",
|
||||||
|
"〉",
|
||||||
r"(?<=[\u1F00-\u1FFF\u0370-\u03FF])[\-\.⸏]",
|
r"(?<=[\u1F00-\u1FFF\u0370-\u03FF])[\-\.⸏]",
|
||||||
]
|
]
|
||||||
)
|
)
|
||||||
|
|
20
spacy/lang/nn/__init__.py
Normal file
20
spacy/lang/nn/__init__.py
Normal file
|
@ -0,0 +1,20 @@
|
||||||
|
from ...language import BaseDefaults, Language
|
||||||
|
from ..nb import SYNTAX_ITERATORS
|
||||||
|
from .punctuation import TOKENIZER_INFIXES, TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
|
||||||
|
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||||
|
|
||||||
|
|
||||||
|
class NorwegianNynorskDefaults(BaseDefaults):
|
||||||
|
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||||
|
prefixes = TOKENIZER_PREFIXES
|
||||||
|
infixes = TOKENIZER_INFIXES
|
||||||
|
suffixes = TOKENIZER_SUFFIXES
|
||||||
|
syntax_iterators = SYNTAX_ITERATORS
|
||||||
|
|
||||||
|
|
||||||
|
class NorwegianNynorsk(Language):
|
||||||
|
lang = "nn"
|
||||||
|
Defaults = NorwegianNynorskDefaults
|
||||||
|
|
||||||
|
|
||||||
|
__all__ = ["NorwegianNynorsk"]
|
15
spacy/lang/nn/examples.py
Normal file
15
spacy/lang/nn/examples.py
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
"""
|
||||||
|
Example sentences to test spaCy and its language models.
|
||||||
|
|
||||||
|
>>> from spacy.lang.nn.examples import sentences
|
||||||
|
>>> docs = nlp.pipe(sentences)
|
||||||
|
"""
|
||||||
|
|
||||||
|
|
||||||
|
# sentences taken from Omsetjingsminne frå Nynorsk pressekontor 2022 (https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-80/)
|
||||||
|
sentences = [
|
||||||
|
"Konseptet går ut på at alle tre omgangar tel, alle hopparar må stille i kvalifiseringa og poengsummen skal telje.",
|
||||||
|
"Det er ein meir enn i same periode i fjor.",
|
||||||
|
"Det har lava ned enorme snømengder i store delar av Europa den siste tida.",
|
||||||
|
"Akhtar Chaudhry er ikkje innstilt på Oslo-lista til SV, men utfordrar Heikki Holmås om førsteplassen.",
|
||||||
|
]
|
74
spacy/lang/nn/punctuation.py
Normal file
74
spacy/lang/nn/punctuation.py
Normal file
|
@ -0,0 +1,74 @@
|
||||||
|
from ..char_classes import (
|
||||||
|
ALPHA,
|
||||||
|
ALPHA_LOWER,
|
||||||
|
ALPHA_UPPER,
|
||||||
|
CONCAT_QUOTES,
|
||||||
|
CURRENCY,
|
||||||
|
LIST_CURRENCY,
|
||||||
|
LIST_ELLIPSES,
|
||||||
|
LIST_ICONS,
|
||||||
|
LIST_PUNCT,
|
||||||
|
LIST_QUOTES,
|
||||||
|
PUNCT,
|
||||||
|
UNITS,
|
||||||
|
)
|
||||||
|
from ..punctuation import TOKENIZER_SUFFIXES
|
||||||
|
|
||||||
|
_quotes = CONCAT_QUOTES.replace("'", "")
|
||||||
|
_list_punct = [x for x in LIST_PUNCT if x != "#"]
|
||||||
|
_list_icons = [x for x in LIST_ICONS if x != "°"]
|
||||||
|
_list_icons = [x.replace("\\u00B0", "") for x in _list_icons]
|
||||||
|
_list_quotes = [x for x in LIST_QUOTES if x != "\\'"]
|
||||||
|
|
||||||
|
|
||||||
|
_prefixes = (
|
||||||
|
["§", "%", "=", "—", "–", r"\+(?![0-9])"]
|
||||||
|
+ _list_punct
|
||||||
|
+ LIST_ELLIPSES
|
||||||
|
+ LIST_QUOTES
|
||||||
|
+ LIST_CURRENCY
|
||||||
|
+ LIST_ICONS
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
_infixes = (
|
||||||
|
LIST_ELLIPSES
|
||||||
|
+ _list_icons
|
||||||
|
+ [
|
||||||
|
r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
|
||||||
|
r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
|
||||||
|
r"(?<=[{a}])[:<>=/](?=[{a}])".format(a=ALPHA),
|
||||||
|
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||||
|
r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
|
||||||
|
r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
_suffixes = (
|
||||||
|
LIST_PUNCT
|
||||||
|
+ LIST_ELLIPSES
|
||||||
|
+ _list_quotes
|
||||||
|
+ _list_icons
|
||||||
|
+ ["—", "–"]
|
||||||
|
+ [
|
||||||
|
r"(?<=[0-9])\+",
|
||||||
|
r"(?<=°[FfCcKk])\.",
|
||||||
|
r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
|
||||||
|
r"(?<=[0-9])(?:{u})".format(u=UNITS),
|
||||||
|
r"(?<=[{al}{e}{p}(?:{q})])\.".format(
|
||||||
|
al=ALPHA_LOWER, e=r"%²\-\+", q=_quotes, p=PUNCT
|
||||||
|
),
|
||||||
|
r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
|
||||||
|
]
|
||||||
|
+ [r"(?<=[^sSxXzZ])'"]
|
||||||
|
)
|
||||||
|
_suffixes += [
|
||||||
|
suffix
|
||||||
|
for suffix in TOKENIZER_SUFFIXES
|
||||||
|
if suffix not in ["'s", "'S", "’s", "’S", r"\'"]
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
TOKENIZER_PREFIXES = _prefixes
|
||||||
|
TOKENIZER_INFIXES = _infixes
|
||||||
|
TOKENIZER_SUFFIXES = _suffixes
|
228
spacy/lang/nn/tokenizer_exceptions.py
Normal file
228
spacy/lang/nn/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,228 @@
|
||||||
|
from ...symbols import NORM, ORTH
|
||||||
|
from ...util import update_exc
|
||||||
|
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||||
|
|
||||||
|
_exc = {}
|
||||||
|
|
||||||
|
|
||||||
|
for exc_data in [
|
||||||
|
{ORTH: "jan.", NORM: "januar"},
|
||||||
|
{ORTH: "feb.", NORM: "februar"},
|
||||||
|
{ORTH: "mar.", NORM: "mars"},
|
||||||
|
{ORTH: "apr.", NORM: "april"},
|
||||||
|
{ORTH: "jun.", NORM: "juni"},
|
||||||
|
# note: "jul." is in the simple list below without a NORM exception
|
||||||
|
{ORTH: "aug.", NORM: "august"},
|
||||||
|
{ORTH: "sep.", NORM: "september"},
|
||||||
|
{ORTH: "okt.", NORM: "oktober"},
|
||||||
|
{ORTH: "nov.", NORM: "november"},
|
||||||
|
{ORTH: "des.", NORM: "desember"},
|
||||||
|
]:
|
||||||
|
_exc[exc_data[ORTH]] = [exc_data]
|
||||||
|
|
||||||
|
|
||||||
|
for orth in [
|
||||||
|
"Ap.",
|
||||||
|
"Aq.",
|
||||||
|
"Ca.",
|
||||||
|
"Chr.",
|
||||||
|
"Co.",
|
||||||
|
"Dr.",
|
||||||
|
"F.eks.",
|
||||||
|
"Fr.p.",
|
||||||
|
"Frp.",
|
||||||
|
"Grl.",
|
||||||
|
"Kr.",
|
||||||
|
"Kr.F.",
|
||||||
|
"Kr.F.s",
|
||||||
|
"Mr.",
|
||||||
|
"Mrs.",
|
||||||
|
"Pb.",
|
||||||
|
"Pr.",
|
||||||
|
"Sp.",
|
||||||
|
"St.",
|
||||||
|
"a.m.",
|
||||||
|
"ad.",
|
||||||
|
"adm.dir.",
|
||||||
|
"adr.",
|
||||||
|
"b.c.",
|
||||||
|
"bl.a.",
|
||||||
|
"bla.",
|
||||||
|
"bm.",
|
||||||
|
"bnr.",
|
||||||
|
"bto.",
|
||||||
|
"c.c.",
|
||||||
|
"ca.",
|
||||||
|
"cand.mag.",
|
||||||
|
"co.",
|
||||||
|
"d.d.",
|
||||||
|
"d.m.",
|
||||||
|
"d.y.",
|
||||||
|
"dept.",
|
||||||
|
"dr.",
|
||||||
|
"dr.med.",
|
||||||
|
"dr.philos.",
|
||||||
|
"dr.psychol.",
|
||||||
|
"dss.",
|
||||||
|
"dvs.",
|
||||||
|
"e.Kr.",
|
||||||
|
"e.l.",
|
||||||
|
"eg.",
|
||||||
|
"eig.",
|
||||||
|
"ekskl.",
|
||||||
|
"el.",
|
||||||
|
"et.",
|
||||||
|
"etc.",
|
||||||
|
"etg.",
|
||||||
|
"ev.",
|
||||||
|
"evt.",
|
||||||
|
"f.",
|
||||||
|
"f.Kr.",
|
||||||
|
"f.eks.",
|
||||||
|
"f.o.m.",
|
||||||
|
"fhv.",
|
||||||
|
"fk.",
|
||||||
|
"foreg.",
|
||||||
|
"fork.",
|
||||||
|
"fv.",
|
||||||
|
"fvt.",
|
||||||
|
"g.",
|
||||||
|
"gl.",
|
||||||
|
"gno.",
|
||||||
|
"gnr.",
|
||||||
|
"grl.",
|
||||||
|
"gt.",
|
||||||
|
"h.r.adv.",
|
||||||
|
"hhv.",
|
||||||
|
"hoh.",
|
||||||
|
"hr.",
|
||||||
|
"ifb.",
|
||||||
|
"ifm.",
|
||||||
|
"iht.",
|
||||||
|
"inkl.",
|
||||||
|
"istf.",
|
||||||
|
"jf.",
|
||||||
|
"jr.",
|
||||||
|
"jul.",
|
||||||
|
"juris.",
|
||||||
|
"kfr.",
|
||||||
|
"kgl.",
|
||||||
|
"kgl.res.",
|
||||||
|
"kl.",
|
||||||
|
"komm.",
|
||||||
|
"kr.",
|
||||||
|
"kst.",
|
||||||
|
"lat.",
|
||||||
|
"lø.",
|
||||||
|
"m.a.",
|
||||||
|
"m.a.o.",
|
||||||
|
"m.fl.",
|
||||||
|
"m.m.",
|
||||||
|
"m.v.",
|
||||||
|
"ma.",
|
||||||
|
"mag.art.",
|
||||||
|
"md.",
|
||||||
|
"mfl.",
|
||||||
|
"mht.",
|
||||||
|
"mill.",
|
||||||
|
"min.",
|
||||||
|
"mnd.",
|
||||||
|
"moh.",
|
||||||
|
"mrd.",
|
||||||
|
"muh.",
|
||||||
|
"mv.",
|
||||||
|
"mva.",
|
||||||
|
"n.å.",
|
||||||
|
"ndf.",
|
||||||
|
"nr.",
|
||||||
|
"nto.",
|
||||||
|
"nyno.",
|
||||||
|
"o.a.",
|
||||||
|
"o.l.",
|
||||||
|
"obl.",
|
||||||
|
"off.",
|
||||||
|
"ofl.",
|
||||||
|
"on.",
|
||||||
|
"op.",
|
||||||
|
"org.",
|
||||||
|
"osv.",
|
||||||
|
"ovf.",
|
||||||
|
"p.",
|
||||||
|
"p.a.",
|
||||||
|
"p.g.a.",
|
||||||
|
"p.m.",
|
||||||
|
"p.t.",
|
||||||
|
"pga.",
|
||||||
|
"ph.d.",
|
||||||
|
"pkt.",
|
||||||
|
"pr.",
|
||||||
|
"pst.",
|
||||||
|
"pt.",
|
||||||
|
"red.anm.",
|
||||||
|
"ref.",
|
||||||
|
"res.",
|
||||||
|
"res.kap.",
|
||||||
|
"resp.",
|
||||||
|
"rv.",
|
||||||
|
"s.",
|
||||||
|
"s.d.",
|
||||||
|
"s.k.",
|
||||||
|
"s.u.",
|
||||||
|
"s.å.",
|
||||||
|
"sen.",
|
||||||
|
"sep.",
|
||||||
|
"siviling.",
|
||||||
|
"sms.",
|
||||||
|
"snr.",
|
||||||
|
"spm.",
|
||||||
|
"sr.",
|
||||||
|
"sst.",
|
||||||
|
"st.",
|
||||||
|
"st.meld.",
|
||||||
|
"st.prp.",
|
||||||
|
"stip.",
|
||||||
|
"stk.",
|
||||||
|
"stud.",
|
||||||
|
"sv.",
|
||||||
|
"såk.",
|
||||||
|
"sø.",
|
||||||
|
"t.d.",
|
||||||
|
"t.h.",
|
||||||
|
"t.o.m.",
|
||||||
|
"t.v.",
|
||||||
|
"temp.",
|
||||||
|
"ti.",
|
||||||
|
"tils.",
|
||||||
|
"tilsv.",
|
||||||
|
"tl;dr",
|
||||||
|
"tlf.",
|
||||||
|
"to.",
|
||||||
|
"ult.",
|
||||||
|
"utg.",
|
||||||
|
"v.",
|
||||||
|
"vedk.",
|
||||||
|
"vedr.",
|
||||||
|
"vg.",
|
||||||
|
"vgs.",
|
||||||
|
"vha.",
|
||||||
|
"vit.ass.",
|
||||||
|
"vn.",
|
||||||
|
"vol.",
|
||||||
|
"vs.",
|
||||||
|
"vsa.",
|
||||||
|
"§§",
|
||||||
|
"©NTB",
|
||||||
|
"årg.",
|
||||||
|
"årh.",
|
||||||
|
]:
|
||||||
|
_exc[orth] = [{ORTH: orth}]
|
||||||
|
|
||||||
|
# Dates
|
||||||
|
for h in range(1, 31 + 1):
|
||||||
|
for period in ["."]:
|
||||||
|
_exc[f"{h}{period}"] = [{ORTH: f"{h}."}]
|
||||||
|
|
||||||
|
_custom_base_exc = {"i.": [{ORTH: "i", NORM: "i"}, {ORTH: "."}]}
|
||||||
|
_exc.update(_custom_base_exc)
|
||||||
|
|
||||||
|
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
|
|
@ -15,4 +15,7 @@ sentences = [
|
||||||
"Türkiye'nin başkenti neresi?",
|
"Türkiye'nin başkenti neresi?",
|
||||||
"Bakanlar Kurulu 180 günlük eylem planını açıkladı.",
|
"Bakanlar Kurulu 180 günlük eylem planını açıkladı.",
|
||||||
"Merkez Bankası, beklentiler doğrultusunda faizlerde değişikliğe gitmedi.",
|
"Merkez Bankası, beklentiler doğrultusunda faizlerde değişikliğe gitmedi.",
|
||||||
|
"Cemal Sureya kimdir?",
|
||||||
|
"Bunlari Biliyor muydunuz?",
|
||||||
|
"Altinoluk Turkiye haritasinin neresinde yer alir?",
|
||||||
]
|
]
|
||||||
|
|
|
@ -31,7 +31,7 @@ segmenter = "char"
|
||||||
[initialize]
|
[initialize]
|
||||||
|
|
||||||
[initialize.tokenizer]
|
[initialize.tokenizer]
|
||||||
pkuseg_model = null
|
pkuseg_model = "spacy_ontonotes"
|
||||||
pkuseg_user_dict = "default"
|
pkuseg_user_dict = "default"
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
import functools
|
import functools
|
||||||
|
import inspect
|
||||||
import itertools
|
import itertools
|
||||||
import multiprocessing as mp
|
import multiprocessing as mp
|
||||||
import random
|
import random
|
||||||
|
@ -64,6 +65,7 @@ from .util import (
|
||||||
registry,
|
registry,
|
||||||
warn_if_jupyter_cupy,
|
warn_if_jupyter_cupy,
|
||||||
)
|
)
|
||||||
|
from .vectors import BaseVectors
|
||||||
from .vocab import Vocab, create_vocab
|
from .vocab import Vocab, create_vocab
|
||||||
|
|
||||||
PipeCallable = Callable[[Doc], Doc]
|
PipeCallable = Callable[[Doc], Doc]
|
||||||
|
@ -128,13 +130,6 @@ def create_tokenizer() -> Callable[["Language"], Tokenizer]:
|
||||||
return tokenizer_factory
|
return tokenizer_factory
|
||||||
|
|
||||||
|
|
||||||
@registry.misc("spacy.LookupsDataLoader.v1")
|
|
||||||
def load_lookups_data(lang, tables):
|
|
||||||
util.logger.debug("Loading lookups from spacy-lookups-data: %s", tables)
|
|
||||||
lookups = load_lookups(lang=lang, tables=tables)
|
|
||||||
return lookups
|
|
||||||
|
|
||||||
|
|
||||||
class Language:
|
class Language:
|
||||||
"""A text-processing pipeline. Usually you'll load this once per process,
|
"""A text-processing pipeline. Usually you'll load this once per process,
|
||||||
and pass the instance around your application.
|
and pass the instance around your application.
|
||||||
|
@ -160,6 +155,7 @@ class Language:
|
||||||
max_length: int = 10**6,
|
max_length: int = 10**6,
|
||||||
meta: Dict[str, Any] = {},
|
meta: Dict[str, Any] = {},
|
||||||
create_tokenizer: Optional[Callable[["Language"], Callable[[str], Doc]]] = None,
|
create_tokenizer: Optional[Callable[["Language"], Callable[[str], Doc]]] = None,
|
||||||
|
create_vectors: Optional[Callable[["Vocab"], BaseVectors]] = None,
|
||||||
batch_size: int = 1000,
|
batch_size: int = 1000,
|
||||||
**kwargs,
|
**kwargs,
|
||||||
) -> None:
|
) -> None:
|
||||||
|
@ -199,6 +195,10 @@ class Language:
|
||||||
raise ValueError(Errors.E918.format(vocab=vocab, vocab_type=type(Vocab)))
|
raise ValueError(Errors.E918.format(vocab=vocab, vocab_type=type(Vocab)))
|
||||||
if vocab is True:
|
if vocab is True:
|
||||||
vocab = create_vocab(self.lang, self.Defaults)
|
vocab = create_vocab(self.lang, self.Defaults)
|
||||||
|
if not create_vectors:
|
||||||
|
vectors_cfg = {"vectors": self._config["nlp"]["vectors"]}
|
||||||
|
create_vectors = registry.resolve(vectors_cfg)["vectors"]
|
||||||
|
vocab.vectors = create_vectors(vocab)
|
||||||
else:
|
else:
|
||||||
if (self.lang and vocab.lang) and (self.lang != vocab.lang):
|
if (self.lang and vocab.lang) and (self.lang != vocab.lang):
|
||||||
raise ValueError(Errors.E150.format(nlp=self.lang, vocab=vocab.lang))
|
raise ValueError(Errors.E150.format(nlp=self.lang, vocab=vocab.lang))
|
||||||
|
@ -1797,6 +1797,12 @@ class Language:
|
||||||
for proc in procs:
|
for proc in procs:
|
||||||
proc.start()
|
proc.start()
|
||||||
|
|
||||||
|
# Close writing-end of channels. This is needed to avoid that reading
|
||||||
|
# from the channel blocks indefinitely when the worker closes the
|
||||||
|
# channel.
|
||||||
|
for tx in bytedocs_send_ch:
|
||||||
|
tx.close()
|
||||||
|
|
||||||
# Cycle channels not to break the order of docs.
|
# Cycle channels not to break the order of docs.
|
||||||
# The received object is a batch of byte-encoded docs, so flatten them with chain.from_iterable.
|
# The received object is a batch of byte-encoded docs, so flatten them with chain.from_iterable.
|
||||||
byte_tuples = chain.from_iterable(
|
byte_tuples = chain.from_iterable(
|
||||||
|
@ -1819,8 +1825,23 @@ class Language:
|
||||||
# tell `sender` that one batch was consumed.
|
# tell `sender` that one batch was consumed.
|
||||||
sender.step()
|
sender.step()
|
||||||
finally:
|
finally:
|
||||||
|
# If we are stopping in an orderly fashion, the workers' queues
|
||||||
|
# are empty. Put the sentinel in their queues to signal that work
|
||||||
|
# is done, so that they can exit gracefully.
|
||||||
|
for q in texts_q:
|
||||||
|
q.put(_WORK_DONE_SENTINEL)
|
||||||
|
|
||||||
|
# Otherwise, we are stopping because the error handler raised an
|
||||||
|
# exception. The sentinel will be last to go out of the queue.
|
||||||
|
# To avoid doing unnecessary work or hanging on platforms that
|
||||||
|
# block on sending (Windows), we'll close our end of the channel.
|
||||||
|
# This signals to the worker that it can exit the next time it
|
||||||
|
# attempts to send data down the channel.
|
||||||
|
for r in bytedocs_recv_ch:
|
||||||
|
r.close()
|
||||||
|
|
||||||
for proc in procs:
|
for proc in procs:
|
||||||
proc.terminate()
|
proc.join()
|
||||||
|
|
||||||
def _link_components(self) -> None:
|
def _link_components(self) -> None:
|
||||||
"""Register 'listeners' within pipeline components, to allow them to
|
"""Register 'listeners' within pipeline components, to allow them to
|
||||||
|
@ -1885,6 +1906,10 @@ class Language:
|
||||||
).merge(config)
|
).merge(config)
|
||||||
if "nlp" not in config:
|
if "nlp" not in config:
|
||||||
raise ValueError(Errors.E985.format(config=config))
|
raise ValueError(Errors.E985.format(config=config))
|
||||||
|
# fill in [nlp.vectors] if not present (as a narrower alternative to
|
||||||
|
# auto-filling [nlp] from the default config)
|
||||||
|
if "vectors" not in config["nlp"]:
|
||||||
|
config["nlp"]["vectors"] = {"@vectors": "spacy.Vectors.v1"}
|
||||||
config_lang = config["nlp"].get("lang")
|
config_lang = config["nlp"].get("lang")
|
||||||
if config_lang is not None and config_lang != cls.lang:
|
if config_lang is not None and config_lang != cls.lang:
|
||||||
raise ValueError(
|
raise ValueError(
|
||||||
|
@ -1920,6 +1945,7 @@ class Language:
|
||||||
filled["nlp"], validate=validate, schema=ConfigSchemaNlp
|
filled["nlp"], validate=validate, schema=ConfigSchemaNlp
|
||||||
)
|
)
|
||||||
create_tokenizer = resolved_nlp["tokenizer"]
|
create_tokenizer = resolved_nlp["tokenizer"]
|
||||||
|
create_vectors = resolved_nlp["vectors"]
|
||||||
before_creation = resolved_nlp["before_creation"]
|
before_creation = resolved_nlp["before_creation"]
|
||||||
after_creation = resolved_nlp["after_creation"]
|
after_creation = resolved_nlp["after_creation"]
|
||||||
after_pipeline_creation = resolved_nlp["after_pipeline_creation"]
|
after_pipeline_creation = resolved_nlp["after_pipeline_creation"]
|
||||||
|
@ -1940,7 +1966,12 @@ class Language:
|
||||||
# inside stuff like the spacy train function. If we loaded them here,
|
# inside stuff like the spacy train function. If we loaded them here,
|
||||||
# then we would load them twice at runtime: once when we make from config,
|
# then we would load them twice at runtime: once when we make from config,
|
||||||
# and then again when we load from disk.
|
# and then again when we load from disk.
|
||||||
nlp = lang_cls(vocab=vocab, create_tokenizer=create_tokenizer, meta=meta)
|
nlp = lang_cls(
|
||||||
|
vocab=vocab,
|
||||||
|
create_tokenizer=create_tokenizer,
|
||||||
|
create_vectors=create_vectors,
|
||||||
|
meta=meta,
|
||||||
|
)
|
||||||
if after_creation is not None:
|
if after_creation is not None:
|
||||||
nlp = after_creation(nlp)
|
nlp = after_creation(nlp)
|
||||||
if not isinstance(nlp, cls):
|
if not isinstance(nlp, cls):
|
||||||
|
@ -2157,8 +2188,20 @@ class Language:
|
||||||
# Go over the listener layers and replace them
|
# Go over the listener layers and replace them
|
||||||
for listener in pipe_listeners:
|
for listener in pipe_listeners:
|
||||||
new_model = tok2vec_model.copy()
|
new_model = tok2vec_model.copy()
|
||||||
if "replace_listener" in tok2vec_model.attrs:
|
replace_listener_func = tok2vec_model.attrs.get("replace_listener")
|
||||||
new_model = tok2vec_model.attrs["replace_listener"](new_model)
|
if replace_listener_func is not None:
|
||||||
|
# Pass the extra args to the callback without breaking compatibility with
|
||||||
|
# old library versions that only expect a single parameter.
|
||||||
|
num_params = len(
|
||||||
|
inspect.signature(replace_listener_func).parameters
|
||||||
|
)
|
||||||
|
if num_params == 1:
|
||||||
|
new_model = replace_listener_func(new_model)
|
||||||
|
elif num_params == 3:
|
||||||
|
new_model = replace_listener_func(new_model, listener, tok2vec)
|
||||||
|
else:
|
||||||
|
raise ValueError(Errors.E1055.format(num_params=num_params))
|
||||||
|
|
||||||
util.replace_model_node(pipe.model, listener, new_model) # type: ignore[attr-defined]
|
util.replace_model_node(pipe.model, listener, new_model) # type: ignore[attr-defined]
|
||||||
tok2vec.remove_listener(listener, pipe_name)
|
tok2vec.remove_listener(listener, pipe_name)
|
||||||
|
|
||||||
|
@ -2418,6 +2461,11 @@ def _apply_pipes(
|
||||||
while True:
|
while True:
|
||||||
try:
|
try:
|
||||||
texts_with_ctx = receiver.get()
|
texts_with_ctx = receiver.get()
|
||||||
|
|
||||||
|
# Stop working if we encounter the end-of-work sentinel.
|
||||||
|
if isinstance(texts_with_ctx, _WorkDoneSentinel):
|
||||||
|
return
|
||||||
|
|
||||||
docs = (
|
docs = (
|
||||||
ensure_doc(doc_like, context) for doc_like, context in texts_with_ctx
|
ensure_doc(doc_like, context) for doc_like, context in texts_with_ctx
|
||||||
)
|
)
|
||||||
|
@ -2426,11 +2474,21 @@ def _apply_pipes(
|
||||||
# Connection does not accept unpickable objects, so send list.
|
# Connection does not accept unpickable objects, so send list.
|
||||||
byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
|
byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
|
||||||
padding = [(None, None, None)] * (len(texts_with_ctx) - len(byte_docs))
|
padding = [(None, None, None)] * (len(texts_with_ctx) - len(byte_docs))
|
||||||
sender.send(byte_docs + padding) # type: ignore[operator]
|
data: Sequence[Tuple[Optional[bytes], Optional[Any], Optional[bytes]]] = (
|
||||||
|
byte_docs + padding # type: ignore[operator]
|
||||||
|
)
|
||||||
except Exception:
|
except Exception:
|
||||||
error_msg = [(None, None, srsly.msgpack_dumps(traceback.format_exc()))]
|
error_msg = [(None, None, srsly.msgpack_dumps(traceback.format_exc()))]
|
||||||
padding = [(None, None, None)] * (len(texts_with_ctx) - 1)
|
padding = [(None, None, None)] * (len(texts_with_ctx) - 1)
|
||||||
sender.send(error_msg + padding)
|
data = error_msg + padding
|
||||||
|
|
||||||
|
try:
|
||||||
|
sender.send(data)
|
||||||
|
except BrokenPipeError:
|
||||||
|
# Parent has closed the pipe prematurely. This happens when a
|
||||||
|
# worker encounters an error and the error handler is set to
|
||||||
|
# stop processing.
|
||||||
|
return
|
||||||
|
|
||||||
|
|
||||||
class _Sender:
|
class _Sender:
|
||||||
|
@ -2460,3 +2518,10 @@ class _Sender:
|
||||||
if self.count >= self.chunk_size:
|
if self.count >= self.chunk_size:
|
||||||
self.count = 0
|
self.count = 0
|
||||||
self.send()
|
self.send()
|
||||||
|
|
||||||
|
|
||||||
|
class _WorkDoneSentinel:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
_WORK_DONE_SENTINEL = _WorkDoneSentinel()
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
# cython: embedsignature=True
|
# cython: embedsignature=True
|
||||||
|
# cython: profile=False
|
||||||
# Compiler crashes on memory view coercion without this. Should report bug.
|
# Compiler crashes on memory view coercion without this. Should report bug.
|
||||||
cimport numpy as np
|
cimport numpy as np
|
||||||
from libc.string cimport memset
|
from libc.string cimport memset
|
||||||
|
|
|
@ -2,16 +2,40 @@ from collections import OrderedDict
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Any, Dict, List, Optional, Union
|
from typing import Any, Dict, List, Optional, Union
|
||||||
|
|
||||||
|
import requests
|
||||||
import srsly
|
import srsly
|
||||||
from preshed.bloom import BloomFilter
|
from preshed.bloom import BloomFilter
|
||||||
|
|
||||||
from .errors import Errors
|
from .errors import Errors
|
||||||
from .strings import get_string_id
|
from .strings import get_string_id
|
||||||
from .util import SimpleFrozenDict, ensure_path, load_language_data, registry
|
from .util import SimpleFrozenDict, ensure_path, load_language_data, logger, registry
|
||||||
|
|
||||||
UNSET = object()
|
UNSET = object()
|
||||||
|
|
||||||
|
|
||||||
|
@registry.misc("spacy.LookupsDataLoader.v1")
|
||||||
|
def load_lookups_data(lang, tables):
|
||||||
|
logger.debug(f"Loading lookups from spacy-lookups-data: {tables}")
|
||||||
|
lookups = load_lookups(lang=lang, tables=tables)
|
||||||
|
return lookups
|
||||||
|
|
||||||
|
|
||||||
|
@registry.misc("spacy.LookupsDataLoaderFromURL.v1")
|
||||||
|
def load_lookups_data_from_url(lang, tables, url):
|
||||||
|
logger.debug(f"Loading lookups from {url}: {tables}")
|
||||||
|
lookups = Lookups()
|
||||||
|
for table in tables:
|
||||||
|
table_url = url + lang + "_" + table + ".json"
|
||||||
|
r = requests.get(table_url)
|
||||||
|
if r.status_code != 200:
|
||||||
|
raise ValueError(
|
||||||
|
Errors.E4011.format(status_code=r.status_code, url=table_url)
|
||||||
|
)
|
||||||
|
table_data = r.json()
|
||||||
|
lookups.add_table(table, table_data)
|
||||||
|
return lookups
|
||||||
|
|
||||||
|
|
||||||
def load_lookups(lang: str, tables: List[str], strict: bool = True) -> "Lookups":
|
def load_lookups(lang: str, tables: List[str], strict: bool = True) -> "Lookups":
|
||||||
"""Load the data from the spacy-lookups-data package for a given language,
|
"""Load the data from the spacy-lookups-data package for a given language,
|
||||||
if available. Returns an empty `Lookups` container if there's no data or if the package
|
if available. Returns an empty `Lookups` container if there's no data or if the package
|
||||||
|
|
|
@ -3,4 +3,4 @@ from .levenshtein import levenshtein
|
||||||
from .matcher import Matcher
|
from .matcher import Matcher
|
||||||
from .phrasematcher import PhraseMatcher
|
from .phrasematcher import PhraseMatcher
|
||||||
|
|
||||||
__all__ = ["Matcher", "PhraseMatcher", "DependencyMatcher", "levenshtein"]
|
__all__ = ["DependencyMatcher", "Matcher", "PhraseMatcher", "levenshtein"]
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: infer_types=True, profile=True
|
# cython: infer_types=True
|
||||||
import warnings
|
import warnings
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
from itertools import product
|
from itertools import product
|
||||||
|
@ -129,6 +129,7 @@ cdef class DependencyMatcher:
|
||||||
else:
|
else:
|
||||||
required_keys = {"RIGHT_ID", "RIGHT_ATTRS", "REL_OP", "LEFT_ID"}
|
required_keys = {"RIGHT_ID", "RIGHT_ATTRS", "REL_OP", "LEFT_ID"}
|
||||||
relation_keys = set(relation.keys())
|
relation_keys = set(relation.keys())
|
||||||
|
# Identify required keys that have not been specified
|
||||||
missing = required_keys - relation_keys
|
missing = required_keys - relation_keys
|
||||||
if missing:
|
if missing:
|
||||||
missing_txt = ", ".join(list(missing))
|
missing_txt = ", ".join(list(missing))
|
||||||
|
@ -136,6 +137,13 @@ cdef class DependencyMatcher:
|
||||||
required=required_keys,
|
required=required_keys,
|
||||||
missing=missing_txt
|
missing=missing_txt
|
||||||
))
|
))
|
||||||
|
# Identify additional, unsupported keys
|
||||||
|
unsupported = relation_keys - required_keys
|
||||||
|
if unsupported:
|
||||||
|
unsupported_txt = ", ".join(list(unsupported))
|
||||||
|
warnings.warn(Warnings.W126.format(
|
||||||
|
unsupported=unsupported_txt
|
||||||
|
))
|
||||||
if (
|
if (
|
||||||
relation["RIGHT_ID"] in visited_nodes
|
relation["RIGHT_ID"] in visited_nodes
|
||||||
or relation["LEFT_ID"] not in visited_nodes
|
or relation["LEFT_ID"] not in visited_nodes
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: profile=True, binding=True, infer_types=True
|
# cython: binding=True, infer_types=True
|
||||||
from cpython.object cimport PyObject
|
from cpython.object cimport PyObject
|
||||||
from libc.stdint cimport int64_t
|
from libc.stdint cimport int64_t
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: binding=True, infer_types=True, profile=True
|
# cython: binding=True, infer_types=True
|
||||||
from typing import Iterable, List
|
from typing import Iterable, List
|
||||||
|
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: infer_types=True, profile=True
|
# cython: infer_types=True
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
from typing import List
|
from typing import List
|
||||||
|
|
||||||
|
|
164
spacy/ml/_precomputable_affine.py
Normal file
164
spacy/ml/_precomputable_affine.py
Normal file
|
@ -0,0 +1,164 @@
|
||||||
|
from thinc.api import Model, normal_init
|
||||||
|
|
||||||
|
from ..util import registry
|
||||||
|
|
||||||
|
|
||||||
|
@registry.layers("spacy.PrecomputableAffine.v1")
|
||||||
|
def PrecomputableAffine(nO, nI, nF, nP, dropout=0.1):
|
||||||
|
model = Model(
|
||||||
|
"precomputable_affine",
|
||||||
|
forward,
|
||||||
|
init=init,
|
||||||
|
dims={"nO": nO, "nI": nI, "nF": nF, "nP": nP},
|
||||||
|
params={"W": None, "b": None, "pad": None},
|
||||||
|
attrs={"dropout_rate": dropout},
|
||||||
|
)
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
def forward(model, X, is_train):
|
||||||
|
nF = model.get_dim("nF")
|
||||||
|
nO = model.get_dim("nO")
|
||||||
|
nP = model.get_dim("nP")
|
||||||
|
nI = model.get_dim("nI")
|
||||||
|
W = model.get_param("W")
|
||||||
|
# Preallocate array for layer output, including padding.
|
||||||
|
Yf = model.ops.alloc2f(X.shape[0] + 1, nF * nO * nP, zeros=False)
|
||||||
|
model.ops.gemm(X, W.reshape((nF * nO * nP, nI)), trans2=True, out=Yf[1:])
|
||||||
|
Yf = Yf.reshape((Yf.shape[0], nF, nO, nP))
|
||||||
|
|
||||||
|
# Set padding. Padding has shape (1, nF, nO, nP). Unfortunately, we cannot
|
||||||
|
# change its shape to (nF, nO, nP) without breaking existing models. So
|
||||||
|
# we'll squeeze the first dimension here.
|
||||||
|
Yf[0] = model.ops.xp.squeeze(model.get_param("pad"), 0)
|
||||||
|
|
||||||
|
def backward(dY_ids):
|
||||||
|
# This backprop is particularly tricky, because we get back a different
|
||||||
|
# thing from what we put out. We put out an array of shape:
|
||||||
|
# (nB, nF, nO, nP), and get back:
|
||||||
|
# (nB, nO, nP) and ids (nB, nF)
|
||||||
|
# The ids tell us the values of nF, so we would have:
|
||||||
|
#
|
||||||
|
# dYf = zeros((nB, nF, nO, nP))
|
||||||
|
# for b in range(nB):
|
||||||
|
# for f in range(nF):
|
||||||
|
# dYf[b, ids[b, f]] += dY[b]
|
||||||
|
#
|
||||||
|
# However, we avoid building that array for efficiency -- and just pass
|
||||||
|
# in the indices.
|
||||||
|
dY, ids = dY_ids
|
||||||
|
assert dY.ndim == 3
|
||||||
|
assert dY.shape[1] == nO, dY.shape
|
||||||
|
assert dY.shape[2] == nP, dY.shape
|
||||||
|
# nB = dY.shape[0]
|
||||||
|
model.inc_grad("pad", _backprop_precomputable_affine_padding(model, dY, ids))
|
||||||
|
Xf = X[ids]
|
||||||
|
Xf = Xf.reshape((Xf.shape[0], nF * nI))
|
||||||
|
|
||||||
|
model.inc_grad("b", dY.sum(axis=0))
|
||||||
|
dY = dY.reshape((dY.shape[0], nO * nP))
|
||||||
|
|
||||||
|
Wopfi = W.transpose((1, 2, 0, 3))
|
||||||
|
Wopfi = Wopfi.reshape((nO * nP, nF * nI))
|
||||||
|
dXf = model.ops.gemm(dY.reshape((dY.shape[0], nO * nP)), Wopfi)
|
||||||
|
|
||||||
|
dWopfi = model.ops.gemm(dY, Xf, trans1=True)
|
||||||
|
dWopfi = dWopfi.reshape((nO, nP, nF, nI))
|
||||||
|
# (o, p, f, i) --> (f, o, p, i)
|
||||||
|
dWopfi = dWopfi.transpose((2, 0, 1, 3))
|
||||||
|
model.inc_grad("W", dWopfi)
|
||||||
|
return dXf.reshape((dXf.shape[0], nF, nI))
|
||||||
|
|
||||||
|
return Yf, backward
|
||||||
|
|
||||||
|
|
||||||
|
def _backprop_precomputable_affine_padding(model, dY, ids):
|
||||||
|
nB = dY.shape[0]
|
||||||
|
nF = model.get_dim("nF")
|
||||||
|
nP = model.get_dim("nP")
|
||||||
|
nO = model.get_dim("nO")
|
||||||
|
# Backprop the "padding", used as a filler for missing values.
|
||||||
|
# Values that are missing are set to -1, and each state vector could
|
||||||
|
# have multiple missing values. The padding has different values for
|
||||||
|
# different missing features. The gradient of the padding vector is:
|
||||||
|
#
|
||||||
|
# for b in range(nB):
|
||||||
|
# for f in range(nF):
|
||||||
|
# if ids[b, f] < 0:
|
||||||
|
# d_pad[f] += dY[b]
|
||||||
|
#
|
||||||
|
# Which can be rewritten as:
|
||||||
|
#
|
||||||
|
# (ids < 0).T @ dY
|
||||||
|
mask = model.ops.asarray(ids < 0, dtype="f")
|
||||||
|
d_pad = model.ops.gemm(mask, dY.reshape(nB, nO * nP), trans1=True)
|
||||||
|
return d_pad.reshape((1, nF, nO, nP))
|
||||||
|
|
||||||
|
|
||||||
|
def init(model, X=None, Y=None):
|
||||||
|
"""This is like the 'layer sequential unit variance', but instead
|
||||||
|
of taking the actual inputs, we randomly generate whitened data.
|
||||||
|
|
||||||
|
Why's this all so complicated? We have a huge number of inputs,
|
||||||
|
and the maxout unit makes guessing the dynamics tricky. Instead
|
||||||
|
we set the maxout weights to values that empirically result in
|
||||||
|
whitened outputs given whitened inputs.
|
||||||
|
"""
|
||||||
|
if model.has_param("W") and model.get_param("W").any():
|
||||||
|
return
|
||||||
|
|
||||||
|
nF = model.get_dim("nF")
|
||||||
|
nO = model.get_dim("nO")
|
||||||
|
nP = model.get_dim("nP")
|
||||||
|
nI = model.get_dim("nI")
|
||||||
|
W = model.ops.alloc4f(nF, nO, nP, nI)
|
||||||
|
b = model.ops.alloc2f(nO, nP)
|
||||||
|
pad = model.ops.alloc4f(1, nF, nO, nP)
|
||||||
|
|
||||||
|
ops = model.ops
|
||||||
|
W = normal_init(ops, W.shape, mean=float(ops.xp.sqrt(1.0 / nF * nI)))
|
||||||
|
pad = normal_init(ops, pad.shape, mean=1.0)
|
||||||
|
model.set_param("W", W)
|
||||||
|
model.set_param("b", b)
|
||||||
|
model.set_param("pad", pad)
|
||||||
|
|
||||||
|
ids = ops.alloc((5000, nF), dtype="f")
|
||||||
|
ids += ops.xp.random.uniform(0, 1000, ids.shape)
|
||||||
|
ids = ops.asarray(ids, dtype="i")
|
||||||
|
tokvecs = ops.alloc((5000, nI), dtype="f")
|
||||||
|
tokvecs += ops.xp.random.normal(loc=0.0, scale=1.0, size=tokvecs.size).reshape(
|
||||||
|
tokvecs.shape
|
||||||
|
)
|
||||||
|
|
||||||
|
def predict(ids, tokvecs):
|
||||||
|
# nS ids. nW tokvecs. Exclude the padding array.
|
||||||
|
hiddens = model.predict(tokvecs[:-1]) # (nW, f, o, p)
|
||||||
|
vectors = model.ops.alloc((ids.shape[0], nO * nP), dtype="f")
|
||||||
|
# need nS vectors
|
||||||
|
hiddens = hiddens.reshape((hiddens.shape[0] * nF, nO * nP))
|
||||||
|
model.ops.scatter_add(vectors, ids.flatten(), hiddens)
|
||||||
|
vectors = vectors.reshape((vectors.shape[0], nO, nP))
|
||||||
|
vectors += b
|
||||||
|
vectors = model.ops.asarray(vectors)
|
||||||
|
if nP >= 2:
|
||||||
|
return model.ops.maxout(vectors)[0]
|
||||||
|
else:
|
||||||
|
return vectors * (vectors >= 0)
|
||||||
|
|
||||||
|
tol_var = 0.01
|
||||||
|
tol_mean = 0.01
|
||||||
|
t_max = 10
|
||||||
|
W = model.get_param("W").copy()
|
||||||
|
b = model.get_param("b").copy()
|
||||||
|
for t_i in range(t_max):
|
||||||
|
acts1 = predict(ids, tokvecs)
|
||||||
|
var = model.ops.xp.var(acts1)
|
||||||
|
mean = model.ops.xp.mean(acts1)
|
||||||
|
if abs(var - 1.0) >= tol_var:
|
||||||
|
W /= model.ops.xp.sqrt(var)
|
||||||
|
model.set_param("W", W)
|
||||||
|
elif abs(mean) >= tol_mean:
|
||||||
|
b -= mean
|
||||||
|
model.set_param("b", b)
|
||||||
|
else:
|
||||||
|
break
|
|
@ -1,66 +1,23 @@
|
||||||
import warnings
|
from typing import List, Literal, Optional
|
||||||
from typing import Any, List, Literal, Optional, Tuple
|
|
||||||
|
|
||||||
from thinc.api import Model
|
from thinc.api import Linear, Model, chain, list2array, use_ops, zero_init
|
||||||
from thinc.types import Floats2d
|
from thinc.types import Floats2d
|
||||||
|
|
||||||
from ...errors import Errors, Warnings
|
from ...errors import Errors
|
||||||
from ...tokens.doc import Doc
|
from ...tokens import Doc
|
||||||
from ...util import registry
|
from ...util import registry
|
||||||
|
from .._precomputable_affine import PrecomputableAffine
|
||||||
from ..tb_framework import TransitionModel
|
from ..tb_framework import TransitionModel
|
||||||
|
|
||||||
TransitionSystem = Any # TODO
|
|
||||||
State = Any # TODO
|
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.TransitionBasedParser.v2")
|
|
||||||
def transition_parser_v2(
|
|
||||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
|
||||||
state_type: Literal["parser", "ner"],
|
|
||||||
extra_state_tokens: bool,
|
|
||||||
hidden_width: int,
|
|
||||||
maxout_pieces: int,
|
|
||||||
use_upper: bool,
|
|
||||||
nO: Optional[int] = None,
|
|
||||||
) -> Model:
|
|
||||||
if not use_upper:
|
|
||||||
warnings.warn(Warnings.W400)
|
|
||||||
|
|
||||||
return build_tb_parser_model(
|
|
||||||
tok2vec,
|
|
||||||
state_type,
|
|
||||||
extra_state_tokens,
|
|
||||||
hidden_width,
|
|
||||||
maxout_pieces,
|
|
||||||
nO=nO,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures.register("spacy.TransitionBasedParser.v3")
|
|
||||||
def transition_parser_v3(
|
|
||||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
|
||||||
state_type: Literal["parser", "ner"],
|
|
||||||
extra_state_tokens: bool,
|
|
||||||
hidden_width: int,
|
|
||||||
maxout_pieces: int,
|
|
||||||
nO: Optional[int] = None,
|
|
||||||
) -> Model:
|
|
||||||
return build_tb_parser_model(
|
|
||||||
tok2vec,
|
|
||||||
state_type,
|
|
||||||
extra_state_tokens,
|
|
||||||
hidden_width,
|
|
||||||
maxout_pieces,
|
|
||||||
nO=nO,
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
|
@registry.architectures("spacy.TransitionBasedParser.v2")
|
||||||
def build_tb_parser_model(
|
def build_tb_parser_model(
|
||||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
tok2vec: Model[List[Doc], List[Floats2d]],
|
||||||
state_type: Literal["parser", "ner"],
|
state_type: Literal["parser", "ner"],
|
||||||
extra_state_tokens: bool,
|
extra_state_tokens: bool,
|
||||||
hidden_width: int,
|
hidden_width: int,
|
||||||
maxout_pieces: int,
|
maxout_pieces: int,
|
||||||
|
use_upper: bool,
|
||||||
nO: Optional[int] = None,
|
nO: Optional[int] = None,
|
||||||
) -> Model:
|
) -> Model:
|
||||||
"""
|
"""
|
||||||
|
@ -94,7 +51,14 @@ def build_tb_parser_model(
|
||||||
feature sets (for the NER) or 13 (for the parser).
|
feature sets (for the NER) or 13 (for the parser).
|
||||||
hidden_width (int): The width of the hidden layer.
|
hidden_width (int): The width of the hidden layer.
|
||||||
maxout_pieces (int): How many pieces to use in the state prediction layer.
|
maxout_pieces (int): How many pieces to use in the state prediction layer.
|
||||||
Recommended values are 1, 2 or 3.
|
Recommended values are 1, 2 or 3. If 1, the maxout non-linearity
|
||||||
|
is replaced with a ReLu non-linearity if use_upper=True, and no
|
||||||
|
non-linearity if use_upper=False.
|
||||||
|
use_upper (bool): Whether to use an additional hidden layer after the state
|
||||||
|
vector in order to predict the action scores. It is recommended to set
|
||||||
|
this to False for large pretrained models such as transformers, and True
|
||||||
|
for smaller networks. The upper layer is computed on CPU, which becomes
|
||||||
|
a bottleneck on larger GPU-based models, where it's also less necessary.
|
||||||
nO (int or None): The number of actions the model will predict between.
|
nO (int or None): The number of actions the model will predict between.
|
||||||
Usually inferred from data at the beginning of training, or loaded from
|
Usually inferred from data at the beginning of training, or loaded from
|
||||||
disk.
|
disk.
|
||||||
|
@ -105,11 +69,106 @@ def build_tb_parser_model(
|
||||||
nr_feature_tokens = 6 if extra_state_tokens else 3
|
nr_feature_tokens = 6 if extra_state_tokens else 3
|
||||||
else:
|
else:
|
||||||
raise ValueError(Errors.E917.format(value=state_type))
|
raise ValueError(Errors.E917.format(value=state_type))
|
||||||
return TransitionModel(
|
t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None
|
||||||
tok2vec=tok2vec,
|
tok2vec = chain(
|
||||||
state_tokens=nr_feature_tokens,
|
tok2vec,
|
||||||
hidden_width=hidden_width,
|
list2array(),
|
||||||
maxout_pieces=maxout_pieces,
|
Linear(hidden_width, t2v_width),
|
||||||
nO=nO,
|
|
||||||
unseen_classes=set(),
|
|
||||||
)
|
)
|
||||||
|
tok2vec.set_dim("nO", hidden_width)
|
||||||
|
lower = _define_lower(
|
||||||
|
nO=hidden_width if use_upper else nO,
|
||||||
|
nF=nr_feature_tokens,
|
||||||
|
nI=tok2vec.get_dim("nO"),
|
||||||
|
nP=maxout_pieces,
|
||||||
|
)
|
||||||
|
upper = None
|
||||||
|
if use_upper:
|
||||||
|
with use_ops("cpu"):
|
||||||
|
# Initialize weights at zero, as it's a classification layer.
|
||||||
|
upper = _define_upper(nO=nO, nI=None)
|
||||||
|
return TransitionModel(tok2vec, lower, upper, resize_output)
|
||||||
|
|
||||||
|
|
||||||
|
def _define_upper(nO, nI):
|
||||||
|
return Linear(nO=nO, nI=nI, init_W=zero_init)
|
||||||
|
|
||||||
|
|
||||||
|
def _define_lower(nO, nF, nI, nP):
|
||||||
|
return PrecomputableAffine(nO=nO, nF=nF, nI=nI, nP=nP)
|
||||||
|
|
||||||
|
|
||||||
|
def resize_output(model, new_nO):
|
||||||
|
if model.attrs["has_upper"]:
|
||||||
|
return _resize_upper(model, new_nO)
|
||||||
|
return _resize_lower(model, new_nO)
|
||||||
|
|
||||||
|
|
||||||
|
def _resize_upper(model, new_nO):
|
||||||
|
upper = model.get_ref("upper")
|
||||||
|
if upper.has_dim("nO") is None:
|
||||||
|
upper.set_dim("nO", new_nO)
|
||||||
|
return model
|
||||||
|
elif new_nO == upper.get_dim("nO"):
|
||||||
|
return model
|
||||||
|
|
||||||
|
smaller = upper
|
||||||
|
nI = smaller.maybe_get_dim("nI")
|
||||||
|
with use_ops("cpu"):
|
||||||
|
larger = _define_upper(nO=new_nO, nI=nI)
|
||||||
|
# it could be that the model is not initialized yet, then skip this bit
|
||||||
|
if smaller.has_param("W"):
|
||||||
|
larger_W = larger.ops.alloc2f(new_nO, nI)
|
||||||
|
larger_b = larger.ops.alloc1f(new_nO)
|
||||||
|
smaller_W = smaller.get_param("W")
|
||||||
|
smaller_b = smaller.get_param("b")
|
||||||
|
# Weights are stored in (nr_out, nr_in) format, so we're basically
|
||||||
|
# just adding rows here.
|
||||||
|
if smaller.has_dim("nO"):
|
||||||
|
old_nO = smaller.get_dim("nO")
|
||||||
|
larger_W[:old_nO] = smaller_W
|
||||||
|
larger_b[:old_nO] = smaller_b
|
||||||
|
for i in range(old_nO, new_nO):
|
||||||
|
model.attrs["unseen_classes"].add(i)
|
||||||
|
|
||||||
|
larger.set_param("W", larger_W)
|
||||||
|
larger.set_param("b", larger_b)
|
||||||
|
model._layers[-1] = larger
|
||||||
|
model.set_ref("upper", larger)
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
def _resize_lower(model, new_nO):
|
||||||
|
lower = model.get_ref("lower")
|
||||||
|
if lower.has_dim("nO") is None:
|
||||||
|
lower.set_dim("nO", new_nO)
|
||||||
|
return model
|
||||||
|
|
||||||
|
smaller = lower
|
||||||
|
nI = smaller.maybe_get_dim("nI")
|
||||||
|
nF = smaller.maybe_get_dim("nF")
|
||||||
|
nP = smaller.maybe_get_dim("nP")
|
||||||
|
larger = _define_lower(nO=new_nO, nI=nI, nF=nF, nP=nP)
|
||||||
|
# it could be that the model is not initialized yet, then skip this bit
|
||||||
|
if smaller.has_param("W"):
|
||||||
|
larger_W = larger.ops.alloc4f(nF, new_nO, nP, nI)
|
||||||
|
larger_b = larger.ops.alloc2f(new_nO, nP)
|
||||||
|
larger_pad = larger.ops.alloc4f(1, nF, new_nO, nP)
|
||||||
|
smaller_W = smaller.get_param("W")
|
||||||
|
smaller_b = smaller.get_param("b")
|
||||||
|
smaller_pad = smaller.get_param("pad")
|
||||||
|
# Copy the old weights and padding into the new layer
|
||||||
|
if smaller.has_dim("nO"):
|
||||||
|
old_nO = smaller.get_dim("nO")
|
||||||
|
larger_W[:, 0:old_nO, :, :] = smaller_W
|
||||||
|
larger_pad[:, :, 0:old_nO, :] = smaller_pad
|
||||||
|
larger_b[0:old_nO, :] = smaller_b
|
||||||
|
for i in range(old_nO, new_nO):
|
||||||
|
model.attrs["unseen_classes"].add(i)
|
||||||
|
|
||||||
|
larger.set_param("W", larger_W)
|
||||||
|
larger.set_param("b", larger_b)
|
||||||
|
larger.set_param("pad", larger_pad)
|
||||||
|
model._layers[1] = larger
|
||||||
|
model.set_ref("lower", larger)
|
||||||
|
return model
|
||||||
|
|
|
@ -1,21 +1,28 @@
|
||||||
from functools import partial
|
from functools import partial
|
||||||
from typing import List, Optional, cast
|
from typing import List, Optional, Tuple, cast
|
||||||
|
|
||||||
from thinc.api import (
|
from thinc.api import (
|
||||||
Dropout,
|
Dropout,
|
||||||
|
Gelu,
|
||||||
LayerNorm,
|
LayerNorm,
|
||||||
Linear,
|
Linear,
|
||||||
Logistic,
|
Logistic,
|
||||||
Maxout,
|
Maxout,
|
||||||
Model,
|
Model,
|
||||||
ParametricAttention,
|
ParametricAttention,
|
||||||
|
ParametricAttention_v2,
|
||||||
Relu,
|
Relu,
|
||||||
Softmax,
|
Softmax,
|
||||||
SparseLinear,
|
SparseLinear,
|
||||||
|
SparseLinear_v2,
|
||||||
chain,
|
chain,
|
||||||
clone,
|
clone,
|
||||||
concatenate,
|
concatenate,
|
||||||
list2ragged,
|
list2ragged,
|
||||||
|
noop,
|
||||||
|
reduce_first,
|
||||||
|
reduce_last,
|
||||||
|
reduce_max,
|
||||||
reduce_mean,
|
reduce_mean,
|
||||||
reduce_sum,
|
reduce_sum,
|
||||||
residual,
|
residual,
|
||||||
|
@ -25,9 +32,10 @@ from thinc.api import (
|
||||||
)
|
)
|
||||||
from thinc.layers.chain import init as init_chain
|
from thinc.layers.chain import init as init_chain
|
||||||
from thinc.layers.resizable import resize_linear_weighted, resize_model
|
from thinc.layers.resizable import resize_linear_weighted, resize_model
|
||||||
from thinc.types import Floats2d
|
from thinc.types import ArrayXd, Floats2d
|
||||||
|
|
||||||
from ...attrs import ORTH
|
from ...attrs import ORTH
|
||||||
|
from ...errors import Errors
|
||||||
from ...tokens import Doc
|
from ...tokens import Doc
|
||||||
from ...util import registry
|
from ...util import registry
|
||||||
from ..extract_ngrams import extract_ngrams
|
from ..extract_ngrams import extract_ngrams
|
||||||
|
@ -47,10 +55,255 @@ def build_simple_cnn_text_classifier(
|
||||||
outputs sum to 1. If exclusive_classes=False, a logistic non-linearity
|
outputs sum to 1. If exclusive_classes=False, a logistic non-linearity
|
||||||
is applied instead, so that outputs are in the range [0, 1].
|
is applied instead, so that outputs are in the range [0, 1].
|
||||||
"""
|
"""
|
||||||
|
return build_reduce_text_classifier(
|
||||||
|
tok2vec=tok2vec,
|
||||||
|
exclusive_classes=exclusive_classes,
|
||||||
|
use_reduce_first=False,
|
||||||
|
use_reduce_last=False,
|
||||||
|
use_reduce_max=False,
|
||||||
|
use_reduce_mean=True,
|
||||||
|
nO=nO,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def resize_and_set_ref(model, new_nO, resizable_layer):
|
||||||
|
resizable_layer = resize_model(resizable_layer, new_nO)
|
||||||
|
model.set_ref("output_layer", resizable_layer.layers[0])
|
||||||
|
model.set_dim("nO", new_nO, force=True)
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
@registry.architectures("spacy.TextCatBOW.v2")
|
||||||
|
def build_bow_text_classifier(
|
||||||
|
exclusive_classes: bool,
|
||||||
|
ngram_size: int,
|
||||||
|
no_output_layer: bool,
|
||||||
|
nO: Optional[int] = None,
|
||||||
|
) -> Model[List[Doc], Floats2d]:
|
||||||
|
return _build_bow_text_classifier(
|
||||||
|
exclusive_classes=exclusive_classes,
|
||||||
|
ngram_size=ngram_size,
|
||||||
|
no_output_layer=no_output_layer,
|
||||||
|
nO=nO,
|
||||||
|
sparse_linear=SparseLinear(nO=nO),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@registry.architectures("spacy.TextCatBOW.v3")
|
||||||
|
def build_bow_text_classifier_v3(
|
||||||
|
exclusive_classes: bool,
|
||||||
|
ngram_size: int,
|
||||||
|
no_output_layer: bool,
|
||||||
|
length: int = 262144,
|
||||||
|
nO: Optional[int] = None,
|
||||||
|
) -> Model[List[Doc], Floats2d]:
|
||||||
|
if length < 1:
|
||||||
|
raise ValueError(Errors.E1056.format(length=length))
|
||||||
|
|
||||||
|
# Find k such that 2**(k-1) < length <= 2**k.
|
||||||
|
length = 2 ** (length - 1).bit_length()
|
||||||
|
|
||||||
|
return _build_bow_text_classifier(
|
||||||
|
exclusive_classes=exclusive_classes,
|
||||||
|
ngram_size=ngram_size,
|
||||||
|
no_output_layer=no_output_layer,
|
||||||
|
nO=nO,
|
||||||
|
sparse_linear=SparseLinear_v2(nO=nO, length=length),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _build_bow_text_classifier(
|
||||||
|
exclusive_classes: bool,
|
||||||
|
ngram_size: int,
|
||||||
|
no_output_layer: bool,
|
||||||
|
sparse_linear: Model[Tuple[ArrayXd, ArrayXd, ArrayXd], ArrayXd],
|
||||||
|
nO: Optional[int] = None,
|
||||||
|
) -> Model[List[Doc], Floats2d]:
|
||||||
fill_defaults = {"b": 0, "W": 0}
|
fill_defaults = {"b": 0, "W": 0}
|
||||||
with Model.define_operators({">>": chain}):
|
with Model.define_operators({">>": chain}):
|
||||||
cnn = tok2vec >> list2ragged() >> reduce_mean()
|
output_layer = None
|
||||||
nI = tok2vec.maybe_get_dim("nO")
|
if not no_output_layer:
|
||||||
|
fill_defaults["b"] = NEG_VALUE
|
||||||
|
output_layer = softmax_activation() if exclusive_classes else Logistic()
|
||||||
|
resizable_layer: Model[Floats2d, Floats2d] = resizable(
|
||||||
|
sparse_linear,
|
||||||
|
resize_layer=partial(resize_linear_weighted, fill_defaults=fill_defaults),
|
||||||
|
)
|
||||||
|
model = extract_ngrams(ngram_size, attr=ORTH) >> resizable_layer
|
||||||
|
model = with_cpu(model, model.ops)
|
||||||
|
if output_layer:
|
||||||
|
model = model >> with_cpu(output_layer, output_layer.ops)
|
||||||
|
if nO is not None:
|
||||||
|
model.set_dim("nO", cast(int, nO))
|
||||||
|
model.set_ref("output_layer", sparse_linear)
|
||||||
|
model.attrs["multi_label"] = not exclusive_classes
|
||||||
|
model.attrs["resize_output"] = partial(
|
||||||
|
resize_and_set_ref, resizable_layer=resizable_layer
|
||||||
|
)
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
@registry.architectures("spacy.TextCatEnsemble.v2")
|
||||||
|
def build_text_classifier_v2(
|
||||||
|
tok2vec: Model[List[Doc], List[Floats2d]],
|
||||||
|
linear_model: Model[List[Doc], Floats2d],
|
||||||
|
nO: Optional[int] = None,
|
||||||
|
) -> Model[List[Doc], Floats2d]:
|
||||||
|
width = tok2vec.maybe_get_dim("nO")
|
||||||
|
exclusive_classes = not linear_model.attrs["multi_label"]
|
||||||
|
parametric_attention = _build_parametric_attention_with_residual_nonlinear(
|
||||||
|
tok2vec=tok2vec,
|
||||||
|
nonlinear_layer=Maxout(nI=width, nO=width),
|
||||||
|
key_transform=noop(),
|
||||||
|
)
|
||||||
|
with Model.define_operators({">>": chain, "|": concatenate}):
|
||||||
|
nO_double = nO * 2 if nO else None
|
||||||
|
if exclusive_classes:
|
||||||
|
output_layer = Softmax(nO=nO, nI=nO_double)
|
||||||
|
else:
|
||||||
|
output_layer = Linear(nO=nO, nI=nO_double) >> Logistic()
|
||||||
|
model = (linear_model | parametric_attention) >> output_layer
|
||||||
|
model.set_ref("tok2vec", tok2vec)
|
||||||
|
if model.has_dim("nO") is not False and nO is not None:
|
||||||
|
model.set_dim("nO", cast(int, nO))
|
||||||
|
model.set_ref("output_layer", linear_model.get_ref("output_layer"))
|
||||||
|
model.attrs["multi_label"] = not exclusive_classes
|
||||||
|
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
@registry.architectures("spacy.TextCatLowData.v1")
|
||||||
|
def build_text_classifier_lowdata(
|
||||||
|
width: int, dropout: Optional[float], nO: Optional[int] = None
|
||||||
|
) -> Model[List[Doc], Floats2d]:
|
||||||
|
# Don't document this yet, I'm not sure it's right.
|
||||||
|
# Note, before v.3, this was the default if setting "low_data" and "pretrained_dims"
|
||||||
|
with Model.define_operators({">>": chain, "**": clone}):
|
||||||
|
model = (
|
||||||
|
StaticVectors(width)
|
||||||
|
>> list2ragged()
|
||||||
|
>> ParametricAttention(width)
|
||||||
|
>> reduce_sum()
|
||||||
|
>> residual(Relu(width, width)) ** 2
|
||||||
|
>> Linear(nO, width)
|
||||||
|
)
|
||||||
|
if dropout:
|
||||||
|
model = model >> Dropout(dropout)
|
||||||
|
model = model >> Logistic()
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
@registry.architectures("spacy.TextCatParametricAttention.v1")
|
||||||
|
def build_textcat_parametric_attention_v1(
|
||||||
|
tok2vec: Model[List[Doc], List[Floats2d]],
|
||||||
|
exclusive_classes: bool,
|
||||||
|
nO: Optional[int] = None,
|
||||||
|
) -> Model[List[Doc], Floats2d]:
|
||||||
|
width = tok2vec.maybe_get_dim("nO")
|
||||||
|
parametric_attention = _build_parametric_attention_with_residual_nonlinear(
|
||||||
|
tok2vec=tok2vec,
|
||||||
|
nonlinear_layer=Maxout(nI=width, nO=width),
|
||||||
|
key_transform=Gelu(nI=width, nO=width),
|
||||||
|
)
|
||||||
|
with Model.define_operators({">>": chain}):
|
||||||
|
if exclusive_classes:
|
||||||
|
output_layer = Softmax(nO=nO)
|
||||||
|
else:
|
||||||
|
output_layer = Linear(nO=nO) >> Logistic()
|
||||||
|
model = parametric_attention >> output_layer
|
||||||
|
if model.has_dim("nO") is not False and nO is not None:
|
||||||
|
model.set_dim("nO", cast(int, nO))
|
||||||
|
model.set_ref("output_layer", output_layer)
|
||||||
|
model.attrs["multi_label"] = not exclusive_classes
|
||||||
|
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
def _build_parametric_attention_with_residual_nonlinear(
|
||||||
|
*,
|
||||||
|
tok2vec: Model[List[Doc], List[Floats2d]],
|
||||||
|
nonlinear_layer: Model[Floats2d, Floats2d],
|
||||||
|
key_transform: Optional[Model[Floats2d, Floats2d]] = None,
|
||||||
|
) -> Model[List[Doc], Floats2d]:
|
||||||
|
with Model.define_operators({">>": chain, "|": concatenate}):
|
||||||
|
width = tok2vec.maybe_get_dim("nO")
|
||||||
|
attention_layer = ParametricAttention_v2(nO=width, key_transform=key_transform)
|
||||||
|
norm_layer = LayerNorm(nI=width)
|
||||||
|
parametric_attention = (
|
||||||
|
tok2vec
|
||||||
|
>> list2ragged()
|
||||||
|
>> attention_layer
|
||||||
|
>> reduce_sum()
|
||||||
|
>> residual(nonlinear_layer >> norm_layer >> Dropout(0.0))
|
||||||
|
)
|
||||||
|
|
||||||
|
parametric_attention.init = _init_parametric_attention_with_residual_nonlinear
|
||||||
|
|
||||||
|
parametric_attention.set_ref("tok2vec", tok2vec)
|
||||||
|
parametric_attention.set_ref("attention_layer", attention_layer)
|
||||||
|
parametric_attention.set_ref("nonlinear_layer", nonlinear_layer)
|
||||||
|
parametric_attention.set_ref("norm_layer", norm_layer)
|
||||||
|
|
||||||
|
return parametric_attention
|
||||||
|
|
||||||
|
|
||||||
|
def _init_parametric_attention_with_residual_nonlinear(model, X, Y) -> Model:
|
||||||
|
tok2vec_width = get_tok2vec_width(model)
|
||||||
|
model.get_ref("attention_layer").set_dim("nO", tok2vec_width)
|
||||||
|
model.get_ref("nonlinear_layer").set_dim("nO", tok2vec_width)
|
||||||
|
model.get_ref("nonlinear_layer").set_dim("nI", tok2vec_width)
|
||||||
|
model.get_ref("norm_layer").set_dim("nI", tok2vec_width)
|
||||||
|
model.get_ref("norm_layer").set_dim("nO", tok2vec_width)
|
||||||
|
init_chain(model, X, Y)
|
||||||
|
return model
|
||||||
|
|
||||||
|
|
||||||
|
@registry.architectures("spacy.TextCatReduce.v1")
|
||||||
|
def build_reduce_text_classifier(
|
||||||
|
tok2vec: Model,
|
||||||
|
exclusive_classes: bool,
|
||||||
|
use_reduce_first: bool,
|
||||||
|
use_reduce_last: bool,
|
||||||
|
use_reduce_max: bool,
|
||||||
|
use_reduce_mean: bool,
|
||||||
|
nO: Optional[int] = None,
|
||||||
|
) -> Model[List[Doc], Floats2d]:
|
||||||
|
"""Build a model that classifies pooled `Doc` representations.
|
||||||
|
|
||||||
|
Pooling is performed using reductions. Reductions are concatenated when
|
||||||
|
multiple reductions are used.
|
||||||
|
|
||||||
|
tok2vec (Model): the tok2vec layer to pool over.
|
||||||
|
exclusive_classes (bool): Whether or not classes are mutually exclusive.
|
||||||
|
use_reduce_first (bool): Pool by using the hidden representation of the
|
||||||
|
first token of a `Doc`.
|
||||||
|
use_reduce_last (bool): Pool by using the hidden representation of the
|
||||||
|
last token of a `Doc`.
|
||||||
|
use_reduce_max (bool): Pool by taking the maximum values of the hidden
|
||||||
|
representations of a `Doc`.
|
||||||
|
use_reduce_mean (bool): Pool by taking the mean of all hidden
|
||||||
|
representations of a `Doc`.
|
||||||
|
nO (Optional[int]): Number of classes.
|
||||||
|
"""
|
||||||
|
|
||||||
|
fill_defaults = {"b": 0, "W": 0}
|
||||||
|
reductions = []
|
||||||
|
if use_reduce_first:
|
||||||
|
reductions.append(reduce_first())
|
||||||
|
if use_reduce_last:
|
||||||
|
reductions.append(reduce_last())
|
||||||
|
if use_reduce_max:
|
||||||
|
reductions.append(reduce_max())
|
||||||
|
if use_reduce_mean:
|
||||||
|
reductions.append(reduce_mean())
|
||||||
|
|
||||||
|
if not len(reductions):
|
||||||
|
raise ValueError(Errors.E1057)
|
||||||
|
|
||||||
|
with Model.define_operators({">>": chain}):
|
||||||
|
cnn = tok2vec >> list2ragged() >> concatenate(*reductions)
|
||||||
|
nO_tok2vec = tok2vec.maybe_get_dim("nO")
|
||||||
|
nI = nO_tok2vec * len(reductions) if nO_tok2vec is not None else None
|
||||||
if exclusive_classes:
|
if exclusive_classes:
|
||||||
output_layer = Softmax(nO=nO, nI=nI)
|
output_layer = Softmax(nO=nO, nI=nI)
|
||||||
fill_defaults["b"] = NEG_VALUE
|
fill_defaults["b"] = NEG_VALUE
|
||||||
|
@ -80,113 +333,3 @@ def build_simple_cnn_text_classifier(
|
||||||
model.set_dim("nO", cast(int, nO))
|
model.set_dim("nO", cast(int, nO))
|
||||||
model.attrs["multi_label"] = not exclusive_classes
|
model.attrs["multi_label"] = not exclusive_classes
|
||||||
return model
|
return model
|
||||||
|
|
||||||
|
|
||||||
def resize_and_set_ref(model, new_nO, resizable_layer):
|
|
||||||
resizable_layer = resize_model(resizable_layer, new_nO)
|
|
||||||
model.set_ref("output_layer", resizable_layer.layers[0])
|
|
||||||
model.set_dim("nO", new_nO, force=True)
|
|
||||||
return model
|
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures("spacy.TextCatBOW.v2")
|
|
||||||
def build_bow_text_classifier(
|
|
||||||
exclusive_classes: bool,
|
|
||||||
ngram_size: int,
|
|
||||||
no_output_layer: bool,
|
|
||||||
nO: Optional[int] = None,
|
|
||||||
) -> Model[List[Doc], Floats2d]:
|
|
||||||
fill_defaults = {"b": 0, "W": 0}
|
|
||||||
with Model.define_operators({">>": chain}):
|
|
||||||
sparse_linear = SparseLinear(nO=nO)
|
|
||||||
output_layer = None
|
|
||||||
if not no_output_layer:
|
|
||||||
fill_defaults["b"] = NEG_VALUE
|
|
||||||
output_layer = softmax_activation() if exclusive_classes else Logistic()
|
|
||||||
resizable_layer: Model[Floats2d, Floats2d] = resizable(
|
|
||||||
sparse_linear,
|
|
||||||
resize_layer=partial(resize_linear_weighted, fill_defaults=fill_defaults),
|
|
||||||
)
|
|
||||||
model = extract_ngrams(ngram_size, attr=ORTH) >> resizable_layer
|
|
||||||
model = with_cpu(model, model.ops)
|
|
||||||
if output_layer:
|
|
||||||
model = model >> with_cpu(output_layer, output_layer.ops)
|
|
||||||
if nO is not None:
|
|
||||||
model.set_dim("nO", cast(int, nO))
|
|
||||||
model.set_ref("output_layer", sparse_linear)
|
|
||||||
model.attrs["multi_label"] = not exclusive_classes
|
|
||||||
model.attrs["resize_output"] = partial(
|
|
||||||
resize_and_set_ref, resizable_layer=resizable_layer
|
|
||||||
)
|
|
||||||
return model
|
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures("spacy.TextCatEnsemble.v2")
|
|
||||||
def build_text_classifier_v2(
|
|
||||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
|
||||||
linear_model: Model[List[Doc], Floats2d],
|
|
||||||
nO: Optional[int] = None,
|
|
||||||
) -> Model[List[Doc], Floats2d]:
|
|
||||||
exclusive_classes = not linear_model.attrs["multi_label"]
|
|
||||||
with Model.define_operators({">>": chain, "|": concatenate}):
|
|
||||||
width = tok2vec.maybe_get_dim("nO")
|
|
||||||
attention_layer = ParametricAttention(width)
|
|
||||||
maxout_layer = Maxout(nO=width, nI=width)
|
|
||||||
norm_layer = LayerNorm(nI=width)
|
|
||||||
cnn_model = (
|
|
||||||
tok2vec
|
|
||||||
>> list2ragged()
|
|
||||||
>> attention_layer
|
|
||||||
>> reduce_sum()
|
|
||||||
>> residual(maxout_layer >> norm_layer >> Dropout(0.0))
|
|
||||||
)
|
|
||||||
|
|
||||||
nO_double = nO * 2 if nO else None
|
|
||||||
if exclusive_classes:
|
|
||||||
output_layer = Softmax(nO=nO, nI=nO_double)
|
|
||||||
else:
|
|
||||||
output_layer = Linear(nO=nO, nI=nO_double) >> Logistic()
|
|
||||||
model = (linear_model | cnn_model) >> output_layer
|
|
||||||
model.set_ref("tok2vec", tok2vec)
|
|
||||||
if model.has_dim("nO") is not False and nO is not None:
|
|
||||||
model.set_dim("nO", cast(int, nO))
|
|
||||||
model.set_ref("output_layer", linear_model.get_ref("output_layer"))
|
|
||||||
model.set_ref("attention_layer", attention_layer)
|
|
||||||
model.set_ref("maxout_layer", maxout_layer)
|
|
||||||
model.set_ref("norm_layer", norm_layer)
|
|
||||||
model.attrs["multi_label"] = not exclusive_classes
|
|
||||||
|
|
||||||
model.init = init_ensemble_textcat # type: ignore[assignment]
|
|
||||||
return model
|
|
||||||
|
|
||||||
|
|
||||||
def init_ensemble_textcat(model, X, Y) -> Model:
|
|
||||||
tok2vec_width = get_tok2vec_width(model)
|
|
||||||
model.get_ref("attention_layer").set_dim("nO", tok2vec_width)
|
|
||||||
model.get_ref("maxout_layer").set_dim("nO", tok2vec_width)
|
|
||||||
model.get_ref("maxout_layer").set_dim("nI", tok2vec_width)
|
|
||||||
model.get_ref("norm_layer").set_dim("nI", tok2vec_width)
|
|
||||||
model.get_ref("norm_layer").set_dim("nO", tok2vec_width)
|
|
||||||
init_chain(model, X, Y)
|
|
||||||
return model
|
|
||||||
|
|
||||||
|
|
||||||
@registry.architectures("spacy.TextCatLowData.v1")
|
|
||||||
def build_text_classifier_lowdata(
|
|
||||||
width: int, dropout: Optional[float], nO: Optional[int] = None
|
|
||||||
) -> Model[List[Doc], Floats2d]:
|
|
||||||
# Don't document this yet, I'm not sure it's right.
|
|
||||||
# Note, before v.3, this was the default if setting "low_data" and "pretrained_dims"
|
|
||||||
with Model.define_operators({">>": chain, "**": clone}):
|
|
||||||
model = (
|
|
||||||
StaticVectors(width)
|
|
||||||
>> list2ragged()
|
|
||||||
>> ParametricAttention(width)
|
|
||||||
>> reduce_sum()
|
|
||||||
>> residual(Relu(width, width)) ** 2
|
|
||||||
>> Linear(nO, width)
|
|
||||||
)
|
|
||||||
if dropout:
|
|
||||||
model = model >> Dropout(dropout)
|
|
||||||
model = model >> Logistic()
|
|
||||||
return model
|
|
||||||
|
|
|
@ -67,8 +67,8 @@ def build_hash_embed_cnn_tok2vec(
|
||||||
are between 2 and 8.
|
are between 2 and 8.
|
||||||
window_size (int): The number of tokens on either side to concatenate during
|
window_size (int): The number of tokens on either side to concatenate during
|
||||||
the convolutions. The receptive field of the CNN will be
|
the convolutions. The receptive field of the CNN will be
|
||||||
depth * (window_size * 2 + 1), so a 4-layer network with window_size of
|
depth * window_size * 2 + 1, so a 4-layer network with window_size of
|
||||||
2 will be sensitive to 20 words at a time. Recommended value is 1.
|
2 will be sensitive to 17 words at a time. Recommended value is 1.
|
||||||
embed_size (int): The number of rows in the hash embedding tables. This can
|
embed_size (int): The number of rows in the hash embedding tables. This can
|
||||||
be surprisingly small, due to the use of the hash embeddings. Recommended
|
be surprisingly small, due to the use of the hash embeddings. Recommended
|
||||||
values are between 2000 and 10000.
|
values are between 2000 and 10000.
|
||||||
|
|
49
spacy/ml/parser_model.pxd
Normal file
49
spacy/ml/parser_model.pxd
Normal file
|
@ -0,0 +1,49 @@
|
||||||
|
from libc.string cimport memcpy, memset
|
||||||
|
from thinc.backends.cblas cimport CBlas
|
||||||
|
|
||||||
|
from ..pipeline._parser_internals._state cimport StateC
|
||||||
|
from ..typedefs cimport hash_t, weight_t
|
||||||
|
|
||||||
|
|
||||||
|
cdef struct SizesC:
|
||||||
|
int states
|
||||||
|
int classes
|
||||||
|
int hiddens
|
||||||
|
int pieces
|
||||||
|
int feats
|
||||||
|
int embed_width
|
||||||
|
|
||||||
|
|
||||||
|
cdef struct WeightsC:
|
||||||
|
const float* feat_weights
|
||||||
|
const float* feat_bias
|
||||||
|
const float* hidden_bias
|
||||||
|
const float* hidden_weights
|
||||||
|
const float* seen_classes
|
||||||
|
|
||||||
|
|
||||||
|
cdef struct ActivationsC:
|
||||||
|
int* token_ids
|
||||||
|
float* unmaxed
|
||||||
|
float* scores
|
||||||
|
float* hiddens
|
||||||
|
int* is_valid
|
||||||
|
int _curr_size
|
||||||
|
int _max_size
|
||||||
|
|
||||||
|
|
||||||
|
cdef WeightsC get_c_weights(model) except *
|
||||||
|
|
||||||
|
cdef SizesC get_c_sizes(model, int batch_size) except *
|
||||||
|
|
||||||
|
cdef ActivationsC alloc_activations(SizesC n) nogil
|
||||||
|
|
||||||
|
cdef void free_activations(const ActivationsC* A) nogil
|
||||||
|
|
||||||
|
cdef void predict_states(CBlas cblas, ActivationsC* A, StateC** states,
|
||||||
|
const WeightsC* W, SizesC n) nogil
|
||||||
|
|
||||||
|
cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) nogil
|
||||||
|
|
||||||
|
cdef void cpu_log_loss(float* d_scores, const float* costs,
|
||||||
|
const int* is_valid, const float* scores, int O) nogil
|
500
spacy/ml/parser_model.pyx
Normal file
500
spacy/ml/parser_model.pyx
Normal file
|
@ -0,0 +1,500 @@
|
||||||
|
# cython: infer_types=True, cdivision=True, boundscheck=False
|
||||||
|
# cython: profile=False
|
||||||
|
cimport numpy as np
|
||||||
|
from libc.math cimport exp
|
||||||
|
from libc.stdlib cimport calloc, free, realloc
|
||||||
|
from libc.string cimport memcpy, memset
|
||||||
|
from thinc.backends.cblas cimport saxpy, sgemm
|
||||||
|
|
||||||
|
import numpy
|
||||||
|
import numpy.random
|
||||||
|
from thinc.api import CupyOps, Model, NumpyOps, get_ops
|
||||||
|
|
||||||
|
from .. import util
|
||||||
|
from ..errors import Errors
|
||||||
|
|
||||||
|
from ..pipeline._parser_internals.stateclass cimport StateClass
|
||||||
|
from ..typedefs cimport weight_t
|
||||||
|
|
||||||
|
|
||||||
|
cdef WeightsC get_c_weights(model) except *:
|
||||||
|
cdef WeightsC output
|
||||||
|
cdef precompute_hiddens state2vec = model.state2vec
|
||||||
|
output.feat_weights = state2vec.get_feat_weights()
|
||||||
|
output.feat_bias = <const float*>state2vec.bias.data
|
||||||
|
cdef np.ndarray vec2scores_W
|
||||||
|
cdef np.ndarray vec2scores_b
|
||||||
|
if model.vec2scores is None:
|
||||||
|
output.hidden_weights = NULL
|
||||||
|
output.hidden_bias = NULL
|
||||||
|
else:
|
||||||
|
vec2scores_W = model.vec2scores.get_param("W")
|
||||||
|
vec2scores_b = model.vec2scores.get_param("b")
|
||||||
|
output.hidden_weights = <const float*>vec2scores_W.data
|
||||||
|
output.hidden_bias = <const float*>vec2scores_b.data
|
||||||
|
cdef np.ndarray class_mask = model._class_mask
|
||||||
|
output.seen_classes = <const float*>class_mask.data
|
||||||
|
return output
|
||||||
|
|
||||||
|
|
||||||
|
cdef SizesC get_c_sizes(model, int batch_size) except *:
|
||||||
|
cdef SizesC output
|
||||||
|
output.states = batch_size
|
||||||
|
if model.vec2scores is None:
|
||||||
|
output.classes = model.state2vec.get_dim("nO")
|
||||||
|
else:
|
||||||
|
output.classes = model.vec2scores.get_dim("nO")
|
||||||
|
output.hiddens = model.state2vec.get_dim("nO")
|
||||||
|
output.pieces = model.state2vec.get_dim("nP")
|
||||||
|
output.feats = model.state2vec.get_dim("nF")
|
||||||
|
output.embed_width = model.tokvecs.shape[1]
|
||||||
|
return output
|
||||||
|
|
||||||
|
|
||||||
|
cdef ActivationsC alloc_activations(SizesC n) nogil:
|
||||||
|
cdef ActivationsC A
|
||||||
|
memset(&A, 0, sizeof(A))
|
||||||
|
resize_activations(&A, n)
|
||||||
|
return A
|
||||||
|
|
||||||
|
|
||||||
|
cdef void free_activations(const ActivationsC* A) nogil:
|
||||||
|
free(A.token_ids)
|
||||||
|
free(A.scores)
|
||||||
|
free(A.unmaxed)
|
||||||
|
free(A.hiddens)
|
||||||
|
free(A.is_valid)
|
||||||
|
|
||||||
|
|
||||||
|
cdef void resize_activations(ActivationsC* A, SizesC n) nogil:
|
||||||
|
if n.states <= A._max_size:
|
||||||
|
A._curr_size = n.states
|
||||||
|
return
|
||||||
|
if A._max_size == 0:
|
||||||
|
A.token_ids = <int*>calloc(n.states * n.feats, sizeof(A.token_ids[0]))
|
||||||
|
A.scores = <float*>calloc(n.states * n.classes, sizeof(A.scores[0]))
|
||||||
|
A.unmaxed = <float*>calloc(n.states * n.hiddens * n.pieces, sizeof(A.unmaxed[0]))
|
||||||
|
A.hiddens = <float*>calloc(n.states * n.hiddens, sizeof(A.hiddens[0]))
|
||||||
|
A.is_valid = <int*>calloc(n.states * n.classes, sizeof(A.is_valid[0]))
|
||||||
|
A._max_size = n.states
|
||||||
|
else:
|
||||||
|
A.token_ids = <int*>realloc(A.token_ids,
|
||||||
|
n.states * n.feats * sizeof(A.token_ids[0]))
|
||||||
|
A.scores = <float*>realloc(A.scores,
|
||||||
|
n.states * n.classes * sizeof(A.scores[0]))
|
||||||
|
A.unmaxed = <float*>realloc(A.unmaxed,
|
||||||
|
n.states * n.hiddens * n.pieces * sizeof(A.unmaxed[0]))
|
||||||
|
A.hiddens = <float*>realloc(A.hiddens,
|
||||||
|
n.states * n.hiddens * sizeof(A.hiddens[0]))
|
||||||
|
A.is_valid = <int*>realloc(A.is_valid,
|
||||||
|
n.states * n.classes * sizeof(A.is_valid[0]))
|
||||||
|
A._max_size = n.states
|
||||||
|
A._curr_size = n.states
|
||||||
|
|
||||||
|
|
||||||
|
cdef void predict_states(CBlas cblas, ActivationsC* A, StateC** states,
|
||||||
|
const WeightsC* W, SizesC n) nogil:
|
||||||
|
resize_activations(A, n)
|
||||||
|
for i in range(n.states):
|
||||||
|
states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats)
|
||||||
|
memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float))
|
||||||
|
memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float))
|
||||||
|
sum_state_features(cblas, A.unmaxed, W.feat_weights, A.token_ids, n.states,
|
||||||
|
n.feats, n.hiddens * n.pieces)
|
||||||
|
for i in range(n.states):
|
||||||
|
saxpy(cblas)(n.hiddens * n.pieces, 1., W.feat_bias, 1,
|
||||||
|
&A.unmaxed[i*n.hiddens*n.pieces], 1)
|
||||||
|
for j in range(n.hiddens):
|
||||||
|
index = i * n.hiddens * n.pieces + j * n.pieces
|
||||||
|
which = _arg_max(&A.unmaxed[index], n.pieces)
|
||||||
|
A.hiddens[i*n.hiddens + j] = A.unmaxed[index + which]
|
||||||
|
memset(A.scores, 0, n.states * n.classes * sizeof(float))
|
||||||
|
if W.hidden_weights == NULL:
|
||||||
|
memcpy(A.scores, A.hiddens, n.states * n.classes * sizeof(float))
|
||||||
|
else:
|
||||||
|
# Compute hidden-to-output
|
||||||
|
sgemm(cblas)(False, True, n.states, n.classes, n.hiddens, 1.0,
|
||||||
|
<const float *>A.hiddens, n.hiddens,
|
||||||
|
<const float *>W.hidden_weights, n.hiddens, 0.0,
|
||||||
|
A.scores, n.classes)
|
||||||
|
# Add bias
|
||||||
|
for i in range(n.states):
|
||||||
|
saxpy(cblas)(n.classes, 1., W.hidden_bias, 1, &A.scores[i*n.classes], 1)
|
||||||
|
# Set unseen classes to minimum value
|
||||||
|
i = 0
|
||||||
|
min_ = A.scores[0]
|
||||||
|
for i in range(1, n.states * n.classes):
|
||||||
|
if A.scores[i] < min_:
|
||||||
|
min_ = A.scores[i]
|
||||||
|
for i in range(n.states):
|
||||||
|
for j in range(n.classes):
|
||||||
|
if not W.seen_classes[j]:
|
||||||
|
A.scores[i*n.classes+j] = min_
|
||||||
|
|
||||||
|
|
||||||
|
cdef void sum_state_features(CBlas cblas, float* output, const float* cached,
|
||||||
|
const int* token_ids, int B, int F, int O) nogil:
|
||||||
|
cdef int idx, b, f
|
||||||
|
cdef const float* feature
|
||||||
|
padding = cached
|
||||||
|
cached += F * O
|
||||||
|
cdef int id_stride = F*O
|
||||||
|
cdef float one = 1.
|
||||||
|
for b in range(B):
|
||||||
|
for f in range(F):
|
||||||
|
if token_ids[f] < 0:
|
||||||
|
feature = &padding[f*O]
|
||||||
|
else:
|
||||||
|
idx = token_ids[f] * id_stride + f*O
|
||||||
|
feature = &cached[idx]
|
||||||
|
saxpy(cblas)(O, one, <const float*>feature, 1, &output[b*O], 1)
|
||||||
|
token_ids += F
|
||||||
|
|
||||||
|
|
||||||
|
cdef void cpu_log_loss(float* d_scores, const float* costs, const int* is_valid,
|
||||||
|
const float* scores, int O) nogil:
|
||||||
|
"""Do multi-label log loss"""
|
||||||
|
cdef double max_, gmax, Z, gZ
|
||||||
|
best = arg_max_if_gold(scores, costs, is_valid, O)
|
||||||
|
guess = _arg_max(scores, O)
|
||||||
|
|
||||||
|
if best == -1 or guess == -1:
|
||||||
|
# These shouldn't happen, but if they do, we want to make sure we don't
|
||||||
|
# cause an OOB access.
|
||||||
|
return
|
||||||
|
Z = 1e-10
|
||||||
|
gZ = 1e-10
|
||||||
|
max_ = scores[guess]
|
||||||
|
gmax = scores[best]
|
||||||
|
for i in range(O):
|
||||||
|
Z += exp(scores[i] - max_)
|
||||||
|
if costs[i] <= costs[best]:
|
||||||
|
gZ += exp(scores[i] - gmax)
|
||||||
|
for i in range(O):
|
||||||
|
if costs[i] <= costs[best]:
|
||||||
|
d_scores[i] = (exp(scores[i]-max_) / Z) - (exp(scores[i]-gmax)/gZ)
|
||||||
|
else:
|
||||||
|
d_scores[i] = exp(scores[i]-max_) / Z
|
||||||
|
|
||||||
|
|
||||||
|
cdef int arg_max_if_gold(const weight_t* scores, const weight_t* costs,
|
||||||
|
const int* is_valid, int n) nogil:
|
||||||
|
# Find minimum cost
|
||||||
|
cdef float cost = 1
|
||||||
|
for i in range(n):
|
||||||
|
if is_valid[i] and costs[i] < cost:
|
||||||
|
cost = costs[i]
|
||||||
|
# Now find best-scoring with that cost
|
||||||
|
cdef int best = -1
|
||||||
|
for i in range(n):
|
||||||
|
if costs[i] <= cost and is_valid[i]:
|
||||||
|
if best == -1 or scores[i] > scores[best]:
|
||||||
|
best = i
|
||||||
|
return best
|
||||||
|
|
||||||
|
|
||||||
|
cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) nogil:
|
||||||
|
cdef int best = -1
|
||||||
|
for i in range(n):
|
||||||
|
if is_valid[i] >= 1:
|
||||||
|
if best == -1 or scores[i] > scores[best]:
|
||||||
|
best = i
|
||||||
|
return best
|
||||||
|
|
||||||
|
|
||||||
|
class ParserStepModel(Model):
|
||||||
|
def __init__(self, docs, layers, *, has_upper, unseen_classes=None, train=True,
|
||||||
|
dropout=0.1):
|
||||||
|
Model.__init__(self, name="parser_step_model", forward=step_forward)
|
||||||
|
self.attrs["has_upper"] = has_upper
|
||||||
|
self.attrs["dropout_rate"] = dropout
|
||||||
|
self.tokvecs, self.bp_tokvecs = layers[0](docs, is_train=train)
|
||||||
|
if layers[1].get_dim("nP") >= 2:
|
||||||
|
activation = "maxout"
|
||||||
|
elif has_upper:
|
||||||
|
activation = None
|
||||||
|
else:
|
||||||
|
activation = "relu"
|
||||||
|
self.state2vec = precompute_hiddens(len(docs), self.tokvecs, layers[1],
|
||||||
|
activation=activation, train=train)
|
||||||
|
if has_upper:
|
||||||
|
self.vec2scores = layers[-1]
|
||||||
|
else:
|
||||||
|
self.vec2scores = None
|
||||||
|
self.cuda_stream = util.get_cuda_stream(non_blocking=True)
|
||||||
|
self.backprops = []
|
||||||
|
self._class_mask = numpy.zeros((self.nO,), dtype='f')
|
||||||
|
self._class_mask.fill(1)
|
||||||
|
if unseen_classes is not None:
|
||||||
|
for class_ in unseen_classes:
|
||||||
|
self._class_mask[class_] = 0.
|
||||||
|
|
||||||
|
def clear_memory(self):
|
||||||
|
del self.tokvecs
|
||||||
|
del self.bp_tokvecs
|
||||||
|
del self.state2vec
|
||||||
|
del self.backprops
|
||||||
|
del self._class_mask
|
||||||
|
|
||||||
|
@property
|
||||||
|
def nO(self):
|
||||||
|
if self.attrs["has_upper"]:
|
||||||
|
return self.vec2scores.get_dim("nO")
|
||||||
|
else:
|
||||||
|
return self.state2vec.get_dim("nO")
|
||||||
|
|
||||||
|
def class_is_unseen(self, class_):
|
||||||
|
return self._class_mask[class_]
|
||||||
|
|
||||||
|
def mark_class_unseen(self, class_):
|
||||||
|
self._class_mask[class_] = 0
|
||||||
|
|
||||||
|
def mark_class_seen(self, class_):
|
||||||
|
self._class_mask[class_] = 1
|
||||||
|
|
||||||
|
def get_token_ids(self, states):
|
||||||
|
cdef StateClass state
|
||||||
|
states = [state for state in states if not state.is_final()]
|
||||||
|
cdef np.ndarray ids = numpy.zeros((len(states), self.state2vec.nF),
|
||||||
|
dtype='i', order='C')
|
||||||
|
ids.fill(-1)
|
||||||
|
c_ids = <int*>ids.data
|
||||||
|
for state in states:
|
||||||
|
state.c.set_context_tokens(c_ids, ids.shape[1])
|
||||||
|
c_ids += ids.shape[1]
|
||||||
|
return ids
|
||||||
|
|
||||||
|
def backprop_step(self, token_ids, d_vector, get_d_tokvecs):
|
||||||
|
if isinstance(self.state2vec.ops, CupyOps) \
|
||||||
|
and not isinstance(token_ids, self.state2vec.ops.xp.ndarray):
|
||||||
|
# Move token_ids and d_vector to GPU, asynchronously
|
||||||
|
self.backprops.append((
|
||||||
|
util.get_async(self.cuda_stream, token_ids),
|
||||||
|
util.get_async(self.cuda_stream, d_vector),
|
||||||
|
get_d_tokvecs
|
||||||
|
))
|
||||||
|
else:
|
||||||
|
self.backprops.append((token_ids, d_vector, get_d_tokvecs))
|
||||||
|
|
||||||
|
def finish_steps(self, golds):
|
||||||
|
# Add a padding vector to the d_tokvecs gradient, so that missing
|
||||||
|
# values don't affect the real gradient.
|
||||||
|
d_tokvecs = self.ops.alloc((self.tokvecs.shape[0]+1, self.tokvecs.shape[1]))
|
||||||
|
# Tells CUDA to block, so our async copies complete.
|
||||||
|
if self.cuda_stream is not None:
|
||||||
|
self.cuda_stream.synchronize()
|
||||||
|
for ids, d_vector, bp_vector in self.backprops:
|
||||||
|
d_state_features = bp_vector((d_vector, ids))
|
||||||
|
ids = ids.flatten()
|
||||||
|
d_state_features = d_state_features.reshape(
|
||||||
|
(ids.size, d_state_features.shape[2]))
|
||||||
|
self.ops.scatter_add(d_tokvecs, ids, d_state_features)
|
||||||
|
# Padded -- see update()
|
||||||
|
self.bp_tokvecs(d_tokvecs[:-1])
|
||||||
|
return d_tokvecs
|
||||||
|
|
||||||
|
|
||||||
|
NUMPY_OPS = NumpyOps()
|
||||||
|
|
||||||
|
|
||||||
|
def step_forward(model: ParserStepModel, states, is_train):
|
||||||
|
token_ids = model.get_token_ids(states)
|
||||||
|
vector, get_d_tokvecs = model.state2vec(token_ids, is_train)
|
||||||
|
mask = None
|
||||||
|
if model.attrs["has_upper"]:
|
||||||
|
dropout_rate = model.attrs["dropout_rate"]
|
||||||
|
if is_train and dropout_rate > 0:
|
||||||
|
mask = NUMPY_OPS.get_dropout_mask(vector.shape, 0.1)
|
||||||
|
vector *= mask
|
||||||
|
scores, get_d_vector = model.vec2scores(vector, is_train)
|
||||||
|
else:
|
||||||
|
scores = NumpyOps().asarray(vector)
|
||||||
|
def get_d_vector(d_scores): return d_scores
|
||||||
|
# If the class is unseen, make sure its score is minimum
|
||||||
|
scores[:, model._class_mask == 0] = numpy.nanmin(scores)
|
||||||
|
|
||||||
|
def backprop_parser_step(d_scores):
|
||||||
|
# Zero vectors for unseen classes
|
||||||
|
d_scores *= model._class_mask
|
||||||
|
d_vector = get_d_vector(d_scores)
|
||||||
|
if mask is not None:
|
||||||
|
d_vector *= mask
|
||||||
|
model.backprop_step(token_ids, d_vector, get_d_tokvecs)
|
||||||
|
return None
|
||||||
|
return scores, backprop_parser_step
|
||||||
|
|
||||||
|
|
||||||
|
cdef class precompute_hiddens:
|
||||||
|
"""Allow a model to be "primed" by pre-computing input features in bulk.
|
||||||
|
|
||||||
|
This is used for the parser, where we want to take a batch of documents,
|
||||||
|
and compute vectors for each (token, position) pair. These vectors can then
|
||||||
|
be reused, especially for beam-search.
|
||||||
|
|
||||||
|
Let's say we're using 12 features for each state, e.g. word at start of
|
||||||
|
buffer, three words on stack, their children, etc. In the normal arc-eager
|
||||||
|
system, a document of length N is processed in 2*N states. This means we'll
|
||||||
|
create 2*N*12 feature vectors --- but if we pre-compute, we only need
|
||||||
|
N*12 vector computations. The saving for beam-search is much better:
|
||||||
|
if we have a beam of k, we'll normally make 2*N*12*K computations --
|
||||||
|
so we can save the factor k. This also gives a nice CPU/GPU division:
|
||||||
|
we can do all our hard maths up front, packed into large multiplications,
|
||||||
|
and do the hard-to-program parsing on the CPU.
|
||||||
|
"""
|
||||||
|
cdef readonly int nF, nO, nP
|
||||||
|
cdef bint _is_synchronized
|
||||||
|
cdef public object ops
|
||||||
|
cdef public object numpy_ops
|
||||||
|
cdef public object _cpu_ops
|
||||||
|
cdef np.ndarray _features
|
||||||
|
cdef np.ndarray _cached
|
||||||
|
cdef np.ndarray bias
|
||||||
|
cdef object _cuda_stream
|
||||||
|
cdef object _bp_hiddens
|
||||||
|
cdef object activation
|
||||||
|
|
||||||
|
def __init__(self, batch_size, tokvecs, lower_model, cuda_stream=None,
|
||||||
|
activation="maxout", train=False):
|
||||||
|
gpu_cached, bp_features = lower_model(tokvecs, train)
|
||||||
|
cdef np.ndarray cached
|
||||||
|
if not isinstance(gpu_cached, numpy.ndarray):
|
||||||
|
# Note the passing of cuda_stream here: it lets
|
||||||
|
# cupy make the copy asynchronously.
|
||||||
|
# We then have to block before first use.
|
||||||
|
cached = gpu_cached.get(stream=cuda_stream)
|
||||||
|
else:
|
||||||
|
cached = gpu_cached
|
||||||
|
if not isinstance(lower_model.get_param("b"), numpy.ndarray):
|
||||||
|
self.bias = lower_model.get_param("b").get(stream=cuda_stream)
|
||||||
|
else:
|
||||||
|
self.bias = lower_model.get_param("b")
|
||||||
|
self.nF = cached.shape[1]
|
||||||
|
if lower_model.has_dim("nP"):
|
||||||
|
self.nP = lower_model.get_dim("nP")
|
||||||
|
else:
|
||||||
|
self.nP = 1
|
||||||
|
self.nO = cached.shape[2]
|
||||||
|
self.ops = lower_model.ops
|
||||||
|
self.numpy_ops = NumpyOps()
|
||||||
|
self._cpu_ops = get_ops("cpu") if isinstance(self.ops, CupyOps) else self.ops
|
||||||
|
assert activation in (None, "relu", "maxout")
|
||||||
|
self.activation = activation
|
||||||
|
self._is_synchronized = False
|
||||||
|
self._cuda_stream = cuda_stream
|
||||||
|
self._cached = cached
|
||||||
|
self._bp_hiddens = bp_features
|
||||||
|
|
||||||
|
cdef const float* get_feat_weights(self) except NULL:
|
||||||
|
if not self._is_synchronized and self._cuda_stream is not None:
|
||||||
|
self._cuda_stream.synchronize()
|
||||||
|
self._is_synchronized = True
|
||||||
|
return <float*>self._cached.data
|
||||||
|
|
||||||
|
def has_dim(self, name):
|
||||||
|
if name == "nF":
|
||||||
|
return self.nF if self.nF is not None else True
|
||||||
|
elif name == "nP":
|
||||||
|
return self.nP if self.nP is not None else True
|
||||||
|
elif name == "nO":
|
||||||
|
return self.nO if self.nO is not None else True
|
||||||
|
else:
|
||||||
|
return False
|
||||||
|
|
||||||
|
def get_dim(self, name):
|
||||||
|
if name == "nF":
|
||||||
|
return self.nF
|
||||||
|
elif name == "nP":
|
||||||
|
return self.nP
|
||||||
|
elif name == "nO":
|
||||||
|
return self.nO
|
||||||
|
else:
|
||||||
|
raise ValueError(Errors.E1033.format(name=name))
|
||||||
|
|
||||||
|
def set_dim(self, name, value):
|
||||||
|
if name == "nF":
|
||||||
|
self.nF = value
|
||||||
|
elif name == "nP":
|
||||||
|
self.nP = value
|
||||||
|
elif name == "nO":
|
||||||
|
self.nO = value
|
||||||
|
else:
|
||||||
|
raise ValueError(Errors.E1033.format(name=name))
|
||||||
|
|
||||||
|
def __call__(self, X, bint is_train):
|
||||||
|
if is_train:
|
||||||
|
return self.begin_update(X)
|
||||||
|
else:
|
||||||
|
return self.predict(X), lambda X: X
|
||||||
|
|
||||||
|
def predict(self, X):
|
||||||
|
return self.begin_update(X)[0]
|
||||||
|
|
||||||
|
def begin_update(self, token_ids):
|
||||||
|
cdef np.ndarray state_vector = numpy.zeros(
|
||||||
|
(token_ids.shape[0], self.nO, self.nP), dtype='f')
|
||||||
|
# This is tricky, but (assuming GPU available);
|
||||||
|
# - Input to forward on CPU
|
||||||
|
# - Output from forward on CPU
|
||||||
|
# - Input to backward on GPU!
|
||||||
|
# - Output from backward on GPU
|
||||||
|
bp_hiddens = self._bp_hiddens
|
||||||
|
|
||||||
|
cdef CBlas cblas = self._cpu_ops.cblas()
|
||||||
|
|
||||||
|
feat_weights = self.get_feat_weights()
|
||||||
|
cdef int[:, ::1] ids = token_ids
|
||||||
|
sum_state_features(cblas, <float*>state_vector.data,
|
||||||
|
feat_weights, &ids[0, 0], token_ids.shape[0],
|
||||||
|
self.nF, self.nO*self.nP)
|
||||||
|
state_vector += self.bias
|
||||||
|
state_vector, bp_nonlinearity = self._nonlinearity(state_vector)
|
||||||
|
|
||||||
|
def backward(d_state_vector_ids):
|
||||||
|
d_state_vector, token_ids = d_state_vector_ids
|
||||||
|
d_state_vector = bp_nonlinearity(d_state_vector)
|
||||||
|
d_tokens = bp_hiddens((d_state_vector, token_ids))
|
||||||
|
return d_tokens
|
||||||
|
return state_vector, backward
|
||||||
|
|
||||||
|
def _nonlinearity(self, state_vector):
|
||||||
|
if self.activation == "maxout":
|
||||||
|
return self._maxout_nonlinearity(state_vector)
|
||||||
|
else:
|
||||||
|
return self._relu_nonlinearity(state_vector)
|
||||||
|
|
||||||
|
def _maxout_nonlinearity(self, state_vector):
|
||||||
|
state_vector, mask = self.numpy_ops.maxout(state_vector)
|
||||||
|
# We're outputting to CPU, but we need this variable on GPU for the
|
||||||
|
# backward pass.
|
||||||
|
mask = self.ops.asarray(mask)
|
||||||
|
|
||||||
|
def backprop_maxout(d_best):
|
||||||
|
return self.ops.backprop_maxout(d_best, mask, self.nP)
|
||||||
|
|
||||||
|
return state_vector, backprop_maxout
|
||||||
|
|
||||||
|
def _relu_nonlinearity(self, state_vector):
|
||||||
|
state_vector = state_vector.reshape((state_vector.shape[0], -1))
|
||||||
|
mask = state_vector >= 0.
|
||||||
|
state_vector *= mask
|
||||||
|
# We're outputting to CPU, but we need this variable on GPU for the
|
||||||
|
# backward pass.
|
||||||
|
mask = self.ops.asarray(mask)
|
||||||
|
|
||||||
|
def backprop_relu(d_best):
|
||||||
|
d_best *= mask
|
||||||
|
return d_best.reshape((d_best.shape + (1,)))
|
||||||
|
|
||||||
|
return state_vector, backprop_relu
|
||||||
|
|
||||||
|
cdef inline int _arg_max(const float* scores, const int n_classes) nogil:
|
||||||
|
if n_classes == 2:
|
||||||
|
return 0 if scores[0] > scores[1] else 1
|
||||||
|
cdef int i
|
||||||
|
cdef int best = 0
|
||||||
|
cdef float mode = scores[0]
|
||||||
|
for i in range(1, n_classes):
|
||||||
|
if scores[i] > mode:
|
||||||
|
mode = scores[i]
|
||||||
|
best = i
|
||||||
|
return best
|
|
@ -9,7 +9,7 @@ from thinc.util import partial
|
||||||
from ..attrs import ORTH
|
from ..attrs import ORTH
|
||||||
from ..errors import Errors, Warnings
|
from ..errors import Errors, Warnings
|
||||||
from ..tokens import Doc
|
from ..tokens import Doc
|
||||||
from ..vectors import Mode
|
from ..vectors import Mode, Vectors
|
||||||
from ..vocab import Vocab
|
from ..vocab import Vocab
|
||||||
|
|
||||||
|
|
||||||
|
@ -48,11 +48,14 @@ def forward(
|
||||||
key_attr: int = getattr(vocab.vectors, "attr", ORTH)
|
key_attr: int = getattr(vocab.vectors, "attr", ORTH)
|
||||||
keys = model.ops.flatten([cast(Ints1d, doc.to_array(key_attr)) for doc in docs])
|
keys = model.ops.flatten([cast(Ints1d, doc.to_array(key_attr)) for doc in docs])
|
||||||
W = cast(Floats2d, model.ops.as_contig(model.get_param("W")))
|
W = cast(Floats2d, model.ops.as_contig(model.get_param("W")))
|
||||||
if vocab.vectors.mode == Mode.default:
|
if isinstance(vocab.vectors, Vectors) and vocab.vectors.mode == Mode.default:
|
||||||
V = model.ops.asarray(vocab.vectors.data)
|
V = model.ops.asarray(vocab.vectors.data)
|
||||||
rows = vocab.vectors.find(keys=keys)
|
rows = vocab.vectors.find(keys=keys)
|
||||||
V = model.ops.as_contig(V[rows])
|
V = model.ops.as_contig(V[rows])
|
||||||
elif vocab.vectors.mode == Mode.floret:
|
elif isinstance(vocab.vectors, Vectors) and vocab.vectors.mode == Mode.floret:
|
||||||
|
V = vocab.vectors.get_batch(keys)
|
||||||
|
V = model.ops.as_contig(V)
|
||||||
|
elif hasattr(vocab.vectors, "get_batch"):
|
||||||
V = vocab.vectors.get_batch(keys)
|
V = vocab.vectors.get_batch(keys)
|
||||||
V = model.ops.as_contig(V)
|
V = model.ops.as_contig(V)
|
||||||
else:
|
else:
|
||||||
|
@ -61,7 +64,7 @@ def forward(
|
||||||
vectors_data = model.ops.gemm(V, W, trans2=True)
|
vectors_data = model.ops.gemm(V, W, trans2=True)
|
||||||
except ValueError:
|
except ValueError:
|
||||||
raise RuntimeError(Errors.E896)
|
raise RuntimeError(Errors.E896)
|
||||||
if vocab.vectors.mode == Mode.default:
|
if isinstance(vocab.vectors, Vectors) and vocab.vectors.mode == Mode.default:
|
||||||
# Convert negative indices to 0-vectors
|
# Convert negative indices to 0-vectors
|
||||||
# TODO: more options for UNK tokens
|
# TODO: more options for UNK tokens
|
||||||
vectors_data[rows < 0] = 0
|
vectors_data[rows < 0] = 0
|
||||||
|
|
|
@ -1,28 +0,0 @@
|
||||||
from libc.stdint cimport int8_t
|
|
||||||
|
|
||||||
|
|
||||||
cdef struct SizesC:
|
|
||||||
int states
|
|
||||||
int classes
|
|
||||||
int hiddens
|
|
||||||
int pieces
|
|
||||||
int feats
|
|
||||||
int embed_width
|
|
||||||
int tokens
|
|
||||||
|
|
||||||
|
|
||||||
cdef struct WeightsC:
|
|
||||||
const float* feat_weights
|
|
||||||
const float* feat_bias
|
|
||||||
const float* hidden_bias
|
|
||||||
const float* hidden_weights
|
|
||||||
const int8_t* seen_mask
|
|
||||||
|
|
||||||
|
|
||||||
cdef struct ActivationsC:
|
|
||||||
int* token_ids
|
|
||||||
float* unmaxed
|
|
||||||
float* hiddens
|
|
||||||
int* is_valid
|
|
||||||
int _curr_size
|
|
||||||
int _max_size
|
|
51
spacy/ml/tb_framework.py
Normal file
51
spacy/ml/tb_framework.py
Normal file
|
@ -0,0 +1,51 @@
|
||||||
|
from thinc.api import Model, noop
|
||||||
|
|
||||||
|
from ..util import registry
|
||||||
|
from .parser_model import ParserStepModel
|
||||||
|
|
||||||
|
|
||||||
|
@registry.layers("spacy.TransitionModel.v1")
|
||||||
|
def TransitionModel(
|
||||||
|
tok2vec, lower, upper, resize_output, dropout=0.2, unseen_classes=set()
|
||||||
|
):
|
||||||
|
"""Set up a stepwise transition-based model"""
|
||||||
|
if upper is None:
|
||||||
|
has_upper = False
|
||||||
|
upper = noop()
|
||||||
|
else:
|
||||||
|
has_upper = True
|
||||||
|
# don't define nO for this object, because we can't dynamically change it
|
||||||
|
return Model(
|
||||||
|
name="parser_model",
|
||||||
|
forward=forward,
|
||||||
|
dims={"nI": tok2vec.maybe_get_dim("nI")},
|
||||||
|
layers=[tok2vec, lower, upper],
|
||||||
|
refs={"tok2vec": tok2vec, "lower": lower, "upper": upper},
|
||||||
|
init=init,
|
||||||
|
attrs={
|
||||||
|
"has_upper": has_upper,
|
||||||
|
"unseen_classes": set(unseen_classes),
|
||||||
|
"resize_output": resize_output,
|
||||||
|
},
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def forward(model, X, is_train):
|
||||||
|
step_model = ParserStepModel(
|
||||||
|
X,
|
||||||
|
model.layers,
|
||||||
|
unseen_classes=model.attrs["unseen_classes"],
|
||||||
|
train=is_train,
|
||||||
|
has_upper=model.attrs["has_upper"],
|
||||||
|
)
|
||||||
|
|
||||||
|
return step_model, step_model.finish_steps
|
||||||
|
|
||||||
|
|
||||||
|
def init(model, X=None, Y=None):
|
||||||
|
model.get_ref("tok2vec").initialize(X=X)
|
||||||
|
lower = model.get_ref("lower")
|
||||||
|
lower.initialize()
|
||||||
|
if model.attrs["has_upper"]:
|
||||||
|
statevecs = model.ops.alloc2f(2, lower.get_dim("nO"))
|
||||||
|
model.get_ref("upper").initialize(X=statevecs)
|
|
@ -1,641 +0,0 @@
|
||||||
# cython: infer_types=True, cdivision=True, boundscheck=False
|
|
||||||
from typing import Any, List, Optional, Tuple, cast
|
|
||||||
|
|
||||||
from libc.stdlib cimport calloc, free, realloc
|
|
||||||
from libc.string cimport memcpy, memset
|
|
||||||
from libcpp.vector cimport vector
|
|
||||||
|
|
||||||
import numpy
|
|
||||||
|
|
||||||
cimport numpy as np
|
|
||||||
|
|
||||||
from thinc.api import (
|
|
||||||
Linear,
|
|
||||||
Model,
|
|
||||||
NumpyOps,
|
|
||||||
chain,
|
|
||||||
glorot_uniform_init,
|
|
||||||
list2array,
|
|
||||||
normal_init,
|
|
||||||
uniform_init,
|
|
||||||
zero_init,
|
|
||||||
)
|
|
||||||
|
|
||||||
from thinc.backends.cblas cimport CBlas, saxpy, sgemm
|
|
||||||
|
|
||||||
from thinc.types import Floats2d, Floats3d, Floats4d, Ints1d, Ints2d
|
|
||||||
|
|
||||||
from ..errors import Errors
|
|
||||||
from ..pipeline._parser_internals import _beam_utils
|
|
||||||
from ..pipeline._parser_internals.batch import GreedyBatch
|
|
||||||
|
|
||||||
from ..pipeline._parser_internals._parser_utils cimport arg_max
|
|
||||||
from ..pipeline._parser_internals.stateclass cimport StateC, StateClass
|
|
||||||
from ..pipeline._parser_internals.transition_system cimport (
|
|
||||||
TransitionSystem,
|
|
||||||
c_apply_actions,
|
|
||||||
c_transition_batch,
|
|
||||||
)
|
|
||||||
|
|
||||||
from ..tokens.doc import Doc
|
|
||||||
from ..util import registry
|
|
||||||
|
|
||||||
State = Any # TODO
|
|
||||||
|
|
||||||
|
|
||||||
@registry.layers("spacy.TransitionModel.v2")
|
|
||||||
def TransitionModel(
|
|
||||||
*,
|
|
||||||
tok2vec: Model[List[Doc], List[Floats2d]],
|
|
||||||
beam_width: int = 1,
|
|
||||||
beam_density: float = 0.0,
|
|
||||||
state_tokens: int,
|
|
||||||
hidden_width: int,
|
|
||||||
maxout_pieces: int,
|
|
||||||
nO: Optional[int] = None,
|
|
||||||
unseen_classes=set(),
|
|
||||||
) -> Model[Tuple[List[Doc], TransitionSystem], List[Tuple[State, List[Floats2d]]]]:
|
|
||||||
"""Set up a transition-based parsing model, using a maxout hidden
|
|
||||||
layer and a linear output layer.
|
|
||||||
"""
|
|
||||||
t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None
|
|
||||||
tok2vec_projected = chain(tok2vec, list2array(), Linear(hidden_width, t2v_width)) # type: ignore
|
|
||||||
tok2vec_projected.set_dim("nO", hidden_width)
|
|
||||||
|
|
||||||
# FIXME: we use `output` as a container for the output layer's
|
|
||||||
# weights and biases. Thinc optimizers cannot handle resizing
|
|
||||||
# of parameters. So, when the parser model is resized, we
|
|
||||||
# construct a new `output` layer, which has a different key in
|
|
||||||
# the optimizer. Once the optimizer supports parameter resizing,
|
|
||||||
# we can replace the `output` layer by `output_W` and `output_b`
|
|
||||||
# parameters in this model.
|
|
||||||
output = Linear(nO=None, nI=hidden_width, init_W=zero_init)
|
|
||||||
|
|
||||||
return Model(
|
|
||||||
name="parser_model",
|
|
||||||
forward=forward,
|
|
||||||
init=init,
|
|
||||||
layers=[tok2vec_projected, output],
|
|
||||||
refs={
|
|
||||||
"tok2vec": tok2vec_projected,
|
|
||||||
"output": output,
|
|
||||||
},
|
|
||||||
params={
|
|
||||||
"hidden_W": None, # Floats2d W for the hidden layer
|
|
||||||
"hidden_b": None, # Floats1d bias for the hidden layer
|
|
||||||
"hidden_pad": None, # Floats1d padding for the hidden layer
|
|
||||||
},
|
|
||||||
dims={
|
|
||||||
"nO": None, # Output size
|
|
||||||
"nP": maxout_pieces,
|
|
||||||
"nH": hidden_width,
|
|
||||||
"nI": tok2vec_projected.maybe_get_dim("nO"),
|
|
||||||
"nF": state_tokens,
|
|
||||||
},
|
|
||||||
attrs={
|
|
||||||
"beam_width": beam_width,
|
|
||||||
"beam_density": beam_density,
|
|
||||||
"unseen_classes": set(unseen_classes),
|
|
||||||
"resize_output": resize_output,
|
|
||||||
},
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
def resize_output(model: Model, new_nO: int) -> Model:
|
|
||||||
old_nO = model.maybe_get_dim("nO")
|
|
||||||
output = model.get_ref("output")
|
|
||||||
if old_nO is None:
|
|
||||||
model.set_dim("nO", new_nO)
|
|
||||||
output.set_dim("nO", new_nO)
|
|
||||||
output.initialize()
|
|
||||||
return model
|
|
||||||
elif new_nO <= old_nO:
|
|
||||||
return model
|
|
||||||
elif output.has_param("W"):
|
|
||||||
nH = model.get_dim("nH")
|
|
||||||
new_output = Linear(nO=new_nO, nI=nH, init_W=zero_init)
|
|
||||||
new_output.initialize()
|
|
||||||
new_W = new_output.get_param("W")
|
|
||||||
new_b = new_output.get_param("b")
|
|
||||||
old_W = output.get_param("W")
|
|
||||||
old_b = output.get_param("b")
|
|
||||||
new_W[:old_nO] = old_W # type: ignore
|
|
||||||
new_b[:old_nO] = old_b # type: ignore
|
|
||||||
for i in range(old_nO, new_nO):
|
|
||||||
model.attrs["unseen_classes"].add(i)
|
|
||||||
model.layers[-1] = new_output
|
|
||||||
model.set_ref("output", new_output)
|
|
||||||
# TODO: Avoid this private intrusion
|
|
||||||
model._dims["nO"] = new_nO
|
|
||||||
return model
|
|
||||||
|
|
||||||
|
|
||||||
def init(
|
|
||||||
model,
|
|
||||||
X: Optional[Tuple[List[Doc], TransitionSystem]] = None,
|
|
||||||
Y: Optional[Tuple[List[State], List[Floats2d]]] = None,
|
|
||||||
):
|
|
||||||
if X is not None:
|
|
||||||
docs, _ = X
|
|
||||||
model.get_ref("tok2vec").initialize(X=docs)
|
|
||||||
else:
|
|
||||||
model.get_ref("tok2vec").initialize()
|
|
||||||
inferred_nO = _infer_nO(Y)
|
|
||||||
if inferred_nO is not None:
|
|
||||||
current_nO = model.maybe_get_dim("nO")
|
|
||||||
if current_nO is None or current_nO != inferred_nO:
|
|
||||||
model.attrs["resize_output"](model, inferred_nO)
|
|
||||||
nP = model.get_dim("nP")
|
|
||||||
nH = model.get_dim("nH")
|
|
||||||
nI = model.get_dim("nI")
|
|
||||||
nF = model.get_dim("nF")
|
|
||||||
ops = model.ops
|
|
||||||
|
|
||||||
Wl = ops.alloc2f(nH * nP, nF * nI)
|
|
||||||
bl = ops.alloc1f(nH * nP)
|
|
||||||
padl = ops.alloc1f(nI)
|
|
||||||
# Wl = zero_init(ops, Wl.shape)
|
|
||||||
Wl = glorot_uniform_init(ops, Wl.shape)
|
|
||||||
padl = uniform_init(ops, padl.shape) # type: ignore
|
|
||||||
# TODO: Experiment with whether better to initialize output_W
|
|
||||||
model.set_param("hidden_W", Wl)
|
|
||||||
model.set_param("hidden_b", bl)
|
|
||||||
model.set_param("hidden_pad", padl)
|
|
||||||
# model = _lsuv_init(model)
|
|
||||||
return model
|
|
||||||
|
|
||||||
|
|
||||||
class TransitionModelInputs:
|
|
||||||
"""
|
|
||||||
Input to transition model.
|
|
||||||
"""
|
|
||||||
|
|
||||||
# dataclass annotation is not yet supported in Cython 0.29.x,
|
|
||||||
# so, we'll do something close to it.
|
|
||||||
|
|
||||||
actions: Optional[List[Ints1d]]
|
|
||||||
docs: List[Doc]
|
|
||||||
max_moves: int
|
|
||||||
moves: TransitionSystem
|
|
||||||
states: Optional[List[State]]
|
|
||||||
|
|
||||||
__slots__ = [
|
|
||||||
"actions",
|
|
||||||
"docs",
|
|
||||||
"max_moves",
|
|
||||||
"moves",
|
|
||||||
"states",
|
|
||||||
]
|
|
||||||
|
|
||||||
def __init__(
|
|
||||||
self,
|
|
||||||
docs: List[Doc],
|
|
||||||
moves: TransitionSystem,
|
|
||||||
actions: Optional[List[Ints1d]] = None,
|
|
||||||
max_moves: int = 0,
|
|
||||||
states: Optional[List[State]] = None,
|
|
||||||
):
|
|
||||||
"""
|
|
||||||
actions (Optional[List[Ints1d]]): actions to apply for each Doc.
|
|
||||||
docs (List[Doc]): Docs to predict transition sequences for.
|
|
||||||
max_moves: (int): the maximum number of moves to apply, values less
|
|
||||||
than 1 will apply moves to states until they are final states.
|
|
||||||
moves (TransitionSystem): the transition system to use when predicting
|
|
||||||
the transition sequences.
|
|
||||||
states (Optional[List[States]]): the initial states to predict the
|
|
||||||
transition sequences for. When absent, the initial states are
|
|
||||||
initialized from the provided Docs.
|
|
||||||
"""
|
|
||||||
self.actions = actions
|
|
||||||
self.docs = docs
|
|
||||||
self.moves = moves
|
|
||||||
self.max_moves = max_moves
|
|
||||||
self.states = states
|
|
||||||
|
|
||||||
|
|
||||||
def forward(model, inputs: TransitionModelInputs, is_train: bool):
|
|
||||||
docs = inputs.docs
|
|
||||||
moves = inputs.moves
|
|
||||||
actions = inputs.actions
|
|
||||||
|
|
||||||
beam_width = model.attrs["beam_width"]
|
|
||||||
hidden_pad = model.get_param("hidden_pad")
|
|
||||||
tok2vec = model.get_ref("tok2vec")
|
|
||||||
|
|
||||||
states = moves.init_batch(docs) if inputs.states is None else inputs.states
|
|
||||||
tokvecs, backprop_tok2vec = tok2vec(docs, is_train)
|
|
||||||
tokvecs = model.ops.xp.vstack((tokvecs, hidden_pad))
|
|
||||||
feats, backprop_feats = _forward_precomputable_affine(model, tokvecs, is_train)
|
|
||||||
seen_mask = _get_seen_mask(model)
|
|
||||||
|
|
||||||
if not is_train and beam_width == 1 and isinstance(model.ops, NumpyOps):
|
|
||||||
# Note: max_moves is only used during training, so we don't need to
|
|
||||||
# pass it to the greedy inference path.
|
|
||||||
return _forward_greedy_cpu(model, moves, states, feats, seen_mask, actions=actions)
|
|
||||||
else:
|
|
||||||
return _forward_fallback(model, moves, states, tokvecs, backprop_tok2vec,
|
|
||||||
feats, backprop_feats, seen_mask, is_train, actions=actions,
|
|
||||||
max_moves=inputs.max_moves)
|
|
||||||
|
|
||||||
|
|
||||||
def _forward_greedy_cpu(model: Model, TransitionSystem moves, states: List[StateClass], np.ndarray feats,
|
|
||||||
np.ndarray[np.npy_bool, ndim = 1] seen_mask, actions: Optional[List[Ints1d]] = None):
|
|
||||||
cdef vector[StateC*] c_states
|
|
||||||
cdef StateClass state
|
|
||||||
for state in states:
|
|
||||||
if not state.is_final():
|
|
||||||
c_states.push_back(state.c)
|
|
||||||
weights = _get_c_weights(model, <float*>feats.data, seen_mask)
|
|
||||||
# Precomputed features have rows for each token, plus one for padding.
|
|
||||||
cdef int n_tokens = feats.shape[0] - 1
|
|
||||||
sizes = _get_c_sizes(model, c_states.size(), n_tokens)
|
|
||||||
cdef CBlas cblas = model.ops.cblas()
|
|
||||||
scores = _parse_batch(cblas, moves, &c_states[0], weights, sizes, actions=actions)
|
|
||||||
|
|
||||||
def backprop(dY):
|
|
||||||
raise ValueError(Errors.E4004)
|
|
||||||
|
|
||||||
return (states, scores), backprop
|
|
||||||
|
|
||||||
|
|
||||||
cdef list _parse_batch(CBlas cblas, TransitionSystem moves, StateC** states,
|
|
||||||
WeightsC weights, SizesC sizes, actions: Optional[List[Ints1d]]=None):
|
|
||||||
cdef int i
|
|
||||||
cdef vector[StateC *] unfinished
|
|
||||||
cdef ActivationsC activations = _alloc_activations(sizes)
|
|
||||||
cdef np.ndarray step_scores
|
|
||||||
cdef np.ndarray step_actions
|
|
||||||
|
|
||||||
scores = []
|
|
||||||
while sizes.states >= 1 and (actions is None or len(actions) > 0):
|
|
||||||
step_scores = numpy.empty((sizes.states, sizes.classes), dtype="f")
|
|
||||||
step_actions = actions[0] if actions is not None else None
|
|
||||||
assert step_actions is None or step_actions.size == sizes.states, \
|
|
||||||
f"number of step actions ({step_actions.size}) must equal number of states ({sizes.states})"
|
|
||||||
with nogil:
|
|
||||||
_predict_states(cblas, &activations, <float*>step_scores.data, states, &weights, sizes)
|
|
||||||
if actions is None:
|
|
||||||
# Validate actions, argmax, take action.
|
|
||||||
c_transition_batch(moves, states, <const float*>step_scores.data, sizes.classes,
|
|
||||||
sizes.states)
|
|
||||||
else:
|
|
||||||
c_apply_actions(moves, states, <const int*>step_actions.data, sizes.states)
|
|
||||||
for i in range(sizes.states):
|
|
||||||
if not states[i].is_final():
|
|
||||||
unfinished.push_back(states[i])
|
|
||||||
for i in range(unfinished.size()):
|
|
||||||
states[i] = unfinished[i]
|
|
||||||
sizes.states = unfinished.size()
|
|
||||||
scores.append(step_scores)
|
|
||||||
unfinished.clear()
|
|
||||||
actions = actions[1:] if actions is not None else None
|
|
||||||
_free_activations(&activations)
|
|
||||||
|
|
||||||
return scores
|
|
||||||
|
|
||||||
|
|
||||||
def _forward_fallback(
|
|
||||||
model: Model,
|
|
||||||
moves: TransitionSystem,
|
|
||||||
states: List[StateClass],
|
|
||||||
tokvecs, backprop_tok2vec,
|
|
||||||
feats,
|
|
||||||
backprop_feats,
|
|
||||||
seen_mask,
|
|
||||||
is_train: bool,
|
|
||||||
actions: Optional[List[Ints1d]] = None,
|
|
||||||
max_moves: int = 0,
|
|
||||||
):
|
|
||||||
nF = model.get_dim("nF")
|
|
||||||
output = model.get_ref("output")
|
|
||||||
hidden_b = model.get_param("hidden_b")
|
|
||||||
nH = model.get_dim("nH")
|
|
||||||
nP = model.get_dim("nP")
|
|
||||||
|
|
||||||
beam_width = model.attrs["beam_width"]
|
|
||||||
beam_density = model.attrs["beam_density"]
|
|
||||||
|
|
||||||
ops = model.ops
|
|
||||||
|
|
||||||
all_ids = []
|
|
||||||
all_which = []
|
|
||||||
all_statevecs = []
|
|
||||||
all_scores = []
|
|
||||||
if beam_width == 1:
|
|
||||||
batch = GreedyBatch(moves, states, None)
|
|
||||||
else:
|
|
||||||
batch = _beam_utils.BeamBatch(
|
|
||||||
moves, states, None, width=beam_width, density=beam_density
|
|
||||||
)
|
|
||||||
arange = ops.xp.arange(nF)
|
|
||||||
n_moves = 0
|
|
||||||
while not batch.is_done:
|
|
||||||
ids = numpy.zeros((len(batch.get_unfinished_states()), nF), dtype="i")
|
|
||||||
for i, state in enumerate(batch.get_unfinished_states()):
|
|
||||||
state.set_context_tokens(ids, i, nF)
|
|
||||||
# Sum the state features, add the bias and apply the activation (maxout)
|
|
||||||
# to create the state vectors.
|
|
||||||
preacts2f = feats[ids, arange].sum(axis=1) # type: ignore
|
|
||||||
preacts2f += hidden_b
|
|
||||||
preacts = ops.reshape3f(preacts2f, preacts2f.shape[0], nH, nP)
|
|
||||||
assert preacts.shape[0] == len(batch.get_unfinished_states()), preacts.shape
|
|
||||||
statevecs, which = ops.maxout(preacts)
|
|
||||||
# We don't use output's backprop, since we want to backprop for
|
|
||||||
# all states at once, rather than a single state.
|
|
||||||
scores = output.predict(statevecs)
|
|
||||||
scores[:, seen_mask] = ops.xp.nanmin(scores)
|
|
||||||
# Transition the states, filtering out any that are finished.
|
|
||||||
cpu_scores = ops.to_numpy(scores)
|
|
||||||
if actions is None:
|
|
||||||
batch.advance(cpu_scores)
|
|
||||||
else:
|
|
||||||
batch.advance_with_actions(actions[0])
|
|
||||||
actions = actions[1:]
|
|
||||||
all_scores.append(scores)
|
|
||||||
if is_train:
|
|
||||||
# Remember intermediate results for the backprop.
|
|
||||||
all_ids.append(ids)
|
|
||||||
all_statevecs.append(statevecs)
|
|
||||||
all_which.append(which)
|
|
||||||
if n_moves >= max_moves >= 1:
|
|
||||||
break
|
|
||||||
n_moves += 1
|
|
||||||
|
|
||||||
def backprop_parser(d_states_d_scores):
|
|
||||||
ids = ops.xp.vstack(all_ids)
|
|
||||||
which = ops.xp.vstack(all_which)
|
|
||||||
statevecs = ops.xp.vstack(all_statevecs)
|
|
||||||
_, d_scores = d_states_d_scores
|
|
||||||
if model.attrs.get("unseen_classes"):
|
|
||||||
# If we have a negative gradient (i.e. the probability should
|
|
||||||
# increase) on any classes we filtered out as unseen, mark
|
|
||||||
# them as seen.
|
|
||||||
for clas in set(model.attrs["unseen_classes"]):
|
|
||||||
if (d_scores[:, clas] < 0).any():
|
|
||||||
model.attrs["unseen_classes"].remove(clas)
|
|
||||||
d_scores *= seen_mask == False # no-cython-lint
|
|
||||||
# Calculate the gradients for the parameters of the output layer.
|
|
||||||
# The weight gemm is (nS, nO) @ (nS, nH).T
|
|
||||||
output.inc_grad("b", d_scores.sum(axis=0))
|
|
||||||
output.inc_grad("W", ops.gemm(d_scores, statevecs, trans1=True))
|
|
||||||
# Now calculate d_statevecs, by backproping through the output linear layer.
|
|
||||||
# This gemm is (nS, nO) @ (nO, nH)
|
|
||||||
output_W = output.get_param("W")
|
|
||||||
d_statevecs = ops.gemm(d_scores, output_W)
|
|
||||||
# Backprop through the maxout activation
|
|
||||||
d_preacts = ops.backprop_maxout(d_statevecs, which, nP)
|
|
||||||
d_preacts2f = ops.reshape2f(d_preacts, d_preacts.shape[0], nH * nP)
|
|
||||||
model.inc_grad("hidden_b", d_preacts2f.sum(axis=0))
|
|
||||||
# We don't need to backprop the summation, because we pass back the IDs instead
|
|
||||||
d_state_features = backprop_feats((d_preacts2f, ids))
|
|
||||||
d_tokvecs = ops.alloc2f(tokvecs.shape[0], tokvecs.shape[1])
|
|
||||||
ops.scatter_add(d_tokvecs, ids, d_state_features)
|
|
||||||
model.inc_grad("hidden_pad", d_tokvecs[-1])
|
|
||||||
return (backprop_tok2vec(d_tokvecs[:-1]), None)
|
|
||||||
|
|
||||||
return (list(batch), all_scores), backprop_parser
|
|
||||||
|
|
||||||
|
|
||||||
def _get_seen_mask(model: Model) -> numpy.array[bool, 1]:
|
|
||||||
mask = model.ops.xp.zeros(model.get_dim("nO"), dtype="bool")
|
|
||||||
for class_ in model.attrs.get("unseen_classes", set()):
|
|
||||||
mask[class_] = True
|
|
||||||
return mask
|
|
||||||
|
|
||||||
|
|
||||||
def _forward_precomputable_affine(model, X: Floats2d, is_train: bool):
|
|
||||||
W: Floats2d = model.get_param("hidden_W")
|
|
||||||
nF = model.get_dim("nF")
|
|
||||||
nH = model.get_dim("nH")
|
|
||||||
nP = model.get_dim("nP")
|
|
||||||
nI = model.get_dim("nI")
|
|
||||||
# The weights start out (nH * nP, nF * nI). Transpose and reshape to (nF * nH *nP, nI)
|
|
||||||
W3f = model.ops.reshape3f(W, nH * nP, nF, nI)
|
|
||||||
W3f = W3f.transpose((1, 0, 2))
|
|
||||||
W2f = model.ops.reshape2f(W3f, nF * nH * nP, nI)
|
|
||||||
assert X.shape == (X.shape[0], nI), X.shape
|
|
||||||
Yf_ = model.ops.gemm(X, W2f, trans2=True)
|
|
||||||
Yf = model.ops.reshape3f(Yf_, Yf_.shape[0], nF, nH * nP)
|
|
||||||
|
|
||||||
def backward(dY_ids: Tuple[Floats3d, Ints2d]):
|
|
||||||
# This backprop is particularly tricky, because we get back a different
|
|
||||||
# thing from what we put out. We put out an array of shape:
|
|
||||||
# (nB, nF, nH, nP), and get back:
|
|
||||||
# (nB, nH, nP) and ids (nB, nF)
|
|
||||||
# The ids tell us the values of nF, so we would have:
|
|
||||||
#
|
|
||||||
# dYf = zeros((nB, nF, nH, nP))
|
|
||||||
# for b in range(nB):
|
|
||||||
# for f in range(nF):
|
|
||||||
# dYf[b, ids[b, f]] += dY[b]
|
|
||||||
#
|
|
||||||
# However, we avoid building that array for efficiency -- and just pass
|
|
||||||
# in the indices.
|
|
||||||
dY, ids = dY_ids
|
|
||||||
dXf = model.ops.gemm(dY, W)
|
|
||||||
Xf = X[ids].reshape((ids.shape[0], -1))
|
|
||||||
dW = model.ops.gemm(dY, Xf, trans1=True)
|
|
||||||
model.inc_grad("hidden_W", dW)
|
|
||||||
return model.ops.reshape3f(dXf, dXf.shape[0], nF, nI)
|
|
||||||
|
|
||||||
return Yf, backward
|
|
||||||
|
|
||||||
|
|
||||||
def _infer_nO(Y: Optional[Tuple[List[State], List[Floats2d]]]) -> Optional[int]:
|
|
||||||
if Y is None:
|
|
||||||
return None
|
|
||||||
_, scores = Y
|
|
||||||
if len(scores) == 0:
|
|
||||||
return None
|
|
||||||
assert scores[0].shape[0] >= 1
|
|
||||||
assert len(scores[0].shape) == 2
|
|
||||||
return scores[0].shape[1]
|
|
||||||
|
|
||||||
|
|
||||||
def _lsuv_init(model: Model):
|
|
||||||
"""This is like the 'layer sequential unit variance', but instead
|
|
||||||
of taking the actual inputs, we randomly generate whitened data.
|
|
||||||
|
|
||||||
Why's this all so complicated? We have a huge number of inputs,
|
|
||||||
and the maxout unit makes guessing the dynamics tricky. Instead
|
|
||||||
we set the maxout weights to values that empirically result in
|
|
||||||
whitened outputs given whitened inputs.
|
|
||||||
"""
|
|
||||||
W = model.maybe_get_param("hidden_W")
|
|
||||||
if W is not None and W.any():
|
|
||||||
return
|
|
||||||
|
|
||||||
nF = model.get_dim("nF")
|
|
||||||
nH = model.get_dim("nH")
|
|
||||||
nP = model.get_dim("nP")
|
|
||||||
nI = model.get_dim("nI")
|
|
||||||
W = model.ops.alloc4f(nF, nH, nP, nI)
|
|
||||||
b = model.ops.alloc2f(nH, nP)
|
|
||||||
pad = model.ops.alloc4f(1, nF, nH, nP)
|
|
||||||
|
|
||||||
ops = model.ops
|
|
||||||
W = normal_init(ops, W.shape, mean=float(ops.xp.sqrt(1.0 / nF * nI)))
|
|
||||||
pad = normal_init(ops, pad.shape, mean=1.0)
|
|
||||||
model.set_param("W", W)
|
|
||||||
model.set_param("b", b)
|
|
||||||
model.set_param("pad", pad)
|
|
||||||
|
|
||||||
ids = ops.alloc_f((5000, nF), dtype="f")
|
|
||||||
ids += ops.xp.random.uniform(0, 1000, ids.shape)
|
|
||||||
ids = ops.asarray(ids, dtype="i")
|
|
||||||
tokvecs = ops.alloc_f((5000, nI), dtype="f")
|
|
||||||
tokvecs += ops.xp.random.normal(loc=0.0, scale=1.0, size=tokvecs.size).reshape(
|
|
||||||
tokvecs.shape
|
|
||||||
)
|
|
||||||
|
|
||||||
def predict(ids, tokvecs):
|
|
||||||
# nS ids. nW tokvecs. Exclude the padding array.
|
|
||||||
hiddens, _ = _forward_precomputable_affine(model, tokvecs[:-1], False)
|
|
||||||
vectors = model.ops.alloc2f(ids.shape[0], nH * nP)
|
|
||||||
# need nS vectors
|
|
||||||
hiddens = hiddens.reshape((hiddens.shape[0] * nF, nH * nP))
|
|
||||||
model.ops.scatter_add(vectors, ids.flatten(), hiddens)
|
|
||||||
vectors3f = model.ops.reshape3f(vectors, vectors.shape[0], nH, nP)
|
|
||||||
vectors3f += b
|
|
||||||
return model.ops.maxout(vectors3f)[0]
|
|
||||||
|
|
||||||
tol_var = 0.01
|
|
||||||
tol_mean = 0.01
|
|
||||||
t_max = 10
|
|
||||||
W = cast(Floats4d, model.get_param("hidden_W").copy())
|
|
||||||
b = cast(Floats2d, model.get_param("hidden_b").copy())
|
|
||||||
for t_i in range(t_max):
|
|
||||||
acts1 = predict(ids, tokvecs)
|
|
||||||
var = model.ops.xp.var(acts1)
|
|
||||||
mean = model.ops.xp.mean(acts1)
|
|
||||||
if abs(var - 1.0) >= tol_var:
|
|
||||||
W /= model.ops.xp.sqrt(var)
|
|
||||||
model.set_param("hidden_W", W)
|
|
||||||
elif abs(mean) >= tol_mean:
|
|
||||||
b -= mean
|
|
||||||
model.set_param("hidden_b", b)
|
|
||||||
else:
|
|
||||||
break
|
|
||||||
return model
|
|
||||||
|
|
||||||
|
|
||||||
cdef WeightsC _get_c_weights(model, const float* feats, np.ndarray[np.npy_bool, ndim=1] seen_mask) except *:
|
|
||||||
output = model.get_ref("output")
|
|
||||||
cdef np.ndarray hidden_b = model.get_param("hidden_b")
|
|
||||||
cdef np.ndarray output_W = output.get_param("W")
|
|
||||||
cdef np.ndarray output_b = output.get_param("b")
|
|
||||||
|
|
||||||
cdef WeightsC weights
|
|
||||||
weights.feat_weights = feats
|
|
||||||
weights.feat_bias = <const float*>hidden_b.data
|
|
||||||
weights.hidden_weights = <const float *> output_W.data
|
|
||||||
weights.hidden_bias = <const float *> output_b.data
|
|
||||||
weights.seen_mask = <const int8_t*> seen_mask.data
|
|
||||||
|
|
||||||
return weights
|
|
||||||
|
|
||||||
|
|
||||||
cdef SizesC _get_c_sizes(model, int batch_size, int tokens) except *:
|
|
||||||
cdef SizesC sizes
|
|
||||||
sizes.states = batch_size
|
|
||||||
sizes.classes = model.get_dim("nO")
|
|
||||||
sizes.hiddens = model.get_dim("nH")
|
|
||||||
sizes.pieces = model.get_dim("nP")
|
|
||||||
sizes.feats = model.get_dim("nF")
|
|
||||||
sizes.embed_width = model.get_dim("nI")
|
|
||||||
sizes.tokens = tokens
|
|
||||||
return sizes
|
|
||||||
|
|
||||||
|
|
||||||
cdef ActivationsC _alloc_activations(SizesC n) nogil:
|
|
||||||
cdef ActivationsC A
|
|
||||||
memset(&A, 0, sizeof(A))
|
|
||||||
_resize_activations(&A, n)
|
|
||||||
return A
|
|
||||||
|
|
||||||
|
|
||||||
cdef void _free_activations(const ActivationsC* A) nogil:
|
|
||||||
free(A.token_ids)
|
|
||||||
free(A.unmaxed)
|
|
||||||
free(A.hiddens)
|
|
||||||
free(A.is_valid)
|
|
||||||
|
|
||||||
|
|
||||||
cdef void _resize_activations(ActivationsC* A, SizesC n) nogil:
|
|
||||||
if n.states <= A._max_size:
|
|
||||||
A._curr_size = n.states
|
|
||||||
return
|
|
||||||
if A._max_size == 0:
|
|
||||||
A.token_ids = <int*>calloc(n.states * n.feats, sizeof(A.token_ids[0]))
|
|
||||||
A.unmaxed = <float*>calloc(n.states * n.hiddens * n.pieces, sizeof(A.unmaxed[0]))
|
|
||||||
A.hiddens = <float*>calloc(n.states * n.hiddens, sizeof(A.hiddens[0]))
|
|
||||||
A.is_valid = <int*>calloc(n.states * n.classes, sizeof(A.is_valid[0]))
|
|
||||||
A._max_size = n.states
|
|
||||||
else:
|
|
||||||
A.token_ids = <int*>realloc(A.token_ids,
|
|
||||||
n.states * n.feats * sizeof(A.token_ids[0]))
|
|
||||||
A.unmaxed = <float*>realloc(A.unmaxed,
|
|
||||||
n.states * n.hiddens * n.pieces * sizeof(A.unmaxed[0]))
|
|
||||||
A.hiddens = <float*>realloc(A.hiddens,
|
|
||||||
n.states * n.hiddens * sizeof(A.hiddens[0]))
|
|
||||||
A.is_valid = <int*>realloc(A.is_valid,
|
|
||||||
n.states * n.classes * sizeof(A.is_valid[0]))
|
|
||||||
A._max_size = n.states
|
|
||||||
A._curr_size = n.states
|
|
||||||
|
|
||||||
|
|
||||||
cdef void _predict_states(CBlas cblas, ActivationsC* A, float* scores, StateC** states, const WeightsC* W, SizesC n) nogil:
|
|
||||||
_resize_activations(A, n)
|
|
||||||
for i in range(n.states):
|
|
||||||
states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats)
|
|
||||||
memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float))
|
|
||||||
_sum_state_features(cblas, A.unmaxed, W.feat_weights, A.token_ids, n)
|
|
||||||
for i in range(n.states):
|
|
||||||
saxpy(cblas)(n.hiddens * n.pieces, 1., W.feat_bias, 1, &A.unmaxed[i*n.hiddens*n.pieces], 1)
|
|
||||||
for j in range(n.hiddens):
|
|
||||||
index = i * n.hiddens * n.pieces + j * n.pieces
|
|
||||||
which = arg_max(&A.unmaxed[index], n.pieces)
|
|
||||||
A.hiddens[i*n.hiddens + j] = A.unmaxed[index + which]
|
|
||||||
if W.hidden_weights == NULL:
|
|
||||||
memcpy(scores, A.hiddens, n.states * n.classes * sizeof(float))
|
|
||||||
else:
|
|
||||||
# Compute hidden-to-output
|
|
||||||
sgemm(cblas)(False, True, n.states, n.classes, n.hiddens,
|
|
||||||
1.0, <const float *>A.hiddens, n.hiddens,
|
|
||||||
<const float *>W.hidden_weights, n.hiddens,
|
|
||||||
0.0, scores, n.classes)
|
|
||||||
# Add bias
|
|
||||||
for i in range(n.states):
|
|
||||||
saxpy(cblas)(n.classes, 1., W.hidden_bias, 1, &scores[i*n.classes], 1)
|
|
||||||
# Set unseen classes to minimum value
|
|
||||||
i = 0
|
|
||||||
min_ = scores[0]
|
|
||||||
for i in range(1, n.states * n.classes):
|
|
||||||
if scores[i] < min_:
|
|
||||||
min_ = scores[i]
|
|
||||||
for i in range(n.states):
|
|
||||||
for j in range(n.classes):
|
|
||||||
if W.seen_mask[j]:
|
|
||||||
scores[i*n.classes+j] = min_
|
|
||||||
|
|
||||||
|
|
||||||
cdef void _sum_state_features(CBlas cblas, float* output, const float* cached,
|
|
||||||
const int* token_ids, SizesC n) nogil:
|
|
||||||
cdef int idx, b, f
|
|
||||||
cdef const float* feature
|
|
||||||
cdef int B = n.states
|
|
||||||
cdef int O = n.hiddens * n.pieces # no-cython-lint
|
|
||||||
cdef int F = n.feats
|
|
||||||
cdef int T = n.tokens
|
|
||||||
padding = cached + (T * F * O)
|
|
||||||
cdef int id_stride = F*O
|
|
||||||
cdef float one = 1.
|
|
||||||
for b in range(B):
|
|
||||||
for f in range(F):
|
|
||||||
if token_ids[f] < 0:
|
|
||||||
feature = &padding[f*O]
|
|
||||||
else:
|
|
||||||
idx = token_ids[f] * id_stride + f*O
|
|
||||||
feature = &cached[idx]
|
|
||||||
saxpy(cblas)(O, one, <const float*>feature, 1, &output[b*O], 1)
|
|
||||||
token_ids += F
|
|
|
@ -1,4 +1,5 @@
|
||||||
# cython: infer_types
|
# cython: infer_types
|
||||||
|
# cython: profile=False
|
||||||
import warnings
|
import warnings
|
||||||
from typing import Dict, List, Optional, Tuple, Union
|
from typing import Dict, List, Optional, Tuple, Union
|
||||||
|
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
|
# cython: profile=False
|
||||||
IDS = {
|
IDS = {
|
||||||
"": NO_TAG,
|
"": NO_TAG,
|
||||||
"ADJ": ADJ,
|
"ADJ": ADJ,
|
||||||
|
|
|
@ -21,6 +21,7 @@ from .trainable_pipe import TrainablePipe
|
||||||
__all__ = [
|
__all__ = [
|
||||||
"AttributeRuler",
|
"AttributeRuler",
|
||||||
"DependencyParser",
|
"DependencyParser",
|
||||||
|
"EditTreeLemmatizer",
|
||||||
"EntityLinker",
|
"EntityLinker",
|
||||||
"EntityRecognizer",
|
"EntityRecognizer",
|
||||||
"Morphologizer",
|
"Morphologizer",
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
# cython: infer_types=True, binding=True
|
# cython: infer_types=True, binding=True
|
||||||
|
# cython: profile=False
|
||||||
from cython.operator cimport dereference as deref
|
from cython.operator cimport dereference as deref
|
||||||
from libc.stdint cimport UINT32_MAX, uint32_t
|
from libc.stdint cimport UINT32_MAX, uint32_t
|
||||||
from libc.string cimport memset
|
from libc.string cimport memset
|
||||||
|
|
|
@ -1,8 +1,12 @@
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
from typing import Any, Dict, List, Union
|
from typing import Any, Dict, List, Union
|
||||||
|
|
||||||
from pydantic import BaseModel, Field, ValidationError
|
try:
|
||||||
from pydantic.types import StrictBool, StrictInt, StrictStr
|
from pydantic.v1 import BaseModel, Field, ValidationError
|
||||||
|
from pydantic.v1.types import StrictBool, StrictInt, StrictStr
|
||||||
|
except ImportError:
|
||||||
|
from pydantic import BaseModel, Field, ValidationError # type: ignore
|
||||||
|
from pydantic.types import StrictBool, StrictInt, StrictStr # type: ignore
|
||||||
|
|
||||||
|
|
||||||
class MatchNodeSchema(BaseModel):
|
class MatchNodeSchema(BaseModel):
|
||||||
|
|
|
@ -1,13 +1,10 @@
|
||||||
# cython: infer_types=True
|
# cython: infer_types=True
|
||||||
# cython: profile=True
|
|
||||||
import numpy
|
import numpy
|
||||||
|
|
||||||
from ...typedefs cimport class_t
|
from ...typedefs cimport class_t
|
||||||
from .transition_system cimport Transition, TransitionSystem
|
from .transition_system cimport Transition, TransitionSystem
|
||||||
|
|
||||||
from ...errors import Errors
|
from ...errors import Errors
|
||||||
|
|
||||||
from .batch cimport Batch
|
|
||||||
from .search cimport Beam, MaxViolation
|
from .search cimport Beam, MaxViolation
|
||||||
|
|
||||||
from .search import MaxViolation
|
from .search import MaxViolation
|
||||||
|
@ -29,7 +26,7 @@ cdef int check_final_state(void* _state, void* extra_args) except -1:
|
||||||
return state.is_final()
|
return state.is_final()
|
||||||
|
|
||||||
|
|
||||||
cdef class BeamBatch(Batch):
|
cdef class BeamBatch(object):
|
||||||
cdef public TransitionSystem moves
|
cdef public TransitionSystem moves
|
||||||
cdef public object states
|
cdef public object states
|
||||||
cdef public object docs
|
cdef public object docs
|
||||||
|
|
|
@ -1,2 +0,0 @@
|
||||||
cdef int arg_max(const float* scores, const int n_classes) nogil
|
|
||||||
cdef int arg_max_if_valid(const float* scores, const int* is_valid, int n) nogil
|
|
|
@ -1,22 +0,0 @@
|
||||||
# cython: infer_types=True
|
|
||||||
|
|
||||||
cdef inline int arg_max(const float* scores, const int n_classes) nogil:
|
|
||||||
if n_classes == 2:
|
|
||||||
return 0 if scores[0] > scores[1] else 1
|
|
||||||
cdef int i
|
|
||||||
cdef int best = 0
|
|
||||||
cdef float mode = scores[0]
|
|
||||||
for i in range(1, n_classes):
|
|
||||||
if scores[i] > mode:
|
|
||||||
mode = scores[i]
|
|
||||||
best = i
|
|
||||||
return best
|
|
||||||
|
|
||||||
|
|
||||||
cdef inline int arg_max_if_valid(const float* scores, const int* is_valid, int n) nogil:
|
|
||||||
cdef int best = -1
|
|
||||||
for i in range(n):
|
|
||||||
if is_valid[i] >= 1:
|
|
||||||
if best == -1 or scores[i] > scores[best]:
|
|
||||||
best = i
|
|
||||||
return best
|
|
|
@ -1,4 +1,5 @@
|
||||||
cimport libcpp
|
cimport libcpp
|
||||||
|
from cpython.exc cimport PyErr_CheckSignals, PyErr_SetFromErrno
|
||||||
from cython.operator cimport dereference as deref
|
from cython.operator cimport dereference as deref
|
||||||
from cython.operator cimport preincrement as incr
|
from cython.operator cimport preincrement as incr
|
||||||
from libc.stdint cimport uint32_t, uint64_t
|
from libc.stdint cimport uint32_t, uint64_t
|
||||||
|
@ -26,7 +27,7 @@ cdef struct ArcC:
|
||||||
|
|
||||||
|
|
||||||
cdef cppclass StateC:
|
cdef cppclass StateC:
|
||||||
vector[int] _heads
|
int* _heads
|
||||||
const TokenC* _sent
|
const TokenC* _sent
|
||||||
vector[int] _stack
|
vector[int] _stack
|
||||||
vector[int] _rebuffer
|
vector[int] _rebuffer
|
||||||
|
@ -34,34 +35,31 @@ cdef cppclass StateC:
|
||||||
unordered_map[int, vector[ArcC]] _left_arcs
|
unordered_map[int, vector[ArcC]] _left_arcs
|
||||||
unordered_map[int, vector[ArcC]] _right_arcs
|
unordered_map[int, vector[ArcC]] _right_arcs
|
||||||
vector[libcpp.bool] _unshiftable
|
vector[libcpp.bool] _unshiftable
|
||||||
vector[int] history
|
|
||||||
set[int] _sent_starts
|
set[int] _sent_starts
|
||||||
TokenC _empty_token
|
TokenC _empty_token
|
||||||
int length
|
int length
|
||||||
int offset
|
int offset
|
||||||
int _b_i
|
int _b_i
|
||||||
|
|
||||||
__init__(const TokenC* sent, int length) nogil except +:
|
__init__(const TokenC* sent, int length) nogil:
|
||||||
this._heads.resize(length, -1)
|
|
||||||
this._unshiftable.resize(length, False)
|
|
||||||
|
|
||||||
# Reserve memory ahead of time to minimize allocations during parsing.
|
|
||||||
# The initial capacity set here ideally reflects the expected average-case/majority usage.
|
|
||||||
cdef int init_capacity = 32
|
|
||||||
this._stack.reserve(init_capacity)
|
|
||||||
this._rebuffer.reserve(init_capacity)
|
|
||||||
this._ents.reserve(init_capacity)
|
|
||||||
this._left_arcs.reserve(init_capacity)
|
|
||||||
this._right_arcs.reserve(init_capacity)
|
|
||||||
this.history.reserve(init_capacity)
|
|
||||||
|
|
||||||
this._sent = sent
|
this._sent = sent
|
||||||
|
this._heads = <int*>calloc(length, sizeof(int))
|
||||||
|
if not (this._sent and this._heads):
|
||||||
|
with gil:
|
||||||
|
PyErr_SetFromErrno(MemoryError)
|
||||||
|
PyErr_CheckSignals()
|
||||||
this.offset = 0
|
this.offset = 0
|
||||||
this.length = length
|
this.length = length
|
||||||
this._b_i = 0
|
this._b_i = 0
|
||||||
|
for i in range(length):
|
||||||
|
this._heads[i] = -1
|
||||||
|
this._unshiftable.push_back(0)
|
||||||
memset(&this._empty_token, 0, sizeof(TokenC))
|
memset(&this._empty_token, 0, sizeof(TokenC))
|
||||||
this._empty_token.lex = &EMPTY_LEXEME
|
this._empty_token.lex = &EMPTY_LEXEME
|
||||||
|
|
||||||
|
__dealloc__():
|
||||||
|
free(this._heads)
|
||||||
|
|
||||||
void set_context_tokens(int* ids, int n) nogil:
|
void set_context_tokens(int* ids, int n) nogil:
|
||||||
cdef int i, j
|
cdef int i, j
|
||||||
if n == 1:
|
if n == 1:
|
||||||
|
@ -134,20 +132,19 @@ cdef cppclass StateC:
|
||||||
ids[i] = -1
|
ids[i] = -1
|
||||||
|
|
||||||
int S(int i) nogil const:
|
int S(int i) nogil const:
|
||||||
cdef int stack_size = this._stack.size()
|
if i >= this._stack.size():
|
||||||
if i >= stack_size or i < 0:
|
|
||||||
return -1
|
return -1
|
||||||
else:
|
elif i < 0:
|
||||||
return this._stack[stack_size - (i+1)]
|
return -1
|
||||||
|
return this._stack.at(this._stack.size() - (i+1))
|
||||||
|
|
||||||
int B(int i) nogil const:
|
int B(int i) nogil const:
|
||||||
cdef int buf_size = this._rebuffer.size()
|
|
||||||
if i < 0:
|
if i < 0:
|
||||||
return -1
|
return -1
|
||||||
elif i < buf_size:
|
elif i < this._rebuffer.size():
|
||||||
return this._rebuffer[buf_size - (i+1)]
|
return this._rebuffer.at(this._rebuffer.size() - (i+1))
|
||||||
else:
|
else:
|
||||||
b_i = this._b_i + (i - buf_size)
|
b_i = this._b_i + (i - this._rebuffer.size())
|
||||||
if b_i >= this.length:
|
if b_i >= this.length:
|
||||||
return -1
|
return -1
|
||||||
else:
|
else:
|
||||||
|
@ -246,7 +243,7 @@ cdef cppclass StateC:
|
||||||
return 0
|
return 0
|
||||||
elif this._sent[word].sent_start == 1:
|
elif this._sent[word].sent_start == 1:
|
||||||
return 1
|
return 1
|
||||||
elif this._sent_starts.const_find(word) != this._sent_starts.const_end():
|
elif this._sent_starts.count(word) >= 1:
|
||||||
return 1
|
return 1
|
||||||
else:
|
else:
|
||||||
return 0
|
return 0
|
||||||
|
@ -330,7 +327,7 @@ cdef cppclass StateC:
|
||||||
if item >= this._unshiftable.size():
|
if item >= this._unshiftable.size():
|
||||||
return 0
|
return 0
|
||||||
else:
|
else:
|
||||||
return this._unshiftable[item]
|
return this._unshiftable.at(item)
|
||||||
|
|
||||||
void set_reshiftable(int item) nogil:
|
void set_reshiftable(int item) nogil:
|
||||||
if item < this._unshiftable.size():
|
if item < this._unshiftable.size():
|
||||||
|
@ -350,9 +347,6 @@ cdef cppclass StateC:
|
||||||
this._heads[child] = head
|
this._heads[child] = head
|
||||||
|
|
||||||
void map_del_arc(unordered_map[int, vector[ArcC]]* heads_arcs, int h_i, int c_i) nogil:
|
void map_del_arc(unordered_map[int, vector[ArcC]]* heads_arcs, int h_i, int c_i) nogil:
|
||||||
cdef vector[ArcC]* arcs
|
|
||||||
cdef ArcC* arc
|
|
||||||
|
|
||||||
arcs_it = heads_arcs.find(h_i)
|
arcs_it = heads_arcs.find(h_i)
|
||||||
if arcs_it == heads_arcs.end():
|
if arcs_it == heads_arcs.end():
|
||||||
return
|
return
|
||||||
|
@ -361,12 +355,12 @@ cdef cppclass StateC:
|
||||||
if arcs.size() == 0:
|
if arcs.size() == 0:
|
||||||
return
|
return
|
||||||
|
|
||||||
arc = &arcs.back()
|
arc = arcs.back()
|
||||||
if arc.head == h_i and arc.child == c_i:
|
if arc.head == h_i and arc.child == c_i:
|
||||||
arcs.pop_back()
|
arcs.pop_back()
|
||||||
else:
|
else:
|
||||||
for i in range(arcs.size()-1):
|
for i in range(arcs.size()-1):
|
||||||
arc = &deref(arcs)[i]
|
arc = arcs.at(i)
|
||||||
if arc.head == h_i and arc.child == c_i:
|
if arc.head == h_i and arc.child == c_i:
|
||||||
arc.head = -1
|
arc.head = -1
|
||||||
arc.child = -1
|
arc.child = -1
|
||||||
|
@ -406,11 +400,10 @@ cdef cppclass StateC:
|
||||||
this._rebuffer = src._rebuffer
|
this._rebuffer = src._rebuffer
|
||||||
this._sent_starts = src._sent_starts
|
this._sent_starts = src._sent_starts
|
||||||
this._unshiftable = src._unshiftable
|
this._unshiftable = src._unshiftable
|
||||||
this._heads = src._heads
|
memcpy(this._heads, src._heads, this.length * sizeof(this._heads[0]))
|
||||||
this._ents = src._ents
|
this._ents = src._ents
|
||||||
this._left_arcs = src._left_arcs
|
this._left_arcs = src._left_arcs
|
||||||
this._right_arcs = src._right_arcs
|
this._right_arcs = src._right_arcs
|
||||||
this._b_i = src._b_i
|
this._b_i = src._b_i
|
||||||
this.offset = src.offset
|
this.offset = src.offset
|
||||||
this._empty_token = src._empty_token
|
this._empty_token = src._empty_token
|
||||||
this.history = src.history
|
|
||||||
|
|
|
@ -0,0 +1 @@
|
||||||
|
# cython: profile=False
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: profile=True, cdivision=True, infer_types=True
|
# cython: cdivision=True, infer_types=True
|
||||||
from cymem.cymem cimport Address, Pool
|
from cymem.cymem cimport Address, Pool
|
||||||
from libc.stdint cimport int32_t
|
from libc.stdint cimport int32_t
|
||||||
from libcpp.vector cimport vector
|
from libcpp.vector cimport vector
|
||||||
|
@ -779,8 +779,6 @@ cdef class ArcEager(TransitionSystem):
|
||||||
return list(arcs)
|
return list(arcs)
|
||||||
|
|
||||||
def has_gold(self, Example eg, start=0, end=None):
|
def has_gold(self, Example eg, start=0, end=None):
|
||||||
if end is not None and end < 0:
|
|
||||||
end = None
|
|
||||||
for word in eg.y[start:end]:
|
for word in eg.y[start:end]:
|
||||||
if word.dep != 0:
|
if word.dep != 0:
|
||||||
return True
|
return True
|
||||||
|
@ -865,7 +863,6 @@ cdef class ArcEager(TransitionSystem):
|
||||||
state.print_state()
|
state.print_state()
|
||||||
)))
|
)))
|
||||||
action.do(state.c, action.label)
|
action.do(state.c, action.label)
|
||||||
state.c.history.push_back(i)
|
|
||||||
break
|
break
|
||||||
else:
|
else:
|
||||||
failed = False
|
failed = False
|
||||||
|
|
|
@ -1,2 +0,0 @@
|
||||||
cdef class Batch:
|
|
||||||
pass
|
|
|
@ -1,52 +0,0 @@
|
||||||
from typing import Any
|
|
||||||
|
|
||||||
TransitionSystem = Any # TODO
|
|
||||||
|
|
||||||
cdef class Batch:
|
|
||||||
def advance(self, scores):
|
|
||||||
raise NotImplementedError
|
|
||||||
|
|
||||||
def get_states(self):
|
|
||||||
raise NotImplementedError
|
|
||||||
|
|
||||||
@property
|
|
||||||
def is_done(self):
|
|
||||||
raise NotImplementedError
|
|
||||||
|
|
||||||
def get_unfinished_states(self):
|
|
||||||
raise NotImplementedError
|
|
||||||
|
|
||||||
def __getitem__(self, i):
|
|
||||||
raise NotImplementedError
|
|
||||||
|
|
||||||
def __len__(self):
|
|
||||||
raise NotImplementedError
|
|
||||||
|
|
||||||
|
|
||||||
class GreedyBatch(Batch):
|
|
||||||
def __init__(self, moves: TransitionSystem, states, golds):
|
|
||||||
self._moves = moves
|
|
||||||
self._states = states
|
|
||||||
self._next_states = [s for s in states if not s.is_final()]
|
|
||||||
|
|
||||||
def advance(self, scores):
|
|
||||||
self._next_states = self._moves.transition_states(self._next_states, scores)
|
|
||||||
|
|
||||||
def advance_with_actions(self, actions):
|
|
||||||
self._next_states = self._moves.apply_actions(self._next_states, actions)
|
|
||||||
|
|
||||||
def get_states(self):
|
|
||||||
return self._states
|
|
||||||
|
|
||||||
@property
|
|
||||||
def is_done(self):
|
|
||||||
return all(s.is_final() for s in self._states)
|
|
||||||
|
|
||||||
def get_unfinished_states(self):
|
|
||||||
return [st for st in self._states if not st.is_final()]
|
|
||||||
|
|
||||||
def __getitem__(self, i):
|
|
||||||
return self._states[i]
|
|
||||||
|
|
||||||
def __len__(self):
|
|
||||||
return len(self._states)
|
|
|
@ -1,3 +1,4 @@
|
||||||
|
# cython: profile=False
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
from libcpp.memory cimport shared_ptr
|
from libcpp.memory cimport shared_ptr
|
||||||
from libcpp.vector cimport vector
|
from libcpp.vector cimport vector
|
||||||
|
@ -306,8 +307,6 @@ cdef class BiluoPushDown(TransitionSystem):
|
||||||
for span in eg.y.spans.get(neg_key, []):
|
for span in eg.y.spans.get(neg_key, []):
|
||||||
if span.start >= start and span.end <= end:
|
if span.start >= start and span.end <= end:
|
||||||
return True
|
return True
|
||||||
if end is not None and end < 0:
|
|
||||||
end = None
|
|
||||||
for word in eg.y[start:end]:
|
for word in eg.y[start:end]:
|
||||||
if word.ent_iob != 0:
|
if word.ent_iob != 0:
|
||||||
return True
|
return True
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: profile=True, infer_types=True
|
# cython: infer_types=True
|
||||||
"""Implements the projectivize/deprojectivize mechanism in Nivre & Nilsson 2005
|
"""Implements the projectivize/deprojectivize mechanism in Nivre & Nilsson 2005
|
||||||
for doing pseudo-projective parsing implementation uses the HEAD decoration
|
for doing pseudo-projective parsing implementation uses the HEAD decoration
|
||||||
scheme.
|
scheme.
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: profile=True, experimental_cpp_class_def=True, cdivision=True, infer_types=True
|
# cython: experimental_cpp_class_def=True, cdivision=True, infer_types=True
|
||||||
cimport cython
|
cimport cython
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
from libc.math cimport exp
|
from libc.math cimport exp
|
||||||
|
|
|
@ -1,4 +1,5 @@
|
||||||
# cython: infer_types=True
|
# cython: infer_types=True
|
||||||
|
# cython: profile=False
|
||||||
from libcpp.vector cimport vector
|
from libcpp.vector cimport vector
|
||||||
|
|
||||||
from ...tokens.doc cimport Doc
|
from ...tokens.doc cimport Doc
|
||||||
|
@ -19,10 +20,6 @@ cdef class StateClass:
|
||||||
if self._borrowed != 1:
|
if self._borrowed != 1:
|
||||||
del self.c
|
del self.c
|
||||||
|
|
||||||
@property
|
|
||||||
def history(self):
|
|
||||||
return list(self.c.history)
|
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def stack(self):
|
def stack(self):
|
||||||
return [self.S(i) for i in range(self.c.stack_depth())]
|
return [self.S(i) for i in range(self.c.stack_depth())]
|
||||||
|
@ -32,7 +29,7 @@ cdef class StateClass:
|
||||||
return [self.B(i) for i in range(self.c.buffer_length())]
|
return [self.B(i) for i in range(self.c.buffer_length())]
|
||||||
|
|
||||||
@property
|
@property
|
||||||
def token_vector_lenth(self):
|
def token_vector_length(self):
|
||||||
return self.doc.tensor.shape[1]
|
return self.doc.tensor.shape[1]
|
||||||
|
|
||||||
@property
|
@property
|
||||||
|
@ -179,6 +176,3 @@ cdef class StateClass:
|
||||||
|
|
||||||
def clone(self, StateClass src):
|
def clone(self, StateClass src):
|
||||||
self.c.clone(src.c)
|
self.c.clone(src.c)
|
||||||
|
|
||||||
def set_context_tokens(self, int[:, :] output, int row, int n_feats):
|
|
||||||
self.c.set_context_tokens(&output[row, 0], n_feats)
|
|
||||||
|
|
|
@ -57,10 +57,3 @@ cdef class TransitionSystem:
|
||||||
|
|
||||||
cdef int set_costs(self, int* is_valid, weight_t* costs,
|
cdef int set_costs(self, int* is_valid, weight_t* costs,
|
||||||
const StateC* state, gold) except -1
|
const StateC* state, gold) except -1
|
||||||
|
|
||||||
|
|
||||||
cdef void c_apply_actions(TransitionSystem moves, StateC** states, const int* actions,
|
|
||||||
int batch_size) nogil
|
|
||||||
|
|
||||||
cdef void c_transition_batch(TransitionSystem moves, StateC** states, const float* scores,
|
|
||||||
int nr_class, int batch_size) nogil
|
|
||||||
|
|
|
@ -1,17 +1,14 @@
|
||||||
# cython: infer_types=True
|
# cython: infer_types=True
|
||||||
|
# cython: profile=False
|
||||||
from __future__ import print_function
|
from __future__ import print_function
|
||||||
|
|
||||||
from cymem.cymem cimport Pool
|
from cymem.cymem cimport Pool
|
||||||
from libc.stdlib cimport calloc, free
|
|
||||||
from libcpp.vector cimport vector
|
|
||||||
|
|
||||||
from collections import Counter
|
from collections import Counter
|
||||||
|
|
||||||
import srsly
|
import srsly
|
||||||
|
|
||||||
from ...structs cimport TokenC
|
from ...structs cimport TokenC
|
||||||
from ...typedefs cimport attr_t, weight_t
|
|
||||||
from ._parser_utils cimport arg_max_if_valid
|
|
||||||
from .stateclass cimport StateClass
|
from .stateclass cimport StateClass
|
||||||
|
|
||||||
from ... import util
|
from ... import util
|
||||||
|
@ -76,18 +73,7 @@ cdef class TransitionSystem:
|
||||||
offset += len(doc)
|
offset += len(doc)
|
||||||
return states
|
return states
|
||||||
|
|
||||||
def follow_history(self, doc, history):
|
|
||||||
cdef int clas
|
|
||||||
cdef StateClass state = StateClass(doc)
|
|
||||||
for clas in history:
|
|
||||||
action = self.c[clas]
|
|
||||||
action.do(state.c, action.label)
|
|
||||||
state.c.history.push_back(clas)
|
|
||||||
return state
|
|
||||||
|
|
||||||
def get_oracle_sequence(self, Example example, _debug=False):
|
def get_oracle_sequence(self, Example example, _debug=False):
|
||||||
if not self.has_gold(example):
|
|
||||||
return []
|
|
||||||
states, golds, _ = self.init_gold_batch([example])
|
states, golds, _ = self.init_gold_batch([example])
|
||||||
if not states:
|
if not states:
|
||||||
return []
|
return []
|
||||||
|
@ -99,8 +85,6 @@ cdef class TransitionSystem:
|
||||||
return self.get_oracle_sequence_from_state(state, gold)
|
return self.get_oracle_sequence_from_state(state, gold)
|
||||||
|
|
||||||
def get_oracle_sequence_from_state(self, StateClass state, gold, _debug=None):
|
def get_oracle_sequence_from_state(self, StateClass state, gold, _debug=None):
|
||||||
if state.is_final():
|
|
||||||
return []
|
|
||||||
cdef Pool mem = Pool()
|
cdef Pool mem = Pool()
|
||||||
# n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc
|
# n_moves should not be zero at this point, but make sure to avoid zero-length mem alloc
|
||||||
assert self.n_moves > 0
|
assert self.n_moves > 0
|
||||||
|
@ -126,7 +110,6 @@ cdef class TransitionSystem:
|
||||||
"S0 head?", str(state.has_head(state.S(0))),
|
"S0 head?", str(state.has_head(state.S(0))),
|
||||||
)))
|
)))
|
||||||
action.do(state.c, action.label)
|
action.do(state.c, action.label)
|
||||||
state.c.history.push_back(i)
|
|
||||||
break
|
break
|
||||||
else:
|
else:
|
||||||
if _debug:
|
if _debug:
|
||||||
|
@ -154,28 +137,6 @@ cdef class TransitionSystem:
|
||||||
raise ValueError(Errors.E170.format(name=name))
|
raise ValueError(Errors.E170.format(name=name))
|
||||||
action = self.lookup_transition(name)
|
action = self.lookup_transition(name)
|
||||||
action.do(state.c, action.label)
|
action.do(state.c, action.label)
|
||||||
state.c.history.push_back(action.clas)
|
|
||||||
|
|
||||||
def apply_actions(self, states, const int[::1] actions):
|
|
||||||
assert len(states) == actions.shape[0]
|
|
||||||
cdef StateClass state
|
|
||||||
cdef vector[StateC*] c_states
|
|
||||||
c_states.resize(len(states))
|
|
||||||
cdef int i
|
|
||||||
for (i, state) in enumerate(states):
|
|
||||||
c_states[i] = state.c
|
|
||||||
c_apply_actions(self, &c_states[0], &actions[0], actions.shape[0])
|
|
||||||
return [state for state in states if not state.c.is_final()]
|
|
||||||
|
|
||||||
def transition_states(self, states, float[:, ::1] scores):
|
|
||||||
assert len(states) == scores.shape[0]
|
|
||||||
cdef StateClass state
|
|
||||||
cdef float* c_scores = &scores[0, 0]
|
|
||||||
cdef vector[StateC*] c_states
|
|
||||||
for state in states:
|
|
||||||
c_states.push_back(state.c)
|
|
||||||
c_transition_batch(self, &c_states[0], c_scores, scores.shape[1], scores.shape[0])
|
|
||||||
return [state for state in states if not state.c.is_final()]
|
|
||||||
|
|
||||||
cdef Transition lookup_transition(self, object name) except *:
|
cdef Transition lookup_transition(self, object name) except *:
|
||||||
raise NotImplementedError
|
raise NotImplementedError
|
||||||
|
@ -288,34 +249,3 @@ cdef class TransitionSystem:
|
||||||
self.cfg.update(msg['cfg'])
|
self.cfg.update(msg['cfg'])
|
||||||
self.initialize_actions(labels)
|
self.initialize_actions(labels)
|
||||||
return self
|
return self
|
||||||
|
|
||||||
|
|
||||||
cdef void c_apply_actions(TransitionSystem moves, StateC** states, const int* actions,
|
|
||||||
int batch_size) nogil:
|
|
||||||
cdef int i
|
|
||||||
cdef Transition action
|
|
||||||
cdef StateC* state
|
|
||||||
for i in range(batch_size):
|
|
||||||
state = states[i]
|
|
||||||
action = moves.c[actions[i]]
|
|
||||||
action.do(state, action.label)
|
|
||||||
state.history.push_back(action.clas)
|
|
||||||
|
|
||||||
|
|
||||||
cdef void c_transition_batch(TransitionSystem moves, StateC** states, const float* scores,
|
|
||||||
int nr_class, int batch_size) nogil:
|
|
||||||
is_valid = <int*>calloc(moves.n_moves, sizeof(int))
|
|
||||||
cdef int i, guess
|
|
||||||
cdef Transition action
|
|
||||||
for i in range(batch_size):
|
|
||||||
moves.set_valid(is_valid, states[i])
|
|
||||||
guess = arg_max_if_valid(&scores[i*nr_class], is_valid, nr_class)
|
|
||||||
if guess == -1:
|
|
||||||
# This shouldn't happen, but it's hard to raise an error here,
|
|
||||||
# and we don't want to infinite loop. So, force to end state.
|
|
||||||
states[i].force_final()
|
|
||||||
else:
|
|
||||||
action = moves.c[guess]
|
|
||||||
action.do(states[i], action.label)
|
|
||||||
states[i].history.push_back(guess)
|
|
||||||
free(is_valid)
|
|
||||||
|
|
|
@ -1,9 +1,14 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, binding=True
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
from typing import Callable, Optional
|
from typing import Callable, Optional
|
||||||
|
|
||||||
from thinc.api import Config, Model
|
from thinc.api import Config, Model
|
||||||
|
|
||||||
|
from ._parser_internals.transition_system import TransitionSystem
|
||||||
|
|
||||||
|
from ._parser_internals.arc_eager cimport ArcEager
|
||||||
|
from .transition_parser cimport Parser
|
||||||
|
|
||||||
from ..language import Language
|
from ..language import Language
|
||||||
from ..scorer import Scorer
|
from ..scorer import Scorer
|
||||||
from ..training import remove_bilu_prefix
|
from ..training import remove_bilu_prefix
|
||||||
|
@ -17,11 +22,12 @@ from .transition_parser import Parser
|
||||||
|
|
||||||
default_model_config = """
|
default_model_config = """
|
||||||
[model]
|
[model]
|
||||||
@architectures = "spacy.TransitionBasedParser.v3"
|
@architectures = "spacy.TransitionBasedParser.v2"
|
||||||
state_type = "parser"
|
state_type = "parser"
|
||||||
extra_state_tokens = false
|
extra_state_tokens = false
|
||||||
hidden_width = 64
|
hidden_width = 64
|
||||||
maxout_pieces = 2
|
maxout_pieces = 2
|
||||||
|
use_upper = true
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.HashEmbedCNN.v2"
|
@architectures = "spacy.HashEmbedCNN.v2"
|
||||||
|
@ -227,7 +233,6 @@ def parser_score(examples, **kwargs):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/dependencyparser#score
|
DOCS: https://spacy.io/api/dependencyparser#score
|
||||||
"""
|
"""
|
||||||
|
|
||||||
def has_sents(doc):
|
def has_sents(doc):
|
||||||
return doc.has_annotation("SENT_START")
|
return doc.has_annotation("SENT_START")
|
||||||
|
|
||||||
|
@ -235,11 +240,8 @@ def parser_score(examples, **kwargs):
|
||||||
dep = getattr(token, attr)
|
dep = getattr(token, attr)
|
||||||
dep = token.vocab.strings.as_string(dep).lower()
|
dep = token.vocab.strings.as_string(dep).lower()
|
||||||
return dep
|
return dep
|
||||||
|
|
||||||
results = {}
|
results = {}
|
||||||
results.update(
|
results.update(Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs))
|
||||||
Scorer.score_spans(examples, "sents", has_annotation=has_sents, **kwargs)
|
|
||||||
)
|
|
||||||
kwargs.setdefault("getter", dep_getter)
|
kwargs.setdefault("getter", dep_getter)
|
||||||
kwargs.setdefault("ignore_labels", ("p", "punct"))
|
kwargs.setdefault("ignore_labels", ("p", "punct"))
|
||||||
results.update(Scorer.score_deps(examples, "dep", **kwargs))
|
results.update(Scorer.score_deps(examples, "dep", **kwargs))
|
||||||
|
@ -252,12 +254,11 @@ def make_parser_scorer():
|
||||||
return parser_score
|
return parser_score
|
||||||
|
|
||||||
|
|
||||||
class DependencyParser(Parser):
|
cdef class DependencyParser(Parser):
|
||||||
"""Pipeline component for dependency parsing.
|
"""Pipeline component for dependency parsing.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/dependencyparser
|
DOCS: https://spacy.io/api/dependencyparser
|
||||||
"""
|
"""
|
||||||
|
|
||||||
TransitionSystem = ArcEager
|
TransitionSystem = ArcEager
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
|
@ -277,7 +278,8 @@ class DependencyParser(Parser):
|
||||||
incorrect_spans_key=None,
|
incorrect_spans_key=None,
|
||||||
scorer=parser_score,
|
scorer=parser_score,
|
||||||
):
|
):
|
||||||
"""Create a DependencyParser."""
|
"""Create a DependencyParser.
|
||||||
|
"""
|
||||||
super().__init__(
|
super().__init__(
|
||||||
vocab,
|
vocab,
|
||||||
model,
|
model,
|
|
@ -5,7 +5,6 @@ from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union,
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import srsly
|
import srsly
|
||||||
from thinc.api import Config, Model, NumpyOps, SequenceCategoricalCrossentropy
|
from thinc.api import Config, Model, NumpyOps, SequenceCategoricalCrossentropy
|
||||||
from thinc.legacy import LegacySequenceCategoricalCrossentropy
|
|
||||||
from thinc.types import ArrayXd, Floats2d, Ints1d
|
from thinc.types import ArrayXd, Floats2d, Ints1d
|
||||||
|
|
||||||
from .. import util
|
from .. import util
|
||||||
|
@ -131,9 +130,7 @@ class EditTreeLemmatizer(TrainablePipe):
|
||||||
self, examples: Iterable[Example], scores: List[Floats2d]
|
self, examples: Iterable[Example], scores: List[Floats2d]
|
||||||
) -> Tuple[float, List[Floats2d]]:
|
) -> Tuple[float, List[Floats2d]]:
|
||||||
validate_examples(examples, "EditTreeLemmatizer.get_loss")
|
validate_examples(examples, "EditTreeLemmatizer.get_loss")
|
||||||
loss_func = LegacySequenceCategoricalCrossentropy(
|
loss_func = SequenceCategoricalCrossentropy(normalize=False, missing_value=-1)
|
||||||
normalize=False, missing_value=-1
|
|
||||||
)
|
|
||||||
|
|
||||||
truths = []
|
truths = []
|
||||||
for eg in examples:
|
for eg in examples:
|
||||||
|
@ -169,7 +166,7 @@ class EditTreeLemmatizer(TrainablePipe):
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/edittreelemmatizer#get_teacher_student_loss
|
DOCS: https://spacy.io/api/edittreelemmatizer#get_teacher_student_loss
|
||||||
"""
|
"""
|
||||||
loss_func = LegacySequenceCategoricalCrossentropy(normalize=False)
|
loss_func = SequenceCategoricalCrossentropy(normalize=False)
|
||||||
d_scores, loss = loss_func(student_scores, teacher_scores)
|
d_scores, loss = loss_func(student_scores, teacher_scores)
|
||||||
if self.model.ops.xp.isnan(loss):
|
if self.model.ops.xp.isnan(loss):
|
||||||
raise ValueError(Errors.E910.format(name=self.name))
|
raise ValueError(Errors.E910.format(name=self.name))
|
||||||
|
|
|
@ -2,6 +2,7 @@ import warnings
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union
|
from typing import Any, Callable, Dict, Iterable, List, Optional, Tuple, Union
|
||||||
|
|
||||||
|
import srsly
|
||||||
from thinc.api import Model
|
from thinc.api import Model
|
||||||
|
|
||||||
from .. import util
|
from .. import util
|
||||||
|
@ -155,8 +156,24 @@ class Lemmatizer(Pipe):
|
||||||
"""
|
"""
|
||||||
required_tables, optional_tables = self.get_lookups_config(self.mode)
|
required_tables, optional_tables = self.get_lookups_config(self.mode)
|
||||||
if lookups is None:
|
if lookups is None:
|
||||||
logger.debug("Lemmatizer: loading tables from spacy-lookups-data")
|
logger.debug(
|
||||||
lookups = load_lookups(lang=self.vocab.lang, tables=required_tables)
|
"Lemmatizer: no lemmatizer lookups tables provided, "
|
||||||
|
"trying to load tables from registered lookups (usually "
|
||||||
|
"spacy-lookups-data)"
|
||||||
|
)
|
||||||
|
lookups = load_lookups(
|
||||||
|
lang=self.vocab.lang, tables=required_tables, strict=False
|
||||||
|
)
|
||||||
|
missing_tables = set(required_tables) - set(lookups.tables)
|
||||||
|
if len(missing_tables) > 0:
|
||||||
|
raise ValueError(
|
||||||
|
Errors.E4010.format(
|
||||||
|
missing_tables=list(missing_tables),
|
||||||
|
pipe_name=self.name,
|
||||||
|
required_tables=srsly.json_dumps(required_tables),
|
||||||
|
tables=srsly.json_dumps(required_tables + optional_tables),
|
||||||
|
)
|
||||||
|
)
|
||||||
optional_lookups = load_lookups(
|
optional_lookups = load_lookups(
|
||||||
lang=self.vocab.lang, tables=optional_tables, strict=False
|
lang=self.vocab.lang, tables=optional_tables, strict=False
|
||||||
)
|
)
|
||||||
|
|
|
@ -1,9 +1,8 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, binding=True
|
||||||
from itertools import islice
|
from itertools import islice
|
||||||
from typing import Callable, Dict, Iterable, Optional, Union
|
from typing import Callable, Dict, Iterable, Optional, Union
|
||||||
|
|
||||||
from thinc.api import Config, Model
|
from thinc.api import Config, Model, SequenceCategoricalCrossentropy
|
||||||
from thinc.legacy import LegacySequenceCategoricalCrossentropy
|
|
||||||
|
|
||||||
from ..morphology cimport Morphology
|
from ..morphology cimport Morphology
|
||||||
from ..tokens.doc cimport Doc
|
from ..tokens.doc cimport Doc
|
||||||
|
@ -296,8 +295,8 @@ class Morphologizer(Tagger):
|
||||||
DOCS: https://spacy.io/api/morphologizer#get_loss
|
DOCS: https://spacy.io/api/morphologizer#get_loss
|
||||||
"""
|
"""
|
||||||
validate_examples(examples, "Morphologizer.get_loss")
|
validate_examples(examples, "Morphologizer.get_loss")
|
||||||
loss_func = LegacySequenceCategoricalCrossentropy(names=self.labels, normalize=False,
|
loss_func = SequenceCategoricalCrossentropy(names=self.labels, normalize=False,
|
||||||
label_smoothing=self.cfg["label_smoothing"])
|
label_smoothing=self.cfg["label_smoothing"])
|
||||||
truths = []
|
truths = []
|
||||||
for eg in examples:
|
for eg in examples:
|
||||||
eg_truths = []
|
eg_truths = []
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, binding=True
|
||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
from typing import Callable, Optional
|
from typing import Callable, Optional
|
||||||
|
|
||||||
|
@ -10,15 +10,23 @@ from ..training import remove_bilu_prefix
|
||||||
from ..util import registry
|
from ..util import registry
|
||||||
from ._parser_internals.ner import BiluoPushDown
|
from ._parser_internals.ner import BiluoPushDown
|
||||||
from ._parser_internals.transition_system import TransitionSystem
|
from ._parser_internals.transition_system import TransitionSystem
|
||||||
from .transition_parser import Parser
|
|
||||||
|
from ._parser_internals.ner cimport BiluoPushDown
|
||||||
|
from .transition_parser cimport Parser
|
||||||
|
|
||||||
|
from ..language import Language
|
||||||
|
from ..scorer import get_ner_prf
|
||||||
|
from ..training import remove_bilu_prefix
|
||||||
|
from ..util import registry
|
||||||
|
|
||||||
default_model_config = """
|
default_model_config = """
|
||||||
[model]
|
[model]
|
||||||
@architectures = "spacy.TransitionBasedParser.v3"
|
@architectures = "spacy.TransitionBasedParser.v2"
|
||||||
state_type = "ner"
|
state_type = "ner"
|
||||||
extra_state_tokens = false
|
extra_state_tokens = false
|
||||||
hidden_width = 64
|
hidden_width = 64
|
||||||
maxout_pieces = 2
|
maxout_pieces = 2
|
||||||
|
use_upper = true
|
||||||
|
|
||||||
[model.tok2vec]
|
[model.tok2vec]
|
||||||
@architectures = "spacy.HashEmbedCNN.v2"
|
@architectures = "spacy.HashEmbedCNN.v2"
|
||||||
|
@ -43,12 +51,8 @@ DEFAULT_NER_MODEL = Config().from_str(default_model_config)["model"]
|
||||||
"incorrect_spans_key": None,
|
"incorrect_spans_key": None,
|
||||||
"scorer": {"@scorers": "spacy.ner_scorer.v1"},
|
"scorer": {"@scorers": "spacy.ner_scorer.v1"},
|
||||||
},
|
},
|
||||||
default_score_weights={
|
default_score_weights={"ents_f": 1.0, "ents_p": 0.0, "ents_r": 0.0, "ents_per_type": None},
|
||||||
"ents_f": 1.0,
|
|
||||||
"ents_p": 0.0,
|
|
||||||
"ents_r": 0.0,
|
|
||||||
"ents_per_type": None,
|
|
||||||
},
|
|
||||||
)
|
)
|
||||||
def make_ner(
|
def make_ner(
|
||||||
nlp: Language,
|
nlp: Language,
|
||||||
|
@ -115,12 +119,7 @@ def make_ner(
|
||||||
"incorrect_spans_key": None,
|
"incorrect_spans_key": None,
|
||||||
"scorer": None,
|
"scorer": None,
|
||||||
},
|
},
|
||||||
default_score_weights={
|
default_score_weights={"ents_f": 1.0, "ents_p": 0.0, "ents_r": 0.0, "ents_per_type": None},
|
||||||
"ents_f": 1.0,
|
|
||||||
"ents_p": 0.0,
|
|
||||||
"ents_r": 0.0,
|
|
||||||
"ents_per_type": None,
|
|
||||||
},
|
|
||||||
)
|
)
|
||||||
def make_beam_ner(
|
def make_beam_ner(
|
||||||
nlp: Language,
|
nlp: Language,
|
||||||
|
@ -194,12 +193,11 @@ def make_ner_scorer():
|
||||||
return ner_score
|
return ner_score
|
||||||
|
|
||||||
|
|
||||||
class EntityRecognizer(Parser):
|
cdef class EntityRecognizer(Parser):
|
||||||
"""Pipeline component for named entity recognition.
|
"""Pipeline component for named entity recognition.
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/entityrecognizer
|
DOCS: https://spacy.io/api/entityrecognizer
|
||||||
"""
|
"""
|
||||||
|
|
||||||
TransitionSystem = BiluoPushDown
|
TransitionSystem = BiluoPushDown
|
||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
|
@ -217,14 +215,15 @@ class EntityRecognizer(Parser):
|
||||||
incorrect_spans_key=None,
|
incorrect_spans_key=None,
|
||||||
scorer=ner_score,
|
scorer=ner_score,
|
||||||
):
|
):
|
||||||
"""Create an EntityRecognizer."""
|
"""Create an EntityRecognizer.
|
||||||
|
"""
|
||||||
super().__init__(
|
super().__init__(
|
||||||
vocab,
|
vocab,
|
||||||
model,
|
model,
|
||||||
name,
|
name,
|
||||||
moves,
|
moves,
|
||||||
update_with_oracle_cut_size=update_with_oracle_cut_size,
|
update_with_oracle_cut_size=update_with_oracle_cut_size,
|
||||||
min_action_freq=1, # not relevant for NER
|
min_action_freq=1, # not relevant for NER
|
||||||
learn_tokens=False, # not relevant for NER
|
learn_tokens=False, # not relevant for NER
|
||||||
beam_width=beam_width,
|
beam_width=beam_width,
|
||||||
beam_density=beam_density,
|
beam_density=beam_density,
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, binding=True
|
||||||
from typing import Callable, Dict, Iterable, Iterator, Tuple, Union
|
from typing import Callable, Dict, Iterable, Iterator, Tuple, Union
|
||||||
|
|
||||||
import srsly
|
import srsly
|
||||||
|
|
|
@ -1,4 +1,4 @@
|
||||||
# cython: infer_types=True, profile=True, binding=True
|
# cython: infer_types=True, binding=True
|
||||||
from typing import Callable, List, Optional
|
from typing import Callable, List, Optional
|
||||||
|
|
||||||
import srsly
|
import srsly
|
||||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user