mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 18:56:36 +03:00
Merge pull request #5933 from svlandeg/feature/more-v3-docs [ci skip]
This commit is contained in:
commit
2285e59765
156
CONTRIBUTING.md
156
CONTRIBUTING.md
|
@ -43,33 +43,33 @@ can also submit a [regression test](#fixing-bugs) straight away. When you're
|
||||||
opening an issue to report the bug, simply refer to your pull request in the
|
opening an issue to report the bug, simply refer to your pull request in the
|
||||||
issue body. A few more tips:
|
issue body. A few more tips:
|
||||||
|
|
||||||
- **Describing your issue:** Try to provide as many details as possible. What
|
- **Describing your issue:** Try to provide as many details as possible. What
|
||||||
exactly goes wrong? _How_ is it failing? Is there an error?
|
exactly goes wrong? _How_ is it failing? Is there an error?
|
||||||
"XY doesn't work" usually isn't that helpful for tracking down problems. Always
|
"XY doesn't work" usually isn't that helpful for tracking down problems. Always
|
||||||
remember to include the code you ran and if possible, extract only the relevant
|
remember to include the code you ran and if possible, extract only the relevant
|
||||||
parts and don't just dump your entire script. This will make it easier for us to
|
parts and don't just dump your entire script. This will make it easier for us to
|
||||||
reproduce the error.
|
reproduce the error.
|
||||||
|
|
||||||
- **Getting info about your spaCy installation and environment:** If you're
|
- **Getting info about your spaCy installation and environment:** If you're
|
||||||
using spaCy v1.7+, you can use the command line interface to print details and
|
using spaCy v1.7+, you can use the command line interface to print details and
|
||||||
even format them as Markdown to copy-paste into GitHub issues:
|
even format them as Markdown to copy-paste into GitHub issues:
|
||||||
`python -m spacy info --markdown`.
|
`python -m spacy info --markdown`.
|
||||||
|
|
||||||
- **Checking the model compatibility:** If you're having problems with a
|
- **Checking the model compatibility:** If you're having problems with a
|
||||||
[statistical model](https://spacy.io/models), it may be because the
|
[statistical model](https://spacy.io/models), it may be because the
|
||||||
model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
|
model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
|
||||||
this on the command line by running `python -m spacy validate`.
|
this on the command line by running `python -m spacy validate`.
|
||||||
|
|
||||||
- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
|
- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
|
||||||
comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
|
comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
|
||||||
you can run from within your script or a Jupyter notebook. For some issues, it's
|
you can run from within your script or a Jupyter notebook. For some issues, it's
|
||||||
helpful to **include a screenshot** of the visualization. You can simply drag and
|
helpful to **include a screenshot** of the visualization. You can simply drag and
|
||||||
drop the image into GitHub's editor and it will be uploaded and included.
|
drop the image into GitHub's editor and it will be uploaded and included.
|
||||||
|
|
||||||
- **Sharing long blocks of code or logs:** If you need to include long code,
|
- **Sharing long blocks of code or logs:** If you need to include long code,
|
||||||
logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
|
logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
|
||||||
[collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
|
[collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
|
||||||
so it only becomes visible on click, making the issue easier to read and follow.
|
so it only becomes visible on click, making the issue easier to read and follow.
|
||||||
|
|
||||||
### Issue labels
|
### Issue labels
|
||||||
|
|
||||||
|
@ -94,39 +94,39 @@ shipped in the core library, and what could be provided in other packages. Our
|
||||||
philosophy is to prefer a smaller core library. We generally ask the following
|
philosophy is to prefer a smaller core library. We generally ask the following
|
||||||
questions:
|
questions:
|
||||||
|
|
||||||
- **What would this feature look like if implemented in a separate package?**
|
- **What would this feature look like if implemented in a separate package?**
|
||||||
Some features would be very difficult to implement externally – for example,
|
Some features would be very difficult to implement externally – for example,
|
||||||
changes to spaCy's built-in methods. In contrast, a library of word
|
changes to spaCy's built-in methods. In contrast, a library of word
|
||||||
alignment functions could easily live as a separate package that depended on
|
alignment functions could easily live as a separate package that depended on
|
||||||
spaCy — there's little difference between writing `import word_aligner` and
|
spaCy — there's little difference between writing `import word_aligner` and
|
||||||
`import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
|
`import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
|
||||||
[custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
|
[custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
|
||||||
and add your own attributes, properties and methods to the `Doc`, `Token` and
|
and add your own attributes, properties and methods to the `Doc`, `Token` and
|
||||||
`Span`. If you're looking to implement a new spaCy feature, starting with a
|
`Span`. If you're looking to implement a new spaCy feature, starting with a
|
||||||
custom component package is usually the best strategy. You won't have to worry
|
custom component package is usually the best strategy. You won't have to worry
|
||||||
about spaCy's internals and you can test your module in an isolated
|
about spaCy's internals and you can test your module in an isolated
|
||||||
environment. And if it works well, we can always integrate it into the core
|
environment. And if it works well, we can always integrate it into the core
|
||||||
library later.
|
library later.
|
||||||
|
|
||||||
- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
|
- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
|
||||||
Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
|
Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
|
||||||
TensorFlow/Keras do lots of useful things — but we don't want to have them as
|
TensorFlow/Keras do lots of useful things — but we don't want to have them as
|
||||||
dependencies. If the feature requires functionality in one of these libraries,
|
dependencies. If the feature requires functionality in one of these libraries,
|
||||||
it's probably better to break it out into a different package.
|
it's probably better to break it out into a different package.
|
||||||
|
|
||||||
- **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
|
- **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
|
||||||
spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
|
spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
|
||||||
As better techniques are developed, we prefer to drop support for "the old way".
|
As better techniques are developed, we prefer to drop support for "the old way".
|
||||||
However, it's rare that one approach _entirely_ dominates another. It's very
|
However, it's rare that one approach _entirely_ dominates another. It's very
|
||||||
common that there's still a use-case for the "obsolete" approach. For instance,
|
common that there's still a use-case for the "obsolete" approach. For instance,
|
||||||
[WordNet](https://wordnet.princeton.edu/) is still very useful — but word
|
[WordNet](https://wordnet.princeton.edu/) is still very useful — but word
|
||||||
vectors are better for most use-cases, and the two approaches to lexical
|
vectors are better for most use-cases, and the two approaches to lexical
|
||||||
semantics do a lot of the same things. spaCy therefore only supports word
|
semantics do a lot of the same things. spaCy therefore only supports word
|
||||||
vectors, and support for WordNet is currently left for other packages.
|
vectors, and support for WordNet is currently left for other packages.
|
||||||
|
|
||||||
- **Do you need the feature to get basic things done?** We do want spaCy to be
|
- **Do you need the feature to get basic things done?** We do want spaCy to be
|
||||||
at least somewhat self-contained. If we keep needing some feature in our
|
at least somewhat self-contained. If we keep needing some feature in our
|
||||||
recipes, that does provide some argument for bringing it "in house".
|
recipes, that does provide some argument for bringing it "in house".
|
||||||
|
|
||||||
### Getting started
|
### Getting started
|
||||||
|
|
||||||
|
@ -203,10 +203,10 @@ your files on save:
|
||||||
|
|
||||||
```json
|
```json
|
||||||
{
|
{
|
||||||
"python.formatting.provider": "black",
|
"python.formatting.provider": "black",
|
||||||
"[python]": {
|
"[python]": {
|
||||||
"editor.formatOnSave": true
|
"editor.formatOnSave": true
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
|
@ -216,7 +216,7 @@ list of available editor integrations.
|
||||||
#### Disabling formatting
|
#### Disabling formatting
|
||||||
|
|
||||||
There are a few cases where auto-formatting doesn't improve readability – for
|
There are a few cases where auto-formatting doesn't improve readability – for
|
||||||
example, in some of the the language data files like the `tag_map.py`, or in
|
example, in some of the language data files like the `tag_map.py`, or in
|
||||||
the tests that construct `Doc` objects from lists of words and other labels.
|
the tests that construct `Doc` objects from lists of words and other labels.
|
||||||
Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
|
Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
|
||||||
for that particular code. Here's an example:
|
for that particular code. Here's an example:
|
||||||
|
@ -397,10 +397,10 @@ Python. If it's not fast enough the first time, just switch to Cython.
|
||||||
|
|
||||||
### Resources to get you started
|
### Resources to get you started
|
||||||
|
|
||||||
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
|
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
|
||||||
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
|
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
|
||||||
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
|
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
|
||||||
- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
|
- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
|
||||||
|
|
||||||
## Adding tests
|
## Adding tests
|
||||||
|
|
||||||
|
@ -440,25 +440,25 @@ simply click on the "Suggest edits" button at the bottom of a page.
|
||||||
We're very excited about all the new possibilities for **community extensions**
|
We're very excited about all the new possibilities for **community extensions**
|
||||||
and plugins in spaCy v2.0, and we can't wait to see what you build with it!
|
and plugins in spaCy v2.0, and we can't wait to see what you build with it!
|
||||||
|
|
||||||
- An extension or plugin should add substantial functionality, be
|
- An extension or plugin should add substantial functionality, be
|
||||||
**well-documented** and **open-source**. It should be available for users to download
|
**well-documented** and **open-source**. It should be available for users to download
|
||||||
and install as a Python package – for example via [PyPi](http://pypi.python.org).
|
and install as a Python package – for example via [PyPi](http://pypi.python.org).
|
||||||
|
|
||||||
- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
|
- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
|
||||||
as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
|
as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
|
||||||
that users can **add to their processing pipeline** using `nlp.add_pipe()`.
|
that users can **add to their processing pipeline** using `nlp.add_pipe()`.
|
||||||
|
|
||||||
- When publishing your extension on GitHub, **tag it** with the topics
|
- When publishing your extension on GitHub, **tag it** with the topics
|
||||||
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
|
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
|
||||||
[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
|
[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
|
||||||
to make it easier to find. Those are also the topics we're linking to from the
|
to make it easier to find. Those are also the topics we're linking to from the
|
||||||
spaCy website. If you're sharing your project on Twitter, feel free to tag
|
spaCy website. If you're sharing your project on Twitter, feel free to tag
|
||||||
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
|
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
|
||||||
|
|
||||||
- Once your extension is published, you can open an issue on the
|
- Once your extension is published, you can open an issue on the
|
||||||
[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
|
[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
|
||||||
[resources directory](https://spacy.io/usage/resources#extensions) on the
|
[resources directory](https://spacy.io/usage/resources#extensions) on the
|
||||||
website.
|
website.
|
||||||
|
|
||||||
📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**
|
📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**
|
||||||
|
|
||||||
|
|
|
@ -235,7 +235,7 @@ def train_while_improving(
|
||||||
with each iteration yielding a tuple `(batch, info, is_best_checkpoint)`,
|
with each iteration yielding a tuple `(batch, info, is_best_checkpoint)`,
|
||||||
where info is a dict, and is_best_checkpoint is in [True, False, None] --
|
where info is a dict, and is_best_checkpoint is in [True, False, None] --
|
||||||
None indicating that the iteration was not evaluated as a checkpoint.
|
None indicating that the iteration was not evaluated as a checkpoint.
|
||||||
The evaluation is conducted by calling the evaluate callback, which should
|
The evaluation is conducted by calling the evaluate callback.
|
||||||
|
|
||||||
Positional arguments:
|
Positional arguments:
|
||||||
nlp: The spaCy pipeline to evaluate.
|
nlp: The spaCy pipeline to evaluate.
|
||||||
|
|
|
@ -545,18 +545,18 @@ network has an internal CNN Tok2Vec layer and uses attention.
|
||||||
|
|
||||||
<!-- TODO: model return type -->
|
<!-- TODO: model return type -->
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
||||||
| `pretrained_vectors` | Whether or not pretrained vectors will be used in addition to the feature vectors. ~~bool~~ |
|
| `pretrained_vectors` | Whether or not pretrained vectors will be used in addition to the feature vectors. ~~bool~~ |
|
||||||
| `width` | Output dimension of the feature encoding step. ~~int~~ |
|
| `width` | Output dimension of the feature encoding step. ~~int~~ |
|
||||||
| `embed_size` | Input dimension of the feature encoding step. ~~int~~ |
|
| `embed_size` | Input dimension of the feature encoding step. ~~int~~ |
|
||||||
| `conv_depth` | Depth of the tok2vec layer. ~~int~~ |
|
| `conv_depth` | Depth of the tok2vec layer. ~~int~~ |
|
||||||
| `window_size` | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. ~~int~~ |
|
| `window_size` | The number of contextual vectors to [concatenate](https://thinc.ai/docs/api-layers#expand_window) from the left and from the right. ~~int~~ |
|
||||||
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ |
|
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ |
|
||||||
| `dropout` | The dropout rate. ~~float~~ |
|
| `dropout` | The dropout rate. ~~float~~ |
|
||||||
| `nO` | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
|
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model~~ |
|
| **CREATES** | The model using the architecture. ~~Model~~ |
|
||||||
|
|
||||||
### spacy.TextCatCNN.v1 {#TextCatCNN}
|
### spacy.TextCatCNN.v1 {#TextCatCNN}
|
||||||
|
|
||||||
|
@ -585,12 +585,12 @@ architecture is usually less accurate than the ensemble, but runs faster.
|
||||||
|
|
||||||
<!-- TODO: model return type -->
|
<!-- TODO: model return type -->
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
||||||
| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ |
|
| `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ |
|
||||||
| `nO` | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
|
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model~~ |
|
| **CREATES** | The model using the architecture. ~~Model~~ |
|
||||||
|
|
||||||
### spacy.TextCatBOW.v1 {#TextCatBOW}
|
### spacy.TextCatBOW.v1 {#TextCatBOW}
|
||||||
|
|
||||||
|
@ -610,13 +610,13 @@ others, but may not be as accurate, especially if texts are short.
|
||||||
|
|
||||||
<!-- TODO: model return type -->
|
<!-- TODO: model return type -->
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
| ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
| `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ |
|
||||||
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ |
|
| `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ |
|
||||||
| `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`. ~~bool~~ |
|
| `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`. ~~bool~~ |
|
||||||
| `nO` | Output dimension, determined by the number of different labels. If not set, the the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
|
| `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ |
|
||||||
| **CREATES** | The model using the architecture. ~~Model~~ |
|
| **CREATES** | The model using the architecture. ~~Model~~ |
|
||||||
|
|
||||||
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
|
## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"}
|
||||||
|
|
||||||
|
|
|
@ -17,7 +17,7 @@ customize the data loading during training, you can register your own
|
||||||
or evaluation data. It takes the same arguments as the `Corpus` class and
|
or evaluation data. It takes the same arguments as the `Corpus` class and
|
||||||
returns a callable that yields [`Example`](/api/example) objects. You can
|
returns a callable that yields [`Example`](/api/example) objects. You can
|
||||||
replace it with your own registered function in the
|
replace it with your own registered function in the
|
||||||
[`@readers` registry](/api/top-level#regsitry) to customize the data loading and
|
[`@readers` registry](/api/top-level#registry) to customize the data loading and
|
||||||
streaming.
|
streaming.
|
||||||
|
|
||||||
> #### Example config
|
> #### Example config
|
||||||
|
|
|
@ -162,7 +162,7 @@ run [`spacy pretrain`](/api/cli#pretrain).
|
||||||
| `dropout` | The dropout rate. Defaults to `0.2`. ~~float~~ |
|
| `dropout` | The dropout rate. Defaults to `0.2`. ~~float~~ |
|
||||||
| `n_save_every` | Saving frequency. Defaults to `null`. ~~Optional[int]~~ |
|
| `n_save_every` | Saving frequency. Defaults to `null`. ~~Optional[int]~~ |
|
||||||
| `batch_size` | The batch size or batch size [schedule](https://thinc.ai/docs/api-schedules). Defaults to `3000`. ~~Union[int, Sequence[int]]~~ |
|
| `batch_size` | The batch size or batch size [schedule](https://thinc.ai/docs/api-schedules). Defaults to `3000`. ~~Union[int, Sequence[int]]~~ |
|
||||||
| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ |
|
| `seed` | The random seed. Defaults to variable `${system:seed}`. ~~int~~ |
|
||||||
| `use_pytorch_for_gpu_memory` | Allocate memory via PyTorch. Defaults to variable `${system:use_pytorch_for_gpu_memory}`. ~~bool~~ |
|
| `use_pytorch_for_gpu_memory` | Allocate memory via PyTorch. Defaults to variable `${system:use_pytorch_for_gpu_memory}`. ~~bool~~ |
|
||||||
| `tok2vec_model` | The model section of the embedding component in the config. Defaults to `"components.tok2vec.model"`. ~~str~~ |
|
| `tok2vec_model` | The model section of the embedding component in the config. Defaults to `"components.tok2vec.model"`. ~~str~~ |
|
||||||
| `objective` | The pretraining objective. Defaults to `{"type": "characters", "n_characters": 4}`. ~~Dict[str, Any]~~ |
|
| `objective` | The pretraining objective. Defaults to `{"type": "characters", "n_characters": 4}`. ~~Dict[str, Any]~~ |
|
||||||
|
|
|
@ -169,7 +169,7 @@ $ python setup.py build_ext --inplace # compile spaCy
|
||||||
|
|
||||||
Compared to regular install via pip, the
|
Compared to regular install via pip, the
|
||||||
[`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt)
|
[`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt)
|
||||||
additionally installs developer dependencies such as Cython. See the the
|
additionally installs developer dependencies such as Cython. See the
|
||||||
[quickstart widget](#quickstart) to get the right commands for your platform and
|
[quickstart widget](#quickstart) to get the right commands for your platform and
|
||||||
Python version.
|
Python version.
|
||||||
|
|
||||||
|
|
|
@ -551,9 +551,9 @@ setup(
|
||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
After installing the package, the the custom colors will be used when
|
After installing the package, the custom colors will be used when visualizing
|
||||||
visualizing text with `displacy`. Whenever the label `SNEK` is assigned, it will
|
text with `displacy`. Whenever the label `SNEK` is assigned, it will be
|
||||||
be displayed in `#3dff74`.
|
displayed in `#3dff74`.
|
||||||
|
|
||||||
import DisplaCyEntSnekHtml from 'images/displacy-ent-snek.html'
|
import DisplaCyEntSnekHtml from 'images/displacy-ent-snek.html'
|
||||||
|
|
||||||
|
|
|
@ -144,7 +144,7 @@ https://github.com/explosion/spaCy/blob/develop/spacy/default_config.cfg
|
||||||
|
|
||||||
Under the hood, the config is parsed into a dictionary. It's divided into
|
Under the hood, the config is parsed into a dictionary. It's divided into
|
||||||
sections and subsections, indicated by the square brackets and dot notation. For
|
sections and subsections, indicated by the square brackets and dot notation. For
|
||||||
example, `[training]` is a section and `[training.batch_size]` a subsections.
|
example, `[training]` is a section and `[training.batch_size]` a subsection.
|
||||||
Subsections can define values, just like a dictionary, or use the `@` syntax to
|
Subsections can define values, just like a dictionary, or use the `@` syntax to
|
||||||
refer to [registered functions](#config-functions). This allows the config to
|
refer to [registered functions](#config-functions). This allows the config to
|
||||||
not just define static settings, but also construct objects like architectures,
|
not just define static settings, but also construct objects like architectures,
|
||||||
|
@ -156,7 +156,7 @@ sections of a config file are:
|
||||||
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. |
|
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/usage/processing-pipelines) component names. |
|
||||||
| `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. |
|
| `components` | Definitions of the [pipeline components](/usage/processing-pipelines) and their models. |
|
||||||
| `paths` | Paths to data and other assets. Re-used across the config as variables, e.g. `${paths:train}`, and can be [overwritten](#config-overrides) on the CLI. |
|
| `paths` | Paths to data and other assets. Re-used across the config as variables, e.g. `${paths:train}`, and can be [overwritten](#config-overrides) on the CLI. |
|
||||||
| `system` | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system.seed}`, and can be [overwritten](#config-overrides) on the CLI. |
|
| `system` | Settings related to system and hardware. Re-used across the config as variables, e.g. `${system:seed}`, and can be [overwritten](#config-overrides) on the CLI. |
|
||||||
| `training` | Settings and controls for the training and evaluation process. |
|
| `training` | Settings and controls for the training and evaluation process. |
|
||||||
| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
|
| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
|
||||||
|
|
||||||
|
@ -514,11 +514,11 @@ language class and `nlp` object at different points of the lifecycle:
|
||||||
| `after_creation` | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer. |
|
| `after_creation` | Called right after the `nlp` object is created, but before the pipeline components are added to the pipeline and receives the `nlp` object. Useful for modifying the tokenizer. |
|
||||||
| `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components. |
|
| `after_pipeline_creation` | Called right after the pipeline components are created and added and receives the `nlp` object. Useful for modifying pipeline components. |
|
||||||
|
|
||||||
The `@spacy.registry.callbacks` decorator lets you register that function in the
|
The `@spacy.registry.callbacks` decorator lets you register your custom function
|
||||||
`callbacks` [registry](/api/top-level#registry) under a given name. You can then
|
in the `callbacks` [registry](/api/top-level#registry) under a given name. You
|
||||||
reference the function in a config block using the `@callbacks` key. If a block
|
can then reference the function in a config block using the `@callbacks` key. If
|
||||||
contains a key starting with an `@`, it's interpreted as a reference to a
|
a block contains a key starting with an `@`, it's interpreted as a reference to
|
||||||
function. Because you've registered the function, spaCy knows how to create it
|
a function. Because you've registered the function, spaCy knows how to create it
|
||||||
when you reference `"customize_language_data"` in your config. Here's an example
|
when you reference `"customize_language_data"` in your config. Here's an example
|
||||||
of a callback that runs before the `nlp` object is created and adds a few custom
|
of a callback that runs before the `nlp` object is created and adds a few custom
|
||||||
tokenization rules to the defaults:
|
tokenization rules to the defaults:
|
||||||
|
@ -593,9 +593,9 @@ spaCy's configs are powered by our machine learning library Thinc's
|
||||||
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
|
using [`pydantic`](https://github.com/samuelcolvin/pydantic). If your registered
|
||||||
function provides type hints, the values that are passed in will be checked
|
function provides type hints, the values that are passed in will be checked
|
||||||
against the expected types. For example, `debug: bool` in the example above will
|
against the expected types. For example, `debug: bool` in the example above will
|
||||||
ensure that the value received as the argument `debug` is an boolean. If the
|
ensure that the value received as the argument `debug` is a boolean. If the
|
||||||
value can't be coerced into a boolean, spaCy will raise an error.
|
value can't be coerced into a boolean, spaCy will raise an error.
|
||||||
`start: pydantic.StrictBool` will force the value to be an boolean and raise an
|
`debug: pydantic.StrictBool` will force the value to be a boolean and raise an
|
||||||
error if it's not – for instance, if your config defines `1` instead of `true`.
|
error if it's not – for instance, if your config defines `1` instead of `true`.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
@ -642,7 +642,9 @@ settings in the block will be passed to the function as keyword arguments. Keep
|
||||||
in mind that the config shouldn't have any hidden defaults and all arguments on
|
in mind that the config shouldn't have any hidden defaults and all arguments on
|
||||||
the functions need to be represented in the config. If your function defines
|
the functions need to be represented in the config. If your function defines
|
||||||
**default argument values**, spaCy is able to auto-fill your config when you run
|
**default argument values**, spaCy is able to auto-fill your config when you run
|
||||||
[`init fill-config`](/api/cli#init-fill-config).
|
[`init fill-config`](/api/cli#init-fill-config). If you want to make sure that a
|
||||||
|
given parameter is always explicitely set in the config, avoid setting a default
|
||||||
|
value for it.
|
||||||
|
|
||||||
```ini
|
```ini
|
||||||
### config.cfg (excerpt)
|
### config.cfg (excerpt)
|
||||||
|
@ -654,7 +656,68 @@ factor = 1.005
|
||||||
|
|
||||||
#### Example: Custom data reading and batching {#custom-code-readers-batchers}
|
#### Example: Custom data reading and batching {#custom-code-readers-batchers}
|
||||||
|
|
||||||
<!-- TODO: -->
|
Some use-cases require streaming in data or manipulating datasets on the fly,
|
||||||
|
rather than generating all data beforehand and storing it to file. Instead of
|
||||||
|
using the built-in reader `"spacy.Corpus.v1"`, which uses static file paths, you
|
||||||
|
can create and register a custom function that generates
|
||||||
|
[`Example`](/api/example) objects. The resulting generator can be infinite. When
|
||||||
|
using this dataset for training, stopping criteria such as maximum number of
|
||||||
|
steps, or stopping when the loss does not decrease further, can be used.
|
||||||
|
|
||||||
|
In this example we assume a custom function `read_custom_data()` which loads or
|
||||||
|
generates texts with relevant textcat annotations. Then, small lexical
|
||||||
|
variations of the input text are created before generating the final `Example`
|
||||||
|
objects.
|
||||||
|
|
||||||
|
We can also customize the batching strategy by registering a new "batcher" which
|
||||||
|
turns a stream of items into a stream of batches. spaCy has several useful
|
||||||
|
built-in batching strategies with customizable sizes<!-- TODO: link -->, but
|
||||||
|
it's also easy to implement your own. For instance, the following function takes
|
||||||
|
the stream of generated `Example` objects, and removes those which have the
|
||||||
|
exact same underlying raw text, to avoid duplicates within each batch. Note that
|
||||||
|
in a more realistic implementation, you'd also want to check whether the
|
||||||
|
annotations are exactly the same.
|
||||||
|
|
||||||
|
> ```ini
|
||||||
|
> [training.train_corpus]
|
||||||
|
> @readers = "corpus_variants.v1"
|
||||||
|
>
|
||||||
|
> [training.batcher]
|
||||||
|
> @batchers = "filtering_batch.v1"
|
||||||
|
> size = 150
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```python
|
||||||
|
### functions.py
|
||||||
|
from typing import Callable, Iterable, List
|
||||||
|
import spacy
|
||||||
|
from spacy.gold import Example
|
||||||
|
import random
|
||||||
|
|
||||||
|
@spacy.registry.readers("corpus_variants.v1")
|
||||||
|
def stream_data() -> Callable[["Language"], Iterable[Example]]:
|
||||||
|
def generate_stream(nlp):
|
||||||
|
for text, cats in read_custom_data():
|
||||||
|
random_index = random.randint(0, len(text) - 1)
|
||||||
|
variant = text[:random_index] + text[random_index].upper() + text[random_index + 1:]
|
||||||
|
doc = nlp.make_doc(variant)
|
||||||
|
example = Example.from_dict(doc, {"cats": cats})
|
||||||
|
yield example
|
||||||
|
return generate_stream
|
||||||
|
|
||||||
|
|
||||||
|
@spacy.registry.batchers("filtering_batch.v1")
|
||||||
|
def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterable[List[Example]]]:
|
||||||
|
def create_filtered_batches(examples: Iterable[Example]) -> Iterable[List[Example]]:
|
||||||
|
batch = []
|
||||||
|
for eg in examples:
|
||||||
|
if eg.text not in [x.text for x in batch]:
|
||||||
|
batch.append(eg)
|
||||||
|
if len(batch) == size:
|
||||||
|
yield batch
|
||||||
|
batch = []
|
||||||
|
return create_filtered_batches
|
||||||
|
```
|
||||||
|
|
||||||
### Wrapping PyTorch and TensorFlow {#custom-frameworks}
|
### Wrapping PyTorch and TensorFlow {#custom-frameworks}
|
||||||
|
|
||||||
|
|
|
@ -60,7 +60,7 @@
|
||||||
"clear": "rm -rf .cache",
|
"clear": "rm -rf .cache",
|
||||||
"test": "echo \"Write tests! -> https://gatsby.app/unit-testing\"",
|
"test": "echo \"Write tests! -> https://gatsby.app/unit-testing\"",
|
||||||
"python:install": "pip install setup/requirements.txt",
|
"python:install": "pip install setup/requirements.txt",
|
||||||
"python:setup": "cd setup && ./setup.sh"
|
"python:setup": "cd setup && sh setup.sh"
|
||||||
},
|
},
|
||||||
"devDependencies": {
|
"devDependencies": {
|
||||||
"@sindresorhus/slugify": "^0.8.0",
|
"@sindresorhus/slugify": "^0.8.0",
|
||||||
|
|
|
@ -2,7 +2,7 @@
|
||||||
# With additional functionality: in/not in, replace, pprint, round, + for lists,
|
# With additional functionality: in/not in, replace, pprint, round, + for lists,
|
||||||
# rendering empty dicts
|
# rendering empty dicts
|
||||||
# This script is mostly used to generate the JavaScript function for the
|
# This script is mostly used to generate the JavaScript function for the
|
||||||
# training quicktart widget.
|
# training quickstart widget.
|
||||||
import contextlib
|
import contextlib
|
||||||
import json
|
import json
|
||||||
import re
|
import re
|
||||||
|
|
Loading…
Reference in New Issue
Block a user