mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 17:24:41 +03:00
Merge remote-tracking branch 'upstream/develop' into feature/update-docs
# Conflicts: # website/docs/usage/training.md
This commit is contained in:
commit
169b5bcda0
|
@ -509,8 +509,6 @@ page should be safe to use and we'll try to ensure backwards compatibility.
|
|||
However, we recommend having additional tests in place if your application
|
||||
depends on any of spaCy's utilities.
|
||||
|
||||
<!-- TODO: document new config-related util functions? -->
|
||||
|
||||
### util.get_lang_class {#util.get_lang_class tag="function"}
|
||||
|
||||
Import and load a `Language` class. Allows lazy-loading
|
||||
|
|
|
@ -623,7 +623,7 @@ added to the pipeline:
|
|||
>
|
||||
> @Language.factory("my_component")
|
||||
> def my_component(nlp, name):
|
||||
> return MyComponent()
|
||||
> return MyComponent()
|
||||
> ```
|
||||
|
||||
| Argument | Description |
|
||||
|
@ -636,8 +636,6 @@ All other settings can be passed in by the user via the `config` argument on
|
|||
[`@Language.factory`](/api/language#factory) decorator also lets you define a
|
||||
`default_config` that's used as a fallback.
|
||||
|
||||
<!-- TODO: add example of passing in a custom Python object via the config based on a registered function -->
|
||||
|
||||
```python
|
||||
### With config {highlight="4,9"}
|
||||
import spacy
|
||||
|
@ -688,7 +686,7 @@ make your factory a separate function. That's also how spaCy does it internally.
|
|||
|
||||
</Accordion>
|
||||
|
||||
### Example: Stateful component with settings
|
||||
### Example: Stateful component with settings {#example-stateful-components}
|
||||
|
||||
This example shows a **stateful** pipeline component for handling acronyms:
|
||||
based on a dictionary, it will detect acronyms and their expanded forms in both
|
||||
|
@ -757,6 +755,85 @@ doc = nlp("LOL, be right back")
|
|||
print(doc._.acronyms)
|
||||
```
|
||||
|
||||
Many stateful components depend on **data resources** like dictionaries and
|
||||
lookup tables that should ideally be **configurable**. For example, it makes
|
||||
sense to make the `DICTIONARY` and argument of the registered function, so the
|
||||
`AcronymComponent` can be re-used with different data. One logical solution
|
||||
would be to make it an argument of the component factory, and allow it to be
|
||||
initialized with different dictionaries.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> Making the data an argument of the registered function would result in output
|
||||
> like this in your `config.cfg`, which is typically not what you want (and only
|
||||
> works for JSON-serializable data).
|
||||
>
|
||||
> ```ini
|
||||
> [components.acronyms.dictionary]
|
||||
> lol = "laugh out loud"
|
||||
> brb = "be right back"
|
||||
> ```
|
||||
|
||||
However, passing in the dictionary directly is problematic, because it means
|
||||
that if a component saves out its config and settings, the
|
||||
[`config.cfg`](/usage/training#config) will include a dump of the entire data,
|
||||
since that's the config the component was created with.
|
||||
|
||||
```diff
|
||||
DICTIONARY = {"lol": "laughing out loud", "brb": "be right back"}
|
||||
- default_config = {"dictionary:" DICTIONARY}
|
||||
```
|
||||
|
||||
If what you're passing in isn't JSON-serializable – e.g. a custom object like a
|
||||
[model](#trainable-components) – saving out the component config becomes
|
||||
impossible because there's no way for spaCy to know _how_ that object was
|
||||
created, and what to do to create it again. This makes it much harder to save,
|
||||
load and train custom models with custom components. A simple solution is to
|
||||
**register a function** that returns your resources. The
|
||||
[registry](/api/top-level#registry) lets you **map string names to functions**
|
||||
that create objects, so given a name and optional arguments, spaCy will know how
|
||||
to recreate the object. To register a function that returns a custom asset, you
|
||||
can use the `@spacy.registry.assets` decorator with a single argument, the name:
|
||||
|
||||
```python
|
||||
### Registered function for assets {highlight="1"}
|
||||
@spacy.registry.assets("acronyms.slang_dict.v1")
|
||||
def create_acronyms_slang_dict():
|
||||
dictionary = {"lol": "laughing out loud", "brb": "be right back"}
|
||||
dictionary.update({value: key for key, value in dictionary.items()})
|
||||
return dictionary
|
||||
```
|
||||
|
||||
In your `default_config` (and later in your
|
||||
[training config](/usage/training#config)), you can now refer to the function
|
||||
registered under the name `"acronyms.slang_dict.v1"` using the `@assets` key.
|
||||
This tells spaCy how to create the value, and when your component is created,
|
||||
the result of the registered function is passed in as the key `"dictionary"`.
|
||||
|
||||
> #### config.cfg
|
||||
>
|
||||
> ```ini
|
||||
> [components.acronyms]
|
||||
> factory = "acronyms"
|
||||
>
|
||||
> [components.acronyms.dictionary]
|
||||
> @assets = "acronyms.slang_dict.v1"
|
||||
> ```
|
||||
|
||||
```diff
|
||||
- default_config = {"dictionary:" DICTIONARY}
|
||||
+ default_config = {"dictionary": {"@assets": "acronyms.slang_dict.v1"}}
|
||||
```
|
||||
|
||||
Using a registered function also means that you can easily include your custom
|
||||
components in models that you [train](/usage/training). To make sure spaCy knows
|
||||
where to find your custom `@assets` function, you can pass in a Python file via
|
||||
the argument `--code`. If someone else is using your component, all they have to
|
||||
do to customize the data is to register their own function and swap out the
|
||||
name. Registered functions can also take **arguments** by the way that can be
|
||||
defined in the config as well – you can read more about this in the docs on
|
||||
[training with custom code](/usage/training#custom-code).
|
||||
|
||||
### Python type hints and pydantic validation {#type-hints new="3"}
|
||||
|
||||
spaCy's configs are powered by our machine learning library Thinc's
|
||||
|
@ -994,7 +1071,7 @@ loss is calculated and to add evaluation scores to the training output.
|
|||
| [`get_loss`](/api/pipe#get_loss) | Return a tuple of the loss and the gradient for a batch of [`Example`](/api/example) objects. |
|
||||
| [`score`](/api/pipe#score) | Score a batch of [`Example`](/api/example) objects and return a dictionary of scores. The [`@Language.factory`](/api/language#factory) decorator can define the `default_socre_weights` of the component to decide which keys of the scores to display during training and how they count towards the final score. |
|
||||
|
||||
<!-- TODO: add more details, examples and maybe an example project -->
|
||||
<!-- TODO: link to (not yet created) page for defining models for trainable components -->
|
||||
|
||||
## Extension attributes {#custom-components-attributes new="2"}
|
||||
|
||||
|
|
|
@ -97,7 +97,7 @@ to download and where to put them. The
|
|||
[`spacy project assets`](/api/cli#project-assets) will fetch the project assets
|
||||
for you:
|
||||
|
||||
```
|
||||
```cli
|
||||
$ cd some_example_project
|
||||
$ python -m spacy project assets
|
||||
```
|
||||
|
|
|
@ -414,7 +414,7 @@ recipe once the dish has already been prepared. You have to make a new one.
|
|||
spaCy includes a variety of built-in [architectures](/api/architectures) for
|
||||
different tasks. For example:
|
||||
|
||||
<!-- TODO: -->
|
||||
<!-- TODO: select example architectures to showcase -->
|
||||
|
||||
| Architecture | Description |
|
||||
| ----------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
|
@ -776,12 +776,11 @@ mattis pretium.
|
|||
|
||||
### Defining custom architectures {#custom-architectures}
|
||||
|
||||
<!-- TODO: this could maybe be a more general example of using Thinc to compose some layers? We don't want to go too deep here and probably want to focus on a simple architecture example to show how it works -->
|
||||
<!-- TODO: Wrapping PyTorch and TensorFlow -->
|
||||
<!-- TODO: this should probably move to new section on models -->
|
||||
|
||||
## Transfer learning {#transfer-learning}
|
||||
|
||||
<!-- TODO: link to embeddings and transformers page -->
|
||||
<!-- TODO: write something, link to embeddings and transformers page – should probably wait until transformers/embeddings/transfer learning docs are done -->
|
||||
|
||||
### Using transformer models like BERT {#transformers}
|
||||
|
||||
|
@ -811,7 +810,7 @@ config and customize the implementations, see the usage guide on
|
|||
|
||||
### Pretraining with spaCy {#pretraining}
|
||||
|
||||
<!-- TODO: document spacy pretrain, objectives etc. -->
|
||||
<!-- TODO: document spacy pretrain, objectives etc. – should probably wait until transformers/embeddings/transfer learning docs are done -->
|
||||
|
||||
## Parallel Training with Ray {#parallel-training}
|
||||
|
||||
|
@ -836,9 +835,8 @@ spaCy gives you full control over the training loop. However, for most use
|
|||
cases, it's recommended to train your models via the
|
||||
[`spacy train`](/api/cli#train) command with a [`config.cfg`](#config) to keep
|
||||
track of your settings and hyperparameters, instead of writing your own training
|
||||
scripts from scratch.
|
||||
[Custom registered functions](/usage/training/#custom-code) should typically
|
||||
give you everything you need to train fully custom models with
|
||||
scripts from scratch. [Custom registered functions](#custom-code) should
|
||||
typically give you everything you need to train fully custom models with
|
||||
[`spacy train`](/api/cli#train).
|
||||
|
||||
</Infobox>
|
||||
|
|
Loading…
Reference in New Issue
Block a user