mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-26 18:06:29 +03:00
Update docs [ci skip]
This commit is contained in:
parent
3eaeb73342
commit
c044460823
|
@ -3,7 +3,7 @@ title: Model Architectures
|
|||
teaser: Pre-defined model architectures included with the core library
|
||||
source: spacy/ml/models
|
||||
menu:
|
||||
- ['Tok2Vec', 'tok2vec']
|
||||
- ['Tok2Vec', 'tok2vec-arch']
|
||||
- ['Transformers', 'transformers']
|
||||
- ['Parser & NER', 'parser']
|
||||
- ['Tagging', 'tagger']
|
||||
|
@ -236,7 +236,7 @@ and residual connections.
|
|||
> depth = 4
|
||||
> ```
|
||||
|
||||
Encode context using bidirectonal LSTM layers. Requires
|
||||
Encode context using bidirectional LSTM layers. Requires
|
||||
[PyTorch](https://pytorch.org).
|
||||
|
||||
| Name | Type | Description |
|
||||
|
@ -278,8 +278,6 @@ architectures into your training config.
|
|||
|
||||
### spacy-transformers.Tok2VecListener.v1 {#Tok2VecListener}
|
||||
|
||||
<!-- TODO: description -->
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
|
@ -291,10 +289,41 @@ architectures into your training config.
|
|||
> @layers = "reduce_mean.v1"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | ------------------------- | ---------------------------------------------------------------------------------------------- |
|
||||
| `grad_factor` | float | Factor for weighting the gradient if multiple components listen to the same transformer model. |
|
||||
| `pooling` | `Model[Ragged, Floats2d]` | Pooling layer to determine how the vector for each spaCy token will be computed. |
|
||||
Create a `TransformerListener` layer, which will connect to a
|
||||
[`Transformer`](/api/transformer) component earlier in the pipeline. The layer
|
||||
takes a list of [`Doc`](/api/doc) objects as input, and produces a list of
|
||||
2-dimensional arrays as output, with each array having one row per token. Most
|
||||
spaCy models expect a sublayer with this signature, making it easy to connect
|
||||
them to a transformer model via this sublayer. Transformer models usually
|
||||
operate over wordpieces, which usually don't align one-to-one against spaCy
|
||||
tokens. The layer therefore requires a reduction operation in order to calculate
|
||||
a single token vector given zero or more wordpiece vectors.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `pooling` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** [`Ragged`](https://thinc.ai/docs/api-types#ragged). **Output:** [`Floats2d`](https://thinc.ai/docs/api-types#types) | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean)) is usually a good choice. |
|
||||
| `grad_factor` | float | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. |
|
||||
|
||||
### spacy-transformers.Tok2VecTransformer.v1 {#Tok2VecTransformer}
|
||||
|
||||
> #### Example Config
|
||||
>
|
||||
> ```ini
|
||||
> # TODO:
|
||||
> ```
|
||||
|
||||
Use a transformer as a [`Tok2Vec`](/api/tok2vec) layer directly. This does
|
||||
**not** allow multiple components to share the transformer weights, and does
|
||||
**not** allow the transformer to set annotations into the [`Doc`](/api/doc)
|
||||
object, but it's a **simpler solution** if you only need the transformer within
|
||||
one component.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------ | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `get_spans` | callable | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. |
|
||||
| `tokenizer_config` | `Dict[str, Any]` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). |
|
||||
| `pooling` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** [`Ragged`](https://thinc.ai/docs/api-types#ragged). **Output:** [`Floats2d`](https://thinc.ai/docs/api-types#types) | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean)) is usually a good choice. |
|
||||
| `grad_factor` | float | Reweight gradients from the component before passing them upstream. You can set this to `0` to "freeze" the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at `1.0` is usually fine. |
|
||||
|
||||
## Parser & NER architectures {#parser}
|
||||
|
||||
|
@ -595,8 +624,6 @@ A function that creates a default, empty `KnowledgeBase` from a
|
|||
|
||||
A function that takes as input a [`KnowledgeBase`](/api/kb) and a
|
||||
[`Span`](/api/span) object denoting a named entity, and returns a list of
|
||||
plausible [`Candidate` objects](/api/kb/#candidate_init).
|
||||
|
||||
The default `CandidateGenerator` simply uses the text of a mention to find its
|
||||
potential aliases in the Knowledgebase. Note that this function is
|
||||
case-dependent.
|
||||
plausible [`Candidate` objects](/api/kb/#candidate_init). The default
|
||||
`CandidateGenerator` simply uses the text of a mention to find its potential
|
||||
aliases in the `KnowledgeBase`. Note that this function is case-dependent.
|
||||
|
|
|
@ -242,6 +242,21 @@ a batch of [Example](/api/example) objects.
|
|||
|
||||
Update the models in the pipeline.
|
||||
|
||||
<Infobox variant="warning" title="Changed in v3.0">
|
||||
|
||||
The `Language.update` method now takes a batch of [`Example`](/api/example)
|
||||
objects instead of the raw texts and annotations or `Doc` and `GoldParse`
|
||||
objects. An [`Example`](/api/example) streamlines how data is passed around. It
|
||||
stores two `Doc` objects: one for holding the gold-standard reference data, and
|
||||
one for holding the predictions of the pipeline.
|
||||
|
||||
For most use cases, you shouldn't have to write your own training scripts
|
||||
anymore. Instead, you can use [`spacy train`](/api/cli#train) with a config file
|
||||
and custom registered functions if needed. See the
|
||||
[training documentation](/usage/training) for details.
|
||||
|
||||
</Infobox>
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
|
@ -253,7 +268,7 @@ Update the models in the pipeline.
|
|||
|
||||
| Name | Type | Description |
|
||||
| --------------- | --------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
|
||||
| `examples` | `Iterable[Example]` | A batch of `Example` objects to learn from. |
|
||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
||||
| _keyword-only_ | | |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
||||
|
|
|
@ -9,6 +9,28 @@ api_string_name: lemmatizer
|
|||
api_trainable: false
|
||||
---
|
||||
|
||||
Component for assigning base forms to tokens using rules based on part-of-speech
|
||||
tags, or lookup tables. Functionality to train the component is coming soon.
|
||||
Different [`Language`](/api/language) subclasses can implement their own
|
||||
lemmatizer components via
|
||||
[language-specific factories](/usage/processing-pipelines#factories-language).
|
||||
The default data used is provided by the
|
||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
|
||||
extension package.
|
||||
|
||||
<Infobox variant="warning" title="New in v3.0">
|
||||
|
||||
As of v3.0, the `Lemmatizer` is a **standalone pipeline component** that can be
|
||||
added to your pipeline, and not a hidden part of the vocab that runs behind the
|
||||
scenes. This makes it easier to customize how lemmas should be assigned in your
|
||||
pipeline.
|
||||
|
||||
If the lemmatization mode is set to `"rule"` and requires part-of-speech tags to
|
||||
be assigned, make sure a [`Tagger`](/api/tagger) or another component assigning
|
||||
tags is available in the pipeline and runs _before_ the lemmatizer.
|
||||
|
||||
</Infobox>
|
||||
|
||||
## Config and implementation
|
||||
|
||||
The default config is defined by the pipeline component factory and describes
|
||||
|
@ -29,7 +51,7 @@ lemmatizers, see the
|
|||
|
||||
| Setting | Type | Description | Default |
|
||||
| ----------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------- |
|
||||
| `mode` | str | The lemmatizer mode, e.g. "lookup" or "rule". | `"lookup"` |
|
||||
| `mode` | str | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. | `"lookup"` |
|
||||
| `lookups` | [`Lookups`](/api/lookups) | The lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. If `None`, default tables are loaded from `spacy-lookups-data`. | `None` |
|
||||
| `overwrite` | bool | Whether to overwrite existing lemmas. | `False` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Not yet implemented:** the model to use. | `None` |
|
||||
|
@ -55,15 +77,15 @@ Create a new pipeline instance. In your application, you would normally use a
|
|||
shortcut for this and instantiate the component using its string name and
|
||||
[`nlp.add_pipe`](/api/language#add_pipe).
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------- | ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | [`Vocab`](/api/vocab) | The vocab. |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model (not yet implemented). |
|
||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||
| _keyword-only_ | | |
|
||||
| mode | str | The lemmatizer mode, e.g. "lookup" or "rule". Defaults to "lookup". |
|
||||
| lookups | [`Lookups`](/api/lookups) | A lookups object containing the tables such as "lemma_rules", "lemma_index", "lemma_exc" and "lemma_lookup". Defaults to `None`. |
|
||||
| overwrite | bool | Whether to overwrite existing lemmas. |
|
||||
| Name | Type | Description |
|
||||
| -------------- | ------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | [`Vocab`](/api/vocab) | The vocab. |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | A model (not yet implemented). |
|
||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||
| _keyword-only_ | | |
|
||||
| mode | str | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`. |
|
||||
| lookups | [`Lookups`](/api/lookups) | A lookups object containing the tables such as `"lemma_rules"`, `"lemma_index"`, `"lemma_exc"` and `"lemma_lookup"`. Defaults to `None`. |
|
||||
| overwrite | bool | Whether to overwrite existing lemmas. |
|
||||
|
||||
## Lemmatizer.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
|
|
@ -25,8 +25,15 @@ work out-of-the-box.
|
|||
|
||||
</Infobox>
|
||||
|
||||
This pipeline component lets you use transformer models in your pipeline. The
|
||||
component assigns the output of the transformer to the Doc's extension
|
||||
This pipeline component lets you use transformer models in your pipeline, using
|
||||
the [HuggingFace `transformers`](https://huggingface.co/transformers) library
|
||||
under the hood. Usually you will connect subsequent components to the shared
|
||||
transformer using the
|
||||
[TransformerListener](/api/architectures#TransformerListener) layer. This works
|
||||
similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
|
||||
[Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.
|
||||
|
||||
The component assigns the output of the transformer to the `Doc`'s extension
|
||||
attributes. We also calculate an alignment between the word-piece tokens and the
|
||||
spaCy tokenization, so that we can use the last hidden states to set the
|
||||
`Doc.tensor` attribute. When multiple word-piece tokens align to the same spaCy
|
||||
|
@ -53,11 +60,11 @@ architectures and their arguments and hyperparameters.
|
|||
> nlp.add_pipe("transformer", config=DEFAULT_CONFIG)
|
||||
> ```
|
||||
|
||||
| Setting | Type | Description | Default |
|
||||
| ------------------- | ------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
|
||||
| `max_batch_items` | int | Maximum size of a padded batch. | `4096` |
|
||||
| `annotation_setter` | Callable | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. | `null_annotation_setter` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The model to use. | [TransformerModel](/api/architectures#TransformerModel) |
|
||||
| Setting | Type | Description | Default |
|
||||
| ------------------- | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------- |
|
||||
| `max_batch_items` | int | Maximum size of a padded batch. | `4096` |
|
||||
| `annotation_setter` | Callable | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no additional annotations are set. | `null_annotation_setter` |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** [`FullTransformerBatch`](/api/transformer#fulltransformerbatch). The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. | [TransformerModel](/api/architectures#TransformerModel) |
|
||||
|
||||
```python
|
||||
https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/pipeline_component.py
|
||||
|
@ -86,18 +93,22 @@ https://github.com/explosion/spacy-transformers/blob/master/spacy_transformers/p
|
|||
> trf = Transformer(nlp.vocab, model)
|
||||
> ```
|
||||
|
||||
Create a new pipeline instance. In your application, you would normally use a
|
||||
shortcut for this and instantiate the component using its string name and
|
||||
[`nlp.add_pipe`](/api/language#create_pipe).
|
||||
Construct a `Transformer` component. One or more subsequent spaCy components can
|
||||
use the transformer outputs as features in its model, with gradients
|
||||
backpropagated to the single shared weights. The activations from the
|
||||
transformer are saved in the [`Doc._.trf_data`](#custom-attributes) extension
|
||||
attribute. You can also provide a callback to set additional annotations. In
|
||||
your application, you would normally use a shortcut for this and instantiate the
|
||||
component using its string name and [`nlp.add_pipe`](/api/language#create_pipe).
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------------- | ------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | The Thinc [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. |
|
||||
| `annotation_setter` | `Callable` | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. Defaults to `null_annotation_setter`, a function that does nothing. |
|
||||
| _keyword-only_ | | |
|
||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||
| `max_batch_items` | int | Maximum size of a padded batch. Defaults to `128*32`. |
|
||||
| Name | Type | Description |
|
||||
| ------------------- | ------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | [`Model`](https://thinc.ai/docs/api-model) | **Input:** `List[Doc]`. **Output:** [`FullTransformerBatch`](/api/transformer#fulltransformerbatch). The Thinc [`Model`](https://thinc.ai/docs/api-model) wrapping the transformer. Usually you will want to use the [TransformerModel](/api/architectures#TransformerModel) layer for this. |
|
||||
| `annotation_setter` | `Callable` | Function that takes a batch of `Doc` objects and a [`FullTransformerBatch`](/api/transformer#fulltransformerbatch) and can set additional annotations on the `Doc`. The `Doc._.transformer_data` attribute is set prior to calling the callback. By default, no additional annotations are set. |
|
||||
| _keyword-only_ | | |
|
||||
| `name` | str | String name of the component instance. Used to add entries to the `losses` during training. |
|
||||
| `max_batch_items` | int | Maximum size of a padded batch. Defaults to `128*32`. |
|
||||
|
||||
## Transformer.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
@ -184,7 +195,10 @@ Apply the pipeline's model to a batch of docs, without modifying them.
|
|||
|
||||
## Transformer.set_annotations {#set_annotations tag="method"}
|
||||
|
||||
Modify a batch of documents, using pre-computed scores.
|
||||
Assign the extracted features to the Doc objects. By default, the
|
||||
[`TransformerData`](/api/transformer#transformerdata) object is written to the
|
||||
[`Doc._.trf_data`](#custom-attributes) attribute. Your annotation_setter
|
||||
callback is then called, if provided.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -201,8 +215,19 @@ Modify a batch of documents, using pre-computed scores.
|
|||
|
||||
## Transformer.update {#update tag="method"}
|
||||
|
||||
Learn from a batch of documents and gold-standard information, updating the
|
||||
pipe's model. Delegates to [`predict`](/api/transformer#predict).
|
||||
Prepare for an update to the transformer. Like the [`Tok2Vec`](/api/tok2vec)
|
||||
component, the `Transformer` component is unusual in that it does not receive
|
||||
"gold standard" annotations to calculate a weight update. The optimal output of
|
||||
the transformer data is unknown – it's a hidden layer inside the network that is
|
||||
updated by backpropagating from output layers.
|
||||
|
||||
The `Transformer` component therefore does **not** perform a weight update
|
||||
during its own `update` method. Instead, it runs its transformer model and
|
||||
communicates the output and the backpropagation callback to any **downstream
|
||||
components** that have been connected to it via the
|
||||
[TransformerListener](/api/architectures#TransformerListener) sublayer. If there
|
||||
are multiple listeners, the last layer will actually backprop to the transformer
|
||||
and call the optimizer, while the others simply increment the gradients.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -212,15 +237,15 @@ pipe's model. Delegates to [`predict`](/api/transformer#predict).
|
|||
> losses = trf.update(examples, sgd=optimizer)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------------- | --------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
||||
| _keyword-only_ | | |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/transformer#set_annotations). |
|
||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
||||
| Name | Type | Description |
|
||||
| ----------------- | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects. Only the [`Example.predicted`](/api/example#predicted) `Doc` object is used, the reference `Doc` is ignored. |
|
||||
| _keyword-only_ | | |
|
||||
| `drop` | float | The dropout rate. |
|
||||
| `set_annotations` | bool | Whether or not to update the `Example` objects with the predictions, delegating to [`set_annotations`](/api/transformer#set_annotations). |
|
||||
| `sgd` | [`Optimizer`](https://thinc.ai/docs/api-optimizers) | The optimizer. |
|
||||
| `losses` | `Dict[str, float]` | Optional record of the loss during training. Updated using the component name as the key. |
|
||||
| **RETURNS** | `Dict[str, float]` | The updated `losses` dictionary. |
|
||||
|
||||
## Transformer.create_optimizer {#create_optimizer tag="method"}
|
||||
|
||||
|
@ -396,14 +421,16 @@ Split a `TransformerData` object that represents a batch into a list with one
|
|||
|
||||
## Span getters {#span_getters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}
|
||||
|
||||
<!-- TODO: details on what this is for -->
|
||||
|
||||
Span getters are functions that take a batch of [`Doc`](/api/doc) objects and
|
||||
return a lists of [`Span`](/api/span) objects for each doc, to be processed by
|
||||
the transformer. The returned spans can overlap. Span getters can be referenced
|
||||
in the config's `[components.transformer.model.get_spans]` block to customize
|
||||
the sequences processed by the transformer. You can also register custom span
|
||||
getters using the `@registry.span_getters` decorator.
|
||||
the transformer. This is used to manage long documents, by cutting them into
|
||||
smaller sequences before running the transformer. The spans are allowed to
|
||||
overlap, and you can also omit sections of the Doc if they are not relevant.
|
||||
|
||||
Span getters can be referenced in the config's
|
||||
`[components.transformer.model.get_spans]` block to customize the sequences
|
||||
processed by the transformer. You can also register custom span getters using
|
||||
the `@registry.span_getters` decorator.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
|
|
@ -6,25 +6,97 @@ menu:
|
|||
- ['New Features', 'features']
|
||||
- ['Backwards Incompatibilities', 'incompat']
|
||||
- ['Migrating from v2.x', 'migrating']
|
||||
- ['Migrating plugins', 'plugins']
|
||||
---
|
||||
|
||||
## Summary {#summary}
|
||||
|
||||
## New Features {#features}
|
||||
|
||||
### New training workflow and config system {#features-training}
|
||||
|
||||
### Transformer-based pipelines {#features-transformers}
|
||||
|
||||
### Custom models using any framework {#feautres-custom-models}
|
||||
|
||||
### Manage end-to-end workflows with projects {#features-projects}
|
||||
|
||||
### New built-in pipeline components {#features-pipeline-components}
|
||||
|
||||
| Name | Description |
|
||||
| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. |
|
||||
| [`Morphologizer`](/api/morphologizer) | Trainable component to predict morphological features. |
|
||||
| [`Lemmatizer`](/api/lemmatizer) | Standalone component for rule-based and lookup lemmatization. |
|
||||
| [`AttributeRuler`](/api/attributeruler) | Component for setting token attributes using match patterns. |
|
||||
| [`Transformer`](/api/transformer) | Component for using [transformer models](/usage/transformers) in your pipeline, accessing outputs and aligning tokens. Provided via [`spacy-transformers`](https://github.com/explosion/spacy-transformers). |
|
||||
|
||||
### New and improved pipeline component APIs {#features-components}
|
||||
|
||||
- `Language.factory`, `Language.component`
|
||||
- `Language.analyze_pipes`
|
||||
- Adding components from other models
|
||||
|
||||
### Type hints and type-based data validation {#features-types}
|
||||
|
||||
spaCy v3.0 officially drops support for Python 2 and now requires **Python
|
||||
3.6+**. This also means that the code base can take full advantage of
|
||||
[type hints](https://docs.python.org/3/library/typing.html). spaCy's user-facing
|
||||
API that's implemented in pure Python (as opposed to Cython) now comes with type
|
||||
hints. The new version of spaCy's machine learning library
|
||||
[Thinc](https://thinc.ai) also features extensive
|
||||
[type support](https://thinc.ai/docs/usage-type-checking/), including custom
|
||||
types for models and arrays, and a custom `mypy` plugin that can be used to
|
||||
type-check model definitions.
|
||||
|
||||
For data validation, spacy v3.0 adopts
|
||||
[`pydantic`](https://github.com/samuelcolvin/pydantic). It also powers the data
|
||||
validation of Thinc's [config system](https://thinc.ai/docs/usage-config), which
|
||||
lets you to register **custom functions with typed arguments**, reference them
|
||||
in your config and see validation errors if the argument values don't match.
|
||||
|
||||
### CLI
|
||||
|
||||
| Name | Description |
|
||||
| --------------------------------------- | -------------------------------------------------------------------------------------------------------- |
|
||||
| [`init config`](/api/cli#init-config) | Initialize a [training config](/usage/training) file for a blank language or auto-fill a partial config. |
|
||||
| [`debug config`](/api/cli#debug-config) | Debug a [training config](/usage/training) file and show validation errors. |
|
||||
| [`project`](/api/cli#project) | Subcommand for cloning and running [spaCy projects](/usage/projects). |
|
||||
|
||||
## Backwards Incompatibilities {#incompat}
|
||||
|
||||
### Removed or renamed objects, methods, attributes and arguments {#incompat-removed}
|
||||
As always, we've tried to keep the breaking changes to a minimum and focus on
|
||||
changes that were necessary to support the new features, fix problems or improve
|
||||
usability. The following section lists the relevant changes to the user-facing
|
||||
API. For specific examples of how to rewrite your code, check out the
|
||||
[migration guide](#migrating).
|
||||
|
||||
| Removed | Replacement |
|
||||
| -------------------------------------------------------- | ----------------------------------------- |
|
||||
| `GoldParse` | [`Example`](/api/example) |
|
||||
| `GoldCorpus` | [`Corpus`](/api/corpus) |
|
||||
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
|
||||
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
|
||||
### Compatibility {#incompat-compat}
|
||||
|
||||
### Removed deprecated methods, attributes and arguments {#incompat-removed-deprecated}
|
||||
- spaCy now requires **Python 3.6+**.
|
||||
|
||||
### API changes {#incompat-api}
|
||||
|
||||
- [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of
|
||||
the component factory instead of the component function.
|
||||
- **Custom pipeline components** now needs to be decorated with the
|
||||
[`@Language.component`](/api/language#component) or
|
||||
[`@Language.factory`](/api/language#factory) decorator.
|
||||
- [`Language.update`](/api/language#update) now takes a batch of
|
||||
[`Example`](/api/example) objects instead of raw texts and annotations, or
|
||||
`Doc` and `GoldParse` objects.
|
||||
- The `Language.disable_pipes` contextmanager has been replaced by
|
||||
[`Language.select_pipes`](/api/language#select_pipes), which can explicitly
|
||||
disable or enable components.
|
||||
|
||||
### Removed or renamed API {#incompat-removed}
|
||||
|
||||
| Removed | Replacement |
|
||||
| -------------------------------------------------------- | ----------------------------------------------------- |
|
||||
| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) |
|
||||
| `GoldParse` | [`Example`](/api/example) |
|
||||
| `GoldCorpus` | [`Corpus`](/api/corpus) |
|
||||
| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) |
|
||||
| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated |
|
||||
|
||||
The following deprecated methods, attributes and arguments were removed in v3.0.
|
||||
Most of them have been **deprecated for a while** and many would previously
|
||||
|
@ -214,17 +286,14 @@ python -m spacy package ./model ./packages
|
|||
- python setup.py sdist
|
||||
```
|
||||
|
||||
## Migration notes for plugin maintainers {#plugins}
|
||||
#### Migration notes for plugin maintainers {#migrating-plugins}
|
||||
|
||||
Thanks to everyone who's been contributing to the spaCy ecosystem by developing
|
||||
and maintaining one of the many awesome [plugins and extensions](/universe).
|
||||
We've tried to keep breaking changes to a minimum and make it as easy as
|
||||
possible for you to upgrade your packages for spaCy v3.
|
||||
|
||||
### Custom pipeline components
|
||||
|
||||
The most common use case for plugins is providing pipeline components and
|
||||
extension attributes.
|
||||
We've tried to make it as easy as possible for you to upgrade your packages for
|
||||
spaCy v3. The most common use case for plugins is providing pipeline components
|
||||
and extension attributes. When migrating your plugin, double-check the
|
||||
following:
|
||||
|
||||
- Use the [`@Language.factory`](/api/language#factory) decorator to register
|
||||
your component and assign it a name. This allows users to refer to your
|
||||
|
|
Loading…
Reference in New Issue
Block a user