Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-07-29 19:41:34 +02:00
parent 9f69afdd1e
commit 9c80cb673d
5 changed files with 92 additions and 27 deletions

View File

@ -24,10 +24,55 @@ TODO: intro and how architectures work, link to
## Transformer architectures {#transformers source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/architectures.py"}
The following architectures are provided by the package
[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the
[usage documentation](/usage/transformers) for how to integrate the
architectures into your training config.
### spacy-transformers.TransformerModel.v1 {#TransformerModel}
<!-- TODO: description -->
> #### Example Config
>
> ```ini
> [model]
> @architectures = "spacy-transformers.TransformerModel.v1"
> name = "roberta-base"
> tokenizer_config = {"use_fast": true}
>
> [model.get_spans]
> @span_getters = "strided_spans.v1"
> window = 128
> stride = 96
> ```
| Name | Type | Description |
| ------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `name` | str | Any model name that can be loaded by [`transformers.AutoModel`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel). |
| `get_spans` | `Callable` | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. |
| `tokenizer_config` | `Dict[str, Any]` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). |
### spacy-transformers.Tok2VecListener.v1 {#Tok2VecListener}
<!-- TODO: description -->
> #### Example Config
>
> ```ini
> [model]
> @architectures = "spacy-transformers.Tok2VecListener.v1"
> grad_factor = 1.0
>
> [model.pooling]
> @layers = "reduce_mean.v1"
> ```
| Name | Type | Description |
| ------------- | ------------------------- | ---------------------------------------------------------------------------------------------- |
| `grad_factor` | float | Factor for weighting the gradient if multiple components listen to the same transformer model. |
| `pooling` | `Model[Ragged, Floats2d]` | Pooling layer to determine how the vector for each spaCy token will be computed. |
## Parser & NER architectures {#parser source="spacy/ml/models/parser.py"}
### spacy.TransitionBasedParser.v1 {#TransitionBasedParser}

View File

@ -366,13 +366,13 @@ Transformer tokens and outputs for one `Doc` object.
<!-- TODO: -->
| Name | Type | Description |
| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------- |
| `spans` | `List[List[Span]]` | <!-- TODO: --> |
| `tokens` | [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html?highlight=batchencoding#transformers.BatchEncoding) | <!-- TODO: --> |
| `tensors` | `List[torch.Tensor]` | <!-- TODO: --> |
| `align` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | <!-- TODO: --> |
| `doc_data` | `List[TransformerData]` | <!-- TODO: also mention it's property --> |
| Name | Type | Description |
| ---------- | -------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------- |
| `spans` | `List[List[Span]]` | <!-- TODO: --> |
| `tokens` | [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) | <!-- TODO: --> |
| `tensors` | `List[torch.Tensor]` | <!-- TODO: --> |
| `align` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | <!-- TODO: --> |
| `doc_data` | `List[TransformerData]` | <!-- TODO: also mention it's property --> |
### FullTransformerBatch.unsplit_by_doc {#fulltransformerbatch-unsplit_by_doc tag="method"}

View File

@ -220,15 +220,19 @@ available pipeline components and component functions.
> ruler = nlp.add_pipe("entity_ruler")
> ```
| String name | Component | Description |
| --------------- | ------------------------------------------- | ----------------------------------------------------------------------------------------- |
| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. |
| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules. |
| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. |
| String name | Component | Description |
| --------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- |
| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. |
| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. |
| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. |
| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. |
| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. |
| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | <!-- TODO: --> |
| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. |
<!-- TODO: update with more components -->

View File

@ -101,7 +101,9 @@ evaluate, package and visualize your model.
The `[components]` section in the [`config.cfg`](#TODO:) describes the pipeline
components and the settings used to construct them, including their model
implementation. Here's a config snippet for the
[`Transformer`](/api/transformer) component, along with matching Python code:
[`Transformer`](/api/transformer) component, along with matching Python code. In
this case, the `[components.transformer]` block describes the `transformer`
component:
> #### Python equivalent
>
@ -257,10 +259,10 @@ grad_factor = 1.0
```
The [Tok2VecListener](/api/architectures#Tok2VecListener) layer expects a
[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops), which needs to
be of type `Model[Ragged, Floats2d]`. This layer determines how the vector for
each spaCy token will be computed from the zero or more source rows the token is
aligned against. Here we use the
[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the argument
`pooling`, which needs to be of type `Model[Ragged, Floats2d]`. This layer
determines how the vector for each spaCy token will be computed from the zero or
more source rows the token is aligned against. Here we use the
[`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which
averages the wordpiece rows. We could instead use `reduce_last`,
[`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom

View File

@ -36,13 +36,18 @@ const DATA = [
],
},
{
id: 'data',
title: 'Additional data',
id: 'addition',
title: 'Additions',
multiple: true,
options: [
{
id: 'transformers',
title: 'Transformers',
help: 'Use transformers like BERT to train your spaCy models',
},
{
id: 'lookups',
title: 'Lemmatization',
title: 'Lemmatizer data',
help: 'Install additional lookup tables and rules for lemmatization',
},
],
@ -86,13 +91,22 @@ const QuickstartInstall = ({ id, title }) => (
set PYTHONPATH=C:\path\to\spaCy
</QS>
<QS package="source">pip install -r requirements.txt</QS>
<QS data="lookups" package="pip">
<QS addition="transformers" package="pip">
pip install -U spacy-lookups-transformers
</QS>
<QS addition="transformers" package="source">
pip install -U spacy-transformers
</QS>
<QS addition="transformers" package="conda">
conda install -c conda-forge spacy-transformers
</QS>
<QS addition="lookups" package="pip">
pip install -U spacy-lookups-data
</QS>
<QS data="lookups" package="source">
<QS addition="lookups" package="source">
pip install -U spacy-lookups-data
</QS>
<QS data="lookups" package="conda">
<QS addition="lookups" package="conda">
conda install -c conda-forge spacy-lookups-data
</QS>
<QS package="source">python setup.py build_ext --inplace</QS>