diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 534f0bdf0..95f7d0597 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -24,10 +24,55 @@ TODO: intro and how architectures work, link to ## Transformer architectures {#transformers source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/architectures.py"} +The following architectures are provided by the package +[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the +[usage documentation](/usage/transformers) for how to integrate the +architectures into your training config. + ### spacy-transformers.TransformerModel.v1 {#TransformerModel} + + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy-transformers.TransformerModel.v1" +> name = "roberta-base" +> tokenizer_config = {"use_fast": true} +> +> [model.get_spans] +> @span_getters = "strided_spans.v1" +> window = 128 +> stride = 96 +> ``` + +| Name | Type | Description | +| ------------------ | ---------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `name` | str | Any model name that can be loaded by [`transformers.AutoModel`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoModel). | +| `get_spans` | `Callable` | Function that takes a batch of [`Doc`](/api/doc) object and returns lists of [`Span`](/api) objects to process by the transformer. [See here](/api/transformer#span_getters) for built-in options and examples. | +| `tokenizer_config` | `Dict[str, Any]` | Tokenizer settings passed to [`transformers.AutoTokenizer`](https://huggingface.co/transformers/model_doc/auto.html#transformers.AutoTokenizer). | + ### spacy-transformers.Tok2VecListener.v1 {#Tok2VecListener} + + +> #### Example Config +> +> ```ini +> [model] +> @architectures = "spacy-transformers.Tok2VecListener.v1" +> grad_factor = 1.0 +> +> [model.pooling] +> @layers = "reduce_mean.v1" +> ``` + +| Name | Type | Description | +| ------------- | ------------------------- | ---------------------------------------------------------------------------------------------- | +| `grad_factor` | float | Factor for weighting the gradient if multiple components listen to the same transformer model. | +| `pooling` | `Model[Ragged, Floats2d]` | Pooling layer to determine how the vector for each spaCy token will be computed. | + ## Parser & NER architectures {#parser source="spacy/ml/models/parser.py"} ### spacy.TransitionBasedParser.v1 {#TransitionBasedParser} diff --git a/website/docs/api/transformer.md b/website/docs/api/transformer.md index 764b3dd88..70128d225 100644 --- a/website/docs/api/transformer.md +++ b/website/docs/api/transformer.md @@ -366,13 +366,13 @@ Transformer tokens and outputs for one `Doc` object. -| Name | Type | Description | -| ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------- | -| `spans` | `List[List[Span]]` | | -| `tokens` | [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html?highlight=batchencoding#transformers.BatchEncoding) | | -| `tensors` | `List[torch.Tensor]` | | -| `align` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | | -| `doc_data` | `List[TransformerData]` | | +| Name | Type | Description | +| ---------- | -------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------- | +| `spans` | `List[List[Span]]` | | +| `tokens` | [`transformers.BatchEncoding`](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.BatchEncoding) | | +| `tensors` | `List[torch.Tensor]` | | +| `align` | [`Ragged`](https://thinc.ai/docs/api-types#ragged) | | +| `doc_data` | `List[TransformerData]` | | ### FullTransformerBatch.unsplit_by_doc {#fulltransformerbatch-unsplit_by_doc tag="method"} diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index 08e8e964f..56ade692a 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -220,15 +220,19 @@ available pipeline components and component functions. > ruler = nlp.add_pipe("entity_ruler") > ``` -| String name | Component | Description | -| --------------- | ------------------------------------------- | ----------------------------------------------------------------------------------------- | -| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. | -| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. | -| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. | -| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. | -| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. | -| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules. | -| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. | +| String name | Component | Description | +| --------------- | ----------------------------------------------- | ----------------------------------------------------------------------------------------- | +| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. | +| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. | +| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. | +| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. | +| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. | +| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories. | +| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. | +| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. | +| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. | +| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | | +| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. | diff --git a/website/docs/usage/transformers.md b/website/docs/usage/transformers.md index 0c98ad630..a7fd83ac6 100644 --- a/website/docs/usage/transformers.md +++ b/website/docs/usage/transformers.md @@ -101,7 +101,9 @@ evaluate, package and visualize your model. The `[components]` section in the [`config.cfg`](#TODO:) describes the pipeline components and the settings used to construct them, including their model implementation. Here's a config snippet for the -[`Transformer`](/api/transformer) component, along with matching Python code: +[`Transformer`](/api/transformer) component, along with matching Python code. In +this case, the `[components.transformer]` block describes the `transformer` +component: > #### Python equivalent > @@ -257,10 +259,10 @@ grad_factor = 1.0 ``` The [Tok2VecListener](/api/architectures#Tok2VecListener) layer expects a -[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops), which needs to -be of type `Model[Ragged, Floats2d]`. This layer determines how the vector for -each spaCy token will be computed from the zero or more source rows the token is -aligned against. Here we use the +[pooling layer](https://thinc.ai/docs/api-layers#reduction-ops) as the argument +`pooling`, which needs to be of type `Model[Ragged, Floats2d]`. This layer +determines how the vector for each spaCy token will be computed from the zero or +more source rows the token is aligned against. Here we use the [`reduce_mean`](https://thinc.ai/docs/api-layers#reduce_mean) layer, which averages the wordpiece rows. We could instead use `reduce_last`, [`reduce_max`](https://thinc.ai/docs/api-layers#reduce_max), or a custom diff --git a/website/src/widgets/quickstart-install.js b/website/src/widgets/quickstart-install.js index 237567eb8..b2e72752a 100644 --- a/website/src/widgets/quickstart-install.js +++ b/website/src/widgets/quickstart-install.js @@ -36,13 +36,18 @@ const DATA = [ ], }, { - id: 'data', - title: 'Additional data', + id: 'addition', + title: 'Additions', multiple: true, options: [ + { + id: 'transformers', + title: 'Transformers', + help: 'Use transformers like BERT to train your spaCy models', + }, { id: 'lookups', - title: 'Lemmatization', + title: 'Lemmatizer data', help: 'Install additional lookup tables and rules for lemmatization', }, ], @@ -86,13 +91,22 @@ const QuickstartInstall = ({ id, title }) => ( set PYTHONPATH=C:\path\to\spaCy pip install -r requirements.txt - + + pip install -U spacy-lookups-transformers + + + pip install -U spacy-transformers + + + conda install -c conda-forge spacy-transformers + + pip install -U spacy-lookups-data - + pip install -U spacy-lookups-data - + conda install -c conda-forge spacy-lookups-data python setup.py build_ext --inplace