Update docs [ci skip]

2025-07-23 06:29:48 +03:00 · 2020-08-21 16:11:38 +02:00 · 2020-08-21 16:11:38 +02:00 · 74cb6d39d0
commit 74cb6d39d0
parent f5bcc10268
5 changed files with 125 additions and 25 deletions
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@ -303,22 +303,23 @@ architectures into your training config.
 > stride = 96
 > ```

-Load and wrap a transformer model from the Huggingface transformers library.
-You can any transformer that has pretrained weights and a PyTorch
-implementation. The `name` variable is passed through to the underlying
-library, so it can be either a string or a path. If it's a string, the
-pretrained weights will be downloaded via the transformers library if they are
-not already available locally.
-
-In order to support longer documents, the `TransformerModel` layer allows you
-to pass in a `get_spans` function that will divide up the `Doc` objects before
-passing them through the transformer. Your spans are allowed to overlap or
-exclude tokens.
-
-This layer is usually used directly by the `Transformer` component, which
-allows you to share the transformer weights across your pipeline. For a layer
-that's configured for use in other components, see `Tok2VecTransformer`.
+Load and wrap a transformer model from the
+[HuggingFace `transformers`](https://huggingface.co/transformers) library. You
+can any transformer that has pretrained weights and a PyTorch implementation.
+The `name` variable is passed through to the underlying library, so it can be
+either a string or a path. If it's a string, the pretrained weights will be
+downloaded via the transformers library if they are not already available
+locally.

+In order to support longer documents, the
+[TransformerModel](/api/architectures#TransformerModel) layer allows you to pass
+in a `get_spans` function that will divide up the [`Doc`](/api/doc) objects
+before passing them through the transformer. Your spans are allowed to overlap
+or exclude tokens. This layer is usually used directly by the
+[`Transformer`](/api/transformer) component, which allows you to share the
+transformer weights across your pipeline. For a layer that's configured for use
+in other components, see
+[Tok2VecTransformer](/api/architectures#Tok2VecTransformer).

 | Name               | Description                                                                                                                                                                                                                                           |
 | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
--- a/website/docs/usage/architectures.md
+++ b/website/docs/usage/architectures.md
@ -0,0 +1,92 @@
+---
+title: Layers and Model Architectures
+teaser: Power spaCy components with custom neural networks
+menu:
+  - ['Type Signatures', 'type-sigs']
+  - ['Defining Sublayers', 'sublayers']
+  - ['PyTorch & TensorFlow', 'frameworks']
+  - ['Trainable Components', 'components']
+---
+
+ A **model architecture** is a function that wires up a
+[Thinc `Model`](https://thinc.ai/docs/api-model) instance, which you can then
+use in a component or as a layer of a larger network. You can use Thinc as a
+thin wrapper around frameworks such as PyTorch, TensorFlow or MXNet, or you can
+implement your logic in Thinc directly.  spaCy's built-in components will never
+construct their `Model` instances themselves, so you won't have to subclass the
+component to change its model architecture. You can just **update the config**
+so that it refers to a different registered function. Once the component has
+been created, its model instance has already been assigned, so you cannot change
+its model architecture. The architecture is like a recipe for the network, and
+you can't change the recipe once the dish has already been prepared. You have to
+make a new one. 
+
+## Type signatures {#type-sigs}
+
+ The Thinc `Model` class is a **generic type** that can specify its input and
+output types. Python uses a square-bracket notation for this, so the type
+~~Model[List, Dict]~~ says that each batch of inputs to the model will be a
+list, and the outputs will be a dictionary. Both `typing.List` and `typing.Dict`
+are also generics, allowing you to be more specific about the data. For
+instance, you can write ~~Model[List[Doc], Dict[str, float]]~~ to specify that
+the model expects a list of [`Doc`](/api/doc) objects as input, and returns a
+dictionary mapping strings to floats. Some of the most common types you'll see
+are: 
+
+| Type               | Description                                                                                          |
+| ------------------ | ---------------------------------------------------------------------------------------------------- |
+| ~~List[Doc]~~      | A batch of [`Doc`](/api/doc) objects. Most components expect their models to take this as input.     |
+| ~~Floats2d~~       | A two-dimensional `numpy` or `cupy` array of floats. Usually 32-bit.                                 |
+| ~~Ints2d~~         | A two-dimensional `numpy` or `cupy` array of integers. Common dtypes include uint64, int32 and int8. |
+| ~~List[Floats2d]~~ | A list of two-dimensional arrays, generally with one array per `Doc` and one row per token.          |
+| ~~Ragged~~         | A container to handle variable-length sequence data in an unpadded contiguous array.                 |
+| ~~Padded~~         | A container to handle variable-length sequence data in a passed contiguous array.                    |
+
+The model type-signatures help you figure out which model architectures and
+components can fit together. For instance, the
+[`TextCategorizer`](/api/textcaregorizer) class expects a model typed
+~~Model[List[Doc], Floats2d]~~, because the model will predict one row of
+category probabilities per `Doc`. In contrast, the `Tagger` class expects a
+model typed ~~Model[List[Doc], List[Floats2d]]~~, because it needs to predict
+one row of probabilities per token.  There's no guarantee that two models with
+the same type-signature can be used interchangeably. There are many other ways
+they could be incompatible. However, if the types don't match, they almost
+surely _won't_ be compatible. This little bit of validation goes a long way,
+especially if you configure your editor or other tools to highlight these errors
+early. Thinc will also verify that your types match correctly when your config
+file is processed at the beginning of training. 
+
+## Defining sublayers {#sublayers}
+
+ Model architecture functions often accept sublayers as arguments, so that you
+can try substituting a different layer into the network. Depending on how the
+architecture function is structured, you might be able to define your network
+structure entirely through the [config system](/usage/training#config), using
+layers that have already been defined. The
+[transformers documentation](/usage/embeddings-transformers#transformers)
+section shows a common example of swapping in a different sublayer. In most NLP
+neural network models, the most important parts of the network are what we refer
+to as the
+[embed and encode](https://explosion.ai/blog/embed-encode-attend-predict) steps.
+These steps together compute dense, context-sensitive representations of the
+tokens. Most of spaCy's default architectures accept a `tok2vec` layer as an
+argument, so you can control this important part of the network separately. This
+makes it easy to switch between transformer, CNN, BiLSTM or other feature
+extraction approaches. And if you want to define your own solution, all you need
+to do is register a ~~Model[List[Doc], List[Floats2d]]~~ architecture function,
+and you'll be able to try it out in any of spaCy components. 
+
+### Registering new architectures
+
+- Recap concept, link to config docs. 
+
+## Wrapping PyTorch, TensorFlow and other frameworks {#frameworks}
+
+- Explain concept
+- Link off to notebook 
+
+## Models for trainable components {#components}
+
+- Interaction with `predict`, `get_loss` and `set_annotations`
+- Initialization life-cycle with `begin_training`.
+- Link to relation extraction notebook.
--- a/website/meta/sidebars.json
+++ b/website/meta/sidebars.json
@ -24,6 +24,11 @@
                        "tag": "new"
                    },
                    { "text": "Training Models", "url": "/usage/training", "tag": "new" },
+                    {
+                        "text": "Layers & Model Architectures",
+                        "url": "/usage/architectures",
+                        "tag": "new"
+                    },
                    { "text": "spaCy Projects", "url": "/usage/projects", "tag": "new" },
                    { "text": "Saving & Loading", "url": "/usage/saving-loading" },
                    { "text": "Visualizers", "url": "/usage/visualizers" }
--- a/website/meta/type-annotations.json
+++ b/website/meta/type-annotations.json
@ -29,6 +29,8 @@
    "Optimizer": "https://thinc.ai/docs/api-optimizers",
    "Model": "https://thinc.ai/docs/api-model",
    "Ragged": "https://thinc.ai/docs/api-types#ragged",
+    "Padded": "https://thinc.ai/docs/api-types#padded",
+    "Ints2d": "https://thinc.ai/docs/api-types#types",
    "Floats2d": "https://thinc.ai/docs/api-types#types",
    "Floats3d": "https://thinc.ai/docs/api-types#types",
    "FloatsXd": "https://thinc.ai/docs/api-types#types",
--- a/website/src/styles/code.module.sass
+++ b/website/src/styles/code.module.sass
@ -67,7 +67,7 @@
        border: 0

    // Special style for types in API tables
-    td > &:last-child
+    td:not(:first-child) > &:last-child
        display: block
        border-top: 1px dotted var(--color-subtle)
        border-radius: 0