Update docs [ci skip]

2025-09-23 04:26:46 +03:00 · 2020-08-10 00:42:26 +02:00 · 2020-08-10 00:42:26 +02:00 · d611cbef43
commit d611cbef43
parent c044460823
4 changed files with 137 additions and 63 deletions
--- a/website/docs/api/tok2vec.md
+++ b/website/docs/api/tok2vec.md
@ -8,9 +8,9 @@ api_string_name: tok2vec
 api_trainable: true
 ---
-Apply a "token-to-vector" model and set its outputs in the doc.tensor attribute.
+Apply a "token-to-vector" model and set its outputs in the `Doc.tensor`
-This is mostly useful to **share a single subnetwork** between multiple
+attribute. This is mostly useful to **share a single subnetwork** between
-components, e.g. to have one embedding and CNN network shared between a
+multiple components, e.g. to have one embedding and CNN network shared between a
 [`DependencyParser`](/api/dependencyparser), [`Tagger`](/api/tagger) and
 [`EntityRecognizer`](/api/entityrecognizer).
--- a/website/docs/api/transformer.md
+++ b/website/docs/api/transformer.md
@ -25,12 +25,12 @@ work out-of-the-box.
 </Infobox>
-This pipeline component lets you use transformer models in your pipeline, using
+This pipeline component lets you use transformer models in your pipeline.
-the [HuggingFace `transformers`](https://huggingface.co/transformers) library
+Supports all models that are available via the
-under the hood. Usually you will connect subsequent components to the shared
+[HuggingFace `transformers`](https://huggingface.co/transformers) library.
-transformer using the
+Usually you will connect subsequent components to the shared transformer using
-[TransformerListener](/api/architectures#TransformerListener) layer. This works
+the [TransformerListener](/api/architectures#TransformerListener) layer. This
-similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
+works similarly to spaCy's [Tok2Vec](/api/tok2vec) component and
 [Tok2VecListener](/api/architectures/Tok2VecListener) sublayer.
 The component assigns the output of the transformer to the `Doc`'s extension
@ -419,7 +419,7 @@ Split a `TransformerData` object that represents a batch into a list with one
 | ----------- | ----------------------- | ----------- |
 | **RETURNS** | `List[TransformerData]` |             |
-## Span getters {#span_getters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}
+## Span getters {#span_getters source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/span_getters.py"}
 Span getters are functions that take a batch of [`Doc`](/api/doc) objects and
 return a lists of [`Span`](/api/span) objects for each doc, to be processed by
@ -427,15 +427,15 @@ the transformer. This is used to manage long documents, by cutting them into
 smaller sequences before running the transformer. The spans are allowed to
 overlap, and you can also omit sections of the Doc if they are not relevant.
-Span getters can be referenced in the config's
+Span getters can be referenced in the `[components.transformer.model.get_spans]`
-`[components.transformer.model.get_spans]` block to customize the sequences
+block of the config to customize the sequences processed by the transformer. You
-processed by the transformer. You can also register custom span getters using
+can also register custom span getters using the `@spacy.registry.span_getters`
-the `@registry.span_getters` decorator.
+decorator.
 > #### Example
 >
 > ```python
-> @registry.span_getters("sent_spans.v1")
+> @spacy.registry.span_getters("sent_spans.v1")
 > def configure_get_sent_spans() -> Callable:
 >     def get_sent_spans(docs: Iterable[Doc]) -> List[List[Span]]:
 >         return [list(doc.sents) for doc in docs]
@ -448,15 +448,55 @@ the `@registry.span_getters` decorator.
 | `docs`      | `Iterable[Doc]`    | A batch of `Doc` objects.                |
 | **RETURNS** | `List[List[Span]]` | The spans to process by the transformer. |
-The following built-in functions are available:
+### doc_spans.v1 {#doc_spans tag="registered function"}
-<!-- TODO: finish API docs -->
+> #### Example config
 >
 > ```ini
 > [transformer.model.get_spans]
 > @span_getters = "doc_spans.v1"
 > ```
-| Name               | Description                                                        |
+Create a span getter that uses the whole document as its spans. This is the best
-| ------------------ | ------------------------------------------------------------------ |
+approach if your [`Doc`](/api/doc) objects already refer to relatively short
-| `doc_spans.v1`     | Create a span for each doc (no transformation, process each text). |
+texts.
-| `sent_spans.v1`    | Create a span for each sentence if sentence boundaries are set.    |
+
-| `strided_spans.v1` |                                                                    |
+### sent_spans.v1 {#sent_spans tag="registered function"}
 > #### Example config
 >
 > ```ini
 > [transformer.model.get_spans]
 > @span_getters = "sent_spans.v1"
 > ```
 Create a span getter that uses sentence boundary markers to extract the spans.
 This requires sentence boundaries to be set (e.g. by the
 [`Sentencizer`](/api/sentencizer)), and may result in somewhat uneven batches,
 depending on the sentence lengths. However, it does provide the transformer with
 more meaningful windows to attend over.
 ### strided_spans.v1 {#strided_spans tag="registered function"}
 > #### Example config
 >
 > ```ini
 > [transformer.model.get_spans]
 > @span_getters = "strided_spans.v1"
 > window = 128
 > stride = 96
 > ```
 Create a span getter for strided spans. If you set the `window` and `stride` to
 the same value, the spans will cover each token once. Setting `stride` lower
 than `window` will allow for an overlap, so that some tokens are counted twice.
 This can be desirable, because it allows all tokens to have both a left and
 right context.
 | Name      | Type | Description      |
 | --------- | ---- | ---------------- |
 |  `window` | int  | The window size. |
 | `stride`  | int  | The stride size. |
 ## Annotation setters {#annotation_setters tag="registered functions" source="github.com/explosion/spacy-transformers/blob/master/spacy_transformers/annotation_setters.py"}
--- a/website/docs/usage/101/_architecture.md
+++ b/website/docs/usage/101/_architecture.md
@ -1,54 +1,88 @@
-The central data structures in spaCy are the `Doc` and the `Vocab`. The `Doc`
+The central data structures in spaCy are the [`Language`](/api/language) class,
-object owns the **sequence of tokens** and all their annotations. The `Vocab`
+the [`Vocab`](/api/vocab) and the [`Doc`](/api/doc) object. The `Language` class
-object owns a set of **look-up tables** that make common information available
+is used to process a text and turn it into a `Doc` object. It's typically stored
-across documents. By centralizing strings, word vectors and lexical attributes,
+as a variable called `nlp`. The `Doc` object owns the **sequence of tokens** and
-we avoid storing multiple copies of this data. This saves memory, and ensures
+all their annotations. By centralizing strings, word vectors and lexical
-there's a **single source of truth**.
+attributes in the `Vocab`, we avoid storing multiple copies of this data. This
 saves memory, and ensures there's a **single source of truth**.
 Text annotations are also designed to allow a single source of truth: the `Doc`
-object owns the data, and `Span` and `Token` are **views that point into it**.
+object owns the data, and [`Span`](/api/span) and [`Token`](/api/token) are
-The `Doc` object is constructed by the `Tokenizer`, and then **modified in
+**views that point into it**. The `Doc` object is constructed by the
-place** by the components of the pipeline. The `Language` object coordinates
+[`Tokenizer`](/api/tokenizer), and then **modified in place** by the components
-these components. It takes raw text and sends it through the pipeline, returning
+of the pipeline. The `Language` object coordinates these components. It takes
-an **annotated document**. It also orchestrates training and serialization.
+raw text and sends it through the pipeline, returning an **annotated document**.
 It also orchestrates training and serialization.
-<!-- TODO: update architecture and tables below to match sidebar in API docs etc. -->
+<!-- TODO: update graphic -->
 ![Library architecture](../../images/architecture.svg)
 ### Container objects {#architecture-containers}
-| Name                                  | Description                                                                                                                                             |
+| Name                        | Description                                                                                                                                             |
-| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| [`Doc`](/api/doc)                     | A container for accessing linguistic annotations.                                                                                                       |
+| [`Language`](/api/language) | Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp`.  |
-| [`Span`](/api/span)                   | A slice from a `Doc` object.                                                                                                                            |
+| [`Doc`](/api/doc)           | A container for accessing linguistic annotations.                                                                                                       |
-| [`Token`](/api/token)                 | An individual token — i.e. a word, punctuation symbol, whitespace, etc.                                                                                 |
+| [`Span`](/api/span)         | A slice from a `Doc` object.                                                                                                                            |
-| [`Lexeme`](/api/lexeme)               | An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. |
+| [`Token`](/api/token)       | An individual token — i.e. a word, punctuation symbol, whitespace, etc.                                                                                 |
-| [`MorphAnalysis`](/api/morphanalysis) | A morphological analysis.                                                                                                                               |
+| [`Lexeme`](/api/lexeme)     | An entry in the vocabulary. It's a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. |
 | [`Example`](/api/example)   | A collection of training annotations, containing two `Doc` objects: the reference data and the predictions.                                             |
 | [`DocBin`](/api/docbin)     | A collection of `Doc` objects for efficient binary serialization. Also used for [training data](/api/data-formats#binary-training).                     |
 ### Processing pipeline {#architecture-pipeline}
-| Name                                        | Description                                                                                                                   |
+The processing pipeline consists of one or more **pipeline components** that are
-| ------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
+called on the `Doc` in order. The tokenizer runs before the components. Pipeline
-| [`Language`](/api/language)                 | A text-processing pipeline. Usually you'll load this once per process as `nlp` and pass the instance around your application. |
+components can be added using [`Language.add_pipe`](/api/language#add_pipe).
-| [`Tokenizer`](/api/tokenizer)               | Segment text, and create `Doc` objects with the discovered segment boundaries.                                                |
+They can contain a statistical model and trained weights, or only make
-| [`Lemmatizer`](/api/lemmatizer)             | Determine the base forms of words.                                                                                            |
+rule-based modifications to the `Doc`. spaCy provides a range of built-in
-| [`Morphology`](/api/morphology)             | Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag.              |
+components for different language processing tasks and also allows adding
-| [`Tagger`](/api/tagger)                     | Annotate part-of-speech tags on `Doc` objects.                                                                                |
+[custom components](/usage/processing-pipelines#custom-components).
-| [`DependencyParser`](/api/dependencyparser) | Annotate syntactic dependencies on `Doc` objects.                                                                             |
+
-| [`EntityRecognizer`](/api/entityrecognizer) | Annotate named entities, e.g. persons or products, on `Doc` objects.                                                          |
+![The processing pipeline](../../images/pipeline.svg)
-| [`TextCategorizer`](/api/textcategorizer)   | Assign categories or labels to `Doc` objects.                                                                                 |
+
-| [`Matcher`](/api/matcher)                   | Match sequences of tokens, based on pattern rules, similar to regular expressions.                                            |
+| Name                                            | Description                                                                                 |
-| [`PhraseMatcher`](/api/phrasematcher)       | Match sequences of tokens based on phrases.                                                                                   |
+| ----------------------------------------------- | ------------------------------------------------------------------------------------------- |
-| [`EntityRuler`](/api/entityruler)           | Add entity spans to the `Doc` using token-based rules or exact phrase matches.                                                |
+| [`Tokenizer`](/api/tokenizer)                   | Segment raw text and create `Doc` objects from the words.                                   |
-| [`Sentencizer`](/api/sentencizer)           | Implement custom sentence boundary detection logic that doesn't require the dependency parse.                                 |
+| [`Tok2Vec`](/api/tok2vec)                       | Apply a "token-to-vector" model and set its outputs.                                        |
-| [Other functions](/api/pipeline-functions)  | Automatically apply something to the `Doc`, e.g. to merge spans of tokens.                                                    |
+| [`Transformer`](/api/transformer)               | Use a transformer model and set its outputs.                                                |
 | [`Lemmatizer`](/api/lemmatizer)                 | Determine the base forms of words.                                                          |
 | [`Morphologizer`](/api/morphologizer)           | Predict morphological features and coarse-grained part-of-speech tags.                      |
 | [`Tagger`](/api/tagger)                         | Predict part-of-speech tags.                                                                |
 | [`AttributeRuler`](/api/attributeruler)         | Set token attributes using matcher rules.                                                   |
 | [`DependencyParser`](/api/dependencyparser)     | Predict syntactic dependencies.                                                             |
 | [`EntityRecognizer`](/api/entityrecognizer)     | Predict named entities, e.g. persons or products.                                           |
 | [`EntityRuler`](/api/entityruler)               | Add entity spans to the `Doc` using token-based rules or exact phrase matches.              |
 | [`EntityLinker`](/api/entitylinker)             | Disambiguate named entities to nodes in a knowledge base.                                   |
 | [`TextCategorizer`](/api/textcategorizer)       | Predict categories or labels over the whole document.                                       |
 | [`Sentencizer`](/api/sentencizer)               | Implement rule-based sentence boundary detection that doesn't require the dependency parse. |
 | [`SentenceRecognizer`](/api/sentencerecognizer) | Predict sentence boundaries.                                                                |
 | [Other functions](/api/pipeline-functions)      | Automatically apply something to the `Doc`, e.g. to merge spans of tokens.                  |
 | [`Pipe`](/api/pipe)                             | Base class that all trainable pipeline components inherit from.                             |
 ### Matchers {#architecture-matchers}
 Matchers help you find and extract information from [`Doc`](/api/doc) objects
 based on match patterns describing the sequences you're looking for. A matcher
 operates on a `Doc` and gives you access to the matched tokens **in context**.
 | Name                                          | Description                                                                                                                                                                         |
 | --------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | [`Matcher`](/api/matcher)                     | Match sequences of tokens, based on pattern rules, similar to regular expressions.                                                                                                  |
 | [`PhraseMatcher`](/api/phrasematcher)         | Match sequences of tokens based on phrases.                                                                                                                                         |
 | [`DependencyMatcher`](/api/dependencymatcher) | Match sequences of tokens based on dependency trees using the [Semgrex syntax](https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/semgrex/SemgrexPattern.html). |
 ### Other classes {#architecture-other}
-| Name                              | Description                                                                   |
+| Name                                  | Description                                                                                                      |
-| --------------------------------- | ----------------------------------------------------------------------------- |
+| ------------------------------------- | ---------------------------------------------------------------------------------------------------------------- |
-| [`Vocab`](/api/vocab)             | A lookup table for the vocabulary that allows you to access `Lexeme` objects. |
+| [`Vocab`](/api/vocab)                 | The shared vocabulary that stores strings and gives you access to [`Lexeme`](/api/lexeme) objects.               |
-| [`StringStore`](/api/stringstore) | Map strings to and from hash values.                                          |
+| [`StringStore`](/api/stringstore)     | Map strings to and from hash values.                                                                             |
-| [`Vectors`](/api/vectors)         | Container class for vector data keyed by string.                              |
+| [`Vectors`](/api/vectors)             | Container class for vector data keyed by string.                                                                 |
-| [`Example`](/api/example)         | Collection for training annotations.                                          |
+| [`Lookups`](/api/lookups)             | Container for convenient access to large lookup tables and dictionaries.                                         |
 | [`Morphology`](/api/morphology)       | Assign linguistic features like lemmas, noun case, verb tense etc. based on the word and its part-of-speech tag. |
 | [`MorphAnalysis`](/api/morphanalysis) | A morphological analysis.                                                                                        |
 | [`KnowledgeBase`](/api/kb)            | Storage for entities and aliases of a knowledge base for entity linking.                                         |
 | [`Scorer`](/api/scorer)               | Compute evaluation scores.                                                                                       |
 | [`Corpus`](/api/corpis)               | Class for managing annotated corpora for training and evaluation data.                                           |
--- a/website/src/components/code.js
+++ b/website/src/components/code.js
@ -11,7 +11,7 @@ import Link from './link'
 import GitHubCode from './github'
 import classes from '../styles/code.module.sass'
-const WRAP_THRESHOLD = 15
+const WRAP_THRESHOLD = 16
 export default props => (
    <Pre>