diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index 751cff6a5..75be71845 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -229,6 +229,8 @@ By default, the `Transformer` component sets the [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, which lets you access the transformers outputs at runtime. + + ```cli $ python -m spacy download en_core_trf_lg ``` @@ -368,10 +370,10 @@ To change any of the settings, you can edit the `config.cfg` and re-run the training. To change any of the functions, like the span getter, you can replace the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to process sentences. You can also register your own functions using the -`span_getters` registry. For instance, the following custom function returns -`Span` objects following sentence boundaries, unless a sentence succeeds a -certain amount of tokens, in which case subsentences of at most `max_length` -tokens are returned. +[`span_getters` registry](/api/top-level#registry). For instance, the following +custom function returns [`Span`](/api/span) objects following sentence +boundaries, unless a sentence succeeds a certain amount of tokens, in which case +subsentences of at most `max_length` tokens are returned. > #### config.cfg > @@ -408,7 +410,7 @@ def configure_custom_sent_spans(max_length: int): To resolve the config during training, spaCy needs to know about your custom function. You can make it available via the `--code` argument that can point to a Python file. For more details on training with custom code, see the -[training documentation](/usage/training#custom-code). +[training documentation](/usage/training#custom-functions). ```cli python -m spacy train ./config.cfg --code ./code.py diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index fe57d65ce..a0e58c9d2 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -750,14 +750,6 @@ subclass. --- - - ### Adding special case tokenization rules {#special-cases} Most domains have at least some idiosyncrasies that require custom tokenization @@ -1488,19 +1480,20 @@ for sent in doc.sents: print(sent.text) ``` -spaCy provides three alternatives for sentence segmentation: +spaCy provides four alternatives for sentence segmentation: -1. [Dependency parser](#sbd-parser): the statistical `parser` provides the most - accurate sentence boundaries based on full dependency parses. -2. [Statistical sentence segmenter](#sbd-senter): the statistical `senter` is a - simpler and faster alternative to the parser that only sets sentence - boundaries. -3. [Rule-based pipeline component](#sbd-component): the rule-based `sentencizer` - sets sentence boundaries using a customizable list of sentence-final - punctuation. - -You can also plug an entirely custom [rule-based function](#sbd-custom) into -your [processing pipeline](/usage/processing-pipelines). +1. [Dependency parser](#sbd-parser): the statistical + [`DependencyParser`](/api/dependencyparser) provides the most accurate + sentence boundaries based on full dependency parses. +2. [Statistical sentence segmenter](#sbd-senter): the statistical + [`SentenceRecognizer`](/api/sentencerecognizer) is a simpler and faster + alternative to the parser that only sets sentence boundaries. +3. [Rule-based pipeline component](#sbd-component): the rule-based + [`Sentencizer`](/api/sentencizer) sets sentence boundaries using a + customizable list of sentence-final punctuation. +4. [Custom function](#sbd-custom): your own custom function added to the + processing pipeline can set sentence boundaries by writing to + `Token.is_sent_start`. ### Default: Using the dependency parse {#sbd-parser model="parser"} @@ -1535,7 +1528,13 @@ smaller than the parser, its primary advantage is that it's easier to train custom models because it only requires annotated sentence boundaries rather than full dependency parses. - + + +> #### senter vs. parser +> +> The recall for the `senter` is typically slightly lower than for the parser, +> which is better at predicting sentence boundaries when punctuation is not +> present. ```python ### {executable="true"} @@ -1547,10 +1546,6 @@ for sent in doc.sents: print(sent.text) ``` -The recall for the `senter` is typically slightly lower than for the parser, -which is better at predicting sentence boundaries when punctuation is not -present. - ### Rule-based pipeline component {#sbd-component} The [`Sentencizer`](/api/sentencizer) component is a