Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-08-29 12:53:14 +02:00
parent 450bf806b0
commit bc0730be3f
2 changed files with 27 additions and 30 deletions

View File

@ -229,6 +229,8 @@ By default, the `Transformer` component sets the
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, [`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
which lets you access the transformers outputs at runtime. which lets you access the transformers outputs at runtime.
<!-- TODO: update/confirm once we have final models trained -->
```cli ```cli
$ python -m spacy download en_core_trf_lg $ python -m spacy download en_core_trf_lg
``` ```
@ -368,10 +370,10 @@ To change any of the settings, you can edit the `config.cfg` and re-run the
training. To change any of the functions, like the span getter, you can replace training. To change any of the functions, like the span getter, you can replace
the name of the referenced function e.g. `@span_getters = "sent_spans.v1"` to the name of the referenced function e.g. `@span_getters = "sent_spans.v1"` to
process sentences. You can also register your own functions using the process sentences. You can also register your own functions using the
`span_getters` registry. For instance, the following custom function returns [`span_getters` registry](/api/top-level#registry). For instance, the following
`Span` objects following sentence boundaries, unless a sentence succeeds a custom function returns [`Span`](/api/span) objects following sentence
certain amount of tokens, in which case subsentences of at most `max_length` boundaries, unless a sentence succeeds a certain amount of tokens, in which case
tokens are returned. subsentences of at most `max_length` tokens are returned.
> #### config.cfg > #### config.cfg
> >
@ -408,7 +410,7 @@ def configure_custom_sent_spans(max_length: int):
To resolve the config during training, spaCy needs to know about your custom To resolve the config during training, spaCy needs to know about your custom
function. You can make it available via the `--code` argument that can point to function. You can make it available via the `--code` argument that can point to
a Python file. For more details on training with custom code, see the a Python file. For more details on training with custom code, see the
[training documentation](/usage/training#custom-code). [training documentation](/usage/training#custom-functions).
```cli ```cli
python -m spacy train ./config.cfg --code ./code.py python -m spacy train ./config.cfg --code ./code.py

View File

@ -750,14 +750,6 @@ subclass.
--- ---
<!--
### Customizing the tokenizer {#tokenizer-custom}
TODO: rewrite the docs on custom tokenization in a more user-friendly order, including details on how to integrate a fully custom tokenizer, representing a tokenizer in the config etc.
-->
### Adding special case tokenization rules {#special-cases} ### Adding special case tokenization rules {#special-cases}
Most domains have at least some idiosyncrasies that require custom tokenization Most domains have at least some idiosyncrasies that require custom tokenization
@ -1488,19 +1480,20 @@ for sent in doc.sents:
print(sent.text) print(sent.text)
``` ```
spaCy provides three alternatives for sentence segmentation: spaCy provides four alternatives for sentence segmentation:
1. [Dependency parser](#sbd-parser): the statistical `parser` provides the most 1. [Dependency parser](#sbd-parser): the statistical
accurate sentence boundaries based on full dependency parses. [`DependencyParser`](/api/dependencyparser) provides the most accurate
2. [Statistical sentence segmenter](#sbd-senter): the statistical `senter` is a sentence boundaries based on full dependency parses.
simpler and faster alternative to the parser that only sets sentence 2. [Statistical sentence segmenter](#sbd-senter): the statistical
boundaries. [`SentenceRecognizer`](/api/sentencerecognizer) is a simpler and faster
3. [Rule-based pipeline component](#sbd-component): the rule-based `sentencizer` alternative to the parser that only sets sentence boundaries.
sets sentence boundaries using a customizable list of sentence-final 3. [Rule-based pipeline component](#sbd-component): the rule-based
punctuation. [`Sentencizer`](/api/sentencizer) sets sentence boundaries using a
customizable list of sentence-final punctuation.
You can also plug an entirely custom [rule-based function](#sbd-custom) into 4. [Custom function](#sbd-custom): your own custom function added to the
your [processing pipeline](/usage/processing-pipelines). processing pipeline can set sentence boundaries by writing to
`Token.is_sent_start`.
### Default: Using the dependency parse {#sbd-parser model="parser"} ### Default: Using the dependency parse {#sbd-parser model="parser"}
@ -1535,7 +1528,13 @@ smaller than the parser, its primary advantage is that it's easier to train
custom models because it only requires annotated sentence boundaries rather than custom models because it only requires annotated sentence boundaries rather than
full dependency parses. full dependency parses.
<!-- TODO: correct senter loading --> <!-- TODO: update/confirm usage once we have final models trained -->
> #### senter vs. parser
>
> The recall for the `senter` is typically slightly lower than for the parser,
> which is better at predicting sentence boundaries when punctuation is not
> present.
```python ```python
### {executable="true"} ### {executable="true"}
@ -1547,10 +1546,6 @@ for sent in doc.sents:
print(sent.text) print(sent.text)
``` ```
The recall for the `senter` is typically slightly lower than for the parser,
which is better at predicting sentence boundaries when punctuation is not
present.
### Rule-based pipeline component {#sbd-component} ### Rule-based pipeline component {#sbd-component}
The [`Sentencizer`](/api/sentencizer) component is a The [`Sentencizer`](/api/sentencizer) component is a