mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
Update docs [ci skip]
This commit is contained in:
parent
450bf806b0
commit
bc0730be3f
|
@ -229,6 +229,8 @@ By default, the `Transformer` component sets the
|
||||||
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
|
[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute,
|
||||||
which lets you access the transformers outputs at runtime.
|
which lets you access the transformers outputs at runtime.
|
||||||
|
|
||||||
|
<!-- TODO: update/confirm once we have final models trained -->
|
||||||
|
|
||||||
```cli
|
```cli
|
||||||
$ python -m spacy download en_core_trf_lg
|
$ python -m spacy download en_core_trf_lg
|
||||||
```
|
```
|
||||||
|
@ -368,10 +370,10 @@ To change any of the settings, you can edit the `config.cfg` and re-run the
|
||||||
training. To change any of the functions, like the span getter, you can replace
|
training. To change any of the functions, like the span getter, you can replace
|
||||||
the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to
|
the name of the referenced function – e.g. `@span_getters = "sent_spans.v1"` to
|
||||||
process sentences. You can also register your own functions using the
|
process sentences. You can also register your own functions using the
|
||||||
`span_getters` registry. For instance, the following custom function returns
|
[`span_getters` registry](/api/top-level#registry). For instance, the following
|
||||||
`Span` objects following sentence boundaries, unless a sentence succeeds a
|
custom function returns [`Span`](/api/span) objects following sentence
|
||||||
certain amount of tokens, in which case subsentences of at most `max_length`
|
boundaries, unless a sentence succeeds a certain amount of tokens, in which case
|
||||||
tokens are returned.
|
subsentences of at most `max_length` tokens are returned.
|
||||||
|
|
||||||
> #### config.cfg
|
> #### config.cfg
|
||||||
>
|
>
|
||||||
|
@ -408,7 +410,7 @@ def configure_custom_sent_spans(max_length: int):
|
||||||
To resolve the config during training, spaCy needs to know about your custom
|
To resolve the config during training, spaCy needs to know about your custom
|
||||||
function. You can make it available via the `--code` argument that can point to
|
function. You can make it available via the `--code` argument that can point to
|
||||||
a Python file. For more details on training with custom code, see the
|
a Python file. For more details on training with custom code, see the
|
||||||
[training documentation](/usage/training#custom-code).
|
[training documentation](/usage/training#custom-functions).
|
||||||
|
|
||||||
```cli
|
```cli
|
||||||
python -m spacy train ./config.cfg --code ./code.py
|
python -m spacy train ./config.cfg --code ./code.py
|
||||||
|
|
|
@ -750,14 +750,6 @@ subclass.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
<!--
|
|
||||||
|
|
||||||
### Customizing the tokenizer {#tokenizer-custom}
|
|
||||||
|
|
||||||
TODO: rewrite the docs on custom tokenization in a more user-friendly order, including details on how to integrate a fully custom tokenizer, representing a tokenizer in the config etc.
|
|
||||||
|
|
||||||
-->
|
|
||||||
|
|
||||||
### Adding special case tokenization rules {#special-cases}
|
### Adding special case tokenization rules {#special-cases}
|
||||||
|
|
||||||
Most domains have at least some idiosyncrasies that require custom tokenization
|
Most domains have at least some idiosyncrasies that require custom tokenization
|
||||||
|
@ -1488,19 +1480,20 @@ for sent in doc.sents:
|
||||||
print(sent.text)
|
print(sent.text)
|
||||||
```
|
```
|
||||||
|
|
||||||
spaCy provides three alternatives for sentence segmentation:
|
spaCy provides four alternatives for sentence segmentation:
|
||||||
|
|
||||||
1. [Dependency parser](#sbd-parser): the statistical `parser` provides the most
|
1. [Dependency parser](#sbd-parser): the statistical
|
||||||
accurate sentence boundaries based on full dependency parses.
|
[`DependencyParser`](/api/dependencyparser) provides the most accurate
|
||||||
2. [Statistical sentence segmenter](#sbd-senter): the statistical `senter` is a
|
sentence boundaries based on full dependency parses.
|
||||||
simpler and faster alternative to the parser that only sets sentence
|
2. [Statistical sentence segmenter](#sbd-senter): the statistical
|
||||||
boundaries.
|
[`SentenceRecognizer`](/api/sentencerecognizer) is a simpler and faster
|
||||||
3. [Rule-based pipeline component](#sbd-component): the rule-based `sentencizer`
|
alternative to the parser that only sets sentence boundaries.
|
||||||
sets sentence boundaries using a customizable list of sentence-final
|
3. [Rule-based pipeline component](#sbd-component): the rule-based
|
||||||
punctuation.
|
[`Sentencizer`](/api/sentencizer) sets sentence boundaries using a
|
||||||
|
customizable list of sentence-final punctuation.
|
||||||
You can also plug an entirely custom [rule-based function](#sbd-custom) into
|
4. [Custom function](#sbd-custom): your own custom function added to the
|
||||||
your [processing pipeline](/usage/processing-pipelines).
|
processing pipeline can set sentence boundaries by writing to
|
||||||
|
`Token.is_sent_start`.
|
||||||
|
|
||||||
### Default: Using the dependency parse {#sbd-parser model="parser"}
|
### Default: Using the dependency parse {#sbd-parser model="parser"}
|
||||||
|
|
||||||
|
@ -1535,7 +1528,13 @@ smaller than the parser, its primary advantage is that it's easier to train
|
||||||
custom models because it only requires annotated sentence boundaries rather than
|
custom models because it only requires annotated sentence boundaries rather than
|
||||||
full dependency parses.
|
full dependency parses.
|
||||||
|
|
||||||
<!-- TODO: correct senter loading -->
|
<!-- TODO: update/confirm usage once we have final models trained -->
|
||||||
|
|
||||||
|
> #### senter vs. parser
|
||||||
|
>
|
||||||
|
> The recall for the `senter` is typically slightly lower than for the parser,
|
||||||
|
> which is better at predicting sentence boundaries when punctuation is not
|
||||||
|
> present.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
|
@ -1547,10 +1546,6 @@ for sent in doc.sents:
|
||||||
print(sent.text)
|
print(sent.text)
|
||||||
```
|
```
|
||||||
|
|
||||||
The recall for the `senter` is typically slightly lower than for the parser,
|
|
||||||
which is better at predicting sentence boundaries when punctuation is not
|
|
||||||
present.
|
|
||||||
|
|
||||||
### Rule-based pipeline component {#sbd-component}
|
### Rule-based pipeline component {#sbd-component}
|
||||||
|
|
||||||
The [`Sentencizer`](/api/sentencizer) component is a
|
The [`Sentencizer`](/api/sentencizer) component is a
|
||||||
|
|
Loading…
Reference in New Issue
Block a user