Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-10-02 13:24:33 +02:00
parent d2aa662ab2
commit df06f7a792
10 changed files with 88 additions and 44 deletions

View File

@ -8,8 +8,8 @@ source: spacy/language.py
Usually you'll load this once per process as `nlp` and pass the instance around Usually you'll load this once per process as `nlp` and pass the instance around
your application. The `Language` class is created when you call your application. The `Language` class is created when you call
[`spacy.load`](/api/top-level#spacy.load) and contains the shared vocabulary and [`spacy.load`](/api/top-level#spacy.load) and contains the shared vocabulary and
[language data](/usage/adding-languages), optional binary weights, e.g. provided [language data](/usage/linguistic-features#language-data), optional binary
by a [trained pipeline](/models), and the weights, e.g. provided by a [trained pipeline](/models), and the
[processing pipeline](/usage/processing-pipelines) containing components like [processing pipeline](/usage/processing-pipelines) containing components like
the tagger or parser that are called on a document in order. You can also add the tagger or parser that are called on a document in order. You can also add
your own processing pipeline components that take a `Doc` object, modify it and your own processing pipeline components that take a `Doc` object, modify it and
@ -210,7 +210,9 @@ settings defined in the [`[initialize]`](/api/data-formats#config-initialize)
config block to set up the vocabulary, load in vectors and tok2vec weights and config block to set up the vocabulary, load in vectors and tok2vec weights and
pass optional arguments to the `initialize` methods implemented by pipeline pass optional arguments to the `initialize` methods implemented by pipeline
components or the tokenizer. This method is typically called automatically when components or the tokenizer. This method is typically called automatically when
you run [`spacy train`](/api/cli#train). you run [`spacy train`](/api/cli#train). See the usage guide on the
[config lifecycle](/usage/training#config-lifecycle) and
[initialization](/usage/training#initialization) for details.
`get_examples` should be a function that returns an iterable of `get_examples` should be a function that returns an iterable of
[`Example`](/api/example) objects. The data examples can either be the full [`Example`](/api/example) objects. The data examples can either be the full
@ -928,7 +930,7 @@ Serialize the current state to a binary string.
Load state from a binary string. Note that this method is commonly used via the Load state from a binary string. Note that this method is commonly used via the
subclasses like `English` or `German` to make language-specific functionality subclasses like `English` or `German` to make language-specific functionality
like the [lexical attribute getters](/usage/adding-languages#lex-attrs) like the [lexical attribute getters](/usage/linguistic-features#language-data)
available to the loaded object. available to the loaded object.
> #### Example > #### Example

View File

@ -130,8 +130,7 @@ applied to the `Doc` in order.
## Lemmatizer.lookup_lemmatize {#lookup_lemmatize tag="method"} ## Lemmatizer.lookup_lemmatize {#lookup_lemmatize tag="method"}
Lemmatize a token using a lookup-based approach. If no lemma is found, the Lemmatize a token using a lookup-based approach. If no lemma is found, the
original string is returned. Languages can provide a original string is returned.
[lookup table](/usage/adding-languages#lemmatizer) via the `Lookups`.
| Name | Description | | Name | Description |
| ----------- | --------------------------------------------------- | | ----------- | --------------------------------------------------- |

View File

@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
| `ent_id_` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~str~~ | | `ent_id_` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~str~~ |
| `lemma` | Base form of the token, with no inflectional suffixes. ~~int~~ | | `lemma` | Base form of the token, with no inflectional suffixes. ~~int~~ |
| `lemma_` | Base form of the token, with no inflectional suffixes. ~~str~~ | | `lemma_` | Base form of the token, with no inflectional suffixes. ~~str~~ |
| `norm` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~int~~ | | `norm` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~int~~ |
| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~str~~ | | `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~str~~ |
| `lower` | Lowercase form of the token. ~~int~~ | | `lower` | Lowercase form of the token. ~~int~~ |
| `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ | | `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ |
| `shape` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ | | `shape` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |

View File

@ -22,9 +22,8 @@ like punctuation and special case rules from the
## Tokenizer.\_\_init\_\_ {#init tag="method"} ## Tokenizer.\_\_init\_\_ {#init tag="method"}
Create a `Tokenizer` to create `Doc` objects given unicode text. For examples Create a `Tokenizer` to create `Doc` objects given unicode text. For examples of
of how to construct a custom tokenizer with different tokenization rules, see how to construct a custom tokenizer with different tokenization rules, see the
the
[usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers). [usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers).
> #### Example > #### Example
@ -87,7 +86,7 @@ Tokenize a stream of texts.
| ------------ | ------------------------------------------------------------------------------------ | | ------------ | ------------------------------------------------------------------------------------ |
| `texts` | A sequence of unicode texts. ~~Iterable[str]~~ | | `texts` | A sequence of unicode texts. ~~Iterable[str]~~ |
| `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ | | `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ |
| **YIELDS** | The tokenized `Doc` objects, in order. ~~Doc~~ | | **YIELDS** | The tokenized `Doc` objects, in order. ~~Doc~~ |
## Tokenizer.find_infix {#find_infix tag="method"} ## Tokenizer.find_infix {#find_infix tag="method"}
@ -121,10 +120,10 @@ if no suffix rules match.
## Tokenizer.add_special_case {#add_special_case tag="method"} ## Tokenizer.add_special_case {#add_special_case tag="method"}
Add a special-case tokenization rule. This mechanism is also used to add custom Add a special-case tokenization rule. This mechanism is also used to add custom
tokenizer exceptions to the language data. See the usage guide on tokenizer exceptions to the language data. See the usage guide on the
[adding languages](/usage/adding-languages#tokenizer-exceptions) and [languages data](/usage/linguistic-features#language-data) and
[linguistic features](/usage/linguistic-features#special-cases) for more details [tokenizer special cases](/usage/linguistic-features#special-cases) for more
and examples. details and examples.
> #### Example > #### Example
> >

View File

@ -827,7 +827,7 @@ utilities.
### util.get_lang_class {#util.get_lang_class tag="function"} ### util.get_lang_class {#util.get_lang_class tag="function"}
Import and load a `Language` class. Allows lazy-loading Import and load a `Language` class. Allows lazy-loading
[language data](/usage/adding-languages) and importing languages using the [language data](/usage/linguistic-features#language-data) and importing languages using the
two-letter language code. To add a language code for a custom language class, two-letter language code. To add a language code for a custom language class,
you can register it using the [`@registry.languages`](/api/top-level#registry) you can register it using the [`@registry.languages`](/api/top-level#registry)
decorator. decorator.

View File

@ -30,7 +30,7 @@ import QuickstartModels from 'widgets/quickstart-models.js'
## Language support {#languages} ## Language support {#languages}
spaCy currently provides support for the following languages. You can help by spaCy currently provides support for the following languages. You can help by
[improving the existing language data](/usage/adding-languages#language-data) improving the existing [language data](/usage/linguistic-features#language-data)
and extending the tokenization patterns. and extending the tokenization patterns.
[See here](https://github.com/explosion/spaCy/issues/3056) for details on how to [See here](https://github.com/explosion/spaCy/issues/3056) for details on how to
contribute to development. contribute to development.
@ -83,55 +83,81 @@ To train a pipeline using the neutral multi-language class, you can set
import the `MultiLanguage` class directly, or call import the `MultiLanguage` class directly, or call
[`spacy.blank("xx")`](/api/top-level#spacy.blank) for lazy-loading. [`spacy.blank("xx")`](/api/top-level#spacy.blank) for lazy-loading.
### Chinese language support {#chinese new=2.3} ### Chinese language support {#chinese new="2.3"}
The Chinese language class supports three word segmentation options, `char`, The Chinese language class supports three word segmentation options, `char`,
`jieba` and `pkuseg`: `jieba` and `pkuseg`.
> #### Manual setup
>
> ```python > ```python
> from spacy.lang.zh import Chinese > from spacy.lang.zh import Chinese
> >
> # Character segmentation (default) > # Character segmentation (default)
> nlp = Chinese() > nlp = Chinese()
>
> # Jieba > # Jieba
> cfg = {"segmenter": "jieba"} > cfg = {"segmenter": "jieba"}
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}}) > nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
>
> # PKUSeg with "default" model provided by pkuseg > # PKUSeg with "default" model provided by pkuseg
> cfg = {"segmenter": "pkuseg"} > cfg = {"segmenter": "pkuseg"}
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}}) > nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
> nlp.tokenizer.initialize(pkuseg_model="default") > nlp.tokenizer.initialize(pkuseg_model="default")
> ``` > ```
1. **Character segmentation:** Character segmentation is the default ```ini
segmentation option. It's enabled when you create a new `Chinese` language ### config.cfg
class or call `spacy.blank("zh")`. [nlp.tokenizer]
2. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word @tokenizers = "spacy.zh.ChineseTokenizer"
segmentation with the tokenizer option `{"segmenter": "jieba"}`. segmenter = "char"
3. **PKUSeg**: As of spaCy v2.3.0, support for ```
[PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support
better segmentation for Chinese OntoNotes and the provided
[Chinese pipelines](/models/zh). Enable PKUSeg with the tokenizer option
`{"segmenter": "pkuseg"}`.
<Infobox variant="warning"> | Segmenter | Description |
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `char` | **Character segmentation:** Character segmentation is the default segmentation option. It's enabled when you create a new `Chinese` language class or call `spacy.blank("zh")`. |
| `jieba` | **Jieba:** to use [Jieba](https://github.com/fxsjy/jieba) for word segmentation, you can set the option `segmenter` to `"jieba"`. |
| `pkuseg` | **PKUSeg**: As of spaCy v2.3.0, support for [PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support better segmentation for Chinese OntoNotes and the provided [Chinese pipelines](/models/zh). Enable PKUSeg by setting tokenizer option `segmenter` to `"pkuseg"`. |
In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to <Infobox title="Changed in v3.0" variant="warning">
character segmentation.
In v3.0, the default word segmenter has switched from Jieba to character
segmentation. Because the `pkuseg` segmenter depends on a model that can be
loaded from a file, the model is loaded on
[initialization](/usage/training#config-lifecycle) (typically before training).
This ensures that your packaged Chinese model doesn't depend on a local path at
runtime.
</Infobox> </Infobox>
<Accordion title="Details on spaCy's Chinese API"> <Accordion title="Details on spaCy's Chinese API">
The `initialize` method for the Chinese tokenizer class supports the following The `initialize` method for the Chinese tokenizer class supports the following
config settings for loading pkuseg models: config settings for loading `pkuseg` models:
| Name | Description | | Name | Description |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- | | ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
| `pkuseg_model` | Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ | | `pkuseg_model` | Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ |
| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`. ~~str~~ | | `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`. ~~str~~ |
The initialization settings are typically provided in the
[training config](/usage/training#config) and the data is loaded in before
training and serialized with the model. This allows you to load the data from a
local path and save out your pipeline and config, without requiring the same
local path at runtime. See the usage guide on the
[config lifecycle](/usage/training#config-lifecycle) for more background on
this.
```ini
### config.cfg
[initialize]
[initialize.tokenizer]
pkuseg_model = "/path/to/model"
pkuseg_user_dict = "default"
```
You can also initialize the tokenizer for a blank language class by calling its
`initialize` method:
```python ```python
### Examples ### Examples
# Initialize the pkuseg tokenizer # Initialize the pkuseg tokenizer
@ -191,12 +217,13 @@ nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
### Japanese language support {#japanese new=2.3} ### Japanese language support {#japanese new=2.3}
> #### Manual setup
>
> ```python > ```python
> from spacy.lang.ja import Japanese > from spacy.lang.ja import Japanese
> >
> # Load SudachiPy with split mode A (default) > # Load SudachiPy with split mode A (default)
> nlp = Japanese() > nlp = Japanese()
>
> # Load SudachiPy with split mode B > # Load SudachiPy with split mode B
> cfg = {"split_mode": "B"} > cfg = {"split_mode": "B"}
> nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}}) > nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
@ -208,6 +235,13 @@ segmentation and part-of-speech tagging. The default Japanese language class and
the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
config can be used to configure the split mode to `A`, `B` or `C`. config can be used to configure the split mode to `A`, `B` or `C`.
```ini
### config.cfg
[nlp.tokenizer]
@tokenizers = "spacy.ja.JapaneseTokenizer"
split_mode = "A"
```
<Infobox variant="warning"> <Infobox variant="warning">
If you run into errors related to `sudachipy`, which is currently under active If you run into errors related to `sudachipy`, which is currently under active

View File

@ -895,6 +895,10 @@ the name. Registered functions can also take **arguments** by the way that can
be defined in the config as well you can read more about this in the docs on be defined in the config as well you can read more about this in the docs on
[training with custom code](/usage/training#custom-code). [training with custom code](/usage/training#custom-code).
### Initializing components with data {#initialization}
<!-- TODO: -->
### Python type hints and pydantic validation {#type-hints new="3"} ### Python type hints and pydantic validation {#type-hints new="3"}
spaCy's configs are powered by our machine learning library Thinc's spaCy's configs are powered by our machine learning library Thinc's

View File

@ -291,7 +291,7 @@ installed in the same environment that's it.
| Entry point | Description | | Entry point | Description |
| ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories, keyed by component name. Can be used to expose custom components defined by another package. | | [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories, keyed by component name. Can be used to expose custom components defined by another package. |
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut. | | [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/linguistic-features#language-data), keyed by language shortcut. |
| `spacy_lookups` <Tag variant="new">2.2</Tag> | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. | | `spacy_lookups` <Tag variant="new">2.2</Tag> | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. |
| [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. | | [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |

View File

@ -200,7 +200,7 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
To learn more about how spaCy's tokenization rules work in detail, how to To learn more about how spaCy's tokenization rules work in detail, how to
**customize and replace** the default tokenizer and how to **add **customize and replace** the default tokenizer and how to **add
language-specific data**, see the usage guides on language-specific data**, see the usage guides on
[adding languages](/usage/adding-languages) and [language data](/usage/linguistic-features#language-data) and
[customizing the tokenizer](/usage/linguistic-features#tokenization). [customizing the tokenizer](/usage/linguistic-features#tokenization).
</Infobox> </Infobox>
@ -479,7 +479,7 @@ find a "Suggest edits" link at the bottom of each page that points you to the
source. source.
Another way of getting involved is to help us improve the Another way of getting involved is to help us improve the
[language data](/usage/adding-languages#language-data) especially if you [language data](/usage/linguistic-features#language-data) especially if you
happen to speak one of the languages currently in happen to speak one of the languages currently in
[alpha support](/usage/models#languages). Even adding simple tokenizer [alpha support](/usage/models#languages). Even adding simple tokenizer
exceptions, stop words or lemmatizer data can make a big difference. It will exceptions, stop words or lemmatizer data can make a big difference. It will

View File

@ -216,7 +216,9 @@ The initialization settings are only loaded and used when
[`nlp.initialize`](/api/language#initialize) is called (typically right before [`nlp.initialize`](/api/language#initialize) is called (typically right before
training). This allows you to set up your pipeline using local data resources training). This allows you to set up your pipeline using local data resources
and custom functions, and preserve the information in your config but without and custom functions, and preserve the information in your config but without
requiring it to be available at runtime requiring it to be available at runtime. You can also use this mechanism to
provide data paths to custom pipeline components and custom tokenizers see the
section on [custom initialization](#initialization) for details.
### Overwriting config settings on the command line {#config-overrides} ### Overwriting config settings on the command line {#config-overrides}
@ -815,9 +817,9 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:
return create_model(output_width) return create_model(output_width)
``` ```
<!-- TODO:
### Customizing the initialization {#initialization} ### Customizing the initialization {#initialization}
-->
<!-- TODO: -->
## Data utilities {#data} ## Data utilities {#data}
@ -1135,7 +1137,11 @@ An easy way to create modified `Example` objects is to use the
capitalization changes, so only the `ORTH` values of the tokens will be capitalization changes, so only the `ORTH` values of the tokens will be
different between the original and augmented examples. different between the original and augmented examples.
<!-- TODO: mention alignment --> Note that if your data augmentation strategy involves changing the tokenization
(for instance, removing or adding tokens) and your training examples include
token-based annotations like the dependency parse or entity labels, you'll need
to take care to adjust the `Example` object so its annotations match and remain
valid.
## Parallel & distributed training with Ray {#parallel-training} ## Parallel & distributed training with Ray {#parallel-training}