Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-10-02 13:24:33 +02:00
parent d2aa662ab2
commit df06f7a792
10 changed files with 88 additions and 44 deletions

View File

@ -8,8 +8,8 @@ source: spacy/language.py
Usually you'll load this once per process as `nlp` and pass the instance around
your application. The `Language` class is created when you call
[`spacy.load`](/api/top-level#spacy.load) and contains the shared vocabulary and
[language data](/usage/adding-languages), optional binary weights, e.g. provided
by a [trained pipeline](/models), and the
[language data](/usage/linguistic-features#language-data), optional binary
weights, e.g. provided by a [trained pipeline](/models), and the
[processing pipeline](/usage/processing-pipelines) containing components like
the tagger or parser that are called on a document in order. You can also add
your own processing pipeline components that take a `Doc` object, modify it and
@ -210,7 +210,9 @@ settings defined in the [`[initialize]`](/api/data-formats#config-initialize)
config block to set up the vocabulary, load in vectors and tok2vec weights and
pass optional arguments to the `initialize` methods implemented by pipeline
components or the tokenizer. This method is typically called automatically when
you run [`spacy train`](/api/cli#train).
you run [`spacy train`](/api/cli#train). See the usage guide on the
[config lifecycle](/usage/training#config-lifecycle) and
[initialization](/usage/training#initialization) for details.
`get_examples` should be a function that returns an iterable of
[`Example`](/api/example) objects. The data examples can either be the full
@ -928,7 +930,7 @@ Serialize the current state to a binary string.
Load state from a binary string. Note that this method is commonly used via the
subclasses like `English` or `German` to make language-specific functionality
like the [lexical attribute getters](/usage/adding-languages#lex-attrs)
like the [lexical attribute getters](/usage/linguistic-features#language-data)
available to the loaded object.
> #### Example

View File

@ -130,8 +130,7 @@ applied to the `Doc` in order.
## Lemmatizer.lookup_lemmatize {#lookup_lemmatize tag="method"}
Lemmatize a token using a lookup-based approach. If no lemma is found, the
original string is returned. Languages can provide a
[lookup table](/usage/adding-languages#lemmatizer) via the `Lookups`.
original string is returned.
| Name | Description |
| ----------- | --------------------------------------------------- |

View File

@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
| `ent_id_` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~str~~ |
| `lemma` | Base form of the token, with no inflectional suffixes. ~~int~~ |
| `lemma_` | Base form of the token, with no inflectional suffixes. ~~str~~ |
| `norm` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~int~~ |
| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~str~~ |
| `norm` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~int~~ |
| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~str~~ |
| `lower` | Lowercase form of the token. ~~int~~ |
| `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ |
| `shape` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |

View File

@ -22,9 +22,8 @@ like punctuation and special case rules from the
## Tokenizer.\_\_init\_\_ {#init tag="method"}
Create a `Tokenizer` to create `Doc` objects given unicode text. For examples
of how to construct a custom tokenizer with different tokenization rules, see
the
Create a `Tokenizer` to create `Doc` objects given unicode text. For examples of
how to construct a custom tokenizer with different tokenization rules, see the
[usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers).
> #### Example
@ -121,10 +120,10 @@ if no suffix rules match.
## Tokenizer.add_special_case {#add_special_case tag="method"}
Add a special-case tokenization rule. This mechanism is also used to add custom
tokenizer exceptions to the language data. See the usage guide on
[adding languages](/usage/adding-languages#tokenizer-exceptions) and
[linguistic features](/usage/linguistic-features#special-cases) for more details
and examples.
tokenizer exceptions to the language data. See the usage guide on the
[languages data](/usage/linguistic-features#language-data) and
[tokenizer special cases](/usage/linguistic-features#special-cases) for more
details and examples.
> #### Example
>

View File

@ -827,7 +827,7 @@ utilities.
### util.get_lang_class {#util.get_lang_class tag="function"}
Import and load a `Language` class. Allows lazy-loading
[language data](/usage/adding-languages) and importing languages using the
[language data](/usage/linguistic-features#language-data) and importing languages using the
two-letter language code. To add a language code for a custom language class,
you can register it using the [`@registry.languages`](/api/top-level#registry)
decorator.

View File

@ -30,7 +30,7 @@ import QuickstartModels from 'widgets/quickstart-models.js'
## Language support {#languages}
spaCy currently provides support for the following languages. You can help by
[improving the existing language data](/usage/adding-languages#language-data)
improving the existing [language data](/usage/linguistic-features#language-data)
and extending the tokenization patterns.
[See here](https://github.com/explosion/spaCy/issues/3056) for details on how to
contribute to development.
@ -83,55 +83,81 @@ To train a pipeline using the neutral multi-language class, you can set
import the `MultiLanguage` class directly, or call
[`spacy.blank("xx")`](/api/top-level#spacy.blank) for lazy-loading.
### Chinese language support {#chinese new=2.3}
### Chinese language support {#chinese new="2.3"}
The Chinese language class supports three word segmentation options, `char`,
`jieba` and `pkuseg`:
`jieba` and `pkuseg`.
> #### Manual setup
>
> ```python
> from spacy.lang.zh import Chinese
>
> # Character segmentation (default)
> nlp = Chinese()
>
> # Jieba
> cfg = {"segmenter": "jieba"}
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
>
> # PKUSeg with "default" model provided by pkuseg
> cfg = {"segmenter": "pkuseg"}
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
> nlp.tokenizer.initialize(pkuseg_model="default")
> ```
1. **Character segmentation:** Character segmentation is the default
segmentation option. It's enabled when you create a new `Chinese` language
class or call `spacy.blank("zh")`.
2. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word
segmentation with the tokenizer option `{"segmenter": "jieba"}`.
3. **PKUSeg**: As of spaCy v2.3.0, support for
[PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support
better segmentation for Chinese OntoNotes and the provided
[Chinese pipelines](/models/zh). Enable PKUSeg with the tokenizer option
`{"segmenter": "pkuseg"}`.
```ini
### config.cfg
[nlp.tokenizer]
@tokenizers = "spacy.zh.ChineseTokenizer"
segmenter = "char"
```
<Infobox variant="warning">
| Segmenter | Description |
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `char` | **Character segmentation:** Character segmentation is the default segmentation option. It's enabled when you create a new `Chinese` language class or call `spacy.blank("zh")`. |
| `jieba` | **Jieba:** to use [Jieba](https://github.com/fxsjy/jieba) for word segmentation, you can set the option `segmenter` to `"jieba"`. |
| `pkuseg` | **PKUSeg**: As of spaCy v2.3.0, support for [PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support better segmentation for Chinese OntoNotes and the provided [Chinese pipelines](/models/zh). Enable PKUSeg by setting tokenizer option `segmenter` to `"pkuseg"`. |
In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to
character segmentation.
<Infobox title="Changed in v3.0" variant="warning">
In v3.0, the default word segmenter has switched from Jieba to character
segmentation. Because the `pkuseg` segmenter depends on a model that can be
loaded from a file, the model is loaded on
[initialization](/usage/training#config-lifecycle) (typically before training).
This ensures that your packaged Chinese model doesn't depend on a local path at
runtime.
</Infobox>
<Accordion title="Details on spaCy's Chinese API">
The `initialize` method for the Chinese tokenizer class supports the following
config settings for loading pkuseg models:
config settings for loading `pkuseg` models:
| Name | Description |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
| `pkuseg_model` | Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ |
| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`. ~~str~~ |
The initialization settings are typically provided in the
[training config](/usage/training#config) and the data is loaded in before
training and serialized with the model. This allows you to load the data from a
local path and save out your pipeline and config, without requiring the same
local path at runtime. See the usage guide on the
[config lifecycle](/usage/training#config-lifecycle) for more background on
this.
```ini
### config.cfg
[initialize]
[initialize.tokenizer]
pkuseg_model = "/path/to/model"
pkuseg_user_dict = "default"
```
You can also initialize the tokenizer for a blank language class by calling its
`initialize` method:
```python
### Examples
# Initialize the pkuseg tokenizer
@ -191,12 +217,13 @@ nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
### Japanese language support {#japanese new=2.3}
> #### Manual setup
>
> ```python
> from spacy.lang.ja import Japanese
>
> # Load SudachiPy with split mode A (default)
> nlp = Japanese()
>
> # Load SudachiPy with split mode B
> cfg = {"split_mode": "B"}
> nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
@ -208,6 +235,13 @@ segmentation and part-of-speech tagging. The default Japanese language class and
the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
config can be used to configure the split mode to `A`, `B` or `C`.
```ini
### config.cfg
[nlp.tokenizer]
@tokenizers = "spacy.ja.JapaneseTokenizer"
split_mode = "A"
```
<Infobox variant="warning">
If you run into errors related to `sudachipy`, which is currently under active

View File

@ -895,6 +895,10 @@ the name. Registered functions can also take **arguments** by the way that can
be defined in the config as well you can read more about this in the docs on
[training with custom code](/usage/training#custom-code).
### Initializing components with data {#initialization}
<!-- TODO: -->
### Python type hints and pydantic validation {#type-hints new="3"}
spaCy's configs are powered by our machine learning library Thinc's

View File

@ -291,7 +291,7 @@ installed in the same environment that's it.
| Entry point | Description |
| ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories, keyed by component name. Can be used to expose custom components defined by another package. |
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut. |
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/linguistic-features#language-data), keyed by language shortcut. |
| `spacy_lookups` <Tag variant="new">2.2</Tag> | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. |
| [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |

View File

@ -200,7 +200,7 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
To learn more about how spaCy's tokenization rules work in detail, how to
**customize and replace** the default tokenizer and how to **add
language-specific data**, see the usage guides on
[adding languages](/usage/adding-languages) and
[language data](/usage/linguistic-features#language-data) and
[customizing the tokenizer](/usage/linguistic-features#tokenization).
</Infobox>
@ -479,7 +479,7 @@ find a "Suggest edits" link at the bottom of each page that points you to the
source.
Another way of getting involved is to help us improve the
[language data](/usage/adding-languages#language-data) especially if you
[language data](/usage/linguistic-features#language-data) especially if you
happen to speak one of the languages currently in
[alpha support](/usage/models#languages). Even adding simple tokenizer
exceptions, stop words or lemmatizer data can make a big difference. It will

View File

@ -216,7 +216,9 @@ The initialization settings are only loaded and used when
[`nlp.initialize`](/api/language#initialize) is called (typically right before
training). This allows you to set up your pipeline using local data resources
and custom functions, and preserve the information in your config but without
requiring it to be available at runtime
requiring it to be available at runtime. You can also use this mechanism to
provide data paths to custom pipeline components and custom tokenizers see the
section on [custom initialization](#initialization) for details.
### Overwriting config settings on the command line {#config-overrides}
@ -815,9 +817,9 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:
return create_model(output_width)
```
<!-- TODO:
### Customizing the initialization {#initialization}
-->
<!-- TODO: -->
## Data utilities {#data}
@ -1135,7 +1137,11 @@ An easy way to create modified `Example` objects is to use the
capitalization changes, so only the `ORTH` values of the tokens will be
different between the original and augmented examples.
<!-- TODO: mention alignment -->
Note that if your data augmentation strategy involves changing the tokenization
(for instance, removing or adding tokens) and your training examples include
token-based annotations like the dependency parse or entity labels, you'll need
to take care to adjust the `Example` object so its annotations match and remain
valid.
## Parallel & distributed training with Ray {#parallel-training}