mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-13 10:46:29 +03:00
Update docs [ci skip]
This commit is contained in:
parent
d2aa662ab2
commit
df06f7a792
|
@ -8,8 +8,8 @@ source: spacy/language.py
|
||||||
Usually you'll load this once per process as `nlp` and pass the instance around
|
Usually you'll load this once per process as `nlp` and pass the instance around
|
||||||
your application. The `Language` class is created when you call
|
your application. The `Language` class is created when you call
|
||||||
[`spacy.load`](/api/top-level#spacy.load) and contains the shared vocabulary and
|
[`spacy.load`](/api/top-level#spacy.load) and contains the shared vocabulary and
|
||||||
[language data](/usage/adding-languages), optional binary weights, e.g. provided
|
[language data](/usage/linguistic-features#language-data), optional binary
|
||||||
by a [trained pipeline](/models), and the
|
weights, e.g. provided by a [trained pipeline](/models), and the
|
||||||
[processing pipeline](/usage/processing-pipelines) containing components like
|
[processing pipeline](/usage/processing-pipelines) containing components like
|
||||||
the tagger or parser that are called on a document in order. You can also add
|
the tagger or parser that are called on a document in order. You can also add
|
||||||
your own processing pipeline components that take a `Doc` object, modify it and
|
your own processing pipeline components that take a `Doc` object, modify it and
|
||||||
|
@ -210,7 +210,9 @@ settings defined in the [`[initialize]`](/api/data-formats#config-initialize)
|
||||||
config block to set up the vocabulary, load in vectors and tok2vec weights and
|
config block to set up the vocabulary, load in vectors and tok2vec weights and
|
||||||
pass optional arguments to the `initialize` methods implemented by pipeline
|
pass optional arguments to the `initialize` methods implemented by pipeline
|
||||||
components or the tokenizer. This method is typically called automatically when
|
components or the tokenizer. This method is typically called automatically when
|
||||||
you run [`spacy train`](/api/cli#train).
|
you run [`spacy train`](/api/cli#train). See the usage guide on the
|
||||||
|
[config lifecycle](/usage/training#config-lifecycle) and
|
||||||
|
[initialization](/usage/training#initialization) for details.
|
||||||
|
|
||||||
`get_examples` should be a function that returns an iterable of
|
`get_examples` should be a function that returns an iterable of
|
||||||
[`Example`](/api/example) objects. The data examples can either be the full
|
[`Example`](/api/example) objects. The data examples can either be the full
|
||||||
|
@ -928,7 +930,7 @@ Serialize the current state to a binary string.
|
||||||
|
|
||||||
Load state from a binary string. Note that this method is commonly used via the
|
Load state from a binary string. Note that this method is commonly used via the
|
||||||
subclasses like `English` or `German` to make language-specific functionality
|
subclasses like `English` or `German` to make language-specific functionality
|
||||||
like the [lexical attribute getters](/usage/adding-languages#lex-attrs)
|
like the [lexical attribute getters](/usage/linguistic-features#language-data)
|
||||||
available to the loaded object.
|
available to the loaded object.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
|
|
|
@ -130,8 +130,7 @@ applied to the `Doc` in order.
|
||||||
## Lemmatizer.lookup_lemmatize {#lookup_lemmatize tag="method"}
|
## Lemmatizer.lookup_lemmatize {#lookup_lemmatize tag="method"}
|
||||||
|
|
||||||
Lemmatize a token using a lookup-based approach. If no lemma is found, the
|
Lemmatize a token using a lookup-based approach. If no lemma is found, the
|
||||||
original string is returned. Languages can provide a
|
original string is returned.
|
||||||
[lookup table](/usage/adding-languages#lemmatizer) via the `Lookups`.
|
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ----------- | --------------------------------------------------- |
|
| ----------- | --------------------------------------------------- |
|
||||||
|
|
|
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
|
||||||
| `ent_id_` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~str~~ |
|
| `ent_id_` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~str~~ |
|
||||||
| `lemma` | Base form of the token, with no inflectional suffixes. ~~int~~ |
|
| `lemma` | Base form of the token, with no inflectional suffixes. ~~int~~ |
|
||||||
| `lemma_` | Base form of the token, with no inflectional suffixes. ~~str~~ |
|
| `lemma_` | Base form of the token, with no inflectional suffixes. ~~str~~ |
|
||||||
| `norm` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~int~~ |
|
| `norm` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~int~~ |
|
||||||
| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~str~~ |
|
| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~str~~ |
|
||||||
| `lower` | Lowercase form of the token. ~~int~~ |
|
| `lower` | Lowercase form of the token. ~~int~~ |
|
||||||
| `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ |
|
| `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ |
|
||||||
| `shape` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |
|
| `shape` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |
|
||||||
|
|
|
@ -22,9 +22,8 @@ like punctuation and special case rules from the
|
||||||
|
|
||||||
## Tokenizer.\_\_init\_\_ {#init tag="method"}
|
## Tokenizer.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
Create a `Tokenizer` to create `Doc` objects given unicode text. For examples
|
Create a `Tokenizer` to create `Doc` objects given unicode text. For examples of
|
||||||
of how to construct a custom tokenizer with different tokenization rules, see
|
how to construct a custom tokenizer with different tokenization rules, see the
|
||||||
the
|
|
||||||
[usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers).
|
[usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers).
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
|
@ -87,7 +86,7 @@ Tokenize a stream of texts.
|
||||||
| ------------ | ------------------------------------------------------------------------------------ |
|
| ------------ | ------------------------------------------------------------------------------------ |
|
||||||
| `texts` | A sequence of unicode texts. ~~Iterable[str]~~ |
|
| `texts` | A sequence of unicode texts. ~~Iterable[str]~~ |
|
||||||
| `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ |
|
| `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ |
|
||||||
| **YIELDS** | The tokenized `Doc` objects, in order. ~~Doc~~ |
|
| **YIELDS** | The tokenized `Doc` objects, in order. ~~Doc~~ |
|
||||||
|
|
||||||
## Tokenizer.find_infix {#find_infix tag="method"}
|
## Tokenizer.find_infix {#find_infix tag="method"}
|
||||||
|
|
||||||
|
@ -121,10 +120,10 @@ if no suffix rules match.
|
||||||
## Tokenizer.add_special_case {#add_special_case tag="method"}
|
## Tokenizer.add_special_case {#add_special_case tag="method"}
|
||||||
|
|
||||||
Add a special-case tokenization rule. This mechanism is also used to add custom
|
Add a special-case tokenization rule. This mechanism is also used to add custom
|
||||||
tokenizer exceptions to the language data. See the usage guide on
|
tokenizer exceptions to the language data. See the usage guide on the
|
||||||
[adding languages](/usage/adding-languages#tokenizer-exceptions) and
|
[languages data](/usage/linguistic-features#language-data) and
|
||||||
[linguistic features](/usage/linguistic-features#special-cases) for more details
|
[tokenizer special cases](/usage/linguistic-features#special-cases) for more
|
||||||
and examples.
|
details and examples.
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
|
|
@ -827,7 +827,7 @@ utilities.
|
||||||
### util.get_lang_class {#util.get_lang_class tag="function"}
|
### util.get_lang_class {#util.get_lang_class tag="function"}
|
||||||
|
|
||||||
Import and load a `Language` class. Allows lazy-loading
|
Import and load a `Language` class. Allows lazy-loading
|
||||||
[language data](/usage/adding-languages) and importing languages using the
|
[language data](/usage/linguistic-features#language-data) and importing languages using the
|
||||||
two-letter language code. To add a language code for a custom language class,
|
two-letter language code. To add a language code for a custom language class,
|
||||||
you can register it using the [`@registry.languages`](/api/top-level#registry)
|
you can register it using the [`@registry.languages`](/api/top-level#registry)
|
||||||
decorator.
|
decorator.
|
||||||
|
|
|
@ -30,7 +30,7 @@ import QuickstartModels from 'widgets/quickstart-models.js'
|
||||||
## Language support {#languages}
|
## Language support {#languages}
|
||||||
|
|
||||||
spaCy currently provides support for the following languages. You can help by
|
spaCy currently provides support for the following languages. You can help by
|
||||||
[improving the existing language data](/usage/adding-languages#language-data)
|
improving the existing [language data](/usage/linguistic-features#language-data)
|
||||||
and extending the tokenization patterns.
|
and extending the tokenization patterns.
|
||||||
[See here](https://github.com/explosion/spaCy/issues/3056) for details on how to
|
[See here](https://github.com/explosion/spaCy/issues/3056) for details on how to
|
||||||
contribute to development.
|
contribute to development.
|
||||||
|
@ -83,55 +83,81 @@ To train a pipeline using the neutral multi-language class, you can set
|
||||||
import the `MultiLanguage` class directly, or call
|
import the `MultiLanguage` class directly, or call
|
||||||
[`spacy.blank("xx")`](/api/top-level#spacy.blank) for lazy-loading.
|
[`spacy.blank("xx")`](/api/top-level#spacy.blank) for lazy-loading.
|
||||||
|
|
||||||
### Chinese language support {#chinese new=2.3}
|
### Chinese language support {#chinese new="2.3"}
|
||||||
|
|
||||||
The Chinese language class supports three word segmentation options, `char`,
|
The Chinese language class supports three word segmentation options, `char`,
|
||||||
`jieba` and `pkuseg`:
|
`jieba` and `pkuseg`.
|
||||||
|
|
||||||
|
> #### Manual setup
|
||||||
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.lang.zh import Chinese
|
> from spacy.lang.zh import Chinese
|
||||||
>
|
>
|
||||||
> # Character segmentation (default)
|
> # Character segmentation (default)
|
||||||
> nlp = Chinese()
|
> nlp = Chinese()
|
||||||
>
|
|
||||||
> # Jieba
|
> # Jieba
|
||||||
> cfg = {"segmenter": "jieba"}
|
> cfg = {"segmenter": "jieba"}
|
||||||
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
||||||
>
|
|
||||||
> # PKUSeg with "default" model provided by pkuseg
|
> # PKUSeg with "default" model provided by pkuseg
|
||||||
> cfg = {"segmenter": "pkuseg"}
|
> cfg = {"segmenter": "pkuseg"}
|
||||||
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
||||||
> nlp.tokenizer.initialize(pkuseg_model="default")
|
> nlp.tokenizer.initialize(pkuseg_model="default")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
1. **Character segmentation:** Character segmentation is the default
|
```ini
|
||||||
segmentation option. It's enabled when you create a new `Chinese` language
|
### config.cfg
|
||||||
class or call `spacy.blank("zh")`.
|
[nlp.tokenizer]
|
||||||
2. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word
|
@tokenizers = "spacy.zh.ChineseTokenizer"
|
||||||
segmentation with the tokenizer option `{"segmenter": "jieba"}`.
|
segmenter = "char"
|
||||||
3. **PKUSeg**: As of spaCy v2.3.0, support for
|
```
|
||||||
[PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support
|
|
||||||
better segmentation for Chinese OntoNotes and the provided
|
|
||||||
[Chinese pipelines](/models/zh). Enable PKUSeg with the tokenizer option
|
|
||||||
`{"segmenter": "pkuseg"}`.
|
|
||||||
|
|
||||||
<Infobox variant="warning">
|
| Segmenter | Description |
|
||||||
|
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
|
| `char` | **Character segmentation:** Character segmentation is the default segmentation option. It's enabled when you create a new `Chinese` language class or call `spacy.blank("zh")`. |
|
||||||
|
| `jieba` | **Jieba:** to use [Jieba](https://github.com/fxsjy/jieba) for word segmentation, you can set the option `segmenter` to `"jieba"`. |
|
||||||
|
| `pkuseg` | **PKUSeg**: As of spaCy v2.3.0, support for [PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support better segmentation for Chinese OntoNotes and the provided [Chinese pipelines](/models/zh). Enable PKUSeg by setting tokenizer option `segmenter` to `"pkuseg"`. |
|
||||||
|
|
||||||
In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to
|
<Infobox title="Changed in v3.0" variant="warning">
|
||||||
character segmentation.
|
|
||||||
|
In v3.0, the default word segmenter has switched from Jieba to character
|
||||||
|
segmentation. Because the `pkuseg` segmenter depends on a model that can be
|
||||||
|
loaded from a file, the model is loaded on
|
||||||
|
[initialization](/usage/training#config-lifecycle) (typically before training).
|
||||||
|
This ensures that your packaged Chinese model doesn't depend on a local path at
|
||||||
|
runtime.
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
|
||||||
<Accordion title="Details on spaCy's Chinese API">
|
<Accordion title="Details on spaCy's Chinese API">
|
||||||
|
|
||||||
The `initialize` method for the Chinese tokenizer class supports the following
|
The `initialize` method for the Chinese tokenizer class supports the following
|
||||||
config settings for loading pkuseg models:
|
config settings for loading `pkuseg` models:
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `pkuseg_model` | Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ |
|
| `pkuseg_model` | Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ |
|
||||||
| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`. ~~str~~ |
|
| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`. ~~str~~ |
|
||||||
|
|
||||||
|
The initialization settings are typically provided in the
|
||||||
|
[training config](/usage/training#config) and the data is loaded in before
|
||||||
|
training and serialized with the model. This allows you to load the data from a
|
||||||
|
local path and save out your pipeline and config, without requiring the same
|
||||||
|
local path at runtime. See the usage guide on the
|
||||||
|
[config lifecycle](/usage/training#config-lifecycle) for more background on
|
||||||
|
this.
|
||||||
|
|
||||||
|
```ini
|
||||||
|
### config.cfg
|
||||||
|
[initialize]
|
||||||
|
|
||||||
|
[initialize.tokenizer]
|
||||||
|
pkuseg_model = "/path/to/model"
|
||||||
|
pkuseg_user_dict = "default"
|
||||||
|
```
|
||||||
|
|
||||||
|
You can also initialize the tokenizer for a blank language class by calling its
|
||||||
|
`initialize` method:
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### Examples
|
### Examples
|
||||||
# Initialize the pkuseg tokenizer
|
# Initialize the pkuseg tokenizer
|
||||||
|
@ -191,12 +217,13 @@ nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
|
||||||
|
|
||||||
### Japanese language support {#japanese new=2.3}
|
### Japanese language support {#japanese new=2.3}
|
||||||
|
|
||||||
|
> #### Manual setup
|
||||||
|
>
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.lang.ja import Japanese
|
> from spacy.lang.ja import Japanese
|
||||||
>
|
>
|
||||||
> # Load SudachiPy with split mode A (default)
|
> # Load SudachiPy with split mode A (default)
|
||||||
> nlp = Japanese()
|
> nlp = Japanese()
|
||||||
>
|
|
||||||
> # Load SudachiPy with split mode B
|
> # Load SudachiPy with split mode B
|
||||||
> cfg = {"split_mode": "B"}
|
> cfg = {"split_mode": "B"}
|
||||||
> nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
|
> nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
|
||||||
|
@ -208,6 +235,13 @@ segmentation and part-of-speech tagging. The default Japanese language class and
|
||||||
the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
|
the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
|
||||||
config can be used to configure the split mode to `A`, `B` or `C`.
|
config can be used to configure the split mode to `A`, `B` or `C`.
|
||||||
|
|
||||||
|
```ini
|
||||||
|
### config.cfg
|
||||||
|
[nlp.tokenizer]
|
||||||
|
@tokenizers = "spacy.ja.JapaneseTokenizer"
|
||||||
|
split_mode = "A"
|
||||||
|
```
|
||||||
|
|
||||||
<Infobox variant="warning">
|
<Infobox variant="warning">
|
||||||
|
|
||||||
If you run into errors related to `sudachipy`, which is currently under active
|
If you run into errors related to `sudachipy`, which is currently under active
|
||||||
|
|
|
@ -895,6 +895,10 @@ the name. Registered functions can also take **arguments** by the way that can
|
||||||
be defined in the config as well – you can read more about this in the docs on
|
be defined in the config as well – you can read more about this in the docs on
|
||||||
[training with custom code](/usage/training#custom-code).
|
[training with custom code](/usage/training#custom-code).
|
||||||
|
|
||||||
|
### Initializing components with data {#initialization}
|
||||||
|
|
||||||
|
<!-- TODO: -->
|
||||||
|
|
||||||
### Python type hints and pydantic validation {#type-hints new="3"}
|
### Python type hints and pydantic validation {#type-hints new="3"}
|
||||||
|
|
||||||
spaCy's configs are powered by our machine learning library Thinc's
|
spaCy's configs are powered by our machine learning library Thinc's
|
||||||
|
|
|
@ -291,7 +291,7 @@ installed in the same environment – that's it.
|
||||||
| Entry point | Description |
|
| Entry point | Description |
|
||||||
| ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories, keyed by component name. Can be used to expose custom components defined by another package. |
|
| [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories, keyed by component name. Can be used to expose custom components defined by another package. |
|
||||||
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut. |
|
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/linguistic-features#language-data), keyed by language shortcut. |
|
||||||
| `spacy_lookups` <Tag variant="new">2.2</Tag> | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. |
|
| `spacy_lookups` <Tag variant="new">2.2</Tag> | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. |
|
||||||
| [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |
|
| [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |
|
||||||
|
|
||||||
|
|
|
@ -200,7 +200,7 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
|
||||||
To learn more about how spaCy's tokenization rules work in detail, how to
|
To learn more about how spaCy's tokenization rules work in detail, how to
|
||||||
**customize and replace** the default tokenizer and how to **add
|
**customize and replace** the default tokenizer and how to **add
|
||||||
language-specific data**, see the usage guides on
|
language-specific data**, see the usage guides on
|
||||||
[adding languages](/usage/adding-languages) and
|
[language data](/usage/linguistic-features#language-data) and
|
||||||
[customizing the tokenizer](/usage/linguistic-features#tokenization).
|
[customizing the tokenizer](/usage/linguistic-features#tokenization).
|
||||||
|
|
||||||
</Infobox>
|
</Infobox>
|
||||||
|
@ -479,7 +479,7 @@ find a "Suggest edits" link at the bottom of each page that points you to the
|
||||||
source.
|
source.
|
||||||
|
|
||||||
Another way of getting involved is to help us improve the
|
Another way of getting involved is to help us improve the
|
||||||
[language data](/usage/adding-languages#language-data) – especially if you
|
[language data](/usage/linguistic-features#language-data) – especially if you
|
||||||
happen to speak one of the languages currently in
|
happen to speak one of the languages currently in
|
||||||
[alpha support](/usage/models#languages). Even adding simple tokenizer
|
[alpha support](/usage/models#languages). Even adding simple tokenizer
|
||||||
exceptions, stop words or lemmatizer data can make a big difference. It will
|
exceptions, stop words or lemmatizer data can make a big difference. It will
|
||||||
|
|
|
@ -216,7 +216,9 @@ The initialization settings are only loaded and used when
|
||||||
[`nlp.initialize`](/api/language#initialize) is called (typically right before
|
[`nlp.initialize`](/api/language#initialize) is called (typically right before
|
||||||
training). This allows you to set up your pipeline using local data resources
|
training). This allows you to set up your pipeline using local data resources
|
||||||
and custom functions, and preserve the information in your config – but without
|
and custom functions, and preserve the information in your config – but without
|
||||||
requiring it to be available at runtime
|
requiring it to be available at runtime. You can also use this mechanism to
|
||||||
|
provide data paths to custom pipeline components and custom tokenizers – see the
|
||||||
|
section on [custom initialization](#initialization) for details.
|
||||||
|
|
||||||
### Overwriting config settings on the command line {#config-overrides}
|
### Overwriting config settings on the command line {#config-overrides}
|
||||||
|
|
||||||
|
@ -815,9 +817,9 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:
|
||||||
return create_model(output_width)
|
return create_model(output_width)
|
||||||
```
|
```
|
||||||
|
|
||||||
<!-- TODO:
|
|
||||||
### Customizing the initialization {#initialization}
|
### Customizing the initialization {#initialization}
|
||||||
-->
|
|
||||||
|
<!-- TODO: -->
|
||||||
|
|
||||||
## Data utilities {#data}
|
## Data utilities {#data}
|
||||||
|
|
||||||
|
@ -1135,7 +1137,11 @@ An easy way to create modified `Example` objects is to use the
|
||||||
capitalization changes, so only the `ORTH` values of the tokens will be
|
capitalization changes, so only the `ORTH` values of the tokens will be
|
||||||
different between the original and augmented examples.
|
different between the original and augmented examples.
|
||||||
|
|
||||||
<!-- TODO: mention alignment -->
|
Note that if your data augmentation strategy involves changing the tokenization
|
||||||
|
(for instance, removing or adding tokens) and your training examples include
|
||||||
|
token-based annotations like the dependency parse or entity labels, you'll need
|
||||||
|
to take care to adjust the `Example` object so its annotations match and remain
|
||||||
|
valid.
|
||||||
|
|
||||||
## Parallel & distributed training with Ray {#parallel-training}
|
## Parallel & distributed training with Ray {#parallel-training}
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user