mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-24 16:24:16 +03:00
Update docs [ci skip]
This commit is contained in:
parent
d2aa662ab2
commit
df06f7a792
|
@ -8,8 +8,8 @@ source: spacy/language.py
|
|||
Usually you'll load this once per process as `nlp` and pass the instance around
|
||||
your application. The `Language` class is created when you call
|
||||
[`spacy.load`](/api/top-level#spacy.load) and contains the shared vocabulary and
|
||||
[language data](/usage/adding-languages), optional binary weights, e.g. provided
|
||||
by a [trained pipeline](/models), and the
|
||||
[language data](/usage/linguistic-features#language-data), optional binary
|
||||
weights, e.g. provided by a [trained pipeline](/models), and the
|
||||
[processing pipeline](/usage/processing-pipelines) containing components like
|
||||
the tagger or parser that are called on a document in order. You can also add
|
||||
your own processing pipeline components that take a `Doc` object, modify it and
|
||||
|
@ -210,7 +210,9 @@ settings defined in the [`[initialize]`](/api/data-formats#config-initialize)
|
|||
config block to set up the vocabulary, load in vectors and tok2vec weights and
|
||||
pass optional arguments to the `initialize` methods implemented by pipeline
|
||||
components or the tokenizer. This method is typically called automatically when
|
||||
you run [`spacy train`](/api/cli#train).
|
||||
you run [`spacy train`](/api/cli#train). See the usage guide on the
|
||||
[config lifecycle](/usage/training#config-lifecycle) and
|
||||
[initialization](/usage/training#initialization) for details.
|
||||
|
||||
`get_examples` should be a function that returns an iterable of
|
||||
[`Example`](/api/example) objects. The data examples can either be the full
|
||||
|
@ -928,7 +930,7 @@ Serialize the current state to a binary string.
|
|||
|
||||
Load state from a binary string. Note that this method is commonly used via the
|
||||
subclasses like `English` or `German` to make language-specific functionality
|
||||
like the [lexical attribute getters](/usage/adding-languages#lex-attrs)
|
||||
like the [lexical attribute getters](/usage/linguistic-features#language-data)
|
||||
available to the loaded object.
|
||||
|
||||
> #### Example
|
||||
|
|
|
@ -130,8 +130,7 @@ applied to the `Doc` in order.
|
|||
## Lemmatizer.lookup_lemmatize {#lookup_lemmatize tag="method"}
|
||||
|
||||
Lemmatize a token using a lookup-based approach. If no lemma is found, the
|
||||
original string is returned. Languages can provide a
|
||||
[lookup table](/usage/adding-languages#lemmatizer) via the `Lookups`.
|
||||
original string is returned.
|
||||
|
||||
| Name | Description |
|
||||
| ----------- | --------------------------------------------------- |
|
||||
|
|
|
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
|
|||
| `ent_id_` | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. ~~str~~ |
|
||||
| `lemma` | Base form of the token, with no inflectional suffixes. ~~int~~ |
|
||||
| `lemma_` | Base form of the token, with no inflectional suffixes. ~~str~~ |
|
||||
| `norm` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~int~~ |
|
||||
| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions). ~~str~~ |
|
||||
| `norm` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~int~~ |
|
||||
| `norm_` | The token's norm, i.e. a normalized form of the token text. Can be set in the language's [tokenizer exceptions](/usage/linguistic-features#language-data). ~~str~~ |
|
||||
| `lower` | Lowercase form of the token. ~~int~~ |
|
||||
| `lower_` | Lowercase form of the token text. Equivalent to `Token.text.lower()`. ~~str~~ |
|
||||
| `shape` | Transform of the tokens's string to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. ~~int~~ |
|
||||
|
|
|
@ -22,9 +22,8 @@ like punctuation and special case rules from the
|
|||
|
||||
## Tokenizer.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
Create a `Tokenizer` to create `Doc` objects given unicode text. For examples
|
||||
of how to construct a custom tokenizer with different tokenization rules, see
|
||||
the
|
||||
Create a `Tokenizer` to create `Doc` objects given unicode text. For examples of
|
||||
how to construct a custom tokenizer with different tokenization rules, see the
|
||||
[usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers).
|
||||
|
||||
> #### Example
|
||||
|
@ -87,7 +86,7 @@ Tokenize a stream of texts.
|
|||
| ------------ | ------------------------------------------------------------------------------------ |
|
||||
| `texts` | A sequence of unicode texts. ~~Iterable[str]~~ |
|
||||
| `batch_size` | The number of texts to accumulate in an internal buffer. Defaults to `1000`. ~~int~~ |
|
||||
| **YIELDS** | The tokenized `Doc` objects, in order. ~~Doc~~ |
|
||||
| **YIELDS** | The tokenized `Doc` objects, in order. ~~Doc~~ |
|
||||
|
||||
## Tokenizer.find_infix {#find_infix tag="method"}
|
||||
|
||||
|
@ -121,10 +120,10 @@ if no suffix rules match.
|
|||
## Tokenizer.add_special_case {#add_special_case tag="method"}
|
||||
|
||||
Add a special-case tokenization rule. This mechanism is also used to add custom
|
||||
tokenizer exceptions to the language data. See the usage guide on
|
||||
[adding languages](/usage/adding-languages#tokenizer-exceptions) and
|
||||
[linguistic features](/usage/linguistic-features#special-cases) for more details
|
||||
and examples.
|
||||
tokenizer exceptions to the language data. See the usage guide on the
|
||||
[languages data](/usage/linguistic-features#language-data) and
|
||||
[tokenizer special cases](/usage/linguistic-features#special-cases) for more
|
||||
details and examples.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
|
|
@ -827,7 +827,7 @@ utilities.
|
|||
### util.get_lang_class {#util.get_lang_class tag="function"}
|
||||
|
||||
Import and load a `Language` class. Allows lazy-loading
|
||||
[language data](/usage/adding-languages) and importing languages using the
|
||||
[language data](/usage/linguistic-features#language-data) and importing languages using the
|
||||
two-letter language code. To add a language code for a custom language class,
|
||||
you can register it using the [`@registry.languages`](/api/top-level#registry)
|
||||
decorator.
|
||||
|
|
|
@ -30,7 +30,7 @@ import QuickstartModels from 'widgets/quickstart-models.js'
|
|||
## Language support {#languages}
|
||||
|
||||
spaCy currently provides support for the following languages. You can help by
|
||||
[improving the existing language data](/usage/adding-languages#language-data)
|
||||
improving the existing [language data](/usage/linguistic-features#language-data)
|
||||
and extending the tokenization patterns.
|
||||
[See here](https://github.com/explosion/spaCy/issues/3056) for details on how to
|
||||
contribute to development.
|
||||
|
@ -83,55 +83,81 @@ To train a pipeline using the neutral multi-language class, you can set
|
|||
import the `MultiLanguage` class directly, or call
|
||||
[`spacy.blank("xx")`](/api/top-level#spacy.blank) for lazy-loading.
|
||||
|
||||
### Chinese language support {#chinese new=2.3}
|
||||
### Chinese language support {#chinese new="2.3"}
|
||||
|
||||
The Chinese language class supports three word segmentation options, `char`,
|
||||
`jieba` and `pkuseg`:
|
||||
`jieba` and `pkuseg`.
|
||||
|
||||
> #### Manual setup
|
||||
>
|
||||
> ```python
|
||||
> from spacy.lang.zh import Chinese
|
||||
>
|
||||
> # Character segmentation (default)
|
||||
> nlp = Chinese()
|
||||
>
|
||||
> # Jieba
|
||||
> cfg = {"segmenter": "jieba"}
|
||||
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
||||
>
|
||||
> # PKUSeg with "default" model provided by pkuseg
|
||||
> cfg = {"segmenter": "pkuseg"}
|
||||
> nlp = Chinese.from_config({"nlp": {"tokenizer": cfg}})
|
||||
> nlp.tokenizer.initialize(pkuseg_model="default")
|
||||
> ```
|
||||
|
||||
1. **Character segmentation:** Character segmentation is the default
|
||||
segmentation option. It's enabled when you create a new `Chinese` language
|
||||
class or call `spacy.blank("zh")`.
|
||||
2. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word
|
||||
segmentation with the tokenizer option `{"segmenter": "jieba"}`.
|
||||
3. **PKUSeg**: As of spaCy v2.3.0, support for
|
||||
[PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support
|
||||
better segmentation for Chinese OntoNotes and the provided
|
||||
[Chinese pipelines](/models/zh). Enable PKUSeg with the tokenizer option
|
||||
`{"segmenter": "pkuseg"}`.
|
||||
```ini
|
||||
### config.cfg
|
||||
[nlp.tokenizer]
|
||||
@tokenizers = "spacy.zh.ChineseTokenizer"
|
||||
segmenter = "char"
|
||||
```
|
||||
|
||||
<Infobox variant="warning">
|
||||
| Segmenter | Description |
|
||||
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `char` | **Character segmentation:** Character segmentation is the default segmentation option. It's enabled when you create a new `Chinese` language class or call `spacy.blank("zh")`. |
|
||||
| `jieba` | **Jieba:** to use [Jieba](https://github.com/fxsjy/jieba) for word segmentation, you can set the option `segmenter` to `"jieba"`. |
|
||||
| `pkuseg` | **PKUSeg**: As of spaCy v2.3.0, support for [PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support better segmentation for Chinese OntoNotes and the provided [Chinese pipelines](/models/zh). Enable PKUSeg by setting tokenizer option `segmenter` to `"pkuseg"`. |
|
||||
|
||||
In spaCy v3.0, the default Chinese word segmenter has switched from Jieba to
|
||||
character segmentation.
|
||||
<Infobox title="Changed in v3.0" variant="warning">
|
||||
|
||||
In v3.0, the default word segmenter has switched from Jieba to character
|
||||
segmentation. Because the `pkuseg` segmenter depends on a model that can be
|
||||
loaded from a file, the model is loaded on
|
||||
[initialization](/usage/training#config-lifecycle) (typically before training).
|
||||
This ensures that your packaged Chinese model doesn't depend on a local path at
|
||||
runtime.
|
||||
|
||||
</Infobox>
|
||||
|
||||
<Accordion title="Details on spaCy's Chinese API">
|
||||
|
||||
The `initialize` method for the Chinese tokenizer class supports the following
|
||||
config settings for loading pkuseg models:
|
||||
config settings for loading `pkuseg` models:
|
||||
|
||||
| Name | Description |
|
||||
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `pkuseg_model` | Name of a model provided by `pkuseg` or the path to a local model directory. ~~str~~ |
|
||||
| `pkuseg_user_dict` | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. Defaults to `"default"`. ~~str~~ |
|
||||
|
||||
The initialization settings are typically provided in the
|
||||
[training config](/usage/training#config) and the data is loaded in before
|
||||
training and serialized with the model. This allows you to load the data from a
|
||||
local path and save out your pipeline and config, without requiring the same
|
||||
local path at runtime. See the usage guide on the
|
||||
[config lifecycle](/usage/training#config-lifecycle) for more background on
|
||||
this.
|
||||
|
||||
```ini
|
||||
### config.cfg
|
||||
[initialize]
|
||||
|
||||
[initialize.tokenizer]
|
||||
pkuseg_model = "/path/to/model"
|
||||
pkuseg_user_dict = "default"
|
||||
```
|
||||
|
||||
You can also initialize the tokenizer for a blank language class by calling its
|
||||
`initialize` method:
|
||||
|
||||
```python
|
||||
### Examples
|
||||
# Initialize the pkuseg tokenizer
|
||||
|
@ -191,12 +217,13 @@ nlp.tokenizer.initialize(pkuseg_model="/path/to/pkuseg_model")
|
|||
|
||||
### Japanese language support {#japanese new=2.3}
|
||||
|
||||
> #### Manual setup
|
||||
>
|
||||
> ```python
|
||||
> from spacy.lang.ja import Japanese
|
||||
>
|
||||
> # Load SudachiPy with split mode A (default)
|
||||
> nlp = Japanese()
|
||||
>
|
||||
> # Load SudachiPy with split mode B
|
||||
> cfg = {"split_mode": "B"}
|
||||
> nlp = Japanese.from_config({"nlp": {"tokenizer": cfg}})
|
||||
|
@ -208,6 +235,13 @@ segmentation and part-of-speech tagging. The default Japanese language class and
|
|||
the provided Japanese pipelines use SudachiPy split mode `A`. The tokenizer
|
||||
config can be used to configure the split mode to `A`, `B` or `C`.
|
||||
|
||||
```ini
|
||||
### config.cfg
|
||||
[nlp.tokenizer]
|
||||
@tokenizers = "spacy.ja.JapaneseTokenizer"
|
||||
split_mode = "A"
|
||||
```
|
||||
|
||||
<Infobox variant="warning">
|
||||
|
||||
If you run into errors related to `sudachipy`, which is currently under active
|
||||
|
|
|
@ -895,6 +895,10 @@ the name. Registered functions can also take **arguments** by the way that can
|
|||
be defined in the config as well – you can read more about this in the docs on
|
||||
[training with custom code](/usage/training#custom-code).
|
||||
|
||||
### Initializing components with data {#initialization}
|
||||
|
||||
<!-- TODO: -->
|
||||
|
||||
### Python type hints and pydantic validation {#type-hints new="3"}
|
||||
|
||||
spaCy's configs are powered by our machine learning library Thinc's
|
||||
|
|
|
@ -291,7 +291,7 @@ installed in the same environment – that's it.
|
|||
| Entry point | Description |
|
||||
| ------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| [`spacy_factories`](#entry-points-components) | Group of entry points for pipeline component factories, keyed by component name. Can be used to expose custom components defined by another package. |
|
||||
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/adding-languages), keyed by language shortcut. |
|
||||
| [`spacy_languages`](#entry-points-languages) | Group of entry points for custom [`Language` subclasses](/usage/linguistic-features#language-data), keyed by language shortcut. |
|
||||
| `spacy_lookups` <Tag variant="new">2.2</Tag> | Group of entry points for custom [`Lookups`](/api/lookups), including lemmatizer data. Used by spaCy's [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) package. |
|
||||
| [`spacy_displacy_colors`](#entry-points-displacy) <Tag variant="new">2.2</Tag> | Group of entry points of custom label colors for the [displaCy visualizer](/usage/visualizers#ent). The key name doesn't matter, but it should point to a dict of labels and color values. Useful for custom models that predict different entity types. |
|
||||
|
||||
|
|
|
@ -200,7 +200,7 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
|
|||
To learn more about how spaCy's tokenization rules work in detail, how to
|
||||
**customize and replace** the default tokenizer and how to **add
|
||||
language-specific data**, see the usage guides on
|
||||
[adding languages](/usage/adding-languages) and
|
||||
[language data](/usage/linguistic-features#language-data) and
|
||||
[customizing the tokenizer](/usage/linguistic-features#tokenization).
|
||||
|
||||
</Infobox>
|
||||
|
@ -479,7 +479,7 @@ find a "Suggest edits" link at the bottom of each page that points you to the
|
|||
source.
|
||||
|
||||
Another way of getting involved is to help us improve the
|
||||
[language data](/usage/adding-languages#language-data) – especially if you
|
||||
[language data](/usage/linguistic-features#language-data) – especially if you
|
||||
happen to speak one of the languages currently in
|
||||
[alpha support](/usage/models#languages). Even adding simple tokenizer
|
||||
exceptions, stop words or lemmatizer data can make a big difference. It will
|
||||
|
|
|
@ -216,7 +216,9 @@ The initialization settings are only loaded and used when
|
|||
[`nlp.initialize`](/api/language#initialize) is called (typically right before
|
||||
training). This allows you to set up your pipeline using local data resources
|
||||
and custom functions, and preserve the information in your config – but without
|
||||
requiring it to be available at runtime
|
||||
requiring it to be available at runtime. You can also use this mechanism to
|
||||
provide data paths to custom pipeline components and custom tokenizers – see the
|
||||
section on [custom initialization](#initialization) for details.
|
||||
|
||||
### Overwriting config settings on the command line {#config-overrides}
|
||||
|
||||
|
@ -815,9 +817,9 @@ def MyModel(output_width: int) -> Model[List[Doc], List[Floats2d]]:
|
|||
return create_model(output_width)
|
||||
```
|
||||
|
||||
<!-- TODO:
|
||||
### Customizing the initialization {#initialization}
|
||||
-->
|
||||
|
||||
<!-- TODO: -->
|
||||
|
||||
## Data utilities {#data}
|
||||
|
||||
|
@ -1135,7 +1137,11 @@ An easy way to create modified `Example` objects is to use the
|
|||
capitalization changes, so only the `ORTH` values of the tokens will be
|
||||
different between the original and augmented examples.
|
||||
|
||||
<!-- TODO: mention alignment -->
|
||||
Note that if your data augmentation strategy involves changing the tokenization
|
||||
(for instance, removing or adding tokens) and your training examples include
|
||||
token-based annotations like the dependency parse or entity labels, you'll need
|
||||
to take care to adjust the `Example` object so its annotations match and remain
|
||||
valid.
|
||||
|
||||
## Parallel & distributed training with Ray {#parallel-training}
|
||||
|
||||
|
|
Loading…
Reference in New Issue
Block a user