mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 09:14:32 +03:00
Update docs [ci skip]
This commit is contained in:
parent
1346ee06d4
commit
c288dba8e7
|
@ -49,11 +49,11 @@ contain arbitrary whitespace. Alignment into the original string is preserved.
|
|||
> assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | --------------------------------------------------------------------------------- |
|
||||
| `text` | str | The text to be processed. |
|
||||
| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| **RETURNS** | `Doc` | A container for accessing the annotations. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------- | --------------------------------------------------------------------------------- |
|
||||
| `text` | str | The text to be processed. |
|
||||
| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| **RETURNS** | `Doc` | A container for accessing the annotations. |
|
||||
|
||||
## Language.pipe {#pipe tag="method"}
|
||||
|
||||
|
@ -112,14 +112,14 @@ Evaluate a model's pipeline components.
|
|||
> print(scores)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------- |
|
||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
||||
| `verbose` | bool | Print debugging information. |
|
||||
| `batch_size` | int | The batch size to use. |
|
||||
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
|
||||
| **RETURNS** | `Dict[str, Union[float, Dict]]` | A dictionary of evaluation scores. |
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | ------------------------------- | ------------------------------------------------------------------------------------- |
|
||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
||||
| `verbose` | bool | Print debugging information. |
|
||||
| `batch_size` | int | The batch size to use. |
|
||||
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
|
||||
| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
|
||||
| **RETURNS** | `Dict[str, Union[float, Dict]]` | A dictionary of evaluation scores. |
|
||||
|
||||
## Language.begin_training {#begin_training tag="method"}
|
||||
|
||||
|
@ -418,11 +418,70 @@ available to the loaded object.
|
|||
|
||||
## Class attributes {#class-attributes}
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------- | ----- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
|
||||
| `lang` | str | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
|
||||
| `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
|
||||
| Name | Type | Description |
|
||||
| ---------- | ----- | ----------------------------------------------------------------------------------------------- |
|
||||
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
|
||||
| `lang` | str | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
|
||||
|
||||
## Defaults {#defaults}
|
||||
|
||||
The following attributes can be set on the `Language.Defaults` class to
|
||||
customize the default language data:
|
||||
|
||||
> #### Example
|
||||
>
|
||||
> ```python
|
||||
> from spacy.language import language
|
||||
> from spacy.lang.tokenizer_exceptions import URL_MATCH
|
||||
> from thinc.api import Config
|
||||
>
|
||||
> DEFAULT_CONFIFG = """
|
||||
> [nlp.tokenizer]
|
||||
> @tokenizers = "MyCustomTokenizer.v1"
|
||||
> """
|
||||
>
|
||||
> class Defaults(Language.Defaults):
|
||||
> stop_words = set()
|
||||
> tokenizer_exceptions = {}
|
||||
> prefixes = tuple()
|
||||
> suffixes = tuple()
|
||||
> infixes = tuple()
|
||||
> token_match = None
|
||||
> url_match = URL_MATCH
|
||||
> lex_attr_getters = {}
|
||||
> syntax_iterators = {}
|
||||
> writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
|
||||
> config = Config().from_str(DEFAULT_CONFIG)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `stop_words` | List of stop words, used for `Token.is_stop`.<br />**Example:** [`stop_words.py`][stop_words.py] |
|
||||
| `tokenizer_exceptions` | Tokenizer exception rules, string mapped to list of token attributes.<br />**Example:** [`de/tokenizer_exceptions.py`][de/tokenizer_exceptions.py] |
|
||||
| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`puncutation.py`][punctuation.py] |
|
||||
| `token_match` | Optional regex for matching strings that should never be split, overriding the infix rules.<br />**Example:** [`fr/tokenizer_exceptions.py`][fr/tokenizer_exceptions.py] |
|
||||
| `url_match` | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.<br />**Example:** [`tokenizer_exceptions.py`][tokenizer_exceptions.py] |
|
||||
| `lex_attr_getters` | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.<br />**Example:** [`lex_attrs.py`][lex_attrs.py] |
|
||||
| `syntax_iterators` | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).<br />**Example:** [`syntax_iterators.py`][syntax_iterators.py]. |
|
||||
| `writing_system` | Information about the language's writing system, available via `Vocab.writing_system`. Defaults to: `{"direction": "ltr", "has_case": True, "has_letters": True}.`.<br />**Example:** [`zh/__init__.py`][zh/__init__.py] |
|
||||
| `config` | Default [config](/usage/training#config) added to `nlp.config`. This can include references to custom tokenizers or lemmatizers.<br />**Example:** [`zh/__init__.py`][zh/__init__.py] |
|
||||
|
||||
[stop_words.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
||||
[tokenizer_exceptions.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/tokenizer_exceptions.py
|
||||
[de/tokenizer_exceptions.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
|
||||
[fr/tokenizer_exceptions.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/tokenizer_exceptions.py
|
||||
[punctuation.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
|
||||
[lex_attrs.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
|
||||
[syntax_iterators.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
|
||||
[zh/__init__.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/zh/__init__.py
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
|
|
|
@ -8,12 +8,10 @@ makes the data easy to update and extend.
|
|||
|
||||
The **shared language data** in the directory root includes rules that can be
|
||||
generalized across languages – for example, rules for basic punctuation, emoji,
|
||||
emoticons, single-letter abbreviations and norms for equivalent tokens with
|
||||
different spellings, like `"` and `”`. This helps the models make more accurate
|
||||
predictions. The **individual language data** in a submodule contains rules that
|
||||
are only relevant to a particular language. It also takes care of putting
|
||||
together all components and creating the `Language` subclass – for example,
|
||||
`English` or `German`.
|
||||
emoticons and single-letter abbreviations. The **individual language data** in a
|
||||
submodule contains rules that are only relevant to a particular language. It
|
||||
also takes care of putting together all components and creating the `Language`
|
||||
subclass – for example, `English` or `German`.
|
||||
|
||||
> ```python
|
||||
> from spacy.lang.en import English
|
||||
|
@ -23,27 +21,28 @@ together all components and creating the `Language` subclass – for example,
|
|||
> nlp_de = German() # Includes German data
|
||||
> ```
|
||||
|
||||
<!-- TODO: upgrade graphic
|
||||
|
||||
![Language data architecture](../../images/language_data.svg)
|
||||
|
||||
-->
|
||||
|
||||
<!-- TODO: remove this table in favor of more specific Language.Defaults docs in linguistic features? -->
|
||||
|
||||
| Name | Description |
|
||||
| ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| **Stop words**<br />[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
|
||||
| **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". |
|
||||
| **Norm exceptions**<br />[`norm_exceptions.py`][norm_exceptions.py] | Special-case rules for normalizing tokens to improve the model's predictions, for example on American vs. British spelling. |
|
||||
| **Punctuation rules**<br />[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. |
|
||||
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
|
||||
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
|
||||
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
|
||||
| **Tag map**<br />[`tag_map.py`][tag_map.py] | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
|
||||
| **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
|
||||
| **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
|
||||
|
||||
[stop_words.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
||||
[tokenizer_exceptions.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
|
||||
[norm_exceptions.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py
|
||||
[punctuation.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
|
||||
[char_classes.py]:
|
||||
|
@ -52,8 +51,4 @@ together all components and creating the `Language` subclass – for example,
|
|||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
|
||||
[syntax_iterators.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
|
||||
[tag_map.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
|
||||
[morph_rules.py]:
|
||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
|
||||
[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
|
||||
|
|
|
@ -602,7 +602,95 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
|
|||
|
||||
<Tokenization101 />
|
||||
|
||||
### Tokenizer data {#101-data}
|
||||
<Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>
|
||||
|
||||
spaCy introduces a novel tokenization algorithm, that gives a better balance
|
||||
between performance, ease of definition, and ease of alignment into the original
|
||||
string.
|
||||
|
||||
After consuming a prefix or suffix, we consult the special cases again. We want
|
||||
the special cases to handle things like "don't" in English, and we want the same
|
||||
rule to work for "(don't)!". We do this by splitting off the open bracket, then
|
||||
the exclamation, then the close bracket, and finally matching the special case.
|
||||
Here's an implementation of the algorithm in Python, optimized for readability
|
||||
rather than performance:
|
||||
|
||||
```python
|
||||
def tokenizer_pseudo_code(
|
||||
special_cases,
|
||||
prefix_search,
|
||||
suffix_search,
|
||||
infix_finditer,
|
||||
token_match,
|
||||
url_match
|
||||
):
|
||||
tokens = []
|
||||
for substring in text.split():
|
||||
suffixes = []
|
||||
while substring:
|
||||
while prefix_search(substring) or suffix_search(substring):
|
||||
if token_match(substring):
|
||||
tokens.append(substring)
|
||||
substring = ""
|
||||
break
|
||||
if substring in special_cases:
|
||||
tokens.extend(special_cases[substring])
|
||||
substring = ""
|
||||
break
|
||||
if prefix_search(substring):
|
||||
split = prefix_search(substring).end()
|
||||
tokens.append(substring[:split])
|
||||
substring = substring[split:]
|
||||
if substring in special_cases:
|
||||
continue
|
||||
if suffix_search(substring):
|
||||
split = suffix_search(substring).start()
|
||||
suffixes.append(substring[split:])
|
||||
substring = substring[:split]
|
||||
if token_match(substring):
|
||||
tokens.append(substring)
|
||||
substring = ""
|
||||
elif url_match(substring):
|
||||
tokens.append(substring)
|
||||
substring = ""
|
||||
elif substring in special_cases:
|
||||
tokens.extend(special_cases[substring])
|
||||
substring = ""
|
||||
elif list(infix_finditer(substring)):
|
||||
infixes = infix_finditer(substring)
|
||||
offset = 0
|
||||
for match in infixes:
|
||||
tokens.append(substring[offset : match.start()])
|
||||
tokens.append(substring[match.start() : match.end()])
|
||||
offset = match.end()
|
||||
if substring[offset:]:
|
||||
tokens.append(substring[offset:])
|
||||
substring = ""
|
||||
elif substring:
|
||||
tokens.append(substring)
|
||||
substring = ""
|
||||
tokens.extend(reversed(suffixes))
|
||||
return tokens
|
||||
```
|
||||
|
||||
The algorithm can be summarized as follows:
|
||||
|
||||
1. Iterate over whitespace-separated substrings.
|
||||
2. Look for a token match. If there is a match, stop processing and keep this
|
||||
token.
|
||||
3. Check whether we have an explicitly defined special case for this substring.
|
||||
If we do, use it.
|
||||
4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
|
||||
so that the token match and special cases always get priority.
|
||||
5. If we didn't consume a prefix, try to consume a suffix and then go back to
|
||||
#2.
|
||||
6. If we can't consume a prefix or a suffix, look for a URL match.
|
||||
7. If there's no URL match, then look for a special case.
|
||||
8. Look for "infixes" — stuff like hyphens etc. and split the substring into
|
||||
tokens on all infixes.
|
||||
9. Once we can't consume any more of the string, handle it as a single token.
|
||||
|
||||
</Accordion>
|
||||
|
||||
**Global** and **language-specific** tokenizer data is supplied via the language
|
||||
data in
|
||||
|
@ -613,15 +701,6 @@ The prefixes, suffixes and infixes mostly define punctuation rules – for
|
|||
example, when to split off periods (at the end of a sentence), and when to leave
|
||||
tokens containing periods intact (abbreviations like "U.S.").
|
||||
|
||||
![Language data architecture](../images/language_data.svg)
|
||||
|
||||
<Infobox title="Language data" emoji="📖">
|
||||
|
||||
For more details on the language-specific data, see the usage guide on
|
||||
[adding languages](/usage/adding-languages).
|
||||
|
||||
</Infobox>
|
||||
|
||||
<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
|
||||
|
||||
Tokenization rules that are specific to one language, but can be **generalized
|
||||
|
@ -637,6 +716,14 @@ subclass.
|
|||
|
||||
---
|
||||
|
||||
<!--
|
||||
|
||||
### Customizing the tokenizer {#tokenizer-custom}
|
||||
|
||||
TODO: rewrite the docs on custom tokenization in a more user-friendly order, including details on how to integrate a fully custom tokenizer, representing a tokenizer in the config etc.
|
||||
|
||||
-->
|
||||
|
||||
### Adding special case tokenization rules {#special-cases}
|
||||
|
||||
Most domains have at least some idiosyncrasies that require custom tokenization
|
||||
|
@ -677,88 +764,6 @@ nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}])
|
|||
assert len(nlp("...gimme...?")) == 1
|
||||
```
|
||||
|
||||
### How spaCy's tokenizer works {#how-tokenizer-works}
|
||||
|
||||
spaCy introduces a novel tokenization algorithm, that gives a better balance
|
||||
between performance, ease of definition, and ease of alignment into the original
|
||||
string.
|
||||
|
||||
After consuming a prefix or suffix, we consult the special cases again. We want
|
||||
the special cases to handle things like "don't" in English, and we want the same
|
||||
rule to work for "(don't)!". We do this by splitting off the open bracket, then
|
||||
the exclamation, then the close bracket, and finally matching the special case.
|
||||
Here's an implementation of the algorithm in Python, optimized for readability
|
||||
rather than performance:
|
||||
|
||||
```python
|
||||
def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search,
|
||||
infix_finditer, token_match, url_match):
|
||||
tokens = []
|
||||
for substring in text.split():
|
||||
suffixes = []
|
||||
while substring:
|
||||
while prefix_search(substring) or suffix_search(substring):
|
||||
if token_match(substring):
|
||||
tokens.append(substring)
|
||||
substring = ''
|
||||
break
|
||||
if substring in special_cases:
|
||||
tokens.extend(special_cases[substring])
|
||||
substring = ''
|
||||
break
|
||||
if prefix_search(substring):
|
||||
split = prefix_search(substring).end()
|
||||
tokens.append(substring[:split])
|
||||
substring = substring[split:]
|
||||
if substring in special_cases:
|
||||
continue
|
||||
if suffix_search(substring):
|
||||
split = suffix_search(substring).start()
|
||||
suffixes.append(substring[split:])
|
||||
substring = substring[:split]
|
||||
if token_match(substring):
|
||||
tokens.append(substring)
|
||||
substring = ''
|
||||
elif url_match(substring):
|
||||
tokens.append(substring)
|
||||
substring = ''
|
||||
elif substring in special_cases:
|
||||
tokens.extend(special_cases[substring])
|
||||
substring = ''
|
||||
elif list(infix_finditer(substring)):
|
||||
infixes = infix_finditer(substring)
|
||||
offset = 0
|
||||
for match in infixes:
|
||||
tokens.append(substring[offset : match.start()])
|
||||
tokens.append(substring[match.start() : match.end()])
|
||||
offset = match.end()
|
||||
if substring[offset:]:
|
||||
tokens.append(substring[offset:])
|
||||
substring = ''
|
||||
elif substring:
|
||||
tokens.append(substring)
|
||||
substring = ''
|
||||
tokens.extend(reversed(suffixes))
|
||||
return tokens
|
||||
```
|
||||
|
||||
The algorithm can be summarized as follows:
|
||||
|
||||
1. Iterate over whitespace-separated substrings.
|
||||
2. Look for a token match. If there is a match, stop processing and keep this
|
||||
token.
|
||||
3. Check whether we have an explicitly defined special case for this substring.
|
||||
If we do, use it.
|
||||
4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
|
||||
so that the token match and special cases always get priority.
|
||||
5. If we didn't consume a prefix, try to consume a suffix and then go back to
|
||||
#2.
|
||||
6. If we can't consume a prefix or a suffix, look for a URL match.
|
||||
7. If there's no URL match, then look for a special case.
|
||||
8. Look for "infixes" — stuff like hyphens etc. and split the substring into
|
||||
tokens on all infixes.
|
||||
9. Once we can't consume any more of the string, handle it as a single token.
|
||||
|
||||
#### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}
|
||||
|
||||
A working implementation of the pseudo-code above is available for debugging as
|
||||
|
@ -766,6 +771,17 @@ A working implementation of the pseudo-code above is available for debugging as
|
|||
tuples showing which tokenizer rule or pattern was matched for each token. The
|
||||
tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:
|
||||
|
||||
> #### Expected output
|
||||
>
|
||||
> ```
|
||||
> " PREFIX
|
||||
> Let SPECIAL-1
|
||||
> 's SPECIAL-2
|
||||
> go TOKEN
|
||||
> ! SUFFIX
|
||||
> " SUFFIX
|
||||
> ```
|
||||
|
||||
```python
|
||||
### {executable="true"}
|
||||
from spacy.lang.en import English
|
||||
|
@ -777,13 +793,6 @@ tok_exp = nlp.tokenizer.explain(text)
|
|||
assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
|
||||
for t in tok_exp:
|
||||
print(t[1], "\\t", t[0])
|
||||
|
||||
# " PREFIX
|
||||
# Let SPECIAL-1
|
||||
# 's SPECIAL-2
|
||||
# go TOKEN
|
||||
# ! SUFFIX
|
||||
# " SUFFIX
|
||||
```
|
||||
|
||||
### Customizing spaCy's Tokenizer class {#native-tokenizers}
|
||||
|
@ -1437,3 +1446,73 @@ print("After:", [sent.text for sent in doc.sents])
|
|||
import LanguageData101 from 'usage/101/\_language-data.md'
|
||||
|
||||
<LanguageData101 />
|
||||
|
||||
### Creating a custom language subclass {#language-subclass}
|
||||
|
||||
If you want to customize multiple components of the language data or add support
|
||||
for a custom language or domain-specific "dialect", you can also implement your
|
||||
own language subclass. The subclass should define two attributes: the `lang`
|
||||
(unique language code) and the `Defaults` defining the language data. For an
|
||||
overview of the available attributes that can be overwritten, see the
|
||||
[`Language.Defaults`](/api/language#defaults) documentation.
|
||||
|
||||
```python
|
||||
### {executable="true"}
|
||||
from spacy.lang.en import English
|
||||
|
||||
class CustomEnglishDefaults(English.Defaults):
|
||||
stop_words = set(["custom", "stop"])
|
||||
|
||||
class CustomEnglish(English):
|
||||
lang = "custom_en"
|
||||
Defaults = CustomEnglishDefaults
|
||||
|
||||
nlp1 = English()
|
||||
nlp2 = CustomEnglish()
|
||||
|
||||
print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")])
|
||||
print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")])
|
||||
```
|
||||
|
||||
The [`@spacy.registry.languages`](/api/top-level#registry) decorator lets you
|
||||
register a custom language class and assign it a string name. This means that
|
||||
you can call [`spacy.blank`](/api/top-level#spacy.blank) with your custom
|
||||
language name, and even train models with it and refer to it in your
|
||||
[training config](/usage/training#config).
|
||||
|
||||
> #### Config usage
|
||||
>
|
||||
> After registering your custom language class using the `languages` registry,
|
||||
> you can refer to it in your [training config](/usage/training#config). This
|
||||
> means spaCy will train your model using the custom subclass.
|
||||
>
|
||||
> ```ini
|
||||
> [nlp]
|
||||
> lang = "custom_en"
|
||||
> ```
|
||||
>
|
||||
> In order to resolve `"custom_en"` to your subclass, the registered function
|
||||
> needs to be available during training. You can load a Python file containing
|
||||
> the code using the `--code` argument:
|
||||
>
|
||||
> ```bash
|
||||
> ### {wrap="true"}
|
||||
> $ python -m spacy train train.spacy dev.spacy config.cfg --code code.py
|
||||
> ```
|
||||
|
||||
```python
|
||||
### Registering a custom language {highlight="7,12-13"}
|
||||
import spacy
|
||||
from spacy.lang.en import English
|
||||
|
||||
class CustomEnglishDefaults(English.Defaults):
|
||||
stop_words = set(["custom", "stop"])
|
||||
|
||||
@spacy.registry.languages("custom_en")
|
||||
class CustomEnglish(English):
|
||||
lang = "custom_en"
|
||||
Defaults = CustomEnglishDefaults
|
||||
|
||||
# This now works! 🎉
|
||||
nlp = spacy.blank("custom_en")
|
||||
```
|
||||
|
|
|
@ -618,7 +618,9 @@ mattis pretium.
|
|||
[FastAPI](https://fastapi.tiangolo.com/) is a modern high-performance framework
|
||||
for building REST APIs with Python, based on Python
|
||||
[type hints](https://fastapi.tiangolo.com/python-types/). It's become a popular
|
||||
library for serving machine learning models and
|
||||
library for serving machine learning models and you can use it in your spaCy
|
||||
projects to quickly serve up a trained model and make it available behind a REST
|
||||
API.
|
||||
|
||||
```python
|
||||
# TODO: show an example that addresses some of the main concerns for serving ML (workers etc.)
|
||||
|
|
|
@ -74,7 +74,7 @@ When you train a model using the [`spacy train`](/api/cli#train) command, you'll
|
|||
see a table showing metrics after each pass over the data. Here's what those
|
||||
metrics means:
|
||||
|
||||
<!-- TODO: update table below with updated metrics if needed -->
|
||||
<!-- TODO: update table below and include note about scores in config -->
|
||||
|
||||
| Name | Description |
|
||||
| ---------- | ------------------------------------------------------------------------------------------------- |
|
||||
|
@ -116,7 +116,7 @@ integrate custom models and architectures, written in your framework of choice.
|
|||
Some of the main advantages and features of spaCy's training config are:
|
||||
|
||||
- **Structured sections.** The config is grouped into sections, and nested
|
||||
sections are defined using the `.` notation. For example, `[nlp.pipeline.ner]`
|
||||
sections are defined using the `.` notation. For example, `[components.ner]`
|
||||
defines the settings for the pipeline's named entity recognizer. The config
|
||||
can be loaded as a Python dict.
|
||||
- **References to registered functions.** Sections can refer to registered
|
||||
|
@ -136,10 +136,8 @@ Some of the main advantages and features of spaCy's training config are:
|
|||
Python [type hints](https://docs.python.org/3/library/typing.html) to tell the
|
||||
config which types of data to expect.
|
||||
|
||||
<!-- TODO: update this config? -->
|
||||
|
||||
```ini
|
||||
https://github.com/explosion/spaCy/blob/develop/examples/experiments/onto-joint/defaults.cfg
|
||||
https://github.com/explosion/spaCy/blob/develop/spacy/default_config.cfg
|
||||
```
|
||||
|
||||
Under the hood, the config is parsed into a dictionary. It's divided into
|
||||
|
@ -151,11 +149,12 @@ not just define static settings, but also construct objects like architectures,
|
|||
schedules, optimizers or any other custom components. The main top-level
|
||||
sections of a config file are:
|
||||
|
||||
| Section | Description |
|
||||
| ------------- | ----------------------------------------------------------------------------------------------------- |
|
||||
| `training` | Settings and controls for the training and evaluation process. |
|
||||
| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
|
||||
| `nlp` | Definition of the [processing pipeline](/docs/processing-pipelines), its components and their models. |
|
||||
| Section | Description |
|
||||
| ------------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||
| `training` | Settings and controls for the training and evaluation process. |
|
||||
| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
|
||||
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/docs/processing-pipelines) component names. |
|
||||
| `components` | Definitions of the [pipeline components](/docs/processing-pipelines) and their models. |
|
||||
|
||||
<Infobox title="Config format and settings" emoji="📖">
|
||||
|
||||
|
@ -176,16 +175,16 @@ a consistent format. There are no command-line arguments that need to be set,
|
|||
and no hidden defaults. However, there can still be scenarios where you may want
|
||||
to override config settings when you run [`spacy train`](/api/cli#train). This
|
||||
includes **file paths** to vectors or other resources that shouldn't be
|
||||
hard-code in a config file, or **system-dependent settings** like the GPU ID.
|
||||
hard-code in a config file, or **system-dependent settings**.
|
||||
|
||||
For cases like this, you can set additional command-line options starting with
|
||||
`--` that correspond to the config section and value to override. For example,
|
||||
`--training.use_gpu 1` sets the `use_gpu` value in the `[training]` block to
|
||||
`1`.
|
||||
`--training.batch_size 128` sets the `batch_size` value in the `[training]`
|
||||
block to `128`.
|
||||
|
||||
```bash
|
||||
$ python -m spacy train train.spacy dev.spacy config.cfg
|
||||
--training.use_gpu 1 --nlp.vectors /path/to/vectors
|
||||
--training.batch_size 128 --nlp.vectors /path/to/vectors
|
||||
```
|
||||
|
||||
Only existing sections and values in the config can be overwritten. At the end
|
||||
|
|
|
@ -14,4 +14,20 @@ menu:
|
|||
|
||||
## Backwards Incompatibilities {#incompat}
|
||||
|
||||
### Removed deprecated methods, attributes and arguments {#incompat-removed}
|
||||
|
||||
The following deprecated methods, attributes and arguments were removed in v3.0.
|
||||
Most of them have been deprecated for quite a while now and many would
|
||||
previously raise errors. Many of them were also mostly internals. If you've been
|
||||
working with more recent versions of spaCy v2.x, it's unlikely that your code
|
||||
relied on them.
|
||||
|
||||
| Class | Removed |
|
||||
| --------------------- | ------------------------------------------------------- |
|
||||
| [`Doc`](/api/doc) | `Doc.tokens_from_list`, `Doc.merge` |
|
||||
| [`Span`](/api/span) | `Span.merge`, `Span.upper`, `Span.lower`, `Span.string` |
|
||||
| [`Token`](/api/token) | `Token.string` |
|
||||
|
||||
<!-- TODO: complete (see release notes Dropbox Paper doc) -->
|
||||
|
||||
## Migrating from v2.x {#migrating}
|
||||
|
|
Loading…
Reference in New Issue
Block a user