Update docs [ci skip]

This commit is contained in:
Ines Montani 2020-07-25 18:51:12 +02:00
parent 1346ee06d4
commit c288dba8e7
6 changed files with 297 additions and 147 deletions

View File

@ -50,7 +50,7 @@ contain arbitrary whitespace. Alignment into the original string is preserved.
> ```
| Name | Type | Description |
| ----------- | ----- | --------------------------------------------------------------------------------- |
| ----------- | ----------- | --------------------------------------------------------------------------------- |
| `text` | str | The text to be processed. |
| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
| **RETURNS** | `Doc` | A container for accessing the annotations. |
@ -113,7 +113,7 @@ Evaluate a model's pipeline components.
> ```
| Name | Type | Description |
| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------- |
| -------------------------------------------- | ------------------------------- | ------------------------------------------------------------------------------------- |
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
| `verbose` | bool | Print debugging information. |
| `batch_size` | int | The batch size to use. |
@ -419,10 +419,69 @@ available to the loaded object.
## Class attributes {#class-attributes}
| Name | Type | Description |
| -------------------------------------- | ----- | ----------------------------------------------------------------------------------------------------------------------------------- |
| ---------- | ----- | ----------------------------------------------------------------------------------------------- |
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
| `lang` | str | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
| `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
## Defaults {#defaults}
The following attributes can be set on the `Language.Defaults` class to
customize the default language data:
> #### Example
>
> ```python
> from spacy.language import language
> from spacy.lang.tokenizer_exceptions import URL_MATCH
> from thinc.api import Config
>
> DEFAULT_CONFIFG = """
> [nlp.tokenizer]
> @tokenizers = "MyCustomTokenizer.v1"
> """
>
> class Defaults(Language.Defaults):
> stop_words = set()
> tokenizer_exceptions = {}
> prefixes = tuple()
> suffixes = tuple()
> infixes = tuple()
> token_match = None
> url_match = URL_MATCH
> lex_attr_getters = {}
> syntax_iterators = {}
> writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
> config = Config().from_str(DEFAULT_CONFIG)
> ```
| Name | Description |
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `stop_words` | List of stop words, used for `Token.is_stop`.<br />**Example:** [`stop_words.py`][stop_words.py] |
| `tokenizer_exceptions` | Tokenizer exception rules, string mapped to list of token attributes.<br />**Example:** [`de/tokenizer_exceptions.py`][de/tokenizer_exceptions.py] |
| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`puncutation.py`][punctuation.py] |
| `token_match` | Optional regex for matching strings that should never be split, overriding the infix rules.<br />**Example:** [`fr/tokenizer_exceptions.py`][fr/tokenizer_exceptions.py] |
| `url_match` | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.<br />**Example:** [`tokenizer_exceptions.py`][tokenizer_exceptions.py] |
| `lex_attr_getters` | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.<br />**Example:** [`lex_attrs.py`][lex_attrs.py] |
| `syntax_iterators` | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).<br />**Example:** [`syntax_iterators.py`][syntax_iterators.py]. |
| `writing_system` | Information about the language's writing system, available via `Vocab.writing_system`. Defaults to: `{"direction": "ltr", "has_case": True, "has_letters": True}.`.<br />**Example:** [`zh/__init__.py`][zh/__init__.py] |
| `config` | Default [config](/usage/training#config) added to `nlp.config`. This can include references to custom tokenizers or lemmatizers.<br />**Example:** [`zh/__init__.py`][zh/__init__.py] |
[stop_words.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
[tokenizer_exceptions.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/tokenizer_exceptions.py
[de/tokenizer_exceptions.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
[fr/tokenizer_exceptions.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/tokenizer_exceptions.py
[punctuation.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
[lex_attrs.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
[syntax_iterators.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
[zh/__init__.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/zh/__init__.py
## Serialization fields {#serialization-fields}

View File

@ -8,12 +8,10 @@ makes the data easy to update and extend.
The **shared language data** in the directory root includes rules that can be
generalized across languages for example, rules for basic punctuation, emoji,
emoticons, single-letter abbreviations and norms for equivalent tokens with
different spellings, like `"` and `”`. This helps the models make more accurate
predictions. The **individual language data** in a submodule contains rules that
are only relevant to a particular language. It also takes care of putting
together all components and creating the `Language` subclass for example,
`English` or `German`.
emoticons and single-letter abbreviations. The **individual language data** in a
submodule contains rules that are only relevant to a particular language. It
also takes care of putting together all components and creating the `Language`
subclass for example, `English` or `German`.
> ```python
> from spacy.lang.en import English
@ -23,27 +21,28 @@ together all components and creating the `Language` subclass for example,
> nlp_de = German() # Includes German data
> ```
<!-- TODO: upgrade graphic
![Language data architecture](../../images/language_data.svg)
-->
<!-- TODO: remove this table in favor of more specific Language.Defaults docs in linguistic features? -->
| Name | Description |
| ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Stop words**<br />[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
| **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". |
| **Norm exceptions**<br />[`norm_exceptions.py`][norm_exceptions.py] | Special-case rules for normalizing tokens to improve the model's predictions, for example on American vs. British spelling. |
| **Punctuation rules**<br />[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. |
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
| **Tag map**<br />[`tag_map.py`][tag_map.py] | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
| **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
| **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
[stop_words.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
[tokenizer_exceptions.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
[norm_exceptions.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py
[punctuation.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
[char_classes.py]:
@ -52,8 +51,4 @@ together all components and creating the `Language` subclass for example,
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
[syntax_iterators.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
[tag_map.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
[morph_rules.py]:
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data

View File

@ -602,7 +602,95 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
<Tokenization101 />
### Tokenizer data {#101-data}
<Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>
spaCy introduces a novel tokenization algorithm, that gives a better balance
between performance, ease of definition, and ease of alignment into the original
string.
After consuming a prefix or suffix, we consult the special cases again. We want
the special cases to handle things like "don't" in English, and we want the same
rule to work for "(don't)!". We do this by splitting off the open bracket, then
the exclamation, then the close bracket, and finally matching the special case.
Here's an implementation of the algorithm in Python, optimized for readability
rather than performance:
```python
def tokenizer_pseudo_code(
special_cases,
prefix_search,
suffix_search,
infix_finditer,
token_match,
url_match
):
tokens = []
for substring in text.split():
suffixes = []
while substring:
while prefix_search(substring) or suffix_search(substring):
if token_match(substring):
tokens.append(substring)
substring = ""
break
if substring in special_cases:
tokens.extend(special_cases[substring])
substring = ""
break
if prefix_search(substring):
split = prefix_search(substring).end()
tokens.append(substring[:split])
substring = substring[split:]
if substring in special_cases:
continue
if suffix_search(substring):
split = suffix_search(substring).start()
suffixes.append(substring[split:])
substring = substring[:split]
if token_match(substring):
tokens.append(substring)
substring = ""
elif url_match(substring):
tokens.append(substring)
substring = ""
elif substring in special_cases:
tokens.extend(special_cases[substring])
substring = ""
elif list(infix_finditer(substring)):
infixes = infix_finditer(substring)
offset = 0
for match in infixes:
tokens.append(substring[offset : match.start()])
tokens.append(substring[match.start() : match.end()])
offset = match.end()
if substring[offset:]:
tokens.append(substring[offset:])
substring = ""
elif substring:
tokens.append(substring)
substring = ""
tokens.extend(reversed(suffixes))
return tokens
```
The algorithm can be summarized as follows:
1. Iterate over whitespace-separated substrings.
2. Look for a token match. If there is a match, stop processing and keep this
token.
3. Check whether we have an explicitly defined special case for this substring.
If we do, use it.
4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
so that the token match and special cases always get priority.
5. If we didn't consume a prefix, try to consume a suffix and then go back to
#2.
6. If we can't consume a prefix or a suffix, look for a URL match.
7. If there's no URL match, then look for a special case.
8. Look for "infixes" — stuff like hyphens etc. and split the substring into
tokens on all infixes.
9. Once we can't consume any more of the string, handle it as a single token.
</Accordion>
**Global** and **language-specific** tokenizer data is supplied via the language
data in
@ -613,15 +701,6 @@ The prefixes, suffixes and infixes mostly define punctuation rules for
example, when to split off periods (at the end of a sentence), and when to leave
tokens containing periods intact (abbreviations like "U.S.").
![Language data architecture](../images/language_data.svg)
<Infobox title="Language data" emoji="📖">
For more details on the language-specific data, see the usage guide on
[adding languages](/usage/adding-languages).
</Infobox>
<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
Tokenization rules that are specific to one language, but can be **generalized
@ -637,6 +716,14 @@ subclass.
---
<!--
### Customizing the tokenizer {#tokenizer-custom}
TODO: rewrite the docs on custom tokenization in a more user-friendly order, including details on how to integrate a fully custom tokenizer, representing a tokenizer in the config etc.
-->
### Adding special case tokenization rules {#special-cases}
Most domains have at least some idiosyncrasies that require custom tokenization
@ -677,88 +764,6 @@ nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}])
assert len(nlp("...gimme...?")) == 1
```
### How spaCy's tokenizer works {#how-tokenizer-works}
spaCy introduces a novel tokenization algorithm, that gives a better balance
between performance, ease of definition, and ease of alignment into the original
string.
After consuming a prefix or suffix, we consult the special cases again. We want
the special cases to handle things like "don't" in English, and we want the same
rule to work for "(don't)!". We do this by splitting off the open bracket, then
the exclamation, then the close bracket, and finally matching the special case.
Here's an implementation of the algorithm in Python, optimized for readability
rather than performance:
```python
def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search,
infix_finditer, token_match, url_match):
tokens = []
for substring in text.split():
suffixes = []
while substring:
while prefix_search(substring) or suffix_search(substring):
if token_match(substring):
tokens.append(substring)
substring = ''
break
if substring in special_cases:
tokens.extend(special_cases[substring])
substring = ''
break
if prefix_search(substring):
split = prefix_search(substring).end()
tokens.append(substring[:split])
substring = substring[split:]
if substring in special_cases:
continue
if suffix_search(substring):
split = suffix_search(substring).start()
suffixes.append(substring[split:])
substring = substring[:split]
if token_match(substring):
tokens.append(substring)
substring = ''
elif url_match(substring):
tokens.append(substring)
substring = ''
elif substring in special_cases:
tokens.extend(special_cases[substring])
substring = ''
elif list(infix_finditer(substring)):
infixes = infix_finditer(substring)
offset = 0
for match in infixes:
tokens.append(substring[offset : match.start()])
tokens.append(substring[match.start() : match.end()])
offset = match.end()
if substring[offset:]:
tokens.append(substring[offset:])
substring = ''
elif substring:
tokens.append(substring)
substring = ''
tokens.extend(reversed(suffixes))
return tokens
```
The algorithm can be summarized as follows:
1. Iterate over whitespace-separated substrings.
2. Look for a token match. If there is a match, stop processing and keep this
token.
3. Check whether we have an explicitly defined special case for this substring.
If we do, use it.
4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
so that the token match and special cases always get priority.
5. If we didn't consume a prefix, try to consume a suffix and then go back to
#2.
6. If we can't consume a prefix or a suffix, look for a URL match.
7. If there's no URL match, then look for a special case.
8. Look for "infixes" — stuff like hyphens etc. and split the substring into
tokens on all infixes.
9. Once we can't consume any more of the string, handle it as a single token.
#### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}
A working implementation of the pseudo-code above is available for debugging as
@ -766,6 +771,17 @@ A working implementation of the pseudo-code above is available for debugging as
tuples showing which tokenizer rule or pattern was matched for each token. The
tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:
> #### Expected output
>
> ```
> " PREFIX
> Let SPECIAL-1
> 's SPECIAL-2
> go TOKEN
> ! SUFFIX
> " SUFFIX
> ```
```python
### {executable="true"}
from spacy.lang.en import English
@ -777,13 +793,6 @@ tok_exp = nlp.tokenizer.explain(text)
assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
for t in tok_exp:
print(t[1], "\\t", t[0])
# " PREFIX
# Let SPECIAL-1
# 's SPECIAL-2
# go TOKEN
# ! SUFFIX
# " SUFFIX
```
### Customizing spaCy's Tokenizer class {#native-tokenizers}
@ -1437,3 +1446,73 @@ print("After:", [sent.text for sent in doc.sents])
import LanguageData101 from 'usage/101/\_language-data.md'
<LanguageData101 />
### Creating a custom language subclass {#language-subclass}
If you want to customize multiple components of the language data or add support
for a custom language or domain-specific "dialect", you can also implement your
own language subclass. The subclass should define two attributes: the `lang`
(unique language code) and the `Defaults` defining the language data. For an
overview of the available attributes that can be overwritten, see the
[`Language.Defaults`](/api/language#defaults) documentation.
```python
### {executable="true"}
from spacy.lang.en import English
class CustomEnglishDefaults(English.Defaults):
stop_words = set(["custom", "stop"])
class CustomEnglish(English):
lang = "custom_en"
Defaults = CustomEnglishDefaults
nlp1 = English()
nlp2 = CustomEnglish()
print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")])
print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")])
```
The [`@spacy.registry.languages`](/api/top-level#registry) decorator lets you
register a custom language class and assign it a string name. This means that
you can call [`spacy.blank`](/api/top-level#spacy.blank) with your custom
language name, and even train models with it and refer to it in your
[training config](/usage/training#config).
> #### Config usage
>
> After registering your custom language class using the `languages` registry,
> you can refer to it in your [training config](/usage/training#config). This
> means spaCy will train your model using the custom subclass.
>
> ```ini
> [nlp]
> lang = "custom_en"
> ```
>
> In order to resolve `"custom_en"` to your subclass, the registered function
> needs to be available during training. You can load a Python file containing
> the code using the `--code` argument:
>
> ```bash
> ### {wrap="true"}
> $ python -m spacy train train.spacy dev.spacy config.cfg --code code.py
> ```
```python
### Registering a custom language {highlight="7,12-13"}
import spacy
from spacy.lang.en import English
class CustomEnglishDefaults(English.Defaults):
stop_words = set(["custom", "stop"])
@spacy.registry.languages("custom_en")
class CustomEnglish(English):
lang = "custom_en"
Defaults = CustomEnglishDefaults
# This now works! 🎉
nlp = spacy.blank("custom_en")
```

View File

@ -618,7 +618,9 @@ mattis pretium.
[FastAPI](https://fastapi.tiangolo.com/) is a modern high-performance framework
for building REST APIs with Python, based on Python
[type hints](https://fastapi.tiangolo.com/python-types/). It's become a popular
library for serving machine learning models and
library for serving machine learning models and you can use it in your spaCy
projects to quickly serve up a trained model and make it available behind a REST
API.
```python
# TODO: show an example that addresses some of the main concerns for serving ML (workers etc.)

View File

@ -74,7 +74,7 @@ When you train a model using the [`spacy train`](/api/cli#train) command, you'll
see a table showing metrics after each pass over the data. Here's what those
metrics means:
<!-- TODO: update table below with updated metrics if needed -->
<!-- TODO: update table below and include note about scores in config -->
| Name | Description |
| ---------- | ------------------------------------------------------------------------------------------------- |
@ -116,7 +116,7 @@ integrate custom models and architectures, written in your framework of choice.
Some of the main advantages and features of spaCy's training config are:
- **Structured sections.** The config is grouped into sections, and nested
sections are defined using the `.` notation. For example, `[nlp.pipeline.ner]`
sections are defined using the `.` notation. For example, `[components.ner]`
defines the settings for the pipeline's named entity recognizer. The config
can be loaded as a Python dict.
- **References to registered functions.** Sections can refer to registered
@ -136,10 +136,8 @@ Some of the main advantages and features of spaCy's training config are:
Python [type hints](https://docs.python.org/3/library/typing.html) to tell the
config which types of data to expect.
<!-- TODO: update this config? -->
```ini
https://github.com/explosion/spaCy/blob/develop/examples/experiments/onto-joint/defaults.cfg
https://github.com/explosion/spaCy/blob/develop/spacy/default_config.cfg
```
Under the hood, the config is parsed into a dictionary. It's divided into
@ -152,10 +150,11 @@ schedules, optimizers or any other custom components. The main top-level
sections of a config file are:
| Section | Description |
| ------------- | ----------------------------------------------------------------------------------------------------- |
| ------------- | -------------------------------------------------------------------------------------------------------------------- |
| `training` | Settings and controls for the training and evaluation process. |
| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
| `nlp` | Definition of the [processing pipeline](/docs/processing-pipelines), its components and their models. |
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/docs/processing-pipelines) component names. |
| `components` | Definitions of the [pipeline components](/docs/processing-pipelines) and their models. |
<Infobox title="Config format and settings" emoji="📖">
@ -176,16 +175,16 @@ a consistent format. There are no command-line arguments that need to be set,
and no hidden defaults. However, there can still be scenarios where you may want
to override config settings when you run [`spacy train`](/api/cli#train). This
includes **file paths** to vectors or other resources that shouldn't be
hard-code in a config file, or **system-dependent settings** like the GPU ID.
hard-code in a config file, or **system-dependent settings**.
For cases like this, you can set additional command-line options starting with
`--` that correspond to the config section and value to override. For example,
`--training.use_gpu 1` sets the `use_gpu` value in the `[training]` block to
`1`.
`--training.batch_size 128` sets the `batch_size` value in the `[training]`
block to `128`.
```bash
$ python -m spacy train train.spacy dev.spacy config.cfg
--training.use_gpu 1 --nlp.vectors /path/to/vectors
--training.batch_size 128 --nlp.vectors /path/to/vectors
```
Only existing sections and values in the config can be overwritten. At the end

View File

@ -14,4 +14,20 @@ menu:
## Backwards Incompatibilities {#incompat}
### Removed deprecated methods, attributes and arguments {#incompat-removed}
The following deprecated methods, attributes and arguments were removed in v3.0.
Most of them have been deprecated for quite a while now and many would
previously raise errors. Many of them were also mostly internals. If you've been
working with more recent versions of spaCy v2.x, it's unlikely that your code
relied on them.
| Class | Removed |
| --------------------- | ------------------------------------------------------- |
| [`Doc`](/api/doc) | `Doc.tokens_from_list`, `Doc.merge` |
| [`Span`](/api/span) | `Span.merge`, `Span.upper`, `Span.lower`, `Span.string` |
| [`Token`](/api/token) | `Token.string` |
<!-- TODO: complete (see release notes Dropbox Paper doc) -->
## Migrating from v2.x {#migrating}