mirror of
https://github.com/explosion/spaCy.git
synced 2025-04-25 03:13:41 +03:00
Update docs [ci skip]
This commit is contained in:
parent
1346ee06d4
commit
c288dba8e7
|
@ -49,11 +49,11 @@ contain arbitrary whitespace. Alignment into the original string is preserved.
|
||||||
> assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
|
> assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| ----------- | ----- | --------------------------------------------------------------------------------- |
|
| ----------- | ----------- | --------------------------------------------------------------------------------- |
|
||||||
| `text` | str | The text to be processed. |
|
| `text` | str | The text to be processed. |
|
||||||
| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
| `disable` | `List[str]` | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||||
| **RETURNS** | `Doc` | A container for accessing the annotations. |
|
| **RETURNS** | `Doc` | A container for accessing the annotations. |
|
||||||
|
|
||||||
## Language.pipe {#pipe tag="method"}
|
## Language.pipe {#pipe tag="method"}
|
||||||
|
|
||||||
|
@ -112,14 +112,14 @@ Evaluate a model's pipeline components.
|
||||||
> print(scores)
|
> print(scores)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| -------------------------------------------- | ------------------- | ------------------------------------------------------------------------------------- |
|
| -------------------------------------------- | ------------------------------- | ------------------------------------------------------------------------------------- |
|
||||||
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
| `examples` | `Iterable[Example]` | A batch of [`Example`](/api/example) objects to learn from. |
|
||||||
| `verbose` | bool | Print debugging information. |
|
| `verbose` | bool | Print debugging information. |
|
||||||
| `batch_size` | int | The batch size to use. |
|
| `batch_size` | int | The batch size to use. |
|
||||||
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
|
| `scorer` | `Scorer` | Optional [`Scorer`](/api/scorer) to use. If not passed in, a new one will be created. |
|
||||||
| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
|
| `component_cfg` <Tag variant="new">2.1</Tag> | `Dict[str, Dict]` | Config parameters for specific pipeline components, keyed by component name. |
|
||||||
| **RETURNS** | `Dict[str, Union[float, Dict]]` | A dictionary of evaluation scores. |
|
| **RETURNS** | `Dict[str, Union[float, Dict]]` | A dictionary of evaluation scores. |
|
||||||
|
|
||||||
## Language.begin_training {#begin_training tag="method"}
|
## Language.begin_training {#begin_training tag="method"}
|
||||||
|
|
||||||
|
@ -418,11 +418,70 @@ available to the loaded object.
|
||||||
|
|
||||||
## Class attributes {#class-attributes}
|
## Class attributes {#class-attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| -------------------------------------- | ----- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------- | ----- | ----------------------------------------------------------------------------------------------- |
|
||||||
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
|
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
|
||||||
| `lang` | str | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
|
| `lang` | str | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
|
||||||
| `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
|
|
||||||
|
## Defaults {#defaults}
|
||||||
|
|
||||||
|
The following attributes can be set on the `Language.Defaults` class to
|
||||||
|
customize the default language data:
|
||||||
|
|
||||||
|
> #### Example
|
||||||
|
>
|
||||||
|
> ```python
|
||||||
|
> from spacy.language import language
|
||||||
|
> from spacy.lang.tokenizer_exceptions import URL_MATCH
|
||||||
|
> from thinc.api import Config
|
||||||
|
>
|
||||||
|
> DEFAULT_CONFIFG = """
|
||||||
|
> [nlp.tokenizer]
|
||||||
|
> @tokenizers = "MyCustomTokenizer.v1"
|
||||||
|
> """
|
||||||
|
>
|
||||||
|
> class Defaults(Language.Defaults):
|
||||||
|
> stop_words = set()
|
||||||
|
> tokenizer_exceptions = {}
|
||||||
|
> prefixes = tuple()
|
||||||
|
> suffixes = tuple()
|
||||||
|
> infixes = tuple()
|
||||||
|
> token_match = None
|
||||||
|
> url_match = URL_MATCH
|
||||||
|
> lex_attr_getters = {}
|
||||||
|
> syntax_iterators = {}
|
||||||
|
> writing_system = {"direction": "ltr", "has_case": True, "has_letters": True}
|
||||||
|
> config = Config().from_str(DEFAULT_CONFIG)
|
||||||
|
> ```
|
||||||
|
|
||||||
|
| Name | Description |
|
||||||
|
| --------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
|
| `stop_words` | List of stop words, used for `Token.is_stop`.<br />**Example:** [`stop_words.py`][stop_words.py] |
|
||||||
|
| `tokenizer_exceptions` | Tokenizer exception rules, string mapped to list of token attributes.<br />**Example:** [`de/tokenizer_exceptions.py`][de/tokenizer_exceptions.py] |
|
||||||
|
| `prefixes`, `suffixes`, `infixes` | Prefix, suffix and infix rules for the default tokenizer.<br />**Example:** [`puncutation.py`][punctuation.py] |
|
||||||
|
| `token_match` | Optional regex for matching strings that should never be split, overriding the infix rules.<br />**Example:** [`fr/tokenizer_exceptions.py`][fr/tokenizer_exceptions.py] |
|
||||||
|
| `url_match` | Regular expression for matching URLs. Prefixes and suffixes are removed before applying the match.<br />**Example:** [`tokenizer_exceptions.py`][tokenizer_exceptions.py] |
|
||||||
|
| `lex_attr_getters` | Custom functions for setting lexical attributes on tokens, e.g. `like_num`.<br />**Example:** [`lex_attrs.py`][lex_attrs.py] |
|
||||||
|
| `syntax_iterators` | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks).<br />**Example:** [`syntax_iterators.py`][syntax_iterators.py]. |
|
||||||
|
| `writing_system` | Information about the language's writing system, available via `Vocab.writing_system`. Defaults to: `{"direction": "ltr", "has_case": True, "has_letters": True}.`.<br />**Example:** [`zh/__init__.py`][zh/__init__.py] |
|
||||||
|
| `config` | Default [config](/usage/training#config) added to `nlp.config`. This can include references to custom tokenizers or lemmatizers.<br />**Example:** [`zh/__init__.py`][zh/__init__.py] |
|
||||||
|
|
||||||
|
[stop_words.py]:
|
||||||
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
||||||
|
[tokenizer_exceptions.py]:
|
||||||
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/tokenizer_exceptions.py
|
||||||
|
[de/tokenizer_exceptions.py]:
|
||||||
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
|
||||||
|
[fr/tokenizer_exceptions.py]:
|
||||||
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/fr/tokenizer_exceptions.py
|
||||||
|
[punctuation.py]:
|
||||||
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
|
||||||
|
[lex_attrs.py]:
|
||||||
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
|
||||||
|
[syntax_iterators.py]:
|
||||||
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
|
||||||
|
[zh/__init__.py]:
|
||||||
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/zh/__init__.py
|
||||||
|
|
||||||
## Serialization fields {#serialization-fields}
|
## Serialization fields {#serialization-fields}
|
||||||
|
|
||||||
|
|
|
@ -8,12 +8,10 @@ makes the data easy to update and extend.
|
||||||
|
|
||||||
The **shared language data** in the directory root includes rules that can be
|
The **shared language data** in the directory root includes rules that can be
|
||||||
generalized across languages – for example, rules for basic punctuation, emoji,
|
generalized across languages – for example, rules for basic punctuation, emoji,
|
||||||
emoticons, single-letter abbreviations and norms for equivalent tokens with
|
emoticons and single-letter abbreviations. The **individual language data** in a
|
||||||
different spellings, like `"` and `”`. This helps the models make more accurate
|
submodule contains rules that are only relevant to a particular language. It
|
||||||
predictions. The **individual language data** in a submodule contains rules that
|
also takes care of putting together all components and creating the `Language`
|
||||||
are only relevant to a particular language. It also takes care of putting
|
subclass – for example, `English` or `German`.
|
||||||
together all components and creating the `Language` subclass – for example,
|
|
||||||
`English` or `German`.
|
|
||||||
|
|
||||||
> ```python
|
> ```python
|
||||||
> from spacy.lang.en import English
|
> from spacy.lang.en import English
|
||||||
|
@ -23,27 +21,28 @@ together all components and creating the `Language` subclass – for example,
|
||||||
> nlp_de = German() # Includes German data
|
> nlp_de = German() # Includes German data
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
<!-- TODO: upgrade graphic
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
|
-->
|
||||||
|
|
||||||
|
<!-- TODO: remove this table in favor of more specific Language.Defaults docs in linguistic features? -->
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| **Stop words**<br />[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
|
| **Stop words**<br />[`stop_words.py`][stop_words.py] | List of most common words of a language that are often useful to filter out, for example "and" or "I". Matching tokens will return `True` for `is_stop`. |
|
||||||
| **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". |
|
| **Tokenizer exceptions**<br />[`tokenizer_exceptions.py`][tokenizer_exceptions.py] | Special-case rules for the tokenizer, for example, contractions like "can't" and abbreviations with punctuation, like "U.K.". |
|
||||||
| **Norm exceptions**<br />[`norm_exceptions.py`][norm_exceptions.py] | Special-case rules for normalizing tokens to improve the model's predictions, for example on American vs. British spelling. |
|
|
||||||
| **Punctuation rules**<br />[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. |
|
| **Punctuation rules**<br />[`punctuation.py`][punctuation.py] | Regular expressions for splitting tokens, e.g. on punctuation or special characters like emoji. Includes rules for prefixes, suffixes and infixes. |
|
||||||
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
|
| **Character classes**<br />[`char_classes.py`][char_classes.py] | Character classes to be used in regular expressions, for example, latin characters, quotes, hyphens or icons. |
|
||||||
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
|
| **Lexical attributes**<br />[`lex_attrs.py`][lex_attrs.py] | Custom functions for setting lexical attributes on tokens, e.g. `like_num`, which includes language-specific words like "ten" or "hundred". |
|
||||||
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
|
| **Syntax iterators**<br />[`syntax_iterators.py`][syntax_iterators.py] | Functions that compute views of a `Doc` object based on its syntax. At the moment, only used for [noun chunks](/usage/linguistic-features#noun-chunks). |
|
||||||
| **Tag map**<br />[`tag_map.py`][tag_map.py] | Dictionary mapping strings in your tag set to [Universal Dependencies](http://universaldependencies.org/u/pos/all.html) tags. |
|
|
||||||
| **Morph rules**<br />[`morph_rules.py`][morph_rules.py] | Exception rules for morphological analysis of irregular words like personal pronouns. |
|
|
||||||
| **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
|
| **Lemmatizer**<br />[`spacy-lookups-data`][spacy-lookups-data] | Lemmatization rules or a lookup-based lemmatization table to assign base forms, for example "be" for "was". |
|
||||||
|
|
||||||
[stop_words.py]:
|
[stop_words.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/stop_words.py
|
||||||
[tokenizer_exceptions.py]:
|
[tokenizer_exceptions.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/de/tokenizer_exceptions.py
|
||||||
[norm_exceptions.py]:
|
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/norm_exceptions.py
|
|
||||||
[punctuation.py]:
|
[punctuation.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/punctuation.py
|
||||||
[char_classes.py]:
|
[char_classes.py]:
|
||||||
|
@ -52,8 +51,4 @@ together all components and creating the `Language` subclass – for example,
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/lex_attrs.py
|
||||||
[syntax_iterators.py]:
|
[syntax_iterators.py]:
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
|
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/syntax_iterators.py
|
||||||
[tag_map.py]:
|
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/tag_map.py
|
|
||||||
[morph_rules.py]:
|
|
||||||
https://github.com/explosion/spaCy/tree/master/spacy/lang/en/morph_rules.py
|
|
||||||
[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
|
[spacy-lookups-data]: https://github.com/explosion/spacy-lookups-data
|
||||||
|
|
|
@ -602,7 +602,95 @@ import Tokenization101 from 'usage/101/\_tokenization.md'
|
||||||
|
|
||||||
<Tokenization101 />
|
<Tokenization101 />
|
||||||
|
|
||||||
### Tokenizer data {#101-data}
|
<Accordion title="Algorithm details: How spaCy's tokenizer works" id="how-tokenizer-works" spaced>
|
||||||
|
|
||||||
|
spaCy introduces a novel tokenization algorithm, that gives a better balance
|
||||||
|
between performance, ease of definition, and ease of alignment into the original
|
||||||
|
string.
|
||||||
|
|
||||||
|
After consuming a prefix or suffix, we consult the special cases again. We want
|
||||||
|
the special cases to handle things like "don't" in English, and we want the same
|
||||||
|
rule to work for "(don't)!". We do this by splitting off the open bracket, then
|
||||||
|
the exclamation, then the close bracket, and finally matching the special case.
|
||||||
|
Here's an implementation of the algorithm in Python, optimized for readability
|
||||||
|
rather than performance:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def tokenizer_pseudo_code(
|
||||||
|
special_cases,
|
||||||
|
prefix_search,
|
||||||
|
suffix_search,
|
||||||
|
infix_finditer,
|
||||||
|
token_match,
|
||||||
|
url_match
|
||||||
|
):
|
||||||
|
tokens = []
|
||||||
|
for substring in text.split():
|
||||||
|
suffixes = []
|
||||||
|
while substring:
|
||||||
|
while prefix_search(substring) or suffix_search(substring):
|
||||||
|
if token_match(substring):
|
||||||
|
tokens.append(substring)
|
||||||
|
substring = ""
|
||||||
|
break
|
||||||
|
if substring in special_cases:
|
||||||
|
tokens.extend(special_cases[substring])
|
||||||
|
substring = ""
|
||||||
|
break
|
||||||
|
if prefix_search(substring):
|
||||||
|
split = prefix_search(substring).end()
|
||||||
|
tokens.append(substring[:split])
|
||||||
|
substring = substring[split:]
|
||||||
|
if substring in special_cases:
|
||||||
|
continue
|
||||||
|
if suffix_search(substring):
|
||||||
|
split = suffix_search(substring).start()
|
||||||
|
suffixes.append(substring[split:])
|
||||||
|
substring = substring[:split]
|
||||||
|
if token_match(substring):
|
||||||
|
tokens.append(substring)
|
||||||
|
substring = ""
|
||||||
|
elif url_match(substring):
|
||||||
|
tokens.append(substring)
|
||||||
|
substring = ""
|
||||||
|
elif substring in special_cases:
|
||||||
|
tokens.extend(special_cases[substring])
|
||||||
|
substring = ""
|
||||||
|
elif list(infix_finditer(substring)):
|
||||||
|
infixes = infix_finditer(substring)
|
||||||
|
offset = 0
|
||||||
|
for match in infixes:
|
||||||
|
tokens.append(substring[offset : match.start()])
|
||||||
|
tokens.append(substring[match.start() : match.end()])
|
||||||
|
offset = match.end()
|
||||||
|
if substring[offset:]:
|
||||||
|
tokens.append(substring[offset:])
|
||||||
|
substring = ""
|
||||||
|
elif substring:
|
||||||
|
tokens.append(substring)
|
||||||
|
substring = ""
|
||||||
|
tokens.extend(reversed(suffixes))
|
||||||
|
return tokens
|
||||||
|
```
|
||||||
|
|
||||||
|
The algorithm can be summarized as follows:
|
||||||
|
|
||||||
|
1. Iterate over whitespace-separated substrings.
|
||||||
|
2. Look for a token match. If there is a match, stop processing and keep this
|
||||||
|
token.
|
||||||
|
3. Check whether we have an explicitly defined special case for this substring.
|
||||||
|
If we do, use it.
|
||||||
|
4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
|
||||||
|
so that the token match and special cases always get priority.
|
||||||
|
5. If we didn't consume a prefix, try to consume a suffix and then go back to
|
||||||
|
#2.
|
||||||
|
6. If we can't consume a prefix or a suffix, look for a URL match.
|
||||||
|
7. If there's no URL match, then look for a special case.
|
||||||
|
8. Look for "infixes" — stuff like hyphens etc. and split the substring into
|
||||||
|
tokens on all infixes.
|
||||||
|
9. Once we can't consume any more of the string, handle it as a single token.
|
||||||
|
|
||||||
|
</Accordion>
|
||||||
|
|
||||||
**Global** and **language-specific** tokenizer data is supplied via the language
|
**Global** and **language-specific** tokenizer data is supplied via the language
|
||||||
data in
|
data in
|
||||||
|
@ -613,15 +701,6 @@ The prefixes, suffixes and infixes mostly define punctuation rules – for
|
||||||
example, when to split off periods (at the end of a sentence), and when to leave
|
example, when to split off periods (at the end of a sentence), and when to leave
|
||||||
tokens containing periods intact (abbreviations like "U.S.").
|
tokens containing periods intact (abbreviations like "U.S.").
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
<Infobox title="Language data" emoji="📖">
|
|
||||||
|
|
||||||
For more details on the language-specific data, see the usage guide on
|
|
||||||
[adding languages](/usage/adding-languages).
|
|
||||||
|
|
||||||
</Infobox>
|
|
||||||
|
|
||||||
<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
|
<Accordion title="Should I change the language data or add custom tokenizer rules?" id="lang-data-vs-tokenizer">
|
||||||
|
|
||||||
Tokenization rules that are specific to one language, but can be **generalized
|
Tokenization rules that are specific to one language, but can be **generalized
|
||||||
|
@ -637,6 +716,14 @@ subclass.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
<!--
|
||||||
|
|
||||||
|
### Customizing the tokenizer {#tokenizer-custom}
|
||||||
|
|
||||||
|
TODO: rewrite the docs on custom tokenization in a more user-friendly order, including details on how to integrate a fully custom tokenizer, representing a tokenizer in the config etc.
|
||||||
|
|
||||||
|
-->
|
||||||
|
|
||||||
### Adding special case tokenization rules {#special-cases}
|
### Adding special case tokenization rules {#special-cases}
|
||||||
|
|
||||||
Most domains have at least some idiosyncrasies that require custom tokenization
|
Most domains have at least some idiosyncrasies that require custom tokenization
|
||||||
|
@ -677,88 +764,6 @@ nlp.tokenizer.add_special_case("...gimme...?", [{"ORTH": "...gimme...?"}])
|
||||||
assert len(nlp("...gimme...?")) == 1
|
assert len(nlp("...gimme...?")) == 1
|
||||||
```
|
```
|
||||||
|
|
||||||
### How spaCy's tokenizer works {#how-tokenizer-works}
|
|
||||||
|
|
||||||
spaCy introduces a novel tokenization algorithm, that gives a better balance
|
|
||||||
between performance, ease of definition, and ease of alignment into the original
|
|
||||||
string.
|
|
||||||
|
|
||||||
After consuming a prefix or suffix, we consult the special cases again. We want
|
|
||||||
the special cases to handle things like "don't" in English, and we want the same
|
|
||||||
rule to work for "(don't)!". We do this by splitting off the open bracket, then
|
|
||||||
the exclamation, then the close bracket, and finally matching the special case.
|
|
||||||
Here's an implementation of the algorithm in Python, optimized for readability
|
|
||||||
rather than performance:
|
|
||||||
|
|
||||||
```python
|
|
||||||
def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search,
|
|
||||||
infix_finditer, token_match, url_match):
|
|
||||||
tokens = []
|
|
||||||
for substring in text.split():
|
|
||||||
suffixes = []
|
|
||||||
while substring:
|
|
||||||
while prefix_search(substring) or suffix_search(substring):
|
|
||||||
if token_match(substring):
|
|
||||||
tokens.append(substring)
|
|
||||||
substring = ''
|
|
||||||
break
|
|
||||||
if substring in special_cases:
|
|
||||||
tokens.extend(special_cases[substring])
|
|
||||||
substring = ''
|
|
||||||
break
|
|
||||||
if prefix_search(substring):
|
|
||||||
split = prefix_search(substring).end()
|
|
||||||
tokens.append(substring[:split])
|
|
||||||
substring = substring[split:]
|
|
||||||
if substring in special_cases:
|
|
||||||
continue
|
|
||||||
if suffix_search(substring):
|
|
||||||
split = suffix_search(substring).start()
|
|
||||||
suffixes.append(substring[split:])
|
|
||||||
substring = substring[:split]
|
|
||||||
if token_match(substring):
|
|
||||||
tokens.append(substring)
|
|
||||||
substring = ''
|
|
||||||
elif url_match(substring):
|
|
||||||
tokens.append(substring)
|
|
||||||
substring = ''
|
|
||||||
elif substring in special_cases:
|
|
||||||
tokens.extend(special_cases[substring])
|
|
||||||
substring = ''
|
|
||||||
elif list(infix_finditer(substring)):
|
|
||||||
infixes = infix_finditer(substring)
|
|
||||||
offset = 0
|
|
||||||
for match in infixes:
|
|
||||||
tokens.append(substring[offset : match.start()])
|
|
||||||
tokens.append(substring[match.start() : match.end()])
|
|
||||||
offset = match.end()
|
|
||||||
if substring[offset:]:
|
|
||||||
tokens.append(substring[offset:])
|
|
||||||
substring = ''
|
|
||||||
elif substring:
|
|
||||||
tokens.append(substring)
|
|
||||||
substring = ''
|
|
||||||
tokens.extend(reversed(suffixes))
|
|
||||||
return tokens
|
|
||||||
```
|
|
||||||
|
|
||||||
The algorithm can be summarized as follows:
|
|
||||||
|
|
||||||
1. Iterate over whitespace-separated substrings.
|
|
||||||
2. Look for a token match. If there is a match, stop processing and keep this
|
|
||||||
token.
|
|
||||||
3. Check whether we have an explicitly defined special case for this substring.
|
|
||||||
If we do, use it.
|
|
||||||
4. Otherwise, try to consume one prefix. If we consumed a prefix, go back to #2,
|
|
||||||
so that the token match and special cases always get priority.
|
|
||||||
5. If we didn't consume a prefix, try to consume a suffix and then go back to
|
|
||||||
#2.
|
|
||||||
6. If we can't consume a prefix or a suffix, look for a URL match.
|
|
||||||
7. If there's no URL match, then look for a special case.
|
|
||||||
8. Look for "infixes" — stuff like hyphens etc. and split the substring into
|
|
||||||
tokens on all infixes.
|
|
||||||
9. Once we can't consume any more of the string, handle it as a single token.
|
|
||||||
|
|
||||||
#### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}
|
#### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}
|
||||||
|
|
||||||
A working implementation of the pseudo-code above is available for debugging as
|
A working implementation of the pseudo-code above is available for debugging as
|
||||||
|
@ -766,6 +771,17 @@ A working implementation of the pseudo-code above is available for debugging as
|
||||||
tuples showing which tokenizer rule or pattern was matched for each token. The
|
tuples showing which tokenizer rule or pattern was matched for each token. The
|
||||||
tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:
|
tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:
|
||||||
|
|
||||||
|
> #### Expected output
|
||||||
|
>
|
||||||
|
> ```
|
||||||
|
> " PREFIX
|
||||||
|
> Let SPECIAL-1
|
||||||
|
> 's SPECIAL-2
|
||||||
|
> go TOKEN
|
||||||
|
> ! SUFFIX
|
||||||
|
> " SUFFIX
|
||||||
|
> ```
|
||||||
|
|
||||||
```python
|
```python
|
||||||
### {executable="true"}
|
### {executable="true"}
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
|
@ -777,13 +793,6 @@ tok_exp = nlp.tokenizer.explain(text)
|
||||||
assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
|
assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
|
||||||
for t in tok_exp:
|
for t in tok_exp:
|
||||||
print(t[1], "\\t", t[0])
|
print(t[1], "\\t", t[0])
|
||||||
|
|
||||||
# " PREFIX
|
|
||||||
# Let SPECIAL-1
|
|
||||||
# 's SPECIAL-2
|
|
||||||
# go TOKEN
|
|
||||||
# ! SUFFIX
|
|
||||||
# " SUFFIX
|
|
||||||
```
|
```
|
||||||
|
|
||||||
### Customizing spaCy's Tokenizer class {#native-tokenizers}
|
### Customizing spaCy's Tokenizer class {#native-tokenizers}
|
||||||
|
@ -1437,3 +1446,73 @@ print("After:", [sent.text for sent in doc.sents])
|
||||||
import LanguageData101 from 'usage/101/\_language-data.md'
|
import LanguageData101 from 'usage/101/\_language-data.md'
|
||||||
|
|
||||||
<LanguageData101 />
|
<LanguageData101 />
|
||||||
|
|
||||||
|
### Creating a custom language subclass {#language-subclass}
|
||||||
|
|
||||||
|
If you want to customize multiple components of the language data or add support
|
||||||
|
for a custom language or domain-specific "dialect", you can also implement your
|
||||||
|
own language subclass. The subclass should define two attributes: the `lang`
|
||||||
|
(unique language code) and the `Defaults` defining the language data. For an
|
||||||
|
overview of the available attributes that can be overwritten, see the
|
||||||
|
[`Language.Defaults`](/api/language#defaults) documentation.
|
||||||
|
|
||||||
|
```python
|
||||||
|
### {executable="true"}
|
||||||
|
from spacy.lang.en import English
|
||||||
|
|
||||||
|
class CustomEnglishDefaults(English.Defaults):
|
||||||
|
stop_words = set(["custom", "stop"])
|
||||||
|
|
||||||
|
class CustomEnglish(English):
|
||||||
|
lang = "custom_en"
|
||||||
|
Defaults = CustomEnglishDefaults
|
||||||
|
|
||||||
|
nlp1 = English()
|
||||||
|
nlp2 = CustomEnglish()
|
||||||
|
|
||||||
|
print(nlp1.lang, [token.is_stop for token in nlp1("custom stop")])
|
||||||
|
print(nlp2.lang, [token.is_stop for token in nlp2("custom stop")])
|
||||||
|
```
|
||||||
|
|
||||||
|
The [`@spacy.registry.languages`](/api/top-level#registry) decorator lets you
|
||||||
|
register a custom language class and assign it a string name. This means that
|
||||||
|
you can call [`spacy.blank`](/api/top-level#spacy.blank) with your custom
|
||||||
|
language name, and even train models with it and refer to it in your
|
||||||
|
[training config](/usage/training#config).
|
||||||
|
|
||||||
|
> #### Config usage
|
||||||
|
>
|
||||||
|
> After registering your custom language class using the `languages` registry,
|
||||||
|
> you can refer to it in your [training config](/usage/training#config). This
|
||||||
|
> means spaCy will train your model using the custom subclass.
|
||||||
|
>
|
||||||
|
> ```ini
|
||||||
|
> [nlp]
|
||||||
|
> lang = "custom_en"
|
||||||
|
> ```
|
||||||
|
>
|
||||||
|
> In order to resolve `"custom_en"` to your subclass, the registered function
|
||||||
|
> needs to be available during training. You can load a Python file containing
|
||||||
|
> the code using the `--code` argument:
|
||||||
|
>
|
||||||
|
> ```bash
|
||||||
|
> ### {wrap="true"}
|
||||||
|
> $ python -m spacy train train.spacy dev.spacy config.cfg --code code.py
|
||||||
|
> ```
|
||||||
|
|
||||||
|
```python
|
||||||
|
### Registering a custom language {highlight="7,12-13"}
|
||||||
|
import spacy
|
||||||
|
from spacy.lang.en import English
|
||||||
|
|
||||||
|
class CustomEnglishDefaults(English.Defaults):
|
||||||
|
stop_words = set(["custom", "stop"])
|
||||||
|
|
||||||
|
@spacy.registry.languages("custom_en")
|
||||||
|
class CustomEnglish(English):
|
||||||
|
lang = "custom_en"
|
||||||
|
Defaults = CustomEnglishDefaults
|
||||||
|
|
||||||
|
# This now works! 🎉
|
||||||
|
nlp = spacy.blank("custom_en")
|
||||||
|
```
|
||||||
|
|
|
@ -618,7 +618,9 @@ mattis pretium.
|
||||||
[FastAPI](https://fastapi.tiangolo.com/) is a modern high-performance framework
|
[FastAPI](https://fastapi.tiangolo.com/) is a modern high-performance framework
|
||||||
for building REST APIs with Python, based on Python
|
for building REST APIs with Python, based on Python
|
||||||
[type hints](https://fastapi.tiangolo.com/python-types/). It's become a popular
|
[type hints](https://fastapi.tiangolo.com/python-types/). It's become a popular
|
||||||
library for serving machine learning models and
|
library for serving machine learning models and you can use it in your spaCy
|
||||||
|
projects to quickly serve up a trained model and make it available behind a REST
|
||||||
|
API.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
# TODO: show an example that addresses some of the main concerns for serving ML (workers etc.)
|
# TODO: show an example that addresses some of the main concerns for serving ML (workers etc.)
|
||||||
|
|
|
@ -74,7 +74,7 @@ When you train a model using the [`spacy train`](/api/cli#train) command, you'll
|
||||||
see a table showing metrics after each pass over the data. Here's what those
|
see a table showing metrics after each pass over the data. Here's what those
|
||||||
metrics means:
|
metrics means:
|
||||||
|
|
||||||
<!-- TODO: update table below with updated metrics if needed -->
|
<!-- TODO: update table below and include note about scores in config -->
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| ---------- | ------------------------------------------------------------------------------------------------- |
|
| ---------- | ------------------------------------------------------------------------------------------------- |
|
||||||
|
@ -116,7 +116,7 @@ integrate custom models and architectures, written in your framework of choice.
|
||||||
Some of the main advantages and features of spaCy's training config are:
|
Some of the main advantages and features of spaCy's training config are:
|
||||||
|
|
||||||
- **Structured sections.** The config is grouped into sections, and nested
|
- **Structured sections.** The config is grouped into sections, and nested
|
||||||
sections are defined using the `.` notation. For example, `[nlp.pipeline.ner]`
|
sections are defined using the `.` notation. For example, `[components.ner]`
|
||||||
defines the settings for the pipeline's named entity recognizer. The config
|
defines the settings for the pipeline's named entity recognizer. The config
|
||||||
can be loaded as a Python dict.
|
can be loaded as a Python dict.
|
||||||
- **References to registered functions.** Sections can refer to registered
|
- **References to registered functions.** Sections can refer to registered
|
||||||
|
@ -136,10 +136,8 @@ Some of the main advantages and features of spaCy's training config are:
|
||||||
Python [type hints](https://docs.python.org/3/library/typing.html) to tell the
|
Python [type hints](https://docs.python.org/3/library/typing.html) to tell the
|
||||||
config which types of data to expect.
|
config which types of data to expect.
|
||||||
|
|
||||||
<!-- TODO: update this config? -->
|
|
||||||
|
|
||||||
```ini
|
```ini
|
||||||
https://github.com/explosion/spaCy/blob/develop/examples/experiments/onto-joint/defaults.cfg
|
https://github.com/explosion/spaCy/blob/develop/spacy/default_config.cfg
|
||||||
```
|
```
|
||||||
|
|
||||||
Under the hood, the config is parsed into a dictionary. It's divided into
|
Under the hood, the config is parsed into a dictionary. It's divided into
|
||||||
|
@ -151,11 +149,12 @@ not just define static settings, but also construct objects like architectures,
|
||||||
schedules, optimizers or any other custom components. The main top-level
|
schedules, optimizers or any other custom components. The main top-level
|
||||||
sections of a config file are:
|
sections of a config file are:
|
||||||
|
|
||||||
| Section | Description |
|
| Section | Description |
|
||||||
| ------------- | ----------------------------------------------------------------------------------------------------- |
|
| ------------- | -------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `training` | Settings and controls for the training and evaluation process. |
|
| `training` | Settings and controls for the training and evaluation process. |
|
||||||
| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
|
| `pretraining` | Optional settings and controls for the [language model pretraining](#pretraining). |
|
||||||
| `nlp` | Definition of the [processing pipeline](/docs/processing-pipelines), its components and their models. |
|
| `nlp` | Definition of the `nlp` object, its tokenizer and [processing pipeline](/docs/processing-pipelines) component names. |
|
||||||
|
| `components` | Definitions of the [pipeline components](/docs/processing-pipelines) and their models. |
|
||||||
|
|
||||||
<Infobox title="Config format and settings" emoji="📖">
|
<Infobox title="Config format and settings" emoji="📖">
|
||||||
|
|
||||||
|
@ -176,16 +175,16 @@ a consistent format. There are no command-line arguments that need to be set,
|
||||||
and no hidden defaults. However, there can still be scenarios where you may want
|
and no hidden defaults. However, there can still be scenarios where you may want
|
||||||
to override config settings when you run [`spacy train`](/api/cli#train). This
|
to override config settings when you run [`spacy train`](/api/cli#train). This
|
||||||
includes **file paths** to vectors or other resources that shouldn't be
|
includes **file paths** to vectors or other resources that shouldn't be
|
||||||
hard-code in a config file, or **system-dependent settings** like the GPU ID.
|
hard-code in a config file, or **system-dependent settings**.
|
||||||
|
|
||||||
For cases like this, you can set additional command-line options starting with
|
For cases like this, you can set additional command-line options starting with
|
||||||
`--` that correspond to the config section and value to override. For example,
|
`--` that correspond to the config section and value to override. For example,
|
||||||
`--training.use_gpu 1` sets the `use_gpu` value in the `[training]` block to
|
`--training.batch_size 128` sets the `batch_size` value in the `[training]`
|
||||||
`1`.
|
block to `128`.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
$ python -m spacy train train.spacy dev.spacy config.cfg
|
$ python -m spacy train train.spacy dev.spacy config.cfg
|
||||||
--training.use_gpu 1 --nlp.vectors /path/to/vectors
|
--training.batch_size 128 --nlp.vectors /path/to/vectors
|
||||||
```
|
```
|
||||||
|
|
||||||
Only existing sections and values in the config can be overwritten. At the end
|
Only existing sections and values in the config can be overwritten. At the end
|
||||||
|
|
|
@ -14,4 +14,20 @@ menu:
|
||||||
|
|
||||||
## Backwards Incompatibilities {#incompat}
|
## Backwards Incompatibilities {#incompat}
|
||||||
|
|
||||||
|
### Removed deprecated methods, attributes and arguments {#incompat-removed}
|
||||||
|
|
||||||
|
The following deprecated methods, attributes and arguments were removed in v3.0.
|
||||||
|
Most of them have been deprecated for quite a while now and many would
|
||||||
|
previously raise errors. Many of them were also mostly internals. If you've been
|
||||||
|
working with more recent versions of spaCy v2.x, it's unlikely that your code
|
||||||
|
relied on them.
|
||||||
|
|
||||||
|
| Class | Removed |
|
||||||
|
| --------------------- | ------------------------------------------------------- |
|
||||||
|
| [`Doc`](/api/doc) | `Doc.tokens_from_list`, `Doc.merge` |
|
||||||
|
| [`Span`](/api/span) | `Span.merge`, `Span.upper`, `Span.lower`, `Span.string` |
|
||||||
|
| [`Token`](/api/token) | `Token.string` |
|
||||||
|
|
||||||
|
<!-- TODO: complete (see release notes Dropbox Paper doc) -->
|
||||||
|
|
||||||
## Migrating from v2.x {#migrating}
|
## Migrating from v2.x {#migrating}
|
||||||
|
|
Loading…
Reference in New Issue
Block a user