mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-11-04 01:48:04 +03:00 
			
		
		
		
	Documentation updates for v2.3.0 (#5593)
* Update website models for v2.3.0 * Add docs for Chinese word segmentation * Tighten up Chinese docs section * Merge branch 'master' into docs/v2.3.0 [ci skip] * Merge branch 'master' into docs/v2.3.0 [ci skip] * Auto-format and update version * Update matcher.md * Update languages and sorting * Typo in landing page * Infobox about token_match behavior * Add meta and basic docs for Japanese * POS -> TAG in models table * Add info about lookups for normalization * Updates to API docs for v2.3 * Update adding norm exceptions for adding languages * Add --omit-extra-lookups to CLI API docs * Add initial draft of "What's New in v2.3" * Add new in v2.3 tags to Chinese and Japanese sections * Add tokenizer to migration section * Add new in v2.3 flags to init-model * Typo * More what's new in v2.3 Co-authored-by: Ines Montani <ines@ines.io>
This commit is contained in:
		
							parent
							
								
									7ff447c5a0
								
							
						
					
					
						commit
						d5110ffbf2
					
				
							
								
								
									
										17
									
								
								README.md
									
									
									
									
									
								
							
							
						
						
									
										17
									
								
								README.md
									
									
									
									
									
								
							| 
						 | 
					@ -6,12 +6,12 @@ spaCy is a library for advanced Natural Language Processing in Python and
 | 
				
			||||||
Cython. It's built on the very latest research, and was designed from day one to
 | 
					Cython. It's built on the very latest research, and was designed from day one to
 | 
				
			||||||
be used in real products. spaCy comes with
 | 
					be used in real products. spaCy comes with
 | 
				
			||||||
[pretrained statistical models](https://spacy.io/models) and word vectors, and
 | 
					[pretrained statistical models](https://spacy.io/models) and word vectors, and
 | 
				
			||||||
currently supports tokenization for **50+ languages**. It features
 | 
					currently supports tokenization for **60+ languages**. It features
 | 
				
			||||||
state-of-the-art speed, convolutional **neural network models** for tagging,
 | 
					state-of-the-art speed, convolutional **neural network models** for tagging,
 | 
				
			||||||
parsing and **named entity recognition** and easy **deep learning** integration.
 | 
					parsing and **named entity recognition** and easy **deep learning** integration.
 | 
				
			||||||
It's commercial open-source software, released under the MIT license.
 | 
					It's commercial open-source software, released under the MIT license.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
💫 **Version 2.2 out now!**
 | 
					💫 **Version 2.3 out now!**
 | 
				
			||||||
[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
 | 
					[Check out the release notes here.](https://github.com/explosion/spaCy/releases)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
[>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
 | 
					[>)](https://dev.azure.com/explosion-ai/public/_build?definitionId=8)
 | 
				
			||||||
| 
						 | 
					@ -32,7 +32,7 @@ It's commercial open-source software, released under the MIT license.
 | 
				
			||||||
| --------------- | -------------------------------------------------------------- |
 | 
					| --------------- | -------------------------------------------------------------- |
 | 
				
			||||||
| [spaCy 101]     | New to spaCy? Here's everything you need to know!              |
 | 
					| [spaCy 101]     | New to spaCy? Here's everything you need to know!              |
 | 
				
			||||||
| [Usage Guides]  | How to use spaCy and its features.                             |
 | 
					| [Usage Guides]  | How to use spaCy and its features.                             |
 | 
				
			||||||
| [New in v2.2]   | New features, backwards incompatibilities and migration guide. |
 | 
					| [New in v2.3]   | New features, backwards incompatibilities and migration guide. |
 | 
				
			||||||
| [API Reference] | The detailed reference for spaCy's API.                        |
 | 
					| [API Reference] | The detailed reference for spaCy's API.                        |
 | 
				
			||||||
| [Models]        | Download statistical language models for spaCy.                |
 | 
					| [Models]        | Download statistical language models for spaCy.                |
 | 
				
			||||||
| [Universe]      | Libraries, extensions, demos, books and courses.               |
 | 
					| [Universe]      | Libraries, extensions, demos, books and courses.               |
 | 
				
			||||||
| 
						 | 
					@ -40,7 +40,7 @@ It's commercial open-source software, released under the MIT license.
 | 
				
			||||||
| [Contribute]    | How to contribute to the spaCy project and code base.          |
 | 
					| [Contribute]    | How to contribute to the spaCy project and code base.          |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
[spacy 101]: https://spacy.io/usage/spacy-101
 | 
					[spacy 101]: https://spacy.io/usage/spacy-101
 | 
				
			||||||
[new in v2.2]: https://spacy.io/usage/v2-2
 | 
					[new in v2.3]: https://spacy.io/usage/v2-3
 | 
				
			||||||
[usage guides]: https://spacy.io/usage/
 | 
					[usage guides]: https://spacy.io/usage/
 | 
				
			||||||
[api reference]: https://spacy.io/api/
 | 
					[api reference]: https://spacy.io/api/
 | 
				
			||||||
[models]: https://spacy.io/models
 | 
					[models]: https://spacy.io/models
 | 
				
			||||||
| 
						 | 
					@ -113,12 +113,13 @@ of `v2.0.13`).
 | 
				
			||||||
pip install spacy
 | 
					pip install spacy
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
To install additional data tables for lemmatization in **spaCy v2.2+** you can
 | 
					To install additional data tables for lemmatization and normalization in
 | 
				
			||||||
run `pip install spacy[lookups]` or install
 | 
					**spaCy v2.2+** you can run `pip install spacy[lookups]` or install
 | 
				
			||||||
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
 | 
					[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
 | 
				
			||||||
separately. The lookups package is needed to create blank models with
 | 
					separately. The lookups package is needed to create blank models with
 | 
				
			||||||
lemmatization data, and to lemmatize in languages that don't yet come with
 | 
					lemmatization data for v2.2+ plus normalization data for v2.3+, and to
 | 
				
			||||||
pretrained models and aren't powered by third-party libraries.
 | 
					lemmatize in languages that don't yet come with pretrained models and aren't
 | 
				
			||||||
 | 
					powered by third-party libraries.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
When using pip it is generally recommended to install packages in a virtual
 | 
					When using pip it is generally recommended to install packages in a virtual
 | 
				
			||||||
environment to avoid modifying system state:
 | 
					environment to avoid modifying system state:
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -542,7 +542,7 @@ $ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
 | 
				
			||||||
```
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| Argument                                                    | Type       | Description                                                                                                                                                                                                                                            |
 | 
					| Argument                                                    | Type       | Description                                                                                                                                                                                                                                            |
 | 
				
			||||||
| ------------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | 
					| ----------------------------------------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | 
				
			||||||
| `lang`                                                      | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`.                                                                                                                                                           |
 | 
					| `lang`                                                      | positional | Model language [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes), e.g. `en`.                                                                                                                                                           |
 | 
				
			||||||
| `output_dir`                                                | positional | Model output directory. Will be created if it doesn't exist.                                                                                                                                                                                           |
 | 
					| `output_dir`                                                | positional | Model output directory. Will be created if it doesn't exist.                                                                                                                                                                                           |
 | 
				
			||||||
| `--jsonl-loc`, `-j`                                         | option     | Optional location of JSONL-formatted [vocabulary file](/api/annotation#vocab-jsonl) with lexical attributes.                                                                                                                                           |
 | 
					| `--jsonl-loc`, `-j`                                         | option     | Optional location of JSONL-formatted [vocabulary file](/api/annotation#vocab-jsonl) with lexical attributes.                                                                                                                                           |
 | 
				
			||||||
| 
						 | 
					@ -550,6 +550,7 @@ $ python -m spacy init-model [lang] [output_dir] [--jsonl-loc] [--vectors-loc]
 | 
				
			||||||
| `--truncate-vectors`, `-t` <Tag variant="new">2.3</Tag>     | option     | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation.                                                                                                                                                      |
 | 
					| `--truncate-vectors`, `-t` <Tag variant="new">2.3</Tag>     | option     | Number of vectors to truncate to when reading in vectors file. Defaults to `0` for no truncation.                                                                                                                                                      |
 | 
				
			||||||
| `--prune-vectors`, `-V`                                     | option     | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning.                                                                                                                                                                         |
 | 
					| `--prune-vectors`, `-V`                                     | option     | Number of vectors to prune the vocabulary to. Defaults to `-1` for no pruning.                                                                                                                                                                         |
 | 
				
			||||||
| `--vectors-name`, `-vn`                                     | option     | Name to assign to the word vectors in the `meta.json`, e.g. `en_core_web_md.vectors`.                                                                                                                                                                  |
 | 
					| `--vectors-name`, `-vn`                                     | option     | Name to assign to the word vectors in the `meta.json`, e.g. `en_core_web_md.vectors`.                                                                                                                                                                  |
 | 
				
			||||||
 | 
					| `--omit-extra-lookups`, `-OEL` <Tag variant="new">2.3</Tag> | flag       | Do not include any of the extra lookups tables (`cluster`/`prob`/`sentiment`) from `spacy-lookups-data` in the model.                                                                                                                                  |
 | 
				
			||||||
| **CREATES**                                                 | model      | A spaCy model containing the vocab and vectors.                                                                                                                                                                                                        |
 | 
					| **CREATES**                                                 | model      | A spaCy model containing the vocab and vectors.                                                                                                                                                                                                        |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Evaluate {#evaluate new="2"}
 | 
					## Evaluate {#evaluate new="2"}
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -171,9 +171,6 @@ struct.
 | 
				
			||||||
| `shape`     | <Abbr title="uint64_t">`attr_t`</Abbr>  | Transform of the lexeme's string, to show orthographic features.                                                           |
 | 
					| `shape`     | <Abbr title="uint64_t">`attr_t`</Abbr>  | Transform of the lexeme's string, to show orthographic features.                                                           |
 | 
				
			||||||
| `prefix`    | <Abbr title="uint64_t">`attr_t`</Abbr>  | Length-N substring from the start of the lexeme. Defaults to `N=1`.                                                        |
 | 
					| `prefix`    | <Abbr title="uint64_t">`attr_t`</Abbr>  | Length-N substring from the start of the lexeme. Defaults to `N=1`.                                                        |
 | 
				
			||||||
| `suffix`    | <Abbr title="uint64_t">`attr_t`</Abbr>  | Length-N substring from the end of the lexeme. Defaults to `N=3`.                                                          |
 | 
					| `suffix`    | <Abbr title="uint64_t">`attr_t`</Abbr>  | Length-N substring from the end of the lexeme. Defaults to `N=3`.                                                          |
 | 
				
			||||||
| `cluster`   | <Abbr title="uint64_t">`attr_t`</Abbr>  | Brown cluster ID.                                                                                                          |
 | 
					 | 
				
			||||||
| `prob`      | `float`                                 | Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary).                 |
 | 
					 | 
				
			||||||
| `sentiment` | `float`                                 | A scalar value indicating positivity or negativity.                                                                        |
 | 
					 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Lexeme.get_struct_attr {#lexeme_get_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
 | 
					### Lexeme.get_struct_attr {#lexeme_get_struct_attr tag="staticmethod, nogil" source="spacy/lexeme.pxd"}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -22,6 +22,7 @@ missing – the gradient for those labels will be zero.
 | 
				
			||||||
| `entities`  | iterable    | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
 | 
					| `entities`  | iterable    | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
 | 
				
			||||||
| `cats`      | dict        | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative).                                                                                  |
 | 
					| `cats`      | dict        | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative).                                                                                  |
 | 
				
			||||||
| `links`     | dict        | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative).                       |
 | 
					| `links`     | dict        | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative).                       |
 | 
				
			||||||
 | 
					| `make_projective` | bool  | Whether to projectivize the dependency tree. Defaults to `False.`.                                                                                     |
 | 
				
			||||||
| **RETURNS** | `GoldParse` | The newly constructed object.                                                                                                                                                                                                          |
 | 
					| **RETURNS** | `GoldParse` | The newly constructed object.                                                                                                                                                                                                          |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## GoldParse.\_\_len\_\_ {#len tag="method"}
 | 
					## GoldParse.\_\_len\_\_ {#len tag="method"}
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -156,7 +156,7 @@ The L2 norm of the lexeme's vector representation.
 | 
				
			||||||
| `like_url`                                   | bool    | Does the lexeme resemble a URL?                                                                                                                                                                                                                              |
 | 
					| `like_url`                                   | bool    | Does the lexeme resemble a URL?                                                                                                                                                                                                                              |
 | 
				
			||||||
| `like_num`                                   | bool    | Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc.                                                                                                                                                                                           |
 | 
					| `like_num`                                   | bool    | Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc.                                                                                                                                                                                           |
 | 
				
			||||||
| `like_email`                                 | bool    | Does the lexeme resemble an email address?                                                                                                                                                                                                                   |
 | 
					| `like_email`                                 | bool    | Does the lexeme resemble an email address?                                                                                                                                                                                                                   |
 | 
				
			||||||
| `is_oov`                                     | bool    | Is the lexeme out-of-vocabulary?                                                                                                                                                                                                                             |
 | 
					| `is_oov`                                     | bool    | Does the lexeme have a word vector?                                                                                                                                                                                                                          |
 | 
				
			||||||
| `is_stop`                                    | bool    | Is the lexeme part of a "stop list"?                                                                                                                                                                                                                         |
 | 
					| `is_stop`                                    | bool    | Is the lexeme part of a "stop list"?                                                                                                                                                                                                                         |
 | 
				
			||||||
| `lang`                                       | int     | Language of the parent vocabulary.                                                                                                                                                                                                                           |
 | 
					| `lang`                                       | int     | Language of the parent vocabulary.                                                                                                                                                                                                                           |
 | 
				
			||||||
| `lang_`                                      | unicode | Language of the parent vocabulary.                                                                                                                                                                                                                           |
 | 
					| `lang_`                                      | unicode | Language of the parent vocabulary.                                                                                                                                                                                                                           |
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -40,7 +40,8 @@ string where an integer is expected) or unexpected property names.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Matcher.\_\_call\_\_ {#call tag="method"}
 | 
					## Matcher.\_\_call\_\_ {#call tag="method"}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Find all token sequences matching the supplied patterns on the `Doc`.
 | 
					Find all token sequences matching the supplied patterns on the `Doc`. As of
 | 
				
			||||||
 | 
					spaCy v2.3, the `Matcher` can also be called on `Span` objects.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> #### Example
 | 
					> #### Example
 | 
				
			||||||
>
 | 
					>
 | 
				
			||||||
| 
						 | 
					@ -55,8 +56,8 @@ Find all token sequences matching the supplied patterns on the `Doc`.
 | 
				
			||||||
> ```
 | 
					> ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| Name        | Type         | Description                                                                                                                                                              |
 | 
					| Name        | Type         | Description                                                                                                                                                              |
 | 
				
			||||||
| ----------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | 
					| ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | 
				
			||||||
| `doc`       | `Doc` | The document to match over.                                                                                                                                              |
 | 
					| `doclike`   | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3)..                                                                                                                    |
 | 
				
			||||||
| **RETURNS** | list         | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. |
 | 
					| **RETURNS** | list         | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
<Infobox title="Important note" variant="warning">
 | 
					<Infobox title="Important note" variant="warning">
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -42,7 +42,7 @@ Initialize the sentencizer.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| Name          | Type          | Description                                                                                            |
 | 
					| Name          | Type          | Description                                                                                            |
 | 
				
			||||||
| ------------- | ------------- | ------------------------------------------------------------------------------------------------------ |
 | 
					| ------------- | ------------- | ------------------------------------------------------------------------------------------------------ |
 | 
				
			||||||
| `punct_chars` | list          | Optional custom list of punctuation characters that mark sentence ends. Defaults to `[".", "!", "?"].` |
 | 
					| `punct_chars` | list          | Optional custom list of punctuation characters that mark sentence ends. Defaults to `['!', '.', '?', '։', '؟', '۔', '܀', '܁', '܂', '߹', '।', '॥', '၊', '။', '።', '፧', '፨', '᙮', '᜵', '᜶', '᠃', '᠉', '᥄', '᥅', '᪨', '᪩', '᪪', '᪫', '᭚', '᭛', '᭞', '᭟', '᰻', '᰼', '᱾', '᱿', '‼', '‽', '⁇', '⁈', '⁉', '⸮', '⸼', '꓿', '꘎', '꘏', '꛳', '꛷', '꡶', '꡷', '꣎', '꣏', '꤯', '꧈', '꧉', '꩝', '꩞', '꩟', '꫰', '꫱', '꯫', '﹒', '﹖', '﹗', '!', '.', '?', '𐩖', '𐩗', '𑁇', '𑁈', '𑂾', '𑂿', '𑃀', '𑃁', '𑅁', '𑅂', '𑅃', '𑇅', '𑇆', '𑇍', '𑇞', '𑇟', '𑈸', '𑈹', '𑈻', '𑈼', '𑊩', '𑑋', '𑑌', '𑗂', '𑗃', '𑗉', '𑗊', '𑗋', '𑗌', '𑗍', '𑗎', '𑗏', '𑗐', '𑗑', '𑗒', '𑗓', '𑗔', '𑗕', '𑗖', '𑗗', '𑙁', '𑙂', '𑜼', '𑜽', '𑜾', '𑩂', '𑩃', '𑪛', '𑪜', '𑱁', '𑱂', '𖩮', '𖩯', '𖫵', '𖬷', '𖬸', '𖭄', '𛲟', '𝪈', '。', '。']`. |
 | 
				
			||||||
| **RETURNS**   | `Sentencizer` | The newly constructed object.                                                                          |
 | 
					| **RETURNS**   | `Sentencizer` | The newly constructed object.                                                                          |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Sentencizer.\_\_call\_\_ {#call tag="method"}
 | 
					## Sentencizer.\_\_call\_\_ {#call tag="method"}
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -459,7 +459,7 @@ The L2 norm of the token's vector representation.
 | 
				
			||||||
| `like_url`                                   | bool         | Does the token resemble a URL?                                                                                                                                                                                                                                 |
 | 
					| `like_url`                                   | bool         | Does the token resemble a URL?                                                                                                                                                                                                                                 |
 | 
				
			||||||
| `like_num`                                   | bool         | Does the token represent a number? e.g. "10.9", "10", "ten", etc.                                                                                                                                                                                              |
 | 
					| `like_num`                                   | bool         | Does the token represent a number? e.g. "10.9", "10", "ten", etc.                                                                                                                                                                                              |
 | 
				
			||||||
| `like_email`                                 | bool         | Does the token resemble an email address?                                                                                                                                                                                                                      |
 | 
					| `like_email`                                 | bool         | Does the token resemble an email address?                                                                                                                                                                                                                      |
 | 
				
			||||||
| `is_oov`                                     | bool         | Is the token out-of-vocabulary?                                                                                                                                                                                                                                |
 | 
					| `is_oov`                                     | bool         | Does the token have a word vector?                                                                                                                                                                                                                              |
 | 
				
			||||||
| `is_stop`                                    | bool         | Is the token part of a "stop list"?                                                                                                                                                                                                                            |
 | 
					| `is_stop`                                    | bool         | Is the token part of a "stop list"?                                                                                                                                                                                                                            |
 | 
				
			||||||
| `pos`                                        | int          | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/).                                                                                                                                                 |
 | 
					| `pos`                                        | int          | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/).                                                                                                                                                 |
 | 
				
			||||||
| `pos_`                                       | unicode      | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/).                                                                                                                                                 |
 | 
					| `pos_`                                       | unicode      | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/).                                                                                                                                                 |
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -27,6 +27,9 @@ Create the vocabulary.
 | 
				
			||||||
| `tag_map`                                   | dict                 | A dictionary mapping fine-grained tags to coarse-grained parts-of-speech, and optionally morphological attributes. |
 | 
					| `tag_map`                                   | dict                 | A dictionary mapping fine-grained tags to coarse-grained parts-of-speech, and optionally morphological attributes. |
 | 
				
			||||||
| `lemmatizer`                                | object               | A lemmatizer. Defaults to `None`.                                                                                  |
 | 
					| `lemmatizer`                                | object               | A lemmatizer. Defaults to `None`.                                                                                  |
 | 
				
			||||||
| `strings`                                   | `StringStore` / list | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings.        |
 | 
					| `strings`                                   | `StringStore` / list | A [`StringStore`](/api/stringstore) that maps strings to hash values, and vice versa, or a list of strings.        |
 | 
				
			||||||
 | 
					| `lookups`                                   | `Lookups`            | A [`Lookups`](/api/lookups) that stores the `lemma_\*`, `lexeme_norm` and other large lookup tables. Defaults to `None`.           |
 | 
				
			||||||
 | 
					| `lookups_extra` <Tag variant="new">2.3</Tag> | `Lookups`           | A [`Lookups`](/api/lookups) that stores the optional `lexeme_cluster`/`lexeme_prob`/`lexeme_sentiment`/`lexeme_settings` lookup tables. Defaults to `None`. |
 | 
				
			||||||
 | 
					| `oov_prob`                                  | float                | The default OOV probability. Defaults to `-20.0`.                                                                  |
 | 
				
			||||||
| `vectors_name` <Tag variant="new">2.2</Tag> | unicode              | A name to identify the vectors table.                                                                              |
 | 
					| `vectors_name` <Tag variant="new">2.2</Tag> | unicode              | A name to identify the vectors table.                                                                              |
 | 
				
			||||||
| **RETURNS**                                 | `Vocab`              | The newly constructed object.                                                                                      |
 | 
					| **RETURNS**                                 | `Vocab`              | The newly constructed object.                                                                                      |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -297,9 +297,35 @@ though `$` and `€` are very different, spaCy normalizes them both to `$`. This
 | 
				
			||||||
way, they'll always be seen as similar, no matter how common they were in the
 | 
					way, they'll always be seen as similar, no matter how common they were in the
 | 
				
			||||||
training data.
 | 
					training data.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Norm exceptions can be provided as a simple dictionary. For more examples, see
 | 
					As of spaCy v2.3, language-specific norm exceptions are provided as a
 | 
				
			||||||
the English
 | 
					JSON dictionary in the package
 | 
				
			||||||
[`norm_exceptions.py`](https://github.com/explosion/spaCy/tree/master/spacy/lang/en/norm_exceptions.py).
 | 
					[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) rather
 | 
				
			||||||
 | 
					than in the main library. For a full example, see
 | 
				
			||||||
 | 
					[`en_lexeme_norm.json`](https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/data/en_lexeme_norm.json).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```json
 | 
				
			||||||
 | 
					### Example
 | 
				
			||||||
 | 
					{
 | 
				
			||||||
 | 
					    "cos": "because",
 | 
				
			||||||
 | 
					    "fav": "favorite",
 | 
				
			||||||
 | 
					    "accessorise": "accessorize",
 | 
				
			||||||
 | 
					    "accessorised": "accessorized"
 | 
				
			||||||
 | 
					}
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you're adding tables for a new languages, be sure to add the tables to
 | 
				
			||||||
 | 
					[`spacy_lookups_data/__init__.py`](https://github.com/explosion/spacy-lookups-data/blob/master/spacy_lookups_data/__init__.py)
 | 
				
			||||||
 | 
					and register the entry point under `spacy_lookups` in
 | 
				
			||||||
 | 
					[`setup.cfg`](https://github.com/explosion/spacy-lookups-data/blob/master/setup.cfg).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Alternatively, you can initialize your language [`Vocab`](/api/vocab) with a
 | 
				
			||||||
 | 
					[`Lookups`](/api/lookups) object that includes the table `lexeme_norm`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<Accordion title="Norm exceptions in spaCy v2.0-v2.2" id="norm-exceptions-v2.2">
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Previously in spaCy v2.0-v2.2, norm exceptions were provided as a simple python
 | 
				
			||||||
 | 
					dictionary. For more examples, see the English
 | 
				
			||||||
 | 
					[`norm_exceptions.py`](https://github.com/explosion/spaCy/tree/v2.2.x/spacy/lang/en/norm_exceptions.py).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
### Example
 | 
					### Example
 | 
				
			||||||
| 
						 | 
					@ -327,6 +353,8 @@ norm exceptions overwrite any of the global exceptions, they should be added
 | 
				
			||||||
first. Also note that the tokenizer exceptions will always have priority over
 | 
					first. Also note that the tokenizer exceptions will always have priority over
 | 
				
			||||||
the attribute getters.
 | 
					the attribute getters.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</Accordion>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
### Lexical attributes {#lex-attrs new="2"}
 | 
					### Lexical attributes {#lex-attrs new="2"}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
spaCy provides a range of [`Token` attributes](/api/token#attributes) that
 | 
					spaCy provides a range of [`Token` attributes](/api/token#attributes) that
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -732,7 +732,7 @@ rather than performance:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
```python
 | 
					```python
 | 
				
			||||||
def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search,
 | 
					def tokenizer_pseudo_code(self, special_cases, prefix_search, suffix_search,
 | 
				
			||||||
                          infix_finditer, token_match):
 | 
					                          infix_finditer, token_match, url_match):
 | 
				
			||||||
    tokens = []
 | 
					    tokens = []
 | 
				
			||||||
    for substring in text.split():
 | 
					    for substring in text.split():
 | 
				
			||||||
        suffixes = []
 | 
					        suffixes = []
 | 
				
			||||||
| 
						 | 
					@ -829,7 +829,7 @@ for t in tok_exp:
 | 
				
			||||||
### Customizing spaCy's Tokenizer class {#native-tokenizers}
 | 
					### Customizing spaCy's Tokenizer class {#native-tokenizers}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
Let's imagine you wanted to create a tokenizer for a new language or specific
 | 
					Let's imagine you wanted to create a tokenizer for a new language or specific
 | 
				
			||||||
domain. There are five things you would need to define:
 | 
					domain. There are six things you may need to define:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
1. A dictionary of **special cases**. This handles things like contractions,
 | 
					1. A dictionary of **special cases**. This handles things like contractions,
 | 
				
			||||||
   units of measurement, emoticons, certain abbreviations, etc.
 | 
					   units of measurement, emoticons, certain abbreviations, etc.
 | 
				
			||||||
| 
						 | 
					@ -840,9 +840,22 @@ domain. There are five things you would need to define:
 | 
				
			||||||
4. A function `infixes_finditer`, to handle non-whitespace separators, such as
 | 
					4. A function `infixes_finditer`, to handle non-whitespace separators, such as
 | 
				
			||||||
   hyphens etc.
 | 
					   hyphens etc.
 | 
				
			||||||
5. An optional boolean function `token_match` matching strings that should never
 | 
					5. An optional boolean function `token_match` matching strings that should never
 | 
				
			||||||
   be split, overriding the infix rules. Useful for things like URLs or numbers.
 | 
					   be split, overriding the infix rules. Useful for things like numbers.
 | 
				
			||||||
6. An optional boolean function `url_match`, which is similar to `token_match`
 | 
					6. An optional boolean function `url_match`, which is similar to `token_match`
 | 
				
			||||||
   except prefixes and suffixes are removed before applying the match.
 | 
					   except that prefixes and suffixes are removed before applying the match.
 | 
				
			||||||
 | 
					 
 | 
				
			||||||
 | 
					<Infobox title="Important note: token match in spaCy v2.2" variant="warning">
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					In spaCy v2.2.2-v2.2.4, the `token_match` was equivalent to the `url_match`
 | 
				
			||||||
 | 
					above and there was no match pattern applied before prefixes and suffixes were
 | 
				
			||||||
 | 
					analyzed. As of spaCy v2.3.0, the `token_match` has been reverted to its
 | 
				
			||||||
 | 
					behavior in v2.2.1 and earlier with precedence over prefixes and suffixes.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The `url_match` is introduced in v2.3.0 to handle cases like URLs where the
 | 
				
			||||||
 | 
					tokenizer should remove prefixes and suffixes (e.g., a comma at the end of a
 | 
				
			||||||
 | 
					URL) before applying the match.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</Infobox>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is
 | 
					You shouldn't usually need to create a `Tokenizer` subclass. Standard usage is
 | 
				
			||||||
to use `re.compile()` to build a regular expression object, and pass its
 | 
					to use `re.compile()` to build a regular expression object, and pass its
 | 
				
			||||||
| 
						 | 
					@ -865,7 +878,7 @@ def custom_tokenizer(nlp):
 | 
				
			||||||
                                prefix_search=prefix_re.search,
 | 
					                                prefix_search=prefix_re.search,
 | 
				
			||||||
                                suffix_search=suffix_re.search,
 | 
					                                suffix_search=suffix_re.search,
 | 
				
			||||||
                                infix_finditer=infix_re.finditer,
 | 
					                                infix_finditer=infix_re.finditer,
 | 
				
			||||||
                                token_match=simple_url_re.match)
 | 
					                                url_match=simple_url_re.match)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
nlp = spacy.load("en_core_web_sm")
 | 
					nlp = spacy.load("en_core_web_sm")
 | 
				
			||||||
nlp.tokenizer = custom_tokenizer(nlp)
 | 
					nlp.tokenizer = custom_tokenizer(nlp)
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -85,6 +85,123 @@ To load your model with the neutral, multi-language class, simply set
 | 
				
			||||||
`meta.json`. You can also import the class directly, or call
 | 
					`meta.json`. You can also import the class directly, or call
 | 
				
			||||||
[`util.get_lang_class()`](/api/top-level#util.get_lang_class) for lazy-loading.
 | 
					[`util.get_lang_class()`](/api/top-level#util.get_lang_class) for lazy-loading.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Chinese language support {#chinese new=2.3}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The Chinese language class supports three word segmentation options:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> ```python
 | 
				
			||||||
 | 
					> from spacy.lang.zh import Chinese
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> # Disable jieba to use character segmentation
 | 
				
			||||||
 | 
					> Chinese.Defaults.use_jieba = False
 | 
				
			||||||
 | 
					> nlp = Chinese()
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> # Disable jieba through tokenizer config options
 | 
				
			||||||
 | 
					> cfg = {"use_jieba": False}
 | 
				
			||||||
 | 
					> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> # Load with "default" model provided by pkuseg
 | 
				
			||||||
 | 
					> cfg = {"pkuseg_model": "default", "require_pkuseg": True}
 | 
				
			||||||
 | 
					> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
 | 
				
			||||||
 | 
					> ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. **Jieba:** `Chinese` uses [Jieba](https://github.com/fxsjy/jieba) for word
 | 
				
			||||||
 | 
					   segmentation by default. It's enabled when you create a new `Chinese`
 | 
				
			||||||
 | 
					   language class or call `spacy.blank("zh")`.
 | 
				
			||||||
 | 
					2. **Character segmentation:** Character segmentation is supported by disabling
 | 
				
			||||||
 | 
					   `jieba` and setting `Chinese.Defaults.use_jieba = False` _before_
 | 
				
			||||||
 | 
					   initializing the language class. As of spaCy v2.3.0, the `meta` tokenizer
 | 
				
			||||||
 | 
					   config options can be used to configure `use_jieba`.
 | 
				
			||||||
 | 
					3. **PKUSeg**: In spaCy v2.3.0, support for
 | 
				
			||||||
 | 
					   [PKUSeg](https://github.com/lancopku/PKUSeg-python) has been added to support
 | 
				
			||||||
 | 
					   better segmentation for Chinese OntoNotes and the new
 | 
				
			||||||
 | 
					   [Chinese models](/models/zh).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<Accordion title="Details on spaCy's PKUSeg API">
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The `meta` argument of the `Chinese` language class supports the following
 | 
				
			||||||
 | 
					following tokenizer config settings:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					| Name               | Type    | Description                                                                                          |
 | 
				
			||||||
 | 
					| ------------------ | ------- | ---------------------------------------------------------------------------------------------------- |
 | 
				
			||||||
 | 
					| `pkuseg_model`     | unicode | **Required:** Name of a model provided by `pkuseg` or the path to a local model directory.           |
 | 
				
			||||||
 | 
					| `pkuseg_user_dict` | unicode | Optional path to a file with one word per line which overrides the default `pkuseg` user dictionary. |
 | 
				
			||||||
 | 
					| `require_pkuseg`   | bool    | Overrides all `jieba` settings (optional but strongly recommended).                                  |
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					### Examples
 | 
				
			||||||
 | 
					# Load "default" model
 | 
				
			||||||
 | 
					cfg = {"pkuseg_model": "default", "require_pkuseg": True}
 | 
				
			||||||
 | 
					nlp = Chinese(meta={"tokenizer": {"config": cfg}})
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Load local model
 | 
				
			||||||
 | 
					cfg = {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True}
 | 
				
			||||||
 | 
					nlp = Chinese(meta={"tokenizer": {"config": cfg}})
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Override the user directory
 | 
				
			||||||
 | 
					cfg = {"pkuseg_model": "default", "require_pkuseg": True, "pkuseg_user_dict": "/path"}
 | 
				
			||||||
 | 
					nlp = Chinese(meta={"tokenizer": {"config": cfg}})
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					You can also modify the user dictionary on-the-fly:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					# Append words to user dict
 | 
				
			||||||
 | 
					nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Remove all words from user dict and replace with new words
 | 
				
			||||||
 | 
					nlp.tokenizer.pkuseg_update_user_dict(["中国"], reset=True)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Remove all words from user dict
 | 
				
			||||||
 | 
					nlp.tokenizer.pkuseg_update_user_dict([], reset=True)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</Accordion>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<Accordion title="Details on pretrained and custom Chinese models">
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The [Chinese models](/models/zh) provided by spaCy include a custom `pkuseg`
 | 
				
			||||||
 | 
					model trained only on
 | 
				
			||||||
 | 
					[Chinese OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19), since the
 | 
				
			||||||
 | 
					models provided by `pkuseg` include data restricted to research use. For
 | 
				
			||||||
 | 
					research use, `pkuseg` provides models for several different domains
 | 
				
			||||||
 | 
					(`"default"`, `"news"` `"web"`, `"medicine"`, `"tourism"`) and for other uses,
 | 
				
			||||||
 | 
					`pkuseg` provides a simple
 | 
				
			||||||
 | 
					[training API](https://github.com/lancopku/pkuseg-python/blob/master/readme/readme_english.md#usage):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					```python
 | 
				
			||||||
 | 
					import pkuseg
 | 
				
			||||||
 | 
					from spacy.lang.zh import Chinese
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Train pkuseg model
 | 
				
			||||||
 | 
					pkuseg.train("train.utf8", "test.utf8", "/path/to/pkuseg_model")
 | 
				
			||||||
 | 
					# Load pkuseg model in spaCy Chinese tokenizer
 | 
				
			||||||
 | 
					nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_model", "require_pkuseg": True}}})
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</Accordion>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Japanese language support {#japanese new=2.3}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> ```python
 | 
				
			||||||
 | 
					> from spacy.lang.ja import Japanese
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> # Load SudachiPy with split mode A (default)
 | 
				
			||||||
 | 
					> nlp = Japanese()
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> # Load SudachiPy with split mode B
 | 
				
			||||||
 | 
					> cfg = {"split_mode": "B"}
 | 
				
			||||||
 | 
					> nlp = Japanese(meta={"tokenizer": {"config": cfg}})
 | 
				
			||||||
 | 
					> ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The Japanese language class uses
 | 
				
			||||||
 | 
					[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
 | 
				
			||||||
 | 
					segmentation and part-of-speech tagging. The default Japanese language class
 | 
				
			||||||
 | 
					and the provided Japanese models use SudachiPy split mode `A`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The `meta` argument of the `Japanese` language class can be used to configure
 | 
				
			||||||
 | 
					the split mode to `A`, `B` or `C`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
## Installing and using models {#download}
 | 
					## Installing and using models {#download}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
> #### Downloading models in spaCy < v1.7
 | 
					> #### Downloading models in spaCy < v1.7
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
							
								
								
									
										213
									
								
								website/docs/usage/v2-3.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										213
									
								
								website/docs/usage/v2-3.md
									
									
									
									
									
										Normal file
									
								
							| 
						 | 
					@ -0,0 +1,213 @@
 | 
				
			||||||
 | 
					---
 | 
				
			||||||
 | 
					title: What's New in v2.3
 | 
				
			||||||
 | 
					teaser: New features, backwards incompatibilities and migration guide
 | 
				
			||||||
 | 
					menu:
 | 
				
			||||||
 | 
					  - ['New Features', 'features']
 | 
				
			||||||
 | 
					  - ['Backwards Incompatibilities', 'incompat']
 | 
				
			||||||
 | 
					  - ['Migrating from v2.2', 'migrating']
 | 
				
			||||||
 | 
					---
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## New Features {#features hidden="true"}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					spaCy v2.3 features new pretrained models for five languages, word vectors for
 | 
				
			||||||
 | 
					all language models, and decreased model size and loading times for models with
 | 
				
			||||||
 | 
					vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
 | 
				
			||||||
 | 
					and Romanian** and updated the training data and vectors for most languages.
 | 
				
			||||||
 | 
					Model packages with vectors are about **2×** smaller on disk and load
 | 
				
			||||||
 | 
					**2-4×** faster. For the full changelog, see the [release notes on
 | 
				
			||||||
 | 
					GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more
 | 
				
			||||||
 | 
					details and a behind-the-scenes look at the new release, [see our blog
 | 
				
			||||||
 | 
					post](https://explosion.ai/blog/spacy-v2-3).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Expanded model families with vectors {#models}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> #### Example
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> ```bash
 | 
				
			||||||
 | 
					> python -m spacy download da_core_news_sm
 | 
				
			||||||
 | 
					> python -m spacy download ja_core_news_sm
 | 
				
			||||||
 | 
					> python -m spacy download pl_core_news_sm
 | 
				
			||||||
 | 
					> python -m spacy download ro_core_news_sm
 | 
				
			||||||
 | 
					> python -m spacy download zh_core_web_sm
 | 
				
			||||||
 | 
					> ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
 | 
				
			||||||
 | 
					`md` and `lg` models with word vectors for all languages, this release provides
 | 
				
			||||||
 | 
					a total of 46 model packages. For models trained using [Universal
 | 
				
			||||||
 | 
					Dependencies](https://universaldependencies.org) corpora, the training data has
 | 
				
			||||||
 | 
					been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been
 | 
				
			||||||
 | 
					extended to include both UD Dutch Alpino and LassySmall.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<Infobox>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					**Models:** [Models directory](/models) **Benchmarks: **
 | 
				
			||||||
 | 
					[Release notes](https://github.com/explosion/spaCy/releases/tag/v2.3.0)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</Infobox>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Chinese {#chinese}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> #### Example
 | 
				
			||||||
 | 
					> ```python
 | 
				
			||||||
 | 
					> from spacy.lang.zh import Chinese
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> # Load with "default" model provided by pkuseg
 | 
				
			||||||
 | 
					> cfg = {"pkuseg_model": "default", "require_pkuseg": True}
 | 
				
			||||||
 | 
					> nlp = Chinese(meta={"tokenizer": {"config": cfg}})
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> # Append words to user dict
 | 
				
			||||||
 | 
					> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					This release adds support for
 | 
				
			||||||
 | 
					[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and
 | 
				
			||||||
 | 
					the new Chinese models ship with a custom pkuseg model trained on OntoNotes.
 | 
				
			||||||
 | 
					The Chinese tokenizer can be initialized with both `pkuseg` and custom models
 | 
				
			||||||
 | 
					and the `pkuseg` user dictionary is easy to customize.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<Infobox>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					**Chinese:** [Chinese tokenizer usage](/usage/models#chinese)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</Infobox>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Japanese {#japanese}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The updated Japanese language class switches to
 | 
				
			||||||
 | 
					[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
 | 
				
			||||||
 | 
					segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies
 | 
				
			||||||
 | 
					installing spaCy for Japanese, which is now possible with a single command:
 | 
				
			||||||
 | 
					`pip install spacy[ja]`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<Infobox>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					**Japanese:** [Japanese tokenizer usage](/usage/models#japanese)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</Infobox>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Small CLI updates
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- `spacy debug-data` provides the coverage of the vectors in a base model with
 | 
				
			||||||
 | 
					  `spacy debug-data lang train dev -b base_model`
 | 
				
			||||||
 | 
					- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en
 | 
				
			||||||
 | 
					  dev.json`) to evaluate the tokenization accuracy without loading a model
 | 
				
			||||||
 | 
					- `spacy train` on GPU restricts the CPU timing evaluation to the first
 | 
				
			||||||
 | 
					  iteration
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Backwards incompatibilities {#incompat}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<Infobox title="Important note on models" variant="warning">
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you've been training **your own models**, you'll need to **retrain** them
 | 
				
			||||||
 | 
					with the new version. Also don't forget to upgrade all models to the latest
 | 
				
			||||||
 | 
					versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
 | 
				
			||||||
 | 
					with models for v2.3. To check if all of your models are up to date, you can
 | 
				
			||||||
 | 
					run the [`spacy validate`](/api/cli#validate) command.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</Infobox>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> #### Install with lookups data
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> ```bash
 | 
				
			||||||
 | 
					> $ pip install spacy[lookups]
 | 
				
			||||||
 | 
					> ```
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> You can also install
 | 
				
			||||||
 | 
					> [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
 | 
				
			||||||
 | 
					> directly.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					- If you're training new models, you'll want to install the package
 | 
				
			||||||
 | 
					  [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data),
 | 
				
			||||||
 | 
					  which now includes both the lemmatization tables (as in v2.2) and the
 | 
				
			||||||
 | 
					  normalization tables (new in v2.3). If you're using pretrained models,
 | 
				
			||||||
 | 
					  **nothing changes**, because the relevant tables are included in the model
 | 
				
			||||||
 | 
					  packages.
 | 
				
			||||||
 | 
					- Due to the updated Universal Dependencies training data, the fine-grained
 | 
				
			||||||
 | 
					  part-of-speech tags will change for many provided language models. The
 | 
				
			||||||
 | 
					  coarse-grained part-of-speech tagset remains the same, but the mapping from
 | 
				
			||||||
 | 
					  particular fine-grained to coarse-grained tags may show minor differences.
 | 
				
			||||||
 | 
					- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
 | 
				
			||||||
 | 
					  tagsets contain new merged tags related to contracted forms, such as
 | 
				
			||||||
 | 
					  `ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head
 | 
				
			||||||
 | 
					  `"à"`. This increases the accuracy of the models by improving the alignment
 | 
				
			||||||
 | 
					  between spaCy's tokenization and Universal Dependencies multi-word tokens
 | 
				
			||||||
 | 
					  used for contractions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Migrating from spaCy 2.2 {#migrating}
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### Tokenizer settings
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					In spaCy v2.2.2-v2.2.4, there was a change to the precedence of `token_match`
 | 
				
			||||||
 | 
					that gave prefixes and suffixes priority over `token_match`, which caused
 | 
				
			||||||
 | 
					problems for many custom tokenizer configurations. This has been reverted in
 | 
				
			||||||
 | 
					v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
 | 
				
			||||||
 | 
					and earlier versions.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
 | 
				
			||||||
 | 
					cases like URLs where the tokenizer should remove prefixes and suffixes (e.g.,
 | 
				
			||||||
 | 
					a comma at the end of a URL) before applying the match. See the full [tokenizer
 | 
				
			||||||
 | 
					documentation](/usage/linguistic-features#tokenization) and try out
 | 
				
			||||||
 | 
					[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
 | 
				
			||||||
 | 
					debugging your tokenizer configuration.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### Warnings configuration
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					spaCy's custom warnings have been replaced with native python
 | 
				
			||||||
 | 
					[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
 | 
				
			||||||
 | 
					setting `SPACY_WARNING_IGNORE`, use the [warnings
 | 
				
			||||||
 | 
					filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
 | 
				
			||||||
 | 
					to manage warnings.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### Normalization tables
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The normalization tables have moved from the language data in
 | 
				
			||||||
 | 
					[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to
 | 
				
			||||||
 | 
					the package
 | 
				
			||||||
 | 
					[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If
 | 
				
			||||||
 | 
					you're adding data for a new language, the normalization table should be added
 | 
				
			||||||
 | 
					to `spacy-lookups-data`. See [adding norm
 | 
				
			||||||
 | 
					exceptions](/usage/adding-languages#norm-exceptions).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### Probability and cluster features
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					> #### Load and save extra prob lookups table
 | 
				
			||||||
 | 
					>
 | 
				
			||||||
 | 
					> ```python
 | 
				
			||||||
 | 
					> from spacy.lang.en import English
 | 
				
			||||||
 | 
					> nlp = English()
 | 
				
			||||||
 | 
					> doc = nlp("the")
 | 
				
			||||||
 | 
					> print(doc[0].prob) # lazily loads extra prob table
 | 
				
			||||||
 | 
					> nlp.to_disk("/path/to/model") # includes prob table
 | 
				
			||||||
 | 
					> ```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The `Token.prob` and `Token.cluster` features, which are no longer used by the
 | 
				
			||||||
 | 
					core pipeline components as of spaCy v2, are no longer provided in the
 | 
				
			||||||
 | 
					pretrained models to reduce the model size. To keep these features available
 | 
				
			||||||
 | 
					for users relying on them, the `prob` and `cluster` features for the most
 | 
				
			||||||
 | 
					frequent 1M tokens have been moved to
 | 
				
			||||||
 | 
					[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
 | 
				
			||||||
 | 
					`extra` features for the relevant languages (English, German, Greek and
 | 
				
			||||||
 | 
					Spanish).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The extra tables are loaded lazily, so if you have `spacy-lookups-data`
 | 
				
			||||||
 | 
					installed and your code accesses `Token.prob`, the full table is loaded into
 | 
				
			||||||
 | 
					the model vocab, which will take a few seconds on initial loading. When you
 | 
				
			||||||
 | 
					save this model after loading the `prob` table, the full `prob` table will be
 | 
				
			||||||
 | 
					saved as part of the model vocab.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as
 | 
				
			||||||
 | 
					part of a new model, add the data to
 | 
				
			||||||
 | 
					[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
 | 
				
			||||||
 | 
					the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
 | 
				
			||||||
 | 
					initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
 | 
				
			||||||
 | 
					[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
 | 
				
			||||||
 | 
					`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
 | 
				
			||||||
 | 
					currently only used to provide a custom `oov_prob`. See examples in the [`data`
 | 
				
			||||||
 | 
					directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
 | 
				
			||||||
 | 
					in `spacy-lookups-data`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					#### Initializing new models without extra lookups tables
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
 | 
				
			||||||
 | 
					the `prob` table from `spacy-lookups-data` may be loaded as part of the
 | 
				
			||||||
 | 
					initialization. If you'd like to omit this extra data as in spaCy's provided
 | 
				
			||||||
 | 
					v2.3 models, use the new flag `--omit-extra-lookups`.
 | 
				
			||||||
| 
						 | 
					@ -1,5 +1,35 @@
 | 
				
			||||||
{
 | 
					{
 | 
				
			||||||
    "languages": [
 | 
					    "languages": [
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "code": "zh",
 | 
				
			||||||
 | 
					            "name": "Chinese",
 | 
				
			||||||
 | 
					            "models": ["zh_core_web_sm", "zh_core_web_md", "zh_core_web_lg"],
 | 
				
			||||||
 | 
					            "dependencies": [
 | 
				
			||||||
 | 
					                {
 | 
				
			||||||
 | 
					                    "name": "Jieba",
 | 
				
			||||||
 | 
					                    "url": "https://github.com/fxsjy/jieba"
 | 
				
			||||||
 | 
					                },
 | 
				
			||||||
 | 
					                {
 | 
				
			||||||
 | 
					                    "name": "PKUSeg",
 | 
				
			||||||
 | 
					                    "url": "https://github.com/lancopku/PKUSeg-python"
 | 
				
			||||||
 | 
					                }
 | 
				
			||||||
 | 
					            ],
 | 
				
			||||||
 | 
					            "has_examples": true
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "code": "da",
 | 
				
			||||||
 | 
					            "name": "Danish",
 | 
				
			||||||
 | 
					            "example": "Dette er en sætning.",
 | 
				
			||||||
 | 
					            "has_examples": true,
 | 
				
			||||||
 | 
					            "models": ["da_core_news_sm", "da_core_news_md", "da_core_news_lg"]
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "code": "nl",
 | 
				
			||||||
 | 
					            "name": "Dutch",
 | 
				
			||||||
 | 
					            "models": ["nl_core_news_sm", "nl_core_news_md", "nl_core_news_lg"],
 | 
				
			||||||
 | 
					            "example": "Dit is een zin.",
 | 
				
			||||||
 | 
					            "has_examples": true
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
        {
 | 
					        {
 | 
				
			||||||
            "code": "en",
 | 
					            "code": "en",
 | 
				
			||||||
            "name": "English",
 | 
					            "name": "English",
 | 
				
			||||||
| 
						 | 
					@ -14,68 +44,91 @@
 | 
				
			||||||
            "example": "This is a sentence.",
 | 
					            "example": "This is a sentence.",
 | 
				
			||||||
            "has_examples": true
 | 
					            "has_examples": true
 | 
				
			||||||
        },
 | 
					        },
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "code": "fr",
 | 
				
			||||||
 | 
					            "name": "French",
 | 
				
			||||||
 | 
					            "models": ["fr_core_news_sm", "fr_core_news_md", "fr_core_news_lg"],
 | 
				
			||||||
 | 
					            "example": "C'est une phrase.",
 | 
				
			||||||
 | 
					            "has_examples": true
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
        {
 | 
					        {
 | 
				
			||||||
            "code": "de",
 | 
					            "code": "de",
 | 
				
			||||||
            "name": "German",
 | 
					            "name": "German",
 | 
				
			||||||
            "models": ["de_core_news_sm", "de_core_news_md"],
 | 
					            "models": ["de_core_news_sm", "de_core_news_md", "de_core_news_lg"],
 | 
				
			||||||
            "starters": ["de_trf_bertbasecased_lg"],
 | 
					            "starters": ["de_trf_bertbasecased_lg"],
 | 
				
			||||||
            "example": "Dies ist ein Satz.",
 | 
					            "example": "Dies ist ein Satz.",
 | 
				
			||||||
            "has_examples": true
 | 
					            "has_examples": true
 | 
				
			||||||
        },
 | 
					        },
 | 
				
			||||||
        {
 | 
					        {
 | 
				
			||||||
            "code": "fr",
 | 
					            "code": "el",
 | 
				
			||||||
            "name": "French",
 | 
					            "name": "Greek",
 | 
				
			||||||
            "models": ["fr_core_news_sm", "fr_core_news_md"],
 | 
					            "models": ["el_core_news_sm", "el_core_news_md", "el_core_news_lg"],
 | 
				
			||||||
            "example": "C'est une phrase.",
 | 
					            "example": "Αυτή είναι μια πρόταση.",
 | 
				
			||||||
            "has_examples": true
 | 
					 | 
				
			||||||
        },
 | 
					 | 
				
			||||||
        {
 | 
					 | 
				
			||||||
            "code": "es",
 | 
					 | 
				
			||||||
            "name": "Spanish",
 | 
					 | 
				
			||||||
            "models": ["es_core_news_sm", "es_core_news_md"],
 | 
					 | 
				
			||||||
            "example": "Esto es una frase.",
 | 
					 | 
				
			||||||
            "has_examples": true
 | 
					 | 
				
			||||||
        },
 | 
					 | 
				
			||||||
        {
 | 
					 | 
				
			||||||
            "code": "pt",
 | 
					 | 
				
			||||||
            "name": "Portuguese",
 | 
					 | 
				
			||||||
            "models": ["pt_core_news_sm"],
 | 
					 | 
				
			||||||
            "example": "Esta é uma frase.",
 | 
					 | 
				
			||||||
            "has_examples": true
 | 
					            "has_examples": true
 | 
				
			||||||
        },
 | 
					        },
 | 
				
			||||||
        {
 | 
					        {
 | 
				
			||||||
            "code": "it",
 | 
					            "code": "it",
 | 
				
			||||||
            "name": "Italian",
 | 
					            "name": "Italian",
 | 
				
			||||||
            "models": ["it_core_news_sm"],
 | 
					            "models": ["it_core_news_sm", "it_core_news_md", "it_core_news_lg"],
 | 
				
			||||||
            "example": "Questa è una frase.",
 | 
					            "example": "Questa è una frase.",
 | 
				
			||||||
            "has_examples": true
 | 
					            "has_examples": true
 | 
				
			||||||
        },
 | 
					        },
 | 
				
			||||||
        {
 | 
					        {
 | 
				
			||||||
            "code": "nl",
 | 
					            "code": "ja",
 | 
				
			||||||
            "name": "Dutch",
 | 
					            "name": "Japanese",
 | 
				
			||||||
            "models": ["nl_core_news_sm"],
 | 
					            "models": ["ja_core_news_sm", "ja_core_news_md", "ja_core_news_lg"],
 | 
				
			||||||
            "example": "Dit is een zin.",
 | 
					            "dependencies": [
 | 
				
			||||||
 | 
					                {
 | 
				
			||||||
 | 
					                    "name": "SudachiPy",
 | 
				
			||||||
 | 
					                    "url": "https://github.com/WorksApplications/SudachiPy"
 | 
				
			||||||
 | 
					                }
 | 
				
			||||||
 | 
					            ],
 | 
				
			||||||
            "has_examples": true
 | 
					            "has_examples": true
 | 
				
			||||||
        },
 | 
					        },
 | 
				
			||||||
        {
 | 
					        {
 | 
				
			||||||
            "code": "el",
 | 
					            "code": "lt",
 | 
				
			||||||
            "name": "Greek",
 | 
					            "name": "Lithuanian",
 | 
				
			||||||
            "models": ["el_core_news_sm", "el_core_news_md"],
 | 
					            "has_examples": true,
 | 
				
			||||||
            "example": "Αυτή είναι μια πρόταση.",
 | 
					            "models": ["lt_core_news_sm", "lt_core_news_md", "lt_core_news_lg"]
 | 
				
			||||||
            "has_examples": true
 | 
					 | 
				
			||||||
        },
 | 
					        },
 | 
				
			||||||
        { "code": "sv", "name": "Swedish", "has_examples": true },
 | 
					 | 
				
			||||||
        { "code": "fi", "name": "Finnish", "has_examples": true },
 | 
					 | 
				
			||||||
        {
 | 
					        {
 | 
				
			||||||
            "code": "nb",
 | 
					            "code": "nb",
 | 
				
			||||||
            "name": "Norwegian Bokmål",
 | 
					            "name": "Norwegian Bokmål",
 | 
				
			||||||
            "example": "Dette er en setning.",
 | 
					            "example": "Dette er en setning.",
 | 
				
			||||||
            "has_examples": true,
 | 
					            "has_examples": true,
 | 
				
			||||||
            "models": ["nb_core_news_sm"]
 | 
					            "models": ["nb_core_news_sm", "nb_core_news_md", "nb_core_news_lg"]
 | 
				
			||||||
        },
 | 
					        },
 | 
				
			||||||
        { "code": "da", "name": "Danish", "example": "Dette er en sætning.", "has_examples": true },
 | 
					        {
 | 
				
			||||||
 | 
					            "code": "pl",
 | 
				
			||||||
 | 
					            "name": "Polish",
 | 
				
			||||||
 | 
					            "example": "To jest zdanie.",
 | 
				
			||||||
 | 
					            "has_examples": true,
 | 
				
			||||||
 | 
					            "models": ["pl_core_news_sm", "pl_core_news_md", "pl_core_news_lg"]
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "code": "pt",
 | 
				
			||||||
 | 
					            "name": "Portuguese",
 | 
				
			||||||
 | 
					            "models": ["pt_core_news_sm", "pt_core_news_md", "pt_core_news_lg"],
 | 
				
			||||||
 | 
					            "example": "Esta é uma frase.",
 | 
				
			||||||
 | 
					            "has_examples": true
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "code": "ro",
 | 
				
			||||||
 | 
					            "name": "Romanian",
 | 
				
			||||||
 | 
					            "example": "Aceasta este o propoziție.",
 | 
				
			||||||
 | 
					            "has_examples": true,
 | 
				
			||||||
 | 
					            "models": ["ro_core_news_sm", "ro_core_news_md", "ro_core_news_lg"]
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "code": "es",
 | 
				
			||||||
 | 
					            "name": "Spanish",
 | 
				
			||||||
 | 
					            "models": ["es_core_news_sm", "es_core_news_md", "es_core_news_lg"],
 | 
				
			||||||
 | 
					            "example": "Esto es una frase.",
 | 
				
			||||||
 | 
					            "has_examples": true
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					        { "code": "sv", "name": "Swedish", "has_examples": true },
 | 
				
			||||||
 | 
					        { "code": "fi", "name": "Finnish", "has_examples": true },
 | 
				
			||||||
        { "code": "hu", "name": "Hungarian", "example": "Ez egy mondat.", "has_examples": true },
 | 
					        { "code": "hu", "name": "Hungarian", "example": "Ez egy mondat.", "has_examples": true },
 | 
				
			||||||
        { "code": "pl", "name": "Polish", "example": "To jest zdanie.", "has_examples": true },
 | 
					 | 
				
			||||||
        {
 | 
					        {
 | 
				
			||||||
            "code": "ru",
 | 
					            "code": "ru",
 | 
				
			||||||
            "name": "Russian",
 | 
					            "name": "Russian",
 | 
				
			||||||
| 
						 | 
					@ -88,12 +141,6 @@
 | 
				
			||||||
            "has_examples": true,
 | 
					            "has_examples": true,
 | 
				
			||||||
            "dependencies": [{ "name": "pymorphy2", "url": "https://github.com/kmike/pymorphy2" }]
 | 
					            "dependencies": [{ "name": "pymorphy2", "url": "https://github.com/kmike/pymorphy2" }]
 | 
				
			||||||
        },
 | 
					        },
 | 
				
			||||||
        {
 | 
					 | 
				
			||||||
            "code": "ro",
 | 
					 | 
				
			||||||
            "name": "Romanian",
 | 
					 | 
				
			||||||
            "example": "Aceasta este o propoziție.",
 | 
					 | 
				
			||||||
            "has_examples": true
 | 
					 | 
				
			||||||
        },
 | 
					 | 
				
			||||||
        { "code": "hr", "name": "Croatian", "has_examples": true },
 | 
					        { "code": "hr", "name": "Croatian", "has_examples": true },
 | 
				
			||||||
        { "code": "eu", "name": "Basque", "has_examples": true },
 | 
					        { "code": "eu", "name": "Basque", "has_examples": true },
 | 
				
			||||||
        { "code": "yo", "name": "Yoruba", "has_examples": true },
 | 
					        { "code": "yo", "name": "Yoruba", "has_examples": true },
 | 
				
			||||||
| 
						 | 
					@ -123,7 +170,6 @@
 | 
				
			||||||
        { "code": "bg", "name": "Bulgarian", "example": "Това е изречение", "has_examples": true },
 | 
					        { "code": "bg", "name": "Bulgarian", "example": "Това е изречение", "has_examples": true },
 | 
				
			||||||
        { "code": "cs", "name": "Czech" },
 | 
					        { "code": "cs", "name": "Czech" },
 | 
				
			||||||
        { "code": "is", "name": "Icelandic" },
 | 
					        { "code": "is", "name": "Icelandic" },
 | 
				
			||||||
        { "code": "lt", "name": "Lithuanian", "has_examples": true, "models": ["lt_core_news_sm"] },
 | 
					 | 
				
			||||||
        { "code": "lv", "name": "Latvian" },
 | 
					        { "code": "lv", "name": "Latvian" },
 | 
				
			||||||
        { "code": "sr", "name": "Serbian" },
 | 
					        { "code": "sr", "name": "Serbian" },
 | 
				
			||||||
        { "code": "sk", "name": "Slovak" },
 | 
					        { "code": "sk", "name": "Slovak" },
 | 
				
			||||||
| 
						 | 
					@ -145,12 +191,6 @@
 | 
				
			||||||
            "example": "นี่คือประโยค",
 | 
					            "example": "นี่คือประโยค",
 | 
				
			||||||
            "has_examples": true
 | 
					            "has_examples": true
 | 
				
			||||||
        },
 | 
					        },
 | 
				
			||||||
        {
 | 
					 | 
				
			||||||
            "code": "zh",
 | 
					 | 
				
			||||||
            "name": "Chinese",
 | 
					 | 
				
			||||||
            "dependencies": [{ "name": "Jieba", "url": "https://github.com/fxsjy/jieba" }],
 | 
					 | 
				
			||||||
            "has_examples": true
 | 
					 | 
				
			||||||
        },
 | 
					 | 
				
			||||||
        {
 | 
					        {
 | 
				
			||||||
            "code": "ja",
 | 
					            "code": "ja",
 | 
				
			||||||
            "name": "Japanese",
 | 
					            "name": "Japanese",
 | 
				
			||||||
| 
						 | 
					@ -187,6 +227,21 @@
 | 
				
			||||||
            "example": "Sta chì a l'é unna fraxe.",
 | 
					            "example": "Sta chì a l'é unna fraxe.",
 | 
				
			||||||
            "has_examples": true
 | 
					            "has_examples": true
 | 
				
			||||||
        },
 | 
					        },
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "code": "hy",
 | 
				
			||||||
 | 
					            "name": "Armenian",
 | 
				
			||||||
 | 
					            "has_examples": true
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "code": "gu",
 | 
				
			||||||
 | 
					            "name": "Gujarati",
 | 
				
			||||||
 | 
					            "has_examples": true
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
 | 
					        {
 | 
				
			||||||
 | 
					            "code": "ml",
 | 
				
			||||||
 | 
					            "name": "Malayalam",
 | 
				
			||||||
 | 
					            "has_examples": true
 | 
				
			||||||
 | 
					        },
 | 
				
			||||||
        {
 | 
					        {
 | 
				
			||||||
            "code": "xx",
 | 
					            "code": "xx",
 | 
				
			||||||
            "name": "Multi-language",
 | 
					            "name": "Multi-language",
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -9,6 +9,7 @@
 | 
				
			||||||
                    { "text": "Models & Languages", "url": "/usage/models" },
 | 
					                    { "text": "Models & Languages", "url": "/usage/models" },
 | 
				
			||||||
                    { "text": "Facts & Figures", "url": "/usage/facts-figures" },
 | 
					                    { "text": "Facts & Figures", "url": "/usage/facts-figures" },
 | 
				
			||||||
                    { "text": "spaCy 101", "url": "/usage/spacy-101" },
 | 
					                    { "text": "spaCy 101", "url": "/usage/spacy-101" },
 | 
				
			||||||
 | 
					                    { "text": "New in v2.3", "url": "/usage/v2-3" },
 | 
				
			||||||
                    { "text": "New in v2.2", "url": "/usage/v2-2" },
 | 
					                    { "text": "New in v2.2", "url": "/usage/v2-2" },
 | 
				
			||||||
                    { "text": "New in v2.1", "url": "/usage/v2-1" },
 | 
					                    { "text": "New in v2.1", "url": "/usage/v2-1" },
 | 
				
			||||||
                    { "text": "New in v2.0", "url": "/usage/v2" }
 | 
					                    { "text": "New in v2.0", "url": "/usage/v2" }
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -83,7 +83,7 @@ function formatVectors(data) {
 | 
				
			||||||
 | 
					
 | 
				
			||||||
function formatAccuracy(data) {
 | 
					function formatAccuracy(data) {
 | 
				
			||||||
    if (!data) return null
 | 
					    if (!data) return null
 | 
				
			||||||
    const labels = { tags_acc: 'POS', ents_f: 'NER F', ents_p: 'NER P', ents_r: 'NER R' }
 | 
					    const labels = { tags_acc: 'TAG', ents_f: 'NER F', ents_p: 'NER P', ents_r: 'NER R' }
 | 
				
			||||||
    const isSyntax = key => ['tags_acc', 'las', 'uas'].includes(key)
 | 
					    const isSyntax = key => ['tags_acc', 'las', 'uas'].includes(key)
 | 
				
			||||||
    const isNer = key => key.startsWith('ents_')
 | 
					    const isNer = key => key.startsWith('ents_')
 | 
				
			||||||
    return Object.keys(data).map(key => ({
 | 
					    return Object.keys(data).map(key => ({
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -124,7 +124,7 @@ const Landing = ({ data }) => {
 | 
				
			||||||
                            {counts.modelLangs} languages
 | 
					                            {counts.modelLangs} languages
 | 
				
			||||||
                        </Li>
 | 
					                        </Li>
 | 
				
			||||||
                        <Li>
 | 
					                        <Li>
 | 
				
			||||||
                            pretrained <strong>word vectors</strong>
 | 
					                            Pretrained <strong>word vectors</strong>
 | 
				
			||||||
                        </Li>
 | 
					                        </Li>
 | 
				
			||||||
                        <Li>State-of-the-art speed</Li>
 | 
					                        <Li>State-of-the-art speed</Li>
 | 
				
			||||||
                        <Li>
 | 
					                        <Li>
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
| 
						 | 
					@ -38,10 +38,10 @@ const Languages = () => (
 | 
				
			||||||
            const langs = site.siteMetadata.languages
 | 
					            const langs = site.siteMetadata.languages
 | 
				
			||||||
            const withModels = langs
 | 
					            const withModels = langs
 | 
				
			||||||
                .filter(({ models }) => models && !!models.length)
 | 
					                .filter(({ models }) => models && !!models.length)
 | 
				
			||||||
                .sort((a, b) => a.code.localeCompare(b.code))
 | 
					                .sort((a, b) => a.name.localeCompare(b.name))
 | 
				
			||||||
            const withoutModels = langs
 | 
					            const withoutModels = langs
 | 
				
			||||||
                .filter(({ models }) => !models || !models.length)
 | 
					                .filter(({ models }) => !models || !models.length)
 | 
				
			||||||
                .sort((a, b) => a.code.localeCompare(b.code))
 | 
					                .sort((a, b) => a.name.localeCompare(b.name))
 | 
				
			||||||
            const withDeps = langs.filter(({ dependencies }) => dependencies && dependencies.length)
 | 
					            const withDeps = langs.filter(({ dependencies }) => dependencies && dependencies.length)
 | 
				
			||||||
            return (
 | 
					            return (
 | 
				
			||||||
                <>
 | 
					                <>
 | 
				
			||||||
| 
						 | 
					
 | 
				
			||||||
		Loading…
	
		Reference in New Issue
	
	Block a user