mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-25 13:11:03 +03:00 
			
		
		
		
	Various docs updates for v3.0 (#8353)
* Update cats score names in Scorer API docs * Refer to performance in meta * Update package naming/versions, lemmatizer details * Minor formatting fixes * Provide more explanation for cats_score_desc * Provide language-specific lemmatizer defaults in API docs Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
This commit is contained in:
		
							parent
							
								
									8729307e67
								
							
						
					
					
						commit
						507422149f
					
				|  | @ -588,7 +588,7 @@ source of truth** used for loading a pipeline. | ||||||
| | `vectors`                                      | Information about the word vectors included with the pipeline. Typically a dict with the keys `"width"`, `"vectors"` (number of vectors), `"keys"` and `"name"`. ~~Dict[str, Any]~~                                                                                                                                              | | | `vectors`                                      | Information about the word vectors included with the pipeline. Typically a dict with the keys `"width"`, `"vectors"` (number of vectors), `"keys"` and `"name"`. ~~Dict[str, Any]~~                                                                                                                                              | | ||||||
| | `pipeline`                                     | Names of pipeline component names, in order. Corresponds to [`nlp.pipe_names`](/api/language#pipe_names). Only exists for reference and is not used to create the components. This information is defined in the [`config.cfg`](/api/data-formats#config). Defaults to `[]`. ~~List[str]~~                                       | | | `pipeline`                                     | Names of pipeline component names, in order. Corresponds to [`nlp.pipe_names`](/api/language#pipe_names). Only exists for reference and is not used to create the components. This information is defined in the [`config.cfg`](/api/data-formats#config). Defaults to `[]`. ~~List[str]~~                                       | | ||||||
| | `labels`                                       | Label schemes of the trained pipeline components, keyed by component name. Corresponds to [`nlp.pipe_labels`](/api/language#pipe_labels). [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `{}`. ~~Dict[str, Dict[str, List[str]]]~~                                             | | | `labels`                                       | Label schemes of the trained pipeline components, keyed by component name. Corresponds to [`nlp.pipe_labels`](/api/language#pipe_labels). [See here](https://github.com/explosion/spacy-models/tree/master/meta) for examples. Defaults to `{}`. ~~Dict[str, Dict[str, List[str]]]~~                                             | | ||||||
| | `accuracy`                                     | Training accuracy, added automatically by [`spacy train`](/api/cli#train). Dictionary of [score names](/usage/training#metrics) mapped to scores. Defaults to `{}`. ~~Dict[str, Union[float, Dict[str, float]]]~~                                                                                                                | | | `performance`                                  | Training accuracy, added automatically by [`spacy train`](/api/cli#train). Dictionary of [score names](/usage/training#metrics) mapped to scores. Defaults to `{}`. ~~Dict[str, Union[float, Dict[str, float]]]~~                                                                                                                | | ||||||
| | `speed`                                        | Inference speed, added automatically by [`spacy train`](/api/cli#train). Typically a dictionary with the keys `"cpu"`, `"gpu"` and `"nwords"` (words per second). Defaults to `{}`. ~~Dict[str, Optional[Union[float, str]]]~~                                                                                                   | | | `speed`                                        | Inference speed, added automatically by [`spacy train`](/api/cli#train). Typically a dictionary with the keys `"cpu"`, `"gpu"` and `"nwords"` (words per second). Defaults to `{}`. ~~Dict[str, Optional[Union[float, str]]]~~                                                                                                   | | ||||||
| | `spacy_git_version` <Tag variant="new">3</Tag> | Git commit of [`spacy`](https://github.com/explosion/spaCy) used to create pipeline. ~~str~~                                                                                                                                                                                                                                     | | | `spacy_git_version` <Tag variant="new">3</Tag> | Git commit of [`spacy`](https://github.com/explosion/spaCy) used to create pipeline. ~~str~~                                                                                                                                                                                                                                     | | ||||||
| | other                                          | Any other custom meta information you want to add. The data is preserved in [`nlp.meta`](/api/language#meta). ~~Any~~                                                                                                                                                                                                            | | | other                                          | Any other custom meta information you want to add. The data is preserved in [`nlp.meta`](/api/language#meta). ~~Any~~                                                                                                                                                                                                            | | ||||||
|  |  | ||||||
|  | @ -49,11 +49,34 @@ data format used by the lookup and rule-based lemmatizers, see | ||||||
| > ``` | > ``` | ||||||
| 
 | 
 | ||||||
| | Setting     | Description                                                                                                                                               | | | Setting     | Description                                                                                                                                               | | ||||||
| | ----------- | --------------------------------------------------------------------------------- | | | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||||||
| | `mode`      | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `"lookup"`. ~~str~~ | | | `mode`      | The lemmatizer mode, e.g. `"lookup"` or `"rule"`. Defaults to `lookup` if no language-specific lemmatizer is available (see the following table). ~~str~~ | | ||||||
| | `overwrite` | Whether to overwrite existing lemmas. Defaults to `False`. ~~bool~~                                                                                       | | | `overwrite` | Whether to overwrite existing lemmas. Defaults to `False`. ~~bool~~                                                                                       | | ||||||
| | `model`     | **Not yet implemented:** the model to use. ~~Model~~                                                                                                      | | | `model`     | **Not yet implemented:** the model to use. ~~Model~~                                                                                                      | | ||||||
| 
 | 
 | ||||||
|  | Many languages specify a default lemmatizer mode other than `lookup` if a better | ||||||
|  | lemmatizer is available. The lemmatizer modes `rule` and `pos_lookup` require | ||||||
|  | [`token.pos`](/api/token) from a previous pipeline component (see example | ||||||
|  | pipeline configurations in the | ||||||
|  | [pretrained pipeline design details](/models#design-cnn)) or rely on third-party | ||||||
|  | libraries (`pymorphy2`). | ||||||
|  | 
 | ||||||
|  | | Language | Default Mode | | ||||||
|  | | -------- | ------------ | | ||||||
|  | | `bn`     | `rule`       | | ||||||
|  | | `el`     | `rule`       | | ||||||
|  | | `en`     | `rule`       | | ||||||
|  | | `es`     | `rule`       | | ||||||
|  | | `fa`     | `rule`       | | ||||||
|  | | `fr`     | `rule`       | | ||||||
|  | | `mk`     | `rule`       | | ||||||
|  | | `nb`     | `rule`       | | ||||||
|  | | `nl`     | `rule`       | | ||||||
|  | | `pl`     | `pos_lookup` | | ||||||
|  | | `ru`     | `pymorphy2`  | | ||||||
|  | | `sv`     | `rule`       | | ||||||
|  | | `uk`     | `pymorphy2`  | | ||||||
|  | 
 | ||||||
| ```python | ```python | ||||||
| %%GITHUB_SPACY/spacy/pipeline/lemmatizer.py | %%GITHUB_SPACY/spacy/pipeline/lemmatizer.py | ||||||
| ``` | ``` | ||||||
|  |  | ||||||
|  | @ -46,7 +46,10 @@ attribute being scored: | ||||||
| - `tag_acc`, `pos_acc`, `morph_acc`, `morph_per_feat`, `lemma_acc` | - `tag_acc`, `pos_acc`, `morph_acc`, `morph_per_feat`, `lemma_acc` | ||||||
| - `dep_uas`, `dep_las`, `dep_las_per_type` | - `dep_uas`, `dep_las`, `dep_las_per_type` | ||||||
| - `ents_p`, `ents_r` `ents_f`, `ents_per_type` | - `ents_p`, `ents_r` `ents_f`, `ents_per_type` | ||||||
| - `textcat_macro_auc`, `textcat_macro_f` | - `cats_score` (depends on config, description provided in `cats_score_desc`), | ||||||
|  |   `cats_micro_p`, `cats_micro_r`, `cats_micro_f`, `cats_macro_p`, | ||||||
|  |   `cats_macro_r`, `cats_macro_f`, `cats_macro_auc`, `cats_f_per_type`, | ||||||
|  |   `cats_auc_per_type` | ||||||
| 
 | 
 | ||||||
| > #### Example | > #### Example | ||||||
| > | > | ||||||
|  | @ -77,7 +80,7 @@ Docs with `has_unknown_spaces` are skipped during scoring. | ||||||
| > ``` | > ``` | ||||||
| 
 | 
 | ||||||
| | Name        | Description                                                                                                         | | | Name        | Description                                                                                                         | | ||||||
| | ----------- | ------------------------------------------------------------------------------------------------------------------- | | | ----------- | ------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ | | ||||||
| | `examples`  | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | | | `examples`  | The `Example` objects holding both the predictions and the correct gold-standard annotations. ~~Iterable[Example]~~ | | ||||||
| | **RETURNS** | `Dict`                                                                                                              | A dictionary containing the scores `token_acc`, `token_p`, `token_r`, `token_f`. ~~Dict[str, float]]~~ | | | **RETURNS** | `Dict`                                                                                                              | A dictionary containing the scores `token_acc`, `token_p`, `token_r`, `token_f`. ~~Dict[str, float]]~~ | | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -27,28 +27,29 @@ of `[lang]\_[name]`. For spaCy's pipelines, we also chose to divide the name | ||||||
| into three components: | into three components: | ||||||
| 
 | 
 | ||||||
| 1. **Type:** Capabilities (e.g. `core` for general-purpose pipeline with | 1. **Type:** Capabilities (e.g. `core` for general-purpose pipeline with | ||||||
|    vocabulary, syntax, entities and word vectors, or `dep` for only vocab and |    tagging, parsing, lemmatization and named entity recognition, or `dep` for | ||||||
|    syntax). |    only tagging, parsing and lemmatization). | ||||||
| 2. **Genre:** Type of text the pipeline is trained on, e.g. `web` or `news`. | 2. **Genre:** Type of text the pipeline is trained on, e.g. `web` or `news`. | ||||||
| 3. **Size:** Package size indicator, `sm`, `md` or `lg`. | 3. **Size:** Package size indicator, `sm`, `md`, `lg` or `trf` (`sm`: no word | ||||||
|  |    vectors, `md`: reduced word vector table with 20k unique vectors for ~500k | ||||||
|  |    words, `lg`: large word vector table with ~500k entries, `trf`: transformer | ||||||
|  |    pipeline without static word vectors) | ||||||
| 
 | 
 | ||||||
| For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English | For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English | ||||||
| pipeline trained on written web text (blogs, news, comments), that includes | pipeline trained on written web text (blogs, news, comments), that includes | ||||||
| vocabulary, vectors, syntax and entities. | vocabulary, syntax and entities. | ||||||
| 
 | 
 | ||||||
| ### Package versioning {#model-versioning} | ### Package versioning {#model-versioning} | ||||||
| 
 | 
 | ||||||
| Additionally, the pipeline package versioning reflects both the compatibility | Additionally, the pipeline package versioning reflects both the compatibility | ||||||
| with spaCy, as well as the major and minor version. A package version `a.b.c` | with spaCy, as well as the model version. A package version `a.b.c` translates | ||||||
| translates to: | to: | ||||||
| 
 | 
 | ||||||
| - `a`: **spaCy major version**. For example, `2` for spaCy v2.x. | - `a`: **spaCy major version**. For example, `2` for spaCy v2.x. | ||||||
| - `b`: **Package major version**. Pipelines with a different major version can't | - `b`: **spaCy minor version**. For example, `3` for spaCy v2.3.x. | ||||||
|   be loaded by the same code. For example, changing the width of the model, | - `c`: **Model version**. Different model config: e.g. from being trained on | ||||||
|   adding hidden layers or changing the activation changes the major version. |   different data, with different parameters, for different numbers of | ||||||
| - `c`: **Package minor version**. Same pipeline structure, but different |   iterations, with different vectors, etc. | ||||||
|   parameter values, e.g. from being trained on different data, for different |  | ||||||
|   numbers of iterations, etc. |  | ||||||
| 
 | 
 | ||||||
| For a detailed compatibility overview, see the | For a detailed compatibility overview, see the | ||||||
| [`compatibility.json`](https://github.com/explosion/spacy-models/tree/master/compatibility.json). | [`compatibility.json`](https://github.com/explosion/spacy-models/tree/master/compatibility.json). | ||||||
|  | @ -96,9 +97,9 @@ In the `sm`/`md`/`lg` models: | ||||||
|   tagger. For English, the attribute ruler can improve its mapping from |   tagger. For English, the attribute ruler can improve its mapping from | ||||||
|   `token.tag` to `token.pos` if dependency parses from a `parser` are present, |   `token.tag` to `token.pos` if dependency parses from a `parser` are present, | ||||||
|   but the parser is not required. |   but the parser is not required. | ||||||
| - The rule-based `lemmatizer` (Dutch, English, French, Greek, Macedonian, | - The `lemmatizer` component for many languages (Dutch, English, French, Greek, | ||||||
|   Norwegian and Spanish) requires `token.pos` annotation from either |   Macedonian, Norwegian, Polish and Spanish) requires `token.pos` annotation | ||||||
|   `tagger`+`attribute_ruler` or `morphologizer`. |   from either `tagger`+`attribute_ruler` or `morphologizer`. | ||||||
| - The `ner` component is independent with its own internal tok2vec layer. | - The `ner` component is independent with its own internal tok2vec layer. | ||||||
| 
 | 
 | ||||||
| ### Transformer pipeline design {#design-trf} | ### Transformer pipeline design {#design-trf} | ||||||
|  | @ -107,8 +108,6 @@ In the transformer (`trf`) models, the `tagger`, `parser` and `ner` (if present) | ||||||
| all listen to the `transformer` component. The `attribute_ruler` and | all listen to the `transformer` component. The `attribute_ruler` and | ||||||
| `lemmatizer` have the same configuration as in the CNN models. | `lemmatizer` have the same configuration as in the CNN models. | ||||||
| 
 | 
 | ||||||
| <!-- TODO: pretty diagram --> |  | ||||||
| 
 |  | ||||||
| ### Modifying the default pipeline {#design-modify} | ### Modifying the default pipeline {#design-modify} | ||||||
| 
 | 
 | ||||||
| For faster processing, you may only want to run a subset of the components in a | For faster processing, you may only want to run a subset of the components in a | ||||||
|  | @ -130,12 +129,13 @@ nlp = spacy.load("en_core_web_sm", disable=["tagger", "attribute_ruler", "lemmat | ||||||
| nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemmatizer"]) | nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemmatizer"]) | ||||||
| ``` | ``` | ||||||
| 
 | 
 | ||||||
| <Infobox variant="warning" title="Rule-based lemmatizers require Token.pos"> | <Infobox variant="warning" title="Rule-based and POS-lookup lemmatizers require | ||||||
|  | Token.pos"> | ||||||
| 
 | 
 | ||||||
| The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for | The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for | ||||||
| Dutch, English, French, Greek, Macedonian, Norwegian and Spanish. If you disable | Dutch, English, French, Greek, Macedonian, Norwegian, Polish and Spanish. If you | ||||||
| any of these components, you'll see lemmatizer warnings unless the lemmatizer is | disable any of these components, you'll see lemmatizer warnings unless the | ||||||
| also disabled. | lemmatizer is also disabled. | ||||||
| 
 | 
 | ||||||
| </Infobox> | </Infobox> | ||||||
| 
 | 
 | ||||||
|  |  | ||||||
|  | @ -57,7 +57,7 @@ | ||||||
|             "title": "eMFDscore : Extended Moral Foundation Dictionary Scoring for Python", |             "title": "eMFDscore : Extended Moral Foundation Dictionary Scoring for Python", | ||||||
|             "slogan": "Extended Moral Foundation Dictionary Scoring for Python", |             "slogan": "Extended Moral Foundation Dictionary Scoring for Python", | ||||||
|             "description": "eMFDscore is a library for the fast and flexible extraction of various moral information metrics from textual input data. eMFDscore is built on spaCy for faster execution and performs minimal preprocessing consisting of tokenization, syntactic dependency parsing, lower-casing, and stopword/punctuation/whitespace removal. eMFDscore lets users score documents with multiple Moral Foundations Dictionaries, provides various metrics for analyzing moral information, and extracts moral patient, agent, and attribute words related to entities.", |             "description": "eMFDscore is a library for the fast and flexible extraction of various moral information metrics from textual input data. eMFDscore is built on spaCy for faster execution and performs minimal preprocessing consisting of tokenization, syntactic dependency parsing, lower-casing, and stopword/punctuation/whitespace removal. eMFDscore lets users score documents with multiple Moral Foundations Dictionaries, provides various metrics for analyzing moral information, and extracts moral patient, agent, and attribute words related to entities.", | ||||||
|             "github": "https://github.com/medianeuroscience/emfdscore", |             "github": "medianeuroscience/emfdscore", | ||||||
|             "code_example": [ |             "code_example": [ | ||||||
|                 "from emfdscore.scoring import score_docs", |                 "from emfdscore.scoring import score_docs", | ||||||
|                 "import pandas as pd", |                 "import pandas as pd", | ||||||
|  |  | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user