Remove NBSP's across tables in the docs (#10842)

This commit is contained in:
Sofie Van Landeghem 2022-05-25 09:48:39 +02:00 committed by Adriane Boyd
parent 6be09bbd07
commit 4619a99185
9 changed files with 98 additions and 102 deletions

View File

@ -1335,7 +1335,7 @@ $ python -m spacy project run [subcommand] [project_dir] [--force] [--dry]
| `subcommand` | Name of the command or workflow to run. ~~str (positional)~~ | | `subcommand` | Name of the command or workflow to run. ~~str (positional)~~ |
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | | `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
| `--force`, `-F` | Force re-running steps, even if nothing changed. ~~bool (flag)~~ | | `--force`, `-F` | Force re-running steps, even if nothing changed. ~~bool (flag)~~ |
| `--dry`, `-D` |  Perform a dry run and don't execute scripts. ~~bool (flag)~~ | | `--dry`, `-D` | Perform a dry run and don't execute scripts. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **EXECUTES** | The command defined in the `project.yml`. | | **EXECUTES** | The command defined in the `project.yml`. |
@ -1454,10 +1454,10 @@ For more examples, see the templates in our
</Accordion> </Accordion>
| Name | Description | | Name | Description |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | | `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
| `--output`, `-o` | Path to output file or `-` for stdout (default). If a file is specified and it already exists and contains auto-generated docs, only the auto-generated docs section is replaced. ~~Path (positional)~~ | | `--output`, `-o` | Path to output file or `-` for stdout (default). If a file is specified and it already exists and contains auto-generated docs, only the auto-generated docs section is replaced. ~~Path (positional)~~ |
|  `--no-emoji`, `-NE` | Don't use emoji in the titles. ~~bool (flag)~~ | | `--no-emoji`, `-NE` | Don't use emoji in the titles. ~~bool (flag)~~ |
| **CREATES** | The Markdown-formatted project documentation. | | **CREATES** | The Markdown-formatted project documentation. |
### project dvc {#project-dvc tag="command"} ### project dvc {#project-dvc tag="command"}
@ -1497,7 +1497,7 @@ $ python -m spacy project dvc [project_dir] [workflow] [--force] [--verbose]
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | | `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
| `workflow` | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(option)~~ | | `workflow` | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(option)~~ |
| `--force`, `-F` | Force-updating config file. ~~bool (flag)~~ | | `--force`, `-F` | Force-updating config file. ~~bool (flag)~~ |
| `--verbose`, `-V` |  Print more output generated by DVC. ~~bool (flag)~~ | | `--verbose`, `-V` | Print more output generated by DVC. ~~bool (flag)~~ |
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
| **CREATES** | A `dvc.yaml` file in the project directory, based on the steps defined in the given workflow. | | **CREATES** | A `dvc.yaml` file in the project directory, based on the steps defined in the given workflow. |
@ -1588,5 +1588,5 @@ $ python -m spacy huggingface-hub push [whl_path] [--org] [--msg] [--local-repo]
| `--org`, `-o` | Optional name of organization to which the pipeline should be uploaded. ~~str (option)~~ | | `--org`, `-o` | Optional name of organization to which the pipeline should be uploaded. ~~str (option)~~ |
| `--msg`, `-m` | Commit message to use for update. Defaults to `"Update spaCy pipeline"`. ~~str (option)~~ | | `--msg`, `-m` | Commit message to use for update. Defaults to `"Update spaCy pipeline"`. ~~str (option)~~ |
| `--local-repo`, `-l` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. ~~Path (option)~~ | | `--local-repo`, `-l` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. ~~Path (option)~~ |
| `--verbose`, `-V` | Output additional info for debugging, e.g. the full generated hub metadata. ~~bool (flag)~~  | | `--verbose`, `-V` | Output additional info for debugging, e.g. the full generated hub metadata. ~~bool (flag)~~ |
| **UPLOADS** | The pipeline to the hub. | | **UPLOADS** | The pipeline to the hub. |

View File

@ -38,9 +38,9 @@ streaming.
> ``` > ```
| Name | Description | | Name | Description |
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Path~~ | | `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Path~~ |
|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ | | `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ |
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | | `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | | `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ | | `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ |
@ -72,10 +72,10 @@ train/test skew.
> ``` > ```
| Name | Description | | Name | Description |
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path` | The directory or filename to read from. ~~Union[str, Path]~~ | | `path` | The directory or filename to read from. ~~Union[str, Path]~~ |
| _keyword-only_ | | | _keyword-only_ | |
|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~ | | `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~ |
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | | `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | | `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
| `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ | | `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ |

View File

@ -1123,7 +1123,7 @@ instance and factory instance.
| `factory` | The name of the registered component factory. ~~str~~ | | `factory` | The name of the registered component factory. ~~str~~ |
| `default_config` | The default config, describing the default values of the factory arguments. ~~Dict[str, Any]~~ | | `default_config` | The default config, describing the default values of the factory arguments. ~~Dict[str, Any]~~ |
| `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | | `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~  | | `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~  | | `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ |
| `default_score_weights` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. If a weight is set to `None`, the score will not be logged or weighted. ~~Dict[str, Optional[float]]~~ | | `default_score_weights` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. If a weight is set to `None`, the score will not be logged or weighted. ~~Dict[str, Optional[float]]~~ |
| `scores` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Based on the `default_score_weights` and used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | | `scores` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Based on the `default_score_weights` and used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |

View File

@ -30,20 +30,20 @@ pattern keys correspond to a number of
[`Token` attributes](/api/token#attributes). The supported attributes for [`Token` attributes](/api/token#attributes). The supported attributes for
rule-based matching are: rule-based matching are:
| Attribute |  Description | | Attribute | Description |
| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | | ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| `ORTH` | The exact verbatim text of a token. ~~str~~ | | `ORTH` | The exact verbatim text of a token. ~~str~~ |
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ | | `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
| `NORM` | The normalized form of the token text. ~~str~~ | | `NORM` | The normalized form of the token text. ~~str~~ |
| `LOWER` | The lowercase form of the token text. ~~str~~ | | `LOWER` | The lowercase form of the token text. ~~str~~ |
|  `LENGTH` | The length of the token text. ~~int~~ | | `LENGTH` | The length of the token text. ~~int~~ |
|  `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ | | `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|  `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ | | `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|  `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ | | `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|  `IS_SENT_START` | Token is start of sentence. ~~bool~~ | | `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|  `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ | | `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
| `SPACY` | Token has a trailing space. ~~bool~~ | | `SPACY` | Token has a trailing space. ~~bool~~ |
|  `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ | | `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ |
| `ENT_TYPE` | The token's entity label. ~~str~~ | | `ENT_TYPE` | The token's entity label. ~~str~~ |
| `ENT_IOB` | The IOB part of the token's entity tag. ~~str~~ | | `ENT_IOB` | The IOB part of the token's entity tag. ~~str~~ |
| `ENT_ID` | The token's entity ID (`ent_id`). ~~str~~ | | `ENT_ID` | The token's entity ID (`ent_id`). ~~str~~ |

View File

@ -320,7 +320,6 @@ If a setting is not present in the options, the default value will be used.
| `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](%%GITHUB_SPACY/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ | | `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](%%GITHUB_SPACY/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |
| `kb_url_template` <Tag variant="new">3.2.1</Tag> | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in. ~~Optional[str]~~ | | `kb_url_template` <Tag variant="new">3.2.1</Tag> | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in. ~~Optional[str]~~ |
#### Span Visualizer options {#displacy_options-span} #### Span Visualizer options {#displacy_options-span}
> #### Example > #### Example
@ -331,20 +330,18 @@ If a setting is not present in the options, the default value will be used.
> ``` > ```
| Name | Description | | Name | Description |
|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------| | ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `spans_key` | Which spans key to render spans from. Default is `"sc"`. ~~str~~ | | `spans_key` | Which spans key to render spans from. Default is `"sc"`. ~~str~~ |
| `templates` | Dictionary containing the keys `"span"`, `"slice"`, and `"start"`. These dictate how the overall span, a span slice, and the starting token will be rendered. ~~Optional[Dict[str, str]~~ | | `templates` | Dictionary containing the keys `"span"`, `"slice"`, and `"start"`. These dictate how the overall span, a span slice, and the starting token will be rendered. ~~Optional[Dict[str, str]~~ |
| `kb_url_template` | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in ~~Optional[str]~~ | | `kb_url_template` | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in ~~Optional[str]~~ |
| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ | | `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ |
By default, displaCy comes with colors for all entity types used by
By default, displaCy comes with colors for all entity types used by [spaCy's [spaCy's trained pipelines](/models) for both entity and span visualizer. If
trained pipelines](/models) for both entity and span visualizer. If you're you're using custom entity types, you can use the `colors` setting to add your
using custom entity types, you can use the `colors` setting to add your own own colors for them. Your application or pipeline package can also expose a
colors for them. Your application or pipeline package can also expose a [`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy)
[`spacy_displacy_colors` entry to add custom labels and their colors automatically.
point](/usage/saving-loading#entry-points-displacy) to add custom labels and
their colors automatically.
By default, displaCy links to `#` for entities without a `kb_id` set on their By default, displaCy links to `#` for entities without a `kb_id` set on their
span. If you wish to link an entity to their URL then consider using the span. If you wish to link an entity to their URL then consider using the
@ -354,7 +351,6 @@ span. If you wish to link an entity to their URL then consider using the
should redirect you to their Wikidata page, in this case should redirect you to their Wikidata page, in this case
`https://www.wikidata.org/wiki/Q95`. `https://www.wikidata.org/wiki/Q95`.
## registry {#registry source="spacy/util.py" new="3"} ## registry {#registry source="spacy/util.py" new="3"}
spaCy's function registry extends spaCy's function registry extends
@ -443,8 +439,8 @@ and the accuracy scores on the development set.
The built-in, default logger is the ConsoleLogger, which prints results to the The built-in, default logger is the ConsoleLogger, which prints results to the
console in tabular format. The console in tabular format. The
[spacy-loggers](https://github.com/explosion/spacy-loggers) package, included as [spacy-loggers](https://github.com/explosion/spacy-loggers) package, included as
a dependency of spaCy, enables other loggers, such as one that a dependency of spaCy, enables other loggers, such as one that sends results to
sends results to a [Weights & Biases](https://www.wandb.com/) dashboard. a [Weights & Biases](https://www.wandb.com/) dashboard.
Instead of using one of the built-in loggers, you can Instead of using one of the built-in loggers, you can
[implement your own](/usage/training#custom-logging). [implement your own](/usage/training#custom-logging).
@ -584,9 +580,9 @@ the [`Corpus`](/api/corpus) class.
> ``` > ```
| Name | Description | | Name | Description |
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Union[str, Path]~~ | | `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Union[str, Path]~~ |
|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ | | `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ |
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | | `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | | `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ | | `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ |

View File

@ -48,7 +48,7 @@ but do not change its part-of-speech. We say that a **lemma** (root form) is
**inflected** (modified/combined) with one or more **morphological features** to **inflected** (modified/combined) with one or more **morphological features** to
create a surface form. Here are some examples: create a surface form. Here are some examples:
| Context | Surface | Lemma | POS |  Morphological Features | | Context | Surface | Lemma | POS | Morphological Features |
| ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- | | ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- |
| I was reading the paper | reading | read | `VERB` | `VerbForm=Ger` | | I was reading the paper | reading | read | `VERB` | `VerbForm=Ger` |
| I don't watch the news, I read the paper | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` | | I don't watch the news, I read the paper | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
@ -430,7 +430,7 @@ for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text) print(token.text, token.pos_, token.dep_, token.head.text)
``` ```
| Text |  POS | Dep | Head text | | Text | POS | Dep | Head text |
| ----------------------------------- | ------ | ------- | --------- | | ----------------------------------- | ------ | ------- | --------- |
| Credit and mortgage account holders | `NOUN` | `nsubj` | submit | | Credit and mortgage account holders | `NOUN` | `nsubj` | submit |
| must | `VERB` | `aux` | submit | | must | `VERB` | `aux` | submit |

View File

@ -158,20 +158,20 @@ The available token pattern keys correspond to a number of
[`Token` attributes](/api/token#attributes). The supported attributes for [`Token` attributes](/api/token#attributes). The supported attributes for
rule-based matching are: rule-based matching are:
| Attribute |  Description | | Attribute | Description |
| ----------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ---------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ORTH` | The exact verbatim text of a token. ~~str~~ | | `ORTH` | The exact verbatim text of a token. ~~str~~ |
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ | | `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
| `NORM` | The normalized form of the token text. ~~str~~ | | `NORM` | The normalized form of the token text. ~~str~~ |
| `LOWER` | The lowercase form of the token text. ~~str~~ | | `LOWER` | The lowercase form of the token text. ~~str~~ |
|  `LENGTH` | The length of the token text. ~~int~~ | | `LENGTH` | The length of the token text. ~~int~~ |
|  `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ | | `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|  `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ | | `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|  `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ | | `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|  `IS_SENT_START` | Token is start of sentence. ~~bool~~ | | `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|  `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ | | `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
| `SPACY` | Token has a trailing space. ~~bool~~ | | `SPACY` | Token has a trailing space. ~~bool~~ |
|  `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). ~~str~~ | | `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). ~~str~~ |
| `ENT_TYPE` | The token's entity label. ~~str~~ | | `ENT_TYPE` | The token's entity label. ~~str~~ |
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ | | `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
| `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ | | `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ |

View File

@ -132,8 +132,8 @@ your own.
> contributions for Catalan and to Kenneth Enevoldsen for Danish. For additional > contributions for Catalan and to Kenneth Enevoldsen for Danish. For additional
> Danish pipelines, check out [DaCy](https://github.com/KennethEnevoldsen/DaCy). > Danish pipelines, check out [DaCy](https://github.com/KennethEnevoldsen/DaCy).
| Package | Language | UPOS | Parser LAS |  NER F | | Package | Language | UPOS | Parser LAS | NER F |
| ------------------------------------------------- | -------- | ---: | ---------: | -----: | | ------------------------------------------------- | -------- | ---: | ---------: | ----: |
| [`ca_core_news_sm`](/models/ca#ca_core_news_sm) | Catalan | 98.2 | 87.4 | 79.8 | | [`ca_core_news_sm`](/models/ca#ca_core_news_sm) | Catalan | 98.2 | 87.4 | 79.8 |
| [`ca_core_news_md`](/models/ca#ca_core_news_md) | Catalan | 98.3 | 88.2 | 84.0 | | [`ca_core_news_md`](/models/ca#ca_core_news_md) | Catalan | 98.3 | 88.2 | 84.0 |
| [`ca_core_news_lg`](/models/ca#ca_core_news_lg) | Catalan | 98.5 | 88.4 | 84.2 | | [`ca_core_news_lg`](/models/ca#ca_core_news_lg) | Catalan | 98.5 | 88.4 | 84.2 |

View File

@ -116,7 +116,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
> corpus that had both syntactic and entity annotations, so the transformer > corpus that had both syntactic and entity annotations, so the transformer
> models for those languages do not include NER. > models for those languages do not include NER.
| Package | Language | Transformer | Tagger | Parser |  NER | | Package | Language | Transformer | Tagger | Parser | NER |
| ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: | | ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: |
| [`en_core_web_trf`](/models/en#en_core_web_trf) | English | [`roberta-base`](https://huggingface.co/roberta-base) | 97.8 | 95.2 | 89.9 | | [`en_core_web_trf`](/models/en#en_core_web_trf) | English | [`roberta-base`](https://huggingface.co/roberta-base) | 97.8 | 95.2 | 89.9 |
| [`de_dep_news_trf`](/models/de#de_dep_news_trf) | German | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased) | 99.0 | 95.8 | - | | [`de_dep_news_trf`](/models/de#de_dep_news_trf) | German | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased) | 99.0 | 95.8 | - |
@ -856,9 +856,9 @@ attribute ruler before training using the `[initialize]` block of your config.
### Using Lexeme Tables ### Using Lexeme Tables
To use tables like `lexeme_prob` when training a model from scratch, you need To use tables like `lexeme_prob` when training a model from scratch, you need to
to add an entry to the `initialize` block in your config. Here's what that add an entry to the `initialize` block in your config. Here's what that looks
looks like for the existing trained pipelines: like for the existing trained pipelines:
```ini ```ini
[initialize.lookups] [initialize.lookups]