From 4619a99185778329d5b4d98a9f27442423721aae Mon Sep 17 00:00:00 2001 From: Sofie Van Landeghem Date: Wed, 25 May 2022 09:48:39 +0200 Subject: [PATCH] Remove NBSP's across tables in the docs (#10842) --- website/docs/api/cli.md | 18 ++++----- website/docs/api/corpus.md | 32 ++++++++-------- website/docs/api/language.md | 4 +- website/docs/api/matcher.md | 40 ++++++++++---------- website/docs/api/top-level.md | 46 +++++++++++------------ website/docs/usage/linguistic-features.md | 4 +- website/docs/usage/rule-based-matching.md | 34 ++++++++--------- website/docs/usage/v3-1.md | 14 +++---- website/docs/usage/v3.md | 8 ++-- 9 files changed, 98 insertions(+), 102 deletions(-) diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index dd396d0b3..cbd1f794a 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -1335,7 +1335,7 @@ $ python -m spacy project run [subcommand] [project_dir] [--force] [--dry] | `subcommand` | Name of the command or workflow to run. ~~str (positional)~~ | | `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | | `--force`, `-F` | Force re-running steps, even if nothing changed. ~~bool (flag)~~ | -| `--dry`, `-D` |  Perform a dry run and don't execute scripts. ~~bool (flag)~~ | +| `--dry`, `-D` | Perform a dry run and don't execute scripts. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | **EXECUTES** | The command defined in the `project.yml`. | @@ -1453,12 +1453,12 @@ For more examples, see the templates in our -| Name | Description | -| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | -| `--output`, `-o` | Path to output file or `-` for stdout (default). If a file is specified and it already exists and contains auto-generated docs, only the auto-generated docs section is replaced. ~~Path (positional)~~ | -|  `--no-emoji`, `-NE` | Don't use emoji in the titles. ~~bool (flag)~~ | -| **CREATES** | The Markdown-formatted project documentation. | +| Name | Description | +| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | +| `--output`, `-o` | Path to output file or `-` for stdout (default). If a file is specified and it already exists and contains auto-generated docs, only the auto-generated docs section is replaced. ~~Path (positional)~~ | +| `--no-emoji`, `-NE` | Don't use emoji in the titles. ~~bool (flag)~~ | +| **CREATES** | The Markdown-formatted project documentation. | ### project dvc {#project-dvc tag="command"} @@ -1497,7 +1497,7 @@ $ python -m spacy project dvc [project_dir] [workflow] [--force] [--verbose] | `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | | `workflow` | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(option)~~ | | `--force`, `-F` | Force-updating config file. ~~bool (flag)~~ | -| `--verbose`, `-V` |  Print more output generated by DVC. ~~bool (flag)~~ | +| `--verbose`, `-V` | Print more output generated by DVC. ~~bool (flag)~~ | | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | **CREATES** | A `dvc.yaml` file in the project directory, based on the steps defined in the given workflow. | @@ -1588,5 +1588,5 @@ $ python -m spacy huggingface-hub push [whl_path] [--org] [--msg] [--local-repo] | `--org`, `-o` | Optional name of organization to which the pipeline should be uploaded. ~~str (option)~~ | | `--msg`, `-m` | Commit message to use for update. Defaults to `"Update spaCy pipeline"`. ~~str (option)~~ | | `--local-repo`, `-l` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. ~~Path (option)~~ | -| `--verbose`, `-V` | Output additional info for debugging, e.g. the full generated hub metadata. ~~bool (flag)~~  | +| `--verbose`, `-V` | Output additional info for debugging, e.g. the full generated hub metadata. ~~bool (flag)~~ | | **UPLOADS** | The pipeline to the hub. | diff --git a/website/docs/api/corpus.md b/website/docs/api/corpus.md index 35afc8fea..88c4befd7 100644 --- a/website/docs/api/corpus.md +++ b/website/docs/api/corpus.md @@ -37,13 +37,13 @@ streaming. > augmenter = null > ``` -| Name | Description | -| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Path~~ | -|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ | -| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | -| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | -| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ | +| Name | Description | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Path~~ | +| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ | +| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | +| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | +| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ | ```python %%GITHUB_SPACY/spacy/training/corpus.py @@ -71,15 +71,15 @@ train/test skew. > corpus = Corpus("./data", limit=10) > ``` -| Name | Description | -| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | -| `path` | The directory or filename to read from. ~~Union[str, Path]~~ | -| _keyword-only_ | | -|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~ | -| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | -| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | -| `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ | -| `shuffle` | Whether to shuffle the examples. Defaults to `False`. ~~bool~~ | +| Name | Description | +| -------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | The directory or filename to read from. ~~Union[str, Path]~~ | +| _keyword-only_ | | +| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~ | +| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | +| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | +| `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ | +| `shuffle` | Whether to shuffle the examples. Defaults to `False`. ~~bool~~ | ## Corpus.\_\_call\_\_ {#call tag="method"} diff --git a/website/docs/api/language.md b/website/docs/api/language.md index 8d7686243..9a413efaf 100644 --- a/website/docs/api/language.md +++ b/website/docs/api/language.md @@ -1123,7 +1123,7 @@ instance and factory instance. | `factory` | The name of the registered component factory. ~~str~~ | | `default_config` | The default config, describing the default values of the factory arguments. ~~Dict[str, Any]~~ | | `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | -| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~  | -| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~  | +| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | +| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ | | `default_score_weights` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. If a weight is set to `None`, the score will not be logged or weighted. ~~Dict[str, Optional[float]]~~ | | `scores` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Based on the `default_score_weights` and used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ | diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index 273c202ca..6c8cae211 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -30,26 +30,26 @@ pattern keys correspond to a number of [`Token` attributes](/api/token#attributes). The supported attributes for rule-based matching are: -| Attribute |  Description | -| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | -| `ORTH` | The exact verbatim text of a token. ~~str~~ | -| `TEXT` 2.1 | The exact verbatim text of a token. ~~str~~ | -| `NORM` | The normalized form of the token text. ~~str~~ | -| `LOWER` | The lowercase form of the token text. ~~str~~ | -|  `LENGTH` | The length of the token text. ~~int~~ | -|  `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ | -|  `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ | -|  `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ | -|  `IS_SENT_START` | Token is start of sentence. ~~bool~~ | -|  `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ | -| `SPACY` | Token has a trailing space. ~~bool~~ | -|  `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ | -| `ENT_TYPE` | The token's entity label. ~~str~~ | -| `ENT_IOB` | The IOB part of the token's entity tag. ~~str~~ | -| `ENT_ID` | The token's entity ID (`ent_id`). ~~str~~ | -| `ENT_KB_ID` | The token's entity knowledge base ID (`ent_kb_id`). ~~str~~ | -| `_` 2.1 | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ | -| `OP` | Operator or quantifier to determine how often to match a token pattern. ~~str~~ | +| Attribute | Description | +| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- | +| `ORTH` | The exact verbatim text of a token. ~~str~~ | +| `TEXT` 2.1 | The exact verbatim text of a token. ~~str~~ | +| `NORM` | The normalized form of the token text. ~~str~~ | +| `LOWER` | The lowercase form of the token text. ~~str~~ | +| `LENGTH` | The length of the token text. ~~int~~ | +| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ | +| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ | +| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ | +| `IS_SENT_START` | Token is start of sentence. ~~bool~~ | +| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ | +| `SPACY` | Token has a trailing space. ~~bool~~ | +| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ | +| `ENT_TYPE` | The token's entity label. ~~str~~ | +| `ENT_IOB` | The IOB part of the token's entity tag. ~~str~~ | +| `ENT_ID` | The token's entity ID (`ent_id`). ~~str~~ | +| `ENT_KB_ID` | The token's entity knowledge base ID (`ent_kb_id`). ~~str~~ | +| `_` 2.1 | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ | +| `OP` | Operator or quantifier to determine how often to match a token pattern. ~~str~~ | Operators and quantifiers define **how often** a token pattern should be matched: diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index f2fd1415f..904a91ea9 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -320,7 +320,6 @@ If a setting is not present in the options, the default value will be used. | `template` 2.2 | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](%%GITHUB_SPACY/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ | | `kb_url_template` 3.2.1 | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in. ~~Optional[str]~~ | - #### Span Visualizer options {#displacy_options-span} > #### Example @@ -330,21 +329,19 @@ If a setting is not present in the options, the default value will be used. > displacy.serve(doc, style="span", options=options) > ``` -| Name | Description | -|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------| -| `spans_key` | Which spans key to render spans from. Default is `"sc"`. ~~str~~ | +| Name | Description | +| ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `spans_key` | Which spans key to render spans from. Default is `"sc"`. ~~str~~ | | `templates` | Dictionary containing the keys `"span"`, `"slice"`, and `"start"`. These dictate how the overall span, a span slice, and the starting token will be rendered. ~~Optional[Dict[str, str]~~ | -| `kb_url_template` | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in ~~Optional[str]~~ | -| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ | +| `kb_url_template` | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in ~~Optional[str]~~ | +| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ | - -By default, displaCy comes with colors for all entity types used by [spaCy's -trained pipelines](/models) for both entity and span visualizer. If you're -using custom entity types, you can use the `colors` setting to add your own -colors for them. Your application or pipeline package can also expose a -[`spacy_displacy_colors` entry -point](/usage/saving-loading#entry-points-displacy) to add custom labels and -their colors automatically. +By default, displaCy comes with colors for all entity types used by +[spaCy's trained pipelines](/models) for both entity and span visualizer. If +you're using custom entity types, you can use the `colors` setting to add your +own colors for them. Your application or pipeline package can also expose a +[`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy) +to add custom labels and their colors automatically. By default, displaCy links to `#` for entities without a `kb_id` set on their span. If you wish to link an entity to their URL then consider using the @@ -354,7 +351,6 @@ span. If you wish to link an entity to their URL then consider using the should redirect you to their Wikidata page, in this case `https://www.wikidata.org/wiki/Q95`. - ## registry {#registry source="spacy/util.py" new="3"} spaCy's function registry extends @@ -443,8 +439,8 @@ and the accuracy scores on the development set. The built-in, default logger is the ConsoleLogger, which prints results to the console in tabular format. The [spacy-loggers](https://github.com/explosion/spacy-loggers) package, included as -a dependency of spaCy, enables other loggers, such as one that -sends results to a [Weights & Biases](https://www.wandb.com/) dashboard. +a dependency of spaCy, enables other loggers, such as one that sends results to +a [Weights & Biases](https://www.wandb.com/) dashboard. Instead of using one of the built-in loggers, you can [implement your own](/usage/training#custom-logging). @@ -583,14 +579,14 @@ the [`Corpus`](/api/corpus) class. > limit = 0 > ``` -| Name | Description | -| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Union[str, Path]~~ | -|  `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ | -| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | -| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | -| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ | -| **CREATES** | The corpus reader. ~~Corpus~~ | +| Name | Description | +| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Union[str, Path]~~ | +| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ | +| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ | +| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ | +| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ | +| **CREATES** | The corpus reader. ~~Corpus~~ | #### spacy.JsonlCorpus.v1 {#jsonlcorpus tag="registered function"} diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index b3b896a54..c547ec0bc 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -48,7 +48,7 @@ but do not change its part-of-speech. We say that a **lemma** (root form) is **inflected** (modified/combined) with one or more **morphological features** to create a surface form. Here are some examples: -| Context | Surface | Lemma | POS |  Morphological Features | +| Context | Surface | Lemma | POS | Morphological Features | | ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- | | I was reading the paper | reading | read | `VERB` | `VerbForm=Ger` | | I don't watch the news, I read the paper | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` | @@ -430,7 +430,7 @@ for token in doc: print(token.text, token.pos_, token.dep_, token.head.text) ``` -| Text |  POS | Dep | Head text | +| Text | POS | Dep | Head text | | ----------------------------------- | ------ | ------- | --------- | | Credit and mortgage account holders | `NOUN` | `nsubj` | submit | | must | `VERB` | `aux` | submit | diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index be9a56dc8..bf654c14f 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -158,23 +158,23 @@ The available token pattern keys correspond to a number of [`Token` attributes](/api/token#attributes). The supported attributes for rule-based matching are: -| Attribute |  Description | -| ----------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `ORTH` | The exact verbatim text of a token. ~~str~~ | -| `TEXT` 2.1 | The exact verbatim text of a token. ~~str~~ | -| `NORM` | The normalized form of the token text. ~~str~~ | -| `LOWER` | The lowercase form of the token text. ~~str~~ | -|  `LENGTH` | The length of the token text. ~~int~~ | -|  `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ | -|  `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ | -|  `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ | -|  `IS_SENT_START` | Token is start of sentence. ~~bool~~ | -|  `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ | -| `SPACY` | Token has a trailing space. ~~bool~~ | -|  `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). ~~str~~ | -| `ENT_TYPE` | The token's entity label. ~~str~~ | -| `_` 2.1 | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ | -| `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ | +| Attribute | Description | +| ---------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `ORTH` | The exact verbatim text of a token. ~~str~~ | +| `TEXT` 2.1 | The exact verbatim text of a token. ~~str~~ | +| `NORM` | The normalized form of the token text. ~~str~~ | +| `LOWER` | The lowercase form of the token text. ~~str~~ | +| `LENGTH` | The length of the token text. ~~int~~ | +| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ | +| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ | +| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ | +| `IS_SENT_START` | Token is start of sentence. ~~bool~~ | +| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ | +| `SPACY` | Token has a trailing space. ~~bool~~ | +| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). ~~str~~ | +| `ENT_TYPE` | The token's entity label. ~~str~~ | +| `_` 2.1 | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ | +| `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ | diff --git a/website/docs/usage/v3-1.md b/website/docs/usage/v3-1.md index 1bac8fd81..2725cacb9 100644 --- a/website/docs/usage/v3-1.md +++ b/website/docs/usage/v3-1.md @@ -132,13 +132,13 @@ your own. > contributions for Catalan and to Kenneth Enevoldsen for Danish. For additional > Danish pipelines, check out [DaCy](https://github.com/KennethEnevoldsen/DaCy). -| Package | Language | UPOS | Parser LAS |  NER F | -| ------------------------------------------------- | -------- | ---: | ---------: | -----: | -| [`ca_core_news_sm`](/models/ca#ca_core_news_sm) | Catalan | 98.2 | 87.4 | 79.8 | -| [`ca_core_news_md`](/models/ca#ca_core_news_md) | Catalan | 98.3 | 88.2 | 84.0 | -| [`ca_core_news_lg`](/models/ca#ca_core_news_lg) | Catalan | 98.5 | 88.4 | 84.2 | -| [`ca_core_news_trf`](/models/ca#ca_core_news_trf) | Catalan | 98.9 | 93.0 | 91.2 | -| [`da_core_news_trf`](/models/da#da_core_news_trf) | Danish | 98.0 | 85.0 | 82.9 | +| Package | Language | UPOS | Parser LAS | NER F | +| ------------------------------------------------- | -------- | ---: | ---------: | ----: | +| [`ca_core_news_sm`](/models/ca#ca_core_news_sm) | Catalan | 98.2 | 87.4 | 79.8 | +| [`ca_core_news_md`](/models/ca#ca_core_news_md) | Catalan | 98.3 | 88.2 | 84.0 | +| [`ca_core_news_lg`](/models/ca#ca_core_news_lg) | Catalan | 98.5 | 88.4 | 84.2 | +| [`ca_core_news_trf`](/models/ca#ca_core_news_trf) | Catalan | 98.9 | 93.0 | 91.2 | +| [`da_core_news_trf`](/models/da#da_core_news_trf) | Danish | 98.0 | 85.0 | 82.9 | ### Resizable text classification architectures {#resizable-textcat} diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index 980f06172..971779ed3 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -116,7 +116,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md' > corpus that had both syntactic and entity annotations, so the transformer > models for those languages do not include NER. -| Package | Language | Transformer | Tagger | Parser |  NER | +| Package | Language | Transformer | Tagger | Parser | NER | | ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: | | [`en_core_web_trf`](/models/en#en_core_web_trf) | English | [`roberta-base`](https://huggingface.co/roberta-base) | 97.8 | 95.2 | 89.9 | | [`de_dep_news_trf`](/models/de#de_dep_news_trf) | German | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased) | 99.0 | 95.8 | - | @@ -856,9 +856,9 @@ attribute ruler before training using the `[initialize]` block of your config. ### Using Lexeme Tables -To use tables like `lexeme_prob` when training a model from scratch, you need -to add an entry to the `initialize` block in your config. Here's what that -looks like for the existing trained pipelines: +To use tables like `lexeme_prob` when training a model from scratch, you need to +add an entry to the `initialize` block in your config. Here's what that looks +like for the existing trained pipelines: ```ini [initialize.lookups]