mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-10 19:57:17 +03:00
Remove NBSP's across tables in the docs (#10842)
This commit is contained in:
parent
32954c3bcb
commit
83ed1f391b
|
@ -1335,7 +1335,7 @@ $ python -m spacy project run [subcommand] [project_dir] [--force] [--dry]
|
|||
| `subcommand` | Name of the command or workflow to run. ~~str (positional)~~ |
|
||||
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
|
||||
| `--force`, `-F` | Force re-running steps, even if nothing changed. ~~bool (flag)~~ |
|
||||
| `--dry`, `-D` | Perform a dry run and don't execute scripts. ~~bool (flag)~~ |
|
||||
| `--dry`, `-D` | Perform a dry run and don't execute scripts. ~~bool (flag)~~ |
|
||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||
| **EXECUTES** | The command defined in the `project.yml`. |
|
||||
|
||||
|
@ -1453,12 +1453,12 @@ For more examples, see the templates in our
|
|||
|
||||
</Accordion>
|
||||
|
||||
| Name | Description |
|
||||
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
|
||||
| `--output`, `-o` | Path to output file or `-` for stdout (default). If a file is specified and it already exists and contains auto-generated docs, only the auto-generated docs section is replaced. ~~Path (positional)~~ |
|
||||
| `--no-emoji`, `-NE` | Don't use emoji in the titles. ~~bool (flag)~~ |
|
||||
| **CREATES** | The Markdown-formatted project documentation. |
|
||||
| Name | Description |
|
||||
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
|
||||
| `--output`, `-o` | Path to output file or `-` for stdout (default). If a file is specified and it already exists and contains auto-generated docs, only the auto-generated docs section is replaced. ~~Path (positional)~~ |
|
||||
| `--no-emoji`, `-NE` | Don't use emoji in the titles. ~~bool (flag)~~ |
|
||||
| **CREATES** | The Markdown-formatted project documentation. |
|
||||
|
||||
### project dvc {#project-dvc tag="command"}
|
||||
|
||||
|
@ -1497,7 +1497,7 @@ $ python -m spacy project dvc [project_dir] [workflow] [--force] [--verbose]
|
|||
| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ |
|
||||
| `workflow` | Name of workflow defined in `project.yml`. Defaults to first workflow if not set. ~~Optional[str] \(option)~~ |
|
||||
| `--force`, `-F` | Force-updating config file. ~~bool (flag)~~ |
|
||||
| `--verbose`, `-V` | Print more output generated by DVC. ~~bool (flag)~~ |
|
||||
| `--verbose`, `-V` | Print more output generated by DVC. ~~bool (flag)~~ |
|
||||
| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ |
|
||||
| **CREATES** | A `dvc.yaml` file in the project directory, based on the steps defined in the given workflow. |
|
||||
|
||||
|
@ -1588,5 +1588,5 @@ $ python -m spacy huggingface-hub push [whl_path] [--org] [--msg] [--local-repo]
|
|||
| `--org`, `-o` | Optional name of organization to which the pipeline should be uploaded. ~~str (option)~~ |
|
||||
| `--msg`, `-m` | Commit message to use for update. Defaults to `"Update spaCy pipeline"`. ~~str (option)~~ |
|
||||
| `--local-repo`, `-l` | Local path to the model repository (will be created if it doesn't exist). Defaults to `hub` in the current working directory. ~~Path (option)~~ |
|
||||
| `--verbose`, `-V` | Output additional info for debugging, e.g. the full generated hub metadata. ~~bool (flag)~~ |
|
||||
| `--verbose`, `-V` | Output additional info for debugging, e.g. the full generated hub metadata. ~~bool (flag)~~ |
|
||||
| **UPLOADS** | The pipeline to the hub. |
|
||||
|
|
|
@ -37,13 +37,13 @@ streaming.
|
|||
> augmenter = null
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Path~~ |
|
||||
| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ |
|
||||
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ |
|
||||
| Name | Description |
|
||||
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Path~~ |
|
||||
| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ |
|
||||
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ |
|
||||
|
||||
```python
|
||||
%%GITHUB_SPACY/spacy/training/corpus.py
|
||||
|
@ -71,15 +71,15 @@ train/test skew.
|
|||
> corpus = Corpus("./data", limit=10)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | The directory or filename to read from. ~~Union[str, Path]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~ |
|
||||
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ |
|
||||
| `shuffle` | Whether to shuffle the examples. Defaults to `False`. ~~bool~~ |
|
||||
| Name | Description |
|
||||
| -------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | The directory or filename to read from. ~~Union[str, Path]~~ |
|
||||
| _keyword-only_ | |
|
||||
| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. ~~bool~~ |
|
||||
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ |
|
||||
| `shuffle` | Whether to shuffle the examples. Defaults to `False`. ~~bool~~ |
|
||||
|
||||
## Corpus.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
|
|
@ -1123,7 +1123,7 @@ instance and factory instance.
|
|||
| `factory` | The name of the registered component factory. ~~str~~ |
|
||||
| `default_config` | The default config, describing the default values of the factory arguments. ~~Dict[str, Any]~~ |
|
||||
| `assigns` | `Doc` or `Token` attributes assigned by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
||||
| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
||||
| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ |
|
||||
| `requires` | `Doc` or `Token` attributes required by this component, e.g. `["token.ent_id"]`. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
||||
| `retokenizes` | Whether the component changes tokenization. Used for [pipe analysis](/usage/processing-pipelines#analysis). ~~bool~~ |
|
||||
| `default_score_weights` | The scores to report during training, and their default weight towards the final score used to select the best model. Weights should sum to `1.0` per component and will be combined and normalized for the whole pipeline. If a weight is set to `None`, the score will not be logged or weighted. ~~Dict[str, Optional[float]]~~ |
|
||||
| `scores` | All scores set by the components if it's trainable, e.g. `["ents_f", "ents_r", "ents_p"]`. Based on the `default_score_weights` and used for [pipe analysis](/usage/processing-pipelines#analysis). ~~Iterable[str]~~ |
|
||||
|
|
|
@ -30,26 +30,26 @@ pattern keys correspond to a number of
|
|||
[`Token` attributes](/api/token#attributes). The supported attributes for
|
||||
rule-based matching are:
|
||||
|
||||
| Attribute | Description |
|
||||
| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `ORTH` | The exact verbatim text of a token. ~~str~~ |
|
||||
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
|
||||
| `NORM` | The normalized form of the token text. ~~str~~ |
|
||||
| `LOWER` | The lowercase form of the token text. ~~str~~ |
|
||||
| `LENGTH` | The length of the token text. ~~int~~ |
|
||||
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|
||||
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|
||||
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|
||||
| `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|
||||
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
|
||||
| `SPACY` | Token has a trailing space. ~~bool~~ |
|
||||
| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ |
|
||||
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
||||
| `ENT_IOB` | The IOB part of the token's entity tag. ~~str~~ |
|
||||
| `ENT_ID` | The token's entity ID (`ent_id`). ~~str~~ |
|
||||
| `ENT_KB_ID` | The token's entity knowledge base ID (`ent_kb_id`). ~~str~~ |
|
||||
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
|
||||
| `OP` | Operator or quantifier to determine how often to match a token pattern. ~~str~~ |
|
||||
| Attribute | Description |
|
||||
| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `ORTH` | The exact verbatim text of a token. ~~str~~ |
|
||||
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
|
||||
| `NORM` | The normalized form of the token text. ~~str~~ |
|
||||
| `LOWER` | The lowercase form of the token text. ~~str~~ |
|
||||
| `LENGTH` | The length of the token text. ~~int~~ |
|
||||
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|
||||
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|
||||
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|
||||
| `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|
||||
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
|
||||
| `SPACY` | Token has a trailing space. ~~bool~~ |
|
||||
| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. ~~str~~ |
|
||||
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
||||
| `ENT_IOB` | The IOB part of the token's entity tag. ~~str~~ |
|
||||
| `ENT_ID` | The token's entity ID (`ent_id`). ~~str~~ |
|
||||
| `ENT_KB_ID` | The token's entity knowledge base ID (`ent_kb_id`). ~~str~~ |
|
||||
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
|
||||
| `OP` | Operator or quantifier to determine how often to match a token pattern. ~~str~~ |
|
||||
|
||||
Operators and quantifiers define **how often** a token pattern should be
|
||||
matched:
|
||||
|
|
|
@ -320,7 +320,6 @@ If a setting is not present in the options, the default value will be used.
|
|||
| `template` <Tag variant="new">2.2</Tag> | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](%%GITHUB_SPACY/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |
|
||||
| `kb_url_template` <Tag variant="new">3.2.1</Tag> | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in. ~~Optional[str]~~ |
|
||||
|
||||
|
||||
#### Span Visualizer options {#displacy_options-span}
|
||||
|
||||
> #### Example
|
||||
|
@ -330,21 +329,19 @@ If a setting is not present in the options, the default value will be used.
|
|||
> displacy.serve(doc, style="span", options=options)
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `spans_key` | Which spans key to render spans from. Default is `"sc"`. ~~str~~ |
|
||||
| Name | Description |
|
||||
| ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `spans_key` | Which spans key to render spans from. Default is `"sc"`. ~~str~~ |
|
||||
| `templates` | Dictionary containing the keys `"span"`, `"slice"`, and `"start"`. These dictate how the overall span, a span slice, and the starting token will be rendered. ~~Optional[Dict[str, str]~~ |
|
||||
| `kb_url_template` | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in ~~Optional[str]~~ |
|
||||
| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ |
|
||||
| `kb_url_template` | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in ~~Optional[str]~~ |
|
||||
| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ |
|
||||
|
||||
|
||||
By default, displaCy comes with colors for all entity types used by [spaCy's
|
||||
trained pipelines](/models) for both entity and span visualizer. If you're
|
||||
using custom entity types, you can use the `colors` setting to add your own
|
||||
colors for them. Your application or pipeline package can also expose a
|
||||
[`spacy_displacy_colors` entry
|
||||
point](/usage/saving-loading#entry-points-displacy) to add custom labels and
|
||||
their colors automatically.
|
||||
By default, displaCy comes with colors for all entity types used by
|
||||
[spaCy's trained pipelines](/models) for both entity and span visualizer. If
|
||||
you're using custom entity types, you can use the `colors` setting to add your
|
||||
own colors for them. Your application or pipeline package can also expose a
|
||||
[`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy)
|
||||
to add custom labels and their colors automatically.
|
||||
|
||||
By default, displaCy links to `#` for entities without a `kb_id` set on their
|
||||
span. If you wish to link an entity to their URL then consider using the
|
||||
|
@ -354,7 +351,6 @@ span. If you wish to link an entity to their URL then consider using the
|
|||
should redirect you to their Wikidata page, in this case
|
||||
`https://www.wikidata.org/wiki/Q95`.
|
||||
|
||||
|
||||
## registry {#registry source="spacy/util.py" new="3"}
|
||||
|
||||
spaCy's function registry extends
|
||||
|
@ -443,8 +439,8 @@ and the accuracy scores on the development set.
|
|||
The built-in, default logger is the ConsoleLogger, which prints results to the
|
||||
console in tabular format. The
|
||||
[spacy-loggers](https://github.com/explosion/spacy-loggers) package, included as
|
||||
a dependency of spaCy, enables other loggers, such as one that
|
||||
sends results to a [Weights & Biases](https://www.wandb.com/) dashboard.
|
||||
a dependency of spaCy, enables other loggers, such as one that sends results to
|
||||
a [Weights & Biases](https://www.wandb.com/) dashboard.
|
||||
|
||||
Instead of using one of the built-in loggers, you can
|
||||
[implement your own](/usage/training#custom-logging).
|
||||
|
@ -583,14 +579,14 @@ the [`Corpus`](/api/corpus) class.
|
|||
> limit = 0
|
||||
> ```
|
||||
|
||||
| Name | Description |
|
||||
| --------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Union[str, Path]~~ |
|
||||
| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ |
|
||||
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ |
|
||||
| **CREATES** | The corpus reader. ~~Corpus~~ |
|
||||
| Name | Description |
|
||||
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | The directory or filename to read from. Expects data in spaCy's binary [`.spacy` format](/api/data-formats#binary-training). ~~Union[str, Path]~~ |
|
||||
| `gold_preproc` | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See [`Corpus`](/api/corpus#init) for details. ~~bool~~ |
|
||||
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
|
||||
| `augmenter` | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don't have smart-quotes, or only have smart quotes, etc. Defaults to `None`. ~~Optional[Callable]~~ |
|
||||
| **CREATES** | The corpus reader. ~~Corpus~~ |
|
||||
|
||||
#### spacy.JsonlCorpus.v1 {#jsonlcorpus tag="registered function"}
|
||||
|
||||
|
|
|
@ -48,7 +48,7 @@ but do not change its part-of-speech. We say that a **lemma** (root form) is
|
|||
**inflected** (modified/combined) with one or more **morphological features** to
|
||||
create a surface form. Here are some examples:
|
||||
|
||||
| Context | Surface | Lemma | POS | Morphological Features |
|
||||
| Context | Surface | Lemma | POS | Morphological Features |
|
||||
| ---------------------------------------- | ------- | ----- | ------ | ---------------------------------------- |
|
||||
| I was reading the paper | reading | read | `VERB` | `VerbForm=Ger` |
|
||||
| I don't watch the news, I read the paper | read | read | `VERB` | `VerbForm=Fin`, `Mood=Ind`, `Tense=Pres` |
|
||||
|
@ -430,7 +430,7 @@ for token in doc:
|
|||
print(token.text, token.pos_, token.dep_, token.head.text)
|
||||
```
|
||||
|
||||
| Text | POS | Dep | Head text |
|
||||
| Text | POS | Dep | Head text |
|
||||
| ----------------------------------- | ------ | ------- | --------- |
|
||||
| Credit and mortgage account holders | `NOUN` | `nsubj` | submit |
|
||||
| must | `VERB` | `aux` | submit |
|
||||
|
|
|
@ -158,23 +158,23 @@ The available token pattern keys correspond to a number of
|
|||
[`Token` attributes](/api/token#attributes). The supported attributes for
|
||||
rule-based matching are:
|
||||
|
||||
| Attribute | Description |
|
||||
| ----------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `ORTH` | The exact verbatim text of a token. ~~str~~ |
|
||||
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
|
||||
| `NORM` | The normalized form of the token text. ~~str~~ |
|
||||
| `LOWER` | The lowercase form of the token text. ~~str~~ |
|
||||
| `LENGTH` | The length of the token text. ~~int~~ |
|
||||
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|
||||
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|
||||
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|
||||
| `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|
||||
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
|
||||
| `SPACY` | Token has a trailing space. ~~bool~~ |
|
||||
| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). ~~str~~ |
|
||||
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
||||
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
|
||||
| `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ |
|
||||
| Attribute | Description |
|
||||
| ---------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `ORTH` | The exact verbatim text of a token. ~~str~~ |
|
||||
| `TEXT` <Tag variant="new">2.1</Tag> | The exact verbatim text of a token. ~~str~~ |
|
||||
| `NORM` | The normalized form of the token text. ~~str~~ |
|
||||
| `LOWER` | The lowercase form of the token text. ~~str~~ |
|
||||
| `LENGTH` | The length of the token text. ~~int~~ |
|
||||
| `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
|
||||
| `IS_LOWER`, `IS_UPPER`, `IS_TITLE` | Token text is in lowercase, uppercase, titlecase. ~~bool~~ |
|
||||
| `IS_PUNCT`, `IS_SPACE`, `IS_STOP` | Token is punctuation, whitespace, stop word. ~~bool~~ |
|
||||
| `IS_SENT_START` | Token is start of sentence. ~~bool~~ |
|
||||
| `LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL` | Token text resembles a number, URL, email. ~~bool~~ |
|
||||
| `SPACY` | Token has a trailing space. ~~bool~~ |
|
||||
| `POS`, `TAG`, `MORPH`, `DEP`, `LEMMA`, `SHAPE` | The token's simple and extended part-of-speech tag, morphological analysis, dependency label, lemma, shape. Note that the values of these attributes are case-sensitive. For a list of available part-of-speech tags and dependency labels, see the [Annotation Specifications](/api/annotation). ~~str~~ |
|
||||
| `ENT_TYPE` | The token's entity label. ~~str~~ |
|
||||
| `_` <Tag variant="new">2.1</Tag> | Properties in [custom extension attributes](/usage/processing-pipelines#custom-components-attributes). ~~Dict[str, Any]~~ |
|
||||
| `OP` | [Operator or quantifier](#quantifiers) to determine how often to match a token pattern. ~~str~~ |
|
||||
|
||||
<Accordion title="Does it matter if the attribute names are uppercase or lowercase?">
|
||||
|
||||
|
|
|
@ -132,13 +132,13 @@ your own.
|
|||
> contributions for Catalan and to Kenneth Enevoldsen for Danish. For additional
|
||||
> Danish pipelines, check out [DaCy](https://github.com/KennethEnevoldsen/DaCy).
|
||||
|
||||
| Package | Language | UPOS | Parser LAS | NER F |
|
||||
| ------------------------------------------------- | -------- | ---: | ---------: | -----: |
|
||||
| [`ca_core_news_sm`](/models/ca#ca_core_news_sm) | Catalan | 98.2 | 87.4 | 79.8 |
|
||||
| [`ca_core_news_md`](/models/ca#ca_core_news_md) | Catalan | 98.3 | 88.2 | 84.0 |
|
||||
| [`ca_core_news_lg`](/models/ca#ca_core_news_lg) | Catalan | 98.5 | 88.4 | 84.2 |
|
||||
| [`ca_core_news_trf`](/models/ca#ca_core_news_trf) | Catalan | 98.9 | 93.0 | 91.2 |
|
||||
| [`da_core_news_trf`](/models/da#da_core_news_trf) | Danish | 98.0 | 85.0 | 82.9 |
|
||||
| Package | Language | UPOS | Parser LAS | NER F |
|
||||
| ------------------------------------------------- | -------- | ---: | ---------: | ----: |
|
||||
| [`ca_core_news_sm`](/models/ca#ca_core_news_sm) | Catalan | 98.2 | 87.4 | 79.8 |
|
||||
| [`ca_core_news_md`](/models/ca#ca_core_news_md) | Catalan | 98.3 | 88.2 | 84.0 |
|
||||
| [`ca_core_news_lg`](/models/ca#ca_core_news_lg) | Catalan | 98.5 | 88.4 | 84.2 |
|
||||
| [`ca_core_news_trf`](/models/ca#ca_core_news_trf) | Catalan | 98.9 | 93.0 | 91.2 |
|
||||
| [`da_core_news_trf`](/models/da#da_core_news_trf) | Danish | 98.0 | 85.0 | 82.9 |
|
||||
|
||||
### Resizable text classification architectures {#resizable-textcat}
|
||||
|
||||
|
|
|
@ -116,7 +116,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
|
|||
> corpus that had both syntactic and entity annotations, so the transformer
|
||||
> models for those languages do not include NER.
|
||||
|
||||
| Package | Language | Transformer | Tagger | Parser | NER |
|
||||
| Package | Language | Transformer | Tagger | Parser | NER |
|
||||
| ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: |
|
||||
| [`en_core_web_trf`](/models/en#en_core_web_trf) | English | [`roberta-base`](https://huggingface.co/roberta-base) | 97.8 | 95.2 | 89.9 |
|
||||
| [`de_dep_news_trf`](/models/de#de_dep_news_trf) | German | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased) | 99.0 | 95.8 | - |
|
||||
|
@ -856,9 +856,9 @@ attribute ruler before training using the `[initialize]` block of your config.
|
|||
|
||||
### Using Lexeme Tables
|
||||
|
||||
To use tables like `lexeme_prob` when training a model from scratch, you need
|
||||
to add an entry to the `initialize` block in your config. Here's what that
|
||||
looks like for the existing trained pipelines:
|
||||
To use tables like `lexeme_prob` when training a model from scratch, you need to
|
||||
add an entry to the `initialize` block in your config. Here's what that looks
|
||||
like for the existing trained pipelines:
|
||||
|
||||
```ini
|
||||
[initialize.lookups]
|
||||
|
|
Loading…
Reference in New Issue
Block a user