Update shape docs and examples (resolves #4615) [ci skip]

This commit is contained in:
Ines Montani 2019-11-23 17:16:55 +01:00
parent c9f1e99787
commit cbacb0f1a4
3 changed files with 111 additions and 111 deletions

View File

@ -122,44 +122,44 @@ The L2 norm of the lexeme's vector representation.
## Attributes {#attributes}
| Name | Type | Description |
| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------ |
| `vocab` | `Vocab` | The lexeme's vocabulary. |
| `text` | unicode | Verbatim text content. |
| `orth` | int | ID of the verbatim text content. |
| `orth_` | unicode | Verbatim text content (identical to `Lexeme.text`). Exists mostly for consistency with the other attributes. |
| `rank` | int | Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors. |
| `flags` | int | Container of the lexeme's binary flags. |
| `norm` | int | The lexemes's norm, i.e. a normalized form of the lexeme text. |
| `norm_` | unicode | The lexemes's norm, i.e. a normalized form of the lexeme text. |
| `lower` | int | Lowercase form of the word. |
| `lower_` | unicode | Lowercase form of the word. |
| `shape` | int | Transform of the word's string, to show orthographic features. |
| `shape_` | unicode | Transform of the word's string, to show orthographic features. |
| `prefix` | int | Length-N substring from the start of the word. Defaults to `N=1`. |
| `prefix_` | unicode | Length-N substring from the start of the word. Defaults to `N=1`. |
| `suffix` | int | Length-N substring from the end of the word. Defaults to `N=3`. |
| `suffix_` | unicode | Length-N substring from the start of the word. Defaults to `N=3`. |
| `is_alpha` | bool | Does the lexeme consist of alphabetic characters? Equivalent to `lexeme.text.isalpha()`. |
| `is_ascii` | bool | Does the lexeme consist of ASCII characters? Equivalent to `[any(ord(c) >= 128 for c in lexeme.text)]`. |
| `is_digit` | bool | Does the lexeme consist of digits? Equivalent to `lexeme.text.isdigit()`. |
| `is_lower` | bool | Is the lexeme in lowercase? Equivalent to `lexeme.text.islower()`. |
| `is_upper` | bool | Is the lexeme in uppercase? Equivalent to `lexeme.text.isupper()`. |
| `is_title` | bool | Is the lexeme in titlecase? Equivalent to `lexeme.text.istitle()`. |
| `is_punct` | bool | Is the lexeme punctuation? |
| `is_left_punct` | bool | Is the lexeme a left punctuation mark, e.g. `(`? |
| `is_right_punct` | bool | Is the lexeme a right punctuation mark, e.g. `)`? |
| `is_space` | bool | Does the lexeme consist of whitespace characters? Equivalent to `lexeme.text.isspace()`. |
| `is_bracket` | bool | Is the lexeme a bracket? |
| `is_quote` | bool | Is the lexeme a quotation mark? |
| `is_currency` <Tag variant="new">2.0.8</Tag> | bool | Is the lexeme a currency symbol? |
| `like_url` | bool | Does the lexeme resemble a URL? |
| `like_num` | bool | Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc. |
| `like_email` | bool | Does the lexeme resemble an email address? |
| `is_oov` | bool | Is the lexeme out-of-vocabulary? |
| `is_stop` | bool | Is the lexeme part of a "stop list"? |
| `lang` | int | Language of the parent vocabulary. |
| `lang_` | unicode | Language of the parent vocabulary. |
| `prob` | float | Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary). |
| `cluster` | int | Brown cluster ID. |
| `sentiment` | float | A scalar value indicating the positivity or negativity of the lexeme. |
| Name | Type | Description |
| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `vocab` | `Vocab` | The lexeme's vocabulary. |
| `text` | unicode | Verbatim text content. |
| `orth` | int | ID of the verbatim text content. |
| `orth_` | unicode | Verbatim text content (identical to `Lexeme.text`). Exists mostly for consistency with the other attributes. |
| `rank` | int | Sequential ID of the lexemes's lexical type, used to index into tables, e.g. for word vectors. |
| `flags` | int | Container of the lexeme's binary flags. |
| `norm` | int | The lexemes's norm, i.e. a normalized form of the lexeme text. |
| `norm_` | unicode | The lexemes's norm, i.e. a normalized form of the lexeme text. |
| `lower` | int | Lowercase form of the word. |
| `lower_` | unicode | Lowercase form of the word. |
| `shape` | int | Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `shape_` | unicode | Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `prefix` | int | Length-N substring from the start of the word. Defaults to `N=1`. |
| `prefix_` | unicode | Length-N substring from the start of the word. Defaults to `N=1`. |
| `suffix` | int | Length-N substring from the end of the word. Defaults to `N=3`. |
| `suffix_` | unicode | Length-N substring from the start of the word. Defaults to `N=3`. |
| `is_alpha` | bool | Does the lexeme consist of alphabetic characters? Equivalent to `lexeme.text.isalpha()`. |
| `is_ascii` | bool | Does the lexeme consist of ASCII characters? Equivalent to `[any(ord(c) >= 128 for c in lexeme.text)]`. |
| `is_digit` | bool | Does the lexeme consist of digits? Equivalent to `lexeme.text.isdigit()`. |
| `is_lower` | bool | Is the lexeme in lowercase? Equivalent to `lexeme.text.islower()`. |
| `is_upper` | bool | Is the lexeme in uppercase? Equivalent to `lexeme.text.isupper()`. |
| `is_title` | bool | Is the lexeme in titlecase? Equivalent to `lexeme.text.istitle()`. |
| `is_punct` | bool | Is the lexeme punctuation? |
| `is_left_punct` | bool | Is the lexeme a left punctuation mark, e.g. `(`? |
| `is_right_punct` | bool | Is the lexeme a right punctuation mark, e.g. `)`? |
| `is_space` | bool | Does the lexeme consist of whitespace characters? Equivalent to `lexeme.text.isspace()`. |
| `is_bracket` | bool | Is the lexeme a bracket? |
| `is_quote` | bool | Is the lexeme a quotation mark? |
| `is_currency` <Tag variant="new">2.0.8</Tag> | bool | Is the lexeme a currency symbol? |
| `like_url` | bool | Does the lexeme resemble a URL? |
| `like_num` | bool | Does the lexeme represent a number? e.g. "10.9", "10", "ten", etc. |
| `like_email` | bool | Does the lexeme resemble an email address? |
| `is_oov` | bool | Is the lexeme out-of-vocabulary? |
| `is_stop` | bool | Is the lexeme part of a "stop list"? |
| `lang` | int | Language of the parent vocabulary. |
| `lang_` | unicode | Language of the parent vocabulary. |
| `prob` | float | Smoothed log probability estimate of the lexeme's word type (context-independent entry in the vocabulary). |
| `cluster` | int | Brown cluster ID. |
| `sentiment` | float | A scalar value indicating the positivity or negativity of the lexeme. |

View File

@ -408,71 +408,71 @@ The L2 norm of the token's vector representation.
## Attributes {#attributes}
| Name | Type | Description |
| -------------------------------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc` | `Doc` | The parent document. |
| `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. |
| `text` | unicode | Verbatim text content. |
| `text_with_ws` | unicode | Text content, with trailing space character if present. |
| `whitespace_` | unicode | Trailing space character if present. |
| `orth` | int | ID of the verbatim text content. |
| `orth_` | unicode | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. |
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
| `tensor` <Tag variant="new">2.1.7</Tag> | `ndarray` | The tokens's slice of the parent `Doc`'s tensor. |
| `head` | `Token` | The syntactic parent, or "governor", of this token. |
| `left_edge` | `Token` | The leftmost token of this token's syntactic descendants. |
| `right_edge` | `Token` | The rightmost token of this token's syntactic descendants. |
| `i` | int | The index of the token within the parent document. |
| `ent_type` | int | Named entity type. |
| `ent_type_` | unicode | Named entity type. |
| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. |
| `ent_iob_` | unicode | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. |
| `ent_kb_id` <Tag variant="new">2.2</Tag> | int | Knowledge base ID that refers to the named entity this token is a part of, if any. |
| `ent_kb_id_` <Tag variant="new">2.2</Tag> | unicode | Knowledge base ID that refers to the named entity this token is a part of, if any. |
| `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
| `ent_id_` | unicode | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
| `lemma` | int | Base form of the token, with no inflectional suffixes. |
| `lemma_` | unicode | Base form of the token, with no inflectional suffixes. |
| `norm` | int | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
| `lower` | int | Lowercase form of the token. |
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
| `shape` | int | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". |
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". |
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
| `suffix_` | unicode | Length-N substring from the end of the token. Defaults to `N=3`. |
| `is_alpha` | bool | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. |
| `is_ascii` | bool | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. |
| `is_digit` | bool | Does the token consist of digits? Equivalent to `token.text.isdigit()`. |
| `is_lower` | bool | Is the token in lowercase? Equivalent to `token.text.islower()`. |
| `is_upper` | bool | Is the token in uppercase? Equivalent to `token.text.isupper()`. |
| `is_title` | bool | Is the token in titlecase? Equivalent to `token.text.istitle()`. |
| `is_punct` | bool | Is the token punctuation? |
| `is_left_punct` | bool | Is the token a left punctuation mark, e.g. `(`? |
| `is_right_punct` | bool | Is the token a right punctuation mark, e.g. `)`? |
| `is_space` | bool | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. |
| `is_bracket` | bool | Is the token a bracket? |
| `is_quote` | bool | Is the token a quotation mark? |
| `is_currency` <Tag variant="new">2.0.8</Tag> | bool | Is the token a currency symbol? |
| `like_url` | bool | Does the token resemble a URL? |
| `like_num` | bool | Does the token represent a number? e.g. "10.9", "10", "ten", etc. |
| `like_email` | bool | Does the token resemble an email address? |
| `is_oov` | bool | Is the token out-of-vocabulary? |
| `is_stop` | bool | Is the token part of a "stop list"? |
| `pos` | int | Coarse-grained part-of-speech. |
| `pos_` | unicode | Coarse-grained part-of-speech. |
| `tag` | int | Fine-grained part-of-speech. |
| `tag_` | unicode | Fine-grained part-of-speech. |
| `dep` | int | Syntactic dependency relation. |
| `dep_` | unicode | Syntactic dependency relation. |
| `lang` | int | Language of the parent document's vocabulary. |
| `lang_` | unicode | Language of the parent document's vocabulary. |
| `prob` | float | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). |
| `idx` | int | The character offset of the token within the parent document. |
| `sentiment` | float | A scalar value indicating the positivity or negativity of the token. |
| `lex_id` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
| `rank` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
| `cluster` | int | Brown cluster ID. |
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
| Name | Type | Description |
| -------------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc` | `Doc` | The parent document. |
| `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. |
| `text` | unicode | Verbatim text content. |
| `text_with_ws` | unicode | Text content, with trailing space character if present. |
| `whitespace_` | unicode | Trailing space character if present. |
| `orth` | int | ID of the verbatim text content. |
| `orth_` | unicode | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. |
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
| `tensor` <Tag variant="new">2.1.7</Tag> | `ndarray` | The tokens's slice of the parent `Doc`'s tensor. |
| `head` | `Token` | The syntactic parent, or "governor", of this token. |
| `left_edge` | `Token` | The leftmost token of this token's syntactic descendants. |
| `right_edge` | `Token` | The rightmost token of this token's syntactic descendants. |
| `i` | int | The index of the token within the parent document. |
| `ent_type` | int | Named entity type. |
| `ent_type_` | unicode | Named entity type. |
| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. |
| `ent_iob_` | unicode | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. |
| `ent_kb_id` <Tag variant="new">2.2</Tag> | int | Knowledge base ID that refers to the named entity this token is a part of, if any. |
| `ent_kb_id_` <Tag variant="new">2.2</Tag> | unicode | Knowledge base ID that refers to the named entity this token is a part of, if any. |
| `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
| `ent_id_` | unicode | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
| `lemma` | int | Base form of the token, with no inflectional suffixes. |
| `lemma_` | unicode | Base form of the token, with no inflectional suffixes. |
| `norm` | int | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
| `lower` | int | Lowercase form of the token. |
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
| `suffix_` | unicode | Length-N substring from the end of the token. Defaults to `N=3`. |
| `is_alpha` | bool | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. |
| `is_ascii` | bool | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. |
| `is_digit` | bool | Does the token consist of digits? Equivalent to `token.text.isdigit()`. |
| `is_lower` | bool | Is the token in lowercase? Equivalent to `token.text.islower()`. |
| `is_upper` | bool | Is the token in uppercase? Equivalent to `token.text.isupper()`. |
| `is_title` | bool | Is the token in titlecase? Equivalent to `token.text.istitle()`. |
| `is_punct` | bool | Is the token punctuation? |
| `is_left_punct` | bool | Is the token a left punctuation mark, e.g. `(`? |
| `is_right_punct` | bool | Is the token a right punctuation mark, e.g. `)`? |
| `is_space` | bool | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. |
| `is_bracket` | bool | Is the token a bracket? |
| `is_quote` | bool | Is the token a quotation mark? |
| `is_currency` <Tag variant="new">2.0.8</Tag> | bool | Is the token a currency symbol? |
| `like_url` | bool | Does the token resemble a URL? |
| `like_num` | bool | Does the token represent a number? e.g. "10.9", "10", "ten", etc. |
| `like_email` | bool | Does the token resemble an email address? |
| `is_oov` | bool | Is the token out-of-vocabulary? |
| `is_stop` | bool | Is the token part of a "stop list"? |
| `pos` | int | Coarse-grained part-of-speech. |
| `pos_` | unicode | Coarse-grained part-of-speech. |
| `tag` | int | Fine-grained part-of-speech. |
| `tag_` | unicode | Fine-grained part-of-speech. |
| `dep` | int | Syntactic dependency relation. |
| `dep_` | unicode | Syntactic dependency relation. |
| `lang` | int | Language of the parent document's vocabulary. |
| `lang_` | unicode | Language of the parent document's vocabulary. |
| `prob` | float | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). |
| `idx` | int | The character offset of the token within the parent document. |
| `sentiment` | float | A scalar value indicating the positivity or negativity of the token. |
| `lex_id` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
| `rank` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
| `cluster` | int | Brown cluster ID. |
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |

View File

@ -638,7 +638,7 @@ punctuation depending on the
The `IS_DIGIT` flag is not very helpful here, because it doesn't tell us
anything about the length. However, you can use the `SHAPE` flag, with each `d`
representing a digit:
representing a digit (up to 4 digits / characters):
```python
[{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"},
@ -654,7 +654,7 @@ match the most common formats of
```python
[{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"},
{"ORTH": ")", "OP": "?"}, {"SHAPE": "dddddd"}]
{"ORTH": ")", "OP": "?"}, {"SHAPE": "dddd", "LENGTH": 6}]
```
Depending on the formats your application needs to match, creating an extensive