Update shape docs and examples (resolves #4615) [ci skip]

This commit is contained in:
Ines Montani 2019-11-23 17:16:55 +01:00
parent c9f1e99787
commit cbacb0f1a4
3 changed files with 111 additions and 111 deletions

View File

@ -123,7 +123,7 @@ The L2 norm of the lexeme's vector representation.
## Attributes {#attributes} ## Attributes {#attributes}
| Name | Type | Description | | Name | Type | Description |
| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------ | | -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `vocab` | `Vocab` | The lexeme's vocabulary. | | `vocab` | `Vocab` | The lexeme's vocabulary. |
| `text` | unicode | Verbatim text content. | | `text` | unicode | Verbatim text content. |
| `orth` | int | ID of the verbatim text content. | | `orth` | int | ID of the verbatim text content. |
@ -134,8 +134,8 @@ The L2 norm of the lexeme's vector representation.
| `norm_` | unicode | The lexemes's norm, i.e. a normalized form of the lexeme text. | | `norm_` | unicode | The lexemes's norm, i.e. a normalized form of the lexeme text. |
| `lower` | int | Lowercase form of the word. | | `lower` | int | Lowercase form of the word. |
| `lower_` | unicode | Lowercase form of the word. | | `lower_` | unicode | Lowercase form of the word. |
| `shape` | int | Transform of the word's string, to show orthographic features. | | `shape` | int | Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `shape_` | unicode | Transform of the word's string, to show orthographic features. | | `shape_` | unicode | Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `prefix` | int | Length-N substring from the start of the word. Defaults to `N=1`. | | `prefix` | int | Length-N substring from the start of the word. Defaults to `N=1`. |
| `prefix_` | unicode | Length-N substring from the start of the word. Defaults to `N=1`. | | `prefix_` | unicode | Length-N substring from the start of the word. Defaults to `N=1`. |
| `suffix` | int | Length-N substring from the end of the word. Defaults to `N=3`. | | `suffix` | int | Length-N substring from the end of the word. Defaults to `N=3`. |

View File

@ -409,7 +409,7 @@ The L2 norm of the token's vector representation.
## Attributes {#attributes} ## Attributes {#attributes}
| Name | Type | Description | | Name | Type | Description |
| -------------------------------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc` | `Doc` | The parent document. | | `doc` | `Doc` | The parent document. |
| `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. | | `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. |
| `text` | unicode | Verbatim text content. | | `text` | unicode | Verbatim text content. |
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). | | `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
| `lower` | int | Lowercase form of the token. | | `lower` | int | Lowercase form of the token. |
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. | | `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
| `shape` | int | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". | | `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". | | `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. | | `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. | | `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. | | `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |

View File

@ -638,7 +638,7 @@ punctuation depending on the
The `IS_DIGIT` flag is not very helpful here, because it doesn't tell us The `IS_DIGIT` flag is not very helpful here, because it doesn't tell us
anything about the length. However, you can use the `SHAPE` flag, with each `d` anything about the length. However, you can use the `SHAPE` flag, with each `d`
representing a digit: representing a digit (up to 4 digits / characters):
```python ```python
[{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"}, [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"},
@ -654,7 +654,7 @@ match the most common formats of
```python ```python
[{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"}, [{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"},
{"ORTH": ")", "OP": "?"}, {"SHAPE": "dddddd"}] {"ORTH": ")", "OP": "?"}, {"SHAPE": "dddd", "LENGTH": 6}]
``` ```
Depending on the formats your application needs to match, creating an extensive Depending on the formats your application needs to match, creating an extensive