Update shape docs and examples (resolves #4615) [ci skip]

2025-12-24 02:23:19 +03:00 · 2019-11-23 17:16:55 +01:00 · 2019-11-23 17:16:55 +01:00 · cbacb0f1a4
commit cbacb0f1a4
parent c9f1e99787
3 changed files with 111 additions and 111 deletions
--- a/website/docs/api/lexeme.md
+++ b/website/docs/api/lexeme.md
@ -123,7 +123,7 @@ The L2 norm of the lexeme's vector representation.
 ## Attributes {#attributes}

 | Name                                         | Type    | Description                                                                                                                                                                                                                                                  |
-| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------ |
+| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
 | `vocab`                                      | `Vocab` | The lexeme's vocabulary.                                                                                                                                                                                                                                     |
 | `text`                                       | unicode | Verbatim text content.                                                                                                                                                                                                                                       |
 | `orth`                                       | int     | ID of the verbatim text content.                                                                                                                                                                                                                             |
@ -134,8 +134,8 @@ The L2 norm of the lexeme's vector representation.
 | `norm_`                                      | unicode | The lexemes's norm, i.e. a normalized form of the lexeme text.                                                                                                                                                                                               |
 | `lower`                                      | int     | Lowercase form of the word.                                                                                                                                                                                                                                  |
 | `lower_`                                     | unicode | Lowercase form of the word.                                                                                                                                                                                                                                  |
-| `shape`                                      | int     | Transform of the word's string, to show orthographic features.                                               |
-| `shape_`                                     | unicode | Transform of the word's string, to show orthographic features.                                               |
+| `shape`                                      | int     | Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
+| `shape_`                                     | unicode | Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`.  |
 | `prefix`                                     | int     | Length-N substring from the start of the word. Defaults to `N=1`.                                                                                                                                                                                            |
 | `prefix_`                                    | unicode | Length-N substring from the start of the word. Defaults to `N=1`.                                                                                                                                                                                            |
 | `suffix`                                     | int     | Length-N substring from the end of the word. Defaults to `N=3`.                                                                                                                                                                                              |
--- a/website/docs/api/token.md
+++ b/website/docs/api/token.md
@ -409,7 +409,7 @@ The L2 norm of the token's vector representation.
 ## Attributes {#attributes}

 | Name                                         | Type         | Description                                                                                                                                                                                                                                                   |
-| -------------------------------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| -------------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `doc`                                        | `Doc`        | The parent document.                                                                                                                                                                                                                                          |
 | `sent` <Tag variant="new">2.0.12</Tag>       | `Span`       | The sentence span that this token is a part of.                                                                                                                                                                                                               |
 | `text`                                       | unicode      | Verbatim text content.                                                                                                                                                                                                                                        |
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
 | `norm_`                                      | unicode      | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions).                                 |
 | `lower`                                      | int          | Lowercase form of the token.                                                                                                                                                                                                                                  |
 | `lower_`                                     | unicode      | Lowercase form of the token text. Equivalent to `Token.text.lower()`.                                                                                                                                                                                         |
-| `shape`                                      | int          | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".                                                                                                                                 |
-| `shape_`                                     | unicode      | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".                                                                                                                                 |
+| `shape`                                      | int          | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
+| `shape_`                                     | unicode      | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
 | `prefix`                                     | int          | Hash value of a length-N substring from the start of the token. Defaults to `N=1`.                                                                                                                                                                            |
 | `prefix_`                                    | unicode      | A length-N substring from the start of the token. Defaults to `N=1`.                                                                                                                                                                                          |
 | `suffix`                                     | int          | Hash value of a length-N substring from the end of the token. Defaults to `N=3`.                                                                                                                                                                              |
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -638,7 +638,7 @@ punctuation – depending on the

 The `IS_DIGIT` flag is not very helpful here, because it doesn't tell us
 anything about the length. However, you can use the `SHAPE` flag, with each `d`
-representing a digit:
+representing a digit (up to 4 digits / characters):

 ```python
 [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"},
@ -654,7 +654,7 @@ match the most common formats of

 ```python
 [{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"},
- {"ORTH": ")", "OP": "?"}, {"SHAPE": "dddddd"}]
+ {"ORTH": ")", "OP": "?"}, {"SHAPE": "dddd", "LENGTH": 6}]
 ```

 Depending on the formats your application needs to match, creating an extensive