mirror of
				https://github.com/explosion/spaCy.git
				synced 2025-10-31 07:57:35 +03:00 
			
		
		
		
	Update shape docs and examples (resolves #4615) [ci skip]
This commit is contained in:
		
							parent
							
								
									c9f1e99787
								
							
						
					
					
						commit
						cbacb0f1a4
					
				|  | @ -123,7 +123,7 @@ The L2 norm of the lexeme's vector representation. | |||
| ## Attributes {#attributes} | ||||
| 
 | ||||
| | Name                                         | Type    | Description                                                                                                                                                                                                                                                  | | ||||
| | -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------ | | ||||
| | -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ||||
| | `vocab`                                      | `Vocab` | The lexeme's vocabulary.                                                                                                                                                                                                                                     | | ||||
| | `text`                                       | unicode | Verbatim text content.                                                                                                                                                                                                                                       | | ||||
| | `orth`                                       | int     | ID of the verbatim text content.                                                                                                                                                                                                                             | | ||||
|  | @ -134,8 +134,8 @@ The L2 norm of the lexeme's vector representation. | |||
| | `norm_`                                      | unicode | The lexemes's norm, i.e. a normalized form of the lexeme text.                                                                                                                                                                                               | | ||||
| | `lower`                                      | int     | Lowercase form of the word.                                                                                                                                                                                                                                  | | ||||
| | `lower_`                                     | unicode | Lowercase form of the word.                                                                                                                                                                                                                                  | | ||||
| | `shape`                                      | int     | Transform of the word's string, to show orthographic features.                                               | | ||||
| | `shape_`                                     | unicode | Transform of the word's string, to show orthographic features.                                               | | ||||
| | `shape`                                      | int     | Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. | | ||||
| | `shape_`                                     | unicode | Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`.  | | ||||
| | `prefix`                                     | int     | Length-N substring from the start of the word. Defaults to `N=1`.                                                                                                                                                                                            | | ||||
| | `prefix_`                                    | unicode | Length-N substring from the start of the word. Defaults to `N=1`.                                                                                                                                                                                            | | ||||
| | `suffix`                                     | int     | Length-N substring from the end of the word. Defaults to `N=3`.                                                                                                                                                                                              | | ||||
|  |  | |||
|  | @ -409,7 +409,7 @@ The L2 norm of the token's vector representation. | |||
| ## Attributes {#attributes} | ||||
| 
 | ||||
| | Name                                         | Type         | Description                                                                                                                                                                                                                                                   | | ||||
| | -------------------------------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||||
| | -------------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||||
| | `doc`                                        | `Doc`        | The parent document.                                                                                                                                                                                                                                          | | ||||
| | `sent` <Tag variant="new">2.0.12</Tag>       | `Span`       | The sentence span that this token is a part of.                                                                                                                                                                                                               | | ||||
| | `text`                                       | unicode      | Verbatim text content.                                                                                                                                                                                                                                        | | ||||
|  | @ -437,8 +437,8 @@ The L2 norm of the token's vector representation. | |||
| | `norm_`                                      | unicode      | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions).                                 | | ||||
| | `lower`                                      | int          | Lowercase form of the token.                                                                                                                                                                                                                                  | | ||||
| | `lower_`                                     | unicode      | Lowercase form of the token text. Equivalent to `Token.text.lower()`.                                                                                                                                                                                         | | ||||
| | `shape`                                      | int          | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".                                                                                                                                 | | ||||
| | `shape_`                                     | unicode      | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd".                                                                                                                                 | | ||||
| | `shape`                                      | int          | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. | | ||||
| | `shape_`                                     | unicode      | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. | | ||||
| | `prefix`                                     | int          | Hash value of a length-N substring from the start of the token. Defaults to `N=1`.                                                                                                                                                                            | | ||||
| | `prefix_`                                    | unicode      | A length-N substring from the start of the token. Defaults to `N=1`.                                                                                                                                                                                          | | ||||
| | `suffix`                                     | int          | Hash value of a length-N substring from the end of the token. Defaults to `N=3`.                                                                                                                                                                              | | ||||
|  |  | |||
|  | @ -638,7 +638,7 @@ punctuation – depending on the | |||
| 
 | ||||
| The `IS_DIGIT` flag is not very helpful here, because it doesn't tell us | ||||
| anything about the length. However, you can use the `SHAPE` flag, with each `d` | ||||
| representing a digit: | ||||
| representing a digit (up to 4 digits / characters): | ||||
| 
 | ||||
| ```python | ||||
| [{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"}, | ||||
|  | @ -654,7 +654,7 @@ match the most common formats of | |||
| 
 | ||||
| ```python | ||||
| [{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"}, | ||||
|  {"ORTH": ")", "OP": "?"}, {"SHAPE": "dddddd"}] | ||||
|  {"ORTH": ")", "OP": "?"}, {"SHAPE": "dddd", "LENGTH": 6}] | ||||
| ``` | ||||
| 
 | ||||
| Depending on the formats your application needs to match, creating an extensive | ||||
|  |  | |||
		Loading…
	
		Reference in New Issue
	
	Block a user