mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
Update shape docs and examples (resolves #4615) [ci skip]
This commit is contained in:
parent
c9f1e99787
commit
cbacb0f1a4
|
@ -123,7 +123,7 @@ The L2 norm of the lexeme's vector representation.
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------ |
|
| -------------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||||
| `vocab` | `Vocab` | The lexeme's vocabulary. |
|
| `vocab` | `Vocab` | The lexeme's vocabulary. |
|
||||||
| `text` | unicode | Verbatim text content. |
|
| `text` | unicode | Verbatim text content. |
|
||||||
| `orth` | int | ID of the verbatim text content. |
|
| `orth` | int | ID of the verbatim text content. |
|
||||||
|
@ -134,8 +134,8 @@ The L2 norm of the lexeme's vector representation.
|
||||||
| `norm_` | unicode | The lexemes's norm, i.e. a normalized form of the lexeme text. |
|
| `norm_` | unicode | The lexemes's norm, i.e. a normalized form of the lexeme text. |
|
||||||
| `lower` | int | Lowercase form of the word. |
|
| `lower` | int | Lowercase form of the word. |
|
||||||
| `lower_` | unicode | Lowercase form of the word. |
|
| `lower_` | unicode | Lowercase form of the word. |
|
||||||
| `shape` | int | Transform of the word's string, to show orthographic features. |
|
| `shape` | int | Transform of the words's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||||
| `shape_` | unicode | Transform of the word's string, to show orthographic features. |
|
| `shape_` | unicode | Transform of the word's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||||
| `prefix` | int | Length-N substring from the start of the word. Defaults to `N=1`. |
|
| `prefix` | int | Length-N substring from the start of the word. Defaults to `N=1`. |
|
||||||
| `prefix_` | unicode | Length-N substring from the start of the word. Defaults to `N=1`. |
|
| `prefix_` | unicode | Length-N substring from the start of the word. Defaults to `N=1`. |
|
||||||
| `suffix` | int | Length-N substring from the end of the word. Defaults to `N=3`. |
|
| `suffix` | int | Length-N substring from the end of the word. Defaults to `N=3`. |
|
||||||
|
|
|
@ -409,7 +409,7 @@ The L2 norm of the token's vector representation.
|
||||||
## Attributes {#attributes}
|
## Attributes {#attributes}
|
||||||
|
|
||||||
| Name | Type | Description |
|
| Name | Type | Description |
|
||||||
| -------------------------------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| -------------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `doc` | `Doc` | The parent document. |
|
| `doc` | `Doc` | The parent document. |
|
||||||
| `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. |
|
| `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. |
|
||||||
| `text` | unicode | Verbatim text content. |
|
| `text` | unicode | Verbatim text content. |
|
||||||
|
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
|
||||||
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
||||||
| `lower` | int | Lowercase form of the token. |
|
| `lower` | int | Lowercase form of the token. |
|
||||||
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
|
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
|
||||||
| `shape` | int | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". |
|
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||||
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. For example, "Xxxx" or "dd". |
|
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||||
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
|
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
|
||||||
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
|
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
|
||||||
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
|
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
|
||||||
|
|
|
@ -638,7 +638,7 @@ punctuation – depending on the
|
||||||
|
|
||||||
The `IS_DIGIT` flag is not very helpful here, because it doesn't tell us
|
The `IS_DIGIT` flag is not very helpful here, because it doesn't tell us
|
||||||
anything about the length. However, you can use the `SHAPE` flag, with each `d`
|
anything about the length. However, you can use the `SHAPE` flag, with each `d`
|
||||||
representing a digit:
|
representing a digit (up to 4 digits / characters):
|
||||||
|
|
||||||
```python
|
```python
|
||||||
[{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"},
|
[{"ORTH": "("}, {"SHAPE": "ddd"}, {"ORTH": ")"}, {"SHAPE": "dddd"},
|
||||||
|
@ -654,7 +654,7 @@ match the most common formats of
|
||||||
|
|
||||||
```python
|
```python
|
||||||
[{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"},
|
[{"ORTH": "+"}, {"ORTH": "49"}, {"ORTH": "(", "OP": "?"}, {"SHAPE": "dddd"},
|
||||||
{"ORTH": ")", "OP": "?"}, {"SHAPE": "dddddd"}]
|
{"ORTH": ")", "OP": "?"}, {"SHAPE": "dddd", "LENGTH": 6}]
|
||||||
```
|
```
|
||||||
|
|
||||||
Depending on the formats your application needs to match, creating an extensive
|
Depending on the formats your application needs to match, creating an extensive
|
||||||
|
|
Loading…
Reference in New Issue
Block a user