mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
Update tokenizer and doc init example (#3939)
* Fix Doc.to_json hyperlink * Update tokenizer and doc init examples * Change "matchin rules" to "punctuation rules" * Auto-format
This commit is contained in:
parent
58f06e6180
commit
205c73a589
|
@ -85,13 +85,14 @@ cdef class Doc:
|
||||||
Python-level `Token` and `Span` objects are views of this array, i.e.
|
Python-level `Token` and `Span` objects are views of this array, i.e.
|
||||||
they don't own the data themselves.
|
they don't own the data themselves.
|
||||||
|
|
||||||
EXAMPLE: Construction 1
|
EXAMPLE:
|
||||||
|
Construction 1
|
||||||
>>> doc = nlp(u'Some text')
|
>>> doc = nlp(u'Some text')
|
||||||
|
|
||||||
Construction 2
|
Construction 2
|
||||||
>>> from spacy.tokens import Doc
|
>>> from spacy.tokens import Doc
|
||||||
>>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'],
|
>>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'],
|
||||||
spaces=[True, False, False])
|
>>> spaces=[True, False, False])
|
||||||
|
|
||||||
DOCS: https://spacy.io/api/doc
|
DOCS: https://spacy.io/api/doc
|
||||||
"""
|
"""
|
||||||
|
|
|
@ -264,7 +264,7 @@ ancestor is found, e.g. if span excludes a necessary ancestor.
|
||||||
| ----------- | -------------------------------------- | ----------------------------------------------- |
|
| ----------- | -------------------------------------- | ----------------------------------------------- |
|
||||||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Doc`. |
|
| **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Doc`. |
|
||||||
|
|
||||||
## Doc.to_json {#to_json, tag="method" new="2.1"}
|
## Doc.to_json {#to_json tag="method" new="2.1"}
|
||||||
|
|
||||||
Convert a Doc to JSON. The format it produces will be the new format for the
|
Convert a Doc to JSON. The format it produces will be the new format for the
|
||||||
[`spacy train`](/api/cli#train) command (not implemented yet). If custom
|
[`spacy train`](/api/cli#train) command (not implemented yet). If custom
|
||||||
|
|
|
@ -9,7 +9,10 @@ Segment text, and create `Doc` objects with the discovered segment boundaries.
|
||||||
|
|
||||||
## Tokenizer.\_\_init\_\_ {#init tag="method"}
|
## Tokenizer.\_\_init\_\_ {#init tag="method"}
|
||||||
|
|
||||||
Create a `Tokenizer`, to create `Doc` objects given unicode text.
|
Create a `Tokenizer`, to create `Doc` objects given unicode text. For examples
|
||||||
|
of how to construct a custom tokenizer with different tokenization rules, see
|
||||||
|
the
|
||||||
|
[usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers).
|
||||||
|
|
||||||
> #### Example
|
> #### Example
|
||||||
>
|
>
|
||||||
|
@ -18,11 +21,14 @@ Create a `Tokenizer`, to create `Doc` objects given unicode text.
|
||||||
> from spacy.tokenizer import Tokenizer
|
> from spacy.tokenizer import Tokenizer
|
||||||
> from spacy.lang.en import English
|
> from spacy.lang.en import English
|
||||||
> nlp = English()
|
> nlp = English()
|
||||||
|
> # Create a blank Tokenizer with just the English vocab
|
||||||
> tokenizer = Tokenizer(nlp.vocab)
|
> tokenizer = Tokenizer(nlp.vocab)
|
||||||
>
|
>
|
||||||
> # Construction 2
|
> # Construction 2
|
||||||
> from spacy.lang.en import English
|
> from spacy.lang.en import English
|
||||||
> nlp = English()
|
> nlp = English()
|
||||||
|
> # Create a Tokenizer with the default settings for English
|
||||||
|
> # including punctuation rules and exceptions
|
||||||
> tokenizer = nlp.Defaults.create_tokenizer(nlp)
|
> tokenizer = nlp.Defaults.create_tokenizer(nlp)
|
||||||
> ```
|
> ```
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue
Block a user