Update tokenizer and doc init example (#3939)

* Fix Doc.to_json hyperlink

* Update tokenizer and doc init examples

* Change "matchin rules" to "punctuation rules"

* Auto-format
This commit is contained in:
Björn Böing 2019-07-10 10:16:48 +02:00 committed by Ines Montani
parent 58f06e6180
commit 205c73a589
3 changed files with 11 additions and 4 deletions

View File

@ -85,13 +85,14 @@ cdef class Doc:
Python-level `Token` and `Span` objects are views of this array, i.e. Python-level `Token` and `Span` objects are views of this array, i.e.
they don't own the data themselves. they don't own the data themselves.
EXAMPLE: Construction 1 EXAMPLE:
Construction 1
>>> doc = nlp(u'Some text') >>> doc = nlp(u'Some text')
Construction 2 Construction 2
>>> from spacy.tokens import Doc >>> from spacy.tokens import Doc
>>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'], >>> doc = Doc(nlp.vocab, words=[u'hello', u'world', u'!'],
spaces=[True, False, False]) >>> spaces=[True, False, False])
DOCS: https://spacy.io/api/doc DOCS: https://spacy.io/api/doc
""" """

View File

@ -264,7 +264,7 @@ ancestor is found, e.g. if span excludes a necessary ancestor.
| ----------- | -------------------------------------- | ----------------------------------------------- | | ----------- | -------------------------------------- | ----------------------------------------------- |
| **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Doc`. | | **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Doc`. |
## Doc.to_json {#to_json, tag="method" new="2.1"} ## Doc.to_json {#to_json tag="method" new="2.1"}
Convert a Doc to JSON. The format it produces will be the new format for the Convert a Doc to JSON. The format it produces will be the new format for the
[`spacy train`](/api/cli#train) command (not implemented yet). If custom [`spacy train`](/api/cli#train) command (not implemented yet). If custom

View File

@ -9,7 +9,10 @@ Segment text, and create `Doc` objects with the discovered segment boundaries.
## Tokenizer.\_\_init\_\_ {#init tag="method"} ## Tokenizer.\_\_init\_\_ {#init tag="method"}
Create a `Tokenizer`, to create `Doc` objects given unicode text. Create a `Tokenizer`, to create `Doc` objects given unicode text. For examples
of how to construct a custom tokenizer with different tokenization rules, see
the
[usage documentation](https://spacy.io/usage/linguistic-features#native-tokenizers).
> #### Example > #### Example
> >
@ -18,11 +21,14 @@ Create a `Tokenizer`, to create `Doc` objects given unicode text.
> from spacy.tokenizer import Tokenizer > from spacy.tokenizer import Tokenizer
> from spacy.lang.en import English > from spacy.lang.en import English
> nlp = English() > nlp = English()
> # Create a blank Tokenizer with just the English vocab
> tokenizer = Tokenizer(nlp.vocab) > tokenizer = Tokenizer(nlp.vocab)
> >
> # Construction 2 > # Construction 2
> from spacy.lang.en import English > from spacy.lang.en import English
> nlp = English() > nlp = English()
> # Create a Tokenizer with the default settings for English
> # including punctuation rules and exceptions
> tokenizer = nlp.Defaults.create_tokenizer(nlp) > tokenizer = nlp.Defaults.create_tokenizer(nlp)
> ``` > ```