unicode -> str consistency

This commit is contained in:
Ines Montani 2020-05-24 17:23:00 +02:00
parent 5d3806e059
commit 262d306eaa
29 changed files with 564 additions and 561 deletions

View File

@ -504,10 +504,10 @@ tokenization can be provided.
> srsly.write_jsonl("/path/to/text.jsonl", data) > srsly.write_jsonl("/path/to/text.jsonl", data)
> ``` > ```
| Key | Type | Description | | Key | Type | Description |
| -------- | ------- | ---------------------------------------------------------- | | -------- | ---- | ---------------------------------------------------------- |
| `text` | unicode | The raw input text. Is not required if `tokens` available. | | `text` | str | The raw input text. Is not required if `tokens` available. |
| `tokens` | list | Optional tokenization, one string per token. | | `tokens` | list | Optional tokenization, one string per token. |
```json ```json
### Example ### Example

View File

@ -170,7 +170,7 @@ vocabulary.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | ------------------------------------------------------------------------------------------- | | ----------- | ---------------- | ------------------------------------------------------------------------------------------- |
| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. | | `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. |
| `string` | unicode | The string of the word to look up. | | `string` | str | The string of the word to look up. |
| **RETURNS** | `const LexemeC*` | The lexeme in the vocabulary. | | **RETURNS** | `const LexemeC*` | The lexeme in the vocabulary. |
### Vocab.get_by_orth {#vocab_get_by_orth tag="method"} ### Vocab.get_by_orth {#vocab_get_by_orth tag="method"}

View File

@ -229,9 +229,9 @@ Add a new label to the pipe.
> parser.add_label("MY_LABEL") > parser.add_label("MY_LABEL")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------- | ------- | ----------------- | | ------- | ---- | ----------------- |
| `label` | unicode | The label to add. | | `label` | str | The label to add. |
## DependencyParser.to_disk {#to_disk tag="method"} ## DependencyParser.to_disk {#to_disk tag="method"}
@ -244,10 +244,10 @@ Serialize the pipe to disk.
> parser.to_disk("/path/to/parser") > parser.to_disk("/path/to/parser")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## DependencyParser.from_disk {#from_disk tag="method"} ## DependencyParser.from_disk {#from_disk tag="method"}
@ -262,7 +262,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------------------ | -------------------------------------------------------------------------- | | ----------- | ------------------ | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `DependencyParser` | The modified `DependencyParser` object. | | **RETURNS** | `DependencyParser` | The modified `DependencyParser` object. |

View File

@ -123,7 +123,7 @@ details, see the documentation on
| Name | Type | Description | | Name | Type | Description |
| --------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------- | | --------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `doc._.my_attr`. | | `name` | str | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `doc._.my_attr`. |
| `default` | - | Optional default value of the attribute if no getter or method is defined. | | `default` | - | Optional default value of the attribute if no getter or method is defined. |
| `method` | callable | Set a custom method on the object, for example `doc._.compare(other_doc)`. | | `method` | callable | Set a custom method on the object, for example `doc._.compare(other_doc)`. |
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. | | `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
@ -145,10 +145,10 @@ Look up a previously registered extension by name. Returns a 4-tuple
> assert extension == (False, None, None, None) > assert extension == (False, None, None, None)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ------------------------------------------------------------- | | ----------- | ----- | ------------------------------------------------------------- |
| `name` | unicode | Name of the extension. | | `name` | str | Name of the extension. |
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. | | **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
## Doc.has_extension {#has_extension tag="classmethod" new="2"} ## Doc.has_extension {#has_extension tag="classmethod" new="2"}
@ -162,10 +162,10 @@ Check whether an extension has been registered on the `Doc` class.
> assert Doc.has_extension('has_city') > assert Doc.has_extension('has_city')
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ------------------------------------------ | | ----------- | ---- | ------------------------------------------ |
| `name` | unicode | Name of the extension to check. | | `name` | str | Name of the extension to check. |
| **RETURNS** | bool | Whether the extension has been registered. | | **RETURNS** | bool | Whether the extension has been registered. |
## Doc.remove_extension {#remove_extension tag="classmethod" new="2.0.12"} ## Doc.remove_extension {#remove_extension tag="classmethod" new="2.0.12"}
@ -180,10 +180,10 @@ Remove a previously registered extension.
> assert not Doc.has_extension('has_city') > assert not Doc.has_extension('has_city')
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | --------------------------------------------------------------------- | | ----------- | ----- | --------------------------------------------------------------------- |
| `name` | unicode | Name of the extension. | | `name` | str | Name of the extension. |
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. | | **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
## Doc.char_span {#char_span tag="method" new="2"} ## Doc.char_span {#char_span tag="method" new="2"}
@ -368,10 +368,10 @@ Save the current state to a directory.
> doc.to_disk("/path/to/doc") > doc.to_disk("/path/to/doc")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## Doc.from_disk {#from_disk tag="method" new="2"} ## Doc.from_disk {#from_disk tag="method" new="2"}
@ -385,11 +385,11 @@ Loads state from a directory. Modifies the object in place and returns it.
> doc = Doc(Vocab()).from_disk("/path/to/doc") > doc = Doc(Vocab()).from_disk("/path/to/doc")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- | | ----------- | ------------ | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Doc` | The modified `Doc` object. | | **RETURNS** | `Doc` | The modified `Doc` object. |
## Doc.to_bytes {#to_bytes tag="method"} ## Doc.to_bytes {#to_bytes tag="method"}
@ -648,15 +648,15 @@ The L2 norm of the document's vector representation.
| Name | Type | Description | | Name | Type | Description |
| --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `text` | unicode | A unicode representation of the document text. | | `text` | str | A unicode representation of the document text. |
| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. | | `text_with_ws` | str | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. | | `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
| `vocab` | `Vocab` | The store of lexical types. | | `vocab` | `Vocab` | The store of lexical types. |
| `tensor` <Tag variant="new">2</Tag> | `ndarray` | Container for dense vector representations. | | `tensor` <Tag variant="new">2</Tag> | `ndarray` | Container for dense vector representations. |
| `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. | | `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. |
| `user_data` | - | A generic storage area, for user custom data. | | `user_data` | - | A generic storage area, for user custom data. |
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. | | `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
| `lang_` <Tag variant="new">2.1</Tag> | unicode | Language of the document's vocabulary. | | `lang_` <Tag variant="new">2.1</Tag> | str | Language of the document's vocabulary. |
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. | | `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. |
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. | | `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. |
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. | | `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. |

View File

@ -258,10 +258,10 @@ Serialize the pipe to disk.
> entity_linker.to_disk("/path/to/entity_linker") > entity_linker.to_disk("/path/to/entity_linker")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## EntityLinker.from_disk {#from_disk tag="method"} ## EntityLinker.from_disk {#from_disk tag="method"}
@ -274,11 +274,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
> entity_linker.from_disk("/path/to/entity_linker") > entity_linker.from_disk("/path/to/entity_linker")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- | | ----------- | -------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `EntityLinker` | The modified `EntityLinker` object. | | **RETURNS** | `EntityLinker` | The modified `EntityLinker` object. |
## Serialization fields {#serialization-fields} ## Serialization fields {#serialization-fields}

View File

@ -230,9 +230,9 @@ Add a new label to the pipe.
> ner.add_label("MY_LABEL") > ner.add_label("MY_LABEL")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------- | ------- | ----------------- | | ------- | ---- | ----------------- |
| `label` | unicode | The label to add. | | `label` | str | The label to add. |
## EntityRecognizer.to_disk {#to_disk tag="method"} ## EntityRecognizer.to_disk {#to_disk tag="method"}
@ -245,10 +245,10 @@ Serialize the pipe to disk.
> ner.to_disk("/path/to/ner") > ner.to_disk("/path/to/ner")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## EntityRecognizer.from_disk {#from_disk tag="method"} ## EntityRecognizer.from_disk {#from_disk tag="method"}
@ -263,7 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------------------ | -------------------------------------------------------------------------- | | ----------- | ------------------ | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `EntityRecognizer` | The modified `EntityRecognizer` object. | | **RETURNS** | `EntityRecognizer` | The modified `EntityRecognizer` object. |

View File

@ -72,10 +72,10 @@ Whether a label is present in the patterns.
> assert not "PERSON" in ruler > assert not "PERSON" in ruler
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | -------------------------------------------- | | ----------- | ---- | -------------------------------------------- |
| `label` | unicode | The label to check. | | `label` | str | The label to check. |
| **RETURNS** | bool | Whether the entity ruler contains the label. | | **RETURNS** | bool | Whether the entity ruler contains the label. |
## EntityRuler.\_\_call\_\_ {#call tag="method"} ## EntityRuler.\_\_call\_\_ {#call tag="method"}
@ -83,8 +83,9 @@ Find matches in the `Doc` and add them to the `doc.ents`. Typically, this
happens automatically after the component has been added to the pipeline using happens automatically after the component has been added to the pipeline using
[`nlp.add_pipe`](/api/language#add_pipe). If the entity ruler was initialized [`nlp.add_pipe`](/api/language#add_pipe). If the entity ruler was initialized
with `overwrite_ents=True`, existing entities will be replaced if they overlap with `overwrite_ents=True`, existing entities will be replaced if they overlap
with the matches. When matches overlap in a Doc, the entity ruler prioritizes longer with the matches. When matches overlap in a Doc, the entity ruler prioritizes
patterns over shorter, and if equal the match occuring first in the Doc is chosen. longer patterns over shorter, and if equal the match occuring first in the Doc
is chosen.
> #### Example > #### Example
> >
@ -139,9 +140,9 @@ only the patterns are saved as JSONL. If a directory name is provided, a
> ruler.to_disk("/path/to/entity_ruler") # saves patterns and config > ruler.to_disk("/path/to/entity_ruler") # saves patterns and config
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------- | | ------ | ------------ | ----------------------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a JSONL file or directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a JSONL file or directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
## EntityRuler.from_disk {#from_disk tag="method"} ## EntityRuler.from_disk {#from_disk tag="method"}
@ -158,10 +159,10 @@ configuration.
> ruler.from_disk("/path/to/entity_ruler") # loads patterns and config > ruler.from_disk("/path/to/entity_ruler") # loads patterns and config
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | ---------------------------------------------------------------------------------------- | | ----------- | ------------- | ---------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a JSONL file or directory. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a JSONL file or directory. Paths may be either strings or `Path`-like objects. |
| **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. | | **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. |
## EntityRuler.to_bytes {#to_bytes tag="method"} ## EntityRuler.to_bytes {#to_bytes tag="method"}

View File

@ -17,8 +17,8 @@ Create a `GoldCorpus`. IF the input data is an iterable, each item should be a
[`gold.read_json_file`](https://github.com/explosion/spaCy/tree/master/spacy/gold.pyx) [`gold.read_json_file`](https://github.com/explosion/spaCy/tree/master/spacy/gold.pyx)
for further details. for further details.
| Name | Type | Description | | Name | Type | Description |
| ----------- | --------------------------- | ------------------------------------------------------------ | | ----------- | ----------------------- | ------------------------------------------------------------ |
| `train` | unicode / `Path` / iterable | Training data, as a path (file or directory) or iterable. | | `train` | str / `Path` / iterable | Training data, as a path (file or directory) or iterable. |
| `dev` | unicode / `Path` / iterable | Development data, as a path (file or directory) or iterable. | | `dev` | str / `Path` / iterable | Development data, as a path (file or directory) or iterable. |
| **RETURNS** | `GoldCorpus` | The newly constructed object. | | **RETURNS** | `GoldCorpus` | The newly constructed object. |

View File

@ -62,7 +62,8 @@ Whether the provided syntactic annotations form a projective dependency tree.
Convert a list of Doc objects into the Convert a list of Doc objects into the
[JSON-serializable format](/api/annotation#json-input) used by the [JSON-serializable format](/api/annotation#json-input) used by the
[`spacy train`](/api/cli#train) command. Each input doc will be treated as a 'paragraph' in the output doc. [`spacy train`](/api/cli#train) command. Each input doc will be treated as a
'paragraph' in the output doc.
> #### Example > #### Example
> >
@ -160,7 +161,7 @@ single-token entity.
| ----------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc` | `Doc` | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. | | `doc` | `Doc` | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. |
| `entities` | iterable | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. | | `entities` | iterable | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. |
| **RETURNS** | list | Unicode strings, describing the [BILUO](/api/annotation#biluo) tags. | | **RETURNS** | list | str strings, describing the [BILUO](/api/annotation#biluo) tags. |
### gold.offsets_from_biluo_tags {#offsets_from_biluo_tags tag="function"} ### gold.offsets_from_biluo_tags {#offsets_from_biluo_tags tag="function"}

View File

@ -1,16 +1,19 @@
--- ---
title: KnowledgeBase title: KnowledgeBase
teaser: A storage class for entities and aliases of a specific knowledge base (ontology) teaser:
A storage class for entities and aliases of a specific knowledge base
(ontology)
tag: class tag: class
source: spacy/kb.pyx source: spacy/kb.pyx
new: 2.2 new: 2.2
--- ---
The `KnowledgeBase` object provides a method to generate [`Candidate`](/api/kb/#candidate_init) The `KnowledgeBase` object provides a method to generate
objects, which are plausible external identifiers given a certain textual mention. [`Candidate`](/api/kb/#candidate_init) objects, which are plausible external
Each such `Candidate` holds information from the relevant KB entities, identifiers given a certain textual mention. Each such `Candidate` holds
such as its frequency in text and possible aliases. information from the relevant KB entities, such as its frequency in text and
Each entity in the knowledge base also has a pretrained entity vector of a fixed size. possible aliases. Each entity in the knowledge base also has a pretrained entity
vector of a fixed size.
## KnowledgeBase.\_\_init\_\_ {#init tag="method"} ## KnowledgeBase.\_\_init\_\_ {#init tag="method"}
@ -24,25 +27,25 @@ Create the knowledge base.
> kb = KnowledgeBase(vocab=vocab, entity_vector_length=64) > kb = KnowledgeBase(vocab=vocab, entity_vector_length=64)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------------------- | ---------------- | ----------------------------------------- | | ---------------------- | --------------- | ---------------------------------------- |
| `vocab` | `Vocab` | A `Vocab` object. | | `vocab` | `Vocab` | A `Vocab` object. |
| `entity_vector_length` | int | Length of the fixed-size entity vectors. | | `entity_vector_length` | int | Length of the fixed-size entity vectors. |
| **RETURNS** | `KnowledgeBase` | The newly constructed object. | | **RETURNS** | `KnowledgeBase` | The newly constructed object. |
## KnowledgeBase.entity_vector_length {#entity_vector_length tag="property"} ## KnowledgeBase.entity_vector_length {#entity_vector_length tag="property"}
The length of the fixed-size entity vectors in the knowledge base. The length of the fixed-size entity vectors in the knowledge base.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---- | ----------------------------------------- | | ----------- | ---- | ---------------------------------------- |
| **RETURNS** | int | Length of the fixed-size entity vectors. | | **RETURNS** | int | Length of the fixed-size entity vectors. |
## KnowledgeBase.add_entity {#add_entity tag="method"} ## KnowledgeBase.add_entity {#add_entity tag="method"}
Add an entity to the knowledge base, specifying its corpus frequency Add an entity to the knowledge base, specifying its corpus frequency and entity
and entity vector, which should be of length [`entity_vector_length`](/api/kb#entity_vector_length). vector, which should be of length
[`entity_vector_length`](/api/kb#entity_vector_length).
> #### Example > #### Example
> >
@ -51,16 +54,16 @@ and entity vector, which should be of length [`entity_vector_length`](/api/kb#en
> kb.add_entity(entity="Q463035", freq=111, entity_vector=vector2) > kb.add_entity(entity="Q463035", freq=111, entity_vector=vector2)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| --------------- | ------------- | ------------------------------------------------- | | --------------- | ------ | ----------------------------------------------- |
| `entity` | unicode | The unique entity identifier | | `entity` | str | The unique entity identifier |
| `freq` | float | The frequency of the entity in a typical corpus | | `freq` | float | The frequency of the entity in a typical corpus |
| `entity_vector` | vector | The pretrained vector of the entity | | `entity_vector` | vector | The pretrained vector of the entity |
## KnowledgeBase.set_entities {#set_entities tag="method"} ## KnowledgeBase.set_entities {#set_entities tag="method"}
Define the full list of entities in the knowledge base, specifying the corpus frequency Define the full list of entities in the knowledge base, specifying the corpus
and entity vector for each entity. frequency and entity vector for each entity.
> #### Example > #### Example
> >
@ -68,18 +71,19 @@ and entity vector for each entity.
> kb.set_entities(entity_list=["Q42", "Q463035"], freq_list=[32, 111], vector_list=[vector1, vector2]) > kb.set_entities(entity_list=["Q42", "Q463035"], freq_list=[32, 111], vector_list=[vector1, vector2])
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------- | ------------- | ------------------------------------------------- | | ------------- | -------- | --------------------------------- |
| `entity_list` | iterable | List of unique entity identifiers | | `entity_list` | iterable | List of unique entity identifiers |
| `freq_list` | iterable | List of entity frequencies | | `freq_list` | iterable | List of entity frequencies |
| `vector_list` | iterable | List of entity vectors | | `vector_list` | iterable | List of entity vectors |
## KnowledgeBase.add_alias {#add_alias tag="method"} ## KnowledgeBase.add_alias {#add_alias tag="method"}
Add an alias or mention to the knowledge base, specifying its potential KB identifiers Add an alias or mention to the knowledge base, specifying its potential KB
and their prior probabilities. The entity identifiers should refer to entities previously identifiers and their prior probabilities. The entity identifiers should refer
added with [`add_entity`](/api/kb#add_entity) or [`set_entities`](/api/kb#set_entities). to entities previously added with [`add_entity`](/api/kb#add_entity) or
The sum of the prior probabilities should not exceed 1. [`set_entities`](/api/kb#set_entities). The sum of the prior probabilities
should not exceed 1.
> #### Example > #### Example
> >
@ -87,11 +91,11 @@ The sum of the prior probabilities should not exceed 1.
> kb.add_alias(alias="Douglas", entities=["Q42", "Q463035"], probabilities=[0.6, 0.3]) > kb.add_alias(alias="Douglas", entities=["Q42", "Q463035"], probabilities=[0.6, 0.3])
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------------- | ------------- | -------------------------------------------------- | | --------------- | -------- | -------------------------------------------------- |
| `alias` | unicode | The textual mention or alias | | `alias` | str | The textual mention or alias |
| `entities` | iterable | The potential entities that the alias may refer to | | `entities` | iterable | The potential entities that the alias may refer to |
| `probabilities`| iterable | The prior probabilities of each entity | | `probabilities` | iterable | The prior probabilities of each entity |
## KnowledgeBase.\_\_len\_\_ {#len tag="method"} ## KnowledgeBase.\_\_len\_\_ {#len tag="method"}
@ -117,9 +121,9 @@ Get a list of all entity IDs in the knowledge base.
> all_entities = kb.get_entity_strings() > all_entities = kb.get_entity_strings()
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---- | --------------------------------------------- | | ----------- | ---- | ------------------------------------------- |
| **RETURNS** | list | The list of entities in the knowledge base. | | **RETURNS** | list | The list of entities in the knowledge base. |
## KnowledgeBase.get_size_aliases {#get_size_aliases tag="method"} ## KnowledgeBase.get_size_aliases {#get_size_aliases tag="method"}
@ -131,9 +135,9 @@ Get the total number of aliases in the knowledge base.
> total_aliases = kb.get_size_aliases() > total_aliases = kb.get_size_aliases()
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---- | --------------------------------------------- | | ----------- | ---- | -------------------------------------------- |
| **RETURNS** | int | The number of aliases in the knowledge base. | | **RETURNS** | int | The number of aliases in the knowledge base. |
## KnowledgeBase.get_alias_strings {#get_alias_strings tag="method"} ## KnowledgeBase.get_alias_strings {#get_alias_strings tag="method"}
@ -145,9 +149,9 @@ Get a list of all aliases in the knowledge base.
> all_aliases = kb.get_alias_strings() > all_aliases = kb.get_alias_strings()
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---- | --------------------------------------------- | | ----------- | ---- | ------------------------------------------ |
| **RETURNS** | list | The list of aliases in the knowledge base. | | **RETURNS** | list | The list of aliases in the knowledge base. |
## KnowledgeBase.get_candidates {#get_candidates tag="method"} ## KnowledgeBase.get_candidates {#get_candidates tag="method"}
@ -160,10 +164,10 @@ of type [`Candidate`](/api/kb/#candidate_init).
> candidates = kb.get_candidates("Douglas") > candidates = kb.get_candidates("Douglas")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------- | ------------- | -------------------------------------------------- | | ----------- | -------- | ---------------------------------------- |
| `alias` | unicode | The textual mention or alias | | `alias` | str | The textual mention or alias |
| **RETURNS** | iterable | The list of relevant `Candidate` objects | | **RETURNS** | iterable | The list of relevant `Candidate` objects |
## KnowledgeBase.get_vector {#get_vector tag="method"} ## KnowledgeBase.get_vector {#get_vector tag="method"}
@ -175,15 +179,15 @@ Given a certain entity ID, retrieve its pretrained entity vector.
> vector = kb.get_vector("Q42") > vector = kb.get_vector("Q42")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------- | ------------- | -------------------------------------------------- | | ----------- | ------ | ----------------- |
| `entity` | unicode | The entity ID | | `entity` | str | The entity ID |
| **RETURNS** | vector | The entity vector | | **RETURNS** | vector | The entity vector |
## KnowledgeBase.get_prior_prob {#get_prior_prob tag="method"} ## KnowledgeBase.get_prior_prob {#get_prior_prob tag="method"}
Given a certain entity ID and a certain textual mention, retrieve Given a certain entity ID and a certain textual mention, retrieve the prior
the prior probability of the fact that the mention links to the entity ID. probability of the fact that the mention links to the entity ID.
> #### Example > #### Example
> >
@ -191,11 +195,11 @@ the prior probability of the fact that the mention links to the entity ID.
> probability = kb.get_prior_prob("Q42", "Douglas") > probability = kb.get_prior_prob("Q42", "Douglas")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------- | ------------- | --------------------------------------------------------------- | | ----------- | ----- | -------------------------------------------------------------- |
| `entity` | unicode | The entity ID | | `entity` | str | The entity ID |
| `alias` | unicode | The textual mention or alias | | `alias` | str | The textual mention or alias |
| **RETURNS** | float | The prior probability of the `alias` referring to the `entity` | | **RETURNS** | float | The prior probability of the `alias` referring to the `entity` |
## KnowledgeBase.dump {#dump tag="method"} ## KnowledgeBase.dump {#dump tag="method"}
@ -207,14 +211,14 @@ Save the current state of the knowledge base to a directory.
> kb.dump(loc) > kb.dump(loc)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------- | ---------------- | ------------------------------------------------------------------------------------------------------------------------ | | ----- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
| `loc` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `loc` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
## KnowledgeBase.load_bulk {#load_bulk tag="method"} ## KnowledgeBase.load_bulk {#load_bulk tag="method"}
Restore the state of the knowledge base from a given directory. Note that the [`Vocab`](/api/vocab) Restore the state of the knowledge base from a given directory. Note that the
should also be the same as the one used to create the KB. [`Vocab`](/api/vocab) should also be the same as the one used to create the KB.
> #### Example > #### Example
> >
@ -226,18 +230,16 @@ should also be the same as the one used to create the KB.
> kb.load_bulk("/path/to/kb") > kb.load_bulk("/path/to/kb")
> ``` > ```
| Name | Type | Description |
| Name | Type | Description | | ----------- | --------------- | -------------------------------------------------------------------------- |
| ----------- | ---------------- | ----------------------------------------------------------------------------------------- | | `loc` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `loc` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | **RETURNS** | `KnowledgeBase` | The modified `KnowledgeBase` object. |
| **RETURNS** | `KnowledgeBase` | The modified `KnowledgeBase` object. |
## Candidate.\_\_init\_\_ {#candidate_init tag="method"} ## Candidate.\_\_init\_\_ {#candidate_init tag="method"}
Construct a `Candidate` object. Usually this constructor is not called directly, Construct a `Candidate` object. Usually this constructor is not called directly,
but instead these objects are returned by the [`get_candidates`](/api/kb#get_candidates) method but instead these objects are returned by the
of a `KnowledgeBase`. [`get_candidates`](/api/kb#get_candidates) method of a `KnowledgeBase`.
> #### Example > #### Example
> >
@ -257,12 +259,12 @@ of a `KnowledgeBase`.
## Candidate attributes {#candidate_attributes} ## Candidate attributes {#candidate_attributes}
| Name | Type | Description | | Name | Type | Description |
| ---------------------- | ------------ | ------------------------------------------------------------------ | | --------------- | ------ | -------------------------------------------------------------- |
| `entity` | int | The entity's unique KB identifier | | `entity` | int | The entity's unique KB identifier |
| `entity_` | unicode | The entity's unique KB identifier | | `entity_` | str | The entity's unique KB identifier |
| `alias` | int | The alias or textual mention | | `alias` | int | The alias or textual mention |
| `alias_` | unicode | The alias or textual mention | | `alias_` | str | The alias or textual mention |
| `prior_prob` | long | The prior probability of the `alias` referring to the `entity` | | `prior_prob` | long | The prior probability of the `alias` referring to the `entity` |
| `entity_freq` | long | The frequency of the entity in a typical corpus | | `entity_freq` | long | The frequency of the entity in a typical corpus |
| `entity_vector` | vector | The pretrained vector of the entity | | `entity_vector` | vector | The pretrained vector of the entity |

View File

@ -49,11 +49,11 @@ contain arbitrary whitespace. Alignment into the original string is preserved.
> assert (doc[0].text, doc[0].head.tag_) == ("An", "NN") > assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | --------------------------------------------------------------------------------- | | ----------- | ----- | --------------------------------------------------------------------------------- |
| `text` | unicode | The text to be processed. | | `text` | str | The text to be processed. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | | `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
| **RETURNS** | `Doc` | A container for accessing the annotations. | | **RETURNS** | `Doc` | A container for accessing the annotations. |
<Infobox title="Changed in v2.0" variant="warning"> <Infobox title="Changed in v2.0" variant="warning">
@ -201,7 +201,7 @@ Create a pipeline component from a factory.
| Name | Type | Description | | Name | Type | Description |
| ----------- | -------- | ---------------------------------------------------------------------------------- | | ----------- | -------- | ---------------------------------------------------------------------------------- |
| `name` | unicode | Factory name to look up in [`Language.factories`](/api/language#class-attributes). | | `name` | str | Factory name to look up in [`Language.factories`](/api/language#class-attributes). |
| `config` | dict | Configuration parameters to initialize component. | | `config` | dict | Configuration parameters to initialize component. |
| **RETURNS** | callable | The pipeline component. | | **RETURNS** | callable | The pipeline component. |
@ -224,9 +224,9 @@ take a `Doc` object, modify it and return it. Only one of `before`, `after`,
| Name | Type | Description | | Name | Type | Description |
| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `component` | callable | The pipeline component. | | `component` | callable | The pipeline component. |
| `name` | unicode | Name of pipeline component. Overwrites existing `component.name` attribute if available. If no `name` is set and the component exposes no name attribute, `component.__name__` is used. An error is raised if the name already exists in the pipeline. | | `name` | str | Name of pipeline component. Overwrites existing `component.name` attribute if available. If no `name` is set and the component exposes no name attribute, `component.__name__` is used. An error is raised if the name already exists in the pipeline. |
| `before` | unicode | Component name to insert component directly before. | | `before` | str | Component name to insert component directly before. |
| `after` | unicode | Component name to insert component directly after: | | `after` | str | Component name to insert component directly after: |
| `first` | bool | Insert component first / not first in the pipeline. | | `first` | bool | Insert component first / not first in the pipeline. |
| `last` | bool | Insert component last / not last in the pipeline. | | `last` | bool | Insert component last / not last in the pipeline. |
@ -243,10 +243,10 @@ Check whether a component is present in the pipeline. Equivalent to
> assert nlp.has_pipe("component") > assert nlp.has_pipe("component")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | -------------------------------------------------------- | | ----------- | ---- | -------------------------------------------------------- |
| `name` | unicode | Name of the pipeline component to check. | | `name` | str | Name of the pipeline component to check. |
| **RETURNS** | bool | Whether a component of that name exists in the pipeline. | | **RETURNS** | bool | Whether a component of that name exists in the pipeline. |
## Language.get_pipe {#get_pipe tag="method" new="2"} ## Language.get_pipe {#get_pipe tag="method" new="2"}
@ -261,7 +261,7 @@ Get a pipeline component for a given component name.
| Name | Type | Description | | Name | Type | Description |
| ----------- | -------- | -------------------------------------- | | ----------- | -------- | -------------------------------------- |
| `name` | unicode | Name of the pipeline component to get. | | `name` | str | Name of the pipeline component to get. |
| **RETURNS** | callable | The pipeline component. | | **RETURNS** | callable | The pipeline component. |
## Language.replace_pipe {#replace_pipe tag="method" new="2"} ## Language.replace_pipe {#replace_pipe tag="method" new="2"}
@ -276,7 +276,7 @@ Replace a component in the pipeline.
| Name | Type | Description | | Name | Type | Description |
| ----------- | -------- | --------------------------------- | | ----------- | -------- | --------------------------------- |
| `name` | unicode | Name of the component to replace. | | `name` | str | Name of the component to replace. |
| `component` | callable | The pipeline component to insert. | | `component` | callable | The pipeline component to insert. |
## Language.rename_pipe {#rename_pipe tag="method" new="2"} ## Language.rename_pipe {#rename_pipe tag="method" new="2"}
@ -292,10 +292,10 @@ added to the pipeline, you can also use the `name` argument on
> nlp.rename_pipe("parser", "spacy_parser") > nlp.rename_pipe("parser", "spacy_parser")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ---------- | ------- | -------------------------------- | | ---------- | ---- | -------------------------------- |
| `old_name` | unicode | Name of the component to rename. | | `old_name` | str | Name of the component to rename. |
| `new_name` | unicode | New name of the component. | | `new_name` | str | New name of the component. |
## Language.remove_pipe {#remove_pipe tag="method" new="2"} ## Language.remove_pipe {#remove_pipe tag="method" new="2"}
@ -309,10 +309,10 @@ component function.
> assert name == "parser" > assert name == "parser"
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ----------------------------------------------------- | | ----------- | ----- | ----------------------------------------------------- |
| `name` | unicode | Name of the component to remove. | | `name` | str | Name of the component to remove. |
| **RETURNS** | tuple | A `(name, component)` tuple of the removed component. | | **RETURNS** | tuple | A `(name, component)` tuple of the removed component. |
## Language.select_pipes {#select_pipes tag="contextmanager, method" new="3"} ## Language.select_pipes {#select_pipes tag="contextmanager, method" new="3"}
@ -342,12 +342,11 @@ latter case, all components not in the `enable` list, will be disabled.
| Name | Type | Description | | Name | Type | Description |
| ----------- | --------------- | ------------------------------------------------------------------------------------ | | ----------- | --------------- | ------------------------------------------------------------------------------------ |
| `disable` | list | Names of pipeline components to disable. | | `disable` | list | Names of pipeline components to disable. |
| `disable` | unicode | Name of pipeline component to disable. | | `disable` | str | Name of pipeline component to disable. |
| `enable` | list | Names of pipeline components that will not be disabled. | | `enable` | list | Names of pipeline components that will not be disabled. |
| `enable` | unicode | Name of pipeline component that will not be disabled. | | `enable` | str | Name of pipeline component that will not be disabled. |
| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. | | **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. |
<Infobox title="Changed in v3.0" variant="warning"> <Infobox title="Changed in v3.0" variant="warning">
As of spaCy v3.0, the `disable_pipes` method has been renamed to `select_pipes`: As of spaCy v3.0, the `disable_pipes` method has been renamed to `select_pipes`:
@ -370,10 +369,10 @@ the model**.
> nlp.to_disk("/path/to/models") > nlp.to_disk("/path/to/models")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
## Language.from_disk {#from_disk tag="method" new="2"} ## Language.from_disk {#from_disk tag="method" new="2"}
@ -395,11 +394,11 @@ loaded object.
> nlp = English().from_disk("/path/to/en_model") > nlp = English().from_disk("/path/to/en_model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | ----------------------------------------------------------------------------------------- | | ----------- | ------------ | ----------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Language` | The modified `Language` object. | | **RETURNS** | `Language` | The modified `Language` object. |
<Infobox title="Changed in v2.0" variant="warning"> <Infobox title="Changed in v2.0" variant="warning">
@ -480,11 +479,11 @@ per component.
## Class attributes {#class-attributes} ## Class attributes {#class-attributes}
| Name | Type | Description | | Name | Type | Description |
| -------------------------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------- | | -------------------------------------- | ----- | ----------------------------------------------------------------------------------------------------------------------------------- |
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. | | `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
| `lang` | unicode | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). | | `lang` | str | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
| `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. | | `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
## Serialization fields {#serialization-fields} ## Serialization fields {#serialization-fields}

View File

@ -63,8 +63,8 @@ Lemmatize a string.
| Name | Type | Description | | Name | Type | Description |
| ------------ | ------------- | -------------------------------------------------------------------------------------------------------- | | ------------ | ------------- | -------------------------------------------------------------------------------------------------------- |
| `string` | unicode | The string to lemmatize, e.g. the token text. | | `string` | str | The string to lemmatize, e.g. the token text. |
| `univ_pos` | unicode / int | The token's universal part-of-speech tag. | | `univ_pos` | str / int | The token's universal part-of-speech tag. |
| `morphology` | dict / `None` | Morphological features following the [Universal Dependencies](http://universaldependencies.org/) scheme. | | `morphology` | dict / `None` | Morphological features following the [Universal Dependencies](http://universaldependencies.org/) scheme. |
| **RETURNS** | list | The available lemmas for the string. | | **RETURNS** | list | The available lemmas for the string. |
@ -82,11 +82,11 @@ original string is returned. Languages can provide a
> assert lemmatizer.lookup("going") == "go" > assert lemmatizer.lookup("going") == "go"
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ----------------------------------------------------------------------------------------------------------- | | ----------- | ---- | ----------------------------------------------------------------------------------------------------------- |
| `string` | unicode | The string to look up. | | `string` | str | The string to look up. |
| `orth` | int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to `None`. | | `orth` | int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to `None`. |
| **RETURNS** | unicode | The lemma if the string was found, otherwise the original string. | | **RETURNS** | str | The lemma if the string was found, otherwise the original string. |
## Lemmatizer.is_base_form {#is_base_form tag="method"} ## Lemmatizer.is_base_form {#is_base_form tag="method"}
@ -102,11 +102,11 @@ lemmatization entirely.
> assert is_base_form == True > assert is_base_form == True
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------ | ------------- | --------------------------------------------------------------------------------------- | | ------------ | --------- | --------------------------------------------------------------------------------------- |
| `univ_pos` | unicode / int | The token's universal part-of-speech tag. | | `univ_pos` | str / int | The token's universal part-of-speech tag. |
| `morphology` | dict | The token's morphological features. | | `morphology` | dict | The token's morphological features. |
| **RETURNS** | bool | Whether the token's part-of-speech tag and morphological features describe a base form. | | **RETURNS** | bool | Whether the token's part-of-speech tag and morphological features describe a base form. |
## Attributes {#attributes} ## Attributes {#attributes}

View File

@ -56,10 +56,10 @@ Check if the lookups contain a table of a given name. Delegates to
> assert "some_table" in lookups > assert "some_table" in lookups
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ----------------------------------------------- | | ----------- | ---- | ----------------------------------------------- |
| `name` | unicode | Name of the table. | | `name` | str | Name of the table. |
| **RETURNS** | bool | Whether a table of that name is in the lookups. | | **RETURNS** | bool | Whether a table of that name is in the lookups. |
## Lookups.tables {#tables tag="property"} ## Lookups.tables {#tables tag="property"}
@ -91,7 +91,7 @@ exists.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----------------------------- | ---------------------------------- | | ----------- | ----------------------------- | ---------------------------------- |
| `name` | unicode | Unique name of the table. | | `name` | str | Unique name of the table. |
| `data` | dict | Optional data to add to the table. | | `data` | dict | Optional data to add to the table. |
| **RETURNS** | [`Table`](/api/lookups#table) | The newly added table. | | **RETURNS** | [`Table`](/api/lookups#table) | The newly added table. |
@ -110,7 +110,7 @@ Get a table from the lookups. Raises an error if the table doesn't exist.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----------------------------- | ------------------ | | ----------- | ----------------------------- | ------------------ |
| `name` | unicode | Name of the table. | | `name` | str | Name of the table. |
| **RETURNS** | [`Table`](/api/lookups#table) | The table. | | **RETURNS** | [`Table`](/api/lookups#table) | The table. |
## Lookups.remove_table {#remove_table tag="method"} ## Lookups.remove_table {#remove_table tag="method"}
@ -128,7 +128,7 @@ Remove a table from the lookups. Raises an error if the table doesn't exist.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----------------------------- | ---------------------------- | | ----------- | ----------------------------- | ---------------------------- |
| `name` | unicode | Name of the table to remove. | | `name` | str | Name of the table to remove. |
| **RETURNS** | [`Table`](/api/lookups#table) | The removed table. | | **RETURNS** | [`Table`](/api/lookups#table) | The removed table. |
## Lookups.has_table {#has_table tag="method"} ## Lookups.has_table {#has_table tag="method"}
@ -144,10 +144,10 @@ Check if the lookups contain a table of a given name. Equivalent to
> assert lookups.has_table("some_table") > assert lookups.has_table("some_table")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ----------------------------------------------- | | ----------- | ---- | ----------------------------------------------- |
| `name` | unicode | Name of the table. | | `name` | str | Name of the table. |
| **RETURNS** | bool | Whether a table of that name is in the lookups. | | **RETURNS** | bool | Whether a table of that name is in the lookups. |
## Lookups.to_bytes {#to_bytes tag="method"} ## Lookups.to_bytes {#to_bytes tag="method"}
@ -191,9 +191,9 @@ which will be created if it doesn't exist.
> lookups.to_disk("/path/to/lookups") > lookups.to_disk("/path/to/lookups")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | ------ | ------------ | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
## Lookups.from_disk {#from_disk tag="method"} ## Lookups.from_disk {#from_disk tag="method"}
@ -208,10 +208,10 @@ the file doesn't exist.
> lookups.from_disk("/path/to/lookups") > lookups.from_disk("/path/to/lookups")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- | | ----------- | ------------ | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| **RETURNS** | `Lookups` | The loaded lookups. | | **RETURNS** | `Lookups` | The loaded lookups. |
## Table {#table tag="class, ordererddict"} ## Table {#table tag="class, ordererddict"}
@ -238,7 +238,7 @@ Initialize a new table.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ---------------------------------- | | ----------- | ------- | ---------------------------------- |
| `name` | unicode | Optional table name for reference. | | `name` | str | Optional table name for reference. |
| **RETURNS** | `Table` | The newly constructed object. | | **RETURNS** | `Table` | The newly constructed object. |
### Table.from_dict {#table.from_dict tag="classmethod"} ### Table.from_dict {#table.from_dict tag="classmethod"}
@ -256,7 +256,7 @@ Initialize a new table from a dict.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ---------------------------------- | | ----------- | ------- | ---------------------------------- |
| `data` | dict | The dictionary. | | `data` | dict | The dictionary. |
| `name` | unicode | Optional table name for reference. | | `name` | str | Optional table name for reference. |
| **RETURNS** | `Table` | The newly constructed object. | | **RETURNS** | `Table` | The newly constructed object. |
### Table.set {#table.set tag="method"} ### Table.set {#table.set tag="method"}
@ -273,10 +273,10 @@ Set a new key / value pair. String keys will be hashed. Same as
> assert table["foo"] == "bar" > assert table["foo"] == "bar"
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------- | ------------- | ----------- | | ------- | --------- | ----------- |
| `key` | unicode / int | The key. | | `key` | str / int | The key. |
| `value` | - | The value. | | `value` | - | The value. |
### Table.to_bytes {#table.to_bytes tag="method"} ### Table.to_bytes {#table.to_bytes tag="method"}
@ -313,6 +313,6 @@ Load a table from a bytestring.
| Name | Type | Description | | Name | Type | Description |
| -------------- | --------------------------- | ----------------------------------------------------- | | -------------- | --------------------------- | ----------------------------------------------------- |
| `name` | unicode | Table name. | | `name` | str | Table name. |
| `default_size` | int | Default size of bloom filters if no data is provided. | | `default_size` | int | Default size of bloom filters if no data is provided. |
| `bloom` | `preshed.bloom.BloomFilter` | The bloom filters. | | `bloom` | `preshed.bloom.BloomFilter` | The bloom filters. |

View File

@ -125,10 +125,10 @@ Check whether the matcher contains rules for a match ID.
> assert 'Rule' in matcher > assert 'Rule' in matcher
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ----------------------------------------------------- | | ----------- | ---- | ----------------------------------------------------- |
| `key` | unicode | The match ID. | | `key` | str | The match ID. |
| **RETURNS** | bool | Whether the matcher contains rules for this match ID. | | **RETURNS** | bool | Whether the matcher contains rules for this match ID. |
## Matcher.add {#add tag="method" new="2"} ## Matcher.add {#add tag="method" new="2"}
@ -153,7 +153,7 @@ overwritten.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------------------ | --------------------------------------------------------------------------------------------- | | ----------- | ------------------ | --------------------------------------------------------------------------------------------- |
| `match_id` | unicode | An ID for the thing you're matching. | | `match_id` | str | An ID for the thing you're matching. |
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | | `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
| `*patterns` | list | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. | | `*patterns` | list | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. |
@ -188,9 +188,9 @@ exist.
> assert "Rule" not in matcher > assert "Rule" not in matcher
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----- | ------- | ------------------------- | | ----- | ---- | ------------------------- |
| `key` | unicode | The ID of the match rule. | | `key` | str | The ID of the match rule. |
## Matcher.get {#get tag="method" new="2"} ## Matcher.get {#get tag="method" new="2"}
@ -204,7 +204,7 @@ Retrieve the pattern stored for a key. Returns the rule as an
> on_match, patterns = matcher.get("Rule") > on_match, patterns = matcher.get("Rule")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | --------------------------------------------- | | ----------- | ----- | --------------------------------------------- |
| `key` | unicode | The ID of the match rule. | | `key` | str | The ID of the match rule. |
| **RETURNS** | tuple | The rule, as an `(on_match, patterns)` tuple. | | **RETURNS** | tuple | The rule, as an `(on_match, patterns)` tuple. |

View File

@ -133,10 +133,10 @@ Check whether the matcher contains rules for a match ID.
> assert "OBAMA" in matcher > assert "OBAMA" in matcher
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ----------------------------------------------------- | | ----------- | ---- | ----------------------------------------------------- |
| `key` | unicode | The match ID. | | `key` | str | The match ID. |
| **RETURNS** | bool | Whether the matcher contains rules for this match ID. | | **RETURNS** | bool | Whether the matcher contains rules for this match ID. |
## PhraseMatcher.add {#add tag="method"} ## PhraseMatcher.add {#add tag="method"}
@ -162,7 +162,7 @@ overwritten.
| Name | Type | Description | | Name | Type | Description |
| ---------- | ------------------ | --------------------------------------------------------------------------------------------- | | ---------- | ------------------ | --------------------------------------------------------------------------------------------- |
| `match_id` | unicode | An ID for the thing you're matching. | | `match_id` | str | An ID for the thing you're matching. |
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. | | `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
| `*docs` | `Doc` | `Doc` objects of the phrases to match. | | `*docs` | `Doc` | `Doc` objects of the phrases to match. |
@ -198,6 +198,6 @@ does not exist.
> assert "OBAMA" not in matcher > assert "OBAMA" not in matcher
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----- | ------- | ------------------------- | | ----- | ---- | ------------------------- |
| `key` | unicode | The ID of the match rule. | | `key` | str | The ID of the match rule. |

View File

@ -112,8 +112,8 @@ end of the pipeline and after all other components.
</Infobox> </Infobox>
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ------------------------------------------------------------ | | ----------- | ----- | ------------------------------------------------------------ |
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. | | `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
| `label` | unicode | The subtoken dependency label. Defaults to `"subtok"`. | | `label` | str | The subtoken dependency label. Defaults to `"subtok"`. |
| **RETURNS** | `Doc` | The modified `Doc` with merged subtokens. | | **RETURNS** | `Doc` | The modified `Doc` with merged subtokens. |

View File

@ -81,9 +81,9 @@ a file `sentencizer.json`. This also happens automatically when you save an
> sentencizer.to_disk("/path/to/sentencizer.jsonl") > sentencizer.to_disk("/path/to/sentencizer.jsonl")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | ---------------------------------------------------------------------------------------------------------------- | | ------ | ------------ | ---------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
## Sentencizer.from_disk {#from_disk tag="method"} ## Sentencizer.from_disk {#from_disk tag="method"}
@ -98,10 +98,10 @@ added to its pipeline.
> sentencizer.from_disk("/path/to/sentencizer.json") > sentencizer.from_disk("/path/to/sentencizer.json")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- | | ----------- | ------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a JSON file. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a JSON file. Paths may be either strings or `Path`-like objects. |
| **RETURNS** | `Sentencizer` | The modified `Sentencizer` object. | | **RETURNS** | `Sentencizer` | The modified `Sentencizer` object. |
## Sentencizer.to_bytes {#to_bytes tag="method"} ## Sentencizer.to_bytes {#to_bytes tag="method"}

View File

@ -110,7 +110,7 @@ For details, see the documentation on
| Name | Type | Description | | Name | Type | Description |
| --------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------- | | --------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------- |
| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `span._.my_attr`. | | `name` | str | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `span._.my_attr`. |
| `default` | - | Optional default value of the attribute if no getter or method is defined. | | `default` | - | Optional default value of the attribute if no getter or method is defined. |
| `method` | callable | Set a custom method on the object, for example `span._.compare(other_span)`. | | `method` | callable | Set a custom method on the object, for example `span._.compare(other_span)`. |
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. | | `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
@ -132,10 +132,10 @@ Look up a previously registered extension by name. Returns a 4-tuple
> assert extension == (False, None, None, None) > assert extension == (False, None, None, None)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ------------------------------------------------------------- | | ----------- | ----- | ------------------------------------------------------------- |
| `name` | unicode | Name of the extension. | | `name` | str | Name of the extension. |
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. | | **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
## Span.has_extension {#has_extension tag="classmethod" new="2"} ## Span.has_extension {#has_extension tag="classmethod" new="2"}
@ -149,10 +149,10 @@ Check whether an extension has been registered on the `Span` class.
> assert Span.has_extension("is_city") > assert Span.has_extension("is_city")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ------------------------------------------ | | ----------- | ---- | ------------------------------------------ |
| `name` | unicode | Name of the extension to check. | | `name` | str | Name of the extension to check. |
| **RETURNS** | bool | Whether the extension has been registered. | | **RETURNS** | bool | Whether the extension has been registered. |
## Span.remove_extension {#remove_extension tag="classmethod" new="2.0.12"} ## Span.remove_extension {#remove_extension tag="classmethod" new="2.0.12"}
@ -167,10 +167,10 @@ Remove a previously registered extension.
> assert not Span.has_extension("is_city") > assert not Span.has_extension("is_city")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | --------------------------------------------------------------------- | | ----------- | ----- | --------------------------------------------------------------------- |
| `name` | unicode | Name of the extension. | | `name` | str | Name of the extension. |
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. | | **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
## Span.char_span {#char_span tag="method" new="2.2.4"} ## Span.char_span {#char_span tag="method" new="2.2.4"}
@ -497,16 +497,16 @@ The L2 norm of the span's vector representation.
| `end` | int | The token offset for the end of the span. | | `end` | int | The token offset for the end of the span. |
| `start_char` | int | The character offset for the start of the span. | | `start_char` | int | The character offset for the start of the span. |
| `end_char` | int | The character offset for the end of the span. | | `end_char` | int | The character offset for the end of the span. |
| `text` | unicode | A unicode representation of the span text. | | `text` | str | A unicode representation of the span text. |
| `text_with_ws` | unicode | The text content of the span with a trailing whitespace character if the last token has one. | | `text_with_ws` | str | The text content of the span with a trailing whitespace character if the last token has one. |
| `orth` | int | ID of the verbatim text content. | | `orth` | int | ID of the verbatim text content. |
| `orth_` | unicode | Verbatim text content (identical to `Span.text`). Exists mostly for consistency with the other attributes. | | `orth_` | str | Verbatim text content (identical to `Span.text`). Exists mostly for consistency with the other attributes. |
| `label` | int | The hash value of the span's label. | | `label` | int | The hash value of the span's label. |
| `label_` | unicode | The span's label. | | `label_` | str | The span's label. |
| `lemma_` | unicode | The span's lemma. | | `lemma_` | str | The span's lemma. |
| `kb_id` | int | The hash value of the knowledge base ID referred to by the span. | | `kb_id` | int | The hash value of the knowledge base ID referred to by the span. |
| `kb_id_` | unicode | The knowledge base ID referred to by the span. | | `kb_id_` | str | The knowledge base ID referred to by the span. |
| `ent_id` | int | The hash value of the named entity the token is an instance of. | | `ent_id` | int | The hash value of the named entity the token is an instance of. |
| `ent_id_` | unicode | The string ID of the named entity the token is an instance of. | | `ent_id_` | str | The string ID of the named entity the token is an instance of. |
| `sentiment` | float | A scalar value indicating the positivity or negativity of the span. | | `sentiment` | float | A scalar value indicating the positivity or negativity of the span. |
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). | | `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |

View File

@ -55,7 +55,7 @@ Retrieve a string from a given hash, or vice versa.
| Name | Type | Description | | Name | Type | Description |
| -------------- | ------------------------ | -------------------------- | | -------------- | ------------------------ | -------------------------- |
| `string_or_id` | bytes, unicode or uint64 | The value to encode. | | `string_or_id` | bytes, unicode or uint64 | The value to encode. |
| **RETURNS** | unicode or int | The value to be retrieved. | | **RETURNS** | str or int | The value to be retrieved. |
## StringStore.\_\_contains\_\_ {#contains tag="method"} ## StringStore.\_\_contains\_\_ {#contains tag="method"}
@ -69,10 +69,10 @@ Check whether a string is in the store.
> assert not "cherry" in stringstore > assert not "cherry" in stringstore
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | -------------------------------------- | | ----------- | ---- | -------------------------------------- |
| `string` | unicode | The string to check. | | `string` | str | The string to check. |
| **RETURNS** | bool | Whether the store contains the string. | | **RETURNS** | bool | Whether the store contains the string. |
## StringStore.\_\_iter\_\_ {#iter tag="method"} ## StringStore.\_\_iter\_\_ {#iter tag="method"}
@ -87,9 +87,9 @@ store will always include an empty string `''` at position `0`.
> assert all_strings == ["apple", "orange"] > assert all_strings == ["apple", "orange"]
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ---------- | ------- | ---------------------- | | ---------- | ---- | ---------------------- |
| **YIELDS** | unicode | A string in the store. | | **YIELDS** | str | A string in the store. |
## StringStore.add {#add tag="method" new="2"} ## StringStore.add {#add tag="method" new="2"}
@ -106,10 +106,10 @@ Add a string to the `StringStore`.
> assert stringstore["banana"] == banana_hash > assert stringstore["banana"] == banana_hash
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ------------------------ | | ----------- | ------ | ------------------------ |
| `string` | unicode | The string to add. | | `string` | str | The string to add. |
| **RETURNS** | uint64 | The string's hash value. | | **RETURNS** | uint64 | The string's hash value. |
## StringStore.to_disk {#to_disk tag="method" new="2"} ## StringStore.to_disk {#to_disk tag="method" new="2"}
@ -121,9 +121,9 @@ Save the current state to a directory.
> stringstore.to_disk("/path/to/strings") > stringstore.to_disk("/path/to/strings")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | ------ | ------------ | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
## StringStore.from_disk {#from_disk tag="method" new="2"} ## StringStore.from_disk {#from_disk tag="method" new="2"}
@ -136,10 +136,10 @@ Loads state from a directory. Modifies the object in place and returns it.
> stringstore = StringStore().from_disk("/path/to/strings") > stringstore = StringStore().from_disk("/path/to/strings")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- | | ----------- | ------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| **RETURNS** | `StringStore` | The modified `StringStore` object. | | **RETURNS** | `StringStore` | The modified `StringStore` object. |
## StringStore.to_bytes {#to_bytes tag="method"} ## StringStore.to_bytes {#to_bytes tag="method"}
@ -185,7 +185,7 @@ Get a 64-bit hash for a given string.
> assert hash_string("apple") == 8566208034543834098 > assert hash_string("apple") == 8566208034543834098
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ------------------- | | ----------- | ------ | ------------------- |
| `string` | unicode | The string to hash. | | `string` | str | The string to hash. |
| **RETURNS** | uint64 | The hash. | | **RETURNS** | uint64 | The hash. |

View File

@ -229,10 +229,10 @@ Add a new label to the pipe.
> tagger.add_label("MY_LABEL", {POS: 'NOUN'}) > tagger.add_label("MY_LABEL", {POS: 'NOUN'})
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------- | ------- | --------------------------------------------------------------- | | -------- | ---- | --------------------------------------------------------------- |
| `label` | unicode | The label to add. | | `label` | str | The label to add. |
| `values` | dict | Optional values to map to the label, e.g. a tag map dictionary. | | `values` | dict | Optional values to map to the label, e.g. a tag map dictionary. |
## Tagger.to_disk {#to_disk tag="method"} ## Tagger.to_disk {#to_disk tag="method"}
@ -245,10 +245,10 @@ Serialize the pipe to disk.
> tagger.to_disk("/path/to/tagger") > tagger.to_disk("/path/to/tagger")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## Tagger.from_disk {#from_disk tag="method"} ## Tagger.from_disk {#from_disk tag="method"}
@ -261,11 +261,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
> tagger.from_disk("/path/to/tagger") > tagger.from_disk("/path/to/tagger")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- | | ----------- | ------------ | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Tagger` | The modified `Tagger` object. | | **RETURNS** | `Tagger` | The modified `Tagger` object. |
## Tagger.to_bytes {#to_bytes tag="method"} ## Tagger.to_bytes {#to_bytes tag="method"}

View File

@ -44,7 +44,7 @@ shortcut for this and instantiate the component using its string name and
| `vocab` | `Vocab` | The shared vocabulary. | | `vocab` | `Vocab` | The shared vocabulary. |
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. | | `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
| `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. | | `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. |
| `architecture` | unicode | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. | | `architecture` | str | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. |
| **RETURNS** | `TextCategorizer` | The newly constructed object. | | **RETURNS** | `TextCategorizer` | The newly constructed object. |
### Architectures {#architectures new="2.1"} ### Architectures {#architectures new="2.1"}
@ -247,9 +247,9 @@ Add a new label to the pipe.
> textcat.add_label("MY_LABEL") > textcat.add_label("MY_LABEL")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------- | ------- | ----------------- | | ------- | ---- | ----------------- |
| `label` | unicode | The label to add. | | `label` | str | The label to add. |
## TextCategorizer.to_disk {#to_disk tag="method"} ## TextCategorizer.to_disk {#to_disk tag="method"}
@ -262,10 +262,10 @@ Serialize the pipe to disk.
> textcat.to_disk("/path/to/textcat") > textcat.to_disk("/path/to/textcat")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## TextCategorizer.from_disk {#from_disk tag="method"} ## TextCategorizer.from_disk {#from_disk tag="method"}
@ -280,7 +280,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ----------------- | -------------------------------------------------------------------------- | | ----------- | ----------------- | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `TextCategorizer` | The modified `TextCategorizer` object. | | **RETURNS** | `TextCategorizer` | The modified `TextCategorizer` object. |

View File

@ -58,7 +58,7 @@ For details, see the documentation on
| Name | Type | Description | | Name | Type | Description |
| --------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------- | | --------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `token._.my_attr`. | | `name` | str | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `token._.my_attr`. |
| `default` | - | Optional default value of the attribute if no getter or method is defined. | | `default` | - | Optional default value of the attribute if no getter or method is defined. |
| `method` | callable | Set a custom method on the object, for example `token._.compare(other_token)`. | | `method` | callable | Set a custom method on the object, for example `token._.compare(other_token)`. |
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. | | `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
@ -80,10 +80,10 @@ Look up a previously registered extension by name. Returns a 4-tuple
> assert extension == (False, None, None, None) > assert extension == (False, None, None, None)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ------------------------------------------------------------- | | ----------- | ----- | ------------------------------------------------------------- |
| `name` | unicode | Name of the extension. | | `name` | str | Name of the extension. |
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. | | **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
## Token.has_extension {#has_extension tag="classmethod" new="2"} ## Token.has_extension {#has_extension tag="classmethod" new="2"}
@ -97,10 +97,10 @@ Check whether an extension has been registered on the `Token` class.
> assert Token.has_extension("is_fruit") > assert Token.has_extension("is_fruit")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ------------------------------------------ | | ----------- | ---- | ------------------------------------------ |
| `name` | unicode | Name of the extension to check. | | `name` | str | Name of the extension to check. |
| **RETURNS** | bool | Whether the extension has been registered. | | **RETURNS** | bool | Whether the extension has been registered. |
## Token.remove_extension {#remove_extension tag="classmethod" new=""2.0.11""} ## Token.remove_extension {#remove_extension tag="classmethod" new=""2.0.11""}
@ -115,10 +115,10 @@ Remove a previously registered extension.
> assert not Token.has_extension("is_fruit") > assert not Token.has_extension("is_fruit")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | --------------------------------------------------------------------- | | ----------- | ----- | --------------------------------------------------------------------- |
| `name` | unicode | Name of the extension. | | `name` | str | Name of the extension. |
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. | | **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
## Token.check_flag {#check_flag tag="method"} ## Token.check_flag {#check_flag tag="method"}
@ -408,71 +408,71 @@ The L2 norm of the token's vector representation.
## Attributes {#attributes} ## Attributes {#attributes}
| Name | Type | Description | | Name | Type | Description |
| -------------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | -------------------------------------------- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc` | `Doc` | The parent document. | | `doc` | `Doc` | The parent document. |
| `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. | | `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. |
| `text` | unicode | Verbatim text content. | | `text` | str | Verbatim text content. |
| `text_with_ws` | unicode | Text content, with trailing space character if present. | | `text_with_ws` | str | Text content, with trailing space character if present. |
| `whitespace_` | unicode | Trailing space character if present. | | `whitespace_` | str | Trailing space character if present. |
| `orth` | int | ID of the verbatim text content. | | `orth` | int | ID of the verbatim text content. |
| `orth_` | unicode | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. | | `orth_` | str | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. |
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. | | `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
| `tensor` <Tag variant="new">2.1.7</Tag> | `ndarray` | The tokens's slice of the parent `Doc`'s tensor. | | `tensor` <Tag variant="new">2.1.7</Tag> | `ndarray` | The tokens's slice of the parent `Doc`'s tensor. |
| `head` | `Token` | The syntactic parent, or "governor", of this token. | | `head` | `Token` | The syntactic parent, or "governor", of this token. |
| `left_edge` | `Token` | The leftmost token of this token's syntactic descendants. | | `left_edge` | `Token` | The leftmost token of this token's syntactic descendants. |
| `right_edge` | `Token` | The rightmost token of this token's syntactic descendants. | | `right_edge` | `Token` | The rightmost token of this token's syntactic descendants. |
| `i` | int | The index of the token within the parent document. | | `i` | int | The index of the token within the parent document. |
| `ent_type` | int | Named entity type. | | `ent_type` | int | Named entity type. |
| `ent_type_` | unicode | Named entity type. | | `ent_type_` | str | Named entity type. |
| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. | | `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. |
| `ent_iob_` | unicode | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. | | `ent_iob_` | str | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. |
| `ent_kb_id` <Tag variant="new">2.2</Tag> | int | Knowledge base ID that refers to the named entity this token is a part of, if any. | | `ent_kb_id` <Tag variant="new">2.2</Tag> | int | Knowledge base ID that refers to the named entity this token is a part of, if any. |
| `ent_kb_id_` <Tag variant="new">2.2</Tag> | unicode | Knowledge base ID that refers to the named entity this token is a part of, if any. | | `ent_kb_id_` <Tag variant="new">2.2</Tag> | str | Knowledge base ID that refers to the named entity this token is a part of, if any. |
| `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. | | `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
| `ent_id_` | unicode | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. | | `ent_id_` | str | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
| `lemma` | int | Base form of the token, with no inflectional suffixes. | | `lemma` | int | Base form of the token, with no inflectional suffixes. |
| `lemma_` | unicode | Base form of the token, with no inflectional suffixes. | | `lemma_` | str | Base form of the token, with no inflectional suffixes. |
| `norm` | int | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). | | `norm` | int | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). | | `norm_` | str | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
| `lower` | int | Lowercase form of the token. | | `lower` | int | Lowercase form of the token. |
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. | | `lower_` | str | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. | | `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. | | `shape_` | str | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. | | `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. | | `prefix_` | str | A length-N substring from the start of the token. Defaults to `N=1`. |
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. | | `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
| `suffix_` | unicode | Length-N substring from the end of the token. Defaults to `N=3`. | | `suffix_` | str | Length-N substring from the end of the token. Defaults to `N=3`. |
| `is_alpha` | bool | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. | | `is_alpha` | bool | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. |
| `is_ascii` | bool | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. | | `is_ascii` | bool | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. |
| `is_digit` | bool | Does the token consist of digits? Equivalent to `token.text.isdigit()`. | | `is_digit` | bool | Does the token consist of digits? Equivalent to `token.text.isdigit()`. |
| `is_lower` | bool | Is the token in lowercase? Equivalent to `token.text.islower()`. | | `is_lower` | bool | Is the token in lowercase? Equivalent to `token.text.islower()`. |
| `is_upper` | bool | Is the token in uppercase? Equivalent to `token.text.isupper()`. | | `is_upper` | bool | Is the token in uppercase? Equivalent to `token.text.isupper()`. |
| `is_title` | bool | Is the token in titlecase? Equivalent to `token.text.istitle()`. | | `is_title` | bool | Is the token in titlecase? Equivalent to `token.text.istitle()`. |
| `is_punct` | bool | Is the token punctuation? | | `is_punct` | bool | Is the token punctuation? |
| `is_left_punct` | bool | Is the token a left punctuation mark, e.g. `'('` ? | | `is_left_punct` | bool | Is the token a left punctuation mark, e.g. `'('` ? |
| `is_right_punct` | bool | Is the token a right punctuation mark, e.g. `')'` ? | | `is_right_punct` | bool | Is the token a right punctuation mark, e.g. `')'` ? |
| `is_space` | bool | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. | | `is_space` | bool | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. |
| `is_bracket` | bool | Is the token a bracket? | | `is_bracket` | bool | Is the token a bracket? |
| `is_quote` | bool | Is the token a quotation mark? | | `is_quote` | bool | Is the token a quotation mark? |
| `is_currency` <Tag variant="new">2.0.8</Tag> | bool | Is the token a currency symbol? | | `is_currency` <Tag variant="new">2.0.8</Tag> | bool | Is the token a currency symbol? |
| `like_url` | bool | Does the token resemble a URL? | | `like_url` | bool | Does the token resemble a URL? |
| `like_num` | bool | Does the token represent a number? e.g. "10.9", "10", "ten", etc. | | `like_num` | bool | Does the token represent a number? e.g. "10.9", "10", "ten", etc. |
| `like_email` | bool | Does the token resemble an email address? | | `like_email` | bool | Does the token resemble an email address? |
| `is_oov` | bool | Is the token out-of-vocabulary? | | `is_oov` | bool | Is the token out-of-vocabulary? |
| `is_stop` | bool | Is the token part of a "stop list"? | | `is_stop` | bool | Is the token part of a "stop list"? |
| `pos` | int | Coarse-grained part-of-speech. | | `pos` | int | Coarse-grained part-of-speech. |
| `pos_` | unicode | Coarse-grained part-of-speech. | | `pos_` | str | Coarse-grained part-of-speech. |
| `tag` | int | Fine-grained part-of-speech. | | `tag` | int | Fine-grained part-of-speech. |
| `tag_` | unicode | Fine-grained part-of-speech. | | `tag_` | str | Fine-grained part-of-speech. |
| `dep` | int | Syntactic dependency relation. | | `dep` | int | Syntactic dependency relation. |
| `dep_` | unicode | Syntactic dependency relation. | | `dep_` | str | Syntactic dependency relation. |
| `lang` | int | Language of the parent document's vocabulary. | | `lang` | int | Language of the parent document's vocabulary. |
| `lang_` | unicode | Language of the parent document's vocabulary. | | `lang_` | str | Language of the parent document's vocabulary. |
| `prob` | float | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). | | `prob` | float | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). |
| `idx` | int | The character offset of the token within the parent document. | | `idx` | int | The character offset of the token within the parent document. |
| `sentiment` | float | A scalar value indicating the positivity or negativity of the token. | | `sentiment` | float | A scalar value indicating the positivity or negativity of the token. |
| `lex_id` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. | | `lex_id` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
| `rank` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. | | `rank` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
| `cluster` | int | Brown cluster ID. | | `cluster` | int | Brown cluster ID. |
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). | | `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |

View File

@ -34,15 +34,15 @@ the
> tokenizer = nlp.Defaults.create_tokenizer(nlp) > tokenizer = nlp.Defaults.create_tokenizer(nlp)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ---------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------- | | ---------------- | ----------- | ------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | A storage container for lexical types. | | `vocab` | `Vocab` | A storage container for lexical types. |
| `rules` | dict | Exceptions and special-cases for the tokenizer. | | `rules` | dict | Exceptions and special-cases for the tokenizer. |
| `prefix_search` | callable | A function matching the signature of `re.compile(string).search` to match prefixes. | | `prefix_search` | callable | A function matching the signature of `re.compile(string).search` to match prefixes. |
| `suffix_search` | callable | A function matching the signature of `re.compile(string).search` to match suffixes. | | `suffix_search` | callable | A function matching the signature of `re.compile(string).search` to match suffixes. |
| `infix_finditer` | callable | A function matching the signature of `re.compile(string).finditer` to find infixes. | | `infix_finditer` | callable | A function matching the signature of `re.compile(string).finditer` to find infixes. |
| `token_match` | callable | A function matching the signature of `re.compile(string).match to find token matches. | | `token_match` | callable | A function matching the signature of `re.compile(string).match to find token matches. |
| **RETURNS** | `Tokenizer` | The newly constructed object. | | **RETURNS** | `Tokenizer` | The newly constructed object. |
## Tokenizer.\_\_call\_\_ {#call tag="method"} ## Tokenizer.\_\_call\_\_ {#call tag="method"}
@ -55,10 +55,10 @@ Tokenize a string.
> assert len(tokens) == 4 > assert len(tokens) == 4
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | --------------------------------------- | | ----------- | ----- | --------------------------------------- |
| `string` | unicode | The string to tokenize. | | `string` | str | The string to tokenize. |
| **RETURNS** | `Doc` | A container for linguistic annotations. | | **RETURNS** | `Doc` | A container for linguistic annotations. |
## Tokenizer.pipe {#pipe tag="method"} ## Tokenizer.pipe {#pipe tag="method"}
@ -82,20 +82,20 @@ Tokenize a stream of texts.
Find internal split points of the string. Find internal split points of the string.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | | ----------- | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| `string` | unicode | The string to split. | | `string` | str | The string to split. |
| **RETURNS** | list | A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens. | | **RETURNS** | list | A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens. |
## Tokenizer.find_prefix {#find_prefix tag="method"} ## Tokenizer.find_prefix {#find_prefix tag="method"}
Find the length of a prefix that should be segmented from the string, or `None` Find the length of a prefix that should be segmented from the string, or `None`
if no prefix rules match. if no prefix rules match.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | ------------------------------------------------------ | | ----------- | ---- | ------------------------------------------------------ |
| `string` | unicode | The string to segment. | | `string` | str | The string to segment. |
| **RETURNS** | int | The length of the prefix if present, otherwise `None`. | | **RETURNS** | int | The length of the prefix if present, otherwise `None`. |
## Tokenizer.find_suffix {#find_suffix tag="method"} ## Tokenizer.find_suffix {#find_suffix tag="method"}
@ -104,7 +104,7 @@ if no suffix rules match.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------------ | ------------------------------------------------------ | | ----------- | ------------ | ------------------------------------------------------ |
| `string` | unicode | The string to segment. | | `string` | str | The string to segment. |
| **RETURNS** | int / `None` | The length of the suffix if present, otherwise `None`. | | **RETURNS** | int / `None` | The length of the suffix if present, otherwise `None`. |
## Tokenizer.add_special_case {#add_special_case tag="method"} ## Tokenizer.add_special_case {#add_special_case tag="method"}
@ -125,7 +125,7 @@ and examples.
| Name | Type | Description | | Name | Type | Description |
| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `string` | unicode | The string to specially tokenize. | | `string` | str | The string to specially tokenize. |
| `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. | | `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. |
## Tokenizer.explain {#explain tag="method"} ## Tokenizer.explain {#explain tag="method"}
@ -142,10 +142,10 @@ produced are identical to `Tokenizer.__call__` except for whitespace tokens.
> assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"] > assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------------| -------- | --------------------------------------------------- | | ----------- | ---- | --------------------------------------------------- |
| `string` | unicode | The string to tokenize with the debugging tokenizer | | `string` | str | The string to tokenize with the debugging tokenizer |
| **RETURNS** | list | A list of `(pattern_string, token_string)` tuples | | **RETURNS** | list | A list of `(pattern_string, token_string)` tuples |
## Tokenizer.to_disk {#to_disk tag="method"} ## Tokenizer.to_disk {#to_disk tag="method"}
@ -158,10 +158,10 @@ Serialize the tokenizer to disk.
> tokenizer.to_disk("/path/to/tokenizer") > tokenizer.to_disk("/path/to/tokenizer")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
## Tokenizer.from_disk {#from_disk tag="method"} ## Tokenizer.from_disk {#from_disk tag="method"}
@ -174,11 +174,11 @@ Load the tokenizer from disk. Modifies the object in place and returns it.
> tokenizer.from_disk("/path/to/tokenizer") > tokenizer.from_disk("/path/to/tokenizer")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- | | ----------- | ------------ | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. | | `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
| **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. | | **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. |
## Tokenizer.to_bytes {#to_bytes tag="method"} ## Tokenizer.to_bytes {#to_bytes tag="method"}
@ -217,14 +217,14 @@ it.
## Attributes {#attributes} ## Attributes {#attributes}
| Name | Type | Description | | Name | Type | Description |
| ---------------- | ------- | --------------------------------------------------------------------------------------------------------------------------- | | ---------------- | ------- | -------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. | | `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. | | `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. | | `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. | | `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
| `token_match` | - | A function matching the signature of `re.compile(string).match to find token matches. Returns an `re.MatchObject` or `None. | | `token_match` | - | A function matching the signature of `re.compile(string).match to find token matches. Returns an`re.MatchObject`or`None. |
| `rules` | dict | A dictionary of tokenizer exceptions and special cases. | | `rules` | dict | A dictionary of tokenizer exceptions and special cases. |
## Serialization fields {#serialization-fields} ## Serialization fields {#serialization-fields}

View File

@ -32,11 +32,11 @@ class. The data will be loaded in via
> nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"]) > nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | --------------------------------------------------------------------------------- | | ----------- | ------------ | --------------------------------------------------------------------------------- |
| `name` | unicode / `Path` | Model to load, i.e. shortcut link, package name or path. | | `name` | str / `Path` | Model to load, i.e. shortcut link, package name or path. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | | `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
| **RETURNS** | `Language` | A `Language` object with the loaded model. | | **RETURNS** | `Language` | A `Language` object with the loaded model. |
Essentially, `spacy.load()` is a convenience wrapper that reads the language ID Essentially, `spacy.load()` is a convenience wrapper that reads the language ID
and pipeline components from a model's `meta.json`, initializes the `Language` and pipeline components from a model's `meta.json`, initializes the `Language`
@ -79,7 +79,7 @@ Create a blank model of a given language class. This function is the twin of
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------- | ------------------------------------------------------------------------------------------------ | | ----------- | ---------- | ------------------------------------------------------------------------------------------------ |
| `name` | unicode | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. | | `name` | str | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. |
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). | | `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
| **RETURNS** | `Language` | An empty `Language` object of the appropriate subclass. | | **RETURNS** | `Language` | An empty `Language` object of the appropriate subclass. |
@ -98,10 +98,10 @@ meta data as a dictionary instead, you can use the `meta` attribute on your
> spacy.info("de", markdown=True) > spacy.info("de", markdown=True)
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ---------- | ------- | ------------------------------------------------------------- | | ---------- | ---- | ------------------------------------------------------------- |
| `model` | unicode | A model, i.e. shortcut link, package name or path (optional). | | `model` | str | A model, i.e. shortcut link, package name or path (optional). |
| `markdown` | bool | Print information as Markdown. | | `markdown` | bool | Print information as Markdown. |
### spacy.explain {#spacy.explain tag="function"} ### spacy.explain {#spacy.explain tag="function"}
@ -122,10 +122,10 @@ list of available terms, see
> # world NN noun, singular or mass > # world NN noun, singular or mass
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | -------------------------------------------------------- | | ----------- | ---- | -------------------------------------------------------- |
| `term` | unicode | Term to explain. | | `term` | str | Term to explain. |
| **RETURNS** | unicode | The explanation, or `None` if not found in the glossary. | | **RETURNS** | str | The explanation, or `None` if not found in the glossary. |
### spacy.prefer_gpu {#spacy.prefer_gpu tag="function" new="2.0.14"} ### spacy.prefer_gpu {#spacy.prefer_gpu tag="function" new="2.0.14"}
@ -189,13 +189,13 @@ browser. Will run a simple web server.
| Name | Type | Description | Default | | Name | Type | Description | Default |
| --------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ----------- | | --------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ----------- |
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. | | `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
| `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` | | `style` | str | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
| `page` | bool | Render markup as full HTML page. | `True` | | `page` | bool | Render markup as full HTML page. | `True` |
| `minify` | bool | Minify HTML markup. | `False` | | `minify` | bool | Minify HTML markup. | `False` |
| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` | | `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` |
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` | | `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
| `port` | int | Port to serve visualization. | `5000` | | `port` | int | Port to serve visualization. | `5000` |
| `host` | unicode | Host to serve visualization. | `'0.0.0.0'` | | `host` | str | Host to serve visualization. | `'0.0.0.0'` |
### displacy.render {#displacy.render tag="method" new="2"} ### displacy.render {#displacy.render tag="method" new="2"}
@ -214,13 +214,13 @@ Render a dependency parse tree or named entity visualization.
| Name | Type | Description | Default | | Name | Type | Description | Default |
| ----------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- | | ----------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. | | `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
| `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` | | `style` | str | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
| `page` | bool | Render markup as full HTML page. | `False` | | `page` | bool | Render markup as full HTML page. | `False` |
| `minify` | bool | Minify HTML markup. | `False` | | `minify` | bool | Minify HTML markup. | `False` |
| `jupyter` | bool | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None`. | `None` | | `jupyter` | bool | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None`. | `None` |
| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` | | `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` |
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` | | `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
| **RETURNS** | unicode | Rendered HTML markup. | | **RETURNS** | str | Rendered HTML markup. |
### Visualizer options {#displacy_options} ### Visualizer options {#displacy_options}
@ -236,22 +236,22 @@ If a setting is not present in the options, the default value will be used.
> displacy.serve(doc, style="dep", options=options) > displacy.serve(doc, style="dep", options=options)
> ``` > ```
| Name | Type | Description | Default | | Name | Type | Description | Default |
| ------------------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- | | ------------------------------------------ | ---- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` | | `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool | Print the lemma's in a separate row below the token texts. | `False` | | `add_lemma` <Tag variant="new">2.2.4</Tag> | bool | Print the lemma's in a separate row below the token texts. | `False` |
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` | | `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` | | `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` | | `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
| `color` | unicode | Text color (HEX, RGB or color names). | `'#000000'` | | `color` | str | Text color (HEX, RGB or color names). | `'#000000'` |
| `bg` | unicode | Background color (HEX, RGB or color names). | `'#ffffff'` | | `bg` | str | Background color (HEX, RGB or color names). | `'#ffffff'` |
| `font` | unicode | Font name or font family for all text. | `'Arial'` | | `font` | str | Font name or font family for all text. | `'Arial'` |
| `offset_x` | int | Spacing on left side of the SVG in px. | `50` | | `offset_x` | int | Spacing on left side of the SVG in px. | `50` |
| `arrow_stroke` | int | Width of arrow path in px. | `2` | | `arrow_stroke` | int | Width of arrow path in px. | `2` |
| `arrow_width` | int | Width of arrow head in px. | `10` / `8` (compact) | | `arrow_width` | int | Width of arrow head in px. | `10` / `8` (compact) |
| `arrow_spacing` | int | Spacing between arrows in px to avoid overlaps. | `20` / `12` (compact) | | `arrow_spacing` | int | Spacing between arrows in px to avoid overlaps. | `20` / `12` (compact) |
| `word_spacing` | int | Vertical spacing between words and arcs in px. | `45` | | `word_spacing` | int | Vertical spacing between words and arcs in px. | `45` |
| `distance` | int | Distance between words in px. | `175` / `150` (compact) | | `distance` | int | Distance between words in px. | `175` / `150` (compact) |
#### Named Entity Visualizer options {#displacy_options-ent} #### Named Entity Visualizer options {#displacy_options-ent}
@ -263,11 +263,11 @@ If a setting is not present in the options, the default value will be used.
> displacy.serve(doc, style="ent", options=options) > displacy.serve(doc, style="ent", options=options)
> ``` > ```
| Name | Type | Description | Default | | Name | Type | Description | Default |
| --------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------ | | --------------------------------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------ |
| `ents` | list | Entity types to highlight (`None` for all types). | `None` | | `ents` | list | Entity types to highlight (`None` for all types). | `None` |
| `colors` | dict | Color overrides. Entity types in uppercase should be mapped to color names or values. | `{}` | | `colors` | dict | Color overrides. Entity types in uppercase should be mapped to color names or values. | `{}` |
| `template` <Tag variant="new">2.2</Tag> | unicode | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. | see [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) | | `template` <Tag variant="new">2.2</Tag> | str | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. | see [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) |
By default, displaCy comes with colors for all By default, displaCy comes with colors for all
[entity types supported by spaCy](/api/annotation#named-entities). If you're [entity types supported by spaCy](/api/annotation#named-entities). If you're
@ -308,9 +308,9 @@ Set custom path to the data directory where spaCy looks for models.
> # PosixPath('/custom/path') > # PosixPath('/custom/path')
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | --------------------------- | | ------ | ------------ | --------------------------- |
| `path` | unicode / `Path` | Path to new data directory. | | `path` | str / `Path` | Path to new data directory. |
### util.get_lang_class {#util.get_lang_class tag="function"} ### util.get_lang_class {#util.get_lang_class tag="function"}
@ -330,7 +330,7 @@ you can use the [`set_lang_class`](/api/top-level#util.set_lang_class) helper.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------- | -------------------------------------- | | ----------- | ---------- | -------------------------------------- |
| `lang` | unicode | Two-letter language code, e.g. `'en'`. | | `lang` | str | Two-letter language code, e.g. `'en'`. |
| **RETURNS** | `Language` | Language class. | | **RETURNS** | `Language` | Language class. |
### util.set_lang_class {#util.set_lang_class tag="function"} ### util.set_lang_class {#util.set_lang_class tag="function"}
@ -352,7 +352,7 @@ the two-letter language code.
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------- | -------------------------------------- | | ------ | ---------- | -------------------------------------- |
| `name` | unicode | Two-letter language code, e.g. `'en'`. | | `name` | str | Two-letter language code, e.g. `'en'`. |
| `cls` | `Language` | The language class, e.g. `English`. | | `cls` | `Language` | The language class, e.g. `English`. |
### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"} ### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}
@ -368,10 +368,10 @@ loaded lazily, to avoid expensive setup code associated with the language data.
> assert util.lang_class_is_loaded("de") is False > assert util.lang_class_is_loaded("de") is False
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | -------------------------------------- | | ----------- | ---- | -------------------------------------- |
| `name` | unicode | Two-letter language code, e.g. `'en'`. | | `name` | str | Two-letter language code, e.g. `'en'`. |
| **RETURNS** | bool | Whether the class has been loaded. | | **RETURNS** | bool | Whether the class has been loaded. |
### util.load_model {#util.load_model tag="function" new="2"} ### util.load_model {#util.load_model tag="function" new="2"}
@ -392,7 +392,7 @@ in via [`Language.from_disk()`](/api/language#from_disk).
| Name | Type | Description | | Name | Type | Description |
| ------------- | ---------- | -------------------------------------------------------- | | ------------- | ---------- | -------------------------------------------------------- |
| `name` | unicode | Package name, shortcut link or model path. | | `name` | str | Package name, shortcut link or model path. |
| `**overrides` | - | Specific overrides, like pipeline components to disable. | | `**overrides` | - | Specific overrides, like pipeline components to disable. |
| **RETURNS** | `Language` | `Language` class with the loaded model. | | **RETURNS** | `Language` | `Language` class with the loaded model. |
@ -411,7 +411,7 @@ it easy to test a new model that you haven't packaged yet.
| Name | Type | Description | | Name | Type | Description |
| ------------- | ---------- | ---------------------------------------------------------------------------------------------------- | | ------------- | ---------- | ---------------------------------------------------------------------------------------------------- |
| `model_path` | unicode | Path to model data directory. | | `model_path` | str | Path to model data directory. |
| `meta` | dict | Model meta data. If `False`, spaCy will try to load the meta from a meta.json in the same directory. | | `meta` | dict | Model meta data. If `False`, spaCy will try to load the meta from a meta.json in the same directory. |
| `**overrides` | - | Specific overrides, like pipeline components to disable. | | `**overrides` | - | Specific overrides, like pipeline components to disable. |
| **RETURNS** | `Language` | `Language` class with the loaded model. | | **RETURNS** | `Language` | `Language` class with the loaded model. |
@ -432,7 +432,7 @@ A helper function to use in the `load()` method of a model package's
| Name | Type | Description | | Name | Type | Description |
| ------------- | ---------- | -------------------------------------------------------- | | ------------- | ---------- | -------------------------------------------------------- |
| `init_file` | unicode | Path to model's `__init__.py`, i.e. `__file__`. | | `init_file` | str | Path to model's `__init__.py`, i.e. `__file__`. |
| `**overrides` | - | Specific overrides, like pipeline components to disable. | | `**overrides` | - | Specific overrides, like pipeline components to disable. |
| **RETURNS** | `Language` | `Language` class with the loaded model. | | **RETURNS** | `Language` | `Language` class with the loaded model. |
@ -446,10 +446,10 @@ Get a model's meta.json from a directory path and validate its contents.
> meta = util.get_model_meta("/path/to/model") > meta = util.get_model_meta("/path/to/model")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | ------------------------ | | ----------- | ------------ | ------------------------ |
| `path` | unicode / `Path` | Path to model directory. | | `path` | str / `Path` | Path to model directory. |
| **RETURNS** | dict | The model's meta data. | | **RETURNS** | dict | The model's meta data. |
### util.is_package {#util.is_package tag="function"} ### util.is_package {#util.is_package tag="function"}
@ -463,10 +463,10 @@ Check if string maps to a package installed via pip. Mainly used to validate
> util.is_package("xyz") # False > util.is_package("xyz") # False
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------- | -------------------------------------------- | | ----------- | ------ | -------------------------------------------- |
| `name` | unicode | Name of package. | | `name` | str | Name of package. |
| **RETURNS** | `bool` | `True` if installed package, `False` if not. | | **RETURNS** | `bool` | `True` if installed package, `False` if not. |
### util.get_package_path {#util.get_package_path tag="function" new="2"} ### util.get_package_path {#util.get_package_path tag="function" new="2"}
@ -480,10 +480,10 @@ Get path to an installed package. Mainly used to resolve the location of
> # /usr/lib/python3.6/site-packages/en_core_web_sm > # /usr/lib/python3.6/site-packages/en_core_web_sm
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| -------------- | ------- | -------------------------------- | | -------------- | ------ | -------------------------------- |
| `package_name` | unicode | Name of installed package. | | `package_name` | str | Name of installed package. |
| **RETURNS** | `Path` | Path to model package directory. | | **RETURNS** | `Path` | Path to model package directory. |
### util.is_in_jupyter {#util.is_in_jupyter tag="function" new="2"} ### util.is_in_jupyter {#util.is_in_jupyter tag="function" new="2"}

View File

@ -35,7 +35,7 @@ you can add vectors to later.
| `data` | `ndarray[ndim=1, dtype='float32']` | The vector data. | | `data` | `ndarray[ndim=1, dtype='float32']` | The vector data. |
| `keys` | iterable | A sequence of keys aligned with the data. | | `keys` | iterable | A sequence of keys aligned with the data. |
| `shape` | tuple | Size of the table as `(n_entries, n_columns)`, the number of entries and number of columns. Not required if you're initializing the object with `data` and `keys`. | | `shape` | tuple | Size of the table as `(n_entries, n_columns)`, the number of entries and number of columns. Not required if you're initializing the object with `data` and `keys`. |
| `name` | unicode | A name to identify the vectors table. | | `name` | str | A name to identify the vectors table. |
| **RETURNS** | `Vectors` | The newly created object. | | **RETURNS** | `Vectors` | The newly created object. |
## Vectors.\_\_getitem\_\_ {#getitem tag="method"} ## Vectors.\_\_getitem\_\_ {#getitem tag="method"}
@ -140,7 +140,7 @@ mapping separately. If you need to manage the strings, you should use the
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------------------------- | ----------------------------------------------------- | | ----------- | ---------------------------------- | ----------------------------------------------------- |
| `key` | unicode / int | The key to add. | | `key` | str / int | The key to add. |
| `vector` | `ndarray[ndim=1, dtype='float32']` | An optional vector to add for the key. | | `vector` | `ndarray[ndim=1, dtype='float32']` | An optional vector to add for the key. |
| `row` | int | An optional row number of a vector to map the key to. | | `row` | int | An optional row number of a vector to map the key to. |
| **RETURNS** | int | The row the vector was added to. | | **RETURNS** | int | The row the vector was added to. |
@ -227,7 +227,7 @@ Look up one or more keys by row, or vice versa.
| Name | Type | Description | | Name | Type | Description |
| ----------- | ------------------------------------- | ------------------------------------------------------------------------ | | ----------- | ------------------------------------- | ------------------------------------------------------------------------ |
| `key` | unicode / int | Find the row that the given key points to. Returns int, `-1` if missing. | | `key` | str / int | Find the row that the given key points to. Returns int, `-1` if missing. |
| `keys` | iterable | Find rows that the keys point to. Returns `ndarray`. | | `keys` | iterable | Find rows that the keys point to. Returns `ndarray`. |
| `row` | int | Find the first key that points to the row. Returns int. | | `row` | int | Find the first key that points to the row. Returns int. |
| `rows` | iterable | Find the keys that point to the rows. Returns ndarray. | | `rows` | iterable | Find the keys that point to the rows. Returns ndarray. |
@ -337,9 +337,9 @@ Save the current state to a directory.
> >
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- | | ------ | ------------ | --------------------------------------------------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
## Vectors.from_disk {#from_disk tag="method"} ## Vectors.from_disk {#from_disk tag="method"}
@ -352,10 +352,10 @@ Loads state from a directory. Modifies the object in place and returns it.
> vectors.from_disk("/path/to/vectors") > vectors.from_disk("/path/to/vectors")
> ``` > ```
| Name | Type | Description | | Name | Type | Description |
| ----------- | ---------------- | -------------------------------------------------------------------------- | | ----------- | ------------ | -------------------------------------------------------------------------- |
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. | | `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
| **RETURNS** | `Vectors` | The modified `Vectors` object. | | **RETURNS** | `Vectors` | The modified `Vectors` object. |
## Vectors.to_bytes {#to_bytes tag="method"} ## Vectors.to_bytes {#to_bytes tag="method"}

View File

@ -327,11 +327,11 @@ displaCy in our [online demo](https://explosion.ai/demos/displacy)..
### Disabling the parser {#disabling} ### Disabling the parser {#disabling}
In the [default models](/models), the parser is loaded and enabled as part of In the [default models](/models), the parser is loaded and enabled as part of
the [standard processing pipeline](/usage/processing-pipelines). If you don't need the [standard processing pipeline](/usage/processing-pipelines). If you don't
any of the syntactic information, you should disable the parser. Disabling the need any of the syntactic information, you should disable the parser. Disabling
parser will make spaCy load and run much faster. If you want to load the parser, the parser will make spaCy load and run much faster. If you want to load the
but need to disable it for specific documents, you can also control its use on parser, but need to disable it for specific documents, you can also control its
the `nlp` object. use on the `nlp` object.
```python ```python
nlp = spacy.load("en_core_web_sm", disable=["parser"]) nlp = spacy.load("en_core_web_sm", disable=["parser"])
@ -990,10 +990,10 @@ nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = my_tokenizer nlp.tokenizer = my_tokenizer
``` ```
| Argument | Type | Description | | Argument | Type | Description |
| ----------- | ------- | ------------------------- | | ----------- | ----- | ------------------------- |
| `text` | unicode | The raw text to tokenize. | | `text` | str | The raw text to tokenize. |
| **RETURNS** | `Doc` | The tokenized document. | | **RETURNS** | `Doc` | The tokenized document. |
<Infobox title="Important note: using a custom tokenizer" variant="warning"> <Infobox title="Important note: using a custom tokenizer" variant="warning">

View File

@ -272,16 +272,16 @@ doc = nlp("I won't have named entities")
disabled.restore() disabled.restore()
``` ```
If you want to disable all pipes except for one or a few, you can use the `enable` If you want to disable all pipes except for one or a few, you can use the
keyword. Just like the `disable` keyword, it takes a list of pipe names, or a string `enable` keyword. Just like the `disable` keyword, it takes a list of pipe
defining just one pipe. names, or a string defining just one pipe.
```python ```python
# Enable only the parser # Enable only the parser
with nlp.select_pipes(enable="parser"): with nlp.select_pipes(enable="parser"):
doc = nlp("I will only be parsed") doc = nlp("I will only be parsed")
``` ```
Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
to remove pipeline components from an existing pipeline, the to remove pipeline components from an existing pipeline, the
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the [`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
@ -349,12 +349,12 @@ last** in the pipeline, or define a **custom name**. If no name is set and no
> nlp.add_pipe(my_component, before="parser") > nlp.add_pipe(my_component, before="parser")
> ``` > ```
| Argument | Type | Description | | Argument | Type | Description |
| -------- | ------- | ------------------------------------------------------------------------ | | -------- | ---- | ------------------------------------------------------------------------ |
| `last` | bool | If set to `True`, component is added **last** in the pipeline (default). | | `last` | bool | If set to `True`, component is added **last** in the pipeline (default). |
| `first` | bool | If set to `True`, component is added **first** in the pipeline. | | `first` | bool | If set to `True`, component is added **first** in the pipeline. |
| `before` | unicode | String name of component to add the new component **before**. | | `before` | str | String name of component to add the new component **before**. |
| `after` | unicode | String name of component to add the new component **after**. | | `after` | str | String name of component to add the new component **after**. |
### Example: A simple pipeline component {#custom-components-simple} ### Example: A simple pipeline component {#custom-components-simple}

View File

@ -94,8 +94,8 @@ docs = list(doc_bin.get_docs(nlp.vocab))
If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as
well, which includes the values of well, which includes the values of
[extension attributes](/usage/processing-pipelines#custom-components-attributes) (if [extension attributes](/usage/processing-pipelines#custom-components-attributes)
they're serializable with msgpack). (if they're serializable with msgpack).
<Infobox title="Important note on serializing extension attributes" variant="warning"> <Infobox title="Important note on serializing extension attributes" variant="warning">
@ -666,10 +666,10 @@ and lets you customize how the model should be initialized and loaded. You can
define the language data to be loaded and the define the language data to be loaded and the
[processing pipeline](/usage/processing-pipelines) to execute. [processing pipeline](/usage/processing-pipelines) to execute.
| Setting | Type | Description | | Setting | Type | Description |
| ---------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ---------- | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `lang` | unicode | ID of the language class to initialize. | | `lang` | str | ID of the language class to initialize. |
| `pipeline` | list | A list of strings mapping to the IDs of pipeline factories to apply in that order. If not set, spaCy's [default pipeline](/usage/processing-pipelines) will be used. | | `pipeline` | list | A list of strings mapping to the IDs of pipeline factories to apply in that order. If not set, spaCy's [default pipeline](/usage/processing-pipelines) will be used. |
The `load()` method that comes with our model package templates will take care The `load()` method that comes with our model package templates will take care
of putting all this together and returning a `Language` object with the loaded of putting all this together and returning a `Language` object with the loaded

View File

@ -67,12 +67,12 @@ arcs.
</Infobox> </Infobox>
| Argument | Type | Description | Default | | Argument | Type | Description | Default |
| --------- | ------- | ----------------------------------------------------------- | ----------- | | --------- | ---- | ----------------------------------------------------------- | ----------- |
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` | | `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
| `color` | unicode | Text color (HEX, RGB or color names). | `"#000000"` | | `color` | str | Text color (HEX, RGB or color names). | `"#000000"` |
| `bg` | unicode | Background color (HEX, RGB or color names). | `"#ffffff"` | | `bg` | str | Background color (HEX, RGB or color names). | `"#ffffff"` |
| `font` | unicode | Font name or font family for all text. | `"Arial"` | | `font` | str | Font name or font family for all text. | `"Arial"` |
For a list of all available options, see the For a list of all available options, see the
[`displacy` API documentation](/api/top-level#displacy_options). [`displacy` API documentation](/api/top-level#displacy_options).