mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-26 01:04:34 +03:00
unicode -> str consistency
This commit is contained in:
parent
5d3806e059
commit
262d306eaa
|
@ -504,10 +504,10 @@ tokenization can be provided.
|
|||
> srsly.write_jsonl("/path/to/text.jsonl", data)
|
||||
> ```
|
||||
|
||||
| Key | Type | Description |
|
||||
| -------- | ------- | ---------------------------------------------------------- |
|
||||
| `text` | unicode | The raw input text. Is not required if `tokens` available. |
|
||||
| `tokens` | list | Optional tokenization, one string per token. |
|
||||
| Key | Type | Description |
|
||||
| -------- | ---- | ---------------------------------------------------------- |
|
||||
| `text` | str | The raw input text. Is not required if `tokens` available. |
|
||||
| `tokens` | list | Optional tokenization, one string per token. |
|
||||
|
||||
```json
|
||||
### Example
|
||||
|
|
|
@ -170,7 +170,7 @@ vocabulary.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | ------------------------------------------------------------------------------------------- |
|
||||
| `mem` | `cymem.Pool` | A memory pool. Allocated memory will be freed once the `Vocab` object is garbage collected. |
|
||||
| `string` | unicode | The string of the word to look up. |
|
||||
| `string` | str | The string of the word to look up. |
|
||||
| **RETURNS** | `const LexemeC*` | The lexeme in the vocabulary. |
|
||||
|
||||
### Vocab.get_by_orth {#vocab_get_by_orth tag="method"}
|
||||
|
|
|
@ -229,9 +229,9 @@ Add a new label to the pipe.
|
|||
> parser.add_label("MY_LABEL")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------- | ------- | ----------------- |
|
||||
| `label` | unicode | The label to add. |
|
||||
| Name | Type | Description |
|
||||
| ------- | ---- | ----------------- |
|
||||
| `label` | str | The label to add. |
|
||||
|
||||
## DependencyParser.to_disk {#to_disk tag="method"}
|
||||
|
||||
|
@ -244,10 +244,10 @@ Serialize the pipe to disk.
|
|||
> parser.to_disk("/path/to/parser")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## DependencyParser.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -262,7 +262,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------------ | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `DependencyParser` | The modified `DependencyParser` object. |
|
||||
|
||||
|
|
|
@ -123,7 +123,7 @@ details, see the documentation on
|
|||
|
||||
| Name | Type | Description |
|
||||
| --------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `doc._.my_attr`. |
|
||||
| `name` | str | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `doc._.my_attr`. |
|
||||
| `default` | - | Optional default value of the attribute if no getter or method is defined. |
|
||||
| `method` | callable | Set a custom method on the object, for example `doc._.compare(other_doc)`. |
|
||||
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
|
||||
|
@ -145,10 +145,10 @@ Look up a previously registered extension by name. Returns a 4-tuple
|
|||
> assert extension == (False, None, None, None)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ------------------------------------------------------------- |
|
||||
| `name` | unicode | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------- |
|
||||
| `name` | str | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
|
||||
|
||||
## Doc.has_extension {#has_extension tag="classmethod" new="2"}
|
||||
|
||||
|
@ -162,10 +162,10 @@ Check whether an extension has been registered on the `Doc` class.
|
|||
> assert Doc.has_extension('has_city')
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ------------------------------------------ |
|
||||
| `name` | unicode | Name of the extension to check. |
|
||||
| **RETURNS** | bool | Whether the extension has been registered. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ------------------------------------------ |
|
||||
| `name` | str | Name of the extension to check. |
|
||||
| **RETURNS** | bool | Whether the extension has been registered. |
|
||||
|
||||
## Doc.remove_extension {#remove_extension tag="classmethod" new="2.0.12"}
|
||||
|
||||
|
@ -180,10 +180,10 @@ Remove a previously registered extension.
|
|||
> assert not Doc.has_extension('has_city')
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | --------------------------------------------------------------------- |
|
||||
| `name` | unicode | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | --------------------------------------------------------------------- |
|
||||
| `name` | str | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
||||
|
||||
## Doc.char_span {#char_span tag="method" new="2"}
|
||||
|
||||
|
@ -368,10 +368,10 @@ Save the current state to a directory.
|
|||
> doc.to_disk("/path/to/doc")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Doc.from_disk {#from_disk tag="method" new="2"}
|
||||
|
||||
|
@ -385,11 +385,11 @@ Loads state from a directory. Modifies the object in place and returns it.
|
|||
> doc = Doc(Vocab()).from_disk("/path/to/doc")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Doc` | The modified `Doc` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------ | -------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Doc` | The modified `Doc` object. |
|
||||
|
||||
## Doc.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
|
@ -648,15 +648,15 @@ The L2 norm of the document's vector representation.
|
|||
|
||||
| Name | Type | Description |
|
||||
| --------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `text` | unicode | A unicode representation of the document text. |
|
||||
| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
||||
| `text` | str | A unicode representation of the document text. |
|
||||
| `text_with_ws` | str | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
||||
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
||||
| `vocab` | `Vocab` | The store of lexical types. |
|
||||
| `tensor` <Tag variant="new">2</Tag> | `ndarray` | Container for dense vector representations. |
|
||||
| `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. |
|
||||
| `user_data` | - | A generic storage area, for user custom data. |
|
||||
| `lang` <Tag variant="new">2.1</Tag> | int | Language of the document's vocabulary. |
|
||||
| `lang_` <Tag variant="new">2.1</Tag> | unicode | Language of the document's vocabulary. |
|
||||
| `lang_` <Tag variant="new">2.1</Tag> | str | Language of the document's vocabulary. |
|
||||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. Returns `True` if the `Doc` is empty. |
|
||||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. Returns `True` if the `Doc` is empty. |
|
||||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. Returns `True` if the `Doc` is empty. |
|
||||
|
|
|
@ -258,10 +258,10 @@ Serialize the pipe to disk.
|
|||
> entity_linker.to_disk("/path/to/entity_linker")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## EntityLinker.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -274,11 +274,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
> entity_linker.from_disk("/path/to/entity_linker")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `EntityLinker` | The modified `EntityLinker` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `EntityLinker` | The modified `EntityLinker` object. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
|
|
|
@ -230,9 +230,9 @@ Add a new label to the pipe.
|
|||
> ner.add_label("MY_LABEL")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------- | ------- | ----------------- |
|
||||
| `label` | unicode | The label to add. |
|
||||
| Name | Type | Description |
|
||||
| ------- | ---- | ----------------- |
|
||||
| `label` | str | The label to add. |
|
||||
|
||||
## EntityRecognizer.to_disk {#to_disk tag="method"}
|
||||
|
||||
|
@ -245,10 +245,10 @@ Serialize the pipe to disk.
|
|||
> ner.to_disk("/path/to/ner")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## EntityRecognizer.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -263,7 +263,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------------ | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `EntityRecognizer` | The modified `EntityRecognizer` object. |
|
||||
|
||||
|
|
|
@ -72,10 +72,10 @@ Whether a label is present in the patterns.
|
|||
> assert not "PERSON" in ruler
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | -------------------------------------------- |
|
||||
| `label` | unicode | The label to check. |
|
||||
| **RETURNS** | bool | Whether the entity ruler contains the label. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | -------------------------------------------- |
|
||||
| `label` | str | The label to check. |
|
||||
| **RETURNS** | bool | Whether the entity ruler contains the label. |
|
||||
|
||||
## EntityRuler.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
@ -83,8 +83,9 @@ Find matches in the `Doc` and add them to the `doc.ents`. Typically, this
|
|||
happens automatically after the component has been added to the pipeline using
|
||||
[`nlp.add_pipe`](/api/language#add_pipe). If the entity ruler was initialized
|
||||
with `overwrite_ents=True`, existing entities will be replaced if they overlap
|
||||
with the matches. When matches overlap in a Doc, the entity ruler prioritizes longer
|
||||
patterns over shorter, and if equal the match occuring first in the Doc is chosen.
|
||||
with the matches. When matches overlap in a Doc, the entity ruler prioritizes
|
||||
longer patterns over shorter, and if equal the match occuring first in the Doc
|
||||
is chosen.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -139,9 +140,9 @@ only the patterns are saved as JSONL. If a directory name is provided, a
|
|||
> ruler.to_disk("/path/to/entity_ruler") # saves patterns and config
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a JSONL file or directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| ------ | ------------ | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a JSONL file or directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
|
||||
## EntityRuler.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -158,10 +159,10 @@ configuration.
|
|||
> ruler.from_disk("/path/to/entity_ruler") # loads patterns and config
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | ---------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a JSONL file or directory. Paths may be either strings or `Path`-like objects. |
|
||||
| **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------- | ---------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a JSONL file or directory. Paths may be either strings or `Path`-like objects. |
|
||||
| **RETURNS** | `EntityRuler` | The modified `EntityRuler` object. |
|
||||
|
||||
## EntityRuler.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
|
|
|
@ -17,8 +17,8 @@ Create a `GoldCorpus`. IF the input data is an iterable, each item should be a
|
|||
[`gold.read_json_file`](https://github.com/explosion/spaCy/tree/master/spacy/gold.pyx)
|
||||
for further details.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | --------------------------- | ------------------------------------------------------------ |
|
||||
| `train` | unicode / `Path` / iterable | Training data, as a path (file or directory) or iterable. |
|
||||
| `dev` | unicode / `Path` / iterable | Development data, as a path (file or directory) or iterable. |
|
||||
| **RETURNS** | `GoldCorpus` | The newly constructed object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------------------- | ------------------------------------------------------------ |
|
||||
| `train` | str / `Path` / iterable | Training data, as a path (file or directory) or iterable. |
|
||||
| `dev` | str / `Path` / iterable | Development data, as a path (file or directory) or iterable. |
|
||||
| **RETURNS** | `GoldCorpus` | The newly constructed object. |
|
||||
|
|
|
@ -62,7 +62,8 @@ Whether the provided syntactic annotations form a projective dependency tree.
|
|||
|
||||
Convert a list of Doc objects into the
|
||||
[JSON-serializable format](/api/annotation#json-input) used by the
|
||||
[`spacy train`](/api/cli#train) command. Each input doc will be treated as a 'paragraph' in the output doc.
|
||||
[`spacy train`](/api/cli#train) command. Each input doc will be treated as a
|
||||
'paragraph' in the output doc.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -160,7 +161,7 @@ single-token entity.
|
|||
| ----------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doc` | `Doc` | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. |
|
||||
| `entities` | iterable | A sequence of `(start, end, label)` triples. `start` and `end` should be character-offset integers denoting the slice into the original string. |
|
||||
| **RETURNS** | list | Unicode strings, describing the [BILUO](/api/annotation#biluo) tags. |
|
||||
| **RETURNS** | list | str strings, describing the [BILUO](/api/annotation#biluo) tags. |
|
||||
|
||||
### gold.offsets_from_biluo_tags {#offsets_from_biluo_tags tag="function"}
|
||||
|
||||
|
|
|
@ -1,16 +1,19 @@
|
|||
---
|
||||
title: KnowledgeBase
|
||||
teaser: A storage class for entities and aliases of a specific knowledge base (ontology)
|
||||
teaser:
|
||||
A storage class for entities and aliases of a specific knowledge base
|
||||
(ontology)
|
||||
tag: class
|
||||
source: spacy/kb.pyx
|
||||
new: 2.2
|
||||
---
|
||||
|
||||
The `KnowledgeBase` object provides a method to generate [`Candidate`](/api/kb/#candidate_init)
|
||||
objects, which are plausible external identifiers given a certain textual mention.
|
||||
Each such `Candidate` holds information from the relevant KB entities,
|
||||
such as its frequency in text and possible aliases.
|
||||
Each entity in the knowledge base also has a pretrained entity vector of a fixed size.
|
||||
The `KnowledgeBase` object provides a method to generate
|
||||
[`Candidate`](/api/kb/#candidate_init) objects, which are plausible external
|
||||
identifiers given a certain textual mention. Each such `Candidate` holds
|
||||
information from the relevant KB entities, such as its frequency in text and
|
||||
possible aliases. Each entity in the knowledge base also has a pretrained entity
|
||||
vector of a fixed size.
|
||||
|
||||
## KnowledgeBase.\_\_init\_\_ {#init tag="method"}
|
||||
|
||||
|
@ -24,25 +27,25 @@ Create the knowledge base.
|
|||
> kb = KnowledgeBase(vocab=vocab, entity_vector_length=64)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------------------- | ---------------- | ----------------------------------------- |
|
||||
| `vocab` | `Vocab` | A `Vocab` object. |
|
||||
| `entity_vector_length` | int | Length of the fixed-size entity vectors. |
|
||||
| **RETURNS** | `KnowledgeBase` | The newly constructed object. |
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------------------- | --------------- | ---------------------------------------- |
|
||||
| `vocab` | `Vocab` | A `Vocab` object. |
|
||||
| `entity_vector_length` | int | Length of the fixed-size entity vectors. |
|
||||
| **RETURNS** | `KnowledgeBase` | The newly constructed object. |
|
||||
|
||||
## KnowledgeBase.entity_vector_length {#entity_vector_length tag="property"}
|
||||
|
||||
The length of the fixed-size entity vectors in the knowledge base.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ----------------------------------------- |
|
||||
| **RETURNS** | int | Length of the fixed-size entity vectors. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ---------------------------------------- |
|
||||
| **RETURNS** | int | Length of the fixed-size entity vectors. |
|
||||
|
||||
## KnowledgeBase.add_entity {#add_entity tag="method"}
|
||||
|
||||
Add an entity to the knowledge base, specifying its corpus frequency
|
||||
and entity vector, which should be of length [`entity_vector_length`](/api/kb#entity_vector_length).
|
||||
Add an entity to the knowledge base, specifying its corpus frequency and entity
|
||||
vector, which should be of length
|
||||
[`entity_vector_length`](/api/kb#entity_vector_length).
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -51,16 +54,16 @@ and entity vector, which should be of length [`entity_vector_length`](/api/kb#en
|
|||
> kb.add_entity(entity="Q463035", freq=111, entity_vector=vector2)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------------- | ------------- | ------------------------------------------------- |
|
||||
| `entity` | unicode | The unique entity identifier |
|
||||
| `freq` | float | The frequency of the entity in a typical corpus |
|
||||
| `entity_vector` | vector | The pretrained vector of the entity |
|
||||
| Name | Type | Description |
|
||||
| --------------- | ------ | ----------------------------------------------- |
|
||||
| `entity` | str | The unique entity identifier |
|
||||
| `freq` | float | The frequency of the entity in a typical corpus |
|
||||
| `entity_vector` | vector | The pretrained vector of the entity |
|
||||
|
||||
## KnowledgeBase.set_entities {#set_entities tag="method"}
|
||||
|
||||
Define the full list of entities in the knowledge base, specifying the corpus frequency
|
||||
and entity vector for each entity.
|
||||
Define the full list of entities in the knowledge base, specifying the corpus
|
||||
frequency and entity vector for each entity.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -68,18 +71,19 @@ and entity vector for each entity.
|
|||
> kb.set_entities(entity_list=["Q42", "Q463035"], freq_list=[32, 111], vector_list=[vector1, vector2])
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | ------------- | ------------------------------------------------- |
|
||||
| `entity_list` | iterable | List of unique entity identifiers |
|
||||
| `freq_list` | iterable | List of entity frequencies |
|
||||
| `vector_list` | iterable | List of entity vectors |
|
||||
| Name | Type | Description |
|
||||
| ------------- | -------- | --------------------------------- |
|
||||
| `entity_list` | iterable | List of unique entity identifiers |
|
||||
| `freq_list` | iterable | List of entity frequencies |
|
||||
| `vector_list` | iterable | List of entity vectors |
|
||||
|
||||
## KnowledgeBase.add_alias {#add_alias tag="method"}
|
||||
|
||||
Add an alias or mention to the knowledge base, specifying its potential KB identifiers
|
||||
and their prior probabilities. The entity identifiers should refer to entities previously
|
||||
added with [`add_entity`](/api/kb#add_entity) or [`set_entities`](/api/kb#set_entities).
|
||||
The sum of the prior probabilities should not exceed 1.
|
||||
Add an alias or mention to the knowledge base, specifying its potential KB
|
||||
identifiers and their prior probabilities. The entity identifiers should refer
|
||||
to entities previously added with [`add_entity`](/api/kb#add_entity) or
|
||||
[`set_entities`](/api/kb#set_entities). The sum of the prior probabilities
|
||||
should not exceed 1.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -87,11 +91,11 @@ The sum of the prior probabilities should not exceed 1.
|
|||
> kb.add_alias(alias="Douglas", entities=["Q42", "Q463035"], probabilities=[0.6, 0.3])
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------- | ------------- | -------------------------------------------------- |
|
||||
| `alias` | unicode | The textual mention or alias |
|
||||
| `entities` | iterable | The potential entities that the alias may refer to |
|
||||
| `probabilities`| iterable | The prior probabilities of each entity |
|
||||
| Name | Type | Description |
|
||||
| --------------- | -------- | -------------------------------------------------- |
|
||||
| `alias` | str | The textual mention or alias |
|
||||
| `entities` | iterable | The potential entities that the alias may refer to |
|
||||
| `probabilities` | iterable | The prior probabilities of each entity |
|
||||
|
||||
## KnowledgeBase.\_\_len\_\_ {#len tag="method"}
|
||||
|
||||
|
@ -117,9 +121,9 @@ Get a list of all entity IDs in the knowledge base.
|
|||
> all_entities = kb.get_entity_strings()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | --------------------------------------------- |
|
||||
| **RETURNS** | list | The list of entities in the knowledge base. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ------------------------------------------- |
|
||||
| **RETURNS** | list | The list of entities in the knowledge base. |
|
||||
|
||||
## KnowledgeBase.get_size_aliases {#get_size_aliases tag="method"}
|
||||
|
||||
|
@ -131,9 +135,9 @@ Get the total number of aliases in the knowledge base.
|
|||
> total_aliases = kb.get_size_aliases()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | --------------------------------------------- |
|
||||
| **RETURNS** | int | The number of aliases in the knowledge base. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | -------------------------------------------- |
|
||||
| **RETURNS** | int | The number of aliases in the knowledge base. |
|
||||
|
||||
## KnowledgeBase.get_alias_strings {#get_alias_strings tag="method"}
|
||||
|
||||
|
@ -145,9 +149,9 @@ Get a list of all aliases in the knowledge base.
|
|||
> all_aliases = kb.get_alias_strings()
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | --------------------------------------------- |
|
||||
| **RETURNS** | list | The list of aliases in the knowledge base. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ------------------------------------------ |
|
||||
| **RETURNS** | list | The list of aliases in the knowledge base. |
|
||||
|
||||
## KnowledgeBase.get_candidates {#get_candidates tag="method"}
|
||||
|
||||
|
@ -160,10 +164,10 @@ of type [`Candidate`](/api/kb/#candidate_init).
|
|||
> candidates = kb.get_candidates("Douglas")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | ------------- | -------------------------------------------------- |
|
||||
| `alias` | unicode | The textual mention or alias |
|
||||
| **RETURNS** | iterable | The list of relevant `Candidate` objects |
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | ---------------------------------------- |
|
||||
| `alias` | str | The textual mention or alias |
|
||||
| **RETURNS** | iterable | The list of relevant `Candidate` objects |
|
||||
|
||||
## KnowledgeBase.get_vector {#get_vector tag="method"}
|
||||
|
||||
|
@ -175,15 +179,15 @@ Given a certain entity ID, retrieve its pretrained entity vector.
|
|||
> vector = kb.get_vector("Q42")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | ------------- | -------------------------------------------------- |
|
||||
| `entity` | unicode | The entity ID |
|
||||
| **RETURNS** | vector | The entity vector |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------ | ----------------- |
|
||||
| `entity` | str | The entity ID |
|
||||
| **RETURNS** | vector | The entity vector |
|
||||
|
||||
## KnowledgeBase.get_prior_prob {#get_prior_prob tag="method"}
|
||||
|
||||
Given a certain entity ID and a certain textual mention, retrieve
|
||||
the prior probability of the fact that the mention links to the entity ID.
|
||||
Given a certain entity ID and a certain textual mention, retrieve the prior
|
||||
probability of the fact that the mention links to the entity ID.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -191,11 +195,11 @@ the prior probability of the fact that the mention links to the entity ID.
|
|||
> probability = kb.get_prior_prob("Q42", "Douglas")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | ------------- | --------------------------------------------------------------- |
|
||||
| `entity` | unicode | The entity ID |
|
||||
| `alias` | unicode | The textual mention or alias |
|
||||
| **RETURNS** | float | The prior probability of the `alias` referring to the `entity` |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | -------------------------------------------------------------- |
|
||||
| `entity` | str | The entity ID |
|
||||
| `alias` | str | The textual mention or alias |
|
||||
| **RETURNS** | float | The prior probability of the `alias` referring to the `entity` |
|
||||
|
||||
## KnowledgeBase.dump {#dump tag="method"}
|
||||
|
||||
|
@ -207,14 +211,14 @@ Save the current state of the knowledge base to a directory.
|
|||
> kb.dump(loc)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------- | ---------------- | ------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `loc` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| ----- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `loc` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
|
||||
## KnowledgeBase.load_bulk {#load_bulk tag="method"}
|
||||
|
||||
Restore the state of the knowledge base from a given directory. Note that the [`Vocab`](/api/vocab)
|
||||
should also be the same as the one used to create the KB.
|
||||
Restore the state of the knowledge base from a given directory. Note that the
|
||||
[`Vocab`](/api/vocab) should also be the same as the one used to create the KB.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -226,18 +230,16 @@ should also be the same as the one used to create the KB.
|
|||
> kb.load_bulk("/path/to/kb")
|
||||
> ```
|
||||
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | ----------------------------------------------------------------------------------------- |
|
||||
| `loc` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| **RETURNS** | `KnowledgeBase` | The modified `KnowledgeBase` object. |
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | --------------- | -------------------------------------------------------------------------- |
|
||||
| `loc` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| **RETURNS** | `KnowledgeBase` | The modified `KnowledgeBase` object. |
|
||||
|
||||
## Candidate.\_\_init\_\_ {#candidate_init tag="method"}
|
||||
|
||||
Construct a `Candidate` object. Usually this constructor is not called directly,
|
||||
but instead these objects are returned by the [`get_candidates`](/api/kb#get_candidates) method
|
||||
of a `KnowledgeBase`.
|
||||
but instead these objects are returned by the
|
||||
[`get_candidates`](/api/kb#get_candidates) method of a `KnowledgeBase`.
|
||||
|
||||
> #### Example
|
||||
>
|
||||
|
@ -257,12 +259,12 @@ of a `KnowledgeBase`.
|
|||
|
||||
## Candidate attributes {#candidate_attributes}
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------------------- | ------------ | ------------------------------------------------------------------ |
|
||||
| `entity` | int | The entity's unique KB identifier |
|
||||
| `entity_` | unicode | The entity's unique KB identifier |
|
||||
| `alias` | int | The alias or textual mention |
|
||||
| `alias_` | unicode | The alias or textual mention |
|
||||
| `prior_prob` | long | The prior probability of the `alias` referring to the `entity` |
|
||||
| `entity_freq` | long | The frequency of the entity in a typical corpus |
|
||||
| `entity_vector` | vector | The pretrained vector of the entity |
|
||||
| Name | Type | Description |
|
||||
| --------------- | ------ | -------------------------------------------------------------- |
|
||||
| `entity` | int | The entity's unique KB identifier |
|
||||
| `entity_` | str | The entity's unique KB identifier |
|
||||
| `alias` | int | The alias or textual mention |
|
||||
| `alias_` | str | The alias or textual mention |
|
||||
| `prior_prob` | long | The prior probability of the `alias` referring to the `entity` |
|
||||
| `entity_freq` | long | The frequency of the entity in a typical corpus |
|
||||
| `entity_vector` | vector | The pretrained vector of the entity |
|
||||
|
|
|
@ -49,11 +49,11 @@ contain arbitrary whitespace. Alignment into the original string is preserved.
|
|||
> assert (doc[0].text, doc[0].head.tag_) == ("An", "NN")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | --------------------------------------------------------------------------------- |
|
||||
| `text` | unicode | The text to be processed. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| **RETURNS** | `Doc` | A container for accessing the annotations. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | --------------------------------------------------------------------------------- |
|
||||
| `text` | str | The text to be processed. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| **RETURNS** | `Doc` | A container for accessing the annotations. |
|
||||
|
||||
<Infobox title="Changed in v2.0" variant="warning">
|
||||
|
||||
|
@ -201,7 +201,7 @@ Create a pipeline component from a factory.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | ---------------------------------------------------------------------------------- |
|
||||
| `name` | unicode | Factory name to look up in [`Language.factories`](/api/language#class-attributes). |
|
||||
| `name` | str | Factory name to look up in [`Language.factories`](/api/language#class-attributes). |
|
||||
| `config` | dict | Configuration parameters to initialize component. |
|
||||
| **RETURNS** | callable | The pipeline component. |
|
||||
|
||||
|
@ -224,9 +224,9 @@ take a `Doc` object, modify it and return it. Only one of `before`, `after`,
|
|||
| Name | Type | Description |
|
||||
| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `component` | callable | The pipeline component. |
|
||||
| `name` | unicode | Name of pipeline component. Overwrites existing `component.name` attribute if available. If no `name` is set and the component exposes no name attribute, `component.__name__` is used. An error is raised if the name already exists in the pipeline. |
|
||||
| `before` | unicode | Component name to insert component directly before. |
|
||||
| `after` | unicode | Component name to insert component directly after: |
|
||||
| `name` | str | Name of pipeline component. Overwrites existing `component.name` attribute if available. If no `name` is set and the component exposes no name attribute, `component.__name__` is used. An error is raised if the name already exists in the pipeline. |
|
||||
| `before` | str | Component name to insert component directly before. |
|
||||
| `after` | str | Component name to insert component directly after: |
|
||||
| `first` | bool | Insert component first / not first in the pipeline. |
|
||||
| `last` | bool | Insert component last / not last in the pipeline. |
|
||||
|
||||
|
@ -243,10 +243,10 @@ Check whether a component is present in the pipeline. Equivalent to
|
|||
> assert nlp.has_pipe("component")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | -------------------------------------------------------- |
|
||||
| `name` | unicode | Name of the pipeline component to check. |
|
||||
| **RETURNS** | bool | Whether a component of that name exists in the pipeline. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | -------------------------------------------------------- |
|
||||
| `name` | str | Name of the pipeline component to check. |
|
||||
| **RETURNS** | bool | Whether a component of that name exists in the pipeline. |
|
||||
|
||||
## Language.get_pipe {#get_pipe tag="method" new="2"}
|
||||
|
||||
|
@ -261,7 +261,7 @@ Get a pipeline component for a given component name.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | -------------------------------------- |
|
||||
| `name` | unicode | Name of the pipeline component to get. |
|
||||
| `name` | str | Name of the pipeline component to get. |
|
||||
| **RETURNS** | callable | The pipeline component. |
|
||||
|
||||
## Language.replace_pipe {#replace_pipe tag="method" new="2"}
|
||||
|
@ -276,7 +276,7 @@ Replace a component in the pipeline.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | -------- | --------------------------------- |
|
||||
| `name` | unicode | Name of the component to replace. |
|
||||
| `name` | str | Name of the component to replace. |
|
||||
| `component` | callable | The pipeline component to insert. |
|
||||
|
||||
## Language.rename_pipe {#rename_pipe tag="method" new="2"}
|
||||
|
@ -292,10 +292,10 @@ added to the pipeline, you can also use the `name` argument on
|
|||
> nlp.rename_pipe("parser", "spacy_parser")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------- | ------- | -------------------------------- |
|
||||
| `old_name` | unicode | Name of the component to rename. |
|
||||
| `new_name` | unicode | New name of the component. |
|
||||
| Name | Type | Description |
|
||||
| ---------- | ---- | -------------------------------- |
|
||||
| `old_name` | str | Name of the component to rename. |
|
||||
| `new_name` | str | New name of the component. |
|
||||
|
||||
## Language.remove_pipe {#remove_pipe tag="method" new="2"}
|
||||
|
||||
|
@ -309,10 +309,10 @@ component function.
|
|||
> assert name == "parser"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ----------------------------------------------------- |
|
||||
| `name` | unicode | Name of the component to remove. |
|
||||
| **RETURNS** | tuple | A `(name, component)` tuple of the removed component. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ----------------------------------------------------- |
|
||||
| `name` | str | Name of the component to remove. |
|
||||
| **RETURNS** | tuple | A `(name, component)` tuple of the removed component. |
|
||||
|
||||
## Language.select_pipes {#select_pipes tag="contextmanager, method" new="3"}
|
||||
|
||||
|
@ -342,12 +342,11 @@ latter case, all components not in the `enable` list, will be disabled.
|
|||
| Name | Type | Description |
|
||||
| ----------- | --------------- | ------------------------------------------------------------------------------------ |
|
||||
| `disable` | list | Names of pipeline components to disable. |
|
||||
| `disable` | unicode | Name of pipeline component to disable. |
|
||||
| `disable` | str | Name of pipeline component to disable. |
|
||||
| `enable` | list | Names of pipeline components that will not be disabled. |
|
||||
| `enable` | unicode | Name of pipeline component that will not be disabled. |
|
||||
| `enable` | str | Name of pipeline component that will not be disabled. |
|
||||
| **RETURNS** | `DisabledPipes` | The disabled pipes that can be restored by calling the object's `.restore()` method. |
|
||||
|
||||
|
||||
<Infobox title="Changed in v3.0" variant="warning">
|
||||
|
||||
As of spaCy v3.0, the `disable_pipes` method has been renamed to `select_pipes`:
|
||||
|
@ -370,10 +369,10 @@ the model**.
|
|||
> nlp.to_disk("/path/to/models")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Language.from_disk {#from_disk tag="method" new="2"}
|
||||
|
||||
|
@ -395,11 +394,11 @@ loaded object.
|
|||
> nlp = English().from_disk("/path/to/en_model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | ----------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Language` | The modified `Language` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------ | ----------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | Names of pipeline components or [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Language` | The modified `Language` object. |
|
||||
|
||||
<Infobox title="Changed in v2.0" variant="warning">
|
||||
|
||||
|
@ -480,11 +479,11 @@ per component.
|
|||
|
||||
## Class attributes {#class-attributes}
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
|
||||
| `lang` | unicode | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
|
||||
| `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------- | ----- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `Defaults` | class | Settings, data and factory methods for creating the `nlp` object and processing pipeline. |
|
||||
| `lang` | str | Two-letter language ID, i.e. [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes). |
|
||||
| `factories` <Tag variant="new">2</Tag> | dict | Factories that create pre-defined pipeline components, e.g. the tagger, parser or entity recognizer, keyed by their component name. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
|
|
|
@ -63,8 +63,8 @@ Lemmatize a string.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------- | -------------------------------------------------------------------------------------------------------- |
|
||||
| `string` | unicode | The string to lemmatize, e.g. the token text. |
|
||||
| `univ_pos` | unicode / int | The token's universal part-of-speech tag. |
|
||||
| `string` | str | The string to lemmatize, e.g. the token text. |
|
||||
| `univ_pos` | str / int | The token's universal part-of-speech tag. |
|
||||
| `morphology` | dict / `None` | Morphological features following the [Universal Dependencies](http://universaldependencies.org/) scheme. |
|
||||
| **RETURNS** | list | The available lemmas for the string. |
|
||||
|
||||
|
@ -82,11 +82,11 @@ original string is returned. Languages can provide a
|
|||
> assert lemmatizer.lookup("going") == "go"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ----------------------------------------------------------------------------------------------------------- |
|
||||
| `string` | unicode | The string to look up. |
|
||||
| `orth` | int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to `None`. |
|
||||
| **RETURNS** | unicode | The lemma if the string was found, otherwise the original string. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ----------------------------------------------------------------------------------------------------------- |
|
||||
| `string` | str | The string to look up. |
|
||||
| `orth` | int | Optional hash of the string to look up. If not set, the string will be used and hashed. Defaults to `None`. |
|
||||
| **RETURNS** | str | The lemma if the string was found, otherwise the original string. |
|
||||
|
||||
## Lemmatizer.is_base_form {#is_base_form tag="method"}
|
||||
|
||||
|
@ -102,11 +102,11 @@ lemmatization entirely.
|
|||
> assert is_base_form == True
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------ | ------------- | --------------------------------------------------------------------------------------- |
|
||||
| `univ_pos` | unicode / int | The token's universal part-of-speech tag. |
|
||||
| `morphology` | dict | The token's morphological features. |
|
||||
| **RETURNS** | bool | Whether the token's part-of-speech tag and morphological features describe a base form. |
|
||||
| Name | Type | Description |
|
||||
| ------------ | --------- | --------------------------------------------------------------------------------------- |
|
||||
| `univ_pos` | str / int | The token's universal part-of-speech tag. |
|
||||
| `morphology` | dict | The token's morphological features. |
|
||||
| **RETURNS** | bool | Whether the token's part-of-speech tag and morphological features describe a base form. |
|
||||
|
||||
## Attributes {#attributes}
|
||||
|
||||
|
|
|
@ -56,10 +56,10 @@ Check if the lookups contain a table of a given name. Delegates to
|
|||
> assert "some_table" in lookups
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ----------------------------------------------- |
|
||||
| `name` | unicode | Name of the table. |
|
||||
| **RETURNS** | bool | Whether a table of that name is in the lookups. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ----------------------------------------------- |
|
||||
| `name` | str | Name of the table. |
|
||||
| **RETURNS** | bool | Whether a table of that name is in the lookups. |
|
||||
|
||||
## Lookups.tables {#tables tag="property"}
|
||||
|
||||
|
@ -91,7 +91,7 @@ exists.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------------------------- | ---------------------------------- |
|
||||
| `name` | unicode | Unique name of the table. |
|
||||
| `name` | str | Unique name of the table. |
|
||||
| `data` | dict | Optional data to add to the table. |
|
||||
| **RETURNS** | [`Table`](/api/lookups#table) | The newly added table. |
|
||||
|
||||
|
@ -110,7 +110,7 @@ Get a table from the lookups. Raises an error if the table doesn't exist.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------------------------- | ------------------ |
|
||||
| `name` | unicode | Name of the table. |
|
||||
| `name` | str | Name of the table. |
|
||||
| **RETURNS** | [`Table`](/api/lookups#table) | The table. |
|
||||
|
||||
## Lookups.remove_table {#remove_table tag="method"}
|
||||
|
@ -128,7 +128,7 @@ Remove a table from the lookups. Raises an error if the table doesn't exist.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------------------------- | ---------------------------- |
|
||||
| `name` | unicode | Name of the table to remove. |
|
||||
| `name` | str | Name of the table to remove. |
|
||||
| **RETURNS** | [`Table`](/api/lookups#table) | The removed table. |
|
||||
|
||||
## Lookups.has_table {#has_table tag="method"}
|
||||
|
@ -144,10 +144,10 @@ Check if the lookups contain a table of a given name. Equivalent to
|
|||
> assert lookups.has_table("some_table")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ----------------------------------------------- |
|
||||
| `name` | unicode | Name of the table. |
|
||||
| **RETURNS** | bool | Whether a table of that name is in the lookups. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ----------------------------------------------- |
|
||||
| `name` | str | Name of the table. |
|
||||
| **RETURNS** | bool | Whether a table of that name is in the lookups. |
|
||||
|
||||
## Lookups.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
|
@ -191,9 +191,9 @@ which will be created if it doesn't exist.
|
|||
> lookups.to_disk("/path/to/lookups")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| ------ | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
|
||||
## Lookups.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -208,10 +208,10 @@ the file doesn't exist.
|
|||
> lookups.from_disk("/path/to/lookups")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| **RETURNS** | `Lookups` | The loaded lookups. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------ | -------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| **RETURNS** | `Lookups` | The loaded lookups. |
|
||||
|
||||
## Table {#table tag="class, ordererddict"}
|
||||
|
||||
|
@ -238,7 +238,7 @@ Initialize a new table.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ---------------------------------- |
|
||||
| `name` | unicode | Optional table name for reference. |
|
||||
| `name` | str | Optional table name for reference. |
|
||||
| **RETURNS** | `Table` | The newly constructed object. |
|
||||
|
||||
### Table.from_dict {#table.from_dict tag="classmethod"}
|
||||
|
@ -256,7 +256,7 @@ Initialize a new table from a dict.
|
|||
| Name | Type | Description |
|
||||
| ----------- | ------- | ---------------------------------- |
|
||||
| `data` | dict | The dictionary. |
|
||||
| `name` | unicode | Optional table name for reference. |
|
||||
| `name` | str | Optional table name for reference. |
|
||||
| **RETURNS** | `Table` | The newly constructed object. |
|
||||
|
||||
### Table.set {#table.set tag="method"}
|
||||
|
@ -273,10 +273,10 @@ Set a new key / value pair. String keys will be hashed. Same as
|
|||
> assert table["foo"] == "bar"
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------- | ------------- | ----------- |
|
||||
| `key` | unicode / int | The key. |
|
||||
| `value` | - | The value. |
|
||||
| Name | Type | Description |
|
||||
| ------- | --------- | ----------- |
|
||||
| `key` | str / int | The key. |
|
||||
| `value` | - | The value. |
|
||||
|
||||
### Table.to_bytes {#table.to_bytes tag="method"}
|
||||
|
||||
|
@ -313,6 +313,6 @@ Load a table from a bytestring.
|
|||
|
||||
| Name | Type | Description |
|
||||
| -------------- | --------------------------- | ----------------------------------------------------- |
|
||||
| `name` | unicode | Table name. |
|
||||
| `name` | str | Table name. |
|
||||
| `default_size` | int | Default size of bloom filters if no data is provided. |
|
||||
| `bloom` | `preshed.bloom.BloomFilter` | The bloom filters. |
|
||||
|
|
|
@ -125,10 +125,10 @@ Check whether the matcher contains rules for a match ID.
|
|||
> assert 'Rule' in matcher
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ----------------------------------------------------- |
|
||||
| `key` | unicode | The match ID. |
|
||||
| **RETURNS** | bool | Whether the matcher contains rules for this match ID. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ----------------------------------------------------- |
|
||||
| `key` | str | The match ID. |
|
||||
| **RETURNS** | bool | Whether the matcher contains rules for this match ID. |
|
||||
|
||||
## Matcher.add {#add tag="method" new="2"}
|
||||
|
||||
|
@ -153,7 +153,7 @@ overwritten.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------------ | --------------------------------------------------------------------------------------------- |
|
||||
| `match_id` | unicode | An ID for the thing you're matching. |
|
||||
| `match_id` | str | An ID for the thing you're matching. |
|
||||
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
|
||||
| `*patterns` | list | Match pattern. A pattern consists of a list of dicts, where each dict describes a token. |
|
||||
|
||||
|
@ -188,9 +188,9 @@ exist.
|
|||
> assert "Rule" not in matcher
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----- | ------- | ------------------------- |
|
||||
| `key` | unicode | The ID of the match rule. |
|
||||
| Name | Type | Description |
|
||||
| ----- | ---- | ------------------------- |
|
||||
| `key` | str | The ID of the match rule. |
|
||||
|
||||
## Matcher.get {#get tag="method" new="2"}
|
||||
|
||||
|
@ -204,7 +204,7 @@ Retrieve the pattern stored for a key. Returns the rule as an
|
|||
> on_match, patterns = matcher.get("Rule")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | --------------------------------------------- |
|
||||
| `key` | unicode | The ID of the match rule. |
|
||||
| **RETURNS** | tuple | The rule, as an `(on_match, patterns)` tuple. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | --------------------------------------------- |
|
||||
| `key` | str | The ID of the match rule. |
|
||||
| **RETURNS** | tuple | The rule, as an `(on_match, patterns)` tuple. |
|
||||
|
|
|
@ -133,10 +133,10 @@ Check whether the matcher contains rules for a match ID.
|
|||
> assert "OBAMA" in matcher
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ----------------------------------------------------- |
|
||||
| `key` | unicode | The match ID. |
|
||||
| **RETURNS** | bool | Whether the matcher contains rules for this match ID. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ----------------------------------------------------- |
|
||||
| `key` | str | The match ID. |
|
||||
| **RETURNS** | bool | Whether the matcher contains rules for this match ID. |
|
||||
|
||||
## PhraseMatcher.add {#add tag="method"}
|
||||
|
||||
|
@ -162,7 +162,7 @@ overwritten.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ---------- | ------------------ | --------------------------------------------------------------------------------------------- |
|
||||
| `match_id` | unicode | An ID for the thing you're matching. |
|
||||
| `match_id` | str | An ID for the thing you're matching. |
|
||||
| `on_match` | callable or `None` | Callback function to act on matches. Takes the arguments `matcher`, `doc`, `i` and `matches`. |
|
||||
| `*docs` | `Doc` | `Doc` objects of the phrases to match. |
|
||||
|
||||
|
@ -198,6 +198,6 @@ does not exist.
|
|||
> assert "OBAMA" not in matcher
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----- | ------- | ------------------------- |
|
||||
| `key` | unicode | The ID of the match rule. |
|
||||
| Name | Type | Description |
|
||||
| ----- | ---- | ------------------------- |
|
||||
| `key` | str | The ID of the match rule. |
|
||||
|
|
|
@ -112,8 +112,8 @@ end of the pipeline and after all other components.
|
|||
|
||||
</Infobox>
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ------------------------------------------------------------ |
|
||||
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
||||
| `label` | unicode | The subtoken dependency label. Defaults to `"subtok"`. |
|
||||
| **RETURNS** | `Doc` | The modified `Doc` with merged subtokens. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------ |
|
||||
| `doc` | `Doc` | The `Doc` object to process, e.g. the `Doc` in the pipeline. |
|
||||
| `label` | str | The subtoken dependency label. Defaults to `"subtok"`. |
|
||||
| **RETURNS** | `Doc` | The modified `Doc` with merged subtokens. |
|
||||
|
|
|
@ -81,9 +81,9 @@ a file `sentencizer.json`. This also happens automatically when you save an
|
|||
> sentencizer.to_disk("/path/to/sentencizer.jsonl")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | ---------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| ------ | ------------ | ---------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a file, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
|
||||
## Sentencizer.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -98,10 +98,10 @@ added to its pipeline.
|
|||
> sentencizer.from_disk("/path/to/sentencizer.json")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a JSON file. Paths may be either strings or `Path`-like objects. |
|
||||
| **RETURNS** | `Sentencizer` | The modified `Sentencizer` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a JSON file. Paths may be either strings or `Path`-like objects. |
|
||||
| **RETURNS** | `Sentencizer` | The modified `Sentencizer` object. |
|
||||
|
||||
## Sentencizer.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
|
|
|
@ -110,7 +110,7 @@ For details, see the documentation on
|
|||
|
||||
| Name | Type | Description |
|
||||
| --------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `span._.my_attr`. |
|
||||
| `name` | str | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `span._.my_attr`. |
|
||||
| `default` | - | Optional default value of the attribute if no getter or method is defined. |
|
||||
| `method` | callable | Set a custom method on the object, for example `span._.compare(other_span)`. |
|
||||
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
|
||||
|
@ -132,10 +132,10 @@ Look up a previously registered extension by name. Returns a 4-tuple
|
|||
> assert extension == (False, None, None, None)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ------------------------------------------------------------- |
|
||||
| `name` | unicode | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------- |
|
||||
| `name` | str | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
|
||||
|
||||
## Span.has_extension {#has_extension tag="classmethod" new="2"}
|
||||
|
||||
|
@ -149,10 +149,10 @@ Check whether an extension has been registered on the `Span` class.
|
|||
> assert Span.has_extension("is_city")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ------------------------------------------ |
|
||||
| `name` | unicode | Name of the extension to check. |
|
||||
| **RETURNS** | bool | Whether the extension has been registered. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ------------------------------------------ |
|
||||
| `name` | str | Name of the extension to check. |
|
||||
| **RETURNS** | bool | Whether the extension has been registered. |
|
||||
|
||||
## Span.remove_extension {#remove_extension tag="classmethod" new="2.0.12"}
|
||||
|
||||
|
@ -167,10 +167,10 @@ Remove a previously registered extension.
|
|||
> assert not Span.has_extension("is_city")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | --------------------------------------------------------------------- |
|
||||
| `name` | unicode | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | --------------------------------------------------------------------- |
|
||||
| `name` | str | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
||||
|
||||
## Span.char_span {#char_span tag="method" new="2.2.4"}
|
||||
|
||||
|
@ -497,16 +497,16 @@ The L2 norm of the span's vector representation.
|
|||
| `end` | int | The token offset for the end of the span. |
|
||||
| `start_char` | int | The character offset for the start of the span. |
|
||||
| `end_char` | int | The character offset for the end of the span. |
|
||||
| `text` | unicode | A unicode representation of the span text. |
|
||||
| `text_with_ws` | unicode | The text content of the span with a trailing whitespace character if the last token has one. |
|
||||
| `text` | str | A unicode representation of the span text. |
|
||||
| `text_with_ws` | str | The text content of the span with a trailing whitespace character if the last token has one. |
|
||||
| `orth` | int | ID of the verbatim text content. |
|
||||
| `orth_` | unicode | Verbatim text content (identical to `Span.text`). Exists mostly for consistency with the other attributes. |
|
||||
| `orth_` | str | Verbatim text content (identical to `Span.text`). Exists mostly for consistency with the other attributes. |
|
||||
| `label` | int | The hash value of the span's label. |
|
||||
| `label_` | unicode | The span's label. |
|
||||
| `lemma_` | unicode | The span's lemma. |
|
||||
| `label_` | str | The span's label. |
|
||||
| `lemma_` | str | The span's lemma. |
|
||||
| `kb_id` | int | The hash value of the knowledge base ID referred to by the span. |
|
||||
| `kb_id_` | unicode | The knowledge base ID referred to by the span. |
|
||||
| `kb_id_` | str | The knowledge base ID referred to by the span. |
|
||||
| `ent_id` | int | The hash value of the named entity the token is an instance of. |
|
||||
| `ent_id_` | unicode | The string ID of the named entity the token is an instance of. |
|
||||
| `ent_id_` | str | The string ID of the named entity the token is an instance of. |
|
||||
| `sentiment` | float | A scalar value indicating the positivity or negativity of the span. |
|
||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
||||
|
|
|
@ -55,7 +55,7 @@ Retrieve a string from a given hash, or vice versa.
|
|||
| Name | Type | Description |
|
||||
| -------------- | ------------------------ | -------------------------- |
|
||||
| `string_or_id` | bytes, unicode or uint64 | The value to encode. |
|
||||
| **RETURNS** | unicode or int | The value to be retrieved. |
|
||||
| **RETURNS** | str or int | The value to be retrieved. |
|
||||
|
||||
## StringStore.\_\_contains\_\_ {#contains tag="method"}
|
||||
|
||||
|
@ -69,10 +69,10 @@ Check whether a string is in the store.
|
|||
> assert not "cherry" in stringstore
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | -------------------------------------- |
|
||||
| `string` | unicode | The string to check. |
|
||||
| **RETURNS** | bool | Whether the store contains the string. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | -------------------------------------- |
|
||||
| `string` | str | The string to check. |
|
||||
| **RETURNS** | bool | Whether the store contains the string. |
|
||||
|
||||
## StringStore.\_\_iter\_\_ {#iter tag="method"}
|
||||
|
||||
|
@ -87,9 +87,9 @@ store will always include an empty string `''` at position `0`.
|
|||
> assert all_strings == ["apple", "orange"]
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------- | ------- | ---------------------- |
|
||||
| **YIELDS** | unicode | A string in the store. |
|
||||
| Name | Type | Description |
|
||||
| ---------- | ---- | ---------------------- |
|
||||
| **YIELDS** | str | A string in the store. |
|
||||
|
||||
## StringStore.add {#add tag="method" new="2"}
|
||||
|
||||
|
@ -106,10 +106,10 @@ Add a string to the `StringStore`.
|
|||
> assert stringstore["banana"] == banana_hash
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ------------------------ |
|
||||
| `string` | unicode | The string to add. |
|
||||
| **RETURNS** | uint64 | The string's hash value. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------ | ------------------------ |
|
||||
| `string` | str | The string to add. |
|
||||
| **RETURNS** | uint64 | The string's hash value. |
|
||||
|
||||
## StringStore.to_disk {#to_disk tag="method" new="2"}
|
||||
|
||||
|
@ -121,9 +121,9 @@ Save the current state to a directory.
|
|||
> stringstore.to_disk("/path/to/strings")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| ------ | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
|
||||
## StringStore.from_disk {#from_disk tag="method" new="2"}
|
||||
|
||||
|
@ -136,10 +136,10 @@ Loads state from a directory. Modifies the object in place and returns it.
|
|||
> stringstore = StringStore().from_disk("/path/to/strings")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| **RETURNS** | `StringStore` | The modified `StringStore` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| **RETURNS** | `StringStore` | The modified `StringStore` object. |
|
||||
|
||||
## StringStore.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
|
@ -185,7 +185,7 @@ Get a 64-bit hash for a given string.
|
|||
> assert hash_string("apple") == 8566208034543834098
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ------------------- |
|
||||
| `string` | unicode | The string to hash. |
|
||||
| **RETURNS** | uint64 | The hash. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------ | ------------------- |
|
||||
| `string` | str | The string to hash. |
|
||||
| **RETURNS** | uint64 | The hash. |
|
||||
|
|
|
@ -229,10 +229,10 @@ Add a new label to the pipe.
|
|||
> tagger.add_label("MY_LABEL", {POS: 'NOUN'})
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------- | ------- | --------------------------------------------------------------- |
|
||||
| `label` | unicode | The label to add. |
|
||||
| `values` | dict | Optional values to map to the label, e.g. a tag map dictionary. |
|
||||
| Name | Type | Description |
|
||||
| -------- | ---- | --------------------------------------------------------------- |
|
||||
| `label` | str | The label to add. |
|
||||
| `values` | dict | Optional values to map to the label, e.g. a tag map dictionary. |
|
||||
|
||||
## Tagger.to_disk {#to_disk tag="method"}
|
||||
|
||||
|
@ -245,10 +245,10 @@ Serialize the pipe to disk.
|
|||
> tagger.to_disk("/path/to/tagger")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Tagger.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -261,11 +261,11 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
> tagger.from_disk("/path/to/tagger")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Tagger` | The modified `Tagger` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------ | -------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Tagger` | The modified `Tagger` object. |
|
||||
|
||||
## Tagger.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
|
|
|
@ -44,7 +44,7 @@ shortcut for this and instantiate the component using its string name and
|
|||
| `vocab` | `Vocab` | The shared vocabulary. |
|
||||
| `model` | `thinc.neural.Model` / `True` | The model powering the pipeline component. If no model is supplied, the model is created when you call `begin_training`, `from_disk` or `from_bytes`. |
|
||||
| `exclusive_classes` | bool | Make categories mutually exclusive. Defaults to `False`. |
|
||||
| `architecture` | unicode | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. |
|
||||
| `architecture` | str | Model architecture to use, see [architectures](#architectures) for details. Defaults to `"ensemble"`. |
|
||||
| **RETURNS** | `TextCategorizer` | The newly constructed object. |
|
||||
|
||||
### Architectures {#architectures new="2.1"}
|
||||
|
@ -247,9 +247,9 @@ Add a new label to the pipe.
|
|||
> textcat.add_label("MY_LABEL")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------- | ------- | ----------------- |
|
||||
| `label` | unicode | The label to add. |
|
||||
| Name | Type | Description |
|
||||
| ------- | ---- | ----------------- |
|
||||
| `label` | str | The label to add. |
|
||||
|
||||
## TextCategorizer.to_disk {#to_disk tag="method"}
|
||||
|
||||
|
@ -262,10 +262,10 @@ Serialize the pipe to disk.
|
|||
> textcat.to_disk("/path/to/textcat")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## TextCategorizer.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -280,7 +280,7 @@ Load the pipe from disk. Modifies the object in place and returns it.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `TextCategorizer` | The modified `TextCategorizer` object. |
|
||||
|
||||
|
|
|
@ -58,7 +58,7 @@ For details, see the documentation on
|
|||
|
||||
| Name | Type | Description |
|
||||
| --------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `token._.my_attr`. |
|
||||
| `name` | str | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `token._.my_attr`. |
|
||||
| `default` | - | Optional default value of the attribute if no getter or method is defined. |
|
||||
| `method` | callable | Set a custom method on the object, for example `token._.compare(other_token)`. |
|
||||
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
|
||||
|
@ -80,10 +80,10 @@ Look up a previously registered extension by name. Returns a 4-tuple
|
|||
> assert extension == (False, None, None, None)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ------------------------------------------------------------- |
|
||||
| `name` | unicode | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | ------------------------------------------------------------- |
|
||||
| `name` | str | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
|
||||
|
||||
## Token.has_extension {#has_extension tag="classmethod" new="2"}
|
||||
|
||||
|
@ -97,10 +97,10 @@ Check whether an extension has been registered on the `Token` class.
|
|||
> assert Token.has_extension("is_fruit")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ------------------------------------------ |
|
||||
| `name` | unicode | Name of the extension to check. |
|
||||
| **RETURNS** | bool | Whether the extension has been registered. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ------------------------------------------ |
|
||||
| `name` | str | Name of the extension to check. |
|
||||
| **RETURNS** | bool | Whether the extension has been registered. |
|
||||
|
||||
## Token.remove_extension {#remove_extension tag="classmethod" new=""2.0.11""}
|
||||
|
||||
|
@ -115,10 +115,10 @@ Remove a previously registered extension.
|
|||
> assert not Token.has_extension("is_fruit")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | --------------------------------------------------------------------- |
|
||||
| `name` | unicode | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | --------------------------------------------------------------------- |
|
||||
| `name` | str | Name of the extension. |
|
||||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
||||
|
||||
## Token.check_flag {#check_flag tag="method"}
|
||||
|
||||
|
@ -408,71 +408,71 @@ The L2 norm of the token's vector representation.
|
|||
|
||||
## Attributes {#attributes}
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doc` | `Doc` | The parent document. |
|
||||
| `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. |
|
||||
| `text` | unicode | Verbatim text content. |
|
||||
| `text_with_ws` | unicode | Text content, with trailing space character if present. |
|
||||
| `whitespace_` | unicode | Trailing space character if present. |
|
||||
| `orth` | int | ID of the verbatim text content. |
|
||||
| `orth_` | unicode | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. |
|
||||
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
|
||||
| `tensor` <Tag variant="new">2.1.7</Tag> | `ndarray` | The tokens's slice of the parent `Doc`'s tensor. |
|
||||
| `head` | `Token` | The syntactic parent, or "governor", of this token. |
|
||||
| `left_edge` | `Token` | The leftmost token of this token's syntactic descendants. |
|
||||
| `right_edge` | `Token` | The rightmost token of this token's syntactic descendants. |
|
||||
| `i` | int | The index of the token within the parent document. |
|
||||
| `ent_type` | int | Named entity type. |
|
||||
| `ent_type_` | unicode | Named entity type. |
|
||||
| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. |
|
||||
| `ent_iob_` | unicode | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. |
|
||||
| `ent_kb_id` <Tag variant="new">2.2</Tag> | int | Knowledge base ID that refers to the named entity this token is a part of, if any. |
|
||||
| `ent_kb_id_` <Tag variant="new">2.2</Tag> | unicode | Knowledge base ID that refers to the named entity this token is a part of, if any. |
|
||||
| `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
|
||||
| `ent_id_` | unicode | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
|
||||
| `lemma` | int | Base form of the token, with no inflectional suffixes. |
|
||||
| `lemma_` | unicode | Base form of the token, with no inflectional suffixes. |
|
||||
| `norm` | int | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
||||
| `norm_` | unicode | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
||||
| `lower` | int | Lowercase form of the token. |
|
||||
| `lower_` | unicode | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
|
||||
| Name | Type | Description |
|
||||
| -------------------------------------------- | ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `doc` | `Doc` | The parent document. |
|
||||
| `sent` <Tag variant="new">2.0.12</Tag> | `Span` | The sentence span that this token is a part of. |
|
||||
| `text` | str | Verbatim text content. |
|
||||
| `text_with_ws` | str | Text content, with trailing space character if present. |
|
||||
| `whitespace_` | str | Trailing space character if present. |
|
||||
| `orth` | int | ID of the verbatim text content. |
|
||||
| `orth_` | str | Verbatim text content (identical to `Token.text`). Exists mostly for consistency with the other attributes. |
|
||||
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
|
||||
| `tensor` <Tag variant="new">2.1.7</Tag> | `ndarray` | The tokens's slice of the parent `Doc`'s tensor. |
|
||||
| `head` | `Token` | The syntactic parent, or "governor", of this token. |
|
||||
| `left_edge` | `Token` | The leftmost token of this token's syntactic descendants. |
|
||||
| `right_edge` | `Token` | The rightmost token of this token's syntactic descendants. |
|
||||
| `i` | int | The index of the token within the parent document. |
|
||||
| `ent_type` | int | Named entity type. |
|
||||
| `ent_type_` | str | Named entity type. |
|
||||
| `ent_iob` | int | IOB code of named entity tag. `3` means the token begins an entity, `2` means it is outside an entity, `1` means it is inside an entity, and `0` means no entity tag is set. |
|
||||
| `ent_iob_` | str | IOB code of named entity tag. "B" means the token begins an entity, "I" means it is inside an entity, "O" means it is outside an entity, and "" means no entity tag is set. |
|
||||
| `ent_kb_id` <Tag variant="new">2.2</Tag> | int | Knowledge base ID that refers to the named entity this token is a part of, if any. |
|
||||
| `ent_kb_id_` <Tag variant="new">2.2</Tag> | str | Knowledge base ID that refers to the named entity this token is a part of, if any. |
|
||||
| `ent_id` | int | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
|
||||
| `ent_id_` | str | ID of the entity the token is an instance of, if any. Currently not used, but potentially for coreference resolution. |
|
||||
| `lemma` | int | Base form of the token, with no inflectional suffixes. |
|
||||
| `lemma_` | str | Base form of the token, with no inflectional suffixes. |
|
||||
| `norm` | int | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
||||
| `norm_` | str | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions). |
|
||||
| `lower` | int | Lowercase form of the token. |
|
||||
| `lower_` | str | Lowercase form of the token text. Equivalent to `Token.text.lower()`. |
|
||||
| `shape` | int | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||
| `shape_` | unicode | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
|
||||
| `prefix_` | unicode | A length-N substring from the start of the token. Defaults to `N=1`. |
|
||||
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
|
||||
| `suffix_` | unicode | Length-N substring from the end of the token. Defaults to `N=3`. |
|
||||
| `is_alpha` | bool | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. |
|
||||
| `is_ascii` | bool | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. |
|
||||
| `is_digit` | bool | Does the token consist of digits? Equivalent to `token.text.isdigit()`. |
|
||||
| `is_lower` | bool | Is the token in lowercase? Equivalent to `token.text.islower()`. |
|
||||
| `is_upper` | bool | Is the token in uppercase? Equivalent to `token.text.isupper()`. |
|
||||
| `is_title` | bool | Is the token in titlecase? Equivalent to `token.text.istitle()`. |
|
||||
| `is_punct` | bool | Is the token punctuation? |
|
||||
| `is_left_punct` | bool | Is the token a left punctuation mark, e.g. `'('` ? |
|
||||
| `is_right_punct` | bool | Is the token a right punctuation mark, e.g. `')'` ? |
|
||||
| `is_space` | bool | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. |
|
||||
| `is_bracket` | bool | Is the token a bracket? |
|
||||
| `is_quote` | bool | Is the token a quotation mark? |
|
||||
| `is_currency` <Tag variant="new">2.0.8</Tag> | bool | Is the token a currency symbol? |
|
||||
| `like_url` | bool | Does the token resemble a URL? |
|
||||
| `like_num` | bool | Does the token represent a number? e.g. "10.9", "10", "ten", etc. |
|
||||
| `like_email` | bool | Does the token resemble an email address? |
|
||||
| `is_oov` | bool | Is the token out-of-vocabulary? |
|
||||
| `is_stop` | bool | Is the token part of a "stop list"? |
|
||||
| `pos` | int | Coarse-grained part-of-speech. |
|
||||
| `pos_` | unicode | Coarse-grained part-of-speech. |
|
||||
| `tag` | int | Fine-grained part-of-speech. |
|
||||
| `tag_` | unicode | Fine-grained part-of-speech. |
|
||||
| `dep` | int | Syntactic dependency relation. |
|
||||
| `dep_` | unicode | Syntactic dependency relation. |
|
||||
| `lang` | int | Language of the parent document's vocabulary. |
|
||||
| `lang_` | unicode | Language of the parent document's vocabulary. |
|
||||
| `prob` | float | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). |
|
||||
| `idx` | int | The character offset of the token within the parent document. |
|
||||
| `sentiment` | float | A scalar value indicating the positivity or negativity of the token. |
|
||||
| `lex_id` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
|
||||
| `rank` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
|
||||
| `cluster` | int | Brown cluster ID. |
|
||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
||||
| `shape_` | str | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
|
||||
| `prefix` | int | Hash value of a length-N substring from the start of the token. Defaults to `N=1`. |
|
||||
| `prefix_` | str | A length-N substring from the start of the token. Defaults to `N=1`. |
|
||||
| `suffix` | int | Hash value of a length-N substring from the end of the token. Defaults to `N=3`. |
|
||||
| `suffix_` | str | Length-N substring from the end of the token. Defaults to `N=3`. |
|
||||
| `is_alpha` | bool | Does the token consist of alphabetic characters? Equivalent to `token.text.isalpha()`. |
|
||||
| `is_ascii` | bool | Does the token consist of ASCII characters? Equivalent to `all(ord(c) < 128 for c in token.text)`. |
|
||||
| `is_digit` | bool | Does the token consist of digits? Equivalent to `token.text.isdigit()`. |
|
||||
| `is_lower` | bool | Is the token in lowercase? Equivalent to `token.text.islower()`. |
|
||||
| `is_upper` | bool | Is the token in uppercase? Equivalent to `token.text.isupper()`. |
|
||||
| `is_title` | bool | Is the token in titlecase? Equivalent to `token.text.istitle()`. |
|
||||
| `is_punct` | bool | Is the token punctuation? |
|
||||
| `is_left_punct` | bool | Is the token a left punctuation mark, e.g. `'('` ? |
|
||||
| `is_right_punct` | bool | Is the token a right punctuation mark, e.g. `')'` ? |
|
||||
| `is_space` | bool | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. |
|
||||
| `is_bracket` | bool | Is the token a bracket? |
|
||||
| `is_quote` | bool | Is the token a quotation mark? |
|
||||
| `is_currency` <Tag variant="new">2.0.8</Tag> | bool | Is the token a currency symbol? |
|
||||
| `like_url` | bool | Does the token resemble a URL? |
|
||||
| `like_num` | bool | Does the token represent a number? e.g. "10.9", "10", "ten", etc. |
|
||||
| `like_email` | bool | Does the token resemble an email address? |
|
||||
| `is_oov` | bool | Is the token out-of-vocabulary? |
|
||||
| `is_stop` | bool | Is the token part of a "stop list"? |
|
||||
| `pos` | int | Coarse-grained part-of-speech. |
|
||||
| `pos_` | str | Coarse-grained part-of-speech. |
|
||||
| `tag` | int | Fine-grained part-of-speech. |
|
||||
| `tag_` | str | Fine-grained part-of-speech. |
|
||||
| `dep` | int | Syntactic dependency relation. |
|
||||
| `dep_` | str | Syntactic dependency relation. |
|
||||
| `lang` | int | Language of the parent document's vocabulary. |
|
||||
| `lang_` | str | Language of the parent document's vocabulary. |
|
||||
| `prob` | float | Smoothed log probability estimate of token's word type (context-independent entry in the vocabulary). |
|
||||
| `idx` | int | The character offset of the token within the parent document. |
|
||||
| `sentiment` | float | A scalar value indicating the positivity or negativity of the token. |
|
||||
| `lex_id` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
|
||||
| `rank` | int | Sequential ID of the token's lexical type, used to index into tables, e.g. for word vectors. |
|
||||
| `cluster` | int | Brown cluster ID. |
|
||||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|
||||
|
|
|
@ -34,15 +34,15 @@ the
|
|||
> tokenizer = nlp.Defaults.create_tokenizer(nlp)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | A storage container for lexical types. |
|
||||
| `rules` | dict | Exceptions and special-cases for the tokenizer. |
|
||||
| `prefix_search` | callable | A function matching the signature of `re.compile(string).search` to match prefixes. |
|
||||
| `suffix_search` | callable | A function matching the signature of `re.compile(string).search` to match suffixes. |
|
||||
| `infix_finditer` | callable | A function matching the signature of `re.compile(string).finditer` to find infixes. |
|
||||
| `token_match` | callable | A function matching the signature of `re.compile(string).match to find token matches. |
|
||||
| **RETURNS** | `Tokenizer` | The newly constructed object. |
|
||||
| Name | Type | Description |
|
||||
| ---------------- | ----------- | ------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | A storage container for lexical types. |
|
||||
| `rules` | dict | Exceptions and special-cases for the tokenizer. |
|
||||
| `prefix_search` | callable | A function matching the signature of `re.compile(string).search` to match prefixes. |
|
||||
| `suffix_search` | callable | A function matching the signature of `re.compile(string).search` to match suffixes. |
|
||||
| `infix_finditer` | callable | A function matching the signature of `re.compile(string).finditer` to find infixes. |
|
||||
| `token_match` | callable | A function matching the signature of `re.compile(string).match to find token matches. |
|
||||
| **RETURNS** | `Tokenizer` | The newly constructed object. |
|
||||
|
||||
## Tokenizer.\_\_call\_\_ {#call tag="method"}
|
||||
|
||||
|
@ -55,10 +55,10 @@ Tokenize a string.
|
|||
> assert len(tokens) == 4
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | --------------------------------------- |
|
||||
| `string` | unicode | The string to tokenize. |
|
||||
| **RETURNS** | `Doc` | A container for linguistic annotations. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ----- | --------------------------------------- |
|
||||
| `string` | str | The string to tokenize. |
|
||||
| **RETURNS** | `Doc` | A container for linguistic annotations. |
|
||||
|
||||
## Tokenizer.pipe {#pipe tag="method"}
|
||||
|
||||
|
@ -82,20 +82,20 @@ Tokenize a stream of texts.
|
|||
|
||||
Find internal split points of the string.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `string` | unicode | The string to split. |
|
||||
| **RETURNS** | list | A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `string` | str | The string to split. |
|
||||
| **RETURNS** | list | A list of `re.MatchObject` objects that have `.start()` and `.end()` methods, denoting the placement of internal segment separators, e.g. hyphens. |
|
||||
|
||||
## Tokenizer.find_prefix {#find_prefix tag="method"}
|
||||
|
||||
Find the length of a prefix that should be segmented from the string, or `None`
|
||||
if no prefix rules match.
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | ------------------------------------------------------ |
|
||||
| `string` | unicode | The string to segment. |
|
||||
| **RETURNS** | int | The length of the prefix if present, otherwise `None`. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | ------------------------------------------------------ |
|
||||
| `string` | str | The string to segment. |
|
||||
| **RETURNS** | int | The length of the prefix if present, otherwise `None`. |
|
||||
|
||||
## Tokenizer.find_suffix {#find_suffix tag="method"}
|
||||
|
||||
|
@ -104,7 +104,7 @@ if no suffix rules match.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------ | ------------------------------------------------------ |
|
||||
| `string` | unicode | The string to segment. |
|
||||
| `string` | str | The string to segment. |
|
||||
| **RETURNS** | int / `None` | The length of the suffix if present, otherwise `None`. |
|
||||
|
||||
## Tokenizer.add_special_case {#add_special_case tag="method"}
|
||||
|
@ -125,7 +125,7 @@ and examples.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ------------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| `string` | unicode | The string to specially tokenize. |
|
||||
| `string` | str | The string to specially tokenize. |
|
||||
| `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. |
|
||||
|
||||
## Tokenizer.explain {#explain tag="method"}
|
||||
|
@ -142,10 +142,10 @@ produced are identical to `Tokenizer.__call__` except for whitespace tokens.
|
|||
> assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------------| -------- | --------------------------------------------------- |
|
||||
| `string` | unicode | The string to tokenize with the debugging tokenizer |
|
||||
| **RETURNS** | list | A list of `(pattern_string, token_string)` tuples |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | --------------------------------------------------- |
|
||||
| `string` | str | The string to tokenize with the debugging tokenizer |
|
||||
| **RETURNS** | list | A list of `(pattern_string, token_string)` tuples |
|
||||
|
||||
## Tokenizer.to_disk {#to_disk tag="method"}
|
||||
|
||||
|
@ -158,10 +158,10 @@ Serialize the tokenizer to disk.
|
|||
> tokenizer.to_disk("/path/to/tokenizer")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| --------- | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| Name | Type | Description |
|
||||
| --------- | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
|
||||
## Tokenizer.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -174,11 +174,11 @@ Load the tokenizer from disk. Modifies the object in place and returns it.
|
|||
> tokenizer.from_disk("/path/to/tokenizer")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------ | -------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| `exclude` | list | String names of [serialization fields](#serialization-fields) to exclude. |
|
||||
| **RETURNS** | `Tokenizer` | The modified `Tokenizer` object. |
|
||||
|
||||
## Tokenizer.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
|
@ -217,14 +217,14 @@ it.
|
|||
|
||||
## Attributes {#attributes}
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------------- | ------- | --------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
|
||||
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
|
||||
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
|
||||
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
|
||||
| `token_match` | - | A function matching the signature of `re.compile(string).match to find token matches. Returns an `re.MatchObject` or `None. |
|
||||
| `rules` | dict | A dictionary of tokenizer exceptions and special cases. |
|
||||
| Name | Type | Description |
|
||||
| ---------------- | ------- | -------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `vocab` | `Vocab` | The vocab object of the parent `Doc`. |
|
||||
| `prefix_search` | - | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`. |
|
||||
| `suffix_search` | - | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`. |
|
||||
| `infix_finditer` | - | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
|
||||
| `token_match` | - | A function matching the signature of `re.compile(string).match to find token matches. Returns an`re.MatchObject`or`None. |
|
||||
| `rules` | dict | A dictionary of tokenizer exceptions and special cases. |
|
||||
|
||||
## Serialization fields {#serialization-fields}
|
||||
|
||||
|
|
|
@ -32,11 +32,11 @@ class. The data will be loaded in via
|
|||
> nlp = spacy.load("en_core_web_sm", disable=["parser", "tagger"])
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | --------------------------------------------------------------------------------- |
|
||||
| `name` | unicode / `Path` | Model to load, i.e. shortcut link, package name or path. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| **RETURNS** | `Language` | A `Language` object with the loaded model. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------ | --------------------------------------------------------------------------------- |
|
||||
| `name` | str / `Path` | Model to load, i.e. shortcut link, package name or path. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| **RETURNS** | `Language` | A `Language` object with the loaded model. |
|
||||
|
||||
Essentially, `spacy.load()` is a convenience wrapper that reads the language ID
|
||||
and pipeline components from a model's `meta.json`, initializes the `Language`
|
||||
|
@ -79,7 +79,7 @@ Create a blank model of a given language class. This function is the twin of
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------- | ------------------------------------------------------------------------------------------------ |
|
||||
| `name` | unicode | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. |
|
||||
| `name` | str | [ISO code](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) of the language class to load. |
|
||||
| `disable` | list | Names of pipeline components to [disable](/usage/processing-pipelines#disabling). |
|
||||
| **RETURNS** | `Language` | An empty `Language` object of the appropriate subclass. |
|
||||
|
||||
|
@ -98,10 +98,10 @@ meta data as a dictionary instead, you can use the `meta` attribute on your
|
|||
> spacy.info("de", markdown=True)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ---------- | ------- | ------------------------------------------------------------- |
|
||||
| `model` | unicode | A model, i.e. shortcut link, package name or path (optional). |
|
||||
| `markdown` | bool | Print information as Markdown. |
|
||||
| Name | Type | Description |
|
||||
| ---------- | ---- | ------------------------------------------------------------- |
|
||||
| `model` | str | A model, i.e. shortcut link, package name or path (optional). |
|
||||
| `markdown` | bool | Print information as Markdown. |
|
||||
|
||||
### spacy.explain {#spacy.explain tag="function"}
|
||||
|
||||
|
@ -122,10 +122,10 @@ list of available terms, see
|
|||
> # world NN noun, singular or mass
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | -------------------------------------------------------- |
|
||||
| `term` | unicode | Term to explain. |
|
||||
| **RETURNS** | unicode | The explanation, or `None` if not found in the glossary. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | -------------------------------------------------------- |
|
||||
| `term` | str | Term to explain. |
|
||||
| **RETURNS** | str | The explanation, or `None` if not found in the glossary. |
|
||||
|
||||
### spacy.prefer_gpu {#spacy.prefer_gpu tag="function" new="2.0.14"}
|
||||
|
||||
|
@ -189,13 +189,13 @@ browser. Will run a simple web server.
|
|||
| Name | Type | Description | Default |
|
||||
| --------- | ------------------- | ------------------------------------------------------------------------------------------------------------------------------------ | ----------- |
|
||||
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
|
||||
| `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
|
||||
| `style` | str | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
|
||||
| `page` | bool | Render markup as full HTML page. | `True` |
|
||||
| `minify` | bool | Minify HTML markup. | `False` |
|
||||
| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` |
|
||||
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
|
||||
| `port` | int | Port to serve visualization. | `5000` |
|
||||
| `host` | unicode | Host to serve visualization. | `'0.0.0.0'` |
|
||||
| `host` | str | Host to serve visualization. | `'0.0.0.0'` |
|
||||
|
||||
### displacy.render {#displacy.render tag="method" new="2"}
|
||||
|
||||
|
@ -214,13 +214,13 @@ Render a dependency parse tree or named entity visualization.
|
|||
| Name | Type | Description | Default |
|
||||
| ----------- | ------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
|
||||
| `docs` | list, `Doc`, `Span` | Document(s) to visualize. |
|
||||
| `style` | unicode | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
|
||||
| `style` | str | Visualization style, `'dep'` or `'ent'`. | `'dep'` |
|
||||
| `page` | bool | Render markup as full HTML page. | `False` |
|
||||
| `minify` | bool | Minify HTML markup. | `False` |
|
||||
| `jupyter` | bool | Explicitly enable or disable "[Jupyter](http://jupyter.org/) mode" to return markup ready to be rendered in a notebook. Detected automatically if `None`. | `None` |
|
||||
| `options` | dict | [Visualizer-specific options](#displacy_options), e.g. colors. | `{}` |
|
||||
| `manual` | bool | Don't parse `Doc` and instead, expect a dict or list of dicts. [See here](/usage/visualizers#manual-usage) for formats and examples. | `False` |
|
||||
| **RETURNS** | unicode | Rendered HTML markup. |
|
||||
| **RETURNS** | str | Rendered HTML markup. |
|
||||
|
||||
### Visualizer options {#displacy_options}
|
||||
|
||||
|
@ -236,22 +236,22 @@ If a setting is not present in the options, the default value will be used.
|
|||
> displacy.serve(doc, style="dep", options=options)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description | Default |
|
||||
| ------------------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
||||
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
||||
| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool | Print the lemma's in a separate row below the token texts. | `False` |
|
||||
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
||||
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||||
| `color` | unicode | Text color (HEX, RGB or color names). | `'#000000'` |
|
||||
| `bg` | unicode | Background color (HEX, RGB or color names). | `'#ffffff'` |
|
||||
| `font` | unicode | Font name or font family for all text. | `'Arial'` |
|
||||
| `offset_x` | int | Spacing on left side of the SVG in px. | `50` |
|
||||
| `arrow_stroke` | int | Width of arrow path in px. | `2` |
|
||||
| `arrow_width` | int | Width of arrow head in px. | `10` / `8` (compact) |
|
||||
| `arrow_spacing` | int | Spacing between arrows in px to avoid overlaps. | `20` / `12` (compact) |
|
||||
| `word_spacing` | int | Vertical spacing between words and arcs in px. | `45` |
|
||||
| `distance` | int | Distance between words in px. | `175` / `150` (compact) |
|
||||
| Name | Type | Description | Default |
|
||||
| ------------------------------------------ | ---- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
|
||||
| `fine_grained` | bool | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`). | `False` |
|
||||
| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool | Print the lemma's in a separate row below the token texts. | `False` |
|
||||
| `collapse_punct` | bool | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True` |
|
||||
| `collapse_phrases` | bool | Merge noun phrases into one token. | `False` |
|
||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||||
| `color` | str | Text color (HEX, RGB or color names). | `'#000000'` |
|
||||
| `bg` | str | Background color (HEX, RGB or color names). | `'#ffffff'` |
|
||||
| `font` | str | Font name or font family for all text. | `'Arial'` |
|
||||
| `offset_x` | int | Spacing on left side of the SVG in px. | `50` |
|
||||
| `arrow_stroke` | int | Width of arrow path in px. | `2` |
|
||||
| `arrow_width` | int | Width of arrow head in px. | `10` / `8` (compact) |
|
||||
| `arrow_spacing` | int | Spacing between arrows in px to avoid overlaps. | `20` / `12` (compact) |
|
||||
| `word_spacing` | int | Vertical spacing between words and arcs in px. | `45` |
|
||||
| `distance` | int | Distance between words in px. | `175` / `150` (compact) |
|
||||
|
||||
#### Named Entity Visualizer options {#displacy_options-ent}
|
||||
|
||||
|
@ -263,11 +263,11 @@ If a setting is not present in the options, the default value will be used.
|
|||
> displacy.serve(doc, style="ent", options=options)
|
||||
> ```
|
||||
|
||||
| Name | Type | Description | Default |
|
||||
| --------------------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------ |
|
||||
| `ents` | list | Entity types to highlight (`None` for all types). | `None` |
|
||||
| `colors` | dict | Color overrides. Entity types in uppercase should be mapped to color names or values. | `{}` |
|
||||
| `template` <Tag variant="new">2.2</Tag> | unicode | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. | see [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) |
|
||||
| Name | Type | Description | Default |
|
||||
| --------------------------------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------ |
|
||||
| `ents` | list | Entity types to highlight (`None` for all types). | `None` |
|
||||
| `colors` | dict | Color overrides. Entity types in uppercase should be mapped to color names or values. | `{}` |
|
||||
| `template` <Tag variant="new">2.2</Tag> | str | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. | see [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) |
|
||||
|
||||
By default, displaCy comes with colors for all
|
||||
[entity types supported by spaCy](/api/annotation#named-entities). If you're
|
||||
|
@ -308,9 +308,9 @@ Set custom path to the data directory where spaCy looks for models.
|
|||
> # PosixPath('/custom/path')
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------- |
|
||||
| `path` | unicode / `Path` | Path to new data directory. |
|
||||
| Name | Type | Description |
|
||||
| ------ | ------------ | --------------------------- |
|
||||
| `path` | str / `Path` | Path to new data directory. |
|
||||
|
||||
### util.get_lang_class {#util.get_lang_class tag="function"}
|
||||
|
||||
|
@ -330,7 +330,7 @@ you can use the [`set_lang_class`](/api/top-level#util.set_lang_class) helper.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------- | -------------------------------------- |
|
||||
| `lang` | unicode | Two-letter language code, e.g. `'en'`. |
|
||||
| `lang` | str | Two-letter language code, e.g. `'en'`. |
|
||||
| **RETURNS** | `Language` | Language class. |
|
||||
|
||||
### util.set_lang_class {#util.set_lang_class tag="function"}
|
||||
|
@ -352,7 +352,7 @@ the two-letter language code.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------- | -------------------------------------- |
|
||||
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
|
||||
| `name` | str | Two-letter language code, e.g. `'en'`. |
|
||||
| `cls` | `Language` | The language class, e.g. `English`. |
|
||||
|
||||
### util.lang_class_is_loaded {#util.lang_class_is_loaded tag="function" new="2.1"}
|
||||
|
@ -368,10 +368,10 @@ loaded lazily, to avoid expensive setup code associated with the language data.
|
|||
> assert util.lang_class_is_loaded("de") is False
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | -------------------------------------- |
|
||||
| `name` | unicode | Two-letter language code, e.g. `'en'`. |
|
||||
| **RETURNS** | bool | Whether the class has been loaded. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---- | -------------------------------------- |
|
||||
| `name` | str | Two-letter language code, e.g. `'en'`. |
|
||||
| **RETURNS** | bool | Whether the class has been loaded. |
|
||||
|
||||
### util.load_model {#util.load_model tag="function" new="2"}
|
||||
|
||||
|
@ -392,7 +392,7 @@ in via [`Language.from_disk()`](/api/language#from_disk).
|
|||
|
||||
| Name | Type | Description |
|
||||
| ------------- | ---------- | -------------------------------------------------------- |
|
||||
| `name` | unicode | Package name, shortcut link or model path. |
|
||||
| `name` | str | Package name, shortcut link or model path. |
|
||||
| `**overrides` | - | Specific overrides, like pipeline components to disable. |
|
||||
| **RETURNS** | `Language` | `Language` class with the loaded model. |
|
||||
|
||||
|
@ -411,7 +411,7 @@ it easy to test a new model that you haven't packaged yet.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ------------- | ---------- | ---------------------------------------------------------------------------------------------------- |
|
||||
| `model_path` | unicode | Path to model data directory. |
|
||||
| `model_path` | str | Path to model data directory. |
|
||||
| `meta` | dict | Model meta data. If `False`, spaCy will try to load the meta from a meta.json in the same directory. |
|
||||
| `**overrides` | - | Specific overrides, like pipeline components to disable. |
|
||||
| **RETURNS** | `Language` | `Language` class with the loaded model. |
|
||||
|
@ -432,7 +432,7 @@ A helper function to use in the `load()` method of a model package's
|
|||
|
||||
| Name | Type | Description |
|
||||
| ------------- | ---------- | -------------------------------------------------------- |
|
||||
| `init_file` | unicode | Path to model's `__init__.py`, i.e. `__file__`. |
|
||||
| `init_file` | str | Path to model's `__init__.py`, i.e. `__file__`. |
|
||||
| `**overrides` | - | Specific overrides, like pipeline components to disable. |
|
||||
| **RETURNS** | `Language` | `Language` class with the loaded model. |
|
||||
|
||||
|
@ -446,10 +446,10 @@ Get a model's meta.json from a directory path and validate its contents.
|
|||
> meta = util.get_model_meta("/path/to/model")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | ------------------------ |
|
||||
| `path` | unicode / `Path` | Path to model directory. |
|
||||
| **RETURNS** | dict | The model's meta data. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------ | ------------------------ |
|
||||
| `path` | str / `Path` | Path to model directory. |
|
||||
| **RETURNS** | dict | The model's meta data. |
|
||||
|
||||
### util.is_package {#util.is_package tag="function"}
|
||||
|
||||
|
@ -463,10 +463,10 @@ Check if string maps to a package installed via pip. Mainly used to validate
|
|||
> util.is_package("xyz") # False
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------- | -------------------------------------------- |
|
||||
| `name` | unicode | Name of package. |
|
||||
| **RETURNS** | `bool` | `True` if installed package, `False` if not. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------ | -------------------------------------------- |
|
||||
| `name` | str | Name of package. |
|
||||
| **RETURNS** | `bool` | `True` if installed package, `False` if not. |
|
||||
|
||||
### util.get_package_path {#util.get_package_path tag="function" new="2"}
|
||||
|
||||
|
@ -480,10 +480,10 @@ Get path to an installed package. Mainly used to resolve the location of
|
|||
> # /usr/lib/python3.6/site-packages/en_core_web_sm
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| -------------- | ------- | -------------------------------- |
|
||||
| `package_name` | unicode | Name of installed package. |
|
||||
| **RETURNS** | `Path` | Path to model package directory. |
|
||||
| Name | Type | Description |
|
||||
| -------------- | ------ | -------------------------------- |
|
||||
| `package_name` | str | Name of installed package. |
|
||||
| **RETURNS** | `Path` | Path to model package directory. |
|
||||
|
||||
### util.is_in_jupyter {#util.is_in_jupyter tag="function" new="2"}
|
||||
|
||||
|
|
|
@ -35,7 +35,7 @@ you can add vectors to later.
|
|||
| `data` | `ndarray[ndim=1, dtype='float32']` | The vector data. |
|
||||
| `keys` | iterable | A sequence of keys aligned with the data. |
|
||||
| `shape` | tuple | Size of the table as `(n_entries, n_columns)`, the number of entries and number of columns. Not required if you're initializing the object with `data` and `keys`. |
|
||||
| `name` | unicode | A name to identify the vectors table. |
|
||||
| `name` | str | A name to identify the vectors table. |
|
||||
| **RETURNS** | `Vectors` | The newly created object. |
|
||||
|
||||
## Vectors.\_\_getitem\_\_ {#getitem tag="method"}
|
||||
|
@ -140,7 +140,7 @@ mapping separately. If you need to manage the strings, you should use the
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------------------------- | ----------------------------------------------------- |
|
||||
| `key` | unicode / int | The key to add. |
|
||||
| `key` | str / int | The key to add. |
|
||||
| `vector` | `ndarray[ndim=1, dtype='float32']` | An optional vector to add for the key. |
|
||||
| `row` | int | An optional row number of a vector to map the key to. |
|
||||
| **RETURNS** | int | The row the vector was added to. |
|
||||
|
@ -227,7 +227,7 @@ Look up one or more keys by row, or vice versa.
|
|||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------------------------------- | ------------------------------------------------------------------------ |
|
||||
| `key` | unicode / int | Find the row that the given key points to. Returns int, `-1` if missing. |
|
||||
| `key` | str / int | Find the row that the given key points to. Returns int, `-1` if missing. |
|
||||
| `keys` | iterable | Find rows that the keys point to. Returns `ndarray`. |
|
||||
| `row` | int | Find the first key that points to the row. Returns int. |
|
||||
| `rows` | iterable | Find the keys that point to the rows. Returns ndarray. |
|
||||
|
@ -337,9 +337,9 @@ Save the current state to a directory.
|
|||
>
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
| Name | Type | Description |
|
||||
| ------ | ------------ | --------------------------------------------------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||||
|
||||
## Vectors.from_disk {#from_disk tag="method"}
|
||||
|
||||
|
@ -352,10 +352,10 @@ Loads state from a directory. Modifies the object in place and returns it.
|
|||
> vectors.from_disk("/path/to/vectors")
|
||||
> ```
|
||||
|
||||
| Name | Type | Description |
|
||||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| **RETURNS** | `Vectors` | The modified `Vectors` object. |
|
||||
| Name | Type | Description |
|
||||
| ----------- | ------------ | -------------------------------------------------------------------------- |
|
||||
| `path` | str / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||||
| **RETURNS** | `Vectors` | The modified `Vectors` object. |
|
||||
|
||||
## Vectors.to_bytes {#to_bytes tag="method"}
|
||||
|
||||
|
|
|
@ -327,11 +327,11 @@ displaCy in our [online demo](https://explosion.ai/demos/displacy)..
|
|||
### Disabling the parser {#disabling}
|
||||
|
||||
In the [default models](/models), the parser is loaded and enabled as part of
|
||||
the [standard processing pipeline](/usage/processing-pipelines). If you don't need
|
||||
any of the syntactic information, you should disable the parser. Disabling the
|
||||
parser will make spaCy load and run much faster. If you want to load the parser,
|
||||
but need to disable it for specific documents, you can also control its use on
|
||||
the `nlp` object.
|
||||
the [standard processing pipeline](/usage/processing-pipelines). If you don't
|
||||
need any of the syntactic information, you should disable the parser. Disabling
|
||||
the parser will make spaCy load and run much faster. If you want to load the
|
||||
parser, but need to disable it for specific documents, you can also control its
|
||||
use on the `nlp` object.
|
||||
|
||||
```python
|
||||
nlp = spacy.load("en_core_web_sm", disable=["parser"])
|
||||
|
@ -990,10 +990,10 @@ nlp = spacy.load("en_core_web_sm")
|
|||
nlp.tokenizer = my_tokenizer
|
||||
```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| ----------- | ------- | ------------------------- |
|
||||
| `text` | unicode | The raw text to tokenize. |
|
||||
| **RETURNS** | `Doc` | The tokenized document. |
|
||||
| Argument | Type | Description |
|
||||
| ----------- | ----- | ------------------------- |
|
||||
| `text` | str | The raw text to tokenize. |
|
||||
| **RETURNS** | `Doc` | The tokenized document. |
|
||||
|
||||
<Infobox title="Important note: using a custom tokenizer" variant="warning">
|
||||
|
||||
|
|
|
@ -272,16 +272,16 @@ doc = nlp("I won't have named entities")
|
|||
disabled.restore()
|
||||
```
|
||||
|
||||
If you want to disable all pipes except for one or a few, you can use the `enable`
|
||||
keyword. Just like the `disable` keyword, it takes a list of pipe names, or a string
|
||||
defining just one pipe.
|
||||
If you want to disable all pipes except for one or a few, you can use the
|
||||
`enable` keyword. Just like the `disable` keyword, it takes a list of pipe
|
||||
names, or a string defining just one pipe.
|
||||
|
||||
```python
|
||||
# Enable only the parser
|
||||
with nlp.select_pipes(enable="parser"):
|
||||
doc = nlp("I will only be parsed")
|
||||
```
|
||||
|
||||
|
||||
Finally, you can also use the [`remove_pipe`](/api/language#remove_pipe) method
|
||||
to remove pipeline components from an existing pipeline, the
|
||||
[`rename_pipe`](/api/language#rename_pipe) method to rename them, or the
|
||||
|
@ -349,12 +349,12 @@ last** in the pipeline, or define a **custom name**. If no name is set and no
|
|||
> nlp.add_pipe(my_component, before="parser")
|
||||
> ```
|
||||
|
||||
| Argument | Type | Description |
|
||||
| -------- | ------- | ------------------------------------------------------------------------ |
|
||||
| `last` | bool | If set to `True`, component is added **last** in the pipeline (default). |
|
||||
| `first` | bool | If set to `True`, component is added **first** in the pipeline. |
|
||||
| `before` | unicode | String name of component to add the new component **before**. |
|
||||
| `after` | unicode | String name of component to add the new component **after**. |
|
||||
| Argument | Type | Description |
|
||||
| -------- | ---- | ------------------------------------------------------------------------ |
|
||||
| `last` | bool | If set to `True`, component is added **last** in the pipeline (default). |
|
||||
| `first` | bool | If set to `True`, component is added **first** in the pipeline. |
|
||||
| `before` | str | String name of component to add the new component **before**. |
|
||||
| `after` | str | String name of component to add the new component **after**. |
|
||||
|
||||
### Example: A simple pipeline component {#custom-components-simple}
|
||||
|
||||
|
|
|
@ -94,8 +94,8 @@ docs = list(doc_bin.get_docs(nlp.vocab))
|
|||
|
||||
If `store_user_data` is set to `True`, the `Doc.user_data` will be serialized as
|
||||
well, which includes the values of
|
||||
[extension attributes](/usage/processing-pipelines#custom-components-attributes) (if
|
||||
they're serializable with msgpack).
|
||||
[extension attributes](/usage/processing-pipelines#custom-components-attributes)
|
||||
(if they're serializable with msgpack).
|
||||
|
||||
<Infobox title="Important note on serializing extension attributes" variant="warning">
|
||||
|
||||
|
@ -666,10 +666,10 @@ and lets you customize how the model should be initialized and loaded. You can
|
|||
define the language data to be loaded and the
|
||||
[processing pipeline](/usage/processing-pipelines) to execute.
|
||||
|
||||
| Setting | Type | Description |
|
||||
| ---------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `lang` | unicode | ID of the language class to initialize. |
|
||||
| `pipeline` | list | A list of strings mapping to the IDs of pipeline factories to apply in that order. If not set, spaCy's [default pipeline](/usage/processing-pipelines) will be used. |
|
||||
| Setting | Type | Description |
|
||||
| ---------- | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `lang` | str | ID of the language class to initialize. |
|
||||
| `pipeline` | list | A list of strings mapping to the IDs of pipeline factories to apply in that order. If not set, spaCy's [default pipeline](/usage/processing-pipelines) will be used. |
|
||||
|
||||
The `load()` method that comes with our model package templates will take care
|
||||
of putting all this together and returning a `Language` object with the loaded
|
||||
|
|
|
@ -67,12 +67,12 @@ arcs.
|
|||
|
||||
</Infobox>
|
||||
|
||||
| Argument | Type | Description | Default |
|
||||
| --------- | ------- | ----------------------------------------------------------- | ----------- |
|
||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||||
| `color` | unicode | Text color (HEX, RGB or color names). | `"#000000"` |
|
||||
| `bg` | unicode | Background color (HEX, RGB or color names). | `"#ffffff"` |
|
||||
| `font` | unicode | Font name or font family for all text. | `"Arial"` |
|
||||
| Argument | Type | Description | Default |
|
||||
| --------- | ---- | ----------------------------------------------------------- | ----------- |
|
||||
| `compact` | bool | "Compact mode" with square arrows that takes up less space. | `False` |
|
||||
| `color` | str | Text color (HEX, RGB or color names). | `"#000000"` |
|
||||
| `bg` | str | Background color (HEX, RGB or color names). | `"#ffffff"` |
|
||||
| `font` | str | Font name or font family for all text. | `"Arial"` |
|
||||
|
||||
For a list of all available options, see the
|
||||
[`displacy` API documentation](/api/top-level#displacy_options).
|
||||
|
|
Loading…
Reference in New Issue
Block a user