mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-14 21:57:15 +03:00
e597110d31
<!--- Provide a general summary of your changes in the title. --> ## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
609 lines
30 KiB
Markdown
609 lines
30 KiB
Markdown
---
|
||
title: Doc
|
||
tag: class
|
||
teaser: A container for accessing linguistic annotations.
|
||
source: spacy/tokens/doc.pyx
|
||
---
|
||
|
||
A `Doc` is a sequence of [`Token`](/api/token) objects. Access sentences and
|
||
named entities, export annotations to numpy arrays, losslessly serialize to
|
||
compressed binary strings. The `Doc` object holds an array of `TokenC]` structs.
|
||
The Python-level `Token` and [`Span`](/api/span) objects are views of this
|
||
array, i.e. they don't own the data themselves.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> # Construction 1
|
||
> doc = nlp(u"Some text")
|
||
>
|
||
> # Construction 2
|
||
> from spacy.tokens import Doc
|
||
> words = [u"hello", u"world", u"!"]
|
||
> spaces = [True, False, False]
|
||
> doc = Doc(nlp.vocab, words=words, spaces=spaces)
|
||
> ```
|
||
|
||
## Doc.\_\_init\_\_ {#init tag="method"}
|
||
|
||
Construct a `Doc` object. The most common way to get a `Doc` object is via the
|
||
`nlp` object.
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `vocab` | `Vocab` | A storage container for lexical types. |
|
||
| `words` | iterable | A list of strings to add to the container. |
|
||
| `spaces` | iterable | A list of boolean values indicating whether each word has a subsequent space. Must have the same length as `words`, if specified. Defaults to a sequence of `True`. |
|
||
| **RETURNS** | `Doc` | The newly constructed object. |
|
||
|
||
## Doc.\_\_getitem\_\_ {#getitem tag="method"}
|
||
|
||
Get a [`Token`](/api/token) object at position `i`, where `i` is an integer.
|
||
Negative indexing is supported, and follows the usual Python semantics, i.e.
|
||
`doc[-2]` is `doc[len(doc) - 2]`.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"Give it back! He pleaded.")
|
||
> assert doc[0].text == "Give"
|
||
> assert doc[-1].text == "."
|
||
> span = doc[1:3]
|
||
> assert span.text == "it back"
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ------- | ----------------------- |
|
||
| `i` | int | The index of the token. |
|
||
| **RETURNS** | `Token` | The token at `doc[i]`. |
|
||
|
||
Get a [`Span`](/api/span) object, starting at position `start` (token index) and
|
||
ending at position `end` (token index). For instance, `doc[2:5]` produces a span
|
||
consisting of tokens 2, 3 and 4. Stepped slices (e.g. `doc[start : end : step]`)
|
||
are not supported, as `Span` objects must be contiguous (cannot have gaps). You
|
||
can use negative indices and open-ended ranges, which have their normal Python
|
||
semantics.
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ------ | --------------------------------- |
|
||
| `start_end` | tuple | The slice of the document to get. |
|
||
| **RETURNS** | `Span` | The span at `doc[start:end]`. |
|
||
|
||
## Doc.\_\_iter\_\_ {#iter tag="method"}
|
||
|
||
Iterate over `Token` objects, from which the annotations can be easily accessed.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u'Give it back')
|
||
> assert [t.text for t in doc] == [u'Give', u'it', u'back']
|
||
> ```
|
||
|
||
This is the main way of accessing [`Token`](/api/token) objects, which are the
|
||
main way annotations are accessed from Python. If faster-than-Python speeds are
|
||
required, you can instead access the annotations as a numpy array, or access the
|
||
underlying C data directly from Cython.
|
||
|
||
| Name | Type | Description |
|
||
| ---------- | ------- | ----------------- |
|
||
| **YIELDS** | `Token` | A `Token` object. |
|
||
|
||
## Doc.\_\_len\_\_ {#len tag="method"}
|
||
|
||
Get the number of tokens in the document.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"Give it back! He pleaded.")
|
||
> assert len(doc) == 7
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---- | ------------------------------------- |
|
||
| **RETURNS** | int | The number of tokens in the document. |
|
||
|
||
## Doc.set_extension {#set_extension tag="classmethod" new="2"}
|
||
|
||
Define a custom attribute on the `Doc` which becomes available via `Doc._`. For
|
||
details, see the documentation on
|
||
[custom attributes](/usage/processing-pipelines#custom-components-attributes).
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.tokens import Doc
|
||
> city_getter = lambda doc: any(city in doc.text for city in ('New York', 'Paris', 'Berlin'))
|
||
> Doc.set_extension('has_city', getter=city_getter)
|
||
> doc = nlp(u'I like New York')
|
||
> assert doc._.has_city
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| --------- | -------- | ----------------------------------------------------------------------------------------------------------------------------------- |
|
||
| `name` | unicode | Name of the attribute to set by the extension. For example, `'my_attr'` will be available as `doc._.my_attr`. |
|
||
| `default` | - | Optional default value of the attribute if no getter or method is defined. |
|
||
| `method` | callable | Set a custom method on the object, for example `doc._.compare(other_doc)`. |
|
||
| `getter` | callable | Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. |
|
||
| `setter` | callable | Setter function that takes the `Doc` and a value, and modifies the object. Is called when the user writes to the `Doc._` attribute. |
|
||
|
||
## Doc.get_extension {#get_extension tag="classmethod" new="2"}
|
||
|
||
Look up a previously registered extension by name. Returns a 4-tuple
|
||
`(default, method, getter, setter)` if the extension is registered. Raises a
|
||
`KeyError` otherwise.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.tokens import Doc
|
||
> Doc.set_extension('has_city', default=False)
|
||
> extension = Doc.get_extension('has_city')
|
||
> assert extension == (False, None, None, None)
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ------- | ------------------------------------------------------------- |
|
||
| `name` | unicode | Name of the extension. |
|
||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the extension. |
|
||
|
||
## Doc.has_extension {#has_extension tag="classmethod" new="2"}
|
||
|
||
Check whether an extension has been registered on the `Doc` class.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.tokens import Doc
|
||
> Doc.set_extension('has_city', default=False)
|
||
> assert Doc.has_extension('has_city')
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ------- | ------------------------------------------ |
|
||
| `name` | unicode | Name of the extension to check. |
|
||
| **RETURNS** | bool | Whether the extension has been registered. |
|
||
|
||
## Doc.remove_extension {#remove_extension tag="classmethod" new="2.0.12"}
|
||
|
||
Remove a previously registered extension.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.tokens import Doc
|
||
> Doc.set_extension('has_city', default=False)
|
||
> removed = Doc.remove_extension('has_city')
|
||
> assert not Doc.has_extension('has_city')
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ------- | --------------------------------------------------------------------- |
|
||
| `name` | unicode | Name of the extension. |
|
||
| **RETURNS** | tuple | A `(default, method, getter, setter)` tuple of the removed extension. |
|
||
|
||
## Doc.char_span {#char_span tag="method" new="2"}
|
||
|
||
Create a `Span` object from the slice `doc.text[start:end]`. Returns `None` if
|
||
the character indices don't map to a valid span.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"I like New York")
|
||
> span = doc.char_span(7, 15, label=u"GPE")
|
||
> assert span.text == "New York"
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---------------------------------------- | ------------------------------------------------------- |
|
||
| `start` | int | The index of the first character of the span. |
|
||
| `end` | int | The index of the last character after the span. |
|
||
| `label` | uint64 / unicode | A label to attach to the Span, e.g. for named entities. |
|
||
| `vector` | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span. |
|
||
| **RETURNS** | `Span` | The newly constructed object or `None`. |
|
||
|
||
## Doc.similarity {#similarity tag="method" model="vectors"}
|
||
|
||
Make a semantic similarity estimate. The default estimate is cosine similarity
|
||
using an average of word vectors.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> apples = nlp(u"I like apples")
|
||
> oranges = nlp(u"I like oranges")
|
||
> apples_oranges = apples.similarity(oranges)
|
||
> oranges_apples = oranges.similarity(apples)
|
||
> assert apples_oranges == oranges_apples
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ----- | -------------------------------------------------------------------------------------------- |
|
||
| `other` | - | The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. |
|
||
| **RETURNS** | float | A scalar similarity score. Higher is more similar. |
|
||
|
||
## Doc.count_by {#count_by tag="method"}
|
||
|
||
Count the frequencies of a given attribute. Produces a dict of
|
||
`{attr (int): count (ints)}` frequencies, keyed by the values of the given
|
||
attribute ID.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.attrs import ORTH
|
||
> doc = nlp(u"apple apple orange banana")
|
||
> assert doc.count_by(ORTH) == {7024L: 1, 119552L: 1, 2087L: 2}
|
||
> doc.to_array([attrs.ORTH])
|
||
> # array([[11880], [11880], [7561], [12800]])
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---- | -------------------------------------------------- |
|
||
| `attr_id` | int | The attribute ID |
|
||
| **RETURNS** | dict | A dictionary mapping attributes to integer counts. |
|
||
|
||
## Doc.get_lca_matrix {#get_lca_matrix tag="method"}
|
||
|
||
Calculates the lowest common ancestor matrix for a given `Doc`. Returns LCA
|
||
matrix containing the integer index of the ancestor, or `-1` if no common
|
||
ancestor is found, e.g. if span excludes a necessary ancestor.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"This is a test")
|
||
> matrix = doc.get_lca_matrix()
|
||
> # array([[0, 1, 1, 1], [1, 1, 1, 1], [1, 1, 2, 3], [1, 1, 3, 3]], dtype=int32)
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | -------------------------------------- | ----------------------------------------------- |
|
||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype='int32']` | The lowest common ancestor matrix of the `Doc`. |
|
||
|
||
## Doc.to_array {#to_array tag="method"}
|
||
|
||
Export given token attributes to a numpy `ndarray`. If `attr_ids` is a sequence
|
||
of `M` attributes, the output array will be of shape `(N, M)`, where `N` is the
|
||
length of the `Doc` (in tokens). If `attr_ids` is a single attribute, the output
|
||
shape will be `(N,)`. You can specify attributes by integer ID (e.g.
|
||
`spacy.attrs.LEMMA`) or string name (e.g. 'LEMMA' or 'lemma'). The values will
|
||
be 64-bit integers.
|
||
|
||
Returns a 2D array with one row per token and one column per attribute (when
|
||
`attr_ids` is a list), or as a 1D numpy array, with one item per attribute (when
|
||
`attr_ids` is a single value).
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
|
||
> doc = nlp(text)
|
||
> # All strings mapped to integers, for easy export to numpy
|
||
> np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
|
||
> np_array = doc.to_array("POS")
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------- |
|
||
| `attr_ids` | list or int or string | A list of attributes (int IDs or string names) or a single attribute (int ID or string name) |
|
||
| **RETURNS** | `numpy.ndarray[ndim=2, dtype='uint64']` or `numpy.ndarray[ndim=1, dtype='uint64']` | The exported attributes as a numpy array. |
|
||
|
||
## Doc.from_array {#from_array tag="method"}
|
||
|
||
Load attributes from a numpy array. Write to a `Doc` object, from an `(M, N)`
|
||
array of attributes.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.attrs import LOWER, POS, ENT_TYPE, IS_ALPHA
|
||
> from spacy.tokens import Doc
|
||
> doc = nlp(u"Hello world!")
|
||
> np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
|
||
> doc2 = Doc(doc.vocab, words=[t.text for t in doc])
|
||
> doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)
|
||
> assert doc[0].pos_ == doc2[0].pos_
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | -------------------------------------- | ----------------------------- |
|
||
| `attrs` | ints | A list of attribute ID ints. |
|
||
| `array` | `numpy.ndarray[ndim=2, dtype='int32']` | The attribute values to load. |
|
||
| **RETURNS** | `Doc` | Itself. |
|
||
|
||
## Doc.to_disk {#to_disk tag="method" new="2"}
|
||
|
||
Save the current state to a directory.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc.to_disk("/path/to/doc")
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ------ | ---------------- | --------------------------------------------------------------------------------------------------------------------- |
|
||
| `path` | unicode / `Path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. |
|
||
|
||
## Doc.from_disk {#from_disk tag="method" new="2"}
|
||
|
||
Loads state from a directory. Modifies the object in place and returns it.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.tokens import Doc
|
||
> from spacy.vocab import Vocab
|
||
> doc = Doc(Vocab()).from_disk("/path/to/doc")
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---------------- | -------------------------------------------------------------------------- |
|
||
| `path` | unicode / `Path` | A path to a directory. Paths may be either strings or `Path`-like objects. |
|
||
| **RETURNS** | `Doc` | The modified `Doc` object. |
|
||
|
||
## Doc.to_bytes {#to_bytes tag="method"}
|
||
|
||
Serialize, i.e. export the document contents to a binary string.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"Give it back! He pleaded.")
|
||
> doc_bytes = doc.to_bytes()
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ----- | --------------------------------------------------------------------- |
|
||
| **RETURNS** | bytes | A losslessly serialized copy of the `Doc`, including all annotations. |
|
||
|
||
## Doc.from_bytes {#from_bytes tag="method"}
|
||
|
||
Deserialize, i.e. import the document contents from a binary string.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> from spacy.tokens import Doc
|
||
> text = u"Give it back! He pleaded."
|
||
> doc = nlp(text)
|
||
> bytes = doc.to_bytes()
|
||
> doc2 = Doc(doc.vocab).from_bytes(bytes)
|
||
> assert doc.text == doc2.text
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ----- | ------------------------ |
|
||
| `data` | bytes | The string to load from. |
|
||
| **RETURNS** | `Doc` | The `Doc` object. |
|
||
|
||
## Doc.retokenize {#retokenize tag="contextmanager" new="2.1"}
|
||
|
||
Context manager to handle retokenization of the `Doc`. Modifications to the
|
||
`Doc`'s tokenization are stored, and then made all at once when the context
|
||
manager exits. This is much more efficient, and less error-prone. All views of
|
||
the `Doc` (`Span` and `Token`) created before the retokenization are
|
||
invalidated, although they may accidentally continue to work.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp("Hello world!")
|
||
> with doc.retokenize() as retokenizer:
|
||
> retokenizer.merge(doc[0:2])
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ------------- | ---------------- |
|
||
| **RETURNS** | `Retokenizer` | The retokenizer. |
|
||
|
||
### Retokenizer.merge {#retokenizer.merge tag="method"}
|
||
|
||
Mark a span for merging. The `attrs` will be applied to the resulting token.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"I like David Bowie")
|
||
> with doc.retokenize() as retokenizer:
|
||
> attrs = {"LEMMA": u"David Bowie"}
|
||
> retokenizer.merge(doc[2:4], attrs=attrs)
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ------- | ------ | -------------------------------------- |
|
||
| `span` | `Span` | The span to merge. |
|
||
| `attrs` | dict | Attributes to set on the merged token. |
|
||
|
||
### Retokenizer.split {#retokenizer.split tag="method"}
|
||
|
||
Mark a token for splitting, into the specified `orths`. The `heads` are required
|
||
to specify how the new subtokens should be integrated into the dependency tree.
|
||
The list of per-token heads can either be a token in the original document, e.g.
|
||
`doc[2]`, or a tuple consisting of the token in the original document and its
|
||
subtoken index. For example, `(doc[3], 1)` will attach the subtoken to the
|
||
second subtoken of `doc[3]`. This mechanism allows attaching subtokens to other
|
||
newly created subtokens, without having to keep track of the changing token
|
||
indices. If the specified head token will be split within the retokenizer block
|
||
and no subtoken index is specified, it will default to `0`.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"I live in NewYork")
|
||
> with doc.retokenize() as retokenizer:
|
||
> heads = [(doc[3], 1), doc[2]]
|
||
> attrs = {"POS": ["PROPN", "PROPN"],
|
||
> "DEP": ["pobj", "compound"]}
|
||
> retokenizer.split(doc[3], ["New", "York"], heads=heads, attrs=attrs)
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ------- | ------- | ----------------------------------------------------------------------------------------------------------- |
|
||
| `token` | `Token` | The token to split. |
|
||
| `orths` | list | The verbatim text of the split tokens. Needs to match the text of the original token. |
|
||
| `heads` | list | List of `token` or `(token, subtoken)` tuples specifying the tokens to attach the newly split subtokens to. |
|
||
| `attrs` | dict | Attributes to set on all split tokens. Attribute names mapped to list of per-token attribute values. |
|
||
|
||
## Doc.merge {#merge tag="method"}
|
||
|
||
<Infobox title="Deprecation note" variant="danger">
|
||
|
||
As of v2.1.0, `Doc.merge` still works but is considered deprecated. You should
|
||
use the new and less error-prone [`Doc.retokenize`](/api/doc#retokenize)
|
||
instead.
|
||
|
||
</Infobox>
|
||
|
||
Retokenize the document, such that the span at `doc.text[start_idx : end_idx]`
|
||
is merged into a single token. If `start_idx` and `end_idx` do not mark start
|
||
and end token boundaries, the document remains unchanged.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"Los Angeles start.")
|
||
> doc.merge(0, len("Los Angeles"), "NNP", "Los Angeles", "GPE")
|
||
> assert [t.text for t in doc] == [u"Los Angeles", u"start", u"."]
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| -------------- | ------- | ------------------------------------------------------------------------------------------------------------------------- |
|
||
| `start_idx` | int | The character index of the start of the slice to merge. |
|
||
| `end_idx` | int | The character index after the end of the slice to merge. |
|
||
| `**attributes` | - | Attributes to assign to the merged token. By default, attributes are inherited from the syntactic root token of the span. |
|
||
| **RETURNS** | `Token` | The newly merged token, or `None` if the start and end indices did not fall at token boundaries |
|
||
|
||
## Doc.ents {#ents tag="property" model="NER"}
|
||
|
||
Iterate over the entities in the document. Yields named-entity `Span` objects,
|
||
if the entity recognizer has been applied to the document.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"Mr. Best flew to New York on Saturday morning.")
|
||
> ents = list(doc.ents)
|
||
> assert ents[0].label == 346
|
||
> assert ents[0].label_ == u"PERSON"
|
||
> assert ents[0].text == u"Mr. Best"
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ---------- | ------ | ------------------------- |
|
||
| **YIELDS** | `Span` | Entities in the document. |
|
||
|
||
## Doc.noun_chunks {#noun_chunks tag="property" model="parser"}
|
||
|
||
Iterate over the base noun phrases in the document. Yields base noun-phrase
|
||
`Span` objects, if the document has been syntactically parsed. A base noun
|
||
phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be
|
||
nested within it – so no NP-level coordination, no prepositional phrases, and no
|
||
relative clauses.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"A phrase with another phrase occurs.")
|
||
> chunks = list(doc.noun_chunks)
|
||
> assert chunks[0].text == u"A phrase"
|
||
> assert chunks[1].text == u"another phrase"
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ---------- | ------ | ---------------------------- |
|
||
| **YIELDS** | `Span` | Noun chunks in the document. |
|
||
|
||
## Doc.sents {#sents tag="property" model="parser"}
|
||
|
||
Iterate over the sentences in the document. Sentence spans have no label. To
|
||
improve accuracy on informal texts, spaCy calculates sentence boundaries from
|
||
the syntactic dependency parse. If the parser is disabled, the `sents` iterator
|
||
will be unavailable.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"This is a sentence. Here's another...")
|
||
> sents = list(doc.sents)
|
||
> assert len(sents) == 2
|
||
> assert [s.root.text for s in sents] == [u"is", u"'s"]
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ---------- | ---------------------------------- | ----------- |
|
||
| **YIELDS** | `Span | Sentences in the document. |
|
||
|
||
## Doc.has_vector {#has_vector tag="property" model="vectors"}
|
||
|
||
A boolean value indicating whether a word vector is associated with the object.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"I like apples")
|
||
> assert doc.has_vector
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---- | ------------------------------------------------ |
|
||
| **RETURNS** | bool | Whether the document has a vector data attached. |
|
||
|
||
## Doc.vector {#vector tag="property" model="vectors"}
|
||
|
||
A real-valued meaning representation. Defaults to an average of the token
|
||
vectors.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc = nlp(u"I like apples")
|
||
> assert doc.vector.dtype == 'float32'
|
||
> assert doc.vector.shape == (300,)
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ---------------------------------------- | ------------------------------------------------------- |
|
||
| **RETURNS** | `numpy.ndarray[ndim=1, dtype='float32']` | A 1D numpy array representing the document's semantics. |
|
||
|
||
## Doc.vector_norm {#vector_norm tag="property" model="vectors"}
|
||
|
||
The L2 norm of the document's vector representation.
|
||
|
||
> #### Example
|
||
>
|
||
> ```python
|
||
> doc1 = nlp(u"I like apples")
|
||
> doc2 = nlp(u"I like oranges")
|
||
> doc1.vector_norm # 4.54232424414368
|
||
> doc2.vector_norm # 3.304373298575751
|
||
> assert doc1.vector_norm != doc2.vector_norm
|
||
> ```
|
||
|
||
| Name | Type | Description |
|
||
| ----------- | ----- | ----------------------------------------- |
|
||
| **RETURNS** | float | The L2 norm of the vector representation. |
|
||
|
||
## Attributes {#attributes}
|
||
|
||
| Name | Type | Description |
|
||
| ----------------------------------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||
| `text` | unicode | A unicode representation of the document text. |
|
||
| `text_with_ws` | unicode | An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. |
|
||
| `mem` | `Pool` | The document's local memory heap, for all C data it owns. |
|
||
| `vocab` | `Vocab` | The store of lexical types. |
|
||
| `tensor` <Tag variant="new">2</Tag> | object | Container for dense vector representations. |
|
||
| `cats` <Tag variant="new">2</Tag> | dictionary | Maps either a label to a score for categories applied to whole document, or `(start_char, end_char, label)` to score for categories applied to spans. `start_char` and `end_char` should be character offsets, label can be either a string or an integer ID, and score should be a float. |
|
||
| `user_data` | - | A generic storage area, for user custom data. |
|
||
| `is_tagged` | bool | A flag indicating that the document has been part-of-speech tagged. |
|
||
| `is_parsed` | bool | A flag indicating that the document has been syntactically parsed. |
|
||
| `is_sentenced` | bool | A flag indicating that sentence boundaries have been applied to the document. |
|
||
| `sentiment` | float | The document's positivity/negativity score, if available. |
|
||
| `user_hooks` | dict | A dictionary that allows customization of the `Doc`'s properties. |
|
||
| `user_token_hooks` | dict | A dictionary that allows customization of properties of `Token` children. |
|
||
| `user_span_hooks` | dict | A dictionary that allows customization of properties of `Span` children. |
|
||
| `_` | `Underscore` | User space for adding custom [attribute extensions](/usage/processing-pipelines#custom-components-attributes). |
|