mirror of
https://github.com/explosion/spaCy.git
synced 2024-12-27 02:16:32 +03:00
* Add using/ docs.
This commit is contained in:
parent
4b07c17d6f
commit
2f7110e852
70
docs/source/reference/using/document.rst
Normal file
70
docs/source/reference/using/document.rst
Normal file
|
@ -0,0 +1,70 @@
|
||||||
|
========
|
||||||
|
Document
|
||||||
|
========
|
||||||
|
|
||||||
|
.. autoclass:: spacy.tokens.Tokens
|
||||||
|
|
||||||
|
:code:`__getitem__`, :code:`__iter__`, :code:`__len__`
|
||||||
|
The Tokens class behaves as a Python sequence, supporting the usual operators,
|
||||||
|
len(), etc. Negative indexing is supported. Slices are not yet.
|
||||||
|
|
||||||
|
.. code::
|
||||||
|
|
||||||
|
>>> tokens = nlp(u'Zero one two three four five six')
|
||||||
|
>>> tokens[0].orth_
|
||||||
|
u'Zero'
|
||||||
|
>>> tokens[-1].orth_
|
||||||
|
u'six'
|
||||||
|
>>> tokens[0:4]
|
||||||
|
Error
|
||||||
|
|
||||||
|
:code:`sents`
|
||||||
|
Iterate over sentences in the document.
|
||||||
|
|
||||||
|
:code:`ents`
|
||||||
|
Iterate over entities in the document.
|
||||||
|
|
||||||
|
:code:`to_array`
|
||||||
|
Given a list of M attribute IDs, export the tokens to a numpy ndarray
|
||||||
|
of shape N*M, where N is the length of the sentence.
|
||||||
|
|
||||||
|
Arguments:
|
||||||
|
attr_ids (list[int]): A list of attribute ID ints.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
feat_array (numpy.ndarray[long, ndim=2]):
|
||||||
|
A feature matrix, with one row per word, and one column per attribute
|
||||||
|
indicated in the input attr_ids.
|
||||||
|
|
||||||
|
:code:`count_by`
|
||||||
|
Produce a dict of {attribute (int): count (ints)} frequencies, keyed
|
||||||
|
by the values of the given attribute ID.
|
||||||
|
|
||||||
|
>>> from spacy.en import English, attrs
|
||||||
|
>>> nlp = English()
|
||||||
|
>>> tokens = nlp(u'apple apple orange banana')
|
||||||
|
>>> tokens.count_by(attrs.ORTH)
|
||||||
|
{12800L: 1, 11880L: 2, 7561L: 1}
|
||||||
|
>>> tokens.to_array([attrs.ORTH])
|
||||||
|
array([[11880],
|
||||||
|
[11880],
|
||||||
|
[ 7561],
|
||||||
|
[12800]])
|
||||||
|
|
||||||
|
:code:`merge`
|
||||||
|
Merge a multi-word expression into a single token. Currently
|
||||||
|
experimental; API is likely to change.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Internals
|
||||||
|
A Tokens instance stores the annotations in a C-array of `TokenC` structs.
|
||||||
|
Each TokenC struct holds a const pointer to a LexemeC struct, which describes
|
||||||
|
a vocabulary item.
|
||||||
|
|
||||||
|
The Token objects are built lazily, from this underlying C-data.
|
||||||
|
|
||||||
|
For faster access, the underlying C data can be accessed from Cython. You
|
||||||
|
can also export the data to a numpy array, via `Tokens.to_array`, if pure Python
|
||||||
|
access is required, and you need slightly better performance. However, this
|
||||||
|
is both slower and has a worse API than Cython access.
|
11
docs/source/reference/using/index.rst
Normal file
11
docs/source/reference/using/index.rst
Normal file
|
@ -0,0 +1,11 @@
|
||||||
|
==================
|
||||||
|
Annotation Objects
|
||||||
|
==================
|
||||||
|
|
||||||
|
|
||||||
|
.. toctree::
|
||||||
|
:maxdepth: 3
|
||||||
|
|
||||||
|
document.rst
|
||||||
|
token.rst
|
||||||
|
span.rst
|
32
docs/source/reference/using/span.rst
Normal file
32
docs/source/reference/using/span.rst
Normal file
|
@ -0,0 +1,32 @@
|
||||||
|
====
|
||||||
|
Span
|
||||||
|
====
|
||||||
|
|
||||||
|
.. autoclass:: spacy.spans.Span
|
||||||
|
|
||||||
|
:code:`__getitem__`, :code:`__iter__`, :code:`__len__`
|
||||||
|
Sequence API
|
||||||
|
|
||||||
|
:code:`head`
|
||||||
|
Syntactic head, or None
|
||||||
|
|
||||||
|
:code:`left`
|
||||||
|
Tokens to the left of the span
|
||||||
|
|
||||||
|
:code:`rights`
|
||||||
|
Tokens to the left of the span
|
||||||
|
|
||||||
|
:code:`orth` / :code:`orth_`
|
||||||
|
Orth string
|
||||||
|
|
||||||
|
:code:`lemma` / :code:`lemma_`
|
||||||
|
Lemma string
|
||||||
|
|
||||||
|
:code:`string`
|
||||||
|
String
|
||||||
|
|
||||||
|
:code:`label` / :code:`label_`
|
||||||
|
Label
|
||||||
|
|
||||||
|
:code:`subtree`
|
||||||
|
Lefts + [self] + Rights
|
124
docs/source/reference/using/token.rst
Normal file
124
docs/source/reference/using/token.rst
Normal file
|
@ -0,0 +1,124 @@
|
||||||
|
====================
|
||||||
|
spacy.tokens.Tokens
|
||||||
|
====================
|
||||||
|
|
||||||
|
A Token represents a single word, punctuation or significant whitespace symbol.
|
||||||
|
|
||||||
|
Integer IDs are provided for all string features. The (unicode) string is
|
||||||
|
provided by an attribute of the same name followed by an underscore, e.g.
|
||||||
|
token.orth is an integer ID, token.orth\_ is the unicode value.
|
||||||
|
|
||||||
|
The only exception is the Token.string attribute, which is (unicode)
|
||||||
|
string-typed.
|
||||||
|
|
||||||
|
**String Features**
|
||||||
|
|
||||||
|
:code:`string`
|
||||||
|
The form of the word as it appears in the string, include trailing
|
||||||
|
whitespace. This is useful when you need to use linguistic features to
|
||||||
|
add inline mark-up to the string.
|
||||||
|
|
||||||
|
:code:`orth` / :code:`orth_`
|
||||||
|
The form of the word with no string normalization or processing, as it
|
||||||
|
appears in the string, without trailing whitespace.
|
||||||
|
|
||||||
|
:code:`lemma` / :code:`lemma_`
|
||||||
|
The "base" of the word, with no inflectional suffixes, e.g. the lemma of
|
||||||
|
"developing" is "develop", the lemma of "geese" is "goose", etc. Note that
|
||||||
|
*derivational* suffixes are not stripped, e.g. the lemma of "instutitions"
|
||||||
|
is "institution", not "institute". Lemmatization is performed using the
|
||||||
|
WordNet data, but extended to also cover closed-class words such as
|
||||||
|
pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his".
|
||||||
|
We assign pronouns the lemma -PRON-.
|
||||||
|
|
||||||
|
:code:`lower` / :code:`lower_`
|
||||||
|
The form of the word, but forced to lower-case, i.e. lower = word.orth\_.lower()
|
||||||
|
|
||||||
|
:code:`norm` / :code:`norm_`
|
||||||
|
The form of the word, after language-specific normalizations have been
|
||||||
|
applied.
|
||||||
|
|
||||||
|
:code:`shape` / :code:`shape_`
|
||||||
|
A transform of the word's string, to show orthographic features. The
|
||||||
|
characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d.
|
||||||
|
After these mappings, sequences of 4 or more of the same character are
|
||||||
|
truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx,
|
||||||
|
:) --> :)
|
||||||
|
|
||||||
|
:code:`prefix` / :code:`prefix_`
|
||||||
|
A length-N substring from the start of the word. Length may vary by
|
||||||
|
language; currently for English n=1, i.e. prefix = word.orth\_[:1]
|
||||||
|
|
||||||
|
:code:`suffix` / :code:`suffix_`
|
||||||
|
A length-N substring from the end of the word. Length may vary by
|
||||||
|
language; currently for English n=3, i.e. suffix = word.orth\_[-3:]
|
||||||
|
|
||||||
|
**Distributional Features**
|
||||||
|
|
||||||
|
:code:`prob`
|
||||||
|
The unigram log-probability of the word, estimated from counts from a
|
||||||
|
large corpus, smoothed using Simple Good Turing estimation.
|
||||||
|
|
||||||
|
:code:`cluster`
|
||||||
|
The Brown cluster ID of the word. These are often useful features for
|
||||||
|
linear models. If you're using a non-linear model, particularly
|
||||||
|
a neural net or random forest, consider using the real-valued word
|
||||||
|
representation vector, in Token.repvec, instead.
|
||||||
|
|
||||||
|
:code:`repvec`
|
||||||
|
A "word embedding" representation: a dense real-valued vector that supports
|
||||||
|
similarity queries between words. By default, spaCy currently loads
|
||||||
|
vectors produced by the Levy and Goldberg (2014) dependency-based word2vec
|
||||||
|
model.
|
||||||
|
|
||||||
|
**Syntactic Features**
|
||||||
|
|
||||||
|
:code:`tag`
|
||||||
|
A morphosyntactic tag, e.g. NN, VBZ, DT, etc. These tags are
|
||||||
|
language/corpus specific, and typically describe part-of-speech and some
|
||||||
|
amount of morphological information. For instance, in the Penn Treebank
|
||||||
|
tag set, VBZ is assigned to a present-tense singular verb.
|
||||||
|
|
||||||
|
:code:`pos`
|
||||||
|
A part-of-speech tag, from the Google Universal Tag Set, e.g. NOUN, VERB,
|
||||||
|
ADV. Constants for the 17 tag values are provided in spacy.parts\_of\_speech.
|
||||||
|
|
||||||
|
:code:`dep`
|
||||||
|
The type of syntactic dependency relation between the word and its
|
||||||
|
syntactic head.
|
||||||
|
|
||||||
|
:code:`n_lefts`
|
||||||
|
The number of immediate syntactic children preceding the word in the
|
||||||
|
string.
|
||||||
|
|
||||||
|
:code:`n_rights`
|
||||||
|
The number of immediate syntactic children following the word in the
|
||||||
|
string.
|
||||||
|
|
||||||
|
**Navigating the Dependency Tree**
|
||||||
|
|
||||||
|
:code:`head`
|
||||||
|
The Token that is the immediate syntactic head of the word. If the word is
|
||||||
|
the root of the dependency tree, the same word is returned.
|
||||||
|
|
||||||
|
:code:`lefts`
|
||||||
|
An iterator for the immediate leftward syntactic children of the word.
|
||||||
|
|
||||||
|
:code:`rights`
|
||||||
|
An iterator for the immediate rightward syntactic children of the word.
|
||||||
|
|
||||||
|
:code:`children`
|
||||||
|
An iterator that yields from lefts, and then yields from rights.
|
||||||
|
|
||||||
|
:code:`subtree`
|
||||||
|
An iterator for the part of the sentence syntactically governed by the
|
||||||
|
word, including the word itself.
|
||||||
|
|
||||||
|
|
||||||
|
**Named Entities**
|
||||||
|
|
||||||
|
:code:`ent_type`
|
||||||
|
If the token is part of an entity, its entity type
|
||||||
|
|
||||||
|
:code:`ent_iob`
|
||||||
|
The IOB (inside, outside, begin) entity recognition tag for the token
|
Loading…
Reference in New Issue
Block a user