spaCy/website/docs/_api-token.jade

//-  Docs > API > Token
//- ============================================================================

+section('token')
    +h2('token', 'https://github.com/' + profiles.github + '/spaCy/blob/master/spacy/tokens/token.pyx#L31')
        | #[+label('tag') class] Token

    p.
        A Token represents a single word, punctuation or significant whitespace
        symbol. Integer IDs are provided for all string features. The (unicode)
        string is provided by an attribute of the same name followed by an underscore,
        e.g. #[code token.orth] is an integer ID, #[code token.orth_] is the unicode
        value. The only exception is the #[code token.text] attribute, which is (unicode)
        string-typed.

    +section('token-init')
        +h3('token-init')
            | Token.__init__

        +code('python', 'definition').
            def __init__(vocab, doc, offset):
                return Token()

        +table(['Name', 'Type', 'Description'], 'params')
                +row
                    +cell vocab
                    +cell
                    +cell A Vocab object

                +row
                    +cell doc
                    +cell
                    +cell The parent sequence

                +row
                    +cell offset
                    +cell #[a(href=link_int target='_blank') int]
                    +cell The index of the token within the document

    +section('token-stringfeatures')
        +h3('token-stringfeatures')
            | String Features

        +table(['Name', 'Description'], 'params')
            +row
                +cell lemma / lemma_
                +cell.
                    The "base" of the word, with no inflectional suffixes, e.g.
                    the lemma of "developing" is "develop", the lemma of "geese"
                    is "goose", etc.  Note that #[em derivational] suffixes are
                    not stripped, e.g. the lemma of "instutitions" is "institution",
                    not "institute".  Lemmatization is performed using the WordNet
                    data, but extended to also cover closed-class words such as
                    pronouns.  By default, the WN lemmatizer returns "hi" as the
                    lemma of "his". We assign pronouns the lemma #[code -PRON-].

            +row
                +cell orth / orth_
                +cell.
                    The form of the word with no string normalization or processing,
                    as it appears in the string, without trailing whitespace.

            +row
                +cell lower / lower_
                +cell.
                    The form of the word, but forced to lower-case, i.e.
                    #[code lower = word.orth_.lower()]

            +row
                +cell shape / shape_
                +cell.
                    A transform of the word's string, to show orthographic features.
                    The characters a-z are mapped to x, A-Z is mapped to X, 0-9
                    is mapped to d. After these mappings, sequences of 4 or more
                    of the same character are truncated to length 4. Examples:
                    C3Po --&gt; XdXx, favorite --&gt; xxxx, :) --&gt; :)

            +row
                +cell prefix / prefix_
                +cell.
                    A length-N substring from the start of the word. Length may
                    vary by language; currently for English n=1, i.e.
                    #[code prefix = word.orth_[:1]]

            +row
                +cell suffix / suffix_
                +cell.
                    A length-N substring from the end of the word. Length may
                    vary by language; currently for English n=3, i.e.
                    #[code suffix = word.orth_[-3:]]

    +section('token-booleanflags')
        +h3('token-booleanflags')
            | Boolean Flags

        +table(['Name', 'Description'], 'params')
            +row
                +cell is_alpha
                +cell.
                    Equivalent to #[code word.orth_.isalpha()]

            +row
                +cell is_ascii
                +cell.
                    Equivalent to any(ord(c) >= 128 for c in word.orth_)]

            +row
                +cell is_digit
                +cell.
                    Equivalent to #[code word.orth_.isdigit()]

            +row
                +cell is_lower
                +cell.
                    Equivalent to #[code word.orth_.islower()]

            +row
                +cell is_title
                +cell.
                    Equivalent to #[code word.orth_.istitle()]

            +row
                +cell is_punct
                +cell.
                    Equivalent to #[code word.orth_.ispunct()]

            +row
                +cell is_space
                +cell.
                    Equivalent to #[code word.orth_.isspace()]

            +row
                +cell like_url
                +cell.
                    Does the word resemble a URL?

            +row
                +cell like_num
                +cell.
                    Does the word represent a number? e.g. “10.9”, “10”, “ten”, etc.

            +row
                +cell like_email
                +cell.
                    Does the word resemble an email?

            +row
                +cell is_oov
                +cell.
                    Is the word out-of-vocabulary?

            +row
                +cell is_stop
                +cell.
                    Is the word part of a "stop list"? Stop lists are used to
                    improve the quality of topic models, by filtering out common,
                    domain-general words.

    +section('token-distributional')
        +h3('token-distributional')
            | Distributional Features

        +table(['Name', 'Description'], 'params')
            +row
                +cell prob
                +cell.
                    The unigram log-probability of the word, estimated from
                    counts from a large corpus, smoothed using Simple Good Turing
                    estimation.

            +row
                +cell cluster
                +cell.
                    The Brown cluster ID of the word. These are often useful features
                    for linear models. If you’re using a non-linear model, particularly
                        a neural net or random forest, consider using the real-valued
                        word representation vector, in #[code Token.repvec], instead.

            +row
                +cell vector
                +cell.
                    A "word embedding" representation: a dense real-valued vector
                    that supports similarity queries between words. By default,
                    spaCy currently loads vectors produced by the Levy and
                    Goldberg (2014) dependency-based word2vec model.

            +row
                +cell has_vector
                +cell.
                    A boolean value indicating whether a vector.

    +section('token-alignment')
        +h3('token-alignment')
            | Alignment and Output

        +table(['Name', 'Description'], 'params')
            +row
                +cell idx
                +cell.
                    Start index of the token in the string

            +row
                +cell len(token)
                +cell.
                    Length of the token's orth string, in unicode code-points.

            +row
                +cell unicode(token)
                +cell.
                    Same as #[code token.orth_].

            +row
                +cell str(token)
                +cell.
                    In Python 3, returns #[code token.orth_]. In Python 2, returns
                    #[code token.orth_.encode('utf8')].

            +row
                +cell text
                +cell.
                    An alias for #[code token.orth_].

            +row
                +cell text_with_ws
                +cell.
                    #[code token.orth_ + token.whitespace_], i.e. the form of the
                    word as it appears in the string, trailing whitespace. This is
                    useful when you need to use linguistic features to add inline
                    mark-up to the string.

            +row
                +cell whitespace_
                +cell.
                    The number of immediate syntactic children following the word
                    in the string.

    +section('token-postags')
        +h3('token-postags')
            | Part-of-Speech Tags

        +table(['Name', 'Description'], 'params')
            +row
                +cell pos / pos_
                +cell.
                    A coarse-grained, less detailed tag that represents the
                    word-class of the token. The set of #[code .pos] tags are
                    consistent across languages. The available tags are #[code ADJ],
                    #[code ADP], #[code ADV], #[code AUX], #[code CONJ], #[code DET],
                    #[code INTJ], #[code NOUN], #[code NUM], #[code PART],
                    #[code PRON], #[code PROPN], #[code PUNCT], #[code SCONJ],
                    #[code SYM], #[code VERB], #[code X], #[code EOL], #[code SPACE].

            +row
                +cell tag / tag_
                +cell.
                    A fine-grained, more detailed tag that represents the
                    word-class and some basic morphological information for the
                    token. These tags are primarily designed to be good features
                    for subsequent models, particularly the syntactic parser.
                        They are language and treebank dependent. The tagger is
                        trained to predict these fine-grained tags, and then a
                        mapping table is used to reduce them to the coarse-grained
                        #[code .pos] tags.

    +section('token-navigating')
        +h3('token-navigating')
            | Navigating the Parse Tree

        +table(['Name', 'Description'], 'params')
            +row
                +cell dep / dep_
                +cell.
                    The syntactic relation type, aka the dependency label, connecting the word to its head.
            +row
                +cell head
                +cell.
                    The immediate syntactic head of the token. If the token is the
                    root of its sentence, it is the token itself, i.e.
                    #[code root_token.head is root_token].

            +row
                +cell children
                +cell.
                    An iterator that yields from lefts, and then yields from rights.

            +row
                +cell subtree
                +cell.
                    An iterator for the part of the sentence syntactically governed
                    by the word, including the word itself.

            +row
                +cell left_edge
                +cell.
                    The leftmost edge of the token's subtree.

            +row
                +cell right_edge
                +cell.
                    The rightmost edge of the token's subtree.

            +row
                +cell nbor(i=1)
                +cell.
                    Get the #[code i]#[sup th] next / previous neighboring token.

    +section('token-namedentities')
        +h3('token-namedentities')
            | Named Entity Recognition

        +table(['Name', 'Description'], 'params')
            +row
                +cell ent_type
                +cell.
                    If the token is part of an entity, its entity type.

            +row
                +cell ent_iob
                +cell.
                    The IOB (inside, outside, begin) entity recognition tag for
                    the token.