spaCy.io

Legacy Docs (v0.100.6)

This page shows documentation for spaCy in the legacy style. We've kept this page accessible to ease your transition to our current documentation, since we know change can be jarring, especially when you're working against a deadline. This page will not be updated when the library changes, so if you're using a version of the library newer than v0.100.6, the information on this page may not be accurate.

API

classEnglish

Load models into a callable object to process English text. Intended use is for one instance to be created per process. You can create more if you're doing something unusual. You may wish to make the instance a global variable or "singleton". We usually instantiate the object in the main() function and pass it around as an explicit argument.

__init__self, data_dir=None, vocab=None, tokenizer=None, tagger=None, parser=None, entity=None, matcher=None, serializer=None)

Load the linguistic analysis pipeline. Loading may take up to a minute, and the instance consumes 2 to 3 gigabytes of memory. The pipeline class is responsible for loading and saving the components, and applying them in sequence. Each component can be passed as an argument to the __init__ function, or left as None, in which case it will be loaded from a classmethod, named e.g. default_vocab.

Common usage is to accept all defaults, in which case loading is simply:

nlp = spacy.en.English()

To keep the default components, but load data from a specified directory, use:

nlp = English(data_dir=u'path/to/data_directory')

To disable (and avoid loading) parts of the processing pipeline:

nlp = English(parser=False, tagger=False, entity=False)
  • data_dir – The data directory. If None, value is obtained via the default_data_dir() method.
  • vocab –The vocab object, which should be an instance of class spacy.vocab.Vocab. If None, the object is obtained from the default_vocab() class method. The vocab object manages all of the language specific rules and definitions, maintains the cache of lexical types, and manages the word vectors. Because the vocab owns this important data, most objects hold a reference to the vocab.
  • tokenizer – The tokenizer, which should be a callable that accepts a unicode string, and returns a Doc object. If set to None, the default tokenizer is constructed from the default_tokenizer() method.
  • tagger – The part-of-speech tagger, which should be a callable that accepts a Doc object, and sets the part-of-speech tags in-place. If set to None, the default tagger is constructed from the default_tagger() method.
  • parser – The dependency parser, which should be a callable that accepts a Doc object, and sets the syntactic heads and dependency labels in-place. If set to None, the default parser is constructed from the default_parser() method.
  • entity – The named entity recognizer, which should be a callable that accepts a Doc object, and sets the named entity annotations in-place. If set to None, the default entity recognizer is constructed from the default_entity() method.
  • matcher – The pattern matcher, which should be a callable that accepts a Doc object, and sets annotations in-place. If set to None, the default matcher is constructed from the default_matcher() method.
__call__text, tag=True, parse=True, entity=True

The main entry point to spaCy. Takes raw unicode text, and returns a Doc object, which can be iterated to access Token and Span objects. spaCy's models are all linear-time, so you can supply documents of arbitrary length, e.g. whole novels.

  • text (unicode) –The text to be processed. spaCy expects raw unicode txt – you don't necessarily need to, say, split it into paragraphs. However, depending on your documents, you might be better off applying custom pre-processing. Non-text formatting, e.g. from HTML mark-up, should be removed before sending the document to spaCy. If your documents have a consistent format, you may be able to improve accuracy by pre-processing. For instance, if the first word of your documents are always in upper-case, it may be helpful to normalize them before supplying them to spaCy.
  • tag (bool) –Whether to apply the part-of-speech tagger. Required for parsing and entity recognition.
  • parse (bool) – Whether to apply the syntactic dependency parser.
  • entity (bool) –Whether to apply the named entity recognizer.
# from spacy.en import English
# nlp = English()
doc = nlp('Some text.') # Applies tagger, parser, entity
doc = nlp('Some text.', parse=False) # Applies tagger and entity, not parser
doc = nlp('Some text.', entity=False) # Applies tagger and parser, not entity
doc = nlp('Some text.', tag=False) # Does not apply tagger, entity or parser
doc = nlp('') # Zero-length tokens, not an error
# doc = nlp(b'Some text') <-- Error: need unicode
doc = nlp(b'Some text'.decode('utf8')) # Encode to unicode first.
pipeself, texts_iterator, batch_size=1000, n_threads=2

Parse a sequence of texts into a sequence of Doc objects. Accepts a generator as input, and produces a generator as output. spaCy releases the global interpreter lock around the parser and named entity recognizer, allowing shared-memory parallelism via OpenMP. However, OpenMP is not supported on OSX — so multiple threads will only be used on Linux and Windows.

Internally, .pipe accumulates a buffer of batch_size texts, works on them with n_threads workers in parallel, and then yields the Doc objects one by one. Increasing batch_size results in higher latency (a longer time before the first document is yielded), and higher memory used (for the texts in the buffer), but can allow better parallelism.

  • n_threads (int) –The number of worker threads to use. If -1, OpenMP will decide how many to use at run time. Default is 2.
  • texts –A sequence of unicode objects. Usually you will want this to be a generator, so that you don't need to have all of your texts in memory.
  • batch_size (int) –The number of texts to buffer. Let's say you have a batch_size of 1,000. The input, texts, is a generator that yields the texts one-by-one. We want to operate on them in parallel. So, we accumulate a work queue. Instead of taking one document from texts and operating on it, we buffer batch_size documents, work on them in parallel, and then yield them one-by-one. Higher batch_size therefore often results in better parallelism, up to a point.
  • texts = [u'One document.', u'...', u'Lots of documents']
    # .pipe streams input, and produces streaming output
    iter_texts = (texts[i % 3] for i in xrange(100000000))
    for i, doc in enumerate(nlp.pipe(iter_texts, batch_size=50, n_threads=4)):
        assert doc.is_parsed
        if i == 100:
            break
    
    classDoc

    A sequence of Token objects. Access sentences and named entities, export annotations to numpy arrays, losslessly serialize to compressed binary strings.

    Internally, the Doc object holds an array of TokenC structs. The Python-level Token and Span objects are views of this array, i.e. they don't own the data themselves. This details of the internals shouldn't matter for the API – but it may help you read the code, and understand how spaCy is designed.

    Constructors

    viaEnglish.__call__(unicode text)
    __init__self, vocab, orth_and_spaces=None This method of constructing a Doc object is usually only used for deserialization. Standard usage is to construct the document via a call to the language object.
    • vocab – A Vocabulary object, which must match any models you want to use (e.g. tokenizer, parser, entity recognizer).
    • orth_and_spaces – A list of (orth_id, has_space) tuples, where orth_id is an integer, and has_space is a boolean, indicating whether the token has a trailing space.

    Sequence API

    • doc[i] Get the Token object at position i, where i is an integer. Negative indexing is supported, and follows the usual Python semantics, i.e. doc[-2] is doc[len(doc) - 2].
    • doc[start : end] Get a Span object, starting at position start and ending at position end. For instance, doc[2:5] produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step]) are not supported, as Span objects must be contiguous (cannot have gaps).
    • for token in docIterate over Token objects, from which the annotations can be easily accessed. This is the main way of accessing Token objects, which are the main way annotations are accessed from Python. If faster-than-Python speeds are required, you can instead access the annotations as a numpy array, or access the underlying C data directly from Cython, via Doc.data, an array of TokenC structs. The C API has not yet been finalized, and is subject to change.
    • len(doc) The number of tokens in the document.

    Sentence, entity and noun chunk spans

    sents

    Yields sentence Span objects. Iterate over the span to get individual Token objects. Sentence spans have no label.

    # from spacy.en import English
    # nlp = English()
    doc = nlp("This is a sentence. Here's another...")
    assert [s.root.orth_ for s in doc.sents] == ["is", "'s"]
    

    ents

    Yields named-entity Span objects. Iterate over the span to get individual Token objects, or access the label:

    # from spacy.en import English
    # nlp = English()
    tokens = nlp('Mr. Best flew to New York on Saturday morning.')
    ents = list(tokens.ents)
    assert ents[0].label == 346
    assert ents[0].label_ == 'PERSON'
    assert ents[0].orth_ == 'Best'
    assert ents[0].string == ents[0].string
    

    noun_chunks

    Yields base noun-phrase Span objects. A base noun phrase, or "NP chunk", is a noun phrase that does not permit other NPs to be nested within it – so no NP-level coordination, no prepositional phrases, and no relative clauses. For example:

    # from spacy.en import English
    # nlp = English()
    doc = nlp('The sentence in this example has three noun chunks.')
    for chunk in doc.noun_chunks:
        print(chunk.label, chunk.orth_, '<--', chunk.root.head.orth_)
    

    Export/Import

    to_arrayattr_idsGiven a list of M attribute IDs, export the tokens to a numpy ndarray of shape N*M, where N is the length of the sentence.
    • attr_ids (list[int]) –A list of attribute ID ints. Attribute IDs can be imported from spacy.attrs
    count_byattr_idProduce a dict of {attribute (int): count (ints)} frequencies, keyed by the values of the given attribute ID.
    # from spacy.en import English, attrs
    # nlp = English()
    import numpy
    from spacy import attrs
    tokens = nlp('apple apple orange banana')
    assert tokens.count_by(attrs.ORTH) == {3699: 2, 3750: 1, 5965: 1}
    assert repr(tokens.to_array([attrs.ORTH])) == repr(numpy.array([[3699],
                                                        [3699],
                                                        [3750],
                                                        [5965]], dtype=numpy.int32))
    
    from_arrayattrs, array to a Doc object, from an M*N array of attributes.
    from_bytesbyte_stringDeserialize, loading from bytes.
    to_bytesSerialize, producing a byte string.
    read_bytesA staticmethod, used to read serialized Doc objects from a file.For example:
    from spacy.tokens.doc import Doc
    loc = 'test_serialize.bin'
    with open(loc, 'wb') as file_:
        file_.write(nlp(u'This is a document.').to_bytes())
        file_.write(nlp(u'This is another.').to_bytes())
    docs = []
    with open(loc, 'rb') as file_:
        for byte_string in Doc.read_bytes(file_):
            docs.append(Doc(nlp.vocab).from_bytes(byte_string))
    assert len(docs) == 2
    
    classToken

    A Token represents a single word, punctuation or significant whitespace symbol. Integer IDs are provided for all string features. The (unicode) string is provided by an attribute of the same name followed by an underscore, e.g. token.orth is an integer ID, token.orth_ is the unicode value. The only exception is the Token.string attribute, which is (unicode) string-typed.

    String Features

    • lemma / lemma_The "base" of the word, with no inflectional suffixes, e.g. the lemma of "developing" is "develop", the lemma of "geese" is "goose", etc. Note that derivational suffixes are not stripped, e.g. the lemma of "instutitions" is "institution", not "institute". Lemmatization is performed using the WordNet data, but extended to also cover closed-class words such as pronouns. By default, the WN lemmatizer returns "hi" as the lemma of "his". We assign pronouns the lemma -PRON-.
    • orth / orth_The form of the word with no string normalization or processing, as it appears in the string, without trailing whitespace.
    • lower / lower_The form of the word, but forced to lower-case, i.e. lower = word.orth_.lower()
    • shape / shape_A transform of the word's string, to show orthographic features. The characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d. After these mappings, sequences of 4 or more of the same character are truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx, :) --> :)
    • prefix / prefix_A length-N substring from the start of the word. Length may vary by language; currently for English n=1, i.e. prefix = word.orth_[:1]
    • suffix / suffix_A length-N substring from the end of the word. Length may vary by language; currently for English n=3, i.e. suffix = word.orth_[-3:]

    Boolean Flags

    • is_alpha Equivalent to word.orth_.isalpha()
    • is_ascii Equivalent to any(ord(c) >= 128 for c in word.orth_)
    • is_digit Equivalent to word.orth_.isdigit()
    • is_lower Equivalent to word.orth_.islower()
    • is_title Equivalent to word.orth_.istitle()
    • is_punct Equivalent to word.orth_.ispunct()
    • is_space Equivalent to word.orth_.isspace()
    • like_url Does the word resembles a URL?
    • like_num Does the word represent a number? e.g. “10.9”, “10”, “ten”, etc
    • like_email Does the word resemble an email?
    • is_oov Is the word out-of-vocabulary?
    • is_stopIs the word part of a "stop list"? Stop lists are used to improve the quality of topic models, by filtering out common, domain-general words.
    check_flagflag_idGet the value of one of the boolean flags

    Distributional Features

    • prob The unigram log-probability of the word, estimated from counts from a large corpus, smoothed using Simple Good Turing estimation.
    • cluster The Brown cluster ID of the word. These are often useful features for linear models. If you’re using a non-linear model, particularly a neural net or random forest, consider using the real-valued word representation vector, in Token.repvec, instead.
    • vector A “word embedding” representation: a dense real-valued vector that supports similarity queries between words. By default, spaCy currently loads vectors produced by the Levy and Goldberg (2014) dependency-based word2vec model.
    • has_vectorA boolean value indicating whether a vector.

    Alignment and Output

    • idxStart index of the token in the string
    • len(token)Length of the token's orth string, in unicode code-points.
    • unicode(token)Same as token.orth_
    • str(token)In Python 3, returns token.orth_. In Python 2, returnstoken.orth_.encode('utf8')
    • textAn alias for token.orth_.
    • text_with_wstoken.orth_ + token.whitespace_, i.e. the form of the word as it appears in the string, trailing whitespace. This is useful when you need to use linguistic features to add inline mark-up to the string.
    • whitespace_The number of immediate syntactic children following the word in the string.

    Part-of-Speech Tags

    • pos / pos_A coarse-grained, less detailed tag that represents the word-class of the token. The set of .pos tags are consistent across languages. The available tags are ADJ, ADP, ADV, AUX, CONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X, EOL, SPACE.
    • tag / tag_A fine-grained, more detailed tag that represents the word-class and some basic morphological information for the token. These tags are primarily designed to be good features for subsequent models, particularly the syntactic parser. They are language and treebank dependent. The tagger is trained to predict these fine-grained tags, and then a mapping table is used to reduce them to the coarse-grained .pos tags.

    Navigating the Parse Tree

    • headThe immediate syntactic head of the token. If the token is the root of its sentence, it is the token itself, i.e. root_token.head is root_token
    • childrenAn iterator that yields from lefts, and then yields from rights.
    • subtreeAn iterator for the part of the sentence syntactically governed by the word, including the word itself.
    • left_edgeThe leftmost edge of the token's subtree
    • right_edgeThe rightmost edge of the token's subtree
    nbor(i=1)Get the ith next / previous neighboring token.

    Named Entities

    • ent_typeIf the token is part of an entity, its entity type.
    • ent_iobThe IOB (inside, outside, begin) entity recognition tag for the token.

    Constructors

    __init__vocab, doc, offset
    • vocab –A Vocab object
    • doc –The parent sequence
    • offset (int) –The index of the token within the document
    classSpanA Span is a slice of a Doc object, consisting of zero or more tokens. Spans are used to represent sentences, named entities, phrases, and arbitrary contiguous slices from the Doc object. Span objects are views – that is, they do not copy the underlying C data. This makes them cheap to construct, as internally are simply a reference to the Doc object, a start position, an end position, and a label ID.
  • token = span[i]Get the Token object at position i, where i is an offset within the Span, not the document. That is:
    span = doc[4:6]
    token = span[0]
    assert token.i == 4
    
    • for token in spanIterate over the Token objects in the span.
    • __len__Number of tokens in the span.
    • textThe text content of the span, obtained from ''.join(token.text_with_ws for token in span)
    • startThe start offset of the span, i.e. span[0].i.
    • endThe end offset of the span, i.e. span[-1].i + 1

    Navigating the Parse Tree

    root

    The word with the shortest path to the root of the sentence is the root of the span.

    toks = nlp('I like New York in Autumn.')
    

    Let's name the indices --- easier than writing toks[4] etc.

    i, like, new, york, in_, autumn, dot = range(len(toks))
    

    The head of new is York, and the head of York is like

    assert toks[new].head.orth_ == 'York'
    assert toks[york].head.orth_ == 'like'
    

    Create a span for "New York". Its root is "York".

    new_york = toks[new:york+1]
    assert new_york.root.orth_ == 'York'
    

    When there are multiple words with external dependencies, we take the first:

    assert toks[autumn].head.orth_ == 'in'
    assert toks[dot].head.orth_ == 'like'
    autumn_dot = toks[autumn:]
    assert autumn_dot.root.orth_ == 'Autumn'
    

    lefts

    Tokens that are to the left of the span, whose head is within the span, i.e.

    # TODO: where does the span object come from?
    span = doc[:2]
    lefts = [span.doc[i] for i in range(0, span.start)
             if span.doc[i].head in span]
    

    rights

    Tokens that are to the right of the span, whose head is within the span, i.e.

    span = doc[:2]
    rights = [span.doc[i] for i in range(span.end, len(span.doc))
              if span.doc[i].head in span]
    

    subtree

    Tokens in the range (start, end+1), where start is the index of the leftmost word descended from a token in the span, and end is the index of the rightmost token descended from a token in the span.

    Constructors

    • doc[start : end]
    • for entity in doc.ents
    • for sentence in doc.sents
    • for noun_phrase in doc.noun_chunks
    • span = Span(doc, start, end, label=0)

    Strings

    • text_with_wsThe form of the span as it appears in the string, trailing whitespace. This is useful when you need to use linguistic features to add inline mark-up to the string.
    • lemma / lemma_Whitespace-concatenated lemmas of each token in the span.
    • label / label_The span label, used particularly for named entities.
    classLexeme

    The Lexeme object represents a lexical type, stored in the vocabulary – as opposed to a token, occurring in a document.

    Each Token object receives a reference to a lexeme object (specifically, it receives a pointer to a LexemeC struct). This allows features to be computed and saved once per type, rather than once per token. As job sizes grow, this amounts to substantial efficiency improvements, as the vocabulary size (number of types) will be much smaller than the total number of words processed (number of tokens).

    All Lexeme attributes are therefore context independent, as a single lexeme is reused for all usages of that word. Lexemes are keyed by the “orth” attribute.

    Most Lexeme attributes can be set, with the exception of the primary key, orth. Assigning to an attribute of the Lexeme object writes to the underlying struct, so all tokens that are backed by that Lexeme will inherit the new value.

    String Features

    • orth / orth_The form of the word with no string normalization or processing, as it appears in the string, without trailing whitespace.
    • lower / lower_The form of the word, but forced to lower-case, i.e. lower = word.orth_.lower()
    • shape / shape_A transform of the word's string, to show orthographic features. The characters a-z are mapped to x, A-Z is mapped to X, 0-9 is mapped to d. After these mappings, sequences of 4 or more of the same character are truncated to length 4. Examples: C3Po --> XdXx, favorite --> xxxx, :) --> :)
    • prefix / prefix_A length-N substring from the start of the word. Length may vary by language; currently for English n=1, i.e. prefix = word.orth_[:1]
    • suffix / suffix_A length-N substring from the end of the word. Length may vary by language; currently for English n=3, i.e. suffix = word.orth_[-3:]

    Boolean Features

    • is_alpha Equivalent to word.orth_.isalpha()
    • is_ascii Equivalent to any(ord(c) >= 128 for c in word.orth_)
    • is_digit Equivalent to word.orth_.isdigit()
    • is_lower Equivalent to word.orth_.islower()
    • is_title Equivalent to word.orth_.istitle()
    • is_punct Equivalent to word.orth_.ispunct()
    • is_space Equivalent to word.orth_.isspace()
    • like_url Does the word resembles a URL?
    • like_num Does the word represent a number? e.g. “10.9”, “10”, “ten”, etc
    • like_email Does the word resemble an email?
    • is_oov Is the word out-of-vocabulary?
    • is_stopIs the word part of a "stop list"? Stop lists are used to improve the quality of topic models, by filtering out common, domain-general words.

    Distributional Features

    • prob The unigram log-probability of the word, estimated from counts from a large corpus, smoothed using Simple Good Turing estimation.
    • cluster The Brown cluster ID of the word. These are often useful features for linear models. If you’re using a non-linear model, particularly a neural net or random forest, consider using the real-valued word representation vector, in Token.repvec, instead.
    • vector A “word embedding” representation: a dense real-valued vector that supports similarity queries between words. By default, spaCy currently loads vectors produced by the Levy and Goldberg (2014) dependency-based word2vec model.
    • has_vectorA boolean value indicating whether a vector.

    Constructors

    • lexeme = vocab[string]
    • lexeme = vocab[i]
    classVocab
    • lexeme = vocab[integer_id]Get a lexeme by its orth ID
    • lexeme = vocab[string]Get a lexeme by the string corresponding to its orth ID.
    • for lexeme in vocabIterate over Lexeme objects
    • vocab[integer_id] = attributes_dictA props dictionary
    • len(vocab)Number of lexemes (unique words) in the

    Constructors

    • nlp.vocab
    • doc.vocab
    • span.vocab
    • token.vocab
    • lexeme.vocab

    Save and Load

    dumploc
    • loc (unicode) –Path where the vocabulary should be saved
    load_lexemesloc
    • loc (unicode) –Path to load the lexemes.bin file from
    load_vectorsfile
    • file (unicode) –A file-like object, to load word vectors from.
    load_vectors_from_bin_locloc
    • loc (unicode) –A path to a file, in spaCy's binary word-vectors file format.
    classStringStore

    Intern strings, and map them to sequential integer IDs. The mapping table is very efficient , and a small-string optimization is used to maintain a small memory footprint. Only the integer IDs are held by spaCy's data classes (Doc, Token, Span and Lexeme) – when you use a string-valued attribute like token.orth_, you access a property that computes token.strings[token.orth].

    • string = string_store[int_id]Retrieve a string from a given integer ID. If the integer ID is not found, raise IndexError
    • int_id = string_store[unicode_string] Map a unicode string to an integer ID. If the string is previously unseen, it is interned, and a new ID is returned.
    • int_id = string_store[utf8_byte_string] Byte strings are assumed to be in UTF-8 encoding. Strings encoded with other codecs may fail silently. Given a utf8 string, the behaviour is the same as for unicode strings. Internally, strings are stored in UTF-8 format. So if you start with a UTF-8 byte string, it's less efficient to first decode it as unicode, as StringStore will then have to encode it as UTF-8 once again.
    • n_strings = len(string_store)Number of strings in the string-store
    • for string in string_storeIterate over strings in the string store, in order, such that the ith string in the sequence has the ID i:
      string_store = doc.vocab.strings
      for i, string in enumerate(string_store):
          assert i == string_store[string]
      

    Constructors

    StringStore.__init__ takes no arguments, so a new instance can be constructed as follows:

    string_store = StringStore()

    However, in practice you'll usually use the instance owned by the language's vocab object, which all classes hold a reference to:

    • english.vocab.strings
    • doc.vocab.strings
    • span.vocab.strings
    • token.vocab.strings
    • lexeme.vocab.strings

    If you create another instance, it will map strings to different integers – which is usually not what you want.

    Save and Load

    dumploc

    Save the strings mapping to the given location, in plain text. The format is subject to change; so if you need to read/write compatible files, please can find details in the strings.pyx source.

    loadloc

    Load the strings mapping from a plain-text file in the given location. The format is subject to change; so if you need to read/write compatible files, please can find details in the strings.pyx source.

    Mark all adverbs, particularly for verbs of speech

    Let's say you're developing a proofreading tool, or possibly an IDE for writers. You're convinced by Stephen King's advice that adverbs are not your friend so you want to highlight all adverbs.

    Search Reddit for comments about Google doing something

    Example use of the spaCy NLP tools for data exploration. Here we will look for Reddit comments that describe Google doing something, i.e. discuss the company's actions. This is difficult, because other senses of "Google" now dominate usage of the word in conversation, particularly references to using Google products.

    Finding Relevant Tweets

    In this tutorial, we will use word vectors to search for tweets about Jeb Bush. We'll do this by building up two word lists: one that represents the type of meanings in the Jeb Bush tweets, and another to help screen out irrelevant tweets that mention the common, ambiguous word 'bush'.

    Annotation Specifications

    Overview

    This document describes the target annotations spaCy is trained to predict. This is currently a work in progress. Please ask questions on the issue tracker, so that the answers can be integrated here to improve the documentation.

    Tokenization

    Tokenization standards are based on the OntoNotes 5 corpus.

    The tokenizer differs from most by including tokens for significant whitespace. Any sequence of whitespace characters beyond a single space (' ') is included as a token. For instance:

    from spacy.en import English
    nlp = English(parse=False)
    tokens = nlp('Some\nspaces  and\ttab characters')
    print([t.orth_ for t in tokens])

    Which produces:

    ['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters']

    The whitespace tokens are useful for much the same reason punctuation is – it's often an important delimiter in the text. By preserving it in the token output, we are able to maintain a simple alignment between the tokens and the original string, and we ensure that no information is lost during processing.

    Sentence boundary detection

    Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalisation play an important but non-decisive role in determining the sentence boundaries. Usually this means that the sentence boundaries will at least coincide with clause boundaries, even given poorly punctuated text.

    Part-of-speech Tagging

    The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set. We also map the tags to the simpler Google Universal POS Tag set.

    Details here.

    Lemmatization

    A "lemma" is the uninflected form of a word. In English, this means:

    • Adjectives: The form like "happy", not "happier" or "happiest"
    • Adverbs: The form like "badly", not "worse" or "worst"
    • Nouns: The form like "dog", not "dogs"; like "child", not "children"
    • Verbs: The form like "write", not "writes", "writing", "wrote" or "written"

    The lemmatization data is taken from WordNet. However, we also add a special case for pronouns: all pronouns are lemmatized to the special token -PRON-.

    Syntactic Dependency Parsing

    The parser is trained on data produced by the ClearNLP converter. Details of the annotation scheme can be found here.

    Named Entity Recognition

    Entity Type Description
    PERSON People, including fictional.
    NORP Nationalities or religious or political groups.
    FACILITY Buildings, airports, highways, bridges, etc.
    ORG Companies, agencies, institutions, etc.
    GPE Countries, cities, states.
    LOC Non-GPE locations, mountain ranges, bodies of water.
    PRODUCT Vehicles, weapons, foods, etc. (Not services
    EVENT Named hurricanes, battles, wars, sports events, etc.
    WORK_OF_ART Titles of books, songs, etc.
    LAW Named documents made into laws
    LANGUAGE Any named language

    The following values are also annotated in a style similar to names:

    Entity Type Description
    DATE Absolute or relative dates or periods
    TIME Times smaller than a day
    PERCENT Percentage (including “%”)
    MONEY Monetary values, including unit
    QUANTITY Measurements, as of weight or distance
    ORDINAL first", "second"
    CARDINAL Numerals that do not fall under another type