extends ./outline.jade

mixin columns(...names)
  tr
    each name in names
      th= name


mixin row(...cells)
  tr
    each cell in cells
      td= cell


block body_block
  article(class="page docs-page")
    p.
      This document describes the target annotations spaCy is trained to predict.
      This is currently a work in progress. Please ask questions on the issue tracker,
      so that the answers can be integrated here to improve the documentation.

    h2 Tokenization

    p Tokenization standards are based on the OntoNotes 5 corpus.

    p.
      The tokenizer differs from most by including tokens for significant
      whitespace. Any sequence of whitespace characters beyond a single space
      (' ') is included as a token. For instance:

    pre.language-python
      code
        | from spacy.en import English
        | nlp = English(parse=False)
        | tokens = nlp('Some\nspaces  and\ttab characters')
        | print([t.orth_ for t in tokens])
        
    p Which produces:
    
    pre.language-python
      code
        | ['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters']

    p.
      The whitespace tokens are useful for much the same reason punctuation is
      &ndash; it's often an important delimiter in the text.  By preserving
      it in the token output, we are able to maintain a simple alignment
      between the tokens and the original string, and we ensure that no
      information is lost during processing.

    h3 Sentence boundary detection

    p.
      Sentence boundaries are calculated from the syntactic parse tree, so
      features such as punctuation and capitalisation play an important but
      non-decisive role in determining the sentence boundaries.  Usually this
      means that the sentence boundaries will at least coincide with clause
      boundaries, even given poorly punctuated text.

    h3 Part-of-speech Tagging

    p.
      The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank
      tag set.  We also map the tags to the simpler Google Universal POS Tag set.

      Details here: https://github.com/honnibal/spaCy/blob/master/spacy/en/pos.pyx#L124

    h3 Lemmatization

    p.
      A "lemma" is the uninflected form of a word. In English, this means:

    ul
      li Adjectives: The form like "happy", not "happier" or "happiest"
      li Adverbs: The form like "badly", not "worse" or "worst"
      li Nouns: The form like "dog", not "dogs"; like "child", not "children"
      li Verbs: The form like "write", not "writes", "writing", "wrote" or "written" 

    p.
      The lemmatization data is taken from WordNet. However, we also add a
      special case for pronouns: all pronouns are lemmatized to the special
      token -PRON-.


    h3 Syntactic Dependency Parsing

    p.
      The parser is trained on data produced by the ClearNLP converter. Details
      of the annotation scheme can be found here:  http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf

    h3 Named Entity Recognition

    table
      thead
        +columns("Entity Type", "Description")
      
      tbody
        +row("PERSON", "People, including fictional.")
        +row("NORP", "Nationalities or religious or political groups.")
        +row("FACILITY", "Buildings, airports, highways, bridges, etc.")
        +row("ORG", "Companies, agencies, institutions, etc.")
        +row("GPE", "Countries, cities, states.")
        +row("LOC", "Non-GPE locations, mountain ranges, bodies of water.")
        +row("PRODUCT", "Vehicles, weapons, foods, etc. (Not services")
        +row("EVENT", "Named hurricanes, battles, wars, sports events, etc.")
        +row("WORK_OF_ART", "Titles of books, songs, etc.")
        +row("LAW", "Named documents made into laws")
        +row("LANGUAGE", "Any named language")

    p The following values are also annotated in a style similar to names:

    table
      thead
        +columns("Entity Type", "Description")
      
      tbody
        +row("DATE", "Absolute or relative dates or periods")
        +row("TIME", "Times smaller than a day")
        +row("PERCENT", 'Percentage (including “%”)')
        +row("MONEY", "Monetary values, including unit")
        +row("QUANTITY", "Measurements, as of weight or distance")
        +row("ORDINAL", 'first", "second"')
        +row("CARDINAL", "Numerals that do not fall under another type")