spaCy/website/src/jade/docs/_spec.jade

mixin columns(...names)
  tr
    each name in names
      th= name


mixin row(...cells)
  tr
    each cell in cells
      td= cell


mixin Define(term)
  li #[code #{term}]
    block


details
  summary: h4 Overview

  p This document describes the target annotations spaCy is trained to predict. This is currently a work in progress. Please ask questions on the issue tracker, so that the answers can be integrated here to improve the documentation.

details
  summary: h4 Tokenization

  p Tokenization standards are based on the OntoNotes 5 corpus.

  p The tokenizer differs from most by including tokens for significant whitespace. Any sequence of whitespace characters beyond a single space (' ') is included as a token. For instance:

  pre.language-python
    code
      | from spacy.en import English
      | nlp = English(parse=False)
      | tokens = nlp('Some\nspaces  and\ttab characters')
      | print([t.orth_ for t in tokens])

  p Which produces:

  pre.language-python
    code
      | ['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters']

  p The whitespace tokens are useful for much the same reason punctuation is &ndash; it's often an important delimiter in the text.  By preserving it in the token output, we are able to maintain a simple alignment between the tokens and the original string, and we ensure that no information is lost during processing.

details
  summary: h4 Sentence boundary detection

  p Sentence boundaries are calculated from the syntactic parse tree, so features such as punctuation and capitalisation play an important but non-decisive role in determining the sentence boundaries.  Usually this means that the sentence boundaries will at least coincide with clause boundaries, even given poorly punctuated text.

details
  summary: h4 Part-of-speech Tagging

  p The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set.  We also map the tags to the simpler Google Universal POS Tag set.

  p Details #[a(href="https://github.com/honnibal/spaCy/blob/master/spacy/tagger.pyx") here].

details
  summary: h4 Lemmatization

  p.
    A "lemma" is the uninflected form of a word. In English, this means:

  ul
    li Adjectives: The form like "happy", not "happier" or "happiest"
    li Adverbs: The form like "badly", not "worse" or "worst"
    li Nouns: The form like "dog", not "dogs"; like "child", not "children"
    li Verbs: The form like "write", not "writes", "writing", "wrote" or "written"

  p.
    The lemmatization data is taken from WordNet. However, we also add a
    special case for pronouns: all pronouns are lemmatized to the special
    token #[code -PRON-].


details
  summary: h4 Syntactic Dependency Parsing

  p The parser is trained on data produced by the ClearNLP converter. Details of the annotation scheme can be found #[a(href="http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf") here].

details
  summary: h4 Named Entity Recognition

  table
    thead
      +columns("Entity Type", "Description")

    tbody
      +row("PERSON", "People, including fictional.")
      +row("NORP", "Nationalities or religious or political groups.")
      +row("FACILITY", "Buildings, airports, highways, bridges, etc.")
      +row("ORG", "Companies, agencies, institutions, etc.")
      +row("GPE", "Countries, cities, states.")
      +row("LOC", "Non-GPE locations, mountain ranges, bodies of water.")
      +row("PRODUCT", "Vehicles, weapons, foods, etc. (Not services")
      +row("EVENT", "Named hurricanes, battles, wars, sports events, etc.")
      +row("WORK_OF_ART", "Titles of books, songs, etc.")
      +row("LAW", "Named documents made into laws")
      +row("LANGUAGE", "Any named language")

  p The following values are also annotated in a style similar to names:

  table
    thead
      +columns("Entity Type", "Description")

    tbody
      +row("DATE", "Absolute or relative dates or periods")
      +row("TIME", "Times smaller than a day")
      +row("PERCENT", 'Percentage (including “%”)')
      +row("MONEY", "Monetary values, including unit")
      +row("QUANTITY", "Measurements, as of weight or distance")
      +row("ORDINAL", 'first", "second"')
      +row("CARDINAL", "Numerals that do not fall under another type")