* Add spec.jade

This commit is contained in:
Matthew Honnibal 2015-08-13 01:11:40 +02:00
parent b57a3ddd7e
commit ba00c72505

123
docs/redesign/spec.jade Normal file
View File

@ -0,0 +1,123 @@
extends ./outline.jade
mixin columns(...names)
tr
each name in names
th= name
mixin row(...cells)
tr
each cell in cells
td= cell
block body_block
article(class="page docs-page")
p.
This document describes the target annotations spaCy is trained to predict.
This is currently a work in progress. Please ask questions on the issue tracker,
so that the answers can be integrated here to improve the documentation.
h2 Tokenization
p Tokenization standards are based on the OntoNotes 5 corpus.
p.
The tokenizer differs from most by including tokens for significant
whitespace. Any sequence of whitespace characters beyond a single space
(' ') is included as a token. For instance:
pre.language-python
code
| from spacy.en import English
| nlp = English(parse=False)
| tokens = nlp('Some\nspaces and\ttab characters')
| print([t.orth_ for t in tokens])
p Which produces:
pre.language-python
code
| ['Some', '\n', 'spaces', ' ', 'and', '\t', 'tab', 'characters']
p.
The whitespace tokens are useful for much the same reason punctuation is
– it's often an important delimiter in the text. By preserving
it in the token output, we are able to maintain a simple alignment
between the tokens and the original string, and we ensure that no
information is lost during processing.
h3 Sentence boundary detection
p.
Sentence boundaries are calculated from the syntactic parse tree, so
features such as punctuation and capitalisation play an important but
non-decisive role in determining the sentence boundaries. Usually this
means that the sentence boundaries will at least coincide with clause
boundaries, even given poorly punctuated text.
h3 Part-of-speech Tagging
p.
The part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank
tag set. We also map the tags to the simpler Google Universal POS Tag set.
Details here: https://github.com/honnibal/spaCy/blob/master/spacy/en/pos.pyx#L124
h3 Lemmatization
p.
A "lemma" is the uninflected form of a word. In English, this means:
ul
li Adjectives: The form like "happy", not "happier" or "happiest"
li Adverbs: The form like "badly", not "worse" or "worst"
li Nouns: The form like "dog", not "dogs"; like "child", not "children"
li Verbs: The form like "write", not "writes", "writing", "wrote" or "written"
p.
The lemmatization data is taken from WordNet. However, we also add a
special case for pronouns: all pronouns are lemmatized to the special
token -PRON-.
h3 Syntactic Dependency Parsing
p.
The parser is trained on data produced by the ClearNLP converter. Details
of the annotation scheme can be found here: http://www.mathcs.emory.edu/~choi/doc/clear-dependency-2012.pdf
h3 Named Entity Recognition
table
thead
+columns("Entity Type", "Description")
tbody
+row("PERSON", "People, including fictional.")
+row("NORP", "Nationalities or religious or political groups.")
+row("FACILITY", "Buildings, airports, highways, bridges, etc.")
+row("ORG", "Companies, agencies, institutions, etc.")
+row("GPE", "Countries, cities, states.")
+row("LOC", "Non-GPE locations, mountain ranges, bodies of water.")
+row("PRODUCT", "Vehicles, weapons, foods, etc. (Not services")
+row("EVENT", "Named hurricanes, battles, wars, sports events, etc.")
+row("WORK_OF_ART", "Titles of books, songs, etc.")
+row("LAW", "Named documents made into laws")
+row("LANGUAGE", "Any named language")
p The following values are also annotated in a style similar to names:
table
thead
+columns("Entity Type", "Description")
tbody
+row("DATE", "Absolute or relative dates or periods")
+row("TIME", "Times smaller than a day")
+row("PERCENT", 'Percentage (including “%”)')
+row("MONEY", "Monetary values, including unit")
+row("QUANTITY", "Measurements, as of weight or distance")
+row("ORDINAL", 'first", "second"')
+row("CARDINAL", "Numerals that do not fall under another type")