From 15de2bb01d09eab61edc726abd61c612c5487a0f Mon Sep 17 00:00:00 2001 From: ines Date: Sun, 5 Nov 2017 16:09:48 +0100 Subject: [PATCH] Update and simplify other annotation scheme data --- website/api/_annotation/_text-processing.jade | 55 ++++++++++++++++ website/api/_data.json | 4 +- website/api/annotation.jade | 64 +------------------ 3 files changed, 59 insertions(+), 64 deletions(-) create mode 100644 website/api/_annotation/_text-processing.jade diff --git a/website/api/_annotation/_text-processing.jade b/website/api/_annotation/_text-processing.jade new file mode 100644 index 000000000..564e76f08 --- /dev/null +++ b/website/api/_annotation/_text-processing.jade @@ -0,0 +1,55 @@ +//- 💫 DOCS > API > ANNOTATION > TEXT PROCESSING + ++aside-code("Example"). + from spacy.lang.en import English + nlp = English() + tokens = nlp('Some\nspaces and\ttab characters') + tokens_text = [t.text for t in tokens] + assert tokens_text == ['Some', '\n', 'spaces', ' ', 'and', + '\t', 'tab', 'characters'] + +p + | Tokenization standards are based on the + | #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus. + | The tokenizer differs from most by including + | #[strong tokens for significant whitespace]. Any sequence of + | whitespace characters beyond a single space (#[code ' ']) is included + | as a token. The whitespace tokens are useful for much the same reason + | punctuation is – it's often an important delimiter in the text. By + | preserving it in the token output, we are able to maintain a simple + | alignment between the tokens and the original string, and we ensure + | that #[strong no information is lost] during processing. + ++h(3, "lemmatization") Lemmatization + ++aside("Examples") + | In English, this means:#[br] + | #[strong Adjectives]: happier, happiest → happy#[br] + | #[strong Adverbs]: worse, worst → badly#[br] + | #[strong Nouns]: dogs, children → dog, child#[br] + | #[strong Verbs]: writes, wirting, wrote, written → write + + +p + | A lemma is the uninflected form of a word. The English lemmatization + | data is taken from #[+a("https://wordnet.princeton.edu") WordNet]. + | Lookup tables are taken from + | #[+a("http://www.lexiconista.com/datasets/lemmatization/") Lexiconista]. + | spaCy also adds a #[strong special case for pronouns]: all pronouns + | are lemmatized to the special token #[code -PRON-]. + ++infobox("About spaCy's custom pronoun lemma", "⚠️") + | Unlike verbs and common nouns, there's no clear base form of a personal + | pronoun. Should the lemma of "me" be "I", or should we normalize person + | as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a + | novel symbol, #[code -PRON-], which is used as the lemma for + | all personal pronouns. + ++h(3, "sentence-boundary") Sentence boundary detection + +p + | Sentence boundaries are calculated from the syntactic parse tree, so + | features such as punctuation and capitalisation play an important but + | non-decisive role in determining the sentence boundaries. Usually this + | means that the sentence boundaries will at least coincide with clause + | boundaries, even given poorly punctuated text. diff --git a/website/api/_data.json b/website/api/_data.json index 9d447570f..67b9debf0 100644 --- a/website/api/_data.json +++ b/website/api/_data.json @@ -205,10 +205,8 @@ "title": "Annotation Specifications", "teaser": "Schemes used for labels, tags and training data.", "menu": { - "Tokenization": "tokenization", - "Sentence Boundaries": "sbd", + "Text Processing": "text-processing", "POS Tagging": "pos-tagging", - "Lemmatization": "lemmatization", "Dependencies": "dependency-parsing", "Named Entities": "named-entities", "Models & Training": "training" diff --git a/website/api/annotation.jade b/website/api/annotation.jade index 16598371d..bff9a71cb 100644 --- a/website/api/annotation.jade +++ b/website/api/annotation.jade @@ -2,43 +2,9 @@ include ../_includes/_mixins -p This document describes the target annotations spaCy is trained to predict. - - -+section("tokenization") - +h(2, "tokenization") Tokenization - - p - | Tokenization standards are based on the - | #[+a("https://catalog.ldc.upenn.edu/LDC2013T19") OntoNotes 5] corpus. - | The tokenizer differs from most by including tokens for significant - | whitespace. Any sequence of whitespace characters beyond a single space - | (#[code ' ']) is included as a token. - - +aside-code("Example"). - from spacy.lang.en import English - nlp = English() - tokens = nlp('Some\nspaces and\ttab characters') - tokens_text = [t.text for t in tokens] - assert tokens_text == ['Some', '\n', 'spaces', ' ', 'and', - '\t', 'tab', 'characters'] - - p - | The whitespace tokens are useful for much the same reason punctuation is - | – it's often an important delimiter in the text. By preserving it in the - | token output, we are able to maintain a simple alignment between the - | tokens and the original string, and we ensure that no information is - | lost during processing. - -+section("sbd") - +h(2, "sentence-boundary") Sentence boundary detection - - p - | Sentence boundaries are calculated from the syntactic parse tree, so - | features such as punctuation and capitalisation play an important but - | non-decisive role in determining the sentence boundaries. Usually this - | means that the sentence boundaries will at least coincide with clause - | boundaries, even given poorly punctuated text. ++section("text-processing") + +h(2, "text-processing") Text Processing + include _annotation/_text-processing +section("pos-tagging") +h(2, "pos-tagging") Part-of-speech Tagging @@ -50,30 +16,6 @@ p This document describes the target annotations spaCy is trained to predict. include _annotation/_pos-tags -+section("lemmatization") - +h(2, "lemmatization") Lemmatization - - p A "lemma" is the uninflected form of a word. In English, this means: - - +list - +item #[strong Adjectives]: The form like "happy", not "happier" or "happiest" - +item #[strong Adverbs]: The form like "badly", not "worse" or "worst" - +item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children" - +item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written" - - p - | The lemmatization data is taken from - | #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a - | special case for pronouns: all pronouns are lemmatized to the special - | token #[code -PRON-]. - - +infobox("About spaCy's custom pronoun lemma") - | Unlike verbs and common nouns, there's no clear base form of a personal - | pronoun. Should the lemma of "me" be "I", or should we normalize person - | as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a - | novel symbol, #[code -PRON-], which is used as the lemma for - | all personal pronouns. - +section("dependency-parsing") +h(2, "dependency-parsing") Syntactic Dependency Parsing