From 6a793251c8d074dcf56070cbb313313d17ee76e0 Mon Sep 17 00:00:00 2001 From: Ines Montani Date: Mon, 19 Dec 2016 13:41:47 +0100 Subject: [PATCH] Add aside on spaCy's custom pronoun lemma --- website/docs/api/annotation.jade | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/website/docs/api/annotation.jade b/website/docs/api/annotation.jade index de678b472..342928add 100644 --- a/website/docs/api/annotation.jade +++ b/website/docs/api/annotation.jade @@ -50,6 +50,13 @@ p A "lemma" is the uninflected form of a word. In English, this means: +item #[strong Nouns]: The form like "dog", not "dogs"; like "child", not "children" +item #[strong Verbs]: The form like "write", not "writes", "writing", "wrote" or "written" ++aside("About spaCy's custom pronoun lemma") + | Unlike verbs and common nouns, there's no clear base form of a personal + | pronoun. Should the lemma of "me" be "I", or should we normalize person + | as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a + | novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for + | all personal pronouns. + p | The lemmatization data is taken from | #[+a("https://wordnet.princeton.edu") WordNet]. However, we also add a