spaCy/website/usage/_spacy-101/_pos-deps.jade

//- 💫 DOCS > USAGE > SPACY 101 > POS TAGGING AND DEPENDENCY PARSING

p
    |  After tokenization, spaCy can #[strong parse] and #[strong tag] a
    |  given #[code Doc]. This is where the statistical model comes in, which
    |  enables spaCy to #[strong make a prediction] of which tag or label most
    |  likely applies in this context. A model consists of binary data and is
    |  produced by showing a system enough examples for it to make predictions
    |  that generalise across the language – for example, a word following "the"
    |  in English is most likely a noun.

p
    |  Linguistic annotations are available as
    |  #[+api("token#attributes") #[code Token] attributes]. Like many NLP
    |  libraries, spaCy #[strong encodes all strings to hash values] to reduce
    |  memory usage and improve efficiency. So to get the readable string
    |  representation of an attribute, we need to add an underscore #[code _]
    |  to its name:

+code-exec.
    import spacy

    nlp = spacy.load('en_core_web_sm')
    doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

    for token in doc:
        print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
              token.shape_, token.is_alpha, token.is_stop)

+aside
    |  #[strong Text:] The original word text.#[br]
    |  #[strong Lemma:] The base form of the word.#[br]
    |  #[strong POS:] The simple part-of-speech tag.#[br]
    |  #[strong Tag:] The detailed part-of-speech tag.#[br]
    |  #[strong Dep:] Syntactic dependency, i.e. the relation between tokens.#[br]
    |  #[strong Shape:] The word shape – capitalisation, punctuation, digits.#[br]
    |  #[strong is alpha:] Is the token an alpha character?#[br]
    |  #[strong is stop:] Is the token part of a stop list, i.e. the most common
    |  words of the language?#[br]

+table(["Text", "Lemma", "POS", "Tag", "Dep", "Shape", "alpha", "stop"])
    - var style = [0, 0, 1, 1, 1, 1, 1, 1]
    +annotation-row(["Apple", "apple", "PROPN", "NNP", "nsubj", "Xxxxx", true, false], style)
    +annotation-row(["is", "be", "VERB", "VBZ", "aux", "xx", true, true], style)
    +annotation-row(["looking", "look", "VERB", "VBG", "ROOT", "xxxx", true, false], style)
    +annotation-row(["at", "at", "ADP", "IN", "prep", "xx", true, true], style)
    +annotation-row(["buying", "buy", "VERB", "VBG", "pcomp", "xxxx", true, false], style)
    +annotation-row(["U.K.", "u.k.", "PROPN", "NNP", "compound", "X.X.", false, false], style)
    +annotation-row(["startup", "startup", "NOUN", "NN", "dobj", "xxxx", true, false], style)
    +annotation-row(["for", "for", "ADP", "IN", "prep", "xxx", true, true], style)
    +annotation-row(["$", "$", "SYM", "$", "quantmod", "$", false, false], style)
    +annotation-row(["1", "1", "NUM", "CD", "compound", "d", false, false], style)
    +annotation-row(["billion", "billion", "NUM", "CD", "pobj", "xxxx", true, false], style)

+aside("Tip: Understanding tags and labels")
    |  Most of the tags and labels look pretty abstract, and they vary between
    |  languages. #[code spacy.explain()] will show you a short description –
    |  for example, #[code spacy.explain("VBZ")] returns "verb, 3rd person
    |  singular present".

p
    |  Using spaCy's built-in #[+a("/usage/visualizers") displaCy visualizer],
    |  here's what our example sentence and its dependencies look like:

+codepen("030d1e4dfa6256cad8fdd59e6aefecbe", 460)
-												Add spaCy 101 components

											
										
										
											2017-05-24 00:16:31 +03:00
+								//- 💫 DOCS > USAGE > SPACY 101 > POS TAGGING AND DEPENDENCY PARSING
 								p
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  After tokenization, spaCy can #[strong parse] and #[strong tag] a
-												Add spaCy 101 components

											
										
										
											2017-05-24 00:16:31 +03:00
+								    |  given #[code Doc]. This is where the statistical model comes in, which
 								    |  enables spaCy to #[strong make a prediction] of which tag or label most
 								    |  likely applies in this context. A model consists of binary data and is
 								    |  produced by showing a system enough examples for it to make predictions
 								    |  that generalise across the language – for example, a word following "the"
 								    |  in English is most likely a noun.
 								p
 								    |  Linguistic annotations are available as
 								    |  #[+api("token#attributes") #[code Token] attributes]. Like many NLP
-												Update docs and change integer IDs to hash values

											
										
										
											2017-05-28 20:25:34 +03:00
+								    |  libraries, spaCy #[strong encodes all strings to hash values] to reduce
-												Add spaCy 101 components

											
										
										
											2017-05-24 00:16:31 +03:00
+								    |  memory usage and improve efficiency. So to get the readable string
 								    |  representation of an attribute, we need to add an underscore #[code _]
 								    |  to its name:
-												💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)

* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label

											
										
										
											2018-04-29 03:06:46 +03:00
+								+code-exec.
 								    import spacy
 								    nlp = spacy.load('en_core_web_sm')
-												Add spaCy 101 components

											
										
										
											2017-05-24 00:16:31 +03:00
+								    doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
 								    for token in doc:
 								        print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
 								              token.shape_, token.is_alpha, token.is_stop)
 								+aside
 								    |  #[strong Text:] The original word text.#[br]
 								    |  #[strong Lemma:] The base form of the word.#[br]
 								    |  #[strong POS:] The simple part-of-speech tag.#[br]
-												Fix typos, text, examples and formatting

											
										
										
											2017-05-25 12:17:21 +03:00
+								    |  #[strong Tag:] The detailed part-of-speech tag.#[br]
-												Add spaCy 101 components

											
										
										
											2017-05-24 00:16:31 +03:00
+								    |  #[strong Dep:] Syntactic dependency, i.e. the relation between tokens.#[br]
 								    |  #[strong Shape:] The word shape – capitalisation, punctuation, digits.#[br]
 								    |  #[strong is alpha:] Is the token an alpha character?#[br]
 								    |  #[strong is stop:] Is the token part of a stop list, i.e. the most common
 								    |  words of the language?#[br]
 								+table(["Text", "Lemma", "POS", "Tag", "Dep", "Shape", "alpha", "stop"])
 								    - var style = [0, 0, 1, 1, 1, 1, 1, 1]
 								    +annotation-row(["Apple", "apple", "PROPN", "NNP", "nsubj", "Xxxxx", true, false], style)
 								    +annotation-row(["is", "be", "VERB", "VBZ", "aux", "xx", true, true], style)
 								    +annotation-row(["looking", "look", "VERB", "VBG", "ROOT", "xxxx", true, false], style)
 								    +annotation-row(["at", "at", "ADP", "IN", "prep", "xx", true, true], style)
 								    +annotation-row(["buying", "buy", "VERB", "VBG", "pcomp", "xxxx", true, false], style)
 								    +annotation-row(["U.K.", "u.k.", "PROPN", "NNP", "compound", "X.X.", false, false], style)
 								    +annotation-row(["startup", "startup", "NOUN", "NN", "dobj", "xxxx", true, false], style)
 								    +annotation-row(["for", "for", "ADP", "IN", "prep", "xxx", true, true], style)
 								    +annotation-row(["$", "$", "SYM", "$", "quantmod", "$", false, false], style)
 								    +annotation-row(["1", "1", "NUM", "CD", "compound", "d", false, false], style)
 								    +annotation-row(["billion", "billion", "NUM", "CD", "pobj", "xxxx", true, false], style)
 								+aside("Tip: Understanding tags and labels")
 								    |  Most of the tags and labels look pretty abstract, and they vary between
 								    |  languages. #[code spacy.explain()] will show you a short description –
 								    |  for example, #[code spacy.explain("VBZ")] returns "verb, 3rd person
 								    |  singular present".
 								p
-												Update usage documentation

											
										
										
											2017-10-03 15:26:20 +03:00
+								    |  Using spaCy's built-in #[+a("/usage/visualizers") displaCy visualizer],
-												Add spaCy 101 components

											
										
										
											2017-05-24 00:16:31 +03:00
+								    |  here's what our example sentence and its dependencies look like:
 								+codepen("030d1e4dfa6256cad8fdd59e6aefecbe", 460)