spaCy/website/usage/_spacy-101/_pos-deps.jade
Ines Montani 49cee4af92
💫 Interactive code examples, spaCy Universe and various docs improvements (#2274)
* Integrate Python kernel via Binder

* Add live model test for languages with examples

* Update docs and code examples

* Adjust margin (if not bootstrapped)

* Add binder version to global config

* Update terminal and executable code mixins

* Pass attributes through infobox and section

* Hide v-cloak

* Fix example

* Take out model comparison for now

* Add meta text for compat

* Remove chart.js dependency

* Tidy up and simplify JS and port big components over to Vue

* Remove chartjs example

* Add Twitter icon

* Add purple stylesheet option

* Add utility for hand cursor (special cases only)

* Add transition classes

* Add small option for section

* Add thumb object for small round thumbnail images

* Allow unset code block language via "none" value

(workaround to still allow unset language to default to DEFAULT_SYNTAX)

* Pass through attributes

* Add syntax highlighting definitions for Julia, R and Docker

* Add website icon

* Remove user survey from navigation

* Don't hide GitHub icon on small screens

* Make top navigation scrollable on small screens

* Remove old resources page and references to it

* Add Universe

* Add helper functions for better page URL and title

* Update site description

* Increment versions

* Update preview images

* Update mentions of resources

* Fix image

* Fix social images

* Fix problem with cover sizing and floats

* Add divider and move badges into heading

* Add docstrings

* Reference converting section

* Add section on converting word vectors

* Move converting section to custom section and fix formatting

* Remove old fastText example

* Move extensions content to own section

Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary)

* Use better component example and add factories section

* Add note on larger model

* Use better example for non-vector

* Remove similarity in context section

Only works via small models with tensors so has always been kind of confusing

* Add note on init-model command

* Fix lightning tour examples and make excutable if possible

* Add spacy train CLI section to train

* Fix formatting and add video

* Fix formatting

* Fix textcat example description (resolves #2246)

* Add dummy file to try resolve conflict

* Delete dummy file

* Tidy up [ci skip]

* Ensure sufficient height of loading container

* Add loading animation to universe

* Update Thebelab build and use better startup message

* Fix asset versioning

* Fix typo [ci skip]

* Add note on project idea label
2018-04-29 02:06:46 +02:00

66 lines
3.4 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > SPACY 101 > POS TAGGING AND DEPENDENCY PARSING
p
| After tokenization, spaCy can #[strong parse] and #[strong tag] a
| given #[code Doc]. This is where the statistical model comes in, which
| enables spaCy to #[strong make a prediction] of which tag or label most
| likely applies in this context. A model consists of binary data and is
| produced by showing a system enough examples for it to make predictions
| that generalise across the language for example, a word following "the"
| in English is most likely a noun.
p
| Linguistic annotations are available as
| #[+api("token#attributes") #[code Token] attributes]. Like many NLP
| libraries, spaCy #[strong encodes all strings to hash values] to reduce
| memory usage and improve efficiency. So to get the readable string
| representation of an attribute, we need to add an underscore #[code _]
| to its name:
+code-exec.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)
+aside
| #[strong Text:] The original word text.#[br]
| #[strong Lemma:] The base form of the word.#[br]
| #[strong POS:] The simple part-of-speech tag.#[br]
| #[strong Tag:] The detailed part-of-speech tag.#[br]
| #[strong Dep:] Syntactic dependency, i.e. the relation between tokens.#[br]
| #[strong Shape:] The word shape capitalisation, punctuation, digits.#[br]
| #[strong is alpha:] Is the token an alpha character?#[br]
| #[strong is stop:] Is the token part of a stop list, i.e. the most common
| words of the language?#[br]
+table(["Text", "Lemma", "POS", "Tag", "Dep", "Shape", "alpha", "stop"])
- var style = [0, 0, 1, 1, 1, 1, 1, 1]
+annotation-row(["Apple", "apple", "PROPN", "NNP", "nsubj", "Xxxxx", true, false], style)
+annotation-row(["is", "be", "VERB", "VBZ", "aux", "xx", true, true], style)
+annotation-row(["looking", "look", "VERB", "VBG", "ROOT", "xxxx", true, false], style)
+annotation-row(["at", "at", "ADP", "IN", "prep", "xx", true, true], style)
+annotation-row(["buying", "buy", "VERB", "VBG", "pcomp", "xxxx", true, false], style)
+annotation-row(["U.K.", "u.k.", "PROPN", "NNP", "compound", "X.X.", false, false], style)
+annotation-row(["startup", "startup", "NOUN", "NN", "dobj", "xxxx", true, false], style)
+annotation-row(["for", "for", "ADP", "IN", "prep", "xxx", true, true], style)
+annotation-row(["$", "$", "SYM", "$", "quantmod", "$", false, false], style)
+annotation-row(["1", "1", "NUM", "CD", "compound", "d", false, false], style)
+annotation-row(["billion", "billion", "NUM", "CD", "pobj", "xxxx", true, false], style)
+aside("Tip: Understanding tags and labels")
| Most of the tags and labels look pretty abstract, and they vary between
| languages. #[code spacy.explain()] will show you a short description
| for example, #[code spacy.explain("VBZ")] returns "verb, 3rd person
| singular present".
p
| Using spaCy's built-in #[+a("/usage/visualizers") displaCy visualizer],
| here's what our example sentence and its dependencies look like:
+codepen("030d1e4dfa6256cad8fdd59e6aefecbe", 460)