mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-14 13:47:13 +03:00
49cee4af92
* Integrate Python kernel via Binder * Add live model test for languages with examples * Update docs and code examples * Adjust margin (if not bootstrapped) * Add binder version to global config * Update terminal and executable code mixins * Pass attributes through infobox and section * Hide v-cloak * Fix example * Take out model comparison for now * Add meta text for compat * Remove chart.js dependency * Tidy up and simplify JS and port big components over to Vue * Remove chartjs example * Add Twitter icon * Add purple stylesheet option * Add utility for hand cursor (special cases only) * Add transition classes * Add small option for section * Add thumb object for small round thumbnail images * Allow unset code block language via "none" value (workaround to still allow unset language to default to DEFAULT_SYNTAX) * Pass through attributes * Add syntax highlighting definitions for Julia, R and Docker * Add website icon * Remove user survey from navigation * Don't hide GitHub icon on small screens * Make top navigation scrollable on small screens * Remove old resources page and references to it * Add Universe * Add helper functions for better page URL and title * Update site description * Increment versions * Update preview images * Update mentions of resources * Fix image * Fix social images * Fix problem with cover sizing and floats * Add divider and move badges into heading * Add docstrings * Reference converting section * Add section on converting word vectors * Move converting section to custom section and fix formatting * Remove old fastText example * Move extensions content to own section Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary) * Use better component example and add factories section * Add note on larger model * Use better example for non-vector * Remove similarity in context section Only works via small models with tensors so has always been kind of confusing * Add note on init-model command * Fix lightning tour examples and make excutable if possible * Add spacy train CLI section to train * Fix formatting and add video * Fix formatting * Fix textcat example description (resolves #2246) * Add dummy file to try resolve conflict * Delete dummy file * Tidy up [ci skip] * Ensure sufficient height of loading container * Add loading animation to universe * Update Thebelab build and use better startup message * Fix asset versioning * Fix typo [ci skip] * Add note on project idea label
66 lines
3.4 KiB
Plaintext
66 lines
3.4 KiB
Plaintext
//- 💫 DOCS > USAGE > SPACY 101 > POS TAGGING AND DEPENDENCY PARSING
|
||
|
||
p
|
||
| After tokenization, spaCy can #[strong parse] and #[strong tag] a
|
||
| given #[code Doc]. This is where the statistical model comes in, which
|
||
| enables spaCy to #[strong make a prediction] of which tag or label most
|
||
| likely applies in this context. A model consists of binary data and is
|
||
| produced by showing a system enough examples for it to make predictions
|
||
| that generalise across the language – for example, a word following "the"
|
||
| in English is most likely a noun.
|
||
|
||
p
|
||
| Linguistic annotations are available as
|
||
| #[+api("token#attributes") #[code Token] attributes]. Like many NLP
|
||
| libraries, spaCy #[strong encodes all strings to hash values] to reduce
|
||
| memory usage and improve efficiency. So to get the readable string
|
||
| representation of an attribute, we need to add an underscore #[code _]
|
||
| to its name:
|
||
|
||
+code-exec.
|
||
import spacy
|
||
|
||
nlp = spacy.load('en_core_web_sm')
|
||
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
|
||
|
||
for token in doc:
|
||
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
|
||
token.shape_, token.is_alpha, token.is_stop)
|
||
|
||
+aside
|
||
| #[strong Text:] The original word text.#[br]
|
||
| #[strong Lemma:] The base form of the word.#[br]
|
||
| #[strong POS:] The simple part-of-speech tag.#[br]
|
||
| #[strong Tag:] The detailed part-of-speech tag.#[br]
|
||
| #[strong Dep:] Syntactic dependency, i.e. the relation between tokens.#[br]
|
||
| #[strong Shape:] The word shape – capitalisation, punctuation, digits.#[br]
|
||
| #[strong is alpha:] Is the token an alpha character?#[br]
|
||
| #[strong is stop:] Is the token part of a stop list, i.e. the most common
|
||
| words of the language?#[br]
|
||
|
||
+table(["Text", "Lemma", "POS", "Tag", "Dep", "Shape", "alpha", "stop"])
|
||
- var style = [0, 0, 1, 1, 1, 1, 1, 1]
|
||
+annotation-row(["Apple", "apple", "PROPN", "NNP", "nsubj", "Xxxxx", true, false], style)
|
||
+annotation-row(["is", "be", "VERB", "VBZ", "aux", "xx", true, true], style)
|
||
+annotation-row(["looking", "look", "VERB", "VBG", "ROOT", "xxxx", true, false], style)
|
||
+annotation-row(["at", "at", "ADP", "IN", "prep", "xx", true, true], style)
|
||
+annotation-row(["buying", "buy", "VERB", "VBG", "pcomp", "xxxx", true, false], style)
|
||
+annotation-row(["U.K.", "u.k.", "PROPN", "NNP", "compound", "X.X.", false, false], style)
|
||
+annotation-row(["startup", "startup", "NOUN", "NN", "dobj", "xxxx", true, false], style)
|
||
+annotation-row(["for", "for", "ADP", "IN", "prep", "xxx", true, true], style)
|
||
+annotation-row(["$", "$", "SYM", "$", "quantmod", "$", false, false], style)
|
||
+annotation-row(["1", "1", "NUM", "CD", "compound", "d", false, false], style)
|
||
+annotation-row(["billion", "billion", "NUM", "CD", "pobj", "xxxx", true, false], style)
|
||
|
||
+aside("Tip: Understanding tags and labels")
|
||
| Most of the tags and labels look pretty abstract, and they vary between
|
||
| languages. #[code spacy.explain()] will show you a short description –
|
||
| for example, #[code spacy.explain("VBZ")] returns "verb, 3rd person
|
||
| singular present".
|
||
|
||
p
|
||
| Using spaCy's built-in #[+a("/usage/visualizers") displaCy visualizer],
|
||
| here's what our example sentence and its dependencies look like:
|
||
|
||
+codepen("030d1e4dfa6256cad8fdd59e6aefecbe", 460)
|