mirror of
https://github.com/explosion/spaCy.git
synced 2025-11-18 00:35:50 +03:00
* Fix typos and auto-format [ci skip] * Add pkuseg warnings and auto-format [ci skip] * Update Binder URL [ci skip] * Update Binder version [ci skip] * Update alignment example for new gold.align * Update POS in tagging example * Fix numpy.zeros() dtype for Doc.from_array * Change example title to Dr. Change example title to Dr. so the current model does exclude the title in the initial example. * Fix spacy convert argument * Warning for sudachipy 0.4.5 (#5611) * Create myavrum.md (#5612) * Update lex_attrs.py (#5608) * Create mahnerak.md (#5615) * Some changes for Armenian (#5616) * Fixing numericals * We need a Armenian question sign to make the sentence a question * Add Nepali Language (#5622) * added support for nepali lang * added examples and test files * added spacy contributor agreement * Japanese model: add user_dict entries and small refactor (#5573) * user_dict fields: adding inflections, reading_forms, sub_tokens deleting: unidic_tags improve code readability around the token alignment procedure * add test cases, replace fugashi with sudachipy in conftest * move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer * tag is space -> both surface and tag are spaces * consider len(text)==0 * Add warnings example in v2.3 migration guide (#5627) * contribute (#5632) * Fix polarity of Token.is_oov and Lexeme.is_oov (#5634) Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the lexeme does **not** have a vector. * Extend what's new in v2.3 with vocab / is_oov (#5635) * Skip vocab in component config overrides (#5624) * Fix backslashes in warnings config diff (#5640) Fix backslashes in warnings config diff in v2.3 migration section. * Disregard special tag _SP in check for new tag map (#5641) * Skip special tag _SP in check for new tag map In `Tagger.begin_training()` check for new tags aside from `_SP` in the new tag map initialized from the provided gold tuples when determining whether to reinitialize the morphology with the new tag map. * Simplify _SP check Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Marat M. Yavrumyan <myavrum@ysu.am> Co-authored-by: Karen Hambardzumyan <mahnerak@gmail.com> Co-authored-by: Rameshh <30867740+rameshhpathak@users.noreply.github.com> Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
63 lines
3.2 KiB
Markdown
63 lines
3.2 KiB
Markdown
After tokenization, spaCy can **parse** and **tag** a given `Doc`. This is where
|
||
the statistical model comes in, which enables spaCy to **make a prediction** of
|
||
which tag or label most likely applies in this context. A model consists of
|
||
binary data and is produced by showing a system enough examples for it to make
|
||
predictions that generalize across the language – for example, a word following
|
||
"the" in English is most likely a noun.
|
||
|
||
Linguistic annotations are available as
|
||
[`Token` attributes](/api/token#attributes). Like many NLP libraries, spaCy
|
||
**encodes all strings to hash values** to reduce memory usage and improve
|
||
efficiency. So to get the readable string representation of an attribute, we
|
||
need to add an underscore `_` to its name:
|
||
|
||
```python
|
||
### {executable="true"}
|
||
import spacy
|
||
|
||
nlp = spacy.load("en_core_web_sm")
|
||
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
|
||
|
||
for token in doc:
|
||
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
|
||
token.shape_, token.is_alpha, token.is_stop)
|
||
```
|
||
|
||
> - **Text:** The original word text.
|
||
> - **Lemma:** The base form of the word.
|
||
> - **POS:** The simple [UPOS](https://universaldependencies.org/docs/u/pos/) part-of-speech tag.
|
||
> - **Tag:** The detailed part-of-speech tag.
|
||
> - **Dep:** Syntactic dependency, i.e. the relation between tokens.
|
||
> - **Shape:** The word shape – capitalization, punctuation, digits.
|
||
> - **is alpha:** Is the token an alpha character?
|
||
> - **is stop:** Is the token part of a stop list, i.e. the most common words of
|
||
> the language?
|
||
|
||
| Text | Lemma | POS | Tag | Dep | Shape | alpha | stop |
|
||
| ------- | ------- | ------- | ----- | ---------- | ------- | ------- | ------- |
|
||
| Apple | apple | `PROPN` | `NNP` | `nsubj` | `Xxxxx` | `True` | `False` |
|
||
| is | be | `AUX` | `VBZ` | `aux` | `xx` | `True` | `True` |
|
||
| looking | look | `VERB` | `VBG` | `ROOT` | `xxxx` | `True` | `False` |
|
||
| at | at | `ADP` | `IN` | `prep` | `xx` | `True` | `True` |
|
||
| buying | buy | `VERB` | `VBG` | `pcomp` | `xxxx` | `True` | `False` |
|
||
| U.K. | u.k. | `PROPN` | `NNP` | `compound` | `X.X.` | `False` | `False` |
|
||
| startup | startup | `NOUN` | `NN` | `dobj` | `xxxx` | `True` | `False` |
|
||
| for | for | `ADP` | `IN` | `prep` | `xxx` | `True` | `True` |
|
||
| \$ | \$ | `SYM` | `$` | `quantmod` | `$` | `False` | `False` |
|
||
| 1 | 1 | `NUM` | `CD` | `compound` | `d` | `False` | `False` |
|
||
| billion | billion | `NUM` | `CD` | `pobj` | `xxxx` | `True` | `False` |
|
||
|
||
> #### Tip: Understanding tags and labels
|
||
>
|
||
> Most of the tags and labels look pretty abstract, and they vary between
|
||
> languages. `spacy.explain` will show you a short description – for example,
|
||
> `spacy.explain("VBZ")` returns "verb, 3rd person singular present".
|
||
|
||
Using spaCy's built-in [displaCy visualizer](/usage/visualizers), here's what
|
||
our example sentence and its dependencies look like:
|
||
|
||
import DisplaCyLongHtml from 'images/displacy-long.html'; import { Iframe } from
|
||
'components/embed'
|
||
|
||
<Iframe title="displaCy visualization of dependencies and entities" html={DisplaCyLongHtml} height={450} />
|