spaCy/website/usage/spacy-101.jade

320 lines
13 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > SPACY 101
include ../_includes/_mixins
p
| Whether you're new to spaCy, or just want to brush up on some
| NLP basics and implementation details this page should have you covered.
| Each section will explain one of spaCy's features in simple terms and
| with examples or illustrations. Some sections will also reappear across
| the usage guides as a quick introduction.
+aside("Help us improve the docs")
| Did you spot a mistake or come across explanations that
| are unclear? We always appreciate improvement
| #[+a(gh("spaCy") + "/issues") suggestions] or
| #[+a(gh("spaCy") + "/pulls") pull requests]. You can find a "Suggest
| edits" link at the bottom of each page that points you to the source.
+h(2, "whats-spacy") What's spaCy?
+grid.o-no-block
+grid-col("half")
p
| spaCy is a #[strong free, open-source library] for advanced
| #[strong Natural Language Processing] (NLP) in Python.
p
| If you're working with a lot of text, you'll eventually want to
| know more about it. For example, what's it about? What do the
| words mean in context? Who is doing what to whom? What companies
| and products are mentioned? Which texts are similar to each other?
p
| spaCy is designed specifically for #[strong production use] and
| helps you build applications that process and "understand"
| large volumes of text. It can be used to build
| #[strong information extraction] or
| #[strong natural language understanding] systems, or to
| pre-process text for #[strong deep learning].
+table-of-contents
+item #[+a("#features") Features]
+item #[+a("#annotations") Linguistic annotations]
+item #[+a("#annotations-token") Tokenization]
+item #[+a("#annotations-pos-deps") POS tags and dependencies]
+item #[+a("#annotations-ner") Named entities]
+item #[+a("#vectors-similarity") Word vectors and similarity]
+item #[+a("#pipelines") Pipelines]
+item #[+a("#vocab") Vocab, hashes and lexemes]
+item #[+a("#serialization") Serialization]
+item #[+a("#training") Training]
+item #[+a("#language-data") Language data]
+item #[+a("#lightning-tour") Lightning tour]
+item #[+a("#architecture") Architecture]
+item #[+a("#community") Community & FAQ]
+h(3, "what-spacy-isnt") What spaCy isn't
+list
+item #[strong spaCy is not a platform or "an API"].
| Unlike a platform, spaCy does not provide a software as a service, or
| a web application. It's an open-source library designed to help you
| build NLP applications, not a consumable service.
+item #[strong spaCy is not an out-of-the-box chat bot engine].
| While spaCy can be used to power conversational applications, it's
| not designed specifically for chat bots, and only provides the
| underlying text processing capabilities.
+item #[strong spaCy is not research software].
| It's built on the latest research, but it's designed to get
| things done. This leads to fairly different design decisions than
| #[+a("https://github./nltk/nltk") NLTK]
| or #[+a("https://stanfordnlp.github.io/CoreNLP/") CoreNLP], which were
| created as platforms for teaching and research. The main difference
| is that spaCy is integrated and opinionated. spaCy tries to avoid asking
| the user to choose between multiple algorithms that deliver equivalent
| functionality. Keeping the menu small lets spaCy deliver generally better
| performance and developer experience.
+item #[strong spaCy is not a company].
| It's an open-source library. Our company publishing spaCy and other
| software is called #[+a(COMPANY_URL, true) Explosion AI].
+section("features")
+h(2, "features") Features
p
| In the documentation, you'll come across mentions of spaCy's
| features and capabilities. Some of them refer to linguistic concepts,
| while others are related to more general machine learning
| functionality.
+table(["Name", "Description"])
+row
+cell #[strong Tokenization]
+cell Segmenting text into words, punctuations marks etc.
+row
+cell #[strong Part-of-speech] (POS) #[strong Tagging]
+cell Assigning word types to tokens, like verb or noun.
+row
+cell #[strong Dependency Parsing]
+cell
| Assigning syntactic dependency labels, describing the
| relations between individual tokens, like subject or object.
+row
+cell #[strong Lemmatization]
+cell
| Assigning the base forms of words. For example, the lemma of
| "was" is "be", and the lemma of "rats" is "rat".
+row
+cell #[strong Sentence Boundary Detection] (SBD)
+cell Finding and segmenting individual sentences.
+row
+cell #[strong Named Entity Recognition] (NER)
+cell
| Labelling named "real-world" objects, like persons, companies
| or locations.
+row
+cell #[strong Similarity]
+cell
| Comparing words, text spans and documents and how similar
| they are to each other.
+row
+cell #[strong Text Classification]
+cell
| Assigning categories or labels to a whole document, or parts
| of a document.
+row
+cell #[strong Rule-based Matching]
+cell
| Finding sequences of tokens based on their texts and
| linguistic annotations, similar to regular expressions.
+row
+cell #[strong Training]
+cell Updating and improving a statistical model's predictions.
+row
+cell #[strong Serialization]
+cell Saving objects to files or byte strings.
+h(3, "statistical-models") Statistical models
p
| While some of spaCy's features work independently, others require
| #[+a("/models") statistical models] to be loaded, which enable spaCy
| to #[strong predict] linguistic annotations for example,
| whether a word is a verb or a noun. spaCy currently offers statistical
| models for #[strong #{MODEL_LANG_COUNT} languages], which can be
| installed as individual Python modules. Models can differ in size,
| speed, memory usage, accuracy and the data they include. The model
| you choose always depends on your use case and the texts you're
| working with. For a general-purpose use case, the small, default
| models are always a good start. They typically include the following
| components:
+list
+item
| #[strong Binary weights] for the part-of-speech tagger,
| dependency parser and named entity recognizer to predict those
| annotations in context.
+item
| #[strong Lexical entries] in the vocabulary, i.e. words and their
| context-independent attributes like the shape or spelling.
+item
| #[strong Word vectors], i.e. multi-dimensional meaning
| representations of words that let you determine how similar they
| are to each other.
+item
| #[strong Configuration] options, like the language and
| processing pipeline settings, to put spaCy in the correct state
| when you load in the model.
+h(2, "annotations") Linguistic annotations
p
| spaCy provides a variety of linguistic annotations to give you
| #[strong insights into a text's grammatical structure]. This
| includes the word types, like the parts of speech, and how the words
| are related to each other. For example, if you're analysing text, it
| makes a huge difference whether a noun is the subject of a sentence,
| or the object or whether "google" is used as a verb, or refers to
| the website or company in a specific context.
+aside-code("Loading models", "bash", "$").
python -m spacy download en
>>> import spacy
>>> nlp = spacy.load('en')
p
| Once you've #[+a("/usage/models") downloaded and installed] a model,
| you can load it via #[+api("spacy#load") #[code spacy.load()]]. This will
| return a #[code Language] object contaning all components and data needed
| to process text. We usually call it #[code nlp]. Calling the #[code nlp]
| object on a string of text will return a processed #[code Doc]:
+code.
import spacy
nlp = spacy.load('en')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
p
| Even though a #[code Doc] is processed e.g. split into individual words
| and annotated it still holds #[strong all information of the original text],
| like whitespace characters. You can always get the offset of a token into the
| original string, or reconstruct the original by joining the tokens and their
| trailing whitespace. This way, you'll never lose any information
| when processing text with spaCy.
+h(3, "annotations-token") Tokenization
include _spacy-101/_tokenization
+infobox
| To learn more about how spaCy's tokenization rules work in detail,
| how to #[strong customise and replace] the default tokenizer and how to
| #[strong add language-specific data], see the usage guides on
| #[+a("/usage/adding-languages") adding languages] and
| #[+a("/usage/linguistic-features#tokenization") customising the tokenizer].
+h(3, "annotations-pos-deps") Part-of-speech tags and dependencies
+tag-model("dependency parse")
include _spacy-101/_pos-deps
+infobox
| To learn more about #[strong part-of-speech tagging] and rule-based
| morphology, and how to #[strong navigate and use the parse tree]
| effectively, see the usage guides on
| #[+a("/usage/linguistic-features#pos-tagging") part-of-speech tagging] and
| #[+a("/usage/linguistic-features#dependency-parse") using the dependency parse].
+h(3, "annotations-ner") Named Entities
+tag-model("named entities")
include _spacy-101/_named-entities
+infobox
| To learn more about entity recognition in spaCy, how to
| #[strong add your own entities] to a document and how to
| #[strong train and update] the entity predictions of a model, see the
| usage guides on
| #[+a("/usage/linguistic-features#named-entities") named entity recognition] and
| #[+a("/usage/training#ner") training the named entity recognizer].
+h(2, "vectors-similarity") Word vectors and similarity
+tag-model("vectors")
include _spacy-101/_similarity
include _spacy-101/_word-vectors
+infobox
| To learn more about word vectors, how to #[strong customise them] and
| how to load #[strong your own vectors] into spaCy, see the usage
| guide on
| #[+a("/usage/vectors-similarity") using word vectors and semantic similarities].
+h(2, "pipelines") Pipelines
include _spacy-101/_pipelines
+infobox
| To learn more about #[strong how processing pipelines work] in detail,
| how to enable and disable their components, and how to
| #[strong create your own], see the usage guide on
| #[+a("/usage/processing-pipelines") language processing pipelines].
+h(2, "vocab") Vocab, hashes and lexemes
include _spacy-101/_vocab
+h(2, "serialization") Serialization
include _spacy-101/_serialization
+infobox
| To learn more about how to #[strong save and load your own models],
| see the usage guide on
| #[+a("/usage/training#saving-loading") saving and loading].
+h(2, "training") Training
include _spacy-101/_training
+infobox
| To learn more about #[strong training and updating] models, how to create
| training data and how to improve spaCy's named entity recognition models,
| see the usage guides on #[+a("/usage/training") training].
+h(2, "language-data") Language data
include _spacy-101/_language-data
+infobox
| To learn more about the individual components of the language data and
| how to #[strong add a new language] to spaCy in preparation for training
| a language model, see the usage guide on
| #[+a("/usage/adding-languages") adding languages].
+section("lightning-tour")
+h(2, "lightning-tour") Lightning tour
include _spacy-101/_lightning-tour
+section("architecture")
+h(2, "architecture") Architecture
include _spacy-101/_architecture
+section("community-faq")
+h(2, "community") Community & FAQ
include _spacy-101/_community-faq