mirror of
https://github.com/explosion/spaCy.git
synced 2024-11-11 04:08:09 +03:00
49cee4af92
* Integrate Python kernel via Binder * Add live model test for languages with examples * Update docs and code examples * Adjust margin (if not bootstrapped) * Add binder version to global config * Update terminal and executable code mixins * Pass attributes through infobox and section * Hide v-cloak * Fix example * Take out model comparison for now * Add meta text for compat * Remove chart.js dependency * Tidy up and simplify JS and port big components over to Vue * Remove chartjs example * Add Twitter icon * Add purple stylesheet option * Add utility for hand cursor (special cases only) * Add transition classes * Add small option for section * Add thumb object for small round thumbnail images * Allow unset code block language via "none" value (workaround to still allow unset language to default to DEFAULT_SYNTAX) * Pass through attributes * Add syntax highlighting definitions for Julia, R and Docker * Add website icon * Remove user survey from navigation * Don't hide GitHub icon on small screens * Make top navigation scrollable on small screens * Remove old resources page and references to it * Add Universe * Add helper functions for better page URL and title * Update site description * Increment versions * Update preview images * Update mentions of resources * Fix image * Fix social images * Fix problem with cover sizing and floats * Add divider and move badges into heading * Add docstrings * Reference converting section * Add section on converting word vectors * Move converting section to custom section and fix formatting * Remove old fastText example * Move extensions content to own section Keep weird ID to not break permalinks for now (we don't want to rewrite URLs if not absolutely necessary) * Use better component example and add factories section * Add note on larger model * Use better example for non-vector * Remove similarity in context section Only works via small models with tensors so has always been kind of confusing * Add note on init-model command * Fix lightning tour examples and make excutable if possible * Add spacy train CLI section to train * Fix formatting and add video * Fix formatting * Fix textcat example description (resolves #2246) * Add dummy file to try resolve conflict * Delete dummy file * Tidy up [ci skip] * Ensure sufficient height of loading container * Add loading animation to universe * Update Thebelab build and use better startup message * Fix asset versioning * Fix typo [ci skip] * Add note on project idea label
322 lines
13 KiB
Plaintext
322 lines
13 KiB
Plaintext
//- 💫 DOCS > USAGE > SPACY 101
|
||
|
||
include ../_includes/_mixins
|
||
|
||
p
|
||
| Whether you're new to spaCy, or just want to brush up on some
|
||
| NLP basics and implementation details – this page should have you covered.
|
||
| Each section will explain one of spaCy's features in simple terms and
|
||
| with examples or illustrations. Some sections will also reappear across
|
||
| the usage guides as a quick introduction.
|
||
|
||
+aside("Help us improve the docs")
|
||
| Did you spot a mistake or come across explanations that
|
||
| are unclear? We always appreciate improvement
|
||
| #[+a(gh("spaCy") + "/issues") suggestions] or
|
||
| #[+a(gh("spaCy") + "/pulls") pull requests]. You can find a "Suggest
|
||
| edits" link at the bottom of each page that points you to the source.
|
||
|
||
+h(2, "whats-spacy") What's spaCy?
|
||
|
||
+grid.o-no-block
|
||
+grid-col("half")
|
||
p
|
||
| spaCy is a #[strong free, open-source library] for advanced
|
||
| #[strong Natural Language Processing] (NLP) in Python.
|
||
|
||
p
|
||
| If you're working with a lot of text, you'll eventually want to
|
||
| know more about it. For example, what's it about? What do the
|
||
| words mean in context? Who is doing what to whom? What companies
|
||
| and products are mentioned? Which texts are similar to each other?
|
||
|
||
p
|
||
| spaCy is designed specifically for #[strong production use] and
|
||
| helps you build applications that process and "understand"
|
||
| large volumes of text. It can be used to build
|
||
| #[strong information extraction] or
|
||
| #[strong natural language understanding] systems, or to
|
||
| pre-process text for #[strong deep learning].
|
||
|
||
+table-of-contents
|
||
+item #[+a("#features") Features]
|
||
+item #[+a("#annotations") Linguistic annotations]
|
||
+item #[+a("#annotations-token") Tokenization]
|
||
+item #[+a("#annotations-pos-deps") POS tags and dependencies]
|
||
+item #[+a("#annotations-ner") Named entities]
|
||
+item #[+a("#vectors-similarity") Word vectors and similarity]
|
||
+item #[+a("#pipelines") Pipelines]
|
||
+item #[+a("#vocab") Vocab, hashes and lexemes]
|
||
+item #[+a("#serialization") Serialization]
|
||
+item #[+a("#training") Training]
|
||
+item #[+a("#language-data") Language data]
|
||
+item #[+a("#lightning-tour") Lightning tour]
|
||
+item #[+a("#architecture") Architecture]
|
||
+item #[+a("#community") Community & FAQ]
|
||
|
||
+h(3, "what-spacy-isnt") What spaCy isn't
|
||
|
||
+list
|
||
+item #[strong spaCy is not a platform or "an API"].
|
||
| Unlike a platform, spaCy does not provide a software as a service, or
|
||
| a web application. It's an open-source library designed to help you
|
||
| build NLP applications, not a consumable service.
|
||
+item #[strong spaCy is not an out-of-the-box chat bot engine].
|
||
| While spaCy can be used to power conversational applications, it's
|
||
| not designed specifically for chat bots, and only provides the
|
||
| underlying text processing capabilities.
|
||
+item #[strong spaCy is not research software].
|
||
| It's built on the latest research, but it's designed to get
|
||
| things done. This leads to fairly different design decisions than
|
||
| #[+a("https://github.com/nltk/nltk") NLTK]
|
||
| or #[+a("https://stanfordnlp.github.io/CoreNLP/") CoreNLP], which were
|
||
| created as platforms for teaching and research. The main difference
|
||
| is that spaCy is integrated and opinionated. spaCy tries to avoid asking
|
||
| the user to choose between multiple algorithms that deliver equivalent
|
||
| functionality. Keeping the menu small lets spaCy deliver generally better
|
||
| performance and developer experience.
|
||
+item #[strong spaCy is not a company].
|
||
| It's an open-source library. Our company publishing spaCy and other
|
||
| software is called #[+a(COMPANY_URL, true) Explosion AI].
|
||
|
||
+section("features")
|
||
+h(2, "features") Features
|
||
|
||
p
|
||
| In the documentation, you'll come across mentions of spaCy's
|
||
| features and capabilities. Some of them refer to linguistic concepts,
|
||
| while others are related to more general machine learning
|
||
| functionality.
|
||
|
||
+table(["Name", "Description"])
|
||
+row
|
||
+cell #[strong Tokenization]
|
||
+cell Segmenting text into words, punctuations marks etc.
|
||
|
||
+row
|
||
+cell #[strong Part-of-speech] (POS) #[strong Tagging]
|
||
+cell Assigning word types to tokens, like verb or noun.
|
||
|
||
+row
|
||
+cell #[strong Dependency Parsing]
|
||
+cell
|
||
| Assigning syntactic dependency labels, describing the
|
||
| relations between individual tokens, like subject or object.
|
||
|
||
+row
|
||
+cell #[strong Lemmatization]
|
||
+cell
|
||
| Assigning the base forms of words. For example, the lemma of
|
||
| "was" is "be", and the lemma of "rats" is "rat".
|
||
|
||
+row
|
||
+cell #[strong Sentence Boundary Detection] (SBD)
|
||
+cell Finding and segmenting individual sentences.
|
||
|
||
+row
|
||
+cell #[strong Named Entity Recognition] (NER)
|
||
+cell
|
||
| Labelling named "real-world" objects, like persons, companies
|
||
| or locations.
|
||
|
||
+row
|
||
+cell #[strong Similarity]
|
||
+cell
|
||
| Comparing words, text spans and documents and how similar
|
||
| they are to each other.
|
||
|
||
+row
|
||
+cell #[strong Text Classification]
|
||
+cell
|
||
| Assigning categories or labels to a whole document, or parts
|
||
| of a document.
|
||
|
||
+row
|
||
+cell #[strong Rule-based Matching]
|
||
+cell
|
||
| Finding sequences of tokens based on their texts and
|
||
| linguistic annotations, similar to regular expressions.
|
||
|
||
+row
|
||
+cell #[strong Training]
|
||
+cell Updating and improving a statistical model's predictions.
|
||
|
||
+row
|
||
+cell #[strong Serialization]
|
||
+cell Saving objects to files or byte strings.
|
||
|
||
+h(3, "statistical-models") Statistical models
|
||
|
||
p
|
||
| While some of spaCy's features work independently, others require
|
||
| #[+a("/models") statistical models] to be loaded, which enable spaCy
|
||
| to #[strong predict] linguistic annotations – for example,
|
||
| whether a word is a verb or a noun. spaCy currently offers statistical
|
||
| models for #[strong #{MODEL_LANG_COUNT} languages], which can be
|
||
| installed as individual Python modules. Models can differ in size,
|
||
| speed, memory usage, accuracy and the data they include. The model
|
||
| you choose always depends on your use case and the texts you're
|
||
| working with. For a general-purpose use case, the small, default
|
||
| models are always a good start. They typically include the following
|
||
| components:
|
||
|
||
+list
|
||
+item
|
||
| #[strong Binary weights] for the part-of-speech tagger,
|
||
| dependency parser and named entity recognizer to predict those
|
||
| annotations in context.
|
||
+item
|
||
| #[strong Lexical entries] in the vocabulary, i.e. words and their
|
||
| context-independent attributes like the shape or spelling.
|
||
+item
|
||
| #[strong Word vectors], i.e. multi-dimensional meaning
|
||
| representations of words that let you determine how similar they
|
||
| are to each other.
|
||
+item
|
||
| #[strong Configuration] options, like the language and
|
||
| processing pipeline settings, to put spaCy in the correct state
|
||
| when you load in the model.
|
||
|
||
+h(2, "annotations") Linguistic annotations
|
||
|
||
p
|
||
| spaCy provides a variety of linguistic annotations to give you
|
||
| #[strong insights into a text's grammatical structure]. This
|
||
| includes the word types, like the parts of speech, and how the words
|
||
| are related to each other. For example, if you're analysing text, it
|
||
| makes a huge difference whether a noun is the subject of a sentence,
|
||
| or the object – or whether "google" is used as a verb, or refers to
|
||
| the website or company in a specific context.
|
||
|
||
+aside-code("Loading models", "bash", "$").
|
||
python -m spacy download en_core_web_sm
|
||
>>> import spacy
|
||
>>> nlp = spacy.load('en_core_web_sm')
|
||
|
||
p
|
||
| Once you've #[+a("/usage/models") downloaded and installed] a model,
|
||
| you can load it via #[+api("spacy#load") #[code spacy.load()]]. This will
|
||
| return a #[code Language] object contaning all components and data needed
|
||
| to process text. We usually call it #[code nlp]. Calling the #[code nlp]
|
||
| object on a string of text will return a processed #[code Doc]:
|
||
|
||
+code-exec.
|
||
import spacy
|
||
|
||
nlp = spacy.load('en_core_web_sm')
|
||
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
|
||
for token in doc:
|
||
print(token.text, token.pos_, token.dep_)
|
||
|
||
p
|
||
| Even though a #[code Doc] is processed – e.g. split into individual words
|
||
| and annotated – it still holds #[strong all information of the original text],
|
||
| like whitespace characters. You can always get the offset of a token into the
|
||
| original string, or reconstruct the original by joining the tokens and their
|
||
| trailing whitespace. This way, you'll never lose any information
|
||
| when processing text with spaCy.
|
||
|
||
+h(3, "annotations-token") Tokenization
|
||
|
||
include _spacy-101/_tokenization
|
||
|
||
+infobox
|
||
| To learn more about how spaCy's tokenization rules work in detail,
|
||
| how to #[strong customise and replace] the default tokenizer and how to
|
||
| #[strong add language-specific data], see the usage guides on
|
||
| #[+a("/usage/adding-languages") adding languages] and
|
||
| #[+a("/usage/linguistic-features#tokenization") customising the tokenizer].
|
||
|
||
+h(3, "annotations-pos-deps") Part-of-speech tags and dependencies
|
||
+tag-model("dependency parse")
|
||
|
||
include _spacy-101/_pos-deps
|
||
|
||
+infobox
|
||
| To learn more about #[strong part-of-speech tagging] and rule-based
|
||
| morphology, and how to #[strong navigate and use the parse tree]
|
||
| effectively, see the usage guides on
|
||
| #[+a("/usage/linguistic-features#pos-tagging") part-of-speech tagging] and
|
||
| #[+a("/usage/linguistic-features#dependency-parse") using the dependency parse].
|
||
|
||
+h(3, "annotations-ner") Named Entities
|
||
+tag-model("named entities")
|
||
|
||
include _spacy-101/_named-entities
|
||
|
||
+infobox
|
||
| To learn more about entity recognition in spaCy, how to
|
||
| #[strong add your own entities] to a document and how to
|
||
| #[strong train and update] the entity predictions of a model, see the
|
||
| usage guides on
|
||
| #[+a("/usage/linguistic-features#named-entities") named entity recognition] and
|
||
| #[+a("/usage/training#ner") training the named entity recognizer].
|
||
|
||
+h(2, "vectors-similarity") Word vectors and similarity
|
||
+tag-model("vectors")
|
||
|
||
include _spacy-101/_similarity
|
||
|
||
include _spacy-101/_word-vectors
|
||
|
||
+infobox
|
||
| To learn more about word vectors, how to #[strong customise them] and
|
||
| how to load #[strong your own vectors] into spaCy, see the usage
|
||
| guide on
|
||
| #[+a("/usage/vectors-similarity") using word vectors and semantic similarities].
|
||
|
||
+h(2, "pipelines") Pipelines
|
||
|
||
include _spacy-101/_pipelines
|
||
|
||
+infobox
|
||
| To learn more about #[strong how processing pipelines work] in detail,
|
||
| how to enable and disable their components, and how to
|
||
| #[strong create your own], see the usage guide on
|
||
| #[+a("/usage/processing-pipelines") language processing pipelines].
|
||
|
||
+h(2, "vocab") Vocab, hashes and lexemes
|
||
|
||
include _spacy-101/_vocab
|
||
|
||
+h(2, "serialization") Serialization
|
||
|
||
include _spacy-101/_serialization
|
||
|
||
+infobox
|
||
| To learn more about how to #[strong save and load your own models],
|
||
| see the usage guide on
|
||
| #[+a("/usage/training#saving-loading") saving and loading].
|
||
|
||
+h(2, "training") Training
|
||
|
||
include _spacy-101/_training
|
||
|
||
+infobox
|
||
| To learn more about #[strong training and updating] models, how to create
|
||
| training data and how to improve spaCy's named entity recognition models,
|
||
| see the usage guides on #[+a("/usage/training") training].
|
||
|
||
+h(2, "language-data") Language data
|
||
|
||
include _spacy-101/_language-data
|
||
|
||
+infobox
|
||
| To learn more about the individual components of the language data and
|
||
| how to #[strong add a new language] to spaCy in preparation for training
|
||
| a language model, see the usage guide on
|
||
| #[+a("/usage/adding-languages") adding languages].
|
||
|
||
|
||
+section("lightning-tour")
|
||
+h(2, "lightning-tour") Lightning tour
|
||
include _spacy-101/_lightning-tour
|
||
|
||
+section("architecture")
|
||
+h(2, "architecture") Architecture
|
||
include _spacy-101/_architecture
|
||
|
||
+section("community-faq")
|
||
+h(2, "community") Community & FAQ
|
||
include _spacy-101/_community-faq
|