spaCy/website/docs/usage/spacy-101.jade
2017-06-03 22:16:26 +02:00

500 lines
21 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > SPACY 101
include ../../_includes/_mixins
p
| Whether you're new to spaCy, or just want to brush up on some
| NLP basics and implementation details this page should have you covered.
| Each section will explain one of spaCy's features in simple terms and
| with examples or illustrations. Some sections will also reappear across
| the usage guides as a quick introcution.
+aside("Help us improve the docs")
| Did you spot a mistake or come across explanations that
| are unclear? We always appreciate improvement
| #[+a(gh("spaCy") + "/issues") suggestions] or
| #[+a(gh("spaCy") + "/pulls") pull requests]. You can find a "Suggest
| edits" link at the bottom of each page that points you to the source.
+h(2, "whats-spacy") What's spaCy?
+grid.o-no-block
+grid-col("half")
p
| spaCy is a #[strong free, open-source library] for advanced
| #[strong Natural Language Processing] (NLP) in Python.
p
| If you're working with a lot of text, you'll eventually want to
| know more about it. For example, what's it about? What do the
| words mean in context? Who is doing what to whom? What companies
| and products are mentioned? Which texts are similar to each other?
p
| spaCy is designed specifically for #[strong production use] and
| helps you build applications that process and "understand"
| large volumes of text. It can be used to build
| #[strong information extraction] or
| #[strong natural language understanding] systems, or to
| pre-process text for #[strong deep learning].
+table-of-contents
+item #[+a("#features") Features]
+item #[+a("#annotations") Linguistic annotations]
+item #[+a("#annotations-token") Tokenization]
+item #[+a("#annotations-pos-deps") POS tags and dependencies]
+item #[+a("#annotations-ner") Named entities]
+item #[+a("#vectors-similarity") Word vectos and similarity]
+item #[+a("#pipelines") Pipelines]
+item #[+a("#vocab") Vocab, hashes and lexemes]
+item #[+a("#serialization") Serialization]
+item #[+a("#training") Training]
+item #[+a("#architecture") Architecture]
+item #[+a("#community") Community & FAQ]
+h(3, "what-spacy-isnt") What spaCy isn't
+list
+item #[strong spaCy is not a platform or "an API"].
| Unlike a platform, spaCy does not provide a software as a service, or
| a web application. It's an open-source library designed to help you
| build NLP applications, not a consumable service.
+item #[strong spaCy is not an out-of-the-box chat bot engine].
| While spaCy can be used to power conversational applications, it's
| not designed specifically for chat bots, and only provides the
| underlying text processing capabilities.
+item #[strong spaCy is not research software].
| It's is built on the latest research, but unlike
| #[+a("https://github./nltk/nltk") NLTK], which is intended for
| teaching and research, spaCy follows a more opinionated approach and
| focuses on production usage. Its aim is to provide you with the best
| possible general-purpose solution for text processing and machine learning
| with text input but this also means that there's only one implementation
| of each component.
+item #[strong spaCy is not a company].
| It's an open-source library. Our company publishing spaCy and other
| software is called #[+a(COMPANY_URL, true) Explosion AI].
+h(2, "features") Features
p
| Across the documentations, you'll come across mentions of spaCy's
| features and capabilities. Some of them refer to linguistic concepts,
| while others are related to more general machine learning functionality.
+aside
| If one of spaCy's functionalities #[strong needs a model], it means that
| you need to have one our the available
| #[+a("/docs/usage/models") statistical models] installed. Models are used
| to #[strong predict] linguistic annotations for example, if a word is
| a verb or a noun.
+table(["Name", "Description", "Needs model"])
+row
+cell #[strong Tokenization]
+cell Segmenting text into words, punctuations marks etc.
+cell #[+procon("con")]
+row
+cell #[strong Part-of-speech] (POS) #[strong Tagging]
+cell Assigning word types to tokens, like verb or noun.
+cell #[+procon("pro")]
+row
+cell #[strong Dependency Parsing]
+cell
| Assigning syntactic dependency labels, describing the relations
| between individual tokens, like subject or object.
+cell #[+procon("pro")]
+row
+cell #[strong Sentence Boundary Detection] (SBD)
+cell Finding and segmenting individual sentences.
+cell #[+procon("pro")]
+row
+cell #[strong Named Entity Recongition] (NER)
+cell
| Labelling named "real-world" objects, like persons, companies or
| locations.
+cell #[+procon("pro")]
+row
+cell #[strong Rule-based Matching]
+cell
| Finding sequences of tokens based on their texts and linguistic
| annotations, similar to regular expressions.
+cell #[+procon("con")]
+row
+cell #[strong Similarity]
+cell
| Comparing words, text spans and documents and how similar they
| are to each other.
+cell #[+procon("pro")]
+row
+cell #[strong Training]
+cell Updating and improving a statistical model's predictions.
+cell #[+procon("neutral")]
+row
+cell #[strong Serialization]
+cell Saving objects to files or byte strings.
+cell #[+procon("neutral")]
+h(2, "annotations") Linguistic annotations
p
| spaCy provides a variety of linguistic annotations to give you
| #[strong insights into a text's grammatical structure]. This includes the
| word types, like the parts of speech, and how the words are related to
| each other. For example, if you're analysing text, it makes a huge
| difference whether a noun is the subject of a sentence, or the object
| or whether "google" is used as a verb, or refers to the website or
| company in a specific context.
p
| Once you've downloaded and installed a #[+a("/docs/usage/models") model],
| you can load it via #[+api("spacy#load") #[code spacy.load()]]. This will
| return a #[code Language] object contaning all components and data needed
| to process text. We usually call it #[code nlp]. Calling the #[code nlp]
| object on a string of text will return a processed #[code Doc]:
+code.
import spacy
nlp = spacy.load('en')
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
p
| Even though a #[code Doc] is processed e.g. split into individual words
| and annotated it still holds #[strong all information of the original text],
| like whitespace characters. This way, you'll never lose any information
| when processing text with spaCy.
+h(3, "annotations-token") Tokenization
include _spacy-101/_tokenization
+infobox
| To learn more about how spaCy's tokenization rules work in detail,
| how to #[strong customise and replace] the default tokenizer and how to
| #[strong add language-specific data], see the usage guides on
| #[+a("/docs/usage/adding-languages") adding languages] and
| #[+a("/docs/usage/customizing-tokenizer") customising the tokenizer].
+h(3, "annotations-pos-deps") Part-of-speech tags and dependencies
+tag-model("dependency parse")
include _spacy-101/_pos-deps
+infobox
| To learn more about #[strong part-of-speech tagging] and rule-based
| morphology, and how to #[strong navigate and use the parse tree]
| effectively, see the usage guides on
| #[+a("/docs/usage/pos-tagging") part-of-speech tagging] and
| #[+a("/docs/usage/dependency-parse") using the dependency parse].
+h(3, "annotations-ner") Named Entities
+tag-model("named entities")
include _spacy-101/_named-entities
+infobox
| To learn more about entity recognition in spaCy, how to
| #[strong add your own entities] to a document and how to
| #[strong train and update] the entity predictions of a model, see the
| usage guides on
| #[+a("/docs/usage/entity-recognition") named entity recognition] and
| #[+a("/docs/usage/training-ner") training the named entity recognizer].
+h(2, "vectors-similarity") Word vectors and similarity
+tag-model("vectors")
include _spacy-101/_similarity
include _spacy-101/_word-vectors
+infobox
| To learn more about word vectors, how to #[strong customise them] and
| how to load #[strong your own vectors] into spaCy, see the usage
| guide on
| #[+a("/docs/usage/word-vectors-similarities") using word vectors and semantic similarities].
+h(2, "pipelines") Pipelines
include _spacy-101/_pipelines
+infobox
| To learn more about #[strong how processing pipelines work] in detail,
| how to enable and disable their components, and how to
| #[strong create your own], see the usage guide on
| #[+a("/docs/usage/language-processing-pipeline") language processing pipelines].
+h(2, "vocab") Vocab, hashes and lexemes
include _spacy-101/_vocab
+h(2, "serialization") Serialization
include _spacy-101/_serialization
+infobox
| To learn more about #[strong serialization] and how to
| #[strong save and load your own models], see the usage guide on
| #[+a("/docs/usage/saving-loading") saving, loading and data serialization].
+h(2, "training") Training
include _spacy-101/_training
+infobox
| To learn more about #[strong training and updating] models, how to create
| training data and how to improve spaCy's named entity recognition models,
| see the usage guides on #[+a("/docs/usage/training") training] and
| #[+a("/docs/usage/training-ner") training the named entity recognizer].
+h(2, "architecture") Architecture
+under-construction
+image
include ../../assets/img/docs/architecture.svg
.u-text-right
+button("/assets/img/docs/architecture.svg", false, "secondary").u-text-tag View large graphic
+table(["Name", "Description"])
+row
+cell #[+api("language") #[code Language]]
+cell
| A text-processing pipeline. Usually you'll load this once per
| process as #[code nlp] and pass the instance around your application.
+row
+cell #[+api("doc") #[code Doc]]
+cell A container for accessing linguistic annotations.
+row
+cell #[+api("span") #[code Span]]
+cell A slice from a #[code Doc] object.
+row
+cell #[+api("token") #[code Token]]
+cell
| An individual token — i.e. a word, punctuation symbol, whitespace,
| etc.
+row
+cell #[+api("lexeme") #[code Lexeme]]
+cell
| An entry in the vocabulary. It's a word type with no context, as
| opposed to a word token. It therefore has no part-of-speech tag,
| dependency parse etc.
+row
+cell #[+api("vocab") #[code Vocab]]
+cell
| A lookup table for the vocabulary that allows you to access
| #[code Lexeme] objects.
+row
+cell #[code Morphology]
+cell
| Assign linguistic features like lemmas, noun case, verb tense etc.
| based on the word and its part-of-speech tag.
+row
+cell #[+api("stringstore") #[code StringStore]]
+cell Map strings to and from hash values.
+row
+row
+cell #[+api("tokenizer") #[code Tokenizer]]
+cell
| Segment text, and create #[code Doc] objects with the discovered
| segment boundaries.
+row
+cell #[+api("matcher") #[code Matcher]]
+cell
| Match sequences of tokens, based on pattern rules, similar to
| regular expressions.
+h(3, "architecture-pipeline") Pipeline components
+table(["Name", "Description"])
+row
+cell #[+api("tagger") #[code Tagger]]
+cell Annotate part-of-speech tags on #[code Doc] objects.
+row
+cell #[+api("dependencyparser") #[code DependencyParser]]
+cell Annotate syntactic dependencies on #[code Doc] objects.
+row
+cell #[+api("entityrecognizer") #[code EntityRecognizer]]
+cell
| Annotate named entities, e.g. persons or products, on #[code Doc]
| objects.
+h(3, "architecture-other") Other classes
+table(["Name", "Description"])
+row
+cell #[+api("binder") #[code Binder]]
+cell Container class for serializing collections of #[code Doc] objects.
+row
+cell #[+api("goldparse") #[code GoldParse]]
+cell Collection for training annotations.
+row
+cell #[+api("goldcorpus") #[code GoldCorpus]]
+cell
| An annotated corpus, using the JSON file format. Manages
| annotations for tagging, dependency parsing and NER.
+h(2, "community") Community & FAQ
p
| We're very happy to see the spaCy community grow and include a mix of
| people from all kinds of different backgrounds computational
| linguistics, data science, deep learning, research and more. If you'd
| like to get involved, below are some answers to the most important
| questions and resources for further reading.
+h(3, "faq-help-code") Help, my code isn't working!
p
| Bugs suck, and we're doing our best to continuously improve the tests
| and fix bugs as soon as possible. Before you submit an issue, do a
| quick search and check if the problem has already been reported. If
| you're having installation or loading problems, make sure to also check
| out the #[+a("/docs/usage#troubleshooting") troubleshooting guide]. Help
| with spaCy is available via the following platforms:
+aside("How do I know if something is a bug?")
| Of course, it's always hard to know for sure, so don't worry we're not
| going to be mad if a bug report turns out to be a typo in your
| code. As a simple rule, any C-level error without a Python traceback,
| like a #[strong segmentation fault] or #[strong memory error],
| is #[strong always] a spaCy bug.#[br]#[br]
| Because models are statistical, their performance will never be
| #[em perfect]. However, if you come across
| #[strong patterns that might indicate an underlying issue], please do
| file a report. Similarly, we also care about behaviours that
| #[strong contradict our docs].
+table(["Platform", "Purpose"])
+row
+cell #[+a("https://stackoverflow.com/questions/tagged/spacy") StackOverflow]
+cell
| #[strong Usage questions] and everything related to problems with
| your specific code. The StackOverflow community is much larger
| than ours, so if your problem can be solved by others, you'll
| receive help much quicker.
+row
+cell #[+a("https://gitter.im/" + SOCIAL.gitter) Gitter chat]
+cell
| #[strong General discussion] about spaCy, meeting other community
| members and exchanging #[strong tips, tricks and best practices].
| If we're working on experimental models and features, we usually
| share them on Gitter first.
+row
+cell #[+a(gh("spaCy") + "/issues") GitHub issue tracker]
+cell
| #[strong Bug reports] and #[strong improvement suggestions], i.e.
| everything that's likely spaCy's fault. This also includes
| problems with the models beyond statistical imprecisions, like
| patterns that point to a bug.
+infobox
| Please understand that we won't be able to provide individual support via
| email. We also believe that help is much more valuable if it's shared
| publicly, so that #[strong more people can benefit from it]. If you come
| across an issue and you think you might be able to help, consider posting
| a quick update with your solution. No matter how simple, it can easily
| save someone a lot of time and headache and the next time you need help,
| they might repay the favour.
+h(3, "faq-contributing") How can I contribute to spaCy?
p
| You don't have to be an NLP expert or Python pro to contribute, and we're
| happy to help you get started. If you're new to spaCy, a good place to
| start is the
| #[+a(gh("spaCy") + '/issues?q=is%3Aissue+is%3Aopen+label%3A"help+wanted+%28easy%29"') #[code help wanted (easy)] label]
| on GitHub, which we use to tag bugs and feature requests that are easy
| and self-contained. We also appreciate contributions to the docs whether
| it's fixing a typo, improving an example or adding additional explanations.
| You'll find a "Suggest edits" link at the bottom of each page that points
| you to the source.
p
| Another way of getting involved is to help us improve the
| #[+a("/docs/usage/adding-languages#language-data") language data]
| especially if you happen to speak one of the languages currently in
| #[+a("/docs/api/language-models#alpha-support") alpha support]. Even
| adding simple tokenizer exceptions, stop words or lemmatizer data
| can make a big difference. It will also make it easier for us to provide
| a statistical model for the language in the future. Submitting a test
| that documents a bug or performance issue, or covers functionality that's
| especially important for your application is also very helpful. This way,
| you'll also make sure we never accidentally introduce regressions to the
| parts of the library that you care about the most.
p
strong
| For more details on the types of contributions we're looking for, the
| code conventions and other useful tips, make sure to check out the
| #[+a(gh("spaCy", "CONTRIBUTING.md")) contributing guidelines].
+infobox("Code of Conduct")
| spaCy adheres to the
| #[+a("http://contributor-covenant.org/version/1/4/") Contributor Covenant Code of Conduct].
| By participating, you are expected to uphold this code.
+h(3, "faq-project-with-spacy")
| I've built something cool with spaCy how can I get the word out?
p
| First, congrats we'd love to check it out! When you share your
| project on Twitter, don't forget to tag
| #[+a("https://twitter.com/" + SOCIAL.twitter) @#{SOCIAL.twitter}] so we
| don't miss it. If you think your project would be a good fit for the
| #[+a("/docs/usage/showcase") showcase], #[strong feel free to submit it!]
| Tutorials are also incredibly valuable to other users and a great way to
| get exposure. So we strongly encourage #[strong writing up your experiences],
| or sharing your code and some tips and tricks on your blog. Since our
| website is open-source, you can add your project or tutorial by making a
| pull request on GitHub.
+aside("Contributing to spacy.io")
| All showcase and tutorial links are stored in a
| #[+a(gh("spaCy", "website/docs/usage/_data.json")) JSON file], so you
| won't even have to edit any markup. For more info on how to submit
| your project, see the
| #[+a(gh("spaCy", "CONTRIBUTING.md#submitting-a-project-to-the-showcase")) contributing guidelines]
| and our #[+a(gh("spaCy", "website")) website docs].
p
| If you would like to use the spaCy logo on your site, please get in touch
| and ask us first. However, if you want to show support and tell others
| that your project is using spaCy, you can grab one of our
| #[strong spaCy badges] here:
- SPACY_BADGES = ["built%20with-spaCy-09a3d5.svg", "made%20with%20❤%20and-spaCy-09a3d5.svg", "spaCy-v2-09a3d5.svg"]
+quickstart([{id: "badge", input_style: "check", options: SPACY_BADGES.map(function(badge, i) { return {id: i, title: "<img class='o-icon' src='https://img.shields.io/badge/" + badge + "' height='20'/>", checked: (i == 0) ? true : false}}) }], false, false, true)
.c-code-block(data-qs-results)
for badge, i in SPACY_BADGES
- var url = "https://img.shields.io/badge/" + badge
+code(false, "text", "star").o-no-block(data-qs-badge=i)=url
+code(false, "text", "code").o-no-block(data-qs-badge=i).
&lt;a href="#{SITE_URL}"&gt;&lt;img src="#{url}" height="20"&gt;&lt;/a&gt;
+code(false, "text", "markdown").o-no-block(data-qs-badge=i).
[![spaCy](#{url})](#{SITE_URL})