💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
---
|
|
|
|
|
title: 'spaCy 101: Everything you need to know'
|
|
|
|
|
teaser: The most important concepts, explained in simple terms
|
|
|
|
|
menu:
|
|
|
|
|
- ["What's spaCy?", 'whats-spacy']
|
|
|
|
|
- ['Features', 'features']
|
|
|
|
|
- ['Linguistic Annotations', 'annotations']
|
|
|
|
|
- ['Pipelines', 'pipelines']
|
2020-08-11 01:09:49 +03:00
|
|
|
|
- ['Architecture', 'architecture']
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
- ['Vocab', 'vocab']
|
|
|
|
|
- ['Serialization', 'serialization']
|
|
|
|
|
- ['Training', 'training']
|
|
|
|
|
- ['Language Data', 'language-data']
|
|
|
|
|
- ['Community & FAQ', 'community-faq']
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
Whether you're new to spaCy, or just want to brush up on some NLP basics and
|
|
|
|
|
implementation details – this page should have you covered. Each section will
|
|
|
|
|
explain one of spaCy's features in simple terms and with examples or
|
|
|
|
|
illustrations. Some sections will also reappear across the usage guides as a
|
|
|
|
|
quick introduction.
|
|
|
|
|
|
|
|
|
|
> #### Help us improve the docs
|
|
|
|
|
>
|
|
|
|
|
> Did you spot a mistake or come across explanations that are unclear? We always
|
|
|
|
|
> appreciate improvement
|
|
|
|
|
> [suggestions](https://github.com/explosion/spaCy/issues) or
|
|
|
|
|
> [pull requests](https://github.com/explosion/spaCy/pulls). You can find a
|
|
|
|
|
> "Suggest edits" link at the bottom of each page that points you to the source.
|
|
|
|
|
|
2019-04-19 16:59:51 +03:00
|
|
|
|
<Infobox title="Take the free interactive course">
|
|
|
|
|
|
|
|
|
|
[![Advanced NLP with spaCy](../images/course.jpg)](https://course.spacy.io)
|
|
|
|
|
|
|
|
|
|
In this course you'll learn how to use spaCy to build advanced natural language
|
|
|
|
|
understanding systems, using both rule-based and machine learning approaches. It
|
|
|
|
|
includes 55 exercises featuring interactive coding practice, multiple-choice
|
|
|
|
|
questions and slide decks.
|
|
|
|
|
|
|
|
|
|
<p><Button to="https://course.spacy.io" variant="primary">Start the course</Button></p>
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
## What's spaCy? {#whats-spacy}
|
|
|
|
|
|
|
|
|
|
<Grid cols={2}>
|
|
|
|
|
|
|
|
|
|
<div>
|
|
|
|
|
|
|
|
|
|
spaCy is a **free, open-source library** for advanced **Natural Language
|
|
|
|
|
Processing** (NLP) in Python.
|
|
|
|
|
|
|
|
|
|
If you're working with a lot of text, you'll eventually want to know more about
|
|
|
|
|
it. For example, what's it about? What do the words mean in context? Who is
|
|
|
|
|
doing what to whom? What companies and products are mentioned? Which texts are
|
|
|
|
|
similar to each other?
|
|
|
|
|
|
|
|
|
|
spaCy is designed specifically for **production use** and helps you build
|
|
|
|
|
applications that process and "understand" large volumes of text. It can be used
|
|
|
|
|
to build **information extraction** or **natural language understanding**
|
|
|
|
|
systems, or to pre-process text for **deep learning**.
|
|
|
|
|
|
|
|
|
|
</div>
|
|
|
|
|
|
2019-03-13 00:57:15 +03:00
|
|
|
|
<Infobox title="Table of contents" id="toc">
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
- [Features](#features)
|
|
|
|
|
- [Linguistic annotations](#annotations)
|
|
|
|
|
- [Tokenization](#annotations-token)
|
|
|
|
|
- [POS tags and dependencies](#annotations-pos-deps)
|
|
|
|
|
- [Named entities](#annotations-ner)
|
|
|
|
|
- [Word vectors and similarity](#vectors-similarity)
|
|
|
|
|
- [Pipelines](#pipelines)
|
2020-08-11 01:09:49 +03:00
|
|
|
|
- [Library architecture](#architecture)
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
- [Vocab, hashes and lexemes](#vocab)
|
|
|
|
|
- [Serialization](#serialization)
|
|
|
|
|
- [Training](#training)
|
|
|
|
|
- [Language data](#language-data)
|
|
|
|
|
- [Community & FAQ](#community)
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
</Grid>
|
|
|
|
|
|
|
|
|
|
### What spaCy isn't {#what-spacy-isnt}
|
|
|
|
|
|
2020-09-12 18:40:50 +03:00
|
|
|
|
- ❌ **spaCy is not a platform or "an API"**. Unlike a platform, spaCy does not
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
provide a software as a service, or a web application. It's an open-source
|
|
|
|
|
library designed to help you build NLP applications, not a consumable service.
|
2020-09-12 18:40:50 +03:00
|
|
|
|
- ❌ **spaCy is not an out-of-the-box chat bot engine**. While spaCy can be used
|
|
|
|
|
to power conversational applications, it's not designed specifically for chat
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
bots, and only provides the underlying text processing capabilities.
|
2020-09-12 18:40:50 +03:00
|
|
|
|
- ❌**spaCy is not research software**. It's built on the latest research, but
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
it's designed to get things done. This leads to fairly different design
|
|
|
|
|
decisions than [NLTK](https://github.com/nltk/nltk) or
|
|
|
|
|
[CoreNLP](https://stanfordnlp.github.io/CoreNLP/), which were created as
|
|
|
|
|
platforms for teaching and research. The main difference is that spaCy is
|
|
|
|
|
integrated and opinionated. spaCy tries to avoid asking the user to choose
|
|
|
|
|
between multiple algorithms that deliver equivalent functionality. Keeping the
|
|
|
|
|
menu small lets spaCy deliver generally better performance and developer
|
2019-04-29 20:44:43 +03:00
|
|
|
|
experience.
|
2020-09-12 18:40:50 +03:00
|
|
|
|
- ❌ **spaCy is not a company**. It's an open-source library. Our company
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
publishing spaCy and other software is called
|
2020-07-06 16:57:44 +03:00
|
|
|
|
[Explosion](https://explosion.ai).
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
## Features {#features}
|
|
|
|
|
|
|
|
|
|
In the documentation, you'll come across mentions of spaCy's features and
|
|
|
|
|
capabilities. Some of them refer to linguistic concepts, while others are
|
|
|
|
|
related to more general machine learning functionality.
|
|
|
|
|
|
|
|
|
|
| Name | Description |
|
|
|
|
|
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------ |
|
|
|
|
|
| **Tokenization** | Segmenting text into words, punctuations marks etc. |
|
|
|
|
|
| **Part-of-speech** (POS) **Tagging** | Assigning word types to tokens, like verb or noun. |
|
|
|
|
|
| **Dependency Parsing** | Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object. |
|
|
|
|
|
| **Lemmatization** | Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat". |
|
|
|
|
|
| **Sentence Boundary Detection** (SBD) | Finding and segmenting individual sentences. |
|
|
|
|
|
| **Named Entity Recognition** (NER) | Labelling named "real-world" objects, like persons, companies or locations. |
|
2020-08-07 00:20:13 +03:00
|
|
|
|
| **Entity Linking** (EL) | Disambiguating textual entities to unique identifiers in a knowledge base. |
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
| **Similarity** | Comparing words, text spans and documents and how similar they are to each other. |
|
|
|
|
|
| **Text Classification** | Assigning categories or labels to a whole document, or parts of a document. |
|
|
|
|
|
| **Rule-based Matching** | Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions. |
|
|
|
|
|
| **Training** | Updating and improving a statistical model's predictions. |
|
|
|
|
|
| **Serialization** | Saving objects to files or byte strings. |
|
|
|
|
|
|
|
|
|
|
### Statistical models {#statistical-models}
|
|
|
|
|
|
|
|
|
|
While some of spaCy's features work independently, others require
|
2020-09-03 14:13:03 +03:00
|
|
|
|
[trained pipelines](/models) to be loaded, which enable spaCy to **predict**
|
|
|
|
|
linguistic annotations – for example, whether a word is a verb or a noun. A
|
|
|
|
|
trained pipeline can consist of multiple components that use a statistical model
|
|
|
|
|
trained on labeled data. spaCy currently offers trained pipelines for a variety
|
|
|
|
|
of languages, which can be installed as individual Python modules. Pipeline
|
|
|
|
|
packages can differ in size, speed, memory usage, accuracy and the data they
|
|
|
|
|
include. The package you choose always depends on your use case and the texts
|
|
|
|
|
you're working with. For a general-purpose use case, the small, default packages
|
|
|
|
|
are always a good start. They typically include the following components:
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
- **Binary weights** for the part-of-speech tagger, dependency parser and named
|
|
|
|
|
entity recognizer to predict those annotations in context.
|
|
|
|
|
- **Lexical entries** in the vocabulary, i.e. words and their
|
|
|
|
|
context-independent attributes like the shape or spelling.
|
2019-10-01 14:22:13 +03:00
|
|
|
|
- **Data files** like lemmatization rules and lookup tables.
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
- **Word vectors**, i.e. multi-dimensional meaning representations of words that
|
|
|
|
|
let you determine how similar they are to each other.
|
2020-09-03 14:13:03 +03:00
|
|
|
|
- **Configuration** options, like the language and processing pipeline settings
|
|
|
|
|
and model implementations to use, to put spaCy in the correct state when you
|
|
|
|
|
load the pipeline.
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
## Linguistic annotations {#annotations}
|
|
|
|
|
|
|
|
|
|
spaCy provides a variety of linguistic annotations to give you **insights into a
|
|
|
|
|
text's grammatical structure**. This includes the word types, like the parts of
|
|
|
|
|
speech, and how the words are related to each other. For example, if you're
|
|
|
|
|
analyzing text, it makes a huge difference whether a noun is the subject of a
|
|
|
|
|
sentence, or the object – or whether "google" is used as a verb, or refers to
|
|
|
|
|
the website or company in a specific context.
|
|
|
|
|
|
2020-09-03 14:13:03 +03:00
|
|
|
|
> #### Loading pipelines
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
>
|
2020-08-19 01:28:37 +03:00
|
|
|
|
> ```cli
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
> $ python -m spacy download en_core_web_sm
|
|
|
|
|
>
|
|
|
|
|
> >>> import spacy
|
|
|
|
|
> >>> nlp = spacy.load("en_core_web_sm")
|
|
|
|
|
> ```
|
|
|
|
|
|
2020-09-03 14:13:03 +03:00
|
|
|
|
Once you've [downloaded and installed](/usage/models) a trained pipeline, you
|
|
|
|
|
can load it via [`spacy.load`](/api/top-level#spacy.load). This will return a
|
|
|
|
|
`Language` object containing all components and data needed to process text. We
|
|
|
|
|
usually call it `nlp`. Calling the `nlp` object on a string of text will return
|
|
|
|
|
a processed `Doc`:
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
### {executable="true"}
|
|
|
|
|
import spacy
|
|
|
|
|
|
|
|
|
|
nlp = spacy.load("en_core_web_sm")
|
2019-09-12 17:11:15 +03:00
|
|
|
|
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
for token in doc:
|
|
|
|
|
print(token.text, token.pos_, token.dep_)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Even though a `Doc` is processed – e.g. split into individual words and
|
|
|
|
|
annotated – it still holds **all information of the original text**, like
|
|
|
|
|
whitespace characters. You can always get the offset of a token into the
|
|
|
|
|
original string, or reconstruct the original by joining the tokens and their
|
|
|
|
|
trailing whitespace. This way, you'll never lose any information when processing
|
|
|
|
|
text with spaCy.
|
|
|
|
|
|
|
|
|
|
### Tokenization {#annotations-token}
|
|
|
|
|
|
|
|
|
|
import Tokenization101 from 'usage/101/\_tokenization.md'
|
|
|
|
|
|
|
|
|
|
<Tokenization101 />
|
|
|
|
|
|
2020-07-06 23:22:37 +03:00
|
|
|
|
<Infobox title="Tokenization rules" emoji="📖">
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
To learn more about how spaCy's tokenization rules work in detail, how to
|
|
|
|
|
**customize and replace** the default tokenizer and how to **add
|
|
|
|
|
language-specific data**, see the usage guides on
|
2020-10-02 14:24:33 +03:00
|
|
|
|
[language data](/usage/linguistic-features#language-data) and
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
[customizing the tokenizer](/usage/linguistic-features#tokenization).
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Part-of-speech tags and dependencies {#annotations-pos-deps model="parser"}
|
|
|
|
|
|
|
|
|
|
import PosDeps101 from 'usage/101/\_pos-deps.md'
|
|
|
|
|
|
|
|
|
|
<PosDeps101 />
|
|
|
|
|
|
2020-07-06 23:22:37 +03:00
|
|
|
|
<Infobox title="Part-of-speech tagging and morphology" emoji="📖">
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
To learn more about **part-of-speech tagging** and rule-based morphology, and
|
|
|
|
|
how to **navigate and use the parse tree** effectively, see the usage guides on
|
|
|
|
|
[part-of-speech tagging](/usage/linguistic-features#pos-tagging) and
|
|
|
|
|
[using the dependency parse](/usage/linguistic-features#dependency-parse).
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Named Entities {#annotations-ner model="ner"}
|
|
|
|
|
|
|
|
|
|
import NER101 from 'usage/101/\_named-entities.md'
|
|
|
|
|
|
|
|
|
|
<NER101 />
|
|
|
|
|
|
2020-07-06 23:22:37 +03:00
|
|
|
|
<Infobox title="Named Entity Recognition" emoji="📖">
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
To learn more about entity recognition in spaCy, how to **add your own
|
|
|
|
|
entities** to a document and how to **train and update** the entity predictions
|
|
|
|
|
of a model, see the usage guides on
|
|
|
|
|
[named entity recognition](/usage/linguistic-features#named-entities) and
|
2020-09-03 14:13:03 +03:00
|
|
|
|
[training pipelines](/usage/training).
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Word vectors and similarity {#vectors-similarity model="vectors"}
|
|
|
|
|
|
|
|
|
|
import Vectors101 from 'usage/101/\_vectors-similarity.md'
|
|
|
|
|
|
|
|
|
|
<Vectors101 />
|
|
|
|
|
|
2020-07-06 23:22:37 +03:00
|
|
|
|
<Infobox title="Word vectors" emoji="📖">
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
To learn more about word vectors, how to **customize them** and how to load
|
|
|
|
|
**your own vectors** into spaCy, see the usage guide on
|
2020-08-18 01:49:19 +03:00
|
|
|
|
[using word vectors and semantic similarities](/usage/linguistic-features#vectors-similarity).
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
## Pipelines {#pipelines}
|
|
|
|
|
|
|
|
|
|
import Pipelines101 from 'usage/101/\_pipelines.md'
|
|
|
|
|
|
|
|
|
|
<Pipelines101 />
|
|
|
|
|
|
2020-07-06 23:22:37 +03:00
|
|
|
|
<Infobox title="Processing pipelines" emoji="📖">
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
To learn more about **how processing pipelines work** in detail, how to enable
|
|
|
|
|
and disable their components, and how to **create your own**, see the usage
|
|
|
|
|
guide on [language processing pipelines](/usage/processing-pipelines).
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
2020-08-11 01:09:49 +03:00
|
|
|
|
## Architecture {#architecture}
|
|
|
|
|
|
|
|
|
|
import Architecture101 from 'usage/101/\_architecture.md'
|
|
|
|
|
|
|
|
|
|
<Architecture101 />
|
|
|
|
|
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
## Vocab, hashes and lexemes {#vocab}
|
|
|
|
|
|
|
|
|
|
Whenever possible, spaCy tries to store data in a vocabulary, the
|
|
|
|
|
[`Vocab`](/api/vocab), that will be **shared by multiple documents**. To save
|
|
|
|
|
memory, spaCy also encodes all strings to **hash values** – in this case for
|
|
|
|
|
example, "coffee" has the hash `3197928453018144401`. Entity labels like "ORG"
|
|
|
|
|
and part-of-speech tags like "VERB" are also encoded. Internally, spaCy only
|
|
|
|
|
"speaks" in hash values.
|
|
|
|
|
|
|
|
|
|
> - **Token**: A word, punctuation mark etc. _in context_, including its
|
|
|
|
|
> attributes, tags and dependencies.
|
|
|
|
|
> - **Lexeme**: A "word type" with no context. Includes the word shape and
|
|
|
|
|
> flags, e.g. if it's lowercase, a digit or punctuation.
|
|
|
|
|
> - **Doc**: A processed container of tokens in context.
|
|
|
|
|
> - **Vocab**: The collection of lexemes.
|
|
|
|
|
> - **StringStore**: The dictionary mapping hash values to strings, for example
|
|
|
|
|
> `3197928453018144401` → "coffee".
|
|
|
|
|
|
|
|
|
|
![Doc, Vocab, Lexeme and StringStore](../images/vocab_stringstore.svg)
|
|
|
|
|
|
|
|
|
|
If you process lots of documents containing the word "coffee" in all kinds of
|
|
|
|
|
different contexts, storing the exact string "coffee" every time would take up
|
|
|
|
|
way too much space. So instead, spaCy hashes the string and stores it in the
|
|
|
|
|
[`StringStore`](/api/stringstore). You can think of the `StringStore` as a
|
|
|
|
|
**lookup table that works in both directions** – you can look up a string to get
|
|
|
|
|
its hash, or a hash to get its string:
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
### {executable="true"}
|
|
|
|
|
import spacy
|
|
|
|
|
|
|
|
|
|
nlp = spacy.load("en_core_web_sm")
|
2019-09-12 17:11:15 +03:00
|
|
|
|
doc = nlp("I love coffee")
|
|
|
|
|
print(doc.vocab.strings["coffee"]) # 3197928453018144401
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
print(doc.vocab.strings[3197928453018144401]) # 'coffee'
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Now that all strings are encoded, the entries in the vocabulary **don't need to
|
|
|
|
|
include the word text** themselves. Instead, they can look it up in the
|
|
|
|
|
`StringStore` via its hash value. Each entry in the vocabulary, also called
|
|
|
|
|
[`Lexeme`](/api/lexeme), contains the **context-independent** information about
|
|
|
|
|
a word. For example, no matter if "love" is used as a verb or a noun in some
|
|
|
|
|
context, its spelling and whether it consists of alphabetic characters won't
|
|
|
|
|
ever change. Its hash value will also always be the same.
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
### {executable="true"}
|
|
|
|
|
import spacy
|
|
|
|
|
|
|
|
|
|
nlp = spacy.load("en_core_web_sm")
|
2019-09-12 17:11:15 +03:00
|
|
|
|
doc = nlp("I love coffee")
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
for word in doc:
|
|
|
|
|
lexeme = doc.vocab[word.text]
|
|
|
|
|
print(lexeme.text, lexeme.orth, lexeme.shape_, lexeme.prefix_, lexeme.suffix_,
|
|
|
|
|
lexeme.is_alpha, lexeme.is_digit, lexeme.is_title, lexeme.lang_)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
> - **Text**: The original text of the lexeme.
|
|
|
|
|
> - **Orth**: The hash value of the lexeme.
|
|
|
|
|
> - **Shape**: The abstract word shape of the lexeme.
|
|
|
|
|
> - **Prefix**: By default, the first letter of the word string.
|
|
|
|
|
> - **Suffix**: By default, the last three letters of the word string.
|
|
|
|
|
> - **is alpha**: Does the lexeme consist of alphabetic characters?
|
|
|
|
|
> - **is digit**: Does the lexeme consist of digits?
|
|
|
|
|
|
|
|
|
|
| Text | Orth | Shape | Prefix | Suffix | is_alpha | is_digit |
|
|
|
|
|
| ------ | --------------------- | ------ | ------ | ------ | -------- | -------- |
|
|
|
|
|
| I | `4690420944186131903` | `X` | I | I | `True` | `False` |
|
|
|
|
|
| love | `3702023516439754181` | `xxxx` | l | ove | `True` | `False` |
|
|
|
|
|
| coffee | `3197928453018144401` | `xxxx` | c | fee | `True` | `False` |
|
|
|
|
|
|
|
|
|
|
The mapping of words to hashes doesn't depend on any state. To make sure each
|
|
|
|
|
value is unique, spaCy uses a
|
|
|
|
|
[hash function](https://en.wikipedia.org/wiki/Hash_function) to calculate the
|
|
|
|
|
hash **based on the word string**. This also means that the hash for "coffee"
|
2020-09-03 14:13:03 +03:00
|
|
|
|
will always be the same, no matter which pipeline you're using or how you've
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
configured spaCy.
|
|
|
|
|
|
|
|
|
|
However, hashes **cannot be reversed** and there's no way to resolve
|
|
|
|
|
`3197928453018144401` back to "coffee". All spaCy can do is look it up in the
|
|
|
|
|
vocabulary. That's why you always need to make sure all objects you create have
|
|
|
|
|
access to the same vocabulary. If they don't, spaCy might not be able to find
|
|
|
|
|
the strings it needs.
|
|
|
|
|
|
|
|
|
|
```python
|
|
|
|
|
### {executable="true"}
|
|
|
|
|
import spacy
|
|
|
|
|
from spacy.tokens import Doc
|
|
|
|
|
from spacy.vocab import Vocab
|
|
|
|
|
|
|
|
|
|
nlp = spacy.load("en_core_web_sm")
|
2019-09-12 17:11:15 +03:00
|
|
|
|
doc = nlp("I love coffee") # Original Doc
|
|
|
|
|
print(doc.vocab.strings["coffee"]) # 3197928453018144401
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
print(doc.vocab.strings[3197928453018144401]) # 'coffee' 👍
|
|
|
|
|
|
|
|
|
|
empty_doc = Doc(Vocab()) # New Doc with empty Vocab
|
|
|
|
|
# empty_doc.vocab.strings[3197928453018144401] will raise an error :(
|
|
|
|
|
|
2019-09-12 17:11:15 +03:00
|
|
|
|
empty_doc.vocab.strings.add("coffee") # Add "coffee" and generate hash
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
print(empty_doc.vocab.strings[3197928453018144401]) # 'coffee' 👍
|
|
|
|
|
|
|
|
|
|
new_doc = Doc(doc.vocab) # Create new doc with first doc's vocab
|
|
|
|
|
print(new_doc.vocab.strings[3197928453018144401]) # 'coffee' 👍
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If the vocabulary doesn't contain a string for `3197928453018144401`, spaCy will
|
|
|
|
|
raise an error. You can re-add "coffee" manually, but this only works if you
|
|
|
|
|
actually _know_ that the document contains that word. To prevent this problem,
|
|
|
|
|
spaCy will also export the `Vocab` when you save a `Doc` or `nlp` object. This
|
|
|
|
|
will give you the object and its encoded annotations, plus the "key" to decode
|
|
|
|
|
it.
|
|
|
|
|
|
|
|
|
|
## Serialization {#serialization}
|
|
|
|
|
|
|
|
|
|
import Serialization101 from 'usage/101/\_serialization.md'
|
|
|
|
|
|
|
|
|
|
<Serialization101 />
|
|
|
|
|
|
2020-07-06 23:22:37 +03:00
|
|
|
|
<Infobox title="Saving and loading" emoji="📖">
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
2020-09-03 14:13:03 +03:00
|
|
|
|
To learn more about how to **save and load your own pipelines**, see the usage
|
2019-02-18 00:25:50 +03:00
|
|
|
|
guide on [saving and loading](/usage/saving-loading#models).
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
## Training {#training}
|
|
|
|
|
|
|
|
|
|
import Training101 from 'usage/101/\_training.md'
|
|
|
|
|
|
|
|
|
|
<Training101 />
|
|
|
|
|
|
2020-09-03 14:13:03 +03:00
|
|
|
|
<Infobox title="Training pipelines and models" emoji="📖">
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
2020-09-03 14:13:03 +03:00
|
|
|
|
To learn more about **training and updating** pipelines, how to create training
|
2020-10-06 15:15:08 +03:00
|
|
|
|
data and how to improve spaCy's named models, see the usage guides on
|
|
|
|
|
[training](/usage/training).
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Training config and lifecycle {#training-config}
|
|
|
|
|
|
|
|
|
|
Training config files include all **settings and hyperparameters** for training
|
|
|
|
|
your pipeline. Instead of providing lots of arguments on the command line, you
|
|
|
|
|
only need to pass your `config.cfg` file to [`spacy train`](/api/cli#train).
|
|
|
|
|
This also makes it easy to integrate custom models and architectures, written in
|
|
|
|
|
your framework of choice. A pipeline's `config.cfg` is considered the "single
|
|
|
|
|
source of truth", both at **training** and **runtime**.
|
|
|
|
|
|
|
|
|
|
> ```ini
|
|
|
|
|
> ### config.cfg (excerpt)
|
|
|
|
|
> [training]
|
|
|
|
|
> accumulate_gradient = 3
|
|
|
|
|
>
|
|
|
|
|
> [training.optimizer]
|
|
|
|
|
> @optimizers = "Adam.v1"
|
|
|
|
|
>
|
|
|
|
|
> [training.optimizer.learn_rate]
|
|
|
|
|
> @schedules = "warmup_linear.v1"
|
|
|
|
|
> warmup_steps = 250
|
|
|
|
|
> total_steps = 20000
|
|
|
|
|
> initial_rate = 0.01
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
![Illustration of pipeline lifecycle](../images/lifecycle.svg)
|
|
|
|
|
|
|
|
|
|
<Infobox title="Training configuration system" emoji="📖">
|
|
|
|
|
|
|
|
|
|
For more details on spaCy's **configuration system** and how to use it to
|
|
|
|
|
customize your pipeline components, component models, training settings and
|
|
|
|
|
hyperparameters, see the [training config](/usage/training#config) usage guide.
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### Trainable components {#training-components}
|
|
|
|
|
|
|
|
|
|
spaCy's [`Pipe`](/api/pipe) class helps you implement your own trainable
|
|
|
|
|
components that have their own model instance, make predictions over `Doc`
|
|
|
|
|
objects and can be updated using [`spacy train`](/api/cli#train). This lets you
|
|
|
|
|
plug fully custom machine learning components into your pipeline that can be
|
|
|
|
|
configured via a single training config.
|
|
|
|
|
|
|
|
|
|
> #### config.cfg (excerpt)
|
|
|
|
|
>
|
|
|
|
|
> ```ini
|
|
|
|
|
> [components.my_component]
|
|
|
|
|
> factory = "my_component"
|
|
|
|
|
>
|
|
|
|
|
> [components.my_component.model]
|
|
|
|
|
> @architectures = "my_model.v1"
|
|
|
|
|
> width = 128
|
|
|
|
|
> ```
|
|
|
|
|
|
|
|
|
|
![Illustration of Pipe methods](../images/trainable_component.svg)
|
|
|
|
|
|
|
|
|
|
<Infobox title="Custom trainable components" emoji="📖">
|
|
|
|
|
|
|
|
|
|
To learn more about how to implement your own **model architectures** and use
|
|
|
|
|
them to power custom **trainable components**, see the usage guides on the
|
|
|
|
|
[trainable component API](/usage/processing-pipelines#trainable-components) and
|
|
|
|
|
implementing [layers and architectures](/usage/layers-architectures#components)
|
|
|
|
|
for trainable components.
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
## Language data {#language-data}
|
|
|
|
|
|
|
|
|
|
import LanguageData101 from 'usage/101/\_language-data.md'
|
|
|
|
|
|
|
|
|
|
<LanguageData101 />
|
|
|
|
|
|
|
|
|
|
## Community & FAQ {#community-faq}
|
|
|
|
|
|
|
|
|
|
We're very happy to see the spaCy community grow and include a mix of people
|
|
|
|
|
from all kinds of different backgrounds – computational linguistics, data
|
|
|
|
|
science, deep learning, research and more. If you'd like to get involved, below
|
|
|
|
|
are some answers to the most important questions and resources for further
|
|
|
|
|
reading.
|
|
|
|
|
|
|
|
|
|
### Help, my code isn't working! {#faq-help-code}
|
|
|
|
|
|
|
|
|
|
Bugs suck, and we're doing our best to continuously improve the tests and fix
|
|
|
|
|
bugs as soon as possible. Before you submit an issue, do a quick search and
|
|
|
|
|
check if the problem has already been reported. If you're having installation or
|
|
|
|
|
loading problems, make sure to also check out the
|
|
|
|
|
[troubleshooting guide](/usage/#troubleshooting). Help with spaCy is available
|
|
|
|
|
via the following platforms:
|
|
|
|
|
|
|
|
|
|
> #### How do I know if something is a bug?
|
|
|
|
|
>
|
|
|
|
|
> Of course, it's always hard to know for sure, so don't worry – we're not going
|
|
|
|
|
> to be mad if a bug report turns out to be a typo in your code. As a simple
|
|
|
|
|
> rule, any C-level error without a Python traceback, like a **segmentation
|
|
|
|
|
> fault** or **memory error**, is **always** a spaCy bug.
|
|
|
|
|
>
|
|
|
|
|
> Because models are statistical, their performance will never be _perfect_.
|
|
|
|
|
> However, if you come across **patterns that might indicate an underlying
|
|
|
|
|
> issue**, please do file a report. Similarly, we also care about behaviors that
|
|
|
|
|
> **contradict our docs**.
|
|
|
|
|
|
|
|
|
|
- [Stack Overflow](https://stackoverflow.com/questions/tagged/spacy): **Usage
|
|
|
|
|
questions** and everything related to problems with your specific code. The
|
|
|
|
|
Stack Overflow community is much larger than ours, so if your problem can be
|
|
|
|
|
solved by others, you'll receive help much quicker.
|
2020-12-10 22:17:36 +03:00
|
|
|
|
- [GitHub discussions](https://github.com/explosion/spaCy/discussions): **General
|
|
|
|
|
discussion**, **project ideas** and **usage questions**. Meet other community
|
|
|
|
|
members to get help with a specific code implementation, discuss ideas for new
|
|
|
|
|
projects/plugins, support more languages, and share best practices.
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
- [GitHub issue tracker](https://github.com/explosion/spaCy/issues): **Bug
|
|
|
|
|
reports** and **improvement suggestions**, i.e. everything that's likely
|
2020-09-03 14:13:03 +03:00
|
|
|
|
spaCy's fault. This also includes problems with the trained pipelines beyond
|
|
|
|
|
statistical imprecisions, like patterns that point to a bug.
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
<Infobox title="Important note" variant="warning">
|
|
|
|
|
|
|
|
|
|
Please understand that we won't be able to provide individual support via email.
|
|
|
|
|
We also believe that help is much more valuable if it's shared publicly, so that
|
|
|
|
|
**more people can benefit from it**. If you come across an issue and you think
|
|
|
|
|
you might be able to help, consider posting a quick update with your solution.
|
|
|
|
|
No matter how simple, it can easily save someone a lot of time and headache –
|
|
|
|
|
and the next time you need help, they might repay the favor.
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### How can I contribute to spaCy? {#faq-contributing}
|
|
|
|
|
|
|
|
|
|
You don't have to be an NLP expert or Python pro to contribute, and we're happy
|
|
|
|
|
to help you get started. If you're new to spaCy, a good place to start is the
|
|
|
|
|
[`help wanted (easy)` label](https://github.com/explosion/spaCy/issues?q=is%3Aissue+is%3Aopen+label%3A"help+wanted+%28easy%29")
|
|
|
|
|
on GitHub, which we use to tag bugs and feature requests that are easy and
|
|
|
|
|
self-contained. We also appreciate contributions to the docs – whether it's
|
|
|
|
|
fixing a typo, improving an example or adding additional explanations. You'll
|
|
|
|
|
find a "Suggest edits" link at the bottom of each page that points you to the
|
|
|
|
|
source.
|
|
|
|
|
|
|
|
|
|
Another way of getting involved is to help us improve the
|
2020-10-02 14:24:33 +03:00
|
|
|
|
[language data](/usage/linguistic-features#language-data) – especially if you
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
happen to speak one of the languages currently in
|
|
|
|
|
[alpha support](/usage/models#languages). Even adding simple tokenizer
|
|
|
|
|
exceptions, stop words or lemmatizer data can make a big difference. It will
|
2020-09-03 14:13:03 +03:00
|
|
|
|
also make it easier for us to provide a trained pipeline for the language in the
|
|
|
|
|
future. Submitting a test that documents a bug or performance issue, or covers
|
|
|
|
|
functionality that's especially important for your application is also very
|
|
|
|
|
helpful. This way, you'll also make sure we never accidentally introduce
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
regressions to the parts of the library that you care about the most.
|
|
|
|
|
|
|
|
|
|
**For more details on the types of contributions we're looking for, the code
|
|
|
|
|
conventions and other useful tips, make sure to check out the
|
2020-09-12 18:05:10 +03:00
|
|
|
|
[contributing guidelines](%%GITHUB_SPACY/CONTRIBUTING.md).**
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 21:31:19 +03:00
|
|
|
|
|
|
|
|
|
<Infobox title="Code of Conduct" variant="warning">
|
|
|
|
|
|
|
|
|
|
spaCy adheres to the
|
|
|
|
|
[Contributor Covenant Code of Conduct](http://contributor-covenant.org/version/1/4/).
|
|
|
|
|
By participating, you are expected to uphold this code.
|
|
|
|
|
|
|
|
|
|
</Infobox>
|
|
|
|
|
|
|
|
|
|
### I've built something cool with spaCy – how can I get the word out? {#faq-project-with-spacy}
|
|
|
|
|
|
|
|
|
|
First, congrats – we'd love to check it out! When you share your project on
|
|
|
|
|
Twitter, don't forget to tag [@spacy_io](https://twitter.com/spacy_io) so we
|
|
|
|
|
don't miss it. If you think your project would be a good fit for the
|
|
|
|
|
[spaCy Universe](/universe), **feel free to submit it!** Tutorials are also
|
|
|
|
|
incredibly valuable to other users and a great way to get exposure. So we
|
|
|
|
|
strongly encourage **writing up your experiences**, or sharing your code and
|
|
|
|
|
some tips and tricks on your blog. Since our website is open-source, you can add
|
|
|
|
|
your project or tutorial by making a pull request on GitHub.
|
|
|
|
|
|
|
|
|
|
If you would like to use the spaCy logo on your site, please get in touch and
|
|
|
|
|
ask us first. However, if you want to show support and tell others that your
|
|
|
|
|
project is using spaCy, you can grab one of our **spaCy badges** here:
|
|
|
|
|
|
|
|
|
|
<img src={`https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg`} />
|
|
|
|
|
|
|
|
|
|
```markdown
|
|
|
|
|
[![Built with spaCy](https://img.shields.io/badge/built%20with-spaCy-09a3d5.svg)](https://spacy.io)
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
<img src={`https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg`}
|
|
|
|
|
/>
|
|
|
|
|
|
|
|
|
|
```markdown
|
|
|
|
|
[![Built with spaCy](https://img.shields.io/badge/made%20with%20❤%20and-spaCy-09a3d5.svg)](https://spacy.io)
|
|
|
|
|
```
|