diff --git a/README.rst b/README.rst
index 671801061..2f4acd540 100644
--- a/README.rst
+++ b/README.rst
@@ -6,7 +6,7 @@ Cython. spaCy is built on the very latest research, but it isn't researchware.
It was designed from day 1 to be used in real products. It's commercial
open-source software, released under the MIT license.
-💫 **Version 1.3 out now!** `Read the release notes here. `_
+💫 **Version 1.4 out now!** `Read the release notes here. `_
.. image:: http://i.imgur.com/wFvLZyJ.png
:target: https://travis-ci.org/explosion/spaCy
@@ -241,8 +241,45 @@ calling ``spacy.load()``, or by passing a ``path`` argument to the ``spacy.en.En
Changelog
=========
-2016-12-03 `v1.3.0 `_: *Improve API consistency*
----------------------------------------------------------------------------------------------
+2016-12-18 `v1.4.0 `_: *Improved language data and alpha Dutch support*
+--------------------------------------------------------------------------------------------------------------------
+
+**✨ Major features and improvements**
+
+* **NEW:** Alpha support for Dutch tokenization.
+* Reorganise and improve format for language data.
+* Add shared tag map, entity rules, emoticons and punctuation to language data.
+* Convert entity rules, morphological rules and lemmatization rules from JSON to Python.
+* Update language data for English, German, Spanish, French, Italian and Portuguese.
+
+**🔴 Bug fixes**
+
+* Fix issue `#649 `_: Update and reorganise stop lists.
+* Fix issue `#672 `_: Make ``token.ent_iob_`` return unicode.
+* Fix issue `#674 `_: Add missing lemmas for contracted forms of "be" to ``TOKENIZER_EXCEPTIONS``.
+* Fix issue `#683 `_ ``Morphology`` class now supplies tag map value for the special space tag if it's missing.
+* Fix issue `#684 `_: Ensure ``spacy.en.English()`` loads the Glove vector data if available. Previously was inconsistent with behaviour of ``spacy.load('en')``.
+* Fix issue `#685 `_: Expand ``TOKENIZER_EXCEPTIONS`` with unicode apostrophe (``’``).
+* Fix issue `#689 `_: Correct typo in ``STOP_WORDS``.
+* Fix issue `#691 `_: Add tokenizer exceptions for "gonna" and "Gonna".
+
+**⚠️ Backwards incompatibilities**
+
+No changes to the public, documented API, but the previously undocumented language data and model initialisation processes have been refactored and reorganised. If you were relying on the ``bin/init_model.py`` script, see the new `spaCy Developer Resources `_ repo. Code that references internals of the ``spacy.en`` or ``spacy.de`` packages should also be reviewed before updating to this version.
+
+**📖 Documentation and examples**
+
+* **NEW:** `"Adding languages" `_ workflow.
+* **NEW:** `"Part-of-speech tagging" `_ workflow.
+* **NEW:** `spaCy Developer Resources `_ repo – scripts, tools and resources for developing spaCy.
+* Fix various typos and inconsistencies.
+
+**👥 Contributors**
+
+Thanks to `@dafnevk `_, `@jvdzwaan `_, `@RvanNieuwpoort `_, `@wrvhage `_, `@jaspb `_, `@savvopoulos `_ and `@davedwards `_ for the pull requests!
+
+2016-12-03 `v1.3.0 `_: *Improve API consistency*
+--------------------------------------------------------------------------------------------------------
**✨ API improvements**
diff --git a/website/_harp.json b/website/_harp.json
index caa67a9f9..63021272f 100644
--- a/website/_harp.json
+++ b/website/_harp.json
@@ -12,7 +12,7 @@
"COMPANY_URL": "https://explosion.ai",
"DEMOS_URL": "https://demos.explosion.ai",
- "SPACY_VERSION": "1.2",
+ "SPACY_VERSION": "1.4",
"LATEST_NEWS": {
"url": "https://explosion.ai/blog/spacy-user-survey",
"title": "The results of the spaCy user survey"
@@ -53,7 +53,7 @@
}
},
- "V_CSS": "1.9",
+ "V_CSS": "1.10",
"V_JS": "1.0",
"DEFAULT_SYNTAX" : "python",
"ANALYTICS": "UA-58931649-1",
diff --git a/website/docs/usage/_data.json b/website/docs/usage/_data.json
index dce419d75..eb85c683d 100644
--- a/website/docs/usage/_data.json
+++ b/website/docs/usage/_data.json
@@ -8,6 +8,7 @@
"Loading the pipeline": "language-processing-pipeline",
"Processing text": "processing-text",
"spaCy's data model": "data-model",
+ "POS tagging": "pos-tagging",
"Using the parse": "dependency-parse",
"Entity recognition": "entity-recognition",
"Custom pipelines": "customizing-pipeline",
@@ -15,7 +16,8 @@
"Word vectors": "word-vectors-similarities",
"Deep learning": "deep-learning",
"Custom tokenization": "customizing-tokenizer",
- "Training": "training"
+ "Training": "training",
+ "Adding languages": "adding-languages"
},
"Examples": {
"Tutorials": "tutorials",
@@ -82,6 +84,16 @@
"title": "Training the tagger, parser and entity recognizer"
},
+ "pos-tagging": {
+ "title": "Part-of-speech tagging",
+ "next": "dependency-parse"
+ },
+
+ "adding-languages": {
+ "title": "Adding languages",
+ "next": "training"
+ },
+
"showcase": {
"title": "Showcase",
@@ -105,6 +117,11 @@
"url": "https://github.com/avisingh599/visual-qa",
"author": "Avi Singh",
"description": "Keras-based LSTM/CNN models for Visual Question Answering."
+ },
+ "rasa_nlu": {
+ "url": "https://github.com/golastmile/rasa_nlu",
+ "author": "LASTMILE",
+ "description": "High level APIs for building your own language parser using existing NLP and ML libraries."
}
},
"visualizations": {
@@ -169,6 +186,11 @@
}
},
"research": {
+ "Distributional semantics for understanding spoken meal descriptions": {
+ "url": "https://www.semanticscholar.org/paper/Distributional-semantics-for-understanding-spoken-Korpusik-Huang/5f55c5535e80d3e5ed7f1f0b89531e32725faff5",
+ "author": "Mandy Korpusik et al. (2016)"
+ },
+
"Refactoring the Genia Event Extraction Shared Task Toward a General Framework for IE-Driven KB Development": {
"url": "https://www.semanticscholar.org/paper/Refactoring-the-Genia-Event-Extraction-Shared-Task-Kim-Wang/06d94b64a7bd2d3433f57caddad5084435d6a91f",
"author": "Jin-Dong Kim et al. (2016)"
diff --git a/website/docs/usage/adding-languages.jade b/website/docs/usage/adding-languages.jade
new file mode 100644
index 000000000..349ab3b45
--- /dev/null
+++ b/website/docs/usage/adding-languages.jade
@@ -0,0 +1,463 @@
+//- 💫 DOCS > USAGE > ADDING LANGUAGES
+
+include ../../_includes/_mixins
+
+p
+ | Adding full support for a language touches many different parts of the
+ | spaCy library. This guide explains how to fit everything together, and
+ | points you to the specific workflows for each component. Obviously,
+ | there are lots of ways you can organise your code when you implement
+ | your own #[+api("language") #[code Language]] class. This guide will
+ | focus on how it's done within spaCy. For full language support, we'll
+ | need to:
+
++list("numbers")
+ +item
+ | Create a #[strong #[code Language] subclass] and
+ | #[a(href="#language-subclass") implement it].
+
+ +item
+ | Define custom #[strong language data], like a
+ | #[a(href="#stop-words") stop list], #[a(href="#tag-map") tag map]
+ | and #[a(href="#tokenizer-exceptions") tokenizer exceptions].
+
+ +item
+ | #[strong Build the vocabulary] including
+ | #[a(href="#word-probabilities") word probabilities],
+ | #[a(href="#brown-clusters") Brown clusters] and
+ | #[a(href="#word-vectors") word vectors].
+
+p
+ | Once you have the tokenizer and vocabulary, you can
+ | #[+a("/docs/usage/training") train the tagger, parser and entity recognizer].
+ | For some languages, you may also want to develop a solution for
+ | lemmatization and morphological analysis.
+
++h(2, "language-subclass") Creating a #[code Language] subclass
+
+p
+ | Language-specific code and resources should be organised into a
+ | subpackage of spaCy, named according to the language's
+ | #[+a("https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes") ISO code].
+ | For instance, code and resources specific to Spanish are placed into a
+ | folder #[code spacy/es], which can be imported as #[code spacy.es].
+
+p
+ | To get started, you can use our
+ | #[+src(gh("spacy-dev-resources", "templates/new_language")) templates]
+ | for the most important files. Here's what the class template looks like:
+
++code("__init__.py (excerpt)").
+ # Import language-specific data
+ from .language_data import *
+
+ class Xxxxx(Language):
+ lang = 'xx' # ISO code
+
+ class Defaults(Language.Defaults):
+ lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+ lex_attr_getters[LANG] = lambda text: 'xx'
+
+ # override defaults
+ tokenizer_exceptions = TOKENIZER_EXCEPTIONS
+ tag_map = TAG_MAP
+ stop_words = STOP_WORDS
+
+p Additionally, the new #[code Language] class needs to be registered in #[+src(gh("spaCy", "spacy/__init__.py")) spacy/__init__.py] using the #[code set_lang_class()] function, so that you can use #[code spacy.load()].
+
++code("spacy/__init__.py").
+ from . import en
+ from . import xx
+
+ set_lang_class(en.English.lang, en.English)
+ set_lang_class(xx.Xxxxx.lang, xx.Xxxxx)
+
+p You'll also need to list the new package in #[+src(gh("spaCy", "spacy/setup.py")) setup.py]:
+
++code("spacy/setup.py").
+ PACKAGES = [
+ 'spacy',
+ 'spacy.tokens',
+ 'spacy.en',
+ 'spacy.xx',
+ # ...
+ ]
+
++h(2, "language-data") Adding language data
+
+p
+ | Every language is full of exceptions and special cases, especially
+ | amongst the most common words. Some of these exceptions are shared
+ | between multiple languages, while others are entirely idiosyncratic.
+ | spaCy makes it easy to deal with these exceptions on a case-by-case
+ | basis, by defining simple rules and exceptions. The exceptions data is
+ | defined in Python the
+ | #[+src(gh("spacy-dev-resources", "templates/new_language")) language data],
+ | so that Python functions can be used to help you generalise and combine
+ | the data as you require.
+
++h(3, "stop-words") Stop words
+
+p
+ | A #[+a("https://en.wikipedia.org/wiki/Stop_words") "stop list"] is a
+ | classic trick from the early days of information retrieval when search
+ | was largely about keyword presence and absence. It is still sometimes
+ | useful today to filter out common words from a bag-of-words model.
+
++aside("What does spaCy consider a stop word?")
+ | There's no particularly principal logic behind what words should be
+ | added to the stop list. Make a list that you think might be useful
+ | to people and is likely to be unsurprising. As a rule of thumb, words
+ | that are very rare are unlikely to be useful stop words.
+
+p
+ | To improve readability, #[code STOP_WORDS] are separated by spaces and
+ | newlines, and added as a multiline string:
+
++code("Example").
+ STOP_WORDS = set("""
+ a about above across after afterwards again against all almost alone along
+ already also although always am among amongst amount an and another any anyhow
+ anyone anything anyway anywhere are around as at
+
+ back be became because become becomes becoming been before beforehand behind
+ being below beside besides between beyond both bottom but by
+ """).split())
+
++h(3, "tag-map") Tag map
+
+p
+ | Most treebanks define a custom part-of-speech tag scheme, striking a
+ | balance between level of detail and ease of prediction. While it's
+ | useful to have custom tagging schemes, it's also useful to have a common
+ | scheme, to which the more specific tags can be related. The tagger can
+ | learn a tag scheme with any arbitrary symbols. However, you need to
+ | define how those symbols map down to the
+ | #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies tag set].
+ | This is done by providing a tag map.
+
+p
+ | The keys of the tag map should be #[strong strings in your tag set]. The
+ | values should be a dictionary. The dictionary must have an entry POS
+ | whose value is one of the
+ | #[+a("http://universaldependencies.org/u/pos/all.html") Universal Dependencies]
+ | tags. Optionally, you can also include morphological features or other
+ | token attributes in the tag map as well. This allows you to do simple
+ | #[+a("/docs/usage/pos-tagging#rule-based-morphology") rule-based morphological analysis].
+
++code("Example").
+ TAG_MAP = {
+ "NNS": {POS: NOUN, "Number": "plur"},
+ "VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
+ "DT": {POS: DET}
+ }
+
++h(3, "tokenizer-exceptions") Tokenizer exceptions
+
+p
+ | spaCy's #[+a("/docs/usage/customizing-tokenizer#how-tokenizer-works") tokenization algorithm]
+ | lets you deal with whitespace-delimited chunks separately. This makes it
+ | easy to define special-case rules, without worrying about how they
+ | interact with the rest of the tokenizer. Whenever the key string is
+ | matched, the special-case rule is applied, giving the defined sequence of
+ | tokens. You can also attach attributes to the subtokens, covered by your
+ | special case, such as the subtokens #[code LEMMA] or #[code TAG].
+
+p
+ | Tokenizer exceptions can be added in the following format:
+
++code("language_data.py").
+ TOKENIZER_EXCEPTIONS = {
+ "don't": [
+ {ORTH: "do", LEMMA: "do"},
+ {ORTH: "n't", LEMMA: "not", TAG: "RB"}
+ ]
+ }
+
+p
+ | Some exceptions, like certain abbreviations, will always be mapped to a
+ | single token containing only an #[code ORTH] property. To make your data
+ | less verbose, you can use the helper function #[code strings_to_exc()]
+ | with a simple array of strings:
+
++code("Example").
+ from ..language_data import update_exc, strings_to_exc
+
+ ORTH_ONLY = ["a.", "b.", "c."]
+ converted = strings_to_exc(ORTH_ONLY)
+ # {"a.": [{ORTH: "a."}], "b.": [{ORTH: "b."}], "c.": [{ORTH: "c."}]}
+
+ update_exc(TOKENIZER_EXCEPTIONS, converted)
+
+p
+ | Unambiguous abbreviations, like month names or locations in English,
+ | should be added to #[code TOKENIZER_EXCEPTIONS] with a lemma assigned,
+ | for example #[code {ORTH: "Jan.", LEMMA: "January"}].
+
++h(3, "custom-tokenizer-exceptions") Custom tokenizer exceptions
+
+p
+ | For language-specific tokenizer exceptions, you can use the
+ | #[code update_exc()] function to update the existing exceptions with a
+ | custom dictionary. This is especially useful for exceptions that follow
+ | a consistent pattern. Instead of adding each exception manually, you can
+ | write a simple function that returns a dictionary of exceptions.
+
+p
+ | For example, here's how exceptions for time formats like "1a.m." and
+ | "1am" are generated in the English
+ | #[+src(gh("spaCy", "spacy/en/language_data.py")) language_data.py]:
+
++code("language_data.py").
+ from ..language_data import update_exc
+
+ def get_time_exc(hours):
+ exc = {}
+ for hour in hours:
+ exc["%da.m." % hour] = [{ORTH: hour}, {ORTH: "a.m."}]
+ exc["%dp.m." % hour] = [{ORTH: hour}, {ORTH: "p.m."}]
+ exc["%dam" % hour] = [{ORTH: hour}, {ORTH: "am", LEMMA: "a.m."}]
+ exc["%dpm" % hour] = [{ORTH: hour}, {ORTH: "pm", LEMMA: "p.m."}]
+ return exc
+
+
+ TOKENIZER_EXCEPTIONS = dict(language_data.TOKENIZER_EXCEPTIONS)
+
+ hours = 12
+ update_exc(TOKENIZER_EXCEPTIONS, get_time_exc(range(1, hours + 1)))
+
++h(3, "utils") Shared utils
+
+p
+ | The #[code spacy.language_data] package provides constants and functions
+ | that can be imported and used across languages.
+
++aside("About spaCy's custom pronoun lemma")
+ | Unlike verbs and common nouns, there's no clear base form of a personal
+ | pronoun. Should the lemma of "me" be "I", or should we normalize person
+ | as well, giving "it" — or maybe "he"? spaCy's solution is to introduce a
+ | novel symbol, #[code.u-nowrap -PRON-], which is used as the lemma for
+ | all personal pronouns.
+
++table(["Name", "Description"])
+ +row
+ +cell #[code PRON_LEMMA]
+ +cell
+ | Special value for pronoun lemmas (#[code "-PRON-"]).
+
+ +row
+ +cell #[code ENT_ID]
+ +cell
+ | Special value for entity IDs (#[code "ent_id"])
+
+ +row
+ +cell #[code update_exc(exc, additions)]
+ +cell
+ | Update an existing dictionary of exceptions #[code exc] with a
+ | dictionary of #[code additions].
+
+ +row
+ +cell #[code strings_to_exc(orths)]
+ +cell
+ | Convert an array of strings to a dictionary of exceptions of the
+ | format #[code {"string": [{ORTH: "string"}]}].
+
+ +row
+ +cell #[code expand_exc(excs, search, replace)]
+ +cell
+ | Search for a string #[code search] in a dictionary of exceptions
+ | #[code excs] and if found, copy the entry and replace
+ | #[code search] with #[code replace] in both the key and
+ | #[code ORTH] value. Useful to provide exceptions containing
+ | different versions of special unicode characters, like
+ | #[code '] and #[code ’].
+
+p
+ | If you've written a custom function that seems like it might be useful
+ | for several languages, consider adding it to
+ | #[+src(gh("spaCy", "spacy/language_data/util.py")) language_data/util.py]
+ | instead of the individual language module.
+
++h(3, "shared-data") Shared language data
+
+p
+ | Because languages can vary in quite arbitrary ways, spaCy avoids
+ | organising the language data into an explicit inheritance hierarchy.
+ | Instead, reuseable functions and data are collected as atomic pieces in
+ | the #[code spacy.language_data] package.
+
++aside-code("Example").
+ from ..language_data import update_exc, strings_to_exc
+ from ..language_data import EMOTICONS
+
+ # Add custom emoticons
+ EMOTICONS = EMOTICONS + ["8===D", ":~)"]
+
+ # Add emoticons to tokenizer exceptions
+ update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(EMOTICONS))
+
++table(["Name", "Description", "Source"])
+ +row
+ +cell #[code EMOTICONS]
+
+ +cell
+ | Common unicode emoticons without whitespace.
+
+ +cell
+ +src(gh("spaCy", "spacy/language_data/emoticons.py")) emoticons.py
+
+ +row
+ +cell #[code TOKENIZER_PREFIXES]
+
+ +cell
+ | Regular expressions to match left-attaching tokens and
+ | punctuation, e.g. #[code $], #[code (], #[code "]
+
+ +cell
+ +src(gh("spaCy", "spacy/language_data/punctuation.py")) punctuation.py
+
+ +row
+ +cell #[code TOKENIZER_SUFFIXES]
+
+ +cell
+ | Regular expressions to match right-attaching tokens and
+ | punctuation, e.g. #[code %], #[code )], #[code "]
+
+ +cell
+ +src(gh("spaCy", "spacy/language_data/punctuation.py")) punctuation.py
+
+ +row
+ +cell #[code TOKENIZER_INFIXES]
+
+ +cell
+ | Regular expressions to match token separators, e.g. #[code -]
+
+ +cell
+ +src(gh("spaCy", "spacy/language_data/punctuation.py")) punctuation.py
+
+ +row
+ +cell #[code TAG_MAP]
+
+ +cell
+ | A tag map keyed by the universal part-of-speech tags to
+ | themselves with no morphological features.
+
+ +cell
+ +src(gh("spaCy", "spacy/language_data/tag_map.py")) tag_map.py
+
+ +row
+ +cell #[code ENTITY_RULES]
+
+ +cell
+ | Patterns for named entities commonly missed by the statistical
+ | entity recognizer, for use in the rule matcher.
+
+ +cell
+ +src(gh("spaCy", "spacy/language_data/entity_rules.py")) entity_rules.py
+
+ +row
+ +cell #[code FALSE_POSITIVES]
+
+ +cell
+ | Patterns for phrases commonly mistaken for named entities by the
+ | statistical entity recognizer, to use in the rule matcher.
+
+ +cell
+ +src(gh("spaCy", "spacy/language_data/entity_rules.py")) entity_rules.py
+
+p
+ | Individual languages can extend and override any of these expressions.
+ | Often, when a new language is added, you'll find a pattern or symbol
+ | that's missing. Even if this pattern or symbol isn't common in other
+ | languages, it might be best to add it to the base expressions, unless it
+ | has some conflicting interpretation. For instance, we don't expect to
+ | see guillemot quotation symbols (#[code »] and #[code «]) in
+ | English text. But if we do see them, we'd probably prefer the tokenizer
+ | to split it off.
+
++h(2, "vocabulary") Building the vocabulary
+
+p
+ | spaCy expects that common words will be cached in a
+ | #[+api("vocab") #[code Vocab]] instance. The vocabulary caches lexical
+ | features, and makes it easy to use information from unlabelled text
+ | samples in your models. Specifically, you'll usually want to collect
+ | word frequencies, and train two types of distributional similarity model:
+ | Brown clusters, and word vectors. The Brown clusters are used as features
+ | by linear models, while the word vectors are useful for lexical
+ | similarity models and deep learning.
+
+p
+ | Once you've collected the word frequencies, Brown clusters and word
+ | vectors files, you can use the
+ | #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
+ | script from our
+ | #[+a(gh("spacy-developer-resources")) developer resources] to create a
+ | spaCy data directory:
+
++code(false, "bash").
+ python training/init.py xx your_data_directory/ my_data/word_freqs.txt my_data/clusters.txt my_data/word_vectors.bz2
+
++aside-code("your_data_directory", "yaml").
+ ├── vocab/
+ | ├── lexemes.bin # via nlp.vocab.dump(path)
+ | ├── strings.json # via nlp.vocab.strings.dump(file_)
+ | └── oov_prob # optional
+ ├── pos/ # optional
+ | ├── model # via nlp.tagger.model.dump(path)
+ | └── config.json # via Langage.train
+ ├── deps/ # optional
+ | ├── model # via nlp.parser.model.dump(path)
+ | └── config.json # via Langage.train
+ └── ner/ # optional
+ ├── model # via nlp.entity.model.dump(path)
+ └── config.json # via Langage.train
+
+p
+ | This creates a spaCy data directory with a vocabulary model, ready to be
+ | loaded. By default, the
+ | #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
+ | script expects to be able to find your language class using
+ | #[code spacy.util.get_lang_class(lang_id)]. You can edit the script to
+ | help it find your language class if necessary.
+
++h(3, "word-frequencies") Word frequencies
+
+p
+ | The #[code init.py] script expects a tab-separated word frequencies file
+ | with three columns: the number of times the word occurred in your language
+ | sample, the number of distinct documents the word occurred in, and the
+ | word itself. You should make sure you use the spaCy tokenizer for your
+ | language to segment the text for your word frequencies. This will ensure
+ | that the frequencies refer to the same segmentation standards you'll be
+ | using at run-time. For instance, spaCy's English tokenizer segments "can't"
+ | into two tokens. If we segmented the text by whitespace to produce the
+ | frequency counts, we'll have incorrect frequency counts for the tokens
+ | "ca" and "n't".
+
++h(3, "brown-clusters") Training the Brown clusters
+
+p
+ | spaCy's tagger, parser and entity recognizer are designed to use
+ | distributional similarity features provided by the
+ | #[+a("https://github.com/percyliang/brown-cluster") Brown clustering algorithm].
+ | You should train a model with between 500 and 1000 clusters. A minimum
+ | frequency threshold of 10 usually works well.
+
++h(3, "word-vectors") Training the word vectors
+
+p
+ | #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec] and related
+ | algorithms let you train useful word similarity models from unlabelled
+ | text. This is a key part of using
+ | #[+a("/docs/usage/deep-learning") deep learning] for NLP with limited
+ | labelled data. The vectors are also useful by themselves – they power
+ | the #[code .similarity()] methods in spaCy. For best results, you should
+ | pre-process the text with spaCy before training the Word2vec model. This
+ | ensures your tokenization will match.
+
+p
+ | You can use our
+ | #[+src(gh("spacy-dev-resources", "training/word_vectors.py")) word vectors training script],
+ | which pre-processes the text with your language-specific tokenizer and
+ | trains the model using #[+a("https://radimrehurek.com/gensim/") Gensim].
diff --git a/website/docs/usage/pos-tagging.jade b/website/docs/usage/pos-tagging.jade
new file mode 100644
index 000000000..cded00b6c
--- /dev/null
+++ b/website/docs/usage/pos-tagging.jade
@@ -0,0 +1,93 @@
+//- 💫 DOCS > USAGE > PART-OF-SPEECH TAGGING
+
+include ../../_includes/_mixins
+
+p
+ | Part-of-speech tags are labels like noun, verb, adjective etc that are
+ | assigned to each token in the document. They're useful in rule-based
+ | processes. They can also be useful features in some statistical models.
+
+p
+ | To use spaCy's tagger, you need to have a data pack installed that
+ | includes a tagging model. Tagging models are included in the data
+ | downloads for English and German. After you load the model, the tagger
+ | is applied automatically, as part of the default pipeline. You can then
+ | access the tags using the #[+api("token") #[code Token.tag]] and
+ | #[+api("token") #[code token.pos]] attributes. For English, the tagger
+ | also triggers some simple rule-based morphological processing, which
+ | gives you the lemma as well.
+
++code("Usage").
+ import spacy
+ nlp = spacy.load('en')
+ doc = nlp(u'They told us to duck.')
+ for word in doc:
+ print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)
+
++h(2, "rule-based-morphology") Rule-based morphology
+
+p
+ | Inflectional morphology is the process by which a root form of a word is
+ | modified by adding prefixes or suffixes that specify its grammatical
+ | function but do not changes its part-of-speech. We say that a
+ | #[strong lemma] (root form) is #[strong inflected] (modified/combined)
+ | with one or more #[strong morphological features] to create a surface
+ | form. Here are some examples:
+
++table(["Context", "Surface", "Lemma", "POS", "Morphological Features"])
+ +row
+ +cell I was reading the paper
+ +cell reading
+ +cell read
+ +cell verb
+ +cell #[code VerbForm=Ger]
+
+ +row
+ +cell I don't watch the news, I read the paper.
+ +cell read
+ +cell read
+ +cell verb
+ +cell #[code VerbForm=Fin], #[code Mood=Ind], #[code Tense=Pres]
+
+ +row
+ +cell I read the paper yesteday
+ +cell read
+ +cell read
+ +cell verb
+ +cell #[code VerbForm=Fin], #[code Mood=Ind], #[code Tense=Past]
+
+p
+ | English has a relatively simple morphological system, which spaCy
+ | handles using rules that can be keyed by the token, the part-of-speech
+ | tag, or the combination of the two. The system works as follows:
+
++list("numbers")
+ +item
+ | The tokenizer consults a #[strong mapping table]
+ | #[code TOKENIZER_EXCEPTIONS], which allows sequences of characters
+ | to be mapped to multiple tokens. Each token may be assigned a part
+ | of speech and one or more morphological features.
+
+ +item
+ | The part-of-speech tagger then assigns each token an
+ | #[strong extended POS tag]. In the API, these tags are known as
+ | #[code Token.tag]. They express the part-of-speech (e.g.
+ | #[code VERB]) and some amount of morphological information, e.g.
+ | that the verb is past tense.
+
+ +item
+ | For words whose POS is not set by a prior process, a
+ | #[strong mapping table] #[code TAG_MAP] maps the tags to a
+ | part-of-speech and a set of morphological features.
+
+ +item
+ | Finally, a #[strong rule-based deterministic lemmatizer] maps the
+ | surface form, to a lemma in light of the previously assigned
+ | extended part-of-speech and morphological information, without
+ | consulting the context of the token. The lemmatizer also accepts
+ | list-based exception files, acquired from
+ | #[+a("https://wordnet.princeton.edu/") WordNet].
+
++h(2, "pos-schemes") Part-of-speech tag schemes
+
+include ../api/_annotation/_pos-tags