Merge branch 'master' of ssh://github.com/explosion/spaCy

2025-08-02 03:10:22 +03:00 · 2016-10-19 03:05:30 +02:00 · 2016-10-19 03:05:30 +02:00 · 9dd789d140
commit 9dd789d140
parent 89d2a5c8b3 3937d5a076
9 changed files with 331 additions and 65 deletions
--- a/README.rst
+++ b/README.rst
@ -7,6 +7,8 @@ the very latest research, but it isn't researchware.  It was designed from day 1
 to be used in real products. It's commercial open-source software, released under 
 the MIT license.

+💫 **Version 1.0.0 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_
+
 .. image:: http://i.imgur.com/wFvLZyJ.png
    :target: https://travis-ci.org/explosion/spaCy
    :alt: spaCy on Travis CI
@ -201,7 +203,6 @@ OS X ships with Python and git preinstalled.

 Windows
 -------
-<<<<<<< HEAD

 Install a version of Visual Studio Express or higher that matches the version 
 that was used to compile your Python interpreter. For official distributions 
@ -221,27 +222,6 @@ Python install. Run:
 Run tests
 =========

-=======
-
-Install a version of Visual Studio Express or higher that matches the version 
-that was used to compile your Python interpreter. For official distributions 
-these are VS 2008 (Python 2.7), VS 2010 (Python 3.4) and VS 2015 (Python 3.5).
-
-Workaround for obsolete system Python
-=====================================
-
-If you're stuck using a system with an old version of Python, and you don't 
-have root access, we've prepared a bootstrap script to help you compile a local 
-Python install. Run:
-
-.. code:: bash
-
-    curl https://raw.githubusercontent.com/spacy-io/gist/master/bootstrap_python_env.sh | bash && source .env/bin/activate
-
-Run tests
-=========
-
->>>>>>> v1.0.0-rc1
 spaCy comes with an extensive test suite. First, find out where spaCy is 
 installed:

@ -273,8 +253,44 @@ For the detailed documentation, check out the `spaCy website <https://spacy.io/d
 Changelog
 =========

-2016-05-10 `v0.101.0 <../../releases/tag/0.101.0>`_: *Fixed German model*
-------------------------------------------------------------------------
+2016-10-18 `v1.0.0 <https://github.com/explosion/spaCy/releases/>`_: *Support for deep learning workflows and entity-aware rule matcher*
+----------------------------------------------------------------------------------------------------------------------------------------
+
+**✨ Major features and improvements**
+
+* **NEW:** `custom processing pipelines <https://spacy.io/docs/tutorials/custom-pipelines>`_, to support deep learning workflows
+* **NEW:** `Rule matcher <https://spacy.io/docs/tutorials/rule-based-matcher>`_ now supports entity IDs and attributes
+* **NEW:** Official/documented `training APIs <https://spacy.io/docs/tutorials/training>`_ and `GoldParse` class
+* Download and use GloVe vectors by default
+* Make it easier to load and unload word vectors
+* Improved rule matching functionality
+* Move basic data into the code, rather than the json files. This makes it simpler to use the tokenizer without the models installed, and makes adding new languages much easier.
+* Replace file-system strings with ``Path`` objects. You can now load resources over your network, or do similar trickery, by passing any object that supports the ``Path`` protocol.
+
+**⚠️  Backwards incompatibilities**
+
+* The data_dir keyword argument of ``Language.__init__`` (and its subclasses ``English.__init__`` and ``German.__init__``) has been renamed to ``path``.
+* Details of how the Language base-class and its sub-classes are loaded, and how defaults are accessed, have been heavily changed. If you have your own subclasses, you should review the changes.
+* The deprecated ``token.repvec`` name has been removed.
+* The ``.train()`` method of Tagger and Parser has been renamed to ``.update()``
+* The previously undocumented ``GoldParse`` class has a new ``__init__()`` method. The old method has been preserved in ``GoldParse.from_annot_tuples()``.
+* Previously undocumented details of the ``Parser`` class have changed.
+* The previously undocumented ``get_package`` and ``get_package_by_name`` helper functions have been moved into a new module, ``spacy.deprecated``, in case you still need them while you update.
+
+**🔴  Bug fixes**
+
+* Fix ``get_lang_class`` bug when GloVe vectors are used.
+* Fix Issue `#411 <https://github.com/explosion/spaCy/issues/411>`_: ``doc.sents`` raised IndexError on empty string.
+* Fix Issue `#455 <https://github.com/explosion/spaCy/issues/455>`_: Correct lemmatization logic
+* Fix Issue `#371 <https://github.com/explosion/spaCy/issues/371>`_: Make ``Lexeme`` objects hashable
+* Fix Issue `#469 <https://github.com/explosion/spaCy/issues/469>`_: Make ``noun_chunks`` detect root NPs
+
+**👥  Contributors**
+
+Thanks to `@daylen <https://github.com/daylen>`_, `@RahulKulhari <https://github.com/RahulKulhari>`_, `@stared <https://github.com/stared>`_, `@adamhadani <https://github.com/adamhadani>`_, `@izeye <https://github.com/adamhadani>`_ and `@crawfordcomeaux <https://github.com/adamhadani>`_ for the pull requests!
+
+2016-05-10 `v0.101.0 <https://github.com/explosion/spaCy/releases/tag/0.101.0>`_: *Fixed German model*
+------------------------------------------------------------------------------------------------------

 * Fixed bug that prevented German parses from being deprojectivised.
 * Bug fixes to sentence boundary detection.
@ -282,8 +298,8 @@ Changelog
 * Add missing ``Doc.has_vector`` and ``Span.has_vector`` properties.
 * Add missing ``Span.sent`` property.

-2016-05-05 `v0.100.7 <../../releases/tag/0.100.7>`_: *German!*
--------------------------------------------------------------
+2016-05-05 `v0.100.7 <https://github.com/explosion/spaCy/releases/tag/0.100.7>`_: *German!*
+-------------------------------------------------------------------------------------------

 spaCy finally supports another language, in addition to English. We're lucky 
 to have Wolfgang Seeker on the team, and the new German model is just the 
@ -327,13 +343,14 @@ and it doesn't yet recognise numeric entities such as numbers and dates.
 * Fix bug that led to inconsistent sentence boundaries before and after serialisation.
 * Fix bug from deserialising untagged documents.

-2016-03-08 `v0.100.6 <../../releases/tag/0.100.6>`_: *Add support for GloVe vectors*
------------------------------------------------------------------------------------
+2016-03-08 `v0.100.6 <https://github.com/explosion/spaCy/releases/tag/0.100.6>`_: *Add support for GloVe vectors*
+-----------------------------------------------------------------------------------------------------------------

 This release offers improved support for replacing the word vectors used by spaCy. 
 To install Stanford's GloVe vectors, trained on the Common Crawl, just run:

 .. code:: bash
+
    sputnik --name spacy install en_glove_cc_300_1m_vectors

 To reduce memory usage and loading time, we've trimmed the vocabulary down to 1m entries.
@ -343,20 +360,21 @@ will be released shortly. To assist in multi-lingual processing, we've added a `
 function. To load the English model with the GloVe vectors:

 .. code:: python
+
    spacy.load('en', vectors='en_glove_cc_300_1m_vectors')

-2016-02-07 `v0.100.5 <../../releases/tag/0.100.5>`_
---------------------------------------------------
+2016-02-07 `v0.100.5 <https://github.com/explosion/spaCy/releases/tag/0.100.5>`_
+--------------------------------------------------------------------------------

 Fix incorrect use of header file, caused from problem with thinc

-2016-02-07 `v0.100.4 <../../releases/tag/0.100.4>`_: *Fix OSX problem introduced in 0.100.3*
--------------------------------------------------------------------------------------------
+2016-02-07 `v0.100.4 <https://github.com/explosion/spaCy/releases/tag/0.100.4>`_: *Fix OSX problem introduced in 0.100.3*
+-------------------------------------------------------------------------------------------------------------------------

 Small correction to right_edge calculation

-2016-02-06 `v0.100.3 <../../releases/tag/0.100.3>`_
---------------------------------------------------
+2016-02-06 `v0.100.3 <https://github.com/explosion/spaCy/releases/tag/0.100.3>`_
+--------------------------------------------------------------------------------

 Support multi-threading, via the ``.pipe`` method. spaCy now releases the GIL around the
 parser and entity recognizer, so systems that support OpenMP should be able to do
@ -364,20 +382,20 @@ shared memory parallelism at close to full efficiency.

 We've also greatly reduced loading time, and fixed a number of bugs.

-2016-01-21 `v0.100.2 <../../releases/tag/0.100.2>`_
---------------------------------------------------
+2016-01-21 `v0.100.2 <https://github.com/explosion/spaCy/releases/tag/0.100.2>`_
+--------------------------------------------------------------------------------

 Fix data version lock that affected v0.100.1

-2016-01-21 `v0.100.1 <../../releases/tag/0.100.1>`_: *Fix install for OSX*
--------------------------------------------------------------------------
+2016-01-21 `v0.100.1 <https://github.com/explosion/spaCy/releases/tag/0.100.1>`_: *Fix install for OSX*
+-------------------------------------------------------------------------------------------------------

 v0.100 included header files built on Linux that caused installation to fail on OSX.
 This should now be corrected. We also update the default data distribution, to
 include a small fix to the tokenizer.

-2016-01-19 `v0.100 <../../releases/tag/0.100>`_: *Revise setup.py, better model downloads, bug fixes*
-----------------------------------------------------------------------------------------------------
+2016-01-19 `v0.100 <https://github.com/explosion/spaCy/releases/tag/0.100>`_: *Revise setup.py, better model downloads, bug fixes*
+----------------------------------------------------------------------------------------------------------------------------------

 * Redo setup.py, and remove ugly headers_workaround hack. Should result in fewer install problems.
 * Update data downloading and installation functionality, by migrating to the Sputnik data-package manager. This will allow us to offer finer grained control of data installation in future.
@ -388,16 +406,16 @@ include a small fix to the tokenizer.
 * Fix problem that caused ``doc.merge()`` to sometimes hang
 * Fix problems in handling of whitespace

-2015-11-08 `v0.99 <../../releases/tag/0.99>`_: *Improve span merging, internal refactoring*
-------------------------------------------------------------------------------------------
+2015-11-08 `v0.99 <https://github.com/explosion/spaCy/releases/tag/0.99>`_: *Improve span merging, internal refactoring*
+------------------------------------------------------------------------------------------------------------------------

 * Merging multi-word tokens into one, via the ``doc.merge()`` and ``span.merge()`` methods, no longer invalidates existing ``Span`` objects. This makes it much easier to merge multiple spans, e.g. to merge all named entities, or all base noun phrases. Thanks to @andreasgrv for help on this patch.
 * Lots of internal refactoring, especially around the machine learning module, thinc. The thinc API has now been improved, and the spacy._ml wrapper module is no longer necessary.
 * The lemmatizer now lower-cases non-noun, noun-verb and non-adjective words.
 * A new attribute, ``.rank``, is added to Token and Lexeme objects, giving the frequency rank of the word.

-2015-11-03 `v0.98 <../../releases/tag/0.98>`_: *Smaller package, bug fixes*
---------------------------------------------------------------------------
+2015-11-03 `v0.98 <https://github.com/explosion/spaCy/releases/tag/0.98>`_: *Smaller package, bug fixes*
+---------------------------------------------------------------------------------------------------------

 * Remove binary data from PyPi package.
 * Delete archive after downloading data
@ -405,21 +423,21 @@ include a small fix to the tokenizer.
 * Fix information loss in deserialize
 * Fix ``__str__`` methods for Python2

-2015-10-23 `v0.97 <../../releases/tag/0.97>`_: *Load the StringStore from a json list, instead of a text file*
--------------------------------------------------------------------------------------------------------------
+2015-10-23 `v0.97 <https://github.com/explosion/spaCy/releases/tag/0.97>`_: *Load the StringStore from a json list, instead of a text file*
+-------------------------------------------------------------------------------------------------------------------------------------------

 * Fix bugs in download.py
 * Require ``--force`` to over-write the data directory in download.py
 * Fix bugs in ``Matcher`` and ``doc.merge()``

-2015-10-19 `v0.96 <../../releases/tag/0.96>`_: *Hotfix to .merge method*
------------------------------------------------------------------------
+2015-10-19 `v0.96 <https://github.com/explosion/spaCy/releases/tag/0.96>`_: *Hotfix to .merge method*
+-----------------------------------------------------------------------------------------------------

 * Fix bug that caused text to be lost after ``.merge``
 * Fix bug in Matcher when matched entities overlapped

-2015-10-18 `v0.95 <../../releases/tag/0.95>`_: *Bugfixes*
---------------------------------------------------------
+2015-10-18 `v0.95 <https://github.com/explosion/spaCy/releases/tag/0.95>`_: *Bugfixes*
+--------------------------------------------------------------------------------------

 * Reform encoding of symbols
 * Fix bugs in ``Matcher``
@ -428,13 +446,13 @@ include a small fix to the tokenizer.
 * Add specific string-length cap in Tokenizer
 * Fix ``token.conjuncts```

-2015-10-09 `v0.94 <../../releases/tag/0.94>`_
---------------------------------------------
+2015-10-09 `v0.94 <https://github.com/explosion/spaCy/releases/tag/0.94>`_
+--------------------------------------------------------------------------

 * Fix memory error that caused crashes on 32bit platforms
 * Fix parse errors caused by smart quotes and em-dashes

-2015-09-22 `v0.93 <../../releases/tag/0.93>`_
---------------------------------------------
+2015-09-22 `v0.93 <https://github.com/explosion/spaCy/releases/tag/0.93>`_
+--------------------------------------------------------------------------

 Bug fixes to word vectors
--- a/website/README.md
+++ b/website/README.md
@ -4,14 +4,14 @@

 The [spacy.io](https://spacy.io) website is implemented in [Jade (aka Pug)](https://www.jade-lang.org), and is built or served by [Harp](https://harpjs.com). Jade is an extensible templating language with a readable syntax, that compiles to HTML.
 The website source makes extensive use of Jade mixins, so that the design system is abstracted away from the content you're
-writing. You can read more about our approach in our blog post, ["Rebuilding a Website with Modular Markup Components"](https://explosion.ai/blog/modular-markup).
+writing. You can read more about our approach in our blog post, ["Rebuilding a Website with Modular Markup"](https://explosion.ai/blog/modular-markup).


 ## Building the site

 ```bash
 sudo npm install --global harp
-git clone https://github.com/explosion/spacy
+git clone https://github.com/explosion/spaCy
 cd website
 harp server
 ```
--- a/website/_layout.jade
+++ b/website/_layout.jade
@ -45,7 +45,7 @@ html(lang="en")
            if sidebar
                include _includes/_sidebar

-            main.o-content(class="#{(sidebar) ? 'o-content--sidebar' : '' } #{((current.path[0] == 'docs' && asides != false) || asides) ? 'o-content--asides' : '' }")
+            main.o-content(class="#{(sidebar) ? 'o-content--sidebar' : '' } #{((current.path[0] == 'docs' && asides != false) || asides) ? 'o-content--asides' : '' } #{(current.path[1] == 'tutorials') ? 'o-content--article' : '' }")
                if current.path[1] == "tutorials"
                    +h(1)=title

--- a/website/assets/css/_base/_layout.sass
+++ b/website/assets/css/_base/_layout.sass
@ -24,7 +24,10 @@ body
 //- Paragraphs

 p
-    @extend .u-text-regular, .o-block, .has-aside
+    @extend .o-block, .u-text-regular, .has-aside
+
+    .o-content--article &:not([class])
+        @extend .u-text-medium


 //- Links
--- a/website/assets/js/main.js
+++ b/website/assets/js/main.js
@ -24,8 +24,8 @@ const $$ = document.querySelectorAll.bind(document);
        scrollUp = newScrollY <= scrollY;
        scrollY = newScrollY;

-        if(scrollUp && !(isNaN(scrollY) || scrollY <= vh)) topnav.classList.add('is-fixed');
-        else if(!scrollUp || (isNaN(scrollY) || scrollY <= vh/2)) topnav.classList.remove('is-fixed');
+        if(scrollUp && !(isNaN(scrollY) || scrollY <= vh)) nav.classList.add('is-fixed');
+        else if(!scrollUp || (isNaN(scrollY) || scrollY <= vh/2)) nav.classList.remove('is-fixed');
    }

    const updateSidebar = () => {
--- a/website/docs/tutorials/_data.json
+++ b/website/docs/tutorials/_data.json
@ -1,34 +1,47 @@
 {
+    "training": {
+        "title": "Training the tagger, entity recogniser and parser",
+        "date": "2016-10-17",
+        "description": "This tutorial describes how to train new statistical models for spaCy's part-of-speech tagger, named entity recognizer and dependency parser."
+    },
+
+    "custom-pipelines": {
+        "title": "Custom Pipelines",
+        "date": "2016-10-17",
+        "description": "spaCy 1.0 introduces dynamic pipelines, so that you can easily create custom workflows. This tutorial describes the feature, and introduces experimental support for dynamic Token attributes. The tutorial also discusses how we can make it easier to use bidirectional LSTMs with spaCy."
+    },
+
+    "rule-based-matcher": {
+        "title": "Rule-based Matcher",
+        "date": "2016-10-17",
+        "description": "spaCy features a rule-matching engine that operates over tokens. The rules can refer to token annotations and flags, and matches support callbacks to accept, modify and/or act on the match. The rule matcher also allows you to associate patterns with entity IDs, to allow some basic entity linking or disambiguation."
+    },
+
    "load-new-word-vectors": {
-        "template": "article",
        "title": "Load new word vectors",
        "date": "2015-09-24",
        "description": "Word vectors allow simple similarity queries, and drive many NLP applications. This tutorial explains how to load custom word vectors into spaCy, to make use of task or data-specific representations."
    },

    "byo-annotations": {
-        "template": "article",
        "title": "Using Pre-existing Tokenization, Tags, and Other Annotations",
        "date": "2016-04-15",
        "description": "spaCy assumes by default that your data is raw text. However, sometimes your data is partially annotated, e.g. with pre-existing tokenization, part-of-speech tags, etc. This tutorial explains how to use these annotations in spaCy."
    },

    "mark-adverbs": {
-        "template": "article",
        "title": "Mark all adverbs, particularly for verbs of speech",
        "date": "2015-08-18",
        "description": "Let's say you're developing a proofreading tool, or possibly an IDE for writers.  You're convinced by Stephen King's advice that adverbs are not your friend so you want to highlight all adverbs."
    },

    "syntax-search": {
-        "template": "article",
        "title": "Search Reddit for comments about Google doing something",
        "date": "2015-08-18",
        "description": "Example use of the spaCy NLP tools for data exploration. Here we will look for Reddit comments that describe Google doing something, i.e. discuss the company's actions. This is difficult, because other senses of \"Google\" now dominate usage of the word in conversation, particularly references to using Google products."
    },

    "twitter-filter": {
-        "template": "article",
        "title": "Finding Relevant Tweets",
        "date": "2015-08-18",
        "description": "In this tutorial, we will use word vectors to search for tweets about Jeb Bush. We'll do this by building up two word lists: one that represents the type of meanings in the Jeb Bush tweets, and another to help screen out irrelevant tweets that mention the common, ambiguous word \"bush\"."
--- a/website/docs/tutorials/custom-pipelines.jade
+++ b/website/docs/tutorials/custom-pipelines.jade
@ -0,0 +1,89 @@
+include ../../_includes/_mixins
+
+p.u-text-large spaCy 1.0 introduces dynamic pipelines, so that you can easily create custom workflows. This tutorial describes the feature, and introduces experimental support for dynamic Token attributes. The tutorial also discusses how we can make it easier to use bidirectional LSTMs with spaCy.
+
+p Best practices in NLP are now already pretty different from when I first designed spaCy, even though it's only been two years. The spaCy 1.0 release has a new custom pipeline API to help you use the new hotness.
+
+p Before 1.0, spaCy's pipeline was hard-coded. When you called #[code nlp(text)], spaCy would apply the tokenizer, tagger, parser and named entity recognizer, in sequence. This design assumed that users should subclass the #[code Language] class to customize the pipeline. However, the #[code Language] class has gotten more complicated, and subclassing it now feels like a relatively "serious" thing to do. It feels hard.
+
+p In spaCy 1.0, the order of operations is no longer hard-coded. Instead, the new #[code Language.__call__] does something like this:
+
+code.
+    def __call__(self, text):
+        doc = self.make_doc(text)
+        for process in self.pipeline:
+            process(doc)
+        return doc
+
+p The pipeline can consist of any sequence of callables. They should accept a Doc object, and modify it in-place. You can install the pipeline by passing a callable to the #[code spacy.load()] function, or the constructor of the #[code Language] class:
+
+code("python", "Basic Example").
+    import spacy
+
+    def arbitrary_fixup_rules(doc):
+        for token in doc:
+            if token.text == u'bill' and token.tag_ == u'NNP':
+                token.tag_ = u'NN'
+
+    def custom_pipeline(nlp):
+        return (nlp.tagger, arbitrary_fixup_rules, nlp.parser, nlp.entity)
+
+    nlp = spacy.load('en', pipeline=custom_pipeline)
+
+
+p The value passed to the #[code pipeline] keyword should be a callable that takes the #[code Language] instance (i.e. #[code nlp]) as an argument. The callable should return a sequence of callables. Each member of the sequence should take a Doc object as its sole positional argument.
+
+h(2, "experimental-lstm") Experimental: Bidirectional LSTM with custom pipeline
+
+p Probably the most important new technology in Natural Language Processing is the rise of bidirectional LSTM models. These models associate each word with a #[em context-specific] vector. You can also neatly include character level features, so that all relevant aspects of the word are captured. This is pretty much the best way to do feature extraction in NLP at the moment, for almost any task.
+
+p spaCy doesn't feature any pre-trained LSTM models yet, and the details of this API are still being refined. But, because BiLSTMs are proving so important, I wanted to get the proposal up.
+
+p Version 1.0 adds an attribute #[code tensor] to the #[code Doc] object. The #[code tensor] attribute expects a numpy ndarray object, and is publicly writeable. This gives you a place to store the output of the LSTM (or some other real-valued output you want to keep).
+
+code("python", "Basic Example").
+    import spacy
+    from spacy.symbols import LEMMA, TAG
+
+    class LSTMModel(object):
+        def __init__(self, **kwargs):
+            # Load your weights etc
+            pass
+
+        def __call__(self, doc):
+            features = doc.to_array([LEMMA, TAG])
+            doc.tensor = lstm(features)
+
+    def custom_pipeline(nlp):
+        return (nlp.tagger, LSTMModel(), nlp.parser, nlp.entity)
+
+    nlp = spacy.load('en', pipeline=custom_pipeline)
+
+p Now, so far we only have the LSTM output as an attribute of the #[code Doc] object. We'd like to be able to do stuff like #[code doc[0].vector], and have that get us the LSTM vector for the token. We can do #[code doc.tensor[doc[0].i]], but I'd like a little more sugar. The details of this part are still experimental — in particular, don't take the names too seriously at this point.
+
+p A relevant implementation detail of spaCy is that the #[code Token] objects are thin proxies, that can be created and destroyed as convenient. The #[code Doc] object owns all the data. This means that we can't simply assign a vector to the #[code Token] objects. Instead, we'll add a hook that gets called by #[code token.vector]. We'll also add space for hooks in other places we might need them.
+
+aside("Why don't Token and Span own their data?") Well, we want the sequence of tokens to be stored together in memory. That means we really want to have a sequence owned by the #[code Doc] object. But if we have that, then we would have to copy data to the #[code Token] objects. This gets super messy, especially if the tokens should be able to modify their state. The Token therefore proxies to the Doc, to maintain a single source of truth.
+
+p Here's what that looks like:
+
+code.
+    def install_vector_hook(doc):
+        doc.getters_for_token['similarity'] = lambda token: doc.tensor[token.i]
+
+    def custom_pipeline(nlp):
+        return (nlp.tagger, LSTMModel(), install_vector_hook, nlp.parser, nlp.entity)
+
+    nlp = spacy.load('en', pipeline=custom_pipeline)
+
+p The #[code install_vector_hook] function will run after the LSTM. It modifies the #[code Doc], setting a value in a dictionary that the #[code Token] knows to look for. When you access the #[code token.vector] property, the token checks whether there's a special-case listener for that attribute:
+
+code.
+    @property
+    def vector(self):
+        if 'vector' in self.doc.getters_for_tokens:
+            return self.doc.getters_for_tokens['vector'](self)
+        else:
+            return self.c.lex.vector
+
+p As I said — don't take the names too seriously at this point. But do test out the feature — it should be all working. You should be able to customize he behaviour of a lot of attributes this way already. Possibly we should just make it everything on the Token and the Span, but I think it might not be nice to have so much uncertainty about how some values are being calculated. There's such a thing as being too dynamic.
--- a/website/docs/tutorials/rule-based-matcher.jade
+++ b/website/docs/tutorials/rule-based-matcher.jade
@ -0,0 +1,61 @@
+include ../../_includes/_mixins
+
+p.u-text-large spaCy features a rule-matching engine that operates over tokens. The rules can refer to token annotations and flags, and matches support callbacks to accept, modify and/or act on the match. The rule matcher also allows you to associate patterns with entity IDs, to allow some basic entity linking or disambiguation.
+
+code("python", "Matcher Example").
+    from spacy.matcher import Matcher
+    from spacy.attributes import *
+    import spacy
+
+    nlp = spacy.load('en', parser=False, entity=False)
+
+    matcher = Matcher(nlp.vocab)
+
+    matcher.add_entity(
+        "GoogleNow", # Entity ID -- Helps you act on the match.
+        {"ent_type": "PRODUCT", "wiki_en": "Google_Now"}, # Arbitrary attributes (optional)
+        acceptor=None, # Accept or modify the match
+        on_match=merge_phrases # Callback to act on the matches
+    )
+    matcher.add_pattern(
+        "GoogleNow", # Entity ID -- Created if doesn't exist.
+        [ # The pattern is a list of *Token Specifiers*.
+            { # This Token Specifier matches tokens whose orth field is "Google"
+              ORTH: "Google"
+            },
+            { # This Token Specifier matches tokens whose orth field is "Now"
+              ORTH: "Now"
+            }
+        ],
+        label=None # Can associate a label to the pattern-match, to handle it better.
+    )
+    doc = nlp(u"I prefer Siri to Google Now.")
+    matches = matcher(doc)
+    for ent_id, label, start, end in matches:
+        print(nlp.strings[ent_id], nlp.strings[label], doc[start : end].text)
+        entity = matcher.get_entity(ent_id)
+        print(entity)
+
+    matcher.add_pattern(
+        "GoogleNow",
+        [ # This Surface Form matches "google now", verbatim, and requires
+          # "google" to have the NNP tag. This helps prevent the pattern from
+          # matching cases like "I will google now to look up the time"
+          {
+            ORTH: "google",
+            TAG: "NNP"
+          },
+          {
+            ORTH: "now"
+          }
+        ]
+    )
+
+    doc = nlp(u"I'll google now to find out how the google now service works.")
+    matches = matcher(doc)
+    for ent_id, label, start, end in matches:
+        print(ent_id, label, start, end, doc[start : end].text)
+    # Because we specified the on_match=merge_phrases callback,
+    # we should see 'google now' as a single token.
+    for token in doc:
+        print(token.text, token.lemma_, token.tag_, token.ent_type_)
--- a/website/docs/tutorials/training.jade
+++ b/website/docs/tutorials/training.jade
@ -0,0 +1,82 @@
+include ../../_includes/_mixins
+
+p.u-text-large This tutorial describes how to train new statistical models for spaCy's part-of-speech tagger, named entity recognizer and dependency parser.
+
+p I'll start with some quick code examples, that describe how to train each model. I'll then provide a bit of background about the algorithms, and explain how the data and feature templates work.
+
+h(2, "train-pos-tagger") Training the part-of-speech tagger
+
+code('python', 'Simple Example').
+    from spacy.vocab import Vocab
+    from spacy.pipeline import Tagger
+    from spacy.tokens import Doc
+
+    vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
+    tagger = Tagger(vocab)
+
+    doc = Doc(vocab, words=['I', 'like', 'stuff'])
+    tagger.update(doc, ['N', 'V', 'N'])
+
+    tagger.model.end_training()
+
+p #[+a("https://github.com/" + SOCIAL.github + "/spaCy/examples/training/train_tagger.py") Full example]
+
+h(2, "train-entity") Training the named entity recognizer
+
+code('python', 'Simple Example').
+    from spacy.vocab import Vocab
+    from spacy.pipeline import EntityRecognizer
+    from spacy.tokens import Doc
+
+    vocab = Vocab()
+    entity = EntityRecognizer(vocab, entity_types=['PERSON', 'LOC'])
+
+    doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
+    entity.update(doc, ['O', 'O', 'B-PERSON', 'L-PERSON', 'O'])
+
+    entity.model.end_training()
+
+p #[+a("https://github.com/" + SOCIAL.github + "/spaCy/examples/training/train_ner.y") Full example]
+
+h(2, "train-entity") Training the dependency parser
+
+code('python', 'Simple Example').
+    from spacy.vocab import Vocab
+    from spacy.pipeline import DependencyParser
+    from spacy.tokens import Doc
+
+    vocab = Vocab()
+    parser = DependencyParser(vocab, labels=['nsubj', 'compound', 'dobj', 'punct'])
+
+    doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
+    parser.update(doc, [(1, 'nsubj'), (1, 'ROOT'), (3, 'compound'), (1, 'dobj'),
+                        (1, 'punct')])
+
+    parser.model.end_training()
+
+p #[+a("https://github.com/" + SOCIAL.github + "/spaCy/examples/training/train_parser.py") Full example]
+
+h(2, 'feature-templates') Customising the feature extraction
+
+p spaCy currently uses linear models for the tagger, parser and entity recognizer, with weights learned using the #[+a("https://explosion.ai/blog/part-of-speech-pos-tagger-in-python") Averaged Perceptron algorithm].
+
+p Because it's a linear model, it's important for accuracy to build conjunction features out of the atomic predictors. Let's say you have two atomic predictors asking, "What is the part-of-speech of the previous token?", and "What is the part-of-speech of the previous previous token?". These ppredictors will introduce a number of features, e.g. "Prev-pos=NN", "Prev-pos=VBZ", etc. A conjunction template introduces features such as "Prev-pos=NN&Prev-pos=VBZ".
+
+p The feature extraction proceeds in two passes. In the first pass, we fill an array with the values of all of the atomic predictors. In the second pass, we iterate over the feature templates, and fill a small temporary array with the predictors that will be combined into a conjunction feature. Finally, we hash this array into a 64-bit integer, using the MurmurHash algorithm. You can see this at work in the #[+a("https://github.com/" + SOCIAL.github + "/thinc/blob/94dbe06fd3c8f24d86ab0f5c7984e52dbfcdc6cb/thinc/linear/features.pyx") thinc.linear.features] module.
+
+p It's very easy to change the feature templates, to create novel combinations of the existing atomic predictors. There's currently no API available to add new atomic predictors, though. You'll have to create a subclass of the model, and write your own #[+code set_featuresC] method.
+
+p The feature templates are passed in using the #[+code features] keyword argument to the constructors of the Tagger, DependencyParser and EntityRecognizer:
+
+code('python', 'custom tagger templates').
+    from spacy.vocab import Vocab
+    from spacy.pipeline import Tagger
+    from spacy.tagger import P2_orth, P1_orth
+    from spacy.tagger import P2_cluster, P1_cluster, W_orth, N1_orth, N2_orth
+
+    vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
+    tagger = Tagger(vocab, features=[(P2_orth, P2_cluster), (P1_orth, P1_cluster),
+                                     (P2_orth,), (P1_orth,), (W_orth,),
+                                     (N1_orth,), (N2_orth,)])
+
+p Custom feature templates can be passed to the DependencyParser and EntityRecognizer as well, also using the #[+code features] keyword argument of the constructor.