Merge branch 'master' into develop

2025-09-03 10:54:55 +03:00 · 2018-07-04 14:52:25 +02:00 · 2018-07-04 14:52:25 +02:00 · 63666af328
commit 63666af328
parent 8feb7cfe2d a82c3153ad
57 changed files with 31968 additions and 304 deletions
--- a/.github/contributors/aliiae.md
+++ b/.github/contributors/aliiae.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Aliia Erofeeva       |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 13 June 2018         |
+| GitHub username                | aliiae               |
+| Website (optional)             |                      |
--- a/.github/contributors/btrungchi.md
+++ b/.github/contributors/btrungchi.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                              |
+|------------------------------- | --------------------               |
+| Name                           | Bui Trung Chi                      |
+| Company name (if applicable)   |                                    |
+| Title or role (if applicable)  |                                    |
+| Date                           | 2018-06-30                         |
+| GitHub username                | btrungchi                          |
+| Website (optional)             |                                    |
--- a/.github/contributors/coryhurst.md
+++ b/.github/contributors/coryhurst.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                        |
+|------------------------------- | -----------------------------|
+| Name                           | Cory Hurst                   |
+| Company name (if applicable)   | Samtec Smart Platform Group  |               
+| Title or role (if applicable)  | SoftwareDeveloper            |
+| Date                           | 2017-11-13                   |
+| GitHub username                | cjhurst                      |
+| Website (optional)             | https://blog.spg.ai/         |
--- a/.github/contributors/mirfan899.md
+++ b/.github/contributors/mirfan899.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your 
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                    |
+|------------------------------- | ------------------------ |
+| Name                           | Muhammad Irfan           |
+| Company name (if applicable)   |                          |
+| Title or role (if applicable)  | AI & ML Developer        |
+| Date                           | 2018-09-06               |
+| GitHub username                | mirfan899                |
+| Website (optional)             |                          |
--- a/requirements.txt
+++ b/requirements.txt
@ -11,5 +11,5 @@ ujson>=1.35
 dill>=0.2,<0.3
 regex==2017.4.5
 requests>=2.13.0,<3.0.0
-pytest>=3.0.6,<4.0.0
+pytest>=3.6.0,<4.0.0
 mock>=2.0.0,<3.0.0
--- a/spacy/init.py
+++ b/spacy/init.py
@ -20,5 +20,5 @@ def blank(name, **kwargs):
    return LangClass(**kwargs)


-def info(model=None, markdown=False):
-    return cli_info(model, markdown)
+def info(model=None, markdown=False, silent=False):
+    return cli_info(model, markdown, silent)
--- a/spacy/displacy/init.py
+++ b/spacy/displacy/init.py
@ -2,7 +2,7 @@
 from __future__ import unicode_literals

 from .render import DependencyRenderer, EntityRenderer
-from ..tokens import Doc
+from ..tokens import Doc, Span
 from ..compat import b_to_str
 from ..errors import Errors, Warnings, user_warning
 from ..util import prints, is_in_jupyter
@ -29,8 +29,11 @@ def render(docs, style='dep', page=False, minify=False, jupyter=IS_JUPYTER,
                 'ent': (EntityRenderer, parse_ents)}
    if style not in factories:
        raise ValueError(Errors.E087.format(style=style))
-    if isinstance(docs, Doc) or isinstance(docs, dict):
+    if isinstance(docs, (Doc, Span, dict)):
        docs = [docs]
+    docs = [obj if not isinstance(obj, Span) else obj.as_doc() for obj in docs]
+    if not all(isinstance(obj, (Doc, Span, dict)) for obj in docs):
+        raise ValueError(Errors.E096)
    renderer, converter = factories[style]
    renderer = renderer(options=options)
    parsed = [converter(doc, options) for doc in docs] if not manual else docs
--- a/spacy/displacy/render.py
+++ b/spacy/displacy/render.py
@ -136,7 +136,7 @@ class DependencyRenderer(object):
        end (int): X-coordinate of arrow end point.
        RETURNS (unicode): Definition of the arrow head path ('d' attribute).
        """
-        if direction is 'left':
+        if direction == 'left':
            pos1, pos2, pos3 = (x, x-self.arrow_width+2, x+self.arrow_width-2)
        else:
            pos1, pos2, pos3 = (end, end+self.arrow_width-2,
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -257,6 +257,8 @@ class Errors(object):
    E094 = ("Error reading line {line_num} in vectors file {loc}.")
    E095 = ("Can't write to frozen dictionary. This is likely an internal "
            "error. Are you writing to a default function argument?")
+    E096 = ("Invalid object passed to displaCy: Can only visualize Doc or "
+             "Span objects, or dicts if set to manual=True.")


@add_codes
--- a/spacy/lang/char_classes.py
+++ b/spacy/lang/char_classes.py
@ -16,9 +16,11 @@ _latin = r'[[\p{Ll}||\p{Lu}]&&\p{Latin}]'
 _persian = r'[\p{L}&&\p{Arabic}]'
 _russian_lower = r'[ёа-я]'
 _russian_upper = r'[ЁА-Я]'
+_tatar_lower = r'[әөүҗңһ]'
+_tatar_upper = r'[ӘӨҮҖҢҺ]'

-_upper = [_latin_upper, _russian_upper]
-_lower = [_latin_lower, _russian_lower]
+_upper = [_latin_upper, _russian_upper, _tatar_upper]
+_lower = [_latin_lower, _russian_lower, _tatar_lower]
 _uncased = [_bengali, _hebrew, _persian]

 ALPHA = merge_char_classes(_upper + _lower + _uncased)
--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -60,9 +60,8 @@ def detailed_tokens(tokenizer, text):
        parts = node.feature.split(',')
        pos = ','.join(parts[0:4])

-        if len(parts) > 6:
+        if len(parts) > 7:
            # this information is only available for words in the tokenizer dictionary
-            reading = parts[6]
            base = parts[7]

        words.append( ShortUnitWord(surface, base, pos) )
--- a/spacy/lang/tt/init.py
+++ b/spacy/lang/tt/init.py
@ -0,0 +1,31 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .lex_attrs import LEX_ATTRS
+from .punctuation import TOKENIZER_INFIXES
+from .stop_words import STOP_WORDS
+from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
+from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from ...attrs import LANG
+from ...language import Language
+from ...util import update_exc
+
+
+class TatarDefaults(Language.Defaults):
+    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters[LANG] = lambda text: 'tt'
+
+    lex_attr_getters.update(LEX_ATTRS)
+
+    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    infixes = tuple(TOKENIZER_INFIXES)
+
+    stop_words = STOP_WORDS
+
+
+class Tatar(Language):
+    lang = 'tt'
+    Defaults = TatarDefaults
+
+
+__all__ = ['Tatar']
--- a/spacy/lang/tt/examples.py
+++ b/spacy/lang/tt/examples.py
@ -0,0 +1,19 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+"""
+Example sentences to test spaCy and its language models.
+>>> from spacy.lang.tt.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+sentences = [
+    "Apple Бөекбритания стартабын $1 миллиард өчен сатып алыун исәпли.",
+    "Автоном автомобильләр иминият җаваплылыкны җитештерүчеләргә күчерә.",
+    "Сан-Франциско тротуар буенча йөри торган робот-курьерларны тыю мөмкинлеген карый.",
+    "Лондон - Бөекбританиядә урнашкан зур шәһәр.",
+    "Син кайда?",
+    "Францияда кем президент?",
+    "Америка Кушма Штатларының башкаласы нинди шәһәр?",
+    "Барак Обама кайчан туган?"
+]
--- a/spacy/lang/tt/lex_attrs.py
+++ b/spacy/lang/tt/lex_attrs.py
@ -0,0 +1,29 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...attrs import LIKE_NUM
+
+_num_words = ['нуль', 'ноль', 'бер', 'ике', 'өч', 'дүрт', 'биш', 'алты', 'җиде',
+              'сигез', 'тугыз', 'ун', 'унбер', 'унике', 'унөч', 'ундүрт',
+              'унбиш', 'уналты', 'унҗиде', 'унсигез', 'унтугыз', 'егерме',
+              'утыз', 'кырык', 'илле', 'алтмыш', 'җитмеш', 'сиксән', 'туксан',
+              'йөз', 'мең', 'төмән', 'миллион', 'миллиард', 'триллион',
+              'триллиард']
+
+
+def like_num(text):
+    text = text.replace(',', '').replace('.', '')
+    if text.isdigit():
+        return True
+    if text.count('/') == 1:
+        num, denom = text.split('/')
+        if num.isdigit() and denom.isdigit():
+            return True
+    if text in _num_words:
+        return True
+    return False
+
+
+LEX_ATTRS = {
+    LIKE_NUM: like_num
+}
--- a/spacy/lang/tt/punctuation.py
+++ b/spacy/lang/tt/punctuation.py
@ -0,0 +1,19 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ..char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER, QUOTES, HYPHENS
+from ..char_classes import LIST_ELLIPSES, LIST_ICONS
+
+_hyphens_no_dash = HYPHENS.replace('-', '').strip('|').replace('||', '')
+_infixes = (LIST_ELLIPSES + LIST_ICONS +
+            [r'(?<=[{}])\.(?=[{}])'.format(ALPHA_LOWER, ALPHA_UPPER),
+             r'(?<=[{a}])[,!?/\(\)]+(?=[{a}])'.format(a=ALPHA),
+             r'(?<=[{a}{q}])[:<>=](?=[{a}])'.format(a=ALPHA, q=QUOTES),
+             r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA),
+             r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
+             r'(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])'.format(a=ALPHA, q=QUOTES),
+             r'(?<=[{a}])[?";:=,.]*(?:{h})(?=[{a}])'.format(a=ALPHA,
+                                                            h=_hyphens_no_dash),
+             r'(?<=[0-9])-(?=[0-9])'])
+
+TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/tt/stop_words.py
+++ b/spacy/lang/tt/stop_words.py
@ -0,0 +1,174 @@
+# encoding: utf8
+from __future__ import unicode_literals
+
+# Tatar stopwords are from https://github.com/aliiae/stopwords-tt
+
+STOP_WORDS = set("""алай алайса алар аларга аларда алардан аларны аларның аларча
+алары аларын аларынга аларында аларыннан аларының алтмыш алтмышынчы алтмышынчыга
+алтмышынчыда алтмышынчыдан алтмышынчылар алтмышынчыларга алтмышынчыларда
+алтмышынчылардан алтмышынчыларны алтмышынчыларның алтмышынчыны алтмышынчының
+алты алтылап алтынчы алтынчыга алтынчыда алтынчыдан алтынчылар алтынчыларга
+алтынчыларда алтынчылардан алтынчыларны алтынчыларның алтынчыны алтынчының
+алтышар анда андагы андай андый андыйга андыйда андыйдан андыйны андыйның аннан
+ансы анча аны аныкы аныкын аныкынга аныкында аныкыннан аныкының анысы анысын
+анысынга анысында анысыннан анысының аның аныңча аркылы ары аша аңа аңар аңарга
+аңарда аңардагы аңардан
+
+бар бара барлык барча барчасы барчасын барчасына барчасында барчасыннан
+барчасының бары башка башкача белән без безгә бездә бездән безне безнең безнеңчә
+белдерүенчә белән бер бергә беренче беренчегә беренчедә беренчедән беренчеләр
+беренчеләргә беренчеләрдә беренчеләрдән беренчеләрне беренчеләрнең беренчене
+беренченең беркайда беркайсы беркая беркаян беркем беркемгә беркемдә беркемне
+беркемнең беркемнән берлән берни бернигә бернидә бернидән бернинди бернине
+бернинең берничек берничә бернәрсә бернәрсәгә бернәрсәдә бернәрсәдән бернәрсәне
+бернәрсәнең беррәттән берсе берсен берсенгә берсендә берсенең берсеннән берәр
+берәрсе берәрсен берәрсендә берәрсенең берәрсеннән берәрсенә берәү бигрәк бик
+бирле бит биш бишенче бишенчегә бишенчедә бишенчедән бишенчеләр бишенчеләргә
+бишенчеләрдә бишенчеләрдән бишенчеләрне бишенчеләрнең бишенчене бишенченең
+бишләп болай болар боларга боларда болардан боларны боларның болары боларын
+боларынга боларында боларыннан боларының бу буе буена буенда буенча буйлап
+буларак булачак булды булмый булса булып булыр булырга бусы бүтән бәлки бән
+бәрабәренә бөтен бөтенесе бөтенесен бөтенесендә бөтенесенең бөтенесеннән
+бөтенесенә
+
+вә
+
+гел генә гына гүя гүяки гәрчә
+
+да ди дигән диде дип дистәләгән дистәләрчә дүрт дүртенче дүртенчегә дүртенчедә
+дүртенчедән дүртенчеләр дүртенчеләргә дүртенчеләрдә дүртенчеләрдән дүртенчеләрне
+дүртенчеләрнең дүртенчене дүртенченең дүртләп дә
+
+егерме егерменче егерменчегә егерменчедә егерменчедән егерменчеләр
+егерменчеләргә егерменчеләрдә егерменчеләрдән егерменчеләрне егерменчеләрнең
+егерменчене егерменченең ел елда
+
+иде идек идем ике икенче икенчегә икенчедә икенчедән икенчеләр икенчеләргә
+икенчеләрдә икенчеләрдән икенчеләрне икенчеләрнең икенчене икенченең икешәр икән
+илле илленче илленчегә илленчедә илленчедән илленчеләр илленчеләргә
+илленчеләрдә илленчеләрдән илленчеләрне илленчеләрнең илленчене илленченең илә
+илән инде исә итеп иткән итте итү итә итәргә иң
+
+йөз йөзенче йөзенчегә йөзенчедә йөзенчедән йөзенчеләр йөзенчеләргә йөзенчеләрдә
+йөзенчеләрдән йөзенчеләрне йөзенчеләрнең йөзенчене йөзенченең йөзләгән йөзләрчә
+йөзәрләгән
+
+кадәр кай кайбер кайберләре кайберсе кайберәү кайберәүгә кайберәүдә кайберәүдән
+кайберәүне кайберәүнең кайдагы кайсы кайсыбер кайсын кайсына кайсында кайсыннан
+кайсының кайчангы кайчандагы кайчаннан караганда карамастан карамый карата каршы
+каршына каршында каршындагы кебек кем кемгә кемдә кемне кемнең кемнән кенә ки
+килеп килә кирәк кына кырыгынчы кырыгынчыга кырыгынчыда кырыгынчыдан
+кырыгынчылар кырыгынчыларга кырыгынчыларда кырыгынчылардан кырыгынчыларны
+кырыгынчыларның кырыгынчыны кырыгынчының кырык күк күпләгән күпме күпмеләп
+күпмешәр күпмешәрләп күптән күрә
+
+ләкин
+
+максатында менә мең меңенче меңенчегә меңенчедә меңенчедән меңенчеләр
+меңенчеләргә меңенчеләрдә меңенчеләрдән меңенчеләрне меңенчеләрнең меңенчене
+меңенченең меңләгән меңләп меңнәрчә меңәрләгән меңәрләп миллиард миллиардлаган
+миллиардларча миллион миллионлаган миллионнарча миллионынчы миллионынчыга
+миллионынчыда миллионынчыдан миллионынчылар миллионынчыларга миллионынчыларда
+миллионынчылардан миллионынчыларны миллионынчыларның миллионынчыны
+миллионынчының мин миндә мине минем минемчә миннән миңа монда мондагы мондые
+мондыен мондыенгә мондыендә мондыеннән мондыеның мондый мондыйга мондыйда
+мондыйдан мондыйлар мондыйларга мондыйларда мондыйлардан мондыйларны
+мондыйларның мондыйлары мондыйларын мондыйларынга мондыйларында мондыйларыннан
+мондыйларының мондыйны мондыйның моннан монсыз монча моны моныкы моныкын
+моныкынга моныкында моныкыннан моныкының монысы монысын монысынга монысында
+монысыннан монысының моның моңа моңар моңарга мәгълүматынча мәгәр мән мөмкин
+
+ни нибарысы никадәре нинди ниндие ниндиен ниндиенгә ниндиендә ниндиенең
+ниндиеннән ниндиләр ниндиләргә ниндиләрдә ниндиләрдән ниндиләрен ниндиләренн
+ниндиләреннгә ниндиләренндә ниндиләреннең ниндиләренннән ниндиләрне ниндиләрнең
+ниндирәк нихәтле ничаклы ничек ничәшәр ничәшәрләп нуль нче нчы нәрсә нәрсәгә 
+нәрсәдә нәрсәдән нәрсәне нәрсәнең
+
+саен сез сезгә сездә сездән сезне сезнең сезнеңчә сигез сигезенче сигезенчегә
+сигезенчедә сигезенчедән сигезенчеләр сигезенчеләргә сигезенчеләрдә
+сигезенчеләрдән сигезенчеләрне сигезенчеләрнең сигезенчене сигезенченең
+сиксән син синдә сине синең синеңчә синнән сиңа соң сыман сүзенчә сүзләренчә
+
+та таба теге тегеләй тегеләр тегеләргә тегеләрдә тегеләрдән тегеләре тегеләрен
+тегеләренгә тегеләрендә тегеләренең тегеләреннән тегеләрне тегеләрнең тегенди 
+тегендигә тегендидә тегендидән тегендине тегендинең тегендә тегендәге тегене
+тегенеке тегенекен тегенекенгә тегенекендә тегенекенең тегенекеннән тегенең
+тегеннән тегесе тегесен тегесенгә тегесендә тегесенең тегесеннән тегеңә тиеш тик
+тикле тора триллиард триллион тугыз тугызлап тугызлашып тугызынчы тугызынчыга
+тугызынчыда тугызынчыдан тугызынчылар тугызынчыларга тугызынчыларда
+тугызынчылардан тугызынчыларны тугызынчыларның тугызынчыны тугызынчының туксан
+туксанынчы туксанынчыга туксанынчыда туксанынчыдан туксанынчылар туксанынчыларга
+туксанынчыларда туксанынчылардан туксанынчыларны туксанынчыларның туксанынчыны
+туксанынчының турында тыш түгел тә тәгаенләнгән төмән
+
+уенча уйлавынча ук ул ун уналты уналтынчы уналтынчыга уналтынчыда уналтынчыдан
+уналтынчылар уналтынчыларга уналтынчыларда уналтынчылардан уналтынчыларны
+уналтынчыларның уналтынчыны уналтынчының унарлаган унарлап унаула унаулап унбер
+унберенче унберенчегә унберенчедә унберенчедән унберенчеләр унберенчеләргә
+унберенчеләрдә унберенчеләрдән унберенчеләрне унберенчеләрнең унберенчене
+унберенченең унбиш унбишенче унбишенчегә унбишенчедә унбишенчедән унбишенчеләр
+унбишенчеләргә унбишенчеләрдә унбишенчеләрдән унбишенчеләрне унбишенчеләрнең
+унбишенчене унбишенченең ундүрт ундүртенче ундүртенчегә ундүртенчедә
+ундүртенчедән ундүртенчеләр ундүртенчеләргә ундүртенчеләрдә ундүртенчеләрдән
+ундүртенчеләрне ундүртенчеләрнең ундүртенчене ундүртенченең унике уникенче
+уникенчегә уникенчедә уникенчедән уникенчеләр уникенчеләргә уникенчеләрдә
+уникенчеләрдән уникенчеләрне уникенчеләрнең уникенчене уникенченең унлаган
+унлап уннарча унсигез унсигезенче унсигезенчегә унсигезенчедә унсигезенчедән
+унсигезенчеләр унсигезенчеләргә унсигезенчеләрдә унсигезенчеләрдән
+унсигезенчеләрне унсигезенчеләрнең унсигезенчене унсигезенченең унтугыз
+унтугызынчы унтугызынчыга унтугызынчыда унтугызынчыдан унтугызынчылар
+унтугызынчыларга унтугызынчыларда унтугызынчылардан унтугызынчыларны
+унтугызынчыларның унтугызынчыны унтугызынчының унынчы унынчыга унынчыда
+унынчыдан унынчылар унынчыларга унынчыларда унынчылардан унынчыларны
+унынчыларның унынчыны унынчының унҗиде унҗиденче унҗиденчегә унҗиденчедә
+унҗиденчедән унҗиденчеләр унҗиденчеләргә унҗиденчеләрдә унҗиденчеләрдән
+унҗиденчеләрне унҗиденчеләрнең унҗиденчене унҗиденченең унөч унөченче унөченчегә
+унөченчедә унөченчедән унөченчеләр унөченчеләргә унөченчеләрдә унөченчеләрдән
+унөченчеләрне унөченчеләрнең унөченчене унөченченең утыз утызынчы утызынчыга
+утызынчыда утызынчыдан утызынчылар утызынчыларга утызынчыларда утызынчылардан
+утызынчыларны утызынчыларның утызынчыны утызынчының
+
+фикеренчә фәкать
+
+хакында хәбәр хәлбуки хәтле хәтта
+
+чаклы чакта чөнки
+
+шикелле шул шулай шулар шуларга шуларда шулардан шуларны шуларның шулары шуларын
+шуларынга шуларында шуларыннан шуларының шулкадәр шултикле шултиклем шулхәтле
+шулчаклы шунда шундагы шундый шундыйга шундыйда шундыйдан шундыйны шундыйның
+шунлыктан шуннан шунсы шунча шуны шуныкы шуныкын шуныкынга шуныкында шуныкыннан
+шуныкының шунысы шунысын шунысынга шунысында шунысыннан шунысының шуның шушы
+шушында шушыннан шушыны шушының шушыңа шуңа шуңар шуңарга
+
+элек
+
+югыйсә юк юкса
+
+я ягъни язуынча яисә яки яктан якын ярашлы яхут яшь яшьлек
+
+җиде җиделәп җиденче җиденчегә җиденчедә җиденчедән җиденчеләр җиденчеләргә
+җиденчеләрдә җиденчеләрдән җиденчеләрне җиденчеләрнең җиденчене җиденченең
+җидешәр җитмеш җитмешенче җитмешенчегә җитмешенчедә җитмешенчедән җитмешенчеләр
+җитмешенчеләргә җитмешенчеләрдә җитмешенчеләрдән җитмешенчеләрне
+җитмешенчеләрнең җитмешенчене җитмешенченең җыенысы
+
+үз үзе үзем үземдә үземне үземнең үземнән үземә үзен үзендә үзенең үзеннән үзенә
+үк
+
+һичбер һичбере һичберен һичберендә һичберенең һичбереннән һичберенә һичберсе
+һичберсен һичберсендә һичберсенең һичберсеннән һичберсенә һичберәү һичберәүгә
+һичберәүдә һичберәүдән һичберәүне һичберәүнең һичкайсы һичкайсыга һичкайсыда
+һичкайсыдан һичкайсыны һичкайсының һичкем һичкемгә һичкемдә һичкемне һичкемнең
+һичкемнән һични һичнигә һичнидә һичнидән һичнинди һичнине һичнинең һичнәрсә
+һичнәрсәгә һичнәрсәдә һичнәрсәдән һичнәрсәне һичнәрсәнең һәм һәммә һәммәсе
+һәммәсен һәммәсендә һәммәсенең һәммәсеннән һәммәсенә һәр һәрбер һәрбере һәрберсе
+һәркайсы һәркайсыга һәркайсыда һәркайсыдан һәркайсыны һәркайсының һәркем
+һәркемгә һәркемдә һәркемне һәркемнең һәркемнән һәрни һәрнәрсә һәрнәрсәгә
+һәрнәрсәдә һәрнәрсәдән һәрнәрсәне һәрнәрсәнең һәртөрле
+
+ә әгәр әйтүенчә әйтүләренчә әлбәттә әле әлеге әллә әмма әнә
+
+өстәп өч өчен өченче өченчегә өченчедә өченчедән өченчеләр өченчеләргә
+өченчеләрдә өченчеләрдән өченчеләрне өченчеләрнең өченчене өченченең өчләп
+өчәрләп""".split())
--- a/spacy/lang/tt/tokenizer_exceptions.py
+++ b/spacy/lang/tt/tokenizer_exceptions.py
@ -0,0 +1,52 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...symbols import ORTH, LEMMA, NORM
+
+_exc = {}
+
+_abbrev_exc = [
+    # Weekdays abbreviations
+    {ORTH: "дш", LEMMA: "дүшәмбе"},
+    {ORTH: "сш", LEMMA: "сишәмбе"},
+    {ORTH: "чш", LEMMA: "чәршәмбе"},
+    {ORTH: "пш", LEMMA: "пәнҗешәмбе"},
+    {ORTH: "җм", LEMMA: "җомга"},
+    {ORTH: "шб", LEMMA: "шимбә"},
+    {ORTH: "яш", LEMMA: "якшәмбе"},
+
+    # Months abbreviations
+    {ORTH: "гый", LEMMA: "гыйнвар"},
+    {ORTH: "фев", LEMMA: "февраль"},
+    {ORTH: "мар", LEMMA: "март"},
+    {ORTH: "мар", LEMMA: "март"},
+    {ORTH: "апр", LEMMA: "апрель"},
+    {ORTH: "июн", LEMMA: "июнь"},
+    {ORTH: "июл", LEMMA: "июль"},
+    {ORTH: "авг", LEMMA: "август"},
+    {ORTH: "сен", LEMMA: "сентябрь"},
+    {ORTH: "окт", LEMMA: "октябрь"},
+    {ORTH: "ноя", LEMMA: "ноябрь"},
+    {ORTH: "дек", LEMMA: "декабрь"},
+
+    # Number abbreviations
+    {ORTH: "млрд", LEMMA: "миллиард"},
+    {ORTH: "млн", LEMMA: "миллион"},
+]
+
+for abbr in _abbrev_exc:
+    for orth in (abbr[ORTH], abbr[ORTH].capitalize(), abbr[ORTH].upper()):
+        _exc[orth] = [{ORTH: orth, LEMMA: abbr[LEMMA], NORM: abbr[LEMMA]}]
+        _exc[orth + "."] = [
+            {ORTH: orth + ".", LEMMA: abbr[LEMMA], NORM: abbr[LEMMA]}
+        ]
+
+for exc_data in [  # "etc." abbreviations
+    {ORTH: "һ.б.ш.", NORM: "һәм башка шундыйлар"},
+    {ORTH: "һ.б.", NORM: "һәм башка"},
+    {ORTH: "б.э.к.", NORM: "безнең эрага кадәр"},
+    {ORTH: "б.э.", NORM: "безнең эра"}]:
+    exc_data[LEMMA] = exc_data[NORM]
+    _exc[exc_data[ORTH]] = [exc_data]
+
+TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/lang/ur/init.py
+++ b/spacy/lang/ur/init.py
@ -0,0 +1,30 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
+from .stop_words import STOP_WORDS
+from .lex_attrs import LEX_ATTRS
+from ..tag_map import TAG_MAP
+
+from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from ...language import Language
+from ...attrs import LANG, NORM
+from ...util import update_exc
+
+
+class UrduDefaults(Language.Defaults):
+    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters.update(LEX_ATTRS)
+    lex_attr_getters[LANG] = lambda text: 'ur'
+
+    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    tag_map = TAG_MAP
+    stop_words = STOP_WORDS
+
+
+class Urdu(Language):
+    lang = 'ur'
+    Defaults = UrduDefaults
+
+
+__all__ = ['Urdu']
--- a/spacy/lang/ur/examples.py
+++ b/spacy/lang/ur/examples.py
@ -0,0 +1,16 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.da.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+    "اردو ہے جس کا نام ہم جانتے ہیں داغ",
+    "سارے جہاں میں دھوم ہماری زباں کی ہے",
+]
--- a/spacy/lang/ur/lemmatizer.py
+++ b/spacy/lang/ur/lemmatizer.py
--- a/spacy/lang/ur/lex_attrs.py
+++ b/spacy/lang/ur/lex_attrs.py
@ -0,0 +1,47 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...attrs import LIKE_NUM
+
+# Source https://quizlet.com/4271889/1-100-urdu-number-wordsurdu-numerals-flash-cards/
+# http://www.urduword.com/lessons.php?lesson=numbers
+# https://en.wikibooks.org/wiki/Urdu/Vocabulary/Numbers
+# https://www.urdu-english.com/lessons/beginner/numbers
+
+_num_words = """ایک دو تین چار پانچ چھ سات آٹھ نو دس گیارہ بارہ تیرہ چودہ پندرہ سولہ سترہ
+ اٹهارا انیس بیس اکیس بائیس تئیس چوبیس پچیس چھببیس 
+ستایس اٹھائس انتيس تیس اکتیس بتیس تینتیس چونتیس پینتیس
+ چھتیس سینتیس ارتیس انتالیس چالیس اکتالیس بیالیس تیتالیس 
+چوالیس پیتالیس چھیالیس سینتالیس اڑتالیس انچالیس پچاس اکاون باون
+ تریپن چون پچپن چھپن ستاون اٹھاون انسٹھ ساثھ 
+اکسٹھ باسٹھ تریسٹھ چوسٹھ پیسٹھ چھیاسٹھ سڑسٹھ اڑسٹھ 
+انھتر ستر اکھتر بھتتر تیھتر چوھتر تچھتر چھیتر ستتر
+اٹھتر انیاسی اسی اکیاسی بیاسی تیراسی چوراسی پچیاسی چھیاسی
+ سٹیاسی اٹھیاسی نواسی نوے اکانوے بانوے ترانوے 
+چورانوے پچانوے چھیانوے ستانوے اٹھانوے ننانوے سو
+""".split()
+
+# source https://www.google.com/intl/ur/inputtools/try/
+
+_ordinal_words = """پہلا دوسرا تیسرا چوتھا پانچواں چھٹا ساتواں آٹھواں نواں دسواں گیارہواں بارہواں تیرھواں چودھواں
+ پندرھواں سولہواں سترھواں اٹھارواں انیسواں بسیواں 
+""".split()
+
+
+def like_num(text):
+    text = text.replace(',', '').replace('.', '')
+    if text.isdigit():
+        return True
+    if text.count('/') == 1:
+        num, denom = text.split('/')
+        if num.isdigit() and denom.isdigit():
+            return True
+    if text in _num_words:
+        return True
+    if text in _ordinal_words:
+        return True
+    return False
+
+LEX_ATTRS = {
+    LIKE_NUM: like_num
+}
--- a/spacy/lang/ur/stop_words.py
+++ b/spacy/lang/ur/stop_words.py
@ -0,0 +1,515 @@
+# encoding: utf8
+from __future__ import unicode_literals
+
+# Source: collected from different resource on internet
+
+STOP_WORDS = set("""
+ثھی
+ خو
+  گی
+   اپٌے
+    گئے
+     ثہت
+      طرف
+       ہوبری
+        پبئے
+         اپٌب
+          دوضری
+           گیب
+            کت
+             گب
+              ثھی
+               ضے
+                ہر
+پر
+اش
+ دی
+ گے
+لگیں
+ہے
+ثعذ
+ ضکتے
+  تھی
+   اى
+    دیب
+     لئے
+      والے
+       یہ
+        ثدبئے
+         ضکتی
+          تھب
+           اًذر
+            رریعے
+             لگی
+              ہوبرا
+               ہوًے
+                ثبہر
+                 ضکتب
+                  ًہیں
+                   تو
+                    اور
+رہب
+ لگے
+  ہوضکتب
+   ہوں
+    کب
+     ہوبرے
+      توبم
+       کیب
+        ایطے
+         رہی
+          هگر
+           ہوضکتی
+            ہیں
+             کریں
+              ہو
+               تک
+                کی
+                 ایک
+                  رہے
+                   هیں
+ ہوضکتے
+  کیطے
+   ہوًب
+    تت
+     کہ
+      ہوا
+       آئے
+        ضبت
+         تھے
+          کیوں
+           ہو
+            تب
+             کے
+              پھر
+               ثغیر
+                خبر
+                ہے
+                 رکھ
+                  کی
+                   طب
+                    کوئی
+                     رریعے
+ثبرے
+ خب
+  اضطرذ
+   ثلکہ
+    خجکہ
+     رکھ
+      تب
+       کی
+        طرف
+         ثراں
+          خبر
+            رریعہ
+ اضکب
+  ثٌذ
+   خص
+ کی
+  لئے
+ توہیں
+دوضرے
+ کررہی
+  اضکی
+   ثیچ
+    خوکہ
+     رکھتی
+      کیوًکہ
+       دوًوں
+        کر
+         رہے
+          خبر
+           ہی
+            ثرآں
+             اضکے
+              پچھلا
+               خیطب
+                رکھتے
+                 کے
+                  ثعذ
+                   تو
+                    ہی
+                     دورى
+کر
+ یہبں
+ آش
+  تھوڑا
+  چکے
+  زکویہ
+  دوضروں
+  ضکب
+  اوًچب
+  ثٌب
+  پل
+  تھوڑی
+  چلا
+  خبهوظ
+  دیتب
+  ضکٌب
+  اخبزت
+  اوًچبئی
+  ثٌبرہب
+پوچھب
+تھوڑے
+چلو
+ختن
+دیتی
+ضکی
+اچھب
+اوًچی
+ثٌبرہی
+پوچھتب
+تیي
+چلیں
+در
+دیتے
+ضکے
+اچھی
+اوًچے
+ثٌبرہے
+پوچھتی
+خبًب
+چلے
+درخبت
+دیر
+ضلطلہ
+اچھے
+اٹھبًب
+ثٌبًب
+پوچھتے
+خبًتب
+چھوٹب
+درخہ
+دیکھٌب
+ضوچ
+اختتبم
+اہن
+ثٌذ
+پوچھٌب
+خبًتی
+چھوٹوں
+درخے
+دیکھو
+ضوچب
+ادھر
+آئی
+ثٌذکرًب
+پوچھو
+خبًتے
+چھوٹی
+درزقیقت
+دیکھی
+ضوچتب
+ارد
+آئے
+ثٌذکرو
+پوچھوں
+خبًٌب
+چھوٹے
+درضت
+دیکھیں
+ضوچتی
+اردگرد
+آج
+ثٌذی
+پوچھیں
+خططرذ
+چھہ
+دش
+دیٌب
+ضوچتے
+ارکبى
+آخر
+ثڑا
+پورا
+خگہ
+چیسیں
+دفعہ
+دے
+ضوچٌب
+اضتعوبل
+آخر
+پہلا
+خگہوں
+زبصل
+دکھبئیں
+راضتوں
+ضوچو
+اضتعوبلات
+آدهی
+ثڑی
+پہلی
+خگہیں
+زبضر
+دکھبتب
+راضتہ
+ضوچی
+اغیب
+آًب
+ثڑے
+پہلےضی
+خلذی
+زبل
+دکھبتی
+راضتے
+ضوچیں
+اطراف
+آٹھ
+ثھر
+خٌبة
+زبل
+دکھبتے
+رکي
+ضیذھب
+افراد
+آیب
+ثھرا
+پہلے
+خواى
+زبلات
+دکھبًب
+رکھب
+ضیذھی
+اکثر
+ثب
+ہوا
+پیع
+خوًہی
+زبلیہ
+دکھبو
+رکھی
+ضیذھے
+اکٹھب
+ثھرپور
+تبزٍ
+خیطبکہ
+زصوں
+رکھے
+ضیکٌڈ
+اکٹھی
+ثبری
+ثہتر
+تر
+چبر
+زصہ
+دلچطپ
+زیبدٍ
+غبیذ
+اکٹھے
+ثبلا
+ثہتری
+ترتیت
+چبہب
+زصے
+دلچطپی
+ضبت
+غخص
+اکیلا
+ثبلترتیت
+ثہتریي
+تریي
+چبہٌب
+زقبئق
+دلچطپیبں
+ضبدٍ
+غذ
+اکیلی
+ثرش
+پبش
+تعذاد
+چبہے
+زقیتیں
+هٌبضت
+ضبرا
+غروع
+اکیلے
+ثغیر
+پبًب
+چکب
+زقیقت
+دو
+ضبرے
+غروعبت
+اگرچہ
+ثلٌذ
+پبًچ
+تن
+چکی
+زکن
+دور
+ضبل
+غے
+الگ
+پراًب
+تٌہب
+چکیں
+دوضرا
+ضبلوں
+صبف
+صسیر
+قجیلہ
+کوًطے
+لازهی
+هطئلے
+ًیب
+طریق
+کرتی
+کہتے
+صفر
+قطن
+کھولا
+لگتب
+هطبئل
+وار
+طریقوں
+کرتے
+کہٌب
+صورت
+کئی
+کھولٌب
+لگتی
+هطتعول
+وار
+طریقہ
+کرتے
+ہو
+کہٌب
+صورتسبل
+کئے
+کھولو
+لگتے
+هػتول
+ٹھیک
+طریقے
+کرًب
+کہو
+صورتوں
+کبفی
+هطلق
+ڈھوًڈا
+طور
+کرو
+کہوں
+صورتیں
+کبم
+کھولیں
+لگی
+هعلوم
+ڈھوًڈلیب
+طورپر
+کریں
+کہی
+ضرور
+کجھی
+کھولے
+لگے
+هکول
+ڈھوًڈًب
+ظبہر
+کرے
+کہیں
+ضرورت
+کرا
+کہب
+لوجب
+هلا
+ڈھوًڈو
+عذد
+کل
+کہیں
+کرتب
+کہتب
+لوجی
+هوکي
+ڈھوًڈی
+عظین
+کن
+کہے
+ضروری
+کرتبہوں
+کہتی
+لوجے
+هوکٌبت
+ڈھوًڈیں
+علاقوں
+کوتر
+کیے
+لوسبت
+هوکٌہ
+ہن
+لے
+ًبپطٌذ
+ہورہے
+علاقہ
+کورا
+کے
+رریعے
+لوسہ
+هڑا
+ہوئی
+هتعلق
+ًبگسیر
+ہوگئی
+علاقے
+کوروں
+گئی
+لو
+هڑًب
+ہوئے
+هسترم
+ًطجت
+ہو
+گئے
+علاوٍ
+کورٍ
+گرد
+لوگ
+هڑے
+ہوتی
+هسترهہ
+ًقطہ
+ہوگیب
+کورے
+گروپ
+لوگوں
+هہرثبى
+ہوتے
+هسطوش
+ًکبلٌب
+ہوًی
+عووهی
+کوطي
+گروٍ
+لڑکپي
+هیرا
+ہوچکب
+هختلف
+ًکتہ
+ہی
+فرد
+کوى
+گروہوں
+لی
+هیری
+ہوچکی
+هسیذ
+فی
+کوًطب
+گٌتی
+لیب
+هیرے
+ہوچکے
+هطئلہ
+ًوخواى
+یقیٌی
+قجل
+کوًطی
+لیٌب
+ًئی
+ہورہب
+لیں
+ًئے
+ہورہی
+ثبعث
+ضت
+""".split())
--- a/spacy/lang/ur/tag_map.py
+++ b/spacy/lang/ur/tag_map.py
@ -0,0 +1,65 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
+from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
+
+TAG_MAP = {
+    ".":        {POS: PUNCT, "PunctType": "peri"},
+    ",":        {POS: PUNCT, "PunctType": "comm"},
+    "-LRB-":    {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
+    "-RRB-":    {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
+    "``":       {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
+    "\"\"":     {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
+    "''":       {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
+    ":":        {POS: PUNCT},
+    "$":        {POS: SYM, "Other": {"SymType": "currency"}},
+    "#":        {POS: SYM, "Other": {"SymType": "numbersign"}},
+    "AFX":      {POS: ADJ,  "Hyph": "yes"},
+    "CC":       {POS: CCONJ, "ConjType": "coor"},
+    "CD":       {POS: NUM, "NumType": "card"},
+    "DT":       {POS: DET},
+    "EX":       {POS: ADV, "AdvType": "ex"},
+    "FW":       {POS: X, "Foreign": "yes"},
+    "HYPH":     {POS: PUNCT, "PunctType": "dash"},
+    "IN":       {POS: ADP},
+    "JJ":       {POS: ADJ, "Degree": "pos"},
+    "JJR":      {POS: ADJ, "Degree": "comp"},
+    "JJS":      {POS: ADJ, "Degree": "sup"},
+    "LS":       {POS: PUNCT, "NumType": "ord"},
+    "MD":       {POS: VERB, "VerbType": "mod"},
+    "NIL":      {POS: ""},
+    "NN":       {POS: NOUN, "Number": "sing"},
+    "NNP":      {POS: PROPN, "NounType": "prop", "Number": "sing"},
+    "NNPS":     {POS: PROPN, "NounType": "prop", "Number": "plur"},
+    "NNS":      {POS: NOUN, "Number": "plur"},
+    "PDT":      {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
+    "POS":      {POS: PART, "Poss": "yes"},
+    "PRP":      {POS: PRON, "PronType": "prs"},
+    "PRP$":     {POS: ADJ, "PronType": "prs", "Poss": "yes"},
+    "RB":       {POS: ADV, "Degree": "pos"},
+    "RBR":      {POS: ADV, "Degree": "comp"},
+    "RBS":      {POS: ADV, "Degree": "sup"},
+    "RP":       {POS: PART},
+    "SP":       {POS: SPACE},
+    "SYM":      {POS: SYM},
+    "TO":       {POS: PART, "PartType": "inf", "VerbForm": "inf"},
+    "UH":       {POS: INTJ},
+    "VB":       {POS: VERB, "VerbForm": "inf"},
+    "VBD":      {POS: VERB, "VerbForm": "fin", "Tense": "past"},
+    "VBG":      {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
+    "VBN":      {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
+    "VBP":      {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
+    "VBZ":      {POS: VERB, "VerbForm": "fin", "Tense": "pres", "Number": "sing", "Person": 3},
+    "WDT":      {POS: ADJ, "PronType": "int|rel"},
+    "WP":       {POS: NOUN, "PronType": "int|rel"},
+    "WP$":      {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
+    "WRB":      {POS: ADV, "PronType": "int|rel"},
+    "ADD":      {POS: X},
+    "NFP":      {POS: PUNCT},
+    "GW":       {POS: X},
+    "XX":       {POS: X},
+    "BES":      {POS: VERB},
+    "HVS":      {POS: VERB},
+    "_SP":       {POS: SPACE},
+}
--- a/spacy/lang/ur/tokenizer_exceptions.py
+++ b/spacy/lang/ur/tokenizer_exceptions.py
@ -0,0 +1,22 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+# import symbols – if you need to use more, add them here
+from ...symbols import ORTH, LEMMA, TAG, NORM, ADP, DET
+
+# Add tokenizer exceptions
+# Documentation: https://spacy.io/docs/usage/adding-languages#tokenizer-exceptions
+# Feel free to use custom logic to generate repetitive exceptions more efficiently.
+# If an exception is split into more than one token, the ORTH values combined always
+# need to match the original string.
+
+# Exceptions should be added in the following format:
+
+_exc = {
+
+}
+
+# To keep things clean and readable, it's recommended to only declare the
+# TOKENIZER_EXCEPTIONS at the bottom:
+
+TOKENIZER_EXCEPTIONS = _exc
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -15,7 +15,8 @@ from .. import util
 # here if it's using spaCy's tokenizer (not a different library)
 # TODO: re-implement generic tokenizer tests
 _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
-              'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'xx']
+              'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'ut', 'tt',
+              'xx']

 _models = {'en': ['en_core_web_sm'],
           'de': ['de_core_news_sm'],
@ -153,10 +154,18 @@ def th_tokenizer():
 def tr_tokenizer():
    return util.get_lang_class('tr').Defaults.create_tokenizer()

+@pytest.fixture
+def tt_tokenizer():
+    return util.get_lang_class('tt').Defaults.create_tokenizer()
+
@pytest.fixture
 def ar_tokenizer():
    return util.get_lang_class('ar').Defaults.create_tokenizer()

+@pytest.fixture
+def ur_tokenizer():
+    return util.get_lang_class('ur').Defaults.create_tokenizer()
+
@pytest.fixture
 def ru_tokenizer():
    pymorphy = pytest.importorskip('pymorphy2')
--- a/spacy/tests/lang/tt/init.py
+++ b/spacy/tests/lang/tt/init.py
--- a/spacy/tests/lang/tt/test_tokenizer.py
+++ b/spacy/tests/lang/tt/test_tokenizer.py
@ -0,0 +1,75 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import pytest
+
+INFIX_HYPHEN_TESTS = [
+    ("Явым-төшем күләме.", "Явым-төшем күләме .".split()),
+    ("Хатын-кыз киеме.", "Хатын-кыз киеме .".split())
+]
+
+PUNC_INSIDE_WORDS_TESTS = [
+    ("Пассаҗир саны - 2,13 млн — кеше/көндә (2010), 783,9 млн. кеше/елда.",
+     "Пассаҗир саны - 2,13 млн — кеше / көндә ( 2010 ) ,"
+     " 783,9 млн. кеше / елда .".split()),
+    ("Ту\"кай", "Ту \" кай".split())
+]
+
+MIXED_ORDINAL_NUMS_TESTS = [
+    ("Иртәгә 22нче гыйнвар...", "Иртәгә 22нче гыйнвар ...".split())
+]
+
+ABBREV_TESTS = [
+    ("«3 елда (б.э.к.) туган", "« 3 елда ( б.э.к. ) туган".split()),
+    ("тукымадан һ.б.ш. тегелгән.", "тукымадан һ.б.ш. тегелгән .".split())
+]
+
+NAME_ABBREV_TESTS = [
+    ("Ә.Тукай", "Ә.Тукай".split()),
+    ("Ә.тукай", "Ә.тукай".split()),
+    ("ә.Тукай", "ә . Тукай".split()),
+    ("Миләүшә.", "Миләүшә .".split())
+]
+
+TYPOS_IN_PUNC_TESTS = [
+    ("«3 елда , туган", "« 3 елда , туган".split()),
+    ("«3 елда,туган", "« 3 елда , туган".split()),
+    ("«3 елда,туган.", "« 3 елда , туган .".split()),
+    ("Ул эшли(кайчан?)", "Ул эшли ( кайчан ? )".split()),
+    ("Ул (кайчан?)эшли", "Ул ( кайчан ?) эшли".split())  # "?)" => "?)" or "? )"
+]
+
+LONG_TEXTS_TESTS = [
+    ("Иң борынгы кешеләр суыклар һәм салкын кышлар булмый торган җылы"
+     "якларда яшәгәннәр, шуңа күрә аларга кием кирәк булмаган.Йөз"
+     "меңнәрчә еллар үткән, борынгы кешеләр акрынлап Европа һәм Азиянең"
+     "салкын илләрендә дә яши башлаганнар. Алар кырыс һәм салкын"
+     "кышлардан саклану өчен кием-салым уйлап тапканнар - итәк.",
+     "Иң борынгы кешеләр суыклар һәм салкын кышлар булмый торган җылы"
+     "якларда яшәгәннәр , шуңа күрә аларга кием кирәк булмаган . Йөз"
+     "меңнәрчә еллар үткән , борынгы кешеләр акрынлап Европа һәм Азиянең"
+     "салкын илләрендә дә яши башлаганнар . Алар кырыс һәм салкын"
+     "кышлардан саклану өчен кием-салым уйлап тапканнар - итәк .".split()
+     )
+]
+
+TESTCASES = (INFIX_HYPHEN_TESTS + PUNC_INSIDE_WORDS_TESTS +
+             MIXED_ORDINAL_NUMS_TESTS + ABBREV_TESTS + NAME_ABBREV_TESTS +
+             LONG_TEXTS_TESTS + TYPOS_IN_PUNC_TESTS)
+
+NORM_TESTCASES = [
+    ("тукымадан һ.б.ш. тегелгән.",
+     ["тукымадан", "һәм башка шундыйлар", "тегелгән", "."])
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", TESTCASES)
+def test_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens):
+    tokens = [token.text for token in tt_tokenizer(text) if not token.is_space]
+    assert expected_tokens == tokens
+
+
+@pytest.mark.parametrize('text,norms', NORM_TESTCASES)
+def test_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms):
+    tokens = tt_tokenizer(text)
+    assert [token.norm_ for token in tokens] == norms
--- a/spacy/tests/lang/ur/init.py
+++ b/spacy/tests/lang/ur/init.py
--- a/spacy/tests/lang/ur/test_text.py
+++ b/spacy/tests/lang/ur/test_text.py
@ -0,0 +1,26 @@
+# coding: utf-8
+
+"""Test that longer and mixed texts are tokenized correctly."""
+
+
+from __future__ import unicode_literals
+
+import pytest
+
+
+def test_tokenizer_handles_long_text(ur_tokenizer):
+    text = """اصل میں رسوا ہونے کی ہمیں
+     کچھ عادت سی ہو گئی ہے اس لئے جگ ہنسائی کا ذکر نہیں کرتا،ہوا کچھ یوں کہ عرصہ چھ سال بعد ہمیں بھی خیال آیا
+     کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر 
+    ورلڈ کپ دیکھنے کا حوصلہ ہی نہ رہا تو اب یوں ہی ادھر اُدھر کے چینل گھمانے لگ پڑتے ہیں۔"""
+
+    tokens = ur_tokenizer(text)
+    assert len(tokens) == 77
+
+
+@pytest.mark.parametrize('text,length', [
+    ("تحریر باسط حبیب", 3),
+    ("میرا پاکستان", 2)])
+def test_tokenizer_handles_cnts(ur_tokenizer, text, length):
+    tokens = ur_tokenizer(text)
+    assert len(tokens) == length
--- a/spacy/tests/test_misc.py
+++ b/spacy/tests/test_misc.py
@ -3,7 +3,7 @@ from __future__ import unicode_literals

 from ..util import ensure_path
 from .. import util
-from ..displacy import parse_deps, parse_ents
+from .. import displacy
 from ..tokens import Span
 from .util import get_doc
 from .._ml import PrecomputableAffine
@ -34,18 +34,16 @@ def test_util_get_package_path(package):
    assert isinstance(path, Path)


-@pytest.mark.xfail
 def test_displacy_parse_ents(en_vocab):
    """Test that named entities on a Doc are converted into displaCy's format."""
    doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
    doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings[u'ORG'])]
-    ents = parse_ents(doc)
+    ents = displacy.parse_ents(doc)
    assert isinstance(ents, dict)
    assert ents['text'] == 'But Google is starting from behind '
    assert ents['ents'] == [{'start': 4, 'end': 10, 'label': 'ORG'}]


-@pytest.mark.xfail
 def test_displacy_parse_deps(en_vocab):
    """Test that deps and tags on a Doc are converted into displaCy's format."""
    words = ["This", "is", "a", "sentence"]
@ -55,7 +53,7 @@ def test_displacy_parse_deps(en_vocab):
    deps = ['nsubj', 'ROOT', 'det', 'attr']
    doc = get_doc(en_vocab, words=words, heads=heads, pos=pos, tags=tags,
                  deps=deps)
-    deps = parse_deps(doc)
+    deps = displacy.parse_deps(doc)
    assert isinstance(deps, dict)
    assert deps['words'] == [{'text': 'This', 'tag': 'DET'},
                            {'text': 'is', 'tag': 'VERB'},
@ -66,7 +64,19 @@ def test_displacy_parse_deps(en_vocab):
                            {'start': 1, 'end': 3, 'label': 'attr', 'dir': 'right'}]


-@pytest.mark.xfail
+def test_displacy_spans(en_vocab):
+    """Test that displaCy can render Spans."""
+    doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
+    doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings[u'ORG'])]
+    html = displacy.render(doc[1:4], style='ent')
+    assert html.startswith('<div')
+
+
+def test_displacy_raises_for_wrong_type(en_vocab):
+    with pytest.raises(ValueError):
+        html = displacy.render('hello world')
+
+
 def test_PrecomputableAffine(nO=4, nI=5, nF=3, nP=2):
    model = PrecomputableAffine(nO=nO, nI=nI, nF=nF, nP=nP)
    assert model.W.shape == (nF, nO, nP, nI)
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@ -394,7 +394,7 @@ cdef class Tokenizer:
        data = OrderedDict()
        deserializers = OrderedDict((
            ('vocab', lambda b: self.vocab.from_bytes(b)),
-            ('prefix_search', lambda b: data.setdefault('prefix', b)),
+            ('prefix_search', lambda b: data.setdefault('prefix_search', b)),
            ('suffix_search', lambda b: data.setdefault('suffix_search', b)),
            ('infix_finditer', lambda b: data.setdefault('infix_finditer', b)),
            ('token_match', lambda b: data.setdefault('token_match', b)),
--- a/website/_harp.json
+++ b/website/_harp.json
@ -84,8 +84,8 @@
            }
        ],

-        "V_CSS": "2.1.3",
-        "V_JS": "2.1.2",
+        "V_CSS": "2.2.1",
+        "V_JS": "2.2.2",
        "DEFAULT_SYNTAX": "python",
        "ANALYTICS": "UA-58931649-1",
        "MAILCHIMP": {
--- a/website/_includes/_mixins.jade
+++ b/website/_includes/_mixins.jade
@ -124,6 +124,12 @@ mixin help(tooltip, icon_size)
        +icon("help_o", icon_size || 16).o-icon--inline


+//- Abbreviation
+
+mixin abbr(title)
+    abbr.o-abbr(data-tooltip=title data-tooltip-style="code" aria-label=title)&attributes(attributes)
+        block
+
 //- Aside wrapper
    label - [string] aside label

--- a/website/_includes/_sidebar.jade
+++ b/website/_includes/_sidebar.jade
@ -9,7 +9,7 @@ menu.c-sidebar.js-sidebar.u-text
                each url, item in items
                    - var is_current = CURRENT == url || (CURRENT == "index" && url == "./")
                    li.c-sidebar__item
-                        +a(url)(class=is_current ? "is-active" : null tabindex=is_current ? "-1" : null)=item
+                        +a(url)(class=is_current ? "is-active" : null tabindex=is_current ? "-1" : null data-sidebar-active=is_current ? "" : null)=item

                        if is_current
                            if IS_MODELS && CURRENT_MODELS.length
--- a/website/api/_architecture/_cython.jade
+++ b/website/api/_architecture/_cython.jade
@ -1,115 +0,0 @@
-//- 💫 DOCS > API > ARCHITECTURE > CYTHON
-
-+aside("What's Cython?")
-    |  #[+a("http://cython.org/") Cython] is a language for writing
-    |  C extensions for Python. Most Python code is also valid Cython, but
-    |  you can add type declarations to get efficient memory-managed code
-    |  just like C or C++.
-
-p
-    |  spaCy's core data structures are implemented as
-    |  #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is
-    |  managed through the #[+a(gh("cymem")) #[code cymem]]
-    |  #[code cymem.Pool] class, which allows you
-    |  to allocate memory which will be freed when the #[code Pool] object
-    |  is garbage collected. This means you usually don't have to worry
-    |  about freeing memory. You just have to decide which Python object
-    |  owns the memory, and make it own the #[code Pool]. When that object
-    |  goes out of scope, the memory will be freed. You do have to take
-    |  care that no pointers outlive the object that owns them — but this
-    |  is generally quite easy.
-
-p
-    |  All Cython modules should have the #[code # cython: infer_types=True]
-    |  compiler directive at the top of the file. This makes the code much
-    |  cleaner, as it avoids the need for many type declarations. If
-    |  possible, you should prefer to declare your functions #[code nogil],
-    |  even if you don't especially care about multi-threading. The reason
-    |  is that #[code nogil] functions help the Cython compiler reason about
-    |  your code quite a lot — you're telling the compiler that no Python
-    |  dynamics are possible. This lets many errors be raised, and ensures
-    |  your function will run at C speed.
-
-
-p
-    |  Cython gives you many choices of sequences: you could have a Python
-    |  list, a numpy array, a memory view, a C++ vector, or a pointer.
-    |  Pointers are preferred, because they are fastest, have the most
-    |  explicit semantics, and let the compiler check your code more
-    |  strictly. C++ vectors are also great — but you should only use them
-    |  internally in functions. It's less friendly to accept a vector as an
-    |  argument, because that asks the user to do much more work. Here's
-    |  how to get a pointer from a numpy array, memory view or vector:
-
-+code.
-    cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
-    pointer1 = &lt;int*&gt;numpy_array.data
-    pointer2 = cpp_vector.data()
-    pointer3 = &memory_view[0]
-
-p
-    |  Both C arrays and C++ vectors reassure the compiler that no Python
-    |  operations are possible on your variable. This is a big advantage:
-    |  it lets the Cython compiler raise many more errors for you.
-
-p
-    |  When getting a pointer from a numpy array or memoryview, take care
-    |  that the data is actually stored in C-contiguous order — otherwise
-    |  you'll get a pointer to nonsense. The type-declarations in the code
-    |  above should generate runtime errors if buffers with incorrect
-    |  memory layouts are passed in. To iterate over the array, the
-    |  following style is preferred:
-
-+code.
-    cdef int c_total(const int* int_array, int length) nogil:
-        total = 0
-        for item in int_array[:length]:
-            total += item
-        return total
-
-p
-    |  If this is confusing, consider that the compiler couldn't deal with
-    |  #[code for item in int_array:] — there's no length attached to a raw
-    |  pointer, so how could we figure out where to stop? The length is
-    |  provided in the slice notation as a solution to this. Note that we
-    |  don't have to declare the type of #[code item] in the code above —
-    |  the compiler can easily infer it. This gives us tidy code that looks
-    |  quite like Python, but is exactly as fast as C — because we've made
-    |  sure the compilation to C is trivial.
-
-p
-    |  Your functions cannot be declared #[code nogil] if they need to
-    |  create Python objects or call Python functions. This is perfectly
-    |  okay — you shouldn't torture your code just to get #[code nogil]
-    |  functions. However, if your function isn't #[code nogil], you should
-    |  compile your module with #[code cython -a --cplus my_module.pyx] and
-    |  open the resulting #[code my_module.html] file in a browser. This
-    |  will let you see how Cython is compiling your code. Calls into the
-    |  Python run-time will be in bright yellow. This lets you easily see
-    |  whether Cython is able to correctly type your code, or whether there
-    |  are unexpected problems.
-
-p
-    |  Working in Cython is very rewarding once you're over the initial
-    |  learning curve. As with C and C++, the first way you write something
-    |  in Cython will often be the performance-optimal approach. In
-    |  contrast, Python optimisation generally requires a lot of
-    |  experimentation. Is it faster to have an #[code if item in my_dict]
-    |  check, or to use #[code .get()]? What about
-    |  #[code try]/#[code except]? Does this numpy operation create a copy?
-    |  There's no way to guess the answers to these questions, and you'll
-    |  usually be dissatisfied with your results — so there's no way to
-    |  know when to stop this process. In the worst case, you'll make a
-    |  mess that invites the next reader to try their luck too. This is
-    |  like one of those
-    |  #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps],
-    |  where the rescuers keep passing out from low oxygen, causing
-    |  another rescuer to follow — only to succumb themselves. In short,
-    |  just say no to optimizing your Python. If it's not fast enough the
-    |  first time, just switch to Cython.
-
-+infobox("Resources")
-    +list.o-no-block
-        +item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org)
-        +item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai)
-        +item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCy’s parser and named entity recogniser] (explosion.ai)
--- a/website/api/_architecture/_nn-model.jade
+++ b/website/api/_architecture/_nn-model.jade
@ -1,149 +0,0 @@
-//- 💫 DOCS > API > ARCHITECTURE > NN MODEL ARCHITECTURE
-
-p
-    |  spaCy's statistical models have been custom-designed to give a
-    |  high-performance mix of speed and accuracy. The current architecture
-    |  hasn't been published yet, but in the meantime we prepared a video that
-    |  explains how the models work, with particular focus on NER.
-
-+youtube("sqDHBH9IjRU")
-
-p
-    |  The parsing model is a blend of recent results. The two recent
-    |  inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
-    |  Bar Ilan#[+fn(1)], and the SyntaxNet team from Google. The foundation of
-    |  the parser is still based on the work of Joakim Nivre#[+fn(2)], who
-    |  introduced the transition-based framework#[+fn(3)], the arc-eager
-    |  transition system, and the imitation learning objective. The model is
-    |  implemented using #[+a(gh("thinc")) Thinc], spaCy's machine learning
-    |  library. We first predict context-sensitive vectors for each word in the
-    |  input:
-
-+code.
-    (embed_lower | embed_prefix | embed_suffix | embed_shape)
-        &gt;&gt; Maxout(token_width)
-        &gt;&gt; convolution ** 4
-
-p
-    |  This convolutional layer is shared between the tagger, parser and NER,
-    |  and will also be shared by the future neural lemmatizer. Because the
-    |  parser shares these layers with the tagger, the parser does not require
-    |  tag features. I got this trick from David Weiss's "Stack Combination"
-    |  paper#[+fn(4)].
-
-p
-    |  To boost the representation, the tagger actually predicts a "super tag"
-    |  with POS, morphology and dependency label#[+fn(5)]. The tagger predicts
-    |  these supertags by adding a softmax layer onto the convolutional layer –
-    |  so, we're teaching the convolutional layer to give us a representation
-    |  that's one affine transform from this informative lexical information.
-    |  This is obviously good for the parser (which backprops to the
-    |  convolutions too). The parser model makes a state vector by concatenating
-    |  the vector representations for its context tokens.  The current context
-    |  tokens:
-
-+table
-    +row
-        +cell #[code S0], #[code S1], #[code S2]
-        +cell Top three words on the stack.
-
-    +row
-        +cell #[code B0], #[code B1]
-        +cell First two words of the buffer.
-
-    +row
-        +cell
-            |  #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
-            |  #[code B1L1]#[br]
-            |  #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
-            |  #[code B1L2]
-        +cell
-            |  Leftmost and second leftmost children of #[code S0], #[code S1],
-            |  #[code S2], #[code B0] and #[code B1].
-
-    +row
-        +cell
-            |  #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
-            |  #[code B1R1]#[br]
-            |  #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
-            |  #[code B1R2]
-        +cell
-            |  Rightmost and second rightmost children of #[code S0], #[code S1],
-            |  #[code S2], #[code B0] and #[code B1].
-
-p
-    |  This makes the state vector quite long: #[code 13*T], where #[code T] is
-    |  the token vector width (128 is working well). Fortunately, there's a way
-    |  to structure the computation to save some expense (and make it more
-    |  GPU-friendly).
-
-p
-    |  The parser typically visits #[code 2*N] states for a sentence of length
-    |  #[code N] (although it may visit more, if it back-tracks with a
-    |  non-monotonic transition#[+fn(4)]). A naive implementation would require
-    |  #[code 2*N (B, 13*T) @ (13*T, H)] matrix multiplications for a batch of
-    |  size #[code B]. We can instead perform one #[code (B*N, T) @ (T, 13*H)]
-    |  multiplication, to pre-compute the hidden weights for each positional
-    |  feature with respect to the words in the batch. (Note that our token
-    |  vectors come from the CNN — so we can't play this trick over the
-    |  vocabulary. That's how Stanford's NN parser#[+fn(3)] works — and why its
-    |  model is so big.)
-
-p
-    |  This pre-computation strategy allows a nice compromise between
-    |  GPU-friendliness and implementation simplicity. The CNN and the wide
-    |  lower layer are computed on the GPU, and then the precomputed hidden
-    |  weights are moved to the CPU, before we start the transition-based
-    |  parsing process. This makes a lot of things much easier. We don't have to
-    |  worry about variable-length batch sizes, and we don't have to implement
-    |  the dynamic oracle in CUDA to train.
-
-p
-    |  Currently the parser's loss function is multilabel log loss#[+fn(6)], as
-    |  the dynamic oracle allows multiple states to be 0 cost. This is defined
-    |  as follows, where #[code gZ] is the sum of the scores assigned to gold
-    |  classes:
-
-+code.
-    (exp(score) / Z) - (exp(score) / gZ)
-
-+bibliography
-    +item
-        |  #[+a("https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41") Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations]
-        br
-        |  Eliyahu Kiperwasser, Yoav Goldberg. (2016)
-
-    +item
-        |  #[+a("https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4") A Dynamic Oracle for Arc-Eager Dependency Parsing]
-        br
-        |  Yoav Goldberg, Joakim Nivre (2012)
-
-    +item
-        |  #[+a("https://explosion.ai/blog/parsing-english-in-python") Parsing English in 500 Lines of Python]
-        br
-        |  Matthew Honnibal (2013)
-
-    +item
-        |  #[+a("https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466") Stack-propagation: Improved Representation Learning for Syntax]
-        br
-        |  Yuan Zhang, David Weiss (2016)
-
-    +item
-        |  #[+a("https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86") Deep multi-task learning with low level tasks supervised at lower layers]
-        br
-        |  Anders Søgaard, Yoav Goldberg (2016)
-
-    +item
-        |  #[+a("https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c") An Improved Non-monotonic Transition System for Dependency Parsing]
-        br
-        |  Matthew Honnibal, Mark Johnson (2015)
-
-    +item
-        |  #[+a("http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf") A Fast and Accurate Dependency Parser using Neural Networks]
-        br
-        |  Danqi Cheng, Christopher D. Manning (2014)
-
-    +item
-        |  #[+a("https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2") Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques]
-        br
-        |  Stefan Riezler et al. (2002)
--- a/website/api/_cython/_doc.jade
+++ b/website/api/_cython/_doc.jade
@ -0,0 +1,71 @@
+//- 💫 DOCS > API > CYTHON > CLASSES > DOC
+
+p
+    |  The #[code Doc] object holds an array of
+    |  #[+api("cython-structs#tokenc") #[code TokenC]] structs.
+
+infobox
+    |  This section documents the extra C-level attributes and methods that
+    |  can't be accessed from Python. For the Python documentation, see
+    |  #[+api("doc") #[code Doc]].
+
+h(3, "doc_attributes") Attributes
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code mem]
+        +cell #[code cymem.Pool]
+        +cell
+            |  A memory pool. Allocated memory will be freed once the
+            |  #[code Doc] object is garbage collected.
+
+    +row
+        +cell #[code vocab]
+        +cell #[code Vocab]
+        +cell A reference to the shared #[code Vocab] object.
+
+    +row
+        +cell #[code c]
+        +cell #[code TokenC*]
+        +cell
+            |  A pointer to a #[+api("cython-structs#tokenc") #[code TokenC]]
+            |  struct.
+
+    +row
+        +cell #[code length]
+        +cell #[code int]
+        +cell The number of tokens in the document.
+
+    +row
+        +cell #[code max_length]
+        +cell #[code int]
+        +cell The underlying size of the #[code Doc.c] array.
+
+h(3, "doc_push_back") Doc.push_back
+    +tag method
+
+p
+    |  Append a token to the #[code Doc]. The token can be provided as a
+    |  #[+api("cython-structs#lexemec") #[code LexemeC]] or
+    |  #[+api("cython-structs#tokenc") #[code TokenC]] pointer, using Cython's
+    |  #[+a("http://cython.readthedocs.io/en/latest/src/userguide/fusedtypes.html") fused types].
+
+aside-code("Example").
+    from spacy.tokens cimport Doc
+    from spacy.vocab cimport Vocab
+
+    doc = Doc(Vocab())
+    lexeme = doc.vocab.get(u'hello')
+    doc.push_back(lexeme, True)
+    assert doc.text == u'hello '
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code lex_or_tok]
+        +cell #[code LexemeOrToken]
+        +cell The word to append to the #[code Doc].
+
+    +row
+        +cell #[code has_space]
+        +cell #[code bint]
+        +cell Whether the word has trailing whitespace.
--- a/website/api/_cython/_lexeme.jade
+++ b/website/api/_cython/_lexeme.jade
@ -0,0 +1,30 @@
+//- 💫 DOCS > API > CYTHON > CLASSES > LEXEME
+
+p
+    |  A Cython class providing access and methods for an entry in the
+    |  vocabulary.
+
+infobox
+    |  This section documents the extra C-level attributes and methods that
+    |  can't be accessed from Python. For the Python documentation, see
+    |  #[+api("lexeme") #[code Lexeme]].
+
+h(3, "lexeme_attributes") Attributes
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code c]
+        +cell #[code LexemeC*]
+        +cell
+            |  A pointer to a #[+api("cython-structs#lexemec") #[code LexemeC]]
+            |  struct.
+
+    +row
+        +cell #[code vocab]
+        +cell #[code Vocab]
+        +cell A reference to the shared #[code Vocab] object.
+
+    +row
+        +cell #[code orth]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell ID of the verbatim text content.
--- a/website/api/_cython/_lexemec.jade
+++ b/website/api/_cython/_lexemec.jade
@ -0,0 +1,200 @@
+//- 💫 DOCS > API > CYTHON > STRUCTS > LEXEMEC
+
+p
+    |  Struct holding information about a lexical type. #[code LexemeC]
+    |  structs are usually owned by the #[code Vocab], and accessed through a
+    |  read-only pointer on the #[code TokenC] struct.
+
+aside-code("Example").
+    lex = doc.c[3].lex
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code flags]
+        +cell #[+abbr("uint64_t") #[code flags_t]]
+        +cell Bit-field for binary lexical flag values.
+
+    +row
+        +cell #[code id]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell
+            |  Usually used to map lexemes to rows in a matrix, e.g. for word
+            |  vectors. Does not need to be unique, so currently misnamed.
+
+    +row
+        +cell #[code length]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell Number of unicode characters in the lexeme.
+
+    +row
+        +cell #[code orth]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell ID of the verbatim text content.
+
+    +row
+        +cell #[code lower]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell ID of the lowercase form of the lexeme.
+
+    +row
+        +cell #[code norm]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell ID of the lexeme's norm, i.e. a normalised form of the text.
+
+    +row
+        +cell #[code shape]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell Transform of the lexeme's string, to show orthographic features.
+
+    +row
+        +cell #[code prefix]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell
+            |  Length-N substring from the start of the lexeme. Defaults to
+            |  #[code N=1].
+
+    +row
+        +cell #[code suffix]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell
+            |  Length-N substring from the end of the lexeme. Defaults to
+            |  #[code N=3].
+
+    +row
+        +cell #[code cluster]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell Brown cluster ID.
+
+    +row
+        +cell #[code prob]
+        +cell #[code float]
+        +cell Smoothed log probability estimate of the lexeme's type.
+
+    +row
+        +cell #[code sentiment]
+        +cell #[code float]
+        +cell A scalar value indicating positivity or negativity.
+
+h(3, "lexeme_get_struct_attr", "spacy/lexeme.pxd") Lexeme.get_struct_attr
+    +tag staticmethod
+    +tag nogil
+
+p Get the value of an attribute from the #[code LexemeC] struct by attribute ID.
+
+aside-code("Example").
+    from spacy.attrs cimport IS_ALPHA
+    from spacy.lexeme cimport Lexeme
+
+    lexeme = doc.c[3].lex
+    is_alpha = Lexeme.get_struct_attr(lexeme, IS_ALPHA)
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code lex]
+        +cell #[code const LexemeC*]
+        +cell A pointer to a #[code LexemeC] struct.
+
+    +row
+        +cell #[code feat_name]
+        +cell #[code attr_id_t]
+        +cell
+            |  The ID of the attribute to look up. The attributes are
+            |  enumerated in #[code spacy.typedefs].
+
+    +row("foot")
+        +cell returns
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell The value of the attribute.
+
+h(3, "lexeme_set_struct_attr", "spacy/lexeme.pxd") Lexeme.set_struct_attr
+    +tag staticmethod
+    +tag nogil
+
+p Set the value of an attribute of the #[code LexemeC] struct by attribute ID.
+
+aside-code("Example").
+    from spacy.attrs cimport NORM
+    from spacy.lexeme cimport Lexeme
+
+    lexeme = doc.c[3].lex
+    Lexeme.set_struct_attr(lexeme, NORM, lexeme.lower)
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code lex]
+        +cell #[code const LexemeC*]
+        +cell A pointer to a #[code LexemeC] struct.
+
+    +row
+        +cell #[code feat_name]
+        +cell #[code attr_id_t]
+        +cell
+            |  The ID of the attribute to look up. The attributes are
+            |  enumerated in #[code spacy.typedefs].
+
+    +row
+        +cell #[code value]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell The value to set.
+
+h(3, "lexeme_c_check_flag", "spacy/lexeme.pxd") Lexeme.c_check_flag
+    +tag staticmethod
+    +tag nogil
+
+p Check the value of a binary flag attribute.
+
+aside-code("Example").
+    from spacy.attrs cimport IS_STOP
+    from spacy.lexeme cimport Lexeme
+
+    lexeme = doc.c[3].lex
+    is_stop = Lexeme.c_check_flag(lexeme, IS_STOP)
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code lexeme]
+        +cell #[code const LexemeC*]
+        +cell A pointer to a #[code LexemeC] struct.
+
+    +row
+        +cell #[code flag_id]
+        +cell #[code attr_id_t]
+        +cell
+            |  The ID of the flag to look up. The flag IDs are enumerated in
+            |  #[code spacy.typedefs].
+
+    +row("foot")
+        +cell returns
+        +cell #[code bint]
+        +cell The boolean value of the flag.
+
+h(3, "lexeme_c_set_flag", "spacy/lexeme.pxd") Lexeme.c_set_flag
+    +tag staticmethod
+    +tag nogil
+
+p Set the value of a binary flag attribute.
+
+aside-code("Example").
+    from spacy.attrs cimport IS_STOP
+    from spacy.lexeme cimport Lexeme
+
+    lexeme = doc.c[3].lex
+    Lexeme.c_set_flag(lexeme, IS_STOP, 0)
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code lexeme]
+        +cell #[code const LexemeC*]
+        +cell A pointer to a #[code LexemeC] struct.
+
+    +row
+        +cell #[code flag_id]
+        +cell #[code attr_id_t]
+        +cell
+            |  The ID of the flag to look up. The flag IDs are enumerated in
+            |  #[code spacy.typedefs].
+
+    +row
+        +cell #[code value]
+        +cell #[code bint]
+        +cell The value to set.
--- a/website/api/_cython/_span.jade
+++ b/website/api/_cython/_span.jade
@ -0,0 +1,43 @@
+//- 💫 DOCS > API > CYTHON > CLASSES > SPAN
+
+p
+    |  A Cython class providing access and methods for a slice of a #[code Doc]
+    |  object.
+
+infobox
+    |  This section documents the extra C-level attributes and methods that
+    |  can't be accessed from Python. For the Python documentation, see
+    |  #[+api("span") #[code Span]].
+
+h(3, "span_attributes") Attributes
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code doc]
+        +cell #[code Doc]
+        +cell The parent document.
+
+    +row
+        +cell #[code start]
+        +cell #[code int]
+        +cell The index of the first token of the span.
+
+    +row
+        +cell #[code end]
+        +cell #[code int]
+        +cell The index of the first token after the span.
+
+    +row
+        +cell #[code start_char]
+        +cell #[code int]
+        +cell The index of the first character of the span.
+
+    +row
+        +cell #[code end_char]
+        +cell #[code int]
+        +cell The index of the last character of the span.
+
+    +row
+        +cell #[code label]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell A label to attach to the span, e.g. for named entities.
--- a/website/api/_cython/_stringstore.jade
+++ b/website/api/_cython/_stringstore.jade
@ -0,0 +1,23 @@
+//- 💫 DOCS > API > CYTHON > CLASSES > STRINGSTORE
+
+p A lookup table to retrieve strings by 64-bit hashes.
+
+infobox
+    |  This section documents the extra C-level attributes and methods that
+    |  can't be accessed from Python. For the Python documentation, see
+    |  #[+api("stringstore") #[code StringStore]].
+
+h(3, "stringstore_attributes") Attributes
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code mem]
+        +cell #[code cymem.Pool]
+        +cell
+            |  A memory pool. Allocated memory will be freed once the
+            |  #[code StringStore] object is garbage collected.
+
+    +row
+        +cell #[code keys]
+        +cell #[+abbr("vector[uint64_t]") #[code vector[hash_t]]]
+        +cell A list of hash values in the #[code StringStore].
--- a/website/api/_cython/_token.jade
+++ b/website/api/_cython/_token.jade
@ -0,0 +1,73 @@
+//- 💫 DOCS > API > CYTHON > CLASSES > TOKEN
+
+p
+    |  A Cython class providing access and methods for a
+    |  #[+api("cython-structs#tokenc") #[code TokenC]] struct. Note that the
+    |  #[code Token] object does not own the struct. It only receives a pointer
+    |  to it.
+
+infobox
+    |  This section documents the extra C-level attributes and methods that
+    |  can't be accessed from Python. For the Python documentation, see
+    |  #[+api("token") #[code Token]].
+
+h(3, "token_attributes") Attributes
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code vocab]
+        +cell #[code Vocab]
+        +cell A reference to the shared #[code Vocab] object.
+
+    +row
+        +cell #[code c]
+        +cell #[code TokenC*]
+        +cell
+            |  A pointer to a #[+api("cython-structs#tokenc") #[code TokenC]]
+            |  struct.
+
+    +row
+        +cell #[code i]
+        +cell #[code int]
+        +cell The offset of the token within the document.
+
+    +row
+        +cell #[code doc]
+        +cell #[code Doc]
+        +cell The parent document.
+
+h(3, "token_cinit") Token.cinit
+    +tag method
+
+p Create a #[code Token] object from a #[code TokenC*] pointer.
+
+aside-code("Example").
+    token = Token.cinit(&doc.c[3], doc, 3)
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code vocab]
+        +cell #[code Vocab]
+        +cell A reference to the shared #[code Vocab].
+
+    +row
+        +cell #[code c]
+        +cell #[code TokenC*]
+        +cell
+            |  A pointer to a #[+api("cython-structs#tokenc") #[code TokenC]]
+            |  struct.
+
+    +row
+        +cell #[code offset]
+        +cell #[code int]
+        +cell The offset of the token within the document.
+
+    +row
+        +cell #[code doc]
+        +cell #[code Doc]
+        +cell The parent document.
+
+    +row("foot")
+        +cell returns
+        +cell #[code Token]
+        +cell The newly constructed object.
--- a/website/api/_cython/_tokenc.jade
+++ b/website/api/_cython/_tokenc.jade
@ -0,0 +1,270 @@
+//- 💫 DOCS > API > CYTHON > STRUCTS > TOKENC
+
+p
+    |  Cython data container for the #[code Token] object.
+
+aside-code("Example").
+    token = &doc.c[3]
+    token_ptr = &doc.c[3]
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code lex]
+        +cell #[code const LexemeC*]
+        +cell A pointer to the lexeme for the token.
+
+    +row
+        +cell #[code morph]
+        +cell #[code uint64_t]
+        +cell An ID allowing lookup of morphological attributes.
+
+    +row
+        +cell #[code pos]
+        +cell #[code univ_pos_t]
+        +cell Coarse-grained part-of-speech tag.
+
+    +row
+        +cell #[code spacy]
+        +cell #[code bint]
+        +cell A binary value indicating whether the token has trailing whitespace.
+
+    +row
+        +cell #[code tag]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell Fine-grained part-of-speech tag.
+
+    +row
+        +cell #[code idx]
+        +cell #[code int]
+        +cell The character offset of the token within the parent document.
+
+    +row
+        +cell #[code lemma]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell Base form of the token, with no inflectional suffixes.
+
+    +row
+        +cell #[code sense]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell Space for storing a word sense ID, currently unused.
+
+    +row
+        +cell #[code head]
+        +cell #[code int]
+        +cell Offset of the syntactic parent relative to the token.
+
+    +row
+        +cell #[code dep]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell Syntactic dependency relation.
+
+    +row
+        +cell #[code l_kids]
+        +cell #[code uint32_t]
+        +cell Number of left children.
+
+    +row
+        +cell #[code r_kids]
+        +cell #[code uint32_t]
+        +cell Number of right children.
+
+    +row
+        +cell #[code l_edge]
+        +cell #[code uint32_t]
+        +cell Offset of the leftmost token of this token's syntactic descendents.
+
+    +row
+        +cell #[code r_edge]
+        +cell #[code uint32_t]
+        +cell Offset of the rightmost token of this token's syntactic descendents.
+
+    +row
+        +cell #[code sent_start]
+        +cell #[code int]
+        +cell
+            |  Ternary value indicating whether the token is the first word of
+            |  a sentence. #[code 0] indicates a missing value, #[code -1]
+            |  indicates #[code False] and #[code 1] indicates #[code True]. The default value, 0,
+            |  is interpretted as no sentence break. Sentence boundary detectors will usually
+            |  set 0 for all tokens except tokens that follow a sentence boundary.
+
+    +row
+        +cell #[code ent_iob]
+        +cell #[code int]
+        +cell
+            |  IOB code of named entity tag. #[code 0] indicates a missing
+            |  value, #[code 1] indicates #[code I], #[code 2] indicates
+            |  #[code 0] and #[code 3] indicates #[code B].
+
+    +row
+        +cell #[code ent_type]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell Named entity type.
+
+    +row
+        +cell #[code ent_id]
+        +cell #[+abbr("uint64_t") #[code hash_t]]
+        +cell
+            |  ID of the entity the token is an instance of, if any. Currently
+            |  not used, but potentially for coreference resolution.
+
+h(3, "token_get_struct_attr", "spacy/tokens/token.pxd") Token.get_struct_attr
+    +tag staticmethod
+    +tag nogil
+
+p Get the value of an attribute from the #[code TokenC] struct by attribute ID.
+
+aside-code("Example").
+    from spacy.attrs cimport IS_ALPHA
+    from spacy.tokens cimport Token
+
+    is_alpha = Token.get_struct_attr(&doc.c[3], IS_ALPHA)
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code token]
+        +cell #[code const TokenC*]
+        +cell A pointer to a #[code TokenC] struct.
+
+    +row
+        +cell #[code feat_name]
+        +cell #[code attr_id_t]
+        +cell
+            |  The ID of the attribute to look up. The attributes are
+            |  enumerated in #[code spacy.typedefs].
+
+    +row("foot")
+        +cell returns
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell The value of the attribute.
+
+h(3, "token_set_struct_attr", "spacy/tokens/token.pxd") Token.set_struct_attr
+    +tag staticmethod
+    +tag nogil
+
+p Set the value of an attribute of the #[code TokenC] struct by attribute ID.
+
+aside-code("Example").
+    from spacy.attrs cimport TAG
+    from spacy.tokens cimport Token
+
+    token = &doc.c[3]
+    Token.set_struct_attr(token, TAG, 0)
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code token]
+        +cell #[code const TokenC*]
+        +cell A pointer to a #[code TokenC] struct.
+
+    +row
+        +cell #[code feat_name]
+        +cell #[code attr_id_t]
+        +cell
+            |  The ID of the attribute to look up. The attributes are
+            |  enumerated in #[code spacy.typedefs].
+
+    +row
+        +cell #[code value]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell The value to set.
+
+h(3, "token_by_start", "spacy/tokens/doc.pxd") token_by_start
+    +tag function
+
+p Find a token in a #[code TokenC*] array by the offset of its first character.
+
+aside-code("Example").
+    from spacy.tokens.doc cimport Doc, token_by_start
+    from spacy.vocab cimport Vocab
+
+    doc = Doc(Vocab(), words=[u'hello', u'world'])
+    assert token_by_start(doc.c, doc.length, 6) == 1
+    assert token_by_start(doc.c, doc.length, 4) == -1
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code tokens]
+        +cell #[code const TokenC*]
+        +cell A #[code TokenC*] array.
+
+    +row
+        +cell #[code length]
+        +cell #[code int]
+        +cell The number of tokens in the array.
+
+    +row
+        +cell #[code start_char]
+        +cell #[code int]
+        +cell The start index to search for.
+
+    +row("foot")
+        +cell returns
+        +cell #[code int]
+        +cell The index of the token in the array or #[code -1] if not found.
+
+h(3, "token_by_end", "spacy/tokens/doc.pxd") token_by_end
+    +tag function
+
+p Find a token in a #[code TokenC*] array by the offset of its final character.
+
+aside-code("Example").
+    from spacy.tokens.doc cimport Doc, token_by_end
+    from spacy.vocab cimport Vocab
+
+    doc = Doc(Vocab(), words=[u'hello', u'world'])
+    assert token_by_end(doc.c, doc.length, 5) == 0
+    assert token_by_end(doc.c, doc.length, 1) == -1
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code tokens]
+        +cell #[code const TokenC*]
+        +cell A #[code TokenC*] array.
+
+    +row
+        +cell #[code length]
+        +cell #[code int]
+        +cell The number of tokens in the array.
+
+    +row
+        +cell #[code end_char]
+        +cell #[code int]
+        +cell The end index to search for.
+
+    +row("foot")
+        +cell returns
+        +cell #[code int]
+        +cell The index of the token in the array or #[code -1] if not found.
+
+h(3, "set_children_from_heads", "spacy/tokens/doc.pxd") set_children_from_heads
+    +tag function
+
+p
+    |  Set attributes that allow lookup of syntactic children on a
+    |  #[code TokenC*] array. This function must be called after making changes
+    |  to the #[code TokenC.head] attribute, in order to make the parse tree
+    |  navigation consistent.
+
+aside-code("Example").
+    from spacy.tokens.doc cimport Doc, set_children_from_heads
+    from spacy.vocab cimport Vocab
+
+    doc = Doc(Vocab(), words=[u'Baileys', u'from', u'a', u'shoe'])
+    doc.c[0].head = 0
+    doc.c[1].head = 0
+    doc.c[2].head = 3
+    doc.c[3].head = 1
+    set_children_from_heads(doc.c, doc.length)
+    assert doc.c[3].l_kids == 1
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code tokens]
+        +cell #[code const TokenC*]
+        +cell A #[code TokenC*] array.
+
+    +row
+        +cell #[code length]
+        +cell #[code int]
+        +cell The number of tokens in the array.
--- a/website/api/_cython/_vocab.jade
+++ b/website/api/_cython/_vocab.jade
@ -0,0 +1,88 @@
+//- 💫 DOCS > API > CYTHON > CLASSES > VOCAB
+
+p
+    |  A Cython class providing access and methods for a vocabulary and other
+    |  data shared across a language.
+
+infobox
+    |  This section documents the extra C-level attributes and methods that
+    |  can't be accessed from Python. For the Python documentation, see
+    |  #[+api("vocab") #[code Vocab]].
+
+h(3, "vocab_attributes") Attributes
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code mem]
+        +cell #[code cymem.Pool]
+        +cell
+            |  A memory pool. Allocated memory will be freed once the
+            |  #[code Vocab] object is garbage collected.
+
+    +row
+        +cell #[code strings]
+        +cell #[code StringStore]
+        +cell
+            |  A #[code StringStore] that maps string to hash values and vice
+            |  versa.
+
+    +row
+        +cell #[code length]
+        +cell #[code int]
+        +cell The number of entries in the vocabulary.
+
+h(3, "vocab_get") Vocab.get
+    +tag method
+
+p
+    |  Retrieve a #[+api("cython-structs#lexemec") #[code LexemeC*]] pointer
+    |  from the vocabulary.
+
+aside-code("Example").
+    lexeme = vocab.get(vocab.mem, u'hello')
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code mem]
+        +cell #[code cymem.Pool]
+        +cell
+            |  A memory pool. Allocated memory will be freed once the
+            |  #[code Vocab] object is garbage collected.
+
+    +row
+        +cell #[code string]
+        +cell #[code unicode]
+        +cell The string of the word to look up.
+
+    +row("foot")
+        +cell returns
+        +cell #[code const LexemeC*]
+        +cell The lexeme in the vocabulary.
+
+h(3, "vocab_get_by_orth") Vocab.get_by_orth
+    +tag method
+
+p
+    |  Retrieve a #[+api("cython-structs#lexemec") #[code LexemeC*]] pointer
+    |  from the vocabulary.
+
+aside-code("Example").
+    lexeme = vocab.get_by_orth(doc[0].lex.norm)
+
+table(["Name", "Type", "Description"])
+    +row
+        +cell #[code mem]
+        +cell #[code cymem.Pool]
+        +cell
+            |  A memory pool. Allocated memory will be freed once the
+            |  #[code Vocab] object is garbage collected.
+
+    +row
+        +cell #[code orth]
+        +cell #[+abbr("uint64_t") #[code attr_t]]
+        +cell ID of the verbatim text content.
+
+    +row("foot")
+        +cell returns
+        +cell #[code const LexemeC*]
+        +cell The lexeme in the vocabulary.
--- a/website/api/_data.json
+++ b/website/api/_data.json
@ -33,6 +33,12 @@
            "Vectors": "vectors",
            "GoldParse": "goldparse",
            "GoldCorpus": "goldcorpus"
+        },
+
+        "Cython": {
+            "Architecture": "cython",
+            "Structs": "cython-structs",
+            "Classes": "cython-classes"
        }
    },

@ -41,8 +47,7 @@
        "next": "annotation",
        "menu": {
            "Basics": "basics",
-            "Neural Network Model": "nn-model",
-            "Cython Conventions": "cython"
+            "Neural Network Model": "nn-model"
        }
    },

@ -211,5 +216,36 @@
            "Named Entities": "named-entities",
            "Models & Training": "training"
        }
+    },
+
+    "cython": {
+        "title": "Cython Architecture",
+        "next": "cython-structs",
+        "menu": {
+            "Overview": "overview",
+            "Conventions": "conventions"
+        }
+    },
+
+    "cython-structs": {
+        "title": "Cython Structs",
+        "teaser": "C-language objects that let you group variables together in a single contiguous block.",
+        "next": "cython-classes",
+        "menu": {
+            "TokenC": "tokenc",
+            "LexemeC": "lexemec"
+        }
+    },
+
+    "cython-classes": {
+        "title": "Cython Classes",
+        "menu": {
+            "Doc": "doc",
+            "Token": "token",
+            "Span": "span",
+            "Lexeme": "lexeme",
+            "Vocab": "vocab",
+            "StringStore": "stringstore"
+        }
    }
 }
--- a/website/api/cli.jade
+++ b/website/api/cli.jade
@ -280,7 +280,7 @@ p
    +row
        +cell #[code --n-iter], #[code -n]
        +cell option
-        +cell Number of iterations (default: #[code 20]).
+        +cell Number of iterations (default: #[code 30]).

    +row
        +cell #[code --n-sents], #[code -ns]
--- a/website/api/cython-classes.jade
+++ b/website/api/cython-classes.jade
@ -0,0 +1,39 @@
+//- 💫 DOCS > API > CYTHON > CLASSES
+
+include ../_includes/_mixins
+
+section("doc")
+    +h(2, "doc", "spacy/tokens/doc.pxd") Doc
+        +tag cdef class
+
+    include _cython/_doc
+
+section("token")
+    +h(2, "token", "spacy/tokens/token.pxd") Token
+        +tag cdef class
+
+    include _cython/_token
+
+section("span")
+    +h(2, "span", "spacy/tokens/span.pxd") Span
+        +tag cdef class
+
+    include _cython/_span
+
+section("lexeme")
+    +h(2, "lexeme", "spacy/lexeme.pxd") Lexeme
+        +tag cdef class
+
+    include _cython/_lexeme
+
+section("vocab")
+    +h(2, "vocab", "spacy/vocab.pxd") Vocab
+        +tag cdef class
+
+    include _cython/_vocab
+
+section("stringstore")
+    +h(2, "stringstore", "spacy/strings.pxd") StringStore
+        +tag cdef class
+
+    include _cython/_stringstore
--- a/website/api/cython-structs.jade
+++ b/website/api/cython-structs.jade
@ -0,0 +1,15 @@
+//- 💫 DOCS > API > CYTHON > STRUCTS
+
+include ../_includes/_mixins
+
+section("tokenc")
+    +h(2, "tokenc", "spacy/structs.pxd") TokenC
+        +tag C struct
+
+    include _cython/_tokenc
+
+section("lexemec")
+    +h(2, "lexemec", "spacy/structs.pxd") LexemeC
+        +tag C struct
+
+    include _cython/_lexemec
--- a/website/api/cython.jade
+++ b/website/api/cython.jade
@ -0,0 +1,176 @@
+//- 💫 DOCS > API > CYTHON > ARCHITECTURE
+
+include ../_includes/_mixins
+
+section("overview")
+    +aside("What's Cython?")
+        |  #[+a("http://cython.org/") Cython] is a language for writing
+        |  C extensions for Python. Most Python code is also valid Cython, but
+        |  you can add type declarations to get efficient memory-managed code
+        |  just like C or C++.
+
+    p
+        |  This section documents spaCy's C-level data structures and
+        |  interfaces, intended for use from Cython. Some of the attributes are
+        |  primarily for internal use, and all C-level functions and methods are
+        |  designed for speed over safety – if you make a mistake and access an
+        |  array out-of-bounds, the program may crash abruptly.
+
+    p
+        |  With Cython there are four ways of declaring complex data types.
+        |  Unfortunately we use all four in different places, as they all have
+        |  different utility:
+
+    +table(["Declaration", "Description", "Example"])
+        +row
+            +cell #[code class]
+            +cell A normal Python class.
+            +cell #[+api("language") #[code Language]]
+
+        +row
+            +cell #[code cdef class]
+            +cell
+                |  A Python extension type. Differs from a normal Python class
+                |  in that its attributes can be defined on the underlying
+                |  struct. Can have C-level objects as attributes (notably
+                |  structs and pointers), and can have methods which have
+                |  C-level objects as arguments or return types.
+            +cell #[+api("cython-classes#lexeme") #[code Lexeme]]
+
+        +row
+            +cell #[code cdef struct]
+            +cell
+                |  A struct is just a collection of variables, sort of like a
+                |  named tuple, except the memory is contiguous. Structs can't
+                |  have methods, only attributes.
+            +cell #[+api("cython-structs#lexemec") #[code LexemeC]]
+
+        +row
+            +cell #[code cdef cppclass]
+            +cell
+                |  A C++ class. Like a struct, this can be allocated on the
+                |  stack, but can have methods, a constructor and a destructor.
+                |  Differs from `cdef class` in that it can be created and
+                |  destroyed without acquiring the Python global interpreter
+                |  lock. This style is the most obscure.
+            +cell #[+src(gh("spacy", "spacy/syntax/_state.pxd")) #[code StateC]]
+
+    p
+        |  The most important classes in spaCy are defined as #[code cdef class]
+        |  objects. The underlying data for these objects is usually gathered
+        |  into a struct, which is usually named #[code c]. For instance, the
+        |  #[+api("cython-classses#lexeme") #[code Lexeme]] class holds a
+        |  #[+api("cython-structs#lexemec") #[code LexemeC]] struct, at
+        |  #[code Lexeme.c]. This lets you shed the Python container, and pass
+        |  a pointer to the underlying data into C-level functions.
+
+section("conventions")
+    +h(2, "conventions") Conventions
+
+    p
+        |  spaCy's core data structures are implemented as
+        |  #[+a("http://cython.org/") Cython] #[code cdef] classes. Memory is
+        |  managed through the #[+a(gh("cymem")) #[code cymem]]
+        |  #[code cymem.Pool] class, which allows you
+        |  to allocate memory which will be freed when the #[code Pool] object
+        |  is garbage collected. This means you usually don't have to worry
+        |  about freeing memory. You just have to decide which Python object
+        |  owns the memory, and make it own the #[code Pool]. When that object
+        |  goes out of scope, the memory will be freed. You do have to take
+        |  care that no pointers outlive the object that owns them — but this
+        |  is generally quite easy.
+
+    p
+        |  All Cython modules should have the #[code # cython: infer_types=True]
+        |  compiler directive at the top of the file. This makes the code much
+        |  cleaner, as it avoids the need for many type declarations. If
+        |  possible, you should prefer to declare your functions #[code nogil],
+        |  even if you don't especially care about multi-threading. The reason
+        |  is that #[code nogil] functions help the Cython compiler reason about
+        |  your code quite a lot — you're telling the compiler that no Python
+        |  dynamics are possible. This lets many errors be raised, and ensures
+        |  your function will run at C speed.
+
+
+    p
+        |  Cython gives you many choices of sequences: you could have a Python
+        |  list, a numpy array, a memory view, a C++ vector, or a pointer.
+        |  Pointers are preferred, because they are fastest, have the most
+        |  explicit semantics, and let the compiler check your code more
+        |  strictly. C++ vectors are also great — but you should only use them
+        |  internally in functions. It's less friendly to accept a vector as an
+        |  argument, because that asks the user to do much more work. Here's
+        |  how to get a pointer from a numpy array, memory view or vector:
+
+    +code.
+        cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
+        pointer1 = &lt;int*&gt;numpy_array.data
+        pointer2 = cpp_vector.data()
+        pointer3 = &memory_view[0]
+
+    p
+        |  Both C arrays and C++ vectors reassure the compiler that no Python
+        |  operations are possible on your variable. This is a big advantage:
+        |  it lets the Cython compiler raise many more errors for you.
+
+    p
+        |  When getting a pointer from a numpy array or memoryview, take care
+        |  that the data is actually stored in C-contiguous order — otherwise
+        |  you'll get a pointer to nonsense. The type-declarations in the code
+        |  above should generate runtime errors if buffers with incorrect
+        |  memory layouts are passed in. To iterate over the array, the
+        |  following style is preferred:
+
+    +code.
+        cdef int c_total(const int* int_array, int length) nogil:
+            total = 0
+            for item in int_array[:length]:
+                total += item
+            return total
+
+    p
+        |  If this is confusing, consider that the compiler couldn't deal with
+        |  #[code for item in int_array:] — there's no length attached to a raw
+        |  pointer, so how could we figure out where to stop? The length is
+        |  provided in the slice notation as a solution to this. Note that we
+        |  don't have to declare the type of #[code item] in the code above —
+        |  the compiler can easily infer it. This gives us tidy code that looks
+        |  quite like Python, but is exactly as fast as C — because we've made
+        |  sure the compilation to C is trivial.
+
+    p
+        |  Your functions cannot be declared #[code nogil] if they need to
+        |  create Python objects or call Python functions. This is perfectly
+        |  okay — you shouldn't torture your code just to get #[code nogil]
+        |  functions. However, if your function isn't #[code nogil], you should
+        |  compile your module with #[code cython -a --cplus my_module.pyx] and
+        |  open the resulting #[code my_module.html] file in a browser. This
+        |  will let you see how Cython is compiling your code. Calls into the
+        |  Python run-time will be in bright yellow. This lets you easily see
+        |  whether Cython is able to correctly type your code, or whether there
+        |  are unexpected problems.
+
+    p
+        |  Working in Cython is very rewarding once you're over the initial
+        |  learning curve. As with C and C++, the first way you write something
+        |  in Cython will often be the performance-optimal approach. In
+        |  contrast, Python optimisation generally requires a lot of
+        |  experimentation. Is it faster to have an #[code if item in my_dict]
+        |  check, or to use #[code .get()]? What about
+        |  #[code try]/#[code except]? Does this numpy operation create a copy?
+        |  There's no way to guess the answers to these questions, and you'll
+        |  usually be dissatisfied with your results — so there's no way to
+        |  know when to stop this process. In the worst case, you'll make a
+        |  mess that invites the next reader to try their luck too. This is
+        |  like one of those
+        |  #[+a("http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract") volcanic gas-traps],
+        |  where the rescuers keep passing out from low oxygen, causing
+        |  another rescuer to follow — only to succumb themselves. In short,
+        |  just say no to optimizing your Python. If it's not fast enough the
+        |  first time, just switch to Cython.
+
+    +infobox("Resources")
+        +list.o-no-block
+            +item #[+a("http://docs.cython.org/en/latest/") Official Cython documentation] (cython.org)
+            +item #[+a("https://explosion.ai/blog/writing-c-in-cython", true) Writing C in Cython] (explosion.ai)
+            +item #[+a("https://explosion.ai/blog/multithreading-with-cython") Multi-threading spaCy’s parser and named entity recogniser] (explosion.ai)
--- a/website/api/index.jade
+++ b/website/api/index.jade
@ -7,8 +7,151 @@ include ../_includes/_mixins

 +section("nn-model")
    +h(2, "nn-model") Neural network model architecture
-    include _architecture/_nn-model

-+section("cython")
-    +h(2, "cython") Cython conventions
-    include _architecture/_cython
+    p
+        |  spaCy's statistical models have been custom-designed to give a
+        |  high-performance mix of speed and accuracy. The current architecture
+        |  hasn't been published yet, but in the meantime we prepared a video that
+        |  explains how the models work, with particular focus on NER.
+
+    +youtube("sqDHBH9IjRU")
+
+    p
+        |  The parsing model is a blend of recent results. The two recent
+        |  inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at
+        |  Bar Ilan#[+fn(1)], and the SyntaxNet team from Google. The foundation of
+        |  the parser is still based on the work of Joakim Nivre#[+fn(2)], who
+        |  introduced the transition-based framework#[+fn(3)], the arc-eager
+        |  transition system, and the imitation learning objective. The model is
+        |  implemented using #[+a(gh("thinc")) Thinc], spaCy's machine learning
+        |  library. We first predict context-sensitive vectors for each word in the
+        |  input:
+
+    +code.
+        (embed_lower | embed_prefix | embed_suffix | embed_shape)
+            &gt;&gt; Maxout(token_width)
+            &gt;&gt; convolution ** 4
+
+    p
+        |  This convolutional layer is shared between the tagger, parser and NER,
+        |  and will also be shared by the future neural lemmatizer. Because the
+        |  parser shares these layers with the tagger, the parser does not require
+        |  tag features. I got this trick from David Weiss's "Stack Combination"
+        |  paper#[+fn(4)].
+
+    p
+        |  To boost the representation, the tagger actually predicts a "super tag"
+        |  with POS, morphology and dependency label#[+fn(5)]. The tagger predicts
+        |  these supertags by adding a softmax layer onto the convolutional layer –
+        |  so, we're teaching the convolutional layer to give us a representation
+        |  that's one affine transform from this informative lexical information.
+        |  This is obviously good for the parser (which backprops to the
+        |  convolutions too). The parser model makes a state vector by concatenating
+        |  the vector representations for its context tokens.  The current context
+        |  tokens:
+
+    +table
+        +row
+            +cell #[code S0], #[code S1], #[code S2]
+            +cell Top three words on the stack.
+
+        +row
+            +cell #[code B0], #[code B1]
+            +cell First two words of the buffer.
+
+        +row
+            +cell
+                |  #[code S0L1], #[code S1L1], #[code S2L1], #[code B0L1],
+                |  #[code B1L1]#[br]
+                |  #[code S0L2], #[code S1L2], #[code S2L2], #[code B0L2],
+                |  #[code B1L2]
+            +cell
+                |  Leftmost and second leftmost children of #[code S0], #[code S1],
+                |  #[code S2], #[code B0] and #[code B1].
+
+        +row
+            +cell
+                |  #[code S0R1], #[code S1R1], #[code S2R1], #[code B0R1],
+                |  #[code B1R1]#[br]
+                |  #[code S0R2], #[code S1R2], #[code S2R2], #[code B0R2],
+                |  #[code B1R2]
+            +cell
+                |  Rightmost and second rightmost children of #[code S0], #[code S1],
+                |  #[code S2], #[code B0] and #[code B1].
+
+    p
+        |  This makes the state vector quite long: #[code 13*T], where #[code T] is
+        |  the token vector width (128 is working well). Fortunately, there's a way
+        |  to structure the computation to save some expense (and make it more
+        |  GPU-friendly).
+
+    p
+        |  The parser typically visits #[code 2*N] states for a sentence of length
+        |  #[code N] (although it may visit more, if it back-tracks with a
+        |  non-monotonic transition#[+fn(4)]). A naive implementation would require
+        |  #[code 2*N (B, 13*T) @ (13*T, H)] matrix multiplications for a batch of
+        |  size #[code B]. We can instead perform one #[code (B*N, T) @ (T, 13*H)]
+        |  multiplication, to pre-compute the hidden weights for each positional
+        |  feature with respect to the words in the batch. (Note that our token
+        |  vectors come from the CNN — so we can't play this trick over the
+        |  vocabulary. That's how Stanford's NN parser#[+fn(3)] works — and why its
+        |  model is so big.)
+
+    p
+        |  This pre-computation strategy allows a nice compromise between
+        |  GPU-friendliness and implementation simplicity. The CNN and the wide
+        |  lower layer are computed on the GPU, and then the precomputed hidden
+        |  weights are moved to the CPU, before we start the transition-based
+        |  parsing process. This makes a lot of things much easier. We don't have to
+        |  worry about variable-length batch sizes, and we don't have to implement
+        |  the dynamic oracle in CUDA to train.
+
+    p
+        |  Currently the parser's loss function is multilabel log loss#[+fn(6)], as
+        |  the dynamic oracle allows multiple states to be 0 cost. This is defined
+        |  as follows, where #[code gZ] is the sum of the scores assigned to gold
+        |  classes:
+
+    +code.
+        (exp(score) / Z) - (exp(score) / gZ)
+
+    +bibliography
+        +item
+            |  #[+a("https://www.semanticscholar.org/paper/Simple-and-Accurate-Dependency-Parsing-Using-Bidir-Kiperwasser-Goldberg/3cf31ecb2724b5088783d7c96a5fc0d5604cbf41") Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations]
+            br
+            |  Eliyahu Kiperwasser, Yoav Goldberg. (2016)
+
+        +item
+            |  #[+a("https://www.semanticscholar.org/paper/A-Dynamic-Oracle-for-Arc-Eager-Dependency-Parsing-Goldberg-Nivre/22697256ec19ecc3e14fcfc63624a44cf9c22df4") A Dynamic Oracle for Arc-Eager Dependency Parsing]
+            br
+            |  Yoav Goldberg, Joakim Nivre (2012)
+
+        +item
+            |  #[+a("https://explosion.ai/blog/parsing-english-in-python") Parsing English in 500 Lines of Python]
+            br
+            |  Matthew Honnibal (2013)
+
+        +item
+            |  #[+a("https://www.semanticscholar.org/paper/Stack-propagation-Improved-Representation-Learning-Zhang-Weiss/0c133f79b23e8c680891d2e49a66f0e3d37f1466") Stack-propagation: Improved Representation Learning for Syntax]
+            br
+            |  Yuan Zhang, David Weiss (2016)
+
+        +item
+            |  #[+a("https://www.semanticscholar.org/paper/Deep-multi-task-learning-with-low-level-tasks-supe-S%C3%B8gaard-Goldberg/03ad06583c9721855ccd82c3d969a01360218d86") Deep multi-task learning with low level tasks supervised at lower layers]
+            br
+            |  Anders Søgaard, Yoav Goldberg (2016)
+
+        +item
+            |  #[+a("https://www.semanticscholar.org/paper/An-Improved-Non-monotonic-Transition-System-for-De-Honnibal-Johnson/4094cee47ade13b77b5ab4d2e6cb9dd2b8a2917c") An Improved Non-monotonic Transition System for Dependency Parsing]
+            br
+            |  Matthew Honnibal, Mark Johnson (2015)
+
+        +item
+            |  #[+a("http://cs.stanford.edu/people/danqi/papers/emnlp2014.pdf") A Fast and Accurate Dependency Parser using Neural Networks]
+            br
+            |  Danqi Cheng, Christopher D. Manning (2014)
+
+        +item
+            |  #[+a("https://www.semanticscholar.org/paper/Parsing-the-Wall-Street-Journal-using-a-Lexical-Fu-Riezler-King/0ad07862a91cd59b7eb5de38267e47725a62b8b2") Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques]
+            br
+            |  Stefan Riezler et al. (2002)
--- a/website/api/token.jade
+++ b/website/api/token.jade
@ -573,15 +573,15 @@ p The L2 norm of the token's vector representation.
        +cell #[code ent_id]
        +cell int
        +cell
-            |  ID of the entity the token is an instance of, if any. Usually
-            |  assigned by patterns in the Matcher.
+            |  ID of the entity the token is an instance of, if any. Currently
+            |  not used, but potentially for coreference resolution.

    +row
        +cell #[code ent_id_]
        +cell unicode
        +cell
-            |  ID of the entity the token is an instance of, if any. Usually
-            |  assigned by patterns in the Matcher.
+            |  ID of the entity the token is an instance of, if any. Currently
+            |  not used, but potentially for coreference resolution.

    +row
        +cell #[code lemma]
--- a/website/assets/css/_base/_objects.sass
+++ b/website/assets/css/_base/_objects.sass
@ -231,3 +231,19 @@
     border: none
     text-align-last: center
     width: 100%
+
+//- Abbreviations
+
+.o-abbr
+    +breakpoint(min, md)
+        cursor: help
+        border-bottom: 2px dotted $color-theme
+        padding-bottom: 3px
+
+    +breakpoint(max, sm)
+        &[data-tooltip]:before
+            content: none
+
+        &:after
+            content: " (" attr(aria-label) ")"
+            color: $color-subtle-dark
--- a/website/assets/js/main.js
+++ b/website/assets/js/main.js
@ -47,7 +47,10 @@ import initUniverse from './universe.vue.js';
 */
 {
    if (window.Juniper) {
-        new Juniper({ repo: 'ines/spacy-io-binder' });
+        new Juniper({
+            repo: 'ines/spacy-io-binder',
+            storageExpire: 60
+        });
    }
 }

@ -58,8 +61,13 @@ import initUniverse from './universe.vue.js';
    const sectionAttr = 'data-section';
    const navAttr = 'data-nav';
    const activeClass = 'is-active';
+    const sidebarAttr = 'data-sidebar-active';
    const sections = [...document.querySelectorAll(`[${navAttr}]`)];
+    const currentItem = document.querySelector(`[${sidebarAttr}]`);
    if (window.inView) {
+        if (currentItem && Element.prototype.scrollIntoView && !inView.is(currentItem)) {
+            currentItem.scrollIntoView();
+        }
        if (sections.length) {  // highlight first item regardless
            sections[0].classList.add(activeClass);
        }
@ -69,6 +77,9 @@ import initUniverse from './universe.vue.js';
            if (el) {
                sections.forEach(el => el.classList.remove(activeClass));
                el.classList.add(activeClass);
+                if (Element.prototype.scrollIntoView && !inView.is(el)) {
+                    el.scrollIntoView();
+                }
            }
        });
    }
--- a/website/assets/js/vendor/juniper.min.js
+++ b/website/assets/js/vendor/juniper.min.js
--- a/website/assets/js/vendor/prism.min.js
+++ b/website/assets/js/vendor/prism.min.js
@ -16,7 +16,7 @@ Prism.languages.json={property:/".*?"(?=\s*:)/gi,string:/"(?!:)(\\?[^"])*?"(?!:)
 !function(a){var e=/\\([^a-z()[\]]|[a-z\*]+)/i,n={"equation-command":{pattern:e,alias:"regex"}};a.languages.latex={comment:/%.*/m,cdata:{pattern:/(\\begin\{((?:verbatim|lstlisting)\*?)\})([\w\W]*?)(?=\\end\{\2\})/,lookbehind:!0},equation:[{pattern:/\$(?:\\?[\w\W])*?\$|\\\((?:\\?[\w\W])*?\\\)|\\\[(?:\\?[\w\W])*?\\\]/,inside:n,alias:"string"},{pattern:/(\\begin\{((?:equation|math|eqnarray|align|multline|gather)\*?)\})([\w\W]*?)(?=\\end\{\2\})/,lookbehind:!0,inside:n,alias:"string"}],keyword:{pattern:/(\\(?:begin|end|ref|cite|label|usepackage|documentclass)(?:\[[^\]]+\])?\{)[^}]+(?=\})/,lookbehind:!0},url:{pattern:/(\\url\{)[^}]+(?=\})/,lookbehind:!0},headline:{pattern:/(\\(?:part|chapter|section|subsection|frametitle|subsubsection|paragraph|subparagraph|subsubparagraph|subsubsubparagraph)\*?(?:\[[^\]]+\])?\{)[^}]+(?=\}(?:\[[^\]]+\])?)/,lookbehind:!0,alias:"class-name"},"function":{pattern:e,alias:"selector"},punctuation:/[[\]{}&]/}}(Prism);
 Prism.languages.makefile={comment:{pattern:/(^|[^\\])#(?:\\(?:\r\n|[\s\S])|.)*/,lookbehind:!0},string:/(["'])(?:\\(?:\r\n|[\s\S])|(?!\1)[^\\\r\n])*\1/,builtin:/\.[A-Z][^:#=\s]+(?=\s*:(?!=))/,symbol:{pattern:/^[^:=\r\n]+(?=\s*:(?!=))/m,inside:{variable:/\$+(?:[^(){}:#=\s]+|(?=[({]))/}},variable:/\$+(?:[^(){}:#=\s]+|\([@*%<^+?][DF]\)|(?=[({]))/,keyword:[/-include\b|\b(?:define|else|endef|endif|export|ifn?def|ifn?eq|include|override|private|sinclude|undefine|unexport|vpath)\b/,{pattern:/(\()(?:addsuffix|abspath|and|basename|call|dir|error|eval|file|filter(?:-out)?|findstring|firstword|flavor|foreach|guile|if|info|join|lastword|load|notdir|or|origin|patsubst|realpath|shell|sort|strip|subst|suffix|value|warning|wildcard|word(?:s|list)?)(?=[ \t])/,lookbehind:!0}],operator:/(?:::|[?:+!])?=|[|@]/,punctuation:/[:;(){}]/};
 Prism.languages.markdown=Prism.languages.extend("markup",{}),Prism.languages.insertBefore("markdown","prolog",{blockquote:{pattern:/^>(?:[\t ]*>)*/m,alias:"punctuation"},code:[{pattern:/^(?: {4}|\t).+/m,alias:"keyword"},{pattern:/``.+?``|`[^`\n]+`/,alias:"keyword"}],title:[{pattern:/\w+.*(?:\r?\n|\r)(?:==+|--+)/,alias:"important",inside:{punctuation:/==+$|--+$/}},{pattern:/(^\s*)#+.+/m,lookbehind:!0,alias:"important",inside:{punctuation:/^#+|#+$/}}],hr:{pattern:/(^\s*)([*-])([\t ]*\2){2,}(?=\s*$)/m,lookbehind:!0,alias:"punctuation"},list:{pattern:/(^\s*)(?:[*+-]|\d+\.)(?=[\t ].)/m,lookbehind:!0,alias:"punctuation"},"url-reference":{pattern:/!?\[[^\]]+\]:[\t ]+(?:\S+|<(?:\\.|[^>\\])+>)(?:[\t ]+(?:"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\((?:\\.|[^)\\])*\)))?/,inside:{variable:{pattern:/^(!?\[)[^\]]+/,lookbehind:!0},string:/(?:"(?:\\.|[^"\\])*"|'(?:\\.|[^'\\])*'|\((?:\\.|[^)\\])*\))$/,punctuation:/^[\[\]!:]|[<>]/},alias:"url"},bold:{pattern:/(^|[^\\])(\*\*|__)(?:(?:\r?\n|\r)(?!\r?\n|\r)|.)+?\2/,lookbehind:!0,inside:{punctuation:/^\*\*|^__|\*\*$|__$/}},italic:{pattern:/(^|[^\\])([*_])(?:(?:\r?\n|\r)(?!\r?\n|\r)|.)+?\2/,lookbehind:!0,inside:{punctuation:/^[*_]|[*_]$/}},url:{pattern:/!?\[[^\]]+\](?:\([^\s)]+(?:[\t ]+"(?:\\.|[^"\\])*")?\)| ?\[[^\]\n]*\])/,inside:{variable:{pattern:/(!?\[)[^\]]+(?=\]$)/,lookbehind:!0},string:{pattern:/"(?:\\.|[^"\\])*"(?=\)$)/}}}}),Prism.languages.markdown.bold.inside.url=Prism.util.clone(Prism.languages.markdown.url),Prism.languages.markdown.italic.inside.url=Prism.util.clone(Prism.languages.markdown.url),Prism.languages.markdown.bold.inside.italic=Prism.util.clone(Prism.languages.markdown.italic),Prism.languages.markdown.italic.inside.bold=Prism.util.clone(Prism.languages.markdown.bold);
-Prism.languages.python={"triple-quoted-string":{pattern:/"""[\s\S]+?"""|'''[\s\S]+?'''/,alias:"string"},comment:{pattern:/(^|[^\\])#.*/,lookbehind:!0},string:/("|')(?:\\?.)*?\1/,"function":{pattern:/((?:^|\s)def[ \t]+)[a-zA-Z_][a-zA-Z0-9_]*(?=\()/g,lookbehind:!0},"class-name":{pattern:/(\bclass\s+)[a-z0-9_]+/i,lookbehind:!0},keyword:/\b(?:as|assert|async|await|break|class|continue|def|del|elif|else|except|exec|finally|for|from|global|if|import|in|is|lambda|pass|print|raise|return|try|while|with|yield)\b/,"boolean":/\b(?:True|False|None)\b/,number:/\b-?(?:0[bo])?(?:(?:\d|0x[\da-f])[\da-f]*\.?\d*|\.\d+)(?:e[+-]?\d+)?j?\b/i,operator:/[-+%=]=?|!=|\*\*?=?|\/\/?=?|<[<=>]?|>[=>]?|[&|^~]|\b(?:or|and|not)\b/,punctuation:/[{}[\];(),.:]/,"constant":/\b[A-Z_]{2,}\b/};
+Prism.languages.python={"triple-quoted-string":{pattern:/"""[\s\S]+?"""|'''[\s\S]+?'''/,alias:"string"},comment:{pattern:/(^|[^\\])#.*/,lookbehind:!0},string:/("|')(?:\\?.)*?\1/,"function":{pattern:/((?:^|\s)def[ \t]+)[a-zA-Z_][a-zA-Z0-9_]*(?=\()/g,lookbehind:!0},"class-name":{pattern:/(\bclass\s+)[a-z0-9_]+/i,lookbehind:!0},keyword:/\b(?:as|assert|async|await|break|class|continue|def|del|elif|else|except|exec|finally|for|from|global|if|import|in|is|lambda|pass|print|raise|return|try|while|with|yield|cimport)\b/,"boolean":/\b(?:True|False|None)\b/,number:/\b-?(?:0[bo])?(?:(?:\d|0x[\da-f])[\da-f]*\.?\d*|\.\d+)(?:e[+-]?\d+)?j?\b/i,operator:/[-+%=]=?|!=|\*\*?=?|\/\/?=?|<[<=>]?|>[=>]?|[&|^~]|\b(?:or|and|not)\b/,punctuation:/[{}[\];(),.:]/,"constant":/\b[A-Z_]{2,}\b/};
 Prism.languages.rest={table:[{pattern:/(\s*)(?:\+[=-]+)+\+(?:\r?\n|\r)(?:\1(?:[+|].+)+[+|](?:\r?\n|\r))+\1(?:\+[=-]+)+\+/,lookbehind:!0,inside:{punctuation:/\||(?:\+[=-]+)+\+/}},{pattern:/(\s*)(?:=+ +)+=+((?:\r?\n|\r)\1.+)+(?:\r?\n|\r)\1(?:=+ +)+=+(?=(?:\r?\n|\r){2}|\s*$)/,lookbehind:!0,inside:{punctuation:/[=-]+/}}],"substitution-def":{pattern:/(^\s*\.\. )\|(?:[^|\s](?:[^|]*[^|\s])?)\| [^:]+::/m,lookbehind:!0,inside:{substitution:{pattern:/^\|(?:[^|\s]|[^|\s][^|]*[^|\s])\|/,alias:"attr-value",inside:{punctuation:/^\||\|$/}},directive:{pattern:/( +)[^:]+::/,lookbehind:!0,alias:"function",inside:{punctuation:/::$/}}}},"link-target":[{pattern:/(^\s*\.\. )\[[^\]]+\]/m,lookbehind:!0,alias:"string",inside:{punctuation:/^\[|\]$/}},{pattern:/(^\s*\.\. )_(?:`[^`]+`|(?:[^:\\]|\\.)+):/m,lookbehind:!0,alias:"string",inside:{punctuation:/^_|:$/}}],directive:{pattern:/(^\s*\.\. )[^:]+::/m,lookbehind:!0,alias:"function",inside:{punctuation:/::$/}},comment:{pattern:/(^\s*\.\.)(?:(?: .+)?(?:(?:\r?\n|\r).+)+| .+)(?=(?:\r?\n|\r){2}|$)/m,lookbehind:!0},title:[{pattern:/^(([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\2+)(?:\r?\n|\r).+(?:\r?\n|\r)\1$/m,inside:{punctuation:/^[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]+|[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]+$/,important:/.+/}},{pattern:/(^|(?:\r?\n|\r){2}).+(?:\r?\n|\r)([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\2+(?=\r?\n|\r|$)/,lookbehind:!0,inside:{punctuation:/[!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]+$/,important:/.+/}}],hr:{pattern:/((?:\r?\n|\r){2})([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\2{3,}(?=(?:\r?\n|\r){2})/,lookbehind:!0,alias:"punctuation"},field:{pattern:/(^\s*):[^:\r\n]+:(?= )/m,lookbehind:!0,alias:"attr-name"},"command-line-option":{pattern:/(^\s*)(?:[+-][a-z\d]|(?:\-\-|\/)[a-z\d-]+)(?:[ =](?:[a-z][a-z\d_-]*|<[^<>]+>))?(?:, (?:[+-][a-z\d]|(?:\-\-|\/)[a-z\d-]+)(?:[ =](?:[a-z][a-z\d_-]*|<[^<>]+>))?)*(?=(?:\r?\n|\r)? {2,}\S)/im,lookbehind:!0,alias:"symbol"},"literal-block":{pattern:/::(?:\r?\n|\r){2}([ \t]+).+(?:(?:\r?\n|\r)\1.+)*/,inside:{"literal-block-punctuation":{pattern:/^::/,alias:"punctuation"}}},"quoted-literal-block":{pattern:/::(?:\r?\n|\r){2}([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~]).*(?:(?:\r?\n|\r)\1.*)*/,inside:{"literal-block-punctuation":{pattern:/^(?:::|([!"#$%&'()*+,\-.\/:;<=>?@\[\\\]^_`{|}~])\1*)/m,alias:"punctuation"}}},"list-bullet":{pattern:/(^\s*)(?:[*+\-•‣⁃]|\(?(?:\d+|[a-z]|[ivxdclm]+)\)|(?:\d+|[a-z]|[ivxdclm]+)\.)(?= )/im,lookbehind:!0,alias:"punctuation"},"doctest-block":{pattern:/(^\s*)>>> .+(?:(?:\r?\n|\r).+)*/m,lookbehind:!0,inside:{punctuation:/^>>>/}},inline:[{pattern:/(^|[\s\-:\/'"<(\[{])(?::[^:]+:`.*?`|`.*?`:[^:]+:|(\*\*?|``?|\|)(?!\s).*?[^\s]\2(?=[\s\-.,:;!?\\\/'")\]}]|$))/m,lookbehind:!0,inside:{bold:{pattern:/(^\*\*).+(?=\*\*$)/,lookbehind:!0},italic:{pattern:/(^\*).+(?=\*$)/,lookbehind:!0},"inline-literal":{pattern:/(^``).+(?=``$)/,lookbehind:!0,alias:"symbol"},role:{pattern:/^:[^:]+:|:[^:]+:$/,alias:"function",inside:{punctuation:/^:|:$/}},"interpreted-text":{pattern:/(^`).+(?=`$)/,lookbehind:!0,alias:"attr-value"},substitution:{pattern:/(^\|).+(?=\|$)/,lookbehind:!0,alias:"attr-value"},punctuation:/\*\*?|``?|\|/}}],link:[{pattern:/\[[^\]]+\]_(?=[\s\-.,:;!?\\\/'")\]}]|$)/,alias:"string",inside:{punctuation:/^\[|\]_$/}},{pattern:/(?:\b[a-z\d](?:[_.:+]?[a-z\d]+)*_?_|`[^`]+`_?_|_`[^`]+`)(?=[\s\-.,:;!?\\\/'")\]}]|$)/i,alias:"string",inside:{punctuation:/^_?`|`$|`?_?_$/}}],punctuation:{pattern:/(^\s*)(?:\|(?= |$)|(?:---?|—|\.\.|__)(?= )|\.\.$)/m,lookbehind:!0}};
 !function(e){e.languages.sass=e.languages.extend("css",{comment:{pattern:/^([ \t]*)\/[\/*].*(?:(?:\r?\n|\r)\1[ \t]+.+)*/m,lookbehind:!0}}),e.languages.insertBefore("sass","atrule",{"atrule-line":{pattern:/^(?:[ \t]*)[@+=].+/m,inside:{atrule:/(?:@[\w-]+|[+=])/m}}}),delete e.languages.sass.atrule;var a=/((\$[-_\w]+)|(#\{\$[-_\w]+\}))/i,t=[/[+*\/%]|[=!]=|<=?|>=?|\b(?:and|or|not)\b/,{pattern:/(\s+)-(?=\s)/,lookbehind:!0}];e.languages.insertBefore("sass","property",{"variable-line":{pattern:/^[ \t]*\$.+/m,inside:{punctuation:/:/,variable:a,operator:t}},"property-line":{pattern:/^[ \t]*(?:[^:\s]+ *:.*|:[^:\s]+.*)/m,inside:{property:[/[^:\s]+(?=\s*:)/,{pattern:/(:)[^:\s]+/,lookbehind:!0}],punctuation:/:/,variable:a,operator:t,important:e.languages.sass.important}}}),delete e.languages.sass.property,delete e.languages.sass.important,delete e.languages.sass.selector,e.languages.insertBefore("sass","punctuation",{selector:{pattern:/([ \t]*)\S(?:,?[^,\r\n]+)*(?:,(?:\r?\n|\r)\1[ \t]+\S(?:,?[^,\r\n]+)*)*/,lookbehind:!0}})}(Prism);
 Prism.languages.scss=Prism.languages.extend("css",{comment:{pattern:/(^|[^\\])(?:\/\*[\w\W]*?\*\/|\/\/.*)/,lookbehind:!0},atrule:{pattern:/@[\w-]+(?:\([^()]+\)|[^(])*?(?=\s+[{;])/,inside:{rule:/@[\w-]+/}},url:/(?:[-a-z]+-)*url(?=\()/i,selector:{pattern:/(?=\S)[^@;\{\}\(\)]?([^@;\{\}\(\)]|&|#\{\$[-_\w]+\})+(?=\s*\{(\}|\s|[^\}]+(:|\{)[^\}]+))/m,inside:{placeholder:/%[-_\w]+/}}}),Prism.languages.insertBefore("scss","atrule",{keyword:[/@(?:if|else(?: if)?|for|each|while|import|extend|debug|warn|mixin|include|function|return|content)/i,{pattern:/( +)(?:from|through)(?= )/,lookbehind:!0}]}),Prism.languages.insertBefore("scss","property",{variable:/\$[-_\w]+|#\{\$[-_\w]+\}/}),Prism.languages.insertBefore("scss","function",{placeholder:{pattern:/%[-_\w]+/,alias:"selector"},statement:/\B!(?:default|optional)\b/i,"boolean":/\b(?:true|false)\b/,"null":/\bnull\b/,operator:{pattern:/(\s)(?:[-+*\/%]|[=!]=|<=?|>=?|and|or|not)(?=\s)/,lookbehind:!0}}),Prism.languages.scss.atrule.inside.rest=Prism.util.clone(Prism.languages.scss);
--- a/website/usage/_facts-figures/_benchmarks.jade
+++ b/website/usage/_facts-figures/_benchmarks.jade
@ -157,7 +157,13 @@ p

 +infobox("Important note", "⚠️")
    |  This evaluation was conducted in 2015. We're working on benchmarks on
-    |  current CPU and GPU hardware.
+    |  current CPU and GPU hardware. In the meantime, we're grateful to the
+    |  Stanford folks for drawing our attention to what seems
+    |  to be #[+a("https://nlp.stanford.edu/software/tokenizer.html#Speed") a long-standing error] 
+    |  in our CoreNLP benchmarks, especially for their 
+    |  tokenizer. Until we run corrected experiments, we have updated the table
+    |  using their figures.
+

 +aside("Methodology")
    |  #[strong Set up:] 100,000 plain-text documents were streamed from an
@ -183,14 +189,14 @@ p
    +row
        +cell #[strong spaCy]
        each data in [ "0.2ms", "1ms", "19ms"]
-            +cell("num") #[strong=data]
+            +cell("num")=data

        each data in ["1x", "1x", "1x"]
            +cell("num")=data

    +row
        +cell CoreNLP
-        each data in ["2ms", "10ms", "49ms", "10x", "10x", "2.6x"]
+        each data in ["0.18ms", "10ms", "49ms", "0.9x", "10x", "2.6x"]
            +cell("num")=data
    +row
        +cell ZPar
--- a/website/usage/_spacy-101/_lightning-tour.jade
+++ b/website/usage/_spacy-101/_lightning-tour.jade
@ -354,7 +354,7 @@ p
        string = ''.join(output)
        string = string.replace('\n', '')
        string = string.replace('\t', '    ')
-        return '&lt;pre&gt;{}&lt;/pre&gt;.format(string)
+        return '&lt;pre&gt;{}&lt;/pre&gt;'.format(string)

    nlp = spacy.load('en_core_web_sm')
    doc = nlp(u"This is a test.\n\nHello   world.")