Merge branch 'master' of ssh://github.com/explosion/spaCy

2025-08-24 05:54:55 +03:00 · 2017-01-16 13:18:06 +01:00 · 2017-01-16 13:18:06 +01:00 · 48c712f1c1
commit 48c712f1c1
parent 7ccf490c73 50878ef598
78 changed files with 1615 additions and 1971 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -7,10 +7,11 @@ Following the v1.0 release, it's time to welcome more contributors into the spaC
 ## Table of contents
 1. [Issues and bug reports](#issues-and-bug-reports)
 2. [Contributing to the code base](#contributing-to-the-code-base)
-3. [Updating the website](#updating-the-website)
-4. [Submitting a tutorial](#submitting-a-tutorial)
-5. [Submitting a project to the showcase](#submitting-a-project-to-the-showcase)
-6. [Code of conduct](#code-of-conduct)
+3. [Adding tests](#adding-tests)
+4. [Updating the website](#updating-the-website)
+5. [Submitting a tutorial](#submitting-a-tutorial)
+6. [Submitting a project to the showcase](#submitting-a-project-to-the-showcase)
+7. [Code of conduct](#code-of-conduct)

 ## Issues and bug reports

@ -33,6 +34,7 @@ We use the following system to tag our issues:
 | [`install`](https://github.com/explosion/spaCy/labels/install) | Installation problems |
 | [`performance`](https://github.com/explosion/spaCy/labels/performance) | Accuracy, speed and memory use problems |
 | [`tests`](https://github.com/explosion/spaCy/labels/tests) | Missing or incorrect [tests](spacy/tests) |
+| [`examples`](https://github.com/explosion/spaCy/labels/examples) | Issues related to the [examples](spacy/examples) |
 | [`english`](https://github.com/explosion/spaCy/labels/english), [`german`](https://github.com/explosion/spaCy/labels/german) | Issues related to the specific languages, models and data |
 | [`linux`](https://github.com/explosion/spaCy/labels/linux), [`osx`](https://github.com/explosion/spaCy/labels/osx), [`windows`](https://github.com/explosion/spaCy/labels/windows) | Issues related to the specific operating systems |
 | [`pip`](https://github.com/explosion/spaCy/labels/pip), [`conda`](https://github.com/explosion/spaCy/labels/conda) | Issues related to the specific package managers |
@ -66,11 +68,21 @@ example_user would create the file `.github/contributors/example_user.md`.

 ### Fixing bugs

-When fixing a bug, first create an [issue](https://github.com/explosion/spaCy/issues) if one does not already exist. The description text can be very short – we don't want to make this too bureaucratic. Next, create a test file named `test_issue[ISSUE NUMBER].py` in the [`spacy/tests/regression`](spacy/tests/regression) folder.
+When fixing a bug, first create an [issue](https://github.com/explosion/spaCy/issues) if one does not already exist. The description text can be very short – we don't want to make this too bureaucratic. 

-Test for the bug you're fixing, and make sure the test fails. If the test requires the models to be loaded, mark it with the `pytest.mark.models` decorator.
+Next, create a test file named `test_issue[ISSUE NUMBER].py` in the [`spacy/tests/regression`](spacy/tests/regression) folder. Test for the bug you're fixing, and make sure the test fails. Next, add and commit your test file referencing the issue number in the commit message. Finally, fix the bug, make sure your test passes and reference the issue in your commit message.

-Next, add and commit your test file referencing the issue number in the commit message. Finally, fix the bug, make sure your test passes and reference the issue in your commit message.
+📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).**
+
+## Adding tests
+
+spaCy uses [pytest](http://doc.pytest.org/) framework for testing. For more info on this, see the [pytest documentation](http://docs.pytest.org/en/latest/contents.html). Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the `Tokenizer` can be found in [`/spacy/tests/tokenizer`](spacy/tests/tokenizer). To be interpreted and run, all test files and test functions need to be prefixed with `test_`.
+
+When adding tests, make sure to use descriptive names, keep the code short and concise and only test for one behaviour at a time. Try to `parametrize` test cases wherever possible, use our pre-defined fixtures for spaCy components and avoid unnecessary imports.
+
+Extensive tests that take a long time should be marked with `@pytest.mark.slow`. Tests that require the model to be loaded should be marked with `@pytest.mark.models`. Loading the models is expensive and not necessary if you're not actually testing the model performance. If all you needs ia a `Doc` object with annotations like heads, POS tags or the dependency parse, you can use the `get_doc()` utility function to construct it manually.
+
+📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**


 ## Updating the website
@ -86,7 +98,9 @@ harp server

 The docs can always use another example or more detail, and they should always be up to date and not misleading. To quickly find the correct file to edit, simply click on the "Suggest edits" button at the bottom of a page.

-To make it easy to add content components, we use a [collection of custom mixins](_includes/_mixins.jade), like `+table`, `+list` or `+code`. For more info and troubleshooting guides, check out the [website README](website).
+To make it easy to add content components, we use a [collection of custom mixins](_includes/_mixins.jade), like `+table`, `+list` or `+code`. 
+
+📖 **For more info and troubleshooting guides, check out the [website README](website).**

 ### Resources to get you started

--- a/README.rst
+++ b/README.rst
@ -10,10 +10,6 @@ released under the MIT license.

 💫 **Version 1.5 out now!** `Read the release notes here. <https://github.com/explosion/spaCy/releases/>`_

-.. image:: http://i.imgur.com/wFvLZyJ.png
-    :target: https://travis-ci.org/explosion/spaCy
-    :alt: spaCy on Travis CI
-    
 .. image:: https://travis-ci.org/explosion/spaCy.svg?branch=master
    :target: https://travis-ci.org/explosion/spaCy
    :alt: Build Status
@ -26,9 +22,13 @@ released under the MIT license.
    :target: https://pypi.python.org/pypi/spacy
    :alt: pypi Version

-.. image:: https://badges.gitter.im/explosion.png
+.. image:: https://img.shields.io/badge/gitter-join%20chat%20%E2%86%92-09a3d5.svg
    :target: https://gitter.im/explosion/spaCy
    :alt: spaCy on Gitter
+    
+.. image:: https://img.shields.io/twitter/follow/spacy_io.svg?style=social&label=Follow   
+    :target: https://twitter.com/spacy_io
+    :alt: spaCy on Twitter

 📖 Documentation
 ================
--- a/examples/keras_parikh_entailment/README.md
+++ b/examples/keras_parikh_entailment/README.md
@ -47,11 +47,16 @@ First, install [Keras](https://keras.io/), [spaCy](https://spacy.io) and the spa
 English models (about 1GB of data):

 ```bash
-pip install keras spacy
+pip install https://github.com/fchollet/keras/archive/master.zip
+pip install spacy
 python -m spacy.en.download
 ```

-You'll also want to get keras working on your GPU. This will depend on your
+⚠️ **Important:** In order for the example to run, you'll need to install Keras from 
+the master branch (and not via `pip install keras`). For more info on this, see 
+[#727](https://github.com/explosion/spaCy/issues/727).
+
+You'll also want to get Keras working on your GPU. This will depend on your
 set up, so you're mostly on your own for this step. If you're using AWS, try the 
 [NVidia AMI](https://aws.amazon.com/marketplace/pp/B00FYCDDTE). It made things pretty easy.

--- a/fabfile.py
+++ b/fabfile.py
@ -1,22 +1,20 @@
-from __future__ import print_function
+# coding: utf-8
+from __future__ import unicode_literals, print_function

 from fabric.api import local, lcd, env, settings, prefix
-from os.path import exists as file_exists
 from fabtools.python import virtualenv
-from os import path
-import os
-import shutil
-from pathlib import Path
+from os import path, environ


 PWD = path.dirname(__file__)
-VENV_DIR = path.join(PWD, '.env')
+ENV = environ['VENV_DIR'] if 'VENV_DIR' in environ else '.env'
+VENV_DIR = path.join(PWD, ENV)


-def env(lang="python2.7"):
-    if file_exists('.env'):
-        local('rm -rf .env')
-    local('virtualenv -p %s .env' % lang)
+def env(lang='python2.7'):
+    if path.exists(VENV_DIR):
+        local('rm -rf {env}'.format(env=VENV_DIR))
+    local('virtualenv -p {lang} {env}'.format(lang=lang, env=VENV_DIR))


 def install():
--- a/setup.py
+++ b/setup.py
@ -37,7 +37,6 @@ PACKAGES = [
    'spacy.munge',
    'spacy.tests',
    'spacy.tests.matcher',
-    'spacy.tests.munge',
    'spacy.tests.parser',
    'spacy.tests.serialize',
    'spacy.tests.spans',
--- a/spacy/en/tokenizer_exceptions.py
+++ b/spacy/en/tokenizer_exceptions.py
@ -7,7 +7,7 @@ from ..language_data import PRON_LEMMA

 EXC = {}

-EXCLUDE_EXC = ["Ill", "ill", "Its", "its", "Hell", "hell", "Well", "well", "Whore", "whore"]
+EXCLUDE_EXC = ["Ill", "ill", "Its", "its", "Hell", "hell", "were", "Were", "Well", "well", "Whore", "whore"]


 # Pronouns
--- a/spacy/hu/init.py
+++ b/spacy/hu/init.py
@ -1,6 +1,7 @@
 # encoding: utf8
 from __future__ import unicode_literals, print_function

+from spacy.hu.tokenizer_exceptions import TOKEN_MATCH
 from .language_data import *
 from ..attrs import LANG
 from ..language import Language
@ -21,3 +22,5 @@ class Hungarian(Language):
        infixes = tuple(TOKENIZER_INFIXES)

        stop_words = set(STOP_WORDS)
+
+        token_match = TOKEN_MATCH
--- a/spacy/hu/language_data.py
+++ b/spacy/hu/language_data.py
@ -1,8 +1,6 @@
 # encoding: utf8
 from __future__ import unicode_literals

-import six
-
 from spacy.language_data import strings_to_exc, update_exc
 from .punctuation import *
 from .stop_words import STOP_WORDS
@ -10,19 +8,15 @@ from .tokenizer_exceptions import ABBREVIATIONS
 from .tokenizer_exceptions import OTHER_EXC
 from .. import language_data as base

-
 STOP_WORDS = set(STOP_WORDS)

-
 TOKENIZER_EXCEPTIONS = strings_to_exc(base.EMOTICONS)
 update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.ABBREVIATIONS))
 update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(OTHER_EXC))
 update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ABBREVIATIONS))

-
-TOKENIZER_PREFIXES = base.TOKENIZER_PREFIXES
-TOKENIZER_SUFFIXES = base.TOKENIZER_SUFFIXES + TOKENIZER_SUFFIXES
+TOKENIZER_PREFIXES = TOKENIZER_PREFIXES
+TOKENIZER_SUFFIXES = TOKENIZER_SUFFIXES
 TOKENIZER_INFIXES = TOKENIZER_INFIXES

-
 __all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS", "TOKENIZER_PREFIXES", "TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"]
--- a/spacy/hu/punctuation.py
+++ b/spacy/hu/punctuation.py
@ -1,25 +1,41 @@
 # encoding: utf8
 from __future__ import unicode_literals

-from ..language_data.punctuation import ALPHA, ALPHA_LOWER, ALPHA_UPPER, LIST_ELLIPSES
+from ..language_data.punctuation import ALPHA_LOWER, LIST_ELLIPSES, QUOTES, ALPHA_UPPER, LIST_QUOTES, UNITS, \
+    CURRENCY, LIST_PUNCT, ALPHA, _QUOTES

+CURRENCY_SYMBOLS = r"\$ ¢ £ € ¥ ฿"

-TOKENIZER_SUFFIXES = [
-    r'(?<=[{al})])-e'.format(al=ALPHA_LOWER)
-]
+TOKENIZER_PREFIXES = (
+    [r'\+'] +
+    LIST_PUNCT +
+    LIST_ELLIPSES +
+    LIST_QUOTES
+)

-TOKENIZER_INFIXES = [
-    r'(?<=[0-9])-(?=[0-9])',
-    r'(?<=[0-9])[+\-\*/^](?=[0-9])',
-    r'(?<=[{a}])--(?=[{a}])',
-    r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
-    r'(?<=[{al}])\.(?=[{au}])'.format(al=ALPHA_LOWER, au=ALPHA_UPPER),
-    r'(?<=[0-9{a}])"(?=[\-{a}])'.format(a=ALPHA),
-    r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA)
-]
+TOKENIZER_SUFFIXES = (
+    LIST_PUNCT +
+    LIST_ELLIPSES +
+    LIST_QUOTES +
+    [
+        r'(?<=[0-9])\+',
+        r'(?<=°[FfCcKk])\.',
+        r'(?<=[0-9])(?:{c})'.format(c=CURRENCY),
+        r'(?<=[0-9])(?:{u})'.format(u=UNITS),
+        r'(?<=[{al}{p}{c}(?:{q})])\.'.format(al=ALPHA_LOWER, p=r'%²\-\)\]\+', q=QUOTES, c=CURRENCY_SYMBOLS),
+        r'(?<=[{al})])-e'.format(al=ALPHA_LOWER)
+    ]
+)

-
-TOKENIZER_INFIXES += LIST_ELLIPSES
-
-
-__all__ = ["TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"]
+TOKENIZER_INFIXES = (
+    LIST_ELLIPSES +
+    [
+        r'(?<=[{al}])\.(?=[{au}])'.format(al=ALPHA_LOWER, au=ALPHA_UPPER),
+        r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
+        r'(?<=[{a}"])[:<>=](?=[{a}])'.format(a=ALPHA),
+        r'(?<=[{a}])--(?=[{a}])'.format(a=ALPHA),
+        r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
+        r'(?<=[{a}])([{q}\)\]\(\[])(?=[\-{a}])'.format(a=ALPHA, q=_QUOTES.replace("'", "").strip().replace(" ", "")),
+    ]
+)
+__all__ = ["TOKENIZER_PREFIXES", "TOKENIZER_SUFFIXES", "TOKENIZER_INFIXES"]
--- a/spacy/hu/tokenizer_exceptions.py
+++ b/spacy/hu/tokenizer_exceptions.py
@ -1,9 +1,17 @@
 # encoding: utf8
 from __future__ import unicode_literals

+import re
+
+from spacy.language_data.punctuation import ALPHA_LOWER, CURRENCY
+from ..language_data.tokenizer_exceptions import _URL_PATTERN
+
 ABBREVIATIONS = """
+A.
+AG.
 AkH.
 Aö.
+B.
 B.CS.
 B.S.
 B.Sc.
@ -13,57 +21,103 @@ BEK.
 BSC.
 BSc.
 BTK.
+Bat.
 Be.
 Bek.
 Bfok.
 Bk.
 Bp.
+Bros.
+Bt.
 Btk.
 Btke.
 Btét.
+C.
 CSC.
 Cal.
+Cg.
+Cgf.
+Cgt.
+Cia.
 Co.
 Colo.
 Comp.
 Copr.
+Corp.
+Cos.
 Cs.
 Csc.
 Csop.
+Cstv.
 Ctv.
+Ctvr.
 D.
 DR.
 Dipl.
 Dr.
 Dsz.
 Dzs.
+E.
+EK.
+EU.
+F.
 Fla.
+Folyt.
+Fpk.
 Főszerk.
+G.
+GK.
 GM.
+Gfv.
+Gmk.
+Gr.
+Group.
+Gt.
 Gy.
+H.
 HKsz.
 Hmvh.
+I.
+Ifj.
+Inc.
 Inform.
+Int.
+J.
+Jr.
+Jv.
+K.
 K.m.f.
+KB.
 KER.
 KFT.
 KRT.
+Kb.
 Ker.
 Kft.
+Kg.
+Kht.
+Kkt.
 Kong.
 Korm.
 Kr.
 Kr.e.
 Kr.u.
 Krt.
+L.
+LB.
+Llc.
+Ltd.
+M.
 M.A.
 M.S.
 M.SC.
 M.Sc.
 MA.
+MH.
 MSC.
 MSc.
 Mass.
+Max.
 Mlle.
 Mme.
 Mo.
@ -71,45 +125,77 @@ Mr.
 Mrs.
 Ms.
 Mt.
+N.
 N.N.
 NB.
 NBr.
 Nat.
+No.
 Nr.
 Ny.
 Nyh.
 Nyr.
+Nyrt.
+O.
+OJ.
 Op.
+P.
 P.H.
 P.S.
 PH.D.
 PHD.
 PROF.
+Pf.
 Ph.D
 PhD.
+Pk.
+Pl.
+Plc.
 Pp.
 Proc.
 Prof.
 Ptk.
+R.
+RT.
 Rer.
+Rt.
+S.
 S.B.
 SZOLG.
 Salg.
+Sch.
+Spa.
 St.
 Sz.
+SzRt.
+Szerk.
 Szfv.
 Szjt.
 Szolg.
 Szt.
 Sztv.
+Szvt.
+Számv.
+T.
 TEL.
 Tel.
 Ty.
 Tyr.
+U.
 Ui.
+Ut.
+V.
+VB.
 Vcs.
 Vhr.
+Vht.
+Várm.
+W.
+X.
 X.Y.
+Y.
+Z.
+Zrt.
 Zs.
 a.C.
 ac.
@ -119,11 +205,13 @@ ag.
 agit.
 alez.
 alk.
+all.
 altbgy.
 an.
 ang.
 arch.
 at.
+atc.
 aug.
 b.a.
 b.s.
@ -161,6 +249,7 @@ dikt.
 dipl.
 dj.
 dk.
+dl.
 dny.
 dolg.
 dr.
@ -184,6 +273,7 @@ eü.
 f.h.
 f.é.
 fam.
+fb.
 febr.
 fej.
 felv.
@ -211,6 +301,7 @@ gazd.
 gimn.
 gk.
 gkv.
+gmk.
 gondn.
 gr.
 grav.
@ -240,6 +331,7 @@ hőm.
 i.e.
 i.sz.
 id.
+ie.
 ifj.
 ig.
 igh.
@ -254,6 +346,7 @@ io.
 ip.
 ir.
 irod.
+irod.
 isk.
 ism.
 izr.
@ -261,6 +354,7 @@ iá.
 jan.
 jav.
 jegyz.
+jgmk.
 jjv.
 jkv.
 jogh.
@ -271,6 +365,7 @@ júl.
 jún.
 karb.
 kat.
+kath.
 kb.
 kcs.
 kd.
@ -285,6 +380,8 @@ kiv.
 kk.
 kkt.
 klin.
+km.
+korm.
 kp.
 krt.
 kt.
@ -318,6 +415,7 @@ m.s.
 m.sc.
 ma.
 mat.
+max.
 mb.
 med.
 megh.
@ -353,6 +451,7 @@ nat.
 nb.
 neg.
 nk.
+no.
 nov.
 nu.
 ny.
@ -362,6 +461,7 @@ nyug.
 obj.
 okl.
 okt.
+old.
 olv.
 orsz.
 ort.
@ -372,6 +472,8 @@ pg.
 ph.d
 ph.d.
 phd.
+phil.
+pjt.
 pk.
 pl.
 plb.
@ -406,6 +508,7 @@ röv.
 s.b.
 s.k.
 sa.
+sb.
 sel.
 sgt.
 sm.
@ -413,6 +516,7 @@ st.
 stat.
 stb.
 strat.
+stud.
 sz.
 szakm.
 szaksz.
@ -467,6 +571,7 @@ vb.
 vegy.
 vh.
 vhol.
+vhr.
 vill.
 vizsg.
 vk.
@ -478,13 +583,20 @@ vs.
 vsz.
 vv.
 vál.
+várm.
 vízv.
 vö.
 zrt.
 zs.
+Á.
+Áe.
+Áht.
+É.
+Épt.
 Ész.
 Új-Z.
 ÚjZ.
+Ún.
 á.
 ált.
 ápr.
@ -500,6 +612,7 @@ zs.
 ötk.
 özv.
 ú.
+ú.n.
 úm.
 ún.
 út.
@ -510,7 +623,6 @@ zs.
 ümk.
 ütk.
 üv.
-ő.
 ű.
 őrgy.
 őrpk.
@ -520,3 +632,17 @@ zs.
 OTHER_EXC = """
 -e
 """.strip().split()
+
+ORD_NUM_OR_DATE = "([A-Z0-9]+[./-])*(\d+\.?)"
+_NUM = "[+\-]?\d+([,.]\d+)*"
+_OPS = "[=<>+\-\*/^()÷%²]"
+_SUFFIXES = "-[{a}]+".format(a=ALPHA_LOWER)
+NUMERIC_EXP = "({n})(({o})({n}))*[%]?".format(n=_NUM, o=_OPS)
+TIME_EXP = "\d+(:\d+)*(\.\d+)?"
+
+NUMS = "(({ne})|({t})|({on})|({c}))({s})?".format(
+    ne=NUMERIC_EXP, t=TIME_EXP, on=ORD_NUM_OR_DATE,
+    c=CURRENCY, s=_SUFFIXES
+)
+
+TOKEN_MATCH = re.compile("^({u})|({n})$".format(u=_URL_PATTERN, n=NUMS)).match
--- a/spacy/language_data/punctuation.py
+++ b/spacy/language_data/punctuation.py
@ -57,14 +57,14 @@ LIST_PUNCT = list(_PUNCT.strip().split())
 LIST_HYPHENS = list(_HYPHENS.strip().split())


-ALPHA_LOWER = _ALPHA_LOWER.strip().replace(' ', '')
-ALPHA_UPPER = _ALPHA_UPPER.strip().replace(' ', '')
+ALPHA_LOWER = _ALPHA_LOWER.strip().replace(' ', '').replace('\n', '')
+ALPHA_UPPER = _ALPHA_UPPER.strip().replace(' ', '').replace('\n', '')
 ALPHA = ALPHA_LOWER + ALPHA_UPPER


 QUOTES = _QUOTES.strip().replace(' ', '|')
 CURRENCY = _CURRENCY.strip().replace(' ', '|')
-UNITS = _UNITS.strip().replace(' ', '|')
+UNITS = _UNITS.strip().replace(' ', '|').replace('\n', '|')
 HYPHENS = _HYPHENS.strip().replace(' ', '|')


@ -103,7 +103,7 @@ TOKENIZER_SUFFIXES = (
 TOKENIZER_INFIXES = (
    LIST_ELLIPSES +
    [
-        r'(?<=[0-9])[+\-\*/^](?=[0-9])',
+        r'(?<=[0-9])[+\-\*^](?=[0-9-])',
        r'(?<=[{al}])\.(?=[{au}])'.format(al=ALPHA_LOWER, au=ALPHA_UPPER),
        r'(?<=[{a}]),(?=[{a}])'.format(a=ALPHA),
        r'(?<=[{a}])(?:{h})(?=[{a}])'.format(a=ALPHA, h=HYPHENS),
--- a/spacy/tests/README.md
+++ b/spacy/tests/README.md
@ -0,0 +1,169 @@
+<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
+
+# spaCy tests
+
+spaCy uses [pytest](http://doc.pytest.org/) framework for testing. For more info on this, see the [pytest documentation](http://docs.pytest.org/en/latest/contents.html).
+
+Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the `Tokenizer` can be found in [`/tests/tokenizer`](tokenizer). All test modules (i.e. directories) also need to be listed in spaCy's [`setup.py`](../setup.py). To be interpreted and run, all test files and test functions need to be prefixed with `test_`.
+
+
+## Table of contents
+
+1. [Running the tests](#running-the-tests)
+2. [Dos and don'ts](#dos-and-donts)
+3. [Parameters](#parameters)
+4. [Fixtures](#fixtures)
+5. [Helpers and utilities](#helpers-and-utilities)
+6. [Contributing to the tests](#contributing-to-the-tests)
+
+
+## Running the tests
+
+```bash
+py.test spacy                    # run basic tests
+py.test spacy --models           # run basic and model tests
+py.test spacy --slow             # run basic and slow tests
+py.test spacy --models --slow    # run all tests
+```
+
+To show print statements, run the tests with `py.test -s`. To abort after the first failure, run them with `py.test -x`.
+
+
+## Dos and don'ts
+
+To keep the behaviour of the tests consistent and predictable, we try to follow a few basic conventions:
+
+* **Test names** should follow a pattern of `test_[module]_[tested behaviour]`. For example: `test_tokenizer_keeps_email` or `test_spans_override_sentiment`.
+* If you're testing for a bug reported in a specific issue, always create a **regression test**. Regression tests should be named `test_issue[ISSUE NUMBER]` and live in the [`regression`](regression) directory.
+* Only use `@pytest.mark.xfail` for tests that **should pass, but currently fail**. To test for desired negative behaviour, use `assert not` in your test.
+* Very **extensive tests** that take a long time to run should be marked with `@pytest.mark.slow`. If your slow test is testing important behaviour, consider adding an additional simpler version.
+* Tests that require **loading the models** should be marked with `@pytest.mark.models`.
+* Before requiring the models, always make sure there is no other way to test the particular behaviour. In a lot of cases, it's sufficient to simply create a `Doc` object manually. See the section on [helpers and utility functions](#helpers-and-utilities) for more info on this.
+* **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and most components are  available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`).
+* If you're importing from spaCy, **always use relative imports**. Otherwise, you might accidentally be running the tests over a different copy of spaCy, e.g. one you have installed on your system.
+* Don't forget the **unicode declarations** at the top of each file. This way, unicode strings won't have to be prefixed with `u`.
+* Try to keep the tests **readable and concise**. Use clear and descriptive variable names (`doc`, `tokens` and `text` are great), keep it short and only test for one behaviour at a time.
+
+
+## Parameters
+
+If the test cases can be extracted from the test, always `parametrize` them instead of hard-coding them into the test:
+
+```python
+@pytest.mark.parametrize('text', ["google.com", "spacy.io"])
+def test_tokenizer_keep_urls(tokenizer, text):
+    tokens = tokenizer(text)
+    assert len(tokens) == 1
+```
+
+This will run the test once for each `text` value. Even if you're only testing  one example, it's usually best to specify it as a parameter. This will later make it easier for others to quickly add additional test cases without having to modify the test.
+
+You can also specify parameters as tuples to test with multiple values per test:
+
+```python
+@pytest.mark.parametrize('text,length', [("U.S.", 1), ("us.", 2), ("(U.S.", 2)])
+```
+
+To test for combinations of parameters, you can add several `parametrize` markers:
+
+```python
+@pytest.mark.parametrize('text', ["A test sentence", "Another sentence"])
+@pytest.mark.parametrize('punct', ['.', '!', '?'])
+```
+
+This will run the test with all combinations of the two parameters `text` and `punct`. **Use this feature sparingly**, though, as it can easily cause unneccessary or undesired test bloat.
+
+
+## Fixtures
+
+Fixtures to create instances of spaCy objects and other components should only be defined once in the global [`conftest.py`](conftest.py). We avoid having per-directory conftest files, as this can easily lead to confusion.
+
+These are the main fixtures that are currently available:
+
+| Fixture | Description |
+| --- | --- |
+| `tokenizer` | Creates **all available** language tokenizers and runs the test for **each of them**. |
+| `en_tokenizer` | Creates an English `Tokenizer` object. |
+| `de_tokenizer` | Creates a German `Tokenizer` object. |
+| `hu_tokenizer` | Creates a Hungarian `Tokenizer` object. |
+| `en_vocab` | Creates an English `Vocab` object. |
+| `en_entityrecognizer` | Creates an English `EntityRecognizer` object. |
+| `lemmatizer` | Creates a `Lemmatizer` object from the installed language data (`None` if no data is found).
+| `EN` | Creates an instance of `English`. Only use for tests that require the models. |
+| `DE` | Creates an instance of `German`. Only use for tests that require the models. |
+| `text_file` | Creates an instance of `StringIO` to simulate reading from and writing to files. |
+| `text_file_b` | Creates an instance of `ByteIO` to simulate reading from and writing to files. |
+
+The fixtures can be used in all tests by simply setting them as an argument, like this:
+
+```python
+def test_module_do_something(en_tokenizer):
+    tokens = en_tokenizer("Some text here")
+```
+
+If all tests in a file require a specific configuration, or use the same complex example, it can be helpful to create a separate fixture. This fixture should be added at the top of each file. Make sure to use descriptive names for these fixtures and don't override any of the global fixtures listed above. **From looking at a test, it should immediately be clear which fixtures are used, and where they are coming from.**
+
+
+## Helpers and utilities
+
+Our new test setup comes with a few handy utility functions that can be imported from [`util.py`](util.py).
+
+
+### Constructing a `Doc` object manually with `get_doc()`
+
+Loading the models is expensive and not necessary if you're not actually testing the model performance. If all you need ia a `Doc` object with annotations like heads, POS tags or the dependency parse, you can use `get_doc()` to construct it manually.
+
+```python
+def test_doc_token_api_strings(en_tokenizer):
+    text = "Give it back! He pleaded."
+    pos = ['VERB', 'PRON', 'PART', 'PUNCT', 'PRON', 'VERB', 'PUNCT']
+    heads = [0, -1, -2, -3, 1, 0, -1]
+    deps = ['ROOT', 'dobj', 'prt', 'punct', 'nsubj', 'ROOT', 'punct']
+
+    tokens = en_tokenizer(text)
+    doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos, heads=heads, deps=deps)
+    assert doc[0].text == 'Give'
+    assert doc[0].lower_ == 'give'
+    assert doc[0].pos_ == 'VERB'
+    assert doc[0].dep_ == 'ROOT'
+```
+
+You can construct a `Doc` with the following arguments:
+
+| Argument | Description |
+| --- | --- |
+| `vocab` | `Vocab` instance to use. If you're tokenizing before creating a `Doc`, make sure to use the tokenizer's vocab. Otherwise, you can also use the `en_vocab` fixture. **(required)** |
+| `words` | List of words, for example `[t.text for t in tokens]`. **(required)** |
+| `heads` | List of heads as integers. |
+| `pos` | List of POS tags as text values. |
+| `tag` | List of tag names as text values. |
+| `dep` | List of dependencies as text values. |
+| `ents` | List of entity tuples with `ent_id`, `label`, `start`, `end` (for example `('Stewart Lee', 'PERSON', 0, 2)`). The `label` will be looked up in `vocab.strings[label]`. |
+
+Here's how to quickly get these values from within spaCy:
+
+```python
+doc = nlp(u'Some text here')
+print [token.head.i-token.i for token in doc]
+print [token.tag_ for token in doc]
+print [token.pos_ for token in doc]
+print [token.dep_ for token in doc]
+```
+
+**Note:** There's currently no way of setting the serializer data for the parser without loading the models. If this is relevant to your test, constructing the `Doc` via `get_doc()` won't work.
+
+
+### Other utilities
+
+| Name | Description |
+| --- | --- |
+| `apply_transition_sequence(parser, doc, sequence)` | Perform a series of pre-specified transitions, to put the parser in a desired state. |
+| `add_vecs_to_vocab(vocab, vectors)` | Add list of vector tuples (`[("text", [1, 2, 3])]`) to given vocab. All vectors need to have the same length. |
+| `get_cosine(vec1, vec2)` | Get cosine for two given vectors. |
+| `assert_docs_equal(doc1, doc2)` | Compare two `Doc` objects and `assert` that they're equal. Tests for tokens, tags, dependencies and entities. |
+
+## Contributing to the tests
+
+There's still a long way to go to finally reach **100% test coverage** – and we'd appreciate your help! 🙌 You can open an issue on our [issue tracker](https://github.com/explosion/spaCy/issues) and label it `tests`, or make a [pull request](https://github.com/explosion/spaCy/pulls) to this repository.
+
+📖 **For more information on contributing to spaCy in general, check out our [contribution guidelines](https://github.com/explosion/spaCy/blob/master/CONTRIBUTING.md).**
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -12,9 +12,13 @@ from ..sv import Swedish
 from ..hu import Hungarian
 from ..tokens import Doc
 from ..strings import StringStore
+from ..lemmatizer import Lemmatizer
 from ..attrs import ORTH, TAG, HEAD, DEP
+from ..util import match_best_version, get_data_path

-from io import StringIO
+from io import StringIO, BytesIO
+from pathlib import Path
+import os
 import pytest


@ -52,22 +56,49 @@ def de_tokenizer():
 def hu_tokenizer():
    return Hungarian.Defaults.create_tokenizer()

+
@pytest.fixture
 def stringstore():
    return StringStore()

+
+@pytest.fixture
+def en_entityrecognizer():
+     return English.Defaults.create_entity()
+
+
+@pytest.fixture
+def lemmatizer(path):
+    if path is not None:
+        return Lemmatizer.load(path)
+    else:
+        return None
+
+
@pytest.fixture
 def text_file():
    return StringIO()

+@pytest.fixture
+def text_file_b():
+    return BytesIO()

-# deprecated, to be replaced with more specific instances
+
+@pytest.fixture
+def path():
+    if 'SPACY_DATA' in os.environ:
+        return Path(os.environ['SPACY_DATA'])
+    else:
+        return match_best_version('en', None, get_data_path())
+
+
+# only used for tests that require loading the models
+# in all other cases, use specific instances
@pytest.fixture(scope="session")
 def EN():
    return English()


-# deprecated, to be replaced with more specific instances
@pytest.fixture(scope="session")
 def DE():
    return German()
--- a/spacy/tests/doc/test_doc_api.py
+++ b/spacy/tests/doc/test_doc_api.py
@ -31,7 +31,7 @@ def test_doc_api_getitem(en_tokenizer):
        tokens[len(tokens)]

    def to_str(span):
-       return '/'.join(token.text for token in span)
+        return '/'.join(token.text for token in span)

    span = tokens[1:1]
    assert not to_str(span)
@ -193,7 +193,7 @@ def test_doc_api_runtime_error(en_tokenizer):


 def test_doc_api_right_edge(en_tokenizer):
-    # Test for bug occurring from Unshift action, causing incorrect right edge
+    """Test for bug occurring from Unshift action, causing incorrect right edge"""
    text = "I have proposed to myself, for the sake of such as live under the government of the Romans, to translate those books into the Greek tongue."
    heads = [2, 1, 0, -1, -1, -3, 15, 1, -2, -1, 1, -3, -1, -1, 1, -2, -1, 1,
             -2, -7, 1, -19, 1, -2, -3, 2, 1, -3, -26]
@ -202,7 +202,8 @@ def test_doc_api_right_edge(en_tokenizer):
    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
    assert doc[6].text == 'for'
    subtree = [w.text for w in doc[6].subtree]
-    assert subtree == ['for' , 'the', 'sake', 'of', 'such', 'as', 'live', 'under', 'the', 'government', 'of', 'the', 'Romans', ',']
+    assert subtree == ['for', 'the', 'sake', 'of', 'such', 'as',
+                       'live', 'under', 'the', 'government', 'of', 'the', 'Romans', ',']
    assert doc[6].right_edge.text == ','


--- a/spacy/tests/doc/test_token_api.py
+++ b/spacy/tests/doc/test_token_api.py
@ -85,8 +85,8 @@ def test_doc_token_api_vectors(en_tokenizer, text_file, text, vectors):
    assert tokens[0].similarity(tokens[1]) == tokens[1].similarity(tokens[0])
    assert sum(tokens[0].vector) != sum(tokens[1].vector)
    assert numpy.isclose(
-                tokens[0].vector_norm,
-                numpy.sqrt(numpy.dot(tokens[0].vector, tokens[0].vector)))
+        tokens[0].vector_norm,
+        numpy.sqrt(numpy.dot(tokens[0].vector, tokens[0].vector)))


 def test_doc_token_api_ancestors(en_tokenizer):
--- a/spacy/tests/en/test_punct.py
+++ b/spacy/tests/en/test_punct.py
@ -10,9 +10,6 @@ from ...util import compile_prefix_regex
 from ...language_data import TOKENIZER_PREFIXES


-
-en_search_prefixes = compile_prefix_regex(TOKENIZER_PREFIXES).search
-
 PUNCT_OPEN = ['(', '[', '{', '*']
 PUNCT_CLOSE = [')', ']', '}', '*']
 PUNCT_PAIRED = [('(', ')'),  ('[', ']'), ('{', '}'), ('*', '*')]
@ -99,7 +96,8 @@ def test_tokenizer_splits_double_end_quote(en_tokenizer, text):

@pytest.mark.parametrize('punct_open,punct_close', PUNCT_PAIRED)
@pytest.mark.parametrize('text', ["Hello"])
-def test_tokenizer_splits_open_close_punct(en_tokenizer, punct_open, punct_close, text):
+def test_tokenizer_splits_open_close_punct(en_tokenizer, punct_open,
+                                           punct_close, text):
    tokens = en_tokenizer(punct_open + text + punct_close)
    assert len(tokens) == 3
    assert tokens[0].text == punct_open
@ -108,20 +106,22 @@ def test_tokenizer_splits_open_close_punct(en_tokenizer, punct_open, punct_close


@pytest.mark.parametrize('punct_open,punct_close', PUNCT_PAIRED)
-@pytest.mark.parametrize('punct_open_add,punct_close_add', [("`", "'")])
+@pytest.mark.parametrize('punct_open2,punct_close2', [("`", "'")])
@pytest.mark.parametrize('text', ["Hello"])
-def test_two_different(en_tokenizer, punct_open, punct_close, punct_open_add, punct_close_add, text):
-    tokens = en_tokenizer(punct_open_add + punct_open + text + punct_close + punct_close_add)
+def test_tokenizer_two_diff_punct(en_tokenizer, punct_open, punct_close,
+                                  punct_open2, punct_close2, text):
+    tokens = en_tokenizer(punct_open2 + punct_open + text + punct_close + punct_close2)
    assert len(tokens) == 5
-    assert tokens[0].text == punct_open_add
+    assert tokens[0].text == punct_open2
    assert tokens[1].text == punct_open
    assert tokens[2].text == text
    assert tokens[3].text == punct_close
-    assert tokens[4].text == punct_close_add
+    assert tokens[4].text == punct_close2


@pytest.mark.parametrize('text,punct', [("(can't", "(")])
 def test_tokenizer_splits_pre_punct_regex(text, punct):
+    en_search_prefixes = compile_prefix_regex(TOKENIZER_PREFIXES).search
    match = en_search_prefixes(text)
    assert match.group() == punct

--- a/spacy/tests/en/test_text.py
+++ b/spacy/tests/en/test_text.py
@ -29,8 +29,7 @@ untimely death" of the rapier-tongued Scottish barrister and parliamentarian.
    ("""Yes! "I'd rather have a walk", Ms. Comble sighed. """, 15),
    ("""'Me too!', Mr. P. Delaware cried. """, 11),
    ("They ran about 10km.", 6),
-    # ("But then the 6,000-year ice age came...", 10)
-    ])
+    pytest.mark.xfail(("But then the 6,000-year ice age came...", 10))])
 def test_tokenizer_handles_cnts(en_tokenizer, text, length):
    tokens = en_tokenizer(text)
    assert len(tokens) == length
--- a/spacy/tests/gold/test_biluo.py
+++ b/spacy/tests/gold/test_biluo.py
@ -1,48 +1,43 @@
+# coding: utf-8
 from __future__ import unicode_literals

 from ...gold import biluo_tags_from_offsets
-from ...vocab import Vocab
 from ...tokens.doc import Doc

 import pytest


-@pytest.fixture
-def vocab():
-    return Vocab()
-
-
-def test_U(vocab):
-    orths_and_spaces = [('I', True), ('flew', True), ('to', True), ('London', False),
-                        ('.', True)]
-    doc = Doc(vocab, orths_and_spaces=orths_and_spaces)
+def test_gold_biluo_U(en_vocab):
+    orths_and_spaces = [('I', True), ('flew', True), ('to', True),
+                        ('London', False), ('.', True)]
+    doc = Doc(en_vocab, orths_and_spaces=orths_and_spaces)
    entities = [(len("I flew to "), len("I flew to London"), 'LOC')]
    tags = biluo_tags_from_offsets(doc, entities)
    assert tags == ['O', 'O', 'O', 'U-LOC', 'O']


-def test_BL(vocab):
+def test_gold_biluo_BL(en_vocab):
    orths_and_spaces = [('I', True), ('flew', True), ('to', True), ('San', True),
                        ('Francisco', False), ('.', True)]
-    doc = Doc(vocab, orths_and_spaces=orths_and_spaces)
+    doc = Doc(en_vocab, orths_and_spaces=orths_and_spaces)
    entities = [(len("I flew to "), len("I flew to San Francisco"), 'LOC')]
    tags = biluo_tags_from_offsets(doc, entities)
    assert tags == ['O', 'O', 'O', 'B-LOC', 'L-LOC', 'O']


-def test_BIL(vocab):
+def test_gold_biluo_BIL(en_vocab):
    orths_and_spaces = [('I', True), ('flew', True), ('to', True), ('San', True),
                        ('Francisco', True), ('Valley', False), ('.', True)]
-    doc = Doc(vocab, orths_and_spaces=orths_and_spaces)
+    doc = Doc(en_vocab, orths_and_spaces=orths_and_spaces)
    entities = [(len("I flew to "), len("I flew to San Francisco Valley"), 'LOC')]
    tags = biluo_tags_from_offsets(doc, entities)
    assert tags == ['O', 'O', 'O', 'B-LOC', 'I-LOC', 'L-LOC', 'O']


-def test_misalign(vocab):
+def test_gold_biluo_misalign(en_vocab):
    orths_and_spaces = [('I', True), ('flew', True), ('to', True), ('San', True),
                        ('Francisco', True), ('Valley.', False)]
-    doc = Doc(vocab, orths_and_spaces=orths_and_spaces)
+    doc = Doc(en_vocab, orths_and_spaces=orths_and_spaces)
    entities = [(len("I flew to "), len("I flew to San Francisco Valley"), 'LOC')]
    tags = biluo_tags_from_offsets(doc, entities)
    assert tags == ['O', 'O', 'O', '-', '-', '-']
--- a/spacy/tests/gold/test_lev_align.py
+++ b/spacy/tests/gold/test_lev_align.py
@ -0,0 +1,36 @@
+# coding: utf-8
+"""Find the min-cost alignment between two tokenizations"""
+
+from __future__ import unicode_literals
+
+from ...gold import _min_edit_path as min_edit_path
+from ...gold import align
+
+import pytest
+
+
+@pytest.mark.parametrize('cand,gold,path', [
+    (["U.S", ".", "policy"], ["U.S.", "policy"], (0, 'MDM')),
+    (["U.N", ".", "policy"], ["U.S.", "policy"], (1, 'SDM')),
+    (["The", "cat", "sat", "down"], ["The", "cat", "sat", "down"], (0, 'MMMM')),
+    (["cat", "sat", "down"], ["The", "cat", "sat", "down"], (1, 'IMMM')),
+    (["The", "cat", "down"], ["The", "cat", "sat", "down"], (1, 'MMIM')),
+    (["The", "cat", "sag", "down"], ["The", "cat", "sat", "down"], (1, 'MMSM'))])
+def test_gold_lev_align_edit_path(cand, gold, path):
+    assert min_edit_path(cand, gold) == path
+
+
+def test_gold_lev_align_edit_path2():
+    cand = ["your", "stuff"]
+    gold = ["you", "r", "stuff"]
+    assert min_edit_path(cand, gold) in [(2, 'ISM'), (2, 'SIM')]
+
+
+@pytest.mark.parametrize('cand,gold,result', [
+    (["U.S", ".", "policy"], ["U.S.", "policy"], [0, None, 1]),
+    (["your", "stuff"], ["you", "r", "stuff"], [None, 2]),
+    (["i", "like", "2", "guys", "   ", "well", "id", "just", "come", "straight", "out"],
+     ["i", "like", "2", "guys", "well", "i", "d", "just", "come", "straight", "out"],
+     [0, 1, 2, 3, None, 4, None, 7, 8, 9, 10])])
+def test_gold_lev_align(cand, gold, result):
+    assert align(cand, gold) == result
--- a/spacy/tests/hu/test_tokenizer.py
+++ b/spacy/tests/hu/test_tokenizer.py
@ -3,7 +3,6 @@ from __future__ import unicode_literals

 import pytest

-
 DEFAULT_TESTS = [
    ('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
    ('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.']),
@ -24,23 +23,27 @@ DEFAULT_TESTS = [

 HYPHEN_TESTS = [
    ('Egy -nak, -jaiért, -magyar, bel- van.', ['Egy', '-nak', ',', '-jaiért', ',', '-magyar', ',', 'bel-', 'van', '.']),
+    ('Szabolcs-Szatmár-Bereg megye', ['Szabolcs-Szatmár-Bereg', 'megye']),
    ('Egy -nak.', ['Egy', '-nak', '.']),
    ('Egy bel-.', ['Egy', 'bel-', '.']),
    ('Dinnye-domb-.', ['Dinnye-domb-', '.']),
    ('Ezen -e elcsatangolt.', ['Ezen', '-e', 'elcsatangolt', '.']),
    ('Lakik-e', ['Lakik', '-e']),
+    ('A--B', ['A', '--', 'B']),
    ('Lakik-e?', ['Lakik', '-e', '?']),
    ('Lakik-e.', ['Lakik', '-e', '.']),
    ('Lakik-e...', ['Lakik', '-e', '...']),
    ('Lakik-e... van.', ['Lakik', '-e', '...', 'van', '.']),
    ('Lakik-e van?', ['Lakik', '-e', 'van', '?']),
    ('Lakik-elem van?', ['Lakik-elem', 'van', '?']),
+    ('Az életbiztosításáról- egy.', ['Az', 'életbiztosításáról-', 'egy', '.']),
    ('Van lakik-elem.', ['Van', 'lakik-elem', '.']),
    ('A 7-es busz?', ['A', '7-es', 'busz', '?']),
    ('A 7-es?', ['A', '7-es', '?']),
    ('A 7-es.', ['A', '7-es', '.']),
    ('Ez (lakik)-e?', ['Ez', '(', 'lakik', ')', '-e', '?']),
    ('A %-sal.', ['A', '%-sal', '.']),
+    ('A $-sal.', ['A', '$-sal', '.']),
    ('A CD-ROM-okrol.', ['A', 'CD-ROM-okrol', '.'])
 ]

@ -89,11 +92,15 @@ NUMBER_TESTS = [
    ('A -23,12 van.', ['A', '-23,12', 'van', '.']),
    ('A -23,12-ben van.', ['A', '-23,12-ben', 'van', '.']),
    ('A -23,12-ben.', ['A', '-23,12-ben', '.']),
-    ('A 2+3 van.', ['A', '2', '+', '3', 'van', '.']),
-    ('A 2 +3 van.', ['A', '2', '+', '3', 'van', '.']),
+    ('A 2+3 van.', ['A', '2+3', 'van', '.']),
+    ('A 2<3 van.', ['A', '2<3', 'van', '.']),
+    ('A 2=3 van.', ['A', '2=3', 'van', '.']),
+    ('A 2÷3 van.', ['A', '2÷3', 'van', '.']),
+    ('A 1=(2÷3)-2/5 van.', ['A', '1=(2÷3)-2/5', 'van', '.']),
+    ('A 2 +3 van.', ['A', '2', '+3', 'van', '.']),
    ('A 2+ 3 van.', ['A', '2', '+', '3', 'van', '.']),
    ('A 2 + 3 van.', ['A', '2', '+', '3', 'van', '.']),
-    ('A 2*3 van.', ['A', '2', '*', '3', 'van', '.']),
+    ('A 2*3 van.', ['A', '2*3', 'van', '.']),
    ('A 2 *3 van.', ['A', '2', '*', '3', 'van', '.']),
    ('A 2* 3 van.', ['A', '2', '*', '3', 'van', '.']),
    ('A 2 * 3 van.', ['A', '2', '*', '3', 'van', '.']),
@ -141,7 +148,8 @@ NUMBER_TESTS = [
    ('A 15.-ben.', ['A', '15.-ben', '.']),
    ('A 2002--2003. van.', ['A', '2002--2003.', 'van', '.']),
    ('A 2002--2003-ben van.', ['A', '2002--2003-ben', 'van', '.']),
-    ('A 2002--2003-ben.', ['A', '2002--2003-ben', '.']),
+    ('A 2002-2003-ben.', ['A', '2002-2003-ben', '.']),
+    ('A +0,99% van.', ['A', '+0,99%', 'van', '.']),
    ('A -0,99% van.', ['A', '-0,99%', 'van', '.']),
    ('A -0,99%-ben van.', ['A', '-0,99%-ben', 'van', '.']),
    ('A -0,99%.', ['A', '-0,99%', '.']),
@ -194,23 +202,33 @@ NUMBER_TESTS = [
    ('A III/c-ben.', ['A', 'III/c-ben', '.']),
    ('A TU–154 van.', ['A', 'TU–154', 'van', '.']),
    ('A TU–154-ben van.', ['A', 'TU–154-ben', 'van', '.']),
-    ('A TU–154-ben.', ['A', 'TU–154-ben', '.'])
+    ('A TU–154-ben.', ['A', 'TU–154-ben', '.']),
+    ('A 5cm³', ['A', '5', 'cm³']),
+    ('A 5 $-ban', ['A', '5', '$-ban']),
+    ('A 5$-ban', ['A', '5$-ban']),
+    ('A 5$.', ['A', '5', '$', '.']),
+    ('A 5$', ['A', '5', '$']),
+    ('A $5', ['A', '$5']),
+    ('A 5km/h', ['A', '5', 'km/h']),
+    ('A 75%+1-100%-ig', ['A', '75%+1-100%-ig']),
+    ('A 5km/h.', ['A', '5', 'km/h', '.']),
+    ('3434/1992. évi elszámolás', ['3434/1992.', 'évi', 'elszámolás']),
 ]

 QUOTE_TESTS = [
    ('Az "Ime, hat"-ban irja.', ['Az', '"', 'Ime', ',', 'hat', '"', '-ban', 'irja', '.']),
    ('"Ime, hat"-ban irja.', ['"', 'Ime', ',', 'hat', '"', '-ban', 'irja', '.']),
    ('Az "Ime, hat".', ['Az', '"', 'Ime', ',', 'hat', '"', '.']),
-    ('Egy 24"-os monitor.', ['Egy', '24', '"', '-os', 'monitor', '.']),
-    ("A don't van.", ['A', "don't", 'van', '.'])
+    ('Egy 24"-os monitor.', ['Egy', '24"-os', 'monitor', '.']),
+    ("A McDonald's van.", ['A', "McDonald's", 'van', '.'])
 ]

 DOT_TESTS = [
    ('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
    ('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.']),
    ('Az egy.ketto pelda.', ['Az', 'egy.ketto', 'pelda', '.']),
-    ('A pl. rovidites.', ['A', 'pl.', 'rovidites', '.']),
-    ('A S.M.A.R.T. szo.', ['A', 'S.M.A.R.T.', 'szo', '.']),
+    ('A pl. rövidítés.', ['A', 'pl.', 'rövidítés', '.']),
+    ('A S.M.A.R.T. szó.', ['A', 'S.M.A.R.T.', 'szó', '.']),
    ('A .hu.', ['A', '.hu', '.']),
    ('Az egy.ketto.', ['Az', 'egy.ketto', '.']),
    ('A pl.', ['A', 'pl.']),
@ -223,8 +241,19 @@ DOT_TESTS = [
    ('Valami ... más.', ['Valami', '...', 'más', '.'])
 ]

+WIKI_TESTS = [
+    ('!"', ['!', '"']),
+    ('lány"a', ['lány', '"', 'a']),
+    ('lány"a', ['lány', '"', 'a']),
+    ('!"-lel', ['!', '"', '-lel']),
+    ('""-sorozat ', ['"', '"', '-sorozat']),
+    ('"(Köszönöm', ['"', '(', 'Köszönöm']),
+    ('(törvénykönyv)-ben ', ['(', 'törvénykönyv', ')', '-ben']),
+    ('"(...)"–sokkal ', ['"', '(', '...', ')', '"', '–sokkal']),
+    ('cérium(IV)-oxid', ['cérium', '(', 'IV', ')', '-oxid'])
+]

-TESTCASES = DEFAULT_TESTS + DOT_TESTS + QUOTE_TESTS # + NUMBER_TESTS + HYPHEN_TESTS
+TESTCASES = DEFAULT_TESTS + DOT_TESTS + QUOTE_TESTS + NUMBER_TESTS + HYPHEN_TESTS + WIKI_TESTS


@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
--- a/spacy/tests/integration/test_load_languages.py
+++ b/spacy/tests/integration/test_load_languages.py
@ -1,21 +0,0 @@
-# encoding: utf8
-from __future__ import unicode_literals
-from ...fr import French
-from ...nl import Dutch
-
-def test_load_french():
-    nlp = French()
-    doc = nlp(u'Parlez-vous français?')
-    assert doc[0].text == u'Parlez'
-    assert doc[1].text == u'-'
-    assert doc[2].text == u'vous'
-    assert doc[3].text == u'français'
-    assert doc[4].text == u'?'
-
-def test_load_dutch():
-    nlp = Dutch()
-    doc = nlp(u'Is dit Nederlands?')
-    assert doc[0].text == u'Is'
-    assert doc[1].text == u'dit'
-    assert doc[2].text == u'Nederlands'
-    assert doc[3].text == u'?'
--- a/spacy/tests/integration/test_model_sanity.py
+++ b/spacy/tests/integration/test_model_sanity.py
@ -1,4 +1,5 @@
-# -*- coding: utf-8 -*-
+# coding: utf-8
+
 import pytest
 import numpy

@ -47,7 +48,7 @@ class TestModelSanity:

    def test_vectors(self, example):
        # if vectors are available, they should differ on different words
-        # this isn't a perfect test since this could in principle fail 
+        # this isn't a perfect test since this could in principle fail
        # in a sane model as well,
        # but that's very unlikely and a good indicator if something is wrong
        vector0 = example[0].vector
@ -58,9 +59,9 @@ class TestModelSanity:
        assert not numpy.array_equal(vector1,vector2)

    def test_probs(self, example):
-        # if frequencies/probabilities are okay, they should differ for 
+        # if frequencies/probabilities are okay, they should differ for
        # different words
-        # this isn't a perfect test since this could in principle fail 
+        # this isn't a perfect test since this could in principle fail
        # in a sane model as well,
        # but that's very unlikely and a good indicator if something is wrong
        prob0 = example[0].prob
--- a/spacy/tests/matcher/test_entity_id.py
+++ b/spacy/tests/matcher/test_entity_id.py
@ -1,59 +1,53 @@
+# coding: utf-8
 from __future__ import unicode_literals
-import spacy
-from spacy.vocab import Vocab
-from spacy.matcher import Matcher
-from spacy.tokens.doc import Doc
-from spacy.attrs import *
+
+from ...matcher import Matcher
+from ...attrs import ORTH
+from ..util import get_doc

 import pytest


-@pytest.fixture
-def en_vocab():
-    return spacy.get_lang_class('en').Defaults.create_vocab()
-
-
-def test_init_matcher(en_vocab):
+@pytest.mark.parametrize('words,entity', [
+    (["Test", "Entity"], "TestEntity")])
+def test_matcher_add_empty_entity(en_vocab, words, entity):
    matcher = Matcher(en_vocab)
+    matcher.add_entity(entity)
+    doc = get_doc(en_vocab, words)
    assert matcher.n_patterns == 0
-    assert matcher(Doc(en_vocab, words=[u'Some', u'words'])) == []
+    assert matcher(doc) == []


-def test_add_empty_entity(en_vocab):
+@pytest.mark.parametrize('entity1,entity2,attrs', [
+    ("TestEntity", "TestEntity2", {"Hello": "World"})])
+def test_matcher_get_entity_attrs(en_vocab, entity1, entity2, attrs):
    matcher = Matcher(en_vocab)
-    matcher.add_entity('TestEntity')
+    matcher.add_entity(entity1)
+    assert matcher.get_entity(entity1) == {}
+    matcher.add_entity(entity2, attrs=attrs)
+    assert matcher.get_entity(entity2) == attrs
+    assert matcher.get_entity(entity1) == {}
+
+
+@pytest.mark.parametrize('words,entity,attrs',
+    [(["Test", "Entity"], "TestEntity", {"Hello": "World"})])
+def test_matcher_get_entity_via_match(en_vocab, words, entity, attrs):
+    matcher = Matcher(en_vocab)
+    matcher.add_entity(entity, attrs=attrs)
+    doc = get_doc(en_vocab, words)
    assert matcher.n_patterns == 0
-    assert matcher(Doc(en_vocab, words=[u'Test', u'Entity'])) == []
+    assert matcher(doc) == []

-
-def test_get_entity_attrs(en_vocab):
-    matcher = Matcher(en_vocab)
-    matcher.add_entity('TestEntity')
-    entity = matcher.get_entity('TestEntity')
-    assert entity == {} 
-    matcher.add_entity('TestEntity2', attrs={'Hello': 'World'})
-    entity = matcher.get_entity('TestEntity2')
-    assert entity == {'Hello': 'World'} 
-    assert matcher.get_entity('TestEntity') == {}
-
-
-def test_get_entity_via_match(en_vocab):
-    matcher = Matcher(en_vocab)
-    matcher.add_entity('TestEntity', attrs={u'Hello': u'World'})
-    assert matcher.n_patterns == 0
-    assert matcher(Doc(en_vocab, words=[u'Test', u'Entity'])) == []
-    matcher.add_pattern(u'TestEntity', [{ORTH: u'Test'}, {ORTH: u'Entity'}])
+    matcher.add_pattern(entity, [{ORTH: words[0]}, {ORTH: words[1]}])
    assert matcher.n_patterns == 1
-    matches = matcher(Doc(en_vocab, words=[u'Test', u'Entity']))
+
+    matches = matcher(doc)
    assert len(matches) == 1
    assert len(matches[0]) == 4
+
    ent_id, label, start, end = matches[0]
-    assert ent_id == matcher.vocab.strings[u'TestEntity']
+    assert ent_id == matcher.vocab.strings[entity]
    assert label == 0
    assert start == 0
    assert end == 2
-    attrs = matcher.get_entity(ent_id)
-    assert attrs == {u'Hello': u'World'}
-
-
-
+    assert matcher.get_entity(ent_id) == attrs
--- a/spacy/tests/matcher/test_matcher.py
+++ b/spacy/tests/matcher/test_matcher.py
@ -0,0 +1,107 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+from ...matcher import Matcher, PhraseMatcher
+from ..util import get_doc
+
+import pytest
+
+
+@pytest.fixture
+def matcher(en_vocab):
+    patterns = {
+        'JS':        ['PRODUCT', {}, [[{'ORTH': 'JavaScript'}]]],
+        'GoogleNow': ['PRODUCT', {}, [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]]],
+        'Java':      ['PRODUCT', {}, [[{'LOWER': 'java'}]]]
+    }
+    return Matcher(en_vocab, patterns)
+
+
+@pytest.mark.parametrize('words', [["Some", "words"]])
+def test_matcher_init(en_vocab, words):
+    matcher = Matcher(en_vocab)
+    doc = get_doc(en_vocab, words)
+    assert matcher.n_patterns == 0
+    assert matcher(doc) == []
+
+
+def test_matcher_no_match(matcher):
+    words = ["I", "like", "cheese", "."]
+    doc = get_doc(matcher.vocab, words)
+    assert matcher(doc) == []
+
+
+def test_matcher_compile(matcher):
+    assert matcher.n_patterns == 3
+
+
+def test_matcher_match_start(matcher):
+    words = ["JavaScript", "is", "good"]
+    doc = get_doc(matcher.vocab, words)
+    assert matcher(doc) == [(matcher.vocab.strings['JS'],
+                             matcher.vocab.strings['PRODUCT'], 0, 1)]
+
+
+def test_matcher_match_end(matcher):
+    words = ["I", "like", "java"]
+    doc = get_doc(matcher.vocab, words)
+    assert matcher(doc) == [(doc.vocab.strings['Java'],
+                             doc.vocab.strings['PRODUCT'], 2, 3)]
+
+
+def test_matcher_match_middle(matcher):
+    words = ["I", "like", "Google", "Now", "best"]
+    doc = get_doc(matcher.vocab, words)
+    assert matcher(doc) == [(doc.vocab.strings['GoogleNow'],
+                             doc.vocab.strings['PRODUCT'], 2, 4)]
+
+
+def test_matcher_match_multi(matcher):
+    words = ["I", "like", "Google", "Now", "and", "java", "best"]
+    doc = get_doc(matcher.vocab, words)
+    assert matcher(doc) == [(doc.vocab.strings['GoogleNow'],
+                             doc.vocab.strings['PRODUCT'], 2, 4),
+                            (doc.vocab.strings['Java'],
+                             doc.vocab.strings['PRODUCT'], 5, 6)]
+
+
+def test_matcher_phrase_matcher(en_vocab):
+    words = ["Google", "Now"]
+    doc = get_doc(en_vocab, words)
+    matcher = PhraseMatcher(en_vocab, [doc])
+    words = ["I", "like", "Google", "Now", "best"]
+    doc = get_doc(en_vocab, words)
+    assert len(matcher(doc)) == 1
+
+
+def test_matcher_match_zero(matcher):
+    words1 = 'He said , " some words " ...'.split()
+    words2 = 'He said , " some three words " ...'.split()
+    pattern1 = [{'ORTH': '"'},
+                {'OP': '!', 'IS_PUNCT': True},
+                {'OP': '!', 'IS_PUNCT': True},
+                {'ORTH': '"'}]
+    pattern2 = [{'ORTH': '"'},
+                {'IS_PUNCT': True},
+                {'IS_PUNCT': True},
+                {'IS_PUNCT': True},
+                {'ORTH': '"'}]
+
+    matcher.add('Quote', '', {}, [pattern1])
+    doc = get_doc(matcher.vocab, words1)
+    assert len(matcher(doc)) == 1
+
+    doc = get_doc(matcher.vocab, words2)
+    assert len(matcher(doc)) == 0
+    matcher.add('Quote', '', {}, [pattern2])
+    assert len(matcher(doc)) == 0
+
+
+def test_matcher_match_zero_plus(matcher):
+    words = 'He said , " some words " ...'.split()
+    pattern = [{'ORTH': '"'},
+               {'OP': '*', 'IS_PUNCT': False},
+               {'ORTH': '"'}]
+    matcher.add('Quote', '', {}, [pattern])
+    doc = get_doc(matcher.vocab, words)
+    assert len(matcher(doc)) == 1
--- a/spacy/tests/matcher/test_matcher_bugfixes.py
+++ b/spacy/tests/matcher/test_matcher_bugfixes.py
@ -1,200 +0,0 @@
-import pytest
-import numpy
-import os
-
-import spacy
-from spacy.matcher import Matcher
-from spacy.attrs import ORTH, LOWER, ENT_IOB, ENT_TYPE
-from spacy.attrs import ORTH, TAG, LOWER, IS_ALPHA, FLAG63
-from spacy.symbols import DATE, LOC
-
-
-def test_overlap_issue118(EN):
-    '''Test a bug that arose from having overlapping matches'''
-    doc = EN.tokenizer(u'how many points did lebron james score against the boston celtics last night')
-    ORG = doc.vocab.strings['ORG']
-    matcher = Matcher(EN.vocab,
-        {'BostonCeltics':
-            ('ORG', {},
-                [
-                    [{LOWER: 'celtics'}],
-                    [{LOWER: 'boston'}, {LOWER: 'celtics'}],
-                ]
-            )
-        }
-    )
-    
-    assert len(list(doc.ents)) == 0
-    matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
-    assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
-    doc.ents = matches[:1]
-    ents = list(doc.ents)
-    assert len(ents) == 1
-    assert ents[0].label == ORG
-    assert ents[0].start == 9
-    assert ents[0].end == 11
-
-
-def test_overlap_issue242():
-    '''Test overlapping multi-word phrases.'''
-
-    patterns = [
-        [{LOWER: 'food'}, {LOWER: 'safety'}],
-        [{LOWER: 'safety'}, {LOWER: 'standards'}],
-    ]
-
-    if os.environ.get('SPACY_DATA'):
-        data_dir = os.environ.get('SPACY_DATA')
-    else:
-        data_dir = None
- 
-    nlp = spacy.en.English(path=data_dir, tagger=False, parser=False, entity=False)
-    nlp.matcher = Matcher(nlp.vocab)
-
-    nlp.matcher.add('FOOD', 'FOOD', {}, patterns)
-
-    doc = nlp.tokenizer(u'There are different food safety standards in different countries.')
-
-    matches = [(ent_type, start, end) for ent_id, ent_type, start, end in nlp.matcher(doc)]
-    doc.ents += tuple(matches)
-    food_safety, safety_standards = matches
-    assert food_safety[1] == 3
-    assert food_safety[2] == 5
-    assert safety_standards[1] == 4
-    assert safety_standards[2] == 6
-
-
-def test_overlap_reorder(EN):
-    '''Test order dependence'''
-    doc = EN.tokenizer(u'how many points did lebron james score against the boston celtics last night')
-    ORG = doc.vocab.strings['ORG']
-    matcher = Matcher(EN.vocab,
-        {'BostonCeltics':
-            ('ORG', {},
-                [
-                    [{LOWER: 'boston'}, {LOWER: 'celtics'}],
-                    [{LOWER: 'celtics'}],
-                ]
-            )
-        }
-    )
-    
-    assert len(list(doc.ents)) == 0
-    matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
-    assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
-    doc.ents = matches[:1]
-    ents = list(doc.ents)
-    assert len(ents) == 1
-    assert ents[0].label == ORG
-    assert ents[0].start == 9
-    assert ents[0].end == 11
-
-
-def test_overlap_prefix(EN):
-    '''Test order dependence'''
-    doc = EN.tokenizer(u'how many points did lebron james score against the boston celtics last night')
-    ORG = doc.vocab.strings['ORG']
-    matcher = Matcher(EN.vocab,
-        {'BostonCeltics':
-            ('ORG', {},
-                [
-                    [{LOWER: 'boston'}],
-                    [{LOWER: 'boston'}, {LOWER: 'celtics'}],
-                ]
-            )
-        }
-    )
-    
-    assert len(list(doc.ents)) == 0
-    matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
-    doc.ents = matches[1:]
-    assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
-    ents = list(doc.ents)
-    assert len(ents) == 1
-    assert ents[0].label == ORG
-    assert ents[0].start == 9
-    assert ents[0].end == 11
-
-
-def test_overlap_prefix_reorder(EN):
-    '''Test order dependence'''
-    doc = EN.tokenizer(u'how many points did lebron james score against the boston celtics last night')
-    ORG = doc.vocab.strings['ORG']
-    matcher = Matcher(EN.vocab,
-        {'BostonCeltics':
-            ('ORG', {},
-                [
-                    [{LOWER: 'boston'}, {LOWER: 'celtics'}],
-                    [{LOWER: 'boston'}],
-                ]
-            )
-        }
-    )
-    
-    assert len(list(doc.ents)) == 0
-    matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
-    doc.ents += tuple(matches)[1:]
-    assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
-    ents = doc.ents
-    assert len(ents) == 1
-    assert ents[0].label == ORG
-    assert ents[0].start == 9
-    assert ents[0].end == 11
-
-
-# @pytest.mark.models
-# def test_ner_interaction(EN):
-#     EN.matcher.add('LAX_Airport', 'AIRPORT', {}, [[{ORTH: 'LAX'}]])
-#     EN.matcher.add('SFO_Airport', 'AIRPORT', {}, [[{ORTH: 'SFO'}]])
-#     doc = EN(u'get me a flight from SFO to LAX leaving 20 December and arriving on January 5th')
-
-#     ents = [(ent.label_, ent.text) for ent in doc.ents]
-#     assert ents[0] == ('AIRPORT', 'SFO')
-#     assert ents[1] == ('AIRPORT', 'LAX')
-#     assert ents[2] == ('DATE', '20 December')
-#     assert ents[3] == ('DATE', 'January 5th')
- 
-
-# @pytest.mark.models
-# def test_ner_interaction(EN):
-#     # ensure that matcher doesn't overwrite annotations set by the NER model
-#     doc = EN.tokenizer.tokens_from_list(u'get me a flight from SFO to LAX leaving 20 December and arriving on January 5th'.split(' '))
-#     EN.tagger(doc)
-
-#     columns = [ENT_IOB, ENT_TYPE]
-#     values = numpy.ndarray(shape=(len(doc),len(columns)), dtype='int32')
-#     # IOB values are 0=missing, 1=I, 2=O, 3=B 
-#     iobs = [2,2,2,2,2,3,2,3,2,3,1,2,2,2,3,1]
-#     types = [0,0,0,0,0,LOC,0,LOC,0,DATE,DATE,0,0,0,DATE,DATE]
-#     values[:] = zip(iobs,types)
-#     doc.from_array(columns,values)
-
-#     assert doc[5].ent_type_ == 'LOC'
-#     assert doc[7].ent_type_ == 'LOC'
-#     assert doc[9].ent_type_ == 'DATE'
-#     assert doc[10].ent_type_ == 'DATE'
-#     assert doc[14].ent_type_ == 'DATE'
-#     assert doc[15].ent_type_ == 'DATE'
-
-#     EN.matcher.add('LAX_Airport', 'AIRPORT', {}, [[{ORTH: 'LAX'}]])
-#     EN.matcher.add('SFO_Airport', 'AIRPORT', {}, [[{ORTH: 'SFO'}]])
-#     EN.matcher(doc)
-
-#     assert doc[5].ent_type_ != 'AIRPORT'
-#     assert doc[7].ent_type_ != 'AIRPORT'
-#     assert doc[5].ent_type_ == 'LOC'
-#     assert doc[7].ent_type_ == 'LOC'
-#     assert doc[9].ent_type_ == 'DATE'
-#     assert doc[10].ent_type_ == 'DATE'
-#     assert doc[14].ent_type_ == 'DATE'
-#     assert doc[15].ent_type_ == 'DATE'
-
-
-
-
-
-
-
-
-
-
--- a/spacy/tests/munge/init.py
+++ b/spacy/tests/munge/init.py
--- a/spacy/tests/munge/test_align.py
+++ b/spacy/tests/munge/test_align.py
@ -1,32 +0,0 @@
-from spacy.deprecated import align_tokens
-
-
-def test_perfect_align():
-    ref = ['I', 'align', 'perfectly']
-    indices = []
-    i = 0
-    for token in ref:
-        indices.append((i, i + len(token)))
-        i += len(token)
-    aligned = list(align_tokens(ref, indices))
-    assert aligned[0] == ('I', [(0, 1)])
-    assert aligned[1] == ('align', [(1, 6)])
-    assert aligned[2] == ('perfectly', [(6, 15)])
-
-
-def test_hyphen_align():
-    ref = ['I', 'must', 're-align']
-    indices = [(0, 1), (1, 5), (5, 7), (7, 8), (8, 13)]
-    aligned = list(align_tokens(ref, indices))
-    assert aligned[0] == ('I', [(0, 1)])
-    assert aligned[1] == ('must', [(1, 5)])
-    assert aligned[2] == ('re-align', [(5, 7), (7, 8), (8, 13)])
-
-
-def test_align_continue():
-    ref = ['I', 'must', 're-align', 'and', 'continue']
-    indices = [(0, 1), (1, 5), (5, 7), (7, 8), (8, 13), (13, 16), (16, 24)]
-    aligned = list(align_tokens(ref, indices))
-    assert aligned[2] == ('re-align', [(5, 7), (7, 8), (8, 13)])
-    assert aligned[3] == ('and', [(13, 16)])
-    assert aligned[4] == ('continue', [(16, 24)])
--- a/spacy/tests/munge/test_bad_periods.py
+++ b/spacy/tests/munge/test_bad_periods.py
@ -1,59 +0,0 @@
-import spacy.munge.read_conll
-
-hongbin_example = """
-1       2.      0.      LS      _       24      meta    _       _       _
-2       .       .       .       _       1       punct   _       _       _
-3       Wang    wang    NNP     _       4       compound        _       _       _
-4       Hongbin hongbin NNP     _       16      nsubj   _       _       _
-5       ,       ,       ,       _       4       punct   _       _       _
-6       the     the     DT      _       11      det     _       _       _
-7       "       "       ``      _       11      punct   _       _       _
-8       communist       communist       JJ      _       11      amod    _       _       _
-9       trail   trail   NN      _       11      compound        _       _       _
-10      -       -       HYPH    _       11      punct   _       _       _
-11      blazer  blazer  NN      _       4       appos   _       _       _
-12      ,       ,       ,       _       16      punct   _       _       _
-13      "       "       ''      _       16      punct   _       _       _
-14      has     have    VBZ     _       16      aux     _       _       _
-15      not     not     RB      _       16      neg     _       _       _
-16      turned  turn    VBN     _       24      ccomp   _       _       _
-17      into    into    IN      syn=CLR 16      prep    _       _       _
-18      a       a       DT      _       19      det     _       _       _
-19      capitalist      capitalist      NN      _       17      pobj    _       _       _
-20      (       (       -LRB-   _       24      punct   _       _       _
-21      he      he      PRP     _       24      nsubj   _       _       _
-22      does    do      VBZ     _       24      aux     _       _       _
-23      n't     not     RB      _       24      neg     _       _       _
-24      have    have    VB      _       0       root    _       _       _
-25      any     any     DT      _       26      det     _       _       _
-26      shares  share   NNS     _       24      dobj    _       _       _
-27      ,       ,       ,       _       24      punct   _       _       _
-28      does    do      VBZ     _       30      aux     _       _       _
-29      n't     not     RB      _       30      neg     _       _       _
-30      have    have    VB      _       24      conj    _       _       _
-31      any     any     DT      _       32      det     _       _       _
-32      savings saving  NNS     _       30      dobj    _       _       _
-33      ,       ,       ,       _       30      punct   _       _       _
-34      does    do      VBZ     _       36      aux     _       _       _
-35      n't     not     RB      _       36      neg     _       _       _
-36      have    have    VB      _       30      conj    _       _       _
-37      his     his     PRP$    _       39      poss    _       _       _
-38      own     own     JJ      _       39      amod    _       _       _
-39      car     car     NN      _       36      dobj    _       _       _
-40      ,       ,       ,       _       36      punct   _       _       _
-41      and     and     CC      _       36      cc      _       _       _
-42      does    do      VBZ     _       44      aux     _       _       _
-43      n't     not     RB      _       44      neg     _       _       _
-44      have    have    VB      _       36      conj    _       _       _
-45      a       a       DT      _       46      det     _       _       _
-46      mansion mansion NN      _       44      dobj    _       _       _
-47      ;       ;       .       _       24      punct   _       _       _
-""".strip()
-
-
-def test_hongbin():
-    words, annot = spacy.munge.read_conll.parse(hongbin_example, strip_bad_periods=True)
-    assert words[annot[0]['head']] == 'have'
-    assert words[annot[1]['head']] == 'Hongbin'
-
-
--- a/spacy/tests/munge/test_detokenize.py
+++ b/spacy/tests/munge/test_detokenize.py
@ -1,21 +0,0 @@
-from spacy.deprecated import detokenize
-
-def test_punct():
-    tokens = 'Pierre Vinken , 61 years old .'.split()
-    detoks = [(0,), (1, 2), (3,), (4,), (5, 6)]
-    token_rules = ('<SEP>,', '<SEP>.')
-    assert detokenize(token_rules, tokens) == detoks
-
-
-def test_contractions():
-    tokens = "I ca n't even".split()
-    detoks = [(0,), (1, 2), (3,)]
-    token_rules = ("ca<SEP>n't",)
-    assert detokenize(token_rules, tokens) == detoks
-
-
-def test_contractions_punct():
-    tokens = "I ca n't !".split()
-    detoks = [(0,), (1, 2, 3)]
-    token_rules = ("ca<SEP>n't", '<SEP>!')
-    assert detokenize(token_rules, tokens) == detoks
--- a/spacy/tests/munge/test_lev_align.py
+++ b/spacy/tests/munge/test_lev_align.py
@ -1,42 +0,0 @@
-"""Find the min-cost alignment between two tokenizations"""
-from spacy.gold import _min_edit_path as min_edit_path
-from spacy.gold import align
-
-
-def test_edit_path():
-    cand = ["U.S", ".", "policy"]
-    gold = ["U.S.", "policy"]
-    assert min_edit_path(cand, gold) == (0, 'MDM')
-    cand = ["U.N", ".", "policy"]
-    gold = ["U.S.", "policy"]
-    assert min_edit_path(cand, gold) == (1, 'SDM')
-    cand = ["The", "cat", "sat", "down"]
-    gold = ["The", "cat", "sat", "down"]
-    assert min_edit_path(cand, gold) == (0, 'MMMM')
-    cand = ["cat", "sat", "down"]
-    gold = ["The", "cat", "sat", "down"]
-    assert min_edit_path(cand, gold) == (1, 'IMMM')
-    cand = ["The", "cat", "down"]
-    gold = ["The", "cat", "sat", "down"]
-    assert min_edit_path(cand, gold) == (1, 'MMIM')
-    cand = ["The", "cat", "sag", "down"]
-    gold = ["The", "cat", "sat", "down"]
-    assert min_edit_path(cand, gold) == (1, 'MMSM')
-    cand = ["your", "stuff"]
-    gold = ["you", "r", "stuff"]
-    assert min_edit_path(cand, gold) in [(2, 'ISM'), (2, 'SIM')]
-
-
-def test_align():
-    cand = ["U.S", ".", "policy"]
-    gold = ["U.S.", "policy"]
-    assert align(cand, gold) == [0, None, 1]
-    cand = ["your", "stuff"]
-    gold = ["you", "r", "stuff"]
-    assert align(cand, gold) == [None, 2]
-    cand = [u'i', u'like', u'2', u'guys', u'   ', u'well', u'id', u'just',
-            u'come', u'straight', u'out']
-    gold = [u'i', u'like', u'2', u'guys', u'well', u'i', u'd', u'just', u'come',
-            u'straight', u'out']
-    assert align(cand, gold) == [0, 1, 2, 3, None, 4, None, 7, 8, 9, 10]
-
--- a/spacy/tests/munge/test_onto_ner.py
+++ b/spacy/tests/munge/test_onto_ner.py
@ -1,16 +0,0 @@
-from spacy.munge.read_ner import _get_text, _get_tag
-
-
-def test_get_text():
-    assert _get_text('asbestos') == 'asbestos'
-    assert _get_text('<ENAMEX TYPE="ORG">Lorillard</ENAMEX>') == 'Lorillard'
-    assert _get_text('<ENAMEX TYPE="DATE">more') == 'more'
-    assert _get_text('ago</ENAMEX>') == 'ago'
-
-
-def test_get_tag():
-    assert _get_tag('asbestos', None) == ('O', None)
-    assert _get_tag('asbestos', 'PER') == ('I-PER', 'PER')
-    assert _get_tag('<ENAMEX TYPE="ORG">Lorillard</ENAMEX>', None) == ('U-ORG', None)
-    assert _get_tag('<ENAMEX TYPE="DATE">more', None) == ('B-DATE', 'DATE')
-    assert _get_tag('ago</ENAMEX>', 'DATE') == ('L-DATE', None)
--- a/spacy/tests/parser/prag_sbd.py
+++ b/spacy/tests/parser/prag_sbd.py
@ -1,247 +0,0 @@
-# encoding: utf-8
-# SBD tests from "Pragmatic Segmenter"
-from __future__ import unicode_literals
-
-from spacy.en import English
-
-EN = English()
-
-
-def get_sent_strings(text):
-    tokens = EN(text)
-    sents = []
-    for sent in tokens.sents:
-        sents.append(''.join(tokens[i].string
-                     for i in range(sent.start, sent.end)).strip())
-    return sents
-
-
-def test_gr1():
-    sents = get_sent_strings("Hello World. My name is Jonas.")
-    assert sents == ["Hello World.", "My name is Jonas."]
-
-
-def test_gr2():
-    sents = get_sent_strings("What is your name? My name is Jonas.")
-    assert sents == ["What is your name?", "My name is Jonas."]
-
-
-def test_gr3():
-    sents = get_sent_strings("There it is! I found it.")
-    assert sents == ["There it is!", "I found it."]
-
-
-def test_gr4():
-    sents = get_sent_strings("My name is Jonas E. Smith.")
-    assert sents == ["My name is Jonas E. Smith."]
-
-
-def test_gr5():
-    sents = get_sent_strings("Please turn to p. 55.")
-    assert sents == ["Please turn to p. 55."]
-
-
-def test_gr6():
-    sents = get_sent_strings("Were Jane and co. at the party?")
-    assert sents == ["Were Jane and co. at the party?"]
-
-
-def test_gr7():
-    sents = get_sent_strings("They closed the deal with Pitt, Briggs & Co. at noon.")
-    assert sents == ["They closed the deal with Pitt, Briggs & Co. at noon."]
-
-
-def test_gr8():
-    sents = get_sent_strings("Let's ask Jane and co. They should know.")
-    assert sents == ["Let's ask Jane and co.", "They should know."]
-
-
-def test_gr9():
-    sents = get_sent_strings("They closed the deal with Pitt, Briggs & Co. It closed yesterday.")
-    assert sents == ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]
-
-
-def test_gr10():
-    sents = get_sent_strings("I can see Mt. Fuji from here.")
-    assert sents == ["I can see Mt. Fuji from here."]
-
-
-def test_gr11():
-    sents = get_sent_strings("St. Michael's Church is on 5th st. near the light.")
-    assert sents == ["St. Michael's Church is on 5th st. near the light."]
-
-
-def test_gr12():
-    sents = get_sent_strings("That is JFK Jr.'s book.")
-    assert sents == ["That is JFK Jr.'s book."]
-
-
-def test_gr13():
-    sents = get_sent_strings("I visited the U.S.A. last year.")
-    assert sents == ["I visited the U.S.A. last year."]
-
-
-def test_gr14():
-    sents = get_sent_strings("I live in the E.U. How about you?")
-    assert sents == ["I live in the E.U.", "How about you?"]
-
-
-def test_gr15():
-    sents = get_sent_strings("I live in the U.S. How about you?")
-    assert sents == ["I live in the U.S.", "How about you?"]
-
-
-def test_gr16():
-    sents = get_sent_strings("I work for the U.S. Government in Virginia.")
-    assert sents == ["I work for the U.S. Government in Virginia."]
-
-
-def test_gr17():
-    sents = get_sent_strings("I have lived in the U.S. for 20 years.")
-    assert sents == ["I have lived in the U.S. for 20 years."]
-
-
-def test_gr18():
-    sents = get_sent_strings("At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.")
-    assert sents == ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."]
-
-
-def test_gr19():
-    sents = get_sent_strings("She has $100.00 in her bag.")
-    assert sents == ["She has $100.00 in her bag."]
-
-
-def test_gr20():
-    sents = get_sent_strings("She has $100.00. It is in her bag.")
-    assert sents == ["She has $100.00.", "It is in her bag."]
-
-
-def test_gr21():
-    sents = get_sent_strings("He teaches science (He previously worked for 5 years as an engineer.) at the local University.")
-    assert sents == ["He teaches science (He previously worked for 5 years as an engineer.) at the local University."]
-
-
-def test_gr22():
-    sents = get_sent_strings("Her email is Jane.Doe@example.com. I sent her an email.")
-    assert sents == ["Her email is Jane.Doe@example.com.", "I sent her an email."]
-
-
-def test_gr23():
-    sents = get_sent_strings("The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out.")
-    assert sents == ["The site is: https://www.example.50.com/new-site/awesome_content.html.", "Please check it out."]
-
-def test_gr24():
-    sents = get_sent_strings("She turned to him, 'This is great.' she said.")
-    assert sents == ["She turned to him, 'This is great.' she said."]
-
-def test_gr25():
-    sents = get_sent_strings('She turned to him, "This is great." she said.')
-    assert sents == ['She turned to him, "This is great." she said.']
-
-def test_gr26():
-    sents = get_sent_strings('She turned to him, "This is great." She held the book out to show him.')
-    assert sents == ['She turned to him, "This is great."', "She held the book out to show him."]
-
-def test_gr27():
-    sents = get_sent_strings("Hello!! Long time no see.")
-    assert sents == ["Hello!!", "Long time no see."]
-
-def test_gr28():
-    sents = get_sent_strings("Hello?? Who is there?")
-    assert sents == ["Hello??", "Who is there?"]
-
-def test_gr29():
-    sents = get_sent_strings("Hello!? Is that you?")
-    assert sents == ["Hello!?", "Is that you?"]
-
-def test_gr30():
-    sents = get_sent_strings("Hello?! Is that you?")
-    assert sents == ["Hello?!", "Is that you?"]
-
-def test_gr31():
-    sents = get_sent_strings("1.) The first item 2.) The second item")
-    assert sents == ["1.) The first item", "2.) The second item"]
-
-def test_gr32():
-    sents = get_sent_strings("1.) The first item. 2.) The second item.")
-    assert sents == ["1.) The first item.", "2.) The second item."]
-
-def test_gr33():
-    sents = get_sent_strings("1) The first item 2) The second item")
-    assert sents == ["1) The first item", "2) The second item"]
-
-def test_gr34():
-    sents = get_sent_strings("1) The first item. 2) The second item.")
-    assert sents == ["1) The first item.", "2) The second item."]
-
-def test_gr35():
-    sents = get_sent_strings("1. The first item 2. The second item")
-    assert sents == ["1. The first item", "2. The second item"]
-
-def test_gr36():
-    sents = get_sent_strings("1. The first item. 2. The second item.")
-    assert sents == ["1. The first item.", "2. The second item."]
-
-def test_gr37():
-    sents = get_sent_strings("• 9. The first item • 10. The second item")
-    assert sents == ["• 9. The first item", "• 10. The second item"]
-
-def test_gr38():
-    sents = get_sent_strings("⁃9. The first item ⁃10. The second item")
-    assert sents == ["⁃9. The first item", "⁃10. The second item"]
-
-def test_gr39():
-    sents = get_sent_strings("a. The first item b. The second item c. The third list item")
-    assert sents == ["a. The first item", "b. The second item", "c. The third list item"]
-
-def test_gr40():
-    sents = get_sent_strings("This is a sentence\ncut off in the middle because pdf.")
-    assert sents == ["This is a sentence\ncut off in the middle because pdf."]
-
-def test_gr41():
-    sents = get_sent_strings("It was a cold \nnight in the city.")
-    assert sents == ["It was a cold \nnight in the city."]
-
-def test_gr42():
-    sents = get_sent_strings("features\ncontact manager\nevents, activities\n")
-    assert sents == ["features", "contact manager", "events, activities"]
-
-def test_gr43():
-    sents = get_sent_strings("You can find it at N°. 1026.253.553. That is where the treasure is.")
-    assert sents == ["You can find it at N°. 1026.253.553.", "That is where the treasure is."]
-
-def test_gr44():
-    sents = get_sent_strings("She works at Yahoo! in the accounting department.")
-    assert sents == ["She works at Yahoo! in the accounting department."]
-
-def test_gr45():
-    sents = get_sent_strings("We make a good team, you and I. Did you see Albert I. Jones yesterday?")
-    assert sents == ["We make a good team, you and I.", "Did you see Albert I. Jones yesterday?"]
-
-def test_gr46():
-    sents = get_sent_strings("Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”")
-    assert sents == ["Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”"]
-
-def test_gr47():
-    sents = get_sent_strings(""""Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).""")
-    assert sents == ['"Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).']
-
-def test_gr48():
-    sents = get_sent_strings("If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence.")
-    assert sents == ["If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . .", "Next sentence."]
-
-def test_gr49():
-    sents = get_sent_strings("I never meant that.... She left the store.")
-    assert sents == ["I never meant that....", "She left the store."]
-
-def test_gr50():
-    sents = get_sent_strings("I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it.")
-    assert sents == ["I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it."]
-
-def test_gr51():
-    sents = get_sent_strings("One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . .")
-    assert sents == ["One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds.", ". . . The practice was not abandoned. . . ."]
-
-def test_gr52():
-    sents = get_sent_strings("Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That is a lot.",)
-    assert sents == ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]
--- a/spacy/tests/parser/test_noun_chunks.py
+++ b/spacy/tests/parser/test_noun_chunks.py
@ -0,0 +1,76 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+from ..util import get_doc
+
+import pytest
+
+
+def test_parser_noun_chunks_standard(en_tokenizer):
+    text = "A base phrase should be recognized."
+    heads = [2, 1, 3, 2, 1, 0, -1]
+    tags = ['DT', 'JJ', 'NN', 'MD', 'VB', 'VBN', '.']
+    deps = ['det', 'amod', 'nsubjpass', 'aux', 'auxpass', 'ROOT', 'punct']
+
+    tokens = en_tokenizer(text)
+    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
+    chunks = list(doc.noun_chunks)
+    assert len(chunks) == 1
+    assert chunks[0].text_with_ws == "A base phrase "
+
+
+def test_parser_noun_chunks_coordinated(en_tokenizer):
+    text = "A base phrase and a good phrase are often the same."
+    heads = [2, 1, 5, -1, 2, 1, -4, 0, -1, 1, -3, -4]
+    tags = ['DT', 'NN', 'NN', 'CC', 'DT', 'JJ', 'NN', 'VBP', 'RB', 'DT', 'JJ', '.']
+    deps = ['det', 'compound', 'nsubj', 'cc', 'det', 'amod', 'conj', 'ROOT', 'advmod', 'det', 'attr', 'punct']
+
+    tokens = en_tokenizer(text)
+    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
+    chunks = list(doc.noun_chunks)
+    assert len(chunks) == 2
+    assert chunks[0].text_with_ws == "A base phrase "
+    assert chunks[1].text_with_ws == "a good phrase "
+
+
+def test_parser_noun_chunks_pp_chunks(en_tokenizer):
+    text = "A phrase with another phrase occurs."
+    heads = [1, 4, -1, 1, -2, 0, -1]
+    tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ', '.']
+    deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT', 'punct']
+
+    tokens = en_tokenizer(text)
+    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
+    chunks = list(doc.noun_chunks)
+    assert len(chunks) == 2
+    assert chunks[0].text_with_ws == "A phrase "
+    assert chunks[1].text_with_ws == "another phrase "
+
+
+def test_parser_noun_chunks_standard_de(de_tokenizer):
+    text = "Eine Tasse steht auf dem Tisch."
+    heads = [1, 1, 0, -1, 1, -2, -4]
+    tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', '$.']
+    deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'punct']
+
+    tokens = de_tokenizer(text)
+    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
+    chunks = list(doc.noun_chunks)
+    assert len(chunks) == 2
+    assert chunks[0].text_with_ws == "Eine Tasse "
+    assert chunks[1].text_with_ws == "dem Tisch "
+
+
+def test_de_extended_chunk(de_tokenizer):
+    text = "Die Sängerin singt mit einer Tasse Kaffee Arien."
+    heads = [1, 1, 0, -1, 1, -2, -1, -5, -6]
+    tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', 'NN', 'NN', '$.']
+    deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'nk', 'oa', 'punct']
+
+    tokens = de_tokenizer(text)
+    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
+    chunks = list(doc.noun_chunks)
+    assert len(chunks) == 3
+    assert chunks[0].text_with_ws == "Die Sängerin "
+    assert chunks[1].text_with_ws == "einer Tasse Kaffee "
+    assert chunks[2].text_with_ws == "Arien "
--- a/spacy/tests/parser/test_sbd_prag.py
+++ b/spacy/tests/parser/test_sbd_prag.py
@ -0,0 +1,71 @@
+# encoding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+TEST_CASES = [
+    ("Hello World. My name is Jonas.", ["Hello World.", "My name is Jonas."]),
+    ("What is your name? My name is Jonas.", ["What is your name?", "My name is Jonas."]),
+    ("There it is! I found it.", ["There it is!", "I found it."]),
+    ("My name is Jonas E. Smith.", ["My name is Jonas E. Smith."]),
+    ("Please turn to p. 55.", ["Please turn to p. 55."]),
+    ("Were Jane and co. at the party?", ["Were Jane and co. at the party?"]),
+    ("They closed the deal with Pitt, Briggs & Co. at noon.", ["They closed the deal with Pitt, Briggs & Co. at noon."]),
+    pytest.mark.xfail(("Let's ask Jane and co. They should know.", ["Let's ask Jane and co.", "They should know."])),
+    ("They closed the deal with Pitt, Briggs & Co. It closed yesterday.", ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]),
+    ("I can see Mt. Fuji from here.", ["I can see Mt. Fuji from here."]),
+    pytest.mark.xfail(("St. Michael's Church is on 5th st. near the light.", ["St. Michael's Church is on 5th st. near the light."])),
+    ("That is JFK Jr.'s book.", ["That is JFK Jr.'s book."]),
+    ("I visited the U.S.A. last year.", ["I visited the U.S.A. last year."]),
+    pytest.mark.xfail(("I live in the E.U. How about you?", ["I live in the E.U.", "How about you?"])),
+    pytest.mark.xfail(("I live in the U.S. How about you?", ["I live in the U.S.", "How about you?"])),
+    ("I work for the U.S. Government in Virginia.", ["I work for the U.S. Government in Virginia."]),
+    ("I have lived in the U.S. for 20 years.", ["I have lived in the U.S. for 20 years."]),
+    pytest.mark.xfail(("At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.", ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."])),
+    ("She has $100.00 in her bag.", ["She has $100.00 in her bag."]),
+    ("She has $100.00. It is in her bag.", ["She has $100.00.", "It is in her bag."]),
+    ("He teaches science (He previously worked for 5 years as an engineer.) at the local University.", ["He teaches science (He previously worked for 5 years as an engineer.) at the local University."]),
+    pytest.mark.xfail(("Her email is Jane.Doe@example.com. I sent her an email.", ["Her email is Jane.Doe@example.com.", "I sent her an email."])),
+    pytest.mark.xfail(("The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out.", ["The site is: https://www.example.50.com/new-site/awesome_content.html.", "Please check it out."])),
+    pytest.mark.xfail(("She turned to him, 'This is great.' she said.", ["She turned to him, 'This is great.' she said."])),
+    ('She turned to him, "This is great." she said.', ['She turned to him, "This is great." she said.']),
+    ('She turned to him, "This is great." She held the book out to show him.', ['She turned to him, "This is great."', "She held the book out to show him."]),
+    ("Hello!! Long time no see.", ["Hello!!", "Long time no see."]),
+    ("Hello?? Who is there?", ["Hello??", "Who is there?"]),
+    ("Hello!? Is that you?", ["Hello!?", "Is that you?"]),
+    ("Hello?! Is that you?", ["Hello?!", "Is that you?"]),
+    pytest.mark.xfail(("1.) The first item 2.) The second item", ["1.) The first item", "2.) The second item"])),
+    pytest.mark.xfail(("1.) The first item. 2.) The second item.", ["1.) The first item.", "2.) The second item."])),
+    pytest.mark.xfail(("1) The first item 2) The second item", ["1) The first item", "2) The second item"])),
+    pytest.mark.xfail(("1) The first item. 2) The second item.", ["1) The first item.", "2) The second item."])),
+    pytest.mark.xfail(("1. The first item 2. The second item", ["1. The first item", "2. The second item"])),
+    pytest.mark.xfail(("1. The first item. 2. The second item.", ["1. The first item.", "2. The second item."])),
+    pytest.mark.xfail(("• 9. The first item • 10. The second item", ["• 9. The first item", "• 10. The second item"])),
+    pytest.mark.xfail(("⁃9. The first item ⁃10. The second item", ["⁃9. The first item", "⁃10. The second item"])),
+    pytest.mark.xfail(("a. The first item b. The second item c. The third list item", ["a. The first item", "b. The second item", "c. The third list item"])),
+    ("This is a sentence\ncut off in the middle because pdf.", ["This is a sentence\ncut off in the middle because pdf."]),
+    ("It was a cold \nnight in the city.", ["It was a cold \nnight in the city."]),
+    pytest.mark.xfail(("features\ncontact manager\nevents, activities\n", ["features", "contact manager", "events, activities"])),
+    ("You can find it at N°. 1026.253.553. That is where the treasure is.", ["You can find it at N°. 1026.253.553.", "That is where the treasure is."]),
+    ("She works at Yahoo! in the accounting department.", ["She works at Yahoo! in the accounting department."]),
+    pytest.mark.xfail(("We make a good team, you and I. Did you see Albert I. Jones yesterday?", ["We make a good team, you and I.", "Did you see Albert I. Jones yesterday?"])),
+    pytest.mark.xfail(("Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”", ["Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”"])),
+    (""""Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).""", ['"Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).']),
+    pytest.mark.xfail(("If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence.", ["If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . .", "Next sentence."])),
+    ("I never meant that.... She left the store.", ["I never meant that....", "She left the store."]),
+    pytest.mark.xfail(("I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it.", ["I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it."])),
+    pytest.mark.xfail(("One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . .", ["One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds.", ". . . The practice was not abandoned. . . ."])),
+    pytest.mark.xfail(("Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That is a lot.", ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]))
+]
+
+@pytest.mark.slow
+@pytest.mark.models
+@pytest.mark.parametrize('text,expected_sents', TEST_CASES)
+def test_parser_sbd_prag(EN, text, expected_sents):
+    """SBD tests from Pragmatic Segmenter"""
+    doc = EN(text)
+    sents = []
+    for sent in doc.sents:
+        sents.append(''.join(doc[i].string for i in range(sent.start, sent.end)).strip())
+    assert sents == expected_sents
--- a/spacy/tests/regression/test_issue118.py
+++ b/spacy/tests/regression/test_issue118.py
@ -0,0 +1,54 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+from ...matcher import Matcher
+from ...attrs import ORTH, LOWER
+
+import pytest
+
+
+pattern1 = [[{LOWER: 'celtics'}], [{LOWER: 'boston'}, {LOWER: 'celtics'}]]
+pattern2 = [[{LOWER: 'boston'}, {LOWER: 'celtics'}], [{LOWER: 'celtics'}]]
+pattern3 = [[{LOWER: 'boston'}], [{LOWER: 'boston'}, {LOWER: 'celtics'}]]
+pattern4 = [[{LOWER: 'boston'}, {LOWER: 'celtics'}], [{LOWER: 'boston'}]]
+
+
+@pytest.fixture
+def doc(en_tokenizer):
+    text = "how many points did lebron james score against the boston celtics last night"
+    doc = en_tokenizer(text)
+    return doc
+
+
+@pytest.mark.parametrize('pattern', [pattern1, pattern2])
+def test_issue118(doc, pattern):
+    """Test a bug that arose from having overlapping matches"""
+    ORG = doc.vocab.strings['ORG']
+    matcher = Matcher(doc.vocab, {'BostonCeltics': ('ORG', {}, pattern)})
+
+    assert len(list(doc.ents)) == 0
+    matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
+    assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
+    doc.ents = matches[:1]
+    ents = list(doc.ents)
+    assert len(ents) == 1
+    assert ents[0].label == ORG
+    assert ents[0].start == 9
+    assert ents[0].end == 11
+
+
+@pytest.mark.parametrize('pattern', [pattern3, pattern4])
+def test_issue118_prefix_reorder(doc, pattern):
+    """Test a bug that arose from having overlapping matches"""
+    ORG = doc.vocab.strings['ORG']
+    matcher = Matcher(doc.vocab, {'BostonCeltics': ('ORG', {}, pattern)})
+
+    assert len(list(doc.ents)) == 0
+    matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
+    doc.ents += tuple(matches)[1:]
+    assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
+    ents = doc.ents
+    assert len(ents) == 1
+    assert ents[0].label == ORG
+    assert ents[0].start == 9
+    assert ents[0].end == 11
--- a/spacy/tests/regression/test_issue242.py
+++ b/spacy/tests/regression/test_issue242.py
@ -0,0 +1,26 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+from ...matcher import Matcher
+from ...attrs import LOWER
+
+import pytest
+
+
+def test_issue242(en_tokenizer):
+    """Test overlapping multi-word phrases."""
+    text = "There are different food safety standards in different countries."
+    patterns = [[{LOWER: 'food'}, {LOWER: 'safety'}],
+                [{LOWER: 'safety'}, {LOWER: 'standards'}]]
+
+    doc = en_tokenizer(text)
+    matcher = Matcher(doc.vocab)
+    matcher.add('FOOD', 'FOOD', {}, patterns)
+
+    matches = [(ent_type, start, end) for ent_id, ent_type, start, end in matcher(doc)]
+    doc.ents += tuple(matches)
+    match1, match2 = matches
+    assert match1[1] == 3
+    assert match1[2] == 5
+    assert match2[1] == 4
+    assert match2[2] == 6
--- a/spacy/tests/regression/test_issue309.py
+++ b/spacy/tests/regression/test_issue309.py
@ -4,7 +4,7 @@ from __future__ import unicode_literals
 from ..util import get_doc


-def test_sbd_empty_string(en_tokenizer):
+def test_issue309(en_tokenizer):
    """Test Issue #309: SBD fails on empty string"""
    tokens = en_tokenizer(" ")
    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[0], deps=['ROOT'])
--- a/spacy/tests/regression/test_issue351.py
+++ b/spacy/tests/regression/test_issue351.py
@ -1,16 +1,9 @@
 # coding: utf-8
 from __future__ import unicode_literals

-from ...en import English
-
 import pytest


-@pytest.fixture
-def en_tokenizer():
-    return English.Defaults.create_tokenizer()
-
-
 def test_issue351(en_tokenizer):
    doc = en_tokenizer("   This is a cat.")
    assert doc[0].idx == 0
--- a/spacy/tests/regression/test_issue360.py
+++ b/spacy/tests/regression/test_issue360.py
@ -1,16 +1,10 @@
 # coding: utf-8
 from __future__ import unicode_literals

-from ...en import English
-
 import pytest


-@pytest.fixture
-def en_tokenizer():
-    return English.Defaults.create_tokenizer()
-
-
-def test_big_ellipsis(en_tokenizer):
+def test_issue360(en_tokenizer):
+    """Test tokenization of big ellipsis"""
    tokens = en_tokenizer('$45...............Asking')
    assert len(tokens) > 2
--- a/spacy/tests/regression/test_issue361.py
+++ b/spacy/tests/regression/test_issue361.py
@ -0,0 +1,11 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+@pytest.mark.parametrize('text1,text2', [("cat", "dog")])
+def test_issue361(en_vocab, text1, text2):
+    """Test Issue #361: Equality of lexemes"""
+    assert en_vocab[text1] == en_vocab[text1]
+    assert en_vocab[text1] != en_vocab[text2]
--- a/spacy/tests/regression/test_issue429.py
+++ b/spacy/tests/regression/test_issue429.py
@ -1,31 +1,25 @@
 # coding: utf-8
 from __future__ import unicode_literals

-import spacy
-from spacy.attrs import ORTH
+from ...attrs import ORTH
+from ...matcher import Matcher

 import pytest


@pytest.mark.models
-def test_issue429():
-
-    nlp = spacy.load('en', parser=False)
-
-
+def test_issue429(EN):
    def merge_phrases(matcher, doc, i, matches):
      if i != len(matches) - 1:
        return None
      spans = [(ent_id, label, doc[start:end]) for ent_id, label, start, end in matches]
      for ent_id, label, span in spans:
-        span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])
+        span.merge('NNP' if label else span.root.tag_, span.text, EN.vocab.strings[label])

-    doc = nlp('a')
-    nlp.matcher.add('key', label='TEST', attrs={}, specs=[[{ORTH: 'a'}]], on_match=merge_phrases)
-    doc = nlp.tokenizer('a b c')
-    nlp.tagger(doc)
-    nlp.matcher(doc)
-
-    for word in doc:
-        print(word.text, word.ent_iob_, word.ent_type_)
-    nlp.entity(doc)
+    doc = EN('a')
+    matcher = Matcher(EN.vocab)
+    matcher.add('key', label='TEST', attrs={}, specs=[[{ORTH: 'a'}]], on_match=merge_phrases)
+    doc = EN.tokenizer('a b c')
+    EN.tagger(doc)
+    matcher(doc)
+    EN.entity(doc)
--- a/spacy/tests/regression/test_issue514.py
+++ b/spacy/tests/regression/test_issue514.py
@ -0,0 +1,21 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+from ..util import get_doc
+
+import pytest
+
+
+@pytest.mark.models
+def test_issue514(EN):
+    """Test serializing after adding entity"""
+    text = ["This", "is", "a", "sentence", "about", "pasta", "."]
+    vocab = EN.entity.vocab
+    doc = get_doc(vocab, text)
+    EN.entity.add_label("Food")
+    EN.entity(doc)
+    label_id = vocab.strings[u'Food']
+    doc.ents = [(label_id, 5,6)]
+    assert [(ent.label_, ent.text) for ent in doc.ents] == [("Food", "pasta")]
+    doc2 = get_doc(EN.entity.vocab).from_bytes(doc.to_bytes())
+    assert [(ent.label_, ent.text) for ent in doc2.ents] == [("Food", "pasta")]
--- a/spacy/tests/regression/test_issue54.py
+++ b/spacy/tests/regression/test_issue54.py
@ -6,5 +6,5 @@ import pytest

@pytest.mark.models
 def test_issue54(EN):
-    text = u'Talks given by women had a slightly higher number of questions asked (3.2$\pm$0.2) than talks given by men (2.6$\pm$0.1).'
+    text = "Talks given by women had a slightly higher number of questions asked (3.2$\pm$0.2) than talks given by men (2.6$\pm$0.1)."
    tokens = EN(text)
--- a/spacy/tests/regression/test_issue587.py
+++ b/spacy/tests/regression/test_issue587.py
@ -1,21 +1,20 @@
 # coding: utf-8
 from __future__ import unicode_literals

-import spacy
-import spacy.matcher
-from spacy.attrs import IS_PUNCT, ORTH
+from ...matcher import Matcher
+from ...attrs import IS_PUNCT, ORTH

 import pytest


@pytest.mark.models
-def test_matcher_segfault():
-    nlp = spacy.load('en', parser=False, entity=False)
-    matcher = spacy.matcher.Matcher(nlp.vocab)
+def test_issue587(EN):
+    """Test that Matcher doesn't segfault on particular input"""
+    matcher = Matcher(EN.vocab)
    content = '''a b; c'''
    matcher.add(entity_key='1', label='TEST', attrs={}, specs=[[{ORTH: 'a'}, {ORTH: 'b'}]])
-    matcher(nlp(content))
+    matcher(EN(content))
    matcher.add(entity_key='2', label='TEST', attrs={}, specs=[[{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'c'}]])
-    matcher(nlp(content))
+    matcher(EN(content))
    matcher.add(entity_key='3', label='TEST', attrs={}, specs=[[{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'd'}]])
-    matcher(nlp(content))
+    matcher(EN(content))
--- a/spacy/tests/regression/test_issue588.py
+++ b/spacy/tests/regression/test_issue588.py
@ -1,14 +1,12 @@
 # coding: utf-8
 from __future__ import unicode_literals

-from ...vocab import Vocab
-from ...tokens import Doc
 from ...matcher import Matcher

 import pytest


-def test_issue588():
-    matcher = Matcher(Vocab())
+def test_issue588(en_vocab):
+    matcher = Matcher(en_vocab)
    with pytest.raises(ValueError):
        matcher.add(entity_key='1', label='TEST', attrs={}, specs=[[]])
--- a/spacy/tests/regression/test_issue589.py
+++ b/spacy/tests/regression/test_issue589.py
@ -2,7 +2,7 @@
 from __future__ import unicode_literals

 from ...vocab import Vocab
-from ...tokens import Doc
+from ..util import get_doc

 import pytest

@ -10,4 +10,4 @@ import pytest
 def test_issue589():
    vocab = Vocab()
    vocab.strings.set_frozen(True)
-    doc = Doc(vocab, words=['whata'])
+    doc = get_doc(vocab, ['whata'])
--- a/spacy/tests/regression/test_issue590.py
+++ b/spacy/tests/regression/test_issue590.py
@ -1,37 +1,22 @@
 # coding: utf-8
 from __future__ import unicode_literals

-from ...attrs import *
+from ...attrs import ORTH, IS_ALPHA, LIKE_NUM
 from ...matcher import Matcher
-from ...tokens import Doc
-from ...en import English
+from ..util import get_doc


-def test_overlapping_matches():
-    vocab = English.Defaults.create_vocab()
-    doc = Doc(vocab, words=['n', '=', '1', ';', 'a', ':', '5', '%'])
-
-    matcher = Matcher(vocab)
-    matcher.add_entity(
-        "ab",
-        acceptor=None,
-        on_match=None
-    )
-    matcher.add_pattern(
-        'ab',
-        [
-            {IS_ALPHA: True},
-            {ORTH: ':'},
-            {LIKE_NUM: True},
-            {ORTH: '%'}
-        ], label='a')
-    matcher.add_pattern(
-        'ab',
-        [
-            {IS_ALPHA: True},
-            {ORTH: '='},
-            {LIKE_NUM: True},
-        ], label='b')
+def test_issue590(en_vocab):
+    """Test overlapping matches"""
+    doc = get_doc(en_vocab, ['n', '=', '1', ';', 'a', ':', '5', '%'])

+    matcher = Matcher(en_vocab)
+    matcher.add_entity("ab", acceptor=None, on_match=None)
+    matcher.add_pattern('ab', [{IS_ALPHA: True}, {ORTH: ':'},
+                               {LIKE_NUM: True}, {ORTH: '%'}],
+                               label='a')
+    matcher.add_pattern('ab', [{IS_ALPHA: True}, {ORTH: '='},
+                               {LIKE_NUM: True}],
+                               label='b')
    matches = matcher(doc)
    assert len(matches) == 2
--- a/spacy/tests/regression/test_issue595.py
+++ b/spacy/tests/regression/test_issue595.py
@ -2,43 +2,23 @@
 from __future__ import unicode_literals

 from ...symbols import POS, VERB, VerbForm_inf
-from ...tokens import Doc
 from ...vocab import Vocab
 from ...lemmatizer import Lemmatizer
+from ..util import get_doc

 import pytest


-@pytest.fixture
-def index():
-    return {'verb': {}}
+def test_issue595():
+    """Test lemmatization of base forms"""
+    words = ["Do", "n't", "feed", "the", "dog"]
+    tag_map = {'VB': {POS: VERB, 'morph': VerbForm_inf}}
+    rules = {"verb": [["ed", "e"]]}

-@pytest.fixture
-def exceptions():
-    return {'verb': {}}
+    lemmatizer = Lemmatizer({'verb': {}}, {'verb': {}}, rules)
+    vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
+    doc = get_doc(vocab, words)

-@pytest.fixture
-def rules():
-    return {"verb": [["ed", "e"]]}
-
-@pytest.fixture
-def lemmatizer(index, exceptions, rules):
-    return Lemmatizer(index, exceptions, rules)
-
-
-@pytest.fixture
-def tag_map():
-    return {'VB': {POS: VERB, 'morph': VerbForm_inf}}
-
-
-@pytest.fixture
-def vocab(lemmatizer, tag_map):
-    return Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
-
-
-def test_not_lemmatize_base_forms(vocab):
-    doc = Doc(vocab, words=["Do", "n't", "feed", "the", "dog"])
-    feed = doc[2]
-    feed.tag_ = 'VB'
-    assert feed.text == 'feed'
-    assert feed.lemma_ == 'feed'
+    doc[2].tag_ = 'VB'
+    assert doc[2].text == 'feed'
+    assert doc[2].lemma_ == 'feed'
--- a/spacy/tests/regression/test_issue599.py
+++ b/spacy/tests/regression/test_issue599.py
@ -1,15 +1,13 @@
 # coding: utf-8
 from __future__ import unicode_literals

-from ...tokens import Doc
-from ...vocab import Vocab
+from ..util import get_doc


-def test_issue599():
-    doc = Doc(Vocab())
+def test_issue599(en_vocab):
+    doc = get_doc(en_vocab)
    doc.is_tagged = True
    doc.is_parsed = True
-    bytes_ = doc.to_bytes()
-    doc2 = Doc(doc.vocab)
-    doc2.from_bytes(bytes_)
+    doc2 = get_doc(doc.vocab)
+    doc2.from_bytes(doc.to_bytes())
    assert doc2.is_parsed
--- a/spacy/tests/regression/test_issue600.py
+++ b/spacy/tests/regression/test_issue600.py
@ -1,11 +1,11 @@
 # coding: utf-8
 from __future__ import unicode_literals

-from ...tokens import Doc
 from ...vocab import Vocab
-from ...attrs import POS
+from ..util import get_doc


 def test_issue600():
-    doc = Doc(Vocab(tag_map={'NN': {'pos': 'NOUN'}}), words=['hello'])
+    vocab = Vocab(tag_map={'NN': {'pos': 'NOUN'}})
+    doc = get_doc(vocab, ["hello"])
    doc[0].tag_ = 'NN'
--- a/spacy/tests/regression/test_issue605.py
+++ b/spacy/tests/regression/test_issue605.py
@ -1,27 +1,21 @@
 # coding: utf-8
 from __future__ import unicode_literals

-from ...attrs import LOWER, ORTH
-from ...tokens import Doc
-from ...vocab import Vocab
+from ...attrs import ORTH
 from ...matcher import Matcher
+from ..util import get_doc


-def return_false(doc, ent_id, label, start, end):
-    return False
+def test_issue605(en_vocab):
+    def return_false(doc, ent_id, label, start, end):
+        return False

-
-def test_matcher_accept():
-    doc = Doc(Vocab(), words=['The', 'golf', 'club', 'is', 'broken'])
-
-    golf_pattern =     [
-        { ORTH: "golf"},
-        { ORTH: "club"}
-    ]
+    words = ["The", "golf", "club", "is", "broken"]
+    pattern = [{ORTH: "golf"}, {ORTH: "club"}]
+    label = "Sport_Equipment"
+    doc = get_doc(en_vocab, words)
    matcher = Matcher(doc.vocab)
-
-    matcher.add_entity('Sport_Equipment', acceptor=return_false)
-    matcher.add_pattern("Sport_Equipment", golf_pattern)
+    matcher.add_entity(label, acceptor=return_false)
+    matcher.add_pattern(label, pattern)
    match = matcher(doc)
-
    assert match == []
--- a/spacy/tests/regression/test_issue615.py
+++ b/spacy/tests/regression/test_issue615.py
@ -1,35 +1,31 @@
 # coding: utf-8
 from __future__ import unicode_literals

-import spacy
-from spacy.attrs import ORTH
+from ...matcher import Matcher
+from ...attrs import ORTH


-def merge_phrases(matcher, doc, i, matches):
-    '''
-    Merge a phrase. We have to be careful here because we'll change the token indices.
-    To avoid problems, merge all the phrases once we're called on the last match.
-    '''
-    if i != len(matches)-1:
-        return None
-    # Get Span objects
-    spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
-    for ent_id, label, span in spans:
-        span.merge('NNP' if label else span.root.tag_, span.text, doc.vocab.strings[label])
+def test_issue615(en_tokenizer):
+    def merge_phrases(matcher, doc, i, matches):
+        """Merge a phrase. We have to be careful here because we'll change the
+        token indices. To avoid problems, merge all the phrases once we're called
+        on the last match."""

-def test_entity_ID_assignment():
-    nlp = spacy.en.English()
-    text = """The golf club is broken"""
-    doc = nlp(text)
+        if i != len(matches)-1:
+            return None
+        # Get Span objects
+        spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
+        for ent_id, label, span in spans:
+            span.merge('NNP' if label else span.root.tag_, span.text, doc.vocab.strings[label])

-    golf_pattern =     [
-            { ORTH: "golf"},
-            { ORTH: "club"}
-        ]
+    text = "The golf club is broken"
+    pattern = [{ORTH: "golf"}, {ORTH: "club"}]
+    label = "Sport_Equipment"

-    matcher = spacy.matcher.Matcher(nlp.vocab)
-    matcher.add_entity('Sport_Equipment', on_match = merge_phrases)
-    matcher.add_pattern("Sport_Equipment", golf_pattern, label = 'Sport_Equipment')
+    doc = en_tokenizer(text)
+    matcher = Matcher(doc.vocab)
+    matcher.add_entity(label, on_match=merge_phrases)
+    matcher.add_pattern(label, pattern, label=label)

    match = matcher(doc)
    entities = list(doc.ents)
--- a/spacy/tests/regression/test_issue617.py
+++ b/spacy/tests/regression/test_issue617.py
@ -4,7 +4,8 @@ from __future__ import unicode_literals
 from ...vocab import Vocab


-def test_load_vocab_with_string():
+def test_issue617():
+    """Test loading Vocab with string"""
    try:
        vocab = Vocab.load('/tmp/vocab')
    except IOError:
--- a/spacy/tests/regression/test_issue736.py
+++ b/spacy/tests/regression/test_issue736.py
@ -1,7 +1,4 @@
 # coding: utf-8
-"""Test that times like "7am" are tokenized correctly and that numbers are converted to string."""
-
-
 from __future__ import unicode_literals

 import pytest
@ -9,6 +6,7 @@ import pytest

@pytest.mark.parametrize('text,number', [("7am", "7"), ("11p.m.", "11")])
 def test_issue736(en_tokenizer, text, number):
+    """Test that times like "7am" are tokenized correctly and that numbers are converted to string."""
    tokens = en_tokenizer(text)
    assert len(tokens) == 2
    assert tokens[0].text == number
--- a/spacy/tests/regression/test_issue740.py
+++ b/spacy/tests/regression/test_issue740.py
@ -0,0 +1,12 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+@pytest.mark.parametrize('text', ["3/4/2012", "01/12/1900"])
+def test_issue740(en_tokenizer, text):
+    """Test that dates are not split and kept as one token. This behaviour is currently inconsistent, since dates separated by hyphens are still split.
+    This will be hard to prevent without causing clashes with numeric ranges."""
+    tokens = en_tokenizer(text)
+    assert len(tokens) == 1
--- a/spacy/tests/regression/test_issue744.py
+++ b/spacy/tests/regression/test_issue744.py
@ -0,0 +1,13 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+@pytest.mark.parametrize('text', ["We were scared", "We Were Scared"])
+def test_issue744(en_tokenizer, text):
+    """Test that 'were' and 'Were' are excluded from the contractions
+    generated by the English tokenizer exceptions."""
+    tokens = en_tokenizer(text)
+    assert len(tokens) == 3
+    assert tokens[1].text.lower() == "were"
--- a/spacy/tests/serialize/test_codecs.py
+++ b/spacy/tests/serialize/test_codecs.py
@ -1,60 +1,51 @@
+# coding: utf-8
 from __future__ import unicode_literals
-import pytest
+
+from ...serialize.packer import _BinaryCodec
+from ...serialize.huffman import HuffmanCodec
+from ...serialize.bits import BitArray

 import numpy
-
-from spacy.vocab import Vocab
-from spacy.serialize.packer import _BinaryCodec
-from spacy.serialize.huffman import HuffmanCodec
-from spacy.serialize.bits import BitArray
+import pytest


-def test_binary():
+def test_serialize_codecs_binary():
    codec = _BinaryCodec()
    bits = BitArray()
-    msg = numpy.array([0, 1, 0, 1, 1], numpy.int32)
-    codec.encode(msg, bits)
+    array = numpy.array([0, 1, 0, 1, 1], numpy.int32)
+    codec.encode(array, bits)
    result = numpy.array([0, 0, 0, 0, 0], numpy.int32)
    bits.seek(0)
    codec.decode(bits, result)
-    assert list(msg) == list(result)
+    assert list(array) == list(result)


-def test_attribute():
-    freqs = {'the': 10, 'quick': 3, 'brown': 4, 'fox': 1, 'jumped': 5, 'over': 8,
-            'lazy': 1, 'dog': 2, '.': 9}
- 
-    int_map = {'the': 0, 'quick': 1, 'brown': 2, 'fox': 3, 'jumped': 4, 'over': 5,
-               'lazy': 6, 'dog': 7, '.': 8}
+def test_serialize_codecs_attribute():
+    freqs = {'the': 10, 'quick': 3, 'brown': 4, 'fox': 1, 'jumped': 5,
+             'over': 8, 'lazy': 1, 'dog': 2, '.': 9}
+    int_map = {'the': 0, 'quick': 1, 'brown': 2, 'fox': 3, 'jumped': 4,
+               'over': 5, 'lazy': 6, 'dog': 7, '.': 8}

    codec = HuffmanCodec([(int_map[string], freq) for string, freq in freqs.items()])
-
    bits = BitArray()
-    
-    msg = numpy.array([1, 7], dtype=numpy.int32)
-    msg_list = list(msg)
-    codec.encode(msg, bits)
+    array = numpy.array([1, 7], dtype=numpy.int32)
+    codec.encode(array, bits)
    result = numpy.array([0, 0], dtype=numpy.int32)
    bits.seek(0)
    codec.decode(bits, result)
-    assert msg_list == list(result)
+    assert list(array) == list(result)


-def test_vocab_codec():
-    vocab = Vocab()
-    lex = vocab['dog']
-    lex = vocab['the']
-    lex = vocab['jumped']
-
-    codec = HuffmanCodec([(lex.orth, lex.prob) for lex in vocab])
-
+def test_serialize_codecs_vocab(en_vocab):
+    words = ["the", "dog", "jumped"]
+    for word in words:
+        _ = en_vocab[word]
+    codec = HuffmanCodec([(lex.orth, lex.prob) for lex in en_vocab])
    bits = BitArray()
-    
-    ids = [vocab[s].orth for s in ('the', 'dog', 'jumped')]
-    msg = numpy.array(ids, dtype=numpy.int32)
-    msg_list = list(msg)
-    codec.encode(msg, bits)
-    result = numpy.array(range(len(msg)), dtype=numpy.int32)
+    ids = [en_vocab[s].orth for s in words]
+    array = numpy.array(ids, dtype=numpy.int32)
+    codec.encode(array, bits)
+    result = numpy.array(range(len(array)), dtype=numpy.int32)
    bits.seek(0)
    codec.decode(bits, result)
-    assert msg_list == list(result)
+    assert list(array) == list(result)
--- a/spacy/tests/serialize/test_huffman.py
+++ b/spacy/tests/serialize/test_huffman.py
@ -1,15 +1,15 @@
+# coding: utf-8
 from __future__ import unicode_literals
 from __future__ import division

-import pytest
+from ...serialize.huffman import HuffmanCodec
+from ...serialize.bits import BitArray

-from spacy.serialize.huffman import HuffmanCodec
-from spacy.serialize.bits import BitArray
-import numpy
-import math

 from heapq import heappush, heappop, heapify
 from collections import defaultdict
+import numpy
+import pytest


 def py_encode(symb2freq):
@ -29,7 +29,7 @@ def py_encode(symb2freq):
    return dict(heappop(heap)[1:])


-def test1():
+def test_serialize_huffman_1():
    probs = numpy.zeros(shape=(10,), dtype=numpy.float32)
    probs[0] = 0.3
    probs[1] = 0.2
@ -41,45 +41,44 @@ def test1():
    probs[7] = 0.005
    probs[8] = 0.0001
    probs[9] = 0.000001
-    
+
    codec = HuffmanCodec(list(enumerate(probs)))
-    
    py_codes = py_encode(dict(enumerate(probs)))
    py_codes = list(py_codes.items())
    py_codes.sort()
    assert codec.strings == [c for i, c in py_codes]


-def test_empty():
+def test_serialize_huffman_empty():
    codec = HuffmanCodec({})
    assert codec.strings == []
-    

-def test_round_trip():
-    freqs = {'the': 10, 'quick': 3, 'brown': 4, 'fox': 1, 'jumped': 5, 'over': 8,
-            'lazy': 1, 'dog': 2, '.': 9}
+
+def test_serialize_huffman_round_trip():
+    words = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'the',
+             'lazy', 'dog', '.']
+    freqs = {'the': 10, 'quick': 3, 'brown': 4, 'fox': 1, 'jumped': 5,
+             'over': 8, 'lazy': 1, 'dog': 2, '.': 9}
+
    codec = HuffmanCodec(freqs.items())
-
-    message = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the',
-                'the', 'lazy', 'dog', '.']
    strings = list(codec.strings)
    codes = dict([(codec.leaves[i], strings[i]) for i in range(len(codec.leaves))])
-    bits = codec.encode(message)
+    bits = codec.encode(words)
    string = ''.join('{0:b}'.format(c).rjust(8, '0')[::-1] for c in bits.as_bytes())
-    for word in message:
+    for word in words:
        code = codes[word]
        assert string[:len(code)] == code
        string = string[len(code):]
-    unpacked = [0] * len(message)
+    unpacked = [0] * len(words)
    bits.seek(0)
    codec.decode(bits, unpacked)
-    assert message == unpacked
+    assert words == unpacked


-def test_rosetta():
-    txt = u"this is an example for huffman encoding"
+def test_serialize_huffman_rosetta():
+    text = "this is an example for huffman encoding"
    symb2freq = defaultdict(int)
-    for ch in txt:
+    for ch in text:
        symb2freq[ch] += 1
    by_freq = list(symb2freq.items())
    by_freq.sort(reverse=True, key=lambda item: item[1])
@ -101,7 +100,7 @@ def test_rosetta():
    assert my_exp_len == py_exp_len


-@pytest.mark.slow
+@pytest.mark.models
 def test_vocab(EN):
    codec = HuffmanCodec([(w.orth, numpy.exp(w.prob)) for w in EN.vocab])
    expected_length = 0
--- a/spacy/tests/serialize/test_io.py
+++ b/spacy/tests/serialize/test_io.py
@ -1,58 +1,48 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+from ...tokens import Doc
+from ..util import get_doc
+
 import pytest

-from spacy.serialize.packer import Packer
-from spacy.attrs import ORTH, SPACY
-from spacy.tokens import Doc
-import math
-import tempfile
-import shutil
-import os
+
+def test_serialize_io_read_write(en_vocab, text_file_b):
+    text1 = ["This", "is", "a", "simple", "test", ".", "With", "a", "couple", "of", "sentences", "."]
+    text2 = ["This", "is", "another", "test", "document", "."]
+
+    doc1 = get_doc(en_vocab, text1)
+    doc2 = get_doc(en_vocab, text2)
+    text_file_b.write(doc1.to_bytes())
+    text_file_b.write(doc2.to_bytes())
+    text_file_b.seek(0)
+    bytes1, bytes2 = Doc.read_bytes(text_file_b)
+    result1 = get_doc(en_vocab).from_bytes(bytes1)
+    result2 = get_doc(en_vocab).from_bytes(bytes2)
+    assert result1.text_with_ws == doc1.text_with_ws
+    assert result2.text_with_ws == doc2.text_with_ws


-@pytest.mark.models
-def test_read_write(EN):
-    doc1 = EN(u'This is a simple test. With a couple of sentences.')
-    doc2 = EN(u'This is another test document.')
+def test_serialize_io_left_right(en_vocab):
+    text = ["This", "is", "a", "simple", "test", ".", "With", "a",  "couple", "of", "sentences", "."]
+    doc = get_doc(en_vocab, text)
+    result = Doc(en_vocab).from_bytes(doc.to_bytes())

-    try:
-        tmp_dir = tempfile.mkdtemp()
-        with open(os.path.join(tmp_dir, 'spacy_docs.bin'), 'wb') as file_:
-            file_.write(doc1.to_bytes())
-            file_.write(doc2.to_bytes())
-
-        with open(os.path.join(tmp_dir, 'spacy_docs.bin'), 'rb') as file_:
-            bytes1, bytes2 = Doc.read_bytes(file_)
-            r1 = Doc(EN.vocab).from_bytes(bytes1)
-            r2 = Doc(EN.vocab).from_bytes(bytes2)
-
-        assert r1.string == doc1.string
-        assert r2.string == doc2.string
-    finally:
-        shutil.rmtree(tmp_dir)
-
-
-@pytest.mark.models
-def test_left_right(EN):
-    orig = EN(u'This is a simple test. With a couple of sentences.')
-    result = Doc(orig.vocab).from_bytes(orig.to_bytes())
-
-    for word in result:
-        assert word.head.i == orig[word.i].head.i
-        if word.head is not word:
-            assert word.i in [w.i for w in word.head.children]
-        for child in word.lefts:
-            assert child.head.i == word.i
-        for child in word.rights:
-            assert child.head.i == word.i
+    for token in result:
+        assert token.head.i == doc[token.i].head.i
+        if token.head is not token:
+            assert token.i in [w.i for w in token.head.children]
+        for child in token.lefts:
+            assert child.head.i == token.i
+        for child in token.rights:
+            assert child.head.i == token.i


@pytest.mark.models
 def test_lemmas(EN):
-    orig = EN(u'The geese are flying')
-    result = Doc(orig.vocab).from_bytes(orig.to_bytes())
-    the, geese, are, flying = result
-    assert geese.lemma_ == 'goose'
-    assert are.lemma_ == 'be'
-    assert flying.lemma_ == 'fly'
-
- 
+    text = "The geese are flying"
+    doc = EN(text)
+    result = Doc(doc.vocab).from_bytes(doc.to_bytes())
+    assert result[1].lemma_ == 'goose'
+    assert result[2].lemma_ == 'be'
+    assert result[3].lemma_ == 'fly'
--- a/spacy/tests/serialize/test_packer.py
+++ b/spacy/tests/serialize/test_packer.py
@ -1,55 +1,30 @@
+# coding: utf-8
 from __future__ import unicode_literals

-import re
+from ...attrs import TAG, DEP, HEAD
+from ...serialize.packer import Packer
+from ...serialize.bits import BitArray
+
+from ..util import get_doc

 import pytest
-import numpy
-
-from spacy.language import Language
-from spacy.en import English
-from spacy.vocab import Vocab
-from spacy.tokens.doc import Doc
-from spacy.tokenizer import Tokenizer
-from os import path
-import os
-
-from spacy import util
-from spacy.attrs import ORTH, SPACY, TAG, DEP, HEAD
-from spacy.serialize.packer import Packer
-
-from spacy.serialize.bits import BitArray


@pytest.fixture
-def vocab():
-    path = os.environ.get('SPACY_DATA')
-    if path is None:
-        path = util.match_best_version('en', None, util.get_data_path())
-    else:
-        path = util.match_best_version('en', None, path)
-
-    vocab = English.Defaults.create_vocab()
-    lex = vocab['dog']
-    assert vocab[vocab.strings['dog']].orth_ == 'dog'
-    lex  = vocab['the']
-    lex = vocab['quick']
-    lex = vocab['jumped']
-    return vocab
+def text():
+    return "the dog jumped"


@pytest.fixture
-def tokenizer(vocab):
-    null_re = re.compile(r'!!!!!!!!!')
-    tokenizer = Tokenizer(vocab, {}, null_re.search, null_re.search, null_re.finditer)
-    return tokenizer
+def text_b():
+    return b"the dog jumped"


-def test_char_packer(vocab):
-    packer = Packer(vocab, [])
+def test_serialize_char_packer(en_vocab, text_b):
+    packer = Packer(en_vocab, [])
    bits = BitArray()
    bits.seek(0)
-
-    byte_str = bytearray(b'the dog jumped')
+    byte_str = bytearray(text_b)
    packer.char_codec.encode(byte_str, bits)
    bits.seek(0)
    result = [b''] * len(byte_str)
@ -57,79 +32,67 @@ def test_char_packer(vocab):
    assert bytearray(result) == byte_str


-def test_packer_unannotated(tokenizer):
-    packer = Packer(tokenizer.vocab, [])
-
-    msg = tokenizer(u'the dog jumped')
-
-    assert msg.string == 'the dog jumped'
-    
-
-    bits = packer.pack(msg)
-
+def test_serialize_packer_unannotated(en_tokenizer, text):
+    packer = Packer(en_tokenizer.vocab, [])
+    tokens = en_tokenizer(text)
+    assert tokens.text_with_ws == text
+    bits = packer.pack(tokens)
    result = packer.unpack(bits)
-
-    assert result.string == 'the dog jumped'
+    assert result.text_with_ws == text


-@pytest.mark.models
-def test_packer_annotated(tokenizer):
-    vocab = tokenizer.vocab
-    nn = vocab.strings['NN']
-    dt = vocab.strings['DT']
-    vbd = vocab.strings['VBD']
-    jj = vocab.strings['JJ']
-    det = vocab.strings['det']
-    nsubj = vocab.strings['nsubj']
-    adj = vocab.strings['adj']
-    root = vocab.strings['ROOT']
+def test_packer_annotated(en_vocab, text):
+    heads = [1, 1, 0]
+    deps = ['det', 'nsubj', 'ROOT']
+    tags = ['DT', 'NN', 'VBD']

    attr_freqs = [
-        (TAG, [(nn, 0.1), (dt, 0.2), (jj, 0.01), (vbd, 0.05)]),
-        (DEP, {det: 0.2, nsubj: 0.1, adj: 0.05, root: 0.1}.items()),
+        (TAG, [(en_vocab.strings['NN'], 0.1),
+               (en_vocab.strings['DT'], 0.2),
+               (en_vocab.strings['JJ'], 0.01),
+               (en_vocab.strings['VBD'], 0.05)]),
+        (DEP, {en_vocab.strings['det']: 0.2,
+               en_vocab.strings['nsubj']: 0.1,
+               en_vocab.strings['adj']: 0.05,
+               en_vocab.strings['ROOT']: 0.1}.items()),
        (HEAD, {0: 0.05, 1: 0.2, -1: 0.2, -2: 0.1, 2: 0.1}.items())
    ]

-    packer = Packer(vocab, attr_freqs)
+    packer = Packer(en_vocab, attr_freqs)
+    doc = get_doc(en_vocab, [t for t in text.split()], tags=tags, deps=deps, heads=heads)

-    msg = tokenizer(u'the dog jumped')
+    # assert doc.text_with_ws == text
+    assert [t.tag_ for t in doc] == tags
+    assert [t.dep_ for t in doc] == deps
+    assert [(t.head.i-t.i) for t in doc] == heads

-    msg.from_array(
-        [TAG, DEP, HEAD],
-        numpy.array([
-            [dt, det, 1],
-            [nn, nsubj, 1],
-            [vbd, root, 0]
-        ], dtype=numpy.int32))
-
-    assert msg.string == 'the dog jumped'
-    assert [t.tag_ for t in msg] == ['DT', 'NN', 'VBD']
-    assert [t.dep_ for t in msg] == ['det', 'nsubj', 'ROOT']
-    assert [(t.head.i - t.i) for t in msg] == [1, 1, 0]
-
-    bits = packer.pack(msg)
+    bits = packer.pack(doc)
    result = packer.unpack(bits)

-    assert result.string == 'the dog jumped'
-    assert [t.tag_ for t in result] == ['DT', 'NN', 'VBD']
-    assert [t.dep_ for t in result] == ['det', 'nsubj', 'ROOT']
-    assert [(t.head.i - t.i) for t in result] == [1, 1, 0]
+    # assert result.text_with_ws == text
+    assert [t.tag_ for t in result] == tags
+    assert [t.dep_ for t in result] == deps
+    assert [(t.head.i-t.i) for t in result] == heads


-def test_packer_bad_chars(tokenizer):
-    string = u'naja gut, is eher bl\xf6d und nicht mit reddit.com/digg.com vergleichbar; vielleicht auf dem weg dahin'
-    packer = Packer(tokenizer.vocab, [])
+def test_packer_bad_chars(en_tokenizer):
+    text = "naja gut, is eher bl\xf6d und nicht mit reddit.com/digg.com vergleichbar; vielleicht auf dem weg dahin"
+    packer = Packer(en_tokenizer.vocab, [])

-    doc = tokenizer(string)
+    doc = en_tokenizer(text)
    bits = packer.pack(doc)
    result = packer.unpack(bits)
    assert result.string == doc.string


@pytest.mark.models
-def test_packer_bad_chars(EN):
-    string = u'naja gut, is eher bl\xf6d und nicht mit reddit.com/digg.com vergleichbar; vielleicht auf dem weg dahin'
-    doc = EN(string)
+def test_packer_bad_chars_tags(EN):
+    text = "naja gut, is eher bl\xf6d und nicht mit reddit.com/digg.com vergleichbar; vielleicht auf dem weg dahin"
+    tags = ['JJ', 'NN', ',', 'VBZ', 'DT', 'NN', 'JJ', 'NN', 'NN',
+            'ADD', 'NN', ':', 'NN', 'NN', 'NN', 'NN', 'NN']
+
+    tokens = EN.tokenizer(text)
+    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags)
    byte_string = doc.to_bytes()
-    result = Doc(EN.vocab).from_bytes(byte_string)
+    result = get_doc(tokens.vocab).from_bytes(byte_string)
    assert [t.tag_ for t in result] == [t.tag_ for t in doc]
--- a/spacy/tests/serialize/test_serialization.py
+++ b/spacy/tests/serialize/test_serialization.py
@ -1,127 +1,40 @@
+# coding: utf-8
 from __future__ import unicode_literals
+
+from ...serialize.packer import Packer
+from ..util import get_doc, assert_docs_equal
+
 import pytest

-from spacy.tokens import Doc
-import spacy.en
-from spacy.serialize.packer import Packer
+
+TEXT = ["This", "is", "a", "test", "sentence", "."]
+TAGS = ['DT', 'VBZ', 'DT', 'NN', 'NN', '.']
+DEPS = ['nsubj', 'ROOT', 'det', 'compound', 'attr', 'punct']
+ENTS = [('hi', 'PERSON', 0, 1)]


-def equal(doc1, doc2):
-    # tokens
-    assert [ t.orth for t in doc1 ] == [ t.orth for t in doc2 ]
-
-    # tags
-    assert [ t.pos for t in doc1 ] == [ t.pos for t in doc2 ]
-    assert [ t.tag for t in doc1 ] == [ t.tag for t in doc2 ]
-
-    # parse
-    assert [ t.head.i for t in doc1 ] == [ t.head.i for t in doc2 ]
-    assert [ t.dep for t in doc1 ] == [ t.dep for t in doc2 ]
-    if doc1.is_parsed and doc2.is_parsed:
-        assert [ s for s in doc1.sents ] == [ s for s in doc2.sents ]
-
-    # entities
-    assert [ t.ent_type for t in doc1 ] == [ t.ent_type for t in doc2 ]
-    assert [ t.ent_iob for t in doc1 ] == [ t.ent_iob for t in doc2 ]
-    assert [ ent for ent in doc1.ents ] == [ ent for ent in doc2.ents ]
-
-
-@pytest.mark.models
-def test_serialize_tokens(EN):
-    doc1 = EN(u'This is a test sentence.',tag=False, parse=False, entity=False)
-
-    doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
-    equal(doc1, doc2)
-
-
-@pytest.mark.models
-def test_serialize_tokens_tags(EN):
-    doc1 = EN(u'This is a test sentence.',tag=True, parse=False, entity=False)
-    doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
-    equal(doc1, doc2)
-
-
-@pytest.mark.models
-def test_serialize_tokens_parse(EN):
-    doc1 = EN(u'This is a test sentence.',tag=False, parse=True, entity=False)
-
-    doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
-    equal(doc1, doc2)
-
-
-@pytest.mark.models
-def test_serialize_tokens_ner(EN):
-    doc1 = EN(u'This is a test sentence.', tag=False, parse=False, entity=True)
-
-    doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
-    equal(doc1, doc2)
-
-
-@pytest.mark.models
-def test_serialize_tokens_tags_parse(EN):
-    doc1 = EN(u'This is a test sentence.', tag=True, parse=True, entity=False)
-
-    doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
-    equal(doc1, doc2)
-
-
-@pytest.mark.models
-def test_serialize_tokens_tags_ner(EN):
-    doc1 = EN(u'This is a test sentence.', tag=True, parse=False, entity=True)
-
-    doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
-    equal(doc1, doc2)
-
-
-@pytest.mark.models
-def test_serialize_tokens_ner_parse(EN):
-    doc1 = EN(u'This is a test sentence.', tag=False, parse=True, entity=True)
-
-    doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
-    equal(doc1, doc2)
-
-
-@pytest.mark.models
-def test_serialize_tokens_tags_parse_ner(EN):
-    doc1 = EN(u'This is a test sentence.', tag=True, parse=True, entity=True)
-
-    doc2 = Doc(EN.vocab).from_bytes(doc1.to_bytes())
-    equal(doc1, doc2)
-
-
-def test_serialize_empty_doc():
-    vocab = spacy.en.English.Defaults.create_vocab()
-    doc = Doc(vocab)
-    packer = Packer(vocab, {})
+def test_serialize_empty_doc(en_vocab):
+    doc = get_doc(en_vocab)
+    packer = Packer(en_vocab, {})
    b = packer.pack(doc)
    assert b == b''
-    loaded = Doc(vocab).from_bytes(b)
+    loaded = get_doc(en_vocab).from_bytes(b)
    assert len(loaded) == 0


-def test_serialize_after_adding_entity():
-    # Re issue #514
-    vocab = spacy.en.English.Defaults.create_vocab()
-    entity_recognizer = spacy.en.English.Defaults.create_entity()
-
-    doc = Doc(vocab, words=u'This is a sentence about pasta .'.split())
-    entity_recognizer.add_label('Food')
-    entity_recognizer(doc)
-
-    label_id = vocab.strings[u'Food']
-    doc.ents = [(label_id, 5,6)]
-
-    assert [(ent.label_, ent.text) for ent in doc.ents] == [(u'Food', u'pasta')]
-
-    byte_string = doc.to_bytes()
+@pytest.mark.parametrize('text', [TEXT])
+def test_serialize_tokens(en_vocab, text):
+    doc1 = get_doc(en_vocab, [t for t in text])
+    doc2 = get_doc(en_vocab).from_bytes(doc1.to_bytes())
+    assert_docs_equal(doc1, doc2)


@pytest.mark.models
-def test_serialize_after_adding_entity(EN):
-    EN.entity.add_label(u'Food')
-    doc = EN(u'This is a sentence about pasta.')
-    label_id = EN.vocab.strings[u'Food']
-    doc.ents = [(label_id, 5,6)]
-    byte_string = doc.to_bytes()
-    doc2 = Doc(EN.vocab).from_bytes(byte_string)
-    assert [(ent.label_, ent.text) for ent in doc2.ents] == [(u'Food', u'pasta')]
+@pytest.mark.parametrize('text', [TEXT])
+@pytest.mark.parametrize('tags', [TAGS, []])
+@pytest.mark.parametrize('deps', [DEPS, []])
+@pytest.mark.parametrize('ents', [ENTS, []])
+def test_serialize_tokens_ner(EN, text, tags, deps, ents):
+    doc1 = get_doc(EN.vocab, [t for t in text], tags=tags, deps=deps, ents=ents)
+    doc2 = get_doc(EN.vocab).from_bytes(doc1.to_bytes())
+    assert_docs_equal(doc1, doc2)
--- a/spacy/tests/tagger/test_lemmatizer.py
+++ b/spacy/tests/tagger/test_lemmatizer.py
@ -1,33 +1,48 @@
-# -*- coding: utf-8 -*-
+# coding: utf-8
 from __future__ import unicode_literals
-import os
-import io
-import pickle
-import pathlib

-from spacy.lemmatizer import Lemmatizer, read_index, read_exc
-from spacy import util
+from ...lemmatizer import read_index, read_exc

 import pytest


-@pytest.fixture
-def path():
-    if 'SPACY_DATA' in os.environ:
-        return pathlib.Path(os.environ['SPACY_DATA'])
-    else:
-        return util.match_best_version('en', None, util.get_data_path())
-
-
-@pytest.fixture
-def lemmatizer(path):
-    if path is not None:
-        return Lemmatizer.load(path)
-    else:
+@pytest.mark.models
+@pytest.mark.parametrize('text,lemmas', [("aardwolves", ["aardwolf"]),
+                                         ("aardwolf", ["aardwolf"]),
+                                         ("planets", ["planet"]),
+                                         ("ring", ["ring"]),
+                                         ("axes", ["axis", "axe", "ax"])])
+def test_tagger_lemmatizer_noun_lemmas(lemmatizer, text, lemmas):
+    if lemmatizer is None:
        return None
+    assert lemmatizer.noun(text) == set(lemmas)


-def test_read_index(path):
+@pytest.mark.models
+def test_tagger_lemmatizer_base_forms(lemmatizer):
+    if lemmatizer is None:
+        return None
+    assert lemmatizer.noun('dive', {'number': 'sing'}) == set(['dive'])
+    assert lemmatizer.noun('dive', {'number': 'plur'}) == set(['diva'])
+
+
+@pytest.mark.models
+def test_tagger_lemmatizer_base_form_verb(lemmatizer):
+    if lemmatizer is None:
+        return None
+    assert lemmatizer.verb('saw', {'verbform': 'past'}) == set(['see'])
+
+
+@pytest.mark.models
+def test_tagger_lemmatizer_punct(lemmatizer):
+    if lemmatizer is None:
+        return None
+    assert lemmatizer.punct('“') == set(['"'])
+    assert lemmatizer.punct('“') == set(['"'])
+
+
+@pytest.mark.models
+def test_tagger_lemmatizer_read_index(path):
    if path is not None:
        with (path / 'wordnet' / 'index.noun').open() as file_:
            index = read_index(file_)
@ -36,67 +51,19 @@ def test_read_index(path):
        assert 'plant' in index


-def test_read_exc(path):
+@pytest.mark.models
+@pytest.mark.parametrize('text,lemma', [("was", "be")])
+def test_tagger_lemmatizer_read_exc(path, text, lemma):
    if path is not None:
        with (path / 'wordnet' / 'verb.exc').open() as file_:
            exc = read_exc(file_)
-        assert exc['was'] == ('be',)
-
-
-def test_noun_lemmas(lemmatizer):
-    if lemmatizer is None:
-        return None
-    do = lemmatizer.noun
-
-    assert do('aardwolves') == set(['aardwolf'])
-    assert do('aardwolf') == set(['aardwolf'])
-    assert do('planets') == set(['planet'])
-    assert do('ring') == set(['ring'])
-    assert do('axes') == set(['axis', 'axe', 'ax'])
-
-
-def test_base_form_dive(lemmatizer):
-    if lemmatizer is None:
-        return None
-
-    do = lemmatizer.noun
-    assert do('dive', {'number': 'sing'}) == set(['dive'])
-    assert do('dive', {'number': 'plur'}) == set(['diva'])
-
-
-def test_base_form_saw(lemmatizer):
-    if lemmatizer is None:
-        return None
-
-    do = lemmatizer.verb
-    assert do('saw', {'verbform': 'past'}) == set(['see'])
-
-
-def test_smart_quotes(lemmatizer):
-    if lemmatizer is None:
-        return None
-
-    do = lemmatizer.punct
-    assert do('“') == set(['"'])
-    assert do('“') == set(['"'])
-
-
-def test_pickle_lemmatizer(lemmatizer):
-    if lemmatizer is None:
-        return None
-
-    file_ = io.BytesIO()
-    pickle.dump(lemmatizer, file_)
-
-    file_.seek(0)
-
-    loaded = pickle.load(file_)
+        assert exc[text] == (lemma,)


@pytest.mark.models
-def test_lemma_assignment(EN):
-    tokens = u'Bananas in pyjamas are geese .'.split(' ')
-    doc = EN.tokenizer.tokens_from_list(tokens)
-    assert all( t.lemma_ == u'' for t in doc )
+def test_tagger_lemmatizer_lemma_assignment(EN):
+    text = "Bananas in pyjamas are geese."
+    doc = EN.tokenizer(text)
+    assert all(t.lemma_ == '' for t in doc)
    EN.tagger(doc)
-    assert all( t.lemma_ != u'' for t in doc )
+    assert all(t.lemma_ != '' for t in doc)
--- a/spacy/tests/tagger/test_spaces.py
+++ b/spacy/tests/tagger/test_spaces.py
@ -1,37 +1,32 @@
+# coding: utf-8
 """Ensure spaces are assigned the POS tag SPACE"""


 from __future__ import unicode_literals
-from spacy.parts_of_speech import SPACE
+from ...parts_of_speech import SPACE

 import pytest


+@pytest.mark.models
+def test_tagger_spaces(EN):
+    text = "Some\nspaces are\tnecessary."
+    doc = EN(text, tag=True, parse=False)
+    assert doc[0].pos != SPACE
+    assert doc[0].pos_ != 'SPACE'
+    assert doc[1].pos == SPACE
+    assert doc[1].pos_ == 'SPACE'
+    assert doc[1].tag_ == 'SP'
+    assert doc[2].pos != SPACE
+    assert doc[3].pos != SPACE
+    assert doc[4].pos == SPACE

-@pytest.fixture
-def tagged(EN):
-    string = u'Some\nspaces are\tnecessary.'
-    tokens = EN(string, tag=True, parse=False)
-    return tokens

@pytest.mark.models
-def test_spaces(tagged):
-    assert tagged[0].pos != SPACE
-    assert tagged[0].pos_ != 'SPACE'
-    assert tagged[1].pos == SPACE
-    assert tagged[1].pos_ == 'SPACE'
-    assert tagged[1].tag_ == 'SP'
-    assert tagged[2].pos != SPACE
-    assert tagged[3].pos != SPACE
-    assert tagged[4].pos == SPACE
-
-
-@pytest.mark.xfail
-@pytest.mark.models
-def test_return_char(EN):
-    string = ('hi Aaron,\r\n\r\nHow is your schedule today, I was wondering if '
+def test_tagger_return_char(EN):
+    text = ('hi Aaron,\r\n\r\nHow is your schedule today, I was wondering if '
              'you had time for a phone\r\ncall this afternoon?\r\n\r\n\r\n')
-    tokens = EN(string)
+    tokens = EN(text)
    for token in tokens:
        if token.is_space:
            assert token.pos == SPACE
--- a/spacy/tests/tagger/test_tag_names.py
+++ b/spacy/tests/tagger/test_tag_names.py
@ -1,14 +1,16 @@
-from spacy.en import English
+# coding: utf-8
+from __future__ import unicode_literals
+
 import six
 import pytest


@pytest.mark.models
 def test_tag_names(EN):
-    tokens = EN(u'I ate pizzas with anchovies.', parse=False, tag=True)
-    pizza = tokens[2]
-    assert type(pizza.pos) == int
-    assert isinstance(pizza.pos_, six.text_type)
-    assert type(pizza.dep) == int
-    assert isinstance(pizza.dep_, six.text_type)
-    assert pizza.tag_ == u'NNS'
+    text = "I ate pizzas with anchovies."
+    doc = EN(text, parse=False, tag=True)
+    assert type(doc[2].pos) == int
+    assert isinstance(doc[2].pos_, six.text_type)
+    assert type(doc[2].dep) == int
+    assert isinstance(doc[2].dep_, six.text_type)
+    assert doc[2].tag_ == u'NNS'
--- a/spacy/tests/test_attrs.py
+++ b/spacy/tests/test_attrs.py
@ -0,0 +1,27 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+from ..attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
+
+import pytest
+
+
+@pytest.mark.parametrize('text', ["dog"])
+def test_attrs_key(text):
+    assert intify_attrs({"ORTH": text}) == {ORTH: text}
+    assert intify_attrs({"NORM": text}) == {NORM: text}
+    assert intify_attrs({"lemma": text}, strings_map={text: 10}) == {LEMMA: 10}
+
+
+@pytest.mark.parametrize('text', ["dog"])
+def test_attrs_idempotence(text):
+    int_attrs = intify_attrs({"lemma": text, 'is_alpha': True}, strings_map={text: 10})
+    assert intify_attrs(int_attrs) == {LEMMA: 10, IS_ALPHA: True}
+
+
+@pytest.mark.parametrize('text', ["dog"])
+def test_attrs_do_deprecated(text):
+    int_attrs = intify_attrs({"F": text, 'is_alpha': True},
+                             strings_map={text: 10},
+                             _do_deprecated=True)
+    assert int_attrs == {ORTH: 10, IS_ALPHA: True}
--- a/spacy/tests/test_matcher.py
+++ b/spacy/tests/test_matcher.py
@ -1,94 +0,0 @@
-from __future__ import unicode_literals
-import pytest
-
-from spacy.strings import StringStore
-from spacy.matcher import *
-from spacy.attrs import LOWER
-from spacy.tokens.doc import Doc
-from spacy.vocab import Vocab
-from spacy.en import English
-
-
-@pytest.fixture
-def matcher():
-    patterns = {
-        'JS': ['PRODUCT', {}, [[{'ORTH': 'JavaScript'}]]],
-        'GoogleNow':  ['PRODUCT', {}, [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]]],
-        'Java':       ['PRODUCT', {}, [[{'LOWER': 'java'}]]],
-    }
-    return Matcher(Vocab(lex_attr_getters=English.Defaults.lex_attr_getters), patterns)
-
-
-def test_compile(matcher):
-    assert matcher.n_patterns == 3
-
-
-def test_no_match(matcher):
-    doc = Doc(matcher.vocab, words=['I', 'like', 'cheese', '.'])
-    assert matcher(doc) == []
-
-
-def test_match_start(matcher):
-    doc = Doc(matcher.vocab, words=['JavaScript', 'is', 'good'])
-    assert matcher(doc) == [(matcher.vocab.strings['JS'],
-                             matcher.vocab.strings['PRODUCT'], 0, 1)]
-
-
-def test_match_end(matcher):
-    doc = Doc(matcher.vocab, words=['I', 'like', 'java'])
-    assert matcher(doc) == [(doc.vocab.strings['Java'],
-                             doc.vocab.strings['PRODUCT'], 2, 3)]
-
-
-def test_match_middle(matcher):
-    doc = Doc(matcher.vocab, words=['I', 'like', 'Google', 'Now', 'best'])
-    assert matcher(doc) == [(doc.vocab.strings['GoogleNow'],
-                             doc.vocab.strings['PRODUCT'], 2, 4)]
-
-
-def test_match_multi(matcher):
-    doc = Doc(matcher.vocab, words='I like Google Now and java best'.split())
-    assert matcher(doc) == [(doc.vocab.strings['GoogleNow'],
-                             doc.vocab.strings['PRODUCT'], 2, 4),
-                            (doc.vocab.strings['Java'],
-                             doc.vocab.strings['PRODUCT'], 5, 6)]
-
-def test_match_zero(matcher):
-    matcher.add('Quote', '', {}, [
-        [
-            {'ORTH': '"'},
-            {'OP': '!', 'IS_PUNCT': True},
-            {'OP': '!', 'IS_PUNCT': True},
-            {'ORTH': '"'}
-        ]])
-    doc = Doc(matcher.vocab, words='He said , " some words " ...'.split())
-    assert len(matcher(doc)) == 1
-    doc = Doc(matcher.vocab, words='He said , " some three words " ...'.split())
-    assert len(matcher(doc)) == 0
-    matcher.add('Quote', '', {}, [
-        [
-            {'ORTH': '"'},
-            {'IS_PUNCT': True},
-            {'IS_PUNCT': True},
-            {'IS_PUNCT': True},
-            {'ORTH': '"'}
-        ]])
-    assert len(matcher(doc)) == 0
-
-
-def test_match_zero_plus(matcher):
-    matcher.add('Quote', '', {}, [
-        [
-            {'ORTH': '"'},
-            {'OP': '*', 'IS_PUNCT': False},
-            {'ORTH': '"'}
-        ]])
-    doc = Doc(matcher.vocab, words='He said , " some words " ...'.split())
-    assert len(matcher(doc)) == 1
-
-
-def test_phrase_matcher():
-    vocab = Vocab(lex_attr_getters=English.Defaults.lex_attr_getters)
-    matcher = PhraseMatcher(vocab, [Doc(vocab, words='Google Now'.split())])
-    doc = Doc(vocab, words=['I', 'like', 'Google', 'Now', 'best'])
-    assert len(matcher(doc)) == 1
--- a/spacy/tests/tokenizer/test_tokenizer.py
+++ b/spacy/tests/tokenizer/test_tokenizer.py
@ -1,11 +1,13 @@
 # coding: utf-8
 from __future__ import unicode_literals
-from os import path
-
-import pytest

+from ...vocab import Vocab
+from ...tokenizer import Tokenizer
 from ...util import utf8open

+from os import path
+import pytest
+

 def test_tokenizer_handles_no_word(tokenizer):
    tokens = tokenizer("")
@ -81,3 +83,25 @@ def test_tokenizer_suspected_freeing_strings(tokenizer):
    tokens2 = tokenizer(text2)
    assert tokens1[0].text == "Lorem"
    assert tokens2[0].text == "Lorem"
+
+
+@pytest.mark.parametrize('text,tokens', [
+    ("lorem", [{'orth': 'lo'}, {'orth': 'rem'}])])
+def test_tokenizer_add_special_case(tokenizer, text, tokens):
+    tokenizer.add_special_case(text, tokens)
+    doc = tokenizer(text)
+    assert doc[0].text == tokens[0]['orth']
+    assert doc[1].text == tokens[1]['orth']
+
+
+@pytest.mark.parametrize('text,tokens', [
+    ("lorem", [{'orth': 'lo', 'tag': 'NN'}, {'orth': 'rem'}])])
+def test_tokenizer_add_special_case_tag(text, tokens):
+    vocab = Vocab(tag_map={'NN': {'pos': 'NOUN'}})
+    tokenizer = Tokenizer(vocab, {}, None, None, None)
+    tokenizer.add_special_case(text, tokens)
+    doc = tokenizer(text)
+    assert doc[0].text == tokens[0]['orth']
+    assert doc[0].tag_ == tokens[0]['tag']
+    assert doc[0].pos_ == 'NOUN'
+    assert doc[1].text == tokens[1]['orth']
--- a/spacy/tests/tokenizer/test_urls.py
+++ b/spacy/tests/tokenizer/test_urls.py
@ -4,13 +4,17 @@ from __future__ import unicode_literals
 import pytest


-URLS = [
+URLS_BASIC = [
    "http://www.nytimes.com/2016/04/20/us/politics/new-york-primary-preview.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=a-lede-package-region&region=top-news&WT.nav=top-news&_r=0",
-    "www.google.com?q=google",
    "www.red-stars.com",
-    "http://foo.com/blah_(wikipedia)#cite-1",
    "mailto:foo.bar@baz.com",
-    "mailto:foo-bar@baz-co.com"
+
+]
+
+URLS_FULL = URLS_BASIC + [
+    "mailto:foo-bar@baz-co.com",
+    "www.google.com?q=google",
+    "http://foo.com/blah_(wikipedia)#cite-1"
 ]


@ -25,32 +29,14 @@ SUFFIXES = [
    '"', ":", ">"]


-@pytest.mark.parametrize("url", URLS)
+@pytest.mark.parametrize("url", URLS_BASIC)
 def test_tokenizer_handles_simple_url(tokenizer, url):
    tokens = tokenizer(url)
    assert len(tokens) == 1
    assert tokens[0].text == url


-@pytest.mark.parametrize("prefix", PREFIXES)
-@pytest.mark.parametrize("url", URLS)
-def test_tokenizer_handles_prefixed_url(tokenizer, prefix, url):
-    tokens = tokenizer(prefix + url)
-    assert len(tokens) == 2
-    assert tokens[0].text == prefix
-    assert tokens[1].text == url
-
-
-@pytest.mark.parametrize("suffix", SUFFIXES)
-@pytest.mark.parametrize("url", URLS)
-def test_tokenizer_handles_suffixed_url(tokenizer, url, suffix):
-    tokens = tokenizer(url + suffix)
-    assert len(tokens) == 2
-    assert tokens[0].text == url
-    assert tokens[1].text == suffix
-
-
-@pytest.mark.parametrize("url", URLS)
+@pytest.mark.parametrize("url", URLS_BASIC)
 def test_tokenizer_handles_simple_surround_url(tokenizer, url):
    tokens = tokenizer("(" + url + ")")
    assert len(tokens) == 3
@ -61,8 +47,28 @@ def test_tokenizer_handles_simple_surround_url(tokenizer, url):

@pytest.mark.slow
@pytest.mark.parametrize("prefix", PREFIXES)
+@pytest.mark.parametrize("url", URLS_FULL)
+def test_tokenizer_handles_prefixed_url(tokenizer, prefix, url):
+    tokens = tokenizer(prefix + url)
+    assert len(tokens) == 2
+    assert tokens[0].text == prefix
+    assert tokens[1].text == url
+
+
+@pytest.mark.slow
@pytest.mark.parametrize("suffix", SUFFIXES)
-@pytest.mark.parametrize("url", URLS)
+@pytest.mark.parametrize("url", URLS_FULL)
+def test_tokenizer_handles_suffixed_url(tokenizer, url, suffix):
+    tokens = tokenizer(url + suffix)
+    assert len(tokens) == 2
+    assert tokens[0].text == url
+    assert tokens[1].text == suffix
+
+
+@pytest.mark.slow
+@pytest.mark.parametrize("prefix", PREFIXES)
+@pytest.mark.parametrize("suffix", SUFFIXES)
+@pytest.mark.parametrize("url", URLS_FULL)
 def test_tokenizer_handles_surround_url(tokenizer, prefix, suffix, url):
    tokens = tokenizer(prefix + url + suffix)
    assert len(tokens) == 3
@ -74,7 +80,7 @@ def test_tokenizer_handles_surround_url(tokenizer, prefix, suffix, url):
@pytest.mark.slow
@pytest.mark.parametrize("prefix1", PREFIXES)
@pytest.mark.parametrize("prefix2", PREFIXES)
-@pytest.mark.parametrize("url", URLS)
+@pytest.mark.parametrize("url", URLS_FULL)
 def test_tokenizer_handles_two_prefix_url(tokenizer, prefix1, prefix2, url):
    tokens = tokenizer(prefix1 + prefix2 + url)
    assert len(tokens) == 3
@ -86,7 +92,7 @@ def test_tokenizer_handles_two_prefix_url(tokenizer, prefix1, prefix2, url):
@pytest.mark.slow
@pytest.mark.parametrize("suffix1", SUFFIXES)
@pytest.mark.parametrize("suffix2", SUFFIXES)
-@pytest.mark.parametrize("url", URLS)
+@pytest.mark.parametrize("url", URLS_FULL)
 def test_tokenizer_handles_two_prefix_url(tokenizer, suffix1, suffix2, url):
    tokens = tokenizer(url + suffix1 + suffix2)
    assert len(tokens) == 3
--- a/spacy/tests/unit/init.py
+++ b/spacy/tests/unit/init.py
--- a/spacy/tests/unit/test_attrs.py
+++ b/spacy/tests/unit/test_attrs.py
@ -1,35 +0,0 @@
-# coding: utf-8
-from __future__ import unicode_literals
-
-from ...attrs import *
-
-
-def test_key_no_value():
-    int_attrs = intify_attrs({"ORTH": "dog"})
-    assert int_attrs == {ORTH: "dog"}
-
-
-def test_lower_key():
-    int_attrs = intify_attrs({"norm": "dog"})
-    assert int_attrs == {NORM: "dog"}
-
-
-
-def test_lower_key_value():
-    vals = {'dog': 10}
-    int_attrs = intify_attrs({"lemma": "dog"}, strings_map=vals)
-    assert int_attrs == {LEMMA: 10}
-
-
-def test_idempotence():
-    vals = {'dog': 10}
-    int_attrs = intify_attrs({"lemma": "dog", 'is_alpha': True}, strings_map=vals)
-    int_attrs = intify_attrs(int_attrs)
-    assert int_attrs == {LEMMA: 10, IS_ALPHA: True}
-
-
-def test_do_deprecated():
-    vals = {'dog': 10}
-    int_attrs = intify_attrs({"F": "dog", 'is_alpha': True}, strings_map=vals,
-                             _do_deprecated=True)
-    assert int_attrs == {ORTH: 10, IS_ALPHA: True}
--- a/spacy/tests/unit/test_parser.py
+++ b/spacy/tests/unit/test_parser.py
@ -1,137 +0,0 @@
-# coding: utf-8
-from __future__ import unicode_literals
-
-import pytest
-import numpy
-
-from ...attrs import HEAD, DEP
-
-
-@pytest.mark.models
-class TestNounChunks:
-    @pytest.fixture(scope="class")
-    def ex1_en(self, EN):
-        example = EN.tokenizer.tokens_from_list('A base phrase should be recognized .'.split(' '))
-        EN.tagger.tag_from_strings(example, 'DT NN NN MD VB VBN .'.split(' '))
-        det,compound,nsubjpass,aux,auxpass,root,punct = tuple( EN.vocab.strings[l] for l in ['det','compound','nsubjpass','aux','auxpass','root','punct'] )
-        example.from_array([HEAD, DEP],
-        numpy.asarray(
-            [
-                [2, det],
-                [1, compound],
-                [3, nsubjpass],
-                [2, aux],
-                [1, auxpass],
-                [0, root],
-                [-1, punct]
-            ], dtype='int32'))
-        return example
-
-    @pytest.fixture(scope="class")
-    def ex2_en(self, EN):
-        example = EN.tokenizer.tokens_from_list('A base phrase and a good phrase are often the same .'.split(' '))
-        EN.tagger.tag_from_strings(example, 'DT NN NN CC DT JJ NN VBP RB DT JJ .'.split(' '))
-        det,compound,nsubj,cc,amod,conj,root,advmod,attr,punct = tuple( EN.vocab.strings[l] for l in ['det','compound','nsubj','cc','amod','conj','root','advmod','attr','punct'] )
-        example.from_array([HEAD, DEP],
-        numpy.asarray(
-            [
-                [2, det],
-                [1, compound],
-                [5, nsubj],
-                [-1, cc],
-                [1, det],
-                [1, amod],
-                [-4, conj],
-                [0, root],
-                [-1, advmod],
-                [1, det],
-                [-3, attr],
-                [-4, punct]
-            ], dtype='int32'))
-        return example
-
-    @pytest.fixture(scope="class")
-    def ex3_en(self, EN):
-        example = EN.tokenizer.tokens_from_list('A phrase with another phrase occurs .'.split(' '))
-        EN.tagger.tag_from_strings(example, 'DT NN IN DT NN VBZ .'.split(' '))
-        det,nsubj,prep,pobj,root,punct = tuple( EN.vocab.strings[l] for l in ['det','nsubj','prep','pobj','root','punct'] )
-        example.from_array([HEAD, DEP],
-        numpy.asarray(
-            [
-                [1, det],
-                [4, nsubj],
-                [-1, prep],
-                [1, det],
-                [-2, pobj],
-                [0, root],
-                [-1, punct]
-            ], dtype='int32'))
-        return example
-
-    @pytest.fixture(scope="class")
-    def ex1_de(self, DE):
-        example = DE.tokenizer.tokens_from_list('Eine Tasse steht auf dem Tisch .'.split(' '))
-        DE.tagger.tag_from_strings(example, 'ART NN VVFIN APPR ART NN $.'.split(' '))
-        nk,sb,root,mo,punct = tuple( DE.vocab.strings[l] for l in ['nk','sb','root','mo','punct'])
-        example.from_array([HEAD, DEP],
-        numpy.asarray(
-            [
-                [1, nk],
-                [1, sb],
-                [0, root],
-                [-1, mo],
-                [1, nk],
-                [-2, nk],
-                [-3, punct]
-            ], dtype='int32'))
-        return example
-
-    @pytest.fixture(scope="class")
-    def ex2_de(self, DE):
-        example = DE.tokenizer.tokens_from_list('Die Sängerin singt mit einer Tasse Kaffee Arien .'.split(' '))
-        DE.tagger.tag_from_strings(example, 'ART NN VVFIN APPR ART NN NN NN $.'.split(' '))
-        nk,sb,root,mo,punct,oa = tuple( DE.vocab.strings[l] for l in ['nk','sb','root','mo','punct','oa'])
-        example.from_array([HEAD, DEP],
-        numpy.asarray(
-            [
-                [1, nk],
-                [1, sb],
-                [0, root],
-                [-1, mo],
-                [1, nk],
-                [-2, nk],
-                [-1, nk],
-                [-5, oa],
-                [-6, punct]
-            ], dtype='int32'))
-        return example
-
-    def test_en_standard_chunk(self, ex1_en):
-        chunks = list(ex1_en.noun_chunks)
-        assert len(chunks) == 1
-        assert chunks[0].string == 'A base phrase '
-
-    def test_en_coordinated_chunks(self, ex2_en):
-        chunks = list(ex2_en.noun_chunks)
-        assert len(chunks) == 2
-        assert chunks[0].string == 'A base phrase '
-        assert chunks[1].string == 'a good phrase '
-
-    def test_en_pp_chunks(self, ex3_en):
-        chunks = list(ex3_en.noun_chunks)
-        assert len(chunks) == 2
-        assert chunks[0].string == 'A phrase '
-        assert chunks[1].string == 'another phrase '
-
-    def test_de_standard_chunk(self, ex1_de):
-        chunks = list(ex1_de.noun_chunks)
-        assert len(chunks) == 2
-        assert chunks[0].string == 'Eine Tasse '
-        assert chunks[1].string == 'dem Tisch '
-
-    def test_de_extended_chunk(self, ex2_de):
-        chunks = list(ex2_de.noun_chunks)
-        assert len(chunks) == 3
-        assert chunks[0].string == 'Die Sängerin '
-        assert chunks[1].string == 'einer Tasse Kaffee '
-        assert chunks[2].string == 'Arien '
--- a/spacy/tests/unit/test_tokenizer.py
+++ b/spacy/tests/unit/test_tokenizer.py
@ -1,50 +0,0 @@
-# coding: utf-8
-from __future__ import unicode_literals
-
-from ...vocab import Vocab
-from ...tokenizer import Tokenizer
-
-import re
-import pytest
-
-
-@pytest.fixture
-def vocab():
-    return Vocab(tag_map={'NN': {'pos': 'NOUN'}})
-
-@pytest.fixture
-def rules():
-    return {}
-
-@pytest.fixture
-def prefix_search():
-    return None
-
-@pytest.fixture
-def suffix_search():
-    return None
-
-@pytest.fixture
-def infix_finditer():
-    return None
-
-
-@pytest.fixture
-def tokenizer(vocab, rules, prefix_search, suffix_search, infix_finditer):
-    return Tokenizer(vocab, rules, prefix_search, suffix_search, infix_finditer)
-
-
-def test_add_special_case(tokenizer):
-    tokenizer.add_special_case('dog', [{'orth': 'd'}, {'orth': 'og'}])
-    doc = tokenizer('dog')
-    assert doc[0].text == 'd'
-    assert doc[1].text == 'og'
-
-
-def test_special_case_tag(tokenizer):
-    tokenizer.add_special_case('dog', [{'orth': 'd', 'tag': 'NN'}, {'orth': 'og'}])
-    doc = tokenizer('dog')
-    assert doc[0].text == 'd'
-    assert doc[0].tag_ == 'NN'
-    assert doc[0].pos_ == 'NOUN'
-    assert doc[1].text == 'og'
--- a/spacy/tests/util.py
+++ b/spacy/tests/util.py
@ -4,6 +4,8 @@ from __future__ import unicode_literals
 from ..tokens import Doc
 from ..attrs import ORTH, POS, HEAD, DEP

+import numpy
+

 def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None):
    """Create Doc object from given vocab, words and annotations."""
@ -36,3 +38,36 @@ def apply_transition_sequence(parser, doc, sequence):
    with parser.step_through(doc) as stepwise:
        for transition in sequence:
            stepwise.transition(transition)
+
+
+def add_vecs_to_vocab(vocab, vectors):
+    """Add list of vector tuples to given vocab. All vectors need to have the
+    same length. Format: [("text", [1, 2, 3])]"""
+    length = len(vectors[0][1])
+    vocab.resize_vectors(length)
+    for word, vec in vectors:
+        vocab[word].vector = vec
+    return vocab
+
+
+def get_cosine(vec1, vec2):
+    """Get cosine for two given vectors"""
+    return numpy.dot(vec1, vec2) / (numpy.linalg.norm(vec1) * numpy.linalg.norm(vec2))
+
+
+def assert_docs_equal(doc1, doc2):
+    """Compare two Doc objects and assert that they're equal. Tests for tokens,
+    tags, dependencies and entities."""
+    assert [ t.orth for t in doc1 ] == [ t.orth for t in doc2 ]
+
+    assert [ t.pos for t in doc1 ] == [ t.pos for t in doc2 ]
+    assert [ t.tag for t in doc1 ] == [ t.tag for t in doc2 ]
+
+    assert [ t.head.i for t in doc1 ] == [ t.head.i for t in doc2 ]
+    assert [ t.dep for t in doc1 ] == [ t.dep for t in doc2 ]
+    if doc1.is_parsed and doc2.is_parsed:
+        assert [ s for s in doc1.sents ] == [ s for s in doc2.sents ]
+
+    assert [ t.ent_type for t in doc1 ] == [ t.ent_type for t in doc2 ]
+    assert [ t.ent_iob for t in doc1 ] == [ t.ent_iob for t in doc2 ]
+    assert [ ent for ent in doc1.ents ] == [ ent for ent in doc2.ents ]
--- a/spacy/tests/vectors/test_similarity.py
+++ b/spacy/tests/vectors/test_similarity.py
@ -1,96 +1,60 @@
+# coding: utf-8
 from __future__ import unicode_literals
-import spacy
-from spacy.vocab import Vocab
-from spacy.tokens.doc import Doc
-import numpy
-import numpy.linalg

+from ..util import get_doc, get_cosine, add_vecs_to_vocab
+
+import numpy
 import pytest


-def get_vector(letters):
-    return numpy.asarray([ord(letter) for letter in letters], dtype='float32')
-
-
-def get_cosine(vec1, vec2):
-    return numpy.dot(vec1, vec2) / (numpy.linalg.norm(vec1) * numpy.linalg.norm(vec2))
-
-
-@pytest.fixture(scope='module')
-def en_vocab():
-    vocab = spacy.get_lang_class('en').Defaults.create_vocab()
-    vocab.resize_vectors(2)
-    apple_ = vocab[u'apple']
-    orange_ = vocab[u'orange']
-    apple_.vector = get_vector('ap')
-    orange_.vector = get_vector('or')
-    return vocab
-
-
@pytest.fixture
-def appleL(en_vocab):
-   return en_vocab['apple']
+def vectors():
+    return [("apple", [1, 2, 3]), ("orange", [-1, -2, -3])]


-@pytest.fixture
-def orangeL(en_vocab):
-   return en_vocab['orange']
+@pytest.fixture()
+def vocab(en_vocab, vectors):
+    return add_vecs_to_vocab(en_vocab, vectors)


-@pytest.fixture(scope='module')
-def apple_orange(en_vocab):
-    return Doc(en_vocab, words=[u'apple', u'orange'])
+def test_vectors_similarity_LL(vocab, vectors):
+    [(word1, vec1), (word2, vec2)] = vectors
+    lex1 = vocab[word1]
+    lex2 = vocab[word2]
+    assert lex1.has_vector
+    assert lex2.has_vector
+    assert lex1.vector_norm != 0
+    assert lex2.vector_norm != 0
+    assert lex1.vector[0] != lex2.vector[0] and lex1.vector[1] != lex2.vector[1]
+    assert numpy.isclose(lex1.similarity(lex2), get_cosine(vec1, vec2))
+    assert numpy.isclose(lex2.similarity(lex2), lex1.similarity(lex1))


-@pytest.fixture
-def appleT(apple_orange):
-    return apple_orange[0]
+def test_vectors_similarity_TT(vocab, vectors):
+    [(word1, vec1), (word2, vec2)] = vectors
+    doc = get_doc(vocab, words=[word1, word2])
+    assert doc[0].has_vector
+    assert doc[1].has_vector
+    assert doc[0].vector_norm != 0
+    assert doc[1].vector_norm != 0
+    assert doc[0].vector[0] != doc[1].vector[0] and doc[0].vector[1] != doc[1].vector[1]
+    assert numpy.isclose(doc[0].similarity(doc[1]), get_cosine(vec1, vec2))
+    assert numpy.isclose(doc[1].similarity(doc[0]), doc[0].similarity(doc[1]))


-@pytest.fixture
-def orangeT(apple_orange):
-    return apple_orange[1]
+def test_vectors_similarity_TD(vocab, vectors):
+    [(word1, vec1), (word2, vec2)] = vectors
+    doc = get_doc(vocab, words=[word1, word2])
+    assert doc.similarity(doc[0]) == doc[0].similarity(doc)


-def test_LL_sim(appleL, orangeL):
-    assert appleL.has_vector
-    assert orangeL.has_vector
-    assert appleL.vector_norm != 0
-    assert orangeL.vector_norm != 0
-    assert appleL.vector[0] != orangeL.vector[0] and appleL.vector[1] != orangeL.vector[1]
-    assert numpy.isclose(
-            appleL.similarity(orangeL),
-            get_cosine(get_vector('ap'), get_vector('or')))
-    assert numpy.isclose(
-            orangeL.similarity(appleL),
-            appleL.similarity(orangeL))
-
-
-def test_TT_sim(appleT, orangeT):
-    assert appleT.has_vector
-    assert orangeT.has_vector
-    assert appleT.vector_norm != 0
-    assert orangeT.vector_norm != 0
-    assert appleT.vector[0] != orangeT.vector[0] and appleT.vector[1] != orangeT.vector[1]
-    assert numpy.isclose(
-            appleT.similarity(orangeT),
-            get_cosine(get_vector('ap'), get_vector('or')))
-    assert numpy.isclose(
-            orangeT.similarity(appleT),
-            appleT.similarity(orangeT))
-
-
-def test_TD_sim(apple_orange, appleT):
-    assert apple_orange.similarity(appleT) == appleT.similarity(apple_orange)
-
-def test_DS_sim(apple_orange, appleT):
-    span = apple_orange[:2]
-    assert apple_orange.similarity(span) == 1.0
-    assert span.similarity(apple_orange) == 1.0
-
-
-def test_TS_sim(apple_orange, appleT):
-    span = apple_orange[:2]
-    assert span.similarity(appleT) == appleT.similarity(span)
+def test_vectors_similarity_DS(vocab, vectors):
+    [(word1, vec1), (word2, vec2)] = vectors
+    doc = get_doc(vocab, words=[word1, word2])
+    assert doc.similarity(doc[:2]) == doc[:2].similarity(doc)


+def test_vectors_similarity_TS(vocab, vectors):
+    [(word1, vec1), (word2, vec2)] = vectors
+    doc = get_doc(vocab, words=[word1, word2])
+    assert doc[:2].similarity(doc[0]) == doc[0].similarity(doc[:2])
--- a/spacy/tests/vectors/test_vectors.py
+++ b/spacy/tests/vectors/test_vectors.py
@ -1,109 +1,126 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+from ...tokenizer import Tokenizer
+from ..util import get_doc, add_vecs_to_vocab
+
 import pytest

-@pytest.mark.models
-def test_token_vector(EN):
-    token = EN(u'Apples and oranges')[0]
-    token.vector
-    token.vector_norm

-@pytest.mark.models
-def test_lexeme_vector(EN):
-    lexeme = EN.vocab[u'apples']
-    lexeme.vector
-    lexeme.vector_norm
+@pytest.fixture
+def vectors():
+    return [("apple", [0.0, 1.0, 2.0]), ("orange", [3.0, -2.0, 4.0])]


-@pytest.mark.models
-def test_doc_vector(EN):
-    doc = EN(u'Apples and oranges')
-    doc.vector
-    doc.vector_norm
-
-@pytest.mark.models
-def test_span_vector(EN):
-    span = EN(u'Apples and oranges')[0:2]
-    span.vector
-    span.vector_norm
-
-@pytest.mark.models
-def test_token_token_similarity(EN):
-    apples, oranges = EN(u'apples oranges')
-    assert apples.similarity(oranges) == oranges.similarity(apples)
-    assert 0.0 < apples.similarity(oranges) < 1.0
-    
-
-@pytest.mark.models
-def test_token_lexeme_similarity(EN):
-    apples = EN(u'apples')
-    oranges = EN.vocab[u'oranges']
-    assert apples.similarity(oranges) == oranges.similarity(apples)
-    assert 0.0 < apples.similarity(oranges) < 1.0
- 
-
-@pytest.mark.models
-def test_token_span_similarity(EN):
-    doc = EN(u'apples orange juice')
-    apples = doc[0]
-    oranges = doc[1:3]
-    assert apples.similarity(oranges) == oranges.similarity(apples)
-    assert 0.0 < apples.similarity(oranges) < 1.0
- 
-
-@pytest.mark.models
-def test_token_doc_similarity(EN):
-    doc = EN(u'apples orange juice')
-    apples = doc[0]
-    assert apples.similarity(doc) == doc.similarity(apples)
-    assert 0.0 < apples.similarity(doc) < 1.0
- 
-
-@pytest.mark.models
-def test_lexeme_span_similarity(EN):
-    doc = EN(u'apples orange juice')
-    apples = EN.vocab[u'apples']
-    span = doc[1:3]
-    assert apples.similarity(span) == span.similarity(apples)
-    assert 0.0 < apples.similarity(span) < 1.0
+@pytest.fixture()
+def vocab(en_vocab, vectors):
+    return add_vecs_to_vocab(en_vocab, vectors)


-@pytest.mark.models
-def test_lexeme_lexeme_similarity(EN):
-    apples = EN.vocab[u'apples']
-    oranges = EN.vocab[u'oranges']
-    assert apples.similarity(oranges) == oranges.similarity(apples)
-    assert 0.0 < apples.similarity(oranges) < 1.0
- 
+@pytest.fixture()
+def tokenizer_v(vocab):
+    return Tokenizer(vocab, {}, None, None, None)

-@pytest.mark.models
-def test_lexeme_doc_similarity(EN):
-    doc = EN(u'apples orange juice')
-    apples = EN.vocab[u'apples']
-    assert apples.similarity(doc) == doc.similarity(apples)
-    assert 0.0 < apples.similarity(doc) < 1.0
- 

-@pytest.mark.models
-def test_span_span_similarity(EN):
-    doc = EN(u'apples orange juice')
-    apples = doc[0:2]
-    oj = doc[1:3]
-    assert apples.similarity(oj) == oj.similarity(apples)
-    assert 0.0 < apples.similarity(oj) < 1.0
- 
+@pytest.mark.parametrize('text', ["apple and orange"])
+def test_vectors_token_vector(tokenizer_v, vectors, text):
+    doc = tokenizer_v(text)
+    assert vectors[0] == (doc[0].text, list(doc[0].vector))
+    assert vectors[1] == (doc[2].text, list(doc[2].vector))

-@pytest.mark.models
-def test_span_doc_similarity(EN):
-    doc = EN(u'apples orange juice')
-    apples = doc[0:2]
-    oj = doc[1:3]
-    assert apples.similarity(doc) == doc.similarity(apples)
-    assert 0.0 < apples.similarity(doc) < 1.0
- 

-@pytest.mark.models
-def test_doc_doc_similarity(EN):
-    apples = EN(u'apples and apple pie')
-    oranges = EN(u'orange juice')
-    assert apples.similarity(oranges) == apples.similarity(oranges)
-    assert 0.0 < apples.similarity(oranges) < 1.0
- 
+@pytest.mark.parametrize('text', ["apple", "orange"])
+def test_vectors_lexeme_vector(vocab, text):
+    lex = vocab[text]
+    assert list(lex.vector)
+    assert lex.vector_norm
+
+
+@pytest.mark.parametrize('text', [["apple", "and", "orange"]])
+def test_vectors_doc_vector(vocab, text):
+    doc = get_doc(vocab, text)
+    assert list(doc.vector)
+    assert doc.vector_norm
+
+
+@pytest.mark.parametrize('text', [["apple", "and", "orange"]])
+def test_vectors_span_vector(vocab, text):
+    span = get_doc(vocab, text)[0:2]
+    assert list(span.vector)
+    assert span.vector_norm
+
+
+@pytest.mark.parametrize('text', ["apple orange"])
+def test_vectors_token_token_similarity(tokenizer_v, text):
+    doc = tokenizer_v(text)
+    assert doc[0].similarity(doc[1]) == doc[1].similarity(doc[0])
+    assert 0.0 < doc[0].similarity(doc[1]) < 1.0
+
+
+@pytest.mark.parametrize('text1,text2', [("apple", "orange")])
+def test_vectors_token_lexeme_similarity(tokenizer_v, vocab, text1, text2):
+    token = tokenizer_v(text1)
+    lex = vocab[text2]
+    assert token.similarity(lex) == lex.similarity(token)
+    assert 0.0 < token.similarity(lex) < 1.0
+
+
+@pytest.mark.parametrize('text', [["apple", "orange", "juice"]])
+def test_vectors_token_span_similarity(vocab, text):
+    doc = get_doc(vocab, text)
+    assert doc[0].similarity(doc[1:3]) == doc[1:3].similarity(doc[0])
+    assert 0.0 < doc[0].similarity(doc[1:3]) < 1.0
+
+
+@pytest.mark.parametrize('text', [["apple", "orange", "juice"]])
+def test_vectors_token_doc_similarity(vocab, text):
+    doc = get_doc(vocab, text)
+    assert doc[0].similarity(doc) == doc.similarity(doc[0])
+    assert 0.0 < doc[0].similarity(doc) < 1.0
+
+
+@pytest.mark.parametrize('text', [["apple", "orange", "juice"]])
+def test_vectors_lexeme_span_similarity(vocab, text):
+    doc = get_doc(vocab, text)
+    lex = vocab[text[0]]
+    assert lex.similarity(doc[1:3]) == doc[1:3].similarity(lex)
+    assert 0.0 < doc.similarity(doc[1:3]) < 1.0
+
+
+@pytest.mark.parametrize('text1,text2', [("apple", "orange")])
+def test_vectors_lexeme_lexeme_similarity(vocab, text1, text2):
+    lex1 = vocab[text1]
+    lex2 = vocab[text2]
+    assert lex1.similarity(lex2) == lex2.similarity(lex1)
+    assert 0.0 < lex1.similarity(lex2) < 1.0
+
+
+@pytest.mark.parametrize('text', [["apple", "orange", "juice"]])
+def test_vectors_lexeme_doc_similarity(vocab, text):
+    doc = get_doc(vocab, text)
+    lex = vocab[text[0]]
+    assert lex.similarity(doc) == doc.similarity(lex)
+    assert 0.0 < lex.similarity(doc) < 1.0
+
+
+@pytest.mark.parametrize('text', [["apple", "orange", "juice"]])
+def test_vectors_span_span_similarity(vocab, text):
+    doc = get_doc(vocab, text)
+    assert doc[0:2].similarity(doc[1:3]) == doc[1:3].similarity(doc[0:2])
+    assert 0.0 < doc[0:2].similarity(doc[1:3]) < 1.0
+
+
+@pytest.mark.parametrize('text', [["apple", "orange", "juice"]])
+def test_vectors_span_doc_similarity(vocab, text):
+    doc = get_doc(vocab, text)
+    assert doc[0:2].similarity(doc) == doc.similarity(doc[0:2])
+    assert 0.0 < doc[0:2].similarity(doc) < 1.0
+
+
+@pytest.mark.parametrize('text1,text2', [
+    (["apple", "and", "apple", "pie"], ["orange", "juice"])])
+def test_vectors_doc_doc_similarity(vocab, text1, text2):
+    doc1 = get_doc(vocab, text1)
+    doc2 = get_doc(vocab, text2)
+    assert doc1.similarity(doc2) == doc2.similarity(doc1)
+    assert 0.0 < doc1.similarity(doc2) < 1.0
--- a/spacy/tests/vocab/test_lexeme.py
+++ b/spacy/tests/vocab/test_lexeme.py
@ -1,42 +1,58 @@
+# coding: utf-8
 from __future__ import unicode_literals

+from ...attrs import *
+
 import pytest

-from spacy.attrs import *
+@pytest.mark.parametrize('text1,prob1,text2,prob2', [("NOUN", -1, "opera", -2)])
+def test_vocab_lexeme_lt(en_vocab, text1, text2, prob1, prob2):
+    """More frequent is l.t. less frequent"""
+    lex1 = en_vocab[text1]
+    lex1.prob = prob1
+    lex2 = en_vocab[text2]
+    lex2.prob = prob2
+
+    assert lex1 < lex2
+    assert lex2 > lex1


-def test_lexeme_eq(en_vocab):
-    '''Test Issue #361: Equality of lexemes'''
-    cat1 = en_vocab['cat']
-
-    cat2 = en_vocab['cat']
-
-    assert cat1 == cat2
-
-def test_lexeme_neq(en_vocab):
-    '''Inequality of lexemes'''
-    cat = en_vocab['cat']
-
-    dog = en_vocab['dog']
-
-    assert cat != dog
-
-def test_lexeme_lt(en_vocab):
-    '''More frequent is l.t. less frequent'''
-    noun = en_vocab['NOUN']
-
-    opera = en_vocab['opera']
-
-    assert noun < opera
-    assert opera > noun
+@pytest.mark.parametrize('text1,text2', [("phantom", "opera")])
+def test_vocab_lexeme_hash(en_vocab, text1, text2):
+    """Test that lexemes are hashable."""
+    lex1 = en_vocab[text1]
+    lex2 = en_vocab[text2]
+    lexes = {lex1: lex1, lex2: lex2}
+    assert lexes[lex1].orth_ == text1
+    assert lexes[lex2].orth_ == text2


-def test_lexeme_hash(en_vocab):
-    '''Test that lexemes are hashable.'''
-    phantom = en_vocab['phantom']
+def test_vocab_lexeme_is_alpha(en_vocab):
+    assert en_vocab['the'].flags & (1 << IS_ALPHA)
+    assert not en_vocab['1999'].flags & (1 << IS_ALPHA)
+    assert not en_vocab['hello1'].flags & (1 << IS_ALPHA)

-    opera = en_vocab['opera']

-    lexes = {phantom: phantom, opera: opera}
-    assert lexes[phantom].orth_ == 'phantom'
-    assert lexes[opera].orth_ == 'opera'
+def test_vocab_lexeme_is_digit(en_vocab):
+    assert not en_vocab['the'].flags & (1 << IS_DIGIT)
+    assert en_vocab['1999'].flags & (1 << IS_DIGIT)
+    assert not en_vocab['hello1'].flags & (1 << IS_DIGIT)
+
+
+def test_vocab_lexeme_add_flag_auto_id(en_vocab):
+    is_len4 = en_vocab.add_flag(lambda string: len(string) == 4)
+    assert en_vocab['1999'].check_flag(is_len4) == True
+    assert en_vocab['1999'].check_flag(IS_DIGIT) == True
+    assert en_vocab['199'].check_flag(is_len4) == False
+    assert en_vocab['199'].check_flag(IS_DIGIT) == True
+    assert en_vocab['the'].check_flag(is_len4) == False
+    assert en_vocab['dogs'].check_flag(is_len4) == True
+
+
+def test_vocab_lexeme_add_flag_provided_id(en_vocab):
+    is_len4 = en_vocab.add_flag(lambda string: len(string) == 4, flag_id=IS_DIGIT)
+    assert en_vocab['1999'].check_flag(is_len4) == True
+    assert en_vocab['199'].check_flag(is_len4) == False
+    assert en_vocab['199'].check_flag(IS_DIGIT) == False
+    assert en_vocab['the'].check_flag(is_len4) == False
+    assert en_vocab['dogs'].check_flag(is_len4) == True
--- a/spacy/tests/vocab/test_lexeme_flags.py
+++ b/spacy/tests/vocab/test_lexeme_flags.py
@ -1,42 +0,0 @@
-from __future__ import unicode_literals
-
-import pytest
-
-from spacy.attrs import *
-
-
-def test_is_alpha(en_vocab):
-    the = en_vocab['the']
-    assert the.flags & (1 << IS_ALPHA)
-    year = en_vocab['1999']
-    assert not year.flags & (1 << IS_ALPHA)
-    mixed = en_vocab['hello1']
-    assert not mixed.flags & (1 << IS_ALPHA)
-
-
-def test_is_digit(en_vocab):
-    the = en_vocab['the']
-    assert not the.flags & (1 << IS_DIGIT)
-    year = en_vocab['1999']
-    assert year.flags & (1 << IS_DIGIT)
-    mixed = en_vocab['hello1']
-    assert not mixed.flags & (1 << IS_DIGIT)
-
-
-def test_add_flag_auto_id(en_vocab):
-    is_len4 = en_vocab.add_flag(lambda string: len(string) == 4)
-    assert en_vocab['1999'].check_flag(is_len4) == True
-    assert en_vocab['1999'].check_flag(IS_DIGIT) == True
-    assert en_vocab['199'].check_flag(is_len4) == False
-    assert en_vocab['199'].check_flag(IS_DIGIT) == True
-    assert en_vocab['the'].check_flag(is_len4) == False
-    assert en_vocab['dogs'].check_flag(is_len4) == True
-
-
-def test_add_flag_provided_id(en_vocab):
-    is_len4 = en_vocab.add_flag(lambda string: len(string) == 4, flag_id=IS_DIGIT)
-    assert en_vocab['1999'].check_flag(is_len4) == True
-    assert en_vocab['199'].check_flag(is_len4) == False
-    assert en_vocab['199'].check_flag(IS_DIGIT) == False
-    assert en_vocab['the'].check_flag(is_len4) == False
-    assert en_vocab['dogs'].check_flag(is_len4) == True
--- a/spacy/tests/vocab/test_vocab_api.py
+++ b/spacy/tests/vocab/test_vocab_api.py
@ -33,7 +33,7 @@ def test_vocab_api_symbols(en_vocab, string, symbol):


@pytest.mark.parametrize('text', "Hello")
-def test_contains(en_vocab, text):
+def test_vocab_api_contains(en_vocab, text):
    _ = en_vocab[text]
    assert text in en_vocab
    assert "LKsdjvlsakdvlaksdvlkasjdvljasdlkfvm" not in en_vocab