💫 Refactor test suite (#2568)

## Description Related issues: #2379 (should be fixed by separating model tests) * **total execution time down from > 300 seconds to under 60 seconds** 🎉 * removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure * changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version) * merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways) * tidied up and rewrote existing tests wherever possible ### Todo - [ ] move tests to `/tests` and adjust CI commands accordingly - [x] move model test suite from internal repo to `spacy-models` - [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~ - [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted - [ ] update documentation on how to run tests ### Types of change enhancement, tests ## Checklist  - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
2025-06-05 13:43:24 +03:00 · 2018-07-24 23:38:44 +02:00 · 2018-07-24 23:38:44 +02:00 · 75f3234404
commit 75f3234404
parent 82277f63a3
218 changed files with 2117 additions and 3775 deletions
--- a/.gitignore
+++ b/.gitignore
@ -36,6 +36,7 @@ venv/
 .dev
 .denv
 .pypyenv
 .pytest_cache/
 # Distribution / packaging
 env/
--- a/requirements.txt
+++ b/requirements.txt
@ -11,5 +11,6 @@ dill>=0.2,<0.3
 regex==2017.4.5
 requests>=2.13.0,<3.0.0
 pytest>=3.6.0,<4.0.0
 pytest-timeout>=1.3.0,<2.0.0
 mock>=2.0.0,<3.0.0
 pathlib==1.0.1; python_version < "3.4"
--- a/spacy/tests/README.md
+++ b/spacy/tests/README.md
@ -6,6 +6,7 @@ spaCy uses the [pytest](http://doc.pytest.org/) framework for testing. For more
 Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the `Tokenizer` can be found in [`/tests/tokenizer`](tokenizer). All test modules (i.e. directories) also need to be listed in spaCy's [`setup.py`](../setup.py). To be interpreted and run, all test files and test functions need to be prefixed with `test_`.
 > ⚠️ **Important note:** As part of our new model training infrastructure, we've moved all model tests to the [`spacy-models`](https://github.com/explosion/spacy-models) repository. This allows us to test the models separately from the core library functionality.
 ## Table of contents
@ -13,9 +14,8 @@ Tests for spaCy modules and classes live in their own directories of the same na
 2. [Dos and don'ts](#dos-and-donts)
 3. [Parameters](#parameters)
 4. [Fixtures](#fixtures)
-5. [Testing models](#testing-models)
+5. [Helpers and utilities](#helpers-and-utilities)
-6. [Helpers and utilities](#helpers-and-utilities)
+6. [Contributing to the tests](#contributing-to-the-tests)
 7. [Contributing to the tests](#contributing-to-the-tests)
 ## Running the tests
@ -25,10 +25,7 @@ first failure, run them with `py.test -x`.
 ```bash
 py.test spacy                        # run basic tests
 py.test spacy --models --en          # run basic and English model tests
 py.test spacy --models --all         # run basic and all model tests
 py.test spacy --slow                 # run basic and slow tests
 py.test spacy --models --all --slow  # run all tests
 ```
 You can also run tests in a specific file or directory, or even only one
@ -48,10 +45,10 @@ To keep the behaviour of the tests consistent and predictable, we try to follow
 * If you're testing for a bug reported in a specific issue, always create a **regression test**. Regression tests should be named `test_issue[ISSUE NUMBER]` and live in the [`regression`](regression) directory.
 * Only use `@pytest.mark.xfail` for tests that **should pass, but currently fail**. To test for desired negative behaviour, use `assert not` in your test.
 * Very **extensive tests** that take a long time to run should be marked with `@pytest.mark.slow`. If your slow test is testing important behaviour, consider adding an additional simpler version.
-* Tests that require **loading the models** should be marked with `@pytest.mark.models`.
+* If tests require **loading the models**, they should be added to the [`spacy-models`](https://github.com/explosion/spacy-models) tests.
 * Before requiring the models, always make sure there is no other way to test the particular behaviour. In a lot of cases, it's sufficient to simply create a `Doc` object manually. See the section on [helpers and utility functions](#helpers-and-utilities) for more info on this.
-* **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and most components are  available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`).
+* **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and many components are  available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`).
-* If you're importing from spaCy, **always use relative imports**. Otherwise, you might accidentally be running the tests over a different copy of spaCy, e.g. one you have installed on your system.
+* If you're importing from spaCy, **always use absolute imports**. For example: `from spacy.language import Language`.
 * Don't forget the **unicode declarations** at the top of each file. This way, unicode strings won't have to be prefixed with `u`.
 * Try to keep the tests **readable and concise**. Use clear and descriptive variable names (`doc`, `tokens` and `text` are great), keep it short and only test for one behaviour at a time.
@ -93,12 +90,9 @@ These are the main fixtures that are currently available:
 | Fixture | Description |
 | --- | --- |
-| `tokenizer` | Creates **all available** language tokenizers and runs the test for **each of them**. |
+| `tokenizer` | Basic, language-independent tokenizer. Identical to the `xx` language class. |
 | `en_tokenizer`, `de_tokenizer`, ... | Creates an English, German etc. tokenizer. |
-| `en_vocab`, `en_entityrecognizer`, ... | Creates an instance of the English `Vocab`, `EntityRecognizer` object etc. |
+| `en_vocab` | Creates an instance of the English `Vocab`. |
 |  `EN`, `DE`, ... |  Creates a language class with a loaded model. For more info, see [Testing models](#testing-models). |
 | `text_file` | Creates an instance of `StringIO` to simulate reading from and writing to files. |
 | `text_file_b` | Creates an instance of `ByteIO` to simulate reading from and writing to files. |
 The fixtures can be used in all tests by simply setting them as an argument, like this:
@ -109,49 +103,6 @@ def test_module_do_something(en_tokenizer):
 If all tests in a file require a specific configuration, or use the same complex example, it can be helpful to create a separate fixture. This fixture should be added at the top of each file. Make sure to use descriptive names for these fixtures and don't override any of the global fixtures listed above. **From looking at a test, it should immediately be clear which fixtures are used, and where they are coming from.**
 ## Testing models
 Models should only be loaded and tested **if absolutely necessary** – for example, if you're specifically testing a model's performance, or if your test is related to model loading. If you only need an annotated `Doc`, you should use the `get_doc()` helper function to create it manually instead.
 To specify which language models a test is related to, set the language ID as an argument of `@pytest.mark.models`. This allows you to later run the tests with `--models --en`. You can then use the `EN` [fixture](#fixtures) to get a language
 class with a loaded model.
 ```python
@pytest.mark.models('en')
 def test_english_model(EN):
    doc = EN(u'This is a test')
 ```
 > ⚠️ **Important note:** In order to test models, they need to be installed as a packge. The [conftest.py](conftest.py) includes a list of all available models, mapped to their IDs, e.g. `en`. Unless otherwise specified, each model that's installed in your environment will be imported and tested. If you don't have a model installed, **the test will be skipped**.
 Under the hood, `pytest.importorskip` is used to import a model package and skip the test if the package is not installed. The `EN` fixture for example gets all
 available models for `en`, [parametrizes](#parameters) them to run the test for *each of them*, and uses `load_test_model()` to import the model and run the test, or skip it if the model is not installed.
 ### Testing specific models
 Using the `load_test_model()` helper function, you can also write tests for specific models, or combinations of them:
 ```python
 from .util import load_test_model
@pytest.mark.models('en')
 def test_en_md_only():
    nlp = load_test_model('en_core_web_md')
    # test something specific to en_core_web_md
@pytest.mark.models('en', 'fr')
@pytest.mark.parametrize('model', ['en_core_web_md', 'fr_depvec_web_lg'])
 def test_different_models(model):
    nlp = load_test_model(model)
    # test something specific to the parametrized models
 ```
 ### Known issues and future improvements
 Using `importorskip` on a list of model packages is not ideal and we're looking to improve this in the future. But at the moment, it's the best way to ensure that tests are performed on specific model packages only, and that you'll always be able to run the tests, even if you don't have *all available models* installed. (If the tests made a call to `spacy.load('en')` instead, this would load whichever model you've created an `en` shortcut for. This may be one of spaCy's default models, but it could just as easily be your own custom English model.)
 The current setup also doesn't provide an easy way to only run tests on specific model versions. The `minversion` keyword argument on `pytest.importorskip` can take care of this, but it currently only checks for the package's `__version__` attribute. An alternative solution would be to load a model package's meta.json and skip if the model's version does not match the one specified in the test.
 ## Helpers and utilities
 Our new test setup comes with a few handy utility functions that can be imported from [`util.py`](util.py).
@ -186,7 +137,7 @@ You can construct a `Doc` with the following arguments:
 | `pos` | List of POS tags as text values. |
 | `tag` | List of tag names as text values. |
 | `dep` | List of dependencies as text values. |
-| `ents` | List of entity tuples with `ent_id`, `label`, `start`, `end` (for example `('Stewart Lee', 'PERSON', 0, 2)`). The `label` will be looked up in `vocab.strings[label]`. |
+| `ents` | List of entity tuples with `start`, `end`, `label` (for example `(0, 2, 'PERSON')`). The `label` will be looked up in `vocab.strings[label]`. |
 Here's how to quickly get these values from within spaCy:
@ -196,6 +147,7 @@ print([token.head.i-token.i for token in doc])
 print([token.tag_ for token in doc])
 print([token.pos_ for token in doc])
 print([token.dep_ for token in doc])
 print([(ent.start, ent.end, ent.label_) for ent in doc.ents])
 ```
 **Note:** There's currently no way of setting the serializer data for the parser without loading the models. If this is relevant to your test, constructing the `Doc` via `get_doc()` won't work.
@ -204,7 +156,6 @@ print([token.dep_ for token in doc])
 | Name | Description |
 | --- | --- |
 | `load_test_model` | Load a model if it's installed as a package, otherwise skip test. |
 | `apply_transition_sequence(parser, doc, sequence)` | Perform a series of pre-specified transitions, to put the parser in a desired state. |
 | `add_vecs_to_vocab(vocab, vectors)` | Add list of vector tuples (`[("text", [1, 2, 3])]`) to given vocab. All vectors need to have the same length. |
 | `get_cosine(vec1, vec2)` | Get cosine for two given vectors. |
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -1,229 +1,145 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from io import StringIO, BytesIO
 from pathlib import Path
 import pytest
-
+from io import StringIO, BytesIO
-from .util import load_test_model
+from spacy.util import get_lang_class
 from ..tokens import Doc
 from ..strings import StringStore
 from .. import util
-# These languages are used for generic tokenizer tests – only add a language
+def pytest_addoption(parser):
-# here if it's using spaCy's tokenizer (not a different library)
+    parser.addoption("--slow", action="store_true", help="include slow tests")
 # TODO: re-implement generic tokenizer tests
 _languages = ['bn', 'da', 'de', 'el', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
              'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'ut', 'tt',
              'xx']
 _models = {'en': ['en_core_web_sm'],
           'de': ['de_core_news_sm'],
           'fr': ['fr_core_news_sm'],
           'xx': ['xx_ent_web_sm'],
           'en_core_web_md': ['en_core_web_md'],
           'es_core_news_md': ['es_core_news_md']}
-# only used for tests that require loading the models
+def pytest_runtest_setup(item):
-# in all other cases, use specific instances
+    for opt in ['slow']:
-
+        if opt in item.keywords and not item.config.getoption("--%s" % opt):
-@pytest.fixture(params=_models['en'])
+            pytest.skip("need --%s option to run" % opt)
 def EN(request):
    return load_test_model(request.param)
-@pytest.fixture(params=_models['de'])
+@pytest.fixture(scope='module')
 def DE(request):
    return load_test_model(request.param)
@pytest.fixture(params=_models['fr'])
 def FR(request):
    return load_test_model(request.param)
@pytest.fixture()
 def RU(request):
    pymorphy = pytest.importorskip('pymorphy2')
    return util.get_lang_class('ru')()
@pytest.fixture()
 def JA(request):
    mecab = pytest.importorskip("MeCab")
    return util.get_lang_class('ja')()
 #@pytest.fixture(params=_languages)
 #def tokenizer(request):
 #lang = util.get_lang_class(request.param)
 #return lang.Defaults.create_tokenizer()
@pytest.fixture
 def tokenizer():
-    return util.get_lang_class('xx').Defaults.create_tokenizer()
+    return get_lang_class('xx').Defaults.create_tokenizer()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def en_tokenizer():
-    return util.get_lang_class('en').Defaults.create_tokenizer()
+    return get_lang_class('en').Defaults.create_tokenizer()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def en_vocab():
-    return util.get_lang_class('en').Defaults.create_vocab()
+    return get_lang_class('en').Defaults.create_vocab()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def en_parser(en_vocab):
-    nlp = util.get_lang_class('en')(en_vocab)
+    nlp = get_lang_class('en')(en_vocab)
    return nlp.create_pipe('parser')
-@pytest.fixture
+@pytest.fixture(scope='session')
 def es_tokenizer():
-    return util.get_lang_class('es').Defaults.create_tokenizer()
+    return get_lang_class('es').Defaults.create_tokenizer()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def de_tokenizer():
-    return util.get_lang_class('de').Defaults.create_tokenizer()
+    return get_lang_class('de').Defaults.create_tokenizer()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def fr_tokenizer():
-    return util.get_lang_class('fr').Defaults.create_tokenizer()
+    return get_lang_class('fr').Defaults.create_tokenizer()
@pytest.fixture
 def hu_tokenizer():
-    return util.get_lang_class('hu').Defaults.create_tokenizer()
+    return get_lang_class('hu').Defaults.create_tokenizer()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def fi_tokenizer():
-    return util.get_lang_class('fi').Defaults.create_tokenizer()
+    return get_lang_class('fi').Defaults.create_tokenizer()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def ro_tokenizer():
-    return util.get_lang_class('ro').Defaults.create_tokenizer()
+    return get_lang_class('ro').Defaults.create_tokenizer()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def id_tokenizer():
-    return util.get_lang_class('id').Defaults.create_tokenizer()
+    return get_lang_class('id').Defaults.create_tokenizer()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def sv_tokenizer():
-    return util.get_lang_class('sv').Defaults.create_tokenizer()
+    return get_lang_class('sv').Defaults.create_tokenizer()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def bn_tokenizer():
-    return util.get_lang_class('bn').Defaults.create_tokenizer()
+    return get_lang_class('bn').Defaults.create_tokenizer()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def ga_tokenizer():
-    return util.get_lang_class('ga').Defaults.create_tokenizer()
+    return get_lang_class('ga').Defaults.create_tokenizer()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def he_tokenizer():
-    return util.get_lang_class('he').Defaults.create_tokenizer()
+    return get_lang_class('he').Defaults.create_tokenizer()
-@pytest.fixture
+@pytest.fixture(scope='session')
 def nb_tokenizer():
-    return util.get_lang_class('nb').Defaults.create_tokenizer()
+    return get_lang_class('nb').Defaults.create_tokenizer()
-@pytest.fixture
+
@pytest.fixture(scope='session')
 def da_tokenizer():
-    return util.get_lang_class('da').Defaults.create_tokenizer()
+    return get_lang_class('da').Defaults.create_tokenizer()
-@pytest.fixture
+
@pytest.fixture(scope='session')
 def ja_tokenizer():
    mecab = pytest.importorskip("MeCab")
-    return util.get_lang_class('ja').Defaults.create_tokenizer()
+    return get_lang_class('ja').Defaults.create_tokenizer()
-@pytest.fixture
+
@pytest.fixture(scope='session')
 def th_tokenizer():
    pythainlp = pytest.importorskip("pythainlp")
-    return util.get_lang_class('th').Defaults.create_tokenizer()
+    return get_lang_class('th').Defaults.create_tokenizer()
-@pytest.fixture
+
@pytest.fixture(scope='session')
 def tr_tokenizer():
-    return util.get_lang_class('tr').Defaults.create_tokenizer()
+    return get_lang_class('tr').Defaults.create_tokenizer()
-@pytest.fixture
+
@pytest.fixture(scope='session')
 def tt_tokenizer():
-    return util.get_lang_class('tt').Defaults.create_tokenizer()
+    return get_lang_class('tt').Defaults.create_tokenizer()
-@pytest.fixture
+
@pytest.fixture(scope='session')
 def el_tokenizer():
-    return util.get_lang_class('el').Defaults.create_tokenizer()
+    return get_lang_class('el').Defaults.create_tokenizer()
-@pytest.fixture
+
@pytest.fixture(scope='session')
 def ar_tokenizer():
-    return util.get_lang_class('ar').Defaults.create_tokenizer()
+    return get_lang_class('ar').Defaults.create_tokenizer()
-@pytest.fixture
+
@pytest.fixture(scope='session')
 def ur_tokenizer():
-    return util.get_lang_class('ur').Defaults.create_tokenizer()
+    return get_lang_class('ur').Defaults.create_tokenizer()
-@pytest.fixture
+
@pytest.fixture(scope='session')
 def ru_tokenizer():
    pymorphy = pytest.importorskip('pymorphy2')
-    return util.get_lang_class('ru').Defaults.create_tokenizer()
+    return get_lang_class('ru').Defaults.create_tokenizer()
@pytest.fixture
 def stringstore():
    return StringStore()
@pytest.fixture
 def en_entityrecognizer():
    return util.get_lang_class('en').Defaults.create_entity()
@pytest.fixture
 def text_file():
    return StringIO()
@pytest.fixture
 def text_file_b():
    return BytesIO()
 def pytest_addoption(parser):
    parser.addoption("--models", action="store_true",
                     help="include tests that require full models")
    parser.addoption("--vectors", action="store_true",
                     help="include word vectors tests")
    parser.addoption("--slow", action="store_true",
                     help="include slow tests")
    for lang in _languages + ['all']:
        parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang)
    for model in _models:
        if model not in _languages:
            parser.addoption("--%s" % model, action="store_true", help="Use %s model" % model)
 def pytest_runtest_setup(item):
    for opt in ['models', 'vectors', 'slow']:
        if opt in item.keywords and not item.config.getoption("--%s" % opt):
            pytest.skip("need --%s option to run" % opt)
    # Check if test is marked with models and has arguments set, i.e. specific
    # language. If so, skip test if flag not set.
    if item.get_marker('models'):
        for arg in item.get_marker('models').args:
            if not item.config.getoption("--%s" % arg) and not item.config.getoption("--all"):
                pytest.skip("need --%s or --all option to run" % arg)
--- a/spacy/tests/doc/test_add_entities.py
+++ b/spacy/tests/doc/test_add_entities.py
@ -1,24 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ...pipeline import EntityRecognizer
 from ..util import get_doc
 import pytest
 def test_doc_add_entities_set_ents_iob(en_vocab):
    text = ["This", "is", "a", "lion"]
    doc = get_doc(en_vocab, text)
    ner = EntityRecognizer(en_vocab)
    ner.begin_training([])
    ner(doc)
    assert len(list(doc.ents)) == 0
    assert [w.ent_iob_ for w in doc] == (['O'] * len(doc))
    doc.ents = [(doc.vocab.strings['ANIMAL'], 3, 4)]
    assert [w.ent_iob_ for w in doc] == ['', '', '', 'B']
    doc.ents = [(doc.vocab.strings['WORD'], 0, 2)]
    assert [w.ent_iob_ for w in doc] == ['B', 'I', '', '']
--- a/spacy/tests/doc/test_array.py
+++ b/spacy/tests/doc/test_array.py
@ -1,10 +1,9 @@
 # coding: utf-8
 from __future__ import unicode_literals
-from ...attrs import ORTH, SHAPE, POS, DEP
+from spacy.attrs import ORTH, SHAPE, POS, DEP
 from ..util import get_doc
-import pytest
+from ..util import get_doc
 def test_doc_array_attr_of_token(en_tokenizer, en_vocab):
@ -41,7 +40,7 @@ def test_doc_array_tag(en_tokenizer):
    text = "A nice sentence."
    pos = ['DET', 'ADJ', 'NOUN', 'PUNCT']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos)
    assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos
    feats_array = doc.to_array((ORTH, POS))
    assert feats_array[0][1] == doc[0].pos
@ -54,7 +53,7 @@ def test_doc_array_dep(en_tokenizer):
    text = "A nice sentence."
    deps = ['det', 'amod', 'ROOT', 'punct']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
    feats_array = doc.to_array((ORTH, DEP))
    assert feats_array[0][1] == doc[0].dep
    assert feats_array[1][1] == doc[1].dep
--- a/spacy/tests/doc/test_creation.py
+++ b/spacy/tests/doc/test_creation.py
@ -1,10 +1,10 @@
-'''Test Doc sets up tokens correctly.'''
+# coding: utf-8
 from __future__ import unicode_literals
 import pytest
-from ...vocab import Vocab
+import pytest
-from ...tokens.doc import Doc
+from spacy.vocab import Vocab
-from ...lemmatizer import Lemmatizer
+from spacy.tokens import Doc
 from spacy.lemmatizer import Lemmatizer
@pytest.fixture
--- a/spacy/tests/doc/test_doc_api.py
+++ b/spacy/tests/doc/test_doc_api.py
@ -1,18 +1,18 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ..util import get_doc
 from ...tokens import Doc
 from ...vocab import Vocab
 from ...attrs import LEMMA
 import pytest
 import numpy
 from spacy.tokens import Doc
 from spacy.vocab import Vocab
 from spacy.attrs import LEMMA
 from ..util import get_doc
@pytest.mark.parametrize('text', [["one", "two", "three"]])
 def test_doc_api_compare_by_string_position(en_vocab, text):
-    doc = get_doc(en_vocab, text)
+    doc = Doc(en_vocab, words=text)
    # Get the tokens in this order, so their ID ordering doesn't match the idx
    token3 = doc[-1]
    token2 = doc[-2]
@ -104,18 +104,18 @@ def test_doc_api_getitem(en_tokenizer):
                                  " Give it back! He pleaded. "])
 def test_doc_api_serialize(en_tokenizer, text):
    tokens = en_tokenizer(text)
-    new_tokens = get_doc(tokens.vocab).from_bytes(tokens.to_bytes())
+    new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
    assert tokens.text == new_tokens.text
    assert [t.text for t in tokens] == [t.text for t in new_tokens]
    assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
-    new_tokens = get_doc(tokens.vocab).from_bytes(
+    new_tokens = Doc(tokens.vocab).from_bytes(
        tokens.to_bytes(tensor=False), tensor=False)
    assert tokens.text == new_tokens.text
    assert [t.text for t in tokens] == [t.text for t in new_tokens]
    assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
-    new_tokens = get_doc(tokens.vocab).from_bytes(
+    new_tokens = Doc(tokens.vocab).from_bytes(
        tokens.to_bytes(sentiment=False), sentiment=False)
    assert tokens.text == new_tokens.text
    assert [t.text for t in tokens] == [t.text for t in new_tokens]
@ -199,6 +199,20 @@ def test_doc_api_retokenizer_attrs(en_tokenizer):
    assert doc[4].ent_type_ == 'ORG'
@pytest.mark.xfail
 def test_doc_api_retokenizer_lex_attrs(en_tokenizer):
    """Test that lexical attributes can be changed (see #2390)."""
    doc = en_tokenizer("WKRO played beach boys songs")
    assert not any(token.is_stop for token in doc)
    with doc.retokenize() as retokenizer:
        retokenizer.merge(doc[2:4], attrs={'LEMMA': 'boys', 'IS_STOP': True})
    assert doc[2].text == 'beach boys'
    assert doc[2].lemma_ == 'boys'
    assert doc[2].is_stop
    new_doc = Doc(doc.vocab, words=['beach boys'])
    assert new_doc[0].is_stop
 def test_doc_api_sents_empty_string(en_tokenizer):
    doc = en_tokenizer("")
    doc.is_parsed = True
@ -215,7 +229,7 @@ def test_doc_api_runtime_error(en_tokenizer):
            'ROOT', 'amod', 'dobj']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
    nps = []
    for np in doc.noun_chunks:
@ -235,7 +249,7 @@ def test_doc_api_right_edge(en_tokenizer):
             -2, -7, 1, -19, 1, -2, -3, 2, 1, -3, -26]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    assert doc[6].text == 'for'
    subtree = [w.text for w in doc[6].subtree]
    assert subtree == ['for', 'the', 'sake', 'of', 'such', 'as',
@ -264,7 +278,7 @@ def test_doc_api_similarity_match():
 def test_lowest_common_ancestor(en_tokenizer):
    tokens = en_tokenizer('the lazy dog slept')
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0])
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0])
    lca = doc.get_lca_matrix()
    assert(lca[1, 1] == 1)
    assert(lca[0, 1] == 2)
@ -277,7 +291,7 @@ def test_parse_tree(en_tokenizer):
    heads = [1, 0, 1, -2, -3, -1, -5]
    tags = ['PRP', 'IN', 'NNP', 'NNP', 'IN', 'NNP', '.']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, tags=tags)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, tags=tags)
    # full method parse_tree(text) is a trivial composition
    trees = doc.print_tree()
    assert len(trees) > 0
--- a/spacy/tests/doc/test_pickle_doc.py
+++ b/spacy/tests/doc/test_pickle_doc.py
@ -1,12 +1,13 @@
 # coding: utf-8
 from __future__ import unicode_literals
-from ...language import Language
+from spacy.language import Language
-from ...compat import pickle, unicode_
+from spacy.compat import pickle, unicode_
 def test_pickle_single_doc():
    nlp = Language()
-    doc = nlp(u'pickle roundtrip')
+    doc = nlp('pickle roundtrip')
    data = pickle.dumps(doc, 1)
    doc2 = pickle.loads(data)
    assert doc2.text == 'pickle roundtrip'
@ -16,7 +17,7 @@ def test_list_of_docs_pickles_efficiently():
    nlp = Language()
    for i in range(10000):
        _ = nlp.vocab[unicode_(i)]
-    one_pickled = pickle.dumps(nlp(u'0'), -1)
+    one_pickled = pickle.dumps(nlp('0'), -1)
    docs = list(nlp.pipe(unicode_(i) for i in range(100)))
    many_pickled = pickle.dumps(docs, -1)
    assert len(many_pickled) < (len(one_pickled) * 2)
@ -28,7 +29,7 @@ def test_list_of_docs_pickles_efficiently():
 def test_user_data_from_disk():
    nlp = Language()
-    doc = nlp(u'Hello')
+    doc = nlp('Hello')
    doc.user_data[(0, 1)] = False
    b = doc.to_bytes()
    doc2 = doc.__class__(doc.vocab).from_bytes(b)
@ -36,7 +37,7 @@ def test_user_data_from_disk():
 def test_user_data_unpickles():
    nlp = Language()
-    doc = nlp(u'Hello')
+    doc = nlp('Hello')
    doc.user_data[(0, 1)] = False
    b = pickle.dumps(doc)
    doc2 = pickle.loads(b)
@ -47,7 +48,7 @@ def test_hooks_unpickle():
    def inner_func(d1, d2):
        return 'hello!'
    nlp = Language()
-    doc = nlp(u'Hello')
+    doc = nlp('Hello')
    doc.user_hooks['similarity'] = inner_func
    b = pickle.dumps(doc)
    doc2 = pickle.loads(b)
--- a/spacy/tests/doc/test_span.py
+++ b/spacy/tests/doc/test_span.py
@ -1,12 +1,12 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ..util import get_doc
 from ...attrs import ORTH, LENGTH
 from ...tokens import Doc
 from ...vocab import Vocab
 import pytest
 from spacy.attrs import ORTH, LENGTH
 from spacy.tokens import Doc
 from spacy.vocab import Vocab
 from ..util import get_doc
@pytest.fixture
@ -16,16 +16,16 @@ def doc(en_tokenizer):
    deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
            'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
    tokens = en_tokenizer(text)
-    return get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
+    return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
@pytest.fixture
 def doc_not_parsed(en_tokenizer):
    text = "This is a sentence. This is another sentence. And a third."
    tokens = en_tokenizer(text)
-    d = get_doc(tokens.vocab, [t.text for t in tokens])
+    doc = Doc(tokens.vocab, words=[t.text for t in tokens])
-    d.is_parsed = False
+    doc.is_parsed = False
-    return d
+    return doc
 def test_spans_sent_spans(doc):
@ -56,7 +56,7 @@ def test_spans_root2(en_tokenizer):
    text = "through North and South Carolina"
    heads = [0, 3, -1, -2, -4]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    assert doc[-2:].root.text == 'Carolina'
@ -76,7 +76,7 @@ def test_spans_span_sent(doc, doc_not_parsed):
 def test_spans_lca_matrix(en_tokenizer):
    """Test span's lca matrix generation"""
    tokens = en_tokenizer('the lazy dog slept')
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0])
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0])
    lca = doc[:2].get_lca_matrix()
    assert(lca[0, 0] == 0)
    assert(lca[0, 1] == -1)
@ -100,7 +100,7 @@ def test_spans_default_sentiment(en_tokenizer):
    tokens = en_tokenizer(text)
    tokens.vocab[tokens[0].text].sentiment = 3.0
    tokens.vocab[tokens[2].text].sentiment = -2.0
-    doc = get_doc(tokens.vocab, [t.text for t in tokens])
+    doc = Doc(tokens.vocab, words=[t.text for t in tokens])
    assert doc[:2].sentiment == 3.0 / 2
    assert doc[-2:].sentiment == -2. / 2
    assert doc[:-1].sentiment == (3.+-2) / 3.
@ -112,7 +112,7 @@ def test_spans_override_sentiment(en_tokenizer):
    tokens = en_tokenizer(text)
    tokens.vocab[tokens[0].text].sentiment = 3.0
    tokens.vocab[tokens[2].text].sentiment = -2.0
-    doc = get_doc(tokens.vocab, [t.text for t in tokens])
+    doc = Doc(tokens.vocab, words=[t.text for t in tokens])
    doc.user_span_hooks['sentiment'] = lambda span: 10.0
    assert doc[:2].sentiment == 10.0
    assert doc[-2:].sentiment == 10.0
@ -146,7 +146,7 @@ def test_span_to_array(doc):
    assert arr[0, 1] == len(span[0])
-#def test_span_as_doc(doc):
+def test_span_as_doc(doc):
-#    span = doc[4:10]
+    span = doc[4:10]
-#    span_doc = span.as_doc()
+    span_doc = span.as_doc()
-#    assert span.text == span_doc.text.strip()
+    assert span.text == span_doc.text.strip()
--- a/spacy/tests/doc/test_span_merge.py
+++ b/spacy/tests/doc/test_span_merge.py
@ -1,18 +1,17 @@
 # coding: utf-8
 from __future__ import unicode_literals
-from ..util import get_doc
+from spacy.vocab import Vocab
-from ...vocab import Vocab
+from spacy.tokens import Doc
 from ...tokens import Doc
-import pytest
+from ..util import get_doc
 def test_spans_merge_tokens(en_tokenizer):
    text = "Los Angeles start."
    heads = [1, 1, 0, -1]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    assert len(doc) == 4
    assert doc[0].head.text == 'Angeles'
    assert doc[1].head.text == 'start'
@ -21,7 +20,7 @@ def test_spans_merge_tokens(en_tokenizer):
    assert doc[0].text == 'Los Angeles'
    assert doc[0].head.text == 'start'
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    assert len(doc) == 4
    assert doc[0].head.text == 'Angeles'
    assert doc[1].head.text == 'start'
@ -35,7 +34,7 @@ def test_spans_merge_heads(en_tokenizer):
    text = "I found a pilates class near work."
    heads = [1, 0, 2, 1, -3, -1, -1, -6]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    assert len(doc) == 8
    doc.merge(doc[3].idx, doc[4].idx + len(doc[4]), tag=doc[4].tag_,
@ -53,7 +52,7 @@ def test_span_np_merges(en_tokenizer):
    text = "displaCy is a parse tool built with Javascript"
    heads = [1, 0, 2, 1, -3, -1, -1, -1]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    assert doc[4].head.i == 1
    doc.merge(doc[2].idx, doc[4].idx + len(doc[4]), tag='NP', lemma='tool',
@ -63,7 +62,7 @@ def test_span_np_merges(en_tokenizer):
    text = "displaCy is a lightweight and modern dependency parse tree visualization tool built with CSS3 and JavaScript."
    heads = [1, 0, 8, 3, -1, -2, 4, 3, 1, 1, -9, -1, -1, -1, -1, -2, -15]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    ents = [(e[0].idx, e[-1].idx + len(e[-1]), e.label_, e.lemma_) for e in doc.ents]
    for start, end, label, lemma in ents:
@ -74,8 +73,7 @@ def test_span_np_merges(en_tokenizer):
    text = "One test with entities like New York City so the ents list is not void"
    heads = [1, 11, -1, -1, -1, 1, 1, -3, 4, 2, 1, 1, 0, -1, -2]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    for span in doc.ents:
        merged = doc.merge()
        assert merged != None, (span.start, span.end, span.label_, span.lemma_)
@ -85,10 +83,9 @@ def test_spans_entity_merge(en_tokenizer):
    text = "Stewart Lee is a stand up comedian who lives in England and loves Joe Pasquale.\n"
    heads = [1, 1, 0, 1, 2, -1, -4, 1, -2, -1, -1, -3, -10, 1, -2, -13, -1]
    tags = ['NNP', 'NNP', 'VBZ', 'DT', 'VB', 'RP', 'NN', 'WP', 'VBZ', 'IN', 'NNP', 'CC', 'VBZ', 'NNP', 'NNP', '.', 'SP']
-    ents = [('Stewart Lee', 'PERSON', 0, 2), ('England', 'GPE', 10, 11), ('Joe Pasquale', 'PERSON', 13, 15)]
+    ents = [(0, 2, 'PERSON'), (10, 11, 'GPE'), (13, 15, 'PERSON')]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, tags=tags, ents=ents)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, tags=tags, ents=ents)
    assert len(doc) == 17
    for ent in doc.ents:
        label, lemma, type_ = (ent.root.tag_, ent.root.lemma_, max(w.ent_type_ for w in ent))
@ -120,7 +117,7 @@ def test_spans_sentence_update_after_merge(en_tokenizer):
            'compound', 'dobj', 'punct']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
    sent1, sent2 = list(doc.sents)
    init_len = len(sent1)
    init_len2 = len(sent2)
@ -138,7 +135,7 @@ def test_spans_subtree_size_check(en_tokenizer):
            'dobj']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
    sent1 = list(doc.sents)[0]
    init_len = len(list(sent1.root.subtree))
    doc[0:2].merge(label='none', lemma='none', ent_type='none')
--- a/spacy/tests/doc/test_token_api.py
+++ b/spacy/tests/doc/test_token_api.py
@ -1,14 +1,24 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ...attrs import IS_ALPHA, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_TITLE, IS_STOP
 from ...symbols import NOUN, VERB
 from ..util import get_doc
 from ...vocab import Vocab
 from ...tokens import Doc
 import pytest
 import numpy
 from spacy.attrs import IS_ALPHA, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_TITLE, IS_STOP
 from spacy.symbols import VERB
 from spacy.vocab import Vocab
 from spacy.tokens import Doc
 from ..util import get_doc
@pytest.fixture
 def doc(en_tokenizer):
    text = "This is a sentence. This is another sentence. And a third."
    heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 0, 1, -2, -1]
    deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
            'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
    tokens = en_tokenizer(text)
    return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
 def test_doc_token_api_strings(en_tokenizer):
@ -18,7 +28,7 @@ def test_doc_token_api_strings(en_tokenizer):
    deps = ['ROOT', 'dobj', 'prt', 'punct', 'nsubj', 'ROOT', 'punct']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos, heads=heads, deps=deps)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps)
    assert doc[0].orth_ == 'Give'
    assert doc[0].text == 'Give'
    assert doc[0].text_with_ws == 'Give '
@ -57,18 +67,9 @@ def test_doc_token_api_str_builtin(en_tokenizer, text):
    assert str(tokens[0]) == text.split(' ')[0]
    assert str(tokens[1]) == text.split(' ')[1]
@pytest.fixture
 def doc(en_tokenizer):
    text = "This is a sentence. This is another sentence. And a third."
    heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 0, 1, -2, -1]
    deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
            'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
    tokens = en_tokenizer(text)
    return get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
 def test_doc_token_api_is_properties(en_vocab):
-    text = ["Hi", ",", "my", "email", "is", "test@me.com"]
+    doc = Doc(en_vocab, words=["Hi", ",", "my", "email", "is", "test@me.com"])
    doc = get_doc(en_vocab, text)
    assert doc[0].is_title
    assert doc[0].is_alpha
    assert not doc[0].is_digit
@ -86,7 +87,6 @@ def test_doc_token_api_vectors():
    vocab.set_vector('oranges', vector=numpy.asarray([0., 1.], dtype='f'))
    doc = Doc(vocab, words=['apples', 'oranges', 'oov'])
    assert doc.has_vector
    assert doc[0].has_vector
    assert doc[1].has_vector
    assert not doc[2].has_vector
@ -101,7 +101,7 @@ def test_doc_token_api_ancestors(en_tokenizer):
    text = "Yesterday I saw a dog that barked loudly."
    heads = [2, 1, 0, 1, -2, 1, -2, -1, -6]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    assert [t.text for t in doc[6].ancestors] == ["dog", "saw"]
    assert [t.text for t in doc[1].ancestors] == ["saw"]
    assert [t.text for t in doc[2].ancestors] == []
@ -115,7 +115,7 @@ def test_doc_token_api_head_setter(en_tokenizer):
    text = "Yesterday I saw a dog that barked loudly."
    heads = [2, 1, 0, 1, -2, 1, -2, -1, -6]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    assert doc[6].n_lefts == 1
    assert doc[6].n_rights == 1
@ -165,7 +165,7 @@ def test_doc_token_api_head_setter(en_tokenizer):
 def test_is_sent_start(en_tokenizer):
-    doc = en_tokenizer(u'This is a sentence. This is another.')
+    doc = en_tokenizer('This is a sentence. This is another.')
    assert doc[5].is_sent_start is None
    doc[5].is_sent_start = True
    assert doc[5].is_sent_start is True
--- a/spacy/tests/doc/test_underscore.py
+++ b/spacy/tests/doc/test_underscore.py
@ -3,10 +3,8 @@ from __future__ import unicode_literals
 import pytest
 from mock import Mock
-
+from spacy.tokens import Doc, Span, Token
-from ..vocab import Vocab
+from spacy.tokens.underscore import Underscore
 from ..tokens import Doc, Span, Token
 from ..tokens.underscore import Underscore
 def test_create_doc_underscore():
--- a/spacy/tests/lang/ar/test_exceptions.py
+++ b/spacy/tests/lang/ar/test_exceptions.py
@ -4,15 +4,14 @@ from __future__ import unicode_literals
 import pytest
-@pytest.mark.parametrize('text',
+@pytest.mark.parametrize('text', ["ق.م", "إلخ", "ص.ب", "ت."])
                         ["ق.م", "إلخ", "ص.ب", "ت."])
 def test_ar_tokenizer_handles_abbr(ar_tokenizer, text):
    tokens = ar_tokenizer(text)
    assert len(tokens) == 1
 def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
-    text = u"تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م"
+    text = "تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م"
    tokens = ar_tokenizer(text)
    assert len(tokens) == 7
    assert tokens[6].text == "ق.م"
@ -20,7 +19,6 @@ def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
 def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
-    text = u"يبلغ طول مضيق طارق 14كم "
+    text = "يبلغ طول مضيق طارق 14كم "
    tokens = ar_tokenizer(text)
    print([(tokens[i].text, tokens[i].suffix_) for i in range(len(tokens))])
    assert len(tokens) == 6
--- a/spacy/tests/lang/ar/test_text.py
+++ b/spacy/tests/lang/ar/test_text.py
@ -2,7 +2,7 @@
 from __future__ import unicode_literals
-def test_tokenizer_handles_long_text(ar_tokenizer):
+def test_ar_tokenizer_handles_long_text(ar_tokenizer):
    text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين.
     ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها،
      فتمكن من نيل شهادة في الفلسفة. ألف محفوظ على مدار حياته الكثير من الأعمال الأدبية، و في مقدمتها ثلاثيته الشهيرة.
--- a/spacy/tests/lang/bn/test_tokenizer.py
+++ b/spacy/tests/lang/bn/test_tokenizer.py
@ -3,38 +3,32 @@ from __future__ import unicode_literals
 import pytest
 TESTCASES = []
-PUNCTUATION_TESTS = [
+TESTCASES = [
-    (u'আমি বাংলায় গান গাই!', [u'আমি', u'বাংলায়', u'গান', u'গাই', u'!']),
+    # punctuation tests
-    (u'আমি বাংলায় কথা কই।', [u'আমি', u'বাংলায়', u'কথা', u'কই', u'।']),
+    ('আমি বাংলায় গান গাই!', ['আমি', 'বাংলায়', 'গান', 'গাই', '!']),
-    (u'বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', [u'বসুন্ধরা', u'জনসম্মুখে', u'দোষ', u'স্বীকার', u'করলো', u'না', u'?']),
+    ('আমি বাংলায় কথা কই।', ['আমি', 'বাংলায়', 'কথা', 'কই', '।']),
-    (u'টাকা থাকলে কি না হয়!', [u'টাকা', u'থাকলে', u'কি', u'না', u'হয়', u'!']),
+    ('বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', ['বসুন্ধরা', 'জনসম্মুখে', 'দোষ', 'স্বীকার', 'করলো', 'না', '?']),
    ('টাকা থাকলে কি না হয়!', ['টাকা', 'থাকলে', 'কি', 'না', 'হয়', '!']),
    # abbreviations
    ('ডঃ খালেদ বললেন ঢাকায় ৩৫ ডিগ্রি সে.।', ['ডঃ', 'খালেদ', 'বললেন', 'ঢাকায়', '৩৫', 'ডিগ্রি', 'সে.', '।'])
 ]
 ABBREVIATIONS = [
    (u'ডঃ খালেদ বললেন ঢাকায় ৩৫ ডিগ্রি সে.।', [u'ডঃ', u'খালেদ', u'বললেন', u'ঢাকায়', u'৩৫', u'ডিগ্রি', u'সে.', u'।'])
 ]
 TESTCASES.extend(PUNCTUATION_TESTS)
 TESTCASES.extend(ABBREVIATIONS)
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
-def test_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens):
+def test_bn_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens):
    tokens = bn_tokenizer(text)
    token_list = [token.text for token in tokens if not token.is_space]
    assert expected_tokens == token_list
-def test_tokenizer_handles_long_text(bn_tokenizer):
+def test_bn_tokenizer_handles_long_text(bn_tokenizer):
-    text = u"""নর্থ সাউথ বিশ্ববিদ্যালয়ে সারাবছর কোন না কোন বিষয়ে গবেষণা চলতেই থাকে। \
+    text = """নর্থ সাউথ বিশ্ববিদ্যালয়ে সারাবছর কোন না কোন বিষয়ে গবেষণা চলতেই থাকে। \
 অভিজ্ঞ ফ্যাকাল্টি মেম্বারগণ প্রায়ই শিক্ষার্থীদের নিয়ে বিভিন্ন গবেষণা প্রকল্পে কাজ করেন, \
 যার মধ্যে রয়েছে রোবট থেকে মেশিন লার্নিং সিস্টেম ও আর্টিফিশিয়াল ইন্টেলিজেন্স। \
 এসকল প্রকল্পে কাজ করার মাধ্যমে সংশ্লিষ্ট ক্ষেত্রে যথেষ্ঠ পরিমাণ স্পেশালাইজড হওয়া সম্ভব। \
 আর গবেষণার কাজ তোমার ক্যারিয়ারকে ঠেলে নিয়ে যাবে অনেকখানি! \
 কন্টেস্ট প্রোগ্রামার হও, গবেষক কিংবা ডেভেলপার - নর্থ সাউথ ইউনিভার্সিটিতে তোমার প্রতিভা বিকাশের সুযোগ রয়েছেই। \
 নর্থ সাউথের অসাধারণ কমিউনিটিতে তোমাকে সাদর আমন্ত্রণ।"""
    tokens = bn_tokenizer(text)
    assert len(tokens) == 84
--- a/spacy/tests/lang/da/test_exceptions.py
+++ b/spacy/tests/lang/da/test_exceptions.py
@ -3,28 +3,32 @@ from __future__ import unicode_literals
 import pytest
-@pytest.mark.parametrize('text',
+
-                         ["ca.", "m.a.o.", "Jan.", "Dec.", "kr.", "jf."])
+@pytest.mark.parametrize('text', ["ca.", "m.a.o.", "Jan.", "Dec.", "kr.", "jf."])
 def test_da_tokenizer_handles_abbr(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 1
@pytest.mark.parametrize('text', ["Jul.", "jul.", "Tor.", "Tors."])
 def test_da_tokenizer_handles_ambiguous_abbr(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 2
@pytest.mark.parametrize('text', ["1.", "10.", "31."])
 def test_da_tokenizer_handles_dates(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 1
 def test_da_tokenizer_handles_exc_in_text(da_tokenizer):
    text = "Det er bl.a. ikke meningen"
    tokens = da_tokenizer(text)
    assert len(tokens) == 5
    assert tokens[2].text == "bl.a."
 def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
    text = "Her er noget du kan kigge i."
    tokens = da_tokenizer(text)
@ -32,8 +36,9 @@ def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
    assert tokens[6].text == "i"
    assert tokens[7].text == "."
-@pytest.mark.parametrize('text,norm',
+
-                         [("akvarium", "akvarie"), ("bedstemoder", "bedstemor")])
+@pytest.mark.parametrize('text,norm', [
    ("akvarium", "akvarie"), ("bedstemoder", "bedstemor")])
 def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
    tokens = da_tokenizer(text)
    assert tokens[0].norm_ == norm
--- a/spacy/tests/lang/da/test_lemma.py
+++ b/spacy/tests/lang/da/test_lemma.py
@ -4,10 +4,11 @@ from __future__ import unicode_literals
 import pytest
-@pytest.mark.parametrize('string,lemma', [('affaldsgruppernes', 'affaldsgruppe'),
+@pytest.mark.parametrize('string,lemma', [
    ('affaldsgruppernes', 'affaldsgruppe'),
    ('detailhandelsstrukturernes', 'detailhandelsstruktur'),
    ('kolesterols', 'kolesterol'),
    ('åsyns', 'åsyn')])
-def test_lemmatizer_lookup_assigns(da_tokenizer, string, lemma):
+def test_da_lemmatizer_lookup_assigns(da_tokenizer, string, lemma):
    tokens = da_tokenizer(string)
    assert tokens[0].lemma_ == lemma
--- a/spacy/tests/lang/da/test_prefix_suffix_infix.py
+++ b/spacy/tests/lang/da/test_prefix_suffix_infix.py
@ -1,24 +1,23 @@
 # coding: utf-8
 """Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
 from __future__ import unicode_literals
 import pytest
@pytest.mark.parametrize('text', ["(under)"])
-def test_tokenizer_splits_no_special(da_tokenizer, text):
+def test_da_tokenizer_splits_no_special(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["ta'r", "Søren's", "Lars'"])
-def test_tokenizer_handles_no_punct(da_tokenizer, text):
+def test_da_tokenizer_handles_no_punct(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 1
@pytest.mark.parametrize('text', ["(ta'r"])
-def test_tokenizer_splits_prefix_punct(da_tokenizer, text):
+def test_da_tokenizer_splits_prefix_punct(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 2
    assert tokens[0].text == "("
@ -26,22 +25,23 @@ def test_tokenizer_splits_prefix_punct(da_tokenizer, text):
@pytest.mark.parametrize('text', ["ta'r)"])
-def test_tokenizer_splits_suffix_punct(da_tokenizer, text):
+def test_da_tokenizer_splits_suffix_punct(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 2
    assert tokens[0].text == "ta'r"
    assert tokens[1].text == ")"
-@pytest.mark.parametrize('text,expected', [("(ta'r)", ["(", "ta'r", ")"]), ("'ta'r'", ["'", "ta'r", "'"])])
+@pytest.mark.parametrize('text,expected', [
-def test_tokenizer_splits_even_wrap(da_tokenizer, text, expected):
+    ("(ta'r)", ["(", "ta'r", ")"]), ("'ta'r'", ["'", "ta'r", "'"])])
 def test_da_tokenizer_splits_even_wrap(da_tokenizer, text, expected):
    tokens = da_tokenizer(text)
    assert len(tokens) == len(expected)
    assert [t.text for t in tokens] == expected
@pytest.mark.parametrize('text', ["(ta'r?)"])
-def test_tokenizer_splits_uneven_wrap(da_tokenizer, text):
+def test_da_tokenizer_splits_uneven_wrap(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 4
    assert tokens[0].text == "("
@ -50,15 +50,16 @@ def test_tokenizer_splits_uneven_wrap(da_tokenizer, text):
    assert tokens[3].text == ")"
-@pytest.mark.parametrize('text,expected', [("f.eks.", ["f.eks."]), ("fe.", ["fe", "."]), ("(f.eks.", ["(", "f.eks."])])
+@pytest.mark.parametrize('text,expected', [
-def test_tokenizer_splits_prefix_interact(da_tokenizer, text, expected):
+    ("f.eks.", ["f.eks."]), ("fe.", ["fe", "."]), ("(f.eks.", ["(", "f.eks."])])
 def test_da_tokenizer_splits_prefix_interact(da_tokenizer, text, expected):
    tokens = da_tokenizer(text)
    assert len(tokens) == len(expected)
    assert [t.text for t in tokens] == expected
@pytest.mark.parametrize('text', ["f.eks.)"])
-def test_tokenizer_splits_suffix_interact(da_tokenizer, text):
+def test_da_tokenizer_splits_suffix_interact(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 2
    assert tokens[0].text == "f.eks."
@ -66,7 +67,7 @@ def test_tokenizer_splits_suffix_interact(da_tokenizer, text):
@pytest.mark.parametrize('text', ["(f.eks.)"])
-def test_tokenizer_splits_even_wrap_interact(da_tokenizer, text):
+def test_da_tokenizer_splits_even_wrap_interact(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 3
    assert tokens[0].text == "("
@ -75,7 +76,7 @@ def test_tokenizer_splits_even_wrap_interact(da_tokenizer, text):
@pytest.mark.parametrize('text', ["(f.eks.?)"])
-def test_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text):
+def test_da_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 4
    assert tokens[0].text == "("
@ -85,19 +86,19 @@ def test_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text):
@pytest.mark.parametrize('text', ["0,1-13,5", "0,0-0,1", "103,27-300", "1/2-3/4"])
-def test_tokenizer_handles_numeric_range(da_tokenizer, text):
+def test_da_tokenizer_handles_numeric_range(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 1
@pytest.mark.parametrize('text', ["sort.Gul", "Hej.Verden"])
-def test_tokenizer_splits_period_infix(da_tokenizer, text):
+def test_da_tokenizer_splits_period_infix(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["Hej,Verden", "en,to"])
-def test_tokenizer_splits_comma_infix(da_tokenizer, text):
+def test_da_tokenizer_splits_comma_infix(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 3
    assert tokens[0].text == text.split(",")[0]
@ -106,18 +107,18 @@ def test_tokenizer_splits_comma_infix(da_tokenizer, text):
@pytest.mark.parametrize('text', ["sort...Gul", "sort...gul"])
-def test_tokenizer_splits_ellipsis_infix(da_tokenizer, text):
+def test_da_tokenizer_splits_ellipsis_infix(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ['gå-på-mod', '4-hjulstræk', '100-Pfennig-frimærke', 'TV-2-spots', 'trofæ-vaeggen'])
-def test_tokenizer_keeps_hyphens(da_tokenizer, text):
+def test_da_tokenizer_keeps_hyphens(da_tokenizer, text):
    tokens = da_tokenizer(text)
    assert len(tokens) == 1
-def test_tokenizer_splits_double_hyphen_infix(da_tokenizer):
+def test_da_tokenizer_splits_double_hyphen_infix(da_tokenizer):
    tokens = da_tokenizer("Mange regler--eksempelvis bindestregs-reglerne--er komplicerede.")
    assert len(tokens) == 9
    assert tokens[0].text == "Mange"
@ -130,7 +131,7 @@ def test_tokenizer_splits_double_hyphen_infix(da_tokenizer):
    assert tokens[7].text == "komplicerede"
-def test_tokenizer_handles_posessives_and_contractions(da_tokenizer):
+def test_da_tokenizer_handles_posessives_and_contractions(da_tokenizer):
    tokens = da_tokenizer("'DBA's, Lars' og Liz' bil sku' sgu' ik' ha' en bule, det ka' han ik' li' mere', sagde hun.")
    assert len(tokens) == 25
    assert tokens[0].text == "'"
--- a/spacy/tests/lang/da/test_text.py
+++ b/spacy/tests/lang/da/test_text.py
@ -1,10 +1,9 @@
 # coding: utf-8
 """Test that longer and mixed texts are tokenized correctly."""
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.da.lex_attrs import like_num
 def test_da_tokenizer_handles_long_text(da_tokenizer):
    text = """Der var så dejligt ude på landet. Det var sommer, kornet stod gult, havren grøn,
@ -15,6 +14,7 @@ Rundt om ager og eng var der store skove, og midt i skovene dybe søer; jo, der
    tokens = da_tokenizer(text)
    assert len(tokens) == 84
@pytest.mark.parametrize('text,match', [
    ('10', True), ('1', True), ('10.000', True), ('10.00', True),
    ('999,0', True), ('en', True), ('treoghalvfemsindstyvende', True), ('hundrede', True),
@ -22,6 +22,10 @@ Rundt om ager og eng var der store skove, og midt i skovene dybe søer; jo, der
 def test_lex_attrs_like_number(da_tokenizer, text, match):
    tokens = da_tokenizer(text)
    assert len(tokens) == 1
    print(tokens[0])
    assert tokens[0].like_num == match
@pytest.mark.parametrize('word', ['elleve', 'første'])
 def test_da_lex_attrs_capitals(word):
    assert like_num(word)
    assert like_num(word.upper())
--- a/spacy/tests/lang/de/test_exceptions.py
+++ b/spacy/tests/lang/de/test_exceptions.py
@ -1,7 +1,4 @@
 # coding: utf-8
 """Test that tokenizer exceptions and emoticons are handles correctly."""
 from __future__ import unicode_literals
 import pytest
--- a/spacy/tests/lang/de/test_lemma.py
+++ b/spacy/tests/lang/de/test_lemma.py
@ -4,12 +4,13 @@ from __future__ import unicode_literals
 import pytest
-@pytest.mark.parametrize('string,lemma', [('Abgehängten', 'Abgehängte'),
+@pytest.mark.parametrize('string,lemma', [
    ('Abgehängten', 'Abgehängte'),
    ('engagierte', 'engagieren'),
    ('schließt', 'schließen'),
    ('vorgebenden', 'vorgebend'),
    ('die', 'der'),
    ('Die', 'der')])
-def test_lemmatizer_lookup_assigns(de_tokenizer, string, lemma):
+def test_de_lemmatizer_lookup_assigns(de_tokenizer, string, lemma):
    tokens = de_tokenizer(string)
    assert tokens[0].lemma_ == lemma
--- a/spacy/tests/lang/de/test_models.py
+++ b/spacy/tests/lang/de/test_models.py
@ -1,77 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import numpy
 import pytest
@pytest.fixture
 def example(DE):
    """
    This is to make sure the model works as expected. The tests make sure that
    values are properly set. Tests are not meant to evaluate the content of the
    output, only make sure the output is formally okay.
    """
    assert DE.entity != None
    return DE('An der großen Straße stand eine merkwürdige Gestalt und führte Selbstgespräche.')
@pytest.mark.models('de')
 def test_de_models_tokenization(example):
    # tokenization should split the document into tokens
    assert len(example) > 1
@pytest.mark.xfail
@pytest.mark.models('de')
 def test_de_models_tagging(example):
    # if tagging was done properly, pos tags shouldn't be empty
    assert example.is_tagged
    assert all(t.pos != 0 for t in example)
    assert all(t.tag != 0 for t in example)
@pytest.mark.models('de')
 def test_de_models_parsing(example):
    # if parsing was done properly
    # - dependency labels shouldn't be empty
    # - the head of some tokens should not be root
    assert example.is_parsed
    assert all(t.dep != 0 for t in example)
    assert any(t.dep != i for i,t in enumerate(example))
@pytest.mark.models('de')
 def test_de_models_ner(example):
    # if ner was done properly, ent_iob shouldn't be empty
    assert all([t.ent_iob != 0 for t in example])
@pytest.mark.models('de')
 def test_de_models_vectors(example):
    # if vectors are available, they should differ on different words
    # this isn't a perfect test since this could in principle fail
    # in a sane model as well,
    # but that's very unlikely and a good indicator if something is wrong
    vector0 = example[0].vector
    vector1 = example[1].vector
    vector2 = example[2].vector
    assert not numpy.array_equal(vector0,vector1)
    assert not numpy.array_equal(vector0,vector2)
    assert not numpy.array_equal(vector1,vector2)
@pytest.mark.xfail
@pytest.mark.models('de')
 def test_de_models_probs(example):
    # if frequencies/probabilities are okay, they should differ for
    # different words
    # this isn't a perfect test since this could in principle fail
    # in a sane model as well,
    # but that's very unlikely and a good indicator if something is wrong
    prob0 = example[0].prob
    prob1 = example[1].prob
    prob2 = example[2].prob
    assert not prob0 == prob1
    assert not prob0 == prob2
    assert not prob1 == prob2
--- a/spacy/tests/lang/de/test_parser.py
+++ b/spacy/tests/lang/de/test_parser.py
@ -3,17 +3,14 @@ from __future__ import unicode_literals
 from ...util import get_doc
 import pytest
 def test_de_parser_noun_chunks_standard_de(de_tokenizer):
    text = "Eine Tasse steht auf dem Tisch."
    heads = [1, 1, 0, -1, 1, -2, -4]
    tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', '$.']
    deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'punct']
    tokens = de_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
    chunks = list(doc.noun_chunks)
    assert len(chunks) == 2
    assert chunks[0].text_with_ws == "Eine Tasse "
@ -25,9 +22,8 @@ def test_de_extended_chunk(de_tokenizer):
    heads = [1, 1, 0, -1, 1, -2, -1, -5, -6]
    tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', 'NN', 'NN', '$.']
    deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'nk', 'oa', 'punct']
    tokens = de_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
    chunks = list(doc.noun_chunks)
    assert len(chunks) == 3
    assert chunks[0].text_with_ws == "Die Sängerin "
--- a/spacy/tests/lang/de/test_prefix_suffix_infix.py
+++ b/spacy/tests/lang/de/test_prefix_suffix_infix.py
@ -1,86 +1,83 @@
 # coding: utf-8
 """Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
 from __future__ import unicode_literals
 import pytest
@pytest.mark.parametrize('text', ["(unter)"])
-def test_tokenizer_splits_no_special(de_tokenizer, text):
+def test_de_tokenizer_splits_no_special(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["unter'm"])
-def test_tokenizer_splits_no_punct(de_tokenizer, text):
+def test_de_tokenizer_splits_no_punct(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 2
@pytest.mark.parametrize('text', ["(unter'm"])
-def test_tokenizer_splits_prefix_punct(de_tokenizer, text):
+def test_de_tokenizer_splits_prefix_punct(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["unter'm)"])
-def test_tokenizer_splits_suffix_punct(de_tokenizer, text):
+def test_de_tokenizer_splits_suffix_punct(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["(unter'm)"])
-def test_tokenizer_splits_even_wrap(de_tokenizer, text):
+def test_de_tokenizer_splits_even_wrap(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 4
@pytest.mark.parametrize('text', ["(unter'm?)"])
-def test_tokenizer_splits_uneven_wrap(de_tokenizer, text):
+def test_de_tokenizer_splits_uneven_wrap(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 5
@pytest.mark.parametrize('text,length', [("z.B.", 1), ("zb.", 2), ("(z.B.", 2)])
-def test_tokenizer_splits_prefix_interact(de_tokenizer, text, length):
+def test_de_tokenizer_splits_prefix_interact(de_tokenizer, text, length):
    tokens = de_tokenizer(text)
    assert len(tokens) == length
@pytest.mark.parametrize('text', ["z.B.)"])
-def test_tokenizer_splits_suffix_interact(de_tokenizer, text):
+def test_de_tokenizer_splits_suffix_interact(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 2
@pytest.mark.parametrize('text', ["(z.B.)"])
-def test_tokenizer_splits_even_wrap_interact(de_tokenizer, text):
+def test_de_tokenizer_splits_even_wrap_interact(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["(z.B.?)"])
-def test_tokenizer_splits_uneven_wrap_interact(de_tokenizer, text):
+def test_de_tokenizer_splits_uneven_wrap_interact(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 4
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
-def test_tokenizer_splits_numeric_range(de_tokenizer, text):
+def test_de_tokenizer_splits_numeric_range(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["blau.Rot", "Hallo.Welt"])
-def test_tokenizer_splits_period_infix(de_tokenizer, text):
+def test_de_tokenizer_splits_period_infix(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["Hallo,Welt", "eins,zwei"])
-def test_tokenizer_splits_comma_infix(de_tokenizer, text):
+def test_de_tokenizer_splits_comma_infix(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 3
    assert tokens[0].text == text.split(",")[0]
@ -89,18 +86,18 @@ def test_tokenizer_splits_comma_infix(de_tokenizer, text):
@pytest.mark.parametrize('text', ["blau...Rot", "blau...rot"])
-def test_tokenizer_splits_ellipsis_infix(de_tokenizer, text):
+def test_de_tokenizer_splits_ellipsis_infix(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ['Islam-Konferenz', 'Ost-West-Konflikt'])
-def test_tokenizer_keeps_hyphens(de_tokenizer, text):
+def test_de_tokenizer_keeps_hyphens(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 1
-def test_tokenizer_splits_double_hyphen_infix(de_tokenizer):
+def test_de_tokenizer_splits_double_hyphen_infix(de_tokenizer):
    tokens = de_tokenizer("Viele Regeln--wie die Bindestrich-Regeln--sind kompliziert.")
    assert len(tokens) == 10
    assert tokens[0].text == "Viele"
--- a/spacy/tests/lang/de/test_text.py
+++ b/spacy/tests/lang/de/test_text.py
@ -1,13 +1,10 @@
 # coding: utf-8
 """Test that longer and mixed texts are tokenized correctly."""
 from __future__ import unicode_literals
 import pytest
-def test_tokenizer_handles_long_text(de_tokenizer):
+def test_de_tokenizer_handles_long_text(de_tokenizer):
    text = """Die Verwandlung
 Als Gregor Samsa eines Morgens aus unruhigen Träumen erwachte, fand er sich in
@ -29,17 +26,15 @@ Umfang kläglich dünnen Beine flimmerten ihm hilflos vor den Augen.
    "Donaudampfschifffahrtsgesellschaftskapitänsanwärterposten",
    "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz",
    "Kraftfahrzeug-Haftpflichtversicherung",
-    "Vakuum-Mittelfrequenz-Induktionsofen"
+    "Vakuum-Mittelfrequenz-Induktionsofen"])
-    ])
+def test_de_tokenizer_handles_long_words(de_tokenizer, text):
 def test_tokenizer_handles_long_words(de_tokenizer, text):
    tokens = de_tokenizer(text)
    assert len(tokens) == 1
@pytest.mark.parametrize('text,length', [
    ("»Was ist mit mir geschehen?«, dachte er.", 12),
-    ("“Dies frühzeitige Aufstehen”, dachte er, “macht einen ganz blödsinnig. ", 15)
+    ("“Dies frühzeitige Aufstehen”, dachte er, “macht einen ganz blödsinnig. ", 15)])
-    ])
+def test_de_tokenizer_handles_examples(de_tokenizer, text, length):
 def test_tokenizer_handles_examples(de_tokenizer, text, length):
    tokens = de_tokenizer(text)
    assert len(tokens) == length
--- a/spacy/tests/lang/el/test_exception.py
+++ b/spacy/tests/lang/el/test_exception.py
@ -1,17 +1,16 @@
-# -*- coding: utf-8 -*-
+# coding: utf8
 from __future__ import unicode_literals
 import pytest
@pytest.mark.parametrize('text', ["αριθ.", "τρισ.", "δισ.", "σελ."])
-def test_tokenizer_handles_abbr(el_tokenizer, text):
+def test_el_tokenizer_handles_abbr(el_tokenizer, text):
    tokens = el_tokenizer(text)
    assert len(tokens) == 1
-def test_tokenizer_handles_exc_in_text(el_tokenizer):
+def test_el_tokenizer_handles_exc_in_text(el_tokenizer):
    text = "Στα 14 τρισ. δολάρια το κόστος από την άνοδο της στάθμης της θάλασσας."
    tokens = el_tokenizer(text)
    assert len(tokens) == 14
--- a/spacy/tests/lang/el/test_text.py
+++ b/spacy/tests/lang/el/test_text.py
@ -1,11 +1,10 @@
-# -*- coding: utf-8 -*-
+# coding: utf8
 from __future__ import unicode_literals
 import pytest
-def test_tokenizer_handles_long_text(el_tokenizer):
+def test_el_tokenizer_handles_long_text(el_tokenizer):
    text = """Η Ελλάδα (παλαιότερα Ελλάς), επίσημα γνωστή ως Ελληνική Δημοκρατία,\
    είναι χώρα της νοτιοανατολικής Ευρώπης στο νοτιότερο άκρο της Βαλκανικής χερσονήσου.\
    Συνορεύει στα βορειοδυτικά με την Αλβανία, στα βόρεια με την πρώην\
@ -20,6 +19,6 @@ def test_tokenizer_handles_long_text(el_tokenizer):
    ("Η Ελλάδα είναι μία από τις χώρες της Ευρωπαϊκής Ένωσης (ΕΕ) που διαθέτει σηµαντικό ορυκτό πλούτο.", 19),
    ("Η ναυτιλία αποτέλεσε ένα σημαντικό στοιχείο της Ελληνικής οικονομικής δραστηριότητας από τα αρχαία χρόνια.", 15),
    ("Η Ελλάδα είναι μέλος σε αρκετούς διεθνείς οργανισμούς.", 9)])
-def test_tokenizer_handles_cnts(el_tokenizer,text, length):
+def test_el_tokenizer_handles_cnts(el_tokenizer,text, length):
    tokens = el_tokenizer(text)
    assert len(tokens) == length
--- a/spacy/tests/lang/en/test_customized_tokenizer.py
+++ b/spacy/tests/lang/en/test_customized_tokenizer.py
@ -2,23 +2,22 @@
 from __future__ import unicode_literals
 import pytest
-
+from spacy.lang.en import English
-from ....lang.en import English
+from spacy.tokenizer import Tokenizer
-from ....tokenizer import Tokenizer
+from spacy.util import compile_prefix_regex, compile_suffix_regex
-from .... import util
+from spacy.util import compile_infix_regex
@pytest.fixture
 def custom_en_tokenizer(en_vocab):
-    prefix_re = util.compile_prefix_regex(English.Defaults.prefixes)
+    prefix_re = compile_prefix_regex(English.Defaults.prefixes)
-    suffix_re = util.compile_suffix_regex(English.Defaults.suffixes)
+    suffix_re = compile_suffix_regex(English.Defaults.suffixes)
    custom_infixes = ['\.\.\.+',
                      '(?<=[0-9])-(?=[0-9])',
                      # '(?<=[0-9]+),(?=[0-9]+)',
                      '[0-9]+(,[0-9]+)+',
                      '[\[\]!&:,()\*—–\/-]']
-
+    infix_re = compile_infix_regex(custom_infixes)
    infix_re = util.compile_infix_regex(custom_infixes)
    return Tokenizer(en_vocab,
                     English.Defaults.tokenizer_exceptions,
                     prefix_re.search,
@ -27,13 +26,12 @@ def custom_en_tokenizer(en_vocab):
                     token_match=None)
-def test_customized_tokenizer_handles_infixes(custom_en_tokenizer):
+def test_en_customized_tokenizer_handles_infixes(custom_en_tokenizer):
    sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion."
    context = [word.text for word in custom_en_tokenizer(sentence)]
    assert context == ['The', '8', 'and', '10', '-', 'county', 'definitions',
                       'are', 'not', 'used', 'for', 'the', 'greater',
                       'Southern', 'California', 'Megaregion', '.']
    # the trailing '-' may cause Assertion Error
    sentence = "The 8- and 10-county definitions are not used for the greater Southern California Megaregion."
    context = [word.text for word in custom_en_tokenizer(sentence)]
--- a/spacy/tests/lang/en/test_exceptions.py
+++ b/spacy/tests/lang/en/test_exceptions.py
@ -38,7 +38,7 @@ def test_en_tokenizer_splits_trailing_apos(en_tokenizer, text):
@pytest.mark.parametrize('text', ["'em", "nothin'", "ol'"])
-def text_tokenizer_doesnt_split_apos_exc(en_tokenizer, text):
+def test_en_tokenizer_doesnt_split_apos_exc(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 1
    assert tokens[0].text == text
--- a/spacy/tests/lang/en/test_indices.py
+++ b/spacy/tests/lang/en/test_indices.py
@ -1,11 +1,6 @@
 # coding: utf-8
 """Test that token.idx correctly computes index into the original string."""
 from __future__ import unicode_literals
 import pytest
 def test_en_simple_punct(en_tokenizer):
    text = "to walk, do foo"
--- a/spacy/tests/lang/en/test_lemmatizer.py
+++ b/spacy/tests/lang/en/test_lemmatizer.py
@ -1,63 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from ....tokens.doc import Doc
@pytest.fixture
 def en_lemmatizer(EN):
    return EN.Defaults.create_lemmatizer()
@pytest.mark.models('en')
 def test_doc_lemmatization(EN):
    doc = Doc(EN.vocab, words=['bleed'])
    doc[0].tag_ = 'VBP'
    assert doc[0].lemma_ == 'bleed'
@pytest.mark.models('en')
@pytest.mark.parametrize('text,lemmas', [("aardwolves", ["aardwolf"]),
                                         ("aardwolf", ["aardwolf"]),
                                         ("planets", ["planet"]),
                                         ("ring", ["ring"]),
                                         ("axes", ["axis", "axe", "ax"])])
 def test_en_lemmatizer_noun_lemmas(en_lemmatizer, text, lemmas):
    assert en_lemmatizer.noun(text) == lemmas
@pytest.mark.models('en')
@pytest.mark.parametrize('text,lemmas', [("bleed", ["bleed"]),
                                         ("feed", ["feed"]),
                                         ("need", ["need"]),
                                         ("ring", ["ring"])])
 def test_en_lemmatizer_noun_lemmas(en_lemmatizer, text, lemmas):
    # Cases like this are problematic -- not clear what we should do to resolve
    # ambiguity?
    # ("axes", ["ax", "axes", "axis"])])
    assert en_lemmatizer.noun(text) == lemmas
@pytest.mark.xfail
@pytest.mark.models('en')
 def test_en_lemmatizer_base_forms(en_lemmatizer):
    assert en_lemmatizer.noun('dive', {'number': 'sing'}) == ['dive']
    assert en_lemmatizer.noun('dive', {'number': 'plur'}) == ['diva']
@pytest.mark.models('en')
 def test_en_lemmatizer_base_form_verb(en_lemmatizer):
    assert en_lemmatizer.verb('saw', {'verbform': 'past'}) == ['see']
@pytest.mark.models('en')
 def test_en_lemmatizer_punct(en_lemmatizer):
    assert en_lemmatizer.punct('“') == ['"']
    assert en_lemmatizer.punct('“') == ['"']
@pytest.mark.models('en')
 def test_en_lemmatizer_lemma_assignment(EN):
    text = "Bananas in pyjamas are geese."
    doc = EN.make_doc(text)
    EN.tagger(doc)
    assert all(t.lemma_ != '' for t in doc)
--- a/spacy/tests/lang/en/test_models.py
+++ b/spacy/tests/lang/en/test_models.py
@ -1,85 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import numpy
 import pytest
@pytest.fixture
 def example(EN):
    """
    This is to make sure the model works as expected. The tests make sure that
    values are properly set. Tests are not meant to evaluate the content of the
    output, only make sure the output is formally okay.
    """
    assert EN.entity != None
    return EN('There was a stranger standing at the big street talking to herself.')
@pytest.mark.models('en')
 def test_en_models_tokenization(example):
    # tokenization should split the document into tokens
    assert len(example) > 1
@pytest.mark.models('en')
 def test_en_models_tagging(example):
    # if tagging was done properly, pos tags shouldn't be empty
    assert example.is_tagged
    assert all(t.pos != 0 for t in example)
    assert all(t.tag != 0 for t in example)
@pytest.mark.models('en')
 def test_en_models_parsing(example):
    # if parsing was done properly
    # - dependency labels shouldn't be empty
    # - the head of some tokens should not be root
    assert example.is_parsed
    assert all(t.dep != 0 for t in example)
    assert any(t.dep != i for i,t in enumerate(example))
@pytest.mark.models('en')
 def test_en_models_ner(example):
    # if ner was done properly, ent_iob shouldn't be empty
    assert all([t.ent_iob != 0 for t in example])
@pytest.mark.models('en')
 def test_en_models_vectors(example):
    # if vectors are available, they should differ on different words
    # this isn't a perfect test since this could in principle fail
    # in a sane model as well,
    # but that's very unlikely and a good indicator if something is wrong
    if example.vocab.vectors_length:
        vector0 = example[0].vector
        vector1 = example[1].vector
        vector2 = example[2].vector
        assert not numpy.array_equal(vector0,vector1)
        assert not numpy.array_equal(vector0,vector2)
        assert not numpy.array_equal(vector1,vector2)
@pytest.mark.xfail
@pytest.mark.models('en')
 def test_en_models_probs(example):
    # if frequencies/probabilities are okay, they should differ for
    # different words
    # this isn't a perfect test since this could in principle fail
    # in a sane model as well,
    # but that's very unlikely and a good indicator if something is wrong
    prob0 = example[0].prob
    prob1 = example[1].prob
    prob2 = example[2].prob
    assert not prob0 == prob1
    assert not prob0 == prob2
    assert not prob1 == prob2
@pytest.mark.models('en')
 def test_no_vectors_similarity(EN):
    doc1 = EN(u'hallo')
    doc2 = EN(u'hi')
    assert doc1.similarity(doc2) > 0
--- a/spacy/tests/lang/en/test_ner.py
+++ b/spacy/tests/lang/en/test_ner.py
@ -1,42 +0,0 @@
 from __future__ import unicode_literals, print_function
 import pytest
 from spacy.attrs import LOWER
 from spacy.matcher import Matcher
@pytest.mark.models('en')
 def test_en_ner_simple_types(EN):
    tokens = EN(u'Mr. Best flew to New York on Saturday morning.')
    ents = list(tokens.ents)
    assert ents[0].start == 1
    assert ents[0].end == 2
    assert ents[0].label_ == 'PERSON'
    assert ents[1].start == 4
    assert ents[1].end == 6
    assert ents[1].label_ == 'GPE'
@pytest.mark.skip
@pytest.mark.models('en')
 def test_en_ner_consistency_bug(EN):
    '''Test an arbitrary sequence-consistency bug encountered during speed test'''
    tokens = EN(u'Where rap essentially went mainstream, illustrated by seminal Public Enemy, Beastie Boys and L.L. Cool J. tracks.')
    tokens = EN(u'''Charity and other short-term aid have buoyed them so far, and a tax-relief bill working its way through Congress would help. But the September 11 Victim Compensation Fund, enacted by Congress to discourage people from filing lawsuits, will determine the shape of their lives for years to come.\n\n''', disable=['ner'])
    tokens.ents += tuple(EN.matcher(tokens))
    EN.entity(tokens)
@pytest.mark.skip
@pytest.mark.models('en')
 def test_en_ner_unit_end_gazetteer(EN):
    '''Test a bug in the interaction between the NER model and the gazetteer'''
    matcher = Matcher(EN.vocab)
    matcher.add('MemberNames', None, [{LOWER: 'cal'}], [{LOWER: 'cal'}, {LOWER: 'henderson'}])
    doc = EN(u'who is cal the manager of?')
    if len(list(doc.ents)) == 0:
        ents = matcher(doc)
        assert len(ents) == 1
        doc.ents += tuple(ents)
        EN.entity(doc)
        assert list(doc.ents)[0].text == 'cal'
--- a/spacy/tests/lang/en/test_noun_chunks.py
+++ b/spacy/tests/lang/en/test_noun_chunks.py
@ -1,22 +1,20 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ....attrs import HEAD, DEP
 from ....symbols import nsubj, dobj, amod, nmod, conj, cc, root
 from ....lang.en.syntax_iterators import SYNTAX_ITERATORS
 from ...util import get_doc
 import numpy
 from spacy.attrs import HEAD, DEP
 from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root
 from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS
 from ...util import get_doc
 def test_en_noun_chunks_not_nested(en_tokenizer):
    text = "Peter has chronic command and control issues"
    heads = [1, 0, 4, 3, -1, -2, -5]
    deps = ['nsubj', 'ROOT', 'amod', 'nmod', 'cc', 'conj', 'dobj']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
    tokens.from_array(
        [HEAD, DEP],
        numpy.asarray([[1, nsubj], [0, root], [4, amod], [3, nmod], [-1, cc],
--- a/spacy/tests/lang/en/test_parser.py
+++ b/spacy/tests/lang/en/test_parser.py
@ -3,58 +3,52 @@ from __future__ import unicode_literals
 from ...util import get_doc
 import pytest
-
+def test_en_parser_noun_chunks_standard(en_tokenizer):
 def test_parser_noun_chunks_standard(en_tokenizer):
    text = "A base phrase should be recognized."
    heads = [2, 1, 3, 2, 1, 0, -1]
    tags = ['DT', 'JJ', 'NN', 'MD', 'VB', 'VBN', '.']
    deps = ['det', 'amod', 'nsubjpass', 'aux', 'auxpass', 'ROOT', 'punct']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
    chunks = list(doc.noun_chunks)
    assert len(chunks) == 1
    assert chunks[0].text_with_ws == "A base phrase "
-def test_parser_noun_chunks_coordinated(en_tokenizer):
+def test_en_parser_noun_chunks_coordinated(en_tokenizer):
    text = "A base phrase and a good phrase are often the same."
    heads = [2, 1, 5, -1, 2, 1, -4, 0, -1, 1, -3, -4]
    tags = ['DT', 'NN', 'NN', 'CC', 'DT', 'JJ', 'NN', 'VBP', 'RB', 'DT', 'JJ', '.']
    deps = ['det', 'compound', 'nsubj', 'cc', 'det', 'amod', 'conj', 'ROOT', 'advmod', 'det', 'attr', 'punct']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
    chunks = list(doc.noun_chunks)
    assert len(chunks) == 2
    assert chunks[0].text_with_ws == "A base phrase "
    assert chunks[1].text_with_ws == "a good phrase "
-def test_parser_noun_chunks_pp_chunks(en_tokenizer):
+def test_en_parser_noun_chunks_pp_chunks(en_tokenizer):
    text = "A phrase with another phrase occurs."
    heads = [1, 4, -1, 1, -2, 0, -1]
    tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ', '.']
    deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT', 'punct']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
    chunks = list(doc.noun_chunks)
    assert len(chunks) == 2
    assert chunks[0].text_with_ws == "A phrase "
    assert chunks[1].text_with_ws == "another phrase "
-def test_parser_noun_chunks_appositional_modifiers(en_tokenizer):
+def test_en_parser_noun_chunks_appositional_modifiers(en_tokenizer):
    text = "Sam, my brother, arrived to the house."
    heads = [5, -1, 1, -3, -4, 0, -1, 1, -2, -4]
    tags = ['NNP', ',', 'PRP$', 'NN', ',', 'VBD', 'IN', 'DT', 'NN', '.']
    deps = ['nsubj', 'punct', 'poss', 'appos', 'punct', 'ROOT', 'prep', 'det', 'pobj', 'punct']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
    chunks = list(doc.noun_chunks)
    assert len(chunks) == 3
    assert chunks[0].text_with_ws == "Sam "
@ -62,14 +56,13 @@ def test_parser_noun_chunks_appositional_modifiers(en_tokenizer):
    assert chunks[2].text_with_ws == "the house "
-def test_parser_noun_chunks_dative(en_tokenizer):
+def test_en_parser_noun_chunks_dative(en_tokenizer):
    text = "She gave Bob a raise."
    heads = [1, 0, -1, 1, -3, -4]
    tags = ['PRP', 'VBD', 'NNP', 'DT', 'NN', '.']
    deps = ['nsubj', 'ROOT', 'dative', 'det', 'dobj', 'punct']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
    chunks = list(doc.noun_chunks)
    assert len(chunks) == 3
    assert chunks[0].text_with_ws == "She "
--- a/spacy/tests/lang/en/test_prefix_suffix_infix.py
+++ b/spacy/tests/lang/en/test_prefix_suffix_infix.py
@ -1,92 +1,89 @@
 # coding: utf-8
 """Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
 from __future__ import unicode_literals
 import pytest
@pytest.mark.parametrize('text', ["(can)"])
-def test_tokenizer_splits_no_special(en_tokenizer, text):
+def test_en_tokenizer_splits_no_special(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["can't"])
-def test_tokenizer_splits_no_punct(en_tokenizer, text):
+def test_en_tokenizer_splits_no_punct(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 2
@pytest.mark.parametrize('text', ["(can't"])
-def test_tokenizer_splits_prefix_punct(en_tokenizer, text):
+def test_en_tokenizer_splits_prefix_punct(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["can't)"])
-def test_tokenizer_splits_suffix_punct(en_tokenizer, text):
+def test_en_tokenizer_splits_suffix_punct(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["(can't)"])
-def test_tokenizer_splits_even_wrap(en_tokenizer, text):
+def test_en_tokenizer_splits_even_wrap(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 4
@pytest.mark.parametrize('text', ["(can't?)"])
-def test_tokenizer_splits_uneven_wrap(en_tokenizer, text):
+def test_en_tokenizer_splits_uneven_wrap(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 5
@pytest.mark.parametrize('text,length', [("U.S.", 1), ("us.", 2), ("(U.S.", 2)])
-def test_tokenizer_splits_prefix_interact(en_tokenizer, text, length):
+def test_en_tokenizer_splits_prefix_interact(en_tokenizer, text, length):
    tokens = en_tokenizer(text)
    assert len(tokens) == length
@pytest.mark.parametrize('text', ["U.S.)"])
-def test_tokenizer_splits_suffix_interact(en_tokenizer, text):
+def test_en_tokenizer_splits_suffix_interact(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 2
@pytest.mark.parametrize('text', ["(U.S.)"])
-def test_tokenizer_splits_even_wrap_interact(en_tokenizer, text):
+def test_en_tokenizer_splits_even_wrap_interact(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["(U.S.?)"])
-def test_tokenizer_splits_uneven_wrap_interact(en_tokenizer, text):
+def test_en_tokenizer_splits_uneven_wrap_interact(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 4
@pytest.mark.parametrize('text', ["best-known"])
-def test_tokenizer_splits_hyphens(en_tokenizer, text):
+def test_en_tokenizer_splits_hyphens(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
-def test_tokenizer_splits_numeric_range(en_tokenizer, text):
+def test_en_tokenizer_splits_numeric_range(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["best.Known", "Hello.World"])
-def test_tokenizer_splits_period_infix(en_tokenizer, text):
+def test_en_tokenizer_splits_period_infix(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["Hello,world", "one,two"])
-def test_tokenizer_splits_comma_infix(en_tokenizer, text):
+def test_en_tokenizer_splits_comma_infix(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 3
    assert tokens[0].text == text.split(",")[0]
@ -95,12 +92,12 @@ def test_tokenizer_splits_comma_infix(en_tokenizer, text):
@pytest.mark.parametrize('text', ["best...Known", "best...known"])
-def test_tokenizer_splits_ellipsis_infix(en_tokenizer, text):
+def test_en_tokenizer_splits_ellipsis_infix(en_tokenizer, text):
    tokens = en_tokenizer(text)
    assert len(tokens) == 3
-def test_tokenizer_splits_double_hyphen_infix(en_tokenizer):
+def test_en_tokenizer_splits_double_hyphen_infix(en_tokenizer):
    tokens = en_tokenizer("No decent--let alone well-bred--people.")
    assert tokens[0].text == "No"
    assert tokens[1].text == "decent"
@ -115,7 +112,7 @@ def test_tokenizer_splits_double_hyphen_infix(en_tokenizer):
@pytest.mark.xfail
-def test_tokenizer_splits_period_abbr(en_tokenizer):
+def test_en_tokenizer_splits_period_abbr(en_tokenizer):
    text = "Today is Tuesday.Mr."
    tokens = en_tokenizer(text)
    assert len(tokens) == 5
@ -127,7 +124,7 @@ def test_tokenizer_splits_period_abbr(en_tokenizer):
@pytest.mark.xfail
-def test_tokenizer_splits_em_dash_infix(en_tokenizer):
+def test_en_tokenizer_splits_em_dash_infix(en_tokenizer):
    # Re Issue #225
    tokens = en_tokenizer("""Will this road take me to Puddleton?\u2014No, """
                          """you'll have to walk there.\u2014Ariel.""")
--- a/spacy/tests/lang/en/test_punct.py
+++ b/spacy/tests/lang/en/test_punct.py
@ -1,13 +1,9 @@
 # coding: utf-8
 """Test that open, closed and paired punctuation is split off correctly."""
 from __future__ import unicode_literals
 import pytest
-
+from spacy.util import compile_prefix_regex
-from ....util import compile_prefix_regex
+from spacy.lang.punctuation import TOKENIZER_PREFIXES
 from ....lang.punctuation import TOKENIZER_PREFIXES
 PUNCT_OPEN = ['(', '[', '{', '*']
--- a/spacy/tests/lang/en/test_sbd.py
+++ b/spacy/tests/lang/en/test_sbd.py
@ -1,18 +1,17 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ....tokens import Doc
 from ...util import get_doc, apply_transition_sequence
 import pytest
 from ...util import get_doc, apply_transition_sequence
@pytest.mark.parametrize('text', ["A test sentence"])
@pytest.mark.parametrize('punct', ['.', '!', '?', ''])
 def test_en_sbd_single_punct(en_tokenizer, text, punct):
    heads = [2, 1, 0, -1] if punct else [2, 1, 0]
    tokens = en_tokenizer(text + punct)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    assert len(doc) == 4 if punct else 3
    assert len(list(doc.sents)) == 1
    assert sum(len(sent) for sent in doc.sents) == len(doc)
@ -26,102 +25,10 @@ def test_en_sentence_breaks(en_tokenizer, en_parser):
            'attr', 'punct']
    transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct', 'B-ROOT',
                  'L-nsubj', 'S', 'L-attr', 'R-attr', 'D', 'R-punct']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
    apply_transition_sequence(en_parser, doc, transition)
    assert len(list(doc.sents)) == 2
    for token in doc:
        assert token.dep != 0 or token.is_space
    assert [token.head.i for token in doc ] == [1, 1, 3, 1, 1, 6, 6, 8, 6, 6]
 # Currently, there's no way of setting the serializer data for the parser
 # without loading the models, so we can't remove the model dependency here yet.
@pytest.mark.xfail
@pytest.mark.models('en')
 def test_en_sbd_serialization_projective(EN):
    """Test that before and after serialization, the sentence boundaries are
    the same."""
    text = "I bought a couch from IKEA It wasn't very comfortable."
    transition = ['L-nsubj', 'S', 'L-det', 'R-dobj', 'D', 'R-prep', 'R-pobj',
                  'B-ROOT', 'L-nsubj', 'R-neg', 'D', 'S', 'L-advmod',
                  'R-acomp', 'D', 'R-punct']
    doc = EN.tokenizer(text)
    apply_transition_sequence(EN.parser, doc, transition)
    doc_serialized = Doc(EN.vocab).from_bytes(doc.to_bytes())
    assert doc.is_parsed == True
    assert doc_serialized.is_parsed == True
    assert doc.to_bytes() == doc_serialized.to_bytes()
    assert [s.text for s in doc.sents] == [s.text for s in doc_serialized.sents]
 TEST_CASES = [
    pytest.mark.xfail(("Hello World. My name is Jonas.", ["Hello World.", "My name is Jonas."])),
    ("What is your name? My name is Jonas.", ["What is your name?", "My name is Jonas."]),
    ("There it is! I found it.", ["There it is!", "I found it."]),
    ("My name is Jonas E. Smith.", ["My name is Jonas E. Smith."]),
    ("Please turn to p. 55.", ["Please turn to p. 55."]),
    ("Were Jane and co. at the party?", ["Were Jane and co. at the party?"]),
    ("They closed the deal with Pitt, Briggs & Co. at noon.", ["They closed the deal with Pitt, Briggs & Co. at noon."]),
    ("Let's ask Jane and co. They should know.", ["Let's ask Jane and co.", "They should know."]),
    ("They closed the deal with Pitt, Briggs & Co. It closed yesterday.", ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]),
    ("I can see Mt. Fuji from here.", ["I can see Mt. Fuji from here."]),
    pytest.mark.xfail(("St. Michael's Church is on 5th st. near the light.", ["St. Michael's Church is on 5th st. near the light."])),
    ("That is JFK Jr.'s book.", ["That is JFK Jr.'s book."]),
    ("I visited the U.S.A. last year.", ["I visited the U.S.A. last year."]),
    ("I live in the E.U. How about you?", ["I live in the E.U.", "How about you?"]),
    ("I live in the U.S. How about you?", ["I live in the U.S.", "How about you?"]),
    ("I work for the U.S. Government in Virginia.", ["I work for the U.S. Government in Virginia."]),
    ("I have lived in the U.S. for 20 years.", ["I have lived in the U.S. for 20 years."]),
    pytest.mark.xfail(("At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.", ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."])),
    ("She has $100.00 in her bag.", ["She has $100.00 in her bag."]),
    ("She has $100.00. It is in her bag.", ["She has $100.00.", "It is in her bag."]),
    ("He teaches science (He previously worked for 5 years as an engineer.) at the local University.", ["He teaches science (He previously worked for 5 years as an engineer.) at the local University."]),
    ("Her email is Jane.Doe@example.com. I sent her an email.", ["Her email is Jane.Doe@example.com.", "I sent her an email."]),
    ("The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out.", ["The site is: https://www.example.50.com/new-site/awesome_content.html.", "Please check it out."]),
    pytest.mark.xfail(("She turned to him, 'This is great.' she said.", ["She turned to him, 'This is great.' she said."])),
    pytest.mark.xfail(('She turned to him, "This is great." she said.', ['She turned to him, "This is great." she said.'])),
    ('She turned to him, "This is great." She held the book out to show him.', ['She turned to him, "This is great."', "She held the book out to show him."]),
    ("Hello!! Long time no see.", ["Hello!!", "Long time no see."]),
    ("Hello?? Who is there?", ["Hello??", "Who is there?"]),
    ("Hello!? Is that you?", ["Hello!?", "Is that you?"]),
    ("Hello?! Is that you?", ["Hello?!", "Is that you?"]),
    pytest.mark.xfail(("1.) The first item 2.) The second item", ["1.) The first item", "2.) The second item"])),
    pytest.mark.xfail(("1.) The first item. 2.) The second item.", ["1.) The first item.", "2.) The second item."])),
    pytest.mark.xfail(("1) The first item 2) The second item", ["1) The first item", "2) The second item"])),
    ("1) The first item. 2) The second item.", ["1) The first item.", "2) The second item."]),
    pytest.mark.xfail(("1. The first item 2. The second item", ["1. The first item", "2. The second item"])),
    pytest.mark.xfail(("1. The first item. 2. The second item.", ["1. The first item.", "2. The second item."])),
    pytest.mark.xfail(("• 9. The first item • 10. The second item", ["• 9. The first item", "• 10. The second item"])),
    pytest.mark.xfail(("⁃9. The first item ⁃10. The second item", ["⁃9. The first item", "⁃10. The second item"])),
    pytest.mark.xfail(("a. The first item b. The second item c. The third list item", ["a. The first item", "b. The second item", "c. The third list item"])),
    ("This is a sentence\ncut off in the middle because pdf.", ["This is a sentence\ncut off in the middle because pdf."]),
    ("It was a cold \nnight in the city.", ["It was a cold \nnight in the city."]),
    pytest.mark.xfail(("features\ncontact manager\nevents, activities\n", ["features", "contact manager", "events, activities"])),
    pytest.mark.xfail(("You can find it at N°. 1026.253.553. That is where the treasure is.", ["You can find it at N°. 1026.253.553.", "That is where the treasure is."])),
    ("She works at Yahoo! in the accounting department.", ["She works at Yahoo! in the accounting department."]),
    ("We make a good team, you and I. Did you see Albert I. Jones yesterday?", ["We make a good team, you and I.", "Did you see Albert I. Jones yesterday?"]),
    ("Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”", ["Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”"]),
    pytest.mark.xfail((""""Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).""", ['"Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).'])),
    ("If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence.", ["If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . .", "Next sentence."]),
    ("I never meant that.... She left the store.", ["I never meant that....", "She left the store."]),
    pytest.mark.xfail(("I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it.", ["I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it."])),
    pytest.mark.xfail(("One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . .", ["One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds.", ". . . The practice was not abandoned. . . ."])),
    pytest.mark.xfail(("Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That is a lot.", ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]))
 ]
@pytest.mark.skip
@pytest.mark.models('en')
@pytest.mark.parametrize('text,expected_sents', TEST_CASES)
 def test_en_sbd_prag(EN, text, expected_sents):
    """SBD tests from Pragmatic Segmenter"""
    doc = EN(text)
    sents = []
    for sent in doc.sents:
        sents.append(''.join(doc[i].string for i in range(sent.start, sent.end)).strip())
    assert sents == expected_sents
--- a/spacy/tests/lang/en/test_tagger.py
+++ b/spacy/tests/lang/en/test_tagger.py
@ -1,12 +1,8 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ....parts_of_speech import SPACE
 from ....compat import unicode_
 from ...util import get_doc
 import pytest
 def test_en_tagger_load_morph_exc(en_tokenizer):
    text = "I like his style."
@ -14,47 +10,6 @@ def test_en_tagger_load_morph_exc(en_tokenizer):
    morph_exc = {'VBP': {'like': {'lemma': 'luck'}}}
    en_tokenizer.vocab.morphology.load_morph_exceptions(morph_exc)
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags)
    assert doc[1].tag_ == 'VBP'
    assert doc[1].lemma_ == 'luck'
@pytest.mark.models('en')
 def test_tag_names(EN):
    text = "I ate pizzas with anchovies."
    doc = EN(text, disable=['parser'])
    assert type(doc[2].pos) == int
    assert isinstance(doc[2].pos_, unicode_)
    assert isinstance(doc[2].dep_, unicode_)
    assert doc[2].tag_ == u'NNS'
@pytest.mark.xfail
@pytest.mark.models('en')
 def test_en_tagger_spaces(EN):
    """Ensure spaces are assigned the POS tag SPACE"""
    text = "Some\nspaces are\tnecessary."
    doc = EN(text, disable=['parser'])
    assert doc[0].pos != SPACE
    assert doc[0].pos_ != 'SPACE'
    assert doc[1].pos == SPACE
    assert doc[1].pos_ == 'SPACE'
    assert doc[1].tag_ == 'SP'
    assert doc[2].pos != SPACE
    assert doc[3].pos != SPACE
    assert doc[4].pos == SPACE
@pytest.mark.xfail
@pytest.mark.models('en')
 def test_en_tagger_return_char(EN):
    """Ensure spaces are assigned the POS tag SPACE"""
    text = ('hi Aaron,\r\n\r\nHow is your schedule today, I was wondering if '
              'you had time for a phone\r\ncall this afternoon?\r\n\r\n\r\n')
    tokens = EN(text)
    for token in tokens:
        if token.is_space:
            assert token.pos == SPACE
    assert tokens[3].text == '\r\n\r\n'
    assert tokens[3].is_space
    assert tokens[3].pos == SPACE
--- a/spacy/tests/lang/en/test_text.py
+++ b/spacy/tests/lang/en/test_text.py
@ -1,10 +1,8 @@
 # coding: utf-8
 """Test that longer and mixed texts are tokenized correctly."""
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.en.lex_attrs import like_num
 def test_en_tokenizer_handles_long_text(en_tokenizer):
@ -43,3 +41,9 @@ def test_lex_attrs_like_number(en_tokenizer, text, match):
    tokens = en_tokenizer(text)
    assert len(tokens) == 1
    assert tokens[0].like_num == match
@pytest.mark.parametrize('word', ['eleven'])
 def test_en_lex_attrs_capitals(word):
    assert like_num(word)
    assert like_num(word.upper())
--- a/spacy/tests/lang/es/test_exception.py
+++ b/spacy/tests/lang/es/test_exception.py
@ -1,22 +1,21 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
-@pytest.mark.parametrize('text,lemma', [("aprox.", "aproximadamente"),
+@pytest.mark.parametrize('text,lemma', [
    ("aprox.", "aproximadamente"),
    ("esq.", "esquina"),
    ("pág.", "página"),
-                                        ("p.ej.", "por ejemplo")
+    ("p.ej.", "por ejemplo")])
-                                        ])
+def test_es_tokenizer_handles_abbr(es_tokenizer, text, lemma):
 def test_tokenizer_handles_abbr(es_tokenizer, text, lemma):
    tokens = es_tokenizer(text)
    assert len(tokens) == 1
    assert tokens[0].lemma_ == lemma
-def test_tokenizer_handles_exc_in_text(es_tokenizer):
+def test_es_tokenizer_handles_exc_in_text(es_tokenizer):
    text = "Mariano Rajoy ha corrido aprox. medio kilómetro"
    tokens = es_tokenizer(text)
    assert len(tokens) == 7
--- a/spacy/tests/lang/es/test_text.py
+++ b/spacy/tests/lang/es/test_text.py
@ -1,14 +1,10 @@
 # coding: utf-8
 """Test that longer and mixed texts are tokenized correctly."""
 from __future__ import unicode_literals
 import pytest
-def test_tokenizer_handles_long_text(es_tokenizer):
+def test_es_tokenizer_handles_long_text(es_tokenizer):
    text = """Cuando a José Mujica lo invitaron a dar una conferencia
 en Oxford este verano, su cabeza hizo "crac". La "más antigua" universidad de habla
@ -30,6 +26,6 @@ en Montevideo y que pregona las bondades de la vida austera."""
    ("""¡Sí! "Vámonos", contestó José Arcadio Buendía""", 11),
    ("Corrieron aprox. 10km.", 5),
    ("Y entonces por qué...", 5)])
-def test_tokenizer_handles_cnts(es_tokenizer, text, length):
+def test_es_tokenizer_handles_cnts(es_tokenizer, text, length):
    tokens = es_tokenizer(text)
    assert len(tokens) == length
--- a/spacy/tests/lang/fi/test_tokenizer.py
+++ b/spacy/tests/lang/fi/test_tokenizer.py
@ -11,7 +11,7 @@ ABBREVIATION_TESTS = [
@pytest.mark.parametrize('text,expected_tokens', ABBREVIATION_TESTS)
-def test_tokenizer_handles_testcases(fi_tokenizer, text, expected_tokens):
+def test_fi_tokenizer_handles_testcases(fi_tokenizer, text, expected_tokens):
    tokens = fi_tokenizer(text)
    token_list = [token.text for token in tokens if not token.is_space]
    assert expected_tokens == token_list
--- a/spacy/tests/lang/fr/test_exceptions.py
+++ b/spacy/tests/lang/fr/test_exceptions.py
@ -1,29 +1,29 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
-@pytest.mark.parametrize('text', ["aujourd'hui", "Aujourd'hui", "prud'hommes",
+@pytest.mark.parametrize('text', [
-                                  "prud’hommal"])
+    "aujourd'hui", "Aujourd'hui", "prud'hommes", "prud’hommal"])
-def test_tokenizer_infix_exceptions(fr_tokenizer, text):
+def test_fr_tokenizer_infix_exceptions(fr_tokenizer, text):
    tokens = fr_tokenizer(text)
    assert len(tokens) == 1
-@pytest.mark.parametrize('text,lemma', [("janv.", "janvier"),
+@pytest.mark.parametrize('text,lemma', [
    ("janv.", "janvier"),
    ("juill.", "juillet"),
    ("Dr.", "docteur"),
    ("av.", "avant"),
    ("sept.", "septembre")])
-def test_tokenizer_handles_abbr(fr_tokenizer, text, lemma):
+def test_fr_tokenizer_handles_abbr(fr_tokenizer, text, lemma):
    tokens = fr_tokenizer(text)
    assert len(tokens) == 1
    assert tokens[0].lemma_ == lemma
-def test_tokenizer_handles_exc_in_text(fr_tokenizer):
+def test_fr_tokenizer_handles_exc_in_text(fr_tokenizer):
    text = "Je suis allé au mois de janv. aux prud’hommes."
    tokens = fr_tokenizer(text)
    assert len(tokens) == 10
@ -32,14 +32,15 @@ def test_tokenizer_handles_exc_in_text(fr_tokenizer):
    assert tokens[8].text == "prud’hommes"
-def test_tokenizer_handles_exc_in_text_2(fr_tokenizer):
+def test_fr_tokenizer_handles_exc_in_text_2(fr_tokenizer):
    text = "Cette après-midi, je suis allé dans un restaurant italo-mexicain."
    tokens = fr_tokenizer(text)
    assert len(tokens) == 11
    assert tokens[1].text == "après-midi"
    assert tokens[9].text == "italo-mexicain"
-def test_tokenizer_handles_title(fr_tokenizer):
+
 def test_fr_tokenizer_handles_title(fr_tokenizer):
    text = "N'est-ce pas génial?"
    tokens = fr_tokenizer(text)
    assert len(tokens) == 6
@ -50,14 +51,16 @@ def test_tokenizer_handles_title(fr_tokenizer):
    assert tokens[2].text == "-ce"
    assert tokens[2].lemma_ == "ce"
-def test_tokenizer_handles_title_2(fr_tokenizer):
+
 def test_fr_tokenizer_handles_title_2(fr_tokenizer):
    text = "Est-ce pas génial?"
    tokens = fr_tokenizer(text)
    assert len(tokens) == 6
    assert tokens[0].text == "Est"
    assert tokens[0].lemma_ == "être"
-def test_tokenizer_handles_title_2(fr_tokenizer):
+
 def test_fr_tokenizer_handles_title_2(fr_tokenizer):
    text = "Qu'est-ce que tu fais?"
    tokens = fr_tokenizer(text)
    assert len(tokens) == 7
--- a/spacy/tests/lang/fr/test_lemmatization.py
+++ b/spacy/tests/lang/fr/test_lemmatization.py
@ -4,25 +4,25 @@ from __future__ import unicode_literals
 import pytest
-def test_lemmatizer_verb(fr_tokenizer):
+def test_fr_lemmatizer_verb(fr_tokenizer):
    tokens = fr_tokenizer("Qu'est-ce que tu fais?")
    assert tokens[0].lemma_ == "que"
    assert tokens[1].lemma_ == "être"
    assert tokens[5].lemma_ == "faire"
-def test_lemmatizer_noun_verb_2(fr_tokenizer):
+def test_fr_lemmatizer_noun_verb_2(fr_tokenizer):
    tokens = fr_tokenizer("Les abaissements de température sont gênants.")
    assert tokens[4].lemma_ == "être"
@pytest.mark.xfail(reason="Costaricienne TAG is PROPN instead of NOUN and spacy don't lemmatize PROPN")
-def test_lemmatizer_noun(fr_tokenizer):
+def test_fr_lemmatizer_noun(fr_tokenizer):
    tokens = fr_tokenizer("il y a des Costaricienne.")
    assert tokens[4].lemma_ == "Costaricain"
-def test_lemmatizer_noun_2(fr_tokenizer):
+def test_fr_lemmatizer_noun_2(fr_tokenizer):
    tokens = fr_tokenizer("Les abaissements de température sont gênants.")
    assert tokens[1].lemma_ == "abaissement"
    assert tokens[5].lemma_ == "gênant"
--- a/spacy/tests/lang/fr/test_prefix_suffix_infix.py
+++ b/spacy/tests/lang/fr/test_prefix_suffix_infix.py
@ -0,0 +1,23 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from spacy.language import Language
 from spacy.lang.punctuation import TOKENIZER_INFIXES
 from spacy.lang.char_classes import ALPHA
@pytest.mark.parametrize('text,expected_tokens', [
    ("l'avion", ["l'", "avion"]), ("j'ai", ["j'", "ai"])])
 def test_issue768(text, expected_tokens):
    """Allow zero-width 'infix' token during the tokenization process."""
    SPLIT_INFIX = r'(?<=[{a}]\')(?=[{a}])'.format(a=ALPHA)
    class FrenchTest(Language):
        class Defaults(Language.Defaults):
            infixes = TOKENIZER_INFIXES + [SPLIT_INFIX]
    fr_tokenizer_w_infix = FrenchTest.Defaults.create_tokenizer()
    tokens = fr_tokenizer_w_infix(text)
    assert len(tokens) == 2
    assert [t.text for t in tokens] == expected_tokens
--- a/spacy/tests/lang/fr/test_text.py
+++ b/spacy/tests/lang/fr/test_text.py
@ -1,6 +1,9 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.fr.lex_attrs import like_num
 def test_tokenizer_handles_long_text(fr_tokenizer):
    text = """L'histoire du TAL commence dans les années 1950, bien que l'on puisse \
@ -12,6 +15,11 @@ un humain dans une conversation écrite en temps réel, de façon suffisamment \
 convaincante que l'interlocuteur humain ne peut distinguer sûrement — sur la \
 base du seul contenu de la conversation — s'il interagit avec un programme \
 ou avec un autre vrai humain."""
    tokens = fr_tokenizer(text)
    assert len(tokens) == 113
@pytest.mark.parametrize('word', ['onze', 'onzième'])
 def test_fr_lex_attrs_capitals(word):
    assert like_num(word)
    assert like_num(word.upper())
--- a/spacy/tests/lang/ga/test_tokenizer.py
+++ b/spacy/tests/lang/ga/test_tokenizer.py
@ -11,7 +11,7 @@ GA_TOKEN_EXCEPTION_TESTS = [
@pytest.mark.parametrize('text,expected_tokens', GA_TOKEN_EXCEPTION_TESTS)
-def test_tokenizer_handles_exception_cases(ga_tokenizer, text, expected_tokens):
+def test_ga_tokenizer_handles_exception_cases(ga_tokenizer, text, expected_tokens):
    tokens = ga_tokenizer(text)
    token_list = [token.text for token in tokens if not token.is_space]
    assert expected_tokens == token_list
--- a/spacy/tests/lang/he/test_tokenizer.py
+++ b/spacy/tests/lang/he/test_tokenizer.py
@ -6,7 +6,7 @@ import pytest
@pytest.mark.parametrize('text,expected_tokens',
    [('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])])
-def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
+def test_he_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
    tokens = he_tokenizer(text)
    token_list = [token.text for token in tokens if not token.is_space]
    assert expected_tokens == token_list
@ -18,6 +18,6 @@ def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
    ('עקבת אחריו בכל רחבי המדינה!', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']),
    ('עקבת אחריו בכל רחבי המדינה..', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']),
    ('עקבת אחריו בכל רחבי המדינה...', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...'])])
-def test_tokenizer_handles_punct(he_tokenizer, text, expected_tokens):
+def test_he_tokenizer_handles_punct(he_tokenizer, text, expected_tokens):
    tokens = he_tokenizer(text)
    assert expected_tokens == [token.text for token in tokens]
--- a/spacy/tests/lang/hu/test_tokenizer.py
+++ b/spacy/tests/lang/hu/test_tokenizer.py
@ -3,6 +3,7 @@ from __future__ import unicode_literals
 import pytest
 DEFAULT_TESTS = [
    ('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
    pytest.mark.xfail(('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.'])),
@ -277,7 +278,7 @@ TESTCASES = DEFAULT_TESTS + DOT_TESTS + QUOTE_TESTS + NUMBER_TESTS + HYPHEN_TEST
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
-def test_tokenizer_handles_testcases(hu_tokenizer, text, expected_tokens):
+def test_hu_tokenizer_handles_testcases(hu_tokenizer, text, expected_tokens):
    tokens = hu_tokenizer(text)
    token_list = [token.text for token in tokens if not token.is_space]
    assert expected_tokens == token_list
--- a/spacy/tests/lang/id/test_prefix_suffix_infix.py
+++ b/spacy/tests/lang/id/test_prefix_suffix_infix.py
@ -1,38 +1,35 @@
 # coding: utf-8
 """Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
 from __future__ import unicode_literals
 import pytest
@pytest.mark.parametrize('text', ["(Ma'arif)"])
-def test_tokenizer_splits_no_special(id_tokenizer, text):
+def test_id_tokenizer_splits_no_special(id_tokenizer, text):
    tokens = id_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["Ma'arif"])
-def test_tokenizer_splits_no_punct(id_tokenizer, text):
+def test_id_tokenizer_splits_no_punct(id_tokenizer, text):
    tokens = id_tokenizer(text)
    assert len(tokens) == 1
@pytest.mark.parametrize('text', ["(Ma'arif"])
-def test_tokenizer_splits_prefix_punct(id_tokenizer, text):
+def test_id_tokenizer_splits_prefix_punct(id_tokenizer, text):
    tokens = id_tokenizer(text)
    assert len(tokens) == 2
@pytest.mark.parametrize('text', ["Ma'arif)"])
-def test_tokenizer_splits_suffix_punct(id_tokenizer, text):
+def test_id_tokenizer_splits_suffix_punct(id_tokenizer, text):
    tokens = id_tokenizer(text)
    assert len(tokens) == 2
@pytest.mark.parametrize('text', ["(Ma'arif)"])
-def test_tokenizer_splits_even_wrap(id_tokenizer, text):
+def test_id_tokenizer_splits_even_wrap(id_tokenizer, text):
    tokens = id_tokenizer(text)
    assert len(tokens) == 3
@ -44,49 +41,49 @@ def test_tokenizer_splits_uneven_wrap(id_tokenizer, text):
@pytest.mark.parametrize('text,length', [("S.Kom.", 1), ("SKom.", 2), ("(S.Kom.", 2)])
-def test_tokenizer_splits_prefix_interact(id_tokenizer, text, length):
+def test_id_tokenizer_splits_prefix_interact(id_tokenizer, text, length):
    tokens = id_tokenizer(text)
    assert len(tokens) == length
@pytest.mark.parametrize('text', ["S.Kom.)"])
-def test_tokenizer_splits_suffix_interact(id_tokenizer, text):
+def test_id_tokenizer_splits_suffix_interact(id_tokenizer, text):
    tokens = id_tokenizer(text)
    assert len(tokens) == 2
@pytest.mark.parametrize('text', ["(S.Kom.)"])
-def test_tokenizer_splits_even_wrap_interact(id_tokenizer, text):
+def test_id_tokenizer_splits_even_wrap_interact(id_tokenizer, text):
    tokens = id_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["(S.Kom.?)"])
-def test_tokenizer_splits_uneven_wrap_interact(id_tokenizer, text):
+def test_id_tokenizer_splits_uneven_wrap_interact(id_tokenizer, text):
    tokens = id_tokenizer(text)
    assert len(tokens) == 4
@pytest.mark.parametrize('text,length', [("gara-gara", 1), ("Jokowi-Ahok", 3), ("Sukarno-Hatta", 3)])
-def test_tokenizer_splits_hyphens(id_tokenizer, text, length):
+def test_id_tokenizer_splits_hyphens(id_tokenizer, text, length):
    tokens = id_tokenizer(text)
    assert len(tokens) == length
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
-def test_tokenizer_splits_numeric_range(id_tokenizer, text):
+def test_id_tokenizer_splits_numeric_range(id_tokenizer, text):
    tokens = id_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["ini.Budi", "Halo.Bandung"])
-def test_tokenizer_splits_period_infix(id_tokenizer, text):
+def test_id_tokenizer_splits_period_infix(id_tokenizer, text):
    tokens = id_tokenizer(text)
    assert len(tokens) == 3
@pytest.mark.parametrize('text', ["Halo,Bandung", "satu,dua"])
-def test_tokenizer_splits_comma_infix(id_tokenizer, text):
+def test_id_tokenizer_splits_comma_infix(id_tokenizer, text):
    tokens = id_tokenizer(text)
    assert len(tokens) == 3
    assert tokens[0].text == text.split(",")[0]
@ -95,12 +92,12 @@ def test_tokenizer_splits_comma_infix(id_tokenizer, text):
@pytest.mark.parametrize('text', ["halo...Bandung", "dia...pergi"])
-def test_tokenizer_splits_ellipsis_infix(id_tokenizer, text):
+def test_id_tokenizer_splits_ellipsis_infix(id_tokenizer, text):
    tokens = id_tokenizer(text)
    assert len(tokens) == 3
-def test_tokenizer_splits_double_hyphen_infix(id_tokenizer):
+def test_id_tokenizer_splits_double_hyphen_infix(id_tokenizer):
    tokens = id_tokenizer("Arsene Wenger--manajer Arsenal--melakukan konferensi pers.")
    assert len(tokens) == 10
    assert tokens[0].text == "Arsene"
--- a/spacy/tests/lang/id/test_text.py
+++ b/spacy/tests/lang/id/test_text.py
@ -0,0 +1,11 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.id.lex_attrs import like_num
@pytest.mark.parametrize('word', ['sebelas'])
 def test_id_lex_attrs_capitals(word):
    assert like_num(word)
    assert like_num(word.upper())
--- a/spacy/tests/lang/ja/test_lemma.py
+++ b/spacy/tests/lang/ja/test_lemma.py
@ -1,18 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 LEMMAS = (
        ('新しく', '新しい'),
        ('赤く', '赤い'),
        ('すごく', '凄い'),
        ('いただきました', '頂く'),
        ('なった', '成る'))
@pytest.mark.parametrize('word,lemma', LEMMAS)
 def test_japanese_lemmas(JA, word, lemma):
    test_lemma = JA(word)[0].lemma_
    assert test_lemma == lemma
--- a/spacy/tests/lang/ja/test_lemmatization.py
+++ b/spacy/tests/lang/ja/test_lemmatization.py
@ -0,0 +1,15 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
@pytest.mark.parametrize('word,lemma', [
    ('新しく', '新しい'),
    ('赤く', '赤い'),
    ('すごく', '凄い'),
    ('いただきました', '頂く'),
    ('なった', '成る')])
 def test_ja_lemmatizer_assigns(ja_tokenizer, word, lemma):
    test_lemma = ja_tokenizer(word)[0].lemma_
    assert test_lemma == lemma
--- a/spacy/tests/lang/ja/test_tokenizer.py
+++ b/spacy/tests/lang/ja/test_tokenizer.py
@ -30,16 +30,18 @@ POS_TESTS = [
@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS)
-def test_japanese_tokenizer(ja_tokenizer, text, expected_tokens):
+def test_ja_tokenizer(ja_tokenizer, text, expected_tokens):
    tokens = [token.text for token in ja_tokenizer(text)]
    assert tokens == expected_tokens
@pytest.mark.parametrize('text,expected_tags', TAG_TESTS)
-def test_japanese_tokenizer(ja_tokenizer, text, expected_tags):
+def test_ja_tokenizer(ja_tokenizer, text, expected_tags):
    tags = [token.tag_ for token in ja_tokenizer(text)]
    assert tags == expected_tags
@pytest.mark.parametrize('text,expected_pos', POS_TESTS)
-def test_japanese_tokenizer(ja_tokenizer, text, expected_pos):
+def test_ja_tokenizer(ja_tokenizer, text, expected_pos):
    pos = [token.pos_ for token in ja_tokenizer(text)]
    assert pos == expected_pos
--- a/spacy/tests/lang/nb/test_tokenizer.py
+++ b/spacy/tests/lang/nb/test_tokenizer.py
@ -11,7 +11,7 @@ NB_TOKEN_EXCEPTION_TESTS = [
@pytest.mark.parametrize('text,expected_tokens', NB_TOKEN_EXCEPTION_TESTS)
-def test_tokenizer_handles_exception_cases(nb_tokenizer, text, expected_tokens):
+def test_nb_tokenizer_handles_exception_cases(nb_tokenizer, text, expected_tokens):
    tokens = nb_tokenizer(text)
    token_list = [token.text for token in tokens if not token.is_space]
    assert expected_tokens == token_list
--- a/spacy/tests/lang/nl/init.py
+++ b/spacy/tests/lang/nl/init.py
--- a/spacy/tests/lang/nl/test_text.py
+++ b/spacy/tests/lang/nl/test_text.py
@ -0,0 +1,11 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.nl.lex_attrs import like_num
@pytest.mark.parametrize('word', ['elf', 'elfde'])
 def test_nl_lex_attrs_capitals(word):
    assert like_num(word)
    assert like_num(word.upper())
--- a/spacy/tests/stringstore/init.py
+++ b/spacy/tests/stringstore/init.py
--- a/spacy/tests/lang/pt/test_text.py
+++ b/spacy/tests/lang/pt/test_text.py
@ -0,0 +1,11 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.pt.lex_attrs import like_num
@pytest.mark.parametrize('word', ['onze', 'quadragésimo'])
 def test_pt_lex_attrs_capitals(word):
    assert like_num(word)
    assert like_num(word.upper())
--- a/spacy/tests/lang/ro/test_lemmatizer.py
+++ b/spacy/tests/lang/ro/test_lemmatizer.py
@ -4,10 +4,11 @@ from __future__ import unicode_literals
 import pytest
-@pytest.mark.parametrize('string,lemma', [('câini', 'câine'),
+@pytest.mark.parametrize('string,lemma', [
    ('câini', 'câine'),
    ('expedițiilor', 'expediție'),
    ('pensete', 'pensetă'),
    ('erau', 'fi')])
-def test_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma):
+def test_ro_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma):
    tokens = ro_tokenizer(string)
    assert tokens[0].lemma_ == lemma
--- a/spacy/tests/lang/ro/test_tokenizer.py
+++ b/spacy/tests/lang/ro/test_tokenizer.py
@ -3,23 +3,20 @@ from __future__ import unicode_literals
 import pytest
-DEFAULT_TESTS = [
+
 TEST_CASES = [
    ('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']),
    ('Teste, etc.', ['Teste', ',', 'etc.']),
    ('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']),
-    ('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...'])
+    ('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...']),
-]
+    # number tests
 NUMBER_TESTS = [
    ('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']),
    ('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.'])
 ]
 TESTCASES = DEFAULT_TESTS + NUMBER_TESTS
-
+@pytest.mark.parametrize('text,expected_tokens', TEST_CASES)
-@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
+def test_ro_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens):
 def test_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens):
    tokens = ro_tokenizer(text)
    token_list = [token.text for token in tokens if not token.is_space]
    assert expected_tokens == token_list
--- a/spacy/tests/lang/ru/test_exceptions.py
+++ b/spacy/tests/lang/ru/test_exceptions.py
@ -0,0 +1,14 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
@pytest.mark.parametrize('text,norms', [
    ("пн.", ["понедельник"]),
    ("пт.", ["пятница"]),
    ("дек.", ["декабрь"])])
 def test_ru_tokenizer_abbrev_exceptions(ru_tokenizer, text, norms):
    tokens = ru_tokenizer(text)
    assert len(tokens) == 1
    assert [token.norm_ for token in tokens] == norms
--- a/spacy/tests/lang/ru/test_lemmatizer.py
+++ b/spacy/tests/lang/ru/test_lemmatizer.py
@ -2,27 +2,29 @@
 from __future__ import unicode_literals
 import pytest
-from ....tokens.doc import Doc
+from spacy.lang.ru import Russian
 from ...util import get_doc
@pytest.fixture
-def ru_lemmatizer(RU):
+def ru_lemmatizer():
-    return RU.Defaults.create_lemmatizer()
+    pymorphy = pytest.importorskip('pymorphy2')
    return Russian.Defaults.create_lemmatizer()
-@pytest.mark.models('ru')
+def test_ru_doc_lemmatization(ru_tokenizer):
-def test_doc_lemmatization(RU):
+    words = ['мама', 'мыла', 'раму']
-    doc = Doc(RU.vocab, words=['мама', 'мыла', 'раму'])
+    tags = ['NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing',
-    doc[0].tag_ = 'NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing'
+            'VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act',
-    doc[1].tag_ = 'VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act'
+            'NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing']
-    doc[2].tag_ = 'NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing'
+    doc = get_doc(ru_tokenizer.vocab, words=words, tags=tags)
    lemmas = [token.lemma_ for token in doc]
    assert lemmas == ['мама', 'мыть', 'рама']
-@pytest.mark.models('ru')
+@pytest.mark.parametrize('text,lemmas', [
-@pytest.mark.parametrize('text,lemmas', [('гвоздики', ['гвоздик', 'гвоздика']),
+    ('гвоздики', ['гвоздик', 'гвоздика']),
    ('люди', ['человек']),
    ('реки', ['река']),
    ('кольцо', ['кольцо']),
@ -32,7 +34,8 @@ def test_ru_lemmatizer_noun_lemmas(ru_lemmatizer, text, lemmas):
@pytest.mark.models('ru')
-@pytest.mark.parametrize('text,pos,morphology,lemma', [('рой', 'NOUN', None, 'рой'),
+@pytest.mark.parametrize('text,pos,morphology,lemma', [
    ('рой', 'NOUN', None, 'рой'),
    ('рой', 'VERB', None, 'рыть'),
    ('клей', 'NOUN', None, 'клей'),
    ('клей', 'VERB', None, 'клеить'),
@ -41,31 +44,20 @@ def test_ru_lemmatizer_noun_lemmas(ru_lemmatizer, text, lemmas):
    ('кос', 'NOUN', {'Number': 'Plur'}, 'коса'),
    ('кос', 'ADJ', None, 'косой'),
    ('потом', 'NOUN', None, 'пот'),
-                                                       ('потом', 'ADV', None, 'потом')
+    ('потом', 'ADV', None, 'потом')])
                                                       ])
 def test_ru_lemmatizer_works_with_different_pos_homonyms(ru_lemmatizer, text, pos, morphology, lemma):
    assert ru_lemmatizer(text, pos, morphology) == [lemma]
-@pytest.mark.models('ru')
+@pytest.mark.parametrize('text,morphology,lemma', [
-@pytest.mark.parametrize('text,morphology,lemma', [('гвоздики', {'Gender': 'Fem'}, 'гвоздика'),
+    ('гвоздики', {'Gender': 'Fem'}, 'гвоздика'),
    ('гвоздики', {'Gender': 'Masc'}, 'гвоздик'),
    ('вина', {'Gender': 'Fem'}, 'вина'),
-                                                   ('вина', {'Gender': 'Neut'}, 'вино')
+    ('вина', {'Gender': 'Neut'}, 'вино')])
                                                   ])
 def test_ru_lemmatizer_works_with_noun_homonyms(ru_lemmatizer, text, morphology, lemma):
    assert ru_lemmatizer.noun(text, morphology) == [lemma]
@pytest.mark.models('ru')
 def test_ru_lemmatizer_punct(ru_lemmatizer):
    assert ru_lemmatizer.punct('«') == ['"']
    assert ru_lemmatizer.punct('»') == ['"']
 # @pytest.mark.models('ru')
 # def test_ru_lemmatizer_lemma_assignment(RU):
 #     text = "А роза упала на лапу Азора."
 #     doc = RU.make_doc(text)
 #     RU.tagger(doc)
 #     assert all(t.lemma_ != '' for t in doc)
--- a/spacy/tests/lang/ru/test_text.py
+++ b/spacy/tests/lang/ru/test_text.py
@ -0,0 +1,11 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from spacy.lang.ru.lex_attrs import like_num
@pytest.mark.parametrize('word', ['одиннадцать'])
 def test_ru_lex_attrs_capitals(word):
    assert like_num(word)
    assert like_num(word.upper())
--- a/spacy/tests/lang/ru/test_tokenizer.py
+++ b/spacy/tests/lang/ru/test_tokenizer.py
@ -1,7 +1,4 @@
 # coding: utf-8
 """Test that open, closed and paired punctuation is split off correctly."""
 from __future__ import unicode_literals
 import pytest
--- a/spacy/tests/lang/ru/test_tokenizer_exc.py
+++ b/spacy/tests/lang/ru/test_tokenizer_exc.py
@ -1,16 +0,0 @@
 # coding: utf-8
 """Test that tokenizer exceptions are parsed correctly."""
 from __future__ import unicode_literals
 import pytest
@pytest.mark.parametrize('text,norms', [("пн.", ["понедельник"]),
                                        ("пт.", ["пятница"]),
                                        ("дек.", ["декабрь"])])
 def test_ru_tokenizer_abbrev_exceptions(ru_tokenizer, text, norms):
    tokens = ru_tokenizer(text)
    assert len(tokens) == 1
    assert [token.norm_ for token in tokens] == norms
--- a/spacy/tests/lang/sv/test_tokenizer.py
+++ b/spacy/tests/lang/sv/test_tokenizer.py
@ -11,14 +11,14 @@ SV_TOKEN_EXCEPTION_TESTS = [
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
-def test_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
+def test_sv_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
    tokens = sv_tokenizer(text)
    token_list = [token.text for token in tokens if not token.is_space]
    assert expected_tokens == token_list
@pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"])
-def test_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
+def test_sv_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
    tokens = sv_tokenizer(text)
    assert len(tokens) == 2
    assert tokens[1].text == "u"
--- a/spacy/tests/lang/test_attrs.py
+++ b/spacy/tests/lang/test_attrs.py
@ -1,10 +1,9 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ...attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
 from ...lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape
 import pytest
 from spacy.attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
 from spacy.lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape
@pytest.mark.parametrize('text', ["dog"])
--- a/spacy/tests/lang/th/test_tokenizer.py
+++ b/spacy/tests/lang/th/test_tokenizer.py
@ -3,11 +3,9 @@ from __future__ import unicode_literals
 import pytest
 TOKENIZER_TESTS = [
        ("คุณรักผมไหม", ['คุณ', 'รัก', 'ผม', 'ไหม'])
 ]
-@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS)
+@pytest.mark.parametrize('text,expected_tokens', [
-def test_thai_tokenizer(th_tokenizer, text, expected_tokens):
+    ("คุณรักผมไหม", ['คุณ', 'รัก', 'ผม', 'ไหม'])])
 def test_th_tokenizer(th_tokenizer, text, expected_tokens):
    tokens = [token.text for token in th_tokenizer(text)]
    assert tokens == expected_tokens
--- a/spacy/tests/lang/tr/test_lemmatization.py
+++ b/spacy/tests/lang/tr/test_lemmatization.py
@ -3,13 +3,15 @@ from __future__ import unicode_literals
 import pytest
-@pytest.mark.parametrize('string,lemma', [('evlerimizdeki', 'ev'),
+
@pytest.mark.parametrize('string,lemma', [
    ('evlerimizdeki', 'ev'),
    ('işlerimizi', 'iş'),
    ('biran', 'biran'),
    ('bitirmeliyiz', 'bitir'),
    ('isteklerimizi', 'istek'),
    ('karşılaştırmamızın', 'karşılaştır'),
    ('çoğulculuktan', 'çoğulcu')])
-def test_lemmatizer_lookup_assigns(tr_tokenizer, string, lemma):
+def test_tr_lemmatizer_lookup_assigns(tr_tokenizer, string, lemma):
    tokens = tr_tokenizer(string)
    assert tokens[0].lemma_ == lemma
--- a/spacy/tests/lang/tt/test_tokenizer.py
+++ b/spacy/tests/lang/tt/test_tokenizer.py
@ -3,6 +3,7 @@ from __future__ import unicode_literals
 import pytest
 INFIX_HYPHEN_TESTS = [
    ("Явым-төшем күләме.", "Явым-төшем күләме .".split()),
    ("Хатын-кыз киеме.", "Хатын-кыз киеме .".split())
@ -64,12 +65,12 @@ NORM_TESTCASES = [
@pytest.mark.parametrize("text,expected_tokens", TESTCASES)
-def test_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens):
+def test_tt_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens):
    tokens = [token.text for token in tt_tokenizer(text) if not token.is_space]
    assert expected_tokens == tokens
@pytest.mark.parametrize('text,norms', NORM_TESTCASES)
-def test_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms):
+def test_tt_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms):
    tokens = tt_tokenizer(text)
    assert [token.norm_ for token in tokens] == norms
--- a/spacy/tests/lang/ur/test_text.py
+++ b/spacy/tests/lang/ur/test_text.py
@ -1,19 +1,14 @@
 # coding: utf-8
 """Test that longer and mixed texts are tokenized correctly."""
 from __future__ import unicode_literals
 import pytest
-def test_tokenizer_handles_long_text(ur_tokenizer):
+def test_ur_tokenizer_handles_long_text(ur_tokenizer):
    text = """اصل میں رسوا ہونے کی ہمیں
     کچھ عادت سی ہو گئی ہے اس لئے جگ ہنسائی کا ذکر نہیں کرتا،ہوا کچھ یوں کہ عرصہ چھ سال بعد ہمیں بھی خیال آیا
     کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر
    ورلڈ کپ دیکھنے کا حوصلہ ہی نہ رہا تو اب یوں ہی ادھر اُدھر کے چینل گھمانے لگ پڑتے ہیں۔"""
    tokens = ur_tokenizer(text)
    assert len(tokens) == 77
@ -21,6 +16,6 @@ def test_tokenizer_handles_long_text(ur_tokenizer):
@pytest.mark.parametrize('text,length', [
    ("تحریر باسط حبیب", 3),
    ("میرا پاکستان", 2)])
-def test_tokenizer_handles_cnts(ur_tokenizer, text, length):
+def test_ur_tokenizer_handles_cnts(ur_tokenizer, text, length):
    tokens = ur_tokenizer(text)
    assert len(tokens) == length
--- a/spacy/tests/matcher/init.py
+++ b/spacy/tests/matcher/init.py
--- a/spacy/tests/matcher/test_matcher_api.py
+++ b/spacy/tests/matcher/test_matcher_api.py
@ -1,19 +1,16 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ..matcher import Matcher, PhraseMatcher
 from .util import get_doc
 import pytest
 from spacy.matcher import Matcher
 from spacy.tokens import Doc
@pytest.fixture
 def matcher(en_vocab):
-    rules = {
+    rules = {'JS':        [[{'ORTH': 'JavaScript'}]],
        'JS':        [[{'ORTH': 'JavaScript'}]],
             'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]],
-        'Java':      [[{'LOWER': 'java'}]]
+             'Java':      [[{'LOWER': 'java'}]]}
    }
    matcher = Matcher(en_vocab)
    for key, patterns in rules.items():
        matcher.add(key, None, *patterns)
@ -36,7 +33,7 @@ def test_matcher_from_api_docs(en_vocab):
 def test_matcher_from_usage_docs(en_vocab):
    text = "Wow 😀 This is really cool! 😂 😂"
-    doc = get_doc(en_vocab, words=text.split(' '))
+    doc = Doc(en_vocab, words=text.split(' '))
    pos_emoji = ['😀', '😃', '😂', '🤣', '😊', '😍']
    pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
@ -55,68 +52,46 @@ def test_matcher_from_usage_docs(en_vocab):
    assert doc[1].norm_ == 'happy emoji'
-@pytest.mark.parametrize('words', [["Some", "words"]])
+def test_matcher_len_contains(matcher):
-def test_matcher_init(en_vocab, words):
+    assert len(matcher) == 3
    matcher = Matcher(en_vocab)
    doc = get_doc(en_vocab, words)
    assert len(matcher) == 0
    assert matcher(doc) == []
 def test_matcher_contains(matcher):
    matcher.add('TEST', None, [{'ORTH': 'test'}])
    assert 'TEST' in matcher
    assert 'TEST2' not in matcher
 def test_matcher_no_match(matcher):
-    words = ["I", "like", "cheese", "."]
+    doc = Doc(matcher.vocab, words=["I", "like", "cheese", "."])
    doc = get_doc(matcher.vocab, words)
    assert matcher(doc) == []
 def test_matcher_compile(en_vocab):
    rules = {
        'JS':        [[{'ORTH': 'JavaScript'}]],
        'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]],
        'Java':      [[{'LOWER': 'java'}]]
    }
    matcher = Matcher(en_vocab)
    for key, patterns in rules.items():
        matcher.add(key, None, *patterns)
    assert len(matcher) == 3
 def test_matcher_match_start(matcher):
-    words = ["JavaScript", "is", "good"]
+    doc = Doc(matcher.vocab, words=["JavaScript", "is", "good"])
    doc = get_doc(matcher.vocab, words)
    assert matcher(doc) == [(matcher.vocab.strings['JS'], 0, 1)]
 def test_matcher_match_end(matcher):
    words = ["I", "like", "java"]
-    doc = get_doc(matcher.vocab, words)
+    doc = Doc(matcher.vocab, words=words)
    assert matcher(doc) == [(doc.vocab.strings['Java'], 2, 3)]
 def test_matcher_match_middle(matcher):
    words = ["I", "like", "Google", "Now", "best"]
-    doc = get_doc(matcher.vocab, words)
+    doc = Doc(matcher.vocab, words=words)
    assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4)]
 def test_matcher_match_multi(matcher):
    words = ["I", "like", "Google", "Now", "and", "java", "best"]
-    doc = get_doc(matcher.vocab, words)
+    doc = Doc(matcher.vocab, words=words)
    assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4),
                            (doc.vocab.strings['Java'], 5, 6)]
 def test_matcher_empty_dict(en_vocab):
-    '''Test matcher allows empty token specs, meaning match on any token.'''
+    """Test matcher allows empty token specs, meaning match on any token."""
    matcher = Matcher(en_vocab)
-    abc = ["a", "b", "c"]
+    doc = Doc(matcher.vocab, words=["a", "b", "c"])
    doc = get_doc(matcher.vocab, abc)
    matcher.add('A.C', None, [{'ORTH': 'a'}, {}, {'ORTH': 'c'}])
    matches = matcher(doc)
    assert len(matches) == 1
@ -129,8 +104,7 @@ def test_matcher_empty_dict(en_vocab):
 def test_matcher_operator_shadow(en_vocab):
    matcher = Matcher(en_vocab)
-    abc = ["a", "b", "c"]
+    doc = Doc(matcher.vocab, words=["a", "b", "c"])
    doc = get_doc(matcher.vocab, abc)
    pattern = [{'ORTH': 'a'}, {"IS_ALPHA": True, "OP": "+"}, {'ORTH': 'c'}]
    matcher.add('A.C', None, pattern)
    matches = matcher(doc)
@ -138,32 +112,6 @@ def test_matcher_operator_shadow(en_vocab):
    assert matches[0][1:] == (0, 3)
 def test_matcher_phrase_matcher(en_vocab):
    words = ["Google", "Now"]
    doc = get_doc(en_vocab, words)
    matcher = PhraseMatcher(en_vocab)
    matcher.add('COMPANY', None, doc)
    words = ["I", "like", "Google", "Now", "best"]
    doc = get_doc(en_vocab, words)
    assert len(matcher(doc)) == 1
 def test_phrase_matcher_length(en_vocab):
    matcher = PhraseMatcher(en_vocab)
    assert len(matcher) == 0
    matcher.add('TEST', None, get_doc(en_vocab, ['test']))
    assert len(matcher) == 1
    matcher.add('TEST2', None, get_doc(en_vocab, ['test2']))
    assert len(matcher) == 2
 def test_phrase_matcher_contains(en_vocab):
    matcher = PhraseMatcher(en_vocab)
    matcher.add('TEST', None, get_doc(en_vocab, ['test']))
    assert 'TEST' in matcher
    assert 'TEST2' not in matcher
 def test_matcher_match_zero(matcher):
    words1 = 'He said , " some words " ...'.split()
    words2 = 'He said , " some three words " ...'.split()
@ -176,12 +124,10 @@ def test_matcher_match_zero(matcher):
                {'IS_PUNCT': True},
                {'IS_PUNCT': True},
                {'ORTH': '"'}]
    matcher.add('Quote', None, pattern1)
-    doc = get_doc(matcher.vocab, words1)
+    doc = Doc(matcher.vocab, words=words1)
    assert len(matcher(doc)) == 1
-
+    doc = Doc(matcher.vocab, words=words2)
    doc = get_doc(matcher.vocab, words2)
    assert len(matcher(doc)) == 0
    matcher.add('Quote', None, pattern2)
    assert len(matcher(doc)) == 0
@ -194,14 +140,14 @@ def test_matcher_match_zero_plus(matcher):
                {'ORTH': '"'}]
    matcher = Matcher(matcher.vocab)
    matcher.add('Quote', None, pattern)
-    doc = get_doc(matcher.vocab, words)
+    doc = Doc(matcher.vocab, words=words)
    assert len(matcher(doc)) == 1
 def test_matcher_match_one_plus(matcher):
    control = Matcher(matcher.vocab)
    control.add('BasicPhilippe', None, [{'ORTH': 'Philippe'}])
-    doc = get_doc(control.vocab, ['Philippe', 'Philippe'])
+    doc = Doc(control.vocab, words=['Philippe', 'Philippe'])
    m = control(doc)
    assert len(m) == 2
    matcher.add('KleenePhilippe', None, [{'ORTH': 'Philippe', 'OP': '1'},
@ -210,61 +156,11 @@ def test_matcher_match_one_plus(matcher):
    assert len(m) == 1
 def test_operator_combos(matcher):
    cases = [
        ('aaab', 'a a a b', True),
        ('aaab', 'a+ b', True),
        ('aaab', 'a+ a+ b', True),
        ('aaab', 'a+ a+ a b', True),
        ('aaab', 'a+ a+ a+ b', True),
        ('aaab', 'a+ a a b', True),
        ('aaab', 'a+ a a', True),
        ('aaab', 'a+', True),
        ('aaa', 'a+ b', False),
        ('aaa', 'a+ a+ b', False),
        ('aaa', 'a+ a+ a+ b', False),
        ('aaa', 'a+ a b', False),
        ('aaa', 'a+ a a b', False),
        ('aaab', 'a+ a a', True),
        ('aaab', 'a+', True),
        ('aaab', 'a+ a b', True)
    ]
    for string, pattern_str, result in cases:
        matcher = Matcher(matcher.vocab)
        doc = get_doc(matcher.vocab, words=list(string))
        pattern = []
        for part in pattern_str.split():
            if part.endswith('+'):
                pattern.append({'ORTH': part[0], 'op': '+'})
            else:
                pattern.append({'ORTH': part})
        matcher.add('PATTERN', None, pattern)
        matches = matcher(doc)
        if result:
            assert matches, (string, pattern_str)
        else:
            assert not matches, (string, pattern_str)
 def test_matcher_end_zero_plus(matcher):
    """Test matcher works when patterns end with * operator. (issue 1450)"""
    matcher = Matcher(matcher.vocab)
    pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}]
    matcher.add("TSTEND", None, pattern)
    nlp = lambda string: get_doc(matcher.vocab, string.split())
    assert len(matcher(nlp('a'))) == 1
    assert len(matcher(nlp('a b'))) == 2
    assert len(matcher(nlp('a c'))) == 1
    assert len(matcher(nlp('a b c'))) == 2
    assert len(matcher(nlp('a b b c'))) == 3
    assert len(matcher(nlp('a b b'))) == 3
 def test_matcher_any_token_operator(en_vocab):
    """Test that patterns with "any token" {} work with operators."""
    matcher = Matcher(en_vocab)
    matcher.add('TEST', None, [{'ORTH': 'test'}, {'OP': '*'}])
-    doc = get_doc(en_vocab, ['test', 'hello', 'world'])
+    doc = Doc(en_vocab, words=['test', 'hello', 'world'])
    matches = [doc[start:end].text for _, start, end in matcher(doc)]
    assert len(matches) == 3
    assert matches[0] == 'test'
--- a/spacy/tests/matcher/test_matcher_logic.py
+++ b/spacy/tests/matcher/test_matcher_logic.py
@ -0,0 +1,116 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 import re
 from spacy.matcher import Matcher
 from spacy.tokens import Doc
 pattern1    = [{'ORTH':'A', 'OP':'1'}, {'ORTH':'A', 'OP':'*'}]
 pattern2    = [{'ORTH':'A', 'OP':'*'}, {'ORTH':'A', 'OP':'1'}]
 pattern3    = [{'ORTH':'A', 'OP':'1'}, {'ORTH':'A', 'OP':'1'}]
 pattern4    = [{'ORTH':'B', 'OP':'1'}, {'ORTH':'A', 'OP':'*'}, {'ORTH':'B', 'OP':'1'}]
 pattern5    = [{'ORTH':'B', 'OP':'*'}, {'ORTH':'A', 'OP':'*'}, {'ORTH':'B', 'OP':'1'}]
 re_pattern1 = 'AA*'
 re_pattern2 = 'A*A'
 re_pattern3 = 'AA'
 re_pattern4 = 'BA*B'
 re_pattern5 = 'B*A*B'
@pytest.fixture
 def text():
    return "(ABBAAAAAB)."
@pytest.fixture
 def doc(en_tokenizer, text):
    doc = en_tokenizer(' '.join(text))
    return doc
@pytest.mark.xfail
@pytest.mark.parametrize('pattern,re_pattern', [
    (pattern1, re_pattern1),
    (pattern2, re_pattern2),
    (pattern3, re_pattern3),
    (pattern4, re_pattern4),
    (pattern5, re_pattern5)])
 def test_greedy_matching(doc, text, pattern, re_pattern):
    """Test that the greedy matching behavior of the * op is consistant with
    other re implementations."""
    matcher = Matcher(doc.vocab)
    matcher.add(re_pattern, None, pattern)
    matches = matcher(doc)
    re_matches = [m.span() for m in re.finditer(re_pattern, text)]
    for match, re_match in zip(matches, re_matches):
        assert match[1:] == re_match
@pytest.mark.xfail
@pytest.mark.parametrize('pattern,re_pattern', [
    (pattern1, re_pattern1),
    (pattern2, re_pattern2),
    (pattern3, re_pattern3),
    (pattern4, re_pattern4),
    (pattern5, re_pattern5)])
 def test_match_consuming(doc, text, pattern, re_pattern):
    """Test that matcher.__call__ consumes tokens on a match similar to
    re.findall."""
    matcher = Matcher(doc.vocab)
    matcher.add(re_pattern, None, pattern)
    matches = matcher(doc)
    re_matches = [m.span() for m in re.finditer(re_pattern, text)]
    assert len(matches) == len(re_matches)
 def test_operator_combos(en_vocab):
    cases = [
        ('aaab', 'a a a b', True),
        ('aaab', 'a+ b', True),
        ('aaab', 'a+ a+ b', True),
        ('aaab', 'a+ a+ a b', True),
        ('aaab', 'a+ a+ a+ b', True),
        ('aaab', 'a+ a a b', True),
        ('aaab', 'a+ a a', True),
        ('aaab', 'a+', True),
        ('aaa', 'a+ b', False),
        ('aaa', 'a+ a+ b', False),
        ('aaa', 'a+ a+ a+ b', False),
        ('aaa', 'a+ a b', False),
        ('aaa', 'a+ a a b', False),
        ('aaab', 'a+ a a', True),
        ('aaab', 'a+', True),
        ('aaab', 'a+ a b', True)
    ]
    for string, pattern_str, result in cases:
        matcher = Matcher(en_vocab)
        doc = Doc(matcher.vocab, words=list(string))
        pattern = []
        for part in pattern_str.split():
            if part.endswith('+'):
                pattern.append({'ORTH': part[0], 'OP': '+'})
            else:
                pattern.append({'ORTH': part})
        matcher.add('PATTERN', None, pattern)
        matches = matcher(doc)
        if result:
            assert matches, (string, pattern_str)
        else:
            assert not matches, (string, pattern_str)
 def test_matcher_end_zero_plus(en_vocab):
    """Test matcher works when patterns end with * operator. (issue 1450)"""
    matcher = Matcher(en_vocab)
    pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}]
    matcher.add('TSTEND', None, pattern)
    nlp = lambda string: Doc(matcher.vocab, words=string.split())
    assert len(matcher(nlp('a'))) == 1
    assert len(matcher(nlp('a b'))) == 2
    assert len(matcher(nlp('a c'))) == 1
    assert len(matcher(nlp('a b c'))) == 2
    assert len(matcher(nlp('a b b c'))) == 3
    assert len(matcher(nlp('a b b'))) == 3
--- a/spacy/tests/matcher/test_phrase_matcher.py
+++ b/spacy/tests/matcher/test_phrase_matcher.py
@ -0,0 +1,30 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from spacy.matcher import PhraseMatcher
 from spacy.tokens import Doc
 def test_matcher_phrase_matcher(en_vocab):
    doc = Doc(en_vocab, words=["Google", "Now"])
    matcher = PhraseMatcher(en_vocab)
    matcher.add('COMPANY', None, doc)
    doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
    assert len(matcher(doc)) == 1
 def test_phrase_matcher_length(en_vocab):
    matcher = PhraseMatcher(en_vocab)
    assert len(matcher) == 0
    matcher.add('TEST', None, Doc(en_vocab, words=['test']))
    assert len(matcher) == 1
    matcher.add('TEST2', None, Doc(en_vocab, words=['test2']))
    assert len(matcher) == 2
 def test_phrase_matcher_contains(en_vocab):
    matcher = PhraseMatcher(en_vocab)
    matcher.add('TEST', None, Doc(en_vocab, words=['test']))
    assert 'TEST' in matcher
    assert 'TEST2' not in matcher
--- a/spacy/tests/parser/test_add_label.py
+++ b/spacy/tests/parser/test_add_label.py
@ -1,15 +1,16 @@
-'''Test the ability to add a label to a (potentially trained) parsing model.'''
+# coding: utf8
 from __future__ import unicode_literals
 import pytest
 import numpy.random
 from thinc.neural.optimizers import Adam
 from thinc.neural.ops import NumpyOps
 from spacy.attrs import NORM
 from spacy.gold import GoldParse
 from spacy.vocab import Vocab
 from spacy.tokens import Doc
 from spacy.pipeline import DependencyParser
 from ...attrs import NORM
 from ...gold import GoldParse
 from ...vocab import Vocab
 from ...tokens import Doc
 from ...pipeline import DependencyParser
 numpy.random.seed(0)
@ -37,9 +38,11 @@ def parser(vocab):
        parser.update([doc], [gold], sgd=sgd, losses=losses)
    return parser
 def test_init_parser(parser):
    pass
 # TODO: This is flakey, because it depends on what the parser first learns.
@pytest.mark.xfail
 def test_add_label(parser):
@ -69,4 +72,3 @@ def test_add_label(parser):
    doc = parser(doc)
    assert doc[0].dep_ == 'right'
    assert doc[2].dep_ == 'left'
--- a/spacy/tests/parser/test_arc_eager_oracle.py
+++ b/spacy/tests/parser/test_arc_eager_oracle.py
@ -1,13 +1,14 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
-from ...vocab import Vocab
+import pytest
-from ...pipeline import DependencyParser
+from spacy.vocab import Vocab
-from ...tokens import Doc
+from spacy.pipeline import DependencyParser
-from ...gold import GoldParse
+from spacy.tokens import Doc
-from ...syntax.nonproj import projectivize
+from spacy.gold import GoldParse
-from ...syntax.stateclass import StateClass
+from spacy.syntax.nonproj import projectivize
-from ...syntax.arc_eager import ArcEager
+from spacy.syntax.stateclass import StateClass
 from spacy.syntax.arc_eager import ArcEager
 def get_sequence_costs(M, words, heads, deps, transitions):
--- a/spacy/tests/parser/test_beam_parse.py
+++ b/spacy/tests/parser/test_beam_parse.py
@ -1,23 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 from ...language import Language
 from ...pipeline import DependencyParser
@pytest.mark.models('en')
 def test_beam_parse_en(EN):
    doc = EN(u'Australia is a country', disable=['ner'])
    ents = EN.entity(doc, beam_width=2)
    print(ents)
 def test_beam_parse():
    nlp = Language()
    nlp.add_pipe(DependencyParser(nlp.vocab), name='parser')
    nlp.parser.add_label('nsubj')
    nlp.parser.begin_training([], token_vector_width=8, hidden_width=8)
    doc = nlp.make_doc(u'Australia is a country')
    nlp.parser(doc, beam_width=2)
--- a/spacy/tests/parser/test_ner.py
+++ b/spacy/tests/parser/test_ner.py
@ -1,11 +1,12 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
-
+from spacy.pipeline import EntityRecognizer
-from ...vocab import Vocab
+from spacy.vocab import Vocab
-from ...syntax.ner import BiluoPushDown
+from spacy.syntax.ner import BiluoPushDown
-from ...gold import GoldParse
+from spacy.gold import GoldParse
-from ...tokens import Doc
+from spacy.tokens import Doc
@pytest.fixture
@ -71,3 +72,16 @@ def test_get_oracle_moves_negative_O(tsys, vocab):
    tsys.preprocess_gold(gold)
    act_classes = tsys.get_oracle_sequence(doc, gold)
    names = [tsys.get_class_name(act) for act in act_classes]
 def test_doc_add_entities_set_ents_iob(en_vocab):
    doc = Doc(en_vocab, words=["This", "is", "a", "lion"])
    ner = EntityRecognizer(en_vocab)
    ner.begin_training([])
    ner(doc)
    assert len(list(doc.ents)) == 0
    assert [w.ent_iob_ for w in doc] == (['O'] * len(doc))
    doc.ents = [(doc.vocab.strings['ANIMAL'], 3, 4)]
    assert [w.ent_iob_ for w in doc] == ['', '', '', 'B']
    doc.ents = [(doc.vocab.strings['WORD'], 0, 2)]
    assert [w.ent_iob_ for w in doc] == ['B', 'I', '', '']
--- a/spacy/tests/parser/test_neural_parser.py
+++ b/spacy/tests/parser/test_neural_parser.py
@ -1,16 +1,13 @@
 # coding: utf8
 from __future__ import unicode_literals
 from thinc.neural import Model
 import pytest
 import numpy
-from ..._ml import chain, Tok2Vec, doc2feats
+import pytest
-from ...vocab import Vocab
+from spacy._ml import Tok2Vec
-from ...pipeline import Tensorizer
+from spacy.vocab import Vocab
-from ...syntax.arc_eager import ArcEager
+from spacy.syntax.arc_eager import ArcEager
-from ...syntax.nn_parser import Parser
+from spacy.syntax.nn_parser import Parser
-from ...tokens.doc import Doc
+from spacy.tokens.doc import Doc
-from ...gold import GoldParse
+from spacy.gold import GoldParse
@pytest.fixture
@ -37,10 +34,12 @@ def parser(vocab, arc_eager):
 def model(arc_eager, tok2vec):
    return Parser.Model(arc_eager.n_moves, token_vector_width=tok2vec.nO)[0]
@pytest.fixture
 def doc(vocab):
    return Doc(vocab, words=['a', 'b', 'c'])
@pytest.fixture
 def gold(doc):
    return GoldParse(doc, heads=[1, 1, 1], deps=['L', 'ROOT', 'R'])
@ -80,5 +79,3 @@ def test_update_doc_beam(parser, model, doc, gold):
    def optimize(weights, gradient, key=None):
        weights -= 0.001 * gradient
    parser.update_beam([doc], [gold], sgd=optimize)
--- a/spacy/tests/parser/test_nn_beam.py
+++ b/spacy/tests/parser/test_nn_beam.py
@ -1,20 +1,23 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 import numpy
-from thinc.api import layerize
+from spacy.vocab import Vocab
-
+from spacy.language import Language
-from ...vocab import Vocab
+from spacy.pipeline import DependencyParser
-from ...syntax.arc_eager import ArcEager
+from spacy.syntax.arc_eager import ArcEager
-from ...tokens import Doc
+from spacy.tokens import Doc
-from ...gold import GoldParse
+from spacy.syntax._beam_utils import ParserBeam
-from ...syntax._beam_utils import ParserBeam, update_beam
+from spacy.syntax.stateclass import StateClass
-from ...syntax.stateclass import StateClass
+from spacy.gold import GoldParse
@pytest.fixture
 def vocab():
    return Vocab()
@pytest.fixture
 def moves(vocab):
    aeager = ArcEager(vocab.strings, {})
@ -65,6 +68,7 @@ def vector_size():
 def beam(moves, states, golds, beam_width):
    return ParserBeam(moves, states, golds, width=beam_width, density=0.0)
@pytest.fixture
 def scores(moves, batch_size, beam_width):
    return [
@ -85,3 +89,12 @@ def test_beam_advance(beam, scores):
 def test_beam_advance_too_few_scores(beam, scores):
    with pytest.raises(IndexError):
        beam.advance(scores[:-1])
 def test_beam_parse():
    nlp = Language()
    nlp.add_pipe(DependencyParser(nlp.vocab), name='parser')
    nlp.parser.add_label('nsubj')
    nlp.parser.begin_training([], token_vector_width=8, hidden_width=8)
    doc = nlp.make_doc('Australia is a country')
    nlp.parser(doc, beam_width=2)
--- a/spacy/tests/parser/test_nonproj.py
+++ b/spacy/tests/parser/test_nonproj.py
@ -1,35 +1,39 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ...syntax.nonproj import ancestors, contains_cycle, is_nonproj_arc
 from ...syntax.nonproj import is_nonproj_tree
 from ...syntax import nonproj
 from ...attrs import DEP, HEAD
 from ..util import get_doc
 import pytest
 from spacy.syntax.nonproj import ancestors, contains_cycle, is_nonproj_arc
 from spacy.syntax.nonproj import is_nonproj_tree
 from spacy.syntax import nonproj
 from ..util import get_doc
@pytest.fixture
 def tree():
    return [1, 2, 2, 4, 5, 2, 2]
@pytest.fixture
 def cyclic_tree():
    return [1, 2, 2, 4, 5, 3, 2]
@pytest.fixture
 def partial_tree():
    return [1, 2, 2, 4, 5, None, 7, 4, 2]
@pytest.fixture
 def nonproj_tree():
    return [1, 2, 2, 4, 5, 2, 7, 4, 2]
@pytest.fixture
 def proj_tree():
    return [1, 2, 2, 4, 5, 2, 7, 5, 2]
@pytest.fixture
 def multirooted_tree():
    return [3, 2, 0, 3, 3, 7, 7, 3, 7, 10, 7, 10, 11, 12, 18, 16, 18, 17, 12, 3]
@ -75,14 +79,14 @@ def test_parser_pseudoprojectivity(en_tokenizer):
    def deprojectivize(proj_heads, deco_labels):
        tokens = en_tokenizer('whatever ' * len(proj_heads))
        rel_proj_heads = [head-i for i, head in enumerate(proj_heads)]
-        doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deco_labels, heads=rel_proj_heads)
+        doc = get_doc(tokens.vocab, words=[t.text for t in tokens],
                      deps=deco_labels, heads=rel_proj_heads)
        nonproj.deprojectivize(doc)
        return [t.head.i for t in doc], [token.dep_ for token in doc]
    tree = [1, 2, 2]
    nonproj_tree = [1, 2, 2, 4, 5, 2, 7, 4, 2]
    nonproj_tree2 = [9, 1, 3, 1, 5, 6, 9, 8, 6, 1, 6, 12, 13, 10, 1]
    labels = ['det', 'nsubj', 'root', 'det', 'dobj', 'aux', 'nsubj', 'acl', 'punct']
    labels2 = ['advmod', 'root', 'det', 'nsubj', 'advmod', 'det', 'dobj', 'det', 'nmod', 'aux', 'nmod', 'advmod', 'det', 'amod', 'punct']
--- a/spacy/tests/parser/test_parse.py
+++ b/spacy/tests/parser/test_parse.py
@ -1,17 +1,17 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ..util import get_doc, apply_transition_sequence
 import pytest
 from ..util import get_doc, apply_transition_sequence
 def test_parser_root(en_tokenizer):
    text = "i don't have other assistance"
    heads = [3, 2, 1, 0, 1, -2]
    deps = ['nsubj', 'aux', 'neg', 'ROOT', 'amod', 'dobj']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
    for t in doc:
        assert t.dep != 0, t.text
@ -20,7 +20,7 @@ def test_parser_root(en_tokenizer):
@pytest.mark.parametrize('text', ["Hello"])
 def test_parser_parse_one_word_sentence(en_tokenizer, en_parser, text):
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[0], deps=['ROOT'])
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[0], deps=['ROOT'])
    assert len(doc) == 1
    with en_parser.step_through(doc) as _:
@ -33,10 +33,8 @@ def test_parser_initial(en_tokenizer, en_parser):
    text = "I ate the pizza with anchovies."
    heads = [1, 0, 1, -2, -3, -1, -5]
    transition = ['L-nsubj', 'S', 'L-det']
    tokens = en_tokenizer(text)
    apply_transition_sequence(en_parser, tokens, transition)
    assert tokens[0].head.i == 1
    assert tokens[1].head.i == 1
    assert tokens[2].head.i == 3
@ -47,8 +45,7 @@ def test_parser_parse_subtrees(en_tokenizer, en_parser):
    text = "The four wheels on the bus turned quickly"
    heads = [2, 1, 4, -1, 1, -2, 0, -1]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    assert len(list(doc[2].lefts)) == 2
    assert len(list(doc[2].rights)) == 1
    assert len(list(doc[2].children)) == 3
@ -63,11 +60,9 @@ def test_parser_merge_pp(en_tokenizer):
    heads = [1, 4, -1, 1, -2, 0]
    deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT']
    tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps, heads=heads, tags=tags)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps, heads=heads, tags=tags)
    nps = [(np[0].idx, np[-1].idx + len(np[-1]), np.lemma_) for np in doc.noun_chunks]
    for start, end, lemma in nps:
        doc.merge(start, end, label='NP', lemma=lemma)
    assert doc[0].text == 'A phrase'
--- a/spacy/tests/parser/test_parse_navigate.py
+++ b/spacy/tests/parser/test_parse_navigate.py
@ -1,14 +1,14 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ..util import get_doc
 import pytest
 from ..util import get_doc
@pytest.fixture
 def text():
-    return u"""
+    return """
 It was a bright cold day in April, and the clocks were striking thirteen.
 Winston Smith, his chin nuzzled into his breast in an effort to escape the
 vile wind, slipped quickly through the glass doors of Victory Mansions,
@ -54,7 +54,7 @@ def heads():
 def test_parser_parse_navigate_consistency(en_tokenizer, text, heads):
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    for head in doc:
        for child in head.lefts:
            assert child.head == head
@ -64,7 +64,7 @@ def test_parser_parse_navigate_consistency(en_tokenizer, text, heads):
 def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads):
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    lefts = {}
    rights = {}
@ -97,7 +97,7 @@ def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads):
 def test_parser_parse_navigate_edges(en_tokenizer, text, heads):
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    for token in doc:
        subtree = list(token.subtree)
        debug = '\t'.join((token.text, token.left_edge.text, subtree[0].text))
--- a/spacy/tests/parser/test_preset_sbd.py
+++ b/spacy/tests/parser/test_preset_sbd.py
@ -1,19 +1,21 @@
-'''Test that the parser respects preset sentence boundaries.'''
+# coding: utf8
 from __future__ import unicode_literals
 import pytest
 from thinc.neural.optimizers import Adam
 from thinc.neural.ops import NumpyOps
 from spacy.attrs import NORM
 from spacy.gold import GoldParse
 from spacy.vocab import Vocab
 from spacy.tokens import Doc
 from spacy.pipeline import DependencyParser
 from ...attrs import NORM
 from ...gold import GoldParse
 from ...vocab import Vocab
 from ...tokens import Doc
 from ...pipeline import DependencyParser
@pytest.fixture
 def vocab():
    return Vocab(lex_attr_getters={NORM: lambda s: s})
@pytest.fixture
 def parser(vocab):
    parser = DependencyParser(vocab)
@ -32,6 +34,7 @@ def parser(vocab):
        parser.update([doc], [gold], sgd=sgd, losses=losses)
    return parser
 def test_no_sentences(parser):
    doc = Doc(parser.vocab, words=['a', 'b', 'c', 'd'])
    doc = parser(doc)
--- a/spacy/tests/parser/test_space_attachment.py
+++ b/spacy/tests/parser/test_space_attachment.py
@ -1,19 +1,18 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ...tokens.doc import Doc
 from ...attrs import HEAD
 from ..util import get_doc, apply_transition_sequence
 import pytest
 from spacy.tokens.doc import Doc
 from ..util import get_doc, apply_transition_sequence
 def test_parser_space_attachment(en_tokenizer):
    text = "This is a test.\nTo ensure  spaces are attached well."
    heads = [1, 0, 1, -2, -3, -1, 1, 4, -1, 2, 1, 0, -1, -2]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
    for sent in doc.sents:
        if len(sent) == 1:
            assert not sent[-1].is_space
@ -26,7 +25,7 @@ def test_parser_sentence_space(en_tokenizer):
            'nsubjpass', 'aux', 'auxpass', 'ROOT', 'nsubj', 'aux', 'ccomp',
            'poss', 'nsubj', 'ccomp', 'punct']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
    assert len(list(doc.sents)) == 2
@ -35,7 +34,7 @@ def test_parser_space_attachment_leading(en_tokenizer, en_parser):
    text = "\t \n This is a sentence ."
    heads = [1, 1, 0, 1, -2, -3]
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, text.split(' '), heads=heads)
+    doc = get_doc(tokens.vocab, words=text.split(' '), heads=heads)
    assert doc[0].is_space
    assert doc[1].is_space
    assert doc[2].text == 'This'
@ -52,7 +51,7 @@ def test_parser_space_attachment_intermediate_trailing(en_tokenizer, en_parser):
    heads = [1, 0, -1, 2, -1, -4, -5, -1]
    transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, text.split(' '), heads=heads)
+    doc = get_doc(tokens.vocab, words=text.split(' '), heads=heads)
    assert doc[2].is_space
    assert doc[4].is_space
    assert doc[5].is_space
--- a/spacy/tests/parser/test_to_from_bytes_disk.py
+++ b/spacy/tests/parser/test_to_from_bytes_disk.py
@ -1,28 +0,0 @@
 import pytest
 from ...pipeline import DependencyParser
@pytest.fixture
 def parser(en_vocab):
    parser = DependencyParser(en_vocab)
    parser.add_label('nsubj')
    parser.model, cfg = parser.Model(parser.moves.n_moves)
    parser.cfg.update(cfg)
    return parser
@pytest.fixture
 def blank_parser(en_vocab):
    parser = DependencyParser(en_vocab)
    return parser
 def test_to_from_bytes(parser, blank_parser):
    assert parser.model is not True
    assert blank_parser.model is True
    assert blank_parser.moves.n_moves != parser.moves.n_moves
    bytes_data = parser.to_bytes()
    blank_parser.from_bytes(bytes_data)
    assert blank_parser.model is not True
    assert blank_parser.moves.n_moves == parser.moves.n_moves
--- a/spacy/tests/pipeline/test_entity_ruler.py
+++ b/spacy/tests/pipeline/test_entity_ruler.py
@ -2,10 +2,9 @@
 from __future__ import unicode_literals
 import pytest
-
+from spacy.tokens import Span
-from ...tokens import Span
+from spacy.language import Language
-from ...language import Language
+from spacy.pipeline import EntityRuler
 from ...pipeline import EntityRuler
@pytest.fixture
--- a/spacy/tests/pipeline/test_factories.py
+++ b/spacy/tests/pipeline/test_factories.py
@ -2,11 +2,11 @@
 from __future__ import unicode_literals
 import pytest
 from spacy.language import Language
 from spacy.tokens import Span
 from ..util import get_doc
-from ...language import Language
+
 from ...tokens import Span
 from ... import util
@pytest.fixture
 def doc(en_tokenizer):
@ -16,7 +16,7 @@ def doc(en_tokenizer):
    pos = ['PRON', 'VERB', 'PROPN', 'PROPN', 'ADP', 'PROPN', 'PUNCT']
    deps = ['ROOT', 'prep', 'compound', 'pobj', 'prep', 'pobj', 'punct']
    tokens = en_tokenizer(text)
-    doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads,
+    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads,
                  tags=tags, pos=pos, deps=deps)
    doc.ents = [Span(doc, 2, 4, doc.vocab.strings['GPE'])]
    doc.is_parsed = True
--- a/spacy/tests/pipeline/test_pipe_methods.py
+++ b/spacy/tests/pipeline/test_pipe_methods.py
@ -2,8 +2,7 @@
 from __future__ import unicode_literals
 import pytest
-
+from spacy.language import Language
 from ...language import Language
@pytest.fixture
--- a/spacy/tests/pipeline/test_textcat.py
+++ b/spacy/tests/pipeline/test_textcat.py
@ -1,7 +1,13 @@
 # coding: utf8
 from __future__ import unicode_literals
-from ...language import Language
+
 import pytest
 import random
 import numpy.random
 from spacy.language import Language
 from spacy.pipeline import TextCategorizer
 from spacy.tokens import Doc
 from spacy.gold import GoldParse
 def test_simple_train():
@ -13,6 +19,40 @@ def test_simple_train():
        for text, answer in [('aaaa', 1.), ('bbbb', 0), ('aa', 1.),
                            ('bbbbbbbbb', 0.), ('aaaaaa', 1)]:
            nlp.update([text], [{'cats': {'answer': answer}}])
-    doc = nlp(u'aaa')
+    doc = nlp('aaa')
    assert 'answer' in doc.cats
    assert doc.cats['answer'] >= 0.5
@pytest.mark.skip(reason="Test is flakey when run with others")
 def test_textcat_learns_multilabel():
    random.seed(5)
    numpy.random.seed(5)
    docs = []
    nlp = Language()
    letters = ['a', 'b', 'c']
    for w1 in letters:
        for w2 in letters:
            cats = {letter: float(w2==letter) for letter in letters}
            docs.append((Doc(nlp.vocab, words=['d']*3 + [w1, w2] + ['d']*3), cats))
    random.shuffle(docs)
    model = TextCategorizer(nlp.vocab, width=8)
    for letter in letters:
        model.add_label(letter)
    optimizer = model.begin_training()
    for i in range(30):
        losses = {}
        Ys = [GoldParse(doc, cats=cats) for doc, cats in docs]
        Xs = [doc for doc, cats in docs]
        model.update(Xs, Ys, sgd=optimizer, losses=losses)
        random.shuffle(docs)
    for w1 in letters:
        for w2 in letters:
            doc = Doc(nlp.vocab, words=['d']*3 + [w1, w2] + ['d']*3)
            truth = {letter: w2==letter for letter in letters}
            model(doc)
            for cat, score in doc.cats.items():
                if not truth[cat]:
                    assert score < 0.5
                else:
                    assert score > 0.5
--- a/spacy/tests/regression/test_issue1-1000.py
+++ b/spacy/tests/regression/test_issue1-1000.py
@ -0,0 +1,420 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 import random
 from spacy.matcher import Matcher
 from spacy.attrs import IS_PUNCT, ORTH, LOWER
 from spacy.symbols import POS, VERB, VerbForm_inf
 from spacy.vocab import Vocab
 from spacy.language import Language
 from spacy.lemmatizer import Lemmatizer
 from spacy.tokens import Doc
 from ..util import get_doc, make_tempdir
@pytest.mark.parametrize('patterns', [
    [[{'LOWER': 'celtics'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]],
    [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'celtics'}]]])
 def test_issue118(en_tokenizer, patterns):
    """Test a bug that arose from having overlapping matches"""
    text = "how many points did lebron james score against the boston celtics last night"
    doc = en_tokenizer(text)
    ORG = doc.vocab.strings['ORG']
    matcher = Matcher(doc.vocab)
    matcher.add("BostonCeltics", None, *patterns)
    assert len(list(doc.ents)) == 0
    matches = [(ORG, start, end) for _, start, end in matcher(doc)]
    assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
    doc.ents = matches[:1]
    ents = list(doc.ents)
    assert len(ents) == 1
    assert ents[0].label == ORG
    assert ents[0].start == 9
    assert ents[0].end == 11
@pytest.mark.parametrize('patterns', [
    [[{'LOWER': 'boston'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]],
    [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'boston'}]]])
 def test_issue118_prefix_reorder(en_tokenizer, patterns):
    """Test a bug that arose from having overlapping matches"""
    text = "how many points did lebron james score against the boston celtics last night"
    doc = en_tokenizer(text)
    ORG = doc.vocab.strings['ORG']
    matcher = Matcher(doc.vocab)
    matcher.add('BostonCeltics', None, *patterns)
    assert len(list(doc.ents)) == 0
    matches = [(ORG, start, end) for _, start, end in matcher(doc)]
    doc.ents += tuple(matches)[1:]
    assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
    ents = doc.ents
    assert len(ents) == 1
    assert ents[0].label == ORG
    assert ents[0].start == 9
    assert ents[0].end == 11
 def test_issue242(en_tokenizer):
    """Test overlapping multi-word phrases."""
    text = "There are different food safety standards in different countries."
    patterns = [[{'LOWER': 'food'}, {'LOWER': 'safety'}],
                [{'LOWER': 'safety'}, {'LOWER': 'standards'}]]
    doc = en_tokenizer(text)
    matcher = Matcher(doc.vocab)
    matcher.add('FOOD', None, *patterns)
    matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)]
    doc.ents += tuple(matches)
    match1, match2 = matches
    assert match1[1] == 3
    assert match1[2] == 5
    assert match2[1] == 4
    assert match2[2] == 6
 def test_issue309(en_tokenizer):
    """Test Issue #309: SBD fails on empty string"""
    tokens = en_tokenizer(" ")
    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[0], deps=['ROOT'])
    doc.is_parsed = True
    assert len(doc) == 1
    sents = list(doc.sents)
    assert len(sents) == 1
 def test_issue351(en_tokenizer):
    doc = en_tokenizer("   This is a cat.")
    assert doc[0].idx == 0
    assert len(doc[0]) == 3
    assert doc[1].idx == 3
 def test_issue360(en_tokenizer):
    """Test tokenization of big ellipsis"""
    tokens = en_tokenizer('$45...............Asking')
    assert len(tokens) > 2
@pytest.mark.parametrize('text1,text2', [("cat", "dog")])
 def test_issue361(en_vocab, text1, text2):
    """Test Issue #361: Equality of lexemes"""
    assert en_vocab[text1] == en_vocab[text1]
    assert en_vocab[text1] != en_vocab[text2]
 def test_issue587(en_tokenizer):
    """Test that Matcher doesn't segfault on particular input"""
    doc = en_tokenizer('a b; c')
    matcher = Matcher(doc.vocab)
    matcher.add('TEST1', None, [{ORTH: 'a'}, {ORTH: 'b'}])
    matches = matcher(doc)
    assert len(matches) == 1
    matcher.add('TEST2', None, [{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'c'}])
    matches = matcher(doc)
    assert len(matches) == 2
    matcher.add('TEST3', None, [{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'd'}])
    matches = matcher(doc)
    assert len(matches) == 2
 def test_issue588(en_vocab):
    matcher = Matcher(en_vocab)
    with pytest.raises(ValueError):
        matcher.add('TEST', None, [])
@pytest.mark.xfail
 def test_issue589():
    vocab = Vocab()
    vocab.strings.set_frozen(True)
    doc = Doc(vocab, words=['whata'])
 def test_issue590(en_vocab):
    """Test overlapping matches"""
    doc = Doc(en_vocab, words=['n', '=', '1', ';', 'a', ':', '5', '%'])
    matcher = Matcher(en_vocab)
    matcher.add('ab', None, [{'IS_ALPHA': True}, {'ORTH': ':'}, {'LIKE_NUM': True}, {'ORTH': '%'}])
    matcher.add('ab', None, [{'IS_ALPHA': True}, {'ORTH': '='}, {'LIKE_NUM': True}])
    matches = matcher(doc)
    assert len(matches) == 2
 def test_issue595():
    """Test lemmatization of base forms"""
    words = ["Do", "n't", "feed", "the", "dog"]
    tag_map = {'VB': {POS: VERB, VerbForm_inf: True}}
    rules = {"verb": [["ed", "e"]]}
    lemmatizer = Lemmatizer({'verb': {}}, {'verb': {}}, rules)
    vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
    doc = Doc(vocab, words=words)
    doc[2].tag_ = 'VB'
    assert doc[2].text == 'feed'
    assert doc[2].lemma_ == 'feed'
 def test_issue599(en_vocab):
    doc = Doc(en_vocab)
    doc.is_tagged = True
    doc.is_parsed = True
    doc2 = Doc(doc.vocab)
    doc2.from_bytes(doc.to_bytes())
    assert doc2.is_parsed
 def test_issue600():
    vocab = Vocab(tag_map={'NN': {'pos': 'NOUN'}})
    doc = Doc(vocab, words=["hello"])
    doc[0].tag_ = 'NN'
 def test_issue615(en_tokenizer):
    def merge_phrases(matcher, doc, i, matches):
        """Merge a phrase. We have to be careful here because we'll change the
        token indices. To avoid problems, merge all the phrases once we're called
        on the last match."""
        if i != len(matches)-1:
            return None
        spans = [(ent_id, ent_id, doc[start : end]) for ent_id, start, end in matches]
        for ent_id, label, span in spans:
            span.merge(tag='NNP' if label else span.root.tag_, lemma=span.text,
                label=label)
            doc.ents = doc.ents + ((label, span.start, span.end),)
    text = "The golf club is broken"
    pattern = [{'ORTH': "golf"}, {'ORTH': "club"}]
    label = "Sport_Equipment"
    doc = en_tokenizer(text)
    matcher = Matcher(doc.vocab)
    matcher.add(label, merge_phrases, pattern)
    match = matcher(doc)
    entities = list(doc.ents)
    assert entities != []
    assert entities[0].label != 0
@pytest.mark.parametrize('text,number', [("7am", "7"), ("11p.m.", "11")])
 def test_issue736(en_tokenizer, text, number):
    """Test that times like "7am" are tokenized correctly and that numbers are
    converted to string."""
    tokens = en_tokenizer(text)
    assert len(tokens) == 2
    assert tokens[0].text == number
@pytest.mark.parametrize('text', ["3/4/2012", "01/12/1900"])
 def test_issue740(en_tokenizer, text):
    """Test that dates are not split and kept as one token. This behaviour is
    currently inconsistent, since dates separated by hyphens are still split.
    This will be hard to prevent without causing clashes with numeric ranges."""
    tokens = en_tokenizer(text)
    assert len(tokens) == 1
 def test_issue743():
    doc = Doc(Vocab(), ['hello', 'world'])
    token = doc[0]
    s = set([token])
    items = list(s)
    assert items[0] is token
@pytest.mark.parametrize('text', ["We were scared", "We Were Scared"])
 def test_issue744(en_tokenizer, text):
    """Test that 'were' and 'Were' are excluded from the contractions
    generated by the English tokenizer exceptions."""
    tokens = en_tokenizer(text)
    assert len(tokens) == 3
    assert tokens[1].text.lower() == "were"
@pytest.mark.parametrize('text,is_num', [("one", True), ("ten", True),
                                         ("teneleven", False)])
 def test_issue759(en_tokenizer, text, is_num):
    tokens = en_tokenizer(text)
    assert tokens[0].like_num == is_num
@pytest.mark.parametrize('text', ["Shell", "shell", "Shed", "shed"])
 def test_issue775(en_tokenizer, text):
    """Test that 'Shell' and 'shell' are excluded from the contractions
    generated by the English tokenizer exceptions."""
    tokens = en_tokenizer(text)
    assert len(tokens) == 1
    assert tokens[0].text == text
@pytest.mark.parametrize('text', ["This is a string ", "This is a string\u0020"])
 def test_issue792(en_tokenizer, text):
    """Test for Issue #792: Trailing whitespace is removed after tokenization."""
    doc = en_tokenizer(text)
    assert ''.join([token.text_with_ws for token in doc]) == text
@pytest.mark.parametrize('text', ["This is a string", "This is a string\n"])
 def test_control_issue792(en_tokenizer, text):
    """Test base case for Issue #792: Non-trailing whitespace"""
    doc = en_tokenizer(text)
    assert ''.join([token.text_with_ws for token in doc]) == text
@pytest.mark.parametrize('text,tokens', [
    ('"deserve,"--and', ['"', "deserve", ',"--', "and"]),
    ("exception;--exclusive", ["exception", ";--", "exclusive"]),
    ("day.--Is", ["day", ".--", "Is"]),
    ("refinement:--just", ["refinement", ":--", "just"]),
    ("memories?--To", ["memories", "?--", "To"]),
    ("Useful.=--Therefore", ["Useful", ".=--", "Therefore"]),
    ("=Hope.=--Pandora", ["=", "Hope", ".=--", "Pandora"])])
 def test_issue801(en_tokenizer, text, tokens):
    """Test that special characters + hyphens are split correctly."""
    doc = en_tokenizer(text)
    assert len(doc) == len(tokens)
    assert [t.text for t in doc] == tokens
@pytest.mark.parametrize('text,expected_tokens', [
    ('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
    ('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar'])
 ])
 def test_issue805(sv_tokenizer, text, expected_tokens):
    tokens = sv_tokenizer(text)
    token_list = [token.text for token in tokens if not token.is_space]
    assert expected_tokens == token_list
 def test_issue850():
    """The variable-length pattern matches the succeeding token. Check we
    handle the ambiguity correctly."""
    vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
    matcher = Matcher(vocab)
    IS_ANY_TOKEN = matcher.vocab.add_flag(lambda x: True)
    pattern = [{'LOWER': "bob"}, {'OP': '*', 'IS_ANY_TOKEN': True}, {'LOWER': 'frank'}]
    matcher.add('FarAway', None, pattern)
    doc = Doc(matcher.vocab, words=['bob', 'and', 'and', 'frank'])
    match = matcher(doc)
    assert len(match) == 1
    ent_id, start, end = match[0]
    assert start == 0
    assert end == 4
 def test_issue850_basic():
    """Test Matcher matches with '*' operator and Boolean flag"""
    vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
    matcher = Matcher(vocab)
    IS_ANY_TOKEN = matcher.vocab.add_flag(lambda x: True)
    pattern = [{'LOWER': "bob"}, {'OP': '*', 'LOWER': 'and'}, {'LOWER': 'frank'}]
    matcher.add('FarAway', None, pattern)
    doc = Doc(matcher.vocab, words=['bob', 'and', 'and', 'frank'])
    match = matcher(doc)
    assert len(match) == 1
    ent_id, start, end = match[0]
    assert start == 0
    assert end == 4
@pytest.mark.parametrize('text', ["au-delàs", "pair-programmâmes",
                                  "terra-formées", "σ-compacts"])
 def test_issue852(fr_tokenizer, text):
    """Test that French tokenizer exceptions are imported correctly."""
    tokens = fr_tokenizer(text)
    assert len(tokens) == 1
@pytest.mark.parametrize('text', ["aaabbb@ccc.com\nThank you!",
                                  "aaabbb@ccc.com \nThank you!"])
 def test_issue859(en_tokenizer, text):
    """Test that no extra space is added in doc.text method."""
    doc = en_tokenizer(text)
    assert doc.text == text
@pytest.mark.parametrize('text', ["Datum:2014-06-02\nDokument:76467"])
 def test_issue886(en_tokenizer, text):
    """Test that token.idx matches the original text index for texts with newlines."""
    doc = en_tokenizer(text)
    for token in doc:
        assert len(token.text) == len(token.text_with_ws)
        assert text[token.idx] == token.text[0]
@pytest.mark.parametrize('text', ["want/need"])
 def test_issue891(en_tokenizer, text):
    """Test that / infixes are split correctly."""
    tokens = en_tokenizer(text)
    assert len(tokens) == 3
    assert tokens[1].text == "/"
@pytest.mark.parametrize('text,tag,lemma', [
    ("anus", "NN", "anus"),
    ("princess", "NN", "princess"),
    ("inner", "JJ", "inner")
 ])
 def test_issue912(en_vocab, text, tag, lemma):
    """Test base-forms are preserved."""
    doc = Doc(en_vocab, words=[text])
    doc[0].tag_ = tag
    assert doc[0].lemma_ == lemma
 def test_issue957(en_tokenizer):
    """Test that spaCy doesn't hang on many periods."""
    # skip test if pytest-timeout is not installed
    timeout = pytest.importorskip('pytest-timeout')
    string = '0'
    for i in range(1, 100):
        string += '.%d' % i
    doc = en_tokenizer(string)
@pytest.mark.xfail
 def test_issue999(train_data):
    """Test that adding entities and resuming training works passably OK.
    There are two issues here:
    1) We have to readd labels. This isn't very nice.
    2) There's no way to set the learning rate for the weight update, so we
        end up out-of-scale, causing it to learn too fast.
    """
    TRAIN_DATA = [
        ["hey", []],
        ["howdy", []],
        ["hey there", []],
        ["hello", []],
        ["hi", []],
        ["i'm looking for a place to eat", []],
        ["i'm looking for a place in the north of town", [[31,36,"LOCATION"]]],
        ["show me chinese restaurants", [[8,15,"CUISINE"]]],
        ["show me chines restaurants", [[8,14,"CUISINE"]]],
    ]
    nlp = Language()
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner)
    for _, offsets in TRAIN_DATA:
        for start, end, label in offsets:
            ner.add_label(label)
    nlp.begin_training()
    ner.model.learn_rate = 0.001
    for itn in range(100):
        random.shuffle(TRAIN_DATA)
        for raw_text, entity_offsets in TRAIN_DATA:
            nlp.update([raw_text], [{'entities': entity_offsets}])
    with make_tempdir() as model_dir:
        nlp.to_disk(model_dir)
        nlp2 = Language().from_disk(model_dir)
    for raw_text, entity_offsets in TRAIN_DATA:
        doc = nlp2(raw_text)
        ents = {(ent.start_char, ent.end_char): ent.label_ for ent in doc.ents}
        for start, end, label in entity_offsets:
            if (start, end) in ents:
                assert ents[(start, end)] == label
                break
        else:
            if entity_offsets:
                raise Exception(ents)
--- a/spacy/tests/regression/test_issue1001-1500.py
+++ b/spacy/tests/regression/test_issue1001-1500.py
@ -0,0 +1,127 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 import re
 from spacy.tokens import Doc
 from spacy.vocab import Vocab
 from spacy.lang.en import English
 from spacy.lang.lex_attrs import LEX_ATTRS
 from spacy.matcher import Matcher
 from spacy.tokenizer import Tokenizer
 from spacy.lemmatizer import Lemmatizer
 from spacy.symbols import ORTH, LEMMA, POS, VERB, VerbForm_part
 def test_issue1242():
    nlp = English()
    doc = nlp('')
    assert len(doc) == 0
    docs = list(nlp.pipe(['', 'hello']))
    assert len(docs[0]) == 0
    assert len(docs[1]) == 1
 def test_issue1250():
    """Test cached special cases."""
    special_case = [{ORTH: 'reimbur', LEMMA: 'reimburse', POS: 'VERB'}]
    nlp = English()
    nlp.tokenizer.add_special_case('reimbur', special_case)
    lemmas = [w.lemma_ for w in nlp('reimbur, reimbur...')]
    assert lemmas == ['reimburse', ',', 'reimburse', '...']
    lemmas = [w.lemma_ for w in nlp('reimbur, reimbur...')]
    assert lemmas == ['reimburse', ',', 'reimburse', '...']
 def test_issue1257():
    """Test that tokens compare correctly."""
    doc1 = Doc(Vocab(), words=['a', 'b', 'c'])
    doc2 = Doc(Vocab(), words=['a', 'c', 'e'])
    assert doc1[0] != doc2[0]
    assert not doc1[0] == doc2[0]
 def test_issue1375():
    """Test that token.nbor() raises IndexError for out-of-bounds access."""
    doc = Doc(Vocab(), words=['0', '1', '2'])
    with pytest.raises(IndexError):
        assert doc[0].nbor(-1)
    assert doc[1].nbor(-1).text == '0'
    with pytest.raises(IndexError):
        assert doc[2].nbor(1)
    assert doc[1].nbor(1).text == '2'
 def test_issue1387():
    tag_map = {'VBG': {POS: VERB, VerbForm_part: True}}
    index = {"verb": ("cope","cop")}
    exc = {"verb": {"coping": ("cope",)}}
    rules = {"verb": [["ing", ""]]}
    lemmatizer = Lemmatizer(index, exc, rules)
    vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
    doc = Doc(vocab, words=["coping"])
    doc[0].tag_ = 'VBG'
    assert doc[0].text == "coping"
    assert doc[0].lemma_ == "cope"
 def test_issue1434():
    """Test matches occur when optional element at end of short doc."""
    pattern = [{'ORTH': 'Hello' }, {'IS_ALPHA': True, 'OP': '?'}]
    vocab = Vocab(lex_attr_getters=LEX_ATTRS)
    hello_world = Doc(vocab, words=['Hello', 'World'])
    hello = Doc(vocab, words=['Hello'])
    matcher = Matcher(vocab)
    matcher.add('MyMatcher', None, pattern)
    matches = matcher(hello_world)
    assert matches
    matches = matcher(hello)
    assert matches
@pytest.mark.parametrize('string,start,end', [
    ('a', 0, 1), ('a b', 0, 2), ('a c', 0, 1), ('a b c', 0, 2),
    ('a b b c', 0, 3), ('a b b', 0, 3),])
 def test_issue1450(string, start, end):
    """Test matcher works when patterns end with * operator."""
    pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}]
    matcher = Matcher(Vocab())
    matcher.add("TSTEND", None, pattern)
    doc = Doc(Vocab(), words=string.split())
    matches = matcher(doc)
    if start is None or end is None:
        assert matches == []
    assert matches[-1][1] == start
    assert matches[-1][2] == end
 def test_issue1488():
    prefix_re = re.compile(r'''[\[\("']''')
    suffix_re = re.compile(r'''[\]\)"']''')
    infix_re = re.compile(r'''[-~\.]''')
    simple_url_re = re.compile(r'''^https?://''')
    def my_tokenizer(nlp):
        return Tokenizer(nlp.vocab, {},
                         prefix_search=prefix_re.search,
                         suffix_search=suffix_re.search,
                         infix_finditer=infix_re.finditer,
                         token_match=simple_url_re.match)
    nlp = English()
    nlp.tokenizer = my_tokenizer(nlp)
    doc = nlp("This is a test.")
    for token in doc:
        assert token.text
 def test_issue1494():
    infix_re = re.compile(r'''[^a-z]''')
    test_cases = [('token 123test', ['token', '1', '2', '3', 'test']),
                  ('token 1test', ['token', '1test']),
                  ('hello...test', ['hello', '.', '.', '.', 'test'])]
    new_tokenizer = lambda nlp: Tokenizer(nlp.vocab, {}, infix_finditer=infix_re.finditer)
    nlp = English()
    nlp.tokenizer = new_tokenizer(nlp)
    for text, expected in test_cases:
        assert [token.text for token in nlp(text)] == expected
--- a/spacy/tests/regression/test_issue118.py
+++ b/spacy/tests/regression/test_issue118.py
@ -1,55 +0,0 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ...matcher import Matcher
 import pytest
 pattern1 = [[{'LOWER': 'celtics'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]]
 pattern2 = [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'celtics'}]]
 pattern3 = [[{'LOWER': 'boston'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]]
 pattern4 = [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'boston'}]]
@pytest.fixture
 def doc(en_tokenizer):
    text = "how many points did lebron james score against the boston celtics last night"
    doc = en_tokenizer(text)
    return doc
@pytest.mark.parametrize('pattern', [pattern1, pattern2])
 def test_issue118(doc, pattern):
    """Test a bug that arose from having overlapping matches"""
    ORG = doc.vocab.strings['ORG']
    matcher = Matcher(doc.vocab)
    matcher.add("BostonCeltics", None, *pattern)
    assert len(list(doc.ents)) == 0
    matches = [(ORG, start, end) for _, start, end in matcher(doc)]
    assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
    doc.ents = matches[:1]
    ents = list(doc.ents)
    assert len(ents) == 1
    assert ents[0].label == ORG
    assert ents[0].start == 9
    assert ents[0].end == 11
@pytest.mark.parametrize('pattern', [pattern3, pattern4])
 def test_issue118_prefix_reorder(doc, pattern):
    """Test a bug that arose from having overlapping matches"""
    ORG = doc.vocab.strings['ORG']
    matcher = Matcher(doc.vocab)
    matcher.add('BostonCeltics', None, *pattern)
    assert len(list(doc.ents)) == 0
    matches = [(ORG, start, end) for _, start, end in matcher(doc)]
    doc.ents += tuple(matches)[1:]
    assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
    ents = doc.ents
    assert len(ents) == 1
    assert ents[0].label == ORG
    assert ents[0].start == 9
    assert ents[0].end == 11
--- a/spacy/tests/regression/test_issue1207.py
+++ b/spacy/tests/regression/test_issue1207.py
@ -1,13 +0,0 @@
 from __future__ import unicode_literals
 import pytest
@pytest.mark.models('en')
 def test_issue1207(EN):
    text = 'Employees are recruiting talented staffers from overseas.'
    doc = EN(text)
    assert [i.text for i in doc.noun_chunks] == ['Employees', 'talented staffers']
    sent = list(doc.sents)[0]
    assert [i.text for i in sent.noun_chunks] == ['Employees', 'talented staffers']
--- a/spacy/tests/regression/test_issue1242.py
+++ b/spacy/tests/regression/test_issue1242.py
@ -1,23 +0,0 @@
 from __future__ import unicode_literals
 import pytest
 from ...lang.en import English
 from ...util import load_model
 def test_issue1242_empty_strings():
    nlp = English()
    doc = nlp('')
    assert len(doc) == 0
    docs = list(nlp.pipe(['', 'hello']))
    assert len(docs[0]) == 0
    assert len(docs[1]) == 1
@pytest.mark.models('en')
 def test_issue1242_empty_strings_en_core_web_sm():
    nlp = load_model('en_core_web_sm')
    doc = nlp('')
    assert len(doc) == 0
    docs = list(nlp.pipe(['', 'hello']))
    assert len(docs[0]) == 0
    assert len(docs[1]) == 1
--- a/spacy/tests/regression/test_issue1250.py
+++ b/spacy/tests/regression/test_issue1250.py
@ -1,13 +0,0 @@
 from __future__ import unicode_literals
 from ...tokenizer import Tokenizer
 from ...symbols import ORTH, LEMMA, POS
 from ...lang.en import English
 def test_issue1250_cached_special_cases():
    nlp = English()
    nlp.tokenizer.add_special_case(u'reimbur', [{ORTH: u'reimbur', LEMMA: u'reimburse', POS: u'VERB'}])
    lemmas = [w.lemma_ for w in nlp(u'reimbur, reimbur...')]
    assert lemmas == ['reimburse', ',', 'reimburse', '...']
    lemmas = [w.lemma_ for w in nlp(u'reimbur, reimbur...')]
    assert lemmas == ['reimburse', ',', 'reimburse', '...']
--- a/Show More
+++ b/Show More