diff --git a/.gitignore b/.gitignore index d9e75a229..55f3de909 100644 --- a/.gitignore +++ b/.gitignore @@ -36,6 +36,7 @@ venv/ .dev .denv .pypyenv +.pytest_cache/ # Distribution / packaging env/ diff --git a/requirements.txt b/requirements.txt index bd4bf03b3..47690b713 100644 --- a/requirements.txt +++ b/requirements.txt @@ -11,5 +11,6 @@ dill>=0.2,<0.3 regex==2017.4.5 requests>=2.13.0,<3.0.0 pytest>=3.6.0,<4.0.0 +pytest-timeout>=1.3.0,<2.0.0 mock>=2.0.0,<3.0.0 pathlib==1.0.1; python_version < "3.4" diff --git a/spacy/tests/README.md b/spacy/tests/README.md index fd47ae579..7aa7f6166 100644 --- a/spacy/tests/README.md +++ b/spacy/tests/README.md @@ -6,6 +6,7 @@ spaCy uses the [pytest](http://doc.pytest.org/) framework for testing. For more Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the `Tokenizer` can be found in [`/tests/tokenizer`](tokenizer). All test modules (i.e. directories) also need to be listed in spaCy's [`setup.py`](../setup.py). To be interpreted and run, all test files and test functions need to be prefixed with `test_`. +> ⚠️ **Important note:** As part of our new model training infrastructure, we've moved all model tests to the [`spacy-models`](https://github.com/explosion/spacy-models) repository. This allows us to test the models separately from the core library functionality. ## Table of contents @@ -13,9 +14,8 @@ Tests for spaCy modules and classes live in their own directories of the same na 2. [Dos and don'ts](#dos-and-donts) 3. [Parameters](#parameters) 4. [Fixtures](#fixtures) -5. [Testing models](#testing-models) -6. [Helpers and utilities](#helpers-and-utilities) -7. [Contributing to the tests](#contributing-to-the-tests) +5. [Helpers and utilities](#helpers-and-utilities) +6. [Contributing to the tests](#contributing-to-the-tests) ## Running the tests @@ -25,10 +25,7 @@ first failure, run them with `py.test -x`. ```bash py.test spacy # run basic tests -py.test spacy --models --en # run basic and English model tests -py.test spacy --models --all # run basic and all model tests py.test spacy --slow # run basic and slow tests -py.test spacy --models --all --slow # run all tests ``` You can also run tests in a specific file or directory, or even only one @@ -48,10 +45,10 @@ To keep the behaviour of the tests consistent and predictable, we try to follow * If you're testing for a bug reported in a specific issue, always create a **regression test**. Regression tests should be named `test_issue[ISSUE NUMBER]` and live in the [`regression`](regression) directory. * Only use `@pytest.mark.xfail` for tests that **should pass, but currently fail**. To test for desired negative behaviour, use `assert not` in your test. * Very **extensive tests** that take a long time to run should be marked with `@pytest.mark.slow`. If your slow test is testing important behaviour, consider adding an additional simpler version. -* Tests that require **loading the models** should be marked with `@pytest.mark.models`. +* If tests require **loading the models**, they should be added to the [`spacy-models`](https://github.com/explosion/spacy-models) tests. * Before requiring the models, always make sure there is no other way to test the particular behaviour. In a lot of cases, it's sufficient to simply create a `Doc` object manually. See the section on [helpers and utility functions](#helpers-and-utilities) for more info on this. -* **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and most components are available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`). -* If you're importing from spaCy, **always use relative imports**. Otherwise, you might accidentally be running the tests over a different copy of spaCy, e.g. one you have installed on your system. +* **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and many components are available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`). +* If you're importing from spaCy, **always use absolute imports**. For example: `from spacy.language import Language`. * Don't forget the **unicode declarations** at the top of each file. This way, unicode strings won't have to be prefixed with `u`. * Try to keep the tests **readable and concise**. Use clear and descriptive variable names (`doc`, `tokens` and `text` are great), keep it short and only test for one behaviour at a time. @@ -93,12 +90,9 @@ These are the main fixtures that are currently available: | Fixture | Description | | --- | --- | -| `tokenizer` | Creates **all available** language tokenizers and runs the test for **each of them**. | +| `tokenizer` | Basic, language-independent tokenizer. Identical to the `xx` language class. | | `en_tokenizer`, `de_tokenizer`, ... | Creates an English, German etc. tokenizer. | -| `en_vocab`, `en_entityrecognizer`, ... | Creates an instance of the English `Vocab`, `EntityRecognizer` object etc. | -| `EN`, `DE`, ... | Creates a language class with a loaded model. For more info, see [Testing models](#testing-models). | -| `text_file` | Creates an instance of `StringIO` to simulate reading from and writing to files. | -| `text_file_b` | Creates an instance of `ByteIO` to simulate reading from and writing to files. | +| `en_vocab` | Creates an instance of the English `Vocab`. | The fixtures can be used in all tests by simply setting them as an argument, like this: @@ -109,49 +103,6 @@ def test_module_do_something(en_tokenizer): If all tests in a file require a specific configuration, or use the same complex example, it can be helpful to create a separate fixture. This fixture should be added at the top of each file. Make sure to use descriptive names for these fixtures and don't override any of the global fixtures listed above. **From looking at a test, it should immediately be clear which fixtures are used, and where they are coming from.** -## Testing models - -Models should only be loaded and tested **if absolutely necessary** – for example, if you're specifically testing a model's performance, or if your test is related to model loading. If you only need an annotated `Doc`, you should use the `get_doc()` helper function to create it manually instead. - -To specify which language models a test is related to, set the language ID as an argument of `@pytest.mark.models`. This allows you to later run the tests with `--models --en`. You can then use the `EN` [fixture](#fixtures) to get a language -class with a loaded model. - -```python -@pytest.mark.models('en') -def test_english_model(EN): - doc = EN(u'This is a test') -``` - -> ⚠️ **Important note:** In order to test models, they need to be installed as a packge. The [conftest.py](conftest.py) includes a list of all available models, mapped to their IDs, e.g. `en`. Unless otherwise specified, each model that's installed in your environment will be imported and tested. If you don't have a model installed, **the test will be skipped**. - -Under the hood, `pytest.importorskip` is used to import a model package and skip the test if the package is not installed. The `EN` fixture for example gets all -available models for `en`, [parametrizes](#parameters) them to run the test for *each of them*, and uses `load_test_model()` to import the model and run the test, or skip it if the model is not installed. - -### Testing specific models - -Using the `load_test_model()` helper function, you can also write tests for specific models, or combinations of them: - -```python -from .util import load_test_model - -@pytest.mark.models('en') -def test_en_md_only(): - nlp = load_test_model('en_core_web_md') - # test something specific to en_core_web_md - -@pytest.mark.models('en', 'fr') -@pytest.mark.parametrize('model', ['en_core_web_md', 'fr_depvec_web_lg']) -def test_different_models(model): - nlp = load_test_model(model) - # test something specific to the parametrized models -``` - -### Known issues and future improvements - -Using `importorskip` on a list of model packages is not ideal and we're looking to improve this in the future. But at the moment, it's the best way to ensure that tests are performed on specific model packages only, and that you'll always be able to run the tests, even if you don't have *all available models* installed. (If the tests made a call to `spacy.load('en')` instead, this would load whichever model you've created an `en` shortcut for. This may be one of spaCy's default models, but it could just as easily be your own custom English model.) - -The current setup also doesn't provide an easy way to only run tests on specific model versions. The `minversion` keyword argument on `pytest.importorskip` can take care of this, but it currently only checks for the package's `__version__` attribute. An alternative solution would be to load a model package's meta.json and skip if the model's version does not match the one specified in the test. - ## Helpers and utilities Our new test setup comes with a few handy utility functions that can be imported from [`util.py`](util.py). @@ -186,7 +137,7 @@ You can construct a `Doc` with the following arguments: | `pos` | List of POS tags as text values. | | `tag` | List of tag names as text values. | | `dep` | List of dependencies as text values. | -| `ents` | List of entity tuples with `ent_id`, `label`, `start`, `end` (for example `('Stewart Lee', 'PERSON', 0, 2)`). The `label` will be looked up in `vocab.strings[label]`. | +| `ents` | List of entity tuples with `start`, `end`, `label` (for example `(0, 2, 'PERSON')`). The `label` will be looked up in `vocab.strings[label]`. | Here's how to quickly get these values from within spaCy: @@ -196,6 +147,7 @@ print([token.head.i-token.i for token in doc]) print([token.tag_ for token in doc]) print([token.pos_ for token in doc]) print([token.dep_ for token in doc]) +print([(ent.start, ent.end, ent.label_) for ent in doc.ents]) ``` **Note:** There's currently no way of setting the serializer data for the parser without loading the models. If this is relevant to your test, constructing the `Doc` via `get_doc()` won't work. @@ -204,7 +156,6 @@ print([token.dep_ for token in doc]) | Name | Description | | --- | --- | -| `load_test_model` | Load a model if it's installed as a package, otherwise skip test. | | `apply_transition_sequence(parser, doc, sequence)` | Perform a series of pre-specified transitions, to put the parser in a desired state. | | `add_vecs_to_vocab(vocab, vectors)` | Add list of vector tuples (`[("text", [1, 2, 3])]`) to given vocab. All vectors need to have the same length. | | `get_cosine(vec1, vec2)` | Get cosine for two given vectors. | diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index ce2618970..418c08e89 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -1,229 +1,145 @@ # coding: utf-8 from __future__ import unicode_literals -from io import StringIO, BytesIO -from pathlib import Path import pytest - -from .util import load_test_model -from ..tokens import Doc -from ..strings import StringStore -from .. import util +from io import StringIO, BytesIO +from spacy.util import get_lang_class -# These languages are used for generic tokenizer tests – only add a language -# here if it's using spaCy's tokenizer (not a different library) -# TODO: re-implement generic tokenizer tests -_languages = ['bn', 'da', 'de', 'el', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id', - 'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'ut', 'tt', - 'xx'] - -_models = {'en': ['en_core_web_sm'], - 'de': ['de_core_news_sm'], - 'fr': ['fr_core_news_sm'], - 'xx': ['xx_ent_web_sm'], - 'en_core_web_md': ['en_core_web_md'], - 'es_core_news_md': ['es_core_news_md']} +def pytest_addoption(parser): + parser.addoption("--slow", action="store_true", help="include slow tests") -# only used for tests that require loading the models -# in all other cases, use specific instances - -@pytest.fixture(params=_models['en']) -def EN(request): - return load_test_model(request.param) +def pytest_runtest_setup(item): + for opt in ['slow']: + if opt in item.keywords and not item.config.getoption("--%s" % opt): + pytest.skip("need --%s option to run" % opt) -@pytest.fixture(params=_models['de']) -def DE(request): - return load_test_model(request.param) - - -@pytest.fixture(params=_models['fr']) -def FR(request): - return load_test_model(request.param) - - -@pytest.fixture() -def RU(request): - pymorphy = pytest.importorskip('pymorphy2') - return util.get_lang_class('ru')() - -@pytest.fixture() -def JA(request): - mecab = pytest.importorskip("MeCab") - return util.get_lang_class('ja')() - - -#@pytest.fixture(params=_languages) -#def tokenizer(request): -#lang = util.get_lang_class(request.param) -#return lang.Defaults.create_tokenizer() - - -@pytest.fixture +@pytest.fixture(scope='module') def tokenizer(): - return util.get_lang_class('xx').Defaults.create_tokenizer() + return get_lang_class('xx').Defaults.create_tokenizer() -@pytest.fixture +@pytest.fixture(scope='session') def en_tokenizer(): - return util.get_lang_class('en').Defaults.create_tokenizer() + return get_lang_class('en').Defaults.create_tokenizer() -@pytest.fixture +@pytest.fixture(scope='session') def en_vocab(): - return util.get_lang_class('en').Defaults.create_vocab() + return get_lang_class('en').Defaults.create_vocab() -@pytest.fixture +@pytest.fixture(scope='session') def en_parser(en_vocab): - nlp = util.get_lang_class('en')(en_vocab) + nlp = get_lang_class('en')(en_vocab) return nlp.create_pipe('parser') -@pytest.fixture +@pytest.fixture(scope='session') def es_tokenizer(): - return util.get_lang_class('es').Defaults.create_tokenizer() + return get_lang_class('es').Defaults.create_tokenizer() -@pytest.fixture +@pytest.fixture(scope='session') def de_tokenizer(): - return util.get_lang_class('de').Defaults.create_tokenizer() + return get_lang_class('de').Defaults.create_tokenizer() -@pytest.fixture +@pytest.fixture(scope='session') def fr_tokenizer(): - return util.get_lang_class('fr').Defaults.create_tokenizer() + return get_lang_class('fr').Defaults.create_tokenizer() @pytest.fixture def hu_tokenizer(): - return util.get_lang_class('hu').Defaults.create_tokenizer() + return get_lang_class('hu').Defaults.create_tokenizer() -@pytest.fixture +@pytest.fixture(scope='session') def fi_tokenizer(): - return util.get_lang_class('fi').Defaults.create_tokenizer() + return get_lang_class('fi').Defaults.create_tokenizer() -@pytest.fixture +@pytest.fixture(scope='session') def ro_tokenizer(): - return util.get_lang_class('ro').Defaults.create_tokenizer() + return get_lang_class('ro').Defaults.create_tokenizer() -@pytest.fixture +@pytest.fixture(scope='session') def id_tokenizer(): - return util.get_lang_class('id').Defaults.create_tokenizer() + return get_lang_class('id').Defaults.create_tokenizer() -@pytest.fixture +@pytest.fixture(scope='session') def sv_tokenizer(): - return util.get_lang_class('sv').Defaults.create_tokenizer() + return get_lang_class('sv').Defaults.create_tokenizer() -@pytest.fixture +@pytest.fixture(scope='session') def bn_tokenizer(): - return util.get_lang_class('bn').Defaults.create_tokenizer() + return get_lang_class('bn').Defaults.create_tokenizer() -@pytest.fixture +@pytest.fixture(scope='session') def ga_tokenizer(): - return util.get_lang_class('ga').Defaults.create_tokenizer() + return get_lang_class('ga').Defaults.create_tokenizer() -@pytest.fixture +@pytest.fixture(scope='session') def he_tokenizer(): - return util.get_lang_class('he').Defaults.create_tokenizer() + return get_lang_class('he').Defaults.create_tokenizer() -@pytest.fixture +@pytest.fixture(scope='session') def nb_tokenizer(): - return util.get_lang_class('nb').Defaults.create_tokenizer() + return get_lang_class('nb').Defaults.create_tokenizer() -@pytest.fixture + +@pytest.fixture(scope='session') def da_tokenizer(): - return util.get_lang_class('da').Defaults.create_tokenizer() + return get_lang_class('da').Defaults.create_tokenizer() -@pytest.fixture + +@pytest.fixture(scope='session') def ja_tokenizer(): mecab = pytest.importorskip("MeCab") - return util.get_lang_class('ja').Defaults.create_tokenizer() + return get_lang_class('ja').Defaults.create_tokenizer() -@pytest.fixture + +@pytest.fixture(scope='session') def th_tokenizer(): pythainlp = pytest.importorskip("pythainlp") - return util.get_lang_class('th').Defaults.create_tokenizer() + return get_lang_class('th').Defaults.create_tokenizer() -@pytest.fixture + +@pytest.fixture(scope='session') def tr_tokenizer(): - return util.get_lang_class('tr').Defaults.create_tokenizer() + return get_lang_class('tr').Defaults.create_tokenizer() -@pytest.fixture + +@pytest.fixture(scope='session') def tt_tokenizer(): - return util.get_lang_class('tt').Defaults.create_tokenizer() + return get_lang_class('tt').Defaults.create_tokenizer() -@pytest.fixture + +@pytest.fixture(scope='session') def el_tokenizer(): - return util.get_lang_class('el').Defaults.create_tokenizer() + return get_lang_class('el').Defaults.create_tokenizer() -@pytest.fixture + +@pytest.fixture(scope='session') def ar_tokenizer(): - return util.get_lang_class('ar').Defaults.create_tokenizer() + return get_lang_class('ar').Defaults.create_tokenizer() -@pytest.fixture + +@pytest.fixture(scope='session') def ur_tokenizer(): - return util.get_lang_class('ur').Defaults.create_tokenizer() + return get_lang_class('ur').Defaults.create_tokenizer() -@pytest.fixture + +@pytest.fixture(scope='session') def ru_tokenizer(): pymorphy = pytest.importorskip('pymorphy2') - return util.get_lang_class('ru').Defaults.create_tokenizer() - - -@pytest.fixture -def stringstore(): - return StringStore() - - -@pytest.fixture -def en_entityrecognizer(): - return util.get_lang_class('en').Defaults.create_entity() - - -@pytest.fixture -def text_file(): - return StringIO() - - -@pytest.fixture -def text_file_b(): - return BytesIO() - - -def pytest_addoption(parser): - parser.addoption("--models", action="store_true", - help="include tests that require full models") - parser.addoption("--vectors", action="store_true", - help="include word vectors tests") - parser.addoption("--slow", action="store_true", - help="include slow tests") - - for lang in _languages + ['all']: - parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang) - for model in _models: - if model not in _languages: - parser.addoption("--%s" % model, action="store_true", help="Use %s model" % model) - - -def pytest_runtest_setup(item): - for opt in ['models', 'vectors', 'slow']: - if opt in item.keywords and not item.config.getoption("--%s" % opt): - pytest.skip("need --%s option to run" % opt) - - # Check if test is marked with models and has arguments set, i.e. specific - # language. If so, skip test if flag not set. - if item.get_marker('models'): - for arg in item.get_marker('models').args: - if not item.config.getoption("--%s" % arg) and not item.config.getoption("--all"): - pytest.skip("need --%s or --all option to run" % arg) + return get_lang_class('ru').Defaults.create_tokenizer() diff --git a/spacy/tests/doc/test_add_entities.py b/spacy/tests/doc/test_add_entities.py deleted file mode 100644 index 31d2b8420..000000000 --- a/spacy/tests/doc/test_add_entities.py +++ /dev/null @@ -1,24 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...pipeline import EntityRecognizer -from ..util import get_doc - -import pytest - - -def test_doc_add_entities_set_ents_iob(en_vocab): - text = ["This", "is", "a", "lion"] - doc = get_doc(en_vocab, text) - ner = EntityRecognizer(en_vocab) - ner.begin_training([]) - ner(doc) - - assert len(list(doc.ents)) == 0 - assert [w.ent_iob_ for w in doc] == (['O'] * len(doc)) - - doc.ents = [(doc.vocab.strings['ANIMAL'], 3, 4)] - assert [w.ent_iob_ for w in doc] == ['', '', '', 'B'] - - doc.ents = [(doc.vocab.strings['WORD'], 0, 2)] - assert [w.ent_iob_ for w in doc] == ['B', 'I', '', ''] diff --git a/spacy/tests/doc/test_array.py b/spacy/tests/doc/test_array.py index ff10394d1..27541875b 100644 --- a/spacy/tests/doc/test_array.py +++ b/spacy/tests/doc/test_array.py @@ -1,10 +1,9 @@ # coding: utf-8 from __future__ import unicode_literals -from ...attrs import ORTH, SHAPE, POS, DEP -from ..util import get_doc +from spacy.attrs import ORTH, SHAPE, POS, DEP -import pytest +from ..util import get_doc def test_doc_array_attr_of_token(en_tokenizer, en_vocab): @@ -41,7 +40,7 @@ def test_doc_array_tag(en_tokenizer): text = "A nice sentence." pos = ['DET', 'ADJ', 'NOUN', 'PUNCT'] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos) assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos feats_array = doc.to_array((ORTH, POS)) assert feats_array[0][1] == doc[0].pos @@ -54,7 +53,7 @@ def test_doc_array_dep(en_tokenizer): text = "A nice sentence." deps = ['det', 'amod', 'ROOT', 'punct'] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps) feats_array = doc.to_array((ORTH, DEP)) assert feats_array[0][1] == doc[0].dep assert feats_array[1][1] == doc[1].dep diff --git a/spacy/tests/doc/test_creation.py b/spacy/tests/doc/test_creation.py index c14fdfbe9..652ea00cf 100644 --- a/spacy/tests/doc/test_creation.py +++ b/spacy/tests/doc/test_creation.py @@ -1,10 +1,10 @@ -'''Test Doc sets up tokens correctly.''' +# coding: utf-8 from __future__ import unicode_literals -import pytest -from ...vocab import Vocab -from ...tokens.doc import Doc -from ...lemmatizer import Lemmatizer +import pytest +from spacy.vocab import Vocab +from spacy.tokens import Doc +from spacy.lemmatizer import Lemmatizer @pytest.fixture diff --git a/spacy/tests/doc/test_doc_api.py b/spacy/tests/doc/test_doc_api.py index 10f99223b..658fcb128 100644 --- a/spacy/tests/doc/test_doc_api.py +++ b/spacy/tests/doc/test_doc_api.py @@ -1,18 +1,18 @@ # coding: utf-8 from __future__ import unicode_literals -from ..util import get_doc -from ...tokens import Doc -from ...vocab import Vocab -from ...attrs import LEMMA - import pytest import numpy +from spacy.tokens import Doc +from spacy.vocab import Vocab +from spacy.attrs import LEMMA + +from ..util import get_doc @pytest.mark.parametrize('text', [["one", "two", "three"]]) def test_doc_api_compare_by_string_position(en_vocab, text): - doc = get_doc(en_vocab, text) + doc = Doc(en_vocab, words=text) # Get the tokens in this order, so their ID ordering doesn't match the idx token3 = doc[-1] token2 = doc[-2] @@ -104,18 +104,18 @@ def test_doc_api_getitem(en_tokenizer): " Give it back! He pleaded. "]) def test_doc_api_serialize(en_tokenizer, text): tokens = en_tokenizer(text) - new_tokens = get_doc(tokens.vocab).from_bytes(tokens.to_bytes()) + new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes()) assert tokens.text == new_tokens.text assert [t.text for t in tokens] == [t.text for t in new_tokens] assert [t.orth for t in tokens] == [t.orth for t in new_tokens] - new_tokens = get_doc(tokens.vocab).from_bytes( + new_tokens = Doc(tokens.vocab).from_bytes( tokens.to_bytes(tensor=False), tensor=False) assert tokens.text == new_tokens.text assert [t.text for t in tokens] == [t.text for t in new_tokens] assert [t.orth for t in tokens] == [t.orth for t in new_tokens] - new_tokens = get_doc(tokens.vocab).from_bytes( + new_tokens = Doc(tokens.vocab).from_bytes( tokens.to_bytes(sentiment=False), sentiment=False) assert tokens.text == new_tokens.text assert [t.text for t in tokens] == [t.text for t in new_tokens] @@ -199,6 +199,20 @@ def test_doc_api_retokenizer_attrs(en_tokenizer): assert doc[4].ent_type_ == 'ORG' +@pytest.mark.xfail +def test_doc_api_retokenizer_lex_attrs(en_tokenizer): + """Test that lexical attributes can be changed (see #2390).""" + doc = en_tokenizer("WKRO played beach boys songs") + assert not any(token.is_stop for token in doc) + with doc.retokenize() as retokenizer: + retokenizer.merge(doc[2:4], attrs={'LEMMA': 'boys', 'IS_STOP': True}) + assert doc[2].text == 'beach boys' + assert doc[2].lemma_ == 'boys' + assert doc[2].is_stop + new_doc = Doc(doc.vocab, words=['beach boys']) + assert new_doc[0].is_stop + + def test_doc_api_sents_empty_string(en_tokenizer): doc = en_tokenizer("") doc.is_parsed = True @@ -215,7 +229,7 @@ def test_doc_api_runtime_error(en_tokenizer): 'ROOT', 'amod', 'dobj'] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps) nps = [] for np in doc.noun_chunks: @@ -235,7 +249,7 @@ def test_doc_api_right_edge(en_tokenizer): -2, -7, 1, -19, 1, -2, -3, 2, 1, -3, -26] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) assert doc[6].text == 'for' subtree = [w.text for w in doc[6].subtree] assert subtree == ['for', 'the', 'sake', 'of', 'such', 'as', @@ -264,7 +278,7 @@ def test_doc_api_similarity_match(): def test_lowest_common_ancestor(en_tokenizer): tokens = en_tokenizer('the lazy dog slept') - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0]) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0]) lca = doc.get_lca_matrix() assert(lca[1, 1] == 1) assert(lca[0, 1] == 2) @@ -277,7 +291,7 @@ def test_parse_tree(en_tokenizer): heads = [1, 0, 1, -2, -3, -1, -5] tags = ['PRP', 'IN', 'NNP', 'NNP', 'IN', 'NNP', '.'] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, tags=tags) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, tags=tags) # full method parse_tree(text) is a trivial composition trees = doc.print_tree() assert len(trees) > 0 diff --git a/spacy/tests/doc/test_pickle_doc.py b/spacy/tests/doc/test_pickle_doc.py index 93f06f2c3..85133461b 100644 --- a/spacy/tests/doc/test_pickle_doc.py +++ b/spacy/tests/doc/test_pickle_doc.py @@ -1,12 +1,13 @@ +# coding: utf-8 from __future__ import unicode_literals -from ...language import Language -from ...compat import pickle, unicode_ +from spacy.language import Language +from spacy.compat import pickle, unicode_ def test_pickle_single_doc(): nlp = Language() - doc = nlp(u'pickle roundtrip') + doc = nlp('pickle roundtrip') data = pickle.dumps(doc, 1) doc2 = pickle.loads(data) assert doc2.text == 'pickle roundtrip' @@ -16,7 +17,7 @@ def test_list_of_docs_pickles_efficiently(): nlp = Language() for i in range(10000): _ = nlp.vocab[unicode_(i)] - one_pickled = pickle.dumps(nlp(u'0'), -1) + one_pickled = pickle.dumps(nlp('0'), -1) docs = list(nlp.pipe(unicode_(i) for i in range(100))) many_pickled = pickle.dumps(docs, -1) assert len(many_pickled) < (len(one_pickled) * 2) @@ -28,7 +29,7 @@ def test_list_of_docs_pickles_efficiently(): def test_user_data_from_disk(): nlp = Language() - doc = nlp(u'Hello') + doc = nlp('Hello') doc.user_data[(0, 1)] = False b = doc.to_bytes() doc2 = doc.__class__(doc.vocab).from_bytes(b) @@ -36,7 +37,7 @@ def test_user_data_from_disk(): def test_user_data_unpickles(): nlp = Language() - doc = nlp(u'Hello') + doc = nlp('Hello') doc.user_data[(0, 1)] = False b = pickle.dumps(doc) doc2 = pickle.loads(b) @@ -47,7 +48,7 @@ def test_hooks_unpickle(): def inner_func(d1, d2): return 'hello!' nlp = Language() - doc = nlp(u'Hello') + doc = nlp('Hello') doc.user_hooks['similarity'] = inner_func b = pickle.dumps(doc) doc2 = pickle.loads(b) diff --git a/spacy/tests/doc/test_span.py b/spacy/tests/doc/test_span.py index d355f06c5..c311e71c1 100644 --- a/spacy/tests/doc/test_span.py +++ b/spacy/tests/doc/test_span.py @@ -1,12 +1,12 @@ # coding: utf-8 from __future__ import unicode_literals -from ..util import get_doc -from ...attrs import ORTH, LENGTH -from ...tokens import Doc -from ...vocab import Vocab - import pytest +from spacy.attrs import ORTH, LENGTH +from spacy.tokens import Doc +from spacy.vocab import Vocab + +from ..util import get_doc @pytest.fixture @@ -16,16 +16,16 @@ def doc(en_tokenizer): deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det', 'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct'] tokens = en_tokenizer(text) - return get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) + return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) @pytest.fixture def doc_not_parsed(en_tokenizer): text = "This is a sentence. This is another sentence. And a third." tokens = en_tokenizer(text) - d = get_doc(tokens.vocab, [t.text for t in tokens]) - d.is_parsed = False - return d + doc = Doc(tokens.vocab, words=[t.text for t in tokens]) + doc.is_parsed = False + return doc def test_spans_sent_spans(doc): @@ -56,7 +56,7 @@ def test_spans_root2(en_tokenizer): text = "through North and South Carolina" heads = [0, 3, -1, -2, -4] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) assert doc[-2:].root.text == 'Carolina' @@ -76,7 +76,7 @@ def test_spans_span_sent(doc, doc_not_parsed): def test_spans_lca_matrix(en_tokenizer): """Test span's lca matrix generation""" tokens = en_tokenizer('the lazy dog slept') - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0]) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0]) lca = doc[:2].get_lca_matrix() assert(lca[0, 0] == 0) assert(lca[0, 1] == -1) @@ -100,7 +100,7 @@ def test_spans_default_sentiment(en_tokenizer): tokens = en_tokenizer(text) tokens.vocab[tokens[0].text].sentiment = 3.0 tokens.vocab[tokens[2].text].sentiment = -2.0 - doc = get_doc(tokens.vocab, [t.text for t in tokens]) + doc = Doc(tokens.vocab, words=[t.text for t in tokens]) assert doc[:2].sentiment == 3.0 / 2 assert doc[-2:].sentiment == -2. / 2 assert doc[:-1].sentiment == (3.+-2) / 3. @@ -112,7 +112,7 @@ def test_spans_override_sentiment(en_tokenizer): tokens = en_tokenizer(text) tokens.vocab[tokens[0].text].sentiment = 3.0 tokens.vocab[tokens[2].text].sentiment = -2.0 - doc = get_doc(tokens.vocab, [t.text for t in tokens]) + doc = Doc(tokens.vocab, words=[t.text for t in tokens]) doc.user_span_hooks['sentiment'] = lambda span: 10.0 assert doc[:2].sentiment == 10.0 assert doc[-2:].sentiment == 10.0 @@ -146,7 +146,7 @@ def test_span_to_array(doc): assert arr[0, 1] == len(span[0]) -#def test_span_as_doc(doc): -# span = doc[4:10] -# span_doc = span.as_doc() -# assert span.text == span_doc.text.strip() +def test_span_as_doc(doc): + span = doc[4:10] + span_doc = span.as_doc() + assert span.text == span_doc.text.strip() diff --git a/spacy/tests/doc/test_span_merge.py b/spacy/tests/doc/test_span_merge.py index ae1f4f4a1..24ab17b8e 100644 --- a/spacy/tests/doc/test_span_merge.py +++ b/spacy/tests/doc/test_span_merge.py @@ -1,18 +1,17 @@ # coding: utf-8 from __future__ import unicode_literals -from ..util import get_doc -from ...vocab import Vocab -from ...tokens import Doc +from spacy.vocab import Vocab +from spacy.tokens import Doc -import pytest +from ..util import get_doc def test_spans_merge_tokens(en_tokenizer): text = "Los Angeles start." heads = [1, 1, 0, -1] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) assert len(doc) == 4 assert doc[0].head.text == 'Angeles' assert doc[1].head.text == 'start' @@ -21,7 +20,7 @@ def test_spans_merge_tokens(en_tokenizer): assert doc[0].text == 'Los Angeles' assert doc[0].head.text == 'start' - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) assert len(doc) == 4 assert doc[0].head.text == 'Angeles' assert doc[1].head.text == 'start' @@ -35,7 +34,7 @@ def test_spans_merge_heads(en_tokenizer): text = "I found a pilates class near work." heads = [1, 0, 2, 1, -3, -1, -1, -6] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) assert len(doc) == 8 doc.merge(doc[3].idx, doc[4].idx + len(doc[4]), tag=doc[4].tag_, @@ -53,7 +52,7 @@ def test_span_np_merges(en_tokenizer): text = "displaCy is a parse tool built with Javascript" heads = [1, 0, 2, 1, -3, -1, -1, -1] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) assert doc[4].head.i == 1 doc.merge(doc[2].idx, doc[4].idx + len(doc[4]), tag='NP', lemma='tool', @@ -63,7 +62,7 @@ def test_span_np_merges(en_tokenizer): text = "displaCy is a lightweight and modern dependency parse tree visualization tool built with CSS3 and JavaScript." heads = [1, 0, 8, 3, -1, -2, 4, 3, 1, 1, -9, -1, -1, -1, -1, -2, -15] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) ents = [(e[0].idx, e[-1].idx + len(e[-1]), e.label_, e.lemma_) for e in doc.ents] for start, end, label, lemma in ents: @@ -74,8 +73,7 @@ def test_span_np_merges(en_tokenizer): text = "One test with entities like New York City so the ents list is not void" heads = [1, 11, -1, -1, -1, 1, 1, -3, 4, 2, 1, 1, 0, -1, -2] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) - + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) for span in doc.ents: merged = doc.merge() assert merged != None, (span.start, span.end, span.label_, span.lemma_) @@ -85,10 +83,9 @@ def test_spans_entity_merge(en_tokenizer): text = "Stewart Lee is a stand up comedian who lives in England and loves Joe Pasquale.\n" heads = [1, 1, 0, 1, 2, -1, -4, 1, -2, -1, -1, -3, -10, 1, -2, -13, -1] tags = ['NNP', 'NNP', 'VBZ', 'DT', 'VB', 'RP', 'NN', 'WP', 'VBZ', 'IN', 'NNP', 'CC', 'VBZ', 'NNP', 'NNP', '.', 'SP'] - ents = [('Stewart Lee', 'PERSON', 0, 2), ('England', 'GPE', 10, 11), ('Joe Pasquale', 'PERSON', 13, 15)] - + ents = [(0, 2, 'PERSON'), (10, 11, 'GPE'), (13, 15, 'PERSON')] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, tags=tags, ents=ents) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, tags=tags, ents=ents) assert len(doc) == 17 for ent in doc.ents: label, lemma, type_ = (ent.root.tag_, ent.root.lemma_, max(w.ent_type_ for w in ent)) @@ -120,7 +117,7 @@ def test_spans_sentence_update_after_merge(en_tokenizer): 'compound', 'dobj', 'punct'] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) sent1, sent2 = list(doc.sents) init_len = len(sent1) init_len2 = len(sent2) @@ -138,7 +135,7 @@ def test_spans_subtree_size_check(en_tokenizer): 'dobj'] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) sent1 = list(doc.sents)[0] init_len = len(list(sent1.root.subtree)) doc[0:2].merge(label='none', lemma='none', ent_type='none') diff --git a/spacy/tests/doc/test_token_api.py b/spacy/tests/doc/test_token_api.py index 9de1e6fc1..511ebbad0 100644 --- a/spacy/tests/doc/test_token_api.py +++ b/spacy/tests/doc/test_token_api.py @@ -1,14 +1,24 @@ # coding: utf-8 from __future__ import unicode_literals -from ...attrs import IS_ALPHA, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_TITLE, IS_STOP -from ...symbols import NOUN, VERB -from ..util import get_doc -from ...vocab import Vocab -from ...tokens import Doc - import pytest import numpy +from spacy.attrs import IS_ALPHA, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_TITLE, IS_STOP +from spacy.symbols import VERB +from spacy.vocab import Vocab +from spacy.tokens import Doc + +from ..util import get_doc + + +@pytest.fixture +def doc(en_tokenizer): + text = "This is a sentence. This is another sentence. And a third." + heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 0, 1, -2, -1] + deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det', + 'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct'] + tokens = en_tokenizer(text) + return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) def test_doc_token_api_strings(en_tokenizer): @@ -18,7 +28,7 @@ def test_doc_token_api_strings(en_tokenizer): deps = ['ROOT', 'dobj', 'prt', 'punct', 'nsubj', 'ROOT', 'punct'] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos, heads=heads, deps=deps) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps) assert doc[0].orth_ == 'Give' assert doc[0].text == 'Give' assert doc[0].text_with_ws == 'Give ' @@ -57,18 +67,9 @@ def test_doc_token_api_str_builtin(en_tokenizer, text): assert str(tokens[0]) == text.split(' ')[0] assert str(tokens[1]) == text.split(' ')[1] -@pytest.fixture -def doc(en_tokenizer): - text = "This is a sentence. This is another sentence. And a third." - heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 0, 1, -2, -1] - deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det', - 'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct'] - tokens = en_tokenizer(text) - return get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) def test_doc_token_api_is_properties(en_vocab): - text = ["Hi", ",", "my", "email", "is", "test@me.com"] - doc = get_doc(en_vocab, text) + doc = Doc(en_vocab, words=["Hi", ",", "my", "email", "is", "test@me.com"]) assert doc[0].is_title assert doc[0].is_alpha assert not doc[0].is_digit @@ -86,7 +87,6 @@ def test_doc_token_api_vectors(): vocab.set_vector('oranges', vector=numpy.asarray([0., 1.], dtype='f')) doc = Doc(vocab, words=['apples', 'oranges', 'oov']) assert doc.has_vector - assert doc[0].has_vector assert doc[1].has_vector assert not doc[2].has_vector @@ -101,7 +101,7 @@ def test_doc_token_api_ancestors(en_tokenizer): text = "Yesterday I saw a dog that barked loudly." heads = [2, 1, 0, 1, -2, 1, -2, -1, -6] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) assert [t.text for t in doc[6].ancestors] == ["dog", "saw"] assert [t.text for t in doc[1].ancestors] == ["saw"] assert [t.text for t in doc[2].ancestors] == [] @@ -115,7 +115,7 @@ def test_doc_token_api_head_setter(en_tokenizer): text = "Yesterday I saw a dog that barked loudly." heads = [2, 1, 0, 1, -2, 1, -2, -1, -6] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) assert doc[6].n_lefts == 1 assert doc[6].n_rights == 1 @@ -165,7 +165,7 @@ def test_doc_token_api_head_setter(en_tokenizer): def test_is_sent_start(en_tokenizer): - doc = en_tokenizer(u'This is a sentence. This is another.') + doc = en_tokenizer('This is a sentence. This is another.') assert doc[5].is_sent_start is None doc[5].is_sent_start = True assert doc[5].is_sent_start is True diff --git a/spacy/tests/test_underscore.py b/spacy/tests/doc/test_underscore.py similarity index 96% rename from spacy/tests/test_underscore.py rename to spacy/tests/doc/test_underscore.py index 15ad0e6f4..0c3143245 100644 --- a/spacy/tests/test_underscore.py +++ b/spacy/tests/doc/test_underscore.py @@ -3,10 +3,8 @@ from __future__ import unicode_literals import pytest from mock import Mock - -from ..vocab import Vocab -from ..tokens import Doc, Span, Token -from ..tokens.underscore import Underscore +from spacy.tokens import Doc, Span, Token +from spacy.tokens.underscore import Underscore def test_create_doc_underscore(): diff --git a/spacy/tests/lang/ar/test_exceptions.py b/spacy/tests/lang/ar/test_exceptions.py index e8da7f621..323118002 100644 --- a/spacy/tests/lang/ar/test_exceptions.py +++ b/spacy/tests/lang/ar/test_exceptions.py @@ -4,15 +4,14 @@ from __future__ import unicode_literals import pytest -@pytest.mark.parametrize('text', - ["ق.م", "إلخ", "ص.ب", "ت."]) +@pytest.mark.parametrize('text', ["ق.م", "إلخ", "ص.ب", "ت."]) def test_ar_tokenizer_handles_abbr(ar_tokenizer, text): tokens = ar_tokenizer(text) assert len(tokens) == 1 def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer): - text = u"تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م" + text = "تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م" tokens = ar_tokenizer(text) assert len(tokens) == 7 assert tokens[6].text == "ق.م" @@ -20,7 +19,6 @@ def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer): def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer): - text = u"يبلغ طول مضيق طارق 14كم " + text = "يبلغ طول مضيق طارق 14كم " tokens = ar_tokenizer(text) - print([(tokens[i].text, tokens[i].suffix_) for i in range(len(tokens))]) assert len(tokens) == 6 diff --git a/spacy/tests/lang/ar/test_text.py b/spacy/tests/lang/ar/test_text.py index 7c5e9f9c7..951c24fa6 100644 --- a/spacy/tests/lang/ar/test_text.py +++ b/spacy/tests/lang/ar/test_text.py @@ -2,7 +2,7 @@ from __future__ import unicode_literals -def test_tokenizer_handles_long_text(ar_tokenizer): +def test_ar_tokenizer_handles_long_text(ar_tokenizer): text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين. ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها، فتمكن من نيل شهادة في الفلسفة. ألف محفوظ على مدار حياته الكثير من الأعمال الأدبية، و في مقدمتها ثلاثيته الشهيرة. diff --git a/spacy/tests/lang/bn/test_tokenizer.py b/spacy/tests/lang/bn/test_tokenizer.py index b82aa7525..772fc07fa 100644 --- a/spacy/tests/lang/bn/test_tokenizer.py +++ b/spacy/tests/lang/bn/test_tokenizer.py @@ -3,38 +3,32 @@ from __future__ import unicode_literals import pytest -TESTCASES = [] -PUNCTUATION_TESTS = [ - (u'আমি বাংলায় গান গাই!', [u'আমি', u'বাংলায়', u'গান', u'গাই', u'!']), - (u'আমি বাংলায় কথা কই।', [u'আমি', u'বাংলায়', u'কথা', u'কই', u'।']), - (u'বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', [u'বসুন্ধরা', u'জনসম্মুখে', u'দোষ', u'স্বীকার', u'করলো', u'না', u'?']), - (u'টাকা থাকলে কি না হয়!', [u'টাকা', u'থাকলে', u'কি', u'না', u'হয়', u'!']), +TESTCASES = [ + # punctuation tests + ('আমি বাংলায় গান গাই!', ['আমি', 'বাংলায়', 'গান', 'গাই', '!']), + ('আমি বাংলায় কথা কই।', ['আমি', 'বাংলায়', 'কথা', 'কই', '।']), + ('বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', ['বসুন্ধরা', 'জনসম্মুখে', 'দোষ', 'স্বীকার', 'করলো', 'না', '?']), + ('টাকা থাকলে কি না হয়!', ['টাকা', 'থাকলে', 'কি', 'না', 'হয়', '!']), + # abbreviations + ('ডঃ খালেদ বললেন ঢাকায় ৩৫ ডিগ্রি সে.।', ['ডঃ', 'খালেদ', 'বললেন', 'ঢাকায়', '৩৫', 'ডিগ্রি', 'সে.', '।']) ] -ABBREVIATIONS = [ - (u'ডঃ খালেদ বললেন ঢাকায় ৩৫ ডিগ্রি সে.।', [u'ডঃ', u'খালেদ', u'বললেন', u'ঢাকায়', u'৩৫', u'ডিগ্রি', u'সে.', u'।']) -] - -TESTCASES.extend(PUNCTUATION_TESTS) -TESTCASES.extend(ABBREVIATIONS) - @pytest.mark.parametrize('text,expected_tokens', TESTCASES) -def test_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens): +def test_bn_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens): tokens = bn_tokenizer(text) token_list = [token.text for token in tokens if not token.is_space] assert expected_tokens == token_list -def test_tokenizer_handles_long_text(bn_tokenizer): - text = u"""নর্থ সাউথ বিশ্ববিদ্যালয়ে সারাবছর কোন না কোন বিষয়ে গবেষণা চলতেই থাকে। \ +def test_bn_tokenizer_handles_long_text(bn_tokenizer): + text = """নর্থ সাউথ বিশ্ববিদ্যালয়ে সারাবছর কোন না কোন বিষয়ে গবেষণা চলতেই থাকে। \ অভিজ্ঞ ফ্যাকাল্টি মেম্বারগণ প্রায়ই শিক্ষার্থীদের নিয়ে বিভিন্ন গবেষণা প্রকল্পে কাজ করেন, \ যার মধ্যে রয়েছে রোবট থেকে মেশিন লার্নিং সিস্টেম ও আর্টিফিশিয়াল ইন্টেলিজেন্স। \ এসকল প্রকল্পে কাজ করার মাধ্যমে সংশ্লিষ্ট ক্ষেত্রে যথেষ্ঠ পরিমাণ স্পেশালাইজড হওয়া সম্ভব। \ আর গবেষণার কাজ তোমার ক্যারিয়ারকে ঠেলে নিয়ে যাবে অনেকখানি! \ কন্টেস্ট প্রোগ্রামার হও, গবেষক কিংবা ডেভেলপার - নর্থ সাউথ ইউনিভার্সিটিতে তোমার প্রতিভা বিকাশের সুযোগ রয়েছেই। \ নর্থ সাউথের অসাধারণ কমিউনিটিতে তোমাকে সাদর আমন্ত্রণ।""" - tokens = bn_tokenizer(text) assert len(tokens) == 84 diff --git a/spacy/tests/lang/da/test_exceptions.py b/spacy/tests/lang/da/test_exceptions.py index e3ad6fcb4..2e6493210 100644 --- a/spacy/tests/lang/da/test_exceptions.py +++ b/spacy/tests/lang/da/test_exceptions.py @@ -3,28 +3,32 @@ from __future__ import unicode_literals import pytest -@pytest.mark.parametrize('text', - ["ca.", "m.a.o.", "Jan.", "Dec.", "kr.", "jf."]) + +@pytest.mark.parametrize('text', ["ca.", "m.a.o.", "Jan.", "Dec.", "kr.", "jf."]) def test_da_tokenizer_handles_abbr(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 1 + @pytest.mark.parametrize('text', ["Jul.", "jul.", "Tor.", "Tors."]) def test_da_tokenizer_handles_ambiguous_abbr(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 2 + @pytest.mark.parametrize('text', ["1.", "10.", "31."]) def test_da_tokenizer_handles_dates(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 1 + def test_da_tokenizer_handles_exc_in_text(da_tokenizer): text = "Det er bl.a. ikke meningen" tokens = da_tokenizer(text) assert len(tokens) == 5 assert tokens[2].text == "bl.a." + def test_da_tokenizer_handles_custom_base_exc(da_tokenizer): text = "Her er noget du kan kigge i." tokens = da_tokenizer(text) @@ -32,8 +36,9 @@ def test_da_tokenizer_handles_custom_base_exc(da_tokenizer): assert tokens[6].text == "i" assert tokens[7].text == "." -@pytest.mark.parametrize('text,norm', - [("akvarium", "akvarie"), ("bedstemoder", "bedstemor")]) + +@pytest.mark.parametrize('text,norm', [ + ("akvarium", "akvarie"), ("bedstemoder", "bedstemor")]) def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm): tokens = da_tokenizer(text) assert tokens[0].norm_ == norm diff --git a/spacy/tests/lang/da/test_lemma.py b/spacy/tests/lang/da/test_lemma.py index a2f301bcb..3cfd7f329 100644 --- a/spacy/tests/lang/da/test_lemma.py +++ b/spacy/tests/lang/da/test_lemma.py @@ -4,10 +4,11 @@ from __future__ import unicode_literals import pytest -@pytest.mark.parametrize('string,lemma', [('affaldsgruppernes', 'affaldsgruppe'), - ('detailhandelsstrukturernes', 'detailhandelsstruktur'), - ('kolesterols', 'kolesterol'), - ('åsyns', 'åsyn')]) -def test_lemmatizer_lookup_assigns(da_tokenizer, string, lemma): +@pytest.mark.parametrize('string,lemma', [ + ('affaldsgruppernes', 'affaldsgruppe'), + ('detailhandelsstrukturernes', 'detailhandelsstruktur'), + ('kolesterols', 'kolesterol'), + ('åsyns', 'åsyn')]) +def test_da_lemmatizer_lookup_assigns(da_tokenizer, string, lemma): tokens = da_tokenizer(string) assert tokens[0].lemma_ == lemma diff --git a/spacy/tests/lang/da/test_prefix_suffix_infix.py b/spacy/tests/lang/da/test_prefix_suffix_infix.py index 4cf0719c9..d313aebe5 100644 --- a/spacy/tests/lang/da/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/da/test_prefix_suffix_infix.py @@ -1,24 +1,23 @@ # coding: utf-8 -"""Test that tokenizer prefixes, suffixes and infixes are handled correctly.""" from __future__ import unicode_literals import pytest @pytest.mark.parametrize('text', ["(under)"]) -def test_tokenizer_splits_no_special(da_tokenizer, text): +def test_da_tokenizer_splits_no_special(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["ta'r", "Søren's", "Lars'"]) -def test_tokenizer_handles_no_punct(da_tokenizer, text): +def test_da_tokenizer_handles_no_punct(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 1 @pytest.mark.parametrize('text', ["(ta'r"]) -def test_tokenizer_splits_prefix_punct(da_tokenizer, text): +def test_da_tokenizer_splits_prefix_punct(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 2 assert tokens[0].text == "(" @@ -26,22 +25,23 @@ def test_tokenizer_splits_prefix_punct(da_tokenizer, text): @pytest.mark.parametrize('text', ["ta'r)"]) -def test_tokenizer_splits_suffix_punct(da_tokenizer, text): +def test_da_tokenizer_splits_suffix_punct(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 2 assert tokens[0].text == "ta'r" assert tokens[1].text == ")" -@pytest.mark.parametrize('text,expected', [("(ta'r)", ["(", "ta'r", ")"]), ("'ta'r'", ["'", "ta'r", "'"])]) -def test_tokenizer_splits_even_wrap(da_tokenizer, text, expected): +@pytest.mark.parametrize('text,expected', [ + ("(ta'r)", ["(", "ta'r", ")"]), ("'ta'r'", ["'", "ta'r", "'"])]) +def test_da_tokenizer_splits_even_wrap(da_tokenizer, text, expected): tokens = da_tokenizer(text) assert len(tokens) == len(expected) assert [t.text for t in tokens] == expected @pytest.mark.parametrize('text', ["(ta'r?)"]) -def test_tokenizer_splits_uneven_wrap(da_tokenizer, text): +def test_da_tokenizer_splits_uneven_wrap(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 4 assert tokens[0].text == "(" @@ -50,15 +50,16 @@ def test_tokenizer_splits_uneven_wrap(da_tokenizer, text): assert tokens[3].text == ")" -@pytest.mark.parametrize('text,expected', [("f.eks.", ["f.eks."]), ("fe.", ["fe", "."]), ("(f.eks.", ["(", "f.eks."])]) -def test_tokenizer_splits_prefix_interact(da_tokenizer, text, expected): +@pytest.mark.parametrize('text,expected', [ + ("f.eks.", ["f.eks."]), ("fe.", ["fe", "."]), ("(f.eks.", ["(", "f.eks."])]) +def test_da_tokenizer_splits_prefix_interact(da_tokenizer, text, expected): tokens = da_tokenizer(text) assert len(tokens) == len(expected) assert [t.text for t in tokens] == expected @pytest.mark.parametrize('text', ["f.eks.)"]) -def test_tokenizer_splits_suffix_interact(da_tokenizer, text): +def test_da_tokenizer_splits_suffix_interact(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 2 assert tokens[0].text == "f.eks." @@ -66,7 +67,7 @@ def test_tokenizer_splits_suffix_interact(da_tokenizer, text): @pytest.mark.parametrize('text', ["(f.eks.)"]) -def test_tokenizer_splits_even_wrap_interact(da_tokenizer, text): +def test_da_tokenizer_splits_even_wrap_interact(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 3 assert tokens[0].text == "(" @@ -75,7 +76,7 @@ def test_tokenizer_splits_even_wrap_interact(da_tokenizer, text): @pytest.mark.parametrize('text', ["(f.eks.?)"]) -def test_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text): +def test_da_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 4 assert tokens[0].text == "(" @@ -85,19 +86,19 @@ def test_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text): @pytest.mark.parametrize('text', ["0,1-13,5", "0,0-0,1", "103,27-300", "1/2-3/4"]) -def test_tokenizer_handles_numeric_range(da_tokenizer, text): +def test_da_tokenizer_handles_numeric_range(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 1 @pytest.mark.parametrize('text', ["sort.Gul", "Hej.Verden"]) -def test_tokenizer_splits_period_infix(da_tokenizer, text): +def test_da_tokenizer_splits_period_infix(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["Hej,Verden", "en,to"]) -def test_tokenizer_splits_comma_infix(da_tokenizer, text): +def test_da_tokenizer_splits_comma_infix(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 3 assert tokens[0].text == text.split(",")[0] @@ -106,18 +107,18 @@ def test_tokenizer_splits_comma_infix(da_tokenizer, text): @pytest.mark.parametrize('text', ["sort...Gul", "sort...gul"]) -def test_tokenizer_splits_ellipsis_infix(da_tokenizer, text): +def test_da_tokenizer_splits_ellipsis_infix(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ['gå-på-mod', '4-hjulstræk', '100-Pfennig-frimærke', 'TV-2-spots', 'trofæ-vaeggen']) -def test_tokenizer_keeps_hyphens(da_tokenizer, text): +def test_da_tokenizer_keeps_hyphens(da_tokenizer, text): tokens = da_tokenizer(text) assert len(tokens) == 1 -def test_tokenizer_splits_double_hyphen_infix(da_tokenizer): +def test_da_tokenizer_splits_double_hyphen_infix(da_tokenizer): tokens = da_tokenizer("Mange regler--eksempelvis bindestregs-reglerne--er komplicerede.") assert len(tokens) == 9 assert tokens[0].text == "Mange" @@ -130,7 +131,7 @@ def test_tokenizer_splits_double_hyphen_infix(da_tokenizer): assert tokens[7].text == "komplicerede" -def test_tokenizer_handles_posessives_and_contractions(da_tokenizer): +def test_da_tokenizer_handles_posessives_and_contractions(da_tokenizer): tokens = da_tokenizer("'DBA's, Lars' og Liz' bil sku' sgu' ik' ha' en bule, det ka' han ik' li' mere', sagde hun.") assert len(tokens) == 25 assert tokens[0].text == "'" diff --git a/spacy/tests/lang/da/test_text.py b/spacy/tests/lang/da/test_text.py index fa6a935f6..d2f0bd0c2 100644 --- a/spacy/tests/lang/da/test_text.py +++ b/spacy/tests/lang/da/test_text.py @@ -1,10 +1,9 @@ # coding: utf-8 -"""Test that longer and mixed texts are tokenized correctly.""" - - from __future__ import unicode_literals import pytest +from spacy.lang.da.lex_attrs import like_num + def test_da_tokenizer_handles_long_text(da_tokenizer): text = """Der var så dejligt ude på landet. Det var sommer, kornet stod gult, havren grøn, @@ -15,6 +14,7 @@ Rundt om ager og eng var der store skove, og midt i skovene dybe søer; jo, der tokens = da_tokenizer(text) assert len(tokens) == 84 + @pytest.mark.parametrize('text,match', [ ('10', True), ('1', True), ('10.000', True), ('10.00', True), ('999,0', True), ('en', True), ('treoghalvfemsindstyvende', True), ('hundrede', True), @@ -22,6 +22,10 @@ Rundt om ager og eng var der store skove, og midt i skovene dybe søer; jo, der def test_lex_attrs_like_number(da_tokenizer, text, match): tokens = da_tokenizer(text) assert len(tokens) == 1 - print(tokens[0]) assert tokens[0].like_num == match + +@pytest.mark.parametrize('word', ['elleve', 'første']) +def test_da_lex_attrs_capitals(word): + assert like_num(word) + assert like_num(word.upper()) diff --git a/spacy/tests/lang/de/test_exceptions.py b/spacy/tests/lang/de/test_exceptions.py index f7db648c9..e66eeb781 100644 --- a/spacy/tests/lang/de/test_exceptions.py +++ b/spacy/tests/lang/de/test_exceptions.py @@ -1,7 +1,4 @@ # coding: utf-8 -"""Test that tokenizer exceptions and emoticons are handles correctly.""" - - from __future__ import unicode_literals import pytest diff --git a/spacy/tests/lang/de/test_lemma.py b/spacy/tests/lang/de/test_lemma.py index 56f6c20d6..6c55ed76d 100644 --- a/spacy/tests/lang/de/test_lemma.py +++ b/spacy/tests/lang/de/test_lemma.py @@ -4,12 +4,13 @@ from __future__ import unicode_literals import pytest -@pytest.mark.parametrize('string,lemma', [('Abgehängten', 'Abgehängte'), - ('engagierte', 'engagieren'), - ('schließt', 'schließen'), - ('vorgebenden', 'vorgebend'), - ('die', 'der'), - ('Die', 'der')]) -def test_lemmatizer_lookup_assigns(de_tokenizer, string, lemma): +@pytest.mark.parametrize('string,lemma', [ + ('Abgehängten', 'Abgehängte'), + ('engagierte', 'engagieren'), + ('schließt', 'schließen'), + ('vorgebenden', 'vorgebend'), + ('die', 'der'), + ('Die', 'der')]) +def test_de_lemmatizer_lookup_assigns(de_tokenizer, string, lemma): tokens = de_tokenizer(string) assert tokens[0].lemma_ == lemma diff --git a/spacy/tests/lang/de/test_models.py b/spacy/tests/lang/de/test_models.py deleted file mode 100644 index 85a04a183..000000000 --- a/spacy/tests/lang/de/test_models.py +++ /dev/null @@ -1,77 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import numpy -import pytest - - -@pytest.fixture -def example(DE): - """ - This is to make sure the model works as expected. The tests make sure that - values are properly set. Tests are not meant to evaluate the content of the - output, only make sure the output is formally okay. - """ - assert DE.entity != None - return DE('An der großen Straße stand eine merkwürdige Gestalt und führte Selbstgespräche.') - - -@pytest.mark.models('de') -def test_de_models_tokenization(example): - # tokenization should split the document into tokens - assert len(example) > 1 - - -@pytest.mark.xfail -@pytest.mark.models('de') -def test_de_models_tagging(example): - # if tagging was done properly, pos tags shouldn't be empty - assert example.is_tagged - assert all(t.pos != 0 for t in example) - assert all(t.tag != 0 for t in example) - - -@pytest.mark.models('de') -def test_de_models_parsing(example): - # if parsing was done properly - # - dependency labels shouldn't be empty - # - the head of some tokens should not be root - assert example.is_parsed - assert all(t.dep != 0 for t in example) - assert any(t.dep != i for i,t in enumerate(example)) - - -@pytest.mark.models('de') -def test_de_models_ner(example): - # if ner was done properly, ent_iob shouldn't be empty - assert all([t.ent_iob != 0 for t in example]) - - -@pytest.mark.models('de') -def test_de_models_vectors(example): - # if vectors are available, they should differ on different words - # this isn't a perfect test since this could in principle fail - # in a sane model as well, - # but that's very unlikely and a good indicator if something is wrong - vector0 = example[0].vector - vector1 = example[1].vector - vector2 = example[2].vector - assert not numpy.array_equal(vector0,vector1) - assert not numpy.array_equal(vector0,vector2) - assert not numpy.array_equal(vector1,vector2) - - -@pytest.mark.xfail -@pytest.mark.models('de') -def test_de_models_probs(example): - # if frequencies/probabilities are okay, they should differ for - # different words - # this isn't a perfect test since this could in principle fail - # in a sane model as well, - # but that's very unlikely and a good indicator if something is wrong - prob0 = example[0].prob - prob1 = example[1].prob - prob2 = example[2].prob - assert not prob0 == prob1 - assert not prob0 == prob2 - assert not prob1 == prob2 diff --git a/spacy/tests/lang/de/test_parser.py b/spacy/tests/lang/de/test_parser.py index 6b5b25901..1dc79c28f 100644 --- a/spacy/tests/lang/de/test_parser.py +++ b/spacy/tests/lang/de/test_parser.py @@ -3,17 +3,14 @@ from __future__ import unicode_literals from ...util import get_doc -import pytest - def test_de_parser_noun_chunks_standard_de(de_tokenizer): text = "Eine Tasse steht auf dem Tisch." heads = [1, 1, 0, -1, 1, -2, -4] tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', '$.'] deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'punct'] - tokens = de_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 2 assert chunks[0].text_with_ws == "Eine Tasse " @@ -25,9 +22,8 @@ def test_de_extended_chunk(de_tokenizer): heads = [1, 1, 0, -1, 1, -2, -1, -5, -6] tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', 'NN', 'NN', '$.'] deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'nk', 'oa', 'punct'] - tokens = de_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 3 assert chunks[0].text_with_ws == "Die Sängerin " diff --git a/spacy/tests/lang/de/test_prefix_suffix_infix.py b/spacy/tests/lang/de/test_prefix_suffix_infix.py index bdc68037e..b76fb6e3d 100644 --- a/spacy/tests/lang/de/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/de/test_prefix_suffix_infix.py @@ -1,86 +1,83 @@ # coding: utf-8 -"""Test that tokenizer prefixes, suffixes and infixes are handled correctly.""" - - from __future__ import unicode_literals import pytest @pytest.mark.parametrize('text', ["(unter)"]) -def test_tokenizer_splits_no_special(de_tokenizer, text): +def test_de_tokenizer_splits_no_special(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["unter'm"]) -def test_tokenizer_splits_no_punct(de_tokenizer, text): +def test_de_tokenizer_splits_no_punct(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 2 @pytest.mark.parametrize('text', ["(unter'm"]) -def test_tokenizer_splits_prefix_punct(de_tokenizer, text): +def test_de_tokenizer_splits_prefix_punct(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["unter'm)"]) -def test_tokenizer_splits_suffix_punct(de_tokenizer, text): +def test_de_tokenizer_splits_suffix_punct(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["(unter'm)"]) -def test_tokenizer_splits_even_wrap(de_tokenizer, text): +def test_de_tokenizer_splits_even_wrap(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 4 @pytest.mark.parametrize('text', ["(unter'm?)"]) -def test_tokenizer_splits_uneven_wrap(de_tokenizer, text): +def test_de_tokenizer_splits_uneven_wrap(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 5 @pytest.mark.parametrize('text,length', [("z.B.", 1), ("zb.", 2), ("(z.B.", 2)]) -def test_tokenizer_splits_prefix_interact(de_tokenizer, text, length): +def test_de_tokenizer_splits_prefix_interact(de_tokenizer, text, length): tokens = de_tokenizer(text) assert len(tokens) == length @pytest.mark.parametrize('text', ["z.B.)"]) -def test_tokenizer_splits_suffix_interact(de_tokenizer, text): +def test_de_tokenizer_splits_suffix_interact(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 2 @pytest.mark.parametrize('text', ["(z.B.)"]) -def test_tokenizer_splits_even_wrap_interact(de_tokenizer, text): +def test_de_tokenizer_splits_even_wrap_interact(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["(z.B.?)"]) -def test_tokenizer_splits_uneven_wrap_interact(de_tokenizer, text): +def test_de_tokenizer_splits_uneven_wrap_interact(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 4 @pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"]) -def test_tokenizer_splits_numeric_range(de_tokenizer, text): +def test_de_tokenizer_splits_numeric_range(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["blau.Rot", "Hallo.Welt"]) -def test_tokenizer_splits_period_infix(de_tokenizer, text): +def test_de_tokenizer_splits_period_infix(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["Hallo,Welt", "eins,zwei"]) -def test_tokenizer_splits_comma_infix(de_tokenizer, text): +def test_de_tokenizer_splits_comma_infix(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 3 assert tokens[0].text == text.split(",")[0] @@ -89,18 +86,18 @@ def test_tokenizer_splits_comma_infix(de_tokenizer, text): @pytest.mark.parametrize('text', ["blau...Rot", "blau...rot"]) -def test_tokenizer_splits_ellipsis_infix(de_tokenizer, text): +def test_de_tokenizer_splits_ellipsis_infix(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ['Islam-Konferenz', 'Ost-West-Konflikt']) -def test_tokenizer_keeps_hyphens(de_tokenizer, text): +def test_de_tokenizer_keeps_hyphens(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 1 -def test_tokenizer_splits_double_hyphen_infix(de_tokenizer): +def test_de_tokenizer_splits_double_hyphen_infix(de_tokenizer): tokens = de_tokenizer("Viele Regeln--wie die Bindestrich-Regeln--sind kompliziert.") assert len(tokens) == 10 assert tokens[0].text == "Viele" diff --git a/spacy/tests/lang/de/test_text.py b/spacy/tests/lang/de/test_text.py index 34180b982..7f4097939 100644 --- a/spacy/tests/lang/de/test_text.py +++ b/spacy/tests/lang/de/test_text.py @@ -1,13 +1,10 @@ # coding: utf-8 -"""Test that longer and mixed texts are tokenized correctly.""" - - from __future__ import unicode_literals import pytest -def test_tokenizer_handles_long_text(de_tokenizer): +def test_de_tokenizer_handles_long_text(de_tokenizer): text = """Die Verwandlung Als Gregor Samsa eines Morgens aus unruhigen Träumen erwachte, fand er sich in @@ -29,17 +26,15 @@ Umfang kläglich dünnen Beine flimmerten ihm hilflos vor den Augen. "Donaudampfschifffahrtsgesellschaftskapitänsanwärterposten", "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz", "Kraftfahrzeug-Haftpflichtversicherung", - "Vakuum-Mittelfrequenz-Induktionsofen" - ]) -def test_tokenizer_handles_long_words(de_tokenizer, text): + "Vakuum-Mittelfrequenz-Induktionsofen"]) +def test_de_tokenizer_handles_long_words(de_tokenizer, text): tokens = de_tokenizer(text) assert len(tokens) == 1 @pytest.mark.parametrize('text,length', [ ("»Was ist mit mir geschehen?«, dachte er.", 12), - ("“Dies frühzeitige Aufstehen”, dachte er, “macht einen ganz blödsinnig. ", 15) - ]) -def test_tokenizer_handles_examples(de_tokenizer, text, length): + ("“Dies frühzeitige Aufstehen”, dachte er, “macht einen ganz blödsinnig. ", 15)]) +def test_de_tokenizer_handles_examples(de_tokenizer, text, length): tokens = de_tokenizer(text) assert len(tokens) == length diff --git a/spacy/tests/lang/el/test_exception.py b/spacy/tests/lang/el/test_exception.py index 268cee4d2..ef265afee 100644 --- a/spacy/tests/lang/el/test_exception.py +++ b/spacy/tests/lang/el/test_exception.py @@ -1,18 +1,17 @@ -# -*- coding: utf-8 -*- - +# coding: utf8 from __future__ import unicode_literals import pytest @pytest.mark.parametrize('text', ["αριθ.", "τρισ.", "δισ.", "σελ."]) -def test_tokenizer_handles_abbr(el_tokenizer, text): +def test_el_tokenizer_handles_abbr(el_tokenizer, text): tokens = el_tokenizer(text) assert len(tokens) == 1 -def test_tokenizer_handles_exc_in_text(el_tokenizer): +def test_el_tokenizer_handles_exc_in_text(el_tokenizer): text = "Στα 14 τρισ. δολάρια το κόστος από την άνοδο της στάθμης της θάλασσας." tokens = el_tokenizer(text) assert len(tokens) == 14 - assert tokens[2].text == "τρισ." \ No newline at end of file + assert tokens[2].text == "τρισ." diff --git a/spacy/tests/lang/el/test_text.py b/spacy/tests/lang/el/test_text.py index 367191f79..79b0b23ac 100644 --- a/spacy/tests/lang/el/test_text.py +++ b/spacy/tests/lang/el/test_text.py @@ -1,11 +1,10 @@ -# -*- coding: utf-8 -*- - +# coding: utf8 from __future__ import unicode_literals import pytest -def test_tokenizer_handles_long_text(el_tokenizer): +def test_el_tokenizer_handles_long_text(el_tokenizer): text = """Η Ελλάδα (παλαιότερα Ελλάς), επίσημα γνωστή ως Ελληνική Δημοκρατία,\ είναι χώρα της νοτιοανατολικής Ευρώπης στο νοτιότερο άκρο της Βαλκανικής χερσονήσου.\ Συνορεύει στα βορειοδυτικά με την Αλβανία, στα βόρεια με την πρώην\ @@ -20,6 +19,6 @@ def test_tokenizer_handles_long_text(el_tokenizer): ("Η Ελλάδα είναι μία από τις χώρες της Ευρωπαϊκής Ένωσης (ΕΕ) που διαθέτει σηµαντικό ορυκτό πλούτο.", 19), ("Η ναυτιλία αποτέλεσε ένα σημαντικό στοιχείο της Ελληνικής οικονομικής δραστηριότητας από τα αρχαία χρόνια.", 15), ("Η Ελλάδα είναι μέλος σε αρκετούς διεθνείς οργανισμούς.", 9)]) -def test_tokenizer_handles_cnts(el_tokenizer,text, length): +def test_el_tokenizer_handles_cnts(el_tokenizer,text, length): tokens = el_tokenizer(text) - assert len(tokens) == length \ No newline at end of file + assert len(tokens) == length diff --git a/spacy/tests/lang/en/test_customized_tokenizer.py b/spacy/tests/lang/en/test_customized_tokenizer.py index 1d35fb128..c7efcb4ee 100644 --- a/spacy/tests/lang/en/test_customized_tokenizer.py +++ b/spacy/tests/lang/en/test_customized_tokenizer.py @@ -2,23 +2,22 @@ from __future__ import unicode_literals import pytest - -from ....lang.en import English -from ....tokenizer import Tokenizer -from .... import util +from spacy.lang.en import English +from spacy.tokenizer import Tokenizer +from spacy.util import compile_prefix_regex, compile_suffix_regex +from spacy.util import compile_infix_regex @pytest.fixture def custom_en_tokenizer(en_vocab): - prefix_re = util.compile_prefix_regex(English.Defaults.prefixes) - suffix_re = util.compile_suffix_regex(English.Defaults.suffixes) + prefix_re = compile_prefix_regex(English.Defaults.prefixes) + suffix_re = compile_suffix_regex(English.Defaults.suffixes) custom_infixes = ['\.\.\.+', '(?<=[0-9])-(?=[0-9])', # '(?<=[0-9]+),(?=[0-9]+)', '[0-9]+(,[0-9]+)+', '[\[\]!&:,()\*—–\/-]'] - - infix_re = util.compile_infix_regex(custom_infixes) + infix_re = compile_infix_regex(custom_infixes) return Tokenizer(en_vocab, English.Defaults.tokenizer_exceptions, prefix_re.search, @@ -27,13 +26,12 @@ def custom_en_tokenizer(en_vocab): token_match=None) -def test_customized_tokenizer_handles_infixes(custom_en_tokenizer): +def test_en_customized_tokenizer_handles_infixes(custom_en_tokenizer): sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion." context = [word.text for word in custom_en_tokenizer(sentence)] assert context == ['The', '8', 'and', '10', '-', 'county', 'definitions', 'are', 'not', 'used', 'for', 'the', 'greater', 'Southern', 'California', 'Megaregion', '.'] - # the trailing '-' may cause Assertion Error sentence = "The 8- and 10-county definitions are not used for the greater Southern California Megaregion." context = [word.text for word in custom_en_tokenizer(sentence)] diff --git a/spacy/tests/lang/en/test_exceptions.py b/spacy/tests/lang/en/test_exceptions.py index 3115354bb..3fc14c59d 100644 --- a/spacy/tests/lang/en/test_exceptions.py +++ b/spacy/tests/lang/en/test_exceptions.py @@ -38,7 +38,7 @@ def test_en_tokenizer_splits_trailing_apos(en_tokenizer, text): @pytest.mark.parametrize('text', ["'em", "nothin'", "ol'"]) -def text_tokenizer_doesnt_split_apos_exc(en_tokenizer, text): +def test_en_tokenizer_doesnt_split_apos_exc(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 1 assert tokens[0].text == text diff --git a/spacy/tests/lang/en/test_indices.py b/spacy/tests/lang/en/test_indices.py index c8f4c4b61..8a7bc0323 100644 --- a/spacy/tests/lang/en/test_indices.py +++ b/spacy/tests/lang/en/test_indices.py @@ -1,11 +1,6 @@ # coding: utf-8 -"""Test that token.idx correctly computes index into the original string.""" - - from __future__ import unicode_literals -import pytest - def test_en_simple_punct(en_tokenizer): text = "to walk, do foo" diff --git a/spacy/tests/lang/en/test_lemmatizer.py b/spacy/tests/lang/en/test_lemmatizer.py deleted file mode 100644 index 169cb2695..000000000 --- a/spacy/tests/lang/en/test_lemmatizer.py +++ /dev/null @@ -1,63 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest -from ....tokens.doc import Doc - - -@pytest.fixture -def en_lemmatizer(EN): - return EN.Defaults.create_lemmatizer() - -@pytest.mark.models('en') -def test_doc_lemmatization(EN): - doc = Doc(EN.vocab, words=['bleed']) - doc[0].tag_ = 'VBP' - assert doc[0].lemma_ == 'bleed' - -@pytest.mark.models('en') -@pytest.mark.parametrize('text,lemmas', [("aardwolves", ["aardwolf"]), - ("aardwolf", ["aardwolf"]), - ("planets", ["planet"]), - ("ring", ["ring"]), - ("axes", ["axis", "axe", "ax"])]) -def test_en_lemmatizer_noun_lemmas(en_lemmatizer, text, lemmas): - assert en_lemmatizer.noun(text) == lemmas - - -@pytest.mark.models('en') -@pytest.mark.parametrize('text,lemmas', [("bleed", ["bleed"]), - ("feed", ["feed"]), - ("need", ["need"]), - ("ring", ["ring"])]) -def test_en_lemmatizer_noun_lemmas(en_lemmatizer, text, lemmas): - # Cases like this are problematic -- not clear what we should do to resolve - # ambiguity? - # ("axes", ["ax", "axes", "axis"])]) - assert en_lemmatizer.noun(text) == lemmas - - -@pytest.mark.xfail -@pytest.mark.models('en') -def test_en_lemmatizer_base_forms(en_lemmatizer): - assert en_lemmatizer.noun('dive', {'number': 'sing'}) == ['dive'] - assert en_lemmatizer.noun('dive', {'number': 'plur'}) == ['diva'] - - -@pytest.mark.models('en') -def test_en_lemmatizer_base_form_verb(en_lemmatizer): - assert en_lemmatizer.verb('saw', {'verbform': 'past'}) == ['see'] - - -@pytest.mark.models('en') -def test_en_lemmatizer_punct(en_lemmatizer): - assert en_lemmatizer.punct('“') == ['"'] - assert en_lemmatizer.punct('“') == ['"'] - - -@pytest.mark.models('en') -def test_en_lemmatizer_lemma_assignment(EN): - text = "Bananas in pyjamas are geese." - doc = EN.make_doc(text) - EN.tagger(doc) - assert all(t.lemma_ != '' for t in doc) diff --git a/spacy/tests/lang/en/test_models.py b/spacy/tests/lang/en/test_models.py deleted file mode 100644 index a6006caba..000000000 --- a/spacy/tests/lang/en/test_models.py +++ /dev/null @@ -1,85 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import numpy -import pytest - - -@pytest.fixture -def example(EN): - """ - This is to make sure the model works as expected. The tests make sure that - values are properly set. Tests are not meant to evaluate the content of the - output, only make sure the output is formally okay. - """ - assert EN.entity != None - return EN('There was a stranger standing at the big street talking to herself.') - - -@pytest.mark.models('en') -def test_en_models_tokenization(example): - # tokenization should split the document into tokens - assert len(example) > 1 - - -@pytest.mark.models('en') -def test_en_models_tagging(example): - # if tagging was done properly, pos tags shouldn't be empty - assert example.is_tagged - assert all(t.pos != 0 for t in example) - assert all(t.tag != 0 for t in example) - - -@pytest.mark.models('en') -def test_en_models_parsing(example): - # if parsing was done properly - # - dependency labels shouldn't be empty - # - the head of some tokens should not be root - assert example.is_parsed - assert all(t.dep != 0 for t in example) - assert any(t.dep != i for i,t in enumerate(example)) - - -@pytest.mark.models('en') -def test_en_models_ner(example): - # if ner was done properly, ent_iob shouldn't be empty - assert all([t.ent_iob != 0 for t in example]) - - -@pytest.mark.models('en') -def test_en_models_vectors(example): - # if vectors are available, they should differ on different words - # this isn't a perfect test since this could in principle fail - # in a sane model as well, - # but that's very unlikely and a good indicator if something is wrong - if example.vocab.vectors_length: - vector0 = example[0].vector - vector1 = example[1].vector - vector2 = example[2].vector - assert not numpy.array_equal(vector0,vector1) - assert not numpy.array_equal(vector0,vector2) - assert not numpy.array_equal(vector1,vector2) - - -@pytest.mark.xfail -@pytest.mark.models('en') -def test_en_models_probs(example): - # if frequencies/probabilities are okay, they should differ for - # different words - # this isn't a perfect test since this could in principle fail - # in a sane model as well, - # but that's very unlikely and a good indicator if something is wrong - prob0 = example[0].prob - prob1 = example[1].prob - prob2 = example[2].prob - assert not prob0 == prob1 - assert not prob0 == prob2 - assert not prob1 == prob2 - - -@pytest.mark.models('en') -def test_no_vectors_similarity(EN): - doc1 = EN(u'hallo') - doc2 = EN(u'hi') - assert doc1.similarity(doc2) > 0 - diff --git a/spacy/tests/lang/en/test_ner.py b/spacy/tests/lang/en/test_ner.py deleted file mode 100644 index 8a7838625..000000000 --- a/spacy/tests/lang/en/test_ner.py +++ /dev/null @@ -1,42 +0,0 @@ -from __future__ import unicode_literals, print_function -import pytest - -from spacy.attrs import LOWER -from spacy.matcher import Matcher - - -@pytest.mark.models('en') -def test_en_ner_simple_types(EN): - tokens = EN(u'Mr. Best flew to New York on Saturday morning.') - ents = list(tokens.ents) - assert ents[0].start == 1 - assert ents[0].end == 2 - assert ents[0].label_ == 'PERSON' - assert ents[1].start == 4 - assert ents[1].end == 6 - assert ents[1].label_ == 'GPE' - - -@pytest.mark.skip -@pytest.mark.models('en') -def test_en_ner_consistency_bug(EN): - '''Test an arbitrary sequence-consistency bug encountered during speed test''' - tokens = EN(u'Where rap essentially went mainstream, illustrated by seminal Public Enemy, Beastie Boys and L.L. Cool J. tracks.') - tokens = EN(u'''Charity and other short-term aid have buoyed them so far, and a tax-relief bill working its way through Congress would help. But the September 11 Victim Compensation Fund, enacted by Congress to discourage people from filing lawsuits, will determine the shape of their lives for years to come.\n\n''', disable=['ner']) - tokens.ents += tuple(EN.matcher(tokens)) - EN.entity(tokens) - - -@pytest.mark.skip -@pytest.mark.models('en') -def test_en_ner_unit_end_gazetteer(EN): - '''Test a bug in the interaction between the NER model and the gazetteer''' - matcher = Matcher(EN.vocab) - matcher.add('MemberNames', None, [{LOWER: 'cal'}], [{LOWER: 'cal'}, {LOWER: 'henderson'}]) - doc = EN(u'who is cal the manager of?') - if len(list(doc.ents)) == 0: - ents = matcher(doc) - assert len(ents) == 1 - doc.ents += tuple(ents) - EN.entity(doc) - assert list(doc.ents)[0].text == 'cal' diff --git a/spacy/tests/lang/en/test_noun_chunks.py b/spacy/tests/lang/en/test_noun_chunks.py index 2bfe041f9..5ef3721fe 100644 --- a/spacy/tests/lang/en/test_noun_chunks.py +++ b/spacy/tests/lang/en/test_noun_chunks.py @@ -1,22 +1,20 @@ # coding: utf-8 from __future__ import unicode_literals -from ....attrs import HEAD, DEP -from ....symbols import nsubj, dobj, amod, nmod, conj, cc, root -from ....lang.en.syntax_iterators import SYNTAX_ITERATORS -from ...util import get_doc - import numpy +from spacy.attrs import HEAD, DEP +from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root +from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS + +from ...util import get_doc def test_en_noun_chunks_not_nested(en_tokenizer): text = "Peter has chronic command and control issues" heads = [1, 0, 4, 3, -1, -2, -5] deps = ['nsubj', 'ROOT', 'amod', 'nmod', 'cc', 'conj', 'dobj'] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) - + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) tokens.from_array( [HEAD, DEP], numpy.asarray([[1, nsubj], [0, root], [4, amod], [3, nmod], [-1, cc], diff --git a/spacy/tests/lang/en/test_parser.py b/spacy/tests/lang/en/test_parser.py index 9468fe09d..566ea4295 100644 --- a/spacy/tests/lang/en/test_parser.py +++ b/spacy/tests/lang/en/test_parser.py @@ -3,58 +3,52 @@ from __future__ import unicode_literals from ...util import get_doc -import pytest - -def test_parser_noun_chunks_standard(en_tokenizer): +def test_en_parser_noun_chunks_standard(en_tokenizer): text = "A base phrase should be recognized." heads = [2, 1, 3, 2, 1, 0, -1] tags = ['DT', 'JJ', 'NN', 'MD', 'VB', 'VBN', '.'] deps = ['det', 'amod', 'nsubjpass', 'aux', 'auxpass', 'ROOT', 'punct'] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 1 assert chunks[0].text_with_ws == "A base phrase " -def test_parser_noun_chunks_coordinated(en_tokenizer): +def test_en_parser_noun_chunks_coordinated(en_tokenizer): text = "A base phrase and a good phrase are often the same." heads = [2, 1, 5, -1, 2, 1, -4, 0, -1, 1, -3, -4] tags = ['DT', 'NN', 'NN', 'CC', 'DT', 'JJ', 'NN', 'VBP', 'RB', 'DT', 'JJ', '.'] deps = ['det', 'compound', 'nsubj', 'cc', 'det', 'amod', 'conj', 'ROOT', 'advmod', 'det', 'attr', 'punct'] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 2 assert chunks[0].text_with_ws == "A base phrase " assert chunks[1].text_with_ws == "a good phrase " -def test_parser_noun_chunks_pp_chunks(en_tokenizer): +def test_en_parser_noun_chunks_pp_chunks(en_tokenizer): text = "A phrase with another phrase occurs." heads = [1, 4, -1, 1, -2, 0, -1] tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ', '.'] deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT', 'punct'] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 2 assert chunks[0].text_with_ws == "A phrase " assert chunks[1].text_with_ws == "another phrase " -def test_parser_noun_chunks_appositional_modifiers(en_tokenizer): +def test_en_parser_noun_chunks_appositional_modifiers(en_tokenizer): text = "Sam, my brother, arrived to the house." heads = [5, -1, 1, -3, -4, 0, -1, 1, -2, -4] tags = ['NNP', ',', 'PRP$', 'NN', ',', 'VBD', 'IN', 'DT', 'NN', '.'] deps = ['nsubj', 'punct', 'poss', 'appos', 'punct', 'ROOT', 'prep', 'det', 'pobj', 'punct'] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 3 assert chunks[0].text_with_ws == "Sam " @@ -62,14 +56,13 @@ def test_parser_noun_chunks_appositional_modifiers(en_tokenizer): assert chunks[2].text_with_ws == "the house " -def test_parser_noun_chunks_dative(en_tokenizer): +def test_en_parser_noun_chunks_dative(en_tokenizer): text = "She gave Bob a raise." heads = [1, 0, -1, 1, -3, -4] tags = ['PRP', 'VBD', 'NNP', 'DT', 'NN', '.'] deps = ['nsubj', 'ROOT', 'dative', 'det', 'dobj', 'punct'] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads) chunks = list(doc.noun_chunks) assert len(chunks) == 3 assert chunks[0].text_with_ws == "She " diff --git a/spacy/tests/lang/en/test_prefix_suffix_infix.py b/spacy/tests/lang/en/test_prefix_suffix_infix.py index 042934d4e..987f7b7bc 100644 --- a/spacy/tests/lang/en/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/en/test_prefix_suffix_infix.py @@ -1,92 +1,89 @@ # coding: utf-8 -"""Test that tokenizer prefixes, suffixes and infixes are handled correctly.""" - - from __future__ import unicode_literals import pytest @pytest.mark.parametrize('text', ["(can)"]) -def test_tokenizer_splits_no_special(en_tokenizer, text): +def test_en_tokenizer_splits_no_special(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["can't"]) -def test_tokenizer_splits_no_punct(en_tokenizer, text): +def test_en_tokenizer_splits_no_punct(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 2 @pytest.mark.parametrize('text', ["(can't"]) -def test_tokenizer_splits_prefix_punct(en_tokenizer, text): +def test_en_tokenizer_splits_prefix_punct(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["can't)"]) -def test_tokenizer_splits_suffix_punct(en_tokenizer, text): +def test_en_tokenizer_splits_suffix_punct(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["(can't)"]) -def test_tokenizer_splits_even_wrap(en_tokenizer, text): +def test_en_tokenizer_splits_even_wrap(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 4 @pytest.mark.parametrize('text', ["(can't?)"]) -def test_tokenizer_splits_uneven_wrap(en_tokenizer, text): +def test_en_tokenizer_splits_uneven_wrap(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 5 @pytest.mark.parametrize('text,length', [("U.S.", 1), ("us.", 2), ("(U.S.", 2)]) -def test_tokenizer_splits_prefix_interact(en_tokenizer, text, length): +def test_en_tokenizer_splits_prefix_interact(en_tokenizer, text, length): tokens = en_tokenizer(text) assert len(tokens) == length @pytest.mark.parametrize('text', ["U.S.)"]) -def test_tokenizer_splits_suffix_interact(en_tokenizer, text): +def test_en_tokenizer_splits_suffix_interact(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 2 @pytest.mark.parametrize('text', ["(U.S.)"]) -def test_tokenizer_splits_even_wrap_interact(en_tokenizer, text): +def test_en_tokenizer_splits_even_wrap_interact(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["(U.S.?)"]) -def test_tokenizer_splits_uneven_wrap_interact(en_tokenizer, text): +def test_en_tokenizer_splits_uneven_wrap_interact(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 4 @pytest.mark.parametrize('text', ["best-known"]) -def test_tokenizer_splits_hyphens(en_tokenizer, text): +def test_en_tokenizer_splits_hyphens(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"]) -def test_tokenizer_splits_numeric_range(en_tokenizer, text): +def test_en_tokenizer_splits_numeric_range(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["best.Known", "Hello.World"]) -def test_tokenizer_splits_period_infix(en_tokenizer, text): +def test_en_tokenizer_splits_period_infix(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["Hello,world", "one,two"]) -def test_tokenizer_splits_comma_infix(en_tokenizer, text): +def test_en_tokenizer_splits_comma_infix(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 3 assert tokens[0].text == text.split(",")[0] @@ -95,12 +92,12 @@ def test_tokenizer_splits_comma_infix(en_tokenizer, text): @pytest.mark.parametrize('text', ["best...Known", "best...known"]) -def test_tokenizer_splits_ellipsis_infix(en_tokenizer, text): +def test_en_tokenizer_splits_ellipsis_infix(en_tokenizer, text): tokens = en_tokenizer(text) assert len(tokens) == 3 -def test_tokenizer_splits_double_hyphen_infix(en_tokenizer): +def test_en_tokenizer_splits_double_hyphen_infix(en_tokenizer): tokens = en_tokenizer("No decent--let alone well-bred--people.") assert tokens[0].text == "No" assert tokens[1].text == "decent" @@ -115,7 +112,7 @@ def test_tokenizer_splits_double_hyphen_infix(en_tokenizer): @pytest.mark.xfail -def test_tokenizer_splits_period_abbr(en_tokenizer): +def test_en_tokenizer_splits_period_abbr(en_tokenizer): text = "Today is Tuesday.Mr." tokens = en_tokenizer(text) assert len(tokens) == 5 @@ -127,7 +124,7 @@ def test_tokenizer_splits_period_abbr(en_tokenizer): @pytest.mark.xfail -def test_tokenizer_splits_em_dash_infix(en_tokenizer): +def test_en_tokenizer_splits_em_dash_infix(en_tokenizer): # Re Issue #225 tokens = en_tokenizer("""Will this road take me to Puddleton?\u2014No, """ """you'll have to walk there.\u2014Ariel.""") diff --git a/spacy/tests/lang/en/test_punct.py b/spacy/tests/lang/en/test_punct.py index 750008603..dd344ae34 100644 --- a/spacy/tests/lang/en/test_punct.py +++ b/spacy/tests/lang/en/test_punct.py @@ -1,13 +1,9 @@ # coding: utf-8 -"""Test that open, closed and paired punctuation is split off correctly.""" - - from __future__ import unicode_literals import pytest - -from ....util import compile_prefix_regex -from ....lang.punctuation import TOKENIZER_PREFIXES +from spacy.util import compile_prefix_regex +from spacy.lang.punctuation import TOKENIZER_PREFIXES PUNCT_OPEN = ['(', '[', '{', '*'] diff --git a/spacy/tests/lang/en/test_sbd.py b/spacy/tests/lang/en/test_sbd.py index 8378b186f..6bd1ee249 100644 --- a/spacy/tests/lang/en/test_sbd.py +++ b/spacy/tests/lang/en/test_sbd.py @@ -1,18 +1,17 @@ # coding: utf-8 from __future__ import unicode_literals -from ....tokens import Doc -from ...util import get_doc, apply_transition_sequence - import pytest +from ...util import get_doc, apply_transition_sequence + @pytest.mark.parametrize('text', ["A test sentence"]) @pytest.mark.parametrize('punct', ['.', '!', '?', '']) def test_en_sbd_single_punct(en_tokenizer, text, punct): heads = [2, 1, 0, -1] if punct else [2, 1, 0] tokens = en_tokenizer(text + punct) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) assert len(doc) == 4 if punct else 3 assert len(list(doc.sents)) == 1 assert sum(len(sent) for sent in doc.sents) == len(doc) @@ -26,102 +25,10 @@ def test_en_sentence_breaks(en_tokenizer, en_parser): 'attr', 'punct'] transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct', 'B-ROOT', 'L-nsubj', 'S', 'L-attr', 'R-attr', 'D', 'R-punct'] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) apply_transition_sequence(en_parser, doc, transition) - assert len(list(doc.sents)) == 2 for token in doc: assert token.dep != 0 or token.is_space assert [token.head.i for token in doc ] == [1, 1, 3, 1, 1, 6, 6, 8, 6, 6] - - -# Currently, there's no way of setting the serializer data for the parser -# without loading the models, so we can't remove the model dependency here yet. - -@pytest.mark.xfail -@pytest.mark.models('en') -def test_en_sbd_serialization_projective(EN): - """Test that before and after serialization, the sentence boundaries are - the same.""" - - text = "I bought a couch from IKEA It wasn't very comfortable." - transition = ['L-nsubj', 'S', 'L-det', 'R-dobj', 'D', 'R-prep', 'R-pobj', - 'B-ROOT', 'L-nsubj', 'R-neg', 'D', 'S', 'L-advmod', - 'R-acomp', 'D', 'R-punct'] - - doc = EN.tokenizer(text) - apply_transition_sequence(EN.parser, doc, transition) - doc_serialized = Doc(EN.vocab).from_bytes(doc.to_bytes()) - assert doc.is_parsed == True - assert doc_serialized.is_parsed == True - assert doc.to_bytes() == doc_serialized.to_bytes() - assert [s.text for s in doc.sents] == [s.text for s in doc_serialized.sents] - - -TEST_CASES = [ - pytest.mark.xfail(("Hello World. My name is Jonas.", ["Hello World.", "My name is Jonas."])), - ("What is your name? My name is Jonas.", ["What is your name?", "My name is Jonas."]), - ("There it is! I found it.", ["There it is!", "I found it."]), - ("My name is Jonas E. Smith.", ["My name is Jonas E. Smith."]), - ("Please turn to p. 55.", ["Please turn to p. 55."]), - ("Were Jane and co. at the party?", ["Were Jane and co. at the party?"]), - ("They closed the deal with Pitt, Briggs & Co. at noon.", ["They closed the deal with Pitt, Briggs & Co. at noon."]), - ("Let's ask Jane and co. They should know.", ["Let's ask Jane and co.", "They should know."]), - ("They closed the deal with Pitt, Briggs & Co. It closed yesterday.", ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]), - ("I can see Mt. Fuji from here.", ["I can see Mt. Fuji from here."]), - pytest.mark.xfail(("St. Michael's Church is on 5th st. near the light.", ["St. Michael's Church is on 5th st. near the light."])), - ("That is JFK Jr.'s book.", ["That is JFK Jr.'s book."]), - ("I visited the U.S.A. last year.", ["I visited the U.S.A. last year."]), - ("I live in the E.U. How about you?", ["I live in the E.U.", "How about you?"]), - ("I live in the U.S. How about you?", ["I live in the U.S.", "How about you?"]), - ("I work for the U.S. Government in Virginia.", ["I work for the U.S. Government in Virginia."]), - ("I have lived in the U.S. for 20 years.", ["I have lived in the U.S. for 20 years."]), - pytest.mark.xfail(("At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.", ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."])), - ("She has $100.00 in her bag.", ["She has $100.00 in her bag."]), - ("She has $100.00. It is in her bag.", ["She has $100.00.", "It is in her bag."]), - ("He teaches science (He previously worked for 5 years as an engineer.) at the local University.", ["He teaches science (He previously worked for 5 years as an engineer.) at the local University."]), - ("Her email is Jane.Doe@example.com. I sent her an email.", ["Her email is Jane.Doe@example.com.", "I sent her an email."]), - ("The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out.", ["The site is: https://www.example.50.com/new-site/awesome_content.html.", "Please check it out."]), - pytest.mark.xfail(("She turned to him, 'This is great.' she said.", ["She turned to him, 'This is great.' she said."])), - pytest.mark.xfail(('She turned to him, "This is great." she said.', ['She turned to him, "This is great." she said.'])), - ('She turned to him, "This is great." She held the book out to show him.', ['She turned to him, "This is great."', "She held the book out to show him."]), - ("Hello!! Long time no see.", ["Hello!!", "Long time no see."]), - ("Hello?? Who is there?", ["Hello??", "Who is there?"]), - ("Hello!? Is that you?", ["Hello!?", "Is that you?"]), - ("Hello?! Is that you?", ["Hello?!", "Is that you?"]), - pytest.mark.xfail(("1.) The first item 2.) The second item", ["1.) The first item", "2.) The second item"])), - pytest.mark.xfail(("1.) The first item. 2.) The second item.", ["1.) The first item.", "2.) The second item."])), - pytest.mark.xfail(("1) The first item 2) The second item", ["1) The first item", "2) The second item"])), - ("1) The first item. 2) The second item.", ["1) The first item.", "2) The second item."]), - pytest.mark.xfail(("1. The first item 2. The second item", ["1. The first item", "2. The second item"])), - pytest.mark.xfail(("1. The first item. 2. The second item.", ["1. The first item.", "2. The second item."])), - pytest.mark.xfail(("• 9. The first item • 10. The second item", ["• 9. The first item", "• 10. The second item"])), - pytest.mark.xfail(("⁃9. The first item ⁃10. The second item", ["⁃9. The first item", "⁃10. The second item"])), - pytest.mark.xfail(("a. The first item b. The second item c. The third list item", ["a. The first item", "b. The second item", "c. The third list item"])), - ("This is a sentence\ncut off in the middle because pdf.", ["This is a sentence\ncut off in the middle because pdf."]), - ("It was a cold \nnight in the city.", ["It was a cold \nnight in the city."]), - pytest.mark.xfail(("features\ncontact manager\nevents, activities\n", ["features", "contact manager", "events, activities"])), - pytest.mark.xfail(("You can find it at N°. 1026.253.553. That is where the treasure is.", ["You can find it at N°. 1026.253.553.", "That is where the treasure is."])), - ("She works at Yahoo! in the accounting department.", ["She works at Yahoo! in the accounting department."]), - ("We make a good team, you and I. Did you see Albert I. Jones yesterday?", ["We make a good team, you and I.", "Did you see Albert I. Jones yesterday?"]), - ("Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”", ["Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”"]), - pytest.mark.xfail((""""Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).""", ['"Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).'])), - ("If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence.", ["If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . .", "Next sentence."]), - ("I never meant that.... She left the store.", ["I never meant that....", "She left the store."]), - pytest.mark.xfail(("I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it.", ["I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it."])), - pytest.mark.xfail(("One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . .", ["One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds.", ". . . The practice was not abandoned. . . ."])), - pytest.mark.xfail(("Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That is a lot.", ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."])) -] - -@pytest.mark.skip -@pytest.mark.models('en') -@pytest.mark.parametrize('text,expected_sents', TEST_CASES) -def test_en_sbd_prag(EN, text, expected_sents): - """SBD tests from Pragmatic Segmenter""" - doc = EN(text) - sents = [] - for sent in doc.sents: - sents.append(''.join(doc[i].string for i in range(sent.start, sent.end)).strip()) - assert sents == expected_sents diff --git a/spacy/tests/lang/en/test_tagger.py b/spacy/tests/lang/en/test_tagger.py index 0959ba7c7..a59f6f806 100644 --- a/spacy/tests/lang/en/test_tagger.py +++ b/spacy/tests/lang/en/test_tagger.py @@ -1,12 +1,8 @@ # coding: utf-8 from __future__ import unicode_literals -from ....parts_of_speech import SPACE -from ....compat import unicode_ from ...util import get_doc -import pytest - def test_en_tagger_load_morph_exc(en_tokenizer): text = "I like his style." @@ -14,47 +10,6 @@ def test_en_tagger_load_morph_exc(en_tokenizer): morph_exc = {'VBP': {'like': {'lemma': 'luck'}}} en_tokenizer.vocab.morphology.load_morph_exceptions(morph_exc) tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags) assert doc[1].tag_ == 'VBP' assert doc[1].lemma_ == 'luck' - - -@pytest.mark.models('en') -def test_tag_names(EN): - text = "I ate pizzas with anchovies." - doc = EN(text, disable=['parser']) - assert type(doc[2].pos) == int - assert isinstance(doc[2].pos_, unicode_) - assert isinstance(doc[2].dep_, unicode_) - assert doc[2].tag_ == u'NNS' - - -@pytest.mark.xfail -@pytest.mark.models('en') -def test_en_tagger_spaces(EN): - """Ensure spaces are assigned the POS tag SPACE""" - text = "Some\nspaces are\tnecessary." - doc = EN(text, disable=['parser']) - assert doc[0].pos != SPACE - assert doc[0].pos_ != 'SPACE' - assert doc[1].pos == SPACE - assert doc[1].pos_ == 'SPACE' - assert doc[1].tag_ == 'SP' - assert doc[2].pos != SPACE - assert doc[3].pos != SPACE - assert doc[4].pos == SPACE - - -@pytest.mark.xfail -@pytest.mark.models('en') -def test_en_tagger_return_char(EN): - """Ensure spaces are assigned the POS tag SPACE""" - text = ('hi Aaron,\r\n\r\nHow is your schedule today, I was wondering if ' - 'you had time for a phone\r\ncall this afternoon?\r\n\r\n\r\n') - tokens = EN(text) - for token in tokens: - if token.is_space: - assert token.pos == SPACE - assert tokens[3].text == '\r\n\r\n' - assert tokens[3].is_space - assert tokens[3].pos == SPACE diff --git a/spacy/tests/lang/en/test_text.py b/spacy/tests/lang/en/test_text.py index a2ffaf7ea..91a7d6e4d 100644 --- a/spacy/tests/lang/en/test_text.py +++ b/spacy/tests/lang/en/test_text.py @@ -1,10 +1,8 @@ # coding: utf-8 -"""Test that longer and mixed texts are tokenized correctly.""" - - from __future__ import unicode_literals import pytest +from spacy.lang.en.lex_attrs import like_num def test_en_tokenizer_handles_long_text(en_tokenizer): @@ -43,3 +41,9 @@ def test_lex_attrs_like_number(en_tokenizer, text, match): tokens = en_tokenizer(text) assert len(tokens) == 1 assert tokens[0].like_num == match + + +@pytest.mark.parametrize('word', ['eleven']) +def test_en_lex_attrs_capitals(word): + assert like_num(word) + assert like_num(word.upper()) diff --git a/spacy/tests/lang/es/test_exception.py b/spacy/tests/lang/es/test_exception.py index 2303e6095..f3e424e09 100644 --- a/spacy/tests/lang/es/test_exception.py +++ b/spacy/tests/lang/es/test_exception.py @@ -1,22 +1,21 @@ # coding: utf-8 - from __future__ import unicode_literals import pytest -@pytest.mark.parametrize('text,lemma', [("aprox.", "aproximadamente"), - ("esq.", "esquina"), - ("pág.", "página"), - ("p.ej.", "por ejemplo") - ]) -def test_tokenizer_handles_abbr(es_tokenizer, text, lemma): +@pytest.mark.parametrize('text,lemma', [ + ("aprox.", "aproximadamente"), + ("esq.", "esquina"), + ("pág.", "página"), + ("p.ej.", "por ejemplo")]) +def test_es_tokenizer_handles_abbr(es_tokenizer, text, lemma): tokens = es_tokenizer(text) assert len(tokens) == 1 assert tokens[0].lemma_ == lemma -def test_tokenizer_handles_exc_in_text(es_tokenizer): +def test_es_tokenizer_handles_exc_in_text(es_tokenizer): text = "Mariano Rajoy ha corrido aprox. medio kilómetro" tokens = es_tokenizer(text) assert len(tokens) == 7 diff --git a/spacy/tests/lang/es/test_text.py b/spacy/tests/lang/es/test_text.py index 7081ea12d..b03a9ee4a 100644 --- a/spacy/tests/lang/es/test_text.py +++ b/spacy/tests/lang/es/test_text.py @@ -1,14 +1,10 @@ # coding: utf-8 - -"""Test that longer and mixed texts are tokenized correctly.""" - - from __future__ import unicode_literals import pytest -def test_tokenizer_handles_long_text(es_tokenizer): +def test_es_tokenizer_handles_long_text(es_tokenizer): text = """Cuando a José Mujica lo invitaron a dar una conferencia en Oxford este verano, su cabeza hizo "crac". La "más antigua" universidad de habla @@ -30,6 +26,6 @@ en Montevideo y que pregona las bondades de la vida austera.""" ("""¡Sí! "Vámonos", contestó José Arcadio Buendía""", 11), ("Corrieron aprox. 10km.", 5), ("Y entonces por qué...", 5)]) -def test_tokenizer_handles_cnts(es_tokenizer, text, length): +def test_es_tokenizer_handles_cnts(es_tokenizer, text, length): tokens = es_tokenizer(text) assert len(tokens) == length diff --git a/spacy/tests/lang/fi/test_tokenizer.py b/spacy/tests/lang/fi/test_tokenizer.py index 14858b677..ff18b9eac 100644 --- a/spacy/tests/lang/fi/test_tokenizer.py +++ b/spacy/tests/lang/fi/test_tokenizer.py @@ -11,7 +11,7 @@ ABBREVIATION_TESTS = [ @pytest.mark.parametrize('text,expected_tokens', ABBREVIATION_TESTS) -def test_tokenizer_handles_testcases(fi_tokenizer, text, expected_tokens): +def test_fi_tokenizer_handles_testcases(fi_tokenizer, text, expected_tokens): tokens = fi_tokenizer(text) token_list = [token.text for token in tokens if not token.is_space] assert expected_tokens == token_list diff --git a/spacy/tests/lang/fr/test_exceptions.py b/spacy/tests/lang/fr/test_exceptions.py index b3ae78e20..886b2c8bf 100644 --- a/spacy/tests/lang/fr/test_exceptions.py +++ b/spacy/tests/lang/fr/test_exceptions.py @@ -1,29 +1,29 @@ # coding: utf-8 - from __future__ import unicode_literals import pytest -@pytest.mark.parametrize('text', ["aujourd'hui", "Aujourd'hui", "prud'hommes", - "prud’hommal"]) -def test_tokenizer_infix_exceptions(fr_tokenizer, text): +@pytest.mark.parametrize('text', [ + "aujourd'hui", "Aujourd'hui", "prud'hommes", "prud’hommal"]) +def test_fr_tokenizer_infix_exceptions(fr_tokenizer, text): tokens = fr_tokenizer(text) assert len(tokens) == 1 -@pytest.mark.parametrize('text,lemma', [("janv.", "janvier"), - ("juill.", "juillet"), - ("Dr.", "docteur"), - ("av.", "avant"), - ("sept.", "septembre")]) -def test_tokenizer_handles_abbr(fr_tokenizer, text, lemma): +@pytest.mark.parametrize('text,lemma', [ + ("janv.", "janvier"), + ("juill.", "juillet"), + ("Dr.", "docteur"), + ("av.", "avant"), + ("sept.", "septembre")]) +def test_fr_tokenizer_handles_abbr(fr_tokenizer, text, lemma): tokens = fr_tokenizer(text) assert len(tokens) == 1 assert tokens[0].lemma_ == lemma -def test_tokenizer_handles_exc_in_text(fr_tokenizer): +def test_fr_tokenizer_handles_exc_in_text(fr_tokenizer): text = "Je suis allé au mois de janv. aux prud’hommes." tokens = fr_tokenizer(text) assert len(tokens) == 10 @@ -32,14 +32,15 @@ def test_tokenizer_handles_exc_in_text(fr_tokenizer): assert tokens[8].text == "prud’hommes" -def test_tokenizer_handles_exc_in_text_2(fr_tokenizer): +def test_fr_tokenizer_handles_exc_in_text_2(fr_tokenizer): text = "Cette après-midi, je suis allé dans un restaurant italo-mexicain." tokens = fr_tokenizer(text) assert len(tokens) == 11 assert tokens[1].text == "après-midi" assert tokens[9].text == "italo-mexicain" -def test_tokenizer_handles_title(fr_tokenizer): + +def test_fr_tokenizer_handles_title(fr_tokenizer): text = "N'est-ce pas génial?" tokens = fr_tokenizer(text) assert len(tokens) == 6 @@ -50,16 +51,18 @@ def test_tokenizer_handles_title(fr_tokenizer): assert tokens[2].text == "-ce" assert tokens[2].lemma_ == "ce" -def test_tokenizer_handles_title_2(fr_tokenizer): + +def test_fr_tokenizer_handles_title_2(fr_tokenizer): text = "Est-ce pas génial?" tokens = fr_tokenizer(text) assert len(tokens) == 6 assert tokens[0].text == "Est" assert tokens[0].lemma_ == "être" -def test_tokenizer_handles_title_2(fr_tokenizer): + +def test_fr_tokenizer_handles_title_2(fr_tokenizer): text = "Qu'est-ce que tu fais?" tokens = fr_tokenizer(text) assert len(tokens) == 7 assert tokens[0].text == "Qu'" - assert tokens[0].lemma_ == "que" \ No newline at end of file + assert tokens[0].lemma_ == "que" diff --git a/spacy/tests/lang/fr/test_lemmatization.py b/spacy/tests/lang/fr/test_lemmatization.py index 49ee83531..a61ca001e 100644 --- a/spacy/tests/lang/fr/test_lemmatization.py +++ b/spacy/tests/lang/fr/test_lemmatization.py @@ -4,25 +4,25 @@ from __future__ import unicode_literals import pytest -def test_lemmatizer_verb(fr_tokenizer): +def test_fr_lemmatizer_verb(fr_tokenizer): tokens = fr_tokenizer("Qu'est-ce que tu fais?") assert tokens[0].lemma_ == "que" assert tokens[1].lemma_ == "être" assert tokens[5].lemma_ == "faire" -def test_lemmatizer_noun_verb_2(fr_tokenizer): +def test_fr_lemmatizer_noun_verb_2(fr_tokenizer): tokens = fr_tokenizer("Les abaissements de température sont gênants.") assert tokens[4].lemma_ == "être" @pytest.mark.xfail(reason="Costaricienne TAG is PROPN instead of NOUN and spacy don't lemmatize PROPN") -def test_lemmatizer_noun(fr_tokenizer): +def test_fr_lemmatizer_noun(fr_tokenizer): tokens = fr_tokenizer("il y a des Costaricienne.") assert tokens[4].lemma_ == "Costaricain" -def test_lemmatizer_noun_2(fr_tokenizer): +def test_fr_lemmatizer_noun_2(fr_tokenizer): tokens = fr_tokenizer("Les abaissements de température sont gênants.") assert tokens[1].lemma_ == "abaissement" assert tokens[5].lemma_ == "gênant" diff --git a/spacy/tests/lang/fr/test_prefix_suffix_infix.py b/spacy/tests/lang/fr/test_prefix_suffix_infix.py new file mode 100644 index 000000000..b9fb7bbb1 --- /dev/null +++ b/spacy/tests/lang/fr/test_prefix_suffix_infix.py @@ -0,0 +1,23 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from spacy.language import Language +from spacy.lang.punctuation import TOKENIZER_INFIXES +from spacy.lang.char_classes import ALPHA + + +@pytest.mark.parametrize('text,expected_tokens', [ + ("l'avion", ["l'", "avion"]), ("j'ai", ["j'", "ai"])]) +def test_issue768(text, expected_tokens): + """Allow zero-width 'infix' token during the tokenization process.""" + SPLIT_INFIX = r'(?<=[{a}]\')(?=[{a}])'.format(a=ALPHA) + + class FrenchTest(Language): + class Defaults(Language.Defaults): + infixes = TOKENIZER_INFIXES + [SPLIT_INFIX] + + fr_tokenizer_w_infix = FrenchTest.Defaults.create_tokenizer() + tokens = fr_tokenizer_w_infix(text) + assert len(tokens) == 2 + assert [t.text for t in tokens] == expected_tokens diff --git a/spacy/tests/lang/fr/test_text.py b/spacy/tests/lang/fr/test_text.py index 94a12e3b6..76db57f71 100644 --- a/spacy/tests/lang/fr/test_text.py +++ b/spacy/tests/lang/fr/test_text.py @@ -1,6 +1,9 @@ # coding: utf8 from __future__ import unicode_literals +import pytest +from spacy.lang.fr.lex_attrs import like_num + def test_tokenizer_handles_long_text(fr_tokenizer): text = """L'histoire du TAL commence dans les années 1950, bien que l'on puisse \ @@ -12,6 +15,11 @@ un humain dans une conversation écrite en temps réel, de façon suffisamment \ convaincante que l'interlocuteur humain ne peut distinguer sûrement — sur la \ base du seul contenu de la conversation — s'il interagit avec un programme \ ou avec un autre vrai humain.""" - tokens = fr_tokenizer(text) assert len(tokens) == 113 + + +@pytest.mark.parametrize('word', ['onze', 'onzième']) +def test_fr_lex_attrs_capitals(word): + assert like_num(word) + assert like_num(word.upper()) diff --git a/spacy/tests/lang/ga/test_tokenizer.py b/spacy/tests/lang/ga/test_tokenizer.py index 85c44f3ef..db490315a 100644 --- a/spacy/tests/lang/ga/test_tokenizer.py +++ b/spacy/tests/lang/ga/test_tokenizer.py @@ -11,7 +11,7 @@ GA_TOKEN_EXCEPTION_TESTS = [ @pytest.mark.parametrize('text,expected_tokens', GA_TOKEN_EXCEPTION_TESTS) -def test_tokenizer_handles_exception_cases(ga_tokenizer, text, expected_tokens): +def test_ga_tokenizer_handles_exception_cases(ga_tokenizer, text, expected_tokens): tokens = ga_tokenizer(text) token_list = [token.text for token in tokens if not token.is_space] assert expected_tokens == token_list diff --git a/spacy/tests/lang/he/test_tokenizer.py b/spacy/tests/lang/he/test_tokenizer.py index 62ae84223..b3672c652 100644 --- a/spacy/tests/lang/he/test_tokenizer.py +++ b/spacy/tests/lang/he/test_tokenizer.py @@ -6,7 +6,7 @@ import pytest @pytest.mark.parametrize('text,expected_tokens', [('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])]) -def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens): +def test_he_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens): tokens = he_tokenizer(text) token_list = [token.text for token in tokens if not token.is_space] assert expected_tokens == token_list @@ -18,6 +18,6 @@ def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens): ('עקבת אחריו בכל רחבי המדינה!', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']), ('עקבת אחריו בכל רחבי המדינה..', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']), ('עקבת אחריו בכל רחבי המדינה...', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...'])]) -def test_tokenizer_handles_punct(he_tokenizer, text, expected_tokens): +def test_he_tokenizer_handles_punct(he_tokenizer, text, expected_tokens): tokens = he_tokenizer(text) assert expected_tokens == [token.text for token in tokens] diff --git a/spacy/tests/lang/hu/test_tokenizer.py b/spacy/tests/lang/hu/test_tokenizer.py index 5845b8614..ad725b2f9 100644 --- a/spacy/tests/lang/hu/test_tokenizer.py +++ b/spacy/tests/lang/hu/test_tokenizer.py @@ -3,6 +3,7 @@ from __future__ import unicode_literals import pytest + DEFAULT_TESTS = [ ('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']), pytest.mark.xfail(('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.'])), @@ -277,7 +278,7 @@ TESTCASES = DEFAULT_TESTS + DOT_TESTS + QUOTE_TESTS + NUMBER_TESTS + HYPHEN_TEST @pytest.mark.parametrize('text,expected_tokens', TESTCASES) -def test_tokenizer_handles_testcases(hu_tokenizer, text, expected_tokens): +def test_hu_tokenizer_handles_testcases(hu_tokenizer, text, expected_tokens): tokens = hu_tokenizer(text) token_list = [token.text for token in tokens if not token.is_space] assert expected_tokens == token_list diff --git a/spacy/tests/lang/id/test_prefix_suffix_infix.py b/spacy/tests/lang/id/test_prefix_suffix_infix.py index 539fd1a13..125213fb0 100644 --- a/spacy/tests/lang/id/test_prefix_suffix_infix.py +++ b/spacy/tests/lang/id/test_prefix_suffix_infix.py @@ -1,38 +1,35 @@ # coding: utf-8 -"""Test that tokenizer prefixes, suffixes and infixes are handled correctly.""" - - from __future__ import unicode_literals import pytest @pytest.mark.parametrize('text', ["(Ma'arif)"]) -def test_tokenizer_splits_no_special(id_tokenizer, text): +def test_id_tokenizer_splits_no_special(id_tokenizer, text): tokens = id_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["Ma'arif"]) -def test_tokenizer_splits_no_punct(id_tokenizer, text): +def test_id_tokenizer_splits_no_punct(id_tokenizer, text): tokens = id_tokenizer(text) assert len(tokens) == 1 @pytest.mark.parametrize('text', ["(Ma'arif"]) -def test_tokenizer_splits_prefix_punct(id_tokenizer, text): +def test_id_tokenizer_splits_prefix_punct(id_tokenizer, text): tokens = id_tokenizer(text) assert len(tokens) == 2 @pytest.mark.parametrize('text', ["Ma'arif)"]) -def test_tokenizer_splits_suffix_punct(id_tokenizer, text): +def test_id_tokenizer_splits_suffix_punct(id_tokenizer, text): tokens = id_tokenizer(text) assert len(tokens) == 2 @pytest.mark.parametrize('text', ["(Ma'arif)"]) -def test_tokenizer_splits_even_wrap(id_tokenizer, text): +def test_id_tokenizer_splits_even_wrap(id_tokenizer, text): tokens = id_tokenizer(text) assert len(tokens) == 3 @@ -44,49 +41,49 @@ def test_tokenizer_splits_uneven_wrap(id_tokenizer, text): @pytest.mark.parametrize('text,length', [("S.Kom.", 1), ("SKom.", 2), ("(S.Kom.", 2)]) -def test_tokenizer_splits_prefix_interact(id_tokenizer, text, length): +def test_id_tokenizer_splits_prefix_interact(id_tokenizer, text, length): tokens = id_tokenizer(text) assert len(tokens) == length @pytest.mark.parametrize('text', ["S.Kom.)"]) -def test_tokenizer_splits_suffix_interact(id_tokenizer, text): +def test_id_tokenizer_splits_suffix_interact(id_tokenizer, text): tokens = id_tokenizer(text) assert len(tokens) == 2 @pytest.mark.parametrize('text', ["(S.Kom.)"]) -def test_tokenizer_splits_even_wrap_interact(id_tokenizer, text): +def test_id_tokenizer_splits_even_wrap_interact(id_tokenizer, text): tokens = id_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["(S.Kom.?)"]) -def test_tokenizer_splits_uneven_wrap_interact(id_tokenizer, text): +def test_id_tokenizer_splits_uneven_wrap_interact(id_tokenizer, text): tokens = id_tokenizer(text) assert len(tokens) == 4 @pytest.mark.parametrize('text,length', [("gara-gara", 1), ("Jokowi-Ahok", 3), ("Sukarno-Hatta", 3)]) -def test_tokenizer_splits_hyphens(id_tokenizer, text, length): +def test_id_tokenizer_splits_hyphens(id_tokenizer, text, length): tokens = id_tokenizer(text) assert len(tokens) == length @pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"]) -def test_tokenizer_splits_numeric_range(id_tokenizer, text): +def test_id_tokenizer_splits_numeric_range(id_tokenizer, text): tokens = id_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["ini.Budi", "Halo.Bandung"]) -def test_tokenizer_splits_period_infix(id_tokenizer, text): +def test_id_tokenizer_splits_period_infix(id_tokenizer, text): tokens = id_tokenizer(text) assert len(tokens) == 3 @pytest.mark.parametrize('text', ["Halo,Bandung", "satu,dua"]) -def test_tokenizer_splits_comma_infix(id_tokenizer, text): +def test_id_tokenizer_splits_comma_infix(id_tokenizer, text): tokens = id_tokenizer(text) assert len(tokens) == 3 assert tokens[0].text == text.split(",")[0] @@ -95,12 +92,12 @@ def test_tokenizer_splits_comma_infix(id_tokenizer, text): @pytest.mark.parametrize('text', ["halo...Bandung", "dia...pergi"]) -def test_tokenizer_splits_ellipsis_infix(id_tokenizer, text): +def test_id_tokenizer_splits_ellipsis_infix(id_tokenizer, text): tokens = id_tokenizer(text) assert len(tokens) == 3 -def test_tokenizer_splits_double_hyphen_infix(id_tokenizer): +def test_id_tokenizer_splits_double_hyphen_infix(id_tokenizer): tokens = id_tokenizer("Arsene Wenger--manajer Arsenal--melakukan konferensi pers.") assert len(tokens) == 10 assert tokens[0].text == "Arsene" diff --git a/spacy/tests/lang/id/test_text.py b/spacy/tests/lang/id/test_text.py new file mode 100644 index 000000000..947804162 --- /dev/null +++ b/spacy/tests/lang/id/test_text.py @@ -0,0 +1,11 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from spacy.lang.id.lex_attrs import like_num + + +@pytest.mark.parametrize('word', ['sebelas']) +def test_id_lex_attrs_capitals(word): + assert like_num(word) + assert like_num(word.upper()) diff --git a/spacy/tests/lang/ja/test_lemma.py b/spacy/tests/lang/ja/test_lemma.py deleted file mode 100644 index 9730b8b78..000000000 --- a/spacy/tests/lang/ja/test_lemma.py +++ /dev/null @@ -1,18 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - -LEMMAS = ( - ('新しく', '新しい'), - ('赤く', '赤い'), - ('すごく', '凄い'), - ('いただきました', '頂く'), - ('なった', '成る')) - -@pytest.mark.parametrize('word,lemma', LEMMAS) -def test_japanese_lemmas(JA, word, lemma): - test_lemma = JA(word)[0].lemma_ - assert test_lemma == lemma - - diff --git a/spacy/tests/lang/ja/test_lemmatization.py b/spacy/tests/lang/ja/test_lemmatization.py new file mode 100644 index 000000000..52f3535c0 --- /dev/null +++ b/spacy/tests/lang/ja/test_lemmatization.py @@ -0,0 +1,15 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +@pytest.mark.parametrize('word,lemma', [ + ('新しく', '新しい'), + ('赤く', '赤い'), + ('すごく', '凄い'), + ('いただきました', '頂く'), + ('なった', '成る')]) +def test_ja_lemmatizer_assigns(ja_tokenizer, word, lemma): + test_lemma = ja_tokenizer(word)[0].lemma_ + assert test_lemma == lemma diff --git a/spacy/tests/lang/ja/test_tokenizer.py b/spacy/tests/lang/ja/test_tokenizer.py index e79c3a5ab..345429c9e 100644 --- a/spacy/tests/lang/ja/test_tokenizer.py +++ b/spacy/tests/lang/ja/test_tokenizer.py @@ -5,41 +5,43 @@ import pytest TOKENIZER_TESTS = [ - ("日本語だよ", ['日本', '語', 'だ', 'よ']), - ("東京タワーの近くに住んでいます。", ['東京', 'タワー', 'の', '近く', 'に', '住ん', 'で', 'い', 'ます', '。']), - ("吾輩は猫である。", ['吾輩', 'は', '猫', 'で', 'ある', '。']), - ("月に代わって、お仕置きよ!", ['月', 'に', '代わっ', 'て', '、', 'お', '仕置き', 'よ', '!']), - ("すもももももももものうち", ['すもも', 'も', 'もも', 'も', 'もも', 'の', 'うち']) + ("日本語だよ", ['日本', '語', 'だ', 'よ']), + ("東京タワーの近くに住んでいます。", ['東京', 'タワー', 'の', '近く', 'に', '住ん', 'で', 'い', 'ます', '。']), + ("吾輩は猫である。", ['吾輩', 'は', '猫', 'で', 'ある', '。']), + ("月に代わって、お仕置きよ!", ['月', 'に', '代わっ', 'て', '、', 'お', '仕置き', 'よ', '!']), + ("すもももももももものうち", ['すもも', 'も', 'もも', 'も', 'もも', 'の', 'うち']) ] TAG_TESTS = [ - ("日本語だよ", ['日本語だよ', '名詞-固有名詞-地名-国', '名詞-普通名詞-一般', '助動詞', '助詞-終助詞']), - ("東京タワーの近くに住んでいます。", ['名詞-固有名詞-地名-一般', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '動詞-非自立可能', '助動詞', '補助記号-句点']), - ("吾輩は猫である。", ['代名詞', '助詞-係助詞', '名詞-普通名詞-一般', '助動詞', '動詞-非自立可能', '補助記号-句点']), - ("月に代わって、お仕置きよ!", ['名詞-普通名詞-助数詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '補助記号-読点', '接頭辞', '名詞-普通名詞-一般', '助詞-終助詞', '補助記号-句点 ']), - ("すもももももももものうち", ['名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能']) + ("日本語だよ", ['日本語だよ', '名詞-固有名詞-地名-国', '名詞-普通名詞-一般', '助動詞', '助詞-終助詞']), + ("東京タワーの近くに住んでいます。", ['名詞-固有名詞-地名-一般', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '動詞-非自立可能', '助動詞', '補助記号-句点']), + ("吾輩は猫である。", ['代名詞', '助詞-係助詞', '名詞-普通名詞-一般', '助動詞', '動詞-非自立可能', '補助記号-句点']), + ("月に代わって、お仕置きよ!", ['名詞-普通名詞-助数詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '補助記号-読点', '接頭辞', '名詞-普通名詞-一般', '助詞-終助詞', '補助記号-句点 ']), + ("すもももももももものうち", ['名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能']) ] POS_TESTS = [ - ('日本語だよ', ['PROPN', 'NOUN', 'AUX', 'PART']), - ('東京タワーの近くに住んでいます。', ['PROPN', 'NOUN', 'ADP', 'NOUN', 'ADP', 'VERB', 'SCONJ', 'VERB', 'AUX', 'PUNCT']), - ('吾輩は猫である。', ['PRON', 'ADP', 'NOUN', 'AUX', 'VERB', 'PUNCT']), - ('月に代わって、お仕置きよ!', ['NOUN', 'ADP', 'VERB', 'SCONJ', 'PUNCT', 'NOUN', 'NOUN', 'PART', 'PUNCT']), - ('すもももももももものうち', ['NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN']) + ('日本語だよ', ['PROPN', 'NOUN', 'AUX', 'PART']), + ('東京タワーの近くに住んでいます。', ['PROPN', 'NOUN', 'ADP', 'NOUN', 'ADP', 'VERB', 'SCONJ', 'VERB', 'AUX', 'PUNCT']), + ('吾輩は猫である。', ['PRON', 'ADP', 'NOUN', 'AUX', 'VERB', 'PUNCT']), + ('月に代わって、お仕置きよ!', ['NOUN', 'ADP', 'VERB', 'SCONJ', 'PUNCT', 'NOUN', 'NOUN', 'PART', 'PUNCT']), + ('すもももももももものうち', ['NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN']) ] @pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS) -def test_japanese_tokenizer(ja_tokenizer, text, expected_tokens): +def test_ja_tokenizer(ja_tokenizer, text, expected_tokens): tokens = [token.text for token in ja_tokenizer(text)] assert tokens == expected_tokens + @pytest.mark.parametrize('text,expected_tags', TAG_TESTS) -def test_japanese_tokenizer(ja_tokenizer, text, expected_tags): +def test_ja_tokenizer(ja_tokenizer, text, expected_tags): tags = [token.tag_ for token in ja_tokenizer(text)] assert tags == expected_tags + @pytest.mark.parametrize('text,expected_pos', POS_TESTS) -def test_japanese_tokenizer(ja_tokenizer, text, expected_pos): +def test_ja_tokenizer(ja_tokenizer, text, expected_pos): pos = [token.pos_ for token in ja_tokenizer(text)] assert pos == expected_pos diff --git a/spacy/tests/lang/nb/test_tokenizer.py b/spacy/tests/lang/nb/test_tokenizer.py index b55901339..806bae136 100644 --- a/spacy/tests/lang/nb/test_tokenizer.py +++ b/spacy/tests/lang/nb/test_tokenizer.py @@ -11,7 +11,7 @@ NB_TOKEN_EXCEPTION_TESTS = [ @pytest.mark.parametrize('text,expected_tokens', NB_TOKEN_EXCEPTION_TESTS) -def test_tokenizer_handles_exception_cases(nb_tokenizer, text, expected_tokens): +def test_nb_tokenizer_handles_exception_cases(nb_tokenizer, text, expected_tokens): tokens = nb_tokenizer(text) token_list = [token.text for token in tokens if not token.is_space] assert expected_tokens == token_list diff --git a/spacy/tests/gold/__init__.py b/spacy/tests/lang/nl/__init__.py similarity index 100% rename from spacy/tests/gold/__init__.py rename to spacy/tests/lang/nl/__init__.py diff --git a/spacy/tests/lang/nl/test_text.py b/spacy/tests/lang/nl/test_text.py new file mode 100644 index 000000000..f98d1d105 --- /dev/null +++ b/spacy/tests/lang/nl/test_text.py @@ -0,0 +1,11 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from spacy.lang.nl.lex_attrs import like_num + + +@pytest.mark.parametrize('word', ['elf', 'elfde']) +def test_nl_lex_attrs_capitals(word): + assert like_num(word) + assert like_num(word.upper()) diff --git a/spacy/tests/stringstore/__init__.py b/spacy/tests/lang/pt/__init__.py similarity index 100% rename from spacy/tests/stringstore/__init__.py rename to spacy/tests/lang/pt/__init__.py diff --git a/spacy/tests/lang/pt/test_text.py b/spacy/tests/lang/pt/test_text.py new file mode 100644 index 000000000..8e6fecc45 --- /dev/null +++ b/spacy/tests/lang/pt/test_text.py @@ -0,0 +1,11 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from spacy.lang.pt.lex_attrs import like_num + + +@pytest.mark.parametrize('word', ['onze', 'quadragésimo']) +def test_pt_lex_attrs_capitals(word): + assert like_num(word) + assert like_num(word.upper()) diff --git a/spacy/tests/lang/ro/test_lemmatizer.py b/spacy/tests/lang/ro/test_lemmatizer.py index bcb1e82d9..8d238bcd5 100644 --- a/spacy/tests/lang/ro/test_lemmatizer.py +++ b/spacy/tests/lang/ro/test_lemmatizer.py @@ -4,10 +4,11 @@ from __future__ import unicode_literals import pytest -@pytest.mark.parametrize('string,lemma', [('câini', 'câine'), - ('expedițiilor', 'expediție'), - ('pensete', 'pensetă'), - ('erau', 'fi')]) -def test_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma): +@pytest.mark.parametrize('string,lemma', [ + ('câini', 'câine'), + ('expedițiilor', 'expediție'), + ('pensete', 'pensetă'), + ('erau', 'fi')]) +def test_ro_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma): tokens = ro_tokenizer(string) assert tokens[0].lemma_ == lemma diff --git a/spacy/tests/lang/ro/test_tokenizer.py b/spacy/tests/lang/ro/test_tokenizer.py index e754eaeae..6ed3f2c90 100644 --- a/spacy/tests/lang/ro/test_tokenizer.py +++ b/spacy/tests/lang/ro/test_tokenizer.py @@ -3,23 +3,20 @@ from __future__ import unicode_literals import pytest -DEFAULT_TESTS = [ + +TEST_CASES = [ ('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']), ('Teste, etc.', ['Teste', ',', 'etc.']), ('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']), - ('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...']) -] - -NUMBER_TESTS = [ + ('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...']), + # number tests ('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']), ('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.']) ] -TESTCASES = DEFAULT_TESTS + NUMBER_TESTS - -@pytest.mark.parametrize('text,expected_tokens', TESTCASES) -def test_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens): +@pytest.mark.parametrize('text,expected_tokens', TEST_CASES) +def test_ro_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens): tokens = ro_tokenizer(text) token_list = [token.text for token in tokens if not token.is_space] assert expected_tokens == token_list diff --git a/spacy/tests/lang/ru/test_exceptions.py b/spacy/tests/lang/ru/test_exceptions.py new file mode 100644 index 000000000..ea731df44 --- /dev/null +++ b/spacy/tests/lang/ru/test_exceptions.py @@ -0,0 +1,14 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest + + +@pytest.mark.parametrize('text,norms', [ + ("пн.", ["понедельник"]), + ("пт.", ["пятница"]), + ("дек.", ["декабрь"])]) +def test_ru_tokenizer_abbrev_exceptions(ru_tokenizer, text, norms): + tokens = ru_tokenizer(text) + assert len(tokens) == 1 + assert [token.norm_ for token in tokens] == norms diff --git a/spacy/tests/lang/ru/test_lemmatizer.py b/spacy/tests/lang/ru/test_lemmatizer.py index 2d9dd8b85..21c0923d7 100644 --- a/spacy/tests/lang/ru/test_lemmatizer.py +++ b/spacy/tests/lang/ru/test_lemmatizer.py @@ -2,70 +2,62 @@ from __future__ import unicode_literals import pytest -from ....tokens.doc import Doc +from spacy.lang.ru import Russian + +from ...util import get_doc @pytest.fixture -def ru_lemmatizer(RU): - return RU.Defaults.create_lemmatizer() +def ru_lemmatizer(): + pymorphy = pytest.importorskip('pymorphy2') + return Russian.Defaults.create_lemmatizer() -@pytest.mark.models('ru') -def test_doc_lemmatization(RU): - doc = Doc(RU.vocab, words=['мама', 'мыла', 'раму']) - doc[0].tag_ = 'NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing' - doc[1].tag_ = 'VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act' - doc[2].tag_ = 'NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing' - +def test_ru_doc_lemmatization(ru_tokenizer): + words = ['мама', 'мыла', 'раму'] + tags = ['NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing', + 'VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act', + 'NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing'] + doc = get_doc(ru_tokenizer.vocab, words=words, tags=tags) lemmas = [token.lemma_ for token in doc] assert lemmas == ['мама', 'мыть', 'рама'] -@pytest.mark.models('ru') -@pytest.mark.parametrize('text,lemmas', [('гвоздики', ['гвоздик', 'гвоздика']), - ('люди', ['человек']), - ('реки', ['река']), - ('кольцо', ['кольцо']), - ('пепперони', ['пепперони'])]) +@pytest.mark.parametrize('text,lemmas', [ + ('гвоздики', ['гвоздик', 'гвоздика']), + ('люди', ['человек']), + ('реки', ['река']), + ('кольцо', ['кольцо']), + ('пепперони', ['пепперони'])]) def test_ru_lemmatizer_noun_lemmas(ru_lemmatizer, text, lemmas): assert sorted(ru_lemmatizer.noun(text)) == lemmas @pytest.mark.models('ru') -@pytest.mark.parametrize('text,pos,morphology,lemma', [('рой', 'NOUN', None, 'рой'), - ('рой', 'VERB', None, 'рыть'), - ('клей', 'NOUN', None, 'клей'), - ('клей', 'VERB', None, 'клеить'), - ('три', 'NUM', None, 'три'), - ('кос', 'NOUN', {'Number': 'Sing'}, 'кос'), - ('кос', 'NOUN', {'Number': 'Plur'}, 'коса'), - ('кос', 'ADJ', None, 'косой'), - ('потом', 'NOUN', None, 'пот'), - ('потом', 'ADV', None, 'потом') - ]) +@pytest.mark.parametrize('text,pos,morphology,lemma', [ + ('рой', 'NOUN', None, 'рой'), + ('рой', 'VERB', None, 'рыть'), + ('клей', 'NOUN', None, 'клей'), + ('клей', 'VERB', None, 'клеить'), + ('три', 'NUM', None, 'три'), + ('кос', 'NOUN', {'Number': 'Sing'}, 'кос'), + ('кос', 'NOUN', {'Number': 'Plur'}, 'коса'), + ('кос', 'ADJ', None, 'косой'), + ('потом', 'NOUN', None, 'пот'), + ('потом', 'ADV', None, 'потом')]) def test_ru_lemmatizer_works_with_different_pos_homonyms(ru_lemmatizer, text, pos, morphology, lemma): assert ru_lemmatizer(text, pos, morphology) == [lemma] -@pytest.mark.models('ru') -@pytest.mark.parametrize('text,morphology,lemma', [('гвоздики', {'Gender': 'Fem'}, 'гвоздика'), - ('гвоздики', {'Gender': 'Masc'}, 'гвоздик'), - ('вина', {'Gender': 'Fem'}, 'вина'), - ('вина', {'Gender': 'Neut'}, 'вино') - ]) +@pytest.mark.parametrize('text,morphology,lemma', [ + ('гвоздики', {'Gender': 'Fem'}, 'гвоздика'), + ('гвоздики', {'Gender': 'Masc'}, 'гвоздик'), + ('вина', {'Gender': 'Fem'}, 'вина'), + ('вина', {'Gender': 'Neut'}, 'вино')]) def test_ru_lemmatizer_works_with_noun_homonyms(ru_lemmatizer, text, morphology, lemma): assert ru_lemmatizer.noun(text, morphology) == [lemma] -@pytest.mark.models('ru') def test_ru_lemmatizer_punct(ru_lemmatizer): assert ru_lemmatizer.punct('«') == ['"'] assert ru_lemmatizer.punct('»') == ['"'] - - -# @pytest.mark.models('ru') -# def test_ru_lemmatizer_lemma_assignment(RU): -# text = "А роза упала на лапу Азора." -# doc = RU.make_doc(text) -# RU.tagger(doc) -# assert all(t.lemma_ != '' for t in doc) diff --git a/spacy/tests/lang/ru/test_text.py b/spacy/tests/lang/ru/test_text.py new file mode 100644 index 000000000..6d26988a9 --- /dev/null +++ b/spacy/tests/lang/ru/test_text.py @@ -0,0 +1,11 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from spacy.lang.ru.lex_attrs import like_num + + +@pytest.mark.parametrize('word', ['одиннадцать']) +def test_ru_lex_attrs_capitals(word): + assert like_num(word) + assert like_num(word.upper()) diff --git a/spacy/tests/lang/ru/test_tokenizer.py b/spacy/tests/lang/ru/test_tokenizer.py index 1c4d55d2d..350b8a6c2 100644 --- a/spacy/tests/lang/ru/test_tokenizer.py +++ b/spacy/tests/lang/ru/test_tokenizer.py @@ -1,7 +1,4 @@ # coding: utf-8 -"""Test that open, closed and paired punctuation is split off correctly.""" - - from __future__ import unicode_literals import pytest diff --git a/spacy/tests/lang/ru/test_tokenizer_exc.py b/spacy/tests/lang/ru/test_tokenizer_exc.py deleted file mode 100644 index 554036537..000000000 --- a/spacy/tests/lang/ru/test_tokenizer_exc.py +++ /dev/null @@ -1,16 +0,0 @@ -# coding: utf-8 -"""Test that tokenizer exceptions are parsed correctly.""" - - -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text,norms', [("пн.", ["понедельник"]), - ("пт.", ["пятница"]), - ("дек.", ["декабрь"])]) -def test_ru_tokenizer_abbrev_exceptions(ru_tokenizer, text, norms): - tokens = ru_tokenizer(text) - assert len(tokens) == 1 - assert [token.norm_ for token in tokens] == norms diff --git a/spacy/tests/lang/sv/test_tokenizer.py b/spacy/tests/lang/sv/test_tokenizer.py index dbb3e1dd1..7bca715e2 100644 --- a/spacy/tests/lang/sv/test_tokenizer.py +++ b/spacy/tests/lang/sv/test_tokenizer.py @@ -11,14 +11,14 @@ SV_TOKEN_EXCEPTION_TESTS = [ @pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS) -def test_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens): +def test_sv_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens): tokens = sv_tokenizer(text) token_list = [token.text for token in tokens if not token.is_space] assert expected_tokens == token_list @pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"]) -def test_tokenizer_handles_verb_exceptions(sv_tokenizer, text): +def test_sv_tokenizer_handles_verb_exceptions(sv_tokenizer, text): tokens = sv_tokenizer(text) assert len(tokens) == 2 assert tokens[1].text == "u" diff --git a/spacy/tests/lang/test_attrs.py b/spacy/tests/lang/test_attrs.py index 67485ee60..05bbe9534 100644 --- a/spacy/tests/lang/test_attrs.py +++ b/spacy/tests/lang/test_attrs.py @@ -1,10 +1,9 @@ # coding: utf-8 from __future__ import unicode_literals -from ...attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA -from ...lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape - import pytest +from spacy.attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA +from spacy.lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape @pytest.mark.parametrize('text', ["dog"]) diff --git a/spacy/tests/lang/th/test_tokenizer.py b/spacy/tests/lang/th/test_tokenizer.py index f5925da1e..8f40fb040 100644 --- a/spacy/tests/lang/th/test_tokenizer.py +++ b/spacy/tests/lang/th/test_tokenizer.py @@ -3,11 +3,9 @@ from __future__ import unicode_literals import pytest -TOKENIZER_TESTS = [ - ("คุณรักผมไหม", ['คุณ', 'รัก', 'ผม', 'ไหม']) -] -@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS) -def test_thai_tokenizer(th_tokenizer, text, expected_tokens): - tokens = [token.text for token in th_tokenizer(text)] - assert tokens == expected_tokens +@pytest.mark.parametrize('text,expected_tokens', [ + ("คุณรักผมไหม", ['คุณ', 'รัก', 'ผม', 'ไหม'])]) +def test_th_tokenizer(th_tokenizer, text, expected_tokens): + tokens = [token.text for token in th_tokenizer(text)] + assert tokens == expected_tokens diff --git a/spacy/tests/lang/tr/test_lemmatization.py b/spacy/tests/lang/tr/test_lemmatization.py index deb2bb19b..52141109f 100644 --- a/spacy/tests/lang/tr/test_lemmatization.py +++ b/spacy/tests/lang/tr/test_lemmatization.py @@ -3,13 +3,15 @@ from __future__ import unicode_literals import pytest -@pytest.mark.parametrize('string,lemma', [('evlerimizdeki', 'ev'), - ('işlerimizi', 'iş'), - ('biran', 'biran'), - ('bitirmeliyiz', 'bitir'), - ('isteklerimizi', 'istek'), - ('karşılaştırmamızın', 'karşılaştır'), - ('çoğulculuktan', 'çoğulcu')]) -def test_lemmatizer_lookup_assigns(tr_tokenizer, string, lemma): + +@pytest.mark.parametrize('string,lemma', [ + ('evlerimizdeki', 'ev'), + ('işlerimizi', 'iş'), + ('biran', 'biran'), + ('bitirmeliyiz', 'bitir'), + ('isteklerimizi', 'istek'), + ('karşılaştırmamızın', 'karşılaştır'), + ('çoğulculuktan', 'çoğulcu')]) +def test_tr_lemmatizer_lookup_assigns(tr_tokenizer, string, lemma): tokens = tr_tokenizer(string) - assert tokens[0].lemma_ == lemma \ No newline at end of file + assert tokens[0].lemma_ == lemma diff --git a/spacy/tests/lang/tt/test_tokenizer.py b/spacy/tests/lang/tt/test_tokenizer.py index c052c1846..e95a7acd5 100644 --- a/spacy/tests/lang/tt/test_tokenizer.py +++ b/spacy/tests/lang/tt/test_tokenizer.py @@ -3,6 +3,7 @@ from __future__ import unicode_literals import pytest + INFIX_HYPHEN_TESTS = [ ("Явым-төшем күләме.", "Явым-төшем күләме .".split()), ("Хатын-кыз киеме.", "Хатын-кыз киеме .".split()) @@ -64,12 +65,12 @@ NORM_TESTCASES = [ @pytest.mark.parametrize("text,expected_tokens", TESTCASES) -def test_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens): +def test_tt_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens): tokens = [token.text for token in tt_tokenizer(text) if not token.is_space] assert expected_tokens == tokens @pytest.mark.parametrize('text,norms', NORM_TESTCASES) -def test_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms): +def test_tt_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms): tokens = tt_tokenizer(text) assert [token.norm_ for token in tokens] == norms diff --git a/spacy/tests/lang/ur/test_text.py b/spacy/tests/lang/ur/test_text.py index 1e94b820c..d872799b8 100644 --- a/spacy/tests/lang/ur/test_text.py +++ b/spacy/tests/lang/ur/test_text.py @@ -1,19 +1,14 @@ # coding: utf-8 - -"""Test that longer and mixed texts are tokenized correctly.""" - - from __future__ import unicode_literals import pytest -def test_tokenizer_handles_long_text(ur_tokenizer): +def test_ur_tokenizer_handles_long_text(ur_tokenizer): text = """اصل میں رسوا ہونے کی ہمیں کچھ عادت سی ہو گئی ہے اس لئے جگ ہنسائی کا ذکر نہیں کرتا،ہوا کچھ یوں کہ عرصہ چھ سال بعد ہمیں بھی خیال آیا - کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر + کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر ورلڈ کپ دیکھنے کا حوصلہ ہی نہ رہا تو اب یوں ہی ادھر اُدھر کے چینل گھمانے لگ پڑتے ہیں۔""" - tokens = ur_tokenizer(text) assert len(tokens) == 77 @@ -21,6 +16,6 @@ def test_tokenizer_handles_long_text(ur_tokenizer): @pytest.mark.parametrize('text,length', [ ("تحریر باسط حبیب", 3), ("میرا پاکستان", 2)]) -def test_tokenizer_handles_cnts(ur_tokenizer, text, length): +def test_ur_tokenizer_handles_cnts(ur_tokenizer, text, length): tokens = ur_tokenizer(text) assert len(tokens) == length diff --git a/spacy/tests/vectors/__init__.py b/spacy/tests/matcher/__init__.py similarity index 100% rename from spacy/tests/vectors/__init__.py rename to spacy/tests/matcher/__init__.py diff --git a/spacy/tests/test_matcher.py b/spacy/tests/matcher/test_matcher_api.py similarity index 52% rename from spacy/tests/test_matcher.py rename to spacy/tests/matcher/test_matcher_api.py index 29dcc248f..18779c32c 100644 --- a/spacy/tests/test_matcher.py +++ b/spacy/tests/matcher/test_matcher_api.py @@ -1,19 +1,16 @@ # coding: utf-8 from __future__ import unicode_literals -from ..matcher import Matcher, PhraseMatcher -from .util import get_doc - import pytest +from spacy.matcher import Matcher +from spacy.tokens import Doc @pytest.fixture def matcher(en_vocab): - rules = { - 'JS': [[{'ORTH': 'JavaScript'}]], - 'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]], - 'Java': [[{'LOWER': 'java'}]] - } + rules = {'JS': [[{'ORTH': 'JavaScript'}]], + 'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]], + 'Java': [[{'LOWER': 'java'}]]} matcher = Matcher(en_vocab) for key, patterns in rules.items(): matcher.add(key, None, *patterns) @@ -36,7 +33,7 @@ def test_matcher_from_api_docs(en_vocab): def test_matcher_from_usage_docs(en_vocab): text = "Wow 😀 This is really cool! 😂 😂" - doc = get_doc(en_vocab, words=text.split(' ')) + doc = Doc(en_vocab, words=text.split(' ')) pos_emoji = ['😀', '😃', '😂', '🤣', '😊', '😍'] pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji] @@ -55,68 +52,46 @@ def test_matcher_from_usage_docs(en_vocab): assert doc[1].norm_ == 'happy emoji' -@pytest.mark.parametrize('words', [["Some", "words"]]) -def test_matcher_init(en_vocab, words): - matcher = Matcher(en_vocab) - doc = get_doc(en_vocab, words) - assert len(matcher) == 0 - assert matcher(doc) == [] - - -def test_matcher_contains(matcher): +def test_matcher_len_contains(matcher): + assert len(matcher) == 3 matcher.add('TEST', None, [{'ORTH': 'test'}]) assert 'TEST' in matcher assert 'TEST2' not in matcher def test_matcher_no_match(matcher): - words = ["I", "like", "cheese", "."] - doc = get_doc(matcher.vocab, words) + doc = Doc(matcher.vocab, words=["I", "like", "cheese", "."]) assert matcher(doc) == [] -def test_matcher_compile(en_vocab): - rules = { - 'JS': [[{'ORTH': 'JavaScript'}]], - 'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]], - 'Java': [[{'LOWER': 'java'}]] - } - matcher = Matcher(en_vocab) - for key, patterns in rules.items(): - matcher.add(key, None, *patterns) - assert len(matcher) == 3 - - def test_matcher_match_start(matcher): - words = ["JavaScript", "is", "good"] - doc = get_doc(matcher.vocab, words) + doc = Doc(matcher.vocab, words=["JavaScript", "is", "good"]) assert matcher(doc) == [(matcher.vocab.strings['JS'], 0, 1)] def test_matcher_match_end(matcher): words = ["I", "like", "java"] - doc = get_doc(matcher.vocab, words) + doc = Doc(matcher.vocab, words=words) assert matcher(doc) == [(doc.vocab.strings['Java'], 2, 3)] def test_matcher_match_middle(matcher): words = ["I", "like", "Google", "Now", "best"] - doc = get_doc(matcher.vocab, words) + doc = Doc(matcher.vocab, words=words) assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4)] def test_matcher_match_multi(matcher): words = ["I", "like", "Google", "Now", "and", "java", "best"] - doc = get_doc(matcher.vocab, words) + doc = Doc(matcher.vocab, words=words) assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4), (doc.vocab.strings['Java'], 5, 6)] def test_matcher_empty_dict(en_vocab): - '''Test matcher allows empty token specs, meaning match on any token.''' + """Test matcher allows empty token specs, meaning match on any token.""" matcher = Matcher(en_vocab) - abc = ["a", "b", "c"] - doc = get_doc(matcher.vocab, abc) + doc = Doc(matcher.vocab, words=["a", "b", "c"]) matcher.add('A.C', None, [{'ORTH': 'a'}, {}, {'ORTH': 'c'}]) matches = matcher(doc) assert len(matches) == 1 @@ -129,8 +104,7 @@ def test_matcher_empty_dict(en_vocab): def test_matcher_operator_shadow(en_vocab): matcher = Matcher(en_vocab) - abc = ["a", "b", "c"] - doc = get_doc(matcher.vocab, abc) + doc = Doc(matcher.vocab, words=["a", "b", "c"]) pattern = [{'ORTH': 'a'}, {"IS_ALPHA": True, "OP": "+"}, {'ORTH': 'c'}] matcher.add('A.C', None, pattern) matches = matcher(doc) @@ -138,32 +112,6 @@ def test_matcher_operator_shadow(en_vocab): assert matches[0][1:] == (0, 3) -def test_matcher_phrase_matcher(en_vocab): - words = ["Google", "Now"] - doc = get_doc(en_vocab, words) - matcher = PhraseMatcher(en_vocab) - matcher.add('COMPANY', None, doc) - words = ["I", "like", "Google", "Now", "best"] - doc = get_doc(en_vocab, words) - assert len(matcher(doc)) == 1 - - -def test_phrase_matcher_length(en_vocab): - matcher = PhraseMatcher(en_vocab) - assert len(matcher) == 0 - matcher.add('TEST', None, get_doc(en_vocab, ['test'])) - assert len(matcher) == 1 - matcher.add('TEST2', None, get_doc(en_vocab, ['test2'])) - assert len(matcher) == 2 - - -def test_phrase_matcher_contains(en_vocab): - matcher = PhraseMatcher(en_vocab) - matcher.add('TEST', None, get_doc(en_vocab, ['test'])) - assert 'TEST' in matcher - assert 'TEST2' not in matcher - - def test_matcher_match_zero(matcher): words1 = 'He said , " some words " ...'.split() words2 = 'He said , " some three words " ...'.split() @@ -176,12 +124,10 @@ def test_matcher_match_zero(matcher): {'IS_PUNCT': True}, {'IS_PUNCT': True}, {'ORTH': '"'}] - matcher.add('Quote', None, pattern1) - doc = get_doc(matcher.vocab, words1) + doc = Doc(matcher.vocab, words=words1) assert len(matcher(doc)) == 1 - - doc = get_doc(matcher.vocab, words2) + doc = Doc(matcher.vocab, words=words2) assert len(matcher(doc)) == 0 matcher.add('Quote', None, pattern2) assert len(matcher(doc)) == 0 @@ -194,14 +140,14 @@ def test_matcher_match_zero_plus(matcher): {'ORTH': '"'}] matcher = Matcher(matcher.vocab) matcher.add('Quote', None, pattern) - doc = get_doc(matcher.vocab, words) + doc = Doc(matcher.vocab, words=words) assert len(matcher(doc)) == 1 def test_matcher_match_one_plus(matcher): control = Matcher(matcher.vocab) control.add('BasicPhilippe', None, [{'ORTH': 'Philippe'}]) - doc = get_doc(control.vocab, ['Philippe', 'Philippe']) + doc = Doc(control.vocab, words=['Philippe', 'Philippe']) m = control(doc) assert len(m) == 2 matcher.add('KleenePhilippe', None, [{'ORTH': 'Philippe', 'OP': '1'}, @@ -210,61 +156,11 @@ def test_matcher_match_one_plus(matcher): assert len(m) == 1 -def test_operator_combos(matcher): - cases = [ - ('aaab', 'a a a b', True), - ('aaab', 'a+ b', True), - ('aaab', 'a+ a+ b', True), - ('aaab', 'a+ a+ a b', True), - ('aaab', 'a+ a+ a+ b', True), - ('aaab', 'a+ a a b', True), - ('aaab', 'a+ a a', True), - ('aaab', 'a+', True), - ('aaa', 'a+ b', False), - ('aaa', 'a+ a+ b', False), - ('aaa', 'a+ a+ a+ b', False), - ('aaa', 'a+ a b', False), - ('aaa', 'a+ a a b', False), - ('aaab', 'a+ a a', True), - ('aaab', 'a+', True), - ('aaab', 'a+ a b', True) - ] - for string, pattern_str, result in cases: - matcher = Matcher(matcher.vocab) - doc = get_doc(matcher.vocab, words=list(string)) - pattern = [] - for part in pattern_str.split(): - if part.endswith('+'): - pattern.append({'ORTH': part[0], 'op': '+'}) - else: - pattern.append({'ORTH': part}) - matcher.add('PATTERN', None, pattern) - matches = matcher(doc) - if result: - assert matches, (string, pattern_str) - else: - assert not matches, (string, pattern_str) - - -def test_matcher_end_zero_plus(matcher): - """Test matcher works when patterns end with * operator. (issue 1450)""" - matcher = Matcher(matcher.vocab) - pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}] - matcher.add("TSTEND", None, pattern) - nlp = lambda string: get_doc(matcher.vocab, string.split()) - assert len(matcher(nlp('a'))) == 1 - assert len(matcher(nlp('a b'))) == 2 - assert len(matcher(nlp('a c'))) == 1 - assert len(matcher(nlp('a b c'))) == 2 - assert len(matcher(nlp('a b b c'))) == 3 - assert len(matcher(nlp('a b b'))) == 3 - - def test_matcher_any_token_operator(en_vocab): """Test that patterns with "any token" {} work with operators.""" matcher = Matcher(en_vocab) matcher.add('TEST', None, [{'ORTH': 'test'}, {'OP': '*'}]) - doc = get_doc(en_vocab, ['test', 'hello', 'world']) + doc = Doc(en_vocab, words=['test', 'hello', 'world']) matches = [doc[start:end].text for _, start, end in matcher(doc)] assert len(matches) == 3 assert matches[0] == 'test' diff --git a/spacy/tests/matcher/test_matcher_logic.py b/spacy/tests/matcher/test_matcher_logic.py new file mode 100644 index 000000000..825f23cb3 --- /dev/null +++ b/spacy/tests/matcher/test_matcher_logic.py @@ -0,0 +1,116 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +import re +from spacy.matcher import Matcher +from spacy.tokens import Doc + + +pattern1 = [{'ORTH':'A', 'OP':'1'}, {'ORTH':'A', 'OP':'*'}] +pattern2 = [{'ORTH':'A', 'OP':'*'}, {'ORTH':'A', 'OP':'1'}] +pattern3 = [{'ORTH':'A', 'OP':'1'}, {'ORTH':'A', 'OP':'1'}] +pattern4 = [{'ORTH':'B', 'OP':'1'}, {'ORTH':'A', 'OP':'*'}, {'ORTH':'B', 'OP':'1'}] +pattern5 = [{'ORTH':'B', 'OP':'*'}, {'ORTH':'A', 'OP':'*'}, {'ORTH':'B', 'OP':'1'}] + +re_pattern1 = 'AA*' +re_pattern2 = 'A*A' +re_pattern3 = 'AA' +re_pattern4 = 'BA*B' +re_pattern5 = 'B*A*B' + + +@pytest.fixture +def text(): + return "(ABBAAAAAB)." + + +@pytest.fixture +def doc(en_tokenizer, text): + doc = en_tokenizer(' '.join(text)) + return doc + + +@pytest.mark.xfail +@pytest.mark.parametrize('pattern,re_pattern', [ + (pattern1, re_pattern1), + (pattern2, re_pattern2), + (pattern3, re_pattern3), + (pattern4, re_pattern4), + (pattern5, re_pattern5)]) +def test_greedy_matching(doc, text, pattern, re_pattern): + """Test that the greedy matching behavior of the * op is consistant with + other re implementations.""" + matcher = Matcher(doc.vocab) + matcher.add(re_pattern, None, pattern) + matches = matcher(doc) + re_matches = [m.span() for m in re.finditer(re_pattern, text)] + for match, re_match in zip(matches, re_matches): + assert match[1:] == re_match + + +@pytest.mark.xfail +@pytest.mark.parametrize('pattern,re_pattern', [ + (pattern1, re_pattern1), + (pattern2, re_pattern2), + (pattern3, re_pattern3), + (pattern4, re_pattern4), + (pattern5, re_pattern5)]) +def test_match_consuming(doc, text, pattern, re_pattern): + """Test that matcher.__call__ consumes tokens on a match similar to + re.findall.""" + matcher = Matcher(doc.vocab) + matcher.add(re_pattern, None, pattern) + matches = matcher(doc) + re_matches = [m.span() for m in re.finditer(re_pattern, text)] + assert len(matches) == len(re_matches) + + +def test_operator_combos(en_vocab): + cases = [ + ('aaab', 'a a a b', True), + ('aaab', 'a+ b', True), + ('aaab', 'a+ a+ b', True), + ('aaab', 'a+ a+ a b', True), + ('aaab', 'a+ a+ a+ b', True), + ('aaab', 'a+ a a b', True), + ('aaab', 'a+ a a', True), + ('aaab', 'a+', True), + ('aaa', 'a+ b', False), + ('aaa', 'a+ a+ b', False), + ('aaa', 'a+ a+ a+ b', False), + ('aaa', 'a+ a b', False), + ('aaa', 'a+ a a b', False), + ('aaab', 'a+ a a', True), + ('aaab', 'a+', True), + ('aaab', 'a+ a b', True) + ] + for string, pattern_str, result in cases: + matcher = Matcher(en_vocab) + doc = Doc(matcher.vocab, words=list(string)) + pattern = [] + for part in pattern_str.split(): + if part.endswith('+'): + pattern.append({'ORTH': part[0], 'OP': '+'}) + else: + pattern.append({'ORTH': part}) + matcher.add('PATTERN', None, pattern) + matches = matcher(doc) + if result: + assert matches, (string, pattern_str) + else: + assert not matches, (string, pattern_str) + + +def test_matcher_end_zero_plus(en_vocab): + """Test matcher works when patterns end with * operator. (issue 1450)""" + matcher = Matcher(en_vocab) + pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}] + matcher.add('TSTEND', None, pattern) + nlp = lambda string: Doc(matcher.vocab, words=string.split()) + assert len(matcher(nlp('a'))) == 1 + assert len(matcher(nlp('a b'))) == 2 + assert len(matcher(nlp('a c'))) == 1 + assert len(matcher(nlp('a b c'))) == 2 + assert len(matcher(nlp('a b b c'))) == 3 + assert len(matcher(nlp('a b b'))) == 3 diff --git a/spacy/tests/matcher/test_phrase_matcher.py b/spacy/tests/matcher/test_phrase_matcher.py new file mode 100644 index 000000000..578f2b5d0 --- /dev/null +++ b/spacy/tests/matcher/test_phrase_matcher.py @@ -0,0 +1,30 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from spacy.matcher import PhraseMatcher +from spacy.tokens import Doc + + +def test_matcher_phrase_matcher(en_vocab): + doc = Doc(en_vocab, words=["Google", "Now"]) + matcher = PhraseMatcher(en_vocab) + matcher.add('COMPANY', None, doc) + doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"]) + assert len(matcher(doc)) == 1 + + +def test_phrase_matcher_length(en_vocab): + matcher = PhraseMatcher(en_vocab) + assert len(matcher) == 0 + matcher.add('TEST', None, Doc(en_vocab, words=['test'])) + assert len(matcher) == 1 + matcher.add('TEST2', None, Doc(en_vocab, words=['test2'])) + assert len(matcher) == 2 + + +def test_phrase_matcher_contains(en_vocab): + matcher = PhraseMatcher(en_vocab) + matcher.add('TEST', None, Doc(en_vocab, words=['test'])) + assert 'TEST' in matcher + assert 'TEST2' not in matcher diff --git a/spacy/tests/parser/test_add_label.py b/spacy/tests/parser/test_add_label.py index 9493452a1..5b57ee38b 100644 --- a/spacy/tests/parser/test_add_label.py +++ b/spacy/tests/parser/test_add_label.py @@ -1,15 +1,16 @@ -'''Test the ability to add a label to a (potentially trained) parsing model.''' +# coding: utf8 from __future__ import unicode_literals + import pytest import numpy.random from thinc.neural.optimizers import Adam from thinc.neural.ops import NumpyOps +from spacy.attrs import NORM +from spacy.gold import GoldParse +from spacy.vocab import Vocab +from spacy.tokens import Doc +from spacy.pipeline import DependencyParser -from ...attrs import NORM -from ...gold import GoldParse -from ...vocab import Vocab -from ...tokens import Doc -from ...pipeline import DependencyParser numpy.random.seed(0) @@ -37,9 +38,11 @@ def parser(vocab): parser.update([doc], [gold], sgd=sgd, losses=losses) return parser + def test_init_parser(parser): pass + # TODO: This is flakey, because it depends on what the parser first learns. @pytest.mark.xfail def test_add_label(parser): @@ -69,4 +72,3 @@ def test_add_label(parser): doc = parser(doc) assert doc[0].dep_ == 'right' assert doc[2].dep_ == 'left' - diff --git a/spacy/tests/parser/test_arc_eager_oracle.py b/spacy/tests/parser/test_arc_eager_oracle.py index 7148b0b97..f126fc961 100644 --- a/spacy/tests/parser/test_arc_eager_oracle.py +++ b/spacy/tests/parser/test_arc_eager_oracle.py @@ -1,13 +1,14 @@ +# coding: utf8 from __future__ import unicode_literals -import pytest -from ...vocab import Vocab -from ...pipeline import DependencyParser -from ...tokens import Doc -from ...gold import GoldParse -from ...syntax.nonproj import projectivize -from ...syntax.stateclass import StateClass -from ...syntax.arc_eager import ArcEager +import pytest +from spacy.vocab import Vocab +from spacy.pipeline import DependencyParser +from spacy.tokens import Doc +from spacy.gold import GoldParse +from spacy.syntax.nonproj import projectivize +from spacy.syntax.stateclass import StateClass +from spacy.syntax.arc_eager import ArcEager def get_sequence_costs(M, words, heads, deps, transitions): @@ -105,7 +106,7 @@ annot_tuples = [ (31, 'going', 'VBG', 26, 'parataxis', 'O'), (32, 'to', 'TO', 33, 'aux', 'O'), (33, 'spend', 'VB', 31, 'xcomp', 'O'), - (34, 'the', 'DT', 35, 'det', 'B-TIME'), + (34, 'the', 'DT', 35, 'det', 'B-TIME'), (35, 'night', 'NN', 33, 'dobj', 'L-TIME'), (36, 'there', 'RB', 33, 'advmod', 'O'), (37, 'presumably', 'RB', 33, 'advmod', 'O'), diff --git a/spacy/tests/parser/test_beam_parse.py b/spacy/tests/parser/test_beam_parse.py deleted file mode 100644 index 855b2e1e9..000000000 --- a/spacy/tests/parser/test_beam_parse.py +++ /dev/null @@ -1,23 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from ...language import Language -from ...pipeline import DependencyParser - - -@pytest.mark.models('en') -def test_beam_parse_en(EN): - doc = EN(u'Australia is a country', disable=['ner']) - ents = EN.entity(doc, beam_width=2) - print(ents) - - -def test_beam_parse(): - nlp = Language() - nlp.add_pipe(DependencyParser(nlp.vocab), name='parser') - nlp.parser.add_label('nsubj') - nlp.parser.begin_training([], token_vector_width=8, hidden_width=8) - - doc = nlp.make_doc(u'Australia is a country') - nlp.parser(doc, beam_width=2) diff --git a/spacy/tests/parser/test_ner.py b/spacy/tests/parser/test_ner.py index e74c9ccb0..4de0d25c2 100644 --- a/spacy/tests/parser/test_ner.py +++ b/spacy/tests/parser/test_ner.py @@ -1,11 +1,12 @@ +# coding: utf-8 from __future__ import unicode_literals import pytest - -from ...vocab import Vocab -from ...syntax.ner import BiluoPushDown -from ...gold import GoldParse -from ...tokens import Doc +from spacy.pipeline import EntityRecognizer +from spacy.vocab import Vocab +from spacy.syntax.ner import BiluoPushDown +from spacy.gold import GoldParse +from spacy.tokens import Doc @pytest.fixture @@ -71,3 +72,16 @@ def test_get_oracle_moves_negative_O(tsys, vocab): tsys.preprocess_gold(gold) act_classes = tsys.get_oracle_sequence(doc, gold) names = [tsys.get_class_name(act) for act in act_classes] + + +def test_doc_add_entities_set_ents_iob(en_vocab): + doc = Doc(en_vocab, words=["This", "is", "a", "lion"]) + ner = EntityRecognizer(en_vocab) + ner.begin_training([]) + ner(doc) + assert len(list(doc.ents)) == 0 + assert [w.ent_iob_ for w in doc] == (['O'] * len(doc)) + doc.ents = [(doc.vocab.strings['ANIMAL'], 3, 4)] + assert [w.ent_iob_ for w in doc] == ['', '', '', 'B'] + doc.ents = [(doc.vocab.strings['WORD'], 0, 2)] + assert [w.ent_iob_ for w in doc] == ['B', 'I', '', ''] diff --git a/spacy/tests/parser/test_neural_parser.py b/spacy/tests/parser/test_neural_parser.py index febd4da05..4ab49cb4e 100644 --- a/spacy/tests/parser/test_neural_parser.py +++ b/spacy/tests/parser/test_neural_parser.py @@ -1,16 +1,13 @@ # coding: utf8 from __future__ import unicode_literals -from thinc.neural import Model -import pytest -import numpy -from ..._ml import chain, Tok2Vec, doc2feats -from ...vocab import Vocab -from ...pipeline import Tensorizer -from ...syntax.arc_eager import ArcEager -from ...syntax.nn_parser import Parser -from ...tokens.doc import Doc -from ...gold import GoldParse +import pytest +from spacy._ml import Tok2Vec +from spacy.vocab import Vocab +from spacy.syntax.arc_eager import ArcEager +from spacy.syntax.nn_parser import Parser +from spacy.tokens.doc import Doc +from spacy.gold import GoldParse @pytest.fixture @@ -37,10 +34,12 @@ def parser(vocab, arc_eager): def model(arc_eager, tok2vec): return Parser.Model(arc_eager.n_moves, token_vector_width=tok2vec.nO)[0] + @pytest.fixture def doc(vocab): return Doc(vocab, words=['a', 'b', 'c']) + @pytest.fixture def gold(doc): return GoldParse(doc, heads=[1, 1, 1], deps=['L', 'ROOT', 'R']) @@ -80,5 +79,3 @@ def test_update_doc_beam(parser, model, doc, gold): def optimize(weights, gradient, key=None): weights -= 0.001 * gradient parser.update_beam([doc], [gold], sgd=optimize) - - diff --git a/spacy/tests/parser/test_nn_beam.py b/spacy/tests/parser/test_nn_beam.py index ab8bf012b..a6943d49e 100644 --- a/spacy/tests/parser/test_nn_beam.py +++ b/spacy/tests/parser/test_nn_beam.py @@ -1,20 +1,23 @@ +# coding: utf8 from __future__ import unicode_literals + import pytest import numpy -from thinc.api import layerize - -from ...vocab import Vocab -from ...syntax.arc_eager import ArcEager -from ...tokens import Doc -from ...gold import GoldParse -from ...syntax._beam_utils import ParserBeam, update_beam -from ...syntax.stateclass import StateClass +from spacy.vocab import Vocab +from spacy.language import Language +from spacy.pipeline import DependencyParser +from spacy.syntax.arc_eager import ArcEager +from spacy.tokens import Doc +from spacy.syntax._beam_utils import ParserBeam +from spacy.syntax.stateclass import StateClass +from spacy.gold import GoldParse @pytest.fixture def vocab(): return Vocab() + @pytest.fixture def moves(vocab): aeager = ArcEager(vocab.strings, {}) @@ -65,6 +68,7 @@ def vector_size(): def beam(moves, states, golds, beam_width): return ParserBeam(moves, states, golds, width=beam_width, density=0.0) + @pytest.fixture def scores(moves, batch_size, beam_width): return [ @@ -85,3 +89,12 @@ def test_beam_advance(beam, scores): def test_beam_advance_too_few_scores(beam, scores): with pytest.raises(IndexError): beam.advance(scores[:-1]) + + +def test_beam_parse(): + nlp = Language() + nlp.add_pipe(DependencyParser(nlp.vocab), name='parser') + nlp.parser.add_label('nsubj') + nlp.parser.begin_training([], token_vector_width=8, hidden_width=8) + doc = nlp.make_doc('Australia is a country') + nlp.parser(doc, beam_width=2) diff --git a/spacy/tests/parser/test_nonproj.py b/spacy/tests/parser/test_nonproj.py index 237f0debd..fad04f340 100644 --- a/spacy/tests/parser/test_nonproj.py +++ b/spacy/tests/parser/test_nonproj.py @@ -1,35 +1,39 @@ # coding: utf-8 from __future__ import unicode_literals -from ...syntax.nonproj import ancestors, contains_cycle, is_nonproj_arc -from ...syntax.nonproj import is_nonproj_tree -from ...syntax import nonproj -from ...attrs import DEP, HEAD -from ..util import get_doc - import pytest +from spacy.syntax.nonproj import ancestors, contains_cycle, is_nonproj_arc +from spacy.syntax.nonproj import is_nonproj_tree +from spacy.syntax import nonproj + +from ..util import get_doc @pytest.fixture def tree(): return [1, 2, 2, 4, 5, 2, 2] + @pytest.fixture def cyclic_tree(): return [1, 2, 2, 4, 5, 3, 2] + @pytest.fixture def partial_tree(): return [1, 2, 2, 4, 5, None, 7, 4, 2] + @pytest.fixture def nonproj_tree(): return [1, 2, 2, 4, 5, 2, 7, 4, 2] + @pytest.fixture def proj_tree(): return [1, 2, 2, 4, 5, 2, 7, 5, 2] + @pytest.fixture def multirooted_tree(): return [3, 2, 0, 3, 3, 7, 7, 3, 7, 10, 7, 10, 11, 12, 18, 16, 18, 17, 12, 3] @@ -75,14 +79,14 @@ def test_parser_pseudoprojectivity(en_tokenizer): def deprojectivize(proj_heads, deco_labels): tokens = en_tokenizer('whatever ' * len(proj_heads)) rel_proj_heads = [head-i for i, head in enumerate(proj_heads)] - doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deco_labels, heads=rel_proj_heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], + deps=deco_labels, heads=rel_proj_heads) nonproj.deprojectivize(doc) return [t.head.i for t in doc], [token.dep_ for token in doc] tree = [1, 2, 2] nonproj_tree = [1, 2, 2, 4, 5, 2, 7, 4, 2] nonproj_tree2 = [9, 1, 3, 1, 5, 6, 9, 8, 6, 1, 6, 12, 13, 10, 1] - labels = ['det', 'nsubj', 'root', 'det', 'dobj', 'aux', 'nsubj', 'acl', 'punct'] labels2 = ['advmod', 'root', 'det', 'nsubj', 'advmod', 'det', 'dobj', 'det', 'nmod', 'aux', 'nmod', 'advmod', 'det', 'amod', 'punct'] diff --git a/spacy/tests/parser/test_parse.py b/spacy/tests/parser/test_parse.py index 9cabc5662..9706d9b9b 100644 --- a/spacy/tests/parser/test_parse.py +++ b/spacy/tests/parser/test_parse.py @@ -1,17 +1,17 @@ # coding: utf-8 from __future__ import unicode_literals -from ..util import get_doc, apply_transition_sequence - import pytest +from ..util import get_doc, apply_transition_sequence + def test_parser_root(en_tokenizer): text = "i don't have other assistance" heads = [3, 2, 1, 0, 1, -2] deps = ['nsubj', 'aux', 'neg', 'ROOT', 'amod', 'dobj'] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) for t in doc: assert t.dep != 0, t.text @@ -20,7 +20,7 @@ def test_parser_root(en_tokenizer): @pytest.mark.parametrize('text', ["Hello"]) def test_parser_parse_one_word_sentence(en_tokenizer, en_parser, text): tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[0], deps=['ROOT']) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[0], deps=['ROOT']) assert len(doc) == 1 with en_parser.step_through(doc) as _: @@ -33,10 +33,8 @@ def test_parser_initial(en_tokenizer, en_parser): text = "I ate the pizza with anchovies." heads = [1, 0, 1, -2, -3, -1, -5] transition = ['L-nsubj', 'S', 'L-det'] - tokens = en_tokenizer(text) apply_transition_sequence(en_parser, tokens, transition) - assert tokens[0].head.i == 1 assert tokens[1].head.i == 1 assert tokens[2].head.i == 3 @@ -47,8 +45,7 @@ def test_parser_parse_subtrees(en_tokenizer, en_parser): text = "The four wheels on the bus turned quickly" heads = [2, 1, 4, -1, 1, -2, 0, -1] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) - + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) assert len(list(doc[2].lefts)) == 2 assert len(list(doc[2].rights)) == 1 assert len(list(doc[2].children)) == 3 @@ -63,11 +60,9 @@ def test_parser_merge_pp(en_tokenizer): heads = [1, 4, -1, 1, -2, 0] deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT'] tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ'] - tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps, heads=heads, tags=tags) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps, heads=heads, tags=tags) nps = [(np[0].idx, np[-1].idx + len(np[-1]), np.lemma_) for np in doc.noun_chunks] - for start, end, lemma in nps: doc.merge(start, end, label='NP', lemma=lemma) assert doc[0].text == 'A phrase' diff --git a/spacy/tests/parser/test_parse_navigate.py b/spacy/tests/parser/test_parse_navigate.py index da59b0b59..8cfe3c280 100644 --- a/spacy/tests/parser/test_parse_navigate.py +++ b/spacy/tests/parser/test_parse_navigate.py @@ -1,14 +1,14 @@ # coding: utf-8 from __future__ import unicode_literals -from ..util import get_doc - import pytest +from ..util import get_doc + @pytest.fixture def text(): - return u""" + return """ It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, @@ -54,7 +54,7 @@ def heads(): def test_parser_parse_navigate_consistency(en_tokenizer, text, heads): tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) for head in doc: for child in head.lefts: assert child.head == head @@ -64,7 +64,7 @@ def test_parser_parse_navigate_consistency(en_tokenizer, text, heads): def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads): tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) lefts = {} rights = {} @@ -97,7 +97,7 @@ def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads): def test_parser_parse_navigate_edges(en_tokenizer, text, heads): tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) for token in doc: subtree = list(token.subtree) debug = '\t'.join((token.text, token.left_edge.text, subtree[0].text)) diff --git a/spacy/tests/parser/test_preset_sbd.py b/spacy/tests/parser/test_preset_sbd.py index 5d58ad173..c54e01a6d 100644 --- a/spacy/tests/parser/test_preset_sbd.py +++ b/spacy/tests/parser/test_preset_sbd.py @@ -1,19 +1,21 @@ -'''Test that the parser respects preset sentence boundaries.''' +# coding: utf8 from __future__ import unicode_literals + import pytest from thinc.neural.optimizers import Adam from thinc.neural.ops import NumpyOps +from spacy.attrs import NORM +from spacy.gold import GoldParse +from spacy.vocab import Vocab +from spacy.tokens import Doc +from spacy.pipeline import DependencyParser -from ...attrs import NORM -from ...gold import GoldParse -from ...vocab import Vocab -from ...tokens import Doc -from ...pipeline import DependencyParser @pytest.fixture def vocab(): return Vocab(lex_attr_getters={NORM: lambda s: s}) + @pytest.fixture def parser(vocab): parser = DependencyParser(vocab) @@ -32,6 +34,7 @@ def parser(vocab): parser.update([doc], [gold], sgd=sgd, losses=losses) return parser + def test_no_sentences(parser): doc = Doc(parser.vocab, words=['a', 'b', 'c', 'd']) doc = parser(doc) diff --git a/spacy/tests/parser/test_space_attachment.py b/spacy/tests/parser/test_space_attachment.py index 1ee0d7584..216915882 100644 --- a/spacy/tests/parser/test_space_attachment.py +++ b/spacy/tests/parser/test_space_attachment.py @@ -1,19 +1,18 @@ # coding: utf-8 from __future__ import unicode_literals -from ...tokens.doc import Doc -from ...attrs import HEAD -from ..util import get_doc, apply_transition_sequence - import pytest +from spacy.tokens.doc import Doc + +from ..util import get_doc, apply_transition_sequence + def test_parser_space_attachment(en_tokenizer): text = "This is a test.\nTo ensure spaces are attached well." heads = [1, 0, 1, -2, -3, -1, 1, 4, -1, 2, 1, 0, -1, -2] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) - + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads) for sent in doc.sents: if len(sent) == 1: assert not sent[-1].is_space @@ -26,7 +25,7 @@ def test_parser_sentence_space(en_tokenizer): 'nsubjpass', 'aux', 'auxpass', 'ROOT', 'nsubj', 'aux', 'ccomp', 'poss', 'nsubj', 'ccomp', 'punct'] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps) assert len(list(doc.sents)) == 2 @@ -35,7 +34,7 @@ def test_parser_space_attachment_leading(en_tokenizer, en_parser): text = "\t \n This is a sentence ." heads = [1, 1, 0, 1, -2, -3] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, text.split(' '), heads=heads) + doc = get_doc(tokens.vocab, words=text.split(' '), heads=heads) assert doc[0].is_space assert doc[1].is_space assert doc[2].text == 'This' @@ -52,7 +51,7 @@ def test_parser_space_attachment_intermediate_trailing(en_tokenizer, en_parser): heads = [1, 0, -1, 2, -1, -4, -5, -1] transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct'] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, text.split(' '), heads=heads) + doc = get_doc(tokens.vocab, words=text.split(' '), heads=heads) assert doc[2].is_space assert doc[4].is_space assert doc[5].is_space diff --git a/spacy/tests/parser/test_to_from_bytes_disk.py b/spacy/tests/parser/test_to_from_bytes_disk.py deleted file mode 100644 index 48c412b7a..000000000 --- a/spacy/tests/parser/test_to_from_bytes_disk.py +++ /dev/null @@ -1,28 +0,0 @@ -import pytest - -from ...pipeline import DependencyParser - - -@pytest.fixture -def parser(en_vocab): - parser = DependencyParser(en_vocab) - parser.add_label('nsubj') - parser.model, cfg = parser.Model(parser.moves.n_moves) - parser.cfg.update(cfg) - return parser - - -@pytest.fixture -def blank_parser(en_vocab): - parser = DependencyParser(en_vocab) - return parser - - -def test_to_from_bytes(parser, blank_parser): - assert parser.model is not True - assert blank_parser.model is True - assert blank_parser.moves.n_moves != parser.moves.n_moves - bytes_data = parser.to_bytes() - blank_parser.from_bytes(bytes_data) - assert blank_parser.model is not True - assert blank_parser.moves.n_moves == parser.moves.n_moves diff --git a/spacy/tests/pipeline/test_entity_ruler.py b/spacy/tests/pipeline/test_entity_ruler.py index 49f6cab61..67fb4e003 100644 --- a/spacy/tests/pipeline/test_entity_ruler.py +++ b/spacy/tests/pipeline/test_entity_ruler.py @@ -2,10 +2,9 @@ from __future__ import unicode_literals import pytest - -from ...tokens import Span -from ...language import Language -from ...pipeline import EntityRuler +from spacy.tokens import Span +from spacy.language import Language +from spacy.pipeline import EntityRuler @pytest.fixture diff --git a/spacy/tests/pipeline/test_factories.py b/spacy/tests/pipeline/test_factories.py index 35c42ce56..d6ed68c4a 100644 --- a/spacy/tests/pipeline/test_factories.py +++ b/spacy/tests/pipeline/test_factories.py @@ -2,11 +2,11 @@ from __future__ import unicode_literals import pytest +from spacy.language import Language +from spacy.tokens import Span from ..util import get_doc -from ...language import Language -from ...tokens import Span -from ... import util + @pytest.fixture def doc(en_tokenizer): @@ -16,7 +16,7 @@ def doc(en_tokenizer): pos = ['PRON', 'VERB', 'PROPN', 'PROPN', 'ADP', 'PROPN', 'PUNCT'] deps = ['ROOT', 'prep', 'compound', 'pobj', 'prep', 'pobj', 'punct'] tokens = en_tokenizer(text) - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, tags=tags, pos=pos, deps=deps) doc.ents = [Span(doc, 2, 4, doc.vocab.strings['GPE'])] doc.is_parsed = True diff --git a/spacy/tests/pipeline/test_pipe_methods.py b/spacy/tests/pipeline/test_pipe_methods.py index 6c12e162e..225c9acf8 100644 --- a/spacy/tests/pipeline/test_pipe_methods.py +++ b/spacy/tests/pipeline/test_pipe_methods.py @@ -2,8 +2,7 @@ from __future__ import unicode_literals import pytest - -from ...language import Language +from spacy.language import Language @pytest.fixture diff --git a/spacy/tests/pipeline/test_textcat.py b/spacy/tests/pipeline/test_textcat.py index 99f0f8908..3cf32b6b4 100644 --- a/spacy/tests/pipeline/test_textcat.py +++ b/spacy/tests/pipeline/test_textcat.py @@ -1,7 +1,13 @@ # coding: utf8 - from __future__ import unicode_literals -from ...language import Language + +import pytest +import random +import numpy.random +from spacy.language import Language +from spacy.pipeline import TextCategorizer +from spacy.tokens import Doc +from spacy.gold import GoldParse def test_simple_train(): @@ -13,6 +19,40 @@ def test_simple_train(): for text, answer in [('aaaa', 1.), ('bbbb', 0), ('aa', 1.), ('bbbbbbbbb', 0.), ('aaaaaa', 1)]: nlp.update([text], [{'cats': {'answer': answer}}]) - doc = nlp(u'aaa') + doc = nlp('aaa') assert 'answer' in doc.cats assert doc.cats['answer'] >= 0.5 + + +@pytest.mark.skip(reason="Test is flakey when run with others") +def test_textcat_learns_multilabel(): + random.seed(5) + numpy.random.seed(5) + docs = [] + nlp = Language() + letters = ['a', 'b', 'c'] + for w1 in letters: + for w2 in letters: + cats = {letter: float(w2==letter) for letter in letters} + docs.append((Doc(nlp.vocab, words=['d']*3 + [w1, w2] + ['d']*3), cats)) + random.shuffle(docs) + model = TextCategorizer(nlp.vocab, width=8) + for letter in letters: + model.add_label(letter) + optimizer = model.begin_training() + for i in range(30): + losses = {} + Ys = [GoldParse(doc, cats=cats) for doc, cats in docs] + Xs = [doc for doc, cats in docs] + model.update(Xs, Ys, sgd=optimizer, losses=losses) + random.shuffle(docs) + for w1 in letters: + for w2 in letters: + doc = Doc(nlp.vocab, words=['d']*3 + [w1, w2] + ['d']*3) + truth = {letter: w2==letter for letter in letters} + model(doc) + for cat, score in doc.cats.items(): + if not truth[cat]: + assert score < 0.5 + else: + assert score > 0.5 diff --git a/spacy/tests/regression/test_issue1-1000.py b/spacy/tests/regression/test_issue1-1000.py new file mode 100644 index 000000000..d3fe240d8 --- /dev/null +++ b/spacy/tests/regression/test_issue1-1000.py @@ -0,0 +1,420 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +import random +from spacy.matcher import Matcher +from spacy.attrs import IS_PUNCT, ORTH, LOWER +from spacy.symbols import POS, VERB, VerbForm_inf +from spacy.vocab import Vocab +from spacy.language import Language +from spacy.lemmatizer import Lemmatizer +from spacy.tokens import Doc + +from ..util import get_doc, make_tempdir + + +@pytest.mark.parametrize('patterns', [ + [[{'LOWER': 'celtics'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]], + [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'celtics'}]]]) +def test_issue118(en_tokenizer, patterns): + """Test a bug that arose from having overlapping matches""" + text = "how many points did lebron james score against the boston celtics last night" + doc = en_tokenizer(text) + ORG = doc.vocab.strings['ORG'] + matcher = Matcher(doc.vocab) + matcher.add("BostonCeltics", None, *patterns) + assert len(list(doc.ents)) == 0 + matches = [(ORG, start, end) for _, start, end in matcher(doc)] + assert matches == [(ORG, 9, 11), (ORG, 10, 11)] + doc.ents = matches[:1] + ents = list(doc.ents) + assert len(ents) == 1 + assert ents[0].label == ORG + assert ents[0].start == 9 + assert ents[0].end == 11 + + +@pytest.mark.parametrize('patterns', [ + [[{'LOWER': 'boston'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]], + [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'boston'}]]]) +def test_issue118_prefix_reorder(en_tokenizer, patterns): + """Test a bug that arose from having overlapping matches""" + text = "how many points did lebron james score against the boston celtics last night" + doc = en_tokenizer(text) + ORG = doc.vocab.strings['ORG'] + matcher = Matcher(doc.vocab) + matcher.add('BostonCeltics', None, *patterns) + assert len(list(doc.ents)) == 0 + matches = [(ORG, start, end) for _, start, end in matcher(doc)] + doc.ents += tuple(matches)[1:] + assert matches == [(ORG, 9, 10), (ORG, 9, 11)] + ents = doc.ents + assert len(ents) == 1 + assert ents[0].label == ORG + assert ents[0].start == 9 + assert ents[0].end == 11 + + +def test_issue242(en_tokenizer): + """Test overlapping multi-word phrases.""" + text = "There are different food safety standards in different countries." + patterns = [[{'LOWER': 'food'}, {'LOWER': 'safety'}], + [{'LOWER': 'safety'}, {'LOWER': 'standards'}]] + doc = en_tokenizer(text) + matcher = Matcher(doc.vocab) + matcher.add('FOOD', None, *patterns) + + matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)] + doc.ents += tuple(matches) + match1, match2 = matches + assert match1[1] == 3 + assert match1[2] == 5 + assert match2[1] == 4 + assert match2[2] == 6 + + +def test_issue309(en_tokenizer): + """Test Issue #309: SBD fails on empty string""" + tokens = en_tokenizer(" ") + doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[0], deps=['ROOT']) + doc.is_parsed = True + assert len(doc) == 1 + sents = list(doc.sents) + assert len(sents) == 1 + + +def test_issue351(en_tokenizer): + doc = en_tokenizer(" This is a cat.") + assert doc[0].idx == 0 + assert len(doc[0]) == 3 + assert doc[1].idx == 3 + + +def test_issue360(en_tokenizer): + """Test tokenization of big ellipsis""" + tokens = en_tokenizer('$45...............Asking') + assert len(tokens) > 2 + + +@pytest.mark.parametrize('text1,text2', [("cat", "dog")]) +def test_issue361(en_vocab, text1, text2): + """Test Issue #361: Equality of lexemes""" + assert en_vocab[text1] == en_vocab[text1] + assert en_vocab[text1] != en_vocab[text2] + + +def test_issue587(en_tokenizer): + """Test that Matcher doesn't segfault on particular input""" + doc = en_tokenizer('a b; c') + matcher = Matcher(doc.vocab) + matcher.add('TEST1', None, [{ORTH: 'a'}, {ORTH: 'b'}]) + matches = matcher(doc) + assert len(matches) == 1 + matcher.add('TEST2', None, [{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'c'}]) + matches = matcher(doc) + assert len(matches) == 2 + matcher.add('TEST3', None, [{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'd'}]) + matches = matcher(doc) + assert len(matches) == 2 + + +def test_issue588(en_vocab): + matcher = Matcher(en_vocab) + with pytest.raises(ValueError): + matcher.add('TEST', None, []) + + +@pytest.mark.xfail +def test_issue589(): + vocab = Vocab() + vocab.strings.set_frozen(True) + doc = Doc(vocab, words=['whata']) + + +def test_issue590(en_vocab): + """Test overlapping matches""" + doc = Doc(en_vocab, words=['n', '=', '1', ';', 'a', ':', '5', '%']) + matcher = Matcher(en_vocab) + matcher.add('ab', None, [{'IS_ALPHA': True}, {'ORTH': ':'}, {'LIKE_NUM': True}, {'ORTH': '%'}]) + matcher.add('ab', None, [{'IS_ALPHA': True}, {'ORTH': '='}, {'LIKE_NUM': True}]) + matches = matcher(doc) + assert len(matches) == 2 + + +def test_issue595(): + """Test lemmatization of base forms""" + words = ["Do", "n't", "feed", "the", "dog"] + tag_map = {'VB': {POS: VERB, VerbForm_inf: True}} + rules = {"verb": [["ed", "e"]]} + lemmatizer = Lemmatizer({'verb': {}}, {'verb': {}}, rules) + vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map) + doc = Doc(vocab, words=words) + doc[2].tag_ = 'VB' + assert doc[2].text == 'feed' + assert doc[2].lemma_ == 'feed' + + +def test_issue599(en_vocab): + doc = Doc(en_vocab) + doc.is_tagged = True + doc.is_parsed = True + doc2 = Doc(doc.vocab) + doc2.from_bytes(doc.to_bytes()) + assert doc2.is_parsed + + +def test_issue600(): + vocab = Vocab(tag_map={'NN': {'pos': 'NOUN'}}) + doc = Doc(vocab, words=["hello"]) + doc[0].tag_ = 'NN' + + +def test_issue615(en_tokenizer): + def merge_phrases(matcher, doc, i, matches): + """Merge a phrase. We have to be careful here because we'll change the + token indices. To avoid problems, merge all the phrases once we're called + on the last match.""" + if i != len(matches)-1: + return None + spans = [(ent_id, ent_id, doc[start : end]) for ent_id, start, end in matches] + for ent_id, label, span in spans: + span.merge(tag='NNP' if label else span.root.tag_, lemma=span.text, + label=label) + doc.ents = doc.ents + ((label, span.start, span.end),) + + text = "The golf club is broken" + pattern = [{'ORTH': "golf"}, {'ORTH': "club"}] + label = "Sport_Equipment" + doc = en_tokenizer(text) + matcher = Matcher(doc.vocab) + matcher.add(label, merge_phrases, pattern) + match = matcher(doc) + entities = list(doc.ents) + assert entities != [] + assert entities[0].label != 0 + + +@pytest.mark.parametrize('text,number', [("7am", "7"), ("11p.m.", "11")]) +def test_issue736(en_tokenizer, text, number): + """Test that times like "7am" are tokenized correctly and that numbers are + converted to string.""" + tokens = en_tokenizer(text) + assert len(tokens) == 2 + assert tokens[0].text == number + + +@pytest.mark.parametrize('text', ["3/4/2012", "01/12/1900"]) +def test_issue740(en_tokenizer, text): + """Test that dates are not split and kept as one token. This behaviour is + currently inconsistent, since dates separated by hyphens are still split. + This will be hard to prevent without causing clashes with numeric ranges.""" + tokens = en_tokenizer(text) + assert len(tokens) == 1 + + +def test_issue743(): + doc = Doc(Vocab(), ['hello', 'world']) + token = doc[0] + s = set([token]) + items = list(s) + assert items[0] is token + + +@pytest.mark.parametrize('text', ["We were scared", "We Were Scared"]) +def test_issue744(en_tokenizer, text): + """Test that 'were' and 'Were' are excluded from the contractions + generated by the English tokenizer exceptions.""" + tokens = en_tokenizer(text) + assert len(tokens) == 3 + assert tokens[1].text.lower() == "were" + + +@pytest.mark.parametrize('text,is_num', [("one", True), ("ten", True), + ("teneleven", False)]) +def test_issue759(en_tokenizer, text, is_num): + tokens = en_tokenizer(text) + assert tokens[0].like_num == is_num + + +@pytest.mark.parametrize('text', ["Shell", "shell", "Shed", "shed"]) +def test_issue775(en_tokenizer, text): + """Test that 'Shell' and 'shell' are excluded from the contractions + generated by the English tokenizer exceptions.""" + tokens = en_tokenizer(text) + assert len(tokens) == 1 + assert tokens[0].text == text + + +@pytest.mark.parametrize('text', ["This is a string ", "This is a string\u0020"]) +def test_issue792(en_tokenizer, text): + """Test for Issue #792: Trailing whitespace is removed after tokenization.""" + doc = en_tokenizer(text) + assert ''.join([token.text_with_ws for token in doc]) == text + + +@pytest.mark.parametrize('text', ["This is a string", "This is a string\n"]) +def test_control_issue792(en_tokenizer, text): + """Test base case for Issue #792: Non-trailing whitespace""" + doc = en_tokenizer(text) + assert ''.join([token.text_with_ws for token in doc]) == text + + +@pytest.mark.parametrize('text,tokens', [ + ('"deserve,"--and', ['"', "deserve", ',"--', "and"]), + ("exception;--exclusive", ["exception", ";--", "exclusive"]), + ("day.--Is", ["day", ".--", "Is"]), + ("refinement:--just", ["refinement", ":--", "just"]), + ("memories?--To", ["memories", "?--", "To"]), + ("Useful.=--Therefore", ["Useful", ".=--", "Therefore"]), + ("=Hope.=--Pandora", ["=", "Hope", ".=--", "Pandora"])]) +def test_issue801(en_tokenizer, text, tokens): + """Test that special characters + hyphens are split correctly.""" + doc = en_tokenizer(text) + assert len(doc) == len(tokens) + assert [t.text for t in doc] == tokens + + +@pytest.mark.parametrize('text,expected_tokens', [ + ('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']), + ('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar']) +]) +def test_issue805(sv_tokenizer, text, expected_tokens): + tokens = sv_tokenizer(text) + token_list = [token.text for token in tokens if not token.is_space] + assert expected_tokens == token_list + + +def test_issue850(): + """The variable-length pattern matches the succeeding token. Check we + handle the ambiguity correctly.""" + vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()}) + matcher = Matcher(vocab) + IS_ANY_TOKEN = matcher.vocab.add_flag(lambda x: True) + pattern = [{'LOWER': "bob"}, {'OP': '*', 'IS_ANY_TOKEN': True}, {'LOWER': 'frank'}] + matcher.add('FarAway', None, pattern) + doc = Doc(matcher.vocab, words=['bob', 'and', 'and', 'frank']) + match = matcher(doc) + assert len(match) == 1 + ent_id, start, end = match[0] + assert start == 0 + assert end == 4 + + +def test_issue850_basic(): + """Test Matcher matches with '*' operator and Boolean flag""" + vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()}) + matcher = Matcher(vocab) + IS_ANY_TOKEN = matcher.vocab.add_flag(lambda x: True) + pattern = [{'LOWER': "bob"}, {'OP': '*', 'LOWER': 'and'}, {'LOWER': 'frank'}] + matcher.add('FarAway', None, pattern) + doc = Doc(matcher.vocab, words=['bob', 'and', 'and', 'frank']) + match = matcher(doc) + assert len(match) == 1 + ent_id, start, end = match[0] + assert start == 0 + assert end == 4 + + +@pytest.mark.parametrize('text', ["au-delàs", "pair-programmâmes", + "terra-formées", "σ-compacts"]) +def test_issue852(fr_tokenizer, text): + """Test that French tokenizer exceptions are imported correctly.""" + tokens = fr_tokenizer(text) + assert len(tokens) == 1 + + +@pytest.mark.parametrize('text', ["aaabbb@ccc.com\nThank you!", + "aaabbb@ccc.com \nThank you!"]) +def test_issue859(en_tokenizer, text): + """Test that no extra space is added in doc.text method.""" + doc = en_tokenizer(text) + assert doc.text == text + + +@pytest.mark.parametrize('text', ["Datum:2014-06-02\nDokument:76467"]) +def test_issue886(en_tokenizer, text): + """Test that token.idx matches the original text index for texts with newlines.""" + doc = en_tokenizer(text) + for token in doc: + assert len(token.text) == len(token.text_with_ws) + assert text[token.idx] == token.text[0] + + +@pytest.mark.parametrize('text', ["want/need"]) +def test_issue891(en_tokenizer, text): + """Test that / infixes are split correctly.""" + tokens = en_tokenizer(text) + assert len(tokens) == 3 + assert tokens[1].text == "/" + + +@pytest.mark.parametrize('text,tag,lemma', [ + ("anus", "NN", "anus"), + ("princess", "NN", "princess"), + ("inner", "JJ", "inner") +]) +def test_issue912(en_vocab, text, tag, lemma): + """Test base-forms are preserved.""" + doc = Doc(en_vocab, words=[text]) + doc[0].tag_ = tag + assert doc[0].lemma_ == lemma + + +def test_issue957(en_tokenizer): + """Test that spaCy doesn't hang on many periods.""" + # skip test if pytest-timeout is not installed + timeout = pytest.importorskip('pytest-timeout') + string = '0' + for i in range(1, 100): + string += '.%d' % i + doc = en_tokenizer(string) + + +@pytest.mark.xfail +def test_issue999(train_data): + """Test that adding entities and resuming training works passably OK. + There are two issues here: + 1) We have to readd labels. This isn't very nice. + 2) There's no way to set the learning rate for the weight update, so we + end up out-of-scale, causing it to learn too fast. + """ + TRAIN_DATA = [ + ["hey", []], + ["howdy", []], + ["hey there", []], + ["hello", []], + ["hi", []], + ["i'm looking for a place to eat", []], + ["i'm looking for a place in the north of town", [[31,36,"LOCATION"]]], + ["show me chinese restaurants", [[8,15,"CUISINE"]]], + ["show me chines restaurants", [[8,14,"CUISINE"]]], + ] + + nlp = Language() + ner = nlp.create_pipe('ner') + nlp.add_pipe(ner) + for _, offsets in TRAIN_DATA: + for start, end, label in offsets: + ner.add_label(label) + nlp.begin_training() + ner.model.learn_rate = 0.001 + for itn in range(100): + random.shuffle(TRAIN_DATA) + for raw_text, entity_offsets in TRAIN_DATA: + nlp.update([raw_text], [{'entities': entity_offsets}]) + + with make_tempdir() as model_dir: + nlp.to_disk(model_dir) + nlp2 = Language().from_disk(model_dir) + + for raw_text, entity_offsets in TRAIN_DATA: + doc = nlp2(raw_text) + ents = {(ent.start_char, ent.end_char): ent.label_ for ent in doc.ents} + for start, end, label in entity_offsets: + if (start, end) in ents: + assert ents[(start, end)] == label + break + else: + if entity_offsets: + raise Exception(ents) diff --git a/spacy/tests/regression/test_issue1001-1500.py b/spacy/tests/regression/test_issue1001-1500.py new file mode 100644 index 000000000..e85d19ccd --- /dev/null +++ b/spacy/tests/regression/test_issue1001-1500.py @@ -0,0 +1,127 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +import re +from spacy.tokens import Doc +from spacy.vocab import Vocab +from spacy.lang.en import English +from spacy.lang.lex_attrs import LEX_ATTRS +from spacy.matcher import Matcher +from spacy.tokenizer import Tokenizer +from spacy.lemmatizer import Lemmatizer +from spacy.symbols import ORTH, LEMMA, POS, VERB, VerbForm_part + + +def test_issue1242(): + nlp = English() + doc = nlp('') + assert len(doc) == 0 + docs = list(nlp.pipe(['', 'hello'])) + assert len(docs[0]) == 0 + assert len(docs[1]) == 1 + + +def test_issue1250(): + """Test cached special cases.""" + special_case = [{ORTH: 'reimbur', LEMMA: 'reimburse', POS: 'VERB'}] + nlp = English() + nlp.tokenizer.add_special_case('reimbur', special_case) + lemmas = [w.lemma_ for w in nlp('reimbur, reimbur...')] + assert lemmas == ['reimburse', ',', 'reimburse', '...'] + lemmas = [w.lemma_ for w in nlp('reimbur, reimbur...')] + assert lemmas == ['reimburse', ',', 'reimburse', '...'] + + +def test_issue1257(): + """Test that tokens compare correctly.""" + doc1 = Doc(Vocab(), words=['a', 'b', 'c']) + doc2 = Doc(Vocab(), words=['a', 'c', 'e']) + assert doc1[0] != doc2[0] + assert not doc1[0] == doc2[0] + + +def test_issue1375(): + """Test that token.nbor() raises IndexError for out-of-bounds access.""" + doc = Doc(Vocab(), words=['0', '1', '2']) + with pytest.raises(IndexError): + assert doc[0].nbor(-1) + assert doc[1].nbor(-1).text == '0' + with pytest.raises(IndexError): + assert doc[2].nbor(1) + assert doc[1].nbor(1).text == '2' + + +def test_issue1387(): + tag_map = {'VBG': {POS: VERB, VerbForm_part: True}} + index = {"verb": ("cope","cop")} + exc = {"verb": {"coping": ("cope",)}} + rules = {"verb": [["ing", ""]]} + lemmatizer = Lemmatizer(index, exc, rules) + vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map) + doc = Doc(vocab, words=["coping"]) + doc[0].tag_ = 'VBG' + assert doc[0].text == "coping" + assert doc[0].lemma_ == "cope" + + +def test_issue1434(): + """Test matches occur when optional element at end of short doc.""" + pattern = [{'ORTH': 'Hello' }, {'IS_ALPHA': True, 'OP': '?'}] + vocab = Vocab(lex_attr_getters=LEX_ATTRS) + hello_world = Doc(vocab, words=['Hello', 'World']) + hello = Doc(vocab, words=['Hello']) + matcher = Matcher(vocab) + matcher.add('MyMatcher', None, pattern) + matches = matcher(hello_world) + assert matches + matches = matcher(hello) + assert matches + + +@pytest.mark.parametrize('string,start,end', [ + ('a', 0, 1), ('a b', 0, 2), ('a c', 0, 1), ('a b c', 0, 2), + ('a b b c', 0, 3), ('a b b', 0, 3),]) +def test_issue1450(string, start, end): + """Test matcher works when patterns end with * operator.""" + pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}] + matcher = Matcher(Vocab()) + matcher.add("TSTEND", None, pattern) + doc = Doc(Vocab(), words=string.split()) + matches = matcher(doc) + if start is None or end is None: + assert matches == [] + assert matches[-1][1] == start + assert matches[-1][2] == end + + +def test_issue1488(): + prefix_re = re.compile(r'''[\[\("']''') + suffix_re = re.compile(r'''[\]\)"']''') + infix_re = re.compile(r'''[-~\.]''') + simple_url_re = re.compile(r'''^https?://''') + + def my_tokenizer(nlp): + return Tokenizer(nlp.vocab, {}, + prefix_search=prefix_re.search, + suffix_search=suffix_re.search, + infix_finditer=infix_re.finditer, + token_match=simple_url_re.match) + + nlp = English() + nlp.tokenizer = my_tokenizer(nlp) + doc = nlp("This is a test.") + for token in doc: + assert token.text + + +def test_issue1494(): + infix_re = re.compile(r'''[^a-z]''') + test_cases = [('token 123test', ['token', '1', '2', '3', 'test']), + ('token 1test', ['token', '1test']), + ('hello...test', ['hello', '.', '.', '.', 'test'])] + new_tokenizer = lambda nlp: Tokenizer(nlp.vocab, {}, infix_finditer=infix_re.finditer) + nlp = English() + nlp.tokenizer = new_tokenizer(nlp) + for text, expected in test_cases: + assert [token.text for token in nlp(text)] == expected diff --git a/spacy/tests/regression/test_issue118.py b/spacy/tests/regression/test_issue118.py deleted file mode 100644 index b4e1f02b2..000000000 --- a/spacy/tests/regression/test_issue118.py +++ /dev/null @@ -1,55 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...matcher import Matcher - -import pytest - - -pattern1 = [[{'LOWER': 'celtics'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]] -pattern2 = [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'celtics'}]] -pattern3 = [[{'LOWER': 'boston'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]] -pattern4 = [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'boston'}]] - - -@pytest.fixture -def doc(en_tokenizer): - text = "how many points did lebron james score against the boston celtics last night" - doc = en_tokenizer(text) - return doc - - -@pytest.mark.parametrize('pattern', [pattern1, pattern2]) -def test_issue118(doc, pattern): - """Test a bug that arose from having overlapping matches""" - ORG = doc.vocab.strings['ORG'] - matcher = Matcher(doc.vocab) - matcher.add("BostonCeltics", None, *pattern) - - assert len(list(doc.ents)) == 0 - matches = [(ORG, start, end) for _, start, end in matcher(doc)] - assert matches == [(ORG, 9, 11), (ORG, 10, 11)] - doc.ents = matches[:1] - ents = list(doc.ents) - assert len(ents) == 1 - assert ents[0].label == ORG - assert ents[0].start == 9 - assert ents[0].end == 11 - - -@pytest.mark.parametrize('pattern', [pattern3, pattern4]) -def test_issue118_prefix_reorder(doc, pattern): - """Test a bug that arose from having overlapping matches""" - ORG = doc.vocab.strings['ORG'] - matcher = Matcher(doc.vocab) - matcher.add('BostonCeltics', None, *pattern) - - assert len(list(doc.ents)) == 0 - matches = [(ORG, start, end) for _, start, end in matcher(doc)] - doc.ents += tuple(matches)[1:] - assert matches == [(ORG, 9, 10), (ORG, 9, 11)] - ents = doc.ents - assert len(ents) == 1 - assert ents[0].label == ORG - assert ents[0].start == 9 - assert ents[0].end == 11 diff --git a/spacy/tests/regression/test_issue1207.py b/spacy/tests/regression/test_issue1207.py deleted file mode 100644 index f8a53e05c..000000000 --- a/spacy/tests/regression/test_issue1207.py +++ /dev/null @@ -1,13 +0,0 @@ -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.models('en') -def test_issue1207(EN): - text = 'Employees are recruiting talented staffers from overseas.' - doc = EN(text) - - assert [i.text for i in doc.noun_chunks] == ['Employees', 'talented staffers'] - sent = list(doc.sents)[0] - assert [i.text for i in sent.noun_chunks] == ['Employees', 'talented staffers'] diff --git a/spacy/tests/regression/test_issue1242.py b/spacy/tests/regression/test_issue1242.py deleted file mode 100644 index 50dc8c37e..000000000 --- a/spacy/tests/regression/test_issue1242.py +++ /dev/null @@ -1,23 +0,0 @@ -from __future__ import unicode_literals -import pytest -from ...lang.en import English -from ...util import load_model - - -def test_issue1242_empty_strings(): - nlp = English() - doc = nlp('') - assert len(doc) == 0 - docs = list(nlp.pipe(['', 'hello'])) - assert len(docs[0]) == 0 - assert len(docs[1]) == 1 - - -@pytest.mark.models('en') -def test_issue1242_empty_strings_en_core_web_sm(): - nlp = load_model('en_core_web_sm') - doc = nlp('') - assert len(doc) == 0 - docs = list(nlp.pipe(['', 'hello'])) - assert len(docs[0]) == 0 - assert len(docs[1]) == 1 diff --git a/spacy/tests/regression/test_issue1250.py b/spacy/tests/regression/test_issue1250.py deleted file mode 100644 index 3b6e0bbf2..000000000 --- a/spacy/tests/regression/test_issue1250.py +++ /dev/null @@ -1,13 +0,0 @@ -from __future__ import unicode_literals -from ...tokenizer import Tokenizer -from ...symbols import ORTH, LEMMA, POS -from ...lang.en import English - -def test_issue1250_cached_special_cases(): - nlp = English() - nlp.tokenizer.add_special_case(u'reimbur', [{ORTH: u'reimbur', LEMMA: u'reimburse', POS: u'VERB'}]) - - lemmas = [w.lemma_ for w in nlp(u'reimbur, reimbur...')] - assert lemmas == ['reimburse', ',', 'reimburse', '...'] - lemmas = [w.lemma_ for w in nlp(u'reimbur, reimbur...')] - assert lemmas == ['reimburse', ',', 'reimburse', '...'] diff --git a/spacy/tests/regression/test_issue1253.py b/spacy/tests/regression/test_issue1253.py deleted file mode 100644 index 2fe77d6d8..000000000 --- a/spacy/tests/regression/test_issue1253.py +++ /dev/null @@ -1,20 +0,0 @@ -from __future__ import unicode_literals -import pytest -import spacy - - -def ss(tt): - for i in range(len(tt)-1): - for j in range(i+1, len(tt)): - tt[i:j].root - - -@pytest.mark.models('en') -def test_access_parse_for_merged(): - nlp = spacy.load('en_core_web_sm') - t_t = nlp.tokenizer("Highly rated - I'll definitely") - nlp.tagger(t_t) - nlp.parser(t_t) - nlp.parser(t_t) - ss(t_t) - diff --git a/spacy/tests/regression/test_issue1257.py b/spacy/tests/regression/test_issue1257.py deleted file mode 100644 index de6b014a6..000000000 --- a/spacy/tests/regression/test_issue1257.py +++ /dev/null @@ -1,12 +0,0 @@ -'''Test tokens compare correctly''' -from __future__ import unicode_literals - -from ..util import get_doc -from ...vocab import Vocab - - -def test_issue1257(): - doc1 = get_doc(Vocab(), ['a', 'b', 'c']) - doc2 = get_doc(Vocab(), ['a', 'c', 'e']) - assert doc1[0] != doc2[0] - assert not doc1[0] == doc2[0] diff --git a/spacy/tests/regression/test_issue1305.py b/spacy/tests/regression/test_issue1305.py deleted file mode 100644 index 342cdd081..000000000 --- a/spacy/tests/regression/test_issue1305.py +++ /dev/null @@ -1,10 +0,0 @@ -import pytest -import spacy - -@pytest.mark.models('en') -def test_issue1305(): - '''Test lemmatization of English VBZ''' - nlp = spacy.load('en_core_web_sm') - assert nlp.vocab.morphology.lemmatizer('works', 'verb') == ['work'] - doc = nlp(u'This app works well') - assert doc[2].lemma_ == 'work' diff --git a/spacy/tests/regression/test_issue1375.py b/spacy/tests/regression/test_issue1375.py deleted file mode 100644 index 6f74d9a6d..000000000 --- a/spacy/tests/regression/test_issue1375.py +++ /dev/null @@ -1,16 +0,0 @@ -from __future__ import unicode_literals -import pytest -from ...vocab import Vocab -from ...tokens.doc import Doc - - -def test_issue1375(): - '''Test that token.nbor() raises IndexError for out-of-bounds access.''' - doc = Doc(Vocab(), words=['0', '1', '2']) - with pytest.raises(IndexError): - assert doc[0].nbor(-1) - assert doc[1].nbor(-1).text == '0' - with pytest.raises(IndexError): - assert doc[2].nbor(1) - assert doc[1].nbor(1).text == '2' - diff --git a/spacy/tests/regression/test_issue1380.py b/spacy/tests/regression/test_issue1380.py deleted file mode 100644 index b2d610954..000000000 --- a/spacy/tests/regression/test_issue1380.py +++ /dev/null @@ -1,14 +0,0 @@ -from __future__ import unicode_literals -import pytest - -from ...language import Language - -def test_issue1380_empty_string(): - nlp = Language() - doc = nlp('') - assert len(doc) == 0 - -@pytest.mark.models('en') -def test_issue1380_en(EN): - doc = EN('') - assert len(doc) == 0 diff --git a/spacy/tests/regression/test_issue1387.py b/spacy/tests/regression/test_issue1387.py deleted file mode 100644 index 4bd0092d0..000000000 --- a/spacy/tests/regression/test_issue1387.py +++ /dev/null @@ -1,22 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...symbols import POS, VERB, VerbForm_part -from ...vocab import Vocab -from ...lemmatizer import Lemmatizer -from ..util import get_doc - -import pytest - - -def test_issue1387(): - tag_map = {'VBG': {POS: VERB, VerbForm_part: True}} - index = {"verb": ("cope","cop")} - exc = {"verb": {"coping": ("cope",)}} - rules = {"verb": [["ing", ""]]} - lemmatizer = Lemmatizer(index, exc, rules) - vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map) - doc = get_doc(vocab, ["coping"]) - doc[0].tag_ = 'VBG' - assert doc[0].text == "coping" - assert doc[0].lemma_ == "cope" diff --git a/spacy/tests/regression/test_issue1434.py b/spacy/tests/regression/test_issue1434.py deleted file mode 100644 index fc88cc3e6..000000000 --- a/spacy/tests/regression/test_issue1434.py +++ /dev/null @@ -1,22 +0,0 @@ -from __future__ import unicode_literals - -from ...vocab import Vocab -from ...lang.lex_attrs import LEX_ATTRS -from ...tokens import Doc -from ...matcher import Matcher - - -def test_issue1434(): - '''Test matches occur when optional element at end of short doc''' - vocab = Vocab(lex_attr_getters=LEX_ATTRS) - hello_world = Doc(vocab, words=['Hello', 'World']) - hello = Doc(vocab, words=['Hello']) - - matcher = Matcher(vocab) - matcher.add('MyMatcher', None, - [ {'ORTH': 'Hello' }, {'IS_ALPHA': True, 'OP': '?'} ]) - - matches = matcher(hello_world) - assert matches - matches = matcher(hello) - assert matches diff --git a/spacy/tests/regression/test_issue1450.py b/spacy/tests/regression/test_issue1450.py deleted file mode 100644 index 3cfec349f..000000000 --- a/spacy/tests/regression/test_issue1450.py +++ /dev/null @@ -1,59 +0,0 @@ -from __future__ import unicode_literals -import pytest - -from ...matcher import Matcher -from ...tokens import Doc -from ...vocab import Vocab - - -@pytest.mark.parametrize( - 'string,start,end', - [ - ('a', 0, 1), - ('a b', 0, 2), - ('a c', 0, 1), - ('a b c', 0, 2), - ('a b b c', 0, 3), - ('a b b', 0, 3), - ] -) -def test_issue1450_matcher_end_zero_plus(string, start, end): - '''Test matcher works when patterns end with * operator. - - Original example (rewritten to avoid model usage) - - nlp = spacy.load('en_core_web_sm') - matcher = Matcher(nlp.vocab) - matcher.add( - "TSTEND", - on_match_1, - [ - {TAG: "JJ", LOWER: "new"}, - {TAG: "NN", 'OP': "*"} - ] - ) - doc = nlp(u'Could you create a new ticket for me?') - print([(w.tag_, w.text, w.lower_) for w in doc]) - matches = matcher(doc) - print(matches) - assert len(matches) == 1 - assert matches[0][1] == 4 - assert matches[0][2] == 5 - ''' - matcher = Matcher(Vocab()) - matcher.add( - "TSTEND", - None, - [ - {'ORTH': "a"}, - {'ORTH': "b", 'OP': "*"} - ] - ) - doc = Doc(Vocab(), words=string.split()) - matches = matcher(doc) - if start is None or end is None: - assert matches == [] - - print(matches) - assert matches[-1][1] == start - assert matches[-1][2] == end diff --git a/spacy/tests/regression/test_issue1488.py b/spacy/tests/regression/test_issue1488.py deleted file mode 100644 index 6b9ab9a70..000000000 --- a/spacy/tests/regression/test_issue1488.py +++ /dev/null @@ -1,26 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import regex as re -from ...lang.en import English -from ...tokenizer import Tokenizer - - -def test_issue1488(): - prefix_re = re.compile(r'''[\[\("']''') - suffix_re = re.compile(r'''[\]\)"']''') - infix_re = re.compile(r'''[-~\.]''') - simple_url_re = re.compile(r'''^https?://''') - - def my_tokenizer(nlp): - return Tokenizer(nlp.vocab, {}, - prefix_search=prefix_re.search, - suffix_search=suffix_re.search, - infix_finditer=infix_re.finditer, - token_match=simple_url_re.match) - - nlp = English() - nlp.tokenizer = my_tokenizer(nlp) - doc = nlp("This is a test.") - for token in doc: - assert token.text diff --git a/spacy/tests/regression/test_issue1494.py b/spacy/tests/regression/test_issue1494.py deleted file mode 100644 index 693e81e81..000000000 --- a/spacy/tests/regression/test_issue1494.py +++ /dev/null @@ -1,39 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -import re - -from ...lang.en import English -from ...tokenizer import Tokenizer - - -def test_issue1494(): - infix_re = re.compile(r'''[^a-z]''') - text_to_tokenize1 = 'token 123test' - expected_tokens1 = ['token', '1', '2', '3', 'test'] - - text_to_tokenize2 = 'token 1test' - expected_tokens2 = ['token', '1test'] - - text_to_tokenize3 = 'hello...test' - expected_tokens3 = ['hello', '.', '.', '.', 'test'] - - def my_tokenizer(nlp): - return Tokenizer(nlp.vocab, - {}, - infix_finditer=infix_re.finditer - ) - - nlp = English() - - nlp.tokenizer = my_tokenizer(nlp) - - tokenized_words1 = [token.text for token in nlp(text_to_tokenize1)] - assert tokenized_words1 == expected_tokens1 - - tokenized_words2 = [token.text for token in nlp(text_to_tokenize2)] - assert tokenized_words2 == expected_tokens2 - - tokenized_words3 = [token.text for token in nlp(text_to_tokenize3)] - assert tokenized_words3 == expected_tokens3 diff --git a/spacy/tests/regression/test_issue1501-2000.py b/spacy/tests/regression/test_issue1501-2000.py new file mode 100644 index 000000000..ea329f55b --- /dev/null +++ b/spacy/tests/regression/test_issue1501-2000.py @@ -0,0 +1,246 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest +import gc +import numpy +import copy +from spacy.lang.en import English +from spacy.lang.en.stop_words import STOP_WORDS +from spacy.lang.lex_attrs import is_stop +from spacy.vectors import Vectors +from spacy.vocab import Vocab +from spacy.language import Language +from spacy.tokens import Doc, Span +from spacy.pipeline import Tagger, EntityRecognizer +from spacy.attrs import HEAD, DEP +from spacy.matcher import Matcher + +from ..util import make_tempdir + + +def test_issue1506(): + def string_generator(): + for _ in range(10001): + yield "It's sentence produced by that bug." + for _ in range(10001): + yield "I erase some hbdsaj lemmas." + for _ in range(10001): + yield "I erase lemmas." + for _ in range(10001): + yield "It's sentence produced by that bug." + for _ in range(10001): + yield "It's sentence produced by that bug." + + nlp = English() + for i, d in enumerate(nlp.pipe(string_generator())): + # We should run cleanup more than one time to actually cleanup data. + # In first run — clean up only mark strings as «not hitted». + if i == 10000 or i == 20000 or i == 30000: + gc.collect() + for t in d: + str(t.lemma_) + + +def test_issue1518(): + """Test vectors.resize() works.""" + vectors = Vectors(shape=(10, 10)) + vectors.add('hello', row=2) + vectors.resize((5, 9)) + + +def test_issue1537(): + """Test that Span.as_doc() doesn't segfault.""" + string = 'The sky is blue . The man is pink . The dog is purple .' + doc = Doc(Vocab(), words=string.split()) + doc[0].sent_start = True + for word in doc[1:]: + if word.nbor(-1).text == '.': + word.sent_start = True + else: + word.sent_start = False + sents = list(doc.sents) + sent0 = sents[0].as_doc() + sent1 = sents[1].as_doc() + assert isinstance(sent0, Doc) + assert isinstance(sent1, Doc) + + +# TODO: Currently segfaulting, due to l_edge and r_edge misalignment +#def test_issue1537_model(): +# nlp = load_spacy('en') +# doc = nlp('The sky is blue. The man is pink. The dog is purple.') +# sents = [s.as_doc() for s in doc.sents] +# print(list(sents[0].noun_chunks)) +# print(list(sents[1].noun_chunks)) + + +def test_issue1539(): + """Ensure vectors.resize() doesn't try to modify dictionary during iteration.""" + v = Vectors(shape=(10, 10), keys=[5,3,98,100]) + v.resize((100,100)) + + +def test_issue1547(): + """Test that entity labels still match after merging tokens.""" + words = ['\n', 'worda', '.', '\n', 'wordb', '-', 'Biosphere', '2', '-', ' \n'] + doc = Doc(Vocab(), words=words) + doc.ents = [Span(doc, 6, 8, label=doc.vocab.strings['PRODUCT'])] + doc[5:7].merge() + assert [ent.text for ent in doc.ents] + + +def test_issue1612(en_tokenizer): + doc = en_tokenizer('The black cat purrs.') + span = doc[1: 3] + assert span.orth_ == span.text + + +def test_issue1654(): + nlp = Language(Vocab()) + assert not nlp.pipeline + nlp.add_pipe(lambda doc: doc, name='1') + nlp.add_pipe(lambda doc: doc, name='2', after='1') + nlp.add_pipe(lambda doc: doc, name='3', after='2') + assert nlp.pipe_names == ['1', '2', '3'] + nlp2 = Language(Vocab()) + assert not nlp2.pipeline + nlp2.add_pipe(lambda doc: doc, name='3') + nlp2.add_pipe(lambda doc: doc, name='2', before='3') + nlp2.add_pipe(lambda doc: doc, name='1', before='2') + assert nlp2.pipe_names == ['1', '2', '3'] + + +@pytest.mark.parametrize('text', ['test@example.com', 'john.doe@example.co.uk']) +def test_issue1698(en_tokenizer, text): + doc = en_tokenizer(text) + assert len(doc) == 1 + assert not doc[0].like_url + + +def test_issue1727(): + """Test that models with no pretrained vectors can be deserialized + correctly after vectors are added.""" + data = numpy.ones((3, 300), dtype='f') + vectors = Vectors(data=data, keys=['I', 'am', 'Matt']) + tagger = Tagger(Vocab()) + tagger.add_label('PRP') + tagger.begin_training() + assert tagger.cfg.get('pretrained_dims', 0) == 0 + tagger.vocab.vectors = vectors + with make_tempdir() as path: + tagger.to_disk(path) + tagger = Tagger(Vocab()).from_disk(path) + assert tagger.cfg.get('pretrained_dims', 0) == 0 + + +def test_issue1757(): + """Test comparison against None doesn't cause segfault.""" + doc = Doc(Vocab(), words=['a', 'b', 'c']) + assert not doc[0] < None + assert not doc[0] == None + assert doc[0] >= None + assert not doc[:2] < None + assert not doc[:2] == None + assert doc[:2] >= None + assert not doc.vocab['a'] == None + assert not doc.vocab['a'] < None + + +def test_issue1758(en_tokenizer): + """Test that "would've" is handled by the English tokenizer exceptions.""" + tokens = en_tokenizer("would've") + assert len(tokens) == 2 + assert tokens[0].tag_ == "MD" + assert tokens[1].lemma_ == "have" + + +def test_issue1799(): + """Test sentence boundaries are deserialized correctly, even for + non-projective sentences.""" + heads_deps = numpy.asarray([[1, 397], [4, 436], [2, 426], [1, 402], + [0, 8206900633647566924], [18446744073709551615, 440], + [18446744073709551614, 442]], dtype='uint64') + doc = Doc(Vocab(), words='Just what I was looking for .'.split()) + doc.vocab.strings.add('ROOT') + doc = doc.from_array([HEAD, DEP], heads_deps) + assert len(list(doc.sents)) == 1 + + +def test_issue1807(): + """Test vocab.set_vector also adds the word to the vocab.""" + vocab = Vocab() + assert 'hello' not in vocab + vocab.set_vector('hello', numpy.ones((50,), dtype='f')) + assert 'hello' in vocab + + +def test_issue1834(): + """Test that sentence boundaries & parse/tag flags are not lost + during serialization.""" + string = "This is a first sentence . And another one" + doc = Doc(Vocab(), words=string.split()) + doc[6].sent_start = True + new_doc = Doc(doc.vocab).from_bytes(doc.to_bytes()) + assert new_doc[6].sent_start + assert not new_doc.is_parsed + assert not new_doc.is_tagged + doc.is_parsed = True + doc.is_tagged = True + new_doc = Doc(doc.vocab).from_bytes(doc.to_bytes()) + assert new_doc.is_parsed + assert new_doc.is_tagged + + +def test_issue1868(): + """Test Vocab.__contains__ works with int keys.""" + vocab = Vocab() + lex = vocab['hello'] + assert lex.orth in vocab + assert lex.orth_ in vocab + assert 'some string' not in vocab + int_id = vocab.strings.add('some string') + assert int_id not in vocab + + +def test_issue1883(): + matcher = Matcher(Vocab()) + matcher.add('pat1', None, [{'orth': 'hello'}]) + doc = Doc(matcher.vocab, words=['hello']) + assert len(matcher(doc)) == 1 + new_matcher = copy.deepcopy(matcher) + new_doc = Doc(new_matcher.vocab, words=['hello']) + assert len(new_matcher(new_doc)) == 1 + + +@pytest.mark.parametrize('word', ['the']) +def test_issue1889(word): + assert is_stop(word, STOP_WORDS) == is_stop(word.upper(), STOP_WORDS) + + +def test_issue1915(): + cfg = {'hidden_depth': 2} # should error out + nlp = Language() + nlp.add_pipe(nlp.create_pipe('ner')) + nlp.get_pipe('ner').add_label('answer') + with pytest.raises(ValueError): + nlp.begin_training(**cfg) + + +def test_issue1945(): + """Test regression in Matcher introduced in v2.0.6.""" + matcher = Matcher(Vocab()) + matcher.add('MWE', None, [{'orth': 'a'}, {'orth': 'a'}]) + doc = Doc(matcher.vocab, words=['a', 'a', 'a']) + matches = matcher(doc) # we should see two overlapping matches here + assert len(matches) == 2 + assert matches[0][1:] == (0, 2) + assert matches[1][1:] == (1, 3) + + +@pytest.mark.parametrize('label', ['U-JOB-NAME']) +def test_issue1967(label): + ner = EntityRecognizer(Vocab()) + entry = ([0], ['word'], ['tag'], [0], ['dep'], [label]) + gold_parses = [(None, [(entry, None)])] + ner.moves.get_actions(gold_parses=gold_parses) diff --git a/spacy/tests/regression/test_issue1506.py b/spacy/tests/regression/test_issue1506.py deleted file mode 100644 index 71702a6d4..000000000 --- a/spacy/tests/regression/test_issue1506.py +++ /dev/null @@ -1,35 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import gc - -from ...lang.en import English - - -def test_issue1506(): - nlp = English() - - def string_generator(): - for _ in range(10001): - yield u"It's sentence produced by that bug." - - for _ in range(10001): - yield u"I erase some hbdsaj lemmas." - - for _ in range(10001): - yield u"I erase lemmas." - - for _ in range(10001): - yield u"It's sentence produced by that bug." - - for _ in range(10001): - yield u"It's sentence produced by that bug." - - for i, d in enumerate(nlp.pipe(string_generator())): - # We should run cleanup more than one time to actually cleanup data. - # In first run — clean up only mark strings as «not hitted». - if i == 10000 or i == 20000 or i == 30000: - gc.collect() - - for t in d: - str(t.lemma_) diff --git a/spacy/tests/regression/test_issue1518.py b/spacy/tests/regression/test_issue1518.py deleted file mode 100644 index 428eb46e8..000000000 --- a/spacy/tests/regression/test_issue1518.py +++ /dev/null @@ -1,10 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -from ...vectors import Vectors - -def test_issue1518(): - '''Test vectors.resize() works.''' - vectors = Vectors(shape=(10, 10)) - vectors.add(u'hello', row=2) - vectors.resize((5, 9)) diff --git a/spacy/tests/regression/test_issue1537.py b/spacy/tests/regression/test_issue1537.py deleted file mode 100644 index 2fce727d6..000000000 --- a/spacy/tests/regression/test_issue1537.py +++ /dev/null @@ -1,31 +0,0 @@ -'''Test Span.as_doc() doesn't segfault''' -from __future__ import unicode_literals -from ...tokens import Doc -from ...vocab import Vocab -from ... import load as load_spacy - - -def test_issue1537(): - string = 'The sky is blue . The man is pink . The dog is purple .' - doc = Doc(Vocab(), words=string.split()) - doc[0].sent_start = True - for word in doc[1:]: - if word.nbor(-1).text == '.': - word.sent_start = True - else: - word.sent_start = False - - sents = list(doc.sents) - sent0 = sents[0].as_doc() - sent1 = sents[1].as_doc() - assert isinstance(sent0, Doc) - assert isinstance(sent1, Doc) - - -# Currently segfaulting, due to l_edge and r_edge misalignment -#def test_issue1537_model(): -# nlp = load_spacy('en') -# doc = nlp(u'The sky is blue. The man is pink. The dog is purple.') -# sents = [s.as_doc() for s in doc.sents] -# print(list(sents[0].noun_chunks)) -# print(list(sents[1].noun_chunks)) diff --git a/spacy/tests/regression/test_issue1539.py b/spacy/tests/regression/test_issue1539.py deleted file mode 100644 index 6665f8087..000000000 --- a/spacy/tests/regression/test_issue1539.py +++ /dev/null @@ -1,10 +0,0 @@ -'''Ensure vectors.resize() doesn't try to modify dictionary during iteration.''' -from __future__ import unicode_literals - -from ...vectors import Vectors - - -def test_issue1539(): - v = Vectors(shape=(10, 10), keys=[5,3,98,100]) - v.resize((100,100)) - diff --git a/spacy/tests/regression/test_issue1547.py b/spacy/tests/regression/test_issue1547.py deleted file mode 100644 index aa7076dd3..000000000 --- a/spacy/tests/regression/test_issue1547.py +++ /dev/null @@ -1,17 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - -from ...vocab import Vocab -from ...tokens import Doc, Span - - -@pytest.mark.xfail -def test_issue1547(): - """Test that entity labels still match after merging tokens.""" - words = ['\n', 'worda', '.', '\n', 'wordb', '-', 'Biosphere', '2', '-', ' \n'] - doc = Doc(Vocab(), words=words) - doc.ents = [Span(doc, 6, 8, label=doc.vocab.strings['PRODUCT'])] - doc[5:7].merge() - assert [ent.text for ent in doc.ents] diff --git a/spacy/tests/regression/test_issue1612.py b/spacy/tests/regression/test_issue1612.py deleted file mode 100644 index 6cae17e77..000000000 --- a/spacy/tests/regression/test_issue1612.py +++ /dev/null @@ -1,8 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - - -def test_issue1612(en_tokenizer): - doc = en_tokenizer('The black cat purrs.') - span = doc[1: 3] - assert span.orth_ == span.text diff --git a/spacy/tests/regression/test_issue1622.py b/spacy/tests/regression/test_issue1622.py deleted file mode 100644 index 86e7ad162..000000000 --- a/spacy/tests/regression/test_issue1622.py +++ /dev/null @@ -1,90 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals -import json -from tempfile import NamedTemporaryFile -import pytest - -from ...cli.train import train - - -@pytest.mark.xfail -def test_cli_trained_model_can_be_saved(tmpdir): - lang = 'nl' - output_dir = str(tmpdir) - train_file = NamedTemporaryFile('wb', dir=output_dir, delete=False) - train_corpus = [ - { - "id": "identifier_0", - "paragraphs": [ - { - "raw": "Jan houdt van Marie.\n", - "sentences": [ - { - "tokens": [ - { - "id": 0, - "dep": "nsubj", - "head": 1, - "tag": "NOUN", - "orth": "Jan", - "ner": "B-PER" - }, - { - "id": 1, - "dep": "ROOT", - "head": 0, - "tag": "VERB", - "orth": "houdt", - "ner": "O" - }, - { - "id": 2, - "dep": "case", - "head": 1, - "tag": "ADP", - "orth": "van", - "ner": "O" - }, - { - "id": 3, - "dep": "obj", - "head": -2, - "tag": "NOUN", - "orth": "Marie", - "ner": "B-PER" - }, - { - "id": 4, - "dep": "punct", - "head": -3, - "tag": "PUNCT", - "orth": ".", - "ner": "O" - }, - { - "id": 5, - "dep": "", - "head": -1, - "tag": "SPACE", - "orth": "\n", - "ner": "O" - } - ], - "brackets": [] - } - ] - } - ] - } - ] - - train_file.write(json.dumps(train_corpus).encode('utf-8')) - train_file.close() - train_data = train_file.name - dev_data = train_data - - # spacy train -n 1 -g -1 nl output_nl training_corpus.json training \ - # corpus.json - train(lang, output_dir, train_data, dev_data, n_iter=1) - - assert True diff --git a/spacy/tests/regression/test_issue1654.py b/spacy/tests/regression/test_issue1654.py deleted file mode 100644 index 531c00757..000000000 --- a/spacy/tests/regression/test_issue1654.py +++ /dev/null @@ -1,23 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - -from ...language import Language -from ...vocab import Vocab - - -def test_issue1654(): - nlp = Language(Vocab()) - assert not nlp.pipeline - nlp.add_pipe(lambda doc: doc, name='1') - nlp.add_pipe(lambda doc: doc, name='2', after='1') - nlp.add_pipe(lambda doc: doc, name='3', after='2') - assert nlp.pipe_names == ['1', '2', '3'] - - nlp2 = Language(Vocab()) - assert not nlp2.pipeline - nlp2.add_pipe(lambda doc: doc, name='3') - nlp2.add_pipe(lambda doc: doc, name='2', before='3') - nlp2.add_pipe(lambda doc: doc, name='1', before='2') - assert nlp2.pipe_names == ['1', '2', '3'] diff --git a/spacy/tests/regression/test_issue1660.py b/spacy/tests/regression/test_issue1660.py deleted file mode 100644 index d46de0465..000000000 --- a/spacy/tests/regression/test_issue1660.py +++ /dev/null @@ -1,12 +0,0 @@ -from __future__ import unicode_literals -import pytest -from ...util import load_model - -@pytest.mark.models("en_core_web_md") -@pytest.mark.models("es_core_news_md") -def test_models_with_different_vectors(): - nlp = load_model('en_core_web_md') - doc = nlp(u'hello world') - nlp2 = load_model('es_core_news_md') - doc2 = nlp2(u'hola') - doc = nlp(u'hello world') diff --git a/spacy/tests/regression/test_issue1698.py b/spacy/tests/regression/test_issue1698.py deleted file mode 100644 index fa25cbdd5..000000000 --- a/spacy/tests/regression/test_issue1698.py +++ /dev/null @@ -1,11 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text', ['test@example.com', 'john.doe@example.co.uk']) -def test_issue1698(en_tokenizer, text): - doc = en_tokenizer(text) - assert len(doc) == 1 - assert not doc[0].like_url diff --git a/spacy/tests/regression/test_issue1727.py b/spacy/tests/regression/test_issue1727.py deleted file mode 100644 index 90256821b..000000000 --- a/spacy/tests/regression/test_issue1727.py +++ /dev/null @@ -1,25 +0,0 @@ -'''Test that models with no pretrained vectors can be deserialized correctly -after vectors are added.''' -from __future__ import unicode_literals -import numpy -from ...pipeline import Tagger -from ...vectors import Vectors -from ...vocab import Vocab -from ..util import make_tempdir - - -def test_issue1727(): - data = numpy.ones((3, 300), dtype='f') - keys = [u'I', u'am', u'Matt'] - vectors = Vectors(data=data, keys=keys) - tagger = Tagger(Vocab()) - tagger.add_label('PRP') - tagger.begin_training() - - assert tagger.cfg.get('pretrained_dims', 0) == 0 - tagger.vocab.vectors = vectors - - with make_tempdir() as path: - tagger.to_disk(path) - tagger = Tagger(Vocab()).from_disk(path) - assert tagger.cfg.get('pretrained_dims', 0) == 0 diff --git a/spacy/tests/regression/test_issue1757.py b/spacy/tests/regression/test_issue1757.py deleted file mode 100644 index 782d767b5..000000000 --- a/spacy/tests/regression/test_issue1757.py +++ /dev/null @@ -1,18 +0,0 @@ -'''Test comparison against None doesn't cause segfault''' -from __future__ import unicode_literals - -from ...tokens import Doc -from ...vocab import Vocab - -def test_issue1757(): - doc = Doc(Vocab(), words=['a', 'b', 'c']) - assert not doc[0] < None - assert not doc[0] == None - assert doc[0] >= None - span = doc[:2] - assert not span < None - assert not span == None - assert span >= None - lex = doc.vocab['a'] - assert not lex == None - assert not lex < None diff --git a/spacy/tests/regression/test_issue1758.py b/spacy/tests/regression/test_issue1758.py deleted file mode 100644 index 2c10591e0..000000000 --- a/spacy/tests/regression/test_issue1758.py +++ /dev/null @@ -1,13 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text', ["would've"]) -def test_issue1758(en_tokenizer, text): - """Test that "would've" is handled by the English tokenizer exceptions.""" - tokens = en_tokenizer(text) - assert len(tokens) == 2 - assert tokens[0].tag_ == "MD" - assert tokens[1].lemma_ == "have" diff --git a/spacy/tests/regression/test_issue1769.py b/spacy/tests/regression/test_issue1769.py deleted file mode 100644 index 63ec83908..000000000 --- a/spacy/tests/regression/test_issue1769.py +++ /dev/null @@ -1,61 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals -from ...util import get_lang_class -from ...attrs import LIKE_NUM - -import pytest - - -@pytest.mark.parametrize('word', ['eleven']) -def test_en_lex_attrs(word): - lang = get_lang_class('en') - like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] - assert like_num(word) == like_num(word.upper()) - - -@pytest.mark.slow -@pytest.mark.parametrize('word', ['elleve', 'første']) -def test_da_lex_attrs(word): - lang = get_lang_class('da') - like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] - assert like_num(word) == like_num(word.upper()) - - -@pytest.mark.slow -@pytest.mark.parametrize('word', ['onze', 'onzième']) -def test_fr_lex_attrs(word): - lang = get_lang_class('fr') - like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] - assert like_num(word) == like_num(word.upper()) - - -@pytest.mark.slow -@pytest.mark.parametrize('word', ['sebelas']) -def test_id_lex_attrs(word): - lang = get_lang_class('id') - like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] - assert like_num(word) == like_num(word.upper()) - - -@pytest.mark.slow -@pytest.mark.parametrize('word', ['elf', 'elfde']) -def test_nl_lex_attrs(word): - lang = get_lang_class('nl') - like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] - assert like_num(word) == like_num(word.upper()) - - -@pytest.mark.slow -@pytest.mark.parametrize('word', ['onze', 'quadragésimo']) -def test_pt_lex_attrs(word): - lang = get_lang_class('pt') - like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] - assert like_num(word) == like_num(word.upper()) - - -@pytest.mark.slow -@pytest.mark.parametrize('word', ['одиннадцать']) -def test_ru_lex_attrs(word): - lang = get_lang_class('ru') - like_num = lang.Defaults.lex_attr_getters[LIKE_NUM] - assert like_num(word) == like_num(word.upper()) diff --git a/spacy/tests/regression/test_issue1799.py b/spacy/tests/regression/test_issue1799.py deleted file mode 100644 index 6695a5357..000000000 --- a/spacy/tests/regression/test_issue1799.py +++ /dev/null @@ -1,20 +0,0 @@ -'''Test sentence boundaries are deserialized correctly, -even for non-projective sentences.''' -from __future__ import unicode_literals - -import pytest -import numpy -from ... tokens import Doc -from ... vocab import Vocab -from ... attrs import HEAD, DEP - - -def test_issue1799(): - problem_sentence = 'Just what I was looking for.' - heads_deps = numpy.asarray([[1, 397], [4, 436], [2, 426], [1, 402], - [0, 8206900633647566924], [18446744073709551615, 440], - [18446744073709551614, 442]], dtype='uint64') - doc = Doc(Vocab(), words='Just what I was looking for .'.split()) - doc.vocab.strings.add('ROOT') - doc = doc.from_array([HEAD, DEP], heads_deps) - assert len(list(doc.sents)) == 1 diff --git a/spacy/tests/regression/test_issue1807.py b/spacy/tests/regression/test_issue1807.py deleted file mode 100644 index c73f008eb..000000000 --- a/spacy/tests/regression/test_issue1807.py +++ /dev/null @@ -1,14 +0,0 @@ -'''Test vocab.set_vector also adds the word to the vocab.''' -from __future__ import unicode_literals -from ...vocab import Vocab - -import numpy - - -def test_issue1807(): - vocab = Vocab() - arr = numpy.ones((50,), dtype='f') - assert 'hello' not in vocab - vocab.set_vector('hello', arr) - assert 'hello' in vocab - diff --git a/spacy/tests/regression/test_issue1834.py b/spacy/tests/regression/test_issue1834.py deleted file mode 100644 index 00b0c76eb..000000000 --- a/spacy/tests/regression/test_issue1834.py +++ /dev/null @@ -1,27 +0,0 @@ -from __future__ import unicode_literals -from ...tokens import Doc -from ...vocab import Vocab - - -def test_issue1834(): - """test if sentence boundaries & parse/tag flags are not lost - during serialization - """ - words = "This is a first sentence . And another one".split() - vocab = Vocab() - doc = Doc(vocab, words=words) - vocab = doc.vocab - doc[6].sent_start = True - deser_doc = Doc(vocab).from_bytes(doc.to_bytes()) - assert deser_doc[6].sent_start - assert not deser_doc.is_parsed - assert not deser_doc.is_tagged - doc.is_parsed = True - doc.is_tagged = True - deser_doc = Doc(vocab).from_bytes(doc.to_bytes()) - assert deser_doc.is_parsed - assert deser_doc.is_tagged - - - - diff --git a/spacy/tests/regression/test_issue1855.py b/spacy/tests/regression/test_issue1855.py deleted file mode 100644 index b12b5c251..000000000 --- a/spacy/tests/regression/test_issue1855.py +++ /dev/null @@ -1,65 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals -import re - -from ...matcher import Matcher - -import pytest - -pattern1 = [{'ORTH':'A','OP':'1'},{'ORTH':'A','OP':'*'}] -pattern2 = [{'ORTH':'A','OP':'*'},{'ORTH':'A','OP':'1'}] -pattern3 = [{'ORTH':'A','OP':'1'},{'ORTH':'A','OP':'1'}] -pattern4 = [{'ORTH':'B','OP':'1'},{'ORTH':'A','OP':'*'},{'ORTH':'B','OP':'1'}] -pattern5 = [{'ORTH':'B','OP':'*'},{'ORTH':'A','OP':'*'},{'ORTH':'B','OP':'1'}] - -re_pattern1 = 'AA*' -re_pattern2 = 'A*A' -re_pattern3 = 'AA' -re_pattern4 = 'BA*B' -re_pattern5 = 'B*A*B' - -@pytest.fixture -def text(): - return "(ABBAAAAAB)." - -@pytest.fixture -def doc(en_tokenizer,text): - doc = en_tokenizer(' '.join(text)) - return doc - -@pytest.mark.xfail -@pytest.mark.parametrize('pattern,re_pattern',[ - (pattern1,re_pattern1), - (pattern2,re_pattern2), - (pattern3,re_pattern3), - (pattern4,re_pattern4), - (pattern5,re_pattern5)]) -def test_greedy_matching(doc,text,pattern,re_pattern): - """ - Test that the greedy matching behavior of the * op - is consistant with other re implementations - """ - matcher = Matcher(doc.vocab) - matcher.add(re_pattern,None,pattern) - matches = matcher(doc) - re_matches = [m.span() for m in re.finditer(re_pattern,text)] - for match,re_match in zip(matches,re_matches): - assert match[1:]==re_match - -@pytest.mark.xfail -@pytest.mark.parametrize('pattern,re_pattern',[ - (pattern1,re_pattern1), - (pattern2,re_pattern2), - (pattern3,re_pattern3), - (pattern4,re_pattern4), - (pattern5,re_pattern5)]) -def test_match_consuming(doc,text,pattern,re_pattern): - """ - Test that matcher.__call__ consumes tokens on a match - similar to re.findall - """ - matcher = Matcher(doc.vocab) - matcher.add(re_pattern,None,pattern) - matches = matcher(doc) - re_matches = [m.span() for m in re.finditer(re_pattern,text)] - assert len(matches)==len(re_matches) diff --git a/spacy/tests/regression/test_issue1868.py b/spacy/tests/regression/test_issue1868.py deleted file mode 100644 index 6d9da45cc..000000000 --- a/spacy/tests/regression/test_issue1868.py +++ /dev/null @@ -1,13 +0,0 @@ -'''Test Vocab.__contains__ works with int keys''' -from __future__ import unicode_literals - -from ... vocab import Vocab - -def test_issue1868(): - vocab = Vocab() - lex = vocab['hello'] - assert lex.orth in vocab - assert lex.orth_ in vocab - assert 'some string' not in vocab - int_id = vocab.strings.add('some string') - assert int_id not in vocab diff --git a/spacy/tests/regression/test_issue1883.py b/spacy/tests/regression/test_issue1883.py deleted file mode 100644 index 3fcf905c1..000000000 --- a/spacy/tests/regression/test_issue1883.py +++ /dev/null @@ -1,18 +0,0 @@ -'''Check Matcher can be unpickled correctly.''' -from __future__ import unicode_literals - -import copy - -from ... vocab import Vocab -from ... matcher import Matcher -from ... tokens import Doc - - -def test_issue1883(): - m = Matcher(Vocab()) - m.add('pat1', None, [{'orth': 'hello'}]) - doc = Doc(m.vocab, words=['hello']) - assert len(m(doc)) == 1 - m2 = copy.deepcopy(m) - doc2 = Doc(m2.vocab, words=['hello']) - assert len(m2(doc2)) == 1 diff --git a/spacy/tests/regression/test_issue1889.py b/spacy/tests/regression/test_issue1889.py deleted file mode 100644 index a0e20abcf..000000000 --- a/spacy/tests/regression/test_issue1889.py +++ /dev/null @@ -1,11 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals -from ...lang.lex_attrs import is_stop -from ...lang.en.stop_words import STOP_WORDS - -import pytest - - -@pytest.mark.parametrize('word', ['the']) -def test_lex_attrs_stop_words_case_sensitivity(word): - assert is_stop(word, STOP_WORDS) == is_stop(word.upper(), STOP_WORDS) diff --git a/spacy/tests/regression/test_issue1915.py b/spacy/tests/regression/test_issue1915.py deleted file mode 100644 index 23cf6dc73..000000000 --- a/spacy/tests/regression/test_issue1915.py +++ /dev/null @@ -1,19 +0,0 @@ -# coding: utf8 - -from __future__ import unicode_literals -from ...language import Language - - -def test_simple_ner(): - cfg = { - 'hidden_depth': 2, # should error out - } - - nlp = Language() - nlp.add_pipe(nlp.create_pipe('ner')) - nlp.get_pipe('ner').add_label('answer') - try: - nlp.begin_training(**cfg) - assert False # should error out - except ValueError: - assert True diff --git a/spacy/tests/regression/test_issue1919.py b/spacy/tests/regression/test_issue1919.py deleted file mode 100644 index ffb592b1e..000000000 --- a/spacy/tests/regression/test_issue1919.py +++ /dev/null @@ -1,10 +0,0 @@ -'''Test that nlp.begin_training() doesn't require missing cfg properties.''' -from __future__ import unicode_literals -import pytest -from ... import load as load_spacy - -@pytest.mark.models('en') -def test_issue1919(): - nlp = load_spacy('en') - opt = nlp.begin_training() - diff --git a/spacy/tests/regression/test_issue1945.py b/spacy/tests/regression/test_issue1945.py deleted file mode 100644 index 052f699fb..000000000 --- a/spacy/tests/regression/test_issue1945.py +++ /dev/null @@ -1,18 +0,0 @@ -'''Test regression in Matcher introduced in v2.0.6.''' -from __future__ import unicode_literals -import pytest - -from ...vocab import Vocab -from ...tokens import Doc -from ...matcher import Matcher - -def test_issue1945(): - text = "a a a" - matcher = Matcher(Vocab()) - matcher.add('MWE', None, [{'orth': 'a'}, {'orth': 'a'}]) - doc = Doc(matcher.vocab, words=['a', 'a', 'a']) - matches = matcher(doc) - # We should see two overlapping matches here - assert len(matches) == 2 - assert matches[0][1:] == (0, 2) - assert matches[1][1:] == (1, 3) diff --git a/spacy/tests/regression/test_issue1959.py b/spacy/tests/regression/test_issue1959.py deleted file mode 100644 index 0787af3b7..000000000 --- a/spacy/tests/regression/test_issue1959.py +++ /dev/null @@ -1,23 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals -import pytest - - -@pytest.mark.models('en') -def test_issue1959(EN): - texts = ['Apple is looking at buying U.K. startup for $1 billion.'] - # nlp = load_test_model('en_core_web_sm') - EN.add_pipe(clean_component, name='cleaner', after='ner') - doc = EN(texts[0]) - doc_pipe = [doc_pipe for doc_pipe in EN.pipe(texts)] - assert doc == doc_pipe[0] - - -def clean_component(doc): - """ Clean up text. Make lowercase and remove punctuation and stopwords """ - # Remove punctuation, symbols (#) and stopwords - doc = [tok.text.lower() for tok in doc if (not tok.is_stop - and tok.pos_ != 'PUNCT' and - tok.pos_ != 'SYM')] - doc = ' '.join(doc) - return doc diff --git a/spacy/tests/regression/test_issue1967.py b/spacy/tests/regression/test_issue1967.py deleted file mode 100644 index a6cd489cc..000000000 --- a/spacy/tests/regression/test_issue1967.py +++ /dev/null @@ -1,15 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - -from ...pipeline import EntityRecognizer -from ...vocab import Vocab - - -@pytest.mark.parametrize('label', ['U-JOB-NAME']) -def test_issue1967(label): - ner = EntityRecognizer(Vocab()) - entry = ([0], ['word'], ['tag'], [0], ['dep'], [label]) - gold_parses = [(None, [(entry, None)])] - ner.moves.get_actions(gold_parses=gold_parses) diff --git a/spacy/tests/regression/test_issue2001-2500.py b/spacy/tests/regression/test_issue2001-2500.py new file mode 100644 index 000000000..d9febb152 --- /dev/null +++ b/spacy/tests/regression/test_issue2001-2500.py @@ -0,0 +1,47 @@ +# coding: utf8 +from __future__ import unicode_literals + +import pytest +from spacy.tokens import Doc +from spacy.displacy import render +from spacy.gold import iob_to_biluo + +from ..util import add_vecs_to_vocab + + +def test_issue2219(en_vocab): + vectors = [("a", [1, 2, 3]), ("letter", [4, 5, 6])] + add_vecs_to_vocab(en_vocab, vectors) + [(word1, vec1), (word2, vec2)] = vectors + doc = Doc(en_vocab, words=[word1, word2]) + assert doc[0].similarity(doc[1]) == doc[1].similarity(doc[0]) + + +def test_issue2361(de_tokenizer): + chars = ('<', '>', '&', '"') + doc = de_tokenizer('< > & " ') + doc.is_parsed = True + doc.is_tagged = True + html = render(doc) + for char in chars: + assert char in html + + +def test_issue2385(): + """Test that IOB tags are correctly converted to BILUO tags.""" + # fix bug in labels with a 'b' character + tags1 = ('B-BRAWLER', 'I-BRAWLER', 'I-BRAWLER') + assert iob_to_biluo(tags1) == ['B-BRAWLER', 'I-BRAWLER', 'L-BRAWLER'] + # maintain support for iob1 format + tags2 = ('I-ORG', 'I-ORG', 'B-ORG') + assert iob_to_biluo(tags2) == ['B-ORG', 'L-ORG', 'U-ORG'] + # maintain support for iob2 format + tags3 = ('B-PERSON', 'I-PERSON', 'B-PERSON') + assert iob_to_biluo(tags3) ==['B-PERSON', 'L-PERSON', 'U-PERSON'] + + +@pytest.mark.parametrize('tags', [ + ('B-ORG', 'L-ORG'), ('B-PERSON', 'I-PERSON', 'L-PERSON'), ('U-BRAWLER', 'U-BRAWLER')]) +def test_issue2385_biluo(tags): + """Test that BILUO-compatible tags aren't modified.""" + assert iob_to_biluo(tags) == list(tags) diff --git a/spacy/tests/regression/test_issue2219.py b/spacy/tests/regression/test_issue2219.py deleted file mode 100644 index ff7f9123d..000000000 --- a/spacy/tests/regression/test_issue2219.py +++ /dev/null @@ -1,18 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals -from ..util import add_vecs_to_vocab, get_doc -import pytest - -@pytest.fixture -def vectors(): - return [("a", [1, 2, 3]), ("letter", [4, 5, 6])] - -@pytest.fixture -def vocab(en_vocab, vectors): - add_vecs_to_vocab(en_vocab, vectors) - return en_vocab - -def test_issue2219(vocab, vectors): - [(word1, vec1), (word2, vec2)] = vectors - doc = get_doc(vocab, words=[word1, word2]) - assert doc[0].similarity(doc[1]) == doc[1].similarity(doc[0]) diff --git a/spacy/tests/regression/test_issue2361.py b/spacy/tests/regression/test_issue2361.py deleted file mode 100644 index a2ed38077..000000000 --- a/spacy/tests/regression/test_issue2361.py +++ /dev/null @@ -1,14 +0,0 @@ -from __future__ import unicode_literals -import pytest - -from ...displacy import render -from ..util import get_doc - -def test_issue2361(de_tokenizer): - tokens = de_tokenizer('< > & " ') - html = render(get_doc(tokens.vocab, [t.text for t in tokens])) - - assert '<' in html - assert '>' in html - assert '&' in html - assert '"' in html diff --git a/spacy/tests/regression/test_issue2385.py b/spacy/tests/regression/test_issue2385.py deleted file mode 100644 index b3e4ba11a..000000000 --- a/spacy/tests/regression/test_issue2385.py +++ /dev/null @@ -1,34 +0,0 @@ -# coding: utf-8 -import pytest - -from ...gold import iob_to_biluo - - -@pytest.mark.xfail -@pytest.mark.parametrize('tags', [('B-ORG', 'L-ORG'), - ('B-PERSON', 'I-PERSON', 'L-PERSON'), - ('U-BRAWLER', 'U-BRAWLER')]) -def test_issue2385_biluo(tags): - """already biluo format""" - assert iob_to_biluo(tags) == list(tags) - - -@pytest.mark.xfail -@pytest.mark.parametrize('tags', [('B-BRAWLER', 'I-BRAWLER', 'I-BRAWLER')]) -def test_issue2385_iob_bcharacter(tags): - """fix bug in labels with a 'b' character""" - assert iob_to_biluo(tags) == ['B-BRAWLER', 'I-BRAWLER', 'L-BRAWLER'] - - -@pytest.mark.xfail -@pytest.mark.parametrize('tags', [('I-ORG', 'I-ORG', 'B-ORG')]) -def test_issue2385_iob1(tags): - """maintain support for iob1 format""" - assert iob_to_biluo(tags) == ['B-ORG', 'L-ORG', 'U-ORG'] - - -@pytest.mark.xfail -@pytest.mark.parametrize('tags', [('B-PERSON', 'I-PERSON', 'B-PERSON')]) -def test_issue2385_iob2(tags): - """maintain support for iob2 format""" - assert iob_to_biluo(tags) == ['B-PERSON', 'L-PERSON', 'U-PERSON'] diff --git a/spacy/tests/regression/test_issue242.py b/spacy/tests/regression/test_issue242.py deleted file mode 100644 index b5909fe65..000000000 --- a/spacy/tests/regression/test_issue242.py +++ /dev/null @@ -1,25 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...matcher import Matcher - -import pytest - - -def test_issue242(en_tokenizer): - """Test overlapping multi-word phrases.""" - text = "There are different food safety standards in different countries." - patterns = [[{'LOWER': 'food'}, {'LOWER': 'safety'}], - [{'LOWER': 'safety'}, {'LOWER': 'standards'}]] - - doc = en_tokenizer(text) - matcher = Matcher(doc.vocab) - matcher.add('FOOD', None, *patterns) - - matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)] - doc.ents += tuple(matches) - match1, match2 = matches - assert match1[1] == 3 - assert match1[2] == 5 - assert match2[1] == 4 - assert match2[2] == 6 diff --git a/spacy/tests/regression/test_issue2564.py b/spacy/tests/regression/test_issue2564.py new file mode 100644 index 000000000..ef629efc1 --- /dev/null +++ b/spacy/tests/regression/test_issue2564.py @@ -0,0 +1,17 @@ +# coding: utf8 +from __future__ import unicode_literals + +from spacy.language import Language + + +def test_issue2564(): + """Test the tagger sets is_tagged correctly when used via Language.pipe.""" + nlp = Language() + tagger = nlp.create_pipe('tagger') + tagger.begin_training() # initialise weights + nlp.add_pipe(tagger) + doc = nlp('hello world') + assert doc.is_tagged + docs = nlp.pipe(['hello', 'world']) + piped_doc = next(docs) + assert piped_doc.is_tagged diff --git a/spacy/tests/regression/test_issue2569.py b/spacy/tests/regression/test_issue2569.py new file mode 100644 index 000000000..b1db67508 --- /dev/null +++ b/spacy/tests/regression/test_issue2569.py @@ -0,0 +1,17 @@ +# coding: utf8 +from __future__ import unicode_literals + +from spacy.matcher import Matcher +from spacy.tokens import Span + + +def test_issue2569(en_tokenizer): + doc = en_tokenizer("It is May 15, 1993.") + doc.ents = [Span(doc, 2, 6, label=doc.vocab.strings['DATE'])] + matcher = Matcher(doc.vocab) + matcher.add("RULE", None, [{'ENT_TYPE':'DATE', 'OP':'+'}]) + matched = [doc[start:end] for _, start, end in matcher(doc)] + matched = sorted(matched, key=len, reverse=True) + assert len(matched) == 10 + assert len(matched[0]) == 4 + assert matched[0].text == 'May 15, 1993' diff --git a/spacy/tests/regression/test_issue309.py b/spacy/tests/regression/test_issue309.py deleted file mode 100644 index 84756c6b1..000000000 --- a/spacy/tests/regression/test_issue309.py +++ /dev/null @@ -1,14 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ..util import get_doc - - -def test_issue309(en_tokenizer): - """Test Issue #309: SBD fails on empty string""" - tokens = en_tokenizer(" ") - doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[0], deps=['ROOT']) - doc.is_parsed = True - assert len(doc) == 1 - sents = list(doc.sents) - assert len(sents) == 1 diff --git a/spacy/tests/regression/test_issue351.py b/spacy/tests/regression/test_issue351.py deleted file mode 100644 index 95dbec35a..000000000 --- a/spacy/tests/regression/test_issue351.py +++ /dev/null @@ -1,11 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -def test_issue351(en_tokenizer): - doc = en_tokenizer(" This is a cat.") - assert doc[0].idx == 0 - assert len(doc[0]) == 3 - assert doc[1].idx == 3 diff --git a/spacy/tests/regression/test_issue360.py b/spacy/tests/regression/test_issue360.py deleted file mode 100644 index a2c007f16..000000000 --- a/spacy/tests/regression/test_issue360.py +++ /dev/null @@ -1,10 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -def test_issue360(en_tokenizer): - """Test tokenization of big ellipsis""" - tokens = en_tokenizer('$45...............Asking') - assert len(tokens) > 2 diff --git a/spacy/tests/regression/test_issue361.py b/spacy/tests/regression/test_issue361.py deleted file mode 100644 index cc07567f4..000000000 --- a/spacy/tests/regression/test_issue361.py +++ /dev/null @@ -1,11 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text1,text2', [("cat", "dog")]) -def test_issue361(en_vocab, text1, text2): - """Test Issue #361: Equality of lexemes""" - assert en_vocab[text1] == en_vocab[text1] - assert en_vocab[text1] != en_vocab[text2] diff --git a/spacy/tests/regression/test_issue401.py b/spacy/tests/regression/test_issue401.py deleted file mode 100644 index e5b72d472..000000000 --- a/spacy/tests/regression/test_issue401.py +++ /dev/null @@ -1,13 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.models('en') -@pytest.mark.parametrize('text,i', [("Jane's got a new car", 1), - ("Jane thinks that's a nice car", 3)]) -def test_issue401(EN, text, i): - """Text that 's in contractions is not lemmatized as '.""" - tokens = EN(text) - assert tokens[i].lemma_ != "'" diff --git a/spacy/tests/regression/test_issue429.py b/spacy/tests/regression/test_issue429.py deleted file mode 100644 index 4804225ac..000000000 --- a/spacy/tests/regression/test_issue429.py +++ /dev/null @@ -1,27 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...matcher import Matcher - -import pytest - - -@pytest.mark.models('en') -def test_issue429(EN): - def merge_phrases(matcher, doc, i, matches): - if i != len(matches) - 1: - return None - spans = [(ent_id, ent_id, doc[start:end]) for ent_id, start, end in matches] - for ent_id, label, span in spans: - span.merge( - tag=('NNP' if label else span.root.tag_), - lemma=span.text, - label='PERSON') - - doc = EN('a') - matcher = Matcher(EN.vocab) - matcher.add('TEST', merge_phrases, [{'ORTH': 'a'}]) - doc = EN.make_doc('a b c') - EN.tagger(doc) - matcher(doc) - EN.entity(doc) diff --git a/spacy/tests/regression/test_issue514.py b/spacy/tests/regression/test_issue514.py deleted file mode 100644 index 6021efd44..000000000 --- a/spacy/tests/regression/test_issue514.py +++ /dev/null @@ -1,22 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ..util import get_doc - -import pytest - - -@pytest.mark.skip -@pytest.mark.models('en') -def test_issue514(EN): - """Test serializing after adding entity""" - text = ["This", "is", "a", "sentence", "about", "pasta", "."] - vocab = EN.entity.vocab - doc = get_doc(vocab, text) - EN.entity.add_label("Food") - EN.entity(doc) - label_id = vocab.strings[u'Food'] - doc.ents = [(label_id, 5,6)] - assert [(ent.label_, ent.text) for ent in doc.ents] == [("Food", "pasta")] - doc2 = get_doc(EN.entity.vocab).from_bytes(doc.to_bytes()) - assert [(ent.label_, ent.text) for ent in doc2.ents] == [("Food", "pasta")] diff --git a/spacy/tests/regression/test_issue54.py b/spacy/tests/regression/test_issue54.py deleted file mode 100644 index 9867a4989..000000000 --- a/spacy/tests/regression/test_issue54.py +++ /dev/null @@ -1,10 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.models('en') -def test_issue54(EN): - text = "Talks given by women had a slightly higher number of questions asked (3.2$\pm$0.2) than talks given by men (2.6$\pm$0.1)." - tokens = EN(text) diff --git a/spacy/tests/regression/test_issue587.py b/spacy/tests/regression/test_issue587.py deleted file mode 100644 index fdc23c284..000000000 --- a/spacy/tests/regression/test_issue587.py +++ /dev/null @@ -1,22 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...matcher import Matcher -from ...attrs import IS_PUNCT, ORTH - -import pytest - - -def test_issue587(en_tokenizer): - """Test that Matcher doesn't segfault on particular input""" - doc = en_tokenizer('a b; c') - matcher = Matcher(doc.vocab) - matcher.add('TEST1', None, [{ORTH: 'a'}, {ORTH: 'b'}]) - matches = matcher(doc) - assert len(matches) == 1 - matcher.add('TEST2', None, [{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'c'}]) - matches = matcher(doc) - assert len(matches) == 2 - matcher.add('TEST3', None, [{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'd'}]) - matches = matcher(doc) - assert len(matches) == 2 diff --git a/spacy/tests/regression/test_issue588.py b/spacy/tests/regression/test_issue588.py deleted file mode 100644 index 438f0d161..000000000 --- a/spacy/tests/regression/test_issue588.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...matcher import Matcher - -import pytest - - -def test_issue588(en_vocab): - matcher = Matcher(en_vocab) - with pytest.raises(ValueError): - matcher.add('TEST', None, []) diff --git a/spacy/tests/regression/test_issue589.py b/spacy/tests/regression/test_issue589.py deleted file mode 100644 index 96ea4be61..000000000 --- a/spacy/tests/regression/test_issue589.py +++ /dev/null @@ -1,14 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...vocab import Vocab -from ..util import get_doc - -import pytest - - -@pytest.mark.xfail -def test_issue589(): - vocab = Vocab() - vocab.strings.set_frozen(True) - doc = get_doc(vocab, ['whata']) diff --git a/spacy/tests/regression/test_issue590.py b/spacy/tests/regression/test_issue590.py deleted file mode 100644 index be7c1db48..000000000 --- a/spacy/tests/regression/test_issue590.py +++ /dev/null @@ -1,15 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...matcher import Matcher -from ..util import get_doc - - -def test_issue590(en_vocab): - """Test overlapping matches""" - doc = get_doc(en_vocab, ['n', '=', '1', ';', 'a', ':', '5', '%']) - matcher = Matcher(en_vocab) - matcher.add('ab', None, [{'IS_ALPHA': True}, {'ORTH': ':'}, {'LIKE_NUM': True}, {'ORTH': '%'}]) - matcher.add('ab', None, [{'IS_ALPHA': True}, {'ORTH': '='}, {'LIKE_NUM': True}]) - matches = matcher(doc) - assert len(matches) == 2 diff --git a/spacy/tests/regression/test_issue595.py b/spacy/tests/regression/test_issue595.py deleted file mode 100644 index 4a83a4020..000000000 --- a/spacy/tests/regression/test_issue595.py +++ /dev/null @@ -1,24 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...symbols import POS, VERB, VerbForm_inf -from ...vocab import Vocab -from ...lemmatizer import Lemmatizer -from ..util import get_doc - -import pytest - - -def test_issue595(): - """Test lemmatization of base forms""" - words = ["Do", "n't", "feed", "the", "dog"] - tag_map = {'VB': {POS: VERB, VerbForm_inf: True}} - rules = {"verb": [["ed", "e"]]} - - lemmatizer = Lemmatizer({'verb': {}}, {'verb': {}}, rules) - vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map) - doc = get_doc(vocab, words) - - doc[2].tag_ = 'VB' - assert doc[2].text == 'feed' - assert doc[2].lemma_ == 'feed' diff --git a/spacy/tests/regression/test_issue599.py b/spacy/tests/regression/test_issue599.py deleted file mode 100644 index 9e187b3d4..000000000 --- a/spacy/tests/regression/test_issue599.py +++ /dev/null @@ -1,13 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ..util import get_doc - - -def test_issue599(en_vocab): - doc = get_doc(en_vocab) - doc.is_tagged = True - doc.is_parsed = True - doc2 = get_doc(doc.vocab) - doc2.from_bytes(doc.to_bytes()) - assert doc2.is_parsed diff --git a/spacy/tests/regression/test_issue600.py b/spacy/tests/regression/test_issue600.py deleted file mode 100644 index 45511fd48..000000000 --- a/spacy/tests/regression/test_issue600.py +++ /dev/null @@ -1,11 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...vocab import Vocab -from ..util import get_doc - - -def test_issue600(): - vocab = Vocab(tag_map={'NN': {'pos': 'NOUN'}}) - doc = get_doc(vocab, ["hello"]) - doc[0].tag_ = 'NN' diff --git a/spacy/tests/regression/test_issue615.py b/spacy/tests/regression/test_issue615.py deleted file mode 100644 index 2e36dae04..000000000 --- a/spacy/tests/regression/test_issue615.py +++ /dev/null @@ -1,33 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...matcher import Matcher - - -def test_issue615(en_tokenizer): - def merge_phrases(matcher, doc, i, matches): - """Merge a phrase. We have to be careful here because we'll change the - token indices. To avoid problems, merge all the phrases once we're called - on the last match.""" - - if i != len(matches)-1: - return None - # Get Span objects - spans = [(ent_id, ent_id, doc[start : end]) for ent_id, start, end in matches] - for ent_id, label, span in spans: - span.merge(tag='NNP' if label else span.root.tag_, lemma=span.text, - label=label) - doc.ents = doc.ents + ((label, span.start, span.end),) - - text = "The golf club is broken" - pattern = [{'ORTH': "golf"}, {'ORTH': "club"}] - label = "Sport_Equipment" - - doc = en_tokenizer(text) - matcher = Matcher(doc.vocab) - matcher.add(label, merge_phrases, pattern) - match = matcher(doc) - entities = list(doc.ents) - - assert entities != [] #assertion 1 - assert entities[0].label != 0 #assertion 2 diff --git a/spacy/tests/regression/test_issue686.py b/spacy/tests/regression/test_issue686.py deleted file mode 100644 index 1323393db..000000000 --- a/spacy/tests/regression/test_issue686.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.models('en') -@pytest.mark.parametrize('text', ["He is the man", "he is the man"]) -def test_issue686(EN, text): - """Test that pronoun lemmas are assigned correctly.""" - tokens = EN(text) - assert tokens[0].lemma_ == "-PRON-" diff --git a/spacy/tests/regression/test_issue693.py b/spacy/tests/regression/test_issue693.py deleted file mode 100644 index c3541ea91..000000000 --- a/spacy/tests/regression/test_issue693.py +++ /dev/null @@ -1,19 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.xfail -@pytest.mark.models('en') -def test_issue693(EN): - """Test that doc.noun_chunks parses the complete sentence.""" - - text1 = "the TopTown International Airport Board and the Goodwill Space Exploration Partnership." - text2 = "the Goodwill Space Exploration Partnership and the TopTown International Airport Board." - doc1 = EN(text1) - doc2 = EN(text2) - chunks1 = [chunk for chunk in doc1.noun_chunks] - chunks2 = [chunk for chunk in doc2.noun_chunks] - assert len(chunks1) == 2 - assert len(chunks2) == 2 diff --git a/spacy/tests/regression/test_issue704.py b/spacy/tests/regression/test_issue704.py deleted file mode 100644 index 51f481a3f..000000000 --- a/spacy/tests/regression/test_issue704.py +++ /dev/null @@ -1,15 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.xfail -@pytest.mark.models('en') -def test_issue704(EN): - """Test that sentence boundaries are detected correctly.""" - - text = '“Atticus said to Jem one day, “I’d rather you shot at tin cans in the backyard, but I know you’ll go after birds. Shoot all the blue jays you want, if you can hit ‘em, but remember it’s a sin to kill a mockingbird.”' - doc = EN(text) - sents = list([sent for sent in doc.sents]) - assert len(sents) == 3 diff --git a/spacy/tests/regression/test_issue717.py b/spacy/tests/regression/test_issue717.py deleted file mode 100644 index 69c0705cb..000000000 --- a/spacy/tests/regression/test_issue717.py +++ /dev/null @@ -1,17 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.models('en') -@pytest.mark.parametrize('text1,text2', - [("You're happy", "You are happy"), - ("I'm happy", "I am happy"), - ("he's happy", "he's happy")]) -def test_issue717(EN, text1, text2): - """Test that contractions are assigned the correct lemma.""" - doc1 = EN(text1) - doc2 = EN(text2) - assert doc1[1].lemma_ == doc2[1].lemma_ - assert doc1[1].lemma == doc2[1].lemma diff --git a/spacy/tests/regression/test_issue719.py b/spacy/tests/regression/test_issue719.py deleted file mode 100644 index 9b4838bdb..000000000 --- a/spacy/tests/regression/test_issue719.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.models('en') -@pytest.mark.parametrize('text', ["s..."]) -def test_issue719(EN, text): - """Test that the token 's' is not lemmatized into empty string.""" - tokens = EN(text) - assert tokens[0].lemma_ != '' diff --git a/spacy/tests/regression/test_issue736.py b/spacy/tests/regression/test_issue736.py deleted file mode 100644 index eee4912eb..000000000 --- a/spacy/tests/regression/test_issue736.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text,number', [("7am", "7"), ("11p.m.", "11")]) -def test_issue736(en_tokenizer, text, number): - """Test that times like "7am" are tokenized correctly and that numbers are converted to string.""" - tokens = en_tokenizer(text) - assert len(tokens) == 2 - assert tokens[0].text == number diff --git a/spacy/tests/regression/test_issue740.py b/spacy/tests/regression/test_issue740.py deleted file mode 100644 index babaf8fdb..000000000 --- a/spacy/tests/regression/test_issue740.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text', ["3/4/2012", "01/12/1900"]) -def test_issue740(en_tokenizer, text): - """Test that dates are not split and kept as one token. This behaviour is currently inconsistent, since dates separated by hyphens are still split. - This will be hard to prevent without causing clashes with numeric ranges.""" - tokens = en_tokenizer(text) - assert len(tokens) == 1 diff --git a/spacy/tests/regression/test_issue743.py b/spacy/tests/regression/test_issue743.py deleted file mode 100644 index 7a9ee0298..000000000 --- a/spacy/tests/regression/test_issue743.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals -from ...vocab import Vocab -from ...tokens.doc import Doc - - -def test_token_is_hashable(): - doc = Doc(Vocab(), ['hello', 'world']) - token = doc[0] - s = set([token]) - items = list(s) - assert items[0] is token diff --git a/spacy/tests/regression/test_issue744.py b/spacy/tests/regression/test_issue744.py deleted file mode 100644 index 4e5eb2e10..000000000 --- a/spacy/tests/regression/test_issue744.py +++ /dev/null @@ -1,13 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text', ["We were scared", "We Were Scared"]) -def test_issue744(en_tokenizer, text): - """Test that 'were' and 'Were' are excluded from the contractions - generated by the English tokenizer exceptions.""" - tokens = en_tokenizer(text) - assert len(tokens) == 3 - assert tokens[1].text.lower() == "were" diff --git a/spacy/tests/regression/test_issue758.py b/spacy/tests/regression/test_issue758.py deleted file mode 100644 index 48e27be02..000000000 --- a/spacy/tests/regression/test_issue758.py +++ /dev/null @@ -1,13 +0,0 @@ -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.xfail -@pytest.mark.models('en') -def test_issue758(EN): - '''Test parser transition bug after label added.''' - from ...matcher import merge_phrase - nlp = EN() - nlp.matcher.add('splash', merge_phrase, [[{'LEMMA': 'splash'}, {'LEMMA': 'on'}]]) - doc = nlp('splash On', parse=False) diff --git a/spacy/tests/regression/test_issue759.py b/spacy/tests/regression/test_issue759.py deleted file mode 100644 index b7cf69f1a..000000000 --- a/spacy/tests/regression/test_issue759.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text,is_num', [("one", True), ("ten", True), - ("teneleven", False)]) -def test_issue759(en_tokenizer, text, is_num): - """Test that numbers are recognised correctly.""" - tokens = en_tokenizer(text) - assert tokens[0].like_num == is_num diff --git a/spacy/tests/regression/test_issue768.py b/spacy/tests/regression/test_issue768.py deleted file mode 100644 index b98610ac7..000000000 --- a/spacy/tests/regression/test_issue768.py +++ /dev/null @@ -1,40 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ...language import Language -from ...attrs import LANG -from ...lang.fr.stop_words import STOP_WORDS -from ...lang.fr.tokenizer_exceptions import TOKENIZER_EXCEPTIONS -from ...lang.punctuation import TOKENIZER_INFIXES -from ...lang.char_classes import ALPHA -from ...util import update_exc - -import pytest - - -@pytest.fixture -def fr_tokenizer_w_infix(): - SPLIT_INFIX = r'(?<=[{a}]\')(?=[{a}])'.format(a=ALPHA) - - # create new Language subclass to add to default infixes - class French(Language): - lang = 'fr' - - class Defaults(Language.Defaults): - lex_attr_getters = dict(Language.Defaults.lex_attr_getters) - lex_attr_getters[LANG] = lambda text: 'fr' - tokenizer_exceptions = update_exc(TOKENIZER_EXCEPTIONS) - stop_words = STOP_WORDS - infixes = TOKENIZER_INFIXES + [SPLIT_INFIX] - - return French.Defaults.create_tokenizer() - - -@pytest.mark.skip -@pytest.mark.parametrize('text,expected_tokens', [("l'avion", ["l'", "avion"]), - ("j'ai", ["j'", "ai"])]) -def test_issue768(fr_tokenizer_w_infix, text, expected_tokens): - """Allow zero-width 'infix' token during the tokenization process.""" - tokens = fr_tokenizer_w_infix(text) - assert len(tokens) == 2 - assert [t.text for t in tokens] == expected_tokens diff --git a/spacy/tests/regression/test_issue775.py b/spacy/tests/regression/test_issue775.py deleted file mode 100644 index 2b5cd2df5..000000000 --- a/spacy/tests/regression/test_issue775.py +++ /dev/null @@ -1,13 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text', ["Shell", "shell", "Shed", "shed"]) -def test_issue775(en_tokenizer, text): - """Test that 'Shell' and 'shell' are excluded from the contractions - generated by the English tokenizer exceptions.""" - tokens = en_tokenizer(text) - assert len(tokens) == 1 - assert tokens[0].text == text diff --git a/spacy/tests/regression/test_issue781.py b/spacy/tests/regression/test_issue781.py deleted file mode 100644 index 805e68228..000000000 --- a/spacy/tests/regression/test_issue781.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -# Note: "chromosomes" worked previous the bug fix -@pytest.mark.models('en') -@pytest.mark.parametrize('word,lemmas', [("chromosomes", ["chromosome"]), ("endosomes", ["endosome"]), ("colocalizes", ["colocaliz", "colocalize"])]) -def test_issue781(EN, word, lemmas): - lemmatizer = EN.Defaults.create_lemmatizer() - assert sorted(lemmatizer(word, 'noun', morphology={'number': 'plur'})) == sorted(lemmas) diff --git a/spacy/tests/regression/test_issue792.py b/spacy/tests/regression/test_issue792.py deleted file mode 100644 index df8b5ef50..000000000 --- a/spacy/tests/regression/test_issue792.py +++ /dev/null @@ -1,18 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text', ["This is a string ", "This is a string\u0020"]) -def test_issue792(en_tokenizer, text): - """Test for Issue #792: Trailing whitespace is removed after tokenization.""" - doc = en_tokenizer(text) - assert ''.join([token.text_with_ws for token in doc]) == text - - -@pytest.mark.parametrize('text', ["This is a string", "This is a string\n"]) -def test_control_issue792(en_tokenizer, text): - """Test base case for Issue #792: Non-trailing whitespace""" - doc = en_tokenizer(text) - assert ''.join([token.text_with_ws for token in doc]) == text diff --git a/spacy/tests/regression/test_issue801.py b/spacy/tests/regression/test_issue801.py deleted file mode 100644 index 3d83e707b..000000000 --- a/spacy/tests/regression/test_issue801.py +++ /dev/null @@ -1,19 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text,tokens', [ - ('"deserve,"--and', ['"', "deserve", ',"--', "and"]), - ("exception;--exclusive", ["exception", ";--", "exclusive"]), - ("day.--Is", ["day", ".--", "Is"]), - ("refinement:--just", ["refinement", ":--", "just"]), - ("memories?--To", ["memories", "?--", "To"]), - ("Useful.=--Therefore", ["Useful", ".=--", "Therefore"]), - ("=Hope.=--Pandora", ["=", "Hope", ".=--", "Pandora"])]) -def test_issue801(en_tokenizer, text, tokens): - """Test that special characters + hyphens are split correctly.""" - doc = en_tokenizer(text) - assert len(doc) == len(tokens) - assert [t.text for t in doc] == tokens diff --git a/spacy/tests/regression/test_issue805.py b/spacy/tests/regression/test_issue805.py deleted file mode 100644 index 090dd0f3b..000000000 --- a/spacy/tests/regression/test_issue805.py +++ /dev/null @@ -1,15 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - -SV_TOKEN_EXCEPTION_TESTS = [ - ('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']), - ('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar']) -] - -@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS) -def test_issue805(sv_tokenizer, text, expected_tokens): - tokens = sv_tokenizer(text) - token_list = [token.text for token in tokens if not token.is_space] - assert expected_tokens == token_list diff --git a/spacy/tests/regression/test_issue834.py b/spacy/tests/regression/test_issue834.py deleted file mode 100644 index d3dee49e8..000000000 --- a/spacy/tests/regression/test_issue834.py +++ /dev/null @@ -1,18 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals -import pytest - - -word2vec_str = """, -0.046107 -0.035951 -0.560418 -de -0.648927 -0.400976 -0.527124 -. 0.113685 0.439990 -0.634510 -\u00A0 -1.499184 -0.184280 -0.598371""" - - -@pytest.mark.xfail -def test_issue834(en_vocab, text_file): - """Test that no-break space (U+00A0) is detected as space by the load_vectors function.""" - text_file.write(word2vec_str) - text_file.seek(0) - vector_length = en_vocab.load_vectors(text_file) - assert vector_length == 3 diff --git a/spacy/tests/regression/test_issue850.py b/spacy/tests/regression/test_issue850.py deleted file mode 100644 index e83b4d8af..000000000 --- a/spacy/tests/regression/test_issue850.py +++ /dev/null @@ -1,37 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals -import pytest - -from ...matcher import Matcher -from ...vocab import Vocab -from ...attrs import LOWER -from ...tokens import Doc - - -def test_basic_case(): - """Test Matcher matches with '*' operator and Boolean flag""" - matcher = Matcher(Vocab( - lex_attr_getters={LOWER: lambda string: string.lower()})) - IS_ANY_TOKEN = matcher.vocab.add_flag(lambda x: True) - matcher.add('FarAway', None, [{'LOWER': "bob"}, {'OP': '*', 'LOWER': 'and'}, {'LOWER': 'frank'}]) - doc = Doc(matcher.vocab, words=['bob', 'and', 'and', 'frank']) - match = matcher(doc) - assert len(match) == 1 - ent_id, start, end = match[0] - assert start == 0 - assert end == 4 - - -def test_issue850(): - """The variable-length pattern matches the - succeeding token. Check we handle the ambiguity correctly.""" - matcher = Matcher(Vocab( - lex_attr_getters={LOWER: lambda string: string.lower()})) - IS_ANY_TOKEN = matcher.vocab.add_flag(lambda x: True) - matcher.add('FarAway', None, [{'LOWER': "bob"}, {'OP': '*', 'IS_ANY_TOKEN': True}, {'LOWER': 'frank'}]) - doc = Doc(matcher.vocab, words=['bob', 'and', 'and', 'frank']) - match = matcher(doc) - assert len(match) == 1 - ent_id, start, end = match[0] - assert start == 0 - assert end == 4 diff --git a/spacy/tests/regression/test_issue852.py b/spacy/tests/regression/test_issue852.py deleted file mode 100644 index 2bfbe99bb..000000000 --- a/spacy/tests/regression/test_issue852.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text', ["au-delàs", "pair-programmâmes", - "terra-formées", "σ-compacts"]) -def test_issue852(fr_tokenizer, text): - """Test that French tokenizer exceptions are imported correctly.""" - tokens = fr_tokenizer(text) - assert len(tokens) == 1 diff --git a/spacy/tests/regression/test_issue859.py b/spacy/tests/regression/test_issue859.py deleted file mode 100644 index f6225a5f4..000000000 --- a/spacy/tests/regression/test_issue859.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text', ["aaabbb@ccc.com\nThank you!", - "aaabbb@ccc.com \nThank you!"]) -def test_issue859(en_tokenizer, text): - """Test that no extra space is added in doc.text method.""" - doc = en_tokenizer(text) - assert doc.text == text diff --git a/spacy/tests/regression/test_issue886.py b/spacy/tests/regression/test_issue886.py deleted file mode 100644 index 719036a09..000000000 --- a/spacy/tests/regression/test_issue886.py +++ /dev/null @@ -1,13 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text', ["Datum:2014-06-02\nDokument:76467"]) -def test_issue886(en_tokenizer, text): - """Test that token.idx matches the original text index for texts with newlines.""" - doc = en_tokenizer(text) - for token in doc: - assert len(token.text) == len(token.text_with_ws) - assert text[token.idx] == token.text[0] diff --git a/spacy/tests/regression/test_issue891.py b/spacy/tests/regression/test_issue891.py deleted file mode 100644 index 6e57a750f..000000000 --- a/spacy/tests/regression/test_issue891.py +++ /dev/null @@ -1,12 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.parametrize('text', ["want/need"]) -def test_issue891(en_tokenizer, text): - """Test that / infixes are split correctly.""" - tokens = en_tokenizer(text) - assert len(tokens) == 3 - assert tokens[1].text == "/" diff --git a/spacy/tests/regression/test_issue903.py b/spacy/tests/regression/test_issue903.py deleted file mode 100644 index 36acd2dfc..000000000 --- a/spacy/tests/regression/test_issue903.py +++ /dev/null @@ -1,16 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from ...tokens import Doc - - -@pytest.mark.parametrize('text,tag,lemma', - [("anus", "NN", "anus"), - ("princess", "NN", "princess")]) -def test_issue912(en_vocab, text, tag, lemma): - '''Test base-forms of adjectives are preserved.''' - doc = Doc(en_vocab, words=[text]) - doc[0].tag_ = tag - assert doc[0].lemma_ == lemma - diff --git a/spacy/tests/regression/test_issue910.py b/spacy/tests/regression/test_issue910.py deleted file mode 100644 index 94a2562fd..000000000 --- a/spacy/tests/regression/test_issue910.py +++ /dev/null @@ -1,104 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import json -import random -import contextlib -import shutil -import pytest -import tempfile -from pathlib import Path -from thinc.neural.optimizers import Adam - -from ...gold import GoldParse -from ...pipeline import EntityRecognizer -from ...lang.en import English - -try: - unicode -except NameError: - unicode = str - - -@pytest.fixture -def train_data(): - return [ - ["hey",[]], - ["howdy",[]], - ["hey there",[]], - ["hello",[]], - ["hi",[]], - ["i'm looking for a place to eat",[]], - ["i'm looking for a place in the north of town",[[31,36,"location"]]], - ["show me chinese restaurants",[[8,15,"cuisine"]]], - ["show me chines restaurants",[[8,14,"cuisine"]]], - ["yes",[]], - ["yep",[]], - ["yeah",[]], - ["show me a mexican place in the centre",[[31,37,"location"], [10,17,"cuisine"]]], - ["bye",[]],["goodbye",[]], - ["good bye",[]], - ["stop",[]], - ["end",[]], - ["i am looking for an indian spot",[[20,26,"cuisine"]]], - ["search for restaurants",[]], - ["anywhere in the west",[[16,20,"location"]]], - ["central indian restaurant",[[0,7,"location"],[8,14,"cuisine"]]], - ["indeed",[]], - ["that's right",[]], - ["ok",[]], - ["great",[]] - ] - -@pytest.fixture -def additional_entity_types(): - return ['cuisine', 'location'] - - -@contextlib.contextmanager -def temp_save_model(model): - model_dir = tempfile.mkdtemp() - model.to_disk(model_dir) - yield model_dir - shutil.rmtree(model_dir.as_posix()) - - -@pytest.mark.xfail -@pytest.mark.models('en') -def test_issue910(EN, train_data, additional_entity_types): - '''Test that adding entities and resuming training works passably OK. - There are two issues here: - - 1) We have to readd labels. This isn't very nice. - 2) There's no way to set the learning rate for the weight update, so we - end up out-of-scale, causing it to learn too fast. - ''' - nlp = EN - doc = nlp(u"I am looking for a restaurant in Berlin") - ents_before_train = [(ent.label_, ent.text) for ent in doc.ents] - # Fine tune the ner model - for entity_type in additional_entity_types: - nlp.entity.add_label(entity_type) - - sgd = Adam(nlp.entity.model[0].ops, 0.001) - for itn in range(10): - random.shuffle(train_data) - for raw_text, entity_offsets in train_data: - doc = nlp.make_doc(raw_text) - nlp.tagger(doc) - nlp.tensorizer(doc) - gold = GoldParse(doc, entities=entity_offsets) - loss = nlp.entity.update(doc, gold, sgd=sgd, drop=0.5) - - with temp_save_model(nlp.entity) as model_dir: - # Load the fine tuned model - loaded_ner = EntityRecognizer(nlp.vocab) - loaded_ner.from_disk(model_dir) - - for raw_text, entity_offsets in train_data: - doc = nlp.make_doc(raw_text) - nlp.tagger(doc) - loaded_ner(doc) - ents = {(ent.start_char, ent.end_char): ent.label_ for ent in doc.ents} - for start, end, label in entity_offsets: - assert ents[(start, end)] == label diff --git a/spacy/tests/regression/test_issue912.py b/spacy/tests/regression/test_issue912.py deleted file mode 100644 index 791e2e152..000000000 --- a/spacy/tests/regression/test_issue912.py +++ /dev/null @@ -1,14 +0,0 @@ -# coding: utf8 -from __future__ import unicode_literals - -import pytest -from ...tokens import Doc - - -@pytest.mark.parametrize('text,tag,lemma', [("inner", "JJ", "inner")]) -def test_issue912(en_vocab, text, tag, lemma): - '''Test base-forms of adjectives are preserved.''' - doc = Doc(en_vocab, words=[text]) - doc[0].tag_ = tag - assert doc[0].lemma_ == lemma - diff --git a/spacy/tests/regression/test_issue957.py b/spacy/tests/regression/test_issue957.py deleted file mode 100644 index 4dffda1aa..000000000 --- a/spacy/tests/regression/test_issue957.py +++ /dev/null @@ -1,18 +0,0 @@ -from __future__ import unicode_literals - -import pytest -from ... import load as load_spacy - - -def test_issue957(en_tokenizer): - '''Test that spaCy doesn't hang on many periods.''' - string = '0' - for i in range(1, 100): - string += '.%d' % i - doc = en_tokenizer(string) - -# Don't want tests to fail if they haven't installed pytest-timeout plugin -try: - test_issue913 = pytest.mark.timeout(5)(test_issue913) -except NameError: - pass diff --git a/spacy/tests/regression/test_issue995.py b/spacy/tests/regression/test_issue995.py deleted file mode 100644 index 420185bab..000000000 --- a/spacy/tests/regression/test_issue995.py +++ /dev/null @@ -1,16 +0,0 @@ -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.models('en') -def test_issue955(EN): - '''Test that we don't have any nested noun chunks''' - doc = EN('Does flight number three fifty-four require a connecting flight' - ' to get to Boston?') - seen_tokens = set() - for np in doc.noun_chunks: - for word in np: - key = (word.i, word.text) - assert key not in seen_tokens - seen_tokens.add(key) diff --git a/spacy/tests/regression/test_issue999.py b/spacy/tests/regression/test_issue999.py deleted file mode 100644 index fb176c1fa..000000000 --- a/spacy/tests/regression/test_issue999.py +++ /dev/null @@ -1,81 +0,0 @@ -from __future__ import unicode_literals -import os -import random -import contextlib -import shutil -import pytest -import tempfile -from pathlib import Path - - -import pathlib -from ...gold import GoldParse -from ...pipeline import EntityRecognizer -from ...language import Language - -try: - unicode -except NameError: - unicode = str - - -@pytest.fixture -def train_data(): - return [ - ["hey",[]], - ["howdy",[]], - ["hey there",[]], - ["hello",[]], - ["hi",[]], - ["i'm looking for a place to eat",[]], - ["i'm looking for a place in the north of town",[[31,36,"location"]]], - ["show me chinese restaurants",[[8,15,"cuisine"]]], - ["show me chines restaurants",[[8,14,"cuisine"]]], - ] - - -@contextlib.contextmanager -def temp_save_model(model): - model_dir = Path(tempfile.mkdtemp()) - model.save_to_directory(model_dir) - yield model_dir - shutil.rmtree(model_dir.as_posix()) - - -# TODO: Fix when saving/loading is fixed. -@pytest.mark.xfail -def test_issue999(train_data): - '''Test that adding entities and resuming training works passably OK. - There are two issues here: - - 1) We have to readd labels. This isn't very nice. - 2) There's no way to set the learning rate for the weight update, so we - end up out-of-scale, causing it to learn too fast. - ''' - nlp = Language(pipeline=[]) - nlp.entity = EntityRecognizer(nlp.vocab, features=Language.Defaults.entity_features) - nlp.pipeline.append(nlp.entity) - for _, offsets in train_data: - for start, end, ent_type in offsets: - nlp.entity.add_label(ent_type) - nlp.entity.model.learn_rate = 0.001 - for itn in range(100): - random.shuffle(train_data) - for raw_text, entity_offsets in train_data: - doc = nlp.make_doc(raw_text) - gold = GoldParse(doc, entities=entity_offsets) - loss = nlp.entity.update(doc, gold) - - with temp_save_model(nlp) as model_dir: - nlp2 = Language(path=model_dir) - - for raw_text, entity_offsets in train_data: - doc = nlp2(raw_text) - ents = {(ent.start_char, ent.end_char): ent.label_ for ent in doc.ents} - for start, end, label in entity_offsets: - if (start, end) in ents: - assert ents[(start, end)] == label - break - else: - if entity_offsets: - raise Exception(ents) diff --git a/spacy/tests/serialize/test_serialization.py b/spacy/tests/serialize/test_serialization.py deleted file mode 100644 index 036035095..000000000 --- a/spacy/tests/serialize/test_serialization.py +++ /dev/null @@ -1,51 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ..util import get_doc, assert_docs_equal -from ...tokens import Doc -from ...vocab import Vocab - -import pytest - - -TEXT = ["This", "is", "a", "test", "sentence", "."] -TAGS = ['DT', 'VBZ', 'DT', 'NN', 'NN', '.'] -DEPS = ['nsubj', 'ROOT', 'det', 'compound', 'attr', 'punct'] -ENTS = [('hi', 'PERSON', 0, 1)] - - -def test_serialize_empty_doc(en_vocab): - doc = get_doc(en_vocab) - data = doc.to_bytes() - doc2 = Doc(en_vocab) - doc2.from_bytes(data) - assert len(doc) == len(doc2) - for token1, token2 in zip(doc, doc2): - assert token1.text == token2.text - - -@pytest.mark.xfail -@pytest.mark.parametrize('text', ['rat']) -def test_serialize_vocab(en_vocab, text): - text_hash = en_vocab.strings.add(text) - vocab_bytes = en_vocab.to_bytes() - new_vocab = Vocab().from_bytes(vocab_bytes) - assert new_vocab.strings(text_hash) == text - -# -#@pytest.mark.parametrize('text', [TEXT]) -#def test_serialize_tokens(en_vocab, text): -# doc1 = get_doc(en_vocab, [t for t in text]) -# doc2 = get_doc(en_vocab).from_bytes(doc1.to_bytes()) -# assert_docs_equal(doc1, doc2) -# -# -#@pytest.mark.models -#@pytest.mark.parametrize('text', [TEXT]) -#@pytest.mark.parametrize('tags', [TAGS, []]) -#@pytest.mark.parametrize('deps', [DEPS, []]) -#@pytest.mark.parametrize('ents', [ENTS, []]) -#def test_serialize_tokens_ner(EN, text, tags, deps, ents): -# doc1 = get_doc(EN.vocab, [t for t in text], tags=tags, deps=deps, ents=ents) -# doc2 = get_doc(EN.vocab).from_bytes(doc1.to_bytes()) -# assert_docs_equal(doc1, doc2) diff --git a/spacy/tests/serialize/test_serialize_doc.py b/spacy/tests/serialize/test_serialize_doc.py index 5a10b656a..f9e092050 100644 --- a/spacy/tests/serialize/test_serialize_doc.py +++ b/spacy/tests/serialize/test_serialize_doc.py @@ -1,22 +1,31 @@ # coding: utf-8 from __future__ import unicode_literals -from ..util import make_tempdir, get_doc -from ...tokens import Doc -from ...compat import path2str +from spacy.tokens import Doc +from spacy.compat import path2str -import pytest +from ..util import make_tempdir + + +def test_serialize_empty_doc(en_vocab): + doc = Doc(en_vocab) + data = doc.to_bytes() + doc2 = Doc(en_vocab) + doc2.from_bytes(data) + assert len(doc) == len(doc2) + for token1, token2 in zip(doc, doc2): + assert token1.text == token2.text def test_serialize_doc_roundtrip_bytes(en_vocab): - doc = get_doc(en_vocab, words=['hello', 'world']) + doc = Doc(en_vocab, words=['hello', 'world']) doc_b = doc.to_bytes() new_doc = Doc(en_vocab).from_bytes(doc_b) assert new_doc.to_bytes() == doc_b def test_serialize_doc_roundtrip_disk(en_vocab): - doc = get_doc(en_vocab, words=['hello', 'world']) + doc = Doc(en_vocab, words=['hello', 'world']) with make_tempdir() as d: file_path = d / 'doc' doc.to_disk(file_path) @@ -25,7 +34,7 @@ def test_serialize_doc_roundtrip_disk(en_vocab): def test_serialize_doc_roundtrip_disk_str_path(en_vocab): - doc = get_doc(en_vocab, words=['hello', 'world']) + doc = Doc(en_vocab, words=['hello', 'world']) with make_tempdir() as d: file_path = d / 'doc' file_path = path2str(file_path) diff --git a/spacy/tests/serialize/test_serialize_empty_model.py b/spacy/tests/serialize/test_serialize_empty_model.py deleted file mode 100644 index b614a3648..000000000 --- a/spacy/tests/serialize/test_serialize_empty_model.py +++ /dev/null @@ -1,9 +0,0 @@ -import spacy -import spacy.lang.en -from spacy.pipeline import TextCategorizer - -def test_bytes_serialize_issue_1105(): - nlp = spacy.lang.en.English() - tokenizer = nlp.tokenizer - textcat = TextCategorizer(tokenizer.vocab, labels=['ENTITY', 'ACTION', 'MODIFIER']) - textcat_bytes = textcat.to_bytes() diff --git a/spacy/tests/serialize/test_serialize_extension_attrs.py b/spacy/tests/serialize/test_serialize_extension_attrs.py index 8919ebe1e..251aaf4f0 100644 --- a/spacy/tests/serialize/test_serialize_extension_attrs.py +++ b/spacy/tests/serialize/test_serialize_extension_attrs.py @@ -2,9 +2,8 @@ from __future__ import unicode_literals import pytest - -from ...tokens.doc import Doc -from ...vocab import Vocab +from spacy.tokens import Doc +from spacy.vocab import Vocab @pytest.fixture diff --git a/spacy/tests/serialize/test_serialize_language.py b/spacy/tests/serialize/test_serialize_language.py index 5d1ac4c92..210729340 100644 --- a/spacy/tests/serialize/test_serialize_language.py +++ b/spacy/tests/serialize/test_serialize_language.py @@ -1,12 +1,12 @@ # coding: utf-8 from __future__ import unicode_literals -from ..util import make_tempdir -from ...language import Language -from ...tokenizer import Tokenizer - import pytest import re +from spacy.language import Language +from spacy.tokenizer import Tokenizer + +from ..util import make_tempdir @pytest.fixture diff --git a/spacy/tests/serialize/test_serialize_parser_ner.py b/spacy/tests/serialize/test_serialize_parser_ner.py deleted file mode 100644 index cbe97b716..000000000 --- a/spacy/tests/serialize/test_serialize_parser_ner.py +++ /dev/null @@ -1,34 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ..util import make_tempdir -from ...pipeline import DependencyParser -from ...pipeline import EntityRecognizer - -import pytest - - -test_parsers = [DependencyParser, EntityRecognizer] - - -@pytest.mark.parametrize('Parser', test_parsers) -def test_serialize_parser_roundtrip_bytes(en_vocab, Parser): - parser = Parser(en_vocab) - parser.model, _ = parser.Model(10) - new_parser = Parser(en_vocab) - new_parser.model, _ = new_parser.Model(10) - new_parser = new_parser.from_bytes(parser.to_bytes()) - assert new_parser.to_bytes() == parser.to_bytes() - - -@pytest.mark.parametrize('Parser', test_parsers) -def test_serialize_parser_roundtrip_disk(en_vocab, Parser): - parser = Parser(en_vocab) - parser.model, _ = parser.Model(0) - with make_tempdir() as d: - file_path = d / 'parser' - parser.to_disk(file_path) - parser_d = Parser(en_vocab) - parser_d.model, _ = parser_d.Model(0) - parser_d = parser_d.from_disk(file_path) - assert parser.to_bytes(model=False) == parser_d.to_bytes(model=False) diff --git a/spacy/tests/serialize/test_serialize_pipeline.py b/spacy/tests/serialize/test_serialize_pipeline.py new file mode 100644 index 000000000..b177f9bd8 --- /dev/null +++ b/spacy/tests/serialize/test_serialize_pipeline.py @@ -0,0 +1,114 @@ +# coding: utf-8 +from __future__ import unicode_literals + +import pytest +from spacy.pipeline import Tagger, DependencyParser, EntityRecognizer, Tensorizer, TextCategorizer + +from ..util import make_tempdir + + +test_parsers = [DependencyParser, EntityRecognizer] + + +@pytest.fixture +def parser(en_vocab): + parser = DependencyParser(en_vocab) + parser.add_label('nsubj') + parser.model, cfg = parser.Model(parser.moves.n_moves) + parser.cfg.update(cfg) + return parser + + +@pytest.fixture +def blank_parser(en_vocab): + parser = DependencyParser(en_vocab) + return parser + + +@pytest.fixture +def taggers(en_vocab): + tagger1 = Tagger(en_vocab) + tagger2 = Tagger(en_vocab) + tagger1.model = tagger1.Model(8) + tagger2.model = tagger1.model + return (tagger1, tagger2) + + +@pytest.mark.parametrize('Parser', test_parsers) +def test_serialize_parser_roundtrip_bytes(en_vocab, Parser): + parser = Parser(en_vocab) + parser.model, _ = parser.Model(10) + new_parser = Parser(en_vocab) + new_parser.model, _ = new_parser.Model(10) + new_parser = new_parser.from_bytes(parser.to_bytes()) + assert new_parser.to_bytes() == parser.to_bytes() + + +@pytest.mark.parametrize('Parser', test_parsers) +def test_serialize_parser_roundtrip_disk(en_vocab, Parser): + parser = Parser(en_vocab) + parser.model, _ = parser.Model(0) + with make_tempdir() as d: + file_path = d / 'parser' + parser.to_disk(file_path) + parser_d = Parser(en_vocab) + parser_d.model, _ = parser_d.Model(0) + parser_d = parser_d.from_disk(file_path) + assert parser.to_bytes(model=False) == parser_d.to_bytes(model=False) + + +def test_to_from_bytes(parser, blank_parser): + assert parser.model is not True + assert blank_parser.model is True + assert blank_parser.moves.n_moves != parser.moves.n_moves + bytes_data = parser.to_bytes() + blank_parser.from_bytes(bytes_data) + assert blank_parser.model is not True + assert blank_parser.moves.n_moves == parser.moves.n_moves + + +@pytest.mark.skip(reason="This seems to be a dict ordering bug somewhere. Only failing on some platforms.") +def test_serialize_tagger_roundtrip_bytes(en_vocab, taggers): + tagger1, tagger2 = taggers + tagger1_b = tagger1.to_bytes() + tagger2_b = tagger2.to_bytes() + tagger1 = tagger1.from_bytes(tagger1_b) + assert tagger1.to_bytes() == tagger1_b + new_tagger1 = Tagger(en_vocab).from_bytes(tagger1_b) + assert new_tagger1.to_bytes() == tagger1_b + + +def test_serialize_tagger_roundtrip_disk(en_vocab, taggers): + tagger1, tagger2 = taggers + with make_tempdir() as d: + file_path1 = d / 'tagger1' + file_path2 = d / 'tagger2' + tagger1.to_disk(file_path1) + tagger2.to_disk(file_path2) + tagger1_d = Tagger(en_vocab).from_disk(file_path1) + tagger2_d = Tagger(en_vocab).from_disk(file_path2) + assert tagger1_d.to_bytes() == tagger2_d.to_bytes() + + +def test_serialize_tensorizer_roundtrip_bytes(en_vocab): + tensorizer = Tensorizer(en_vocab) + tensorizer.model = tensorizer.Model() + tensorizer_b = tensorizer.to_bytes() + new_tensorizer = Tensorizer(en_vocab).from_bytes(tensorizer_b) + assert new_tensorizer.to_bytes() == tensorizer_b + + +def test_serialize_tensorizer_roundtrip_disk(en_vocab): + tensorizer = Tensorizer(en_vocab) + tensorizer.model = tensorizer.Model() + with make_tempdir() as d: + file_path = d / 'tensorizer' + tensorizer.to_disk(file_path) + tensorizer_d = Tensorizer(en_vocab).from_disk(file_path) + assert tensorizer.to_bytes() == tensorizer_d.to_bytes() + + +def test_serialize_textcat_empty(en_vocab): + # See issue #1105 + textcat = TextCategorizer(en_vocab, labels=['ENTITY', 'ACTION', 'MODIFIER']) + textcat_bytes = textcat.to_bytes() diff --git a/spacy/tests/serialize/test_serialize_stringstore.py b/spacy/tests/serialize/test_serialize_stringstore.py deleted file mode 100644 index 594413922..000000000 --- a/spacy/tests/serialize/test_serialize_stringstore.py +++ /dev/null @@ -1,46 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ..util import make_tempdir -from ...strings import StringStore - -import pytest - - -test_strings = [([], []), (['rats', 'are', 'cute'], ['i', 'like', 'rats'])] - - -@pytest.mark.parametrize('strings1,strings2', test_strings) -def test_serialize_stringstore_roundtrip_bytes(strings1,strings2): - sstore1 = StringStore(strings=strings1) - sstore2 = StringStore(strings=strings2) - sstore1_b = sstore1.to_bytes() - sstore2_b = sstore2.to_bytes() - if strings1 == strings2: - assert sstore1_b == sstore2_b - else: - assert sstore1_b != sstore2_b - sstore1 = sstore1.from_bytes(sstore1_b) - assert sstore1.to_bytes() == sstore1_b - new_sstore1 = StringStore().from_bytes(sstore1_b) - assert new_sstore1.to_bytes() == sstore1_b - assert list(new_sstore1) == strings1 - - -@pytest.mark.parametrize('strings1,strings2', test_strings) -def test_serialize_stringstore_roundtrip_disk(strings1,strings2): - sstore1 = StringStore(strings=strings1) - sstore2 = StringStore(strings=strings2) - with make_tempdir() as d: - file_path1 = d / 'strings1' - file_path2 = d / 'strings2' - sstore1.to_disk(file_path1) - sstore2.to_disk(file_path2) - sstore1_d = StringStore().from_disk(file_path1) - sstore2_d = StringStore().from_disk(file_path2) - assert list(sstore1_d) == list(sstore1) - assert list(sstore2_d) == list(sstore2) - if strings1 == strings2: - assert list(sstore1_d) == list(sstore2_d) - else: - assert list(sstore1_d) != list(sstore2_d) diff --git a/spacy/tests/serialize/test_serialize_tagger.py b/spacy/tests/serialize/test_serialize_tagger.py deleted file mode 100644 index 4ed422dda..000000000 --- a/spacy/tests/serialize/test_serialize_tagger.py +++ /dev/null @@ -1,40 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ..util import make_tempdir -from ...pipeline import Tagger - -import pytest - - -@pytest.fixture -def taggers(en_vocab): - tagger1 = Tagger(en_vocab) - tagger2 = Tagger(en_vocab) - tagger1.model = tagger1.Model(8) - tagger2.model = tagger1.model - return (tagger1, tagger2) - - -# This seems to be a dict ordering bug somewhere. Only failing on some platforms -@pytest.mark.xfail -def test_serialize_tagger_roundtrip_bytes(en_vocab, taggers): - tagger1, tagger2 = taggers - tagger1_b = tagger1.to_bytes() - tagger2_b = tagger2.to_bytes() - tagger1 = tagger1.from_bytes(tagger1_b) - assert tagger1.to_bytes() == tagger1_b - new_tagger1 = Tagger(en_vocab).from_bytes(tagger1_b) - assert new_tagger1.to_bytes() == tagger1_b - - -def test_serialize_tagger_roundtrip_disk(en_vocab, taggers): - tagger1, tagger2 = taggers - with make_tempdir() as d: - file_path1 = d / 'tagger1' - file_path2 = d / 'tagger2' - tagger1.to_disk(file_path1) - tagger2.to_disk(file_path2) - tagger1_d = Tagger(en_vocab).from_disk(file_path1) - tagger2_d = Tagger(en_vocab).from_disk(file_path2) - assert tagger1_d.to_bytes() == tagger2_d.to_bytes() diff --git a/spacy/tests/serialize/test_serialize_tensorizer.py b/spacy/tests/serialize/test_serialize_tensorizer.py deleted file mode 100644 index bc751a686..000000000 --- a/spacy/tests/serialize/test_serialize_tensorizer.py +++ /dev/null @@ -1,25 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -from ..util import make_tempdir -from ...pipeline import Tensorizer - -import pytest - - -def test_serialize_tensorizer_roundtrip_bytes(en_vocab): - tensorizer = Tensorizer(en_vocab) - tensorizer.model = tensorizer.Model() - tensorizer_b = tensorizer.to_bytes() - new_tensorizer = Tensorizer(en_vocab).from_bytes(tensorizer_b) - assert new_tensorizer.to_bytes() == tensorizer_b - - -def test_serialize_tensorizer_roundtrip_disk(en_vocab): - tensorizer = Tensorizer(en_vocab) - tensorizer.model = tensorizer.Model() - with make_tempdir() as d: - file_path = d / 'tensorizer' - tensorizer.to_disk(file_path) - tensorizer_d = Tensorizer(en_vocab).from_disk(file_path) - assert tensorizer.to_bytes() == tensorizer_d.to_bytes() diff --git a/spacy/tests/serialize/test_serialize_tokenizer.py b/spacy/tests/serialize/test_serialize_tokenizer.py index de022a263..2e1256f2d 100644 --- a/spacy/tests/serialize/test_serialize_tokenizer.py +++ b/spacy/tests/serialize/test_serialize_tokenizer.py @@ -1,11 +1,11 @@ # coding: utf-8 from __future__ import unicode_literals -from ...util import get_lang_class -from ...tokenizer import Tokenizer -from ..util import make_tempdir, assert_packed_msg_equal - import pytest +from spacy.util import get_lang_class +from spacy.tokenizer import Tokenizer + +from ..util import make_tempdir, assert_packed_msg_equal def load_tokenizer(b): diff --git a/spacy/tests/serialize/test_serialize_vocab.py b/spacy/tests/serialize/test_serialize_vocab_strings.py similarity index 57% rename from spacy/tests/serialize/test_serialize_vocab.py rename to spacy/tests/serialize/test_serialize_vocab_strings.py index 47749e69f..352620e92 100644 --- a/spacy/tests/serialize/test_serialize_vocab.py +++ b/spacy/tests/serialize/test_serialize_vocab_strings.py @@ -1,18 +1,28 @@ # coding: utf-8 from __future__ import unicode_literals -from ..util import make_tempdir -from ...vocab import Vocab - import pytest +from spacy.vocab import Vocab +from spacy.strings import StringStore + +from ..util import make_tempdir test_strings = [([], []), (['rats', 'are', 'cute'], ['i', 'like', 'rats'])] test_strings_attrs = [(['rats', 'are', 'cute'], 'Hello')] +@pytest.mark.xfail +@pytest.mark.parametrize('text', ['rat']) +def test_serialize_vocab(en_vocab, text): + text_hash = en_vocab.strings.add(text) + vocab_bytes = en_vocab.to_bytes() + new_vocab = Vocab().from_bytes(vocab_bytes) + assert new_vocab.strings(text_hash) == text + + @pytest.mark.parametrize('strings1,strings2', test_strings) -def test_serialize_vocab_roundtrip_bytes(strings1,strings2): +def test_serialize_vocab_roundtrip_bytes(strings1, strings2): vocab1 = Vocab(strings=strings1) vocab2 = Vocab(strings=strings2) vocab1_b = vocab1.to_bytes() @@ -71,3 +81,39 @@ def test_serialize_vocab_lex_attrs_disk(strings, lex_attr): vocab1.to_disk(file_path) vocab2 = vocab2.from_disk(file_path) assert vocab2[strings[0]].norm_ == lex_attr + + +@pytest.mark.parametrize('strings1,strings2', test_strings) +def test_serialize_stringstore_roundtrip_bytes(strings1, strings2): + sstore1 = StringStore(strings=strings1) + sstore2 = StringStore(strings=strings2) + sstore1_b = sstore1.to_bytes() + sstore2_b = sstore2.to_bytes() + if strings1 == strings2: + assert sstore1_b == sstore2_b + else: + assert sstore1_b != sstore2_b + sstore1 = sstore1.from_bytes(sstore1_b) + assert sstore1.to_bytes() == sstore1_b + new_sstore1 = StringStore().from_bytes(sstore1_b) + assert new_sstore1.to_bytes() == sstore1_b + assert list(new_sstore1) == strings1 + + +@pytest.mark.parametrize('strings1,strings2', test_strings) +def test_serialize_stringstore_roundtrip_disk(strings1, strings2): + sstore1 = StringStore(strings=strings1) + sstore2 = StringStore(strings=strings2) + with make_tempdir() as d: + file_path1 = d / 'strings1' + file_path2 = d / 'strings2' + sstore1.to_disk(file_path1) + sstore2.to_disk(file_path2) + sstore1_d = StringStore().from_disk(file_path1) + sstore2_d = StringStore().from_disk(file_path2) + assert list(sstore1_d) == list(sstore1) + assert list(sstore2_d) == list(sstore2) + if strings1 == strings2: + assert list(sstore1_d) == list(sstore2_d) + else: + assert list(sstore1_d) != list(sstore2_d) diff --git a/spacy/tests/stringstore/test_freeze_string_store.py b/spacy/tests/stringstore/test_freeze_string_store.py deleted file mode 100644 index ebfddccac..000000000 --- a/spacy/tests/stringstore/test_freeze_string_store.py +++ /dev/null @@ -1,24 +0,0 @@ -# coding: utf-8 -"""Test the possibly temporary workaround of flushing the stringstore of OOV words.""" - - -from __future__ import unicode_literals - -import pytest - - -@pytest.mark.xfail -@pytest.mark.parametrize('text', [["a", "b", "c"]]) -def test_stringstore_freeze_oov(stringstore, text): - assert stringstore[text[0]] == 1 - assert stringstore[text[1]] == 2 - - stringstore.set_frozen(True) - s = stringstore[text[2]] - assert s >= 4 - s_ = stringstore[s] - assert s_ == text[2] - - stringstore.flush_oov() - with pytest.raises(IndexError): - s_ = stringstore[s] diff --git a/spacy/tests/test_align.py b/spacy/tests/test_align.py index 758808f6a..2b5af0d2c 100644 --- a/spacy/tests/test_align.py +++ b/spacy/tests/test_align.py @@ -1,6 +1,8 @@ +# coding: utf-8 from __future__ import unicode_literals + import pytest -from .._align import align, multi_align +from spacy._align import align, multi_align @pytest.mark.parametrize('string1,string2,cost', [ diff --git a/spacy/tests/gold/test_biluo.py b/spacy/tests/test_gold.py similarity index 95% rename from spacy/tests/gold/test_biluo.py rename to spacy/tests/test_gold.py index b89dd46b8..e2354c9db 100644 --- a/spacy/tests/gold/test_biluo.py +++ b/spacy/tests/test_gold.py @@ -1,10 +1,8 @@ # coding: utf-8 from __future__ import unicode_literals -from ...gold import biluo_tags_from_offsets, offsets_from_biluo_tags -from ...tokens.doc import Doc - -import pytest +from spacy.gold import biluo_tags_from_offsets, offsets_from_biluo_tags +from spacy.tokens import Doc def test_gold_biluo_U(en_vocab): diff --git a/spacy/tests/test_misc.py b/spacy/tests/test_misc.py index 33a02a4e4..eb52e8a94 100644 --- a/spacy/tests/test_misc.py +++ b/spacy/tests/test_misc.py @@ -1,18 +1,14 @@ # coding: utf-8 from __future__ import unicode_literals -from ..util import ensure_path -from .. import util -from .. import displacy -from ..tokens import Span -from .util import get_doc -from .._ml import PrecomputableAffine - -from pathlib import Path import pytest -from thinc.neural._classes.maxout import Maxout -from thinc.neural._classes.softmax import Softmax -from thinc.api import chain +from pathlib import Path +from spacy import util +from spacy import displacy +from spacy.tokens import Span +from spacy._ml import PrecomputableAffine + +from .util import get_doc @pytest.mark.parametrize('text', ['hello/world', 'hello world']) @@ -37,7 +33,7 @@ def test_util_get_package_path(package): def test_displacy_parse_ents(en_vocab): """Test that named entities on a Doc are converted into displaCy's format.""" doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"]) - doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings[u'ORG'])] + doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings['ORG'])] ents = displacy.parse_ents(doc) assert isinstance(ents, dict) assert ents['text'] == 'But Google is starting from behind ' @@ -67,7 +63,7 @@ def test_displacy_parse_deps(en_vocab): def test_displacy_spans(en_vocab): """Test that displaCy can render Spans.""" doc = get_doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"]) - doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings[u'ORG'])] + doc.ents = [Span(doc, 1, 2, label=doc.vocab.strings['ORG'])] html = displacy.render(doc[1:4], style='ent') assert html.startswith(' 0.5 - diff --git a/spacy/tests/tokenizer/test_tokenizer.py b/spacy/tests/tokenizer/test_tokenizer.py index da79b43a8..276ae7f04 100644 --- a/spacy/tests/tokenizer/test_tokenizer.py +++ b/spacy/tests/tokenizer/test_tokenizer.py @@ -1,11 +1,10 @@ # coding: utf-8 from __future__ import unicode_literals -from ...vocab import Vocab -from ...tokenizer import Tokenizer -from ... import util - import pytest +from spacy.vocab import Vocab +from spacy.tokenizer import Tokenizer +from spacy.util import ensure_path def test_tokenizer_handles_no_word(tokenizer): @@ -74,7 +73,7 @@ Phasellus tincidunt, augue quis porta finibus, massa sapien consectetur augue, n @pytest.mark.parametrize('file_name', ["sun.txt"]) def test_tokenizer_handle_text_from_file(tokenizer, file_name): - loc = util.ensure_path(__file__).parent / file_name + loc = ensure_path(__file__).parent / file_name text = loc.open('r', encoding='utf8').read() assert len(text) != 0 tokens = tokenizer(text) diff --git a/spacy/tests/tokenizer/test_whitespace.py b/spacy/tests/tokenizer/test_whitespace.py index 7ff3106a8..7c53584cf 100644 --- a/spacy/tests/tokenizer/test_whitespace.py +++ b/spacy/tests/tokenizer/test_whitespace.py @@ -1,7 +1,4 @@ # coding: utf-8 -"""Test that tokens are created correctly for whitespace.""" - - from __future__ import unicode_literals import pytest diff --git a/spacy/tests/util.py b/spacy/tests/util.py index 2de97583c..0d97c7907 100644 --- a/spacy/tests/util.py +++ b/spacy/tests/util.py @@ -1,28 +1,15 @@ # coding: utf-8 from __future__ import unicode_literals -from ..tokens import Doc -from ..attrs import ORTH, POS, HEAD, DEP -from ..compat import path2str - -import pytest import numpy import tempfile import shutil import contextlib import msgpack from pathlib import Path - - -MODELS = {} - - -def load_test_model(model): - """Load a model if it's installed as a package, otherwise skip.""" - if model not in MODELS: - module = pytest.importorskip(model) - MODELS[model] = module.load() - return MODELS[model] +from spacy.tokens import Doc, Span +from spacy.attrs import POS, HEAD, DEP +from spacy.compat import path2str @contextlib.contextmanager @@ -56,7 +43,8 @@ def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=No attrs[i, 2] = doc.vocab.strings[dep] doc.from_array([POS, HEAD, DEP], attrs) if ents: - doc.ents = [(ent_id, doc.vocab.strings[label], start, end) for ent_id, label, start, end in ents] + doc.ents = [Span(doc, start, end, label=doc.vocab.strings[label]) + for start, end, label in ents] if tags: for token in doc: token.tag_ = tags[token.i] diff --git a/spacy/tests/vocab/test_add_vectors.py b/spacy/tests/vocab/test_add_vectors.py deleted file mode 100644 index 3ef599678..000000000 --- a/spacy/tests/vocab/test_add_vectors.py +++ /dev/null @@ -1,40 +0,0 @@ -# coding: utf-8 -from __future__ import unicode_literals - -import numpy -from numpy.testing import assert_allclose -from ...vocab import Vocab -from ..._ml import cosine - - -def test_vocab_add_vector(): - vocab = Vocab() - data = numpy.ndarray((5,3), dtype='f') - data[0] = 1. - data[1] = 2. - vocab.set_vector(u'cat', data[0]) - vocab.set_vector(u'dog', data[1]) - cat = vocab[u'cat'] - assert list(cat.vector) == [1., 1., 1.] - dog = vocab[u'dog'] - assert list(dog.vector) == [2., 2., 2.] - - -def test_vocab_prune_vectors(): - vocab = Vocab() - _ = vocab[u'cat'] - _ = vocab[u'dog'] - _ = vocab[u'kitten'] - data = numpy.ndarray((5,3), dtype='f') - data[0] = 1. - data[1] = 2. - data[2] = 1.1 - vocab.set_vector(u'cat', data[0]) - vocab.set_vector(u'dog', data[1]) - vocab.set_vector(u'kitten', data[2]) - - remap = vocab.prune_vectors(2) - assert list(remap.keys()) == [u'kitten'] - neighbour, similarity = list(remap.values())[0] - assert neighbour == u'cat', remap - assert_allclose(similarity, cosine(data[0], data[2]), atol=1e-6) diff --git a/spacy/tests/vocab/__init__.py b/spacy/tests/vocab_vectors/__init__.py similarity index 100% rename from spacy/tests/vocab/__init__.py rename to spacy/tests/vocab_vectors/__init__.py diff --git a/spacy/tests/vocab/test_lexeme.py b/spacy/tests/vocab_vectors/test_lexeme.py similarity index 98% rename from spacy/tests/vocab/test_lexeme.py rename to spacy/tests/vocab_vectors/test_lexeme.py index 0140b256a..bc16267f1 100644 --- a/spacy/tests/vocab/test_lexeme.py +++ b/spacy/tests/vocab_vectors/test_lexeme.py @@ -1,9 +1,9 @@ # coding: utf-8 from __future__ import unicode_literals -from ...attrs import * - import pytest +from spacy.attrs import IS_ALPHA, IS_DIGIT + @pytest.mark.parametrize('text1,prob1,text2,prob2', [("NOUN", -1, "opera", -2)]) def test_vocab_lexeme_lt(en_vocab, text1, text2, prob1, prob2): @@ -69,4 +69,3 @@ def test_lexeme_bytes_roundtrip(en_vocab): assert one.orth == alpha.orth assert one.lower == alpha.lower assert one.lower_ == alpha.lower_ - diff --git a/spacy/tests/vectors/test_similarity.py b/spacy/tests/vocab_vectors/test_similarity.py similarity index 87% rename from spacy/tests/vectors/test_similarity.py rename to spacy/tests/vocab_vectors/test_similarity.py index 231e641de..5e73041b5 100644 --- a/spacy/tests/vectors/test_similarity.py +++ b/spacy/tests/vocab_vectors/test_similarity.py @@ -1,10 +1,11 @@ # coding: utf-8 from __future__ import unicode_literals -from ..util import get_doc, get_cosine, add_vecs_to_vocab - -import numpy import pytest +import numpy +from spacy.tokens import Doc + +from ..util import get_cosine, add_vecs_to_vocab @pytest.fixture @@ -32,7 +33,7 @@ def test_vectors_similarity_LL(vocab, vectors): def test_vectors_similarity_TT(vocab, vectors): [(word1, vec1), (word2, vec2)] = vectors - doc = get_doc(vocab, words=[word1, word2]) + doc = Doc(vocab, words=[word1, word2]) assert doc[0].has_vector assert doc[1].has_vector assert doc[0].vector_norm != 0 @@ -44,19 +45,19 @@ def test_vectors_similarity_TT(vocab, vectors): def test_vectors_similarity_TD(vocab, vectors): [(word1, vec1), (word2, vec2)] = vectors - doc = get_doc(vocab, words=[word1, word2]) + doc = Doc(vocab, words=[word1, word2]) with pytest.warns(None): assert doc.similarity(doc[0]) == doc[0].similarity(doc) def test_vectors_similarity_DS(vocab, vectors): [(word1, vec1), (word2, vec2)] = vectors - doc = get_doc(vocab, words=[word1, word2]) + doc = Doc(vocab, words=[word1, word2]) assert doc.similarity(doc[:2]) == doc[:2].similarity(doc) def test_vectors_similarity_TS(vocab, vectors): [(word1, vec1), (word2, vec2)] = vectors - doc = get_doc(vocab, words=[word1, word2]) + doc = Doc(vocab, words=[word1, word2]) with pytest.warns(None): assert doc[:2].similarity(doc[0]) == doc[0].similarity(doc[:2]) diff --git a/spacy/tests/stringstore/test_stringstore.py b/spacy/tests/vocab_vectors/test_stringstore.py similarity index 73% rename from spacy/tests/stringstore/test_stringstore.py rename to spacy/tests/vocab_vectors/test_stringstore.py index 3f2992a6f..f70498344 100644 --- a/spacy/tests/stringstore/test_stringstore.py +++ b/spacy/tests/vocab_vectors/test_stringstore.py @@ -1,38 +1,37 @@ # coding: utf-8 from __future__ import unicode_literals -from ...strings import StringStore - import pytest +from spacy.strings import StringStore + + +@pytest.fixture +def stringstore(): + return StringStore() def test_string_hash(stringstore): - '''Test that string hashing is stable across platforms''' - ss = stringstore - assert ss.add('apple') == 8566208034543834098 + """Test that string hashing is stable across platforms""" + assert stringstore.add('apple') == 8566208034543834098 heart = '\U0001f499' - print(heart) - h = ss.add(heart) + h = stringstore.add(heart) assert h == 11841826740069053588 - + def test_stringstore_from_api_docs(stringstore): apple_hash = stringstore.add('apple') assert apple_hash == 8566208034543834098 - assert stringstore[apple_hash] == u'apple' - - assert u'apple' in stringstore - assert u'cherry' not in stringstore - + assert stringstore[apple_hash] == 'apple' + assert 'apple' in stringstore + assert 'cherry' not in stringstore orange_hash = stringstore.add('orange') all_strings = [s for s in stringstore] - assert all_strings == [u'apple', u'orange'] - + assert all_strings == ['apple', 'orange'] banana_hash = stringstore.add('banana') assert len(stringstore) == 3 assert banana_hash == 2525716904149915114 - assert stringstore[banana_hash] == u'banana' - assert stringstore[u'banana'] == banana_hash + assert stringstore[banana_hash] == 'banana' + assert stringstore['banana'] == banana_hash @pytest.mark.parametrize('text1,text2,text3', [(b'Hello', b'goodbye', b'hello')]) @@ -99,3 +98,22 @@ def test_stringstore_to_bytes(stringstore, text): serialized = stringstore.to_bytes() new_stringstore = StringStore().from_bytes(serialized) assert new_stringstore[store] == text + + +@pytest.mark.xfail +@pytest.mark.parametrize('text', [["a", "b", "c"]]) +def test_stringstore_freeze_oov(stringstore, text): + """Test the possibly temporary workaround of flushing the stringstore of + OOV words.""" + assert stringstore[text[0]] == 1 + assert stringstore[text[1]] == 2 + + stringstore.set_frozen(True) + s = stringstore[text[2]] + assert s >= 4 + s_ = stringstore[s] + assert s_ == text[2] + + stringstore.flush_oov() + with pytest.raises(IndexError): + s_ = stringstore[s] diff --git a/spacy/tests/vectors/test_vectors.py b/spacy/tests/vocab_vectors/test_vectors.py similarity index 81% rename from spacy/tests/vectors/test_vectors.py rename to spacy/tests/vocab_vectors/test_vectors.py index 831fbf003..0b95da59c 100644 --- a/spacy/tests/vectors/test_vectors.py +++ b/spacy/tests/vocab_vectors/test_vectors.py @@ -1,19 +1,24 @@ # coding: utf-8 from __future__ import unicode_literals -from ...vectors import Vectors -from ...tokenizer import Tokenizer -from ...strings import hash_string -from ..util import add_vecs_to_vocab, get_doc - -import numpy import pytest +import numpy +from numpy.testing import assert_allclose +from spacy._ml import cosine +from spacy.vocab import Vocab +from spacy.vectors import Vectors +from spacy.tokenizer import Tokenizer +from spacy.strings import hash_string +from spacy.tokens import Doc + +from ..util import add_vecs_to_vocab @pytest.fixture def strings(): return ["apple", "orange"] + @pytest.fixture def vectors(): return [ @@ -23,6 +28,7 @@ def vectors(): ('juice', [5, 5, 10]), ('pie', [7, 6.3, 8.9])] + @pytest.fixture def ngrams_vectors(): return [ @@ -31,36 +37,49 @@ def ngrams_vectors(): ('ppl', [-0.2, -0.3, -0.4]), ('pl', [0.7, 0.8, 0.9]) ] + + @pytest.fixture() def ngrams_vocab(en_vocab, ngrams_vectors): add_vecs_to_vocab(en_vocab, ngrams_vectors) return en_vocab + @pytest.fixture def data(): return numpy.asarray([[0.0, 1.0, 2.0], [3.0, -2.0, 4.0]], dtype='f') + @pytest.fixture def resize_data(): return numpy.asarray([[0.0, 1.0], [2.0, 3.0]], dtype='f') + @pytest.fixture() def vocab(en_vocab, vectors): add_vecs_to_vocab(en_vocab, vectors) return en_vocab + +@pytest.fixture() +def tokenizer_v(vocab): + return Tokenizer(vocab, {}, None, None, None) + + def test_init_vectors_with_resize_shape(strings,resize_data): v = Vectors(shape=(len(strings), 3)) v.resize(shape=resize_data.shape) assert v.shape == resize_data.shape assert v.shape != (len(strings), 3) + def test_init_vectors_with_resize_data(data,resize_data): v = Vectors(data=data) v.resize(shape=resize_data.shape) assert v.shape == resize_data.shape assert v.shape != data.shape + def test_get_vector_resize(strings, data,resize_data): v = Vectors(data=data) v.resize(shape=resize_data.shape) @@ -73,10 +92,12 @@ def test_get_vector_resize(strings, data,resize_data): assert list(v[strings[1]]) != list(resize_data[0]) assert list(v[strings[1]]) == list(resize_data[1]) + def test_init_vectors_with_data(strings, data): v = Vectors(data=data) assert v.shape == data.shape + def test_init_vectors_with_shape(strings): v = Vectors(shape=(len(strings), 3)) assert v.shape == (len(strings), 3) @@ -105,11 +126,6 @@ def test_set_vector(strings, data): assert list(v[strings[0]]) != list(orig[0]) -@pytest.fixture() -def tokenizer_v(vocab): - return Tokenizer(vocab, {}, None, None, None) - - @pytest.mark.parametrize('text', ["apple and orange"]) def test_vectors_token_vector(tokenizer_v, vectors, text): doc = tokenizer_v(text) @@ -138,14 +154,14 @@ def test_vectors_lexeme_vector(vocab, text): @pytest.mark.parametrize('text', [["apple", "and", "orange"]]) def test_vectors_doc_vector(vocab, text): - doc = get_doc(vocab, text) + doc = Doc(vocab, words=text) assert list(doc.vector) assert doc.vector_norm @pytest.mark.parametrize('text', [["apple", "and", "orange"]]) def test_vectors_span_vector(vocab, text): - span = get_doc(vocab, text)[0:2] + span = Doc(vocab, words=text)[0:2] assert list(span.vector) assert span.vector_norm @@ -167,21 +183,21 @@ def test_vectors_token_lexeme_similarity(tokenizer_v, vocab, text1, text2): @pytest.mark.parametrize('text', [["apple", "orange", "juice"]]) def test_vectors_token_span_similarity(vocab, text): - doc = get_doc(vocab, text) + doc = Doc(vocab, words=text) assert doc[0].similarity(doc[1:3]) == doc[1:3].similarity(doc[0]) assert -1. < doc[0].similarity(doc[1:3]) < 1.0 @pytest.mark.parametrize('text', [["apple", "orange", "juice"]]) def test_vectors_token_doc_similarity(vocab, text): - doc = get_doc(vocab, text) + doc = Doc(vocab, words=text) assert doc[0].similarity(doc) == doc.similarity(doc[0]) assert -1. < doc[0].similarity(doc) < 1.0 @pytest.mark.parametrize('text', [["apple", "orange", "juice"]]) def test_vectors_lexeme_span_similarity(vocab, text): - doc = get_doc(vocab, text) + doc = Doc(vocab, words=text) lex = vocab[text[0]] assert lex.similarity(doc[1:3]) == doc[1:3].similarity(lex) assert -1. < doc.similarity(doc[1:3]) < 1.0 @@ -197,7 +213,7 @@ def test_vectors_lexeme_lexeme_similarity(vocab, text1, text2): @pytest.mark.parametrize('text', [["apple", "orange", "juice"]]) def test_vectors_lexeme_doc_similarity(vocab, text): - doc = get_doc(vocab, text) + doc = Doc(vocab, words=text) lex = vocab[text[0]] assert lex.similarity(doc) == doc.similarity(lex) assert -1. < lex.similarity(doc) < 1.0 @@ -205,7 +221,7 @@ def test_vectors_lexeme_doc_similarity(vocab, text): @pytest.mark.parametrize('text', [["apple", "orange", "juice"]]) def test_vectors_span_span_similarity(vocab, text): - doc = get_doc(vocab, text) + doc = Doc(vocab, words=text) with pytest.warns(None): assert doc[0:2].similarity(doc[1:3]) == doc[1:3].similarity(doc[0:2]) assert -1. < doc[0:2].similarity(doc[1:3]) < 1.0 @@ -213,7 +229,7 @@ def test_vectors_span_span_similarity(vocab, text): @pytest.mark.parametrize('text', [["apple", "orange", "juice"]]) def test_vectors_span_doc_similarity(vocab, text): - doc = get_doc(vocab, text) + doc = Doc(vocab, words=text) with pytest.warns(None): assert doc[0:2].similarity(doc) == doc.similarity(doc[0:2]) assert -1. < doc[0:2].similarity(doc) < 1.0 @@ -222,7 +238,40 @@ def test_vectors_span_doc_similarity(vocab, text): @pytest.mark.parametrize('text1,text2', [ (["apple", "and", "apple", "pie"], ["orange", "juice"])]) def test_vectors_doc_doc_similarity(vocab, text1, text2): - doc1 = get_doc(vocab, text1) - doc2 = get_doc(vocab, text2) + doc1 = Doc(vocab, words=text1) + doc2 = Doc(vocab, words=text2) assert doc1.similarity(doc2) == doc2.similarity(doc1) assert -1. < doc1.similarity(doc2) < 1.0 + + +def test_vocab_add_vector(): + vocab = Vocab() + data = numpy.ndarray((5,3), dtype='f') + data[0] = 1. + data[1] = 2. + vocab.set_vector('cat', data[0]) + vocab.set_vector('dog', data[1]) + cat = vocab['cat'] + assert list(cat.vector) == [1., 1., 1.] + dog = vocab['dog'] + assert list(dog.vector) == [2., 2., 2.] + + +def test_vocab_prune_vectors(): + vocab = Vocab() + _ = vocab['cat'] + _ = vocab['dog'] + _ = vocab['kitten'] + data = numpy.ndarray((5,3), dtype='f') + data[0] = 1. + data[1] = 2. + data[2] = 1.1 + vocab.set_vector('cat', data[0]) + vocab.set_vector('dog', data[1]) + vocab.set_vector('kitten', data[2]) + + remap = vocab.prune_vectors(2) + assert list(remap.keys()) == ['kitten'] + neighbour, similarity = list(remap.values())[0] + assert neighbour == 'cat', remap + assert_allclose(similarity, cosine(data[0], data[2]), atol=1e-6) diff --git a/spacy/tests/vocab/test_vocab_api.py b/spacy/tests/vocab_vectors/test_vocab_api.py similarity index 91% rename from spacy/tests/vocab/test_vocab_api.py rename to spacy/tests/vocab_vectors/test_vocab_api.py index 8d845a2ce..dc504c2f6 100644 --- a/spacy/tests/vocab/test_vocab_api.py +++ b/spacy/tests/vocab_vectors/test_vocab_api.py @@ -1,10 +1,9 @@ # coding: utf-8 from __future__ import unicode_literals -from ...attrs import LEMMA, ORTH, PROB, IS_ALPHA -from ...parts_of_speech import NOUN, VERB - import pytest +from spacy.attrs import LEMMA, ORTH, PROB, IS_ALPHA +from spacy.parts_of_speech import NOUN, VERB @pytest.mark.parametrize('text1,text2', [ diff --git a/website/usage/_adding-languages/_testing.jade b/website/usage/_adding-languages/_testing.jade index 825d8db6f..4c63bb8d5 100644 --- a/website/usage/_adding-languages/_testing.jade +++ b/website/usage/_adding-languages/_testing.jade @@ -13,31 +13,6 @@ p | practices for writing your own tests, see our | #[+a(gh("spaCy", "spacy/tests")) tests documentation]. -p - | The easiest way to test your new tokenizer is to run the - | language-independent "tokenizer sanity" tests located in - | #[+src(gh("spaCy", "spacy/tests/tokenizer")) #[code tests/tokenizer]]. - | This will test for basic behaviours like punctuation splitting, URL - | matching and correct handling of whitespace. In the - | #[+src(gh("spaCy", "spacy/tests/conftest.py")) #[code conftest.py]], add - | the new language ID to the list of #[code _languages]: - -+code. - _languages = ['bn', 'da', 'de', 'en', 'es', 'fi', 'fr', 'he', 'hu', 'it', 'nb', - 'nl', 'pl', 'pt', 'sv', 'xx'] # new language here - -+aside-code("Global tokenizer test example"). - # use fixture by adding it as an argument - def test_with_all_languages(tokenizer): - # will be performed on ALL language tokenizers - tokens = tokenizer(u'Some text here.') - -p - | The language will now be included in the #[code tokenizer] test fixture, - | which is used by the basic tokenizer tests. If you want to add your own - | tests that should be run over all languages, you can use this fixture as - | an argument of your test function. - +h(3, "testing-custom") Writing language-specific tests p