spaCy/tests/tokenizer/test_special_affix.py

"""Test entries in the tokenization special-case interacting with prefix
and suffix punctuation."""
from __future__ import unicode_literals
import pytest

from spacy.en import English

@pytest.fixture
def EN():
    return English().tokenizer


def test_no_special(EN):
    assert len(EN("(can)")) == 3


def test_no_punct(EN):
    assert len(EN("can't")) == 2


def test_prefix(EN):
    assert len(EN("(can't")) == 3


def test_suffix(EN):
    assert len(EN("can't)")) == 3


def test_wrap(EN):
    assert len(EN("(can't)")) == 4


def test_uneven_wrap(EN):
    assert len(EN("(can't?)")) == 5


def test_prefix_interact(EN):
    assert len(EN("U.S.")) == 1
    assert len(EN("us.")) == 2
    assert len(EN("(U.S.")) == 2


def test_suffix_interact(EN):
    assert len(EN("U.S.)")) == 2


def test_even_wrap_interact(EN):
    assert len(EN("(U.S.)")) == 3


def test_uneven_wrap_interact(EN):
    assert len(EN("(U.S.?)")) == 4
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 20:01:46 +04:00			`"""Test entries in the tokenization special-case interacting with prefix`
			`and suffix punctuation."""`
			`from __future__ import unicode_literals`
			`import pytest`

* Tests passing except for morphology/lemmatization stuff 2014-12-23 03:40:32 +03:00			`from spacy.en import English`
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 20:01:46 +04:00
* Tests passing except for morphology/lemmatization stuff 2014-12-23 03:40:32 +03:00			`@pytest.fixture`
			`def EN():`
* Upd tests, avoiding unnecessary processing to make testing faster 2015-01-30 02:41:55 +03:00			`return English().tokenizer`
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 20:01:46 +04:00

* Tests passing except for morphology/lemmatization stuff 2014-12-23 03:40:32 +03:00			`def test_no_special(EN):`
			`assert len(EN("(can)")) == 3`
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 20:01:46 +04:00
Tweak line spacing 2015-04-19 22:39:18 +03:00
* Tests passing except for morphology/lemmatization stuff 2014-12-23 03:40:32 +03:00			`def test_no_punct(EN):`
			`assert len(EN("can't")) == 2`
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 20:01:46 +04:00
Tweak line spacing 2015-04-19 22:39:18 +03:00
* Tests passing except for morphology/lemmatization stuff 2014-12-23 03:40:32 +03:00			`def test_prefix(EN):`
			`assert len(EN("(can't")) == 3`
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 20:01:46 +04:00

* Tests passing except for morphology/lemmatization stuff 2014-12-23 03:40:32 +03:00			`def test_suffix(EN):`
			`assert len(EN("can't)")) == 3`
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 20:01:46 +04:00

* Tests passing except for morphology/lemmatization stuff 2014-12-23 03:40:32 +03:00			`def test_wrap(EN):`
			`assert len(EN("(can't)")) == 4`
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 20:01:46 +04:00

* Tests passing except for morphology/lemmatization stuff 2014-12-23 03:40:32 +03:00			`def test_uneven_wrap(EN):`
			`assert len(EN("(can't?)")) == 5`
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 20:01:46 +04:00

* Tests passing except for morphology/lemmatization stuff 2014-12-23 03:40:32 +03:00			`def test_prefix_interact(EN):`
			`assert len(EN("U.S.")) == 1`
			`assert len(EN("us.")) == 2`
			`assert len(EN("(U.S.")) == 2`
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 20:01:46 +04:00

* Tests passing except for morphology/lemmatization stuff 2014-12-23 03:40:32 +03:00			`def test_suffix_interact(EN):`
			`assert len(EN("U.S.)")) == 2`
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token. 2014-09-16 20:01:46 +04:00

* Tests passing except for morphology/lemmatization stuff 2014-12-23 03:40:32 +03:00			`def test_even_wrap_interact(EN):`
			`assert len(EN("(U.S.)")) == 3`


			`def test_uneven_wrap_interact(EN):`
			`assert len(EN("(U.S.?)")) == 4`