💫 Refactor test suite (#2568)

## Description

Related issues: #2379 (should be fixed by separating model tests)

* **total execution time down from > 300 seconds to under 60 seconds** 🎉
* removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure
* changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version)
* merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways)
* tidied up and rewrote existing tests wherever possible

### Todo

- [ ] move tests to `/tests` and adjust CI commands accordingly
- [x] move model test suite from internal repo to `spacy-models`
- [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~
- [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted
- [ ] update documentation on how to run tests


### Types of change
enhancement, tests

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
This commit is contained in:
Ines Montani 2018-07-24 23:38:44 +02:00 committed by GitHub
parent 82277f63a3
commit 75f3234404
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
218 changed files with 2117 additions and 3775 deletions

1
.gitignore vendored
View File

@ -36,6 +36,7 @@ venv/
.dev .dev
.denv .denv
.pypyenv .pypyenv
.pytest_cache/
# Distribution / packaging # Distribution / packaging
env/ env/

View File

@ -11,5 +11,6 @@ dill>=0.2,<0.3
regex==2017.4.5 regex==2017.4.5
requests>=2.13.0,<3.0.0 requests>=2.13.0,<3.0.0
pytest>=3.6.0,<4.0.0 pytest>=3.6.0,<4.0.0
pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0 mock>=2.0.0,<3.0.0
pathlib==1.0.1; python_version < "3.4" pathlib==1.0.1; python_version < "3.4"

View File

@ -6,6 +6,7 @@ spaCy uses the [pytest](http://doc.pytest.org/) framework for testing. For more
Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the `Tokenizer` can be found in [`/tests/tokenizer`](tokenizer). All test modules (i.e. directories) also need to be listed in spaCy's [`setup.py`](../setup.py). To be interpreted and run, all test files and test functions need to be prefixed with `test_`. Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the `Tokenizer` can be found in [`/tests/tokenizer`](tokenizer). All test modules (i.e. directories) also need to be listed in spaCy's [`setup.py`](../setup.py). To be interpreted and run, all test files and test functions need to be prefixed with `test_`.
> ⚠️ **Important note:** As part of our new model training infrastructure, we've moved all model tests to the [`spacy-models`](https://github.com/explosion/spacy-models) repository. This allows us to test the models separately from the core library functionality.
## Table of contents ## Table of contents
@ -13,9 +14,8 @@ Tests for spaCy modules and classes live in their own directories of the same na
2. [Dos and don'ts](#dos-and-donts) 2. [Dos and don'ts](#dos-and-donts)
3. [Parameters](#parameters) 3. [Parameters](#parameters)
4. [Fixtures](#fixtures) 4. [Fixtures](#fixtures)
5. [Testing models](#testing-models) 5. [Helpers and utilities](#helpers-and-utilities)
6. [Helpers and utilities](#helpers-and-utilities) 6. [Contributing to the tests](#contributing-to-the-tests)
7. [Contributing to the tests](#contributing-to-the-tests)
## Running the tests ## Running the tests
@ -25,10 +25,7 @@ first failure, run them with `py.test -x`.
```bash ```bash
py.test spacy # run basic tests py.test spacy # run basic tests
py.test spacy --models --en # run basic and English model tests
py.test spacy --models --all # run basic and all model tests
py.test spacy --slow # run basic and slow tests py.test spacy --slow # run basic and slow tests
py.test spacy --models --all --slow # run all tests
``` ```
You can also run tests in a specific file or directory, or even only one You can also run tests in a specific file or directory, or even only one
@ -48,10 +45,10 @@ To keep the behaviour of the tests consistent and predictable, we try to follow
* If you're testing for a bug reported in a specific issue, always create a **regression test**. Regression tests should be named `test_issue[ISSUE NUMBER]` and live in the [`regression`](regression) directory. * If you're testing for a bug reported in a specific issue, always create a **regression test**. Regression tests should be named `test_issue[ISSUE NUMBER]` and live in the [`regression`](regression) directory.
* Only use `@pytest.mark.xfail` for tests that **should pass, but currently fail**. To test for desired negative behaviour, use `assert not` in your test. * Only use `@pytest.mark.xfail` for tests that **should pass, but currently fail**. To test for desired negative behaviour, use `assert not` in your test.
* Very **extensive tests** that take a long time to run should be marked with `@pytest.mark.slow`. If your slow test is testing important behaviour, consider adding an additional simpler version. * Very **extensive tests** that take a long time to run should be marked with `@pytest.mark.slow`. If your slow test is testing important behaviour, consider adding an additional simpler version.
* Tests that require **loading the models** should be marked with `@pytest.mark.models`. * If tests require **loading the models**, they should be added to the [`spacy-models`](https://github.com/explosion/spacy-models) tests.
* Before requiring the models, always make sure there is no other way to test the particular behaviour. In a lot of cases, it's sufficient to simply create a `Doc` object manually. See the section on [helpers and utility functions](#helpers-and-utilities) for more info on this. * Before requiring the models, always make sure there is no other way to test the particular behaviour. In a lot of cases, it's sufficient to simply create a `Doc` object manually. See the section on [helpers and utility functions](#helpers-and-utilities) for more info on this.
* **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and most components are available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`). * **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and many components are available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`).
* If you're importing from spaCy, **always use relative imports**. Otherwise, you might accidentally be running the tests over a different copy of spaCy, e.g. one you have installed on your system. * If you're importing from spaCy, **always use absolute imports**. For example: `from spacy.language import Language`.
* Don't forget the **unicode declarations** at the top of each file. This way, unicode strings won't have to be prefixed with `u`. * Don't forget the **unicode declarations** at the top of each file. This way, unicode strings won't have to be prefixed with `u`.
* Try to keep the tests **readable and concise**. Use clear and descriptive variable names (`doc`, `tokens` and `text` are great), keep it short and only test for one behaviour at a time. * Try to keep the tests **readable and concise**. Use clear and descriptive variable names (`doc`, `tokens` and `text` are great), keep it short and only test for one behaviour at a time.
@ -93,12 +90,9 @@ These are the main fixtures that are currently available:
| Fixture | Description | | Fixture | Description |
| --- | --- | | --- | --- |
| `tokenizer` | Creates **all available** language tokenizers and runs the test for **each of them**. | | `tokenizer` | Basic, language-independent tokenizer. Identical to the `xx` language class. |
| `en_tokenizer`, `de_tokenizer`, ... | Creates an English, German etc. tokenizer. | | `en_tokenizer`, `de_tokenizer`, ... | Creates an English, German etc. tokenizer. |
| `en_vocab`, `en_entityrecognizer`, ... | Creates an instance of the English `Vocab`, `EntityRecognizer` object etc. | | `en_vocab` | Creates an instance of the English `Vocab`. |
| `EN`, `DE`, ... | Creates a language class with a loaded model. For more info, see [Testing models](#testing-models). |
| `text_file` | Creates an instance of `StringIO` to simulate reading from and writing to files. |
| `text_file_b` | Creates an instance of `ByteIO` to simulate reading from and writing to files. |
The fixtures can be used in all tests by simply setting them as an argument, like this: The fixtures can be used in all tests by simply setting them as an argument, like this:
@ -109,49 +103,6 @@ def test_module_do_something(en_tokenizer):
If all tests in a file require a specific configuration, or use the same complex example, it can be helpful to create a separate fixture. This fixture should be added at the top of each file. Make sure to use descriptive names for these fixtures and don't override any of the global fixtures listed above. **From looking at a test, it should immediately be clear which fixtures are used, and where they are coming from.** If all tests in a file require a specific configuration, or use the same complex example, it can be helpful to create a separate fixture. This fixture should be added at the top of each file. Make sure to use descriptive names for these fixtures and don't override any of the global fixtures listed above. **From looking at a test, it should immediately be clear which fixtures are used, and where they are coming from.**
## Testing models
Models should only be loaded and tested **if absolutely necessary** for example, if you're specifically testing a model's performance, or if your test is related to model loading. If you only need an annotated `Doc`, you should use the `get_doc()` helper function to create it manually instead.
To specify which language models a test is related to, set the language ID as an argument of `@pytest.mark.models`. This allows you to later run the tests with `--models --en`. You can then use the `EN` [fixture](#fixtures) to get a language
class with a loaded model.
```python
@pytest.mark.models('en')
def test_english_model(EN):
doc = EN(u'This is a test')
```
> ⚠️ **Important note:** In order to test models, they need to be installed as a packge. The [conftest.py](conftest.py) includes a list of all available models, mapped to their IDs, e.g. `en`. Unless otherwise specified, each model that's installed in your environment will be imported and tested. If you don't have a model installed, **the test will be skipped**.
Under the hood, `pytest.importorskip` is used to import a model package and skip the test if the package is not installed. The `EN` fixture for example gets all
available models for `en`, [parametrizes](#parameters) them to run the test for *each of them*, and uses `load_test_model()` to import the model and run the test, or skip it if the model is not installed.
### Testing specific models
Using the `load_test_model()` helper function, you can also write tests for specific models, or combinations of them:
```python
from .util import load_test_model
@pytest.mark.models('en')
def test_en_md_only():
nlp = load_test_model('en_core_web_md')
# test something specific to en_core_web_md
@pytest.mark.models('en', 'fr')
@pytest.mark.parametrize('model', ['en_core_web_md', 'fr_depvec_web_lg'])
def test_different_models(model):
nlp = load_test_model(model)
# test something specific to the parametrized models
```
### Known issues and future improvements
Using `importorskip` on a list of model packages is not ideal and we're looking to improve this in the future. But at the moment, it's the best way to ensure that tests are performed on specific model packages only, and that you'll always be able to run the tests, even if you don't have *all available models* installed. (If the tests made a call to `spacy.load('en')` instead, this would load whichever model you've created an `en` shortcut for. This may be one of spaCy's default models, but it could just as easily be your own custom English model.)
The current setup also doesn't provide an easy way to only run tests on specific model versions. The `minversion` keyword argument on `pytest.importorskip` can take care of this, but it currently only checks for the package's `__version__` attribute. An alternative solution would be to load a model package's meta.json and skip if the model's version does not match the one specified in the test.
## Helpers and utilities ## Helpers and utilities
Our new test setup comes with a few handy utility functions that can be imported from [`util.py`](util.py). Our new test setup comes with a few handy utility functions that can be imported from [`util.py`](util.py).
@ -186,7 +137,7 @@ You can construct a `Doc` with the following arguments:
| `pos` | List of POS tags as text values. | | `pos` | List of POS tags as text values. |
| `tag` | List of tag names as text values. | | `tag` | List of tag names as text values. |
| `dep` | List of dependencies as text values. | | `dep` | List of dependencies as text values. |
| `ents` | List of entity tuples with `ent_id`, `label`, `start`, `end` (for example `('Stewart Lee', 'PERSON', 0, 2)`). The `label` will be looked up in `vocab.strings[label]`. | | `ents` | List of entity tuples with `start`, `end`, `label` (for example `(0, 2, 'PERSON')`). The `label` will be looked up in `vocab.strings[label]`. |
Here's how to quickly get these values from within spaCy: Here's how to quickly get these values from within spaCy:
@ -196,6 +147,7 @@ print([token.head.i-token.i for token in doc])
print([token.tag_ for token in doc]) print([token.tag_ for token in doc])
print([token.pos_ for token in doc]) print([token.pos_ for token in doc])
print([token.dep_ for token in doc]) print([token.dep_ for token in doc])
print([(ent.start, ent.end, ent.label_) for ent in doc.ents])
``` ```
**Note:** There's currently no way of setting the serializer data for the parser without loading the models. If this is relevant to your test, constructing the `Doc` via `get_doc()` won't work. **Note:** There's currently no way of setting the serializer data for the parser without loading the models. If this is relevant to your test, constructing the `Doc` via `get_doc()` won't work.
@ -204,7 +156,6 @@ print([token.dep_ for token in doc])
| Name | Description | | Name | Description |
| --- | --- | | --- | --- |
| `load_test_model` | Load a model if it's installed as a package, otherwise skip test. |
| `apply_transition_sequence(parser, doc, sequence)` | Perform a series of pre-specified transitions, to put the parser in a desired state. | | `apply_transition_sequence(parser, doc, sequence)` | Perform a series of pre-specified transitions, to put the parser in a desired state. |
| `add_vecs_to_vocab(vocab, vectors)` | Add list of vector tuples (`[("text", [1, 2, 3])]`) to given vocab. All vectors need to have the same length. | | `add_vecs_to_vocab(vocab, vectors)` | Add list of vector tuples (`[("text", [1, 2, 3])]`) to given vocab. All vectors need to have the same length. |
| `get_cosine(vec1, vec2)` | Get cosine for two given vectors. | | `get_cosine(vec1, vec2)` | Get cosine for two given vectors. |

View File

@ -1,229 +1,145 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from io import StringIO, BytesIO
from pathlib import Path
import pytest import pytest
from io import StringIO, BytesIO
from .util import load_test_model from spacy.util import get_lang_class
from ..tokens import Doc
from ..strings import StringStore
from .. import util
# These languages are used for generic tokenizer tests only add a language def pytest_addoption(parser):
# here if it's using spaCy's tokenizer (not a different library) parser.addoption("--slow", action="store_true", help="include slow tests")
# TODO: re-implement generic tokenizer tests
_languages = ['bn', 'da', 'de', 'el', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'ut', 'tt',
'xx']
_models = {'en': ['en_core_web_sm'],
'de': ['de_core_news_sm'],
'fr': ['fr_core_news_sm'],
'xx': ['xx_ent_web_sm'],
'en_core_web_md': ['en_core_web_md'],
'es_core_news_md': ['es_core_news_md']}
# only used for tests that require loading the models def pytest_runtest_setup(item):
# in all other cases, use specific instances for opt in ['slow']:
if opt in item.keywords and not item.config.getoption("--%s" % opt):
@pytest.fixture(params=_models['en']) pytest.skip("need --%s option to run" % opt)
def EN(request):
return load_test_model(request.param)
@pytest.fixture(params=_models['de']) @pytest.fixture(scope='module')
def DE(request):
return load_test_model(request.param)
@pytest.fixture(params=_models['fr'])
def FR(request):
return load_test_model(request.param)
@pytest.fixture()
def RU(request):
pymorphy = pytest.importorskip('pymorphy2')
return util.get_lang_class('ru')()
@pytest.fixture()
def JA(request):
mecab = pytest.importorskip("MeCab")
return util.get_lang_class('ja')()
#@pytest.fixture(params=_languages)
#def tokenizer(request):
#lang = util.get_lang_class(request.param)
#return lang.Defaults.create_tokenizer()
@pytest.fixture
def tokenizer(): def tokenizer():
return util.get_lang_class('xx').Defaults.create_tokenizer() return get_lang_class('xx').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture(scope='session')
def en_tokenizer(): def en_tokenizer():
return util.get_lang_class('en').Defaults.create_tokenizer() return get_lang_class('en').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture(scope='session')
def en_vocab(): def en_vocab():
return util.get_lang_class('en').Defaults.create_vocab() return get_lang_class('en').Defaults.create_vocab()
@pytest.fixture @pytest.fixture(scope='session')
def en_parser(en_vocab): def en_parser(en_vocab):
nlp = util.get_lang_class('en')(en_vocab) nlp = get_lang_class('en')(en_vocab)
return nlp.create_pipe('parser') return nlp.create_pipe('parser')
@pytest.fixture @pytest.fixture(scope='session')
def es_tokenizer(): def es_tokenizer():
return util.get_lang_class('es').Defaults.create_tokenizer() return get_lang_class('es').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture(scope='session')
def de_tokenizer(): def de_tokenizer():
return util.get_lang_class('de').Defaults.create_tokenizer() return get_lang_class('de').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture(scope='session')
def fr_tokenizer(): def fr_tokenizer():
return util.get_lang_class('fr').Defaults.create_tokenizer() return get_lang_class('fr').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture
def hu_tokenizer(): def hu_tokenizer():
return util.get_lang_class('hu').Defaults.create_tokenizer() return get_lang_class('hu').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture(scope='session')
def fi_tokenizer(): def fi_tokenizer():
return util.get_lang_class('fi').Defaults.create_tokenizer() return get_lang_class('fi').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture(scope='session')
def ro_tokenizer(): def ro_tokenizer():
return util.get_lang_class('ro').Defaults.create_tokenizer() return get_lang_class('ro').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture(scope='session')
def id_tokenizer(): def id_tokenizer():
return util.get_lang_class('id').Defaults.create_tokenizer() return get_lang_class('id').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture(scope='session')
def sv_tokenizer(): def sv_tokenizer():
return util.get_lang_class('sv').Defaults.create_tokenizer() return get_lang_class('sv').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture(scope='session')
def bn_tokenizer(): def bn_tokenizer():
return util.get_lang_class('bn').Defaults.create_tokenizer() return get_lang_class('bn').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture(scope='session')
def ga_tokenizer(): def ga_tokenizer():
return util.get_lang_class('ga').Defaults.create_tokenizer() return get_lang_class('ga').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture(scope='session')
def he_tokenizer(): def he_tokenizer():
return util.get_lang_class('he').Defaults.create_tokenizer() return get_lang_class('he').Defaults.create_tokenizer()
@pytest.fixture @pytest.fixture(scope='session')
def nb_tokenizer(): def nb_tokenizer():
return util.get_lang_class('nb').Defaults.create_tokenizer() return get_lang_class('nb').Defaults.create_tokenizer()
@pytest.fixture
@pytest.fixture(scope='session')
def da_tokenizer(): def da_tokenizer():
return util.get_lang_class('da').Defaults.create_tokenizer() return get_lang_class('da').Defaults.create_tokenizer()
@pytest.fixture
@pytest.fixture(scope='session')
def ja_tokenizer(): def ja_tokenizer():
mecab = pytest.importorskip("MeCab") mecab = pytest.importorskip("MeCab")
return util.get_lang_class('ja').Defaults.create_tokenizer() return get_lang_class('ja').Defaults.create_tokenizer()
@pytest.fixture
@pytest.fixture(scope='session')
def th_tokenizer(): def th_tokenizer():
pythainlp = pytest.importorskip("pythainlp") pythainlp = pytest.importorskip("pythainlp")
return util.get_lang_class('th').Defaults.create_tokenizer() return get_lang_class('th').Defaults.create_tokenizer()
@pytest.fixture
@pytest.fixture(scope='session')
def tr_tokenizer(): def tr_tokenizer():
return util.get_lang_class('tr').Defaults.create_tokenizer() return get_lang_class('tr').Defaults.create_tokenizer()
@pytest.fixture
@pytest.fixture(scope='session')
def tt_tokenizer(): def tt_tokenizer():
return util.get_lang_class('tt').Defaults.create_tokenizer() return get_lang_class('tt').Defaults.create_tokenizer()
@pytest.fixture
@pytest.fixture(scope='session')
def el_tokenizer(): def el_tokenizer():
return util.get_lang_class('el').Defaults.create_tokenizer() return get_lang_class('el').Defaults.create_tokenizer()
@pytest.fixture
@pytest.fixture(scope='session')
def ar_tokenizer(): def ar_tokenizer():
return util.get_lang_class('ar').Defaults.create_tokenizer() return get_lang_class('ar').Defaults.create_tokenizer()
@pytest.fixture
@pytest.fixture(scope='session')
def ur_tokenizer(): def ur_tokenizer():
return util.get_lang_class('ur').Defaults.create_tokenizer() return get_lang_class('ur').Defaults.create_tokenizer()
@pytest.fixture
@pytest.fixture(scope='session')
def ru_tokenizer(): def ru_tokenizer():
pymorphy = pytest.importorskip('pymorphy2') pymorphy = pytest.importorskip('pymorphy2')
return util.get_lang_class('ru').Defaults.create_tokenizer() return get_lang_class('ru').Defaults.create_tokenizer()
@pytest.fixture
def stringstore():
return StringStore()
@pytest.fixture
def en_entityrecognizer():
return util.get_lang_class('en').Defaults.create_entity()
@pytest.fixture
def text_file():
return StringIO()
@pytest.fixture
def text_file_b():
return BytesIO()
def pytest_addoption(parser):
parser.addoption("--models", action="store_true",
help="include tests that require full models")
parser.addoption("--vectors", action="store_true",
help="include word vectors tests")
parser.addoption("--slow", action="store_true",
help="include slow tests")
for lang in _languages + ['all']:
parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang)
for model in _models:
if model not in _languages:
parser.addoption("--%s" % model, action="store_true", help="Use %s model" % model)
def pytest_runtest_setup(item):
for opt in ['models', 'vectors', 'slow']:
if opt in item.keywords and not item.config.getoption("--%s" % opt):
pytest.skip("need --%s option to run" % opt)
# Check if test is marked with models and has arguments set, i.e. specific
# language. If so, skip test if flag not set.
if item.get_marker('models'):
for arg in item.get_marker('models').args:
if not item.config.getoption("--%s" % arg) and not item.config.getoption("--all"):
pytest.skip("need --%s or --all option to run" % arg)

View File

@ -1,24 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
from ...pipeline import EntityRecognizer
from ..util import get_doc
import pytest
def test_doc_add_entities_set_ents_iob(en_vocab):
text = ["This", "is", "a", "lion"]
doc = get_doc(en_vocab, text)
ner = EntityRecognizer(en_vocab)
ner.begin_training([])
ner(doc)
assert len(list(doc.ents)) == 0
assert [w.ent_iob_ for w in doc] == (['O'] * len(doc))
doc.ents = [(doc.vocab.strings['ANIMAL'], 3, 4)]
assert [w.ent_iob_ for w in doc] == ['', '', '', 'B']
doc.ents = [(doc.vocab.strings['WORD'], 0, 2)]
assert [w.ent_iob_ for w in doc] == ['B', 'I', '', '']

View File

@ -1,10 +1,9 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...attrs import ORTH, SHAPE, POS, DEP from spacy.attrs import ORTH, SHAPE, POS, DEP
from ..util import get_doc
import pytest from ..util import get_doc
def test_doc_array_attr_of_token(en_tokenizer, en_vocab): def test_doc_array_attr_of_token(en_tokenizer, en_vocab):
@ -41,7 +40,7 @@ def test_doc_array_tag(en_tokenizer):
text = "A nice sentence." text = "A nice sentence."
pos = ['DET', 'ADJ', 'NOUN', 'PUNCT'] pos = ['DET', 'ADJ', 'NOUN', 'PUNCT']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos)
assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos
feats_array = doc.to_array((ORTH, POS)) feats_array = doc.to_array((ORTH, POS))
assert feats_array[0][1] == doc[0].pos assert feats_array[0][1] == doc[0].pos
@ -54,7 +53,7 @@ def test_doc_array_dep(en_tokenizer):
text = "A nice sentence." text = "A nice sentence."
deps = ['det', 'amod', 'ROOT', 'punct'] deps = ['det', 'amod', 'ROOT', 'punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
feats_array = doc.to_array((ORTH, DEP)) feats_array = doc.to_array((ORTH, DEP))
assert feats_array[0][1] == doc[0].dep assert feats_array[0][1] == doc[0].dep
assert feats_array[1][1] == doc[1].dep assert feats_array[1][1] == doc[1].dep

View File

@ -1,10 +1,10 @@
'''Test Doc sets up tokens correctly.''' # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest
from ...vocab import Vocab import pytest
from ...tokens.doc import Doc from spacy.vocab import Vocab
from ...lemmatizer import Lemmatizer from spacy.tokens import Doc
from spacy.lemmatizer import Lemmatizer
@pytest.fixture @pytest.fixture

View File

@ -1,18 +1,18 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ..util import get_doc
from ...tokens import Doc
from ...vocab import Vocab
from ...attrs import LEMMA
import pytest import pytest
import numpy import numpy
from spacy.tokens import Doc
from spacy.vocab import Vocab
from spacy.attrs import LEMMA
from ..util import get_doc
@pytest.mark.parametrize('text', [["one", "two", "three"]]) @pytest.mark.parametrize('text', [["one", "two", "three"]])
def test_doc_api_compare_by_string_position(en_vocab, text): def test_doc_api_compare_by_string_position(en_vocab, text):
doc = get_doc(en_vocab, text) doc = Doc(en_vocab, words=text)
# Get the tokens in this order, so their ID ordering doesn't match the idx # Get the tokens in this order, so their ID ordering doesn't match the idx
token3 = doc[-1] token3 = doc[-1]
token2 = doc[-2] token2 = doc[-2]
@ -104,18 +104,18 @@ def test_doc_api_getitem(en_tokenizer):
" Give it back! He pleaded. "]) " Give it back! He pleaded. "])
def test_doc_api_serialize(en_tokenizer, text): def test_doc_api_serialize(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
new_tokens = get_doc(tokens.vocab).from_bytes(tokens.to_bytes()) new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
assert tokens.text == new_tokens.text assert tokens.text == new_tokens.text
assert [t.text for t in tokens] == [t.text for t in new_tokens] assert [t.text for t in tokens] == [t.text for t in new_tokens]
assert [t.orth for t in tokens] == [t.orth for t in new_tokens] assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
new_tokens = get_doc(tokens.vocab).from_bytes( new_tokens = Doc(tokens.vocab).from_bytes(
tokens.to_bytes(tensor=False), tensor=False) tokens.to_bytes(tensor=False), tensor=False)
assert tokens.text == new_tokens.text assert tokens.text == new_tokens.text
assert [t.text for t in tokens] == [t.text for t in new_tokens] assert [t.text for t in tokens] == [t.text for t in new_tokens]
assert [t.orth for t in tokens] == [t.orth for t in new_tokens] assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
new_tokens = get_doc(tokens.vocab).from_bytes( new_tokens = Doc(tokens.vocab).from_bytes(
tokens.to_bytes(sentiment=False), sentiment=False) tokens.to_bytes(sentiment=False), sentiment=False)
assert tokens.text == new_tokens.text assert tokens.text == new_tokens.text
assert [t.text for t in tokens] == [t.text for t in new_tokens] assert [t.text for t in tokens] == [t.text for t in new_tokens]
@ -199,6 +199,20 @@ def test_doc_api_retokenizer_attrs(en_tokenizer):
assert doc[4].ent_type_ == 'ORG' assert doc[4].ent_type_ == 'ORG'
@pytest.mark.xfail
def test_doc_api_retokenizer_lex_attrs(en_tokenizer):
"""Test that lexical attributes can be changed (see #2390)."""
doc = en_tokenizer("WKRO played beach boys songs")
assert not any(token.is_stop for token in doc)
with doc.retokenize() as retokenizer:
retokenizer.merge(doc[2:4], attrs={'LEMMA': 'boys', 'IS_STOP': True})
assert doc[2].text == 'beach boys'
assert doc[2].lemma_ == 'boys'
assert doc[2].is_stop
new_doc = Doc(doc.vocab, words=['beach boys'])
assert new_doc[0].is_stop
def test_doc_api_sents_empty_string(en_tokenizer): def test_doc_api_sents_empty_string(en_tokenizer):
doc = en_tokenizer("") doc = en_tokenizer("")
doc.is_parsed = True doc.is_parsed = True
@ -215,7 +229,7 @@ def test_doc_api_runtime_error(en_tokenizer):
'ROOT', 'amod', 'dobj'] 'ROOT', 'amod', 'dobj']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
nps = [] nps = []
for np in doc.noun_chunks: for np in doc.noun_chunks:
@ -235,7 +249,7 @@ def test_doc_api_right_edge(en_tokenizer):
-2, -7, 1, -19, 1, -2, -3, 2, 1, -3, -26] -2, -7, 1, -19, 1, -2, -3, 2, 1, -3, -26]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
assert doc[6].text == 'for' assert doc[6].text == 'for'
subtree = [w.text for w in doc[6].subtree] subtree = [w.text for w in doc[6].subtree]
assert subtree == ['for', 'the', 'sake', 'of', 'such', 'as', assert subtree == ['for', 'the', 'sake', 'of', 'such', 'as',
@ -264,7 +278,7 @@ def test_doc_api_similarity_match():
def test_lowest_common_ancestor(en_tokenizer): def test_lowest_common_ancestor(en_tokenizer):
tokens = en_tokenizer('the lazy dog slept') tokens = en_tokenizer('the lazy dog slept')
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0]) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0])
lca = doc.get_lca_matrix() lca = doc.get_lca_matrix()
assert(lca[1, 1] == 1) assert(lca[1, 1] == 1)
assert(lca[0, 1] == 2) assert(lca[0, 1] == 2)
@ -277,7 +291,7 @@ def test_parse_tree(en_tokenizer):
heads = [1, 0, 1, -2, -3, -1, -5] heads = [1, 0, 1, -2, -3, -1, -5]
tags = ['PRP', 'IN', 'NNP', 'NNP', 'IN', 'NNP', '.'] tags = ['PRP', 'IN', 'NNP', 'NNP', 'IN', 'NNP', '.']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, tags=tags) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, tags=tags)
# full method parse_tree(text) is a trivial composition # full method parse_tree(text) is a trivial composition
trees = doc.print_tree() trees = doc.print_tree()
assert len(trees) > 0 assert len(trees) > 0

View File

@ -1,12 +1,13 @@
# coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...language import Language from spacy.language import Language
from ...compat import pickle, unicode_ from spacy.compat import pickle, unicode_
def test_pickle_single_doc(): def test_pickle_single_doc():
nlp = Language() nlp = Language()
doc = nlp(u'pickle roundtrip') doc = nlp('pickle roundtrip')
data = pickle.dumps(doc, 1) data = pickle.dumps(doc, 1)
doc2 = pickle.loads(data) doc2 = pickle.loads(data)
assert doc2.text == 'pickle roundtrip' assert doc2.text == 'pickle roundtrip'
@ -16,7 +17,7 @@ def test_list_of_docs_pickles_efficiently():
nlp = Language() nlp = Language()
for i in range(10000): for i in range(10000):
_ = nlp.vocab[unicode_(i)] _ = nlp.vocab[unicode_(i)]
one_pickled = pickle.dumps(nlp(u'0'), -1) one_pickled = pickle.dumps(nlp('0'), -1)
docs = list(nlp.pipe(unicode_(i) for i in range(100))) docs = list(nlp.pipe(unicode_(i) for i in range(100)))
many_pickled = pickle.dumps(docs, -1) many_pickled = pickle.dumps(docs, -1)
assert len(many_pickled) < (len(one_pickled) * 2) assert len(many_pickled) < (len(one_pickled) * 2)
@ -28,7 +29,7 @@ def test_list_of_docs_pickles_efficiently():
def test_user_data_from_disk(): def test_user_data_from_disk():
nlp = Language() nlp = Language()
doc = nlp(u'Hello') doc = nlp('Hello')
doc.user_data[(0, 1)] = False doc.user_data[(0, 1)] = False
b = doc.to_bytes() b = doc.to_bytes()
doc2 = doc.__class__(doc.vocab).from_bytes(b) doc2 = doc.__class__(doc.vocab).from_bytes(b)
@ -36,7 +37,7 @@ def test_user_data_from_disk():
def test_user_data_unpickles(): def test_user_data_unpickles():
nlp = Language() nlp = Language()
doc = nlp(u'Hello') doc = nlp('Hello')
doc.user_data[(0, 1)] = False doc.user_data[(0, 1)] = False
b = pickle.dumps(doc) b = pickle.dumps(doc)
doc2 = pickle.loads(b) doc2 = pickle.loads(b)
@ -47,7 +48,7 @@ def test_hooks_unpickle():
def inner_func(d1, d2): def inner_func(d1, d2):
return 'hello!' return 'hello!'
nlp = Language() nlp = Language()
doc = nlp(u'Hello') doc = nlp('Hello')
doc.user_hooks['similarity'] = inner_func doc.user_hooks['similarity'] = inner_func
b = pickle.dumps(doc) b = pickle.dumps(doc)
doc2 = pickle.loads(b) doc2 = pickle.loads(b)

View File

@ -1,12 +1,12 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ..util import get_doc
from ...attrs import ORTH, LENGTH
from ...tokens import Doc
from ...vocab import Vocab
import pytest import pytest
from spacy.attrs import ORTH, LENGTH
from spacy.tokens import Doc
from spacy.vocab import Vocab
from ..util import get_doc
@pytest.fixture @pytest.fixture
@ -16,16 +16,16 @@ def doc(en_tokenizer):
deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det', deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct'] 'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
return get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
@pytest.fixture @pytest.fixture
def doc_not_parsed(en_tokenizer): def doc_not_parsed(en_tokenizer):
text = "This is a sentence. This is another sentence. And a third." text = "This is a sentence. This is another sentence. And a third."
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
d = get_doc(tokens.vocab, [t.text for t in tokens]) doc = Doc(tokens.vocab, words=[t.text for t in tokens])
d.is_parsed = False doc.is_parsed = False
return d return doc
def test_spans_sent_spans(doc): def test_spans_sent_spans(doc):
@ -56,7 +56,7 @@ def test_spans_root2(en_tokenizer):
text = "through North and South Carolina" text = "through North and South Carolina"
heads = [0, 3, -1, -2, -4] heads = [0, 3, -1, -2, -4]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
assert doc[-2:].root.text == 'Carolina' assert doc[-2:].root.text == 'Carolina'
@ -76,7 +76,7 @@ def test_spans_span_sent(doc, doc_not_parsed):
def test_spans_lca_matrix(en_tokenizer): def test_spans_lca_matrix(en_tokenizer):
"""Test span's lca matrix generation""" """Test span's lca matrix generation"""
tokens = en_tokenizer('the lazy dog slept') tokens = en_tokenizer('the lazy dog slept')
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0]) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0])
lca = doc[:2].get_lca_matrix() lca = doc[:2].get_lca_matrix()
assert(lca[0, 0] == 0) assert(lca[0, 0] == 0)
assert(lca[0, 1] == -1) assert(lca[0, 1] == -1)
@ -100,7 +100,7 @@ def test_spans_default_sentiment(en_tokenizer):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
tokens.vocab[tokens[0].text].sentiment = 3.0 tokens.vocab[tokens[0].text].sentiment = 3.0
tokens.vocab[tokens[2].text].sentiment = -2.0 tokens.vocab[tokens[2].text].sentiment = -2.0
doc = get_doc(tokens.vocab, [t.text for t in tokens]) doc = Doc(tokens.vocab, words=[t.text for t in tokens])
assert doc[:2].sentiment == 3.0 / 2 assert doc[:2].sentiment == 3.0 / 2
assert doc[-2:].sentiment == -2. / 2 assert doc[-2:].sentiment == -2. / 2
assert doc[:-1].sentiment == (3.+-2) / 3. assert doc[:-1].sentiment == (3.+-2) / 3.
@ -112,7 +112,7 @@ def test_spans_override_sentiment(en_tokenizer):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
tokens.vocab[tokens[0].text].sentiment = 3.0 tokens.vocab[tokens[0].text].sentiment = 3.0
tokens.vocab[tokens[2].text].sentiment = -2.0 tokens.vocab[tokens[2].text].sentiment = -2.0
doc = get_doc(tokens.vocab, [t.text for t in tokens]) doc = Doc(tokens.vocab, words=[t.text for t in tokens])
doc.user_span_hooks['sentiment'] = lambda span: 10.0 doc.user_span_hooks['sentiment'] = lambda span: 10.0
assert doc[:2].sentiment == 10.0 assert doc[:2].sentiment == 10.0
assert doc[-2:].sentiment == 10.0 assert doc[-2:].sentiment == 10.0
@ -146,7 +146,7 @@ def test_span_to_array(doc):
assert arr[0, 1] == len(span[0]) assert arr[0, 1] == len(span[0])
#def test_span_as_doc(doc): def test_span_as_doc(doc):
# span = doc[4:10] span = doc[4:10]
# span_doc = span.as_doc() span_doc = span.as_doc()
# assert span.text == span_doc.text.strip() assert span.text == span_doc.text.strip()

View File

@ -1,18 +1,17 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ..util import get_doc from spacy.vocab import Vocab
from ...vocab import Vocab from spacy.tokens import Doc
from ...tokens import Doc
import pytest from ..util import get_doc
def test_spans_merge_tokens(en_tokenizer): def test_spans_merge_tokens(en_tokenizer):
text = "Los Angeles start." text = "Los Angeles start."
heads = [1, 1, 0, -1] heads = [1, 1, 0, -1]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
assert len(doc) == 4 assert len(doc) == 4
assert doc[0].head.text == 'Angeles' assert doc[0].head.text == 'Angeles'
assert doc[1].head.text == 'start' assert doc[1].head.text == 'start'
@ -21,7 +20,7 @@ def test_spans_merge_tokens(en_tokenizer):
assert doc[0].text == 'Los Angeles' assert doc[0].text == 'Los Angeles'
assert doc[0].head.text == 'start' assert doc[0].head.text == 'start'
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
assert len(doc) == 4 assert len(doc) == 4
assert doc[0].head.text == 'Angeles' assert doc[0].head.text == 'Angeles'
assert doc[1].head.text == 'start' assert doc[1].head.text == 'start'
@ -35,7 +34,7 @@ def test_spans_merge_heads(en_tokenizer):
text = "I found a pilates class near work." text = "I found a pilates class near work."
heads = [1, 0, 2, 1, -3, -1, -1, -6] heads = [1, 0, 2, 1, -3, -1, -1, -6]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
assert len(doc) == 8 assert len(doc) == 8
doc.merge(doc[3].idx, doc[4].idx + len(doc[4]), tag=doc[4].tag_, doc.merge(doc[3].idx, doc[4].idx + len(doc[4]), tag=doc[4].tag_,
@ -53,7 +52,7 @@ def test_span_np_merges(en_tokenizer):
text = "displaCy is a parse tool built with Javascript" text = "displaCy is a parse tool built with Javascript"
heads = [1, 0, 2, 1, -3, -1, -1, -1] heads = [1, 0, 2, 1, -3, -1, -1, -1]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
assert doc[4].head.i == 1 assert doc[4].head.i == 1
doc.merge(doc[2].idx, doc[4].idx + len(doc[4]), tag='NP', lemma='tool', doc.merge(doc[2].idx, doc[4].idx + len(doc[4]), tag='NP', lemma='tool',
@ -63,7 +62,7 @@ def test_span_np_merges(en_tokenizer):
text = "displaCy is a lightweight and modern dependency parse tree visualization tool built with CSS3 and JavaScript." text = "displaCy is a lightweight and modern dependency parse tree visualization tool built with CSS3 and JavaScript."
heads = [1, 0, 8, 3, -1, -2, 4, 3, 1, 1, -9, -1, -1, -1, -1, -2, -15] heads = [1, 0, 8, 3, -1, -2, 4, 3, 1, 1, -9, -1, -1, -1, -1, -2, -15]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
ents = [(e[0].idx, e[-1].idx + len(e[-1]), e.label_, e.lemma_) for e in doc.ents] ents = [(e[0].idx, e[-1].idx + len(e[-1]), e.label_, e.lemma_) for e in doc.ents]
for start, end, label, lemma in ents: for start, end, label, lemma in ents:
@ -74,8 +73,7 @@ def test_span_np_merges(en_tokenizer):
text = "One test with entities like New York City so the ents list is not void" text = "One test with entities like New York City so the ents list is not void"
heads = [1, 11, -1, -1, -1, 1, 1, -3, 4, 2, 1, 1, 0, -1, -2] heads = [1, 11, -1, -1, -1, 1, 1, -3, 4, 2, 1, 1, 0, -1, -2]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
for span in doc.ents: for span in doc.ents:
merged = doc.merge() merged = doc.merge()
assert merged != None, (span.start, span.end, span.label_, span.lemma_) assert merged != None, (span.start, span.end, span.label_, span.lemma_)
@ -85,10 +83,9 @@ def test_spans_entity_merge(en_tokenizer):
text = "Stewart Lee is a stand up comedian who lives in England and loves Joe Pasquale.\n" text = "Stewart Lee is a stand up comedian who lives in England and loves Joe Pasquale.\n"
heads = [1, 1, 0, 1, 2, -1, -4, 1, -2, -1, -1, -3, -10, 1, -2, -13, -1] heads = [1, 1, 0, 1, 2, -1, -4, 1, -2, -1, -1, -3, -10, 1, -2, -13, -1]
tags = ['NNP', 'NNP', 'VBZ', 'DT', 'VB', 'RP', 'NN', 'WP', 'VBZ', 'IN', 'NNP', 'CC', 'VBZ', 'NNP', 'NNP', '.', 'SP'] tags = ['NNP', 'NNP', 'VBZ', 'DT', 'VB', 'RP', 'NN', 'WP', 'VBZ', 'IN', 'NNP', 'CC', 'VBZ', 'NNP', 'NNP', '.', 'SP']
ents = [('Stewart Lee', 'PERSON', 0, 2), ('England', 'GPE', 10, 11), ('Joe Pasquale', 'PERSON', 13, 15)] ents = [(0, 2, 'PERSON'), (10, 11, 'GPE'), (13, 15, 'PERSON')]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, tags=tags, ents=ents) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, tags=tags, ents=ents)
assert len(doc) == 17 assert len(doc) == 17
for ent in doc.ents: for ent in doc.ents:
label, lemma, type_ = (ent.root.tag_, ent.root.lemma_, max(w.ent_type_ for w in ent)) label, lemma, type_ = (ent.root.tag_, ent.root.lemma_, max(w.ent_type_ for w in ent))
@ -120,7 +117,7 @@ def test_spans_sentence_update_after_merge(en_tokenizer):
'compound', 'dobj', 'punct'] 'compound', 'dobj', 'punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
sent1, sent2 = list(doc.sents) sent1, sent2 = list(doc.sents)
init_len = len(sent1) init_len = len(sent1)
init_len2 = len(sent2) init_len2 = len(sent2)
@ -138,7 +135,7 @@ def test_spans_subtree_size_check(en_tokenizer):
'dobj'] 'dobj']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
sent1 = list(doc.sents)[0] sent1 = list(doc.sents)[0]
init_len = len(list(sent1.root.subtree)) init_len = len(list(sent1.root.subtree))
doc[0:2].merge(label='none', lemma='none', ent_type='none') doc[0:2].merge(label='none', lemma='none', ent_type='none')

View File

@ -1,14 +1,24 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...attrs import IS_ALPHA, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_TITLE, IS_STOP
from ...symbols import NOUN, VERB
from ..util import get_doc
from ...vocab import Vocab
from ...tokens import Doc
import pytest import pytest
import numpy import numpy
from spacy.attrs import IS_ALPHA, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_TITLE, IS_STOP
from spacy.symbols import VERB
from spacy.vocab import Vocab
from spacy.tokens import Doc
from ..util import get_doc
@pytest.fixture
def doc(en_tokenizer):
text = "This is a sentence. This is another sentence. And a third."
heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 0, 1, -2, -1]
deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
tokens = en_tokenizer(text)
return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
def test_doc_token_api_strings(en_tokenizer): def test_doc_token_api_strings(en_tokenizer):
@ -18,7 +28,7 @@ def test_doc_token_api_strings(en_tokenizer):
deps = ['ROOT', 'dobj', 'prt', 'punct', 'nsubj', 'ROOT', 'punct'] deps = ['ROOT', 'dobj', 'prt', 'punct', 'nsubj', 'ROOT', 'punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos, heads=heads, deps=deps) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps)
assert doc[0].orth_ == 'Give' assert doc[0].orth_ == 'Give'
assert doc[0].text == 'Give' assert doc[0].text == 'Give'
assert doc[0].text_with_ws == 'Give ' assert doc[0].text_with_ws == 'Give '
@ -57,18 +67,9 @@ def test_doc_token_api_str_builtin(en_tokenizer, text):
assert str(tokens[0]) == text.split(' ')[0] assert str(tokens[0]) == text.split(' ')[0]
assert str(tokens[1]) == text.split(' ')[1] assert str(tokens[1]) == text.split(' ')[1]
@pytest.fixture
def doc(en_tokenizer):
text = "This is a sentence. This is another sentence. And a third."
heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 0, 1, -2, -1]
deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
tokens = en_tokenizer(text)
return get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
def test_doc_token_api_is_properties(en_vocab): def test_doc_token_api_is_properties(en_vocab):
text = ["Hi", ",", "my", "email", "is", "test@me.com"] doc = Doc(en_vocab, words=["Hi", ",", "my", "email", "is", "test@me.com"])
doc = get_doc(en_vocab, text)
assert doc[0].is_title assert doc[0].is_title
assert doc[0].is_alpha assert doc[0].is_alpha
assert not doc[0].is_digit assert not doc[0].is_digit
@ -86,7 +87,6 @@ def test_doc_token_api_vectors():
vocab.set_vector('oranges', vector=numpy.asarray([0., 1.], dtype='f')) vocab.set_vector('oranges', vector=numpy.asarray([0., 1.], dtype='f'))
doc = Doc(vocab, words=['apples', 'oranges', 'oov']) doc = Doc(vocab, words=['apples', 'oranges', 'oov'])
assert doc.has_vector assert doc.has_vector
assert doc[0].has_vector assert doc[0].has_vector
assert doc[1].has_vector assert doc[1].has_vector
assert not doc[2].has_vector assert not doc[2].has_vector
@ -101,7 +101,7 @@ def test_doc_token_api_ancestors(en_tokenizer):
text = "Yesterday I saw a dog that barked loudly." text = "Yesterday I saw a dog that barked loudly."
heads = [2, 1, 0, 1, -2, 1, -2, -1, -6] heads = [2, 1, 0, 1, -2, 1, -2, -1, -6]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
assert [t.text for t in doc[6].ancestors] == ["dog", "saw"] assert [t.text for t in doc[6].ancestors] == ["dog", "saw"]
assert [t.text for t in doc[1].ancestors] == ["saw"] assert [t.text for t in doc[1].ancestors] == ["saw"]
assert [t.text for t in doc[2].ancestors] == [] assert [t.text for t in doc[2].ancestors] == []
@ -115,7 +115,7 @@ def test_doc_token_api_head_setter(en_tokenizer):
text = "Yesterday I saw a dog that barked loudly." text = "Yesterday I saw a dog that barked loudly."
heads = [2, 1, 0, 1, -2, 1, -2, -1, -6] heads = [2, 1, 0, 1, -2, 1, -2, -1, -6]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
assert doc[6].n_lefts == 1 assert doc[6].n_lefts == 1
assert doc[6].n_rights == 1 assert doc[6].n_rights == 1
@ -165,7 +165,7 @@ def test_doc_token_api_head_setter(en_tokenizer):
def test_is_sent_start(en_tokenizer): def test_is_sent_start(en_tokenizer):
doc = en_tokenizer(u'This is a sentence. This is another.') doc = en_tokenizer('This is a sentence. This is another.')
assert doc[5].is_sent_start is None assert doc[5].is_sent_start is None
doc[5].is_sent_start = True doc[5].is_sent_start = True
assert doc[5].is_sent_start is True assert doc[5].is_sent_start is True

View File

@ -3,10 +3,8 @@ from __future__ import unicode_literals
import pytest import pytest
from mock import Mock from mock import Mock
from spacy.tokens import Doc, Span, Token
from ..vocab import Vocab from spacy.tokens.underscore import Underscore
from ..tokens import Doc, Span, Token
from ..tokens.underscore import Underscore
def test_create_doc_underscore(): def test_create_doc_underscore():

View File

@ -4,15 +4,14 @@ from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('text', @pytest.mark.parametrize('text', ["ق.م", "إلخ", "ص.ب", "ت."])
["ق.م", "إلخ", "ص.ب", "ت."])
def test_ar_tokenizer_handles_abbr(ar_tokenizer, text): def test_ar_tokenizer_handles_abbr(ar_tokenizer, text):
tokens = ar_tokenizer(text) tokens = ar_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer): def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
text = u"تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م" text = "تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م"
tokens = ar_tokenizer(text) tokens = ar_tokenizer(text)
assert len(tokens) == 7 assert len(tokens) == 7
assert tokens[6].text == "ق.م" assert tokens[6].text == "ق.م"
@ -20,7 +19,6 @@ def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer): def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
text = u"يبلغ طول مضيق طارق 14كم " text = "يبلغ طول مضيق طارق 14كم "
tokens = ar_tokenizer(text) tokens = ar_tokenizer(text)
print([(tokens[i].text, tokens[i].suffix_) for i in range(len(tokens))])
assert len(tokens) == 6 assert len(tokens) == 6

View File

@ -2,7 +2,7 @@
from __future__ import unicode_literals from __future__ import unicode_literals
def test_tokenizer_handles_long_text(ar_tokenizer): def test_ar_tokenizer_handles_long_text(ar_tokenizer):
text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين. text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين.
ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها، ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها،
فتمكن من نيل شهادة في الفلسفة. ألف محفوظ على مدار حياته الكثير من الأعمال الأدبية، و في مقدمتها ثلاثيته الشهيرة. فتمكن من نيل شهادة في الفلسفة. ألف محفوظ على مدار حياته الكثير من الأعمال الأدبية، و في مقدمتها ثلاثيته الشهيرة.

View File

@ -3,38 +3,32 @@ from __future__ import unicode_literals
import pytest import pytest
TESTCASES = []
PUNCTUATION_TESTS = [ TESTCASES = [
(u'আমি বাংলায় গান গাই!', [u'আমি', u'বাংলায়', u'গান', u'গাই', u'!']), # punctuation tests
(u'আমি বাংলায় কথা কই।', [u'আমি', u'বাংলায়', u'কথা', u'কই', u'']), ('আমি বাংলায় গান গাই!', ['আমি', 'বাংলায়', 'গান', 'গাই', '!']),
(u'বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', [u'বসুন্ধরা', u'জনসম্মুখে', u'দোষ', u'স্বীকার', u'করলো', u'না', u'?']), ('আমি বাংলায় কথা কই।', ['আমি', 'বাংলায়', 'কথা', 'কই', '']),
(u'টাকা থাকলে কি না হয়!', [u'টাকা', u'থাকলে', u'কি', u'না', u'হয়', u'!']), ('বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', ['বসুন্ধরা', 'জনসম্মুখে', 'দোষ', 'স্বীকার', 'করলো', 'না', '?']),
('টাকা থাকলে কি না হয়!', ['টাকা', 'থাকলে', 'কি', 'না', 'হয়', '!']),
# abbreviations
('ডঃ খালেদ বললেন ঢাকায় ৩৫ ডিগ্রি সে.।', ['ডঃ', 'খালেদ', 'বললেন', 'ঢাকায়', '৩৫', 'ডিগ্রি', 'সে.', ''])
] ]
ABBREVIATIONS = [
(u'ডঃ খালেদ বললেন ঢাকায় ৩৫ ডিগ্রি সে.।', [u'ডঃ', u'খালেদ', u'বললেন', u'ঢাকায়', u'৩৫', u'ডিগ্রি', u'সে.', u''])
]
TESTCASES.extend(PUNCTUATION_TESTS)
TESTCASES.extend(ABBREVIATIONS)
@pytest.mark.parametrize('text,expected_tokens', TESTCASES) @pytest.mark.parametrize('text,expected_tokens', TESTCASES)
def test_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens): def test_bn_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens):
tokens = bn_tokenizer(text) tokens = bn_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space] token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list assert expected_tokens == token_list
def test_tokenizer_handles_long_text(bn_tokenizer): def test_bn_tokenizer_handles_long_text(bn_tokenizer):
text = u"""নর্থ সাউথ বিশ্ববিদ্যালয়ে সারাবছর কোন না কোন বিষয়ে গবেষণা চলতেই থাকে। \ text = """নর্থ সাউথ বিশ্ববিদ্যালয়ে সারাবছর কোন না কোন বিষয়ে গবেষণা চলতেই থাকে। \
অভি ি রগণ ি ি িি গবষণ রকল কর, \ অভি ি রগণ ি ি িি গবষণ রকল কর, \
মধ রয বট ি ি ি আরিিি ইনি \ মধ রয বট ি ি ি আরিিি ইনি \
এসকল রকল কর যম ি যথ পরি ইজড হওয সমভব \ এসকল রকল কর যম ি যথ পরি ইজড হওয সমভব \
আর গবষণ িরক ি অনকখি! \ আর গবষণ িরক ি অনকখি! \
কন হও, গবষক ি লপ - নর উথ ইউনিিি রতি ি রয \ কন হও, গবষক ি লপ - নর উথ ইউনিিি রতি ি রয \
নর উথ অসরণ কমিউনিি দর আমনরণ""" নর উথ অসরণ কমিউনিি দর আমনরণ"""
tokens = bn_tokenizer(text) tokens = bn_tokenizer(text)
assert len(tokens) == 84 assert len(tokens) == 84

View File

@ -3,28 +3,32 @@ from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('text',
["ca.", "m.a.o.", "Jan.", "Dec.", "kr.", "jf."]) @pytest.mark.parametrize('text', ["ca.", "m.a.o.", "Jan.", "Dec.", "kr.", "jf."])
def test_da_tokenizer_handles_abbr(da_tokenizer, text): def test_da_tokenizer_handles_abbr(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
@pytest.mark.parametrize('text', ["Jul.", "jul.", "Tor.", "Tors."]) @pytest.mark.parametrize('text', ["Jul.", "jul.", "Tor.", "Tors."])
def test_da_tokenizer_handles_ambiguous_abbr(da_tokenizer, text): def test_da_tokenizer_handles_ambiguous_abbr(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 2 assert len(tokens) == 2
@pytest.mark.parametrize('text', ["1.", "10.", "31."]) @pytest.mark.parametrize('text', ["1.", "10.", "31."])
def test_da_tokenizer_handles_dates(da_tokenizer, text): def test_da_tokenizer_handles_dates(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
def test_da_tokenizer_handles_exc_in_text(da_tokenizer): def test_da_tokenizer_handles_exc_in_text(da_tokenizer):
text = "Det er bl.a. ikke meningen" text = "Det er bl.a. ikke meningen"
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 5 assert len(tokens) == 5
assert tokens[2].text == "bl.a." assert tokens[2].text == "bl.a."
def test_da_tokenizer_handles_custom_base_exc(da_tokenizer): def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
text = "Her er noget du kan kigge i." text = "Her er noget du kan kigge i."
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
@ -32,8 +36,9 @@ def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
assert tokens[6].text == "i" assert tokens[6].text == "i"
assert tokens[7].text == "." assert tokens[7].text == "."
@pytest.mark.parametrize('text,norm',
[("akvarium", "akvarie"), ("bedstemoder", "bedstemor")]) @pytest.mark.parametrize('text,norm', [
("akvarium", "akvarie"), ("bedstemoder", "bedstemor")])
def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm): def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert tokens[0].norm_ == norm assert tokens[0].norm_ == norm

View File

@ -4,10 +4,11 @@ from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('string,lemma', [('affaldsgruppernes', 'affaldsgruppe'), @pytest.mark.parametrize('string,lemma', [
('affaldsgruppernes', 'affaldsgruppe'),
('detailhandelsstrukturernes', 'detailhandelsstruktur'), ('detailhandelsstrukturernes', 'detailhandelsstruktur'),
('kolesterols', 'kolesterol'), ('kolesterols', 'kolesterol'),
('åsyns', 'åsyn')]) ('åsyns', 'åsyn')])
def test_lemmatizer_lookup_assigns(da_tokenizer, string, lemma): def test_da_lemmatizer_lookup_assigns(da_tokenizer, string, lemma):
tokens = da_tokenizer(string) tokens = da_tokenizer(string)
assert tokens[0].lemma_ == lemma assert tokens[0].lemma_ == lemma

View File

@ -1,24 +1,23 @@
# coding: utf-8 # coding: utf-8
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('text', ["(under)"]) @pytest.mark.parametrize('text', ["(under)"])
def test_tokenizer_splits_no_special(da_tokenizer, text): def test_da_tokenizer_splits_no_special(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["ta'r", "Søren's", "Lars'"]) @pytest.mark.parametrize('text', ["ta'r", "Søren's", "Lars'"])
def test_tokenizer_handles_no_punct(da_tokenizer, text): def test_da_tokenizer_handles_no_punct(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
@pytest.mark.parametrize('text', ["(ta'r"]) @pytest.mark.parametrize('text', ["(ta'r"])
def test_tokenizer_splits_prefix_punct(da_tokenizer, text): def test_da_tokenizer_splits_prefix_punct(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 2 assert len(tokens) == 2
assert tokens[0].text == "(" assert tokens[0].text == "("
@ -26,22 +25,23 @@ def test_tokenizer_splits_prefix_punct(da_tokenizer, text):
@pytest.mark.parametrize('text', ["ta'r)"]) @pytest.mark.parametrize('text', ["ta'r)"])
def test_tokenizer_splits_suffix_punct(da_tokenizer, text): def test_da_tokenizer_splits_suffix_punct(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 2 assert len(tokens) == 2
assert tokens[0].text == "ta'r" assert tokens[0].text == "ta'r"
assert tokens[1].text == ")" assert tokens[1].text == ")"
@pytest.mark.parametrize('text,expected', [("(ta'r)", ["(", "ta'r", ")"]), ("'ta'r'", ["'", "ta'r", "'"])]) @pytest.mark.parametrize('text,expected', [
def test_tokenizer_splits_even_wrap(da_tokenizer, text, expected): ("(ta'r)", ["(", "ta'r", ")"]), ("'ta'r'", ["'", "ta'r", "'"])])
def test_da_tokenizer_splits_even_wrap(da_tokenizer, text, expected):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == len(expected) assert len(tokens) == len(expected)
assert [t.text for t in tokens] == expected assert [t.text for t in tokens] == expected
@pytest.mark.parametrize('text', ["(ta'r?)"]) @pytest.mark.parametrize('text', ["(ta'r?)"])
def test_tokenizer_splits_uneven_wrap(da_tokenizer, text): def test_da_tokenizer_splits_uneven_wrap(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 4 assert len(tokens) == 4
assert tokens[0].text == "(" assert tokens[0].text == "("
@ -50,15 +50,16 @@ def test_tokenizer_splits_uneven_wrap(da_tokenizer, text):
assert tokens[3].text == ")" assert tokens[3].text == ")"
@pytest.mark.parametrize('text,expected', [("f.eks.", ["f.eks."]), ("fe.", ["fe", "."]), ("(f.eks.", ["(", "f.eks."])]) @pytest.mark.parametrize('text,expected', [
def test_tokenizer_splits_prefix_interact(da_tokenizer, text, expected): ("f.eks.", ["f.eks."]), ("fe.", ["fe", "."]), ("(f.eks.", ["(", "f.eks."])])
def test_da_tokenizer_splits_prefix_interact(da_tokenizer, text, expected):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == len(expected) assert len(tokens) == len(expected)
assert [t.text for t in tokens] == expected assert [t.text for t in tokens] == expected
@pytest.mark.parametrize('text', ["f.eks.)"]) @pytest.mark.parametrize('text', ["f.eks.)"])
def test_tokenizer_splits_suffix_interact(da_tokenizer, text): def test_da_tokenizer_splits_suffix_interact(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 2 assert len(tokens) == 2
assert tokens[0].text == "f.eks." assert tokens[0].text == "f.eks."
@ -66,7 +67,7 @@ def test_tokenizer_splits_suffix_interact(da_tokenizer, text):
@pytest.mark.parametrize('text', ["(f.eks.)"]) @pytest.mark.parametrize('text', ["(f.eks.)"])
def test_tokenizer_splits_even_wrap_interact(da_tokenizer, text): def test_da_tokenizer_splits_even_wrap_interact(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
assert tokens[0].text == "(" assert tokens[0].text == "("
@ -75,7 +76,7 @@ def test_tokenizer_splits_even_wrap_interact(da_tokenizer, text):
@pytest.mark.parametrize('text', ["(f.eks.?)"]) @pytest.mark.parametrize('text', ["(f.eks.?)"])
def test_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text): def test_da_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 4 assert len(tokens) == 4
assert tokens[0].text == "(" assert tokens[0].text == "("
@ -85,19 +86,19 @@ def test_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text):
@pytest.mark.parametrize('text', ["0,1-13,5", "0,0-0,1", "103,27-300", "1/2-3/4"]) @pytest.mark.parametrize('text', ["0,1-13,5", "0,0-0,1", "103,27-300", "1/2-3/4"])
def test_tokenizer_handles_numeric_range(da_tokenizer, text): def test_da_tokenizer_handles_numeric_range(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
@pytest.mark.parametrize('text', ["sort.Gul", "Hej.Verden"]) @pytest.mark.parametrize('text', ["sort.Gul", "Hej.Verden"])
def test_tokenizer_splits_period_infix(da_tokenizer, text): def test_da_tokenizer_splits_period_infix(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["Hej,Verden", "en,to"]) @pytest.mark.parametrize('text', ["Hej,Verden", "en,to"])
def test_tokenizer_splits_comma_infix(da_tokenizer, text): def test_da_tokenizer_splits_comma_infix(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
assert tokens[0].text == text.split(",")[0] assert tokens[0].text == text.split(",")[0]
@ -106,18 +107,18 @@ def test_tokenizer_splits_comma_infix(da_tokenizer, text):
@pytest.mark.parametrize('text', ["sort...Gul", "sort...gul"]) @pytest.mark.parametrize('text', ["sort...Gul", "sort...gul"])
def test_tokenizer_splits_ellipsis_infix(da_tokenizer, text): def test_da_tokenizer_splits_ellipsis_infix(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ['gå-på-mod', '4-hjulstræk', '100-Pfennig-frimærke', 'TV-2-spots', 'trofæ-vaeggen']) @pytest.mark.parametrize('text', ['gå-på-mod', '4-hjulstræk', '100-Pfennig-frimærke', 'TV-2-spots', 'trofæ-vaeggen'])
def test_tokenizer_keeps_hyphens(da_tokenizer, text): def test_da_tokenizer_keeps_hyphens(da_tokenizer, text):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
def test_tokenizer_splits_double_hyphen_infix(da_tokenizer): def test_da_tokenizer_splits_double_hyphen_infix(da_tokenizer):
tokens = da_tokenizer("Mange regler--eksempelvis bindestregs-reglerne--er komplicerede.") tokens = da_tokenizer("Mange regler--eksempelvis bindestregs-reglerne--er komplicerede.")
assert len(tokens) == 9 assert len(tokens) == 9
assert tokens[0].text == "Mange" assert tokens[0].text == "Mange"
@ -130,7 +131,7 @@ def test_tokenizer_splits_double_hyphen_infix(da_tokenizer):
assert tokens[7].text == "komplicerede" assert tokens[7].text == "komplicerede"
def test_tokenizer_handles_posessives_and_contractions(da_tokenizer): def test_da_tokenizer_handles_posessives_and_contractions(da_tokenizer):
tokens = da_tokenizer("'DBA's, Lars' og Liz' bil sku' sgu' ik' ha' en bule, det ka' han ik' li' mere', sagde hun.") tokens = da_tokenizer("'DBA's, Lars' og Liz' bil sku' sgu' ik' ha' en bule, det ka' han ik' li' mere', sagde hun.")
assert len(tokens) == 25 assert len(tokens) == 25
assert tokens[0].text == "'" assert tokens[0].text == "'"

View File

@ -1,10 +1,9 @@
# coding: utf-8 # coding: utf-8
"""Test that longer and mixed texts are tokenized correctly."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
from spacy.lang.da.lex_attrs import like_num
def test_da_tokenizer_handles_long_text(da_tokenizer): def test_da_tokenizer_handles_long_text(da_tokenizer):
text = """Der var så dejligt ude på landet. Det var sommer, kornet stod gult, havren grøn, text = """Der var så dejligt ude på landet. Det var sommer, kornet stod gult, havren grøn,
@ -15,6 +14,7 @@ Rundt om ager og eng var der store skove, og midt i skovene dybe søer; jo, der
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 84 assert len(tokens) == 84
@pytest.mark.parametrize('text,match', [ @pytest.mark.parametrize('text,match', [
('10', True), ('1', True), ('10.000', True), ('10.00', True), ('10', True), ('1', True), ('10.000', True), ('10.00', True),
('999,0', True), ('en', True), ('treoghalvfemsindstyvende', True), ('hundrede', True), ('999,0', True), ('en', True), ('treoghalvfemsindstyvende', True), ('hundrede', True),
@ -22,6 +22,10 @@ Rundt om ager og eng var der store skove, og midt i skovene dybe søer; jo, der
def test_lex_attrs_like_number(da_tokenizer, text, match): def test_lex_attrs_like_number(da_tokenizer, text, match):
tokens = da_tokenizer(text) tokens = da_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
print(tokens[0])
assert tokens[0].like_num == match assert tokens[0].like_num == match
@pytest.mark.parametrize('word', ['elleve', 'første'])
def test_da_lex_attrs_capitals(word):
assert like_num(word)
assert like_num(word.upper())

View File

@ -1,7 +1,4 @@
# coding: utf-8 # coding: utf-8
"""Test that tokenizer exceptions and emoticons are handles correctly."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest

View File

@ -4,12 +4,13 @@ from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('string,lemma', [('Abgehängten', 'Abgehängte'), @pytest.mark.parametrize('string,lemma', [
('Abgehängten', 'Abgehängte'),
('engagierte', 'engagieren'), ('engagierte', 'engagieren'),
('schließt', 'schließen'), ('schließt', 'schließen'),
('vorgebenden', 'vorgebend'), ('vorgebenden', 'vorgebend'),
('die', 'der'), ('die', 'der'),
('Die', 'der')]) ('Die', 'der')])
def test_lemmatizer_lookup_assigns(de_tokenizer, string, lemma): def test_de_lemmatizer_lookup_assigns(de_tokenizer, string, lemma):
tokens = de_tokenizer(string) tokens = de_tokenizer(string)
assert tokens[0].lemma_ == lemma assert tokens[0].lemma_ == lemma

View File

@ -1,77 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
import numpy
import pytest
@pytest.fixture
def example(DE):
"""
This is to make sure the model works as expected. The tests make sure that
values are properly set. Tests are not meant to evaluate the content of the
output, only make sure the output is formally okay.
"""
assert DE.entity != None
return DE('An der großen Straße stand eine merkwürdige Gestalt und führte Selbstgespräche.')
@pytest.mark.models('de')
def test_de_models_tokenization(example):
# tokenization should split the document into tokens
assert len(example) > 1
@pytest.mark.xfail
@pytest.mark.models('de')
def test_de_models_tagging(example):
# if tagging was done properly, pos tags shouldn't be empty
assert example.is_tagged
assert all(t.pos != 0 for t in example)
assert all(t.tag != 0 for t in example)
@pytest.mark.models('de')
def test_de_models_parsing(example):
# if parsing was done properly
# - dependency labels shouldn't be empty
# - the head of some tokens should not be root
assert example.is_parsed
assert all(t.dep != 0 for t in example)
assert any(t.dep != i for i,t in enumerate(example))
@pytest.mark.models('de')
def test_de_models_ner(example):
# if ner was done properly, ent_iob shouldn't be empty
assert all([t.ent_iob != 0 for t in example])
@pytest.mark.models('de')
def test_de_models_vectors(example):
# if vectors are available, they should differ on different words
# this isn't a perfect test since this could in principle fail
# in a sane model as well,
# but that's very unlikely and a good indicator if something is wrong
vector0 = example[0].vector
vector1 = example[1].vector
vector2 = example[2].vector
assert not numpy.array_equal(vector0,vector1)
assert not numpy.array_equal(vector0,vector2)
assert not numpy.array_equal(vector1,vector2)
@pytest.mark.xfail
@pytest.mark.models('de')
def test_de_models_probs(example):
# if frequencies/probabilities are okay, they should differ for
# different words
# this isn't a perfect test since this could in principle fail
# in a sane model as well,
# but that's very unlikely and a good indicator if something is wrong
prob0 = example[0].prob
prob1 = example[1].prob
prob2 = example[2].prob
assert not prob0 == prob1
assert not prob0 == prob2
assert not prob1 == prob2

View File

@ -3,17 +3,14 @@ from __future__ import unicode_literals
from ...util import get_doc from ...util import get_doc
import pytest
def test_de_parser_noun_chunks_standard_de(de_tokenizer): def test_de_parser_noun_chunks_standard_de(de_tokenizer):
text = "Eine Tasse steht auf dem Tisch." text = "Eine Tasse steht auf dem Tisch."
heads = [1, 1, 0, -1, 1, -2, -4] heads = [1, 1, 0, -1, 1, -2, -4]
tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', '$.'] tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', '$.']
deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'punct'] deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'punct']
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
chunks = list(doc.noun_chunks) chunks = list(doc.noun_chunks)
assert len(chunks) == 2 assert len(chunks) == 2
assert chunks[0].text_with_ws == "Eine Tasse " assert chunks[0].text_with_ws == "Eine Tasse "
@ -25,9 +22,8 @@ def test_de_extended_chunk(de_tokenizer):
heads = [1, 1, 0, -1, 1, -2, -1, -5, -6] heads = [1, 1, 0, -1, 1, -2, -1, -5, -6]
tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', 'NN', 'NN', '$.'] tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', 'NN', 'NN', '$.']
deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'nk', 'oa', 'punct'] deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'nk', 'oa', 'punct']
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
chunks = list(doc.noun_chunks) chunks = list(doc.noun_chunks)
assert len(chunks) == 3 assert len(chunks) == 3
assert chunks[0].text_with_ws == "Die Sängerin " assert chunks[0].text_with_ws == "Die Sängerin "

View File

@ -1,86 +1,83 @@
# coding: utf-8 # coding: utf-8
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('text', ["(unter)"]) @pytest.mark.parametrize('text', ["(unter)"])
def test_tokenizer_splits_no_special(de_tokenizer, text): def test_de_tokenizer_splits_no_special(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["unter'm"]) @pytest.mark.parametrize('text', ["unter'm"])
def test_tokenizer_splits_no_punct(de_tokenizer, text): def test_de_tokenizer_splits_no_punct(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 2 assert len(tokens) == 2
@pytest.mark.parametrize('text', ["(unter'm"]) @pytest.mark.parametrize('text', ["(unter'm"])
def test_tokenizer_splits_prefix_punct(de_tokenizer, text): def test_de_tokenizer_splits_prefix_punct(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["unter'm)"]) @pytest.mark.parametrize('text', ["unter'm)"])
def test_tokenizer_splits_suffix_punct(de_tokenizer, text): def test_de_tokenizer_splits_suffix_punct(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["(unter'm)"]) @pytest.mark.parametrize('text', ["(unter'm)"])
def test_tokenizer_splits_even_wrap(de_tokenizer, text): def test_de_tokenizer_splits_even_wrap(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 4 assert len(tokens) == 4
@pytest.mark.parametrize('text', ["(unter'm?)"]) @pytest.mark.parametrize('text', ["(unter'm?)"])
def test_tokenizer_splits_uneven_wrap(de_tokenizer, text): def test_de_tokenizer_splits_uneven_wrap(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 5 assert len(tokens) == 5
@pytest.mark.parametrize('text,length', [("z.B.", 1), ("zb.", 2), ("(z.B.", 2)]) @pytest.mark.parametrize('text,length', [("z.B.", 1), ("zb.", 2), ("(z.B.", 2)])
def test_tokenizer_splits_prefix_interact(de_tokenizer, text, length): def test_de_tokenizer_splits_prefix_interact(de_tokenizer, text, length):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == length assert len(tokens) == length
@pytest.mark.parametrize('text', ["z.B.)"]) @pytest.mark.parametrize('text', ["z.B.)"])
def test_tokenizer_splits_suffix_interact(de_tokenizer, text): def test_de_tokenizer_splits_suffix_interact(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 2 assert len(tokens) == 2
@pytest.mark.parametrize('text', ["(z.B.)"]) @pytest.mark.parametrize('text', ["(z.B.)"])
def test_tokenizer_splits_even_wrap_interact(de_tokenizer, text): def test_de_tokenizer_splits_even_wrap_interact(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["(z.B.?)"]) @pytest.mark.parametrize('text', ["(z.B.?)"])
def test_tokenizer_splits_uneven_wrap_interact(de_tokenizer, text): def test_de_tokenizer_splits_uneven_wrap_interact(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 4 assert len(tokens) == 4
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"]) @pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
def test_tokenizer_splits_numeric_range(de_tokenizer, text): def test_de_tokenizer_splits_numeric_range(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["blau.Rot", "Hallo.Welt"]) @pytest.mark.parametrize('text', ["blau.Rot", "Hallo.Welt"])
def test_tokenizer_splits_period_infix(de_tokenizer, text): def test_de_tokenizer_splits_period_infix(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["Hallo,Welt", "eins,zwei"]) @pytest.mark.parametrize('text', ["Hallo,Welt", "eins,zwei"])
def test_tokenizer_splits_comma_infix(de_tokenizer, text): def test_de_tokenizer_splits_comma_infix(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
assert tokens[0].text == text.split(",")[0] assert tokens[0].text == text.split(",")[0]
@ -89,18 +86,18 @@ def test_tokenizer_splits_comma_infix(de_tokenizer, text):
@pytest.mark.parametrize('text', ["blau...Rot", "blau...rot"]) @pytest.mark.parametrize('text', ["blau...Rot", "blau...rot"])
def test_tokenizer_splits_ellipsis_infix(de_tokenizer, text): def test_de_tokenizer_splits_ellipsis_infix(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ['Islam-Konferenz', 'Ost-West-Konflikt']) @pytest.mark.parametrize('text', ['Islam-Konferenz', 'Ost-West-Konflikt'])
def test_tokenizer_keeps_hyphens(de_tokenizer, text): def test_de_tokenizer_keeps_hyphens(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
def test_tokenizer_splits_double_hyphen_infix(de_tokenizer): def test_de_tokenizer_splits_double_hyphen_infix(de_tokenizer):
tokens = de_tokenizer("Viele Regeln--wie die Bindestrich-Regeln--sind kompliziert.") tokens = de_tokenizer("Viele Regeln--wie die Bindestrich-Regeln--sind kompliziert.")
assert len(tokens) == 10 assert len(tokens) == 10
assert tokens[0].text == "Viele" assert tokens[0].text == "Viele"

View File

@ -1,13 +1,10 @@
# coding: utf-8 # coding: utf-8
"""Test that longer and mixed texts are tokenized correctly."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
def test_tokenizer_handles_long_text(de_tokenizer): def test_de_tokenizer_handles_long_text(de_tokenizer):
text = """Die Verwandlung text = """Die Verwandlung
Als Gregor Samsa eines Morgens aus unruhigen Träumen erwachte, fand er sich in Als Gregor Samsa eines Morgens aus unruhigen Träumen erwachte, fand er sich in
@ -29,17 +26,15 @@ Umfang kläglich dünnen Beine flimmerten ihm hilflos vor den Augen.
"Donaudampfschifffahrtsgesellschaftskapitänsanwärterposten", "Donaudampfschifffahrtsgesellschaftskapitänsanwärterposten",
"Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz", "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz",
"Kraftfahrzeug-Haftpflichtversicherung", "Kraftfahrzeug-Haftpflichtversicherung",
"Vakuum-Mittelfrequenz-Induktionsofen" "Vakuum-Mittelfrequenz-Induktionsofen"])
]) def test_de_tokenizer_handles_long_words(de_tokenizer, text):
def test_tokenizer_handles_long_words(de_tokenizer, text):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
@pytest.mark.parametrize('text,length', [ @pytest.mark.parametrize('text,length', [
("»Was ist mit mir geschehen?«, dachte er.", 12), ("»Was ist mit mir geschehen?«, dachte er.", 12),
("“Dies frühzeitige Aufstehen”, dachte er, “macht einen ganz blödsinnig. ", 15) ("“Dies frühzeitige Aufstehen”, dachte er, “macht einen ganz blödsinnig. ", 15)])
]) def test_de_tokenizer_handles_examples(de_tokenizer, text, length):
def test_tokenizer_handles_examples(de_tokenizer, text, length):
tokens = de_tokenizer(text) tokens = de_tokenizer(text)
assert len(tokens) == length assert len(tokens) == length

View File

@ -1,17 +1,16 @@
# -*- coding: utf-8 -*- # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('text', ["αριθ.", "τρισ.", "δισ.", "σελ."]) @pytest.mark.parametrize('text', ["αριθ.", "τρισ.", "δισ.", "σελ."])
def test_tokenizer_handles_abbr(el_tokenizer, text): def test_el_tokenizer_handles_abbr(el_tokenizer, text):
tokens = el_tokenizer(text) tokens = el_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
def test_tokenizer_handles_exc_in_text(el_tokenizer): def test_el_tokenizer_handles_exc_in_text(el_tokenizer):
text = "Στα 14 τρισ. δολάρια το κόστος από την άνοδο της στάθμης της θάλασσας." text = "Στα 14 τρισ. δολάρια το κόστος από την άνοδο της στάθμης της θάλασσας."
tokens = el_tokenizer(text) tokens = el_tokenizer(text)
assert len(tokens) == 14 assert len(tokens) == 14

View File

@ -1,11 +1,10 @@
# -*- coding: utf-8 -*- # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
def test_tokenizer_handles_long_text(el_tokenizer): def test_el_tokenizer_handles_long_text(el_tokenizer):
text = """Η Ελλάδα (παλαιότερα Ελλάς), επίσημα γνωστή ως Ελληνική Δημοκρατία,\ text = """Η Ελλάδα (παλαιότερα Ελλάς), επίσημα γνωστή ως Ελληνική Δημοκρατία,\
είναι χώρα της νοτιοανατολικής Ευρώπης στο νοτιότερο άκρο της Βαλκανικής χερσονήσου.\ είναι χώρα της νοτιοανατολικής Ευρώπης στο νοτιότερο άκρο της Βαλκανικής χερσονήσου.\
Συνορεύει στα βορειοδυτικά με την Αλβανία, στα βόρεια με την πρώην\ Συνορεύει στα βορειοδυτικά με την Αλβανία, στα βόρεια με την πρώην\
@ -20,6 +19,6 @@ def test_tokenizer_handles_long_text(el_tokenizer):
("Η Ελλάδα είναι μία από τις χώρες της Ευρωπαϊκής Ένωσης (ΕΕ) που διαθέτει σηµαντικό ορυκτό πλούτο.", 19), ("Η Ελλάδα είναι μία από τις χώρες της Ευρωπαϊκής Ένωσης (ΕΕ) που διαθέτει σηµαντικό ορυκτό πλούτο.", 19),
("Η ναυτιλία αποτέλεσε ένα σημαντικό στοιχείο της Ελληνικής οικονομικής δραστηριότητας από τα αρχαία χρόνια.", 15), ("Η ναυτιλία αποτέλεσε ένα σημαντικό στοιχείο της Ελληνικής οικονομικής δραστηριότητας από τα αρχαία χρόνια.", 15),
("Η Ελλάδα είναι μέλος σε αρκετούς διεθνείς οργανισμούς.", 9)]) ("Η Ελλάδα είναι μέλος σε αρκετούς διεθνείς οργανισμούς.", 9)])
def test_tokenizer_handles_cnts(el_tokenizer,text, length): def test_el_tokenizer_handles_cnts(el_tokenizer,text, length):
tokens = el_tokenizer(text) tokens = el_tokenizer(text)
assert len(tokens) == length assert len(tokens) == length

View File

@ -2,23 +2,22 @@
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
from spacy.lang.en import English
from ....lang.en import English from spacy.tokenizer import Tokenizer
from ....tokenizer import Tokenizer from spacy.util import compile_prefix_regex, compile_suffix_regex
from .... import util from spacy.util import compile_infix_regex
@pytest.fixture @pytest.fixture
def custom_en_tokenizer(en_vocab): def custom_en_tokenizer(en_vocab):
prefix_re = util.compile_prefix_regex(English.Defaults.prefixes) prefix_re = compile_prefix_regex(English.Defaults.prefixes)
suffix_re = util.compile_suffix_regex(English.Defaults.suffixes) suffix_re = compile_suffix_regex(English.Defaults.suffixes)
custom_infixes = ['\.\.\.+', custom_infixes = ['\.\.\.+',
'(?<=[0-9])-(?=[0-9])', '(?<=[0-9])-(?=[0-9])',
# '(?<=[0-9]+),(?=[0-9]+)', # '(?<=[0-9]+),(?=[0-9]+)',
'[0-9]+(,[0-9]+)+', '[0-9]+(,[0-9]+)+',
'[\[\]!&:,()\*—–\/-]'] '[\[\]!&:,()\*—–\/-]']
infix_re = compile_infix_regex(custom_infixes)
infix_re = util.compile_infix_regex(custom_infixes)
return Tokenizer(en_vocab, return Tokenizer(en_vocab,
English.Defaults.tokenizer_exceptions, English.Defaults.tokenizer_exceptions,
prefix_re.search, prefix_re.search,
@ -27,13 +26,12 @@ def custom_en_tokenizer(en_vocab):
token_match=None) token_match=None)
def test_customized_tokenizer_handles_infixes(custom_en_tokenizer): def test_en_customized_tokenizer_handles_infixes(custom_en_tokenizer):
sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion." sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion."
context = [word.text for word in custom_en_tokenizer(sentence)] context = [word.text for word in custom_en_tokenizer(sentence)]
assert context == ['The', '8', 'and', '10', '-', 'county', 'definitions', assert context == ['The', '8', 'and', '10', '-', 'county', 'definitions',
'are', 'not', 'used', 'for', 'the', 'greater', 'are', 'not', 'used', 'for', 'the', 'greater',
'Southern', 'California', 'Megaregion', '.'] 'Southern', 'California', 'Megaregion', '.']
# the trailing '-' may cause Assertion Error # the trailing '-' may cause Assertion Error
sentence = "The 8- and 10-county definitions are not used for the greater Southern California Megaregion." sentence = "The 8- and 10-county definitions are not used for the greater Southern California Megaregion."
context = [word.text for word in custom_en_tokenizer(sentence)] context = [word.text for word in custom_en_tokenizer(sentence)]

View File

@ -38,7 +38,7 @@ def test_en_tokenizer_splits_trailing_apos(en_tokenizer, text):
@pytest.mark.parametrize('text', ["'em", "nothin'", "ol'"]) @pytest.mark.parametrize('text', ["'em", "nothin'", "ol'"])
def text_tokenizer_doesnt_split_apos_exc(en_tokenizer, text): def test_en_tokenizer_doesnt_split_apos_exc(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
assert tokens[0].text == text assert tokens[0].text == text

View File

@ -1,11 +1,6 @@
# coding: utf-8 # coding: utf-8
"""Test that token.idx correctly computes index into the original string."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest
def test_en_simple_punct(en_tokenizer): def test_en_simple_punct(en_tokenizer):
text = "to walk, do foo" text = "to walk, do foo"

View File

@ -1,63 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from ....tokens.doc import Doc
@pytest.fixture
def en_lemmatizer(EN):
return EN.Defaults.create_lemmatizer()
@pytest.mark.models('en')
def test_doc_lemmatization(EN):
doc = Doc(EN.vocab, words=['bleed'])
doc[0].tag_ = 'VBP'
assert doc[0].lemma_ == 'bleed'
@pytest.mark.models('en')
@pytest.mark.parametrize('text,lemmas', [("aardwolves", ["aardwolf"]),
("aardwolf", ["aardwolf"]),
("planets", ["planet"]),
("ring", ["ring"]),
("axes", ["axis", "axe", "ax"])])
def test_en_lemmatizer_noun_lemmas(en_lemmatizer, text, lemmas):
assert en_lemmatizer.noun(text) == lemmas
@pytest.mark.models('en')
@pytest.mark.parametrize('text,lemmas', [("bleed", ["bleed"]),
("feed", ["feed"]),
("need", ["need"]),
("ring", ["ring"])])
def test_en_lemmatizer_noun_lemmas(en_lemmatizer, text, lemmas):
# Cases like this are problematic -- not clear what we should do to resolve
# ambiguity?
# ("axes", ["ax", "axes", "axis"])])
assert en_lemmatizer.noun(text) == lemmas
@pytest.mark.xfail
@pytest.mark.models('en')
def test_en_lemmatizer_base_forms(en_lemmatizer):
assert en_lemmatizer.noun('dive', {'number': 'sing'}) == ['dive']
assert en_lemmatizer.noun('dive', {'number': 'plur'}) == ['diva']
@pytest.mark.models('en')
def test_en_lemmatizer_base_form_verb(en_lemmatizer):
assert en_lemmatizer.verb('saw', {'verbform': 'past'}) == ['see']
@pytest.mark.models('en')
def test_en_lemmatizer_punct(en_lemmatizer):
assert en_lemmatizer.punct('') == ['"']
assert en_lemmatizer.punct('') == ['"']
@pytest.mark.models('en')
def test_en_lemmatizer_lemma_assignment(EN):
text = "Bananas in pyjamas are geese."
doc = EN.make_doc(text)
EN.tagger(doc)
assert all(t.lemma_ != '' for t in doc)

View File

@ -1,85 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
import numpy
import pytest
@pytest.fixture
def example(EN):
"""
This is to make sure the model works as expected. The tests make sure that
values are properly set. Tests are not meant to evaluate the content of the
output, only make sure the output is formally okay.
"""
assert EN.entity != None
return EN('There was a stranger standing at the big street talking to herself.')
@pytest.mark.models('en')
def test_en_models_tokenization(example):
# tokenization should split the document into tokens
assert len(example) > 1
@pytest.mark.models('en')
def test_en_models_tagging(example):
# if tagging was done properly, pos tags shouldn't be empty
assert example.is_tagged
assert all(t.pos != 0 for t in example)
assert all(t.tag != 0 for t in example)
@pytest.mark.models('en')
def test_en_models_parsing(example):
# if parsing was done properly
# - dependency labels shouldn't be empty
# - the head of some tokens should not be root
assert example.is_parsed
assert all(t.dep != 0 for t in example)
assert any(t.dep != i for i,t in enumerate(example))
@pytest.mark.models('en')
def test_en_models_ner(example):
# if ner was done properly, ent_iob shouldn't be empty
assert all([t.ent_iob != 0 for t in example])
@pytest.mark.models('en')
def test_en_models_vectors(example):
# if vectors are available, they should differ on different words
# this isn't a perfect test since this could in principle fail
# in a sane model as well,
# but that's very unlikely and a good indicator if something is wrong
if example.vocab.vectors_length:
vector0 = example[0].vector
vector1 = example[1].vector
vector2 = example[2].vector
assert not numpy.array_equal(vector0,vector1)
assert not numpy.array_equal(vector0,vector2)
assert not numpy.array_equal(vector1,vector2)
@pytest.mark.xfail
@pytest.mark.models('en')
def test_en_models_probs(example):
# if frequencies/probabilities are okay, they should differ for
# different words
# this isn't a perfect test since this could in principle fail
# in a sane model as well,
# but that's very unlikely and a good indicator if something is wrong
prob0 = example[0].prob
prob1 = example[1].prob
prob2 = example[2].prob
assert not prob0 == prob1
assert not prob0 == prob2
assert not prob1 == prob2
@pytest.mark.models('en')
def test_no_vectors_similarity(EN):
doc1 = EN(u'hallo')
doc2 = EN(u'hi')
assert doc1.similarity(doc2) > 0

View File

@ -1,42 +0,0 @@
from __future__ import unicode_literals, print_function
import pytest
from spacy.attrs import LOWER
from spacy.matcher import Matcher
@pytest.mark.models('en')
def test_en_ner_simple_types(EN):
tokens = EN(u'Mr. Best flew to New York on Saturday morning.')
ents = list(tokens.ents)
assert ents[0].start == 1
assert ents[0].end == 2
assert ents[0].label_ == 'PERSON'
assert ents[1].start == 4
assert ents[1].end == 6
assert ents[1].label_ == 'GPE'
@pytest.mark.skip
@pytest.mark.models('en')
def test_en_ner_consistency_bug(EN):
'''Test an arbitrary sequence-consistency bug encountered during speed test'''
tokens = EN(u'Where rap essentially went mainstream, illustrated by seminal Public Enemy, Beastie Boys and L.L. Cool J. tracks.')
tokens = EN(u'''Charity and other short-term aid have buoyed them so far, and a tax-relief bill working its way through Congress would help. But the September 11 Victim Compensation Fund, enacted by Congress to discourage people from filing lawsuits, will determine the shape of their lives for years to come.\n\n''', disable=['ner'])
tokens.ents += tuple(EN.matcher(tokens))
EN.entity(tokens)
@pytest.mark.skip
@pytest.mark.models('en')
def test_en_ner_unit_end_gazetteer(EN):
'''Test a bug in the interaction between the NER model and the gazetteer'''
matcher = Matcher(EN.vocab)
matcher.add('MemberNames', None, [{LOWER: 'cal'}], [{LOWER: 'cal'}, {LOWER: 'henderson'}])
doc = EN(u'who is cal the manager of?')
if len(list(doc.ents)) == 0:
ents = matcher(doc)
assert len(ents) == 1
doc.ents += tuple(ents)
EN.entity(doc)
assert list(doc.ents)[0].text == 'cal'

View File

@ -1,22 +1,20 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ....attrs import HEAD, DEP
from ....symbols import nsubj, dobj, amod, nmod, conj, cc, root
from ....lang.en.syntax_iterators import SYNTAX_ITERATORS
from ...util import get_doc
import numpy import numpy
from spacy.attrs import HEAD, DEP
from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root
from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS
from ...util import get_doc
def test_en_noun_chunks_not_nested(en_tokenizer): def test_en_noun_chunks_not_nested(en_tokenizer):
text = "Peter has chronic command and control issues" text = "Peter has chronic command and control issues"
heads = [1, 0, 4, 3, -1, -2, -5] heads = [1, 0, 4, 3, -1, -2, -5]
deps = ['nsubj', 'ROOT', 'amod', 'nmod', 'cc', 'conj', 'dobj'] deps = ['nsubj', 'ROOT', 'amod', 'nmod', 'cc', 'conj', 'dobj']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
tokens.from_array( tokens.from_array(
[HEAD, DEP], [HEAD, DEP],
numpy.asarray([[1, nsubj], [0, root], [4, amod], [3, nmod], [-1, cc], numpy.asarray([[1, nsubj], [0, root], [4, amod], [3, nmod], [-1, cc],

View File

@ -3,58 +3,52 @@ from __future__ import unicode_literals
from ...util import get_doc from ...util import get_doc
import pytest
def test_en_parser_noun_chunks_standard(en_tokenizer):
def test_parser_noun_chunks_standard(en_tokenizer):
text = "A base phrase should be recognized." text = "A base phrase should be recognized."
heads = [2, 1, 3, 2, 1, 0, -1] heads = [2, 1, 3, 2, 1, 0, -1]
tags = ['DT', 'JJ', 'NN', 'MD', 'VB', 'VBN', '.'] tags = ['DT', 'JJ', 'NN', 'MD', 'VB', 'VBN', '.']
deps = ['det', 'amod', 'nsubjpass', 'aux', 'auxpass', 'ROOT', 'punct'] deps = ['det', 'amod', 'nsubjpass', 'aux', 'auxpass', 'ROOT', 'punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
chunks = list(doc.noun_chunks) chunks = list(doc.noun_chunks)
assert len(chunks) == 1 assert len(chunks) == 1
assert chunks[0].text_with_ws == "A base phrase " assert chunks[0].text_with_ws == "A base phrase "
def test_parser_noun_chunks_coordinated(en_tokenizer): def test_en_parser_noun_chunks_coordinated(en_tokenizer):
text = "A base phrase and a good phrase are often the same." text = "A base phrase and a good phrase are often the same."
heads = [2, 1, 5, -1, 2, 1, -4, 0, -1, 1, -3, -4] heads = [2, 1, 5, -1, 2, 1, -4, 0, -1, 1, -3, -4]
tags = ['DT', 'NN', 'NN', 'CC', 'DT', 'JJ', 'NN', 'VBP', 'RB', 'DT', 'JJ', '.'] tags = ['DT', 'NN', 'NN', 'CC', 'DT', 'JJ', 'NN', 'VBP', 'RB', 'DT', 'JJ', '.']
deps = ['det', 'compound', 'nsubj', 'cc', 'det', 'amod', 'conj', 'ROOT', 'advmod', 'det', 'attr', 'punct'] deps = ['det', 'compound', 'nsubj', 'cc', 'det', 'amod', 'conj', 'ROOT', 'advmod', 'det', 'attr', 'punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
chunks = list(doc.noun_chunks) chunks = list(doc.noun_chunks)
assert len(chunks) == 2 assert len(chunks) == 2
assert chunks[0].text_with_ws == "A base phrase " assert chunks[0].text_with_ws == "A base phrase "
assert chunks[1].text_with_ws == "a good phrase " assert chunks[1].text_with_ws == "a good phrase "
def test_parser_noun_chunks_pp_chunks(en_tokenizer): def test_en_parser_noun_chunks_pp_chunks(en_tokenizer):
text = "A phrase with another phrase occurs." text = "A phrase with another phrase occurs."
heads = [1, 4, -1, 1, -2, 0, -1] heads = [1, 4, -1, 1, -2, 0, -1]
tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ', '.'] tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ', '.']
deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT', 'punct'] deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT', 'punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
chunks = list(doc.noun_chunks) chunks = list(doc.noun_chunks)
assert len(chunks) == 2 assert len(chunks) == 2
assert chunks[0].text_with_ws == "A phrase " assert chunks[0].text_with_ws == "A phrase "
assert chunks[1].text_with_ws == "another phrase " assert chunks[1].text_with_ws == "another phrase "
def test_parser_noun_chunks_appositional_modifiers(en_tokenizer): def test_en_parser_noun_chunks_appositional_modifiers(en_tokenizer):
text = "Sam, my brother, arrived to the house." text = "Sam, my brother, arrived to the house."
heads = [5, -1, 1, -3, -4, 0, -1, 1, -2, -4] heads = [5, -1, 1, -3, -4, 0, -1, 1, -2, -4]
tags = ['NNP', ',', 'PRP$', 'NN', ',', 'VBD', 'IN', 'DT', 'NN', '.'] tags = ['NNP', ',', 'PRP$', 'NN', ',', 'VBD', 'IN', 'DT', 'NN', '.']
deps = ['nsubj', 'punct', 'poss', 'appos', 'punct', 'ROOT', 'prep', 'det', 'pobj', 'punct'] deps = ['nsubj', 'punct', 'poss', 'appos', 'punct', 'ROOT', 'prep', 'det', 'pobj', 'punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
chunks = list(doc.noun_chunks) chunks = list(doc.noun_chunks)
assert len(chunks) == 3 assert len(chunks) == 3
assert chunks[0].text_with_ws == "Sam " assert chunks[0].text_with_ws == "Sam "
@ -62,14 +56,13 @@ def test_parser_noun_chunks_appositional_modifiers(en_tokenizer):
assert chunks[2].text_with_ws == "the house " assert chunks[2].text_with_ws == "the house "
def test_parser_noun_chunks_dative(en_tokenizer): def test_en_parser_noun_chunks_dative(en_tokenizer):
text = "She gave Bob a raise." text = "She gave Bob a raise."
heads = [1, 0, -1, 1, -3, -4] heads = [1, 0, -1, 1, -3, -4]
tags = ['PRP', 'VBD', 'NNP', 'DT', 'NN', '.'] tags = ['PRP', 'VBD', 'NNP', 'DT', 'NN', '.']
deps = ['nsubj', 'ROOT', 'dative', 'det', 'dobj', 'punct'] deps = ['nsubj', 'ROOT', 'dative', 'det', 'dobj', 'punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
chunks = list(doc.noun_chunks) chunks = list(doc.noun_chunks)
assert len(chunks) == 3 assert len(chunks) == 3
assert chunks[0].text_with_ws == "She " assert chunks[0].text_with_ws == "She "

View File

@ -1,92 +1,89 @@
# coding: utf-8 # coding: utf-8
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('text', ["(can)"]) @pytest.mark.parametrize('text', ["(can)"])
def test_tokenizer_splits_no_special(en_tokenizer, text): def test_en_tokenizer_splits_no_special(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["can't"]) @pytest.mark.parametrize('text', ["can't"])
def test_tokenizer_splits_no_punct(en_tokenizer, text): def test_en_tokenizer_splits_no_punct(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 2 assert len(tokens) == 2
@pytest.mark.parametrize('text', ["(can't"]) @pytest.mark.parametrize('text', ["(can't"])
def test_tokenizer_splits_prefix_punct(en_tokenizer, text): def test_en_tokenizer_splits_prefix_punct(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["can't)"]) @pytest.mark.parametrize('text', ["can't)"])
def test_tokenizer_splits_suffix_punct(en_tokenizer, text): def test_en_tokenizer_splits_suffix_punct(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["(can't)"]) @pytest.mark.parametrize('text', ["(can't)"])
def test_tokenizer_splits_even_wrap(en_tokenizer, text): def test_en_tokenizer_splits_even_wrap(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 4 assert len(tokens) == 4
@pytest.mark.parametrize('text', ["(can't?)"]) @pytest.mark.parametrize('text', ["(can't?)"])
def test_tokenizer_splits_uneven_wrap(en_tokenizer, text): def test_en_tokenizer_splits_uneven_wrap(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 5 assert len(tokens) == 5
@pytest.mark.parametrize('text,length', [("U.S.", 1), ("us.", 2), ("(U.S.", 2)]) @pytest.mark.parametrize('text,length', [("U.S.", 1), ("us.", 2), ("(U.S.", 2)])
def test_tokenizer_splits_prefix_interact(en_tokenizer, text, length): def test_en_tokenizer_splits_prefix_interact(en_tokenizer, text, length):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == length assert len(tokens) == length
@pytest.mark.parametrize('text', ["U.S.)"]) @pytest.mark.parametrize('text', ["U.S.)"])
def test_tokenizer_splits_suffix_interact(en_tokenizer, text): def test_en_tokenizer_splits_suffix_interact(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 2 assert len(tokens) == 2
@pytest.mark.parametrize('text', ["(U.S.)"]) @pytest.mark.parametrize('text', ["(U.S.)"])
def test_tokenizer_splits_even_wrap_interact(en_tokenizer, text): def test_en_tokenizer_splits_even_wrap_interact(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["(U.S.?)"]) @pytest.mark.parametrize('text', ["(U.S.?)"])
def test_tokenizer_splits_uneven_wrap_interact(en_tokenizer, text): def test_en_tokenizer_splits_uneven_wrap_interact(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 4 assert len(tokens) == 4
@pytest.mark.parametrize('text', ["best-known"]) @pytest.mark.parametrize('text', ["best-known"])
def test_tokenizer_splits_hyphens(en_tokenizer, text): def test_en_tokenizer_splits_hyphens(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"]) @pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
def test_tokenizer_splits_numeric_range(en_tokenizer, text): def test_en_tokenizer_splits_numeric_range(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["best.Known", "Hello.World"]) @pytest.mark.parametrize('text', ["best.Known", "Hello.World"])
def test_tokenizer_splits_period_infix(en_tokenizer, text): def test_en_tokenizer_splits_period_infix(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["Hello,world", "one,two"]) @pytest.mark.parametrize('text', ["Hello,world", "one,two"])
def test_tokenizer_splits_comma_infix(en_tokenizer, text): def test_en_tokenizer_splits_comma_infix(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
assert tokens[0].text == text.split(",")[0] assert tokens[0].text == text.split(",")[0]
@ -95,12 +92,12 @@ def test_tokenizer_splits_comma_infix(en_tokenizer, text):
@pytest.mark.parametrize('text', ["best...Known", "best...known"]) @pytest.mark.parametrize('text', ["best...Known", "best...known"])
def test_tokenizer_splits_ellipsis_infix(en_tokenizer, text): def test_en_tokenizer_splits_ellipsis_infix(en_tokenizer, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
def test_tokenizer_splits_double_hyphen_infix(en_tokenizer): def test_en_tokenizer_splits_double_hyphen_infix(en_tokenizer):
tokens = en_tokenizer("No decent--let alone well-bred--people.") tokens = en_tokenizer("No decent--let alone well-bred--people.")
assert tokens[0].text == "No" assert tokens[0].text == "No"
assert tokens[1].text == "decent" assert tokens[1].text == "decent"
@ -115,7 +112,7 @@ def test_tokenizer_splits_double_hyphen_infix(en_tokenizer):
@pytest.mark.xfail @pytest.mark.xfail
def test_tokenizer_splits_period_abbr(en_tokenizer): def test_en_tokenizer_splits_period_abbr(en_tokenizer):
text = "Today is Tuesday.Mr." text = "Today is Tuesday.Mr."
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 5 assert len(tokens) == 5
@ -127,7 +124,7 @@ def test_tokenizer_splits_period_abbr(en_tokenizer):
@pytest.mark.xfail @pytest.mark.xfail
def test_tokenizer_splits_em_dash_infix(en_tokenizer): def test_en_tokenizer_splits_em_dash_infix(en_tokenizer):
# Re Issue #225 # Re Issue #225
tokens = en_tokenizer("""Will this road take me to Puddleton?\u2014No, """ tokens = en_tokenizer("""Will this road take me to Puddleton?\u2014No, """
"""you'll have to walk there.\u2014Ariel.""") """you'll have to walk there.\u2014Ariel.""")

View File

@ -1,13 +1,9 @@
# coding: utf-8 # coding: utf-8
"""Test that open, closed and paired punctuation is split off correctly."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
from spacy.util import compile_prefix_regex
from ....util import compile_prefix_regex from spacy.lang.punctuation import TOKENIZER_PREFIXES
from ....lang.punctuation import TOKENIZER_PREFIXES
PUNCT_OPEN = ['(', '[', '{', '*'] PUNCT_OPEN = ['(', '[', '{', '*']

View File

@ -1,18 +1,17 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ....tokens import Doc
from ...util import get_doc, apply_transition_sequence
import pytest import pytest
from ...util import get_doc, apply_transition_sequence
@pytest.mark.parametrize('text', ["A test sentence"]) @pytest.mark.parametrize('text', ["A test sentence"])
@pytest.mark.parametrize('punct', ['.', '!', '?', '']) @pytest.mark.parametrize('punct', ['.', '!', '?', ''])
def test_en_sbd_single_punct(en_tokenizer, text, punct): def test_en_sbd_single_punct(en_tokenizer, text, punct):
heads = [2, 1, 0, -1] if punct else [2, 1, 0] heads = [2, 1, 0, -1] if punct else [2, 1, 0]
tokens = en_tokenizer(text + punct) tokens = en_tokenizer(text + punct)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
assert len(doc) == 4 if punct else 3 assert len(doc) == 4 if punct else 3
assert len(list(doc.sents)) == 1 assert len(list(doc.sents)) == 1
assert sum(len(sent) for sent in doc.sents) == len(doc) assert sum(len(sent) for sent in doc.sents) == len(doc)
@ -26,102 +25,10 @@ def test_en_sentence_breaks(en_tokenizer, en_parser):
'attr', 'punct'] 'attr', 'punct']
transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct', 'B-ROOT', transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct', 'B-ROOT',
'L-nsubj', 'S', 'L-attr', 'R-attr', 'D', 'R-punct'] 'L-nsubj', 'S', 'L-attr', 'R-attr', 'D', 'R-punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
apply_transition_sequence(en_parser, doc, transition) apply_transition_sequence(en_parser, doc, transition)
assert len(list(doc.sents)) == 2 assert len(list(doc.sents)) == 2
for token in doc: for token in doc:
assert token.dep != 0 or token.is_space assert token.dep != 0 or token.is_space
assert [token.head.i for token in doc ] == [1, 1, 3, 1, 1, 6, 6, 8, 6, 6] assert [token.head.i for token in doc ] == [1, 1, 3, 1, 1, 6, 6, 8, 6, 6]
# Currently, there's no way of setting the serializer data for the parser
# without loading the models, so we can't remove the model dependency here yet.
@pytest.mark.xfail
@pytest.mark.models('en')
def test_en_sbd_serialization_projective(EN):
"""Test that before and after serialization, the sentence boundaries are
the same."""
text = "I bought a couch from IKEA It wasn't very comfortable."
transition = ['L-nsubj', 'S', 'L-det', 'R-dobj', 'D', 'R-prep', 'R-pobj',
'B-ROOT', 'L-nsubj', 'R-neg', 'D', 'S', 'L-advmod',
'R-acomp', 'D', 'R-punct']
doc = EN.tokenizer(text)
apply_transition_sequence(EN.parser, doc, transition)
doc_serialized = Doc(EN.vocab).from_bytes(doc.to_bytes())
assert doc.is_parsed == True
assert doc_serialized.is_parsed == True
assert doc.to_bytes() == doc_serialized.to_bytes()
assert [s.text for s in doc.sents] == [s.text for s in doc_serialized.sents]
TEST_CASES = [
pytest.mark.xfail(("Hello World. My name is Jonas.", ["Hello World.", "My name is Jonas."])),
("What is your name? My name is Jonas.", ["What is your name?", "My name is Jonas."]),
("There it is! I found it.", ["There it is!", "I found it."]),
("My name is Jonas E. Smith.", ["My name is Jonas E. Smith."]),
("Please turn to p. 55.", ["Please turn to p. 55."]),
("Were Jane and co. at the party?", ["Were Jane and co. at the party?"]),
("They closed the deal with Pitt, Briggs & Co. at noon.", ["They closed the deal with Pitt, Briggs & Co. at noon."]),
("Let's ask Jane and co. They should know.", ["Let's ask Jane and co.", "They should know."]),
("They closed the deal with Pitt, Briggs & Co. It closed yesterday.", ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]),
("I can see Mt. Fuji from here.", ["I can see Mt. Fuji from here."]),
pytest.mark.xfail(("St. Michael's Church is on 5th st. near the light.", ["St. Michael's Church is on 5th st. near the light."])),
("That is JFK Jr.'s book.", ["That is JFK Jr.'s book."]),
("I visited the U.S.A. last year.", ["I visited the U.S.A. last year."]),
("I live in the E.U. How about you?", ["I live in the E.U.", "How about you?"]),
("I live in the U.S. How about you?", ["I live in the U.S.", "How about you?"]),
("I work for the U.S. Government in Virginia.", ["I work for the U.S. Government in Virginia."]),
("I have lived in the U.S. for 20 years.", ["I have lived in the U.S. for 20 years."]),
pytest.mark.xfail(("At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.", ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."])),
("She has $100.00 in her bag.", ["She has $100.00 in her bag."]),
("She has $100.00. It is in her bag.", ["She has $100.00.", "It is in her bag."]),
("He teaches science (He previously worked for 5 years as an engineer.) at the local University.", ["He teaches science (He previously worked for 5 years as an engineer.) at the local University."]),
("Her email is Jane.Doe@example.com. I sent her an email.", ["Her email is Jane.Doe@example.com.", "I sent her an email."]),
("The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out.", ["The site is: https://www.example.50.com/new-site/awesome_content.html.", "Please check it out."]),
pytest.mark.xfail(("She turned to him, 'This is great.' she said.", ["She turned to him, 'This is great.' she said."])),
pytest.mark.xfail(('She turned to him, "This is great." she said.', ['She turned to him, "This is great." she said.'])),
('She turned to him, "This is great." She held the book out to show him.', ['She turned to him, "This is great."', "She held the book out to show him."]),
("Hello!! Long time no see.", ["Hello!!", "Long time no see."]),
("Hello?? Who is there?", ["Hello??", "Who is there?"]),
("Hello!? Is that you?", ["Hello!?", "Is that you?"]),
("Hello?! Is that you?", ["Hello?!", "Is that you?"]),
pytest.mark.xfail(("1.) The first item 2.) The second item", ["1.) The first item", "2.) The second item"])),
pytest.mark.xfail(("1.) The first item. 2.) The second item.", ["1.) The first item.", "2.) The second item."])),
pytest.mark.xfail(("1) The first item 2) The second item", ["1) The first item", "2) The second item"])),
("1) The first item. 2) The second item.", ["1) The first item.", "2) The second item."]),
pytest.mark.xfail(("1. The first item 2. The second item", ["1. The first item", "2. The second item"])),
pytest.mark.xfail(("1. The first item. 2. The second item.", ["1. The first item.", "2. The second item."])),
pytest.mark.xfail(("• 9. The first item • 10. The second item", ["• 9. The first item", "• 10. The second item"])),
pytest.mark.xfail(("9. The first item 10. The second item", ["9. The first item", "10. The second item"])),
pytest.mark.xfail(("a. The first item b. The second item c. The third list item", ["a. The first item", "b. The second item", "c. The third list item"])),
("This is a sentence\ncut off in the middle because pdf.", ["This is a sentence\ncut off in the middle because pdf."]),
("It was a cold \nnight in the city.", ["It was a cold \nnight in the city."]),
pytest.mark.xfail(("features\ncontact manager\nevents, activities\n", ["features", "contact manager", "events, activities"])),
pytest.mark.xfail(("You can find it at N°. 1026.253.553. That is where the treasure is.", ["You can find it at N°. 1026.253.553.", "That is where the treasure is."])),
("She works at Yahoo! in the accounting department.", ["She works at Yahoo! in the accounting department."]),
("We make a good team, you and I. Did you see Albert I. Jones yesterday?", ["We make a good team, you and I.", "Did you see Albert I. Jones yesterday?"]),
("Thoreau argues that by simplifying ones life, “the laws of the universe will appear less complex. . . .”", ["Thoreau argues that by simplifying ones life, “the laws of the universe will appear less complex. . . .”"]),
pytest.mark.xfail((""""Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).""", ['"Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).'])),
("If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence.", ["If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . .", "Next sentence."]),
("I never meant that.... She left the store.", ["I never meant that....", "She left the store."]),
pytest.mark.xfail(("I wasnt really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didnt mean it.", ["I wasnt really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didnt mean it."])),
pytest.mark.xfail(("One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . .", ["One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds.", ". . . The practice was not abandoned. . . ."])),
pytest.mark.xfail(("Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That is a lot.", ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]))
]
@pytest.mark.skip
@pytest.mark.models('en')
@pytest.mark.parametrize('text,expected_sents', TEST_CASES)
def test_en_sbd_prag(EN, text, expected_sents):
"""SBD tests from Pragmatic Segmenter"""
doc = EN(text)
sents = []
for sent in doc.sents:
sents.append(''.join(doc[i].string for i in range(sent.start, sent.end)).strip())
assert sents == expected_sents

View File

@ -1,12 +1,8 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ....parts_of_speech import SPACE
from ....compat import unicode_
from ...util import get_doc from ...util import get_doc
import pytest
def test_en_tagger_load_morph_exc(en_tokenizer): def test_en_tagger_load_morph_exc(en_tokenizer):
text = "I like his style." text = "I like his style."
@ -14,47 +10,6 @@ def test_en_tagger_load_morph_exc(en_tokenizer):
morph_exc = {'VBP': {'like': {'lemma': 'luck'}}} morph_exc = {'VBP': {'like': {'lemma': 'luck'}}}
en_tokenizer.vocab.morphology.load_morph_exceptions(morph_exc) en_tokenizer.vocab.morphology.load_morph_exceptions(morph_exc)
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags)
assert doc[1].tag_ == 'VBP' assert doc[1].tag_ == 'VBP'
assert doc[1].lemma_ == 'luck' assert doc[1].lemma_ == 'luck'
@pytest.mark.models('en')
def test_tag_names(EN):
text = "I ate pizzas with anchovies."
doc = EN(text, disable=['parser'])
assert type(doc[2].pos) == int
assert isinstance(doc[2].pos_, unicode_)
assert isinstance(doc[2].dep_, unicode_)
assert doc[2].tag_ == u'NNS'
@pytest.mark.xfail
@pytest.mark.models('en')
def test_en_tagger_spaces(EN):
"""Ensure spaces are assigned the POS tag SPACE"""
text = "Some\nspaces are\tnecessary."
doc = EN(text, disable=['parser'])
assert doc[0].pos != SPACE
assert doc[0].pos_ != 'SPACE'
assert doc[1].pos == SPACE
assert doc[1].pos_ == 'SPACE'
assert doc[1].tag_ == 'SP'
assert doc[2].pos != SPACE
assert doc[3].pos != SPACE
assert doc[4].pos == SPACE
@pytest.mark.xfail
@pytest.mark.models('en')
def test_en_tagger_return_char(EN):
"""Ensure spaces are assigned the POS tag SPACE"""
text = ('hi Aaron,\r\n\r\nHow is your schedule today, I was wondering if '
'you had time for a phone\r\ncall this afternoon?\r\n\r\n\r\n')
tokens = EN(text)
for token in tokens:
if token.is_space:
assert token.pos == SPACE
assert tokens[3].text == '\r\n\r\n'
assert tokens[3].is_space
assert tokens[3].pos == SPACE

View File

@ -1,10 +1,8 @@
# coding: utf-8 # coding: utf-8
"""Test that longer and mixed texts are tokenized correctly."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
from spacy.lang.en.lex_attrs import like_num
def test_en_tokenizer_handles_long_text(en_tokenizer): def test_en_tokenizer_handles_long_text(en_tokenizer):
@ -43,3 +41,9 @@ def test_lex_attrs_like_number(en_tokenizer, text, match):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
assert tokens[0].like_num == match assert tokens[0].like_num == match
@pytest.mark.parametrize('word', ['eleven'])
def test_en_lex_attrs_capitals(word):
assert like_num(word)
assert like_num(word.upper())

View File

@ -1,22 +1,21 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('text,lemma', [("aprox.", "aproximadamente"), @pytest.mark.parametrize('text,lemma', [
("aprox.", "aproximadamente"),
("esq.", "esquina"), ("esq.", "esquina"),
("pág.", "página"), ("pág.", "página"),
("p.ej.", "por ejemplo") ("p.ej.", "por ejemplo")])
]) def test_es_tokenizer_handles_abbr(es_tokenizer, text, lemma):
def test_tokenizer_handles_abbr(es_tokenizer, text, lemma):
tokens = es_tokenizer(text) tokens = es_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
assert tokens[0].lemma_ == lemma assert tokens[0].lemma_ == lemma
def test_tokenizer_handles_exc_in_text(es_tokenizer): def test_es_tokenizer_handles_exc_in_text(es_tokenizer):
text = "Mariano Rajoy ha corrido aprox. medio kilómetro" text = "Mariano Rajoy ha corrido aprox. medio kilómetro"
tokens = es_tokenizer(text) tokens = es_tokenizer(text)
assert len(tokens) == 7 assert len(tokens) == 7

View File

@ -1,14 +1,10 @@
# coding: utf-8 # coding: utf-8
"""Test that longer and mixed texts are tokenized correctly."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
def test_tokenizer_handles_long_text(es_tokenizer): def test_es_tokenizer_handles_long_text(es_tokenizer):
text = """Cuando a José Mujica lo invitaron a dar una conferencia text = """Cuando a José Mujica lo invitaron a dar una conferencia
en Oxford este verano, su cabeza hizo "crac". La "más antigua" universidad de habla en Oxford este verano, su cabeza hizo "crac". La "más antigua" universidad de habla
@ -30,6 +26,6 @@ en Montevideo y que pregona las bondades de la vida austera."""
("""¡Sí! "Vámonos", contestó José Arcadio Buendía""", 11), ("""¡Sí! "Vámonos", contestó José Arcadio Buendía""", 11),
("Corrieron aprox. 10km.", 5), ("Corrieron aprox. 10km.", 5),
("Y entonces por qué...", 5)]) ("Y entonces por qué...", 5)])
def test_tokenizer_handles_cnts(es_tokenizer, text, length): def test_es_tokenizer_handles_cnts(es_tokenizer, text, length):
tokens = es_tokenizer(text) tokens = es_tokenizer(text)
assert len(tokens) == length assert len(tokens) == length

View File

@ -11,7 +11,7 @@ ABBREVIATION_TESTS = [
@pytest.mark.parametrize('text,expected_tokens', ABBREVIATION_TESTS) @pytest.mark.parametrize('text,expected_tokens', ABBREVIATION_TESTS)
def test_tokenizer_handles_testcases(fi_tokenizer, text, expected_tokens): def test_fi_tokenizer_handles_testcases(fi_tokenizer, text, expected_tokens):
tokens = fi_tokenizer(text) tokens = fi_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space] token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list assert expected_tokens == token_list

View File

@ -1,29 +1,29 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('text', ["aujourd'hui", "Aujourd'hui", "prud'hommes", @pytest.mark.parametrize('text', [
"prudhommal"]) "aujourd'hui", "Aujourd'hui", "prud'hommes", "prudhommal"])
def test_tokenizer_infix_exceptions(fr_tokenizer, text): def test_fr_tokenizer_infix_exceptions(fr_tokenizer, text):
tokens = fr_tokenizer(text) tokens = fr_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
@pytest.mark.parametrize('text,lemma', [("janv.", "janvier"), @pytest.mark.parametrize('text,lemma', [
("janv.", "janvier"),
("juill.", "juillet"), ("juill.", "juillet"),
("Dr.", "docteur"), ("Dr.", "docteur"),
("av.", "avant"), ("av.", "avant"),
("sept.", "septembre")]) ("sept.", "septembre")])
def test_tokenizer_handles_abbr(fr_tokenizer, text, lemma): def test_fr_tokenizer_handles_abbr(fr_tokenizer, text, lemma):
tokens = fr_tokenizer(text) tokens = fr_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
assert tokens[0].lemma_ == lemma assert tokens[0].lemma_ == lemma
def test_tokenizer_handles_exc_in_text(fr_tokenizer): def test_fr_tokenizer_handles_exc_in_text(fr_tokenizer):
text = "Je suis allé au mois de janv. aux prudhommes." text = "Je suis allé au mois de janv. aux prudhommes."
tokens = fr_tokenizer(text) tokens = fr_tokenizer(text)
assert len(tokens) == 10 assert len(tokens) == 10
@ -32,14 +32,15 @@ def test_tokenizer_handles_exc_in_text(fr_tokenizer):
assert tokens[8].text == "prudhommes" assert tokens[8].text == "prudhommes"
def test_tokenizer_handles_exc_in_text_2(fr_tokenizer): def test_fr_tokenizer_handles_exc_in_text_2(fr_tokenizer):
text = "Cette après-midi, je suis allé dans un restaurant italo-mexicain." text = "Cette après-midi, je suis allé dans un restaurant italo-mexicain."
tokens = fr_tokenizer(text) tokens = fr_tokenizer(text)
assert len(tokens) == 11 assert len(tokens) == 11
assert tokens[1].text == "après-midi" assert tokens[1].text == "après-midi"
assert tokens[9].text == "italo-mexicain" assert tokens[9].text == "italo-mexicain"
def test_tokenizer_handles_title(fr_tokenizer):
def test_fr_tokenizer_handles_title(fr_tokenizer):
text = "N'est-ce pas génial?" text = "N'est-ce pas génial?"
tokens = fr_tokenizer(text) tokens = fr_tokenizer(text)
assert len(tokens) == 6 assert len(tokens) == 6
@ -50,14 +51,16 @@ def test_tokenizer_handles_title(fr_tokenizer):
assert tokens[2].text == "-ce" assert tokens[2].text == "-ce"
assert tokens[2].lemma_ == "ce" assert tokens[2].lemma_ == "ce"
def test_tokenizer_handles_title_2(fr_tokenizer):
def test_fr_tokenizer_handles_title_2(fr_tokenizer):
text = "Est-ce pas génial?" text = "Est-ce pas génial?"
tokens = fr_tokenizer(text) tokens = fr_tokenizer(text)
assert len(tokens) == 6 assert len(tokens) == 6
assert tokens[0].text == "Est" assert tokens[0].text == "Est"
assert tokens[0].lemma_ == "être" assert tokens[0].lemma_ == "être"
def test_tokenizer_handles_title_2(fr_tokenizer):
def test_fr_tokenizer_handles_title_2(fr_tokenizer):
text = "Qu'est-ce que tu fais?" text = "Qu'est-ce que tu fais?"
tokens = fr_tokenizer(text) tokens = fr_tokenizer(text)
assert len(tokens) == 7 assert len(tokens) == 7

View File

@ -4,25 +4,25 @@ from __future__ import unicode_literals
import pytest import pytest
def test_lemmatizer_verb(fr_tokenizer): def test_fr_lemmatizer_verb(fr_tokenizer):
tokens = fr_tokenizer("Qu'est-ce que tu fais?") tokens = fr_tokenizer("Qu'est-ce que tu fais?")
assert tokens[0].lemma_ == "que" assert tokens[0].lemma_ == "que"
assert tokens[1].lemma_ == "être" assert tokens[1].lemma_ == "être"
assert tokens[5].lemma_ == "faire" assert tokens[5].lemma_ == "faire"
def test_lemmatizer_noun_verb_2(fr_tokenizer): def test_fr_lemmatizer_noun_verb_2(fr_tokenizer):
tokens = fr_tokenizer("Les abaissements de température sont gênants.") tokens = fr_tokenizer("Les abaissements de température sont gênants.")
assert tokens[4].lemma_ == "être" assert tokens[4].lemma_ == "être"
@pytest.mark.xfail(reason="Costaricienne TAG is PROPN instead of NOUN and spacy don't lemmatize PROPN") @pytest.mark.xfail(reason="Costaricienne TAG is PROPN instead of NOUN and spacy don't lemmatize PROPN")
def test_lemmatizer_noun(fr_tokenizer): def test_fr_lemmatizer_noun(fr_tokenizer):
tokens = fr_tokenizer("il y a des Costaricienne.") tokens = fr_tokenizer("il y a des Costaricienne.")
assert tokens[4].lemma_ == "Costaricain" assert tokens[4].lemma_ == "Costaricain"
def test_lemmatizer_noun_2(fr_tokenizer): def test_fr_lemmatizer_noun_2(fr_tokenizer):
tokens = fr_tokenizer("Les abaissements de température sont gênants.") tokens = fr_tokenizer("Les abaissements de température sont gênants.")
assert tokens[1].lemma_ == "abaissement" assert tokens[1].lemma_ == "abaissement"
assert tokens[5].lemma_ == "gênant" assert tokens[5].lemma_ == "gênant"

View File

@ -0,0 +1,23 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.language import Language
from spacy.lang.punctuation import TOKENIZER_INFIXES
from spacy.lang.char_classes import ALPHA
@pytest.mark.parametrize('text,expected_tokens', [
("l'avion", ["l'", "avion"]), ("j'ai", ["j'", "ai"])])
def test_issue768(text, expected_tokens):
"""Allow zero-width 'infix' token during the tokenization process."""
SPLIT_INFIX = r'(?<=[{a}]\')(?=[{a}])'.format(a=ALPHA)
class FrenchTest(Language):
class Defaults(Language.Defaults):
infixes = TOKENIZER_INFIXES + [SPLIT_INFIX]
fr_tokenizer_w_infix = FrenchTest.Defaults.create_tokenizer()
tokens = fr_tokenizer_w_infix(text)
assert len(tokens) == 2
assert [t.text for t in tokens] == expected_tokens

View File

@ -1,6 +1,9 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest
from spacy.lang.fr.lex_attrs import like_num
def test_tokenizer_handles_long_text(fr_tokenizer): def test_tokenizer_handles_long_text(fr_tokenizer):
text = """L'histoire du TAL commence dans les années 1950, bien que l'on puisse \ text = """L'histoire du TAL commence dans les années 1950, bien que l'on puisse \
@ -12,6 +15,11 @@ un humain dans une conversation écrite en temps réel, de façon suffisamment \
convaincante que l'interlocuteur humain ne peut distinguer sûrement — sur la \ convaincante que l'interlocuteur humain ne peut distinguer sûrement — sur la \
base du seul contenu de la conversation s'il interagit avec un programme \ base du seul contenu de la conversation s'il interagit avec un programme \
ou avec un autre vrai humain.""" ou avec un autre vrai humain."""
tokens = fr_tokenizer(text) tokens = fr_tokenizer(text)
assert len(tokens) == 113 assert len(tokens) == 113
@pytest.mark.parametrize('word', ['onze', 'onzième'])
def test_fr_lex_attrs_capitals(word):
assert like_num(word)
assert like_num(word.upper())

View File

@ -11,7 +11,7 @@ GA_TOKEN_EXCEPTION_TESTS = [
@pytest.mark.parametrize('text,expected_tokens', GA_TOKEN_EXCEPTION_TESTS) @pytest.mark.parametrize('text,expected_tokens', GA_TOKEN_EXCEPTION_TESTS)
def test_tokenizer_handles_exception_cases(ga_tokenizer, text, expected_tokens): def test_ga_tokenizer_handles_exception_cases(ga_tokenizer, text, expected_tokens):
tokens = ga_tokenizer(text) tokens = ga_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space] token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list assert expected_tokens == token_list

View File

@ -6,7 +6,7 @@ import pytest
@pytest.mark.parametrize('text,expected_tokens', @pytest.mark.parametrize('text,expected_tokens',
[('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])]) [('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])])
def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens): def test_he_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
tokens = he_tokenizer(text) tokens = he_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space] token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list assert expected_tokens == token_list
@ -18,6 +18,6 @@ def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
('עקבת אחריו בכל רחבי המדינה!', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']), ('עקבת אחריו בכל רחבי המדינה!', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']),
('עקבת אחריו בכל רחבי המדינה..', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']), ('עקבת אחריו בכל רחבי המדינה..', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']),
('עקבת אחריו בכל רחבי המדינה...', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...'])]) ('עקבת אחריו בכל רחבי המדינה...', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...'])])
def test_tokenizer_handles_punct(he_tokenizer, text, expected_tokens): def test_he_tokenizer_handles_punct(he_tokenizer, text, expected_tokens):
tokens = he_tokenizer(text) tokens = he_tokenizer(text)
assert expected_tokens == [token.text for token in tokens] assert expected_tokens == [token.text for token in tokens]

View File

@ -3,6 +3,7 @@ from __future__ import unicode_literals
import pytest import pytest
DEFAULT_TESTS = [ DEFAULT_TESTS = [
('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']), ('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
pytest.mark.xfail(('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.'])), pytest.mark.xfail(('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.'])),
@ -277,7 +278,7 @@ TESTCASES = DEFAULT_TESTS + DOT_TESTS + QUOTE_TESTS + NUMBER_TESTS + HYPHEN_TEST
@pytest.mark.parametrize('text,expected_tokens', TESTCASES) @pytest.mark.parametrize('text,expected_tokens', TESTCASES)
def test_tokenizer_handles_testcases(hu_tokenizer, text, expected_tokens): def test_hu_tokenizer_handles_testcases(hu_tokenizer, text, expected_tokens):
tokens = hu_tokenizer(text) tokens = hu_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space] token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list assert expected_tokens == token_list

View File

@ -1,38 +1,35 @@
# coding: utf-8 # coding: utf-8
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('text', ["(Ma'arif)"]) @pytest.mark.parametrize('text', ["(Ma'arif)"])
def test_tokenizer_splits_no_special(id_tokenizer, text): def test_id_tokenizer_splits_no_special(id_tokenizer, text):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["Ma'arif"]) @pytest.mark.parametrize('text', ["Ma'arif"])
def test_tokenizer_splits_no_punct(id_tokenizer, text): def test_id_tokenizer_splits_no_punct(id_tokenizer, text):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == 1 assert len(tokens) == 1
@pytest.mark.parametrize('text', ["(Ma'arif"]) @pytest.mark.parametrize('text', ["(Ma'arif"])
def test_tokenizer_splits_prefix_punct(id_tokenizer, text): def test_id_tokenizer_splits_prefix_punct(id_tokenizer, text):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == 2 assert len(tokens) == 2
@pytest.mark.parametrize('text', ["Ma'arif)"]) @pytest.mark.parametrize('text', ["Ma'arif)"])
def test_tokenizer_splits_suffix_punct(id_tokenizer, text): def test_id_tokenizer_splits_suffix_punct(id_tokenizer, text):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == 2 assert len(tokens) == 2
@pytest.mark.parametrize('text', ["(Ma'arif)"]) @pytest.mark.parametrize('text', ["(Ma'arif)"])
def test_tokenizer_splits_even_wrap(id_tokenizer, text): def test_id_tokenizer_splits_even_wrap(id_tokenizer, text):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@ -44,49 +41,49 @@ def test_tokenizer_splits_uneven_wrap(id_tokenizer, text):
@pytest.mark.parametrize('text,length', [("S.Kom.", 1), ("SKom.", 2), ("(S.Kom.", 2)]) @pytest.mark.parametrize('text,length', [("S.Kom.", 1), ("SKom.", 2), ("(S.Kom.", 2)])
def test_tokenizer_splits_prefix_interact(id_tokenizer, text, length): def test_id_tokenizer_splits_prefix_interact(id_tokenizer, text, length):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == length assert len(tokens) == length
@pytest.mark.parametrize('text', ["S.Kom.)"]) @pytest.mark.parametrize('text', ["S.Kom.)"])
def test_tokenizer_splits_suffix_interact(id_tokenizer, text): def test_id_tokenizer_splits_suffix_interact(id_tokenizer, text):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == 2 assert len(tokens) == 2
@pytest.mark.parametrize('text', ["(S.Kom.)"]) @pytest.mark.parametrize('text', ["(S.Kom.)"])
def test_tokenizer_splits_even_wrap_interact(id_tokenizer, text): def test_id_tokenizer_splits_even_wrap_interact(id_tokenizer, text):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["(S.Kom.?)"]) @pytest.mark.parametrize('text', ["(S.Kom.?)"])
def test_tokenizer_splits_uneven_wrap_interact(id_tokenizer, text): def test_id_tokenizer_splits_uneven_wrap_interact(id_tokenizer, text):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == 4 assert len(tokens) == 4
@pytest.mark.parametrize('text,length', [("gara-gara", 1), ("Jokowi-Ahok", 3), ("Sukarno-Hatta", 3)]) @pytest.mark.parametrize('text,length', [("gara-gara", 1), ("Jokowi-Ahok", 3), ("Sukarno-Hatta", 3)])
def test_tokenizer_splits_hyphens(id_tokenizer, text, length): def test_id_tokenizer_splits_hyphens(id_tokenizer, text, length):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == length assert len(tokens) == length
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"]) @pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
def test_tokenizer_splits_numeric_range(id_tokenizer, text): def test_id_tokenizer_splits_numeric_range(id_tokenizer, text):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["ini.Budi", "Halo.Bandung"]) @pytest.mark.parametrize('text', ["ini.Budi", "Halo.Bandung"])
def test_tokenizer_splits_period_infix(id_tokenizer, text): def test_id_tokenizer_splits_period_infix(id_tokenizer, text):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
@pytest.mark.parametrize('text', ["Halo,Bandung", "satu,dua"]) @pytest.mark.parametrize('text', ["Halo,Bandung", "satu,dua"])
def test_tokenizer_splits_comma_infix(id_tokenizer, text): def test_id_tokenizer_splits_comma_infix(id_tokenizer, text):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
assert tokens[0].text == text.split(",")[0] assert tokens[0].text == text.split(",")[0]
@ -95,12 +92,12 @@ def test_tokenizer_splits_comma_infix(id_tokenizer, text):
@pytest.mark.parametrize('text', ["halo...Bandung", "dia...pergi"]) @pytest.mark.parametrize('text', ["halo...Bandung", "dia...pergi"])
def test_tokenizer_splits_ellipsis_infix(id_tokenizer, text): def test_id_tokenizer_splits_ellipsis_infix(id_tokenizer, text):
tokens = id_tokenizer(text) tokens = id_tokenizer(text)
assert len(tokens) == 3 assert len(tokens) == 3
def test_tokenizer_splits_double_hyphen_infix(id_tokenizer): def test_id_tokenizer_splits_double_hyphen_infix(id_tokenizer):
tokens = id_tokenizer("Arsene Wenger--manajer Arsenal--melakukan konferensi pers.") tokens = id_tokenizer("Arsene Wenger--manajer Arsenal--melakukan konferensi pers.")
assert len(tokens) == 10 assert len(tokens) == 10
assert tokens[0].text == "Arsene" assert tokens[0].text == "Arsene"

View File

@ -0,0 +1,11 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.lang.id.lex_attrs import like_num
@pytest.mark.parametrize('word', ['sebelas'])
def test_id_lex_attrs_capitals(word):
assert like_num(word)
assert like_num(word.upper())

View File

@ -1,18 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
LEMMAS = (
('新しく', '新しい'),
('赤く', '赤い'),
('すごく', '凄い'),
('いただきました', '頂く'),
('なった', '成る'))
@pytest.mark.parametrize('word,lemma', LEMMAS)
def test_japanese_lemmas(JA, word, lemma):
test_lemma = JA(word)[0].lemma_
assert test_lemma == lemma

View File

@ -0,0 +1,15 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('word,lemma', [
('新しく', '新しい'),
('赤く', '赤い'),
('すごく', '凄い'),
('いただきました', '頂く'),
('なった', '成る')])
def test_ja_lemmatizer_assigns(ja_tokenizer, word, lemma):
test_lemma = ja_tokenizer(word)[0].lemma_
assert test_lemma == lemma

View File

@ -30,16 +30,18 @@ POS_TESTS = [
@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS) @pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS)
def test_japanese_tokenizer(ja_tokenizer, text, expected_tokens): def test_ja_tokenizer(ja_tokenizer, text, expected_tokens):
tokens = [token.text for token in ja_tokenizer(text)] tokens = [token.text for token in ja_tokenizer(text)]
assert tokens == expected_tokens assert tokens == expected_tokens
@pytest.mark.parametrize('text,expected_tags', TAG_TESTS) @pytest.mark.parametrize('text,expected_tags', TAG_TESTS)
def test_japanese_tokenizer(ja_tokenizer, text, expected_tags): def test_ja_tokenizer(ja_tokenizer, text, expected_tags):
tags = [token.tag_ for token in ja_tokenizer(text)] tags = [token.tag_ for token in ja_tokenizer(text)]
assert tags == expected_tags assert tags == expected_tags
@pytest.mark.parametrize('text,expected_pos', POS_TESTS) @pytest.mark.parametrize('text,expected_pos', POS_TESTS)
def test_japanese_tokenizer(ja_tokenizer, text, expected_pos): def test_ja_tokenizer(ja_tokenizer, text, expected_pos):
pos = [token.pos_ for token in ja_tokenizer(text)] pos = [token.pos_ for token in ja_tokenizer(text)]
assert pos == expected_pos assert pos == expected_pos

View File

@ -11,7 +11,7 @@ NB_TOKEN_EXCEPTION_TESTS = [
@pytest.mark.parametrize('text,expected_tokens', NB_TOKEN_EXCEPTION_TESTS) @pytest.mark.parametrize('text,expected_tokens', NB_TOKEN_EXCEPTION_TESTS)
def test_tokenizer_handles_exception_cases(nb_tokenizer, text, expected_tokens): def test_nb_tokenizer_handles_exception_cases(nb_tokenizer, text, expected_tokens):
tokens = nb_tokenizer(text) tokens = nb_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space] token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list assert expected_tokens == token_list

View File

@ -0,0 +1,11 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.lang.nl.lex_attrs import like_num
@pytest.mark.parametrize('word', ['elf', 'elfde'])
def test_nl_lex_attrs_capitals(word):
assert like_num(word)
assert like_num(word.upper())

View File

@ -0,0 +1,11 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.lang.pt.lex_attrs import like_num
@pytest.mark.parametrize('word', ['onze', 'quadragésimo'])
def test_pt_lex_attrs_capitals(word):
assert like_num(word)
assert like_num(word.upper())

View File

@ -4,10 +4,11 @@ from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('string,lemma', [('câini', 'câine'), @pytest.mark.parametrize('string,lemma', [
('câini', 'câine'),
('expedițiilor', 'expediție'), ('expedițiilor', 'expediție'),
('pensete', 'pensetă'), ('pensete', 'pensetă'),
('erau', 'fi')]) ('erau', 'fi')])
def test_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma): def test_ro_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma):
tokens = ro_tokenizer(string) tokens = ro_tokenizer(string)
assert tokens[0].lemma_ == lemma assert tokens[0].lemma_ == lemma

View File

@ -3,23 +3,20 @@ from __future__ import unicode_literals
import pytest import pytest
DEFAULT_TESTS = [
TEST_CASES = [
('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']), ('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']),
('Teste, etc.', ['Teste', ',', 'etc.']), ('Teste, etc.', ['Teste', ',', 'etc.']),
('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']), ('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']),
('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...']) ('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...']),
] # number tests
NUMBER_TESTS = [
('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']), ('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']),
('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.']) ('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.'])
] ]
TESTCASES = DEFAULT_TESTS + NUMBER_TESTS
@pytest.mark.parametrize('text,expected_tokens', TEST_CASES)
@pytest.mark.parametrize('text,expected_tokens', TESTCASES) def test_ro_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens):
def test_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens):
tokens = ro_tokenizer(text) tokens = ro_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space] token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list assert expected_tokens == token_list

View File

@ -0,0 +1,14 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('text,norms', [
("пн.", ["понедельник"]),
("пт.", ["пятница"]),
("дек.", ["декабрь"])])
def test_ru_tokenizer_abbrev_exceptions(ru_tokenizer, text, norms):
tokens = ru_tokenizer(text)
assert len(tokens) == 1
assert [token.norm_ for token in tokens] == norms

View File

@ -2,27 +2,29 @@
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
from ....tokens.doc import Doc from spacy.lang.ru import Russian
from ...util import get_doc
@pytest.fixture @pytest.fixture
def ru_lemmatizer(RU): def ru_lemmatizer():
return RU.Defaults.create_lemmatizer() pymorphy = pytest.importorskip('pymorphy2')
return Russian.Defaults.create_lemmatizer()
@pytest.mark.models('ru') def test_ru_doc_lemmatization(ru_tokenizer):
def test_doc_lemmatization(RU): words = ['мама', 'мыла', 'раму']
doc = Doc(RU.vocab, words=['мама', 'мыла', 'раму']) tags = ['NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing',
doc[0].tag_ = 'NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing' 'VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act',
doc[1].tag_ = 'VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act' 'NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing']
doc[2].tag_ = 'NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing' doc = get_doc(ru_tokenizer.vocab, words=words, tags=tags)
lemmas = [token.lemma_ for token in doc] lemmas = [token.lemma_ for token in doc]
assert lemmas == ['мама', 'мыть', 'рама'] assert lemmas == ['мама', 'мыть', 'рама']
@pytest.mark.models('ru') @pytest.mark.parametrize('text,lemmas', [
@pytest.mark.parametrize('text,lemmas', [('гвоздики', ['гвоздик', 'гвоздика']), ('гвоздики', ['гвоздик', 'гвоздика']),
('люди', ['человек']), ('люди', ['человек']),
('реки', ['река']), ('реки', ['река']),
('кольцо', ['кольцо']), ('кольцо', ['кольцо']),
@ -32,7 +34,8 @@ def test_ru_lemmatizer_noun_lemmas(ru_lemmatizer, text, lemmas):
@pytest.mark.models('ru') @pytest.mark.models('ru')
@pytest.mark.parametrize('text,pos,morphology,lemma', [('рой', 'NOUN', None, 'рой'), @pytest.mark.parametrize('text,pos,morphology,lemma', [
('рой', 'NOUN', None, 'рой'),
('рой', 'VERB', None, 'рыть'), ('рой', 'VERB', None, 'рыть'),
('клей', 'NOUN', None, 'клей'), ('клей', 'NOUN', None, 'клей'),
('клей', 'VERB', None, 'клеить'), ('клей', 'VERB', None, 'клеить'),
@ -41,31 +44,20 @@ def test_ru_lemmatizer_noun_lemmas(ru_lemmatizer, text, lemmas):
('кос', 'NOUN', {'Number': 'Plur'}, 'коса'), ('кос', 'NOUN', {'Number': 'Plur'}, 'коса'),
('кос', 'ADJ', None, 'косой'), ('кос', 'ADJ', None, 'косой'),
('потом', 'NOUN', None, 'пот'), ('потом', 'NOUN', None, 'пот'),
('потом', 'ADV', None, 'потом') ('потом', 'ADV', None, 'потом')])
])
def test_ru_lemmatizer_works_with_different_pos_homonyms(ru_lemmatizer, text, pos, morphology, lemma): def test_ru_lemmatizer_works_with_different_pos_homonyms(ru_lemmatizer, text, pos, morphology, lemma):
assert ru_lemmatizer(text, pos, morphology) == [lemma] assert ru_lemmatizer(text, pos, morphology) == [lemma]
@pytest.mark.models('ru') @pytest.mark.parametrize('text,morphology,lemma', [
@pytest.mark.parametrize('text,morphology,lemma', [('гвоздики', {'Gender': 'Fem'}, 'гвоздика'), ('гвоздики', {'Gender': 'Fem'}, 'гвоздика'),
('гвоздики', {'Gender': 'Masc'}, 'гвоздик'), ('гвоздики', {'Gender': 'Masc'}, 'гвоздик'),
('вина', {'Gender': 'Fem'}, 'вина'), ('вина', {'Gender': 'Fem'}, 'вина'),
('вина', {'Gender': 'Neut'}, 'вино') ('вина', {'Gender': 'Neut'}, 'вино')])
])
def test_ru_lemmatizer_works_with_noun_homonyms(ru_lemmatizer, text, morphology, lemma): def test_ru_lemmatizer_works_with_noun_homonyms(ru_lemmatizer, text, morphology, lemma):
assert ru_lemmatizer.noun(text, morphology) == [lemma] assert ru_lemmatizer.noun(text, morphology) == [lemma]
@pytest.mark.models('ru')
def test_ru_lemmatizer_punct(ru_lemmatizer): def test_ru_lemmatizer_punct(ru_lemmatizer):
assert ru_lemmatizer.punct('«') == ['"'] assert ru_lemmatizer.punct('«') == ['"']
assert ru_lemmatizer.punct('»') == ['"'] assert ru_lemmatizer.punct('»') == ['"']
# @pytest.mark.models('ru')
# def test_ru_lemmatizer_lemma_assignment(RU):
# text = "А роза упала на лапу Азора."
# doc = RU.make_doc(text)
# RU.tagger(doc)
# assert all(t.lemma_ != '' for t in doc)

View File

@ -0,0 +1,11 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.lang.ru.lex_attrs import like_num
@pytest.mark.parametrize('word', ['одиннадцать'])
def test_ru_lex_attrs_capitals(word):
assert like_num(word)
assert like_num(word.upper())

View File

@ -1,7 +1,4 @@
# coding: utf-8 # coding: utf-8
"""Test that open, closed and paired punctuation is split off correctly."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest

View File

@ -1,16 +0,0 @@
# coding: utf-8
"""Test that tokenizer exceptions are parsed correctly."""
from __future__ import unicode_literals
import pytest
@pytest.mark.parametrize('text,norms', [("пн.", ["понедельник"]),
("пт.", ["пятница"]),
("дек.", ["декабрь"])])
def test_ru_tokenizer_abbrev_exceptions(ru_tokenizer, text, norms):
tokens = ru_tokenizer(text)
assert len(tokens) == 1
assert [token.norm_ for token in tokens] == norms

View File

@ -11,14 +11,14 @@ SV_TOKEN_EXCEPTION_TESTS = [
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS) @pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
def test_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens): def test_sv_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
tokens = sv_tokenizer(text) tokens = sv_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space] token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list assert expected_tokens == token_list
@pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"]) @pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"])
def test_tokenizer_handles_verb_exceptions(sv_tokenizer, text): def test_sv_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
tokens = sv_tokenizer(text) tokens = sv_tokenizer(text)
assert len(tokens) == 2 assert len(tokens) == 2
assert tokens[1].text == "u" assert tokens[1].text == "u"

View File

@ -1,10 +1,9 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
from ...lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape
import pytest import pytest
from spacy.attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
from spacy.lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape
@pytest.mark.parametrize('text', ["dog"]) @pytest.mark.parametrize('text', ["dog"])

View File

@ -3,11 +3,9 @@ from __future__ import unicode_literals
import pytest import pytest
TOKENIZER_TESTS = [
("คุณรักผมไหม", ['คุณ', 'รัก', 'ผม', 'ไหม'])
]
@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS) @pytest.mark.parametrize('text,expected_tokens', [
def test_thai_tokenizer(th_tokenizer, text, expected_tokens): ("คุณรักผมไหม", ['คุณ', 'รัก', 'ผม', 'ไหม'])])
def test_th_tokenizer(th_tokenizer, text, expected_tokens):
tokens = [token.text for token in th_tokenizer(text)] tokens = [token.text for token in th_tokenizer(text)]
assert tokens == expected_tokens assert tokens == expected_tokens

View File

@ -3,13 +3,15 @@ from __future__ import unicode_literals
import pytest import pytest
@pytest.mark.parametrize('string,lemma', [('evlerimizdeki', 'ev'),
@pytest.mark.parametrize('string,lemma', [
('evlerimizdeki', 'ev'),
('işlerimizi', ''), ('işlerimizi', ''),
('biran', 'biran'), ('biran', 'biran'),
('bitirmeliyiz', 'bitir'), ('bitirmeliyiz', 'bitir'),
('isteklerimizi', 'istek'), ('isteklerimizi', 'istek'),
('karşılaştırmamızın', 'karşılaştır'), ('karşılaştırmamızın', 'karşılaştır'),
('çoğulculuktan', 'çoğulcu')]) ('çoğulculuktan', 'çoğulcu')])
def test_lemmatizer_lookup_assigns(tr_tokenizer, string, lemma): def test_tr_lemmatizer_lookup_assigns(tr_tokenizer, string, lemma):
tokens = tr_tokenizer(string) tokens = tr_tokenizer(string)
assert tokens[0].lemma_ == lemma assert tokens[0].lemma_ == lemma

View File

@ -3,6 +3,7 @@ from __future__ import unicode_literals
import pytest import pytest
INFIX_HYPHEN_TESTS = [ INFIX_HYPHEN_TESTS = [
("Явым-төшем күләме.", "Явым-төшем күләме .".split()), ("Явым-төшем күләме.", "Явым-төшем күләме .".split()),
("Хатын-кыз киеме.", "Хатын-кыз киеме .".split()) ("Хатын-кыз киеме.", "Хатын-кыз киеме .".split())
@ -64,12 +65,12 @@ NORM_TESTCASES = [
@pytest.mark.parametrize("text,expected_tokens", TESTCASES) @pytest.mark.parametrize("text,expected_tokens", TESTCASES)
def test_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens): def test_tt_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens):
tokens = [token.text for token in tt_tokenizer(text) if not token.is_space] tokens = [token.text for token in tt_tokenizer(text) if not token.is_space]
assert expected_tokens == tokens assert expected_tokens == tokens
@pytest.mark.parametrize('text,norms', NORM_TESTCASES) @pytest.mark.parametrize('text,norms', NORM_TESTCASES)
def test_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms): def test_tt_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms):
tokens = tt_tokenizer(text) tokens = tt_tokenizer(text)
assert [token.norm_ for token in tokens] == norms assert [token.norm_ for token in tokens] == norms

View File

@ -1,19 +1,14 @@
# coding: utf-8 # coding: utf-8
"""Test that longer and mixed texts are tokenized correctly."""
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
def test_tokenizer_handles_long_text(ur_tokenizer): def test_ur_tokenizer_handles_long_text(ur_tokenizer):
text = """اصل میں رسوا ہونے کی ہمیں text = """اصل میں رسوا ہونے کی ہمیں
کچھ عادت سی ہو گئی ہے اس لئے جگ ہنسائی کا ذکر نہیں کرتا،ہوا کچھ یوں کہ عرصہ چھ سال بعد ہمیں بھی خیال آیا کچھ عادت سی ہو گئی ہے اس لئے جگ ہنسائی کا ذکر نہیں کرتا،ہوا کچھ یوں کہ عرصہ چھ سال بعد ہمیں بھی خیال آیا
کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر
ورلڈ کپ دیکھنے کا حوصلہ ہی نہ رہا تو اب یوں ہی ادھر اُدھر کے چینل گھمانے لگ پڑتے ہیں۔""" ورلڈ کپ دیکھنے کا حوصلہ ہی نہ رہا تو اب یوں ہی ادھر اُدھر کے چینل گھمانے لگ پڑتے ہیں۔"""
tokens = ur_tokenizer(text) tokens = ur_tokenizer(text)
assert len(tokens) == 77 assert len(tokens) == 77
@ -21,6 +16,6 @@ def test_tokenizer_handles_long_text(ur_tokenizer):
@pytest.mark.parametrize('text,length', [ @pytest.mark.parametrize('text,length', [
("تحریر باسط حبیب", 3), ("تحریر باسط حبیب", 3),
("میرا پاکستان", 2)]) ("میرا پاکستان", 2)])
def test_tokenizer_handles_cnts(ur_tokenizer, text, length): def test_ur_tokenizer_handles_cnts(ur_tokenizer, text, length):
tokens = ur_tokenizer(text) tokens = ur_tokenizer(text)
assert len(tokens) == length assert len(tokens) == length

View File

@ -1,19 +1,16 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ..matcher import Matcher, PhraseMatcher
from .util import get_doc
import pytest import pytest
from spacy.matcher import Matcher
from spacy.tokens import Doc
@pytest.fixture @pytest.fixture
def matcher(en_vocab): def matcher(en_vocab):
rules = { rules = {'JS': [[{'ORTH': 'JavaScript'}]],
'JS': [[{'ORTH': 'JavaScript'}]],
'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]], 'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]],
'Java': [[{'LOWER': 'java'}]] 'Java': [[{'LOWER': 'java'}]]}
}
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
for key, patterns in rules.items(): for key, patterns in rules.items():
matcher.add(key, None, *patterns) matcher.add(key, None, *patterns)
@ -36,7 +33,7 @@ def test_matcher_from_api_docs(en_vocab):
def test_matcher_from_usage_docs(en_vocab): def test_matcher_from_usage_docs(en_vocab):
text = "Wow 😀 This is really cool! 😂 😂" text = "Wow 😀 This is really cool! 😂 😂"
doc = get_doc(en_vocab, words=text.split(' ')) doc = Doc(en_vocab, words=text.split(' '))
pos_emoji = ['😀', '😃', '😂', '🤣', '😊', '😍'] pos_emoji = ['😀', '😃', '😂', '🤣', '😊', '😍']
pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji] pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
@ -55,68 +52,46 @@ def test_matcher_from_usage_docs(en_vocab):
assert doc[1].norm_ == 'happy emoji' assert doc[1].norm_ == 'happy emoji'
@pytest.mark.parametrize('words', [["Some", "words"]]) def test_matcher_len_contains(matcher):
def test_matcher_init(en_vocab, words): assert len(matcher) == 3
matcher = Matcher(en_vocab)
doc = get_doc(en_vocab, words)
assert len(matcher) == 0
assert matcher(doc) == []
def test_matcher_contains(matcher):
matcher.add('TEST', None, [{'ORTH': 'test'}]) matcher.add('TEST', None, [{'ORTH': 'test'}])
assert 'TEST' in matcher assert 'TEST' in matcher
assert 'TEST2' not in matcher assert 'TEST2' not in matcher
def test_matcher_no_match(matcher): def test_matcher_no_match(matcher):
words = ["I", "like", "cheese", "."] doc = Doc(matcher.vocab, words=["I", "like", "cheese", "."])
doc = get_doc(matcher.vocab, words)
assert matcher(doc) == [] assert matcher(doc) == []
def test_matcher_compile(en_vocab):
rules = {
'JS': [[{'ORTH': 'JavaScript'}]],
'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]],
'Java': [[{'LOWER': 'java'}]]
}
matcher = Matcher(en_vocab)
for key, patterns in rules.items():
matcher.add(key, None, *patterns)
assert len(matcher) == 3
def test_matcher_match_start(matcher): def test_matcher_match_start(matcher):
words = ["JavaScript", "is", "good"] doc = Doc(matcher.vocab, words=["JavaScript", "is", "good"])
doc = get_doc(matcher.vocab, words)
assert matcher(doc) == [(matcher.vocab.strings['JS'], 0, 1)] assert matcher(doc) == [(matcher.vocab.strings['JS'], 0, 1)]
def test_matcher_match_end(matcher): def test_matcher_match_end(matcher):
words = ["I", "like", "java"] words = ["I", "like", "java"]
doc = get_doc(matcher.vocab, words) doc = Doc(matcher.vocab, words=words)
assert matcher(doc) == [(doc.vocab.strings['Java'], 2, 3)] assert matcher(doc) == [(doc.vocab.strings['Java'], 2, 3)]
def test_matcher_match_middle(matcher): def test_matcher_match_middle(matcher):
words = ["I", "like", "Google", "Now", "best"] words = ["I", "like", "Google", "Now", "best"]
doc = get_doc(matcher.vocab, words) doc = Doc(matcher.vocab, words=words)
assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4)] assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4)]
def test_matcher_match_multi(matcher): def test_matcher_match_multi(matcher):
words = ["I", "like", "Google", "Now", "and", "java", "best"] words = ["I", "like", "Google", "Now", "and", "java", "best"]
doc = get_doc(matcher.vocab, words) doc = Doc(matcher.vocab, words=words)
assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4), assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4),
(doc.vocab.strings['Java'], 5, 6)] (doc.vocab.strings['Java'], 5, 6)]
def test_matcher_empty_dict(en_vocab): def test_matcher_empty_dict(en_vocab):
'''Test matcher allows empty token specs, meaning match on any token.''' """Test matcher allows empty token specs, meaning match on any token."""
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
abc = ["a", "b", "c"] doc = Doc(matcher.vocab, words=["a", "b", "c"])
doc = get_doc(matcher.vocab, abc)
matcher.add('A.C', None, [{'ORTH': 'a'}, {}, {'ORTH': 'c'}]) matcher.add('A.C', None, [{'ORTH': 'a'}, {}, {'ORTH': 'c'}])
matches = matcher(doc) matches = matcher(doc)
assert len(matches) == 1 assert len(matches) == 1
@ -129,8 +104,7 @@ def test_matcher_empty_dict(en_vocab):
def test_matcher_operator_shadow(en_vocab): def test_matcher_operator_shadow(en_vocab):
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
abc = ["a", "b", "c"] doc = Doc(matcher.vocab, words=["a", "b", "c"])
doc = get_doc(matcher.vocab, abc)
pattern = [{'ORTH': 'a'}, {"IS_ALPHA": True, "OP": "+"}, {'ORTH': 'c'}] pattern = [{'ORTH': 'a'}, {"IS_ALPHA": True, "OP": "+"}, {'ORTH': 'c'}]
matcher.add('A.C', None, pattern) matcher.add('A.C', None, pattern)
matches = matcher(doc) matches = matcher(doc)
@ -138,32 +112,6 @@ def test_matcher_operator_shadow(en_vocab):
assert matches[0][1:] == (0, 3) assert matches[0][1:] == (0, 3)
def test_matcher_phrase_matcher(en_vocab):
words = ["Google", "Now"]
doc = get_doc(en_vocab, words)
matcher = PhraseMatcher(en_vocab)
matcher.add('COMPANY', None, doc)
words = ["I", "like", "Google", "Now", "best"]
doc = get_doc(en_vocab, words)
assert len(matcher(doc)) == 1
def test_phrase_matcher_length(en_vocab):
matcher = PhraseMatcher(en_vocab)
assert len(matcher) == 0
matcher.add('TEST', None, get_doc(en_vocab, ['test']))
assert len(matcher) == 1
matcher.add('TEST2', None, get_doc(en_vocab, ['test2']))
assert len(matcher) == 2
def test_phrase_matcher_contains(en_vocab):
matcher = PhraseMatcher(en_vocab)
matcher.add('TEST', None, get_doc(en_vocab, ['test']))
assert 'TEST' in matcher
assert 'TEST2' not in matcher
def test_matcher_match_zero(matcher): def test_matcher_match_zero(matcher):
words1 = 'He said , " some words " ...'.split() words1 = 'He said , " some words " ...'.split()
words2 = 'He said , " some three words " ...'.split() words2 = 'He said , " some three words " ...'.split()
@ -176,12 +124,10 @@ def test_matcher_match_zero(matcher):
{'IS_PUNCT': True}, {'IS_PUNCT': True},
{'IS_PUNCT': True}, {'IS_PUNCT': True},
{'ORTH': '"'}] {'ORTH': '"'}]
matcher.add('Quote', None, pattern1) matcher.add('Quote', None, pattern1)
doc = get_doc(matcher.vocab, words1) doc = Doc(matcher.vocab, words=words1)
assert len(matcher(doc)) == 1 assert len(matcher(doc)) == 1
doc = Doc(matcher.vocab, words=words2)
doc = get_doc(matcher.vocab, words2)
assert len(matcher(doc)) == 0 assert len(matcher(doc)) == 0
matcher.add('Quote', None, pattern2) matcher.add('Quote', None, pattern2)
assert len(matcher(doc)) == 0 assert len(matcher(doc)) == 0
@ -194,14 +140,14 @@ def test_matcher_match_zero_plus(matcher):
{'ORTH': '"'}] {'ORTH': '"'}]
matcher = Matcher(matcher.vocab) matcher = Matcher(matcher.vocab)
matcher.add('Quote', None, pattern) matcher.add('Quote', None, pattern)
doc = get_doc(matcher.vocab, words) doc = Doc(matcher.vocab, words=words)
assert len(matcher(doc)) == 1 assert len(matcher(doc)) == 1
def test_matcher_match_one_plus(matcher): def test_matcher_match_one_plus(matcher):
control = Matcher(matcher.vocab) control = Matcher(matcher.vocab)
control.add('BasicPhilippe', None, [{'ORTH': 'Philippe'}]) control.add('BasicPhilippe', None, [{'ORTH': 'Philippe'}])
doc = get_doc(control.vocab, ['Philippe', 'Philippe']) doc = Doc(control.vocab, words=['Philippe', 'Philippe'])
m = control(doc) m = control(doc)
assert len(m) == 2 assert len(m) == 2
matcher.add('KleenePhilippe', None, [{'ORTH': 'Philippe', 'OP': '1'}, matcher.add('KleenePhilippe', None, [{'ORTH': 'Philippe', 'OP': '1'},
@ -210,61 +156,11 @@ def test_matcher_match_one_plus(matcher):
assert len(m) == 1 assert len(m) == 1
def test_operator_combos(matcher):
cases = [
('aaab', 'a a a b', True),
('aaab', 'a+ b', True),
('aaab', 'a+ a+ b', True),
('aaab', 'a+ a+ a b', True),
('aaab', 'a+ a+ a+ b', True),
('aaab', 'a+ a a b', True),
('aaab', 'a+ a a', True),
('aaab', 'a+', True),
('aaa', 'a+ b', False),
('aaa', 'a+ a+ b', False),
('aaa', 'a+ a+ a+ b', False),
('aaa', 'a+ a b', False),
('aaa', 'a+ a a b', False),
('aaab', 'a+ a a', True),
('aaab', 'a+', True),
('aaab', 'a+ a b', True)
]
for string, pattern_str, result in cases:
matcher = Matcher(matcher.vocab)
doc = get_doc(matcher.vocab, words=list(string))
pattern = []
for part in pattern_str.split():
if part.endswith('+'):
pattern.append({'ORTH': part[0], 'op': '+'})
else:
pattern.append({'ORTH': part})
matcher.add('PATTERN', None, pattern)
matches = matcher(doc)
if result:
assert matches, (string, pattern_str)
else:
assert not matches, (string, pattern_str)
def test_matcher_end_zero_plus(matcher):
"""Test matcher works when patterns end with * operator. (issue 1450)"""
matcher = Matcher(matcher.vocab)
pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}]
matcher.add("TSTEND", None, pattern)
nlp = lambda string: get_doc(matcher.vocab, string.split())
assert len(matcher(nlp('a'))) == 1
assert len(matcher(nlp('a b'))) == 2
assert len(matcher(nlp('a c'))) == 1
assert len(matcher(nlp('a b c'))) == 2
assert len(matcher(nlp('a b b c'))) == 3
assert len(matcher(nlp('a b b'))) == 3
def test_matcher_any_token_operator(en_vocab): def test_matcher_any_token_operator(en_vocab):
"""Test that patterns with "any token" {} work with operators.""" """Test that patterns with "any token" {} work with operators."""
matcher = Matcher(en_vocab) matcher = Matcher(en_vocab)
matcher.add('TEST', None, [{'ORTH': 'test'}, {'OP': '*'}]) matcher.add('TEST', None, [{'ORTH': 'test'}, {'OP': '*'}])
doc = get_doc(en_vocab, ['test', 'hello', 'world']) doc = Doc(en_vocab, words=['test', 'hello', 'world'])
matches = [doc[start:end].text for _, start, end in matcher(doc)] matches = [doc[start:end].text for _, start, end in matcher(doc)]
assert len(matches) == 3 assert len(matches) == 3
assert matches[0] == 'test' assert matches[0] == 'test'

View File

@ -0,0 +1,116 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
import re
from spacy.matcher import Matcher
from spacy.tokens import Doc
pattern1 = [{'ORTH':'A', 'OP':'1'}, {'ORTH':'A', 'OP':'*'}]
pattern2 = [{'ORTH':'A', 'OP':'*'}, {'ORTH':'A', 'OP':'1'}]
pattern3 = [{'ORTH':'A', 'OP':'1'}, {'ORTH':'A', 'OP':'1'}]
pattern4 = [{'ORTH':'B', 'OP':'1'}, {'ORTH':'A', 'OP':'*'}, {'ORTH':'B', 'OP':'1'}]
pattern5 = [{'ORTH':'B', 'OP':'*'}, {'ORTH':'A', 'OP':'*'}, {'ORTH':'B', 'OP':'1'}]
re_pattern1 = 'AA*'
re_pattern2 = 'A*A'
re_pattern3 = 'AA'
re_pattern4 = 'BA*B'
re_pattern5 = 'B*A*B'
@pytest.fixture
def text():
return "(ABBAAAAAB)."
@pytest.fixture
def doc(en_tokenizer, text):
doc = en_tokenizer(' '.join(text))
return doc
@pytest.mark.xfail
@pytest.mark.parametrize('pattern,re_pattern', [
(pattern1, re_pattern1),
(pattern2, re_pattern2),
(pattern3, re_pattern3),
(pattern4, re_pattern4),
(pattern5, re_pattern5)])
def test_greedy_matching(doc, text, pattern, re_pattern):
"""Test that the greedy matching behavior of the * op is consistant with
other re implementations."""
matcher = Matcher(doc.vocab)
matcher.add(re_pattern, None, pattern)
matches = matcher(doc)
re_matches = [m.span() for m in re.finditer(re_pattern, text)]
for match, re_match in zip(matches, re_matches):
assert match[1:] == re_match
@pytest.mark.xfail
@pytest.mark.parametrize('pattern,re_pattern', [
(pattern1, re_pattern1),
(pattern2, re_pattern2),
(pattern3, re_pattern3),
(pattern4, re_pattern4),
(pattern5, re_pattern5)])
def test_match_consuming(doc, text, pattern, re_pattern):
"""Test that matcher.__call__ consumes tokens on a match similar to
re.findall."""
matcher = Matcher(doc.vocab)
matcher.add(re_pattern, None, pattern)
matches = matcher(doc)
re_matches = [m.span() for m in re.finditer(re_pattern, text)]
assert len(matches) == len(re_matches)
def test_operator_combos(en_vocab):
cases = [
('aaab', 'a a a b', True),
('aaab', 'a+ b', True),
('aaab', 'a+ a+ b', True),
('aaab', 'a+ a+ a b', True),
('aaab', 'a+ a+ a+ b', True),
('aaab', 'a+ a a b', True),
('aaab', 'a+ a a', True),
('aaab', 'a+', True),
('aaa', 'a+ b', False),
('aaa', 'a+ a+ b', False),
('aaa', 'a+ a+ a+ b', False),
('aaa', 'a+ a b', False),
('aaa', 'a+ a a b', False),
('aaab', 'a+ a a', True),
('aaab', 'a+', True),
('aaab', 'a+ a b', True)
]
for string, pattern_str, result in cases:
matcher = Matcher(en_vocab)
doc = Doc(matcher.vocab, words=list(string))
pattern = []
for part in pattern_str.split():
if part.endswith('+'):
pattern.append({'ORTH': part[0], 'OP': '+'})
else:
pattern.append({'ORTH': part})
matcher.add('PATTERN', None, pattern)
matches = matcher(doc)
if result:
assert matches, (string, pattern_str)
else:
assert not matches, (string, pattern_str)
def test_matcher_end_zero_plus(en_vocab):
"""Test matcher works when patterns end with * operator. (issue 1450)"""
matcher = Matcher(en_vocab)
pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}]
matcher.add('TSTEND', None, pattern)
nlp = lambda string: Doc(matcher.vocab, words=string.split())
assert len(matcher(nlp('a'))) == 1
assert len(matcher(nlp('a b'))) == 2
assert len(matcher(nlp('a c'))) == 1
assert len(matcher(nlp('a b c'))) == 2
assert len(matcher(nlp('a b b c'))) == 3
assert len(matcher(nlp('a b b'))) == 3

View File

@ -0,0 +1,30 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
from spacy.matcher import PhraseMatcher
from spacy.tokens import Doc
def test_matcher_phrase_matcher(en_vocab):
doc = Doc(en_vocab, words=["Google", "Now"])
matcher = PhraseMatcher(en_vocab)
matcher.add('COMPANY', None, doc)
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
assert len(matcher(doc)) == 1
def test_phrase_matcher_length(en_vocab):
matcher = PhraseMatcher(en_vocab)
assert len(matcher) == 0
matcher.add('TEST', None, Doc(en_vocab, words=['test']))
assert len(matcher) == 1
matcher.add('TEST2', None, Doc(en_vocab, words=['test2']))
assert len(matcher) == 2
def test_phrase_matcher_contains(en_vocab):
matcher = PhraseMatcher(en_vocab)
matcher.add('TEST', None, Doc(en_vocab, words=['test']))
assert 'TEST' in matcher
assert 'TEST2' not in matcher

View File

@ -1,15 +1,16 @@
'''Test the ability to add a label to a (potentially trained) parsing model.''' # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
import numpy.random import numpy.random
from thinc.neural.optimizers import Adam from thinc.neural.optimizers import Adam
from thinc.neural.ops import NumpyOps from thinc.neural.ops import NumpyOps
from spacy.attrs import NORM
from spacy.gold import GoldParse
from spacy.vocab import Vocab
from spacy.tokens import Doc
from spacy.pipeline import DependencyParser
from ...attrs import NORM
from ...gold import GoldParse
from ...vocab import Vocab
from ...tokens import Doc
from ...pipeline import DependencyParser
numpy.random.seed(0) numpy.random.seed(0)
@ -37,9 +38,11 @@ def parser(vocab):
parser.update([doc], [gold], sgd=sgd, losses=losses) parser.update([doc], [gold], sgd=sgd, losses=losses)
return parser return parser
def test_init_parser(parser): def test_init_parser(parser):
pass pass
# TODO: This is flakey, because it depends on what the parser first learns. # TODO: This is flakey, because it depends on what the parser first learns.
@pytest.mark.xfail @pytest.mark.xfail
def test_add_label(parser): def test_add_label(parser):
@ -69,4 +72,3 @@ def test_add_label(parser):
doc = parser(doc) doc = parser(doc)
assert doc[0].dep_ == 'right' assert doc[0].dep_ == 'right'
assert doc[2].dep_ == 'left' assert doc[2].dep_ == 'left'

View File

@ -1,13 +1,14 @@
# coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest
from ...vocab import Vocab import pytest
from ...pipeline import DependencyParser from spacy.vocab import Vocab
from ...tokens import Doc from spacy.pipeline import DependencyParser
from ...gold import GoldParse from spacy.tokens import Doc
from ...syntax.nonproj import projectivize from spacy.gold import GoldParse
from ...syntax.stateclass import StateClass from spacy.syntax.nonproj import projectivize
from ...syntax.arc_eager import ArcEager from spacy.syntax.stateclass import StateClass
from spacy.syntax.arc_eager import ArcEager
def get_sequence_costs(M, words, heads, deps, transitions): def get_sequence_costs(M, words, heads, deps, transitions):

View File

@ -1,23 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
import pytest
from ...language import Language
from ...pipeline import DependencyParser
@pytest.mark.models('en')
def test_beam_parse_en(EN):
doc = EN(u'Australia is a country', disable=['ner'])
ents = EN.entity(doc, beam_width=2)
print(ents)
def test_beam_parse():
nlp = Language()
nlp.add_pipe(DependencyParser(nlp.vocab), name='parser')
nlp.parser.add_label('nsubj')
nlp.parser.begin_training([], token_vector_width=8, hidden_width=8)
doc = nlp.make_doc(u'Australia is a country')
nlp.parser(doc, beam_width=2)

View File

@ -1,11 +1,12 @@
# coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
from spacy.pipeline import EntityRecognizer
from ...vocab import Vocab from spacy.vocab import Vocab
from ...syntax.ner import BiluoPushDown from spacy.syntax.ner import BiluoPushDown
from ...gold import GoldParse from spacy.gold import GoldParse
from ...tokens import Doc from spacy.tokens import Doc
@pytest.fixture @pytest.fixture
@ -71,3 +72,16 @@ def test_get_oracle_moves_negative_O(tsys, vocab):
tsys.preprocess_gold(gold) tsys.preprocess_gold(gold)
act_classes = tsys.get_oracle_sequence(doc, gold) act_classes = tsys.get_oracle_sequence(doc, gold)
names = [tsys.get_class_name(act) for act in act_classes] names = [tsys.get_class_name(act) for act in act_classes]
def test_doc_add_entities_set_ents_iob(en_vocab):
doc = Doc(en_vocab, words=["This", "is", "a", "lion"])
ner = EntityRecognizer(en_vocab)
ner.begin_training([])
ner(doc)
assert len(list(doc.ents)) == 0
assert [w.ent_iob_ for w in doc] == (['O'] * len(doc))
doc.ents = [(doc.vocab.strings['ANIMAL'], 3, 4)]
assert [w.ent_iob_ for w in doc] == ['', '', '', 'B']
doc.ents = [(doc.vocab.strings['WORD'], 0, 2)]
assert [w.ent_iob_ for w in doc] == ['B', 'I', '', '']

View File

@ -1,16 +1,13 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from thinc.neural import Model
import pytest
import numpy
from ..._ml import chain, Tok2Vec, doc2feats import pytest
from ...vocab import Vocab from spacy._ml import Tok2Vec
from ...pipeline import Tensorizer from spacy.vocab import Vocab
from ...syntax.arc_eager import ArcEager from spacy.syntax.arc_eager import ArcEager
from ...syntax.nn_parser import Parser from spacy.syntax.nn_parser import Parser
from ...tokens.doc import Doc from spacy.tokens.doc import Doc
from ...gold import GoldParse from spacy.gold import GoldParse
@pytest.fixture @pytest.fixture
@ -37,10 +34,12 @@ def parser(vocab, arc_eager):
def model(arc_eager, tok2vec): def model(arc_eager, tok2vec):
return Parser.Model(arc_eager.n_moves, token_vector_width=tok2vec.nO)[0] return Parser.Model(arc_eager.n_moves, token_vector_width=tok2vec.nO)[0]
@pytest.fixture @pytest.fixture
def doc(vocab): def doc(vocab):
return Doc(vocab, words=['a', 'b', 'c']) return Doc(vocab, words=['a', 'b', 'c'])
@pytest.fixture @pytest.fixture
def gold(doc): def gold(doc):
return GoldParse(doc, heads=[1, 1, 1], deps=['L', 'ROOT', 'R']) return GoldParse(doc, heads=[1, 1, 1], deps=['L', 'ROOT', 'R'])
@ -80,5 +79,3 @@ def test_update_doc_beam(parser, model, doc, gold):
def optimize(weights, gradient, key=None): def optimize(weights, gradient, key=None):
weights -= 0.001 * gradient weights -= 0.001 * gradient
parser.update_beam([doc], [gold], sgd=optimize) parser.update_beam([doc], [gold], sgd=optimize)

View File

@ -1,20 +1,23 @@
# coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
import numpy import numpy
from thinc.api import layerize from spacy.vocab import Vocab
from spacy.language import Language
from ...vocab import Vocab from spacy.pipeline import DependencyParser
from ...syntax.arc_eager import ArcEager from spacy.syntax.arc_eager import ArcEager
from ...tokens import Doc from spacy.tokens import Doc
from ...gold import GoldParse from spacy.syntax._beam_utils import ParserBeam
from ...syntax._beam_utils import ParserBeam, update_beam from spacy.syntax.stateclass import StateClass
from ...syntax.stateclass import StateClass from spacy.gold import GoldParse
@pytest.fixture @pytest.fixture
def vocab(): def vocab():
return Vocab() return Vocab()
@pytest.fixture @pytest.fixture
def moves(vocab): def moves(vocab):
aeager = ArcEager(vocab.strings, {}) aeager = ArcEager(vocab.strings, {})
@ -65,6 +68,7 @@ def vector_size():
def beam(moves, states, golds, beam_width): def beam(moves, states, golds, beam_width):
return ParserBeam(moves, states, golds, width=beam_width, density=0.0) return ParserBeam(moves, states, golds, width=beam_width, density=0.0)
@pytest.fixture @pytest.fixture
def scores(moves, batch_size, beam_width): def scores(moves, batch_size, beam_width):
return [ return [
@ -85,3 +89,12 @@ def test_beam_advance(beam, scores):
def test_beam_advance_too_few_scores(beam, scores): def test_beam_advance_too_few_scores(beam, scores):
with pytest.raises(IndexError): with pytest.raises(IndexError):
beam.advance(scores[:-1]) beam.advance(scores[:-1])
def test_beam_parse():
nlp = Language()
nlp.add_pipe(DependencyParser(nlp.vocab), name='parser')
nlp.parser.add_label('nsubj')
nlp.parser.begin_training([], token_vector_width=8, hidden_width=8)
doc = nlp.make_doc('Australia is a country')
nlp.parser(doc, beam_width=2)

View File

@ -1,35 +1,39 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...syntax.nonproj import ancestors, contains_cycle, is_nonproj_arc
from ...syntax.nonproj import is_nonproj_tree
from ...syntax import nonproj
from ...attrs import DEP, HEAD
from ..util import get_doc
import pytest import pytest
from spacy.syntax.nonproj import ancestors, contains_cycle, is_nonproj_arc
from spacy.syntax.nonproj import is_nonproj_tree
from spacy.syntax import nonproj
from ..util import get_doc
@pytest.fixture @pytest.fixture
def tree(): def tree():
return [1, 2, 2, 4, 5, 2, 2] return [1, 2, 2, 4, 5, 2, 2]
@pytest.fixture @pytest.fixture
def cyclic_tree(): def cyclic_tree():
return [1, 2, 2, 4, 5, 3, 2] return [1, 2, 2, 4, 5, 3, 2]
@pytest.fixture @pytest.fixture
def partial_tree(): def partial_tree():
return [1, 2, 2, 4, 5, None, 7, 4, 2] return [1, 2, 2, 4, 5, None, 7, 4, 2]
@pytest.fixture @pytest.fixture
def nonproj_tree(): def nonproj_tree():
return [1, 2, 2, 4, 5, 2, 7, 4, 2] return [1, 2, 2, 4, 5, 2, 7, 4, 2]
@pytest.fixture @pytest.fixture
def proj_tree(): def proj_tree():
return [1, 2, 2, 4, 5, 2, 7, 5, 2] return [1, 2, 2, 4, 5, 2, 7, 5, 2]
@pytest.fixture @pytest.fixture
def multirooted_tree(): def multirooted_tree():
return [3, 2, 0, 3, 3, 7, 7, 3, 7, 10, 7, 10, 11, 12, 18, 16, 18, 17, 12, 3] return [3, 2, 0, 3, 3, 7, 7, 3, 7, 10, 7, 10, 11, 12, 18, 16, 18, 17, 12, 3]
@ -75,14 +79,14 @@ def test_parser_pseudoprojectivity(en_tokenizer):
def deprojectivize(proj_heads, deco_labels): def deprojectivize(proj_heads, deco_labels):
tokens = en_tokenizer('whatever ' * len(proj_heads)) tokens = en_tokenizer('whatever ' * len(proj_heads))
rel_proj_heads = [head-i for i, head in enumerate(proj_heads)] rel_proj_heads = [head-i for i, head in enumerate(proj_heads)]
doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deco_labels, heads=rel_proj_heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens],
deps=deco_labels, heads=rel_proj_heads)
nonproj.deprojectivize(doc) nonproj.deprojectivize(doc)
return [t.head.i for t in doc], [token.dep_ for token in doc] return [t.head.i for t in doc], [token.dep_ for token in doc]
tree = [1, 2, 2] tree = [1, 2, 2]
nonproj_tree = [1, 2, 2, 4, 5, 2, 7, 4, 2] nonproj_tree = [1, 2, 2, 4, 5, 2, 7, 4, 2]
nonproj_tree2 = [9, 1, 3, 1, 5, 6, 9, 8, 6, 1, 6, 12, 13, 10, 1] nonproj_tree2 = [9, 1, 3, 1, 5, 6, 9, 8, 6, 1, 6, 12, 13, 10, 1]
labels = ['det', 'nsubj', 'root', 'det', 'dobj', 'aux', 'nsubj', 'acl', 'punct'] labels = ['det', 'nsubj', 'root', 'det', 'dobj', 'aux', 'nsubj', 'acl', 'punct']
labels2 = ['advmod', 'root', 'det', 'nsubj', 'advmod', 'det', 'dobj', 'det', 'nmod', 'aux', 'nmod', 'advmod', 'det', 'amod', 'punct'] labels2 = ['advmod', 'root', 'det', 'nsubj', 'advmod', 'det', 'dobj', 'det', 'nmod', 'aux', 'nmod', 'advmod', 'det', 'amod', 'punct']

View File

@ -1,17 +1,17 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ..util import get_doc, apply_transition_sequence
import pytest import pytest
from ..util import get_doc, apply_transition_sequence
def test_parser_root(en_tokenizer): def test_parser_root(en_tokenizer):
text = "i don't have other assistance" text = "i don't have other assistance"
heads = [3, 2, 1, 0, 1, -2] heads = [3, 2, 1, 0, 1, -2]
deps = ['nsubj', 'aux', 'neg', 'ROOT', 'amod', 'dobj'] deps = ['nsubj', 'aux', 'neg', 'ROOT', 'amod', 'dobj']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
for t in doc: for t in doc:
assert t.dep != 0, t.text assert t.dep != 0, t.text
@ -20,7 +20,7 @@ def test_parser_root(en_tokenizer):
@pytest.mark.parametrize('text', ["Hello"]) @pytest.mark.parametrize('text', ["Hello"])
def test_parser_parse_one_word_sentence(en_tokenizer, en_parser, text): def test_parser_parse_one_word_sentence(en_tokenizer, en_parser, text):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[0], deps=['ROOT']) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[0], deps=['ROOT'])
assert len(doc) == 1 assert len(doc) == 1
with en_parser.step_through(doc) as _: with en_parser.step_through(doc) as _:
@ -33,10 +33,8 @@ def test_parser_initial(en_tokenizer, en_parser):
text = "I ate the pizza with anchovies." text = "I ate the pizza with anchovies."
heads = [1, 0, 1, -2, -3, -1, -5] heads = [1, 0, 1, -2, -3, -1, -5]
transition = ['L-nsubj', 'S', 'L-det'] transition = ['L-nsubj', 'S', 'L-det']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
apply_transition_sequence(en_parser, tokens, transition) apply_transition_sequence(en_parser, tokens, transition)
assert tokens[0].head.i == 1 assert tokens[0].head.i == 1
assert tokens[1].head.i == 1 assert tokens[1].head.i == 1
assert tokens[2].head.i == 3 assert tokens[2].head.i == 3
@ -47,8 +45,7 @@ def test_parser_parse_subtrees(en_tokenizer, en_parser):
text = "The four wheels on the bus turned quickly" text = "The four wheels on the bus turned quickly"
heads = [2, 1, 4, -1, 1, -2, 0, -1] heads = [2, 1, 4, -1, 1, -2, 0, -1]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
assert len(list(doc[2].lefts)) == 2 assert len(list(doc[2].lefts)) == 2
assert len(list(doc[2].rights)) == 1 assert len(list(doc[2].rights)) == 1
assert len(list(doc[2].children)) == 3 assert len(list(doc[2].children)) == 3
@ -63,11 +60,9 @@ def test_parser_merge_pp(en_tokenizer):
heads = [1, 4, -1, 1, -2, 0] heads = [1, 4, -1, 1, -2, 0]
deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT'] deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT']
tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ'] tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps, heads=heads, tags=tags) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps, heads=heads, tags=tags)
nps = [(np[0].idx, np[-1].idx + len(np[-1]), np.lemma_) for np in doc.noun_chunks] nps = [(np[0].idx, np[-1].idx + len(np[-1]), np.lemma_) for np in doc.noun_chunks]
for start, end, lemma in nps: for start, end, lemma in nps:
doc.merge(start, end, label='NP', lemma=lemma) doc.merge(start, end, label='NP', lemma=lemma)
assert doc[0].text == 'A phrase' assert doc[0].text == 'A phrase'

View File

@ -1,14 +1,14 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ..util import get_doc
import pytest import pytest
from ..util import get_doc
@pytest.fixture @pytest.fixture
def text(): def text():
return u""" return """
It was a bright cold day in April, and the clocks were striking thirteen. It was a bright cold day in April, and the clocks were striking thirteen.
Winston Smith, his chin nuzzled into his breast in an effort to escape the Winston Smith, his chin nuzzled into his breast in an effort to escape the
vile wind, slipped quickly through the glass doors of Victory Mansions, vile wind, slipped quickly through the glass doors of Victory Mansions,
@ -54,7 +54,7 @@ def heads():
def test_parser_parse_navigate_consistency(en_tokenizer, text, heads): def test_parser_parse_navigate_consistency(en_tokenizer, text, heads):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
for head in doc: for head in doc:
for child in head.lefts: for child in head.lefts:
assert child.head == head assert child.head == head
@ -64,7 +64,7 @@ def test_parser_parse_navigate_consistency(en_tokenizer, text, heads):
def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads): def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
lefts = {} lefts = {}
rights = {} rights = {}
@ -97,7 +97,7 @@ def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads):
def test_parser_parse_navigate_edges(en_tokenizer, text, heads): def test_parser_parse_navigate_edges(en_tokenizer, text, heads):
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
for token in doc: for token in doc:
subtree = list(token.subtree) subtree = list(token.subtree)
debug = '\t'.join((token.text, token.left_edge.text, subtree[0].text)) debug = '\t'.join((token.text, token.left_edge.text, subtree[0].text))

View File

@ -1,19 +1,21 @@
'''Test that the parser respects preset sentence boundaries.''' # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
from thinc.neural.optimizers import Adam from thinc.neural.optimizers import Adam
from thinc.neural.ops import NumpyOps from thinc.neural.ops import NumpyOps
from spacy.attrs import NORM
from spacy.gold import GoldParse
from spacy.vocab import Vocab
from spacy.tokens import Doc
from spacy.pipeline import DependencyParser
from ...attrs import NORM
from ...gold import GoldParse
from ...vocab import Vocab
from ...tokens import Doc
from ...pipeline import DependencyParser
@pytest.fixture @pytest.fixture
def vocab(): def vocab():
return Vocab(lex_attr_getters={NORM: lambda s: s}) return Vocab(lex_attr_getters={NORM: lambda s: s})
@pytest.fixture @pytest.fixture
def parser(vocab): def parser(vocab):
parser = DependencyParser(vocab) parser = DependencyParser(vocab)
@ -32,6 +34,7 @@ def parser(vocab):
parser.update([doc], [gold], sgd=sgd, losses=losses) parser.update([doc], [gold], sgd=sgd, losses=losses)
return parser return parser
def test_no_sentences(parser): def test_no_sentences(parser):
doc = Doc(parser.vocab, words=['a', 'b', 'c', 'd']) doc = Doc(parser.vocab, words=['a', 'b', 'c', 'd'])
doc = parser(doc) doc = parser(doc)

View File

@ -1,19 +1,18 @@
# coding: utf-8 # coding: utf-8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...tokens.doc import Doc
from ...attrs import HEAD
from ..util import get_doc, apply_transition_sequence
import pytest import pytest
from spacy.tokens.doc import Doc
from ..util import get_doc, apply_transition_sequence
def test_parser_space_attachment(en_tokenizer): def test_parser_space_attachment(en_tokenizer):
text = "This is a test.\nTo ensure spaces are attached well." text = "This is a test.\nTo ensure spaces are attached well."
heads = [1, 0, 1, -2, -3, -1, 1, 4, -1, 2, 1, 0, -1, -2] heads = [1, 0, 1, -2, -3, -1, 1, 4, -1, 2, 1, 0, -1, -2]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
for sent in doc.sents: for sent in doc.sents:
if len(sent) == 1: if len(sent) == 1:
assert not sent[-1].is_space assert not sent[-1].is_space
@ -26,7 +25,7 @@ def test_parser_sentence_space(en_tokenizer):
'nsubjpass', 'aux', 'auxpass', 'ROOT', 'nsubj', 'aux', 'ccomp', 'nsubjpass', 'aux', 'auxpass', 'ROOT', 'nsubj', 'aux', 'ccomp',
'poss', 'nsubj', 'ccomp', 'punct'] 'poss', 'nsubj', 'ccomp', 'punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps) doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
assert len(list(doc.sents)) == 2 assert len(list(doc.sents)) == 2
@ -35,7 +34,7 @@ def test_parser_space_attachment_leading(en_tokenizer, en_parser):
text = "\t \n This is a sentence ." text = "\t \n This is a sentence ."
heads = [1, 1, 0, 1, -2, -3] heads = [1, 1, 0, 1, -2, -3]
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, text.split(' '), heads=heads) doc = get_doc(tokens.vocab, words=text.split(' '), heads=heads)
assert doc[0].is_space assert doc[0].is_space
assert doc[1].is_space assert doc[1].is_space
assert doc[2].text == 'This' assert doc[2].text == 'This'
@ -52,7 +51,7 @@ def test_parser_space_attachment_intermediate_trailing(en_tokenizer, en_parser):
heads = [1, 0, -1, 2, -1, -4, -5, -1] heads = [1, 0, -1, 2, -1, -4, -5, -1]
transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct'] transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, text.split(' '), heads=heads) doc = get_doc(tokens.vocab, words=text.split(' '), heads=heads)
assert doc[2].is_space assert doc[2].is_space
assert doc[4].is_space assert doc[4].is_space
assert doc[5].is_space assert doc[5].is_space

View File

@ -1,28 +0,0 @@
import pytest
from ...pipeline import DependencyParser
@pytest.fixture
def parser(en_vocab):
parser = DependencyParser(en_vocab)
parser.add_label('nsubj')
parser.model, cfg = parser.Model(parser.moves.n_moves)
parser.cfg.update(cfg)
return parser
@pytest.fixture
def blank_parser(en_vocab):
parser = DependencyParser(en_vocab)
return parser
def test_to_from_bytes(parser, blank_parser):
assert parser.model is not True
assert blank_parser.model is True
assert blank_parser.moves.n_moves != parser.moves.n_moves
bytes_data = parser.to_bytes()
blank_parser.from_bytes(bytes_data)
assert blank_parser.model is not True
assert blank_parser.moves.n_moves == parser.moves.n_moves

View File

@ -2,10 +2,9 @@
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
from spacy.tokens import Span
from ...tokens import Span from spacy.language import Language
from ...language import Language from spacy.pipeline import EntityRuler
from ...pipeline import EntityRuler
@pytest.fixture @pytest.fixture

View File

@ -2,11 +2,11 @@
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
from spacy.language import Language
from spacy.tokens import Span
from ..util import get_doc from ..util import get_doc
from ...language import Language
from ...tokens import Span
from ... import util
@pytest.fixture @pytest.fixture
def doc(en_tokenizer): def doc(en_tokenizer):
@ -16,7 +16,7 @@ def doc(en_tokenizer):
pos = ['PRON', 'VERB', 'PROPN', 'PROPN', 'ADP', 'PROPN', 'PUNCT'] pos = ['PRON', 'VERB', 'PROPN', 'PROPN', 'ADP', 'PROPN', 'PUNCT']
deps = ['ROOT', 'prep', 'compound', 'pobj', 'prep', 'pobj', 'punct'] deps = ['ROOT', 'prep', 'compound', 'pobj', 'prep', 'pobj', 'punct']
tokens = en_tokenizer(text) tokens = en_tokenizer(text)
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads,
tags=tags, pos=pos, deps=deps) tags=tags, pos=pos, deps=deps)
doc.ents = [Span(doc, 2, 4, doc.vocab.strings['GPE'])] doc.ents = [Span(doc, 2, 4, doc.vocab.strings['GPE'])]
doc.is_parsed = True doc.is_parsed = True

View File

@ -2,8 +2,7 @@
from __future__ import unicode_literals from __future__ import unicode_literals
import pytest import pytest
from spacy.language import Language
from ...language import Language
@pytest.fixture @pytest.fixture

View File

@ -1,7 +1,13 @@
# coding: utf8 # coding: utf8
from __future__ import unicode_literals from __future__ import unicode_literals
from ...language import Language
import pytest
import random
import numpy.random
from spacy.language import Language
from spacy.pipeline import TextCategorizer
from spacy.tokens import Doc
from spacy.gold import GoldParse
def test_simple_train(): def test_simple_train():
@ -13,6 +19,40 @@ def test_simple_train():
for text, answer in [('aaaa', 1.), ('bbbb', 0), ('aa', 1.), for text, answer in [('aaaa', 1.), ('bbbb', 0), ('aa', 1.),
('bbbbbbbbb', 0.), ('aaaaaa', 1)]: ('bbbbbbbbb', 0.), ('aaaaaa', 1)]:
nlp.update([text], [{'cats': {'answer': answer}}]) nlp.update([text], [{'cats': {'answer': answer}}])
doc = nlp(u'aaa') doc = nlp('aaa')
assert 'answer' in doc.cats assert 'answer' in doc.cats
assert doc.cats['answer'] >= 0.5 assert doc.cats['answer'] >= 0.5
@pytest.mark.skip(reason="Test is flakey when run with others")
def test_textcat_learns_multilabel():
random.seed(5)
numpy.random.seed(5)
docs = []
nlp = Language()
letters = ['a', 'b', 'c']
for w1 in letters:
for w2 in letters:
cats = {letter: float(w2==letter) for letter in letters}
docs.append((Doc(nlp.vocab, words=['d']*3 + [w1, w2] + ['d']*3), cats))
random.shuffle(docs)
model = TextCategorizer(nlp.vocab, width=8)
for letter in letters:
model.add_label(letter)
optimizer = model.begin_training()
for i in range(30):
losses = {}
Ys = [GoldParse(doc, cats=cats) for doc, cats in docs]
Xs = [doc for doc, cats in docs]
model.update(Xs, Ys, sgd=optimizer, losses=losses)
random.shuffle(docs)
for w1 in letters:
for w2 in letters:
doc = Doc(nlp.vocab, words=['d']*3 + [w1, w2] + ['d']*3)
truth = {letter: w2==letter for letter in letters}
model(doc)
for cat, score in doc.cats.items():
if not truth[cat]:
assert score < 0.5
else:
assert score > 0.5

View File

@ -0,0 +1,420 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
import random
from spacy.matcher import Matcher
from spacy.attrs import IS_PUNCT, ORTH, LOWER
from spacy.symbols import POS, VERB, VerbForm_inf
from spacy.vocab import Vocab
from spacy.language import Language
from spacy.lemmatizer import Lemmatizer
from spacy.tokens import Doc
from ..util import get_doc, make_tempdir
@pytest.mark.parametrize('patterns', [
[[{'LOWER': 'celtics'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]],
[[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'celtics'}]]])
def test_issue118(en_tokenizer, patterns):
"""Test a bug that arose from having overlapping matches"""
text = "how many points did lebron james score against the boston celtics last night"
doc = en_tokenizer(text)
ORG = doc.vocab.strings['ORG']
matcher = Matcher(doc.vocab)
matcher.add("BostonCeltics", None, *patterns)
assert len(list(doc.ents)) == 0
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
doc.ents = matches[:1]
ents = list(doc.ents)
assert len(ents) == 1
assert ents[0].label == ORG
assert ents[0].start == 9
assert ents[0].end == 11
@pytest.mark.parametrize('patterns', [
[[{'LOWER': 'boston'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]],
[[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'boston'}]]])
def test_issue118_prefix_reorder(en_tokenizer, patterns):
"""Test a bug that arose from having overlapping matches"""
text = "how many points did lebron james score against the boston celtics last night"
doc = en_tokenizer(text)
ORG = doc.vocab.strings['ORG']
matcher = Matcher(doc.vocab)
matcher.add('BostonCeltics', None, *patterns)
assert len(list(doc.ents)) == 0
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
doc.ents += tuple(matches)[1:]
assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
ents = doc.ents
assert len(ents) == 1
assert ents[0].label == ORG
assert ents[0].start == 9
assert ents[0].end == 11
def test_issue242(en_tokenizer):
"""Test overlapping multi-word phrases."""
text = "There are different food safety standards in different countries."
patterns = [[{'LOWER': 'food'}, {'LOWER': 'safety'}],
[{'LOWER': 'safety'}, {'LOWER': 'standards'}]]
doc = en_tokenizer(text)
matcher = Matcher(doc.vocab)
matcher.add('FOOD', None, *patterns)
matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)]
doc.ents += tuple(matches)
match1, match2 = matches
assert match1[1] == 3
assert match1[2] == 5
assert match2[1] == 4
assert match2[2] == 6
def test_issue309(en_tokenizer):
"""Test Issue #309: SBD fails on empty string"""
tokens = en_tokenizer(" ")
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[0], deps=['ROOT'])
doc.is_parsed = True
assert len(doc) == 1
sents = list(doc.sents)
assert len(sents) == 1
def test_issue351(en_tokenizer):
doc = en_tokenizer(" This is a cat.")
assert doc[0].idx == 0
assert len(doc[0]) == 3
assert doc[1].idx == 3
def test_issue360(en_tokenizer):
"""Test tokenization of big ellipsis"""
tokens = en_tokenizer('$45...............Asking')
assert len(tokens) > 2
@pytest.mark.parametrize('text1,text2', [("cat", "dog")])
def test_issue361(en_vocab, text1, text2):
"""Test Issue #361: Equality of lexemes"""
assert en_vocab[text1] == en_vocab[text1]
assert en_vocab[text1] != en_vocab[text2]
def test_issue587(en_tokenizer):
"""Test that Matcher doesn't segfault on particular input"""
doc = en_tokenizer('a b; c')
matcher = Matcher(doc.vocab)
matcher.add('TEST1', None, [{ORTH: 'a'}, {ORTH: 'b'}])
matches = matcher(doc)
assert len(matches) == 1
matcher.add('TEST2', None, [{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'c'}])
matches = matcher(doc)
assert len(matches) == 2
matcher.add('TEST3', None, [{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'd'}])
matches = matcher(doc)
assert len(matches) == 2
def test_issue588(en_vocab):
matcher = Matcher(en_vocab)
with pytest.raises(ValueError):
matcher.add('TEST', None, [])
@pytest.mark.xfail
def test_issue589():
vocab = Vocab()
vocab.strings.set_frozen(True)
doc = Doc(vocab, words=['whata'])
def test_issue590(en_vocab):
"""Test overlapping matches"""
doc = Doc(en_vocab, words=['n', '=', '1', ';', 'a', ':', '5', '%'])
matcher = Matcher(en_vocab)
matcher.add('ab', None, [{'IS_ALPHA': True}, {'ORTH': ':'}, {'LIKE_NUM': True}, {'ORTH': '%'}])
matcher.add('ab', None, [{'IS_ALPHA': True}, {'ORTH': '='}, {'LIKE_NUM': True}])
matches = matcher(doc)
assert len(matches) == 2
def test_issue595():
"""Test lemmatization of base forms"""
words = ["Do", "n't", "feed", "the", "dog"]
tag_map = {'VB': {POS: VERB, VerbForm_inf: True}}
rules = {"verb": [["ed", "e"]]}
lemmatizer = Lemmatizer({'verb': {}}, {'verb': {}}, rules)
vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
doc = Doc(vocab, words=words)
doc[2].tag_ = 'VB'
assert doc[2].text == 'feed'
assert doc[2].lemma_ == 'feed'
def test_issue599(en_vocab):
doc = Doc(en_vocab)
doc.is_tagged = True
doc.is_parsed = True
doc2 = Doc(doc.vocab)
doc2.from_bytes(doc.to_bytes())
assert doc2.is_parsed
def test_issue600():
vocab = Vocab(tag_map={'NN': {'pos': 'NOUN'}})
doc = Doc(vocab, words=["hello"])
doc[0].tag_ = 'NN'
def test_issue615(en_tokenizer):
def merge_phrases(matcher, doc, i, matches):
"""Merge a phrase. We have to be careful here because we'll change the
token indices. To avoid problems, merge all the phrases once we're called
on the last match."""
if i != len(matches)-1:
return None
spans = [(ent_id, ent_id, doc[start : end]) for ent_id, start, end in matches]
for ent_id, label, span in spans:
span.merge(tag='NNP' if label else span.root.tag_, lemma=span.text,
label=label)
doc.ents = doc.ents + ((label, span.start, span.end),)
text = "The golf club is broken"
pattern = [{'ORTH': "golf"}, {'ORTH': "club"}]
label = "Sport_Equipment"
doc = en_tokenizer(text)
matcher = Matcher(doc.vocab)
matcher.add(label, merge_phrases, pattern)
match = matcher(doc)
entities = list(doc.ents)
assert entities != []
assert entities[0].label != 0
@pytest.mark.parametrize('text,number', [("7am", "7"), ("11p.m.", "11")])
def test_issue736(en_tokenizer, text, number):
"""Test that times like "7am" are tokenized correctly and that numbers are
converted to string."""
tokens = en_tokenizer(text)
assert len(tokens) == 2
assert tokens[0].text == number
@pytest.mark.parametrize('text', ["3/4/2012", "01/12/1900"])
def test_issue740(en_tokenizer, text):
"""Test that dates are not split and kept as one token. This behaviour is
currently inconsistent, since dates separated by hyphens are still split.
This will be hard to prevent without causing clashes with numeric ranges."""
tokens = en_tokenizer(text)
assert len(tokens) == 1
def test_issue743():
doc = Doc(Vocab(), ['hello', 'world'])
token = doc[0]
s = set([token])
items = list(s)
assert items[0] is token
@pytest.mark.parametrize('text', ["We were scared", "We Were Scared"])
def test_issue744(en_tokenizer, text):
"""Test that 'were' and 'Were' are excluded from the contractions
generated by the English tokenizer exceptions."""
tokens = en_tokenizer(text)
assert len(tokens) == 3
assert tokens[1].text.lower() == "were"
@pytest.mark.parametrize('text,is_num', [("one", True), ("ten", True),
("teneleven", False)])
def test_issue759(en_tokenizer, text, is_num):
tokens = en_tokenizer(text)
assert tokens[0].like_num == is_num
@pytest.mark.parametrize('text', ["Shell", "shell", "Shed", "shed"])
def test_issue775(en_tokenizer, text):
"""Test that 'Shell' and 'shell' are excluded from the contractions
generated by the English tokenizer exceptions."""
tokens = en_tokenizer(text)
assert len(tokens) == 1
assert tokens[0].text == text
@pytest.mark.parametrize('text', ["This is a string ", "This is a string\u0020"])
def test_issue792(en_tokenizer, text):
"""Test for Issue #792: Trailing whitespace is removed after tokenization."""
doc = en_tokenizer(text)
assert ''.join([token.text_with_ws for token in doc]) == text
@pytest.mark.parametrize('text', ["This is a string", "This is a string\n"])
def test_control_issue792(en_tokenizer, text):
"""Test base case for Issue #792: Non-trailing whitespace"""
doc = en_tokenizer(text)
assert ''.join([token.text_with_ws for token in doc]) == text
@pytest.mark.parametrize('text,tokens', [
('"deserve,"--and', ['"', "deserve", ',"--', "and"]),
("exception;--exclusive", ["exception", ";--", "exclusive"]),
("day.--Is", ["day", ".--", "Is"]),
("refinement:--just", ["refinement", ":--", "just"]),
("memories?--To", ["memories", "?--", "To"]),
("Useful.=--Therefore", ["Useful", ".=--", "Therefore"]),
("=Hope.=--Pandora", ["=", "Hope", ".=--", "Pandora"])])
def test_issue801(en_tokenizer, text, tokens):
"""Test that special characters + hyphens are split correctly."""
doc = en_tokenizer(text)
assert len(doc) == len(tokens)
assert [t.text for t in doc] == tokens
@pytest.mark.parametrize('text,expected_tokens', [
('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar'])
])
def test_issue805(sv_tokenizer, text, expected_tokens):
tokens = sv_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list
def test_issue850():
"""The variable-length pattern matches the succeeding token. Check we
handle the ambiguity correctly."""
vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
matcher = Matcher(vocab)
IS_ANY_TOKEN = matcher.vocab.add_flag(lambda x: True)
pattern = [{'LOWER': "bob"}, {'OP': '*', 'IS_ANY_TOKEN': True}, {'LOWER': 'frank'}]
matcher.add('FarAway', None, pattern)
doc = Doc(matcher.vocab, words=['bob', 'and', 'and', 'frank'])
match = matcher(doc)
assert len(match) == 1
ent_id, start, end = match[0]
assert start == 0
assert end == 4
def test_issue850_basic():
"""Test Matcher matches with '*' operator and Boolean flag"""
vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
matcher = Matcher(vocab)
IS_ANY_TOKEN = matcher.vocab.add_flag(lambda x: True)
pattern = [{'LOWER': "bob"}, {'OP': '*', 'LOWER': 'and'}, {'LOWER': 'frank'}]
matcher.add('FarAway', None, pattern)
doc = Doc(matcher.vocab, words=['bob', 'and', 'and', 'frank'])
match = matcher(doc)
assert len(match) == 1
ent_id, start, end = match[0]
assert start == 0
assert end == 4
@pytest.mark.parametrize('text', ["au-delàs", "pair-programmâmes",
"terra-formées", "σ-compacts"])
def test_issue852(fr_tokenizer, text):
"""Test that French tokenizer exceptions are imported correctly."""
tokens = fr_tokenizer(text)
assert len(tokens) == 1
@pytest.mark.parametrize('text', ["aaabbb@ccc.com\nThank you!",
"aaabbb@ccc.com \nThank you!"])
def test_issue859(en_tokenizer, text):
"""Test that no extra space is added in doc.text method."""
doc = en_tokenizer(text)
assert doc.text == text
@pytest.mark.parametrize('text', ["Datum:2014-06-02\nDokument:76467"])
def test_issue886(en_tokenizer, text):
"""Test that token.idx matches the original text index for texts with newlines."""
doc = en_tokenizer(text)
for token in doc:
assert len(token.text) == len(token.text_with_ws)
assert text[token.idx] == token.text[0]
@pytest.mark.parametrize('text', ["want/need"])
def test_issue891(en_tokenizer, text):
"""Test that / infixes are split correctly."""
tokens = en_tokenizer(text)
assert len(tokens) == 3
assert tokens[1].text == "/"
@pytest.mark.parametrize('text,tag,lemma', [
("anus", "NN", "anus"),
("princess", "NN", "princess"),
("inner", "JJ", "inner")
])
def test_issue912(en_vocab, text, tag, lemma):
"""Test base-forms are preserved."""
doc = Doc(en_vocab, words=[text])
doc[0].tag_ = tag
assert doc[0].lemma_ == lemma
def test_issue957(en_tokenizer):
"""Test that spaCy doesn't hang on many periods."""
# skip test if pytest-timeout is not installed
timeout = pytest.importorskip('pytest-timeout')
string = '0'
for i in range(1, 100):
string += '.%d' % i
doc = en_tokenizer(string)
@pytest.mark.xfail
def test_issue999(train_data):
"""Test that adding entities and resuming training works passably OK.
There are two issues here:
1) We have to readd labels. This isn't very nice.
2) There's no way to set the learning rate for the weight update, so we
end up out-of-scale, causing it to learn too fast.
"""
TRAIN_DATA = [
["hey", []],
["howdy", []],
["hey there", []],
["hello", []],
["hi", []],
["i'm looking for a place to eat", []],
["i'm looking for a place in the north of town", [[31,36,"LOCATION"]]],
["show me chinese restaurants", [[8,15,"CUISINE"]]],
["show me chines restaurants", [[8,14,"CUISINE"]]],
]
nlp = Language()
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
for _, offsets in TRAIN_DATA:
for start, end, label in offsets:
ner.add_label(label)
nlp.begin_training()
ner.model.learn_rate = 0.001
for itn in range(100):
random.shuffle(TRAIN_DATA)
for raw_text, entity_offsets in TRAIN_DATA:
nlp.update([raw_text], [{'entities': entity_offsets}])
with make_tempdir() as model_dir:
nlp.to_disk(model_dir)
nlp2 = Language().from_disk(model_dir)
for raw_text, entity_offsets in TRAIN_DATA:
doc = nlp2(raw_text)
ents = {(ent.start_char, ent.end_char): ent.label_ for ent in doc.ents}
for start, end, label in entity_offsets:
if (start, end) in ents:
assert ents[(start, end)] == label
break
else:
if entity_offsets:
raise Exception(ents)

View File

@ -0,0 +1,127 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
import re
from spacy.tokens import Doc
from spacy.vocab import Vocab
from spacy.lang.en import English
from spacy.lang.lex_attrs import LEX_ATTRS
from spacy.matcher import Matcher
from spacy.tokenizer import Tokenizer
from spacy.lemmatizer import Lemmatizer
from spacy.symbols import ORTH, LEMMA, POS, VERB, VerbForm_part
def test_issue1242():
nlp = English()
doc = nlp('')
assert len(doc) == 0
docs = list(nlp.pipe(['', 'hello']))
assert len(docs[0]) == 0
assert len(docs[1]) == 1
def test_issue1250():
"""Test cached special cases."""
special_case = [{ORTH: 'reimbur', LEMMA: 'reimburse', POS: 'VERB'}]
nlp = English()
nlp.tokenizer.add_special_case('reimbur', special_case)
lemmas = [w.lemma_ for w in nlp('reimbur, reimbur...')]
assert lemmas == ['reimburse', ',', 'reimburse', '...']
lemmas = [w.lemma_ for w in nlp('reimbur, reimbur...')]
assert lemmas == ['reimburse', ',', 'reimburse', '...']
def test_issue1257():
"""Test that tokens compare correctly."""
doc1 = Doc(Vocab(), words=['a', 'b', 'c'])
doc2 = Doc(Vocab(), words=['a', 'c', 'e'])
assert doc1[0] != doc2[0]
assert not doc1[0] == doc2[0]
def test_issue1375():
"""Test that token.nbor() raises IndexError for out-of-bounds access."""
doc = Doc(Vocab(), words=['0', '1', '2'])
with pytest.raises(IndexError):
assert doc[0].nbor(-1)
assert doc[1].nbor(-1).text == '0'
with pytest.raises(IndexError):
assert doc[2].nbor(1)
assert doc[1].nbor(1).text == '2'
def test_issue1387():
tag_map = {'VBG': {POS: VERB, VerbForm_part: True}}
index = {"verb": ("cope","cop")}
exc = {"verb": {"coping": ("cope",)}}
rules = {"verb": [["ing", ""]]}
lemmatizer = Lemmatizer(index, exc, rules)
vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
doc = Doc(vocab, words=["coping"])
doc[0].tag_ = 'VBG'
assert doc[0].text == "coping"
assert doc[0].lemma_ == "cope"
def test_issue1434():
"""Test matches occur when optional element at end of short doc."""
pattern = [{'ORTH': 'Hello' }, {'IS_ALPHA': True, 'OP': '?'}]
vocab = Vocab(lex_attr_getters=LEX_ATTRS)
hello_world = Doc(vocab, words=['Hello', 'World'])
hello = Doc(vocab, words=['Hello'])
matcher = Matcher(vocab)
matcher.add('MyMatcher', None, pattern)
matches = matcher(hello_world)
assert matches
matches = matcher(hello)
assert matches
@pytest.mark.parametrize('string,start,end', [
('a', 0, 1), ('a b', 0, 2), ('a c', 0, 1), ('a b c', 0, 2),
('a b b c', 0, 3), ('a b b', 0, 3),])
def test_issue1450(string, start, end):
"""Test matcher works when patterns end with * operator."""
pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}]
matcher = Matcher(Vocab())
matcher.add("TSTEND", None, pattern)
doc = Doc(Vocab(), words=string.split())
matches = matcher(doc)
if start is None or end is None:
assert matches == []
assert matches[-1][1] == start
assert matches[-1][2] == end
def test_issue1488():
prefix_re = re.compile(r'''[\[\("']''')
suffix_re = re.compile(r'''[\]\)"']''')
infix_re = re.compile(r'''[-~\.]''')
simple_url_re = re.compile(r'''^https?://''')
def my_tokenizer(nlp):
return Tokenizer(nlp.vocab, {},
prefix_search=prefix_re.search,
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=simple_url_re.match)
nlp = English()
nlp.tokenizer = my_tokenizer(nlp)
doc = nlp("This is a test.")
for token in doc:
assert token.text
def test_issue1494():
infix_re = re.compile(r'''[^a-z]''')
test_cases = [('token 123test', ['token', '1', '2', '3', 'test']),
('token 1test', ['token', '1test']),
('hello...test', ['hello', '.', '.', '.', 'test'])]
new_tokenizer = lambda nlp: Tokenizer(nlp.vocab, {}, infix_finditer=infix_re.finditer)
nlp = English()
nlp.tokenizer = new_tokenizer(nlp)
for text, expected in test_cases:
assert [token.text for token in nlp(text)] == expected

View File

@ -1,55 +0,0 @@
# coding: utf-8
from __future__ import unicode_literals
from ...matcher import Matcher
import pytest
pattern1 = [[{'LOWER': 'celtics'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]]
pattern2 = [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'celtics'}]]
pattern3 = [[{'LOWER': 'boston'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]]
pattern4 = [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'boston'}]]
@pytest.fixture
def doc(en_tokenizer):
text = "how many points did lebron james score against the boston celtics last night"
doc = en_tokenizer(text)
return doc
@pytest.mark.parametrize('pattern', [pattern1, pattern2])
def test_issue118(doc, pattern):
"""Test a bug that arose from having overlapping matches"""
ORG = doc.vocab.strings['ORG']
matcher = Matcher(doc.vocab)
matcher.add("BostonCeltics", None, *pattern)
assert len(list(doc.ents)) == 0
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
doc.ents = matches[:1]
ents = list(doc.ents)
assert len(ents) == 1
assert ents[0].label == ORG
assert ents[0].start == 9
assert ents[0].end == 11
@pytest.mark.parametrize('pattern', [pattern3, pattern4])
def test_issue118_prefix_reorder(doc, pattern):
"""Test a bug that arose from having overlapping matches"""
ORG = doc.vocab.strings['ORG']
matcher = Matcher(doc.vocab)
matcher.add('BostonCeltics', None, *pattern)
assert len(list(doc.ents)) == 0
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
doc.ents += tuple(matches)[1:]
assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
ents = doc.ents
assert len(ents) == 1
assert ents[0].label == ORG
assert ents[0].start == 9
assert ents[0].end == 11

View File

@ -1,13 +0,0 @@
from __future__ import unicode_literals
import pytest
@pytest.mark.models('en')
def test_issue1207(EN):
text = 'Employees are recruiting talented staffers from overseas.'
doc = EN(text)
assert [i.text for i in doc.noun_chunks] == ['Employees', 'talented staffers']
sent = list(doc.sents)[0]
assert [i.text for i in sent.noun_chunks] == ['Employees', 'talented staffers']

View File

@ -1,23 +0,0 @@
from __future__ import unicode_literals
import pytest
from ...lang.en import English
from ...util import load_model
def test_issue1242_empty_strings():
nlp = English()
doc = nlp('')
assert len(doc) == 0
docs = list(nlp.pipe(['', 'hello']))
assert len(docs[0]) == 0
assert len(docs[1]) == 1
@pytest.mark.models('en')
def test_issue1242_empty_strings_en_core_web_sm():
nlp = load_model('en_core_web_sm')
doc = nlp('')
assert len(doc) == 0
docs = list(nlp.pipe(['', 'hello']))
assert len(docs[0]) == 0
assert len(docs[1]) == 1

View File

@ -1,13 +0,0 @@
from __future__ import unicode_literals
from ...tokenizer import Tokenizer
from ...symbols import ORTH, LEMMA, POS
from ...lang.en import English
def test_issue1250_cached_special_cases():
nlp = English()
nlp.tokenizer.add_special_case(u'reimbur', [{ORTH: u'reimbur', LEMMA: u'reimburse', POS: u'VERB'}])
lemmas = [w.lemma_ for w in nlp(u'reimbur, reimbur...')]
assert lemmas == ['reimburse', ',', 'reimburse', '...']
lemmas = [w.lemma_ for w in nlp(u'reimbur, reimbur...')]
assert lemmas == ['reimburse', ',', 'reimburse', '...']

Some files were not shown because too many files have changed in this diff Show More