mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 10:16:27 +03:00
💫 Refactor test suite (#2568)
## Description Related issues: #2379 (should be fixed by separating model tests) * **total execution time down from > 300 seconds to under 60 seconds** 🎉 * removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure * changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version) * merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways) * tidied up and rewrote existing tests wherever possible ### Todo - [ ] move tests to `/tests` and adjust CI commands accordingly - [x] move model test suite from internal repo to `spacy-models` - [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~ - [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted - [ ] update documentation on how to run tests ### Types of change enhancement, tests ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
This commit is contained in:
parent
82277f63a3
commit
75f3234404
1
.gitignore
vendored
1
.gitignore
vendored
|
@ -36,6 +36,7 @@ venv/
|
||||||
.dev
|
.dev
|
||||||
.denv
|
.denv
|
||||||
.pypyenv
|
.pypyenv
|
||||||
|
.pytest_cache/
|
||||||
|
|
||||||
# Distribution / packaging
|
# Distribution / packaging
|
||||||
env/
|
env/
|
||||||
|
|
|
@ -11,5 +11,6 @@ dill>=0.2,<0.3
|
||||||
regex==2017.4.5
|
regex==2017.4.5
|
||||||
requests>=2.13.0,<3.0.0
|
requests>=2.13.0,<3.0.0
|
||||||
pytest>=3.6.0,<4.0.0
|
pytest>=3.6.0,<4.0.0
|
||||||
|
pytest-timeout>=1.3.0,<2.0.0
|
||||||
mock>=2.0.0,<3.0.0
|
mock>=2.0.0,<3.0.0
|
||||||
pathlib==1.0.1; python_version < "3.4"
|
pathlib==1.0.1; python_version < "3.4"
|
||||||
|
|
|
@ -6,6 +6,7 @@ spaCy uses the [pytest](http://doc.pytest.org/) framework for testing. For more
|
||||||
|
|
||||||
Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the `Tokenizer` can be found in [`/tests/tokenizer`](tokenizer). All test modules (i.e. directories) also need to be listed in spaCy's [`setup.py`](../setup.py). To be interpreted and run, all test files and test functions need to be prefixed with `test_`.
|
Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the `Tokenizer` can be found in [`/tests/tokenizer`](tokenizer). All test modules (i.e. directories) also need to be listed in spaCy's [`setup.py`](../setup.py). To be interpreted and run, all test files and test functions need to be prefixed with `test_`.
|
||||||
|
|
||||||
|
> ⚠️ **Important note:** As part of our new model training infrastructure, we've moved all model tests to the [`spacy-models`](https://github.com/explosion/spacy-models) repository. This allows us to test the models separately from the core library functionality.
|
||||||
|
|
||||||
## Table of contents
|
## Table of contents
|
||||||
|
|
||||||
|
@ -13,9 +14,8 @@ Tests for spaCy modules and classes live in their own directories of the same na
|
||||||
2. [Dos and don'ts](#dos-and-donts)
|
2. [Dos and don'ts](#dos-and-donts)
|
||||||
3. [Parameters](#parameters)
|
3. [Parameters](#parameters)
|
||||||
4. [Fixtures](#fixtures)
|
4. [Fixtures](#fixtures)
|
||||||
5. [Testing models](#testing-models)
|
5. [Helpers and utilities](#helpers-and-utilities)
|
||||||
6. [Helpers and utilities](#helpers-and-utilities)
|
6. [Contributing to the tests](#contributing-to-the-tests)
|
||||||
7. [Contributing to the tests](#contributing-to-the-tests)
|
|
||||||
|
|
||||||
|
|
||||||
## Running the tests
|
## Running the tests
|
||||||
|
@ -25,10 +25,7 @@ first failure, run them with `py.test -x`.
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
py.test spacy # run basic tests
|
py.test spacy # run basic tests
|
||||||
py.test spacy --models --en # run basic and English model tests
|
|
||||||
py.test spacy --models --all # run basic and all model tests
|
|
||||||
py.test spacy --slow # run basic and slow tests
|
py.test spacy --slow # run basic and slow tests
|
||||||
py.test spacy --models --all --slow # run all tests
|
|
||||||
```
|
```
|
||||||
|
|
||||||
You can also run tests in a specific file or directory, or even only one
|
You can also run tests in a specific file or directory, or even only one
|
||||||
|
@ -48,10 +45,10 @@ To keep the behaviour of the tests consistent and predictable, we try to follow
|
||||||
* If you're testing for a bug reported in a specific issue, always create a **regression test**. Regression tests should be named `test_issue[ISSUE NUMBER]` and live in the [`regression`](regression) directory.
|
* If you're testing for a bug reported in a specific issue, always create a **regression test**. Regression tests should be named `test_issue[ISSUE NUMBER]` and live in the [`regression`](regression) directory.
|
||||||
* Only use `@pytest.mark.xfail` for tests that **should pass, but currently fail**. To test for desired negative behaviour, use `assert not` in your test.
|
* Only use `@pytest.mark.xfail` for tests that **should pass, but currently fail**. To test for desired negative behaviour, use `assert not` in your test.
|
||||||
* Very **extensive tests** that take a long time to run should be marked with `@pytest.mark.slow`. If your slow test is testing important behaviour, consider adding an additional simpler version.
|
* Very **extensive tests** that take a long time to run should be marked with `@pytest.mark.slow`. If your slow test is testing important behaviour, consider adding an additional simpler version.
|
||||||
* Tests that require **loading the models** should be marked with `@pytest.mark.models`.
|
* If tests require **loading the models**, they should be added to the [`spacy-models`](https://github.com/explosion/spacy-models) tests.
|
||||||
* Before requiring the models, always make sure there is no other way to test the particular behaviour. In a lot of cases, it's sufficient to simply create a `Doc` object manually. See the section on [helpers and utility functions](#helpers-and-utilities) for more info on this.
|
* Before requiring the models, always make sure there is no other way to test the particular behaviour. In a lot of cases, it's sufficient to simply create a `Doc` object manually. See the section on [helpers and utility functions](#helpers-and-utilities) for more info on this.
|
||||||
* **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and most components are available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`).
|
* **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and many components are available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`).
|
||||||
* If you're importing from spaCy, **always use relative imports**. Otherwise, you might accidentally be running the tests over a different copy of spaCy, e.g. one you have installed on your system.
|
* If you're importing from spaCy, **always use absolute imports**. For example: `from spacy.language import Language`.
|
||||||
* Don't forget the **unicode declarations** at the top of each file. This way, unicode strings won't have to be prefixed with `u`.
|
* Don't forget the **unicode declarations** at the top of each file. This way, unicode strings won't have to be prefixed with `u`.
|
||||||
* Try to keep the tests **readable and concise**. Use clear and descriptive variable names (`doc`, `tokens` and `text` are great), keep it short and only test for one behaviour at a time.
|
* Try to keep the tests **readable and concise**. Use clear and descriptive variable names (`doc`, `tokens` and `text` are great), keep it short and only test for one behaviour at a time.
|
||||||
|
|
||||||
|
@ -93,12 +90,9 @@ These are the main fixtures that are currently available:
|
||||||
|
|
||||||
| Fixture | Description |
|
| Fixture | Description |
|
||||||
| --- | --- |
|
| --- | --- |
|
||||||
| `tokenizer` | Creates **all available** language tokenizers and runs the test for **each of them**. |
|
| `tokenizer` | Basic, language-independent tokenizer. Identical to the `xx` language class. |
|
||||||
| `en_tokenizer`, `de_tokenizer`, ... | Creates an English, German etc. tokenizer. |
|
| `en_tokenizer`, `de_tokenizer`, ... | Creates an English, German etc. tokenizer. |
|
||||||
| `en_vocab`, `en_entityrecognizer`, ... | Creates an instance of the English `Vocab`, `EntityRecognizer` object etc. |
|
| `en_vocab` | Creates an instance of the English `Vocab`. |
|
||||||
| `EN`, `DE`, ... | Creates a language class with a loaded model. For more info, see [Testing models](#testing-models). |
|
|
||||||
| `text_file` | Creates an instance of `StringIO` to simulate reading from and writing to files. |
|
|
||||||
| `text_file_b` | Creates an instance of `ByteIO` to simulate reading from and writing to files. |
|
|
||||||
|
|
||||||
The fixtures can be used in all tests by simply setting them as an argument, like this:
|
The fixtures can be used in all tests by simply setting them as an argument, like this:
|
||||||
|
|
||||||
|
@ -109,49 +103,6 @@ def test_module_do_something(en_tokenizer):
|
||||||
|
|
||||||
If all tests in a file require a specific configuration, or use the same complex example, it can be helpful to create a separate fixture. This fixture should be added at the top of each file. Make sure to use descriptive names for these fixtures and don't override any of the global fixtures listed above. **From looking at a test, it should immediately be clear which fixtures are used, and where they are coming from.**
|
If all tests in a file require a specific configuration, or use the same complex example, it can be helpful to create a separate fixture. This fixture should be added at the top of each file. Make sure to use descriptive names for these fixtures and don't override any of the global fixtures listed above. **From looking at a test, it should immediately be clear which fixtures are used, and where they are coming from.**
|
||||||
|
|
||||||
## Testing models
|
|
||||||
|
|
||||||
Models should only be loaded and tested **if absolutely necessary** – for example, if you're specifically testing a model's performance, or if your test is related to model loading. If you only need an annotated `Doc`, you should use the `get_doc()` helper function to create it manually instead.
|
|
||||||
|
|
||||||
To specify which language models a test is related to, set the language ID as an argument of `@pytest.mark.models`. This allows you to later run the tests with `--models --en`. You can then use the `EN` [fixture](#fixtures) to get a language
|
|
||||||
class with a loaded model.
|
|
||||||
|
|
||||||
```python
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_english_model(EN):
|
|
||||||
doc = EN(u'This is a test')
|
|
||||||
```
|
|
||||||
|
|
||||||
> ⚠️ **Important note:** In order to test models, they need to be installed as a packge. The [conftest.py](conftest.py) includes a list of all available models, mapped to their IDs, e.g. `en`. Unless otherwise specified, each model that's installed in your environment will be imported and tested. If you don't have a model installed, **the test will be skipped**.
|
|
||||||
|
|
||||||
Under the hood, `pytest.importorskip` is used to import a model package and skip the test if the package is not installed. The `EN` fixture for example gets all
|
|
||||||
available models for `en`, [parametrizes](#parameters) them to run the test for *each of them*, and uses `load_test_model()` to import the model and run the test, or skip it if the model is not installed.
|
|
||||||
|
|
||||||
### Testing specific models
|
|
||||||
|
|
||||||
Using the `load_test_model()` helper function, you can also write tests for specific models, or combinations of them:
|
|
||||||
|
|
||||||
```python
|
|
||||||
from .util import load_test_model
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_md_only():
|
|
||||||
nlp = load_test_model('en_core_web_md')
|
|
||||||
# test something specific to en_core_web_md
|
|
||||||
|
|
||||||
@pytest.mark.models('en', 'fr')
|
|
||||||
@pytest.mark.parametrize('model', ['en_core_web_md', 'fr_depvec_web_lg'])
|
|
||||||
def test_different_models(model):
|
|
||||||
nlp = load_test_model(model)
|
|
||||||
# test something specific to the parametrized models
|
|
||||||
```
|
|
||||||
|
|
||||||
### Known issues and future improvements
|
|
||||||
|
|
||||||
Using `importorskip` on a list of model packages is not ideal and we're looking to improve this in the future. But at the moment, it's the best way to ensure that tests are performed on specific model packages only, and that you'll always be able to run the tests, even if you don't have *all available models* installed. (If the tests made a call to `spacy.load('en')` instead, this would load whichever model you've created an `en` shortcut for. This may be one of spaCy's default models, but it could just as easily be your own custom English model.)
|
|
||||||
|
|
||||||
The current setup also doesn't provide an easy way to only run tests on specific model versions. The `minversion` keyword argument on `pytest.importorskip` can take care of this, but it currently only checks for the package's `__version__` attribute. An alternative solution would be to load a model package's meta.json and skip if the model's version does not match the one specified in the test.
|
|
||||||
|
|
||||||
## Helpers and utilities
|
## Helpers and utilities
|
||||||
|
|
||||||
Our new test setup comes with a few handy utility functions that can be imported from [`util.py`](util.py).
|
Our new test setup comes with a few handy utility functions that can be imported from [`util.py`](util.py).
|
||||||
|
@ -186,7 +137,7 @@ You can construct a `Doc` with the following arguments:
|
||||||
| `pos` | List of POS tags as text values. |
|
| `pos` | List of POS tags as text values. |
|
||||||
| `tag` | List of tag names as text values. |
|
| `tag` | List of tag names as text values. |
|
||||||
| `dep` | List of dependencies as text values. |
|
| `dep` | List of dependencies as text values. |
|
||||||
| `ents` | List of entity tuples with `ent_id`, `label`, `start`, `end` (for example `('Stewart Lee', 'PERSON', 0, 2)`). The `label` will be looked up in `vocab.strings[label]`. |
|
| `ents` | List of entity tuples with `start`, `end`, `label` (for example `(0, 2, 'PERSON')`). The `label` will be looked up in `vocab.strings[label]`. |
|
||||||
|
|
||||||
Here's how to quickly get these values from within spaCy:
|
Here's how to quickly get these values from within spaCy:
|
||||||
|
|
||||||
|
@ -196,6 +147,7 @@ print([token.head.i-token.i for token in doc])
|
||||||
print([token.tag_ for token in doc])
|
print([token.tag_ for token in doc])
|
||||||
print([token.pos_ for token in doc])
|
print([token.pos_ for token in doc])
|
||||||
print([token.dep_ for token in doc])
|
print([token.dep_ for token in doc])
|
||||||
|
print([(ent.start, ent.end, ent.label_) for ent in doc.ents])
|
||||||
```
|
```
|
||||||
|
|
||||||
**Note:** There's currently no way of setting the serializer data for the parser without loading the models. If this is relevant to your test, constructing the `Doc` via `get_doc()` won't work.
|
**Note:** There's currently no way of setting the serializer data for the parser without loading the models. If this is relevant to your test, constructing the `Doc` via `get_doc()` won't work.
|
||||||
|
@ -204,7 +156,6 @@ print([token.dep_ for token in doc])
|
||||||
|
|
||||||
| Name | Description |
|
| Name | Description |
|
||||||
| --- | --- |
|
| --- | --- |
|
||||||
| `load_test_model` | Load a model if it's installed as a package, otherwise skip test. |
|
|
||||||
| `apply_transition_sequence(parser, doc, sequence)` | Perform a series of pre-specified transitions, to put the parser in a desired state. |
|
| `apply_transition_sequence(parser, doc, sequence)` | Perform a series of pre-specified transitions, to put the parser in a desired state. |
|
||||||
| `add_vecs_to_vocab(vocab, vectors)` | Add list of vector tuples (`[("text", [1, 2, 3])]`) to given vocab. All vectors need to have the same length. |
|
| `add_vecs_to_vocab(vocab, vectors)` | Add list of vector tuples (`[("text", [1, 2, 3])]`) to given vocab. All vectors need to have the same length. |
|
||||||
| `get_cosine(vec1, vec2)` | Get cosine for two given vectors. |
|
| `get_cosine(vec1, vec2)` | Get cosine for two given vectors. |
|
||||||
|
|
|
@ -1,229 +1,145 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from io import StringIO, BytesIO
|
|
||||||
from pathlib import Path
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from io import StringIO, BytesIO
|
||||||
from .util import load_test_model
|
from spacy.util import get_lang_class
|
||||||
from ..tokens import Doc
|
|
||||||
from ..strings import StringStore
|
|
||||||
from .. import util
|
|
||||||
|
|
||||||
|
|
||||||
# These languages are used for generic tokenizer tests – only add a language
|
def pytest_addoption(parser):
|
||||||
# here if it's using spaCy's tokenizer (not a different library)
|
parser.addoption("--slow", action="store_true", help="include slow tests")
|
||||||
# TODO: re-implement generic tokenizer tests
|
|
||||||
_languages = ['bn', 'da', 'de', 'el', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
|
||||||
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'ut', 'tt',
|
|
||||||
'xx']
|
|
||||||
|
|
||||||
_models = {'en': ['en_core_web_sm'],
|
|
||||||
'de': ['de_core_news_sm'],
|
|
||||||
'fr': ['fr_core_news_sm'],
|
|
||||||
'xx': ['xx_ent_web_sm'],
|
|
||||||
'en_core_web_md': ['en_core_web_md'],
|
|
||||||
'es_core_news_md': ['es_core_news_md']}
|
|
||||||
|
|
||||||
|
|
||||||
# only used for tests that require loading the models
|
def pytest_runtest_setup(item):
|
||||||
# in all other cases, use specific instances
|
for opt in ['slow']:
|
||||||
|
if opt in item.keywords and not item.config.getoption("--%s" % opt):
|
||||||
@pytest.fixture(params=_models['en'])
|
pytest.skip("need --%s option to run" % opt)
|
||||||
def EN(request):
|
|
||||||
return load_test_model(request.param)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(params=_models['de'])
|
@pytest.fixture(scope='module')
|
||||||
def DE(request):
|
|
||||||
return load_test_model(request.param)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(params=_models['fr'])
|
|
||||||
def FR(request):
|
|
||||||
return load_test_model(request.param)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture()
|
|
||||||
def RU(request):
|
|
||||||
pymorphy = pytest.importorskip('pymorphy2')
|
|
||||||
return util.get_lang_class('ru')()
|
|
||||||
|
|
||||||
@pytest.fixture()
|
|
||||||
def JA(request):
|
|
||||||
mecab = pytest.importorskip("MeCab")
|
|
||||||
return util.get_lang_class('ja')()
|
|
||||||
|
|
||||||
|
|
||||||
#@pytest.fixture(params=_languages)
|
|
||||||
#def tokenizer(request):
|
|
||||||
#lang = util.get_lang_class(request.param)
|
|
||||||
#return lang.Defaults.create_tokenizer()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def tokenizer():
|
def tokenizer():
|
||||||
return util.get_lang_class('xx').Defaults.create_tokenizer()
|
return get_lang_class('xx').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def en_tokenizer():
|
def en_tokenizer():
|
||||||
return util.get_lang_class('en').Defaults.create_tokenizer()
|
return get_lang_class('en').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def en_vocab():
|
def en_vocab():
|
||||||
return util.get_lang_class('en').Defaults.create_vocab()
|
return get_lang_class('en').Defaults.create_vocab()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def en_parser(en_vocab):
|
def en_parser(en_vocab):
|
||||||
nlp = util.get_lang_class('en')(en_vocab)
|
nlp = get_lang_class('en')(en_vocab)
|
||||||
return nlp.create_pipe('parser')
|
return nlp.create_pipe('parser')
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def es_tokenizer():
|
def es_tokenizer():
|
||||||
return util.get_lang_class('es').Defaults.create_tokenizer()
|
return get_lang_class('es').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def de_tokenizer():
|
def de_tokenizer():
|
||||||
return util.get_lang_class('de').Defaults.create_tokenizer()
|
return get_lang_class('de').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def fr_tokenizer():
|
def fr_tokenizer():
|
||||||
return util.get_lang_class('fr').Defaults.create_tokenizer()
|
return get_lang_class('fr').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def hu_tokenizer():
|
def hu_tokenizer():
|
||||||
return util.get_lang_class('hu').Defaults.create_tokenizer()
|
return get_lang_class('hu').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def fi_tokenizer():
|
def fi_tokenizer():
|
||||||
return util.get_lang_class('fi').Defaults.create_tokenizer()
|
return get_lang_class('fi').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def ro_tokenizer():
|
def ro_tokenizer():
|
||||||
return util.get_lang_class('ro').Defaults.create_tokenizer()
|
return get_lang_class('ro').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def id_tokenizer():
|
def id_tokenizer():
|
||||||
return util.get_lang_class('id').Defaults.create_tokenizer()
|
return get_lang_class('id').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def sv_tokenizer():
|
def sv_tokenizer():
|
||||||
return util.get_lang_class('sv').Defaults.create_tokenizer()
|
return get_lang_class('sv').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def bn_tokenizer():
|
def bn_tokenizer():
|
||||||
return util.get_lang_class('bn').Defaults.create_tokenizer()
|
return get_lang_class('bn').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def ga_tokenizer():
|
def ga_tokenizer():
|
||||||
return util.get_lang_class('ga').Defaults.create_tokenizer()
|
return get_lang_class('ga').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def he_tokenizer():
|
def he_tokenizer():
|
||||||
return util.get_lang_class('he').Defaults.create_tokenizer()
|
return get_lang_class('he').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture(scope='session')
|
||||||
def nb_tokenizer():
|
def nb_tokenizer():
|
||||||
return util.get_lang_class('nb').Defaults.create_tokenizer()
|
return get_lang_class('nb').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
|
@pytest.fixture(scope='session')
|
||||||
def da_tokenizer():
|
def da_tokenizer():
|
||||||
return util.get_lang_class('da').Defaults.create_tokenizer()
|
return get_lang_class('da').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
|
@pytest.fixture(scope='session')
|
||||||
def ja_tokenizer():
|
def ja_tokenizer():
|
||||||
mecab = pytest.importorskip("MeCab")
|
mecab = pytest.importorskip("MeCab")
|
||||||
return util.get_lang_class('ja').Defaults.create_tokenizer()
|
return get_lang_class('ja').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
|
@pytest.fixture(scope='session')
|
||||||
def th_tokenizer():
|
def th_tokenizer():
|
||||||
pythainlp = pytest.importorskip("pythainlp")
|
pythainlp = pytest.importorskip("pythainlp")
|
||||||
return util.get_lang_class('th').Defaults.create_tokenizer()
|
return get_lang_class('th').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
|
@pytest.fixture(scope='session')
|
||||||
def tr_tokenizer():
|
def tr_tokenizer():
|
||||||
return util.get_lang_class('tr').Defaults.create_tokenizer()
|
return get_lang_class('tr').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
|
@pytest.fixture(scope='session')
|
||||||
def tt_tokenizer():
|
def tt_tokenizer():
|
||||||
return util.get_lang_class('tt').Defaults.create_tokenizer()
|
return get_lang_class('tt').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
|
@pytest.fixture(scope='session')
|
||||||
def el_tokenizer():
|
def el_tokenizer():
|
||||||
return util.get_lang_class('el').Defaults.create_tokenizer()
|
return get_lang_class('el').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
|
@pytest.fixture(scope='session')
|
||||||
def ar_tokenizer():
|
def ar_tokenizer():
|
||||||
return util.get_lang_class('ar').Defaults.create_tokenizer()
|
return get_lang_class('ar').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
|
@pytest.fixture(scope='session')
|
||||||
def ur_tokenizer():
|
def ur_tokenizer():
|
||||||
return util.get_lang_class('ur').Defaults.create_tokenizer()
|
return get_lang_class('ur').Defaults.create_tokenizer()
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
|
@pytest.fixture(scope='session')
|
||||||
def ru_tokenizer():
|
def ru_tokenizer():
|
||||||
pymorphy = pytest.importorskip('pymorphy2')
|
pymorphy = pytest.importorskip('pymorphy2')
|
||||||
return util.get_lang_class('ru').Defaults.create_tokenizer()
|
return get_lang_class('ru').Defaults.create_tokenizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def stringstore():
|
|
||||||
return StringStore()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def en_entityrecognizer():
|
|
||||||
return util.get_lang_class('en').Defaults.create_entity()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def text_file():
|
|
||||||
return StringIO()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def text_file_b():
|
|
||||||
return BytesIO()
|
|
||||||
|
|
||||||
|
|
||||||
def pytest_addoption(parser):
|
|
||||||
parser.addoption("--models", action="store_true",
|
|
||||||
help="include tests that require full models")
|
|
||||||
parser.addoption("--vectors", action="store_true",
|
|
||||||
help="include word vectors tests")
|
|
||||||
parser.addoption("--slow", action="store_true",
|
|
||||||
help="include slow tests")
|
|
||||||
|
|
||||||
for lang in _languages + ['all']:
|
|
||||||
parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang)
|
|
||||||
for model in _models:
|
|
||||||
if model not in _languages:
|
|
||||||
parser.addoption("--%s" % model, action="store_true", help="Use %s model" % model)
|
|
||||||
|
|
||||||
|
|
||||||
def pytest_runtest_setup(item):
|
|
||||||
for opt in ['models', 'vectors', 'slow']:
|
|
||||||
if opt in item.keywords and not item.config.getoption("--%s" % opt):
|
|
||||||
pytest.skip("need --%s option to run" % opt)
|
|
||||||
|
|
||||||
# Check if test is marked with models and has arguments set, i.e. specific
|
|
||||||
# language. If so, skip test if flag not set.
|
|
||||||
if item.get_marker('models'):
|
|
||||||
for arg in item.get_marker('models').args:
|
|
||||||
if not item.config.getoption("--%s" % arg) and not item.config.getoption("--all"):
|
|
||||||
pytest.skip("need --%s or --all option to run" % arg)
|
|
||||||
|
|
|
@ -1,24 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
from ...pipeline import EntityRecognizer
|
|
||||||
from ..util import get_doc
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
|
||||||
def test_doc_add_entities_set_ents_iob(en_vocab):
|
|
||||||
text = ["This", "is", "a", "lion"]
|
|
||||||
doc = get_doc(en_vocab, text)
|
|
||||||
ner = EntityRecognizer(en_vocab)
|
|
||||||
ner.begin_training([])
|
|
||||||
ner(doc)
|
|
||||||
|
|
||||||
assert len(list(doc.ents)) == 0
|
|
||||||
assert [w.ent_iob_ for w in doc] == (['O'] * len(doc))
|
|
||||||
|
|
||||||
doc.ents = [(doc.vocab.strings['ANIMAL'], 3, 4)]
|
|
||||||
assert [w.ent_iob_ for w in doc] == ['', '', '', 'B']
|
|
||||||
|
|
||||||
doc.ents = [(doc.vocab.strings['WORD'], 0, 2)]
|
|
||||||
assert [w.ent_iob_ for w in doc] == ['B', 'I', '', '']
|
|
|
@ -1,10 +1,9 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...attrs import ORTH, SHAPE, POS, DEP
|
from spacy.attrs import ORTH, SHAPE, POS, DEP
|
||||||
from ..util import get_doc
|
|
||||||
|
|
||||||
import pytest
|
from ..util import get_doc
|
||||||
|
|
||||||
|
|
||||||
def test_doc_array_attr_of_token(en_tokenizer, en_vocab):
|
def test_doc_array_attr_of_token(en_tokenizer, en_vocab):
|
||||||
|
@ -41,7 +40,7 @@ def test_doc_array_tag(en_tokenizer):
|
||||||
text = "A nice sentence."
|
text = "A nice sentence."
|
||||||
pos = ['DET', 'ADJ', 'NOUN', 'PUNCT']
|
pos = ['DET', 'ADJ', 'NOUN', 'PUNCT']
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos)
|
||||||
assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos
|
assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos
|
||||||
feats_array = doc.to_array((ORTH, POS))
|
feats_array = doc.to_array((ORTH, POS))
|
||||||
assert feats_array[0][1] == doc[0].pos
|
assert feats_array[0][1] == doc[0].pos
|
||||||
|
@ -54,7 +53,7 @@ def test_doc_array_dep(en_tokenizer):
|
||||||
text = "A nice sentence."
|
text = "A nice sentence."
|
||||||
deps = ['det', 'amod', 'ROOT', 'punct']
|
deps = ['det', 'amod', 'ROOT', 'punct']
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
|
||||||
feats_array = doc.to_array((ORTH, DEP))
|
feats_array = doc.to_array((ORTH, DEP))
|
||||||
assert feats_array[0][1] == doc[0].dep
|
assert feats_array[0][1] == doc[0].dep
|
||||||
assert feats_array[1][1] == doc[1].dep
|
assert feats_array[1][1] == doc[1].dep
|
||||||
|
|
|
@ -1,10 +1,10 @@
|
||||||
'''Test Doc sets up tokens correctly.'''
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
import pytest
|
|
||||||
|
|
||||||
from ...vocab import Vocab
|
import pytest
|
||||||
from ...tokens.doc import Doc
|
from spacy.vocab import Vocab
|
||||||
from ...lemmatizer import Lemmatizer
|
from spacy.tokens import Doc
|
||||||
|
from spacy.lemmatizer import Lemmatizer
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
|
|
@ -1,18 +1,18 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ..util import get_doc
|
|
||||||
from ...tokens import Doc
|
|
||||||
from ...vocab import Vocab
|
|
||||||
from ...attrs import LEMMA
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
import numpy
|
import numpy
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
from spacy.attrs import LEMMA
|
||||||
|
|
||||||
|
from ..util import get_doc
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', [["one", "two", "three"]])
|
@pytest.mark.parametrize('text', [["one", "two", "three"]])
|
||||||
def test_doc_api_compare_by_string_position(en_vocab, text):
|
def test_doc_api_compare_by_string_position(en_vocab, text):
|
||||||
doc = get_doc(en_vocab, text)
|
doc = Doc(en_vocab, words=text)
|
||||||
# Get the tokens in this order, so their ID ordering doesn't match the idx
|
# Get the tokens in this order, so their ID ordering doesn't match the idx
|
||||||
token3 = doc[-1]
|
token3 = doc[-1]
|
||||||
token2 = doc[-2]
|
token2 = doc[-2]
|
||||||
|
@ -104,18 +104,18 @@ def test_doc_api_getitem(en_tokenizer):
|
||||||
" Give it back! He pleaded. "])
|
" Give it back! He pleaded. "])
|
||||||
def test_doc_api_serialize(en_tokenizer, text):
|
def test_doc_api_serialize(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
new_tokens = get_doc(tokens.vocab).from_bytes(tokens.to_bytes())
|
new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
|
||||||
assert tokens.text == new_tokens.text
|
assert tokens.text == new_tokens.text
|
||||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||||
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
||||||
|
|
||||||
new_tokens = get_doc(tokens.vocab).from_bytes(
|
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||||
tokens.to_bytes(tensor=False), tensor=False)
|
tokens.to_bytes(tensor=False), tensor=False)
|
||||||
assert tokens.text == new_tokens.text
|
assert tokens.text == new_tokens.text
|
||||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||||
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
||||||
|
|
||||||
new_tokens = get_doc(tokens.vocab).from_bytes(
|
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||||
tokens.to_bytes(sentiment=False), sentiment=False)
|
tokens.to_bytes(sentiment=False), sentiment=False)
|
||||||
assert tokens.text == new_tokens.text
|
assert tokens.text == new_tokens.text
|
||||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||||
|
@ -199,6 +199,20 @@ def test_doc_api_retokenizer_attrs(en_tokenizer):
|
||||||
assert doc[4].ent_type_ == 'ORG'
|
assert doc[4].ent_type_ == 'ORG'
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.xfail
|
||||||
|
def test_doc_api_retokenizer_lex_attrs(en_tokenizer):
|
||||||
|
"""Test that lexical attributes can be changed (see #2390)."""
|
||||||
|
doc = en_tokenizer("WKRO played beach boys songs")
|
||||||
|
assert not any(token.is_stop for token in doc)
|
||||||
|
with doc.retokenize() as retokenizer:
|
||||||
|
retokenizer.merge(doc[2:4], attrs={'LEMMA': 'boys', 'IS_STOP': True})
|
||||||
|
assert doc[2].text == 'beach boys'
|
||||||
|
assert doc[2].lemma_ == 'boys'
|
||||||
|
assert doc[2].is_stop
|
||||||
|
new_doc = Doc(doc.vocab, words=['beach boys'])
|
||||||
|
assert new_doc[0].is_stop
|
||||||
|
|
||||||
|
|
||||||
def test_doc_api_sents_empty_string(en_tokenizer):
|
def test_doc_api_sents_empty_string(en_tokenizer):
|
||||||
doc = en_tokenizer("")
|
doc = en_tokenizer("")
|
||||||
doc.is_parsed = True
|
doc.is_parsed = True
|
||||||
|
@ -215,7 +229,7 @@ def test_doc_api_runtime_error(en_tokenizer):
|
||||||
'ROOT', 'amod', 'dobj']
|
'ROOT', 'amod', 'dobj']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
|
||||||
|
|
||||||
nps = []
|
nps = []
|
||||||
for np in doc.noun_chunks:
|
for np in doc.noun_chunks:
|
||||||
|
@ -235,7 +249,7 @@ def test_doc_api_right_edge(en_tokenizer):
|
||||||
-2, -7, 1, -19, 1, -2, -3, 2, 1, -3, -26]
|
-2, -7, 1, -19, 1, -2, -3, 2, 1, -3, -26]
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
assert doc[6].text == 'for'
|
assert doc[6].text == 'for'
|
||||||
subtree = [w.text for w in doc[6].subtree]
|
subtree = [w.text for w in doc[6].subtree]
|
||||||
assert subtree == ['for', 'the', 'sake', 'of', 'such', 'as',
|
assert subtree == ['for', 'the', 'sake', 'of', 'such', 'as',
|
||||||
|
@ -264,7 +278,7 @@ def test_doc_api_similarity_match():
|
||||||
|
|
||||||
def test_lowest_common_ancestor(en_tokenizer):
|
def test_lowest_common_ancestor(en_tokenizer):
|
||||||
tokens = en_tokenizer('the lazy dog slept')
|
tokens = en_tokenizer('the lazy dog slept')
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0])
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0])
|
||||||
lca = doc.get_lca_matrix()
|
lca = doc.get_lca_matrix()
|
||||||
assert(lca[1, 1] == 1)
|
assert(lca[1, 1] == 1)
|
||||||
assert(lca[0, 1] == 2)
|
assert(lca[0, 1] == 2)
|
||||||
|
@ -277,7 +291,7 @@ def test_parse_tree(en_tokenizer):
|
||||||
heads = [1, 0, 1, -2, -3, -1, -5]
|
heads = [1, 0, 1, -2, -3, -1, -5]
|
||||||
tags = ['PRP', 'IN', 'NNP', 'NNP', 'IN', 'NNP', '.']
|
tags = ['PRP', 'IN', 'NNP', 'NNP', 'IN', 'NNP', '.']
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, tags=tags)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, tags=tags)
|
||||||
# full method parse_tree(text) is a trivial composition
|
# full method parse_tree(text) is a trivial composition
|
||||||
trees = doc.print_tree()
|
trees = doc.print_tree()
|
||||||
assert len(trees) > 0
|
assert len(trees) > 0
|
||||||
|
|
|
@ -1,12 +1,13 @@
|
||||||
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...language import Language
|
from spacy.language import Language
|
||||||
from ...compat import pickle, unicode_
|
from spacy.compat import pickle, unicode_
|
||||||
|
|
||||||
|
|
||||||
def test_pickle_single_doc():
|
def test_pickle_single_doc():
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
doc = nlp(u'pickle roundtrip')
|
doc = nlp('pickle roundtrip')
|
||||||
data = pickle.dumps(doc, 1)
|
data = pickle.dumps(doc, 1)
|
||||||
doc2 = pickle.loads(data)
|
doc2 = pickle.loads(data)
|
||||||
assert doc2.text == 'pickle roundtrip'
|
assert doc2.text == 'pickle roundtrip'
|
||||||
|
@ -16,7 +17,7 @@ def test_list_of_docs_pickles_efficiently():
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
for i in range(10000):
|
for i in range(10000):
|
||||||
_ = nlp.vocab[unicode_(i)]
|
_ = nlp.vocab[unicode_(i)]
|
||||||
one_pickled = pickle.dumps(nlp(u'0'), -1)
|
one_pickled = pickle.dumps(nlp('0'), -1)
|
||||||
docs = list(nlp.pipe(unicode_(i) for i in range(100)))
|
docs = list(nlp.pipe(unicode_(i) for i in range(100)))
|
||||||
many_pickled = pickle.dumps(docs, -1)
|
many_pickled = pickle.dumps(docs, -1)
|
||||||
assert len(many_pickled) < (len(one_pickled) * 2)
|
assert len(many_pickled) < (len(one_pickled) * 2)
|
||||||
|
@ -28,7 +29,7 @@ def test_list_of_docs_pickles_efficiently():
|
||||||
|
|
||||||
def test_user_data_from_disk():
|
def test_user_data_from_disk():
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
doc = nlp(u'Hello')
|
doc = nlp('Hello')
|
||||||
doc.user_data[(0, 1)] = False
|
doc.user_data[(0, 1)] = False
|
||||||
b = doc.to_bytes()
|
b = doc.to_bytes()
|
||||||
doc2 = doc.__class__(doc.vocab).from_bytes(b)
|
doc2 = doc.__class__(doc.vocab).from_bytes(b)
|
||||||
|
@ -36,7 +37,7 @@ def test_user_data_from_disk():
|
||||||
|
|
||||||
def test_user_data_unpickles():
|
def test_user_data_unpickles():
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
doc = nlp(u'Hello')
|
doc = nlp('Hello')
|
||||||
doc.user_data[(0, 1)] = False
|
doc.user_data[(0, 1)] = False
|
||||||
b = pickle.dumps(doc)
|
b = pickle.dumps(doc)
|
||||||
doc2 = pickle.loads(b)
|
doc2 = pickle.loads(b)
|
||||||
|
@ -47,7 +48,7 @@ def test_hooks_unpickle():
|
||||||
def inner_func(d1, d2):
|
def inner_func(d1, d2):
|
||||||
return 'hello!'
|
return 'hello!'
|
||||||
nlp = Language()
|
nlp = Language()
|
||||||
doc = nlp(u'Hello')
|
doc = nlp('Hello')
|
||||||
doc.user_hooks['similarity'] = inner_func
|
doc.user_hooks['similarity'] = inner_func
|
||||||
b = pickle.dumps(doc)
|
b = pickle.dumps(doc)
|
||||||
doc2 = pickle.loads(b)
|
doc2 = pickle.loads(b)
|
||||||
|
|
|
@ -1,12 +1,12 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ..util import get_doc
|
|
||||||
from ...attrs import ORTH, LENGTH
|
|
||||||
from ...tokens import Doc
|
|
||||||
from ...vocab import Vocab
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from spacy.attrs import ORTH, LENGTH
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
|
||||||
|
from ..util import get_doc
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
@ -16,16 +16,16 @@ def doc(en_tokenizer):
|
||||||
deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
|
deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
|
||||||
'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
|
'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
return get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def doc_not_parsed(en_tokenizer):
|
def doc_not_parsed(en_tokenizer):
|
||||||
text = "This is a sentence. This is another sentence. And a third."
|
text = "This is a sentence. This is another sentence. And a third."
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
d = get_doc(tokens.vocab, [t.text for t in tokens])
|
doc = Doc(tokens.vocab, words=[t.text for t in tokens])
|
||||||
d.is_parsed = False
|
doc.is_parsed = False
|
||||||
return d
|
return doc
|
||||||
|
|
||||||
|
|
||||||
def test_spans_sent_spans(doc):
|
def test_spans_sent_spans(doc):
|
||||||
|
@ -56,7 +56,7 @@ def test_spans_root2(en_tokenizer):
|
||||||
text = "through North and South Carolina"
|
text = "through North and South Carolina"
|
||||||
heads = [0, 3, -1, -2, -4]
|
heads = [0, 3, -1, -2, -4]
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
assert doc[-2:].root.text == 'Carolina'
|
assert doc[-2:].root.text == 'Carolina'
|
||||||
|
|
||||||
|
|
||||||
|
@ -76,7 +76,7 @@ def test_spans_span_sent(doc, doc_not_parsed):
|
||||||
def test_spans_lca_matrix(en_tokenizer):
|
def test_spans_lca_matrix(en_tokenizer):
|
||||||
"""Test span's lca matrix generation"""
|
"""Test span's lca matrix generation"""
|
||||||
tokens = en_tokenizer('the lazy dog slept')
|
tokens = en_tokenizer('the lazy dog slept')
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0])
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0])
|
||||||
lca = doc[:2].get_lca_matrix()
|
lca = doc[:2].get_lca_matrix()
|
||||||
assert(lca[0, 0] == 0)
|
assert(lca[0, 0] == 0)
|
||||||
assert(lca[0, 1] == -1)
|
assert(lca[0, 1] == -1)
|
||||||
|
@ -100,7 +100,7 @@ def test_spans_default_sentiment(en_tokenizer):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
tokens.vocab[tokens[0].text].sentiment = 3.0
|
tokens.vocab[tokens[0].text].sentiment = 3.0
|
||||||
tokens.vocab[tokens[2].text].sentiment = -2.0
|
tokens.vocab[tokens[2].text].sentiment = -2.0
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens])
|
doc = Doc(tokens.vocab, words=[t.text for t in tokens])
|
||||||
assert doc[:2].sentiment == 3.0 / 2
|
assert doc[:2].sentiment == 3.0 / 2
|
||||||
assert doc[-2:].sentiment == -2. / 2
|
assert doc[-2:].sentiment == -2. / 2
|
||||||
assert doc[:-1].sentiment == (3.+-2) / 3.
|
assert doc[:-1].sentiment == (3.+-2) / 3.
|
||||||
|
@ -112,7 +112,7 @@ def test_spans_override_sentiment(en_tokenizer):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
tokens.vocab[tokens[0].text].sentiment = 3.0
|
tokens.vocab[tokens[0].text].sentiment = 3.0
|
||||||
tokens.vocab[tokens[2].text].sentiment = -2.0
|
tokens.vocab[tokens[2].text].sentiment = -2.0
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens])
|
doc = Doc(tokens.vocab, words=[t.text for t in tokens])
|
||||||
doc.user_span_hooks['sentiment'] = lambda span: 10.0
|
doc.user_span_hooks['sentiment'] = lambda span: 10.0
|
||||||
assert doc[:2].sentiment == 10.0
|
assert doc[:2].sentiment == 10.0
|
||||||
assert doc[-2:].sentiment == 10.0
|
assert doc[-2:].sentiment == 10.0
|
||||||
|
@ -146,7 +146,7 @@ def test_span_to_array(doc):
|
||||||
assert arr[0, 1] == len(span[0])
|
assert arr[0, 1] == len(span[0])
|
||||||
|
|
||||||
|
|
||||||
#def test_span_as_doc(doc):
|
def test_span_as_doc(doc):
|
||||||
# span = doc[4:10]
|
span = doc[4:10]
|
||||||
# span_doc = span.as_doc()
|
span_doc = span.as_doc()
|
||||||
# assert span.text == span_doc.text.strip()
|
assert span.text == span_doc.text.strip()
|
||||||
|
|
|
@ -1,18 +1,17 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ..util import get_doc
|
from spacy.vocab import Vocab
|
||||||
from ...vocab import Vocab
|
from spacy.tokens import Doc
|
||||||
from ...tokens import Doc
|
|
||||||
|
|
||||||
import pytest
|
from ..util import get_doc
|
||||||
|
|
||||||
|
|
||||||
def test_spans_merge_tokens(en_tokenizer):
|
def test_spans_merge_tokens(en_tokenizer):
|
||||||
text = "Los Angeles start."
|
text = "Los Angeles start."
|
||||||
heads = [1, 1, 0, -1]
|
heads = [1, 1, 0, -1]
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
assert len(doc) == 4
|
assert len(doc) == 4
|
||||||
assert doc[0].head.text == 'Angeles'
|
assert doc[0].head.text == 'Angeles'
|
||||||
assert doc[1].head.text == 'start'
|
assert doc[1].head.text == 'start'
|
||||||
|
@ -21,7 +20,7 @@ def test_spans_merge_tokens(en_tokenizer):
|
||||||
assert doc[0].text == 'Los Angeles'
|
assert doc[0].text == 'Los Angeles'
|
||||||
assert doc[0].head.text == 'start'
|
assert doc[0].head.text == 'start'
|
||||||
|
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
assert len(doc) == 4
|
assert len(doc) == 4
|
||||||
assert doc[0].head.text == 'Angeles'
|
assert doc[0].head.text == 'Angeles'
|
||||||
assert doc[1].head.text == 'start'
|
assert doc[1].head.text == 'start'
|
||||||
|
@ -35,7 +34,7 @@ def test_spans_merge_heads(en_tokenizer):
|
||||||
text = "I found a pilates class near work."
|
text = "I found a pilates class near work."
|
||||||
heads = [1, 0, 2, 1, -3, -1, -1, -6]
|
heads = [1, 0, 2, 1, -3, -1, -1, -6]
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
|
|
||||||
assert len(doc) == 8
|
assert len(doc) == 8
|
||||||
doc.merge(doc[3].idx, doc[4].idx + len(doc[4]), tag=doc[4].tag_,
|
doc.merge(doc[3].idx, doc[4].idx + len(doc[4]), tag=doc[4].tag_,
|
||||||
|
@ -53,7 +52,7 @@ def test_span_np_merges(en_tokenizer):
|
||||||
text = "displaCy is a parse tool built with Javascript"
|
text = "displaCy is a parse tool built with Javascript"
|
||||||
heads = [1, 0, 2, 1, -3, -1, -1, -1]
|
heads = [1, 0, 2, 1, -3, -1, -1, -1]
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
|
|
||||||
assert doc[4].head.i == 1
|
assert doc[4].head.i == 1
|
||||||
doc.merge(doc[2].idx, doc[4].idx + len(doc[4]), tag='NP', lemma='tool',
|
doc.merge(doc[2].idx, doc[4].idx + len(doc[4]), tag='NP', lemma='tool',
|
||||||
|
@ -63,7 +62,7 @@ def test_span_np_merges(en_tokenizer):
|
||||||
text = "displaCy is a lightweight and modern dependency parse tree visualization tool built with CSS3 and JavaScript."
|
text = "displaCy is a lightweight and modern dependency parse tree visualization tool built with CSS3 and JavaScript."
|
||||||
heads = [1, 0, 8, 3, -1, -2, 4, 3, 1, 1, -9, -1, -1, -1, -1, -2, -15]
|
heads = [1, 0, 8, 3, -1, -2, 4, 3, 1, 1, -9, -1, -1, -1, -1, -2, -15]
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
|
|
||||||
ents = [(e[0].idx, e[-1].idx + len(e[-1]), e.label_, e.lemma_) for e in doc.ents]
|
ents = [(e[0].idx, e[-1].idx + len(e[-1]), e.label_, e.lemma_) for e in doc.ents]
|
||||||
for start, end, label, lemma in ents:
|
for start, end, label, lemma in ents:
|
||||||
|
@ -74,8 +73,7 @@ def test_span_np_merges(en_tokenizer):
|
||||||
text = "One test with entities like New York City so the ents list is not void"
|
text = "One test with entities like New York City so the ents list is not void"
|
||||||
heads = [1, 11, -1, -1, -1, 1, 1, -3, 4, 2, 1, 1, 0, -1, -2]
|
heads = [1, 11, -1, -1, -1, 1, 1, -3, 4, 2, 1, 1, 0, -1, -2]
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
|
|
||||||
for span in doc.ents:
|
for span in doc.ents:
|
||||||
merged = doc.merge()
|
merged = doc.merge()
|
||||||
assert merged != None, (span.start, span.end, span.label_, span.lemma_)
|
assert merged != None, (span.start, span.end, span.label_, span.lemma_)
|
||||||
|
@ -85,10 +83,9 @@ def test_spans_entity_merge(en_tokenizer):
|
||||||
text = "Stewart Lee is a stand up comedian who lives in England and loves Joe Pasquale.\n"
|
text = "Stewart Lee is a stand up comedian who lives in England and loves Joe Pasquale.\n"
|
||||||
heads = [1, 1, 0, 1, 2, -1, -4, 1, -2, -1, -1, -3, -10, 1, -2, -13, -1]
|
heads = [1, 1, 0, 1, 2, -1, -4, 1, -2, -1, -1, -3, -10, 1, -2, -13, -1]
|
||||||
tags = ['NNP', 'NNP', 'VBZ', 'DT', 'VB', 'RP', 'NN', 'WP', 'VBZ', 'IN', 'NNP', 'CC', 'VBZ', 'NNP', 'NNP', '.', 'SP']
|
tags = ['NNP', 'NNP', 'VBZ', 'DT', 'VB', 'RP', 'NN', 'WP', 'VBZ', 'IN', 'NNP', 'CC', 'VBZ', 'NNP', 'NNP', '.', 'SP']
|
||||||
ents = [('Stewart Lee', 'PERSON', 0, 2), ('England', 'GPE', 10, 11), ('Joe Pasquale', 'PERSON', 13, 15)]
|
ents = [(0, 2, 'PERSON'), (10, 11, 'GPE'), (13, 15, 'PERSON')]
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, tags=tags, ents=ents)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, tags=tags, ents=ents)
|
||||||
assert len(doc) == 17
|
assert len(doc) == 17
|
||||||
for ent in doc.ents:
|
for ent in doc.ents:
|
||||||
label, lemma, type_ = (ent.root.tag_, ent.root.lemma_, max(w.ent_type_ for w in ent))
|
label, lemma, type_ = (ent.root.tag_, ent.root.lemma_, max(w.ent_type_ for w in ent))
|
||||||
|
@ -120,7 +117,7 @@ def test_spans_sentence_update_after_merge(en_tokenizer):
|
||||||
'compound', 'dobj', 'punct']
|
'compound', 'dobj', 'punct']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||||
sent1, sent2 = list(doc.sents)
|
sent1, sent2 = list(doc.sents)
|
||||||
init_len = len(sent1)
|
init_len = len(sent1)
|
||||||
init_len2 = len(sent2)
|
init_len2 = len(sent2)
|
||||||
|
@ -138,7 +135,7 @@ def test_spans_subtree_size_check(en_tokenizer):
|
||||||
'dobj']
|
'dobj']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||||
sent1 = list(doc.sents)[0]
|
sent1 = list(doc.sents)[0]
|
||||||
init_len = len(list(sent1.root.subtree))
|
init_len = len(list(sent1.root.subtree))
|
||||||
doc[0:2].merge(label='none', lemma='none', ent_type='none')
|
doc[0:2].merge(label='none', lemma='none', ent_type='none')
|
||||||
|
|
|
@ -1,14 +1,24 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...attrs import IS_ALPHA, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_TITLE, IS_STOP
|
|
||||||
from ...symbols import NOUN, VERB
|
|
||||||
from ..util import get_doc
|
|
||||||
from ...vocab import Vocab
|
|
||||||
from ...tokens import Doc
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
import numpy
|
import numpy
|
||||||
|
from spacy.attrs import IS_ALPHA, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_TITLE, IS_STOP
|
||||||
|
from spacy.symbols import VERB
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
|
||||||
|
from ..util import get_doc
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def doc(en_tokenizer):
|
||||||
|
text = "This is a sentence. This is another sentence. And a third."
|
||||||
|
heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 0, 1, -2, -1]
|
||||||
|
deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
|
||||||
|
'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
|
||||||
|
tokens = en_tokenizer(text)
|
||||||
|
return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||||
|
|
||||||
|
|
||||||
def test_doc_token_api_strings(en_tokenizer):
|
def test_doc_token_api_strings(en_tokenizer):
|
||||||
|
@ -18,7 +28,7 @@ def test_doc_token_api_strings(en_tokenizer):
|
||||||
deps = ['ROOT', 'dobj', 'prt', 'punct', 'nsubj', 'ROOT', 'punct']
|
deps = ['ROOT', 'dobj', 'prt', 'punct', 'nsubj', 'ROOT', 'punct']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos, heads=heads, deps=deps)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps)
|
||||||
assert doc[0].orth_ == 'Give'
|
assert doc[0].orth_ == 'Give'
|
||||||
assert doc[0].text == 'Give'
|
assert doc[0].text == 'Give'
|
||||||
assert doc[0].text_with_ws == 'Give '
|
assert doc[0].text_with_ws == 'Give '
|
||||||
|
@ -57,18 +67,9 @@ def test_doc_token_api_str_builtin(en_tokenizer, text):
|
||||||
assert str(tokens[0]) == text.split(' ')[0]
|
assert str(tokens[0]) == text.split(' ')[0]
|
||||||
assert str(tokens[1]) == text.split(' ')[1]
|
assert str(tokens[1]) == text.split(' ')[1]
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def doc(en_tokenizer):
|
|
||||||
text = "This is a sentence. This is another sentence. And a third."
|
|
||||||
heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 0, 1, -2, -1]
|
|
||||||
deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
|
|
||||||
'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
|
|
||||||
tokens = en_tokenizer(text)
|
|
||||||
return get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
|
||||||
|
|
||||||
def test_doc_token_api_is_properties(en_vocab):
|
def test_doc_token_api_is_properties(en_vocab):
|
||||||
text = ["Hi", ",", "my", "email", "is", "test@me.com"]
|
doc = Doc(en_vocab, words=["Hi", ",", "my", "email", "is", "test@me.com"])
|
||||||
doc = get_doc(en_vocab, text)
|
|
||||||
assert doc[0].is_title
|
assert doc[0].is_title
|
||||||
assert doc[0].is_alpha
|
assert doc[0].is_alpha
|
||||||
assert not doc[0].is_digit
|
assert not doc[0].is_digit
|
||||||
|
@ -86,7 +87,6 @@ def test_doc_token_api_vectors():
|
||||||
vocab.set_vector('oranges', vector=numpy.asarray([0., 1.], dtype='f'))
|
vocab.set_vector('oranges', vector=numpy.asarray([0., 1.], dtype='f'))
|
||||||
doc = Doc(vocab, words=['apples', 'oranges', 'oov'])
|
doc = Doc(vocab, words=['apples', 'oranges', 'oov'])
|
||||||
assert doc.has_vector
|
assert doc.has_vector
|
||||||
|
|
||||||
assert doc[0].has_vector
|
assert doc[0].has_vector
|
||||||
assert doc[1].has_vector
|
assert doc[1].has_vector
|
||||||
assert not doc[2].has_vector
|
assert not doc[2].has_vector
|
||||||
|
@ -101,7 +101,7 @@ def test_doc_token_api_ancestors(en_tokenizer):
|
||||||
text = "Yesterday I saw a dog that barked loudly."
|
text = "Yesterday I saw a dog that barked loudly."
|
||||||
heads = [2, 1, 0, 1, -2, 1, -2, -1, -6]
|
heads = [2, 1, 0, 1, -2, 1, -2, -1, -6]
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
assert [t.text for t in doc[6].ancestors] == ["dog", "saw"]
|
assert [t.text for t in doc[6].ancestors] == ["dog", "saw"]
|
||||||
assert [t.text for t in doc[1].ancestors] == ["saw"]
|
assert [t.text for t in doc[1].ancestors] == ["saw"]
|
||||||
assert [t.text for t in doc[2].ancestors] == []
|
assert [t.text for t in doc[2].ancestors] == []
|
||||||
|
@ -115,7 +115,7 @@ def test_doc_token_api_head_setter(en_tokenizer):
|
||||||
text = "Yesterday I saw a dog that barked loudly."
|
text = "Yesterday I saw a dog that barked loudly."
|
||||||
heads = [2, 1, 0, 1, -2, 1, -2, -1, -6]
|
heads = [2, 1, 0, 1, -2, 1, -2, -1, -6]
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
|
|
||||||
assert doc[6].n_lefts == 1
|
assert doc[6].n_lefts == 1
|
||||||
assert doc[6].n_rights == 1
|
assert doc[6].n_rights == 1
|
||||||
|
@ -165,7 +165,7 @@ def test_doc_token_api_head_setter(en_tokenizer):
|
||||||
|
|
||||||
|
|
||||||
def test_is_sent_start(en_tokenizer):
|
def test_is_sent_start(en_tokenizer):
|
||||||
doc = en_tokenizer(u'This is a sentence. This is another.')
|
doc = en_tokenizer('This is a sentence. This is another.')
|
||||||
assert doc[5].is_sent_start is None
|
assert doc[5].is_sent_start is None
|
||||||
doc[5].is_sent_start = True
|
doc[5].is_sent_start = True
|
||||||
assert doc[5].is_sent_start is True
|
assert doc[5].is_sent_start is True
|
||||||
|
|
|
@ -3,10 +3,8 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
from mock import Mock
|
from mock import Mock
|
||||||
|
from spacy.tokens import Doc, Span, Token
|
||||||
from ..vocab import Vocab
|
from spacy.tokens.underscore import Underscore
|
||||||
from ..tokens import Doc, Span, Token
|
|
||||||
from ..tokens.underscore import Underscore
|
|
||||||
|
|
||||||
|
|
||||||
def test_create_doc_underscore():
|
def test_create_doc_underscore():
|
|
@ -4,15 +4,14 @@ from __future__ import unicode_literals
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text',
|
@pytest.mark.parametrize('text', ["ق.م", "إلخ", "ص.ب", "ت."])
|
||||||
["ق.م", "إلخ", "ص.ب", "ت."])
|
|
||||||
def test_ar_tokenizer_handles_abbr(ar_tokenizer, text):
|
def test_ar_tokenizer_handles_abbr(ar_tokenizer, text):
|
||||||
tokens = ar_tokenizer(text)
|
tokens = ar_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
|
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
|
||||||
text = u"تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م"
|
text = "تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م"
|
||||||
tokens = ar_tokenizer(text)
|
tokens = ar_tokenizer(text)
|
||||||
assert len(tokens) == 7
|
assert len(tokens) == 7
|
||||||
assert tokens[6].text == "ق.م"
|
assert tokens[6].text == "ق.م"
|
||||||
|
@ -20,7 +19,6 @@ def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
|
||||||
|
|
||||||
|
|
||||||
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
|
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
|
||||||
text = u"يبلغ طول مضيق طارق 14كم "
|
text = "يبلغ طول مضيق طارق 14كم "
|
||||||
tokens = ar_tokenizer(text)
|
tokens = ar_tokenizer(text)
|
||||||
print([(tokens[i].text, tokens[i].suffix_) for i in range(len(tokens))])
|
|
||||||
assert len(tokens) == 6
|
assert len(tokens) == 6
|
||||||
|
|
|
@ -2,7 +2,7 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_handles_long_text(ar_tokenizer):
|
def test_ar_tokenizer_handles_long_text(ar_tokenizer):
|
||||||
text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين.
|
text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين.
|
||||||
ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها،
|
ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها،
|
||||||
فتمكن من نيل شهادة في الفلسفة. ألف محفوظ على مدار حياته الكثير من الأعمال الأدبية، و في مقدمتها ثلاثيته الشهيرة.
|
فتمكن من نيل شهادة في الفلسفة. ألف محفوظ على مدار حياته الكثير من الأعمال الأدبية، و في مقدمتها ثلاثيته الشهيرة.
|
||||||
|
|
|
@ -3,38 +3,32 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
TESTCASES = []
|
|
||||||
|
|
||||||
PUNCTUATION_TESTS = [
|
TESTCASES = [
|
||||||
(u'আমি বাংলায় গান গাই!', [u'আমি', u'বাংলায়', u'গান', u'গাই', u'!']),
|
# punctuation tests
|
||||||
(u'আমি বাংলায় কথা কই।', [u'আমি', u'বাংলায়', u'কথা', u'কই', u'।']),
|
('আমি বাংলায় গান গাই!', ['আমি', 'বাংলায়', 'গান', 'গাই', '!']),
|
||||||
(u'বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', [u'বসুন্ধরা', u'জনসম্মুখে', u'দোষ', u'স্বীকার', u'করলো', u'না', u'?']),
|
('আমি বাংলায় কথা কই।', ['আমি', 'বাংলায়', 'কথা', 'কই', '।']),
|
||||||
(u'টাকা থাকলে কি না হয়!', [u'টাকা', u'থাকলে', u'কি', u'না', u'হয়', u'!']),
|
('বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', ['বসুন্ধরা', 'জনসম্মুখে', 'দোষ', 'স্বীকার', 'করলো', 'না', '?']),
|
||||||
|
('টাকা থাকলে কি না হয়!', ['টাকা', 'থাকলে', 'কি', 'না', 'হয়', '!']),
|
||||||
|
# abbreviations
|
||||||
|
('ডঃ খালেদ বললেন ঢাকায় ৩৫ ডিগ্রি সে.।', ['ডঃ', 'খালেদ', 'বললেন', 'ঢাকায়', '৩৫', 'ডিগ্রি', 'সে.', '।'])
|
||||||
]
|
]
|
||||||
|
|
||||||
ABBREVIATIONS = [
|
|
||||||
(u'ডঃ খালেদ বললেন ঢাকায় ৩৫ ডিগ্রি সে.।', [u'ডঃ', u'খালেদ', u'বললেন', u'ঢাকায়', u'৩৫', u'ডিগ্রি', u'সে.', u'।'])
|
|
||||||
]
|
|
||||||
|
|
||||||
TESTCASES.extend(PUNCTUATION_TESTS)
|
|
||||||
TESTCASES.extend(ABBREVIATIONS)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
||||||
def test_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens):
|
def test_bn_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens):
|
||||||
tokens = bn_tokenizer(text)
|
tokens = bn_tokenizer(text)
|
||||||
token_list = [token.text for token in tokens if not token.is_space]
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
assert expected_tokens == token_list
|
assert expected_tokens == token_list
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_handles_long_text(bn_tokenizer):
|
def test_bn_tokenizer_handles_long_text(bn_tokenizer):
|
||||||
text = u"""নর্থ সাউথ বিশ্ববিদ্যালয়ে সারাবছর কোন না কোন বিষয়ে গবেষণা চলতেই থাকে। \
|
text = """নর্থ সাউথ বিশ্ববিদ্যালয়ে সারাবছর কোন না কোন বিষয়ে গবেষণা চলতেই থাকে। \
|
||||||
অভিজ্ঞ ফ্যাকাল্টি মেম্বারগণ প্রায়ই শিক্ষার্থীদের নিয়ে বিভিন্ন গবেষণা প্রকল্পে কাজ করেন, \
|
অভিজ্ঞ ফ্যাকাল্টি মেম্বারগণ প্রায়ই শিক্ষার্থীদের নিয়ে বিভিন্ন গবেষণা প্রকল্পে কাজ করেন, \
|
||||||
যার মধ্যে রয়েছে রোবট থেকে মেশিন লার্নিং সিস্টেম ও আর্টিফিশিয়াল ইন্টেলিজেন্স। \
|
যার মধ্যে রয়েছে রোবট থেকে মেশিন লার্নিং সিস্টেম ও আর্টিফিশিয়াল ইন্টেলিজেন্স। \
|
||||||
এসকল প্রকল্পে কাজ করার মাধ্যমে সংশ্লিষ্ট ক্ষেত্রে যথেষ্ঠ পরিমাণ স্পেশালাইজড হওয়া সম্ভব। \
|
এসকল প্রকল্পে কাজ করার মাধ্যমে সংশ্লিষ্ট ক্ষেত্রে যথেষ্ঠ পরিমাণ স্পেশালাইজড হওয়া সম্ভব। \
|
||||||
আর গবেষণার কাজ তোমার ক্যারিয়ারকে ঠেলে নিয়ে যাবে অনেকখানি! \
|
আর গবেষণার কাজ তোমার ক্যারিয়ারকে ঠেলে নিয়ে যাবে অনেকখানি! \
|
||||||
কন্টেস্ট প্রোগ্রামার হও, গবেষক কিংবা ডেভেলপার - নর্থ সাউথ ইউনিভার্সিটিতে তোমার প্রতিভা বিকাশের সুযোগ রয়েছেই। \
|
কন্টেস্ট প্রোগ্রামার হও, গবেষক কিংবা ডেভেলপার - নর্থ সাউথ ইউনিভার্সিটিতে তোমার প্রতিভা বিকাশের সুযোগ রয়েছেই। \
|
||||||
নর্থ সাউথের অসাধারণ কমিউনিটিতে তোমাকে সাদর আমন্ত্রণ।"""
|
নর্থ সাউথের অসাধারণ কমিউনিটিতে তোমাকে সাদর আমন্ত্রণ।"""
|
||||||
|
|
||||||
tokens = bn_tokenizer(text)
|
tokens = bn_tokenizer(text)
|
||||||
assert len(tokens) == 84
|
assert len(tokens) == 84
|
||||||
|
|
|
@ -3,28 +3,32 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
@pytest.mark.parametrize('text',
|
|
||||||
["ca.", "m.a.o.", "Jan.", "Dec.", "kr.", "jf."])
|
@pytest.mark.parametrize('text', ["ca.", "m.a.o.", "Jan.", "Dec.", "kr.", "jf."])
|
||||||
def test_da_tokenizer_handles_abbr(da_tokenizer, text):
|
def test_da_tokenizer_handles_abbr(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["Jul.", "jul.", "Tor.", "Tors."])
|
@pytest.mark.parametrize('text', ["Jul.", "jul.", "Tor.", "Tors."])
|
||||||
def test_da_tokenizer_handles_ambiguous_abbr(da_tokenizer, text):
|
def test_da_tokenizer_handles_ambiguous_abbr(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 2
|
assert len(tokens) == 2
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["1.", "10.", "31."])
|
@pytest.mark.parametrize('text', ["1.", "10.", "31."])
|
||||||
def test_da_tokenizer_handles_dates(da_tokenizer, text):
|
def test_da_tokenizer_handles_dates(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
def test_da_tokenizer_handles_exc_in_text(da_tokenizer):
|
def test_da_tokenizer_handles_exc_in_text(da_tokenizer):
|
||||||
text = "Det er bl.a. ikke meningen"
|
text = "Det er bl.a. ikke meningen"
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 5
|
assert len(tokens) == 5
|
||||||
assert tokens[2].text == "bl.a."
|
assert tokens[2].text == "bl.a."
|
||||||
|
|
||||||
|
|
||||||
def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
|
def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
|
||||||
text = "Her er noget du kan kigge i."
|
text = "Her er noget du kan kigge i."
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
|
@ -32,8 +36,9 @@ def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
|
||||||
assert tokens[6].text == "i"
|
assert tokens[6].text == "i"
|
||||||
assert tokens[7].text == "."
|
assert tokens[7].text == "."
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,norm',
|
|
||||||
[("akvarium", "akvarie"), ("bedstemoder", "bedstemor")])
|
@pytest.mark.parametrize('text,norm', [
|
||||||
|
("akvarium", "akvarie"), ("bedstemoder", "bedstemor")])
|
||||||
def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
|
def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert tokens[0].norm_ == norm
|
assert tokens[0].norm_ == norm
|
||||||
|
|
|
@ -4,10 +4,11 @@ from __future__ import unicode_literals
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('string,lemma', [('affaldsgruppernes', 'affaldsgruppe'),
|
@pytest.mark.parametrize('string,lemma', [
|
||||||
('detailhandelsstrukturernes', 'detailhandelsstruktur'),
|
('affaldsgruppernes', 'affaldsgruppe'),
|
||||||
('kolesterols', 'kolesterol'),
|
('detailhandelsstrukturernes', 'detailhandelsstruktur'),
|
||||||
('åsyns', 'åsyn')])
|
('kolesterols', 'kolesterol'),
|
||||||
def test_lemmatizer_lookup_assigns(da_tokenizer, string, lemma):
|
('åsyns', 'åsyn')])
|
||||||
|
def test_da_lemmatizer_lookup_assigns(da_tokenizer, string, lemma):
|
||||||
tokens = da_tokenizer(string)
|
tokens = da_tokenizer(string)
|
||||||
assert tokens[0].lemma_ == lemma
|
assert tokens[0].lemma_ == lemma
|
||||||
|
|
|
@ -1,24 +1,23 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(under)"])
|
@pytest.mark.parametrize('text', ["(under)"])
|
||||||
def test_tokenizer_splits_no_special(da_tokenizer, text):
|
def test_da_tokenizer_splits_no_special(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["ta'r", "Søren's", "Lars'"])
|
@pytest.mark.parametrize('text', ["ta'r", "Søren's", "Lars'"])
|
||||||
def test_tokenizer_handles_no_punct(da_tokenizer, text):
|
def test_da_tokenizer_handles_no_punct(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(ta'r"])
|
@pytest.mark.parametrize('text', ["(ta'r"])
|
||||||
def test_tokenizer_splits_prefix_punct(da_tokenizer, text):
|
def test_da_tokenizer_splits_prefix_punct(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 2
|
assert len(tokens) == 2
|
||||||
assert tokens[0].text == "("
|
assert tokens[0].text == "("
|
||||||
|
@ -26,22 +25,23 @@ def test_tokenizer_splits_prefix_punct(da_tokenizer, text):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["ta'r)"])
|
@pytest.mark.parametrize('text', ["ta'r)"])
|
||||||
def test_tokenizer_splits_suffix_punct(da_tokenizer, text):
|
def test_da_tokenizer_splits_suffix_punct(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 2
|
assert len(tokens) == 2
|
||||||
assert tokens[0].text == "ta'r"
|
assert tokens[0].text == "ta'r"
|
||||||
assert tokens[1].text == ")"
|
assert tokens[1].text == ")"
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected', [("(ta'r)", ["(", "ta'r", ")"]), ("'ta'r'", ["'", "ta'r", "'"])])
|
@pytest.mark.parametrize('text,expected', [
|
||||||
def test_tokenizer_splits_even_wrap(da_tokenizer, text, expected):
|
("(ta'r)", ["(", "ta'r", ")"]), ("'ta'r'", ["'", "ta'r", "'"])])
|
||||||
|
def test_da_tokenizer_splits_even_wrap(da_tokenizer, text, expected):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == len(expected)
|
assert len(tokens) == len(expected)
|
||||||
assert [t.text for t in tokens] == expected
|
assert [t.text for t in tokens] == expected
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(ta'r?)"])
|
@pytest.mark.parametrize('text', ["(ta'r?)"])
|
||||||
def test_tokenizer_splits_uneven_wrap(da_tokenizer, text):
|
def test_da_tokenizer_splits_uneven_wrap(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 4
|
assert len(tokens) == 4
|
||||||
assert tokens[0].text == "("
|
assert tokens[0].text == "("
|
||||||
|
@ -50,15 +50,16 @@ def test_tokenizer_splits_uneven_wrap(da_tokenizer, text):
|
||||||
assert tokens[3].text == ")"
|
assert tokens[3].text == ")"
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected', [("f.eks.", ["f.eks."]), ("fe.", ["fe", "."]), ("(f.eks.", ["(", "f.eks."])])
|
@pytest.mark.parametrize('text,expected', [
|
||||||
def test_tokenizer_splits_prefix_interact(da_tokenizer, text, expected):
|
("f.eks.", ["f.eks."]), ("fe.", ["fe", "."]), ("(f.eks.", ["(", "f.eks."])])
|
||||||
|
def test_da_tokenizer_splits_prefix_interact(da_tokenizer, text, expected):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == len(expected)
|
assert len(tokens) == len(expected)
|
||||||
assert [t.text for t in tokens] == expected
|
assert [t.text for t in tokens] == expected
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["f.eks.)"])
|
@pytest.mark.parametrize('text', ["f.eks.)"])
|
||||||
def test_tokenizer_splits_suffix_interact(da_tokenizer, text):
|
def test_da_tokenizer_splits_suffix_interact(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 2
|
assert len(tokens) == 2
|
||||||
assert tokens[0].text == "f.eks."
|
assert tokens[0].text == "f.eks."
|
||||||
|
@ -66,7 +67,7 @@ def test_tokenizer_splits_suffix_interact(da_tokenizer, text):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(f.eks.)"])
|
@pytest.mark.parametrize('text', ["(f.eks.)"])
|
||||||
def test_tokenizer_splits_even_wrap_interact(da_tokenizer, text):
|
def test_da_tokenizer_splits_even_wrap_interact(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
assert tokens[0].text == "("
|
assert tokens[0].text == "("
|
||||||
|
@ -75,7 +76,7 @@ def test_tokenizer_splits_even_wrap_interact(da_tokenizer, text):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(f.eks.?)"])
|
@pytest.mark.parametrize('text', ["(f.eks.?)"])
|
||||||
def test_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text):
|
def test_da_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 4
|
assert len(tokens) == 4
|
||||||
assert tokens[0].text == "("
|
assert tokens[0].text == "("
|
||||||
|
@ -85,19 +86,19 @@ def test_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["0,1-13,5", "0,0-0,1", "103,27-300", "1/2-3/4"])
|
@pytest.mark.parametrize('text', ["0,1-13,5", "0,0-0,1", "103,27-300", "1/2-3/4"])
|
||||||
def test_tokenizer_handles_numeric_range(da_tokenizer, text):
|
def test_da_tokenizer_handles_numeric_range(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["sort.Gul", "Hej.Verden"])
|
@pytest.mark.parametrize('text', ["sort.Gul", "Hej.Verden"])
|
||||||
def test_tokenizer_splits_period_infix(da_tokenizer, text):
|
def test_da_tokenizer_splits_period_infix(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["Hej,Verden", "en,to"])
|
@pytest.mark.parametrize('text', ["Hej,Verden", "en,to"])
|
||||||
def test_tokenizer_splits_comma_infix(da_tokenizer, text):
|
def test_da_tokenizer_splits_comma_infix(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
assert tokens[0].text == text.split(",")[0]
|
assert tokens[0].text == text.split(",")[0]
|
||||||
|
@ -106,18 +107,18 @@ def test_tokenizer_splits_comma_infix(da_tokenizer, text):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["sort...Gul", "sort...gul"])
|
@pytest.mark.parametrize('text', ["sort...Gul", "sort...gul"])
|
||||||
def test_tokenizer_splits_ellipsis_infix(da_tokenizer, text):
|
def test_da_tokenizer_splits_ellipsis_infix(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ['gå-på-mod', '4-hjulstræk', '100-Pfennig-frimærke', 'TV-2-spots', 'trofæ-vaeggen'])
|
@pytest.mark.parametrize('text', ['gå-på-mod', '4-hjulstræk', '100-Pfennig-frimærke', 'TV-2-spots', 'trofæ-vaeggen'])
|
||||||
def test_tokenizer_keeps_hyphens(da_tokenizer, text):
|
def test_da_tokenizer_keeps_hyphens(da_tokenizer, text):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_splits_double_hyphen_infix(da_tokenizer):
|
def test_da_tokenizer_splits_double_hyphen_infix(da_tokenizer):
|
||||||
tokens = da_tokenizer("Mange regler--eksempelvis bindestregs-reglerne--er komplicerede.")
|
tokens = da_tokenizer("Mange regler--eksempelvis bindestregs-reglerne--er komplicerede.")
|
||||||
assert len(tokens) == 9
|
assert len(tokens) == 9
|
||||||
assert tokens[0].text == "Mange"
|
assert tokens[0].text == "Mange"
|
||||||
|
@ -130,7 +131,7 @@ def test_tokenizer_splits_double_hyphen_infix(da_tokenizer):
|
||||||
assert tokens[7].text == "komplicerede"
|
assert tokens[7].text == "komplicerede"
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_handles_posessives_and_contractions(da_tokenizer):
|
def test_da_tokenizer_handles_posessives_and_contractions(da_tokenizer):
|
||||||
tokens = da_tokenizer("'DBA's, Lars' og Liz' bil sku' sgu' ik' ha' en bule, det ka' han ik' li' mere', sagde hun.")
|
tokens = da_tokenizer("'DBA's, Lars' og Liz' bil sku' sgu' ik' ha' en bule, det ka' han ik' li' mere', sagde hun.")
|
||||||
assert len(tokens) == 25
|
assert len(tokens) == 25
|
||||||
assert tokens[0].text == "'"
|
assert tokens[0].text == "'"
|
||||||
|
|
|
@ -1,10 +1,9 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
"""Test that longer and mixed texts are tokenized correctly."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from spacy.lang.da.lex_attrs import like_num
|
||||||
|
|
||||||
|
|
||||||
def test_da_tokenizer_handles_long_text(da_tokenizer):
|
def test_da_tokenizer_handles_long_text(da_tokenizer):
|
||||||
text = """Der var så dejligt ude på landet. Det var sommer, kornet stod gult, havren grøn,
|
text = """Der var så dejligt ude på landet. Det var sommer, kornet stod gult, havren grøn,
|
||||||
|
@ -15,6 +14,7 @@ Rundt om ager og eng var der store skove, og midt i skovene dybe søer; jo, der
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 84
|
assert len(tokens) == 84
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,match', [
|
@pytest.mark.parametrize('text,match', [
|
||||||
('10', True), ('1', True), ('10.000', True), ('10.00', True),
|
('10', True), ('1', True), ('10.000', True), ('10.00', True),
|
||||||
('999,0', True), ('en', True), ('treoghalvfemsindstyvende', True), ('hundrede', True),
|
('999,0', True), ('en', True), ('treoghalvfemsindstyvende', True), ('hundrede', True),
|
||||||
|
@ -22,6 +22,10 @@ Rundt om ager og eng var der store skove, og midt i skovene dybe søer; jo, der
|
||||||
def test_lex_attrs_like_number(da_tokenizer, text, match):
|
def test_lex_attrs_like_number(da_tokenizer, text, match):
|
||||||
tokens = da_tokenizer(text)
|
tokens = da_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
print(tokens[0])
|
|
||||||
assert tokens[0].like_num == match
|
assert tokens[0].like_num == match
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('word', ['elleve', 'første'])
|
||||||
|
def test_da_lex_attrs_capitals(word):
|
||||||
|
assert like_num(word)
|
||||||
|
assert like_num(word.upper())
|
||||||
|
|
|
@ -1,7 +1,4 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
"""Test that tokenizer exceptions and emoticons are handles correctly."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
|
@ -4,12 +4,13 @@ from __future__ import unicode_literals
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('string,lemma', [('Abgehängten', 'Abgehängte'),
|
@pytest.mark.parametrize('string,lemma', [
|
||||||
('engagierte', 'engagieren'),
|
('Abgehängten', 'Abgehängte'),
|
||||||
('schließt', 'schließen'),
|
('engagierte', 'engagieren'),
|
||||||
('vorgebenden', 'vorgebend'),
|
('schließt', 'schließen'),
|
||||||
('die', 'der'),
|
('vorgebenden', 'vorgebend'),
|
||||||
('Die', 'der')])
|
('die', 'der'),
|
||||||
def test_lemmatizer_lookup_assigns(de_tokenizer, string, lemma):
|
('Die', 'der')])
|
||||||
|
def test_de_lemmatizer_lookup_assigns(de_tokenizer, string, lemma):
|
||||||
tokens = de_tokenizer(string)
|
tokens = de_tokenizer(string)
|
||||||
assert tokens[0].lemma_ == lemma
|
assert tokens[0].lemma_ == lemma
|
||||||
|
|
|
@ -1,77 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import numpy
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def example(DE):
|
|
||||||
"""
|
|
||||||
This is to make sure the model works as expected. The tests make sure that
|
|
||||||
values are properly set. Tests are not meant to evaluate the content of the
|
|
||||||
output, only make sure the output is formally okay.
|
|
||||||
"""
|
|
||||||
assert DE.entity != None
|
|
||||||
return DE('An der großen Straße stand eine merkwürdige Gestalt und führte Selbstgespräche.')
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('de')
|
|
||||||
def test_de_models_tokenization(example):
|
|
||||||
# tokenization should split the document into tokens
|
|
||||||
assert len(example) > 1
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
|
||||||
@pytest.mark.models('de')
|
|
||||||
def test_de_models_tagging(example):
|
|
||||||
# if tagging was done properly, pos tags shouldn't be empty
|
|
||||||
assert example.is_tagged
|
|
||||||
assert all(t.pos != 0 for t in example)
|
|
||||||
assert all(t.tag != 0 for t in example)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('de')
|
|
||||||
def test_de_models_parsing(example):
|
|
||||||
# if parsing was done properly
|
|
||||||
# - dependency labels shouldn't be empty
|
|
||||||
# - the head of some tokens should not be root
|
|
||||||
assert example.is_parsed
|
|
||||||
assert all(t.dep != 0 for t in example)
|
|
||||||
assert any(t.dep != i for i,t in enumerate(example))
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('de')
|
|
||||||
def test_de_models_ner(example):
|
|
||||||
# if ner was done properly, ent_iob shouldn't be empty
|
|
||||||
assert all([t.ent_iob != 0 for t in example])
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('de')
|
|
||||||
def test_de_models_vectors(example):
|
|
||||||
# if vectors are available, they should differ on different words
|
|
||||||
# this isn't a perfect test since this could in principle fail
|
|
||||||
# in a sane model as well,
|
|
||||||
# but that's very unlikely and a good indicator if something is wrong
|
|
||||||
vector0 = example[0].vector
|
|
||||||
vector1 = example[1].vector
|
|
||||||
vector2 = example[2].vector
|
|
||||||
assert not numpy.array_equal(vector0,vector1)
|
|
||||||
assert not numpy.array_equal(vector0,vector2)
|
|
||||||
assert not numpy.array_equal(vector1,vector2)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
|
||||||
@pytest.mark.models('de')
|
|
||||||
def test_de_models_probs(example):
|
|
||||||
# if frequencies/probabilities are okay, they should differ for
|
|
||||||
# different words
|
|
||||||
# this isn't a perfect test since this could in principle fail
|
|
||||||
# in a sane model as well,
|
|
||||||
# but that's very unlikely and a good indicator if something is wrong
|
|
||||||
prob0 = example[0].prob
|
|
||||||
prob1 = example[1].prob
|
|
||||||
prob2 = example[2].prob
|
|
||||||
assert not prob0 == prob1
|
|
||||||
assert not prob0 == prob2
|
|
||||||
assert not prob1 == prob2
|
|
|
@ -3,17 +3,14 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...util import get_doc
|
from ...util import get_doc
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
|
||||||
def test_de_parser_noun_chunks_standard_de(de_tokenizer):
|
def test_de_parser_noun_chunks_standard_de(de_tokenizer):
|
||||||
text = "Eine Tasse steht auf dem Tisch."
|
text = "Eine Tasse steht auf dem Tisch."
|
||||||
heads = [1, 1, 0, -1, 1, -2, -4]
|
heads = [1, 1, 0, -1, 1, -2, -4]
|
||||||
tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', '$.']
|
tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', '$.']
|
||||||
deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'punct']
|
deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'punct']
|
||||||
|
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||||
chunks = list(doc.noun_chunks)
|
chunks = list(doc.noun_chunks)
|
||||||
assert len(chunks) == 2
|
assert len(chunks) == 2
|
||||||
assert chunks[0].text_with_ws == "Eine Tasse "
|
assert chunks[0].text_with_ws == "Eine Tasse "
|
||||||
|
@ -25,9 +22,8 @@ def test_de_extended_chunk(de_tokenizer):
|
||||||
heads = [1, 1, 0, -1, 1, -2, -1, -5, -6]
|
heads = [1, 1, 0, -1, 1, -2, -1, -5, -6]
|
||||||
tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', 'NN', 'NN', '$.']
|
tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', 'NN', 'NN', '$.']
|
||||||
deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'nk', 'oa', 'punct']
|
deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'nk', 'oa', 'punct']
|
||||||
|
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||||
chunks = list(doc.noun_chunks)
|
chunks = list(doc.noun_chunks)
|
||||||
assert len(chunks) == 3
|
assert len(chunks) == 3
|
||||||
assert chunks[0].text_with_ws == "Die Sängerin "
|
assert chunks[0].text_with_ws == "Die Sängerin "
|
||||||
|
|
|
@ -1,86 +1,83 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(unter)"])
|
@pytest.mark.parametrize('text', ["(unter)"])
|
||||||
def test_tokenizer_splits_no_special(de_tokenizer, text):
|
def test_de_tokenizer_splits_no_special(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["unter'm"])
|
@pytest.mark.parametrize('text', ["unter'm"])
|
||||||
def test_tokenizer_splits_no_punct(de_tokenizer, text):
|
def test_de_tokenizer_splits_no_punct(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 2
|
assert len(tokens) == 2
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(unter'm"])
|
@pytest.mark.parametrize('text', ["(unter'm"])
|
||||||
def test_tokenizer_splits_prefix_punct(de_tokenizer, text):
|
def test_de_tokenizer_splits_prefix_punct(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["unter'm)"])
|
@pytest.mark.parametrize('text', ["unter'm)"])
|
||||||
def test_tokenizer_splits_suffix_punct(de_tokenizer, text):
|
def test_de_tokenizer_splits_suffix_punct(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(unter'm)"])
|
@pytest.mark.parametrize('text', ["(unter'm)"])
|
||||||
def test_tokenizer_splits_even_wrap(de_tokenizer, text):
|
def test_de_tokenizer_splits_even_wrap(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 4
|
assert len(tokens) == 4
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(unter'm?)"])
|
@pytest.mark.parametrize('text', ["(unter'm?)"])
|
||||||
def test_tokenizer_splits_uneven_wrap(de_tokenizer, text):
|
def test_de_tokenizer_splits_uneven_wrap(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 5
|
assert len(tokens) == 5
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,length', [("z.B.", 1), ("zb.", 2), ("(z.B.", 2)])
|
@pytest.mark.parametrize('text,length', [("z.B.", 1), ("zb.", 2), ("(z.B.", 2)])
|
||||||
def test_tokenizer_splits_prefix_interact(de_tokenizer, text, length):
|
def test_de_tokenizer_splits_prefix_interact(de_tokenizer, text, length):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == length
|
assert len(tokens) == length
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["z.B.)"])
|
@pytest.mark.parametrize('text', ["z.B.)"])
|
||||||
def test_tokenizer_splits_suffix_interact(de_tokenizer, text):
|
def test_de_tokenizer_splits_suffix_interact(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 2
|
assert len(tokens) == 2
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(z.B.)"])
|
@pytest.mark.parametrize('text', ["(z.B.)"])
|
||||||
def test_tokenizer_splits_even_wrap_interact(de_tokenizer, text):
|
def test_de_tokenizer_splits_even_wrap_interact(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(z.B.?)"])
|
@pytest.mark.parametrize('text', ["(z.B.?)"])
|
||||||
def test_tokenizer_splits_uneven_wrap_interact(de_tokenizer, text):
|
def test_de_tokenizer_splits_uneven_wrap_interact(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 4
|
assert len(tokens) == 4
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
|
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
|
||||||
def test_tokenizer_splits_numeric_range(de_tokenizer, text):
|
def test_de_tokenizer_splits_numeric_range(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["blau.Rot", "Hallo.Welt"])
|
@pytest.mark.parametrize('text', ["blau.Rot", "Hallo.Welt"])
|
||||||
def test_tokenizer_splits_period_infix(de_tokenizer, text):
|
def test_de_tokenizer_splits_period_infix(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["Hallo,Welt", "eins,zwei"])
|
@pytest.mark.parametrize('text', ["Hallo,Welt", "eins,zwei"])
|
||||||
def test_tokenizer_splits_comma_infix(de_tokenizer, text):
|
def test_de_tokenizer_splits_comma_infix(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
assert tokens[0].text == text.split(",")[0]
|
assert tokens[0].text == text.split(",")[0]
|
||||||
|
@ -89,18 +86,18 @@ def test_tokenizer_splits_comma_infix(de_tokenizer, text):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["blau...Rot", "blau...rot"])
|
@pytest.mark.parametrize('text', ["blau...Rot", "blau...rot"])
|
||||||
def test_tokenizer_splits_ellipsis_infix(de_tokenizer, text):
|
def test_de_tokenizer_splits_ellipsis_infix(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ['Islam-Konferenz', 'Ost-West-Konflikt'])
|
@pytest.mark.parametrize('text', ['Islam-Konferenz', 'Ost-West-Konflikt'])
|
||||||
def test_tokenizer_keeps_hyphens(de_tokenizer, text):
|
def test_de_tokenizer_keeps_hyphens(de_tokenizer, text):
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_splits_double_hyphen_infix(de_tokenizer):
|
def test_de_tokenizer_splits_double_hyphen_infix(de_tokenizer):
|
||||||
tokens = de_tokenizer("Viele Regeln--wie die Bindestrich-Regeln--sind kompliziert.")
|
tokens = de_tokenizer("Viele Regeln--wie die Bindestrich-Regeln--sind kompliziert.")
|
||||||
assert len(tokens) == 10
|
assert len(tokens) == 10
|
||||||
assert tokens[0].text == "Viele"
|
assert tokens[0].text == "Viele"
|
||||||
|
|
|
@ -1,13 +1,10 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
"""Test that longer and mixed texts are tokenized correctly."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_handles_long_text(de_tokenizer):
|
def test_de_tokenizer_handles_long_text(de_tokenizer):
|
||||||
text = """Die Verwandlung
|
text = """Die Verwandlung
|
||||||
|
|
||||||
Als Gregor Samsa eines Morgens aus unruhigen Träumen erwachte, fand er sich in
|
Als Gregor Samsa eines Morgens aus unruhigen Träumen erwachte, fand er sich in
|
||||||
|
@ -29,17 +26,15 @@ Umfang kläglich dünnen Beine flimmerten ihm hilflos vor den Augen.
|
||||||
"Donaudampfschifffahrtsgesellschaftskapitänsanwärterposten",
|
"Donaudampfschifffahrtsgesellschaftskapitänsanwärterposten",
|
||||||
"Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz",
|
"Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz",
|
||||||
"Kraftfahrzeug-Haftpflichtversicherung",
|
"Kraftfahrzeug-Haftpflichtversicherung",
|
||||||
"Vakuum-Mittelfrequenz-Induktionsofen"
|
"Vakuum-Mittelfrequenz-Induktionsofen"])
|
||||||
])
|
def test_de_tokenizer_handles_long_words(de_tokenizer, text):
|
||||||
def test_tokenizer_handles_long_words(de_tokenizer, text):
|
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,length', [
|
@pytest.mark.parametrize('text,length', [
|
||||||
("»Was ist mit mir geschehen?«, dachte er.", 12),
|
("»Was ist mit mir geschehen?«, dachte er.", 12),
|
||||||
("“Dies frühzeitige Aufstehen”, dachte er, “macht einen ganz blödsinnig. ", 15)
|
("“Dies frühzeitige Aufstehen”, dachte er, “macht einen ganz blödsinnig. ", 15)])
|
||||||
])
|
def test_de_tokenizer_handles_examples(de_tokenizer, text, length):
|
||||||
def test_tokenizer_handles_examples(de_tokenizer, text, length):
|
|
||||||
tokens = de_tokenizer(text)
|
tokens = de_tokenizer(text)
|
||||||
assert len(tokens) == length
|
assert len(tokens) == length
|
||||||
|
|
|
@ -1,17 +1,16 @@
|
||||||
# -*- coding: utf-8 -*-
|
# coding: utf8
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["αριθ.", "τρισ.", "δισ.", "σελ."])
|
@pytest.mark.parametrize('text', ["αριθ.", "τρισ.", "δισ.", "σελ."])
|
||||||
def test_tokenizer_handles_abbr(el_tokenizer, text):
|
def test_el_tokenizer_handles_abbr(el_tokenizer, text):
|
||||||
tokens = el_tokenizer(text)
|
tokens = el_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_handles_exc_in_text(el_tokenizer):
|
def test_el_tokenizer_handles_exc_in_text(el_tokenizer):
|
||||||
text = "Στα 14 τρισ. δολάρια το κόστος από την άνοδο της στάθμης της θάλασσας."
|
text = "Στα 14 τρισ. δολάρια το κόστος από την άνοδο της στάθμης της θάλασσας."
|
||||||
tokens = el_tokenizer(text)
|
tokens = el_tokenizer(text)
|
||||||
assert len(tokens) == 14
|
assert len(tokens) == 14
|
||||||
|
|
|
@ -1,11 +1,10 @@
|
||||||
# -*- coding: utf-8 -*-
|
# coding: utf8
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_handles_long_text(el_tokenizer):
|
def test_el_tokenizer_handles_long_text(el_tokenizer):
|
||||||
text = """Η Ελλάδα (παλαιότερα Ελλάς), επίσημα γνωστή ως Ελληνική Δημοκρατία,\
|
text = """Η Ελλάδα (παλαιότερα Ελλάς), επίσημα γνωστή ως Ελληνική Δημοκρατία,\
|
||||||
είναι χώρα της νοτιοανατολικής Ευρώπης στο νοτιότερο άκρο της Βαλκανικής χερσονήσου.\
|
είναι χώρα της νοτιοανατολικής Ευρώπης στο νοτιότερο άκρο της Βαλκανικής χερσονήσου.\
|
||||||
Συνορεύει στα βορειοδυτικά με την Αλβανία, στα βόρεια με την πρώην\
|
Συνορεύει στα βορειοδυτικά με την Αλβανία, στα βόρεια με την πρώην\
|
||||||
|
@ -20,6 +19,6 @@ def test_tokenizer_handles_long_text(el_tokenizer):
|
||||||
("Η Ελλάδα είναι μία από τις χώρες της Ευρωπαϊκής Ένωσης (ΕΕ) που διαθέτει σηµαντικό ορυκτό πλούτο.", 19),
|
("Η Ελλάδα είναι μία από τις χώρες της Ευρωπαϊκής Ένωσης (ΕΕ) που διαθέτει σηµαντικό ορυκτό πλούτο.", 19),
|
||||||
("Η ναυτιλία αποτέλεσε ένα σημαντικό στοιχείο της Ελληνικής οικονομικής δραστηριότητας από τα αρχαία χρόνια.", 15),
|
("Η ναυτιλία αποτέλεσε ένα σημαντικό στοιχείο της Ελληνικής οικονομικής δραστηριότητας από τα αρχαία χρόνια.", 15),
|
||||||
("Η Ελλάδα είναι μέλος σε αρκετούς διεθνείς οργανισμούς.", 9)])
|
("Η Ελλάδα είναι μέλος σε αρκετούς διεθνείς οργανισμούς.", 9)])
|
||||||
def test_tokenizer_handles_cnts(el_tokenizer,text, length):
|
def test_el_tokenizer_handles_cnts(el_tokenizer,text, length):
|
||||||
tokens = el_tokenizer(text)
|
tokens = el_tokenizer(text)
|
||||||
assert len(tokens) == length
|
assert len(tokens) == length
|
|
@ -2,23 +2,22 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from spacy.lang.en import English
|
||||||
from ....lang.en import English
|
from spacy.tokenizer import Tokenizer
|
||||||
from ....tokenizer import Tokenizer
|
from spacy.util import compile_prefix_regex, compile_suffix_regex
|
||||||
from .... import util
|
from spacy.util import compile_infix_regex
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def custom_en_tokenizer(en_vocab):
|
def custom_en_tokenizer(en_vocab):
|
||||||
prefix_re = util.compile_prefix_regex(English.Defaults.prefixes)
|
prefix_re = compile_prefix_regex(English.Defaults.prefixes)
|
||||||
suffix_re = util.compile_suffix_regex(English.Defaults.suffixes)
|
suffix_re = compile_suffix_regex(English.Defaults.suffixes)
|
||||||
custom_infixes = ['\.\.\.+',
|
custom_infixes = ['\.\.\.+',
|
||||||
'(?<=[0-9])-(?=[0-9])',
|
'(?<=[0-9])-(?=[0-9])',
|
||||||
# '(?<=[0-9]+),(?=[0-9]+)',
|
# '(?<=[0-9]+),(?=[0-9]+)',
|
||||||
'[0-9]+(,[0-9]+)+',
|
'[0-9]+(,[0-9]+)+',
|
||||||
'[\[\]!&:,()\*—–\/-]']
|
'[\[\]!&:,()\*—–\/-]']
|
||||||
|
infix_re = compile_infix_regex(custom_infixes)
|
||||||
infix_re = util.compile_infix_regex(custom_infixes)
|
|
||||||
return Tokenizer(en_vocab,
|
return Tokenizer(en_vocab,
|
||||||
English.Defaults.tokenizer_exceptions,
|
English.Defaults.tokenizer_exceptions,
|
||||||
prefix_re.search,
|
prefix_re.search,
|
||||||
|
@ -27,13 +26,12 @@ def custom_en_tokenizer(en_vocab):
|
||||||
token_match=None)
|
token_match=None)
|
||||||
|
|
||||||
|
|
||||||
def test_customized_tokenizer_handles_infixes(custom_en_tokenizer):
|
def test_en_customized_tokenizer_handles_infixes(custom_en_tokenizer):
|
||||||
sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion."
|
sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion."
|
||||||
context = [word.text for word in custom_en_tokenizer(sentence)]
|
context = [word.text for word in custom_en_tokenizer(sentence)]
|
||||||
assert context == ['The', '8', 'and', '10', '-', 'county', 'definitions',
|
assert context == ['The', '8', 'and', '10', '-', 'county', 'definitions',
|
||||||
'are', 'not', 'used', 'for', 'the', 'greater',
|
'are', 'not', 'used', 'for', 'the', 'greater',
|
||||||
'Southern', 'California', 'Megaregion', '.']
|
'Southern', 'California', 'Megaregion', '.']
|
||||||
|
|
||||||
# the trailing '-' may cause Assertion Error
|
# the trailing '-' may cause Assertion Error
|
||||||
sentence = "The 8- and 10-county definitions are not used for the greater Southern California Megaregion."
|
sentence = "The 8- and 10-county definitions are not used for the greater Southern California Megaregion."
|
||||||
context = [word.text for word in custom_en_tokenizer(sentence)]
|
context = [word.text for word in custom_en_tokenizer(sentence)]
|
||||||
|
|
|
@ -38,7 +38,7 @@ def test_en_tokenizer_splits_trailing_apos(en_tokenizer, text):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["'em", "nothin'", "ol'"])
|
@pytest.mark.parametrize('text', ["'em", "nothin'", "ol'"])
|
||||||
def text_tokenizer_doesnt_split_apos_exc(en_tokenizer, text):
|
def test_en_tokenizer_doesnt_split_apos_exc(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
assert tokens[0].text == text
|
assert tokens[0].text == text
|
||||||
|
|
|
@ -1,11 +1,6 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
"""Test that token.idx correctly computes index into the original string."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
|
||||||
def test_en_simple_punct(en_tokenizer):
|
def test_en_simple_punct(en_tokenizer):
|
||||||
text = "to walk, do foo"
|
text = "to walk, do foo"
|
||||||
|
|
|
@ -1,63 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
from ....tokens.doc import Doc
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def en_lemmatizer(EN):
|
|
||||||
return EN.Defaults.create_lemmatizer()
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_doc_lemmatization(EN):
|
|
||||||
doc = Doc(EN.vocab, words=['bleed'])
|
|
||||||
doc[0].tag_ = 'VBP'
|
|
||||||
assert doc[0].lemma_ == 'bleed'
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
@pytest.mark.parametrize('text,lemmas', [("aardwolves", ["aardwolf"]),
|
|
||||||
("aardwolf", ["aardwolf"]),
|
|
||||||
("planets", ["planet"]),
|
|
||||||
("ring", ["ring"]),
|
|
||||||
("axes", ["axis", "axe", "ax"])])
|
|
||||||
def test_en_lemmatizer_noun_lemmas(en_lemmatizer, text, lemmas):
|
|
||||||
assert en_lemmatizer.noun(text) == lemmas
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
@pytest.mark.parametrize('text,lemmas', [("bleed", ["bleed"]),
|
|
||||||
("feed", ["feed"]),
|
|
||||||
("need", ["need"]),
|
|
||||||
("ring", ["ring"])])
|
|
||||||
def test_en_lemmatizer_noun_lemmas(en_lemmatizer, text, lemmas):
|
|
||||||
# Cases like this are problematic -- not clear what we should do to resolve
|
|
||||||
# ambiguity?
|
|
||||||
# ("axes", ["ax", "axes", "axis"])])
|
|
||||||
assert en_lemmatizer.noun(text) == lemmas
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_lemmatizer_base_forms(en_lemmatizer):
|
|
||||||
assert en_lemmatizer.noun('dive', {'number': 'sing'}) == ['dive']
|
|
||||||
assert en_lemmatizer.noun('dive', {'number': 'plur'}) == ['diva']
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_lemmatizer_base_form_verb(en_lemmatizer):
|
|
||||||
assert en_lemmatizer.verb('saw', {'verbform': 'past'}) == ['see']
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_lemmatizer_punct(en_lemmatizer):
|
|
||||||
assert en_lemmatizer.punct('“') == ['"']
|
|
||||||
assert en_lemmatizer.punct('“') == ['"']
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_lemmatizer_lemma_assignment(EN):
|
|
||||||
text = "Bananas in pyjamas are geese."
|
|
||||||
doc = EN.make_doc(text)
|
|
||||||
EN.tagger(doc)
|
|
||||||
assert all(t.lemma_ != '' for t in doc)
|
|
|
@ -1,85 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import numpy
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def example(EN):
|
|
||||||
"""
|
|
||||||
This is to make sure the model works as expected. The tests make sure that
|
|
||||||
values are properly set. Tests are not meant to evaluate the content of the
|
|
||||||
output, only make sure the output is formally okay.
|
|
||||||
"""
|
|
||||||
assert EN.entity != None
|
|
||||||
return EN('There was a stranger standing at the big street talking to herself.')
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_models_tokenization(example):
|
|
||||||
# tokenization should split the document into tokens
|
|
||||||
assert len(example) > 1
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_models_tagging(example):
|
|
||||||
# if tagging was done properly, pos tags shouldn't be empty
|
|
||||||
assert example.is_tagged
|
|
||||||
assert all(t.pos != 0 for t in example)
|
|
||||||
assert all(t.tag != 0 for t in example)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_models_parsing(example):
|
|
||||||
# if parsing was done properly
|
|
||||||
# - dependency labels shouldn't be empty
|
|
||||||
# - the head of some tokens should not be root
|
|
||||||
assert example.is_parsed
|
|
||||||
assert all(t.dep != 0 for t in example)
|
|
||||||
assert any(t.dep != i for i,t in enumerate(example))
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_models_ner(example):
|
|
||||||
# if ner was done properly, ent_iob shouldn't be empty
|
|
||||||
assert all([t.ent_iob != 0 for t in example])
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_models_vectors(example):
|
|
||||||
# if vectors are available, they should differ on different words
|
|
||||||
# this isn't a perfect test since this could in principle fail
|
|
||||||
# in a sane model as well,
|
|
||||||
# but that's very unlikely and a good indicator if something is wrong
|
|
||||||
if example.vocab.vectors_length:
|
|
||||||
vector0 = example[0].vector
|
|
||||||
vector1 = example[1].vector
|
|
||||||
vector2 = example[2].vector
|
|
||||||
assert not numpy.array_equal(vector0,vector1)
|
|
||||||
assert not numpy.array_equal(vector0,vector2)
|
|
||||||
assert not numpy.array_equal(vector1,vector2)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_models_probs(example):
|
|
||||||
# if frequencies/probabilities are okay, they should differ for
|
|
||||||
# different words
|
|
||||||
# this isn't a perfect test since this could in principle fail
|
|
||||||
# in a sane model as well,
|
|
||||||
# but that's very unlikely and a good indicator if something is wrong
|
|
||||||
prob0 = example[0].prob
|
|
||||||
prob1 = example[1].prob
|
|
||||||
prob2 = example[2].prob
|
|
||||||
assert not prob0 == prob1
|
|
||||||
assert not prob0 == prob2
|
|
||||||
assert not prob1 == prob2
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_no_vectors_similarity(EN):
|
|
||||||
doc1 = EN(u'hallo')
|
|
||||||
doc2 = EN(u'hi')
|
|
||||||
assert doc1.similarity(doc2) > 0
|
|
||||||
|
|
|
@ -1,42 +0,0 @@
|
||||||
from __future__ import unicode_literals, print_function
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
from spacy.attrs import LOWER
|
|
||||||
from spacy.matcher import Matcher
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_ner_simple_types(EN):
|
|
||||||
tokens = EN(u'Mr. Best flew to New York on Saturday morning.')
|
|
||||||
ents = list(tokens.ents)
|
|
||||||
assert ents[0].start == 1
|
|
||||||
assert ents[0].end == 2
|
|
||||||
assert ents[0].label_ == 'PERSON'
|
|
||||||
assert ents[1].start == 4
|
|
||||||
assert ents[1].end == 6
|
|
||||||
assert ents[1].label_ == 'GPE'
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_ner_consistency_bug(EN):
|
|
||||||
'''Test an arbitrary sequence-consistency bug encountered during speed test'''
|
|
||||||
tokens = EN(u'Where rap essentially went mainstream, illustrated by seminal Public Enemy, Beastie Boys and L.L. Cool J. tracks.')
|
|
||||||
tokens = EN(u'''Charity and other short-term aid have buoyed them so far, and a tax-relief bill working its way through Congress would help. But the September 11 Victim Compensation Fund, enacted by Congress to discourage people from filing lawsuits, will determine the shape of their lives for years to come.\n\n''', disable=['ner'])
|
|
||||||
tokens.ents += tuple(EN.matcher(tokens))
|
|
||||||
EN.entity(tokens)
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.skip
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_ner_unit_end_gazetteer(EN):
|
|
||||||
'''Test a bug in the interaction between the NER model and the gazetteer'''
|
|
||||||
matcher = Matcher(EN.vocab)
|
|
||||||
matcher.add('MemberNames', None, [{LOWER: 'cal'}], [{LOWER: 'cal'}, {LOWER: 'henderson'}])
|
|
||||||
doc = EN(u'who is cal the manager of?')
|
|
||||||
if len(list(doc.ents)) == 0:
|
|
||||||
ents = matcher(doc)
|
|
||||||
assert len(ents) == 1
|
|
||||||
doc.ents += tuple(ents)
|
|
||||||
EN.entity(doc)
|
|
||||||
assert list(doc.ents)[0].text == 'cal'
|
|
|
@ -1,22 +1,20 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ....attrs import HEAD, DEP
|
|
||||||
from ....symbols import nsubj, dobj, amod, nmod, conj, cc, root
|
|
||||||
from ....lang.en.syntax_iterators import SYNTAX_ITERATORS
|
|
||||||
from ...util import get_doc
|
|
||||||
|
|
||||||
import numpy
|
import numpy
|
||||||
|
from spacy.attrs import HEAD, DEP
|
||||||
|
from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root
|
||||||
|
from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS
|
||||||
|
|
||||||
|
from ...util import get_doc
|
||||||
|
|
||||||
|
|
||||||
def test_en_noun_chunks_not_nested(en_tokenizer):
|
def test_en_noun_chunks_not_nested(en_tokenizer):
|
||||||
text = "Peter has chronic command and control issues"
|
text = "Peter has chronic command and control issues"
|
||||||
heads = [1, 0, 4, 3, -1, -2, -5]
|
heads = [1, 0, 4, 3, -1, -2, -5]
|
||||||
deps = ['nsubj', 'ROOT', 'amod', 'nmod', 'cc', 'conj', 'dobj']
|
deps = ['nsubj', 'ROOT', 'amod', 'nmod', 'cc', 'conj', 'dobj']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||||
|
|
||||||
tokens.from_array(
|
tokens.from_array(
|
||||||
[HEAD, DEP],
|
[HEAD, DEP],
|
||||||
numpy.asarray([[1, nsubj], [0, root], [4, amod], [3, nmod], [-1, cc],
|
numpy.asarray([[1, nsubj], [0, root], [4, amod], [3, nmod], [-1, cc],
|
||||||
|
|
|
@ -3,58 +3,52 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...util import get_doc
|
from ...util import get_doc
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
def test_en_parser_noun_chunks_standard(en_tokenizer):
|
||||||
def test_parser_noun_chunks_standard(en_tokenizer):
|
|
||||||
text = "A base phrase should be recognized."
|
text = "A base phrase should be recognized."
|
||||||
heads = [2, 1, 3, 2, 1, 0, -1]
|
heads = [2, 1, 3, 2, 1, 0, -1]
|
||||||
tags = ['DT', 'JJ', 'NN', 'MD', 'VB', 'VBN', '.']
|
tags = ['DT', 'JJ', 'NN', 'MD', 'VB', 'VBN', '.']
|
||||||
deps = ['det', 'amod', 'nsubjpass', 'aux', 'auxpass', 'ROOT', 'punct']
|
deps = ['det', 'amod', 'nsubjpass', 'aux', 'auxpass', 'ROOT', 'punct']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||||
chunks = list(doc.noun_chunks)
|
chunks = list(doc.noun_chunks)
|
||||||
assert len(chunks) == 1
|
assert len(chunks) == 1
|
||||||
assert chunks[0].text_with_ws == "A base phrase "
|
assert chunks[0].text_with_ws == "A base phrase "
|
||||||
|
|
||||||
|
|
||||||
def test_parser_noun_chunks_coordinated(en_tokenizer):
|
def test_en_parser_noun_chunks_coordinated(en_tokenizer):
|
||||||
text = "A base phrase and a good phrase are often the same."
|
text = "A base phrase and a good phrase are often the same."
|
||||||
heads = [2, 1, 5, -1, 2, 1, -4, 0, -1, 1, -3, -4]
|
heads = [2, 1, 5, -1, 2, 1, -4, 0, -1, 1, -3, -4]
|
||||||
tags = ['DT', 'NN', 'NN', 'CC', 'DT', 'JJ', 'NN', 'VBP', 'RB', 'DT', 'JJ', '.']
|
tags = ['DT', 'NN', 'NN', 'CC', 'DT', 'JJ', 'NN', 'VBP', 'RB', 'DT', 'JJ', '.']
|
||||||
deps = ['det', 'compound', 'nsubj', 'cc', 'det', 'amod', 'conj', 'ROOT', 'advmod', 'det', 'attr', 'punct']
|
deps = ['det', 'compound', 'nsubj', 'cc', 'det', 'amod', 'conj', 'ROOT', 'advmod', 'det', 'attr', 'punct']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||||
chunks = list(doc.noun_chunks)
|
chunks = list(doc.noun_chunks)
|
||||||
assert len(chunks) == 2
|
assert len(chunks) == 2
|
||||||
assert chunks[0].text_with_ws == "A base phrase "
|
assert chunks[0].text_with_ws == "A base phrase "
|
||||||
assert chunks[1].text_with_ws == "a good phrase "
|
assert chunks[1].text_with_ws == "a good phrase "
|
||||||
|
|
||||||
|
|
||||||
def test_parser_noun_chunks_pp_chunks(en_tokenizer):
|
def test_en_parser_noun_chunks_pp_chunks(en_tokenizer):
|
||||||
text = "A phrase with another phrase occurs."
|
text = "A phrase with another phrase occurs."
|
||||||
heads = [1, 4, -1, 1, -2, 0, -1]
|
heads = [1, 4, -1, 1, -2, 0, -1]
|
||||||
tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ', '.']
|
tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ', '.']
|
||||||
deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT', 'punct']
|
deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT', 'punct']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||||
chunks = list(doc.noun_chunks)
|
chunks = list(doc.noun_chunks)
|
||||||
assert len(chunks) == 2
|
assert len(chunks) == 2
|
||||||
assert chunks[0].text_with_ws == "A phrase "
|
assert chunks[0].text_with_ws == "A phrase "
|
||||||
assert chunks[1].text_with_ws == "another phrase "
|
assert chunks[1].text_with_ws == "another phrase "
|
||||||
|
|
||||||
|
|
||||||
def test_parser_noun_chunks_appositional_modifiers(en_tokenizer):
|
def test_en_parser_noun_chunks_appositional_modifiers(en_tokenizer):
|
||||||
text = "Sam, my brother, arrived to the house."
|
text = "Sam, my brother, arrived to the house."
|
||||||
heads = [5, -1, 1, -3, -4, 0, -1, 1, -2, -4]
|
heads = [5, -1, 1, -3, -4, 0, -1, 1, -2, -4]
|
||||||
tags = ['NNP', ',', 'PRP$', 'NN', ',', 'VBD', 'IN', 'DT', 'NN', '.']
|
tags = ['NNP', ',', 'PRP$', 'NN', ',', 'VBD', 'IN', 'DT', 'NN', '.']
|
||||||
deps = ['nsubj', 'punct', 'poss', 'appos', 'punct', 'ROOT', 'prep', 'det', 'pobj', 'punct']
|
deps = ['nsubj', 'punct', 'poss', 'appos', 'punct', 'ROOT', 'prep', 'det', 'pobj', 'punct']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||||
chunks = list(doc.noun_chunks)
|
chunks = list(doc.noun_chunks)
|
||||||
assert len(chunks) == 3
|
assert len(chunks) == 3
|
||||||
assert chunks[0].text_with_ws == "Sam "
|
assert chunks[0].text_with_ws == "Sam "
|
||||||
|
@ -62,14 +56,13 @@ def test_parser_noun_chunks_appositional_modifiers(en_tokenizer):
|
||||||
assert chunks[2].text_with_ws == "the house "
|
assert chunks[2].text_with_ws == "the house "
|
||||||
|
|
||||||
|
|
||||||
def test_parser_noun_chunks_dative(en_tokenizer):
|
def test_en_parser_noun_chunks_dative(en_tokenizer):
|
||||||
text = "She gave Bob a raise."
|
text = "She gave Bob a raise."
|
||||||
heads = [1, 0, -1, 1, -3, -4]
|
heads = [1, 0, -1, 1, -3, -4]
|
||||||
tags = ['PRP', 'VBD', 'NNP', 'DT', 'NN', '.']
|
tags = ['PRP', 'VBD', 'NNP', 'DT', 'NN', '.']
|
||||||
deps = ['nsubj', 'ROOT', 'dative', 'det', 'dobj', 'punct']
|
deps = ['nsubj', 'ROOT', 'dative', 'det', 'dobj', 'punct']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||||
chunks = list(doc.noun_chunks)
|
chunks = list(doc.noun_chunks)
|
||||||
assert len(chunks) == 3
|
assert len(chunks) == 3
|
||||||
assert chunks[0].text_with_ws == "She "
|
assert chunks[0].text_with_ws == "She "
|
||||||
|
|
|
@ -1,92 +1,89 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(can)"])
|
@pytest.mark.parametrize('text', ["(can)"])
|
||||||
def test_tokenizer_splits_no_special(en_tokenizer, text):
|
def test_en_tokenizer_splits_no_special(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["can't"])
|
@pytest.mark.parametrize('text', ["can't"])
|
||||||
def test_tokenizer_splits_no_punct(en_tokenizer, text):
|
def test_en_tokenizer_splits_no_punct(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 2
|
assert len(tokens) == 2
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(can't"])
|
@pytest.mark.parametrize('text', ["(can't"])
|
||||||
def test_tokenizer_splits_prefix_punct(en_tokenizer, text):
|
def test_en_tokenizer_splits_prefix_punct(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["can't)"])
|
@pytest.mark.parametrize('text', ["can't)"])
|
||||||
def test_tokenizer_splits_suffix_punct(en_tokenizer, text):
|
def test_en_tokenizer_splits_suffix_punct(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(can't)"])
|
@pytest.mark.parametrize('text', ["(can't)"])
|
||||||
def test_tokenizer_splits_even_wrap(en_tokenizer, text):
|
def test_en_tokenizer_splits_even_wrap(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 4
|
assert len(tokens) == 4
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(can't?)"])
|
@pytest.mark.parametrize('text', ["(can't?)"])
|
||||||
def test_tokenizer_splits_uneven_wrap(en_tokenizer, text):
|
def test_en_tokenizer_splits_uneven_wrap(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 5
|
assert len(tokens) == 5
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,length', [("U.S.", 1), ("us.", 2), ("(U.S.", 2)])
|
@pytest.mark.parametrize('text,length', [("U.S.", 1), ("us.", 2), ("(U.S.", 2)])
|
||||||
def test_tokenizer_splits_prefix_interact(en_tokenizer, text, length):
|
def test_en_tokenizer_splits_prefix_interact(en_tokenizer, text, length):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == length
|
assert len(tokens) == length
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["U.S.)"])
|
@pytest.mark.parametrize('text', ["U.S.)"])
|
||||||
def test_tokenizer_splits_suffix_interact(en_tokenizer, text):
|
def test_en_tokenizer_splits_suffix_interact(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 2
|
assert len(tokens) == 2
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(U.S.)"])
|
@pytest.mark.parametrize('text', ["(U.S.)"])
|
||||||
def test_tokenizer_splits_even_wrap_interact(en_tokenizer, text):
|
def test_en_tokenizer_splits_even_wrap_interact(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(U.S.?)"])
|
@pytest.mark.parametrize('text', ["(U.S.?)"])
|
||||||
def test_tokenizer_splits_uneven_wrap_interact(en_tokenizer, text):
|
def test_en_tokenizer_splits_uneven_wrap_interact(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 4
|
assert len(tokens) == 4
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["best-known"])
|
@pytest.mark.parametrize('text', ["best-known"])
|
||||||
def test_tokenizer_splits_hyphens(en_tokenizer, text):
|
def test_en_tokenizer_splits_hyphens(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
|
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
|
||||||
def test_tokenizer_splits_numeric_range(en_tokenizer, text):
|
def test_en_tokenizer_splits_numeric_range(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["best.Known", "Hello.World"])
|
@pytest.mark.parametrize('text', ["best.Known", "Hello.World"])
|
||||||
def test_tokenizer_splits_period_infix(en_tokenizer, text):
|
def test_en_tokenizer_splits_period_infix(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["Hello,world", "one,two"])
|
@pytest.mark.parametrize('text', ["Hello,world", "one,two"])
|
||||||
def test_tokenizer_splits_comma_infix(en_tokenizer, text):
|
def test_en_tokenizer_splits_comma_infix(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
assert tokens[0].text == text.split(",")[0]
|
assert tokens[0].text == text.split(",")[0]
|
||||||
|
@ -95,12 +92,12 @@ def test_tokenizer_splits_comma_infix(en_tokenizer, text):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["best...Known", "best...known"])
|
@pytest.mark.parametrize('text', ["best...Known", "best...known"])
|
||||||
def test_tokenizer_splits_ellipsis_infix(en_tokenizer, text):
|
def test_en_tokenizer_splits_ellipsis_infix(en_tokenizer, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_splits_double_hyphen_infix(en_tokenizer):
|
def test_en_tokenizer_splits_double_hyphen_infix(en_tokenizer):
|
||||||
tokens = en_tokenizer("No decent--let alone well-bred--people.")
|
tokens = en_tokenizer("No decent--let alone well-bred--people.")
|
||||||
assert tokens[0].text == "No"
|
assert tokens[0].text == "No"
|
||||||
assert tokens[1].text == "decent"
|
assert tokens[1].text == "decent"
|
||||||
|
@ -115,7 +112,7 @@ def test_tokenizer_splits_double_hyphen_infix(en_tokenizer):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
@pytest.mark.xfail
|
||||||
def test_tokenizer_splits_period_abbr(en_tokenizer):
|
def test_en_tokenizer_splits_period_abbr(en_tokenizer):
|
||||||
text = "Today is Tuesday.Mr."
|
text = "Today is Tuesday.Mr."
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 5
|
assert len(tokens) == 5
|
||||||
|
@ -127,7 +124,7 @@ def test_tokenizer_splits_period_abbr(en_tokenizer):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
@pytest.mark.xfail
|
||||||
def test_tokenizer_splits_em_dash_infix(en_tokenizer):
|
def test_en_tokenizer_splits_em_dash_infix(en_tokenizer):
|
||||||
# Re Issue #225
|
# Re Issue #225
|
||||||
tokens = en_tokenizer("""Will this road take me to Puddleton?\u2014No, """
|
tokens = en_tokenizer("""Will this road take me to Puddleton?\u2014No, """
|
||||||
"""you'll have to walk there.\u2014Ariel.""")
|
"""you'll have to walk there.\u2014Ariel.""")
|
||||||
|
|
|
@ -1,13 +1,9 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
"""Test that open, closed and paired punctuation is split off correctly."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from spacy.util import compile_prefix_regex
|
||||||
from ....util import compile_prefix_regex
|
from spacy.lang.punctuation import TOKENIZER_PREFIXES
|
||||||
from ....lang.punctuation import TOKENIZER_PREFIXES
|
|
||||||
|
|
||||||
|
|
||||||
PUNCT_OPEN = ['(', '[', '{', '*']
|
PUNCT_OPEN = ['(', '[', '{', '*']
|
||||||
|
|
|
@ -1,18 +1,17 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ....tokens import Doc
|
|
||||||
from ...util import get_doc, apply_transition_sequence
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
from ...util import get_doc, apply_transition_sequence
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["A test sentence"])
|
@pytest.mark.parametrize('text', ["A test sentence"])
|
||||||
@pytest.mark.parametrize('punct', ['.', '!', '?', ''])
|
@pytest.mark.parametrize('punct', ['.', '!', '?', ''])
|
||||||
def test_en_sbd_single_punct(en_tokenizer, text, punct):
|
def test_en_sbd_single_punct(en_tokenizer, text, punct):
|
||||||
heads = [2, 1, 0, -1] if punct else [2, 1, 0]
|
heads = [2, 1, 0, -1] if punct else [2, 1, 0]
|
||||||
tokens = en_tokenizer(text + punct)
|
tokens = en_tokenizer(text + punct)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
assert len(doc) == 4 if punct else 3
|
assert len(doc) == 4 if punct else 3
|
||||||
assert len(list(doc.sents)) == 1
|
assert len(list(doc.sents)) == 1
|
||||||
assert sum(len(sent) for sent in doc.sents) == len(doc)
|
assert sum(len(sent) for sent in doc.sents) == len(doc)
|
||||||
|
@ -26,102 +25,10 @@ def test_en_sentence_breaks(en_tokenizer, en_parser):
|
||||||
'attr', 'punct']
|
'attr', 'punct']
|
||||||
transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct', 'B-ROOT',
|
transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct', 'B-ROOT',
|
||||||
'L-nsubj', 'S', 'L-attr', 'R-attr', 'D', 'R-punct']
|
'L-nsubj', 'S', 'L-attr', 'R-attr', 'D', 'R-punct']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||||
apply_transition_sequence(en_parser, doc, transition)
|
apply_transition_sequence(en_parser, doc, transition)
|
||||||
|
|
||||||
assert len(list(doc.sents)) == 2
|
assert len(list(doc.sents)) == 2
|
||||||
for token in doc:
|
for token in doc:
|
||||||
assert token.dep != 0 or token.is_space
|
assert token.dep != 0 or token.is_space
|
||||||
assert [token.head.i for token in doc ] == [1, 1, 3, 1, 1, 6, 6, 8, 6, 6]
|
assert [token.head.i for token in doc ] == [1, 1, 3, 1, 1, 6, 6, 8, 6, 6]
|
||||||
|
|
||||||
|
|
||||||
# Currently, there's no way of setting the serializer data for the parser
|
|
||||||
# without loading the models, so we can't remove the model dependency here yet.
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_sbd_serialization_projective(EN):
|
|
||||||
"""Test that before and after serialization, the sentence boundaries are
|
|
||||||
the same."""
|
|
||||||
|
|
||||||
text = "I bought a couch from IKEA It wasn't very comfortable."
|
|
||||||
transition = ['L-nsubj', 'S', 'L-det', 'R-dobj', 'D', 'R-prep', 'R-pobj',
|
|
||||||
'B-ROOT', 'L-nsubj', 'R-neg', 'D', 'S', 'L-advmod',
|
|
||||||
'R-acomp', 'D', 'R-punct']
|
|
||||||
|
|
||||||
doc = EN.tokenizer(text)
|
|
||||||
apply_transition_sequence(EN.parser, doc, transition)
|
|
||||||
doc_serialized = Doc(EN.vocab).from_bytes(doc.to_bytes())
|
|
||||||
assert doc.is_parsed == True
|
|
||||||
assert doc_serialized.is_parsed == True
|
|
||||||
assert doc.to_bytes() == doc_serialized.to_bytes()
|
|
||||||
assert [s.text for s in doc.sents] == [s.text for s in doc_serialized.sents]
|
|
||||||
|
|
||||||
|
|
||||||
TEST_CASES = [
|
|
||||||
pytest.mark.xfail(("Hello World. My name is Jonas.", ["Hello World.", "My name is Jonas."])),
|
|
||||||
("What is your name? My name is Jonas.", ["What is your name?", "My name is Jonas."]),
|
|
||||||
("There it is! I found it.", ["There it is!", "I found it."]),
|
|
||||||
("My name is Jonas E. Smith.", ["My name is Jonas E. Smith."]),
|
|
||||||
("Please turn to p. 55.", ["Please turn to p. 55."]),
|
|
||||||
("Were Jane and co. at the party?", ["Were Jane and co. at the party?"]),
|
|
||||||
("They closed the deal with Pitt, Briggs & Co. at noon.", ["They closed the deal with Pitt, Briggs & Co. at noon."]),
|
|
||||||
("Let's ask Jane and co. They should know.", ["Let's ask Jane and co.", "They should know."]),
|
|
||||||
("They closed the deal with Pitt, Briggs & Co. It closed yesterday.", ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]),
|
|
||||||
("I can see Mt. Fuji from here.", ["I can see Mt. Fuji from here."]),
|
|
||||||
pytest.mark.xfail(("St. Michael's Church is on 5th st. near the light.", ["St. Michael's Church is on 5th st. near the light."])),
|
|
||||||
("That is JFK Jr.'s book.", ["That is JFK Jr.'s book."]),
|
|
||||||
("I visited the U.S.A. last year.", ["I visited the U.S.A. last year."]),
|
|
||||||
("I live in the E.U. How about you?", ["I live in the E.U.", "How about you?"]),
|
|
||||||
("I live in the U.S. How about you?", ["I live in the U.S.", "How about you?"]),
|
|
||||||
("I work for the U.S. Government in Virginia.", ["I work for the U.S. Government in Virginia."]),
|
|
||||||
("I have lived in the U.S. for 20 years.", ["I have lived in the U.S. for 20 years."]),
|
|
||||||
pytest.mark.xfail(("At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.", ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."])),
|
|
||||||
("She has $100.00 in her bag.", ["She has $100.00 in her bag."]),
|
|
||||||
("She has $100.00. It is in her bag.", ["She has $100.00.", "It is in her bag."]),
|
|
||||||
("He teaches science (He previously worked for 5 years as an engineer.) at the local University.", ["He teaches science (He previously worked for 5 years as an engineer.) at the local University."]),
|
|
||||||
("Her email is Jane.Doe@example.com. I sent her an email.", ["Her email is Jane.Doe@example.com.", "I sent her an email."]),
|
|
||||||
("The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out.", ["The site is: https://www.example.50.com/new-site/awesome_content.html.", "Please check it out."]),
|
|
||||||
pytest.mark.xfail(("She turned to him, 'This is great.' she said.", ["She turned to him, 'This is great.' she said."])),
|
|
||||||
pytest.mark.xfail(('She turned to him, "This is great." she said.', ['She turned to him, "This is great." she said.'])),
|
|
||||||
('She turned to him, "This is great." She held the book out to show him.', ['She turned to him, "This is great."', "She held the book out to show him."]),
|
|
||||||
("Hello!! Long time no see.", ["Hello!!", "Long time no see."]),
|
|
||||||
("Hello?? Who is there?", ["Hello??", "Who is there?"]),
|
|
||||||
("Hello!? Is that you?", ["Hello!?", "Is that you?"]),
|
|
||||||
("Hello?! Is that you?", ["Hello?!", "Is that you?"]),
|
|
||||||
pytest.mark.xfail(("1.) The first item 2.) The second item", ["1.) The first item", "2.) The second item"])),
|
|
||||||
pytest.mark.xfail(("1.) The first item. 2.) The second item.", ["1.) The first item.", "2.) The second item."])),
|
|
||||||
pytest.mark.xfail(("1) The first item 2) The second item", ["1) The first item", "2) The second item"])),
|
|
||||||
("1) The first item. 2) The second item.", ["1) The first item.", "2) The second item."]),
|
|
||||||
pytest.mark.xfail(("1. The first item 2. The second item", ["1. The first item", "2. The second item"])),
|
|
||||||
pytest.mark.xfail(("1. The first item. 2. The second item.", ["1. The first item.", "2. The second item."])),
|
|
||||||
pytest.mark.xfail(("• 9. The first item • 10. The second item", ["• 9. The first item", "• 10. The second item"])),
|
|
||||||
pytest.mark.xfail(("⁃9. The first item ⁃10. The second item", ["⁃9. The first item", "⁃10. The second item"])),
|
|
||||||
pytest.mark.xfail(("a. The first item b. The second item c. The third list item", ["a. The first item", "b. The second item", "c. The third list item"])),
|
|
||||||
("This is a sentence\ncut off in the middle because pdf.", ["This is a sentence\ncut off in the middle because pdf."]),
|
|
||||||
("It was a cold \nnight in the city.", ["It was a cold \nnight in the city."]),
|
|
||||||
pytest.mark.xfail(("features\ncontact manager\nevents, activities\n", ["features", "contact manager", "events, activities"])),
|
|
||||||
pytest.mark.xfail(("You can find it at N°. 1026.253.553. That is where the treasure is.", ["You can find it at N°. 1026.253.553.", "That is where the treasure is."])),
|
|
||||||
("She works at Yahoo! in the accounting department.", ["She works at Yahoo! in the accounting department."]),
|
|
||||||
("We make a good team, you and I. Did you see Albert I. Jones yesterday?", ["We make a good team, you and I.", "Did you see Albert I. Jones yesterday?"]),
|
|
||||||
("Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”", ["Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”"]),
|
|
||||||
pytest.mark.xfail((""""Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).""", ['"Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).'])),
|
|
||||||
("If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence.", ["If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . .", "Next sentence."]),
|
|
||||||
("I never meant that.... She left the store.", ["I never meant that....", "She left the store."]),
|
|
||||||
pytest.mark.xfail(("I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it.", ["I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it."])),
|
|
||||||
pytest.mark.xfail(("One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . .", ["One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds.", ". . . The practice was not abandoned. . . ."])),
|
|
||||||
pytest.mark.xfail(("Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That is a lot.", ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]))
|
|
||||||
]
|
|
||||||
|
|
||||||
@pytest.mark.skip
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
@pytest.mark.parametrize('text,expected_sents', TEST_CASES)
|
|
||||||
def test_en_sbd_prag(EN, text, expected_sents):
|
|
||||||
"""SBD tests from Pragmatic Segmenter"""
|
|
||||||
doc = EN(text)
|
|
||||||
sents = []
|
|
||||||
for sent in doc.sents:
|
|
||||||
sents.append(''.join(doc[i].string for i in range(sent.start, sent.end)).strip())
|
|
||||||
assert sents == expected_sents
|
|
||||||
|
|
|
@ -1,12 +1,8 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ....parts_of_speech import SPACE
|
|
||||||
from ....compat import unicode_
|
|
||||||
from ...util import get_doc
|
from ...util import get_doc
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
|
||||||
def test_en_tagger_load_morph_exc(en_tokenizer):
|
def test_en_tagger_load_morph_exc(en_tokenizer):
|
||||||
text = "I like his style."
|
text = "I like his style."
|
||||||
|
@ -14,47 +10,6 @@ def test_en_tagger_load_morph_exc(en_tokenizer):
|
||||||
morph_exc = {'VBP': {'like': {'lemma': 'luck'}}}
|
morph_exc = {'VBP': {'like': {'lemma': 'luck'}}}
|
||||||
en_tokenizer.vocab.morphology.load_morph_exceptions(morph_exc)
|
en_tokenizer.vocab.morphology.load_morph_exceptions(morph_exc)
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags)
|
||||||
assert doc[1].tag_ == 'VBP'
|
assert doc[1].tag_ == 'VBP'
|
||||||
assert doc[1].lemma_ == 'luck'
|
assert doc[1].lemma_ == 'luck'
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_tag_names(EN):
|
|
||||||
text = "I ate pizzas with anchovies."
|
|
||||||
doc = EN(text, disable=['parser'])
|
|
||||||
assert type(doc[2].pos) == int
|
|
||||||
assert isinstance(doc[2].pos_, unicode_)
|
|
||||||
assert isinstance(doc[2].dep_, unicode_)
|
|
||||||
assert doc[2].tag_ == u'NNS'
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_tagger_spaces(EN):
|
|
||||||
"""Ensure spaces are assigned the POS tag SPACE"""
|
|
||||||
text = "Some\nspaces are\tnecessary."
|
|
||||||
doc = EN(text, disable=['parser'])
|
|
||||||
assert doc[0].pos != SPACE
|
|
||||||
assert doc[0].pos_ != 'SPACE'
|
|
||||||
assert doc[1].pos == SPACE
|
|
||||||
assert doc[1].pos_ == 'SPACE'
|
|
||||||
assert doc[1].tag_ == 'SP'
|
|
||||||
assert doc[2].pos != SPACE
|
|
||||||
assert doc[3].pos != SPACE
|
|
||||||
assert doc[4].pos == SPACE
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_en_tagger_return_char(EN):
|
|
||||||
"""Ensure spaces are assigned the POS tag SPACE"""
|
|
||||||
text = ('hi Aaron,\r\n\r\nHow is your schedule today, I was wondering if '
|
|
||||||
'you had time for a phone\r\ncall this afternoon?\r\n\r\n\r\n')
|
|
||||||
tokens = EN(text)
|
|
||||||
for token in tokens:
|
|
||||||
if token.is_space:
|
|
||||||
assert token.pos == SPACE
|
|
||||||
assert tokens[3].text == '\r\n\r\n'
|
|
||||||
assert tokens[3].is_space
|
|
||||||
assert tokens[3].pos == SPACE
|
|
||||||
|
|
|
@ -1,10 +1,8 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
"""Test that longer and mixed texts are tokenized correctly."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from spacy.lang.en.lex_attrs import like_num
|
||||||
|
|
||||||
|
|
||||||
def test_en_tokenizer_handles_long_text(en_tokenizer):
|
def test_en_tokenizer_handles_long_text(en_tokenizer):
|
||||||
|
@ -43,3 +41,9 @@ def test_lex_attrs_like_number(en_tokenizer, text, match):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
assert tokens[0].like_num == match
|
assert tokens[0].like_num == match
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('word', ['eleven'])
|
||||||
|
def test_en_lex_attrs_capitals(word):
|
||||||
|
assert like_num(word)
|
||||||
|
assert like_num(word.upper())
|
||||||
|
|
|
@ -1,22 +1,21 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,lemma', [("aprox.", "aproximadamente"),
|
@pytest.mark.parametrize('text,lemma', [
|
||||||
("esq.", "esquina"),
|
("aprox.", "aproximadamente"),
|
||||||
("pág.", "página"),
|
("esq.", "esquina"),
|
||||||
("p.ej.", "por ejemplo")
|
("pág.", "página"),
|
||||||
])
|
("p.ej.", "por ejemplo")])
|
||||||
def test_tokenizer_handles_abbr(es_tokenizer, text, lemma):
|
def test_es_tokenizer_handles_abbr(es_tokenizer, text, lemma):
|
||||||
tokens = es_tokenizer(text)
|
tokens = es_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
assert tokens[0].lemma_ == lemma
|
assert tokens[0].lemma_ == lemma
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_handles_exc_in_text(es_tokenizer):
|
def test_es_tokenizer_handles_exc_in_text(es_tokenizer):
|
||||||
text = "Mariano Rajoy ha corrido aprox. medio kilómetro"
|
text = "Mariano Rajoy ha corrido aprox. medio kilómetro"
|
||||||
tokens = es_tokenizer(text)
|
tokens = es_tokenizer(text)
|
||||||
assert len(tokens) == 7
|
assert len(tokens) == 7
|
||||||
|
|
|
@ -1,14 +1,10 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
|
|
||||||
"""Test that longer and mixed texts are tokenized correctly."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_handles_long_text(es_tokenizer):
|
def test_es_tokenizer_handles_long_text(es_tokenizer):
|
||||||
text = """Cuando a José Mujica lo invitaron a dar una conferencia
|
text = """Cuando a José Mujica lo invitaron a dar una conferencia
|
||||||
|
|
||||||
en Oxford este verano, su cabeza hizo "crac". La "más antigua" universidad de habla
|
en Oxford este verano, su cabeza hizo "crac". La "más antigua" universidad de habla
|
||||||
|
@ -30,6 +26,6 @@ en Montevideo y que pregona las bondades de la vida austera."""
|
||||||
("""¡Sí! "Vámonos", contestó José Arcadio Buendía""", 11),
|
("""¡Sí! "Vámonos", contestó José Arcadio Buendía""", 11),
|
||||||
("Corrieron aprox. 10km.", 5),
|
("Corrieron aprox. 10km.", 5),
|
||||||
("Y entonces por qué...", 5)])
|
("Y entonces por qué...", 5)])
|
||||||
def test_tokenizer_handles_cnts(es_tokenizer, text, length):
|
def test_es_tokenizer_handles_cnts(es_tokenizer, text, length):
|
||||||
tokens = es_tokenizer(text)
|
tokens = es_tokenizer(text)
|
||||||
assert len(tokens) == length
|
assert len(tokens) == length
|
||||||
|
|
|
@ -11,7 +11,7 @@ ABBREVIATION_TESTS = [
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected_tokens', ABBREVIATION_TESTS)
|
@pytest.mark.parametrize('text,expected_tokens', ABBREVIATION_TESTS)
|
||||||
def test_tokenizer_handles_testcases(fi_tokenizer, text, expected_tokens):
|
def test_fi_tokenizer_handles_testcases(fi_tokenizer, text, expected_tokens):
|
||||||
tokens = fi_tokenizer(text)
|
tokens = fi_tokenizer(text)
|
||||||
token_list = [token.text for token in tokens if not token.is_space]
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
assert expected_tokens == token_list
|
assert expected_tokens == token_list
|
||||||
|
|
|
@ -1,29 +1,29 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["aujourd'hui", "Aujourd'hui", "prud'hommes",
|
@pytest.mark.parametrize('text', [
|
||||||
"prud’hommal"])
|
"aujourd'hui", "Aujourd'hui", "prud'hommes", "prud’hommal"])
|
||||||
def test_tokenizer_infix_exceptions(fr_tokenizer, text):
|
def test_fr_tokenizer_infix_exceptions(fr_tokenizer, text):
|
||||||
tokens = fr_tokenizer(text)
|
tokens = fr_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,lemma', [("janv.", "janvier"),
|
@pytest.mark.parametrize('text,lemma', [
|
||||||
("juill.", "juillet"),
|
("janv.", "janvier"),
|
||||||
("Dr.", "docteur"),
|
("juill.", "juillet"),
|
||||||
("av.", "avant"),
|
("Dr.", "docteur"),
|
||||||
("sept.", "septembre")])
|
("av.", "avant"),
|
||||||
def test_tokenizer_handles_abbr(fr_tokenizer, text, lemma):
|
("sept.", "septembre")])
|
||||||
|
def test_fr_tokenizer_handles_abbr(fr_tokenizer, text, lemma):
|
||||||
tokens = fr_tokenizer(text)
|
tokens = fr_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
assert tokens[0].lemma_ == lemma
|
assert tokens[0].lemma_ == lemma
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_handles_exc_in_text(fr_tokenizer):
|
def test_fr_tokenizer_handles_exc_in_text(fr_tokenizer):
|
||||||
text = "Je suis allé au mois de janv. aux prud’hommes."
|
text = "Je suis allé au mois de janv. aux prud’hommes."
|
||||||
tokens = fr_tokenizer(text)
|
tokens = fr_tokenizer(text)
|
||||||
assert len(tokens) == 10
|
assert len(tokens) == 10
|
||||||
|
@ -32,14 +32,15 @@ def test_tokenizer_handles_exc_in_text(fr_tokenizer):
|
||||||
assert tokens[8].text == "prud’hommes"
|
assert tokens[8].text == "prud’hommes"
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_handles_exc_in_text_2(fr_tokenizer):
|
def test_fr_tokenizer_handles_exc_in_text_2(fr_tokenizer):
|
||||||
text = "Cette après-midi, je suis allé dans un restaurant italo-mexicain."
|
text = "Cette après-midi, je suis allé dans un restaurant italo-mexicain."
|
||||||
tokens = fr_tokenizer(text)
|
tokens = fr_tokenizer(text)
|
||||||
assert len(tokens) == 11
|
assert len(tokens) == 11
|
||||||
assert tokens[1].text == "après-midi"
|
assert tokens[1].text == "après-midi"
|
||||||
assert tokens[9].text == "italo-mexicain"
|
assert tokens[9].text == "italo-mexicain"
|
||||||
|
|
||||||
def test_tokenizer_handles_title(fr_tokenizer):
|
|
||||||
|
def test_fr_tokenizer_handles_title(fr_tokenizer):
|
||||||
text = "N'est-ce pas génial?"
|
text = "N'est-ce pas génial?"
|
||||||
tokens = fr_tokenizer(text)
|
tokens = fr_tokenizer(text)
|
||||||
assert len(tokens) == 6
|
assert len(tokens) == 6
|
||||||
|
@ -50,14 +51,16 @@ def test_tokenizer_handles_title(fr_tokenizer):
|
||||||
assert tokens[2].text == "-ce"
|
assert tokens[2].text == "-ce"
|
||||||
assert tokens[2].lemma_ == "ce"
|
assert tokens[2].lemma_ == "ce"
|
||||||
|
|
||||||
def test_tokenizer_handles_title_2(fr_tokenizer):
|
|
||||||
|
def test_fr_tokenizer_handles_title_2(fr_tokenizer):
|
||||||
text = "Est-ce pas génial?"
|
text = "Est-ce pas génial?"
|
||||||
tokens = fr_tokenizer(text)
|
tokens = fr_tokenizer(text)
|
||||||
assert len(tokens) == 6
|
assert len(tokens) == 6
|
||||||
assert tokens[0].text == "Est"
|
assert tokens[0].text == "Est"
|
||||||
assert tokens[0].lemma_ == "être"
|
assert tokens[0].lemma_ == "être"
|
||||||
|
|
||||||
def test_tokenizer_handles_title_2(fr_tokenizer):
|
|
||||||
|
def test_fr_tokenizer_handles_title_2(fr_tokenizer):
|
||||||
text = "Qu'est-ce que tu fais?"
|
text = "Qu'est-ce que tu fais?"
|
||||||
tokens = fr_tokenizer(text)
|
tokens = fr_tokenizer(text)
|
||||||
assert len(tokens) == 7
|
assert len(tokens) == 7
|
||||||
|
|
|
@ -4,25 +4,25 @@ from __future__ import unicode_literals
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
def test_lemmatizer_verb(fr_tokenizer):
|
def test_fr_lemmatizer_verb(fr_tokenizer):
|
||||||
tokens = fr_tokenizer("Qu'est-ce que tu fais?")
|
tokens = fr_tokenizer("Qu'est-ce que tu fais?")
|
||||||
assert tokens[0].lemma_ == "que"
|
assert tokens[0].lemma_ == "que"
|
||||||
assert tokens[1].lemma_ == "être"
|
assert tokens[1].lemma_ == "être"
|
||||||
assert tokens[5].lemma_ == "faire"
|
assert tokens[5].lemma_ == "faire"
|
||||||
|
|
||||||
|
|
||||||
def test_lemmatizer_noun_verb_2(fr_tokenizer):
|
def test_fr_lemmatizer_noun_verb_2(fr_tokenizer):
|
||||||
tokens = fr_tokenizer("Les abaissements de température sont gênants.")
|
tokens = fr_tokenizer("Les abaissements de température sont gênants.")
|
||||||
assert tokens[4].lemma_ == "être"
|
assert tokens[4].lemma_ == "être"
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.xfail(reason="Costaricienne TAG is PROPN instead of NOUN and spacy don't lemmatize PROPN")
|
@pytest.mark.xfail(reason="Costaricienne TAG is PROPN instead of NOUN and spacy don't lemmatize PROPN")
|
||||||
def test_lemmatizer_noun(fr_tokenizer):
|
def test_fr_lemmatizer_noun(fr_tokenizer):
|
||||||
tokens = fr_tokenizer("il y a des Costaricienne.")
|
tokens = fr_tokenizer("il y a des Costaricienne.")
|
||||||
assert tokens[4].lemma_ == "Costaricain"
|
assert tokens[4].lemma_ == "Costaricain"
|
||||||
|
|
||||||
|
|
||||||
def test_lemmatizer_noun_2(fr_tokenizer):
|
def test_fr_lemmatizer_noun_2(fr_tokenizer):
|
||||||
tokens = fr_tokenizer("Les abaissements de température sont gênants.")
|
tokens = fr_tokenizer("Les abaissements de température sont gênants.")
|
||||||
assert tokens[1].lemma_ == "abaissement"
|
assert tokens[1].lemma_ == "abaissement"
|
||||||
assert tokens[5].lemma_ == "gênant"
|
assert tokens[5].lemma_ == "gênant"
|
||||||
|
|
23
spacy/tests/lang/fr/test_prefix_suffix_infix.py
Normal file
23
spacy/tests/lang/fr/test_prefix_suffix_infix.py
Normal file
|
@ -0,0 +1,23 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy.language import Language
|
||||||
|
from spacy.lang.punctuation import TOKENIZER_INFIXES
|
||||||
|
from spacy.lang.char_classes import ALPHA
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,expected_tokens', [
|
||||||
|
("l'avion", ["l'", "avion"]), ("j'ai", ["j'", "ai"])])
|
||||||
|
def test_issue768(text, expected_tokens):
|
||||||
|
"""Allow zero-width 'infix' token during the tokenization process."""
|
||||||
|
SPLIT_INFIX = r'(?<=[{a}]\')(?=[{a}])'.format(a=ALPHA)
|
||||||
|
|
||||||
|
class FrenchTest(Language):
|
||||||
|
class Defaults(Language.Defaults):
|
||||||
|
infixes = TOKENIZER_INFIXES + [SPLIT_INFIX]
|
||||||
|
|
||||||
|
fr_tokenizer_w_infix = FrenchTest.Defaults.create_tokenizer()
|
||||||
|
tokens = fr_tokenizer_w_infix(text)
|
||||||
|
assert len(tokens) == 2
|
||||||
|
assert [t.text for t in tokens] == expected_tokens
|
|
@ -1,6 +1,9 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy.lang.fr.lex_attrs import like_num
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_handles_long_text(fr_tokenizer):
|
def test_tokenizer_handles_long_text(fr_tokenizer):
|
||||||
text = """L'histoire du TAL commence dans les années 1950, bien que l'on puisse \
|
text = """L'histoire du TAL commence dans les années 1950, bien que l'on puisse \
|
||||||
|
@ -12,6 +15,11 @@ un humain dans une conversation écrite en temps réel, de façon suffisamment \
|
||||||
convaincante que l'interlocuteur humain ne peut distinguer sûrement — sur la \
|
convaincante que l'interlocuteur humain ne peut distinguer sûrement — sur la \
|
||||||
base du seul contenu de la conversation — s'il interagit avec un programme \
|
base du seul contenu de la conversation — s'il interagit avec un programme \
|
||||||
ou avec un autre vrai humain."""
|
ou avec un autre vrai humain."""
|
||||||
|
|
||||||
tokens = fr_tokenizer(text)
|
tokens = fr_tokenizer(text)
|
||||||
assert len(tokens) == 113
|
assert len(tokens) == 113
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('word', ['onze', 'onzième'])
|
||||||
|
def test_fr_lex_attrs_capitals(word):
|
||||||
|
assert like_num(word)
|
||||||
|
assert like_num(word.upper())
|
||||||
|
|
|
@ -11,7 +11,7 @@ GA_TOKEN_EXCEPTION_TESTS = [
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected_tokens', GA_TOKEN_EXCEPTION_TESTS)
|
@pytest.mark.parametrize('text,expected_tokens', GA_TOKEN_EXCEPTION_TESTS)
|
||||||
def test_tokenizer_handles_exception_cases(ga_tokenizer, text, expected_tokens):
|
def test_ga_tokenizer_handles_exception_cases(ga_tokenizer, text, expected_tokens):
|
||||||
tokens = ga_tokenizer(text)
|
tokens = ga_tokenizer(text)
|
||||||
token_list = [token.text for token in tokens if not token.is_space]
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
assert expected_tokens == token_list
|
assert expected_tokens == token_list
|
||||||
|
|
|
@ -6,7 +6,7 @@ import pytest
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected_tokens',
|
@pytest.mark.parametrize('text,expected_tokens',
|
||||||
[('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])])
|
[('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])])
|
||||||
def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
|
def test_he_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
|
||||||
tokens = he_tokenizer(text)
|
tokens = he_tokenizer(text)
|
||||||
token_list = [token.text for token in tokens if not token.is_space]
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
assert expected_tokens == token_list
|
assert expected_tokens == token_list
|
||||||
|
@ -18,6 +18,6 @@ def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
|
||||||
('עקבת אחריו בכל רחבי המדינה!', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']),
|
('עקבת אחריו בכל רחבי המדינה!', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']),
|
||||||
('עקבת אחריו בכל רחבי המדינה..', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']),
|
('עקבת אחריו בכל רחבי המדינה..', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']),
|
||||||
('עקבת אחריו בכל רחבי המדינה...', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...'])])
|
('עקבת אחריו בכל רחבי המדינה...', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...'])])
|
||||||
def test_tokenizer_handles_punct(he_tokenizer, text, expected_tokens):
|
def test_he_tokenizer_handles_punct(he_tokenizer, text, expected_tokens):
|
||||||
tokens = he_tokenizer(text)
|
tokens = he_tokenizer(text)
|
||||||
assert expected_tokens == [token.text for token in tokens]
|
assert expected_tokens == [token.text for token in tokens]
|
||||||
|
|
|
@ -3,6 +3,7 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
DEFAULT_TESTS = [
|
DEFAULT_TESTS = [
|
||||||
('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
|
('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
|
||||||
pytest.mark.xfail(('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.'])),
|
pytest.mark.xfail(('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.'])),
|
||||||
|
@ -277,7 +278,7 @@ TESTCASES = DEFAULT_TESTS + DOT_TESTS + QUOTE_TESTS + NUMBER_TESTS + HYPHEN_TEST
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
||||||
def test_tokenizer_handles_testcases(hu_tokenizer, text, expected_tokens):
|
def test_hu_tokenizer_handles_testcases(hu_tokenizer, text, expected_tokens):
|
||||||
tokens = hu_tokenizer(text)
|
tokens = hu_tokenizer(text)
|
||||||
token_list = [token.text for token in tokens if not token.is_space]
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
assert expected_tokens == token_list
|
assert expected_tokens == token_list
|
||||||
|
|
|
@ -1,38 +1,35 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(Ma'arif)"])
|
@pytest.mark.parametrize('text', ["(Ma'arif)"])
|
||||||
def test_tokenizer_splits_no_special(id_tokenizer, text):
|
def test_id_tokenizer_splits_no_special(id_tokenizer, text):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["Ma'arif"])
|
@pytest.mark.parametrize('text', ["Ma'arif"])
|
||||||
def test_tokenizer_splits_no_punct(id_tokenizer, text):
|
def test_id_tokenizer_splits_no_punct(id_tokenizer, text):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == 1
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(Ma'arif"])
|
@pytest.mark.parametrize('text', ["(Ma'arif"])
|
||||||
def test_tokenizer_splits_prefix_punct(id_tokenizer, text):
|
def test_id_tokenizer_splits_prefix_punct(id_tokenizer, text):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == 2
|
assert len(tokens) == 2
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["Ma'arif)"])
|
@pytest.mark.parametrize('text', ["Ma'arif)"])
|
||||||
def test_tokenizer_splits_suffix_punct(id_tokenizer, text):
|
def test_id_tokenizer_splits_suffix_punct(id_tokenizer, text):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == 2
|
assert len(tokens) == 2
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(Ma'arif)"])
|
@pytest.mark.parametrize('text', ["(Ma'arif)"])
|
||||||
def test_tokenizer_splits_even_wrap(id_tokenizer, text):
|
def test_id_tokenizer_splits_even_wrap(id_tokenizer, text):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
@ -44,49 +41,49 @@ def test_tokenizer_splits_uneven_wrap(id_tokenizer, text):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,length', [("S.Kom.", 1), ("SKom.", 2), ("(S.Kom.", 2)])
|
@pytest.mark.parametrize('text,length', [("S.Kom.", 1), ("SKom.", 2), ("(S.Kom.", 2)])
|
||||||
def test_tokenizer_splits_prefix_interact(id_tokenizer, text, length):
|
def test_id_tokenizer_splits_prefix_interact(id_tokenizer, text, length):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == length
|
assert len(tokens) == length
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["S.Kom.)"])
|
@pytest.mark.parametrize('text', ["S.Kom.)"])
|
||||||
def test_tokenizer_splits_suffix_interact(id_tokenizer, text):
|
def test_id_tokenizer_splits_suffix_interact(id_tokenizer, text):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == 2
|
assert len(tokens) == 2
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(S.Kom.)"])
|
@pytest.mark.parametrize('text', ["(S.Kom.)"])
|
||||||
def test_tokenizer_splits_even_wrap_interact(id_tokenizer, text):
|
def test_id_tokenizer_splits_even_wrap_interact(id_tokenizer, text):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["(S.Kom.?)"])
|
@pytest.mark.parametrize('text', ["(S.Kom.?)"])
|
||||||
def test_tokenizer_splits_uneven_wrap_interact(id_tokenizer, text):
|
def test_id_tokenizer_splits_uneven_wrap_interact(id_tokenizer, text):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == 4
|
assert len(tokens) == 4
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,length', [("gara-gara", 1), ("Jokowi-Ahok", 3), ("Sukarno-Hatta", 3)])
|
@pytest.mark.parametrize('text,length', [("gara-gara", 1), ("Jokowi-Ahok", 3), ("Sukarno-Hatta", 3)])
|
||||||
def test_tokenizer_splits_hyphens(id_tokenizer, text, length):
|
def test_id_tokenizer_splits_hyphens(id_tokenizer, text, length):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == length
|
assert len(tokens) == length
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
|
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
|
||||||
def test_tokenizer_splits_numeric_range(id_tokenizer, text):
|
def test_id_tokenizer_splits_numeric_range(id_tokenizer, text):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["ini.Budi", "Halo.Bandung"])
|
@pytest.mark.parametrize('text', ["ini.Budi", "Halo.Bandung"])
|
||||||
def test_tokenizer_splits_period_infix(id_tokenizer, text):
|
def test_id_tokenizer_splits_period_infix(id_tokenizer, text):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["Halo,Bandung", "satu,dua"])
|
@pytest.mark.parametrize('text', ["Halo,Bandung", "satu,dua"])
|
||||||
def test_tokenizer_splits_comma_infix(id_tokenizer, text):
|
def test_id_tokenizer_splits_comma_infix(id_tokenizer, text):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
assert tokens[0].text == text.split(",")[0]
|
assert tokens[0].text == text.split(",")[0]
|
||||||
|
@ -95,12 +92,12 @@ def test_tokenizer_splits_comma_infix(id_tokenizer, text):
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["halo...Bandung", "dia...pergi"])
|
@pytest.mark.parametrize('text', ["halo...Bandung", "dia...pergi"])
|
||||||
def test_tokenizer_splits_ellipsis_infix(id_tokenizer, text):
|
def test_id_tokenizer_splits_ellipsis_infix(id_tokenizer, text):
|
||||||
tokens = id_tokenizer(text)
|
tokens = id_tokenizer(text)
|
||||||
assert len(tokens) == 3
|
assert len(tokens) == 3
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_splits_double_hyphen_infix(id_tokenizer):
|
def test_id_tokenizer_splits_double_hyphen_infix(id_tokenizer):
|
||||||
tokens = id_tokenizer("Arsene Wenger--manajer Arsenal--melakukan konferensi pers.")
|
tokens = id_tokenizer("Arsene Wenger--manajer Arsenal--melakukan konferensi pers.")
|
||||||
assert len(tokens) == 10
|
assert len(tokens) == 10
|
||||||
assert tokens[0].text == "Arsene"
|
assert tokens[0].text == "Arsene"
|
||||||
|
|
11
spacy/tests/lang/id/test_text.py
Normal file
11
spacy/tests/lang/id/test_text.py
Normal file
|
@ -0,0 +1,11 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy.lang.id.lex_attrs import like_num
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('word', ['sebelas'])
|
||||||
|
def test_id_lex_attrs_capitals(word):
|
||||||
|
assert like_num(word)
|
||||||
|
assert like_num(word.upper())
|
|
@ -1,18 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
LEMMAS = (
|
|
||||||
('新しく', '新しい'),
|
|
||||||
('赤く', '赤い'),
|
|
||||||
('すごく', '凄い'),
|
|
||||||
('いただきました', '頂く'),
|
|
||||||
('なった', '成る'))
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('word,lemma', LEMMAS)
|
|
||||||
def test_japanese_lemmas(JA, word, lemma):
|
|
||||||
test_lemma = JA(word)[0].lemma_
|
|
||||||
assert test_lemma == lemma
|
|
||||||
|
|
||||||
|
|
15
spacy/tests/lang/ja/test_lemmatization.py
Normal file
15
spacy/tests/lang/ja/test_lemmatization.py
Normal file
|
@ -0,0 +1,15 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('word,lemma', [
|
||||||
|
('新しく', '新しい'),
|
||||||
|
('赤く', '赤い'),
|
||||||
|
('すごく', '凄い'),
|
||||||
|
('いただきました', '頂く'),
|
||||||
|
('なった', '成る')])
|
||||||
|
def test_ja_lemmatizer_assigns(ja_tokenizer, word, lemma):
|
||||||
|
test_lemma = ja_tokenizer(word)[0].lemma_
|
||||||
|
assert test_lemma == lemma
|
|
@ -5,41 +5,43 @@ import pytest
|
||||||
|
|
||||||
|
|
||||||
TOKENIZER_TESTS = [
|
TOKENIZER_TESTS = [
|
||||||
("日本語だよ", ['日本', '語', 'だ', 'よ']),
|
("日本語だよ", ['日本', '語', 'だ', 'よ']),
|
||||||
("東京タワーの近くに住んでいます。", ['東京', 'タワー', 'の', '近く', 'に', '住ん', 'で', 'い', 'ます', '。']),
|
("東京タワーの近くに住んでいます。", ['東京', 'タワー', 'の', '近く', 'に', '住ん', 'で', 'い', 'ます', '。']),
|
||||||
("吾輩は猫である。", ['吾輩', 'は', '猫', 'で', 'ある', '。']),
|
("吾輩は猫である。", ['吾輩', 'は', '猫', 'で', 'ある', '。']),
|
||||||
("月に代わって、お仕置きよ!", ['月', 'に', '代わっ', 'て', '、', 'お', '仕置き', 'よ', '!']),
|
("月に代わって、お仕置きよ!", ['月', 'に', '代わっ', 'て', '、', 'お', '仕置き', 'よ', '!']),
|
||||||
("すもももももももものうち", ['すもも', 'も', 'もも', 'も', 'もも', 'の', 'うち'])
|
("すもももももももものうち", ['すもも', 'も', 'もも', 'も', 'もも', 'の', 'うち'])
|
||||||
]
|
]
|
||||||
|
|
||||||
TAG_TESTS = [
|
TAG_TESTS = [
|
||||||
("日本語だよ", ['日本語だよ', '名詞-固有名詞-地名-国', '名詞-普通名詞-一般', '助動詞', '助詞-終助詞']),
|
("日本語だよ", ['日本語だよ', '名詞-固有名詞-地名-国', '名詞-普通名詞-一般', '助動詞', '助詞-終助詞']),
|
||||||
("東京タワーの近くに住んでいます。", ['名詞-固有名詞-地名-一般', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '動詞-非自立可能', '助動詞', '補助記号-句点']),
|
("東京タワーの近くに住んでいます。", ['名詞-固有名詞-地名-一般', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '動詞-非自立可能', '助動詞', '補助記号-句点']),
|
||||||
("吾輩は猫である。", ['代名詞', '助詞-係助詞', '名詞-普通名詞-一般', '助動詞', '動詞-非自立可能', '補助記号-句点']),
|
("吾輩は猫である。", ['代名詞', '助詞-係助詞', '名詞-普通名詞-一般', '助動詞', '動詞-非自立可能', '補助記号-句点']),
|
||||||
("月に代わって、お仕置きよ!", ['名詞-普通名詞-助数詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '補助記号-読点', '接頭辞', '名詞-普通名詞-一般', '助詞-終助詞', '補助記号-句点 ']),
|
("月に代わって、お仕置きよ!", ['名詞-普通名詞-助数詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '補助記号-読点', '接頭辞', '名詞-普通名詞-一般', '助詞-終助詞', '補助記号-句点 ']),
|
||||||
("すもももももももものうち", ['名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能'])
|
("すもももももももものうち", ['名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能'])
|
||||||
]
|
]
|
||||||
|
|
||||||
POS_TESTS = [
|
POS_TESTS = [
|
||||||
('日本語だよ', ['PROPN', 'NOUN', 'AUX', 'PART']),
|
('日本語だよ', ['PROPN', 'NOUN', 'AUX', 'PART']),
|
||||||
('東京タワーの近くに住んでいます。', ['PROPN', 'NOUN', 'ADP', 'NOUN', 'ADP', 'VERB', 'SCONJ', 'VERB', 'AUX', 'PUNCT']),
|
('東京タワーの近くに住んでいます。', ['PROPN', 'NOUN', 'ADP', 'NOUN', 'ADP', 'VERB', 'SCONJ', 'VERB', 'AUX', 'PUNCT']),
|
||||||
('吾輩は猫である。', ['PRON', 'ADP', 'NOUN', 'AUX', 'VERB', 'PUNCT']),
|
('吾輩は猫である。', ['PRON', 'ADP', 'NOUN', 'AUX', 'VERB', 'PUNCT']),
|
||||||
('月に代わって、お仕置きよ!', ['NOUN', 'ADP', 'VERB', 'SCONJ', 'PUNCT', 'NOUN', 'NOUN', 'PART', 'PUNCT']),
|
('月に代わって、お仕置きよ!', ['NOUN', 'ADP', 'VERB', 'SCONJ', 'PUNCT', 'NOUN', 'NOUN', 'PART', 'PUNCT']),
|
||||||
('すもももももももものうち', ['NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN'])
|
('すもももももももものうち', ['NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN'])
|
||||||
]
|
]
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS)
|
@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS)
|
||||||
def test_japanese_tokenizer(ja_tokenizer, text, expected_tokens):
|
def test_ja_tokenizer(ja_tokenizer, text, expected_tokens):
|
||||||
tokens = [token.text for token in ja_tokenizer(text)]
|
tokens = [token.text for token in ja_tokenizer(text)]
|
||||||
assert tokens == expected_tokens
|
assert tokens == expected_tokens
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected_tags', TAG_TESTS)
|
@pytest.mark.parametrize('text,expected_tags', TAG_TESTS)
|
||||||
def test_japanese_tokenizer(ja_tokenizer, text, expected_tags):
|
def test_ja_tokenizer(ja_tokenizer, text, expected_tags):
|
||||||
tags = [token.tag_ for token in ja_tokenizer(text)]
|
tags = [token.tag_ for token in ja_tokenizer(text)]
|
||||||
assert tags == expected_tags
|
assert tags == expected_tags
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected_pos', POS_TESTS)
|
@pytest.mark.parametrize('text,expected_pos', POS_TESTS)
|
||||||
def test_japanese_tokenizer(ja_tokenizer, text, expected_pos):
|
def test_ja_tokenizer(ja_tokenizer, text, expected_pos):
|
||||||
pos = [token.pos_ for token in ja_tokenizer(text)]
|
pos = [token.pos_ for token in ja_tokenizer(text)]
|
||||||
assert pos == expected_pos
|
assert pos == expected_pos
|
||||||
|
|
|
@ -11,7 +11,7 @@ NB_TOKEN_EXCEPTION_TESTS = [
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected_tokens', NB_TOKEN_EXCEPTION_TESTS)
|
@pytest.mark.parametrize('text,expected_tokens', NB_TOKEN_EXCEPTION_TESTS)
|
||||||
def test_tokenizer_handles_exception_cases(nb_tokenizer, text, expected_tokens):
|
def test_nb_tokenizer_handles_exception_cases(nb_tokenizer, text, expected_tokens):
|
||||||
tokens = nb_tokenizer(text)
|
tokens = nb_tokenizer(text)
|
||||||
token_list = [token.text for token in tokens if not token.is_space]
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
assert expected_tokens == token_list
|
assert expected_tokens == token_list
|
||||||
|
|
11
spacy/tests/lang/nl/test_text.py
Normal file
11
spacy/tests/lang/nl/test_text.py
Normal file
|
@ -0,0 +1,11 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy.lang.nl.lex_attrs import like_num
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('word', ['elf', 'elfde'])
|
||||||
|
def test_nl_lex_attrs_capitals(word):
|
||||||
|
assert like_num(word)
|
||||||
|
assert like_num(word.upper())
|
11
spacy/tests/lang/pt/test_text.py
Normal file
11
spacy/tests/lang/pt/test_text.py
Normal file
|
@ -0,0 +1,11 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy.lang.pt.lex_attrs import like_num
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('word', ['onze', 'quadragésimo'])
|
||||||
|
def test_pt_lex_attrs_capitals(word):
|
||||||
|
assert like_num(word)
|
||||||
|
assert like_num(word.upper())
|
|
@ -4,10 +4,11 @@ from __future__ import unicode_literals
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('string,lemma', [('câini', 'câine'),
|
@pytest.mark.parametrize('string,lemma', [
|
||||||
('expedițiilor', 'expediție'),
|
('câini', 'câine'),
|
||||||
('pensete', 'pensetă'),
|
('expedițiilor', 'expediție'),
|
||||||
('erau', 'fi')])
|
('pensete', 'pensetă'),
|
||||||
def test_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma):
|
('erau', 'fi')])
|
||||||
|
def test_ro_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma):
|
||||||
tokens = ro_tokenizer(string)
|
tokens = ro_tokenizer(string)
|
||||||
assert tokens[0].lemma_ == lemma
|
assert tokens[0].lemma_ == lemma
|
||||||
|
|
|
@ -3,23 +3,20 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
DEFAULT_TESTS = [
|
|
||||||
|
TEST_CASES = [
|
||||||
('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']),
|
('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']),
|
||||||
('Teste, etc.', ['Teste', ',', 'etc.']),
|
('Teste, etc.', ['Teste', ',', 'etc.']),
|
||||||
('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']),
|
('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']),
|
||||||
('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...'])
|
('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...']),
|
||||||
]
|
# number tests
|
||||||
|
|
||||||
NUMBER_TESTS = [
|
|
||||||
('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']),
|
('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']),
|
||||||
('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.'])
|
('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.'])
|
||||||
]
|
]
|
||||||
|
|
||||||
TESTCASES = DEFAULT_TESTS + NUMBER_TESTS
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,expected_tokens', TEST_CASES)
|
||||||
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
def test_ro_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens):
|
||||||
def test_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens):
|
|
||||||
tokens = ro_tokenizer(text)
|
tokens = ro_tokenizer(text)
|
||||||
token_list = [token.text for token in tokens if not token.is_space]
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
assert expected_tokens == token_list
|
assert expected_tokens == token_list
|
||||||
|
|
14
spacy/tests/lang/ru/test_exceptions.py
Normal file
14
spacy/tests/lang/ru/test_exceptions.py
Normal file
|
@ -0,0 +1,14 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,norms', [
|
||||||
|
("пн.", ["понедельник"]),
|
||||||
|
("пт.", ["пятница"]),
|
||||||
|
("дек.", ["декабрь"])])
|
||||||
|
def test_ru_tokenizer_abbrev_exceptions(ru_tokenizer, text, norms):
|
||||||
|
tokens = ru_tokenizer(text)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
assert [token.norm_ for token in tokens] == norms
|
|
@ -2,70 +2,62 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
from ....tokens.doc import Doc
|
from spacy.lang.ru import Russian
|
||||||
|
|
||||||
|
from ...util import get_doc
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def ru_lemmatizer(RU):
|
def ru_lemmatizer():
|
||||||
return RU.Defaults.create_lemmatizer()
|
pymorphy = pytest.importorskip('pymorphy2')
|
||||||
|
return Russian.Defaults.create_lemmatizer()
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('ru')
|
def test_ru_doc_lemmatization(ru_tokenizer):
|
||||||
def test_doc_lemmatization(RU):
|
words = ['мама', 'мыла', 'раму']
|
||||||
doc = Doc(RU.vocab, words=['мама', 'мыла', 'раму'])
|
tags = ['NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing',
|
||||||
doc[0].tag_ = 'NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing'
|
'VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act',
|
||||||
doc[1].tag_ = 'VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act'
|
'NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing']
|
||||||
doc[2].tag_ = 'NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing'
|
doc = get_doc(ru_tokenizer.vocab, words=words, tags=tags)
|
||||||
|
|
||||||
lemmas = [token.lemma_ for token in doc]
|
lemmas = [token.lemma_ for token in doc]
|
||||||
assert lemmas == ['мама', 'мыть', 'рама']
|
assert lemmas == ['мама', 'мыть', 'рама']
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('ru')
|
@pytest.mark.parametrize('text,lemmas', [
|
||||||
@pytest.mark.parametrize('text,lemmas', [('гвоздики', ['гвоздик', 'гвоздика']),
|
('гвоздики', ['гвоздик', 'гвоздика']),
|
||||||
('люди', ['человек']),
|
('люди', ['человек']),
|
||||||
('реки', ['река']),
|
('реки', ['река']),
|
||||||
('кольцо', ['кольцо']),
|
('кольцо', ['кольцо']),
|
||||||
('пепперони', ['пепперони'])])
|
('пепперони', ['пепперони'])])
|
||||||
def test_ru_lemmatizer_noun_lemmas(ru_lemmatizer, text, lemmas):
|
def test_ru_lemmatizer_noun_lemmas(ru_lemmatizer, text, lemmas):
|
||||||
assert sorted(ru_lemmatizer.noun(text)) == lemmas
|
assert sorted(ru_lemmatizer.noun(text)) == lemmas
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('ru')
|
@pytest.mark.models('ru')
|
||||||
@pytest.mark.parametrize('text,pos,morphology,lemma', [('рой', 'NOUN', None, 'рой'),
|
@pytest.mark.parametrize('text,pos,morphology,lemma', [
|
||||||
('рой', 'VERB', None, 'рыть'),
|
('рой', 'NOUN', None, 'рой'),
|
||||||
('клей', 'NOUN', None, 'клей'),
|
('рой', 'VERB', None, 'рыть'),
|
||||||
('клей', 'VERB', None, 'клеить'),
|
('клей', 'NOUN', None, 'клей'),
|
||||||
('три', 'NUM', None, 'три'),
|
('клей', 'VERB', None, 'клеить'),
|
||||||
('кос', 'NOUN', {'Number': 'Sing'}, 'кос'),
|
('три', 'NUM', None, 'три'),
|
||||||
('кос', 'NOUN', {'Number': 'Plur'}, 'коса'),
|
('кос', 'NOUN', {'Number': 'Sing'}, 'кос'),
|
||||||
('кос', 'ADJ', None, 'косой'),
|
('кос', 'NOUN', {'Number': 'Plur'}, 'коса'),
|
||||||
('потом', 'NOUN', None, 'пот'),
|
('кос', 'ADJ', None, 'косой'),
|
||||||
('потом', 'ADV', None, 'потом')
|
('потом', 'NOUN', None, 'пот'),
|
||||||
])
|
('потом', 'ADV', None, 'потом')])
|
||||||
def test_ru_lemmatizer_works_with_different_pos_homonyms(ru_lemmatizer, text, pos, morphology, lemma):
|
def test_ru_lemmatizer_works_with_different_pos_homonyms(ru_lemmatizer, text, pos, morphology, lemma):
|
||||||
assert ru_lemmatizer(text, pos, morphology) == [lemma]
|
assert ru_lemmatizer(text, pos, morphology) == [lemma]
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('ru')
|
@pytest.mark.parametrize('text,morphology,lemma', [
|
||||||
@pytest.mark.parametrize('text,morphology,lemma', [('гвоздики', {'Gender': 'Fem'}, 'гвоздика'),
|
('гвоздики', {'Gender': 'Fem'}, 'гвоздика'),
|
||||||
('гвоздики', {'Gender': 'Masc'}, 'гвоздик'),
|
('гвоздики', {'Gender': 'Masc'}, 'гвоздик'),
|
||||||
('вина', {'Gender': 'Fem'}, 'вина'),
|
('вина', {'Gender': 'Fem'}, 'вина'),
|
||||||
('вина', {'Gender': 'Neut'}, 'вино')
|
('вина', {'Gender': 'Neut'}, 'вино')])
|
||||||
])
|
|
||||||
def test_ru_lemmatizer_works_with_noun_homonyms(ru_lemmatizer, text, morphology, lemma):
|
def test_ru_lemmatizer_works_with_noun_homonyms(ru_lemmatizer, text, morphology, lemma):
|
||||||
assert ru_lemmatizer.noun(text, morphology) == [lemma]
|
assert ru_lemmatizer.noun(text, morphology) == [lemma]
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('ru')
|
|
||||||
def test_ru_lemmatizer_punct(ru_lemmatizer):
|
def test_ru_lemmatizer_punct(ru_lemmatizer):
|
||||||
assert ru_lemmatizer.punct('«') == ['"']
|
assert ru_lemmatizer.punct('«') == ['"']
|
||||||
assert ru_lemmatizer.punct('»') == ['"']
|
assert ru_lemmatizer.punct('»') == ['"']
|
||||||
|
|
||||||
|
|
||||||
# @pytest.mark.models('ru')
|
|
||||||
# def test_ru_lemmatizer_lemma_assignment(RU):
|
|
||||||
# text = "А роза упала на лапу Азора."
|
|
||||||
# doc = RU.make_doc(text)
|
|
||||||
# RU.tagger(doc)
|
|
||||||
# assert all(t.lemma_ != '' for t in doc)
|
|
||||||
|
|
11
spacy/tests/lang/ru/test_text.py
Normal file
11
spacy/tests/lang/ru/test_text.py
Normal file
|
@ -0,0 +1,11 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy.lang.ru.lex_attrs import like_num
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('word', ['одиннадцать'])
|
||||||
|
def test_ru_lex_attrs_capitals(word):
|
||||||
|
assert like_num(word)
|
||||||
|
assert like_num(word.upper())
|
|
@ -1,7 +1,4 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
"""Test that open, closed and paired punctuation is split off correctly."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
|
@ -1,16 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
"""Test that tokenizer exceptions are parsed correctly."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,norms', [("пн.", ["понедельник"]),
|
|
||||||
("пт.", ["пятница"]),
|
|
||||||
("дек.", ["декабрь"])])
|
|
||||||
def test_ru_tokenizer_abbrev_exceptions(ru_tokenizer, text, norms):
|
|
||||||
tokens = ru_tokenizer(text)
|
|
||||||
assert len(tokens) == 1
|
|
||||||
assert [token.norm_ for token in tokens] == norms
|
|
|
@ -11,14 +11,14 @@ SV_TOKEN_EXCEPTION_TESTS = [
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
|
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
|
||||||
def test_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
|
def test_sv_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
|
||||||
tokens = sv_tokenizer(text)
|
tokens = sv_tokenizer(text)
|
||||||
token_list = [token.text for token in tokens if not token.is_space]
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
assert expected_tokens == token_list
|
assert expected_tokens == token_list
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"])
|
@pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"])
|
||||||
def test_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
|
def test_sv_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
|
||||||
tokens = sv_tokenizer(text)
|
tokens = sv_tokenizer(text)
|
||||||
assert len(tokens) == 2
|
assert len(tokens) == 2
|
||||||
assert tokens[1].text == "u"
|
assert tokens[1].text == "u"
|
||||||
|
|
|
@ -1,10 +1,9 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
|
|
||||||
from ...lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from spacy.attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
|
||||||
|
from spacy.lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text', ["dog"])
|
@pytest.mark.parametrize('text', ["dog"])
|
||||||
|
|
|
@ -3,11 +3,9 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
TOKENIZER_TESTS = [
|
|
||||||
("คุณรักผมไหม", ['คุณ', 'รัก', 'ผม', 'ไหม'])
|
|
||||||
]
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS)
|
@pytest.mark.parametrize('text,expected_tokens', [
|
||||||
def test_thai_tokenizer(th_tokenizer, text, expected_tokens):
|
("คุณรักผมไหม", ['คุณ', 'รัก', 'ผม', 'ไหม'])])
|
||||||
tokens = [token.text for token in th_tokenizer(text)]
|
def test_th_tokenizer(th_tokenizer, text, expected_tokens):
|
||||||
assert tokens == expected_tokens
|
tokens = [token.text for token in th_tokenizer(text)]
|
||||||
|
assert tokens == expected_tokens
|
||||||
|
|
|
@ -3,13 +3,15 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
@pytest.mark.parametrize('string,lemma', [('evlerimizdeki', 'ev'),
|
|
||||||
('işlerimizi', 'iş'),
|
@pytest.mark.parametrize('string,lemma', [
|
||||||
('biran', 'biran'),
|
('evlerimizdeki', 'ev'),
|
||||||
('bitirmeliyiz', 'bitir'),
|
('işlerimizi', 'iş'),
|
||||||
('isteklerimizi', 'istek'),
|
('biran', 'biran'),
|
||||||
('karşılaştırmamızın', 'karşılaştır'),
|
('bitirmeliyiz', 'bitir'),
|
||||||
('çoğulculuktan', 'çoğulcu')])
|
('isteklerimizi', 'istek'),
|
||||||
def test_lemmatizer_lookup_assigns(tr_tokenizer, string, lemma):
|
('karşılaştırmamızın', 'karşılaştır'),
|
||||||
|
('çoğulculuktan', 'çoğulcu')])
|
||||||
|
def test_tr_lemmatizer_lookup_assigns(tr_tokenizer, string, lemma):
|
||||||
tokens = tr_tokenizer(string)
|
tokens = tr_tokenizer(string)
|
||||||
assert tokens[0].lemma_ == lemma
|
assert tokens[0].lemma_ == lemma
|
|
@ -3,6 +3,7 @@ from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
INFIX_HYPHEN_TESTS = [
|
INFIX_HYPHEN_TESTS = [
|
||||||
("Явым-төшем күләме.", "Явым-төшем күләме .".split()),
|
("Явым-төшем күләме.", "Явым-төшем күләме .".split()),
|
||||||
("Хатын-кыз киеме.", "Хатын-кыз киеме .".split())
|
("Хатын-кыз киеме.", "Хатын-кыз киеме .".split())
|
||||||
|
@ -64,12 +65,12 @@ NORM_TESTCASES = [
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize("text,expected_tokens", TESTCASES)
|
@pytest.mark.parametrize("text,expected_tokens", TESTCASES)
|
||||||
def test_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens):
|
def test_tt_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens):
|
||||||
tokens = [token.text for token in tt_tokenizer(text) if not token.is_space]
|
tokens = [token.text for token in tt_tokenizer(text) if not token.is_space]
|
||||||
assert expected_tokens == tokens
|
assert expected_tokens == tokens
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('text,norms', NORM_TESTCASES)
|
@pytest.mark.parametrize('text,norms', NORM_TESTCASES)
|
||||||
def test_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms):
|
def test_tt_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms):
|
||||||
tokens = tt_tokenizer(text)
|
tokens = tt_tokenizer(text)
|
||||||
assert [token.norm_ for token in tokens] == norms
|
assert [token.norm_ for token in tokens] == norms
|
||||||
|
|
|
@ -1,19 +1,14 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
|
|
||||||
"""Test that longer and mixed texts are tokenized correctly."""
|
|
||||||
|
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
|
||||||
def test_tokenizer_handles_long_text(ur_tokenizer):
|
def test_ur_tokenizer_handles_long_text(ur_tokenizer):
|
||||||
text = """اصل میں رسوا ہونے کی ہمیں
|
text = """اصل میں رسوا ہونے کی ہمیں
|
||||||
کچھ عادت سی ہو گئی ہے اس لئے جگ ہنسائی کا ذکر نہیں کرتا،ہوا کچھ یوں کہ عرصہ چھ سال بعد ہمیں بھی خیال آیا
|
کچھ عادت سی ہو گئی ہے اس لئے جگ ہنسائی کا ذکر نہیں کرتا،ہوا کچھ یوں کہ عرصہ چھ سال بعد ہمیں بھی خیال آیا
|
||||||
کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر
|
کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر
|
||||||
ورلڈ کپ دیکھنے کا حوصلہ ہی نہ رہا تو اب یوں ہی ادھر اُدھر کے چینل گھمانے لگ پڑتے ہیں۔"""
|
ورلڈ کپ دیکھنے کا حوصلہ ہی نہ رہا تو اب یوں ہی ادھر اُدھر کے چینل گھمانے لگ پڑتے ہیں۔"""
|
||||||
|
|
||||||
tokens = ur_tokenizer(text)
|
tokens = ur_tokenizer(text)
|
||||||
assert len(tokens) == 77
|
assert len(tokens) == 77
|
||||||
|
|
||||||
|
@ -21,6 +16,6 @@ def test_tokenizer_handles_long_text(ur_tokenizer):
|
||||||
@pytest.mark.parametrize('text,length', [
|
@pytest.mark.parametrize('text,length', [
|
||||||
("تحریر باسط حبیب", 3),
|
("تحریر باسط حبیب", 3),
|
||||||
("میرا پاکستان", 2)])
|
("میرا پاکستان", 2)])
|
||||||
def test_tokenizer_handles_cnts(ur_tokenizer, text, length):
|
def test_ur_tokenizer_handles_cnts(ur_tokenizer, text, length):
|
||||||
tokens = ur_tokenizer(text)
|
tokens = ur_tokenizer(text)
|
||||||
assert len(tokens) == length
|
assert len(tokens) == length
|
||||||
|
|
|
@ -1,19 +1,16 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ..matcher import Matcher, PhraseMatcher
|
|
||||||
from .util import get_doc
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from spacy.matcher import Matcher
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def matcher(en_vocab):
|
def matcher(en_vocab):
|
||||||
rules = {
|
rules = {'JS': [[{'ORTH': 'JavaScript'}]],
|
||||||
'JS': [[{'ORTH': 'JavaScript'}]],
|
'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]],
|
||||||
'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]],
|
'Java': [[{'LOWER': 'java'}]]}
|
||||||
'Java': [[{'LOWER': 'java'}]]
|
|
||||||
}
|
|
||||||
matcher = Matcher(en_vocab)
|
matcher = Matcher(en_vocab)
|
||||||
for key, patterns in rules.items():
|
for key, patterns in rules.items():
|
||||||
matcher.add(key, None, *patterns)
|
matcher.add(key, None, *patterns)
|
||||||
|
@ -36,7 +33,7 @@ def test_matcher_from_api_docs(en_vocab):
|
||||||
|
|
||||||
def test_matcher_from_usage_docs(en_vocab):
|
def test_matcher_from_usage_docs(en_vocab):
|
||||||
text = "Wow 😀 This is really cool! 😂 😂"
|
text = "Wow 😀 This is really cool! 😂 😂"
|
||||||
doc = get_doc(en_vocab, words=text.split(' '))
|
doc = Doc(en_vocab, words=text.split(' '))
|
||||||
pos_emoji = ['😀', '😃', '😂', '🤣', '😊', '😍']
|
pos_emoji = ['😀', '😃', '😂', '🤣', '😊', '😍']
|
||||||
pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
|
pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
|
||||||
|
|
||||||
|
@ -55,68 +52,46 @@ def test_matcher_from_usage_docs(en_vocab):
|
||||||
assert doc[1].norm_ == 'happy emoji'
|
assert doc[1].norm_ == 'happy emoji'
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('words', [["Some", "words"]])
|
def test_matcher_len_contains(matcher):
|
||||||
def test_matcher_init(en_vocab, words):
|
assert len(matcher) == 3
|
||||||
matcher = Matcher(en_vocab)
|
|
||||||
doc = get_doc(en_vocab, words)
|
|
||||||
assert len(matcher) == 0
|
|
||||||
assert matcher(doc) == []
|
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_contains(matcher):
|
|
||||||
matcher.add('TEST', None, [{'ORTH': 'test'}])
|
matcher.add('TEST', None, [{'ORTH': 'test'}])
|
||||||
assert 'TEST' in matcher
|
assert 'TEST' in matcher
|
||||||
assert 'TEST2' not in matcher
|
assert 'TEST2' not in matcher
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_no_match(matcher):
|
def test_matcher_no_match(matcher):
|
||||||
words = ["I", "like", "cheese", "."]
|
doc = Doc(matcher.vocab, words=["I", "like", "cheese", "."])
|
||||||
doc = get_doc(matcher.vocab, words)
|
|
||||||
assert matcher(doc) == []
|
assert matcher(doc) == []
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_compile(en_vocab):
|
|
||||||
rules = {
|
|
||||||
'JS': [[{'ORTH': 'JavaScript'}]],
|
|
||||||
'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]],
|
|
||||||
'Java': [[{'LOWER': 'java'}]]
|
|
||||||
}
|
|
||||||
matcher = Matcher(en_vocab)
|
|
||||||
for key, patterns in rules.items():
|
|
||||||
matcher.add(key, None, *patterns)
|
|
||||||
assert len(matcher) == 3
|
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_match_start(matcher):
|
def test_matcher_match_start(matcher):
|
||||||
words = ["JavaScript", "is", "good"]
|
doc = Doc(matcher.vocab, words=["JavaScript", "is", "good"])
|
||||||
doc = get_doc(matcher.vocab, words)
|
|
||||||
assert matcher(doc) == [(matcher.vocab.strings['JS'], 0, 1)]
|
assert matcher(doc) == [(matcher.vocab.strings['JS'], 0, 1)]
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_match_end(matcher):
|
def test_matcher_match_end(matcher):
|
||||||
words = ["I", "like", "java"]
|
words = ["I", "like", "java"]
|
||||||
doc = get_doc(matcher.vocab, words)
|
doc = Doc(matcher.vocab, words=words)
|
||||||
assert matcher(doc) == [(doc.vocab.strings['Java'], 2, 3)]
|
assert matcher(doc) == [(doc.vocab.strings['Java'], 2, 3)]
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_match_middle(matcher):
|
def test_matcher_match_middle(matcher):
|
||||||
words = ["I", "like", "Google", "Now", "best"]
|
words = ["I", "like", "Google", "Now", "best"]
|
||||||
doc = get_doc(matcher.vocab, words)
|
doc = Doc(matcher.vocab, words=words)
|
||||||
assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4)]
|
assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4)]
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_match_multi(matcher):
|
def test_matcher_match_multi(matcher):
|
||||||
words = ["I", "like", "Google", "Now", "and", "java", "best"]
|
words = ["I", "like", "Google", "Now", "and", "java", "best"]
|
||||||
doc = get_doc(matcher.vocab, words)
|
doc = Doc(matcher.vocab, words=words)
|
||||||
assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4),
|
assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4),
|
||||||
(doc.vocab.strings['Java'], 5, 6)]
|
(doc.vocab.strings['Java'], 5, 6)]
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_empty_dict(en_vocab):
|
def test_matcher_empty_dict(en_vocab):
|
||||||
'''Test matcher allows empty token specs, meaning match on any token.'''
|
"""Test matcher allows empty token specs, meaning match on any token."""
|
||||||
matcher = Matcher(en_vocab)
|
matcher = Matcher(en_vocab)
|
||||||
abc = ["a", "b", "c"]
|
doc = Doc(matcher.vocab, words=["a", "b", "c"])
|
||||||
doc = get_doc(matcher.vocab, abc)
|
|
||||||
matcher.add('A.C', None, [{'ORTH': 'a'}, {}, {'ORTH': 'c'}])
|
matcher.add('A.C', None, [{'ORTH': 'a'}, {}, {'ORTH': 'c'}])
|
||||||
matches = matcher(doc)
|
matches = matcher(doc)
|
||||||
assert len(matches) == 1
|
assert len(matches) == 1
|
||||||
|
@ -129,8 +104,7 @@ def test_matcher_empty_dict(en_vocab):
|
||||||
|
|
||||||
def test_matcher_operator_shadow(en_vocab):
|
def test_matcher_operator_shadow(en_vocab):
|
||||||
matcher = Matcher(en_vocab)
|
matcher = Matcher(en_vocab)
|
||||||
abc = ["a", "b", "c"]
|
doc = Doc(matcher.vocab, words=["a", "b", "c"])
|
||||||
doc = get_doc(matcher.vocab, abc)
|
|
||||||
pattern = [{'ORTH': 'a'}, {"IS_ALPHA": True, "OP": "+"}, {'ORTH': 'c'}]
|
pattern = [{'ORTH': 'a'}, {"IS_ALPHA": True, "OP": "+"}, {'ORTH': 'c'}]
|
||||||
matcher.add('A.C', None, pattern)
|
matcher.add('A.C', None, pattern)
|
||||||
matches = matcher(doc)
|
matches = matcher(doc)
|
||||||
|
@ -138,32 +112,6 @@ def test_matcher_operator_shadow(en_vocab):
|
||||||
assert matches[0][1:] == (0, 3)
|
assert matches[0][1:] == (0, 3)
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_phrase_matcher(en_vocab):
|
|
||||||
words = ["Google", "Now"]
|
|
||||||
doc = get_doc(en_vocab, words)
|
|
||||||
matcher = PhraseMatcher(en_vocab)
|
|
||||||
matcher.add('COMPANY', None, doc)
|
|
||||||
words = ["I", "like", "Google", "Now", "best"]
|
|
||||||
doc = get_doc(en_vocab, words)
|
|
||||||
assert len(matcher(doc)) == 1
|
|
||||||
|
|
||||||
|
|
||||||
def test_phrase_matcher_length(en_vocab):
|
|
||||||
matcher = PhraseMatcher(en_vocab)
|
|
||||||
assert len(matcher) == 0
|
|
||||||
matcher.add('TEST', None, get_doc(en_vocab, ['test']))
|
|
||||||
assert len(matcher) == 1
|
|
||||||
matcher.add('TEST2', None, get_doc(en_vocab, ['test2']))
|
|
||||||
assert len(matcher) == 2
|
|
||||||
|
|
||||||
|
|
||||||
def test_phrase_matcher_contains(en_vocab):
|
|
||||||
matcher = PhraseMatcher(en_vocab)
|
|
||||||
matcher.add('TEST', None, get_doc(en_vocab, ['test']))
|
|
||||||
assert 'TEST' in matcher
|
|
||||||
assert 'TEST2' not in matcher
|
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_match_zero(matcher):
|
def test_matcher_match_zero(matcher):
|
||||||
words1 = 'He said , " some words " ...'.split()
|
words1 = 'He said , " some words " ...'.split()
|
||||||
words2 = 'He said , " some three words " ...'.split()
|
words2 = 'He said , " some three words " ...'.split()
|
||||||
|
@ -176,12 +124,10 @@ def test_matcher_match_zero(matcher):
|
||||||
{'IS_PUNCT': True},
|
{'IS_PUNCT': True},
|
||||||
{'IS_PUNCT': True},
|
{'IS_PUNCT': True},
|
||||||
{'ORTH': '"'}]
|
{'ORTH': '"'}]
|
||||||
|
|
||||||
matcher.add('Quote', None, pattern1)
|
matcher.add('Quote', None, pattern1)
|
||||||
doc = get_doc(matcher.vocab, words1)
|
doc = Doc(matcher.vocab, words=words1)
|
||||||
assert len(matcher(doc)) == 1
|
assert len(matcher(doc)) == 1
|
||||||
|
doc = Doc(matcher.vocab, words=words2)
|
||||||
doc = get_doc(matcher.vocab, words2)
|
|
||||||
assert len(matcher(doc)) == 0
|
assert len(matcher(doc)) == 0
|
||||||
matcher.add('Quote', None, pattern2)
|
matcher.add('Quote', None, pattern2)
|
||||||
assert len(matcher(doc)) == 0
|
assert len(matcher(doc)) == 0
|
||||||
|
@ -194,14 +140,14 @@ def test_matcher_match_zero_plus(matcher):
|
||||||
{'ORTH': '"'}]
|
{'ORTH': '"'}]
|
||||||
matcher = Matcher(matcher.vocab)
|
matcher = Matcher(matcher.vocab)
|
||||||
matcher.add('Quote', None, pattern)
|
matcher.add('Quote', None, pattern)
|
||||||
doc = get_doc(matcher.vocab, words)
|
doc = Doc(matcher.vocab, words=words)
|
||||||
assert len(matcher(doc)) == 1
|
assert len(matcher(doc)) == 1
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_match_one_plus(matcher):
|
def test_matcher_match_one_plus(matcher):
|
||||||
control = Matcher(matcher.vocab)
|
control = Matcher(matcher.vocab)
|
||||||
control.add('BasicPhilippe', None, [{'ORTH': 'Philippe'}])
|
control.add('BasicPhilippe', None, [{'ORTH': 'Philippe'}])
|
||||||
doc = get_doc(control.vocab, ['Philippe', 'Philippe'])
|
doc = Doc(control.vocab, words=['Philippe', 'Philippe'])
|
||||||
m = control(doc)
|
m = control(doc)
|
||||||
assert len(m) == 2
|
assert len(m) == 2
|
||||||
matcher.add('KleenePhilippe', None, [{'ORTH': 'Philippe', 'OP': '1'},
|
matcher.add('KleenePhilippe', None, [{'ORTH': 'Philippe', 'OP': '1'},
|
||||||
|
@ -210,61 +156,11 @@ def test_matcher_match_one_plus(matcher):
|
||||||
assert len(m) == 1
|
assert len(m) == 1
|
||||||
|
|
||||||
|
|
||||||
def test_operator_combos(matcher):
|
|
||||||
cases = [
|
|
||||||
('aaab', 'a a a b', True),
|
|
||||||
('aaab', 'a+ b', True),
|
|
||||||
('aaab', 'a+ a+ b', True),
|
|
||||||
('aaab', 'a+ a+ a b', True),
|
|
||||||
('aaab', 'a+ a+ a+ b', True),
|
|
||||||
('aaab', 'a+ a a b', True),
|
|
||||||
('aaab', 'a+ a a', True),
|
|
||||||
('aaab', 'a+', True),
|
|
||||||
('aaa', 'a+ b', False),
|
|
||||||
('aaa', 'a+ a+ b', False),
|
|
||||||
('aaa', 'a+ a+ a+ b', False),
|
|
||||||
('aaa', 'a+ a b', False),
|
|
||||||
('aaa', 'a+ a a b', False),
|
|
||||||
('aaab', 'a+ a a', True),
|
|
||||||
('aaab', 'a+', True),
|
|
||||||
('aaab', 'a+ a b', True)
|
|
||||||
]
|
|
||||||
for string, pattern_str, result in cases:
|
|
||||||
matcher = Matcher(matcher.vocab)
|
|
||||||
doc = get_doc(matcher.vocab, words=list(string))
|
|
||||||
pattern = []
|
|
||||||
for part in pattern_str.split():
|
|
||||||
if part.endswith('+'):
|
|
||||||
pattern.append({'ORTH': part[0], 'op': '+'})
|
|
||||||
else:
|
|
||||||
pattern.append({'ORTH': part})
|
|
||||||
matcher.add('PATTERN', None, pattern)
|
|
||||||
matches = matcher(doc)
|
|
||||||
if result:
|
|
||||||
assert matches, (string, pattern_str)
|
|
||||||
else:
|
|
||||||
assert not matches, (string, pattern_str)
|
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_end_zero_plus(matcher):
|
|
||||||
"""Test matcher works when patterns end with * operator. (issue 1450)"""
|
|
||||||
matcher = Matcher(matcher.vocab)
|
|
||||||
pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}]
|
|
||||||
matcher.add("TSTEND", None, pattern)
|
|
||||||
nlp = lambda string: get_doc(matcher.vocab, string.split())
|
|
||||||
assert len(matcher(nlp('a'))) == 1
|
|
||||||
assert len(matcher(nlp('a b'))) == 2
|
|
||||||
assert len(matcher(nlp('a c'))) == 1
|
|
||||||
assert len(matcher(nlp('a b c'))) == 2
|
|
||||||
assert len(matcher(nlp('a b b c'))) == 3
|
|
||||||
assert len(matcher(nlp('a b b'))) == 3
|
|
||||||
|
|
||||||
|
|
||||||
def test_matcher_any_token_operator(en_vocab):
|
def test_matcher_any_token_operator(en_vocab):
|
||||||
"""Test that patterns with "any token" {} work with operators."""
|
"""Test that patterns with "any token" {} work with operators."""
|
||||||
matcher = Matcher(en_vocab)
|
matcher = Matcher(en_vocab)
|
||||||
matcher.add('TEST', None, [{'ORTH': 'test'}, {'OP': '*'}])
|
matcher.add('TEST', None, [{'ORTH': 'test'}, {'OP': '*'}])
|
||||||
doc = get_doc(en_vocab, ['test', 'hello', 'world'])
|
doc = Doc(en_vocab, words=['test', 'hello', 'world'])
|
||||||
matches = [doc[start:end].text for _, start, end in matcher(doc)]
|
matches = [doc[start:end].text for _, start, end in matcher(doc)]
|
||||||
assert len(matches) == 3
|
assert len(matches) == 3
|
||||||
assert matches[0] == 'test'
|
assert matches[0] == 'test'
|
116
spacy/tests/matcher/test_matcher_logic.py
Normal file
116
spacy/tests/matcher/test_matcher_logic.py
Normal file
|
@ -0,0 +1,116 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
import re
|
||||||
|
from spacy.matcher import Matcher
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
|
||||||
|
|
||||||
|
pattern1 = [{'ORTH':'A', 'OP':'1'}, {'ORTH':'A', 'OP':'*'}]
|
||||||
|
pattern2 = [{'ORTH':'A', 'OP':'*'}, {'ORTH':'A', 'OP':'1'}]
|
||||||
|
pattern3 = [{'ORTH':'A', 'OP':'1'}, {'ORTH':'A', 'OP':'1'}]
|
||||||
|
pattern4 = [{'ORTH':'B', 'OP':'1'}, {'ORTH':'A', 'OP':'*'}, {'ORTH':'B', 'OP':'1'}]
|
||||||
|
pattern5 = [{'ORTH':'B', 'OP':'*'}, {'ORTH':'A', 'OP':'*'}, {'ORTH':'B', 'OP':'1'}]
|
||||||
|
|
||||||
|
re_pattern1 = 'AA*'
|
||||||
|
re_pattern2 = 'A*A'
|
||||||
|
re_pattern3 = 'AA'
|
||||||
|
re_pattern4 = 'BA*B'
|
||||||
|
re_pattern5 = 'B*A*B'
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def text():
|
||||||
|
return "(ABBAAAAAB)."
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def doc(en_tokenizer, text):
|
||||||
|
doc = en_tokenizer(' '.join(text))
|
||||||
|
return doc
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.xfail
|
||||||
|
@pytest.mark.parametrize('pattern,re_pattern', [
|
||||||
|
(pattern1, re_pattern1),
|
||||||
|
(pattern2, re_pattern2),
|
||||||
|
(pattern3, re_pattern3),
|
||||||
|
(pattern4, re_pattern4),
|
||||||
|
(pattern5, re_pattern5)])
|
||||||
|
def test_greedy_matching(doc, text, pattern, re_pattern):
|
||||||
|
"""Test that the greedy matching behavior of the * op is consistant with
|
||||||
|
other re implementations."""
|
||||||
|
matcher = Matcher(doc.vocab)
|
||||||
|
matcher.add(re_pattern, None, pattern)
|
||||||
|
matches = matcher(doc)
|
||||||
|
re_matches = [m.span() for m in re.finditer(re_pattern, text)]
|
||||||
|
for match, re_match in zip(matches, re_matches):
|
||||||
|
assert match[1:] == re_match
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.xfail
|
||||||
|
@pytest.mark.parametrize('pattern,re_pattern', [
|
||||||
|
(pattern1, re_pattern1),
|
||||||
|
(pattern2, re_pattern2),
|
||||||
|
(pattern3, re_pattern3),
|
||||||
|
(pattern4, re_pattern4),
|
||||||
|
(pattern5, re_pattern5)])
|
||||||
|
def test_match_consuming(doc, text, pattern, re_pattern):
|
||||||
|
"""Test that matcher.__call__ consumes tokens on a match similar to
|
||||||
|
re.findall."""
|
||||||
|
matcher = Matcher(doc.vocab)
|
||||||
|
matcher.add(re_pattern, None, pattern)
|
||||||
|
matches = matcher(doc)
|
||||||
|
re_matches = [m.span() for m in re.finditer(re_pattern, text)]
|
||||||
|
assert len(matches) == len(re_matches)
|
||||||
|
|
||||||
|
|
||||||
|
def test_operator_combos(en_vocab):
|
||||||
|
cases = [
|
||||||
|
('aaab', 'a a a b', True),
|
||||||
|
('aaab', 'a+ b', True),
|
||||||
|
('aaab', 'a+ a+ b', True),
|
||||||
|
('aaab', 'a+ a+ a b', True),
|
||||||
|
('aaab', 'a+ a+ a+ b', True),
|
||||||
|
('aaab', 'a+ a a b', True),
|
||||||
|
('aaab', 'a+ a a', True),
|
||||||
|
('aaab', 'a+', True),
|
||||||
|
('aaa', 'a+ b', False),
|
||||||
|
('aaa', 'a+ a+ b', False),
|
||||||
|
('aaa', 'a+ a+ a+ b', False),
|
||||||
|
('aaa', 'a+ a b', False),
|
||||||
|
('aaa', 'a+ a a b', False),
|
||||||
|
('aaab', 'a+ a a', True),
|
||||||
|
('aaab', 'a+', True),
|
||||||
|
('aaab', 'a+ a b', True)
|
||||||
|
]
|
||||||
|
for string, pattern_str, result in cases:
|
||||||
|
matcher = Matcher(en_vocab)
|
||||||
|
doc = Doc(matcher.vocab, words=list(string))
|
||||||
|
pattern = []
|
||||||
|
for part in pattern_str.split():
|
||||||
|
if part.endswith('+'):
|
||||||
|
pattern.append({'ORTH': part[0], 'OP': '+'})
|
||||||
|
else:
|
||||||
|
pattern.append({'ORTH': part})
|
||||||
|
matcher.add('PATTERN', None, pattern)
|
||||||
|
matches = matcher(doc)
|
||||||
|
if result:
|
||||||
|
assert matches, (string, pattern_str)
|
||||||
|
else:
|
||||||
|
assert not matches, (string, pattern_str)
|
||||||
|
|
||||||
|
|
||||||
|
def test_matcher_end_zero_plus(en_vocab):
|
||||||
|
"""Test matcher works when patterns end with * operator. (issue 1450)"""
|
||||||
|
matcher = Matcher(en_vocab)
|
||||||
|
pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}]
|
||||||
|
matcher.add('TSTEND', None, pattern)
|
||||||
|
nlp = lambda string: Doc(matcher.vocab, words=string.split())
|
||||||
|
assert len(matcher(nlp('a'))) == 1
|
||||||
|
assert len(matcher(nlp('a b'))) == 2
|
||||||
|
assert len(matcher(nlp('a c'))) == 1
|
||||||
|
assert len(matcher(nlp('a b c'))) == 2
|
||||||
|
assert len(matcher(nlp('a b b c'))) == 3
|
||||||
|
assert len(matcher(nlp('a b b'))) == 3
|
30
spacy/tests/matcher/test_phrase_matcher.py
Normal file
30
spacy/tests/matcher/test_phrase_matcher.py
Normal file
|
@ -0,0 +1,30 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
from spacy.matcher import PhraseMatcher
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
|
||||||
|
|
||||||
|
def test_matcher_phrase_matcher(en_vocab):
|
||||||
|
doc = Doc(en_vocab, words=["Google", "Now"])
|
||||||
|
matcher = PhraseMatcher(en_vocab)
|
||||||
|
matcher.add('COMPANY', None, doc)
|
||||||
|
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
|
||||||
|
assert len(matcher(doc)) == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_phrase_matcher_length(en_vocab):
|
||||||
|
matcher = PhraseMatcher(en_vocab)
|
||||||
|
assert len(matcher) == 0
|
||||||
|
matcher.add('TEST', None, Doc(en_vocab, words=['test']))
|
||||||
|
assert len(matcher) == 1
|
||||||
|
matcher.add('TEST2', None, Doc(en_vocab, words=['test2']))
|
||||||
|
assert len(matcher) == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_phrase_matcher_contains(en_vocab):
|
||||||
|
matcher = PhraseMatcher(en_vocab)
|
||||||
|
matcher.add('TEST', None, Doc(en_vocab, words=['test']))
|
||||||
|
assert 'TEST' in matcher
|
||||||
|
assert 'TEST2' not in matcher
|
|
@ -1,15 +1,16 @@
|
||||||
'''Test the ability to add a label to a (potentially trained) parsing model.'''
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
import numpy.random
|
import numpy.random
|
||||||
from thinc.neural.optimizers import Adam
|
from thinc.neural.optimizers import Adam
|
||||||
from thinc.neural.ops import NumpyOps
|
from thinc.neural.ops import NumpyOps
|
||||||
|
from spacy.attrs import NORM
|
||||||
|
from spacy.gold import GoldParse
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
from spacy.pipeline import DependencyParser
|
||||||
|
|
||||||
from ...attrs import NORM
|
|
||||||
from ...gold import GoldParse
|
|
||||||
from ...vocab import Vocab
|
|
||||||
from ...tokens import Doc
|
|
||||||
from ...pipeline import DependencyParser
|
|
||||||
|
|
||||||
numpy.random.seed(0)
|
numpy.random.seed(0)
|
||||||
|
|
||||||
|
@ -37,9 +38,11 @@ def parser(vocab):
|
||||||
parser.update([doc], [gold], sgd=sgd, losses=losses)
|
parser.update([doc], [gold], sgd=sgd, losses=losses)
|
||||||
return parser
|
return parser
|
||||||
|
|
||||||
|
|
||||||
def test_init_parser(parser):
|
def test_init_parser(parser):
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
|
||||||
# TODO: This is flakey, because it depends on what the parser first learns.
|
# TODO: This is flakey, because it depends on what the parser first learns.
|
||||||
@pytest.mark.xfail
|
@pytest.mark.xfail
|
||||||
def test_add_label(parser):
|
def test_add_label(parser):
|
||||||
|
@ -69,4 +72,3 @@ def test_add_label(parser):
|
||||||
doc = parser(doc)
|
doc = parser(doc)
|
||||||
assert doc[0].dep_ == 'right'
|
assert doc[0].dep_ == 'right'
|
||||||
assert doc[2].dep_ == 'left'
|
assert doc[2].dep_ == 'left'
|
||||||
|
|
||||||
|
|
|
@ -1,13 +1,14 @@
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
import pytest
|
|
||||||
|
|
||||||
from ...vocab import Vocab
|
import pytest
|
||||||
from ...pipeline import DependencyParser
|
from spacy.vocab import Vocab
|
||||||
from ...tokens import Doc
|
from spacy.pipeline import DependencyParser
|
||||||
from ...gold import GoldParse
|
from spacy.tokens import Doc
|
||||||
from ...syntax.nonproj import projectivize
|
from spacy.gold import GoldParse
|
||||||
from ...syntax.stateclass import StateClass
|
from spacy.syntax.nonproj import projectivize
|
||||||
from ...syntax.arc_eager import ArcEager
|
from spacy.syntax.stateclass import StateClass
|
||||||
|
from spacy.syntax.arc_eager import ArcEager
|
||||||
|
|
||||||
|
|
||||||
def get_sequence_costs(M, words, heads, deps, transitions):
|
def get_sequence_costs(M, words, heads, deps, transitions):
|
||||||
|
|
|
@ -1,23 +0,0 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
from ...language import Language
|
|
||||||
from ...pipeline import DependencyParser
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_beam_parse_en(EN):
|
|
||||||
doc = EN(u'Australia is a country', disable=['ner'])
|
|
||||||
ents = EN.entity(doc, beam_width=2)
|
|
||||||
print(ents)
|
|
||||||
|
|
||||||
|
|
||||||
def test_beam_parse():
|
|
||||||
nlp = Language()
|
|
||||||
nlp.add_pipe(DependencyParser(nlp.vocab), name='parser')
|
|
||||||
nlp.parser.add_label('nsubj')
|
|
||||||
nlp.parser.begin_training([], token_vector_width=8, hidden_width=8)
|
|
||||||
|
|
||||||
doc = nlp.make_doc(u'Australia is a country')
|
|
||||||
nlp.parser(doc, beam_width=2)
|
|
|
@ -1,11 +1,12 @@
|
||||||
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from spacy.pipeline import EntityRecognizer
|
||||||
from ...vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
from ...syntax.ner import BiluoPushDown
|
from spacy.syntax.ner import BiluoPushDown
|
||||||
from ...gold import GoldParse
|
from spacy.gold import GoldParse
|
||||||
from ...tokens import Doc
|
from spacy.tokens import Doc
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
@ -71,3 +72,16 @@ def test_get_oracle_moves_negative_O(tsys, vocab):
|
||||||
tsys.preprocess_gold(gold)
|
tsys.preprocess_gold(gold)
|
||||||
act_classes = tsys.get_oracle_sequence(doc, gold)
|
act_classes = tsys.get_oracle_sequence(doc, gold)
|
||||||
names = [tsys.get_class_name(act) for act in act_classes]
|
names = [tsys.get_class_name(act) for act in act_classes]
|
||||||
|
|
||||||
|
|
||||||
|
def test_doc_add_entities_set_ents_iob(en_vocab):
|
||||||
|
doc = Doc(en_vocab, words=["This", "is", "a", "lion"])
|
||||||
|
ner = EntityRecognizer(en_vocab)
|
||||||
|
ner.begin_training([])
|
||||||
|
ner(doc)
|
||||||
|
assert len(list(doc.ents)) == 0
|
||||||
|
assert [w.ent_iob_ for w in doc] == (['O'] * len(doc))
|
||||||
|
doc.ents = [(doc.vocab.strings['ANIMAL'], 3, 4)]
|
||||||
|
assert [w.ent_iob_ for w in doc] == ['', '', '', 'B']
|
||||||
|
doc.ents = [(doc.vocab.strings['WORD'], 0, 2)]
|
||||||
|
assert [w.ent_iob_ for w in doc] == ['B', 'I', '', '']
|
||||||
|
|
|
@ -1,16 +1,13 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
from thinc.neural import Model
|
|
||||||
import pytest
|
|
||||||
import numpy
|
|
||||||
|
|
||||||
from ..._ml import chain, Tok2Vec, doc2feats
|
import pytest
|
||||||
from ...vocab import Vocab
|
from spacy._ml import Tok2Vec
|
||||||
from ...pipeline import Tensorizer
|
from spacy.vocab import Vocab
|
||||||
from ...syntax.arc_eager import ArcEager
|
from spacy.syntax.arc_eager import ArcEager
|
||||||
from ...syntax.nn_parser import Parser
|
from spacy.syntax.nn_parser import Parser
|
||||||
from ...tokens.doc import Doc
|
from spacy.tokens.doc import Doc
|
||||||
from ...gold import GoldParse
|
from spacy.gold import GoldParse
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
@ -37,10 +34,12 @@ def parser(vocab, arc_eager):
|
||||||
def model(arc_eager, tok2vec):
|
def model(arc_eager, tok2vec):
|
||||||
return Parser.Model(arc_eager.n_moves, token_vector_width=tok2vec.nO)[0]
|
return Parser.Model(arc_eager.n_moves, token_vector_width=tok2vec.nO)[0]
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def doc(vocab):
|
def doc(vocab):
|
||||||
return Doc(vocab, words=['a', 'b', 'c'])
|
return Doc(vocab, words=['a', 'b', 'c'])
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def gold(doc):
|
def gold(doc):
|
||||||
return GoldParse(doc, heads=[1, 1, 1], deps=['L', 'ROOT', 'R'])
|
return GoldParse(doc, heads=[1, 1, 1], deps=['L', 'ROOT', 'R'])
|
||||||
|
@ -80,5 +79,3 @@ def test_update_doc_beam(parser, model, doc, gold):
|
||||||
def optimize(weights, gradient, key=None):
|
def optimize(weights, gradient, key=None):
|
||||||
weights -= 0.001 * gradient
|
weights -= 0.001 * gradient
|
||||||
parser.update_beam([doc], [gold], sgd=optimize)
|
parser.update_beam([doc], [gold], sgd=optimize)
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,20 +1,23 @@
|
||||||
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
import numpy
|
import numpy
|
||||||
from thinc.api import layerize
|
from spacy.vocab import Vocab
|
||||||
|
from spacy.language import Language
|
||||||
from ...vocab import Vocab
|
from spacy.pipeline import DependencyParser
|
||||||
from ...syntax.arc_eager import ArcEager
|
from spacy.syntax.arc_eager import ArcEager
|
||||||
from ...tokens import Doc
|
from spacy.tokens import Doc
|
||||||
from ...gold import GoldParse
|
from spacy.syntax._beam_utils import ParserBeam
|
||||||
from ...syntax._beam_utils import ParserBeam, update_beam
|
from spacy.syntax.stateclass import StateClass
|
||||||
from ...syntax.stateclass import StateClass
|
from spacy.gold import GoldParse
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def vocab():
|
def vocab():
|
||||||
return Vocab()
|
return Vocab()
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def moves(vocab):
|
def moves(vocab):
|
||||||
aeager = ArcEager(vocab.strings, {})
|
aeager = ArcEager(vocab.strings, {})
|
||||||
|
@ -65,6 +68,7 @@ def vector_size():
|
||||||
def beam(moves, states, golds, beam_width):
|
def beam(moves, states, golds, beam_width):
|
||||||
return ParserBeam(moves, states, golds, width=beam_width, density=0.0)
|
return ParserBeam(moves, states, golds, width=beam_width, density=0.0)
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def scores(moves, batch_size, beam_width):
|
def scores(moves, batch_size, beam_width):
|
||||||
return [
|
return [
|
||||||
|
@ -85,3 +89,12 @@ def test_beam_advance(beam, scores):
|
||||||
def test_beam_advance_too_few_scores(beam, scores):
|
def test_beam_advance_too_few_scores(beam, scores):
|
||||||
with pytest.raises(IndexError):
|
with pytest.raises(IndexError):
|
||||||
beam.advance(scores[:-1])
|
beam.advance(scores[:-1])
|
||||||
|
|
||||||
|
|
||||||
|
def test_beam_parse():
|
||||||
|
nlp = Language()
|
||||||
|
nlp.add_pipe(DependencyParser(nlp.vocab), name='parser')
|
||||||
|
nlp.parser.add_label('nsubj')
|
||||||
|
nlp.parser.begin_training([], token_vector_width=8, hidden_width=8)
|
||||||
|
doc = nlp.make_doc('Australia is a country')
|
||||||
|
nlp.parser(doc, beam_width=2)
|
||||||
|
|
|
@ -1,35 +1,39 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...syntax.nonproj import ancestors, contains_cycle, is_nonproj_arc
|
|
||||||
from ...syntax.nonproj import is_nonproj_tree
|
|
||||||
from ...syntax import nonproj
|
|
||||||
from ...attrs import DEP, HEAD
|
|
||||||
from ..util import get_doc
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from spacy.syntax.nonproj import ancestors, contains_cycle, is_nonproj_arc
|
||||||
|
from spacy.syntax.nonproj import is_nonproj_tree
|
||||||
|
from spacy.syntax import nonproj
|
||||||
|
|
||||||
|
from ..util import get_doc
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def tree():
|
def tree():
|
||||||
return [1, 2, 2, 4, 5, 2, 2]
|
return [1, 2, 2, 4, 5, 2, 2]
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def cyclic_tree():
|
def cyclic_tree():
|
||||||
return [1, 2, 2, 4, 5, 3, 2]
|
return [1, 2, 2, 4, 5, 3, 2]
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def partial_tree():
|
def partial_tree():
|
||||||
return [1, 2, 2, 4, 5, None, 7, 4, 2]
|
return [1, 2, 2, 4, 5, None, 7, 4, 2]
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def nonproj_tree():
|
def nonproj_tree():
|
||||||
return [1, 2, 2, 4, 5, 2, 7, 4, 2]
|
return [1, 2, 2, 4, 5, 2, 7, 4, 2]
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def proj_tree():
|
def proj_tree():
|
||||||
return [1, 2, 2, 4, 5, 2, 7, 5, 2]
|
return [1, 2, 2, 4, 5, 2, 7, 5, 2]
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def multirooted_tree():
|
def multirooted_tree():
|
||||||
return [3, 2, 0, 3, 3, 7, 7, 3, 7, 10, 7, 10, 11, 12, 18, 16, 18, 17, 12, 3]
|
return [3, 2, 0, 3, 3, 7, 7, 3, 7, 10, 7, 10, 11, 12, 18, 16, 18, 17, 12, 3]
|
||||||
|
@ -75,14 +79,14 @@ def test_parser_pseudoprojectivity(en_tokenizer):
|
||||||
def deprojectivize(proj_heads, deco_labels):
|
def deprojectivize(proj_heads, deco_labels):
|
||||||
tokens = en_tokenizer('whatever ' * len(proj_heads))
|
tokens = en_tokenizer('whatever ' * len(proj_heads))
|
||||||
rel_proj_heads = [head-i for i, head in enumerate(proj_heads)]
|
rel_proj_heads = [head-i for i, head in enumerate(proj_heads)]
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deco_labels, heads=rel_proj_heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens],
|
||||||
|
deps=deco_labels, heads=rel_proj_heads)
|
||||||
nonproj.deprojectivize(doc)
|
nonproj.deprojectivize(doc)
|
||||||
return [t.head.i for t in doc], [token.dep_ for token in doc]
|
return [t.head.i for t in doc], [token.dep_ for token in doc]
|
||||||
|
|
||||||
tree = [1, 2, 2]
|
tree = [1, 2, 2]
|
||||||
nonproj_tree = [1, 2, 2, 4, 5, 2, 7, 4, 2]
|
nonproj_tree = [1, 2, 2, 4, 5, 2, 7, 4, 2]
|
||||||
nonproj_tree2 = [9, 1, 3, 1, 5, 6, 9, 8, 6, 1, 6, 12, 13, 10, 1]
|
nonproj_tree2 = [9, 1, 3, 1, 5, 6, 9, 8, 6, 1, 6, 12, 13, 10, 1]
|
||||||
|
|
||||||
labels = ['det', 'nsubj', 'root', 'det', 'dobj', 'aux', 'nsubj', 'acl', 'punct']
|
labels = ['det', 'nsubj', 'root', 'det', 'dobj', 'aux', 'nsubj', 'acl', 'punct']
|
||||||
labels2 = ['advmod', 'root', 'det', 'nsubj', 'advmod', 'det', 'dobj', 'det', 'nmod', 'aux', 'nmod', 'advmod', 'det', 'amod', 'punct']
|
labels2 = ['advmod', 'root', 'det', 'nsubj', 'advmod', 'det', 'dobj', 'det', 'nmod', 'aux', 'nmod', 'advmod', 'det', 'amod', 'punct']
|
||||||
|
|
||||||
|
|
|
@ -1,17 +1,17 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ..util import get_doc, apply_transition_sequence
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
from ..util import get_doc, apply_transition_sequence
|
||||||
|
|
||||||
|
|
||||||
def test_parser_root(en_tokenizer):
|
def test_parser_root(en_tokenizer):
|
||||||
text = "i don't have other assistance"
|
text = "i don't have other assistance"
|
||||||
heads = [3, 2, 1, 0, 1, -2]
|
heads = [3, 2, 1, 0, 1, -2]
|
||||||
deps = ['nsubj', 'aux', 'neg', 'ROOT', 'amod', 'dobj']
|
deps = ['nsubj', 'aux', 'neg', 'ROOT', 'amod', 'dobj']
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||||
for t in doc:
|
for t in doc:
|
||||||
assert t.dep != 0, t.text
|
assert t.dep != 0, t.text
|
||||||
|
|
||||||
|
@ -20,7 +20,7 @@ def test_parser_root(en_tokenizer):
|
||||||
@pytest.mark.parametrize('text', ["Hello"])
|
@pytest.mark.parametrize('text', ["Hello"])
|
||||||
def test_parser_parse_one_word_sentence(en_tokenizer, en_parser, text):
|
def test_parser_parse_one_word_sentence(en_tokenizer, en_parser, text):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[0], deps=['ROOT'])
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[0], deps=['ROOT'])
|
||||||
|
|
||||||
assert len(doc) == 1
|
assert len(doc) == 1
|
||||||
with en_parser.step_through(doc) as _:
|
with en_parser.step_through(doc) as _:
|
||||||
|
@ -33,10 +33,8 @@ def test_parser_initial(en_tokenizer, en_parser):
|
||||||
text = "I ate the pizza with anchovies."
|
text = "I ate the pizza with anchovies."
|
||||||
heads = [1, 0, 1, -2, -3, -1, -5]
|
heads = [1, 0, 1, -2, -3, -1, -5]
|
||||||
transition = ['L-nsubj', 'S', 'L-det']
|
transition = ['L-nsubj', 'S', 'L-det']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
apply_transition_sequence(en_parser, tokens, transition)
|
apply_transition_sequence(en_parser, tokens, transition)
|
||||||
|
|
||||||
assert tokens[0].head.i == 1
|
assert tokens[0].head.i == 1
|
||||||
assert tokens[1].head.i == 1
|
assert tokens[1].head.i == 1
|
||||||
assert tokens[2].head.i == 3
|
assert tokens[2].head.i == 3
|
||||||
|
@ -47,8 +45,7 @@ def test_parser_parse_subtrees(en_tokenizer, en_parser):
|
||||||
text = "The four wheels on the bus turned quickly"
|
text = "The four wheels on the bus turned quickly"
|
||||||
heads = [2, 1, 4, -1, 1, -2, 0, -1]
|
heads = [2, 1, 4, -1, 1, -2, 0, -1]
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
|
|
||||||
assert len(list(doc[2].lefts)) == 2
|
assert len(list(doc[2].lefts)) == 2
|
||||||
assert len(list(doc[2].rights)) == 1
|
assert len(list(doc[2].rights)) == 1
|
||||||
assert len(list(doc[2].children)) == 3
|
assert len(list(doc[2].children)) == 3
|
||||||
|
@ -63,11 +60,9 @@ def test_parser_merge_pp(en_tokenizer):
|
||||||
heads = [1, 4, -1, 1, -2, 0]
|
heads = [1, 4, -1, 1, -2, 0]
|
||||||
deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT']
|
deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT']
|
||||||
tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ']
|
tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ']
|
||||||
|
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps, heads=heads, tags=tags)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps, heads=heads, tags=tags)
|
||||||
nps = [(np[0].idx, np[-1].idx + len(np[-1]), np.lemma_) for np in doc.noun_chunks]
|
nps = [(np[0].idx, np[-1].idx + len(np[-1]), np.lemma_) for np in doc.noun_chunks]
|
||||||
|
|
||||||
for start, end, lemma in nps:
|
for start, end, lemma in nps:
|
||||||
doc.merge(start, end, label='NP', lemma=lemma)
|
doc.merge(start, end, label='NP', lemma=lemma)
|
||||||
assert doc[0].text == 'A phrase'
|
assert doc[0].text == 'A phrase'
|
||||||
|
|
|
@ -1,14 +1,14 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ..util import get_doc
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
from ..util import get_doc
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def text():
|
def text():
|
||||||
return u"""
|
return """
|
||||||
It was a bright cold day in April, and the clocks were striking thirteen.
|
It was a bright cold day in April, and the clocks were striking thirteen.
|
||||||
Winston Smith, his chin nuzzled into his breast in an effort to escape the
|
Winston Smith, his chin nuzzled into his breast in an effort to escape the
|
||||||
vile wind, slipped quickly through the glass doors of Victory Mansions,
|
vile wind, slipped quickly through the glass doors of Victory Mansions,
|
||||||
|
@ -54,7 +54,7 @@ def heads():
|
||||||
|
|
||||||
def test_parser_parse_navigate_consistency(en_tokenizer, text, heads):
|
def test_parser_parse_navigate_consistency(en_tokenizer, text, heads):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
for head in doc:
|
for head in doc:
|
||||||
for child in head.lefts:
|
for child in head.lefts:
|
||||||
assert child.head == head
|
assert child.head == head
|
||||||
|
@ -64,7 +64,7 @@ def test_parser_parse_navigate_consistency(en_tokenizer, text, heads):
|
||||||
|
|
||||||
def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads):
|
def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
|
|
||||||
lefts = {}
|
lefts = {}
|
||||||
rights = {}
|
rights = {}
|
||||||
|
@ -97,7 +97,7 @@ def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads):
|
||||||
|
|
||||||
def test_parser_parse_navigate_edges(en_tokenizer, text, heads):
|
def test_parser_parse_navigate_edges(en_tokenizer, text, heads):
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
for token in doc:
|
for token in doc:
|
||||||
subtree = list(token.subtree)
|
subtree = list(token.subtree)
|
||||||
debug = '\t'.join((token.text, token.left_edge.text, subtree[0].text))
|
debug = '\t'.join((token.text, token.left_edge.text, subtree[0].text))
|
||||||
|
|
|
@ -1,19 +1,21 @@
|
||||||
'''Test that the parser respects preset sentence boundaries.'''
|
# coding: utf8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
from thinc.neural.optimizers import Adam
|
from thinc.neural.optimizers import Adam
|
||||||
from thinc.neural.ops import NumpyOps
|
from thinc.neural.ops import NumpyOps
|
||||||
|
from spacy.attrs import NORM
|
||||||
|
from spacy.gold import GoldParse
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
from spacy.pipeline import DependencyParser
|
||||||
|
|
||||||
from ...attrs import NORM
|
|
||||||
from ...gold import GoldParse
|
|
||||||
from ...vocab import Vocab
|
|
||||||
from ...tokens import Doc
|
|
||||||
from ...pipeline import DependencyParser
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def vocab():
|
def vocab():
|
||||||
return Vocab(lex_attr_getters={NORM: lambda s: s})
|
return Vocab(lex_attr_getters={NORM: lambda s: s})
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def parser(vocab):
|
def parser(vocab):
|
||||||
parser = DependencyParser(vocab)
|
parser = DependencyParser(vocab)
|
||||||
|
@ -32,6 +34,7 @@ def parser(vocab):
|
||||||
parser.update([doc], [gold], sgd=sgd, losses=losses)
|
parser.update([doc], [gold], sgd=sgd, losses=losses)
|
||||||
return parser
|
return parser
|
||||||
|
|
||||||
|
|
||||||
def test_no_sentences(parser):
|
def test_no_sentences(parser):
|
||||||
doc = Doc(parser.vocab, words=['a', 'b', 'c', 'd'])
|
doc = Doc(parser.vocab, words=['a', 'b', 'c', 'd'])
|
||||||
doc = parser(doc)
|
doc = parser(doc)
|
||||||
|
|
|
@ -1,19 +1,18 @@
|
||||||
# coding: utf-8
|
# coding: utf-8
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
from ...tokens.doc import Doc
|
|
||||||
from ...attrs import HEAD
|
|
||||||
from ..util import get_doc, apply_transition_sequence
|
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
|
||||||
|
from spacy.tokens.doc import Doc
|
||||||
|
|
||||||
|
from ..util import get_doc, apply_transition_sequence
|
||||||
|
|
||||||
|
|
||||||
def test_parser_space_attachment(en_tokenizer):
|
def test_parser_space_attachment(en_tokenizer):
|
||||||
text = "This is a test.\nTo ensure spaces are attached well."
|
text = "This is a test.\nTo ensure spaces are attached well."
|
||||||
heads = [1, 0, 1, -2, -3, -1, 1, 4, -1, 2, 1, 0, -1, -2]
|
heads = [1, 0, 1, -2, -3, -1, 1, 4, -1, 2, 1, 0, -1, -2]
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||||
|
|
||||||
for sent in doc.sents:
|
for sent in doc.sents:
|
||||||
if len(sent) == 1:
|
if len(sent) == 1:
|
||||||
assert not sent[-1].is_space
|
assert not sent[-1].is_space
|
||||||
|
@ -26,7 +25,7 @@ def test_parser_sentence_space(en_tokenizer):
|
||||||
'nsubjpass', 'aux', 'auxpass', 'ROOT', 'nsubj', 'aux', 'ccomp',
|
'nsubjpass', 'aux', 'auxpass', 'ROOT', 'nsubj', 'aux', 'ccomp',
|
||||||
'poss', 'nsubj', 'ccomp', 'punct']
|
'poss', 'nsubj', 'ccomp', 'punct']
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||||
assert len(list(doc.sents)) == 2
|
assert len(list(doc.sents)) == 2
|
||||||
|
|
||||||
|
|
||||||
|
@ -35,7 +34,7 @@ def test_parser_space_attachment_leading(en_tokenizer, en_parser):
|
||||||
text = "\t \n This is a sentence ."
|
text = "\t \n This is a sentence ."
|
||||||
heads = [1, 1, 0, 1, -2, -3]
|
heads = [1, 1, 0, 1, -2, -3]
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, text.split(' '), heads=heads)
|
doc = get_doc(tokens.vocab, words=text.split(' '), heads=heads)
|
||||||
assert doc[0].is_space
|
assert doc[0].is_space
|
||||||
assert doc[1].is_space
|
assert doc[1].is_space
|
||||||
assert doc[2].text == 'This'
|
assert doc[2].text == 'This'
|
||||||
|
@ -52,7 +51,7 @@ def test_parser_space_attachment_intermediate_trailing(en_tokenizer, en_parser):
|
||||||
heads = [1, 0, -1, 2, -1, -4, -5, -1]
|
heads = [1, 0, -1, 2, -1, -4, -5, -1]
|
||||||
transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct']
|
transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct']
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, text.split(' '), heads=heads)
|
doc = get_doc(tokens.vocab, words=text.split(' '), heads=heads)
|
||||||
assert doc[2].is_space
|
assert doc[2].is_space
|
||||||
assert doc[4].is_space
|
assert doc[4].is_space
|
||||||
assert doc[5].is_space
|
assert doc[5].is_space
|
||||||
|
|
|
@ -1,28 +0,0 @@
|
||||||
import pytest
|
|
||||||
|
|
||||||
from ...pipeline import DependencyParser
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def parser(en_vocab):
|
|
||||||
parser = DependencyParser(en_vocab)
|
|
||||||
parser.add_label('nsubj')
|
|
||||||
parser.model, cfg = parser.Model(parser.moves.n_moves)
|
|
||||||
parser.cfg.update(cfg)
|
|
||||||
return parser
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def blank_parser(en_vocab):
|
|
||||||
parser = DependencyParser(en_vocab)
|
|
||||||
return parser
|
|
||||||
|
|
||||||
|
|
||||||
def test_to_from_bytes(parser, blank_parser):
|
|
||||||
assert parser.model is not True
|
|
||||||
assert blank_parser.model is True
|
|
||||||
assert blank_parser.moves.n_moves != parser.moves.n_moves
|
|
||||||
bytes_data = parser.to_bytes()
|
|
||||||
blank_parser.from_bytes(bytes_data)
|
|
||||||
assert blank_parser.model is not True
|
|
||||||
assert blank_parser.moves.n_moves == parser.moves.n_moves
|
|
|
@ -2,10 +2,9 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from spacy.tokens import Span
|
||||||
from ...tokens import Span
|
from spacy.language import Language
|
||||||
from ...language import Language
|
from spacy.pipeline import EntityRuler
|
||||||
from ...pipeline import EntityRuler
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
|
|
@ -2,11 +2,11 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from spacy.language import Language
|
||||||
|
from spacy.tokens import Span
|
||||||
|
|
||||||
from ..util import get_doc
|
from ..util import get_doc
|
||||||
from ...language import Language
|
|
||||||
from ...tokens import Span
|
|
||||||
from ... import util
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
def doc(en_tokenizer):
|
def doc(en_tokenizer):
|
||||||
|
@ -16,7 +16,7 @@ def doc(en_tokenizer):
|
||||||
pos = ['PRON', 'VERB', 'PROPN', 'PROPN', 'ADP', 'PROPN', 'PUNCT']
|
pos = ['PRON', 'VERB', 'PROPN', 'PROPN', 'ADP', 'PROPN', 'PUNCT']
|
||||||
deps = ['ROOT', 'prep', 'compound', 'pobj', 'prep', 'pobj', 'punct']
|
deps = ['ROOT', 'prep', 'compound', 'pobj', 'prep', 'pobj', 'punct']
|
||||||
tokens = en_tokenizer(text)
|
tokens = en_tokenizer(text)
|
||||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads,
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads,
|
||||||
tags=tags, pos=pos, deps=deps)
|
tags=tags, pos=pos, deps=deps)
|
||||||
doc.ents = [Span(doc, 2, 4, doc.vocab.strings['GPE'])]
|
doc.ents = [Span(doc, 2, 4, doc.vocab.strings['GPE'])]
|
||||||
doc.is_parsed = True
|
doc.is_parsed = True
|
||||||
|
|
|
@ -2,8 +2,7 @@
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
import pytest
|
import pytest
|
||||||
|
from spacy.language import Language
|
||||||
from ...language import Language
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
@pytest.fixture
|
||||||
|
|
|
@ -1,7 +1,13 @@
|
||||||
# coding: utf8
|
# coding: utf8
|
||||||
|
|
||||||
from __future__ import unicode_literals
|
from __future__ import unicode_literals
|
||||||
from ...language import Language
|
|
||||||
|
import pytest
|
||||||
|
import random
|
||||||
|
import numpy.random
|
||||||
|
from spacy.language import Language
|
||||||
|
from spacy.pipeline import TextCategorizer
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
from spacy.gold import GoldParse
|
||||||
|
|
||||||
|
|
||||||
def test_simple_train():
|
def test_simple_train():
|
||||||
|
@ -13,6 +19,40 @@ def test_simple_train():
|
||||||
for text, answer in [('aaaa', 1.), ('bbbb', 0), ('aa', 1.),
|
for text, answer in [('aaaa', 1.), ('bbbb', 0), ('aa', 1.),
|
||||||
('bbbbbbbbb', 0.), ('aaaaaa', 1)]:
|
('bbbbbbbbb', 0.), ('aaaaaa', 1)]:
|
||||||
nlp.update([text], [{'cats': {'answer': answer}}])
|
nlp.update([text], [{'cats': {'answer': answer}}])
|
||||||
doc = nlp(u'aaa')
|
doc = nlp('aaa')
|
||||||
assert 'answer' in doc.cats
|
assert 'answer' in doc.cats
|
||||||
assert doc.cats['answer'] >= 0.5
|
assert doc.cats['answer'] >= 0.5
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.skip(reason="Test is flakey when run with others")
|
||||||
|
def test_textcat_learns_multilabel():
|
||||||
|
random.seed(5)
|
||||||
|
numpy.random.seed(5)
|
||||||
|
docs = []
|
||||||
|
nlp = Language()
|
||||||
|
letters = ['a', 'b', 'c']
|
||||||
|
for w1 in letters:
|
||||||
|
for w2 in letters:
|
||||||
|
cats = {letter: float(w2==letter) for letter in letters}
|
||||||
|
docs.append((Doc(nlp.vocab, words=['d']*3 + [w1, w2] + ['d']*3), cats))
|
||||||
|
random.shuffle(docs)
|
||||||
|
model = TextCategorizer(nlp.vocab, width=8)
|
||||||
|
for letter in letters:
|
||||||
|
model.add_label(letter)
|
||||||
|
optimizer = model.begin_training()
|
||||||
|
for i in range(30):
|
||||||
|
losses = {}
|
||||||
|
Ys = [GoldParse(doc, cats=cats) for doc, cats in docs]
|
||||||
|
Xs = [doc for doc, cats in docs]
|
||||||
|
model.update(Xs, Ys, sgd=optimizer, losses=losses)
|
||||||
|
random.shuffle(docs)
|
||||||
|
for w1 in letters:
|
||||||
|
for w2 in letters:
|
||||||
|
doc = Doc(nlp.vocab, words=['d']*3 + [w1, w2] + ['d']*3)
|
||||||
|
truth = {letter: w2==letter for letter in letters}
|
||||||
|
model(doc)
|
||||||
|
for cat, score in doc.cats.items():
|
||||||
|
if not truth[cat]:
|
||||||
|
assert score < 0.5
|
||||||
|
else:
|
||||||
|
assert score > 0.5
|
||||||
|
|
420
spacy/tests/regression/test_issue1-1000.py
Normal file
420
spacy/tests/regression/test_issue1-1000.py
Normal file
|
@ -0,0 +1,420 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
import random
|
||||||
|
from spacy.matcher import Matcher
|
||||||
|
from spacy.attrs import IS_PUNCT, ORTH, LOWER
|
||||||
|
from spacy.symbols import POS, VERB, VerbForm_inf
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
from spacy.language import Language
|
||||||
|
from spacy.lemmatizer import Lemmatizer
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
|
||||||
|
from ..util import get_doc, make_tempdir
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('patterns', [
|
||||||
|
[[{'LOWER': 'celtics'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]],
|
||||||
|
[[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'celtics'}]]])
|
||||||
|
def test_issue118(en_tokenizer, patterns):
|
||||||
|
"""Test a bug that arose from having overlapping matches"""
|
||||||
|
text = "how many points did lebron james score against the boston celtics last night"
|
||||||
|
doc = en_tokenizer(text)
|
||||||
|
ORG = doc.vocab.strings['ORG']
|
||||||
|
matcher = Matcher(doc.vocab)
|
||||||
|
matcher.add("BostonCeltics", None, *patterns)
|
||||||
|
assert len(list(doc.ents)) == 0
|
||||||
|
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
|
||||||
|
assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
|
||||||
|
doc.ents = matches[:1]
|
||||||
|
ents = list(doc.ents)
|
||||||
|
assert len(ents) == 1
|
||||||
|
assert ents[0].label == ORG
|
||||||
|
assert ents[0].start == 9
|
||||||
|
assert ents[0].end == 11
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('patterns', [
|
||||||
|
[[{'LOWER': 'boston'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]],
|
||||||
|
[[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'boston'}]]])
|
||||||
|
def test_issue118_prefix_reorder(en_tokenizer, patterns):
|
||||||
|
"""Test a bug that arose from having overlapping matches"""
|
||||||
|
text = "how many points did lebron james score against the boston celtics last night"
|
||||||
|
doc = en_tokenizer(text)
|
||||||
|
ORG = doc.vocab.strings['ORG']
|
||||||
|
matcher = Matcher(doc.vocab)
|
||||||
|
matcher.add('BostonCeltics', None, *patterns)
|
||||||
|
assert len(list(doc.ents)) == 0
|
||||||
|
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
|
||||||
|
doc.ents += tuple(matches)[1:]
|
||||||
|
assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
|
||||||
|
ents = doc.ents
|
||||||
|
assert len(ents) == 1
|
||||||
|
assert ents[0].label == ORG
|
||||||
|
assert ents[0].start == 9
|
||||||
|
assert ents[0].end == 11
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue242(en_tokenizer):
|
||||||
|
"""Test overlapping multi-word phrases."""
|
||||||
|
text = "There are different food safety standards in different countries."
|
||||||
|
patterns = [[{'LOWER': 'food'}, {'LOWER': 'safety'}],
|
||||||
|
[{'LOWER': 'safety'}, {'LOWER': 'standards'}]]
|
||||||
|
doc = en_tokenizer(text)
|
||||||
|
matcher = Matcher(doc.vocab)
|
||||||
|
matcher.add('FOOD', None, *patterns)
|
||||||
|
|
||||||
|
matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)]
|
||||||
|
doc.ents += tuple(matches)
|
||||||
|
match1, match2 = matches
|
||||||
|
assert match1[1] == 3
|
||||||
|
assert match1[2] == 5
|
||||||
|
assert match2[1] == 4
|
||||||
|
assert match2[2] == 6
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue309(en_tokenizer):
|
||||||
|
"""Test Issue #309: SBD fails on empty string"""
|
||||||
|
tokens = en_tokenizer(" ")
|
||||||
|
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[0], deps=['ROOT'])
|
||||||
|
doc.is_parsed = True
|
||||||
|
assert len(doc) == 1
|
||||||
|
sents = list(doc.sents)
|
||||||
|
assert len(sents) == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue351(en_tokenizer):
|
||||||
|
doc = en_tokenizer(" This is a cat.")
|
||||||
|
assert doc[0].idx == 0
|
||||||
|
assert len(doc[0]) == 3
|
||||||
|
assert doc[1].idx == 3
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue360(en_tokenizer):
|
||||||
|
"""Test tokenization of big ellipsis"""
|
||||||
|
tokens = en_tokenizer('$45...............Asking')
|
||||||
|
assert len(tokens) > 2
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text1,text2', [("cat", "dog")])
|
||||||
|
def test_issue361(en_vocab, text1, text2):
|
||||||
|
"""Test Issue #361: Equality of lexemes"""
|
||||||
|
assert en_vocab[text1] == en_vocab[text1]
|
||||||
|
assert en_vocab[text1] != en_vocab[text2]
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue587(en_tokenizer):
|
||||||
|
"""Test that Matcher doesn't segfault on particular input"""
|
||||||
|
doc = en_tokenizer('a b; c')
|
||||||
|
matcher = Matcher(doc.vocab)
|
||||||
|
matcher.add('TEST1', None, [{ORTH: 'a'}, {ORTH: 'b'}])
|
||||||
|
matches = matcher(doc)
|
||||||
|
assert len(matches) == 1
|
||||||
|
matcher.add('TEST2', None, [{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'c'}])
|
||||||
|
matches = matcher(doc)
|
||||||
|
assert len(matches) == 2
|
||||||
|
matcher.add('TEST3', None, [{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'd'}])
|
||||||
|
matches = matcher(doc)
|
||||||
|
assert len(matches) == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue588(en_vocab):
|
||||||
|
matcher = Matcher(en_vocab)
|
||||||
|
with pytest.raises(ValueError):
|
||||||
|
matcher.add('TEST', None, [])
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.xfail
|
||||||
|
def test_issue589():
|
||||||
|
vocab = Vocab()
|
||||||
|
vocab.strings.set_frozen(True)
|
||||||
|
doc = Doc(vocab, words=['whata'])
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue590(en_vocab):
|
||||||
|
"""Test overlapping matches"""
|
||||||
|
doc = Doc(en_vocab, words=['n', '=', '1', ';', 'a', ':', '5', '%'])
|
||||||
|
matcher = Matcher(en_vocab)
|
||||||
|
matcher.add('ab', None, [{'IS_ALPHA': True}, {'ORTH': ':'}, {'LIKE_NUM': True}, {'ORTH': '%'}])
|
||||||
|
matcher.add('ab', None, [{'IS_ALPHA': True}, {'ORTH': '='}, {'LIKE_NUM': True}])
|
||||||
|
matches = matcher(doc)
|
||||||
|
assert len(matches) == 2
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue595():
|
||||||
|
"""Test lemmatization of base forms"""
|
||||||
|
words = ["Do", "n't", "feed", "the", "dog"]
|
||||||
|
tag_map = {'VB': {POS: VERB, VerbForm_inf: True}}
|
||||||
|
rules = {"verb": [["ed", "e"]]}
|
||||||
|
lemmatizer = Lemmatizer({'verb': {}}, {'verb': {}}, rules)
|
||||||
|
vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
|
||||||
|
doc = Doc(vocab, words=words)
|
||||||
|
doc[2].tag_ = 'VB'
|
||||||
|
assert doc[2].text == 'feed'
|
||||||
|
assert doc[2].lemma_ == 'feed'
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue599(en_vocab):
|
||||||
|
doc = Doc(en_vocab)
|
||||||
|
doc.is_tagged = True
|
||||||
|
doc.is_parsed = True
|
||||||
|
doc2 = Doc(doc.vocab)
|
||||||
|
doc2.from_bytes(doc.to_bytes())
|
||||||
|
assert doc2.is_parsed
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue600():
|
||||||
|
vocab = Vocab(tag_map={'NN': {'pos': 'NOUN'}})
|
||||||
|
doc = Doc(vocab, words=["hello"])
|
||||||
|
doc[0].tag_ = 'NN'
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue615(en_tokenizer):
|
||||||
|
def merge_phrases(matcher, doc, i, matches):
|
||||||
|
"""Merge a phrase. We have to be careful here because we'll change the
|
||||||
|
token indices. To avoid problems, merge all the phrases once we're called
|
||||||
|
on the last match."""
|
||||||
|
if i != len(matches)-1:
|
||||||
|
return None
|
||||||
|
spans = [(ent_id, ent_id, doc[start : end]) for ent_id, start, end in matches]
|
||||||
|
for ent_id, label, span in spans:
|
||||||
|
span.merge(tag='NNP' if label else span.root.tag_, lemma=span.text,
|
||||||
|
label=label)
|
||||||
|
doc.ents = doc.ents + ((label, span.start, span.end),)
|
||||||
|
|
||||||
|
text = "The golf club is broken"
|
||||||
|
pattern = [{'ORTH': "golf"}, {'ORTH': "club"}]
|
||||||
|
label = "Sport_Equipment"
|
||||||
|
doc = en_tokenizer(text)
|
||||||
|
matcher = Matcher(doc.vocab)
|
||||||
|
matcher.add(label, merge_phrases, pattern)
|
||||||
|
match = matcher(doc)
|
||||||
|
entities = list(doc.ents)
|
||||||
|
assert entities != []
|
||||||
|
assert entities[0].label != 0
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,number', [("7am", "7"), ("11p.m.", "11")])
|
||||||
|
def test_issue736(en_tokenizer, text, number):
|
||||||
|
"""Test that times like "7am" are tokenized correctly and that numbers are
|
||||||
|
converted to string."""
|
||||||
|
tokens = en_tokenizer(text)
|
||||||
|
assert len(tokens) == 2
|
||||||
|
assert tokens[0].text == number
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text', ["3/4/2012", "01/12/1900"])
|
||||||
|
def test_issue740(en_tokenizer, text):
|
||||||
|
"""Test that dates are not split and kept as one token. This behaviour is
|
||||||
|
currently inconsistent, since dates separated by hyphens are still split.
|
||||||
|
This will be hard to prevent without causing clashes with numeric ranges."""
|
||||||
|
tokens = en_tokenizer(text)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue743():
|
||||||
|
doc = Doc(Vocab(), ['hello', 'world'])
|
||||||
|
token = doc[0]
|
||||||
|
s = set([token])
|
||||||
|
items = list(s)
|
||||||
|
assert items[0] is token
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text', ["We were scared", "We Were Scared"])
|
||||||
|
def test_issue744(en_tokenizer, text):
|
||||||
|
"""Test that 'were' and 'Were' are excluded from the contractions
|
||||||
|
generated by the English tokenizer exceptions."""
|
||||||
|
tokens = en_tokenizer(text)
|
||||||
|
assert len(tokens) == 3
|
||||||
|
assert tokens[1].text.lower() == "were"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,is_num', [("one", True), ("ten", True),
|
||||||
|
("teneleven", False)])
|
||||||
|
def test_issue759(en_tokenizer, text, is_num):
|
||||||
|
tokens = en_tokenizer(text)
|
||||||
|
assert tokens[0].like_num == is_num
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text', ["Shell", "shell", "Shed", "shed"])
|
||||||
|
def test_issue775(en_tokenizer, text):
|
||||||
|
"""Test that 'Shell' and 'shell' are excluded from the contractions
|
||||||
|
generated by the English tokenizer exceptions."""
|
||||||
|
tokens = en_tokenizer(text)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
assert tokens[0].text == text
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text', ["This is a string ", "This is a string\u0020"])
|
||||||
|
def test_issue792(en_tokenizer, text):
|
||||||
|
"""Test for Issue #792: Trailing whitespace is removed after tokenization."""
|
||||||
|
doc = en_tokenizer(text)
|
||||||
|
assert ''.join([token.text_with_ws for token in doc]) == text
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text', ["This is a string", "This is a string\n"])
|
||||||
|
def test_control_issue792(en_tokenizer, text):
|
||||||
|
"""Test base case for Issue #792: Non-trailing whitespace"""
|
||||||
|
doc = en_tokenizer(text)
|
||||||
|
assert ''.join([token.text_with_ws for token in doc]) == text
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,tokens', [
|
||||||
|
('"deserve,"--and', ['"', "deserve", ',"--', "and"]),
|
||||||
|
("exception;--exclusive", ["exception", ";--", "exclusive"]),
|
||||||
|
("day.--Is", ["day", ".--", "Is"]),
|
||||||
|
("refinement:--just", ["refinement", ":--", "just"]),
|
||||||
|
("memories?--To", ["memories", "?--", "To"]),
|
||||||
|
("Useful.=--Therefore", ["Useful", ".=--", "Therefore"]),
|
||||||
|
("=Hope.=--Pandora", ["=", "Hope", ".=--", "Pandora"])])
|
||||||
|
def test_issue801(en_tokenizer, text, tokens):
|
||||||
|
"""Test that special characters + hyphens are split correctly."""
|
||||||
|
doc = en_tokenizer(text)
|
||||||
|
assert len(doc) == len(tokens)
|
||||||
|
assert [t.text for t in doc] == tokens
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,expected_tokens', [
|
||||||
|
('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
|
||||||
|
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar'])
|
||||||
|
])
|
||||||
|
def test_issue805(sv_tokenizer, text, expected_tokens):
|
||||||
|
tokens = sv_tokenizer(text)
|
||||||
|
token_list = [token.text for token in tokens if not token.is_space]
|
||||||
|
assert expected_tokens == token_list
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue850():
|
||||||
|
"""The variable-length pattern matches the succeeding token. Check we
|
||||||
|
handle the ambiguity correctly."""
|
||||||
|
vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
|
||||||
|
matcher = Matcher(vocab)
|
||||||
|
IS_ANY_TOKEN = matcher.vocab.add_flag(lambda x: True)
|
||||||
|
pattern = [{'LOWER': "bob"}, {'OP': '*', 'IS_ANY_TOKEN': True}, {'LOWER': 'frank'}]
|
||||||
|
matcher.add('FarAway', None, pattern)
|
||||||
|
doc = Doc(matcher.vocab, words=['bob', 'and', 'and', 'frank'])
|
||||||
|
match = matcher(doc)
|
||||||
|
assert len(match) == 1
|
||||||
|
ent_id, start, end = match[0]
|
||||||
|
assert start == 0
|
||||||
|
assert end == 4
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue850_basic():
|
||||||
|
"""Test Matcher matches with '*' operator and Boolean flag"""
|
||||||
|
vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
|
||||||
|
matcher = Matcher(vocab)
|
||||||
|
IS_ANY_TOKEN = matcher.vocab.add_flag(lambda x: True)
|
||||||
|
pattern = [{'LOWER': "bob"}, {'OP': '*', 'LOWER': 'and'}, {'LOWER': 'frank'}]
|
||||||
|
matcher.add('FarAway', None, pattern)
|
||||||
|
doc = Doc(matcher.vocab, words=['bob', 'and', 'and', 'frank'])
|
||||||
|
match = matcher(doc)
|
||||||
|
assert len(match) == 1
|
||||||
|
ent_id, start, end = match[0]
|
||||||
|
assert start == 0
|
||||||
|
assert end == 4
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text', ["au-delàs", "pair-programmâmes",
|
||||||
|
"terra-formées", "σ-compacts"])
|
||||||
|
def test_issue852(fr_tokenizer, text):
|
||||||
|
"""Test that French tokenizer exceptions are imported correctly."""
|
||||||
|
tokens = fr_tokenizer(text)
|
||||||
|
assert len(tokens) == 1
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text', ["aaabbb@ccc.com\nThank you!",
|
||||||
|
"aaabbb@ccc.com \nThank you!"])
|
||||||
|
def test_issue859(en_tokenizer, text):
|
||||||
|
"""Test that no extra space is added in doc.text method."""
|
||||||
|
doc = en_tokenizer(text)
|
||||||
|
assert doc.text == text
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text', ["Datum:2014-06-02\nDokument:76467"])
|
||||||
|
def test_issue886(en_tokenizer, text):
|
||||||
|
"""Test that token.idx matches the original text index for texts with newlines."""
|
||||||
|
doc = en_tokenizer(text)
|
||||||
|
for token in doc:
|
||||||
|
assert len(token.text) == len(token.text_with_ws)
|
||||||
|
assert text[token.idx] == token.text[0]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text', ["want/need"])
|
||||||
|
def test_issue891(en_tokenizer, text):
|
||||||
|
"""Test that / infixes are split correctly."""
|
||||||
|
tokens = en_tokenizer(text)
|
||||||
|
assert len(tokens) == 3
|
||||||
|
assert tokens[1].text == "/"
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('text,tag,lemma', [
|
||||||
|
("anus", "NN", "anus"),
|
||||||
|
("princess", "NN", "princess"),
|
||||||
|
("inner", "JJ", "inner")
|
||||||
|
])
|
||||||
|
def test_issue912(en_vocab, text, tag, lemma):
|
||||||
|
"""Test base-forms are preserved."""
|
||||||
|
doc = Doc(en_vocab, words=[text])
|
||||||
|
doc[0].tag_ = tag
|
||||||
|
assert doc[0].lemma_ == lemma
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue957(en_tokenizer):
|
||||||
|
"""Test that spaCy doesn't hang on many periods."""
|
||||||
|
# skip test if pytest-timeout is not installed
|
||||||
|
timeout = pytest.importorskip('pytest-timeout')
|
||||||
|
string = '0'
|
||||||
|
for i in range(1, 100):
|
||||||
|
string += '.%d' % i
|
||||||
|
doc = en_tokenizer(string)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.xfail
|
||||||
|
def test_issue999(train_data):
|
||||||
|
"""Test that adding entities and resuming training works passably OK.
|
||||||
|
There are two issues here:
|
||||||
|
1) We have to readd labels. This isn't very nice.
|
||||||
|
2) There's no way to set the learning rate for the weight update, so we
|
||||||
|
end up out-of-scale, causing it to learn too fast.
|
||||||
|
"""
|
||||||
|
TRAIN_DATA = [
|
||||||
|
["hey", []],
|
||||||
|
["howdy", []],
|
||||||
|
["hey there", []],
|
||||||
|
["hello", []],
|
||||||
|
["hi", []],
|
||||||
|
["i'm looking for a place to eat", []],
|
||||||
|
["i'm looking for a place in the north of town", [[31,36,"LOCATION"]]],
|
||||||
|
["show me chinese restaurants", [[8,15,"CUISINE"]]],
|
||||||
|
["show me chines restaurants", [[8,14,"CUISINE"]]],
|
||||||
|
]
|
||||||
|
|
||||||
|
nlp = Language()
|
||||||
|
ner = nlp.create_pipe('ner')
|
||||||
|
nlp.add_pipe(ner)
|
||||||
|
for _, offsets in TRAIN_DATA:
|
||||||
|
for start, end, label in offsets:
|
||||||
|
ner.add_label(label)
|
||||||
|
nlp.begin_training()
|
||||||
|
ner.model.learn_rate = 0.001
|
||||||
|
for itn in range(100):
|
||||||
|
random.shuffle(TRAIN_DATA)
|
||||||
|
for raw_text, entity_offsets in TRAIN_DATA:
|
||||||
|
nlp.update([raw_text], [{'entities': entity_offsets}])
|
||||||
|
|
||||||
|
with make_tempdir() as model_dir:
|
||||||
|
nlp.to_disk(model_dir)
|
||||||
|
nlp2 = Language().from_disk(model_dir)
|
||||||
|
|
||||||
|
for raw_text, entity_offsets in TRAIN_DATA:
|
||||||
|
doc = nlp2(raw_text)
|
||||||
|
ents = {(ent.start_char, ent.end_char): ent.label_ for ent in doc.ents}
|
||||||
|
for start, end, label in entity_offsets:
|
||||||
|
if (start, end) in ents:
|
||||||
|
assert ents[(start, end)] == label
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
if entity_offsets:
|
||||||
|
raise Exception(ents)
|
127
spacy/tests/regression/test_issue1001-1500.py
Normal file
127
spacy/tests/regression/test_issue1001-1500.py
Normal file
|
@ -0,0 +1,127 @@
|
||||||
|
# coding: utf-8
|
||||||
|
from __future__ import unicode_literals
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
import re
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.lang.lex_attrs import LEX_ATTRS
|
||||||
|
from spacy.matcher import Matcher
|
||||||
|
from spacy.tokenizer import Tokenizer
|
||||||
|
from spacy.lemmatizer import Lemmatizer
|
||||||
|
from spacy.symbols import ORTH, LEMMA, POS, VERB, VerbForm_part
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue1242():
|
||||||
|
nlp = English()
|
||||||
|
doc = nlp('')
|
||||||
|
assert len(doc) == 0
|
||||||
|
docs = list(nlp.pipe(['', 'hello']))
|
||||||
|
assert len(docs[0]) == 0
|
||||||
|
assert len(docs[1]) == 1
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue1250():
|
||||||
|
"""Test cached special cases."""
|
||||||
|
special_case = [{ORTH: 'reimbur', LEMMA: 'reimburse', POS: 'VERB'}]
|
||||||
|
nlp = English()
|
||||||
|
nlp.tokenizer.add_special_case('reimbur', special_case)
|
||||||
|
lemmas = [w.lemma_ for w in nlp('reimbur, reimbur...')]
|
||||||
|
assert lemmas == ['reimburse', ',', 'reimburse', '...']
|
||||||
|
lemmas = [w.lemma_ for w in nlp('reimbur, reimbur...')]
|
||||||
|
assert lemmas == ['reimburse', ',', 'reimburse', '...']
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue1257():
|
||||||
|
"""Test that tokens compare correctly."""
|
||||||
|
doc1 = Doc(Vocab(), words=['a', 'b', 'c'])
|
||||||
|
doc2 = Doc(Vocab(), words=['a', 'c', 'e'])
|
||||||
|
assert doc1[0] != doc2[0]
|
||||||
|
assert not doc1[0] == doc2[0]
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue1375():
|
||||||
|
"""Test that token.nbor() raises IndexError for out-of-bounds access."""
|
||||||
|
doc = Doc(Vocab(), words=['0', '1', '2'])
|
||||||
|
with pytest.raises(IndexError):
|
||||||
|
assert doc[0].nbor(-1)
|
||||||
|
assert doc[1].nbor(-1).text == '0'
|
||||||
|
with pytest.raises(IndexError):
|
||||||
|
assert doc[2].nbor(1)
|
||||||
|
assert doc[1].nbor(1).text == '2'
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue1387():
|
||||||
|
tag_map = {'VBG': {POS: VERB, VerbForm_part: True}}
|
||||||
|
index = {"verb": ("cope","cop")}
|
||||||
|
exc = {"verb": {"coping": ("cope",)}}
|
||||||
|
rules = {"verb": [["ing", ""]]}
|
||||||
|
lemmatizer = Lemmatizer(index, exc, rules)
|
||||||
|
vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
|
||||||
|
doc = Doc(vocab, words=["coping"])
|
||||||
|
doc[0].tag_ = 'VBG'
|
||||||
|
assert doc[0].text == "coping"
|
||||||
|
assert doc[0].lemma_ == "cope"
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue1434():
|
||||||
|
"""Test matches occur when optional element at end of short doc."""
|
||||||
|
pattern = [{'ORTH': 'Hello' }, {'IS_ALPHA': True, 'OP': '?'}]
|
||||||
|
vocab = Vocab(lex_attr_getters=LEX_ATTRS)
|
||||||
|
hello_world = Doc(vocab, words=['Hello', 'World'])
|
||||||
|
hello = Doc(vocab, words=['Hello'])
|
||||||
|
matcher = Matcher(vocab)
|
||||||
|
matcher.add('MyMatcher', None, pattern)
|
||||||
|
matches = matcher(hello_world)
|
||||||
|
assert matches
|
||||||
|
matches = matcher(hello)
|
||||||
|
assert matches
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize('string,start,end', [
|
||||||
|
('a', 0, 1), ('a b', 0, 2), ('a c', 0, 1), ('a b c', 0, 2),
|
||||||
|
('a b b c', 0, 3), ('a b b', 0, 3),])
|
||||||
|
def test_issue1450(string, start, end):
|
||||||
|
"""Test matcher works when patterns end with * operator."""
|
||||||
|
pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}]
|
||||||
|
matcher = Matcher(Vocab())
|
||||||
|
matcher.add("TSTEND", None, pattern)
|
||||||
|
doc = Doc(Vocab(), words=string.split())
|
||||||
|
matches = matcher(doc)
|
||||||
|
if start is None or end is None:
|
||||||
|
assert matches == []
|
||||||
|
assert matches[-1][1] == start
|
||||||
|
assert matches[-1][2] == end
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue1488():
|
||||||
|
prefix_re = re.compile(r'''[\[\("']''')
|
||||||
|
suffix_re = re.compile(r'''[\]\)"']''')
|
||||||
|
infix_re = re.compile(r'''[-~\.]''')
|
||||||
|
simple_url_re = re.compile(r'''^https?://''')
|
||||||
|
|
||||||
|
def my_tokenizer(nlp):
|
||||||
|
return Tokenizer(nlp.vocab, {},
|
||||||
|
prefix_search=prefix_re.search,
|
||||||
|
suffix_search=suffix_re.search,
|
||||||
|
infix_finditer=infix_re.finditer,
|
||||||
|
token_match=simple_url_re.match)
|
||||||
|
|
||||||
|
nlp = English()
|
||||||
|
nlp.tokenizer = my_tokenizer(nlp)
|
||||||
|
doc = nlp("This is a test.")
|
||||||
|
for token in doc:
|
||||||
|
assert token.text
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue1494():
|
||||||
|
infix_re = re.compile(r'''[^a-z]''')
|
||||||
|
test_cases = [('token 123test', ['token', '1', '2', '3', 'test']),
|
||||||
|
('token 1test', ['token', '1test']),
|
||||||
|
('hello...test', ['hello', '.', '.', '.', 'test'])]
|
||||||
|
new_tokenizer = lambda nlp: Tokenizer(nlp.vocab, {}, infix_finditer=infix_re.finditer)
|
||||||
|
nlp = English()
|
||||||
|
nlp.tokenizer = new_tokenizer(nlp)
|
||||||
|
for text, expected in test_cases:
|
||||||
|
assert [token.text for token in nlp(text)] == expected
|
|
@ -1,55 +0,0 @@
|
||||||
# coding: utf-8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
from ...matcher import Matcher
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
|
||||||
pattern1 = [[{'LOWER': 'celtics'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]]
|
|
||||||
pattern2 = [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'celtics'}]]
|
|
||||||
pattern3 = [[{'LOWER': 'boston'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]]
|
|
||||||
pattern4 = [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'boston'}]]
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture
|
|
||||||
def doc(en_tokenizer):
|
|
||||||
text = "how many points did lebron james score against the boston celtics last night"
|
|
||||||
doc = en_tokenizer(text)
|
|
||||||
return doc
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('pattern', [pattern1, pattern2])
|
|
||||||
def test_issue118(doc, pattern):
|
|
||||||
"""Test a bug that arose from having overlapping matches"""
|
|
||||||
ORG = doc.vocab.strings['ORG']
|
|
||||||
matcher = Matcher(doc.vocab)
|
|
||||||
matcher.add("BostonCeltics", None, *pattern)
|
|
||||||
|
|
||||||
assert len(list(doc.ents)) == 0
|
|
||||||
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
|
|
||||||
assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
|
|
||||||
doc.ents = matches[:1]
|
|
||||||
ents = list(doc.ents)
|
|
||||||
assert len(ents) == 1
|
|
||||||
assert ents[0].label == ORG
|
|
||||||
assert ents[0].start == 9
|
|
||||||
assert ents[0].end == 11
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.parametrize('pattern', [pattern3, pattern4])
|
|
||||||
def test_issue118_prefix_reorder(doc, pattern):
|
|
||||||
"""Test a bug that arose from having overlapping matches"""
|
|
||||||
ORG = doc.vocab.strings['ORG']
|
|
||||||
matcher = Matcher(doc.vocab)
|
|
||||||
matcher.add('BostonCeltics', None, *pattern)
|
|
||||||
|
|
||||||
assert len(list(doc.ents)) == 0
|
|
||||||
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
|
|
||||||
doc.ents += tuple(matches)[1:]
|
|
||||||
assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
|
|
||||||
ents = doc.ents
|
|
||||||
assert len(ents) == 1
|
|
||||||
assert ents[0].label == ORG
|
|
||||||
assert ents[0].start == 9
|
|
||||||
assert ents[0].end == 11
|
|
|
@ -1,13 +0,0 @@
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
import pytest
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_issue1207(EN):
|
|
||||||
text = 'Employees are recruiting talented staffers from overseas.'
|
|
||||||
doc = EN(text)
|
|
||||||
|
|
||||||
assert [i.text for i in doc.noun_chunks] == ['Employees', 'talented staffers']
|
|
||||||
sent = list(doc.sents)[0]
|
|
||||||
assert [i.text for i in sent.noun_chunks] == ['Employees', 'talented staffers']
|
|
|
@ -1,23 +0,0 @@
|
||||||
from __future__ import unicode_literals
|
|
||||||
import pytest
|
|
||||||
from ...lang.en import English
|
|
||||||
from ...util import load_model
|
|
||||||
|
|
||||||
|
|
||||||
def test_issue1242_empty_strings():
|
|
||||||
nlp = English()
|
|
||||||
doc = nlp('')
|
|
||||||
assert len(doc) == 0
|
|
||||||
docs = list(nlp.pipe(['', 'hello']))
|
|
||||||
assert len(docs[0]) == 0
|
|
||||||
assert len(docs[1]) == 1
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.mark.models('en')
|
|
||||||
def test_issue1242_empty_strings_en_core_web_sm():
|
|
||||||
nlp = load_model('en_core_web_sm')
|
|
||||||
doc = nlp('')
|
|
||||||
assert len(doc) == 0
|
|
||||||
docs = list(nlp.pipe(['', 'hello']))
|
|
||||||
assert len(docs[0]) == 0
|
|
||||||
assert len(docs[1]) == 1
|
|
|
@ -1,13 +0,0 @@
|
||||||
from __future__ import unicode_literals
|
|
||||||
from ...tokenizer import Tokenizer
|
|
||||||
from ...symbols import ORTH, LEMMA, POS
|
|
||||||
from ...lang.en import English
|
|
||||||
|
|
||||||
def test_issue1250_cached_special_cases():
|
|
||||||
nlp = English()
|
|
||||||
nlp.tokenizer.add_special_case(u'reimbur', [{ORTH: u'reimbur', LEMMA: u'reimburse', POS: u'VERB'}])
|
|
||||||
|
|
||||||
lemmas = [w.lemma_ for w in nlp(u'reimbur, reimbur...')]
|
|
||||||
assert lemmas == ['reimburse', ',', 'reimburse', '...']
|
|
||||||
lemmas = [w.lemma_ for w in nlp(u'reimbur, reimbur...')]
|
|
||||||
assert lemmas == ['reimburse', ',', 'reimburse', '...']
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user