mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-11 17:56:30 +03:00
💫 Refactor test suite (#2568)
## Description Related issues: #2379 (should be fixed by separating model tests) * **total execution time down from > 300 seconds to under 60 seconds** 🎉 * removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure * changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version) * merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways) * tidied up and rewrote existing tests wherever possible ### Todo - [ ] move tests to `/tests` and adjust CI commands accordingly - [x] move model test suite from internal repo to `spacy-models` - [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~ - [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted - [ ] update documentation on how to run tests ### Types of change enhancement, tests ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
This commit is contained in:
parent
82277f63a3
commit
75f3234404
1
.gitignore
vendored
1
.gitignore
vendored
|
@ -36,6 +36,7 @@ venv/
|
|||
.dev
|
||||
.denv
|
||||
.pypyenv
|
||||
.pytest_cache/
|
||||
|
||||
# Distribution / packaging
|
||||
env/
|
||||
|
|
|
@ -11,5 +11,6 @@ dill>=0.2,<0.3
|
|||
regex==2017.4.5
|
||||
requests>=2.13.0,<3.0.0
|
||||
pytest>=3.6.0,<4.0.0
|
||||
pytest-timeout>=1.3.0,<2.0.0
|
||||
mock>=2.0.0,<3.0.0
|
||||
pathlib==1.0.1; python_version < "3.4"
|
||||
|
|
|
@ -6,6 +6,7 @@ spaCy uses the [pytest](http://doc.pytest.org/) framework for testing. For more
|
|||
|
||||
Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the `Tokenizer` can be found in [`/tests/tokenizer`](tokenizer). All test modules (i.e. directories) also need to be listed in spaCy's [`setup.py`](../setup.py). To be interpreted and run, all test files and test functions need to be prefixed with `test_`.
|
||||
|
||||
> ⚠️ **Important note:** As part of our new model training infrastructure, we've moved all model tests to the [`spacy-models`](https://github.com/explosion/spacy-models) repository. This allows us to test the models separately from the core library functionality.
|
||||
|
||||
## Table of contents
|
||||
|
||||
|
@ -13,9 +14,8 @@ Tests for spaCy modules and classes live in their own directories of the same na
|
|||
2. [Dos and don'ts](#dos-and-donts)
|
||||
3. [Parameters](#parameters)
|
||||
4. [Fixtures](#fixtures)
|
||||
5. [Testing models](#testing-models)
|
||||
6. [Helpers and utilities](#helpers-and-utilities)
|
||||
7. [Contributing to the tests](#contributing-to-the-tests)
|
||||
5. [Helpers and utilities](#helpers-and-utilities)
|
||||
6. [Contributing to the tests](#contributing-to-the-tests)
|
||||
|
||||
|
||||
## Running the tests
|
||||
|
@ -25,10 +25,7 @@ first failure, run them with `py.test -x`.
|
|||
|
||||
```bash
|
||||
py.test spacy # run basic tests
|
||||
py.test spacy --models --en # run basic and English model tests
|
||||
py.test spacy --models --all # run basic and all model tests
|
||||
py.test spacy --slow # run basic and slow tests
|
||||
py.test spacy --models --all --slow # run all tests
|
||||
```
|
||||
|
||||
You can also run tests in a specific file or directory, or even only one
|
||||
|
@ -48,10 +45,10 @@ To keep the behaviour of the tests consistent and predictable, we try to follow
|
|||
* If you're testing for a bug reported in a specific issue, always create a **regression test**. Regression tests should be named `test_issue[ISSUE NUMBER]` and live in the [`regression`](regression) directory.
|
||||
* Only use `@pytest.mark.xfail` for tests that **should pass, but currently fail**. To test for desired negative behaviour, use `assert not` in your test.
|
||||
* Very **extensive tests** that take a long time to run should be marked with `@pytest.mark.slow`. If your slow test is testing important behaviour, consider adding an additional simpler version.
|
||||
* Tests that require **loading the models** should be marked with `@pytest.mark.models`.
|
||||
* If tests require **loading the models**, they should be added to the [`spacy-models`](https://github.com/explosion/spacy-models) tests.
|
||||
* Before requiring the models, always make sure there is no other way to test the particular behaviour. In a lot of cases, it's sufficient to simply create a `Doc` object manually. See the section on [helpers and utility functions](#helpers-and-utilities) for more info on this.
|
||||
* **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and most components are available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`).
|
||||
* If you're importing from spaCy, **always use relative imports**. Otherwise, you might accidentally be running the tests over a different copy of spaCy, e.g. one you have installed on your system.
|
||||
* **Avoid unnecessary imports.** There should never be a need to explicitly import spaCy at the top of a file, and many components are available as [fixtures](#fixtures). You should also avoid wildcard imports (`from module import *`).
|
||||
* If you're importing from spaCy, **always use absolute imports**. For example: `from spacy.language import Language`.
|
||||
* Don't forget the **unicode declarations** at the top of each file. This way, unicode strings won't have to be prefixed with `u`.
|
||||
* Try to keep the tests **readable and concise**. Use clear and descriptive variable names (`doc`, `tokens` and `text` are great), keep it short and only test for one behaviour at a time.
|
||||
|
||||
|
@ -93,12 +90,9 @@ These are the main fixtures that are currently available:
|
|||
|
||||
| Fixture | Description |
|
||||
| --- | --- |
|
||||
| `tokenizer` | Creates **all available** language tokenizers and runs the test for **each of them**. |
|
||||
| `tokenizer` | Basic, language-independent tokenizer. Identical to the `xx` language class. |
|
||||
| `en_tokenizer`, `de_tokenizer`, ... | Creates an English, German etc. tokenizer. |
|
||||
| `en_vocab`, `en_entityrecognizer`, ... | Creates an instance of the English `Vocab`, `EntityRecognizer` object etc. |
|
||||
| `EN`, `DE`, ... | Creates a language class with a loaded model. For more info, see [Testing models](#testing-models). |
|
||||
| `text_file` | Creates an instance of `StringIO` to simulate reading from and writing to files. |
|
||||
| `text_file_b` | Creates an instance of `ByteIO` to simulate reading from and writing to files. |
|
||||
| `en_vocab` | Creates an instance of the English `Vocab`. |
|
||||
|
||||
The fixtures can be used in all tests by simply setting them as an argument, like this:
|
||||
|
||||
|
@ -109,49 +103,6 @@ def test_module_do_something(en_tokenizer):
|
|||
|
||||
If all tests in a file require a specific configuration, or use the same complex example, it can be helpful to create a separate fixture. This fixture should be added at the top of each file. Make sure to use descriptive names for these fixtures and don't override any of the global fixtures listed above. **From looking at a test, it should immediately be clear which fixtures are used, and where they are coming from.**
|
||||
|
||||
## Testing models
|
||||
|
||||
Models should only be loaded and tested **if absolutely necessary** – for example, if you're specifically testing a model's performance, or if your test is related to model loading. If you only need an annotated `Doc`, you should use the `get_doc()` helper function to create it manually instead.
|
||||
|
||||
To specify which language models a test is related to, set the language ID as an argument of `@pytest.mark.models`. This allows you to later run the tests with `--models --en`. You can then use the `EN` [fixture](#fixtures) to get a language
|
||||
class with a loaded model.
|
||||
|
||||
```python
|
||||
@pytest.mark.models('en')
|
||||
def test_english_model(EN):
|
||||
doc = EN(u'This is a test')
|
||||
```
|
||||
|
||||
> ⚠️ **Important note:** In order to test models, they need to be installed as a packge. The [conftest.py](conftest.py) includes a list of all available models, mapped to their IDs, e.g. `en`. Unless otherwise specified, each model that's installed in your environment will be imported and tested. If you don't have a model installed, **the test will be skipped**.
|
||||
|
||||
Under the hood, `pytest.importorskip` is used to import a model package and skip the test if the package is not installed. The `EN` fixture for example gets all
|
||||
available models for `en`, [parametrizes](#parameters) them to run the test for *each of them*, and uses `load_test_model()` to import the model and run the test, or skip it if the model is not installed.
|
||||
|
||||
### Testing specific models
|
||||
|
||||
Using the `load_test_model()` helper function, you can also write tests for specific models, or combinations of them:
|
||||
|
||||
```python
|
||||
from .util import load_test_model
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_en_md_only():
|
||||
nlp = load_test_model('en_core_web_md')
|
||||
# test something specific to en_core_web_md
|
||||
|
||||
@pytest.mark.models('en', 'fr')
|
||||
@pytest.mark.parametrize('model', ['en_core_web_md', 'fr_depvec_web_lg'])
|
||||
def test_different_models(model):
|
||||
nlp = load_test_model(model)
|
||||
# test something specific to the parametrized models
|
||||
```
|
||||
|
||||
### Known issues and future improvements
|
||||
|
||||
Using `importorskip` on a list of model packages is not ideal and we're looking to improve this in the future. But at the moment, it's the best way to ensure that tests are performed on specific model packages only, and that you'll always be able to run the tests, even if you don't have *all available models* installed. (If the tests made a call to `spacy.load('en')` instead, this would load whichever model you've created an `en` shortcut for. This may be one of spaCy's default models, but it could just as easily be your own custom English model.)
|
||||
|
||||
The current setup also doesn't provide an easy way to only run tests on specific model versions. The `minversion` keyword argument on `pytest.importorskip` can take care of this, but it currently only checks for the package's `__version__` attribute. An alternative solution would be to load a model package's meta.json and skip if the model's version does not match the one specified in the test.
|
||||
|
||||
## Helpers and utilities
|
||||
|
||||
Our new test setup comes with a few handy utility functions that can be imported from [`util.py`](util.py).
|
||||
|
@ -186,7 +137,7 @@ You can construct a `Doc` with the following arguments:
|
|||
| `pos` | List of POS tags as text values. |
|
||||
| `tag` | List of tag names as text values. |
|
||||
| `dep` | List of dependencies as text values. |
|
||||
| `ents` | List of entity tuples with `ent_id`, `label`, `start`, `end` (for example `('Stewart Lee', 'PERSON', 0, 2)`). The `label` will be looked up in `vocab.strings[label]`. |
|
||||
| `ents` | List of entity tuples with `start`, `end`, `label` (for example `(0, 2, 'PERSON')`). The `label` will be looked up in `vocab.strings[label]`. |
|
||||
|
||||
Here's how to quickly get these values from within spaCy:
|
||||
|
||||
|
@ -196,6 +147,7 @@ print([token.head.i-token.i for token in doc])
|
|||
print([token.tag_ for token in doc])
|
||||
print([token.pos_ for token in doc])
|
||||
print([token.dep_ for token in doc])
|
||||
print([(ent.start, ent.end, ent.label_) for ent in doc.ents])
|
||||
```
|
||||
|
||||
**Note:** There's currently no way of setting the serializer data for the parser without loading the models. If this is relevant to your test, constructing the `Doc` via `get_doc()` won't work.
|
||||
|
@ -204,7 +156,6 @@ print([token.dep_ for token in doc])
|
|||
|
||||
| Name | Description |
|
||||
| --- | --- |
|
||||
| `load_test_model` | Load a model if it's installed as a package, otherwise skip test. |
|
||||
| `apply_transition_sequence(parser, doc, sequence)` | Perform a series of pre-specified transitions, to put the parser in a desired state. |
|
||||
| `add_vecs_to_vocab(vocab, vectors)` | Add list of vector tuples (`[("text", [1, 2, 3])]`) to given vocab. All vectors need to have the same length. |
|
||||
| `get_cosine(vec1, vec2)` | Get cosine for two given vectors. |
|
||||
|
|
|
@ -1,229 +1,145 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from io import StringIO, BytesIO
|
||||
from pathlib import Path
|
||||
import pytest
|
||||
|
||||
from .util import load_test_model
|
||||
from ..tokens import Doc
|
||||
from ..strings import StringStore
|
||||
from .. import util
|
||||
from io import StringIO, BytesIO
|
||||
from spacy.util import get_lang_class
|
||||
|
||||
|
||||
# These languages are used for generic tokenizer tests – only add a language
|
||||
# here if it's using spaCy's tokenizer (not a different library)
|
||||
# TODO: re-implement generic tokenizer tests
|
||||
_languages = ['bn', 'da', 'de', 'el', 'en', 'es', 'fi', 'fr', 'ga', 'he', 'hu', 'id',
|
||||
'it', 'nb', 'nl', 'pl', 'pt', 'ro', 'ru', 'sv', 'tr', 'ar', 'ut', 'tt',
|
||||
'xx']
|
||||
|
||||
_models = {'en': ['en_core_web_sm'],
|
||||
'de': ['de_core_news_sm'],
|
||||
'fr': ['fr_core_news_sm'],
|
||||
'xx': ['xx_ent_web_sm'],
|
||||
'en_core_web_md': ['en_core_web_md'],
|
||||
'es_core_news_md': ['es_core_news_md']}
|
||||
def pytest_addoption(parser):
|
||||
parser.addoption("--slow", action="store_true", help="include slow tests")
|
||||
|
||||
|
||||
# only used for tests that require loading the models
|
||||
# in all other cases, use specific instances
|
||||
|
||||
@pytest.fixture(params=_models['en'])
|
||||
def EN(request):
|
||||
return load_test_model(request.param)
|
||||
def pytest_runtest_setup(item):
|
||||
for opt in ['slow']:
|
||||
if opt in item.keywords and not item.config.getoption("--%s" % opt):
|
||||
pytest.skip("need --%s option to run" % opt)
|
||||
|
||||
|
||||
@pytest.fixture(params=_models['de'])
|
||||
def DE(request):
|
||||
return load_test_model(request.param)
|
||||
|
||||
|
||||
@pytest.fixture(params=_models['fr'])
|
||||
def FR(request):
|
||||
return load_test_model(request.param)
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
def RU(request):
|
||||
pymorphy = pytest.importorskip('pymorphy2')
|
||||
return util.get_lang_class('ru')()
|
||||
|
||||
@pytest.fixture()
|
||||
def JA(request):
|
||||
mecab = pytest.importorskip("MeCab")
|
||||
return util.get_lang_class('ja')()
|
||||
|
||||
|
||||
#@pytest.fixture(params=_languages)
|
||||
#def tokenizer(request):
|
||||
#lang = util.get_lang_class(request.param)
|
||||
#return lang.Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='module')
|
||||
def tokenizer():
|
||||
return util.get_lang_class('xx').Defaults.create_tokenizer()
|
||||
return get_lang_class('xx').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def en_tokenizer():
|
||||
return util.get_lang_class('en').Defaults.create_tokenizer()
|
||||
return get_lang_class('en').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def en_vocab():
|
||||
return util.get_lang_class('en').Defaults.create_vocab()
|
||||
return get_lang_class('en').Defaults.create_vocab()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def en_parser(en_vocab):
|
||||
nlp = util.get_lang_class('en')(en_vocab)
|
||||
nlp = get_lang_class('en')(en_vocab)
|
||||
return nlp.create_pipe('parser')
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def es_tokenizer():
|
||||
return util.get_lang_class('es').Defaults.create_tokenizer()
|
||||
return get_lang_class('es').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def de_tokenizer():
|
||||
return util.get_lang_class('de').Defaults.create_tokenizer()
|
||||
return get_lang_class('de').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def fr_tokenizer():
|
||||
return util.get_lang_class('fr').Defaults.create_tokenizer()
|
||||
return get_lang_class('fr').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def hu_tokenizer():
|
||||
return util.get_lang_class('hu').Defaults.create_tokenizer()
|
||||
return get_lang_class('hu').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def fi_tokenizer():
|
||||
return util.get_lang_class('fi').Defaults.create_tokenizer()
|
||||
return get_lang_class('fi').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def ro_tokenizer():
|
||||
return util.get_lang_class('ro').Defaults.create_tokenizer()
|
||||
return get_lang_class('ro').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def id_tokenizer():
|
||||
return util.get_lang_class('id').Defaults.create_tokenizer()
|
||||
return get_lang_class('id').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def sv_tokenizer():
|
||||
return util.get_lang_class('sv').Defaults.create_tokenizer()
|
||||
return get_lang_class('sv').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def bn_tokenizer():
|
||||
return util.get_lang_class('bn').Defaults.create_tokenizer()
|
||||
return get_lang_class('bn').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def ga_tokenizer():
|
||||
return util.get_lang_class('ga').Defaults.create_tokenizer()
|
||||
return get_lang_class('ga').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def he_tokenizer():
|
||||
return util.get_lang_class('he').Defaults.create_tokenizer()
|
||||
return get_lang_class('he').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
@pytest.fixture(scope='session')
|
||||
def nb_tokenizer():
|
||||
return util.get_lang_class('nb').Defaults.create_tokenizer()
|
||||
return get_lang_class('nb').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def da_tokenizer():
|
||||
return util.get_lang_class('da').Defaults.create_tokenizer()
|
||||
return get_lang_class('da').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def ja_tokenizer():
|
||||
mecab = pytest.importorskip("MeCab")
|
||||
return util.get_lang_class('ja').Defaults.create_tokenizer()
|
||||
return get_lang_class('ja').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def th_tokenizer():
|
||||
pythainlp = pytest.importorskip("pythainlp")
|
||||
return util.get_lang_class('th').Defaults.create_tokenizer()
|
||||
return get_lang_class('th').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def tr_tokenizer():
|
||||
return util.get_lang_class('tr').Defaults.create_tokenizer()
|
||||
return get_lang_class('tr').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def tt_tokenizer():
|
||||
return util.get_lang_class('tt').Defaults.create_tokenizer()
|
||||
return get_lang_class('tt').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def el_tokenizer():
|
||||
return util.get_lang_class('el').Defaults.create_tokenizer()
|
||||
return get_lang_class('el').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def ar_tokenizer():
|
||||
return util.get_lang_class('ar').Defaults.create_tokenizer()
|
||||
return get_lang_class('ar').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def ur_tokenizer():
|
||||
return util.get_lang_class('ur').Defaults.create_tokenizer()
|
||||
return get_lang_class('ur').Defaults.create_tokenizer()
|
||||
|
||||
@pytest.fixture
|
||||
|
||||
@pytest.fixture(scope='session')
|
||||
def ru_tokenizer():
|
||||
pymorphy = pytest.importorskip('pymorphy2')
|
||||
return util.get_lang_class('ru').Defaults.create_tokenizer()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def stringstore():
|
||||
return StringStore()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def en_entityrecognizer():
|
||||
return util.get_lang_class('en').Defaults.create_entity()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def text_file():
|
||||
return StringIO()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def text_file_b():
|
||||
return BytesIO()
|
||||
|
||||
|
||||
def pytest_addoption(parser):
|
||||
parser.addoption("--models", action="store_true",
|
||||
help="include tests that require full models")
|
||||
parser.addoption("--vectors", action="store_true",
|
||||
help="include word vectors tests")
|
||||
parser.addoption("--slow", action="store_true",
|
||||
help="include slow tests")
|
||||
|
||||
for lang in _languages + ['all']:
|
||||
parser.addoption("--%s" % lang, action="store_true", help="Use %s models" % lang)
|
||||
for model in _models:
|
||||
if model not in _languages:
|
||||
parser.addoption("--%s" % model, action="store_true", help="Use %s model" % model)
|
||||
|
||||
|
||||
def pytest_runtest_setup(item):
|
||||
for opt in ['models', 'vectors', 'slow']:
|
||||
if opt in item.keywords and not item.config.getoption("--%s" % opt):
|
||||
pytest.skip("need --%s option to run" % opt)
|
||||
|
||||
# Check if test is marked with models and has arguments set, i.e. specific
|
||||
# language. If so, skip test if flag not set.
|
||||
if item.get_marker('models'):
|
||||
for arg in item.get_marker('models').args:
|
||||
if not item.config.getoption("--%s" % arg) and not item.config.getoption("--all"):
|
||||
pytest.skip("need --%s or --all option to run" % arg)
|
||||
return get_lang_class('ru').Defaults.create_tokenizer()
|
||||
|
|
|
@ -1,24 +0,0 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...pipeline import EntityRecognizer
|
||||
from ..util import get_doc
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_doc_add_entities_set_ents_iob(en_vocab):
|
||||
text = ["This", "is", "a", "lion"]
|
||||
doc = get_doc(en_vocab, text)
|
||||
ner = EntityRecognizer(en_vocab)
|
||||
ner.begin_training([])
|
||||
ner(doc)
|
||||
|
||||
assert len(list(doc.ents)) == 0
|
||||
assert [w.ent_iob_ for w in doc] == (['O'] * len(doc))
|
||||
|
||||
doc.ents = [(doc.vocab.strings['ANIMAL'], 3, 4)]
|
||||
assert [w.ent_iob_ for w in doc] == ['', '', '', 'B']
|
||||
|
||||
doc.ents = [(doc.vocab.strings['WORD'], 0, 2)]
|
||||
assert [w.ent_iob_ for w in doc] == ['B', 'I', '', '']
|
|
@ -1,10 +1,9 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import ORTH, SHAPE, POS, DEP
|
||||
from ..util import get_doc
|
||||
from spacy.attrs import ORTH, SHAPE, POS, DEP
|
||||
|
||||
import pytest
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
def test_doc_array_attr_of_token(en_tokenizer, en_vocab):
|
||||
|
@ -41,7 +40,7 @@ def test_doc_array_tag(en_tokenizer):
|
|||
text = "A nice sentence."
|
||||
pos = ['DET', 'ADJ', 'NOUN', 'PUNCT']
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos)
|
||||
assert doc[0].pos != doc[1].pos != doc[2].pos != doc[3].pos
|
||||
feats_array = doc.to_array((ORTH, POS))
|
||||
assert feats_array[0][1] == doc[0].pos
|
||||
|
@ -54,7 +53,7 @@ def test_doc_array_dep(en_tokenizer):
|
|||
text = "A nice sentence."
|
||||
deps = ['det', 'amod', 'ROOT', 'punct']
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
|
||||
feats_array = doc.to_array((ORTH, DEP))
|
||||
assert feats_array[0][1] == doc[0].dep
|
||||
assert feats_array[1][1] == doc[1].dep
|
||||
|
|
|
@ -1,10 +1,10 @@
|
|||
'''Test Doc sets up tokens correctly.'''
|
||||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
import pytest
|
||||
|
||||
from ...vocab import Vocab
|
||||
from ...tokens.doc import Doc
|
||||
from ...lemmatizer import Lemmatizer
|
||||
import pytest
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.tokens import Doc
|
||||
from spacy.lemmatizer import Lemmatizer
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
|
|
@ -1,18 +1,18 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..util import get_doc
|
||||
from ...tokens import Doc
|
||||
from ...vocab import Vocab
|
||||
from ...attrs import LEMMA
|
||||
|
||||
import pytest
|
||||
import numpy
|
||||
from spacy.tokens import Doc
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.attrs import LEMMA
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', [["one", "two", "three"]])
|
||||
def test_doc_api_compare_by_string_position(en_vocab, text):
|
||||
doc = get_doc(en_vocab, text)
|
||||
doc = Doc(en_vocab, words=text)
|
||||
# Get the tokens in this order, so their ID ordering doesn't match the idx
|
||||
token3 = doc[-1]
|
||||
token2 = doc[-2]
|
||||
|
@ -104,18 +104,18 @@ def test_doc_api_getitem(en_tokenizer):
|
|||
" Give it back! He pleaded. "])
|
||||
def test_doc_api_serialize(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
new_tokens = get_doc(tokens.vocab).from_bytes(tokens.to_bytes())
|
||||
new_tokens = Doc(tokens.vocab).from_bytes(tokens.to_bytes())
|
||||
assert tokens.text == new_tokens.text
|
||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
||||
|
||||
new_tokens = get_doc(tokens.vocab).from_bytes(
|
||||
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||
tokens.to_bytes(tensor=False), tensor=False)
|
||||
assert tokens.text == new_tokens.text
|
||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||
assert [t.orth for t in tokens] == [t.orth for t in new_tokens]
|
||||
|
||||
new_tokens = get_doc(tokens.vocab).from_bytes(
|
||||
new_tokens = Doc(tokens.vocab).from_bytes(
|
||||
tokens.to_bytes(sentiment=False), sentiment=False)
|
||||
assert tokens.text == new_tokens.text
|
||||
assert [t.text for t in tokens] == [t.text for t in new_tokens]
|
||||
|
@ -199,6 +199,20 @@ def test_doc_api_retokenizer_attrs(en_tokenizer):
|
|||
assert doc[4].ent_type_ == 'ORG'
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_doc_api_retokenizer_lex_attrs(en_tokenizer):
|
||||
"""Test that lexical attributes can be changed (see #2390)."""
|
||||
doc = en_tokenizer("WKRO played beach boys songs")
|
||||
assert not any(token.is_stop for token in doc)
|
||||
with doc.retokenize() as retokenizer:
|
||||
retokenizer.merge(doc[2:4], attrs={'LEMMA': 'boys', 'IS_STOP': True})
|
||||
assert doc[2].text == 'beach boys'
|
||||
assert doc[2].lemma_ == 'boys'
|
||||
assert doc[2].is_stop
|
||||
new_doc = Doc(doc.vocab, words=['beach boys'])
|
||||
assert new_doc[0].is_stop
|
||||
|
||||
|
||||
def test_doc_api_sents_empty_string(en_tokenizer):
|
||||
doc = en_tokenizer("")
|
||||
doc.is_parsed = True
|
||||
|
@ -215,7 +229,7 @@ def test_doc_api_runtime_error(en_tokenizer):
|
|||
'ROOT', 'amod', 'dobj']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
|
||||
|
||||
nps = []
|
||||
for np in doc.noun_chunks:
|
||||
|
@ -235,7 +249,7 @@ def test_doc_api_right_edge(en_tokenizer):
|
|||
-2, -7, 1, -19, 1, -2, -3, 2, 1, -3, -26]
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
assert doc[6].text == 'for'
|
||||
subtree = [w.text for w in doc[6].subtree]
|
||||
assert subtree == ['for', 'the', 'sake', 'of', 'such', 'as',
|
||||
|
@ -264,7 +278,7 @@ def test_doc_api_similarity_match():
|
|||
|
||||
def test_lowest_common_ancestor(en_tokenizer):
|
||||
tokens = en_tokenizer('the lazy dog slept')
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0])
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0])
|
||||
lca = doc.get_lca_matrix()
|
||||
assert(lca[1, 1] == 1)
|
||||
assert(lca[0, 1] == 2)
|
||||
|
@ -277,7 +291,7 @@ def test_parse_tree(en_tokenizer):
|
|||
heads = [1, 0, 1, -2, -3, -1, -5]
|
||||
tags = ['PRP', 'IN', 'NNP', 'NNP', 'IN', 'NNP', '.']
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, tags=tags)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, tags=tags)
|
||||
# full method parse_tree(text) is a trivial composition
|
||||
trees = doc.print_tree()
|
||||
assert len(trees) > 0
|
||||
|
|
|
@ -1,12 +1,13 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...language import Language
|
||||
from ...compat import pickle, unicode_
|
||||
from spacy.language import Language
|
||||
from spacy.compat import pickle, unicode_
|
||||
|
||||
|
||||
def test_pickle_single_doc():
|
||||
nlp = Language()
|
||||
doc = nlp(u'pickle roundtrip')
|
||||
doc = nlp('pickle roundtrip')
|
||||
data = pickle.dumps(doc, 1)
|
||||
doc2 = pickle.loads(data)
|
||||
assert doc2.text == 'pickle roundtrip'
|
||||
|
@ -16,7 +17,7 @@ def test_list_of_docs_pickles_efficiently():
|
|||
nlp = Language()
|
||||
for i in range(10000):
|
||||
_ = nlp.vocab[unicode_(i)]
|
||||
one_pickled = pickle.dumps(nlp(u'0'), -1)
|
||||
one_pickled = pickle.dumps(nlp('0'), -1)
|
||||
docs = list(nlp.pipe(unicode_(i) for i in range(100)))
|
||||
many_pickled = pickle.dumps(docs, -1)
|
||||
assert len(many_pickled) < (len(one_pickled) * 2)
|
||||
|
@ -28,7 +29,7 @@ def test_list_of_docs_pickles_efficiently():
|
|||
|
||||
def test_user_data_from_disk():
|
||||
nlp = Language()
|
||||
doc = nlp(u'Hello')
|
||||
doc = nlp('Hello')
|
||||
doc.user_data[(0, 1)] = False
|
||||
b = doc.to_bytes()
|
||||
doc2 = doc.__class__(doc.vocab).from_bytes(b)
|
||||
|
@ -36,7 +37,7 @@ def test_user_data_from_disk():
|
|||
|
||||
def test_user_data_unpickles():
|
||||
nlp = Language()
|
||||
doc = nlp(u'Hello')
|
||||
doc = nlp('Hello')
|
||||
doc.user_data[(0, 1)] = False
|
||||
b = pickle.dumps(doc)
|
||||
doc2 = pickle.loads(b)
|
||||
|
@ -47,7 +48,7 @@ def test_hooks_unpickle():
|
|||
def inner_func(d1, d2):
|
||||
return 'hello!'
|
||||
nlp = Language()
|
||||
doc = nlp(u'Hello')
|
||||
doc = nlp('Hello')
|
||||
doc.user_hooks['similarity'] = inner_func
|
||||
b = pickle.dumps(doc)
|
||||
doc2 = pickle.loads(b)
|
||||
|
|
|
@ -1,12 +1,12 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..util import get_doc
|
||||
from ...attrs import ORTH, LENGTH
|
||||
from ...tokens import Doc
|
||||
from ...vocab import Vocab
|
||||
|
||||
import pytest
|
||||
from spacy.attrs import ORTH, LENGTH
|
||||
from spacy.tokens import Doc
|
||||
from spacy.vocab import Vocab
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
@ -16,16 +16,16 @@ def doc(en_tokenizer):
|
|||
deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
|
||||
'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
|
||||
tokens = en_tokenizer(text)
|
||||
return get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
||||
return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def doc_not_parsed(en_tokenizer):
|
||||
text = "This is a sentence. This is another sentence. And a third."
|
||||
tokens = en_tokenizer(text)
|
||||
d = get_doc(tokens.vocab, [t.text for t in tokens])
|
||||
d.is_parsed = False
|
||||
return d
|
||||
doc = Doc(tokens.vocab, words=[t.text for t in tokens])
|
||||
doc.is_parsed = False
|
||||
return doc
|
||||
|
||||
|
||||
def test_spans_sent_spans(doc):
|
||||
|
@ -56,7 +56,7 @@ def test_spans_root2(en_tokenizer):
|
|||
text = "through North and South Carolina"
|
||||
heads = [0, 3, -1, -2, -4]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
assert doc[-2:].root.text == 'Carolina'
|
||||
|
||||
|
||||
|
@ -76,7 +76,7 @@ def test_spans_span_sent(doc, doc_not_parsed):
|
|||
def test_spans_lca_matrix(en_tokenizer):
|
||||
"""Test span's lca matrix generation"""
|
||||
tokens = en_tokenizer('the lazy dog slept')
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[2, 1, 1, 0])
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[2, 1, 1, 0])
|
||||
lca = doc[:2].get_lca_matrix()
|
||||
assert(lca[0, 0] == 0)
|
||||
assert(lca[0, 1] == -1)
|
||||
|
@ -100,7 +100,7 @@ def test_spans_default_sentiment(en_tokenizer):
|
|||
tokens = en_tokenizer(text)
|
||||
tokens.vocab[tokens[0].text].sentiment = 3.0
|
||||
tokens.vocab[tokens[2].text].sentiment = -2.0
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens])
|
||||
doc = Doc(tokens.vocab, words=[t.text for t in tokens])
|
||||
assert doc[:2].sentiment == 3.0 / 2
|
||||
assert doc[-2:].sentiment == -2. / 2
|
||||
assert doc[:-1].sentiment == (3.+-2) / 3.
|
||||
|
@ -112,7 +112,7 @@ def test_spans_override_sentiment(en_tokenizer):
|
|||
tokens = en_tokenizer(text)
|
||||
tokens.vocab[tokens[0].text].sentiment = 3.0
|
||||
tokens.vocab[tokens[2].text].sentiment = -2.0
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens])
|
||||
doc = Doc(tokens.vocab, words=[t.text for t in tokens])
|
||||
doc.user_span_hooks['sentiment'] = lambda span: 10.0
|
||||
assert doc[:2].sentiment == 10.0
|
||||
assert doc[-2:].sentiment == 10.0
|
||||
|
@ -146,7 +146,7 @@ def test_span_to_array(doc):
|
|||
assert arr[0, 1] == len(span[0])
|
||||
|
||||
|
||||
#def test_span_as_doc(doc):
|
||||
# span = doc[4:10]
|
||||
# span_doc = span.as_doc()
|
||||
# assert span.text == span_doc.text.strip()
|
||||
def test_span_as_doc(doc):
|
||||
span = doc[4:10]
|
||||
span_doc = span.as_doc()
|
||||
assert span.text == span_doc.text.strip()
|
||||
|
|
|
@ -1,18 +1,17 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..util import get_doc
|
||||
from ...vocab import Vocab
|
||||
from ...tokens import Doc
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.tokens import Doc
|
||||
|
||||
import pytest
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
def test_spans_merge_tokens(en_tokenizer):
|
||||
text = "Los Angeles start."
|
||||
heads = [1, 1, 0, -1]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
assert len(doc) == 4
|
||||
assert doc[0].head.text == 'Angeles'
|
||||
assert doc[1].head.text == 'start'
|
||||
|
@ -21,7 +20,7 @@ def test_spans_merge_tokens(en_tokenizer):
|
|||
assert doc[0].text == 'Los Angeles'
|
||||
assert doc[0].head.text == 'start'
|
||||
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
assert len(doc) == 4
|
||||
assert doc[0].head.text == 'Angeles'
|
||||
assert doc[1].head.text == 'start'
|
||||
|
@ -35,7 +34,7 @@ def test_spans_merge_heads(en_tokenizer):
|
|||
text = "I found a pilates class near work."
|
||||
heads = [1, 0, 2, 1, -3, -1, -1, -6]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
|
||||
assert len(doc) == 8
|
||||
doc.merge(doc[3].idx, doc[4].idx + len(doc[4]), tag=doc[4].tag_,
|
||||
|
@ -53,7 +52,7 @@ def test_span_np_merges(en_tokenizer):
|
|||
text = "displaCy is a parse tool built with Javascript"
|
||||
heads = [1, 0, 2, 1, -3, -1, -1, -1]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
|
||||
assert doc[4].head.i == 1
|
||||
doc.merge(doc[2].idx, doc[4].idx + len(doc[4]), tag='NP', lemma='tool',
|
||||
|
@ -63,7 +62,7 @@ def test_span_np_merges(en_tokenizer):
|
|||
text = "displaCy is a lightweight and modern dependency parse tree visualization tool built with CSS3 and JavaScript."
|
||||
heads = [1, 0, 8, 3, -1, -2, 4, 3, 1, 1, -9, -1, -1, -1, -1, -2, -15]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
|
||||
ents = [(e[0].idx, e[-1].idx + len(e[-1]), e.label_, e.lemma_) for e in doc.ents]
|
||||
for start, end, label, lemma in ents:
|
||||
|
@ -74,8 +73,7 @@ def test_span_np_merges(en_tokenizer):
|
|||
text = "One test with entities like New York City so the ents list is not void"
|
||||
heads = [1, 11, -1, -1, -1, 1, 1, -3, 4, 2, 1, 1, 0, -1, -2]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
for span in doc.ents:
|
||||
merged = doc.merge()
|
||||
assert merged != None, (span.start, span.end, span.label_, span.lemma_)
|
||||
|
@ -85,10 +83,9 @@ def test_spans_entity_merge(en_tokenizer):
|
|||
text = "Stewart Lee is a stand up comedian who lives in England and loves Joe Pasquale.\n"
|
||||
heads = [1, 1, 0, 1, 2, -1, -4, 1, -2, -1, -1, -3, -10, 1, -2, -13, -1]
|
||||
tags = ['NNP', 'NNP', 'VBZ', 'DT', 'VB', 'RP', 'NN', 'WP', 'VBZ', 'IN', 'NNP', 'CC', 'VBZ', 'NNP', 'NNP', '.', 'SP']
|
||||
ents = [('Stewart Lee', 'PERSON', 0, 2), ('England', 'GPE', 10, 11), ('Joe Pasquale', 'PERSON', 13, 15)]
|
||||
|
||||
ents = [(0, 2, 'PERSON'), (10, 11, 'GPE'), (13, 15, 'PERSON')]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, tags=tags, ents=ents)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, tags=tags, ents=ents)
|
||||
assert len(doc) == 17
|
||||
for ent in doc.ents:
|
||||
label, lemma, type_ = (ent.root.tag_, ent.root.lemma_, max(w.ent_type_ for w in ent))
|
||||
|
@ -120,7 +117,7 @@ def test_spans_sentence_update_after_merge(en_tokenizer):
|
|||
'compound', 'dobj', 'punct']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||
sent1, sent2 = list(doc.sents)
|
||||
init_len = len(sent1)
|
||||
init_len2 = len(sent2)
|
||||
|
@ -138,7 +135,7 @@ def test_spans_subtree_size_check(en_tokenizer):
|
|||
'dobj']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||
sent1 = list(doc.sents)[0]
|
||||
init_len = len(list(sent1.root.subtree))
|
||||
doc[0:2].merge(label='none', lemma='none', ent_type='none')
|
||||
|
|
|
@ -1,14 +1,24 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import IS_ALPHA, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_TITLE, IS_STOP
|
||||
from ...symbols import NOUN, VERB
|
||||
from ..util import get_doc
|
||||
from ...vocab import Vocab
|
||||
from ...tokens import Doc
|
||||
|
||||
import pytest
|
||||
import numpy
|
||||
from spacy.attrs import IS_ALPHA, IS_DIGIT, IS_LOWER, IS_PUNCT, IS_TITLE, IS_STOP
|
||||
from spacy.symbols import VERB
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.tokens import Doc
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def doc(en_tokenizer):
|
||||
text = "This is a sentence. This is another sentence. And a third."
|
||||
heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 0, 1, -2, -1]
|
||||
deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
|
||||
'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
|
||||
tokens = en_tokenizer(text)
|
||||
return get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||
|
||||
|
||||
def test_doc_token_api_strings(en_tokenizer):
|
||||
|
@ -18,7 +28,7 @@ def test_doc_token_api_strings(en_tokenizer):
|
|||
deps = ['ROOT', 'dobj', 'prt', 'punct', 'nsubj', 'ROOT', 'punct']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos, heads=heads, deps=deps)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], pos=pos, heads=heads, deps=deps)
|
||||
assert doc[0].orth_ == 'Give'
|
||||
assert doc[0].text == 'Give'
|
||||
assert doc[0].text_with_ws == 'Give '
|
||||
|
@ -57,18 +67,9 @@ def test_doc_token_api_str_builtin(en_tokenizer, text):
|
|||
assert str(tokens[0]) == text.split(' ')[0]
|
||||
assert str(tokens[1]) == text.split(' ')[1]
|
||||
|
||||
@pytest.fixture
|
||||
def doc(en_tokenizer):
|
||||
text = "This is a sentence. This is another sentence. And a third."
|
||||
heads = [1, 0, 1, -2, -3, 1, 0, 1, -2, -3, 0, 1, -2, -1]
|
||||
deps = ['nsubj', 'ROOT', 'det', 'attr', 'punct', 'nsubj', 'ROOT', 'det',
|
||||
'attr', 'punct', 'ROOT', 'det', 'npadvmod', 'punct']
|
||||
tokens = en_tokenizer(text)
|
||||
return get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
||||
|
||||
def test_doc_token_api_is_properties(en_vocab):
|
||||
text = ["Hi", ",", "my", "email", "is", "test@me.com"]
|
||||
doc = get_doc(en_vocab, text)
|
||||
doc = Doc(en_vocab, words=["Hi", ",", "my", "email", "is", "test@me.com"])
|
||||
assert doc[0].is_title
|
||||
assert doc[0].is_alpha
|
||||
assert not doc[0].is_digit
|
||||
|
@ -86,7 +87,6 @@ def test_doc_token_api_vectors():
|
|||
vocab.set_vector('oranges', vector=numpy.asarray([0., 1.], dtype='f'))
|
||||
doc = Doc(vocab, words=['apples', 'oranges', 'oov'])
|
||||
assert doc.has_vector
|
||||
|
||||
assert doc[0].has_vector
|
||||
assert doc[1].has_vector
|
||||
assert not doc[2].has_vector
|
||||
|
@ -101,7 +101,7 @@ def test_doc_token_api_ancestors(en_tokenizer):
|
|||
text = "Yesterday I saw a dog that barked loudly."
|
||||
heads = [2, 1, 0, 1, -2, 1, -2, -1, -6]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
assert [t.text for t in doc[6].ancestors] == ["dog", "saw"]
|
||||
assert [t.text for t in doc[1].ancestors] == ["saw"]
|
||||
assert [t.text for t in doc[2].ancestors] == []
|
||||
|
@ -115,7 +115,7 @@ def test_doc_token_api_head_setter(en_tokenizer):
|
|||
text = "Yesterday I saw a dog that barked loudly."
|
||||
heads = [2, 1, 0, 1, -2, 1, -2, -1, -6]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
|
||||
assert doc[6].n_lefts == 1
|
||||
assert doc[6].n_rights == 1
|
||||
|
@ -165,7 +165,7 @@ def test_doc_token_api_head_setter(en_tokenizer):
|
|||
|
||||
|
||||
def test_is_sent_start(en_tokenizer):
|
||||
doc = en_tokenizer(u'This is a sentence. This is another.')
|
||||
doc = en_tokenizer('This is a sentence. This is another.')
|
||||
assert doc[5].is_sent_start is None
|
||||
doc[5].is_sent_start = True
|
||||
assert doc[5].is_sent_start is True
|
||||
|
|
|
@ -3,10 +3,8 @@ from __future__ import unicode_literals
|
|||
|
||||
import pytest
|
||||
from mock import Mock
|
||||
|
||||
from ..vocab import Vocab
|
||||
from ..tokens import Doc, Span, Token
|
||||
from ..tokens.underscore import Underscore
|
||||
from spacy.tokens import Doc, Span, Token
|
||||
from spacy.tokens.underscore import Underscore
|
||||
|
||||
|
||||
def test_create_doc_underscore():
|
|
@ -4,15 +4,14 @@ from __future__ import unicode_literals
|
|||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text',
|
||||
["ق.م", "إلخ", "ص.ب", "ت."])
|
||||
@pytest.mark.parametrize('text', ["ق.م", "إلخ", "ص.ب", "ت."])
|
||||
def test_ar_tokenizer_handles_abbr(ar_tokenizer, text):
|
||||
tokens = ar_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
|
||||
text = u"تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م"
|
||||
text = "تعود الكتابة الهيروغليفية إلى سنة 3200 ق.م"
|
||||
tokens = ar_tokenizer(text)
|
||||
assert len(tokens) == 7
|
||||
assert tokens[6].text == "ق.م"
|
||||
|
@ -20,7 +19,6 @@ def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
|
|||
|
||||
|
||||
def test_ar_tokenizer_handles_exc_in_text(ar_tokenizer):
|
||||
text = u"يبلغ طول مضيق طارق 14كم "
|
||||
text = "يبلغ طول مضيق طارق 14كم "
|
||||
tokens = ar_tokenizer(text)
|
||||
print([(tokens[i].text, tokens[i].suffix_) for i in range(len(tokens))])
|
||||
assert len(tokens) == 6
|
||||
|
|
|
@ -2,7 +2,7 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
|
||||
def test_tokenizer_handles_long_text(ar_tokenizer):
|
||||
def test_ar_tokenizer_handles_long_text(ar_tokenizer):
|
||||
text = """نجيب محفوظ مؤلف و كاتب روائي عربي، يعد من أهم الأدباء العرب خلال القرن العشرين.
|
||||
ولد نجيب محفوظ في مدينة القاهرة، حيث ترعرع و تلقى تعليمه الجامعي في جامعتها،
|
||||
فتمكن من نيل شهادة في الفلسفة. ألف محفوظ على مدار حياته الكثير من الأعمال الأدبية، و في مقدمتها ثلاثيته الشهيرة.
|
||||
|
|
|
@ -3,38 +3,32 @@ from __future__ import unicode_literals
|
|||
|
||||
import pytest
|
||||
|
||||
TESTCASES = []
|
||||
|
||||
PUNCTUATION_TESTS = [
|
||||
(u'আমি বাংলায় গান গাই!', [u'আমি', u'বাংলায়', u'গান', u'গাই', u'!']),
|
||||
(u'আমি বাংলায় কথা কই।', [u'আমি', u'বাংলায়', u'কথা', u'কই', u'।']),
|
||||
(u'বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', [u'বসুন্ধরা', u'জনসম্মুখে', u'দোষ', u'স্বীকার', u'করলো', u'না', u'?']),
|
||||
(u'টাকা থাকলে কি না হয়!', [u'টাকা', u'থাকলে', u'কি', u'না', u'হয়', u'!']),
|
||||
TESTCASES = [
|
||||
# punctuation tests
|
||||
('আমি বাংলায় গান গাই!', ['আমি', 'বাংলায়', 'গান', 'গাই', '!']),
|
||||
('আমি বাংলায় কথা কই।', ['আমি', 'বাংলায়', 'কথা', 'কই', '।']),
|
||||
('বসুন্ধরা জনসম্মুখে দোষ স্বীকার করলো না?', ['বসুন্ধরা', 'জনসম্মুখে', 'দোষ', 'স্বীকার', 'করলো', 'না', '?']),
|
||||
('টাকা থাকলে কি না হয়!', ['টাকা', 'থাকলে', 'কি', 'না', 'হয়', '!']),
|
||||
# abbreviations
|
||||
('ডঃ খালেদ বললেন ঢাকায় ৩৫ ডিগ্রি সে.।', ['ডঃ', 'খালেদ', 'বললেন', 'ঢাকায়', '৩৫', 'ডিগ্রি', 'সে.', '।'])
|
||||
]
|
||||
|
||||
ABBREVIATIONS = [
|
||||
(u'ডঃ খালেদ বললেন ঢাকায় ৩৫ ডিগ্রি সে.।', [u'ডঃ', u'খালেদ', u'বললেন', u'ঢাকায়', u'৩৫', u'ডিগ্রি', u'সে.', u'।'])
|
||||
]
|
||||
|
||||
TESTCASES.extend(PUNCTUATION_TESTS)
|
||||
TESTCASES.extend(ABBREVIATIONS)
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
||||
def test_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens):
|
||||
def test_bn_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens):
|
||||
tokens = bn_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
||||
|
||||
|
||||
def test_tokenizer_handles_long_text(bn_tokenizer):
|
||||
text = u"""নর্থ সাউথ বিশ্ববিদ্যালয়ে সারাবছর কোন না কোন বিষয়ে গবেষণা চলতেই থাকে। \
|
||||
def test_bn_tokenizer_handles_long_text(bn_tokenizer):
|
||||
text = """নর্থ সাউথ বিশ্ববিদ্যালয়ে সারাবছর কোন না কোন বিষয়ে গবেষণা চলতেই থাকে। \
|
||||
অভিজ্ঞ ফ্যাকাল্টি মেম্বারগণ প্রায়ই শিক্ষার্থীদের নিয়ে বিভিন্ন গবেষণা প্রকল্পে কাজ করেন, \
|
||||
যার মধ্যে রয়েছে রোবট থেকে মেশিন লার্নিং সিস্টেম ও আর্টিফিশিয়াল ইন্টেলিজেন্স। \
|
||||
এসকল প্রকল্পে কাজ করার মাধ্যমে সংশ্লিষ্ট ক্ষেত্রে যথেষ্ঠ পরিমাণ স্পেশালাইজড হওয়া সম্ভব। \
|
||||
আর গবেষণার কাজ তোমার ক্যারিয়ারকে ঠেলে নিয়ে যাবে অনেকখানি! \
|
||||
কন্টেস্ট প্রোগ্রামার হও, গবেষক কিংবা ডেভেলপার - নর্থ সাউথ ইউনিভার্সিটিতে তোমার প্রতিভা বিকাশের সুযোগ রয়েছেই। \
|
||||
নর্থ সাউথের অসাধারণ কমিউনিটিতে তোমাকে সাদর আমন্ত্রণ।"""
|
||||
|
||||
tokens = bn_tokenizer(text)
|
||||
assert len(tokens) == 84
|
||||
|
|
|
@ -3,28 +3,32 @@ from __future__ import unicode_literals
|
|||
|
||||
import pytest
|
||||
|
||||
@pytest.mark.parametrize('text',
|
||||
["ca.", "m.a.o.", "Jan.", "Dec.", "kr.", "jf."])
|
||||
|
||||
@pytest.mark.parametrize('text', ["ca.", "m.a.o.", "Jan.", "Dec.", "kr.", "jf."])
|
||||
def test_da_tokenizer_handles_abbr(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["Jul.", "jul.", "Tor.", "Tors."])
|
||||
def test_da_tokenizer_handles_ambiguous_abbr(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["1.", "10.", "31."])
|
||||
def test_da_tokenizer_handles_dates(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
def test_da_tokenizer_handles_exc_in_text(da_tokenizer):
|
||||
text = "Det er bl.a. ikke meningen"
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 5
|
||||
assert tokens[2].text == "bl.a."
|
||||
|
||||
|
||||
def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
|
||||
text = "Her er noget du kan kigge i."
|
||||
tokens = da_tokenizer(text)
|
||||
|
@ -32,8 +36,9 @@ def test_da_tokenizer_handles_custom_base_exc(da_tokenizer):
|
|||
assert tokens[6].text == "i"
|
||||
assert tokens[7].text == "."
|
||||
|
||||
@pytest.mark.parametrize('text,norm',
|
||||
[("akvarium", "akvarie"), ("bedstemoder", "bedstemor")])
|
||||
|
||||
@pytest.mark.parametrize('text,norm', [
|
||||
("akvarium", "akvarie"), ("bedstemoder", "bedstemor")])
|
||||
def test_da_tokenizer_norm_exceptions(da_tokenizer, text, norm):
|
||||
tokens = da_tokenizer(text)
|
||||
assert tokens[0].norm_ == norm
|
||||
|
|
|
@ -4,10 +4,11 @@ from __future__ import unicode_literals
|
|||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('string,lemma', [('affaldsgruppernes', 'affaldsgruppe'),
|
||||
('detailhandelsstrukturernes', 'detailhandelsstruktur'),
|
||||
('kolesterols', 'kolesterol'),
|
||||
('åsyns', 'åsyn')])
|
||||
def test_lemmatizer_lookup_assigns(da_tokenizer, string, lemma):
|
||||
@pytest.mark.parametrize('string,lemma', [
|
||||
('affaldsgruppernes', 'affaldsgruppe'),
|
||||
('detailhandelsstrukturernes', 'detailhandelsstruktur'),
|
||||
('kolesterols', 'kolesterol'),
|
||||
('åsyns', 'åsyn')])
|
||||
def test_da_lemmatizer_lookup_assigns(da_tokenizer, string, lemma):
|
||||
tokens = da_tokenizer(string)
|
||||
assert tokens[0].lemma_ == lemma
|
||||
|
|
|
@ -1,24 +1,23 @@
|
|||
# coding: utf-8
|
||||
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(under)"])
|
||||
def test_tokenizer_splits_no_special(da_tokenizer, text):
|
||||
def test_da_tokenizer_splits_no_special(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["ta'r", "Søren's", "Lars'"])
|
||||
def test_tokenizer_handles_no_punct(da_tokenizer, text):
|
||||
def test_da_tokenizer_handles_no_punct(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(ta'r"])
|
||||
def test_tokenizer_splits_prefix_punct(da_tokenizer, text):
|
||||
def test_da_tokenizer_splits_prefix_punct(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
assert tokens[0].text == "("
|
||||
|
@ -26,22 +25,23 @@ def test_tokenizer_splits_prefix_punct(da_tokenizer, text):
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text', ["ta'r)"])
|
||||
def test_tokenizer_splits_suffix_punct(da_tokenizer, text):
|
||||
def test_da_tokenizer_splits_suffix_punct(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
assert tokens[0].text == "ta'r"
|
||||
assert tokens[1].text == ")"
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected', [("(ta'r)", ["(", "ta'r", ")"]), ("'ta'r'", ["'", "ta'r", "'"])])
|
||||
def test_tokenizer_splits_even_wrap(da_tokenizer, text, expected):
|
||||
@pytest.mark.parametrize('text,expected', [
|
||||
("(ta'r)", ["(", "ta'r", ")"]), ("'ta'r'", ["'", "ta'r", "'"])])
|
||||
def test_da_tokenizer_splits_even_wrap(da_tokenizer, text, expected):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == len(expected)
|
||||
assert [t.text for t in tokens] == expected
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(ta'r?)"])
|
||||
def test_tokenizer_splits_uneven_wrap(da_tokenizer, text):
|
||||
def test_da_tokenizer_splits_uneven_wrap(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 4
|
||||
assert tokens[0].text == "("
|
||||
|
@ -50,15 +50,16 @@ def test_tokenizer_splits_uneven_wrap(da_tokenizer, text):
|
|||
assert tokens[3].text == ")"
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected', [("f.eks.", ["f.eks."]), ("fe.", ["fe", "."]), ("(f.eks.", ["(", "f.eks."])])
|
||||
def test_tokenizer_splits_prefix_interact(da_tokenizer, text, expected):
|
||||
@pytest.mark.parametrize('text,expected', [
|
||||
("f.eks.", ["f.eks."]), ("fe.", ["fe", "."]), ("(f.eks.", ["(", "f.eks."])])
|
||||
def test_da_tokenizer_splits_prefix_interact(da_tokenizer, text, expected):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == len(expected)
|
||||
assert [t.text for t in tokens] == expected
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["f.eks.)"])
|
||||
def test_tokenizer_splits_suffix_interact(da_tokenizer, text):
|
||||
def test_da_tokenizer_splits_suffix_interact(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
assert tokens[0].text == "f.eks."
|
||||
|
@ -66,7 +67,7 @@ def test_tokenizer_splits_suffix_interact(da_tokenizer, text):
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(f.eks.)"])
|
||||
def test_tokenizer_splits_even_wrap_interact(da_tokenizer, text):
|
||||
def test_da_tokenizer_splits_even_wrap_interact(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
assert tokens[0].text == "("
|
||||
|
@ -75,7 +76,7 @@ def test_tokenizer_splits_even_wrap_interact(da_tokenizer, text):
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(f.eks.?)"])
|
||||
def test_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text):
|
||||
def test_da_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 4
|
||||
assert tokens[0].text == "("
|
||||
|
@ -85,19 +86,19 @@ def test_tokenizer_splits_uneven_wrap_interact(da_tokenizer, text):
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text', ["0,1-13,5", "0,0-0,1", "103,27-300", "1/2-3/4"])
|
||||
def test_tokenizer_handles_numeric_range(da_tokenizer, text):
|
||||
def test_da_tokenizer_handles_numeric_range(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["sort.Gul", "Hej.Verden"])
|
||||
def test_tokenizer_splits_period_infix(da_tokenizer, text):
|
||||
def test_da_tokenizer_splits_period_infix(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["Hej,Verden", "en,to"])
|
||||
def test_tokenizer_splits_comma_infix(da_tokenizer, text):
|
||||
def test_da_tokenizer_splits_comma_infix(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
assert tokens[0].text == text.split(",")[0]
|
||||
|
@ -106,18 +107,18 @@ def test_tokenizer_splits_comma_infix(da_tokenizer, text):
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text', ["sort...Gul", "sort...gul"])
|
||||
def test_tokenizer_splits_ellipsis_infix(da_tokenizer, text):
|
||||
def test_da_tokenizer_splits_ellipsis_infix(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ['gå-på-mod', '4-hjulstræk', '100-Pfennig-frimærke', 'TV-2-spots', 'trofæ-vaeggen'])
|
||||
def test_tokenizer_keeps_hyphens(da_tokenizer, text):
|
||||
def test_da_tokenizer_keeps_hyphens(da_tokenizer, text):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
def test_tokenizer_splits_double_hyphen_infix(da_tokenizer):
|
||||
def test_da_tokenizer_splits_double_hyphen_infix(da_tokenizer):
|
||||
tokens = da_tokenizer("Mange regler--eksempelvis bindestregs-reglerne--er komplicerede.")
|
||||
assert len(tokens) == 9
|
||||
assert tokens[0].text == "Mange"
|
||||
|
@ -130,7 +131,7 @@ def test_tokenizer_splits_double_hyphen_infix(da_tokenizer):
|
|||
assert tokens[7].text == "komplicerede"
|
||||
|
||||
|
||||
def test_tokenizer_handles_posessives_and_contractions(da_tokenizer):
|
||||
def test_da_tokenizer_handles_posessives_and_contractions(da_tokenizer):
|
||||
tokens = da_tokenizer("'DBA's, Lars' og Liz' bil sku' sgu' ik' ha' en bule, det ka' han ik' li' mere', sagde hun.")
|
||||
assert len(tokens) == 25
|
||||
assert tokens[0].text == "'"
|
||||
|
|
|
@ -1,10 +1,9 @@
|
|||
# coding: utf-8
|
||||
"""Test that longer and mixed texts are tokenized correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.lang.da.lex_attrs import like_num
|
||||
|
||||
|
||||
def test_da_tokenizer_handles_long_text(da_tokenizer):
|
||||
text = """Der var så dejligt ude på landet. Det var sommer, kornet stod gult, havren grøn,
|
||||
|
@ -15,6 +14,7 @@ Rundt om ager og eng var der store skove, og midt i skovene dybe søer; jo, der
|
|||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 84
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,match', [
|
||||
('10', True), ('1', True), ('10.000', True), ('10.00', True),
|
||||
('999,0', True), ('en', True), ('treoghalvfemsindstyvende', True), ('hundrede', True),
|
||||
|
@ -22,6 +22,10 @@ Rundt om ager og eng var der store skove, og midt i skovene dybe søer; jo, der
|
|||
def test_lex_attrs_like_number(da_tokenizer, text, match):
|
||||
tokens = da_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
print(tokens[0])
|
||||
assert tokens[0].like_num == match
|
||||
|
||||
|
||||
@pytest.mark.parametrize('word', ['elleve', 'første'])
|
||||
def test_da_lex_attrs_capitals(word):
|
||||
assert like_num(word)
|
||||
assert like_num(word.upper())
|
||||
|
|
|
@ -1,7 +1,4 @@
|
|||
# coding: utf-8
|
||||
"""Test that tokenizer exceptions and emoticons are handles correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
|
|
@ -4,12 +4,13 @@ from __future__ import unicode_literals
|
|||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('string,lemma', [('Abgehängten', 'Abgehängte'),
|
||||
('engagierte', 'engagieren'),
|
||||
('schließt', 'schließen'),
|
||||
('vorgebenden', 'vorgebend'),
|
||||
('die', 'der'),
|
||||
('Die', 'der')])
|
||||
def test_lemmatizer_lookup_assigns(de_tokenizer, string, lemma):
|
||||
@pytest.mark.parametrize('string,lemma', [
|
||||
('Abgehängten', 'Abgehängte'),
|
||||
('engagierte', 'engagieren'),
|
||||
('schließt', 'schließen'),
|
||||
('vorgebenden', 'vorgebend'),
|
||||
('die', 'der'),
|
||||
('Die', 'der')])
|
||||
def test_de_lemmatizer_lookup_assigns(de_tokenizer, string, lemma):
|
||||
tokens = de_tokenizer(string)
|
||||
assert tokens[0].lemma_ == lemma
|
||||
|
|
|
@ -1,77 +0,0 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def example(DE):
|
||||
"""
|
||||
This is to make sure the model works as expected. The tests make sure that
|
||||
values are properly set. Tests are not meant to evaluate the content of the
|
||||
output, only make sure the output is formally okay.
|
||||
"""
|
||||
assert DE.entity != None
|
||||
return DE('An der großen Straße stand eine merkwürdige Gestalt und führte Selbstgespräche.')
|
||||
|
||||
|
||||
@pytest.mark.models('de')
|
||||
def test_de_models_tokenization(example):
|
||||
# tokenization should split the document into tokens
|
||||
assert len(example) > 1
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.models('de')
|
||||
def test_de_models_tagging(example):
|
||||
# if tagging was done properly, pos tags shouldn't be empty
|
||||
assert example.is_tagged
|
||||
assert all(t.pos != 0 for t in example)
|
||||
assert all(t.tag != 0 for t in example)
|
||||
|
||||
|
||||
@pytest.mark.models('de')
|
||||
def test_de_models_parsing(example):
|
||||
# if parsing was done properly
|
||||
# - dependency labels shouldn't be empty
|
||||
# - the head of some tokens should not be root
|
||||
assert example.is_parsed
|
||||
assert all(t.dep != 0 for t in example)
|
||||
assert any(t.dep != i for i,t in enumerate(example))
|
||||
|
||||
|
||||
@pytest.mark.models('de')
|
||||
def test_de_models_ner(example):
|
||||
# if ner was done properly, ent_iob shouldn't be empty
|
||||
assert all([t.ent_iob != 0 for t in example])
|
||||
|
||||
|
||||
@pytest.mark.models('de')
|
||||
def test_de_models_vectors(example):
|
||||
# if vectors are available, they should differ on different words
|
||||
# this isn't a perfect test since this could in principle fail
|
||||
# in a sane model as well,
|
||||
# but that's very unlikely and a good indicator if something is wrong
|
||||
vector0 = example[0].vector
|
||||
vector1 = example[1].vector
|
||||
vector2 = example[2].vector
|
||||
assert not numpy.array_equal(vector0,vector1)
|
||||
assert not numpy.array_equal(vector0,vector2)
|
||||
assert not numpy.array_equal(vector1,vector2)
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.models('de')
|
||||
def test_de_models_probs(example):
|
||||
# if frequencies/probabilities are okay, they should differ for
|
||||
# different words
|
||||
# this isn't a perfect test since this could in principle fail
|
||||
# in a sane model as well,
|
||||
# but that's very unlikely and a good indicator if something is wrong
|
||||
prob0 = example[0].prob
|
||||
prob1 = example[1].prob
|
||||
prob2 = example[2].prob
|
||||
assert not prob0 == prob1
|
||||
assert not prob0 == prob2
|
||||
assert not prob1 == prob2
|
|
@ -3,17 +3,14 @@ from __future__ import unicode_literals
|
|||
|
||||
from ...util import get_doc
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_de_parser_noun_chunks_standard_de(de_tokenizer):
|
||||
text = "Eine Tasse steht auf dem Tisch."
|
||||
heads = [1, 1, 0, -1, 1, -2, -4]
|
||||
tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', '$.']
|
||||
deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'punct']
|
||||
|
||||
tokens = de_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
chunks = list(doc.noun_chunks)
|
||||
assert len(chunks) == 2
|
||||
assert chunks[0].text_with_ws == "Eine Tasse "
|
||||
|
@ -25,9 +22,8 @@ def test_de_extended_chunk(de_tokenizer):
|
|||
heads = [1, 1, 0, -1, 1, -2, -1, -5, -6]
|
||||
tags = ['ART', 'NN', 'VVFIN', 'APPR', 'ART', 'NN', 'NN', 'NN', '$.']
|
||||
deps = ['nk', 'sb', 'ROOT', 'mo', 'nk', 'nk', 'nk', 'oa', 'punct']
|
||||
|
||||
tokens = de_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
chunks = list(doc.noun_chunks)
|
||||
assert len(chunks) == 3
|
||||
assert chunks[0].text_with_ws == "Die Sängerin "
|
||||
|
|
|
@ -1,86 +1,83 @@
|
|||
# coding: utf-8
|
||||
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(unter)"])
|
||||
def test_tokenizer_splits_no_special(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_no_special(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["unter'm"])
|
||||
def test_tokenizer_splits_no_punct(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_no_punct(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(unter'm"])
|
||||
def test_tokenizer_splits_prefix_punct(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_prefix_punct(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["unter'm)"])
|
||||
def test_tokenizer_splits_suffix_punct(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_suffix_punct(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(unter'm)"])
|
||||
def test_tokenizer_splits_even_wrap(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_even_wrap(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 4
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(unter'm?)"])
|
||||
def test_tokenizer_splits_uneven_wrap(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_uneven_wrap(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 5
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,length', [("z.B.", 1), ("zb.", 2), ("(z.B.", 2)])
|
||||
def test_tokenizer_splits_prefix_interact(de_tokenizer, text, length):
|
||||
def test_de_tokenizer_splits_prefix_interact(de_tokenizer, text, length):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == length
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["z.B.)"])
|
||||
def test_tokenizer_splits_suffix_interact(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_suffix_interact(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(z.B.)"])
|
||||
def test_tokenizer_splits_even_wrap_interact(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_even_wrap_interact(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(z.B.?)"])
|
||||
def test_tokenizer_splits_uneven_wrap_interact(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_uneven_wrap_interact(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 4
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
|
||||
def test_tokenizer_splits_numeric_range(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_numeric_range(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["blau.Rot", "Hallo.Welt"])
|
||||
def test_tokenizer_splits_period_infix(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_period_infix(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["Hallo,Welt", "eins,zwei"])
|
||||
def test_tokenizer_splits_comma_infix(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_comma_infix(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
assert tokens[0].text == text.split(",")[0]
|
||||
|
@ -89,18 +86,18 @@ def test_tokenizer_splits_comma_infix(de_tokenizer, text):
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text', ["blau...Rot", "blau...rot"])
|
||||
def test_tokenizer_splits_ellipsis_infix(de_tokenizer, text):
|
||||
def test_de_tokenizer_splits_ellipsis_infix(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ['Islam-Konferenz', 'Ost-West-Konflikt'])
|
||||
def test_tokenizer_keeps_hyphens(de_tokenizer, text):
|
||||
def test_de_tokenizer_keeps_hyphens(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
def test_tokenizer_splits_double_hyphen_infix(de_tokenizer):
|
||||
def test_de_tokenizer_splits_double_hyphen_infix(de_tokenizer):
|
||||
tokens = de_tokenizer("Viele Regeln--wie die Bindestrich-Regeln--sind kompliziert.")
|
||||
assert len(tokens) == 10
|
||||
assert tokens[0].text == "Viele"
|
||||
|
|
|
@ -1,13 +1,10 @@
|
|||
# coding: utf-8
|
||||
"""Test that longer and mixed texts are tokenized correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_tokenizer_handles_long_text(de_tokenizer):
|
||||
def test_de_tokenizer_handles_long_text(de_tokenizer):
|
||||
text = """Die Verwandlung
|
||||
|
||||
Als Gregor Samsa eines Morgens aus unruhigen Träumen erwachte, fand er sich in
|
||||
|
@ -29,17 +26,15 @@ Umfang kläglich dünnen Beine flimmerten ihm hilflos vor den Augen.
|
|||
"Donaudampfschifffahrtsgesellschaftskapitänsanwärterposten",
|
||||
"Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz",
|
||||
"Kraftfahrzeug-Haftpflichtversicherung",
|
||||
"Vakuum-Mittelfrequenz-Induktionsofen"
|
||||
])
|
||||
def test_tokenizer_handles_long_words(de_tokenizer, text):
|
||||
"Vakuum-Mittelfrequenz-Induktionsofen"])
|
||||
def test_de_tokenizer_handles_long_words(de_tokenizer, text):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,length', [
|
||||
("»Was ist mit mir geschehen?«, dachte er.", 12),
|
||||
("“Dies frühzeitige Aufstehen”, dachte er, “macht einen ganz blödsinnig. ", 15)
|
||||
])
|
||||
def test_tokenizer_handles_examples(de_tokenizer, text, length):
|
||||
("“Dies frühzeitige Aufstehen”, dachte er, “macht einen ganz blödsinnig. ", 15)])
|
||||
def test_de_tokenizer_handles_examples(de_tokenizer, text, length):
|
||||
tokens = de_tokenizer(text)
|
||||
assert len(tokens) == length
|
||||
|
|
|
@ -1,18 +1,17 @@
|
|||
# -*- coding: utf-8 -*-
|
||||
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["αριθ.", "τρισ.", "δισ.", "σελ."])
|
||||
def test_tokenizer_handles_abbr(el_tokenizer, text):
|
||||
def test_el_tokenizer_handles_abbr(el_tokenizer, text):
|
||||
tokens = el_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
def test_tokenizer_handles_exc_in_text(el_tokenizer):
|
||||
def test_el_tokenizer_handles_exc_in_text(el_tokenizer):
|
||||
text = "Στα 14 τρισ. δολάρια το κόστος από την άνοδο της στάθμης της θάλασσας."
|
||||
tokens = el_tokenizer(text)
|
||||
assert len(tokens) == 14
|
||||
assert tokens[2].text == "τρισ."
|
||||
assert tokens[2].text == "τρισ."
|
||||
|
|
|
@ -1,11 +1,10 @@
|
|||
# -*- coding: utf-8 -*-
|
||||
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_tokenizer_handles_long_text(el_tokenizer):
|
||||
def test_el_tokenizer_handles_long_text(el_tokenizer):
|
||||
text = """Η Ελλάδα (παλαιότερα Ελλάς), επίσημα γνωστή ως Ελληνική Δημοκρατία,\
|
||||
είναι χώρα της νοτιοανατολικής Ευρώπης στο νοτιότερο άκρο της Βαλκανικής χερσονήσου.\
|
||||
Συνορεύει στα βορειοδυτικά με την Αλβανία, στα βόρεια με την πρώην\
|
||||
|
@ -20,6 +19,6 @@ def test_tokenizer_handles_long_text(el_tokenizer):
|
|||
("Η Ελλάδα είναι μία από τις χώρες της Ευρωπαϊκής Ένωσης (ΕΕ) που διαθέτει σηµαντικό ορυκτό πλούτο.", 19),
|
||||
("Η ναυτιλία αποτέλεσε ένα σημαντικό στοιχείο της Ελληνικής οικονομικής δραστηριότητας από τα αρχαία χρόνια.", 15),
|
||||
("Η Ελλάδα είναι μέλος σε αρκετούς διεθνείς οργανισμούς.", 9)])
|
||||
def test_tokenizer_handles_cnts(el_tokenizer,text, length):
|
||||
def test_el_tokenizer_handles_cnts(el_tokenizer,text, length):
|
||||
tokens = el_tokenizer(text)
|
||||
assert len(tokens) == length
|
||||
assert len(tokens) == length
|
||||
|
|
|
@ -2,23 +2,22 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
from ....lang.en import English
|
||||
from ....tokenizer import Tokenizer
|
||||
from .... import util
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokenizer import Tokenizer
|
||||
from spacy.util import compile_prefix_regex, compile_suffix_regex
|
||||
from spacy.util import compile_infix_regex
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def custom_en_tokenizer(en_vocab):
|
||||
prefix_re = util.compile_prefix_regex(English.Defaults.prefixes)
|
||||
suffix_re = util.compile_suffix_regex(English.Defaults.suffixes)
|
||||
prefix_re = compile_prefix_regex(English.Defaults.prefixes)
|
||||
suffix_re = compile_suffix_regex(English.Defaults.suffixes)
|
||||
custom_infixes = ['\.\.\.+',
|
||||
'(?<=[0-9])-(?=[0-9])',
|
||||
# '(?<=[0-9]+),(?=[0-9]+)',
|
||||
'[0-9]+(,[0-9]+)+',
|
||||
'[\[\]!&:,()\*—–\/-]']
|
||||
|
||||
infix_re = util.compile_infix_regex(custom_infixes)
|
||||
infix_re = compile_infix_regex(custom_infixes)
|
||||
return Tokenizer(en_vocab,
|
||||
English.Defaults.tokenizer_exceptions,
|
||||
prefix_re.search,
|
||||
|
@ -27,13 +26,12 @@ def custom_en_tokenizer(en_vocab):
|
|||
token_match=None)
|
||||
|
||||
|
||||
def test_customized_tokenizer_handles_infixes(custom_en_tokenizer):
|
||||
def test_en_customized_tokenizer_handles_infixes(custom_en_tokenizer):
|
||||
sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion."
|
||||
context = [word.text for word in custom_en_tokenizer(sentence)]
|
||||
assert context == ['The', '8', 'and', '10', '-', 'county', 'definitions',
|
||||
'are', 'not', 'used', 'for', 'the', 'greater',
|
||||
'Southern', 'California', 'Megaregion', '.']
|
||||
|
||||
# the trailing '-' may cause Assertion Error
|
||||
sentence = "The 8- and 10-county definitions are not used for the greater Southern California Megaregion."
|
||||
context = [word.text for word in custom_en_tokenizer(sentence)]
|
||||
|
|
|
@ -38,7 +38,7 @@ def test_en_tokenizer_splits_trailing_apos(en_tokenizer, text):
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text', ["'em", "nothin'", "ol'"])
|
||||
def text_tokenizer_doesnt_split_apos_exc(en_tokenizer, text):
|
||||
def test_en_tokenizer_doesnt_split_apos_exc(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].text == text
|
||||
|
|
|
@ -1,11 +1,6 @@
|
|||
# coding: utf-8
|
||||
"""Test that token.idx correctly computes index into the original string."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_en_simple_punct(en_tokenizer):
|
||||
text = "to walk, do foo"
|
||||
|
|
|
@ -1,63 +0,0 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from ....tokens.doc import Doc
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def en_lemmatizer(EN):
|
||||
return EN.Defaults.create_lemmatizer()
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_doc_lemmatization(EN):
|
||||
doc = Doc(EN.vocab, words=['bleed'])
|
||||
doc[0].tag_ = 'VBP'
|
||||
assert doc[0].lemma_ == 'bleed'
|
||||
|
||||
@pytest.mark.models('en')
|
||||
@pytest.mark.parametrize('text,lemmas', [("aardwolves", ["aardwolf"]),
|
||||
("aardwolf", ["aardwolf"]),
|
||||
("planets", ["planet"]),
|
||||
("ring", ["ring"]),
|
||||
("axes", ["axis", "axe", "ax"])])
|
||||
def test_en_lemmatizer_noun_lemmas(en_lemmatizer, text, lemmas):
|
||||
assert en_lemmatizer.noun(text) == lemmas
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
@pytest.mark.parametrize('text,lemmas', [("bleed", ["bleed"]),
|
||||
("feed", ["feed"]),
|
||||
("need", ["need"]),
|
||||
("ring", ["ring"])])
|
||||
def test_en_lemmatizer_noun_lemmas(en_lemmatizer, text, lemmas):
|
||||
# Cases like this are problematic -- not clear what we should do to resolve
|
||||
# ambiguity?
|
||||
# ("axes", ["ax", "axes", "axis"])])
|
||||
assert en_lemmatizer.noun(text) == lemmas
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.models('en')
|
||||
def test_en_lemmatizer_base_forms(en_lemmatizer):
|
||||
assert en_lemmatizer.noun('dive', {'number': 'sing'}) == ['dive']
|
||||
assert en_lemmatizer.noun('dive', {'number': 'plur'}) == ['diva']
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_en_lemmatizer_base_form_verb(en_lemmatizer):
|
||||
assert en_lemmatizer.verb('saw', {'verbform': 'past'}) == ['see']
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_en_lemmatizer_punct(en_lemmatizer):
|
||||
assert en_lemmatizer.punct('“') == ['"']
|
||||
assert en_lemmatizer.punct('“') == ['"']
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_en_lemmatizer_lemma_assignment(EN):
|
||||
text = "Bananas in pyjamas are geese."
|
||||
doc = EN.make_doc(text)
|
||||
EN.tagger(doc)
|
||||
assert all(t.lemma_ != '' for t in doc)
|
|
@ -1,85 +0,0 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import numpy
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def example(EN):
|
||||
"""
|
||||
This is to make sure the model works as expected. The tests make sure that
|
||||
values are properly set. Tests are not meant to evaluate the content of the
|
||||
output, only make sure the output is formally okay.
|
||||
"""
|
||||
assert EN.entity != None
|
||||
return EN('There was a stranger standing at the big street talking to herself.')
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_en_models_tokenization(example):
|
||||
# tokenization should split the document into tokens
|
||||
assert len(example) > 1
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_en_models_tagging(example):
|
||||
# if tagging was done properly, pos tags shouldn't be empty
|
||||
assert example.is_tagged
|
||||
assert all(t.pos != 0 for t in example)
|
||||
assert all(t.tag != 0 for t in example)
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_en_models_parsing(example):
|
||||
# if parsing was done properly
|
||||
# - dependency labels shouldn't be empty
|
||||
# - the head of some tokens should not be root
|
||||
assert example.is_parsed
|
||||
assert all(t.dep != 0 for t in example)
|
||||
assert any(t.dep != i for i,t in enumerate(example))
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_en_models_ner(example):
|
||||
# if ner was done properly, ent_iob shouldn't be empty
|
||||
assert all([t.ent_iob != 0 for t in example])
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_en_models_vectors(example):
|
||||
# if vectors are available, they should differ on different words
|
||||
# this isn't a perfect test since this could in principle fail
|
||||
# in a sane model as well,
|
||||
# but that's very unlikely and a good indicator if something is wrong
|
||||
if example.vocab.vectors_length:
|
||||
vector0 = example[0].vector
|
||||
vector1 = example[1].vector
|
||||
vector2 = example[2].vector
|
||||
assert not numpy.array_equal(vector0,vector1)
|
||||
assert not numpy.array_equal(vector0,vector2)
|
||||
assert not numpy.array_equal(vector1,vector2)
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.models('en')
|
||||
def test_en_models_probs(example):
|
||||
# if frequencies/probabilities are okay, they should differ for
|
||||
# different words
|
||||
# this isn't a perfect test since this could in principle fail
|
||||
# in a sane model as well,
|
||||
# but that's very unlikely and a good indicator if something is wrong
|
||||
prob0 = example[0].prob
|
||||
prob1 = example[1].prob
|
||||
prob2 = example[2].prob
|
||||
assert not prob0 == prob1
|
||||
assert not prob0 == prob2
|
||||
assert not prob1 == prob2
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_no_vectors_similarity(EN):
|
||||
doc1 = EN(u'hallo')
|
||||
doc2 = EN(u'hi')
|
||||
assert doc1.similarity(doc2) > 0
|
||||
|
|
@ -1,42 +0,0 @@
|
|||
from __future__ import unicode_literals, print_function
|
||||
import pytest
|
||||
|
||||
from spacy.attrs import LOWER
|
||||
from spacy.matcher import Matcher
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_en_ner_simple_types(EN):
|
||||
tokens = EN(u'Mr. Best flew to New York on Saturday morning.')
|
||||
ents = list(tokens.ents)
|
||||
assert ents[0].start == 1
|
||||
assert ents[0].end == 2
|
||||
assert ents[0].label_ == 'PERSON'
|
||||
assert ents[1].start == 4
|
||||
assert ents[1].end == 6
|
||||
assert ents[1].label_ == 'GPE'
|
||||
|
||||
|
||||
@pytest.mark.skip
|
||||
@pytest.mark.models('en')
|
||||
def test_en_ner_consistency_bug(EN):
|
||||
'''Test an arbitrary sequence-consistency bug encountered during speed test'''
|
||||
tokens = EN(u'Where rap essentially went mainstream, illustrated by seminal Public Enemy, Beastie Boys and L.L. Cool J. tracks.')
|
||||
tokens = EN(u'''Charity and other short-term aid have buoyed them so far, and a tax-relief bill working its way through Congress would help. But the September 11 Victim Compensation Fund, enacted by Congress to discourage people from filing lawsuits, will determine the shape of their lives for years to come.\n\n''', disable=['ner'])
|
||||
tokens.ents += tuple(EN.matcher(tokens))
|
||||
EN.entity(tokens)
|
||||
|
||||
|
||||
@pytest.mark.skip
|
||||
@pytest.mark.models('en')
|
||||
def test_en_ner_unit_end_gazetteer(EN):
|
||||
'''Test a bug in the interaction between the NER model and the gazetteer'''
|
||||
matcher = Matcher(EN.vocab)
|
||||
matcher.add('MemberNames', None, [{LOWER: 'cal'}], [{LOWER: 'cal'}, {LOWER: 'henderson'}])
|
||||
doc = EN(u'who is cal the manager of?')
|
||||
if len(list(doc.ents)) == 0:
|
||||
ents = matcher(doc)
|
||||
assert len(ents) == 1
|
||||
doc.ents += tuple(ents)
|
||||
EN.entity(doc)
|
||||
assert list(doc.ents)[0].text == 'cal'
|
|
@ -1,22 +1,20 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ....attrs import HEAD, DEP
|
||||
from ....symbols import nsubj, dobj, amod, nmod, conj, cc, root
|
||||
from ....lang.en.syntax_iterators import SYNTAX_ITERATORS
|
||||
from ...util import get_doc
|
||||
|
||||
import numpy
|
||||
from spacy.attrs import HEAD, DEP
|
||||
from spacy.symbols import nsubj, dobj, amod, nmod, conj, cc, root
|
||||
from spacy.lang.en.syntax_iterators import SYNTAX_ITERATORS
|
||||
|
||||
from ...util import get_doc
|
||||
|
||||
|
||||
def test_en_noun_chunks_not_nested(en_tokenizer):
|
||||
text = "Peter has chronic command and control issues"
|
||||
heads = [1, 0, 4, 3, -1, -2, -5]
|
||||
deps = ['nsubj', 'ROOT', 'amod', 'nmod', 'cc', 'conj', 'dobj']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
||||
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||
tokens.from_array(
|
||||
[HEAD, DEP],
|
||||
numpy.asarray([[1, nsubj], [0, root], [4, amod], [3, nmod], [-1, cc],
|
||||
|
|
|
@ -3,58 +3,52 @@ from __future__ import unicode_literals
|
|||
|
||||
from ...util import get_doc
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_parser_noun_chunks_standard(en_tokenizer):
|
||||
def test_en_parser_noun_chunks_standard(en_tokenizer):
|
||||
text = "A base phrase should be recognized."
|
||||
heads = [2, 1, 3, 2, 1, 0, -1]
|
||||
tags = ['DT', 'JJ', 'NN', 'MD', 'VB', 'VBN', '.']
|
||||
deps = ['det', 'amod', 'nsubjpass', 'aux', 'auxpass', 'ROOT', 'punct']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
chunks = list(doc.noun_chunks)
|
||||
assert len(chunks) == 1
|
||||
assert chunks[0].text_with_ws == "A base phrase "
|
||||
|
||||
|
||||
def test_parser_noun_chunks_coordinated(en_tokenizer):
|
||||
def test_en_parser_noun_chunks_coordinated(en_tokenizer):
|
||||
text = "A base phrase and a good phrase are often the same."
|
||||
heads = [2, 1, 5, -1, 2, 1, -4, 0, -1, 1, -3, -4]
|
||||
tags = ['DT', 'NN', 'NN', 'CC', 'DT', 'JJ', 'NN', 'VBP', 'RB', 'DT', 'JJ', '.']
|
||||
deps = ['det', 'compound', 'nsubj', 'cc', 'det', 'amod', 'conj', 'ROOT', 'advmod', 'det', 'attr', 'punct']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
chunks = list(doc.noun_chunks)
|
||||
assert len(chunks) == 2
|
||||
assert chunks[0].text_with_ws == "A base phrase "
|
||||
assert chunks[1].text_with_ws == "a good phrase "
|
||||
|
||||
|
||||
def test_parser_noun_chunks_pp_chunks(en_tokenizer):
|
||||
def test_en_parser_noun_chunks_pp_chunks(en_tokenizer):
|
||||
text = "A phrase with another phrase occurs."
|
||||
heads = [1, 4, -1, 1, -2, 0, -1]
|
||||
tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ', '.']
|
||||
deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT', 'punct']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
chunks = list(doc.noun_chunks)
|
||||
assert len(chunks) == 2
|
||||
assert chunks[0].text_with_ws == "A phrase "
|
||||
assert chunks[1].text_with_ws == "another phrase "
|
||||
|
||||
|
||||
def test_parser_noun_chunks_appositional_modifiers(en_tokenizer):
|
||||
def test_en_parser_noun_chunks_appositional_modifiers(en_tokenizer):
|
||||
text = "Sam, my brother, arrived to the house."
|
||||
heads = [5, -1, 1, -3, -4, 0, -1, 1, -2, -4]
|
||||
tags = ['NNP', ',', 'PRP$', 'NN', ',', 'VBD', 'IN', 'DT', 'NN', '.']
|
||||
deps = ['nsubj', 'punct', 'poss', 'appos', 'punct', 'ROOT', 'prep', 'det', 'pobj', 'punct']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
chunks = list(doc.noun_chunks)
|
||||
assert len(chunks) == 3
|
||||
assert chunks[0].text_with_ws == "Sam "
|
||||
|
@ -62,14 +56,13 @@ def test_parser_noun_chunks_appositional_modifiers(en_tokenizer):
|
|||
assert chunks[2].text_with_ws == "the house "
|
||||
|
||||
|
||||
def test_parser_noun_chunks_dative(en_tokenizer):
|
||||
def test_en_parser_noun_chunks_dative(en_tokenizer):
|
||||
text = "She gave Bob a raise."
|
||||
heads = [1, 0, -1, 1, -3, -4]
|
||||
tags = ['PRP', 'VBD', 'NNP', 'DT', 'NN', '.']
|
||||
deps = ['nsubj', 'ROOT', 'dative', 'det', 'dobj', 'punct']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags, deps=deps, heads=heads)
|
||||
chunks = list(doc.noun_chunks)
|
||||
assert len(chunks) == 3
|
||||
assert chunks[0].text_with_ws == "She "
|
||||
|
|
|
@ -1,92 +1,89 @@
|
|||
# coding: utf-8
|
||||
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(can)"])
|
||||
def test_tokenizer_splits_no_special(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_no_special(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["can't"])
|
||||
def test_tokenizer_splits_no_punct(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_no_punct(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(can't"])
|
||||
def test_tokenizer_splits_prefix_punct(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_prefix_punct(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["can't)"])
|
||||
def test_tokenizer_splits_suffix_punct(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_suffix_punct(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(can't)"])
|
||||
def test_tokenizer_splits_even_wrap(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_even_wrap(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 4
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(can't?)"])
|
||||
def test_tokenizer_splits_uneven_wrap(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_uneven_wrap(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 5
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,length', [("U.S.", 1), ("us.", 2), ("(U.S.", 2)])
|
||||
def test_tokenizer_splits_prefix_interact(en_tokenizer, text, length):
|
||||
def test_en_tokenizer_splits_prefix_interact(en_tokenizer, text, length):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == length
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["U.S.)"])
|
||||
def test_tokenizer_splits_suffix_interact(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_suffix_interact(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(U.S.)"])
|
||||
def test_tokenizer_splits_even_wrap_interact(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_even_wrap_interact(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(U.S.?)"])
|
||||
def test_tokenizer_splits_uneven_wrap_interact(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_uneven_wrap_interact(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 4
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["best-known"])
|
||||
def test_tokenizer_splits_hyphens(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_hyphens(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
|
||||
def test_tokenizer_splits_numeric_range(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_numeric_range(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["best.Known", "Hello.World"])
|
||||
def test_tokenizer_splits_period_infix(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_period_infix(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["Hello,world", "one,two"])
|
||||
def test_tokenizer_splits_comma_infix(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_comma_infix(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
assert tokens[0].text == text.split(",")[0]
|
||||
|
@ -95,12 +92,12 @@ def test_tokenizer_splits_comma_infix(en_tokenizer, text):
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text', ["best...Known", "best...known"])
|
||||
def test_tokenizer_splits_ellipsis_infix(en_tokenizer, text):
|
||||
def test_en_tokenizer_splits_ellipsis_infix(en_tokenizer, text):
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
def test_tokenizer_splits_double_hyphen_infix(en_tokenizer):
|
||||
def test_en_tokenizer_splits_double_hyphen_infix(en_tokenizer):
|
||||
tokens = en_tokenizer("No decent--let alone well-bred--people.")
|
||||
assert tokens[0].text == "No"
|
||||
assert tokens[1].text == "decent"
|
||||
|
@ -115,7 +112,7 @@ def test_tokenizer_splits_double_hyphen_infix(en_tokenizer):
|
|||
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_tokenizer_splits_period_abbr(en_tokenizer):
|
||||
def test_en_tokenizer_splits_period_abbr(en_tokenizer):
|
||||
text = "Today is Tuesday.Mr."
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 5
|
||||
|
@ -127,7 +124,7 @@ def test_tokenizer_splits_period_abbr(en_tokenizer):
|
|||
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_tokenizer_splits_em_dash_infix(en_tokenizer):
|
||||
def test_en_tokenizer_splits_em_dash_infix(en_tokenizer):
|
||||
# Re Issue #225
|
||||
tokens = en_tokenizer("""Will this road take me to Puddleton?\u2014No, """
|
||||
"""you'll have to walk there.\u2014Ariel.""")
|
||||
|
|
|
@ -1,13 +1,9 @@
|
|||
# coding: utf-8
|
||||
"""Test that open, closed and paired punctuation is split off correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
from ....util import compile_prefix_regex
|
||||
from ....lang.punctuation import TOKENIZER_PREFIXES
|
||||
from spacy.util import compile_prefix_regex
|
||||
from spacy.lang.punctuation import TOKENIZER_PREFIXES
|
||||
|
||||
|
||||
PUNCT_OPEN = ['(', '[', '{', '*']
|
||||
|
|
|
@ -1,18 +1,17 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ....tokens import Doc
|
||||
from ...util import get_doc, apply_transition_sequence
|
||||
|
||||
import pytest
|
||||
|
||||
from ...util import get_doc, apply_transition_sequence
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["A test sentence"])
|
||||
@pytest.mark.parametrize('punct', ['.', '!', '?', ''])
|
||||
def test_en_sbd_single_punct(en_tokenizer, text, punct):
|
||||
heads = [2, 1, 0, -1] if punct else [2, 1, 0]
|
||||
tokens = en_tokenizer(text + punct)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
assert len(doc) == 4 if punct else 3
|
||||
assert len(list(doc.sents)) == 1
|
||||
assert sum(len(sent) for sent in doc.sents) == len(doc)
|
||||
|
@ -26,102 +25,10 @@ def test_en_sentence_breaks(en_tokenizer, en_parser):
|
|||
'attr', 'punct']
|
||||
transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct', 'B-ROOT',
|
||||
'L-nsubj', 'S', 'L-attr', 'R-attr', 'D', 'R-punct']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||
apply_transition_sequence(en_parser, doc, transition)
|
||||
|
||||
assert len(list(doc.sents)) == 2
|
||||
for token in doc:
|
||||
assert token.dep != 0 or token.is_space
|
||||
assert [token.head.i for token in doc ] == [1, 1, 3, 1, 1, 6, 6, 8, 6, 6]
|
||||
|
||||
|
||||
# Currently, there's no way of setting the serializer data for the parser
|
||||
# without loading the models, so we can't remove the model dependency here yet.
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.models('en')
|
||||
def test_en_sbd_serialization_projective(EN):
|
||||
"""Test that before and after serialization, the sentence boundaries are
|
||||
the same."""
|
||||
|
||||
text = "I bought a couch from IKEA It wasn't very comfortable."
|
||||
transition = ['L-nsubj', 'S', 'L-det', 'R-dobj', 'D', 'R-prep', 'R-pobj',
|
||||
'B-ROOT', 'L-nsubj', 'R-neg', 'D', 'S', 'L-advmod',
|
||||
'R-acomp', 'D', 'R-punct']
|
||||
|
||||
doc = EN.tokenizer(text)
|
||||
apply_transition_sequence(EN.parser, doc, transition)
|
||||
doc_serialized = Doc(EN.vocab).from_bytes(doc.to_bytes())
|
||||
assert doc.is_parsed == True
|
||||
assert doc_serialized.is_parsed == True
|
||||
assert doc.to_bytes() == doc_serialized.to_bytes()
|
||||
assert [s.text for s in doc.sents] == [s.text for s in doc_serialized.sents]
|
||||
|
||||
|
||||
TEST_CASES = [
|
||||
pytest.mark.xfail(("Hello World. My name is Jonas.", ["Hello World.", "My name is Jonas."])),
|
||||
("What is your name? My name is Jonas.", ["What is your name?", "My name is Jonas."]),
|
||||
("There it is! I found it.", ["There it is!", "I found it."]),
|
||||
("My name is Jonas E. Smith.", ["My name is Jonas E. Smith."]),
|
||||
("Please turn to p. 55.", ["Please turn to p. 55."]),
|
||||
("Were Jane and co. at the party?", ["Were Jane and co. at the party?"]),
|
||||
("They closed the deal with Pitt, Briggs & Co. at noon.", ["They closed the deal with Pitt, Briggs & Co. at noon."]),
|
||||
("Let's ask Jane and co. They should know.", ["Let's ask Jane and co.", "They should know."]),
|
||||
("They closed the deal with Pitt, Briggs & Co. It closed yesterday.", ["They closed the deal with Pitt, Briggs & Co.", "It closed yesterday."]),
|
||||
("I can see Mt. Fuji from here.", ["I can see Mt. Fuji from here."]),
|
||||
pytest.mark.xfail(("St. Michael's Church is on 5th st. near the light.", ["St. Michael's Church is on 5th st. near the light."])),
|
||||
("That is JFK Jr.'s book.", ["That is JFK Jr.'s book."]),
|
||||
("I visited the U.S.A. last year.", ["I visited the U.S.A. last year."]),
|
||||
("I live in the E.U. How about you?", ["I live in the E.U.", "How about you?"]),
|
||||
("I live in the U.S. How about you?", ["I live in the U.S.", "How about you?"]),
|
||||
("I work for the U.S. Government in Virginia.", ["I work for the U.S. Government in Virginia."]),
|
||||
("I have lived in the U.S. for 20 years.", ["I have lived in the U.S. for 20 years."]),
|
||||
pytest.mark.xfail(("At 5 a.m. Mr. Smith went to the bank. He left the bank at 6 P.M. Mr. Smith then went to the store.", ["At 5 a.m. Mr. Smith went to the bank.", "He left the bank at 6 P.M.", "Mr. Smith then went to the store."])),
|
||||
("She has $100.00 in her bag.", ["She has $100.00 in her bag."]),
|
||||
("She has $100.00. It is in her bag.", ["She has $100.00.", "It is in her bag."]),
|
||||
("He teaches science (He previously worked for 5 years as an engineer.) at the local University.", ["He teaches science (He previously worked for 5 years as an engineer.) at the local University."]),
|
||||
("Her email is Jane.Doe@example.com. I sent her an email.", ["Her email is Jane.Doe@example.com.", "I sent her an email."]),
|
||||
("The site is: https://www.example.50.com/new-site/awesome_content.html. Please check it out.", ["The site is: https://www.example.50.com/new-site/awesome_content.html.", "Please check it out."]),
|
||||
pytest.mark.xfail(("She turned to him, 'This is great.' she said.", ["She turned to him, 'This is great.' she said."])),
|
||||
pytest.mark.xfail(('She turned to him, "This is great." she said.', ['She turned to him, "This is great." she said.'])),
|
||||
('She turned to him, "This is great." She held the book out to show him.', ['She turned to him, "This is great."', "She held the book out to show him."]),
|
||||
("Hello!! Long time no see.", ["Hello!!", "Long time no see."]),
|
||||
("Hello?? Who is there?", ["Hello??", "Who is there?"]),
|
||||
("Hello!? Is that you?", ["Hello!?", "Is that you?"]),
|
||||
("Hello?! Is that you?", ["Hello?!", "Is that you?"]),
|
||||
pytest.mark.xfail(("1.) The first item 2.) The second item", ["1.) The first item", "2.) The second item"])),
|
||||
pytest.mark.xfail(("1.) The first item. 2.) The second item.", ["1.) The first item.", "2.) The second item."])),
|
||||
pytest.mark.xfail(("1) The first item 2) The second item", ["1) The first item", "2) The second item"])),
|
||||
("1) The first item. 2) The second item.", ["1) The first item.", "2) The second item."]),
|
||||
pytest.mark.xfail(("1. The first item 2. The second item", ["1. The first item", "2. The second item"])),
|
||||
pytest.mark.xfail(("1. The first item. 2. The second item.", ["1. The first item.", "2. The second item."])),
|
||||
pytest.mark.xfail(("• 9. The first item • 10. The second item", ["• 9. The first item", "• 10. The second item"])),
|
||||
pytest.mark.xfail(("⁃9. The first item ⁃10. The second item", ["⁃9. The first item", "⁃10. The second item"])),
|
||||
pytest.mark.xfail(("a. The first item b. The second item c. The third list item", ["a. The first item", "b. The second item", "c. The third list item"])),
|
||||
("This is a sentence\ncut off in the middle because pdf.", ["This is a sentence\ncut off in the middle because pdf."]),
|
||||
("It was a cold \nnight in the city.", ["It was a cold \nnight in the city."]),
|
||||
pytest.mark.xfail(("features\ncontact manager\nevents, activities\n", ["features", "contact manager", "events, activities"])),
|
||||
pytest.mark.xfail(("You can find it at N°. 1026.253.553. That is where the treasure is.", ["You can find it at N°. 1026.253.553.", "That is where the treasure is."])),
|
||||
("She works at Yahoo! in the accounting department.", ["She works at Yahoo! in the accounting department."]),
|
||||
("We make a good team, you and I. Did you see Albert I. Jones yesterday?", ["We make a good team, you and I.", "Did you see Albert I. Jones yesterday?"]),
|
||||
("Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”", ["Thoreau argues that by simplifying one’s life, “the laws of the universe will appear less complex. . . .”"]),
|
||||
pytest.mark.xfail((""""Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).""", ['"Bohr [...] used the analogy of parallel stairways [...]" (Smith 55).'])),
|
||||
("If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . . Next sentence.", ["If words are left off at the end of a sentence, and that is all that is omitted, indicate the omission with ellipsis marks (preceded and followed by a space) and then indicate the end of the sentence with a period . . . .", "Next sentence."]),
|
||||
("I never meant that.... She left the store.", ["I never meant that....", "She left the store."]),
|
||||
pytest.mark.xfail(("I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it.", ["I wasn’t really ... well, what I mean...see . . . what I'm saying, the thing is . . . I didn’t mean it."])),
|
||||
pytest.mark.xfail(("One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds. . . . The practice was not abandoned. . . .", ["One further habit which was somewhat weakened . . . was that of combining words into self-interpreting compounds.", ". . . The practice was not abandoned. . . ."])),
|
||||
pytest.mark.xfail(("Hello world.Today is Tuesday.Mr. Smith went to the store and bought 1,000.That is a lot.", ["Hello world.", "Today is Tuesday.", "Mr. Smith went to the store and bought 1,000.", "That is a lot."]))
|
||||
]
|
||||
|
||||
@pytest.mark.skip
|
||||
@pytest.mark.models('en')
|
||||
@pytest.mark.parametrize('text,expected_sents', TEST_CASES)
|
||||
def test_en_sbd_prag(EN, text, expected_sents):
|
||||
"""SBD tests from Pragmatic Segmenter"""
|
||||
doc = EN(text)
|
||||
sents = []
|
||||
for sent in doc.sents:
|
||||
sents.append(''.join(doc[i].string for i in range(sent.start, sent.end)).strip())
|
||||
assert sents == expected_sents
|
||||
|
|
|
@ -1,12 +1,8 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ....parts_of_speech import SPACE
|
||||
from ....compat import unicode_
|
||||
from ...util import get_doc
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_en_tagger_load_morph_exc(en_tokenizer):
|
||||
text = "I like his style."
|
||||
|
@ -14,47 +10,6 @@ def test_en_tagger_load_morph_exc(en_tokenizer):
|
|||
morph_exc = {'VBP': {'like': {'lemma': 'luck'}}}
|
||||
en_tokenizer.vocab.morphology.load_morph_exceptions(morph_exc)
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], tags=tags)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], tags=tags)
|
||||
assert doc[1].tag_ == 'VBP'
|
||||
assert doc[1].lemma_ == 'luck'
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_tag_names(EN):
|
||||
text = "I ate pizzas with anchovies."
|
||||
doc = EN(text, disable=['parser'])
|
||||
assert type(doc[2].pos) == int
|
||||
assert isinstance(doc[2].pos_, unicode_)
|
||||
assert isinstance(doc[2].dep_, unicode_)
|
||||
assert doc[2].tag_ == u'NNS'
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.models('en')
|
||||
def test_en_tagger_spaces(EN):
|
||||
"""Ensure spaces are assigned the POS tag SPACE"""
|
||||
text = "Some\nspaces are\tnecessary."
|
||||
doc = EN(text, disable=['parser'])
|
||||
assert doc[0].pos != SPACE
|
||||
assert doc[0].pos_ != 'SPACE'
|
||||
assert doc[1].pos == SPACE
|
||||
assert doc[1].pos_ == 'SPACE'
|
||||
assert doc[1].tag_ == 'SP'
|
||||
assert doc[2].pos != SPACE
|
||||
assert doc[3].pos != SPACE
|
||||
assert doc[4].pos == SPACE
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.models('en')
|
||||
def test_en_tagger_return_char(EN):
|
||||
"""Ensure spaces are assigned the POS tag SPACE"""
|
||||
text = ('hi Aaron,\r\n\r\nHow is your schedule today, I was wondering if '
|
||||
'you had time for a phone\r\ncall this afternoon?\r\n\r\n\r\n')
|
||||
tokens = EN(text)
|
||||
for token in tokens:
|
||||
if token.is_space:
|
||||
assert token.pos == SPACE
|
||||
assert tokens[3].text == '\r\n\r\n'
|
||||
assert tokens[3].is_space
|
||||
assert tokens[3].pos == SPACE
|
||||
|
|
|
@ -1,10 +1,8 @@
|
|||
# coding: utf-8
|
||||
"""Test that longer and mixed texts are tokenized correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.lang.en.lex_attrs import like_num
|
||||
|
||||
|
||||
def test_en_tokenizer_handles_long_text(en_tokenizer):
|
||||
|
@ -43,3 +41,9 @@ def test_lex_attrs_like_number(en_tokenizer, text, match):
|
|||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].like_num == match
|
||||
|
||||
|
||||
@pytest.mark.parametrize('word', ['eleven'])
|
||||
def test_en_lex_attrs_capitals(word):
|
||||
assert like_num(word)
|
||||
assert like_num(word.upper())
|
||||
|
|
|
@ -1,22 +1,21 @@
|
|||
# coding: utf-8
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,lemma', [("aprox.", "aproximadamente"),
|
||||
("esq.", "esquina"),
|
||||
("pág.", "página"),
|
||||
("p.ej.", "por ejemplo")
|
||||
])
|
||||
def test_tokenizer_handles_abbr(es_tokenizer, text, lemma):
|
||||
@pytest.mark.parametrize('text,lemma', [
|
||||
("aprox.", "aproximadamente"),
|
||||
("esq.", "esquina"),
|
||||
("pág.", "página"),
|
||||
("p.ej.", "por ejemplo")])
|
||||
def test_es_tokenizer_handles_abbr(es_tokenizer, text, lemma):
|
||||
tokens = es_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].lemma_ == lemma
|
||||
|
||||
|
||||
def test_tokenizer_handles_exc_in_text(es_tokenizer):
|
||||
def test_es_tokenizer_handles_exc_in_text(es_tokenizer):
|
||||
text = "Mariano Rajoy ha corrido aprox. medio kilómetro"
|
||||
tokens = es_tokenizer(text)
|
||||
assert len(tokens) == 7
|
||||
|
|
|
@ -1,14 +1,10 @@
|
|||
# coding: utf-8
|
||||
|
||||
"""Test that longer and mixed texts are tokenized correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_tokenizer_handles_long_text(es_tokenizer):
|
||||
def test_es_tokenizer_handles_long_text(es_tokenizer):
|
||||
text = """Cuando a José Mujica lo invitaron a dar una conferencia
|
||||
|
||||
en Oxford este verano, su cabeza hizo "crac". La "más antigua" universidad de habla
|
||||
|
@ -30,6 +26,6 @@ en Montevideo y que pregona las bondades de la vida austera."""
|
|||
("""¡Sí! "Vámonos", contestó José Arcadio Buendía""", 11),
|
||||
("Corrieron aprox. 10km.", 5),
|
||||
("Y entonces por qué...", 5)])
|
||||
def test_tokenizer_handles_cnts(es_tokenizer, text, length):
|
||||
def test_es_tokenizer_handles_cnts(es_tokenizer, text, length):
|
||||
tokens = es_tokenizer(text)
|
||||
assert len(tokens) == length
|
||||
|
|
|
@ -11,7 +11,7 @@ ABBREVIATION_TESTS = [
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', ABBREVIATION_TESTS)
|
||||
def test_tokenizer_handles_testcases(fi_tokenizer, text, expected_tokens):
|
||||
def test_fi_tokenizer_handles_testcases(fi_tokenizer, text, expected_tokens):
|
||||
tokens = fi_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
||||
|
|
|
@ -1,29 +1,29 @@
|
|||
# coding: utf-8
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["aujourd'hui", "Aujourd'hui", "prud'hommes",
|
||||
"prud’hommal"])
|
||||
def test_tokenizer_infix_exceptions(fr_tokenizer, text):
|
||||
@pytest.mark.parametrize('text', [
|
||||
"aujourd'hui", "Aujourd'hui", "prud'hommes", "prud’hommal"])
|
||||
def test_fr_tokenizer_infix_exceptions(fr_tokenizer, text):
|
||||
tokens = fr_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,lemma', [("janv.", "janvier"),
|
||||
("juill.", "juillet"),
|
||||
("Dr.", "docteur"),
|
||||
("av.", "avant"),
|
||||
("sept.", "septembre")])
|
||||
def test_tokenizer_handles_abbr(fr_tokenizer, text, lemma):
|
||||
@pytest.mark.parametrize('text,lemma', [
|
||||
("janv.", "janvier"),
|
||||
("juill.", "juillet"),
|
||||
("Dr.", "docteur"),
|
||||
("av.", "avant"),
|
||||
("sept.", "septembre")])
|
||||
def test_fr_tokenizer_handles_abbr(fr_tokenizer, text, lemma):
|
||||
tokens = fr_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].lemma_ == lemma
|
||||
|
||||
|
||||
def test_tokenizer_handles_exc_in_text(fr_tokenizer):
|
||||
def test_fr_tokenizer_handles_exc_in_text(fr_tokenizer):
|
||||
text = "Je suis allé au mois de janv. aux prud’hommes."
|
||||
tokens = fr_tokenizer(text)
|
||||
assert len(tokens) == 10
|
||||
|
@ -32,14 +32,15 @@ def test_tokenizer_handles_exc_in_text(fr_tokenizer):
|
|||
assert tokens[8].text == "prud’hommes"
|
||||
|
||||
|
||||
def test_tokenizer_handles_exc_in_text_2(fr_tokenizer):
|
||||
def test_fr_tokenizer_handles_exc_in_text_2(fr_tokenizer):
|
||||
text = "Cette après-midi, je suis allé dans un restaurant italo-mexicain."
|
||||
tokens = fr_tokenizer(text)
|
||||
assert len(tokens) == 11
|
||||
assert tokens[1].text == "après-midi"
|
||||
assert tokens[9].text == "italo-mexicain"
|
||||
|
||||
def test_tokenizer_handles_title(fr_tokenizer):
|
||||
|
||||
def test_fr_tokenizer_handles_title(fr_tokenizer):
|
||||
text = "N'est-ce pas génial?"
|
||||
tokens = fr_tokenizer(text)
|
||||
assert len(tokens) == 6
|
||||
|
@ -50,16 +51,18 @@ def test_tokenizer_handles_title(fr_tokenizer):
|
|||
assert tokens[2].text == "-ce"
|
||||
assert tokens[2].lemma_ == "ce"
|
||||
|
||||
def test_tokenizer_handles_title_2(fr_tokenizer):
|
||||
|
||||
def test_fr_tokenizer_handles_title_2(fr_tokenizer):
|
||||
text = "Est-ce pas génial?"
|
||||
tokens = fr_tokenizer(text)
|
||||
assert len(tokens) == 6
|
||||
assert tokens[0].text == "Est"
|
||||
assert tokens[0].lemma_ == "être"
|
||||
|
||||
def test_tokenizer_handles_title_2(fr_tokenizer):
|
||||
|
||||
def test_fr_tokenizer_handles_title_2(fr_tokenizer):
|
||||
text = "Qu'est-ce que tu fais?"
|
||||
tokens = fr_tokenizer(text)
|
||||
assert len(tokens) == 7
|
||||
assert tokens[0].text == "Qu'"
|
||||
assert tokens[0].lemma_ == "que"
|
||||
assert tokens[0].lemma_ == "que"
|
||||
|
|
|
@ -4,25 +4,25 @@ from __future__ import unicode_literals
|
|||
import pytest
|
||||
|
||||
|
||||
def test_lemmatizer_verb(fr_tokenizer):
|
||||
def test_fr_lemmatizer_verb(fr_tokenizer):
|
||||
tokens = fr_tokenizer("Qu'est-ce que tu fais?")
|
||||
assert tokens[0].lemma_ == "que"
|
||||
assert tokens[1].lemma_ == "être"
|
||||
assert tokens[5].lemma_ == "faire"
|
||||
|
||||
|
||||
def test_lemmatizer_noun_verb_2(fr_tokenizer):
|
||||
def test_fr_lemmatizer_noun_verb_2(fr_tokenizer):
|
||||
tokens = fr_tokenizer("Les abaissements de température sont gênants.")
|
||||
assert tokens[4].lemma_ == "être"
|
||||
|
||||
|
||||
@pytest.mark.xfail(reason="Costaricienne TAG is PROPN instead of NOUN and spacy don't lemmatize PROPN")
|
||||
def test_lemmatizer_noun(fr_tokenizer):
|
||||
def test_fr_lemmatizer_noun(fr_tokenizer):
|
||||
tokens = fr_tokenizer("il y a des Costaricienne.")
|
||||
assert tokens[4].lemma_ == "Costaricain"
|
||||
|
||||
|
||||
def test_lemmatizer_noun_2(fr_tokenizer):
|
||||
def test_fr_lemmatizer_noun_2(fr_tokenizer):
|
||||
tokens = fr_tokenizer("Les abaissements de température sont gênants.")
|
||||
assert tokens[1].lemma_ == "abaissement"
|
||||
assert tokens[5].lemma_ == "gênant"
|
||||
|
|
23
spacy/tests/lang/fr/test_prefix_suffix_infix.py
Normal file
23
spacy/tests/lang/fr/test_prefix_suffix_infix.py
Normal file
|
@ -0,0 +1,23 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.language import Language
|
||||
from spacy.lang.punctuation import TOKENIZER_INFIXES
|
||||
from spacy.lang.char_classes import ALPHA
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', [
|
||||
("l'avion", ["l'", "avion"]), ("j'ai", ["j'", "ai"])])
|
||||
def test_issue768(text, expected_tokens):
|
||||
"""Allow zero-width 'infix' token during the tokenization process."""
|
||||
SPLIT_INFIX = r'(?<=[{a}]\')(?=[{a}])'.format(a=ALPHA)
|
||||
|
||||
class FrenchTest(Language):
|
||||
class Defaults(Language.Defaults):
|
||||
infixes = TOKENIZER_INFIXES + [SPLIT_INFIX]
|
||||
|
||||
fr_tokenizer_w_infix = FrenchTest.Defaults.create_tokenizer()
|
||||
tokens = fr_tokenizer_w_infix(text)
|
||||
assert len(tokens) == 2
|
||||
assert [t.text for t in tokens] == expected_tokens
|
|
@ -1,6 +1,9 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.lang.fr.lex_attrs import like_num
|
||||
|
||||
|
||||
def test_tokenizer_handles_long_text(fr_tokenizer):
|
||||
text = """L'histoire du TAL commence dans les années 1950, bien que l'on puisse \
|
||||
|
@ -12,6 +15,11 @@ un humain dans une conversation écrite en temps réel, de façon suffisamment \
|
|||
convaincante que l'interlocuteur humain ne peut distinguer sûrement — sur la \
|
||||
base du seul contenu de la conversation — s'il interagit avec un programme \
|
||||
ou avec un autre vrai humain."""
|
||||
|
||||
tokens = fr_tokenizer(text)
|
||||
assert len(tokens) == 113
|
||||
|
||||
|
||||
@pytest.mark.parametrize('word', ['onze', 'onzième'])
|
||||
def test_fr_lex_attrs_capitals(word):
|
||||
assert like_num(word)
|
||||
assert like_num(word.upper())
|
||||
|
|
|
@ -11,7 +11,7 @@ GA_TOKEN_EXCEPTION_TESTS = [
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', GA_TOKEN_EXCEPTION_TESTS)
|
||||
def test_tokenizer_handles_exception_cases(ga_tokenizer, text, expected_tokens):
|
||||
def test_ga_tokenizer_handles_exception_cases(ga_tokenizer, text, expected_tokens):
|
||||
tokens = ga_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
||||
|
|
|
@ -6,7 +6,7 @@ import pytest
|
|||
|
||||
@pytest.mark.parametrize('text,expected_tokens',
|
||||
[('פייתון היא שפת תכנות דינמית', ['פייתון', 'היא', 'שפת', 'תכנות', 'דינמית'])])
|
||||
def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
|
||||
def test_he_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
|
||||
tokens = he_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
||||
|
@ -18,6 +18,6 @@ def test_tokenizer_handles_abbreviation(he_tokenizer, text, expected_tokens):
|
|||
('עקבת אחריו בכל רחבי המדינה!', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '!']),
|
||||
('עקבת אחריו בכל רחבי המדינה..', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '..']),
|
||||
('עקבת אחריו בכל רחבי המדינה...', ['עקבת', 'אחריו', 'בכל', 'רחבי', 'המדינה', '...'])])
|
||||
def test_tokenizer_handles_punct(he_tokenizer, text, expected_tokens):
|
||||
def test_he_tokenizer_handles_punct(he_tokenizer, text, expected_tokens):
|
||||
tokens = he_tokenizer(text)
|
||||
assert expected_tokens == [token.text for token in tokens]
|
||||
|
|
|
@ -3,6 +3,7 @@ from __future__ import unicode_literals
|
|||
|
||||
import pytest
|
||||
|
||||
|
||||
DEFAULT_TESTS = [
|
||||
('N. kormányzósági\nszékhely.', ['N.', 'kormányzósági', 'székhely', '.']),
|
||||
pytest.mark.xfail(('A .hu egy tld.', ['A', '.hu', 'egy', 'tld', '.'])),
|
||||
|
@ -277,7 +278,7 @@ TESTCASES = DEFAULT_TESTS + DOT_TESTS + QUOTE_TESTS + NUMBER_TESTS + HYPHEN_TEST
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
||||
def test_tokenizer_handles_testcases(hu_tokenizer, text, expected_tokens):
|
||||
def test_hu_tokenizer_handles_testcases(hu_tokenizer, text, expected_tokens):
|
||||
tokens = hu_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
||||
|
|
|
@ -1,38 +1,35 @@
|
|||
# coding: utf-8
|
||||
"""Test that tokenizer prefixes, suffixes and infixes are handled correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(Ma'arif)"])
|
||||
def test_tokenizer_splits_no_special(id_tokenizer, text):
|
||||
def test_id_tokenizer_splits_no_special(id_tokenizer, text):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["Ma'arif"])
|
||||
def test_tokenizer_splits_no_punct(id_tokenizer, text):
|
||||
def test_id_tokenizer_splits_no_punct(id_tokenizer, text):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(Ma'arif"])
|
||||
def test_tokenizer_splits_prefix_punct(id_tokenizer, text):
|
||||
def test_id_tokenizer_splits_prefix_punct(id_tokenizer, text):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["Ma'arif)"])
|
||||
def test_tokenizer_splits_suffix_punct(id_tokenizer, text):
|
||||
def test_id_tokenizer_splits_suffix_punct(id_tokenizer, text):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(Ma'arif)"])
|
||||
def test_tokenizer_splits_even_wrap(id_tokenizer, text):
|
||||
def test_id_tokenizer_splits_even_wrap(id_tokenizer, text):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
@ -44,49 +41,49 @@ def test_tokenizer_splits_uneven_wrap(id_tokenizer, text):
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text,length', [("S.Kom.", 1), ("SKom.", 2), ("(S.Kom.", 2)])
|
||||
def test_tokenizer_splits_prefix_interact(id_tokenizer, text, length):
|
||||
def test_id_tokenizer_splits_prefix_interact(id_tokenizer, text, length):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == length
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["S.Kom.)"])
|
||||
def test_tokenizer_splits_suffix_interact(id_tokenizer, text):
|
||||
def test_id_tokenizer_splits_suffix_interact(id_tokenizer, text):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(S.Kom.)"])
|
||||
def test_tokenizer_splits_even_wrap_interact(id_tokenizer, text):
|
||||
def test_id_tokenizer_splits_even_wrap_interact(id_tokenizer, text):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["(S.Kom.?)"])
|
||||
def test_tokenizer_splits_uneven_wrap_interact(id_tokenizer, text):
|
||||
def test_id_tokenizer_splits_uneven_wrap_interact(id_tokenizer, text):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == 4
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,length', [("gara-gara", 1), ("Jokowi-Ahok", 3), ("Sukarno-Hatta", 3)])
|
||||
def test_tokenizer_splits_hyphens(id_tokenizer, text, length):
|
||||
def test_id_tokenizer_splits_hyphens(id_tokenizer, text, length):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == length
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["0.1-13.5", "0.0-0.1", "103.27-300"])
|
||||
def test_tokenizer_splits_numeric_range(id_tokenizer, text):
|
||||
def test_id_tokenizer_splits_numeric_range(id_tokenizer, text):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["ini.Budi", "Halo.Bandung"])
|
||||
def test_tokenizer_splits_period_infix(id_tokenizer, text):
|
||||
def test_id_tokenizer_splits_period_infix(id_tokenizer, text):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["Halo,Bandung", "satu,dua"])
|
||||
def test_tokenizer_splits_comma_infix(id_tokenizer, text):
|
||||
def test_id_tokenizer_splits_comma_infix(id_tokenizer, text):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
assert tokens[0].text == text.split(",")[0]
|
||||
|
@ -95,12 +92,12 @@ def test_tokenizer_splits_comma_infix(id_tokenizer, text):
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text', ["halo...Bandung", "dia...pergi"])
|
||||
def test_tokenizer_splits_ellipsis_infix(id_tokenizer, text):
|
||||
def test_id_tokenizer_splits_ellipsis_infix(id_tokenizer, text):
|
||||
tokens = id_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
|
||||
|
||||
def test_tokenizer_splits_double_hyphen_infix(id_tokenizer):
|
||||
def test_id_tokenizer_splits_double_hyphen_infix(id_tokenizer):
|
||||
tokens = id_tokenizer("Arsene Wenger--manajer Arsenal--melakukan konferensi pers.")
|
||||
assert len(tokens) == 10
|
||||
assert tokens[0].text == "Arsene"
|
||||
|
|
11
spacy/tests/lang/id/test_text.py
Normal file
11
spacy/tests/lang/id/test_text.py
Normal file
|
@ -0,0 +1,11 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.lang.id.lex_attrs import like_num
|
||||
|
||||
|
||||
@pytest.mark.parametrize('word', ['sebelas'])
|
||||
def test_id_lex_attrs_capitals(word):
|
||||
assert like_num(word)
|
||||
assert like_num(word.upper())
|
|
@ -1,18 +0,0 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
LEMMAS = (
|
||||
('新しく', '新しい'),
|
||||
('赤く', '赤い'),
|
||||
('すごく', '凄い'),
|
||||
('いただきました', '頂く'),
|
||||
('なった', '成る'))
|
||||
|
||||
@pytest.mark.parametrize('word,lemma', LEMMAS)
|
||||
def test_japanese_lemmas(JA, word, lemma):
|
||||
test_lemma = JA(word)[0].lemma_
|
||||
assert test_lemma == lemma
|
||||
|
||||
|
15
spacy/tests/lang/ja/test_lemmatization.py
Normal file
15
spacy/tests/lang/ja/test_lemmatization.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('word,lemma', [
|
||||
('新しく', '新しい'),
|
||||
('赤く', '赤い'),
|
||||
('すごく', '凄い'),
|
||||
('いただきました', '頂く'),
|
||||
('なった', '成る')])
|
||||
def test_ja_lemmatizer_assigns(ja_tokenizer, word, lemma):
|
||||
test_lemma = ja_tokenizer(word)[0].lemma_
|
||||
assert test_lemma == lemma
|
|
@ -5,41 +5,43 @@ import pytest
|
|||
|
||||
|
||||
TOKENIZER_TESTS = [
|
||||
("日本語だよ", ['日本', '語', 'だ', 'よ']),
|
||||
("東京タワーの近くに住んでいます。", ['東京', 'タワー', 'の', '近く', 'に', '住ん', 'で', 'い', 'ます', '。']),
|
||||
("吾輩は猫である。", ['吾輩', 'は', '猫', 'で', 'ある', '。']),
|
||||
("月に代わって、お仕置きよ!", ['月', 'に', '代わっ', 'て', '、', 'お', '仕置き', 'よ', '!']),
|
||||
("すもももももももものうち", ['すもも', 'も', 'もも', 'も', 'もも', 'の', 'うち'])
|
||||
("日本語だよ", ['日本', '語', 'だ', 'よ']),
|
||||
("東京タワーの近くに住んでいます。", ['東京', 'タワー', 'の', '近く', 'に', '住ん', 'で', 'い', 'ます', '。']),
|
||||
("吾輩は猫である。", ['吾輩', 'は', '猫', 'で', 'ある', '。']),
|
||||
("月に代わって、お仕置きよ!", ['月', 'に', '代わっ', 'て', '、', 'お', '仕置き', 'よ', '!']),
|
||||
("すもももももももものうち", ['すもも', 'も', 'もも', 'も', 'もも', 'の', 'うち'])
|
||||
]
|
||||
|
||||
TAG_TESTS = [
|
||||
("日本語だよ", ['日本語だよ', '名詞-固有名詞-地名-国', '名詞-普通名詞-一般', '助動詞', '助詞-終助詞']),
|
||||
("東京タワーの近くに住んでいます。", ['名詞-固有名詞-地名-一般', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '動詞-非自立可能', '助動詞', '補助記号-句点']),
|
||||
("吾輩は猫である。", ['代名詞', '助詞-係助詞', '名詞-普通名詞-一般', '助動詞', '動詞-非自立可能', '補助記号-句点']),
|
||||
("月に代わって、お仕置きよ!", ['名詞-普通名詞-助数詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '補助記号-読点', '接頭辞', '名詞-普通名詞-一般', '助詞-終助詞', '補助記号-句点 ']),
|
||||
("すもももももももものうち", ['名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能'])
|
||||
("日本語だよ", ['日本語だよ', '名詞-固有名詞-地名-国', '名詞-普通名詞-一般', '助動詞', '助詞-終助詞']),
|
||||
("東京タワーの近くに住んでいます。", ['名詞-固有名詞-地名-一般', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '動詞-非自立可能', '助動詞', '補助記号-句点']),
|
||||
("吾輩は猫である。", ['代名詞', '助詞-係助詞', '名詞-普通名詞-一般', '助動詞', '動詞-非自立可能', '補助記号-句点']),
|
||||
("月に代わって、お仕置きよ!", ['名詞-普通名詞-助数詞可能', '助詞-格助詞', '動詞-一般', '助詞-接続助詞', '補助記号-読点', '接頭辞', '名詞-普通名詞-一般', '助詞-終助詞', '補助記号-句点 ']),
|
||||
("すもももももももものうち", ['名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-係助詞', '名詞-普通名詞-一般', '助詞-格助詞', '名詞-普通名詞-副詞可能'])
|
||||
]
|
||||
|
||||
POS_TESTS = [
|
||||
('日本語だよ', ['PROPN', 'NOUN', 'AUX', 'PART']),
|
||||
('東京タワーの近くに住んでいます。', ['PROPN', 'NOUN', 'ADP', 'NOUN', 'ADP', 'VERB', 'SCONJ', 'VERB', 'AUX', 'PUNCT']),
|
||||
('吾輩は猫である。', ['PRON', 'ADP', 'NOUN', 'AUX', 'VERB', 'PUNCT']),
|
||||
('月に代わって、お仕置きよ!', ['NOUN', 'ADP', 'VERB', 'SCONJ', 'PUNCT', 'NOUN', 'NOUN', 'PART', 'PUNCT']),
|
||||
('すもももももももものうち', ['NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN'])
|
||||
('日本語だよ', ['PROPN', 'NOUN', 'AUX', 'PART']),
|
||||
('東京タワーの近くに住んでいます。', ['PROPN', 'NOUN', 'ADP', 'NOUN', 'ADP', 'VERB', 'SCONJ', 'VERB', 'AUX', 'PUNCT']),
|
||||
('吾輩は猫である。', ['PRON', 'ADP', 'NOUN', 'AUX', 'VERB', 'PUNCT']),
|
||||
('月に代わって、お仕置きよ!', ['NOUN', 'ADP', 'VERB', 'SCONJ', 'PUNCT', 'NOUN', 'NOUN', 'PART', 'PUNCT']),
|
||||
('すもももももももものうち', ['NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN', 'ADP', 'NOUN'])
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS)
|
||||
def test_japanese_tokenizer(ja_tokenizer, text, expected_tokens):
|
||||
def test_ja_tokenizer(ja_tokenizer, text, expected_tokens):
|
||||
tokens = [token.text for token in ja_tokenizer(text)]
|
||||
assert tokens == expected_tokens
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tags', TAG_TESTS)
|
||||
def test_japanese_tokenizer(ja_tokenizer, text, expected_tags):
|
||||
def test_ja_tokenizer(ja_tokenizer, text, expected_tags):
|
||||
tags = [token.tag_ for token in ja_tokenizer(text)]
|
||||
assert tags == expected_tags
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_pos', POS_TESTS)
|
||||
def test_japanese_tokenizer(ja_tokenizer, text, expected_pos):
|
||||
def test_ja_tokenizer(ja_tokenizer, text, expected_pos):
|
||||
pos = [token.pos_ for token in ja_tokenizer(text)]
|
||||
assert pos == expected_pos
|
||||
|
|
|
@ -11,7 +11,7 @@ NB_TOKEN_EXCEPTION_TESTS = [
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', NB_TOKEN_EXCEPTION_TESTS)
|
||||
def test_tokenizer_handles_exception_cases(nb_tokenizer, text, expected_tokens):
|
||||
def test_nb_tokenizer_handles_exception_cases(nb_tokenizer, text, expected_tokens):
|
||||
tokens = nb_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
||||
|
|
11
spacy/tests/lang/nl/test_text.py
Normal file
11
spacy/tests/lang/nl/test_text.py
Normal file
|
@ -0,0 +1,11 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.lang.nl.lex_attrs import like_num
|
||||
|
||||
|
||||
@pytest.mark.parametrize('word', ['elf', 'elfde'])
|
||||
def test_nl_lex_attrs_capitals(word):
|
||||
assert like_num(word)
|
||||
assert like_num(word.upper())
|
11
spacy/tests/lang/pt/test_text.py
Normal file
11
spacy/tests/lang/pt/test_text.py
Normal file
|
@ -0,0 +1,11 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.lang.pt.lex_attrs import like_num
|
||||
|
||||
|
||||
@pytest.mark.parametrize('word', ['onze', 'quadragésimo'])
|
||||
def test_pt_lex_attrs_capitals(word):
|
||||
assert like_num(word)
|
||||
assert like_num(word.upper())
|
|
@ -4,10 +4,11 @@ from __future__ import unicode_literals
|
|||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('string,lemma', [('câini', 'câine'),
|
||||
('expedițiilor', 'expediție'),
|
||||
('pensete', 'pensetă'),
|
||||
('erau', 'fi')])
|
||||
def test_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma):
|
||||
@pytest.mark.parametrize('string,lemma', [
|
||||
('câini', 'câine'),
|
||||
('expedițiilor', 'expediție'),
|
||||
('pensete', 'pensetă'),
|
||||
('erau', 'fi')])
|
||||
def test_ro_lemmatizer_lookup_assigns(ro_tokenizer, string, lemma):
|
||||
tokens = ro_tokenizer(string)
|
||||
assert tokens[0].lemma_ == lemma
|
||||
|
|
|
@ -3,23 +3,20 @@ from __future__ import unicode_literals
|
|||
|
||||
import pytest
|
||||
|
||||
DEFAULT_TESTS = [
|
||||
|
||||
TEST_CASES = [
|
||||
('Adresa este str. Principală nr. 5.', ['Adresa', 'este', 'str.', 'Principală', 'nr.', '5', '.']),
|
||||
('Teste, etc.', ['Teste', ',', 'etc.']),
|
||||
('Lista, ș.a.m.d.', ['Lista', ',', 'ș.a.m.d.']),
|
||||
('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...'])
|
||||
]
|
||||
|
||||
NUMBER_TESTS = [
|
||||
('Și d.p.d.v. al...', ['Și', 'd.p.d.v.', 'al', '...']),
|
||||
# number tests
|
||||
('Clasa a 4-a.', ['Clasa', 'a', '4-a', '.']),
|
||||
('Al 12-lea ceas.', ['Al', '12-lea', 'ceas', '.'])
|
||||
]
|
||||
|
||||
TESTCASES = DEFAULT_TESTS + NUMBER_TESTS
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
||||
def test_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens):
|
||||
@pytest.mark.parametrize('text,expected_tokens', TEST_CASES)
|
||||
def test_ro_tokenizer_handles_testcases(ro_tokenizer, text, expected_tokens):
|
||||
tokens = ro_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
||||
|
|
14
spacy/tests/lang/ru/test_exceptions.py
Normal file
14
spacy/tests/lang/ru/test_exceptions.py
Normal file
|
@ -0,0 +1,14 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,norms', [
|
||||
("пн.", ["понедельник"]),
|
||||
("пт.", ["пятница"]),
|
||||
("дек.", ["декабрь"])])
|
||||
def test_ru_tokenizer_abbrev_exceptions(ru_tokenizer, text, norms):
|
||||
tokens = ru_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert [token.norm_ for token in tokens] == norms
|
|
@ -2,70 +2,62 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from ....tokens.doc import Doc
|
||||
from spacy.lang.ru import Russian
|
||||
|
||||
from ...util import get_doc
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def ru_lemmatizer(RU):
|
||||
return RU.Defaults.create_lemmatizer()
|
||||
def ru_lemmatizer():
|
||||
pymorphy = pytest.importorskip('pymorphy2')
|
||||
return Russian.Defaults.create_lemmatizer()
|
||||
|
||||
|
||||
@pytest.mark.models('ru')
|
||||
def test_doc_lemmatization(RU):
|
||||
doc = Doc(RU.vocab, words=['мама', 'мыла', 'раму'])
|
||||
doc[0].tag_ = 'NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing'
|
||||
doc[1].tag_ = 'VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act'
|
||||
doc[2].tag_ = 'NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing'
|
||||
|
||||
def test_ru_doc_lemmatization(ru_tokenizer):
|
||||
words = ['мама', 'мыла', 'раму']
|
||||
tags = ['NOUN__Animacy=Anim|Case=Nom|Gender=Fem|Number=Sing',
|
||||
'VERB__Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act',
|
||||
'NOUN__Animacy=Anim|Case=Acc|Gender=Fem|Number=Sing']
|
||||
doc = get_doc(ru_tokenizer.vocab, words=words, tags=tags)
|
||||
lemmas = [token.lemma_ for token in doc]
|
||||
assert lemmas == ['мама', 'мыть', 'рама']
|
||||
|
||||
|
||||
@pytest.mark.models('ru')
|
||||
@pytest.mark.parametrize('text,lemmas', [('гвоздики', ['гвоздик', 'гвоздика']),
|
||||
('люди', ['человек']),
|
||||
('реки', ['река']),
|
||||
('кольцо', ['кольцо']),
|
||||
('пепперони', ['пепперони'])])
|
||||
@pytest.mark.parametrize('text,lemmas', [
|
||||
('гвоздики', ['гвоздик', 'гвоздика']),
|
||||
('люди', ['человек']),
|
||||
('реки', ['река']),
|
||||
('кольцо', ['кольцо']),
|
||||
('пепперони', ['пепперони'])])
|
||||
def test_ru_lemmatizer_noun_lemmas(ru_lemmatizer, text, lemmas):
|
||||
assert sorted(ru_lemmatizer.noun(text)) == lemmas
|
||||
|
||||
|
||||
@pytest.mark.models('ru')
|
||||
@pytest.mark.parametrize('text,pos,morphology,lemma', [('рой', 'NOUN', None, 'рой'),
|
||||
('рой', 'VERB', None, 'рыть'),
|
||||
('клей', 'NOUN', None, 'клей'),
|
||||
('клей', 'VERB', None, 'клеить'),
|
||||
('три', 'NUM', None, 'три'),
|
||||
('кос', 'NOUN', {'Number': 'Sing'}, 'кос'),
|
||||
('кос', 'NOUN', {'Number': 'Plur'}, 'коса'),
|
||||
('кос', 'ADJ', None, 'косой'),
|
||||
('потом', 'NOUN', None, 'пот'),
|
||||
('потом', 'ADV', None, 'потом')
|
||||
])
|
||||
@pytest.mark.parametrize('text,pos,morphology,lemma', [
|
||||
('рой', 'NOUN', None, 'рой'),
|
||||
('рой', 'VERB', None, 'рыть'),
|
||||
('клей', 'NOUN', None, 'клей'),
|
||||
('клей', 'VERB', None, 'клеить'),
|
||||
('три', 'NUM', None, 'три'),
|
||||
('кос', 'NOUN', {'Number': 'Sing'}, 'кос'),
|
||||
('кос', 'NOUN', {'Number': 'Plur'}, 'коса'),
|
||||
('кос', 'ADJ', None, 'косой'),
|
||||
('потом', 'NOUN', None, 'пот'),
|
||||
('потом', 'ADV', None, 'потом')])
|
||||
def test_ru_lemmatizer_works_with_different_pos_homonyms(ru_lemmatizer, text, pos, morphology, lemma):
|
||||
assert ru_lemmatizer(text, pos, morphology) == [lemma]
|
||||
|
||||
|
||||
@pytest.mark.models('ru')
|
||||
@pytest.mark.parametrize('text,morphology,lemma', [('гвоздики', {'Gender': 'Fem'}, 'гвоздика'),
|
||||
('гвоздики', {'Gender': 'Masc'}, 'гвоздик'),
|
||||
('вина', {'Gender': 'Fem'}, 'вина'),
|
||||
('вина', {'Gender': 'Neut'}, 'вино')
|
||||
])
|
||||
@pytest.mark.parametrize('text,morphology,lemma', [
|
||||
('гвоздики', {'Gender': 'Fem'}, 'гвоздика'),
|
||||
('гвоздики', {'Gender': 'Masc'}, 'гвоздик'),
|
||||
('вина', {'Gender': 'Fem'}, 'вина'),
|
||||
('вина', {'Gender': 'Neut'}, 'вино')])
|
||||
def test_ru_lemmatizer_works_with_noun_homonyms(ru_lemmatizer, text, morphology, lemma):
|
||||
assert ru_lemmatizer.noun(text, morphology) == [lemma]
|
||||
|
||||
|
||||
@pytest.mark.models('ru')
|
||||
def test_ru_lemmatizer_punct(ru_lemmatizer):
|
||||
assert ru_lemmatizer.punct('«') == ['"']
|
||||
assert ru_lemmatizer.punct('»') == ['"']
|
||||
|
||||
|
||||
# @pytest.mark.models('ru')
|
||||
# def test_ru_lemmatizer_lemma_assignment(RU):
|
||||
# text = "А роза упала на лапу Азора."
|
||||
# doc = RU.make_doc(text)
|
||||
# RU.tagger(doc)
|
||||
# assert all(t.lemma_ != '' for t in doc)
|
||||
|
|
11
spacy/tests/lang/ru/test_text.py
Normal file
11
spacy/tests/lang/ru/test_text.py
Normal file
|
@ -0,0 +1,11 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.lang.ru.lex_attrs import like_num
|
||||
|
||||
|
||||
@pytest.mark.parametrize('word', ['одиннадцать'])
|
||||
def test_ru_lex_attrs_capitals(word):
|
||||
assert like_num(word)
|
||||
assert like_num(word.upper())
|
|
@ -1,7 +1,4 @@
|
|||
# coding: utf-8
|
||||
"""Test that open, closed and paired punctuation is split off correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
|
|
@ -1,16 +0,0 @@
|
|||
# coding: utf-8
|
||||
"""Test that tokenizer exceptions are parsed correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,norms', [("пн.", ["понедельник"]),
|
||||
("пт.", ["пятница"]),
|
||||
("дек.", ["декабрь"])])
|
||||
def test_ru_tokenizer_abbrev_exceptions(ru_tokenizer, text, norms):
|
||||
tokens = ru_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert [token.norm_ for token in tokens] == norms
|
|
@ -11,14 +11,14 @@ SV_TOKEN_EXCEPTION_TESTS = [
|
|||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', SV_TOKEN_EXCEPTION_TESTS)
|
||||
def test_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
|
||||
def test_sv_tokenizer_handles_exception_cases(sv_tokenizer, text, expected_tokens):
|
||||
tokens = sv_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["driveru", "hajaru", "Serru", "Fixaru"])
|
||||
def test_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
|
||||
def test_sv_tokenizer_handles_verb_exceptions(sv_tokenizer, text):
|
||||
tokens = sv_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
assert tokens[1].text == "u"
|
||||
|
|
|
@ -1,10 +1,9 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
|
||||
from ...lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape
|
||||
|
||||
import pytest
|
||||
from spacy.attrs import intify_attrs, ORTH, NORM, LEMMA, IS_ALPHA
|
||||
from spacy.lang.lex_attrs import is_punct, is_ascii, is_currency, like_url, word_shape
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["dog"])
|
||||
|
|
|
@ -3,11 +3,9 @@ from __future__ import unicode_literals
|
|||
|
||||
import pytest
|
||||
|
||||
TOKENIZER_TESTS = [
|
||||
("คุณรักผมไหม", ['คุณ', 'รัก', 'ผม', 'ไหม'])
|
||||
]
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', TOKENIZER_TESTS)
|
||||
def test_thai_tokenizer(th_tokenizer, text, expected_tokens):
|
||||
tokens = [token.text for token in th_tokenizer(text)]
|
||||
assert tokens == expected_tokens
|
||||
@pytest.mark.parametrize('text,expected_tokens', [
|
||||
("คุณรักผมไหม", ['คุณ', 'รัก', 'ผม', 'ไหม'])])
|
||||
def test_th_tokenizer(th_tokenizer, text, expected_tokens):
|
||||
tokens = [token.text for token in th_tokenizer(text)]
|
||||
assert tokens == expected_tokens
|
||||
|
|
|
@ -3,13 +3,15 @@ from __future__ import unicode_literals
|
|||
|
||||
import pytest
|
||||
|
||||
@pytest.mark.parametrize('string,lemma', [('evlerimizdeki', 'ev'),
|
||||
('işlerimizi', 'iş'),
|
||||
('biran', 'biran'),
|
||||
('bitirmeliyiz', 'bitir'),
|
||||
('isteklerimizi', 'istek'),
|
||||
('karşılaştırmamızın', 'karşılaştır'),
|
||||
('çoğulculuktan', 'çoğulcu')])
|
||||
def test_lemmatizer_lookup_assigns(tr_tokenizer, string, lemma):
|
||||
|
||||
@pytest.mark.parametrize('string,lemma', [
|
||||
('evlerimizdeki', 'ev'),
|
||||
('işlerimizi', 'iş'),
|
||||
('biran', 'biran'),
|
||||
('bitirmeliyiz', 'bitir'),
|
||||
('isteklerimizi', 'istek'),
|
||||
('karşılaştırmamızın', 'karşılaştır'),
|
||||
('çoğulculuktan', 'çoğulcu')])
|
||||
def test_tr_lemmatizer_lookup_assigns(tr_tokenizer, string, lemma):
|
||||
tokens = tr_tokenizer(string)
|
||||
assert tokens[0].lemma_ == lemma
|
||||
assert tokens[0].lemma_ == lemma
|
||||
|
|
|
@ -3,6 +3,7 @@ from __future__ import unicode_literals
|
|||
|
||||
import pytest
|
||||
|
||||
|
||||
INFIX_HYPHEN_TESTS = [
|
||||
("Явым-төшем күләме.", "Явым-төшем күләме .".split()),
|
||||
("Хатын-кыз киеме.", "Хатын-кыз киеме .".split())
|
||||
|
@ -64,12 +65,12 @@ NORM_TESTCASES = [
|
|||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", TESTCASES)
|
||||
def test_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens):
|
||||
def test_tt_tokenizer_handles_testcases(tt_tokenizer, text, expected_tokens):
|
||||
tokens = [token.text for token in tt_tokenizer(text) if not token.is_space]
|
||||
assert expected_tokens == tokens
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,norms', NORM_TESTCASES)
|
||||
def test_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms):
|
||||
def test_tt_tokenizer_handles_norm_exceptions(tt_tokenizer, text, norms):
|
||||
tokens = tt_tokenizer(text)
|
||||
assert [token.norm_ for token in tokens] == norms
|
||||
|
|
|
@ -1,19 +1,14 @@
|
|||
# coding: utf-8
|
||||
|
||||
"""Test that longer and mixed texts are tokenized correctly."""
|
||||
|
||||
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
def test_tokenizer_handles_long_text(ur_tokenizer):
|
||||
def test_ur_tokenizer_handles_long_text(ur_tokenizer):
|
||||
text = """اصل میں رسوا ہونے کی ہمیں
|
||||
کچھ عادت سی ہو گئی ہے اس لئے جگ ہنسائی کا ذکر نہیں کرتا،ہوا کچھ یوں کہ عرصہ چھ سال بعد ہمیں بھی خیال آیا
|
||||
کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر
|
||||
کہ ایک عدد ٹیلی ویژن ہی کیوں نہ خرید لیں ، سوچا ورلڈ کپ ہی دیکھیں گے۔اپنے پاکستان کے کھلاڑیوں کو دیکھ کر
|
||||
ورلڈ کپ دیکھنے کا حوصلہ ہی نہ رہا تو اب یوں ہی ادھر اُدھر کے چینل گھمانے لگ پڑتے ہیں۔"""
|
||||
|
||||
tokens = ur_tokenizer(text)
|
||||
assert len(tokens) == 77
|
||||
|
||||
|
@ -21,6 +16,6 @@ def test_tokenizer_handles_long_text(ur_tokenizer):
|
|||
@pytest.mark.parametrize('text,length', [
|
||||
("تحریر باسط حبیب", 3),
|
||||
("میرا پاکستان", 2)])
|
||||
def test_tokenizer_handles_cnts(ur_tokenizer, text, length):
|
||||
def test_ur_tokenizer_handles_cnts(ur_tokenizer, text, length):
|
||||
tokens = ur_tokenizer(text)
|
||||
assert len(tokens) == length
|
||||
|
|
|
@ -1,19 +1,16 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..matcher import Matcher, PhraseMatcher
|
||||
from .util import get_doc
|
||||
|
||||
import pytest
|
||||
from spacy.matcher import Matcher
|
||||
from spacy.tokens import Doc
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def matcher(en_vocab):
|
||||
rules = {
|
||||
'JS': [[{'ORTH': 'JavaScript'}]],
|
||||
'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]],
|
||||
'Java': [[{'LOWER': 'java'}]]
|
||||
}
|
||||
rules = {'JS': [[{'ORTH': 'JavaScript'}]],
|
||||
'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]],
|
||||
'Java': [[{'LOWER': 'java'}]]}
|
||||
matcher = Matcher(en_vocab)
|
||||
for key, patterns in rules.items():
|
||||
matcher.add(key, None, *patterns)
|
||||
|
@ -36,7 +33,7 @@ def test_matcher_from_api_docs(en_vocab):
|
|||
|
||||
def test_matcher_from_usage_docs(en_vocab):
|
||||
text = "Wow 😀 This is really cool! 😂 😂"
|
||||
doc = get_doc(en_vocab, words=text.split(' '))
|
||||
doc = Doc(en_vocab, words=text.split(' '))
|
||||
pos_emoji = ['😀', '😃', '😂', '🤣', '😊', '😍']
|
||||
pos_patterns = [[{'ORTH': emoji}] for emoji in pos_emoji]
|
||||
|
||||
|
@ -55,68 +52,46 @@ def test_matcher_from_usage_docs(en_vocab):
|
|||
assert doc[1].norm_ == 'happy emoji'
|
||||
|
||||
|
||||
@pytest.mark.parametrize('words', [["Some", "words"]])
|
||||
def test_matcher_init(en_vocab, words):
|
||||
matcher = Matcher(en_vocab)
|
||||
doc = get_doc(en_vocab, words)
|
||||
assert len(matcher) == 0
|
||||
assert matcher(doc) == []
|
||||
|
||||
|
||||
def test_matcher_contains(matcher):
|
||||
def test_matcher_len_contains(matcher):
|
||||
assert len(matcher) == 3
|
||||
matcher.add('TEST', None, [{'ORTH': 'test'}])
|
||||
assert 'TEST' in matcher
|
||||
assert 'TEST2' not in matcher
|
||||
|
||||
|
||||
def test_matcher_no_match(matcher):
|
||||
words = ["I", "like", "cheese", "."]
|
||||
doc = get_doc(matcher.vocab, words)
|
||||
doc = Doc(matcher.vocab, words=["I", "like", "cheese", "."])
|
||||
assert matcher(doc) == []
|
||||
|
||||
|
||||
def test_matcher_compile(en_vocab):
|
||||
rules = {
|
||||
'JS': [[{'ORTH': 'JavaScript'}]],
|
||||
'GoogleNow': [[{'ORTH': 'Google'}, {'ORTH': 'Now'}]],
|
||||
'Java': [[{'LOWER': 'java'}]]
|
||||
}
|
||||
matcher = Matcher(en_vocab)
|
||||
for key, patterns in rules.items():
|
||||
matcher.add(key, None, *patterns)
|
||||
assert len(matcher) == 3
|
||||
|
||||
|
||||
def test_matcher_match_start(matcher):
|
||||
words = ["JavaScript", "is", "good"]
|
||||
doc = get_doc(matcher.vocab, words)
|
||||
doc = Doc(matcher.vocab, words=["JavaScript", "is", "good"])
|
||||
assert matcher(doc) == [(matcher.vocab.strings['JS'], 0, 1)]
|
||||
|
||||
|
||||
def test_matcher_match_end(matcher):
|
||||
words = ["I", "like", "java"]
|
||||
doc = get_doc(matcher.vocab, words)
|
||||
doc = Doc(matcher.vocab, words=words)
|
||||
assert matcher(doc) == [(doc.vocab.strings['Java'], 2, 3)]
|
||||
|
||||
|
||||
def test_matcher_match_middle(matcher):
|
||||
words = ["I", "like", "Google", "Now", "best"]
|
||||
doc = get_doc(matcher.vocab, words)
|
||||
doc = Doc(matcher.vocab, words=words)
|
||||
assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4)]
|
||||
|
||||
|
||||
def test_matcher_match_multi(matcher):
|
||||
words = ["I", "like", "Google", "Now", "and", "java", "best"]
|
||||
doc = get_doc(matcher.vocab, words)
|
||||
doc = Doc(matcher.vocab, words=words)
|
||||
assert matcher(doc) == [(doc.vocab.strings['GoogleNow'], 2, 4),
|
||||
(doc.vocab.strings['Java'], 5, 6)]
|
||||
|
||||
|
||||
def test_matcher_empty_dict(en_vocab):
|
||||
'''Test matcher allows empty token specs, meaning match on any token.'''
|
||||
"""Test matcher allows empty token specs, meaning match on any token."""
|
||||
matcher = Matcher(en_vocab)
|
||||
abc = ["a", "b", "c"]
|
||||
doc = get_doc(matcher.vocab, abc)
|
||||
doc = Doc(matcher.vocab, words=["a", "b", "c"])
|
||||
matcher.add('A.C', None, [{'ORTH': 'a'}, {}, {'ORTH': 'c'}])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 1
|
||||
|
@ -129,8 +104,7 @@ def test_matcher_empty_dict(en_vocab):
|
|||
|
||||
def test_matcher_operator_shadow(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
abc = ["a", "b", "c"]
|
||||
doc = get_doc(matcher.vocab, abc)
|
||||
doc = Doc(matcher.vocab, words=["a", "b", "c"])
|
||||
pattern = [{'ORTH': 'a'}, {"IS_ALPHA": True, "OP": "+"}, {'ORTH': 'c'}]
|
||||
matcher.add('A.C', None, pattern)
|
||||
matches = matcher(doc)
|
||||
|
@ -138,32 +112,6 @@ def test_matcher_operator_shadow(en_vocab):
|
|||
assert matches[0][1:] == (0, 3)
|
||||
|
||||
|
||||
def test_matcher_phrase_matcher(en_vocab):
|
||||
words = ["Google", "Now"]
|
||||
doc = get_doc(en_vocab, words)
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add('COMPANY', None, doc)
|
||||
words = ["I", "like", "Google", "Now", "best"]
|
||||
doc = get_doc(en_vocab, words)
|
||||
assert len(matcher(doc)) == 1
|
||||
|
||||
|
||||
def test_phrase_matcher_length(en_vocab):
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
assert len(matcher) == 0
|
||||
matcher.add('TEST', None, get_doc(en_vocab, ['test']))
|
||||
assert len(matcher) == 1
|
||||
matcher.add('TEST2', None, get_doc(en_vocab, ['test2']))
|
||||
assert len(matcher) == 2
|
||||
|
||||
|
||||
def test_phrase_matcher_contains(en_vocab):
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add('TEST', None, get_doc(en_vocab, ['test']))
|
||||
assert 'TEST' in matcher
|
||||
assert 'TEST2' not in matcher
|
||||
|
||||
|
||||
def test_matcher_match_zero(matcher):
|
||||
words1 = 'He said , " some words " ...'.split()
|
||||
words2 = 'He said , " some three words " ...'.split()
|
||||
|
@ -176,12 +124,10 @@ def test_matcher_match_zero(matcher):
|
|||
{'IS_PUNCT': True},
|
||||
{'IS_PUNCT': True},
|
||||
{'ORTH': '"'}]
|
||||
|
||||
matcher.add('Quote', None, pattern1)
|
||||
doc = get_doc(matcher.vocab, words1)
|
||||
doc = Doc(matcher.vocab, words=words1)
|
||||
assert len(matcher(doc)) == 1
|
||||
|
||||
doc = get_doc(matcher.vocab, words2)
|
||||
doc = Doc(matcher.vocab, words=words2)
|
||||
assert len(matcher(doc)) == 0
|
||||
matcher.add('Quote', None, pattern2)
|
||||
assert len(matcher(doc)) == 0
|
||||
|
@ -194,14 +140,14 @@ def test_matcher_match_zero_plus(matcher):
|
|||
{'ORTH': '"'}]
|
||||
matcher = Matcher(matcher.vocab)
|
||||
matcher.add('Quote', None, pattern)
|
||||
doc = get_doc(matcher.vocab, words)
|
||||
doc = Doc(matcher.vocab, words=words)
|
||||
assert len(matcher(doc)) == 1
|
||||
|
||||
|
||||
def test_matcher_match_one_plus(matcher):
|
||||
control = Matcher(matcher.vocab)
|
||||
control.add('BasicPhilippe', None, [{'ORTH': 'Philippe'}])
|
||||
doc = get_doc(control.vocab, ['Philippe', 'Philippe'])
|
||||
doc = Doc(control.vocab, words=['Philippe', 'Philippe'])
|
||||
m = control(doc)
|
||||
assert len(m) == 2
|
||||
matcher.add('KleenePhilippe', None, [{'ORTH': 'Philippe', 'OP': '1'},
|
||||
|
@ -210,61 +156,11 @@ def test_matcher_match_one_plus(matcher):
|
|||
assert len(m) == 1
|
||||
|
||||
|
||||
def test_operator_combos(matcher):
|
||||
cases = [
|
||||
('aaab', 'a a a b', True),
|
||||
('aaab', 'a+ b', True),
|
||||
('aaab', 'a+ a+ b', True),
|
||||
('aaab', 'a+ a+ a b', True),
|
||||
('aaab', 'a+ a+ a+ b', True),
|
||||
('aaab', 'a+ a a b', True),
|
||||
('aaab', 'a+ a a', True),
|
||||
('aaab', 'a+', True),
|
||||
('aaa', 'a+ b', False),
|
||||
('aaa', 'a+ a+ b', False),
|
||||
('aaa', 'a+ a+ a+ b', False),
|
||||
('aaa', 'a+ a b', False),
|
||||
('aaa', 'a+ a a b', False),
|
||||
('aaab', 'a+ a a', True),
|
||||
('aaab', 'a+', True),
|
||||
('aaab', 'a+ a b', True)
|
||||
]
|
||||
for string, pattern_str, result in cases:
|
||||
matcher = Matcher(matcher.vocab)
|
||||
doc = get_doc(matcher.vocab, words=list(string))
|
||||
pattern = []
|
||||
for part in pattern_str.split():
|
||||
if part.endswith('+'):
|
||||
pattern.append({'ORTH': part[0], 'op': '+'})
|
||||
else:
|
||||
pattern.append({'ORTH': part})
|
||||
matcher.add('PATTERN', None, pattern)
|
||||
matches = matcher(doc)
|
||||
if result:
|
||||
assert matches, (string, pattern_str)
|
||||
else:
|
||||
assert not matches, (string, pattern_str)
|
||||
|
||||
|
||||
def test_matcher_end_zero_plus(matcher):
|
||||
"""Test matcher works when patterns end with * operator. (issue 1450)"""
|
||||
matcher = Matcher(matcher.vocab)
|
||||
pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}]
|
||||
matcher.add("TSTEND", None, pattern)
|
||||
nlp = lambda string: get_doc(matcher.vocab, string.split())
|
||||
assert len(matcher(nlp('a'))) == 1
|
||||
assert len(matcher(nlp('a b'))) == 2
|
||||
assert len(matcher(nlp('a c'))) == 1
|
||||
assert len(matcher(nlp('a b c'))) == 2
|
||||
assert len(matcher(nlp('a b b c'))) == 3
|
||||
assert len(matcher(nlp('a b b'))) == 3
|
||||
|
||||
|
||||
def test_matcher_any_token_operator(en_vocab):
|
||||
"""Test that patterns with "any token" {} work with operators."""
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add('TEST', None, [{'ORTH': 'test'}, {'OP': '*'}])
|
||||
doc = get_doc(en_vocab, ['test', 'hello', 'world'])
|
||||
doc = Doc(en_vocab, words=['test', 'hello', 'world'])
|
||||
matches = [doc[start:end].text for _, start, end in matcher(doc)]
|
||||
assert len(matches) == 3
|
||||
assert matches[0] == 'test'
|
116
spacy/tests/matcher/test_matcher_logic.py
Normal file
116
spacy/tests/matcher/test_matcher_logic.py
Normal file
|
@ -0,0 +1,116 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
import re
|
||||
from spacy.matcher import Matcher
|
||||
from spacy.tokens import Doc
|
||||
|
||||
|
||||
pattern1 = [{'ORTH':'A', 'OP':'1'}, {'ORTH':'A', 'OP':'*'}]
|
||||
pattern2 = [{'ORTH':'A', 'OP':'*'}, {'ORTH':'A', 'OP':'1'}]
|
||||
pattern3 = [{'ORTH':'A', 'OP':'1'}, {'ORTH':'A', 'OP':'1'}]
|
||||
pattern4 = [{'ORTH':'B', 'OP':'1'}, {'ORTH':'A', 'OP':'*'}, {'ORTH':'B', 'OP':'1'}]
|
||||
pattern5 = [{'ORTH':'B', 'OP':'*'}, {'ORTH':'A', 'OP':'*'}, {'ORTH':'B', 'OP':'1'}]
|
||||
|
||||
re_pattern1 = 'AA*'
|
||||
re_pattern2 = 'A*A'
|
||||
re_pattern3 = 'AA'
|
||||
re_pattern4 = 'BA*B'
|
||||
re_pattern5 = 'B*A*B'
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def text():
|
||||
return "(ABBAAAAAB)."
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def doc(en_tokenizer, text):
|
||||
doc = en_tokenizer(' '.join(text))
|
||||
return doc
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.parametrize('pattern,re_pattern', [
|
||||
(pattern1, re_pattern1),
|
||||
(pattern2, re_pattern2),
|
||||
(pattern3, re_pattern3),
|
||||
(pattern4, re_pattern4),
|
||||
(pattern5, re_pattern5)])
|
||||
def test_greedy_matching(doc, text, pattern, re_pattern):
|
||||
"""Test that the greedy matching behavior of the * op is consistant with
|
||||
other re implementations."""
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add(re_pattern, None, pattern)
|
||||
matches = matcher(doc)
|
||||
re_matches = [m.span() for m in re.finditer(re_pattern, text)]
|
||||
for match, re_match in zip(matches, re_matches):
|
||||
assert match[1:] == re_match
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
@pytest.mark.parametrize('pattern,re_pattern', [
|
||||
(pattern1, re_pattern1),
|
||||
(pattern2, re_pattern2),
|
||||
(pattern3, re_pattern3),
|
||||
(pattern4, re_pattern4),
|
||||
(pattern5, re_pattern5)])
|
||||
def test_match_consuming(doc, text, pattern, re_pattern):
|
||||
"""Test that matcher.__call__ consumes tokens on a match similar to
|
||||
re.findall."""
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add(re_pattern, None, pattern)
|
||||
matches = matcher(doc)
|
||||
re_matches = [m.span() for m in re.finditer(re_pattern, text)]
|
||||
assert len(matches) == len(re_matches)
|
||||
|
||||
|
||||
def test_operator_combos(en_vocab):
|
||||
cases = [
|
||||
('aaab', 'a a a b', True),
|
||||
('aaab', 'a+ b', True),
|
||||
('aaab', 'a+ a+ b', True),
|
||||
('aaab', 'a+ a+ a b', True),
|
||||
('aaab', 'a+ a+ a+ b', True),
|
||||
('aaab', 'a+ a a b', True),
|
||||
('aaab', 'a+ a a', True),
|
||||
('aaab', 'a+', True),
|
||||
('aaa', 'a+ b', False),
|
||||
('aaa', 'a+ a+ b', False),
|
||||
('aaa', 'a+ a+ a+ b', False),
|
||||
('aaa', 'a+ a b', False),
|
||||
('aaa', 'a+ a a b', False),
|
||||
('aaab', 'a+ a a', True),
|
||||
('aaab', 'a+', True),
|
||||
('aaab', 'a+ a b', True)
|
||||
]
|
||||
for string, pattern_str, result in cases:
|
||||
matcher = Matcher(en_vocab)
|
||||
doc = Doc(matcher.vocab, words=list(string))
|
||||
pattern = []
|
||||
for part in pattern_str.split():
|
||||
if part.endswith('+'):
|
||||
pattern.append({'ORTH': part[0], 'OP': '+'})
|
||||
else:
|
||||
pattern.append({'ORTH': part})
|
||||
matcher.add('PATTERN', None, pattern)
|
||||
matches = matcher(doc)
|
||||
if result:
|
||||
assert matches, (string, pattern_str)
|
||||
else:
|
||||
assert not matches, (string, pattern_str)
|
||||
|
||||
|
||||
def test_matcher_end_zero_plus(en_vocab):
|
||||
"""Test matcher works when patterns end with * operator. (issue 1450)"""
|
||||
matcher = Matcher(en_vocab)
|
||||
pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}]
|
||||
matcher.add('TSTEND', None, pattern)
|
||||
nlp = lambda string: Doc(matcher.vocab, words=string.split())
|
||||
assert len(matcher(nlp('a'))) == 1
|
||||
assert len(matcher(nlp('a b'))) == 2
|
||||
assert len(matcher(nlp('a c'))) == 1
|
||||
assert len(matcher(nlp('a b c'))) == 2
|
||||
assert len(matcher(nlp('a b b c'))) == 3
|
||||
assert len(matcher(nlp('a b b'))) == 3
|
30
spacy/tests/matcher/test_phrase_matcher.py
Normal file
30
spacy/tests/matcher/test_phrase_matcher.py
Normal file
|
@ -0,0 +1,30 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.matcher import PhraseMatcher
|
||||
from spacy.tokens import Doc
|
||||
|
||||
|
||||
def test_matcher_phrase_matcher(en_vocab):
|
||||
doc = Doc(en_vocab, words=["Google", "Now"])
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add('COMPANY', None, doc)
|
||||
doc = Doc(en_vocab, words=["I", "like", "Google", "Now", "best"])
|
||||
assert len(matcher(doc)) == 1
|
||||
|
||||
|
||||
def test_phrase_matcher_length(en_vocab):
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
assert len(matcher) == 0
|
||||
matcher.add('TEST', None, Doc(en_vocab, words=['test']))
|
||||
assert len(matcher) == 1
|
||||
matcher.add('TEST2', None, Doc(en_vocab, words=['test2']))
|
||||
assert len(matcher) == 2
|
||||
|
||||
|
||||
def test_phrase_matcher_contains(en_vocab):
|
||||
matcher = PhraseMatcher(en_vocab)
|
||||
matcher.add('TEST', None, Doc(en_vocab, words=['test']))
|
||||
assert 'TEST' in matcher
|
||||
assert 'TEST2' not in matcher
|
|
@ -1,15 +1,16 @@
|
|||
'''Test the ability to add a label to a (potentially trained) parsing model.'''
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
import numpy.random
|
||||
from thinc.neural.optimizers import Adam
|
||||
from thinc.neural.ops import NumpyOps
|
||||
from spacy.attrs import NORM
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.tokens import Doc
|
||||
from spacy.pipeline import DependencyParser
|
||||
|
||||
from ...attrs import NORM
|
||||
from ...gold import GoldParse
|
||||
from ...vocab import Vocab
|
||||
from ...tokens import Doc
|
||||
from ...pipeline import DependencyParser
|
||||
|
||||
numpy.random.seed(0)
|
||||
|
||||
|
@ -37,9 +38,11 @@ def parser(vocab):
|
|||
parser.update([doc], [gold], sgd=sgd, losses=losses)
|
||||
return parser
|
||||
|
||||
|
||||
def test_init_parser(parser):
|
||||
pass
|
||||
|
||||
|
||||
# TODO: This is flakey, because it depends on what the parser first learns.
|
||||
@pytest.mark.xfail
|
||||
def test_add_label(parser):
|
||||
|
@ -69,4 +72,3 @@ def test_add_label(parser):
|
|||
doc = parser(doc)
|
||||
assert doc[0].dep_ == 'right'
|
||||
assert doc[2].dep_ == 'left'
|
||||
|
||||
|
|
|
@ -1,13 +1,14 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
import pytest
|
||||
|
||||
from ...vocab import Vocab
|
||||
from ...pipeline import DependencyParser
|
||||
from ...tokens import Doc
|
||||
from ...gold import GoldParse
|
||||
from ...syntax.nonproj import projectivize
|
||||
from ...syntax.stateclass import StateClass
|
||||
from ...syntax.arc_eager import ArcEager
|
||||
import pytest
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.pipeline import DependencyParser
|
||||
from spacy.tokens import Doc
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.syntax.nonproj import projectivize
|
||||
from spacy.syntax.stateclass import StateClass
|
||||
from spacy.syntax.arc_eager import ArcEager
|
||||
|
||||
|
||||
def get_sequence_costs(M, words, heads, deps, transitions):
|
||||
|
@ -105,7 +106,7 @@ annot_tuples = [
|
|||
(31, 'going', 'VBG', 26, 'parataxis', 'O'),
|
||||
(32, 'to', 'TO', 33, 'aux', 'O'),
|
||||
(33, 'spend', 'VB', 31, 'xcomp', 'O'),
|
||||
(34, 'the', 'DT', 35, 'det', 'B-TIME'),
|
||||
(34, 'the', 'DT', 35, 'det', 'B-TIME'),
|
||||
(35, 'night', 'NN', 33, 'dobj', 'L-TIME'),
|
||||
(36, 'there', 'RB', 33, 'advmod', 'O'),
|
||||
(37, 'presumably', 'RB', 33, 'advmod', 'O'),
|
||||
|
|
|
@ -1,23 +0,0 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from ...language import Language
|
||||
from ...pipeline import DependencyParser
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_beam_parse_en(EN):
|
||||
doc = EN(u'Australia is a country', disable=['ner'])
|
||||
ents = EN.entity(doc, beam_width=2)
|
||||
print(ents)
|
||||
|
||||
|
||||
def test_beam_parse():
|
||||
nlp = Language()
|
||||
nlp.add_pipe(DependencyParser(nlp.vocab), name='parser')
|
||||
nlp.parser.add_label('nsubj')
|
||||
nlp.parser.begin_training([], token_vector_width=8, hidden_width=8)
|
||||
|
||||
doc = nlp.make_doc(u'Australia is a country')
|
||||
nlp.parser(doc, beam_width=2)
|
|
@ -1,11 +1,12 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
from ...vocab import Vocab
|
||||
from ...syntax.ner import BiluoPushDown
|
||||
from ...gold import GoldParse
|
||||
from ...tokens import Doc
|
||||
from spacy.pipeline import EntityRecognizer
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.syntax.ner import BiluoPushDown
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.tokens import Doc
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
@ -71,3 +72,16 @@ def test_get_oracle_moves_negative_O(tsys, vocab):
|
|||
tsys.preprocess_gold(gold)
|
||||
act_classes = tsys.get_oracle_sequence(doc, gold)
|
||||
names = [tsys.get_class_name(act) for act in act_classes]
|
||||
|
||||
|
||||
def test_doc_add_entities_set_ents_iob(en_vocab):
|
||||
doc = Doc(en_vocab, words=["This", "is", "a", "lion"])
|
||||
ner = EntityRecognizer(en_vocab)
|
||||
ner.begin_training([])
|
||||
ner(doc)
|
||||
assert len(list(doc.ents)) == 0
|
||||
assert [w.ent_iob_ for w in doc] == (['O'] * len(doc))
|
||||
doc.ents = [(doc.vocab.strings['ANIMAL'], 3, 4)]
|
||||
assert [w.ent_iob_ for w in doc] == ['', '', '', 'B']
|
||||
doc.ents = [(doc.vocab.strings['WORD'], 0, 2)]
|
||||
assert [w.ent_iob_ for w in doc] == ['B', 'I', '', '']
|
||||
|
|
|
@ -1,16 +1,13 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
from thinc.neural import Model
|
||||
import pytest
|
||||
import numpy
|
||||
|
||||
from ..._ml import chain, Tok2Vec, doc2feats
|
||||
from ...vocab import Vocab
|
||||
from ...pipeline import Tensorizer
|
||||
from ...syntax.arc_eager import ArcEager
|
||||
from ...syntax.nn_parser import Parser
|
||||
from ...tokens.doc import Doc
|
||||
from ...gold import GoldParse
|
||||
import pytest
|
||||
from spacy._ml import Tok2Vec
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.syntax.arc_eager import ArcEager
|
||||
from spacy.syntax.nn_parser import Parser
|
||||
from spacy.tokens.doc import Doc
|
||||
from spacy.gold import GoldParse
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
@ -37,10 +34,12 @@ def parser(vocab, arc_eager):
|
|||
def model(arc_eager, tok2vec):
|
||||
return Parser.Model(arc_eager.n_moves, token_vector_width=tok2vec.nO)[0]
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def doc(vocab):
|
||||
return Doc(vocab, words=['a', 'b', 'c'])
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def gold(doc):
|
||||
return GoldParse(doc, heads=[1, 1, 1], deps=['L', 'ROOT', 'R'])
|
||||
|
@ -80,5 +79,3 @@ def test_update_doc_beam(parser, model, doc, gold):
|
|||
def optimize(weights, gradient, key=None):
|
||||
weights -= 0.001 * gradient
|
||||
parser.update_beam([doc], [gold], sgd=optimize)
|
||||
|
||||
|
||||
|
|
|
@ -1,20 +1,23 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
import numpy
|
||||
from thinc.api import layerize
|
||||
|
||||
from ...vocab import Vocab
|
||||
from ...syntax.arc_eager import ArcEager
|
||||
from ...tokens import Doc
|
||||
from ...gold import GoldParse
|
||||
from ...syntax._beam_utils import ParserBeam, update_beam
|
||||
from ...syntax.stateclass import StateClass
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.language import Language
|
||||
from spacy.pipeline import DependencyParser
|
||||
from spacy.syntax.arc_eager import ArcEager
|
||||
from spacy.tokens import Doc
|
||||
from spacy.syntax._beam_utils import ParserBeam
|
||||
from spacy.syntax.stateclass import StateClass
|
||||
from spacy.gold import GoldParse
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def vocab():
|
||||
return Vocab()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def moves(vocab):
|
||||
aeager = ArcEager(vocab.strings, {})
|
||||
|
@ -65,6 +68,7 @@ def vector_size():
|
|||
def beam(moves, states, golds, beam_width):
|
||||
return ParserBeam(moves, states, golds, width=beam_width, density=0.0)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def scores(moves, batch_size, beam_width):
|
||||
return [
|
||||
|
@ -85,3 +89,12 @@ def test_beam_advance(beam, scores):
|
|||
def test_beam_advance_too_few_scores(beam, scores):
|
||||
with pytest.raises(IndexError):
|
||||
beam.advance(scores[:-1])
|
||||
|
||||
|
||||
def test_beam_parse():
|
||||
nlp = Language()
|
||||
nlp.add_pipe(DependencyParser(nlp.vocab), name='parser')
|
||||
nlp.parser.add_label('nsubj')
|
||||
nlp.parser.begin_training([], token_vector_width=8, hidden_width=8)
|
||||
doc = nlp.make_doc('Australia is a country')
|
||||
nlp.parser(doc, beam_width=2)
|
||||
|
|
|
@ -1,35 +1,39 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...syntax.nonproj import ancestors, contains_cycle, is_nonproj_arc
|
||||
from ...syntax.nonproj import is_nonproj_tree
|
||||
from ...syntax import nonproj
|
||||
from ...attrs import DEP, HEAD
|
||||
from ..util import get_doc
|
||||
|
||||
import pytest
|
||||
from spacy.syntax.nonproj import ancestors, contains_cycle, is_nonproj_arc
|
||||
from spacy.syntax.nonproj import is_nonproj_tree
|
||||
from spacy.syntax import nonproj
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def tree():
|
||||
return [1, 2, 2, 4, 5, 2, 2]
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def cyclic_tree():
|
||||
return [1, 2, 2, 4, 5, 3, 2]
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def partial_tree():
|
||||
return [1, 2, 2, 4, 5, None, 7, 4, 2]
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def nonproj_tree():
|
||||
return [1, 2, 2, 4, 5, 2, 7, 4, 2]
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def proj_tree():
|
||||
return [1, 2, 2, 4, 5, 2, 7, 5, 2]
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def multirooted_tree():
|
||||
return [3, 2, 0, 3, 3, 7, 7, 3, 7, 10, 7, 10, 11, 12, 18, 16, 18, 17, 12, 3]
|
||||
|
@ -75,14 +79,14 @@ def test_parser_pseudoprojectivity(en_tokenizer):
|
|||
def deprojectivize(proj_heads, deco_labels):
|
||||
tokens = en_tokenizer('whatever ' * len(proj_heads))
|
||||
rel_proj_heads = [head-i for i, head in enumerate(proj_heads)]
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deco_labels, heads=rel_proj_heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens],
|
||||
deps=deco_labels, heads=rel_proj_heads)
|
||||
nonproj.deprojectivize(doc)
|
||||
return [t.head.i for t in doc], [token.dep_ for token in doc]
|
||||
|
||||
tree = [1, 2, 2]
|
||||
nonproj_tree = [1, 2, 2, 4, 5, 2, 7, 4, 2]
|
||||
nonproj_tree2 = [9, 1, 3, 1, 5, 6, 9, 8, 6, 1, 6, 12, 13, 10, 1]
|
||||
|
||||
labels = ['det', 'nsubj', 'root', 'det', 'dobj', 'aux', 'nsubj', 'acl', 'punct']
|
||||
labels2 = ['advmod', 'root', 'det', 'nsubj', 'advmod', 'det', 'dobj', 'det', 'nmod', 'aux', 'nmod', 'advmod', 'det', 'amod', 'punct']
|
||||
|
||||
|
|
|
@ -1,17 +1,17 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..util import get_doc, apply_transition_sequence
|
||||
|
||||
import pytest
|
||||
|
||||
from ..util import get_doc, apply_transition_sequence
|
||||
|
||||
|
||||
def test_parser_root(en_tokenizer):
|
||||
text = "i don't have other assistance"
|
||||
heads = [3, 2, 1, 0, 1, -2]
|
||||
deps = ['nsubj', 'aux', 'neg', 'ROOT', 'amod', 'dobj']
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||
for t in doc:
|
||||
assert t.dep != 0, t.text
|
||||
|
||||
|
@ -20,7 +20,7 @@ def test_parser_root(en_tokenizer):
|
|||
@pytest.mark.parametrize('text', ["Hello"])
|
||||
def test_parser_parse_one_word_sentence(en_tokenizer, en_parser, text):
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=[0], deps=['ROOT'])
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[0], deps=['ROOT'])
|
||||
|
||||
assert len(doc) == 1
|
||||
with en_parser.step_through(doc) as _:
|
||||
|
@ -33,10 +33,8 @@ def test_parser_initial(en_tokenizer, en_parser):
|
|||
text = "I ate the pizza with anchovies."
|
||||
heads = [1, 0, 1, -2, -3, -1, -5]
|
||||
transition = ['L-nsubj', 'S', 'L-det']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
apply_transition_sequence(en_parser, tokens, transition)
|
||||
|
||||
assert tokens[0].head.i == 1
|
||||
assert tokens[1].head.i == 1
|
||||
assert tokens[2].head.i == 3
|
||||
|
@ -47,8 +45,7 @@ def test_parser_parse_subtrees(en_tokenizer, en_parser):
|
|||
text = "The four wheels on the bus turned quickly"
|
||||
heads = [2, 1, 4, -1, 1, -2, 0, -1]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
assert len(list(doc[2].lefts)) == 2
|
||||
assert len(list(doc[2].rights)) == 1
|
||||
assert len(list(doc[2].children)) == 3
|
||||
|
@ -63,11 +60,9 @@ def test_parser_merge_pp(en_tokenizer):
|
|||
heads = [1, 4, -1, 1, -2, 0]
|
||||
deps = ['det', 'nsubj', 'prep', 'det', 'pobj', 'ROOT']
|
||||
tags = ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ']
|
||||
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], deps=deps, heads=heads, tags=tags)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps, heads=heads, tags=tags)
|
||||
nps = [(np[0].idx, np[-1].idx + len(np[-1]), np.lemma_) for np in doc.noun_chunks]
|
||||
|
||||
for start, end, lemma in nps:
|
||||
doc.merge(start, end, label='NP', lemma=lemma)
|
||||
assert doc[0].text == 'A phrase'
|
||||
|
|
|
@ -1,14 +1,14 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
import pytest
|
||||
|
||||
from ..util import get_doc
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def text():
|
||||
return u"""
|
||||
return """
|
||||
It was a bright cold day in April, and the clocks were striking thirteen.
|
||||
Winston Smith, his chin nuzzled into his breast in an effort to escape the
|
||||
vile wind, slipped quickly through the glass doors of Victory Mansions,
|
||||
|
@ -54,7 +54,7 @@ def heads():
|
|||
|
||||
def test_parser_parse_navigate_consistency(en_tokenizer, text, heads):
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
for head in doc:
|
||||
for child in head.lefts:
|
||||
assert child.head == head
|
||||
|
@ -64,7 +64,7 @@ def test_parser_parse_navigate_consistency(en_tokenizer, text, heads):
|
|||
|
||||
def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads):
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
|
||||
lefts = {}
|
||||
rights = {}
|
||||
|
@ -97,7 +97,7 @@ def test_parser_parse_navigate_child_consistency(en_tokenizer, text, heads):
|
|||
|
||||
def test_parser_parse_navigate_edges(en_tokenizer, text, heads):
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
for token in doc:
|
||||
subtree = list(token.subtree)
|
||||
debug = '\t'.join((token.text, token.left_edge.text, subtree[0].text))
|
||||
|
|
|
@ -1,19 +1,21 @@
|
|||
'''Test that the parser respects preset sentence boundaries.'''
|
||||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from thinc.neural.optimizers import Adam
|
||||
from thinc.neural.ops import NumpyOps
|
||||
from spacy.attrs import NORM
|
||||
from spacy.gold import GoldParse
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.tokens import Doc
|
||||
from spacy.pipeline import DependencyParser
|
||||
|
||||
from ...attrs import NORM
|
||||
from ...gold import GoldParse
|
||||
from ...vocab import Vocab
|
||||
from ...tokens import Doc
|
||||
from ...pipeline import DependencyParser
|
||||
|
||||
@pytest.fixture
|
||||
def vocab():
|
||||
return Vocab(lex_attr_getters={NORM: lambda s: s})
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def parser(vocab):
|
||||
parser = DependencyParser(vocab)
|
||||
|
@ -32,6 +34,7 @@ def parser(vocab):
|
|||
parser.update([doc], [gold], sgd=sgd, losses=losses)
|
||||
return parser
|
||||
|
||||
|
||||
def test_no_sentences(parser):
|
||||
doc = Doc(parser.vocab, words=['a', 'b', 'c', 'd'])
|
||||
doc = parser(doc)
|
||||
|
|
|
@ -1,19 +1,18 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...tokens.doc import Doc
|
||||
from ...attrs import HEAD
|
||||
from ..util import get_doc, apply_transition_sequence
|
||||
|
||||
import pytest
|
||||
|
||||
from spacy.tokens.doc import Doc
|
||||
|
||||
from ..util import get_doc, apply_transition_sequence
|
||||
|
||||
|
||||
def test_parser_space_attachment(en_tokenizer):
|
||||
text = "This is a test.\nTo ensure spaces are attached well."
|
||||
heads = [1, 0, 1, -2, -3, -1, 1, 4, -1, 2, 1, 0, -1, -2]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads)
|
||||
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
|
||||
for sent in doc.sents:
|
||||
if len(sent) == 1:
|
||||
assert not sent[-1].is_space
|
||||
|
@ -26,7 +25,7 @@ def test_parser_sentence_space(en_tokenizer):
|
|||
'nsubjpass', 'aux', 'auxpass', 'ROOT', 'nsubj', 'aux', 'ccomp',
|
||||
'poss', 'nsubj', 'ccomp', 'punct']
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads, deps=deps)
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads, deps=deps)
|
||||
assert len(list(doc.sents)) == 2
|
||||
|
||||
|
||||
|
@ -35,7 +34,7 @@ def test_parser_space_attachment_leading(en_tokenizer, en_parser):
|
|||
text = "\t \n This is a sentence ."
|
||||
heads = [1, 1, 0, 1, -2, -3]
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, text.split(' '), heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=text.split(' '), heads=heads)
|
||||
assert doc[0].is_space
|
||||
assert doc[1].is_space
|
||||
assert doc[2].text == 'This'
|
||||
|
@ -52,7 +51,7 @@ def test_parser_space_attachment_intermediate_trailing(en_tokenizer, en_parser):
|
|||
heads = [1, 0, -1, 2, -1, -4, -5, -1]
|
||||
transition = ['L-nsubj', 'S', 'L-det', 'R-attr', 'D', 'R-punct']
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, text.split(' '), heads=heads)
|
||||
doc = get_doc(tokens.vocab, words=text.split(' '), heads=heads)
|
||||
assert doc[2].is_space
|
||||
assert doc[4].is_space
|
||||
assert doc[5].is_space
|
||||
|
|
|
@ -1,28 +0,0 @@
|
|||
import pytest
|
||||
|
||||
from ...pipeline import DependencyParser
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def parser(en_vocab):
|
||||
parser = DependencyParser(en_vocab)
|
||||
parser.add_label('nsubj')
|
||||
parser.model, cfg = parser.Model(parser.moves.n_moves)
|
||||
parser.cfg.update(cfg)
|
||||
return parser
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def blank_parser(en_vocab):
|
||||
parser = DependencyParser(en_vocab)
|
||||
return parser
|
||||
|
||||
|
||||
def test_to_from_bytes(parser, blank_parser):
|
||||
assert parser.model is not True
|
||||
assert blank_parser.model is True
|
||||
assert blank_parser.moves.n_moves != parser.moves.n_moves
|
||||
bytes_data = parser.to_bytes()
|
||||
blank_parser.from_bytes(bytes_data)
|
||||
assert blank_parser.model is not True
|
||||
assert blank_parser.moves.n_moves == parser.moves.n_moves
|
|
@ -2,10 +2,9 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
from ...tokens import Span
|
||||
from ...language import Language
|
||||
from ...pipeline import EntityRuler
|
||||
from spacy.tokens import Span
|
||||
from spacy.language import Language
|
||||
from spacy.pipeline import EntityRuler
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
|
|
@ -2,11 +2,11 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
from spacy.language import Language
|
||||
from spacy.tokens import Span
|
||||
|
||||
from ..util import get_doc
|
||||
from ...language import Language
|
||||
from ...tokens import Span
|
||||
from ... import util
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def doc(en_tokenizer):
|
||||
|
@ -16,7 +16,7 @@ def doc(en_tokenizer):
|
|||
pos = ['PRON', 'VERB', 'PROPN', 'PROPN', 'ADP', 'PROPN', 'PUNCT']
|
||||
deps = ['ROOT', 'prep', 'compound', 'pobj', 'prep', 'pobj', 'punct']
|
||||
tokens = en_tokenizer(text)
|
||||
doc = get_doc(tokens.vocab, [t.text for t in tokens], heads=heads,
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads,
|
||||
tags=tags, pos=pos, deps=deps)
|
||||
doc.ents = [Span(doc, 2, 4, doc.vocab.strings['GPE'])]
|
||||
doc.is_parsed = True
|
||||
|
|
|
@ -2,8 +2,7 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
from ...language import Language
|
||||
from spacy.language import Language
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
|
|
|
@ -1,7 +1,13 @@
|
|||
# coding: utf8
|
||||
|
||||
from __future__ import unicode_literals
|
||||
from ...language import Language
|
||||
|
||||
import pytest
|
||||
import random
|
||||
import numpy.random
|
||||
from spacy.language import Language
|
||||
from spacy.pipeline import TextCategorizer
|
||||
from spacy.tokens import Doc
|
||||
from spacy.gold import GoldParse
|
||||
|
||||
|
||||
def test_simple_train():
|
||||
|
@ -13,6 +19,40 @@ def test_simple_train():
|
|||
for text, answer in [('aaaa', 1.), ('bbbb', 0), ('aa', 1.),
|
||||
('bbbbbbbbb', 0.), ('aaaaaa', 1)]:
|
||||
nlp.update([text], [{'cats': {'answer': answer}}])
|
||||
doc = nlp(u'aaa')
|
||||
doc = nlp('aaa')
|
||||
assert 'answer' in doc.cats
|
||||
assert doc.cats['answer'] >= 0.5
|
||||
|
||||
|
||||
@pytest.mark.skip(reason="Test is flakey when run with others")
|
||||
def test_textcat_learns_multilabel():
|
||||
random.seed(5)
|
||||
numpy.random.seed(5)
|
||||
docs = []
|
||||
nlp = Language()
|
||||
letters = ['a', 'b', 'c']
|
||||
for w1 in letters:
|
||||
for w2 in letters:
|
||||
cats = {letter: float(w2==letter) for letter in letters}
|
||||
docs.append((Doc(nlp.vocab, words=['d']*3 + [w1, w2] + ['d']*3), cats))
|
||||
random.shuffle(docs)
|
||||
model = TextCategorizer(nlp.vocab, width=8)
|
||||
for letter in letters:
|
||||
model.add_label(letter)
|
||||
optimizer = model.begin_training()
|
||||
for i in range(30):
|
||||
losses = {}
|
||||
Ys = [GoldParse(doc, cats=cats) for doc, cats in docs]
|
||||
Xs = [doc for doc, cats in docs]
|
||||
model.update(Xs, Ys, sgd=optimizer, losses=losses)
|
||||
random.shuffle(docs)
|
||||
for w1 in letters:
|
||||
for w2 in letters:
|
||||
doc = Doc(nlp.vocab, words=['d']*3 + [w1, w2] + ['d']*3)
|
||||
truth = {letter: w2==letter for letter in letters}
|
||||
model(doc)
|
||||
for cat, score in doc.cats.items():
|
||||
if not truth[cat]:
|
||||
assert score < 0.5
|
||||
else:
|
||||
assert score > 0.5
|
||||
|
|
420
spacy/tests/regression/test_issue1-1000.py
Normal file
420
spacy/tests/regression/test_issue1-1000.py
Normal file
|
@ -0,0 +1,420 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
import random
|
||||
from spacy.matcher import Matcher
|
||||
from spacy.attrs import IS_PUNCT, ORTH, LOWER
|
||||
from spacy.symbols import POS, VERB, VerbForm_inf
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.language import Language
|
||||
from spacy.lemmatizer import Lemmatizer
|
||||
from spacy.tokens import Doc
|
||||
|
||||
from ..util import get_doc, make_tempdir
|
||||
|
||||
|
||||
@pytest.mark.parametrize('patterns', [
|
||||
[[{'LOWER': 'celtics'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]],
|
||||
[[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'celtics'}]]])
|
||||
def test_issue118(en_tokenizer, patterns):
|
||||
"""Test a bug that arose from having overlapping matches"""
|
||||
text = "how many points did lebron james score against the boston celtics last night"
|
||||
doc = en_tokenizer(text)
|
||||
ORG = doc.vocab.strings['ORG']
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add("BostonCeltics", None, *patterns)
|
||||
assert len(list(doc.ents)) == 0
|
||||
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
|
||||
assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
|
||||
doc.ents = matches[:1]
|
||||
ents = list(doc.ents)
|
||||
assert len(ents) == 1
|
||||
assert ents[0].label == ORG
|
||||
assert ents[0].start == 9
|
||||
assert ents[0].end == 11
|
||||
|
||||
|
||||
@pytest.mark.parametrize('patterns', [
|
||||
[[{'LOWER': 'boston'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]],
|
||||
[[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'boston'}]]])
|
||||
def test_issue118_prefix_reorder(en_tokenizer, patterns):
|
||||
"""Test a bug that arose from having overlapping matches"""
|
||||
text = "how many points did lebron james score against the boston celtics last night"
|
||||
doc = en_tokenizer(text)
|
||||
ORG = doc.vocab.strings['ORG']
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add('BostonCeltics', None, *patterns)
|
||||
assert len(list(doc.ents)) == 0
|
||||
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
|
||||
doc.ents += tuple(matches)[1:]
|
||||
assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
|
||||
ents = doc.ents
|
||||
assert len(ents) == 1
|
||||
assert ents[0].label == ORG
|
||||
assert ents[0].start == 9
|
||||
assert ents[0].end == 11
|
||||
|
||||
|
||||
def test_issue242(en_tokenizer):
|
||||
"""Test overlapping multi-word phrases."""
|
||||
text = "There are different food safety standards in different countries."
|
||||
patterns = [[{'LOWER': 'food'}, {'LOWER': 'safety'}],
|
||||
[{'LOWER': 'safety'}, {'LOWER': 'standards'}]]
|
||||
doc = en_tokenizer(text)
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add('FOOD', None, *patterns)
|
||||
|
||||
matches = [(ent_type, start, end) for ent_type, start, end in matcher(doc)]
|
||||
doc.ents += tuple(matches)
|
||||
match1, match2 = matches
|
||||
assert match1[1] == 3
|
||||
assert match1[2] == 5
|
||||
assert match2[1] == 4
|
||||
assert match2[2] == 6
|
||||
|
||||
|
||||
def test_issue309(en_tokenizer):
|
||||
"""Test Issue #309: SBD fails on empty string"""
|
||||
tokens = en_tokenizer(" ")
|
||||
doc = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=[0], deps=['ROOT'])
|
||||
doc.is_parsed = True
|
||||
assert len(doc) == 1
|
||||
sents = list(doc.sents)
|
||||
assert len(sents) == 1
|
||||
|
||||
|
||||
def test_issue351(en_tokenizer):
|
||||
doc = en_tokenizer(" This is a cat.")
|
||||
assert doc[0].idx == 0
|
||||
assert len(doc[0]) == 3
|
||||
assert doc[1].idx == 3
|
||||
|
||||
|
||||
def test_issue360(en_tokenizer):
|
||||
"""Test tokenization of big ellipsis"""
|
||||
tokens = en_tokenizer('$45...............Asking')
|
||||
assert len(tokens) > 2
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text1,text2', [("cat", "dog")])
|
||||
def test_issue361(en_vocab, text1, text2):
|
||||
"""Test Issue #361: Equality of lexemes"""
|
||||
assert en_vocab[text1] == en_vocab[text1]
|
||||
assert en_vocab[text1] != en_vocab[text2]
|
||||
|
||||
|
||||
def test_issue587(en_tokenizer):
|
||||
"""Test that Matcher doesn't segfault on particular input"""
|
||||
doc = en_tokenizer('a b; c')
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add('TEST1', None, [{ORTH: 'a'}, {ORTH: 'b'}])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 1
|
||||
matcher.add('TEST2', None, [{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'c'}])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
matcher.add('TEST3', None, [{ORTH: 'a'}, {ORTH: 'b'}, {IS_PUNCT: True}, {ORTH: 'd'}])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
||||
|
||||
def test_issue588(en_vocab):
|
||||
matcher = Matcher(en_vocab)
|
||||
with pytest.raises(ValueError):
|
||||
matcher.add('TEST', None, [])
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_issue589():
|
||||
vocab = Vocab()
|
||||
vocab.strings.set_frozen(True)
|
||||
doc = Doc(vocab, words=['whata'])
|
||||
|
||||
|
||||
def test_issue590(en_vocab):
|
||||
"""Test overlapping matches"""
|
||||
doc = Doc(en_vocab, words=['n', '=', '1', ';', 'a', ':', '5', '%'])
|
||||
matcher = Matcher(en_vocab)
|
||||
matcher.add('ab', None, [{'IS_ALPHA': True}, {'ORTH': ':'}, {'LIKE_NUM': True}, {'ORTH': '%'}])
|
||||
matcher.add('ab', None, [{'IS_ALPHA': True}, {'ORTH': '='}, {'LIKE_NUM': True}])
|
||||
matches = matcher(doc)
|
||||
assert len(matches) == 2
|
||||
|
||||
|
||||
def test_issue595():
|
||||
"""Test lemmatization of base forms"""
|
||||
words = ["Do", "n't", "feed", "the", "dog"]
|
||||
tag_map = {'VB': {POS: VERB, VerbForm_inf: True}}
|
||||
rules = {"verb": [["ed", "e"]]}
|
||||
lemmatizer = Lemmatizer({'verb': {}}, {'verb': {}}, rules)
|
||||
vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
|
||||
doc = Doc(vocab, words=words)
|
||||
doc[2].tag_ = 'VB'
|
||||
assert doc[2].text == 'feed'
|
||||
assert doc[2].lemma_ == 'feed'
|
||||
|
||||
|
||||
def test_issue599(en_vocab):
|
||||
doc = Doc(en_vocab)
|
||||
doc.is_tagged = True
|
||||
doc.is_parsed = True
|
||||
doc2 = Doc(doc.vocab)
|
||||
doc2.from_bytes(doc.to_bytes())
|
||||
assert doc2.is_parsed
|
||||
|
||||
|
||||
def test_issue600():
|
||||
vocab = Vocab(tag_map={'NN': {'pos': 'NOUN'}})
|
||||
doc = Doc(vocab, words=["hello"])
|
||||
doc[0].tag_ = 'NN'
|
||||
|
||||
|
||||
def test_issue615(en_tokenizer):
|
||||
def merge_phrases(matcher, doc, i, matches):
|
||||
"""Merge a phrase. We have to be careful here because we'll change the
|
||||
token indices. To avoid problems, merge all the phrases once we're called
|
||||
on the last match."""
|
||||
if i != len(matches)-1:
|
||||
return None
|
||||
spans = [(ent_id, ent_id, doc[start : end]) for ent_id, start, end in matches]
|
||||
for ent_id, label, span in spans:
|
||||
span.merge(tag='NNP' if label else span.root.tag_, lemma=span.text,
|
||||
label=label)
|
||||
doc.ents = doc.ents + ((label, span.start, span.end),)
|
||||
|
||||
text = "The golf club is broken"
|
||||
pattern = [{'ORTH': "golf"}, {'ORTH': "club"}]
|
||||
label = "Sport_Equipment"
|
||||
doc = en_tokenizer(text)
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add(label, merge_phrases, pattern)
|
||||
match = matcher(doc)
|
||||
entities = list(doc.ents)
|
||||
assert entities != []
|
||||
assert entities[0].label != 0
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,number', [("7am", "7"), ("11p.m.", "11")])
|
||||
def test_issue736(en_tokenizer, text, number):
|
||||
"""Test that times like "7am" are tokenized correctly and that numbers are
|
||||
converted to string."""
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 2
|
||||
assert tokens[0].text == number
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["3/4/2012", "01/12/1900"])
|
||||
def test_issue740(en_tokenizer, text):
|
||||
"""Test that dates are not split and kept as one token. This behaviour is
|
||||
currently inconsistent, since dates separated by hyphens are still split.
|
||||
This will be hard to prevent without causing clashes with numeric ranges."""
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
def test_issue743():
|
||||
doc = Doc(Vocab(), ['hello', 'world'])
|
||||
token = doc[0]
|
||||
s = set([token])
|
||||
items = list(s)
|
||||
assert items[0] is token
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["We were scared", "We Were Scared"])
|
||||
def test_issue744(en_tokenizer, text):
|
||||
"""Test that 'were' and 'Were' are excluded from the contractions
|
||||
generated by the English tokenizer exceptions."""
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
assert tokens[1].text.lower() == "were"
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,is_num', [("one", True), ("ten", True),
|
||||
("teneleven", False)])
|
||||
def test_issue759(en_tokenizer, text, is_num):
|
||||
tokens = en_tokenizer(text)
|
||||
assert tokens[0].like_num == is_num
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["Shell", "shell", "Shed", "shed"])
|
||||
def test_issue775(en_tokenizer, text):
|
||||
"""Test that 'Shell' and 'shell' are excluded from the contractions
|
||||
generated by the English tokenizer exceptions."""
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].text == text
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["This is a string ", "This is a string\u0020"])
|
||||
def test_issue792(en_tokenizer, text):
|
||||
"""Test for Issue #792: Trailing whitespace is removed after tokenization."""
|
||||
doc = en_tokenizer(text)
|
||||
assert ''.join([token.text_with_ws for token in doc]) == text
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["This is a string", "This is a string\n"])
|
||||
def test_control_issue792(en_tokenizer, text):
|
||||
"""Test base case for Issue #792: Non-trailing whitespace"""
|
||||
doc = en_tokenizer(text)
|
||||
assert ''.join([token.text_with_ws for token in doc]) == text
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,tokens', [
|
||||
('"deserve,"--and', ['"', "deserve", ',"--', "and"]),
|
||||
("exception;--exclusive", ["exception", ";--", "exclusive"]),
|
||||
("day.--Is", ["day", ".--", "Is"]),
|
||||
("refinement:--just", ["refinement", ":--", "just"]),
|
||||
("memories?--To", ["memories", "?--", "To"]),
|
||||
("Useful.=--Therefore", ["Useful", ".=--", "Therefore"]),
|
||||
("=Hope.=--Pandora", ["=", "Hope", ".=--", "Pandora"])])
|
||||
def test_issue801(en_tokenizer, text, tokens):
|
||||
"""Test that special characters + hyphens are split correctly."""
|
||||
doc = en_tokenizer(text)
|
||||
assert len(doc) == len(tokens)
|
||||
assert [t.text for t in doc] == tokens
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,expected_tokens', [
|
||||
('Smörsåsen används bl.a. till fisk', ['Smörsåsen', 'används', 'bl.a.', 'till', 'fisk']),
|
||||
('Jag kommer först kl. 13 p.g.a. diverse förseningar', ['Jag', 'kommer', 'först', 'kl.', '13', 'p.g.a.', 'diverse', 'förseningar'])
|
||||
])
|
||||
def test_issue805(sv_tokenizer, text, expected_tokens):
|
||||
tokens = sv_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
||||
|
||||
|
||||
def test_issue850():
|
||||
"""The variable-length pattern matches the succeeding token. Check we
|
||||
handle the ambiguity correctly."""
|
||||
vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
|
||||
matcher = Matcher(vocab)
|
||||
IS_ANY_TOKEN = matcher.vocab.add_flag(lambda x: True)
|
||||
pattern = [{'LOWER': "bob"}, {'OP': '*', 'IS_ANY_TOKEN': True}, {'LOWER': 'frank'}]
|
||||
matcher.add('FarAway', None, pattern)
|
||||
doc = Doc(matcher.vocab, words=['bob', 'and', 'and', 'frank'])
|
||||
match = matcher(doc)
|
||||
assert len(match) == 1
|
||||
ent_id, start, end = match[0]
|
||||
assert start == 0
|
||||
assert end == 4
|
||||
|
||||
|
||||
def test_issue850_basic():
|
||||
"""Test Matcher matches with '*' operator and Boolean flag"""
|
||||
vocab = Vocab(lex_attr_getters={LOWER: lambda string: string.lower()})
|
||||
matcher = Matcher(vocab)
|
||||
IS_ANY_TOKEN = matcher.vocab.add_flag(lambda x: True)
|
||||
pattern = [{'LOWER': "bob"}, {'OP': '*', 'LOWER': 'and'}, {'LOWER': 'frank'}]
|
||||
matcher.add('FarAway', None, pattern)
|
||||
doc = Doc(matcher.vocab, words=['bob', 'and', 'and', 'frank'])
|
||||
match = matcher(doc)
|
||||
assert len(match) == 1
|
||||
ent_id, start, end = match[0]
|
||||
assert start == 0
|
||||
assert end == 4
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["au-delàs", "pair-programmâmes",
|
||||
"terra-formées", "σ-compacts"])
|
||||
def test_issue852(fr_tokenizer, text):
|
||||
"""Test that French tokenizer exceptions are imported correctly."""
|
||||
tokens = fr_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["aaabbb@ccc.com\nThank you!",
|
||||
"aaabbb@ccc.com \nThank you!"])
|
||||
def test_issue859(en_tokenizer, text):
|
||||
"""Test that no extra space is added in doc.text method."""
|
||||
doc = en_tokenizer(text)
|
||||
assert doc.text == text
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["Datum:2014-06-02\nDokument:76467"])
|
||||
def test_issue886(en_tokenizer, text):
|
||||
"""Test that token.idx matches the original text index for texts with newlines."""
|
||||
doc = en_tokenizer(text)
|
||||
for token in doc:
|
||||
assert len(token.text) == len(token.text_with_ws)
|
||||
assert text[token.idx] == token.text[0]
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text', ["want/need"])
|
||||
def test_issue891(en_tokenizer, text):
|
||||
"""Test that / infixes are split correctly."""
|
||||
tokens = en_tokenizer(text)
|
||||
assert len(tokens) == 3
|
||||
assert tokens[1].text == "/"
|
||||
|
||||
|
||||
@pytest.mark.parametrize('text,tag,lemma', [
|
||||
("anus", "NN", "anus"),
|
||||
("princess", "NN", "princess"),
|
||||
("inner", "JJ", "inner")
|
||||
])
|
||||
def test_issue912(en_vocab, text, tag, lemma):
|
||||
"""Test base-forms are preserved."""
|
||||
doc = Doc(en_vocab, words=[text])
|
||||
doc[0].tag_ = tag
|
||||
assert doc[0].lemma_ == lemma
|
||||
|
||||
|
||||
def test_issue957(en_tokenizer):
|
||||
"""Test that spaCy doesn't hang on many periods."""
|
||||
# skip test if pytest-timeout is not installed
|
||||
timeout = pytest.importorskip('pytest-timeout')
|
||||
string = '0'
|
||||
for i in range(1, 100):
|
||||
string += '.%d' % i
|
||||
doc = en_tokenizer(string)
|
||||
|
||||
|
||||
@pytest.mark.xfail
|
||||
def test_issue999(train_data):
|
||||
"""Test that adding entities and resuming training works passably OK.
|
||||
There are two issues here:
|
||||
1) We have to readd labels. This isn't very nice.
|
||||
2) There's no way to set the learning rate for the weight update, so we
|
||||
end up out-of-scale, causing it to learn too fast.
|
||||
"""
|
||||
TRAIN_DATA = [
|
||||
["hey", []],
|
||||
["howdy", []],
|
||||
["hey there", []],
|
||||
["hello", []],
|
||||
["hi", []],
|
||||
["i'm looking for a place to eat", []],
|
||||
["i'm looking for a place in the north of town", [[31,36,"LOCATION"]]],
|
||||
["show me chinese restaurants", [[8,15,"CUISINE"]]],
|
||||
["show me chines restaurants", [[8,14,"CUISINE"]]],
|
||||
]
|
||||
|
||||
nlp = Language()
|
||||
ner = nlp.create_pipe('ner')
|
||||
nlp.add_pipe(ner)
|
||||
for _, offsets in TRAIN_DATA:
|
||||
for start, end, label in offsets:
|
||||
ner.add_label(label)
|
||||
nlp.begin_training()
|
||||
ner.model.learn_rate = 0.001
|
||||
for itn in range(100):
|
||||
random.shuffle(TRAIN_DATA)
|
||||
for raw_text, entity_offsets in TRAIN_DATA:
|
||||
nlp.update([raw_text], [{'entities': entity_offsets}])
|
||||
|
||||
with make_tempdir() as model_dir:
|
||||
nlp.to_disk(model_dir)
|
||||
nlp2 = Language().from_disk(model_dir)
|
||||
|
||||
for raw_text, entity_offsets in TRAIN_DATA:
|
||||
doc = nlp2(raw_text)
|
||||
ents = {(ent.start_char, ent.end_char): ent.label_ for ent in doc.ents}
|
||||
for start, end, label in entity_offsets:
|
||||
if (start, end) in ents:
|
||||
assert ents[(start, end)] == label
|
||||
break
|
||||
else:
|
||||
if entity_offsets:
|
||||
raise Exception(ents)
|
127
spacy/tests/regression/test_issue1001-1500.py
Normal file
127
spacy/tests/regression/test_issue1001-1500.py
Normal file
|
@ -0,0 +1,127 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
import re
|
||||
from spacy.tokens import Doc
|
||||
from spacy.vocab import Vocab
|
||||
from spacy.lang.en import English
|
||||
from spacy.lang.lex_attrs import LEX_ATTRS
|
||||
from spacy.matcher import Matcher
|
||||
from spacy.tokenizer import Tokenizer
|
||||
from spacy.lemmatizer import Lemmatizer
|
||||
from spacy.symbols import ORTH, LEMMA, POS, VERB, VerbForm_part
|
||||
|
||||
|
||||
def test_issue1242():
|
||||
nlp = English()
|
||||
doc = nlp('')
|
||||
assert len(doc) == 0
|
||||
docs = list(nlp.pipe(['', 'hello']))
|
||||
assert len(docs[0]) == 0
|
||||
assert len(docs[1]) == 1
|
||||
|
||||
|
||||
def test_issue1250():
|
||||
"""Test cached special cases."""
|
||||
special_case = [{ORTH: 'reimbur', LEMMA: 'reimburse', POS: 'VERB'}]
|
||||
nlp = English()
|
||||
nlp.tokenizer.add_special_case('reimbur', special_case)
|
||||
lemmas = [w.lemma_ for w in nlp('reimbur, reimbur...')]
|
||||
assert lemmas == ['reimburse', ',', 'reimburse', '...']
|
||||
lemmas = [w.lemma_ for w in nlp('reimbur, reimbur...')]
|
||||
assert lemmas == ['reimburse', ',', 'reimburse', '...']
|
||||
|
||||
|
||||
def test_issue1257():
|
||||
"""Test that tokens compare correctly."""
|
||||
doc1 = Doc(Vocab(), words=['a', 'b', 'c'])
|
||||
doc2 = Doc(Vocab(), words=['a', 'c', 'e'])
|
||||
assert doc1[0] != doc2[0]
|
||||
assert not doc1[0] == doc2[0]
|
||||
|
||||
|
||||
def test_issue1375():
|
||||
"""Test that token.nbor() raises IndexError for out-of-bounds access."""
|
||||
doc = Doc(Vocab(), words=['0', '1', '2'])
|
||||
with pytest.raises(IndexError):
|
||||
assert doc[0].nbor(-1)
|
||||
assert doc[1].nbor(-1).text == '0'
|
||||
with pytest.raises(IndexError):
|
||||
assert doc[2].nbor(1)
|
||||
assert doc[1].nbor(1).text == '2'
|
||||
|
||||
|
||||
def test_issue1387():
|
||||
tag_map = {'VBG': {POS: VERB, VerbForm_part: True}}
|
||||
index = {"verb": ("cope","cop")}
|
||||
exc = {"verb": {"coping": ("cope",)}}
|
||||
rules = {"verb": [["ing", ""]]}
|
||||
lemmatizer = Lemmatizer(index, exc, rules)
|
||||
vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
|
||||
doc = Doc(vocab, words=["coping"])
|
||||
doc[0].tag_ = 'VBG'
|
||||
assert doc[0].text == "coping"
|
||||
assert doc[0].lemma_ == "cope"
|
||||
|
||||
|
||||
def test_issue1434():
|
||||
"""Test matches occur when optional element at end of short doc."""
|
||||
pattern = [{'ORTH': 'Hello' }, {'IS_ALPHA': True, 'OP': '?'}]
|
||||
vocab = Vocab(lex_attr_getters=LEX_ATTRS)
|
||||
hello_world = Doc(vocab, words=['Hello', 'World'])
|
||||
hello = Doc(vocab, words=['Hello'])
|
||||
matcher = Matcher(vocab)
|
||||
matcher.add('MyMatcher', None, pattern)
|
||||
matches = matcher(hello_world)
|
||||
assert matches
|
||||
matches = matcher(hello)
|
||||
assert matches
|
||||
|
||||
|
||||
@pytest.mark.parametrize('string,start,end', [
|
||||
('a', 0, 1), ('a b', 0, 2), ('a c', 0, 1), ('a b c', 0, 2),
|
||||
('a b b c', 0, 3), ('a b b', 0, 3),])
|
||||
def test_issue1450(string, start, end):
|
||||
"""Test matcher works when patterns end with * operator."""
|
||||
pattern = [{'ORTH': "a"}, {'ORTH': "b", 'OP': "*"}]
|
||||
matcher = Matcher(Vocab())
|
||||
matcher.add("TSTEND", None, pattern)
|
||||
doc = Doc(Vocab(), words=string.split())
|
||||
matches = matcher(doc)
|
||||
if start is None or end is None:
|
||||
assert matches == []
|
||||
assert matches[-1][1] == start
|
||||
assert matches[-1][2] == end
|
||||
|
||||
|
||||
def test_issue1488():
|
||||
prefix_re = re.compile(r'''[\[\("']''')
|
||||
suffix_re = re.compile(r'''[\]\)"']''')
|
||||
infix_re = re.compile(r'''[-~\.]''')
|
||||
simple_url_re = re.compile(r'''^https?://''')
|
||||
|
||||
def my_tokenizer(nlp):
|
||||
return Tokenizer(nlp.vocab, {},
|
||||
prefix_search=prefix_re.search,
|
||||
suffix_search=suffix_re.search,
|
||||
infix_finditer=infix_re.finditer,
|
||||
token_match=simple_url_re.match)
|
||||
|
||||
nlp = English()
|
||||
nlp.tokenizer = my_tokenizer(nlp)
|
||||
doc = nlp("This is a test.")
|
||||
for token in doc:
|
||||
assert token.text
|
||||
|
||||
|
||||
def test_issue1494():
|
||||
infix_re = re.compile(r'''[^a-z]''')
|
||||
test_cases = [('token 123test', ['token', '1', '2', '3', 'test']),
|
||||
('token 1test', ['token', '1test']),
|
||||
('hello...test', ['hello', '.', '.', '.', 'test'])]
|
||||
new_tokenizer = lambda nlp: Tokenizer(nlp.vocab, {}, infix_finditer=infix_re.finditer)
|
||||
nlp = English()
|
||||
nlp.tokenizer = new_tokenizer(nlp)
|
||||
for text, expected in test_cases:
|
||||
assert [token.text for token in nlp(text)] == expected
|
|
@ -1,55 +0,0 @@
|
|||
# coding: utf-8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...matcher import Matcher
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
pattern1 = [[{'LOWER': 'celtics'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]]
|
||||
pattern2 = [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'celtics'}]]
|
||||
pattern3 = [[{'LOWER': 'boston'}], [{'LOWER': 'boston'}, {'LOWER': 'celtics'}]]
|
||||
pattern4 = [[{'LOWER': 'boston'}, {'LOWER': 'celtics'}], [{'LOWER': 'boston'}]]
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def doc(en_tokenizer):
|
||||
text = "how many points did lebron james score against the boston celtics last night"
|
||||
doc = en_tokenizer(text)
|
||||
return doc
|
||||
|
||||
|
||||
@pytest.mark.parametrize('pattern', [pattern1, pattern2])
|
||||
def test_issue118(doc, pattern):
|
||||
"""Test a bug that arose from having overlapping matches"""
|
||||
ORG = doc.vocab.strings['ORG']
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add("BostonCeltics", None, *pattern)
|
||||
|
||||
assert len(list(doc.ents)) == 0
|
||||
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
|
||||
assert matches == [(ORG, 9, 11), (ORG, 10, 11)]
|
||||
doc.ents = matches[:1]
|
||||
ents = list(doc.ents)
|
||||
assert len(ents) == 1
|
||||
assert ents[0].label == ORG
|
||||
assert ents[0].start == 9
|
||||
assert ents[0].end == 11
|
||||
|
||||
|
||||
@pytest.mark.parametrize('pattern', [pattern3, pattern4])
|
||||
def test_issue118_prefix_reorder(doc, pattern):
|
||||
"""Test a bug that arose from having overlapping matches"""
|
||||
ORG = doc.vocab.strings['ORG']
|
||||
matcher = Matcher(doc.vocab)
|
||||
matcher.add('BostonCeltics', None, *pattern)
|
||||
|
||||
assert len(list(doc.ents)) == 0
|
||||
matches = [(ORG, start, end) for _, start, end in matcher(doc)]
|
||||
doc.ents += tuple(matches)[1:]
|
||||
assert matches == [(ORG, 9, 10), (ORG, 9, 11)]
|
||||
ents = doc.ents
|
||||
assert len(ents) == 1
|
||||
assert ents[0].label == ORG
|
||||
assert ents[0].start == 9
|
||||
assert ents[0].end == 11
|
|
@ -1,13 +0,0 @@
|
|||
from __future__ import unicode_literals
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_issue1207(EN):
|
||||
text = 'Employees are recruiting talented staffers from overseas.'
|
||||
doc = EN(text)
|
||||
|
||||
assert [i.text for i in doc.noun_chunks] == ['Employees', 'talented staffers']
|
||||
sent = list(doc.sents)[0]
|
||||
assert [i.text for i in sent.noun_chunks] == ['Employees', 'talented staffers']
|
|
@ -1,23 +0,0 @@
|
|||
from __future__ import unicode_literals
|
||||
import pytest
|
||||
from ...lang.en import English
|
||||
from ...util import load_model
|
||||
|
||||
|
||||
def test_issue1242_empty_strings():
|
||||
nlp = English()
|
||||
doc = nlp('')
|
||||
assert len(doc) == 0
|
||||
docs = list(nlp.pipe(['', 'hello']))
|
||||
assert len(docs[0]) == 0
|
||||
assert len(docs[1]) == 1
|
||||
|
||||
|
||||
@pytest.mark.models('en')
|
||||
def test_issue1242_empty_strings_en_core_web_sm():
|
||||
nlp = load_model('en_core_web_sm')
|
||||
doc = nlp('')
|
||||
assert len(doc) == 0
|
||||
docs = list(nlp.pipe(['', 'hello']))
|
||||
assert len(docs[0]) == 0
|
||||
assert len(docs[1]) == 1
|
|
@ -1,13 +0,0 @@
|
|||
from __future__ import unicode_literals
|
||||
from ...tokenizer import Tokenizer
|
||||
from ...symbols import ORTH, LEMMA, POS
|
||||
from ...lang.en import English
|
||||
|
||||
def test_issue1250_cached_special_cases():
|
||||
nlp = English()
|
||||
nlp.tokenizer.add_special_case(u'reimbur', [{ORTH: u'reimbur', LEMMA: u'reimburse', POS: u'VERB'}])
|
||||
|
||||
lemmas = [w.lemma_ for w in nlp(u'reimbur, reimbur...')]
|
||||
assert lemmas == ['reimburse', ',', 'reimburse', '...']
|
||||
lemmas = [w.lemma_ for w in nlp(u'reimbur, reimbur...')]
|
||||
assert lemmas == ['reimburse', ',', 'reimburse', '...']
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user