mirror of
https://github.com/explosion/spaCy.git
synced 2025-07-09 15:52:31 +03:00
## Description Related issues: #2379 (should be fixed by separating model tests) * **total execution time down from > 300 seconds to under 60 seconds** đ * removed all model-specific tests that could only really be run manually anyway â those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure * changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version) * merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways) * tidied up and rewrote existing tests wherever possible ### Todo - [ ] move tests to `/tests` and adjust CI commands accordingly - [x] move model test suite from internal repo to `spacy-models` - [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~ - [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted - [ ] update documentation on how to run tests ### Types of change enhancement, tests ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
35 lines
2.9 KiB
Python
35 lines
2.9 KiB
Python
# coding: utf8
|
|
from __future__ import unicode_literals
|
|
|
|
import pytest
|
|
|
|
|
|
TESTCASES = [
|
|
# punctuation tests
|
|
('āĻāĻŽāĻŋ āĻŦāĻžāĻāϞāĻžāϝāĻŧ āĻāĻžāύ āĻāĻžāĻ!', ['āĻāĻŽāĻŋ', 'āĻŦāĻžāĻāϞāĻžāϝāĻŧ', 'āĻāĻžāύ', 'āĻāĻžāĻ', '!']),
|
|
('āĻāĻŽāĻŋ āĻŦāĻžāĻāϞāĻžāϝāĻŧ āĻāĻĨāĻž āĻāĻāĨ¤', ['āĻāĻŽāĻŋ', 'āĻŦāĻžāĻāϞāĻžāϝāĻŧ', 'āĻāĻĨāĻž', 'āĻāĻ', 'āĨ¤']),
|
|
('āĻŦāϏā§āύā§āϧāϰāĻž āĻāύāϏāĻŽā§āĻŽā§āĻā§ āĻĻā§āώ āϏā§āĻŦā§āĻāĻžāϰ āĻāϰāϞ⧠āύāĻž?', ['āĻŦāϏā§āύā§āϧāϰāĻž', 'āĻāύāϏāĻŽā§āĻŽā§āĻā§', 'āĻĻā§āώ', 'āϏā§āĻŦā§āĻāĻžāϰ', 'āĻāϰāϞā§', 'āύāĻž', '?']),
|
|
('āĻāĻžāĻāĻž āĻĨāĻžāĻāϞ⧠āĻāĻŋ āύāĻž āĻšāϝāĻŧ!', ['āĻāĻžāĻāĻž', 'āĻĨāĻžāĻāϞā§', 'āĻāĻŋ', 'āύāĻž', 'āĻšāϝāĻŧ', '!']),
|
|
# abbreviations
|
|
('āĻĄāĻ āĻāĻžāϞā§āĻĻ āĻŦāϞāϞā§āύ āĻĸāĻžāĻāĻžāϝāĻŧ ā§Šā§Ģ āĻĄāĻŋāĻā§āϰāĻŋ āϏā§.āĨ¤', ['āĻĄāĻ', 'āĻāĻžāϞā§āĻĻ', 'āĻŦāϞāϞā§āύ', 'āĻĸāĻžāĻāĻžāϝāĻŧ', 'ā§Šā§Ģ', 'āĻĄāĻŋāĻā§āϰāĻŋ', 'āϏā§.', 'āĨ¤'])
|
|
]
|
|
|
|
|
|
@pytest.mark.parametrize('text,expected_tokens', TESTCASES)
|
|
def test_bn_tokenizer_handles_testcases(bn_tokenizer, text, expected_tokens):
|
|
tokens = bn_tokenizer(text)
|
|
token_list = [token.text for token in tokens if not token.is_space]
|
|
assert expected_tokens == token_list
|
|
|
|
|
|
def test_bn_tokenizer_handles_long_text(bn_tokenizer):
|
|
text = """āύāϰā§āĻĨ āϏāĻžāĻāĻĨ āĻŦāĻŋāĻļā§āĻŦāĻŦāĻŋāĻĻā§āϝāĻžāϞāϝāĻŧā§ āϏāĻžāϰāĻžāĻŦāĻāϰ āĻā§āύ āύāĻž āĻā§āύ āĻŦāĻŋāώāϝāĻŧā§ āĻāĻŦā§āώāĻŖāĻž āĻāϞāϤā§āĻ āĻĨāĻžāĻā§āĨ¤ \
|
|
āĻ
āĻāĻŋāĻā§āĻ āĻĢā§āϝāĻžāĻāĻžāϞā§āĻāĻŋ āĻŽā§āĻŽā§āĻŦāĻžāϰāĻāĻŖ āĻĒā§āϰāĻžāϝāĻŧāĻ āĻļāĻŋāĻā§āώāĻžāϰā§āĻĨā§āĻĻā§āϰ āύāĻŋāϝāĻŧā§ āĻŦāĻŋāĻāĻŋāύā§āύ āĻāĻŦā§āώāĻŖāĻž āĻĒā§āϰāĻāϞā§āĻĒā§ āĻāĻžāĻ āĻāϰā§āύ, \
|
|
āϝāĻžāϰ āĻŽāϧā§āϝ⧠āϰāϝāĻŧā§āĻā§ āϰā§āĻŦāĻ āĻĨā§āĻā§ āĻŽā§āĻļāĻŋāύ āϞāĻžāϰā§āύāĻŋāĻ āϏāĻŋāϏā§āĻā§āĻŽ āĻ āĻāϰā§āĻāĻŋāĻĢāĻŋāĻļāĻŋāϝāĻŧāĻžāϞ āĻāύā§āĻā§āϞāĻŋāĻā§āύā§āϏāĨ¤ \
|
|
āĻāϏāĻāϞ āĻĒā§āϰāĻāϞā§āĻĒā§ āĻāĻžāĻ āĻāϰāĻžāϰ āĻŽāĻžāϧā§āϝāĻŽā§ āϏāĻāĻļā§āϞāĻŋāώā§āĻ āĻā§āώā§āϤā§āϰ⧠āϝāĻĨā§āώā§āĻ āĻĒāϰāĻŋāĻŽāĻžāĻŖ āϏā§āĻĒā§āĻļāĻžāϞāĻžāĻāĻāĻĄ āĻšāĻāϝāĻŧāĻž āϏāĻŽā§āĻāĻŦāĨ¤ \
|
|
āĻāϰ āĻāĻŦā§āώāĻŖāĻžāϰ āĻāĻžāĻ āϤā§āĻŽāĻžāϰ āĻā§āϝāĻžāϰāĻŋāϝāĻŧāĻžāϰāĻā§ āĻ ā§āϞ⧠āύāĻŋāϝāĻŧā§ āϝāĻžāĻŦā§ āĻ
āύā§āĻāĻāĻžāύāĻŋ! \
|
|
āĻāύā§āĻā§āϏā§āĻ āĻĒā§āϰā§āĻā§āϰāĻžāĻŽāĻžāϰ āĻšāĻ, āĻāĻŦā§āώāĻ āĻāĻŋāĻāĻŦāĻž āĻĄā§āĻā§āϞāĻĒāĻžāϰ - āύāϰā§āĻĨ āϏāĻžāĻāĻĨ āĻāĻāύāĻŋāĻāĻžāϰā§āϏāĻŋāĻāĻŋāϤ⧠āϤā§āĻŽāĻžāϰ āĻĒā§āϰāϤāĻŋāĻāĻž āĻŦāĻŋāĻāĻžāĻļā§āϰ āϏā§āϝā§āĻ āϰāϝāĻŧā§āĻā§āĻāĨ¤ \
|
|
āύāϰā§āĻĨ āϏāĻžāĻāĻĨā§āϰ āĻ
āϏāĻžāϧāĻžāϰāĻŖ āĻāĻŽāĻŋāĻāύāĻŋāĻāĻŋāϤ⧠āϤā§āĻŽāĻžāĻā§ āϏāĻžāĻĻāϰ āĻāĻŽāύā§āϤā§āϰāĻŖāĨ¤"""
|
|
tokens = bn_tokenizer(text)
|
|
assert len(tokens) == 84
|