spaCy/website/usage/_adding-languages/_testing.jade
Ines Montani 75f3234404
💫 Refactor test suite (#2568)
## Description

Related issues: #2379 (should be fixed by separating model tests)

* **total execution time down from > 300 seconds to under 60 seconds** 🎉
* removed all model-specific tests that could only really be run manually anyway – those will now live in a separate test suite in the [`spacy-models`](https://github.com/explosion/spacy-models) repository and are already integrated into our new model training infrastructure
* changed all relative imports to absolute imports to prepare for moving the test suite from `/spacy/tests` to `/tests` (it'll now always test against the installed version)
* merged old regression tests into collections, e.g. `test_issue1001-1500.py` (about 90% of the regression tests are very short anyways)
* tidied up and rewrote existing tests wherever possible

### Todo

- [ ] move tests to `/tests` and adjust CI commands accordingly
- [x] move model test suite from internal repo to `spacy-models`
- [x] ~~investigate why `pipeline/test_textcat.py` is flakey~~
- [x] review old regression tests (leftover files) and see if they can be merged, simplified or deleted
- [ ] update documentation on how to run tests


### Types of change
enhancement, tests

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-24 23:38:44 +02:00

52 lines
2.4 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > ADDING LANGUAGES > TESTING
p
| Before using the new language or submitting a
| #[+a(gh("spaCy") + "/pulls") pull request] to spaCy, you should make sure
| it works as expected. This is especially important if you've added custom
| regular expressions for token matching or punctuation you don't want to
| be causing regressions.
+infobox("spaCy's test suite")
| spaCy uses the #[+a("https://docs.pytest.org/en/latest/") pytest framework]
| for testing. For more details on how the tests are structured and best
| practices for writing your own tests, see our
| #[+a(gh("spaCy", "spacy/tests")) tests documentation].
+h(3, "testing-custom") Writing language-specific tests
p
| It's recommended to always add at least some tests with examples specific
| to the language. Language tests should be located in
| #[+src(gh("spaCy", "spacy/tests/lang")) #[code tests/lang]] in a
| directory named after the language ID. You'll also need to create a
| fixture for your tokenizer in the
| #[+src(gh("spaCy", "spacy/tests/conftest.py")) #[code conftest.py]].
| Always use the #[+api("util#get_lang_class") #[code get_lang_class()]]
| helper function within the fixture, instead of importing the class at the
| top of the file. This will load the language data only when it's needed.
| (Otherwise, #[em all data] would be loaded every time you run a test.)
+code.
@pytest.fixture
def en_tokenizer():
return util.get_lang_class('en').Defaults.create_tokenizer()
p
| When adding test cases, always
| #[+a(gh("spaCy", "spacy/tests#parameters")) #[code parametrize]] them
| this will make it easier for others to add more test cases without having
| to modify the test itself. You can also add parameter tuples, for example,
| a test sentence and its expected length, or a list of expected tokens.
| Here's an example of an English tokenizer test for combinations of
| punctuation and abbreviations:
+code("Example test").
@pytest.mark.parametrize('text,length', [
("The U.S. Army likes Shock and Awe.", 8),
("U.N. regulations are not a part of their concern.", 10),
("“Isn't it?”", 6)])
def test_en_tokenizer_handles_punct_abbrev(en_tokenizer, text, length):
tokens = en_tokenizer(text)
assert len(tokens) == length