mirror of https://github.com/explosion/spaCy.git synced 2026-02-05 23:09:48 +03:00

History

adrianeboyd faaa832518 Generalize handling of tokenizer special cases (#4259 ) * Generalize handling of tokenizer special cases Handle tokenizer special cases more generally by using the Matcher internally to match special cases after the affix/token_match tokenization is complete. Instead of only matching special cases while processing balanced or nearly balanced prefixes and suffixes, this recognizes special cases in a wider range of contexts: * Allows arbitrary numbers of prefixes/affixes around special cases * Allows special cases separated by infixes Existing tests/settings that couldn't be preserved as before: * The emoticon '")' is no longer a supported special case * The emoticon ':)' in "example:)" is a false positive again When merged with #4258 (or the relevant cache bugfix), the affix and token_match properties should be modified to flush and reload all special cases to use the updated internal tokenization with the Matcher. * Remove accidentally added test case * Really remove accidentally added test * Reload special cases when necessary Reload special cases when affixes or token_match are modified. Skip reloading during initialization. * Update error code number * Fix offset and whitespace in Matcher special cases * Fix offset bugs when merging and splitting tokens * Set final whitespace on final token in inserted special case * Improve cache flushing in tokenizer * Separate cache and specials memory (temporarily) * Flush cache when adding special cases * Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()` are necessary due to this bug: https://github.com/explosion/preshed/issues/21 * Remove reinitialized PreshMaps on cache flush * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Use special Matcher only for cases with affixes * Reinsert specials cache checks during normal tokenization for special cases as much as possible * Additionally include specials cache checks while splitting on infixes * Since the special Matcher needs consistent affix-only tokenization for the special cases themselves, introduce the argument `with_special_cases` in order to do tokenization with or without specials cache checks * After normal tokenization, postprocess with special cases Matcher for special cases containing affixes * Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add test for #4248, clean up test * Improve efficiency of special cases handling * Use PhraseMatcher instead of Matcher * Improve efficiency of merging/splitting special cases in document * Process merge/splits in one pass without repeated token shifting * Merge in place if no splits * Update error message number * Remove UD script modifications Only used for timing/testing, should be a separate PR * Remove final traces of UD script modifications * Update UD bin scripts * Update imports for `bin/` * Add all currently supported languages * Update subtok merger for new Matcher validation * Modify blinded check to look at tokens instead of lemmas (for corpora with tokens but not lemmas like Telugu) * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Switch to PhraseMatcher.find_matches * Switch to local cdef functions for span filtering * Switch special case reload threshold to variable Refer to variable instead of hard-coded threshold * Move more of special case retokenize to cdef nogil Move as much of the special case retokenization to nogil as possible. * Rewrap sort as stdsort for OS X * Rewrap stdsort with specific types * Switch to qsort * Fix merge * Improve cmp functions * Fix realloc * Fix realloc again * Initialize span struct while retokenizing * Temporarily skip retokenizing * Revert "Move more of special case retokenize to cdef nogil" This reverts commit `0b7e52c797`. * Revert "Switch to qsort" This reverts commit `a98d71a942`. * Fix specials check while caching * Modify URL test with emoticons The multiple suffix tests result in the emoticon `:>`, which is now retokenized into one token as a special case after the suffixes are split off. * Refactor _apply_special_cases() * Use cdef ints for span info used in multiple spots * Modify _filter_special_spans() to prefer earlier Parallel to #4414, modify _filter_special_spans() so that the earlier span is preferred for overlapping spans of the same length. * Replace MatchStruct with Entity Replace MatchStruct with Entity since the existing Entity struct is nearly identical. * Replace Entity with more general SpanC * Replace MatchStruct with SpanC * Add error in debug-data if no dev docs are available (see #4575) * Update azure-pipelines.yml * Revert "Update azure-pipelines.yml" This reverts commit `ed1060cf59`. * Use latest wasabi * Reorganise install_requires * add dframcy to universe.json (#4580) * Update universe.json [ci skip] * Fix multiprocessing for as_tuples=True (#4582) * Fix conllu script (#4579) * force extensions to avoid clash between example scripts * fix arg order and default file encoding * add example config for conllu script * newline * move extension definitions to main function * few more encodings fixes * Add load_from_docbin example [ci skip] TODO: upload the file somewhere * Update README.md * Add warnings about 3.8 (resolves #4593) [ci skip] * Fixed typo: Added space between "recognize" and "various" (#4600) * Fix DocBin.merge() example (#4599) * Replace function registries with catalogue (#4584) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip] * Bugfix/dep matcher issue 4590 (#4601) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590) * Minor updates to language example sentences (#4608) * Add punctuation to Spanish example sentences * Combine multilanguage examples for lang xx * Add punctuation to nb examples * Always realloc to a larger size Avoid potential (unlikely) edge case and cymem error seen in #4604. * Add error in debug-data if no dev docs are available (see #4575) * Update debug-data for GoldCorpus / Example * Ignore None label in misaligned NER data		2019-11-13 21:24:35 +01:00
..
doc	Fix util.filter_spans() to prefer first span in overlapping sam… (#4414 )	2019-10-10 17:00:03 +02:00
lang	Tidy up and auto-format [ci skip]	2019-10-24 16:20:48 +02:00
matcher	Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522 )	2019-10-25 22:21:08 +02:00
morphology	Refactor lemmatizer and data table integration (#4353 )	2019-10-01 21:36:03 +02:00
parser	Example class for training data (#4543 )	2019-11-11 17:35:27 +01:00
pipeline	Example class for training data (#4543 )	2019-11-11 17:35:27 +01:00
regression	Generalize handling of tokenizer special cases (#4259 )	2019-11-13 21:24:35 +01:00
serialize	Tidy up and auto-format	2019-10-18 11:27:38 +02:00
tokenizer	Generalize handling of tokenizer special cases (#4259 )	2019-11-13 21:24:35 +01:00
vocab_vectors	Clip most_similar to range [-1, 1] (fixes #4506 ) (#4507 )	2019-10-22 20:10:42 +02:00
__init__.py	Revert #4334	2019-09-29 17:32:12 +02:00
conftest.py	Tidy up and auto-format	2019-10-18 11:27:38 +02:00
README.md	Revert #4334	2019-09-29 17:32:12 +02:00
test_align.py	Revert #4334	2019-09-29 17:32:12 +02:00
test_architectures.py	Generalize handling of tokenizer special cases (#4259 )	2019-11-13 21:24:35 +01:00
test_cli.py	Revert #4334	2019-09-29 17:32:12 +02:00
test_displacy.py	Revert #4334	2019-09-29 17:32:12 +02:00
test_gold.py	Example class for training data (#4543 )	2019-11-11 17:35:27 +01:00
test_json_schemas.py	Revert #4334	2019-09-29 17:32:12 +02:00
test_language.py	Example class for training data (#4543 )	2019-11-11 17:35:27 +01:00
test_lemmatizer.py	Refactor lemmatizer and data table integration (#4353 )	2019-10-01 21:36:03 +02:00
test_misc.py	Tidy up and auto-format	2019-10-28 12:43:55 +01:00
test_pickles.py	Revert #4334	2019-09-29 17:32:12 +02:00
test_scorer.py	Example class for training data (#4543 )	2019-11-11 17:35:27 +01:00
test_tok2vec.py	Put Tok2Vec refactor behind feature flag (#4563 )	2019-10-31 15:01:15 +01:00
util.py	Revert #4334	2019-09-29 17:32:12 +02:00

README.md

spaCy tests

spaCy uses the pytest framework for testing. For more info on this, see the pytest documentation.

Tests for spaCy modules and classes live in their own directories of the same name. For example, tests for the Tokenizer can be found in /tests/tokenizer. All test modules (i.e. directories) also need to be listed in spaCy's setup.py. To be interpreted and run, all test files and test functions need to be prefixed with test_.

⚠️ Important note: As part of our new model training infrastructure, we've moved all model tests to the spacy-models repository. This allows us to test the models separately from the core library functionality.

Running the tests
Dos and don'ts
Parameters
Fixtures
Helpers and utilities
Contributing to the tests

Running the tests

To show print statements, run the tests with py.test -s. To abort after the first failure, run them with py.test -x.

py.test spacy                        # run basic tests
py.test spacy --slow                 # run basic and slow tests

You can also run tests in a specific file or directory, or even only one specific test:

py.test spacy/tests/tokenizer  # run all tests in directory
py.test spacy/tests/tokenizer/test_exceptions.py # run all tests in file
py.test spacy/tests/tokenizer/test_exceptions.py::test_tokenizer_handles_emoji # run specific test

Dos and don'ts

To keep the behaviour of the tests consistent and predictable, we try to follow a few basic conventions:

Test names should follow a pattern of test_[module]_[tested behaviour]. For example: test_tokenizer_keeps_email or test_spans_override_sentiment.
If you're testing for a bug reported in a specific issue, always create a regression test. Regression tests should be named test_issue[ISSUE NUMBER] and live in the regression directory.
Only use @pytest.mark.xfail for tests that should pass, but currently fail. To test for desired negative behaviour, use assert not in your test.
Very extensive tests that take a long time to run should be marked with @pytest.mark.slow. If your slow test is testing important behaviour, consider adding an additional simpler version.
If tests require loading the models, they should be added to the spacy-models tests.
Before requiring the models, always make sure there is no other way to test the particular behaviour. In a lot of cases, it's sufficient to simply create a Doc object manually. See the section on helpers and utility functions for more info on this.
Avoid unnecessary imports. There should never be a need to explicitly import spaCy at the top of a file, and many components are available as fixtures. You should also avoid wildcard imports (from module import *).
If you're importing from spaCy, always use absolute imports. For example: from spacy.language import Language.
Don't forget the unicode declarations at the top of each file. This way, unicode strings won't have to be prefixed with u.
Try to keep the tests readable and concise. Use clear and descriptive variable names (doc, tokens and text are great), keep it short and only test for one behaviour at a time.

Parameters

If the test cases can be extracted from the test, always parametrize them instead of hard-coding them into the test:

@pytest.mark.parametrize('text', ["google.com", "spacy.io"])
def test_tokenizer_keep_urls(tokenizer, text):
    tokens = tokenizer(text)
    assert len(tokens) == 1

This will run the test once for each text value. Even if you're only testing one example, it's usually best to specify it as a parameter. This will later make it easier for others to quickly add additional test cases without having to modify the test.

You can also specify parameters as tuples to test with multiple values per test:

@pytest.mark.parametrize('text,length', [("U.S.", 1), ("us.", 2), ("(U.S.", 2)])

To test for combinations of parameters, you can add several parametrize markers:

@pytest.mark.parametrize('text', ["A test sentence", "Another sentence"])
@pytest.mark.parametrize('punct', ['.', '!', '?'])

This will run the test with all combinations of the two parameters text and punct. Use this feature sparingly, though, as it can easily cause unneccessary or undesired test bloat.

Fixtures

Fixtures to create instances of spaCy objects and other components should only be defined once in the global conftest.py. We avoid having per-directory conftest files, as this can easily lead to confusion.

These are the main fixtures that are currently available:

Fixture	Description
`tokenizer`	Basic, language-independent tokenizer. Identical to the `xx` language class.
`en_tokenizer`, `de_tokenizer`, ...	Creates an English, German etc. tokenizer.
`en_vocab`	Creates an instance of the English `Vocab`.

The fixtures can be used in all tests by simply setting them as an argument, like this:

def test_module_do_something(en_tokenizer):
    tokens = en_tokenizer("Some text here")

If all tests in a file require a specific configuration, or use the same complex example, it can be helpful to create a separate fixture. This fixture should be added at the top of each file. Make sure to use descriptive names for these fixtures and don't override any of the global fixtures listed above. From looking at a test, it should immediately be clear which fixtures are used, and where they are coming from.

Helpers and utilities

Our new test setup comes with a few handy utility functions that can be imported from util.py.

Constructing a `Doc` object manually with `get_doc()`

Loading the models is expensive and not necessary if you're not actually testing the model performance. If all you need ia a Doc object with annotations like heads, POS tags or the dependency parse, you can use get_doc() to construct it manually.

def test_doc_token_api_strings(en_tokenizer):
    text = "Give it back! He pleaded."
    pos = ['VERB', 'PRON', 'PART', 'PUNCT', 'PRON', 'VERB', 'PUNCT']
    heads = [0, -1, -2, -3, 1, 0, -1]
    deps = ['ROOT', 'dobj', 'prt', 'punct', 'nsubj', 'ROOT', 'punct']

    tokens = en_tokenizer(text)
    doc = get_doc(tokens.vocab, [t.text for t in tokens], pos=pos, heads=heads, deps=deps)
    assert doc[0].text == 'Give'
    assert doc[0].lower_ == 'give'
    assert doc[0].pos_ == 'VERB'
    assert doc[0].dep_ == 'ROOT'

You can construct a Doc with the following arguments:

Argument	Description
`vocab`	`Vocab` instance to use. If you're tokenizing before creating a `Doc`, make sure to use the tokenizer's vocab. Otherwise, you can also use the `en_vocab` fixture. (required)
`words`	List of words, for example `[t.text for t in tokens]`. (required)
`heads`	List of heads as integers.
`pos`	List of POS tags as text values.
`tag`	List of tag names as text values.
`dep`	List of dependencies as text values.
`ents`	List of entity tuples with `start`, `end`, `label` (for example `(0, 2, 'PERSON')`). The `label` will be looked up in `vocab.strings[label]`.

Here's how to quickly get these values from within spaCy:

doc = nlp(u'Some text here')
print([token.head.i-token.i for token in doc])
print([token.tag_ for token in doc])
print([token.pos_ for token in doc])
print([token.dep_ for token in doc])
print([(ent.start, ent.end, ent.label_) for ent in doc.ents])

Note: There's currently no way of setting the serializer data for the parser without loading the models. If this is relevant to your test, constructing the Doc via get_doc() won't work.

Other utilities

Name	Description
`apply_transition_sequence(parser, doc, sequence)`	Perform a series of pre-specified transitions, to put the parser in a desired state.
`add_vecs_to_vocab(vocab, vectors)`	Add list of vector tuples (`[("text", [1, 2, 3])]`) to given vocab. All vectors need to have the same length.
`get_cosine(vec1, vec2)`	Get cosine for two given vectors.
`assert_docs_equal(doc1, doc2)`	Compare two `Doc` objects and `assert` that they're equal. Tests for tokens, tags, dependencies and entities.

Contributing to the tests

There's still a long way to go to finally reach 100% test coverage – and we'd appreciate your help! 🙌 You can open an issue on our issue tracker and label it tests, or make a pull request to this repository.

📖 For more information on contributing to spaCy in general, check out our contribution guidelines.

README.md Unescape Escape