spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-01-22 08:14:18 +03:00

Author	SHA1	Message	Date
Lj Miranda	addeb34bc4	Decorate regression tests Even if the issue number is already in the file, I still decorated them just to follow the convention found in test_issue8168.py	2021-11-05 11:47:44 +08:00
Lj Miranda	91dec2c76e	Decorate non-regression tests	2021-11-05 11:47:33 +08:00
Lj Miranda	199943deb4	Add simple script to add pytest marks	2021-11-05 11:47:28 +08:00
Duygu Altinok	f0e8c9fe58	Spanish noun chunks review (#9537 ) * updated syntax iters * formatted the code * added prepositional objects * code clean up * eliminated left attached adp * added es vocab * added basic tests * fixed typo * fixed typo * list to set * fixed doc name * added code for conj * more tests * differentiated adjectives and flat * fixed typo * added compounds * more compounds * tests for compounds * tests for nominal modifiers * fixed typo * fixed typo * formatted file * reformatted tests * fixed typo * fixed punct typo * formatted after changes * added indirect object * added full sentence examples * added longer full sentence examples * fixed sentence length of test * added passive subj * added test case by Damian	2021-11-05 00:46:36 +01:00
Duygu Altinok	6e6650307d	Portuguese noun chunks review (#9559 ) * added tests * added pt vocab * transferred spanish * added syntax iters * fixed parenthesis * added nmod example * added relative pron * fixed rel pron * added rel subclause * corrected typo * added more NP chains * long sentence * fixed typo * fixed typo * fixed typo * corrected heads * added passive subj * added pass subj * added passive obj * refinement to rights * went back to odl * fixed test * fixed typo * fixed typo * formatted * Format * Format test cases Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-11-04 23:55:49 +01:00
Adriane Boyd	07dea324f6	Merge remote-tracking branch 'upstream/develop' into chore/switch-to-master-v3.2.0	2021-11-03 15:32:18 +01:00
Bram Vanroy	cab9209c3d	use metaclass to decorate errors (#9593 )	2021-11-03 15:29:32 +01:00
Adriane Boyd	db0d8c56d0	Add test for Language.pipe as_tuples with custom error handlers (#9608 ) * make nlp.pipe() return None docs when no exceptions are (re-)raised during error handling * Remove changes other than as_tuples test * Only check warning count for one process * Fix types * Format Co-authored-by: Xi Bai <xi.bai.ed@gmail.com>	2021-11-03 10:57:34 +01:00
Adriane Boyd	6eee024ff6	Pickle Doc._context (#9603 )	2021-11-03 09:14:29 +01:00
Adriane Boyd	667572adca	Temporarily skip compat tests (#9594 )	2021-11-02 14:10:48 +01:00
Lj Miranda	f1bc655a38	Add initial Tagalog (tl) tests (#9582 ) * Add tl_tokenizer to test fixtures * Add tagalog tests	2021-11-02 08:35:49 +01:00
Adriane Boyd	2d430958e1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-3	2021-10-29 12:18:15 +02:00
Adriane Boyd	5477453ea3	Docs for thinc-apple-ops (#9549 ) * Docs for thinc-apple-ops * Ignore thinc-apple-ops in reqs tests * Fix install quickstart * Add cupy cuda 113, 114 extras * Remove draft section Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:35:31 +02:00
Adriane Boyd	12974bf4d9	Add micro PRF for morph scoring (#9546 ) * Add micro PRF for morph scoring For pipelines where morph features are added by more than one component and a reference training corpus may not contain all features, a micro PRF score is more flexible than a simple accuracy score. An example is the reading and inflection features added by the Japanese tokenizer. * Use `morph_micro_f` as the default morph score for Japanese morphologizers. * Update docstring * Fix typo in docstring * Update Scorer API docs * Fix results type * Organize score list by attribute prefix	2021-10-29 10:29:29 +02:00
Adriane Boyd	c053f158c5	Add support for floret vectors (#8909 ) * Add support for fasttext-bloom hash-only vectors Overview: * Extend `Vectors` to have two modes: `default` and `ngram` * `default` is the default mode and equivalent to the current `Vectors` * `ngram` supports the hash-only ngram tables from `fasttext-bloom` * Extend `spacy.StaticVectors.v2` to handle both modes with no changes for `default` vectors * Extend `spacy init vectors` to support ngram tables The `ngram` mode only supports vector tables produced by this fork of fastText, which adds an option to represent all vectors using only the ngram buckets table and which uses the exact same ngram generation algorithm and hash function (`MurmurHash3_x64_128`). `fasttext-bloom` produces an additional `.hashvec` table, which can be loaded by `spacy init vectors --fasttext-bloom-vectors`. https://github.com/adrianeboyd/fastText/tree/feature/bloom Implementation details: * `Vectors` now includes the `StringStore` as `Vectors.strings` so that the API can stay consistent for both `default` (which can look up from `str` or `int`) and `ngram` (which requires `str` to calculate the ngrams). * In ngram mode `Vectors` uses a default `Vectors` object as a cache since the ngram vectors lookups are relatively expensive. * The default cache size is the same size as the provided ngram vector table. * Once the cache is full, no more entries are added. The user is responsible for managing the cache in cases where the initial documents are not representative of the texts. * The cache can be resized by setting `Vectors.ngram_cache_size` or cleared with `vectors._ngram_cache.clear()`. * The API ends up a bit split between methods for `default` and for `ngram`, so functions that only make sense for `default` or `ngram` include warnings with custom messages suggesting alternatives where possible. * `Vocab.vectors` becomes a property so that the string stores can be synced when assigning vectors to a vocab. * `Vectors` serializes its own config settings as `vectors.cfg`. * The `Vectors` serialization methods have added support for `exclude` so that the `Vocab` can exclude the `Vectors` strings while serializing. Removed: * The `minn` and `maxn` options and related code from `Vocab.get_vector`, which does not work in a meaningful way for default vector tables. * The unused `GlobalRegistry` in `Vectors`. * Refactor to use reduce_mean Refactor to use reduce_mean and remove the ngram vectors cache. * Rename to floret * Rename to floret in error messages * Use --vectors-mode in CLI, vector init * Fix vectors mode in init * Remove unused var * Minor API and docstrings adjustments * Rename `--vectors-mode` to `--mode` in `init vectors` CLI * Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support both modes. * Minor updates to Vectors docstrings. * Update API docs for Vectors and init vectors CLI * Update types for StaticVectors	2021-10-27 14:08:31 +02:00
Adriane Boyd	0c97ed2746	Rename ja morph features to Inflection and Reading (#9520 ) * Rename ja morph features to Inflection and Reading	2021-10-27 13:13:03 +02:00
Adriane Boyd	2ea9b58006	Ignore prefix in suffix matches (#9155 ) * Ignore prefix in suffix matches Ignore the currently matched prefix when looking for suffix matches in the tokenizer. Otherwise a lookbehind in the suffix pattern may match incorrectly due the presence of the prefix in the token string. * Move °[cfkCFK]. to a tokenizer exception * Adjust exceptions for same tokenization as v3.1 * Also update test accordingly * Continue to split . after °CFK if ° is not a prefix * Exclude new ° exceptions for pl * Switch back to default tokenization of "° C ." * Revert "Exclude new ° exceptions for pl" This reverts commit `952013a5b4`. * Add exceptions for °C for hu	2021-10-27 13:02:25 +02:00
Adriane Boyd	386dcada1c	Address random results in slow readers tests (#9544 ) * Set random seed for dataset shuffling * Use more dev examples for non-zero scores	2021-10-26 16:53:10 +02:00
Adriane Boyd	a803af9dfa	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
github-actions[bot]	b0b115ff39	Auto-format code with black (#9530 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-10-22 13:03:10 +02:00
Daniël de Kok	f31ac6fd4f	Print a warning when multiprocessing is used on a GPU (#9475 ) * Raise an error when multiprocessing is used on a GPU As reported in #5507, a confusing exception is thrown when multiprocessing is used with a GPU model and the `fork` multiprocessing start method: cupy.cuda.runtime.CUDARuntimeError: cudaErrorInitializationError: initialization error This change checks whether one of the models uses the GPU when multiprocessing is used. If so, raise a friendly error message. Even though multiprocessing can work on a GPU with the `spawn` method, it quickly runs the GPU out-of-memory on real-world data. Also, multiprocessing on a single GPU typically does not provide large performance gains. * Move GPU multiprocessing check to Language.pipe * Warn rather than error when using multiprocessing with GPU models * Improve GPU multiprocessing warning message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Reduce API assumptions Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/language.py * Update spacy/language.py * Test that warning is thrown with GPU + multiprocessing Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-21 16:14:23 +02:00
Sofie Van Landeghem	5a38f79f18	Custom component types in spacy.ty (#9469 ) * add custom protocols in spacy.ty * add a test for the new types in spacy.ty * import Example when type checking * some type fixes * put Protocol in compat * revert update check back to hasattr * runtime_checkable in compat as well	2021-10-21 15:31:06 +02:00
Ines Montani	ad9f57cbbf	Allow conftest.py to run twice for build envs	2021-10-19 15:13:25 +02:00
Sofie Van Landeghem	da578c3d3b	Fix kb.set_entities (#9463 ) * avoid creating _vectors_table when also using c_add_vector * write to self._vectors_table directly in set_entities	2021-10-19 09:39:17 +02:00
Adriane Boyd	271e8e7856	Skip compat table tests for prerelease versions (#9476 )	2021-10-15 14:28:02 +02:00
github-actions[bot]	29e83f0819	Auto-format code with black (#9474 ) * Auto-format code with black * Update spacy/pipeline/pipe.pyi Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-15 11:36:49 +02:00
Connor Brinton	657af5f91f	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 ) * 🚨 Ignore all existing Mypy errors * 🏗 Add Mypy check to CI * Add types-mock and types-requests as dev requirements * Add additional type ignore directives * Add types packages to dev-only list in reqs test * Add types-dataclasses for python 3.6 * Add ignore to pretrain * 🏷 Improve type annotation on `run_command` helper The `run_command` helper previously declared that it returned an `Optional[subprocess.CompletedProcess]`, but it isn't actually possible for the function to return `None`. These changes modify the type annotation of the `run_command` helper and remove all now-unnecessary `# type: ignore` directives. * 🔧 Allow variable type redefinition in limited contexts These changes modify how Mypy is configured to allow variables to have their type automatically redefined under certain conditions. The Mypy documentation contains the following example: ```python def process(items: List[str]) -> None: # 'items' has type List[str] items = [item.split() for item in items] # 'items' now has type List[List[str]] ... ``` This configuration change is especially helpful in reducing the number of `# type: ignore` directives needed to handle the common pattern of: * Accepting a filepath as a string * Overwriting the variable using `filepath = ensure_path(filepath)` These changes enable redefinition and remove all `# type: ignore` directives rendered redundant by this change. * 🏷 Add type annotation to converters mapping * 🚨 Fix Mypy error in convert CLI argument verification * 🏷 Improve type annotation on `resolve_dot_names` helper * 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors` * 🏷 Add type annotations for more `Vocab` attributes * 🏷 Add loose type annotation for gold data compilation * 🏷 Improve `_format_labels` type annotation * 🏷 Fix `get_lang_class` type annotation * 🏷 Loosen return type of `Language.evaluate` * 🏷 Don't accept `Scorer` in `handle_scores_per_type` * 🏷 Add `string_to_list` overloads * 🏷 Fix non-Optional command-line options * 🙈 Ignore redefinition of `wandb_logger` in `loggers.py` * ➕ Install `typing_extensions` in Python 3.8+ The `typing_extensions` package states that it should be used when "writing code that must be compatible with multiple Python versions". Since SpaCy needs to support multiple Python versions, it should be used when newer `typing` module members are required. One example of this is `Literal`, which is available starting with Python 3.8. Previously SpaCy tried to import `Literal` from `typing`, falling back to `typing_extensions` if the import failed. However, Mypy doesn't seem to be able to understand what `Literal` means when the initial import means. Therefore, these changes modify how `compat` imports `Literal` by always importing it from `typing_extensions`. These changes also modify how `typing_extensions` is installed, so that it is a requirement for all Python versions, including those greater than or equal to 3.8. * 🏷 Improve type annotation for `Language.pipe` These changes add a missing overload variant to the type signature of `Language.pipe`. Additionally, the type signature is enhanced to allow type checkers to differentiate between the two overload variants based on the `as_tuple` parameter. Fixes #8772 * ➖ Don't install `typing-extensions` in Python 3.8+ After more detailed analysis of how to implement Python version-specific type annotations using SpaCy, it has been determined that by branching on a comparison against `sys.version_info` can be statically analyzed by Mypy well enough to enable us to conditionally use `typing_extensions.Literal`. This means that we no longer need to install `typing_extensions` for Python versions greater than or equal to 3.8! 🎉 These changes revert previous changes installing `typing-extensions` regardless of Python version and modify how we import the `Literal` type to ensure that Mypy treats it properly. * resolve mypy errors for Strict pydantic types * refactor code to avoid missing return statement * fix types of convert CLI command * avoid list-set confustion in debug_data * fix typo and formatting * small fixes to avoid type ignores * fix types in profile CLI command and make it more efficient * type fixes in projects CLI * put one ignore back * type fixes for render * fix render types - the sequel * fix BaseDefault in language definitions * fix type of noun_chunks iterator - yields tuple instead of span * fix types in language-specific modules * 🏷 Expand accepted inputs of `get_string_id` `get_string_id` accepts either a string (in which case it returns its ID) or an ID (in which case it immediately returns the ID). These changes extend the type annotation of `get_string_id` to indicate that it can accept either strings or IDs. * 🏷 Handle override types in `combine_score_weights` The `combine_score_weights` function allows users to pass an `overrides` mapping to override data extracted from the `weights` argument. Since it allows `Optional` dictionary values, the return value may also include `Optional` dictionary values. These changes update the type annotations for `combine_score_weights` to reflect this fact. * 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer` * 🏷 Fix redefinition of `wandb_logger` These changes fix the redefinition of `wandb_logger` by giving a separate name to each `WandbLogger` version. For backwards-compatibility, `spacy.train` still exports `wandb_logger_v3` as `wandb_logger` for now. * more fixes for typing in language * type fixes in model definitions * 🏷 Annotate `_RandomWords.probs` as `NDArray` * 🏷 Annotate `tok2vec` layers to help Mypy * 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6 Also remove an import that I forgot to move to the top of the module 😅 * more fixes for matchers and other pipeline components * quick fix for entity linker * fixing types for spancat, textcat, etc * bugfix for tok2vec * type annotations for scorer * add runtime_checkable for Protocol * type and import fixes in tests * mypy fixes for training utilities * few fixes in util * fix import * 🐵 Remove unused `# type: ignore` directives * 🏷 Annotate `Language._components` * 🏷 Annotate `spacy.pipeline.Pipe` * add doc as property to span.pyi * small fixes and cleanup * explicit type annotations instead of via comment Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com>	2021-10-14 15:21:40 +02:00
Adriane Boyd	d98d525bc8	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-3	2021-10-14 09:41:46 +02:00
Paul O'Leary McCann	a3b7519aba	Fix JA Morph Values (#9449 ) * Don't set empty / weird values in morph * Update tests to handy empty morph values * Fix everything * Replace potentially problematic characters * Fix test	2021-10-14 09:21:36 +02:00
Ines Montani	c48564688f	Merge pull request #9423 from explosion/tests/issue-marker	2021-10-13 16:53:40 +02:00
Jette16	78365452d3	Moved test for universe into .github folder (#9447 ) * Moved universe-test into .github folder * Cleaned code * CHanged a file name	2021-10-13 14:13:06 +02:00
Sofie Van Landeghem	2e3d6b8b5a	Fix test for spancat (#9446 ) * fix test for spancat * increase tolerance for almost equal checks * Update spacy/tests/test_models.py * Update spacy/tests/test_models.py	2021-10-13 10:47:56 +02:00
Paul O'Leary McCann	efe5beefe0	Add test for case where parser overwrite annotations (#9406 ) * Add test for case where parser overwrite annotations * Move test to its own file Also add note about how other tokens modify results. * Fix xfail decorator	2021-10-11 14:57:45 +02:00
Ines Montani	1fa7c4e73b	Support issue marker via pytest	2021-10-11 13:56:24 +02:00
Jette16	3b144a3a51	Add universe test (#9278 ) * Added test for universe.json * Added contributor agreement * Ran black on test_universe_json.py	2021-10-11 11:08:46 +02:00
Paul O'Leary McCann	2a7e327310	Fix Dependency Matcher Ordering Issue (#9337 ) * Fix inconsistency This makes the failing test pass, so that behavior is consistent whether patterns are added in one call or two. The issue is that the hash for patterns depended on the index of the pattern in the list of current patterns, not the list of total patterns, so a second call would get identical match ids. * Add illustrative test case * Add failing test for remove case Patterns are not removed from the internal matcher on calls to remove, which causes spurious weird matches (or misses). * Fix removal issue Remove patterns from the internal matcher. * Check that the single add call also gets no matches	2021-10-11 10:26:13 +02:00
Paul O'Leary McCann	113d53ab6c	Fix tests for changes to inflection structure (#9390 )	2021-10-07 13:42:18 +02:00
Elia Robyn Lake (Robyn Speer)	53b5f245ed	Allow IETF language codes, aliases, and close matches (#9342 ) * use language-matching to allow language code aliases Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * link to "IETF language tags" in docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Make requirements consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * change "two-letter language ID" to "IETF language tag" in language docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use langcodes 3.2 and handle language-tag errors better Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * all unknown language codes are ImportErrors Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-10-05 09:52:22 +02:00
Paul O'Leary McCann	1ee6541ab0	Moving Japanese tokenizer extra info to Token.morph (#8977 ) * Use morph for extra Japanese tokenizer info Previously Japanese tokenizer info that didn't correspond to Token fields was put in user data. Since spaCy core should avoid touching user data, this moves most information to the Token.morph attribute. It also adds the normalized form, which wasn't exposed before. The subtokens, which are a list of full tokens, are still added to user data, except with the default tokenizer granualarity. With the default tokenizer settings the subtokens are all None, so in this case the user data is simply not set. * Update tests Also adds a new test for norm data. * Update docs * Add Japanese morphologizer factory Set the default to `extend=True` so that the morphologizer does not clobber the values set by the tokenizer. * Use the norm_ field for normalized forms Before this commit, normalized forms were put in the "norm" field in the morph attributes. I am not sure why I did that instead of using the token morph, I think I just forgot about it. * Skip test if sudachipy is not installed * Fix import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-01 19:19:26 +02:00
Paul O'Leary McCann	8f2409e514	Don't serialize user data in DocBin if not saving it (fix #9190 ) (#9226 ) * Don't store user data if told not to (fix #9190) * Add unit tests for the store_user_data setting	2021-10-01 12:37:39 +02:00
Adriane Boyd	b3192ddea3	Sync thinc install dep in setup, fix test packaging (#9336 ) * Sync thinc install dep in setup * Add __init__.py to include package tests in package * Include *.toml in package	2021-09-30 19:02:10 +02:00
Adriane Boyd	03fefa37e2	Add overwrite settings for more components (#9050 ) * Add overwrite settings for more components For pipeline components where it's relevant and not already implemented, add an explicit `overwrite` setting that controls whether `set_annotations` overwrites existing annotation. For the `morphologizer`, add an additional setting `extend`, which controls whether the existing features are preserved. * +overwrite, +extend: overwrite values of existing features, add any new features * +overwrite, -extend: overwrite completely, removing any existing features * -overwrite, +extend: keep values of existing features, add any new features * -overwrite, -extend: do not modify the existing value if set In all cases an unset value will be set by `set_annotations`. Preserve current overwrite defaults: * True: morphologizer, entity linker * False: tagger, sentencizer, senter * Add backwards compat overwrite settings * Put empty line back Removed by accident in last commit * Set backwards-compatible defaults in __init__ Because the `TrainablePipe` serialization methods update `cfg`, there's no straightforward way to detect whether models serialized with a previous version are missing the overwrite settings. It would be possible in the sentencizer due to its separate serialization methods, however to keep the changes parallel, this also sets the default in `__init__`. * Remove traces Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-09-30 15:35:55 +02:00
Adriane Boyd	effae12cbd	Update slow readers test to use textcat_multilabel (#9300 )	2021-09-27 20:04:02 +02:00
Adriane Boyd	fe5f5d6ac6	Update Catalan tokenizer (#9297 ) * Update Makefile For more recent python version * updated for bsc changes New tokenization changes * Update test_text.py * updating tests and requirements * changed failed test in test/lang/ca changed failed test in test/lang/ca * Update .gitignore deleted stashed changes line * back to python 3.6 and remove transformer requirements As per request * Update test_exception.py Change the test * Update test_exception.py Remove test print * Update Makefile For more recent python version * updated for bsc changes New tokenization changes * updating tests and requirements * Update requirements.txt Removed spacy-transfromers from requirements * Update test_exception.py Added final punctuation to ensure consistency * Update Makefile Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Format * Update test to check all tokens Co-authored-by: cayorodriguez <crodriguezp@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-27 14:42:30 +02:00
Adriane Boyd	03f234b739	Merge remote-tracking branch 'upstream/master' into develop	2021-09-27 09:10:45 +02:00
github-actions[bot]	4da2af4e0e	Auto-format code with black (#9284 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-09-24 10:46:43 +02:00
Jette16	5eced281d8	Add universe test (#9278 ) * Added test for universe.json * Added contributor agreement * Ran black on test_universe_json.py	2021-09-23 14:31:42 +02:00
Adriane Boyd	2f0bb77920	Accept Doc input in pipelines (#9069 ) * Accept Doc input in pipelines Allow `Doc` input to `Language.__call__` and `Language.pipe`, which skips `Language.make_doc` and passes the doc directly to the pipeline. * ensure_doc helper function * avoid running multiple processes on GPU * Update spacy/tests/test_language.py Co-authored-by: svlandeg <svlandeg@github.com>	2021-09-22 09:41:05 +02:00
Adriane Boyd	00bdb31150	Fix vector for 0-length span (#9244 )	2021-09-20 20:22:49 +02:00
github-actions[bot]	015d439eb6	Auto-format code with black (#9234 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-09-20 08:49:19 +02:00

1 2 3 4 5 ...

2334 Commits