spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-25 17:36:30 +03:00

Author	SHA1	Message	Date
Adriane Boyd	bb26550e22	Fix StaticVectors after floret+mypy merge (#9566 )	2021-10-29 16:25:43 +02:00
Adriane Boyd	322635e371	Set version to v3.2.0 (#9565 )	2021-10-29 15:22:40 +02:00
Adriane Boyd	2d430958e1	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-3	2021-10-29 12:18:15 +02:00
Paul O'Leary McCann	006df1ae1f	Clarify error when words are of wrong type (#9541 ) * Clarify error when words are of wrong type See #9437 * Update docs * Use try/except * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-29 12:08:40 +02:00
Paul O'Leary McCann	2fd8d616e7	Add docs section for spacy.cli.train.train (#9545 ) * Add section for spacy.cli.train.train * Add link from training page to train function * Ensure path in train helper * Update docs Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:36:34 +02:00
Adriane Boyd	5477453ea3	Docs for thinc-apple-ops (#9549 ) * Docs for thinc-apple-ops * Ignore thinc-apple-ops in reqs tests * Fix install quickstart * Add cupy cuda 113, 114 extras * Remove draft section Co-authored-by: Ines Montani <ines@ines.io>	2021-10-29 10:35:31 +02:00
Adriane Boyd	12974bf4d9	Add micro PRF for morph scoring (#9546 ) * Add micro PRF for morph scoring For pipelines where morph features are added by more than one component and a reference training corpus may not contain all features, a micro PRF score is more flexible than a simple accuracy score. An example is the reading and inflection features added by the Japanese tokenizer. * Use `morph_micro_f` as the default morph score for Japanese morphologizers. * Update docstring * Fix typo in docstring * Update Scorer API docs * Fix results type * Organize score list by attribute prefix	2021-10-29 10:29:29 +02:00
Adriane Boyd	c053f158c5	Add support for floret vectors (#8909 ) * Add support for fasttext-bloom hash-only vectors Overview: * Extend `Vectors` to have two modes: `default` and `ngram` * `default` is the default mode and equivalent to the current `Vectors` * `ngram` supports the hash-only ngram tables from `fasttext-bloom` * Extend `spacy.StaticVectors.v2` to handle both modes with no changes for `default` vectors * Extend `spacy init vectors` to support ngram tables The `ngram` mode only supports vector tables produced by this fork of fastText, which adds an option to represent all vectors using only the ngram buckets table and which uses the exact same ngram generation algorithm and hash function (`MurmurHash3_x64_128`). `fasttext-bloom` produces an additional `.hashvec` table, which can be loaded by `spacy init vectors --fasttext-bloom-vectors`. https://github.com/adrianeboyd/fastText/tree/feature/bloom Implementation details: * `Vectors` now includes the `StringStore` as `Vectors.strings` so that the API can stay consistent for both `default` (which can look up from `str` or `int`) and `ngram` (which requires `str` to calculate the ngrams). * In ngram mode `Vectors` uses a default `Vectors` object as a cache since the ngram vectors lookups are relatively expensive. * The default cache size is the same size as the provided ngram vector table. * Once the cache is full, no more entries are added. The user is responsible for managing the cache in cases where the initial documents are not representative of the texts. * The cache can be resized by setting `Vectors.ngram_cache_size` or cleared with `vectors._ngram_cache.clear()`. * The API ends up a bit split between methods for `default` and for `ngram`, so functions that only make sense for `default` or `ngram` include warnings with custom messages suggesting alternatives where possible. * `Vocab.vectors` becomes a property so that the string stores can be synced when assigning vectors to a vocab. * `Vectors` serializes its own config settings as `vectors.cfg`. * The `Vectors` serialization methods have added support for `exclude` so that the `Vocab` can exclude the `Vectors` strings while serializing. Removed: * The `minn` and `maxn` options and related code from `Vocab.get_vector`, which does not work in a meaningful way for default vector tables. * The unused `GlobalRegistry` in `Vectors`. * Refactor to use reduce_mean Refactor to use reduce_mean and remove the ngram vectors cache. * Rename to floret * Rename to floret in error messages * Use --vectors-mode in CLI, vector init * Fix vectors mode in init * Remove unused var * Minor API and docstrings adjustments * Rename `--vectors-mode` to `--mode` in `init vectors` CLI * Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support both modes. * Minor updates to Vectors docstrings. * Update API docs for Vectors and init vectors CLI * Update types for StaticVectors	2021-10-27 14:08:31 +02:00
Adriane Boyd	0c97ed2746	Rename ja morph features to Inflection and Reading (#9520 ) * Rename ja morph features to Inflection and Reading	2021-10-27 13:13:03 +02:00
Adriane Boyd	2ea9b58006	Ignore prefix in suffix matches (#9155 ) * Ignore prefix in suffix matches Ignore the currently matched prefix when looking for suffix matches in the tokenizer. Otherwise a lookbehind in the suffix pattern may match incorrectly due the presence of the prefix in the token string. * Move °[cfkCFK]. to a tokenizer exception * Adjust exceptions for same tokenization as v3.1 * Also update test accordingly * Continue to split . after °CFK if ° is not a prefix * Exclude new ° exceptions for pl * Switch back to default tokenization of "° C ." * Revert "Exclude new ° exceptions for pl" This reverts commit `952013a5b4`. * Add exceptions for °C for hu	2021-10-27 13:02:25 +02:00
Adriane Boyd	386dcada1c	Address random results in slow readers tests (#9544 ) * Set random seed for dataset shuffling * Use more dev examples for non-zero scores	2021-10-26 16:53:10 +02:00
Adriane Boyd	a803af9dfa	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.2-1	2021-10-26 11:53:50 +02:00
github-actions[bot]	b0b115ff39	Auto-format code with black (#9530 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-10-22 13:03:10 +02:00
Sofie Van Landeghem	c7ed631f3c	bump version to 3.1.4 (#9524 )	2021-10-21 20:34:57 +02:00
Daniël de Kok	f31ac6fd4f	Print a warning when multiprocessing is used on a GPU (#9475 ) * Raise an error when multiprocessing is used on a GPU As reported in #5507, a confusing exception is thrown when multiprocessing is used with a GPU model and the `fork` multiprocessing start method: cupy.cuda.runtime.CUDARuntimeError: cudaErrorInitializationError: initialization error This change checks whether one of the models uses the GPU when multiprocessing is used. If so, raise a friendly error message. Even though multiprocessing can work on a GPU with the `spawn` method, it quickly runs the GPU out-of-memory on real-world data. Also, multiprocessing on a single GPU typically does not provide large performance gains. * Move GPU multiprocessing check to Language.pipe * Warn rather than error when using multiprocessing with GPU models * Improve GPU multiprocessing warning message. Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Reduce API assumptions Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update spacy/language.py * Update spacy/language.py * Test that warning is thrown with GPU + multiprocessing Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-21 16:14:23 +02:00
Sofie Van Landeghem	5a38f79f18	Custom component types in spacy.ty (#9469 ) * add custom protocols in spacy.ty * add a test for the new types in spacy.ty * import Example when type checking * some type fixes * put Protocol in compat * revert update check back to hasattr * runtime_checkable in compat as well	2021-10-21 15:31:06 +02:00
Daniël de Kok	d0631e3005	Replace use_ops("numpy") by use_ops("cpu") in the parser (#9501 ) * Replace use_ops("numpy") by use_ops("cpu") in the parser This ensures that the best available CPU implementation is chosen (e.g. Thinc Apple Ops on macOS). * Run spaCy tests with apple-thinc-ops on macOS	2021-10-21 11:22:45 +02:00
Paul O'Leary McCann	28ecf399da	Remove some old version refs in the docs (#9448 ) * Remove some old version refs in the docs * Remove warning * Update spacy/matcher/matcher.pyx * Remove all references to the punctuation warning Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-21 11:17:59 +02:00
Edward	014da12f1d	Dont add tok2vec when efficiency textcat (#9502 )	2021-10-20 17:30:19 +02:00
Daniël de Kok	1f05f56433	Add the spacy.models_with_nvtx_range.v1 callback (#9124 ) * Add the spacy.models_with_nvtx_range.v1 callback This callback recursively adds NVTX ranges to the Models in each pipe in a pipeline. * Fix create_models_with_nvtx_range type signature * NVTX range: wrap models of all trainable pipes jointly This avoids that (sub-)models that are shared between pipes get wrapped twice. * NVTX range callback: make color configurable Add forward_color and backprop_color options to set the color for the NVTX range. * Move create_models_with_nvtx_range to spacy.ml * Update create_models_with_nvtx_range for thinc changes with_nvtx_range now updates an existing node, rather than returning a wrapper node. So, we can simply walk over the nodes and update them. * NVTX: use after_pipeline_creation in example	2021-10-20 11:59:48 +02:00
Ines Montani	ad9f57cbbf	Allow conftest.py to run twice for build envs	2021-10-19 15:13:25 +02:00
Sofie Van Landeghem	da578c3d3b	Fix kb.set_entities (#9463 ) * avoid creating _vectors_table when also using c_add_vector * write to self._vectors_table directly in set_entities	2021-10-19 09:39:17 +02:00
Sofie Van Landeghem	3fd3531e12	Docs for new spacy-trf architectures (#8954 ) * use TransformerModel.v2 in quickstart * update docs for new transformer architectures * bump spacy_transformers to 1.1.0 * Add new arguments spacy-transformers.TransformerModel.v3 * Mention that mixed-precision support is experimental * Describe delta transformers.Tok2VecTransformer versions * add dot * add dot, again * Update some more TransformerModel references v2 -> v3 * Add mixed-precision options to the training quickstart Disable mixed-precision training/prediction by default. * Update setup.cfg Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Update website/docs/usage/embeddings-transformers.md Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Daniël de Kok <me@danieldk.eu> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-18 14:15:06 +02:00
Edward	a7cb8de0d7	Fix assertion error in staticvectors (#9481 ) * Fix assertion error in staticvectors * Update spacy/ml/staticvectors.py * Update spacy/ml/staticvectors.py Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Ines Montani <ines@ines.io>	2021-10-18 09:10:45 +02:00
Adriane Boyd	271e8e7856	Skip compat table tests for prerelease versions (#9476 )	2021-10-15 14:28:02 +02:00
github-actions[bot]	29e83f0819	Auto-format code with black (#9474 ) * Auto-format code with black * Update spacy/pipeline/pipe.pyi Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-15 11:36:49 +02:00
Aviora	9a824255d3	Add examples and num_words for Vietnamese (#9412 ) * add examples and num_words * add contributor agreement * Update spacy/lang/vi/examples.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * consistent format add empty line at the end of file Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-10-14 19:15:51 +02:00
Adriane Boyd	b5143b1b84	Minor fixes to convert CLI (#9465 ) * Provide default value for `msg` * Compare paths correctly for file conversion	2021-10-14 18:37:34 +02:00
Connor Brinton	657af5f91f	🏷 Add Mypy check to CI and ignore all existing Mypy errors (#9167 ) * 🚨 Ignore all existing Mypy errors * 🏗 Add Mypy check to CI * Add types-mock and types-requests as dev requirements * Add additional type ignore directives * Add types packages to dev-only list in reqs test * Add types-dataclasses for python 3.6 * Add ignore to pretrain * 🏷 Improve type annotation on `run_command` helper The `run_command` helper previously declared that it returned an `Optional[subprocess.CompletedProcess]`, but it isn't actually possible for the function to return `None`. These changes modify the type annotation of the `run_command` helper and remove all now-unnecessary `# type: ignore` directives. * 🔧 Allow variable type redefinition in limited contexts These changes modify how Mypy is configured to allow variables to have their type automatically redefined under certain conditions. The Mypy documentation contains the following example: ```python def process(items: List[str]) -> None: # 'items' has type List[str] items = [item.split() for item in items] # 'items' now has type List[List[str]] ... ``` This configuration change is especially helpful in reducing the number of `# type: ignore` directives needed to handle the common pattern of: * Accepting a filepath as a string * Overwriting the variable using `filepath = ensure_path(filepath)` These changes enable redefinition and remove all `# type: ignore` directives rendered redundant by this change. * 🏷 Add type annotation to converters mapping * 🚨 Fix Mypy error in convert CLI argument verification * 🏷 Improve type annotation on `resolve_dot_names` helper * 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors` * 🏷 Add type annotations for more `Vocab` attributes * 🏷 Add loose type annotation for gold data compilation * 🏷 Improve `_format_labels` type annotation * 🏷 Fix `get_lang_class` type annotation * 🏷 Loosen return type of `Language.evaluate` * 🏷 Don't accept `Scorer` in `handle_scores_per_type` * 🏷 Add `string_to_list` overloads * 🏷 Fix non-Optional command-line options * 🙈 Ignore redefinition of `wandb_logger` in `loggers.py` * ➕ Install `typing_extensions` in Python 3.8+ The `typing_extensions` package states that it should be used when "writing code that must be compatible with multiple Python versions". Since SpaCy needs to support multiple Python versions, it should be used when newer `typing` module members are required. One example of this is `Literal`, which is available starting with Python 3.8. Previously SpaCy tried to import `Literal` from `typing`, falling back to `typing_extensions` if the import failed. However, Mypy doesn't seem to be able to understand what `Literal` means when the initial import means. Therefore, these changes modify how `compat` imports `Literal` by always importing it from `typing_extensions`. These changes also modify how `typing_extensions` is installed, so that it is a requirement for all Python versions, including those greater than or equal to 3.8. * 🏷 Improve type annotation for `Language.pipe` These changes add a missing overload variant to the type signature of `Language.pipe`. Additionally, the type signature is enhanced to allow type checkers to differentiate between the two overload variants based on the `as_tuple` parameter. Fixes #8772 * ➖ Don't install `typing-extensions` in Python 3.8+ After more detailed analysis of how to implement Python version-specific type annotations using SpaCy, it has been determined that by branching on a comparison against `sys.version_info` can be statically analyzed by Mypy well enough to enable us to conditionally use `typing_extensions.Literal`. This means that we no longer need to install `typing_extensions` for Python versions greater than or equal to 3.8! 🎉 These changes revert previous changes installing `typing-extensions` regardless of Python version and modify how we import the `Literal` type to ensure that Mypy treats it properly. * resolve mypy errors for Strict pydantic types * refactor code to avoid missing return statement * fix types of convert CLI command * avoid list-set confustion in debug_data * fix typo and formatting * small fixes to avoid type ignores * fix types in profile CLI command and make it more efficient * type fixes in projects CLI * put one ignore back * type fixes for render * fix render types - the sequel * fix BaseDefault in language definitions * fix type of noun_chunks iterator - yields tuple instead of span * fix types in language-specific modules * 🏷 Expand accepted inputs of `get_string_id` `get_string_id` accepts either a string (in which case it returns its ID) or an ID (in which case it immediately returns the ID). These changes extend the type annotation of `get_string_id` to indicate that it can accept either strings or IDs. * 🏷 Handle override types in `combine_score_weights` The `combine_score_weights` function allows users to pass an `overrides` mapping to override data extracted from the `weights` argument. Since it allows `Optional` dictionary values, the return value may also include `Optional` dictionary values. These changes update the type annotations for `combine_score_weights` to reflect this fact. * 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer` * 🏷 Fix redefinition of `wandb_logger` These changes fix the redefinition of `wandb_logger` by giving a separate name to each `WandbLogger` version. For backwards-compatibility, `spacy.train` still exports `wandb_logger_v3` as `wandb_logger` for now. * more fixes for typing in language * type fixes in model definitions * 🏷 Annotate `_RandomWords.probs` as `NDArray` * 🏷 Annotate `tok2vec` layers to help Mypy * 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6 Also remove an import that I forgot to move to the top of the module 😅 * more fixes for matchers and other pipeline components * quick fix for entity linker * fixing types for spancat, textcat, etc * bugfix for tok2vec * type annotations for scorer * add runtime_checkable for Protocol * type and import fixes in tests * mypy fixes for training utilities * few fixes in util * fix import * 🐵 Remove unused `# type: ignore` directives * 🏷 Annotate `Language._components` * 🏷 Annotate `spacy.pipeline.Pipe` * add doc as property to span.pyi * small fixes and cleanup * explicit type annotations instead of via comment Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com> Co-authored-by: svlandeg <svlandeg@github.com>	2021-10-14 15:21:40 +02:00
Adriane Boyd	8a018f5207	Set version to v3.2.0.dev0	2021-10-14 10:31:11 +02:00
Adriane Boyd	d98d525bc8	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.1-3	2021-10-14 09:41:46 +02:00
Paul O'Leary McCann	a3b7519aba	Fix JA Morph Values (#9449 ) * Don't set empty / weird values in morph * Update tests to handy empty morph values * Fix everything * Replace potentially problematic characters * Fix test	2021-10-14 09:21:36 +02:00
Ines Montani	c48564688f	Merge pull request #9423 from explosion/tests/issue-marker	2021-10-13 16:53:40 +02:00
Jette16	78365452d3	Moved test for universe into .github folder (#9447 ) * Moved universe-test into .github folder * Cleaned code * CHanged a file name	2021-10-13 14:13:06 +02:00
Sofie Van Landeghem	2e3d6b8b5a	Fix test for spancat (#9446 ) * fix test for spancat * increase tolerance for almost equal checks * Update spacy/tests/test_models.py * Update spacy/tests/test_models.py	2021-10-13 10:47:56 +02:00
Sofie Van Landeghem	5e8e8525f0	fix W108 filter (#9438 ) * remove text argument from W108 to enable 'once' filtering * include the option of partial POS annotation * fix typo * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-12 19:56:44 +02:00
Lj Miranda	6425b9a1c4	Include JsonlCorpus from the imports (#9431 )	2021-10-12 15:39:14 +02:00
Paul O'Leary McCann	efe5beefe0	Add test for case where parser overwrite annotations (#9406 ) * Add test for case where parser overwrite annotations * Move test to its own file Also add note about how other tokens modify results. * Fix xfail decorator	2021-10-11 14:57:45 +02:00
Ines Montani	1fa7c4e73b	Support issue marker via pytest	2021-10-11 13:56:24 +02:00
Paul O'Leary McCann	fd759a881b	Fix inconsistent lemmas (#9405 ) * Add util function to unique lists and preserve order * Use unique function instead of list(set()) list(set()) has the issue that it's not consistent between runs of the Python interpreter, so order can vary. list(set()) calls were left in a few places where they were behind calls to sorted(). I think in this case the calls to list() can be removed, but this commit doesn't do that. * Use the existing pattern for this	2021-10-11 11:38:45 +02:00
Adriane Boyd	a5231cb044	Remove traces of lexemes from vocab serialization (#9400 )	2021-10-11 11:13:35 +02:00
Jette16	3b144a3a51	Add universe test (#9278 ) * Added test for universe.json * Added contributor agreement * Ran black on test_universe_json.py	2021-10-11 11:08:46 +02:00
Ines Montani	5003a9c3c7	Move core training logic in CLI into standalone function (#9398 )	2021-10-11 10:56:14 +02:00
Paul O'Leary McCann	2a7e327310	Fix Dependency Matcher Ordering Issue (#9337 ) * Fix inconsistency This makes the failing test pass, so that behavior is consistent whether patterns are added in one call or two. The issue is that the hash for patterns depended on the index of the pattern in the list of current patterns, not the list of total patterns, so a second call would get identical match ids. * Add illustrative test case * Add failing test for remove case Patterns are not removed from the internal matcher on calls to remove, which causes spurious weird matches (or misses). * Fix removal issue Remove patterns from the internal matcher. * Check that the single add call also gets no matches	2021-10-11 10:26:13 +02:00
Paul O'Leary McCann	113d53ab6c	Fix tests for changes to inflection structure (#9390 )	2021-10-07 13:42:18 +02:00
Paul O'Leary McCann	c4e3b7a5db	Change JA inflection separator to semicolon Hyphen is unsuitable because of interactions with the JA data fields, but pipe is also unsuitable because it has a different meaning in UD data, so it's better to use something that has no significance in either case. So this uses semicolon.	2021-10-07 17:28:15 +09:00
Paul O'Leary McCann	227f98081b	Use a pipe for separating Japanese inflections Inflection values look like this pipe separated: 五段-ラ行\|連用形-促音便 So using a hyphen erases the original fields.	2021-10-07 17:14:05 +09:00
Paul O'Leary McCann	f975690cc9	Use hyphen to join parts of inflection in JA tokenizer	2021-10-07 17:09:38 +09:00
Elia Robyn Lake (Robyn Speer)	53b5f245ed	Allow IETF language codes, aliases, and close matches (#9342 ) * use language-matching to allow language code aliases Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * link to "IETF language tags" in docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Make requirements consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * change "two-letter language ID" to "IETF language tag" in language docs Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use langcodes 3.2 and handle language-tag errors better Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * all unknown language codes are ImportErrors Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-10-05 09:52:22 +02:00
Adriane Boyd	4192e71599	Sync vocab in vectors and components sourced in configs (#9335 ) Since a component may reference anything in the vocab, share the full vocab when loading source components and vectors (which will include `strings` as of #8909). When loading a source component from a config, save and restore the vocab state after loading source pipelines, in particular to preserve the original state without vectors, since `[initialize.vectors] = null` skips rather than resets the vectors. The vocab references are not synced for components loaded with `Language.add_pipe(source=)` because the pipelines are already loaded and not necessarily with the same vocab. A warning could be added in `Language.create_pipe_from_source` that it may be necessary to save and reload before training, but it's a rare enough case that this kind of warning may be too noisy overall.	2021-10-04 12:19:02 +02:00
Paul O'Leary McCann	1ee6541ab0	Moving Japanese tokenizer extra info to Token.morph (#8977 ) * Use morph for extra Japanese tokenizer info Previously Japanese tokenizer info that didn't correspond to Token fields was put in user data. Since spaCy core should avoid touching user data, this moves most information to the Token.morph attribute. It also adds the normalized form, which wasn't exposed before. The subtokens, which are a list of full tokens, are still added to user data, except with the default tokenizer granualarity. With the default tokenizer settings the subtokens are all None, so in this case the user data is simply not set. * Update tests Also adds a new test for norm data. * Update docs * Add Japanese morphologizer factory Set the default to `extend=True` so that the morphologizer does not clobber the values set by the tokenizer. * Use the norm_ field for normalized forms Before this commit, normalized forms were put in the "norm" field in the morph attributes. I am not sure why I did that instead of using the token morph, I think I just forgot about it. * Skip test if sudachipy is not installed * Fix import Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-10-01 19:19:26 +02:00
Paul O'Leary McCann	8f2409e514	Don't serialize user data in DocBin if not saving it (fix #9190 ) (#9226 ) * Don't store user data if told not to (fix #9190) * Add unit tests for the store_user_data setting	2021-10-01 12:37:39 +02:00
github-actions[bot]	42a76c758f	Auto-format code with black (#9346 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-10-01 11:17:11 +02:00
Adriane Boyd	b3192ddea3	Sync thinc install dep in setup, fix test packaging (#9336 ) * Sync thinc install dep in setup * Add __init__.py to include package tests in package * Include *.toml in package	2021-09-30 19:02:10 +02:00
Adriane Boyd	03fefa37e2	Add overwrite settings for more components (#9050 ) * Add overwrite settings for more components For pipeline components where it's relevant and not already implemented, add an explicit `overwrite` setting that controls whether `set_annotations` overwrites existing annotation. For the `morphologizer`, add an additional setting `extend`, which controls whether the existing features are preserved. * +overwrite, +extend: overwrite values of existing features, add any new features * +overwrite, -extend: overwrite completely, removing any existing features * -overwrite, +extend: keep values of existing features, add any new features * -overwrite, -extend: do not modify the existing value if set In all cases an unset value will be set by `set_annotations`. Preserve current overwrite defaults: * True: morphologizer, entity linker * False: tagger, sentencizer, senter * Add backwards compat overwrite settings * Put empty line back Removed by accident in last commit * Set backwards-compatible defaults in __init__ Because the `TrainablePipe` serialization methods update `cfg`, there's no straightforward way to detect whether models serialized with a previous version are missing the overwrite settings. It would be possible in the sentencizer due to its separate serialization methods, however to keep the changes parallel, this also sets the default in `__init__`. * Remove traces Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>	2021-09-30 15:35:55 +02:00
Jim O’Regan	8fe525beb5	Add an Irish lemmatiser, based on BuNaMo (#9102 ) * add tréis/théis * remove previous contents, add demutate/unponc * fmt off/on wrapping * type hints * IrishLemmatizer (sic) * Use spacy-lookups-data>=1.0.3 * Minor bug fixes, refactoring for IrishLemmatizer * Fix return type for ADP list lookups * Fix and refactor lookup table lookups for missing/string/list * Remove unused variables * skip lookup of verbal substantives and adjectives; just demutate * Fix morph checks API details * Add types and format * Move helper methods into lemmatizer Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-09-30 14:18:47 +02:00
Elia Robyn Lake (Robyn Speer)	5b0b0ca809	Move WandB loggers into spacy-loggers (#9223 ) * factor out the WandB logger into spacy-loggers Signed-off-by: Elia Robyn Speer <gh@arborelia.net> * depend on spacy-loggers so they are available Signed-off-by: Elia Robyn Speer <gh@arborelia.net> * remove docs of spacy.WandbLogger.v2 (moved to spacy-loggers) Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * Version number suggestions from code review Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update references to WandbLogger Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * make order of deps more consistent Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2021-09-29 11:12:50 +02:00
Adriane Boyd	e750c1760c	Restore tokenization timing in Language.evaluate (#9305 ) Restore tokenization timing steps that were accidentally removed in #6765.	2021-09-27 20:44:14 +02:00
Sofie Van Landeghem	a361df00cd	Raise E983 early on in docbin init (#9247 ) * raise E983 early on in docbin init * catch situation before error is raised * add more info on the spacy debug command	2021-09-27 20:43:03 +02:00
Adriane Boyd	effae12cbd	Update slow readers test to use textcat_multilabel (#9300 )	2021-09-27 20:04:02 +02:00
Adriane Boyd	fe5f5d6ac6	Update Catalan tokenizer (#9297 ) * Update Makefile For more recent python version * updated for bsc changes New tokenization changes * Update test_text.py * updating tests and requirements * changed failed test in test/lang/ca changed failed test in test/lang/ca * Update .gitignore deleted stashed changes line * back to python 3.6 and remove transformer requirements As per request * Update test_exception.py Change the test * Update test_exception.py Remove test print * Update Makefile For more recent python version * updated for bsc changes New tokenization changes * updating tests and requirements * Update requirements.txt Removed spacy-transfromers from requirements * Update test_exception.py Added final punctuation to ensure consistency * Update Makefile Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Format * Update test to check all tokens Co-authored-by: cayorodriguez <crodriguezp@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-27 14:42:30 +02:00
Adriane Boyd	03f234b739	Merge remote-tracking branch 'upstream/master' into develop	2021-09-27 09:10:45 +02:00
github-actions[bot]	4da2af4e0e	Auto-format code with black (#9284 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-09-24 10:46:43 +02:00
Jette16	5eced281d8	Add universe test (#9278 ) * Added test for universe.json * Added contributor agreement * Ran black on test_universe_json.py	2021-09-23 14:31:42 +02:00
Ines Montani	6bb0324b81	Adjust kb_id visualizer templating and docs	2021-09-23 11:59:02 +02:00
Ines Montani	beb4a8c524	Merge pull request #9199 from shigapov/master (resolves #9129 )	2021-09-23 19:41:53 +10:00
Ines Montani	57b5fc1995	Apply suggestions from code review Co-authored-by: Renat Shigapov <57352291+shigapov@users.noreply.github.com>	2021-09-23 17:58:32 +10:00
Sofie Van Landeghem	3fc3b7a13a	avoid crash when unicode in title (#9254 )	2021-09-22 21:01:34 +02:00
Rumesh Madhusanka	68264b4cee	Updating the stop word list for Sinhala language (#9270 )	2021-09-22 20:43:42 +02:00
Adriane Boyd	2f0bb77920	Accept Doc input in pipelines (#9069 ) * Accept Doc input in pipelines Allow `Doc` input to `Language.__call__` and `Language.pipe`, which skips `Language.make_doc` and passes the doc directly to the pipeline. * ensure_doc helper function * avoid running multiple processes on GPU * Update spacy/tests/test_language.py Co-authored-by: svlandeg <svlandeg@github.com>	2021-09-22 09:41:05 +02:00
Daniël de Kok	17802836be	Allow overriding vars in the project assets subcommand (#9248 ) This change makes the `project assets` subcommand accept variables to override as well, making the interface more similar to `project run`.	2021-09-21 10:49:45 +02:00
Adriane Boyd	00bdb31150	Fix vector for 0-length span (#9244 )	2021-09-20 20:22:49 +02:00
github-actions[bot]	015d439eb6	Auto-format code with black (#9234 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-09-20 08:49:19 +02:00
Paul O'Leary McCann	c4f0800fb8	Validate pos values when creating Doc (#9148 ) * Validate pos values when creating Doc * Add clear error when setting invalid pos This also changes the error language slightly. * Fix variable name * Update spacy/tokens/doc.pyx * Test that setting invalid pos raises an error Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-16 13:28:05 +02:00
Jozef Harag	865cfbc903	feat: add `spacy.WandbLogger.v3` with optional `run_name` and `entity` parameters (#9202 ) * feat: add `spacy.WandbLogger.v3` with optional `run_name` and `entity` parameters * update versioning in docs Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>	2021-09-16 12:26:41 +02:00
Sofie Van Landeghem	00836c2d7d	Update spacy/displacy/templates.py	2021-09-16 09:23:21 +02:00
Sofie Van Landeghem	4bf2606adf	Update spacy/displacy/render.py Co-authored-by: Renat Shigapov <57352291+shigapov@users.noreply.github.com>	2021-09-16 09:22:38 +02:00
Ines Montani	20f63e7154	Only include runtime-relevant config in package CLI dependency detection (#9211 )	2021-09-15 23:16:01 +02:00
Paul O'Leary McCann	cd75f96501	Remove two attributes marked for removal in 3.1 (#9150 ) * Remove two attributes marked for removal in 3.1 * Add back unused ints with changed names * Change data_dir to _unused_object This is still kept in the type definition, but I removed it from the serialization code. * Put serialization code back for now Not sure how this interacts with old serialized models yet.	2021-09-15 23:07:21 +02:00
Adriane Boyd	d74870d38c	Prepare for v3.1.3 (#9200 ) * Update thinc and spacy-legacy requirements * Set version to v3.1.3	2021-09-14 11:03:51 +02:00
Paul O'Leary McCann	0f01f46e02	Update Cython string types (#9143 ) * Replace all basestring references with unicode `basestring` was a compatability type introduced by Cython to make dealing with utf-8 strings in Python2 easier. In Python3 it is equivalent to the unicode (or str) type. I replaced all references to basestring with unicode, since that was used elsewhere, but we could also just replace them with str, which shoudl also be equivalent. All tests pass locally. * Replace all references to unicode type with str Since we only support python3 this is simpler. * Remove all references to unicode type This removes all references to the unicode type across the codebase and replaces them with `str`, which makes it more drastic than the prior commits. In order to make this work importing `unicode_literals` had to be removed, and one explicit unicode literal also had to be removed (it is unclear why this is necessary in Cython with language level 3, but without doing it there were errors about implicit conversion). When `unicode` is used as a type in comments it was also edited to be `str`. Additionally `coding: utf8` headers were removed from a few files.	2021-09-13 17:02:17 +02:00
Renat Shigapov	d5cc009faf	Merge branch 'explosion:master' into master	2021-09-13 08:43:48 +02:00
Renat Shigapov	f4b5c4209d	specify kb_id and kb_url for URL visualisation	2021-09-13 08:15:07 +02:00
Renat Shigapov	7562fb5354	add links to entities into the TPL_ENT-template	2021-09-13 08:06:54 +02:00
j-frei	462b009648	Correct parser.py use_upper param info (#9180 )	2021-09-10 16:19:58 +02:00
Adriane Boyd	aba6ce3a43	Handle spacy-legacy in package CLI for dependencies (#9163 ) * Handle spacy-legacy in package CLI for dependencies * Implement legacy backoff in spacy registry.find * Remove unused import * Update and format test	2021-09-08 11:46:40 +02:00
github-actions[bot]	584fae5807	Auto-format code with black (#9130 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-09-03 10:47:03 +02:00
Kevin Humphreys	ca93504660	Pass alignments to Matcher callbacks (#9001 ) * pass alignments to callbacks * refactor for single callback loop * Update spacy/matcher/matcher.pyx Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2021-09-02 12:58:05 +02:00
Sofie Van Landeghem	8895e3c9ad	matcher doc corrections (#9115 ) * update error message to current UX * clarify uppercase effect * fix docstring	2021-09-02 09:26:33 +02:00
Robyn Speer	d60b748e3c	Fix surprises when asking for the root of a git repo (#9074 ) * Fix surprises when asking for the root of a git repo In the case of the first asset I wanted to get from git, the data I wanted was the entire repository. I tried leaving "path" blank, which gave a less-than-helpful error, and then I tried `path: "/"`, which started copying my entire filesystem into the project. The path I should have used was "". I've made two changes to make this smoother for others: - The 'path' within a git clone defaults to "" - If the path points outside of the tmpdir that the git clone goes into, we fail with an error Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * use a descriptive error instead of a default plus some minor fixes from PR review Signed-off-by: Elia Robyn Speer <elia@explosion.ai> * check for None values in assets Signed-off-by: Elia Robyn Speer <elia@explosion.ai> Co-authored-by: Elia Robyn Speer <elia@explosion.ai>	2021-09-01 22:52:08 +02:00
Paul O'Leary McCann	f803a84571	Fix inference of epoch_resume (#9084 ) * Fix inference of epoch_resume When an epoch_resume value is not specified individually, it can often be inferred from the filename. The value inference code was there but the value wasn't passed back to the training loop. This also adds a specific error in the case where no epoch_resume value is provided and it can't be inferred from the filename. * Add new error * Always use the epoch resume value if specified Before this the value in the filename was used if found	2021-09-01 14:17:42 +09:00
Adriane Boyd	1e9b4b55ee	Pass overrides to subcommands in workflows (#9059 ) * Pass overrides to subcommands in workflows * Add missing docstring	2021-08-30 09:23:54 +02:00
Sofie Van Landeghem	1e974de837	config is not Optional (#9024 )	2021-08-27 11:44:31 +02:00
github-actions[bot]	fb9c31fbda	Auto-format code with black (#9065 ) Co-authored-by: explosion-bot <explosion-bot@users.noreply.github.com>	2021-08-27 11:42:27 +02:00
Sofie Van Landeghem	4d39430b82	Document use-case of freezing tok2vec (#8992 ) * update error msg * add sentence to docs * expand note on frozen components	2021-08-26 09:50:35 +02:00
Sofie Van Landeghem	94fb840443	fix docs for Span constructor arguments (#9023 )	2021-08-25 16:06:22 +02:00
David Strouk	31e9b126a0	Fix verbs list in lang/fr/tokenizer_exceptions.py (#9033 )	2021-08-25 15:55:09 +02:00
Ines Montani	4cd052e81d	Include component factories in third-party dependencies resolver (#9009 ) * Include component factories in third-party dependencies resolver * Increment catalogue and update test	2021-08-25 14:58:01 +02:00
Sofie Van Landeghem	e1f88de729	bump to 3.1.2 (#9008 )	2021-08-20 12:41:09 +02:00
Sofie Van Landeghem	4d52d7051c	Fix spancat training on nested entities (#9007 ) * overfitting test on non-overlapping entities * add failing overfitting test for overlapping entities * failing test for list comprehension * remove test that was put in separate PR * bugfix * cleanup	2021-08-20 12:37:50 +02:00

1 2 3 4 5 ...

8921 Commits