* Support registered vectors
* Format
* Auto-fill [nlp] on load from config and from bytes/disk
* Only auto-fill [nlp]
* Undo all changes to Language.from_disk
* Expand BaseVectors
These methods are needed in various places for training and vector
similarity.
* isort
* More linting
* Only fill [nlp.vectors]
* Update spacy/vocab.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Revert changes to test related to auto-filling [nlp]
* Add vectors registry
* Rephrase error about vocab methods for vectors
* Switch to dummy implementation for BaseVectors.to_ops
* Add initial draft of docs
* Remove example from BaseVectors docs
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/basevectors.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix type and lint bpemb example
* Update website/docs/api/basevectors.mdx
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Support custom token/lexeme attribute for vectors
* Fix imports
* Back off to ORTH without Vectors.attr
* Fallback if vectors.attr doesn't exist
* Update docs
* Use isort with Black profile
* isort all the things
* Fix import cycles as a result of import sorting
* Add DOCBIN_ALL_ATTRS type definition
* Add isort to requirements
* Remove isort from build dependencies check
* Typo
* Add vector deduplication
* Add `Vocab.deduplicate_vectors()`
* Always run deduplication in `spacy init vectors`
* Clean up a few vector-related error messages and docs examples
* Always unique with numpy
* Fix types
* Use Vectors.shape rather than Vectors.data.shape
* Use Vectors.size rather than Vectors.data.size
* Add Vectors.to_ops to move data between different ops
* Add documentation for Vector.to_ops
* Add support for fasttext-bloom hash-only vectors
Overview:
* Extend `Vectors` to have two modes: `default` and `ngram`
* `default` is the default mode and equivalent to the current
`Vectors`
* `ngram` supports the hash-only ngram tables from `fasttext-bloom`
* Extend `spacy.StaticVectors.v2` to handle both modes with no changes
for `default` vectors
* Extend `spacy init vectors` to support ngram tables
The `ngram` mode **only** supports vector tables produced by this
fork of fastText, which adds an option to represent all vectors using
only the ngram buckets table and which uses the exact same ngram
generation algorithm and hash function (`MurmurHash3_x64_128`).
`fasttext-bloom` produces an additional `.hashvec` table, which can be
loaded by `spacy init vectors --fasttext-bloom-vectors`.
https://github.com/adrianeboyd/fastText/tree/feature/bloom
Implementation details:
* `Vectors` now includes the `StringStore` as `Vectors.strings` so that
the API can stay consistent for both `default` (which can look up from
`str` or `int`) and `ngram` (which requires `str` to calculate the
ngrams).
* In ngram mode `Vectors` uses a default `Vectors` object as a cache
since the ngram vectors lookups are relatively expensive.
* The default cache size is the same size as the provided ngram vector
table.
* Once the cache is full, no more entries are added. The user is
responsible for managing the cache in cases where the initial
documents are not representative of the texts.
* The cache can be resized by setting `Vectors.ngram_cache_size` or
cleared with `vectors._ngram_cache.clear()`.
* The API ends up a bit split between methods for `default` and for
`ngram`, so functions that only make sense for `default` or `ngram`
include warnings with custom messages suggesting alternatives where
possible.
* `Vocab.vectors` becomes a property so that the string stores can be
synced when assigning vectors to a vocab.
* `Vectors` serializes its own config settings as `vectors.cfg`.
* The `Vectors` serialization methods have added support for `exclude`
so that the `Vocab` can exclude the `Vectors` strings while serializing.
Removed:
* The `minn` and `maxn` options and related code from
`Vocab.get_vector`, which does not work in a meaningful way for default
vector tables.
* The unused `GlobalRegistry` in `Vectors`.
* Refactor to use reduce_mean
Refactor to use reduce_mean and remove the ngram vectors cache.
* Rename to floret
* Rename to floret in error messages
* Use --vectors-mode in CLI, vector init
* Fix vectors mode in init
* Remove unused var
* Minor API and docstrings adjustments
* Rename `--vectors-mode` to `--mode` in `init vectors` CLI
* Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support
both modes.
* Minor updates to Vectors docstrings.
* Update API docs for Vectors and init vectors CLI
* Update types for StaticVectors
* 🚨 Ignore all existing Mypy errors
* 🏗 Add Mypy check to CI
* Add types-mock and types-requests as dev requirements
* Add additional type ignore directives
* Add types packages to dev-only list in reqs test
* Add types-dataclasses for python 3.6
* Add ignore to pretrain
* 🏷 Improve type annotation on `run_command` helper
The `run_command` helper previously declared that it returned an
`Optional[subprocess.CompletedProcess]`, but it isn't actually possible
for the function to return `None`. These changes modify the type
annotation of the `run_command` helper and remove all now-unnecessary
`# type: ignore` directives.
* 🔧 Allow variable type redefinition in limited contexts
These changes modify how Mypy is configured to allow variables to have
their type automatically redefined under certain conditions. The Mypy
documentation contains the following example:
```python
def process(items: List[str]) -> None:
# 'items' has type List[str]
items = [item.split() for item in items]
# 'items' now has type List[List[str]]
...
```
This configuration change is especially helpful in reducing the number
of `# type: ignore` directives needed to handle the common pattern of:
* Accepting a filepath as a string
* Overwriting the variable using `filepath = ensure_path(filepath)`
These changes enable redefinition and remove all `# type: ignore`
directives rendered redundant by this change.
* 🏷 Add type annotation to converters mapping
* 🚨 Fix Mypy error in convert CLI argument verification
* 🏷 Improve type annotation on `resolve_dot_names` helper
* 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors`
* 🏷 Add type annotations for more `Vocab` attributes
* 🏷 Add loose type annotation for gold data compilation
* 🏷 Improve `_format_labels` type annotation
* 🏷 Fix `get_lang_class` type annotation
* 🏷 Loosen return type of `Language.evaluate`
* 🏷 Don't accept `Scorer` in `handle_scores_per_type`
* 🏷 Add `string_to_list` overloads
* 🏷 Fix non-Optional command-line options
* 🙈 Ignore redefinition of `wandb_logger` in `loggers.py`
* ➕ Install `typing_extensions` in Python 3.8+
The `typing_extensions` package states that it should be used when
"writing code that must be compatible with multiple Python versions".
Since SpaCy needs to support multiple Python versions, it should be used
when newer `typing` module members are required. One example of this is
`Literal`, which is available starting with Python 3.8.
Previously SpaCy tried to import `Literal` from `typing`, falling back
to `typing_extensions` if the import failed. However, Mypy doesn't seem
to be able to understand what `Literal` means when the initial import
means. Therefore, these changes modify how `compat` imports `Literal` by
always importing it from `typing_extensions`.
These changes also modify how `typing_extensions` is installed, so that
it is a requirement for all Python versions, including those greater
than or equal to 3.8.
* 🏷 Improve type annotation for `Language.pipe`
These changes add a missing overload variant to the type signature of
`Language.pipe`. Additionally, the type signature is enhanced to allow
type checkers to differentiate between the two overload variants based
on the `as_tuple` parameter.
Fixes#8772
* ➖ Don't install `typing-extensions` in Python 3.8+
After more detailed analysis of how to implement Python version-specific
type annotations using SpaCy, it has been determined that by branching
on a comparison against `sys.version_info` can be statically analyzed by
Mypy well enough to enable us to conditionally use
`typing_extensions.Literal`. This means that we no longer need to
install `typing_extensions` for Python versions greater than or equal to
3.8! 🎉
These changes revert previous changes installing `typing-extensions`
regardless of Python version and modify how we import the `Literal` type
to ensure that Mypy treats it properly.
* resolve mypy errors for Strict pydantic types
* refactor code to avoid missing return statement
* fix types of convert CLI command
* avoid list-set confustion in debug_data
* fix typo and formatting
* small fixes to avoid type ignores
* fix types in profile CLI command and make it more efficient
* type fixes in projects CLI
* put one ignore back
* type fixes for render
* fix render types - the sequel
* fix BaseDefault in language definitions
* fix type of noun_chunks iterator - yields tuple instead of span
* fix types in language-specific modules
* 🏷 Expand accepted inputs of `get_string_id`
`get_string_id` accepts either a string (in which case it returns its
ID) or an ID (in which case it immediately returns the ID). These
changes extend the type annotation of `get_string_id` to indicate that
it can accept either strings or IDs.
* 🏷 Handle override types in `combine_score_weights`
The `combine_score_weights` function allows users to pass an `overrides`
mapping to override data extracted from the `weights` argument. Since it
allows `Optional` dictionary values, the return value may also include
`Optional` dictionary values.
These changes update the type annotations for `combine_score_weights` to
reflect this fact.
* 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer`
* 🏷 Fix redefinition of `wandb_logger`
These changes fix the redefinition of `wandb_logger` by giving a
separate name to each `WandbLogger` version. For
backwards-compatibility, `spacy.train` still exports `wandb_logger_v3`
as `wandb_logger` for now.
* more fixes for typing in language
* type fixes in model definitions
* 🏷 Annotate `_RandomWords.probs` as `NDArray`
* 🏷 Annotate `tok2vec` layers to help Mypy
* 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6
Also remove an import that I forgot to move to the top of the module 😅
* more fixes for matchers and other pipeline components
* quick fix for entity linker
* fixing types for spancat, textcat, etc
* bugfix for tok2vec
* type annotations for scorer
* add runtime_checkable for Protocol
* type and import fixes in tests
* mypy fixes for training utilities
* few fixes in util
* fix import
* 🐵 Remove unused `# type: ignore` directives
* 🏷 Annotate `Language._components`
* 🏷 Annotate `spacy.pipeline.Pipe`
* add doc as property to span.pyi
* small fixes and cleanup
* explicit type annotations instead of via comment
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
* Remove two attributes marked for removal in 3.1
* Add back unused ints with changed names
* Change data_dir to _unused_object
This is still kept in the type definition, but I removed it from the
serialization code.
* Put serialization code back for now
Not sure how this interacts with old serialized models yet.
* Replace all basestring references with unicode
`basestring` was a compatability type introduced by Cython to make
dealing with utf-8 strings in Python2 easier. In Python3 it is
equivalent to the unicode (or str) type.
I replaced all references to basestring with unicode, since that was
used elsewhere, but we could also just replace them with str, which
shoudl also be equivalent.
All tests pass locally.
* Replace all references to unicode type with str
Since we only support python3 this is simpler.
* Remove all references to unicode type
This removes all references to the unicode type across the codebase and
replaces them with `str`, which makes it more drastic than the prior
commits. In order to make this work importing `unicode_literals` had to
be removed, and one explicit unicode literal also had to be removed (it
is unclear why this is necessary in Cython with language level 3, but
without doing it there were errors about implicit conversion).
When `unicode` is used as a type in comments it was also edited to be
`str`.
Additionally `coding: utf8` headers were removed from a few files.
* ensure vectors data is stored on right device
* ensure the added vector is on the right device
* move vector to numpy before iterating
* move best_rows to numpy before iterating
Similar to how vectors are handled, move the vocab lookups to be loaded
at the start of training rather than when the vocab is initialized,
since the vocab doesn't have access to the full config when it's
created.
The option moves from `nlp.load_vocab_data` to `training.lookups`.
Typically these tables will come from `spacy-lookups-data`, but any
`Lookups` object can be provided.
The loading from `spacy-lookups-data` is now strict, so configs for each
language should specify the exact tables required. This also makes it
easier to control whether the larger clusters and probs tables are
included.
To load `lexeme_norm` from `spacy-lookups-data`:
```
[training.lookups]
@misc = "spacy.LoadLookupsData.v1"
lang = ${nlp.lang}
tables = ["lexeme_norm"]
```
* Add Lemmatizer and simplify related components
* Add `Lemmatizer` pipe with `lookup` and `rule` modes using the
`Lookups` tables.
* Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma)
* Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer,
or morph rules)
* Remove lemmatizer from `Vocab`
* Adjust many many tests
Differences:
* No default lookup lemmas
* No special treatment of TAG in `from_array` and similar required
* Easier to modify labels in a `Tagger`
* No extra strings added from morphology / tag map
* Fix test
* Initial fix for Lemmatizer config/serialization
* Adjust init test to be more generic
* Adjust init test to force empty Lookups
* Add simple cache to rule-based lemmatizer
* Convert language-specific lemmatizers
Convert language-specific lemmatizers to component lemmatizers. Remove
previous lemmatizer class.
* Fix French and Polish lemmatizers
* Remove outdated UPOS conversions
* Update Russian lemmatizer init in tests
* Add minimal init/run tests for custom lemmatizers
* Add option to overwrite existing lemmas
* Update mode setting, lookup loading, and caching
* Make `mode` an immutable property
* Only enforce strict `load_lookups` for known supported modes
* Move caching into individual `_lemmatize` methods
* Implement strict when lang is not found in lookups
* Fix tables/lookups in make_lemmatizer
* Reallow provided lookups and allow for stricter checks
* Add lookups asset to all Lemmatizer pipe tests
* Rename lookups in lemmatizer init test
* Clean up merge
* Refactor lookup table loading
* Add helper from `load_lemmatizer_lookups` that loads required and
optional lookups tables based on settings provided by a config.
Additional slight refactor of lookups:
* Add `Lookups.set_table` to set a table from a provided `Table`
* Reorder class definitions to be able to specify type as `Table`
* Move registry assets into test methods
* Refactor lookups tables config
Use class methods within `Lemmatizer` to provide the config for
particular modes and to load the lookups from a config.
* Add pipe and score to lemmatizer
* Simplify Tagger.score
* Add missing import
* Clean up imports and auto-format
* Remove unused kwarg
* Tidy up and auto-format
* Update docstrings for Lemmatizer
Update docstrings for Lemmatizer.
Additionally modify `is_base_form` API to take `Token` instead of
individual features.
* Update docstrings
* Remove tag map values from Tagger.add_label
* Update API docs
* Fix relative link in Lemmatizer API docs
* Update with WIP
* Update with WIP
* Update with pipeline serialization
* Update types and pipe factories
* Add deep merge, tidy up and add tests
* Fix pipe creation from config
* Don't validate default configs on load
* Update spacy/language.py
Co-authored-by: Ines Montani <ines@ines.io>
* Adjust factory/component meta error
* Clean up factory args and remove defaults
* Add test for failing empty dict defaults
* Update pipeline handling and methods
* provide KB as registry function instead of as object
* small change in test to make functionality more clear
* update example script for EL configuration
* Fix typo
* Simplify test
* Simplify test
* splitting pipes.pyx into separate files
* moving default configs to each component file
* fix batch_size type
* removing default values from component constructors where possible (TODO: test 4725)
* skip instead of xfail
* Add test for config -> nlp with multiple instances
* pipeline.pipes -> pipeline.pipe
* Tidy up, document, remove kwargs
* small cleanup/generalization for Tok2VecListener
* use DEFAULT_UPSTREAM field
* revert to avoid circular imports
* Fix tests
* Replace deprecated arg
* Make model dirs require config
* fix pickling of keyword-only arguments in constructor
* WIP: clean up and integrate full config
* Add helper to handle function args more reliably
Now also includes keyword-only args
* Fix config composition and serialization
* Improve config debugging and add visual diff
* Remove unused defaults and fix type
* Remove pipeline and factories from meta
* Update spacy/default_config.cfg
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/default_config.cfg
* small UX edits
* avoid printing stack trace for debug CLI commands
* Add support for language-specific factories
* specify the section of the config which holds the model to debug
* WIP: add Language.from_config
* Update with language data refactor WIP
* Auto-format
* Add backwards-compat handling for Language.factories
* Update morphologizer.pyx
* Fix morphologizer
* Update and simplify lemmatizers
* Fix Japanese tests
* Port over tagger changes
* Fix Chinese and tests
* Update to latest Thinc
* WIP: xfail first Russian lemmatizer test
* Fix component-specific overrides
* fix nO for output layers in debug_model
* Fix default value
* Fix tests and don't pass objects in config
* Fix deep merging
* Fix lemma lookup data registry
Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)
* Add types
* Add Vocab.from_config
* Fix typo
* Fix tests
* Make config copying more elegant
* Fix pipe analysis
* Fix lemmatizers and is_base_form
* WIP: move language defaults to config
* Fix morphology type
* Fix vocab
* Remove comment
* Update to latest Thinc
* Add morph rules to config
* Tidy up
* Remove set_morphology option from tagger factory
* Hack use_gpu
* Move [pipeline] to top-level block and make [nlp.pipeline] list
Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them
* Fix use_gpu and resume in CLI
* Auto-format
* Remove resume from config
* Fix formatting and error
* [pipeline] -> [components]
* Fix types
* Fix tagger test: requires set_morphology?
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Reduce stored lexemes data, move feats to lookups
* Move non-derivable lexemes features (`norm / cluster / prob`) to
`spacy-lookups-data` as lookups
* Get/set `norm` in both lookups and `LexemeC`, serialize in lookups
* Remove `cluster` and `prob` from `LexemesC`, get/set/serialize in
lookups only
* Remove serialization of lexemes data as `vocab/lexemes.bin`
* Remove `SerializedLexemeC`
* Remove `Lexeme.to_bytes/from_bytes`
* Modify normalization exception loading:
* Always create `Vocab.lookups` table `lexeme_norm` for
normalization exceptions
* Load base exceptions from `lang.norm_exceptions`, but load
language-specific exceptions from lookups
* Set `lex_attr_getter[NORM]` including new lookups table in
`BaseDefaults.create_vocab()` and when deserializing `Vocab`
* Remove all cached lexemes when deserializing vocab to override
existing normalizations with the new normalizations (as a replacement
for the previous step that replaced all lexemes data with the
deserialized data)
* Skip English normalization test
Skip English normalization test because the data is now in
`spacy-lookups-data`.
* Remove norm exceptions
Moved to spacy-lookups-data.
* Move norm exceptions test to spacy-lookups-data
* Load extra lookups from spacy-lookups-data lazily
Load extra lookups (currently for cluster and prob) lazily from the
entry point `lg_extra` as `Vocab.lookups_extra`.
* Skip creating lexeme cache on load
To improve model loading times, do not create the full lexeme cache when
loading. The lexemes will be created on demand when processing.
* Identify numeric values in Lexeme.set_attrs()
With the removal of a special case for `PROB`, also identify `float` to
avoid trying to convert it with the `StringStore`.
* Skip lexeme cache init in from_bytes
* Unskip and update lookups tests for python3.6+
* Update vocab pickle to include lookups_extra
* Update vocab serialization tests
Check strings rather than lexemes since lexemes aren't initialized
automatically, account for addition of "_SP".
* Re-skip lookups test because of python3.5
* Skip PROB/float values in Lexeme.set_attrs
* Convert is_oov from lexeme flag to lex in vectors
Instead of storing `is_oov` as a lexeme flag, `is_oov` reports whether
the lexeme has a vector.
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Check that row is within bounds for the vector data array when adding a
vector.
Don't add vectors with rank OOV_RANK in `init-model` (change is due to
shift from OOV as 0 to OOV as OOV_RANK).