* Support registered vectors
* Format
* Auto-fill [nlp] on load from config and from bytes/disk
* Only auto-fill [nlp]
* Undo all changes to Language.from_disk
* Expand BaseVectors
These methods are needed in various places for training and vector
similarity.
* isort
* More linting
* Only fill [nlp.vectors]
* Update spacy/vocab.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Revert changes to test related to auto-filling [nlp]
* Add vectors registry
* Rephrase error about vocab methods for vectors
* Switch to dummy implementation for BaseVectors.to_ops
* Add initial draft of docs
* Remove example from BaseVectors docs
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/basevectors.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix type and lint bpemb example
* Update website/docs/api/basevectors.mdx
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* remove migration support form
* initial test commit
* add fixture
* add combo test
* pull out parameter example data
* fix formatting on examples
* remove unused import
* remove unncessary fmt:off instructions
* only set logger level if verbose flag is explicitly set
---------
Co-authored-by: svlandeg <svlandeg@github.com>
* Add data structures to docs
* Adjusted descriptions for more consistency
* Add _optional_ flag to parameters
* Add tests and adjust optional title key in doc
* Add title to dep visualizations
* fix typo
---------
Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
* Add cli for finding locations of registered func
* fixes: naming and typing
* isort
* update naming
* remove to find-function
* remove file:// bit
* use registry name if given and exit gracefully if a registry was not found
* clean up failure msg
* specify registry_name options
* mypy fixes
* return location for internal usage
* add documentation
* more mypy fixes
* clean up example
* add section to menu
* add tests
---------
Co-authored-by: svlandeg <svlandeg@github.com>
* Update numpy build constraints for numpy 1.25
Starting in numpy 1.25 (see
https://github.com/numpy/numpy/releases/tag/v1.25.0), the numpy C API is
backwards-compatible by default.
For python 3.9+, we should be able to drop the specific numpy build
requirements and use `numpy>=1.25`, which is currently
backwards-compatible to `numpy>=1.19`.
In the future, the python <3.9 requirements could be dropped and the
lower numpy pin could correspond to the oldest supported version for the
current lower python pin.
* Turn off fail-fast
* Revert "Turn off fail-fast"
This reverts commit 4306f516bc.
* Update for python 3.6
* Fix typo
* Update universe.json
* Update universe.json
add some missing commas in the greCy's description.
* Update punctuation.py
Add mathematical left and right angle brackets as punctuation for ancient Greek for better tokenization.
* modified: spacy/language.py
- corrected typo in docstring for :method:`Language.replace_listeners`
- added noqa comment on unused local variable assignment in :method:`Language.from_config` as I wasn't sure if it should be unassigned
modified: website/docs/api/language.mdx
- corrected typo in `Language.replace_listeners` markdown
* modified: spacy/language.py
- removed noqa comment
---------
Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>
These changes add a missing call to `escape_html` in the displaCy span
renderer. Previously span-annotated tokens would be inserted into the
page markup without being escaped, resulting in potentially incorrect
rendering. When I encountered this issue, it resulted in some docs and
span underlines being superimposed on top of properly rendered docs and
span underlines near the beginning of the visualization (due to an
unescaped `<span>` tag).
* Setting up weasel branch (#12456)
* remove project-specific functionality
* remove project-specific tests
* remove project-specific schemas
* remove project-specific information in about
* remove project-specific functions in util.py
* remove project-specific error strings
* remove project-specific CLI commands
* black formatting
* restore some functions that are used beyond projects
* remove project imports
* remove imports
* remove remote_storage tests
* remove one more project unit test
* update for PR 12394
* remove get_hash and get_checksum
* remove upload_ and download_file methods
* remove ensure_pathy
* revert clumsy fingers
* reinstate E970
* feat: use weasel as spacy project command (#12473)
* feat: use weasel as spacy project command
* build: use constrained requirement for weasel
* feat: add weasel to the library requirements
* build: update weasel to new version
* build: use specific weasel tag
* build: use weasel-0.1.0rc1 from PyPI
* fix: remove weasel from requirements.txt
* fix: requirements.txt and setup.cfg need to reflect each other
* feat: remove legacy spacy project code
* bump version
* further merge fixes
* isort
---------
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
When the default `max_length` is not set and there are longer training
documents, it can be difficult to train and evaluate the span finder due
to memory limits and the time it takes to evaluate a huge number of
predicted spans.
* `Language.replace_listeners`: Pass the replaced listener and the `tok2vec` pipe to the callback
* Update developer docs
* `isort` fixes
* Add error message to assertion
* Add clarification to dev docs
* Replace assertion with exception
* Doc fixes
* Fix problem with universe pages using `docker` language
* Fix problem with universe pages using `r` language
* Add fallback, in case code language is unknown
* Support custom token/lexeme attribute for vectors
* Fix imports
* Back off to ORTH without Vectors.attr
* Fallback if vectors.attr doesn't exist
* Update docs
When sourcing a component, the object from the original pipeline is added to the new pipeline as the same object. This creates a situation where there are several attributes that cannot be in sync between the original pipeline and the new pipeline at the same time for this one object:
* component.name
* component.listener_map / component.listening_components for tok2vec and transformer
When running replace_listeners on a component, the config is not updated correctly if the state of the component is incorrect for the current pipeline (in particular changes that should be applied from model.attrs["replace_listener_cfg"] as used in spacy-transformers) due to the fact that:
* find_listeners relies on component.name to set the name in the listener_map
* replace_listeners relies on listener_map to determine how to modify the configs
In addition, there are several places where pipeline components are modified and the listener map and/or internal component names aren't currently updated.
In cases where there is a component shared by two pipelines that cannot be in sync, this PR chooses to prioritize the most recently modified or initialized pipeline. There is no actual solution with the current source behavior that will make both pipelines usable, so the current pipeline is updated whenever components are added/renamed/removed or the pipeline is initialized for training.