There was a mistake in the regex pattern which caused not matching all the desired tokens. The problem was that when we use r string literal prefix to suppose a raw text, we should not use two backslashes to demonstrate a backslash.
* feat: add example stubs
* fix: add required annotations
* fix: mypy issues
* fix: use Py36-compatible Portocol
* Minor reformatting
* adding further type specifications and removing internal methods
* black formatting
* widen type to iterable
* add private methods that are being used by the built-in convertors
* revert changes to corpus.py
* fixes
* fixes
* fix typing of PlainTextCorpus
---------
Co-authored-by: Basile Dura <basile@bdura.me>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Support registered vectors
* Format
* Auto-fill [nlp] on load from config and from bytes/disk
* Only auto-fill [nlp]
* Undo all changes to Language.from_disk
* Expand BaseVectors
These methods are needed in various places for training and vector
similarity.
* isort
* More linting
* Only fill [nlp.vectors]
* Update spacy/vocab.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Revert changes to test related to auto-filling [nlp]
* Add vectors registry
* Rephrase error about vocab methods for vectors
* Switch to dummy implementation for BaseVectors.to_ops
* Add initial draft of docs
* Remove example from BaseVectors docs
* Apply suggestions from code review
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/basevectors.mdx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix type and lint bpemb example
* Update website/docs/api/basevectors.mdx
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* remove migration support form
* initial test commit
* add fixture
* add combo test
* pull out parameter example data
* fix formatting on examples
* remove unused import
* remove unncessary fmt:off instructions
* only set logger level if verbose flag is explicitly set
---------
Co-authored-by: svlandeg <svlandeg@github.com>
* Add data structures to docs
* Adjusted descriptions for more consistency
* Add _optional_ flag to parameters
* Add tests and adjust optional title key in doc
* Add title to dep visualizations
* fix typo
---------
Co-authored-by: thomashacker <EdwardSchmuhl@web.de>
* Add cli for finding locations of registered func
* fixes: naming and typing
* isort
* update naming
* remove to find-function
* remove file:// bit
* use registry name if given and exit gracefully if a registry was not found
* clean up failure msg
* specify registry_name options
* mypy fixes
* return location for internal usage
* add documentation
* more mypy fixes
* clean up example
* add section to menu
* add tests
---------
Co-authored-by: svlandeg <svlandeg@github.com>
* Update numpy build constraints for numpy 1.25
Starting in numpy 1.25 (see
https://github.com/numpy/numpy/releases/tag/v1.25.0), the numpy C API is
backwards-compatible by default.
For python 3.9+, we should be able to drop the specific numpy build
requirements and use `numpy>=1.25`, which is currently
backwards-compatible to `numpy>=1.19`.
In the future, the python <3.9 requirements could be dropped and the
lower numpy pin could correspond to the oldest supported version for the
current lower python pin.
* Turn off fail-fast
* Revert "Turn off fail-fast"
This reverts commit 4306f516bc.
* Update for python 3.6
* Fix typo
* Update universe.json
* Update universe.json
add some missing commas in the greCy's description.
* Update punctuation.py
Add mathematical left and right angle brackets as punctuation for ancient Greek for better tokenization.
* modified: spacy/language.py
- corrected typo in docstring for :method:`Language.replace_listeners`
- added noqa comment on unused local variable assignment in :method:`Language.from_config` as I wasn't sure if it should be unassigned
modified: website/docs/api/language.mdx
- corrected typo in `Language.replace_listeners` markdown
* modified: spacy/language.py
- removed noqa comment
---------
Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>
These changes add a missing call to `escape_html` in the displaCy span
renderer. Previously span-annotated tokens would be inserted into the
page markup without being escaped, resulting in potentially incorrect
rendering. When I encountered this issue, it resulted in some docs and
span underlines being superimposed on top of properly rendered docs and
span underlines near the beginning of the visualization (due to an
unescaped `<span>` tag).
* Setting up weasel branch (#12456)
* remove project-specific functionality
* remove project-specific tests
* remove project-specific schemas
* remove project-specific information in about
* remove project-specific functions in util.py
* remove project-specific error strings
* remove project-specific CLI commands
* black formatting
* restore some functions that are used beyond projects
* remove project imports
* remove imports
* remove remote_storage tests
* remove one more project unit test
* update for PR 12394
* remove get_hash and get_checksum
* remove upload_ and download_file methods
* remove ensure_pathy
* revert clumsy fingers
* reinstate E970
* feat: use weasel as spacy project command (#12473)
* feat: use weasel as spacy project command
* build: use constrained requirement for weasel
* feat: add weasel to the library requirements
* build: update weasel to new version
* build: use specific weasel tag
* build: use weasel-0.1.0rc1 from PyPI
* fix: remove weasel from requirements.txt
* fix: requirements.txt and setup.cfg need to reflect each other
* feat: remove legacy spacy project code
* bump version
* further merge fixes
* isort
---------
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
When the default `max_length` is not set and there are longer training
documents, it can be difficult to train and evaluate the span finder due
to memory limits and the time it takes to evaluate a huge number of
predicted spans.
* `Language.replace_listeners`: Pass the replaced listener and the `tok2vec` pipe to the callback
* Update developer docs
* `isort` fixes
* Add error message to assertion
* Add clarification to dev docs
* Replace assertion with exception
* Doc fixes
* Fix problem with universe pages using `docker` language
* Fix problem with universe pages using `r` language
* Add fallback, in case code language is unknown
* Support custom token/lexeme attribute for vectors
* Fix imports
* Back off to ORTH without Vectors.attr
* Fallback if vectors.attr doesn't exist
* Update docs