This PR removes the dependency on langcodes introduced in #9342.
While the introduction of langcodes allows a significantly wider range of language codes, there are some unexpected side effects:
zh-Hant (Traditional Chinese) should be mapped to zh intead of None, as spaCy's Chinese model is based on pkuseg which supports tokenization of both Simplified and Traditional Chinese.
Since it is possible that spaCy may have a model for Norwegian Nynorsk in the future, mapping no (macrolanguage Norwegian) to nb (Norwegian Bokmål) might be misleading. In that case, the user should be asked to specify nb or nn (Norwegian Nynorsk) specifically or consult the doc.
Same as above for regional variants of languages such as en_gb and en_us.
Overall, IMHO, introducing an extra dependency just for the conversion of language codes is an overkill. It is possible that most user just need the conversion between 2/3-letter ISO codes and a simple dictionary lookup should suffice.
With this PR, ISO 639-1 and ISO 639-3 codes are supported. ISO 639-2/B (bibliographic codes which are not favored and used in ISO 639-3) and deprecated ISO 639-1/2 codes are also supported to maximize backward compatibility.
In order to support Python 3.13, we had to migrate to Cython 3.0. This caused some tricky interaction with our Pydantic usage, because Cython 3 uses the from __future__ import annotations semantics, which causes type annotations to be saved as strings.
The end result is that we can't have Language.factory decorated functions in Cython modules anymore, as the Language.factory decorator expects to inspect the signature of the functions and build a Pydantic model. If the function is implemented in Cython, an error is raised because the type is not resolved.
To address this I've moved the factory functions into a new module, spacy.pipeline.factories. I've added __getattr__ importlib hooks to the previous locations, in case anyone was importing these functions directly. The change should have no backwards compatibility implications.
Along the way I've also refactored the registration of functions for the config. Previously these ran as import-time side-effects, using the registry decorator. I've created instead a new module spacy.registrations. When the registry is accessed it calls a function ensure_populated(), which cases the registrations to occur.
I've made a similar change to the Language.factory registrations in the new spacy.pipeline.factories module.
I want to remove these import-time side-effects so that we can speed up the loading time of the library, which can be especially painful on the CLI. I also find that I'm often working to track down the implementations of functions referenced by strings in the config. Having the registrations all happen in one place will make this easier.
With these changes I've fortunately avoided the need to migrate to Pydantic v2 properly --- we're still using the v1 compatibility shim. We might not be able to hold out forever though: Pydantic (reasonably) aren't actively supporting the v1 shims. I put a lot of work into v2 migration when investigating the 3.13 support, and it's definitely challenging. In any case, it's a relief that we don't have to do the v2 migration at the same time as the Cython 3.0/Python 3.13 support.
* Add spacy.TextCatParametricAttention.v1
This layer provides is a simplification of the ensemble classifier that
only uses paramteric attention. We have found empirically that with a
sufficient amount of training data, using the ensemble classifier with
BoW does not provide significant improvement in classifier accuracy.
However, plugging in a BoW classifier does reduce GPU training and
inference performance substantially, since it uses a GPU-only kernel.
* Fix merge fallout
* Update numpy build constraints for numpy 1.25
Starting in numpy 1.25 (see
https://github.com/numpy/numpy/releases/tag/v1.25.0), the numpy C API is
backwards-compatible by default.
For python 3.9+, we should be able to drop the specific numpy build
requirements and use `numpy>=1.25`, which is currently
backwards-compatible to `numpy>=1.19`.
In the future, the python <3.9 requirements could be dropped and the
lower numpy pin could correspond to the oldest supported version for the
current lower python pin.
* Turn off fail-fast
* Revert "Turn off fail-fast"
This reverts commit 4306f516bc.
* Update for python 3.6
* Fix typo
* Setting up weasel branch (#12456)
* remove project-specific functionality
* remove project-specific tests
* remove project-specific schemas
* remove project-specific information in about
* remove project-specific functions in util.py
* remove project-specific error strings
* remove project-specific CLI commands
* black formatting
* restore some functions that are used beyond projects
* remove project imports
* remove imports
* remove remote_storage tests
* remove one more project unit test
* update for PR 12394
* remove get_hash and get_checksum
* remove upload_ and download_file methods
* remove ensure_pathy
* revert clumsy fingers
* reinstate E970
* feat: use weasel as spacy project command (#12473)
* feat: use weasel as spacy project command
* build: use constrained requirement for weasel
* feat: add weasel to the library requirements
* build: update weasel to new version
* build: use specific weasel tag
* build: use weasel-0.1.0rc1 from PyPI
* fix: remove weasel from requirements.txt
* fix: requirements.txt and setup.cfg need to reflect each other
* feat: remove legacy spacy project code
* bump version
* further merge fixes
* isort
---------
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
* Use isort with Black profile
* isort all the things
* Fix import cycles as a result of import sorting
* Add DOCBIN_ALL_ATTRS type definition
* Add isort to requirements
* Remove isort from build dependencies check
* Typo
* Change GPU efficient textcat to use CNN, not BOW
If you generate a config with a textcat component using GPU
(transformers), the defaut option (efficiency) uses a BOW architecture,
which does not use tok2vec features. While that can make sense as part
of a larger pipeline, in the case of just a transformer and a textcat,
that means the transformer is doing a lot of work for no purpose.
This changes it so that the CNN architecture is used instead. It could
also be changed to be the same as the accuracy config, which uses the
ensemble architecture.
* Add the transformer when using a textcat with GPU
* Switch ubuntu-latest to ubuntu-20.04 in main tests (#11928)
* Switch ubuntu-latest to ubuntu-20.04 in main tests
* Only use 20.04 for 3.6
* Require thinc v8.1.7
* Require thinc v8.1.8
* Break up longer expression
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Extend to wasabi v1.1
* Temporarily run mypy and tests with newest wasabi
* Temporarily skip check requirements test
* Revert "Temporarily skip check requirements test"
This reverts commit 44f4ce20a8.
* Revert "Temporarily run mypy and tests with newest wasabi"
This reverts commit e677a2257c.
* Support local filesystem remotes for projects
* Fix support for local filesystem remotes for projects
* Use `FluidPath` instead of `Pathy` to support both filesystem and
remote paths
* Create missing parent directories if required for local filesystem
* Add a more general `_file_exists` method to support both `Pathy`,
`Path`, and `smart_open`-compatible URLs
* Add explicit `smart_open` dependency starting with support for
`compression` flag
* Update `pathy` dependency to exclude older versions that aren't
compatible with required `smart_open` version
* Update docs to refer to `Pathy` instead of `smart_open` for project
remotes (technically you can still push to any `smart_open`-compatible
path but you can't pull from them)
* Add tests for local filesystem remotes
* Update pathy for general BlobStat sorting
* Add import
* Remove _file_exists since only Pathy remotes are supported
* Format CLI docs
* Clean up merge
* Add a dry run flag to download
* Remove --dry-run, add --url option to `spacy info` instead
* Make mypy happy
* Print only the URL, so it's easier to use in scripts
* Don't add the egg hash unless downloading an sdist
* Update spacy/cli/info.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add two implementations of requirements
* Clean up requirements sample slightly
This should make mypy happy
* Update URL help string
* Remove requirements option
* Add url option to docs
* Add URL to spacy info model output, when available
* Add types-setuptools to testing reqs
* Add types-setuptools to requirements
* Add "compatible", expand docstring
* Update spacy/cli/info.py
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Run prettier on CLI docs
* Update docs
Add a sidebar about finding download URLs, with some examples of the new
command.
* Add download URLs to table on model page
* Apply suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Updates from review
* download url -> download link
* Update docs
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Parser: use C saxpy/sgemm provided by the Ops implementation
This is a backport of https://github.com/explosion/spaCy/pull/10747
from the parser refactor branch. It eliminates the explicit calls
to BLIS, instead using the saxpy/sgemm provided by the Ops
implementation.
This allows us to use Accelerate in the parser on M1 Macs (with
an updated thinc-apple-ops).
Performance of the de_core_news_lg pipe:
BLIS 0.7.0, no thinc-apple-ops: 6385 WPS
BLIS 0.7.0, thinc-apple-ops: 36455 WPS
BLIS 0.9.0, no thinc-apple-ops: 19188 WPS
BLIS 0.9.0, thinc-apple-ops: 36682 WPS
This PR, thinc-apple-ops: 38726 WPS
Performance of the de_core_news_lg pipe (only tok2vec -> parser):
BLIS 0.7.0, no thinc-apple-ops: 13907 WPS
BLIS 0.7.0, thinc-apple-ops: 73172 WPS
BLIS 0.9.0, no thinc-apple-ops: 41576 WPS
BLIS 0.9.0, thinc-apple-ops: 72569 WPS
This PR, thinc-apple-ops: 87061 WPS
* Require thinc >=8.1.0,<8.2.0
* Lower thinc lowerbound to 8.1.0.dev0
* Use best CPU ops for CBLAS when the parser model is on the GPU
* Fix another unguarded cblas() call
* Fix: use ops as a shorthand for self.model.ops
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
* Make changes to typing
* Correction
* Format with black
* Corrections based on review
* Bumped Thinc dependency version
* Bumped blis requirement
* Correction for older Python versions
* Update spacy/ml/models/textcat.py
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
* Corrections based on review feedback
* Readd deleted docstring line
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>