* Require that all SpanGroup spans are from the current doc
The restriction on only adding spans from the current doc were already
implemented for all operations except for `SpanGroup.__init__`.
Initialize copied spans for `SpanGroup.copy` with `Doc.char_span` in
order to validate the character offsets and to make it possible to copy
spans between documents with differing tokenization. Currently there is
no validation that the document texts are identical, but the span char
offsets must be valid spans in the target doc, which prevents you from
ending up with completely invalid spans.
* Undo change in test_beam_overfitting_IO
* add vetiver to spacy universe
* remove image
* update logo to render correctly in thumbnail
* apply Basil's suggestion
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
* refer to the same model
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
* Address upcoming numpy v1.25 deprecations in test suite
* Temporarily test most recent numpy prerelease in CI
* Revert "Temporarily test most recent numpy prerelease in CI"
This reverts commit d75a66e55e.
While the typing_extensions/pydantic `Literal` bugs are being sorted
out, disable fail-fast so the rest of the CI is available for
development purposes.
* Add scorer option to return per-component scores
Add `per_component` option to `Language.evaluate` and `Scorer.score` to
return scores keyed by `tokenizer` (hard-coded) or by component name.
Add option to `evaluate` CLI to score by component. Per-component scores
can only be saved to JSON.
* Update help text and messages
This reverts commit 6f314f99c4.
We are reverting this until we can support this normalization more
consistently across vectors, training corpora, and lemmatizer data.
* parsigs universe
* added model installation explanation in the description
* Update website/meta/universe.json
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
* added model installement instruction in the code example
---------
Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
* Use Latin normalization for Serbian attrs
Use Latin normalization for Serbian `NORM`, `PREFIX`, and `SUFFIX`.
* Update NORMs in tokenizer exceptions and related tests
* Add tests for all custom lex attrs
* Remove unused imports
* Add spans in spacy benchmark
The current implementation of spaCy benchmark accuracy / spacy evaluate
doesn't include the "spans" type, so calling the command doesn't render
the HTML displaCy file needed.
This PR attempts to fix that by creating a new parameter for "spans"
and calling the appropriate displaCy value.
* Reformat file with black
* Add tests for evaluate
* Fix spans -> span for displacy style
* Update test to check render instead
* Update source so mypy passes
* Add parser information to avoid warnings
* CI: Only run test suite once with thinc-apple-ops for macos python 3.11
* Adjust syntax
* Try alternate syntax
* Try alternate syntax
* Try alternate syntax
* avoid nesting then flattening
* mypy fix
* Apply suggestions from code review
* Add type for indices
* Run full matrix for mypy
* Add back modified type: ignore
* Revert "Run full matrix for mypy"
This reverts commit e218873d04.
---------
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add noun chunking to la syntax iterators
* Expand list of numeral, ordinal words
* Expand abbreviations in la tokenizer_exceptions
* Add example sents
* Update spacy/lang/la/syntax_iterators.py
Reorganize la syntax iterators
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Minor updates based on review
* fix call
---------
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add default to MorphAnalysis.get
Similar to `dict`, allow a `default` option for `MorphAnalysis.get` for
the user to provide a default return value if the field is not found.
The default return value remains `[]`, which is not the same as
`dict.get`, but is already established as this method's default return
value with the return type `List[str]`. However the new `default` option
does not enforce that the user-provided default is actually `List[str]`.
* Restore test case
In `Tokenizer.from_bytes`, the exceptions should be loaded last so that
they are only processed once as part of loading the model.
The exceptions are tokenized as phrase matcher patterns in the
background and the internal tokenization needs to be synced with all the
remaining tokenizer settings. If the exceptions are not loaded last,
there are speed regressions for `Tokenizer.from_bytes/disk` vs.
`Tokenizer.add_special_case` as the caches are reloaded more than
necessary during deserialization.
Replace `progress_bar = "all_steps"` with `progress_bar = "eval"`, which is consistent with the default behavior for `spacy.ConsoleLogger.v1` and `spacy.ConsoleLogger.v2`.