* added ruler coe
* added error for none existing pattern
* changed error to warning
* changed error to warning
* added basic tests
* fixed place
* added test files
* went back to error
* went back to pattern error
* minor change to docs
* changed style
* changed doc
* changed error slightly
* added remove to phrasem api
* error key already existed
* phrase matcher match code to api
* blacked tests
* moved comments before expr
* corrected error no
* Update website/docs/api/entityruler.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/entityruler.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added sents property to Span class that returns a generator of sentences the Span belongs to
* Added description to Span.sents property
* Update test_span to clarify the difference between span.sent and span.sents
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/tests/doc/test_span.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix documentation typos in spacy/tokens/span.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update Span.sents doc string in spacy/tokens/span.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Parametrized test_span_spans
* Corrected Span.sents to check for span-level hook first. Also, made Span.sent respect doc-level sents hook if no span-level hook is provided
* Corrected Span ocumentation copy/paste issue
* Put back accidentally deleted lines
* Fixed formatting in span.pyx
* Moved check for SENT_START annotation after user hooks in Span.sents
* add version where the property was introduced
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Migrate regressions 1-1000
* Move serialize test to correct file
* Remove tests that won't work in v3
* Migrate regressions 1000-1500
Removed regression test 1250 because v3 doesn't support the old LEX
scheme anymore.
* Add missing imports in serializer tests
* Migrate tests 1500-2000
* Migrate regressions from 2000-2500
* Migrate regressions from 2501-3000
* Migrate regressions from 3000-3501
* Migrate regressions from 3501-4000
* Migrate regressions from 4001-4500
* Migrate regressions from 4501-5000
* Migrate regressions from 5001-5501
* Migrate regressions from 5501 to 7000
* Migrate regressions from 7001 to 8000
* Migrate remaining regression tests
* Fixing missing imports
* Update docs with new system [ci skip]
* Update CONTRIBUTING.md
- Fix formatting
- Update wording
* Remove lemmatizer tests in el lang
* Move a few tests into the general tokenizer
* Separate Doc and DocBin tests
* Add support for kb_id to be displayed via displacy.serve. The current support is only limited to the manual option in displacy.render
* Commit to check pre-commit hooks are run.
* Update spacy/displacy/__init__.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Changes as per suggestions on the PR.
* Update website/docs/api/top-level.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/api/top-level.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* tag option as new from 3.2.1 onwards
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
* Use internal names for factories
If a component factory is registered like `@French.factory(...)` instead
of `@Language.factory(...)`, the name in the factories registry will be
prefixed with the language code. However in the nlp.config object the
factory will be listed without the language code. The `add_pipe` code
has fallback logic to handle this, but packaging code and the registry
itself don't.
This change makes it so that the factory name in nlp.config is the
language-specific form. It's not clear if this will break anything else,
but it does seem to fix the inconsistency and resolve the specific user
issue that brought this to our attention.
* Change approach to use fallback in package lookup
This adds fallback logic to the package lookup, so it doesn't have to
touch the way the config is built. It seems to fix the tests too.
* Remove unecessary line
* Add test
Thsi also adds an assert that seems to have been forgotten.
* Added Slovak
* Added Slovenian tests
* Added Estonian tests
* Added Croatian tests
* Added Latvian tests
* Added Icelandic tests
* Added Afrikaans tests
* Added language-independent tests
* Added Kannada tests
* Tidied up
* Added Albanian tests
* Formatted with black
* Added failing tests for anomalies
* Update spacy/tests/lang/af/test_text.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Estonian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Croatian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Icelandic tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Latvian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Slovak tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added context to failing Slovenian tokenizer test
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Added ENT_ID and ENT_KB_ID into the list of the attributes that Matcher matches on
* Added ENT_ID and ENT_KB_ID to TEST_PATTERNS in test_pattern_validation.py. Disabled tests that I added before
* Update website/docs/api/matcher.md
* Format
* Remove skipped tests
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* added error string
* added serialization test
* added more to if statements
* wrote file to tempdir
* added tempdir
* changed parameter a bit
* Update spacy/tests/pipeline/test_entity_ruler.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* make nlp.pipe() return None docs when no exceptions are (re-)raised during error handling
* Remove changes other than as_tuples test
* Only check warning count for one process
* Fix types
* Format
Co-authored-by: Xi Bai <xi.bai.ed@gmail.com>
* Add micro PRF for morph scoring
For pipelines where morph features are added by more than one component
and a reference training corpus may not contain all features, a micro
PRF score is more flexible than a simple accuracy score. An example is
the reading and inflection features added by the Japanese tokenizer.
* Use `morph_micro_f` as the default morph score for Japanese
morphologizers.
* Update docstring
* Fix typo in docstring
* Update Scorer API docs
* Fix results type
* Organize score list by attribute prefix
* Add support for fasttext-bloom hash-only vectors
Overview:
* Extend `Vectors` to have two modes: `default` and `ngram`
* `default` is the default mode and equivalent to the current
`Vectors`
* `ngram` supports the hash-only ngram tables from `fasttext-bloom`
* Extend `spacy.StaticVectors.v2` to handle both modes with no changes
for `default` vectors
* Extend `spacy init vectors` to support ngram tables
The `ngram` mode **only** supports vector tables produced by this
fork of fastText, which adds an option to represent all vectors using
only the ngram buckets table and which uses the exact same ngram
generation algorithm and hash function (`MurmurHash3_x64_128`).
`fasttext-bloom` produces an additional `.hashvec` table, which can be
loaded by `spacy init vectors --fasttext-bloom-vectors`.
https://github.com/adrianeboyd/fastText/tree/feature/bloom
Implementation details:
* `Vectors` now includes the `StringStore` as `Vectors.strings` so that
the API can stay consistent for both `default` (which can look up from
`str` or `int`) and `ngram` (which requires `str` to calculate the
ngrams).
* In ngram mode `Vectors` uses a default `Vectors` object as a cache
since the ngram vectors lookups are relatively expensive.
* The default cache size is the same size as the provided ngram vector
table.
* Once the cache is full, no more entries are added. The user is
responsible for managing the cache in cases where the initial
documents are not representative of the texts.
* The cache can be resized by setting `Vectors.ngram_cache_size` or
cleared with `vectors._ngram_cache.clear()`.
* The API ends up a bit split between methods for `default` and for
`ngram`, so functions that only make sense for `default` or `ngram`
include warnings with custom messages suggesting alternatives where
possible.
* `Vocab.vectors` becomes a property so that the string stores can be
synced when assigning vectors to a vocab.
* `Vectors` serializes its own config settings as `vectors.cfg`.
* The `Vectors` serialization methods have added support for `exclude`
so that the `Vocab` can exclude the `Vectors` strings while serializing.
Removed:
* The `minn` and `maxn` options and related code from
`Vocab.get_vector`, which does not work in a meaningful way for default
vector tables.
* The unused `GlobalRegistry` in `Vectors`.
* Refactor to use reduce_mean
Refactor to use reduce_mean and remove the ngram vectors cache.
* Rename to floret
* Rename to floret in error messages
* Use --vectors-mode in CLI, vector init
* Fix vectors mode in init
* Remove unused var
* Minor API and docstrings adjustments
* Rename `--vectors-mode` to `--mode` in `init vectors` CLI
* Rename `Vectors.get_floret_vectors` to `Vectors.get_batch` and support
both modes.
* Minor updates to Vectors docstrings.
* Update API docs for Vectors and init vectors CLI
* Update types for StaticVectors
* Ignore prefix in suffix matches
Ignore the currently matched prefix when looking for suffix matches in
the tokenizer. Otherwise a lookbehind in the suffix pattern may match
incorrectly due the presence of the prefix in the token string.
* Move °[cfkCFK]. to a tokenizer exception
* Adjust exceptions for same tokenization as v3.1
* Also update test accordingly
* Continue to split . after °CFK if ° is not a prefix
* Exclude new ° exceptions for pl
* Switch back to default tokenization of "° C ."
* Revert "Exclude new ° exceptions for pl"
This reverts commit 952013a5b4.
* Add exceptions for °C for hu
* Raise an error when multiprocessing is used on a GPU
As reported in #5507, a confusing exception is thrown when
multiprocessing is used with a GPU model and the `fork` multiprocessing
start method:
cupy.cuda.runtime.CUDARuntimeError: cudaErrorInitializationError: initialization error
This change checks whether one of the models uses the GPU when
multiprocessing is used. If so, raise a friendly error message.
Even though multiprocessing can work on a GPU with the `spawn` method,
it quickly runs the GPU out-of-memory on real-world data. Also,
multiprocessing on a single GPU typically does not provide large
performance gains.
* Move GPU multiprocessing check to Language.pipe
* Warn rather than error when using multiprocessing with GPU models
* Improve GPU multiprocessing warning message.
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Reduce API assumptions
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update spacy/language.py
* Update spacy/language.py
* Test that warning is thrown with GPU + multiprocessing
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* add custom protocols in spacy.ty
* add a test for the new types in spacy.ty
* import Example when type checking
* some type fixes
* put Protocol in compat
* revert update check back to hasattr
* runtime_checkable in compat as well
* 🚨 Ignore all existing Mypy errors
* 🏗 Add Mypy check to CI
* Add types-mock and types-requests as dev requirements
* Add additional type ignore directives
* Add types packages to dev-only list in reqs test
* Add types-dataclasses for python 3.6
* Add ignore to pretrain
* 🏷 Improve type annotation on `run_command` helper
The `run_command` helper previously declared that it returned an
`Optional[subprocess.CompletedProcess]`, but it isn't actually possible
for the function to return `None`. These changes modify the type
annotation of the `run_command` helper and remove all now-unnecessary
`# type: ignore` directives.
* 🔧 Allow variable type redefinition in limited contexts
These changes modify how Mypy is configured to allow variables to have
their type automatically redefined under certain conditions. The Mypy
documentation contains the following example:
```python
def process(items: List[str]) -> None:
# 'items' has type List[str]
items = [item.split() for item in items]
# 'items' now has type List[List[str]]
...
```
This configuration change is especially helpful in reducing the number
of `# type: ignore` directives needed to handle the common pattern of:
* Accepting a filepath as a string
* Overwriting the variable using `filepath = ensure_path(filepath)`
These changes enable redefinition and remove all `# type: ignore`
directives rendered redundant by this change.
* 🏷 Add type annotation to converters mapping
* 🚨 Fix Mypy error in convert CLI argument verification
* 🏷 Improve type annotation on `resolve_dot_names` helper
* 🏷 Add type annotations for `Vocab` attributes `strings` and `vectors`
* 🏷 Add type annotations for more `Vocab` attributes
* 🏷 Add loose type annotation for gold data compilation
* 🏷 Improve `_format_labels` type annotation
* 🏷 Fix `get_lang_class` type annotation
* 🏷 Loosen return type of `Language.evaluate`
* 🏷 Don't accept `Scorer` in `handle_scores_per_type`
* 🏷 Add `string_to_list` overloads
* 🏷 Fix non-Optional command-line options
* 🙈 Ignore redefinition of `wandb_logger` in `loggers.py`
* ➕ Install `typing_extensions` in Python 3.8+
The `typing_extensions` package states that it should be used when
"writing code that must be compatible with multiple Python versions".
Since SpaCy needs to support multiple Python versions, it should be used
when newer `typing` module members are required. One example of this is
`Literal`, which is available starting with Python 3.8.
Previously SpaCy tried to import `Literal` from `typing`, falling back
to `typing_extensions` if the import failed. However, Mypy doesn't seem
to be able to understand what `Literal` means when the initial import
means. Therefore, these changes modify how `compat` imports `Literal` by
always importing it from `typing_extensions`.
These changes also modify how `typing_extensions` is installed, so that
it is a requirement for all Python versions, including those greater
than or equal to 3.8.
* 🏷 Improve type annotation for `Language.pipe`
These changes add a missing overload variant to the type signature of
`Language.pipe`. Additionally, the type signature is enhanced to allow
type checkers to differentiate between the two overload variants based
on the `as_tuple` parameter.
Fixes#8772
* ➖ Don't install `typing-extensions` in Python 3.8+
After more detailed analysis of how to implement Python version-specific
type annotations using SpaCy, it has been determined that by branching
on a comparison against `sys.version_info` can be statically analyzed by
Mypy well enough to enable us to conditionally use
`typing_extensions.Literal`. This means that we no longer need to
install `typing_extensions` for Python versions greater than or equal to
3.8! 🎉
These changes revert previous changes installing `typing-extensions`
regardless of Python version and modify how we import the `Literal` type
to ensure that Mypy treats it properly.
* resolve mypy errors for Strict pydantic types
* refactor code to avoid missing return statement
* fix types of convert CLI command
* avoid list-set confustion in debug_data
* fix typo and formatting
* small fixes to avoid type ignores
* fix types in profile CLI command and make it more efficient
* type fixes in projects CLI
* put one ignore back
* type fixes for render
* fix render types - the sequel
* fix BaseDefault in language definitions
* fix type of noun_chunks iterator - yields tuple instead of span
* fix types in language-specific modules
* 🏷 Expand accepted inputs of `get_string_id`
`get_string_id` accepts either a string (in which case it returns its
ID) or an ID (in which case it immediately returns the ID). These
changes extend the type annotation of `get_string_id` to indicate that
it can accept either strings or IDs.
* 🏷 Handle override types in `combine_score_weights`
The `combine_score_weights` function allows users to pass an `overrides`
mapping to override data extracted from the `weights` argument. Since it
allows `Optional` dictionary values, the return value may also include
`Optional` dictionary values.
These changes update the type annotations for `combine_score_weights` to
reflect this fact.
* 🏷 Fix tokenizer serialization method signatures in `DummyTokenizer`
* 🏷 Fix redefinition of `wandb_logger`
These changes fix the redefinition of `wandb_logger` by giving a
separate name to each `WandbLogger` version. For
backwards-compatibility, `spacy.train` still exports `wandb_logger_v3`
as `wandb_logger` for now.
* more fixes for typing in language
* type fixes in model definitions
* 🏷 Annotate `_RandomWords.probs` as `NDArray`
* 🏷 Annotate `tok2vec` layers to help Mypy
* 🐛 Fix `_RandomWords.probs` type annotations for Python 3.6
Also remove an import that I forgot to move to the top of the module 😅
* more fixes for matchers and other pipeline components
* quick fix for entity linker
* fixing types for spancat, textcat, etc
* bugfix for tok2vec
* type annotations for scorer
* add runtime_checkable for Protocol
* type and import fixes in tests
* mypy fixes for training utilities
* few fixes in util
* fix import
* 🐵 Remove unused `# type: ignore` directives
* 🏷 Annotate `Language._components`
* 🏷 Annotate `spacy.pipeline.Pipe`
* add doc as property to span.pyi
* small fixes and cleanup
* explicit type annotations instead of via comment
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
* Add test for case where parser overwrite annotations
* Move test to its own file
Also add note about how other tokens modify results.
* Fix xfail decorator
* Fix inconsistency
This makes the failing test pass, so that behavior is consistent whether
patterns are added in one call or two.
The issue is that the hash for patterns depended on the index of the
pattern in the list of current patterns, not the list of total patterns,
so a second call would get identical match ids.
* Add illustrative test case
* Add failing test for remove case
Patterns are not removed from the internal matcher on calls to remove,
which causes spurious weird matches (or misses).
* Fix removal issue
Remove patterns from the internal matcher.
* Check that the single add call also gets no matches
* use language-matching to allow language code aliases
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* link to "IETF language tags" in docs
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* Make requirements consistent
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* change "two-letter language ID" to "IETF language tag" in language docs
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* use langcodes 3.2 and handle language-tag errors better
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* all unknown language codes are ImportErrors
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
* Use morph for extra Japanese tokenizer info
Previously Japanese tokenizer info that didn't correspond to Token
fields was put in user data. Since spaCy core should avoid touching user
data, this moves most information to the Token.morph attribute. It also
adds the normalized form, which wasn't exposed before.
The subtokens, which are a list of full tokens, are still added to user
data, except with the default tokenizer granualarity. With the default
tokenizer settings the subtokens are all None, so in this case the user
data is simply not set.
* Update tests
Also adds a new test for norm data.
* Update docs
* Add Japanese morphologizer factory
Set the default to `extend=True` so that the morphologizer does not
clobber the values set by the tokenizer.
* Use the norm_ field for normalized forms
Before this commit, normalized forms were put in the "norm" field in the
morph attributes. I am not sure why I did that instead of using the
token morph, I think I just forgot about it.
* Skip test if sudachipy is not installed
* Fix import
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add overwrite settings for more components
For pipeline components where it's relevant and not already implemented,
add an explicit `overwrite` setting that controls whether
`set_annotations` overwrites existing annotation.
For the `morphologizer`, add an additional setting `extend`, which
controls whether the existing features are preserved.
* +overwrite, +extend: overwrite values of existing features, add any new
features
* +overwrite, -extend: overwrite completely, removing any existing
features
* -overwrite, +extend: keep values of existing features, add any new
features
* -overwrite, -extend: do not modify the existing value if set
In all cases an unset value will be set by `set_annotations`.
Preserve current overwrite defaults:
* True: morphologizer, entity linker
* False: tagger, sentencizer, senter
* Add backwards compat overwrite settings
* Put empty line back
Removed by accident in last commit
* Set backwards-compatible defaults in __init__
Because the `TrainablePipe` serialization methods update `cfg`, there's
no straightforward way to detect whether models serialized with a
previous version are missing the overwrite settings.
It would be possible in the sentencizer due to its separate
serialization methods, however to keep the changes parallel, this also
sets the default in `__init__`.
* Remove traces
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Update Makefile
For more recent python version
* updated for bsc changes
New tokenization changes
* Update test_text.py
* updating tests and requirements
* changed failed test in test/lang/ca
changed failed test in test/lang/ca
* Update .gitignore
deleted stashed changes line
* back to python 3.6 and remove transformer requirements
As per request
* Update test_exception.py
Change the test
* Update test_exception.py
Remove test print
* Update Makefile
For more recent python version
* updated for bsc changes
New tokenization changes
* updating tests and requirements
* Update requirements.txt
Removed spacy-transfromers from requirements
* Update test_exception.py
Added final punctuation to ensure consistency
* Update Makefile
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Format
* Update test to check all tokens
Co-authored-by: cayorodriguez <crodriguezp@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Accept Doc input in pipelines
Allow `Doc` input to `Language.__call__` and `Language.pipe`, which
skips `Language.make_doc` and passes the doc directly to the pipeline.
* ensure_doc helper function
* avoid running multiple processes on GPU
* Update spacy/tests/test_language.py
Co-authored-by: svlandeg <svlandeg@github.com>
* Validate pos values when creating Doc
* Add clear error when setting invalid pos
This also changes the error language slightly.
* Fix variable name
* Update spacy/tokens/doc.pyx
* Test that setting invalid pos raises an error
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Replace all basestring references with unicode
`basestring` was a compatability type introduced by Cython to make
dealing with utf-8 strings in Python2 easier. In Python3 it is
equivalent to the unicode (or str) type.
I replaced all references to basestring with unicode, since that was
used elsewhere, but we could also just replace them with str, which
shoudl also be equivalent.
All tests pass locally.
* Replace all references to unicode type with str
Since we only support python3 this is simpler.
* Remove all references to unicode type
This removes all references to the unicode type across the codebase and
replaces them with `str`, which makes it more drastic than the prior
commits. In order to make this work importing `unicode_literals` had to
be removed, and one explicit unicode literal also had to be removed (it
is unclear why this is necessary in Cython with language level 3, but
without doing it there were errors about implicit conversion).
When `unicode` is used as a type in comments it was also edited to be
`str`.
Additionally `coding: utf8` headers were removed from a few files.
* Handle spacy-legacy in package CLI for dependencies
* Implement legacy backoff in spacy registry.find
* Remove unused import
* Update and format test
* pass alignments to callbacks
* refactor for single callback loop
* Update spacy/matcher/matcher.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix surprises when asking for the root of a git repo
In the case of the first asset I wanted to get from git, the data I
wanted was the entire repository. I tried leaving "path" blank, which
gave a less-than-helpful error, and then I tried `path: "/"`, which
started copying my entire filesystem into the project. The path I should
have used was "".
I've made two changes to make this smoother for others:
- The 'path' within a git clone defaults to ""
- If the path points outside of the tmpdir that the git clone goes
into, we fail with an error
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* use a descriptive error instead of a default
plus some minor fixes from PR review
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* check for None values in assets
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
* overfitting test on non-overlapping entities
* add failing overfitting test for overlapping entities
* failing test for list comprehension
* remove test that was put in separate PR
* bugfix
* cleanup
* test for error after Doc has been garbage collected
* warn about using a SpanGroup when the Doc has been garbage collected
* add warning to the docs
* rephrase slightly
* raise error instead of warning
* update
* move warning to doc property
* Fix incorrect pickling of Japanese and Korean pipelines, which led to
the entire pipeline being reset if pickled
* Enable pickling of Vietnamese tokenizer
* Update tokenizer APIs for Chinese, Japanese, Korean, Thai, and
Vietnamese so that only the `Vocab` is required for initialization
* Add scorer option to components
Add an optional `scorer` parameter to all pipeline components. If a
scoring function is provided, it overrides the default scoring method
for that component.
* Add registered scorers for all components
* Add `scorers` registry
* Move all scoring methods outside of components as independent
functions and register
* Use the registered scoring methods as defaults in configs and inits
Additional:
* The scoring methods no longer have access to the full component, so
use settings from `cfg` as default scorer options to handle settings
such as `labels`, `threshold`, and `positive_label`
* The `attribute_ruler` scoring method no longer has access to the
patterns, so all scoring methods are called
* Bug fix: `spancat` scoring method is updated to set `allow_overlap` to
score overlapping spans correctly
* Update Russian lemmatizer to use direct score method
* Check type of cfg in Pipe.score
* Fix check
* Update spacy/pipeline/sentencizer.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove validate_examples from scoring functions
* Use Pipe.labels instead of Pipe.cfg["labels"]
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add missing punctuation for Tigrinya and Amharic
* Fix numeral and ordinal numbers for Tigrinya
- Amharic was used in many cases
- Also fixed some typos
* Update Tigrinya stop-words
* Contributor agreement for fgaim
* Fix typo in "ti" lang test
* Remove multi-word entries from numbers and ordinals
* Add scores to output in spancat
This exposes the scores as an attribute on the SpanGroup. Includes a
basic test.
* Add basic doc note
* Vectorize score calcs
* Add "annotation format" section
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clean up doc section
* Ran prettier on docs
* Get arrays off the gpu before iterating over them
* Remove int() calls
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Pass excludes when serializing vocab
Additional minor bug fix:
* Deserialize vocab in `EntityLinker.from_disk`
* Add test for excluding strings on load
* Fix formatting
* Support list values and IS_INTERSECT in Matcher
* Support list values as token attributes for set operators, not just as
pattern values.
* Add `IS_INTERSECT` operator.
* Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs.
* Rename IS_INTERSECT to INTERSECTS
* Add ancient Greek language support
Initial commit
* Contributor Agreement
* grc tokenizer test added and files formatted with black, unnecessary import removed
Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Commas in lists fixed. __init__py added to test
* Update lex_attrs.py
* Update stop_words.py
* Update stop_words.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>