* Validate pos values when creating Doc
* Add clear error when setting invalid pos
This also changes the error language slightly.
* Fix variable name
* Update spacy/tokens/doc.pyx
* Test that setting invalid pos raises an error
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove two attributes marked for removal in 3.1
* Add back unused ints with changed names
* Change data_dir to _unused_object
This is still kept in the type definition, but I removed it from the
serialization code.
* Put serialization code back for now
Not sure how this interacts with old serialized models yet.
* Replace all basestring references with unicode
`basestring` was a compatability type introduced by Cython to make
dealing with utf-8 strings in Python2 easier. In Python3 it is
equivalent to the unicode (or str) type.
I replaced all references to basestring with unicode, since that was
used elsewhere, but we could also just replace them with str, which
shoudl also be equivalent.
All tests pass locally.
* Replace all references to unicode type with str
Since we only support python3 this is simpler.
* Remove all references to unicode type
This removes all references to the unicode type across the codebase and
replaces them with `str`, which makes it more drastic than the prior
commits. In order to make this work importing `unicode_literals` had to
be removed, and one explicit unicode literal also had to be removed (it
is unclear why this is necessary in Cython with language level 3, but
without doing it there were errors about implicit conversion).
When `unicode` is used as a type in comments it was also edited to be
`str`.
Additionally `coding: utf8` headers were removed from a few files.
* Handle spacy-legacy in package CLI for dependencies
* Implement legacy backoff in spacy registry.find
* Remove unused import
* Update and format test
* pass alignments to callbacks
* refactor for single callback loop
* Update spacy/matcher/matcher.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix surprises when asking for the root of a git repo
In the case of the first asset I wanted to get from git, the data I
wanted was the entire repository. I tried leaving "path" blank, which
gave a less-than-helpful error, and then I tried `path: "/"`, which
started copying my entire filesystem into the project. The path I should
have used was "".
I've made two changes to make this smoother for others:
- The 'path' within a git clone defaults to ""
- If the path points outside of the tmpdir that the git clone goes
into, we fail with an error
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* use a descriptive error instead of a default
plus some minor fixes from PR review
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* check for None values in assets
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
* Fix inference of epoch_resume
When an epoch_resume value is not specified individually, it can often
be inferred from the filename. The value inference code was there but
the value wasn't passed back to the training loop.
This also adds a specific error in the case where no epoch_resume value
is provided and it can't be inferred from the filename.
* Add new error
* Always use the epoch resume value if specified
Before this the value in the filename was used if found
* overfitting test on non-overlapping entities
* add failing overfitting test for overlapping entities
* failing test for list comprehension
* remove test that was put in separate PR
* bugfix
* cleanup
* test for error after Doc has been garbage collected
* warn about using a SpanGroup when the Doc has been garbage collected
* add warning to the docs
* rephrase slightly
* raise error instead of warning
* update
* move warning to doc property
* Fix incorrect pickling of Japanese and Korean pipelines, which led to
the entire pipeline being reset if pickled
* Enable pickling of Vietnamese tokenizer
* Update tokenizer APIs for Chinese, Japanese, Korean, Thai, and
Vietnamese so that only the `Vocab` is required for initialization
* Refactor to use list comps and enumerate.
Replace loops that append to a list with a list comprehensions where this does not change the behavior; replace range(len(...)) loops with enumerate. Correct one typo in a comment. Replace a call to set() with a set literal.
* Undo double assignment.
Expand `tokens_to_key[j] = k = self._get_matcher_key(key, i, j)` to two statements.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Sign contributors agreement
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow passing in array vars for speedup
This fixes#8845. Not sure about the docstring changes here...
* Update docs
Types maybe need more detail? Maybe not?
* Run prettier on docs
* Update spacy/tokens/span.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add scorer option to components
Add an optional `scorer` parameter to all pipeline components. If a
scoring function is provided, it overrides the default scoring method
for that component.
* Add registered scorers for all components
* Add `scorers` registry
* Move all scoring methods outside of components as independent
functions and register
* Use the registered scoring methods as defaults in configs and inits
Additional:
* The scoring methods no longer have access to the full component, so
use settings from `cfg` as default scorer options to handle settings
such as `labels`, `threshold`, and `positive_label`
* The `attribute_ruler` scoring method no longer has access to the
patterns, so all scoring methods are called
* Bug fix: `spancat` scoring method is updated to set `allow_overlap` to
score overlapping spans correctly
* Update Russian lemmatizer to use direct score method
* Check type of cfg in Pipe.score
* Fix check
* Update spacy/pipeline/sentencizer.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Remove validate_examples from scoring functions
* Use Pipe.labels instead of Pipe.cfg["labels"]
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add missing punctuation for Tigrinya and Amharic
* Fix numeral and ordinal numbers for Tigrinya
- Amharic was used in many cases
- Also fixed some typos
* Update Tigrinya stop-words
* Contributor agreement for fgaim
* Fix typo in "ti" lang test
* Remove multi-word entries from numbers and ordinals
* Add scores to output in spancat
This exposes the scores as an attribute on the SpanGroup. Includes a
basic test.
* Add basic doc note
* Vectorize score calcs
* Add "annotation format" section
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clean up doc section
* Ran prettier on docs
* Get arrays off the gpu before iterating over them
* Remove int() calls
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add more stop words and Improve the readability
* Add and categorize the tokenizer exceptions for `bg` lang
* Create syrull.md
* Add references for the additional stop words and tokenizer exc abbrs
* Add stub files for main API classes
* Add contributor agreement for ezorita
* Update types for ndarray and hash()
* Fix __getitem__ and __iter__
* Add attributes of Doc and Token classes
* Overload type hints for Span.__getitem__
* Fix type hint overload for Span.__getitem__
Co-authored-by: Luca Dorigo <dorigoluca@gmail.com>
* Fix check for RIGHT_ATTRs in dep matcher
If a non-anchor node does not have RIGHT_ATTRS, the dep matcher throws
an E100, which says that non-anchor nodes must have LEFT_ID, REL_OP, and
RIGHT_ID. It specifically does not say RIGHT_ATTRS is required.
A blank RIGHT_ATTRS is also valid, and patterns with one will be
excepted. While not normal, sometimes a REL_OP is enough to specify a
non-anchor node - maybe you just want the head of another node
unconditionally, for example.
This change just sets RIGHT_ATTRS to {} if not present. Alternatively
changing E100 to state RIGHT_ATTRS is required could also be reasonable.
* Fix test
This test was written on the assumption that if `RIGHT_ATTRS` isn't
present an error will be raised. Since the proposed changes make it so
an error won't be raised this is no longer necessary.
* Revert test, update error message
Error message now lists missing keys, and RIGHT_ATTRS is required.
* Use list of required keys in error message
Also removes unused key param arg.
* Pass excludes when serializing vocab
Additional minor bug fix:
* Deserialize vocab in `EntityLinker.from_disk`
* Add test for excluding strings on load
* Fix formatting
* Support list values and IS_INTERSECT in Matcher
* Support list values as token attributes for set operators, not just as
pattern values.
* Add `IS_INTERSECT` operator.
* Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs.
* Rename IS_INTERSECT to INTERSECTS
* Add ancient Greek language support
Initial commit
* Contributor Agreement
* grc tokenizer test added and files formatted with black, unnecessary import removed
Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Commas in lists fixed. __init__py added to test
* Update lex_attrs.py
* Update stop_words.py
* Update stop_words.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* ✨ implement noun_chunks for dutch language
* copy/paste FR and SV syntax iterators to accomodate UD tags
* added tests with dutch text
* signed contributor agreement
* 🐛 fix noun chunks generator
* built from scratch
* define noun chunk as a single Noun-Phrase
* includes some corner cases debugging (incorrect POS tagging)
* test with provided annotated sample (POS, DEP)
* ✅ fix failing test
* CI pipeline did not like the added sample file
* add the sample as a pytest fixture
* Update spacy/lang/nl/syntax_iterators.py
* Update spacy/lang/nl/syntax_iterators.py
Code readability
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/tests/lang/nl/test_noun_chunks.py
correct comment
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* finalize code
* change "if next_word" into "if next_word is not None"
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* avoid msg var impliciteness
* rename local msg
* Add CI tests for debug data and train
* Adjust debug data CLI test
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add the right return type for Language.pipe and an overload for the as_tuples version
* Reformat, tidy up
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix vectors check for sourced components
Since vectors are not loaded when components are sourced, store a hash
for the vectors of each sourced component and compare it to the loaded
vectors after the vectors are loaded from the `[initialize]` block.
* Pop temporary info
* Remove stored hash in remove_pipe
* Add default for pop
* Add additional convert/debug/assemble CLI tests
* Raise an error for textcat with <2 labels
Raise an error if initializing a `textcat` component without at least
two labels.
* Add similar note to docs
* Update positive_label description in API docs
* Draft spancat model
* Add spancat model
* Add test for extract_spans
* Add extract_spans layer
* Upd extract_spans
* Add spancat model
* Add test for spancat model
* Upd spancat model
* Update spancat component
* Upd spancat
* Update spancat model
* Add quick spancat test
* Import SpanCategorizer
* Fix SpanCategorizer component
* Import SpanGroup
* Fix span extraction
* Fix import
* Fix import
* Upd model
* Update spancat models
* Add scoring, update defaults
* Update and add docs
* Fix type
* Update spacy/ml/extract_spans.py
* Auto-format and fix import
* Fix comment
* Fix type
* Fix type
* Update website/docs/api/spancategorizer.md
* Fix comment
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Better defense
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix labels list
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/ml/extract_spans.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Set annotations during update
* Set annotations in spancat
* fix imports in test
* Update spacy/pipeline/spancat.py
* replace MaxoutLogistic with LinearLogistic
* fix config
* various small fixes
* remove set_annotations parameter in update
* use our beloved tupley format with recent support for doc.spans
* bugfix to allow renaming the default span_key (scores weren't showing up)
* use different key in docs example
* change defaults to better-working parameters from project (WIP)
* register spacy.extract_spans.v1 for legacy purposes
* Upd dev version so can build wheel
* layers instead of architectures for smaller building blocks
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Include additional scores from overrides in combined score weights
* Parameterize spans key in scoring
Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so
that it's possible to evaluate multiple `spancat` components in the same
pipeline.
* Use the (intentionally very short) default spans key `sc` in the
`SpanCategorizer`
* Adjust the default score weights to include the default key
* Adjust the scorer to use `spans_{spans_key}` as the prefix for the
returned score
* Revert addition of `attr_name` argument to `score_spans` and adjust
the key in the `getter` instead.
Note that for `spancat` components with a custom `span_key`, the score
weights currently need to be modified manually in
`[training.score_weights]` for them to be available during training. To
suppress the default score weights `spans_sc_p/r/f` during training, set
them to `null` in `[training.score_weights]`.
* Update website/docs/api/scorer.md
* Fix scorer for spans key containing underscore
* Increment version
* Add Spans to Evaluate CLI (#8439)
* Add Spans to Evaluate CLI
* Change to spans_key
* Add spans per_type output
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix spancat GPU issues (#8455)
* Fix GPU issues
* Require thinc >=8.0.6
* Switch to glorot_uniform_init
* Fix and test ngram suggester
* Include final ngram in doc for all sizes
* Fix ngrams for docs of the same length as ngram size
* Handle batches of docs that result in no ngrams
* Add tests
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Nirant <NirantK@users.noreply.github.com>
* Use minor version for compatibility check
* Use minor version of compatibility table
* Soften warning message about incompatible models
* Add test for presence of current version in compatibility table
* Add test for download compatibility table
* Use minor version of lower pin in error message if possible
* Fall back to spacy_git_version if available
* Fix unknown version string
* Don't use the same vocab for source models
The source models should not be loaded with the vocab from the current
pipeline because this loads the vectors from the source model into the
current vocab.
The strings are all copied in `Language.create_pipe_from_source`, so if
the vectors are configured correctly in the current pipeline, the
sourced component will work as expected. If there is a vector mismatch,
a warning is shown. (It's not possible to inspect whether the vectors
are actually used by the component, so a warning is the best option.)
* Update comment on source model loading
* Copy rather than move files to top-level of package
* Add all files to `MANIFEST.in` (primarily for older versions of pip)
* Include the `README.md` contents as `long_description` in the setup
* Support a cfg field in transition system
* Make NER 'has gold' check use right alignment for span
* Pass 'negative_samples_key' property into NER transition system
* Add field for negative samples to NER transition system
* Check neg_key in NER has_gold
* Support negative examples in NER oracle
* Test for negative examples in NER
* Fix name of config variable in NER
* Remove vestiges of old-style partial annotation
* Remove obsolete tests
* Add comment noting lack of support for negative samples in parser
* Additions to "neg examples" PR (#8201)
* add custom error and test for deprecated format
* add test for unlearning an entity
* add break also for Begin's cost
* add negative_samples_key property on Parser
* rename
* extend docs & fix some older docs issues
* add subclass constructors, clean up tests, fix docs
* add flaky test with ValueError if gold parse was not found
* remove ValueError if n_gold == 0
* fix docstring
* Hack in environment variables to try out training
* Remove hack
* Remove NER hack, and support 'negative O' samples
* Fix O oracle
* Fix transition parser
* Remove 'not O' from oracle
* Fix NER oracle
* check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation
* use set instead of list in consistency check
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* implement textcat resizing for TextCatCNN
* resizing textcat in-place
* simplify code
* ensure predictions for old textcat labels remain the same after resizing (WIP)
* fix for softmax
* store softmax as attr
* fix ensemble weight copy and cleanup
* restructure slightly
* adjust documentation, update tests and quickstart templates to use latest versions
* extend unit test slightly
* revert unnecessary edits
* fix typo
* ensemble architecture won't be resizable for now
* use resizable layer (WIP)
* revert using resizable layer
* resizable container while avoid shape inference trouble
* cleanup
* ensure model continues training after resizing
* use fill_b parameter
* use fill_defaults
* resize_layer callback
* format
* bump thinc to 8.0.4
* bump spacy-legacy to 3.0.6
* Added Italian POS-aware lemmatizer.
Also added the code used to build the lookup tables by POS.
* Create gtoffoli.md
* Add imports and format
* Remove helper script
* Use lemma_lookup instead of lemma_lookup_legacy
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
To avoid config errors during training when `[corpora.pretrain.path]` is
`None` with the default `spacy.JsonlCorpus.v1` reader, make the reader
path optional, similar to `spacy.Corpus.v1`.
* Change span lemmas to use original whitespace (fix#8368)
This is a redo of #8371 based off master.
The test for this required some changes to existing tests. I don't think
the changes were significant but I'd like someone to check them.
* Remove mystery docstring
This sentence was uncompleted for years, and now we will never know how
it ends.
* Fill in deps if not provided with heads
Before this change, if heads were passed without deps they would be
silently ignored, which could be confusing. See #8334.
* Use "dep" instead of a blank string
This is the customary placeholder dep. It might be better to show an
error here instead though.
* Throw error on heads without deps
* Add a test
* Fix tests
* Formatting
* Fix all tests
* Fix a test I missed
* Revise error message
* Clean up whitespace
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update Catalan language data
Update Catalan language data based on contributions from the Text Mining
Unit at the Barcelona Supercomputing Center:
https://github.com/TeMU-BSC/spacy4release/tree/main/lang_data
* Update tokenizer settings for UD Catalan AnCora
Update for UD Catalan AnCora v2.7 with merged multi-word tokens.
* Update test
* Move prefix patternt to more generic infix pattern
* Clean up
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
* Show warning if entity_ruler runs without patterns
* Show warning if matcher runs without patterns
* fix wording
* unit test for warning once (WIP)
* warn W036 only once
* cleanup
* create filter_warning helper
* Don't add duplicate patterns (fix#8216)
* Refactor EntityRuler init
This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.
* Make EntityRuler.clear reset matchers
Includes a new test for this.
* Tidy PhraseMatcher instantiation
Since the attr can be None safely now, the guard if is no longer
required here.
Also renamed the `_validate` attr. Maybe it's not needed?
* Fix NER test
* Add test to make sure patterns aren't increasing
* Move test to regression tests
* "y" etc.
Many changes described in pull request
* Update spacy/lang/fr/stop_words.py
* Update spacy/lang/fr/stop_words.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
The attributes `PROB`, `CLUSTER` and `SENT_END` are not supported by
`Lexeme.get_struct_attr` so should not be included through `attrs.IDS`
as supported attributes in `Doc.to_array` and other methods.
* Show warning if entity_ruler runs without patterns
* Show warning if matcher runs without patterns
* fix wording
* unit test for warning once (WIP)
* warn W036 only once
* cleanup
* create filter_warning helper
* Add all symbols in Unicode Currency Symbols block
In #8102 it came up that the rupee symbol was treated different from
dollar / euro / yen symbols. This adds many symbols not already
included.
* Fix test
* Fix training test
The behavior of `spacy.Corpus.v1` is unexpected enough for `max_length
!= 0` that `0` is a better default for users creating a new config with
the quickstart.
If not, documents are skipped, sometimes the entire corpus is skipped,
and sometimes documents are (quite unexpectedly for your average user)
split into sentences.
* unit test for pickling KB
* add pickling test for NEL
* KB to_bytes and from_bytes
* NEL to_bytes and from_bytes
* xfail pickle tests for now
* fix docs
* cleanup
* Fix range in Span.get_lca_matrix
Fix the adjusted token index / lca matrix index ranges for
`_get_lca_matrix` for spans.
* The range for `k` should correspond to the adjusted indices in
`lca_matrix` with the `start` indexed at `0`
* Update test for v3.x