* overfitting test on non-overlapping entities
* add failing overfitting test for overlapping entities
* failing test for list comprehension
* remove test that was put in separate PR
* bugfix
* cleanup
* test for error after Doc has been garbage collected
* warn about using a SpanGroup when the Doc has been garbage collected
* add warning to the docs
* rephrase slightly
* raise error instead of warning
* update
* move warning to doc property
* Refactor to use list comps and enumerate.
Replace loops that append to a list with a list comprehensions where this does not change the behavior; replace range(len(...)) loops with enumerate. Correct one typo in a comment. Replace a call to set() with a set literal.
* Undo double assignment.
Expand `tokens_to_key[j] = k = self._get_matcher_key(key, i, j)` to two statements.
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Sign contributors agreement
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Allow passing in array vars for speedup
This fixes#8845. Not sure about the docstring changes here...
* Update docs
Types maybe need more detail? Maybe not?
* Run prettier on docs
* Update spacy/tokens/span.pyx
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Add scores to output in spancat
This exposes the scores as an attribute on the SpanGroup. Includes a
basic test.
* Add basic doc note
* Vectorize score calcs
* Add "annotation format" section
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Clean up doc section
* Ran prettier on docs
* Get arrays off the gpu before iterating over them
* Remove int() calls
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add stub files for main API classes
* Add contributor agreement for ezorita
* Update types for ndarray and hash()
* Fix __getitem__ and __iter__
* Add attributes of Doc and Token classes
* Overload type hints for Span.__getitem__
* Fix type hint overload for Span.__getitem__
Co-authored-by: Luca Dorigo <dorigoluca@gmail.com>
* Fix check for RIGHT_ATTRs in dep matcher
If a non-anchor node does not have RIGHT_ATTRS, the dep matcher throws
an E100, which says that non-anchor nodes must have LEFT_ID, REL_OP, and
RIGHT_ID. It specifically does not say RIGHT_ATTRS is required.
A blank RIGHT_ATTRS is also valid, and patterns with one will be
excepted. While not normal, sometimes a REL_OP is enough to specify a
non-anchor node - maybe you just want the head of another node
unconditionally, for example.
This change just sets RIGHT_ATTRS to {} if not present. Alternatively
changing E100 to state RIGHT_ATTRS is required could also be reasonable.
* Fix test
This test was written on the assumption that if `RIGHT_ATTRS` isn't
present an error will be raised. Since the proposed changes make it so
an error won't be raised this is no longer necessary.
* Revert test, update error message
Error message now lists missing keys, and RIGHT_ATTRS is required.
* Use list of required keys in error message
Also removes unused key param arg.
* Pass excludes when serializing vocab
Additional minor bug fix:
* Deserialize vocab in `EntityLinker.from_disk`
* Add test for excluding strings on load
* Fix formatting
* Support list values and IS_INTERSECT in Matcher
* Support list values as token attributes for set operators, not just as
pattern values.
* Add `IS_INTERSECT` operator.
* Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs.
* Rename IS_INTERSECT to INTERSECTS
* Add ancient Greek language support
Initial commit
* Contributor Agreement
* grc tokenizer test added and files formatted with black, unnecessary import removed
Co-Authored-By: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Commas in lists fixed. __init__py added to test
* Update lex_attrs.py
* Update stop_words.py
* Update stop_words.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* ✨ implement noun_chunks for dutch language
* copy/paste FR and SV syntax iterators to accomodate UD tags
* added tests with dutch text
* signed contributor agreement
* 🐛 fix noun chunks generator
* built from scratch
* define noun chunk as a single Noun-Phrase
* includes some corner cases debugging (incorrect POS tagging)
* test with provided annotated sample (POS, DEP)
* ✅ fix failing test
* CI pipeline did not like the added sample file
* add the sample as a pytest fixture
* Update spacy/lang/nl/syntax_iterators.py
* Update spacy/lang/nl/syntax_iterators.py
Code readability
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/tests/lang/nl/test_noun_chunks.py
correct comment
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* finalize code
* change "if next_word" into "if next_word is not None"
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* avoid msg var impliciteness
* rename local msg
* Add CI tests for debug data and train
* Adjust debug data CLI test
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Add the right return type for Language.pipe and an overload for the as_tuples version
* Reformat, tidy up
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix vectors check for sourced components
Since vectors are not loaded when components are sourced, store a hash
for the vectors of each sourced component and compare it to the loaded
vectors after the vectors are loaded from the `[initialize]` block.
* Pop temporary info
* Remove stored hash in remove_pipe
* Add default for pop
* Add additional convert/debug/assemble CLI tests
* Raise an error for textcat with <2 labels
Raise an error if initializing a `textcat` component without at least
two labels.
* Add similar note to docs
* Update positive_label description in API docs
* Draft spancat model
* Add spancat model
* Add test for extract_spans
* Add extract_spans layer
* Upd extract_spans
* Add spancat model
* Add test for spancat model
* Upd spancat model
* Update spancat component
* Upd spancat
* Update spancat model
* Add quick spancat test
* Import SpanCategorizer
* Fix SpanCategorizer component
* Import SpanGroup
* Fix span extraction
* Fix import
* Fix import
* Upd model
* Update spancat models
* Add scoring, update defaults
* Update and add docs
* Fix type
* Update spacy/ml/extract_spans.py
* Auto-format and fix import
* Fix comment
* Fix type
* Fix type
* Update website/docs/api/spancategorizer.md
* Fix comment
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Better defense
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fix labels list
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/ml/extract_spans.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/pipeline/spancat.py
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Set annotations during update
* Set annotations in spancat
* fix imports in test
* Update spacy/pipeline/spancat.py
* replace MaxoutLogistic with LinearLogistic
* fix config
* various small fixes
* remove set_annotations parameter in update
* use our beloved tupley format with recent support for doc.spans
* bugfix to allow renaming the default span_key (scores weren't showing up)
* use different key in docs example
* change defaults to better-working parameters from project (WIP)
* register spacy.extract_spans.v1 for legacy purposes
* Upd dev version so can build wheel
* layers instead of architectures for smaller building blocks
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update website/docs/api/spancategorizer.md
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Include additional scores from overrides in combined score weights
* Parameterize spans key in scoring
Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so
that it's possible to evaluate multiple `spancat` components in the same
pipeline.
* Use the (intentionally very short) default spans key `sc` in the
`SpanCategorizer`
* Adjust the default score weights to include the default key
* Adjust the scorer to use `spans_{spans_key}` as the prefix for the
returned score
* Revert addition of `attr_name` argument to `score_spans` and adjust
the key in the `getter` instead.
Note that for `spancat` components with a custom `span_key`, the score
weights currently need to be modified manually in
`[training.score_weights]` for them to be available during training. To
suppress the default score weights `spans_sc_p/r/f` during training, set
them to `null` in `[training.score_weights]`.
* Update website/docs/api/scorer.md
* Fix scorer for spans key containing underscore
* Increment version
* Add Spans to Evaluate CLI (#8439)
* Add Spans to Evaluate CLI
* Change to spans_key
* Add spans per_type output
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix spancat GPU issues (#8455)
* Fix GPU issues
* Require thinc >=8.0.6
* Switch to glorot_uniform_init
* Fix and test ngram suggester
* Include final ngram in doc for all sizes
* Fix ngrams for docs of the same length as ngram size
* Handle batches of docs that result in no ngrams
* Add tests
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Nirant <NirantK@users.noreply.github.com>
* Use minor version for compatibility check
* Use minor version of compatibility table
* Soften warning message about incompatible models
* Add test for presence of current version in compatibility table
* Add test for download compatibility table
* Use minor version of lower pin in error message if possible
* Fall back to spacy_git_version if available
* Fix unknown version string
* Don't use the same vocab for source models
The source models should not be loaded with the vocab from the current
pipeline because this loads the vectors from the source model into the
current vocab.
The strings are all copied in `Language.create_pipe_from_source`, so if
the vectors are configured correctly in the current pipeline, the
sourced component will work as expected. If there is a vector mismatch,
a warning is shown. (It's not possible to inspect whether the vectors
are actually used by the component, so a warning is the best option.)
* Update comment on source model loading
* Copy rather than move files to top-level of package
* Add all files to `MANIFEST.in` (primarily for older versions of pip)
* Include the `README.md` contents as `long_description` in the setup
* Support a cfg field in transition system
* Make NER 'has gold' check use right alignment for span
* Pass 'negative_samples_key' property into NER transition system
* Add field for negative samples to NER transition system
* Check neg_key in NER has_gold
* Support negative examples in NER oracle
* Test for negative examples in NER
* Fix name of config variable in NER
* Remove vestiges of old-style partial annotation
* Remove obsolete tests
* Add comment noting lack of support for negative samples in parser
* Additions to "neg examples" PR (#8201)
* add custom error and test for deprecated format
* add test for unlearning an entity
* add break also for Begin's cost
* add negative_samples_key property on Parser
* rename
* extend docs & fix some older docs issues
* add subclass constructors, clean up tests, fix docs
* add flaky test with ValueError if gold parse was not found
* remove ValueError if n_gold == 0
* fix docstring
* Hack in environment variables to try out training
* Remove hack
* Remove NER hack, and support 'negative O' samples
* Fix O oracle
* Fix transition parser
* Remove 'not O' from oracle
* Fix NER oracle
* check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation
* use set instead of list in consistency check
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* implement textcat resizing for TextCatCNN
* resizing textcat in-place
* simplify code
* ensure predictions for old textcat labels remain the same after resizing (WIP)
* fix for softmax
* store softmax as attr
* fix ensemble weight copy and cleanup
* restructure slightly
* adjust documentation, update tests and quickstart templates to use latest versions
* extend unit test slightly
* revert unnecessary edits
* fix typo
* ensemble architecture won't be resizable for now
* use resizable layer (WIP)
* revert using resizable layer
* resizable container while avoid shape inference trouble
* cleanup
* ensure model continues training after resizing
* use fill_b parameter
* use fill_defaults
* resize_layer callback
* format
* bump thinc to 8.0.4
* bump spacy-legacy to 3.0.6
* Added Italian POS-aware lemmatizer.
Also added the code used to build the lookup tables by POS.
* Create gtoffoli.md
* Add imports and format
* Remove helper script
* Use lemma_lookup instead of lemma_lookup_legacy
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
To avoid config errors during training when `[corpora.pretrain.path]` is
`None` with the default `spacy.JsonlCorpus.v1` reader, make the reader
path optional, similar to `spacy.Corpus.v1`.
* Change span lemmas to use original whitespace (fix#8368)
This is a redo of #8371 based off master.
The test for this required some changes to existing tests. I don't think
the changes were significant but I'd like someone to check them.
* Remove mystery docstring
This sentence was uncompleted for years, and now we will never know how
it ends.
* Fill in deps if not provided with heads
Before this change, if heads were passed without deps they would be
silently ignored, which could be confusing. See #8334.
* Use "dep" instead of a blank string
This is the customary placeholder dep. It might be better to show an
error here instead though.
* Throw error on heads without deps
* Add a test
* Fix tests
* Formatting
* Fix all tests
* Fix a test I missed
* Revise error message
* Clean up whitespace
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Update Catalan language data
Update Catalan language data based on contributions from the Text Mining
Unit at the Barcelona Supercomputing Center:
https://github.com/TeMU-BSC/spacy4release/tree/main/lang_data
* Update tokenizer settings for UD Catalan AnCora
Update for UD Catalan AnCora v2.7 with merged multi-word tokens.
* Update test
* Move prefix patternt to more generic infix pattern
* Clean up
For the Russian and Ukrainian lemmatizers, restrict the `pymorphy2`
requirement to the mode `pymorphy2` so that lookup or other lemmatizer
modes can be loaded without installing `pymorphy2`.
* Show warning if entity_ruler runs without patterns
* Show warning if matcher runs without patterns
* fix wording
* unit test for warning once (WIP)
* warn W036 only once
* cleanup
* create filter_warning helper