* Fix scoring normalization (#7629)
* fix scoring normalization
* score weights by total sum instead of per component
* cleanup
* more cleanup
* Use a context manager when reading model (fix#7036) (#8244)
* Fix other open calls without context managers (#8245)
* Don't add duplicate patterns all the time in EntityRuler (fix#8216) (#8246)
* Don't add duplicate patterns (fix#8216)
* Refactor EntityRuler init
This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.
* Make EntityRuler.clear reset matchers
Includes a new test for this.
* Tidy PhraseMatcher instantiation
Since the attr can be None safely now, the guard if is no longer
required here.
Also renamed the `_validate` attr. Maybe it's not needed?
* Fix NER test
* Add test to make sure patterns aren't increasing
* Move test to regression tests
* Exclude generated .cpp files from package (#8271)
* Fix non-deterministic deduplication in Greek lemmatizer (#8421)
* Fix setting empty entities in Example.from_dict (#8426)
* Filter W036 for entity ruler, etc. (#8424)
* Preserve paths.vectors/initialize.vectors setting in quickstart template
* Various fixes for spans in Docs.from_docs (#8487)
* Fix spans offsets if a doc ends in a single space and no space is
inserted
* Also include spans key in merged doc for empty spans lists
* Fix duplicate spacy package CLI opts (#8551)
Use `-c` for `--code` and not additionally for `--create-meta`, in line
with the docs.
* Raise an error for textcat with <2 labels (#8584)
* Raise an error for textcat with <2 labels
Raise an error if initializing a `textcat` component without at least
two labels.
* Add similar note to docs
* Update positive_label description in API docs
* Add Macedonian models to website (#8637)
* Fix Azerbaijani init, extend lang init tests (#8656)
* Extend langs in initialize tests
* Fix az init
* Fix ru/uk lemmatizer mp with spawn (#8657)
Use an instance variable instead a class variable for the morphological
analzyer so that multiprocessing with spawn is possible.
* Use 0-vector for OOV lexemes (#8639)
* Set version to v3.0.7
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Paul O'Leary McCann <polm@dampfkraft.com>
* Update sent_starts in Example.from_dict
Update `sent_starts` for `Example.from_dict` so that `Optional[bool]`
values have the same meaning as for `Token.is_sent_start`.
Use `Optional[bool]` as the type for sent start values in the docs.
* Use helper function for conversion to ternary ints
* extend span scorer with consider_label and allow_overlap
* unit test for spans y2x overlap
* add score_spans unit test
* docs for new fields in scorer.score_spans
* rename to include_label
* spell out if-else for clarity
* rename to 'labeled'
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Get basic beam tests working
* Get basic beam tests working
* Compile _beam_utils
* Remove prints
* Test beam density
* Beam parser seems to train
* Draft beam NER
* Upd beam
* Add hypothesis as dev dependency
* Implement missing is-gold-parse method
* Implement early update
* Fix state hashing
* Fix test
* Fix test
* Default to non-beam in parser constructor
* Improve oracle for beam
* Start refactoring beam
* Update test
* Refactor beam
* Update nn
* Refactor beam and weight by cost
* Update ner beam settings
* Update test
* Add __init__.pxd
* Upd test
* Fix test
* Upd test
* Fix test
* Remove ring buffer history from StateC
* WIP change arc-eager transitions
* Add state tests
* Support ternary sent start values
* Fix arc eager
* Fix NER
* Pass oracle cut size for beam
* Fix ner test
* Fix beam
* Improve StateC.clone
* Improve StateClass.borrow
* Work directly with StateC, not StateClass
* Remove print statements
* Fix state copy
* Improve state class
* Refactor parser oracles
* Fix arc eager oracle
* Fix arc eager oracle
* Use a vector to implement the stack
* Refactor state data structure
* Fix alignment of sent start
* Add get_aligned_sent_starts method
* Add test for ae oracle when bad sentence starts
* Fix sentence segment handling
* Avoid Reduce that inserts illegal sentence
* Update preset SBD test
* Fix test
* Remove prints
* Fix sent starts in Example
* Improve python API of StateClass
* Tweak comments and debug output of arc eager
* Upd test
* Fix state test
* Fix state test
* Replace pytokenizations with internal alignment
Replace pytokenizations with internal alignment algorithm that is
restricted to only allow differences in whitespace and capitalization.
* Rename `spacy.training.align` to `spacy.training.alignment` to contain
the `Alignment` dataclass
* Implement `get_alignments` in `spacy.training.align`
* Refactor trailing whitespace handling
* Remove unnecessary exception for empty docs
Allow a non-empty whitespace-only doc to be aligned with an empty doc
* Remove empty docs exceptions completely
* rename Pipe to TrainablePipe
* split functionality between Pipe and TrainablePipe
* remove unnecessary methods from certain components
* cleanup
* hasattr(component, "pipe") should be sufficient again
* remove serialization and vocab/cfg from Pipe
* unify _ensure_examples and validate_examples
* small fixes
* hasattr checks for self.cfg and self.vocab
* make is_resizable and is_trainable properties
* serialize strings.json instead of vocab
* fix KB IO + tests
* fix typos
* more typos
* _added_strings as a set
* few more tests specifically for _added_strings field
* bump to 3.0.0a36
* Refactor Token morph setting
* Remove `Token.morph_`
* Add `Token.set_morph()`
* `0` resets `token.c.morph` to unset
* Any other values are passed to `Morphology.add`
* Add token.morph setter to set from MorphAnalysis
In order to make it easier to construct `Doc` objects as training data,
modify how missing and blocked entity tokens are set to prioritize
setting `O` and missing entity tokens for training purposes over setting
blocked entity tokens.
* `Doc.ents` setter sets tokens outside entity spans to `O` regardless
of the current state of each token
* For `Doc.ents`, setting a span with a missing label sets the `ent_iob`
to missing instead of blocked
* `Doc.block_ents(spans)` marks spans as hard `O` for use with the
`EntityRecognizer`