* Support a cfg field in transition system
* Make NER 'has gold' check use right alignment for span
* Pass 'negative_samples_key' property into NER transition system
* Add field for negative samples to NER transition system
* Check neg_key in NER has_gold
* Support negative examples in NER oracle
* Test for negative examples in NER
* Fix name of config variable in NER
* Remove vestiges of old-style partial annotation
* Remove obsolete tests
* Add comment noting lack of support for negative samples in parser
* Additions to "neg examples" PR (#8201)
* add custom error and test for deprecated format
* add test for unlearning an entity
* add break also for Begin's cost
* add negative_samples_key property on Parser
* rename
* extend docs & fix some older docs issues
* add subclass constructors, clean up tests, fix docs
* add flaky test with ValueError if gold parse was not found
* remove ValueError if n_gold == 0
* fix docstring
* Hack in environment variables to try out training
* Remove hack
* Remove NER hack, and support 'negative O' samples
* Fix O oracle
* Fix transition parser
* Remove 'not O' from oracle
* Fix NER oracle
* check for spans in both gold.ents and gold.spans and raise if so, to prevent memory access violation
* use set instead of list in consistency check
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Fill in deps if not provided with heads
Before this change, if heads were passed without deps they would be
silently ignored, which could be confusing. See #8334.
* Use "dep" instead of a blank string
This is the customary placeholder dep. It might be better to show an
error here instead though.
* Throw error on heads without deps
* Add a test
* Fix tests
* Formatting
* Fix all tests
* Fix a test I missed
* Revise error message
* Clean up whitespace
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Don't add duplicate patterns (fix#8216)
* Refactor EntityRuler init
This simplifies the EntityRuler init code. This is helpful as prep for
allowing the EntityRuler to reset itself.
* Make EntityRuler.clear reset matchers
Includes a new test for this.
* Tidy PhraseMatcher instantiation
Since the attr can be None safely now, the guard if is no longer
required here.
Also renamed the `_validate` attr. Maybe it's not needed?
* Fix NER test
* Add test to make sure patterns aren't increasing
* Move test to regression tests
* Preserve existing ENT_KB_ID annotation in NER
Preserve `ent_kb_id` annotation on existing entity spans, which is not
preserved by the transition system.
* Simplify kb_id assignment
* Simplify further
* clean up of ner tests
* beam_parser tests
* implement get_beam_parses and scored_parses for the dep parser
* we don't have to add the parse if there are no arcs
* small fixes and formatting
* bring test_issue4313 up-to-date, currently fails
* formatting
* add get_beam_parses method back
* add scored_ents function
* delete tag map
* Get basic beam tests working
* Get basic beam tests working
* Compile _beam_utils
* Remove prints
* Test beam density
* Beam parser seems to train
* Draft beam NER
* Upd beam
* Add hypothesis as dev dependency
* Implement missing is-gold-parse method
* Implement early update
* Fix state hashing
* Fix test
* Fix test
* Default to non-beam in parser constructor
* Improve oracle for beam
* Start refactoring beam
* Update test
* Refactor beam
* Update nn
* Refactor beam and weight by cost
* Update ner beam settings
* Update test
* Add __init__.pxd
* Upd test
* Fix test
* Upd test
* Fix test
* Remove ring buffer history from StateC
* WIP change arc-eager transitions
* Add state tests
* Support ternary sent start values
* Fix arc eager
* Fix NER
* Pass oracle cut size for beam
* Fix ner test
* Fix beam
* Improve StateC.clone
* Improve StateClass.borrow
* Work directly with StateC, not StateClass
* Remove print statements
* Fix state copy
* Improve state class
* Refactor parser oracles
* Fix arc eager oracle
* Fix arc eager oracle
* Use a vector to implement the stack
* Refactor state data structure
* Fix alignment of sent start
* Add get_aligned_sent_starts method
* Add test for ae oracle when bad sentence starts
* Fix sentence segment handling
* Avoid Reduce that inserts illegal sentence
* Update preset SBD test
* Fix test
* Remove prints
* Fix sent starts in Example
* Improve python API of StateClass
* Tweak comments and debug output of arc eager
* Upd test
* Fix state test
* Fix state test
In order to make it easier to construct `Doc` objects as training data,
modify how missing and blocked entity tokens are set to prioritize
setting `O` and missing entity tokens for training purposes over setting
blocked entity tokens.
* `Doc.ents` setter sets tokens outside entity spans to `O` regardless
of the current state of each token
* For `Doc.ents`, setting a span with a missing label sets the `ent_iob`
to missing instead of blocked
* `Doc.block_ents(spans)` marks spans as hard `O` for use with the
`EntityRecognizer`
* Refactor Docs.is_ flags
* Add derived `Doc.has_annotation` method
* `Doc.has_annotation(attr)` returns `True` for partial annotation
* `Doc.has_annotation(attr, require_complete=True)` returns `True` for
complete annotation
* Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced`
and `is_nered`
* Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs
for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The
list is the `DocBin` attributes list plus `SPACY` and `LENGTH`.
Notes on `Doc.has_annotation`:
* `HEAD` is converted to `DEP` because heads don't have an unset state
* Accept `IS_SENT_START` as a synonym of `SENT_START`
Additional changes:
* Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for
`DocBin`
* In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override
`SENT_START`
* In `Doc.from_array()` using `attrs` other than
`Doc._get_array_attrs()` (i.e., a user's custom list rather than our
default internal list) with both `HEAD` and `SENT_START` shows a warning
that `HEAD` will override `SENT_START`
* `set_children_from_heads` does not require dependency labels to set
sentence boundaries and sets `sent_start` for all non-sentence starts to
`-1`
* Fix call to set_children_form_heads
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Clean up spacy.tokens
* Update `set_children_from_heads`:
* Don't check `dep` when setting lr_* or sentence starts
* Set all non-sentence starts to `False`
* Use `set_children_from_heads` in `Token.head` setter
* Reduce similar/duplicate code (admittedly adds a bit of overhead)
* Update sentence starts consistently
* Remove unused `Doc.set_parse`
* Minor changes:
* Declare cython variables (to avoid cython warnings)
* Clean up imports
* Modify set_children_from_heads to set token range
Modify `set_children_from_heads` so that it adjust tokens within a
specified range rather then the whole document.
Modify the `Token.head` setter to adjust only the tokens affected by the
new head assignment.
* ensure Language passes on valid examples for initialization
* fix tagger model initialization
* check for valid get_examples across components
* assume labels were added before begin_training
* fix senter initialization
* fix morphologizer initialization
* use methods to check arguments
* test textcat init, requires thinc>=8.0.0a31
* fix tok2vec init
* fix entity linker init
* use islice
* fix simple NER
* cleanup debug model
* fix assert statements
* fix tests
* throw error when adding a label if the output layer can't be resized anymore
* fix test
* add failing test for simple_ner
* UX improvements
* morphologizer UX
* assume begin_training gets a representative set and processes the labels
* remove assumptions for output of untrained NER model
* restore test for original purpose
* Add Lemmatizer and simplify related components
* Add `Lemmatizer` pipe with `lookup` and `rule` modes using the
`Lookups` tables.
* Reduce `Tagger` to a simple tagger that sets `Token.tag` (no pos or lemma)
* Reduce `Morphology` to only keep track of morph tags (no tag map, lemmatizer,
or morph rules)
* Remove lemmatizer from `Vocab`
* Adjust many many tests
Differences:
* No default lookup lemmas
* No special treatment of TAG in `from_array` and similar required
* Easier to modify labels in a `Tagger`
* No extra strings added from morphology / tag map
* Fix test
* Initial fix for Lemmatizer config/serialization
* Adjust init test to be more generic
* Adjust init test to force empty Lookups
* Add simple cache to rule-based lemmatizer
* Convert language-specific lemmatizers
Convert language-specific lemmatizers to component lemmatizers. Remove
previous lemmatizer class.
* Fix French and Polish lemmatizers
* Remove outdated UPOS conversions
* Update Russian lemmatizer init in tests
* Add minimal init/run tests for custom lemmatizers
* Add option to overwrite existing lemmas
* Update mode setting, lookup loading, and caching
* Make `mode` an immutable property
* Only enforce strict `load_lookups` for known supported modes
* Move caching into individual `_lemmatize` methods
* Implement strict when lang is not found in lookups
* Fix tables/lookups in make_lemmatizer
* Reallow provided lookups and allow for stricter checks
* Add lookups asset to all Lemmatizer pipe tests
* Rename lookups in lemmatizer init test
* Clean up merge
* Refactor lookup table loading
* Add helper from `load_lemmatizer_lookups` that loads required and
optional lookups tables based on settings provided by a config.
Additional slight refactor of lookups:
* Add `Lookups.set_table` to set a table from a provided `Table`
* Reorder class definitions to be able to specify type as `Table`
* Move registry assets into test methods
* Refactor lookups tables config
Use class methods within `Lemmatizer` to provide the config for
particular modes and to load the lookups from a config.
* Add pipe and score to lemmatizer
* Simplify Tagger.score
* Add missing import
* Clean up imports and auto-format
* Remove unused kwarg
* Tidy up and auto-format
* Update docstrings for Lemmatizer
Update docstrings for Lemmatizer.
Additionally modify `is_base_form` API to take `Token` instead of
individual features.
* Update docstrings
* Remove tag map values from Tagger.add_label
* Update API docs
* Fix relative link in Lemmatizer API docs
* moving syntax folder to _parser_internals
* moving nn_parser and transition_system
* move nn_parser and transition_system out of internals folder
* moving nn_parser code into transition_system file
* rename transition_system to transition_parser
* moving parser_model and _state to ml
* move _state back to internals
* The Parser now inherits from Pipe!
* small code fixes
* removing unnecessary imports
* remove link_vectors_to_models
* transition_system to internals folder
* little bit more cleanup
* newlines
* Update with WIP
* Update with WIP
* Update with pipeline serialization
* Update types and pipe factories
* Add deep merge, tidy up and add tests
* Fix pipe creation from config
* Don't validate default configs on load
* Update spacy/language.py
Co-authored-by: Ines Montani <ines@ines.io>
* Adjust factory/component meta error
* Clean up factory args and remove defaults
* Add test for failing empty dict defaults
* Update pipeline handling and methods
* provide KB as registry function instead of as object
* small change in test to make functionality more clear
* update example script for EL configuration
* Fix typo
* Simplify test
* Simplify test
* splitting pipes.pyx into separate files
* moving default configs to each component file
* fix batch_size type
* removing default values from component constructors where possible (TODO: test 4725)
* skip instead of xfail
* Add test for config -> nlp with multiple instances
* pipeline.pipes -> pipeline.pipe
* Tidy up, document, remove kwargs
* small cleanup/generalization for Tok2VecListener
* use DEFAULT_UPSTREAM field
* revert to avoid circular imports
* Fix tests
* Replace deprecated arg
* Make model dirs require config
* fix pickling of keyword-only arguments in constructor
* WIP: clean up and integrate full config
* Add helper to handle function args more reliably
Now also includes keyword-only args
* Fix config composition and serialization
* Improve config debugging and add visual diff
* Remove unused defaults and fix type
* Remove pipeline and factories from meta
* Update spacy/default_config.cfg
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update spacy/default_config.cfg
* small UX edits
* avoid printing stack trace for debug CLI commands
* Add support for language-specific factories
* specify the section of the config which holds the model to debug
* WIP: add Language.from_config
* Update with language data refactor WIP
* Auto-format
* Add backwards-compat handling for Language.factories
* Update morphologizer.pyx
* Fix morphologizer
* Update and simplify lemmatizers
* Fix Japanese tests
* Port over tagger changes
* Fix Chinese and tests
* Update to latest Thinc
* WIP: xfail first Russian lemmatizer test
* Fix component-specific overrides
* fix nO for output layers in debug_model
* Fix default value
* Fix tests and don't pass objects in config
* Fix deep merging
* Fix lemma lookup data registry
Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)
* Add types
* Add Vocab.from_config
* Fix typo
* Fix tests
* Make config copying more elegant
* Fix pipe analysis
* Fix lemmatizers and is_base_form
* WIP: move language defaults to config
* Fix morphology type
* Fix vocab
* Remove comment
* Update to latest Thinc
* Add morph rules to config
* Tidy up
* Remove set_morphology option from tagger factory
* Hack use_gpu
* Move [pipeline] to top-level block and make [nlp.pipeline] list
Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them
* Fix use_gpu and resume in CLI
* Auto-format
* Remove resume from config
* Fix formatting and error
* [pipeline] -> [components]
* Fix types
* Fix tagger test: requires set_morphology?
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* step_through tests: skip instead of xfail
* test_empty_doc should be fixed with new Thinc version
* remove outdated test (there are other misaligned tests now)
* xfail reason
* fix test according to french exceptions
* clarified some skipped tests
* skip ukranian test instead of xfail
* skip instead of xfail
* skip + reason instead of xfail
* removed obsolete tests referring to removed "set_frozen" functionality
* fix test 999
* remove unused AlignmentError
* remove xfail where possible, skip otherwise
* increment thinc release for empty_doc test