* Make Example require a Doc object (previously optional)
Clarify methods in GoldCorpus
WIP refactor Example
Refactor Example.split_sents
Fix test
Fix augment
Update test
Update test
Fix import
Update test_scorer
Update Example
* Move get_parses_from_example to spacy.syntax
* Get GoldParse out of Example
* Avoid expecting GoldParse input in parser
* Add Alignment to spacy.gold.align
* Update Example object
* Add comment
* Update pipeline
* Fix imports
* Simplify gold_io
* WIP on GoldCorpus
* Update test
* Xfail some gold tests
* Remove ignore_misaligned option from GoldCorpus
* Fix Example constructor
* Update test
* Fix usage of Example
* Add deprecated_get_gold method on Example
* Patch scorer
* Fix test
* Fix test
* Update tests
* Xfail a test
* Fix passing of make_projective
* Pass make_projective by default
* Hack data format in Example.from_dict
* Update tests
* Fix example.from_dict
* Update morphologizer
* Fix entity linker
* Add get_field to TokenAnnotation
* Fix Example.get_aligned
* Update test
* Fix alignment
* Fix corpus
* Fix GoldCorpus
* Handle misaligned
* Format
* Fix missing import
During `nlp.update`, components can be passed a boolean set_annotations
to indicate whether they should assign annotations to the `Doc`. This
needs to be called if downstream components expect to use the
annotations during training, e.g. if we wanted to use tagger features in
the parser.
Components can specify their assignments and requirements, so we can
figure out which components have these inter-dependencies. After
figuring this out, we can guess whether to pass set_annotations=True.
We could also call set_annotations=True always, or even just have this
as the only behaviour. The downside of this is that it would require the
`Doc` objects to be created afresh to avoid problematic modifications.
One approach would be to make a fresh copy of the `Doc` objects within
`nlp.update()`, so that we can write to the objects without any
problems. If we do that, we can drop this logic and also drop the
`set_annotations` mechanism. I would be fine with that approach,
although it runs the risk of introducing some performance overhead, and
we'll have to take care to copy all extension attributes etc.
* setting KB in the EL constructor, similar to how the model is passed on
* removing wikipedia example files - moved to projects
* throw an error when nlp.update is called with 2 positional arguments
* rewriting the config logic in create pipe to accomodate for other objects (e.g. KB) in the config
* update config files with new parameters
* avoid training pipeline components that don't have a model (like sentencizer)
* various small fixes + UX improvements
* small fixes
* set thinc to 8.0.0a9 everywhere
* remove outdated comment
* Fix most_similar for vectors with unused rows
Address issues related to the unused rows in the vector table and
`most_similar`:
* Update `most_similar()` to search only through rows that are in use
according to `key2row`.
* Raise an error when `most_similar(n=n)` is larger than the number of
vectors in the table.
* Set and restore `_unset` correctly when vectors are added or
deserialized so that new vectors are added in the correct row.
* Set data and keys to the same length in `Vocab.prune_vectors()` to
avoid spurious entries in `key2row`.
* Fix regression test using `most_similar`
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
* Add warning for misaligned character offset spans
* Resolve conflict
* Filter warnings in example scripts
Filter warnings in example scripts to show warnings once, in particular
warnings about misaligned entities.
Co-authored-by: Ines Montani <ines@ines.io>
Update Polish tokenizer for UD_Polish-PDB, which is a relatively major
change from the existing tokenizer. Unused exceptions files and
conflicting test cases removed.
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>