* Add convert CLI option to merge CoNLL-U subtokens
Add `-T` option to convert CLI that merges CoNLL-U subtokens into one
token in the converted data. Each CoNLL-U sentence is read into a `Doc`
and the `Retokenizer` is used to merge subtokens with features as
follows:
* `orth` is the merged token orth (should correspond to raw text and `#
text`)
* `tag` is all subtoken tags concatenated with `_`, e.g. `ADP_DET`
* `pos` is the POS of the syntactic root of the span (as determined by
the Retokenizer)
* `morph` is all morphological features merged
* `lemma` is all subtoken lemmas concatenated with ` `, e.g. `de o`
* with `-m` all morphological features are combined with the tag using
the separator `__`, e.g.
`ADP_DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art`
* `dep` is the dependency relation for the syntactic root of the span
(as determined by the Retokenizer)
Concatenated tags will be mapped to the UD POS of the syntactic root
(e.g., `ADP`) and the morphological features will be the combined
features.
In many cases, the original UD subtokens can be reconstructed from the
available features given a language-specific lookup table, e.g.,
Portuguese `do / ADP_DET /
Definite=Def|Gender=Masc|Number=Sing|PronType=Art` is `de / ADP`, `o /
DET / Definite=Def|Gender=Masc|Number=Sing|PronType=Art` or lookup rules
for forms containing open class words like Spanish `hablarlo / VERB_PRON
/
Case=Acc|Gender=Masc|Number=Sing|Person=3|PrepCase=Npr|PronType=Prs|VerbForm=Inf`.
* Clean up imports
Instead of a hard-coded NER tag simplification function that was only
intended for NorNE, map NER tags in CoNLL-U converter using a dict
provided as JSON as a command-line option.
Map NER entity types or new tag or to "" for 'O', e.g.:
```
{"PER": "PERSON", "BAD": ""}
=>
B-PER -> B-PERSON
B-BAD -> O
```
* Switch to train_dataset() function in train CLI
* Fixes for pipe() methods in pipeline components
* Don't clobber `examples` variable with `as_example` in pipe() methods
* Remove unnecessary traversals of `examples`
* Update Parser.pipe() for Examples
* Add `as_examples` kwarg to `pipe()` with implementation to return
`Example`s
* Accept `Doc` or `Example` in `pipe()` with `_get_doc()` (copied from
`Pipe`)
* Fixes to Example implementation in spacy.gold
* Move `make_projective` from an attribute of Example to an argument of
`Example.get_gold_parses()`
* Head of 0 are not treated as unset
* Unset heads are set to self rather than `None` (which causes problems
while projectivizing)
* Check for `Doc` (not just not `None`) when creating GoldParses for
pre-merged example
* Don't clobber `examples` variable in `iter_gold_docs()`
* Add/modify gold tests for handling projectivity
* In JSON roundtrip compare results from `dev_dataset` rather than
`train_dataset` to avoid projectivization (and other potential
modifications)
* Add test for projective train vs. nonprojective dev versions of the
same `Doc`
* Handle ignore_misaligned as arg rather than attr
Move `ignore_misaligned` from an attribute of `Example` to an argument
to `Example.get_gold_parses()`, which makes it parallel to
`make_projective`.
Add test with old and new align that checks whether `ignore_misaligned`
errors are raised as expected (only for new align).
* Remove unused attrs from gold.pxd
Remove `ignore_misaligned` and `make_projective` from `gold.pxd`
* Restructure Example with merged sents as default
An `Example` now includes a single `TokenAnnotation` that includes all
the information from one `Doc` (=JSON `paragraph`). If required, the
individual sentences can be returned as a list of examples with
`Example.split_sents()` with no raw text available.
* Input/output a single `Example.token_annotation`
* Add `sent_starts` to `TokenAnnotation` to handle sentence boundaries
* Replace `Example.merge_sents()` with `Example.split_sents()`
* Modify components to use a single `Example.token_annotation`
* Pipeline components
* conllu2json converter
* Rework/rename `add_token_annotation()` and `add_doc_annotation()` to
`set_token_annotation()` and `set_doc_annotation()`, functions that set
rather then appending/extending.
* Rename `morphology` to `morphs` in `TokenAnnotation` and `GoldParse`
* Add getters to `TokenAnnotation` to supply default values when a given
attribute is not available
* `Example.get_gold_parses()` in `spacy.gold._make_golds()` is only
applied on single examples, so the `GoldParse` is returned saved in the
provided `Example` rather than creating a new `Example` with no other
internal annotation
* Update tests for API changes and `merge_sents()` vs. `split_sents()`
* Refer to Example.goldparse in iter_gold_docs()
Use `Example.goldparse` in `iter_gold_docs()` instead of `Example.gold`
because a `None` `GoldParse` is generated with ignore_misaligned and
generating it on-the-fly can raise an unwanted AlignmentError
* Fix make_orth_variants()
Fix bug in make_orth_variants() related to conversion from multiple to
one TokenAnnotation per Example.
* Add basic test for make_orth_variants()
* Replace try/except with conditionals
* Replace default morph value with set
* OrigAnnot class instead of gold.orig_annot list of zipped tuples
* from_orig to replace from_annot_tuples
* rename to RawAnnot
* some unit tests for GoldParse creation and internal format
* removing orig_annot and switching to lists instead of tuple
* rewriting tuples to use RawAnnot (+ debug statements, WIP)
* fix pop() changing the data
* small fixes
* pop-append fixes
* return RawAnnot for existing GoldParse to have uniform interface
* clean up imports
* fix merge_sents
* add unit test for 4402 with new structure (not working yet)
* introduce DocAnnot
* typo fixes
* add unit test for merge_sents
* rename from_orig to from_raw
* fixing unit tests
* fix nn parser
* read_annots to produce text, doc_annot pairs
* _make_golds fix
* rename golds_to_gold_annots
* small fixes
* fix encoding
* have golds_to_gold_annots use DocAnnot
* missed a spot
* merge_sents as function in DocAnnot
* allow specifying only part of the token-level annotations
* refactor with Example class + underlying dicts
* pipeline components to work with Example objects (wip)
* input checking
* fix yielding
* fix calls to update
* small fixes
* fix scorer unit test with new format
* fix kwargs order
* fixes for ud and conllu scripts
* fix reading data for conllu script
* add in proper errors (not fixed numbering yet to avoid merge conflicts)
* fixing few more small bugs
* fix EL script
* merging conllu/conll and conllubio scripts
* tabs to spaces
* removing conllubio2json from converters/__init__.py
* Move not-really-CLI tests to misc
* Add converter test using no-ud data
* Fix test I broke
* removing include_biluo parameter
* fixing read_conllx
* remove include_biluo from convert.py
* Support nowrap setting in util.prints
* Tidy up and fix whitespace
* Simplify script and use read_jsonl helper
* Add JSON schemas (see #2928)
* Deprecate Doc.print_tree
Will be replaced with Doc.to_json, which will produce a unified format
* Add Doc.to_json() method (see #2928)
Converts Doc objects to JSON using the same unified format as the training data. Method also supports serializing selected custom attributes in the doc._. space.
* Remove outdated test
* Add write_json and write_jsonl helpers
* WIP: Update spacy train
* Tidy up spacy train
* WIP: Use wasabi for formatting
* Add GoldParse helpers for JSON format
* WIP: add debug-data command
* Fix typo
* Add missing import
* Update wasabi pin
* Add missing import
* 💫 Refactor CLI (#2943)
To be merged into #2932.
## Description
- [x] refactor CLI To use [`wasabi`](https://github.com/ines/wasabi)
- [x] use [`black`](https://github.com/ambv/black) for auto-formatting
- [x] add `flake8` config
- [x] move all messy UD-related scripts to `cli.ud`
- [x] make converters function that take the opened file and return the converted data (instead of having them handle the IO)
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Update wasabi pin
* Delete old test
* Update errors
* Fix typo
* Tidy up and format remaining code
* Fix formatting
* Improve formatting of messages
* Auto-format remaining code
* Add tok2vec stuff to spacy.train
* Fix typo
* Update wasabi pin
* Fix path checks for when train() is called as function
* Reformat and tidy up pretrain script
* Update argument annotations
* Raise error if model language doesn't match lang
* Document new train command
* Add spacy.errors module
* Update deprecation and user warnings
* Replace errors and asserts with new error message system
* Remove redundant asserts
* Fix whitespace
* Add messages for print/util.prints statements
* Fix typo
* Fix typos
* Move CLI messages to spacy.cli._messages
* Add decorator to display error code with message
An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc.
* Remove unused link in spacy.about
* Update errors for invalid pipeline components
* Improve error for unknown factories
* Add displaCy warnings
* Update formatting consistency
* Move error message to spacy.errors
* Update errors and check if doc returned by component is None