* OrigAnnot class instead of gold.orig_annot list of zipped tuples
* from_orig to replace from_annot_tuples
* rename to RawAnnot
* some unit tests for GoldParse creation and internal format
* removing orig_annot and switching to lists instead of tuple
* rewriting tuples to use RawAnnot (+ debug statements, WIP)
* fix pop() changing the data
* small fixes
* pop-append fixes
* return RawAnnot for existing GoldParse to have uniform interface
* clean up imports
* fix merge_sents
* add unit test for 4402 with new structure (not working yet)
* introduce DocAnnot
* typo fixes
* add unit test for merge_sents
* rename from_orig to from_raw
* fixing unit tests
* fix nn parser
* read_annots to produce text, doc_annot pairs
* _make_golds fix
* rename golds_to_gold_annots
* small fixes
* fix encoding
* have golds_to_gold_annots use DocAnnot
* missed a spot
* merge_sents as function in DocAnnot
* allow specifying only part of the token-level annotations
* refactor with Example class + underlying dicts
* pipeline components to work with Example objects (wip)
* input checking
* fix yielding
* fix calls to update
* small fixes
* fix scorer unit test with new format
* fix kwargs order
* fixes for ud and conllu scripts
* fix reading data for conllu script
* add in proper errors (not fixed numbering yet to avoid merge conflicts)
* fixing few more small bugs
* fix EL script
* Set extensions when write_conllu() is called
`run_eval.py` uses the `write_conllu()` function from `ud_train.py` by
itself, so it needs to set the token extensions if necessary.
* Switch from try to if
* Rework Chinese language initialization
* Create a `ChineseTokenizer` class
* Modify jieba post-processing to handle whitespace correctly
* Modify non-jieba character tokenization to handle whitespace correctly
* Add a `create_tokenizer()` method to `ChineseDefaults`
* Load lexical attributes
* Update Chinese tag_map for UD v2
* Add very basic Chinese tests
* Test tokenization with and without jieba
* Test `like_num` attribute
* Fix try_jieba_import()
* Fix zh code formatting
* force extensions to avoid clash between example scripts
* fix arg order and default file encoding
* add example config for conllu script
* newline
* move extension definitions to main function
* few more encodings fixes
The model registry refactor of the Tok2Vec function broke loading models
trained with the previous function, because the model tree was slightly
different. Specifically, the new function wrote:
concatenate(norm, prefix, suffix, shape)
To build the embedding layer. In the previous implementation, I had used
the operator overloading shortcut:
( norm | prefix | suffix | shape )
This actually gets mapped to a binary association, giving something
like:
concatenate(norm, concatenate(prefix, concatenate(suffix, shape)))
This is a different tree, so the layers iterate differently and we
loaded the weights wrongly.
* Xfail new tokenization test
* Put new alignment behind feature flag
* Move USE_ALIGN to top of the file [ci skip]
Co-authored-by: Ines Montani <ines@ines.io>
The `Matcher` in `merge_subtokens()` returns all possible subsequences
of `subtok`, so for sequences of two or more subtoks it's necessary to
filter the matches so that the retokenizer is only merging the longest
matches with no overlapping spans.
* Add arch for MishWindowEncoder
* Support mish in tok2vec and conv window >=2
* Pass new tok2vec settings from parser
* Syntax error
* Fix tok2vec setting
* Fix registration of MishWindowEncoder
* Fix receptive field setting
* Fix mish arch
* Pass more options from parser
* Support more tok2vec options in pretrain
* Require thinc 7.3
* Add docs [ci skip]
* Require thinc 7.3.0.dev0 to run CI
* Run black
* Fix typo
* Update Thinc version
Co-authored-by: Ines Montani <ines@ines.io>