UD_Danish-DDT has (as far as I can tell) hallucinated periods after
abbreviations, so the changes are an artifact of the corpus and not due
to anything meaningful about Danish tokenization.
* avoid changing original config
* fix elif structure, batch with just int crashes otherwise
* tok2vec example with doc2feats, encode and embed architectures
* further clean up MultiHashEmbed
* further generalize Tok2Vec to work with extract-embed-encode parts
* avoid initializing the charembed layer with Docs (for now ?)
* small fixes for bilstm config (still does not run)
* rename to core layer
* move new configs
* walk model to set nI instead of using core ref
* fix senter overfitting test to be more similar to the training data (avoid flakey behaviour)
* merge_entities sets the vector in the vocab for the merged token
* add unit test
* import unicode_literals
* move code to _merge function
* only set vector if vocab has non-zero vectors
* Update sentence recognizer
* rename `sentrec` to `senter`
* use `spacy.HashEmbedCNN.v1` by default
* update to follow `Tagger` modifications
* remove component methods that can be inherited from `Tagger`
* add simple initialization and overfitting pipeline tests
* Update serialization test for senter
* Improve token head verification
Improve the verification for valid token heads when heads are set:
* in `Token.head`: heads come from the same document
* in `Doc.from_array()`: head indices are within the bounds of the
document
* Improve error message
* fix grad_clip naming
* cleaning up pretrained_vectors out of cfg
* further refactoring Model init's
* move Model building out of pipes
* further refactor to require a model config when creating a pipe
* small fixes
* making cfg in nn_parser more consistent
* fixing nr_class for parser
* fixing nn_parser's nO
* fix printing of loss
* architectures in own file per type, consistent naming
* convenience methods default_tagger_config and default_tok2vec_config
* let create_pipe access default config if available for that component
* default_parser_config
* move defaults to separate folder
* allow reading nlp from package or dir with argument 'name'
* architecture spacy.VocabVectors.v1 to read static vectors from file
* cleanup
* default configs for nel, textcat, morphologizer, tensorizer
* fix imports
* fixing unit tests
* fixes and clean up
* fixing defaults, nO, fix unit tests
* restore parser IO
* fix IO
* 'fix' serialization test
* add *.cfg to manifest
* fix example configs with additional arguments
* replace Morpohologizer with Tagger
* add IO bit when testing overfitting of tagger (currently failing)
* fix IO - don't initialize when reading from disk
* expand overfitting tests to also check IO goes OK
* remove dropout from HashEmbed to fix Tagger performance
* add defaults for sentrec
* update thinc
* always pass a Model instance to a Pipe
* fix piped_added statement
* remove obsolete W029
* remove obsolete errors
* restore byte checking tests (work again)
* clean up test
* further test cleanup
* convert from config to Model in create_pipe
* bring back error when component is not initialized
* cleanup
* remove calls for nlp2.begin_training
* use thinc.api in imports
* allow setting charembed's nM and nC
* fix for hardcoded nM/nC + unit test
* formatting fixes
* trigger build
* add lemma option to displacy 'dep' visualiser
* more compact list comprehension
* add option to doc
* fix test and add lemmas to util.get_doc
* fix capital
* remove lemma from get_doc
* cleanup
* Sync Span __eq__ and __hash__
Use the same tuple for `__eq__` and `__hash__`, including all attributes
except `vector` and `vector_norm`.
* Update entity comparison in tests
Update `assert_docs_equal()` test util to compare `Span` properties for
ents rather than `Span` objects.
Modify flag settings so that `DEP` is not sufficient to set `is_parsed`
and only run `set_children_from_heads()` if `HEAD` is provided.
Then the combination `[SENT_START, DEP]` will set deps and not clobber
sent starts with a lot of one-word sentences.