* Use morph for extra Japanese tokenizer info
Previously Japanese tokenizer info that didn't correspond to Token
fields was put in user data. Since spaCy core should avoid touching user
data, this moves most information to the Token.morph attribute. It also
adds the normalized form, which wasn't exposed before.
The subtokens, which are a list of full tokens, are still added to user
data, except with the default tokenizer granualarity. With the default
tokenizer settings the subtokens are all None, so in this case the user
data is simply not set.
* Update tests
Also adds a new test for norm data.
* Update docs
* Add Japanese morphologizer factory
Set the default to `extend=True` so that the morphologizer does not
clobber the values set by the tokenizer.
* Use the norm_ field for normalized forms
Before this commit, normalized forms were put in the "norm" field in the
morph attributes. I am not sure why I did that instead of using the
token morph, I think I just forgot about it.
* Skip test if sudachipy is not installed
* Fix import
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* factor out the WandB logger into spacy-loggers
Signed-off-by: Elia Robyn Speer <gh@arborelia.net>
* depend on spacy-loggers so they are available
Signed-off-by: Elia Robyn Speer <gh@arborelia.net>
* remove docs of spacy.WandbLogger.v2 (moved to spacy-loggers)
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* Version number suggestions from code review
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* update references to WandbLogger
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* make order of deps more consistent
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
* Fix surprises when asking for the root of a git repo
In the case of the first asset I wanted to get from git, the data I
wanted was the entire repository. I tried leaving "path" blank, which
gave a less-than-helpful error, and then I tried `path: "/"`, which
started copying my entire filesystem into the project. The path I should
have used was "".
I've made two changes to make this smoother for others:
- The 'path' within a git clone defaults to ""
- If the path points outside of the tmpdir that the git clone goes
into, we fail with an error
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* use a descriptive error instead of a default
plus some minor fixes from PR review
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
* check for None values in assets
Signed-off-by: Elia Robyn Speer <elia@explosion.ai>
Co-authored-by: Elia Robyn Speer <elia@explosion.ai>
* Add training data section
Not entirely sure this is in the right location on the page - maybe it
should be after quickstart?
* Add pointer from binary format to training data section
* Minor cleanup
* Add to ToC, fix filename
* Update website/docs/usage/training.md
Co-authored-by: Ines Montani <ines@ines.io>
* Update website/docs/usage/training.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/usage/training.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Move the training data section further down the page
* Update website/docs/usage/training.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Update website/docs/usage/training.md
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Run prettier
Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
* Support list values and IS_INTERSECT in Matcher
* Support list values as token attributes for set operators, not just as
pattern values.
* Add `IS_INTERSECT` operator.
* Fix incorrect `ISSUBSET` and `ISSUPERSET` in schema and docs.
* Rename IS_INTERSECT to INTERSECTS
* implement textcat resizing for TextCatCNN
* resizing textcat in-place
* simplify code
* ensure predictions for old textcat labels remain the same after resizing (WIP)
* fix for softmax
* store softmax as attr
* fix ensemble weight copy and cleanup
* restructure slightly
* adjust documentation, update tests and quickstart templates to use latest versions
* extend unit test slightly
* revert unnecessary edits
* fix typo
* ensemble architecture won't be resizable for now
* use resizable layer (WIP)
* revert using resizable layer
* resizable container while avoid shape inference trouble
* cleanup
* ensure model continues training after resizing
* use fill_b parameter
* use fill_defaults
* resize_layer callback
* format
* bump thinc to 8.0.4
* bump spacy-legacy to 3.0.6