Provide more customized normalization table warnings when training a new
model. Only suggest installing `spacy-lookups-data` if it's not already
installed and it includes a table for this language (currently checked
in a hard-coded list).
* Fix warning message for lemmatization tables
* Add a warning when the `lexeme_norm` table is empty. (Given the
relatively lang-specific loading for `Lookups`, it seemed like too much
overhead to dynamically extract the list of languages, so for now it's
hard-coded.)
This reverts commit 9393253b66.
The model shouldn't need to see all examples, and actually in v3 there's
no equivalent step. All examples are provided to the component, for the
component to do stuff like figuring out the labels. The model just needs
to do stuff like shape inference.
* Add arch for MishWindowEncoder
* Support mish in tok2vec and conv window >=2
* Pass new tok2vec settings from parser
* Syntax error
* Fix tok2vec setting
* Fix registration of MishWindowEncoder
* Fix receptive field setting
* Fix mish arch
* Pass more options from parser
* Support more tok2vec options in pretrain
* Require thinc 7.3
* Add docs [ci skip]
* Require thinc 7.3.0.dev0 to run CI
* Run black
* Fix typo
* Update Thinc version
Co-authored-by: Ines Montani <ines@ines.io>
The previous version worked with previous thinc, but only
because some thinc ops happened to have gpu/cpu compatible
implementations. It's better to call the right Ops instance.
* Fix get labels for textcat
* Fix char_embed for gpu
* Revert "Fix char_embed for gpu"
This reverts commit 055b9a9e85.
* Fix passing of cats in gold.pyx
* Revert "Match pop with append for training format (#4516)"
This reverts commit 8e7414dace.
* Fix popping gold parses
* Fix handling of cats in gold tuples
* Fix name
* Fix ner_multitask_objective script
* Add test for 4402
* trying to fix script - not succesful yet
* match pop() with extend() to avoid changing the data
* few more pop-extend fixes
* reinsert deleted print statement
* fix print statement
* add last tested version
* append instead of extend
* add in few comments
* quick fix for 4402 + unit test
* fixing number of docs (not counting cats)
* more fixes
* fix len
* print tmp file instead of using data from examples dir
* print tmp file instead of using data from examples dir (2)
* Add work in progress
* Update analysis helpers and component decorator
* Fix porting of docstrings for Python 2
* Fix docstring stuff on Python 2
* Support meta factories when loading model
* Put auto pipeline analysis behind flag for now
* Analyse pipes on remove_pipe and replace_pipe
* Move analysis to root for now
Try to find a better place for it, but it needs to go for now to avoid circular imports
* Simplify decorator
Don't return a wrapped class and instead just write to the object
* Update existing components and factories
* Add condition in factory for classes vs. functions
* Add missing from_nlp classmethods
* Add "retokenizes" to printed overview
* Update assigns/requires declarations of builtins
* Only return data if no_print is enabled
* Use multiline table for overview
* Don't support Span
* Rewrite errors/warnings and move them to spacy.errors
* raise specific error when removing a matcher rule that doesn't exist
* rephrasing
* goldparse init: allocate fields only if doc is not empty
* avoid zero length alloc in saving tokenizer cache
* avoid allocating zero length mem in matcher
* asserts to avoid allocating zero length mem
* fix zero-length allocation in matcher
* bump cymem version
* revert cymem version bump
* Free pointers in ActivationsC
* Restructure alloc/free for parser activations
* Rewrite/restructure to have allocation and free in parallel functions
in `_parser_model` rather than partially in `_parseC()` in `Parser`.
* Remove `resize_activations` from `_parser_model.pxd`.
* Replace MatchStruct with Entity
Replace MatchStruct with Entity since the existing Entity struct is
nearly identical.
* Replace Entity with more general SpanC
* test and fix for second bug of issue 4042
* fix for first bug in 4042
* crashing test for Issue 4313
* forgot one instance of resize
* remove prints
* undo uncomment
* delete test for 4313 (uses third party lib)
* add fix for Issue 4313
* unit test for 4313
* remove duplicate unit test
* unit test (currently failing) for issue 4267
* bugfix: ensure doc.ents preserves kb_id annotations
* fix in setting doc.ents with empty label
* rename
* test for presetting an entity to a certain type
* allow overwriting Outside + blocking presets
* fix actions when previous label needs to be kept
* fix default ent_iob in set entities
* cleaner solution with U- action
* remove debugging print statements
* unit tests with explicit transitions and is_valid testing
* remove U- from move_names explicitly
* remove unit tests with pre-trained models that don't work
* remove (working) unit tests with pre-trained models
* clean up unit tests
* move unit tests
* small fixes
* remove two TODO's from doc.ents comments
* Add doc.cats to spacy.gold at the paragraph level
Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.
* `spacy.gold.docs_to_json()` writes `docs.cats`
* `GoldCorpus` reads in cats in each `GoldParse`
* Update instances of gold_tuples to handle cats
Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.
* Add textcat to train CLI
* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
* For binary exclusive classes with provided label: F1 for label
* For 2+ exclusive classes: F1 macro average
* For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI
* Fix handling of unset arguments and config params
Fix handling of unset arguments and model confiug parameters in Scorer
initialization.
* Temporarily add sklearn requirement
* Remove sklearn version number
* Improve Scorer handling of models without textcats
* Fixing Scorer handling of models without textcats
* Update Scorer output for python 2.7
* Modify inf in Scorer for python 2.7
* Auto-format
Also make small adjustments to make auto-formatting with black easier and produce nicer results
* Move error message to Errors
* Update documentation
* Add cats to annotation JSON format [ci skip]
* Fix tpl flag and docs [ci skip]
* Switch to internal roc_auc_score
Switch to internal `roc_auc_score()` adapted from scikit-learn.
* Add AUCROCScore tests and improve errors/warnings
* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors
* Make reduced roc_auc_score functions private
Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.
* Check that data corresponds with multilabel flag
Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.
* Add textcat score to early stopping check
* Add more checks to debug-data for textcat
* Add example training data for textcat
* Add more checks to textcat train CLI
* Check configuration when extending base model
* Fix typos
* Update textcat example data
* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.
Co-authored-by: Ines Montani <ines@ines.io>
* Prevent subtok label if not learning tokens
The parser introduces the subtok label to mark tokens that should be
merged during post-processing. Previously this happened even if we did
not have the --learn-tokens flag set. This patch passes the config
through to the parser, to prevent the problem.
* Make merge_subtokens a parser post-process if learn_subtokens
* Fix train script
* Add test for 3830: subtok problem
* Fix handlign of non-subtok in parser training
* Improve error message when model.from_bytes() dies
When Thinc's model.from_bytes() is called with a mismatched model, often
we get a particularly ungraceful error,
e.g. "AttributeError: FunctionLayer has no attribute G"
This is because we're trying to load the parameters for something like
a LayerNorm layer, and the model architecture has some other layer there
instead. This is obviously terrible, especially since the error *type*
is wrong.
I've changed it to raise a ValueError. The error message is still
probably a bit terse, but it's hard to be sure exactly what's gone
wrong.
* Update spacy/pipeline/pipes.pyx
* Update spacy/pipeline/pipes.pyx
* Update spacy/pipeline/pipes.pyx
* Update spacy/syntax/nn_parser.pyx
* Update spacy/syntax/nn_parser.pyx
* Update spacy/pipeline/pipes.pyx
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
* Update spacy/pipeline/pipes.pyx
Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
v2.1 introduced a regression when deserializing the parser after
parser.add_label() had been called. The code around the class mapping is
pretty confusing currently, as it was written to accommodate backwards
model compatibility. It needs to be revised when the models are next
retrained.
Closes#3433
* Make serialization methods consistent
exclude keyword argument instead of random named keyword arguments and deprecation handling
* Update docs and add section on serialization fields
* Improve handling of missing NER tags
GoldParse can accept missing NER tags, if entities is provided
in BILUO format (rather than as spans). Missing tags can be provided
as None values.
Fix bug that occurred when first tag was a None value. Closes#2603.
* Document specification of missing NER tags.
The output weights often return negative scores for classes, especially
via the bias terms. This means that when we add a new class, we can't
rely on just zeroing the weights, or we'll end up with positive
predictions for those labels.
To solve this, we use nan values as the initial weights for new labels.
This prevents them from ever coming out on top. During backprop, we
replace the nan values with the minimum assigned score, so that we're
still able to learn these classes.
After creating a component, the `.model` attribute is left with the value `True`, to indicate it should be created later during `from_disk()`, `from_bytes()` or `begin_training()`. This had led to confusing errors if you try to use the component without initializing the model.
To fix this, we add a method `require_model()` to the `Pipe` base class. The `require_model()` method needs to be called at the start of the `.predict()` and `.update()` methods of the components. It raises a `ValueError` if the model is not initialized. An error message has been added to `spacy.errors`.