Currently the TextCategorizer defaults to a fairly complicated model, designed partly around the active learning requirements of Prodigy. The model's a bit slow, and not very GPU-friendly.
This patch implements a straightforward CNN model that still performs pretty well. The replacement model also makes it easy to use the LMAO pretraining, since most of the parameters are in the CNN.
The replacement model has a flag to specify whether labels are mutually exclusive, which defaults to True. This has been a common problem with the text classifier. We'll also now be able to support adding labels to pretrained models again.
Resolves#2934, #2756, #1798, #1748.
Fixes#3027.
* Allow Span.__init__ to take unicode values for the `label` argument.
* Allow `Span.label_` to be writeable.
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
* Add todo
* Auto-format
* Update wasabi pin
* Format training results with wasabi
* Remove loading animation from model saving
Currently behaves weirdly
* Inline messages
* Remove unnecessary path2str
Already taken care of by printer
* Inline messages in CLI
* Remove unused function
* Move loading indicator into loading function
* Check for invalid whitespace entities
See #3028. The solution in this patch is pretty debateable.
What we do is give the TokenC struct a .norm field, by repurposing the previously idle .sense attribute. It's nice to repurpose a previous field because it means the TokenC doesn't change size, so even if someone's using the internals very deeply, nothing will break.
The weird thing here is that the TokenC and the LexemeC both have an attribute named NORM. This arguably assists in backwards compatibility. On the other hand, maybe it's really bad! We're changing the semantics of the attribute subtly, so maybe it's better if someone calling lex.norm gets a breakage, and instead is told to write lex.default_norm?
Overall I believe this patch makes the NORM feature work the way we sort of expected it to work. Certainly it's much more like how the docs describe it, and more in line with how we've been directing people to use the norm attribute. We'll also be able to use token.norm to do stuff like spelling correction, which is pretty cool.
Fix a bug in the JSON streaming code that GoldCorpus uses. Escaped
slashes were being handled incorrectly. This bug caused low scores for
French in the early v2.1.0 alphas, because most of the data was not
being read in.
Fittingly, the document that triggered the bug was a Wikipedia article about
Perl. Parsing perl remains difficult!
* Add note that Unidic is required for Japanese
This addresses #3001. -POLM
* Add extras_require for mecab with old version
Related to issue #3018.
* mecab → ja
Co-Authored-By: polm <polm@dampfkraft.com>
* Upadate Unidic link for latest version in document
This patch improves #3017 . The link for Unidic was old version one, so will the lates version.
* Add contributor agreement
* Use more specific link for unidic-cwj
* modifying FR lemmatization for nouns
* modifying FR lemmatization for nouns
* adding contributor agreement for amperinet
* adding rules for words with inclusive parentheses wrongly tokenized
* adding contributor agreement for amperinet
* adding a missing comma
* updating rules and vocabulary for French lemmatization of verbs
* updating the file with French auxiliary verb
* updating rules and vocabulary for French lemmatization of verbs
* adding contributor agreement for amperinet
* adding rules for words with inclusive parentheses wrongly tokenized
* Updated wordforms for Norwegian lemmatizer
Upload of updated lists of wordforms for the Norwegian lemmatizer (nouns, verbs, adverbs, adjectives and lookup).
* Add spaCy contributor agreement for user beatesi
* Updated wordforms for Norwegian lemmatizer