Commit Graph

656 Commits

Author SHA1 Message Date
Ines Montani
cf65a80f36 Refactor lemmatizer and data table integration (#4353)
* Move test

* Allow default in Lookups.get_table

* Start with blank tables in Lookups.from_bytes

* Refactor lemmatizer to hold instance of Lookups

* Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk)
* Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency
* Remove old and unsupported Lemmatizer.load classmethod
* Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need

* Update tests and docs

* Fix more tests

* Fix lemmatizer

* Upgrade pytest to try and fix weird CI errors

* Try pytest 4.6.5
2019-10-01 21:36:03 +02:00
Ines Montani
bc7e7db208 Fix wording [ci skip] 2019-10-01 14:20:44 +02:00
Ines Montani
2a3a4565cd Update infobox [ci skip] 2019-10-01 14:19:34 +02:00
Ines Montani
66aa0d479f Update v2.2 page [ci skip] 2019-10-01 14:11:05 +02:00
Ines Montani
a8a1800f2a Update lemma data documentation [ci skip] 2019-10-01 13:22:13 +02:00
Ines Montani
932ad9cb91 Fix typos and formatting [ci skip] 2019-10-01 12:30:04 +02:00
Ines Montani
3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
Ines Montani
3bd4da068e Fix link [ci skip] 2019-09-29 17:30:38 +02:00
Ines Montani
089f44cc56 Update serialization docs [ci skip] 2019-09-29 17:11:13 +02:00
Ines Montani
c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Ines Montani
10742d3219 Update v2 docs [ci skip] 2019-09-28 15:57:22 +02:00
Ines Montani
f8d1e2f214 Update CLI docs [ci skip] 2019-09-28 13:12:30 +02:00
Ines Montani
59beab8405 Update v2-2.md [ci skip] 2019-09-27 18:10:43 +02:00
Ines Montani
685e4b2554 Update v2-2.md [ci skip] 2019-09-27 16:35:01 +02:00
Ines Montani
aad66d9bb9 Document PhraseMatcher.remove [ci skip] 2019-09-27 16:34:53 +02:00
Ines Montani
eb0649e38e Fix tag [ci skip] 2019-09-26 16:22:33 +02:00
Ines Montani
da9a869d3f Update vectors name docs [ci skip] 2019-09-26 16:21:32 +02:00
Em Zhan
aafa091541 Fix typo in documentation (#4322)
* Fix typo 'probj' instead of 'pobj'

* Add spaCy contributor agreement for zqianem
2019-09-25 19:42:18 +02:00
Matthew Honnibal
92ed4dc5e0
Allow vectors name to be set in init-model (#4321)
* Allow vectors name to be specified in init-model

* Document --vectors-name argument to init-model

* Update website/docs/api/cli.md

Co-Authored-By: Ines Montani <ines@ines.io>
2019-09-25 13:11:00 +02:00
Ines Montani
197406de1d Update v2-2.md [ci skip] 2019-09-19 14:33:58 +02:00
Ines Montani
ddc09b08ed Update v2-2.md [ci skip] 2019-09-19 00:58:30 +02:00
Matthew Honnibal
e2047576c4 Fix merge conflict 2019-09-18 21:42:11 +02:00
Matthew Honnibal
46c02d25b1 Merge changes to test_ner 2019-09-18 21:41:24 +02:00
Ines Montani
9c940eab94 Update version in examples [ci skip] 2019-09-18 21:23:26 +02:00
Ines Montani
f873548f6c Add backwards incompatibility [ci skip] 2019-09-18 21:21:48 +02:00
Ines Montani
6ebdc5f7d2 Update download docs [ci skip] 2019-09-18 21:21:39 +02:00
Ines Montani
dd1810f05a Update DocBin and add docs 2019-09-18 20:23:21 +02:00
Ines Montani
d62690b3ba Update examples 2019-09-18 19:57:36 +02:00
Ines Montani
bd435faddd Add note about usage docs [ci skip] 2019-09-18 19:56:43 +02:00
Matthew Honnibal
931e96b6c7 DocPallet->DocBin in docs 2019-09-18 15:17:26 +02:00
Matthew Honnibal
f537cbeacc Update v2-2 docs 2019-09-18 14:07:55 +02:00
Ines Montani
ee15fdfe88 Fix wording [ci skip] 2019-09-17 14:59:42 +02:00
Ines Montani
f566e69f38 Fix --vectors-loc docs (closes #4270) 2019-09-17 14:59:12 +02:00
Ines Montani
25c2b4b9a5 Improve init-model docs (see #4137) 2019-09-17 14:51:44 +02:00
Ines Montani
198b7e9789 Auto-format [ci skip] 2019-09-17 14:48:35 +02:00
adrianeboyd
b5d999e510 Add textcat to train CLI (#4226)
* Add doc.cats to spacy.gold at the paragraph level

Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.

* `spacy.gold.docs_to_json()` writes `docs.cats`

* `GoldCorpus` reads in cats in each `GoldParse`

* Update instances of gold_tuples to handle cats

Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.

* Add textcat to train CLI

* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
  * For binary exclusive classes with provided label: F1 for label
  * For 2+ exclusive classes: F1 macro average
  * For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI

* Fix handling of unset arguments and config params

Fix handling of unset arguments and model confiug parameters in Scorer
initialization.

* Temporarily add sklearn requirement

* Remove sklearn version number

* Improve Scorer handling of models without textcats

* Fixing Scorer handling of models without textcats

* Update Scorer output for python 2.7

* Modify inf in Scorer for python 2.7

* Auto-format

Also make small adjustments to make auto-formatting with black easier and produce nicer results

* Move error message to Errors

* Update documentation

* Add cats to annotation JSON format [ci skip]

* Fix tpl flag and docs [ci skip]

* Switch to internal roc_auc_score

Switch to internal `roc_auc_score()` adapted from scikit-learn.

* Add AUCROCScore tests and improve errors/warnings

* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors

* Make reduced roc_auc_score functions private

Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.

* Check that data corresponds with multilabel flag

Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.

* Add textcat score to early stopping check

* Add more checks to debug-data for textcat

* Add example training data for textcat

* Add more checks to textcat train CLI

* Check configuration when extending base model
* Fix typos

* Update textcat example data

* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-15 22:31:31 +02:00
Ines Montani
bab9976d9a
💫 Adjust Table API and add docs (#4289)
* Adjust Table API and add docs

* Add attributes and update description [ci skip]

* Use strings.get_string_id instead of hash_string

* Fix table method calls

* Make orth arg in Lemmatizer.lookup optional

Fall back to string, which is now handled by Table.__contains__ out-of-the-box

* Fix method name

* Auto-format
2019-09-15 22:08:13 +02:00
Ines Montani
16c2522791 Merge branch 'master' into develop 2019-09-14 16:42:01 +02:00
Ines Montani
86befc80bf WIP: Add v2.2 page [ci skip] 2019-09-14 16:41:48 +02:00
Ines Montani
04d36d2471 Remove unused link [ci skip] 2019-09-14 16:41:19 +02:00
Ines Montani
5c8b5e68ec Fix docs consistency [ci skip] 2019-09-14 16:23:37 +02:00
Ines Montani
bbf7337eaf Update adding languages docs [ci skip] 2019-09-14 15:32:15 +02:00
Ines Montani
3126dd0904 Tidy up and auto-format [ci skip] 2019-09-14 12:58:06 +02:00
Ines Montani
3c3658ef9f Merge branch 'master' into develop 2019-09-12 18:03:01 +02:00
Sofie Van Landeghem
9be4d1c105 Allow copying of user_data in as_doc (#4282)
* Allow copying the user_data with as_doc + unit test

* add option to docs

* add typing

* import fix

* workaround to avoid bool clashing ...

* bint instead of bool
2019-09-12 17:08:14 +02:00
Ines Montani
ff51fba96a Update lemmaitzer docs [ci skip] 2019-09-12 16:26:33 +02:00
Ines Montani
25b2b3ff45 Remove LEMMA from exception examples [ci skip] 2019-09-12 16:26:27 +02:00
Ines Montani
82c16b7943 Remove u-strings and fix formatting [ci skip] 2019-09-12 16:11:15 +02:00
Ines Montani
a31e9e1cd5 Update training docs [ci skip] 2019-09-12 15:32:39 +02:00
Ines Montani
b544dcb3c5 Document debug-data [ci skip] 2019-09-12 15:26:20 +02:00