Commit Graph

11162 Commits

Author SHA1 Message Date
Matthew Honnibal
7d510c833e Fix orth replacement 2019-09-19 00:03:24 +02:00
Ines Montani
89d1dc4afa Merge branch 'master' into develop 2019-09-18 22:12:24 +02:00
Sean Löfgren
31c683d87d add return_matches and as_tuples back to Matcher.pipe (#4303)
* add contributor agreement [ci skip]

* add return_matches and as_tuples back to Matcher.pipe
2019-09-18 22:00:33 +02:00
Matthew Honnibal
42df49133d Also lower-case in orth variants 2019-09-18 21:54:51 +02:00
Matthew Honnibal
19d99fc9e7 Set version to v2.2.0.dev7 2019-09-18 21:43:59 +02:00
Matthew Honnibal
e2047576c4 Fix merge conflict 2019-09-18 21:42:11 +02:00
Matthew Honnibal
46c02d25b1 Merge changes to test_ner 2019-09-18 21:41:24 +02:00
Sofie Van Landeghem
de5a9ecdf3 Distinction between outside, missing and blocked NER annotations (#4307)
* remove duplicate unit test

* unit test (currently failing) for issue 4267

* bugfix: ensure doc.ents preserves kb_id annotations

* fix in setting doc.ents with empty label

* rename

* test for presetting an entity to a certain type

* allow overwriting Outside + blocking presets

* fix actions when previous label needs to be kept

* fix default ent_iob in set entities

* cleaner solution with U- action

* remove debugging print statements

* unit tests with explicit transitions and is_valid testing

* remove U- from move_names explicitly

* remove unit tests with pre-trained models that don't work

* remove (working) unit tests with pre-trained models

* clean up unit tests

* move unit tests

* small fixes

* remove two TODO's from doc.ents comments
2019-09-18 21:37:17 +02:00
Moshe Hazoom
72463b062f Improve speed of _merge method (#4300)
* make merge more efficient

* fix offsets

* merge works with relative indices

* remove printing

* Add the SCA

* fix SCA date

* more cythonize _retokenize.pyx

* more cythonize _retokenize.pyx

* fix only declaration in _retokenize.pyx

* switch back to absolute head

* switch back to absolute head

* fix comment

* merge from origin repo
2019-09-18 21:34:34 +02:00
Ines Montani
63a584c6d4 Update README.md [ci skip] 2019-09-18 21:34:24 +02:00
tamuhey
875f3e5d8c remove redundant __call__ method in pipes.TextCategorizer (#4305)
* remove redundant __call__ method in pipes.TextCategorizer

Because the parent __call__ method behaves in the same way.

* fix: Pipe.__call__ arg

* fix: invalid arg in Pipe.__call__

* modified:   spacy/tests/regression/test_issue4278.py (#4278)

* deleted:    Pipfile
2019-09-18 21:31:27 +02:00
Ines Montani
d84763727c Remove unused setting [ci skip] 2019-09-18 21:24:14 +02:00
Ines Montani
9c940eab94 Update version in examples [ci skip] 2019-09-18 21:23:26 +02:00
Ines Montani
f873548f6c Add backwards incompatibility [ci skip] 2019-09-18 21:21:48 +02:00
Ines Montani
6ebdc5f7d2 Update download docs [ci skip] 2019-09-18 21:21:39 +02:00
Ines Montani
00a8cbc306 Tidy up and auto-format 2019-09-18 20:27:03 +02:00
Ines Montani
f2c8b1e362 Simplify lookup hashing
Just use get_string_id, which already does everything ensure_hash was supposed to do
2019-09-18 20:24:41 +02:00
Ines Montani
dd1810f05a Update DocBin and add docs 2019-09-18 20:23:21 +02:00
Ines Montani
d62690b3ba Update examples 2019-09-18 19:57:36 +02:00
Ines Montani
7e810cced6 Add references to docs pages 2019-09-18 19:57:21 +02:00
Ines Montani
2e5ab5b59c Make except more explicit 2019-09-18 19:57:08 +02:00
Ines Montani
1f648ecb76 Auto-format 2019-09-18 19:56:55 +02:00
Ines Montani
bd435faddd Add note about usage docs [ci skip] 2019-09-18 19:56:43 +02:00
Ines Montani
0f7fe5e7a7 Auto-format and fix typo and consistency 2019-09-18 19:18:30 +02:00
Matthew Honnibal
931e96b6c7 DocPallet->DocBin in docs 2019-09-18 15:17:26 +02:00
Matthew Honnibal
e53b86751f DocPallet -> DocBin 2019-09-18 15:15:37 +02:00
Matthew Honnibal
f537cbeacc Update v2-2 docs 2019-09-18 14:07:55 +02:00
Matthew Honnibal
fa9a283128 Fix name 2019-09-18 13:40:03 +02:00
Matthew Honnibal
88a23cf49a Fix name 2019-09-18 13:38:29 +02:00
Matthew Honnibal
3507943b15 Add docstring for DocPallet 2019-09-18 13:25:47 +02:00
Matthew Honnibal
1c8de6b2e5 Rename DocBox->DocPallet 2019-09-18 13:13:51 +02:00
Ines Montani
c922f8e8b0 Fix sources rendering [ci skip] 2019-09-18 12:09:21 +02:00
Ines Montani
ea2a686cf7 Support new model sources format [ci skip] 2019-09-18 11:42:45 +02:00
Ines Montani
ee15fdfe88 Fix wording [ci skip] 2019-09-17 14:59:42 +02:00
Ines Montani
f566e69f38 Fix --vectors-loc docs (closes #4270) 2019-09-17 14:59:12 +02:00
Ines Montani
25c2b4b9a5 Improve init-model docs (see #4137) 2019-09-17 14:51:44 +02:00
Ines Montani
198b7e9789 Auto-format [ci skip] 2019-09-17 14:48:35 +02:00
Ines Montani
691e0088cf Remove duplicate tok2vec property (closes #4302) 2019-09-17 11:22:03 +02:00
Ines Montani
a84025d70b Remove --no-deps from default pip args on download
Add warning if user is executing spaCy without having it installed and add --no-deps to prevent the package from being redownloaded
2019-09-16 23:32:41 +02:00
Matthew Honnibal
84c65f9455 Merge branch 'master' into develop 2019-09-16 22:12:20 +02:00
Matthew Honnibal
47055d5988 Fix type declarations in _merge method 2019-09-16 22:10:13 +02:00
Sofie Van Landeghem
03ac29f437 Ensure that doc.ents preserves kb_id annotations (#4294)
* bugfix: ensure doc.ents preserves kb_id annotations

* fix backward compatibility

* additional test
2019-09-16 15:18:37 +02:00
Ines Montani
139428c20f Set unique vector names in tests 2019-09-16 15:16:54 +02:00
Ines Montani
bf06d9d537 Allow passing vectors_name to Vocab 2019-09-16 15:16:41 +02:00
Ines Montani
cb6c68a573 Pass vectors name correctly in prune_vectors 2019-09-16 15:16:29 +02:00
Ines Montani
3ba5238282 Make "unnamed vectors" warning a real warning 2019-09-16 15:16:12 +02:00
adrianeboyd
b5d999e510 Add textcat to train CLI (#4226)
* Add doc.cats to spacy.gold at the paragraph level

Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.

* `spacy.gold.docs_to_json()` writes `docs.cats`

* `GoldCorpus` reads in cats in each `GoldParse`

* Update instances of gold_tuples to handle cats

Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.

* Add textcat to train CLI

* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
  * For binary exclusive classes with provided label: F1 for label
  * For 2+ exclusive classes: F1 macro average
  * For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI

* Fix handling of unset arguments and config params

Fix handling of unset arguments and model confiug parameters in Scorer
initialization.

* Temporarily add sklearn requirement

* Remove sklearn version number

* Improve Scorer handling of models without textcats

* Fixing Scorer handling of models without textcats

* Update Scorer output for python 2.7

* Modify inf in Scorer for python 2.7

* Auto-format

Also make small adjustments to make auto-formatting with black easier and produce nicer results

* Move error message to Errors

* Update documentation

* Add cats to annotation JSON format [ci skip]

* Fix tpl flag and docs [ci skip]

* Switch to internal roc_auc_score

Switch to internal `roc_auc_score()` adapted from scikit-learn.

* Add AUCROCScore tests and improve errors/warnings

* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors

* Make reduced roc_auc_score functions private

Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.

* Check that data corresponds with multilabel flag

Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.

* Add textcat score to early stopping check

* Add more checks to debug-data for textcat

* Add example training data for textcat

* Add more checks to textcat train CLI

* Check configuration when extending base model
* Fix typos

* Update textcat example data

* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-15 22:31:31 +02:00
Ines Montani
bab9976d9a
💫 Adjust Table API and add docs (#4289)
* Adjust Table API and add docs

* Add attributes and update description [ci skip]

* Use strings.get_string_id instead of hash_string

* Fix table method calls

* Make orth arg in Lemmatizer.lookup optional

Fall back to string, which is now handled by Table.__contains__ out-of-the-box

* Fix method name

* Auto-format
2019-09-15 22:08:13 +02:00
Ines Montani
88a9d87f6f Fix test 2019-09-15 18:04:44 +02:00
Ines Montani
23e28e2844 Merge branch 'master' into develop 2019-09-15 17:57:09 +02:00