spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-10-03 10:26:48 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	7d510c833e	Fix orth replacement	2019-09-19 00:03:24 +02:00
Ines Montani	89d1dc4afa	Merge branch 'master' into develop	2019-09-18 22:12:24 +02:00
Sean Löfgren	31c683d87d	add return_matches and as_tuples back to Matcher.pipe (#4303 ) * add contributor agreement [ci skip] * add return_matches and as_tuples back to Matcher.pipe	2019-09-18 22:00:33 +02:00
Matthew Honnibal	42df49133d	Also lower-case in orth variants	2019-09-18 21:54:51 +02:00
Matthew Honnibal	19d99fc9e7	Set version to v2.2.0.dev7	2019-09-18 21:43:59 +02:00
Matthew Honnibal	e2047576c4	Fix merge conflict	2019-09-18 21:42:11 +02:00
Matthew Honnibal	46c02d25b1	Merge changes to test_ner	2019-09-18 21:41:24 +02:00
Sofie Van Landeghem	de5a9ecdf3	Distinction between outside, missing and blocked NER annotations (#4307 ) * remove duplicate unit test * unit test (currently failing) for issue 4267 * bugfix: ensure doc.ents preserves kb_id annotations * fix in setting doc.ents with empty label * rename * test for presetting an entity to a certain type * allow overwriting Outside + blocking presets * fix actions when previous label needs to be kept * fix default ent_iob in set entities * cleaner solution with U- action * remove debugging print statements * unit tests with explicit transitions and is_valid testing * remove U- from move_names explicitly * remove unit tests with pre-trained models that don't work * remove (working) unit tests with pre-trained models * clean up unit tests * move unit tests * small fixes * remove two TODO's from doc.ents comments	2019-09-18 21:37:17 +02:00
Moshe Hazoom	72463b062f	Improve speed of _merge method (#4300 ) * make merge more efficient * fix offsets * merge works with relative indices * remove printing * Add the SCA * fix SCA date * more cythonize _retokenize.pyx * more cythonize _retokenize.pyx * fix only declaration in _retokenize.pyx * switch back to absolute head * switch back to absolute head * fix comment * merge from origin repo	2019-09-18 21:34:34 +02:00
Ines Montani	63a584c6d4	Update README.md [ci skip]	2019-09-18 21:34:24 +02:00
tamuhey	875f3e5d8c	remove redundant __call__ method in pipes.TextCategorizer (#4305 ) * remove redundant __call__ method in pipes.TextCategorizer Because the parent __call__ method behaves in the same way. * fix: Pipe.__call__ arg * fix: invalid arg in Pipe.__call__ * modified: spacy/tests/regression/test_issue4278.py (#4278) * deleted: Pipfile	2019-09-18 21:31:27 +02:00
Ines Montani	d84763727c	Remove unused setting [ci skip]	2019-09-18 21:24:14 +02:00
Ines Montani	9c940eab94	Update version in examples [ci skip]	2019-09-18 21:23:26 +02:00
Ines Montani	f873548f6c	Add backwards incompatibility [ci skip]	2019-09-18 21:21:48 +02:00
Ines Montani	6ebdc5f7d2	Update download docs [ci skip]	2019-09-18 21:21:39 +02:00
Ines Montani	00a8cbc306	Tidy up and auto-format	2019-09-18 20:27:03 +02:00
Ines Montani	f2c8b1e362	Simplify lookup hashing Just use get_string_id, which already does everything ensure_hash was supposed to do	2019-09-18 20:24:41 +02:00
Ines Montani	dd1810f05a	Update DocBin and add docs	2019-09-18 20:23:21 +02:00
Ines Montani	d62690b3ba	Update examples	2019-09-18 19:57:36 +02:00
Ines Montani	7e810cced6	Add references to docs pages	2019-09-18 19:57:21 +02:00
Ines Montani	2e5ab5b59c	Make except more explicit	2019-09-18 19:57:08 +02:00
Ines Montani	1f648ecb76	Auto-format	2019-09-18 19:56:55 +02:00
Ines Montani	bd435faddd	Add note about usage docs [ci skip]	2019-09-18 19:56:43 +02:00
Ines Montani	0f7fe5e7a7	Auto-format and fix typo and consistency	2019-09-18 19:18:30 +02:00
Matthew Honnibal	931e96b6c7	DocPallet->DocBin in docs	2019-09-18 15:17:26 +02:00
Matthew Honnibal	e53b86751f	DocPallet -> DocBin	2019-09-18 15:15:37 +02:00
Matthew Honnibal	f537cbeacc	Update v2-2 docs	2019-09-18 14:07:55 +02:00
Matthew Honnibal	fa9a283128	Fix name	2019-09-18 13:40:03 +02:00
Matthew Honnibal	88a23cf49a	Fix name	2019-09-18 13:38:29 +02:00
Matthew Honnibal	3507943b15	Add docstring for DocPallet	2019-09-18 13:25:47 +02:00
Matthew Honnibal	1c8de6b2e5	Rename DocBox->DocPallet	2019-09-18 13:13:51 +02:00
Ines Montani	c922f8e8b0	Fix sources rendering [ci skip]	2019-09-18 12:09:21 +02:00
Ines Montani	ea2a686cf7	Support new model sources format [ci skip]	2019-09-18 11:42:45 +02:00
Ines Montani	ee15fdfe88	Fix wording [ci skip]	2019-09-17 14:59:42 +02:00
Ines Montani	f566e69f38	Fix --vectors-loc docs (closes #4270 )	2019-09-17 14:59:12 +02:00
Ines Montani	25c2b4b9a5	Improve init-model docs (see #4137 )	2019-09-17 14:51:44 +02:00
Ines Montani	198b7e9789	Auto-format [ci skip]	2019-09-17 14:48:35 +02:00
Ines Montani	691e0088cf	Remove duplicate tok2vec property (closes #4302 )	2019-09-17 11:22:03 +02:00
Ines Montani	a84025d70b	Remove --no-deps from default pip args on download Add warning if user is executing spaCy without having it installed and add --no-deps to prevent the package from being redownloaded	2019-09-16 23:32:41 +02:00
Matthew Honnibal	84c65f9455	Merge branch 'master' into develop	2019-09-16 22:12:20 +02:00
Matthew Honnibal	47055d5988	Fix type declarations in _merge method	2019-09-16 22:10:13 +02:00
Sofie Van Landeghem	03ac29f437	Ensure that doc.ents preserves kb_id annotations (#4294 ) * bugfix: ensure doc.ents preserves kb_id annotations * fix backward compatibility * additional test	2019-09-16 15:18:37 +02:00
Ines Montani	139428c20f	Set unique vector names in tests	2019-09-16 15:16:54 +02:00
Ines Montani	bf06d9d537	Allow passing vectors_name to Vocab	2019-09-16 15:16:41 +02:00
Ines Montani	cb6c68a573	Pass vectors name correctly in prune_vectors	2019-09-16 15:16:29 +02:00
Ines Montani	3ba5238282	Make "unnamed vectors" warning a real warning	2019-09-16 15:16:12 +02:00
adrianeboyd	b5d999e510	Add textcat to train CLI (#4226 ) * Add doc.cats to spacy.gold at the paragraph level Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in the spacy JSON training format at the paragraph level. * `spacy.gold.docs_to_json()` writes `docs.cats` * `GoldCorpus` reads in cats in each `GoldParse` * Update instances of gold_tuples to handle cats Update iteration over gold_tuples / gold_parses to handle addition of cats at the paragraph level. * Add textcat to train CLI * Add textcat options to train CLI * Add textcat labels in `TextCategorizer.begin_training()` * Add textcat evaluation to `Scorer`: * For binary exclusive classes with provided label: F1 for label * For 2+ exclusive classes: F1 macro average * For multilabel (not exclusive): ROC AUC macro average (currently relying on sklearn) * Provide user info on textcat evaluation settings, potential incompatibilities * Provide pipeline to Scorer in `Language.evaluate` for textcat config * Customize train CLI output to include only metrics relevant to current pipeline * Add textcat evaluation to evaluate CLI * Fix handling of unset arguments and config params Fix handling of unset arguments and model confiug parameters in Scorer initialization. * Temporarily add sklearn requirement * Remove sklearn version number * Improve Scorer handling of models without textcats * Fixing Scorer handling of models without textcats * Update Scorer output for python 2.7 * Modify inf in Scorer for python 2.7 * Auto-format Also make small adjustments to make auto-formatting with black easier and produce nicer results * Move error message to Errors * Update documentation * Add cats to annotation JSON format [ci skip] * Fix tpl flag and docs [ci skip] * Switch to internal roc_auc_score Switch to internal `roc_auc_score()` adapted from scikit-learn. * Add AUCROCScore tests and improve errors/warnings * Add tests for AUCROCScore and roc_auc_score * Add missing error for only positive/negative values * Remove unnecessary warnings and errors * Make reduced roc_auc_score functions private Because most of the checks and warnings have been stripped for the internal functions and access is only intended through `ROCAUCScore`, make the functions for roc_auc_score adapted from scikit-learn private. * Check that data corresponds with multilabel flag Check that the training instances correspond with the multilabel flag, adding the multilabel flag if required. * Add textcat score to early stopping check * Add more checks to debug-data for textcat * Add example training data for textcat * Add more checks to textcat train CLI * Check configuration when extending base model * Fix typos * Update textcat example data * Provide licensing details and licenses for data * Remove two labels with no positive instances from jigsaw-toxic-comment data. Co-authored-by: Ines Montani <ines@ines.io>	2019-09-15 22:31:31 +02:00
Ines Montani	bab9976d9a	💫 Adjust Table API and add docs (#4289 ) * Adjust Table API and add docs * Add attributes and update description [ci skip] * Use strings.get_string_id instead of hash_string * Fix table method calls * Make orth arg in Lemmatizer.lookup optional Fall back to string, which is now handled by Table.__contains__ out-of-the-box * Fix method name * Auto-format	2019-09-15 22:08:13 +02:00
Ines Montani	88a9d87f6f	Fix test	2019-09-15 18:04:44 +02:00
Ines Montani	23e28e2844	Merge branch 'master' into develop	2019-09-15 17:57:09 +02:00

... 7 8 9 10 11 ...

11162 Commits