spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-04-17 07:31:59 +03:00

Author	SHA1	Message	Date
adrianeboyd	2c876eb672	Add tokenizer explain() debugging method (#4596 ) * Expose tokenizer rules as a property Expose the tokenizer rules property in the same way as the other core properties. (The cache resetting is overkill, but consistent with `from_bytes` for now.) Add tests and update Tokenizer API docs. * Update Hungarian punctuation to remove empty string Update Hungarian punctuation definitions so that `_units` does not match an empty string. * Use _load_special_tokenization consistently Use `_load_special_tokenization()` and have it to handle `None` checks. * Fix precedence of `token_match` vs. special cases Remove `token_match` check from `_split_affixes()` so that special cases have precedence over `token_match`. `token_match` is checked only before infixes are split. * Add `make_debug_doc()` to the Tokenizer Add `make_debug_doc()` to the Tokenizer as a working implementation of the pseudo-code in the docs. Add a test (marked as slow) that checks that `nlp.tokenizer()` and `nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens for all languages that have `examples.sentences` that can be imported. * Update tokenization usage docs Update pseudo-code and algorithm description to correspond to `nlp.tokenizer.make_debug_doc()` with example debugging usage. Add more examples for customizing tokenizers while preserving the existing defaults. Minor edits / clarifications. * Revert "Update Hungarian punctuation to remove empty string" This reverts commit `f0a577f7a5`. * Rework `make_debug_doc()` as `explain()` Rework `make_debug_doc()` as `explain()`, which returns a list of `(pattern_string, token_string)` tuples rather than a non-standard `Doc`. Update docs and tests accordingly, leaving the visualization for future work. * Handle cases with bad tokenizer patterns Detect when tokenizer patterns match empty prefixes and suffixes so that `explain()` does not hang on bad patterns. * Remove unused displacy image * Add tokenizer.explain() to usage docs	2019-11-20 13:07:25 +01:00
Matthew Honnibal	4b123952aa	Add option for improved NER feature extraction (#4671 ) * Support option of three NER features * Expose nr_feature parser model setting * Give feature tokens better name * Test nr_feature=3 for NER * Format	2019-11-19 15:03:14 +01:00
Ines Montani	74b951fe61	Fix xpassing tests (#4657 ) * Ignore internal warnings * Un-xfail passing tests * Skip instead of xfail	2019-11-16 20:20:53 +01:00
Ines Montani	3bd15055ce	Fix bug in Language.evaluate for components without .pipe (#4662 )	2019-11-16 20:20:37 +01:00
Christoph Purschke	433748e867	Fix basic language support for Luxembourgish (by adding punctuation.py) (#4648 ) * Update __init__.py * Create punctuation.py * Update tokenizer_exceptions.py * Create questoph.md * Update questoph.md * Update test_text.py * Update test_text.py * Update test_text.py * Update test_text.py	2019-11-15 16:16:47 +01:00
adrianeboyd	91f89f9693	Fix realloc in retokenizer.split() (#4606 ) Always realloc to a size larger than `doc.max_length` in `retokenizer.split()` (or cymem will throw errors).	2019-11-11 16:26:46 +01:00
adrianeboyd	0b9a5f4074	Rework Chinese language initialization and tokenization (#4619 ) * Rework Chinese language initialization * Create a `ChineseTokenizer` class * Modify jieba post-processing to handle whitespace correctly * Modify non-jieba character tokenization to handle whitespace correctly * Add a `create_tokenizer()` method to `ChineseDefaults` * Load lexical attributes * Update Chinese tag_map for UD v2 * Add very basic Chinese tests * Test tokenization with and without jieba * Test `like_num` attribute * Fix try_jieba_import() * Fix zh code formatting	2019-11-11 14:23:21 +01:00
Priscilla de Abreu Lopes	39e79fcc86	Bugfix/dep matcher issue 4590 (#4601 ) * add contributor agreement for prilopes * add test for issue #4590 * fix on_match params for DependencyMacther (#4590)	2019-11-07 12:01:06 +01:00
Ines Montani	09cec3e41b	Replace function registries with catalogue (#4584 ) * Replace functions registries with catalogue * Update __init__.py * Fix test * Revert unrelated flag [ci skip]	2019-11-07 11:45:22 +01:00
adrianeboyd	56ad3a3988	Add LAS per dependency to Scorer (#4560 )	2019-10-31 21:18:16 +01:00
Matthew Honnibal	e82306937e	Put Tok2Vec refactor behind feature flag (#4563 ) * Add back pre-2.2.2 tok2vec * Add simple tok2vec tests * Add simple tok2vec tests * Reformat * Fix CharacterEmbed in new tok2vec * Fix legacy tok2vec * Resolve circular imports * Fix test for Python 2	2019-10-31 15:01:15 +01:00
Ines Montani	5e9849b60f	Auto-format [ci skip]	2019-10-30 19:27:18 +01:00
Ines Montani	afe4a428f7	Fix pipeline analysis on remove pipe (#4557 ) Validate after component is removed, not before	2019-10-30 19:04:17 +01:00
Ines Montani	85f2b04c45	Support span._. in component decorator attrs (#4555 ) * Support span._. in component decorator attrs * Adjust error [ci skip]	2019-10-30 17:19:36 +01:00
Matthew Honnibal	a927b3a21e	Put new alignment behind flag for v2.2.2 release (#4541 ) * Xfail new tokenization test * Put new alignment behind feature flag * Move USE_ALIGN to top of the file [ci skip] Co-authored-by: Ines Montani <ines@ines.io>	2019-10-28 16:12:32 +01:00
Ines Montani	a90025b277	Fix serialization of extension attr values in DocBin (#4540 )	2019-10-28 16:02:13 +01:00
tamuhey	df293f3894	modified gold.align to handle space tokens (#4537 ) Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>	2019-10-28 15:44:28 +01:00
adrianeboyd	f2bfaa1b38	Filter subtoken matches in merge_subtokens() (#4539 ) The `Matcher` in `merge_subtokens()` returns all possible subsequences of `subtok`, so for sequences of two or more subtoks it's necessary to filter the matches so that the retokenizer is only merging the longest matches with no overlapping spans.	2019-10-28 15:40:28 +01:00
Ines Montani	96bb8f2187	Add regression test for #4528 [ci skip]	2019-10-28 14:36:03 +01:00
Ines Montani	c5e41247e8	Tidy up and auto-format	2019-10-28 12:43:55 +01:00
Matthw Honnibal	426b745640	Fix tests for gpu	2019-10-27 22:19:18 +01:00
Sofie Van Landeghem	8e7414dace	Match pop with append for training format (#4516 ) * trying to fix script - not succesful yet * match pop() with extend() to avoid changing the data * few more pop-extend fixes * reinsert deleted print statement * fix print statement * add last tested version * append instead of extend * add in few comments * quick fix for 4402 + unit test * fixing number of docs (not counting cats) * more fixes * fix len * print tmp file instead of using data from examples dir * print tmp file instead of using data from examples dir (2)	2019-10-27 16:01:32 +01:00
tamuhey	fcd25db033	[#4529 ] fix: gold pyx (#4530 ) * fix: gold pyx * remove print * skip test in python2 * Add unicode declarations and don't skip test on Python 2	2019-10-27 13:50:07 +01:00
tamuhey	554850206c	[#4525 ] fix gold.align (#4526 ) * fix: gold.align * fix align * remove old align	2019-10-27 13:38:04 +01:00
Ines Montani	a9c6104047	Component decorator and component analysis (#4517 ) * Add work in progress * Update analysis helpers and component decorator * Fix porting of docstrings for Python 2 * Fix docstring stuff on Python 2 * Support meta factories when loading model * Put auto pipeline analysis behind flag for now * Analyse pipes on remove_pipe and replace_pipe * Move analysis to root for now Try to find a better place for it, but it needs to go for now to avoid circular imports * Simplify decorator Don't return a wrapped class and instead just write to the object * Update existing components and factories * Add condition in factory for classes vs. functions * Add missing from_nlp classmethods * Add "retokenizes" to printed overview * Update assigns/requires declarations of builtins * Only return data if no_print is enabled * Use multiline table for overview * Don't support Span * Rewrite errors/warnings and move them to spacy.errors	2019-10-27 13:35:49 +01:00
Ines Montani	cfffdba7b1	Implement new API for {Phrase}Matcher.add (backwards-compatible) (#4522 ) * Implement new API for {Phrase}Matcher.add (backwards-compatible) * Update docs * Also update DependencyMatcher.add * Update internals * Rewrite tests to use new API * Add basic check for common mistake Raise error with suggestion if user likely passed in a pattern instead of a list of patterns * Fix typo [ci skip]	2019-10-25 22:21:08 +02:00
Ines Montani	d2da117114	Also support passing list to Language.disable_pipes (#4521 ) * Also support passing list to Language.disable_pipes * Adjust internals	2019-10-25 16:19:08 +02:00
Ines Montani	cc05d9dad6	Auto-format [ci skip]	2019-10-24 16:21:08 +02:00
Ines Montani	73dc63d3bf	Tidy up and auto-format [ci skip]	2019-10-24 16:20:48 +02:00
Ines Montani	9a849fe54e	Explicitly catch warning in test	2019-10-24 16:16:27 +02:00
adrianeboyd	8516e9d53b	Support train dict format as JSONL (#4471 ) * Support train dict format as JSONL * Add (overly simple) check for dict vs. tuple to read JSONL lines as either train dicts or train tuples * Extend JSON/JSONL roundtrip conversion tests using `docs_to_json()` and `GoldCorpus.train_tuples` * Revert docs to default JSON output with convert	2019-10-23 16:01:44 +02:00
Matthew Honnibal	9489c5f6b2	Clip most_similar to range [-1, 1] (fixes #4506 ) (#4507 ) * Clip most_similar to range [-1, 1] * Add/fix vectors tests * Fix test	2019-10-22 20:10:42 +02:00
Ines Montani	74a19aeb1c	Add xfailing test [ci skip]	2019-10-22 18:18:43 +02:00
gustavengstrom	050e2445a8	Adding noun_chunks to the Swedish language model (sv) (#4422 ) * Create syntax_iterators.py Replica of spacy/lang/fr/syntax_iterators.py * Added import statements for SYNTAX_ITERATORS * Create gustavengstrom.md * Added "dobj" to list of labels in noun_chunks method and a test_noun_chunks method to the Swedish language model. * Delete README-checkpoint.md Co-authored-by: Gustav <gustav@davcon.se> Co-authored-by: Ines Montani <ines@ines.io>	2019-10-21 12:57:06 +02:00
adrianeboyd	f5c551a43a	Checks/errors related to ill-formed IOB input in CLI convert and debug-data (#4487 ) * Error for ill-formed input to iob_to_biluo() Check for empty label in iob_to_biluo(), which can result from ill-formed input. * Check for empty NER label in debug-data	2019-10-21 12:20:28 +02:00
Sofie Van Landeghem	d5d55312b2	prevent division by zero in most_similar method (#4488 )	2019-10-21 12:04:46 +02:00
Ines Montani	181c01f629	Tidy up and auto-format	2019-10-18 11:27:38 +02:00
Daniel King	e646956176	Most similar bug (#4446 ) * Add batch size indexing * Don't sort if n == 1 * Add test for most similar vectors issue * Change > to >=	2019-10-16 23:18:55 +02:00
adrianeboyd	275c9ad872	Allow int values in token patterns (#4444 ) * Add missing int value option to top-level pattern validation in Matcher * Adjust existing tests accordingly * Add new test for valid pattern `{"LENGTH": int}`	2019-10-16 13:40:18 +02:00
Sofie Van Landeghem	7d1efac4eb	Fix remove pattern from matcher (#4454 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing * bugfix in remove matcher + extended unit test	2019-10-16 13:34:58 +02:00
Sofie Van Landeghem	2d249a9502	KB extensions and better parsing of WikiData (#4375 ) * fix overflow error on windows * more documentation & logging fixes * md fix * 3 different limit parameters to play with execution time * bug fixes directory locations * small fixes * exclude dev test articles from prior probabilities stats * small fixes * filtering wikidata entities, removing numeric and meta items * adding aliases from wikidata also to the KB * fix adding WD aliases * adding also new aliases to previously added entities * fixing comma's * small doc fixes * adding subclassof filtering * append alias functionality in KB * prevent appending the same entity-alias pair * fix for appending WD aliases * remove date filter * remove unnecessary import * small corrections and reformatting * remove WD aliases for now (too slow) * removing numeric entities from training and evaluation * small fixes * shortcut during prediction if there is only one candidate * add counts and fscore logging, remove FP NER from evaluation * fix entity_linker.predict to take docs instead of single sentences * remove enumeration sentences from the WP dataset * entity_linker.update to process full doc instead of single sentence * spelling corrections and dump locations in readme * NLP IO fix * reading KB is unnecessary at the end of the pipeline * small logging fix * remove empty files	2019-10-14 12:28:53 +02:00
Peter Gilles	428887b8f2	Initial commit: New language Luxembourgish (lb) (#4424 ) * new language: Luxembourgish (lb) * update * update * Update and rename .github/CONTRIBUTOR_AGREEMENT.md to .github/contributors/PeterGilles.md * Update and rename .github/contributors/PeterGilles.md to .github/CONTRIBUTOR_AGREEMENT.md * Update norm_exceptions.py * Delete README.md * moved test_lemma.py * deactivated 'lemma_lookup = LOOKUP' * update * Update conftest.py * update * tests updated * import unicode_literals * Update spacy/tests/lang/lb/test_text.py Co-Authored-By: Ines Montani <ines@ines.io> * Create PeterGilles.md	2019-10-14 12:27:50 +02:00
adrianeboyd	98a961a60e	Fix PhraseMatcher.remove for overlapping patterns (#4437 )	2019-10-14 12:19:51 +02:00
adrianeboyd	6f54e59fe7	Fix util.filter_spans() to prefer first span in overlapping sam… (#4414 ) * Update util.filter_spans() to prefer earlier spans * Add filter_spans test for first same-length span * Update entity relation example to refer to util.filter_spans()	2019-10-10 17:00:03 +02:00
Sofie Van Landeghem	5efae495f1	Error when removing a matcher rule that doesn't exist (#4420 ) * raise specific error when removing a matcher rule that doesn't exist * rephrasing	2019-10-10 14:01:53 +02:00
Matthew Honnibal	ddd6fda59c	Add registry for model creation functions ('architectures') (#4395 ) * Add architecture registry * Add test for arch registry * Add error for model architectures	2019-10-08 12:21:03 +02:00
tamuhey	650cbfe82d	multiprocessing pipe (#1303 ) (#4371 ) * refactor: separate formatting docs and golds in Language.update * fix return typo * add pipe test * unpickleable object cannot be assigned to p.map * passed test pipe * passed test! * pipe terminate * try pipe * passed test * fix ch * add comments * fix len(texts) * add comment * add comment * fix: multiprocessing of pipe is not supported in 2 * test: use assert_docs_equal * fix: is_python3 -> is_python2 * fix: change _pipe arg to use functools.partial * test: add vector modification test * test: add sample ner_pipe and user_data pipe * add warnings test * test: fix user warnings * test: fix warnings capture * fix: remove islice import * test: remove warnings test * test: add stream test * test: rename * fix: multiproc stream * fix: stream pipe * add comment * mp.Pipe seems to be able to use with relative small data * test: skip stream test in python2 * sort imports * test: add reason to skiptest * fix: use pipe for docs communucation * add comments * add comment	2019-10-08 12:20:55 +02:00
adrianeboyd	14841d0aa6	Fix PhraseMatcher callback and add tests (#4399 ) * Fix callback lookup in PhraseMatcher (string key rather than hash key) * Add callback tests for Matcher and PhraseMatcher	2019-10-08 12:07:02 +02:00
adrianeboyd	a3509f67d4	Extend unicode character block for Sinhala (#4378 ) * Extend unicode character block for Sinhala * Add sentencizer tests for more languages	2019-10-07 13:17:03 +02:00
adrianeboyd	cbc2cee2c8	Improve URL_PATTERN and handling in tokenizer (#4374 ) * Move prefix and suffix detection for URL_PATTERN Move prefix and suffix detection for `URL_PATTERN` into the tokenizer. Remove associated lookahead and lookbehind from `URL_PATTERN`. Fix tokenization for Hungarian given new modified handling of prefixes and suffixes. * Match a wider range of URI schemes	2019-10-05 13:00:09 +02:00
Ines Montani	fec9433044	Make PhraseMatcher.vocab consistent with Matcher.vocab (closes #4373 )	2019-10-04 12:18:41 +02:00
Sofie Van Landeghem	4e7259c6cf	Bugfix initializing DocBin with attributes (#4368 ) * docbin init fix + documentation fix + unit tests * newline * try with zlib instead of gzip (python 2 incompatibilities)	2019-10-03 14:48:45 +02:00
Ben Taylor	1db79a33cb	most_similar() return the k most similar vectors (#4364 ) * most_similar return n-most similar vectors * updated most_similar comment * add bintay contributor agreement * sign bintay contributor agreement * fix most_similar documentation typo * fixed error in prune_vectors * updated prune_vectors test	2019-10-03 14:09:44 +02:00
Sofie Van Landeghem	9d3ce7cba2	Ensure training doesn't crash with empty batches (#4360 ) * unit test for previously resolved unflatten issue * prevent batch of empty docs to cause problems	2019-10-02 12:50:47 +02:00
Ines Montani	cf65a80f36	Refactor lemmatizer and data table integration (#4353 ) * Move test * Allow default in Lookups.get_table * Start with blank tables in Lookups.from_bytes * Refactor lemmatizer to hold instance of Lookups * Get lookups table within the lemmatization methods to make sure it references the correct table (even if the table was replaced or modified, e.g. when loading a model from disk) * Deprecate other arguments on Lemmatizer.__init__ and expect Lookups for consistency * Remove old and unsupported Lemmatizer.load classmethod * Refactor language-specific lemmatizers to inherit as much as possible from base class and override only what they need * Update tests and docs * Fix more tests * Fix lemmatizer * Upgrade pytest to try and fix weird CI errors * Try pytest 4.6.5	2019-10-01 21:36:03 +02:00
Ines Montani	3297a19545	Warn in Tagger.begin_training if no lemma tables are available (#4351 )	2019-10-01 15:13:55 +02:00
Ines Montani	e0cf4796a5	Move lookup tables out of the core library (#4346 ) * Add default to util.get_entry_point * Tidy up entry points * Read lookups from entry points * Remove lookup tables and related tests * Add lookups install option * Remove lemmatizer tests * Remove logic to process language data files * Update setup.cfg	2019-10-01 00:01:27 +02:00
Ines Montani	4f905ac9e6	Add test for ASCII filenames (#4345 )	2019-09-30 18:45:30 +02:00
Ines Montani	f7d1736241	Skip duplicate spans in Doc.retokenize (#4339 )	2019-09-30 12:43:48 +02:00
Ines Montani	0226b3bf0e	Fix test imports	2019-09-29 17:34:56 +02:00
Ines Montani	3d8fd4b461	Revert #4334	2019-09-29 17:32:12 +02:00
Ines Montani	c9cd516d96	Move tests out of package (#4334 ) * Move tests out of package * Fix typo	2019-09-28 18:05:00 +02:00
Ines Montani	5fe61539c4	Fix unicode "e" in filename	2019-09-28 15:45:16 +02:00
Sofie Van Landeghem	22b9e12159	Ensure the NER remains consistent after resizing (#4330 ) * test and fix for second bug of issue 4042 * fix for first bug in 4042 * crashing test for Issue 4313 * forgot one instance of resize * remove prints * undo uncomment * delete test for 4313 (uses third party lib) * add fix for Issue 4313 * unit test for 4313	2019-09-27 20:57:13 +02:00
Ines Montani	acd5bcb0b3	Tidy up fixtures	2019-09-27 17:57:59 +02:00
adrianeboyd	c23edf302b	Replace PhraseMatcher with trie-based search (#4309 ) * Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308. * Restore support for pickling * Fix internal keyword add/remove for numpy arrays * Add missing loop for match ID set in search loop * Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher. * Replace dict trie with MapStruct trie * Fix how match ID hash is stored/added * Update fix for match ID vocab * Switch from map_get_unless_missing to map_get * Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.) * Restructure imports to export find_matches * Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added. * Store docs internally only as attr lists * Reduces size for pickle * Remove duplicate keywords store Now that docs are stored as lists of attr hashes, there's no need to have the duplicate _keywords store.	2019-09-27 16:22:34 +02:00
Ines Montani	52904b7270	Raise if on_match is not callable or None	2019-09-24 23:06:24 +02:00
Ines Montani	9bf69bfbb2	Remove test	2019-09-19 17:38:41 +02:00
Matthew Honnibal	46c02d25b1	Merge changes to test_ner	2019-09-18 21:41:24 +02:00
Sofie Van Landeghem	de5a9ecdf3	Distinction between outside, missing and blocked NER annotations (#4307 ) * remove duplicate unit test * unit test (currently failing) for issue 4267 * bugfix: ensure doc.ents preserves kb_id annotations * fix in setting doc.ents with empty label * rename * test for presetting an entity to a certain type * allow overwriting Outside + blocking presets * fix actions when previous label needs to be kept * fix default ent_iob in set entities * cleaner solution with U- action * remove debugging print statements * unit tests with explicit transitions and is_valid testing * remove U- from move_names explicitly * remove unit tests with pre-trained models that don't work * remove (working) unit tests with pre-trained models * clean up unit tests * move unit tests * small fixes * remove two TODO's from doc.ents comments	2019-09-18 21:37:17 +02:00
tamuhey	875f3e5d8c	remove redundant __call__ method in pipes.TextCategorizer (#4305 ) * remove redundant __call__ method in pipes.TextCategorizer Because the parent __call__ method behaves in the same way. * fix: Pipe.__call__ arg * fix: invalid arg in Pipe.__call__ * modified: spacy/tests/regression/test_issue4278.py (#4278) * deleted: Pipfile	2019-09-18 21:31:27 +02:00
Ines Montani	00a8cbc306	Tidy up and auto-format	2019-09-18 20:27:03 +02:00
Ines Montani	f2c8b1e362	Simplify lookup hashing Just use get_string_id, which already does everything ensure_hash was supposed to do	2019-09-18 20:24:41 +02:00
Matthew Honnibal	84c65f9455	Merge branch 'master' into develop	2019-09-16 22:12:20 +02:00
Sofie Van Landeghem	03ac29f437	Ensure that doc.ents preserves kb_id annotations (#4294 ) * bugfix: ensure doc.ents preserves kb_id annotations * fix backward compatibility * additional test	2019-09-16 15:18:37 +02:00
Ines Montani	139428c20f	Set unique vector names in tests	2019-09-16 15:16:54 +02:00
adrianeboyd	b5d999e510	Add textcat to train CLI (#4226 ) * Add doc.cats to spacy.gold at the paragraph level Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in the spacy JSON training format at the paragraph level. * `spacy.gold.docs_to_json()` writes `docs.cats` * `GoldCorpus` reads in cats in each `GoldParse` * Update instances of gold_tuples to handle cats Update iteration over gold_tuples / gold_parses to handle addition of cats at the paragraph level. * Add textcat to train CLI * Add textcat options to train CLI * Add textcat labels in `TextCategorizer.begin_training()` * Add textcat evaluation to `Scorer`: * For binary exclusive classes with provided label: F1 for label * For 2+ exclusive classes: F1 macro average * For multilabel (not exclusive): ROC AUC macro average (currently relying on sklearn) * Provide user info on textcat evaluation settings, potential incompatibilities * Provide pipeline to Scorer in `Language.evaluate` for textcat config * Customize train CLI output to include only metrics relevant to current pipeline * Add textcat evaluation to evaluate CLI * Fix handling of unset arguments and config params Fix handling of unset arguments and model confiug parameters in Scorer initialization. * Temporarily add sklearn requirement * Remove sklearn version number * Improve Scorer handling of models without textcats * Fixing Scorer handling of models without textcats * Update Scorer output for python 2.7 * Modify inf in Scorer for python 2.7 * Auto-format Also make small adjustments to make auto-formatting with black easier and produce nicer results * Move error message to Errors * Update documentation * Add cats to annotation JSON format [ci skip] * Fix tpl flag and docs [ci skip] * Switch to internal roc_auc_score Switch to internal `roc_auc_score()` adapted from scikit-learn. * Add AUCROCScore tests and improve errors/warnings * Add tests for AUCROCScore and roc_auc_score * Add missing error for only positive/negative values * Remove unnecessary warnings and errors * Make reduced roc_auc_score functions private Because most of the checks and warnings have been stripped for the internal functions and access is only intended through `ROCAUCScore`, make the functions for roc_auc_score adapted from scikit-learn private. * Check that data corresponds with multilabel flag Check that the training instances correspond with the multilabel flag, adding the multilabel flag if required. * Add textcat score to early stopping check * Add more checks to debug-data for textcat * Add example training data for textcat * Add more checks to textcat train CLI * Check configuration when extending base model * Fix typos * Update textcat example data * Provide licensing details and licenses for data * Remove two labels with no positive instances from jigsaw-toxic-comment data. Co-authored-by: Ines Montani <ines@ines.io>	2019-09-15 22:31:31 +02:00
Ines Montani	bab9976d9a	💫 Adjust Table API and add docs (#4289 ) * Adjust Table API and add docs * Add attributes and update description [ci skip] * Use strings.get_string_id instead of hash_string * Fix table method calls * Make orth arg in Lemmatizer.lookup optional Fall back to string, which is now handled by Table.__contains__ out-of-the-box * Fix method name * Auto-format	2019-09-15 22:08:13 +02:00
Ines Montani	88a9d87f6f	Fix test	2019-09-15 18:04:44 +02:00
Ines Montani	7194845234	Skip tests properly instead of xfailing them	2019-09-15 17:00:17 +02:00
Ines Montani	16c2522791	Merge branch 'master' into develop	2019-09-14 16:42:01 +02:00
adrianeboyd	6942a6a69b	Extend default punct for sentencizer (#4290 ) Most of these characters are for languages / writing systems that aren't supported by spacy, but I don't think it causes problems to include them. In the UD evals, Hindi and Urdu improve a lot as expected (from 0-10% to 70-80%) and Persian improves a little (90% to 96%). Tamil improves in combination with #4288. The punctuation list is converted to a set internally because of its increased length. Sentence final punctuation generated with: ``` unichars -gas '[\p{Sentence_Break=STerm}\p{Sentence_Break=ATerm}]' '\p{Terminal_Punctuation}' ``` See: https://stackoverflow.com/a/9508766/461847 Fixes #4269.	2019-09-14 15:25:48 +02:00
Ines Montani	3126dd0904	Tidy up and auto-format [ci skip]	2019-09-14 12:58:06 +02:00
Paul O'Leary McCann	29a9e636eb	Fix half-width space handling in JA (#4284 ) (closes #4262 ) Before this patch, half-width spaces between words were simply lost in Japanese text. This wasn't immediately noticeable because much Japanese text never uses spaces at all.	2019-09-13 16:28:12 +02:00
Ines Montani	3c3658ef9f	Merge branch 'master' into develop	2019-09-12 18:03:01 +02:00
Paul O'Leary McCann	7d8df69158	Bloom-filter backed Lookup Tables (#4268 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Lookups / Tables now work This implements the stubs in the Lookups/Table classes. Currently this is in Cython but with no type declarations, so that could be improved. * Add lookups to setup.py * Actually add lookups pyx The previous commit added the old py file... * Lookups work-in-progress * Move from pyx back to py * Add string based lookups, fix serialization * Update tests, language/lemmatizer to work with string lookups There are some outstanding issues here: - a pickling-related test fails due to the bloom filter - some custom lemmatizers (fr/nl at least) have issues More generally, there's a question of how to deal with the case where you have a string but want to use the lookup table. Currently the table allows access by string or id, but that's getting pretty awkward. * Change lemmatizer lookup method to pass (orth, string) * Fix token lookup * Fix French lookup * Fix lt lemmatizer test * Fix Dutch lemmatizer * Fix lemmatizer lookup test This was using a normal dict instead of a Table, so checks for the string instead of an integer key failed. * Make uk/nl/ru lemmatizer lookup methods consistent The mentioned tokenizers all have their own implementation of the `lookup` method, which accesses a `Lookups` table. The way that was called in `token.pyx` was changed so this should be updated to have the same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id, string)). Prior to this change tests weren't failing, but there would probably be issues with normal use of a model. More tests should proably be added. Additionally, the language-specific `lookup` implementations seem like they might not be needed, since they handle things like lower-casing that aren't actually language specific. * Make recently added Greek method compatible * Remove redundant class/method Leftovers from a merge not cleaned up adequately.	2019-09-12 17:26:11 +02:00
Sofie Van Landeghem	9be4d1c105	Allow copying of user_data in as_doc (#4282 ) * Allow copying the user_data with as_doc + unit test * add option to docs * add typing * import fix * workaround to avoid bool clashing ... * bint instead of bool	2019-09-12 17:08:14 +02:00
Ines Montani	655b434553	Merge branch 'master' into develop	2019-09-12 11:39:18 +02:00
Ines Montani	ac0e27a825	💫 Add Language.pipe_labels (#4276 ) * Add Language.pipe_labels * Update spacy/language.py Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>	2019-09-12 10:56:28 +02:00
tamuhey	71909cdf22	Fix iss4278 (#4279 ) * fix: len(tuple) == 2 * (#4278) add fail test * add contributor's aggreement	2019-09-12 10:44:49 +02:00
Ines Montani	8ebc3711dc	Fix bug in Parser.labels and add test (#4275 )	2019-09-11 18:29:35 +02:00
Ines Montani	af25323653	Tidy up and auto-format	2019-09-11 14:00:36 +02:00
Ines Montani	e82a8d0d7a	Merge branch 'master' into develop	2019-09-11 11:52:38 +02:00
Ines Montani	8f9f48b04c	Add GreekLemmatizer.lookup (resolves #4272 )	2019-09-11 11:44:40 +02:00
Ines Montani	6279d74c65	Tidy up and auto-format	2019-09-11 11:38:22 +02:00
Matthew Honnibal	7b858ba606	Update from master	2019-09-10 20:14:08 +02:00
Ines Montani	669a7d37ce	Exclude vocab when testing to_bytes	2019-09-10 19:45:16 +02:00
adrianeboyd	c32126359a	Allow period as suffix following punctuation (#4248 ) Addresses rare cases (such as `_MATH_.`, see #1061) where the final period was not recognized as a suffix following punctuation.	2019-09-09 19:19:22 +02:00
Ines Montani	3e8f136ba7	💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178 ) * Improve load_language_data helper * WIP: Add Lookups implementation * Start moving lemma data over to JSON * WIP: move data over for more languages * Convert more languages * Fix lemmatizer fixtures in tests * Finish conversion * Auto-format JSON files * Fix test for now * Make sure tables are stored on instance * Update docstrings * Update docstrings and errors * Update test * Add Lookups.__len__ * Add serialization methods * Add Lookups.remove_table * Use msgpack for serialization to disk * Fix file exists check * Try using OrderedDict for everything * Update .flake8 [ci skip] * Try fixing serialization * Update test_lookups.py * Update test_serialize_vocab_strings.py * Fix serialization for lookups * Fix lookups * Fix lookups * Fix lookups * Try to fix serialization * Try to fix serialization * Try to fix serialization * Try to fix serialization * Give up on serialization test * Xfail more serialization tests for 3.5 * Fix lookups for 2.7	2019-09-09 19:17:55 +02:00
adrianeboyd	3780e2ff50	Flush tokenizer cache when necessary (#4258 ) Flush tokenizer cache when affixes, token_match, or special cases are modified. Fixes #4238, same issue as in #1250.	2019-09-08 20:52:46 +02:00

1 2 3 4 5 ...

1503 Commits