spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-13 17:52:31 +03:00

Author	SHA1	Message	Date
Adriane Boyd	72c2f98dc9	Switch special case reload threshold to variable Refer to variable instead of hard-coded threshold	2019-09-27 09:24:52 +02:00
Adriane Boyd	669bc1a314	Switch to local cdef functions for span filtering	2019-09-26 21:00:46 +02:00
Adriane Boyd	ae348bee43	Switch to PhraseMatcher.find_matches	2019-09-26 14:43:22 +02:00
Adriane Boyd	63b014d09f	Merge branch 'feature/hashmatcher' into bugfix/tokenizer-special-cases-matcher	2019-09-26 14:34:09 +02:00
Adriane Boyd	3fdb22d832	Implement full remove() Remove unnecessary trie paths and free unused maps. Parallel to Matcher, raise KeyError when attempting to remove a match ID that has not been added.	2019-09-26 11:31:03 +02:00
Adriane Boyd	7862a6eb01	Restructure imports to export find_matches	2019-09-25 11:03:58 +02:00
Adriane Boyd	3c6f1d7e3a	Switch from numpy array to Token.get_struct_attr Access token attributes directly in Doc instead of making a copy of the relevant values in a numpy array. Add unsatisfactory warning for hash collision with reserved terminal hash key. (Ideally it would change the reserved terminal hash and redo the whole trie, but for now, I'm hoping there won't be collisions.)	2019-09-25 09:41:27 +02:00
Adriane Boyd	d995a7849e	Switch from map_get_unless_missing to map_get	2019-09-24 16:20:24 +02:00
Adriane Boyd	34550ef662	Update fix for match ID vocab	2019-09-24 16:07:38 +02:00
Adriane Boyd	d4141302b6	Fix how match ID hash is stored/added	2019-09-24 15:36:26 +02:00
Adriane Boyd	39540ed1ce	Replace dict trie with MapStruct trie	2019-09-24 14:39:50 +02:00
Adriane Boyd	a7e9c0fd3e	Remove cruft in matching loop for partial matches There was a bit of unnecessary code left over from FlashText in the matching loop to handle partial token matches, which we don't have with PhraseMatcher.	2019-09-23 09:11:13 +02:00
Adriane Boyd	c38c330585	Add missing loop for match ID set in search loop	2019-09-21 15:57:38 +02:00
Adriane Boyd	d92e8c8ac8	Update error message number	2019-09-20 20:36:53 +02:00
Adriane Boyd	73ca0ce4f3	Merge remote-tracking branch 'upstream/master' into bugfix/tokenizer-special-cases-matcher	2019-09-20 16:44:33 +02:00
Adriane Boyd	d3990d080c	Improve efficiency of special cases handling * Use PhraseMatcher instead of Matcher * Improve efficiency of merging/splitting special cases in document * Process merge/splits in one pass without repeated token shifting * Merge in place if no splits	2019-09-20 16:39:30 +02:00
Adriane Boyd	e74963acd4	Add test for #4248 , clean up test	2019-09-20 09:20:57 +02:00
Adriane Boyd	3a4e1f5ca7	Fix internal keyword add/remove for numpy arrays	2019-09-20 09:18:38 +02:00
Adriane Boyd	0d851db6d9	Restore support for pickling	2019-09-19 20:20:53 +02:00
Adriane Boyd	3931368ce8	Merge remote-tracking branch 'upstream/master' into feature/hashmatcher	2019-09-19 17:42:17 +02:00
Ines Montani	9bf69bfbb2	Remove test	2019-09-19 17:38:41 +02:00
Adriane Boyd	0d9740e826	Replace PhraseMatcher with Aho-Corasick Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays of the hash values for the relevant attribute. The implementation is based on FlashText. The speed should be similar to the previous PhraseMatcher. It is now possible to easily remove match IDs and matches don't go missing with large keyword lists / vocabularies. Fixes #4308.	2019-09-19 16:49:05 +02:00
Ines Montani	8cd3763678	Update about.py [ci skip]	2019-09-19 01:02:25 +02:00
Matthew Honnibal	f52b857953	Update version	2019-09-19 00:56:35 +02:00
Matthew Honnibal	e34b4a38b0	Fix set labels meta	2019-09-19 00:56:07 +02:00
Matthew Honnibal	9d399fe63a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2019-09-19 00:04:06 +02:00
Matthew Honnibal	7d510c833e	Fix orth replacement	2019-09-19 00:03:24 +02:00
Ines Montani	89d1dc4afa	Merge branch 'master' into develop	2019-09-18 22:12:24 +02:00
Sean Löfgren	31c683d87d	add return_matches and as_tuples back to Matcher.pipe (#4303 ) * add contributor agreement [ci skip] * add return_matches and as_tuples back to Matcher.pipe	2019-09-18 22:00:33 +02:00
Matthew Honnibal	42df49133d	Also lower-case in orth variants	2019-09-18 21:54:51 +02:00
Matthew Honnibal	19d99fc9e7	Set version to v2.2.0.dev7	2019-09-18 21:43:59 +02:00
Matthew Honnibal	46c02d25b1	Merge changes to test_ner	2019-09-18 21:41:24 +02:00
Sofie Van Landeghem	de5a9ecdf3	Distinction between outside, missing and blocked NER annotations (#4307 ) * remove duplicate unit test * unit test (currently failing) for issue 4267 * bugfix: ensure doc.ents preserves kb_id annotations * fix in setting doc.ents with empty label * rename * test for presetting an entity to a certain type * allow overwriting Outside + blocking presets * fix actions when previous label needs to be kept * fix default ent_iob in set entities * cleaner solution with U- action * remove debugging print statements * unit tests with explicit transitions and is_valid testing * remove U- from move_names explicitly * remove unit tests with pre-trained models that don't work * remove (working) unit tests with pre-trained models * clean up unit tests * move unit tests * small fixes * remove two TODO's from doc.ents comments	2019-09-18 21:37:17 +02:00
Moshe Hazoom	72463b062f	Improve speed of _merge method (#4300 ) * make merge more efficient * fix offsets * merge works with relative indices * remove printing * Add the SCA * fix SCA date * more cythonize _retokenize.pyx * more cythonize _retokenize.pyx * fix only declaration in _retokenize.pyx * switch back to absolute head * switch back to absolute head * fix comment * merge from origin repo	2019-09-18 21:34:34 +02:00
tamuhey	875f3e5d8c	remove redundant __call__ method in pipes.TextCategorizer (#4305 ) * remove redundant __call__ method in pipes.TextCategorizer Because the parent __call__ method behaves in the same way. * fix: Pipe.__call__ arg * fix: invalid arg in Pipe.__call__ * modified: spacy/tests/regression/test_issue4278.py (#4278) * deleted: Pipfile	2019-09-18 21:31:27 +02:00
Ines Montani	00a8cbc306	Tidy up and auto-format	2019-09-18 20:27:03 +02:00
Ines Montani	f2c8b1e362	Simplify lookup hashing Just use get_string_id, which already does everything ensure_hash was supposed to do	2019-09-18 20:24:41 +02:00
Ines Montani	dd1810f05a	Update DocBin and add docs	2019-09-18 20:23:21 +02:00
Ines Montani	7e810cced6	Add references to docs pages	2019-09-18 19:57:21 +02:00
Ines Montani	2e5ab5b59c	Make except more explicit	2019-09-18 19:57:08 +02:00
Ines Montani	1f648ecb76	Auto-format	2019-09-18 19:56:55 +02:00
Ines Montani	0f7fe5e7a7	Auto-format and fix typo and consistency	2019-09-18 19:18:30 +02:00
Matthew Honnibal	e53b86751f	DocPallet -> DocBin	2019-09-18 15:15:37 +02:00
Matthew Honnibal	fa9a283128	Fix name	2019-09-18 13:40:03 +02:00
Matthew Honnibal	88a23cf49a	Fix name	2019-09-18 13:38:29 +02:00
Matthew Honnibal	3507943b15	Add docstring for DocPallet	2019-09-18 13:25:47 +02:00
Matthew Honnibal	1c8de6b2e5	Rename DocBox->DocPallet	2019-09-18 13:13:51 +02:00
Ines Montani	691e0088cf	Remove duplicate tok2vec property (closes #4302 )	2019-09-17 11:22:03 +02:00
Ines Montani	a84025d70b	Remove --no-deps from default pip args on download Add warning if user is executing spaCy without having it installed and add --no-deps to prevent the package from being redownloaded	2019-09-16 23:32:41 +02:00
Matthew Honnibal	84c65f9455	Merge branch 'master' into develop	2019-09-16 22:12:20 +02:00

1 2 3 4 5 ...

6431 Commits