spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-11 00:32:40 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	82277f63a3	💫 Small efficiency fixes to tokenizer (#2587 ) This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical. The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second. Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to: * Fix the variable-length lookarounds in the suffix, infix and `token_match` rules * Improve the performance of the `token_match` regex * Switch back from the `regex` library to the `re` library. ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-07-24 23:35:54 +02:00
Suraj Krishnan Rajan	69d041148f	Implement Fast-Text vectors with subword features	2018-04-21 01:34:14 +05:30
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Matthew Honnibal	95a9615221	Fix loading of multiple pre-trained vectors This patch addresses #1660, which was caused by keying all pre-trained vectors with the same ID when telling Thinc how to refer to them. This meant that if multiple models were loaded that had pre-trained vectors, errors or incorrect behaviour resulted. The vectors class now includes a .name attribute, which defaults to: {nlp.meta['lang']_nlp.meta['name']}.vectors The vectors name is set in the cfg of the pipeline components under the key pretrained_vectors. This replaces the previous cfg key pretrained_dims. In order to make existing models compatible with this change, we check for the pretrained_dims key when loading models in from_disk and from_bytes, and add the cfg key pretrained_vectors if we find it.	2018-03-28 16:02:59 +02:00
Matthew Honnibal	43f381ce36	Make Vocab.__contains__ work with ints. Fixes #1868	2018-01-23 23:26:47 +01:00
Matthew Honnibal	0153220304	Make set_vector add word to vocab. Fixes #1807	2018-01-14 13:57:57 +01:00
Matthew Honnibal	07acb43a85	Merge branch 'master' of https://github.com/explosion/spaCy	2017-12-04 14:42:52 +01:00
Matthew Honnibal	79f11d4f85	Pickle vectors with vocab	2017-11-23 17:19:50 +01:00
Roman Domrachev	b3311100c7	Merge branch 'master' of github.com:explosion/spaCy	2017-11-15 18:30:04 +03:00
ines	8e65247886	Fix lex.id if vectors is None	2017-11-15 14:23:58 +01:00
Matthew Honnibal	2f169fdb0a	Set lex ID correctly for new tokens in Vocab	2017-11-15 13:58:03 +01:00
Roman Domrachev	3e21680814	Use safer method to get string without hit	2017-11-14 22:58:46 +03:00
Roman Domrachev	91e2fa6561	Clean all caches	2017-11-14 21:15:04 +03:00
Matthew Honnibal	c16310d156	Update vectors with find method	2017-11-01 00:34:55 +01:00
Matthew Honnibal	c5799ecc7b	Remove print statement	2017-10-31 21:12:33 +01:00
Matthew Honnibal	77d8f5de9a	Revise and simplify Vectors class	2017-10-31 18:25:08 +01:00
Matthew Honnibal	cb5217012f	Fix vector remapping	2017-10-31 11:40:46 +01:00
Matthew Honnibal	9c11ee4a1c	WIP on vectors fixes	2017-10-31 11:22:56 +01:00
Matthew Honnibal	368fdb389a	WIP on refactoring and fixing vectors	2017-10-31 02:00:26 +01:00
Matthew Honnibal	4e3006cec7	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-30 19:44:58 +01:00
Matthew Honnibal	4112a991ec	Fix vector pruning	2017-10-30 19:44:40 +01:00
ines	ec657c1ddc	Update vocab docs and document Vocab.prune_vectors	2017-10-30 19:35:41 +01:00
Matthew Honnibal	e026b29ea9	Add prune_vectors method to Vocab	2017-10-30 17:59:43 +01:00
Explosion Bot	aa64031751	Fix clear_vectors() method on Vocab	2017-10-30 16:09:04 +01:00
Explosion Bot	7b56b2f04b	Add Vocab.cfg attr, to hold stuff like oov probs	2017-10-30 16:08:50 +01:00
ines	d96e72f656	Tidy up rest	2017-10-27 21:07:59 +02:00
ines	5167a0cce2	Tidy up Vectors and docs	2017-10-27 19:45:19 +02:00
Matthew Honnibal	c52671420c	Remove old cfile import	2017-10-26 13:28:19 +02:00
Matthew Honnibal	ef3e5a361b	Merge pull request #1442 from explosion/feature/fix-sp 💫Fix SP tag, tweak Vectors.__init__, fix Morphology	2017-10-24 10:24:07 +02:00
Matthew Honnibal	8f8bccecb9	Patch deserialisation for invalid loads, to avoid model failure	2017-10-21 00:51:42 +02:00
Matthew Honnibal	9010a1a060	Create vectors correctly	2017-10-20 14:19:46 +02:00
Matthew Honnibal	33229b1c9e	Remove print statement	2017-10-20 14:19:29 +02:00
Matthew Honnibal	92ac9316b5	Fix initialization of vectors, to address serialization problem	2017-10-20 13:59:24 +02:00
Matthew Honnibal	0d57b9748a	Serialize lex_attr_getters with dill, for better pickle support	2017-10-17 18:17:45 +02:00
ines	b776f48e58	Fix typo	2017-10-01 21:58:45 +02:00
Matthew Honnibal	2cf0f4622f	Fix loading of models with pre-trained vectors	2017-10-01 14:05:32 -05:00
Matthew Honnibal	5aaef3e7b8	Dont link vectors in vocab deserialize	2017-09-26 06:45:47 -05:00
Matthew Honnibal	d9124f1aa3	Add link_vectors_to_models function	2017-09-22 09:38:22 -05:00
Matthew Honnibal	039d609362	Remove hard-coded default vectors width	2017-09-17 12:29:39 -05:00
Matthew Honnibal	83f8e98450	Fix retrieval of OOV vectors	2017-08-22 19:46:35 +02:00
Matthew Honnibal	5b329acbf2	Fix vectors_length property in vocab	2017-08-22 19:00:27 +02:00
Matthew Honnibal	6a94648373	Fix serialization	2017-08-19 21:27:35 +02:00
Matthew Honnibal	1157294434	Improve vector handling	2017-08-19 20:35:33 +02:00
Matthew Honnibal	93fb8b64e9	Fix vector loading	2017-08-19 19:52:25 +02:00
Matthew Honnibal	49a615e7d9	Create Vectors object in Vocab	2017-08-19 18:50:16 +02:00
Matthew Honnibal	2993b54fff	Load vectors in vocab	2017-08-18 20:46:56 +02:00
Matthew Honnibal	add9a33782	Return False for vocab.has_vector	2017-06-04 14:26:14 -05:00
ines	05fe6758a7	Set lexeme attributes for tokenizer special cases	2017-06-03 19:44:39 +02:00
ines	41a6adf1f6	Initialise Vocab length correctly	2017-06-02 10:57:25 +02:00
ines	53b82f972a	Add strings to Vocab in init, instead of StringStore	2017-06-02 10:57:06 +02:00
ines	023f38bdd4	Fix return value of Vocab.from_bytes	2017-06-02 10:56:40 +02:00
Matthew Honnibal	307d615c5f	Fix serialization for tagger when tag_map has changed	2017-06-01 12:18:36 -05:00
Matthew Honnibal	9805e0e369	Fix vocab pickling	2017-05-31 08:25:01 -05:00
Matthew Honnibal	a131981f3b	Work on vectors	2017-05-30 23:34:50 +02:00
Matthew Honnibal	9bf22a94aa	Fix tag set serialisation	2017-05-29 17:52:36 -05:00
Matthew Honnibal	920887f4e4	Specify order of vocab deserialization	2017-05-29 13:04:40 +02:00
Matthew Honnibal	6b019b0540	Update to/from bytes methods	2017-05-29 10:14:20 +02:00
Matthew Honnibal	6dad4117ad	Work on serialization for models	2017-05-29 01:37:57 +02:00
Matthew Honnibal	2edd96ce47	Draft Vocab to/from disk/bytes	2017-05-28 23:34:12 +02:00
Matthew Honnibal	fe11564b8e	Finish stringstore change. Also xfail vectors tests	2017-05-28 15:10:22 +02:00
Matthew Honnibal	fe4a746300	Accomodate symbols in new string scheme	2017-05-28 13:03:16 +02:00
Matthew Honnibal	a5606c3eda	Work on changing StringStore to return hashes.	2017-05-28 12:36:27 +02:00
Matthew Honnibal	39293ab2ee	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-28 11:46:57 +02:00
Matthew Honnibal	15f6efc127	Remove vectors from vocab	2017-05-28 11:45:32 +02:00
ines	c8543c8237	Fix formatting and docstrings and remove deprecated function	2017-05-28 00:22:40 +02:00
ines	251346b59f	Fix typos and formatting	2017-05-21 14:18:46 +02:00
ines	d82ae9a585	Change "function" to "callable" in docs	2017-05-21 13:17:40 +02:00
ines	f0cc642bb9	Update docstrings and API docs for Vocab	2017-05-20 14:00:41 +02:00
Matthew Honnibal	793430aa7a	Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab	2017-05-17 12:04:50 +02:00
Matthew Honnibal	9e167b7bb6	Strip serializer from code	2017-05-09 17:28:50 +02:00
ines	e1efd589c3	Fix json imports and use ujson	2017-04-15 12:13:34 +02:00
ines	c05ec4b89a	Add compat functions and remove old workarounds Add ensure_path util function to handle checking instance of path	2017-04-15 12:11:16 +02:00
ines	d24589aa72	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
ines	561f2a3eb4	Use consistent formatting for docstrings	2017-04-15 11:59:21 +02:00
Matthew Honnibal	d013aba7b5	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-17 18:30:53 +01:00
Matthew Honnibal	854cfce7cf	Make vocabs more compatible across versions Previously, symbols were inserted into the string-store before strings were loaded. This meant that adding a symbol would invalidate saved models. We now make sure that strings are loaded faithfully, so that compatibility is maintained.	2017-03-17 18:29:04 +01:00
Matthew Honnibal	1cc841e600	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-17 08:18:11 -05:00
Matthew Honnibal	4bfc55b532	Auto-add words to vocab when loading vectors When calling vocab.load_vectors_from_bin_loc, ensure that missing entries are added to the vocab. Otherwise, loading vectors into an empty vocab object resulted in no vectors being added.	2017-03-17 08:15:59 -05:00
Matthew Honnibal	4382f175b3	Squelch compiler warnings	2017-03-11 12:44:43 -06:00
Matthew Honnibal	d814892805	Hackish pickle support for Vocab.	2017-03-07 20:25:12 +01:00
ines	aa92d4e9b5	Fix unicode regex for Python 2 (see #834 )	2017-02-16 23:49:54 +01:00
ines	85d249d451	Revert "Revert "Merge pull request #836 from raphael0202/load_vectors (closes #834 )"" This reverts commit `ea05f78660`.	2017-02-16 23:26:25 +01:00
ines	ea05f78660	Revert "Merge pull request #836 from raphael0202/load_vectors (closes #834 )" This reverts commit `7d8c9eee7f`, reversing changes made to `f6b69babcc`.	2017-02-16 15:27:12 +01:00
Raphaël Bournhonesque	e17dc2db75	Remove useless import	2017-02-16 12:10:24 +01:00
Raphaël Bournhonesque	3fd2742649	load_vectors should accept arbitrary space characters as word tokens Fix bug #834	2017-02-16 12:08:30 +01:00
Daniel Hershcovich	99eb494a82	Fix #737 : support loading word vectors with " " as a word	2017-01-12 17:00:14 +02:00
Daniel Hershcovich	8e603cc917	Avoid "True if ... else False"	2017-01-11 11:18:22 +02:00
Matthew Honnibal	cade536d1e	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-12-27 21:04:10 +01:00
Matthew Honnibal	ce4539dafd	Allow the vocabulary to grow to 10,000, to prevent cold-start problem.	2016-12-27 21:03:45 +01:00
Ines Montani	8978806ea6	Allow Vocab to load without serializer_freqs	2016-12-21 18:05:23 +01:00
Ines Montani	be8ed811f6	Remove trailing whitespace	2016-12-21 18:04:41 +01:00
Matthew Honnibal	6ee1df93c5	Set tag_map to None if it's not seen in the data by vocab	2016-12-18 16:51:10 +01:00
Matthew Honnibal	1e0f566d95	Fix #656 , #624 : Support arbitrary token attributes when adding special-case rules.	2016-11-25 12:43:24 +01:00
Matthew Honnibal	f123f92e0c	Fix #617 : Vocab.load() required Path. Should work with string as well.	2016-11-10 22:48:48 +01:00
Matthew Honnibal	b86f8af0c1	Fix doc strings	2016-11-01 12:25:36 +01:00
Matthew Honnibal	6036ec7c77	Fix vector norm when loading lexemes.	2016-10-23 19:40:18 +02:00
Matthew Honnibal	3e688e6d4b	Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.	2016-10-23 17:45:44 +02:00
Matthew Honnibal	f62088d646	Fix compile error	2016-10-23 14:50:50 +02:00
Matthew Honnibal	a0a4ada42a	Fix calculation of L2-norm for Lexeme	2016-10-23 14:44:45 +02:00
Matthew Honnibal	7ab03050d4	Add resize_vectors method to Vocab	2016-10-21 01:44:50 +02:00

1 2 3 4 5 ...

276 Commits