spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 18:36:36 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	0d57b9748a	Serialize lex_attr_getters with dill, for better pickle support	2017-10-17 18:17:45 +02:00
ines	b776f48e58	Fix typo	2017-10-01 21:58:45 +02:00
Matthew Honnibal	2cf0f4622f	Fix loading of models with pre-trained vectors	2017-10-01 14:05:32 -05:00
Matthew Honnibal	5aaef3e7b8	Dont link vectors in vocab deserialize	2017-09-26 06:45:47 -05:00
Matthew Honnibal	d9124f1aa3	Add link_vectors_to_models function	2017-09-22 09:38:22 -05:00
Matthew Honnibal	039d609362	Remove hard-coded default vectors width	2017-09-17 12:29:39 -05:00
Matthew Honnibal	83f8e98450	Fix retrieval of OOV vectors	2017-08-22 19:46:35 +02:00
Matthew Honnibal	5b329acbf2	Fix vectors_length property in vocab	2017-08-22 19:00:27 +02:00
Matthew Honnibal	6a94648373	Fix serialization	2017-08-19 21:27:35 +02:00
Matthew Honnibal	1157294434	Improve vector handling	2017-08-19 20:35:33 +02:00
Matthew Honnibal	93fb8b64e9	Fix vector loading	2017-08-19 19:52:25 +02:00
Matthew Honnibal	49a615e7d9	Create Vectors object in Vocab	2017-08-19 18:50:16 +02:00
Matthew Honnibal	2993b54fff	Load vectors in vocab	2017-08-18 20:46:56 +02:00
Matthew Honnibal	add9a33782	Return False for vocab.has_vector	2017-06-04 14:26:14 -05:00
ines	05fe6758a7	Set lexeme attributes for tokenizer special cases	2017-06-03 19:44:39 +02:00
ines	41a6adf1f6	Initialise Vocab length correctly	2017-06-02 10:57:25 +02:00
ines	53b82f972a	Add strings to Vocab in init, instead of StringStore	2017-06-02 10:57:06 +02:00
ines	023f38bdd4	Fix return value of Vocab.from_bytes	2017-06-02 10:56:40 +02:00
Matthew Honnibal	307d615c5f	Fix serialization for tagger when tag_map has changed	2017-06-01 12:18:36 -05:00
Matthew Honnibal	9805e0e369	Fix vocab pickling	2017-05-31 08:25:01 -05:00
Matthew Honnibal	a131981f3b	Work on vectors	2017-05-30 23:34:50 +02:00
Matthew Honnibal	9bf22a94aa	Fix tag set serialisation	2017-05-29 17:52:36 -05:00
Matthew Honnibal	920887f4e4	Specify order of vocab deserialization	2017-05-29 13:04:40 +02:00
Matthew Honnibal	6b019b0540	Update to/from bytes methods	2017-05-29 10:14:20 +02:00
Matthew Honnibal	6dad4117ad	Work on serialization for models	2017-05-29 01:37:57 +02:00
Matthew Honnibal	2edd96ce47	Draft Vocab to/from disk/bytes	2017-05-28 23:34:12 +02:00
Matthew Honnibal	fe11564b8e	Finish stringstore change. Also xfail vectors tests	2017-05-28 15:10:22 +02:00
Matthew Honnibal	fe4a746300	Accomodate symbols in new string scheme	2017-05-28 13:03:16 +02:00
Matthew Honnibal	a5606c3eda	Work on changing StringStore to return hashes.	2017-05-28 12:36:27 +02:00
Matthew Honnibal	39293ab2ee	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-05-28 11:46:57 +02:00
Matthew Honnibal	15f6efc127	Remove vectors from vocab	2017-05-28 11:45:32 +02:00
ines	c8543c8237	Fix formatting and docstrings and remove deprecated function	2017-05-28 00:22:40 +02:00
ines	251346b59f	Fix typos and formatting	2017-05-21 14:18:46 +02:00
ines	d82ae9a585	Change "function" to "callable" in docs	2017-05-21 13:17:40 +02:00
ines	f0cc642bb9	Update docstrings and API docs for Vocab	2017-05-20 14:00:41 +02:00
Matthew Honnibal	793430aa7a	Get spaCy train command working with neural network * Integrate models into pipeline * Add basic serialization (maybe incorrect) * Fix pickle on vocab	2017-05-17 12:04:50 +02:00
Matthew Honnibal	9e167b7bb6	Strip serializer from code	2017-05-09 17:28:50 +02:00
ines	e1efd589c3	Fix json imports and use ujson	2017-04-15 12:13:34 +02:00
ines	c05ec4b89a	Add compat functions and remove old workarounds Add ensure_path util function to handle checking instance of path	2017-04-15 12:11:16 +02:00
ines	d24589aa72	Clean up imports, unused code, whitespace, docstrings	2017-04-15 12:05:47 +02:00
ines	561f2a3eb4	Use consistent formatting for docstrings	2017-04-15 11:59:21 +02:00
Matthew Honnibal	d013aba7b5	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-17 18:30:53 +01:00
Matthew Honnibal	854cfce7cf	Make vocabs more compatible across versions Previously, symbols were inserted into the string-store before strings were loaded. This meant that adding a symbol would invalidate saved models. We now make sure that strings are loaded faithfully, so that compatibility is maintained.	2017-03-17 18:29:04 +01:00
Matthew Honnibal	1cc841e600	Merge branch 'master' of https://github.com/explosion/spaCy	2017-03-17 08:18:11 -05:00
Matthew Honnibal	4bfc55b532	Auto-add words to vocab when loading vectors When calling vocab.load_vectors_from_bin_loc, ensure that missing entries are added to the vocab. Otherwise, loading vectors into an empty vocab object resulted in no vectors being added.	2017-03-17 08:15:59 -05:00
Matthew Honnibal	4382f175b3	Squelch compiler warnings	2017-03-11 12:44:43 -06:00
Matthew Honnibal	d814892805	Hackish pickle support for Vocab.	2017-03-07 20:25:12 +01:00
ines	aa92d4e9b5	Fix unicode regex for Python 2 (see #834 )	2017-02-16 23:49:54 +01:00
ines	85d249d451	Revert "Revert "Merge pull request #836 from raphael0202/load_vectors (closes #834 )"" This reverts commit `ea05f78660`.	2017-02-16 23:26:25 +01:00
ines	ea05f78660	Revert "Merge pull request #836 from raphael0202/load_vectors (closes #834 )" This reverts commit `7d8c9eee7f`, reversing changes made to `f6b69babcc`.	2017-02-16 15:27:12 +01:00
Raphaël Bournhonesque	e17dc2db75	Remove useless import	2017-02-16 12:10:24 +01:00
Raphaël Bournhonesque	3fd2742649	load_vectors should accept arbitrary space characters as word tokens Fix bug #834	2017-02-16 12:08:30 +01:00
Daniel Hershcovich	99eb494a82	Fix #737 : support loading word vectors with " " as a word	2017-01-12 17:00:14 +02:00
Daniel Hershcovich	8e603cc917	Avoid "True if ... else False"	2017-01-11 11:18:22 +02:00
Matthew Honnibal	cade536d1e	Merge branch 'master' of ssh://github.com/explosion/spaCy	2016-12-27 21:04:10 +01:00
Matthew Honnibal	ce4539dafd	Allow the vocabulary to grow to 10,000, to prevent cold-start problem.	2016-12-27 21:03:45 +01:00
Ines Montani	8978806ea6	Allow Vocab to load without serializer_freqs	2016-12-21 18:05:23 +01:00
Ines Montani	be8ed811f6	Remove trailing whitespace	2016-12-21 18:04:41 +01:00
Matthew Honnibal	6ee1df93c5	Set tag_map to None if it's not seen in the data by vocab	2016-12-18 16:51:10 +01:00
Matthew Honnibal	1e0f566d95	Fix #656 , #624 : Support arbitrary token attributes when adding special-case rules.	2016-11-25 12:43:24 +01:00
Matthew Honnibal	f123f92e0c	Fix #617 : Vocab.load() required Path. Should work with string as well.	2016-11-10 22:48:48 +01:00
Matthew Honnibal	b86f8af0c1	Fix doc strings	2016-11-01 12:25:36 +01:00
Matthew Honnibal	6036ec7c77	Fix vector norm when loading lexemes.	2016-10-23 19:40:18 +02:00
Matthew Honnibal	3e688e6d4b	Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness.	2016-10-23 17:45:44 +02:00
Matthew Honnibal	f62088d646	Fix compile error	2016-10-23 14:50:50 +02:00
Matthew Honnibal	a0a4ada42a	Fix calculation of L2-norm for Lexeme	2016-10-23 14:44:45 +02:00
Matthew Honnibal	7ab03050d4	Add resize_vectors method to Vocab	2016-10-21 01:44:50 +02:00
Matthew Honnibal	f5fe4f595b	Fix json loading, for Python 3.	2016-10-20 21:23:26 +02:00
Matthew Honnibal	5ec32f5d97	Fix loading of GloVe vectors, to address Issue #541	2016-10-20 18:27:48 +02:00
Matthew Honnibal	d10c17f2a4	Fix Issue #536 : oov_prob was 0 for OOV words.	2016-10-19 23:38:47 +02:00
Matthew Honnibal	2bbb050500	Fix default of serializer_freqs	2016-10-18 19:55:41 +02:00
Matthew Honnibal	2cc515b2ed	Add add_flag method to Vocab, re Issue #504 .	2016-10-14 12:15:38 +02:00
Matthew Honnibal	ea23b64cc8	Refactor training, with new spacy.train module. Defaults still a little awkward.	2016-10-09 12:24:24 +02:00
Matthew Honnibal	ca32a1ab01	Revert "Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good." This reverts commit `8423e8627f`.	2016-09-30 20:20:22 +02:00
Matthew Honnibal	1f1cd5013f	Revert "Changes to vocab for new stringstore scheme" This reverts commit `a51149a717`.	2016-09-30 20:10:30 +02:00
Matthew Honnibal	a51149a717	Changes to vocab for new stringstore scheme	2016-09-30 20:01:19 +02:00
Matthew Honnibal	8423e8627f	Work on Issue #285 : intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good.	2016-09-30 10:14:47 +02:00
Matthew Honnibal	95aaea0d3f	Refactor so that the tokenizer data is read from Python data, rather than from disk	2016-09-25 14:49:53 +02:00
Matthew Honnibal	df88690177	Fix encoding of path variable	2016-09-24 21:13:15 +02:00
Matthew Honnibal	af847e07fc	Fix usage of pathlib for Python3 -- turning paths to strings.	2016-09-24 21:05:27 +02:00
Matthew Honnibal	453683aaf0	Fix spacy/vocab.pyx	2016-09-24 20:50:31 +02:00
Matthew Honnibal	fd65cf6cbb	Finish refactoring data loading	2016-09-24 20:26:17 +02:00
Matthew Honnibal	83e364188c	Mostly finished loading refactoring. Design is in place, but doesn't work yet.	2016-09-24 15:42:01 +02:00
Matthew Honnibal	872695759d	Merge pull request #306 from wbwseeker/german_noun_chunks add German noun chunk functionality	2016-04-08 00:54:24 +10:00
Henning Peters	b8f63071eb	add lang registration facility	2016-03-25 18:54:45 +01:00
Wolfgang Seeker	5e2e8e951a	add baseclass DocIterator for iterators over documents add classes for English and German noun chunks the respective iterators are set for the document when created by the parser as they depend on the annotation scheme of the parsing model	2016-03-16 15:53:35 +01:00
Wolfgang Seeker	03fb498dbe	introduce lang field for LexemeC to hold language id put noun_chunk logic into iterators.py for each language separately	2016-03-10 13:01:34 +01:00
Matthew Honnibal	963fe5258e	* Add missing __contains__ method to vocab	2016-03-08 15:49:10 +00:00
Matthew Honnibal	478aa21cb0	* Remove broken __reduce__ method on vocab	2016-03-08 15:48:21 +00:00
Henning Peters	931c07a609	initial proposal for separate vector package	2016-03-04 11:09:06 +01:00
Matthew Honnibal	a95974ad3f	* Fix oov probability	2016-02-06 15:13:55 +01:00
Matthew Honnibal	dcb401f3e1	* Remove broken Vocab pickling	2016-02-06 14:08:47 +01:00
Matthew Honnibal	63e3d4e27f	* Add comment on Vocab.__reduce__	2016-01-19 20:11:25 +01:00
Henning Peters	235f094534	untangle data_path/via	2016-01-16 12:23:45 +01:00
Henning Peters	846fa49b2a	distinct load() and from_package() methods	2016-01-16 10:00:57 +01:00
Henning Peters	788f734513	refactored data_dir->via, add zip_safe, add spacy.load()	2016-01-15 18:01:02 +01:00
Henning Peters	bc229790ac	integrate with sputnik	2016-01-13 19:46:17 +01:00
Matthew Honnibal	eaf2ad59f1	* Fix use of mock Package object	2015-12-31 04:13:15 +01:00
Matthew Honnibal	aec130af56	Use util.Package class for io Previous Sputnik integration caused API change: Vocab, Tagger, etc were loaded via a from_package classmethod, that required a sputnik.Package instance. This forced users to first create a sputnik.Sputnik() instance, in order to acquire a Package via sp.pool(). Instead I've created a small file-system shim, util.Package, which allows classes to have a .load() classmethod, that accepts either util.Package objects, or strings. We can later gut the internals of this and make it a proxy for Sputnik if we need more functionality that should live in the Sputnik library. Sputnik is now only used to download and install the data, in spacy.en.download	2015-12-29 18:00:48 +01:00
Matthew Honnibal	0e2498da00	* Replace from_package with load() classmethod in Vocab	2015-12-29 16:56:51 +01:00

1 2 3 4 5

243 Commits