spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-12-05 17:24:29 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	2c4a6d66fa	Merge master into develop. Big merge, many conflicts -- need to review	2018-04-29 14:49:26 +02:00
ines	1c6d77610c	Add remove_extension method on Doc, Token and Span (closes #2242 )	2018-04-28 23:33:09 +02:00
ines	abdb853ebf	Simplify underscore tests	2018-04-28 23:30:33 +02:00
Jens Dahl Møllerhøj	e5055e3cf6	Add Danish lemmatizer (#2184 ) * add danish lemmatizer * fill contributor agreement	2018-04-07 19:07:28 +02:00
ines	10462816bc	Fix tests for Python 2	2018-04-03 18:51:31 +02:00
ines	62b4b527d7	Don't raise error if set_extension has getter and setter (closes #2177 ) Improve error messages, raise error if setter is specified without a getter and compare against _unset to allow default=None. Also add more tests.	2018-04-03 18:30:17 +02:00
Suraj Rajan	1cdbb7c97c	[2032] - Changed python set to cpp stl set (#2170 ) Changed python set to cpp stl set #2032 ## Description Changed python set to cpp stl set. CPP stl set works better due to the logarithmic run time of its methods. Finding minimum in the cpp set is done in constant time as opposed to the worst case linear runtime of python set. Operations such as find,count,insert,delete are also done in either constant and logarithmic time thus making cpp set a better option to manage vectors. Reference : http://www.cplusplus.com/reference/set/set/ ### Types of change Enhancement for `Vectors` for faster initialising of word vectors(fasttext)	2018-03-31 13:28:25 +02:00
Ines Montani	0de599b16b	Merge pull request #2159 from explosion/feature/fix-merged-entity-iob (resolves #1554 , resolves #1752 ) 💫 Fix token.ent_iob after doc.merge(), and ensure consistency in doc.ents	2018-03-28 23:10:00 +02:00
Ines Montani	98e9cda677	Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660 ) 💫 Fix loading of multiple vector models	2018-03-28 23:08:24 +02:00
ines	3eb67bbe4b	Allow entity types with dashes (resolves #1967 )	2018-03-28 20:51:26 +02:00
Matthew Honnibal	cf5fcf0546	Update serialization test	2018-03-28 20:12:53 +02:00
Matthew Honnibal	95fa89c4b8	Update doc.ents test	2018-03-28 18:39:03 +02:00
Matthew Honnibal	cbd2794be0	Add test for ent_iob during span merge	2018-03-28 18:36:53 +02:00
Matthew Honnibal	fd9e259414	Add test for #1660	2018-03-28 18:22:51 +02:00
Matthew Honnibal	95a9615221	Fix loading of multiple pre-trained vectors This patch addresses #1660, which was caused by keying all pre-trained vectors with the same ID when telling Thinc how to refer to them. This meant that if multiple models were loaded that had pre-trained vectors, errors or incorrect behaviour resulted. The vectors class now includes a .name attribute, which defaults to: {nlp.meta['lang']_nlp.meta['name']}.vectors The vectors name is set in the cfg of the pipeline components under the key pretrained_vectors. This replaces the previous cfg key pretrained_dims. In order to make existing models compatible with this change, we check for the pretrained_dims key when loading models in from_disk and from_bytes, and add the cfg key pretrained_vectors if we find it.	2018-03-28 16:02:59 +02:00
ines	6d2c85f428	Drop six and related hacks as a dependency	2018-03-28 10:45:25 +02:00
Matthew Honnibal	de9fd091ac	Fix #2014 : token.pos_ not writeable	2018-03-27 21:21:11 +02:00
Matthew Honnibal	1f7229f40f	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `c9ba3d3c2d`, reversing changes made to `92c26a35d4`.	2018-03-27 19:23:02 +02:00
Matthew Honnibal	d2118792e7	Merge changes from master	2018-03-27 13:38:41 +02:00
Matthew Honnibal	7d4687162f	Update doc.ents test	2018-03-26 07:14:35 +02:00
Matthew Honnibal	938436455a	Add test for ent_iob during span merge	2018-03-25 22:16:19 +02:00
Matthew Honnibal	bede11b67c	Improve label management in parser and NER (#2108 ) This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly. Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable. We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense. To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort. Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training. To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make. Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths. This is a squash merge, as I made a lot of very small commits. Individual commit messages below. * Simplify label management for TransitionSystem and its subclasses * Fix serialization for new label handling format in parser * Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir * Set actions in transition system * Require thinc 6.11.1.dev4 * Fix error in parser init * Add unicode declaration * Fix unicode declaration * Update textcat test * Try to get model training on less memory * Print json loc for now * Try rapidjson to reduce memory use * Remove rapidjson requirement * Try rapidjson for reduced mem usage * Handle None heads when projectivising * Stream json docs * Fix train script * Handle projectivity in GoldParse * Fix projectivity handling * Add minibatch_by_words util from ud_train * Minibatch by number of words in spacy.cli.train * Move minibatch_by_words util to spacy.util * Fix label handling * More hacking at label management in parser * Fix encoding in msgpack serialization in GoldParse * Adjust batch sizes in parser training * Fix minibatch_by_words * Add merge_subtokens function to pipeline.pyx * Register merge_subtokens factory * Restore use of msgpack tmp directory * Use minibatch-by-words in train * Handle retokenization in scorer * Change back-off approach for missing labels. Use 'dep' label * Update NER for new label management * Set NER tags for over-segmented words * Fix label alignment in gold * Fix label back-off for infrequent labels * Fix int type in labels dict key * Fix int type in labels dict key * Update feature definition for 8 feature set * Update ud-train script for new label stuff * Fix json streamer * Print the line number if conll eval fails * Update children and sentence boundaries after deprojectivisation * Export set_children_from_heads from doc.pxd * Render parses during UD training * Remove print statement * Require thinc 6.11.1.dev6. Try adding wheel as install_requires * Set different dev version, to flush pip cache * Update thinc version * Update GoldCorpus docs * Remove print statements * Fix formatting and links [ci skip]	2018-03-19 02:58:08 +01:00
Matthew Honnibal	ff42b726c1	Fix unicode declaration on test	2018-03-19 02:04:24 +01:00
Matthew Honnibal	7dc76c6ff6	Add test for textcat	2018-03-16 12:39:45 +01:00
ines	f3f8bfc367	Add built-in factories for merge_entities and merge_noun_chunks Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).	2018-03-15 17:16:54 +01:00
ines	d854f69fe3	Add built-in factories for merge_entities and merge_noun_chunks Allows adding those components to the pipeline out-of-the-box if they're defined in a model's meta.json. Also allows usage as nlp.add_pipe(nlp.create_pipe('merge_entities')).	2018-03-15 00:18:51 +01:00
Matthew Honnibal	c2f4759257	Fix test for Python 2	2018-03-12 23:03:05 +01:00
Matthew Honnibal	53b3249e06	Add tests for arc eager oracle	2018-03-10 23:42:56 +01:00
Matthew Honnibal	5cc3bd1c1d	Update alignment tests	2018-02-24 16:03:58 +01:00
Matthew Honnibal	7865746574	Support many-to-one alignment	2018-02-24 02:09:53 +01:00
Matthew Honnibal	458710b831	Poke matcher test for appveyor	2018-02-23 23:53:48 +01:00
Matthew Honnibal	2c9c8b8d72	Try comming out emoji test in matcher	2018-02-23 23:34:35 +01:00
Matthew Honnibal	980ad68cbe	Try to find test that fails on appveyor	2018-02-23 21:27:53 +01:00
Matthew Honnibal	39de8cd4d3	Try to find test failing on appveyor	2018-02-23 20:59:21 +01:00
Matthew Honnibal	7b575a119e	Try to reduce memory usage of test_matcher	2018-02-23 15:34:37 +01:00
Matthew Honnibal	875411b875	Set unicode types in _align.pyx and test	2018-02-23 14:35:38 +01:00
Matthew Honnibal	51d9679aa3	Fix broken span.as_doc test	2018-02-23 14:22:24 +01:00
Matthew Honnibal	3e6c1111b7	Remove obsolete test	2018-02-23 03:22:07 +01:00
Matthew Honnibal	c0734ba526	Make alignment work with strings	2018-02-20 17:51:49 +01:00
Matthew Honnibal	8180c84a98	Add tests for new Levenshtein alignment	2018-02-20 17:32:25 +01:00
Matthew Honnibal	2bccad8815	Fix incorrect matcher test	2018-02-18 14:56:12 +01:00
Matthew Honnibal	530172d57a	Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher	2018-02-18 14:40:42 +01:00
Matthew Honnibal	1e5aeb4eec	Merge pull request #1987 from thomasopsomer/span-sent Make span.sent work when only manual / custom sbd	2018-02-18 14:05:37 +01:00
Matthew Honnibal	eb3040ce46	Merge pull request #1891 from fucking-signup/master Fix issue #1889	2018-02-18 13:47:47 +01:00
Matthew Honnibal	3d7285870b	Update matcher branch with v2.0.8 master	2018-02-18 13:42:58 +01:00
ines	6bba1db4cc	Drop six and related hacks as a dependency	2018-02-18 13:29:56 +01:00
Matthew Honnibal	f9f46e5a07	Revert matcher fixes from GregDubbin	2018-02-18 10:59:28 +01:00
Matthew Honnibal	f7dc64d2a3	Merge branch 'master' of https://github.com/explosion/spaCy into feature/better-faster-matcher	2018-02-17 16:47:35 +01:00
Aaron Marquez	f0d3672e17	Changed loading EN model	2018-02-15 14:28:38 -08:00
Aaron Marquez	7ba4111554	Add test for issue-1959	2018-02-15 12:46:22 -08:00

1 2 3 4 5 ...

963 Commits