spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-03-08 05:41:29 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	a040fca99e	Import json into cli.train	2018-06-25 11:50:37 +02:00
Matthew Honnibal	2c703d99c2	Fix collation of best models	2018-06-25 01:21:34 +02:00
Matthew Honnibal	2c80b7c013	Collate best model after training	2018-06-24 23:39:52 +02:00
ines	330c039106	Merge branch 'master' into develop	2018-05-26 18:30:52 +02:00
James Messinger	4515e96e90	Better formatting for `spacy train` CLI (#2357 ) * Better formatting for `spacy train` CLI Changed to use fixed-spaces rather than tabs to align table headers and data. ### Before: ``` Itn. P.Loss N.Loss UAS NER P. NER R. NER F. Tag % Token % 0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4 1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1 2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9 ``` ### After: ``` Itn. Dep Loss NER Loss UAS NER P. NER R. NER F. Tag % Token % CPU WPS GPU WPS 0 4618.857 2910.004 76.172 79.645 67.987 88.732 88.261 100.000 4436.9 6376.4 1 4671.972 3764.812 74.481 78.046 62.374 82.680 88.377 100.000 4672.2 6227.1 2 4742.756 3673.473 71.994 77.380 63.966 84.494 90.620 100.000 4298.0 5983.9 ``` * Added contributor file	2018-05-25 13:08:45 +02:00
Matthew Honnibal	ce458c2428	Fix spacy requirement constraint in package template	2018-05-22 20:50:46 +02:00
Matthew Honnibal	f3b4f6a4ec	Merge setup.py	2018-05-20 23:21:00 +02:00
Ines Montani	d4cc736b7c	💫 Improve model downloads: check for existing install, customise pip and use requests library again (#2346 ) * Go back to using requests instead of urllib (closes #2320) Fewer dependencies are good, but this one was simply causing too many other problems around SSL verification and Python 2/3 compatibility. requests is a popular enough package that it's okay for spaCy to depend on it – and this will hopefully make model downloads less flakey. * Only download model if not installed (see #1456) Use #egg=model==version to allow pip to check for existing installations. The download is only started if no installation matching the package/version is found. Fixes a long-standing inconvenience. * Pass additional options to pip when installing model (resolves #1456) Treat all additional arguments passed to the download command as pip options to allow user to customise the command. For example: python -m spacy download en --user * Add CLI option to enable installing model package dependencies * Revert "Add CLI option to enable installing model package dependencies" This reverts commit `9336ffe695`. * Update documentation	2018-05-20 20:26:56 +02:00
Matthew Honnibal	74d5c625b3	Use rising beam update prob	2018-05-16 20:11:59 +02:00
Matthew Honnibal	dc1a479fbd	Merge branch 'develop' into feature/refactor-parser	2018-05-15 18:39:21 +02:00
Matthew Honnibal	546dd99cdf	Merge master into develop -- mostly Arabic and website	2018-05-15 18:14:28 +02:00
Matthew Honnibal	a6ae1ee6f7	Don't modify Token in global scope	2018-05-09 00:43:00 +02:00
Matthew Honnibal	f94f721f40	Avoid importing fused token symbol in ud-run-test, untl that's added	2018-05-09 00:28:03 +02:00
Matthew Honnibal	659ec5b975	Avoid importing fused token symbol in ud-run-test, untl that's added	2018-05-08 19:40:33 +02:00
Matthew Honnibal	fc4dd49b77	Support oracle segmentation in ud-train CLI command	2018-05-08 13:47:45 +02:00
ines	7a3599c21a	Fix formatting and consistency	2018-05-07 23:02:11 +02:00
Matthew Honnibal	eddc0e0c74	Set gold.sent_starts in ud_train	2018-05-07 15:52:47 +02:00
G.Pruvost	cc8e804648	#2211 - Support for ssl certs config on download command (#2212 ) * Add support for SSL/Certs customization on download CLI * Add a note on SSL options for the 'download' CLI in the README * Add contributor agreement	2018-05-03 18:37:02 +02:00
Matthew Honnibal	723b328062	Add script to run UD test	2018-04-29 15:50:25 +02:00
Matthew Honnibal	17af6aa3a4	Update ud_train script	2018-04-29 15:49:32 +02:00
Matthew Honnibal	2c4a6d66fa	Merge master into develop. Big merge, many conflicts -- need to review	2018-04-29 14:49:26 +02:00
ines	3c80f69ff5	Return data in cli.info and add silent option (resolves #2196 )	2018-04-29 01:59:44 +02:00
ines	0299d5fac8	Update argument annotations and formatting	2018-04-10 21:45:11 +02:00
ines	49b1e48bf5	Fix syntax error	2018-04-10 21:44:59 +02:00
ines	70052e46e9	Fix formatting [ci skip]	2018-04-10 21:42:46 +02:00
Matthew Honnibal	0ddb152be0	Improve error message when reading vectors	2018-04-10 21:26:50 +02:00
Matthew Honnibal	db50ac524e	Support zipped vector files in init-model	2018-04-10 21:21:00 +02:00
ines	270fcfd925	Fix typo in package command message (closes #2200 )	2018-04-10 19:14:31 +02:00
ines	24d8bf348d	Revert "Add support for .zip to init_model" This reverts commit `7ee880a0ad`.	2018-04-10 19:08:06 +02:00
Matthew Honnibal	7ee880a0ad	Add support for .zip to init_model	2018-04-10 14:30:04 +00:00
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Ines Montani	a609a1ca29	Merge pull request #2152 from explosion/feature/tidy-up-dependencies 💫 Tidy up dependencies	2018-03-29 14:35:09 +02:00
Matthew Honnibal	b5098079d8	Fix error on urllib	2018-03-29 00:08:16 +02:00
Ines Montani	98e9cda677	Merge pull request #2158 from explosion/feature/fix-multiple-vectors (resolves #1660 ) 💫 Fix loading of multiple vector models	2018-03-28 23:08:24 +02:00
Matthew Honnibal	17c3e7efa2	Add message noting vectors	2018-03-28 16:33:43 +02:00
ines	7fbc9e5874	Replace requests with urllib	2018-03-28 12:46:07 +02:00
ines	ac88c72c9a	Fix ftfy workaround and remove old import	2018-03-28 12:14:28 +02:00
Matthew Honnibal	070b6c6495	Remove dependency on ftfy	2018-03-28 12:07:02 +02:00
Matthew Honnibal	b7136cb094	Support zipped vector files in init-model	2018-03-27 21:01:18 +00:00
Matthew Honnibal	1f7229f40f	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `c9ba3d3c2d`, reversing changes made to `92c26a35d4`.	2018-03-27 19:23:02 +02:00
Matthew Honnibal	f57bfbccdc	Fix non-projective label filtering	2018-03-27 13:41:33 +02:00
Matthew Honnibal	8bbd26579c	Support GPU in UD training script	2018-03-27 09:53:35 +00:00
Matthew Honnibal	406548b976	Support .gz and .tar.gz files in spacy init-model	2018-03-24 17:18:32 +01:00
Matthew Honnibal	85717f570c	Merge branch 'master' of https://github.com/explosion/spaCy	2018-03-23 20:30:42 +01:00
Matthew Honnibal	8902754f0b	Fix vector loading for ud_train	2018-03-23 20:30:00 +01:00
Xiaoquan Kong	a71b99d7ff	bugfix for global-variable-change-in-runtime related issue (#2135 ) * Bugfix: setting pollution from spacy/cli/ud_train.py to whole package * Add contributor agreement of howl-anderson	2018-03-23 11:36:38 +01:00
Matthew Honnibal	044397e269	Support .gz and .tar.gz files in spacy init-model	2018-03-21 14:33:23 +01:00
Matthew Honnibal	bede11b67c	Improve label management in parser and NER (#2108 ) This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly. Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable. We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense. To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort. Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training. To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make. Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths. This is a squash merge, as I made a lot of very small commits. Individual commit messages below. * Simplify label management for TransitionSystem and its subclasses * Fix serialization for new label handling format in parser * Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir * Set actions in transition system * Require thinc 6.11.1.dev4 * Fix error in parser init * Add unicode declaration * Fix unicode declaration * Update textcat test * Try to get model training on less memory * Print json loc for now * Try rapidjson to reduce memory use * Remove rapidjson requirement * Try rapidjson for reduced mem usage * Handle None heads when projectivising * Stream json docs * Fix train script * Handle projectivity in GoldParse * Fix projectivity handling * Add minibatch_by_words util from ud_train * Minibatch by number of words in spacy.cli.train * Move minibatch_by_words util to spacy.util * Fix label handling * More hacking at label management in parser * Fix encoding in msgpack serialization in GoldParse * Adjust batch sizes in parser training * Fix minibatch_by_words * Add merge_subtokens function to pipeline.pyx * Register merge_subtokens factory * Restore use of msgpack tmp directory * Use minibatch-by-words in train * Handle retokenization in scorer * Change back-off approach for missing labels. Use 'dep' label * Update NER for new label management * Set NER tags for over-segmented words * Fix label alignment in gold * Fix label back-off for infrequent labels * Fix int type in labels dict key * Fix int type in labels dict key * Update feature definition for 8 feature set * Update ud-train script for new label stuff * Fix json streamer * Print the line number if conll eval fails * Update children and sentence boundaries after deprojectivisation * Export set_children_from_heads from doc.pxd * Render parses during UD training * Remove print statement * Require thinc 6.11.1.dev6. Try adding wheel as install_requires * Set different dev version, to flush pip cache * Update thinc version * Update GoldCorpus docs * Remove print statements * Fix formatting and links [ci skip]	2018-03-19 02:58:08 +01:00
Matthew Honnibal	d7ce6527fb	Use increasing batch sizes in ud-train	2018-03-14 20:15:28 +01:00
Matthew Honnibal	5dddb30e5b	Fix ud-train script	2018-03-11 01:26:45 +01:00

1 2 3 4 5 ...

306 Commits