spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-12-27 02:16:32 +03:00

Author	SHA1	Message	Date
ines	ac88c72c9a	Fix ftfy workaround and remove old import	2018-03-28 12:14:28 +02:00
Matthew Honnibal	070b6c6495	Remove dependency on ftfy	2018-03-28 12:07:02 +02:00
Matthew Honnibal	b7136cb094	Support zipped vector files in init-model	2018-03-27 21:01:18 +00:00
Matthew Honnibal	1f7229f40f	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `c9ba3d3c2d`, reversing changes made to `92c26a35d4`.	2018-03-27 19:23:02 +02:00
Matthew Honnibal	f57bfbccdc	Fix non-projective label filtering	2018-03-27 13:41:33 +02:00
Matthew Honnibal	8bbd26579c	Support GPU in UD training script	2018-03-27 09:53:35 +00:00
Matthew Honnibal	406548b976	Support .gz and .tar.gz files in spacy init-model	2018-03-24 17:18:32 +01:00
Matthew Honnibal	85717f570c	Merge branch 'master' of https://github.com/explosion/spaCy	2018-03-23 20:30:42 +01:00
Matthew Honnibal	8902754f0b	Fix vector loading for ud_train	2018-03-23 20:30:00 +01:00
Xiaoquan Kong	a71b99d7ff	bugfix for global-variable-change-in-runtime related issue (#2135 ) * Bugfix: setting pollution from spacy/cli/ud_train.py to whole package * Add contributor agreement of howl-anderson	2018-03-23 11:36:38 +01:00
Matthew Honnibal	044397e269	Support .gz and .tar.gz files in spacy init-model	2018-03-21 14:33:23 +01:00
Matthew Honnibal	bede11b67c	Improve label management in parser and NER (#2108 ) This patch does a few smallish things that tighten up the training workflow a little, and allow memory use during training to be reduced by letting the GoldCorpus stream data properly. Previously, the parser and entity recognizer read and saved labels as lists, with extra labels noted separately. Lists were used becaue ordering is very important, to ensure that the label-to-class mapping is stable. We now manage labels as nested dictionaries, first keyed by the action, and then keyed by the label. Values are frequencies. The trick is, how do we save new labels? We need to make sure we iterate over these in the same order they're added. Otherwise, we'll get different class IDs, and the model's predictions won't make sense. To allow stable sorting, we map the new labels to negative values. If we have two new labels, they'll be noted as having "frequency" -1 and -2. The next new label will then have "frequency" -3. When we sort by (frequency, label), we then get a stable sort. Storing frequencies then allows us to make the next nice improvement. Previously we had to iterate over the whole training set, to pre-process it for the deprojectivisation. This led to storing the whole training set in memory. This was most of the required memory during training. To prevent this, we now store the frequencies as we stream in the data, and deprojectivize as we go. Once we've built the frequencies, we can then apply a frequency cut-off when we decide how many classes to make. Finally, to allow proper data streaming, we also have to have some way of shuffling the iterator. This is awkward if the training files have multiple documents in them. To solve this, the GoldCorpus class now writes the training data to disk in msgpack files, one per document. We can then shuffle the data by shuffling the paths. This is a squash merge, as I made a lot of very small commits. Individual commit messages below. * Simplify label management for TransitionSystem and its subclasses * Fix serialization for new label handling format in parser * Simplify and improve GoldCorpus class. Reduce memory use, write to temp dir * Set actions in transition system * Require thinc 6.11.1.dev4 * Fix error in parser init * Add unicode declaration * Fix unicode declaration * Update textcat test * Try to get model training on less memory * Print json loc for now * Try rapidjson to reduce memory use * Remove rapidjson requirement * Try rapidjson for reduced mem usage * Handle None heads when projectivising * Stream json docs * Fix train script * Handle projectivity in GoldParse * Fix projectivity handling * Add minibatch_by_words util from ud_train * Minibatch by number of words in spacy.cli.train * Move minibatch_by_words util to spacy.util * Fix label handling * More hacking at label management in parser * Fix encoding in msgpack serialization in GoldParse * Adjust batch sizes in parser training * Fix minibatch_by_words * Add merge_subtokens function to pipeline.pyx * Register merge_subtokens factory * Restore use of msgpack tmp directory * Use minibatch-by-words in train * Handle retokenization in scorer * Change back-off approach for missing labels. Use 'dep' label * Update NER for new label management * Set NER tags for over-segmented words * Fix label alignment in gold * Fix label back-off for infrequent labels * Fix int type in labels dict key * Fix int type in labels dict key * Update feature definition for 8 feature set * Update ud-train script for new label stuff * Fix json streamer * Print the line number if conll eval fails * Update children and sentence boundaries after deprojectivisation * Export set_children_from_heads from doc.pxd * Render parses during UD training * Remove print statement * Require thinc 6.11.1.dev6. Try adding wheel as install_requires * Set different dev version, to flush pip cache * Update thinc version * Update GoldCorpus docs * Remove print statements * Fix formatting and links [ci skip]	2018-03-19 02:58:08 +01:00
Matthew Honnibal	d7ce6527fb	Use increasing batch sizes in ud-train	2018-03-14 20:15:28 +01:00
Matthew Honnibal	5dddb30e5b	Fix ud-train script	2018-03-11 01:26:45 +01:00
Matthew Honnibal	2cab4d6517	Remove use of attr module in ud_train	2018-03-11 00:59:39 +01:00
Matthew Honnibal	754ea1b2f7	Link in spaCy CoNLL commands	2018-03-10 23:42:15 +01:00
Matthew Honnibal	3478ea76d1	Add ud_train and ud_evaluate CLI commands	2018-03-10 23:41:55 +01:00
Matthew Honnibal	b59765ca9f	Stream gold during spacy train	2018-03-10 22:32:45 +01:00
Matthew Honnibal	86405e4ad1	Fix CLI for multitask objectives	2018-02-18 10:59:11 +01:00
Matthew Honnibal	a34749b2bf	Add multitask objectives options to train CLI	2018-02-17 22:03:54 +01:00
Matthew Honnibal	262d0a3148	Fix overwriting of lexical attributes when loading vectors during training	2018-02-17 18:11:11 +01:00
Johannes Dollinger	bf94c13382	Don't fix random seeds on import	2018-02-13 12:42:23 +01:00
Ali Zarezade	9df9da34a3	Fix init_model issue Fixing issue #1928	2018-02-03 17:21:34 +03:30
ines	3c1fb9d02d	Make validate command fail more gracefully if version not found Mostly relevant during develoment when working with .dev versions	2018-01-31 22:06:28 +01:00
Adam Binford	1a2c2f7d7f	Fixed auto linking after download and added simple test to check	2018-01-29 14:25:21 -05:00
Matthew Honnibal	7ca49c2061	Merge branch 'master' into feature-improve-model-download	2018-01-10 18:21:55 +01:00
Søren Lind Kristiansen	10dab8eef8	Remove dummy variable from function calls	2018-01-05 09:37:05 +01:00
Søren Lind Kristiansen	7f0ab145e9	Don't pass CLI command name as dummy argument	2018-01-04 21:33:47 +01:00
ines	2c656f90fb	Exit with 1 if incompatible models found (see #1714 )	2018-01-03 21:20:35 +01:00
ines	dacfaa2ca4	Ensure that download command exits properly (resolves #1714 )	2018-01-03 21:03:36 +01:00
Søren Lind Kristiansen	a9ff6eadc9	Prefix dummy argument names with underscore	2018-01-03 20:48:12 +01:00
ines	1081e08efb	Fix formatting	2018-01-03 20:14:50 +01:00
ines	d8109964d6	Use --no-deps on model install In general, it's nice for models to specify spaCy as a dependency. However, this tends to cause problems in conda environments, as pip will re-install spaCy and its dependencies (especially Thinc)	2018-01-03 17:40:37 +01:00
ines	319d754309	Fix overwriting of existing symlinks Check for is_symlink() to also overwrite invalid and outdated symlinks. Also show better error message if link path exists but is not symlink (i.e. file or directory).	2018-01-03 17:39:36 +01:00
ines	8ba0dfd017	Make message on failed linking more clear	2018-01-03 17:38:09 +01:00
Søren Lind Kristiansen	d6327e8495	Fix handling case when vectors not specified	2018-01-03 12:20:49 +01:00
Søren Lind Kristiansen	bcc51d7d8b	Fix shifted positional arguments	2018-01-03 12:19:47 +01:00
Søren Lind Kristiansen	5a9d377580	Remove abbreviation for positional plac argument	2017-12-11 11:08:29 +01:00
Isaac Sijaranamual	20ae0c459a	Fixes "Error saving model" #1622	2017-12-10 23:07:13 +01:00
Isaac Sijaranamual	e188b61960	Make cli/train.py not eat exception	2017-12-10 22:53:08 +01:00
ines	5eaa61c2b8	Fix formatting	2017-12-07 10:23:09 +01:00
ines	24e80c51b8	Document init-model command	2017-12-07 10:14:37 +01:00
Matthew Honnibal	c91f451b0f	Fix imports and CLI in init-model	2017-12-07 10:03:07 +01:00
ines	82e80ff928	Rename model command to init_model and fix formatting	2017-12-07 09:59:23 +01:00
Ines Montani	2feeb428d6	Merge pull request #1646 from GreenRiverRUS/master Added model command to create models from raw data	2017-12-07 08:54:26 +00:00
Thomas Werkmeister	94eac75b7c	fix setup.py spacy req string for packaging Requirement should be `spacy>=2.0.2` instead of `spacy2.0.2`	2017-12-03 04:16:28 -06:00
Vadim Mazaev	495eacf470	Merge branch 'model_command'	2017-11-30 12:30:26 +03:00
Vadim Mazaev	c332ffdde1	Added model command to create model from raw data: words counts, brown clusters and vectors	2017-11-27 01:21:47 +03:00
Matthew Honnibal	2acc907d55	Improve profiling	2017-11-23 12:33:03 +00:00
Matthew Honnibal	8d692771f6	Improve profiling	2017-11-15 13:51:25 +01:00
ines	4c5d2c80d5	Re-add python -m to commands, too brittle :( (see #1536 )	2017-11-10 02:30:55 +01:00
Matthew Honnibal	de45702bbe	Strip dev suffixes from version for compatibility check	2017-11-08 18:40:21 +01:00
Matthew Honnibal	a2f980de4e	Exclude .devN versioning from compatibility check	2017-11-08 18:03:52 +01:00
ines	a4662a31a9	Move model package templates to cli.package and update docs	2017-11-07 12:15:35 +01:00
Matthew Honnibal	c2bbf076a4	Add document length cap for training	2017-11-03 01:54:54 +01:00
Matthew Honnibal	eca41f0cf6	Fix filename conversion for conllu	2017-11-01 21:26:49 +01:00
Matthew Honnibal	e237472cdc	Fix tag and filename conversion for conllu	2017-11-01 21:25:33 +01:00
ines	affd3404ab	Remove old model command (now "vocab")	2017-11-01 13:14:03 +01:00
ines	37e62ab0e2	Update vector meta in meta.json	2017-11-01 01:25:09 +01:00
Matthew Honnibal	c390f2d745	Make it easier to pass explicit no-pruning to vocab	2017-10-31 20:14:47 +01:00
Matthew Honnibal	3659a807b0	Remove vector pruning arg from train CLI	2017-10-31 19:21:05 +01:00
Matthew Honnibal	59203a2e8a	Move vector pruning command into spacy vocab cli tool	2017-10-31 19:10:01 +01:00
ines	803e41bc66	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-30 18:39:51 +01:00
ines	abf8aa05d3	Populate --create-meta defaults from file if available If meta.json is found in directory and user chooses to overwrite it, show existing data as defaults.	2017-10-30 18:39:38 +01:00
ines	ce98fa7934	Fix formatting	2017-10-30 18:38:55 +01:00
ines	98c35d2585	Fix spacy vocab command	2017-10-30 18:38:41 +01:00
Matthew Honnibal	e98451b5f7	Add -prune-vectors argument to spacy.cly.train	2017-10-30 18:00:10 +01:00
Explosion Bot	05a1dd570e	Fix vocab script	2017-10-30 16:19:22 +01:00
Explosion Bot	b46bdce8d2	Add missing import	2017-10-30 16:18:10 +01:00
Explosion Bot	0fc1209421	Wire up new vocab command	2017-10-30 16:14:50 +01:00
Matthew Honnibal	64e4ff7c4b	Merge 'tidy-up' changes into branch. Resolve conflicts	2017-10-28 13:16:06 +02:00
ines	d941fc3667	Tidy up CLI	2017-10-27 14:38:39 +02:00
Matthew Honnibal	531142a933	Merge remote-tracking branch 'origin/develop' into feature/better-parser	2017-10-27 12:34:48 +00:00
Matthew Honnibal	b9616419e1	Add try/except around bz2 import	2017-10-27 01:18:05 +00:00
ines	11e3f19764	Fix vectors data added after training (see #1457 )	2017-10-25 16:08:26 +02:00
ines	057954695b	Read pipeline and vector data off model in --generate-meta	2017-10-25 16:03:26 +02:00
ines	273e638183	Add vector data to model meta after training (see #1457 )	2017-10-25 16:03:05 +02:00
ines	95f6174516	Remove tensorizer from model pipeline example in spacy package	2017-10-24 16:00:56 +02:00
ines	24512420b1	Show error if data_path does not exist or is None (see #1102 )	2017-10-19 00:53:49 +02:00
Matthew Honnibal	dc01acd821	Escape encoding in validate function	2017-10-12 22:23:21 +02:00
ines	fff1028391	Add validate CLI command	2017-10-12 20:05:06 +02:00
Matthew Honnibal	a955843684	Increase default number of epochs	2017-10-12 13:13:01 +02:00
Matthew Honnibal	acba2e1051	Fix metadata in training	2017-10-11 08:55:52 +02:00
Matthew Honnibal	74c2c6a58c	Add default name and lang to meta	2017-10-11 08:49:12 +02:00
Matthew Honnibal	5156074df1	Make loading code more consistent in train command	2017-10-10 12:51:20 -05:00
Matthew Honnibal	97c9b5db8b	Patch spacy.train for new pipeline management	2017-10-09 23:41:16 -05:00
Matthew Honnibal	a635240398	Add conll_ner2json converter	2017-10-09 22:03:26 -05:00
Matthew Honnibal	735d18654d	Add NER converter for CoNLL 2003 data	2017-10-09 20:06:28 -05:00
Matthew Honnibal	808d8740d6	Remove print statement	2017-10-09 08:45:20 -05:00
Matthew Honnibal	0f41b25f60	Add speed benchmarks to metadata	2017-10-09 08:05:37 -05:00
Matthew Honnibal	be4f0b6460	Update defaults	2017-10-08 02:08:12 -05:00
Matthew Honnibal	9d66a915da	Update training defaults	2017-10-07 21:02:38 -05:00
Matthew Honnibal	09442d25ec	Merge remote-tracking branch 'origin/develop' into feature/parser-history-model	2017-10-07 07:05:04 -05:00
Matthew Honnibal	f4c9a98166	Fix spacy evaluate command on non-GPU	2017-10-06 13:17:47 -05:00
Matthew Honnibal	c6cd81f192	Wrap try/except around model saving	2017-10-05 08:14:24 -05:00
Matthew Honnibal	5743b06e36	Wrap model saving in try/except	2017-10-05 08:12:50 -05:00
ines	73ac0aa0b5	Update spacy evaluate and add displaCy option	2017-10-04 00:03:15 +02:00
Matthew Honnibal	f24c2e3a8a	Fix evaluate for non-GPU	2017-10-03 22:47:31 +02:00
Matthew Honnibal	1289187279	Fix circular import	2017-10-03 09:33:21 -05:00
Matthew Honnibal	a44c4c3a5b	Add timer to evaluate	2017-10-03 09:15:35 -05:00
Matthew Honnibal	8902df44de	Fix component disabling during training	2017-10-02 21:07:23 +02:00
Matthew Honnibal	c617d288d8	Update pipeline component names in spaCy train	2017-10-02 17:20:19 +02:00
Matthew Honnibal	f942903429	Improve sentence merging in iob2json	2017-10-02 17:02:10 +02:00
Matthew Honnibal	31681d20e0	Fix concatenation in iob2json converter	2017-10-02 16:50:26 +02:00
Matthew Honnibal	4896ce3320	Remove misleading comment	2017-10-02 00:09:14 +02:00
Matthew Honnibal	94df115a81	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-10-01 14:06:23 -05:00
Matthew Honnibal	69c7c642c2	Add spacy evaluate	2017-10-01 14:05:04 -05:00
ines	fd1a9225d8	Handle conversion of pipeline components correctly Allow both comma and comma + whitespace as separators	2017-09-29 20:52:56 +02:00
Matthew Honnibal	ac8481a7b0	Print NER loss	2017-09-28 08:05:31 -05:00
Matthew Honnibal	542ebfa498	Improve defaults	2017-09-27 18:54:37 -05:00
Matthew Honnibal	dcb86bdc43	Default batch size to 32	2017-09-27 11:48:19 -05:00
ines	1ff62eaee7	Fix option shortcut to avoid conflict	2017-09-26 17:59:34 +02:00
ines	7fdfb78141	Add version option to cli.train	2017-09-26 17:34:52 +02:00
Matthew Honnibal	698fc0d016	Remove merge artefact	2017-09-26 08:31:37 -05:00
Matthew Honnibal	defb68e94f	Update feature/noshare with recent develop changes	2017-09-26 08:15:14 -05:00
ines	edf7e4881d	Add meta.json option to cli.train and add relevant properties Add accuracy scores to meta.json instead of accuracy.json and replace all relevant properties like lang, pipeline, spacy_version in existing meta.json. If not present, also add name and version placeholders to make it packagable.	2017-09-25 19:00:47 +02:00
Matthew Honnibal	204b58c864	Fix evaluation during training	2017-09-24 05:01:03 -05:00
Matthew Honnibal	dc3a623d00	Remove unused update_shared argument	2017-09-24 05:00:37 -05:00
Matthew Honnibal	4348c479fc	Merge pre-trained vectors and noshare patches	2017-09-22 20:07:28 -05:00
Matthew Honnibal	e93d43a43a	Fix training with preset vectors	2017-09-22 20:00:40 -05:00
Matthew Honnibal	a2357cce3f	Set random seed in train script	2017-09-23 02:57:31 +02:00
Matthew Honnibal	0a9016cade	Fix serialization during training	2017-09-21 13:06:45 -05:00
Matthew Honnibal	20193371f5	Don't share CNN, to reduce complexities	2017-09-21 14:59:48 +02:00
Matthew Honnibal	1d73dec8b1	Refactor train script	2017-09-20 19:17:10 -05:00
Matthew Honnibal	a0c4b33d03	Support resuming a model during spacy train	2017-09-18 18:04:47 -05:00
Matthew Honnibal	8496d76224	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-09-14 09:21:20 -05:00
Matthew Honnibal	24ff6b0ad9	Fix parsing and tok2vec models	2017-09-06 05:50:58 -05:00
Matthew Honnibal	e920885676	Fix pickle during train	2017-09-02 12:46:01 -05:00
ines	7e04b7f89c	Fix info text on pipeline in package cli	2017-08-26 18:30:59 +02:00
Matthew Honnibal	876f38c548	Merge pull request #1279 from oroszgy/model_cli_v2 Added vector loading to model cli	2017-08-26 15:57:50 +02:00
ines	bb1abbeba5	Only link model if download was successfull	2017-08-23 12:36:31 +02:00
Matthew Honnibal	7be5f30f17	Add profile function	2017-08-21 23:22:49 +02:00
Gyorgy Orosz	b3576bfc86	Added vector leading to model cli	2017-08-20 23:16:12 +02:00
Matthew Honnibal	7a6edeea68	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-08-20 12:55:39 -05:00
Matthew Honnibal	f2f9229964	Fix name of update_shared flag	2017-08-20 18:19:06 +02:00
Matthew Honnibal	80a5146ec2	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-08-20 11:07:08 -05:00
Matthew Honnibal	84bb543e4d	Add gold_preproc flag to cli/train	2017-08-20 11:07:00 -05:00
Gyorgy Orosz	e5344b83a3	Ported model cli from v1	2017-08-19 21:45:23 +02:00
Matthew Honnibal	11c31d285c	Restore changes from nn-beam-parser	2017-08-18 22:26:12 +02:00
Matthew Honnibal	52c180ecf5	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `ea8de11ad5`, reversing changes made to `08e443e083`.	2017-08-14 13:00:23 +02:00
Matthew Honnibal	4ae0d5e1e6	Set defaults for convert command	2017-08-13 09:03:38 +02:00
ines	d4f2baf7dd	Add create_meta option to package command Re-create meta.json in model directory, even if it exists. Especially useful when updating existing spaCy models or training with Prodigy. Ensures user won't end up with multiple "en_core_web_sm" models, and offers easy way to change the model's name and settings without having to edit the meta.json file.	2017-08-12 21:44:18 +02:00
Matthew Honnibal	8870d491f1	Remove redundant pickling during training	2017-08-12 08:55:53 -05:00
ines	28e2fec23b	Fix autolinking failure on fresh model install (resolves #1138 ) On fresh install via subprocess, pip.get_installed_distributions() won't show new model, so is_package check in link command fails. Solution for now is to get model package path explicitly and pass it to link command.	2017-08-09 11:52:38 +02:00
Matthew Honnibal	0a566dc320	Add update_tensors flag to Language.update. Experimental, re #1182	2017-08-06 02:18:12 +02:00
György Orosz	62dbf9025c	Fixed conllu converter	2017-06-09 22:53:56 +02:00
ines	03db56f48c	Detect spaCy version and add package title Package title allows customised package names (like spacy-nightly)	2017-06-05 20:11:02 +02:00
Matthew Honnibal	c52fde40f4	Improve train CLI	2017-06-04 20:18:37 -05:00
ines	848e47669e	Fix typo	2017-06-04 20:44:15 +02:00
ines	7b7d46b64e	Fix typo and success message	2017-06-04 13:45:50 +02:00

1 2 3 4 5 ...

370 Commits