spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-14 13:47:13 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	e9dd5943b9	Support exclusive_classes setting for textcat models	2019-02-23 11:57:16 +01:00
Matthew Honnibal	83ac227bd3	💫 Better support for semi-supervised learning (#3035 ) The new spacy pretrain command implemented BERT/ULMFit/etc-like transfer learning, using our Language Modelling with Approximate Outputs version of BERT's cloze task. Pretraining is convenient, but in some ways it's a bit of a strange solution. All we're doing is initialising the weights. At the same time, we're putting a lot of work into our optimisation so that it's less sensitive to initial conditions, and more likely to find good optima. I discuss this a bit in the pseudo-rehearsal blog post: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting Support semi-supervised learning in spacy train One obvious way to improve these pretraining methods is to do multi-task learning, instead of just transfer learning. This has been shown to work very well: https://arxiv.org/pdf/1809.08370.pdf . This patch makes it easy to do this sort of thing. Add a new argument to spacy train, --raw-text. This takes a jsonl file with unlabelled data that can be used in arbitrary ways to do semi-supervised learning. Add a new method to the Language class and to pipeline components, .rehearse(). This is like .update(), but doesn't expect GoldParse objects. It takes a batch of Doc objects, and performs an update on some semi-supervised objective. Move the BERT-LMAO objective out from spacy/cli/pretrain.py into spacy/_ml.py, so we can create a new pipeline component, ClozeMultitask. This can be specified as a parser or NER multitask in the spacy train command. Example usage: python -m spacy train en ./tmp ~/data/en-core-web/train/nw.json ~/data/en-core-web/dev/nw.json --pipeline parser --raw-textt ~/data/unlabelled/reddit-100k.jsonl --vectors en_vectors_web_lg --parser-multitasks cloze Implement rehearsal methods for pipeline components The new --raw-text argument and nlp.rehearse() method also gives us a good place to implement the the idea in the pseudo-rehearsal blog post in the parser. This works as follows: Add a new nlp.resume_training() method. This allocates copies of pre-trained models in the pipeline, setting things up for the rehearsal updates. It also returns an optimizer object. This also greatly reduces confusion around the nlp.begin_training() method, which randomises the weights, making it not suitable for adding new labels or otherwise fine-tuning a pre-trained model. Implement rehearsal updates on the Parser class, making it available for the dependency parser and NER. During rehearsal, the initial model is used to supervise the model being trained. The current model is asked to match the predictions of the initial model on some data. This minimises catastrophic forgetting, by keeping the model's predictions close to the original. See the blog post for details. Implement rehearsal updates for tagger Implement rehearsal updates for text categoriz	2018-12-10 16:25:33 +01:00
Matthew Honnibal	375f0dc529	💫 Make TextCategorizer default to a simpler, GPU-friendly model (#3038 ) Currently the TextCategorizer defaults to a fairly complicated model, designed partly around the active learning requirements of Prodigy. The model's a bit slow, and not very GPU-friendly. This patch implements a straightforward CNN model that still performs pretty well. The replacement model also makes it easy to use the LMAO pretraining, since most of the parameters are in the CNN. The replacement model has a flag to specify whether labels are mutually exclusive, which defaults to True. This has been a common problem with the text classifier. We'll also now be able to support adding labels to pretrained models again. Resolves #2934, #2756, #1798, #1748.	2018-12-10 14:37:39 +01:00
Matthew Honnibal	d2ac618af1	Set cbb_maxout_pieces=3	2018-12-08 23:27:29 +01:00
Matthew Honnibal	cabaadd793	Fix build error from bad import Thinc v7.0.0.dev6 moved FeatureExtracter around and didn't add a compatibility import.	2018-12-06 15:12:39 +01:00
Ines Montani	323fc26880	Tidy up and format remaining files	2018-11-30 17:43:08 +01:00
Ines Montani	eddeb36c96	💫 Tidy up and auto-format .py files (#2983 ) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-30 17:03:03 +01:00
Matthew Honnibal	ef0820827a	Update hyper-parameters after NER random search (#2972 ) These experiments were completed a few weeks ago, but I didn't make the PR, pending model release. Token vector width: 128->96 Hidden width: 128->64 Embed size: 5000->2000 Dropout: 0.2->0.1 Updated optimizer defaults (unclear how important?) This should improve speed, model size and load time, while keeping similar or slightly better accuracy. The tl;dr is we prefer to prevent over-fitting by reducing model size, rather than using more dropout.	2018-11-27 18:49:52 +01:00
Matthew Honnibal	2527ba68e5	Fix tensorizer	2018-11-02 23:29:54 +00:00
Matthew Honnibal	53eb96db09	Fix definition of morphology model	2018-09-25 22:12:32 +02:00
Matthew Honnibal	e6dde97295	Add function to make morphologizer model	2018-09-25 10:57:59 +02:00
Matthew Honnibal	b10d0cce05	Add MultiSoftmax class Add a new class for the Tagger model, MultiSoftmax. This allows softmax prediction of multiple classes on the same output layer, e.g. one variable with 3 classes, another with 4 classes. This makes a layer with 7 output neurons, which we softmax into two distributions.	2018-09-24 17:35:28 +02:00
Matthew Honnibal	99a6011580	Avoid adding empty layer in model, to keep models backwards compatible	2018-09-14 22:51:58 +02:00
Matthew Honnibal	afeddfff26	Fix PyTorch BiLSTM	2018-09-13 22:54:34 +00:00
Matthew Honnibal	45032fe9e1	Support option of BiLSTM in Tok2Vec (requires pytorch)	2018-09-13 19:28:35 +02:00
Matthew Honnibal	4d2d7d5866	Fix new feature flags	2018-08-27 02:12:39 +02:00
Matthew Honnibal	8051136d70	Support subword_features and conv_depth params in Tok2Vec	2018-08-27 01:50:48 +02:00
Matthew Honnibal	401213fb1f	Only warn about unnamed vectors if non-zero sized.	2018-05-19 18:51:55 +02:00
Matthew Honnibal	2338e8c7fc	Update develop from master	2018-05-02 01:36:12 +00:00
Matthew Honnibal	548bdff943	Update default Adam settings	2018-05-01 15:18:20 +02:00
Matthew Honnibal	2c4a6d66fa	Merge master into develop. Big merge, many conflicts -- need to review	2018-04-29 14:49:26 +02:00
Ines Montani	3141e04822	💫 New system for error messages and warnings (#2163 ) * Add spacy.errors module * Update deprecation and user warnings * Replace errors and asserts with new error message system * Remove redundant asserts * Fix whitespace * Add messages for print/util.prints statements * Fix typo * Fix typos * Move CLI messages to spacy.cli._messages * Add decorator to display error code with message An implementation like this is nice because it only modifies the string when it's retrieved from the containing class – so we don't have to worry about manipulating tracebacks etc. * Remove unused link in spacy.about * Update errors for invalid pipeline components * Improve error for unknown factories * Add displaCy warnings * Update formatting consistency * Move error message to spacy.errors * Update errors and check if doc returned by component is None	2018-04-03 15:50:31 +02:00
Matthew Honnibal	4555e3e251	Dont assume pretrained_vectors cfg set in build_tagger	2018-03-28 20:12:45 +02:00
Matthew Honnibal	f8dd905a24	Warn and fallback if vectors have no name	2018-03-28 18:24:53 +02:00
Matthew Honnibal	95a9615221	Fix loading of multiple pre-trained vectors This patch addresses #1660, which was caused by keying all pre-trained vectors with the same ID when telling Thinc how to refer to them. This meant that if multiple models were loaded that had pre-trained vectors, errors or incorrect behaviour resulted. The vectors class now includes a .name attribute, which defaults to: {nlp.meta['lang']_nlp.meta['name']}.vectors The vectors name is set in the cfg of the pipeline components under the key pretrained_vectors. This replaces the previous cfg key pretrained_dims. In order to make existing models compatible with this change, we check for the pretrained_dims key when loading models in from_disk and from_bytes, and add the cfg key pretrained_vectors if we find it.	2018-03-28 16:02:59 +02:00
Matthew Honnibal	1f7229f40f	Revert "Merge branch 'develop' of https://github.com/explosion/spaCy into develop" This reverts commit `c9ba3d3c2d`, reversing changes made to `92c26a35d4`.	2018-03-27 19:23:02 +02:00
Matthew Honnibal	6e641f46d4	Create a preprocess function that gets bigrams	2017-11-12 00:43:41 +01:00
Matthew Honnibal	d5537e5516	Work on Windows test failure	2017-11-08 13:25:18 +01:00
Matthew Honnibal	1d5599cd28	Fix dtype	2017-11-08 12:18:32 +01:00
Matthew Honnibal	a8b592783b	Make a dtype more specific, to fix a windows build	2017-11-08 11:24:35 +01:00
Matthew Honnibal	13336a6197	Fix Adam import	2017-11-06 14:25:37 +01:00
Matthew Honnibal	2eb11d60f2	Add function create_default_optimizer to spacy._ml	2017-11-06 14:11:59 +01:00
Matthew Honnibal	33bd2428db	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-11-03 13:29:56 +01:00
Matthew Honnibal	c9b118a7e9	Set softmax attr in tagger model	2017-11-03 11:22:01 +01:00
Matthew Honnibal	b3264aa5f0	Expose the softmax layer in the tagger model, to allow setting tensors	2017-11-03 11:19:51 +01:00
Matthew Honnibal	6771780d3f	Fix backprop of padding variable	2017-11-03 01:54:34 +01:00
Matthew Honnibal	260e6ee3fb	Improve efficiency of backprop of padding variable	2017-11-03 00:49:11 +01:00
Matthew Honnibal	e85e31cfbd	Fix backprop of d_pad	2017-11-01 19:27:26 +01:00
Matthew Honnibal	d17a12c71d	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2017-11-01 16:38:26 +01:00
Matthew Honnibal	9f9439667b	Don't create low-data text classifier if no vectors	2017-11-01 16:34:09 +01:00
Matthew Honnibal	8075726838	Restore vector usage in models	2017-10-31 19:21:17 +01:00
Matthew Honnibal	cb5217012f	Fix vector remapping	2017-10-31 11:40:46 +01:00
Matthew Honnibal	ce876c551e	Fix GPU usage	2017-10-31 02:33:34 +01:00
Matthew Honnibal	368fdb389a	WIP on refactoring and fixing vectors	2017-10-31 02:00:26 +01:00
Matthew Honnibal	3b91097321	Whitespace	2017-10-28 17:05:11 +00:00
Matthew Honnibal	6ef72864fa	Improve initialization for hidden layers	2017-10-28 17:05:01 +00:00
Matthew Honnibal	df4803cc6d	Add learned missing values for parser	2017-10-28 16:45:14 +00:00
Matthew Honnibal	64e4ff7c4b	Merge 'tidy-up' changes into branch. Resolve conflicts	2017-10-28 13:16:06 +02:00
Explosion Bot	b22e42af7f	Merge changes to parser and _ml	2017-10-28 11:52:10 +02:00
ines	d96e72f656	Tidy up rest	2017-10-27 21:07:59 +02:00
ines	e33b7e0b3c	Tidy up parser and ML	2017-10-27 14:39:30 +02:00
Matthew Honnibal	531142a933	Merge remote-tracking branch 'origin/develop' into feature/better-parser	2017-10-27 12:34:48 +00:00
Matthew Honnibal	c9987cf131	Avoid use of numpy.tensordot	2017-10-27 10:18:36 +00:00
Matthew Honnibal	f6fef30adc	Remove dead code from spacy._ml	2017-10-27 10:16:41 +00:00
ines	4eb5bd02e7	Update textcat pre-processing after to_array change	2017-10-27 00:32:12 +02:00
Matthew Honnibal	35977bdbb9	Update better-parser branch with develop	2017-10-26 00:55:53 +00:00
Matthew Honnibal	075e8118ea	Update from develop	2017-10-25 12:45:21 +02:00
ines	0b1dcbac14	Remove unused function	2017-10-25 12:08:46 +02:00
Matthew Honnibal	3faf9189a2	Make parser hidden shape consistent even if maxout==1	2017-10-20 16:23:31 +02:00
Matthew Honnibal	b101736555	Fix precomputed layer	2017-10-20 12:14:52 +02:00
Matthew Honnibal	64658e02e5	Implement fancier initialisation for precomputed layer	2017-10-20 03:07:45 +02:00
Matthew Honnibal	a17a1b60c7	Clean up redundant PrecomputableMaxouts class	2017-10-19 20:26:37 +02:00
Matthew Honnibal	b00d0a2c97	Fix bias in parser	2017-10-19 18:42:11 +02:00
Matthew Honnibal	03a215c5fd	Make PrecomputableAffines work	2017-10-19 13:44:49 +02:00
Matthew Honnibal	76fe24f44d	Improve embedding defaults	2017-10-11 09:44:17 +02:00
Matthew Honnibal	b2b8506f2c	Remove whitespace	2017-10-09 03:35:57 +02:00
Matthew Honnibal	d163115e91	Add non-linearity after history features	2017-10-07 21:00:43 -05:00
Matthew Honnibal	5c750a9c2f	Reserve 0 for 'missing' in history features	2017-10-06 06:10:13 -05:00
Matthew Honnibal	fbba7c517e	Pass dropout through to embed tables	2017-10-06 06:09:18 -05:00
Matthew Honnibal	3db0a32fd6	Fix dropout for history features	2017-10-05 22:21:30 -05:00
Matthew Honnibal	fc06b0a333	Fix training when hist_size==0	2017-10-05 21:52:28 -05:00
Matthew Honnibal	dcdfa071aa	Disable LayerNorm hack	2017-10-04 20:06:52 -05:00
Matthew Honnibal	bfabc333be	Merge remote-tracking branch 'origin/develop' into feature/parser-history-model	2017-10-04 20:00:36 -05:00
Matthew Honnibal	92066b04d6	Fix Embed and HistoryFeatures	2017-10-04 19:55:34 -05:00
Matthew Honnibal	bd8e84998a	Add nO attribute to TextCategorizer model	2017-10-04 16:07:30 +02:00
Matthew Honnibal	f8a0614527	Improve textcat model slightly	2017-10-04 15:15:53 +02:00
Matthew Honnibal	39798b0172	Uncomment layernorm adjustment hack	2017-10-04 15:12:09 +02:00
Matthew Honnibal	774f5732bd	Fix dimensionality of textcat when no vectors available	2017-10-04 14:55:15 +02:00
Matthew Honnibal	af75b74208	Unset LayerNorm backwards compat hack	2017-10-03 20:47:10 -05:00
Matthew Honnibal	246612cb53	Merge remote-tracking branch 'origin/develop' into feature/parser-history-model	2017-10-03 16:56:42 -05:00
Matthew Honnibal	5cbefcba17	Set backwards compatibility flag	2017-10-03 20:29:58 +02:00
Matthew Honnibal	5454b20cd7	Update thinc imports for 6.9	2017-10-03 20:07:17 +02:00
Matthew Honnibal	e514d6aa0a	Import thinc modules more explicitly, to avoid cycles	2017-10-03 18:49:25 +02:00
Matthew Honnibal	b770f4e108	Fix embed class in history features	2017-10-03 13:26:55 +02:00
Matthew Honnibal	6aa6a5bc25	Add a layer type for history features	2017-10-03 12:43:09 +02:00
Matthew Honnibal	f6330d69e6	Default embed size to 7000	2017-09-28 08:07:41 -05:00
Matthew Honnibal	1a37a2c0a0	Update training defaults	2017-09-27 11:48:07 -05:00
Matthew Honnibal	e34e70673f	Allow tagger models to be built with pre-defined tok2vec layer	2017-09-26 05:51:52 -05:00
Matthew Honnibal	63bd87508d	Don't use iterated convolutions	2017-09-23 04:39:17 -05:00
Matthew Honnibal	4348c479fc	Merge pre-trained vectors and noshare patches	2017-09-22 20:07:28 -05:00
Matthew Honnibal	4bd6a12b1f	Fix Tok2Vec	2017-09-23 02:58:54 +02:00
Matthew Honnibal	980fb6e854	Refactor Tok2Vec	2017-09-22 09:38:36 -05:00
Matthew Honnibal	d9124f1aa3	Add link_vectors_to_models function	2017-09-22 09:38:22 -05:00
Matthew Honnibal	a186596307	Add 'reapply' combinator, for iterated CNN	2017-09-22 09:37:03 -05:00
Matthew Honnibal	40a4873b70	Fix serialization of model options	2017-09-21 13:07:26 -05:00
Matthew Honnibal	20193371f5	Don't share CNN, to reduce complexities	2017-09-21 14:59:48 +02:00
Matthew Honnibal	f5144f04be	Add argument for CNN maxout pieces	2017-09-20 19:14:41 -05:00
Matthew Honnibal	78301b2d29	Avoid comparison to None in Tok2Vec	2017-09-20 00:19:34 +02:00
Matthew Honnibal	3fa76c17d1	Refactor Tok2Vec	2017-09-18 15:00:05 -05:00
Matthew Honnibal	7b3f391f80	Try dropping the Affine layer, conditionally	2017-09-18 11:35:59 -05:00

1 2 3 4 5

236 Commits