spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-11 12:18:04 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	29f9fec267	Improve spacy pretrain (#4393 ) * Support bilstm_depth arg in spacy pretrain * Add option to ignore zero vectors in get_cossim_loss * Use cosine loss in Cloze multitask	2019-10-07 23:34:58 +02:00
Ines Montani	3ba5238282	Make "unnamed vectors" warning a real warning	2019-09-16 15:16:12 +02:00
Ines Montani	af25323653	Tidy up and auto-format	2019-09-11 14:00:36 +02:00
Matthew Honnibal	c308cf3e3e	Merge branch 'master' into feature/lemmatizer	2019-08-25 13:52:27 +02:00
Matthew Honnibal	bcd08f20af	Merge changes from master	2019-08-21 14:18:52 +02:00
Ines Montani	f65e36925d	Fix absolute imports and avoid importing from cli	2019-08-20 15:08:59 +02:00
Ines Montani	009280fbc5	Tidy up and auto-format	2019-08-18 15:09:16 +02:00
Sofie Van Landeghem	0ba1b5eebc	CLI scripts for entity linking (wikipedia & generic) (#4091 ) * document token ent_kb_id * document span kb_id * update pipeline documentation * prior and context weights as bool's instead * entitylinker api documentation * drop for both models * finish entitylinker documentation * small fixes * documentation for KB * candidate documentation * links to api pages in code * small fix * frequency examples as counts for consistency * consistent documentation about tensors returned by predict * add entity linking to usage 101 * add entity linking infobox and KB section to 101 * entity-linking in linguistic features * small typo corrections * training example and docs for entity_linker * predefined nlp and kb * revert back to similarity encodings for simplicity (for now) * set prior probabilities to 0 when excluded * code clean up * bugfix: deleting kb ID from tokens when entities were removed * refactor train el example to use either model or vocab * pretrain_kb example for example kb generation * add to training docs for KB + EL example scripts * small fixes * error numbering * ensure the language of vocab and nlp stay consistent across serialization * equality with = * avoid conflict in errors file * add error 151 * final adjustements to the train scripts - consistency * update of goldparse documentation * small corrections * push commit * turn kb_creator into CLI script (wip) * proper parameters for training entity vectors * wikidata pipeline split up into two executable scripts * remove context_width * move wikidata scripts in bin directory, remove old dummy script * refine KB script with logs and preprocessing options * small edits * small improvements to logging of EL CLI script	2019-08-13 15:38:59 +02:00
svlandeg	cdc589d344	small fix	2019-07-15 12:04:45 +02:00
svlandeg	6e809e9b8b	proper error for missing cfg arguments	2019-07-15 11:42:50 +02:00
Matthew Honnibal	09dc01a426	Fix #3853 , and add warning	2019-07-11 14:46:47 +02:00
Matthew Honnibal	e19f4ee719	Add warning message re Issue #3853	2019-07-11 12:50:38 +02:00
Ines Montani	0b8406a05c	Tidy up and auto-format	2019-07-11 12:02:25 +02:00
svlandeg	2d2dea9924	experiment with adding NER types to the feature vector	2019-06-29 14:52:36 +02:00
svlandeg	c664f58246	adding prior probability as feature in the model	2019-06-28 16:22:58 +02:00
svlandeg	1c80b85241	fix tests	2019-06-28 08:59:23 +02:00
svlandeg	68a0662019	context encoder with Tok2Vec + linking model instead of cosine	2019-06-28 08:29:31 +02:00
svlandeg	bee23cd8af	try Tok2Vec instead of SpacyVectors	2019-06-25 16:09:22 +02:00
svlandeg	0d177c1146	clean up code, remove old code, move to bin	2019-06-18 13:20:40 +02:00
svlandeg	fb37cdb2d3	implementing el pipe in pipes.pyx (not tested yet)	2019-06-03 21:32:54 +02:00
Ines Montani	c23e234d65	Auto-format	2019-04-01 12:11:27 +02:00
Matthew Honnibal	f77bf2bdb1	Fix GPU training for textcat. Closes #3473	2019-03-26 13:36:11 +01:00
Matthew Honnibal	f436efd8a4	Small tweak to ensemble textcat model	2019-03-23 16:47:26 +01:00
Matthew Honnibal	6c783f8045	Bug fixes and options for TextCategorizer (#3472 ) * Fix code for bag-of-words feature extraction The _ml.py module had a redundant copy of a function to extract unigram bag-of-words features, except one had a bug that set values to 0. Another function allowed extraction of bigram features. Replace all three with a new function that supports arbitrary ngram sizes and also allows control of which attribute is used (e.g. ORTH, LOWER, etc). * Support 'bow' architecture for TextCategorizer This allows efficient ngram bag-of-words models, which are better when the classifier needs to run quickly, especially when the texts are long. Pass architecture="bow" to use it. The extra arguments ngram_size and attr are also available, e.g. ngram_size=2 means unigram and bigram features will be extracted. * Fix size limits in train_textcat example * Explain architectures better in docs	2019-03-23 16:44:44 +01:00
Matthew Honnibal	61617c64d5	Revert changes to optimizer default hyper-params (WIP) (#3415 ) While developing v2.1, I ran a bunch of hyper-parameter search experiments to find settings that performed well for spaCy's NER and parser. I ended up changing the default Adam settings from beta1=0.9, beta2=0.999, eps=1e-8 to beta1=0.8, beta2=0.8, eps=1e-5. This was giving a small improvement in accuracy (like, 0.4%). Months later, I run the models with Prodigy, which uses beam-search decoding even when the model has been trained with a greedy objective. The new models performed terribly...So, wtf? After a couple of days debugging, I figured out that the new optimizer settings was causing the model to converge to solutions where the top-scoring class often had a score of like, -80. The variance on the weights had gone up enormously. I guess I needed to update the L2 regularisation as well? Anyway. Let's just revert the change --- if the optimizer is finding such extreme solutions, that seems bad, and not nearly worth the small improvement in accuracy. Currently training a slate of models, to verify the accuracy change is minimal. Once the training is complete, we can merge this. <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2019-03-16 21:39:02 +01:00
Ines Montani	278e9d2eb0	Merge branch 'master' into feature/lemmatizer	2019-03-16 13:44:22 +01:00
Ines Montani	c998cde7e2	Auto-format [ci skip]	2019-03-10 19:22:59 +01:00
Matthew Honnibal	78aba46530	Update feature/lemmatizer from develop	2019-03-10 02:45:33 +01:00
Matthew Honnibal	0f12082465	Refactor morphologizer	2019-03-09 22:54:59 +00:00
Matthew Honnibal	ce1fe8a510	Add comment	2019-03-09 17:51:17 +00:00
Matthew Honnibal	28c26e212d	Fix textcat model for GPU	2019-03-09 17:50:08 +00:00
Matthew Honnibal	e1a83d15ed	Add support for character features to Tok2Vec	2019-03-09 11:50:08 +00:00
Ines Montani	ad834be494	Tidy up and auto-format	2019-03-08 13:28:53 +01:00
Matthew Honnibal	3993f41cc4	Update morphology branch from develop	2019-03-07 00:14:43 +01:00
Ines Montani	2982f82934	Auto-format	2019-02-24 14:09:15 +01:00
Matthew Honnibal	d13b9373bf	Improve initialization for mutually textcat	2019-02-23 12:27:45 +01:00
Matthew Honnibal	e9dd5943b9	Support exclusive_classes setting for textcat models	2019-02-23 11:57:16 +01:00
Matthew Honnibal	83ac227bd3	💫 Better support for semi-supervised learning (#3035 ) The new spacy pretrain command implemented BERT/ULMFit/etc-like transfer learning, using our Language Modelling with Approximate Outputs version of BERT's cloze task. Pretraining is convenient, but in some ways it's a bit of a strange solution. All we're doing is initialising the weights. At the same time, we're putting a lot of work into our optimisation so that it's less sensitive to initial conditions, and more likely to find good optima. I discuss this a bit in the pseudo-rehearsal blog post: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting Support semi-supervised learning in spacy train One obvious way to improve these pretraining methods is to do multi-task learning, instead of just transfer learning. This has been shown to work very well: https://arxiv.org/pdf/1809.08370.pdf . This patch makes it easy to do this sort of thing. Add a new argument to spacy train, --raw-text. This takes a jsonl file with unlabelled data that can be used in arbitrary ways to do semi-supervised learning. Add a new method to the Language class and to pipeline components, .rehearse(). This is like .update(), but doesn't expect GoldParse objects. It takes a batch of Doc objects, and performs an update on some semi-supervised objective. Move the BERT-LMAO objective out from spacy/cli/pretrain.py into spacy/_ml.py, so we can create a new pipeline component, ClozeMultitask. This can be specified as a parser or NER multitask in the spacy train command. Example usage: python -m spacy train en ./tmp ~/data/en-core-web/train/nw.json ~/data/en-core-web/dev/nw.json --pipeline parser --raw-textt ~/data/unlabelled/reddit-100k.jsonl --vectors en_vectors_web_lg --parser-multitasks cloze Implement rehearsal methods for pipeline components The new --raw-text argument and nlp.rehearse() method also gives us a good place to implement the the idea in the pseudo-rehearsal blog post in the parser. This works as follows: Add a new nlp.resume_training() method. This allocates copies of pre-trained models in the pipeline, setting things up for the rehearsal updates. It also returns an optimizer object. This also greatly reduces confusion around the nlp.begin_training() method, which randomises the weights, making it not suitable for adding new labels or otherwise fine-tuning a pre-trained model. Implement rehearsal updates on the Parser class, making it available for the dependency parser and NER. During rehearsal, the initial model is used to supervise the model being trained. The current model is asked to match the predictions of the initial model on some data. This minimises catastrophic forgetting, by keeping the model's predictions close to the original. See the blog post for details. Implement rehearsal updates for tagger Implement rehearsal updates for text categoriz	2018-12-10 16:25:33 +01:00
Matthew Honnibal	375f0dc529	💫 Make TextCategorizer default to a simpler, GPU-friendly model (#3038 ) Currently the TextCategorizer defaults to a fairly complicated model, designed partly around the active learning requirements of Prodigy. The model's a bit slow, and not very GPU-friendly. This patch implements a straightforward CNN model that still performs pretty well. The replacement model also makes it easy to use the LMAO pretraining, since most of the parameters are in the CNN. The replacement model has a flag to specify whether labels are mutually exclusive, which defaults to True. This has been a common problem with the text classifier. We'll also now be able to support adding labels to pretrained models again. Resolves #2934, #2756, #1798, #1748.	2018-12-10 14:37:39 +01:00
Matthew Honnibal	d2ac618af1	Set cbb_maxout_pieces=3	2018-12-08 23:27:29 +01:00
Matthew Honnibal	cabaadd793	Fix build error from bad import Thinc v7.0.0.dev6 moved FeatureExtracter around and didn't add a compatibility import.	2018-12-06 15:12:39 +01:00
Ines Montani	323fc26880	Tidy up and format remaining files	2018-11-30 17:43:08 +01:00
Ines Montani	eddeb36c96	💫 Tidy up and auto-format .py files (#2983 ) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files. - [x] Update flake8 config to exclude very large files (lemmatization tables etc.) - [x] Update code to be compatible with flake8 rules - [x] Fix various small bugs, inconsistencies and messy stuff in the language data - [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means) Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results. At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information. ### Types of change enhancement, code style ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.	2018-11-30 17:03:03 +01:00
Matthew Honnibal	ef0820827a	Update hyper-parameters after NER random search (#2972 ) These experiments were completed a few weeks ago, but I didn't make the PR, pending model release. Token vector width: 128->96 Hidden width: 128->64 Embed size: 5000->2000 Dropout: 0.2->0.1 Updated optimizer defaults (unclear how important?) This should improve speed, model size and load time, while keeping similar or slightly better accuracy. The tl;dr is we prefer to prevent over-fitting by reducing model size, rather than using more dropout.	2018-11-27 18:49:52 +01:00
Matthew Honnibal	2527ba68e5	Fix tensorizer	2018-11-02 23:29:54 +00:00
Matthew Honnibal	53eb96db09	Fix definition of morphology model	2018-09-25 22:12:32 +02:00
Matthew Honnibal	e6dde97295	Add function to make morphologizer model	2018-09-25 10:57:59 +02:00
Matthew Honnibal	b10d0cce05	Add MultiSoftmax class Add a new class for the Tagger model, MultiSoftmax. This allows softmax prediction of multiple classes on the same output layer, e.g. one variable with 3 classes, another with 4 classes. This makes a layer with 7 output neurons, which we softmax into two distributions.	2018-09-24 17:35:28 +02:00
Matthew Honnibal	99a6011580	Avoid adding empty layer in model, to keep models backwards compatible	2018-09-14 22:51:58 +02:00
Matthew Honnibal	afeddfff26	Fix PyTorch BiLSTM	2018-09-13 22:54:34 +00:00

1 2 3 4 5

222 Commits