spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-09-14 16:12:39 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	aeb59f6791	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-20 16:15:01 +01:00
Matthew Honnibal	f57bea8ab6	💫 Prevent parser from predicting unseen classes (#3075 ) The output weights often return negative scores for classes, especially via the bias terms. This means that when we add a new class, we can't rely on just zeroing the weights, or we'll end up with positive predictions for those labels. To solve this, we use nan values as the initial weights for new labels. This prevents them from ever coming out on top. During backprop, we replace the nan values with the minimum assigned score, so that we're still able to learn these classes.	2018-12-20 16:12:22 +01:00
Matthew Honnibal	9ec9f89b99	💫 Raise better error when using uninitialized pipeline component (#3074 ) After creating a component, the `.model` attribute is left with the value `True`, to indicate it should be created later during `from_disk()`, `from_bytes()` or `begin_training()`. This had led to confusing errors if you try to use the component without initializing the model. To fix this, we add a method `require_model()` to the `Pipe` base class. The `require_model()` method needs to be called at the start of the `.predict()` and `.update()` methods of the components. It raises a `ValueError` if the model is not initialized. An error message has been added to `spacy.errors`.	2018-12-20 15:54:53 +01:00
Matthew Honnibal	1788bf1af7	Unbreak progress bar	2018-12-20 13:57:00 +01:00
Muhammad Irfan	2e84ec1513	Fixed ISO code for Urdu. (#3073 )	2018-12-20 12:28:53 +01:00
Matthew Honnibal	c315e08e6e	Fix formatting of meta.json after spacy package	2018-12-19 14:36:08 +01:00
Matthew Honnibal	b7ce85a6f3	Fix packaging of json schemas	2018-12-19 13:54:02 +01:00
Matthew Honnibal	35ff889852	Fix OSX wheel building	2018-12-19 13:14:57 +01:00
Matthew Honnibal	e24f94ce39	Fix handling of preset entities. closes #2779	2018-12-19 02:13:31 +01:00
Matthew Honnibal	faa8656582	Port parser fix for large label sets from master	2018-12-19 02:11:26 +01:00
Matthew Honnibal	99a84e4d0e	Make ParserModel.resize_output idempotent	2018-12-19 02:10:36 +01:00
Matthew Honnibal	9fc8ce0c4d	Add schemas to MANIFEST	2018-12-19 01:18:50 +01:00
Matthew Honnibal	a2b75036e9	Try to make sure json schemas are packaged	2018-12-19 01:08:51 +01:00
Matthew Honnibal	0f83b98afa	Remove unused code from spacy pretrain	2018-12-18 19:19:26 +01:00
Ken	5f0c5fbfa4	issue #3012 : add test (#3021 ) * issue #3012: add test * add contributor aggreement * Make test work without models and fix typos ten.pos_ instead of ten.orth_ and comparison against "10" instead of integer 10	2018-12-18 15:02:49 +01:00
Ines Montani	77a47b2b20	Auto-format	2018-12-18 15:02:11 +01:00
Kirill Bulygin	2fb004832f	Fix the first `nlp` call for `ja` (closes #2901 ) (#3065 ) * Fix the first `nlp` call for `ja` (closes #2901) * Add unicode declaration, formatting and use relative import	2018-12-18 15:01:06 +01:00
Kirill Bulygin	10189d9092	Fix the first `nlp` call for `ja` (closes #2901 ) (#3065 ) * Fix the first `nlp` call for `ja` (closes #2901) * Add unicode declaration, formatting and use relative import	2018-12-18 14:53:50 +01:00
Ines Montani	ae880ef912	Tidy up merge conflict leftovers	2018-12-18 13:58:30 +01:00
Ines Montani	61d09c481b	Merge branch 'master' into develop	2018-12-18 13:48:10 +01:00
Brixjohn	52f3c95004	Added alpha support for Tagalog language (#3062 ) I have added alpha support for the Tagalog language from the Philippines. It is the basis for the country's national language Filipino. I have heavily based the format to the EN and ES languages. I have provided several words in the lemmatizer lookup table, added stop words from a source, translated numeric words to its Tagalog counterpart, added some tokenizer exceptions, and kept the tag map the same as the English language. While the alpha language passed the preliminary testing that you provided, I think it needs more data to be useful for most cases. * Added alpha support for Tagalog language * Edited contributor template * Included SCA; Reverted templates * Fixed SCA template * Fixed changes in SCA template	2018-12-18 13:08:38 +01:00
Matthew Honnibal	92f4b9c8ea	set max batch size to 1000	2018-12-17 23:15:39 +00:00
Matthew Honnibal	3c4a2edf4a	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-17 23:08:40 +00:00
Matthew Honnibal	95fc0176d1	Pass tagger options in begin_training	2018-12-17 23:08:31 +00:00
Matthew Honnibal	7c504b6ddb	Try to implement more losses for pretraining * Try to implement cosine loss This one seems to be correct? Still unsure, but it performs okay * Try to implement the von Mises-Fisher loss This one's definitely not right yet.	2018-12-17 14:48:27 +00:00
Ines Montani	e3405f8af3	Don't call begin_training if updating new model (see #3059 ) [ci skip]	2018-12-17 13:45:49 +01:00
Ines Montani	c9a89bba50	Don't call begin_training if updating new model (see #3059 ) [ci skip]	2018-12-17 13:45:28 +01:00
Ines Montani	6f1438b5d9	Auto-format example	2018-12-17 13:44:38 +01:00
Matthew Honnibal	ab4b61fb6e	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-16 20:11:43 +01:00
Matthew Honnibal	9ef30b0cde	Accept 'text' in matcher as an alternative to ORTH	2018-12-16 20:10:43 +01:00
Amandine Périnet	361554f629	Lemmatization of Adjectives - French : adding rules and vocabulary (#3045 ) * modifying FR lemmatisation for Adjectives * adding contributor agreement for amperinet * correcting some errors in vocabulary files	2018-12-16 18:11:07 +01:00
Sofie	c6ad557cea	French regular expressions instead of extensive exceptions list (on develop) (#3046 ) (resolves #2679 ) * merge changes of PR 3023 into develop branch instead of master * further deletions from exception list according to PR 3023	2018-12-16 18:04:55 +01:00
Ines Montani	7bbdffd36e	Remove pre-set lemma for "cause" (resolves #2165 )	2018-12-14 12:51:18 +01:00
Shooter23	6ae8e49bff	Fix docstring for is_right_punct(). (#3044 )	2018-12-14 10:11:11 +01:00
Matthew Honnibal	ab9494b2a3	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-12 21:08:50 +00:00
Matthew Honnibal	fb56028476	Remove b1 and b2 decay	2018-12-12 12:37:07 +01:00
Matthew Honnibal	df15279e88	Reduce batch size during pretrain	2018-12-10 15:30:23 +00:00
Matthew Honnibal	83ac227bd3	💫 Better support for semi-supervised learning (#3035 ) The new spacy pretrain command implemented BERT/ULMFit/etc-like transfer learning, using our Language Modelling with Approximate Outputs version of BERT's cloze task. Pretraining is convenient, but in some ways it's a bit of a strange solution. All we're doing is initialising the weights. At the same time, we're putting a lot of work into our optimisation so that it's less sensitive to initial conditions, and more likely to find good optima. I discuss this a bit in the pseudo-rehearsal blog post: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting Support semi-supervised learning in spacy train One obvious way to improve these pretraining methods is to do multi-task learning, instead of just transfer learning. This has been shown to work very well: https://arxiv.org/pdf/1809.08370.pdf . This patch makes it easy to do this sort of thing. Add a new argument to spacy train, --raw-text. This takes a jsonl file with unlabelled data that can be used in arbitrary ways to do semi-supervised learning. Add a new method to the Language class and to pipeline components, .rehearse(). This is like .update(), but doesn't expect GoldParse objects. It takes a batch of Doc objects, and performs an update on some semi-supervised objective. Move the BERT-LMAO objective out from spacy/cli/pretrain.py into spacy/_ml.py, so we can create a new pipeline component, ClozeMultitask. This can be specified as a parser or NER multitask in the spacy train command. Example usage: python -m spacy train en ./tmp ~/data/en-core-web/train/nw.json ~/data/en-core-web/dev/nw.json --pipeline parser --raw-textt ~/data/unlabelled/reddit-100k.jsonl --vectors en_vectors_web_lg --parser-multitasks cloze Implement rehearsal methods for pipeline components The new --raw-text argument and nlp.rehearse() method also gives us a good place to implement the the idea in the pseudo-rehearsal blog post in the parser. This works as follows: Add a new nlp.resume_training() method. This allocates copies of pre-trained models in the pipeline, setting things up for the rehearsal updates. It also returns an optimizer object. This also greatly reduces confusion around the nlp.begin_training() method, which randomises the weights, making it not suitable for adding new labels or otherwise fine-tuning a pre-trained model. Implement rehearsal updates on the Parser class, making it available for the dependency parser and NER. During rehearsal, the initial model is used to supervise the model being trained. The current model is asked to match the predictions of the initial model on some data. This minimises catastrophic forgetting, by keeping the model's predictions close to the original. See the blog post for details. Implement rehearsal updates for tagger Implement rehearsal updates for text categoriz	2018-12-10 16:25:33 +01:00
Matthew Honnibal	449b889454	Fix KeyError in Vectors.most_similar. Fixes #2648	2018-12-10 16:19:18 +01:00
Matthew Honnibal	90aec6d2f6	Fix vectors for reserved words. Closes #2871	2018-12-10 16:09:49 +01:00
Matthew Honnibal	16fd8dce1d	Add get_string_id helper to spacy.strings	2018-12-10 16:09:26 +01:00
Matthew Honnibal	cc1ea03004	Add test for issue #2871 -- vectors for reserved words	2018-12-10 16:09:10 +01:00
Matthew Honnibal	375f0dc529	💫 Make TextCategorizer default to a simpler, GPU-friendly model (#3038 ) Currently the TextCategorizer defaults to a fairly complicated model, designed partly around the active learning requirements of Prodigy. The model's a bit slow, and not very GPU-friendly. This patch implements a straightforward CNN model that still performs pretty well. The replacement model also makes it easy to use the LMAO pretraining, since most of the parameters are in the CNN. The replacement model has a flag to specify whether labels are mutually exclusive, which defaults to True. This has been a common problem with the text classifier. We'll also now be able to support adding labels to pretrained models again. Resolves #2934, #2756, #1798, #1748.	2018-12-10 14:37:39 +01:00
Matthew Honnibal	b1c8731b4d	Make spacy train respect LOG_FRIENDLY	2018-12-10 09:46:53 +01:00
Matthew Honnibal	6936ca1664	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-10 09:44:07 +01:00
Matthew Honnibal	4405b5c875	Fix resizing edge-case for NER	2018-12-10 06:25:17 +00:00
Matthew Honnibal	0994dc50d8	Merge branch 'develop' of https://github.com/explosion/spaCy into develop	2018-12-10 05:35:01 +00:00
Matthew Honnibal	24f2e9bc07	Tweak training params	2018-12-09 17:08:58 +00:00
Matthew Honnibal	16c5861d29	Fix NER space constraints Allow entities to end on spaces, to avoid stumping the oracle when we're inside an entity, and there's a space just before a correct entity.	2018-12-09 08:06:45 +01:00
Matthew Honnibal	1b1a1af193	Fix printing in spacy train	2018-12-09 06:03:49 +01:00

... 5 6 7 8 9 ...

9669 Commits