spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-13 02:30:41 +03:00

Author	SHA1	Message	Date
svlandeg	60b54ae8ce	bulk entity writing and experiment with regex wikidata reader to speed up processing	2019-05-01 00:00:38 +02:00
svlandeg	653b7d9c87	calculate entity raw counts offline to speed up KB construction	2019-04-30 11:39:42 +02:00
svlandeg	19e8f339cb	deduce entity freq from WP corpus and serialize vocab in WP test	2019-04-29 17:37:29 +02:00
svlandeg	54d0cea062	unit test for KB serialization	2019-04-24 23:52:34 +02:00
svlandeg	3e0cb69065	KB aliases to and from file	2019-04-24 20:24:24 +02:00
svlandeg	ad6c5e581c	writing and reading number of entries to/from header	2019-04-24 15:31:44 +02:00
svlandeg	6e3223f234	bulk loading in proper order of entity indices	2019-04-24 11:26:38 +02:00
svlandeg	694fea597a	dumping all entryC entries + (inefficient) reading back in	2019-04-23 18:36:50 +02:00
svlandeg	8e70a564f1	custom reader and writer for _EntryC fields (first stab at it - not complete)	2019-04-23 16:33:40 +02:00
svlandeg	004e5e7d1c	little fixes	2019-04-19 14:24:02 +02:00
svlandeg	9a8197185b	fix alias capitalization	2019-04-18 22:37:50 +02:00
svlandeg	9f308eb5dc	fixes for prior prob and linking wikidata IDs with wikipedia titles	2019-04-18 16:14:25 +02:00
svlandeg	10ee8dfea2	poc with few entities and collecting aliases from the WP links	2019-04-18 14:12:17 +02:00
svlandeg	6763e025e1	parse wp dump for links to determine prior probabilities	2019-04-15 11:41:57 +02:00
svlandeg	3163331b1e	wikipedia dump parser and mediawiki format regex cleanup	2019-04-14 21:52:01 +02:00
svlandeg	b31a390a9a	reading types, claims and sitelinks	2019-04-11 21:42:44 +02:00
svlandeg	6e997be4b4	reading wikidata descriptions and aliases	2019-04-11 21:08:22 +02:00
svlandeg	9a7d534b1b	enable nogil for cython functions in kb.pxd	2019-04-10 17:25:10 +02:00
Ines Montani	24cecdb44f	Update compatibility [ci skip]	2019-04-01 16:25:16 +02:00
Sofie	a4a6bfa4e1	Merge branch 'master' into feature/el-framework	2019-03-26 11:00:02 +01:00
svlandeg	8814b9010d	entity as one field instead of both ID and name	2019-03-25 18:10:41 +01:00
Matthew Honnibal	6c783f8045	Bug fixes and options for TextCategorizer (#3472 ) * Fix code for bag-of-words feature extraction The _ml.py module had a redundant copy of a function to extract unigram bag-of-words features, except one had a bug that set values to 0. Another function allowed extraction of bigram features. Replace all three with a new function that supports arbitrary ngram sizes and also allows control of which attribute is used (e.g. ORTH, LOWER, etc). * Support 'bow' architecture for TextCategorizer This allows efficient ngram bag-of-words models, which are better when the classifier needs to run quickly, especially when the texts are long. Pass architecture="bow" to use it. The extra arguments ngram_size and attr are also available, e.g. ngram_size=2 means unigram and bigram features will be extracted. * Fix size limits in train_textcat example * Explain architectures better in docs	2019-03-23 16:44:44 +01:00
svlandeg	9de9900510	adding future import unicode literals to .py files	2019-03-22 16:18:04 +01:00
Matthew Honnibal	4c5f265884	Fix train loop for train_textcat example	2019-03-22 16:10:11 +01:00
svlandeg	5318ce88fa	'entity_linker' instead of 'el'	2019-03-22 13:55:10 +01:00
svlandeg	a48241e9a2	use nlp's vocab for stringstore	2019-03-22 11:36:45 +01:00
svlandeg	1ee0e78fd7	select candidate with highest prior probabiity	2019-03-22 11:36:45 +01:00
Matthew Honnibal	4e3ed2ea88	Add -t2v argument to train_textcat script	2019-03-20 23:05:42 +01:00
Ines Montani	399987c216	Test and update examples [ci skip]	2019-03-16 14:15:49 +01:00
Ines Montani	cb5dbfa63a	Tidy up references to n_threads and fix default	2019-03-15 16:24:26 +01:00
Matthew Honnibal	4dc57d9e15	Update train_new_entity_type example	2019-02-24 16:41:03 +01:00
Matthew Honnibal	7ac0f9626c	Update rehearsal example	2019-02-24 16:17:41 +01:00
Matthew Honnibal	981cb89194	Fix f-score calculation if zero	2019-02-23 12:45:41 +01:00
Matthew Honnibal	5063d999e5	Set architecture in textcat example	2019-02-23 11:57:59 +01:00
Matthew Honnibal	582be8746c	Update multi_processing example	2019-02-21 10:33:16 +01:00
Ines Montani	9696cf16c1	Merge branch 'master' into develop	2019-02-20 21:31:27 +01:00
Michael Liberman	386cec1979	- Json fix in comment (#3294 )	2019-02-19 18:01:35 +01:00
Ines Montani	5d0b60999d	Merge branch 'master' into develop	2019-02-07 20:54:07 +01:00
Laura Baakman	04aa041c9e	Update Example input JSON file to adhere to specification. (#3243 ) * Example file does not adhere to json input spec. According to the [json input spec ](https://spacy.io/api/annotation#json-input) the `id ` needs to be an `int` not a string. Using a string as `id` results in a `TypeError` when calling `spacy.gold.read_json_file()`. * Add spaCy Contributor Agreement.	2019-02-07 16:18:01 +01:00
mak	8fc6aaf134	Updated main to make use of lang variable (#3220 ) Updated main to make use of language variable when initializing spacy.	2019-01-31 23:43:22 +01:00
Hunter Kelly	f28a1c7271	Update call to `mkdir()` to create the parents (#3139 ) * Update call to `mkdir()` to create the parents - Update the call to `output_dir.mkdir()` to also create the parents if needed * don't automatically create parents but fail fast if cannot create directory * add signed contributors agreement for retnuh	2019-01-11 03:02:18 +01:00
Ines Montani	61d09c481b	Merge branch 'master' into develop	2018-12-18 13:48:10 +01:00
Ines Montani	e3405f8af3	Don't call begin_training if updating new model (see #3059 ) [ci skip]	2018-12-17 13:45:49 +01:00
Ines Montani	c9a89bba50	Don't call begin_training if updating new model (see #3059 ) [ci skip]	2018-12-17 13:45:28 +01:00
Ines Montani	6f1438b5d9	Auto-format example	2018-12-17 13:44:38 +01:00
Matthew Honnibal	83ac227bd3	💫 Better support for semi-supervised learning (#3035 ) The new spacy pretrain command implemented BERT/ULMFit/etc-like transfer learning, using our Language Modelling with Approximate Outputs version of BERT's cloze task. Pretraining is convenient, but in some ways it's a bit of a strange solution. All we're doing is initialising the weights. At the same time, we're putting a lot of work into our optimisation so that it's less sensitive to initial conditions, and more likely to find good optima. I discuss this a bit in the pseudo-rehearsal blog post: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting Support semi-supervised learning in spacy train One obvious way to improve these pretraining methods is to do multi-task learning, instead of just transfer learning. This has been shown to work very well: https://arxiv.org/pdf/1809.08370.pdf . This patch makes it easy to do this sort of thing. Add a new argument to spacy train, --raw-text. This takes a jsonl file with unlabelled data that can be used in arbitrary ways to do semi-supervised learning. Add a new method to the Language class and to pipeline components, .rehearse(). This is like .update(), but doesn't expect GoldParse objects. It takes a batch of Doc objects, and performs an update on some semi-supervised objective. Move the BERT-LMAO objective out from spacy/cli/pretrain.py into spacy/_ml.py, so we can create a new pipeline component, ClozeMultitask. This can be specified as a parser or NER multitask in the spacy train command. Example usage: python -m spacy train en ./tmp ~/data/en-core-web/train/nw.json ~/data/en-core-web/dev/nw.json --pipeline parser --raw-textt ~/data/unlabelled/reddit-100k.jsonl --vectors en_vectors_web_lg --parser-multitasks cloze Implement rehearsal methods for pipeline components The new --raw-text argument and nlp.rehearse() method also gives us a good place to implement the the idea in the pseudo-rehearsal blog post in the parser. This works as follows: Add a new nlp.resume_training() method. This allocates copies of pre-trained models in the pipeline, setting things up for the rehearsal updates. It also returns an optimizer object. This also greatly reduces confusion around the nlp.begin_training() method, which randomises the weights, making it not suitable for adding new labels or otherwise fine-tuning a pre-trained model. Implement rehearsal updates on the Parser class, making it available for the dependency parser and NER. During rehearsal, the initial model is used to supervise the model being trained. The current model is asked to match the predictions of the initial model on some data. This minimises catastrophic forgetting, by keeping the model's predictions close to the original. See the blog post for details. Implement rehearsal updates for tagger Implement rehearsal updates for text categoriz	2018-12-10 16:25:33 +01:00
Matthew Honnibal	375f0dc529	💫 Make TextCategorizer default to a simpler, GPU-friendly model (#3038 ) Currently the TextCategorizer defaults to a fairly complicated model, designed partly around the active learning requirements of Prodigy. The model's a bit slow, and not very GPU-friendly. This patch implements a straightforward CNN model that still performs pretty well. The replacement model also makes it easy to use the LMAO pretraining, since most of the parameters are in the CNN. The replacement model has a flag to specify whether labels are mutually exclusive, which defaults to True. This has been a common problem with the text classifier. We'll also now be able to support adding labels to pretrained models again. Resolves #2934, #2756, #1798, #1748.	2018-12-10 14:37:39 +01:00
Matthew Honnibal	e5685d98a2	Fix averaging in textcat example (closes #2745 ) (#3032 ) [ci skip]	2018-12-08 13:27:05 +01:00
Ines Montani	5b2741f751	Remove unused cytoolz / itertools imports	2018-12-03 02:12:07 +01:00
Gavriel Loria	ae5601beae	Initialize trues to 0.0 in training example (#3004 ) * added contributor agreement * if there are no true positives, precision should be 0.0	2018-12-03 01:33:22 +01:00

1 2 3 4 5 ...

291 Commits