spaCy

mirror of https://github.com/explosion/spaCy.git synced 2026-02-02 13:36:18 +03:00

Author	SHA1	Message	Date
svlandeg	bbcb9da466	creating training data with clean WP texts and QID entities true/false	2019-05-03 10:44:29 +02:00
svlandeg	cba9680d13	run NER on clean WP text and link to gold-standard entity IDs	2019-05-02 17:24:52 +02:00
svlandeg	581dc9742d	parsing clean text from WP articles to use as input data for NER and NEL	2019-05-02 17:09:56 +02:00
svlandeg	8353552191	cleanup	2019-05-01 23:26:16 +02:00
svlandeg	1ae41daaa9	allow small rounding errors	2019-05-01 23:05:40 +02:00
svlandeg	3629a52ede	reading all persons in wikidata	2019-05-01 01:00:59 +02:00
svlandeg	60b54ae8ce	bulk entity writing and experiment with regex wikidata reader to speed up processing	2019-05-01 00:00:38 +02:00
svlandeg	653b7d9c87	calculate entity raw counts offline to speed up KB construction	2019-04-30 11:39:42 +02:00
svlandeg	19e8f339cb	deduce entity freq from WP corpus and serialize vocab in WP test	2019-04-29 17:37:29 +02:00
svlandeg	54d0cea062	unit test for KB serialization	2019-04-24 23:52:34 +02:00
svlandeg	3e0cb69065	KB aliases to and from file	2019-04-24 20:24:24 +02:00
svlandeg	ad6c5e581c	writing and reading number of entries to/from header	2019-04-24 15:31:44 +02:00
svlandeg	6e3223f234	bulk loading in proper order of entity indices	2019-04-24 11:26:38 +02:00
svlandeg	694fea597a	dumping all entryC entries + (inefficient) reading back in	2019-04-23 18:36:50 +02:00
svlandeg	8e70a564f1	custom reader and writer for _EntryC fields (first stab at it - not complete)	2019-04-23 16:33:40 +02:00
svlandeg	004e5e7d1c	little fixes	2019-04-19 14:24:02 +02:00
svlandeg	9a8197185b	fix alias capitalization	2019-04-18 22:37:50 +02:00
svlandeg	9f308eb5dc	fixes for prior prob and linking wikidata IDs with wikipedia titles	2019-04-18 16:14:25 +02:00
svlandeg	10ee8dfea2	poc with few entities and collecting aliases from the WP links	2019-04-18 14:12:17 +02:00
svlandeg	6763e025e1	parse wp dump for links to determine prior probabilities	2019-04-15 11:41:57 +02:00
svlandeg	3163331b1e	wikipedia dump parser and mediawiki format regex cleanup	2019-04-14 21:52:01 +02:00
svlandeg	b31a390a9a	reading types, claims and sitelinks	2019-04-11 21:42:44 +02:00
svlandeg	6e997be4b4	reading wikidata descriptions and aliases	2019-04-11 21:08:22 +02:00
svlandeg	9a7d534b1b	enable nogil for cython functions in kb.pxd	2019-04-10 17:25:10 +02:00
Ines Montani	24cecdb44f	Update compatibility [ci skip]	2019-04-01 16:25:16 +02:00
Sofie	a4a6bfa4e1	Merge branch 'master' into feature/el-framework	2019-03-26 11:00:02 +01:00
svlandeg	8814b9010d	entity as one field instead of both ID and name	2019-03-25 18:10:41 +01:00
Matthew Honnibal	6c783f8045	Bug fixes and options for TextCategorizer (#3472 ) * Fix code for bag-of-words feature extraction The _ml.py module had a redundant copy of a function to extract unigram bag-of-words features, except one had a bug that set values to 0. Another function allowed extraction of bigram features. Replace all three with a new function that supports arbitrary ngram sizes and also allows control of which attribute is used (e.g. ORTH, LOWER, etc). * Support 'bow' architecture for TextCategorizer This allows efficient ngram bag-of-words models, which are better when the classifier needs to run quickly, especially when the texts are long. Pass architecture="bow" to use it. The extra arguments ngram_size and attr are also available, e.g. ngram_size=2 means unigram and bigram features will be extracted. * Fix size limits in train_textcat example * Explain architectures better in docs	2019-03-23 16:44:44 +01:00
svlandeg	9de9900510	adding future import unicode literals to .py files	2019-03-22 16:18:04 +01:00
Matthew Honnibal	4c5f265884	Fix train loop for train_textcat example	2019-03-22 16:10:11 +01:00
svlandeg	5318ce88fa	'entity_linker' instead of 'el'	2019-03-22 13:55:10 +01:00
svlandeg	a48241e9a2	use nlp's vocab for stringstore	2019-03-22 11:36:45 +01:00
svlandeg	1ee0e78fd7	select candidate with highest prior probabiity	2019-03-22 11:36:45 +01:00
Matthew Honnibal	4e3ed2ea88	Add -t2v argument to train_textcat script	2019-03-20 23:05:42 +01:00
Ines Montani	399987c216	Test and update examples [ci skip]	2019-03-16 14:15:49 +01:00
Ines Montani	cb5dbfa63a	Tidy up references to n_threads and fix default	2019-03-15 16:24:26 +01:00
Matthew Honnibal	4dc57d9e15	Update train_new_entity_type example	2019-02-24 16:41:03 +01:00
Matthew Honnibal	7ac0f9626c	Update rehearsal example	2019-02-24 16:17:41 +01:00
Matthew Honnibal	981cb89194	Fix f-score calculation if zero	2019-02-23 12:45:41 +01:00
Matthew Honnibal	5063d999e5	Set architecture in textcat example	2019-02-23 11:57:59 +01:00
Matthew Honnibal	582be8746c	Update multi_processing example	2019-02-21 10:33:16 +01:00
Ines Montani	9696cf16c1	Merge branch 'master' into develop	2019-02-20 21:31:27 +01:00
Michael Liberman	386cec1979	- Json fix in comment (#3294 )	2019-02-19 18:01:35 +01:00
Ines Montani	5d0b60999d	Merge branch 'master' into develop	2019-02-07 20:54:07 +01:00
Laura Baakman	04aa041c9e	Update Example input JSON file to adhere to specification. (#3243 ) * Example file does not adhere to json input spec. According to the [json input spec ](https://spacy.io/api/annotation#json-input) the `id ` needs to be an `int` not a string. Using a string as `id` results in a `TypeError` when calling `spacy.gold.read_json_file()`. * Add spaCy Contributor Agreement.	2019-02-07 16:18:01 +01:00
mak	8fc6aaf134	Updated main to make use of lang variable (#3220 ) Updated main to make use of language variable when initializing spacy.	2019-01-31 23:43:22 +01:00
Hunter Kelly	f28a1c7271	Update call to `mkdir()` to create the parents (#3139 ) * Update call to `mkdir()` to create the parents - Update the call to `output_dir.mkdir()` to also create the parents if needed * don't automatically create parents but fail fast if cannot create directory * add signed contributors agreement for retnuh	2019-01-11 03:02:18 +01:00
Ines Montani	61d09c481b	Merge branch 'master' into develop	2018-12-18 13:48:10 +01:00
Ines Montani	e3405f8af3	Don't call begin_training if updating new model (see #3059 ) [ci skip]	2018-12-17 13:45:49 +01:00
Ines Montani	c9a89bba50	Don't call begin_training if updating new model (see #3059 ) [ci skip]	2018-12-17 13:45:28 +01:00

1 2 3 4 5 ...

297 Commits