svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							7e348d7f7f 
							
						 
					 
					
						
						
							
							baseline evaluation using highest-freq candidate  
						
						
						
					 
					
						2019-05-06 15:13:50 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							6961215578 
							
						 
					 
					
						
						
							
							refactor code to separate functionality into different files  
						
						
						
					 
					
						2019-05-06 10:56:56 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							f5190267e7 
							
						 
					 
					
						
						
							
							run only 100M of WP data as training dataset (9%)  
						
						
						
					 
					
						2019-05-03 18:09:09 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							4e929600e5 
							
						 
					 
					
						
						
							
							fix WP id parsing, speed up processing and remove ambiguous strings in one doc (for now)  
						
						
						
					 
					
						2019-05-03 17:37:47 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							34600c92bd 
							
						 
					 
					
						
						
							
							try catch per article to ensure the pipeline goes on  
						
						
						
					 
					
						2019-05-03 15:10:09 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							bbcb9da466 
							
						 
					 
					
						
						
							
							creating training data with clean WP texts and QID entities true/false  
						
						
						
					 
					
						2019-05-03 10:44:29 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							cba9680d13 
							
						 
					 
					
						
						
							
							run NER on clean WP text and link to gold-standard entity IDs  
						
						
						
					 
					
						2019-05-02 17:24:52 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							581dc9742d 
							
						 
					 
					
						
						
							
							parsing clean text from WP articles to use as input data for NER and NEL  
						
						
						
					 
					
						2019-05-02 17:09:56 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							8353552191 
							
						 
					 
					
						
						
							
							cleanup  
						
						
						
					 
					
						2019-05-01 23:26:16 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							1ae41daaa9 
							
						 
					 
					
						
						
							
							allow small rounding errors  
						
						
						
					 
					
						2019-05-01 23:05:40 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							3629a52ede 
							
						 
					 
					
						
						
							
							reading all persons in wikidata  
						
						
						
					 
					
						2019-05-01 01:00:59 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							60b54ae8ce 
							
						 
					 
					
						
						
							
							bulk entity writing and experiment with regex wikidata reader to speed up processing  
						
						
						
					 
					
						2019-05-01 00:00:38 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							653b7d9c87 
							
						 
					 
					
						
						
							
							calculate entity raw counts offline to speed up KB construction  
						
						
						
					 
					
						2019-04-30 11:39:42 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							19e8f339cb 
							
						 
					 
					
						
						
							
							deduce entity freq from WP corpus and serialize vocab in WP test  
						
						
						
					 
					
						2019-04-29 17:37:29 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							54d0cea062 
							
						 
					 
					
						
						
							
							unit test for KB serialization  
						
						
						
					 
					
						2019-04-24 23:52:34 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							3e0cb69065 
							
						 
					 
					
						
						
							
							KB aliases to and from file  
						
						
						
					 
					
						2019-04-24 20:24:24 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							ad6c5e581c 
							
						 
					 
					
						
						
							
							writing and reading number of entries to/from header  
						
						
						
					 
					
						2019-04-24 15:31:44 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							6e3223f234 
							
						 
					 
					
						
						
							
							bulk loading in proper order of entity indices  
						
						
						
					 
					
						2019-04-24 11:26:38 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							694fea597a 
							
						 
					 
					
						
						
							
							dumping all entryC entries + (inefficient) reading back in  
						
						
						
					 
					
						2019-04-23 18:36:50 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							8e70a564f1 
							
						 
					 
					
						
						
							
							custom reader and writer for _EntryC fields (first stab at it - not complete)  
						
						
						
					 
					
						2019-04-23 16:33:40 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							004e5e7d1c 
							
						 
					 
					
						
						
							
							little fixes  
						
						
						
					 
					
						2019-04-19 14:24:02 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							9a8197185b 
							
						 
					 
					
						
						
							
							fix alias capitalization  
						
						
						
					 
					
						2019-04-18 22:37:50 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							9f308eb5dc 
							
						 
					 
					
						
						
							
							fixes for prior prob and linking wikidata IDs with wikipedia titles  
						
						
						
					 
					
						2019-04-18 16:14:25 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							10ee8dfea2 
							
						 
					 
					
						
						
							
							poc with few entities and collecting aliases from the WP links  
						
						
						
					 
					
						2019-04-18 14:12:17 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							6763e025e1 
							
						 
					 
					
						
						
							
							parse wp dump for links to determine prior probabilities  
						
						
						
					 
					
						2019-04-15 11:41:57 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							3163331b1e 
							
						 
					 
					
						
						
							
							wikipedia dump parser and mediawiki format regex cleanup  
						
						
						
					 
					
						2019-04-14 21:52:01 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							b31a390a9a 
							
						 
					 
					
						
						
							
							reading types, claims and sitelinks  
						
						
						
					 
					
						2019-04-11 21:42:44 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							6e997be4b4 
							
						 
					 
					
						
						
							
							reading wikidata descriptions and aliases  
						
						
						
					 
					
						2019-04-11 21:08:22 +02:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							9a7d534b1b 
							
						 
					 
					
						
						
							
							enable nogil for cython functions in kb.pxd  
						
						
						
					 
					
						2019-04-10 17:25:10 +02:00 
						 
				 
			
				
					
						
							
							
								Ines Montani 
							
						 
					 
					
						
						
						
						
							
						
						
							24cecdb44f 
							
						 
					 
					
						
						
							
							Update compatibility [ci skip]  
						
						
						
					 
					
						2019-04-01 16:25:16 +02:00 
						 
				 
			
				
					
						
							
							
								Sofie 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							a4a6bfa4e1 
							
						 
					 
					
						
						
							
							Merge branch 'master' into feature/el-framework  
						
						
						
					 
					
						2019-03-26 11:00:02 +01:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							8814b9010d 
							
						 
					 
					
						
						
							
							entity as one field instead of both ID and name  
						
						
						
					 
					
						2019-03-25 18:10:41 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							6c783f8045 
							
						 
					 
					
						
						
							
							Bug fixes and options for TextCategorizer ( #3472 )  
						
						... 
						
						
						
						* Fix code for bag-of-words feature extraction
The _ml.py module had a redundant copy of a function to extract unigram
bag-of-words features, except one had a bug that set values to 0.
Another function allowed extraction of bigram features. Replace all three
with a new function that supports arbitrary ngram sizes and also allows
control of which attribute is used (e.g. ORTH, LOWER, etc).
* Support 'bow' architecture for TextCategorizer
This allows efficient ngram bag-of-words models, which are better when
the classifier needs to run quickly, especially when the texts are long.
Pass architecture="bow" to use it. The extra arguments ngram_size and
attr are also available, e.g. ngram_size=2 means unigram and bigram
features will be extracted.
* Fix size limits in train_textcat example
* Explain architectures better in docs 
						
					 
					
						2019-03-23 16:44:44 +01:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							9de9900510 
							
						 
					 
					
						
						
							
							adding future import unicode literals to .py files  
						
						
						
					 
					
						2019-03-22 16:18:04 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
							
							
						
						
						
							
						
						
							4c5f265884 
							
						 
					 
					
						
						
							
							Fix train loop for train_textcat example  
						
						
						
					 
					
						2019-03-22 16:10:11 +01:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							5318ce88fa 
							
						 
					 
					
						
						
							
							'entity_linker' instead of 'el'  
						
						
						
					 
					
						2019-03-22 13:55:10 +01:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							a48241e9a2 
							
						 
					 
					
						
						
							
							use nlp's vocab for stringstore  
						
						
						
					 
					
						2019-03-22 11:36:45 +01:00 
						 
				 
			
				
					
						
							
							
								svlandeg 
							
						 
					 
					
						
						
						
						
							
						
						
							1ee0e78fd7 
							
						 
					 
					
						
						
							
							select candidate with highest prior probabiity  
						
						
						
					 
					
						2019-03-22 11:36:45 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							4e3ed2ea88 
							
						 
					 
					
						
						
							
							Add -t2v argument to train_textcat script  
						
						
						
					 
					
						2019-03-20 23:05:42 +01:00 
						 
				 
			
				
					
						
							
							
								Ines Montani 
							
						 
					 
					
						
						
						
						
							
						
						
							399987c216 
							
						 
					 
					
						
						
							
							Test and update examples [ci skip]  
						
						
						
					 
					
						2019-03-16 14:15:49 +01:00 
						 
				 
			
				
					
						
							
							
								Ines Montani 
							
						 
					 
					
						
						
						
						
							
						
						
							cb5dbfa63a 
							
						 
					 
					
						
						
							
							Tidy up references to n_threads and fix default  
						
						
						
					 
					
						2019-03-15 16:24:26 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							4dc57d9e15 
							
						 
					 
					
						
						
							
							Update train_new_entity_type example  
						
						
						
					 
					
						2019-02-24 16:41:03 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							7ac0f9626c 
							
						 
					 
					
						
						
							
							Update rehearsal example  
						
						
						
					 
					
						2019-02-24 16:17:41 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							981cb89194 
							
						 
					 
					
						
						
							
							Fix f-score calculation if zero  
						
						
						
					 
					
						2019-02-23 12:45:41 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							5063d999e5 
							
						 
					 
					
						
						
							
							Set architecture in textcat example  
						
						
						
					 
					
						2019-02-23 11:57:59 +01:00 
						 
				 
			
				
					
						
							
							
								Matthew Honnibal 
							
						 
					 
					
						
						
						
						
							
						
						
							582be8746c 
							
						 
					 
					
						
						
							
							Update multi_processing example  
						
						
						
					 
					
						2019-02-21 10:33:16 +01:00 
						 
				 
			
				
					
						
							
							
								Ines Montani 
							
						 
					 
					
						
						
						
						
							
						
						
							9696cf16c1 
							
						 
					 
					
						
						
							
							Merge branch 'master' into develop  
						
						
						
					 
					
						2019-02-20 21:31:27 +01:00 
						 
				 
			
				
					
						
							
							
								Michael Liberman 
							
						 
					 
					
						
						
						
						
							
						
						
							386cec1979 
							
						 
					 
					
						
						
							
							- Json fix in comment ( #3294 )  
						
						
						
					 
					
						2019-02-19 18:01:35 +01:00 
						 
				 
			
				
					
						
							
							
								Ines Montani 
							
						 
					 
					
						
						
						
						
							
						
						
							5d0b60999d 
							
						 
					 
					
						
						
							
							Merge branch 'master' into develop  
						
						
						
					 
					
						2019-02-07 20:54:07 +01:00 
						 
				 
			
				
					
						
							
							
								Laura Baakman 
							
						 
					 
					
						
						
						
						
							
						
						
							04aa041c9e 
							
						 
					 
					
						
						
							
							Update Example input JSON file to adhere to specification. ( #3243 )  
						
						... 
						
						
						
						* Example file does not adhere to json input spec.
According to the [json input spec ](https://spacy.io/api/annotation#json-input ) the `id ` needs to be an `int` not a string. Using a string as `id` results in a `TypeError` when calling `spacy.gold.read_json_file()`.
* Add spaCy Contributor Agreement. 
						
					 
					
						2019-02-07 16:18:01 +01:00