spaCy

mirror of https://github.com/explosion/spaCy.git synced 2024-11-11 12:18:04 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	6b34a2f34b	* Move morphological analysis into its own module, morphology.pyx	2014-12-09 21:16:17 +11:00
Matthew Honnibal	495e1c7366	* Use fused type in Tokens.push_back, simplifying the use of the cache	2014-12-09 16:50:01 +11:00
Matthew Honnibal	302e09018b	* Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas	2014-12-09 14:48:01 +11:00
Matthew Honnibal	99bbbb6feb	* Work on morphological processing	2014-12-08 21:12:15 +11:00
Matthew Honnibal	ef4398b204	* Rearrange POS stuff, so that language-specific stuff can live in language-specific modules	2014-12-07 23:52:41 +11:00
Matthew Honnibal	5caabec789	* Link in tagger, to work on integrating POS tagging	2014-12-07 15:29:41 +11:00
Matthew Honnibal	1c9253701d	* Introduce a TokenC struct, to handle token indices, pos tags and sense tags	2014-12-05 15:56:14 +11:00
Matthew Honnibal	187372c7f3	* Allow the lexicon to create lexemes using an external memory pool, so that it can decide to make some lexemes temporary, rather than cached	2014-12-05 03:29:50 +11:00
Matthew Honnibal	d70d31aa45	* Introduce first attempt at const-ness	2014-12-03 15:44:25 +11:00
Matthew Honnibal	b463a7eb86	* Make flag-setting a language-specific thing	2014-12-03 11:04:32 +11:00
Matthew Honnibal	8c2938fe01	* Rename Lexicon._dict to Lexicon._map	2014-12-02 23:46:59 +11:00
Matthew Honnibal	c788633429	* Add tokens_from_list method to Language	2014-11-11 23:43:14 +11:00
Matthew Honnibal	ff8989b63c	* Use greedy NER parser	2014-11-11 21:08:35 +11:00
Matthew Honnibal	4ecbe8c893	* Complete refactor of Tagger features, to use a generic list of context names.	2014-11-05 20:45:29 +11:00
Matthew Honnibal	3733444101	* Generalize tagger code, in preparation for NER and supersense tagging.	2014-11-05 03:42:14 +11:00
Matthew Honnibal	fcd9490d56	* Add pos_tag method to Language	2014-11-02 14:21:43 +11:00
Matthew Honnibal	a8ca078b24	* Restore lexemes field to lexicon	2014-10-31 17:43:25 +11:00
Matthew Honnibal	ea8f1e7053	* Tighten interfaces	2014-10-30 18:14:42 +11:00
Matthew Honnibal	ea85bf3a0a	* Tighten the interface to Language	2014-10-30 18:01:27 +11:00
Matthew Honnibal	87c2418a89	* Fiddle with data types on Lexeme, to compress them to a much smaller size.	2014-10-30 15:42:15 +11:00
Matthew Honnibal	e6b87766fe	* Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme	2014-10-30 15:21:38 +11:00
Matthew Honnibal	08ce602243	* Large refactor, particularly to Python API	2014-10-24 00:59:17 +11:00
Matthew Honnibal	e5e951ae67	* Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding.	2014-10-23 01:57:59 +11:00
Matthew Honnibal	43743a5d63	* Work on efficiency	2014-10-14 18:22:41 +11:00
Matthew Honnibal	6fb42c4919	* Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang	2014-10-14 16:17:45 +11:00
Matthew Honnibal	868e558037	* Preparations in place to handle hyphenation etc	2014-10-10 20:23:23 +11:00
Matthew Honnibal	02e948e7d5	* Remove counts stuff from Language class	2014-10-10 19:25:01 +11:00
Matthew Honnibal	71ee921055	* Slight cleaning of tokenizer code	2014-10-10 19:17:22 +11:00
Matthew Honnibal	d73d89a2de	* Add i attribute to lexeme, giving lexemes sequential IDs.	2014-10-09 13:50:05 +11:00
Matthew Honnibal	096ef2b199	* Rename external hashing lib, from trustyc to preshed	2014-09-26 18:40:03 +02:00
Matthew Honnibal	b15619e170	* Use PointerHash instead of locally provided _hashing module	2014-09-25 18:23:35 +02:00
Matthew Honnibal	ac522e2553	* Switch from own memory class to cymem, in pip	2014-09-17 23:09:24 +02:00
Matthew Honnibal	6266cac593	* Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks	2014-09-17 20:02:26 +02:00
Matthew Honnibal	0152831c89	* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token.	2014-09-16 18:01:46 +02:00
Matthew Honnibal	143e51ec73	* Refactor tokenization, splitting it into a clearer life-cycle.	2014-09-16 13:16:02 +02:00
Matthew Honnibal	0bb547ab98	* Fix memory error in cache, where entry wasn't being null-terminated. Various other changes, some good for performance	2014-09-15 06:34:10 +02:00
Matthew Honnibal	e68a431e5e	* Pass only the tokens vector to _tokenize, instead of the whole python object.	2014-09-15 04:01:38 +02:00
Matthew Honnibal	df24e3708c	* Move EnglishTokens stuff to Tokens	2014-09-15 01:31:44 +02:00
Matthew Honnibal	f3393cf57c	* Improve interface for PointerHash	2014-09-13 17:29:58 +02:00
Matthew Honnibal	0447279c57	* PointerHash working, efficiency is good. 6-7 mins	2014-09-13 16:43:59 +02:00
Matthew Honnibal	85d68e8e95	* Replaced cache with own hash table. Similar timing	2014-09-13 03:14:43 +02:00
Matthew Honnibal	a8e7cce30f	* Efficiency tweaks	2014-09-13 00:14:05 +02:00
Matthew Honnibal	126a8453a5	* Fix performance issues by implementing a better cache. Add own String struct to help	2014-09-12 23:50:37 +02:00
Matthew Honnibal	9298e36b36	* Move special tokenization into its own lookup table, away from the cache.	2014-09-12 19:43:14 +02:00
Matthew Honnibal	985bc68327	* Fix bug with trailing punct on contractions. Reduced efficiency, and slightly hacky implementation.	2014-09-12 18:26:26 +02:00
Matthew Honnibal	4817277d66	* Replace main lexicon dict with dense_hash_map. May be unsuitable, if strings need recovery.	2014-09-12 04:29:09 +02:00
Matthew Honnibal	8b20e9ad97	* Delete ununused _split method	2014-09-12 04:03:52 +02:00
Matthew Honnibal	a4863686ec	* Changed cache to use a linked-list data structure, to take out Python list code. Taking 6-7 mins for gigaword.	2014-09-12 03:30:50 +02:00
Matthew Honnibal	e096f30161	* Tweak signatures and refactor slightly. Processing gigaword taking 8-9 mins. Tests passing, but some sort of memory bug on exit.	2014-09-12 02:43:36 +02:00
Matthew Honnibal	073ee0de63	* Restore dense_hash_map for cache dictionary. Seems to double efficiency	2014-09-12 02:23:51 +02:00

1 2

58 Commits