spaCy

mirror of https://github.com/explosion/spaCy.git synced 2025-07-11 08:42:28 +03:00

Author	SHA1	Message	Date
Matthew Honnibal	1c9253701d	* Introduce a TokenC struct, to handle token indices, pos tags and sense tags	2014-12-05 15:56:14 +11:00
Matthew Honnibal	187372c7f3	* Allow the lexicon to create lexemes using an external memory pool, so that it can decide to make some lexemes temporary, rather than cached	2014-12-05 03:29:50 +11:00
Matthew Honnibal	7e04c22f8f	* const added to Lexicon interface. Seems to work.	2014-12-03 15:58:17 +11:00
Matthew Honnibal	d70d31aa45	* Introduce first attempt at const-ness	2014-12-03 15:44:25 +11:00
Matthew Honnibal	b463a7eb86	* Make flag-setting a language-specific thing	2014-12-03 11:04:32 +11:00
Matthew Honnibal	8c2938fe01	* Rename Lexicon._dict to Lexicon._map	2014-12-02 23:46:59 +11:00
Matthew Honnibal	33dfb4933c	* Remove taggers from Language class. Work on doc strings	2014-11-26 19:53:55 +11:00
Matthew Honnibal	c788633429	* Add tokens_from_list method to Language	2014-11-11 23:43:14 +11:00
Matthew Honnibal	ff8989b63c	* Use greedy NER parser	2014-11-11 21:08:35 +11:00
Matthew Honnibal	4ecbe8c893	* Complete refactor of Tagger features, to use a generic list of context names.	2014-11-05 20:45:29 +11:00
Matthew Honnibal	3733444101	* Generalize tagger code, in preparation for NER and supersense tagging.	2014-11-05 03:42:14 +11:00
Matthew Honnibal	fcd9490d56	* Add pos_tag method to Language	2014-11-02 14:21:43 +11:00
Matthew Honnibal	a8ca078b24	* Restore lexemes field to lexicon	2014-10-31 17:43:25 +11:00
Matthew Honnibal	ea8f1e7053	* Tighten interfaces	2014-10-30 18:14:42 +11:00
Matthew Honnibal	ea85bf3a0a	* Tighten the interface to Language	2014-10-30 18:01:27 +11:00
Matthew Honnibal	87c2418a89	* Fiddle with data types on Lexeme, to compress them to a much smaller size.	2014-10-30 15:42:15 +11:00
Matthew Honnibal	e6b87766fe	* Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme	2014-10-30 15:21:38 +11:00
Matthew Honnibal	889b7b48b4	* Fix POS tagger, so that it loads correctly. Lexemes are being read in.	2014-10-30 13:38:55 +11:00
Matthew Honnibal	67c8c8019f	* Update lexeme serialization, using a binary file format	2014-10-30 01:01:00 +11:00
Matthew Honnibal	13909a2e24	* Rewriting Lexeme serialization.	2014-10-29 23:19:38 +11:00
Matthew Honnibal	234d49bf4d	* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.	2014-10-24 02:23:42 +11:00
Matthew Honnibal	08ce602243	* Large refactor, particularly to Python API	2014-10-24 00:59:17 +11:00
Matthew Honnibal	e5e951ae67	* Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding.	2014-10-23 01:57:59 +11:00
Matthew Honnibal	99f5e59286	* Have tokenizer emit tokens for whitespace other than single spaces	2014-10-14 20:25:57 +11:00
Matthew Honnibal	43743a5d63	* Work on efficiency	2014-10-14 18:22:41 +11:00
Matthew Honnibal	6fb42c4919	* Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang	2014-10-14 16:17:45 +11:00
Matthew Honnibal	868e558037	* Preparations in place to handle hyphenation etc	2014-10-10 20:23:23 +11:00
Matthew Honnibal	ff79dbac2e	* More slight cleaning for lang.pyx	2014-10-10 20:11:22 +11:00
Matthew Honnibal	3d82ed1e5e	* More slight cleaning for lang.pyx	2014-10-10 19:50:07 +11:00
Matthew Honnibal	02e948e7d5	* Remove counts stuff from Language class	2014-10-10 19:25:01 +11:00
Matthew Honnibal	71ee921055	* Slight cleaning of tokenizer code	2014-10-10 19:17:22 +11:00
Matthew Honnibal	59b41a9fd3	* Switch to new data model, tests passing	2014-10-10 08:11:31 +11:00
Matthew Honnibal	e40caae51f	* Update Lexicon class to expect a list of lexeme dict descriptions	2014-10-09 14:51:35 +11:00
Matthew Honnibal	d73d89a2de	* Add i attribute to lexeme, giving lexemes sequential IDs.	2014-10-09 13:50:05 +11:00
Matthew Honnibal	096ef2b199	* Rename external hashing lib, from trustyc to preshed	2014-09-26 18:40:03 +02:00
Matthew Honnibal	b15619e170	* Use PointerHash instead of locally provided _hashing module	2014-09-25 18:23:35 +02:00
Matthew Honnibal	ac522e2553	* Switch from own memory class to cymem, in pip	2014-09-17 23:09:24 +02:00
Matthew Honnibal	6266cac593	* Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks	2014-09-17 20:02:26 +02:00
Matthew Honnibal	0152831c89	* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token.	2014-09-16 18:01:46 +02:00
Matthew Honnibal	143e51ec73	* Refactor tokenization, splitting it into a clearer life-cycle.	2014-09-16 13:16:02 +02:00
Matthew Honnibal	0bb547ab98	* Fix memory error in cache, where entry wasn't being null-terminated. Various other changes, some good for performance	2014-09-15 06:34:10 +02:00
Matthew Honnibal	e68a431e5e	* Pass only the tokens vector to _tokenize, instead of the whole python object.	2014-09-15 04:01:38 +02:00
Matthew Honnibal	08cef75ffd	* Switch to using a heap-allocated vector in tokens	2014-09-15 03:46:14 +02:00
Matthew Honnibal	f77b7098c0	* Upd Tokens to use vector, with bounds checking.	2014-09-15 03:22:40 +02:00
Matthew Honnibal	df24e3708c	* Move EnglishTokens stuff to Tokens	2014-09-15 01:31:44 +02:00
Matthew Honnibal	f3393cf57c	* Improve interface for PointerHash	2014-09-13 17:29:58 +02:00
Matthew Honnibal	45865be37e	* Switch hash interface, using void* instead of size_t, to avoid casts.	2014-09-13 17:02:06 +02:00
Matthew Honnibal	0447279c57	* PointerHash working, efficiency is good. 6-7 mins	2014-09-13 16:43:59 +02:00
Matthew Honnibal	85d68e8e95	* Replaced cache with own hash table. Similar timing	2014-09-13 03:14:43 +02:00
Matthew Honnibal	afdc9b7ac2	* More performance fiddling, particularly moving the specials into the cache, so that we can just lookup the cache in _tokenize	2014-09-13 00:59:34 +02:00

1 2

79 Commits