Commit Graph

61 Commits

Author SHA1 Message Date
Matthew Honnibal
a97bed9359 * Fix POS and dependency label tag names. Add parse and string navigation functions. 2015-01-24 17:29:04 +11:00
Matthew Honnibal
5fd72bc220 * Have 'string' refer to the whitespace-padded string 2015-01-24 07:32:38 +11:00
Matthew Honnibal
fda94271af * Rename NORM1 and NORM2 attrs to lower and norm 2015-01-24 06:17:03 +11:00
Matthew Honnibal
5ed8b2b98f * Rename sic to orth 2015-01-23 02:08:25 +11:00
Matthew Honnibal
8b9d913d97 * Rename vec to repvec 2015-01-22 02:05:58 +11:00
Matthew Honnibal
56e6cf0672 * Add _string attr to Tokens object 2015-01-21 18:57:09 +11:00
Matthew Honnibal
b65b0c07bf * Messily hook up vector in tokens 2015-01-19 19:59:55 +11:00
Matthew Honnibal
8ff5b8bd84 * Add attribute for POS scheme 2015-01-17 17:33:16 +11:00
Matthew Honnibal
802867e96a * Revise interface to Token. Strings now have attribute names like norm1_ 2015-01-15 03:51:47 +11:00
Matthew Honnibal
0930892fc1 * Tmp. Working on refactor. Compiles, must hook up lexical feats. 2015-01-14 00:03:48 +11:00
Matthew Honnibal
ce2edd6312 * Tmp commit. Refactoring to create a Python Lexeme class. 2015-01-12 10:26:22 +11:00
Matthew Honnibal
3f1944d688 * Make PyPy work 2015-01-05 17:54:38 +11:00
Matthew Honnibal
b8b65903fc * Tmp 2014-12-24 17:42:00 +11:00
Matthew Honnibal
ab61673edd * Fix api of array method 2014-12-23 15:18:48 +11:00
Matthew Honnibal
73f200436f * Tests passing except for morphology/lemmatization stuff 2014-12-23 11:40:32 +11:00
Matthew Honnibal
cf8d26c3d2 * POS tagger training working after reorg 2014-12-22 08:54:47 +11:00
Matthew Honnibal
4c4aa2c5c9 * Work on train 2014-12-22 07:25:43 +11:00
Matthew Honnibal
4c6ce7ee84 * Update tokens.pyx as part of reorg 2014-12-20 07:03:26 +11:00
Matthew Honnibal
87e9487d76 * Work on parser 2014-12-17 21:10:12 +11:00
Matthew Honnibal
95ccea03b2 * Work on greedy parser 2014-12-16 22:46:55 +11:00
Matthew Honnibal
9959a64f7b * Working morphology and lemmatisation. POS tagging quite fast. 2014-12-10 08:09:32 +11:00
Matthew Honnibal
accdbe989b * Remove Tokens.extend method 2014-12-09 17:09:23 +11:00
Matthew Honnibal
495e1c7366 * Use fused type in Tokens.push_back, simplifying the use of the cache 2014-12-09 16:50:01 +11:00
Matthew Honnibal
302e09018b * Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas 2014-12-09 14:48:01 +11:00
Matthew Honnibal
99bbbb6feb * Work on morphological processing 2014-12-08 21:12:15 +11:00
Matthew Honnibal
9f17467c2e * Fix EMPTY_TOKEN 2014-12-07 22:07:41 +11:00
Matthew Honnibal
e27b912ef9 * Remove need for confusing _data pointer to be stored on Tokens 2014-12-05 16:31:30 +11:00
Matthew Honnibal
1c9253701d * Introduce a TokenC struct, to handle token indices, pos tags and sense tags 2014-12-05 15:56:14 +11:00
Matthew Honnibal
69bb022204 * Add as_array and count_by method 2014-12-04 20:46:55 +11:00
Matthew Honnibal
d70d31aa45 * Introduce first attempt at const-ness 2014-12-03 15:44:25 +11:00
Matthew Honnibal
e170faf5b0 * Hack Tokens to work without tagger.pyx 2014-12-03 11:05:15 +11:00
Matthew Honnibal
522bb0346e * Work on get_array method of Tokens 2014-12-02 23:48:05 +11:00
Matthew Honnibal
4ecbe8c893 * Complete refactor of Tagger features, to use a generic list of context names. 2014-11-05 20:45:29 +11:00
Matthew Honnibal
3733444101 * Generalize tagger code, in preparation for NER and supersense tagging. 2014-11-05 03:42:14 +11:00
Matthew Honnibal
ae52f9f38c * Remove vocab10k from tokens 2014-11-03 00:23:20 +11:00
Matthew Honnibal
b186a66bae * Rename Token.lex_pos to Token.postype, and Token.lex_supersense to Token.sensetype 2014-10-31 17:44:39 +11:00
Matthew Honnibal
e6b87766fe * Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme 2014-10-30 15:21:38 +11:00
Matthew Honnibal
13909a2e24 * Rewriting Lexeme serialization. 2014-10-29 23:19:38 +11:00
Matthew Honnibal
08ce602243 * Large refactor, particularly to Python API 2014-10-24 00:59:17 +11:00
Matthew Honnibal
e5e951ae67 * Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding. 2014-10-23 01:57:59 +11:00
Matthew Honnibal
7018b53d3a * Improve array features in tokens 2014-10-22 12:55:42 +11:00
Matthew Honnibal
43743a5d63 * Work on efficiency 2014-10-14 18:22:41 +11:00
Matthew Honnibal
6fb42c4919 * Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang 2014-10-14 16:17:45 +11:00
Matthew Honnibal
2805068ca8 * Have tokens track tuples that record the start offset and pos tag as well as a lexeme pointer 2014-10-14 15:21:03 +11:00
Matthew Honnibal
71ee921055 * Slight cleaning of tokenizer code 2014-10-10 19:17:22 +11:00
Matthew Honnibal
59b41a9fd3 * Switch to new data model, tests passing 2014-10-10 08:11:31 +11:00
Matthew Honnibal
08cef75ffd * Switch to using a heap-allocated vector in tokens 2014-09-15 03:46:14 +02:00
Matthew Honnibal
f77b7098c0 * Upd Tokens to use vector, with bounds checking. 2014-09-15 03:22:40 +02:00
Matthew Honnibal
df24e3708c * Move EnglishTokens stuff to Tokens 2014-09-15 01:31:44 +02:00
Matthew Honnibal
5aa591106b * Fiddle with token features 2014-09-12 15:49:36 +02:00