Matthew Honnibal
|
9959a64f7b
|
* Working morphology and lemmatisation. POS tagging quite fast.
|
2014-12-10 08:09:32 +11:00 |
|
Matthew Honnibal
|
accdbe989b
|
* Remove Tokens.extend method
|
2014-12-09 17:09:23 +11:00 |
|
Matthew Honnibal
|
495e1c7366
|
* Use fused type in Tokens.push_back, simplifying the use of the cache
|
2014-12-09 16:50:01 +11:00 |
|
Matthew Honnibal
|
99bbbb6feb
|
* Work on morphological processing
|
2014-12-08 21:12:15 +11:00 |
|
Matthew Honnibal
|
ef4398b204
|
* Rearrange POS stuff, so that language-specific stuff can live in language-specific modules
|
2014-12-07 23:52:41 +11:00 |
|
Matthew Honnibal
|
9f17467c2e
|
* Fix EMPTY_TOKEN
|
2014-12-07 22:07:41 +11:00 |
|
Matthew Honnibal
|
e27b912ef9
|
* Remove need for confusing _data pointer to be stored on Tokens
|
2014-12-05 16:31:30 +11:00 |
|
Matthew Honnibal
|
1c9253701d
|
* Introduce a TokenC struct, to handle token indices, pos tags and sense tags
|
2014-12-05 15:56:14 +11:00 |
|
Matthew Honnibal
|
564082e48e
|
* Hack Token class to take lex.dense inplace of the old lex.norm. This needs to be fixed...
|
2014-12-04 20:51:29 +11:00 |
|
Matthew Honnibal
|
69bb022204
|
* Add as_array and count_by method
|
2014-12-04 20:46:55 +11:00 |
|
Matthew Honnibal
|
d70d31aa45
|
* Introduce first attempt at const-ness
|
2014-12-03 15:44:25 +11:00 |
|
Matthew Honnibal
|
e170faf5b0
|
* Hack Tokens to work without tagger.pyx
|
2014-12-03 11:05:15 +11:00 |
|
Matthew Honnibal
|
522bb0346e
|
* Work on get_array method of Tokens
|
2014-12-02 23:48:05 +11:00 |
|
Matthew Honnibal
|
4ecbe8c893
|
* Complete refactor of Tagger features, to use a generic list of context names.
|
2014-11-05 20:45:29 +11:00 |
|
Matthew Honnibal
|
3733444101
|
* Generalize tagger code, in preparation for NER and supersense tagging.
|
2014-11-05 03:42:14 +11:00 |
|
Matthew Honnibal
|
954c970415
|
* Add __iter__ method to tokens
|
2014-11-04 01:07:08 +11:00 |
|
Matthew Honnibal
|
ae52f9f38c
|
* Remove vocab10k from tokens
|
2014-11-03 00:23:20 +11:00 |
|
Matthew Honnibal
|
b186a66bae
|
* Rename Token.lex_pos to Token.postype, and Token.lex_supersense to Token.sensetype
|
2014-10-31 17:44:39 +11:00 |
|
Matthew Honnibal
|
ac88893232
|
* Fix Token after lexeme changes
|
2014-10-30 15:30:52 +11:00 |
|
Matthew Honnibal
|
e6b87766fe
|
* Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme
|
2014-10-30 15:21:38 +11:00 |
|
Matthew Honnibal
|
13909a2e24
|
* Rewriting Lexeme serialization.
|
2014-10-29 23:19:38 +11:00 |
|
Matthew Honnibal
|
08ce602243
|
* Large refactor, particularly to Python API
|
2014-10-24 00:59:17 +11:00 |
|
Matthew Honnibal
|
7baef5b7ff
|
* Fix padding on tokens
|
2014-10-23 04:01:17 +11:00 |
|
Matthew Honnibal
|
e5e951ae67
|
* Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding.
|
2014-10-23 01:57:59 +11:00 |
|
Matthew Honnibal
|
7018b53d3a
|
* Improve array features in tokens
|
2014-10-22 12:55:42 +11:00 |
|
Matthew Honnibal
|
43743a5d63
|
* Work on efficiency
|
2014-10-14 18:22:41 +11:00 |
|
Matthew Honnibal
|
6fb42c4919
|
* Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang
|
2014-10-14 16:17:45 +11:00 |
|
Matthew Honnibal
|
2805068ca8
|
* Have tokens track tuples that record the start offset and pos tag as well as a lexeme pointer
|
2014-10-14 15:21:03 +11:00 |
|
Matthew Honnibal
|
71ee921055
|
* Slight cleaning of tokenizer code
|
2014-10-10 19:17:22 +11:00 |
|
Matthew Honnibal
|
59b41a9fd3
|
* Switch to new data model, tests passing
|
2014-10-10 08:11:31 +11:00 |
|
Matthew Honnibal
|
6266cac593
|
* Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks
|
2014-09-17 20:02:26 +02:00 |
|
Matthew Honnibal
|
08cef75ffd
|
* Switch to using a heap-allocated vector in tokens
|
2014-09-15 03:46:14 +02:00 |
|
Matthew Honnibal
|
f77b7098c0
|
* Upd Tokens to use vector, with bounds checking.
|
2014-09-15 03:22:40 +02:00 |
|
Matthew Honnibal
|
0f6bf2a2ee
|
* Fix niggling memory error, which was caused by bug in the way tokens resized their internal vector.
|
2014-09-15 02:08:39 +02:00 |
|
Matthew Honnibal
|
df24e3708c
|
* Move EnglishTokens stuff to Tokens
|
2014-09-15 01:31:44 +02:00 |
|
Matthew Honnibal
|
5aa591106b
|
* Fiddle with token features
|
2014-09-12 15:49:36 +02:00 |
|
Matthew Honnibal
|
073ee0de63
|
* Restore dense_hash_map for cache dictionary. Seems to double efficiency
|
2014-09-12 02:23:51 +02:00 |
|
Matthew Honnibal
|
2389bd1b10
|
* Improve cache mechanism by including a random element depending on the size of the cache.
|
2014-09-12 00:19:16 +02:00 |
|
Matthew Honnibal
|
563047e90f
|
* Switch to returning a Tokens object
|
2014-09-11 21:37:32 +02:00 |
|
Matthew Honnibal
|
1a3222af4b
|
* Moving tokens to use an array internally, instead of a list of Lexeme objects.
|
2014-09-11 16:57:08 +02:00 |
|
Matthew Honnibal
|
cf412adba8
|
* Refactoring to use Tokens object
|
2014-09-10 18:11:13 +02:00 |
|
Matthew Honnibal
|
68bae2fec6
|
* More refactoring
|
2014-08-25 16:42:22 +02:00 |
|
Matthew Honnibal
|
07ecf5d2f4
|
* Fixed group_by, removed idea of general attr_of function.
|
2014-08-22 00:02:37 +02:00 |
|
Matthew Honnibal
|
811b7a6b91
|
* Struggling with arbitrary attr access...
|
2014-08-21 23:49:14 +02:00 |
|
Matthew Honnibal
|
a78ad4152d
|
* Broken version being refactored for docs
|
2014-08-20 13:39:39 +02:00 |
|
Matthew Honnibal
|
365a2af756
|
* Restore happax. commit uncommited work
|
2014-08-02 21:27:03 +01:00 |
|
Matthew Honnibal
|
a895fe5ddb
|
* Upd from spacy
|
2014-07-23 17:35:18 +01:00 |
|
Matthew Honnibal
|
571808a274
|
Group-by seems to be working
|
2014-07-07 20:27:02 +02:00 |
|
Matthew Honnibal
|
80b36f9f27
|
* 710k words per second for counts
|
2014-07-07 19:12:19 +02:00 |
|
Matthew Honnibal
|
057c21969b
|
* Refactor for string view features. Working on setting up flags and enums.
|
2014-07-07 16:58:48 +02:00 |
|