Matthew Honnibal
|
ef4398b204
|
* Rearrange POS stuff, so that language-specific stuff can live in language-specific modules
|
2014-12-07 23:52:41 +11:00 |
|
Matthew Honnibal
|
75b8dfb348
|
* Remove upper_pc from lexeme.pyx
|
2014-12-04 22:14:34 +11:00 |
|
Matthew Honnibal
|
e1b1f45cc9
|
* Add STEM attribute to lexeme
|
2014-12-04 20:46:20 +11:00 |
|
Matthew Honnibal
|
d70d31aa45
|
* Introduce first attempt at const-ness
|
2014-12-03 15:44:25 +11:00 |
|
Matthew Honnibal
|
b463a7eb86
|
* Make flag-setting a language-specific thing
|
2014-12-03 11:04:32 +11:00 |
|
Matthew Honnibal
|
70ea862703
|
* Remove vocab10k field, and add flags for gazetteers
|
2014-11-03 00:13:51 +11:00 |
|
Matthew Honnibal
|
8335706321
|
* Add LIKE_URL and LIKE_NUMBER flag features
|
2014-11-02 13:19:23 +11:00 |
|
Matthew Honnibal
|
6c807aa45f
|
* Restore id attribute to lexeme, and rename pos field to postype, to store clustered tag dictionaries
|
2014-10-31 17:43:00 +11:00 |
|
Matthew Honnibal
|
c6fcd03692
|
* Small efficiency tweak to lexeme init
|
2014-10-30 17:56:11 +11:00 |
|
Matthew Honnibal
|
87c2418a89
|
* Fiddle with data types on Lexeme, to compress them to a much smaller size.
|
2014-10-30 15:42:15 +11:00 |
|
Matthew Honnibal
|
e6b87766fe
|
* Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme
|
2014-10-30 15:21:38 +11:00 |
|
Matthew Honnibal
|
67c8c8019f
|
* Update lexeme serialization, using a binary file format
|
2014-10-30 01:01:00 +11:00 |
|
Matthew Honnibal
|
13909a2e24
|
* Rewriting Lexeme serialization.
|
2014-10-29 23:19:38 +11:00 |
|
Matthew Honnibal
|
08ce602243
|
* Large refactor, particularly to Python API
|
2014-10-24 00:59:17 +11:00 |
|
Matthew Honnibal
|
e5e951ae67
|
* Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding.
|
2014-10-23 01:57:59 +11:00 |
|
Matthew Honnibal
|
0a0e41f6c8
|
* Add prefix and suffix features
|
2014-10-22 12:56:09 +11:00 |
|
Matthew Honnibal
|
65d3ead4fd
|
* Rename LexStr_casefix to LexStr_norm and LexInt_i to LexInt_id
|
2014-10-14 15:19:07 +11:00 |
|
Matthew Honnibal
|
71ee921055
|
* Slight cleaning of tokenizer code
|
2014-10-10 19:17:22 +11:00 |
|
Matthew Honnibal
|
59b41a9fd3
|
* Switch to new data model, tests passing
|
2014-10-10 08:11:31 +11:00 |
|
Matthew Honnibal
|
1b0e01d3d8
|
* Revising data model of lexeme. Compiles.
|
2014-10-09 19:53:30 +11:00 |
|
Matthew Honnibal
|
51d75b244b
|
* Add serialize/deserialize functions for lexeme, transport to/from python dict.
|
2014-10-09 14:10:46 +11:00 |
|
Matthew Honnibal
|
d73d89a2de
|
* Add i attribute to lexeme, giving lexemes sequential IDs.
|
2014-10-09 13:50:05 +11:00 |
|
Matthew Honnibal
|
ac522e2553
|
* Switch from own memory class to cymem, in pip
|
2014-09-17 23:09:24 +02:00 |
|
Matthew Honnibal
|
6266cac593
|
* Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks
|
2014-09-17 20:02:26 +02:00 |
|
Matthew Honnibal
|
c396581a0b
|
* Fiddle with the way strings are interned in lexeme
|
2014-09-15 06:34:45 +02:00 |
|
Matthew Honnibal
|
f77b7098c0
|
* Upd Tokens to use vector, with bounds checking.
|
2014-09-15 03:22:40 +02:00 |
|
Matthew Honnibal
|
df24e3708c
|
* Move EnglishTokens stuff to Tokens
|
2014-09-15 01:31:44 +02:00 |
|
Matthew Honnibal
|
b488224c09
|
* Restoring Lexeme-as-struct
|
2014-09-10 20:41:37 +02:00 |
|
Matthew Honnibal
|
88095666dc
|
* Remove Lexeme struct, preparing to rename Word to Lexeme.
|
2014-08-24 19:24:42 +02:00 |
|
Matthew Honnibal
|
e289896603
|
* Fix ptb3 module
|
2014-08-22 16:36:17 +02:00 |
|
Matthew Honnibal
|
d10993f41a
|
* More docs work
|
2014-08-21 16:37:13 +02:00 |
|
Matthew Honnibal
|
a78ad4152d
|
* Broken version being refactored for docs
|
2014-08-20 13:39:39 +02:00 |
|
Matthew Honnibal
|
5fddb8d165
|
* Working refactor, with updated data model for Lexemes
|
2014-08-19 04:21:20 +02:00 |
|
Matthew Honnibal
|
3379d7a571
|
* Reforming data model for lexemes
|
2014-08-19 02:40:37 +02:00 |
|
Matthew Honnibal
|
01469b0888
|
* Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word.
|
2014-08-18 19:14:00 +02:00 |
|
Matthew Honnibal
|
6319ff0f22
|
* Add length property
|
2014-08-02 21:26:44 +01:00 |
|
Matthew Honnibal
|
571808a274
|
Group-by seems to be working
|
2014-07-07 20:27:02 +02:00 |
|
Matthew Honnibal
|
80b36f9f27
|
* 710k words per second for counts
|
2014-07-07 19:12:19 +02:00 |
|
Matthew Honnibal
|
057c21969b
|
* Refactor for string view features. Working on setting up flags and enums.
|
2014-07-07 16:58:48 +02:00 |
|
Matthew Honnibal
|
f1bcbd4c4e
|
* Reorganized code to accomodate Tokens class. Need string views before group_by and count_by can be done well.
|
2014-07-07 12:47:21 +02:00 |
|
Matthew Honnibal
|
ff1869ff07
|
* Fixed major efficiency problem, from not quite grokking pass by reference in cython c++
|
2014-07-07 07:36:43 +02:00 |
|
Matthew Honnibal
|
d5bef02c72
|
* Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals
|
2014-07-07 04:21:06 +02:00 |
|
Matthew Honnibal
|
556f6a18ca
|
* Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc.
|
2014-07-05 20:51:42 +02:00 |
|