adrianeboyd
98c59027ed
Use max(uint64) for OOV lexeme rank ( #5303 )
...
* Use max(uint64) for OOV lexeme rank
* Add test for default OOV rank
* Revert back to thinc==7.4.0
Requiring the updated version of thinc was unnecessary.
* Define OOV_RANK in one place
Define OOV_RANK in one place in `util`.
* Fix formatting [ci skip]
* Switch to external definitions of max(uint64)
Switch to external defintions of max(uint64) and confirm that they are
equal.
2020-04-15 13:49:47 +02:00
Ines Montani
62b558ab72
💫 Support lexical attributes in retokenizer attrs ( closes #2390 ) ( #3325 )
...
* Fix formatting and whitespace
* Add support for lexical attributes (closes #2390 )
* Document lexical attribute setting during retokenization
* Assign variable oputside of nested loop
2019-02-24 21:13:51 +01:00
Matthew Honnibal
84e66ca6d4
WIP on stringstore change. 27 failures
2017-05-28 14:06:40 +02:00
Matthew Honnibal
f51e6a6c16
Adjust lexeme sizing for attr_t being 64 bit
2017-05-28 12:51:09 +02:00
Matthew Honnibal
793430aa7a
Get spaCy train command working with neural network
...
* Integrate models into pipeline
* Add basic serialization (maybe incorrect)
* Fix pickle on vocab
2017-05-17 12:04:50 +02:00
Wolfgang Seeker
03fb498dbe
introduce lang field for LexemeC to hold language id
...
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00
Matthew Honnibal
193f127f81
* Fix ugly py_check_flag and py_set_flag functions in Lexeme
2015-09-15 13:06:18 +10:00
Matthew Honnibal
e7e529edf4
* Fix Lexeme.check_flag
2015-09-10 14:45:43 +02:00
Matthew Honnibal
4f8e38271d
* Fix merge errors in lexeme.pxd
2015-09-06 20:19:08 +02:00
Matthew Honnibal
86c888667f
* Merge in changes from de branch
2015-09-06 19:49:28 +02:00
Matthew Honnibal
d2fc104a26
* Begin merge of Gazetteer and DE branches
2015-09-06 19:45:15 +02:00
Matthew Honnibal
e35bb36be7
* Ensure Lexeme.check_flag returns a boolean value
2015-09-06 17:52:32 +02:00
Matthew Honnibal
6f1743692a
* Work on language-independent refactoring
2015-08-23 20:49:18 +02:00
Matthew Honnibal
cad0cca4e3
* Tmp
2015-08-22 22:04:34 +02:00
Matthew Honnibal
c263577424
* Fix lower attribute in lexeme.pxd
2015-08-06 16:07:41 +02:00
Matthew Honnibal
6bb96c122d
* Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects
2015-07-26 16:37:16 +02:00
Matthew Honnibal
4dddc8a69b
* Fix type declarations for attr_t. Remove unused id_t.
2015-07-18 22:39:57 +02:00
Matthew Honnibal
a6d040bd11
* Import Lexeme attrs from spacy.attrs, not spacy.typedefs
2015-07-16 11:20:08 +02:00
Matthew Honnibal
65251e7625
* Remove redundant attr_id_t from typedefs.pxd
2015-07-16 00:58:51 +02:00
Matthew Honnibal
78db7e32f7
* Remove has_sense method from Lexeme declaration
2015-07-08 19:41:20 +02:00
Matthew Honnibal
b64c843861
* Remove senses attr
2015-07-08 19:26:24 +02:00
Matthew Honnibal
2b8459d9a8
* Add senses flag to Lexeme
2015-07-01 20:10:41 +02:00
Matthew Honnibal
c04e6ebca6
* Allow user to load different sized vectors.
2015-06-05 16:26:39 +02:00
Jordan Suchow
3a8d9b37a6
Remove trailing whitespace
2015-04-19 13:01:38 -07:00
Matthew Honnibal
321b402739
* Store the l2 norm of the word's vector
2015-02-07 08:42:16 -05:00
Matthew Honnibal
fda94271af
* Rename NORM1 and NORM2 attrs to lower and norm
2015-01-24 06:17:03 +11:00
Matthew Honnibal
5ed8b2b98f
* Rename sic to orth
2015-01-23 02:08:25 +11:00
Matthew Honnibal
5e63c606ad
* Rename vec to repvec
2015-01-22 02:03:54 +11:00
Matthew Honnibal
6c7e44140b
* Work on word vectors, and other stuff
2015-01-17 16:21:17 +11:00
Matthew Honnibal
7d3c40de7d
* Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme
2015-01-15 00:33:16 +11:00
Matthew Honnibal
0930892fc1
* Tmp. Working on refactor. Compiles, must hook up lexical feats.
2015-01-14 00:03:48 +11:00
Matthew Honnibal
46da3d74d2
* Tmp. Refactoring, introducing a Lexeme PyObject.
2015-01-12 11:23:44 +11:00
Matthew Honnibal
ce2edd6312
* Tmp commit. Refactoring to create a Python Lexeme class.
2015-01-12 10:26:22 +11:00
Matthew Honnibal
4c4aa2c5c9
* Work on train
2014-12-22 07:25:43 +11:00
Matthew Honnibal
f6556d8e5d
* Refactor, move Lexeme struct to structs.pxd
2014-12-20 06:51:03 +11:00
Matthew Honnibal
9959a64f7b
* Working morphology and lemmatisation. POS tagging quite fast.
2014-12-10 08:09:32 +11:00
Matthew Honnibal
ef4398b204
* Rearrange POS stuff, so that language-specific stuff can live in language-specific modules
2014-12-07 23:52:41 +11:00
Matthew Honnibal
49f3780ff5
* Fiddle with lexeme attrs
2014-12-04 21:22:38 +11:00
Matthew Honnibal
e1b1f45cc9
* Add STEM attribute to lexeme
2014-12-04 20:46:20 +11:00
Matthew Honnibal
d70d31aa45
* Introduce first attempt at const-ness
2014-12-03 15:44:25 +11:00
Matthew Honnibal
b463a7eb86
* Make flag-setting a language-specific thing
2014-12-03 11:04:32 +11:00
Matthew Honnibal
50309e6e49
* Fix context vector, importing all features
2014-11-05 22:11:39 +11:00
Matthew Honnibal
70ea862703
* Remove vocab10k field, and add flags for gazetteers
2014-11-03 00:13:51 +11:00
Matthew Honnibal
8335706321
* Add LIKE_URL and LIKE_NUMBER flag features
2014-11-02 13:19:23 +11:00
Matthew Honnibal
6c807aa45f
* Restore id attribute to lexeme, and rename pos field to postype, to store clustered tag dictionaries
2014-10-31 17:43:00 +11:00
Matthew Honnibal
87c2418a89
* Fiddle with data types on Lexeme, to compress them to a much smaller size.
2014-10-30 15:42:15 +11:00
Matthew Honnibal
e6b87766fe
* Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme
2014-10-30 15:21:38 +11:00
Matthew Honnibal
13909a2e24
* Rewriting Lexeme serialization.
2014-10-29 23:19:38 +11:00
Matthew Honnibal
08ce602243
* Large refactor, particularly to Python API
2014-10-24 00:59:17 +11:00
Matthew Honnibal
e5e951ae67
* Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding.
2014-10-23 01:57:59 +11:00