Matthew Honnibal
|
53d1f5b2eb
|
* Rename Span.head to Span.root.
|
2015-07-09 17:30:58 +02:00 |
|
Matthew Honnibal
|
c0255ed7d8
|
* Allow slice indexing in Doc.__getitem__, returning a Span object
|
2015-07-09 15:15:32 +02:00 |
|
Matthew Honnibal
|
89a91ad726
|
* Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity
|
2015-07-09 13:30:41 +02:00 |
|
Matthew Honnibal
|
55f1042443
|
* Improve efficiency of L and R features, correcting the non-linear-in-length problem.
|
2015-07-09 12:17:26 +02:00 |
|
Matthew Honnibal
|
70d2acb579
|
* Fix edge features
|
2015-07-09 12:15:01 +02:00 |
|
Matthew Honnibal
|
adb868bdad
|
* Add warning for models not found in parser
|
2015-07-08 20:04:55 +02:00 |
|
Matthew Honnibal
|
05b28ec9eb
|
* Add warning for models not found in parser
|
2015-07-08 20:02:13 +02:00 |
|
Matthew Honnibal
|
ef700401a6
|
* Add warning for models not found in parser
|
2015-07-08 20:00:46 +02:00 |
|
Matthew Honnibal
|
6218d8b389
|
* Add warning for models not found in parser
|
2015-07-08 19:59:16 +02:00 |
|
Matthew Honnibal
|
f6a6c39ce8
|
* Add warning for models not found in parser
|
2015-07-08 19:52:30 +02:00 |
|
Matthew Honnibal
|
78db7e32f7
|
* Remove has_sense method from Lexeme declaration
|
2015-07-08 19:41:20 +02:00 |
|
Matthew Honnibal
|
6ddb2f5e45
|
* Restore merge_mwe in English class
|
2015-07-08 19:35:30 +02:00 |
|
Matthew Honnibal
|
6859f6adac
|
* Restore merge_mwe in English class
|
2015-07-08 19:34:55 +02:00 |
|
Matthew Honnibal
|
3c270fc8ff
|
* Remove has_sense method from Lexeme
|
2015-07-08 19:28:29 +02:00 |
|
Matthew Honnibal
|
b64c843861
|
* Remove senses attr
|
2015-07-08 19:26:24 +02:00 |
|
Matthew Honnibal
|
1d3a592edf
|
* Remove the senses attr from LexemeC, to keep data compatibility
|
2015-07-08 19:24:44 +02:00 |
|
Matthew Honnibal
|
0ceb1f71c2
|
* Update parse features
|
2015-07-08 19:11:36 +02:00 |
|
Matthew Honnibal
|
2e51b5027a
|
* Alias Doc to Tokens, for backwards compatibility
|
2015-07-08 18:59:35 +02:00 |
|
Matthew Honnibal
|
e3c53f5ecd
|
* Fix mention of Tokens in docstring
|
2015-07-08 18:56:27 +02:00 |
|
Matthew Honnibal
|
bb522496dd
|
* Rename Tokens to Doc
|
2015-07-08 18:53:00 +02:00 |
|
Matthew Honnibal
|
b24e8be2b9
|
* Whitespace in docstring
|
2015-07-08 12:37:03 +02:00 |
|
Matthew Honnibal
|
abc43b852d
|
* Add pos_tags attr to Vocab.
|
2015-07-08 12:36:38 +02:00 |
|
Matthew Honnibal
|
935bcdf3e5
|
* Remove redundant tag_names argument to Tokenizer
|
2015-07-08 12:36:04 +02:00 |
|
Matthew Honnibal
|
ff885e8511
|
* Add ParserFactory convenience function
|
2015-07-08 12:35:46 +02:00 |
|
Matthew Honnibal
|
4e4fac452b
|
* Refactor __init__ for simplicity. Allow parse=True, tag=True etc flags to be passed at top-level. Do not lazy-load parser.
|
2015-07-08 12:35:29 +02:00 |
|
Matthew Honnibal
|
1d2deb4616
|
* Work on refactoring default arguments to English.__init__
|
2015-07-07 15:53:25 +02:00 |
|
Matthew Honnibal
|
2d0e99a096
|
* Pass pos_tags into Tokenizer.from_dir
|
2015-07-07 14:23:08 +02:00 |
|
Matthew Honnibal
|
6788c86b2f
|
* Begin refactor
|
2015-07-07 14:00:07 +02:00 |
|
Matthew Honnibal
|
52fd80c6c6
|
* Add experimental supersense features for parsing, based on lookup into wordnet.
|
2015-07-01 20:12:44 +02:00 |
|
Matthew Honnibal
|
e6d828a9af
|
* Set up an array POS_SENSES that denotes the set of valid senses for each POS tag. This way, we can do bitwise & between a lexeme's senses and the ones available for its POS tag, to get the allowable senses for the token.
|
2015-07-01 20:12:13 +02:00 |
|
Matthew Honnibal
|
2b8459d9a8
|
* Add senses flag to Lexeme
|
2015-07-01 20:10:41 +02:00 |
|
Matthew Honnibal
|
e23d1582a2
|
* Add supersense data to Lexeme objects. Add simple has_sense method to check the flag.
|
2015-07-01 18:50:37 +02:00 |
|
Matthew Honnibal
|
64fafa98be
|
* Add senses.pyx and senses.pxd
|
2015-07-01 18:49:44 +02:00 |
|
Matthew Honnibal
|
94dab94e5f
|
uerge branch 'master' of https://github.com/honnibal/spaCy
|
2015-06-30 18:16:26 +02:00 |
|
Matthew Honnibal
|
9af86b0b0b
|
* Fix attrs.pxd
|
2015-06-30 18:16:30 +02:00 |
|
Matthew Honnibal
|
af9c82f7a6
|
Merge branch 'master' of https://github.com/honnibal/spaCy
|
2015-06-30 18:11:37 +02:00 |
|
Matthew Honnibal
|
5d595b5a8c
|
* Inc versions
|
2015-06-30 18:11:06 +02:00 |
|
Matthew Honnibal
|
d2eeba6667
|
* Start wiring up color and emotion lexicons. Hopefully we get to use them.
|
2015-06-30 16:22:23 +02:00 |
|
Matthew Honnibal
|
e20106fdff
|
* Begin reorganizing neuralnet work
|
2015-06-30 14:26:32 +02:00 |
|
Matthew Honnibal
|
5cd3ed42d4
|
* Reenable averaging
|
2015-06-29 16:44:42 +02:00 |
|
Matthew Honnibal
|
894cbef8ba
|
* Wire eta and mu parameters up for neural net
|
2015-06-29 07:10:33 +02:00 |
|
Matthew Honnibal
|
3bb5876c5a
|
* Inline methods in StateClass
|
2015-06-29 01:10:14 +02:00 |
|
Matthew Honnibal
|
313a7f87b3
|
* Inline methods in StateClass
|
2015-06-29 01:06:28 +02:00 |
|
Matthew Honnibal
|
a02fd3af5d
|
* Check valency in L and R feature methods, to make feaure calculation faster
|
2015-06-29 00:27:56 +02:00 |
|
Matthew Honnibal
|
5d870720bc
|
* Check valency in L and R feature methods, to make feaure calculation faster
|
2015-06-29 00:17:29 +02:00 |
|
Matthew Honnibal
|
f4986d5d3c
|
* Use new Example class
|
2015-06-28 22:36:03 +02:00 |
|
Matthew Honnibal
|
735f1af91f
|
* Fix neural net stuff
|
2015-06-28 11:44:58 +02:00 |
|
Matthew Honnibal
|
e7003f1cf3
|
* Remove hard-coding of vector lengths
|
2015-06-28 11:37:17 +02:00 |
|
Matthew Honnibal
|
897dd0dd0b
|
* Merge changes, and adjust Example to use memoryview
|
2015-06-28 11:36:11 +02:00 |
|
Matthew Honnibal
|
9282a8e72c
|
* Prepare for new models to be plugged in by using Example class
|
2015-06-28 11:02:35 +02:00 |
|
Matthew Honnibal
|
75aeccc064
|
* Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search
|
2015-06-28 11:02:34 +02:00 |
|
Matthew Honnibal
|
bf33598b34
|
* Work on a theano-driven model for the parser
|
2015-06-28 11:02:34 +02:00 |
|
Matthew Honnibal
|
bbef71f213
|
* Fix min function in fill_context
|
2015-06-28 10:46:39 +02:00 |
|
Matthew Honnibal
|
142b6f9510
|
* Revert last changes
|
2015-06-28 10:44:28 +02:00 |
|
Matthew Honnibal
|
b06962f18b
|
* Pad buffers in state
|
2015-06-28 10:36:14 +02:00 |
|
Matthew Honnibal
|
53be72387c
|
* Hack at fill_context to investigate performance loss
|
2015-06-28 10:34:28 +02:00 |
|
Matthew Honnibal
|
71a4e876a9
|
* Fix parse features
|
2015-06-28 09:27:33 +02:00 |
|
Matthew Honnibal
|
0c4b5a2bb0
|
* Start scoring tokens
|
2015-06-28 06:21:38 +02:00 |
|
Matthew Honnibal
|
5af500909c
|
* Remove unused directve from parser.pyx
|
2015-06-28 06:20:21 +02:00 |
|
Matthew Honnibal
|
d5b4090705
|
* Add profile directive
|
2015-06-28 06:19:33 +02:00 |
|
Matthew Honnibal
|
2b5421e60c
|
* Add profile directive
|
2015-06-28 06:07:04 +02:00 |
|
Matthew Honnibal
|
8b5de4a411
|
* Add word / tag / label sets, for use in neural net
|
2015-06-28 05:46:53 +02:00 |
|
Matthew Honnibal
|
cfcbd8d256
|
* Fix punctuation eval in scorer.py
|
2015-06-28 01:31:39 +02:00 |
|
Matthew Honnibal
|
ed40a8380e
|
* Remove hard-coding of vector lengths
|
2015-06-27 04:18:47 +02:00 |
|
Matthew Honnibal
|
ebe630cc8d
|
* Enable more features for NN
|
2015-06-27 04:17:29 +02:00 |
|
Matthew Honnibal
|
f8bb43475e
|
* Bridge to Theano working. Very disorganised. Using thinc adb60aba966ed2
|
2015-06-27 02:39:18 +02:00 |
|
Matthew Honnibal
|
2fe98b8a9a
|
* Prepare for new models to be plugged in by using Example class
|
2015-06-26 13:51:39 +02:00 |
|
Matthew Honnibal
|
6896455884
|
* Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search
|
2015-06-26 06:25:36 +02:00 |
|
Matthew Honnibal
|
b266a63f2c
|
* Inc version of downloadble data
|
2015-06-24 04:53:08 +02:00 |
|
Matthew Honnibal
|
02b171ee67
|
* Bug fixes to edge calculation
|
2015-06-24 04:28:02 +02:00 |
|
Matthew Honnibal
|
a4e9bdf4c1
|
* Work on a theano-driven model for the parser
|
2015-06-24 01:02:40 +02:00 |
|
Matthew Honnibal
|
7f9384f53c
|
* Remove deprecated _state module
|
2015-06-23 17:28:24 +02:00 |
|
Matthew Honnibal
|
6dbe182491
|
* Fix merge conflicts
|
2015-06-23 17:28:00 +02:00 |
|
Matthew Honnibal
|
579735a095
|
* Remove import of _state module
|
2015-06-23 17:25:08 +02:00 |
|
Matthew Honnibal
|
88f55d136b
|
* Remove deprecated _state module
|
2015-06-23 17:19:51 +02:00 |
|
Matthew Honnibal
|
9ab9dd2bf7
|
* Clean up unused orig_arc_eager and tree_arc_eager modules, which were only added for EMNLP experiments
|
2015-06-23 17:17:33 +02:00 |
|
Matthew Honnibal
|
7ebfe4b983
|
* Fixes to edge features
|
2015-06-23 16:32:54 +02:00 |
|
Matthew Honnibal
|
7b125f5a86
|
* Fixes to edge features
|
2015-06-23 16:31:01 +02:00 |
|
Matthew Honnibal
|
8d4bbacfc5
|
* Fix edge navigation in Token objects
|
2015-06-23 16:07:34 +02:00 |
|
Matthew Honnibal
|
35c290bee4
|
* Fix edge features
|
2015-06-23 15:50:56 +02:00 |
|
Matthew Honnibal
|
221e2e485f
|
* Assign 'ROOT' as label, not 'root'
|
2015-06-23 15:09:54 +02:00 |
|
Matthew Honnibal
|
a7bf7b0626
|
* Rename sent_start to sent_end, to reflect its new usage in the Break transition
|
2015-06-23 05:39:43 +02:00 |
|
Matthew Honnibal
|
ee3e56f27b
|
* Fix bounds checking on entities
|
2015-06-23 04:35:08 +02:00 |
|
Matthew Honnibal
|
43ef5ddea5
|
* Ensure root albel is spelled ROOT, for backwards compatibility
|
2015-06-23 04:14:03 +02:00 |
|
Matthew Honnibal
|
065c2e1d2d
|
* Add some bounds checking around state arrays
|
2015-06-23 04:13:09 +02:00 |
|
Matthew Honnibal
|
89ae218b75
|
* Add import to tokens.pyx from weird Cython compiler issue with casting from memory views
|
2015-06-23 03:04:34 +02:00 |
|
Matthew Honnibal
|
f01b3d043e
|
* Add padding to arrays in stateclass. May be papering over a deeper bug.
|
2015-06-23 03:03:41 +02:00 |
|
Matthew Honnibal
|
5e94b5d581
|
* Have Tokens return proper numpy arrays, not Cython views.
|
2015-06-23 00:07:34 +02:00 |
|
Matthew Honnibal
|
69507bc729
|
* Re-enable Break transition in arc_eager.pyx
|
2015-06-23 00:03:30 +02:00 |
|
Matthew Honnibal
|
cc579ed429
|
* Add __len__ function to StringStore
|
2015-06-23 00:02:50 +02:00 |
|
Matthew Honnibal
|
46fb24e9fd
|
* Add cycle-checking code in gold.pyx
|
2015-06-23 00:02:22 +02:00 |
|
Matthew Honnibal
|
60d26243e3
|
* Fix head alignment in read_conll.parse, which was causing corrupt parses when strip_bad_periods=True. A similar problem may apply to other data readers.
|
2015-06-18 16:35:27 +02:00 |
|
Matthew Honnibal
|
f868175e43
|
* Whitespace
|
2015-06-16 23:37:46 +02:00 |
|
Matthew Honnibal
|
ab110be125
|
* Remove debugging in parser.pyx
|
2015-06-16 23:37:25 +02:00 |
|
Matthew Honnibal
|
9b13d11ab3
|
* Fix handling of entities in StateClass
|
2015-06-16 23:35:21 +02:00 |
|
Matthew Honnibal
|
c40a2c661c
|
* Add tree_arc_eager
|
2015-06-15 08:23:24 +02:00 |
|
Matthew Honnibal
|
5da5cf7084
|
* Add some more features for S1/S0
|
2015-06-15 04:07:13 +02:00 |
|
Matthew Honnibal
|
8156a01bca
|
* Fix root label for orig_arc_eager
|
2015-06-15 02:54:55 +02:00 |
|
Matthew Honnibal
|
21930ede15
|
* Switch toggle on USE_ROOT_ARC_SEGMENT
|
2015-06-15 02:54:32 +02:00 |
|
Matthew Honnibal
|
38a6afa484
|
* Make possibly dubious correction to the unshift oracle
|
2015-06-15 02:50:00 +02:00 |
|
Matthew Honnibal
|
f66228f253
|
* Add some more features, esp for labels
|
2015-06-14 21:18:02 +02:00 |
|
Matthew Honnibal
|
3da8e0f317
|
* Add orig_arc_eager
|
2015-06-14 20:31:44 +02:00 |
|
Matthew Honnibal
|
ea8a103007
|
* Fix import of TransitionSystem in parser.pyx
|
2015-06-14 19:01:26 +02:00 |
|
Matthew Honnibal
|
e0984ca139
|
* Fix valency features in StateClass
|
2015-06-14 17:50:26 +02:00 |
|
Matthew Honnibal
|
e50ac1a47f
|
* Add verbose printing to scorer
|
2015-06-14 17:45:50 +02:00 |
|
Matthew Honnibal
|
763cbd23d5
|
* Upd stateclass.print_state
|
2015-06-14 17:44:29 +02:00 |
|
Matthew Honnibal
|
bdd07bf000
|
* Fix Break oracle, but disable the Break transition for now, while we finalize the gold-standard experiments
|
2015-06-14 17:44:03 +02:00 |
|
Matthew Honnibal
|
399f15fbdf
|
* Add flag to toggle handling of multi-root inputs without the Break transition. Clear up now unused best_valid stuff.
|
2015-06-14 00:28:37 +02:00 |
|
Matthew Honnibal
|
75289b4761
|
* Don't refuse to parse single token sentences, incase some transition system needs them, e.g. single word entity. Instead fix error in _init_state.
|
2015-06-13 22:55:55 +02:00 |
|
Matthew Honnibal
|
77d7e79c7e
|
* Fix r/l and distance features.
|
2015-06-12 13:06:15 +02:00 |
|
Matthew Honnibal
|
b643cb3d5c
|
* Allow training documents to be filtered in gold.pyx
|
2015-06-12 02:42:08 +02:00 |
|
Matthew Honnibal
|
15e177d7a1
|
* Fixes to unshift/fast-forward strategy. Getting 91.55 greedy on NW dev, gold preproc
|
2015-06-12 01:50:23 +02:00 |
|
Matthew Honnibal
|
afd77a529b
|
* Prepare for break transition, with fast-forwarding. 86.5 on 1k nw gold preproc
|
2015-06-10 14:08:30 +02:00 |
|
Matthew Honnibal
|
495f528709
|
* Add support for sentence breaks in stateclass
|
2015-06-10 12:34:28 +02:00 |
|
Matthew Honnibal
|
b7b18c279d
|
* Fix Reduce oracle. Getting 86.35
|
2015-06-10 11:33:39 +02:00 |
|
Matthew Honnibal
|
bb09b5d91a
|
* Fix shifted bit vector in stateclass --- should reflect whether the word has been *unshifted*.
|
2015-06-10 11:33:09 +02:00 |
|
Matthew Honnibal
|
aa9625f688
|
* Do non-monotonic Unshift. Every word can be shifted at most 1 time. When the Reduce move is used, if S0 has no head, we put the word back on the buffer. Gets 86.4 on nw 1k with gold pre-proc. Break transition not yet implemented for this.
|
2015-06-10 10:15:56 +02:00 |
|
Matthew Honnibal
|
7bf6b7de3e
|
* Add unshift action to StateClass, and track which moves have been shifted
|
2015-06-10 10:13:03 +02:00 |
|
Matthew Honnibal
|
f7c8069e65
|
* Fix bug in distance feature
|
2015-06-10 10:12:17 +02:00 |
|
Matthew Honnibal
|
abd07c067a
|
* Inline B and S methods on stateclass
|
2015-06-10 07:22:33 +02:00 |
|
Matthew Honnibal
|
e2f9a80713
|
* Remove old _state imports
|
2015-06-10 07:09:17 +02:00 |
|
Matthew Honnibal
|
e9aaecc619
|
* Remove from_struct method from StateClass
|
2015-06-10 06:58:27 +02:00 |
|
Matthew Honnibal
|
18cc326dc0
|
* Bug fixes to ner.pyx
|
2015-06-10 06:57:41 +02:00 |
|
Matthew Honnibal
|
e5570c9700
|
* Set nogil for oracle functions
|
2015-06-10 06:56:56 +02:00 |
|
Matthew Honnibal
|
4575e7a60f
|
* Fix beam search with new StateClass
|
2015-06-10 06:33:39 +02:00 |
|
Matthew Honnibal
|
04b1cd9b8c
|
* Greedy parsing working with new StateClass. Beam parsing broken
|
2015-06-10 04:20:23 +02:00 |
|
Matthew Honnibal
|
6a94b64eca
|
* Remove State* from parser.pyx entirely, switching over to StateClass. Beam parsing still untested.
|
2015-06-10 02:03:38 +02:00 |
|
Matthew Honnibal
|
f14a1526aa
|
* Remove version of fill_context that takes State*
|
2015-06-10 01:39:07 +02:00 |
|
Matthew Honnibal
|
d68c686ec1
|
* Move StateClass into interface of transition functions
|
2015-06-10 01:35:28 +02:00 |
|
Matthew Honnibal
|
4b98b3e9c8
|
* Cost functions now take StateClass argument, instead of State*.
|
2015-06-10 00:40:43 +02:00 |
|
Matthew Honnibal
|
e0cf61f591
|
* Move StateClass into the interface for is_valid
|
2015-06-09 23:23:28 +02:00 |
|
Matthew Honnibal
|
0895d454fb
|
* Prepare to switch to using state class, instead of state struct
|
2015-06-09 21:20:14 +02:00 |
|
Matthew Honnibal
|
2b9629ed62
|
* Begin adding stateclass to ArcEager
|
2015-06-09 01:41:09 +02:00 |
|
Matthew Honnibal
|
ba10fd8af5
|
* Add StateClass, to replace/refactor the mess in _state
|
2015-06-09 01:39:54 +02:00 |
|
Matthew Honnibal
|
c7e3dfc1dc
|
* Don't automatically push words when stack is empty, as it messes up beam parsing. Add hash method to beam state.
|
2015-06-08 14:49:04 +02:00 |
|
Matthew Honnibal
|
00a0dfcb59
|
* Avoid shipping the spacy.munge package
|
2015-06-08 00:54:13 +02:00 |
|
Matthew Honnibal
|
7d265a9c62
|
* Revert to wget in spacy.en.download
|
2015-06-08 00:48:56 +02:00 |
|
Matthew Honnibal
|
a8fc5f1285
|
* Fix munge/read_ner
|
2015-06-08 00:35:04 +02:00 |
|
Matthew Honnibal
|
1515862861
|
* Fix download.py
|
2015-06-08 00:08:05 +02:00 |
|
Matthew Honnibal
|
7e9e8f654a
|
* Use urllib in spacy.en.download
|
2015-06-07 23:51:38 +02:00 |
|
Matthew Honnibal
|
80cff41a9c
|
* Upd download.py
|
2015-06-07 19:13:28 +02:00 |
|
Matthew Honnibal
|
6e2564239d
|
* Bug fixes to beam parser. Search still broken on non-gold sentences
|
2015-06-07 19:12:59 +02:00 |
|
Matthew Honnibal
|
1ec4e6fc95
|
* Don't score whitespace tokens
|
2015-06-07 19:10:32 +02:00 |
|
Matthew Honnibal
|
731e5f1e46
|
* Add get() function in spacy/syntax/Config
|
2015-06-07 19:09:15 +02:00 |
|
Matthew Honnibal
|
8f142c1838
|
* Refactor transition system oracles, to split out move and label cost. Preparing to add Unshift move. Will exclude non-monotonic.
|
2015-06-07 03:21:29 +02:00 |
|
Matthew Honnibal
|
89b8775887
|
* Fix output from _min_edit_path when inputs match.
|
2015-06-06 05:58:53 +02:00 |
|
Matthew Honnibal
|
98cfd84123
|
* Remove hyphenation from main tokenizer loop: do it in infix.txt instead. This lets emoticons work
|
2015-06-06 05:57:03 +02:00 |
|
Matthew Honnibal
|
1fee7ade61
|
* Tweak to ner
|
2015-06-05 23:48:43 +02:00 |
|
Matthew Honnibal
|
33e70b167f
|
* Remove dead code from ner.pyx
|
2015-06-05 17:12:47 +02:00 |
|
Matthew Honnibal
|
88ac5c6e98
|
* Send beam_width < 0 to greedy parser
|
2015-06-05 17:12:06 +02:00 |
|
Matthew Honnibal
|
0114e7600d
|
* Fix NER oracle
|
2015-06-05 17:11:26 +02:00 |
|
Matthew Honnibal
|
c04e6ebca6
|
* Allow user to load different sized vectors.
|
2015-06-05 16:26:39 +02:00 |
|
Matthew Honnibal
|
6bf35cecc3
|
* Refactor transition system to use classes with staticmethods.
|
2015-06-05 02:27:17 +02:00 |
|
Matthew Honnibal
|
36a34d544b
|
* Refactoring arc_eager, grouping oracle functions into transitions
|
2015-06-04 22:43:03 +02:00 |
|
Matthew Honnibal
|
4433396005
|
* Impove efficiency of dynamic oracle, making beam training faster
|
2015-06-04 21:15:14 +02:00 |
|
Matthew Honnibal
|
079dad28a7
|
* Update for faster beam training
|
2015-06-04 19:32:32 +02:00 |
|
Matthew Honnibal
|
f8843906ad
|
Merge branch 'constituency'
Add beam parsing and training from JSON files, with Levenshtein alignment.
|
2015-06-03 06:07:24 +02:00 |
|
Matthew Honnibal
|
ae653b850a
|
* Remove unused import from gold.pyx
|
2015-06-03 06:07:15 +02:00 |
|
Matthew Honnibal
|
a2627b6102
|
* Fix bug in refactored init_transition
|
2015-06-03 06:01:26 +02:00 |
|
Matthew Honnibal
|
dd0867645d
|
* Remove stray const from State header
|
2015-06-03 00:10:04 +02:00 |
|
Matthew Honnibal
|
6c47b10a6e
|
* Make optimization to children_in_buffer: stop searching when we would cross a bracket.
|
2015-06-02 21:05:24 +02:00 |
|
Matthew Honnibal
|
a513ec500f
|
* Have oracle functions take a struct instead of a Python object
|
2015-06-02 20:01:06 +02:00 |
|
Matthew Honnibal
|
d1b55310a1
|
* Refactor _advance_beam function
|
2015-06-02 18:38:41 +02:00 |
|
Matthew Honnibal
|
0786d9b3c7
|
* Refactor TransitionSystem, adding set_valid method
|
2015-06-02 18:38:07 +02:00 |
|
Matthew Honnibal
|
bd82a49994
|
* Add set_scores method to Model
|
2015-06-02 18:37:10 +02:00 |
|
Matthew Honnibal
|
a3964957f6
|
* Add profiling for _state.pyx
|
2015-06-02 18:36:27 +02:00 |
|
Matthew Honnibal
|
e822df0867
|
* Fix bugs in new greedy/beam parser
|
2015-06-02 02:01:33 +02:00 |
|
Matthew Honnibal
|
66dfa95847
|
* Revise greedy_parse/beam_parse ownership goof
|
2015-06-02 01:34:19 +02:00 |
|
Matthew Honnibal
|
75658b2ed3
|
* Remove use of new beam.loss property, to maintain compatibility with older versions of thinc for now.
|
2015-06-02 00:57:09 +02:00 |
|
Matthew Honnibal
|
7c29362d60
|
* Rename parser class in parser.pxd, now that beam parsing is supported
|
2015-06-02 00:53:49 +02:00 |
|
Matthew Honnibal
|
58d5ac0944
|
* Add beam search capabilities to Parser. Rename GreedyParser to Parser.
|
2015-06-02 00:28:02 +02:00 |
|
Matthew Honnibal
|
62424e6c76
|
* Remove unused regularize argument from _ml.Model
|
2015-06-02 00:27:07 +02:00 |
|
Matthew Honnibal
|
adeb57cb1e
|
* Fix long line
|
2015-06-01 23:07:00 +02:00 |
|
Matthew Honnibal
|
e09a08bd00
|
* Add copy_state function
|
2015-06-01 23:06:30 +02:00 |
|
Matthew Honnibal
|
c7876aa8b6
|
* Add get_valid method
|
2015-06-01 23:06:00 +02:00 |
|
Matthew Honnibal
|
d82f9d958d
|
* Remove regularization cruft from _ml, move score from .pxd file to .pyx
|
2015-05-31 18:48:05 +02:00 |
|
Matthew Honnibal
|
5e99ff94c8
|
* Edits to arc eager oracle. Couldn't figure out how the non-monotonic lines made sense. They seem covered by children_in_stack
|
2015-05-31 15:14:37 +02:00 |
|
Matthew Honnibal
|
6c5632b71c
|
* Roll back proposed change to Break transition while investigate effect
|
2015-05-31 06:49:52 +02:00 |
|
Matthew Honnibal
|
6bba793df3
|
* Disable the Zipf-reweighting thing while investigate effect
|
2015-05-31 06:48:43 +02:00 |
|
Matthew Honnibal
|
e77940565d
|
* Add length cap to distance feature
|
2015-05-31 05:25:30 +02:00 |
|
Matthew Honnibal
|
fd596351ba
|
* Fix valency features
|
2015-05-31 05:24:33 +02:00 |
|
Matthew Honnibal
|
87d6551d19
|
* Allow gold parse to cut non-projective arcs
|
2015-05-31 01:11:56 +02:00 |
|
Matthew Honnibal
|
c4f0914b4e
|
* Fix POS tag evaluation in scorer.py: do evaluate punctuation tags
|
2015-05-30 18:24:32 +02:00 |
|
Matthew Honnibal
|
9e39a206da
|
* Fix efficiency of JSON reading, by using ujson instead of stream
|
2015-05-30 17:54:52 +02:00 |
|
Matthew Honnibal
|
76300bbb1b
|
* Use updated JSON format, with sentences below paragraphs. Allows use of gold preprocessing flag.
|
2015-05-30 01:25:46 +02:00 |
|
Matthew Honnibal
|
b76bbbd12c
|
* Read json files recursively from a directory, instead of requiring a single .json file
|
2015-05-29 03:52:55 +02:00 |
|
Matthew Honnibal
|
8f31d3b864
|
* Relax constraint on Break transition for non-monotonic parsing.
|
2015-05-28 23:39:52 +02:00 |
|
Matthew Honnibal
|
6b2e5c4b8a
|
* Avoid NER scoring for sentences with some missing NER values.
|
2015-05-28 22:39:08 +02:00 |
|
Matthew Honnibal
|
d25d31442d
|
* Hackishly support broken NER annotations. Should fix this.
|
2015-05-27 19:14:31 +02:00 |
|
Matthew Honnibal
|
7a2725bca4
|
* Read input json in a streaming way
|
2015-05-27 19:13:11 +02:00 |
|
Matthew Honnibal
|
6a1c91675e
|
* Add file to read ENAMEX ner data
|
2015-05-27 17:36:23 +02:00 |
|
Matthew Honnibal
|
732fa7709a
|
* Edits to align_raw script, for use in prepare_treebank
|
2015-05-27 04:23:31 +02:00 |
|
Matthew Honnibal
|
4010b9b6d9
|
* Pass parameter for regularization in parser.pyx
|
2015-05-27 03:18:50 +02:00 |
|
Matthew Honnibal
|
4c6058baa7
|
* Fix evaluation of NER in scorer.py
|
2015-05-27 03:18:16 +02:00 |
|
Matthew Honnibal
|
6016ee83a6
|
* Fix reading of NER in gold.pyx
|
2015-05-27 03:17:50 +02:00 |
|
Matthew Honnibal
|
04bda8648d
|
* Pass parameter for regularization to model
|
2015-05-27 03:16:58 +02:00 |
|
Matthew Honnibal
|
f69fe6a635
|
* Fix heads problem in read_conll
|
2015-05-27 01:14:54 +02:00 |
|
Matthew Honnibal
|
0eec1d12af
|
* Add comment about zipf reweighting
|
2015-05-27 01:14:07 +02:00 |
|
Matthew Honnibal
|
4d37b66c55
|
* Make Zipf regularization a bit more efficient
|
2015-05-27 01:12:50 +02:00 |
|
Matthew Honnibal
|
7fc24821bc
|
* Experiment with Zipfian corruptions when calculating prediction
|
2015-05-26 22:17:15 +02:00 |
|
Matthew Honnibal
|
eba7b34f66
|
* Add flag to disable loading of word vectors
|
2015-05-25 01:02:42 +02:00 |
|
Matthew Honnibal
|
3593babd35
|
* Add functions for Levenshtein distance alignment
|
2015-05-24 21:50:48 +02:00 |
|
Matthew Honnibal
|
744f06abf5
|
* Add script to read OntoNotes source documents
|
2015-05-24 21:49:58 +02:00 |
|
Matthew Honnibal
|
fc75210941
|
* Move spacy.syntax.conll to spacy.gold
|
2015-05-24 21:35:02 +02:00 |
|
Matthew Honnibal
|
765b61cac4
|
* Update spacy.scorer, to use P/R/F to support tokenization errors
|
2015-05-24 20:07:18 +02:00 |
|
Matthew Honnibal
|
efe7a7d7d6
|
* Clean unused functions from spacy.syntax.conll
|
2015-05-24 20:06:46 +02:00 |
|
Matthew Honnibal
|
78487f3e66
|
* Update parser oracle for missing heads
|
2015-05-24 20:05:58 +02:00 |
|
Matthew Honnibal
|
1044a13413
|
* Begin refactoring scorer to use recall over gold dependencies
|
2015-05-24 17:40:15 +02:00 |
|
Matthew Honnibal
|
acd1245ad4
|
* Remove cruft from conll.pyx --- unused stuff about evlauation, which now lives in spacy.scorer
|
2015-05-24 17:35:49 +02:00 |
|
Matthew Honnibal
|
20f1d868a3
|
* Tmp commit. Working on whole document parsing
|
2015-05-24 02:49:56 +02:00 |
|
Matthew Honnibal
|
f2ee9c4feb
|
* Comment out constituency parsing stuff, so that code compiles
|
2015-05-20 16:55:05 +02:00 |
|
Matthew Honnibal
|
8ee7c541f1
|
* Update Constituent definition
|
2015-05-20 16:03:26 +02:00 |
|
Matthew Honnibal
|
9dfc9c039c
|
* Work on constituency parsing.
|
2015-05-20 16:02:51 +02:00 |
|
Matthew Honnibal
|
5a5710e711
|
* Fix Span.subtree property
|
2015-05-13 21:53:15 +02:00 |
|
Matthew Honnibal
|
badf030b6c
|
* Add parse navigation to Span objects
|
2015-05-13 21:45:19 +02:00 |
|
Matthew Honnibal
|
ca320afe86
|
* Add docstring for ents attribute
|
2015-05-13 21:20:47 +02:00 |
|
Matthew Honnibal
|
ba07b925a7
|
* Fix compile error in conll.pyx
|
2015-05-12 22:33:47 +02:00 |
|
Matthew Honnibal
|
f1e0272b18
|
* Disable c-parsing transitions
|
2015-05-12 22:33:25 +02:00 |
|
Matthew Honnibal
|
03a6626545
|
* Tmp commit
|
2015-05-12 20:27:56 +02:00 |
|
Matthew Honnibal
|
9568ebed08
|
* Fix off-by-one in head reading
|
2015-05-12 20:27:56 +02:00 |
|
Matthew Honnibal
|
69840d8cc3
|
* Tweak verbose output printing in scorer.py
|
2015-05-12 20:27:56 +02:00 |
|
Matthew Honnibal
|
0605af6838
|
* Fix head misalignment in read_conll, when periods are ignored
|
2015-05-12 20:27:56 +02:00 |
|
Matthew Honnibal
|
d2ac8d8007
|
* Add ctnt field to State, in preparation for constituency parsing
|
2015-05-12 20:27:56 +02:00 |
|
Matthew Honnibal
|
ab67693393
|
* Add read_json_file to conll.pyx
|
2015-05-12 20:27:55 +02:00 |
|
Matthew Honnibal
|
aff9359a8d
|
* Update ner.pyx to expect brackets from gold_tuples
|
2015-05-12 20:27:55 +02:00 |
|
Matthew Honnibal
|
0ad72a77ce
|
* Write JSON files, with both dependency and PSG parses
|
2015-05-12 20:27:55 +02:00 |
|
Matthew Honnibal
|
d48218f4b2
|
* Add left_edge and right_edge properties
|
2015-05-12 20:27:55 +02:00 |
|
Matthew Honnibal
|
53cf77e1c8
|
* Bug fix: when non-monotonically correct a dependency, make sure to delete the old one from the child list
|
2015-05-12 20:26:41 +02:00 |
|
Matthew Honnibal
|
a4e2af54f9
|
* Add support for l/r edge to add_dep, and move inlined methods into _state.pyx where possible
|
2015-05-12 20:26:41 +02:00 |
|
Matthew Honnibal
|
d634038eb6
|
* Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token
|
2015-05-12 20:26:41 +02:00 |
|
Matthew Honnibal
|
03ebf70a66
|
* Inc version to 0.84
|
2015-05-12 02:38:51 +02:00 |
|
Matthew Honnibal
|
e73eaf2d05
|
* Replace some assertions with proper errors
|
2015-05-08 16:52:17 +02:00 |
|
Matthew Honnibal
|
fb8d50b3d5
|
Merge branch 'master' of ssh://github.com/honnibal/spaCy
|
2015-04-30 12:45:15 +02:00 |
|
Matthew Honnibal
|
ed8e8c3bd0
|
* Whitespace
|
2015-04-29 14:22:47 +02:00 |
|
Matthew Honnibal
|
378c2a6435
|
* Fix POS model: make it use tag instead of pos in history features
|
2015-04-29 00:02:53 +02:00 |
|
Matthew Honnibal
|
763ef01575
|
* Fix two bugs in feature calculation
|
2015-04-28 23:25:09 +02:00 |
|
Matthew Honnibal
|
b3fd48c97b
|
* Fix missing root labels bug identified in Issue #57
|
2015-04-28 20:45:51 +02:00 |
|
Jordan Suchow
|
3a8d9b37a6
|
Remove trailing whitespace
|
2015-04-19 13:01:38 -07:00 |
|
Jordan Suchow
|
5f0f940a1f
|
Remove unused imports
|
2015-04-19 01:05:22 -07:00 |
|
Matthew Honnibal
|
cc4e395927
|
* Add some ad hoc regexes, for multi-word location prepositions
|
2015-04-17 04:44:24 +02:00 |
|
Matthew Honnibal
|
f7ffd94e6a
|
* Add Token.conjuncts property
|
2015-04-17 01:40:53 +02:00 |
|
Matthew Honnibal
|
684d0e5e85
|
* Download updated data
|
2015-04-16 04:29:15 +02:00 |
|
Matthew Honnibal
|
2ef170a991
|
* Fix Issue #54: Error merging multi-word token when there's a mid-token match.
|
2015-04-16 04:28:06 +02:00 |
|
Matthew Honnibal
|
42617548af
|
* Disable merge_mwes by default
|
2015-04-16 04:20:31 +02:00 |
|
Matthew Honnibal
|
99dbf8a38c
|
* Fix error type in lookup_transition
|
2015-04-16 01:36:22 +02:00 |
|
Matthew Honnibal
|
77d0700caf
|
* Add on X way regexes
|
2015-04-16 01:35:46 +02:00 |
|
Matthew Honnibal
|
9f16848b60
|
* Add (N0w, N1w) unigram pair to NER features, prompted by failure to detect 'this weekend'
|
2015-04-15 06:01:18 +02:00 |
|
Matthew Honnibal
|
c6707778dd
|
* Fix Issue #51: Handle non-ascii lemmas correctly
|
2015-04-13 22:28:59 +02:00 |
|
Matthew Honnibal
|
bf0aff5124
|
* Fix bug in Tokens.ents where entity wasn't being emitted if another started immediately after
|
2015-04-13 21:34:33 +02:00 |
|
Matthew Honnibal
|
2b84a90bbb
|
* Fix Issue #50: Python 3 compatibility of v0.80
|
2015-04-13 05:59:43 +02:00 |
|
Matthew Honnibal
|
fbd48c571d
|
* Rearrange code in tokens.pyx
|
2015-04-13 05:41:25 +02:00 |
|
Matthew Honnibal
|
507048dc45
|
* Rename StandardError to Exception, for Python 3 compatibility
|
2015-04-12 07:28:34 +02:00 |
|
Matthew Honnibal
|
761a19113a
|
* Fix /tmp moving thing in download.py
|
2015-04-12 07:04:10 +02:00 |
|
Matthew Honnibal
|
248a2b4b0f
|
* Remove Spans class
|
2015-04-12 04:07:29 +02:00 |
|
Matthew Honnibal
|
1d05e6da00
|
* Add ne_iob and ne_type features to NER
|
2015-04-10 19:07:08 +02:00 |
|
Matthew Honnibal
|
4df8a3d90f
|
* Add ne_iob and ne_type attributes to context vector
|
2015-04-10 05:02:15 +02:00 |
|
Matthew Honnibal
|
8c354c432b
|
* Add ValueError condition to ner_tag reading
|
2015-04-10 04:59:59 +02:00 |
|
Matthew Honnibal
|
435cccf098
|
* Add read_conll03_file function to conll.pyx
|
2015-04-10 04:59:11 +02:00 |
|
Matthew Honnibal
|
99c9ecfc18
|
* Fix bug in prefix, suffix and word shape features in parser and NER
|
2015-04-10 03:53:33 +02:00 |
|
Matthew Honnibal
|
cff2b13fef
|
* Fix Issue #44: Broken Token.string attribute when single word sentence
|
2015-04-07 06:08:25 +02:00 |
|
Matthew Honnibal
|
6640386b25
|
* Fix Issue #43: TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way.
|
2015-04-07 06:00:57 +02:00 |
|
Matthew Honnibal
|
b64b2bd910
|
* Fix Issue #43: TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way.
|
2015-04-07 06:00:30 +02:00 |
|
Matthew Honnibal
|
f9e510a893
|
* Whitespace
|
2015-04-07 04:53:59 +02:00 |
|
Matthew Honnibal
|
66c7ccf6cc
|
* Fix Spans.orth_
|
2015-04-07 04:53:40 +02:00 |
|
Matthew Honnibal
|
b8d34531c4
|
* Add support for units to English.__init__, by loading and applying regular expressions
|
2015-04-07 04:02:32 +02:00 |
|
Matthew Honnibal
|
0ea5af88b6
|
* Add multi-word expression RegexMatcher
|
2015-04-07 03:45:40 +02:00 |
|
Matthew Honnibal
|
2fee67cfa3
|
* Add regular expressions for English multi-word expressions
|
2015-04-07 03:45:18 +02:00 |
|
Matthew Honnibal
|
5a075ea3fc
|
* Ensure NER moves are available for single-word tokens
|
2015-04-05 22:30:58 +02:00 |
|
Matthew Honnibal
|
a60a366b2c
|
* Support 'punct' dep label in conll.pyx
|
2015-04-05 22:30:19 +02:00 |
|
Matthew Honnibal
|
021c972137
|
* Print parse if verbose in scorer
|
2015-04-05 22:29:30 +02:00 |
|
Matthew Honnibal
|
fbf19049cf
|
* Add ent_type_ property
|
2015-03-31 02:01:29 +02:00 |
|
Matthew Honnibal
|
e70b87efeb
|
* Add merge() method to Tokens, with fairly brittle/hacky implementation, but quite easy to test. Passing minimal tests. Still need to fix left/right deps in C data
|
2015-03-30 01:37:41 +02:00 |
|
Matthew Honnibal
|
557856e84c
|
* Allow regular expressions to specify labels for merged spans
|
2015-03-27 17:40:52 +01:00 |
|
Matthew Honnibal
|
a3af6b7c3d
|
* Left-Arc from Root, to allow non-monotonic reduce to compete with left-arc when the stack is not empty.
|
2015-03-27 17:39:16 +01:00 |
|
Matthew Honnibal
|
db5a43318c
|
* Improve print_state debug printer
|
2015-03-27 17:29:58 +01:00 |
|
Matthew Honnibal
|
1705eccbbe
|
* Remove whitespace
|
2015-03-27 15:22:39 +01:00 |
|
Matthew Honnibal
|
3feb52374c
|
* Break apart a condition, for ease of debug printing
|
2015-03-27 15:21:38 +01:00 |
|
Matthew Honnibal
|
b32f581acb
|
* Fix bug in ArcEager.get_labels
|
2015-03-27 15:21:06 +01:00 |
|
Matthew Honnibal
|
5f2a4ff36d
|
* Fix spans.lemma_
|
2015-03-26 16:45:38 +01:00 |
|
Matthew Honnibal
|
f4cc222ec3
|
* Fix NER scoring
|
2015-03-26 16:45:38 +01:00 |
|
Matthew Honnibal
|
1320bd19db
|
* Move Span class to own file
|
2015-03-26 16:45:38 +01:00 |
|
Matthew Honnibal
|
6f47a667cf
|
* Move Span class to own file
|
2015-03-26 16:45:38 +01:00 |
|
Matthew Honnibal
|
f02c39dfaf
|
* Compare to is not None, for more robustness
|
2015-03-26 16:44:48 +01:00 |
|
Matthew Honnibal
|
8f68b864c4
|
* Move Span/Spans to separate files. Currently duplicates lots of Tokens functionality. Should probably be integrated into Tokens
|
2015-03-26 16:44:48 +01:00 |
|
Matthew Honnibal
|
e854ba0a13
|
* Remove support for force_gold flag from GreedyParser, since it's not so useful, and it's clutter
|
2015-03-26 16:44:47 +01:00 |
|
Matthew Honnibal
|
6a6085f8b9
|
* Clean up GreedyParser.train function a bit
|
2015-03-26 16:44:47 +01:00 |
|
Matthew Honnibal
|
b3157927e6
|
* Clean up unused feature templates
|
2015-03-26 16:44:47 +01:00 |
|
Matthew Honnibal
|
411bf377d4
|
* Remove dependency on ner_util module
|
2015-03-26 16:44:47 +01:00 |
|
Matthew Honnibal
|
01c892f583
|
* Add comment to fill_context
|
2015-03-26 16:44:47 +01:00 |
|
Matthew Honnibal
|
2741179aff
|
* Important bug fix: Fill token N2w, which was being unfilled, after a bad edit while writing the NER features.
|
2015-03-26 16:44:47 +01:00 |
|
Matthew Honnibal
|
2b2dec95d3
|
* Add comment to set_parse
|
2015-03-26 16:44:47 +01:00 |
|
Matthew Honnibal
|
e770fade1e
|
* Don't set dependency labels in set_parse, as this may be used by the Entity recogniser instead. Need to clean this method up...
|
2015-03-26 16:44:47 +01:00 |
|
Matthew Honnibal
|
71648205d9
|
* Add support for debug feature set. Just use unigrams for this.
|
2015-03-26 16:44:47 +01:00 |
|
Matthew Honnibal
|
3b70b304b2
|
* Add words to gold_tuples from gold conll file
|
2015-03-26 16:44:47 +01:00 |
|
Matthew Honnibal
|
2e12dec76e
|
* Adjust scorer to account for tokenization mistakes
|
2015-03-26 16:44:47 +01:00 |
|
Matthew Honnibal
|
05d6065e2e
|
* Add assertion
|
2015-03-26 16:44:46 +01:00 |
|
Matthew Honnibal
|
377e9b29b1
|
* Whitespace
|
2015-03-26 16:44:46 +01:00 |
|
Matthew Honnibal
|
670959f40c
|
* Fix iteration order on Tokens.rights
|
2015-03-26 16:44:46 +01:00 |
|
Matthew Honnibal
|
231ce2dae5
|
* Assign ROOT label by default. May be papering over another bug.
|
2015-03-26 16:44:46 +01:00 |
|
Matthew Honnibal
|
9f4ad8fdfb
|
* Assign root words the ROOT label via the Break transition. Something is still wrong here...
|
2015-03-26 16:44:46 +01:00 |
|
Matthew Honnibal
|
f729164c01
|
* Fix bug in label assignment: ensure null-label transitions receive the label 0
|
2015-03-26 16:44:46 +01:00 |
|
Matthew Honnibal
|
7237c805c7
|
* Load tag for specials.json token
|
2015-03-26 16:44:46 +01:00 |
|
Matthew Honnibal
|
567388e38d
|
* Use values encoded by StringStore in POS tagging, rather than indices into a list of tags
|
2015-03-26 16:44:45 +01:00 |
|
Matthew Honnibal
|
3105c7f8ba
|
* Don't pass label_ids dict to Tokens, since we now use the StringStore to manage string-to-int mapping for labels
|
2015-03-26 16:44:45 +01:00 |
|
Matthew Honnibal
|
801bf14f4f
|
* Clean up handling of dep_strings and ent_strings, using StringStore to encode the label names.
|
2015-03-26 16:44:45 +01:00 |
|
Matthew Honnibal
|
31fad99518
|
* Use StringStore to encode label names, instead of label_ids
|
2015-03-26 16:44:45 +01:00 |
|
Matthew Honnibal
|
64db61bff1
|
* Add Span class to Python API
|
2015-03-26 16:44:45 +01:00 |
|
Matthew Honnibal
|
b9b695fb1b
|
* Remove debug word list
|
2015-03-26 16:44:45 +01:00 |
|
Matthew Honnibal
|
f21ab2d7fb
|
* Fix bug in ugly ent_strings hack on English class
|
2015-03-26 16:44:45 +01:00 |
|
Matthew Honnibal
|
1c843934be
|
* Fix oracle bug in NER. Now getting 77% F on ontonotes
|
2015-03-26 16:44:44 +01:00 |
|
Matthew Honnibal
|
903f196b3f
|
* Fix verbose printing for scorer
|
2015-03-26 16:44:44 +01:00 |
|
Matthew Honnibal
|
e181c051d5
|
* Improve features for NER
|
2015-03-26 16:44:44 +01:00 |
|
Matthew Honnibal
|
7ecb52c0ed
|
* Add scorer script
|
2015-03-26 16:44:44 +01:00 |
|
Matthew Honnibal
|
8057a95f20
|
* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring.
|
2015-03-26 16:44:44 +01:00 |
|
Matthew Honnibal
|
ae235e07b9
|
* Refactoring working for parser, but now need to rig up features for NER, and then debug oracle etc.
|
2015-03-26 16:44:44 +01:00 |
|
Matthew Honnibal
|
b3eda03c9c
|
* Tmp
|
2015-03-26 16:44:44 +01:00 |
|
Matthew Honnibal
|
220ce8bfed
|
* Prepare English class for NER
|
2015-03-26 16:44:44 +01:00 |
|
Matthew Honnibal
|
f5830dc1c1
|
* Remove _transitions.pyx
|
2015-03-26 16:44:44 +01:00 |
|
Matthew Honnibal
|
6865c2fb4d
|
* Fix assignment of dep strings in tokens.pyx
|
2015-03-26 16:44:43 +01:00 |
|
Matthew Honnibal
|
6b6bce9e7a
|
* Fix label loading for transition system
|
2015-03-26 16:44:43 +01:00 |
|
Matthew Honnibal
|
5278c7504b
|
* Hacks to conll.pyx. Should clean these up.
|
2015-03-26 16:44:43 +01:00 |
|
Matthew Honnibal
|
f321b2b2eb
|
* Remove TODO comment
|
2015-03-26 16:44:43 +01:00 |
|
Matthew Honnibal
|
fdabd93bfb
|
* Ensure high loss for invalid moves, and fix label reading for arc-eager
|
2015-03-26 16:44:43 +01:00 |
|
Matthew Honnibal
|
10ed738df2
|
* Tmp commit
|
2015-03-26 16:44:43 +01:00 |
|
Matthew Honnibal
|
4f83c9b3d5
|
* Make costs label-sensitive
|
2015-03-26 16:44:43 +01:00 |
|
Matthew Honnibal
|
179b7eb0a7
|
* Specify parser transition system in language
|
2015-03-26 16:44:43 +01:00 |
|
Matthew Honnibal
|
8c883cef58
|
* Refactored transition system code now compiling. Still need to hook up label oracle, and test
|
2015-03-26 16:44:43 +01:00 |
|
Matthew Honnibal
|
f0159ab4b6
|
* Add file to hold GoldParse class
|
2015-03-26 16:44:42 +01:00 |
|
Matthew Honnibal
|
8eadb984cb
|
* Refactor arc_eager to use new TransitionSystem base class. Need to fix oracle
|
2015-03-26 16:44:42 +01:00 |
|
Matthew Honnibal
|
b063001596
|
* Add base TransitionSystem class. Still need to rethink how non-monotonic labelling will work for best_valid
|
2015-03-26 16:44:42 +01:00 |
|
Matthew Honnibal
|
01bc4d6815
|
* Add set_parse method, to assign parse to tokens in a less hacky way.
|
2015-03-26 16:44:42 +01:00 |
|
Matthew Honnibal
|
dc986dbc0b
|
* Work on refactored parser, where TransitionSystem can be easily subclassed
|
2015-03-26 16:44:42 +01:00 |
|
Matthew Honnibal
|
1cc6329b18
|
* Add base class to do transitions
|
2015-03-26 16:44:42 +01:00 |
|
Matthew Honnibal
|
135756ac3d
|
* Tmp commit of NER refactoring
|
2015-03-26 16:44:42 +01:00 |
|
Matthew Honnibal
|
23c1f6fc04
|
* Merge changes from stash
|
2015-03-26 16:44:41 +01:00 |
|
Matthew Honnibal
|
0ff078876a
|
* Commit some work on ner.yx done on the plane
|
2015-03-26 16:44:41 +01:00 |
|
Matthew Honnibal
|
d81b7be6a2
|
* Merge train.py
|
2015-03-26 16:44:41 +01:00 |
|
Matthew Honnibal
|
2e3dc3dfe2
|
* Merge changes in tokens.pyx
|
2015-03-26 16:44:41 +01:00 |
|
Matthew Honnibal
|
8cc3524dc9
|
* Ws
|
2015-03-26 16:44:41 +01:00 |
|
Matthew Honnibal
|
3d0570685c
|
* Add NER transition system
|
2015-03-26 16:44:41 +01:00 |
|
Matthew Honnibal
|
043b758cf4
|
* Resurrect old NER code. This version won't be the one that runs; we want to re-use the parser code. But for now this is a useful reference.
|
2015-03-26 16:44:41 +01:00 |
|
Matthew Honnibal
|
b139aa92ba
|
* Start setting out how NER will be implemented in the data model
|
2015-03-26 16:44:41 +01:00 |
|
Matthew Honnibal
|
0962ffc095
|
* Fix issue #37: missing check_flag attribute from Token class
|
2015-03-26 15:06:26 +01:00 |
|
Matthew Honnibal
|
2e8d0e5d45
|
* Upd download script
|
2015-03-03 05:47:16 -05:00 |
|
Matthew Honnibal
|
dbe26f5793
|
* Add children and subtree methods to Token, which are generators to assist parse-tree navigation.
|
2015-03-03 04:18:41 -05:00 |
|
Matthew Honnibal
|
ea90d136e8
|
* Fix bug in labelled parsing, that caused an 8% drop in labelled accuracy.
|
2015-02-27 03:56:10 -05:00 |
|
Matthew Honnibal
|
caf046b220
|
* Hastily add method to apply tags from a list of strings, instead of predicting the tags.
|
2015-02-23 15:40:17 -05:00 |
|
Matthew Honnibal
|
cae077b583
|
* Work on fixing orphaned Token objects bug
|
2015-02-16 15:20:31 -05:00 |
|
Matthew Honnibal
|
7572e31f5e
|
* Pass ownership of C data to Token instances if Tokens object is being garbage-collected, but Token instances are staying alive.
|
2015-02-11 18:05:06 -05:00 |
|
Matthew Honnibal
|
64645a1c2f
|
* Improve docstring on English
|
2015-02-11 15:13:20 -05:00 |
|
Matthew Honnibal
|
594e50bd45
|
* Add option to download speech-parsing data set.
|
2015-02-11 14:20:29 -05:00 |
|
Matthew Honnibal
|
0b7e769211
|
* Add POS tags to support SWBD tag set
|
2015-02-11 14:08:28 -05:00 |
|
Matthew Honnibal
|
312b3a45f3
|
* Fix issue #19: Allow parsing/pos tagging of empty strings
|
2015-02-10 10:15:58 -05:00 |
|
Matthew Honnibal
|
2a0615104b
|
* Upd download script
|
2015-02-09 10:22:59 -05:00 |
|
Matthew Honnibal
|
5c3513583d
|
* Clear buffered python tokens when modifying the Tokens object. Need to clean this up, and modify via a method on Tokens.
|
2015-02-09 03:57:10 -05:00 |
|
Matthew Honnibal
|
be5536d239
|
* Fix Issue #22: PRP and PRP$ were mapped to NOUN. Should be PRON.
|
2015-02-08 18:36:18 -05:00 |
|
Matthew Honnibal
|
0492cee8b4
|
* Fix Issue #24: Lemmas are empty when the L field is missing for special-cased tokens
|
2015-02-08 18:30:30 -05:00 |
|
Matthew Honnibal
|
d229fbd228
|
* Give better error on out-of-bounds array access
|
2015-02-07 12:59:12 -05:00 |
|
Matthew Honnibal
|
ab8bb047d0
|
* Fix negative index for __getitem__
|
2015-02-07 12:58:46 -05:00 |
|
Matthew Honnibal
|
44c7eafe44
|
* Fix download.py
|
2015-02-07 12:00:36 -05:00 |
|
Matthew Honnibal
|
6ca7f2eedc
|
* Upd download script
|
2015-02-07 11:32:33 -05:00 |
|
Matthew Honnibal
|
f0e0588833
|
* Fill L2 norm attribute on LexemeC struct
|
2015-02-07 08:44:42 -05:00 |
|
Matthew Honnibal
|
75f9b7d6bf
|
* Add L2 norm field to LexemeC struct
|
2015-02-07 08:43:17 -05:00 |
|
Matthew Honnibal
|
51b618d646
|
* Add a has_repvec property to Lexeme, and a check function to check flags
|
2015-02-07 08:42:44 -05:00 |
|
Matthew Honnibal
|
321b402739
|
* Store the l2 norm of the word's vector
|
2015-02-07 08:42:16 -05:00 |
|
Matthew Honnibal
|
c7d8644149
|
* Fix regression on 'prob' attr of Token.
|
2015-02-03 03:32:18 +11:00 |
|
Matthew Honnibal
|
c55a33d045
|
* Catch oracle errors
|
2015-02-02 23:02:04 +11:00 |
|
Matthew Honnibal
|
de772088e6
|
* Use parse tree for sbd in Tokens.sents
|
2015-02-02 12:17:32 +11:00 |
|
Matthew Honnibal
|
56c2ef2982
|
* Tweak POS features for web text
|
2015-02-02 11:59:36 +11:00 |
|
Matthew Honnibal
|
d68678a93e
|
* Add Exception class, OracleError
|
2015-02-02 11:57:32 +11:00 |
|
Matthew Honnibal
|
a20fdbd8ee
|
* Upd download script
|
2015-02-01 13:22:23 +11:00 |
|
Matthew Honnibal
|
76d9394cb4
|
* Fix vocab.pyx for Python3
|
2015-02-01 13:14:04 +11:00 |
|
Matthew Honnibal
|
63abdf154c
|
* Hastily hack download file
|
2015-01-31 22:48:32 +11:00 |
|
Matthew Honnibal
|
7de00c5a79
|
* Try not holding a reference to Pool, since that seems to confuse the GC
|
2015-01-31 22:10:22 +11:00 |
|
Matthew Honnibal
|
ce3ae8b5d9
|
* Fix platform-specific lexicon bug.
|
2015-01-31 16:38:58 +11:00 |
|
Matthew Honnibal
|
a1ed574b7b
|
* Fix default model path for English
|
2015-01-31 16:38:27 +11:00 |
|
Matthew Honnibal
|
018e0bfa24
|
* Bug fixes to parse navigation
|
2015-01-31 16:37:13 +11:00 |
|
Matthew Honnibal
|
e013555b25
|
* Add option to download script
|
2015-01-31 13:51:56 +11:00 |
|
Matthew Honnibal
|
08ca5c8970
|
* Add sent_end flag to TokenC struct
|
2015-01-31 13:44:16 +11:00 |
|
Matthew Honnibal
|
024cfd485c
|
* Pass tag_strings as a tuple, to support new Tokens API
|
2015-01-31 13:43:37 +11:00 |
|
Matthew Honnibal
|
77d62d0179
|
* Large refactor of Token objects, making them much thinner. This is to support fast parse-tree navigation.
|
2015-01-31 13:42:58 +11:00 |
|
Matthew Honnibal
|
88170e6295
|
* Supply dep_strings as a tuple, for the changed API on Tokens
|
2015-01-31 13:42:09 +11:00 |
|
Matthew Honnibal
|
0981d68022
|
* Set a sent_end flag during parsing, for later use
|
2015-01-31 13:41:46 +11:00 |
|
Matthew Honnibal
|
251dbf24d7
|
* Fix unintialised variable error
|
2015-01-30 20:46:34 +11:00 |
|
Matthew Honnibal
|
83a4df5a1a
|
* Fix download script
|
2015-01-30 20:40:42 +11:00 |
|
Matthew Honnibal
|
6f9ebc2f34
|
* Fix download script
|
2015-01-30 20:33:19 +11:00 |
|
Matthew Honnibal
|
8b85d0bb8a
|
* Only download small data if no data dir exists
|
2015-01-30 20:27:14 +11:00 |
|
Matthew Honnibal
|
1a7a1c2771
|
* Fix Issue #16: tokens recurse when printing
|
2015-01-30 19:47:50 +11:00 |
|
Matthew Honnibal
|
cb95ef6934
|
* Fix download script
|
2015-01-30 19:28:43 +11:00 |
|
Matthew Honnibal
|
e578bd37bd
|
* Fix download script
|
2015-01-30 18:59:31 +11:00 |
|
Matthew Honnibal
|
df52014d12
|
* Fix download script
|
2015-01-30 18:36:24 +11:00 |
|
Matthew Honnibal
|
0f95712189
|
* Improve accuracy reporting during training
|
2015-01-30 18:05:06 +11:00 |
|
Matthew Honnibal
|
b68f563c2f
|
* Fix Issue #14: Improve parsing API
|
2015-01-30 18:04:41 +11:00 |
|
Matthew Honnibal
|
998b607f65
|
* Upd download script, having it download all data if there's no data/ directory, allowing easier compilation from source
|
2015-01-30 18:04:01 +11:00 |
|
Matthew Honnibal
|
67d6e53a69
|
* Ensure parser and tagger function correctly when training from missing values, indicated by -1
|
2015-01-30 14:08:56 +11:00 |
|
Matthew Honnibal
|
4ff180db74
|
* Fix off-by-one error in commit 0a7fceb
|
2015-01-30 12:49:33 +11:00 |
|
Matthew Honnibal
|
0a7fcebdf7
|
* Fix Issue #12: Incorrect token.idx calculations for some punctuation, in the presence of token cache
|
2015-01-30 12:33:38 +11:00 |
|
Matthew Honnibal
|
ebf7d2fab1
|
* Use non-joint sbd, for more simplicity and fewer classes
|
2015-01-29 06:22:03 +11:00 |
|
Matthew Honnibal
|
d05c5bf141
|
* Remove comment
|
2015-01-29 05:19:27 +11:00 |
|
Matthew Honnibal
|
320b045daa
|
* Oracle now consistent over gold standard derivation
|
2015-01-29 03:41:58 +11:00 |
|
Matthew Honnibal
|
f590382134
|
* Work on sbd
|
2015-01-29 03:18:29 +11:00 |
|
Matthew Honnibal
|
1884a7a0be
|
* Attach comment with paper
|
2015-01-28 03:18:43 +11:00 |
|
Matthew Honnibal
|
a2d6b195db
|
* Add messy Break transitions, carefully following the scheme of Dd Zhang et al (2013)
|
2015-01-28 03:09:45 +11:00 |
|
Matthew Honnibal
|
f9ee5d9934
|
* Build a python list of word strings, for debugging
|
2015-01-28 01:06:13 +11:00 |
|
Matthew Honnibal
|
d819101571
|
* Improve error message on oracle failure
|
2015-01-28 00:58:03 +11:00 |
|
Matthew Honnibal
|
e6c3d3471f
|
* Tweak documentation for Tokens, and hide constructor as __cinit__
|
2015-01-27 18:57:52 +11:00 |
|
Matthew Honnibal
|
c38c62d4a3
|
* Add docstring to English class
|
2015-01-27 02:45:21 +11:00 |
|
Matthew Honnibal
|
d4c99f7dec
|
* Add attrs.pxd
|
2015-01-26 22:22:09 +11:00 |
|
Matthew Honnibal
|
d4a493855e
|
* Fix error msg
|
2015-01-25 23:01:30 +11:00 |
|
Matthew Honnibal
|
7f87716cf7
|
* Fix download script
|
2015-01-25 23:01:10 +11:00 |
|
Matthew Honnibal
|
92fb9257dd
|
* Add parts-of-speech file
|
2015-01-25 22:00:39 +11:00 |
|
Matthew Honnibal
|
c1c3dba4cb
|
* Check whether vector files are present before trying to load them.
|
2015-01-25 18:16:48 +11:00 |
|
Matthew Honnibal
|
5049d4c2e6
|
* Add parts_of_speech.pyx
|
2015-01-25 16:32:26 +11:00 |
|
Matthew Honnibal
|
12b034e3ef
|
* Move POS tag definitions to parts_of_speech.pxd
|
2015-01-25 16:31:07 +11:00 |
|
Matthew Honnibal
|
7431c133d8
|
* Add error if try to access head and not is_parsed
|
2015-01-25 15:33:54 +11:00 |
|
Matthew Honnibal
|
951d06c824
|
* Silently don't parse if data is not present
|
2015-01-25 14:47:38 +11:00 |
|
Matthew Honnibal
|
4e857ab7a6
|
* Fix bug in POS tagger feature
|
2015-01-25 02:20:15 +11:00 |
|
Matthew Honnibal
|
dd56e298e2
|
* Ensure tagging is applied if parse=True
|
2015-01-25 02:19:44 +11:00 |
|
Matthew Honnibal
|
94750819cd
|
* Set parse=True by default --- i.e. parse unless told not to.
|
2015-01-25 01:28:28 +11:00 |
|
Matthew Honnibal
|
71b95202eb
|
* Add docstring to StringStore
|
2015-01-24 20:49:15 +11:00 |
|
Matthew Honnibal
|
6d1c08dafd
|
* Add docstring to Lexeme
|
2015-01-24 20:48:34 +11:00 |
|
Matthew Honnibal
|
a97bed9359
|
* Fix POS and dependency label tag names. Add parse and string navigation functions.
|
2015-01-24 17:29:04 +11:00 |
|
Matthew Honnibal
|
76cd024095
|
* Add whitespace property to Token
|
2015-01-24 07:41:21 +11:00 |
|
Matthew Honnibal
|
5fd72bc220
|
* Have 'string' refer to the whitespace-padded string
|
2015-01-24 07:32:38 +11:00 |
|
Matthew Honnibal
|
fda94271af
|
* Rename NORM1 and NORM2 attrs to lower and norm
|
2015-01-24 06:17:03 +11:00 |
|
Matthew Honnibal
|
5ed8b2b98f
|
* Rename sic to orth
|
2015-01-23 02:08:25 +11:00 |
|
Matthew Honnibal
|
a27b23cc8f
|
* Have SBD return start/end indices
|
2015-01-22 22:24:44 +11:00 |
|
Matthew Honnibal
|
d460c28838
|
* Rename vec to repvec
|
2015-01-22 02:06:22 +11:00 |
|
Matthew Honnibal
|
8b9d913d97
|
* Rename vec to repvec
|
2015-01-22 02:05:58 +11:00 |
|
Matthew Honnibal
|
9cd0b6b3e9
|
* Various tweaks to Tokens class
|
2015-01-22 02:05:37 +11:00 |
|
Matthew Honnibal
|
5928d158ce
|
* Pass the string to Tokens
|
2015-01-22 02:04:58 +11:00 |
|
Matthew Honnibal
|
45264e356b
|
* Rename vec to repvec
|
2015-01-22 02:04:24 +11:00 |
|
Matthew Honnibal
|
5e63c606ad
|
* Rename vec to repvec
|
2015-01-22 02:03:54 +11:00 |
|
Matthew Honnibal
|
56e6cf0672
|
* Add _string attr to Tokens object
|
2015-01-21 18:57:09 +11:00 |
|
Matthew Honnibal
|
d6ac60e91c
|
* Bug fixes to sentences method, and improved vector transport for tokens
|
2015-01-21 18:56:32 +11:00 |
|
Matthew Honnibal
|
f2a229136c
|
* Fix data_dir=None argument to English class
|
2015-01-21 18:27:31 +11:00 |
|
Matthew Honnibal
|
ef49b8c179
|
* Add stop-word flag
|
2015-01-21 18:22:31 +11:00 |
|
Matthew Honnibal
|
6646bfc5df
|
* Add LOWER attr
|
2015-01-21 18:19:08 +11:00 |
|
Matthew Honnibal
|
f149259bf5
|
* Fix negative indices in tokens
|
2015-01-20 01:16:29 +11:00 |
|
Matthew Honnibal
|
b65b0c07bf
|
* Messily hook up vector in tokens
|
2015-01-19 19:59:55 +11:00 |
|
Matthew Honnibal
|
8ff5b8bd84
|
* Add attribute for POS scheme
|
2015-01-17 17:33:16 +11:00 |
|
Matthew Honnibal
|
6c7e44140b
|
* Work on word vectors, and other stuff
|
2015-01-17 16:21:17 +11:00 |
|
Matthew Honnibal
|
802867e96a
|
* Revise interface to Token. Strings now have attribute names like norm1_
|
2015-01-15 03:51:47 +11:00 |
|
Matthew Honnibal
|
7d3c40de7d
|
* Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme
|
2015-01-15 00:33:16 +11:00 |
|
Matthew Honnibal
|
0930892fc1
|
* Tmp. Working on refactor. Compiles, must hook up lexical feats.
|
2015-01-14 00:03:48 +11:00 |
|
Matthew Honnibal
|
46da3d74d2
|
* Tmp. Refactoring, introducing a Lexeme PyObject.
|
2015-01-12 11:23:44 +11:00 |
|
Matthew Honnibal
|
ce2edd6312
|
* Tmp commit. Refactoring to create a Python Lexeme class.
|
2015-01-12 10:26:22 +11:00 |
|
Matthew Honnibal
|
aacaf1a0f0
|
* Fix parser
|
2015-01-08 01:19:23 +11:00 |
|
Matthew Honnibal
|
9a21127bf7
|
* Fix parser, which was importing the wrong model
|
2015-01-08 00:10:15 +11:00 |
|
Matthew Honnibal
|
6a3e39cdd1
|
* Add typedefs.pyx
|
2015-01-06 04:51:40 +11:00 |
|
Matthew Honnibal
|
a58920cc5e
|
* Import orth.word_shape as a C module
|
2015-01-06 03:18:22 +11:00 |
|
Matthew Honnibal
|
6b68f7ef75
|
* Finally get string types right for orth function
|
2015-01-06 03:17:39 +11:00 |
|
Matthew Honnibal
|
90c143bd85
|
* Fix orth import
|
2015-01-05 18:49:19 +11:00 |
|
Matthew Honnibal
|
7689dccd0f
|
* Remove unused import
|
2015-01-05 18:48:48 +11:00 |
|
Matthew Honnibal
|
3f1944d688
|
* Make PyPy work
|
2015-01-05 17:54:38 +11:00 |
|
Matthew Honnibal
|
a510d9f677
|
* Another assertion removed
|
2015-01-05 13:01:40 +11:00 |
|
Matthew Honnibal
|
2856946a66
|
* Remove assertion that doesn't work on Python 3
|
2015-01-05 12:51:16 +11:00 |
|
Matthew Honnibal
|
94034f1112
|
* Fix encoding in lemmatization
|
2015-01-05 11:54:29 +11:00 |
|
Matthew Honnibal
|
b132b3caa6
|
* Fix unicode error in lemmatizer
|
2015-01-05 11:53:54 +11:00 |
|
Matthew Honnibal
|
477e7fbffe
|
* Fix data reading for lemmatizer
|
2015-01-05 06:01:32 +11:00 |
|
Matthew Honnibal
|
58f75abaca
|
* Fix unicode error in orth
|
2015-01-05 05:53:08 +11:00 |
|
Matthew Honnibal
|
4e085d5166
|
* Fix lemmatizer for Python3
|
2015-01-05 05:51:26 +11:00 |
|
Matthew Honnibal
|
ae7c811fd1
|
* Use Exception instead of StandardError
|
2015-01-04 01:22:12 +11:00 |
|
Matthew Honnibal
|
0e4c2ba036
|
* Fix loading of special morph words
|
2015-01-03 23:13:00 +11:00 |
|
Matthew Honnibal
|
f5d41028b5
|
* Move around data files for test release
|
2015-01-03 01:59:22 +11:00 |
|
Matthew Honnibal
|
a24321b63a
|
* Add downloader
|
2015-01-02 21:44:41 +11:00 |
|
Matthew Honnibal
|
5d9a096e2f
|
* Some minor clean-up after HastyModel
|
2014-12-31 19:46:04 +11:00 |
|
Matthew Honnibal
|
aafaf58cbe
|
* Refactor _ml.Model, and finish implementing HastyModel so far not worthwhile.
|
2014-12-31 19:40:59 +11:00 |
|
Matthew Honnibal
|
bcd038e7b6
|
* Implement HastyModel
|
2014-12-31 01:16:47 +11:00 |
|
Matthew Honnibal
|
1a075f77ff
|
* Don't over-ride pre-loaded POS tags, if set by special-cases
|
2014-12-30 23:26:32 +11:00 |
|
Matthew Honnibal
|
785c7ba76a
|
* Embed signature on attrs
|
2014-12-30 23:25:31 +11:00 |
|
Matthew Honnibal
|
30e5805656
|
* Lazy-load tagger and parser
|
2014-12-30 23:25:09 +11:00 |
|
Matthew Honnibal
|
9976aa976e
|
* Messily fix morphology and POS tags on special tokens.
|
2014-12-30 23:24:37 +11:00 |
|
Matthew Honnibal
|
c1ef3febee
|
* Embedsignature in tokens.pyx
|
2014-12-30 21:22:00 +11:00 |
|
Matthew Honnibal
|
aac5028b6e
|
* Move tagger to _ml
|
2014-12-30 21:21:38 +11:00 |
|
Matthew Honnibal
|
1ffb0229ed
|
* Import tokens in parser.pxd
|
2014-12-30 21:21:17 +11:00 |
|
Matthew Honnibal
|
bb0b00f819
|
* Repurporse the Tagger class as a generic Model, wrapping thinc's interface
|
2014-12-30 21:20:15 +11:00 |
|
Matthew Honnibal
|
fe2a5e0370
|
* Work on docstrings
|
2014-12-27 21:46:04 +11:00 |
|
Matthew Honnibal
|
bb80937544
|
* Upd docstrings
|
2014-12-27 18:45:16 +11:00 |
|
Matthew Honnibal
|
b8b65903fc
|
* Tmp
|
2014-12-24 17:42:00 +11:00 |
|
Matthew Honnibal
|
ab61673edd
|
* Fix api of array method
|
2014-12-23 15:18:48 +11:00 |
|
Matthew Honnibal
|
7708d0e24a
|
* Move lemmatizer to en dir
|
2014-12-23 15:16:57 +11:00 |
|
Matthew Honnibal
|
98eb4c0426
|
* Fix path to parser model
|
2014-12-23 15:09:09 +11:00 |
|
Matthew Honnibal
|
b00bc01d8c
|
* All tests now passing for reorg
|
2014-12-23 13:18:59 +11:00 |
|
Matthew Honnibal
|
73f200436f
|
* Tests passing except for morphology/lemmatization stuff
|
2014-12-23 11:40:32 +11:00 |
|
Matthew Honnibal
|
cf8d26c3d2
|
* POS tagger training working after reorg
|
2014-12-22 08:54:47 +11:00 |
|
Matthew Honnibal
|
4c4aa2c5c9
|
* Work on train
|
2014-12-22 07:25:43 +11:00 |
|
Matthew Honnibal
|
61df50b598
|
* Add English-subclass POS tagger
|
2014-12-21 20:59:07 +11:00 |
|
Matthew Honnibal
|
9f3f07cab6
|
* Add attrs file for English
|
2014-12-21 11:29:11 +11:00 |
|
Matthew Honnibal
|
2a89d70429
|
* Add vocab.pyx to setup, and ensure we can import spacy.en.lang
|
2014-12-21 06:03:53 +11:00 |
|
Matthew Honnibal
|
b34a1325d3
|
* Everything compiling after reorg. About to start testing.
|
2014-12-21 05:42:23 +11:00 |
|
Matthew Honnibal
|
e1c1a4b868
|
* Tmp
|
2014-12-21 05:36:29 +11:00 |
|
Matthew Honnibal
|
d11c1edf8c
|
* Import slice_unicode from strings.pyx
|
2014-12-20 07:56:26 +11:00 |
|
Matthew Honnibal
|
be1bdcbd85
|
* Move lang.pyx to tokenizer.pyx
|
2014-12-20 07:55:40 +11:00 |
|
Matthew Honnibal
|
89a1cc1a48
|
* Move murmurhash to .pxd in strings file
|
2014-12-20 07:41:08 +11:00 |
|
Matthew Honnibal
|
d5a942c4a4
|
* Rename lang.pyx to tokenizer.pyx
|
2014-12-20 07:30:39 +11:00 |
|
Matthew Honnibal
|
a60ae261ae
|
* Move tokenizer to its own file, and refactor
|
2014-12-20 07:29:16 +11:00 |
|
Matthew Honnibal
|
867a4a000c
|
* Export set_morph_from_dict function
|
2014-12-20 07:28:27 +11:00 |
|
Matthew Honnibal
|
4e30195c6d
|
* Refactor morphology.pyx
|
2014-12-20 07:27:28 +11:00 |
|
Matthew Honnibal
|
4c6ce7ee84
|
* Update tokens.pyx as part of reorg
|
2014-12-20 07:03:26 +11:00 |
|
Matthew Honnibal
|
116f7f3bc1
|
* Rename Lexicon to Vocab, and move it to its own file
|
2014-12-20 06:54:03 +11:00 |
|
Matthew Honnibal
|
780cbd68b1
|
* Move all struct definitions to structs.pxd, to avoid circular dependencies
|
2014-12-20 06:51:33 +11:00 |
|
Matthew Honnibal
|
f6556d8e5d
|
* Refactor, move Lexeme struct to structs.pxd
|
2014-12-20 06:51:03 +11:00 |
|
Matthew Honnibal
|
7d48bba6c4
|
* Move StringStore class to its own file
|
2014-12-20 06:42:01 +11:00 |
|
Matthew Honnibal
|
b066102d2d
|
* Remove POS cache for now
|
2014-12-20 03:49:58 +11:00 |
|
Matthew Honnibal
|
ff252dd535
|
* Clean up 'guess_cache' idea, which didnt work well enough
|
2014-12-20 03:49:11 +11:00 |
|
Matthew Honnibal
|
9d3ca13909
|
* Start work on parse-tree iteration classes
|
2014-12-20 03:48:10 +11:00 |
|
Matthew Honnibal
|
bed680c632
|
* Remove commented-out features
|
2014-12-20 03:47:32 +11:00 |
|
Matthew Honnibal
|
3d178c03ae
|
* Prune the features a bit
|
2014-12-20 02:46:14 +11:00 |
|
Matthew Honnibal
|
a0408e1758
|
* Working DecisionMemory class
|
2014-12-20 01:43:26 +11:00 |
|
Matthew Honnibal
|
7920ea72b4
|
* Working parser with the decision memory idea. Disabling that for now, for simplicity
|
2014-12-20 01:43:15 +11:00 |
|
Matthew Honnibal
|
a2f2a48da9
|
* Add some extra features
|
2014-12-20 01:42:24 +11:00 |
|
Matthew Honnibal
|
8fd9762d91
|
* Start laying out parse tree iteration methods
|
2014-12-20 01:42:09 +11:00 |
|
Matthew Honnibal
|
53b8bc1f3c
|
* Work on implementing a trainable cache for the parser. So far, doesn't improve efficiency
|
2014-12-19 09:30:50 +11:00 |
|
Matthew Honnibal
|
033d6c9ac2
|
* Adapt POS tagger decision-memory for use in parser
|
2014-12-19 07:23:04 +11:00 |
|
Matthew Honnibal
|
809ddf7887
|
* Add index.pxd
|
2014-12-19 07:23:00 +11:00 |
|
Matthew Honnibal
|
1879abd16a
|
* Set const-correctness for tagger
|
2014-12-18 20:41:52 +11:00 |
|
Matthew Honnibal
|
f72243b156
|
* Set const-correctness for Feature* array
|
2014-12-18 20:41:32 +11:00 |
|
Matthew Honnibal
|
6ab7e40590
|
* Add non-monotonic parsing with cost-sensitive update. 92.26 on Y&M set
|
2014-12-18 11:33:25 +11:00 |
|
Matthew Honnibal
|
7e0c692daf
|
* Automatically push when the stack is empty
|
2014-12-18 09:16:10 +11:00 |
|
Matthew Honnibal
|
61142a8eff
|
* Tweak features
|
2014-12-18 09:15:03 +11:00 |
|
Matthew Honnibal
|
8446ebfbbb
|
* Work on parser. Up to 92 UAS on YM labels
|
2014-12-18 09:05:31 +11:00 |
|
Matthew Honnibal
|
55de747bfc
|
* Remove .cpp files
|
2014-12-18 02:43:13 +11:00 |
|
Matthew Honnibal
|
4448a840f7
|
* Work on greedy parsing. Scoring about 91.2
|
2014-12-18 02:42:55 +11:00 |
|
Matthew Honnibal
|
87e9487d76
|
* Work on parser
|
2014-12-17 21:10:12 +11:00 |
|
Matthew Honnibal
|
9d7d97978d
|
* Work on greedy parser
|
2014-12-17 21:09:29 +11:00 |
|
Matthew Honnibal
|
d524dd306a
|
* Work on greedy parser
|
2014-12-17 03:19:43 +11:00 |
|
Matthew Honnibal
|
95ccea03b2
|
* Work on greedy parser
|
2014-12-16 22:46:55 +11:00 |
|
Matthew Honnibal
|
a432862fde
|
* Add exception type to _arg_max_among in tagger
|
2014-12-16 09:44:19 +11:00 |
|
Matthew Honnibal
|
9e00798820
|
* Work on integrating a greedy dependency parser
|
2014-12-16 08:06:04 +11:00 |
|
Matthew Honnibal
|
792802b2b9
|
* POS tag memoisation working, with good speed-up
|
2014-12-12 14:33:51 +11:00 |
|
Matthew Honnibal
|
ca54d58638
|
* Merge setup.py
|
2014-12-10 15:21:27 +11:00 |
|
Matthew Honnibal
|
9959a64f7b
|
* Working morphology and lemmatisation. POS tagging quite fast.
|
2014-12-10 08:09:32 +11:00 |
|
Matthew Honnibal
|
df3be14987
|
* Add pos_type features to POS tagger
|
2014-12-10 08:08:55 +11:00 |
|
Matthew Honnibal
|
42973c4b37
|
* Improve efficiency of tagger, and improve morphological processing
|
2014-12-10 01:02:04 +11:00 |
|
Matthew Honnibal
|
6b34a2f34b
|
* Move morphological analysis into its own module, morphology.pyx
|
2014-12-09 21:16:17 +11:00 |
|
Matthew Honnibal
|
b962fe73d7
|
* Make suffixes file use full-power regex, so that we can handle periods properly
|
2014-12-09 19:04:27 +11:00 |
|
Matthew Honnibal
|
accdbe989b
|
* Remove Tokens.extend method
|
2014-12-09 17:09:23 +11:00 |
|
Matthew Honnibal
|
495e1c7366
|
* Use fused type in Tokens.push_back, simplifying the use of the cache
|
2014-12-09 16:50:01 +11:00 |
|
Matthew Honnibal
|
302e09018b
|
* Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas
|
2014-12-09 14:48:01 +11:00 |
|
Matthew Honnibal
|
99bbbb6feb
|
* Work on morphological processing
|
2014-12-08 21:12:15 +11:00 |
|
Matthew Honnibal
|
7b68f911cf
|
* Add WordNet lemmatizer
|
2014-12-08 01:39:13 +11:00 |
|
Matthew Honnibal
|
c20dd79748
|
* Fiddle with const correctness and comments
|
2014-12-08 00:03:55 +11:00 |
|
Matthew Honnibal
|
b031c7c430
|
* Remove language-general context module
|
2014-12-07 23:53:01 +11:00 |
|
Matthew Honnibal
|
ef4398b204
|
* Rearrange POS stuff, so that language-specific stuff can live in language-specific modules
|
2014-12-07 23:52:41 +11:00 |
|
Matthew Honnibal
|
327383e38a
|
* Remove unused code in tagger.pyx
|
2014-12-07 22:16:17 +11:00 |
|
Matthew Honnibal
|
9f17467c2e
|
* Fix EMPTY_TOKEN
|
2014-12-07 22:07:41 +11:00 |
|
Matthew Honnibal
|
3819a88e1b
|
* Add support for tag dictionary, and fix error-code for predict method
|
2014-12-07 22:07:16 +11:00 |
|
Matthew Honnibal
|
f00afe12c4
|
* Load POS tagger in load() function if path exists
|
2014-12-07 22:05:57 +11:00 |
|
Matthew Honnibal
|
5fe5e6e66b
|
* Move context functions to header, inlining them.
|
2014-12-07 21:59:04 +11:00 |
|
Matthew Honnibal
|
5caabec789
|
* Link in tagger, to work on integrating POS tagging
|
2014-12-07 15:29:41 +11:00 |
|
Matthew Honnibal
|
0c7aeb9de7
|
* Begin revising tagger, focussing on POS tagging
|
2014-12-07 15:29:04 +11:00 |
|
Matthew Honnibal
|
f5c4f2eb52
|
* Revise context, focussing on POS tagging for now
|
2014-12-07 15:28:22 +11:00 |
|
Matthew Honnibal
|
e27b912ef9
|
* Remove need for confusing _data pointer to be stored on Tokens
|
2014-12-05 16:31:30 +11:00 |
|
Matthew Honnibal
|
1c9253701d
|
* Introduce a TokenC struct, to handle token indices, pos tags and sense tags
|
2014-12-05 15:56:14 +11:00 |
|
Matthew Honnibal
|
187372c7f3
|
* Allow the lexicon to create lexemes using an external memory pool, so that it can decide to make some lexemes temporary, rather than cached
|
2014-12-05 03:29:50 +11:00 |
|
Matthew Honnibal
|
75b8dfb348
|
* Remove upper_pc from lexeme.pyx
|
2014-12-04 22:14:34 +11:00 |
|
Matthew Honnibal
|
49f3780ff5
|
* Fiddle with lexeme attrs
|
2014-12-04 21:22:38 +11:00 |
|
Matthew Honnibal
|
564082e48e
|
* Hack Token class to take lex.dense inplace of the old lex.norm. This needs to be fixed...
|
2014-12-04 20:51:29 +11:00 |
|
Matthew Honnibal
|
69bb022204
|
* Add as_array and count_by method
|
2014-12-04 20:46:55 +11:00 |
|
Matthew Honnibal
|
e1b1f45cc9
|
* Add STEM attribute to lexeme
|
2014-12-04 20:46:20 +11:00 |
|
Matthew Honnibal
|
d7952634ca
|
* Make the string-store serve const pointers to Utf8Str
|
2014-12-03 16:01:47 +11:00 |
|
Matthew Honnibal
|
7e04c22f8f
|
* const added to Lexicon interface. Seems to work.
|
2014-12-03 15:58:17 +11:00 |
|
Matthew Honnibal
|
d70d31aa45
|
* Introduce first attempt at const-ness
|
2014-12-03 15:44:25 +11:00 |
|
Matthew Honnibal
|
4560ada85b
|
* Add typedef for attr_t. Change flag_t to flags_t
|
2014-12-03 11:06:31 +11:00 |
|
Matthew Honnibal
|
e600f7b327
|
* Move String struct stuff into the utf8string module, from spacy.lang
|
2014-12-03 11:06:00 +11:00 |
|
Matthew Honnibal
|
e170faf5b0
|
* Hack Tokens to work without tagger.pyx
|
2014-12-03 11:05:15 +11:00 |
|
Matthew Honnibal
|
b463a7eb86
|
* Make flag-setting a language-specific thing
|
2014-12-03 11:04:32 +11:00 |
|
Matthew Honnibal
|
71b009e323
|
* Fix bug in refactored StringStore.__getitem__
|
2014-12-03 11:02:24 +11:00 |
|
Matthew Honnibal
|
14097311ae
|
* Make StringStore.__getitem__ accept unicode-typed keys.
|
2014-12-03 01:33:20 +11:00 |
|
Matthew Honnibal
|
522bb0346e
|
* Work on get_array method of Tokens
|
2014-12-02 23:48:05 +11:00 |
|
Matthew Honnibal
|
8c2938fe01
|
* Rename Lexicon._dict to Lexicon._map
|
2014-12-02 23:46:59 +11:00 |
|
Matthew Honnibal
|
33dfb4933c
|
* Remove taggers from Language class. Work on doc strings
|
2014-11-26 19:53:55 +11:00 |
|
Matthew Honnibal
|
80baa2e3db
|
* Work on beam parser
|
2014-11-20 19:49:33 +11:00 |
|
Matthew Honnibal
|
5c3016bac8
|
* Tmp commit of ner code
|
2014-11-14 18:27:47 +11:00 |
|
Matthew Honnibal
|
33c421bcf8
|
* More feature tweaks
|
2014-11-12 23:59:16 +11:00 |
|
Matthew Honnibal
|
41dedfb14e
|
* Add label features for NER parsing
|
2014-11-12 23:55:10 +11:00 |
|
Matthew Honnibal
|
cf55b48ba6
|
* Switch to predict label on shift. Big increase in accuracy.
|
2014-11-12 23:50:12 +11:00 |
|
Matthew Honnibal
|
8f84e8a78b
|
* Neaten oracle
|
2014-11-12 23:38:07 +11:00 |
|
Matthew Honnibal
|
7e0a9077dd
|
* Add context files
|
2014-11-12 23:22:36 +11:00 |
|
Matthew Honnibal
|
3b0b902384
|
* IOB-style parsing working. Accuracy down from BILOU, form 87-88 to 85-86
|
2014-11-12 23:21:09 +11:00 |
|
Matthew Honnibal
|
e6bb8aa3a9
|
* Move moves to bilou_moves. Refactor context, returning to the simpler giant-enum style
|
2014-11-12 00:54:50 +11:00 |
|
Matthew Honnibal
|
c788633429
|
* Add tokens_from_list method to Language
|
2014-11-11 23:43:14 +11:00 |
|
Matthew Honnibal
|
95282d4993
|
* Use the dynamic oracle 'follow' strategy
|
2014-11-11 21:11:17 +11:00 |
|
Matthew Honnibal
|
5aaf7a024d
|
* Move ner features to ner subdir
|
2014-11-11 21:09:03 +11:00 |
|
Matthew Honnibal
|
ff8989b63c
|
* Use greedy NER parser
|
2014-11-11 21:08:35 +11:00 |
|
Matthew Honnibal
|
0d943ab358
|
* Fixed greedy NER parsing. With static oracle, replicates accuracy from tagger.
|
2014-11-11 17:17:54 +11:00 |
|
Matthew Honnibal
|
399239760b
|
* Fix moves for new State struct
|
2014-11-10 22:16:05 +11:00 |
|
Matthew Honnibal
|
82247169f2
|
* Implement validation and oracle on pystate, for testing
|
2014-11-10 22:15:32 +11:00 |
|
Matthew Honnibal
|
3709ed9d6d
|
* Add curr field to State, to handle entity being built
|
2014-11-10 22:14:36 +11:00 |
|
Matthew Honnibal
|
af9ed18cf1
|
* Bug fixes to NER
|
2014-11-10 17:39:23 +11:00 |
|
Matthew Honnibal
|
9f2587f5ec
|
* Work on shift-reduce NER
|
2014-11-10 16:28:56 +11:00 |
|
Matthew Honnibal
|
f307eb2e36
|
* Refactor context extraction, and start breaking out gold standards into their own functions
|
2014-11-09 15:43:07 +11:00 |
|
Matthew Honnibal
|
602f993af9
|
* Moving tagger to accept multiple correct answers
|
2014-11-09 15:18:33 +11:00 |
|
Matthew Honnibal
|
f37d896a42
|
* Upd NER feats. With adadelta learner, getting 76.9 on NER
|
2014-11-07 04:43:54 +11:00 |
|
Matthew Honnibal
|
68d1cdad62
|
* When encoding POS/NER tags, accept '-' as a missing value
|
2014-11-07 04:42:31 +11:00 |
|
Matthew Honnibal
|
949a6245f9
|
* Increase default number of iterations from 5 to 10
|
2014-11-07 04:42:04 +11:00 |
|
Matthew Honnibal
|
3cab1d9a29
|
* Refine word_shape feature, by trimming the max sequence length
|
2014-11-07 04:41:29 +11:00 |
|
Matthew Honnibal
|
b4454cf036
|
* Add extra context tokens
|
2014-11-07 04:40:36 +11:00 |
|
Matthew Honnibal
|
50309e6e49
|
* Fix context vector, importing all features
|
2014-11-05 22:11:39 +11:00 |
|
Matthew Honnibal
|
07a23768de
|
* Play with NER feats a bit. Up to 82.00 training on MUC7.
|
2014-11-05 21:47:17 +11:00 |
|
Matthew Honnibal
|
4ecbe8c893
|
* Complete refactor of Tagger features, to use a generic list of context names.
|
2014-11-05 20:45:29 +11:00 |
|
Matthew Honnibal
|
0a8c84625d
|
* Moving feature context stuff to a generalized place
|
2014-11-05 19:55:10 +11:00 |
|
Matthew Honnibal
|
3733444101
|
* Generalize tagger code, in preparation for NER and supersense tagging.
|
2014-11-05 03:42:14 +11:00 |
|
Matthew Honnibal
|
abbe3e44b0
|
* Move spacy.pos tagger to spacy.tagger, and generalize it so that it can take on other tagging tasks, given a different set of feature templates.
|
2014-11-05 00:37:59 +11:00 |
|
Matthew Honnibal
|
954c970415
|
* Add __iter__ method to tokens
|
2014-11-04 01:07:08 +11:00 |
|
Matthew Honnibal
|
f07457a91f
|
* Remove POS alignment stuff. Now use training data based on raw text, instead of clumsy detokenization stuff
|
2014-11-04 01:06:43 +11:00 |
|
Matthew Honnibal
|
ae52f9f38c
|
* Remove vocab10k from tokens
|
2014-11-03 00:23:20 +11:00 |
|
Matthew Honnibal
|
32fb50dc35
|
* Remove non_sparse method --- features wanting this can do it easily enough.
|
2014-11-03 00:15:47 +11:00 |
|
Matthew Honnibal
|
b5ae1471db
|
* Fiddle with POS tag features
|
2014-11-03 00:15:03 +11:00 |
|
Matthew Honnibal
|
70ea862703
|
* Remove vocab10k field, and add flags for gazetteers
|
2014-11-03 00:13:51 +11:00 |
|
Matthew Honnibal
|
711ed0f636
|
* Whitespace
|
2014-11-02 14:22:32 +11:00 |
|
Matthew Honnibal
|
fcd9490d56
|
* Add pos_tag method to Language
|
2014-11-02 14:21:43 +11:00 |
|
Matthew Honnibal
|
829bb2bdbe
|
* Add mappings to Twitter POS tag corpus
|
2014-11-02 13:21:19 +11:00 |
|
Matthew Honnibal
|
437cd2217d
|
* Fix strings i/o, removing use of ujson library in favour of plain text file. Allows better control of codecs.
|
2014-11-02 13:20:37 +11:00 |
|
Matthew Honnibal
|
3352e89e21
|
* Use LIKE_URL and LIKE_NUMBER flag features. Seems to improve accuracy on onto web
|
2014-11-02 13:19:54 +11:00 |
|
Matthew Honnibal
|
8335706321
|
* Add LIKE_URL and LIKE_NUMBER flag features
|
2014-11-02 13:19:23 +11:00 |
|
Matthew Honnibal
|
5484fbea69
|
* Implement is_number
|
2014-11-01 19:13:24 +11:00 |
|
Matthew Honnibal
|
f685218e21
|
* Add is_urlish function
|
2014-11-01 17:39:34 +11:00 |
|
Matthew Honnibal
|
09a3e54176
|
* Delete print statements from stringstore
|
2014-10-31 17:45:26 +11:00 |
|
Matthew Honnibal
|
b186a66bae
|
* Rename Token.lex_pos to Token.postype, and Token.lex_supersense to Token.sensetype
|
2014-10-31 17:44:39 +11:00 |
|
Matthew Honnibal
|
a8ca078b24
|
* Restore lexemes field to lexicon
|
2014-10-31 17:43:25 +11:00 |
|
Matthew Honnibal
|
6c807aa45f
|
* Restore id attribute to lexeme, and rename pos field to postype, to store clustered tag dictionaries
|
2014-10-31 17:43:00 +11:00 |
|
Matthew Honnibal
|
aaf6953fe0
|
* Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web.
|
2014-10-31 17:42:15 +11:00 |
|
Matthew Honnibal
|
f67cb9a5a3
|
* Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web.
|
2014-10-31 17:42:04 +11:00 |
|
Matthew Honnibal
|
ea8f1e7053
|
* Tighten interfaces
|
2014-10-30 18:14:42 +11:00 |
|
Matthew Honnibal
|
ea85bf3a0a
|
* Tighten the interface to Language
|
2014-10-30 18:01:27 +11:00 |
|
Matthew Honnibal
|
c6fcd03692
|
* Small efficiency tweak to lexeme init
|
2014-10-30 17:56:11 +11:00 |
|
Matthew Honnibal
|
87c2418a89
|
* Fiddle with data types on Lexeme, to compress them to a much smaller size.
|
2014-10-30 15:42:15 +11:00 |
|
Matthew Honnibal
|
ac88893232
|
* Fix Token after lexeme changes
|
2014-10-30 15:30:52 +11:00 |
|
Matthew Honnibal
|
e6b87766fe
|
* Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme
|
2014-10-30 15:21:38 +11:00 |
|
Matthew Honnibal
|
889b7b48b4
|
* Fix POS tagger, so that it loads correctly. Lexemes are being read in.
|
2014-10-30 13:38:55 +11:00 |
|
Matthew Honnibal
|
67c8c8019f
|
* Update lexeme serialization, using a binary file format
|
2014-10-30 01:01:00 +11:00 |
|
Matthew Honnibal
|
13909a2e24
|
* Rewriting Lexeme serialization.
|
2014-10-29 23:19:38 +11:00 |
|
Matthew Honnibal
|
234d49bf4d
|
* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.
|
2014-10-24 02:23:42 +11:00 |
|
Matthew Honnibal
|
08ce602243
|
* Large refactor, particularly to Python API
|
2014-10-24 00:59:17 +11:00 |
|
Matthew Honnibal
|
7baef5b7ff
|
* Fix padding on tokens
|
2014-10-23 04:01:17 +11:00 |
|
Matthew Honnibal
|
96b835a3d4
|
* Upd for refactored Tokens class. Now gets 95.74, 185ms training on swbd_wsj_ewtb, eval on onto_web, Google POS tags.
|
2014-10-23 03:20:02 +11:00 |
|
Matthew Honnibal
|
e5e951ae67
|
* Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding.
|
2014-10-23 01:57:59 +11:00 |
|
Matthew Honnibal
|
ea1d4a81eb
|
* Refactoring get_atoms, improving tokens API
|
2014-10-22 13:10:56 +11:00 |
|
Matthew Honnibal
|
ad49e2482e
|
* Tagger now gets 97pc on wsj, parsing 19-21 in 500ms. Gets 92.7 on web text.
|
2014-10-22 12:57:06 +11:00 |
|
Matthew Honnibal
|
0a0e41f6c8
|
* Add prefix and suffix features
|
2014-10-22 12:56:09 +11:00 |
|
Matthew Honnibal
|
7018b53d3a
|
* Improve array features in tokens
|
2014-10-22 12:55:42 +11:00 |
|
Matthew Honnibal
|
43d5964e13
|
* Add function to read detokenization rules
|
2014-10-22 12:54:59 +11:00 |
|
Matthew Honnibal
|
224bdae996
|
* Add POS utilities
|
2014-10-22 10:17:57 +11:00 |
|
Matthew Honnibal
|
5ebe14f353
|
* Add greedy pos tagger
|
2014-10-22 10:17:26 +11:00 |
|
Matthew Honnibal
|
12742f4f83
|
* Add detokenize method and test
|
2014-10-18 18:07:29 +11:00 |
|
Matthew Honnibal
|
99f5e59286
|
* Have tokenizer emit tokens for whitespace other than single spaces
|
2014-10-14 20:25:57 +11:00 |
|
Matthew Honnibal
|
43743a5d63
|
* Work on efficiency
|
2014-10-14 18:22:41 +11:00 |
|
Matthew Honnibal
|
6fb42c4919
|
* Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang
|
2014-10-14 16:17:45 +11:00 |
|
Matthew Honnibal
|
2805068ca8
|
* Have tokens track tuples that record the start offset and pos tag as well as a lexeme pointer
|
2014-10-14 15:21:03 +11:00 |
|
Matthew Honnibal
|
65d3ead4fd
|
* Rename LexStr_casefix to LexStr_norm and LexInt_i to LexInt_id
|
2014-10-14 15:19:07 +11:00 |
|
Matthew Honnibal
|
868e558037
|
* Preparations in place to handle hyphenation etc
|
2014-10-10 20:23:23 +11:00 |
|
Matthew Honnibal
|
ff79dbac2e
|
* More slight cleaning for lang.pyx
|
2014-10-10 20:11:22 +11:00 |
|
Matthew Honnibal
|
3d82ed1e5e
|
* More slight cleaning for lang.pyx
|
2014-10-10 19:50:07 +11:00 |
|
Matthew Honnibal
|
02e948e7d5
|
* Remove counts stuff from Language class
|
2014-10-10 19:25:01 +11:00 |
|
Matthew Honnibal
|
71ee921055
|
* Slight cleaning of tokenizer code
|
2014-10-10 19:17:22 +11:00 |
|
Matthew Honnibal
|
59b41a9fd3
|
* Switch to new data model, tests passing
|
2014-10-10 08:11:31 +11:00 |
|
Matthew Honnibal
|
1b0e01d3d8
|
* Revising data model of lexeme. Compiles.
|
2014-10-09 19:53:30 +11:00 |
|
Matthew Honnibal
|
e40caae51f
|
* Update Lexicon class to expect a list of lexeme dict descriptions
|
2014-10-09 14:51:35 +11:00 |
|
Matthew Honnibal
|
51d75b244b
|
* Add serialize/deserialize functions for lexeme, transport to/from python dict.
|
2014-10-09 14:10:46 +11:00 |
|
Matthew Honnibal
|
d73d89a2de
|
* Add i attribute to lexeme, giving lexemes sequential IDs.
|
2014-10-09 13:50:05 +11:00 |
|
Matthew Honnibal
|
096ef2b199
|
* Rename external hashing lib, from trustyc to preshed
|
2014-09-26 18:40:03 +02:00 |
|
Matthew Honnibal
|
11a346fd5e
|
* Remove hashing modules, which are now taken over by external lib
|
2014-09-26 18:39:40 +02:00 |
|
Matthew Honnibal
|
93505276ed
|
* Add German tokenizer files
|
2014-09-25 18:29:13 +02:00 |
|
Matthew Honnibal
|
2e44fa7179
|
* Add util.py
|
2014-09-25 18:26:22 +02:00 |
|
Matthew Honnibal
|
b15619e170
|
* Use PointerHash instead of locally provided _hashing module
|
2014-09-25 18:23:35 +02:00 |
|
Matthew Honnibal
|
ed446c67ad
|
* Add typedefs file
|
2014-09-17 23:10:32 +02:00 |
|
Matthew Honnibal
|
316a57c4be
|
* Remove own memory classes, which have now been broken out into their own package
|
2014-09-17 23:10:07 +02:00 |
|
Matthew Honnibal
|
ac522e2553
|
* Switch from own memory class to cymem, in pip
|
2014-09-17 23:09:24 +02:00 |
|
Matthew Honnibal
|
6266cac593
|
* Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks
|
2014-09-17 20:02:26 +02:00 |
|
Matthew Honnibal
|
5a20dfc03e
|
* Add memory management code
|
2014-09-17 20:02:06 +02:00 |
|
Matthew Honnibal
|
0152831c89
|
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token.
|
2014-09-16 18:01:46 +02:00 |
|
Matthew Honnibal
|
143e51ec73
|
* Refactor tokenization, splitting it into a clearer life-cycle.
|
2014-09-16 13:16:02 +02:00 |
|
Matthew Honnibal
|
c396581a0b
|
* Fiddle with the way strings are interned in lexeme
|
2014-09-15 06:34:45 +02:00 |
|
Matthew Honnibal
|
0bb547ab98
|
* Fix memory error in cache, where entry wasn't being null-terminated. Various other changes, some good for performance
|
2014-09-15 06:34:10 +02:00 |
|
Matthew Honnibal
|
7959141d36
|
* Add a few abbreviations, to get tests to pass
|
2014-09-15 06:32:18 +02:00 |
|
Matthew Honnibal
|
d235299260
|
* Few nips and tucks to hash table
|
2014-09-15 05:03:44 +02:00 |
|
Matthew Honnibal
|
e68a431e5e
|
* Pass only the tokens vector to _tokenize, instead of the whole python object.
|
2014-09-15 04:01:38 +02:00 |
|
Matthew Honnibal
|
08cef75ffd
|
* Switch to using a heap-allocated vector in tokens
|
2014-09-15 03:46:14 +02:00 |
|
Matthew Honnibal
|
f77b7098c0
|
* Upd Tokens to use vector, with bounds checking.
|
2014-09-15 03:22:40 +02:00 |
|
Matthew Honnibal
|
0f6bf2a2ee
|
* Fix niggling memory error, which was caused by bug in the way tokens resized their internal vector.
|
2014-09-15 02:08:39 +02:00 |
|
Matthew Honnibal
|
df24e3708c
|
* Move EnglishTokens stuff to Tokens
|
2014-09-15 01:31:44 +02:00 |
|
Matthew Honnibal
|
bd08cb09a2
|
* Remove short-circuiting of initial_size argument for PointerHash
|
2014-09-15 01:30:49 +02:00 |
|
Matthew Honnibal
|
f3393cf57c
|
* Improve interface for PointerHash
|
2014-09-13 17:29:58 +02:00 |
|
Matthew Honnibal
|
45865be37e
|
* Switch hash interface, using void* instead of size_t, to avoid casts.
|
2014-09-13 17:02:06 +02:00 |
|
Matthew Honnibal
|
0447279c57
|
* PointerHash working, efficiency is good. 6-7 mins
|
2014-09-13 16:43:59 +02:00 |
|
Matthew Honnibal
|
85d68e8e95
|
* Replaced cache with own hash table. Similar timing
|
2014-09-13 03:14:43 +02:00 |
|
Matthew Honnibal
|
c8db76e3e1
|
* Add initial work on simple hash table
|
2014-09-13 02:02:41 +02:00 |
|
Matthew Honnibal
|
afdc9b7ac2
|
* More performance fiddling, particularly moving the specials into the cache, so that we can just lookup the cache in _tokenize
|
2014-09-13 00:59:34 +02:00 |
|
Matthew Honnibal
|
7d239df4c8
|
* Fiddle with declarations, for small efficiency boost
|
2014-09-13 00:31:53 +02:00 |
|
Matthew Honnibal
|
a8e7cce30f
|
* Efficiency tweaks
|
2014-09-13 00:14:05 +02:00 |
|
Matthew Honnibal
|
126a8453a5
|
* Fix performance issues by implementing a better cache. Add own String struct to help
|
2014-09-12 23:50:37 +02:00 |
|
Matthew Honnibal
|
9298e36b36
|
* Move special tokenization into its own lookup table, away from the cache.
|
2014-09-12 19:43:14 +02:00 |
|
Matthew Honnibal
|
985bc68327
|
* Fix bug with trailing punct on contractions. Reduced efficiency, and slightly hacky implementation.
|
2014-09-12 18:26:26 +02:00 |
|
Matthew Honnibal
|
7eab281194
|
* Fiddle with token features
|
2014-09-12 15:49:55 +02:00 |
|
Matthew Honnibal
|
5aa591106b
|
* Fiddle with token features
|
2014-09-12 15:49:36 +02:00 |
|
Matthew Honnibal
|
1533041885
|
* Update the split_one method, so that it doesn't need to cast back to a Python object
|
2014-09-12 05:10:59 +02:00 |
|
Matthew Honnibal
|
4817277d66
|
* Replace main lexicon dict with dense_hash_map. May be unsuitable, if strings need recovery.
|
2014-09-12 04:29:09 +02:00 |
|
Matthew Honnibal
|
8b20e9ad97
|
* Delete ununused _split method
|
2014-09-12 04:03:52 +02:00 |
|