Matthew Honnibal
6c99e5f4aa
* Move serialization into Serializer class, with __call__ and train() api
2015-07-16 11:22:35 +02:00
Matthew Honnibal
e2133d990e
* Move serialization functionality out into a Serializer object
2015-07-16 11:21:44 +02:00
Matthew Honnibal
a6d040bd11
* Import Lexeme attrs from spacy.attrs, not spacy.typedefs
2015-07-16 11:20:08 +02:00
Matthew Honnibal
45ae1ce428
* Remove unused declaration in parser
2015-07-16 01:27:11 +02:00
Matthew Honnibal
efa80096f1
* Upd attrs id list
2015-07-16 01:26:54 +02:00
Matthew Honnibal
01fab6bb90
* Improve de/serialize functions
2015-07-16 01:26:35 +02:00
Matthew Honnibal
0e07c1ed2a
* draft de/serialization functions in doc.pyx
2015-07-16 01:16:33 +02:00
Matthew Honnibal
9d956b07e9
* Fix import of attrs in doc.pyx, and update the get_token_attr function.
2015-07-16 01:15:34 +02:00
Matthew Honnibal
65251e7625
* Remove redundant attr_id_t from typedefs.pxd
2015-07-16 00:58:51 +02:00
Matthew Honnibal
9a8db9743c
* Remove gil from parser.call
2015-07-14 23:47:33 +02:00
Matthew Honnibal
38ca0c33f5
Merge branch 'neuralnet' into refactor
...
Mostly refactors parser, to use new thinc3.2 Example class.
Aim is to remove use of shared memory, so that we can parallelize
over documents easily.
Conflicts:
setup.py
spacy/syntax/parser.pxd
spacy/syntax/parser.pyx
spacy/syntax/stateclass.pyx
2015-07-14 14:13:47 +02:00
Matthew Honnibal
935ac53ee3
* Extend count_by method
2015-07-14 03:20:09 +02:00
Matthew Honnibal
3b5baa660f
* Fix tokenizer
2015-07-14 00:10:51 +02:00
Matthew Honnibal
2ae0b439b2
* Fix space check in gold.pyx
2015-07-14 00:10:27 +02:00
Matthew Honnibal
81aa4e6dcc
* Go back to having token reference doc, instead of complicated gymnastics. Rename the attr 'doc', to expose it in the API
2015-07-14 00:10:11 +02:00
Matthew Honnibal
24d6ce99ec
* Add comment to tokenizer, explaining the spacy attr
2015-07-13 22:29:13 +02:00
Matthew Honnibal
8214b74eec
* Restore _py_tokens cache, to handle orphan tokens.
2015-07-13 22:28:10 +02:00
Matthew Honnibal
67641f3b58
* Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string
2015-07-13 21:46:02 +02:00
Matthew Honnibal
6eef0bf9ab
* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx
2015-07-13 20:20:58 +02:00
Matthew Honnibal
3ea8756c24
* Add spacy/tokens/doc.pyx, for Doc class in its own file
2015-07-13 19:58:26 +02:00
Matthew Honnibal
c99387155f
* Refactor tokens, moving classes into a module instead of a single file
2015-07-13 19:49:55 +02:00
Matthew Honnibal
d27899658e
* Import classes in spacy.tokens.__init__
2015-07-13 19:48:55 +02:00
Matthew Honnibal
aa82caf8f5
* Add TokenC.spacy attr
2015-07-13 19:48:07 +02:00
Matthew Honnibal
dba6b47d4e
* Refactor monster tokens.pyx file, into a tokens/ subpackage. Try to break the cycle between Doc and Token, and remove the need to pass around a unicode string reference
2015-07-13 19:20:48 +02:00
Matthew Honnibal
5b0a7190c9
* Round-trip for serialization finally working. Needs a lot of optimization.
2015-07-13 18:39:38 +02:00
Matthew Honnibal
edd371246c
* Make huffman coder take BitArray in encode/decode. Add __iter__ method to BitArray.
2015-07-13 17:33:33 +02:00
Matthew Honnibal
af5cc926a4
* Add codec property to Vocab, to use the Huffman encoding
2015-07-13 13:55:14 +02:00
Matthew Honnibal
77385d5580
* Make .pxd file for huffman codec
2015-07-13 13:54:51 +02:00
Matthew Honnibal
083b6ea7ae
* Clean up encoder a bit. now read for integration into Vocab.
2015-07-13 12:57:22 +02:00
Matthew Honnibal
8d0f1d98da
* Draft dockstring for HuffmanCache
2015-07-13 12:01:18 +02:00
Matthew Honnibal
281f1faefb
* Nearly finished huffman coder
2015-07-12 23:48:46 +02:00
Matthew Honnibal
e1a25fba32
* Work on huffman coder
2015-07-12 19:58:05 +02:00
Matthew Honnibal
3fb9de2d13
* Remove vector[bint], in favor of simple Code struct.
2015-07-12 17:58:27 +02:00
Matthew Honnibal
aa7bfd932b
* Work on compressor
2015-07-12 16:03:43 +02:00
Matthew Honnibal
14eafcab15
* Refactor to use vector[bint]
2015-07-12 05:27:47 +02:00
Matthew Honnibal
6a6e852a39
* Refactor huffman coding stuff into class
2015-07-12 05:06:36 +02:00
Matthew Honnibal
aad96fdb5c
* Improve efficiency of huffman coding
2015-07-12 01:31:37 +02:00
Matthew Honnibal
ff9ff6f3fa
* Ensure unseen words are given low log probability
2015-07-12 01:31:09 +02:00
Matthew Honnibal
9d3b0d83de
* Refactor huffman coding
2015-07-11 22:27:43 +02:00
Matthew Honnibal
8d29406cd6
* Rename span.right to span.rights
2015-07-11 22:15:04 +02:00
Matthew Honnibal
da9f358166
* Fix span getting
2015-07-11 21:41:41 +02:00
Matthew Honnibal
11e8f2ffb4
* Huffman codes working
2015-07-11 20:01:10 +02:00
Matthew Honnibal
cb6fc81909
* Work on huffman coding.
2015-07-11 15:23:35 +02:00
Matthew Honnibal
4c9b77fe95
* Begin working on serialization code
2015-07-11 10:57:30 +02:00
Matthew Honnibal
53d1f5b2eb
* Rename Span.head to Span.root.
2015-07-09 17:30:58 +02:00
Matthew Honnibal
c0255ed7d8
* Allow slice indexing in Doc.__getitem__, returning a Span object
2015-07-09 15:15:32 +02:00
Matthew Honnibal
89a91ad726
* Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity
2015-07-09 13:30:41 +02:00
Matthew Honnibal
55f1042443
* Improve efficiency of L and R features, correcting the non-linear-in-length problem.
2015-07-09 12:17:26 +02:00
Matthew Honnibal
70d2acb579
* Fix edge features
2015-07-09 12:15:01 +02:00
Matthew Honnibal
adb868bdad
* Add warning for models not found in parser
2015-07-08 20:04:55 +02:00
Matthew Honnibal
05b28ec9eb
* Add warning for models not found in parser
2015-07-08 20:02:13 +02:00
Matthew Honnibal
ef700401a6
* Add warning for models not found in parser
2015-07-08 20:00:46 +02:00
Matthew Honnibal
6218d8b389
* Add warning for models not found in parser
2015-07-08 19:59:16 +02:00
Matthew Honnibal
f6a6c39ce8
* Add warning for models not found in parser
2015-07-08 19:52:30 +02:00
Matthew Honnibal
78db7e32f7
* Remove has_sense method from Lexeme declaration
2015-07-08 19:41:20 +02:00
Matthew Honnibal
6ddb2f5e45
* Restore merge_mwe in English class
2015-07-08 19:35:30 +02:00
Matthew Honnibal
6859f6adac
* Restore merge_mwe in English class
2015-07-08 19:34:55 +02:00
Matthew Honnibal
3c270fc8ff
* Remove has_sense method from Lexeme
2015-07-08 19:28:29 +02:00
Matthew Honnibal
b64c843861
* Remove senses attr
2015-07-08 19:26:24 +02:00
Matthew Honnibal
1d3a592edf
* Remove the senses attr from LexemeC, to keep data compatibility
2015-07-08 19:24:44 +02:00
Matthew Honnibal
0ceb1f71c2
* Update parse features
2015-07-08 19:11:36 +02:00
Matthew Honnibal
2e51b5027a
* Alias Doc to Tokens, for backwards compatibility
2015-07-08 18:59:35 +02:00
Matthew Honnibal
e3c53f5ecd
* Fix mention of Tokens in docstring
2015-07-08 18:56:27 +02:00
Matthew Honnibal
bb522496dd
* Rename Tokens to Doc
2015-07-08 18:53:00 +02:00
Matthew Honnibal
b24e8be2b9
* Whitespace in docstring
2015-07-08 12:37:03 +02:00
Matthew Honnibal
abc43b852d
* Add pos_tags attr to Vocab.
2015-07-08 12:36:38 +02:00
Matthew Honnibal
935bcdf3e5
* Remove redundant tag_names argument to Tokenizer
2015-07-08 12:36:04 +02:00
Matthew Honnibal
ff885e8511
* Add ParserFactory convenience function
2015-07-08 12:35:46 +02:00
Matthew Honnibal
4e4fac452b
* Refactor __init__ for simplicity. Allow parse=True, tag=True etc flags to be passed at top-level. Do not lazy-load parser.
2015-07-08 12:35:29 +02:00
Matthew Honnibal
1d2deb4616
* Work on refactoring default arguments to English.__init__
2015-07-07 15:53:25 +02:00
Matthew Honnibal
2d0e99a096
* Pass pos_tags into Tokenizer.from_dir
2015-07-07 14:23:08 +02:00
Matthew Honnibal
6788c86b2f
* Begin refactor
2015-07-07 14:00:07 +02:00
Matthew Honnibal
52fd80c6c6
* Add experimental supersense features for parsing, based on lookup into wordnet.
2015-07-01 20:12:44 +02:00
Matthew Honnibal
e6d828a9af
* Set up an array POS_SENSES that denotes the set of valid senses for each POS tag. This way, we can do bitwise & between a lexeme's senses and the ones available for its POS tag, to get the allowable senses for the token.
2015-07-01 20:12:13 +02:00
Matthew Honnibal
2b8459d9a8
* Add senses flag to Lexeme
2015-07-01 20:10:41 +02:00
Matthew Honnibal
e23d1582a2
* Add supersense data to Lexeme objects. Add simple has_sense method to check the flag.
2015-07-01 18:50:37 +02:00
Matthew Honnibal
64fafa98be
* Add senses.pyx and senses.pxd
2015-07-01 18:49:44 +02:00
Matthew Honnibal
94dab94e5f
uerge branch 'master' of https://github.com/honnibal/spaCy
2015-06-30 18:16:26 +02:00
Matthew Honnibal
9af86b0b0b
* Fix attrs.pxd
2015-06-30 18:16:30 +02:00
Matthew Honnibal
af9c82f7a6
Merge branch 'master' of https://github.com/honnibal/spaCy
2015-06-30 18:11:37 +02:00
Matthew Honnibal
5d595b5a8c
* Inc versions
2015-06-30 18:11:06 +02:00
Matthew Honnibal
d2eeba6667
* Start wiring up color and emotion lexicons. Hopefully we get to use them.
2015-06-30 16:22:23 +02:00
Matthew Honnibal
e20106fdff
* Begin reorganizing neuralnet work
2015-06-30 14:26:32 +02:00
Matthew Honnibal
5cd3ed42d4
* Reenable averaging
2015-06-29 16:44:42 +02:00
Matthew Honnibal
894cbef8ba
* Wire eta and mu parameters up for neural net
2015-06-29 07:10:33 +02:00
Matthew Honnibal
3bb5876c5a
* Inline methods in StateClass
2015-06-29 01:10:14 +02:00
Matthew Honnibal
313a7f87b3
* Inline methods in StateClass
2015-06-29 01:06:28 +02:00
Matthew Honnibal
a02fd3af5d
* Check valency in L and R feature methods, to make feaure calculation faster
2015-06-29 00:27:56 +02:00
Matthew Honnibal
5d870720bc
* Check valency in L and R feature methods, to make feaure calculation faster
2015-06-29 00:17:29 +02:00
Matthew Honnibal
f4986d5d3c
* Use new Example class
2015-06-28 22:36:03 +02:00
Matthew Honnibal
735f1af91f
* Fix neural net stuff
2015-06-28 11:44:58 +02:00
Matthew Honnibal
e7003f1cf3
* Remove hard-coding of vector lengths
2015-06-28 11:37:17 +02:00
Matthew Honnibal
897dd0dd0b
* Merge changes, and adjust Example to use memoryview
2015-06-28 11:36:11 +02:00
Matthew Honnibal
9282a8e72c
* Prepare for new models to be plugged in by using Example class
2015-06-28 11:02:35 +02:00
Matthew Honnibal
75aeccc064
* Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search
2015-06-28 11:02:34 +02:00
Matthew Honnibal
bf33598b34
* Work on a theano-driven model for the parser
2015-06-28 11:02:34 +02:00
Matthew Honnibal
bbef71f213
* Fix min function in fill_context
2015-06-28 10:46:39 +02:00
Matthew Honnibal
142b6f9510
* Revert last changes
2015-06-28 10:44:28 +02:00
Matthew Honnibal
b06962f18b
* Pad buffers in state
2015-06-28 10:36:14 +02:00
Matthew Honnibal
53be72387c
* Hack at fill_context to investigate performance loss
2015-06-28 10:34:28 +02:00
Matthew Honnibal
71a4e876a9
* Fix parse features
2015-06-28 09:27:33 +02:00
Matthew Honnibal
0c4b5a2bb0
* Start scoring tokens
2015-06-28 06:21:38 +02:00
Matthew Honnibal
5af500909c
* Remove unused directve from parser.pyx
2015-06-28 06:20:21 +02:00
Matthew Honnibal
d5b4090705
* Add profile directive
2015-06-28 06:19:33 +02:00
Matthew Honnibal
2b5421e60c
* Add profile directive
2015-06-28 06:07:04 +02:00
Matthew Honnibal
8b5de4a411
* Add word / tag / label sets, for use in neural net
2015-06-28 05:46:53 +02:00
Matthew Honnibal
cfcbd8d256
* Fix punctuation eval in scorer.py
2015-06-28 01:31:39 +02:00
Matthew Honnibal
ed40a8380e
* Remove hard-coding of vector lengths
2015-06-27 04:18:47 +02:00
Matthew Honnibal
ebe630cc8d
* Enable more features for NN
2015-06-27 04:17:29 +02:00
Matthew Honnibal
f8bb43475e
* Bridge to Theano working. Very disorganised. Using thinc adb60aba966ed2
2015-06-27 02:39:18 +02:00
Matthew Honnibal
2fe98b8a9a
* Prepare for new models to be plugged in by using Example class
2015-06-26 13:51:39 +02:00
Matthew Honnibal
6896455884
* Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search
2015-06-26 06:25:36 +02:00
Matthew Honnibal
b266a63f2c
* Inc version of downloadble data
2015-06-24 04:53:08 +02:00
Matthew Honnibal
02b171ee67
* Bug fixes to edge calculation
2015-06-24 04:28:02 +02:00
Matthew Honnibal
a4e9bdf4c1
* Work on a theano-driven model for the parser
2015-06-24 01:02:40 +02:00
Matthew Honnibal
7f9384f53c
* Remove deprecated _state module
2015-06-23 17:28:24 +02:00
Matthew Honnibal
6dbe182491
* Fix merge conflicts
2015-06-23 17:28:00 +02:00
Matthew Honnibal
579735a095
* Remove import of _state module
2015-06-23 17:25:08 +02:00
Matthew Honnibal
88f55d136b
* Remove deprecated _state module
2015-06-23 17:19:51 +02:00
Matthew Honnibal
9ab9dd2bf7
* Clean up unused orig_arc_eager and tree_arc_eager modules, which were only added for EMNLP experiments
2015-06-23 17:17:33 +02:00
Matthew Honnibal
7ebfe4b983
* Fixes to edge features
2015-06-23 16:32:54 +02:00
Matthew Honnibal
7b125f5a86
* Fixes to edge features
2015-06-23 16:31:01 +02:00
Matthew Honnibal
8d4bbacfc5
* Fix edge navigation in Token objects
2015-06-23 16:07:34 +02:00
Matthew Honnibal
35c290bee4
* Fix edge features
2015-06-23 15:50:56 +02:00
Matthew Honnibal
221e2e485f
* Assign 'ROOT' as label, not 'root'
2015-06-23 15:09:54 +02:00
Matthew Honnibal
a7bf7b0626
* Rename sent_start to sent_end, to reflect its new usage in the Break transition
2015-06-23 05:39:43 +02:00
Matthew Honnibal
ee3e56f27b
* Fix bounds checking on entities
2015-06-23 04:35:08 +02:00
Matthew Honnibal
43ef5ddea5
* Ensure root albel is spelled ROOT, for backwards compatibility
2015-06-23 04:14:03 +02:00
Matthew Honnibal
065c2e1d2d
* Add some bounds checking around state arrays
2015-06-23 04:13:09 +02:00
Matthew Honnibal
89ae218b75
* Add import to tokens.pyx from weird Cython compiler issue with casting from memory views
2015-06-23 03:04:34 +02:00
Matthew Honnibal
f01b3d043e
* Add padding to arrays in stateclass. May be papering over a deeper bug.
2015-06-23 03:03:41 +02:00
Matthew Honnibal
5e94b5d581
* Have Tokens return proper numpy arrays, not Cython views.
2015-06-23 00:07:34 +02:00
Matthew Honnibal
69507bc729
* Re-enable Break transition in arc_eager.pyx
2015-06-23 00:03:30 +02:00
Matthew Honnibal
cc579ed429
* Add __len__ function to StringStore
2015-06-23 00:02:50 +02:00
Matthew Honnibal
46fb24e9fd
* Add cycle-checking code in gold.pyx
2015-06-23 00:02:22 +02:00
Matthew Honnibal
60d26243e3
* Fix head alignment in read_conll.parse, which was causing corrupt parses when strip_bad_periods=True. A similar problem may apply to other data readers.
2015-06-18 16:35:27 +02:00
Matthew Honnibal
f868175e43
* Whitespace
2015-06-16 23:37:46 +02:00
Matthew Honnibal
ab110be125
* Remove debugging in parser.pyx
2015-06-16 23:37:25 +02:00
Matthew Honnibal
9b13d11ab3
* Fix handling of entities in StateClass
2015-06-16 23:35:21 +02:00
Matthew Honnibal
c40a2c661c
* Add tree_arc_eager
2015-06-15 08:23:24 +02:00
Matthew Honnibal
5da5cf7084
* Add some more features for S1/S0
2015-06-15 04:07:13 +02:00
Matthew Honnibal
8156a01bca
* Fix root label for orig_arc_eager
2015-06-15 02:54:55 +02:00
Matthew Honnibal
21930ede15
* Switch toggle on USE_ROOT_ARC_SEGMENT
2015-06-15 02:54:32 +02:00
Matthew Honnibal
38a6afa484
* Make possibly dubious correction to the unshift oracle
2015-06-15 02:50:00 +02:00
Matthew Honnibal
f66228f253
* Add some more features, esp for labels
2015-06-14 21:18:02 +02:00
Matthew Honnibal
3da8e0f317
* Add orig_arc_eager
2015-06-14 20:31:44 +02:00
Matthew Honnibal
ea8a103007
* Fix import of TransitionSystem in parser.pyx
2015-06-14 19:01:26 +02:00
Matthew Honnibal
e0984ca139
* Fix valency features in StateClass
2015-06-14 17:50:26 +02:00
Matthew Honnibal
e50ac1a47f
* Add verbose printing to scorer
2015-06-14 17:45:50 +02:00
Matthew Honnibal
763cbd23d5
* Upd stateclass.print_state
2015-06-14 17:44:29 +02:00