Matthew Honnibal
128b6d9714
* Move Utf8Str struct to strings module, as that's the only place it's relevant
2015-07-20 12:06:41 +02:00
Matthew Honnibal
01a97b90f3
* Fix header for string store
2015-07-20 12:06:10 +02:00
Matthew Honnibal
52d538ea42
* Fix short string optimization in strings.pyx. StringStore tests now all pass.
2015-07-20 12:05:23 +02:00
Matthew Honnibal
09a3055630
* Work on short string optimization in Utf8Str
2015-07-20 11:26:46 +02:00
Matthew Honnibal
bb0ba1f0cd
* Improve serialization speed
2015-07-20 03:27:59 +02:00
Matthew Honnibal
8743a8c084
* Update Doc serialization for new Packer interface
2015-07-20 01:38:04 +02:00
Matthew Honnibal
1f7170e0e1
* Reinstate the fixed vocabulary --- words are only added to the lexicon in init_model, after that we create LexemeC structs with the Pool given to us.
2015-07-20 01:37:34 +02:00
Matthew Honnibal
5a7d060d9c
* Switch between the orth and char codecs depending on which is shorter for that message. Mostly orth is shorter, except if there are OOV words.
2015-07-20 01:36:22 +02:00
Matthew Honnibal
5a042ee0d3
* Add function to predict number of bits needed to encode message
2015-07-20 01:35:11 +02:00
Matthew Honnibal
b89b489bb4
* Implement both character and orth encoding in Packer, so that we can decide which to use per-text
2015-07-19 22:39:45 +02:00
Matthew Honnibal
ae78c9e3ce
* Implement character-based codec, so that we can do word/char backoff
2015-07-19 22:03:39 +02:00
Matthew Honnibal
cd1d047cb8
* Delete out-dated HuffmanCodec comment
2015-07-19 18:28:14 +02:00
Matthew Honnibal
b8086067d5
* Build Huffman codec from unsorted inputs
2015-07-19 17:58:44 +02:00
Matthew Honnibal
317cbbc015
* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.
2015-07-19 15:18:17 +02:00
Matthew Honnibal
6b13e7227c
* Remove duplicate get_lex_attr method from doc.pyx
2015-07-18 22:46:07 +02:00
Matthew Honnibal
e49c7f1478
* Update oov check in tokenizer
2015-07-18 22:45:28 +02:00
Matthew Honnibal
cfd842769e
* Allow infix tokens to be variable length
2015-07-18 22:45:00 +02:00
Matthew Honnibal
5b4c78bbb2
* Use an AttributeCodec based on orth for words. Still no oov handling mechanism.
2015-07-18 22:43:18 +02:00
Matthew Honnibal
82d84b0f2b
* Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this.
2015-07-18 22:42:15 +02:00
Matthew Honnibal
4dddc8a69b
* Fix type declarations for attr_t. Remove unused id_t.
2015-07-18 22:39:57 +02:00
Matthew Honnibal
ced59ab9ea
* Make minor efficiency improvement in Doc.__iter__
2015-07-18 04:10:53 +02:00
Matthew Honnibal
cd91914dd8
* Fix hard-coded length
2015-07-18 04:09:56 +02:00
Matthew Honnibal
b1d74ce60d
* Remove unused joint.pyx and joint.pxd files
2015-07-17 23:31:44 +02:00
Matthew Honnibal
c27514512b
* Remove cruft ner/ directory
2015-07-17 23:24:32 +02:00
Matthew Honnibal
f8d6d319f4
* Remove cruft module
2015-07-17 23:23:05 +02:00
Matthew Honnibal
fb0a641a2d
* Don't release the gil around Parser.parse. Does this indicate thread problems?
2015-07-17 23:07:37 +02:00
Matthew Honnibal
e29daea85f
* Fix bint/int typing problem in TransitionSystem. In C++ bint* means bool*, but in C it means int*. So, type-casting to bint* is unsafe.
2015-07-17 22:37:24 +02:00
Matthew Honnibal
cf0c788892
* Tests passing on round-trip pack/unpack on basic example
2015-07-17 21:20:48 +02:00
Matthew Honnibal
44f39a876f
* Add a blank attrs.pyx
2015-07-17 16:40:42 +02:00
Matthew Honnibal
c2c83120d4
* Remove codec property from Vocab
2015-07-17 16:40:11 +02:00
Matthew Honnibal
dfdf19f6a9
* Draft a from_orth method for Doc
2015-07-17 16:39:54 +02:00
Matthew Honnibal
9e3f17051b
* Move to ORTH instead of ID for encoding lexemes. Basic tests of the codec wrappers now passing
2015-07-17 16:38:29 +02:00
Matthew Honnibal
15ff739996
* Fix passing of ID attribute in string store
2015-07-17 14:49:42 +02:00
Matthew Honnibal
95e57c2780
* Remove unnecessary key and id properties from Utf8String.
2015-07-17 01:40:18 +02:00
Matthew Honnibal
234c7e440a
* Add spacy/serialize/__init__ files
2015-07-17 01:37:33 +02:00
Matthew Honnibal
db9dfd2e23
* Major refactor of serialization. Nearly complete now.
2015-07-17 01:27:54 +02:00
Matthew Honnibal
c8282f9934
* Work on serialization. Needs more reorganisation
2015-07-16 19:56:02 +02:00
Matthew Honnibal
d8458d6a25
* Fix attr_id_t import in Spans
2015-07-16 19:55:21 +02:00
Matthew Honnibal
d1cb30dbc4
* Remove unnecessary key and id properties from Utf8String.
2015-07-16 19:29:02 +02:00
Matthew Honnibal
897de2d438
* Add 'bitter' property for serializer in English class
2015-07-16 17:47:53 +02:00
Matthew Honnibal
fb54052ae0
* Work on serializer design
2015-07-16 17:46:46 +02:00
Matthew Honnibal
a6f401580d
* Add from_array function to Doc.
2015-07-16 17:46:11 +02:00
Matthew Honnibal
2a5d050134
* Give codec loading back to Vocab.
2015-07-16 17:45:42 +02:00
Matthew Honnibal
8bf0f65f1c
* Remove dead code in strings.pyx
2015-07-16 17:35:53 +02:00
Matthew Honnibal
a9c3863665
* Fix inefficiency in StringStore.dump function
2015-07-16 17:34:32 +02:00
Matthew Honnibal
b59d271510
* Move serialization functionality into Serializer class
2015-07-16 11:23:48 +02:00
Matthew Honnibal
30be4f15da
* Import attrs from spacy.attrs, not spacy.typedefs
2015-07-16 11:23:25 +02:00
Matthew Honnibal
6c99e5f4aa
* Move serialization into Serializer class, with __call__ and train() api
2015-07-16 11:22:35 +02:00
Matthew Honnibal
e2133d990e
* Move serialization functionality out into a Serializer object
2015-07-16 11:21:44 +02:00
Matthew Honnibal
a6d040bd11
* Import Lexeme attrs from spacy.attrs, not spacy.typedefs
2015-07-16 11:20:08 +02:00
Matthew Honnibal
45ae1ce428
* Remove unused declaration in parser
2015-07-16 01:27:11 +02:00
Matthew Honnibal
efa80096f1
* Upd attrs id list
2015-07-16 01:26:54 +02:00
Matthew Honnibal
01fab6bb90
* Improve de/serialize functions
2015-07-16 01:26:35 +02:00
Matthew Honnibal
0e07c1ed2a
* draft de/serialization functions in doc.pyx
2015-07-16 01:16:33 +02:00
Matthew Honnibal
9d956b07e9
* Fix import of attrs in doc.pyx, and update the get_token_attr function.
2015-07-16 01:15:34 +02:00
Matthew Honnibal
65251e7625
* Remove redundant attr_id_t from typedefs.pxd
2015-07-16 00:58:51 +02:00
Matthew Honnibal
9a8db9743c
* Remove gil from parser.call
2015-07-14 23:47:33 +02:00
Matthew Honnibal
38ca0c33f5
Merge branch 'neuralnet' into refactor
...
Mostly refactors parser, to use new thinc3.2 Example class.
Aim is to remove use of shared memory, so that we can parallelize
over documents easily.
Conflicts:
setup.py
spacy/syntax/parser.pxd
spacy/syntax/parser.pyx
spacy/syntax/stateclass.pyx
2015-07-14 14:13:47 +02:00
Matthew Honnibal
935ac53ee3
* Extend count_by method
2015-07-14 03:20:09 +02:00
Matthew Honnibal
3b5baa660f
* Fix tokenizer
2015-07-14 00:10:51 +02:00
Matthew Honnibal
2ae0b439b2
* Fix space check in gold.pyx
2015-07-14 00:10:27 +02:00
Matthew Honnibal
81aa4e6dcc
* Go back to having token reference doc, instead of complicated gymnastics. Rename the attr 'doc', to expose it in the API
2015-07-14 00:10:11 +02:00
Matthew Honnibal
24d6ce99ec
* Add comment to tokenizer, explaining the spacy attr
2015-07-13 22:29:13 +02:00
Matthew Honnibal
8214b74eec
* Restore _py_tokens cache, to handle orphan tokens.
2015-07-13 22:28:10 +02:00
Matthew Honnibal
67641f3b58
* Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string
2015-07-13 21:46:02 +02:00
Matthew Honnibal
6eef0bf9ab
* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx
2015-07-13 20:20:58 +02:00
Matthew Honnibal
3ea8756c24
* Add spacy/tokens/doc.pyx, for Doc class in its own file
2015-07-13 19:58:26 +02:00
Matthew Honnibal
c99387155f
* Refactor tokens, moving classes into a module instead of a single file
2015-07-13 19:49:55 +02:00
Matthew Honnibal
d27899658e
* Import classes in spacy.tokens.__init__
2015-07-13 19:48:55 +02:00
Matthew Honnibal
aa82caf8f5
* Add TokenC.spacy attr
2015-07-13 19:48:07 +02:00
Matthew Honnibal
dba6b47d4e
* Refactor monster tokens.pyx file, into a tokens/ subpackage. Try to break the cycle between Doc and Token, and remove the need to pass around a unicode string reference
2015-07-13 19:20:48 +02:00
Matthew Honnibal
5b0a7190c9
* Round-trip for serialization finally working. Needs a lot of optimization.
2015-07-13 18:39:38 +02:00
Matthew Honnibal
edd371246c
* Make huffman coder take BitArray in encode/decode. Add __iter__ method to BitArray.
2015-07-13 17:33:33 +02:00
Matthew Honnibal
af5cc926a4
* Add codec property to Vocab, to use the Huffman encoding
2015-07-13 13:55:14 +02:00
Matthew Honnibal
77385d5580
* Make .pxd file for huffman codec
2015-07-13 13:54:51 +02:00
Matthew Honnibal
083b6ea7ae
* Clean up encoder a bit. now read for integration into Vocab.
2015-07-13 12:57:22 +02:00
Matthew Honnibal
8d0f1d98da
* Draft dockstring for HuffmanCache
2015-07-13 12:01:18 +02:00
Matthew Honnibal
281f1faefb
* Nearly finished huffman coder
2015-07-12 23:48:46 +02:00
Matthew Honnibal
e1a25fba32
* Work on huffman coder
2015-07-12 19:58:05 +02:00
Matthew Honnibal
3fb9de2d13
* Remove vector[bint], in favor of simple Code struct.
2015-07-12 17:58:27 +02:00
Matthew Honnibal
aa7bfd932b
* Work on compressor
2015-07-12 16:03:43 +02:00
Matthew Honnibal
14eafcab15
* Refactor to use vector[bint]
2015-07-12 05:27:47 +02:00
Matthew Honnibal
6a6e852a39
* Refactor huffman coding stuff into class
2015-07-12 05:06:36 +02:00
Matthew Honnibal
aad96fdb5c
* Improve efficiency of huffman coding
2015-07-12 01:31:37 +02:00
Matthew Honnibal
ff9ff6f3fa
* Ensure unseen words are given low log probability
2015-07-12 01:31:09 +02:00
Matthew Honnibal
9d3b0d83de
* Refactor huffman coding
2015-07-11 22:27:43 +02:00
Matthew Honnibal
8d29406cd6
* Rename span.right to span.rights
2015-07-11 22:15:04 +02:00
Matthew Honnibal
da9f358166
* Fix span getting
2015-07-11 21:41:41 +02:00
Matthew Honnibal
11e8f2ffb4
* Huffman codes working
2015-07-11 20:01:10 +02:00
Matthew Honnibal
cb6fc81909
* Work on huffman coding.
2015-07-11 15:23:35 +02:00
Matthew Honnibal
4c9b77fe95
* Begin working on serialization code
2015-07-11 10:57:30 +02:00
Matthew Honnibal
53d1f5b2eb
* Rename Span.head to Span.root.
2015-07-09 17:30:58 +02:00
Matthew Honnibal
c0255ed7d8
* Allow slice indexing in Doc.__getitem__, returning a Span object
2015-07-09 15:15:32 +02:00
Matthew Honnibal
89a91ad726
* Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity
2015-07-09 13:30:41 +02:00
Matthew Honnibal
55f1042443
* Improve efficiency of L and R features, correcting the non-linear-in-length problem.
2015-07-09 12:17:26 +02:00
Matthew Honnibal
70d2acb579
* Fix edge features
2015-07-09 12:15:01 +02:00
Matthew Honnibal
adb868bdad
* Add warning for models not found in parser
2015-07-08 20:04:55 +02:00
Matthew Honnibal
05b28ec9eb
* Add warning for models not found in parser
2015-07-08 20:02:13 +02:00
Matthew Honnibal
ef700401a6
* Add warning for models not found in parser
2015-07-08 20:00:46 +02:00
Matthew Honnibal
6218d8b389
* Add warning for models not found in parser
2015-07-08 19:59:16 +02:00
Matthew Honnibal
f6a6c39ce8
* Add warning for models not found in parser
2015-07-08 19:52:30 +02:00
Matthew Honnibal
78db7e32f7
* Remove has_sense method from Lexeme declaration
2015-07-08 19:41:20 +02:00
Matthew Honnibal
6ddb2f5e45
* Restore merge_mwe in English class
2015-07-08 19:35:30 +02:00
Matthew Honnibal
6859f6adac
* Restore merge_mwe in English class
2015-07-08 19:34:55 +02:00
Matthew Honnibal
3c270fc8ff
* Remove has_sense method from Lexeme
2015-07-08 19:28:29 +02:00
Matthew Honnibal
b64c843861
* Remove senses attr
2015-07-08 19:26:24 +02:00
Matthew Honnibal
1d3a592edf
* Remove the senses attr from LexemeC, to keep data compatibility
2015-07-08 19:24:44 +02:00
Matthew Honnibal
0ceb1f71c2
* Update parse features
2015-07-08 19:11:36 +02:00
Matthew Honnibal
2e51b5027a
* Alias Doc to Tokens, for backwards compatibility
2015-07-08 18:59:35 +02:00
Matthew Honnibal
e3c53f5ecd
* Fix mention of Tokens in docstring
2015-07-08 18:56:27 +02:00
Matthew Honnibal
bb522496dd
* Rename Tokens to Doc
2015-07-08 18:53:00 +02:00
Matthew Honnibal
b24e8be2b9
* Whitespace in docstring
2015-07-08 12:37:03 +02:00
Matthew Honnibal
abc43b852d
* Add pos_tags attr to Vocab.
2015-07-08 12:36:38 +02:00
Matthew Honnibal
935bcdf3e5
* Remove redundant tag_names argument to Tokenizer
2015-07-08 12:36:04 +02:00
Matthew Honnibal
ff885e8511
* Add ParserFactory convenience function
2015-07-08 12:35:46 +02:00
Matthew Honnibal
4e4fac452b
* Refactor __init__ for simplicity. Allow parse=True, tag=True etc flags to be passed at top-level. Do not lazy-load parser.
2015-07-08 12:35:29 +02:00
Matthew Honnibal
1d2deb4616
* Work on refactoring default arguments to English.__init__
2015-07-07 15:53:25 +02:00
Matthew Honnibal
2d0e99a096
* Pass pos_tags into Tokenizer.from_dir
2015-07-07 14:23:08 +02:00
Matthew Honnibal
6788c86b2f
* Begin refactor
2015-07-07 14:00:07 +02:00
Matthew Honnibal
52fd80c6c6
* Add experimental supersense features for parsing, based on lookup into wordnet.
2015-07-01 20:12:44 +02:00
Matthew Honnibal
e6d828a9af
* Set up an array POS_SENSES that denotes the set of valid senses for each POS tag. This way, we can do bitwise & between a lexeme's senses and the ones available for its POS tag, to get the allowable senses for the token.
2015-07-01 20:12:13 +02:00
Matthew Honnibal
2b8459d9a8
* Add senses flag to Lexeme
2015-07-01 20:10:41 +02:00
Matthew Honnibal
e23d1582a2
* Add supersense data to Lexeme objects. Add simple has_sense method to check the flag.
2015-07-01 18:50:37 +02:00
Matthew Honnibal
64fafa98be
* Add senses.pyx and senses.pxd
2015-07-01 18:49:44 +02:00
Matthew Honnibal
94dab94e5f
uerge branch 'master' of https://github.com/honnibal/spaCy
2015-06-30 18:16:26 +02:00
Matthew Honnibal
9af86b0b0b
* Fix attrs.pxd
2015-06-30 18:16:30 +02:00
Matthew Honnibal
af9c82f7a6
Merge branch 'master' of https://github.com/honnibal/spaCy
2015-06-30 18:11:37 +02:00
Matthew Honnibal
5d595b5a8c
* Inc versions
2015-06-30 18:11:06 +02:00
Matthew Honnibal
d2eeba6667
* Start wiring up color and emotion lexicons. Hopefully we get to use them.
2015-06-30 16:22:23 +02:00
Matthew Honnibal
e20106fdff
* Begin reorganizing neuralnet work
2015-06-30 14:26:32 +02:00
Matthew Honnibal
5cd3ed42d4
* Reenable averaging
2015-06-29 16:44:42 +02:00
Matthew Honnibal
894cbef8ba
* Wire eta and mu parameters up for neural net
2015-06-29 07:10:33 +02:00
Matthew Honnibal
3bb5876c5a
* Inline methods in StateClass
2015-06-29 01:10:14 +02:00
Matthew Honnibal
313a7f87b3
* Inline methods in StateClass
2015-06-29 01:06:28 +02:00
Matthew Honnibal
a02fd3af5d
* Check valency in L and R feature methods, to make feaure calculation faster
2015-06-29 00:27:56 +02:00
Matthew Honnibal
5d870720bc
* Check valency in L and R feature methods, to make feaure calculation faster
2015-06-29 00:17:29 +02:00
Matthew Honnibal
f4986d5d3c
* Use new Example class
2015-06-28 22:36:03 +02:00
Matthew Honnibal
735f1af91f
* Fix neural net stuff
2015-06-28 11:44:58 +02:00
Matthew Honnibal
e7003f1cf3
* Remove hard-coding of vector lengths
2015-06-28 11:37:17 +02:00
Matthew Honnibal
897dd0dd0b
* Merge changes, and adjust Example to use memoryview
2015-06-28 11:36:11 +02:00
Matthew Honnibal
9282a8e72c
* Prepare for new models to be plugged in by using Example class
2015-06-28 11:02:35 +02:00
Matthew Honnibal
75aeccc064
* Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search
2015-06-28 11:02:34 +02:00
Matthew Honnibal
bf33598b34
* Work on a theano-driven model for the parser
2015-06-28 11:02:34 +02:00
Matthew Honnibal
bbef71f213
* Fix min function in fill_context
2015-06-28 10:46:39 +02:00
Matthew Honnibal
142b6f9510
* Revert last changes
2015-06-28 10:44:28 +02:00
Matthew Honnibal
b06962f18b
* Pad buffers in state
2015-06-28 10:36:14 +02:00
Matthew Honnibal
53be72387c
* Hack at fill_context to investigate performance loss
2015-06-28 10:34:28 +02:00
Matthew Honnibal
71a4e876a9
* Fix parse features
2015-06-28 09:27:33 +02:00
Matthew Honnibal
0c4b5a2bb0
* Start scoring tokens
2015-06-28 06:21:38 +02:00
Matthew Honnibal
5af500909c
* Remove unused directve from parser.pyx
2015-06-28 06:20:21 +02:00
Matthew Honnibal
d5b4090705
* Add profile directive
2015-06-28 06:19:33 +02:00
Matthew Honnibal
2b5421e60c
* Add profile directive
2015-06-28 06:07:04 +02:00
Matthew Honnibal
8b5de4a411
* Add word / tag / label sets, for use in neural net
2015-06-28 05:46:53 +02:00
Matthew Honnibal
cfcbd8d256
* Fix punctuation eval in scorer.py
2015-06-28 01:31:39 +02:00
Matthew Honnibal
ed40a8380e
* Remove hard-coding of vector lengths
2015-06-27 04:18:47 +02:00
Matthew Honnibal
ebe630cc8d
* Enable more features for NN
2015-06-27 04:17:29 +02:00
Matthew Honnibal
f8bb43475e
* Bridge to Theano working. Very disorganised. Using thinc adb60aba966ed2
2015-06-27 02:39:18 +02:00
Matthew Honnibal
2fe98b8a9a
* Prepare for new models to be plugged in by using Example class
2015-06-26 13:51:39 +02:00
Matthew Honnibal
6896455884
* Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search
2015-06-26 06:25:36 +02:00
Matthew Honnibal
b266a63f2c
* Inc version of downloadble data
2015-06-24 04:53:08 +02:00
Matthew Honnibal
02b171ee67
* Bug fixes to edge calculation
2015-06-24 04:28:02 +02:00
Matthew Honnibal
a4e9bdf4c1
* Work on a theano-driven model for the parser
2015-06-24 01:02:40 +02:00
Matthew Honnibal
7f9384f53c
* Remove deprecated _state module
2015-06-23 17:28:24 +02:00
Matthew Honnibal
6dbe182491
* Fix merge conflicts
2015-06-23 17:28:00 +02:00
Matthew Honnibal
579735a095
* Remove import of _state module
2015-06-23 17:25:08 +02:00
Matthew Honnibal
88f55d136b
* Remove deprecated _state module
2015-06-23 17:19:51 +02:00
Matthew Honnibal
9ab9dd2bf7
* Clean up unused orig_arc_eager and tree_arc_eager modules, which were only added for EMNLP experiments
2015-06-23 17:17:33 +02:00
Matthew Honnibal
7ebfe4b983
* Fixes to edge features
2015-06-23 16:32:54 +02:00
Matthew Honnibal
7b125f5a86
* Fixes to edge features
2015-06-23 16:31:01 +02:00
Matthew Honnibal
8d4bbacfc5
* Fix edge navigation in Token objects
2015-06-23 16:07:34 +02:00
Matthew Honnibal
35c290bee4
* Fix edge features
2015-06-23 15:50:56 +02:00
Matthew Honnibal
221e2e485f
* Assign 'ROOT' as label, not 'root'
2015-06-23 15:09:54 +02:00
Matthew Honnibal
a7bf7b0626
* Rename sent_start to sent_end, to reflect its new usage in the Break transition
2015-06-23 05:39:43 +02:00
Matthew Honnibal
ee3e56f27b
* Fix bounds checking on entities
2015-06-23 04:35:08 +02:00
Matthew Honnibal
43ef5ddea5
* Ensure root albel is spelled ROOT, for backwards compatibility
2015-06-23 04:14:03 +02:00
Matthew Honnibal
065c2e1d2d
* Add some bounds checking around state arrays
2015-06-23 04:13:09 +02:00
Matthew Honnibal
89ae218b75
* Add import to tokens.pyx from weird Cython compiler issue with casting from memory views
2015-06-23 03:04:34 +02:00
Matthew Honnibal
f01b3d043e
* Add padding to arrays in stateclass. May be papering over a deeper bug.
2015-06-23 03:03:41 +02:00
Matthew Honnibal
5e94b5d581
* Have Tokens return proper numpy arrays, not Cython views.
2015-06-23 00:07:34 +02:00
Matthew Honnibal
69507bc729
* Re-enable Break transition in arc_eager.pyx
2015-06-23 00:03:30 +02:00
Matthew Honnibal
cc579ed429
* Add __len__ function to StringStore
2015-06-23 00:02:50 +02:00
Matthew Honnibal
46fb24e9fd
* Add cycle-checking code in gold.pyx
2015-06-23 00:02:22 +02:00
Matthew Honnibal
60d26243e3
* Fix head alignment in read_conll.parse, which was causing corrupt parses when strip_bad_periods=True. A similar problem may apply to other data readers.
2015-06-18 16:35:27 +02:00
Matthew Honnibal
f868175e43
* Whitespace
2015-06-16 23:37:46 +02:00
Matthew Honnibal
ab110be125
* Remove debugging in parser.pyx
2015-06-16 23:37:25 +02:00
Matthew Honnibal
9b13d11ab3
* Fix handling of entities in StateClass
2015-06-16 23:35:21 +02:00
Matthew Honnibal
c40a2c661c
* Add tree_arc_eager
2015-06-15 08:23:24 +02:00
Matthew Honnibal
5da5cf7084
* Add some more features for S1/S0
2015-06-15 04:07:13 +02:00
Matthew Honnibal
8156a01bca
* Fix root label for orig_arc_eager
2015-06-15 02:54:55 +02:00
Matthew Honnibal
21930ede15
* Switch toggle on USE_ROOT_ARC_SEGMENT
2015-06-15 02:54:32 +02:00
Matthew Honnibal
38a6afa484
* Make possibly dubious correction to the unshift oracle
2015-06-15 02:50:00 +02:00
Matthew Honnibal
f66228f253
* Add some more features, esp for labels
2015-06-14 21:18:02 +02:00
Matthew Honnibal
3da8e0f317
* Add orig_arc_eager
2015-06-14 20:31:44 +02:00
Matthew Honnibal
ea8a103007
* Fix import of TransitionSystem in parser.pyx
2015-06-14 19:01:26 +02:00
Matthew Honnibal
e0984ca139
* Fix valency features in StateClass
2015-06-14 17:50:26 +02:00
Matthew Honnibal
e50ac1a47f
* Add verbose printing to scorer
2015-06-14 17:45:50 +02:00
Matthew Honnibal
763cbd23d5
* Upd stateclass.print_state
2015-06-14 17:44:29 +02:00
Matthew Honnibal
bdd07bf000
* Fix Break oracle, but disable the Break transition for now, while we finalize the gold-standard experiments
2015-06-14 17:44:03 +02:00
Matthew Honnibal
399f15fbdf
* Add flag to toggle handling of multi-root inputs without the Break transition. Clear up now unused best_valid stuff.
2015-06-14 00:28:37 +02:00
Matthew Honnibal
75289b4761
* Don't refuse to parse single token sentences, incase some transition system needs them, e.g. single word entity. Instead fix error in _init_state.
2015-06-13 22:55:55 +02:00
Matthew Honnibal
77d7e79c7e
* Fix r/l and distance features.
2015-06-12 13:06:15 +02:00
Matthew Honnibal
b643cb3d5c
* Allow training documents to be filtered in gold.pyx
2015-06-12 02:42:08 +02:00
Matthew Honnibal
15e177d7a1
* Fixes to unshift/fast-forward strategy. Getting 91.55 greedy on NW dev, gold preproc
2015-06-12 01:50:23 +02:00
Matthew Honnibal
afd77a529b
* Prepare for break transition, with fast-forwarding. 86.5 on 1k nw gold preproc
2015-06-10 14:08:30 +02:00
Matthew Honnibal
495f528709
* Add support for sentence breaks in stateclass
2015-06-10 12:34:28 +02:00
Matthew Honnibal
b7b18c279d
* Fix Reduce oracle. Getting 86.35
2015-06-10 11:33:39 +02:00
Matthew Honnibal
bb09b5d91a
* Fix shifted bit vector in stateclass --- should reflect whether the word has been *unshifted*.
2015-06-10 11:33:09 +02:00
Matthew Honnibal
aa9625f688
* Do non-monotonic Unshift. Every word can be shifted at most 1 time. When the Reduce move is used, if S0 has no head, we put the word back on the buffer. Gets 86.4 on nw 1k with gold pre-proc. Break transition not yet implemented for this.
2015-06-10 10:15:56 +02:00
Matthew Honnibal
7bf6b7de3e
* Add unshift action to StateClass, and track which moves have been shifted
2015-06-10 10:13:03 +02:00
Matthew Honnibal
f7c8069e65
* Fix bug in distance feature
2015-06-10 10:12:17 +02:00
Matthew Honnibal
abd07c067a
* Inline B and S methods on stateclass
2015-06-10 07:22:33 +02:00
Matthew Honnibal
e2f9a80713
* Remove old _state imports
2015-06-10 07:09:17 +02:00
Matthew Honnibal
e9aaecc619
* Remove from_struct method from StateClass
2015-06-10 06:58:27 +02:00
Matthew Honnibal
18cc326dc0
* Bug fixes to ner.pyx
2015-06-10 06:57:41 +02:00
Matthew Honnibal
e5570c9700
* Set nogil for oracle functions
2015-06-10 06:56:56 +02:00
Matthew Honnibal
4575e7a60f
* Fix beam search with new StateClass
2015-06-10 06:33:39 +02:00
Matthew Honnibal
04b1cd9b8c
* Greedy parsing working with new StateClass. Beam parsing broken
2015-06-10 04:20:23 +02:00
Matthew Honnibal
6a94b64eca
* Remove State* from parser.pyx entirely, switching over to StateClass. Beam parsing still untested.
2015-06-10 02:03:38 +02:00
Matthew Honnibal
f14a1526aa
* Remove version of fill_context that takes State*
2015-06-10 01:39:07 +02:00
Matthew Honnibal
d68c686ec1
* Move StateClass into interface of transition functions
2015-06-10 01:35:28 +02:00
Matthew Honnibal
4b98b3e9c8
* Cost functions now take StateClass argument, instead of State*.
2015-06-10 00:40:43 +02:00
Matthew Honnibal
e0cf61f591
* Move StateClass into the interface for is_valid
2015-06-09 23:23:28 +02:00
Matthew Honnibal
0895d454fb
* Prepare to switch to using state class, instead of state struct
2015-06-09 21:20:14 +02:00
Matthew Honnibal
2b9629ed62
* Begin adding stateclass to ArcEager
2015-06-09 01:41:09 +02:00
Matthew Honnibal
ba10fd8af5
* Add StateClass, to replace/refactor the mess in _state
2015-06-09 01:39:54 +02:00
Matthew Honnibal
c7e3dfc1dc
* Don't automatically push words when stack is empty, as it messes up beam parsing. Add hash method to beam state.
2015-06-08 14:49:04 +02:00
Matthew Honnibal
00a0dfcb59
* Avoid shipping the spacy.munge package
2015-06-08 00:54:13 +02:00
Matthew Honnibal
7d265a9c62
* Revert to wget in spacy.en.download
2015-06-08 00:48:56 +02:00
Matthew Honnibal
a8fc5f1285
* Fix munge/read_ner
2015-06-08 00:35:04 +02:00
Matthew Honnibal
1515862861
* Fix download.py
2015-06-08 00:08:05 +02:00
Matthew Honnibal
7e9e8f654a
* Use urllib in spacy.en.download
2015-06-07 23:51:38 +02:00
Matthew Honnibal
80cff41a9c
* Upd download.py
2015-06-07 19:13:28 +02:00
Matthew Honnibal
6e2564239d
* Bug fixes to beam parser. Search still broken on non-gold sentences
2015-06-07 19:12:59 +02:00
Matthew Honnibal
1ec4e6fc95
* Don't score whitespace tokens
2015-06-07 19:10:32 +02:00
Matthew Honnibal
731e5f1e46
* Add get() function in spacy/syntax/Config
2015-06-07 19:09:15 +02:00
Matthew Honnibal
8f142c1838
* Refactor transition system oracles, to split out move and label cost. Preparing to add Unshift move. Will exclude non-monotonic.
2015-06-07 03:21:29 +02:00
Matthew Honnibal
89b8775887
* Fix output from _min_edit_path when inputs match.
2015-06-06 05:58:53 +02:00
Matthew Honnibal
98cfd84123
* Remove hyphenation from main tokenizer loop: do it in infix.txt instead. This lets emoticons work
2015-06-06 05:57:03 +02:00
Matthew Honnibal
1fee7ade61
* Tweak to ner
2015-06-05 23:48:43 +02:00
Matthew Honnibal
33e70b167f
* Remove dead code from ner.pyx
2015-06-05 17:12:47 +02:00
Matthew Honnibal
88ac5c6e98
* Send beam_width < 0 to greedy parser
2015-06-05 17:12:06 +02:00
Matthew Honnibal
0114e7600d
* Fix NER oracle
2015-06-05 17:11:26 +02:00
Matthew Honnibal
c04e6ebca6
* Allow user to load different sized vectors.
2015-06-05 16:26:39 +02:00
Matthew Honnibal
6bf35cecc3
* Refactor transition system to use classes with staticmethods.
2015-06-05 02:27:17 +02:00
Matthew Honnibal
36a34d544b
* Refactoring arc_eager, grouping oracle functions into transitions
2015-06-04 22:43:03 +02:00
Matthew Honnibal
4433396005
* Impove efficiency of dynamic oracle, making beam training faster
2015-06-04 21:15:14 +02:00
Matthew Honnibal
079dad28a7
* Update for faster beam training
2015-06-04 19:32:32 +02:00
Matthew Honnibal
f8843906ad
Merge branch 'constituency'
...
Add beam parsing and training from JSON files, with Levenshtein alignment.
2015-06-03 06:07:24 +02:00
Matthew Honnibal
ae653b850a
* Remove unused import from gold.pyx
2015-06-03 06:07:15 +02:00
Matthew Honnibal
a2627b6102
* Fix bug in refactored init_transition
2015-06-03 06:01:26 +02:00
Matthew Honnibal
dd0867645d
* Remove stray const from State header
2015-06-03 00:10:04 +02:00
Matthew Honnibal
6c47b10a6e
* Make optimization to children_in_buffer: stop searching when we would cross a bracket.
2015-06-02 21:05:24 +02:00
Matthew Honnibal
a513ec500f
* Have oracle functions take a struct instead of a Python object
2015-06-02 20:01:06 +02:00
Matthew Honnibal
d1b55310a1
* Refactor _advance_beam function
2015-06-02 18:38:41 +02:00
Matthew Honnibal
0786d9b3c7
* Refactor TransitionSystem, adding set_valid method
2015-06-02 18:38:07 +02:00
Matthew Honnibal
bd82a49994
* Add set_scores method to Model
2015-06-02 18:37:10 +02:00
Matthew Honnibal
a3964957f6
* Add profiling for _state.pyx
2015-06-02 18:36:27 +02:00
Matthew Honnibal
e822df0867
* Fix bugs in new greedy/beam parser
2015-06-02 02:01:33 +02:00
Matthew Honnibal
66dfa95847
* Revise greedy_parse/beam_parse ownership goof
2015-06-02 01:34:19 +02:00
Matthew Honnibal
75658b2ed3
* Remove use of new beam.loss property, to maintain compatibility with older versions of thinc for now.
2015-06-02 00:57:09 +02:00
Matthew Honnibal
7c29362d60
* Rename parser class in parser.pxd, now that beam parsing is supported
2015-06-02 00:53:49 +02:00
Matthew Honnibal
58d5ac0944
* Add beam search capabilities to Parser. Rename GreedyParser to Parser.
2015-06-02 00:28:02 +02:00
Matthew Honnibal
62424e6c76
* Remove unused regularize argument from _ml.Model
2015-06-02 00:27:07 +02:00
Matthew Honnibal
adeb57cb1e
* Fix long line
2015-06-01 23:07:00 +02:00
Matthew Honnibal
e09a08bd00
* Add copy_state function
2015-06-01 23:06:30 +02:00
Matthew Honnibal
c7876aa8b6
* Add get_valid method
2015-06-01 23:06:00 +02:00
Matthew Honnibal
d82f9d958d
* Remove regularization cruft from _ml, move score from .pxd file to .pyx
2015-05-31 18:48:05 +02:00
Matthew Honnibal
5e99ff94c8
* Edits to arc eager oracle. Couldn't figure out how the non-monotonic lines made sense. They seem covered by children_in_stack
2015-05-31 15:14:37 +02:00
Matthew Honnibal
6c5632b71c
* Roll back proposed change to Break transition while investigate effect
2015-05-31 06:49:52 +02:00
Matthew Honnibal
6bba793df3
* Disable the Zipf-reweighting thing while investigate effect
2015-05-31 06:48:43 +02:00
Matthew Honnibal
e77940565d
* Add length cap to distance feature
2015-05-31 05:25:30 +02:00
Matthew Honnibal
fd596351ba
* Fix valency features
2015-05-31 05:24:33 +02:00
Matthew Honnibal
87d6551d19
* Allow gold parse to cut non-projective arcs
2015-05-31 01:11:56 +02:00
Matthew Honnibal
c4f0914b4e
* Fix POS tag evaluation in scorer.py: do evaluate punctuation tags
2015-05-30 18:24:32 +02:00
Matthew Honnibal
9e39a206da
* Fix efficiency of JSON reading, by using ujson instead of stream
2015-05-30 17:54:52 +02:00
Matthew Honnibal
76300bbb1b
* Use updated JSON format, with sentences below paragraphs. Allows use of gold preprocessing flag.
2015-05-30 01:25:46 +02:00
Matthew Honnibal
b76bbbd12c
* Read json files recursively from a directory, instead of requiring a single .json file
2015-05-29 03:52:55 +02:00
Matthew Honnibal
8f31d3b864
* Relax constraint on Break transition for non-monotonic parsing.
2015-05-28 23:39:52 +02:00
Matthew Honnibal
6b2e5c4b8a
* Avoid NER scoring for sentences with some missing NER values.
2015-05-28 22:39:08 +02:00
Matthew Honnibal
d25d31442d
* Hackishly support broken NER annotations. Should fix this.
2015-05-27 19:14:31 +02:00
Matthew Honnibal
7a2725bca4
* Read input json in a streaming way
2015-05-27 19:13:11 +02:00
Matthew Honnibal
6a1c91675e
* Add file to read ENAMEX ner data
2015-05-27 17:36:23 +02:00
Matthew Honnibal
732fa7709a
* Edits to align_raw script, for use in prepare_treebank
2015-05-27 04:23:31 +02:00
Matthew Honnibal
4010b9b6d9
* Pass parameter for regularization in parser.pyx
2015-05-27 03:18:50 +02:00
Matthew Honnibal
4c6058baa7
* Fix evaluation of NER in scorer.py
2015-05-27 03:18:16 +02:00
Matthew Honnibal
6016ee83a6
* Fix reading of NER in gold.pyx
2015-05-27 03:17:50 +02:00
Matthew Honnibal
04bda8648d
* Pass parameter for regularization to model
2015-05-27 03:16:58 +02:00
Matthew Honnibal
f69fe6a635
* Fix heads problem in read_conll
2015-05-27 01:14:54 +02:00
Matthew Honnibal
0eec1d12af
* Add comment about zipf reweighting
2015-05-27 01:14:07 +02:00
Matthew Honnibal
4d37b66c55
* Make Zipf regularization a bit more efficient
2015-05-27 01:12:50 +02:00
Matthew Honnibal
7fc24821bc
* Experiment with Zipfian corruptions when calculating prediction
2015-05-26 22:17:15 +02:00
Matthew Honnibal
eba7b34f66
* Add flag to disable loading of word vectors
2015-05-25 01:02:42 +02:00
Matthew Honnibal
3593babd35
* Add functions for Levenshtein distance alignment
2015-05-24 21:50:48 +02:00
Matthew Honnibal
744f06abf5
* Add script to read OntoNotes source documents
2015-05-24 21:49:58 +02:00
Matthew Honnibal
fc75210941
* Move spacy.syntax.conll to spacy.gold
2015-05-24 21:35:02 +02:00
Matthew Honnibal
765b61cac4
* Update spacy.scorer, to use P/R/F to support tokenization errors
2015-05-24 20:07:18 +02:00
Matthew Honnibal
efe7a7d7d6
* Clean unused functions from spacy.syntax.conll
2015-05-24 20:06:46 +02:00
Matthew Honnibal
78487f3e66
* Update parser oracle for missing heads
2015-05-24 20:05:58 +02:00
Matthew Honnibal
1044a13413
* Begin refactoring scorer to use recall over gold dependencies
2015-05-24 17:40:15 +02:00
Matthew Honnibal
acd1245ad4
* Remove cruft from conll.pyx --- unused stuff about evlauation, which now lives in spacy.scorer
2015-05-24 17:35:49 +02:00
Matthew Honnibal
20f1d868a3
* Tmp commit. Working on whole document parsing
2015-05-24 02:49:56 +02:00
Matthew Honnibal
f2ee9c4feb
* Comment out constituency parsing stuff, so that code compiles
2015-05-20 16:55:05 +02:00
Matthew Honnibal
8ee7c541f1
* Update Constituent definition
2015-05-20 16:03:26 +02:00
Matthew Honnibal
9dfc9c039c
* Work on constituency parsing.
2015-05-20 16:02:51 +02:00
Matthew Honnibal
5a5710e711
* Fix Span.subtree property
2015-05-13 21:53:15 +02:00
Matthew Honnibal
badf030b6c
* Add parse navigation to Span objects
2015-05-13 21:45:19 +02:00
Matthew Honnibal
ca320afe86
* Add docstring for ents attribute
2015-05-13 21:20:47 +02:00
Matthew Honnibal
ba07b925a7
* Fix compile error in conll.pyx
2015-05-12 22:33:47 +02:00
Matthew Honnibal
f1e0272b18
* Disable c-parsing transitions
2015-05-12 22:33:25 +02:00
Matthew Honnibal
03a6626545
* Tmp commit
2015-05-12 20:27:56 +02:00
Matthew Honnibal
9568ebed08
* Fix off-by-one in head reading
2015-05-12 20:27:56 +02:00
Matthew Honnibal
69840d8cc3
* Tweak verbose output printing in scorer.py
2015-05-12 20:27:56 +02:00
Matthew Honnibal
0605af6838
* Fix head misalignment in read_conll, when periods are ignored
2015-05-12 20:27:56 +02:00
Matthew Honnibal
d2ac8d8007
* Add ctnt field to State, in preparation for constituency parsing
2015-05-12 20:27:56 +02:00
Matthew Honnibal
ab67693393
* Add read_json_file to conll.pyx
2015-05-12 20:27:55 +02:00
Matthew Honnibal
aff9359a8d
* Update ner.pyx to expect brackets from gold_tuples
2015-05-12 20:27:55 +02:00
Matthew Honnibal
0ad72a77ce
* Write JSON files, with both dependency and PSG parses
2015-05-12 20:27:55 +02:00
Matthew Honnibal
d48218f4b2
* Add left_edge and right_edge properties
2015-05-12 20:27:55 +02:00
Matthew Honnibal
53cf77e1c8
* Bug fix: when non-monotonically correct a dependency, make sure to delete the old one from the child list
2015-05-12 20:26:41 +02:00
Matthew Honnibal
a4e2af54f9
* Add support for l/r edge to add_dep, and move inlined methods into _state.pyx where possible
2015-05-12 20:26:41 +02:00
Matthew Honnibal
d634038eb6
* Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token
2015-05-12 20:26:41 +02:00
Matthew Honnibal
03ebf70a66
* Inc version to 0.84
2015-05-12 02:38:51 +02:00
Matthew Honnibal
e73eaf2d05
* Replace some assertions with proper errors
2015-05-08 16:52:17 +02:00
Matthew Honnibal
fb8d50b3d5
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2015-04-30 12:45:15 +02:00
Matthew Honnibal
ed8e8c3bd0
* Whitespace
2015-04-29 14:22:47 +02:00
Matthew Honnibal
378c2a6435
* Fix POS model: make it use tag instead of pos in history features
2015-04-29 00:02:53 +02:00
Matthew Honnibal
763ef01575
* Fix two bugs in feature calculation
2015-04-28 23:25:09 +02:00
Matthew Honnibal
b3fd48c97b
* Fix missing root labels bug identified in Issue #57
2015-04-28 20:45:51 +02:00
Jordan Suchow
3a8d9b37a6
Remove trailing whitespace
2015-04-19 13:01:38 -07:00
Jordan Suchow
5f0f940a1f
Remove unused imports
2015-04-19 01:05:22 -07:00
Matthew Honnibal
cc4e395927
* Add some ad hoc regexes, for multi-word location prepositions
2015-04-17 04:44:24 +02:00
Matthew Honnibal
f7ffd94e6a
* Add Token.conjuncts property
2015-04-17 01:40:53 +02:00
Matthew Honnibal
684d0e5e85
* Download updated data
2015-04-16 04:29:15 +02:00
Matthew Honnibal
2ef170a991
* Fix Issue #54 : Error merging multi-word token when there's a mid-token match.
2015-04-16 04:28:06 +02:00
Matthew Honnibal
42617548af
* Disable merge_mwes by default
2015-04-16 04:20:31 +02:00
Matthew Honnibal
99dbf8a38c
* Fix error type in lookup_transition
2015-04-16 01:36:22 +02:00
Matthew Honnibal
77d0700caf
* Add on X way regexes
2015-04-16 01:35:46 +02:00
Matthew Honnibal
9f16848b60
* Add (N0w, N1w) unigram pair to NER features, prompted by failure to detect 'this weekend'
2015-04-15 06:01:18 +02:00
Matthew Honnibal
c6707778dd
* Fix Issue #51 : Handle non-ascii lemmas correctly
2015-04-13 22:28:59 +02:00
Matthew Honnibal
bf0aff5124
* Fix bug in Tokens.ents where entity wasn't being emitted if another started immediately after
2015-04-13 21:34:33 +02:00
Matthew Honnibal
2b84a90bbb
* Fix Issue #50 : Python 3 compatibility of v0.80
2015-04-13 05:59:43 +02:00
Matthew Honnibal
fbd48c571d
* Rearrange code in tokens.pyx
2015-04-13 05:41:25 +02:00
Matthew Honnibal
507048dc45
* Rename StandardError to Exception, for Python 3 compatibility
2015-04-12 07:28:34 +02:00
Matthew Honnibal
761a19113a
* Fix /tmp moving thing in download.py
2015-04-12 07:04:10 +02:00
Matthew Honnibal
248a2b4b0f
* Remove Spans class
2015-04-12 04:07:29 +02:00
Matthew Honnibal
1d05e6da00
* Add ne_iob and ne_type features to NER
2015-04-10 19:07:08 +02:00
Matthew Honnibal
4df8a3d90f
* Add ne_iob and ne_type attributes to context vector
2015-04-10 05:02:15 +02:00
Matthew Honnibal
8c354c432b
* Add ValueError condition to ner_tag reading
2015-04-10 04:59:59 +02:00
Matthew Honnibal
435cccf098
* Add read_conll03_file function to conll.pyx
2015-04-10 04:59:11 +02:00
Matthew Honnibal
99c9ecfc18
* Fix bug in prefix, suffix and word shape features in parser and NER
2015-04-10 03:53:33 +02:00
Matthew Honnibal
cff2b13fef
* Fix Issue #44 : Broken Token.string attribute when single word sentence
2015-04-07 06:08:25 +02:00
Matthew Honnibal
6640386b25
* Fix Issue #43 : TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way.
2015-04-07 06:00:57 +02:00
Matthew Honnibal
b64b2bd910
* Fix Issue #43 : TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way.
2015-04-07 06:00:30 +02:00
Matthew Honnibal
f9e510a893
* Whitespace
2015-04-07 04:53:59 +02:00
Matthew Honnibal
66c7ccf6cc
* Fix Spans.orth_
2015-04-07 04:53:40 +02:00
Matthew Honnibal
b8d34531c4
* Add support for units to English.__init__, by loading and applying regular expressions
2015-04-07 04:02:32 +02:00
Matthew Honnibal
0ea5af88b6
* Add multi-word expression RegexMatcher
2015-04-07 03:45:40 +02:00
Matthew Honnibal
2fee67cfa3
* Add regular expressions for English multi-word expressions
2015-04-07 03:45:18 +02:00
Matthew Honnibal
5a075ea3fc
* Ensure NER moves are available for single-word tokens
2015-04-05 22:30:58 +02:00
Matthew Honnibal
a60a366b2c
* Support 'punct' dep label in conll.pyx
2015-04-05 22:30:19 +02:00
Matthew Honnibal
021c972137
* Print parse if verbose in scorer
2015-04-05 22:29:30 +02:00
Matthew Honnibal
fbf19049cf
* Add ent_type_ property
2015-03-31 02:01:29 +02:00
Matthew Honnibal
e70b87efeb
* Add merge() method to Tokens, with fairly brittle/hacky implementation, but quite easy to test. Passing minimal tests. Still need to fix left/right deps in C data
2015-03-30 01:37:41 +02:00
Matthew Honnibal
557856e84c
* Allow regular expressions to specify labels for merged spans
2015-03-27 17:40:52 +01:00
Matthew Honnibal
a3af6b7c3d
* Left-Arc from Root, to allow non-monotonic reduce to compete with left-arc when the stack is not empty.
2015-03-27 17:39:16 +01:00
Matthew Honnibal
db5a43318c
* Improve print_state debug printer
2015-03-27 17:29:58 +01:00
Matthew Honnibal
1705eccbbe
* Remove whitespace
2015-03-27 15:22:39 +01:00
Matthew Honnibal
3feb52374c
* Break apart a condition, for ease of debug printing
2015-03-27 15:21:38 +01:00
Matthew Honnibal
b32f581acb
* Fix bug in ArcEager.get_labels
2015-03-27 15:21:06 +01:00
Matthew Honnibal
5f2a4ff36d
* Fix spans.lemma_
2015-03-26 16:45:38 +01:00
Matthew Honnibal
f4cc222ec3
* Fix NER scoring
2015-03-26 16:45:38 +01:00
Matthew Honnibal
1320bd19db
* Move Span class to own file
2015-03-26 16:45:38 +01:00
Matthew Honnibal
6f47a667cf
* Move Span class to own file
2015-03-26 16:45:38 +01:00
Matthew Honnibal
f02c39dfaf
* Compare to is not None, for more robustness
2015-03-26 16:44:48 +01:00
Matthew Honnibal
8f68b864c4
* Move Span/Spans to separate files. Currently duplicates lots of Tokens functionality. Should probably be integrated into Tokens
2015-03-26 16:44:48 +01:00
Matthew Honnibal
e854ba0a13
* Remove support for force_gold flag from GreedyParser, since it's not so useful, and it's clutter
2015-03-26 16:44:47 +01:00
Matthew Honnibal
6a6085f8b9
* Clean up GreedyParser.train function a bit
2015-03-26 16:44:47 +01:00
Matthew Honnibal
b3157927e6
* Clean up unused feature templates
2015-03-26 16:44:47 +01:00
Matthew Honnibal
411bf377d4
* Remove dependency on ner_util module
2015-03-26 16:44:47 +01:00
Matthew Honnibal
01c892f583
* Add comment to fill_context
2015-03-26 16:44:47 +01:00
Matthew Honnibal
2741179aff
* Important bug fix: Fill token N2w, which was being unfilled, after a bad edit while writing the NER features.
2015-03-26 16:44:47 +01:00
Matthew Honnibal
2b2dec95d3
* Add comment to set_parse
2015-03-26 16:44:47 +01:00
Matthew Honnibal
e770fade1e
* Don't set dependency labels in set_parse, as this may be used by the Entity recogniser instead. Need to clean this method up...
2015-03-26 16:44:47 +01:00
Matthew Honnibal
71648205d9
* Add support for debug feature set. Just use unigrams for this.
2015-03-26 16:44:47 +01:00
Matthew Honnibal
3b70b304b2
* Add words to gold_tuples from gold conll file
2015-03-26 16:44:47 +01:00
Matthew Honnibal
2e12dec76e
* Adjust scorer to account for tokenization mistakes
2015-03-26 16:44:47 +01:00
Matthew Honnibal
05d6065e2e
* Add assertion
2015-03-26 16:44:46 +01:00
Matthew Honnibal
377e9b29b1
* Whitespace
2015-03-26 16:44:46 +01:00
Matthew Honnibal
670959f40c
* Fix iteration order on Tokens.rights
2015-03-26 16:44:46 +01:00
Matthew Honnibal
231ce2dae5
* Assign ROOT label by default. May be papering over another bug.
2015-03-26 16:44:46 +01:00
Matthew Honnibal
9f4ad8fdfb
* Assign root words the ROOT label via the Break transition. Something is still wrong here...
2015-03-26 16:44:46 +01:00
Matthew Honnibal
f729164c01
* Fix bug in label assignment: ensure null-label transitions receive the label 0
2015-03-26 16:44:46 +01:00
Matthew Honnibal
7237c805c7
* Load tag for specials.json token
2015-03-26 16:44:46 +01:00
Matthew Honnibal
567388e38d
* Use values encoded by StringStore in POS tagging, rather than indices into a list of tags
2015-03-26 16:44:45 +01:00
Matthew Honnibal
3105c7f8ba
* Don't pass label_ids dict to Tokens, since we now use the StringStore to manage string-to-int mapping for labels
2015-03-26 16:44:45 +01:00
Matthew Honnibal
801bf14f4f
* Clean up handling of dep_strings and ent_strings, using StringStore to encode the label names.
2015-03-26 16:44:45 +01:00
Matthew Honnibal
31fad99518
* Use StringStore to encode label names, instead of label_ids
2015-03-26 16:44:45 +01:00
Matthew Honnibal
64db61bff1
* Add Span class to Python API
2015-03-26 16:44:45 +01:00
Matthew Honnibal
b9b695fb1b
* Remove debug word list
2015-03-26 16:44:45 +01:00
Matthew Honnibal
f21ab2d7fb
* Fix bug in ugly ent_strings hack on English class
2015-03-26 16:44:45 +01:00
Matthew Honnibal
1c843934be
* Fix oracle bug in NER. Now getting 77% F on ontonotes
2015-03-26 16:44:44 +01:00
Matthew Honnibal
903f196b3f
* Fix verbose printing for scorer
2015-03-26 16:44:44 +01:00
Matthew Honnibal
e181c051d5
* Improve features for NER
2015-03-26 16:44:44 +01:00
Matthew Honnibal
7ecb52c0ed
* Add scorer script
2015-03-26 16:44:44 +01:00
Matthew Honnibal
8057a95f20
* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring.
2015-03-26 16:44:44 +01:00
Matthew Honnibal
ae235e07b9
* Refactoring working for parser, but now need to rig up features for NER, and then debug oracle etc.
2015-03-26 16:44:44 +01:00
Matthew Honnibal
b3eda03c9c
* Tmp
2015-03-26 16:44:44 +01:00
Matthew Honnibal
220ce8bfed
* Prepare English class for NER
2015-03-26 16:44:44 +01:00
Matthew Honnibal
f5830dc1c1
* Remove _transitions.pyx
2015-03-26 16:44:44 +01:00
Matthew Honnibal
6865c2fb4d
* Fix assignment of dep strings in tokens.pyx
2015-03-26 16:44:43 +01:00
Matthew Honnibal
6b6bce9e7a
* Fix label loading for transition system
2015-03-26 16:44:43 +01:00
Matthew Honnibal
5278c7504b
* Hacks to conll.pyx. Should clean these up.
2015-03-26 16:44:43 +01:00
Matthew Honnibal
f321b2b2eb
* Remove TODO comment
2015-03-26 16:44:43 +01:00
Matthew Honnibal
fdabd93bfb
* Ensure high loss for invalid moves, and fix label reading for arc-eager
2015-03-26 16:44:43 +01:00
Matthew Honnibal
10ed738df2
* Tmp commit
2015-03-26 16:44:43 +01:00
Matthew Honnibal
4f83c9b3d5
* Make costs label-sensitive
2015-03-26 16:44:43 +01:00
Matthew Honnibal
179b7eb0a7
* Specify parser transition system in language
2015-03-26 16:44:43 +01:00
Matthew Honnibal
8c883cef58
* Refactored transition system code now compiling. Still need to hook up label oracle, and test
2015-03-26 16:44:43 +01:00
Matthew Honnibal
f0159ab4b6
* Add file to hold GoldParse class
2015-03-26 16:44:42 +01:00
Matthew Honnibal
8eadb984cb
* Refactor arc_eager to use new TransitionSystem base class. Need to fix oracle
2015-03-26 16:44:42 +01:00
Matthew Honnibal
b063001596
* Add base TransitionSystem class. Still need to rethink how non-monotonic labelling will work for best_valid
2015-03-26 16:44:42 +01:00
Matthew Honnibal
01bc4d6815
* Add set_parse method, to assign parse to tokens in a less hacky way.
2015-03-26 16:44:42 +01:00
Matthew Honnibal
dc986dbc0b
* Work on refactored parser, where TransitionSystem can be easily subclassed
2015-03-26 16:44:42 +01:00
Matthew Honnibal
1cc6329b18
* Add base class to do transitions
2015-03-26 16:44:42 +01:00
Matthew Honnibal
135756ac3d
* Tmp commit of NER refactoring
2015-03-26 16:44:42 +01:00
Matthew Honnibal
23c1f6fc04
* Merge changes from stash
2015-03-26 16:44:41 +01:00
Matthew Honnibal
0ff078876a
* Commit some work on ner.yx done on the plane
2015-03-26 16:44:41 +01:00
Matthew Honnibal
d81b7be6a2
* Merge train.py
2015-03-26 16:44:41 +01:00
Matthew Honnibal
2e3dc3dfe2
* Merge changes in tokens.pyx
2015-03-26 16:44:41 +01:00
Matthew Honnibal
8cc3524dc9
* Ws
2015-03-26 16:44:41 +01:00
Matthew Honnibal
3d0570685c
* Add NER transition system
2015-03-26 16:44:41 +01:00
Matthew Honnibal
043b758cf4
* Resurrect old NER code. This version won't be the one that runs; we want to re-use the parser code. But for now this is a useful reference.
2015-03-26 16:44:41 +01:00
Matthew Honnibal
b139aa92ba
* Start setting out how NER will be implemented in the data model
2015-03-26 16:44:41 +01:00
Matthew Honnibal
0962ffc095
* Fix issue #37 : missing check_flag attribute from Token class
2015-03-26 15:06:26 +01:00
Matthew Honnibal
2e8d0e5d45
* Upd download script
2015-03-03 05:47:16 -05:00
Matthew Honnibal
dbe26f5793
* Add children and subtree methods to Token, which are generators to assist parse-tree navigation.
2015-03-03 04:18:41 -05:00
Matthew Honnibal
ea90d136e8
* Fix bug in labelled parsing, that caused an 8% drop in labelled accuracy.
2015-02-27 03:56:10 -05:00
Matthew Honnibal
caf046b220
* Hastily add method to apply tags from a list of strings, instead of predicting the tags.
2015-02-23 15:40:17 -05:00
Matthew Honnibal
cae077b583
* Work on fixing orphaned Token objects bug
2015-02-16 15:20:31 -05:00
Matthew Honnibal
7572e31f5e
* Pass ownership of C data to Token instances if Tokens object is being garbage-collected, but Token instances are staying alive.
2015-02-11 18:05:06 -05:00
Matthew Honnibal
64645a1c2f
* Improve docstring on English
2015-02-11 15:13:20 -05:00
Matthew Honnibal
594e50bd45
* Add option to download speech-parsing data set.
2015-02-11 14:20:29 -05:00
Matthew Honnibal
0b7e769211
* Add POS tags to support SWBD tag set
2015-02-11 14:08:28 -05:00
Matthew Honnibal
312b3a45f3
* Fix issue #19 : Allow parsing/pos tagging of empty strings
2015-02-10 10:15:58 -05:00
Matthew Honnibal
2a0615104b
* Upd download script
2015-02-09 10:22:59 -05:00
Matthew Honnibal
5c3513583d
* Clear buffered python tokens when modifying the Tokens object. Need to clean this up, and modify via a method on Tokens.
2015-02-09 03:57:10 -05:00
Matthew Honnibal
be5536d239
* Fix Issue #22 : PRP and PRP$ were mapped to NOUN. Should be PRON.
2015-02-08 18:36:18 -05:00
Matthew Honnibal
0492cee8b4
* Fix Issue #24 : Lemmas are empty when the L field is missing for special-cased tokens
2015-02-08 18:30:30 -05:00
Matthew Honnibal
d229fbd228
* Give better error on out-of-bounds array access
2015-02-07 12:59:12 -05:00
Matthew Honnibal
ab8bb047d0
* Fix negative index for __getitem__
2015-02-07 12:58:46 -05:00
Matthew Honnibal
44c7eafe44
* Fix download.py
2015-02-07 12:00:36 -05:00
Matthew Honnibal
6ca7f2eedc
* Upd download script
2015-02-07 11:32:33 -05:00
Matthew Honnibal
f0e0588833
* Fill L2 norm attribute on LexemeC struct
2015-02-07 08:44:42 -05:00
Matthew Honnibal
75f9b7d6bf
* Add L2 norm field to LexemeC struct
2015-02-07 08:43:17 -05:00
Matthew Honnibal
51b618d646
* Add a has_repvec property to Lexeme, and a check function to check flags
2015-02-07 08:42:44 -05:00
Matthew Honnibal
321b402739
* Store the l2 norm of the word's vector
2015-02-07 08:42:16 -05:00
Matthew Honnibal
c7d8644149
* Fix regression on 'prob' attr of Token.
2015-02-03 03:32:18 +11:00
Matthew Honnibal
c55a33d045
* Catch oracle errors
2015-02-02 23:02:04 +11:00
Matthew Honnibal
de772088e6
* Use parse tree for sbd in Tokens.sents
2015-02-02 12:17:32 +11:00
Matthew Honnibal
56c2ef2982
* Tweak POS features for web text
2015-02-02 11:59:36 +11:00
Matthew Honnibal
d68678a93e
* Add Exception class, OracleError
2015-02-02 11:57:32 +11:00
Matthew Honnibal
a20fdbd8ee
* Upd download script
2015-02-01 13:22:23 +11:00
Matthew Honnibal
76d9394cb4
* Fix vocab.pyx for Python3
2015-02-01 13:14:04 +11:00
Matthew Honnibal
63abdf154c
* Hastily hack download file
2015-01-31 22:48:32 +11:00
Matthew Honnibal
7de00c5a79
* Try not holding a reference to Pool, since that seems to confuse the GC
2015-01-31 22:10:22 +11:00
Matthew Honnibal
ce3ae8b5d9
* Fix platform-specific lexicon bug.
2015-01-31 16:38:58 +11:00
Matthew Honnibal
a1ed574b7b
* Fix default model path for English
2015-01-31 16:38:27 +11:00
Matthew Honnibal
018e0bfa24
* Bug fixes to parse navigation
2015-01-31 16:37:13 +11:00
Matthew Honnibal
e013555b25
* Add option to download script
2015-01-31 13:51:56 +11:00
Matthew Honnibal
08ca5c8970
* Add sent_end flag to TokenC struct
2015-01-31 13:44:16 +11:00
Matthew Honnibal
024cfd485c
* Pass tag_strings as a tuple, to support new Tokens API
2015-01-31 13:43:37 +11:00
Matthew Honnibal
77d62d0179
* Large refactor of Token objects, making them much thinner. This is to support fast parse-tree navigation.
2015-01-31 13:42:58 +11:00
Matthew Honnibal
88170e6295
* Supply dep_strings as a tuple, for the changed API on Tokens
2015-01-31 13:42:09 +11:00
Matthew Honnibal
0981d68022
* Set a sent_end flag during parsing, for later use
2015-01-31 13:41:46 +11:00
Matthew Honnibal
251dbf24d7
* Fix unintialised variable error
2015-01-30 20:46:34 +11:00
Matthew Honnibal
83a4df5a1a
* Fix download script
2015-01-30 20:40:42 +11:00
Matthew Honnibal
6f9ebc2f34
* Fix download script
2015-01-30 20:33:19 +11:00
Matthew Honnibal
8b85d0bb8a
* Only download small data if no data dir exists
2015-01-30 20:27:14 +11:00
Matthew Honnibal
1a7a1c2771
* Fix Issue #16 : tokens recurse when printing
2015-01-30 19:47:50 +11:00
Matthew Honnibal
cb95ef6934
* Fix download script
2015-01-30 19:28:43 +11:00
Matthew Honnibal
e578bd37bd
* Fix download script
2015-01-30 18:59:31 +11:00
Matthew Honnibal
df52014d12
* Fix download script
2015-01-30 18:36:24 +11:00
Matthew Honnibal
0f95712189
* Improve accuracy reporting during training
2015-01-30 18:05:06 +11:00
Matthew Honnibal
b68f563c2f
* Fix Issue #14 : Improve parsing API
2015-01-30 18:04:41 +11:00
Matthew Honnibal
998b607f65
* Upd download script, having it download all data if there's no data/ directory, allowing easier compilation from source
2015-01-30 18:04:01 +11:00
Matthew Honnibal
67d6e53a69
* Ensure parser and tagger function correctly when training from missing values, indicated by -1
2015-01-30 14:08:56 +11:00
Matthew Honnibal
4ff180db74
* Fix off-by-one error in commit 0a7fceb
2015-01-30 12:49:33 +11:00
Matthew Honnibal
0a7fcebdf7
* Fix Issue #12 : Incorrect token.idx calculations for some punctuation, in the presence of token cache
2015-01-30 12:33:38 +11:00
Matthew Honnibal
ebf7d2fab1
* Use non-joint sbd, for more simplicity and fewer classes
2015-01-29 06:22:03 +11:00
Matthew Honnibal
d05c5bf141
* Remove comment
2015-01-29 05:19:27 +11:00
Matthew Honnibal
320b045daa
* Oracle now consistent over gold standard derivation
2015-01-29 03:41:58 +11:00
Matthew Honnibal
f590382134
* Work on sbd
2015-01-29 03:18:29 +11:00
Matthew Honnibal
1884a7a0be
* Attach comment with paper
2015-01-28 03:18:43 +11:00
Matthew Honnibal
a2d6b195db
* Add messy Break transitions, carefully following the scheme of Dd Zhang et al (2013)
2015-01-28 03:09:45 +11:00
Matthew Honnibal
f9ee5d9934
* Build a python list of word strings, for debugging
2015-01-28 01:06:13 +11:00
Matthew Honnibal
d819101571
* Improve error message on oracle failure
2015-01-28 00:58:03 +11:00
Matthew Honnibal
e6c3d3471f
* Tweak documentation for Tokens, and hide constructor as __cinit__
2015-01-27 18:57:52 +11:00
Matthew Honnibal
c38c62d4a3
* Add docstring to English class
2015-01-27 02:45:21 +11:00
Matthew Honnibal
d4c99f7dec
* Add attrs.pxd
2015-01-26 22:22:09 +11:00
Matthew Honnibal
d4a493855e
* Fix error msg
2015-01-25 23:01:30 +11:00
Matthew Honnibal
7f87716cf7
* Fix download script
2015-01-25 23:01:10 +11:00
Matthew Honnibal
92fb9257dd
* Add parts-of-speech file
2015-01-25 22:00:39 +11:00
Matthew Honnibal
c1c3dba4cb
* Check whether vector files are present before trying to load them.
2015-01-25 18:16:48 +11:00
Matthew Honnibal
5049d4c2e6
* Add parts_of_speech.pyx
2015-01-25 16:32:26 +11:00
Matthew Honnibal
12b034e3ef
* Move POS tag definitions to parts_of_speech.pxd
2015-01-25 16:31:07 +11:00
Matthew Honnibal
7431c133d8
* Add error if try to access head and not is_parsed
2015-01-25 15:33:54 +11:00
Matthew Honnibal
951d06c824
* Silently don't parse if data is not present
2015-01-25 14:47:38 +11:00
Matthew Honnibal
4e857ab7a6
* Fix bug in POS tagger feature
2015-01-25 02:20:15 +11:00
Matthew Honnibal
dd56e298e2
* Ensure tagging is applied if parse=True
2015-01-25 02:19:44 +11:00
Matthew Honnibal
94750819cd
* Set parse=True by default --- i.e. parse unless told not to.
2015-01-25 01:28:28 +11:00
Matthew Honnibal
71b95202eb
* Add docstring to StringStore
2015-01-24 20:49:15 +11:00
Matthew Honnibal
6d1c08dafd
* Add docstring to Lexeme
2015-01-24 20:48:34 +11:00
Matthew Honnibal
a97bed9359
* Fix POS and dependency label tag names. Add parse and string navigation functions.
2015-01-24 17:29:04 +11:00
Matthew Honnibal
76cd024095
* Add whitespace property to Token
2015-01-24 07:41:21 +11:00
Matthew Honnibal
5fd72bc220
* Have 'string' refer to the whitespace-padded string
2015-01-24 07:32:38 +11:00
Matthew Honnibal
fda94271af
* Rename NORM1 and NORM2 attrs to lower and norm
2015-01-24 06:17:03 +11:00
Matthew Honnibal
5ed8b2b98f
* Rename sic to orth
2015-01-23 02:08:25 +11:00
Matthew Honnibal
a27b23cc8f
* Have SBD return start/end indices
2015-01-22 22:24:44 +11:00
Matthew Honnibal
d460c28838
* Rename vec to repvec
2015-01-22 02:06:22 +11:00
Matthew Honnibal
8b9d913d97
* Rename vec to repvec
2015-01-22 02:05:58 +11:00
Matthew Honnibal
9cd0b6b3e9
* Various tweaks to Tokens class
2015-01-22 02:05:37 +11:00
Matthew Honnibal
5928d158ce
* Pass the string to Tokens
2015-01-22 02:04:58 +11:00
Matthew Honnibal
45264e356b
* Rename vec to repvec
2015-01-22 02:04:24 +11:00
Matthew Honnibal
5e63c606ad
* Rename vec to repvec
2015-01-22 02:03:54 +11:00
Matthew Honnibal
56e6cf0672
* Add _string attr to Tokens object
2015-01-21 18:57:09 +11:00
Matthew Honnibal
d6ac60e91c
* Bug fixes to sentences method, and improved vector transport for tokens
2015-01-21 18:56:32 +11:00
Matthew Honnibal
f2a229136c
* Fix data_dir=None argument to English class
2015-01-21 18:27:31 +11:00
Matthew Honnibal
ef49b8c179
* Add stop-word flag
2015-01-21 18:22:31 +11:00
Matthew Honnibal
6646bfc5df
* Add LOWER attr
2015-01-21 18:19:08 +11:00
Matthew Honnibal
f149259bf5
* Fix negative indices in tokens
2015-01-20 01:16:29 +11:00
Matthew Honnibal
b65b0c07bf
* Messily hook up vector in tokens
2015-01-19 19:59:55 +11:00
Matthew Honnibal
8ff5b8bd84
* Add attribute for POS scheme
2015-01-17 17:33:16 +11:00
Matthew Honnibal
6c7e44140b
* Work on word vectors, and other stuff
2015-01-17 16:21:17 +11:00
Matthew Honnibal
802867e96a
* Revise interface to Token. Strings now have attribute names like norm1_
2015-01-15 03:51:47 +11:00
Matthew Honnibal
7d3c40de7d
* Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme
2015-01-15 00:33:16 +11:00
Matthew Honnibal
0930892fc1
* Tmp. Working on refactor. Compiles, must hook up lexical feats.
2015-01-14 00:03:48 +11:00
Matthew Honnibal
46da3d74d2
* Tmp. Refactoring, introducing a Lexeme PyObject.
2015-01-12 11:23:44 +11:00
Matthew Honnibal
ce2edd6312
* Tmp commit. Refactoring to create a Python Lexeme class.
2015-01-12 10:26:22 +11:00
Matthew Honnibal
aacaf1a0f0
* Fix parser
2015-01-08 01:19:23 +11:00
Matthew Honnibal
9a21127bf7
* Fix parser, which was importing the wrong model
2015-01-08 00:10:15 +11:00
Matthew Honnibal
6a3e39cdd1
* Add typedefs.pyx
2015-01-06 04:51:40 +11:00
Matthew Honnibal
a58920cc5e
* Import orth.word_shape as a C module
2015-01-06 03:18:22 +11:00
Matthew Honnibal
6b68f7ef75
* Finally get string types right for orth function
2015-01-06 03:17:39 +11:00
Matthew Honnibal
90c143bd85
* Fix orth import
2015-01-05 18:49:19 +11:00
Matthew Honnibal
7689dccd0f
* Remove unused import
2015-01-05 18:48:48 +11:00
Matthew Honnibal
3f1944d688
* Make PyPy work
2015-01-05 17:54:38 +11:00
Matthew Honnibal
a510d9f677
* Another assertion removed
2015-01-05 13:01:40 +11:00
Matthew Honnibal
2856946a66
* Remove assertion that doesn't work on Python 3
2015-01-05 12:51:16 +11:00
Matthew Honnibal
94034f1112
* Fix encoding in lemmatization
2015-01-05 11:54:29 +11:00
Matthew Honnibal
b132b3caa6
* Fix unicode error in lemmatizer
2015-01-05 11:53:54 +11:00
Matthew Honnibal
477e7fbffe
* Fix data reading for lemmatizer
2015-01-05 06:01:32 +11:00
Matthew Honnibal
58f75abaca
* Fix unicode error in orth
2015-01-05 05:53:08 +11:00
Matthew Honnibal
4e085d5166
* Fix lemmatizer for Python3
2015-01-05 05:51:26 +11:00
Matthew Honnibal
ae7c811fd1
* Use Exception instead of StandardError
2015-01-04 01:22:12 +11:00
Matthew Honnibal
0e4c2ba036
* Fix loading of special morph words
2015-01-03 23:13:00 +11:00
Matthew Honnibal
f5d41028b5
* Move around data files for test release
2015-01-03 01:59:22 +11:00
Matthew Honnibal
a24321b63a
* Add downloader
2015-01-02 21:44:41 +11:00
Matthew Honnibal
5d9a096e2f
* Some minor clean-up after HastyModel
2014-12-31 19:46:04 +11:00
Matthew Honnibal
aafaf58cbe
* Refactor _ml.Model, and finish implementing HastyModel so far not worthwhile.
2014-12-31 19:40:59 +11:00
Matthew Honnibal
bcd038e7b6
* Implement HastyModel
2014-12-31 01:16:47 +11:00
Matthew Honnibal
1a075f77ff
* Don't over-ride pre-loaded POS tags, if set by special-cases
2014-12-30 23:26:32 +11:00
Matthew Honnibal
785c7ba76a
* Embed signature on attrs
2014-12-30 23:25:31 +11:00
Matthew Honnibal
30e5805656
* Lazy-load tagger and parser
2014-12-30 23:25:09 +11:00
Matthew Honnibal
9976aa976e
* Messily fix morphology and POS tags on special tokens.
2014-12-30 23:24:37 +11:00
Matthew Honnibal
c1ef3febee
* Embedsignature in tokens.pyx
2014-12-30 21:22:00 +11:00
Matthew Honnibal
aac5028b6e
* Move tagger to _ml
2014-12-30 21:21:38 +11:00
Matthew Honnibal
1ffb0229ed
* Import tokens in parser.pxd
2014-12-30 21:21:17 +11:00
Matthew Honnibal
bb0b00f819
* Repurporse the Tagger class as a generic Model, wrapping thinc's interface
2014-12-30 21:20:15 +11:00
Matthew Honnibal
fe2a5e0370
* Work on docstrings
2014-12-27 21:46:04 +11:00
Matthew Honnibal
bb80937544
* Upd docstrings
2014-12-27 18:45:16 +11:00
Matthew Honnibal
b8b65903fc
* Tmp
2014-12-24 17:42:00 +11:00
Matthew Honnibal
ab61673edd
* Fix api of array method
2014-12-23 15:18:48 +11:00
Matthew Honnibal
7708d0e24a
* Move lemmatizer to en dir
2014-12-23 15:16:57 +11:00
Matthew Honnibal
98eb4c0426
* Fix path to parser model
2014-12-23 15:09:09 +11:00
Matthew Honnibal
b00bc01d8c
* All tests now passing for reorg
2014-12-23 13:18:59 +11:00
Matthew Honnibal
73f200436f
* Tests passing except for morphology/lemmatization stuff
2014-12-23 11:40:32 +11:00
Matthew Honnibal
cf8d26c3d2
* POS tagger training working after reorg
2014-12-22 08:54:47 +11:00
Matthew Honnibal
4c4aa2c5c9
* Work on train
2014-12-22 07:25:43 +11:00
Matthew Honnibal
61df50b598
* Add English-subclass POS tagger
2014-12-21 20:59:07 +11:00
Matthew Honnibal
9f3f07cab6
* Add attrs file for English
2014-12-21 11:29:11 +11:00
Matthew Honnibal
2a89d70429
* Add vocab.pyx to setup, and ensure we can import spacy.en.lang
2014-12-21 06:03:53 +11:00
Matthew Honnibal
b34a1325d3
* Everything compiling after reorg. About to start testing.
2014-12-21 05:42:23 +11:00
Matthew Honnibal
e1c1a4b868
* Tmp
2014-12-21 05:36:29 +11:00
Matthew Honnibal
d11c1edf8c
* Import slice_unicode from strings.pyx
2014-12-20 07:56:26 +11:00
Matthew Honnibal
be1bdcbd85
* Move lang.pyx to tokenizer.pyx
2014-12-20 07:55:40 +11:00
Matthew Honnibal
89a1cc1a48
* Move murmurhash to .pxd in strings file
2014-12-20 07:41:08 +11:00
Matthew Honnibal
d5a942c4a4
* Rename lang.pyx to tokenizer.pyx
2014-12-20 07:30:39 +11:00
Matthew Honnibal
a60ae261ae
* Move tokenizer to its own file, and refactor
2014-12-20 07:29:16 +11:00
Matthew Honnibal
867a4a000c
* Export set_morph_from_dict function
2014-12-20 07:28:27 +11:00
Matthew Honnibal
4e30195c6d
* Refactor morphology.pyx
2014-12-20 07:27:28 +11:00
Matthew Honnibal
4c6ce7ee84
* Update tokens.pyx as part of reorg
2014-12-20 07:03:26 +11:00
Matthew Honnibal
116f7f3bc1
* Rename Lexicon to Vocab, and move it to its own file
2014-12-20 06:54:03 +11:00
Matthew Honnibal
780cbd68b1
* Move all struct definitions to structs.pxd, to avoid circular dependencies
2014-12-20 06:51:33 +11:00
Matthew Honnibal
f6556d8e5d
* Refactor, move Lexeme struct to structs.pxd
2014-12-20 06:51:03 +11:00
Matthew Honnibal
7d48bba6c4
* Move StringStore class to its own file
2014-12-20 06:42:01 +11:00
Matthew Honnibal
b066102d2d
* Remove POS cache for now
2014-12-20 03:49:58 +11:00
Matthew Honnibal
ff252dd535
* Clean up 'guess_cache' idea, which didnt work well enough
2014-12-20 03:49:11 +11:00
Matthew Honnibal
9d3ca13909
* Start work on parse-tree iteration classes
2014-12-20 03:48:10 +11:00
Matthew Honnibal
bed680c632
* Remove commented-out features
2014-12-20 03:47:32 +11:00
Matthew Honnibal
3d178c03ae
* Prune the features a bit
2014-12-20 02:46:14 +11:00
Matthew Honnibal
a0408e1758
* Working DecisionMemory class
2014-12-20 01:43:26 +11:00
Matthew Honnibal
7920ea72b4
* Working parser with the decision memory idea. Disabling that for now, for simplicity
2014-12-20 01:43:15 +11:00
Matthew Honnibal
a2f2a48da9
* Add some extra features
2014-12-20 01:42:24 +11:00
Matthew Honnibal
8fd9762d91
* Start laying out parse tree iteration methods
2014-12-20 01:42:09 +11:00
Matthew Honnibal
53b8bc1f3c
* Work on implementing a trainable cache for the parser. So far, doesn't improve efficiency
2014-12-19 09:30:50 +11:00
Matthew Honnibal
033d6c9ac2
* Adapt POS tagger decision-memory for use in parser
2014-12-19 07:23:04 +11:00
Matthew Honnibal
809ddf7887
* Add index.pxd
2014-12-19 07:23:00 +11:00
Matthew Honnibal
1879abd16a
* Set const-correctness for tagger
2014-12-18 20:41:52 +11:00
Matthew Honnibal
f72243b156
* Set const-correctness for Feature* array
2014-12-18 20:41:32 +11:00
Matthew Honnibal
6ab7e40590
* Add non-monotonic parsing with cost-sensitive update. 92.26 on Y&M set
2014-12-18 11:33:25 +11:00
Matthew Honnibal
7e0c692daf
* Automatically push when the stack is empty
2014-12-18 09:16:10 +11:00
Matthew Honnibal
61142a8eff
* Tweak features
2014-12-18 09:15:03 +11:00
Matthew Honnibal
8446ebfbbb
* Work on parser. Up to 92 UAS on YM labels
2014-12-18 09:05:31 +11:00
Matthew Honnibal
55de747bfc
* Remove .cpp files
2014-12-18 02:43:13 +11:00
Matthew Honnibal
4448a840f7
* Work on greedy parsing. Scoring about 91.2
2014-12-18 02:42:55 +11:00
Matthew Honnibal
87e9487d76
* Work on parser
2014-12-17 21:10:12 +11:00
Matthew Honnibal
9d7d97978d
* Work on greedy parser
2014-12-17 21:09:29 +11:00
Matthew Honnibal
d524dd306a
* Work on greedy parser
2014-12-17 03:19:43 +11:00
Matthew Honnibal
95ccea03b2
* Work on greedy parser
2014-12-16 22:46:55 +11:00
Matthew Honnibal
a432862fde
* Add exception type to _arg_max_among in tagger
2014-12-16 09:44:19 +11:00
Matthew Honnibal
9e00798820
* Work on integrating a greedy dependency parser
2014-12-16 08:06:04 +11:00
Matthew Honnibal
792802b2b9
* POS tag memoisation working, with good speed-up
2014-12-12 14:33:51 +11:00
Matthew Honnibal
ca54d58638
* Merge setup.py
2014-12-10 15:21:27 +11:00
Matthew Honnibal
9959a64f7b
* Working morphology and lemmatisation. POS tagging quite fast.
2014-12-10 08:09:32 +11:00
Matthew Honnibal
df3be14987
* Add pos_type features to POS tagger
2014-12-10 08:08:55 +11:00
Matthew Honnibal
42973c4b37
* Improve efficiency of tagger, and improve morphological processing
2014-12-10 01:02:04 +11:00
Matthew Honnibal
6b34a2f34b
* Move morphological analysis into its own module, morphology.pyx
2014-12-09 21:16:17 +11:00
Matthew Honnibal
b962fe73d7
* Make suffixes file use full-power regex, so that we can handle periods properly
2014-12-09 19:04:27 +11:00
Matthew Honnibal
accdbe989b
* Remove Tokens.extend method
2014-12-09 17:09:23 +11:00
Matthew Honnibal
495e1c7366
* Use fused type in Tokens.push_back, simplifying the use of the cache
2014-12-09 16:50:01 +11:00
Matthew Honnibal
302e09018b
* Work on fixing special-cases, reading them in as JSON objects so that they can specify lemmas
2014-12-09 14:48:01 +11:00
Matthew Honnibal
99bbbb6feb
* Work on morphological processing
2014-12-08 21:12:15 +11:00
Matthew Honnibal
7b68f911cf
* Add WordNet lemmatizer
2014-12-08 01:39:13 +11:00
Matthew Honnibal
c20dd79748
* Fiddle with const correctness and comments
2014-12-08 00:03:55 +11:00
Matthew Honnibal
b031c7c430
* Remove language-general context module
2014-12-07 23:53:01 +11:00
Matthew Honnibal
ef4398b204
* Rearrange POS stuff, so that language-specific stuff can live in language-specific modules
2014-12-07 23:52:41 +11:00
Matthew Honnibal
327383e38a
* Remove unused code in tagger.pyx
2014-12-07 22:16:17 +11:00
Matthew Honnibal
9f17467c2e
* Fix EMPTY_TOKEN
2014-12-07 22:07:41 +11:00
Matthew Honnibal
3819a88e1b
* Add support for tag dictionary, and fix error-code for predict method
2014-12-07 22:07:16 +11:00
Matthew Honnibal
f00afe12c4
* Load POS tagger in load() function if path exists
2014-12-07 22:05:57 +11:00
Matthew Honnibal
5fe5e6e66b
* Move context functions to header, inlining them.
2014-12-07 21:59:04 +11:00
Matthew Honnibal
5caabec789
* Link in tagger, to work on integrating POS tagging
2014-12-07 15:29:41 +11:00
Matthew Honnibal
0c7aeb9de7
* Begin revising tagger, focussing on POS tagging
2014-12-07 15:29:04 +11:00
Matthew Honnibal
f5c4f2eb52
* Revise context, focussing on POS tagging for now
2014-12-07 15:28:22 +11:00
Matthew Honnibal
e27b912ef9
* Remove need for confusing _data pointer to be stored on Tokens
2014-12-05 16:31:30 +11:00
Matthew Honnibal
1c9253701d
* Introduce a TokenC struct, to handle token indices, pos tags and sense tags
2014-12-05 15:56:14 +11:00
Matthew Honnibal
187372c7f3
* Allow the lexicon to create lexemes using an external memory pool, so that it can decide to make some lexemes temporary, rather than cached
2014-12-05 03:29:50 +11:00
Matthew Honnibal
75b8dfb348
* Remove upper_pc from lexeme.pyx
2014-12-04 22:14:34 +11:00
Matthew Honnibal
49f3780ff5
* Fiddle with lexeme attrs
2014-12-04 21:22:38 +11:00
Matthew Honnibal
564082e48e
* Hack Token class to take lex.dense inplace of the old lex.norm. This needs to be fixed...
2014-12-04 20:51:29 +11:00
Matthew Honnibal
69bb022204
* Add as_array and count_by method
2014-12-04 20:46:55 +11:00
Matthew Honnibal
e1b1f45cc9
* Add STEM attribute to lexeme
2014-12-04 20:46:20 +11:00
Matthew Honnibal
d7952634ca
* Make the string-store serve const pointers to Utf8Str
2014-12-03 16:01:47 +11:00
Matthew Honnibal
7e04c22f8f
* const added to Lexicon interface. Seems to work.
2014-12-03 15:58:17 +11:00
Matthew Honnibal
d70d31aa45
* Introduce first attempt at const-ness
2014-12-03 15:44:25 +11:00
Matthew Honnibal
4560ada85b
* Add typedef for attr_t. Change flag_t to flags_t
2014-12-03 11:06:31 +11:00
Matthew Honnibal
e600f7b327
* Move String struct stuff into the utf8string module, from spacy.lang
2014-12-03 11:06:00 +11:00
Matthew Honnibal
e170faf5b0
* Hack Tokens to work without tagger.pyx
2014-12-03 11:05:15 +11:00
Matthew Honnibal
b463a7eb86
* Make flag-setting a language-specific thing
2014-12-03 11:04:32 +11:00
Matthew Honnibal
71b009e323
* Fix bug in refactored StringStore.__getitem__
2014-12-03 11:02:24 +11:00
Matthew Honnibal
14097311ae
* Make StringStore.__getitem__ accept unicode-typed keys.
2014-12-03 01:33:20 +11:00
Matthew Honnibal
522bb0346e
* Work on get_array method of Tokens
2014-12-02 23:48:05 +11:00
Matthew Honnibal
8c2938fe01
* Rename Lexicon._dict to Lexicon._map
2014-12-02 23:46:59 +11:00
Matthew Honnibal
33dfb4933c
* Remove taggers from Language class. Work on doc strings
2014-11-26 19:53:55 +11:00
Matthew Honnibal
80baa2e3db
* Work on beam parser
2014-11-20 19:49:33 +11:00
Matthew Honnibal
5c3016bac8
* Tmp commit of ner code
2014-11-14 18:27:47 +11:00
Matthew Honnibal
33c421bcf8
* More feature tweaks
2014-11-12 23:59:16 +11:00
Matthew Honnibal
41dedfb14e
* Add label features for NER parsing
2014-11-12 23:55:10 +11:00
Matthew Honnibal
cf55b48ba6
* Switch to predict label on shift. Big increase in accuracy.
2014-11-12 23:50:12 +11:00
Matthew Honnibal
8f84e8a78b
* Neaten oracle
2014-11-12 23:38:07 +11:00
Matthew Honnibal
7e0a9077dd
* Add context files
2014-11-12 23:22:36 +11:00
Matthew Honnibal
3b0b902384
* IOB-style parsing working. Accuracy down from BILOU, form 87-88 to 85-86
2014-11-12 23:21:09 +11:00
Matthew Honnibal
e6bb8aa3a9
* Move moves to bilou_moves. Refactor context, returning to the simpler giant-enum style
2014-11-12 00:54:50 +11:00
Matthew Honnibal
c788633429
* Add tokens_from_list method to Language
2014-11-11 23:43:14 +11:00
Matthew Honnibal
95282d4993
* Use the dynamic oracle 'follow' strategy
2014-11-11 21:11:17 +11:00
Matthew Honnibal
5aaf7a024d
* Move ner features to ner subdir
2014-11-11 21:09:03 +11:00
Matthew Honnibal
ff8989b63c
* Use greedy NER parser
2014-11-11 21:08:35 +11:00
Matthew Honnibal
0d943ab358
* Fixed greedy NER parsing. With static oracle, replicates accuracy from tagger.
2014-11-11 17:17:54 +11:00
Matthew Honnibal
399239760b
* Fix moves for new State struct
2014-11-10 22:16:05 +11:00
Matthew Honnibal
82247169f2
* Implement validation and oracle on pystate, for testing
2014-11-10 22:15:32 +11:00
Matthew Honnibal
3709ed9d6d
* Add curr field to State, to handle entity being built
2014-11-10 22:14:36 +11:00
Matthew Honnibal
af9ed18cf1
* Bug fixes to NER
2014-11-10 17:39:23 +11:00
Matthew Honnibal
9f2587f5ec
* Work on shift-reduce NER
2014-11-10 16:28:56 +11:00
Matthew Honnibal
f307eb2e36
* Refactor context extraction, and start breaking out gold standards into their own functions
2014-11-09 15:43:07 +11:00
Matthew Honnibal
602f993af9
* Moving tagger to accept multiple correct answers
2014-11-09 15:18:33 +11:00
Matthew Honnibal
f37d896a42
* Upd NER feats. With adadelta learner, getting 76.9 on NER
2014-11-07 04:43:54 +11:00
Matthew Honnibal
68d1cdad62
* When encoding POS/NER tags, accept '-' as a missing value
2014-11-07 04:42:31 +11:00
Matthew Honnibal
949a6245f9
* Increase default number of iterations from 5 to 10
2014-11-07 04:42:04 +11:00
Matthew Honnibal
3cab1d9a29
* Refine word_shape feature, by trimming the max sequence length
2014-11-07 04:41:29 +11:00
Matthew Honnibal
b4454cf036
* Add extra context tokens
2014-11-07 04:40:36 +11:00
Matthew Honnibal
50309e6e49
* Fix context vector, importing all features
2014-11-05 22:11:39 +11:00
Matthew Honnibal
07a23768de
* Play with NER feats a bit. Up to 82.00 training on MUC7.
2014-11-05 21:47:17 +11:00
Matthew Honnibal
4ecbe8c893
* Complete refactor of Tagger features, to use a generic list of context names.
2014-11-05 20:45:29 +11:00
Matthew Honnibal
0a8c84625d
* Moving feature context stuff to a generalized place
2014-11-05 19:55:10 +11:00
Matthew Honnibal
3733444101
* Generalize tagger code, in preparation for NER and supersense tagging.
2014-11-05 03:42:14 +11:00
Matthew Honnibal
abbe3e44b0
* Move spacy.pos tagger to spacy.tagger, and generalize it so that it can take on other tagging tasks, given a different set of feature templates.
2014-11-05 00:37:59 +11:00
Matthew Honnibal
954c970415
* Add __iter__ method to tokens
2014-11-04 01:07:08 +11:00
Matthew Honnibal
f07457a91f
* Remove POS alignment stuff. Now use training data based on raw text, instead of clumsy detokenization stuff
2014-11-04 01:06:43 +11:00
Matthew Honnibal
ae52f9f38c
* Remove vocab10k from tokens
2014-11-03 00:23:20 +11:00
Matthew Honnibal
32fb50dc35
* Remove non_sparse method --- features wanting this can do it easily enough.
2014-11-03 00:15:47 +11:00
Matthew Honnibal
b5ae1471db
* Fiddle with POS tag features
2014-11-03 00:15:03 +11:00
Matthew Honnibal
70ea862703
* Remove vocab10k field, and add flags for gazetteers
2014-11-03 00:13:51 +11:00
Matthew Honnibal
711ed0f636
* Whitespace
2014-11-02 14:22:32 +11:00
Matthew Honnibal
fcd9490d56
* Add pos_tag method to Language
2014-11-02 14:21:43 +11:00
Matthew Honnibal
829bb2bdbe
* Add mappings to Twitter POS tag corpus
2014-11-02 13:21:19 +11:00
Matthew Honnibal
437cd2217d
* Fix strings i/o, removing use of ujson library in favour of plain text file. Allows better control of codecs.
2014-11-02 13:20:37 +11:00
Matthew Honnibal
3352e89e21
* Use LIKE_URL and LIKE_NUMBER flag features. Seems to improve accuracy on onto web
2014-11-02 13:19:54 +11:00
Matthew Honnibal
8335706321
* Add LIKE_URL and LIKE_NUMBER flag features
2014-11-02 13:19:23 +11:00
Matthew Honnibal
5484fbea69
* Implement is_number
2014-11-01 19:13:24 +11:00
Matthew Honnibal
f685218e21
* Add is_urlish function
2014-11-01 17:39:34 +11:00
Matthew Honnibal
09a3e54176
* Delete print statements from stringstore
2014-10-31 17:45:26 +11:00
Matthew Honnibal
b186a66bae
* Rename Token.lex_pos to Token.postype, and Token.lex_supersense to Token.sensetype
2014-10-31 17:44:39 +11:00
Matthew Honnibal
a8ca078b24
* Restore lexemes field to lexicon
2014-10-31 17:43:25 +11:00
Matthew Honnibal
6c807aa45f
* Restore id attribute to lexeme, and rename pos field to postype, to store clustered tag dictionaries
2014-10-31 17:43:00 +11:00
Matthew Honnibal
aaf6953fe0
* Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web.
2014-10-31 17:42:15 +11:00
Matthew Honnibal
f67cb9a5a3
* Add count_tags functionto pos.pyx, which should probably live in another file. Feature set achieves 97.9 on wsj19-21, 95.85 on onto web.
2014-10-31 17:42:04 +11:00
Matthew Honnibal
ea8f1e7053
* Tighten interfaces
2014-10-30 18:14:42 +11:00
Matthew Honnibal
ea85bf3a0a
* Tighten the interface to Language
2014-10-30 18:01:27 +11:00
Matthew Honnibal
c6fcd03692
* Small efficiency tweak to lexeme init
2014-10-30 17:56:11 +11:00
Matthew Honnibal
87c2418a89
* Fiddle with data types on Lexeme, to compress them to a much smaller size.
2014-10-30 15:42:15 +11:00
Matthew Honnibal
ac88893232
* Fix Token after lexeme changes
2014-10-30 15:30:52 +11:00
Matthew Honnibal
e6b87766fe
* Remove lexemes vector from Lexicon, and the id and hash attributes from Lexeme
2014-10-30 15:21:38 +11:00
Matthew Honnibal
889b7b48b4
* Fix POS tagger, so that it loads correctly. Lexemes are being read in.
2014-10-30 13:38:55 +11:00
Matthew Honnibal
67c8c8019f
* Update lexeme serialization, using a binary file format
2014-10-30 01:01:00 +11:00
Matthew Honnibal
13909a2e24
* Rewriting Lexeme serialization.
2014-10-29 23:19:38 +11:00
Matthew Honnibal
234d49bf4d
* Seems to be working after refactor. Need to wire up more POS tag features, and wire up save/load of POS tags.
2014-10-24 02:23:42 +11:00
Matthew Honnibal
08ce602243
* Large refactor, particularly to Python API
2014-10-24 00:59:17 +11:00
Matthew Honnibal
7baef5b7ff
* Fix padding on tokens
2014-10-23 04:01:17 +11:00
Matthew Honnibal
96b835a3d4
* Upd for refactored Tokens class. Now gets 95.74, 185ms training on swbd_wsj_ewtb, eval on onto_web, Google POS tags.
2014-10-23 03:20:02 +11:00
Matthew Honnibal
e5e951ae67
* Remove the feature array stuff from Tokens class, and replace vector with array-based implementation, with padding.
2014-10-23 01:57:59 +11:00
Matthew Honnibal
ea1d4a81eb
* Refactoring get_atoms, improving tokens API
2014-10-22 13:10:56 +11:00
Matthew Honnibal
ad49e2482e
* Tagger now gets 97pc on wsj, parsing 19-21 in 500ms. Gets 92.7 on web text.
2014-10-22 12:57:06 +11:00
Matthew Honnibal
0a0e41f6c8
* Add prefix and suffix features
2014-10-22 12:56:09 +11:00
Matthew Honnibal
7018b53d3a
* Improve array features in tokens
2014-10-22 12:55:42 +11:00
Matthew Honnibal
43d5964e13
* Add function to read detokenization rules
2014-10-22 12:54:59 +11:00
Matthew Honnibal
224bdae996
* Add POS utilities
2014-10-22 10:17:57 +11:00
Matthew Honnibal
5ebe14f353
* Add greedy pos tagger
2014-10-22 10:17:26 +11:00
Matthew Honnibal
12742f4f83
* Add detokenize method and test
2014-10-18 18:07:29 +11:00
Matthew Honnibal
99f5e59286
* Have tokenizer emit tokens for whitespace other than single spaces
2014-10-14 20:25:57 +11:00
Matthew Honnibal
43743a5d63
* Work on efficiency
2014-10-14 18:22:41 +11:00
Matthew Honnibal
6fb42c4919
* Add offsets to Tokens class. Some changes to interfaces, and reorganization of spacy.Lang
2014-10-14 16:17:45 +11:00
Matthew Honnibal
2805068ca8
* Have tokens track tuples that record the start offset and pos tag as well as a lexeme pointer
2014-10-14 15:21:03 +11:00
Matthew Honnibal
65d3ead4fd
* Rename LexStr_casefix to LexStr_norm and LexInt_i to LexInt_id
2014-10-14 15:19:07 +11:00
Matthew Honnibal
868e558037
* Preparations in place to handle hyphenation etc
2014-10-10 20:23:23 +11:00
Matthew Honnibal
ff79dbac2e
* More slight cleaning for lang.pyx
2014-10-10 20:11:22 +11:00
Matthew Honnibal
3d82ed1e5e
* More slight cleaning for lang.pyx
2014-10-10 19:50:07 +11:00
Matthew Honnibal
02e948e7d5
* Remove counts stuff from Language class
2014-10-10 19:25:01 +11:00
Matthew Honnibal
71ee921055
* Slight cleaning of tokenizer code
2014-10-10 19:17:22 +11:00
Matthew Honnibal
59b41a9fd3
* Switch to new data model, tests passing
2014-10-10 08:11:31 +11:00
Matthew Honnibal
1b0e01d3d8
* Revising data model of lexeme. Compiles.
2014-10-09 19:53:30 +11:00
Matthew Honnibal
e40caae51f
* Update Lexicon class to expect a list of lexeme dict descriptions
2014-10-09 14:51:35 +11:00
Matthew Honnibal
51d75b244b
* Add serialize/deserialize functions for lexeme, transport to/from python dict.
2014-10-09 14:10:46 +11:00
Matthew Honnibal
d73d89a2de
* Add i attribute to lexeme, giving lexemes sequential IDs.
2014-10-09 13:50:05 +11:00
Matthew Honnibal
096ef2b199
* Rename external hashing lib, from trustyc to preshed
2014-09-26 18:40:03 +02:00
Matthew Honnibal
11a346fd5e
* Remove hashing modules, which are now taken over by external lib
2014-09-26 18:39:40 +02:00
Matthew Honnibal
93505276ed
* Add German tokenizer files
2014-09-25 18:29:13 +02:00
Matthew Honnibal
2e44fa7179
* Add util.py
2014-09-25 18:26:22 +02:00
Matthew Honnibal
b15619e170
* Use PointerHash instead of locally provided _hashing module
2014-09-25 18:23:35 +02:00
Matthew Honnibal
ed446c67ad
* Add typedefs file
2014-09-17 23:10:32 +02:00
Matthew Honnibal
316a57c4be
* Remove own memory classes, which have now been broken out into their own package
2014-09-17 23:10:07 +02:00
Matthew Honnibal
ac522e2553
* Switch from own memory class to cymem, in pip
2014-09-17 23:09:24 +02:00
Matthew Honnibal
6266cac593
* Switch to using a Python ref counted gateway to malloc/free, to prevent memory leaks
2014-09-17 20:02:26 +02:00
Matthew Honnibal
5a20dfc03e
* Add memory management code
2014-09-17 20:02:06 +02:00
Matthew Honnibal
0152831c89
* Refactor tokenization, enable cache, and ensure we look up specials correctly even when there's confusing punctuation surrounding the token.
2014-09-16 18:01:46 +02:00
Matthew Honnibal
143e51ec73
* Refactor tokenization, splitting it into a clearer life-cycle.
2014-09-16 13:16:02 +02:00
Matthew Honnibal
c396581a0b
* Fiddle with the way strings are interned in lexeme
2014-09-15 06:34:45 +02:00
Matthew Honnibal
0bb547ab98
* Fix memory error in cache, where entry wasn't being null-terminated. Various other changes, some good for performance
2014-09-15 06:34:10 +02:00
Matthew Honnibal
7959141d36
* Add a few abbreviations, to get tests to pass
2014-09-15 06:32:18 +02:00
Matthew Honnibal
d235299260
* Few nips and tucks to hash table
2014-09-15 05:03:44 +02:00
Matthew Honnibal
e68a431e5e
* Pass only the tokens vector to _tokenize, instead of the whole python object.
2014-09-15 04:01:38 +02:00
Matthew Honnibal
08cef75ffd
* Switch to using a heap-allocated vector in tokens
2014-09-15 03:46:14 +02:00
Matthew Honnibal
f77b7098c0
* Upd Tokens to use vector, with bounds checking.
2014-09-15 03:22:40 +02:00
Matthew Honnibal
0f6bf2a2ee
* Fix niggling memory error, which was caused by bug in the way tokens resized their internal vector.
2014-09-15 02:08:39 +02:00
Matthew Honnibal
df24e3708c
* Move EnglishTokens stuff to Tokens
2014-09-15 01:31:44 +02:00
Matthew Honnibal
bd08cb09a2
* Remove short-circuiting of initial_size argument for PointerHash
2014-09-15 01:30:49 +02:00
Matthew Honnibal
f3393cf57c
* Improve interface for PointerHash
2014-09-13 17:29:58 +02:00
Matthew Honnibal
45865be37e
* Switch hash interface, using void* instead of size_t, to avoid casts.
2014-09-13 17:02:06 +02:00
Matthew Honnibal
0447279c57
* PointerHash working, efficiency is good. 6-7 mins
2014-09-13 16:43:59 +02:00
Matthew Honnibal
85d68e8e95
* Replaced cache with own hash table. Similar timing
2014-09-13 03:14:43 +02:00
Matthew Honnibal
c8db76e3e1
* Add initial work on simple hash table
2014-09-13 02:02:41 +02:00
Matthew Honnibal
afdc9b7ac2
* More performance fiddling, particularly moving the specials into the cache, so that we can just lookup the cache in _tokenize
2014-09-13 00:59:34 +02:00
Matthew Honnibal
7d239df4c8
* Fiddle with declarations, for small efficiency boost
2014-09-13 00:31:53 +02:00
Matthew Honnibal
a8e7cce30f
* Efficiency tweaks
2014-09-13 00:14:05 +02:00
Matthew Honnibal
126a8453a5
* Fix performance issues by implementing a better cache. Add own String struct to help
2014-09-12 23:50:37 +02:00
Matthew Honnibal
9298e36b36
* Move special tokenization into its own lookup table, away from the cache.
2014-09-12 19:43:14 +02:00
Matthew Honnibal
985bc68327
* Fix bug with trailing punct on contractions. Reduced efficiency, and slightly hacky implementation.
2014-09-12 18:26:26 +02:00
Matthew Honnibal
7eab281194
* Fiddle with token features
2014-09-12 15:49:55 +02:00
Matthew Honnibal
5aa591106b
* Fiddle with token features
2014-09-12 15:49:36 +02:00
Matthew Honnibal
1533041885
* Update the split_one method, so that it doesn't need to cast back to a Python object
2014-09-12 05:10:59 +02:00
Matthew Honnibal
4817277d66
* Replace main lexicon dict with dense_hash_map. May be unsuitable, if strings need recovery.
2014-09-12 04:29:09 +02:00
Matthew Honnibal
8b20e9ad97
* Delete ununused _split method
2014-09-12 04:03:52 +02:00
Matthew Honnibal
a4863686ec
* Changed cache to use a linked-list data structure, to take out Python list code. Taking 6-7 mins for gigaword.
2014-09-12 03:30:50 +02:00
Matthew Honnibal
51e2006a65
* Increase cache size. Processing now 6-7 mins
2014-09-12 02:52:34 +02:00
Matthew Honnibal
e096f30161
* Tweak signatures and refactor slightly. Processing gigaword taking 8-9 mins. Tests passing, but some sort of memory bug on exit.
2014-09-12 02:43:36 +02:00
Matthew Honnibal
073ee0de63
* Restore dense_hash_map for cache dictionary. Seems to double efficiency
2014-09-12 02:23:51 +02:00
Matthew Honnibal
3c928fb5e0
* Switch to 64 bit hashes, for better reliability
2014-09-12 02:04:47 +02:00
Matthew Honnibal
2389bd1b10
* Improve cache mechanism by including a random element depending on the size of the cache.
2014-09-12 00:19:16 +02:00
Matthew Honnibal
c8f7c8bfde
* Moving to storing LexemeC structs internally
2014-09-11 21:54:34 +02:00
Matthew Honnibal
bf9c60c31c
* Moving to storing LexemeC structs internally
2014-09-11 21:44:58 +02:00
Matthew Honnibal
563047e90f
* Switch to returning a Tokens object
2014-09-11 21:37:32 +02:00
Matthew Honnibal
1a3222af4b
* Moving tokens to use an array internally, instead of a list of Lexeme objects.
2014-09-11 16:57:08 +02:00
Matthew Honnibal
5b1c651661
* Only store LexemeC structs in the vocabulary, transforming them to Lexeme objects for output. Moving away from Lexeme objects for Tokens soon.
2014-09-11 12:28:38 +02:00
Matthew Honnibal
e567713429
* Moving back to lexeme structs
2014-09-10 20:41:47 +02:00
Matthew Honnibal
b488224c09
* Restoring Lexeme-as-struct
2014-09-10 20:41:37 +02:00
Matthew Honnibal
7c09c73a14
* Refactor to use tokens class.
2014-09-10 18:27:44 +02:00
Matthew Honnibal
cf412adba8
* Refactoring to use Tokens object
2014-09-10 18:11:13 +02:00
Matthew Honnibal
8fbe9b6f97
* Bug fixes to flag features
2014-09-01 23:41:31 +02:00
Matthew Honnibal
151aa14bba
* Add asciify string transform, and other bits.
2014-09-01 23:25:28 +02:00
Matthew Honnibal
c4ba216642
* Switch canon_case to get value, to avoid keyerror
2014-09-01 17:27:36 +02:00
Matthew Honnibal
a779275a59
* Add canon_case function
2014-08-30 20:57:43 +02:00
Matthew Honnibal
8bbfadfced
* Pass tests. Need to implement more feature functions.
2014-08-30 20:36:06 +02:00
Matthew Honnibal
dcab14ede2
* Begin testing more functionality
2014-08-30 19:01:15 +02:00
Matthew Honnibal
3e3ff99ca0
* Add orth features
2014-08-30 19:01:00 +02:00
Matthew Honnibal
4e5b2d47e2
* More docs
2014-08-29 03:01:40 +02:00
Matthew Honnibal
5233f110c4
* Adding PTB3 tokenizer back in, so can understand how much boilerplate is in the docs for multiple tokenizers
2014-08-29 02:30:27 +02:00
Matthew Honnibal
45a22d6b2c
* Docs coming together
2014-08-29 01:59:23 +02:00
Matthew Honnibal
c282e6d5fb
* Redesign proceeding
2014-08-28 19:45:09 +02:00
Matthew Honnibal
fd4e61e58b
* Fixed contraction tests. Need to correct problem with the way case stats and tag stats are supposed to work.
2014-08-27 20:22:33 +02:00
Matthew Honnibal
fdaf24604a
* Basic punct tests updated and passing
2014-08-27 19:38:57 +02:00
Matthew Honnibal
8d20617dfd
* Whitespace
2014-08-27 17:16:16 +02:00
Matthew Honnibal
e9a62b6eba
* Refactoring with Lexeme as a class now compiles. Basic design seems to work
2014-08-27 17:15:39 +02:00
Matthew Honnibal
68bae2fec6
* More refactoring
2014-08-25 16:42:22 +02:00
Matthew Honnibal
88095666dc
* Remove Lexeme struct, preparing to rename Word to Lexeme.
2014-08-24 19:24:42 +02:00
Matthew Honnibal
ce59526011
* Add Word classes
2014-08-24 18:14:08 +02:00
Matthew Honnibal
3b793cf4f7
* Tests passing for new Word object version
2014-08-24 18:13:53 +02:00
Matthew Honnibal
9815c7649e
* Refactor around Word objects, adapting tests. Tests passing, except for string views.
2014-08-23 19:55:06 +02:00
Matthew Honnibal
4f01df9152
* Moving to Word objects in place of the Lexeme struct.
2014-08-22 17:32:16 +02:00
Matthew Honnibal
782806df08
* Moving to Word objects in place of the Lexeme struct.
2014-08-22 17:28:23 +02:00
Matthew Honnibal
47fbd0475a
* Replace the use of dense_hash_map with Python dict
2014-08-22 17:13:09 +02:00
Matthew Honnibal
e289896603
* Fix ptb3 module
2014-08-22 16:36:17 +02:00
Matthew Honnibal
89d6faa9c9
* Move en_ptb to ptb3
2014-08-22 04:24:05 +02:00
Matthew Honnibal
07ecf5d2f4
* Fixed group_by, removed idea of general attr_of function.
2014-08-22 00:02:37 +02:00
Matthew Honnibal
811b7a6b91
* Struggling with arbitrary attr access...
2014-08-21 23:49:14 +02:00
Matthew Honnibal
314658b31c
* Improve module docstring
2014-08-21 18:42:47 +02:00
Matthew Honnibal
d10993f41a
* More docs work
2014-08-21 16:37:13 +02:00
Matthew Honnibal
248cbb6d07
* Update doc strings
2014-08-21 03:29:15 +02:00
Matthew Honnibal
76afbd7d69
* Remove compiled orthography file
2014-08-20 17:04:07 +02:00
Matthew Honnibal
f39dcb1d89
* Add orthography
2014-08-20 17:03:44 +02:00
Matthew Honnibal
a78ad4152d
* Broken version being refactored for docs
2014-08-20 13:39:39 +02:00
Matthew Honnibal
5fddb8d165
* Working refactor, with updated data model for Lexemes
2014-08-19 04:21:20 +02:00
Matthew Honnibal
3379d7a571
* Reforming data model for lexemes
2014-08-19 02:40:37 +02:00
Matthew Honnibal
ab9b0daabf
* Whitespace
2014-08-18 23:21:49 +02:00
Matthew Honnibal
1b71cbfe28
* Roll back to using unicode, and never Py_UNICODE. No dependence on murmurhash either.
2014-08-18 20:48:48 +02:00
Matthew Honnibal
bbf9a2c944
* Working version that uses arrays for chunks, which should be more memory efficient
2014-08-18 20:23:54 +02:00
Matthew Honnibal
8d3f6082be
* Working version, adding improvements
2014-08-18 19:59:59 +02:00
Matthew Honnibal
01469b0888
* Refactor spacy so that chunks return arrays of lexemes, so that there is properly one lexeme per word.
2014-08-18 19:14:00 +02:00
Matthew Honnibal
b94c9b72c9
* WordTree in use. Need to reform the way chunks are handled. Should be properly one Lexeme per word, with split points being the things that are cached.
2014-08-16 20:10:22 +02:00
Matthew Honnibal
34b68a18ab
* Progress to getting WordTree working. Tests pass, but so far it's slower.
2014-08-16 19:59:38 +02:00
Matthew Honnibal
865cacfaf7
* Remove dependence on murmurhash
2014-08-16 17:37:09 +02:00
Matthew Honnibal
515d41d325
* Restore string saving to spacy
2014-08-16 16:09:24 +02:00
Matthew Honnibal
36073b89fe
* Restore unicode, work on improving string storage.
2014-08-16 14:35:34 +02:00
Matthew Honnibal
a225ca5b0d
* Refactoring tokenizer
2014-08-16 03:22:03 +02:00
Matthew Honnibal
213a440ffc
* Add string decode and encode helpers to string_tools
2014-08-15 23:57:27 +02:00
Matthew Honnibal
f11c8e22eb
* Remove happax stuff
2014-08-02 22:11:28 +01:00
Matthew Honnibal
d6e07aa922
* Switch to 32bit hash for strings
2014-08-02 21:51:52 +01:00
Matthew Honnibal
365a2af756
* Restore happax. commit uncommited work
2014-08-02 21:27:03 +01:00
Matthew Honnibal
6319ff0f22
* Add length property
2014-08-02 21:26:44 +01:00
Matthew Honnibal
18fb76b2c4
* Removed happax. Not sure if good idea.
2014-08-02 20:53:35 +01:00
Matthew Honnibal
edd38a84b1
* Removing happax stuff. Added length
2014-08-02 20:45:12 +01:00
Matthew Honnibal
fc7c10d7f8
* Ugly but seemingly working fix to the token memory leak
2014-08-01 09:43:19 +01:00
Matthew Honnibal
c7bb6b329c
* Don't free clobbered lexemes, as they might be part of a tail
2014-08-01 08:22:38 +01:00
Matthew Honnibal
c48214460e
* Free lexemes clobbered as happaxes
2014-08-01 07:40:20 +01:00
Matthew Honnibal
5b6457e80e
* Free lexemes clobbered as happaxes
2014-08-01 07:37:50 +01:00
Matthew Honnibal
d8cb2288ce
* Roll back to using murmurhash2 for now
2014-08-01 07:28:47 +01:00
Matthew Honnibal
f39211b2b1
* Add FixedTable for hashing
2014-08-01 07:27:21 +01:00
Matthew Honnibal
a44e15f623
* Hack around lack of distribution features for now.
2014-07-31 18:24:51 +01:00
Matthew Honnibal
4cb88c940b
* Fix memory leak in tokenizer, caused by having a fixed vocab.
2014-07-31 18:19:38 +01:00
Matthew Honnibal
5b81ee716f
* Use a sparse_hash_map to store happax vocab items, with a max size.
2014-07-31 17:40:43 +01:00
Matthew Honnibal
b9016c4633
* Switch to using sparsehash and murmurhash libraries out of pip
2014-07-25 15:47:27 +01:00
Matthew Honnibal
a895fe5ddb
* Upd from spacy
2014-07-23 17:35:18 +01:00
Matthew Honnibal
87bf205b82
* Fix open apostrophe bug
2014-07-07 23:26:01 +02:00
Matthew Honnibal
571808a274
Group-by seems to be working
2014-07-07 20:27:02 +02:00
Matthew Honnibal
80b36f9f27
* 710k words per second for counts
2014-07-07 19:12:19 +02:00
Matthew Honnibal
057c21969b
* Refactor for string view features. Working on setting up flags and enums.
2014-07-07 16:58:48 +02:00
Matthew Honnibal
f1bcbd4c4e
* Reorganized code to accomodate Tokens class. Need string views before group_by and count_by can be done well.
2014-07-07 12:47:21 +02:00
Matthew Honnibal
6668e44961
* Whitespace
2014-07-07 08:15:44 +02:00
Matthew Honnibal
0074ae2fc0
* Switch to dynamically allocating array, based on the document length
2014-07-07 08:05:29 +02:00
Matthew Honnibal
ff1869ff07
* Fixed major efficiency problem, from not quite grokking pass by reference in cython c++
2014-07-07 07:36:43 +02:00
Matthew Honnibal
0c76143b72
* Give value for assert
2014-07-07 05:10:46 +02:00
Matthew Honnibal
e244739dfe
* Fix ptb tokenization
2014-07-07 05:10:09 +02:00
Matthew Honnibal
dc20500920
* Remove cpp files
2014-07-07 05:09:05 +02:00
Matthew Honnibal
25849fc926
* Generalize tokenization rules to capitals
2014-07-07 05:07:21 +02:00
Matthew Honnibal
df0458001d
* Begin work on full PTB-compatible English tokenization
2014-07-07 04:29:24 +02:00
Matthew Honnibal
d5bef02c72
* Reorganized, moving language-independent stuff to spacy. The functions in spacy ask for the dictionaries and split function on input, but the language-specific modules are curried versions that use the globals
2014-07-07 04:21:06 +02:00
Matthew Honnibal
a62c38e1ef
* Working tokenization. en doesn't match PTB perfectly. Need to reorganize before adding more schemes.
2014-07-07 01:15:59 +02:00
Matthew Honnibal
4e79446dc2
* Reading in tokenization rules correctly. Passing tests.
2014-07-07 00:02:55 +02:00
Matthew Honnibal
72159e7011
* Fixes to tokenization. Now segment sequences of the same punctuation.
2014-07-06 19:28:42 +02:00
Matthew Honnibal
e98e97d483
* Possessive test passing
2014-07-06 18:35:55 +02:00
Matthew Honnibal
556f6a18ca
* Initial commit. Tests passing for punctuation handling. Need contractions, file transport, tokenize function, etc.
2014-07-05 20:51:42 +02:00