Commit Graph

1220 Commits

Author SHA1 Message Date
Matthew Honnibal
4c87a696b3 * Add draft dfa matcher, in Python. Passing tests. 2015-08-04 15:55:28 +02:00
Matthew Honnibal
eb7138c761 * Add attr relation in base NP detection 2015-08-01 00:34:40 +02:00
Matthew Honnibal
4988356cf0 * Fix dependency type bug from merged tokens 2015-08-01 00:33:24 +02:00
Matthew Honnibal
78a9068319 * Fix spacy attr on merged tokens 2015-07-30 04:25:58 +02:00
Matthew Honnibal
430e2edb96 * Fix noun_chunks issue 2015-07-30 03:51:50 +02:00
Matthew Honnibal
9590968fc1 * Fix negative indices in Span 2015-07-30 02:30:24 +02:00
Matthew Honnibal
74d8cb3980 * Add noun_chunks iterator, and fix left/right child setting in Doc.merge 2015-07-30 02:29:49 +02:00
Matthew Honnibal
d153f18969 * Fix negative indices on spans 2015-07-29 22:36:03 +02:00
Matthew Honnibal
b5132bed7d * Set left and right children when loading parse from byte string 2015-07-28 21:03:18 +02:00
Matthew Honnibal
6609fcf4b2 * Make mem and vocab python-visible in Doc 2015-07-28 20:46:59 +02:00
Matthew Honnibal
d42fe2e694 * Add unicode_literals to strings.pyx 2015-07-28 16:15:53 +02:00
Matthew Honnibal
bb910cff92 * Fix Python3 problem in align_raw 2015-07-28 16:06:53 +02:00
Matthew Honnibal
dcafb181b9 * Fix Python3 problem in align_raw 2015-07-28 15:52:10 +02:00
Matthew Honnibal
c609ea18f0 * Increment version in download script 2015-07-28 15:22:17 +02:00
Matthew Honnibal
9c4d0aae62 * Switch to better Python2/3 compatible unicode handling 2015-07-28 14:45:37 +02:00
Matthew Honnibal
7606d9936f * Python3 correction for GoldParse 2015-07-28 14:44:53 +02:00
Matthew Honnibal
ddc1a5cfe5 * Fix training under python3 2015-07-28 14:09:30 +02:00
Matthew Honnibal
a8bbd7312c * Hackishly patch long dependencies problem 2015-07-28 00:14:29 +02:00
Matthew Honnibal
bb583f7f09 * Hackishly patch long dependencies problem 2015-07-27 23:14:33 +02:00
Matthew Honnibal
aa7a964a4f * Add a type declaration for doc.from_array 2015-07-27 22:57:22 +02:00
Matthew Honnibal
25a8774f42 * Fix regression in packer 2015-07-27 21:53:38 +02:00
Matthew Honnibal
1601e488ee * Fix bug in decoding non-ascii characters 2015-07-27 21:43:58 +02:00
Matthew Honnibal
6a95409cd2 * Fix type on bits 2015-07-27 21:16:49 +02:00
Matthew Honnibal
a296d72b54 * Fix en/attrs 2015-07-27 21:16:33 +02:00
Matthew Honnibal
45460f505c * Fix data type on read32 in BitArray 2015-07-27 21:12:13 +02:00
Matthew Honnibal
3d43f49f69 * Revert prev change 2015-07-27 10:58:15 +02:00
Matthew Honnibal
6b586cdad4 * Change lexemes.bin format. Add a header specifying size of LexemeC and number of lexemes, and don't have the redundant orth information. 2015-07-27 08:31:51 +02:00
Matthew Honnibal
af6ed18f2a * Ensure we don't use orth_encode on OOV words. 2015-07-27 02:12:01 +02:00
Matthew Honnibal
8535d872e8 * Set is_oov property in get_flags 2015-07-27 01:51:24 +02:00
Matthew Honnibal
8e4c69ee8c * Add is_oov property, and fix up handling of attributes 2015-07-27 01:50:06 +02:00
Matthew Honnibal
fc268f03eb * Assert against null pointer exceptions in vocab 2015-07-27 01:00:10 +02:00
Matthew Honnibal
0f093fdb30 * Fix get_by_orth for py3 2015-07-26 19:26:41 +02:00
Matthew Honnibal
ceeda5a739 * Fix get_by_orth for py3 2015-07-26 18:39:27 +02:00
Matthew Honnibal
6bb96c122d * Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects 2015-07-26 16:37:16 +02:00
Matthew Honnibal
eeaea25f0c * Check oov_prob file is present 2015-07-26 16:36:38 +02:00
Matthew Honnibal
7eb2446082 * Return empty lexeme on empty string 2015-07-26 00:18:30 +02:00
Matthew Honnibal
1b5d1da2a7 * Allow an OOV probability to be specified in get_lex_props 2015-07-26 00:03:43 +02:00
Matthew Honnibal
cd6e25132b * Allow an OOV probability to be specified in get_lex_props 2015-07-26 00:01:46 +02:00
Matthew Honnibal
fd525f0675 * Pass OOV probability around 2015-07-25 23:29:51 +02:00
Matthew Honnibal
3fe14b8ed6 * Fix CFile for Python2 2015-07-25 22:55:53 +02:00
Matthew Honnibal
823ef4a00b * Remove profile declarations 2015-07-25 18:13:06 +02:00
Matthew Honnibal
f4809e562f * Allow json to be used as a fallback if ujson is not available 2015-07-25 18:11:36 +02:00
Matthew Honnibal
9da06671cf * Remove unused import 2015-07-25 18:11:16 +02:00
Matthew Honnibal
2060935cdb * Remove explicit bytes type in doc.from_bytes, to accept bytearray 2015-07-24 04:54:13 +02:00
Matthew Honnibal
aa28e2e01d * Release the GIL around parse function 2015-07-24 04:53:27 +02:00
Matthew Honnibal
d62eb34b76 * More Py 2/3 compatibility in bit strings 2015-07-24 04:52:06 +02:00
Matthew Honnibal
0bb839d299 * Fix string coercion for Python 3 2015-07-24 03:49:30 +02:00
Matthew Honnibal
c4ff410fdb * Fix bytes problems for Python3 2015-07-24 03:48:23 +02:00
Matthew Honnibal
1ab25e4dad * Fix python3 type error 2015-07-24 02:45:34 +02:00
Matthew Honnibal
f35ff173b0 * Fix bits.pyx unicode error 2015-07-23 20:37:57 +02:00
Matthew Honnibal
1406e24327 * Fix unicode error for Python3 2015-07-23 19:36:21 +02:00
Matthew Honnibal
dbda6c27fa * Fix python3 error 2015-07-23 14:52:30 +02:00
Matthew Honnibal
99387f9572 * Fix python3 error 2015-07-23 14:30:29 +02:00
Matthew Honnibal
b81ffe9032 * Fix typing on mode string in CFile 2015-07-23 13:24:43 +02:00
Matthew Honnibal
22028602a9 * Add unicode_literals declaration in vocab.pyx 2015-07-23 13:24:20 +02:00
Matthew Honnibal
5b41744270 * Check for directory presence before loading annotators 2015-07-23 09:27:37 +02:00
Matthew Honnibal
df01a88763 Merge branch 'refactor' (and serializaton)
Add Huffman-code serialization, and do a lot of
refactoring. Highlights include:

* Much more efficient StringStore
* Vocab maintains a by-orth mapping of Lexemes
* Avoid manually slicing Py_UNICODE buffers,
  simplifying tokenizer and vocab C APIs
* Remove various bits of dead code
* Work on removing GIL around parser
* Work on bridge to Theano

Conflicts:
	spacy/strings.pxd
	spacy/strings.pyx
	spacy/structs.pxd
2015-07-23 02:18:35 +02:00
Matthew Honnibal
a7c4d72e83 * Add serializer property to Vocab, and lazy-load it. Add get_by_orth method. 2015-07-23 01:18:19 +02:00
Matthew Honnibal
6ab1696b15 * Remove read_encoding_freqs from util.py 2015-07-23 01:17:32 +02:00
Matthew Honnibal
d5255aad77 * Update freqs for missing tags in ner, for serializer 2015-07-23 01:17:11 +02:00
Matthew Honnibal
12699a1152 * Set initial freqs, to avoid missing values in serializer 2015-07-23 01:16:27 +02:00
Matthew Honnibal
680bb47b55 * Write serializer freqs to single file, vocab/serializer.json 2015-07-23 01:15:25 +02:00
Matthew Honnibal
a0e36e8efc * Add working to/from bytes API to Doc 2015-07-23 01:14:45 +02:00
Matthew Honnibal
1f31d96bf9 * Fix Packer API, so that it reads and writes bytes strings, instead of BitArray. Docs are always byte aligned anyway. 2015-07-23 01:13:02 +02:00
Matthew Honnibal
38ef986b29 * Update spacy/en/attrs.pxd 2015-07-23 01:10:58 +02:00
Matthew Honnibal
06eac32610 * Add cfile.pyx 2015-07-23 01:10:36 +02:00
Matthew Honnibal
0c507bd80a * Fix tokenizer 2015-07-22 14:10:30 +02:00
Matthew Honnibal
c86dbe4944 * Update English.save_models for new Packer save/load stuff 2015-07-22 13:40:23 +02:00
Matthew Honnibal
bf77bcd6b9 * Add comment explaining hash_string 2015-07-22 13:39:42 +02:00
Matthew Honnibal
815bda201d * Remove UniStr struct 2015-07-22 13:39:17 +02:00
Matthew Honnibal
2fc66e3723 * Use Py_UNICODE in tokenizer for now, while sort out Py_UCS4 stuff 2015-07-22 13:38:45 +02:00
Matthew Honnibal
4d61239eac * Reorganize the serialization functions on Doc 2015-07-22 04:53:01 +02:00
Matthew Honnibal
109106a949 * Replace UniStr, using unicode objects instead 2015-07-22 04:52:05 +02:00
Matthew Honnibal
424854028f * Fix decode_int32 2015-07-21 20:09:59 +00:00
Matthew Honnibal
304d0e2633 * Use decode_int32 in _orth_decode 2015-07-21 20:40:55 +02:00
Matthew Honnibal
9cfa59ec33 * Optimistically try orth encoding, with char as a back-off 2015-07-21 20:22:45 +02:00
Matthew Honnibal
c8b89e37a5 * Bug fix to faster huffman decoding 2015-07-21 20:05:53 +02:00
Matthew Honnibal
b166d1d2a2 * Use encode32 and decode32 2015-07-21 19:59:06 +02:00
Matthew Honnibal
c6cd0ddce8 * Add faster encode_int32 and decode_int32 methods 2015-07-21 19:58:45 +02:00
Matthew Honnibal
dd60594f41 * Fix double encoding error in strings.pyx 2015-07-20 13:52:56 +02:00
Matthew Honnibal
06639dc497 * Add length cap to word shape feature 2015-07-20 12:06:59 +02:00
Matthew Honnibal
128b6d9714 * Move Utf8Str struct to strings module, as that's the only place it's relevant 2015-07-20 12:06:41 +02:00
Matthew Honnibal
01a97b90f3 * Fix header for string store 2015-07-20 12:06:10 +02:00
Matthew Honnibal
52d538ea42 * Fix short string optimization in strings.pyx. StringStore tests now all pass. 2015-07-20 12:05:23 +02:00
Matthew Honnibal
09a3055630 * Work on short string optimization in Utf8Str 2015-07-20 11:26:46 +02:00
Matthew Honnibal
bb0ba1f0cd * Improve serialization speed 2015-07-20 03:27:59 +02:00
Matthew Honnibal
8743a8c084 * Update Doc serialization for new Packer interface 2015-07-20 01:38:04 +02:00
Matthew Honnibal
1f7170e0e1 * Reinstate the fixed vocabulary --- words are only added to the lexicon in init_model, after that we create LexemeC structs with the Pool given to us. 2015-07-20 01:37:34 +02:00
Matthew Honnibal
5a7d060d9c * Switch between the orth and char codecs depending on which is shorter for that message. Mostly orth is shorter, except if there are OOV words. 2015-07-20 01:36:22 +02:00
Matthew Honnibal
5a042ee0d3 * Add function to predict number of bits needed to encode message 2015-07-20 01:35:11 +02:00
Matthew Honnibal
b89b489bb4 * Implement both character and orth encoding in Packer, so that we can decide which to use per-text 2015-07-19 22:39:45 +02:00
Matthew Honnibal
ae78c9e3ce * Implement character-based codec, so that we can do word/char backoff 2015-07-19 22:03:39 +02:00
Matthew Honnibal
cd1d047cb8 * Delete out-dated HuffmanCodec comment 2015-07-19 18:28:14 +02:00
Matthew Honnibal
b8086067d5 * Build Huffman codec from unsorted inputs 2015-07-19 17:58:44 +02:00
Matthew Honnibal
317cbbc015 * Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time. 2015-07-19 15:18:17 +02:00
Matthew Honnibal
6b13e7227c * Remove duplicate get_lex_attr method from doc.pyx 2015-07-18 22:46:07 +02:00
Matthew Honnibal
e49c7f1478 * Update oov check in tokenizer 2015-07-18 22:45:28 +02:00
Matthew Honnibal
cfd842769e * Allow infix tokens to be variable length 2015-07-18 22:45:00 +02:00
Matthew Honnibal
5b4c78bbb2 * Use an AttributeCodec based on orth for words. Still no oov handling mechanism. 2015-07-18 22:43:18 +02:00
Matthew Honnibal
82d84b0f2b * Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this. 2015-07-18 22:42:15 +02:00
Matthew Honnibal
4dddc8a69b * Fix type declarations for attr_t. Remove unused id_t. 2015-07-18 22:39:57 +02:00
Matthew Honnibal
ced59ab9ea * Make minor efficiency improvement in Doc.__iter__ 2015-07-18 04:10:53 +02:00
Matthew Honnibal
cd91914dd8 * Fix hard-coded length 2015-07-18 04:09:56 +02:00
Matthew Honnibal
b1d74ce60d * Remove unused joint.pyx and joint.pxd files 2015-07-17 23:31:44 +02:00
Matthew Honnibal
c27514512b * Remove cruft ner/ directory 2015-07-17 23:24:32 +02:00
Matthew Honnibal
f8d6d319f4 * Remove cruft module 2015-07-17 23:23:05 +02:00
Matthew Honnibal
fb0a641a2d * Don't release the gil around Parser.parse. Does this indicate thread problems? 2015-07-17 23:07:37 +02:00
Matthew Honnibal
e29daea85f * Fix bint/int typing problem in TransitionSystem. In C++ bint* means bool*, but in C it means int*. So, type-casting to bint* is unsafe. 2015-07-17 22:37:24 +02:00
Matthew Honnibal
cf0c788892 * Tests passing on round-trip pack/unpack on basic example 2015-07-17 21:20:48 +02:00
Matthew Honnibal
44f39a876f * Add a blank attrs.pyx 2015-07-17 16:40:42 +02:00
Matthew Honnibal
c2c83120d4 * Remove codec property from Vocab 2015-07-17 16:40:11 +02:00
Matthew Honnibal
dfdf19f6a9 * Draft a from_orth method for Doc 2015-07-17 16:39:54 +02:00
Matthew Honnibal
9e3f17051b * Move to ORTH instead of ID for encoding lexemes. Basic tests of the codec wrappers now passing 2015-07-17 16:38:29 +02:00
Matthew Honnibal
15ff739996 * Fix passing of ID attribute in string store 2015-07-17 14:49:42 +02:00
Matthew Honnibal
95e57c2780 * Remove unnecessary key and id properties from Utf8String. 2015-07-17 01:40:18 +02:00
Matthew Honnibal
234c7e440a * Add spacy/serialize/__init__ files 2015-07-17 01:37:33 +02:00
Matthew Honnibal
db9dfd2e23 * Major refactor of serialization. Nearly complete now. 2015-07-17 01:27:54 +02:00
Matthew Honnibal
c8282f9934 * Work on serialization. Needs more reorganisation 2015-07-16 19:56:02 +02:00
Matthew Honnibal
d8458d6a25 * Fix attr_id_t import in Spans 2015-07-16 19:55:21 +02:00
Matthew Honnibal
d1cb30dbc4 * Remove unnecessary key and id properties from Utf8String. 2015-07-16 19:29:02 +02:00
Matthew Honnibal
897de2d438 * Add 'bitter' property for serializer in English class 2015-07-16 17:47:53 +02:00
Matthew Honnibal
fb54052ae0 * Work on serializer design 2015-07-16 17:46:46 +02:00
Matthew Honnibal
a6f401580d * Add from_array function to Doc. 2015-07-16 17:46:11 +02:00
Matthew Honnibal
2a5d050134 * Give codec loading back to Vocab. 2015-07-16 17:45:42 +02:00
Matthew Honnibal
8bf0f65f1c * Remove dead code in strings.pyx 2015-07-16 17:35:53 +02:00
Matthew Honnibal
a9c3863665 * Fix inefficiency in StringStore.dump function 2015-07-16 17:34:32 +02:00
Matthew Honnibal
b59d271510 * Move serialization functionality into Serializer class 2015-07-16 11:23:48 +02:00
Matthew Honnibal
30be4f15da * Import attrs from spacy.attrs, not spacy.typedefs 2015-07-16 11:23:25 +02:00
Matthew Honnibal
6c99e5f4aa * Move serialization into Serializer class, with __call__ and train() api 2015-07-16 11:22:35 +02:00
Matthew Honnibal
e2133d990e * Move serialization functionality out into a Serializer object 2015-07-16 11:21:44 +02:00
Matthew Honnibal
a6d040bd11 * Import Lexeme attrs from spacy.attrs, not spacy.typedefs 2015-07-16 11:20:08 +02:00
Matthew Honnibal
45ae1ce428 * Remove unused declaration in parser 2015-07-16 01:27:11 +02:00
Matthew Honnibal
efa80096f1 * Upd attrs id list 2015-07-16 01:26:54 +02:00
Matthew Honnibal
01fab6bb90 * Improve de/serialize functions 2015-07-16 01:26:35 +02:00
Matthew Honnibal
0e07c1ed2a * draft de/serialization functions in doc.pyx 2015-07-16 01:16:33 +02:00
Matthew Honnibal
9d956b07e9 * Fix import of attrs in doc.pyx, and update the get_token_attr function. 2015-07-16 01:15:34 +02:00
Matthew Honnibal
65251e7625 * Remove redundant attr_id_t from typedefs.pxd 2015-07-16 00:58:51 +02:00
Matthew Honnibal
9a8db9743c * Remove gil from parser.call 2015-07-14 23:47:33 +02:00
Matthew Honnibal
38ca0c33f5 Merge branch 'neuralnet' into refactor
Mostly refactors parser, to use new thinc3.2 Example class.
Aim is to remove use of shared memory, so that we can parallelize
over documents easily.

Conflicts:
	setup.py
	spacy/syntax/parser.pxd
	spacy/syntax/parser.pyx
	spacy/syntax/stateclass.pyx
2015-07-14 14:13:47 +02:00
Matthew Honnibal
935ac53ee3 * Extend count_by method 2015-07-14 03:20:09 +02:00
Matthew Honnibal
3b5baa660f * Fix tokenizer 2015-07-14 00:10:51 +02:00
Matthew Honnibal
2ae0b439b2 * Fix space check in gold.pyx 2015-07-14 00:10:27 +02:00
Matthew Honnibal
81aa4e6dcc * Go back to having token reference doc, instead of complicated gymnastics. Rename the attr 'doc', to expose it in the API 2015-07-14 00:10:11 +02:00
Matthew Honnibal
24d6ce99ec * Add comment to tokenizer, explaining the spacy attr 2015-07-13 22:29:13 +02:00
Matthew Honnibal
8214b74eec * Restore _py_tokens cache, to handle orphan tokens. 2015-07-13 22:28:10 +02:00
Matthew Honnibal
67641f3b58 * Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string 2015-07-13 21:46:02 +02:00
Matthew Honnibal
6eef0bf9ab * Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx 2015-07-13 20:20:58 +02:00
Matthew Honnibal
3ea8756c24 * Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 19:58:26 +02:00
Matthew Honnibal
c99387155f * Refactor tokens, moving classes into a module instead of a single file 2015-07-13 19:49:55 +02:00
Matthew Honnibal
d27899658e * Import classes in spacy.tokens.__init__ 2015-07-13 19:48:55 +02:00
Matthew Honnibal
aa82caf8f5 * Add TokenC.spacy attr 2015-07-13 19:48:07 +02:00
Matthew Honnibal
dba6b47d4e * Refactor monster tokens.pyx file, into a tokens/ subpackage. Try to break the cycle between Doc and Token, and remove the need to pass around a unicode string reference 2015-07-13 19:20:48 +02:00
Matthew Honnibal
5b0a7190c9 * Round-trip for serialization finally working. Needs a lot of optimization. 2015-07-13 18:39:38 +02:00
Matthew Honnibal
edd371246c * Make huffman coder take BitArray in encode/decode. Add __iter__ method to BitArray. 2015-07-13 17:33:33 +02:00
Matthew Honnibal
af5cc926a4 * Add codec property to Vocab, to use the Huffman encoding 2015-07-13 13:55:14 +02:00
Matthew Honnibal
77385d5580 * Make .pxd file for huffman codec 2015-07-13 13:54:51 +02:00
Matthew Honnibal
083b6ea7ae * Clean up encoder a bit. now read for integration into Vocab. 2015-07-13 12:57:22 +02:00
Matthew Honnibal
8d0f1d98da * Draft dockstring for HuffmanCache 2015-07-13 12:01:18 +02:00
Matthew Honnibal
281f1faefb * Nearly finished huffman coder 2015-07-12 23:48:46 +02:00
Matthew Honnibal
e1a25fba32 * Work on huffman coder 2015-07-12 19:58:05 +02:00
Matthew Honnibal
3fb9de2d13 * Remove vector[bint], in favor of simple Code struct. 2015-07-12 17:58:27 +02:00
Matthew Honnibal
aa7bfd932b * Work on compressor 2015-07-12 16:03:43 +02:00
Matthew Honnibal
14eafcab15 * Refactor to use vector[bint] 2015-07-12 05:27:47 +02:00
Matthew Honnibal
6a6e852a39 * Refactor huffman coding stuff into class 2015-07-12 05:06:36 +02:00
Matthew Honnibal
aad96fdb5c * Improve efficiency of huffman coding 2015-07-12 01:31:37 +02:00
Matthew Honnibal
ff9ff6f3fa * Ensure unseen words are given low log probability 2015-07-12 01:31:09 +02:00
Matthew Honnibal
9d3b0d83de * Refactor huffman coding 2015-07-11 22:27:43 +02:00
Matthew Honnibal
8d29406cd6 * Rename span.right to span.rights 2015-07-11 22:15:04 +02:00
Matthew Honnibal
da9f358166 * Fix span getting 2015-07-11 21:41:41 +02:00
Matthew Honnibal
11e8f2ffb4 * Huffman codes working 2015-07-11 20:01:10 +02:00
Matthew Honnibal
cb6fc81909 * Work on huffman coding. 2015-07-11 15:23:35 +02:00
Matthew Honnibal
4c9b77fe95 * Begin working on serialization code 2015-07-11 10:57:30 +02:00
Matthew Honnibal
53d1f5b2eb * Rename Span.head to Span.root. 2015-07-09 17:30:58 +02:00
Matthew Honnibal
c0255ed7d8 * Allow slice indexing in Doc.__getitem__, returning a Span object 2015-07-09 15:15:32 +02:00
Matthew Honnibal
89a91ad726 * Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity 2015-07-09 13:30:41 +02:00
Matthew Honnibal
55f1042443 * Improve efficiency of L and R features, correcting the non-linear-in-length problem. 2015-07-09 12:17:26 +02:00
Matthew Honnibal
70d2acb579 * Fix edge features 2015-07-09 12:15:01 +02:00
Matthew Honnibal
adb868bdad * Add warning for models not found in parser 2015-07-08 20:04:55 +02:00
Matthew Honnibal
05b28ec9eb * Add warning for models not found in parser 2015-07-08 20:02:13 +02:00
Matthew Honnibal
ef700401a6 * Add warning for models not found in parser 2015-07-08 20:00:46 +02:00
Matthew Honnibal
6218d8b389 * Add warning for models not found in parser 2015-07-08 19:59:16 +02:00
Matthew Honnibal
f6a6c39ce8 * Add warning for models not found in parser 2015-07-08 19:52:30 +02:00
Matthew Honnibal
78db7e32f7 * Remove has_sense method from Lexeme declaration 2015-07-08 19:41:20 +02:00
Matthew Honnibal
6ddb2f5e45 * Restore merge_mwe in English class 2015-07-08 19:35:30 +02:00
Matthew Honnibal
6859f6adac * Restore merge_mwe in English class 2015-07-08 19:34:55 +02:00
Matthew Honnibal
3c270fc8ff * Remove has_sense method from Lexeme 2015-07-08 19:28:29 +02:00
Matthew Honnibal
b64c843861 * Remove senses attr 2015-07-08 19:26:24 +02:00
Matthew Honnibal
1d3a592edf * Remove the senses attr from LexemeC, to keep data compatibility 2015-07-08 19:24:44 +02:00
Matthew Honnibal
0ceb1f71c2 * Update parse features 2015-07-08 19:11:36 +02:00
Matthew Honnibal
2e51b5027a * Alias Doc to Tokens, for backwards compatibility 2015-07-08 18:59:35 +02:00
Matthew Honnibal
e3c53f5ecd * Fix mention of Tokens in docstring 2015-07-08 18:56:27 +02:00
Matthew Honnibal
bb522496dd * Rename Tokens to Doc 2015-07-08 18:53:00 +02:00
Matthew Honnibal
b24e8be2b9 * Whitespace in docstring 2015-07-08 12:37:03 +02:00
Matthew Honnibal
abc43b852d * Add pos_tags attr to Vocab. 2015-07-08 12:36:38 +02:00
Matthew Honnibal
935bcdf3e5 * Remove redundant tag_names argument to Tokenizer 2015-07-08 12:36:04 +02:00
Matthew Honnibal
ff885e8511 * Add ParserFactory convenience function 2015-07-08 12:35:46 +02:00
Matthew Honnibal
4e4fac452b * Refactor __init__ for simplicity. Allow parse=True, tag=True etc flags to be passed at top-level. Do not lazy-load parser. 2015-07-08 12:35:29 +02:00
Matthew Honnibal
1d2deb4616 * Work on refactoring default arguments to English.__init__ 2015-07-07 15:53:25 +02:00
Matthew Honnibal
2d0e99a096 * Pass pos_tags into Tokenizer.from_dir 2015-07-07 14:23:08 +02:00
Matthew Honnibal
6788c86b2f * Begin refactor 2015-07-07 14:00:07 +02:00
Matthew Honnibal
52fd80c6c6 * Add experimental supersense features for parsing, based on lookup into wordnet. 2015-07-01 20:12:44 +02:00
Matthew Honnibal
e6d828a9af * Set up an array POS_SENSES that denotes the set of valid senses for each POS tag. This way, we can do bitwise & between a lexeme's senses and the ones available for its POS tag, to get the allowable senses for the token. 2015-07-01 20:12:13 +02:00
Matthew Honnibal
2b8459d9a8 * Add senses flag to Lexeme 2015-07-01 20:10:41 +02:00
Matthew Honnibal
e23d1582a2 * Add supersense data to Lexeme objects. Add simple has_sense method to check the flag. 2015-07-01 18:50:37 +02:00
Matthew Honnibal
64fafa98be * Add senses.pyx and senses.pxd 2015-07-01 18:49:44 +02:00
Matthew Honnibal
94dab94e5f uerge branch 'master' of https://github.com/honnibal/spaCy 2015-06-30 18:16:26 +02:00
Matthew Honnibal
9af86b0b0b * Fix attrs.pxd 2015-06-30 18:16:30 +02:00
Matthew Honnibal
af9c82f7a6 Merge branch 'master' of https://github.com/honnibal/spaCy 2015-06-30 18:11:37 +02:00
Matthew Honnibal
5d595b5a8c * Inc versions 2015-06-30 18:11:06 +02:00
Matthew Honnibal
d2eeba6667 * Start wiring up color and emotion lexicons. Hopefully we get to use them. 2015-06-30 16:22:23 +02:00
Matthew Honnibal
e20106fdff * Begin reorganizing neuralnet work 2015-06-30 14:26:32 +02:00
Matthew Honnibal
5cd3ed42d4 * Reenable averaging 2015-06-29 16:44:42 +02:00
Matthew Honnibal
894cbef8ba * Wire eta and mu parameters up for neural net 2015-06-29 07:10:33 +02:00
Matthew Honnibal
3bb5876c5a * Inline methods in StateClass 2015-06-29 01:10:14 +02:00
Matthew Honnibal
313a7f87b3 * Inline methods in StateClass 2015-06-29 01:06:28 +02:00
Matthew Honnibal
a02fd3af5d * Check valency in L and R feature methods, to make feaure calculation faster 2015-06-29 00:27:56 +02:00
Matthew Honnibal
5d870720bc * Check valency in L and R feature methods, to make feaure calculation faster 2015-06-29 00:17:29 +02:00
Matthew Honnibal
f4986d5d3c * Use new Example class 2015-06-28 22:36:03 +02:00
Matthew Honnibal
735f1af91f * Fix neural net stuff 2015-06-28 11:44:58 +02:00
Matthew Honnibal
e7003f1cf3 * Remove hard-coding of vector lengths 2015-06-28 11:37:17 +02:00
Matthew Honnibal
897dd0dd0b * Merge changes, and adjust Example to use memoryview 2015-06-28 11:36:11 +02:00
Matthew Honnibal
9282a8e72c * Prepare for new models to be plugged in by using Example class 2015-06-28 11:02:35 +02:00
Matthew Honnibal
75aeccc064 * Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search 2015-06-28 11:02:34 +02:00
Matthew Honnibal
bf33598b34 * Work on a theano-driven model for the parser 2015-06-28 11:02:34 +02:00
Matthew Honnibal
bbef71f213 * Fix min function in fill_context 2015-06-28 10:46:39 +02:00
Matthew Honnibal
142b6f9510 * Revert last changes 2015-06-28 10:44:28 +02:00
Matthew Honnibal
b06962f18b * Pad buffers in state 2015-06-28 10:36:14 +02:00
Matthew Honnibal
53be72387c * Hack at fill_context to investigate performance loss 2015-06-28 10:34:28 +02:00
Matthew Honnibal
71a4e876a9 * Fix parse features 2015-06-28 09:27:33 +02:00
Matthew Honnibal
0c4b5a2bb0 * Start scoring tokens 2015-06-28 06:21:38 +02:00
Matthew Honnibal
5af500909c * Remove unused directve from parser.pyx 2015-06-28 06:20:21 +02:00
Matthew Honnibal
d5b4090705 * Add profile directive 2015-06-28 06:19:33 +02:00
Matthew Honnibal
2b5421e60c * Add profile directive 2015-06-28 06:07:04 +02:00
Matthew Honnibal
8b5de4a411 * Add word / tag / label sets, for use in neural net 2015-06-28 05:46:53 +02:00
Matthew Honnibal
cfcbd8d256 * Fix punctuation eval in scorer.py 2015-06-28 01:31:39 +02:00
Matthew Honnibal
ed40a8380e * Remove hard-coding of vector lengths 2015-06-27 04:18:47 +02:00
Matthew Honnibal
ebe630cc8d * Enable more features for NN 2015-06-27 04:17:29 +02:00
Matthew Honnibal
f8bb43475e * Bridge to Theano working. Very disorganised. Using thinc adb60aba966ed2 2015-06-27 02:39:18 +02:00
Matthew Honnibal
2fe98b8a9a * Prepare for new models to be plugged in by using Example class 2015-06-26 13:51:39 +02:00
Matthew Honnibal
6896455884 * Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search 2015-06-26 06:25:36 +02:00
Matthew Honnibal
b266a63f2c * Inc version of downloadble data 2015-06-24 04:53:08 +02:00
Matthew Honnibal
02b171ee67 * Bug fixes to edge calculation 2015-06-24 04:28:02 +02:00
Matthew Honnibal
a4e9bdf4c1 * Work on a theano-driven model for the parser 2015-06-24 01:02:40 +02:00
Matthew Honnibal
7f9384f53c * Remove deprecated _state module 2015-06-23 17:28:24 +02:00
Matthew Honnibal
6dbe182491 * Fix merge conflicts 2015-06-23 17:28:00 +02:00
Matthew Honnibal
579735a095 * Remove import of _state module 2015-06-23 17:25:08 +02:00
Matthew Honnibal
88f55d136b * Remove deprecated _state module 2015-06-23 17:19:51 +02:00
Matthew Honnibal
9ab9dd2bf7 * Clean up unused orig_arc_eager and tree_arc_eager modules, which were only added for EMNLP experiments 2015-06-23 17:17:33 +02:00
Matthew Honnibal
7ebfe4b983 * Fixes to edge features 2015-06-23 16:32:54 +02:00
Matthew Honnibal
7b125f5a86 * Fixes to edge features 2015-06-23 16:31:01 +02:00
Matthew Honnibal
8d4bbacfc5 * Fix edge navigation in Token objects 2015-06-23 16:07:34 +02:00
Matthew Honnibal
35c290bee4 * Fix edge features 2015-06-23 15:50:56 +02:00
Matthew Honnibal
221e2e485f * Assign 'ROOT' as label, not 'root' 2015-06-23 15:09:54 +02:00
Matthew Honnibal
a7bf7b0626 * Rename sent_start to sent_end, to reflect its new usage in the Break transition 2015-06-23 05:39:43 +02:00
Matthew Honnibal
ee3e56f27b * Fix bounds checking on entities 2015-06-23 04:35:08 +02:00
Matthew Honnibal
43ef5ddea5 * Ensure root albel is spelled ROOT, for backwards compatibility 2015-06-23 04:14:03 +02:00
Matthew Honnibal
065c2e1d2d * Add some bounds checking around state arrays 2015-06-23 04:13:09 +02:00
Matthew Honnibal
89ae218b75 * Add import to tokens.pyx from weird Cython compiler issue with casting from memory views 2015-06-23 03:04:34 +02:00
Matthew Honnibal
f01b3d043e * Add padding to arrays in stateclass. May be papering over a deeper bug. 2015-06-23 03:03:41 +02:00
Matthew Honnibal
5e94b5d581 * Have Tokens return proper numpy arrays, not Cython views. 2015-06-23 00:07:34 +02:00
Matthew Honnibal
69507bc729 * Re-enable Break transition in arc_eager.pyx 2015-06-23 00:03:30 +02:00
Matthew Honnibal
cc579ed429 * Add __len__ function to StringStore 2015-06-23 00:02:50 +02:00
Matthew Honnibal
46fb24e9fd * Add cycle-checking code in gold.pyx 2015-06-23 00:02:22 +02:00
Matthew Honnibal
60d26243e3 * Fix head alignment in read_conll.parse, which was causing corrupt parses when strip_bad_periods=True. A similar problem may apply to other data readers. 2015-06-18 16:35:27 +02:00
Matthew Honnibal
f868175e43 * Whitespace 2015-06-16 23:37:46 +02:00
Matthew Honnibal
ab110be125 * Remove debugging in parser.pyx 2015-06-16 23:37:25 +02:00
Matthew Honnibal
9b13d11ab3 * Fix handling of entities in StateClass 2015-06-16 23:35:21 +02:00
Matthew Honnibal
c40a2c661c * Add tree_arc_eager 2015-06-15 08:23:24 +02:00
Matthew Honnibal
5da5cf7084 * Add some more features for S1/S0 2015-06-15 04:07:13 +02:00
Matthew Honnibal
8156a01bca * Fix root label for orig_arc_eager 2015-06-15 02:54:55 +02:00
Matthew Honnibal
21930ede15 * Switch toggle on USE_ROOT_ARC_SEGMENT 2015-06-15 02:54:32 +02:00
Matthew Honnibal
38a6afa484 * Make possibly dubious correction to the unshift oracle 2015-06-15 02:50:00 +02:00
Matthew Honnibal
f66228f253 * Add some more features, esp for labels 2015-06-14 21:18:02 +02:00
Matthew Honnibal
3da8e0f317 * Add orig_arc_eager 2015-06-14 20:31:44 +02:00
Matthew Honnibal
ea8a103007 * Fix import of TransitionSystem in parser.pyx 2015-06-14 19:01:26 +02:00
Matthew Honnibal
e0984ca139 * Fix valency features in StateClass 2015-06-14 17:50:26 +02:00
Matthew Honnibal
e50ac1a47f * Add verbose printing to scorer 2015-06-14 17:45:50 +02:00
Matthew Honnibal
763cbd23d5 * Upd stateclass.print_state 2015-06-14 17:44:29 +02:00
Matthew Honnibal
bdd07bf000 * Fix Break oracle, but disable the Break transition for now, while we finalize the gold-standard experiments 2015-06-14 17:44:03 +02:00
Matthew Honnibal
399f15fbdf * Add flag to toggle handling of multi-root inputs without the Break transition. Clear up now unused best_valid stuff. 2015-06-14 00:28:37 +02:00
Matthew Honnibal
75289b4761 * Don't refuse to parse single token sentences, incase some transition system needs them, e.g. single word entity. Instead fix error in _init_state. 2015-06-13 22:55:55 +02:00
Matthew Honnibal
77d7e79c7e * Fix r/l and distance features. 2015-06-12 13:06:15 +02:00
Matthew Honnibal
b643cb3d5c * Allow training documents to be filtered in gold.pyx 2015-06-12 02:42:08 +02:00
Matthew Honnibal
15e177d7a1 * Fixes to unshift/fast-forward strategy. Getting 91.55 greedy on NW dev, gold preproc 2015-06-12 01:50:23 +02:00
Matthew Honnibal
afd77a529b * Prepare for break transition, with fast-forwarding. 86.5 on 1k nw gold preproc 2015-06-10 14:08:30 +02:00
Matthew Honnibal
495f528709 * Add support for sentence breaks in stateclass 2015-06-10 12:34:28 +02:00
Matthew Honnibal
b7b18c279d * Fix Reduce oracle. Getting 86.35 2015-06-10 11:33:39 +02:00
Matthew Honnibal
bb09b5d91a * Fix shifted bit vector in stateclass --- should reflect whether the word has been *unshifted*. 2015-06-10 11:33:09 +02:00
Matthew Honnibal
aa9625f688 * Do non-monotonic Unshift. Every word can be shifted at most 1 time. When the Reduce move is used, if S0 has no head, we put the word back on the buffer. Gets 86.4 on nw 1k with gold pre-proc. Break transition not yet implemented for this. 2015-06-10 10:15:56 +02:00
Matthew Honnibal
7bf6b7de3e * Add unshift action to StateClass, and track which moves have been shifted 2015-06-10 10:13:03 +02:00
Matthew Honnibal
f7c8069e65 * Fix bug in distance feature 2015-06-10 10:12:17 +02:00
Matthew Honnibal
abd07c067a * Inline B and S methods on stateclass 2015-06-10 07:22:33 +02:00
Matthew Honnibal
e2f9a80713 * Remove old _state imports 2015-06-10 07:09:17 +02:00
Matthew Honnibal
e9aaecc619 * Remove from_struct method from StateClass 2015-06-10 06:58:27 +02:00
Matthew Honnibal
18cc326dc0 * Bug fixes to ner.pyx 2015-06-10 06:57:41 +02:00
Matthew Honnibal
e5570c9700 * Set nogil for oracle functions 2015-06-10 06:56:56 +02:00
Matthew Honnibal
4575e7a60f * Fix beam search with new StateClass 2015-06-10 06:33:39 +02:00
Matthew Honnibal
04b1cd9b8c * Greedy parsing working with new StateClass. Beam parsing broken 2015-06-10 04:20:23 +02:00
Matthew Honnibal
6a94b64eca * Remove State* from parser.pyx entirely, switching over to StateClass. Beam parsing still untested. 2015-06-10 02:03:38 +02:00
Matthew Honnibal
f14a1526aa * Remove version of fill_context that takes State* 2015-06-10 01:39:07 +02:00