Commit Graph

1861 Commits

Author SHA1 Message Date
Matthew Honnibal
6405d2384c * Add first draft of annotation standards doc 2015-07-14 12:50:13 +02:00
Matthew Honnibal
af54d05d60 * Remove sense stuff from init_model 2015-07-14 10:56:17 +02:00
Matthew Honnibal
3de1b3ef1d * Change get_freqs to take a list of files 2015-07-14 10:55:56 +02:00
Matthew Honnibal
935ac53ee3 * Extend count_by method 2015-07-14 03:20:09 +02:00
Matthew Honnibal
39c93116eb * Add get_freqs script 2015-07-14 02:31:32 +02:00
Matthew Honnibal
3b5baa660f * Fix tokenizer 2015-07-14 00:10:51 +02:00
Matthew Honnibal
2ae0b439b2 * Fix space check in gold.pyx 2015-07-14 00:10:27 +02:00
Matthew Honnibal
81aa4e6dcc * Go back to having token reference doc, instead of complicated gymnastics. Rename the attr 'doc', to expose it in the API 2015-07-14 00:10:11 +02:00
Matthew Honnibal
e1c702e498 * Upd tests after refactor 2015-07-14 00:08:50 +02:00
Matthew Honnibal
ba9a22ae0b * Ignore cpp files in spacy/tokens 2015-07-13 22:30:15 +02:00
Matthew Honnibal
98382bd7a0 * Update tests after refactor 2015-07-13 22:30:01 +02:00
Matthew Honnibal
d87d71caf4 * Compile the new modules after refactor 2015-07-13 22:29:33 +02:00
Matthew Honnibal
24d6ce99ec * Add comment to tokenizer, explaining the spacy attr 2015-07-13 22:29:13 +02:00
Matthew Honnibal
8214b74eec * Restore _py_tokens cache, to handle orphan tokens. 2015-07-13 22:28:10 +02:00
Matthew Honnibal
67641f3b58 * Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string 2015-07-13 21:46:02 +02:00
Matthew Honnibal
6eef0bf9ab * Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx 2015-07-13 20:20:58 +02:00
Matthew Honnibal
3ea8756c24 * Add spacy/tokens/doc.pyx, for Doc class in its own file 2015-07-13 19:58:26 +02:00
Matthew Honnibal
c99387155f * Refactor tokens, moving classes into a module instead of a single file 2015-07-13 19:49:55 +02:00
Matthew Honnibal
d27899658e * Import classes in spacy.tokens.__init__ 2015-07-13 19:48:55 +02:00
Matthew Honnibal
aa82caf8f5 * Add TokenC.spacy attr 2015-07-13 19:48:07 +02:00
Matthew Honnibal
dba6b47d4e * Refactor monster tokens.pyx file, into a tokens/ subpackage. Try to break the cycle between Doc and Token, and remove the need to pass around a unicode string reference 2015-07-13 19:20:48 +02:00
Matthew Honnibal
5b0a7190c9 * Round-trip for serialization finally working. Needs a lot of optimization. 2015-07-13 18:39:38 +02:00
Matthew Honnibal
edd371246c * Make huffman coder take BitArray in encode/decode. Add __iter__ method to BitArray. 2015-07-13 17:33:33 +02:00
Matthew Honnibal
af5cc926a4 * Add codec property to Vocab, to use the Huffman encoding 2015-07-13 13:55:14 +02:00
Matthew Honnibal
77385d5580 * Make .pxd file for huffman codec 2015-07-13 13:54:51 +02:00
Matthew Honnibal
0628e0e2a8 * Add tests for huffman encoding 2015-07-13 12:58:07 +02:00
Matthew Honnibal
083b6ea7ae * Clean up encoder a bit. now read for integration into Vocab. 2015-07-13 12:57:22 +02:00
Matthew Honnibal
8d0f1d98da * Draft dockstring for HuffmanCache 2015-07-13 12:01:18 +02:00
Matthew Honnibal
281f1faefb * Nearly finished huffman coder 2015-07-12 23:48:46 +02:00
Matthew Honnibal
e1a25fba32 * Work on huffman coder 2015-07-12 19:58:05 +02:00
Matthew Honnibal
3fb9de2d13 * Remove vector[bint], in favor of simple Code struct. 2015-07-12 17:58:27 +02:00
Matthew Honnibal
aa7bfd932b * Work on compressor 2015-07-12 16:03:43 +02:00
Matthew Honnibal
14eafcab15 * Refactor to use vector[bint] 2015-07-12 05:27:47 +02:00
Matthew Honnibal
6a6e852a39 * Refactor huffman coding stuff into class 2015-07-12 05:06:36 +02:00
Matthew Honnibal
aad96fdb5c * Improve efficiency of huffman coding 2015-07-12 01:31:37 +02:00
Matthew Honnibal
ff9ff6f3fa * Ensure unseen words are given low log probability 2015-07-12 01:31:09 +02:00
Matthew Honnibal
9d3b0d83de * Refactor huffman coding 2015-07-11 22:27:43 +02:00
Matthew Honnibal
8d29406cd6 * Rename span.right to span.rights 2015-07-11 22:15:04 +02:00
Matthew Honnibal
da9f358166 * Fix span getting 2015-07-11 21:41:41 +02:00
Matthew Honnibal
11e8f2ffb4 * Huffman codes working 2015-07-11 20:01:10 +02:00
Matthew Honnibal
cb6fc81909 * Work on huffman coding. 2015-07-11 15:23:35 +02:00
Matthew Honnibal
4c9b77fe95 * Begin working on serialization code 2015-07-11 10:57:30 +02:00
Matthew Honnibal
11a380e00f * Draft v0.89 update notes 2015-07-10 19:41:42 +02:00
Matthew Honnibal
53d1f5b2eb * Rename Span.head to Span.root. 2015-07-09 17:30:58 +02:00
Matthew Honnibal
c0255ed7d8 * Allow slice indexing in Doc.__getitem__, returning a Span object 2015-07-09 15:15:32 +02:00
Matthew Honnibal
7d2964f673 * Test that whitespace is not assigned a tag 2015-07-09 13:31:40 +02:00
Matthew Honnibal
b5223c4824 * Add whitespace to specials.json 2015-07-09 13:31:12 +02:00
Matthew Honnibal
89a91ad726 * Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity 2015-07-09 13:30:41 +02:00
Matthew Honnibal
f95da0bd52 * Allow tests to read model dir from SPACY_DATA environment variable 2015-07-09 12:18:02 +02:00
Matthew Honnibal
55f1042443 * Improve efficiency of L and R features, correcting the non-linear-in-length problem. 2015-07-09 12:17:26 +02:00