Matthew Honnibal
38ca0c33f5
Merge branch 'neuralnet' into refactor
...
Mostly refactors parser, to use new thinc3.2 Example class.
Aim is to remove use of shared memory, so that we can parallelize
over documents easily.
Conflicts:
setup.py
spacy/syntax/parser.pxd
spacy/syntax/parser.pyx
spacy/syntax/stateclass.pyx
2015-07-14 14:13:47 +02:00
Matthew Honnibal
6405d2384c
* Add first draft of annotation standards doc
2015-07-14 12:50:13 +02:00
Matthew Honnibal
935ac53ee3
* Extend count_by method
2015-07-14 03:20:09 +02:00
Matthew Honnibal
39c93116eb
* Add get_freqs script
2015-07-14 02:31:32 +02:00
Matthew Honnibal
3b5baa660f
* Fix tokenizer
2015-07-14 00:10:51 +02:00
Matthew Honnibal
2ae0b439b2
* Fix space check in gold.pyx
2015-07-14 00:10:27 +02:00
Matthew Honnibal
81aa4e6dcc
* Go back to having token reference doc, instead of complicated gymnastics. Rename the attr 'doc', to expose it in the API
2015-07-14 00:10:11 +02:00
Matthew Honnibal
e1c702e498
* Upd tests after refactor
2015-07-14 00:08:50 +02:00
Matthew Honnibal
ba9a22ae0b
* Ignore cpp files in spacy/tokens
2015-07-13 22:30:15 +02:00
Matthew Honnibal
98382bd7a0
* Update tests after refactor
2015-07-13 22:30:01 +02:00
Matthew Honnibal
d87d71caf4
* Compile the new modules after refactor
2015-07-13 22:29:33 +02:00
Matthew Honnibal
24d6ce99ec
* Add comment to tokenizer, explaining the spacy attr
2015-07-13 22:29:13 +02:00
Matthew Honnibal
8214b74eec
* Restore _py_tokens cache, to handle orphan tokens.
2015-07-13 22:28:10 +02:00
Matthew Honnibal
67641f3b58
* Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string
2015-07-13 21:46:02 +02:00
Matthew Honnibal
6eef0bf9ab
* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx
2015-07-13 20:20:58 +02:00
Matthew Honnibal
3ea8756c24
* Add spacy/tokens/doc.pyx, for Doc class in its own file
2015-07-13 19:58:26 +02:00
Matthew Honnibal
c99387155f
* Refactor tokens, moving classes into a module instead of a single file
2015-07-13 19:49:55 +02:00
Matthew Honnibal
d27899658e
* Import classes in spacy.tokens.__init__
2015-07-13 19:48:55 +02:00
Matthew Honnibal
aa82caf8f5
* Add TokenC.spacy attr
2015-07-13 19:48:07 +02:00
Matthew Honnibal
dba6b47d4e
* Refactor monster tokens.pyx file, into a tokens/ subpackage. Try to break the cycle between Doc and Token, and remove the need to pass around a unicode string reference
2015-07-13 19:20:48 +02:00
Matthew Honnibal
5b0a7190c9
* Round-trip for serialization finally working. Needs a lot of optimization.
2015-07-13 18:39:38 +02:00
Matthew Honnibal
edd371246c
* Make huffman coder take BitArray in encode/decode. Add __iter__ method to BitArray.
2015-07-13 17:33:33 +02:00
Matthew Honnibal
af5cc926a4
* Add codec property to Vocab, to use the Huffman encoding
2015-07-13 13:55:14 +02:00
Matthew Honnibal
77385d5580
* Make .pxd file for huffman codec
2015-07-13 13:54:51 +02:00
Matthew Honnibal
0628e0e2a8
* Add tests for huffman encoding
2015-07-13 12:58:07 +02:00
Matthew Honnibal
083b6ea7ae
* Clean up encoder a bit. now read for integration into Vocab.
2015-07-13 12:57:22 +02:00
Matthew Honnibal
8d0f1d98da
* Draft dockstring for HuffmanCache
2015-07-13 12:01:18 +02:00
Matthew Honnibal
281f1faefb
* Nearly finished huffman coder
2015-07-12 23:48:46 +02:00
Matthew Honnibal
e1a25fba32
* Work on huffman coder
2015-07-12 19:58:05 +02:00
Matthew Honnibal
3fb9de2d13
* Remove vector[bint], in favor of simple Code struct.
2015-07-12 17:58:27 +02:00
Matthew Honnibal
aa7bfd932b
* Work on compressor
2015-07-12 16:03:43 +02:00
Matthew Honnibal
14eafcab15
* Refactor to use vector[bint]
2015-07-12 05:27:47 +02:00
Matthew Honnibal
6a6e852a39
* Refactor huffman coding stuff into class
2015-07-12 05:06:36 +02:00
Matthew Honnibal
aad96fdb5c
* Improve efficiency of huffman coding
2015-07-12 01:31:37 +02:00
Matthew Honnibal
ff9ff6f3fa
* Ensure unseen words are given low log probability
2015-07-12 01:31:09 +02:00
Matthew Honnibal
9d3b0d83de
* Refactor huffman coding
2015-07-11 22:27:43 +02:00
Matthew Honnibal
8d29406cd6
* Rename span.right to span.rights
2015-07-11 22:15:04 +02:00
Matthew Honnibal
da9f358166
* Fix span getting
2015-07-11 21:41:41 +02:00
Matthew Honnibal
11e8f2ffb4
* Huffman codes working
2015-07-11 20:01:10 +02:00
Matthew Honnibal
cb6fc81909
* Work on huffman coding.
2015-07-11 15:23:35 +02:00
Matthew Honnibal
4c9b77fe95
* Begin working on serialization code
2015-07-11 10:57:30 +02:00
Matthew Honnibal
11a380e00f
* Draft v0.89 update notes
2015-07-10 19:41:42 +02:00
Matthew Honnibal
53d1f5b2eb
* Rename Span.head to Span.root.
2015-07-09 17:30:58 +02:00
Matthew Honnibal
c0255ed7d8
* Allow slice indexing in Doc.__getitem__, returning a Span object
2015-07-09 15:15:32 +02:00
Matthew Honnibal
7d2964f673
* Test that whitespace is not assigned a tag
2015-07-09 13:31:40 +02:00
Matthew Honnibal
b5223c4824
* Add whitespace to specials.json
2015-07-09 13:31:12 +02:00
Matthew Honnibal
89a91ad726
* Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity
2015-07-09 13:30:41 +02:00
Matthew Honnibal
f95da0bd52
* Allow tests to read model dir from SPACY_DATA environment variable
2015-07-09 12:18:02 +02:00
Matthew Honnibal
55f1042443
* Improve efficiency of L and R features, correcting the non-linear-in-length problem.
2015-07-09 12:17:26 +02:00
Matthew Honnibal
70d2acb579
* Fix edge features
2015-07-09 12:15:01 +02:00