Matthew Honnibal
d30029979e
* Avoid import of morphology in spans
2015-08-26 19:20:46 +02:00
Matthew Honnibal
119c0f8c3f
* Hack out morphology stuff from tokenizer, while morphology being reimplemented.
2015-08-26 19:20:11 +02:00
Matthew Honnibal
b4faf551f5
* Refactor language-independent tagger class
2015-08-26 19:19:21 +02:00
Matthew Honnibal
a3d5e6c0dd
* Reform constructor and save/load workflow in parser model
2015-08-26 19:19:01 +02:00
Matthew Honnibal
1d7f2d3abc
* Hack on morphology structs
2015-08-26 19:18:36 +02:00
Matthew Honnibal
f8f2f4e545
* Temporarily add PUNC name to parts_of_specch dictionary, until better solution
2015-08-26 19:18:19 +02:00
Matthew Honnibal
008b02b035
* Comment out enums in Morpohlogy for now
2015-08-26 19:17:35 +02:00
Matthew Honnibal
378729f81a
* Hack Morphology class towards usability
2015-08-26 19:17:21 +02:00
Matthew Honnibal
430affc347
* Fix missing n_patterns property in Matcher class. Fix from_dir method
2015-08-26 19:17:02 +02:00
Matthew Honnibal
3acf60df06
* Add missing properties in Lexeme class
2015-08-26 19:16:28 +02:00
Matthew Honnibal
76996f4145
* Hack on generic Language class. Still needs work for morphology, defaults, etc
2015-08-26 19:16:09 +02:00
Matthew Honnibal
e2ef78b29c
* Gut pos.pyx module, since functionality moved to spacy/tagger.pyx
2015-08-26 19:15:42 +02:00
Matthew Honnibal
c4d8754385
* Specify LOCAL_DATA_DIR global in spacy.en.__init__.py
2015-08-26 19:15:07 +02:00
Matthew Honnibal
c2d8edd0bd
* Add PROB attribute in attrs.pxd
2015-08-26 19:14:19 +02:00
Matthew Honnibal
c5a27d1821
* Move lemmatizer to spacy
2015-08-25 15:47:08 +02:00
Matthew Honnibal
82217c6ec6
* Generalize lemmatizer
2015-08-25 15:46:19 +02:00
Matthew Honnibal
8083a07c3e
* Use language base class
2015-08-25 15:37:30 +02:00
Matthew Honnibal
f2f699ac18
* Add language base class
2015-08-25 15:37:17 +02:00
Matthew Honnibal
5dd76be446
* Split EnPosTagger up into base class and subclass
2015-08-24 05:25:55 +02:00
Matthew Honnibal
5d5922dbfa
* Begin laying out morphological features
2015-08-24 01:04:30 +02:00
Matthew Honnibal
6f1743692a
* Work on language-independent refactoring
2015-08-23 20:49:18 +02:00
Matthew Honnibal
3879d28457
* Fix https for url detection
2015-08-23 02:40:35 +02:00
Matthew Honnibal
cad0cca4e3
* Tmp
2015-08-22 22:04:34 +02:00
Matthew Honnibal
bf38b3b883
* Hack on l/r reversal bug
2015-08-10 05:58:43 +02:00
Matthew Honnibal
6116413b47
* Fix label prediction in StepwiseState
2015-08-10 05:05:31 +02:00
Matthew Honnibal
2c9753eff2
* Whitespace
2015-08-10 00:09:02 +02:00
Matthew Honnibal
9de98f5a6f
* Add Parser.stepthrough method, with context manager
2015-08-10 00:08:46 +02:00
Matthew Honnibal
fe43f8cf39
* Whitespace
2015-08-09 02:31:53 +02:00
Matthew Honnibal
9c090945e0
* Add Parser.predict method, and clean up Parser.get_state
2015-08-09 02:29:58 +02:00
Matthew Honnibal
04fccfb984
* Fix get_state for parser prediction
2015-08-09 02:11:22 +02:00
Matthew Honnibal
55fde0e240
* Fix get_state
2015-08-09 01:45:30 +02:00
Matthew Honnibal
f0f4fa9838
* Fix Parser.get_state
2015-08-09 01:40:13 +02:00
Matthew Honnibal
18331dca89
* Add continue_for argument to parser 'partial' function, which is now renamed to get_state
2015-08-09 01:31:54 +02:00
Matthew Honnibal
0653288fa5
* Fix stateclass.queue
2015-08-09 00:39:02 +02:00
Matthew Honnibal
9de218b7ba
* Fix Parser.partial function
2015-08-08 23:45:18 +02:00
Matthew Honnibal
01be34d55a
* Whitespace
2015-08-08 23:37:44 +02:00
Matthew Honnibal
cc9deae960
* Add is_valid method to transition_system
2015-08-08 23:36:18 +02:00
Matthew Honnibal
2a46c77324
* Whitespace
2015-08-08 23:35:59 +02:00
Matthew Honnibal
7bafc789e7
* Add stack and queue properties to stateclass, for python access
2015-08-08 23:32:42 +02:00
Matthew Honnibal
3af938365f
* Add function partial to Parser
2015-08-08 23:32:15 +02:00
Matthew Honnibal
76a1f0481a
* Whitespace
2015-08-08 23:31:54 +02:00
Matthew Honnibal
b0f5c39084
* Fix handling of exclusion entities
2015-08-06 17:28:43 +02:00
Matthew Honnibal
9f65879991
* Fix shape attr bug, and fix handling of false positive matches
2015-08-06 17:28:14 +02:00
Matthew Honnibal
10d869d102
* Don't allow conjunction between NPs in base NP chunks
2015-08-06 16:31:53 +02:00
Matthew Honnibal
383dfabd67
* Fix matcher setting of entities
2015-08-06 16:27:01 +02:00
Matthew Honnibal
59c3bf60a6
* Ensure entity recognizer doesn't over-write preset types
2015-08-06 16:09:08 +02:00
Matthew Honnibal
cd7d1682cd
* Fix loading of gazetteer.json file
2015-08-06 16:08:25 +02:00
Matthew Honnibal
9c667b7f15
* Set a value in attrs.pxd on the first flag, to reduce bugs
2015-08-06 16:08:04 +02:00
Matthew Honnibal
c263577424
* Fix lower attribute in lexeme.pxd
2015-08-06 16:07:41 +02:00
Matthew Honnibal
5737115e1e
* Work on gazetteer matching
2015-08-06 14:33:21 +02:00
Matthew Honnibal
9c1724ecae
* Gazetteer stuff working, now need to wire up to API
2015-08-06 00:35:40 +02:00
Matthew Honnibal
5bc0e83f9a
* Reimplement matching in Cython, instead of Python.
2015-08-05 01:05:54 +02:00
Matthew Honnibal
4c87a696b3
* Add draft dfa matcher, in Python. Passing tests.
2015-08-04 15:55:28 +02:00
Matthew Honnibal
eb7138c761
* Add attr relation in base NP detection
2015-08-01 00:34:40 +02:00
Matthew Honnibal
4988356cf0
* Fix dependency type bug from merged tokens
2015-08-01 00:33:24 +02:00
Matthew Honnibal
78a9068319
* Fix spacy attr on merged tokens
2015-07-30 04:25:58 +02:00
Matthew Honnibal
430e2edb96
* Fix noun_chunks issue
2015-07-30 03:51:50 +02:00
Matthew Honnibal
9590968fc1
* Fix negative indices in Span
2015-07-30 02:30:24 +02:00
Matthew Honnibal
74d8cb3980
* Add noun_chunks iterator, and fix left/right child setting in Doc.merge
2015-07-30 02:29:49 +02:00
Matthew Honnibal
d153f18969
* Fix negative indices on spans
2015-07-29 22:36:03 +02:00
Matthew Honnibal
b5132bed7d
* Set left and right children when loading parse from byte string
2015-07-28 21:03:18 +02:00
Matthew Honnibal
6609fcf4b2
* Make mem and vocab python-visible in Doc
2015-07-28 20:46:59 +02:00
Matthew Honnibal
d42fe2e694
* Add unicode_literals to strings.pyx
2015-07-28 16:15:53 +02:00
Matthew Honnibal
bb910cff92
* Fix Python3 problem in align_raw
2015-07-28 16:06:53 +02:00
Matthew Honnibal
dcafb181b9
* Fix Python3 problem in align_raw
2015-07-28 15:52:10 +02:00
Matthew Honnibal
c609ea18f0
* Increment version in download script
2015-07-28 15:22:17 +02:00
Matthew Honnibal
9c4d0aae62
* Switch to better Python2/3 compatible unicode handling
2015-07-28 14:45:37 +02:00
Matthew Honnibal
7606d9936f
* Python3 correction for GoldParse
2015-07-28 14:44:53 +02:00
Matthew Honnibal
ddc1a5cfe5
* Fix training under python3
2015-07-28 14:09:30 +02:00
Matthew Honnibal
a8bbd7312c
* Hackishly patch long dependencies problem
2015-07-28 00:14:29 +02:00
Matthew Honnibal
bb583f7f09
* Hackishly patch long dependencies problem
2015-07-27 23:14:33 +02:00
Matthew Honnibal
aa7a964a4f
* Add a type declaration for doc.from_array
2015-07-27 22:57:22 +02:00
Matthew Honnibal
25a8774f42
* Fix regression in packer
2015-07-27 21:53:38 +02:00
Matthew Honnibal
1601e488ee
* Fix bug in decoding non-ascii characters
2015-07-27 21:43:58 +02:00
Matthew Honnibal
6a95409cd2
* Fix type on bits
2015-07-27 21:16:49 +02:00
Matthew Honnibal
a296d72b54
* Fix en/attrs
2015-07-27 21:16:33 +02:00
Matthew Honnibal
45460f505c
* Fix data type on read32 in BitArray
2015-07-27 21:12:13 +02:00
Matthew Honnibal
3d43f49f69
* Revert prev change
2015-07-27 10:58:15 +02:00
Matthew Honnibal
6b586cdad4
* Change lexemes.bin format. Add a header specifying size of LexemeC and number of lexemes, and don't have the redundant orth information.
2015-07-27 08:31:51 +02:00
Matthew Honnibal
af6ed18f2a
* Ensure we don't use orth_encode on OOV words.
2015-07-27 02:12:01 +02:00
Matthew Honnibal
8535d872e8
* Set is_oov property in get_flags
2015-07-27 01:51:24 +02:00
Matthew Honnibal
8e4c69ee8c
* Add is_oov property, and fix up handling of attributes
2015-07-27 01:50:06 +02:00
Matthew Honnibal
fc268f03eb
* Assert against null pointer exceptions in vocab
2015-07-27 01:00:10 +02:00
Matthew Honnibal
0f093fdb30
* Fix get_by_orth for py3
2015-07-26 19:26:41 +02:00
Matthew Honnibal
ceeda5a739
* Fix get_by_orth for py3
2015-07-26 18:39:27 +02:00
Matthew Honnibal
6bb96c122d
* Host IS_ flags in attrs.pxd, and add properties for them on Token and Lexeme objects
2015-07-26 16:37:16 +02:00
Matthew Honnibal
eeaea25f0c
* Check oov_prob file is present
2015-07-26 16:36:38 +02:00
Matthew Honnibal
7eb2446082
* Return empty lexeme on empty string
2015-07-26 00:18:30 +02:00
Matthew Honnibal
1b5d1da2a7
* Allow an OOV probability to be specified in get_lex_props
2015-07-26 00:03:43 +02:00
Matthew Honnibal
cd6e25132b
* Allow an OOV probability to be specified in get_lex_props
2015-07-26 00:01:46 +02:00
Matthew Honnibal
fd525f0675
* Pass OOV probability around
2015-07-25 23:29:51 +02:00
Matthew Honnibal
3fe14b8ed6
* Fix CFile for Python2
2015-07-25 22:55:53 +02:00
Matthew Honnibal
823ef4a00b
* Remove profile declarations
2015-07-25 18:13:06 +02:00
Matthew Honnibal
f4809e562f
* Allow json to be used as a fallback if ujson is not available
2015-07-25 18:11:36 +02:00
Matthew Honnibal
9da06671cf
* Remove unused import
2015-07-25 18:11:16 +02:00
Matthew Honnibal
2060935cdb
* Remove explicit bytes type in doc.from_bytes, to accept bytearray
2015-07-24 04:54:13 +02:00
Matthew Honnibal
aa28e2e01d
* Release the GIL around parse function
2015-07-24 04:53:27 +02:00
Matthew Honnibal
d62eb34b76
* More Py 2/3 compatibility in bit strings
2015-07-24 04:52:06 +02:00
Matthew Honnibal
0bb839d299
* Fix string coercion for Python 3
2015-07-24 03:49:30 +02:00
Matthew Honnibal
c4ff410fdb
* Fix bytes problems for Python3
2015-07-24 03:48:23 +02:00
Matthew Honnibal
1ab25e4dad
* Fix python3 type error
2015-07-24 02:45:34 +02:00
Matthew Honnibal
f35ff173b0
* Fix bits.pyx unicode error
2015-07-23 20:37:57 +02:00
Matthew Honnibal
1406e24327
* Fix unicode error for Python3
2015-07-23 19:36:21 +02:00
Matthew Honnibal
dbda6c27fa
* Fix python3 error
2015-07-23 14:52:30 +02:00
Matthew Honnibal
99387f9572
* Fix python3 error
2015-07-23 14:30:29 +02:00
Matthew Honnibal
b81ffe9032
* Fix typing on mode string in CFile
2015-07-23 13:24:43 +02:00
Matthew Honnibal
22028602a9
* Add unicode_literals declaration in vocab.pyx
2015-07-23 13:24:20 +02:00
Matthew Honnibal
5b41744270
* Check for directory presence before loading annotators
2015-07-23 09:27:37 +02:00
Matthew Honnibal
df01a88763
Merge branch 'refactor' (and serializaton)
...
Add Huffman-code serialization, and do a lot of
refactoring. Highlights include:
* Much more efficient StringStore
* Vocab maintains a by-orth mapping of Lexemes
* Avoid manually slicing Py_UNICODE buffers,
simplifying tokenizer and vocab C APIs
* Remove various bits of dead code
* Work on removing GIL around parser
* Work on bridge to Theano
Conflicts:
spacy/strings.pxd
spacy/strings.pyx
spacy/structs.pxd
2015-07-23 02:18:35 +02:00
Matthew Honnibal
a7c4d72e83
* Add serializer property to Vocab, and lazy-load it. Add get_by_orth method.
2015-07-23 01:18:19 +02:00
Matthew Honnibal
6ab1696b15
* Remove read_encoding_freqs from util.py
2015-07-23 01:17:32 +02:00
Matthew Honnibal
d5255aad77
* Update freqs for missing tags in ner, for serializer
2015-07-23 01:17:11 +02:00
Matthew Honnibal
12699a1152
* Set initial freqs, to avoid missing values in serializer
2015-07-23 01:16:27 +02:00
Matthew Honnibal
680bb47b55
* Write serializer freqs to single file, vocab/serializer.json
2015-07-23 01:15:25 +02:00
Matthew Honnibal
a0e36e8efc
* Add working to/from bytes API to Doc
2015-07-23 01:14:45 +02:00
Matthew Honnibal
1f31d96bf9
* Fix Packer API, so that it reads and writes bytes strings, instead of BitArray. Docs are always byte aligned anyway.
2015-07-23 01:13:02 +02:00
Matthew Honnibal
38ef986b29
* Update spacy/en/attrs.pxd
2015-07-23 01:10:58 +02:00
Matthew Honnibal
06eac32610
* Add cfile.pyx
2015-07-23 01:10:36 +02:00
Matthew Honnibal
0c507bd80a
* Fix tokenizer
2015-07-22 14:10:30 +02:00
Matthew Honnibal
c86dbe4944
* Update English.save_models for new Packer save/load stuff
2015-07-22 13:40:23 +02:00
Matthew Honnibal
bf77bcd6b9
* Add comment explaining hash_string
2015-07-22 13:39:42 +02:00
Matthew Honnibal
815bda201d
* Remove UniStr struct
2015-07-22 13:39:17 +02:00
Matthew Honnibal
2fc66e3723
* Use Py_UNICODE in tokenizer for now, while sort out Py_UCS4 stuff
2015-07-22 13:38:45 +02:00
Matthew Honnibal
4d61239eac
* Reorganize the serialization functions on Doc
2015-07-22 04:53:01 +02:00
Matthew Honnibal
109106a949
* Replace UniStr, using unicode objects instead
2015-07-22 04:52:05 +02:00
Matthew Honnibal
424854028f
* Fix decode_int32
2015-07-21 20:09:59 +00:00
Matthew Honnibal
304d0e2633
* Use decode_int32 in _orth_decode
2015-07-21 20:40:55 +02:00
Matthew Honnibal
9cfa59ec33
* Optimistically try orth encoding, with char as a back-off
2015-07-21 20:22:45 +02:00
Matthew Honnibal
c8b89e37a5
* Bug fix to faster huffman decoding
2015-07-21 20:05:53 +02:00
Matthew Honnibal
b166d1d2a2
* Use encode32 and decode32
2015-07-21 19:59:06 +02:00
Matthew Honnibal
c6cd0ddce8
* Add faster encode_int32 and decode_int32 methods
2015-07-21 19:58:45 +02:00
Matthew Honnibal
dd60594f41
* Fix double encoding error in strings.pyx
2015-07-20 13:52:56 +02:00
Matthew Honnibal
06639dc497
* Add length cap to word shape feature
2015-07-20 12:06:59 +02:00
Matthew Honnibal
128b6d9714
* Move Utf8Str struct to strings module, as that's the only place it's relevant
2015-07-20 12:06:41 +02:00
Matthew Honnibal
01a97b90f3
* Fix header for string store
2015-07-20 12:06:10 +02:00
Matthew Honnibal
52d538ea42
* Fix short string optimization in strings.pyx. StringStore tests now all pass.
2015-07-20 12:05:23 +02:00
Matthew Honnibal
09a3055630
* Work on short string optimization in Utf8Str
2015-07-20 11:26:46 +02:00
Matthew Honnibal
bb0ba1f0cd
* Improve serialization speed
2015-07-20 03:27:59 +02:00
Matthew Honnibal
8743a8c084
* Update Doc serialization for new Packer interface
2015-07-20 01:38:04 +02:00
Matthew Honnibal
1f7170e0e1
* Reinstate the fixed vocabulary --- words are only added to the lexicon in init_model, after that we create LexemeC structs with the Pool given to us.
2015-07-20 01:37:34 +02:00
Matthew Honnibal
5a7d060d9c
* Switch between the orth and char codecs depending on which is shorter for that message. Mostly orth is shorter, except if there are OOV words.
2015-07-20 01:36:22 +02:00
Matthew Honnibal
5a042ee0d3
* Add function to predict number of bits needed to encode message
2015-07-20 01:35:11 +02:00
Matthew Honnibal
b89b489bb4
* Implement both character and orth encoding in Packer, so that we can decide which to use per-text
2015-07-19 22:39:45 +02:00
Matthew Honnibal
ae78c9e3ce
* Implement character-based codec, so that we can do word/char backoff
2015-07-19 22:03:39 +02:00
Matthew Honnibal
cd1d047cb8
* Delete out-dated HuffmanCodec comment
2015-07-19 18:28:14 +02:00
Matthew Honnibal
b8086067d5
* Build Huffman codec from unsorted inputs
2015-07-19 17:58:44 +02:00
Matthew Honnibal
317cbbc015
* Serialization round trip now working with decent API, but with rough spots in the organisation and requiring vocabulary to be fixed ahead of time.
2015-07-19 15:18:17 +02:00
Matthew Honnibal
6b13e7227c
* Remove duplicate get_lex_attr method from doc.pyx
2015-07-18 22:46:07 +02:00
Matthew Honnibal
e49c7f1478
* Update oov check in tokenizer
2015-07-18 22:45:28 +02:00
Matthew Honnibal
cfd842769e
* Allow infix tokens to be variable length
2015-07-18 22:45:00 +02:00
Matthew Honnibal
5b4c78bbb2
* Use an AttributeCodec based on orth for words. Still no oov handling mechanism.
2015-07-18 22:43:18 +02:00
Matthew Honnibal
82d84b0f2b
* Index lexemes by orth, instead of a lexemes vector. Breaks the mechanism for deciding not to own LexemeC structs during parsing. Need to reinstate this.
2015-07-18 22:42:15 +02:00
Matthew Honnibal
4dddc8a69b
* Fix type declarations for attr_t. Remove unused id_t.
2015-07-18 22:39:57 +02:00
Matthew Honnibal
ced59ab9ea
* Make minor efficiency improvement in Doc.__iter__
2015-07-18 04:10:53 +02:00
Matthew Honnibal
cd91914dd8
* Fix hard-coded length
2015-07-18 04:09:56 +02:00
Matthew Honnibal
b1d74ce60d
* Remove unused joint.pyx and joint.pxd files
2015-07-17 23:31:44 +02:00
Matthew Honnibal
c27514512b
* Remove cruft ner/ directory
2015-07-17 23:24:32 +02:00
Matthew Honnibal
f8d6d319f4
* Remove cruft module
2015-07-17 23:23:05 +02:00
Matthew Honnibal
fb0a641a2d
* Don't release the gil around Parser.parse. Does this indicate thread problems?
2015-07-17 23:07:37 +02:00
Matthew Honnibal
e29daea85f
* Fix bint/int typing problem in TransitionSystem. In C++ bint* means bool*, but in C it means int*. So, type-casting to bint* is unsafe.
2015-07-17 22:37:24 +02:00
Matthew Honnibal
cf0c788892
* Tests passing on round-trip pack/unpack on basic example
2015-07-17 21:20:48 +02:00
Matthew Honnibal
44f39a876f
* Add a blank attrs.pyx
2015-07-17 16:40:42 +02:00
Matthew Honnibal
c2c83120d4
* Remove codec property from Vocab
2015-07-17 16:40:11 +02:00
Matthew Honnibal
dfdf19f6a9
* Draft a from_orth method for Doc
2015-07-17 16:39:54 +02:00
Matthew Honnibal
9e3f17051b
* Move to ORTH instead of ID for encoding lexemes. Basic tests of the codec wrappers now passing
2015-07-17 16:38:29 +02:00
Matthew Honnibal
15ff739996
* Fix passing of ID attribute in string store
2015-07-17 14:49:42 +02:00
Matthew Honnibal
95e57c2780
* Remove unnecessary key and id properties from Utf8String.
2015-07-17 01:40:18 +02:00
Matthew Honnibal
234c7e440a
* Add spacy/serialize/__init__ files
2015-07-17 01:37:33 +02:00
Matthew Honnibal
db9dfd2e23
* Major refactor of serialization. Nearly complete now.
2015-07-17 01:27:54 +02:00
Matthew Honnibal
c8282f9934
* Work on serialization. Needs more reorganisation
2015-07-16 19:56:02 +02:00
Matthew Honnibal
d8458d6a25
* Fix attr_id_t import in Spans
2015-07-16 19:55:21 +02:00
Matthew Honnibal
d1cb30dbc4
* Remove unnecessary key and id properties from Utf8String.
2015-07-16 19:29:02 +02:00
Matthew Honnibal
897de2d438
* Add 'bitter' property for serializer in English class
2015-07-16 17:47:53 +02:00
Matthew Honnibal
fb54052ae0
* Work on serializer design
2015-07-16 17:46:46 +02:00
Matthew Honnibal
a6f401580d
* Add from_array function to Doc.
2015-07-16 17:46:11 +02:00
Matthew Honnibal
2a5d050134
* Give codec loading back to Vocab.
2015-07-16 17:45:42 +02:00
Matthew Honnibal
8bf0f65f1c
* Remove dead code in strings.pyx
2015-07-16 17:35:53 +02:00
Matthew Honnibal
a9c3863665
* Fix inefficiency in StringStore.dump function
2015-07-16 17:34:32 +02:00
Matthew Honnibal
b59d271510
* Move serialization functionality into Serializer class
2015-07-16 11:23:48 +02:00
Matthew Honnibal
30be4f15da
* Import attrs from spacy.attrs, not spacy.typedefs
2015-07-16 11:23:25 +02:00
Matthew Honnibal
6c99e5f4aa
* Move serialization into Serializer class, with __call__ and train() api
2015-07-16 11:22:35 +02:00
Matthew Honnibal
e2133d990e
* Move serialization functionality out into a Serializer object
2015-07-16 11:21:44 +02:00
Matthew Honnibal
a6d040bd11
* Import Lexeme attrs from spacy.attrs, not spacy.typedefs
2015-07-16 11:20:08 +02:00
Matthew Honnibal
45ae1ce428
* Remove unused declaration in parser
2015-07-16 01:27:11 +02:00
Matthew Honnibal
efa80096f1
* Upd attrs id list
2015-07-16 01:26:54 +02:00
Matthew Honnibal
01fab6bb90
* Improve de/serialize functions
2015-07-16 01:26:35 +02:00
Matthew Honnibal
0e07c1ed2a
* draft de/serialization functions in doc.pyx
2015-07-16 01:16:33 +02:00
Matthew Honnibal
9d956b07e9
* Fix import of attrs in doc.pyx, and update the get_token_attr function.
2015-07-16 01:15:34 +02:00
Matthew Honnibal
65251e7625
* Remove redundant attr_id_t from typedefs.pxd
2015-07-16 00:58:51 +02:00
Matthew Honnibal
9a8db9743c
* Remove gil from parser.call
2015-07-14 23:47:33 +02:00
Matthew Honnibal
38ca0c33f5
Merge branch 'neuralnet' into refactor
...
Mostly refactors parser, to use new thinc3.2 Example class.
Aim is to remove use of shared memory, so that we can parallelize
over documents easily.
Conflicts:
setup.py
spacy/syntax/parser.pxd
spacy/syntax/parser.pyx
spacy/syntax/stateclass.pyx
2015-07-14 14:13:47 +02:00
Matthew Honnibal
935ac53ee3
* Extend count_by method
2015-07-14 03:20:09 +02:00
Matthew Honnibal
3b5baa660f
* Fix tokenizer
2015-07-14 00:10:51 +02:00
Matthew Honnibal
2ae0b439b2
* Fix space check in gold.pyx
2015-07-14 00:10:27 +02:00
Matthew Honnibal
81aa4e6dcc
* Go back to having token reference doc, instead of complicated gymnastics. Rename the attr 'doc', to expose it in the API
2015-07-14 00:10:11 +02:00
Matthew Honnibal
24d6ce99ec
* Add comment to tokenizer, explaining the spacy attr
2015-07-13 22:29:13 +02:00
Matthew Honnibal
8214b74eec
* Restore _py_tokens cache, to handle orphan tokens.
2015-07-13 22:28:10 +02:00
Matthew Honnibal
67641f3b58
* Refactor tokenizer, to set the 'spacy' field on TokenC instead of passing a string
2015-07-13 21:46:02 +02:00
Matthew Honnibal
6eef0bf9ab
* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx
2015-07-13 20:20:58 +02:00
Matthew Honnibal
3ea8756c24
* Add spacy/tokens/doc.pyx, for Doc class in its own file
2015-07-13 19:58:26 +02:00
Matthew Honnibal
c99387155f
* Refactor tokens, moving classes into a module instead of a single file
2015-07-13 19:49:55 +02:00
Matthew Honnibal
d27899658e
* Import classes in spacy.tokens.__init__
2015-07-13 19:48:55 +02:00
Matthew Honnibal
aa82caf8f5
* Add TokenC.spacy attr
2015-07-13 19:48:07 +02:00
Matthew Honnibal
dba6b47d4e
* Refactor monster tokens.pyx file, into a tokens/ subpackage. Try to break the cycle between Doc and Token, and remove the need to pass around a unicode string reference
2015-07-13 19:20:48 +02:00
Matthew Honnibal
5b0a7190c9
* Round-trip for serialization finally working. Needs a lot of optimization.
2015-07-13 18:39:38 +02:00
Matthew Honnibal
edd371246c
* Make huffman coder take BitArray in encode/decode. Add __iter__ method to BitArray.
2015-07-13 17:33:33 +02:00
Matthew Honnibal
af5cc926a4
* Add codec property to Vocab, to use the Huffman encoding
2015-07-13 13:55:14 +02:00
Matthew Honnibal
77385d5580
* Make .pxd file for huffman codec
2015-07-13 13:54:51 +02:00
Matthew Honnibal
083b6ea7ae
* Clean up encoder a bit. now read for integration into Vocab.
2015-07-13 12:57:22 +02:00
Matthew Honnibal
8d0f1d98da
* Draft dockstring for HuffmanCache
2015-07-13 12:01:18 +02:00
Matthew Honnibal
281f1faefb
* Nearly finished huffman coder
2015-07-12 23:48:46 +02:00
Matthew Honnibal
e1a25fba32
* Work on huffman coder
2015-07-12 19:58:05 +02:00
Matthew Honnibal
3fb9de2d13
* Remove vector[bint], in favor of simple Code struct.
2015-07-12 17:58:27 +02:00
Matthew Honnibal
aa7bfd932b
* Work on compressor
2015-07-12 16:03:43 +02:00
Matthew Honnibal
14eafcab15
* Refactor to use vector[bint]
2015-07-12 05:27:47 +02:00
Matthew Honnibal
6a6e852a39
* Refactor huffman coding stuff into class
2015-07-12 05:06:36 +02:00
Matthew Honnibal
aad96fdb5c
* Improve efficiency of huffman coding
2015-07-12 01:31:37 +02:00
Matthew Honnibal
ff9ff6f3fa
* Ensure unseen words are given low log probability
2015-07-12 01:31:09 +02:00
Matthew Honnibal
9d3b0d83de
* Refactor huffman coding
2015-07-11 22:27:43 +02:00
Matthew Honnibal
8d29406cd6
* Rename span.right to span.rights
2015-07-11 22:15:04 +02:00
Matthew Honnibal
da9f358166
* Fix span getting
2015-07-11 21:41:41 +02:00
Matthew Honnibal
11e8f2ffb4
* Huffman codes working
2015-07-11 20:01:10 +02:00
Matthew Honnibal
cb6fc81909
* Work on huffman coding.
2015-07-11 15:23:35 +02:00
Matthew Honnibal
4c9b77fe95
* Begin working on serialization code
2015-07-11 10:57:30 +02:00
Matthew Honnibal
53d1f5b2eb
* Rename Span.head to Span.root.
2015-07-09 17:30:58 +02:00
Matthew Honnibal
c0255ed7d8
* Allow slice indexing in Doc.__getitem__, returning a Span object
2015-07-09 15:15:32 +02:00
Matthew Honnibal
89a91ad726
* Add SPACE part-of-speech tag, and train tagger to assign it. Also train tagger not to make whitespace an entity
2015-07-09 13:30:41 +02:00
Matthew Honnibal
55f1042443
* Improve efficiency of L and R features, correcting the non-linear-in-length problem.
2015-07-09 12:17:26 +02:00
Matthew Honnibal
70d2acb579
* Fix edge features
2015-07-09 12:15:01 +02:00
Matthew Honnibal
adb868bdad
* Add warning for models not found in parser
2015-07-08 20:04:55 +02:00
Matthew Honnibal
05b28ec9eb
* Add warning for models not found in parser
2015-07-08 20:02:13 +02:00
Matthew Honnibal
ef700401a6
* Add warning for models not found in parser
2015-07-08 20:00:46 +02:00
Matthew Honnibal
6218d8b389
* Add warning for models not found in parser
2015-07-08 19:59:16 +02:00
Matthew Honnibal
f6a6c39ce8
* Add warning for models not found in parser
2015-07-08 19:52:30 +02:00
Matthew Honnibal
78db7e32f7
* Remove has_sense method from Lexeme declaration
2015-07-08 19:41:20 +02:00
Matthew Honnibal
6ddb2f5e45
* Restore merge_mwe in English class
2015-07-08 19:35:30 +02:00
Matthew Honnibal
6859f6adac
* Restore merge_mwe in English class
2015-07-08 19:34:55 +02:00
Matthew Honnibal
3c270fc8ff
* Remove has_sense method from Lexeme
2015-07-08 19:28:29 +02:00
Matthew Honnibal
b64c843861
* Remove senses attr
2015-07-08 19:26:24 +02:00
Matthew Honnibal
1d3a592edf
* Remove the senses attr from LexemeC, to keep data compatibility
2015-07-08 19:24:44 +02:00
Matthew Honnibal
0ceb1f71c2
* Update parse features
2015-07-08 19:11:36 +02:00
Matthew Honnibal
2e51b5027a
* Alias Doc to Tokens, for backwards compatibility
2015-07-08 18:59:35 +02:00
Matthew Honnibal
e3c53f5ecd
* Fix mention of Tokens in docstring
2015-07-08 18:56:27 +02:00
Matthew Honnibal
bb522496dd
* Rename Tokens to Doc
2015-07-08 18:53:00 +02:00
Matthew Honnibal
b24e8be2b9
* Whitespace in docstring
2015-07-08 12:37:03 +02:00
Matthew Honnibal
abc43b852d
* Add pos_tags attr to Vocab.
2015-07-08 12:36:38 +02:00
Matthew Honnibal
935bcdf3e5
* Remove redundant tag_names argument to Tokenizer
2015-07-08 12:36:04 +02:00
Matthew Honnibal
ff885e8511
* Add ParserFactory convenience function
2015-07-08 12:35:46 +02:00
Matthew Honnibal
4e4fac452b
* Refactor __init__ for simplicity. Allow parse=True, tag=True etc flags to be passed at top-level. Do not lazy-load parser.
2015-07-08 12:35:29 +02:00
Matthew Honnibal
1d2deb4616
* Work on refactoring default arguments to English.__init__
2015-07-07 15:53:25 +02:00
Matthew Honnibal
2d0e99a096
* Pass pos_tags into Tokenizer.from_dir
2015-07-07 14:23:08 +02:00
Matthew Honnibal
6788c86b2f
* Begin refactor
2015-07-07 14:00:07 +02:00
Matthew Honnibal
52fd80c6c6
* Add experimental supersense features for parsing, based on lookup into wordnet.
2015-07-01 20:12:44 +02:00
Matthew Honnibal
e6d828a9af
* Set up an array POS_SENSES that denotes the set of valid senses for each POS tag. This way, we can do bitwise & between a lexeme's senses and the ones available for its POS tag, to get the allowable senses for the token.
2015-07-01 20:12:13 +02:00
Matthew Honnibal
2b8459d9a8
* Add senses flag to Lexeme
2015-07-01 20:10:41 +02:00
Matthew Honnibal
e23d1582a2
* Add supersense data to Lexeme objects. Add simple has_sense method to check the flag.
2015-07-01 18:50:37 +02:00
Matthew Honnibal
64fafa98be
* Add senses.pyx and senses.pxd
2015-07-01 18:49:44 +02:00
Matthew Honnibal
94dab94e5f
uerge branch 'master' of https://github.com/honnibal/spaCy
2015-06-30 18:16:26 +02:00
Matthew Honnibal
9af86b0b0b
* Fix attrs.pxd
2015-06-30 18:16:30 +02:00
Matthew Honnibal
af9c82f7a6
Merge branch 'master' of https://github.com/honnibal/spaCy
2015-06-30 18:11:37 +02:00
Matthew Honnibal
5d595b5a8c
* Inc versions
2015-06-30 18:11:06 +02:00
Matthew Honnibal
d2eeba6667
* Start wiring up color and emotion lexicons. Hopefully we get to use them.
2015-06-30 16:22:23 +02:00
Matthew Honnibal
e20106fdff
* Begin reorganizing neuralnet work
2015-06-30 14:26:32 +02:00
Matthew Honnibal
5cd3ed42d4
* Reenable averaging
2015-06-29 16:44:42 +02:00
Matthew Honnibal
894cbef8ba
* Wire eta and mu parameters up for neural net
2015-06-29 07:10:33 +02:00
Matthew Honnibal
3bb5876c5a
* Inline methods in StateClass
2015-06-29 01:10:14 +02:00
Matthew Honnibal
313a7f87b3
* Inline methods in StateClass
2015-06-29 01:06:28 +02:00
Matthew Honnibal
a02fd3af5d
* Check valency in L and R feature methods, to make feaure calculation faster
2015-06-29 00:27:56 +02:00
Matthew Honnibal
5d870720bc
* Check valency in L and R feature methods, to make feaure calculation faster
2015-06-29 00:17:29 +02:00
Matthew Honnibal
f4986d5d3c
* Use new Example class
2015-06-28 22:36:03 +02:00
Matthew Honnibal
735f1af91f
* Fix neural net stuff
2015-06-28 11:44:58 +02:00
Matthew Honnibal
e7003f1cf3
* Remove hard-coding of vector lengths
2015-06-28 11:37:17 +02:00
Matthew Honnibal
897dd0dd0b
* Merge changes, and adjust Example to use memoryview
2015-06-28 11:36:11 +02:00
Matthew Honnibal
9282a8e72c
* Prepare for new models to be plugged in by using Example class
2015-06-28 11:02:35 +02:00
Matthew Honnibal
75aeccc064
* Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search
2015-06-28 11:02:34 +02:00
Matthew Honnibal
bf33598b34
* Work on a theano-driven model for the parser
2015-06-28 11:02:34 +02:00
Matthew Honnibal
bbef71f213
* Fix min function in fill_context
2015-06-28 10:46:39 +02:00
Matthew Honnibal
142b6f9510
* Revert last changes
2015-06-28 10:44:28 +02:00
Matthew Honnibal
b06962f18b
* Pad buffers in state
2015-06-28 10:36:14 +02:00
Matthew Honnibal
53be72387c
* Hack at fill_context to investigate performance loss
2015-06-28 10:34:28 +02:00
Matthew Honnibal
71a4e876a9
* Fix parse features
2015-06-28 09:27:33 +02:00
Matthew Honnibal
0c4b5a2bb0
* Start scoring tokens
2015-06-28 06:21:38 +02:00
Matthew Honnibal
5af500909c
* Remove unused directve from parser.pyx
2015-06-28 06:20:21 +02:00
Matthew Honnibal
d5b4090705
* Add profile directive
2015-06-28 06:19:33 +02:00
Matthew Honnibal
2b5421e60c
* Add profile directive
2015-06-28 06:07:04 +02:00
Matthew Honnibal
8b5de4a411
* Add word / tag / label sets, for use in neural net
2015-06-28 05:46:53 +02:00
Matthew Honnibal
cfcbd8d256
* Fix punctuation eval in scorer.py
2015-06-28 01:31:39 +02:00
Matthew Honnibal
ed40a8380e
* Remove hard-coding of vector lengths
2015-06-27 04:18:47 +02:00
Matthew Honnibal
ebe630cc8d
* Enable more features for NN
2015-06-27 04:17:29 +02:00
Matthew Honnibal
f8bb43475e
* Bridge to Theano working. Very disorganised. Using thinc adb60aba966ed2
2015-06-27 02:39:18 +02:00
Matthew Honnibal
2fe98b8a9a
* Prepare for new models to be plugged in by using Example class
2015-06-26 13:51:39 +02:00
Matthew Honnibal
6896455884
* Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search
2015-06-26 06:25:36 +02:00
Matthew Honnibal
b266a63f2c
* Inc version of downloadble data
2015-06-24 04:53:08 +02:00
Matthew Honnibal
02b171ee67
* Bug fixes to edge calculation
2015-06-24 04:28:02 +02:00
Matthew Honnibal
a4e9bdf4c1
* Work on a theano-driven model for the parser
2015-06-24 01:02:40 +02:00
Matthew Honnibal
7f9384f53c
* Remove deprecated _state module
2015-06-23 17:28:24 +02:00
Matthew Honnibal
6dbe182491
* Fix merge conflicts
2015-06-23 17:28:00 +02:00
Matthew Honnibal
579735a095
* Remove import of _state module
2015-06-23 17:25:08 +02:00
Matthew Honnibal
88f55d136b
* Remove deprecated _state module
2015-06-23 17:19:51 +02:00
Matthew Honnibal
9ab9dd2bf7
* Clean up unused orig_arc_eager and tree_arc_eager modules, which were only added for EMNLP experiments
2015-06-23 17:17:33 +02:00
Matthew Honnibal
7ebfe4b983
* Fixes to edge features
2015-06-23 16:32:54 +02:00
Matthew Honnibal
7b125f5a86
* Fixes to edge features
2015-06-23 16:31:01 +02:00
Matthew Honnibal
8d4bbacfc5
* Fix edge navigation in Token objects
2015-06-23 16:07:34 +02:00
Matthew Honnibal
35c290bee4
* Fix edge features
2015-06-23 15:50:56 +02:00
Matthew Honnibal
221e2e485f
* Assign 'ROOT' as label, not 'root'
2015-06-23 15:09:54 +02:00
Matthew Honnibal
a7bf7b0626
* Rename sent_start to sent_end, to reflect its new usage in the Break transition
2015-06-23 05:39:43 +02:00
Matthew Honnibal
ee3e56f27b
* Fix bounds checking on entities
2015-06-23 04:35:08 +02:00
Matthew Honnibal
43ef5ddea5
* Ensure root albel is spelled ROOT, for backwards compatibility
2015-06-23 04:14:03 +02:00
Matthew Honnibal
065c2e1d2d
* Add some bounds checking around state arrays
2015-06-23 04:13:09 +02:00
Matthew Honnibal
89ae218b75
* Add import to tokens.pyx from weird Cython compiler issue with casting from memory views
2015-06-23 03:04:34 +02:00
Matthew Honnibal
f01b3d043e
* Add padding to arrays in stateclass. May be papering over a deeper bug.
2015-06-23 03:03:41 +02:00
Matthew Honnibal
5e94b5d581
* Have Tokens return proper numpy arrays, not Cython views.
2015-06-23 00:07:34 +02:00
Matthew Honnibal
69507bc729
* Re-enable Break transition in arc_eager.pyx
2015-06-23 00:03:30 +02:00
Matthew Honnibal
cc579ed429
* Add __len__ function to StringStore
2015-06-23 00:02:50 +02:00
Matthew Honnibal
46fb24e9fd
* Add cycle-checking code in gold.pyx
2015-06-23 00:02:22 +02:00
Matthew Honnibal
60d26243e3
* Fix head alignment in read_conll.parse, which was causing corrupt parses when strip_bad_periods=True. A similar problem may apply to other data readers.
2015-06-18 16:35:27 +02:00
Matthew Honnibal
f868175e43
* Whitespace
2015-06-16 23:37:46 +02:00
Matthew Honnibal
ab110be125
* Remove debugging in parser.pyx
2015-06-16 23:37:25 +02:00
Matthew Honnibal
9b13d11ab3
* Fix handling of entities in StateClass
2015-06-16 23:35:21 +02:00
Matthew Honnibal
c40a2c661c
* Add tree_arc_eager
2015-06-15 08:23:24 +02:00
Matthew Honnibal
5da5cf7084
* Add some more features for S1/S0
2015-06-15 04:07:13 +02:00
Matthew Honnibal
8156a01bca
* Fix root label for orig_arc_eager
2015-06-15 02:54:55 +02:00
Matthew Honnibal
21930ede15
* Switch toggle on USE_ROOT_ARC_SEGMENT
2015-06-15 02:54:32 +02:00
Matthew Honnibal
38a6afa484
* Make possibly dubious correction to the unshift oracle
2015-06-15 02:50:00 +02:00
Matthew Honnibal
f66228f253
* Add some more features, esp for labels
2015-06-14 21:18:02 +02:00
Matthew Honnibal
3da8e0f317
* Add orig_arc_eager
2015-06-14 20:31:44 +02:00
Matthew Honnibal
ea8a103007
* Fix import of TransitionSystem in parser.pyx
2015-06-14 19:01:26 +02:00
Matthew Honnibal
e0984ca139
* Fix valency features in StateClass
2015-06-14 17:50:26 +02:00
Matthew Honnibal
e50ac1a47f
* Add verbose printing to scorer
2015-06-14 17:45:50 +02:00
Matthew Honnibal
763cbd23d5
* Upd stateclass.print_state
2015-06-14 17:44:29 +02:00
Matthew Honnibal
bdd07bf000
* Fix Break oracle, but disable the Break transition for now, while we finalize the gold-standard experiments
2015-06-14 17:44:03 +02:00
Matthew Honnibal
399f15fbdf
* Add flag to toggle handling of multi-root inputs without the Break transition. Clear up now unused best_valid stuff.
2015-06-14 00:28:37 +02:00
Matthew Honnibal
75289b4761
* Don't refuse to parse single token sentences, incase some transition system needs them, e.g. single word entity. Instead fix error in _init_state.
2015-06-13 22:55:55 +02:00
Matthew Honnibal
77d7e79c7e
* Fix r/l and distance features.
2015-06-12 13:06:15 +02:00
Matthew Honnibal
b643cb3d5c
* Allow training documents to be filtered in gold.pyx
2015-06-12 02:42:08 +02:00
Matthew Honnibal
15e177d7a1
* Fixes to unshift/fast-forward strategy. Getting 91.55 greedy on NW dev, gold preproc
2015-06-12 01:50:23 +02:00
Matthew Honnibal
afd77a529b
* Prepare for break transition, with fast-forwarding. 86.5 on 1k nw gold preproc
2015-06-10 14:08:30 +02:00
Matthew Honnibal
495f528709
* Add support for sentence breaks in stateclass
2015-06-10 12:34:28 +02:00
Matthew Honnibal
b7b18c279d
* Fix Reduce oracle. Getting 86.35
2015-06-10 11:33:39 +02:00
Matthew Honnibal
bb09b5d91a
* Fix shifted bit vector in stateclass --- should reflect whether the word has been *unshifted*.
2015-06-10 11:33:09 +02:00
Matthew Honnibal
aa9625f688
* Do non-monotonic Unshift. Every word can be shifted at most 1 time. When the Reduce move is used, if S0 has no head, we put the word back on the buffer. Gets 86.4 on nw 1k with gold pre-proc. Break transition not yet implemented for this.
2015-06-10 10:15:56 +02:00
Matthew Honnibal
7bf6b7de3e
* Add unshift action to StateClass, and track which moves have been shifted
2015-06-10 10:13:03 +02:00
Matthew Honnibal
f7c8069e65
* Fix bug in distance feature
2015-06-10 10:12:17 +02:00
Matthew Honnibal
abd07c067a
* Inline B and S methods on stateclass
2015-06-10 07:22:33 +02:00
Matthew Honnibal
e2f9a80713
* Remove old _state imports
2015-06-10 07:09:17 +02:00
Matthew Honnibal
e9aaecc619
* Remove from_struct method from StateClass
2015-06-10 06:58:27 +02:00
Matthew Honnibal
18cc326dc0
* Bug fixes to ner.pyx
2015-06-10 06:57:41 +02:00
Matthew Honnibal
e5570c9700
* Set nogil for oracle functions
2015-06-10 06:56:56 +02:00
Matthew Honnibal
4575e7a60f
* Fix beam search with new StateClass
2015-06-10 06:33:39 +02:00
Matthew Honnibal
04b1cd9b8c
* Greedy parsing working with new StateClass. Beam parsing broken
2015-06-10 04:20:23 +02:00
Matthew Honnibal
6a94b64eca
* Remove State* from parser.pyx entirely, switching over to StateClass. Beam parsing still untested.
2015-06-10 02:03:38 +02:00
Matthew Honnibal
f14a1526aa
* Remove version of fill_context that takes State*
2015-06-10 01:39:07 +02:00
Matthew Honnibal
d68c686ec1
* Move StateClass into interface of transition functions
2015-06-10 01:35:28 +02:00
Matthew Honnibal
4b98b3e9c8
* Cost functions now take StateClass argument, instead of State*.
2015-06-10 00:40:43 +02:00
Matthew Honnibal
e0cf61f591
* Move StateClass into the interface for is_valid
2015-06-09 23:23:28 +02:00
Matthew Honnibal
0895d454fb
* Prepare to switch to using state class, instead of state struct
2015-06-09 21:20:14 +02:00
Matthew Honnibal
2b9629ed62
* Begin adding stateclass to ArcEager
2015-06-09 01:41:09 +02:00
Matthew Honnibal
ba10fd8af5
* Add StateClass, to replace/refactor the mess in _state
2015-06-09 01:39:54 +02:00
Matthew Honnibal
c7e3dfc1dc
* Don't automatically push words when stack is empty, as it messes up beam parsing. Add hash method to beam state.
2015-06-08 14:49:04 +02:00
Matthew Honnibal
00a0dfcb59
* Avoid shipping the spacy.munge package
2015-06-08 00:54:13 +02:00
Matthew Honnibal
7d265a9c62
* Revert to wget in spacy.en.download
2015-06-08 00:48:56 +02:00
Matthew Honnibal
a8fc5f1285
* Fix munge/read_ner
2015-06-08 00:35:04 +02:00
Matthew Honnibal
1515862861
* Fix download.py
2015-06-08 00:08:05 +02:00
Matthew Honnibal
7e9e8f654a
* Use urllib in spacy.en.download
2015-06-07 23:51:38 +02:00
Matthew Honnibal
80cff41a9c
* Upd download.py
2015-06-07 19:13:28 +02:00
Matthew Honnibal
6e2564239d
* Bug fixes to beam parser. Search still broken on non-gold sentences
2015-06-07 19:12:59 +02:00
Matthew Honnibal
1ec4e6fc95
* Don't score whitespace tokens
2015-06-07 19:10:32 +02:00
Matthew Honnibal
731e5f1e46
* Add get() function in spacy/syntax/Config
2015-06-07 19:09:15 +02:00
Matthew Honnibal
8f142c1838
* Refactor transition system oracles, to split out move and label cost. Preparing to add Unshift move. Will exclude non-monotonic.
2015-06-07 03:21:29 +02:00
Matthew Honnibal
89b8775887
* Fix output from _min_edit_path when inputs match.
2015-06-06 05:58:53 +02:00
Matthew Honnibal
98cfd84123
* Remove hyphenation from main tokenizer loop: do it in infix.txt instead. This lets emoticons work
2015-06-06 05:57:03 +02:00
Matthew Honnibal
1fee7ade61
* Tweak to ner
2015-06-05 23:48:43 +02:00
Matthew Honnibal
33e70b167f
* Remove dead code from ner.pyx
2015-06-05 17:12:47 +02:00
Matthew Honnibal
88ac5c6e98
* Send beam_width < 0 to greedy parser
2015-06-05 17:12:06 +02:00
Matthew Honnibal
0114e7600d
* Fix NER oracle
2015-06-05 17:11:26 +02:00
Matthew Honnibal
c04e6ebca6
* Allow user to load different sized vectors.
2015-06-05 16:26:39 +02:00
Matthew Honnibal
6bf35cecc3
* Refactor transition system to use classes with staticmethods.
2015-06-05 02:27:17 +02:00
Matthew Honnibal
36a34d544b
* Refactoring arc_eager, grouping oracle functions into transitions
2015-06-04 22:43:03 +02:00
Matthew Honnibal
4433396005
* Impove efficiency of dynamic oracle, making beam training faster
2015-06-04 21:15:14 +02:00
Matthew Honnibal
079dad28a7
* Update for faster beam training
2015-06-04 19:32:32 +02:00
Matthew Honnibal
f8843906ad
Merge branch 'constituency'
...
Add beam parsing and training from JSON files, with Levenshtein alignment.
2015-06-03 06:07:24 +02:00
Matthew Honnibal
ae653b850a
* Remove unused import from gold.pyx
2015-06-03 06:07:15 +02:00
Matthew Honnibal
a2627b6102
* Fix bug in refactored init_transition
2015-06-03 06:01:26 +02:00
Matthew Honnibal
dd0867645d
* Remove stray const from State header
2015-06-03 00:10:04 +02:00
Matthew Honnibal
6c47b10a6e
* Make optimization to children_in_buffer: stop searching when we would cross a bracket.
2015-06-02 21:05:24 +02:00
Matthew Honnibal
a513ec500f
* Have oracle functions take a struct instead of a Python object
2015-06-02 20:01:06 +02:00
Matthew Honnibal
d1b55310a1
* Refactor _advance_beam function
2015-06-02 18:38:41 +02:00
Matthew Honnibal
0786d9b3c7
* Refactor TransitionSystem, adding set_valid method
2015-06-02 18:38:07 +02:00
Matthew Honnibal
bd82a49994
* Add set_scores method to Model
2015-06-02 18:37:10 +02:00
Matthew Honnibal
a3964957f6
* Add profiling for _state.pyx
2015-06-02 18:36:27 +02:00
Matthew Honnibal
e822df0867
* Fix bugs in new greedy/beam parser
2015-06-02 02:01:33 +02:00
Matthew Honnibal
66dfa95847
* Revise greedy_parse/beam_parse ownership goof
2015-06-02 01:34:19 +02:00
Matthew Honnibal
75658b2ed3
* Remove use of new beam.loss property, to maintain compatibility with older versions of thinc for now.
2015-06-02 00:57:09 +02:00
Matthew Honnibal
7c29362d60
* Rename parser class in parser.pxd, now that beam parsing is supported
2015-06-02 00:53:49 +02:00
Matthew Honnibal
58d5ac0944
* Add beam search capabilities to Parser. Rename GreedyParser to Parser.
2015-06-02 00:28:02 +02:00
Matthew Honnibal
62424e6c76
* Remove unused regularize argument from _ml.Model
2015-06-02 00:27:07 +02:00
Matthew Honnibal
adeb57cb1e
* Fix long line
2015-06-01 23:07:00 +02:00
Matthew Honnibal
e09a08bd00
* Add copy_state function
2015-06-01 23:06:30 +02:00
Matthew Honnibal
c7876aa8b6
* Add get_valid method
2015-06-01 23:06:00 +02:00
Matthew Honnibal
d82f9d958d
* Remove regularization cruft from _ml, move score from .pxd file to .pyx
2015-05-31 18:48:05 +02:00
Matthew Honnibal
5e99ff94c8
* Edits to arc eager oracle. Couldn't figure out how the non-monotonic lines made sense. They seem covered by children_in_stack
2015-05-31 15:14:37 +02:00
Matthew Honnibal
6c5632b71c
* Roll back proposed change to Break transition while investigate effect
2015-05-31 06:49:52 +02:00
Matthew Honnibal
6bba793df3
* Disable the Zipf-reweighting thing while investigate effect
2015-05-31 06:48:43 +02:00
Matthew Honnibal
e77940565d
* Add length cap to distance feature
2015-05-31 05:25:30 +02:00
Matthew Honnibal
fd596351ba
* Fix valency features
2015-05-31 05:24:33 +02:00
Matthew Honnibal
87d6551d19
* Allow gold parse to cut non-projective arcs
2015-05-31 01:11:56 +02:00
Matthew Honnibal
c4f0914b4e
* Fix POS tag evaluation in scorer.py: do evaluate punctuation tags
2015-05-30 18:24:32 +02:00
Matthew Honnibal
9e39a206da
* Fix efficiency of JSON reading, by using ujson instead of stream
2015-05-30 17:54:52 +02:00
Matthew Honnibal
76300bbb1b
* Use updated JSON format, with sentences below paragraphs. Allows use of gold preprocessing flag.
2015-05-30 01:25:46 +02:00
Matthew Honnibal
b76bbbd12c
* Read json files recursively from a directory, instead of requiring a single .json file
2015-05-29 03:52:55 +02:00
Matthew Honnibal
8f31d3b864
* Relax constraint on Break transition for non-monotonic parsing.
2015-05-28 23:39:52 +02:00
Matthew Honnibal
6b2e5c4b8a
* Avoid NER scoring for sentences with some missing NER values.
2015-05-28 22:39:08 +02:00
Matthew Honnibal
d25d31442d
* Hackishly support broken NER annotations. Should fix this.
2015-05-27 19:14:31 +02:00
Matthew Honnibal
7a2725bca4
* Read input json in a streaming way
2015-05-27 19:13:11 +02:00
Matthew Honnibal
6a1c91675e
* Add file to read ENAMEX ner data
2015-05-27 17:36:23 +02:00
Matthew Honnibal
732fa7709a
* Edits to align_raw script, for use in prepare_treebank
2015-05-27 04:23:31 +02:00
Matthew Honnibal
4010b9b6d9
* Pass parameter for regularization in parser.pyx
2015-05-27 03:18:50 +02:00
Matthew Honnibal
4c6058baa7
* Fix evaluation of NER in scorer.py
2015-05-27 03:18:16 +02:00
Matthew Honnibal
6016ee83a6
* Fix reading of NER in gold.pyx
2015-05-27 03:17:50 +02:00
Matthew Honnibal
04bda8648d
* Pass parameter for regularization to model
2015-05-27 03:16:58 +02:00
Matthew Honnibal
f69fe6a635
* Fix heads problem in read_conll
2015-05-27 01:14:54 +02:00
Matthew Honnibal
0eec1d12af
* Add comment about zipf reweighting
2015-05-27 01:14:07 +02:00
Matthew Honnibal
4d37b66c55
* Make Zipf regularization a bit more efficient
2015-05-27 01:12:50 +02:00
Matthew Honnibal
7fc24821bc
* Experiment with Zipfian corruptions when calculating prediction
2015-05-26 22:17:15 +02:00
Matthew Honnibal
eba7b34f66
* Add flag to disable loading of word vectors
2015-05-25 01:02:42 +02:00
Matthew Honnibal
3593babd35
* Add functions for Levenshtein distance alignment
2015-05-24 21:50:48 +02:00
Matthew Honnibal
744f06abf5
* Add script to read OntoNotes source documents
2015-05-24 21:49:58 +02:00
Matthew Honnibal
fc75210941
* Move spacy.syntax.conll to spacy.gold
2015-05-24 21:35:02 +02:00
Matthew Honnibal
765b61cac4
* Update spacy.scorer, to use P/R/F to support tokenization errors
2015-05-24 20:07:18 +02:00
Matthew Honnibal
efe7a7d7d6
* Clean unused functions from spacy.syntax.conll
2015-05-24 20:06:46 +02:00
Matthew Honnibal
78487f3e66
* Update parser oracle for missing heads
2015-05-24 20:05:58 +02:00
Matthew Honnibal
1044a13413
* Begin refactoring scorer to use recall over gold dependencies
2015-05-24 17:40:15 +02:00
Matthew Honnibal
acd1245ad4
* Remove cruft from conll.pyx --- unused stuff about evlauation, which now lives in spacy.scorer
2015-05-24 17:35:49 +02:00
Matthew Honnibal
20f1d868a3
* Tmp commit. Working on whole document parsing
2015-05-24 02:49:56 +02:00
Matthew Honnibal
f2ee9c4feb
* Comment out constituency parsing stuff, so that code compiles
2015-05-20 16:55:05 +02:00
Matthew Honnibal
8ee7c541f1
* Update Constituent definition
2015-05-20 16:03:26 +02:00
Matthew Honnibal
9dfc9c039c
* Work on constituency parsing.
2015-05-20 16:02:51 +02:00
Matthew Honnibal
5a5710e711
* Fix Span.subtree property
2015-05-13 21:53:15 +02:00
Matthew Honnibal
badf030b6c
* Add parse navigation to Span objects
2015-05-13 21:45:19 +02:00
Matthew Honnibal
ca320afe86
* Add docstring for ents attribute
2015-05-13 21:20:47 +02:00
Matthew Honnibal
ba07b925a7
* Fix compile error in conll.pyx
2015-05-12 22:33:47 +02:00
Matthew Honnibal
f1e0272b18
* Disable c-parsing transitions
2015-05-12 22:33:25 +02:00
Matthew Honnibal
03a6626545
* Tmp commit
2015-05-12 20:27:56 +02:00
Matthew Honnibal
9568ebed08
* Fix off-by-one in head reading
2015-05-12 20:27:56 +02:00
Matthew Honnibal
69840d8cc3
* Tweak verbose output printing in scorer.py
2015-05-12 20:27:56 +02:00
Matthew Honnibal
0605af6838
* Fix head misalignment in read_conll, when periods are ignored
2015-05-12 20:27:56 +02:00
Matthew Honnibal
d2ac8d8007
* Add ctnt field to State, in preparation for constituency parsing
2015-05-12 20:27:56 +02:00
Matthew Honnibal
ab67693393
* Add read_json_file to conll.pyx
2015-05-12 20:27:55 +02:00
Matthew Honnibal
aff9359a8d
* Update ner.pyx to expect brackets from gold_tuples
2015-05-12 20:27:55 +02:00
Matthew Honnibal
0ad72a77ce
* Write JSON files, with both dependency and PSG parses
2015-05-12 20:27:55 +02:00
Matthew Honnibal
d48218f4b2
* Add left_edge and right_edge properties
2015-05-12 20:27:55 +02:00
Matthew Honnibal
53cf77e1c8
* Bug fix: when non-monotonically correct a dependency, make sure to delete the old one from the child list
2015-05-12 20:26:41 +02:00
Matthew Honnibal
a4e2af54f9
* Add support for l/r edge to add_dep, and move inlined methods into _state.pyx where possible
2015-05-12 20:26:41 +02:00
Matthew Honnibal
d634038eb6
* Add l_edge and r_edge props in TokenC for tracking the parse-yield of the token
2015-05-12 20:26:41 +02:00
Matthew Honnibal
03ebf70a66
* Inc version to 0.84
2015-05-12 02:38:51 +02:00
Matthew Honnibal
e73eaf2d05
* Replace some assertions with proper errors
2015-05-08 16:52:17 +02:00
Matthew Honnibal
fb8d50b3d5
Merge branch 'master' of ssh://github.com/honnibal/spaCy
2015-04-30 12:45:15 +02:00
Matthew Honnibal
ed8e8c3bd0
* Whitespace
2015-04-29 14:22:47 +02:00
Matthew Honnibal
378c2a6435
* Fix POS model: make it use tag instead of pos in history features
2015-04-29 00:02:53 +02:00
Matthew Honnibal
763ef01575
* Fix two bugs in feature calculation
2015-04-28 23:25:09 +02:00
Matthew Honnibal
b3fd48c97b
* Fix missing root labels bug identified in Issue #57
2015-04-28 20:45:51 +02:00
Jordan Suchow
3a8d9b37a6
Remove trailing whitespace
2015-04-19 13:01:38 -07:00
Jordan Suchow
5f0f940a1f
Remove unused imports
2015-04-19 01:05:22 -07:00
Matthew Honnibal
cc4e395927
* Add some ad hoc regexes, for multi-word location prepositions
2015-04-17 04:44:24 +02:00
Matthew Honnibal
f7ffd94e6a
* Add Token.conjuncts property
2015-04-17 01:40:53 +02:00
Matthew Honnibal
684d0e5e85
* Download updated data
2015-04-16 04:29:15 +02:00
Matthew Honnibal
2ef170a991
* Fix Issue #54 : Error merging multi-word token when there's a mid-token match.
2015-04-16 04:28:06 +02:00
Matthew Honnibal
42617548af
* Disable merge_mwes by default
2015-04-16 04:20:31 +02:00
Matthew Honnibal
99dbf8a38c
* Fix error type in lookup_transition
2015-04-16 01:36:22 +02:00
Matthew Honnibal
77d0700caf
* Add on X way regexes
2015-04-16 01:35:46 +02:00
Matthew Honnibal
9f16848b60
* Add (N0w, N1w) unigram pair to NER features, prompted by failure to detect 'this weekend'
2015-04-15 06:01:18 +02:00
Matthew Honnibal
c6707778dd
* Fix Issue #51 : Handle non-ascii lemmas correctly
2015-04-13 22:28:59 +02:00
Matthew Honnibal
bf0aff5124
* Fix bug in Tokens.ents where entity wasn't being emitted if another started immediately after
2015-04-13 21:34:33 +02:00
Matthew Honnibal
2b84a90bbb
* Fix Issue #50 : Python 3 compatibility of v0.80
2015-04-13 05:59:43 +02:00
Matthew Honnibal
fbd48c571d
* Rearrange code in tokens.pyx
2015-04-13 05:41:25 +02:00
Matthew Honnibal
507048dc45
* Rename StandardError to Exception, for Python 3 compatibility
2015-04-12 07:28:34 +02:00
Matthew Honnibal
761a19113a
* Fix /tmp moving thing in download.py
2015-04-12 07:04:10 +02:00
Matthew Honnibal
248a2b4b0f
* Remove Spans class
2015-04-12 04:07:29 +02:00
Matthew Honnibal
1d05e6da00
* Add ne_iob and ne_type features to NER
2015-04-10 19:07:08 +02:00
Matthew Honnibal
4df8a3d90f
* Add ne_iob and ne_type attributes to context vector
2015-04-10 05:02:15 +02:00
Matthew Honnibal
8c354c432b
* Add ValueError condition to ner_tag reading
2015-04-10 04:59:59 +02:00
Matthew Honnibal
435cccf098
* Add read_conll03_file function to conll.pyx
2015-04-10 04:59:11 +02:00
Matthew Honnibal
99c9ecfc18
* Fix bug in prefix, suffix and word shape features in parser and NER
2015-04-10 03:53:33 +02:00
Matthew Honnibal
cff2b13fef
* Fix Issue #44 : Broken Token.string attribute when single word sentence
2015-04-07 06:08:25 +02:00
Matthew Honnibal
6640386b25
* Fix Issue #43 : TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way.
2015-04-07 06:00:57 +02:00
Matthew Honnibal
b64b2bd910
* Fix Issue #43 : TAG attr not supported. Also add DEP attr, while I'm at it. Need better way of ensuring future changes don't break in similar way.
2015-04-07 06:00:30 +02:00
Matthew Honnibal
f9e510a893
* Whitespace
2015-04-07 04:53:59 +02:00
Matthew Honnibal
66c7ccf6cc
* Fix Spans.orth_
2015-04-07 04:53:40 +02:00
Matthew Honnibal
b8d34531c4
* Add support for units to English.__init__, by loading and applying regular expressions
2015-04-07 04:02:32 +02:00
Matthew Honnibal
0ea5af88b6
* Add multi-word expression RegexMatcher
2015-04-07 03:45:40 +02:00
Matthew Honnibal
2fee67cfa3
* Add regular expressions for English multi-word expressions
2015-04-07 03:45:18 +02:00
Matthew Honnibal
5a075ea3fc
* Ensure NER moves are available for single-word tokens
2015-04-05 22:30:58 +02:00
Matthew Honnibal
a60a366b2c
* Support 'punct' dep label in conll.pyx
2015-04-05 22:30:19 +02:00
Matthew Honnibal
021c972137
* Print parse if verbose in scorer
2015-04-05 22:29:30 +02:00
Matthew Honnibal
fbf19049cf
* Add ent_type_ property
2015-03-31 02:01:29 +02:00
Matthew Honnibal
e70b87efeb
* Add merge() method to Tokens, with fairly brittle/hacky implementation, but quite easy to test. Passing minimal tests. Still need to fix left/right deps in C data
2015-03-30 01:37:41 +02:00
Matthew Honnibal
557856e84c
* Allow regular expressions to specify labels for merged spans
2015-03-27 17:40:52 +01:00
Matthew Honnibal
a3af6b7c3d
* Left-Arc from Root, to allow non-monotonic reduce to compete with left-arc when the stack is not empty.
2015-03-27 17:39:16 +01:00
Matthew Honnibal
db5a43318c
* Improve print_state debug printer
2015-03-27 17:29:58 +01:00
Matthew Honnibal
1705eccbbe
* Remove whitespace
2015-03-27 15:22:39 +01:00
Matthew Honnibal
3feb52374c
* Break apart a condition, for ease of debug printing
2015-03-27 15:21:38 +01:00
Matthew Honnibal
b32f581acb
* Fix bug in ArcEager.get_labels
2015-03-27 15:21:06 +01:00
Matthew Honnibal
5f2a4ff36d
* Fix spans.lemma_
2015-03-26 16:45:38 +01:00
Matthew Honnibal
f4cc222ec3
* Fix NER scoring
2015-03-26 16:45:38 +01:00
Matthew Honnibal
1320bd19db
* Move Span class to own file
2015-03-26 16:45:38 +01:00
Matthew Honnibal
6f47a667cf
* Move Span class to own file
2015-03-26 16:45:38 +01:00
Matthew Honnibal
f02c39dfaf
* Compare to is not None, for more robustness
2015-03-26 16:44:48 +01:00
Matthew Honnibal
8f68b864c4
* Move Span/Spans to separate files. Currently duplicates lots of Tokens functionality. Should probably be integrated into Tokens
2015-03-26 16:44:48 +01:00
Matthew Honnibal
e854ba0a13
* Remove support for force_gold flag from GreedyParser, since it's not so useful, and it's clutter
2015-03-26 16:44:47 +01:00
Matthew Honnibal
6a6085f8b9
* Clean up GreedyParser.train function a bit
2015-03-26 16:44:47 +01:00
Matthew Honnibal
b3157927e6
* Clean up unused feature templates
2015-03-26 16:44:47 +01:00
Matthew Honnibal
411bf377d4
* Remove dependency on ner_util module
2015-03-26 16:44:47 +01:00
Matthew Honnibal
01c892f583
* Add comment to fill_context
2015-03-26 16:44:47 +01:00
Matthew Honnibal
2741179aff
* Important bug fix: Fill token N2w, which was being unfilled, after a bad edit while writing the NER features.
2015-03-26 16:44:47 +01:00
Matthew Honnibal
2b2dec95d3
* Add comment to set_parse
2015-03-26 16:44:47 +01:00
Matthew Honnibal
e770fade1e
* Don't set dependency labels in set_parse, as this may be used by the Entity recogniser instead. Need to clean this method up...
2015-03-26 16:44:47 +01:00
Matthew Honnibal
71648205d9
* Add support for debug feature set. Just use unigrams for this.
2015-03-26 16:44:47 +01:00
Matthew Honnibal
3b70b304b2
* Add words to gold_tuples from gold conll file
2015-03-26 16:44:47 +01:00
Matthew Honnibal
2e12dec76e
* Adjust scorer to account for tokenization mistakes
2015-03-26 16:44:47 +01:00
Matthew Honnibal
05d6065e2e
* Add assertion
2015-03-26 16:44:46 +01:00
Matthew Honnibal
377e9b29b1
* Whitespace
2015-03-26 16:44:46 +01:00
Matthew Honnibal
670959f40c
* Fix iteration order on Tokens.rights
2015-03-26 16:44:46 +01:00
Matthew Honnibal
231ce2dae5
* Assign ROOT label by default. May be papering over another bug.
2015-03-26 16:44:46 +01:00
Matthew Honnibal
9f4ad8fdfb
* Assign root words the ROOT label via the Break transition. Something is still wrong here...
2015-03-26 16:44:46 +01:00
Matthew Honnibal
f729164c01
* Fix bug in label assignment: ensure null-label transitions receive the label 0
2015-03-26 16:44:46 +01:00
Matthew Honnibal
7237c805c7
* Load tag for specials.json token
2015-03-26 16:44:46 +01:00
Matthew Honnibal
567388e38d
* Use values encoded by StringStore in POS tagging, rather than indices into a list of tags
2015-03-26 16:44:45 +01:00
Matthew Honnibal
3105c7f8ba
* Don't pass label_ids dict to Tokens, since we now use the StringStore to manage string-to-int mapping for labels
2015-03-26 16:44:45 +01:00
Matthew Honnibal
801bf14f4f
* Clean up handling of dep_strings and ent_strings, using StringStore to encode the label names.
2015-03-26 16:44:45 +01:00
Matthew Honnibal
31fad99518
* Use StringStore to encode label names, instead of label_ids
2015-03-26 16:44:45 +01:00
Matthew Honnibal
64db61bff1
* Add Span class to Python API
2015-03-26 16:44:45 +01:00
Matthew Honnibal
b9b695fb1b
* Remove debug word list
2015-03-26 16:44:45 +01:00
Matthew Honnibal
f21ab2d7fb
* Fix bug in ugly ent_strings hack on English class
2015-03-26 16:44:45 +01:00
Matthew Honnibal
1c843934be
* Fix oracle bug in NER. Now getting 77% F on ontonotes
2015-03-26 16:44:44 +01:00
Matthew Honnibal
903f196b3f
* Fix verbose printing for scorer
2015-03-26 16:44:44 +01:00
Matthew Honnibal
e181c051d5
* Improve features for NER
2015-03-26 16:44:44 +01:00
Matthew Honnibal
7ecb52c0ed
* Add scorer script
2015-03-26 16:44:44 +01:00
Matthew Honnibal
8057a95f20
* NER seems to be working, scoring 69 F. Need to add decision-history features --- currently only use current word, 2 words context. Need refactoring.
2015-03-26 16:44:44 +01:00
Matthew Honnibal
ae235e07b9
* Refactoring working for parser, but now need to rig up features for NER, and then debug oracle etc.
2015-03-26 16:44:44 +01:00
Matthew Honnibal
b3eda03c9c
* Tmp
2015-03-26 16:44:44 +01:00
Matthew Honnibal
220ce8bfed
* Prepare English class for NER
2015-03-26 16:44:44 +01:00
Matthew Honnibal
f5830dc1c1
* Remove _transitions.pyx
2015-03-26 16:44:44 +01:00
Matthew Honnibal
6865c2fb4d
* Fix assignment of dep strings in tokens.pyx
2015-03-26 16:44:43 +01:00
Matthew Honnibal
6b6bce9e7a
* Fix label loading for transition system
2015-03-26 16:44:43 +01:00
Matthew Honnibal
5278c7504b
* Hacks to conll.pyx. Should clean these up.
2015-03-26 16:44:43 +01:00
Matthew Honnibal
f321b2b2eb
* Remove TODO comment
2015-03-26 16:44:43 +01:00
Matthew Honnibal
fdabd93bfb
* Ensure high loss for invalid moves, and fix label reading for arc-eager
2015-03-26 16:44:43 +01:00
Matthew Honnibal
10ed738df2
* Tmp commit
2015-03-26 16:44:43 +01:00
Matthew Honnibal
4f83c9b3d5
* Make costs label-sensitive
2015-03-26 16:44:43 +01:00
Matthew Honnibal
179b7eb0a7
* Specify parser transition system in language
2015-03-26 16:44:43 +01:00
Matthew Honnibal
8c883cef58
* Refactored transition system code now compiling. Still need to hook up label oracle, and test
2015-03-26 16:44:43 +01:00
Matthew Honnibal
f0159ab4b6
* Add file to hold GoldParse class
2015-03-26 16:44:42 +01:00
Matthew Honnibal
8eadb984cb
* Refactor arc_eager to use new TransitionSystem base class. Need to fix oracle
2015-03-26 16:44:42 +01:00
Matthew Honnibal
b063001596
* Add base TransitionSystem class. Still need to rethink how non-monotonic labelling will work for best_valid
2015-03-26 16:44:42 +01:00
Matthew Honnibal
01bc4d6815
* Add set_parse method, to assign parse to tokens in a less hacky way.
2015-03-26 16:44:42 +01:00
Matthew Honnibal
dc986dbc0b
* Work on refactored parser, where TransitionSystem can be easily subclassed
2015-03-26 16:44:42 +01:00
Matthew Honnibal
1cc6329b18
* Add base class to do transitions
2015-03-26 16:44:42 +01:00
Matthew Honnibal
135756ac3d
* Tmp commit of NER refactoring
2015-03-26 16:44:42 +01:00
Matthew Honnibal
23c1f6fc04
* Merge changes from stash
2015-03-26 16:44:41 +01:00
Matthew Honnibal
0ff078876a
* Commit some work on ner.yx done on the plane
2015-03-26 16:44:41 +01:00
Matthew Honnibal
d81b7be6a2
* Merge train.py
2015-03-26 16:44:41 +01:00
Matthew Honnibal
2e3dc3dfe2
* Merge changes in tokens.pyx
2015-03-26 16:44:41 +01:00
Matthew Honnibal
8cc3524dc9
* Ws
2015-03-26 16:44:41 +01:00
Matthew Honnibal
3d0570685c
* Add NER transition system
2015-03-26 16:44:41 +01:00
Matthew Honnibal
043b758cf4
* Resurrect old NER code. This version won't be the one that runs; we want to re-use the parser code. But for now this is a useful reference.
2015-03-26 16:44:41 +01:00
Matthew Honnibal
b139aa92ba
* Start setting out how NER will be implemented in the data model
2015-03-26 16:44:41 +01:00
Matthew Honnibal
0962ffc095
* Fix issue #37 : missing check_flag attribute from Token class
2015-03-26 15:06:26 +01:00
Matthew Honnibal
2e8d0e5d45
* Upd download script
2015-03-03 05:47:16 -05:00
Matthew Honnibal
dbe26f5793
* Add children and subtree methods to Token, which are generators to assist parse-tree navigation.
2015-03-03 04:18:41 -05:00
Matthew Honnibal
ea90d136e8
* Fix bug in labelled parsing, that caused an 8% drop in labelled accuracy.
2015-02-27 03:56:10 -05:00
Matthew Honnibal
caf046b220
* Hastily add method to apply tags from a list of strings, instead of predicting the tags.
2015-02-23 15:40:17 -05:00
Matthew Honnibal
cae077b583
* Work on fixing orphaned Token objects bug
2015-02-16 15:20:31 -05:00
Matthew Honnibal
7572e31f5e
* Pass ownership of C data to Token instances if Tokens object is being garbage-collected, but Token instances are staying alive.
2015-02-11 18:05:06 -05:00
Matthew Honnibal
64645a1c2f
* Improve docstring on English
2015-02-11 15:13:20 -05:00
Matthew Honnibal
594e50bd45
* Add option to download speech-parsing data set.
2015-02-11 14:20:29 -05:00
Matthew Honnibal
0b7e769211
* Add POS tags to support SWBD tag set
2015-02-11 14:08:28 -05:00
Matthew Honnibal
312b3a45f3
* Fix issue #19 : Allow parsing/pos tagging of empty strings
2015-02-10 10:15:58 -05:00
Matthew Honnibal
2a0615104b
* Upd download script
2015-02-09 10:22:59 -05:00
Matthew Honnibal
5c3513583d
* Clear buffered python tokens when modifying the Tokens object. Need to clean this up, and modify via a method on Tokens.
2015-02-09 03:57:10 -05:00
Matthew Honnibal
be5536d239
* Fix Issue #22 : PRP and PRP$ were mapped to NOUN. Should be PRON.
2015-02-08 18:36:18 -05:00
Matthew Honnibal
0492cee8b4
* Fix Issue #24 : Lemmas are empty when the L field is missing for special-cased tokens
2015-02-08 18:30:30 -05:00
Matthew Honnibal
d229fbd228
* Give better error on out-of-bounds array access
2015-02-07 12:59:12 -05:00
Matthew Honnibal
ab8bb047d0
* Fix negative index for __getitem__
2015-02-07 12:58:46 -05:00
Matthew Honnibal
44c7eafe44
* Fix download.py
2015-02-07 12:00:36 -05:00
Matthew Honnibal
6ca7f2eedc
* Upd download script
2015-02-07 11:32:33 -05:00
Matthew Honnibal
f0e0588833
* Fill L2 norm attribute on LexemeC struct
2015-02-07 08:44:42 -05:00
Matthew Honnibal
75f9b7d6bf
* Add L2 norm field to LexemeC struct
2015-02-07 08:43:17 -05:00
Matthew Honnibal
51b618d646
* Add a has_repvec property to Lexeme, and a check function to check flags
2015-02-07 08:42:44 -05:00
Matthew Honnibal
321b402739
* Store the l2 norm of the word's vector
2015-02-07 08:42:16 -05:00
Matthew Honnibal
c7d8644149
* Fix regression on 'prob' attr of Token.
2015-02-03 03:32:18 +11:00
Matthew Honnibal
c55a33d045
* Catch oracle errors
2015-02-02 23:02:04 +11:00
Matthew Honnibal
de772088e6
* Use parse tree for sbd in Tokens.sents
2015-02-02 12:17:32 +11:00
Matthew Honnibal
56c2ef2982
* Tweak POS features for web text
2015-02-02 11:59:36 +11:00
Matthew Honnibal
d68678a93e
* Add Exception class, OracleError
2015-02-02 11:57:32 +11:00
Matthew Honnibal
a20fdbd8ee
* Upd download script
2015-02-01 13:22:23 +11:00
Matthew Honnibal
76d9394cb4
* Fix vocab.pyx for Python3
2015-02-01 13:14:04 +11:00
Matthew Honnibal
63abdf154c
* Hastily hack download file
2015-01-31 22:48:32 +11:00
Matthew Honnibal
7de00c5a79
* Try not holding a reference to Pool, since that seems to confuse the GC
2015-01-31 22:10:22 +11:00
Matthew Honnibal
ce3ae8b5d9
* Fix platform-specific lexicon bug.
2015-01-31 16:38:58 +11:00
Matthew Honnibal
a1ed574b7b
* Fix default model path for English
2015-01-31 16:38:27 +11:00
Matthew Honnibal
018e0bfa24
* Bug fixes to parse navigation
2015-01-31 16:37:13 +11:00
Matthew Honnibal
e013555b25
* Add option to download script
2015-01-31 13:51:56 +11:00
Matthew Honnibal
08ca5c8970
* Add sent_end flag to TokenC struct
2015-01-31 13:44:16 +11:00
Matthew Honnibal
024cfd485c
* Pass tag_strings as a tuple, to support new Tokens API
2015-01-31 13:43:37 +11:00
Matthew Honnibal
77d62d0179
* Large refactor of Token objects, making them much thinner. This is to support fast parse-tree navigation.
2015-01-31 13:42:58 +11:00
Matthew Honnibal
88170e6295
* Supply dep_strings as a tuple, for the changed API on Tokens
2015-01-31 13:42:09 +11:00
Matthew Honnibal
0981d68022
* Set a sent_end flag during parsing, for later use
2015-01-31 13:41:46 +11:00
Matthew Honnibal
251dbf24d7
* Fix unintialised variable error
2015-01-30 20:46:34 +11:00
Matthew Honnibal
83a4df5a1a
* Fix download script
2015-01-30 20:40:42 +11:00
Matthew Honnibal
6f9ebc2f34
* Fix download script
2015-01-30 20:33:19 +11:00
Matthew Honnibal
8b85d0bb8a
* Only download small data if no data dir exists
2015-01-30 20:27:14 +11:00
Matthew Honnibal
1a7a1c2771
* Fix Issue #16 : tokens recurse when printing
2015-01-30 19:47:50 +11:00
Matthew Honnibal
cb95ef6934
* Fix download script
2015-01-30 19:28:43 +11:00
Matthew Honnibal
e578bd37bd
* Fix download script
2015-01-30 18:59:31 +11:00
Matthew Honnibal
df52014d12
* Fix download script
2015-01-30 18:36:24 +11:00
Matthew Honnibal
0f95712189
* Improve accuracy reporting during training
2015-01-30 18:05:06 +11:00
Matthew Honnibal
b68f563c2f
* Fix Issue #14 : Improve parsing API
2015-01-30 18:04:41 +11:00
Matthew Honnibal
998b607f65
* Upd download script, having it download all data if there's no data/ directory, allowing easier compilation from source
2015-01-30 18:04:01 +11:00
Matthew Honnibal
67d6e53a69
* Ensure parser and tagger function correctly when training from missing values, indicated by -1
2015-01-30 14:08:56 +11:00
Matthew Honnibal
4ff180db74
* Fix off-by-one error in commit 0a7fceb
2015-01-30 12:49:33 +11:00
Matthew Honnibal
0a7fcebdf7
* Fix Issue #12 : Incorrect token.idx calculations for some punctuation, in the presence of token cache
2015-01-30 12:33:38 +11:00
Matthew Honnibal
ebf7d2fab1
* Use non-joint sbd, for more simplicity and fewer classes
2015-01-29 06:22:03 +11:00
Matthew Honnibal
d05c5bf141
* Remove comment
2015-01-29 05:19:27 +11:00
Matthew Honnibal
320b045daa
* Oracle now consistent over gold standard derivation
2015-01-29 03:41:58 +11:00
Matthew Honnibal
f590382134
* Work on sbd
2015-01-29 03:18:29 +11:00
Matthew Honnibal
1884a7a0be
* Attach comment with paper
2015-01-28 03:18:43 +11:00
Matthew Honnibal
a2d6b195db
* Add messy Break transitions, carefully following the scheme of Dd Zhang et al (2013)
2015-01-28 03:09:45 +11:00
Matthew Honnibal
f9ee5d9934
* Build a python list of word strings, for debugging
2015-01-28 01:06:13 +11:00
Matthew Honnibal
d819101571
* Improve error message on oracle failure
2015-01-28 00:58:03 +11:00
Matthew Honnibal
e6c3d3471f
* Tweak documentation for Tokens, and hide constructor as __cinit__
2015-01-27 18:57:52 +11:00
Matthew Honnibal
c38c62d4a3
* Add docstring to English class
2015-01-27 02:45:21 +11:00
Matthew Honnibal
d4c99f7dec
* Add attrs.pxd
2015-01-26 22:22:09 +11:00
Matthew Honnibal
d4a493855e
* Fix error msg
2015-01-25 23:01:30 +11:00
Matthew Honnibal
7f87716cf7
* Fix download script
2015-01-25 23:01:10 +11:00
Matthew Honnibal
92fb9257dd
* Add parts-of-speech file
2015-01-25 22:00:39 +11:00
Matthew Honnibal
c1c3dba4cb
* Check whether vector files are present before trying to load them.
2015-01-25 18:16:48 +11:00
Matthew Honnibal
5049d4c2e6
* Add parts_of_speech.pyx
2015-01-25 16:32:26 +11:00
Matthew Honnibal
12b034e3ef
* Move POS tag definitions to parts_of_speech.pxd
2015-01-25 16:31:07 +11:00
Matthew Honnibal
7431c133d8
* Add error if try to access head and not is_parsed
2015-01-25 15:33:54 +11:00
Matthew Honnibal
951d06c824
* Silently don't parse if data is not present
2015-01-25 14:47:38 +11:00
Matthew Honnibal
4e857ab7a6
* Fix bug in POS tagger feature
2015-01-25 02:20:15 +11:00
Matthew Honnibal
dd56e298e2
* Ensure tagging is applied if parse=True
2015-01-25 02:19:44 +11:00
Matthew Honnibal
94750819cd
* Set parse=True by default --- i.e. parse unless told not to.
2015-01-25 01:28:28 +11:00
Matthew Honnibal
71b95202eb
* Add docstring to StringStore
2015-01-24 20:49:15 +11:00
Matthew Honnibal
6d1c08dafd
* Add docstring to Lexeme
2015-01-24 20:48:34 +11:00
Matthew Honnibal
a97bed9359
* Fix POS and dependency label tag names. Add parse and string navigation functions.
2015-01-24 17:29:04 +11:00
Matthew Honnibal
76cd024095
* Add whitespace property to Token
2015-01-24 07:41:21 +11:00
Matthew Honnibal
5fd72bc220
* Have 'string' refer to the whitespace-padded string
2015-01-24 07:32:38 +11:00
Matthew Honnibal
fda94271af
* Rename NORM1 and NORM2 attrs to lower and norm
2015-01-24 06:17:03 +11:00
Matthew Honnibal
5ed8b2b98f
* Rename sic to orth
2015-01-23 02:08:25 +11:00
Matthew Honnibal
a27b23cc8f
* Have SBD return start/end indices
2015-01-22 22:24:44 +11:00
Matthew Honnibal
d460c28838
* Rename vec to repvec
2015-01-22 02:06:22 +11:00
Matthew Honnibal
8b9d913d97
* Rename vec to repvec
2015-01-22 02:05:58 +11:00
Matthew Honnibal
9cd0b6b3e9
* Various tweaks to Tokens class
2015-01-22 02:05:37 +11:00
Matthew Honnibal
5928d158ce
* Pass the string to Tokens
2015-01-22 02:04:58 +11:00
Matthew Honnibal
45264e356b
* Rename vec to repvec
2015-01-22 02:04:24 +11:00
Matthew Honnibal
5e63c606ad
* Rename vec to repvec
2015-01-22 02:03:54 +11:00
Matthew Honnibal
56e6cf0672
* Add _string attr to Tokens object
2015-01-21 18:57:09 +11:00
Matthew Honnibal
d6ac60e91c
* Bug fixes to sentences method, and improved vector transport for tokens
2015-01-21 18:56:32 +11:00
Matthew Honnibal
f2a229136c
* Fix data_dir=None argument to English class
2015-01-21 18:27:31 +11:00
Matthew Honnibal
ef49b8c179
* Add stop-word flag
2015-01-21 18:22:31 +11:00
Matthew Honnibal
6646bfc5df
* Add LOWER attr
2015-01-21 18:19:08 +11:00
Matthew Honnibal
f149259bf5
* Fix negative indices in tokens
2015-01-20 01:16:29 +11:00
Matthew Honnibal
b65b0c07bf
* Messily hook up vector in tokens
2015-01-19 19:59:55 +11:00
Matthew Honnibal
8ff5b8bd84
* Add attribute for POS scheme
2015-01-17 17:33:16 +11:00
Matthew Honnibal
6c7e44140b
* Work on word vectors, and other stuff
2015-01-17 16:21:17 +11:00
Matthew Honnibal
802867e96a
* Revise interface to Token. Strings now have attribute names like norm1_
2015-01-15 03:51:47 +11:00
Matthew Honnibal
7d3c40de7d
* Tests passing after refactor. API has obvious warts, particularly in Token and Lexeme
2015-01-15 00:33:16 +11:00
Matthew Honnibal
0930892fc1
* Tmp. Working on refactor. Compiles, must hook up lexical feats.
2015-01-14 00:03:48 +11:00
Matthew Honnibal
46da3d74d2
* Tmp. Refactoring, introducing a Lexeme PyObject.
2015-01-12 11:23:44 +11:00
Matthew Honnibal
ce2edd6312
* Tmp commit. Refactoring to create a Python Lexeme class.
2015-01-12 10:26:22 +11:00
Matthew Honnibal
aacaf1a0f0
* Fix parser
2015-01-08 01:19:23 +11:00
Matthew Honnibal
9a21127bf7
* Fix parser, which was importing the wrong model
2015-01-08 00:10:15 +11:00
Matthew Honnibal
6a3e39cdd1
* Add typedefs.pyx
2015-01-06 04:51:40 +11:00
Matthew Honnibal
a58920cc5e
* Import orth.word_shape as a C module
2015-01-06 03:18:22 +11:00
Matthew Honnibal
6b68f7ef75
* Finally get string types right for orth function
2015-01-06 03:17:39 +11:00
Matthew Honnibal
90c143bd85
* Fix orth import
2015-01-05 18:49:19 +11:00
Matthew Honnibal
7689dccd0f
* Remove unused import
2015-01-05 18:48:48 +11:00
Matthew Honnibal
3f1944d688
* Make PyPy work
2015-01-05 17:54:38 +11:00
Matthew Honnibal
a510d9f677
* Another assertion removed
2015-01-05 13:01:40 +11:00
Matthew Honnibal
2856946a66
* Remove assertion that doesn't work on Python 3
2015-01-05 12:51:16 +11:00
Matthew Honnibal
94034f1112
* Fix encoding in lemmatization
2015-01-05 11:54:29 +11:00
Matthew Honnibal
b132b3caa6
* Fix unicode error in lemmatizer
2015-01-05 11:53:54 +11:00
Matthew Honnibal
477e7fbffe
* Fix data reading for lemmatizer
2015-01-05 06:01:32 +11:00
Matthew Honnibal
58f75abaca
* Fix unicode error in orth
2015-01-05 05:53:08 +11:00
Matthew Honnibal
4e085d5166
* Fix lemmatizer for Python3
2015-01-05 05:51:26 +11:00
Matthew Honnibal
ae7c811fd1
* Use Exception instead of StandardError
2015-01-04 01:22:12 +11:00
Matthew Honnibal
0e4c2ba036
* Fix loading of special morph words
2015-01-03 23:13:00 +11:00
Matthew Honnibal
f5d41028b5
* Move around data files for test release
2015-01-03 01:59:22 +11:00
Matthew Honnibal
a24321b63a
* Add downloader
2015-01-02 21:44:41 +11:00
Matthew Honnibal
5d9a096e2f
* Some minor clean-up after HastyModel
2014-12-31 19:46:04 +11:00
Matthew Honnibal
aafaf58cbe
* Refactor _ml.Model, and finish implementing HastyModel so far not worthwhile.
2014-12-31 19:40:59 +11:00
Matthew Honnibal
bcd038e7b6
* Implement HastyModel
2014-12-31 01:16:47 +11:00
Matthew Honnibal
1a075f77ff
* Don't over-ride pre-loaded POS tags, if set by special-cases
2014-12-30 23:26:32 +11:00
Matthew Honnibal
785c7ba76a
* Embed signature on attrs
2014-12-30 23:25:31 +11:00
Matthew Honnibal
30e5805656
* Lazy-load tagger and parser
2014-12-30 23:25:09 +11:00
Matthew Honnibal
9976aa976e
* Messily fix morphology and POS tags on special tokens.
2014-12-30 23:24:37 +11:00
Matthew Honnibal
c1ef3febee
* Embedsignature in tokens.pyx
2014-12-30 21:22:00 +11:00
Matthew Honnibal
aac5028b6e
* Move tagger to _ml
2014-12-30 21:21:38 +11:00
Matthew Honnibal
1ffb0229ed
* Import tokens in parser.pxd
2014-12-30 21:21:17 +11:00