Wolfgang Seeker
d99a9cbce9
different handling of space tokens
...
space tokens are now always attached to the previous non-space token
there are two exceptions:
leading space tokens are attached to the first following non-space token
in input that consists exclusively of space tokens, the last space token
is the head of all others.
2016-04-13 15:28:28 +02:00
Wolfgang Seeker
80bea62842
bugfix in unit test
2016-04-08 16:46:44 +02:00
Wolfgang Seeker
5e2e8e951a
add baseclass DocIterator for iterators over documents
...
add classes for English and German noun chunks
the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Wolfgang Seeker
690c5acabf
adjust train.py to train both english and german models
2016-03-03 15:21:00 +01:00
Wolfgang Seeker
3448cb40a4
integrated pseudo-projective parsing into parser
...
- nonproj.pyx holds a class PseudoProjectivity which currently holds
all functionality to implement Nivre & Nilsson 2005's pseudo-projective
parsing using the HEAD decoration scheme
- changed lefts/rights in Token to account for possible non-projective
structures
2016-03-01 10:09:08 +01:00
Wolfgang Seeker
56b7210e82
moved nonproj.py to syntax/nonproj.pyx
2016-02-25 15:08:49 +01:00
Matthew Honnibal
1b41f868d2
* Check for errors in parser, and parallelise the left-over batch
2016-02-06 10:06:30 +01:00
Matthew Honnibal
165ca28b80
* Set is_parsed flag in Parser.pipe
2016-02-05 19:51:44 +01:00
Matthew Honnibal
bdd579db0a
* Set is_parsed flag in Parser.pipe
2016-02-05 19:50:11 +01:00
Matthew Honnibal
b04c9aad71
* Fix off-by-one in Parser.pipe
2016-02-05 19:37:50 +01:00
Matthew Honnibal
048dfe35aa
* cimport cython.parallel
2016-02-05 12:20:42 +01:00
Matthew Honnibal
8a13cebdcc
* Update for modified thinc interface
2016-02-05 11:44:39 +01:00
Matthew Honnibal
84b247ef83
* Add a .pipe method, that takes a stream of input, operates on it, and streams the output. Internally, the stream may be buffered, to allow multi-threading.
2016-02-03 02:10:58 +01:00
Matthew Honnibal
b3802562d6
Merge branch 'rethinc2' of https://github.com/honnibal/spaCy into rethinc2
2016-02-01 08:59:24 +01:00
Matthew Honnibal
4b08a3fafd
* Fix merge conflict
2016-02-01 08:58:18 +01:00
Matthew Honnibal
5188f6d9d8
* Fix parseC function
2016-02-01 08:48:48 +01:00
Matthew Honnibal
bcf8f7ba40
* Add a parse_batch method to Parser, that releases the GIL around a batch of documents.
2016-02-01 08:34:55 +01:00
Matthew Honnibal
490ba65398
* Use openmp in parser
2016-02-01 03:08:42 +01:00
Matthew Honnibal
28e5ad62bc
* Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents
2016-02-01 03:00:15 +01:00
Matthew Honnibal
a47f00901b
* Pass a StateC pointer into the transition and validation methods in the parser, so that the GIL can be released over a batch of documents
2016-02-01 02:58:14 +01:00
Matthew Honnibal
daaad66448
* Now fully proxied
2016-02-01 02:37:08 +01:00
Matthew Honnibal
7a0e3bb9c1
* Continue proxying. Some problem currently
2016-02-01 02:22:21 +01:00
Matthew Honnibal
9410e74c92
* Switch parser to use nogil functions
2016-01-30 20:27:07 +01:00
Matthew Honnibal
10877a7791
* Update for thinc 5.0, including changing cost from int to weight_t, and updating the tagger and parser
2016-01-30 14:31:36 +01:00
Matthew Honnibal
84c5dfbfc3
* Clean up debugging python list
2016-01-19 20:10:32 +01:00
Matthew Honnibal
65c5bc4988
* Add add_label method, to allow users to register new entity types and dependency labels.
2016-01-19 19:11:02 +01:00
Matthew Honnibal
3dc398b727
* Fix merge conflict in requirements.txt
2016-01-16 16:20:49 +01:00
Matthew Honnibal
c025a0c64b
* Check for KeyboardInerrupt in parser.__call__
2016-01-16 16:18:44 +01:00
Matthew Honnibal
aec130af56
Use util.Package class for io
...
Previous Sputnik integration caused API change: Vocab, Tagger, etc
were loaded via a from_package classmethod, that required a
sputnik.Package instance. This forced users to first create a
sputnik.Sputnik() instance, in order to acquire a Package via
sp.pool().
Instead I've created a small file-system shim, util.Package, which
allows classes to have a .load() classmethod, that accepts either
util.Package objects, or strings. We can later gut the internals
of this and make it a proxy for Sputnik if we need more functionality
that should live in the Sputnik library.
Sputnik is now only used to download and install the data, in
spacy.en.download
2015-12-29 18:00:48 +01:00
Matthew Honnibal
6f47074214
* Make constructor of ParserModel and TaggerModel the same as AveragedPerceptron, for each pickling.
2015-11-07 18:25:17 +11:00
Matthew Honnibal
888c05a7fa
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 11:02:44 +11:00
Matthew Honnibal
fc2185bfe3
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 10:48:31 +11:00
Matthew Honnibal
954442a807
* Fix variable naming in StepwiseState, for thinc 4.0
2015-11-07 10:30:45 +11:00
Matthew Honnibal
19136b0e7d
* Add better debug message for illegal move
2015-11-07 05:34:37 +11:00
Matthew Honnibal
3c162dcac3
* Refactor away from the _ml module, to use thinc 4.0. Still some work needs to be done, e.g. to add __reduce__ to the models, more testing, etc.
2015-11-07 03:24:30 +11:00
Matthew Honnibal
b9991fbd20
* Update to use thinc 3.0
2015-11-06 00:25:59 +11:00
Matthew Honnibal
68f479e821
* Rename Doc.data to Doc.c
2015-11-04 00:15:14 +11:00
Matthew Honnibal
20fd36a0f7
* Very scrappy, likely buggy first-cut pickle implementation, to work on Issue #125 : allow pickle for Apache Spark. The current implementation sends stuff to temp files, and does almost nothing to ensure all modifiable state is actually preserved. The Language() instance is a deep tree of extension objects, and if pickling during training, some of the C-data state is hard to preserve.
2015-10-13 13:44:41 +11:00
Matthew Honnibal
86c888667f
* Merge in changes from de branch
2015-09-06 19:49:28 +02:00
Matthew Honnibal
5edac11225
* Wrap self.parse in nogil, and break if an invalid move is predicted. The invalid break is a work-around that papers over likely bugs, but we can't easily break in the nogil block, and otherwise we'll get an infinite loop. Need to set this as an error flag.
2015-09-06 04:15:00 +02:00
Matthew Honnibal
a3d5e6c0dd
* Reform constructor and save/load workflow in parser model
2015-08-26 19:19:01 +02:00
Matthew Honnibal
bf38b3b883
* Hack on l/r reversal bug
2015-08-10 05:58:43 +02:00
Matthew Honnibal
6116413b47
* Fix label prediction in StepwiseState
2015-08-10 05:05:31 +02:00
Matthew Honnibal
9de98f5a6f
* Add Parser.stepthrough method, with context manager
2015-08-10 00:08:46 +02:00
Matthew Honnibal
9c090945e0
* Add Parser.predict method, and clean up Parser.get_state
2015-08-09 02:29:58 +02:00
Matthew Honnibal
04fccfb984
* Fix get_state for parser prediction
2015-08-09 02:11:22 +02:00
Matthew Honnibal
55fde0e240
* Fix get_state
2015-08-09 01:45:30 +02:00
Matthew Honnibal
f0f4fa9838
* Fix Parser.get_state
2015-08-09 01:40:13 +02:00
Matthew Honnibal
18331dca89
* Add continue_for argument to parser 'partial' function, which is now renamed to get_state
2015-08-09 01:31:54 +02:00
Matthew Honnibal
9de218b7ba
* Fix Parser.partial function
2015-08-08 23:45:18 +02:00
Matthew Honnibal
3af938365f
* Add function partial to Parser
2015-08-08 23:32:15 +02:00
Matthew Honnibal
823ef4a00b
* Remove profile declarations
2015-07-25 18:13:06 +02:00
Matthew Honnibal
aa28e2e01d
* Release the GIL around parse function
2015-07-24 04:53:27 +02:00
Matthew Honnibal
fb0a641a2d
* Don't release the gil around Parser.parse. Does this indicate thread problems?
2015-07-17 23:07:37 +02:00
Matthew Honnibal
e29daea85f
* Fix bint/int typing problem in TransitionSystem. In C++ bint* means bool*, but in C it means int*. So, type-casting to bint* is unsafe.
2015-07-17 22:37:24 +02:00
Matthew Honnibal
45ae1ce428
* Remove unused declaration in parser
2015-07-16 01:27:11 +02:00
Matthew Honnibal
9a8db9743c
* Remove gil from parser.call
2015-07-14 23:47:33 +02:00
Matthew Honnibal
38ca0c33f5
Merge branch 'neuralnet' into refactor
...
Mostly refactors parser, to use new thinc3.2 Example class.
Aim is to remove use of shared memory, so that we can parallelize
over documents easily.
Conflicts:
setup.py
spacy/syntax/parser.pxd
spacy/syntax/parser.pyx
spacy/syntax/stateclass.pyx
2015-07-14 14:13:47 +02:00
Matthew Honnibal
6eef0bf9ab
* Break up tokens.pyx into tokens/doc.pyx, tokens/token.pyx, tokens/spans.pyx
2015-07-13 20:20:58 +02:00
Matthew Honnibal
adb868bdad
* Add warning for models not found in parser
2015-07-08 20:04:55 +02:00
Matthew Honnibal
05b28ec9eb
* Add warning for models not found in parser
2015-07-08 20:02:13 +02:00
Matthew Honnibal
ef700401a6
* Add warning for models not found in parser
2015-07-08 20:00:46 +02:00
Matthew Honnibal
6218d8b389
* Add warning for models not found in parser
2015-07-08 19:59:16 +02:00
Matthew Honnibal
f6a6c39ce8
* Add warning for models not found in parser
2015-07-08 19:52:30 +02:00
Matthew Honnibal
bb522496dd
* Rename Tokens to Doc
2015-07-08 18:53:00 +02:00
Matthew Honnibal
ff885e8511
* Add ParserFactory convenience function
2015-07-08 12:35:46 +02:00
Matthew Honnibal
e20106fdff
* Begin reorganizing neuralnet work
2015-06-30 14:26:32 +02:00
Matthew Honnibal
f4986d5d3c
* Use new Example class
2015-06-28 22:36:03 +02:00
Matthew Honnibal
735f1af91f
* Fix neural net stuff
2015-06-28 11:44:58 +02:00
Matthew Honnibal
e7003f1cf3
* Remove hard-coding of vector lengths
2015-06-28 11:37:17 +02:00
Matthew Honnibal
897dd0dd0b
* Merge changes, and adjust Example to use memoryview
2015-06-28 11:36:11 +02:00
Matthew Honnibal
9282a8e72c
* Prepare for new models to be plugged in by using Example class
2015-06-28 11:02:35 +02:00
Matthew Honnibal
75aeccc064
* Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search
2015-06-28 11:02:34 +02:00
Matthew Honnibal
5af500909c
* Remove unused directve from parser.pyx
2015-06-28 06:20:21 +02:00
Matthew Honnibal
ed40a8380e
* Remove hard-coding of vector lengths
2015-06-27 04:18:47 +02:00
Matthew Honnibal
f8bb43475e
* Bridge to Theano working. Very disorganised. Using thinc adb60aba966ed2
2015-06-27 02:39:18 +02:00
Matthew Honnibal
2fe98b8a9a
* Prepare for new models to be plugged in by using Example class
2015-06-26 13:51:39 +02:00
Matthew Honnibal
6896455884
* Rejig parser interface to use new thinc.api.Example class, in prep of theano model. Comment out beam search
2015-06-26 06:25:36 +02:00
Matthew Honnibal
ab110be125
* Remove debugging in parser.pyx
2015-06-16 23:37:25 +02:00
Matthew Honnibal
f66228f253
* Add some more features, esp for labels
2015-06-14 21:18:02 +02:00
Matthew Honnibal
ea8a103007
* Fix import of TransitionSystem in parser.pyx
2015-06-14 19:01:26 +02:00
Matthew Honnibal
75289b4761
* Don't refuse to parse single token sentences, incase some transition system needs them, e.g. single word entity. Instead fix error in _init_state.
2015-06-13 22:55:55 +02:00
Matthew Honnibal
15e177d7a1
* Fixes to unshift/fast-forward strategy. Getting 91.55 greedy on NW dev, gold preproc
2015-06-12 01:50:23 +02:00
Matthew Honnibal
4575e7a60f
* Fix beam search with new StateClass
2015-06-10 06:33:39 +02:00
Matthew Honnibal
04b1cd9b8c
* Greedy parsing working with new StateClass. Beam parsing broken
2015-06-10 04:20:23 +02:00
Matthew Honnibal
6a94b64eca
* Remove State* from parser.pyx entirely, switching over to StateClass. Beam parsing still untested.
2015-06-10 02:03:38 +02:00
Matthew Honnibal
f14a1526aa
* Remove version of fill_context that takes State*
2015-06-10 01:39:07 +02:00
Matthew Honnibal
d68c686ec1
* Move StateClass into interface of transition functions
2015-06-10 01:35:28 +02:00
Matthew Honnibal
4b98b3e9c8
* Cost functions now take StateClass argument, instead of State*.
2015-06-10 00:40:43 +02:00
Matthew Honnibal
e0cf61f591
* Move StateClass into the interface for is_valid
2015-06-09 23:23:28 +02:00
Matthew Honnibal
0895d454fb
* Prepare to switch to using state class, instead of state struct
2015-06-09 21:20:14 +02:00
Matthew Honnibal
c7e3dfc1dc
* Don't automatically push words when stack is empty, as it messes up beam parsing. Add hash method to beam state.
2015-06-08 14:49:04 +02:00
Matthew Honnibal
6e2564239d
* Bug fixes to beam parser. Search still broken on non-gold sentences
2015-06-07 19:12:59 +02:00
Matthew Honnibal
88ac5c6e98
* Send beam_width < 0 to greedy parser
2015-06-05 17:12:06 +02:00
Matthew Honnibal
6bf35cecc3
* Refactor transition system to use classes with staticmethods.
2015-06-05 02:27:17 +02:00
Matthew Honnibal
4433396005
* Impove efficiency of dynamic oracle, making beam training faster
2015-06-04 21:15:14 +02:00
Matthew Honnibal
a513ec500f
* Have oracle functions take a struct instead of a Python object
2015-06-02 20:01:06 +02:00
Matthew Honnibal
d1b55310a1
* Refactor _advance_beam function
2015-06-02 18:38:41 +02:00
Matthew Honnibal
e822df0867
* Fix bugs in new greedy/beam parser
2015-06-02 02:01:33 +02:00
Matthew Honnibal
66dfa95847
* Revise greedy_parse/beam_parse ownership goof
2015-06-02 01:34:19 +02:00